The End of U-Net in Visual Generation: Diffusion Transformers (DiT) and Flow Matching

Introduction: The Bottleneck of Convolutional Networks (CNN)

Between 2022 and 2024, the U-Net architecture (Stable Diffusion 1.5, XL) dominated the world of visual generation. This architecture used Convolutional Neural Networks (CNN) to compress and denoise images. However, the "inductive bias" of CNNs (focusing on local features) limited their understanding of global context and complex compositions.

As of 2025, the Diffusion Transformers (DiT) architecture has shattered this paradigm. Pixels are now treated as "tokens," just like words. This article examines how scalability has been brought to the visual world and the mathematics of Flow Matching.

1. Heart of the Architecture: Diffusion Transformer (DiT)

While traditional Diffusion models used U-Net to predict noise, DiT delegates this task to a standard Transformer (ViT - Vision Transformer) block.

"Patchify" Process and Latent Tokenization

DiT does not process the image directly. First, it reduces the image to a "Latent Space" via a VAE (Variational Autoencoder). Then:

Patching: If the latent image is, for example, $32 \times 32 \times 4$ , this structure is divided into small squares of size $p \times p$ .
Linear Projection: Each square is flattened and converted into a vector. This is now a "Visual Token".
Positional Embedding: Just like word order in language, position information of these squares is added (usually via RoPE - Rotary Positional Embeddings).

This process allows the model to process visual data like a language model (LLM). Mathematically, a Transformer block performs the following operation:

$\text{DiTBlock}(x, c) = \text{MLP}(\text{AdaLN}(\text{SelfAttention}(\text{AdaLN}(x, c))))$

Here, $c$ is the condition (text input or timestep) information and is injected into each layer of the network via the AdaLN (Adaptive Layer Norm) mechanism. This ensures the model remembers the command "draw a red car" at every processing step.

2. From Noise to Route: Flow Matching

Older diffusion models (DDPM, DDIM) tried to gradually drown data in noise and then reverse it. This process was stochastic and slow. The new standard of 2025 is Rectified Flow Matching.

Flow Matching aims to find the "straightest path" between the noise distribution ( $N(0, I)$ ) and the data distribution ( $Data$ ).

Principle of Optimal Transport (OT)

Mathematically, generating an image from noise is solving a differential equation (ODE):

$\frac{dX_t}{dt} = v_t(X_t)$

Here, $v_t$ is the velocity field. Flow Matching trains this vector field so that the path from noise to image is not curved and complex, but linear.

Advantage 1 (Speed): Since the path is straight, the result can be reached in far fewer steps (e.g., high quality in 4-8 steps).
Advantage 2 (Consistency): Randomness decreases, fidelity to commands (prompt adherence) increases.

3. Hardware and Scaling Laws

The biggest advantage of DiT is that the scaling laws we see in LLMs apply here as well. While U-Net models reach saturation as parameter count increases, DiT models (Flux, SD3) continue to improve linearly as parameters (or training data/compute) increase.

On-Premise Hardware Requirements (e.g., Flux.1 Dev/Pro)

The VRAM bottleneck our users face when running these models on local systems (RTX 3090/4090/5090) stems from the "Attention" mechanism of DiT ( $O(N^2)$ complexity).

Attention Offloading: Moving layers not currently being computed to system RAM to conserve VRAM.
Quantization (FP8/NF4): DiT models are much more resistant to quantization (reducing precision) compared to U-Nets. They can operate almost losslessly with 8-bit weights instead of 16-bit.

4. The Role of Text Encoders: T5 vs CLIP

DiT architectures are typically "Multimodal". Just using CLIP (Contrastive Language-Image Pre-training) is not enough to understand the image.

Modern architectures (e.g., Flux) use massive language models like T5-XXL as a "Text Encoder". This allows the model to understand complex commands requiring typography ("a sign saying 'Hello'") and logical relationships. DiT combines this dense text information (embedding) with visual tokens via the *Cross-Attention* mechanism.

Conclusion

Diffusion Transformers and Flow Matching have transformed visual generation from "random art" into "controlled and scalable engineering".

This architectural shift lays the foundation not only for better images but also for video models capable of simulating the physical world and 3D asset generation. As BRIQ Mind, we are ready to integrate these next-generation models into your workflows.

Introduction: The Bottleneck of Convolutional Networks (CNN)

1. Heart of the Architecture: Diffusion Transformer (DiT)

"Patchify" Process and Latent Tokenization

2. From Noise to Route: Flow Matching

Principle of Optimal Transport (OT)

3. Hardware and Scaling Laws

On-Premise Hardware Requirements (e.g., Flux.1 Dev/Pro)

4. The Role of Text Encoders: T5 vs CLIP

Conclusion

Train Your Own Visual Model

Other Research

Beyond Next-Token Prediction: World Models and JEPA Architecture

The End of the Pipeline Era: Native Multimodal (Omni) Architectures and Audio Tokenization