Not All Tokens Are Created Equal: Mixture-of-Depths (MoD) and Dynamic Compute Allocation

Introduction: The Waste of Static Compute

In a traditional "Dense" Transformer model (like Llama-3 or GPT-4), the architecture is static. The model uses the exact same number of layers and the same amount of FLOPs (Floating Point Operations) whether it is processing the word "and" or the concept of "Quantum Entanglement."

This is a massive waste of computation. The vast majority of tokens in a sentence are grammatical filler and do not require deep reasoning.

The Mixture-of-Depths (MoD) architecture, theorized by Google DeepMind in 2024 and integrated into open-source models in 2025, puts an end to this paradigm. Unlike Mixture-of-Experts (MoE), which changes the model's width, MoD dynamically changes the model's depth.

MoD Architecture and the Router Mechanism

The core principle of MoD is: In certain layers of the model, only a small fraction of tokens (e.g., 12.5%) pass through the Self-Attention and MLP blocks. The remaining tokens "skip" this layer (flowing through the residual connection).

The mechanism that makes this choice is called the Router.

Mathematical Decision Mechanism

For each token ( $x_t$ ), the router calculates a scalar weight ( $r_t$ ). This value indicates how much the token "needs" processing in that layer.

$r_t = x_t^T \cdot w_r$

Here, $w_r$ is the learnable weight vector of the router. The system then ranks all tokens in that batch according to their $r_t$ scores and processes only the Top-k tokens.

Selected tokens (Set $P$ ) pass through the block, while unselected ones (Set $S$ ) are routed directly to the output:

$y_t = \begin{cases} f(x_t) + x_t & \text{if } t \in P \text{ (Processed)} \\ x_t & \text{if } t \in S \text{ (Skipped)} \end{cases}$

Thus, a token representing a difficult concept might pass through all 80 layers, while a simple punctuation mark might reach the output after passing through only 10.

Capacity Constraints and "Iso-FLOP" Performance

The biggest engineering challenge of MoD is maintaining hardware efficiency. Processing a random number of tokens disrupts the GPU's parallel processing structure (Dynamic Shape Problem).

To solve this, MoD uses a Capacity Cap. The number of tokens to be processed in each block is predetermined (e.g., 12.5% of total tokens).

Iso-FLOP Analysis

Academic research shows that when compared at the same training budget (Iso-FLOP) and inference cost, MoD models consistently achieve lower loss rates than standard models. In other words, the model learns to be "smarter by doing less" because it focuses its energy only where necessary.

Impact on Local AI and Hardware

For Local LLM users (RTX 3090/4090/5090 owners), MoD means an increase in Throughput (tokens per second).

If you get 15 tokens/sec on a standard 70B model with all layers active, this speed can rise to 30-40 tokens/sec on an MoD architecture because the average number of active parameters drops.

VRAM vs. Compute

MoE (Mixture of Experts): Increases VRAM usage (need to load all experts), reduces computation.
MoD (Mixture of Depths): VRAM usage is constant (same as standard model), but dramatically reduces computational load.

Therefore, MoD is the ideal optimization for systems bottlenecked by Compute rather than memory bandwidth.

Future Vision: MoE + MoD Hybrid Structures

Projections for 2026 suggest a convergence of MoD (Depth Selection) and MoE (Expert Selection) architectures.

Imagine a token where;

First, the MoD router asks, "Does this token need processing?"
If "Yes," the MoE router asks, "Which expert (Coding, History, Logic) should process it?"

This "Double Sparsity" is the ultimate architecture that will allow trillion-parameter models to run with as little energy as today's 7B models.

Conclusion: Intelligent Laziness

Mixture-of-Depths is a triumph of the "intelligent laziness" principle in AI engineering. Instead of treating every piece of data with the same seriousness, the model's ability to decide what is important and what is not brings us one step closer to the working principle of biological brains (Energy Minimization).

In on-premise systems, this means serving 50% more users with the same hardware or halving response times.

Introduction: The Waste of Static Compute

MoD Architecture and the Router Mechanism

Mathematical Decision Mechanism

Capacity Constraints and "Iso-FLOP" Performance

Iso-FLOP Analysis

Impact on Local AI and Hardware

VRAM vs. Compute

Future Vision: MoE + MoD Hybrid Structures

Conclusion: Intelligent Laziness

Balance Speed and Intelligence

Other Research

The End of KV Cache: Test-Time Training (TTT) and Ephemeral Weight Architecture

Beyond Next-Token Prediction: World Models and JEPA Architecture