Training is Over, Thinking Begins: Inference-Time Compute and System 2 Scaling

Introduction: Fast Thinking (System 1) vs. Slow Thinking (System 2)

All LLMs in the 2023-2024 era (GPT-4, Claude 3, Llama 3) were essentially System 1 thinkers. According to Daniel Kahneman's theory, System 1 is fast, instinctive, and automatic. When you ask an LLM "2+2?" or "Write a poem," it spills out the answer token by token in milliseconds. It doesn't think; it gives a *reflex*.

However, this reflex falls short in complex math, coding, or strategy questions. Humans pause, grab a pen and paper, and plan steps when faced with a hard problem (System 2).

The new paradigm of 2025, Inference-Time Compute, has endowed AI with this ability to "pause and think". Model performance now increases not just with the amount of training data, but with the compute power spent while answering.


1. The New Scaling Law

Previously, the rule was "The larger the model, the smarter it is" (Kaplan et al., 2020). Now, the equation has changed:

Performance(Pre-training Compute)×(Inference-Time Compute) \text{Performance} \propto (\text{Pre-training Compute}) \times (\text{Inference-Time Compute})

This leads to a revolutionary conclusion: A smaller model (e.g., 8B parameters), if allowed to "think" long enough, can outperform a massive model (e.g., 70B).

How Test-Time Compute Works

Before giving the final answer to the user, the model generates thousands of "Hidden Chains of Thought" in the background.

  1. Breaks the problem into parts.
  2. Generates solution paths.
  3. Backtracks if it realizes a mistake.
  4. Tries alternative paths (Tree Search).
The user doesn't see this process, only the filtered, verified "golden" answer.


2. The Supervision Mechanism: PRM vs. ORM

How do we know the model is "thinking correctly"? Two types of reward models come into play here:

Outcome Reward Models (ORM)

The old method. It looks at the model's answer and says "Correct" or "Incorrect". Like looking only at the checkmate in chess. It cannot correct errors in intermediate steps.

Process Reward Models (PRM) - *Game Changer*

The new standard. It scores every single logic step the model takes.

  • *Step 1:* defined x=5x = 5. (Score: 0.9)
  • *Step 2:* multiplied both sides by 2. (Score: 0.1 - Error!)

Thanks to PRM, the model realizes its mistake before reaching the end of the thought chain, abandons that branch, and takes another solution path. This is the integration of Monte Carlo Tree Search (MCTS) algorithms into LLMs.


3. Search Strategies: Best-of-N and Tree of Thoughts

Techniques used to increase intelligence during inference are hungry for hardware power:

  • Best-of-N Sampling: The model generates 64 different answers to the same question in parallel. A Verifier selects the best one. This significantly increases accuracy but multiplies the cost by N.
  • Tree of Thoughts (ToT): The model simulates possible future steps in a tree structure, like playing chess. It expands only the "promising" branches.
  • Self-Correction: The model reads its own output, asks "Is there a logic error here?", and revises its answer.

4. Hardware and the RTX 5090 Effect

This is the most critical point for our users. Inference-Time Compute relies heavily on KV Cache management and Memory Bandwidth rather than just VRAM capacity.

  • Long Thought Chains: The "hidden thoughts" a model generates before answering a question can be tens of thousands of tokens long. These tokens must be held in the KV Cache.
  • Local Reasoning: RTX 3090/4090/5090 cards with 24GB+ VRAM are ideal for running 8B-14B "Reasoning" models (e.g., DeepSeek-Math, Llama-3-Reasoning) and providing them with local "thinking space".
  • Latency Acceptance: Instant answers are not expected in these systems. The user accepts a 10-20 second wait (the model's thinking time) for a much higher quality and error-free code/analysis output.

Conclusion: From Knowledge to Wisdom

AI is no longer just a "Search Engine++" or "Autocomplete". With Inference-Time Compute, it has transformed into a "Reasoning Engine" that processes and synthesizes data, evaluates probabilities, and selects the best path.

As BRIQ Mind, we develop System 2-based AI solutions that "ponder" over your business problems and help you make strategic decisions, rather than simple chatbots.



Smart Agents at Work

BRIQ Labs integrates autonomous reasoning engines into your business that "think," plan steps, and self-verify to solve complex problems.

Contact Us