Introduction: The Evolution of the Alignment Paradigm
In the development of Large Language Models (LLMs), while the "pre-training" phase is critical for knowledge acquisition, the "post-training" or "alignment" phase—which ensures the model acts according to human intent—is the decisive factor for the model's usability.
For a long time, the industry standard has been Reinforcement Learning from Human Feedback (RLHF) and its core algorithm, PPO (Proximal Policy Optimization). However, PPO creates significant bottlenecks in on-premise systems due to its computational complexity and high VRAM requirements (specifically the need for a "Value Network").
At the turn of 2024 and 2025, GRPO (Group Relative Policy Optimization), popularized by models like DeepSeek-R1, has emerged as a revolutionary approach that overcomes these bottlenecks. In this article, we will compare these two methodologies in terms of mathematical infrastructure, memory costs, and training stability.
1. The Industry Standard: RLHF & PPO (Proximal Policy Optimization)
Popularized by OpenAI's InstructGPT paper, PPO-based RLHF allows a model to be optimized according to human preferences. However, this process is extremely computationally expensive.
Architectural Complexity
In a PPO-based training loop, four distinct models must be held in memory (VRAM) simultaneously:
- Policy Model (Actor): The model being trained (e.g., Llama-3-70B).
- Reference Model: A frozen model used to calculate the KL Divergence penalty.
- Reward Model: The model that scores the quality of the output.
- Value Model (Critic): The model that estimates the expected reward.
Hardware Bottleneck
This structure requires VRAM approximately 4 times the weight of the model itself for training a 70B parameter model. The "Value Function" approach used by PPO to ensure stability creates scalability issues, especially in on-premise GPU clusters (e.g., clusters with a limited number of A100/H100s or RTX 4090s).
2. The Next-Gen Approach: GRPO (Group Relative Policy Optimization)
GRPO is a method developed by the DeepSeek team, specifically utilized to enhance mathematical reasoning capabilities. Its fundamental innovation is the complete elimination of the need for a Value Model (Critic).
Working Principle
For each query (prompt), GRPO samples a group of outputs (e.g., 64 different outputs) from the model (). It then calculates the rewards () for these outputs and normalizes them based on the average reward within the group.
Mathematically, the advantage function is simplified as follows:
Where is the advantage value of the -th output.
Advantages of GRPO
- VRAM Efficiency: Since no Critic (Value) model is needed, the number of parameters required in memory during training is almost halved. This allows for the use of larger models or larger batch sizes on the same hardware.
- Training Stability: In PPO, training the Value model is difficult and often leads to training collapse. GRPO provides a more stable gradient flow by performing group-relative comparisons.
- Reference Model Freedom: GRPO does not always require a separate reference model when calculating the KL penalty; it can perform an "approximate KL" calculation referencing the intra-group distribution.
3. Comparative Analysis: On-Premise Scenarios
The table below summarizes the impact of the two algorithms on on-premise hardware for training models at the 7B and 70B scale.
| Feature | RLHF (PPO) | GRPO (DeepSeek Style) |
|---|---|---|
| Model Components | Actor, Ref, Reward, Critic | Actor, (Optional Ref), Reward |
| VRAM Consumption | Very High (Model x 4) | Medium (Model x 2 + Buffer) |
| Training Speed | Slow (Critic update required) | Fast (Only Policy update) |
| Application Area | General Chat, Creative Writing | Math, Coding, Reasoning |
| Hardware Example (70B) | 8x H100 (80GB) | 4x H100 or 8x A100 |
Why Choose GRPO?
Especially in domains like corporate intelligence, financial analysis, or code generation where the distinction between "correct/incorrect" is clear (verifiable rewards), GRPO demonstrates significantly superior performance compared to PPO. As seen in DeepSeek-R1-Zero experiments, models trained with GRPO can self-improve their reasoning capabilities even without Supervised Fine-Tuning (SFT) data, using only rule-based rewards (e.g., did the code compile? is the result correct?).
Conclusion
In on-premise LLM deployments, hardware resources are the most valuable asset. While RLHF/PPO has been the "gold standard" for years, the hardware burden it imposes (the necessity of a Critic model) increases operational costs on local servers.
GRPO is not just an optimization algorithm; it is a harbinger of the Efficient AI era. At BRIQ Mind, we actively integrate GRPO-based workflows into our pipelines to ensure our clients get maximum performance on their own infrastructure (RTX 4090/A6000/H100). This allows us to bring GPT-4 level reasoning capabilities to local systems without compromising data privacy.