DeepSeek has rapidly emerged as a prominent force in the industry, thanks to its innovative technology and outstanding performance. Its applications have not only achieved significant user growth on major social media platforms both domestically and internationally but have also ranked highly in multiple AI technology evaluations, demonstrating its strong market competitiveness. The DeepSeek R1 model introduces reinforcement learning technology that does not rely on supervised fine-tuning, significantly improving the model's performance post-fine-tuning. For example, in terms of mathematical ability, the base model scores 100 points, while after fine-tuning, the score can reach 450 points. This article is based on research into the DeepSeek-V3 and R1 papers, analyzing the training process of DeepSeek-R1 and its core technical principles.
Core Contributions of DeepSeek R1
First, we need to clarify a question: the reason DeepSeek-R1 has caused such a global sensation is not because of the sheer power of the R1 model, as R1, no matter how powerful, cannot surpass OpenAI's O3 model. Personally, I believe there are two main reasons for DeepSeek-R1's breakthrough:
- DeepSeek has provided an efficient and low-cost method for training and inference, and replicated this "efficiency" capability in smaller models. In simpler terms, DeepSeek R1 has demonstrated to the world that AI can become smarter without relying on large amounts of labeled data, and can even "teach" this method of becoming smarter to smaller models.
- DeepSeek has open-sourced the entire training methodology and all models, reshaping the paradigm of AI technology development, lowering industry barriers while invigorating global competition and injecting new vitality into the global AI ecosystem. As Yann LeCun, Chief Scientist at Meta, said: "This is not simply a matter of Chinese AI technology surpassing the U.S., but rather open-source models surpassing proprietary models. This is a victory for the open-source world."
What Problems Does the R1 Inference Model Solve?
In fact, we can achieve the effects of an inference model using prompt engineering. When ChatGPT first appeared, many users discovered that adding "Please think step by step slowly" to the prompt could improve the AI's output quality. Later, it was found that including a chain of thought (COT) in the prompt could significantly enhance the reasoning ability of large language models. DeepSeek R1 simply embeds this ability to generate long chains of thought into the model's foundational capabilities through reinforcement learning. In short, DeepSeek R1 solves two problems for users:
- Users no longer need to input lengthy COT prompts every time to guide the AI's reasoning.
- It lowers the barrier to writing professional COT instructions, as not everyone can craft specialized COT prompts, and the reasoning processes generated by DeepSeek-R1 even have the potential to inspire human experts.
What Is a Chain of Thought?
A chain of thought (COT) is essentially guiding the model to reason step by step to arrive at an answer, providing the reasoning process.
Example: For a simple arithmetic problem like "If an apple costs 3 yuan, and Xiao Ming buys 5 apples, how much does he spend in total?" Traditional language models might directly generate the answer "15 yuan." A model using COT would first output intermediate steps, such as "Each apple costs 3 yuan, and 5 are bought, so 3×5 = 15."
Which Problems Require COT?
- Mathematics
- Programming
- Complex decision-making problems, e.g., "Should a company expand its production line?"
- Open-ended discussion questions, e.g., "How to improve urban air quality?"
- Problems requiring multi-step thinking, e.g., "Planning an overseas trip."
Which Problems Do Not Require COT?
- Text generation and creative writing.
- Very simple problems, e.g., "What is 2+3?"
- Self-awareness questions, e.g., "Hello, who are you?"
- Translation.
- Knowledge-based questions, e.g., "What is the capital of China?" or "What is the population of Shenzhen?"
DeepSeek R1 collected approximately 600k training samples related to reasoning (forced reasoning) and about 200k training samples unrelated to reasoning (forced non-reasoning), totaling 800K fine-tuning data points. Apart from these SFT training samples, the AI determines whether reasoning is required for other questions.
Technical Terminology
- RL - LLM (Reinforcement Learning for Large Language Models): Reinforcement learning for large language models.
- MCTS (Monte Carlo Tree Search): A search algorithm used to find optimal strategies in decision-making processes.
- RLHF (Reinforcement Learning from Human Feedback): Reinforcement learning based on human feedback.
- GPQA (Graduate-Level Google-Proof Q&A Benchmark): A graduate-level expert reasoning benchmark in the field of AI, consisting of 448 difficult multiple-choice questions that cannot be easily answered via Google search. It covers multiple disciplines, including biology, physics, and chemistry, and is meticulously designed by subject matter experts.
- SFT (Supervised Fine-Tuning): Fine-tuning using labeled data to adapt a pre-trained model to a specific task or domain, primarily used for post-training. In contrast, pre-training aims to allow the model to learn general features from large amounts of unlabeled data, without targeting specific tasks.
The Training Process of DeepSeek R1
Core Idea: Under simple reward criteria, directly apply reinforcement learning, allowing the model to evolve automatically and find the optimal solution.
Next, we will break down the training process of DeepSeek R1 and its related distilled models step by step.
DeepSeek-R1-Zero
First, starting with the DeepSeek-V3 base model, DeepSeek-R1-Zero was obtained through reinforcement learning. The specific steps involve preparing some questions for the AI to answer, filling these questions into a prompt template during training (as shown below), and requiring the model to first think through the reasoning process internally, then output the reasoning process within <think>
tags and the answer within <answer>
tags.
Reward Model
During training, a reward model is provided to determine the optimization direction of reinforcement learning. The incentives are based on two criteria:
- Accuracy Incentive: Primarily for mathematical and programming problems. Since the answers to math problems can be known in advance, correctness is clear-cut. Similarly, for programming problems, the final judgment is based on whether the code runs and whether the results meet the requirements.
- Output Format Incentive: Checks two behaviors—whether the reasoning process is output and whether the output follows the specified format (i.e., the reasoning process is placed within
<think>
tags and the answer within<answer>
tags).
In summary, the training method for DeepSeek-R1-Zero is akin to teaching a child to walk: instead of directly providing the correct answer, the model is allowed to try on its own and adjust its behavior based on the results (e.g., whether the answer is correct). This method does not require pre-labeled data and relies entirely on the AI's own exploration, with no input of labeled data—hence the "Zero" in the name, indicating zero-shot input.
The Model's "Aha Moment"
During the training of DeepSeek-R1-Zero, the model experienced an "aha moment" in its reasoning process—a sudden realization of a key step:
The model seemed to come alive, exhibiting human-like sparks of thought. One could even say its intelligence upgraded itself. Researchers also found that as training steps increased, the model automatically learned to spend more time on reasoning.
In other words, the model found its own optimization direction: the more time spent on reasoning, the higher the accuracy of the answers.
Limitations of DeepSeek-R1-Zero
DeepSeek-R1-Zero performed remarkably well, achieving capabilities in mathematics and programming comparable to OpenAI's O1-0912 model. However, it has a noticeable flaw: the generated answers often suffer from poor readability, frequently mixing Chinese and English.
DeepSeek-R1
To address the issues with DeepSeek-R1-Zero, the DeepSeek team implemented a series of optimizations. First, they used thousands of high-quality COT data points (e.g., detailed problem-solving steps) processed by humans to provide a "cold start" through supervised fine-tuning (SFT). Then, they further trained the model using reinforcement learning. This resulted in clearer answers and more consistent language.
In simpler terms, researchers gave DeepSeek-R1-Zero some high-quality example problems, teaching it standardized problem-solving formats, and then fine-tuned it with reinforcement learning. The final model could solve problems quickly and accurately, with neatly formatted outputs. This intermediate checkpoint was tentatively named DeepSeek-R1-One.
Next, using the same method as for DeepSeek-R1-Zero, the team generated a batch of high-quality COT data (long chain-of-thought data) with DeepSeek-R1-One. They combined this with knowledge data (primarily domain-specific data to enhance the AI's capabilities in tasks like writing and role-playing) and human feedback data. They then applied reinforcement learning to the DeepSeek-V3 base model, ultimately yielding DeepSeek-R1.
Distilled Models
After obtaining the DeepSeek-R1 model, the research team attempted a third step: Could the same data and methods be used to fine-tune other smaller models, thereby granting them stronger reasoning capabilities? This is akin to passing a top student's study notes and problem-solving techniques to underperforming students, turning them into top students as well. They fine-tuned six smaller models (1.5B, 7B, 8B, 14B, 32B, and 70B) using Qwen-2.5 and Llama3, and found that these models indeed exhibited excellent reasoning abilities, with significant improvements in various evaluation metrics.
Notably, since the distilled models were obtained by directly applying SFT to COT data and knowledge data without reinforcement learning, locally deployed models like deepseek-r1:7b occasionally exhibit issues such as mixed-language reasoning processes or chaotic output formats. Additionally, the reasoning speed of distilled models is slower than that of the full-fledged DeepSeek-R1, as MTP (Multi-Token Prediction) is a unique feature of DeepSeek-V3, designed to improve output speed by predicting two tokens at once. Qwen2.5 and Llama3 do not employ this technology.
Failed Attempts
After obtaining the distilled models, the research team further experimented with using Qwen or Llama as base models and applying the same training method as for DeepSeek-R1 to see if they could produce high-IQ reasoning models. The result: it didn't work.
As shown above, the DeepSeek-R1-Zero-Qwen-32B
model, trained using the same method with Qwen-32B as the base model, showed no significant improvement in capabilities—some even underperformed the base model. It couldn't compare to the distilled models. This suggests that the training method has certain requirements for the base model—its "IQ" cannot be too low. It's like expecting a primary school student to solve college-level calculus problems through self-study alone.
Another failed attempt was the early-stage effort by the DeepSeek team to train reasoning models using MCTS and PRM methods, both of which failed. Of course, the researchers phrased this cautiously:
We share our failed experiences here to provide insights, but this does not mean these methods cannot develop effective reasoning models.
Summary
From the training process of DeepSeek-R1, we can draw three simple conclusions:
- Pure reinforcement learning (PURE-RL) with simple reward mechanisms can train powerful reasoning models. A top student can become a genius through self-study alone.
- Using high-quality COT data to distill smaller models can significantly enhance their reasoning abilities. Underperforming students can become top students by studying the notes of geniuses.
- Smaller models cannot achieve strong reasoning capabilities through pure reinforcement learning alone. Underperforming students cannot become geniuses through self-study. In fact, Kimi1.5 adopted a training strategy similar to DeepSeek's (interested readers can refer to the references at the end of this article [1]), but the results were far inferior, likely because Kimi's base model "lacked sufficient IQ."
Core Technology Interpretation
Next, we briefly interpret the core technologies used in DeepSeek-V3 and DeepSeek-R1.
1. GRPO Training Strategy
GRPO (Group Relative Policy Optimization) is a reinforcement learning optimization method. Its core idea is to replace the traditional value network model (Critic model) in PPO with relative scoring within groups. Traditional PPO algorithms rely on a value network comparable in size to the policy model to estimate the advantage function. Each training step requires updating the value network, resulting in high computational complexity and memory usage. GRPO completely discards the Critic model, adopting a more lightweight implementation.
Here's a simple example to illustrate how GRPO works.
Suppose you're training a chatbot to answer math problems:
- Generate Group Outputs: For the question "1+1=?", GRPO generates multiple candidate answers (e.g., "2", "3", "The answer requires more steps").
- Calculate Group Rewards: Assign rewards to each answer based on rules (e.g., correctness) or model scoring (e.g., "2" gets a high score, "3" gets a low score).
- Calculate Relative Advantage: Based on the mean and standard deviation of group rewards, calculate the relative advantage for each answer (e.g., correct answers have positive advantage values, incorrect ones have negative values).
- Optimize Policy: Adjust model parameters to increase the probability of generating correct answers and decrease incorrect ones by maximizing the relative advantage objective function.
Key Differences Between GRPO and PPO
Difference | GRPO | PPO |
---|---|---|
Critic Model | No Critic model | Requires Critic model to estimate state value function |
Advantage Calculation | Based on relative rewards within groups (model-free) | Relies on Critic model's absolute value predictions |
Computational Complexity | Lower (no additional model training) | Higher (requires joint optimization of policy and Critic models) |
Applicable Scenarios | Problems with easily definable rewards and effective group comparisons | Complex tasks requiring precise value estimation |
Stability | Depends on group sample diversity | Provides smoother value estimates via Critic |
Key Takeaways
- GRPO's Advantage: Simplified architecture, reduced computational costs, suitable for tasks with clear reward rules (e.g., code generation, math problem-solving).
- PPO's Advantage: Provides finer value estimation via Critic, suitable for scenarios requiring complex state evaluation (e.g., open-domain dialogue).
The fundamental difference lies in whether external value estimation models are relied upon and how the advantage function is calculated. GRPO achieves lightweight optimization through group comparisons, while PPO achieves precise optimization via the Critic model.
MLA Architecture
Multi-head Latent Attention (MLA) is an efficient attention architecture proposed in DeepSeek-V3. Its core goal is to reduce GPU memory usage (KV caching) during inference through low-rank compression while maintaining performance comparable to traditional multi-head attention (MHA). Intuitively, if traditional MHA requires storing complete keys (Key) and values (Value) for each head, MLA optimizes this in two steps:
- Low-Rank Compression: Compresses multi-head keys and values into smaller latent vectors. This is similar to how Stable Diffusion compresses images into latent space during training.
- Decoupled Queries: Decomposes queries (Query) into compressed parts and independent position-aware parts.
For an input sequence length of 4096, using MLA:
- KV Cache: Traditional MHA requires storing 4096 * 32,768 = 134M parameters.
- MLA Cache: Only requires storing 4096*(512+64) = 4096*576 ≈ 2.36M parameters, reducing memory usage by ~98%.
Example Explanation
Here's a more relatable analogy for MLA (Multi-head Latent Attention):
Imagine you have a massive wardrobe (analogous to the model's GPU memory) filled with clothes (analogous to the key-value information stored by attention heads). Traditional methods (MHA) require preserving independent location information for each piece of clothing, while MLA is like a "folding compression method"—sorting and folding clothes into storage boxes, labeling each box, saving space while allowing quick access to needed items.
Core Ideas of MLA
- Compression: "Folding" complex key-value information into streamlined versions to reduce storage.
- Position Tagging: Using special labels (position encodings) to remember information locations, ensuring quick retrieval.
- Balance: Sacrificing a bit of folding time (computational overhead) for massive space savings (memory optimization).
Real-World MLA Applications
- Long-Text Chat: For example, having an AI read a 1,000-page novel without memory overload, smoothly generating summaries.
- Mobile AI: Enabling large models to run on memory-limited devices, such as real-time translation of long speech.
Summary
MLA is like a "wardrobe organization technique," using folding and tagging to let AI handle vast amounts of information efficiently without chaos!
FP8 Mixed-Precision Training
Traditional models use FP32 full-precision calculations, which, while accurate, are computationally intensive. DeepSeek-V3 employs FP8 mixed-precision training. For matrix multiplication (GEMM) and activation storage (e.g., hidden layer outputs), low-precision (FP8) calculations are used, while weight updates and normalization (LayerNorm) use full-precision (FP32) calculations.
Simply put, FP8 is used for intermediate calculations, while FP32 is used for final accumulation.
Example Explanation
Here's a chef analogy to explain FP8 mixed-precision:
A chef needs precise control over seasoning for each dish. Traditional methods (FP32) require weighing all ingredients to the milligram, while FP8 mixed-precision is like using "smart measuring spoons"—key steps (e.g., adding salt or MSG) use highly precise small spoons, while other steps (e.g., pouring oil or water) use less precise large spoons.
Another example: some shops simplify daily accounting by rounding off small amounts, but at month-end, professional accountants meticulously reconcile all details with calculators.
In short, FP8 mixed-precision training isn't mindless compression but "applying the best steel to the blade"—engineering wisdom that makes large model training both efficient and stable!
DualPipe Cross-Node Communication
Imagine running a multinational logistics company (AI training cluster) that needs to efficiently transport packages (data) between sorting centers (GPU nodes) in different countries. Traditional pipelines often get stuck waiting for package transfers (communication delays), while DualPipe is like a "bidirectional conveyor belt + smart scheduling system" that ensures sorters (GPUs) are never idle.
Traditional training methods use unidirectional and serial communication between GPU nodes. Using the logistics analogy, packages can only move in one direction (e.g., China → Germany → USA). While the German sorting center processes a batch, the U.S. center must wait. During international transport (cross-node communication), the entire pipeline halts until packages arrive at the next stop. Sorters spend 50% of their time waiting, resulting in low efficiency. This makes network latency a major bottleneck for cross-room or cross-data-center training.
DualPipe's Solution
- Bidirectional Conveyor Design: China and the U.S. simultaneously send packages in opposite directions. The German center processes packages from China first, then immediately handles those from the U.S., reducing bubbles by 50%.
- Overlapping Transport and Sorting: When the Chinese center finishes processing a batch, it immediately packs and sends it to Germany while starting the next batch. A smart scheduling system predicts transport times to ensure sorters always have work.
- Multi-Node Coordination: Automatically selects the fastest routes (e.g., air transport from China → Germany, sea transport from Germany → USA) based on real-time traffic (network bandwidth).
Notably, due to export restrictions under the Chip Act, Nvidia had to reduce the communication capabilities of the computing cards sold to DeepSeek. For example, the H100 was downgraded to the H800, reducing communication capacity by ~60% and increasing latency by 2x.
To address this, DeepSeek had to optimize at a lower level. Originally, 132 stream processors could handle both computation and communication tasks, but DeepSeek used PTX language to force 32 of them to handle only communication tasks. In typical V3 training workloads, despite fewer compute units, the communication bottleneck was alleviated.
Summary: DualPipe is like a "multinational logistics AI scheduling system" that enables global GPUs to collaborate as seamlessly as local devices through bidirectional pipelines + real-time communication optimization. It doesn't eliminate distance but makes it "transparent"!
MTP Technology
Multi-Token Prediction (MTP) is a new technology adopted in DeepSeek-V3. Simply put, while traditional models predict one token at a time, DeepSeek-V3 predicts two tokens at once, which is why DeepSeek-R1 responds so quickly.
The innovation lies in how an additional small transformer layer is added to generate the second token and how a dual loss function is designed. This is why DeepSeek ultimately chose to predict two tokens instead of 200—the accuracy of subsequent token predictions drops sharply. According to DeepSeek-V3's technical paper, the acceptance rate for the second token is between 85%-90%, improving inference speed by 1.8x with only a 5% increase in resource consumption.
Predicting five tokens at once drops the acceptance rate below 30%, resulting in net speed gains worse than predicting two tokens. Additionally, the small transformer layer's memory usage would increase significantly, potentially exceeding the main model's—hardly worth the trade-off.
Moreover, multi-token prediction requires balancing loss weights across positions. Predicting two tokens allows stable training via simple weighting, but predicting 100 tokens risks the loss function converging to local optima, causing model divergence.
Summary
DeepSeek-V3's choice to predict two tokens is the optimal balance between speed, accuracy, complexity, and stability. It achieves this by:
- Precise Prediction: High acceptance rates ensure candidate tokens are valid.
- Lightweight Extension: Minimizes additional computation and memory overhead.
- End-to-End Optimization: Matches hardware limits and user experience needs.
This decision embodies the core philosophy of AI engineering: "solving 80% of problems with 20% of the complexity," rather than blindly pursuing theoretical peak speed.
References
- https://x.com/kimi_moonshot/status/1882413059513471044?5=46&t=2QnaL4UHDCHYOWAWNe4NCW
- https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
- https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf
- https://magazine.sebastianraschka.com/p/understanding-reasoning-llms