DeepSeek Series - LLM

Paper list (papers in gray is not discussed in this blog)

  • DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
  • DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
  • DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence
  • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
  • DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
  • DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
  • DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
  • Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
  • DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
  • DeepSeek-V3 Technical Report
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

General information

DeepSeek is a series of strong LLM models, open-source but trying to beat close-source models, in terms of general language generation, and specific fields like Math, coding, reasoning, etc. The famous powerful DeepSeek V2/V3/R1 use MoE architectures, a huge amount of training tokens, and novel architectural, training, and optimization innovations, to get state-of-the-art performances with efficient training and inferences. In addition, DeepSeek also developed a series powerful Vision-Language models, which will be discussed in the next blog.

The details are in the original papers, but let’s try to briefly introduce them here.

Note: no DeepSeek models, ChatGPT, or other similar LLM tools are used in the blogs.

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Number of tokens for pretraining Model size options Training pipeline Performance
2T 7B, 67B Pretraining + SFT + DPO DeepSeek LLM 67B > LLaMA-2 70B

After SFT + DPO:DeepSeek LLM 67B > GPT-3.5

Overview

This might be the first LLM model developed by DeepSeek team. It is a dense LLM that has two versions: 7B and 67B. The main contributions (my understanding) are:

  1. Open-source (this is the key for long-term influence);
  2. Re-investigate the scaling law, finding the formulas of the best hyperparameters (learning rate and batch size), non-embedding FLOPs/token, and number of training tokens.
  3. Finding that the data quality is a key factor affecting the model performance. Higher quality data may need more compute budge allocated to model rather than data.

Method

Data: Using deduplication, filtering, and remixing to create high-quality training data;
Architecture: Largely follows the design of LLaMA but applies deeper layers.
Scaling law: The best hyperparameters are found by grid search using small scale models, and M, D are found by IsoFLOPs profile approach, where for a fixed compute budge, different M/D scale allocations are designed to draw the curve.

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Number of tokens for pretraining Model size options Training pipeline Performance
2T 2B / 0.3B activated

16B / 2.8B activated

145B / 22.2B activated
Pretraining + SFT (for 16B) DeepSeekMoE 2B ~ GShard 2.9B (the latter has 1.5x expert parameters)

DeepSeekMoE 2B ~ its dense counterpart

DeepSeekMoE 16B ~ LLaMA2 7B (the former has only 40% computations)

DeepSeekMoE 145B ~ DeepSeek 67B (the former has only 28.5% computations)

Overview

This paper mainly introduces two ideas for the architecture: fine-grained expert segmentation and shared expert isolation. In addition, some balance loss functions are designed to encourage load balance.

Method

Fine-grained Expert Segmentation

As shown in the above figure, based on the basic MoE architecture in (a), it further segments each FFN expert into $m$ smaller experts, such that the number of experts are increased by $m$. But the ratio of the number of activated experts to the total number of experts keeps the same. It means the number of activated experts are also increased by $m$ but are finer-grained. It increase the level of expert specialization.

Shared Expert Isolation

It has some shared experts that are activated all the time. The shared experts are dedicated to capturing and consolidating common knowledge across varying contexts, the parameter redundancy among other routed experts will be alleviated.

Load balance loss functions

  • Expert-Level Balance loss:

    where $N’$ is the number of routed experts (those who are not shared), $K’$ is the number of activated experts, $T$ is the sequence length, $s_{i, t}$ is the token-to-expert affinity (score of $t$th token on $i$th expert output by softmax). Basically, $f_i$ is the actual number of tokens assigned to expert $i$, while $P_i$ is the soft number of tokens assigned to expert $i$. The loss encourage the tokens are evenly assigned across the experts.

  • Device-level Balance loss:

It shows similar spirit. So they partition all routed experts into $D$ groups ${\epsilon_1, \epsilon_2, …, \epsilon_D}$, and deploy each group on a single device. The above is the loss function.

DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence

Number of tokens for pretraining Model size options Training pipeline Performance
2T tokens sourced from 87 programming languages DeepSeek-Coder 1.3B, 6.7B, 33B (v1) Pretraining, 
Instruction tuning
As shown above
DeepSeek-Coder-v1.5 7B Pretraining starting from DeepSeek-LLM-7B Base,
only using next token prediction
Comparing with the v1 models, can increase the performance of math reasoning and natural language categories, with minor degradation on programming.

So, the innovations in this paper are:

  1. For training, they also use Fill-In-Middle approach in addition to next token prediction. This method aims to incorporate a fill-in-the-blank pretraining task during the training process. Within the FIM methodology, two distinct modes are employed: PSM (Prefix-Suffix-Middle) and SPM (Suffix-Prefix-Middle). So it means the middle is predicted given both suffix and prefix. It can enhance model’s capability to handle various structural arrangements in code
  2. They incorporate repository-level data construction instead of file-level. Which means they consider all the files in a project and reorder them to ensure the correct dependencies across files. It can potentially increase the practicality and applicability of the model in handling project-level code scenarios.

 

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Number of tokens for pretraining Model size options Training pipeline Performance
120B DeepSeekMath-Base 7B Additional pretraining from DeepSeek-Coder-Base-v1.5 7B,
mathematical instruction tuning,
GRPO (Group Relative Policy Optimization, a proposed RL algorithm in the paper)
Shown above

Basically, it continues training DeepSeek-Coder-Base-v1.5 7B, with 120B math-related tokens, using mathematical instruction tuning and GRPO.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Number of tokens for pretraining Model size options Training pipeline Performance
8.1T 236B / 21B activated (context length 128K tokens) Pretraining + SFT + RL (GRPO) Comparison with DeepSeek 67B is shown above (right), with others (left)

Overview

The main idea proposed in this paper is the Multi-head Latent Attention (MLA), in order to reduce the KV cache to save memory and computation. DeepSeek v2 is built based on DeepSeekMoE (using fine-grained expert segmentation and shared expert isolation), using MLA, and introducing extra load balance loss functions, with larger data scale and additional training strategies to reach its goal.

Question: what is the key for the shocking performance of DeepSeek v2?
I guess from the architectural side, the MLA make DeepSeek v2 efficient in both memory and computation, leading to much less training and inference cost. DeepSeekMoE also provides a good start for effective MoE architecture design considering efficiency and accuracy. From the training perspective, the performance may also comes from some balance loss functions and training pipelines. From the data perspective, 8.1T tokens for pretraining should be definitely playing a role.

Method

Multi-head Latent Attention (MLA)

Since we already has discussed the DeepSeekMoE architecture (in this blog) and KV cache (in another blog), so maybe let’s quickly talk about this MLA approach that aims to further save memory and computations based on the KV cache strategy.

As shown on the bottom right of the above figure, the spirit is from low-rank compression. The input $h_t$ is first mapped to a feature in a latent space $c_t^{KV}$ where the dimension is small before creating K and V. Then K and V are created based on this latent space by using separate mapping matrix to project the dimension to a larger one. By doing so, only the latent feature is cached which saves a lot of memory and computation. To further speed up the training process, Q is also compressed using the same strategy. The below equation is an example for Q.

For the positional embedding, the rotary position embedding (RoPE) is used and cannot be integrated into the above process (two linear projections) since it is nonlinear. Therefore, the authors propose to decouple that like this (taking Q as an example):

where, $q^R$ is divided into multi heads, but $k^R$ is shared.

Auxiliary Loss for Load Balance

In addition to the two load balance loss functions introduced in DeepSeekMoE, the v2 version further considers a balance loss, namely, communication balance loss:

This is to encourage each device receives an equal number of tokens.

Long Context Extension

YaRN method is applied on RoPE to extend the length to 160K (so that the performance on 128K should be expected well).

DeepSeek-V3 Technical Report

Number of tokens for pretraining Model size options Training pipeline Performance
14.8T 671B / 37B activated (context length 128K tokens) Pretraining + SFT + RL (GRPO)
The post-training (SFT, RL) data is curated from DeepSeek-R1, so it means using distillation from R1
Above figure

Overview

Main ideas include:
1. Auxiliary loss free strategy to ensure load balance (no need to design balance loss function anymore);
2. Multi-token prediction training objective;
3. DualPipe pipeline parallelism strategy;
4. Other training and inference optimization strategies like FP8 training.

Why v3 stronger?
Definitely, the number of training tokens is increased significantly, from 8.1T in V2 to 14.8T in V3. Larger model scale also gives enhancement. In other words, scaling law.

Let’s also take a look at the training cost of V3 (I don’t know other LLMs but I guess V3 is much cheaper right?):

Method

The architecture still continues with the DeepSeekMoE architecture and MLA strategy for KV cache. So here just introduce the new ideas.

Auxiliary-Loss-Free Load Balancing

The motivation is the balance loss will make the model less focused on the loss regarding the accuracy. Here as shown in the equation, $b_i$ is a bias term added to the affinity score (kind of reducing the effect of the affinity scores on the loading). During training, the bias term is decreased if the corresponding expert is overloaded, otherwise increased.

But actually, the load balancing is not completely loss free, as the authors also introduce a complementary Sequence-wise loss:

to prevent extreme imbalance within any single sequence.

Multi-Token Prediction


As shown above, they use additional $D$ MTP modules to predict D additional tokens, keeping the causal chain. The embedding layer and output head is shared in those MTP modules with the main model. The input of the RMSNorm in the MTP module is from the previous modules or the main model.

During training, each MTP module predict some tokens, giving a loss, and all the losses from the main model and MTP modules are combined with some weights. It also means certain tokens are predicted more than once.

During inference, only the main model is used.

Other strategies

The DualPipe is a little complicated, you may need to refer to the paper or code for details. Overall, it carefully divides the structure into smaller segments for each pipeline, and overlaps the communication and computation (to make them happen at the same time) to largely enhance efficiency, and at the same time, to ensure minimum pipeline bubbles.

For other strategies like FP8 mixed precision training, efficient all-to-all communication, recomputation of certain activations (RMSNorm and MLA Up-Projection), and some inference and deployment tricks, please refer to the original paper.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning


It’s the reasoning performance of different LLMs.

DeepSeek-R1 is based on DeepSeek-V3-Base, using a novel pipeline to develop with hundreds of sounds of samples to do RL.

Overview

  1. They first propose DeepSeek-R1-Zero, where no supervised data is provided. The finetuning data is collected from the base model itself. Zero already shows great reasoning performance compared with other LLMs (e.g., OpenAI-o1-mini).
  2. Then they propose a novel pipeline to further develop DeepSeek-R1.
  3. They use the samples curated with R1 to finetune other open-source LLMs like Qwen and Llama. They only use SFT (no RL) to finetune these LLMs, even though they demonstrate that applying RL can further enhance the reasoning performance.

Method

DeepSeek-R1-Zero


Zero is based on a DeepSeek base model (not sure which model it refers to in the original paper) and is obtained by using RL, particularly, GRPO (Group Relative Policy Optimization) proposed by the team before. The training data is gathered using the base model by guiding it to adhere to some specific instructions for the output, as shown in the template above.

The reward for RL is from two sides. One is the accuracy rewards, which can be obtained by comparing the output of the model and the ground truth answer, like math, or code questions (using a compiler to check whether the generated code can pass). The other is the format rewards, enforcing the model to output a thinking process (enforces the model to put its thinking process between ‘<think>’ and ‘</think>’ tags.)

During training, Zero shows increasing reasoning abilities.

It is an amazing result, as LLM can evolve to a reasoning model without any supervised fine-tuning data. It underscores the model’s ability to
learn and generalize effectively through RL alone.

DeepSeek-R1

The authors further propose two questions:

  1. Can reasoning performance be further improved or convergence accelerated by incorporating a small amount of high-quality data as a cold start?
  2. How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong general capabilities?

To solve that, they propose a novel pipeline to train DeepSeek-R1:

  • Cold Start
    They construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor, by:
    • using few-shot prompting with a long CoT as an example;
    • directly prompting the model to generate detailed output with reflection and verification;
    • filtering the Zero output, to ensure readability and refine the output by human annotators.
  • Reasoning-oriented Reinforcement Learning
    After fine-tuning on the cold start data, they combine the following rewards for the RL training.
    • Language consistency reward: the proportion of target language words in the CoT;
    • Accuracy reward like above.
  • Rejection Sampling and Supervised Fine-Tuning
    • Perform rejection sampling from the above RL training checkpoint to collect fine-tuning data. To evaluate the quality of the data, in addition to the rule-based rewards like accuracy, this step also considers a generative rewards. Particularly, for the prompt, they feed the ground truth and the output of the model (here, I think they are referring to the model after the above RL training) to DeepSeek-V3 for judgement (e.g., I guess to give some match score). They collect 600K reasoning related training samples by doing this step.
    • They also collect 200K non-reasoning samples from the SFT dataset of DeepSeek-V3.
      After that, they perform fine-tuning for two epochs using the 800K.
  • Reinforcement Learning for all Scenarios
    They train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios.

After the above four steps, R1 can be obtained.

About distillation, they use R1 to curate the 800K samples and use the curated data to fine-tune (no RL here) other open-source LLMs, the performances are shown below:

They also state that incorporating RL could substantially boost model performance.

Limitations

  • Limitations in function calling, multi-turn, complex role-playing, JSON output, etc.
  • Currently optimized for Chinese and English, not good for other languages.
  • sensitive to prompts. Few-shot prompting consistently degrades its performance. Recommend to use zero-shot but detailed prompt.
  • has not been applied extensively in software engineering tasks.