DeepSeek Series - VLM

Posted on 2025-04-01 Edited on 2025-04-02 In Paper reading

Paper list

DeepSeek-VL: Towards Real-World Vision-Language Understanding
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

General information

DeepSeek-VL and VL2 are two multimodal models for understanding, no visual generation.
Janus series are text and image generative models. Among them, Janus and Janus-Pro use autoregressive mechanism for the image generation, while JanusFlow use Rectified flow, like the diffusion models, to iteratively refine the generated contents from a noise to an image.

DeepSeek Series - LLM

Posted on 2025-03-30 Edited on 2025-04-02 In Paper reading

Paper list (papers in gray is not discussed in this blog)

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
DeepSeek-V3 Technical Report
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

General information

DeepSeek is a series of strong LLM models, open-source but trying to beat close-source models, in terms of general language generation, and specific fields like Math, coding, reasoning, etc. The famous powerful DeepSeek V2/V3/R1 use MoE architectures, a huge amount of training tokens, and novel architectural, training, and optimization innovations, to get state-of-the-art performances with efficient training and inferences. In addition, DeepSeek also developed a series powerful Vision-Language models, which will be discussed in the next blog.

Flash Attention 1 & 2

Posted on 2025-01-20 Edited on 2025-06-08 In Paper reading

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

paper (2022 arxiv): https://arxiv.org/abs/2205.14135

Background: compute-bound and memory-bound

The performance on throughput of transformer layers can be either compute-bound or memory-bound. The higher the arithmetic intensity, the more likely to be compute-bound.
Compute-bound operators include matrix multiplication where the computation takes more time than communication or data movement, memory-bound operators perform in an opposite way, including element-wise operators, and reduction, e.g., sum, softmax, batch norm, etc.

Megatron-Turing NLG 530B

Posted on 2025-01-18 Edited on 2025-01-19 In Paper reading

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

paper (2021 arxiv): https://arxiv.org/abs/2201.11990

Model architecture and training hyperparameters

Other papers on DeepSpeed-MoE

Posted on 2025-01-18 Edited on 2025-01-19 In Paper reading

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

This paper further adopts tensor model parallelism from Megatron-LM with Expert parallelism and ZeRO data parallelism for MoE model training. The design spirit is similar to DeepSpeed-MoE:

DeepSpeed-MoE

Posted on 2025-01-15 Edited on 2025-01-19 In Paper reading

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

paper （2022 arxiv): https://arxiv.org/abs/2201.05596

First, let’s look at how MoE architectures look like?

ZeRO-Infinity

Posted on 2025-01-13 Edited on 2025-01-19 In Paper reading

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

paper (2021 arxiv): https://arxiv.org/abs/2104.07857

The highlight of the method

The much larger models training can scale to with the same number of GPUs compared with prior methods.
High throughput and good scalability

Surprisingly, even with small model size, ZeRO-Infinity performs comparably with 3D parallelism and ZeRO-Offload.

ZeRO-Offload

Posted on 2025-01-13 Edited on 2025-01-19 In Paper reading

ZeRO-Offload: Democratizing Billion-Scale Model Training

paper (2021 arxiv): https://arxiv.org/abs/2101.06840

My memory on ZeRO-Offload

ZeRO-Offlad helps multiple GPUs to scale to larger models that have more number of parameters by offloading the part of the memory to CPUs during training, without affecting the efficiency.

Way of thinking:

As much memory as possible offloaded to CPUs.
The communication overhead should be as small as possible.
The computation overhead on CPUs shouldn’t affect the training efficiency.

ZeRO (DeepSpeed from Microsoft)

Posted on 2025-01-01 Edited on 2025-01-13 In Paper reading

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

paper (2019 arxiv, SC’20): https://arxiv.org/abs/1910.02054
website: https://www.deepspeed.ai/

How to check remote hexo webpages in our local computer

Posted on 2024-12-28 In tools

Basics of Hexo

Basic knowledge before developing hexo project

Tool	Full name	Purpose in Hexo	Key Benefits
NVM	node version manager	Manage Node.js versions for Hexo projects.	Avoid compatibility issues, work with multiple projects requiring different Node.js versions.
NPM	node package manager	Install Hexo, plugins, and dependencies.	Streamlined dependency management, consistent environment across systems.