Paper list

  • DeepSeek-VL: Towards Real-World Vision-Language Understanding
  • Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
  • JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
  • DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
  • Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

General information

DeepSeek-VL and VL2 are two multimodal models for understanding, no visual generation.
Janus series are text and image generative models. Among them, Janus and Janus-Pro use autoregressive mechanism for the image generation, while JanusFlow use Rectified flow, like the diffusion models, to iteratively refine the generated contents from a noise to an image.

Read more »

Paper list (papers in gray is not discussed in this blog)

  • DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
  • DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
  • DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence
  • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
  • DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
  • DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
  • DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
  • Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
  • DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
  • DeepSeek-V3 Technical Report
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

General information

DeepSeek is a series of strong LLM models, open-source but trying to beat close-source models, in terms of general language generation, and specific fields like Math, coding, reasoning, etc. The famous powerful DeepSeek V2/V3/R1 use MoE architectures, a huge amount of training tokens, and novel architectural, training, and optimization innovations, to get state-of-the-art performances with efficient training and inferences. In addition, DeepSeek also developed a series powerful Vision-Language models, which will be discussed in the next blog.

Read more »

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

paper (2022 arxiv): https://arxiv.org/abs/2205.14135

Background: compute-bound and memory-bound

  • The performance on throughput of transformer layers can be either compute-bound or memory-bound. The higher the arithmetic intensity, the more likely to be memory-bound.
  • Compute-bound operators include matrix multiplication where the computation takes more time than communication or data movement, memory-bound operators perform in an opposite way, including element-wise operators, and reduction, e.g., sum, softmax, batch norm, etc.
Read more »

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

This paper further adopts tensor model parallelism from Megatron-LM with Expert parallelism and ZeRO data parallelism for MoE model training. The design spirit is similar to DeepSpeed-MoE:

Read more »

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

paper (2022 arxiv): https://arxiv.org/abs/2201.05596

First, let’s look at how MoE architectures look like?

Read more »

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

paper (2021 arxiv): https://arxiv.org/abs/2104.07857

The highlight of the method

  1. The much larger models training can scale to with the same number of GPUs compared with prior methods.
  2. High throughput and good scalability

    Surprisingly, even with small model size, ZeRO-Infinity performs comparably with 3D parallelism and ZeRO-Offload.
Read more »

ZeRO-Offload: Democratizing Billion-Scale Model Training

paper (2021 arxiv): https://arxiv.org/abs/2101.06840

My memory on ZeRO-Offload

ZeRO-Offlad helps multiple GPUs to scale to larger models that have more number of parameters by offloading the part of the memory to CPUs during training, without affecting the efficiency.

Way of thinking:

  • As much memory as possible offloaded to CPUs.
  • The communication overhead should be as small as possible.
  • The computation overhead on CPUs shouldn’t affect the training efficiency.
Read more »

Basics of Hexo

Basic knowledge before developing hexo project

Tool Full name Purpose in Hexo Key Benefits
NVM node version manager Manage Node.js versions for Hexo projects. Avoid compatibility issues, work with multiple projects requiring different Node.js versions.
NPM node package manager Install Hexo, plugins, and dependencies. Streamlined dependency management, consistent environment across systems.
Read more »
0%