FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

paper (2022 arxiv): https://arxiv.org/abs/2205.14135

Background: compute-bound and memory-bound

  • The performance on throughput of transformer layers can be either compute-bound or memory-bound. The higher the arithmetic intensity, the more likely to be memory-bound.
  • Compute-bound operators include matrix multiplication where the computation takes more time than communication or data movement, memory-bound operators perform in an opposite way, including element-wise operators, and reduction, e.g., sum, softmax, batch norm, etc.
Read more »

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

This paper further adopts tensor model parallelism from Megatron-LM with Expert parallelism and ZeRO data parallelism for MoE model training. The design spirit is similar to DeepSpeed-MoE:

Read more »

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

paper (2022 arxiv): https://arxiv.org/abs/2201.05596

First, let’s look at how MoE architectures look like?

Read more »

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

paper (2021 arxiv): https://arxiv.org/abs/2104.07857

The highlight of the method

  1. The much larger models training can scale to with the same number of GPUs compared with prior methods.
  2. High throughput and good scalability

    Surprisingly, even with small model size, ZeRO-Infinity performs comparably with 3D parallelism and ZeRO-Offload.
Read more »

ZeRO-Offload: Democratizing Billion-Scale Model Training

paper (2021 arxiv): https://arxiv.org/abs/2101.06840

My memory on ZeRO-Offload

ZeRO-Offlad helps multiple GPUs to scale to larger models that have more number of parameters by offloading the part of the memory to CPUs during training, without affecting the efficiency.

Way of thinking:

  • As much memory as possible offloaded to CPUs.
  • The communication overhead should be as small as possible.
  • The computation overhead on CPUs shouldn’t affect the training efficiency.
Read more »

Basics of Hexo

Basic knowledge before developing hexo project

Tool Full name Purpose in Hexo Key Benefits
NVM node version manager Manage Node.js versions for Hexo projects. Avoid compatibility issues, work with multiple projects requiring different Node.js versions.
NPM node package manager Install Hexo, plugins, and dependencies. Streamlined dependency management, consistent environment across systems.
Read more »
0%