Megatron-Turing NLG 530B

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

paper (2021 arxiv): https://arxiv.org/abs/2201.11990

Model architecture and training hyperparameters

The model has 510B parameters scaled from the transformer decoder. Sequence length is 2048, and the global batch size is 1920.

  • LR: use 1B tokens to linearly warmup LR to $5.0e^{-5}$. The LR is then decayed to 10% of its value over 340B tokens with cosine decay.
  • Batch size: starts at 32 and gradually increases to 1920 with a step of 32, in the first 12B tokens.
  • Weight initialization: normal distribution with 0 mean and standard deviation of $4.0e^{-3}$
  • Optimizer: Adam, $\beta_1=0.9$, $\beta_2=0.95$, $\epsilon=10^{-8}$.
  • Gradient norm clipping at 1.0, weight decay 0.1.

Dataset

In total 339B tokens.

Parallelism in training

Overall: ZeRO Data parallelism + Megatron-LM Tensor parallelism + 1F1B Pipeline parallelism.

Considerations on bandwidth

  • Tensor parallelism has the largest communication overhead of the three strategies, and so we prioritize placing tensor parallel works within a node.
  • Pipeline parallelism has the lowest communication volume, and so we can schedule pipeline stages across nodes.
  • Data parallel works are placed within a node to accelerate gradient communications when possible, otherwise are mapped to nearby nodes when possible.

Parallelism degree

  • Each 530B parameter model replica spans 280 NVIDIA A100 GPUs, with 8-way tensor parallelism within a node and 35-way pipeline parallelism across nodes. Data parallelism is used to further scale out to thousands of GPUs.
  • Model is trained with mixed precision on NVIDIA’s Selene supercomputer with 560 DGX A100 nodes, with each node having 8 NVIDIA 80-GB A100 GPUs (so in total 4480 GPUs). GPUs are connected with NVLink and NVSwitch within node and Infiniband across nodes.

Results

On LAMBADA