Megatron-Turing NLG 530B

Posted on 2025-01-18 Edited on 2025-01-19 In Paper reading Views:

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

The model has 510B parameters scaled from the transformer decoder. Sequence length is 2048, and the global batch size is 1920.

LR: use 1B tokens to linearly warmup LR to $5.0e^{-5}$. The LR is then decayed to 10% of its value over 340B tokens with cosine decay.
Batch size: starts at 32 and gradually increases to 1920 with a step of 32, in the first 12B tokens.
Weight initialization: normal distribution with 0 mean and standard deviation of $4.0e^{-3}$
Optimizer: Adam, $\beta_1=0.9$, $\beta_2=0.95$, $\epsilon=10^{-8}$.
Gradient norm clipping at 1.0, weight decay 0.1.

In total 339B tokens.

Overall: ZeRO Data parallelism + Megatron-LM Tensor parallelism + 1F1B Pipeline parallelism.

Tensor parallelism has the largest communication overhead of the three strategies, and so we prioritize placing tensor parallel works within a node.
Pipeline parallelism has the lowest communication volume, and so we can schedule pipeline stages across nodes.
Data parallel works are placed within a node to accelerate gradient communications when possible, otherwise are mapped to nearby nodes when possible.

Each 530B parameter model replica spans 280 NVIDIA A100 GPUs, with 8-way tensor parallelism within a node and 35-way pipeline parallelism across nodes. Data parallelism is used to further scale out to thousands of GPUs.
Model is trained with mixed precision on NVIDIA’s Selene supercomputer with 560 DGX A100 nodes, with each node having 8 NVIDIA 80-GB A100 GPUs (so in total 4480 GPUs). GPUs are connected with NVLink and NVSwitch within node and Infiniband across nodes.

On LAMBADA