Other papers on DeepSpeed-MoE

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

This paper further adopts tensor model parallelism from Megatron-LM with Expert parallelism and ZeRO data parallelism for MoE model training. The design spirit is similar to DeepSpeed-MoE:

In forward pass, two “all-reduce” and two “all-to-all” for each transformer layer. In backward pass, an additional “all-reduce” operator is used to synchronize gradients from different data parallel groups.

The authors in this paper also proposed a tiled version of optimizer that partitions the parameters into tiles which could be process sequentially, so that the temporary memory for 32-bit gradients is independent of the number of experts and the base model sizes.

Scaling Vision-Language Models with Sparse Mixture of Experts

This is an application on VLM.

Architecture

Similar to BeiT v3, but replace the FFN every second layer with the expert layer (only for the first $L-F$ layers, normal FFN in the last $F$ layers).

The first $L-F$ layers are for unimodal inputs, where text or image tokens are fed accordingly to V-MoE or T-MoE. The last $F$ layers have additional VL-FFN that can take data if the input is image-text multimodal input.

Training objectives

It uses masked data modeling like BeiT v3.

  • For texts, like BERT, the mask ratio is 15% and the model is trained to recover the masked tokens.
  • For images, like MAE, using block-wise masking but mask ratio is 40%, the input image is tokenized using the tokenizer in BEiT v2, where tokens are discretized (similar to VQ-VAE).
  • For image-text pairs, text and image mask ratio keeps the same as MLM and MIM (i.e., 15% and 40% I think), and the masked contents need to be recovered by the model based on multimodal input.

Training objectives will also consider the loading loss for experts.

During fine-tuning, all the MoE modules, i.e., routers and experts, are frozed.

Model configuration and size

Model #layers L-F/F #parameters Maximum sequence length/Tokenizer Image resolution/Tokenizer Batch size Content in each batch Training steps/epochs
Base 12 9/3 2B (180M per token) 128/SentencePiece 224x224/BEiT v2 tokenizer 6144 2048 images, 2048 texts, 2048 image-text pairs 200k steps/40 epochs
Small 8 7/1

Dataset for pre-training

Modal Dataset Size
Text English Wikipedia, BookCorpus 4.7B words for EW + 1B words for BC
Image ImageNet-22K 14M images
Text-Image Conceptual Captions, SBU Captions, COCO, and Visual Genome 4M images and 10M image-text pairs in total

Performance on fine-tuning on downstream tasks

Above is for VL tasks.

Above is for Vision/Language only downstream tasks. ImageNet has 1.3M images with 1K classes, and MNLI has 433K sentence pairs.