Multi-GPU Training Guide
This guide covers advanced training configurations for multi-GPU setups using Axolotl.
1 Overview
Axolotl supports several methods for multi-GPU training:
- DeepSpeed (recommended)
- FSDP (Fully Sharded Data Parallel)
- FSDP + QLoRA
2 DeepSpeed
DeepSpeed is the recommended approach for multi-GPU training due to its stability and performance. It provides various optimization levels through ZeRO stages.
2.1 Configuration
Add to your YAML config:
deepspeed: deepspeed_configs/zero1.json
2.2 Usage
accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json
2.3 ZeRO Stages
We provide default configurations for:
- ZeRO Stage 1 (
zero1.json
) - ZeRO Stage 2 (
zero2.json
) - ZeRO Stage 3 (
zero3.json
)
Choose based on your memory requirements and performance needs.
3 FSDP
3.1 Basic FSDP Configuration
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_offload_params: true
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
3.2 FSDP + QLoRA
For combining FSDP with QLoRA, see our dedicated guide.
4 Performance Optimization
4.1 Liger Kernel Integration
Note
Liger Kernel provides efficient Triton kernels for LLM training, offering:
- 20% increase in multi-GPU training throughput
- 60% reduction in memory usage
- Compatibility with both FSDP and DeepSpeed
Configuration:
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true
5 Troubleshooting
5.1 NCCL Issues
For NCCL-related problems, see our NCCL troubleshooting guide.
5.2 Common Problems
- Reduce
micro_batch_size
- Reduce
eval_batch_size
- Adjust
gradient_accumulation_steps
- Consider using a higher ZeRO stage
- Start with DeepSpeed ZeRO-2
- Monitor loss values
- Check learning rates
For more detailed troubleshooting, see our debugging guide.