Multi-GPU Training Guide

This guide covers advanced training configurations for multi-GPU setups using Axolotl.

1 Overview

Axolotl supports several methods for multi-GPU training:

  • DeepSpeed (recommended)
  • FSDP (Fully Sharded Data Parallel)
  • FSDP + QLoRA

2 DeepSpeed

DeepSpeed is the recommended approach for multi-GPU training due to its stability and performance. It provides various optimization levels through ZeRO stages.

2.1 Configuration

Add to your YAML config:

deepspeed: deepspeed_configs/zero1.json

2.2 Usage

accelerate launch -m axolotl.cli.train examples/llama-2/config.yml --deepspeed deepspeed_configs/zero1.json

2.3 ZeRO Stages

We provide default configurations for:

  • ZeRO Stage 1 (zero1.json)
  • ZeRO Stage 2 (zero2.json)
  • ZeRO Stage 3 (zero3.json)

Choose based on your memory requirements and performance needs.

3 FSDP

3.1 Basic FSDP Configuration

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_offload_params: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

3.2 FSDP + QLoRA

For combining FSDP with QLoRA, see our dedicated guide.

4 Performance Optimization

4.1 Liger Kernel Integration

Note

Liger Kernel provides efficient Triton kernels for LLM training, offering:

  • 20% increase in multi-GPU training throughput
  • 60% reduction in memory usage
  • Compatibility with both FSDP and DeepSpeed

Configuration:

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true

5 Troubleshooting

5.1 NCCL Issues

For NCCL-related problems, see our NCCL troubleshooting guide.

5.2 Common Problems

  • Reduce micro_batch_size
  • Reduce eval_batch_size
  • Adjust gradient_accumulation_steps
  • Consider using a higher ZeRO stage
  • Start with DeepSpeed ZeRO-2
  • Monitor loss values
  • Check learning rates

For more detailed troubleshooting, see our debugging guide.