Multi Node
The below are three ways to train multi-node in Axolotl.
Each machine needs a copy of Axolotl, we suggest using the same commit to ensure compatibility.
You will also need to have the same configuration file for your model on each machine.
Make sure the main machine is reachable by other machines.
Accelerate
You will need to create a configuration for accelerate, either by using accelerate config
and follow the instructions or you can use one of the preset below:
~/.cache/huggingface/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
machine_rank: 0 # Set to 0 for the main machine, increment by one for other machines
main_process_ip: 10.0.0.4 # Set to main machine's IP
main_process_port: 5000
main_training_function: main
mixed_precision: bf16
num_machines: 2 # Change to the number of machines
num_processes: 4 # That's the total number of GPUs, (for example: if you have 2 machines with 4 GPU, put 8)
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Configure your model to use FSDP in the Axolotl yaml. For example:
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_offload_params: true
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
All you have to do now is launch using accelerate as you would usually do on each machine and voila, the processes will start once you have launched accelerate on every machine.
Raytrain
Please see ray train doc here.
Torchrun
If you are using Infiniband, we recommend torchrun to utilize the full bandwidth.
Set the following env (change buffersize/socketname depending on your system):
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond"
export NCCL_BUFFSIZE=2097152
Run the following on each node:
torchrun --nnodes $num_nodes --nproc_per_node $gpu_per_node --rdzv_id $rdzv_id --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:$head_node_port" -m axolotl.cli.train config.yaml
Please make sure to substitute the placeholder variables.
num_nodes
: Number of nodes (containing GPUs)gpu_per_node
: Number of gpus per nodehead_node_ip
: IP of the head node (make sure other machines can connect to this)head_node_port
: Port of the head node (make sure other machines can connect to this. Default 29400)rdzv_id
: A unique job ID that is used by the job across nodes.
You need to call axolotl.cli.train
instead of axolotl train
as the latter calls accelerate under the hood
More info on the available configs can be found on the Pytorch docs here