Setting up

import torch
# Check so there is a gpu available, a T4(free tier) is enough to run this notebook
assert (torch.cuda.is_available()==True)
!pip install --no-build-isolation axolotl[deepspeed]

Hugging Face login (optional)

from huggingface_hub import notebook_login
notebook_login()

Example configuration

import yaml

yaml_string = """
base_model: NousResearch/Meta-Llama-3.1-8B

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/lora-out

sequence_len: 2048
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true

adapter: qlora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_modules_to_save:
  - embed_tokens
  - lm_head

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: false
sdp_attention: true

warmup_steps: 1
max_steps: 25
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: <|end_of_text|>
"""


# Convert the YAML string to a Python dictionary
yaml_dict = yaml.safe_load(yaml_string)

# Specify your file path
file_path = 'test_axolotl.yaml'

# Write the YAML file
with open(file_path, 'w') as file:
    yaml.dump(yaml_dict, file)

Above we have a configuration file with base LLM model and datasets specified, among many other things. Axolotl can automatically detect whether the specified datasets are on HuggingFace repo or local machine.

The Axolotl configuration options encompass model and dataset selection, data pre-processing, and training. Let’s go through them line by line:

  • “base model”: String value, specifies the underlying pre-trained LLM that will be used for finetuning

Next we have options for model weights quantization. Quantization allows for reduction in occupied memory on GPUs.

  • “load_in_8bit”: Boolean value, whether to quantize the model weights into 8-bit integer.

  • “load_in_4bit”: Boolean value, whether to quantize the model weights into 4-bit integer.

  • “strict”: Boolean value. If false, it allows for overriding established configuration options in the yaml file when executing in command-line interface.

  • “datasets”: a list of dicts that contain path and type of data sets as well as other optional configurations where datasets are concerned. Supports multiple datasets.

  • “val_set_size”: Either a float value less than one or an integer less than the total size of dataset. Sets the size of validation set from the whole dataset. If float, sets the proportion of the dataset assigned for validation. If integer, sets the direct size of validation set.

  • “output_dir”: String value. Path of trained model.

For data preprocessing:

  • “sequence_len”: Integer. Specifies the maximum sequence length of the input. Typically 2048 or less.

  • “pad_to_sequence_len”: Boolean. Padding input to maximum sequence length.

  • “sample_packing”: Boolean. Specifies whether to use multi-packing with block diagonal attention.

  • “special_tokens”: Python dict, optional. Allows users to specify the additional special tokens to be ignored by the tokenizer.

For LoRA configuration and its hyperparamters:

  • “adapter”: String. Either “lora” or “qlora”, depending on user’s choice.

  • “lora_model_dir”: String, Optional. Path to directory that contains LoRA model, if there is already a trained LoRA model the user would like to use.

  • “lora_r”: Integer. Refers to the rank of LoRA decomposition matrices. Higher value will reduce LoRA efficiency. Recommended to be set to 8.

  • “lora_alpha”: Integer. Scale the weight matrices by \(\frac{\text{lora_alpha}}{\text{lora_r}}\)Recommended to be fixed at 16.

  • “lora_dropout”: Float that is 1 or less. The dropout probability of a lora layer.

  • “lora_target_linear”: Boolean. If true, lora will target all linear modules in the transformers architecture.

  • “lora_modules_to_save”: If you added new tokens to the tokenizer, you may need to save some LoRA modules because they need to know the new tokens.

See LoRA for detailed explanation of LoRA implementation.

For the training configurations:

  • “gradient_accumulation_steps”: Integer. The number of steps over which to accumulate gradient for batch training. E.g. if 2, backprop is performed every two steps.

  • “micro_batch_size”: Integer. Batch size per gpu / gradient_accumulation_steps

  • “num_epochs”: Integer. Number of epochs. One epoch is when training has looped over every batch in the whole data set once.

  • “optimizer”: The optimizer to use for the training.

  • “learning_rate”: The learning rate.

  • “lr_scheduler”: The learning rate scheduler to use for adjusting learning rate during training.

  • “train_on_inputs”: Boolean. Whether to ignore or include the user’s prompt from the training labels.

  • “group_by_length”: Boolean. Whether to group similarly sized data to minimize padding.

  • “bf16”: Either “auto”, “true”, or “false”. Whether to use CUDA bf16 floating point format. If set to “auto”, will automatically apply bf16 should the gpu supports it.

  • “fp16”: Optional. Specifies whether to use CUDA fp16. Automatically set to true if “bf16” is set to true. Otherwise false.

  • “tf32”: Boolean. Whether to use CUDA tf32. Will override bf16.

  • “gradient_checkpointing”: Boolean. Whether to use gradient checkpointing https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing

  • “gradient_checkpointing_kwargs”: Python Dict. Fed into the trainer.

  • “logging_steps”: Integer. Log training information over every specified number of steps.

  • “flash_attention”: Boolean. Whether to use the flash attention mechanism.

  • “sdp_attention”: Boolean. Whether to use the Scaled Dot Product attention mechanism (the attention mechanism in the original implementation of transformers.)

  • “warmup_steps”: Integer. The number of pre-training steps where a very low learning rate is used.

  • “evals_per_epoch”: Integer. Number of evaluations to be performed within one training epoch.

  • “saves_per_epoch”: Integer. Number of times the model is saved in one training epoch.

  • “weight_decay”: Positive Float. Sets the “strength” of weight decay (i.e. setting the coefficient of L2 regularization)

The above is but a snippet aiming to get users familiarized with the types of streamlined configuration options axolotl provides. For a full list of configuration options, see here

Train the model

!accelerate launch -m axolotl.cli.train /content/test_axolotl.yaml

Predict with trained model

!accelerate launch -m axolotl.cli.inference /content/test_axolotl.yaml \
    --lora_model_dir="./outputs/lora-out" --gradio

Deeper Dive

It is also helpful to gain some familiarity over some of the core inner workings of axolotl

Configuration Normalization

Axolotl uses a custom Dict class, called DictDefault to store configurations specified in the yaml configuration file (into a Python variable named cfg). The definition for this custom Dict can be found in the utils/dict.py

DictDefault is amended such that calling a missing key from it will result in a None return type. This is important because if some configuration options aren’t specified by the user, the None type allows Axolotl to perform boolean operations to determine the default settings for missing configurations. For more examples on how this is done, check out utils/config/init.py

Loading Models, Tokenizers, and Trainer

If we inspect cli.train.py, we will find that most of the heavy lifting were done by the function train() which is itself imported from src/axolotl/train.py.

train() takes care of loading the appropriate tokenizer and pre-trained model through load_model() and load_tokenizer() from src/axolotl/utils/models.py respectively.

load_tokenizer() loads in the appropriate tokenizer given the desired model, as well as chat templates.

ModelLoader class follows after tokenizer has been selected. It will automatically discern the base model type, load in the desired model, as well as applying model-appropriate attention mechanism modifications (e.g. flash attention). Depending on which base model the user chooses in the configuration, ModelLoader will utilize the corresponding “attention hijacking” script. For example, if the user specified the base model to be NousResearch/Meta-Llama-3.1-8B, which is of llama type, and set flash_attn to True, ModelLoader will load in llama_attn_hijack_flash.py. For a list of supported attention hijacking, please refer to the directory /src/axolotl/monkeypatch/

Another important operation encompassed in train() is setting up the training that takes into account of user-specified traning configurations (e.g. num_epochs, optimizer) through the use of setup_trainer() from /src/axolotl/utils/trainer.py, which in turn relies on modules from /src/axolotl/core/trainer_builder.py. trainer_builder.py provides a list of trainer object options bespoke for the task type (Causal or Reinforcement learning (‘dpo’, ‘ipo’, ‘kto’) )

Monkey patch

The Monkey patch directory is where model architecture/optimization patching scripts are stored (these are modifications that are not implemented in the official releases, hence the name monkey patch). It includes attention jacking, ReLoRA, and unsloth optimization.