Inference

This guide covers how to use your trained models for inference, including model loading, interactive testing, and common troubleshooting steps.

1 Quick Start

axolotl inference your_config.yml --lora-model-dir="./lora-output-dir"

axolotl inference your_config.yml --base-model="./completed-model"

Launch an interactive web interface:

axolotl inference your_config.yml --gradio

Process prompts from a text file:

cat /tmp/prompt.txt | axolotl inference your_config.yml \
  --base-model="./completed-model" --prompter=None

For large models or limited memory:

axolotl inference your_config.yml --load-in-8bit=True

Merge LoRA adapters with the base model:

axolotl merge-lora your_config.yml --lora-model-dir="./completed-model"

gpu_memory_limit: 20GiB  # Adjust based on your GPU
lora_on_cpu: true        # Process on CPU if needed

CUDA_VISIBLE_DEVICES="" axolotl merge-lora ...

Warning

Tokenization mismatches between training and inference are a common source of problems.

To debug:

axolotl preprocess your_config.yml --debug

Configure special tokens in your YAML:

special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

For more details, see our debugging guide.