Inference Guide

This guide covers how to use your trained models for inference, including model loading, interactive testing, and common troubleshooting steps.

1 Quick Start

1.1 Basic Inference

axolotl inference your_config.yml --lora-model-dir="./lora-output-dir"
axolotl inference your_config.yml --base-model="./completed-model"

2 Advanced Usage

2.1 Gradio Interface

Launch an interactive web interface:

axolotl inference your_config.yml --gradio

2.2 File-based Prompts

Process prompts from a text file:

cat /tmp/prompt.txt | axolotl inference your_config.yml \
  --base-model="./completed-model" --prompter=None

2.3 Memory Optimization

For large models or limited memory:

axolotl inference your_config.yml --load-in-8bit=True

3 Merging LoRA Weights

Merge LoRA adapters with the base model:

axolotl merge-lora your_config.yml --lora-model-dir="./completed-model"

3.1 Memory Management for Merging

gpu_memory_limit: 20GiB  # Adjust based on your GPU
lora_on_cpu: true        # Process on CPU if needed
CUDA_VISIBLE_DEVICES="" axolotl merge-lora ...

4 Tokenization

4.1 Common Issues

Warning

Tokenization mismatches between training and inference are a common source of problems.

To debug:

  1. Check training tokenization:
axolotl preprocess your_config.yml --debug
  1. Verify inference tokenization by decoding tokens before model input

  2. Compare token IDs between training and inference

4.2 Special Tokens

Configure special tokens in your YAML:

special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

5 Troubleshooting

5.1 Common Problems

  • Use 8-bit loading
  • Reduce batch sizes
  • Try CPU offloading
  • Verify special tokens
  • Check tokenizer settings
  • Compare training and inference preprocessing
  • Verify model loading
  • Check prompt formatting
  • Ensure temperature/sampling settings

For more details, see our debugging guide.