Inference Guide
This guide covers how to use your trained models for inference, including model loading, interactive testing, and common troubleshooting steps.
1 Quick Start
1.1 Basic Inference
axolotl inference your_config.yml --lora-model-dir="./lora-output-dir"
axolotl inference your_config.yml --base-model="./completed-model"
2 Advanced Usage
2.1 Gradio Interface
Launch an interactive web interface:
axolotl inference your_config.yml --gradio
2.2 File-based Prompts
Process prompts from a text file:
cat /tmp/prompt.txt | axolotl inference your_config.yml \
--base-model="./completed-model" --prompter=None
2.3 Memory Optimization
For large models or limited memory:
axolotl inference your_config.yml --load-in-8bit=True
3 Merging LoRA Weights
Merge LoRA adapters with the base model:
axolotl merge-lora your_config.yml --lora-model-dir="./completed-model"
3.1 Memory Management for Merging
gpu_memory_limit: 20GiB # Adjust based on your GPU
lora_on_cpu: true # Process on CPU if needed
CUDA_VISIBLE_DEVICES="" axolotl merge-lora ...
4 Tokenization
4.1 Common Issues
Warning
Tokenization mismatches between training and inference are a common source of problems.
To debug:
- Check training tokenization:
axolotl preprocess your_config.yml --debug
Verify inference tokenization by decoding tokens before model input
Compare token IDs between training and inference
4.2 Special Tokens
Configure special tokens in your YAML:
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"
tokens:
- "<|im_start|>"
- "<|im_end|>"
5 Troubleshooting
5.1 Common Problems
- Use 8-bit loading
- Reduce batch sizes
- Try CPU offloading
- Verify special tokens
- Check tokenizer settings
- Compare training and inference preprocessing
- Verify model loading
- Check prompt formatting
- Ensure temperature/sampling settings
For more details, see our debugging guide.