MultiModal / Vision Language Models (BETA)
Supported Models
- Mllama, i.e. llama with vision models
Usage
Currently multimodal support is limited and doesn’t have full feature parity. To finetune a multimodal Llama w/ LoRA, you’ll need to use the following in YAML in combination with the rest of the required hyperparams.
base_model: alpindale/Llama-3.2-11B-Vision-Instruct
processor_type: AutoProcessor
skip_prepare_dataset: true
chat_template: llama3_2_vision
datasets:
- path: HuggingFaceH4/llava-instruct-mix-vsft
type: chat_template
split: train[:1%]
field_messages: messages
remove_unused_columns: false
sample_packing: false
# only finetune the Language model, leave the vision model and vision tower frozen
lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'