Conversation
Conversation format for supervised fine-tuning.
pygmalion
data.jsonl
{"conversations": [{"role": "...", "value": "..."}]}
chat_template
Chat Template strategy uses a jinja2 template that converts a list of messages into a prompt. Support using tokenizer’s template, a supported template, or custom jinja2.
data.jsonl
{"conversations": [{"role": "...", "content": "..."}]}
See config.qmd
for full configs and supported templates.
Examples
- Using the default chat template in the tokenizer_config.json on OpenAI messages format, training on only last message.
datasets:
- path: ...
type: chat_template
- Using the
gemma
chat template to override the tokenizer_config.json’s chat template on OpenAI messages format, training on all assistant messages.
chat_template: gemma # this overwrites the tokenizer's chat_template
datasets:
- path: ...
type: chat_template
roles_to_train: ["assistant"]
- Using the tokenizer_config.json’s chat template or
chatml
as fallback if the former’s chat template does not exist, on OpenAI messages format, training on all assistant messages.
chat_template: tokenizer_default_fallback_chatml # this overwrites the tokenizer's chat_template
datasets:
- path: ...
type: chat_template
roles_to_train: ["assistant"]
- Using a custom jinja template on OpenAI messages format, training on all assistant messages.
# chat_template: jinja # `jinja` will be implied if the `chat_template_jinja` is set and this field is empty
chat_template_jinja: "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|system|>' + '\n' + message['content'] + '<|end|>' + '\n'}}{% elif (message['role'] == 'user') %}{{'<|user|>' + '\n' + message['content'] + '<|end|>' + '\n' + '<|assistant|>' + '\n'}}{% elif message['role'] == 'assistant' %}{{message['content'] + '<|end|>' + '\n'}}{% endif %}{% endfor %}"
datasets:
- path: ...
type: chat_template
roles_to_train: ["assistant"]
- (Advanced) Using fine-grained control over tokens and turns to train in a conversation
For a data sample that looks like:
data.jsonl
{
"conversations": [
{"from": "system", "value": "You are an AI assistant.", "train": false},
{"from": "human", "value": "Hello", "train": false},
{"from": "assistant", "value": "Hello", "train": true},
{"from": "human", "value": "How are you?", "train": true},
{
"from": "assistant",
"value": "I'm doing very well, thank you!",
"train_detail": [
{"begin_offset": 0, "end_offset": 8, "train": false},
{"begin_offset": 9, "end_offset": 18, "train": true},
{"begin_offset": 19, "end_offset": 30, "train": false},
],
},
{
"from": "human",
"value": "I'm doing very well, thank you!",
"train": true,
},
{"from": "assistant", "value": "Hi there!", "train": true}
]
}
The configuration would look like:
datasets:
- path: ...
type: chat_template
chat_template: tokenizer_default
field_messages: conversations
message_field_role: from
message_field_content: value
roles_to_train: []
train_on_eos: turn
message_field_training: train
message_field_training_detail: train_detail
Tip: It is not necessary to use both message_field_training
and message_field_training_detail
at a time.