Custom Pre-Tokenized Dataset
How to use a custom pre-tokenized dataset.
- Pass an empty
type:
in your axolotl config. - Columns in Dataset must be exactly
input_ids
,attention_mask
,labels
- To indicate that a token should be ignored during training, set its corresponding label to
-100
. - You must add BOS and EOS, and make sure that you are training on EOS by not setting its label to -100.
- For pretraining, do not truncate/pad documents to the context window length.
- For instruction training, documents must be truncated/padded as desired.
Sample config:
config.yml
datasets:
- path: /path/to/your/file.jsonl
ds_type: json
type:
Sample jsonl:
{"input_ids":[271,299,99],"attention_mask":[1,1,1],"labels":[271,-100,99]}
{"input_ids":[87,227,8383,12],"attention_mask":[1,1,1,1],"labels":[87,227,8383,12]}