Pre-training
Data format for a pre-training completion task.
For pretraining, there is no prompt template or roles. The only required field is text
:
data.jsonl
{"text": "first row"}
{"text": "second row"}
...
Streaming is recommended for large datasets
Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:
config.yaml
pretraining_dataset:
- name:
path:
split:
text_column: # column in dataset with the data, usually `text`
type: pretrain
trust_remote_code:
skip: # number of rows of data to skip over from the beginning
...