Dataset Preprocessing
Overview
Dataset pre-processing is the step where Axolotl takes each dataset you’ve configured alongside the dataset format and prompt strategies to:
- parse the dataset based on the dataset format
- transform the dataset to how you would interact with the model based on the prompt strategy
- tokenize the dataset based on the configured model & tokenizer
- shuffle and merge multiple datasets together if using more than one
The processing of the datasets can happen one of two ways:
- Before kicking off training by calling
axolotl preprocess config.yaml --debug
- When training is started
What are the benefits of pre-processing?
When training interactively or for sweeps (e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent training parameters so that it will intelligently pull from its cache when possible.
The path of the cache is controlled by dataset_prepared_path:
and is often left blank in example YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data.
If dataset_prepared_path:
is left empty, when training, the processed dataset will be cached in a default path of ./last_run_prepared/
, but will ignore anything already cached there. By explicitly setting dataset_prepared_path: ./last_run_prepared
, the trainer will use whatever pre-processed data is in the cache.
What are the edge cases?
Let’s say you are writing a custom prompt strategy or using a user-defined prompt template. Because the trainer cannot readily detect these changes, we cannot change the calculated hash value for the pre-processed dataset.
If you have dataset_prepared_path: ...
set and change your prompt templating logic, it may not pick up the changes you made and you will be training over the old prompt.