utils.collators.batching

utils.collators.batching

Data collators for axolotl to pad labels and position_ids for packed sequences. Also includes logic for handling sequence parallelism collation.

Classes

Name	Description
BatchSamplerDataCollatorForSeq2Seq	Collator for multipack specific to the using the BatchSampler
DataCollatorForSeq2Seq	Data collator that will dynamically pad the inputs received, as well as the labels and position_ids
PretrainingBatchSamplerDataCollatorForSeq2Seq	Collator for multipack specific to the using the BatchSampler
V2BatchSamplerDataCollatorForSeq2Seq	Collator for multipack specific to the using the BatchSampler

BatchSamplerDataCollatorForSeq2Seq

utils.collators.batching.BatchSamplerDataCollatorForSeq2Seq(
    self,
    tokenizer,
    model=None,
    padding=True,
    max_length=None,
    pad_to_multiple_of=None,
    label_pad_token_id=-100,
    position_pad_token_id=0,
    return_tensors='pt',
    sequence_parallel_degree=1,
)

Collator for multipack specific to the using the BatchSampler

DataCollatorForSeq2Seq

utils.collators.batching.DataCollatorForSeq2Seq(
    self,
    tokenizer,
    model=None,
    padding=True,
    max_length=None,
    pad_to_multiple_of=None,
    label_pad_token_id=-100,
    position_pad_token_id=0,
    return_tensors='pt',
    sequence_parallel_degree=1,
)

Data collator that will dynamically pad the inputs received, as well as the labels and position_ids

Parameters

Name	Type	Description	Default
tokenizer	[`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]	The tokenizer used for encoding the data.	required
model	[`PreTrainedModel`]	The model that is being trained. If set and has the prepare_decoder_input_ids_from_labels, use it to prepare the decoder_input_ids This is useful when using label_smoothing to avoid calculating loss twice.	`None`
padding	`bool`, `str` or [`~utils.PaddingStrategy`], optional, defaults to `True`	Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among: - `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single sequence is provided). - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum acceptable input length for the model if that argument is not provided. - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).	`True`
max_length	`int`, optional	Maximum length of the returned list and optionally padding length (see above).	`None`
pad_to_multiple_of	`int`, optional	If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).	`None`
label_pad_token_id	`int`, optional, defaults to -100	The id to use when padding the labels (-100 will be automatically ignored by PyTorch loss functions).	`-100`
return_tensors	`str`	The type of Tensor to return. Allowable values are “np”, “pt” and “tf”.	`'pt'`
sequence_parallel_degree	`int`	The degree of sequence parallelism. Default to 1 for no sequence parallelism.	`1`

Methods

Name	Description
apply_sequence_parallelism	Apply sequence parallelism slicing to a batch.

apply_sequence_parallelism

utils.collators.batching.DataCollatorForSeq2Seq.apply_sequence_parallelism(
    batch,
)

Apply sequence parallelism slicing to a batch.

Parameters

Name	Type	Description	Default
batch	dict[str, torch.Tensor]	Batch dictionary from parent collator.	required

Returns

Name	Type	Description
	torch.Tensor	Sliced batch dictionary.

PretrainingBatchSamplerDataCollatorForSeq2Seq

utils.collators.batching.PretrainingBatchSamplerDataCollatorForSeq2Seq(
    self,
    *args,
    multipack_attn=True,
    **kwargs,
)

Collator for multipack specific to the using the BatchSampler

V2BatchSamplerDataCollatorForSeq2Seq

utils.collators.batching.V2BatchSamplerDataCollatorForSeq2Seq(
    self,
    tokenizer,
    model=None,
    padding=True,
    max_length=None,
    pad_to_multiple_of=None,
    label_pad_token_id=-100,
    position_pad_token_id=0,
    return_tensors='pt',
    sequence_parallel_degree=1,
)

Collator for multipack specific to the using the BatchSampler

Functions

Name	Description
adjust_position_ids_for_slice	Adjust position IDs for a sliced sequence to maintain proper relative positions.

adjust_position_ids_for_slice

utils.collators.batching.adjust_position_ids_for_slice(position_ids, start_idx)

Adjust position IDs for a sliced sequence to maintain proper relative positions. This handles the case where position IDs might not be contiguous due to sample packing.