utils.collators.batching

utils.collators.batching

Data collators for axolotl to pad labels and position_ids for packed sequences. Also includes logic for handling sequence parallelism collation.

Classes

Name Description
BatchSamplerDataCollatorForSeq2Seq Collator for multipack specific to the using the BatchSampler
DataCollatorForSeq2Seq Data collator that will dynamically pad the inputs received, as well as the labels and position_ids
PretrainingBatchSamplerDataCollatorForSeq2Seq Collator for multipack specific to the using the BatchSampler
V2BatchSamplerDataCollatorForSeq2Seq Collator for multipack specific to the using the BatchSampler

BatchSamplerDataCollatorForSeq2Seq

utils.collators.batching.BatchSamplerDataCollatorForSeq2Seq(
    self,
    tokenizer,
    model=None,
    padding=True,
    max_length=None,
    pad_to_multiple_of=None,
    label_pad_token_id=-100,
    position_pad_token_id=0,
    return_tensors='pt',
    sequence_parallel_degree=1,
)

Collator for multipack specific to the using the BatchSampler

DataCollatorForSeq2Seq

utils.collators.batching.DataCollatorForSeq2Seq(
    self,
    tokenizer,
    model=None,
    padding=True,
    max_length=None,
    pad_to_multiple_of=None,
    label_pad_token_id=-100,
    position_pad_token_id=0,
    return_tensors='pt',
    sequence_parallel_degree=1,
)

Data collator that will dynamically pad the inputs received, as well as the labels and position_ids

Parameters

Name Type Description Default
tokenizer [PreTrainedTokenizer] or [PreTrainedTokenizerFast] The tokenizer used for encoding the data. required
model [PreTrainedModel] The model that is being trained. If set and has the prepare_decoder_input_ids_from_labels, use it to prepare the decoder_input_ids This is useful when using label_smoothing to avoid calculating loss twice. None
padding bool, str or [~utils.PaddingStrategy], optional, defaults to True Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among: - True or 'longest' (default): Pad to the longest sequence in the batch (or no padding if only a single sequence is provided). - 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. - False or 'do_not_pad': No padding (i.e., can output a batch with sequences of different lengths). True
max_length int, optional Maximum length of the returned list and optionally padding length (see above). None
pad_to_multiple_of int, optional If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta). None
label_pad_token_id int, optional, defaults to -100 The id to use when padding the labels (-100 will be automatically ignored by PyTorch loss functions). -100
return_tensors str The type of Tensor to return. Allowable values are “np”, “pt” and “tf”. 'pt'
sequence_parallel_degree int The degree of sequence parallelism. Default to 1 for no sequence parallelism. 1

Methods

Name Description
apply_sequence_parallelism Apply sequence parallelism slicing to a batch.
apply_sequence_parallelism
utils.collators.batching.DataCollatorForSeq2Seq.apply_sequence_parallelism(
    batch,
)

Apply sequence parallelism slicing to a batch.

Parameters
Name Type Description Default
batch dict[str, torch.Tensor] Batch dictionary from parent collator. required
Returns
Name Type Description
torch.Tensor Sliced batch dictionary.

PretrainingBatchSamplerDataCollatorForSeq2Seq

utils.collators.batching.PretrainingBatchSamplerDataCollatorForSeq2Seq(
    self,
    *args,
    multipack_attn=True,
    **kwargs,
)

Collator for multipack specific to the using the BatchSampler

V2BatchSamplerDataCollatorForSeq2Seq

utils.collators.batching.V2BatchSamplerDataCollatorForSeq2Seq(
    self,
    tokenizer,
    model=None,
    padding=True,
    max_length=None,
    pad_to_multiple_of=None,
    label_pad_token_id=-100,
    position_pad_token_id=0,
    return_tensors='pt',
    sequence_parallel_degree=1,
)

Collator for multipack specific to the using the BatchSampler

Functions

Name Description
adjust_position_ids_for_slice Adjust position IDs for a sliced sequence to maintain proper relative positions.

adjust_position_ids_for_slice

utils.collators.batching.adjust_position_ids_for_slice(position_ids, start_idx)

Adjust position IDs for a sliced sequence to maintain proper relative positions. This handles the case where position IDs might not be contiguous due to sample packing.