utils.collators.batching
utils.collators.batching
Data collators for axolotl to pad labels and position_ids for packed sequences. Also includes logic for handling sequence parallelism collation.
Classes
Name | Description |
---|---|
BatchSamplerDataCollatorForSeq2Seq | Collator for multipack specific to the using the BatchSampler |
DataCollatorForSeq2Seq | Data collator that will dynamically pad the inputs received, as well as the labels and position_ids |
PretrainingBatchSamplerDataCollatorForSeq2Seq | Collator for multipack specific to the using the BatchSampler |
V2BatchSamplerDataCollatorForSeq2Seq | Collator for multipack specific to the using the BatchSampler |
BatchSamplerDataCollatorForSeq2Seq
utils.collators.batching.BatchSamplerDataCollatorForSeq2Seq(self,
tokenizer,=None,
model=True,
padding=None,
max_length=None,
pad_to_multiple_of=-100,
label_pad_token_id=0,
position_pad_token_id='pt',
return_tensors=1,
sequence_parallel_degree )
Collator for multipack specific to the using the BatchSampler
DataCollatorForSeq2Seq
utils.collators.batching.DataCollatorForSeq2Seq(self,
tokenizer,=None,
model=True,
padding=None,
max_length=None,
pad_to_multiple_of=-100,
label_pad_token_id=0,
position_pad_token_id='pt',
return_tensors=1,
sequence_parallel_degree )
Data collator that will dynamically pad the inputs received, as well as the labels and position_ids
Parameters
Name | Type | Description | Default |
---|---|---|---|
tokenizer | [PreTrainedTokenizer ] or [PreTrainedTokenizerFast ] |
The tokenizer used for encoding the data. | required |
model | [PreTrainedModel ] |
The model that is being trained. If set and has the prepare_decoder_input_ids_from_labels, use it to prepare the decoder_input_ids This is useful when using label_smoothing to avoid calculating loss twice. | None |
padding | bool , str or [~utils.PaddingStrategy ], optional, defaults to True |
Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among: - True or 'longest' (default): Pad to the longest sequence in the batch (or no padding if only a single sequence is provided). - 'max_length' : Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. - False or 'do_not_pad' : No padding (i.e., can output a batch with sequences of different lengths). |
True |
max_length | int , optional |
Maximum length of the returned list and optionally padding length (see above). | None |
pad_to_multiple_of | int , optional |
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta). | None |
label_pad_token_id | int , optional, defaults to -100 |
The id to use when padding the labels (-100 will be automatically ignored by PyTorch loss functions). | -100 |
return_tensors | str |
The type of Tensor to return. Allowable values are “np”, “pt” and “tf”. | 'pt' |
sequence_parallel_degree | int |
The degree of sequence parallelism. Default to 1 for no sequence parallelism. | 1 |
Methods
Name | Description |
---|---|
apply_sequence_parallelism | Apply sequence parallelism slicing to a batch. |
apply_sequence_parallelism
utils.collators.batching.DataCollatorForSeq2Seq.apply_sequence_parallelism(
batch, )
Apply sequence parallelism slicing to a batch.
Parameters
Name | Type | Description | Default |
---|---|---|---|
batch | dict[str, torch.Tensor] | Batch dictionary from parent collator. | required |
Returns
Name | Type | Description |
---|---|---|
torch.Tensor | Sliced batch dictionary. |
PretrainingBatchSamplerDataCollatorForSeq2Seq
utils.collators.batching.PretrainingBatchSamplerDataCollatorForSeq2Seq(self,
*args,
=True,
multipack_attn**kwargs,
)
Collator for multipack specific to the using the BatchSampler
V2BatchSamplerDataCollatorForSeq2Seq
utils.collators.batching.V2BatchSamplerDataCollatorForSeq2Seq(self,
tokenizer,=None,
model=True,
padding=None,
max_length=None,
pad_to_multiple_of=-100,
label_pad_token_id=0,
position_pad_token_id='pt',
return_tensors=1,
sequence_parallel_degree )
Collator for multipack specific to the using the BatchSampler
Functions
Name | Description |
---|---|
adjust_position_ids_for_slice | Adjust position IDs for a sliced sequence to maintain proper relative positions. |
adjust_position_ids_for_slice
utils.collators.batching.adjust_position_ids_for_slice(position_ids, start_idx)
Adjust position IDs for a sliced sequence to maintain proper relative positions. This handles the case where position IDs might not be contiguous due to sample packing.