datasets

datasets

Module containing Dataset functionality

Classes

Name Description
ConstantLengthDataset Iterable dataset that returns constant length chunks of tokens from stream of text files.
TokenizedPromptDataset Dataset that returns tokenized prompts from a stream of text files.

ConstantLengthDataset

datasets.ConstantLengthDataset(self, tokenizer, datasets, seq_length=2048)

Iterable dataset that returns constant length chunks of tokens from stream of text files. Args: tokenizer (Tokenizer): The processor used for processing the data. dataset (dataset.Dataset): Dataset with text files. seq_length (int): Length of token sequences to return.

TokenizedPromptDataset

datasets.TokenizedPromptDataset(
    self,
    prompt_tokenizer,
    dataset,
    process_count=None,
    keep_in_memory=False,
    **kwargs,
)

Dataset that returns tokenized prompts from a stream of text files. Args: prompt_tokenizer (PromptTokenizingStrategy): The prompt tokenizing method for processing the data. dataset (dataset.Dataset): Dataset with text files. process_count (int): Number of processes to use for tokenizing. keep_in_memory (bool): Whether to keep the tokenized dataset in memory.