datasets
datasets
Module containing Dataset functionality
Classes
Name | Description |
---|---|
ConstantLengthDataset | Iterable dataset that returns constant length chunks of tokens from stream of text files. |
TokenizedPromptDataset | Dataset that returns tokenized prompts from a stream of text files. |
ConstantLengthDataset
self, tokenizer, datasets, seq_length=2048) datasets.ConstantLengthDataset(
Iterable dataset that returns constant length chunks of tokens from stream of text files. Args: tokenizer (Tokenizer): The processor used for processing the data. dataset (dataset.Dataset): Dataset with text files. seq_length (int): Length of token sequences to return.
TokenizedPromptDataset
datasets.TokenizedPromptDataset(self,
prompt_tokenizer,
dataset,=None,
process_count=False,
keep_in_memory**kwargs,
)
Dataset that returns tokenized prompts from a stream of text files. Args: prompt_tokenizer (PromptTokenizingStrategy): The prompt tokenizing method for processing the data. dataset (dataset.Dataset): Dataset with text files. process_count (int): Number of processes to use for tokenizing. keep_in_memory (bool): Whether to keep the tokenized dataset in memory.