The DatasetsPorter exports a list of Chunk objects into a Hugging Face Dataset object. This is particularly useful for saving your processed chunks in a standardized format for training models, sharing, or archiving.

Installation

The DatasetsPorter requires the datasets library. You can install it with:
pip install "chonkie[datasets]"
For general installation instructions, see the Installation Guide.

Initialization

To get started, simply import and initialize the porter.
from chonkie.friends.porters import DatasetsPorter

porter = DatasetsPorter()

Parameters

chunks
list[Chunk]
required
The list of Chunk objects to be exported.
save_to_disk
bool
default:"True"
If True, the dataset will be saved to the location specified in the path parameter.
path
str
default:"chunks"
The local directory path where the dataset should be saved. This is only used if save_to_disk is True.
**kwargs
Any
Additional keyword arguments to be passed directly to the datasets.Dataset.save_to_disk method. This allows you to control aspects like the number of shards or processes.

Usage

The DatasetsPorter can either return a Dataset object directly for in-memory use or save it to disk.

Return a Dataset Object

By default, the porter returns a Dataset object without writing any files.
from chonkie import Chunk

chunks = [
    Chunk(text="This is the first chunk.", start_index=0, end_index=25, token_count=5),
    Chunk(text="This is the second chunk.", start_index=26, end_index=52, token_count=5),
]

# Get the dataset in memory
dataset = porter.export(chunks)

print(dataset)
# Expected output:
# Dataset({
#     features: ['text', 'start_index', 'end_index', 'token_count', 'context'],
#     num_rows: 2
# })

Save a Dataset to Disk

To save the dataset, set save_to_disk=True and provide a path. The method will still return the Dataset object.
# Save the dataset to a directory named "my_exported_chunks"
dataset = porter.export(chunks, save_to_disk=True, path="my_exported_chunks")

# You can now find the dataset files in the "my_exported_chunks" directory

Using as a Callable

The porter can also be used as a callable, which is an alias for the export method.
# Get the dataset in memory
dataset = porter(chunks)

# Save the dataset to disk
porter(chunks, save_to_disk=True, path="my_exported_chunks")

Return Type

The export method (and the __call__ method) will always return a datasets.Dataset object, regardless of whether it is saved to disk. This allows you to immediately work with the dataset after exporting.