Export Chonkie’s Chunks into a Hugging Face Dataset.
DatasetsPorter
exports a list of Chunk
objects into a Hugging Face Dataset
object. This is particularly useful for saving your processed chunks in a standardized format for training models, sharing, or archiving.
DatasetsPorter
requires the datasets
library. You can install it with:
Chunk
objects to be exported.True
, the dataset will be saved to the location specified in the path
parameter.save_to_disk
is True
.datasets.Dataset.save_to_disk
method. This allows you to control aspects
like the number of shards or processes.DatasetsPorter
can either return a Dataset
object directly for in-memory use or save it to disk.
Dataset
object without writing any files.
save_to_disk=True
and provide a path
. The method will still return the Dataset
object.
export
method.
export
method (and the __call__
method) will always return a datasets.Dataset
object, regardless of whether it is saved to disk. This allows you to immediately work with the dataset after exporting.