Recursively chunk documents into smaller, semantically meaningful pieces using customizable rules.
RecursiveChunker
recursively splits documents into smaller, semantically meaningful chunks using customizable hierarchical rules. It is ideal for long or structured documents (e.g., books, research papers, technical docs) where you want to preserve logical structure while respecting token limits.
RecursiveChunker
via the API, check out the API reference documentation.
Xenova/gpt2
tokenizer.RecursiveChunk
objects (with metadata) or plain text strings.RecursiveChunk
objects by default. Each chunk includes metadata:
returnType
is set to 'texts'
, only the chunked text strings are returned.
rules
parameter allows for highly flexible, hierarchical chunking strategies. You can specify a list of levels, each with its own delimiters or whitespace splitting. For example:
RecursiveLevel
:
delimiters
: Custom string(s) to split on (cannot be used with whitespace
).whitespace
: If true, splits on whitespace (cannot be used with delimiters
).includeDelim
: Whether to include the delimiter with the previous chunk ("prev"
, default) or the next chunk ("next"
).const chunks = await chunker(text)
or await chunker([text1, text2])
.returnType
is set to 'chunks'
, each chunk includes metadata: text
, startIndex
, endIndex
, tokenCount
, and level
(recursion depth).rules
parameter allows for hierarchical chunking (e.g., paragraphs → sentences → tokens). See RecursiveRules
and RecursiveLevel
for customization.minCharactersPerChunk
may be merged with adjacent chunks.chunkSize
in tokens.chunkBatch
method (or calling with an array) allows efficient batch processing.