Sentence Chunker
Split text into chunks while preserving sentence boundaries
The SentenceChunker
splits text into chunks based on sentences, ensuring each chunk maintains complete sentences and stays within specified token limits. This is ideal for preparing text for models that work best with semantically complete units, or for consistent chunking across different texts.
API Reference
To use the SentenceChunker
via the API, check out the API reference documentation.
Initialization
Parameters
Tokenizer to use. Can be a string identifier (model name) or a Tokenizer instance. Defaults to using Xenova/gpt2
tokenizer.
Maximum number of tokens per chunk.
Number of overlapping tokens between chunks. Must be >= 0 and < chunkSize.
Minimum number of sentences per chunk. Must be > 0.
Minimum number of characters for a valid sentence. Sentences shorter than this are merged with adjacent sentences.
(Deprecated) Whether to use approximate token counting.
List of sentence delimiters to use for splitting. Default: [". ", "! ", "? ", "\n"]
.
Whether to include the delimiter with the previous sentence ("prev"
), next sentence ("next"
), or exclude it (null
).
Whether to return chunks as SentenceChunk
objects (with metadata) or plain text strings.
Usage
Single Text Chunking
Batch Chunking
Using as a Callable
Return Type
SentenceChunker returns chunks as SentenceChunk
objects by default. Each chunk includes metadata:
If returnType
is set to 'texts'
, only the chunked text strings are returned.
Notes:
- The chunker is directly callable as a function after creation:
const chunks = await chunker(text)
orawait chunker([text1, text2])
. - If
returnType
is set to'chunks'
, each chunk includes metadata:text
,startIndex
,endIndex
,tokenCount
, and the list ofsentences
. - The chunker ensures that no chunk exceeds the specified
chunkSize
in tokens, and that each chunk contains at leastminSentencesPerChunk
sentences (except possibly the last chunk). - Sentences shorter than
minCharactersPerSentence
are merged with adjacent sentences. - Overlap is specified in tokens, and the chunker will overlap sentences as needed to meet the overlap requirement.
- You can customize sentence splitting using the
delim
andincludeDelim
options.
For more details, see the TypeScript API Reference.