Split text into chunks while preserving sentence boundaries
SentenceChunker
splits text into chunks based on sentences, ensuring each chunk maintains complete sentences and stays within specified token limits. This is ideal for preparing text for models that work best with semantically complete units, or for consistent chunking across different texts.
SentenceChunker
via the API, check out the API reference documentation.
Xenova/gpt2
tokenizer.[". ", "! ", "? ", "\n"]
."prev"
), next sentence ("next"
), or exclude it (null
).SentenceChunk
objects (with metadata) or plain text strings.SentenceChunk
objects by default. Each chunk includes metadata:
returnType
is set to 'texts'
, only the chunked text strings are returned.
const chunks = await chunker(text)
or await chunker([text1, text2])
.returnType
is set to 'chunks'
, each chunk includes metadata: text
, startIndex
, endIndex
, tokenCount
, and the list of sentences
.chunkSize
in tokens, and that each chunk contains at least minSentencesPerChunk
sentences (except possibly the last chunk).minCharactersPerSentence
are merged with adjacent sentences.delim
and includeDelim
options.