The TokenChunker
splits text into chunks based on token count, ensuring each chunk stays within specified token limits. It is ideal for preparing text for models with token limits, or for consistent chunking across different texts.
API Reference
To use the TokenChunker
via the API, check out the API reference documentation.
Initialization
import { TokenChunker } from "chonkie";
// Basic initialization with default parameters (async)
const chunker = await TokenChunker.create({
tokenizer: "Xenova/gpt2", // Supports string identifiers or Tokenizer instance
chunkSize: 512, // Maximum tokens per chunk
chunkOverlap: 128 // Overlap between chunks
});
// Using a custom tokenizer
import { Tokenizer } from "@huggingface/transformers";
const customTokenizer = await Tokenizer.from_pretrained("your-tokenizer");
const chunker = await TokenChunker.create({
tokenizer: customTokenizer,
chunkSize: 512,
chunkOverlap: 128
});
Parameters
tokenizer
string | Tokenizer
default:"Xenova/gpt2"
Tokenizer to use. Can be a string identifier (model name) or a Tokenizer instance. Defaults to using Xenova/gpt2
tokenizer.
Maximum number of tokens per chunk.
Number or percentage of overlapping tokens between chunks. Can be an absolute number (e.g., 16) or a decimal between 0 and 1 (e.g., 0.1 for 10% overlap).
returnType
'chunks' | 'texts'
default:"chunks"
Whether to return chunks as Chunk
objects (with metadata) or plain text strings.
Single Text Chunking
const text = "Some long text that needs to be chunked into smaller pieces...";
const chunks = await chunker.chunk(text);
for (const chunk of chunks) {
console.log(`Chunk text: ${chunk.text}`);
console.log(`Token count: ${chunk.tokenCount}`);
console.log(`Start index: ${chunk.startIndex}`);
console.log(`End index: ${chunk.endIndex}`);
}
Batch Chunking
const texts = [
"First document to chunk...",
"Second document to chunk..."
];
const batchChunks = await chunker.chunkBatch(texts);
for (const docChunks of batchChunks) {
for (const chunk of docChunks) {
console.log(`Chunk: ${chunk.text}`);
}
}
Using as a Callable
// Single text
const chunks = await chunker("Text to chunk...");
// Multiple texts
const batchChunks = await chunker(["Text 1...", "Text 2..."]);
Return Type
TokenChunker returns chunks as Chunk
objects by default. Each chunk includes metadata:
class Chunk {
text: string; // The chunk text
startIndex: number; // Starting position in original text
endIndex: number; // Ending position in original text
tokenCount: number; // Number of tokens in chunk
}
If returnType
is set to 'texts'
, only the chunked text strings are returned.
For more details, see the TypeScript API Reference.