The TokenChunker splits text into chunks based on token count, ensuring each chunk stays within specified token limits.

API Reference

To use the TokenChunker via the API, check out the API reference documentation.

Installation

TokenChunker is included in the base installation of Chonkie. No additional dependencies are required.
For installation instructions, see the Installation Guide.

Initialization

from chonkie import TokenChunker

# Basic initialization with default parameters
chunker = TokenChunker(
    tokenizer="character",  # Default tokenizer (or use "gpt2", etc.)
    chunk_size=2048,    # Maximum tokens per chunk
    chunk_overlap=128  # Overlap between chunks
)

# Using a custom tokenizer
from tokenizers import Tokenizer
custom_tokenizer = Tokenizer.from_pretrained("your-tokenizer")
chunker = TokenChunker(
    tokenizer=custom_tokenizer,
    chunk_size=2048,
    chunk_overlap=128
)

Parameters

tokenizer
Union[str, Any]
default:"character"
Tokenizer to use. Can be a string identifier (“character”, “word”, “gpt2”, etc.) or a tokenizer instance
chunk_size
int
default:"2048"
Maximum number of tokens per chunk
chunk_overlap
Union[int, float]
default:"0"
Number or percentage of overlapping tokens between chunks

Basic Usage

from chonkie import TokenChunker

# Initialize the chunker
chunker = TokenChunker(
    tokenizer="gpt2",
    chunk_size=512,
    chunk_overlap=50
)

# Chunk your text
text = "Your long document text here..."
chunks = chunker.chunk(text)

# Access chunk information
for chunk in chunks:
    print(f"Chunk: {chunk.text[:50]}...")
    print(f"Tokens: {chunk.token_count}")

Examples

Supported Tokenizers

TokenChunker supports multiple tokenizer backends:
  • TikToken (Recommended)
    import tiktoken
    tokenizer = tiktoken.get_encoding("gpt2")
    
  • AutoTikTokenizer
    from autotiktokenizer import AutoTikTokenizer
    tokenizer = AutoTikTokenizer.from_pretrained("gpt2")
    
  • Hugging Face Tokenizers
    from tokenizers import Tokenizer
    tokenizer = Tokenizer.from_pretrained("gpt2")
    
  • Transformers
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    

Return Type

TokenChunker returns chunks as Chunk objects. Chunks object include a custom Context class for additional metadata alongside other attributes:
@dataclass
class Context:
    text: str
    token_count: int
    start_index: Optional[int] = None
    end_index: Optional[int] = None
@dataclass
class Chunk:
    text: str           # The chunk text
    start_index: int    # Starting position in original text
    end_index: int      # Ending position in original text
    token_count: int    # Number of tokens in chunk
    context: Optional[Context]  # Contextual information (if any)