The SentenceChunker
splits text into chunks while preserving complete sentences, ensuring that each chunk maintains proper sentence boundaries and context.
API Reference
To use the SentenceChunker
via the API, check out the API reference documentation.
Installation
SentenceChunker is included in the base installation of Chonkie. No additional dependencies are required.
Initialization
from chonkie import SentenceChunker
# Basic initialization with default parameters
chunker = SentenceChunker(
tokenizer_or_token_counter="gpt2", # Supports string identifiers
chunk_size=512, # Maximum tokens per chunk
chunk_overlap=128, # Overlap between chunks
min_sentences_per_chunk=1 # Minimum sentences in each chunk
)
Parameters
tokenizer_or_token_counter
Union[str, Callable, Any]
default:"gpt2"
Tokenizer to use. Can be a string identifier or a tokenizer instance
Maximum number of tokens per chunk
Number of overlapping tokens between chunks
Minimum number of sentences to include in each chunk
min_characters_per_sentence
Minimum number of characters per sentence
Use approximate token counting for faster processing.
Note: This field is deprecated and will be removed in future versions.
delim
Union[str, List[str]]
default:"['.', '!', '?', '\\n']"
Delimiters to split sentences on
include_delim
Optional[Literal["prev", "next"]]
default:"prev"
Include delimiters in the chunk text. If so, specify whether to include the previous or next delimiter.
return_type
Literal['texts', 'chunks']
default:"chunks"
Whether to return chunks as text strings or as SentenceChunk
objects.
Single Text Chunking
text = """This is the first sentence. This is the second sentence.
And here's a third one with some additional context."""
chunks = chunker.chunk(text)
for chunk in chunks:
print(f"Chunk text: {chunk.text}")
print(f"Token count: {chunk.token_count}")
print(f"Number of sentences: {len(chunk.sentences)}")
Batch Chunking
texts = [
"First document. With multiple sentences.",
"Second document. Also with sentences. And more context."
]
batch_chunks = chunker.chunk_batch(texts)
for doc_chunks in batch_chunks:
for chunk in doc_chunks:
print(f"Chunk: {chunk.text}")
Using as a Callable
# Single text
chunks = chunker("First sentence. Second sentence.")
# Multiple texts
batch_chunks = chunker(["Text 1. More text.", "Text 2. More."])
Supported Tokenizers
SentenceChunker supports multiple tokenizer backends:
-
TikToken (Recommended)
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
-
AutoTikTokenizer
from autotiktokenizer import AutoTikTokenizer
tokenizer = AutoTikTokenizer.from_pretrained("gpt2")
-
Hugging Face Tokenizers
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("gpt2")
-
Transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
Return Type
SentenceChunker returns chunks as SentenceChunk
objects with additional sentence metadata:
@dataclass
class Sentence:
text: str # The sentence text
start_index: int # Starting position in original text
end_index: int # Ending position in original text
token_count: int # Number of tokens in sentence
@dataclass
class SentenceChunk(Chunk):
text: str # The chunk text
start_index: int # Starting position in original text
end_index: int # Ending position in original text
token_count: int # Number of tokens in chunk
sentences: List[Sentence] # List of sentences in chunk