Advanced AST-based code chunking with intelligent semantic preservation
The experimental CodeChunker provides advanced AST-based code parsing that goes beyond simple line-based splitting to understand and preserve code structure and semantics.
Experimental Feature: This CodeChunker is experimental and may change significantly between versions. Use with caution in production environments.
# Set a chunk size limit (chunks may exceed this to preserve semantics)chunker = CodeChunker( language="python", chunk_size=2048, # Target chunk size in characters tokenizer_or_token_counter="character")
The experimental CodeChunker can automatically detect the programming language using Magika, Google’s deep learning-based language detection model:
Copy
Ask AI
# Let the chunker detect the language automaticallychunker = CodeChunker(language="auto")# Chunk different types of code - language is detected automaticallypython_code = '''def fibonacci(n): if n <= 1: return n return fibonacci(n-1) + fibonacci(n-2)'''javascript_code = '''function fibonacci(n) { if (n <= 1) return n; return fibonacci(n-1) + fibonacci(n-2);}'''rust_code = '''fn fibonacci(n: u32) -> u32 { if n <= 1 { n } else { fibonacci(n-1) + fibonacci(n-2) }}'''# All will be chunked with appropriate language-specific rulespython_chunks = chunker.chunk(python_code) # Detected as Pythonjs_chunks = chunker.chunk(javascript_code) # Detected as JavaScript rust_chunks = chunker.chunk(rust_code) # Detected as Rust
Performance Consideration: When using language="auto", the chunker will show a warning that auto-detection may affect performance. For better performance in production, specify the language explicitly when known.
# Control whether to add split context informationchunker = CodeChunker( language="typescript", add_split_context=True # Include context about split locations)
The experimental CodeChunker prioritizes semantic coherence over strict size limits:
Copy
Ask AI
chunker = CodeChunker(language="python", chunk_size=100)# This class will likely stay together even if it exceeds 100 characterscode = '''class SmallButImportant: def __init__(self): self.value = "important" def get_value(self): return self.value'''chunks = chunker.chunk(code)# The class will typically be kept as one chunk for semantic coherence
# For code analysis taskschunker = CodeChunker(language="python", chunk_size=1024)# For embedding generation (smaller chunks often work better)chunker = CodeChunker(language="python", chunk_size=2048)# No size limit (preserve all semantic units)chunker = CodeChunker(language="python", chunk_size=None)
# For web development files with mixed contenthtml_chunker = CodeChunker(language="html", chunk_size=800)# For documentation with code examplesmd_chunker = CodeChunker(language="markdown", chunk_size=600)# For system-level code that needs precise structurec_chunker = CodeChunker(language="c", chunk_size=1200)
If migrating from the stable CodeChunker to the experimental version:
Copy
Ask AI
# Old stable versionfrom chonkie import CodeChunker# New experimental versionfrom chonkie.experimental import CodeChunker# The API is similar but with enhanced capabilitieschunker = CodeChunker(language="python", chunk_size=2048)