The CodeChunker splits code into chunks based on its structure, leveraging Abstract Syntax Trees (ASTs) to create contextually relevant segments.

API Reference

To use the CodeChunker via the API, check out the API reference documentation.

Installation

CodeChunker requires additional dependencies for code parsing. You can install it with:

pip install "chonkie[code]"
For installation instructions, see the Installation Guide.

Initialization

from chonkie import CodeChunker

# Basic initialization with default parameters
chunker = CodeChunker(
    language="python",                 # Specify the programming language
    tokenizer_or_token_counter="gpt2", # Tokenizer to use
    chunk_size=512,                    # Maximum tokens per chunk
    include_nodes=False                # Optionally include AST nodes in output
)

# Using a custom tokenizer
from tokenizers import Tokenizer
custom_tokenizer = Tokenizer.from_pretrained("your-tokenizer")
chunker = CodeChunker(
    language="javascript",
    tokenizer_or_token_counter=custom_tokenizer,
    chunk_size=512
)

Parameters

language
str
required

The programming language of the code. Accepts languages supported by tree-sitter-language-pack.

tokenizer_or_token_counter
Union[str, Callable, Any]
default:"gpt2"

Tokenizer or token counting function to use for measuring chunk size.

chunk_size
int
default:"512"

Maximum number of tokens per chunk.

include_nodes
bool
default:"False"

Whether to include the list of corresponding AST Node objects within each CodeChunk.

return_type
Literal['chunks', 'texts']
default:"chunks"

Whether to return chunks as CodeChunk objects or plain text strings.

Usage

Single Code Chunking

code = """
def hello_world():
    print("Hello, Chonkie!")

class MyClass:
    def __init__(self):
        self.value = 42
"""
chunks = chunker.chunk(code)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Language: {chunk.lang}")
    if chunk.nodes:
        print(f"Node count: {len(chunk.nodes)}")

Batch Chunking

codes = [
    "def func1():\n    pass",
    "const x = 10;\nfunction add(a, b) { return a + b; }"
]
batch_chunks = chunker.chunk_batch(codes)

for doc_chunks in batch_chunks:
    for chunk in doc_chunks:
        print(f"Chunk: {chunk.text}")

Using as a Callable

# Single code string
chunks = chunker("def greet(name):\n    print(f'Hello, {name}')")

# Multiple code strings
batch_chunks = chunker(["int main() { return 0; }", "package main\nimport \"fmt\"\nfunc main() { fmt.Println(\"Hi\") }"])

Return Type

CodeChunker returns chunks as CodeChunk objects:

@dataclass
class CodeChunk(Chunk):
    text: str           # The chunk text (code snippet)
    start_index: int    # Starting position in original code
    end_index: int      # Ending position in original code
    token_count: int    # Number of tokens in chunk
    lang: Optional[str] = None # Language of the code chunk
    nodes: Optional[List[Node]] = None # List of AST nodes if include_nodes=True