Chonkie Documentation

The MistralOCR chef extracts text from images and PDF files using Mistral’s OCR API, returning structured MarkdownDocument objects for further processing.

Installation

pip install chonkie[mistral]

You need a Mistral API key. Set the MISTRAL_API_KEY environment variable or pass it directly.

Initialization

from chonkie import MistralOCR

# Default initialization (uses MISTRAL_API_KEY env var)
ocr = MistralOCR()

# Custom model and explicit API key
ocr = MistralOCR(model="mistral-ocr-2505", api_key="sk-...")

Parameters

model

str

default:"mistral-ocr-latest"

The Mistral OCR model to use.

api_key

Optional[str]

default:"None"

Mistral API key. Falls back to the MISTRAL_API_KEY environment variable.

Methods

process()

Process an image or PDF file and return a MarkdownDocument.

Parameters

path

Union[str, Path]

required

Path to the image or PDF file.

Returns

MarkdownDocument containing the extracted text as markdown content.

process_batch()

Process multiple image or PDF files at once.

Parameters

paths

list[Union[str, Path]]

required

List of file paths to process.

Returns

list[MarkdownDocument] where each document contains extracted text from a file.

parse()

Parse raw text into a Document (wraps text as-is, since OCR operates on files).

Parameters

text

str

required

Raw text to wrap into a Document.

Returns

Document containing the provided text.

Supported File Types

Type	Extensions
Images	`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.webp`, `.tiff`, `.tif`
Documents	`.pdf`

Usage

Standalone

from chonkie import MistralOCR

ocr = MistralOCR()

# Single file
doc = ocr.process("research_paper.pdf")
print(doc.content)
print(f"Source: {doc.metadata['filename']}")

# Multiple files
docs = ocr.process_batch(["page1.png", "page2.png"])

# Async
import asyncio
doc = asyncio.run(ocr.aprocess("document.pdf"))

Pipeline

Use .process_with("mistral") to add OCR to a pipeline:

from chonkie import Pipeline

# Process a PDF with OCR and chunk it
doc = (Pipeline()
    .fetch_from("file", path="document.pdf")
    .process_with("mistral")
    .chunk_with("recursive", chunk_size=512)
    .run())

print(f"Extracted {len(doc.chunks)} chunks from PDF")

OCR + RAG Pipeline

Build a complete pipeline from scanned documents to vector database:

from chonkie import Pipeline

docs = (Pipeline()
    .fetch_from("file", dir="./scanned_docs", ext=[".pdf", ".png"])
    .process_with("mistral")
    .chunk_with("recursive", chunk_size=1024)
    .refine_with("overlap", context_size=100)
    .store_in("qdrant", collection_name="scanned_documents")
    .run())

print(f"Ingested {len(docs)} documents")

OCR + Semantic Chunking

Use semantic chunking on OCR output for intelligent retrieval boundaries:

from chonkie import Pipeline

doc = (Pipeline()
    .fetch_from("file", path="textbook_chapter.pdf")
    .process_with("mistral")
    .chunk_with("semantic", threshold=0.8, chunk_size=1024)
    .refine_with("embedding", model="text-embedding-3-small")
    .export_with("json", file="textbook_chunks.json")
    .run())

Integration with Chunkers

MistralOCR returns a MarkdownDocument, making it compatible with any chunker:

from chonkie import MistralOCR, RecursiveChunker

# Step 1: Extract text from PDF
ocr = MistralOCR()
doc = ocr.process("report.pdf")

# Step 2: Chunk the extracted content
chunker = RecursiveChunker(chunk_size=512)
chunks = chunker.chunk(doc.content)

# Step 3: Store chunks in the document
doc.chunks = chunks

print(f"Document: {doc.metadata['filename']}")
print(f"  Content: {len(doc.content)} characters")
print(f"  Chunks: {len(doc.chunks)}")

Notes

OCR quality depends on image resolution and clarity
Large PDFs are processed page-by-page and concatenated with double newlines
The extracted text is returned as markdown, preserving structure from the source document
API calls are synchronous by default; use aprocess() for async execution

​Installation

​Initialization

​Parameters

​Methods

​process()

​Parameters

​Returns

​process_batch()

​Parameters

​Returns

​parse()

​Parameters

​Returns

​Supported File Types

​Usage

​Standalone

​Pipeline

​OCR + RAG Pipeline

​OCR + Semantic Chunking

​Integration with Chunkers

​Notes

Installation

Initialization

Parameters

Methods

process()

Parameters

Returns

process_batch()

Parameters

Returns

parse()

Parameters

Returns

Supported File Types

Usage

Standalone

Pipeline

OCR + RAG Pipeline

OCR + Semantic Chunking

Integration with Chunkers

Notes