Skip to main content
The MistralOCR chef extracts text from images and PDF files using Mistral’s OCR API, returning structured MarkdownDocument objects for further processing.

Installation

pip install chonkie[mistral]
You need a Mistral API key. Set the MISTRAL_API_KEY environment variable or pass it directly.

Initialization

from chonkie import MistralOCR

# Default initialization (uses MISTRAL_API_KEY env var)
ocr = MistralOCR()

# Custom model and explicit API key
ocr = MistralOCR(model="mistral-ocr-2505", api_key="sk-...")

Parameters

model
str
default:"mistral-ocr-latest"
The Mistral OCR model to use.
api_key
Optional[str]
default:"None"
Mistral API key. Falls back to the MISTRAL_API_KEY environment variable.

Methods

process()

Process an image or PDF file and return a MarkdownDocument.

Parameters

path
Union[str, Path]
required
Path to the image or PDF file.

Returns

MarkdownDocument containing the extracted text as markdown content.

process_batch()

Process multiple image or PDF files at once.

Parameters

paths
list[Union[str, Path]]
required
List of file paths to process.

Returns

list[MarkdownDocument] where each document contains extracted text from a file.

parse()

Parse raw text into a Document (wraps text as-is, since OCR operates on files).

Parameters

text
str
required
Raw text to wrap into a Document.

Returns

Document containing the provided text.

Supported File Types

TypeExtensions
Images.png, .jpg, .jpeg, .gif, .bmp, .webp, .tiff, .tif
Documents.pdf

Usage

Standalone

from chonkie import MistralOCR

ocr = MistralOCR()

# Single file
doc = ocr.process("research_paper.pdf")
print(doc.content)
print(f"Source: {doc.metadata['filename']}")

# Multiple files
docs = ocr.process_batch(["page1.png", "page2.png"])

# Async
import asyncio
doc = asyncio.run(ocr.aprocess("document.pdf"))

Pipeline

Use .process_with("mistral") to add OCR to a pipeline:
from chonkie import Pipeline

# Process a PDF with OCR and chunk it
doc = (Pipeline()
    .fetch_from("file", path="document.pdf")
    .process_with("mistral")
    .chunk_with("recursive", chunk_size=512)
    .run())

print(f"Extracted {len(doc.chunks)} chunks from PDF")

OCR + RAG Pipeline

Build a complete pipeline from scanned documents to vector database:
from chonkie import Pipeline

docs = (Pipeline()
    .fetch_from("file", dir="./scanned_docs", ext=[".pdf", ".png"])
    .process_with("mistral")
    .chunk_with("recursive", chunk_size=1024)
    .refine_with("overlap", context_size=100)
    .store_in("qdrant", collection_name="scanned_documents")
    .run())

print(f"Ingested {len(docs)} documents")

OCR + Semantic Chunking

Use semantic chunking on OCR output for intelligent retrieval boundaries:
from chonkie import Pipeline

doc = (Pipeline()
    .fetch_from("file", path="textbook_chapter.pdf")
    .process_with("mistral")
    .chunk_with("semantic", threshold=0.8, chunk_size=1024)
    .refine_with("embedding", model="text-embedding-3-small")
    .export_with("json", file="textbook_chunks.json")
    .run())

Integration with Chunkers

MistralOCR returns a MarkdownDocument, making it compatible with any chunker:
from chonkie import MistralOCR, RecursiveChunker

# Step 1: Extract text from PDF
ocr = MistralOCR()
doc = ocr.process("report.pdf")

# Step 2: Chunk the extracted content
chunker = RecursiveChunker(chunk_size=512)
chunks = chunker.chunk(doc.content)

# Step 3: Store chunks in the document
doc.chunks = chunks

print(f"Document: {doc.metadata['filename']}")
print(f"  Content: {len(doc.content)} characters")
print(f"  Chunks: {len(doc.chunks)}")

Notes

  • OCR quality depends on image resolution and clarity
  • Large PDFs are processed page-by-page and concatenated with double newlines
  • The extracted text is returned as markdown, preserving structure from the source document
  • API calls are synchronous by default; use aprocess() for async execution