| .github | ||
| signatures/version1 | ||
| static | ||
| surya | ||
| tests | ||
| .gitignore | ||
| .pre-commit-config.yaml | ||
| CITATION.cff | ||
| CLA.md | ||
| LICENSE | ||
| MODEL_LICENSE | ||
| pyproject.toml | ||
| pytest.ini | ||
| README.md | ||
| uv.lock | ||
Datalab
State of the Art models for Document Intelligence
Surya
Surya is a document OCR toolkit built around a single vision-language model that does:
- Full-page OCR with layout, ranking near top models on olmOCR-bench
- Line-level text detection
- Layout analysis (table, image, header, etc.) with reading order
- Table recognition (rows + columns + cell HTML)
- Math / equations recognized inline (no separate LaTeX OCR pass)
It works on a range of documents (see usage and benchmarks).
Try Datalab's Managed Platform
Our managed platform runs both Surya, and variants of our highest accuracy model, Chandra.
If you have high volume workloads, we offer a batch processing service that can process 1B+ pages per week.
Get started with $5 in free credits — sign up (takes under 30 seconds) or try our public playground.
Commercial self-hosting of the model weights requires a license — see Commercial usage. For on-prem licensing, contact us.
| Detection | OCR |
|---|---|
![]() |
![]() |
| Layout | Reading Order |
|---|---|
![]() |
![]() |
| Table Recognition | Math / Equations |
|---|---|
![]() |
![]() |
Surya is named for the Hindu sun god, who has universal vision.
Examples
Each row links to four annotated views of the same page: text-line detection, layout, reading order, and (when present) table recognition.
| Name | Detection | Layout | Order | Table Rec |
|---|---|---|---|---|
| Newspaper | Image | Image | Image | |
| Textbook | Image | Image | Image | |
| Tax Form | Image | Image | Image | Image |
| Handwritten Notes | Image | Image | Image | Image |
| Corporate Doc | Image | Image | Image | Image |
Commercial usage
The Surya code is licensed under Apache 2.0. The model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue). For broader commercial licensing of the model weights, visit our pricing page here.
Installation
You'll need python 3.10+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.
Install with:
pip install surya-ocr
Upgrading from Surya v1
Surya 2 replaces the per-task encoder-decoder models (FoundationPredictor + RecognitionPredictor + LayoutPredictor + TableRecPredictor each holding their own torch checkpoints) with a single vision-language model served by vllm (Docker, GPU) or llama-server (Apple Silicon / CPU). If you have v1 code, you can migrate to this:
# v2
from surya.inference import SuryaInferenceManager
from surya.recognition import RecognitionPredictor
manager = SuryaInferenceManager() # auto-spawns vllm or llama-server
rec = RecognitionPredictor(manager)
predictions = rec([image]) # full-page OCR by default
What's different:
SuryaInferenceManagerreplacesFoundationPredictor. Same manager instance is shared acrossLayoutPredictor,RecognitionPredictor,TableRecPredictor.- Output schemas changed: see the per-section JSON tables below. Highlights —
text_lines→blocks(withhtml); layout droppedtop_k, addedcount; table_rec droppedis_header/colspan/rowspanfrom cells.
Usage
Surya 2 runs layout, OCR, and table recognition through a single VLM served
by vllm (GPU) or llama.cpp (CPU / Apple Silicon). The inference manager
will spawn one for you on first use; you can also point it at an existing
server via SURYA_INFERENCE_URL=http://host:port/v1.
- Inspect the settings in
surya/settings.py. You can override any setting via env var (e.g.SURYA_INFERENCE_BACKEND=vllm). - Text detection and OCR errors are separate models.
Interactive App
I've included a streamlit app that lets you interactively try Surya on images or PDF files. Run it with:
pip install streamlit pdftext
surya_gui
OCR (text recognition)
This command will write out a json file with the detected text and bboxes:
surya_ocr DATA_PATH
DATA_PATHcan be an image, pdf, or folder of images/pdfs--imageswill save images of the pages and detected blocks (optional)--output_dirspecifies the directory to save results to instead of the default--page_rangespecifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example:0,5-10,20.
The results.json file contains a dict keyed by input filename (no extension). Each value is a list of page dicts. Each page dict contains:
blocks- per-block OCR results in reading orderlabel- canonicalized layout label (e.g.Text,Section-Header,Table,Equation-Block,Picture,Form,Page-Header, ...). Seesurya/inference/prompts.py:LAYOUT_LABEL_SETfor the full set.raw_label- original label emitted by the model, before canonicalizationreading_order- 0-indexed position in layout outputhtml- block content as HTML (math wrapped in<math>...</math>, tables as<table>...</table>, etc.).""if the block was skippedpolygon- 4-corner polygon in[[x0,y0],[x1,y0],[x1,y1],[x0,y1]]orderbbox- axis-aligned[x0, y0, x1, y1]derived from the polygonconfidence- mean per-token probability across the block's decode (0-1)skipped- true if the block was a visual label (e.g. Picture) and not OCR'derror- true if the block OCR call failed
image_bbox-[0, 0, width, height]for the page image
Performance tips
Throughput is governed by the inference backend, not a RECOGNITION_BATCH_SIZE env var. With vllm, raise --max-num-seqs / --max-num-batched-tokens (or SURYA_INFERENCE_PARALLEL on the client side) to keep more pages in flight. With llama.cpp, set SURYA_INFERENCE_PARALLEL to match --parallel on llama-server.
From python
from PIL import Image
from surya.inference import SuryaInferenceManager
from surya.recognition import RecognitionPredictor
manager = SuryaInferenceManager()
recognition_predictor = RecognitionPredictor(manager)
# Default: full-page OCR. One VLM call per page; returns layout + content as
# HTML <div data-bbox=... data-label=...> blocks.
predictions = recognition_predictor([Image.open(IMAGE_PATH)])
# Block mode: pre-run layout, then per-block OCR. Auto-selected when
# `layout_results` is passed.
from surya.layout import LayoutPredictor
layout = LayoutPredictor(manager)
layouts = layout([Image.open(IMAGE_PATH)])
predictions = recognition_predictor([Image.open(IMAGE_PATH)], layouts)
Text line detection
This command will write out a json file with the detected bboxes.
surya_detect DATA_PATH
DATA_PATHcan be an image, pdf, or folder of images/pdfs--imageswill save images of the pages and detected text lines (optional)--output_dirspecifies the directory to save results to instead of the default--page_rangespecifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example:0,5-10,20.
The results.json file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:
bboxes- detected bounding boxes for textbbox- the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.polygon- the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.confidence- the confidence of the model in the detected text (0-1)
vertical_lines- vertical lines detected in the documentbbox- the axis-aligned line coordinates.
page- the page number in the fileimage_bbox- the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
Performance tips
Detection is a torch model. DETECTOR_BATCH_SIZE (default 36) controls VRAM usage on GPU; raise it on larger cards.
From python
from PIL import Image
from surya.detection import DetectionPredictor
det_predictor = DetectionPredictor()
predictions = det_predictor([Image.open(IMAGE_PATH)])
Layout and reading order
This command will write out a json file with the detected layout and reading order.
surya_layout DATA_PATH
DATA_PATHcan be an image, pdf, or folder of images/pdfs--imageswill save images of the pages and detected text lines (optional)--output_dirspecifies the directory to save results to instead of the default--page_rangespecifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example:0,5-10,20.
The results.json file contains a dict keyed by input filename (no extension). Each value is a list of page dicts. Each page dict contains:
bboxes- layout boxes in reading orderpolygon- 4-corner polygon[[x0,y0],[x1,y0],[x1,y1],[x0,y1]]bbox- axis-aligned[x0, y0, x1, y1]derived from the polygonlabel- canonicalized label. One ofCaption,Footnote,Equation-Block,List-Group,Page-Header,Page-Footer,Image,Section-Header,Table,Text,Complex-Block,Code-Block,Form,Table-Of-Contents,Figure,Chemical-Block,Diagram,Bibliography,Blank-Pageraw_label- original label emitted by the modelposition- 0-indexed reading ordercount- model's token estimate for OCR'ing this block (rounded to multiples of 50; used to size the per-block decode budget)confidence- mean per-token probability across the layout decode (0-1)
image_bbox-[0, 0, width, height]raw- raw JSON the layout model emitted, for debuggingerror- true if the layout call failed
Performance tips
Layout runs through the shared inference backend. Throughput tuning is the same as OCR — see Performance tips above.
From python
from PIL import Image
from surya.inference import SuryaInferenceManager
from surya.layout import LayoutPredictor
layout_predictor = LayoutPredictor(SuryaInferenceManager())
layout_predictions = layout_predictor([Image.open(IMAGE_PATH)])
Table Recognition
This command will write out a json file with the detected table cells and row/column ids, along with row/column bounding boxes. If you want to get cell positions and text, along with nice formatting, check out the marker repo. You can use the TableConverter to detect and extract tables in images and PDFs. It supports output in json (with bboxes), markdown, and html.
surya_table DATA_PATH
DATA_PATHcan be an image, pdf, or folder of images/pdfs--imageswill save annotated row + column overlays alongside the json (optional)--output_dirspecifies the directory to save results to instead of the default--page_rangespecifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example:0,5-10,20.--skip_table_detectiontells table recognition not to detect tables first. Use this if your image is already cropped to a table.
The results.json file contains a dict keyed by input filename (no extension). Each value is a list of per-table dicts. Each table dict contains:
rows- detected table rows in reading orderpolygon/bbox- row geometry (same convention as everywhere else)row_id- 0-indexed row id
cols- detected table columnspolygon/bbox- column geometrycol_id- 0-indexed column id
cells- geometric row × column intersections (simple mode)polygon/bbox- cell geometryrow_id,col_id,cell_id
html- full<table>...</table>HTML (only populated whenpredict_fullis used; handles spanning cells / header rows).nullin simple mode.mode-"simple"or"full"image_bbox- the table crop bboxerror- true if the table_rec call failedraw- raw model output, for debugging
Performance tips
Table recognition routes through the shared VLM. Throughput tuning is the same as OCR.
From python
from PIL import Image
from surya.inference import SuryaInferenceManager
from surya.table_rec import TableRecPredictor
table_rec_predictor = TableRecPredictor(SuryaInferenceManager())
# Default: rows + columns only, cells derived from intersections.
table_predictions = table_rec_predictor([Image.open(IMAGE_PATH)])
# Or full HTML output (better for spanning cells / headers):
# table_predictions = table_rec_predictor.predict_full([image])
Math / equations
Surya 2 handles math inline as part of full-page OCR — recognized equations
come back inside <math>...</math> tags in the same HTML output as
surrounding prose, in KaTeX-compatible LaTeX. No separate LaTeX OCR pass.
Inference Backends
Layout / OCR / table_rec all share one VLM, served either by vllm (GPU) or llama.cpp (CPU / Apple Silicon). The SuryaInferenceManager will spawn one automatically; you can also point at a pre-running server:
# Attach to an existing vllm
export SURYA_INFERENCE_BACKEND=vllm
export SURYA_INFERENCE_URL=http://localhost:8000/v1
| Setting | Default | Notes |
|---|---|---|
SURYA_INFERENCE_BACKEND |
auto (vllm if NVIDIA, else llamacpp) | vllm | llamacpp | unset (auto) |
SURYA_INFERENCE_URL |
(auto-spawn) | Attach to a running OpenAI-compatible server |
SURYA_INFERENCE_PARALLEL |
8 | Client-side concurrency to the backend |
SURYA_GUIDED_LAYOUT |
true | JSON-schema-constrained layout decode |
Limitations
- This is specialized for document OCR. Performance on photos or natural scenes is not the goal.
- Layout / OCR / table_rec all need a running inference backend (vllm or llama.cpp). Detection runs purely on torch and works without it.
Troubleshooting
If OCR isn't working properly:
- Try increasing resolution of the image so the text is bigger. If the resolution is already very high, try decreasing it to no more than a
2048pxwidth. - Preprocessing the image (binarizing, deskewing, etc) can help with very old/blurry images.
- You can adjust
DETECTOR_BLANK_THRESHOLDandDETECTOR_TEXT_THRESHOLDif you don't get good results.DETECTOR_BLANK_THRESHOLDcontrols the space between lines - any prediction below this number will be considered blank space.DETECTOR_TEXT_THRESHOLDcontrols how text is joined - any number above this is considered text.DETECTOR_TEXT_THRESHOLDshould always be higher thanDETECTOR_BLANK_THRESHOLD, and both should be in the 0-1 range. Looking at the heatmap from the debug output of the detector can tell you how to adjust these (if you see faint things that look like boxes, lower the thresholds, and if you see bboxes being joined together, raise the thresholds).
Manual install
If you want to develop surya, you can install it manually with uv:
git clone https://github.com/VikParuchuri/surya.git
cd surya
uv sync --group dev # installs runtime + dev deps
uv run surya_ocr ... # or `uv shell` to enter the venv
Benchmarks
Surya 2 is a single VLM that handles layout analysis, OCR (full-page or per-block), and table recognition in one model. We evaluate end-to-end on olmOCR-bench — the standard quality benchmark for document parsers.
olmOCR-bench
| Model | Params | Score |
|---|---|---|
| Infinity-Parser2-Pro | 35.1B | 87.6 |
| Chandra OCR 2 (Datalab) | 5.3B | 85.9 |
| dots.mocr | 3.0B | 83.9 |
| LightOnOCR 2-1B * | 1.0B | 83.2 |
| Surya OCR 2 (Datalab) | 0.69B | 83.1 |
| Chandra OCR 1 (Datalab) | 9.0B | 83.1 |
| olmOCR (anchored) | 8.3B | 77.4 |
| GOT OCR | 0.6B | 48.3 |
* LightOnOCR 2-1B uses a different benchmark methodology than the other entries (see their release notes); the score is included for context but is not directly comparable.
Comparison scores from the olmOCR-bench dataset card. Surya OCR 2 is reported as 0.69B params — the on-disk safetensors duplicates the tied embedding + lm_head, so HuggingFace shows ~0.75B; the underlying parameter count is 0.69B.
Surya 2, per-source pass rate on the default preset (8,413 tests total):
| ArXiv | Base | Hdr/Ftr | TinyTxt | MultCol | OldScan | OldMath | Tables |
|---|---|---|---|---|---|---|---|
| 88.7 | 99.9 | 92.1 | 86.4 | 82.6 | 42.8 | 85.8 | 86.6 |
Throughput
Full-page OCR, 96 DPI input (~3,000 output tokens/page average), measured client-side against a running inference server.
B200 (vllm)
vllm/vllm-openai:v0.20.1, single B200, --max-num-seqs 512 /
--max-num-batched-tokens 16384. Each row uses ≥20× concurrency in pages so
steady state dominates. Prefix caching off.
| Concurrency | Pages | Wall (s) | Pages/s | Tokens/s | p50 (ms) | p95 (ms) | p99 (ms) |
|---|---|---|---|---|---|---|---|
| 64 | 1,280 | 270 | 4.74 | 14,058 | 13,806 | 18,656 | 19,217 |
| 128 | 2,560 | 449 | 5.71 | 16,948 | 22,820 | 29,501 | 30,897 |
| 256 | 5,120 | 882 | 5.81 | 17,186 | 44,253 | 51,119 | 52,333 |
| 512 | 10,240 | 1,761 | 5.81 | 17,211 | 87,961 | 94,941 | 96,185 |
Throughput saturates at conc=256 (5.81 pages/s ≈ 17.2k tokens/s). Going to 512 doesn't add capacity — it just queues, doubling latency. For production batch jobs the sweet spot is conc=128–256.
Apple Silicon (llama.cpp / Metal)
llama-server with Metal backend, one process per --parallel level.
--parallel |
Pages/s | Tokens/s | p50 (ms) | p95 (ms) |
|---|---|---|---|---|
| 4 | 0.217 | 222 | 18,122 | 19,676 |
| 8 | 0.269 | 270 | 27,549 | 42,730 |
| 16 | 0.264 | 286 | 53,166 | 70,378 |
Knees at --parallel=8 — Metal is decode-saturated past that point.
Reproducing
We score Surya 2 on olmOCR-bench by serving the model with vllm (or
llama.cpp) and running the official olmOCR-bench harness from
allenai/olmocr. Use the
HIGH_ACCURACY_BBOX_PROMPT prompt (single full-page call per page); the
RecognitionPredictor defaults to that mode.
Training
Layout, OCR, and table recognition all share a single vision-language model (Qwen3.5-style architecture, ~770M params). It's trained on diverse document images to emit either a layout JSON or a full-page HTML output, depending on prompt. Text-line detection is a separate small torch model — a modified EfficientViT segformer trained from scratch on document line annotations.
If you want help finetuning Surya on your own data, or to use our managed training stack, reach us at hi@datalab.to.
Thanks
This work would not have been possible without amazing open source AI work:
- Qwen3-VL from Alibaba (architecture basis for the Surya 2 VLM)
- vllm and llama.cpp for inference
- Segformer from NVIDIA
- EfficientViT from MIT
- timm from Ross Wightman
- transformers from huggingface
- CRAFT, a great scene text detection model
Thank you to everyone who makes open source AI possible.
Citation
If you use surya (or the associated models) in your work or research, please consider citing us using the following BibTeX entry:
@misc{paruchuri2025surya,
author = {Vikas Paruchuri and Datalab Team},
title = {Surya: A lightweight document OCR and analysis toolkit},
year = {2025},
howpublished = {\url{https://github.com/VikParuchuri/surya}},
note = {GitHub repository},
}





