mirror of
https://github.com/VikParuchuri/surya.git
synced 2026-06-04 21:03:53 +08:00
Cleanups
This commit is contained in:
parent
8e1d94ff7e
commit
9158557600
20
README.md
20
README.md
@ -46,13 +46,9 @@ Commercial self-hosting of the model weights requires a license — see [Commerc
|
|||||||
|:----------------------------------------------------------------:|:-----------------------------------------------------------------------:|
|
|:----------------------------------------------------------------:|:-----------------------------------------------------------------------:|
|
||||||
| <img src="static/images/excerpt.png" width="280"/> | <img src="static/images/excerpt_text.png" width="280"/> |
|
| <img src="static/images/excerpt.png" width="280"/> | <img src="static/images/excerpt_text.png" width="280"/> |
|
||||||
|
|
||||||
| Layout | Reading Order |
|
| Layout | Table Recognition |
|
||||||
|:------------------------------------------------------------------:|:--------------------------------------------------------------------------:|
|
|:------------------------------------------------------------------:|:-------------------------------------------------------------:|
|
||||||
| <img src="static/images/excerpt_layout.png" width="280"/> | <img src="static/images/excerpt_reading.png" width="280"/> |
|
| <img src="static/images/excerpt_layout.png" width="280"/> | <img src="static/images/scanned_tablerec.png" width="280"/> |
|
||||||
|
|
||||||
| Table Recognition | Math / Equations |
|
|
||||||
|:-------------------------------------------------------------:|:------------------------------------------------------------:|
|
|
||||||
| <img src="static/images/scanned_tablerec.png" width="280"/> | <img src="static/images/latex_ocr.png" width="280"/> |
|
|
||||||
|
|
||||||
|
|
||||||
Surya is named for the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who has universal vision.
|
Surya is named for the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who has universal vision.
|
||||||
@ -76,8 +72,6 @@ The Surya code is licensed under Apache 2.0. The model weights use a modified AI
|
|||||||
|
|
||||||
# Installation
|
# Installation
|
||||||
|
|
||||||
You'll need python 3.10+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See [here](https://pytorch.org/get-started/locally/) for more details.
|
|
||||||
|
|
||||||
Install with:
|
Install with:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
@ -377,7 +371,7 @@ standard quality benchmark for document parsers.
|
|||||||
|
|
||||||
\* **LightOnOCR 2-1B** uses a different benchmark methodology than the other entries (see their [release notes](https://huggingface.co/lightonai/LightOnOCR-2-1B)); the score is included for context but is not directly comparable.
|
\* **LightOnOCR 2-1B** uses a different benchmark methodology than the other entries (see their [release notes](https://huggingface.co/lightonai/LightOnOCR-2-1B)); the score is included for context but is not directly comparable.
|
||||||
|
|
||||||
Comparison scores from the [olmOCR-bench dataset card](https://huggingface.co/datasets/allenai/olmOCR-bench). Surya OCR 2 is reported as 0.69B params — the on-disk safetensors duplicates the tied embedding + lm_head, so HuggingFace shows ~0.75B; the underlying parameter count is 0.69B.
|
Comparison scores from the [olmOCR-bench dataset card](https://huggingface.co/datasets/allenai/olmOCR-bench).
|
||||||
|
|
||||||
Surya 2, per-source pass rate on the `default` preset (8,413 tests total):
|
Surya 2, per-source pass rate on the `default` preset (8,413 tests total):
|
||||||
|
|
||||||
@ -430,7 +424,7 @@ RecognitionPredictor defaults to that mode.
|
|||||||
# Training
|
# Training
|
||||||
|
|
||||||
Layout, OCR, and table recognition all share a single vision-language model
|
Layout, OCR, and table recognition all share a single vision-language model
|
||||||
(Qwen3.5-style architecture, ~770M params). It's trained on diverse document
|
(Qwen3.5-style architecture, ~690M params). It's trained on diverse document
|
||||||
images to emit either a layout JSON or a full-page HTML output, depending on
|
images to emit either a layout JSON or a full-page HTML output, depending on
|
||||||
prompt. Text-line detection is a separate small torch model — a modified
|
prompt. Text-line detection is a separate small torch model — a modified
|
||||||
EfficientViT segformer trained from scratch on document line annotations.
|
EfficientViT segformer trained from scratch on document line annotations.
|
||||||
@ -442,7 +436,7 @@ training stack, reach us at hi@datalab.to.
|
|||||||
|
|
||||||
This work would not have been possible without amazing open source AI work:
|
This work would not have been possible without amazing open source AI work:
|
||||||
|
|
||||||
- [Qwen3-VL](https://huggingface.co/Qwen) from Alibaba (architecture basis for the Surya 2 VLM)
|
- [Qwen3-VL](https://huggingface.co/Qwen) from Alibaba
|
||||||
- [vllm](https://github.com/vllm-project/vllm) and [llama.cpp](https://github.com/ggerganov/llama.cpp) for inference
|
- [vllm](https://github.com/vllm-project/vllm) and [llama.cpp](https://github.com/ggerganov/llama.cpp) for inference
|
||||||
- [Segformer](https://arxiv.org/pdf/2105.15203.pdf) from NVIDIA
|
- [Segformer](https://arxiv.org/pdf/2105.15203.pdf) from NVIDIA
|
||||||
- [EfficientViT](https://github.com/mit-han-lab/efficientvit) from MIT
|
- [EfficientViT](https://github.com/mit-han-lab/efficientvit) from MIT
|
||||||
@ -461,6 +455,6 @@ If you use surya (or the associated models) in your work or research, please con
|
|||||||
author = {Vikas Paruchuri and Datalab Team},
|
author = {Vikas Paruchuri and Datalab Team},
|
||||||
title = {Surya: A lightweight document OCR and analysis toolkit},
|
title = {Surya: A lightweight document OCR and analysis toolkit},
|
||||||
year = {2025},
|
year = {2025},
|
||||||
howpublished = {\url{https://github.com/VikParuchuri/surya}},
|
howpublished = {\url{https://github.com/datalab-to/surya}},
|
||||||
note = {GitHub repository},
|
note = {GitHub repository},
|
||||||
}
|
}
|
||||||
|
|||||||
@ -31,6 +31,9 @@ BASELINE_MAX_BATCHED_TOKENS = 8192
|
|||||||
BASELINE_MAX_NUM_SEQS = 32
|
BASELINE_MAX_NUM_SEQS = 32
|
||||||
|
|
||||||
GPU_VRAM_GB = {
|
GPU_VRAM_GB = {
|
||||||
|
"b300": 270,
|
||||||
|
"b200": 180,
|
||||||
|
"h200": 141,
|
||||||
"h100": 80,
|
"h100": 80,
|
||||||
"a100-80": 80,
|
"a100-80": 80,
|
||||||
"a100": 40,
|
"a100": 40,
|
||||||
|
|||||||
@ -6,14 +6,3 @@ def test_detection(detection_predictor, test_image):
|
|||||||
|
|
||||||
bboxes = detection_results[0].bboxes
|
bboxes = detection_results[0].bboxes
|
||||||
assert len(bboxes) == 4
|
assert len(bboxes) == 4
|
||||||
|
|
||||||
|
|
||||||
def test_detection_chunking(detection_predictor, test_image_tall):
|
|
||||||
detection_results = detection_predictor([test_image_tall])
|
|
||||||
|
|
||||||
assert len(detection_results) == 1
|
|
||||||
assert detection_results[0].image_bbox == [0, 0, 4096, 4096]
|
|
||||||
|
|
||||||
bboxes = detection_results[0].bboxes
|
|
||||||
assert len(bboxes) >= 3 # Sometimes merges into 3
|
|
||||||
assert abs(4000 - bboxes[1].polygon[0][0]) < 50
|
|
||||||
Loading…
Reference in New Issue
Block a user