Add streamlit app

This commit is contained in:
Vik Paruchuri 2024-02-09 12:44:01 -08:00
parent 9d3e9063e0
commit 272619af3e
39 changed files with 561 additions and 88 deletions

1
.gitignore vendored
View File

@ -8,6 +8,7 @@ wandb
notebooks
results
data
slices
# Byte-compiled / optimized / DLL files
__pycache__/

View File

@ -25,14 +25,16 @@ Surya is named for the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who
| Name | Text Detection | OCR |
|------------------|:-----------------------------------:|-----------------------------------------:|
| Japanese | [Image](static/images/japanese.jpg) | [Image](static/images/japanese_text.jpg) |
| Chinese | [Image](static/images/chinese.jpg) | [Image](static/images/chinese_text.jpg) |
| Hindi | [Image](static/images/hindi.jpg) | [Image](static/images/hindi_text.jpg) |
| Arabic | [Image](static/images/arabic.jpg) | [Image](static/images/arabic_text.jpg) |
| Presentation | [Image](static/images/pres.png) | [Image](static/images/pres_text.jpg) |
| Scientific Paper | [Image](static/images/paper.jpg) | [Image](static/images/paper_text.jpg) |
| Scanned Document | [Image](static/images/scanned.png) | [Image](static/images/scanned_text.jpg) |
| New York Times | [Image](static/images/nyt.png) | [Image](static/images/nyt_text.png) |
| Japanese | [Image](static/images/japanese.png) | [Image](static/images/japanese_text.png) |
| Chinese | [Image](static/images/chinese.png) | [Image](static/images/chinese_text.png) |
| Hindi | [Image](static/images/hindi.png) | [Image](static/images/hindi_text.png) |
| Presentation | [Image](static/images/pres.png) | [Image](static/images/pres_text.png) |
| Scientific Paper | [Image](static/images/paper.png) | [Image](static/images/paper_text.png) |
| Scanned Document | [Image](static/images/scanned.png) | [Image](static/images/scanned_text.png) |
| Scanned Form | [Image](static/images/funsd.png) | |
| Scanned Form | [Image](static/images/funsd.png) | [Image](static/images/funsd_text.jpg) |
| Textbook | [Image](static/images/textbook.jpg) | [Image](static/images/textbook_text.jpg) |
# Installation
@ -51,6 +53,15 @@ Model weights will automatically download the first time you run surya. Note th
- Inspect the settings in `surya/settings.py`. You can override any settings with environment variables.
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`. For text detection, the `mps` device has a bug (on the [Apple side](https://github.com/pytorch/pytorch/issues/84936)) that may prevent it from working properly.
## Interactive App
I've included a streamlit app that lets you interactively try Surya on images or PDF files. Run it with:
```
pip install streamlit
surya_gui
```
## OCR (text recognition)
You can detect text in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected text and bboxes, and optionally save images of the reconstructed page.
@ -78,10 +89,7 @@ The `results.json` file will contain these keys for each page of the input docum
**Performance tips**
Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `40MB` of VRAM, so very high batch sizes are possible. The default is a batch size `256`, which will use about 10GB of VRAM.
Depending on your CPU core count, `RECOGNITION_BATCH_SIZE` might make a difference there too - the default CPU batch size is `32`.
Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `40MB` of VRAM, so very high batch sizes are possible. The default is a batch size `256`, which will use about 10GB of VRAM. Depending on your CPU core count, it may help, too - the default CPU batch size is `32`.
### From python
@ -94,20 +102,15 @@ from surya.model.recognition.processor import load_processor as load_rec_process
image = Image.open(IMAGE_PATH)
langs = ["en"] # Replace with your languages
det_processor = load_det_processor()
det_model = load_det_model()
rec_model = load_rec_model()
rec_processor = load_rec_processor()
det_processor, det_model = load_det_processor(), load_det_model()
rec_model, rec_processor = load_rec_model(), load_rec_processor()
predictions = run_ocr([image], langs, det_model, det_processor, rec_model, rec_processor)
```
## Text line detection
You can detect text lines in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected bboxes, and optionally save images of the pages with the bboxes.
You can detect text lines in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected bboxes.
```
surya_detect DATA_PATH --images
@ -128,12 +131,7 @@ The `results.json` file will contain these keys for each page of the input docum
**Performance tips**
Setting the `DETECTOR_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `280MB` of VRAM, so very high batch sizes are possible. The default is a batch size `32`, which will use about 9GB of VRAM.
Depending on your CPU core count, `DETECTOR_BATCH_SIZE` might make a difference there too - the default CPU batch size is `2`.
You can adjust `DETECTOR_NMS_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results. Try lowering them to detect more text, and vice versa.
Setting the `DETECTOR_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `280MB` of VRAM, so very high batch sizes are possible. The default is a batch size `32`, which will use about 9GB of VRAM. Depending on your CPU core count, it might help, too - the default CPU batch size is `2`.
### From python
@ -149,9 +147,20 @@ model, processor = load_model(), load_processor()
predictions = batch_detection([image], model, processor)
```
## Table and chart detection
# Limitations
- This is specialized for document OCR. It will likely not work on photos or other images.
- It is for printed text, not handwriting (though it may work on some handwriting).
- The model has trained itself to ignore advertisements.
- You can find language support for OCR in `surya/languages.py`. Text detection should work with any language.
## Troubleshooting
If OCR isn't working properly:
- If the lines aren't detected properly, try increasing resolution of the image if the width is below `896px`, and vice versa. Very high width images don't work well with the detector.
- You can adjust `DETECTOR_BLANK_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results. `DETECTOR_BLANK_THRESHOLD` controls the space between lines - any prediction below this number will be considered blank space. `DETECTOR_TEXT_THRESHOLD` controls how text is joined - any number above this is considered text. `DETECTOR_TEXT_THRESHOLD` should always be higher than `DETECTOR_BLANK_THRESHOLD`, and both should be in the 0-1 range. Looking at the heatmap from the debug output of the detector can tell you how to adjust these (if you see faint things that look like boxes, lower the thresholds, and if you see bboxes being joined together, raise the thresholds).
Coming soon.
# Manual install
@ -162,13 +171,6 @@ If you want to develop surya, you can install it manually:
- `poetry install` - installs main and dev dependencies
- `poetry shell` - activates the virtual environment
# Limitations
- This is specialized for document OCR. It will likely not work on photos or other images.
- It is for printed text, not handwriting (though it may work on some handwriting).
- The model has trained itself to ignore advertisements.
- You can find language support for OCR in `surya/languages.py`. Text detection should work with any language.
# Benchmarks
## OCR

View File

@ -1,38 +0,0 @@
import gradio as gr
from surya.detection import batch_detection
from surya.model.detection.segformer import load_model, load_processor
from surya.postprocessing.heatmap import draw_polys_on_image
model, processor = load_model(), load_processor()
HEADER = """
# Surya OCR Demo
This demo will let you try surya, a multilingual OCR model. It supports text detection now, but will support text recognition in the future.
Notes:
- This works best on documents with printed text.
- Set DETECTOR_MODEL_CHECKPOINT=vikp/line_detector_math before running this app if you want better math detection.
Learn more [here](https://github.com/VikParuchuri/surya).
""".strip()
def text_detection(img):
preds = batch_detection([img], model, processor)[0]
img = draw_polys_on_image(preds["polygons"], img)
return img, preds
with gr.Blocks() as app:
gr.Markdown(HEADER)
with gr.Row():
input_image = gr.Image(label="Input Image", type="pil")
output_image = gr.Image(label="Output Image", type="pil", interactive=False)
text_detection_btn = gr.Button("Run Text Detection")
json_output = gr.JSON(label="JSON Output")
text_detection_btn.click(fn=text_detection, inputs=input_image, outputs=[output_image, json_output], api_name="text_detection")
if __name__ == "__main__":
app.launch()

123
ocr_app.py Normal file
View File

@ -0,0 +1,123 @@
import io
import pypdfium2
import streamlit as st
from surya.detection import batch_detection
from surya.model.detection.segformer import load_model, load_processor
from surya.model.recognition.model import load_model as load_rec_model
from surya.model.recognition.processor import load_processor as load_rec_processor
from surya.postprocessing.heatmap import draw_polys_on_image
from surya.ocr import run_ocr
from surya.postprocessing.text import draw_text_on_image
from PIL import Image
from surya.languages import CODE_TO_LANGUAGE
from surya.input.langs import replace_lang_with_code
@st.cache_resource()
def load_det_cached():
return load_model(), load_processor()
@st.cache_resource()
def load_rec_cached():
return load_rec_model(), load_rec_processor()
def text_detection(img):
preds = batch_detection([img], det_model, det_processor)[0]
det_img = draw_polys_on_image(preds["polygons"], img.copy())
return det_img, preds
# Function for OCR
def ocr(img, langs):
replace_lang_with_code(langs)
pred = run_ocr([img], [langs], det_model, det_processor, rec_model, rec_processor)[0]
rec_img = draw_text_on_image(pred["bboxes"], pred["text_lines"], img.size)
return rec_img, pred
def open_pdf(pdf_file):
stream = io.BytesIO(pdf_file.getvalue())
return pypdfium2.PdfDocument(stream)
@st.cache_data()
def get_page_image(pdf_file, page_num, dpi=96):
doc = open_pdf(pdf_file)
renderer = doc.render(
pypdfium2.PdfBitmap.to_pil,
page_indices=[page_num - 1],
scale=dpi / 72,
)
png = list(renderer)[0]
png_image = png.convert("RGB")
return png_image
@st.cache_data()
def page_count(pdf_file):
doc = open_pdf(pdf_file)
return len(doc)
st.set_page_config(layout="wide")
col1, col2 = st.columns([.5, .5])
det_model, det_processor = load_det_cached()
rec_model, rec_processor = load_rec_cached()
st.markdown("""
# Surya OCR Demo
This app will let you try surya, a multilingual OCR model. It supports text detection in any language, and text recognition in 90+ languages.
Notes:
- This works best on documents with printed text.
- Try to keep the image width around 1024, especially if you have large text.
- This supports 90+ languages, see [here](https://github.com/VikParuchuri/surya/tree/master/surya/languages.py) for a full list of codes.
Find the project [here](https://github.com/VikParuchuri/surya).
""")
in_file = st.sidebar.file_uploader("PDF file or image:", type=["pdf", "png", "jpg", "jpeg", "gif", "webp"])
languages = st.sidebar.multiselect("Languages", sorted(list(CODE_TO_LANGUAGE.values())), default=["English"], max_selections=4)
if in_file is None:
st.stop()
filetype = in_file.type
whole_image = False
if "pdf" in filetype:
page_count = page_count(in_file)
page_number = st.sidebar.number_input(f"Page number out of {page_count}:", min_value=1, value=1, max_value=page_count)
pil_image = get_page_image(in_file, page_number)
else:
pil_image = Image.open(in_file).convert("RGB")
text_det = st.sidebar.button("Run Text Detection")
text_rec = st.sidebar.button("Run OCR")
# Run Text Detection
if text_det and pil_image is not None:
det_img, preds = text_detection(pil_image)
with col1:
st.image(det_img, caption="Detected Text", use_column_width=True)
st.json(preds, expanded=True)
# Run OCR
if text_rec and pil_image is not None:
rec_img, pred = ocr(pil_image, languages)
with col1:
st.image(rec_img, caption="OCR Result", use_column_width=True)
json_tab, text_tab = st.tabs(["JSON", "Full Text"])
with json_tab:
st.json(pred, expanded=True)
with text_tab:
st.text("\n".join(pred["text_lines"]))
with col2:
st.image(pil_image, caption="Uploaded Image", use_column_width=True)

353
poetry.lock generated
View File

@ -110,6 +110,30 @@ files = [
[package.dependencies]
frozenlist = ">=1.1.0"
[[package]]
name = "altair"
version = "5.2.0"
description = "Vega-Altair: A declarative statistical visualization library for Python."
optional = false
python-versions = ">=3.8"
files = [
{file = "altair-5.2.0-py3-none-any.whl", hash = "sha256:8c4888ad11db7c39f3f17aa7f4ea985775da389d79ac30a6c22856ab238df399"},
{file = "altair-5.2.0.tar.gz", hash = "sha256:2ad7f0c8010ebbc46319cc30febfb8e59ccf84969a201541c207bc3a4fa6cf81"},
]
[package.dependencies]
jinja2 = "*"
jsonschema = ">=3.0"
numpy = "*"
packaging = "*"
pandas = ">=0.25"
toolz = "*"
typing-extensions = {version = ">=4.0.1", markers = "python_version < \"3.11\""}
[package.extras]
dev = ["anywidget", "geopandas", "hatch", "ipython", "m2r", "mypy", "pandas-stubs", "pyarrow (>=11)", "pytest", "pytest-cov", "ruff (>=0.1.3)", "types-jsonschema", "types-setuptools", "vega-datasets", "vegafusion[embed] (>=1.4.0)", "vl-convert-python (>=1.1.0)"]
doc = ["docutils", "jinja2", "myst-parser", "numpydoc", "pillow (>=9,<10)", "pydata-sphinx-theme (>=0.14.1)", "scipy", "sphinx", "sphinx-copybutton", "sphinx-design", "sphinxext-altair"]
[[package]]
name = "annotated-types"
version = "0.6.0"
@ -359,6 +383,28 @@ webencodings = "*"
[package.extras]
css = ["tinycss2 (>=1.1.0,<1.3)"]
[[package]]
name = "blinker"
version = "1.7.0"
description = "Fast, simple object-to-object and broadcast signaling"
optional = false
python-versions = ">=3.8"
files = [
{file = "blinker-1.7.0-py3-none-any.whl", hash = "sha256:c3f865d4d54db7abc53758a01601cf343fe55b84c1de4e3fa910e420b438d5b9"},
{file = "blinker-1.7.0.tar.gz", hash = "sha256:e6820ff6fa4e4d1d8e2747c2283749c3f547e4fee112b98555cdcdae32996182"},
]
[[package]]
name = "cachetools"
version = "5.3.2"
description = "Extensible memoizing collections and decorators"
optional = false
python-versions = ">=3.7"
files = [
{file = "cachetools-5.3.2-py3-none-any.whl", hash = "sha256:861f35a13a451f94e301ce2bec7cac63e881232ccce7ed67fab9b5df4d3beaa1"},
{file = "cachetools-5.3.2.tar.gz", hash = "sha256:086ee420196f7b2ab9ca2db2520aca326318b68fe5ba8bc4d49cca91add450f2"},
]
[[package]]
name = "certifi"
version = "2024.2.2"
@ -533,6 +579,20 @@ files = [
{file = "charset_normalizer-3.3.2-py3-none-any.whl", hash = "sha256:3e4d1f6587322d2788836a99c69062fbb091331ec940e02d12d179c1d53e25fc"},
]
[[package]]
name = "click"
version = "8.1.7"
description = "Composable command line interface toolkit"
optional = false
python-versions = ">=3.7"
files = [
{file = "click-8.1.7-py3-none-any.whl", hash = "sha256:ae74fb96c20a0277a1d615f1e4d73c8414f5a98db8b799a7931d1582f3390c28"},
{file = "click-8.1.7.tar.gz", hash = "sha256:ca9853ad459e787e2192211578cc907e7594e294c7ccc834310722b41b9ca6de"},
]
[package.dependencies]
colorama = {version = "*", markers = "platform_system == \"Windows\""}
[[package]]
name = "colorama"
version = "0.4.6"
@ -873,6 +933,37 @@ smb = ["smbprotocol"]
ssh = ["paramiko"]
tqdm = ["tqdm"]
[[package]]
name = "gitdb"
version = "4.0.11"
description = "Git Object Database"
optional = false
python-versions = ">=3.7"
files = [
{file = "gitdb-4.0.11-py3-none-any.whl", hash = "sha256:81a3407ddd2ee8df444cbacea00e2d038e40150acfa3001696fe0dcf1d3adfa4"},
{file = "gitdb-4.0.11.tar.gz", hash = "sha256:bf5421126136d6d0af55bc1e7c1af1c397a34f5b7bd79e776cd3e89785c2b04b"},
]
[package.dependencies]
smmap = ">=3.0.1,<6"
[[package]]
name = "gitpython"
version = "3.1.41"
description = "GitPython is a Python library used to interact with Git repositories"
optional = false
python-versions = ">=3.7"
files = [
{file = "GitPython-3.1.41-py3-none-any.whl", hash = "sha256:c36b6634d069b3f719610175020a9aed919421c87552185b085e04fbbdb10b7c"},
{file = "GitPython-3.1.41.tar.gz", hash = "sha256:ed66e624884f76df22c8e16066d567aaa5a37d5b5fa19db2c6df6f7156db9048"},
]
[package.dependencies]
gitdb = ">=4.0.1,<5"
[package.extras]
test = ["black", "coverage[toml]", "ddt (>=1.1.1,!=1.4.3)", "mock", "mypy", "pre-commit", "pytest (>=7.3.1)", "pytest-cov", "pytest-instafail", "pytest-mock", "pytest-sugar", "sumtypes"]
[[package]]
name = "huggingface-hub"
version = "0.20.3"
@ -1406,6 +1497,30 @@ files = [
{file = "jupyterlab_widgets-3.0.9.tar.gz", hash = "sha256:6005a4e974c7beee84060fdfba341a3218495046de8ae3ec64888e5fe19fdb4c"},
]
[[package]]
name = "markdown-it-py"
version = "3.0.0"
description = "Python port of markdown-it. Markdown parsing, done right!"
optional = false
python-versions = ">=3.8"
files = [
{file = "markdown-it-py-3.0.0.tar.gz", hash = "sha256:e3f60a94fa066dc52ec76661e37c851cb232d92f9886b15cb560aaada2df8feb"},
{file = "markdown_it_py-3.0.0-py3-none-any.whl", hash = "sha256:355216845c60bd96232cd8d8c40e8f9765cc86f46880e43a8fd22dc1a1a8cab1"},
]
[package.dependencies]
mdurl = ">=0.1,<1.0"
[package.extras]
benchmarking = ["psutil", "pytest", "pytest-benchmark"]
code-style = ["pre-commit (>=3.0,<4.0)"]
compare = ["commonmark (>=0.9,<1.0)", "markdown (>=3.4,<4.0)", "mistletoe (>=1.0,<2.0)", "mistune (>=2.0,<3.0)", "panflute (>=2.3,<3.0)"]
linkify = ["linkify-it-py (>=1,<3)"]
plugins = ["mdit-py-plugins"]
profiling = ["gprof2dot"]
rtd = ["jupyter_sphinx", "mdit-py-plugins", "myst-parser", "pyyaml", "sphinx", "sphinx-copybutton", "sphinx-design", "sphinx_book_theme"]
testing = ["coverage", "pytest", "pytest-cov", "pytest-regressions"]
[[package]]
name = "markupsafe"
version = "2.1.5"
@ -1489,6 +1604,17 @@ files = [
[package.dependencies]
traitlets = "*"
[[package]]
name = "mdurl"
version = "0.1.2"
description = "Markdown URL utilities"
optional = false
python-versions = ">=3.7"
files = [
{file = "mdurl-0.1.2-py3-none-any.whl", hash = "sha256:84008a41e51615a49fc9966191ff91509e3c40b939176e643fd50a5c2196b8f8"},
{file = "mdurl-0.1.2.tar.gz", hash = "sha256:bb413d29f5eea38f31dd4754dd7377d4465116fb207585f97bf925588687c1ba"},
]
[[package]]
name = "mistune"
version = "3.0.2"
@ -2270,6 +2396,26 @@ files = [
[package.dependencies]
wcwidth = "*"
[[package]]
name = "protobuf"
version = "4.25.2"
description = ""
optional = false
python-versions = ">=3.8"
files = [
{file = "protobuf-4.25.2-cp310-abi3-win32.whl", hash = "sha256:b50c949608682b12efb0b2717f53256f03636af5f60ac0c1d900df6213910fd6"},
{file = "protobuf-4.25.2-cp310-abi3-win_amd64.whl", hash = "sha256:8f62574857ee1de9f770baf04dde4165e30b15ad97ba03ceac65f760ff018ac9"},
{file = "protobuf-4.25.2-cp37-abi3-macosx_10_9_universal2.whl", hash = "sha256:2db9f8fa64fbdcdc93767d3cf81e0f2aef176284071507e3ede160811502fd3d"},
{file = "protobuf-4.25.2-cp37-abi3-manylinux2014_aarch64.whl", hash = "sha256:10894a2885b7175d3984f2be8d9850712c57d5e7587a2410720af8be56cdaf62"},
{file = "protobuf-4.25.2-cp37-abi3-manylinux2014_x86_64.whl", hash = "sha256:fc381d1dd0516343f1440019cedf08a7405f791cd49eef4ae1ea06520bc1c020"},
{file = "protobuf-4.25.2-cp38-cp38-win32.whl", hash = "sha256:33a1aeef4b1927431d1be780e87b641e322b88d654203a9e9d93f218ee359e61"},
{file = "protobuf-4.25.2-cp38-cp38-win_amd64.whl", hash = "sha256:47f3de503fe7c1245f6f03bea7e8d3ec11c6c4a2ea9ef910e3221c8a15516d62"},
{file = "protobuf-4.25.2-cp39-cp39-win32.whl", hash = "sha256:5e5c933b4c30a988b52e0b7c02641760a5ba046edc5e43d3b94a74c9fc57c1b3"},
{file = "protobuf-4.25.2-cp39-cp39-win_amd64.whl", hash = "sha256:d66a769b8d687df9024f2985d5137a337f957a0916cf5464d1513eee96a63ff0"},
{file = "protobuf-4.25.2-py3-none-any.whl", hash = "sha256:a8b7a98d4ce823303145bf3c1a8bdb0f2f4642a414b196f04ad9853ed0c8f830"},
{file = "protobuf-4.25.2.tar.gz", hash = "sha256:fe599e175cb347efc8ee524bcd4b902d11f7262c0e569ececcb89995c15f0a5e"},
]
[[package]]
name = "psutil"
version = "5.9.8"
@ -2518,6 +2664,25 @@ files = [
pydantic = ">=2.3.0"
python-dotenv = ">=0.21.0"
[[package]]
name = "pydeck"
version = "0.8.0"
description = "Widget for deck.gl maps"
optional = false
python-versions = ">=3.7"
files = [
{file = "pydeck-0.8.0-py2.py3-none-any.whl", hash = "sha256:a8fa7757c6f24bba033af39db3147cb020eef44012ba7e60d954de187f9ed4d5"},
{file = "pydeck-0.8.0.tar.gz", hash = "sha256:07edde833f7cfcef6749124351195aa7dcd24663d4909fd7898dbd0b6fbc01ec"},
]
[package.dependencies]
jinja2 = ">=2.10.1"
numpy = ">=1.16.4"
[package.extras]
carto = ["pydeck-carto"]
jupyter = ["ipykernel (>=5.1.2)", "ipython (>=5.8.0)", "ipywidgets (>=7,<8)", "traitlets (>=4.3.2)"]
[[package]]
name = "pygments"
version = "2.17.2"
@ -3177,6 +3342,24 @@ files = [
{file = "rfc3986_validator-0.1.1.tar.gz", hash = "sha256:3d44bde7921b3b9ec3ae4e3adca370438eccebc676456449b145d533b240d055"},
]
[[package]]
name = "rich"
version = "13.7.0"
description = "Render rich text, tables, progress bars, syntax highlighting, markdown and more to the terminal"
optional = false
python-versions = ">=3.7.0"
files = [
{file = "rich-13.7.0-py3-none-any.whl", hash = "sha256:6da14c108c4866ee9520bbffa71f6fe3962e193b7da68720583850cd4548e235"},
{file = "rich-13.7.0.tar.gz", hash = "sha256:5cb5123b5cf9ee70584244246816e9114227e0b98ad9176eede6ad54bf5403fa"},
]
[package.dependencies]
markdown-it-py = ">=2.2.0"
pygments = ">=2.13.0,<3.0.0"
[package.extras]
jupyter = ["ipywidgets (>=7.5.1,<9)"]
[[package]]
name = "rpds-py"
version = "0.17.1"
@ -3444,6 +3627,17 @@ files = [
{file = "six-1.16.0.tar.gz", hash = "sha256:1e61c37477a1626458e36f7b1d82aa5c9b094fa4802892072e49de9c60c4c926"},
]
[[package]]
name = "smmap"
version = "5.0.1"
description = "A pure Python implementation of a sliding window memory map manager"
optional = false
python-versions = ">=3.7"
files = [
{file = "smmap-5.0.1-py3-none-any.whl", hash = "sha256:e6d8668fa5f93e706934a62d7b4db19c8d9eb8cf2adbb75ef1b675aa332b69da"},
{file = "smmap-5.0.1.tar.gz", hash = "sha256:dceeb6c0028fdb6734471eb07c0cd2aae706ccaecab45965ee83f11c8d3b1f62"},
]
[[package]]
name = "snakeviz"
version = "2.2.0"
@ -3499,6 +3693,45 @@ pure-eval = "*"
[package.extras]
tests = ["cython", "littleutils", "pygments", "pytest", "typeguard"]
[[package]]
name = "streamlit"
version = "1.31.0"
description = "A faster way to build and share data apps"
optional = false
python-versions = ">=3.8, !=3.9.7"
files = [
{file = "streamlit-1.31.0-py2.py3-none-any.whl", hash = "sha256:4d95c4f5d6881f7adebaec14997fa7024bb38853412d1bba9588074d585563f9"},
{file = "streamlit-1.31.0.tar.gz", hash = "sha256:40d71944e30394612481f80a8bc09e7de40d33b7a472989807467a5299e342ca"},
]
[package.dependencies]
altair = ">=4.0,<6"
blinker = ">=1.0.0,<2"
cachetools = ">=4.0,<6"
click = ">=7.0,<9"
gitpython = ">=3.0.7,<3.1.19 || >3.1.19,<4"
importlib-metadata = ">=1.4,<8"
numpy = ">=1.19.3,<2"
packaging = ">=16.8,<24"
pandas = ">=1.3.0,<3"
pillow = ">=7.1.0,<11"
protobuf = ">=3.20,<5"
pyarrow = ">=7.0"
pydeck = ">=0.8.0b4,<1"
python-dateutil = ">=2.7.3,<3"
requests = ">=2.27,<3"
rich = ">=10.14.0,<14"
tenacity = ">=8.1.0,<9"
toml = ">=0.10.1,<2"
tornado = ">=6.0.3,<7"
typing-extensions = ">=4.3.0,<5"
tzlocal = ">=1.1,<6"
validators = ">=0.2,<1"
watchdog = {version = ">=2.1.5", markers = "platform_system != \"Darwin\""}
[package.extras]
snowflake = ["snowflake-connector-python (>=2.8.0)", "snowflake-snowpark-python (>=0.9.0)"]
[[package]]
name = "sympy"
version = "1.12"
@ -3527,6 +3760,20 @@ files = [
[package.extras]
widechars = ["wcwidth"]
[[package]]
name = "tenacity"
version = "8.2.3"
description = "Retry code until it succeeds"
optional = false
python-versions = ">=3.7"
files = [
{file = "tenacity-8.2.3-py3-none-any.whl", hash = "sha256:ce510e327a630c9e1beaf17d42e6ffacc88185044ad85cf74c0a8887c6a0f88c"},
{file = "tenacity-8.2.3.tar.gz", hash = "sha256:5398ef0d78e63f40007c1fb4c0bff96e1911394d2fa8d194f77619c05ff6cc8a"},
]
[package.extras]
doc = ["reno", "sphinx", "tornado (>=4.5)"]
[[package]]
name = "terminado"
version = "0.18.0"
@ -3693,6 +3940,17 @@ dev = ["tokenizers[testing]"]
docs = ["setuptools_rust", "sphinx", "sphinx_rtd_theme"]
testing = ["black (==22.3)", "datasets", "numpy", "pytest", "requests"]
[[package]]
name = "toml"
version = "0.10.2"
description = "Python Library for Tom's Obvious, Minimal Language"
optional = false
python-versions = ">=2.6, !=3.0.*, !=3.1.*, !=3.2.*"
files = [
{file = "toml-0.10.2-py2.py3-none-any.whl", hash = "sha256:806143ae5bfb6a3c6e736a764057db0e6a0e05e338b5630894a5f779cabb4f9b"},
{file = "toml-0.10.2.tar.gz", hash = "sha256:b3bda1d108d5dd99f4a20d24d9c348e91c4db7ab1b749200bded2f839ccbe68f"},
]
[[package]]
name = "tomli"
version = "2.0.1"
@ -3704,6 +3962,17 @@ files = [
{file = "tomli-2.0.1.tar.gz", hash = "sha256:de526c12914f0c550d15924c62d72abc48d6fe7364aa87328337a31007fe8a4f"},
]
[[package]]
name = "toolz"
version = "0.12.1"
description = "List processing tools and functional utilities"
optional = false
python-versions = ">=3.7"
files = [
{file = "toolz-0.12.1-py3-none-any.whl", hash = "sha256:d22731364c07d72eea0a0ad45bafb2c2937ab6fd38a3507bf55eae8744aa7d85"},
{file = "toolz-0.12.1.tar.gz", hash = "sha256:ecca342664893f177a13dac0e6b41cbd8ac25a358e5f215316d43e2100224f4d"},
]
[[package]]
name = "torch"
version = "2.2.0"
@ -3941,6 +4210,23 @@ files = [
{file = "tzdata-2023.4.tar.gz", hash = "sha256:dd54c94f294765522c77399649b4fefd95522479a664a0cec87f41bebc6148c9"},
]
[[package]]
name = "tzlocal"
version = "5.2"
description = "tzinfo object for the local timezone"
optional = false
python-versions = ">=3.8"
files = [
{file = "tzlocal-5.2-py3-none-any.whl", hash = "sha256:49816ef2fe65ea8ac19d19aa7a1ae0551c834303d5014c6d5a62e4cbda8047b8"},
{file = "tzlocal-5.2.tar.gz", hash = "sha256:8d399205578f1a9342816409cc1e46a93ebd5755e39ea2d85334bea911bf0e6e"},
]
[package.dependencies]
tzdata = {version = "*", markers = "platform_system == \"Windows\""}
[package.extras]
devenv = ["check-manifest", "pytest (>=4.3)", "pytest-cov", "pytest-mock (>=3.3)", "zest.releaser"]
[[package]]
name = "uri-template"
version = "1.3.0"
@ -3972,6 +4258,69 @@ h2 = ["h2 (>=4,<5)"]
socks = ["pysocks (>=1.5.6,!=1.5.7,<2.0)"]
zstd = ["zstandard (>=0.18.0)"]
[[package]]
name = "validators"
version = "0.22.0"
description = "Python Data Validation for Humans™"
optional = false
python-versions = ">=3.8"
files = [
{file = "validators-0.22.0-py3-none-any.whl", hash = "sha256:61cf7d4a62bbae559f2e54aed3b000cea9ff3e2fdbe463f51179b92c58c9585a"},
{file = "validators-0.22.0.tar.gz", hash = "sha256:77b2689b172eeeb600d9605ab86194641670cdb73b60afd577142a9397873370"},
]
[package.extras]
docs-offline = ["myst-parser (>=2.0.0)", "pypandoc-binary (>=1.11)", "sphinx (>=7.1.1)"]
docs-online = ["mkdocs (>=1.5.2)", "mkdocs-git-revision-date-localized-plugin (>=1.2.0)", "mkdocs-material (>=9.2.6)", "mkdocstrings[python] (>=0.22.0)", "pyaml (>=23.7.0)"]
hooks = ["pre-commit (>=3.3.3)"]
package = ["build (>=1.0.0)", "twine (>=4.0.2)"]
runner = ["tox (>=4.11.1)"]
sast = ["bandit[toml] (>=1.7.5)"]
testing = ["pytest (>=7.4.0)"]
tooling = ["black (>=23.7.0)", "pyright (>=1.1.325)", "ruff (>=0.0.287)"]
tooling-extras = ["pyaml (>=23.7.0)", "pypandoc-binary (>=1.11)", "pytest (>=7.4.0)"]
[[package]]
name = "watchdog"
version = "4.0.0"
description = "Filesystem events monitoring"
optional = false
python-versions = ">=3.8"
files = [
{file = "watchdog-4.0.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:39cb34b1f1afbf23e9562501673e7146777efe95da24fab5707b88f7fb11649b"},
{file = "watchdog-4.0.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:c522392acc5e962bcac3b22b9592493ffd06d1fc5d755954e6be9f4990de932b"},
{file = "watchdog-4.0.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:6c47bdd680009b11c9ac382163e05ca43baf4127954c5f6d0250e7d772d2b80c"},
{file = "watchdog-4.0.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:8350d4055505412a426b6ad8c521bc7d367d1637a762c70fdd93a3a0d595990b"},
{file = "watchdog-4.0.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:c17d98799f32e3f55f181f19dd2021d762eb38fdd381b4a748b9f5a36738e935"},
{file = "watchdog-4.0.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:4986db5e8880b0e6b7cd52ba36255d4793bf5cdc95bd6264806c233173b1ec0b"},
{file = "watchdog-4.0.0-cp312-cp312-macosx_10_9_universal2.whl", hash = "sha256:11e12fafb13372e18ca1bbf12d50f593e7280646687463dd47730fd4f4d5d257"},
{file = "watchdog-4.0.0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:5369136a6474678e02426bd984466343924d1df8e2fd94a9b443cb7e3aa20d19"},
{file = "watchdog-4.0.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:76ad8484379695f3fe46228962017a7e1337e9acadafed67eb20aabb175df98b"},
{file = "watchdog-4.0.0-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:45cc09cc4c3b43fb10b59ef4d07318d9a3ecdbff03abd2e36e77b6dd9f9a5c85"},
{file = "watchdog-4.0.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:eed82cdf79cd7f0232e2fdc1ad05b06a5e102a43e331f7d041e5f0e0a34a51c4"},
{file = "watchdog-4.0.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:ba30a896166f0fee83183cec913298151b73164160d965af2e93a20bbd2ab605"},
{file = "watchdog-4.0.0-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:d18d7f18a47de6863cd480734613502904611730f8def45fc52a5d97503e5101"},
{file = "watchdog-4.0.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:2895bf0518361a9728773083908801a376743bcc37dfa252b801af8fd281b1ca"},
{file = "watchdog-4.0.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:87e9df830022488e235dd601478c15ad73a0389628588ba0b028cb74eb72fed8"},
{file = "watchdog-4.0.0-pp310-pypy310_pp73-macosx_10_9_x86_64.whl", hash = "sha256:6e949a8a94186bced05b6508faa61b7adacc911115664ccb1923b9ad1f1ccf7b"},
{file = "watchdog-4.0.0-pp38-pypy38_pp73-macosx_10_9_x86_64.whl", hash = "sha256:6a4db54edea37d1058b08947c789a2354ee02972ed5d1e0dca9b0b820f4c7f92"},
{file = "watchdog-4.0.0-pp39-pypy39_pp73-macosx_10_9_x86_64.whl", hash = "sha256:d31481ccf4694a8416b681544c23bd271f5a123162ab603c7d7d2dd7dd901a07"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_aarch64.whl", hash = "sha256:8fec441f5adcf81dd240a5fe78e3d83767999771630b5ddfc5867827a34fa3d3"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_armv7l.whl", hash = "sha256:6a9c71a0b02985b4b0b6d14b875a6c86ddea2fdbebd0c9a720a806a8bbffc69f"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_i686.whl", hash = "sha256:557ba04c816d23ce98a06e70af6abaa0485f6d94994ec78a42b05d1c03dcbd50"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_ppc64.whl", hash = "sha256:d0f9bd1fd919134d459d8abf954f63886745f4660ef66480b9d753a7c9d40927"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_ppc64le.whl", hash = "sha256:f9b2fdca47dc855516b2d66eef3c39f2672cbf7e7a42e7e67ad2cbfcd6ba107d"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_s390x.whl", hash = "sha256:73c7a935e62033bd5e8f0da33a4dcb763da2361921a69a5a95aaf6c93aa03a87"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_x86_64.whl", hash = "sha256:6a80d5cae8c265842c7419c560b9961561556c4361b297b4c431903f8c33b269"},
{file = "watchdog-4.0.0-py3-none-win32.whl", hash = "sha256:8f9a542c979df62098ae9c58b19e03ad3df1c9d8c6895d96c0d51da17b243b1c"},
{file = "watchdog-4.0.0-py3-none-win_amd64.whl", hash = "sha256:f970663fa4f7e80401a7b0cbeec00fa801bf0287d93d48368fc3e6fa32716245"},
{file = "watchdog-4.0.0-py3-none-win_ia64.whl", hash = "sha256:9a03e16e55465177d416699331b0f3564138f1807ecc5f2de9d55d8f188d08c7"},
{file = "watchdog-4.0.0.tar.gz", hash = "sha256:e3e7065cbdabe6183ab82199d7a4f6b3ba0a438c5a512a68559846ccb76a78ec"},
]
[package.extras]
watchmedo = ["PyYAML (>=3.10)"]
[[package]]
name = "wcwidth"
version = "0.2.13"
@ -4273,5 +4622,5 @@ testing = ["big-O", "jaraco.functools", "jaraco.itertools", "more-itertools", "p
[metadata]
lock-version = "2.0"
python-versions = ">=3.9,<3.13"
content-hash = "3283801c83fc07a81307276855a4479dd11069bfc9821279cc5cbd42bf7794c6"
python-versions = ">=3.9,<3.13,!=3.9.7"
content-hash = "b6abaf81bb850c204b073e638c539f47a0c2bf1cfb46dbce2482265beed73198"

View File

@ -12,10 +12,13 @@ packages = [
]
include = [
"detect_text.py",
"ocr_text.py",
"ocr_app.py",
"run_ocr_app.py"
]
[tool.poetry.dependencies]
python = ">=3.9,<3.13"
python = ">=3.9,<3.13,!=3.9.7"
transformers = "4.36.2"
torch = "^2.1.2"
pydantic = "^2.5.3"
@ -35,10 +38,12 @@ snakeviz = "^2.2.0"
datasets = "^2.16.1"
rapidfuzz = "^3.6.1"
arabic-reshaper = "^3.0.0"
streamlit = "^1.31.0"
[tool.poetry.scripts]
surya_detect = "detect_text:main"
surya_ocr = "ocr_text:main"
surya_gui = "run_ocr_app:run_app"
[build-system]
requires = ["poetry-core"]

8
run_ocr_app.py Normal file
View File

@ -0,0 +1,8 @@
import subprocess
import os
def run_app():
cur_dir = os.path.dirname(os.path.abspath(__file__))
ocr_app_path = os.path.join(cur_dir, "ocr_app.py")
subprocess.run(["streamlit", "run", ocr_app_path])

BIN
static/images/arabic.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 360 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 274 KiB

BIN
static/images/chinese.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 343 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 135 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 276 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 189 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 121 KiB

BIN
static/images/hindi.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 443 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 220 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 386 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 182 KiB

BIN
static/images/japanese.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 477 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 318 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 417 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 403 KiB

BIN
static/images/paper.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 668 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 324 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 545 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 389 KiB

BIN
static/images/pres_text.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 181 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 148 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 670 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 300 KiB

BIN
static/images/textbook.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 400 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 339 KiB

View File

@ -4,8 +4,8 @@ from surya.languages import LANGUAGE_TO_CODE, CODE_TO_LANGUAGE
def replace_lang_with_code(langs: List[str]):
for i in range(len(langs)):
if langs[i] in LANGUAGE_TO_CODE:
langs[i] = LANGUAGE_TO_CODE[langs[i]]
if langs[i].title() in LANGUAGE_TO_CODE:
langs[i] = LANGUAGE_TO_CODE[langs[i].title()]
if langs[i] not in CODE_TO_LANGUAGE:
raise ValueError(f"Language code {langs[i]} not found.")

View File

@ -1,4 +1,5 @@
import os
import random
from typing import List
import numpy as np

View File

@ -6,7 +6,7 @@ CODE_TO_LANGUAGE = {
'az': 'Azerbaijani',
'be': 'Belarusian',
'bg': 'Bulgarian',
'bn': 'Bangla',
'bn': 'Bengali',
'br': 'Breton',
'bs': 'Bosnian',
'ca': 'Catalan',

View File

@ -73,7 +73,7 @@ class Byt5LangTokenizer(ByT5Tokenizer):
super().__init__()
def __call__(self, texts: List[str] | str, langs: List[List[str]] | List[str], pad_token_id: int = 0, **kwargs):
def __call__(self, texts: Union[List[str], str], langs: Union[List[List[str]], List[str]], pad_token_id: int = 0, **kwargs):
tokenized = []
all_langs = []

View File

@ -30,13 +30,35 @@ def clean_contained_boxes(boxes: List[PolygonBox]):
return new_boxes
def get_dynamic_thresholds(linemap, text_threshold, low_text, typical_top10_avg=.7):
# Find average intensity of top 10% pixels
# Do top 10% to account for pdfs that are mostly whitespace, etc.
flat_map = linemap.flatten()
sorted_map = np.sort(flat_map)[::-1]
top_10_count = int(np.ceil(len(flat_map) * 0.1))
top_10 = sorted_map[:top_10_count]
avg_intensity = np.mean(top_10)
# Adjust thresholds based on normalized intensityy
scaling_factor = min(1, avg_intensity / typical_top10_avg) ** (1 / 2)
low_text = max(low_text * scaling_factor, 0.1)
text_threshold = max(text_threshold * scaling_factor, 0.15)
low_text = min(low_text, 0.6)
text_threshold = min(text_threshold, 0.8)
return text_threshold, low_text
def detect_boxes(linemap, text_threshold, low_text):
# From CRAFT - https://github.com/clovaai/CRAFT-pytorch
# prepare data
linemap = linemap.copy()
img_h, img_w = linemap.shape
ret, text_score = cv2.threshold(linemap, low_text, 1, 0)
text_threshold, low_text = get_dynamic_thresholds(linemap, text_threshold, low_text)
ret, text_score = cv2.threshold(linemap, low_text, 1, cv2.THRESH_BINARY)
text_score_comb = np.clip(text_score, 0, 1)
label_count, labels, stats, centroids = cv2.connectedComponentsWithStats(text_score_comb.astype(np.uint8), connectivity=4)
@ -96,7 +118,7 @@ def detect_boxes(linemap, text_threshold, low_text):
return det, labels
def get_detected_boxes(textmap, text_threshold=settings.DETECTOR_TEXT_THRESHOLD, low_text=settings.DETECTOR_NMS_THRESHOLD):
def get_detected_boxes(textmap, text_threshold=settings.DETECTOR_TEXT_THRESHOLD, low_text=settings.DETECTOR_BLANK_THRESHOLD):
textmap = textmap.copy()
textmap = textmap.astype(np.float32)
boxes, labels = detect_boxes(textmap, text_threshold, low_text)

View File

@ -12,7 +12,7 @@ def get_text_size(text, font):
return width, height
def draw_text_on_image(bboxes, texts, image_size=(1024, 1024), font_path=settings.RECOGNITION_RENDER_FONT, font_size=18, res_upscale=2):
def draw_text_on_image(bboxes, texts, image_size=(1024, 1024), font_path=settings.RECOGNITION_RENDER_FONT, max_font_size=60, res_upscale=2):
new_image_size = (image_size[0] * res_upscale, image_size[1] * res_upscale)
image = Image.new('RGB', new_image_size, color='white')
draw = ImageDraw.Draw(image)
@ -23,7 +23,7 @@ def draw_text_on_image(bboxes, texts, image_size=(1024, 1024), font_path=setting
bbox_height = s_bbox[3] - s_bbox[1]
# Shrink the text to fit in the bbox if needed
box_font_size = font_size
box_font_size = min(int(.75 * bbox_height), max_font_size)
# Download font if it doesn't exist
if not os.path.exists(font_path):

View File

@ -44,14 +44,14 @@ class Settings(BaseSettings):
# Text detection
DETECTOR_BATCH_SIZE: Optional[int] = None # Defaults to 2 for CPU, 32 otherwise
DETECTOR_MODEL_CHECKPOINT: str = "vikp/line_detector"
DETECTOR_MODEL_CHECKPOINT: str = "vikp/surya_det"
DETECTOR_BENCH_DATASET_NAME: str = "vikp/doclaynet_bench"
DETECTOR_IMAGE_CHUNK_HEIGHT: int = 1200 # Height at which to slice images vertically
DETECTOR_TEXT_THRESHOLD: float = 0.6 # Threshold for text detection
DETECTOR_NMS_THRESHOLD: float = 0.35 # Threshold for non-maximum suppression
DETECTOR_IMAGE_CHUNK_HEIGHT: int = 1280 # Height at which to slice images vertically
DETECTOR_TEXT_THRESHOLD: float = 0.6 # Threshold for text detection (above this is considered text)
DETECTOR_BLANK_THRESHOLD: float = 0.35 # Threshold for blank space (below this is considered blank)
# Text recognition
RECOGNITION_MODEL_CHECKPOINT: str = "vikp/text_recognizer_test"
RECOGNITION_MODEL_CHECKPOINT: str = "vikp/surya_rec"
RECOGNITION_MAX_TOKENS: int = 160
RECOGNITION_BATCH_SIZE: Optional[int] = None # Defaults to 8 for CPU/MPS, 256 otherwise
RECOGNITION_IMAGE_SIZE: Dict = {"height": 196, "width": 896}