Commit Graph

841 Commits

Author SHA1 Message Date
Tarun Menta
d4496f8caa
Fix bad test - Add real latex image
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-29 14:53:41 -04:00
Tarun Menta
e7ec40ecb4
Move new model to R2 2025-08-29 14:40:14 -04:00
Tarun Menta
cdc7b18af9
Merge branch 'table-cell-updates' of https://github.com/VikParuchuri/surya into table-cell-updates
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-28 10:42:55 -04:00
Zach Nussbaum
4cdf1080cd
fix: ignore on utf16 errors 2025-08-28 00:08:51 +00:00
Zach Nussbaum
5d1c369477
feat: new tokenizer 2025-08-28 00:08:51 +00:00
Tarun Menta
c37c42e72c
Merge branch 'vik/new-enc' into table-cell-newenc 2025-08-27 10:21:59 -04:00
Zach Nussbaum
bc7ee4895a
fix: ignore on utf16 errors
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-27 10:45:08 +00:00
Tarun Menta
a4ed5523d0
Filter more unwanted tags
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-26 15:16:31 -04:00
Zach Nussbaum
a37919feff
feat: new tokenizer
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-26 13:59:13 +00:00
Tarun Menta
78302facbf
Merge branch 'vik/new-enc' into table-cell-updates 2025-08-19 16:56:28 -04:00
Vik Paruchuri
8def7db80e Patch in new encoder 2025-08-19 14:20:52 -04:00
Tarun Menta
82b88729aa
Correct dtype when forcing to table rec to CPU
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-18 16:29:25 -04:00
Tarun Menta
a9d5a093e5
Pin table model to CPU if using MPS. Fixes datalab-to/marker#827 2025-08-18 11:01:04 -04:00
Tarun Menta
38a452d2b2
Make list of tags to filter an argument to get passed in
Required so that lists are not skipped in tables
2025-08-16 15:53:33 -04:00
Tarun Menta
508ad43735
Improve behavior of disable_tqdm 2025-08-16 12:58:51 -04:00
Tarun Menta
a3efce1830
Improve filtering of tags + Increase tags in blacklist 2025-08-15 01:06:00 -04:00
Tarun Menta
5497449bfa
Merge pull request #429 from datalab-to/dev
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
Improve model performance on math
2025-08-12 19:11:44 -04:00
Tarun Menta
5bb47b2f09
Bump model
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-12 18:59:39 -04:00
Vik Paruchuri
17b875fd55
Merge pull request #424 from datalab-to/dev
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
Dev
2025-08-08 20:04:06 -04:00
Vik Paruchuri
632a5a9621 Bump version
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-08 18:11:35 -04:00
Vik Paruchuri
9f6d957f57
Merge pull request #422 from datalab-to/finetuning
[WIP]: Finetuning Script
2025-08-08 18:10:25 -04:00
Vik Paruchuri
685e63c0d6 Bump surya checkpoint 2025-08-08 18:09:49 -04:00
Tarun Menta
57fb761ac6
Bump model
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-08 17:36:49 -04:00
Tarun Menta
98010bee5c
Update README [skip ci] 2025-08-08 16:45:08 -04:00
Tarun Menta
59bc1a781c
Fix trailing whitespace 2025-08-08 16:33:07 -04:00
Tarun Menta
f97add87a0
Update README 2025-08-08 16:32:15 -04:00
Tarun Menta
e1bb6306b0
Update README 2025-08-08 16:29:02 -04:00
Tarun Menta
3689c5aa8c
Update README with finetuning details 2025-08-08 16:16:46 -04:00
Tarun Menta
68d9c7916f
Merge pull request #423 from starikovplusplus/finetuning
Fix tokenizer to correctly tokenize script tokens
2025-08-08 15:21:59 -04:00
Tarun Menta
e8fb02dad4
Add in language scripts to text inputs 2025-08-08 15:19:41 -04:00
Tarun Menta
64f0bd0c8b
Improve processing + limit image size 2025-08-08 15:11:31 -04:00
github-actions[bot]
1486d0bdca
@starikovplusplus has signed the CLA in datalab-to/surya#423 2025-08-08 18:29:58 +00:00
starikov.y.e
563054f0b5 Fix tokenizer to correctly tokenize script tokens 2025-08-08 23:24:24 +05:00
Tarun Menta
4451aa4716
Minimal working finetuning 2025-08-07 19:33:51 -04:00
Tarun Menta
9f5b2535fe
Typo - Fix #419 2025-08-07 15:45:34 -04:00
Zach Nussbaum
2fa3a1ee9a
Merge pull request #421 from datalab-to/download-progbar
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-07 13:51:40 -04:00
Zach Nussbaum
644f5feb13
feat: download progress bar for each file 2025-08-07 13:11:53 -04:00
github-actions[bot]
30e59deb64
@mebriki has signed the CLA in datalab-to/surya#418
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-05 10:54:39 +00:00
Vik Paruchuri
b215de26e7
Merge pull request #415 from datalab-to/dev
Dev
2025-08-04 13:05:59 -04:00
Vik Paruchuri
f47b0cdb96 Bump version 2025-08-04 13:05:36 -04:00
Tarun Menta
f2b6363482
Merge pull request #414 from datalab-to/tag-fix
Fix edge case for empty tags
2025-08-04 12:58:31 -04:00
Tarun Menta
1dd9b95a25
Fix edge case for empty tags 2025-08-04 12:42:32 -04:00
Vik Paruchuri
894dbd1d3c
Merge pull request #413 from datalab-to/dev
Bump recognition model
2025-08-04 11:04:08 -04:00
Vik Paruchuri
04d2ba9d9b Bump recognition model 2025-08-04 11:02:57 -04:00
Vik Paruchuri
b3a1aab4d3
Merge pull request #412 from datalab-to/dev
OCR model update
2025-08-04 09:44:18 -04:00
Vik Paruchuri
54d55bb0e7 Bump version 2025-08-04 09:16:46 -04:00
Vik Paruchuri
e55703eff5
Merge pull request #411 from datalab-to/foundation-ocr-release
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
Foundation ocr release
2025-08-01 18:42:33 -04:00
Tarun Menta
006becd9f3
Merge branch 'dev' into foundation-ocr-release 2025-08-01 18:41:31 -04:00
Tarun Menta
fe8545cfc8
Better calculation of max image token count 2025-08-01 18:39:06 -04:00
Tarun Menta
3212707c49
Move checkpoint to S3 2025-08-01 17:22:49 -04:00