Commit Graph

832 Commits

Author SHA1 Message Date
Zach Nussbaum
d6f3515009
feat: new unified tokenizer 2025-08-25 14:21:32 +00:00
Vik Paruchuri
e1aa09d3bc Set disable tqdm
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-19 17:39:18 -04:00
Vik Paruchuri
a95b6cafe5 Fix layout and table rec image bbox 2025-08-19 11:33:19 -04:00
Vik Paruchuri
4613f45a5b Prefill fix 2025-08-18 08:06:45 -04:00
Vik Paruchuri
053f13cde7 Fix padding on tpu
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-15 15:52:03 -04:00
Vik Paruchuri
a73eee6648 Force bf16 2025-08-15 15:38:50 -04:00
Vik Paruchuri
cbe23fae03 Tables can have a lot of cells 2025-08-15 15:29:50 -04:00
Vik Paruchuri
14e7ee6ed9 Avoid truncating layout and table 2025-08-15 11:18:31 -04:00
Vik Paruchuri
f2eecf1ad1 Properly pad
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-12 12:45:56 -04:00
Vik Paruchuri
609caf42c9 Fix tensor creation 2025-08-12 12:18:30 -04:00
Vik Paruchuri
a511a095b9 Pad image embeddings 2025-08-12 12:03:42 -04:00
Vik Paruchuri
9e5fa2931b Wire in table structure 2025-08-12 09:53:17 -04:00
Vik Paruchuri
fc6657e8a6 Use fix-length index 2025-08-11 21:33:23 -04:00
Vik Paruchuri
de947006a5 Fix text lengths 2025-08-11 16:30:27 -04:00
Vik Paruchuri
2748109d33 Fix encoder chunking 2025-08-11 12:45:49 -04:00
Vik Paruchuri
8367a631a2 Accuracy fixes 2025-08-11 12:40:09 -04:00
Vik Paruchuri
eee29d4ae7 Fix beacon issue 2025-08-11 11:59:51 -04:00
Vik Paruchuri
f03b58b4e1 Fix table rec
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-08-08 14:34:06 -04:00
Vik Paruchuri
d55f00a49e Integrate table rec predictor 2025-08-08 11:11:07 -04:00
Vik Paruchuri
669ce4869d Patch clamp issue 2025-08-06 21:52:47 -04:00
Vik Paruchuri
e1df24c93e Cleanup embedding 2025-08-06 16:54:38 -04:00
Vik Paruchuri
185b57abd7 Cleanup 2025-08-06 16:34:49 -04:00
Vik Paruchuri
8d1ef8517c Merge remote-tracking branch 'origin/vik/tpu-layout' into vik/tpu-layout 2025-08-06 16:28:12 -04:00
Vik Paruchuri
0600fc5904 Enable re-embedding bboxes 2025-08-06 16:22:39 -04:00
Vik Paruchuri
523bd6664c Merge branch 'vik/layout' into vik/tpu3 2025-08-06 12:44:57 -04:00
Vik Paruchuri
768d8d54a7 Move layout 2025-08-06 12:44:19 -04:00
Vik Paruchuri
d4461c6d30 Fix mark steps 2025-08-06 12:06:39 -04:00
Vik Paruchuri
3b30120601 Enable compile 2025-08-05 11:58:15 -04:00
Vik Paruchuri
2c60d24a81 Cleanup debug logs 2025-08-05 11:41:07 -04:00
Vik Paruchuri
a1aa1557a6 Fix embedding with a static scatter 2025-08-05 10:57:42 -04:00
Vik Paruchuri
d9f6e4c52e Fix issues with GPU codepaths 2025-08-04 16:27:50 -04:00
Vik Paruchuri
dd7b127d92 Add in original codepath 2025-08-04 16:03:35 -04:00
Vik Paruchuri
ab9eff4d69 Static cache impl 2025-08-04 13:57:40 -04:00
Vik Paruchuri
f47b0cdb96 Bump version 2025-08-04 13:05:36 -04:00
Vik Paruchuri
4fcc094159 Fresh tpu start 2025-08-04 13:05:06 -04:00
Tarun Menta
f2b6363482
Merge pull request #414 from datalab-to/tag-fix
Fix edge case for empty tags
2025-08-04 12:58:31 -04:00
Tarun Menta
1dd9b95a25
Fix edge case for empty tags 2025-08-04 12:42:32 -04:00
Vik Paruchuri
04d2ba9d9b Bump recognition model 2025-08-04 11:02:57 -04:00
Vik Paruchuri
54d55bb0e7 Bump version 2025-08-04 09:16:46 -04:00
Vik Paruchuri
e55703eff5
Merge pull request #411 from datalab-to/foundation-ocr-release
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
Foundation ocr release
2025-08-01 18:42:33 -04:00
Tarun Menta
006becd9f3
Merge branch 'dev' into foundation-ocr-release 2025-08-01 18:41:31 -04:00
Tarun Menta
fe8545cfc8
Better calculation of max image token count 2025-08-01 18:39:06 -04:00
Tarun Menta
3212707c49
Move checkpoint to S3 2025-08-01 17:22:49 -04:00
Tarun Menta
729ffc9295
Fix max image cache space logic 2025-08-01 16:04:52 -04:00
Tarun Menta
bb2d77d729
Filter more HTML tags out 2025-08-01 11:46:58 -04:00
Tarun Menta
02b6588de8
Filter unwanted tags from characters instead of joined text
This allows it to be filtered when appearing in marker as well
2025-08-01 10:46:16 -04:00
Tarun Menta
1cf444d752
Clean out unwanted formatting tags from OCR 2025-07-31 20:33:11 -04:00
Tarun Menta
48b98856bc
Optimize decode cache update 2025-07-31 19:56:00 -04:00
Tarun Menta
34f1148fd9
Cleanup 2025-07-31 19:51:25 -04:00
Tarun Menta
de9e9e74d4
Allow max tokens and sliding window to be set to custom values 2025-07-31 19:49:05 -04:00