Commit Graph

800 Commits

Author SHA1 Message Date
Vik Paruchuri
5b830e2298 Change move to device 2025-08-01 09:51:42 -04:00
Vik Paruchuri
9911a9d928 Batched encoder 2025-07-31 15:36:41 -04:00
Vik Paruchuri
d71a38576c Fix grid sizes 2025-07-31 05:24:04 -04:00
Vik Paruchuri
5bce56ea26 Misc cleanup 2025-07-30 13:44:51 -04:00
Vik Paruchuri
9a042b127e Fix batch assignment 2025-07-30 13:19:41 -04:00
Vik Paruchuri
5196fd8e2d Fix bugs with forward 2025-07-29 19:00:30 -04:00
Vik Paruchuri
26f964574b Maybe add batch to encoder 2025-07-29 17:44:49 -04:00
Tarun Menta
a41fb6d99f
Merge branch 'vik/tpu' into foundation-update 2025-07-28 17:27:24 -04:00
Tarun Menta
a754f8fed9
Cleanup 2025-07-28 16:59:51 -04:00
Tarun Menta
d681c04bd1
Cleanup 2025-07-28 16:58:11 -04:00
Tarun Menta
92eee41256
Fix topk 2025-07-28 16:57:42 -04:00
Tarun Menta
bb8e5f8935
Allow multi token prediction for OCR 2025-07-28 16:29:03 -04:00
Tarun Menta
e1cab15a9c
Extend functionality of the new cache
Cache now supports "ragged" input_ids where each batch can have a different
number of "true tokens", with padding. This helps for lots of scenarios,
including MTP and beacons, when some sequences have shorter preds than others

Can be improved further
2025-07-28 16:24:02 -04:00
Tarun Menta
5a7347ab57
Bugfix in decode udpate - Text token counts were wrong
We wanted to limit the text token count to max of `text_sliding_window`,
but were clamping to min instead, which messed up the logic in a lot
of places downstream

Also removed dependence on huggingface caching
2025-07-28 15:05:48 -04:00
Tarun Menta
99693a9abb
Fix speed issues due to topk
Do the topK on GPU before moving to CPU, avoids an expensive and slow
GPU<->CPU memory transfer of the full logits
2025-07-28 11:07:23 -04:00
Tarun Menta
a489a58086
Fix decode attention mask update
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
During decode, if the sliding window is not full, we should update
the attention mask in the last `sliding_window` positions to only
attend to valid tokens. This update was not offset by the `text_cache_start`,
so we were actually making updates in the image cache space

Simple change to include this offset
2025-07-26 17:18:03 -04:00
Tarun Menta
467e7024d9
Delete unused function 2025-07-26 15:31:21 -04:00
Tarun Menta
fabcb0ed79
Cleanup 2025-07-26 13:40:56 -04:00
Tarun Menta
6251ab2568
Faster static cache implementation
Decode update is way faster now. Leverages the fact that flash
attention now has an option to both left and right pad the
cache
2025-07-25 19:51:20 -04:00
Vik Paruchuri
2351d34b0d Add item conv
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-07-21 15:21:48 -04:00
Vik Paruchuri
ca9137d0c1 Fix issues
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-07-21 15:17:08 -04:00
Vik Paruchuri
a8d6509685 Default sliding window: 2025-07-21 09:49:40 -04:00
Vik Paruchuri
5160f774cf Keep on cpu for longer
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-07-15 10:02:57 -04:00
Vik Paruchuri
f8a9cedd1e Prefill experiments 2025-07-14 20:18:45 -04:00
Vik Paruchuri
d16721362d Improve prefill
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-07-11 09:32:06 -04:00
Vik Paruchuri
9bb7fe5fd5 Improve embeddings 2025-07-10 22:56:40 -04:00
Vik Paruchuri
58b3054f6e Fix 2025-07-10 16:42:07 -04:00
Vik Paruchuri
b256889b0b Improve prefill and decode speed 2025-07-10 16:21:41 -04:00
Vik Paruchuri
6ae83d8df1 Revert encoder changes 2025-07-10 12:31:18 -04:00
Vik Paruchuri
7e819c3442 Refactor cache for tpu 2025-07-10 12:17:58 -04:00
Vik Paruchuri
8044edaaef Fix compile issues 2025-07-09 18:46:40 -04:00
Vik Paruchuri
64872755a6 Remove the loop 2025-07-09 18:14:45 -04:00
Vik Paruchuri
402003a346 Cleanup cache 2025-07-09 15:42:05 -04:00
Vik Paruchuri
c23bd234c2 Refactor cache 2025-07-09 11:59:47 -04:00
Vik Paruchuri
2eae380119 Fix graph break 2025-07-08 19:03:49 -04:00
Vik Paruchuri
cd0c46b9b9 Refactor caching 2025-07-08 17:42:57 -04:00
Vik Paruchuri
53d06c0da9 Pad the encoder properly 2025-07-08 16:25:23 -04:00
Vik Paruchuri
dce33261e3 Vectorize, add static shapes 2025-07-08 11:58:40 -04:00
Vik Paruchuri
d2ee4f241b Work on tpu 2025-07-07 20:07:11 -04:00
Tarun Menta
9ab2cd7753
Static cache on encoder when required
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-07-03 17:45:44 -04:00
Tarun Menta
344d1834f8
Cleanup 2025-07-03 15:58:48 -04:00
Tarun Menta
ee68baa137
Pad prefill inputs batch size for compiled static shape
Cache was already static shape, but not prefill inputs since prefill can
happen at 0.2 times the initial batch size
2025-07-03 15:57:35 -04:00
Tarun Menta
1995b65783
Cleanup 2025-07-03 15:33:54 -04:00
Tarun Menta
d2a52ce02d
Pin to seq len for static cache 2025-07-03 15:32:16 -04:00
Tarun Menta
4cc0c574cd
Some more fixes when moving from right to left padding 2025-07-03 14:04:11 -04:00
Tarun Menta
2bfb8168bf
Minor comments for SDPA [no ci] 2025-07-03 12:26:43 -04:00
Tarun Menta
33bd9c1bfd
Cleanup 2025-07-03 11:45:14 -04:00
Tarun Menta
6e11b95bef
Expose topk through foundation - Pipe into layout and rec
Some checks failed
Integration test / build (push) Has been cancelled
Unit tests / build (t4_gpu) (push) Has been cancelled
Unit tests / build (ubuntu-latest) (push) Has been cancelled
Unit tests / build (windows-latest) (push) Has been cancelled
Test CLI scripts / build (push) Has been cancelled
2025-07-01 19:16:48 -04:00
Tarun Menta
f6371d51d9
Fix math mode for layout 2025-06-30 14:58:14 -04:00
Tarun Menta
c733a9ba86
Make lookahead prediction configurable 2025-06-30 14:54:27 -04:00