ollama

mirror of https://github.com/ollama/ollama.git synced 2026-06-05 21:05:00 +08:00

Author	SHA1	Message	Date
Jeffrey Morgan	e7766a4a47	model: improvements to laguna-xs.2 parser/renderer (#16362 )	2026-05-31 14:11:07 -07:00
Jeffrey Morgan	be7de10c41	llama: handle Gemma 4 and LFM2 BOS override in llama server (#16367 )	2026-05-31 14:05:39 -07:00
Daniel Hiltgen	9db4bdbad6	runner: Remove CGO engines, use llama-server exclusively for GGML models (#16031 ) * broad lint fixes to sidestep CI scope glitch * runner: Remove CGO engines, use llama-server exclusively for GGML models Remove the vendored GGML and llama.cpp backend, CGO runner, Go model implementations, and sample. llama-server (built from upstream llama.cpp via FetchContent) is now the sole inference engine for GGUF-based models. (Safetensor based models continue to run on the new MLX engine.) This allows us to more rapidly pick up new capabilities and fixes from llama.cpp as they come out. On windows this now requires recent AMD driver versions to support ROCm v7 as llama.cpp currently does not support building against v6. * llama/compat: load Ollama-format GGUFs in llama-server Squashed from upstream/jmorganca/llama-compat on 2026-04-29. Source tip: `0c33775d37`. Original source commits: - `25223160d` llama/compat: add in-memory shim so llama-server can load Ollama-format GGUFs - `7449b539a` llm,server: route Ollama-format gemma3 blobs through llama/compat - `436f2e2b1` llama/compat: make patch-apply idempotent - `8c2c9d4c8` llama/compat: extend gemma3 handler to cover 1B and 270M blobs - `021389f7b` llama/compat: shrink clip.cpp injection from 18 lines to 1 - `61b367ec2` llama/compat: shrink patch to pure call-site hooks (34 -> 20 lines) - `36049361c` llama/compat: simplify shim (gemma3-tested) - `8fa664865` llama/compat: add qwen35moe text handler - `db0c74530` llama/compat: add qwen35moe vision (clip) support - `2a388da77` llama/compat: split shared infra into a util TU - `9a69a17dc` llama/compat: document non-public API dependencies - `d0f38a915` llama/compat: add gpt-oss and lfm2 handlers - `086071822` llama/compat: add mistral3 text handler (vision TODO) - `63bde9ff7` llama/compat: add mistral3 vision (clip) support - `3a57b89d5` llama/compat: apply LLaMA RoPE permute to mistral3 vision Q/K - `99cb87439` llama/compat: add qwen35, gemma4, deepseek-ocr handlers - `2c7850dba` llama/compat: add nemotron_h_moe handler (latent FFN + MTP skip) - `9e3b54225` llama/compat: add llama4 text + clip handlers - `034fee349` llama/compat: add gemma4 clip handler (gemma4v projector) - `9945c5a93` server: remove dhiltgen/* compat redirect table - `5d4539101` llama/compat: rewrite gemma4 tokenizer model to BPE - `7e0765327` llama/compat: add glm-ocr text handler + text-loader load-op hook - `f1bd1a25a` llama/compat: add glm-ocr clip handler (glm4v projector) - `4b5cf3420` llama/compat: collapse text-loader hook back to one new patch line - `eb4ecf4fc` llama/compat: extend gemma4 clip handler to gemma4a (audio) - `a23a5e76f` llama/compat: fix gemma4a per-block norm tensor mapping - `cd2dcaff4` llama/compat: add embeddinggemma handler - `1ce8a6b26` llama/compat: add qwen3-vl + qwen2.5-vl handlers - `fd98ffa1e` llama/compat: add gemma3n + glm4moelite handlers - `cc7bdf0bc` llama/compat: handle null buft in maybe_load_tensor - `0c33775d3` llama/compat: disable mmap when load_op transforms text-side tensors * refine implementation * ci: fix windows MLX build * ci: fix windows llama-server build * ci: fix windows rocm build * ci: windows mlx tuning Shorten long-tail on build, and get OllamaSetup.exe back under 2g limit * ci: fix windows dependencies * win: fix dependency gathering * disable openmp * win: arm64 cross-compile build also DRY out CI steps * scheduler improvements * ci: improvements from #15982 * win: favor ninja for faster developer builds * win: fix build * win: fix arm64 cross-compile * win: avoid spaces in compiler path * misc discovery fixes, and bos handling * lint fixes * win: fix arm cross-compile build/CI bugs * llama.cpp update * win: handle multiple CRT dirs * vulkan: add windows iGPU detection * fix creation bugs for patched models, other refactoring work * tune batch size for better performance * ci and lint fixes * fix repeat_last_n bug * build: revamp build for better developer UX * amd, sampler, qwen3next fixes * version bump * fix mlx build * revamp GPU discovery Scanning the output of llama-server is turning out to be too error prone across llama.cpp updates, so this switches to a thin dynamic library load against the bundled GGML libraries so more details can be gathered from the API. * version bump * missing file * ci: fix cache miss on rocm build * refine vulkan dep handling * fix ps reporting bug on full GPU load * improve cmake wiring for customized local builds * version bump * docker build arg cleanup * improve windows exit error logs * fix community gemma4 support and ci flakes * fix mlx unit test * tighten up ps logic to avoid double counting fit log lines * version bump * fix ps view for full gpu layer offload * add MTP wiring for llama-server and create with GGUFs * pick best template by capabilities * version bump * ci: harden apt repos * remove unused cpu core discovery * adjust batch default logic to reduce OOMs * support larger tool calls * fix audio support, template show * qwen35 mtp patch support * flesh out dtypes * rocm deps * version bump * lint fix * block broken gfx1150 on windows * fix qwen3.5 moe mtp tensors in patch * mmproj oom fallback and vulkan on by default * qwen MTP compat fix * version bump * ci: fix WoA cross-compile * ci: workaround ui tool in cross-compile * version bump * win: enable OpenMP for CPU builds * build: improve developer UX * ci: windows path workaround for CPU build * win: fix WoA dependencies * win: fix large offset reads for mmproj patched loads * version bump * fix vulkan dup detection * add OLLAMA_IGPU_ENABLE and largely disable iGPUs by default * opt-in MTP, win large offset, integraton fixes * fix unit test scheduler interaction hang * fix multi-gpu filtering * version bump * review comments * fix thinking level * fix linux rocm ordering and granite 3.3 template * version bump * ci fix - non-shallow MLX checkout * bypass linux sysfs unit test on windows --------- Co-authored-by: jmorganca <jmorganca@gmail.com>	2026-05-29 13:35:47 -07:00
Eva H	6bdb73073b	anthropic: Preserve Claude local image-path tool results in renderer-owned prompt formatting (#16047 )	2026-05-12 00:02:17 -04:00
Parth Sareen	c7c2837c96	renderers: update gemma4 renderer (#15886 )	2026-04-29 18:40:23 -07:00
Daniel Hiltgen	87288ced4f	New models (#15861 ) * mlx: add laguna model support * convert: support fp8 safetensors import Decode HF F8_E4M3 safetensors with block scale companions into GGUF-supported tensor types, and record which output tensors came from FP8 source weights. Use that source-precision metadata during create quantization: default FP8-sourced GGUFs to Q8_0, keep non-FP8 tensors at their original precision for Q8_0, and promote non-FP8 quantizable tensors to Q8_0 for Q4_K requests. * ggml: add laguna model support * server: preserve generate logprobs with builtin parsers Generate requests were dropping logprob-only chunks whenever a builtin parser buffered visible content. Chat already handled this case, but generate only forwarded chunks with visible response, thinking, or tool-call output. Keep generate chunks that carry logprobs even when the builtin parser has not flushed visible content yet, and add a regression test that exercises the behavior with a generic thinking parser. * review comments - perf improvements * ggml: implement nemotron 3 nano omni * add poolside integration * update poolside doc * adapt to new cache setup * fix test * fix test --------- Co-authored-by: Eva Ho <hoyyeva@gmail.com>	2026-04-28 11:50:12 -07:00
Devon Rifkin	9e3618d663	make empty block conditional	2026-04-15 15:35:25 -07:00
Devon Rifkin	e585ecd11f	gemma4: render differently based on model size Following up on #15560, this change now has e2b/e4b render differently from 26b/31b. For backwards compatibility, we take the existing renderer name `gemma4` and make it do dynamic resolution based on the model name/size, but the intended use is for the models to be republished with the renderer variant specified explicitly: `gemma4-small` or `gemma4-large`.	2026-04-15 14:37:16 -07:00
Devon Rifkin	bf2a421727	gemma4: restore e2b-style nothink prompt (#15560 ) Gemma 4 prompts differ when thinking is disabled for different sized models: 26b/31b emit an empty thought block, while e2b/e4b do not. Before #15490, our shared Gemma 4 renderer effectively matched the e2b behavior. #15490 changed it to always emit the empty thought block, which regressed e2b/e4b nothink behavior and led to #15536 (and possibly This change restores the previous shared behavior by removing the empty trailing thought block. It also renames the checked-in upstream chat templates so the e2b and 31b fixtures are tracked separately. A follow-up will split Gemma 4 rendering by model size. Fixes: #15536	2026-04-13 14:26:15 -07:00
Devon Rifkin	5dfac387a6	Revert "gemma4: fix nothink case renderer (#15553 )" (#15556 ) This reverts commit `4d75f5da03`.	2026-04-13 13:12:18 -07:00
Devon Rifkin	ee0266462a	Revert "gemma4: add nothink renderer tests (#15554 )" (#15555 ) This reverts commit `1b70bb8a10`.	2026-04-13 13:00:59 -07:00
Devon Rifkin	1b70bb8a10	gemma4: add nothink renderer tests (#15554 ) Meant to include in #15553	2026-04-13 11:38:19 -07:00
Devon Rifkin	4d75f5da03	gemma4: fix nothink case renderer (#15553 ) Regressed in #15490 Fixes: #15536	2026-04-13 11:23:19 -07:00
Devon Rifkin	9330bb9120	gemma4: be less strict about whitespace before bare keys (#15494 )	2026-04-11 16:30:27 -07:00
Devon Rifkin	40a1317dfd	gemma4: update renderer to match new jinja template (#15490 ) * gemma4: update renderer to match new jinja template Google has updated their jinja template for gemma4, and so this change gives us parity with the new template. The parsing also slightly changed upstream, so we make a small change to our parser as well. I've also corrected a few probably existing edge cases, especially around type unions. The upstream output format is weird (a stringified array), but in practice the models seem to understand it well. * gemma4: special case simple `AnyOf`s The upstream template doesn't handle `AnyOf`s, but since in the previous commit we saw type unions work reasonably well, I'm now treating very simple `AnyOf`s as type unions to help in cases where they might be used * fix lint * gemma4: prefer empty instead of `None` We can't currently distinguish between a result being not-present vs. empty. The empty case seems more important (e.g., a legitimately empty tool call) * gemma4: be more careful for tool results with missing IDs	2026-04-10 15:45:27 -07:00
Devon Rifkin	fdfe9cec98	model/parsers: fix missing parallel tool call indices (#15467 ) We were missing setting the function index for several models that can make parallel tool calls. In the future we may want to consider putting some sort of post-parse hook and relieve the parsers of this duty. Fixes: #15457	2026-04-10 15:23:21 -07:00
Devon Rifkin	8c8f8f3450	model/parsers: add gemma4 tool call repair (#15374 ) The existing strict gemma4 tool parser is still the primary path, but if this fails, we try to repair by fixing some of the most commonly seen mistakes these models seem to make in practice. We repair by building up a set of candidates, and use the first candidate that parses. Repairs cover: - missing Gemma string delimiters - single-quoted string values, including a dangling Gemma delimiter - raw terminal string values (if the corresponding tool schema indicates it should be a string) - missing object close only after a concrete repair Add regression coverage for malformed tool calls from issue #15315 and focused unit tests for the individual repair helpers and candidate pipeline.	2026-04-06 18:47:17 -07:00
Devon Rifkin	34a790a2e6	model/parsers: suppress extra gemma4 closing tool tags (#15370 ) We've observed Gemma 4 occasionally emitting extra <tool_call\|> tags after a valid tool call. We suppress leading close tags in this immediate post-tool-call state so the extra close tags do not leak into assistant content. The tradeoff is that if the model intentionally begins its next content span with the literal string "<tool_call\|>", we will erroneously treat it as noise and drop it.	2026-04-06 12:41:33 -07:00
Devon Rifkin	49d5fd5a3e	model/parsers: rework gemma4 tool call handling (#15306 ) Replace the custom Gemma4 argument normalizer with a stricter reference-style conversion: preserve Gemma-quoted strings, quote bare keys, and then unmarshal the result as JSON. This keeps quoted scalars as strings, preserves typed unquoted values, and adds test coverage for malformed raw-quoted inputs that the reference implementation rejects.	2026-04-03 14:35:00 -07:00
Devon Rifkin	036ed1b9b5	model/parsers: fix gemma4 arg parsing when quoted strings contain " (#15254 ) * model/parsers: fix gemma4 arg parsing when quoted strings contain " Fixes: #15241 * add more tests, be careful about what we escape We want Windows-style paths to not get misinterpreted * fix backslash-quote case, it really should be a literal backslash h/t to @chathaway-codes for pointing this out! Co-Authored-By: Charles H <2773397+chathaway-codes@users.noreply.github.com> --------- Co-authored-by: Charles H <2773397+chathaway-codes@users.noreply.github.com>	2026-04-02 22:52:51 -07:00
Daniel Hiltgen	de9673ac3f	tokenizer: add byte fallback for SentencePiece BPE encoding (#15232 ) * tokenizer: add byte fallback for SentencePiece BPE encoding When BPE merging produces tokens not in the vocabulary, fall back to encoding each UTF-8 byte as <0xHH> byte tokens instead of silently dropping the character. Also teach Decode to convert <0xHH> tokens back to raw bytes. Fixes #15229, fixes #15231 * tokenizer fixes	2026-04-02 13:04:45 -07:00
Daniel Hiltgen	96b202d34b	Add support for gemma4 (#15214 ) * bench: add prompt calibration, context size flag, and NumCtx reporting Add --num-ctx flag to set context size, and report NumCtx in model info header. Calibrate tokens-per-word ratio during warmup using actual tokenization metrics from the model, replacing the fixed 1.3 heuristic. This produces more accurate prompt token counts for --prompt-tokens. Also add fetchContextLength() to query running model context via /api/ps. * integration: improve vision test robustness and add thinking tests Add skipIfNoVisionOverride() to skip vision tests when OLLAMA_TEST_MODEL is set to a non-vision model. Add Think:false to context exhaustion test to prevent thinking models from using all context before the test can measure it. Add third test image (ollama homepage) and replace OCR test with ImageDescription test using it. Relax match strings for broader model compatibility. Add TestThinkingEnabled and TestThinkingSuppressed to verify thinking output and channel tag handling. * gemma4: add Gemma 4 GGML model support Add full Gemma 4 model family support (E2B, E4B, 26B MoE, 31B Dense) for the GGML backend including text, vision, converter, parser, and renderer. Text model features: - Sliding window + full attention with per-layer patterns - KV sharing across layers with donor map - Per-layer embeddings (PLE) with learned projections - MoE routing with RMSNorm + learned scale - Proportional RoPE with freq_factors for global attention - Final logit softcapping Vision model features: - SigLIP vision encoder with 2D RoPE - ClippableLinear with input/output clamping via packed v.clamp_data - Adaptive average pooling with nMerge kernel - Multi-modal projection with unweighted RMSNorm Converter: - Safetensors to GGUF with vision tensor renaming - Fused MoE gate_up_proj splitting - Vision patch embedding reshape (HF to Conv2D layout) - Packed clamp data tensor for ClippableLinear bounds - Proportional RoPE freq_factors generation Also includes: - BackendGet() on ml.Tensor for reading weight tensor data - Q6_K CUDA get_rows kernel support - MoE-aware ffn_down quantization layer counting - Gemma4 parser with tool calling and thinking support - Gemma4 renderer with structured tool format - Architecture-based auto-detection of renderer/parser/stop tokens - Integration test gemma4 model list additions * gemma4: add audio support with USM conformer encoder Add audio encoding for Gemma 4 using the USM conformer architecture: - Converter: audio tensor mapping, SSCP/conformer/embedder name replacements, softplus repacker for per_dim_scale, F32 enforcement for conv weights - GGML backend: Conv1DDW and PadExt tensor ops - Audio encoder: SSCP Conv2D, 12 conformer blocks (FFW + block-local attention with relative position embeddings + LightConv1d + FFW), output projection, audio-to-text embedding projector - Audio preprocessing: WAV decode, mel spectrogram, FFT (pure Go) - Model wiring: WAV detection, audio token handling, unified PostTokenize Correctly transcribes "why is the sky blue" from test audio. * integration: add gemma4 audio tests including OpenAI API coverage Test audio transcription and response via the Ollama native API, plus two new tests exercising the OpenAI-compatible endpoints: - /v1/audio/transcriptions (multipart form upload) - /v1/chat/completions with input_audio content type All tests use capability checks and skip models without audio support. * gemma4: add OpenAI audio API support and capability detection - Add CapabilityAudio and detect from audio.block_count in GGUF - Add /v1/audio/transcriptions endpoint with TranscriptionMiddleware - Add input_audio content type support in /v1/chat/completions - Add TranscriptionRequest/Response types in openai package * gemma4: add audio input support for run command - /audio toggle in interactive mode for voice chat - Platform-specific microphone recording (AVFoundation on macOS, PulseAudio/ALSA on Linux, WASAPI on Windows) - Space to start/stop recording, automatic chunking for long audio * gemma4: add transcribe command (ollama transcribe MODEL) - Interactive mode with readline prompt and slash commands - Non-interactive mode for piped audio or record-until-Ctrl+C - Chunked streaming transcription for long recordings - Word-wrapped output matching run command style * gemma4: add parser, renderer, and integration test plumbing * gemma4: fix renderer to emit BOS token * gemma4: add OpenAI audio transcription API and input_audio support * gemma4: update converter for new weight drop naming * gemma4: add per_expert_scale to MoE router and fix moe_intermediate_size config * gemma4: rewrite renderer to match HF Jinja2 template exactly Fix 8 bugs found by building 55 reference tests verified against the HF Jinja2 chat template (VERIFY_JINJA2=1 shells out to Python): - Tool responses use separate <\|turn>tool turns (not inline tags) - Tool calls emitted before content in assistant messages - Thinking content stripped from assistant history (strip_thinking) - User, tool, and system content trimmed (template does \| trim) - Empty system message still emits system turn (check role, not content) - Nested object properties rendered recursively with required field - Array items specification rendered for array-type properties - OBJECT/ARRAY type-specific rendering comma logic matches template Also adds Required field to api.ToolProperty for nested object schemas, replaces old gemma4_test.go with comprehensive gemma4_reference_test.go, and commits the Jinja2 template as testdata for verification. * gemma4: fix MoE fused gate_up split and multiline tool-call arg parsing - Text MoE: split `ffn_gate_up_exps` into contiguous `[gate\|up]` halves instead of stride-2 slices. - Parser: escape control characters in `<\|"\|>...<\|"\|>` string literals when converting tool-call args to JSON. - Fixes warnings like `invalid character '\n' in string literal` for multiline tool arguments. - Add Gemma4 parser regressions for multiline tool-call args and `gemma4ArgsToJSON`. * cmd: simplify audio input to dropped file attachments * gemma4: use full SWA memory for better cache reuse * gemma4: initialize clamps after backend load * convert: align gemma4 audio tensor renames with llama.cpp * Remove redundant comments in gemma4 vision model * Format Gemma4 MoE block field alignment * use 4096 kvcache.NewSWAMemCache * convert: support new Gemma4 audio_tower tensor naming (#15221) Co-authored-by: jmorganca <jmorganca@gmail.com> * fix integration test defaults for audio * review comments and lint fixes * remove unused audio/video files --------- Co-authored-by: jmorganca <jmorganca@gmail.com>	2026-04-02 11:33:33 -07:00
Jeffrey Morgan	b7bda92d52	model: add qwen3-next compatibility for legacy ssm_in projections (#15133 )	2026-03-29 11:50:47 -07:00
Jesse Gross	ac83ac20c4	anthropic: fix KV cache reuse degraded by tool call argument reordering Use typed structs for tool call arguments instead of map[string]any to preserve JSON key order, which Go maps do not guarantee.	2026-03-27 14:30:16 -07:00
Jeffrey Morgan	69ed0c2729	parsers: qwen3.5 streaming tool-call parsing and add regression test (#15098 )	2026-03-27 14:04:14 -07:00
Alfredo Matas	1cefa749aa	model/parsers: close think block if tool block starts in Qwen3.5 (#15022 )	2026-03-27 11:28:34 -07:00
Bruce MacDonald	126d8db7f3	parsers: robust xml tool repair (#14961 ) Previous xml repair for glm was a good start, but we need to go further and repair any incorrect open or closing tags Co-authored-by: Dongluo Chen <dongluo.chen@gmail.com>	2026-03-19 11:24:48 -07:00
Bruce MacDonald	1af850e6e3	parsers: repair unclosed arg_value tags in GLM tool calls (#14656 ) GLM models sometimes omits </arg_value> closing tags in tool call XML, causing xml.Unmarshal to fail with "element <arg_value> closed by </tool_call>". This is a known issue across the GLM family. Sanitize the input to fix closing arg_key values so encoding/xml can handle it.	2026-03-06 14:08:34 -08:00
Jeffrey Morgan	82848a7806	model: fix renderer and parser for qwen3.5 (#14605 )	2026-03-03 20:58:29 -08:00
Victor-Quqi	e8fcb29586	model/renderers: fix glm-ocr image tags in renderer prompts (#14584 )	2026-03-03 12:51:34 -08:00
Jeffrey Morgan	3490e9590b	model/qwen3next: avoid crash in in DeltaNet when offloading (#14541 ) Co-authored-by: Yossi Ovadia <jabadia@gmail.com>	2026-03-01 18:44:04 -08:00
Jeffrey Morgan	8da09b1e7e	qwen3next: add compatibility with imported GGUF models (#14517 )	2026-02-28 14:21:42 -08:00
Parth Sareen	cc90a035a0	model/parsers: add stable tool call indexing for glm47 and qwen3 parsers (#14484 )	2026-02-26 18:14:29 -08:00
Jeffrey Morgan	d98dda4676	model: fix qwen3 tool calling in thinking (#14477 ) Align Qwen parser behavior with Transformers serve by allowing <tool_call> parsing while still in thinking collection. Changes: - qwen3vl: detect <tool_call> before </think> in thinking state and transition to tool parsing - qwen3: same thinking-state tool detection and partial-tag overlap handling - tests: update qwen3vl thinking/tool interleaving expectations - tests: add qwen3 cases for tool call before </think> and split <tool_call> streaming	2026-02-26 16:13:18 -08:00
Jeffrey Morgan	7f9efd53df	model: add support for qwen3.5-27b model (#14415 )	2026-02-25 01:09:58 -08:00
Jeffrey Morgan	da70c3222e	model: support for qwen3.5 architecture (#14378 )	2026-02-24 20:08:05 -08:00
Jeffrey Morgan	4b2ac1f369	model: improvements to LFM architectures (#14368 )	2026-02-23 14:38:10 -08:00
Jeffrey Morgan	0ade9205cc	models: add nemotronh architecture support (#14356 )	2026-02-22 15:09:14 -08:00
Patrick Devine	9aefd2dfee	model: add qwen3 support to mlxrunner (#14293 )	2026-02-17 13:58:49 -08:00
Michael Yang	f1373193dc	move tokenizers to separate package (#13825 )	2026-02-05 17:44:11 -08:00
Jeffrey Morgan	d25535c3f3	qwen3next: avoid inplace sigmoid for shared gate (#14077 )	2026-02-04 15:50:02 -08:00
Jeffrey Morgan	255579aaa7	qwen3next: fix issue in delta net (#14075 ) gDiffExp was being broadcast across the wrong axis when multiplying with k. This fix reshapes gDiffExp to [1, chunkSize, nChunks, ...]	2026-02-04 13:40:38 -08:00
Jeffrey Morgan	77eb2ca619	model: add qwen3-next architecture (#14051 )	2026-02-03 23:27:21 -08:00
Jeffrey Morgan	8f4a008139	Add GLM-OCR vision model support (#14024 )	2026-02-02 15:39:18 -08:00
Gyungrai Wang	e0f03790b1	parsers/ministral: fix nested tool call parsing by counting brace nesting (#13905 ) * parsers/ministral: fix nested tool call parsing by counting brace nesting * fix lint error * parsers: refactor ministral parser The old one was very tied to expecting to see only one token at a time, which I don't like to assume (who knows what the future might hold wrt speculative decoding, etc). This new one follows a similar structure to qwen3-coder's parser, which incidentally makes it easier to test as well (since we can test the individual events that come out when given particular inputs). --------- Co-authored-by: Devon Rifkin <drifkin@drifkin.net>	2026-01-26 15:03:43 -08:00
Jeffrey Morgan	a1ca428c90	glm4moelite: fix attention scale calculation (#13893 ) Use the original key dimension (qkNopeHeadDim + qkRopeHeadDim = 256) for the attention scale instead of the MLA absorbed dimension (kvLoraRank + qkRopeHeadDim = 576). MLA absorption is a mathematically equivalent reorganization of the attention computation - it should not change the effective attention scale. The scale should match training, which uses 1/sqrt(256). This improves tool calling and model looping issues.	2026-01-24 17:48:09 -08:00
Jeffrey Morgan	16750865d1	glm4moelite: quantize more tensors to q8_0 and avoid double BOS token (#13891 )	2026-01-24 16:33:54 -08:00
Jeffrey Morgan	64737330a4	Re-apply "model: add MLA absorption for glm4moelite" with fix (#13870 ) The nvidia_fp32 config for (576, 512) head sizes had nbatch_fa=32, which caused zero-sized arrays when computing array dimensions: nbatch_fa / (np * warp_size) = 32 / (2 * 32) = 0 This resulted in CUDA compilation failures on CUDA 12 (Windows and Linux arm64): - "static assertion failed with nbatch_fa % (np*warp_size) != 0" - "the size of an array must be greater than zero" Fix by changing nbatch_fa from 32 to 64 for all (576, 512) configs in the nvidia_fp32 function, matching the nvidia_fp16 and AMD configs.	2026-01-23 18:40:28 -08:00
Jeffrey Morgan	2eda97f1c3	Revert "model: add MLA absorption for glm4moelite (#13810 )" (#13869 ) This reverts commit `1044b0419a`.	2026-01-23 17:14:15 -08:00
Jeffrey Morgan	1044b0419a	model: add MLA absorption for glm4moelite (#13810 ) * model: add MLA absorption for glm4moelite Split the combined KV_B tensor into separate K_B and V_B tensors during conversion, enabling MLA (Multi-head Latent Attention) absorption which compresses the KV cache for improved efficiency. * ggml: enable MLA flash attention for GLM-4.7-flash Add support for gqa_ratio 4 in MLA flash attention kernels. GLM-4.7-flash uses head size 576 with gqa_ratio 4, which was previously only supported for gqa_ratio 16 (DeepSeek). Metal changes: - Enable head size 576 for flash attention - Increase simdgroups to 8 for large heads (>=512) - Add case 8 kernel dispatch for 8 simdgroups CUDA changes: - Add gqa_ratio 4 support for head 576/512 - Add tile configs for (576, 512, 4) and (576, 512, 8) - Add MMA config cases for ncols 4 - Add template instances for ncols2=4 * model: add compatibility validation for glm4moelite architecture	2026-01-23 14:47:42 -08:00

1 2 3 4 5

228 Commits