This cleans up the capabilities logic so we can log more information about the various options we consider as well as the final template version we use.
* llama-server followups
Misc fixes for #16031
- Add back dropped ROCm build flag for multi-GPU support on windows
- Fix amdhip64_*.dll version detection for "latest" selection
- Fix embeddings API for consistent normalize behavior with prior versions
* ci: set up for automated llama.cpp update testing
* reduce batch for fa-disabled, and constrained vram
* mlx: fix v3 load bug on m5
Imagegen was incorrectly loading v3 first. This DRYs out the loading code so imagegen gets the same new v4/v3 selection logic.
* fix reload bug on embedding models
* bump version
* steer user how to enable iGPU when disabled
This change addresses some problems with GGUF conversion including:
* correctly naming the MoE tensors
* correctly quantizing the nextn.eh_proj.weight MTP tensor
* broad lint fixes to sidestep CI scope glitch
* runner: Remove CGO engines, use llama-server exclusively for GGML models
Remove the vendored GGML and llama.cpp backend, CGO runner, Go model
implementations, and sample. llama-server (built from upstream llama.cpp via
FetchContent) is now the sole inference engine for GGUF-based models.
(Safetensor based models continue to run on the new MLX engine.) This allows
us to more rapidly pick up new capabilities and fixes from llama.cpp as they
come out.
On windows this now requires recent AMD driver versions to support ROCm v7 as
llama.cpp currently does not support building against v6.
* llama/compat: load Ollama-format GGUFs in llama-server
Squashed from upstream/jmorganca/llama-compat on 2026-04-29.
Source tip: 0c33775d37.
Original source commits:
- 25223160d llama/compat: add in-memory shim so llama-server can load Ollama-format GGUFs
- 7449b539a llm,server: route Ollama-format gemma3 blobs through llama/compat
- 436f2e2b1 llama/compat: make patch-apply idempotent
- 8c2c9d4c8 llama/compat: extend gemma3 handler to cover 1B and 270M blobs
- 021389f7b llama/compat: shrink clip.cpp injection from 18 lines to 1
- 61b367ec2 llama/compat: shrink patch to pure call-site hooks (34 -> 20 lines)
- 36049361c llama/compat: simplify shim (gemma3-tested)
- 8fa664865 llama/compat: add qwen35moe text handler
- db0c74530 llama/compat: add qwen35moe vision (clip) support
- 2a388da77 llama/compat: split shared infra into a util TU
- 9a69a17dc llama/compat: document non-public API dependencies
- d0f38a915 llama/compat: add gpt-oss and lfm2 handlers
- 086071822 llama/compat: add mistral3 text handler (vision TODO)
- 63bde9ff7 llama/compat: add mistral3 vision (clip) support
- 3a57b89d5 llama/compat: apply LLaMA RoPE permute to mistral3 vision Q/K
- 99cb87439 llama/compat: add qwen35, gemma4, deepseek-ocr handlers
- 2c7850dba llama/compat: add nemotron_h_moe handler (latent FFN + MTP skip)
- 9e3b54225 llama/compat: add llama4 text + clip handlers
- 034fee349 llama/compat: add gemma4 clip handler (gemma4v projector)
- 9945c5a93 server: remove dhiltgen/* compat redirect table
- 5d4539101 llama/compat: rewrite gemma4 tokenizer model to BPE
- 7e0765327 llama/compat: add glm-ocr text handler + text-loader load-op hook
- f1bd1a25a llama/compat: add glm-ocr clip handler (glm4v projector)
- 4b5cf3420 llama/compat: collapse text-loader hook back to one new patch line
- eb4ecf4fc llama/compat: extend gemma4 clip handler to gemma4a (audio)
- a23a5e76f llama/compat: fix gemma4a per-block norm tensor mapping
- cd2dcaff4 llama/compat: add embeddinggemma handler
- 1ce8a6b26 llama/compat: add qwen3-vl + qwen2.5-vl handlers
- fd98ffa1e llama/compat: add gemma3n + glm4moelite handlers
- cc7bdf0bc llama/compat: handle null buft in maybe_load_tensor
- 0c33775d3 llama/compat: disable mmap when load_op transforms text-side tensors
* refine implementation
* ci: fix windows MLX build
* ci: fix windows llama-server build
* ci: fix windows rocm build
* ci: windows mlx tuning
Shorten long-tail on build, and get OllamaSetup.exe back under 2g limit
* ci: fix windows dependencies
* win: fix dependency gathering
* disable openmp
* win: arm64 cross-compile build
also DRY out CI steps
* scheduler improvements
* ci: improvements from #15982
* win: favor ninja for faster developer builds
* win: fix build
* win: fix arm64 cross-compile
* win: avoid spaces in compiler path
* misc discovery fixes, and bos handling
* lint fixes
* win: fix arm cross-compile build/CI bugs
* llama.cpp update
* win: handle multiple CRT dirs
* vulkan: add windows iGPU detection
* fix creation bugs for patched models, other refactoring work
* tune batch size for better performance
* ci and lint fixes
* fix repeat_last_n bug
* build: revamp build for better developer UX
* amd, sampler, qwen3next fixes
* version bump
* fix mlx build
* revamp GPU discovery
Scanning the output of llama-server is turning out to be too error prone across
llama.cpp updates, so this switches to a thin dynamic library load against the
bundled GGML libraries so more details can be gathered from the API.
* version bump
* missing file
* ci: fix cache miss on rocm build
* refine vulkan dep handling
* fix ps reporting bug on full GPU load
* improve cmake wiring for customized local builds
* version bump
* docker build arg cleanup
* improve windows exit error logs
* fix community gemma4 support and ci flakes
* fix mlx unit test
* tighten up ps logic to avoid double counting fit log lines
* version bump
* fix ps view for full gpu layer offload
* add MTP wiring for llama-server and create with GGUFs
* pick best template by capabilities
* version bump
* ci: harden apt repos
* remove unused cpu core discovery
* adjust batch default logic to reduce OOMs
* support larger tool calls
* fix audio support, template show
* qwen35 mtp patch support
* flesh out dtypes
* rocm deps
* version bump
* lint fix
* block broken gfx1150 on windows
* fix qwen3.5 moe mtp tensors in patch
* mmproj oom fallback and vulkan on by default
* qwen MTP compat fix
* version bump
* ci: fix WoA cross-compile
* ci: workaround ui tool in cross-compile
* version bump
* win: enable OpenMP for CPU builds
* build: improve developer UX
* ci: windows path workaround for CPU build
* win: fix WoA dependencies
* win: fix large offset reads for mmproj patched loads
* version bump
* fix vulkan dup detection
* add OLLAMA_IGPU_ENABLE and largely disable iGPUs by default
* opt-in MTP, win large offset, integraton fixes
* fix unit test scheduler interaction hang
* fix multi-gpu filtering
* version bump
* review comments
* fix thinking level
* fix linux rocm ordering and granite 3.3 template
* version bump
* ci fix - non-shallow MLX checkout
* bypass linux sysfs unit test on windows
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
This change updates the show API for MLX models to:
* display the correct quantization in mixed precision models
* not display the global_scale scalar value
* not duplicate the `tools` capability
* Reduce startup model hydration
Add a lightweight model list cache for tags and launch inventory, while keeping show cache population lazy. This avoids loading every local model at startup on large model stores.
* harden flaky scheduler unit test
* remove extra launch model metadata text
* review comments
* review comments
* app: harden update flows
This hardens the windows update flows and adds a new opt-in and CI triggered unit test to verify Mac/Windows updates with verification.
* test: harden unit tests for OLLAMA_MODELS being set
* app: harden updater
* mlx: refined model push behavior
Refine the algorithm for parallel push of safetensors based models to get
better reliability and throughput.
* review comments, hardening, and performance tuning for slow links
* review comments
* mlx: add laguna model support
* convert: support fp8 safetensors import
Decode HF F8_E4M3 safetensors with block scale companions into GGUF-supported tensor types, and record which output tensors came from FP8 source weights.
Use that source-precision metadata during create quantization: default FP8-sourced GGUFs to Q8_0, keep non-FP8 tensors at their original precision for Q8_0, and promote non-FP8 quantizable tensors to Q8_0 for Q4_K requests.
* ggml: add laguna model support
* server: preserve generate logprobs with builtin parsers
Generate requests were dropping logprob-only chunks whenever a builtin parser buffered visible content. Chat already handled this case, but generate only forwarded chunks with visible response, thinking, or tool-call output.
Keep generate chunks that carry logprobs even when the builtin parser has not flushed visible content yet, and add a regression test that exercises the behavior with a generic thinking parser.
* review comments - perf improvements
* ggml: implement nemotron 3 nano omni
* add poolside integration
* update poolside doc
* adapt to new cache setup
* fix test
* fix test
---------
Co-authored-by: Eva Ho <hoyyeva@gmail.com>
If you have a long running create, and start another ollama server with the
same model dir, the GC algorithm deletes the pending blobs and breaks the
create. This adds a 1h grace period to avoid deleting in-flight creation
operations.
Following up on #15560, this change now has e2b/e4b render differently
from 26b/31b.
For backwards compatibility, we take the existing renderer name `gemma4`
and make it do dynamic resolution based on the model name/size, but the
intended use is for the models to be republished with the renderer
variant specified explicitly: `gemma4-small` or `gemma4-large`.
This change fixes two issues with Modelfiles:
1. If a user uses `ollama show --modelfile` to show a safetensors based
model, the Model would leave the "FROM" field blank which won't allow
a user to recreate the model. This change adds the model's current
canonical short name to the FROM field.
2. If a user uses the `/save` command in the CLI any messages which were
saved in a previous model wouldn't get saved (only the set of messages
from the current session).
* create: Clean up experimental paths
This cleans up the experimental features, and adds both unit and integration test coverage to verify no regressions.
* create: preserve config and layer names when creating from safetensors models
When creating a model FROM an existing safetensors model, ModelFormat,
Capabilities, and layer Name fields were lost. ModelFormat stayed empty
because it's only set from GGML layers (which safetensors models lack),
and layer names weren't copied in parseFromModel. This caused derived
models to fail loading ("config.json not found in manifest").
* review comments
* bench: add prompt calibration, context size flag, and NumCtx reporting
Add --num-ctx flag to set context size, and report NumCtx in model info
header. Calibrate tokens-per-word ratio during warmup using actual
tokenization metrics from the model, replacing the fixed 1.3 heuristic.
This produces more accurate prompt token counts for --prompt-tokens.
Also add fetchContextLength() to query running model context via /api/ps.
* integration: improve vision test robustness and add thinking tests
Add skipIfNoVisionOverride() to skip vision tests when OLLAMA_TEST_MODEL
is set to a non-vision model. Add Think:false to context exhaustion test
to prevent thinking models from using all context before the test can
measure it. Add third test image (ollama homepage) and replace OCR test
with ImageDescription test using it. Relax match strings for broader
model compatibility. Add TestThinkingEnabled and TestThinkingSuppressed
to verify thinking output and channel tag handling.
* gemma4: add Gemma 4 GGML model support
Add full Gemma 4 model family support (E2B, E4B, 26B MoE, 31B Dense)
for the GGML backend including text, vision, converter, parser, and
renderer.
Text model features:
- Sliding window + full attention with per-layer patterns
- KV sharing across layers with donor map
- Per-layer embeddings (PLE) with learned projections
- MoE routing with RMSNorm + learned scale
- Proportional RoPE with freq_factors for global attention
- Final logit softcapping
Vision model features:
- SigLIP vision encoder with 2D RoPE
- ClippableLinear with input/output clamping via packed v.clamp_data
- Adaptive average pooling with nMerge kernel
- Multi-modal projection with unweighted RMSNorm
Converter:
- Safetensors to GGUF with vision tensor renaming
- Fused MoE gate_up_proj splitting
- Vision patch embedding reshape (HF to Conv2D layout)
- Packed clamp data tensor for ClippableLinear bounds
- Proportional RoPE freq_factors generation
Also includes:
- BackendGet() on ml.Tensor for reading weight tensor data
- Q6_K CUDA get_rows kernel support
- MoE-aware ffn_down quantization layer counting
- Gemma4 parser with tool calling and thinking support
- Gemma4 renderer with structured tool format
- Architecture-based auto-detection of renderer/parser/stop tokens
- Integration test gemma4 model list additions
* gemma4: add audio support with USM conformer encoder
Add audio encoding for Gemma 4 using the USM conformer architecture:
- Converter: audio tensor mapping, SSCP/conformer/embedder name replacements,
softplus repacker for per_dim_scale, F32 enforcement for conv weights
- GGML backend: Conv1DDW and PadExt tensor ops
- Audio encoder: SSCP Conv2D, 12 conformer blocks (FFW + block-local
attention with relative position embeddings + LightConv1d + FFW),
output projection, audio-to-text embedding projector
- Audio preprocessing: WAV decode, mel spectrogram, FFT (pure Go)
- Model wiring: WAV detection, audio token handling, unified PostTokenize
Correctly transcribes "why is the sky blue" from test audio.
* integration: add gemma4 audio tests including OpenAI API coverage
Test audio transcription and response via the Ollama native API, plus
two new tests exercising the OpenAI-compatible endpoints:
- /v1/audio/transcriptions (multipart form upload)
- /v1/chat/completions with input_audio content type
All tests use capability checks and skip models without audio support.
* gemma4: add OpenAI audio API support and capability detection
- Add CapabilityAudio and detect from audio.block_count in GGUF
- Add /v1/audio/transcriptions endpoint with TranscriptionMiddleware
- Add input_audio content type support in /v1/chat/completions
- Add TranscriptionRequest/Response types in openai package
* gemma4: add audio input support for run command
- /audio toggle in interactive mode for voice chat
- Platform-specific microphone recording (AVFoundation on macOS,
PulseAudio/ALSA on Linux, WASAPI on Windows)
- Space to start/stop recording, automatic chunking for long audio
* gemma4: add transcribe command (ollama transcribe MODEL)
- Interactive mode with readline prompt and slash commands
- Non-interactive mode for piped audio or record-until-Ctrl+C
- Chunked streaming transcription for long recordings
- Word-wrapped output matching run command style
* gemma4: add parser, renderer, and integration test plumbing
* gemma4: fix renderer to emit BOS token
* gemma4: add OpenAI audio transcription API and input_audio support
* gemma4: update converter for new weight drop naming
* gemma4: add per_expert_scale to MoE router and fix moe_intermediate_size config
* gemma4: rewrite renderer to match HF Jinja2 template exactly
Fix 8 bugs found by building 55 reference tests verified against the
HF Jinja2 chat template (VERIFY_JINJA2=1 shells out to Python):
- Tool responses use separate <|turn>tool turns (not inline tags)
- Tool calls emitted before content in assistant messages
- Thinking content stripped from assistant history (strip_thinking)
- User, tool, and system content trimmed (template does | trim)
- Empty system message still emits system turn (check role, not content)
- Nested object properties rendered recursively with required field
- Array items specification rendered for array-type properties
- OBJECT/ARRAY type-specific rendering comma logic matches template
Also adds Required field to api.ToolProperty for nested object schemas,
replaces old gemma4_test.go with comprehensive gemma4_reference_test.go,
and commits the Jinja2 template as testdata for verification.
* gemma4: fix MoE fused gate_up split and multiline tool-call arg parsing
- Text MoE: split `ffn_gate_up_exps` into contiguous `[gate|up]` halves instead of stride-2 slices.
- Parser: escape control characters in `<|"|>...<|"|>` string literals when converting tool-call args to JSON.
- Fixes warnings like `invalid character '\n' in string literal` for multiline tool arguments.
- Add Gemma4 parser regressions for multiline tool-call args and `gemma4ArgsToJSON`.
* cmd: simplify audio input to dropped file attachments
* gemma4: use full SWA memory for better cache reuse
* gemma4: initialize clamps after backend load
* convert: align gemma4 audio tensor renames with llama.cpp
* Remove redundant comments in gemma4 vision model
* Format Gemma4 MoE block field alignment
* use 4096 kvcache.NewSWAMemCache
* convert: support new Gemma4 audio_tower tensor naming (#15221)
Co-authored-by: jmorganca <jmorganca@gmail.com>
* fix integration test defaults for audio
* review comments and lint fixes
* remove unused audio/video files
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
pullModelManifest unmarshals the registry response into a Go struct
then re-marshals with json.Marshal before writing to disk. When the
registry's JSON formatting or field ordering differs from Go's
output, the local SHA256 won't match the registry's
Ollama-Content-Digest header, causing false "out of date" warnings.
Preserve the raw bytes from the registry response and write them
directly to disk so the local manifest is byte-for-byte identical
to what the registry serves.
A stop-gap for now to guide users better. We'll add more in-depth recommendations per integration as well.
---------
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
Copilot Chat prefers to use `general.basename` in the built-in Ollama
integration, but this name isn't usually shown directly to users (and
there may be many models that share this name). Instead we pass back
`req.Model`, which for this extension is the value that we return from
`/api/tags`
If `OLLAMA_DEBUG_LOG_REQUESTS` is set, then on server startup a temp
folder will be created. Upon any inference request, the body will be
logged to a file in this folder, as well as a small shell script to
"replay" the request using cURL.
This is just intended for debugging scenarios, not as something to turn
on normally.
`WebSearchAnthropicWriter` expects a single object per write. The new
transparent proxy will instead send it whatever bytes it sees. This
cloud-model + local-orchestration + cloud-search is a temporary code
path, so instead of making the web search code more robust to this, I
put an adapter in the middle that will flush line-by-line to preserve
the old behavior.
MLX runners (image generation and LLM) previously bypassed the
scheduler's standard load path via a separate loadMLX method. This meant
they skipped VRAM fitting checks and couldn't participate in model
eviction.
Now all model types flow through the same load function. Model eviction
for MLX is based on weights as KV cache and compute graph are dynamic.
This means that eviction does not take into account the worst case
memory and models can still compete for memory but it is a significant
improvement.
When a zstd-compressed request (e.g. from Codex CLI) hits /v1/responses
with a cloud model the request failed.
Fix by decompressing zstd bodies before
model extraction, so cloud models are detected and proxied directly
without the writer being wrapped.
Previously we were printing out bad errors for expected cases like
clients disconnecting. Now we only debug log when that happens (which
still might help in cases where we're figuring out why an integration
isn't working). For other errors, we print out a proper warning now
* don't require pulling stubs for cloud models
This is a first in a series of PRs that will better integrate Ollama's
cloud into the API and CLI. Previously we used to have a layer of
indirection where you'd first have to pull a "stub" model that contains
a reference to a cloud model. With this change, you don't have to pull
first, you can just use a cloud model in various routes like `/api/chat`
and `/api/show`. This change respects
<https://github.com/ollama/ollama/pull/14221>, so if cloud is disabled,
these models won't be accessible.
There's also a new, simpler pass-through proxy that doesn't convert the
requests ahead of hitting the cloud models, which they themselves
already support various formats (e.g., `v1/chat/completions` or Open
Responses, etc.). This will help prevent issues caused by double
converting (e.g., `v1/chat/completions` converted to `api/chat` on the
client, then calling cloud and converting back to a
`v1/chat/completions` response instead of the cloud model handling the
original `v1/chat/completions` request first).
There's now a notion of "source tags", which can be mixed with existing
tags. So instead of having different formats like`gpt-oss:20b-cloud` vs.
`kimi-k2.5:cloud` (`-cloud` suffix vs. `:cloud`), you can now specify
cloud by simply appending `:cloud`. This PR doesn't change model
resolution yet, but sets us up to allow for things like omitting the
non-source tag, which would make something like `ollama run
gpt-oss:cloud` work the same way that `ollama run gpt-oss` already works
today.
More detailed changes:
- Added a shared model selector parser in `types/modelselector`:
- supports `:cloud` and `:local`
- accepts source tags in any position
- supports legacy `:<tag>-cloud`
- rejects conflicting source tags
- Integrated selector handling across server inference/show routes:
- `GenerateHandler`, `ChatHandler`, `EmbedHandler`,
`EmbeddingsHandler`, `ShowHandler`
- Added explicit-cloud passthrough proxy for ollama.com:
- same-endpoint forwarding for `/api/*`, `/v1/*`, and `/v1/messages`
- normalizes `model` (and `name` for `/api/show`) before forwarding
- forwards request headers except hop-by-hop/proxy-managed headers
- uses bounded response-header timeout
- handles auth failures in a friendly way
- Preserved cloud-disable behavior (`OLLAMA_NO_CLOUD`)
- Updated create flow to support `FROM ...:cloud` model sources (though
this flow uses the legacy proxy still, supporting Modelfile overrides
is more complicated with the direct proxy approach)
- Updated CLI/TUI/config cloud detection to use shared selector logic
- Updated CLI preflight behavior so explicit cloud requests do not
auto-pull local stubs
What's next?
- Cloud discovery/listing and cache-backed `ollama ls` / `/api/tags`
- Modelfile overlay support for virtual cloud models on OpenAI/Anthropic
request families
- Recommender/default-selection behavior for ambiguous model families
- Fully remove the legacy flow
Fixes: https://github.com/ollama/ollama/issues/13801
* consolidate pull logic into confirmAndPull helper
pullIfNeeded and ShowOrPull shared identical confirm-and-pull logic.
Extract confirmAndPull to eliminate the duplication.
* skip local existence checks for cloud models
ModelExists and the TUI's modelExists both check the local model list,
which causes cloud models to appear missing. Return true early for
explicit cloud models so the TUI displays them beside the integration
name and skips re-prompting the model picker on relaunch.
* support optionally pulling stubs for newly-style names
We now normalize names like `<family>:<size>:cloud` into legacy-style
names like `<family>:<size>-cloud` for pulling and deleting (this also
supports stripping `:local`). Support for pulling cloud models is
temporary, once we integrate properly into `/api/tags` we won't need
this anymore.
* Fix server alias syncing
* Update cmd/cmd.go
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
* address comments
* improve some naming
---------
Co-authored-by: ParthSareen <parth.sareen@ollama.com>