Commit Graph

1034 Commits

Author SHA1 Message Date
Jeffrey Morgan
5f56a289b3
server: classify mmproj GGUFs as projector layers (#16472) 2026-06-03 12:59:34 -07:00
Patrick Devine
50bbda5660
models: add support for gemma4-12b (#16457) 2026-06-03 07:44:57 -07:00
Daniel Hiltgen
7d3a6c3ae5
log template details to aid troubleshooting (#16403)
This cleans up the capabilities logic so we can log more information about the various options we consider as well as the final template version we use.
2026-06-01 16:25:44 -07:00
Daniel Hiltgen
630882621b
llama-server followups (#16353)
* llama-server followups

Misc fixes for #16031
- Add back dropped ROCm build flag for multi-GPU support on windows
- Fix amdhip64_*.dll version detection for "latest" selection
- Fix embeddings API for consistent normalize behavior with prior versions

* ci: set up for automated llama.cpp update testing

* reduce batch for fa-disabled, and constrained vram

* mlx: fix v3 load bug on m5

Imagegen was incorrectly loading v3 first.  This DRYs out the loading code so imagegen gets the same new v4/v3 selection logic.

* fix reload bug on embedding models

* bump version

* steer user how to enable iGPU when disabled
2026-06-01 10:44:21 -07:00
Patrick Devine
0e93ccc2cd
convert: fixes for qwen3next model conversion (#16354)
This change addresses some problems with GGUF conversion including:
 * correctly naming the MoE tensors
 * correctly quantizing the nextn.eh_proj.weight MTP tensor
2026-06-01 09:43:11 -07:00
Daniel Hiltgen
9db4bdbad6
runner: Remove CGO engines, use llama-server exclusively for GGML models (#16031)
* broad lint fixes to sidestep CI scope glitch

* runner: Remove CGO engines, use llama-server exclusively for GGML models

Remove the vendored GGML and llama.cpp backend, CGO runner, Go model
implementations, and sample.  llama-server (built from upstream llama.cpp via
FetchContent) is now the sole inference engine for GGUF-based models.
(Safetensor based models continue to run on the new MLX engine.)  This allows
us to more rapidly pick up new capabilities and fixes from llama.cpp as they
come out.

On windows this now requires recent AMD driver versions to support ROCm v7 as
llama.cpp currently does not support building against v6.

* llama/compat: load Ollama-format GGUFs in llama-server

Squashed from upstream/jmorganca/llama-compat on 2026-04-29.
Source tip: 0c33775d37.

Original source commits:
- 25223160d llama/compat: add in-memory shim so llama-server can load Ollama-format GGUFs
- 7449b539a llm,server: route Ollama-format gemma3 blobs through llama/compat
- 436f2e2b1 llama/compat: make patch-apply idempotent
- 8c2c9d4c8 llama/compat: extend gemma3 handler to cover 1B and 270M blobs
- 021389f7b llama/compat: shrink clip.cpp injection from 18 lines to 1
- 61b367ec2 llama/compat: shrink patch to pure call-site hooks (34 -> 20 lines)
- 36049361c llama/compat: simplify shim (gemma3-tested)
- 8fa664865 llama/compat: add qwen35moe text handler
- db0c74530 llama/compat: add qwen35moe vision (clip) support
- 2a388da77 llama/compat: split shared infra into a util TU
- 9a69a17dc llama/compat: document non-public API dependencies
- d0f38a915 llama/compat: add gpt-oss and lfm2 handlers
- 086071822 llama/compat: add mistral3 text handler (vision TODO)
- 63bde9ff7 llama/compat: add mistral3 vision (clip) support
- 3a57b89d5 llama/compat: apply LLaMA RoPE permute to mistral3 vision Q/K
- 99cb87439 llama/compat: add qwen35, gemma4, deepseek-ocr handlers
- 2c7850dba llama/compat: add nemotron_h_moe handler (latent FFN + MTP skip)
- 9e3b54225 llama/compat: add llama4 text + clip handlers
- 034fee349 llama/compat: add gemma4 clip handler (gemma4v projector)
- 9945c5a93 server: remove dhiltgen/* compat redirect table
- 5d4539101 llama/compat: rewrite gemma4 tokenizer model to BPE
- 7e0765327 llama/compat: add glm-ocr text handler + text-loader load-op hook
- f1bd1a25a llama/compat: add glm-ocr clip handler (glm4v projector)
- 4b5cf3420 llama/compat: collapse text-loader hook back to one new patch line
- eb4ecf4fc llama/compat: extend gemma4 clip handler to gemma4a (audio)
- a23a5e76f llama/compat: fix gemma4a per-block norm tensor mapping
- cd2dcaff4 llama/compat: add embeddinggemma handler
- 1ce8a6b26 llama/compat: add qwen3-vl + qwen2.5-vl handlers
- fd98ffa1e llama/compat: add gemma3n + glm4moelite handlers
- cc7bdf0bc llama/compat: handle null buft in maybe_load_tensor
- 0c33775d3 llama/compat: disable mmap when load_op transforms text-side tensors

* refine implementation

* ci: fix windows MLX build

* ci: fix windows llama-server build

* ci: fix windows rocm build

* ci: windows mlx tuning

Shorten long-tail on build, and get OllamaSetup.exe back under 2g limit

* ci: fix windows dependencies

* win: fix dependency gathering

* disable openmp

* win: arm64 cross-compile build

also DRY out CI steps

* scheduler improvements

* ci: improvements from #15982

* win: favor ninja for faster developer builds

* win: fix build

* win: fix arm64 cross-compile

* win: avoid spaces in compiler path

* misc discovery fixes, and bos handling

* lint fixes

* win: fix arm cross-compile build/CI bugs

* llama.cpp update

* win: handle multiple CRT dirs

* vulkan: add windows iGPU detection

* fix creation bugs for patched models, other refactoring work

* tune batch size for better performance

* ci and lint fixes

* fix repeat_last_n bug

* build: revamp build for better developer UX

* amd, sampler, qwen3next fixes

* version bump

* fix mlx build

* revamp GPU discovery

Scanning the output of llama-server is turning out to be too error prone across
llama.cpp updates, so this switches to a thin dynamic library load against the
bundled GGML libraries so more details can be gathered from the API.

* version bump

* missing file

* ci: fix cache miss on rocm build

* refine vulkan dep handling

* fix ps reporting bug on full GPU load

* improve cmake wiring for customized local builds

* version bump

* docker build arg cleanup

* improve windows exit error logs

* fix community gemma4 support and ci flakes

* fix mlx unit test

* tighten up ps logic to avoid double counting fit log lines

* version bump

* fix ps view for full gpu layer offload

* add MTP wiring for llama-server and create with GGUFs

* pick best template by capabilities

* version bump

* ci: harden apt repos

* remove unused cpu core discovery

* adjust batch default logic to reduce OOMs

* support larger tool calls

* fix audio support, template show

* qwen35 mtp patch support

* flesh out dtypes

* rocm deps

* version bump

* lint fix

* block broken gfx1150 on windows

* fix qwen3.5 moe mtp tensors in patch

* mmproj oom fallback and vulkan on by default

* qwen MTP compat fix

* version bump

* ci: fix WoA cross-compile

* ci: workaround ui tool in cross-compile

* version bump

* win: enable OpenMP for CPU builds

* build: improve developer UX

* ci: windows path workaround for CPU build

* win: fix WoA dependencies

* win: fix large offset reads for mmproj patched loads

* version bump

* fix vulkan dup detection

* add OLLAMA_IGPU_ENABLE and largely disable iGPUs by default

* opt-in MTP, win large offset, integraton fixes

* fix unit test scheduler interaction hang

* fix multi-gpu filtering

* version bump

* review comments

* fix thinking level

* fix linux rocm ordering and granite 3.3 template

* version bump

* ci fix - non-shallow MLX checkout

* bypass linux sysfs unit test on windows

---------

Co-authored-by: jmorganca <jmorganca@gmail.com>
2026-05-29 13:35:47 -07:00
Patrick Devine
f63eea3d27
mlx: fix reported information in ollama show (#16289)
This change updates the show API for MLX models to:
  * display the correct quantization in mixed precision models
  * not display the global_scale scalar value
  * not duplicate the `tools` capability
2026-05-24 14:08:06 -07:00
Anto jones
632ff00798
server: remove duplicate template parsing (#16287) 2026-05-24 13:27:24 -07:00
Daniel Hiltgen
4b2d529966
Reduce startup model hydration (#16215)
* Reduce startup model hydration

Add a lightweight model list cache for tags and launch inventory, while keeping show cache population lazy. This avoids loading every local model at startup on large model stores.

* harden flaky scheduler unit test

* remove extra launch model metadata text

* review comments

* review comments
2026-05-19 15:53:08 -07:00
Eva H
6bdb73073b
anthropic: Preserve Claude local image-path tool results in renderer-owned prompt formatting (#16047) 2026-05-12 00:02:17 -04:00
Patrick Devine
d819ef0f97
mlx: update the imagegen runner for mlx thread affinity (#16096) 2026-05-11 13:05:06 -07:00
Daniel Hiltgen
3d5a011a2e
app: harden update flows (#16100)
* app: harden update flows

This hardens the windows update flows and adds a new opt-in and CI triggered unit test to verify Mac/Windows updates with verification.

* test: harden unit tests for OLLAMA_MODELS being set

* app: harden updater
2026-05-11 12:24:01 -07:00
Daniel Hiltgen
1e1b34dada
mlx: refined model push behavior (#15431)
* mlx: refined model push behavior

Refine the algorithm for parallel push of safetensors based models to get
better reliability and throughput.

* review comments, hardening, and performance tuning for slow links

* review comments
2026-05-08 14:25:30 -07:00
Parth Sareen
bab59072fb
launch: add plan-aware model gating (#16027) 2026-05-06 14:34:26 -07:00
Parth Sareen
d319227df0
server: cache show responses (#15967) 2026-05-05 14:40:18 -07:00
Daniel Hiltgen
2d84ec939c
mlx: partial cleanup of imagegen layout (#15435)
* mlx: partial cleanup of imagegen layout

This moves part of the imagegen safetensors code to the new package.

* test: remove flaky timing test
2026-05-05 14:15:30 -07:00
Parth Sareen
4017af96cd
go: bump to 1.26 (#15904) 2026-05-03 23:24:35 -07:00
Parth Sareen
b6447caebc
launch: use vram bytes for model recommendations (#15885) 2026-04-29 18:40:14 -07:00
Parth Sareen
321cc8a2ba
server/launch: add model recommendations cache endpoint (#15868) 2026-04-28 17:09:04 -07:00
Daniel Hiltgen
87288ced4f
New models (#15861)
* mlx: add laguna model support

* convert: support fp8 safetensors import

Decode HF F8_E4M3 safetensors with block scale companions into GGUF-supported tensor types, and record which output tensors came from FP8 source weights.

Use that source-precision metadata during create quantization: default FP8-sourced GGUFs to Q8_0, keep non-FP8 tensors at their original precision for Q8_0, and promote non-FP8 quantizable tensors to Q8_0 for Q4_K requests.

* ggml: add laguna model support

* server: preserve generate logprobs with builtin parsers

Generate requests were dropping logprob-only chunks whenever a builtin parser buffered visible content. Chat already handled this case, but generate only forwarded chunks with visible response, thinking, or tool-call output.

Keep generate chunks that carry logprobs even when the builtin parser has not flushed visible content yet, and add a regression test that exercises the behavior with a generic thinking parser.

* review comments - perf improvements

* ggml: implement nemotron 3 nano omni

* add poolside integration

* update poolside doc

* adapt to new cache setup

* fix test

* fix test

---------

Co-authored-by: Eva Ho <hoyyeva@gmail.com>
2026-04-28 11:50:12 -07:00
Parth Sareen
c2ebb4d57c
api: accept "max" as a think value (#15787) 2026-04-24 01:49:39 -07:00
Parth Sareen
5d1021603a
server: apply format when think=false for gemma4 (#15678) 2026-04-20 17:42:29 -07:00
Daniel Hiltgen
2bb7ea00d2
create: avoid gc race with create (#15628)
If you have a long running create, and start another ollama server with the
same model dir, the GC algorithm deletes the pending blobs and breaks the
create.  This adds a 1h grace period to avoid deleting in-flight creation
operations.
2026-04-16 13:29:16 -07:00
Devon Rifkin
e585ecd11f gemma4: render differently based on model size
Following up on #15560, this change now has e2b/e4b render differently
from 26b/31b.

For backwards compatibility, we take the existing renderer name `gemma4`
and make it do dynamic resolution based on the model name/size, but the
intended use is for the models to be republished with the renderer
variant specified explicitly: `gemma4-small` or `gemma4-large`.
2026-04-15 14:37:16 -07:00
Patrick Devine
eb97274e5c
modelfiles: fix /save command and add shortname for safetensors based models (#15413)
This change fixes two issues with Modelfiles:

  1. If a user uses `ollama show --modelfile` to show a safetensors based
     model, the Model would leave the "FROM" field blank which won't allow
     a user to recreate the model. This change adds the model's current
     canonical short name to the FROM field.
  2. If a user uses the `/save` command in the CLI any messages which were
     saved in a previous model wouldn't get saved (only the set of messages
     from the current session).
2026-04-08 21:05:39 -07:00
Daniel Hiltgen
30fdd229a4
create: Clean up experimental paths, fix create from existing safetensor model (#14679)
* create:  Clean up experimental paths

This cleans up the experimental features, and adds both unit and integration test coverage to verify no regressions.

* create: preserve config and layer names when creating from safetensors models

When creating a model FROM an existing safetensors model, ModelFormat,
Capabilities, and layer Name fields were lost. ModelFormat stayed empty
because it's only set from GGML layers (which safetensors models lack),
and layer names weren't copied in parseFromModel. This caused derived
models to fail loading ("config.json not found in manifest").

* review comments
2026-04-07 08:12:57 -07:00
Daniel Hiltgen
96b202d34b
Add support for gemma4 (#15214)
* bench: add prompt calibration, context size flag, and NumCtx reporting

Add --num-ctx flag to set context size, and report NumCtx in model info
header. Calibrate tokens-per-word ratio during warmup using actual
tokenization metrics from the model, replacing the fixed 1.3 heuristic.
This produces more accurate prompt token counts for --prompt-tokens.

Also add fetchContextLength() to query running model context via /api/ps.

* integration: improve vision test robustness and add thinking tests

Add skipIfNoVisionOverride() to skip vision tests when OLLAMA_TEST_MODEL
is set to a non-vision model. Add Think:false to context exhaustion test
to prevent thinking models from using all context before the test can
measure it. Add third test image (ollama homepage) and replace OCR test
with ImageDescription test using it. Relax match strings for broader
model compatibility. Add TestThinkingEnabled and TestThinkingSuppressed
to verify thinking output and channel tag handling.

* gemma4: add Gemma 4 GGML model support

Add full Gemma 4 model family support (E2B, E4B, 26B MoE, 31B Dense)
for the GGML backend including text, vision, converter, parser, and
renderer.

Text model features:
- Sliding window + full attention with per-layer patterns
- KV sharing across layers with donor map
- Per-layer embeddings (PLE) with learned projections
- MoE routing with RMSNorm + learned scale
- Proportional RoPE with freq_factors for global attention
- Final logit softcapping

Vision model features:
- SigLIP vision encoder with 2D RoPE
- ClippableLinear with input/output clamping via packed v.clamp_data
- Adaptive average pooling with nMerge kernel
- Multi-modal projection with unweighted RMSNorm

Converter:
- Safetensors to GGUF with vision tensor renaming
- Fused MoE gate_up_proj splitting
- Vision patch embedding reshape (HF to Conv2D layout)
- Packed clamp data tensor for ClippableLinear bounds
- Proportional RoPE freq_factors generation

Also includes:
- BackendGet() on ml.Tensor for reading weight tensor data
- Q6_K CUDA get_rows kernel support
- MoE-aware ffn_down quantization layer counting
- Gemma4 parser with tool calling and thinking support
- Gemma4 renderer with structured tool format
- Architecture-based auto-detection of renderer/parser/stop tokens
- Integration test gemma4 model list additions

* gemma4: add audio support with USM conformer encoder

Add audio encoding for Gemma 4 using the USM conformer architecture:
- Converter: audio tensor mapping, SSCP/conformer/embedder name replacements,
  softplus repacker for per_dim_scale, F32 enforcement for conv weights
- GGML backend: Conv1DDW and PadExt tensor ops
- Audio encoder: SSCP Conv2D, 12 conformer blocks (FFW + block-local
  attention with relative position embeddings + LightConv1d + FFW),
  output projection, audio-to-text embedding projector
- Audio preprocessing: WAV decode, mel spectrogram, FFT (pure Go)
- Model wiring: WAV detection, audio token handling, unified PostTokenize

Correctly transcribes "why is the sky blue" from test audio.

* integration: add gemma4 audio tests including OpenAI API coverage

Test audio transcription and response via the Ollama native API, plus
two new tests exercising the OpenAI-compatible endpoints:
- /v1/audio/transcriptions (multipart form upload)
- /v1/chat/completions with input_audio content type

All tests use capability checks and skip models without audio support.

* gemma4: add OpenAI audio API support and capability detection

- Add CapabilityAudio and detect from audio.block_count in GGUF
- Add /v1/audio/transcriptions endpoint with TranscriptionMiddleware
- Add input_audio content type support in /v1/chat/completions
- Add TranscriptionRequest/Response types in openai package

* gemma4: add audio input support for run command

- /audio toggle in interactive mode for voice chat
- Platform-specific microphone recording (AVFoundation on macOS,
  PulseAudio/ALSA on Linux, WASAPI on Windows)
- Space to start/stop recording, automatic chunking for long audio

* gemma4: add transcribe command (ollama transcribe MODEL)

- Interactive mode with readline prompt and slash commands
- Non-interactive mode for piped audio or record-until-Ctrl+C
- Chunked streaming transcription for long recordings
- Word-wrapped output matching run command style

* gemma4: add parser, renderer, and integration test plumbing

* gemma4: fix renderer to emit BOS token

* gemma4: add OpenAI audio transcription API and input_audio support

* gemma4: update converter for new weight drop naming

* gemma4: add per_expert_scale to MoE router and fix moe_intermediate_size config

* gemma4: rewrite renderer to match HF Jinja2 template exactly

Fix 8 bugs found by building 55 reference tests verified against the
HF Jinja2 chat template (VERIFY_JINJA2=1 shells out to Python):

- Tool responses use separate <|turn>tool turns (not inline tags)
- Tool calls emitted before content in assistant messages
- Thinking content stripped from assistant history (strip_thinking)
- User, tool, and system content trimmed (template does | trim)
- Empty system message still emits system turn (check role, not content)
- Nested object properties rendered recursively with required field
- Array items specification rendered for array-type properties
- OBJECT/ARRAY type-specific rendering comma logic matches template

Also adds Required field to api.ToolProperty for nested object schemas,
replaces old gemma4_test.go with comprehensive gemma4_reference_test.go,
and commits the Jinja2 template as testdata for verification.

* gemma4: fix MoE fused gate_up split and multiline tool-call arg parsing

- Text MoE: split `ffn_gate_up_exps` into contiguous `[gate|up]` halves instead of stride-2 slices.
- Parser: escape control characters in `<|"|>...<|"|>` string literals when converting tool-call args to JSON.
- Fixes warnings like `invalid character '\n' in string literal` for multiline tool arguments.
- Add Gemma4 parser regressions for multiline tool-call args and `gemma4ArgsToJSON`.

* cmd: simplify audio input to dropped file attachments

* gemma4: use full SWA memory for better cache reuse

* gemma4: initialize clamps after backend load

* convert: align gemma4 audio tensor renames with llama.cpp

* Remove redundant comments in gemma4 vision model

* Format Gemma4 MoE block field alignment

* use 4096 kvcache.NewSWAMemCache

* convert: support new Gemma4 audio_tower tensor naming (#15221)

Co-authored-by: jmorganca <jmorganca@gmail.com>

* fix integration test defaults for audio

* review comments and lint fixes

* remove unused audio/video files

---------

Co-authored-by: jmorganca <jmorganca@gmail.com>
2026-04-02 11:33:33 -07:00
Parth Sareen
8e54823fd3
revert context length warnings change (#15121) 2026-03-28 16:43:59 -07:00
Patrick Devine
9e7cb9697e
mlx: fix vision capability + min version (#15106) 2026-03-27 17:09:28 -07:00
Bruce MacDonald
3824e380a8
server: preserve raw manifest bytes during pull (#15104)
pullModelManifest unmarshals the registry response into a Go struct
then re-marshals with json.Marshal before writing to disk. When the
registry's JSON formatting or field ordering differs from Go's
output, the local SHA256 won't match the registry's
Ollama-Content-Digest header, causing false "out of date" warnings.

Preserve the raw bytes from the registry response and write them
directly to disk so the local manifest is byte-for-byte identical
to what the registry serves.
2026-03-27 15:42:31 -07:00
Eva H
366625a831
launch: warn when server context length is below 64k for local models (#15044)
A stop-gap for now to guide users better. We'll add more in-depth recommendations per integration as well.

---------

Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
2026-03-27 00:15:53 -07:00
Devon Rifkin
26b9f53f8e
api/show: overwrite basename for copilot chat (#15062)
Copilot Chat prefers to use `general.basename` in the built-in Ollama
integration, but this name isn't usually shown directly to users (and
there may be many models that share this name). Instead we pass back
`req.Model`, which for this extension is the value that we return from
`/api/tags`
2026-03-25 14:02:22 -07:00
Devon Rifkin
46cb7795e1
add ability to turn on debug request logging (#14106)
If `OLLAMA_DEBUG_LOG_REQUESTS` is set, then on server startup a temp
folder will be created. Upon any inference request, the body will be
logged to a file in this folder, as well as a small shell script to
"replay" the request using cURL.

This is just intended for debugging scenarios, not as something to turn
on normally.
2026-03-19 17:08:17 -07:00
Devon Rifkin
e37a9b4c01
cloud_proxy: for the web_search legacy path, flush on newlines (#14897)
`WebSearchAnthropicWriter` expects a single object per write. The new
transparent proxy will instead send it whatever bytes it sees. This
cloud-model + local-orchestration + cloud-search is a temporary code
path, so instead of making the web search code more robust to this, I
put an adapter in the middle that will flush line-by-line to preserve
the old behavior.
2026-03-17 13:30:17 -07:00
Jesse Gross
bbbad97686 sched: Model eviction for MLX
MLX runners (image generation and LLM) previously bypassed the
scheduler's standard load path via a separate loadMLX method. This meant
they skipped VRAM fitting checks and couldn't participate in model
eviction.

Now all model types flow through the same load function. Model eviction
for MLX is based on weights as KV cache and compute graph are dynamic.
This means that eviction does not take into account the worst case
memory and models can still compete for memory but it is a significant
improvement.
2026-03-16 17:40:29 -07:00
Bruce MacDonald
3980c0217d
server: decompress zstd request bodies in cloud passthrough middleware (#14827)
When a zstd-compressed request (e.g. from Codex CLI) hits /v1/responses
with a cloud model the request failed.

Fix by decompressing zstd bodies before
model extraction, so cloud models are detected and proxied directly
without the writer being wrapped.
2026-03-13 15:06:47 -07:00
Devon Rifkin
f676231de9
server: remove experimental aliases support (#14810) 2026-03-12 20:27:24 -07:00
Devon Rifkin
8c4d5d6c2f
cloud_proxy: send ollama client version (#14769)
This was previously included in the user agent, and we've made use of it
in the past to hotpatch bugs server-side for particular Ollama versions.
2026-03-10 15:53:25 -07:00
Parth Sareen
61086083eb
server: add experimental web search and web fetch routes (#14753) 2026-03-09 21:52:12 -07:00
Devon Rifkin
afb4c62fbf
cloud_proxy: handle stream disconnects gracefully (#14685)
Previously we were printing out bad errors for expected cases like
clients disconnecting. Now we only debug log when that happens (which
still might help in cases where we're figuring out why an integration
isn't working). For other errors, we print out a proper warning now
2026-03-06 19:18:52 -08:00
Jeffrey Morgan
4eab60c1e2
Reapply "don't require pulling stubs for cloud models" again (#14608)
* Revert "Revert "Reapply "don't require pulling stubs for cloud models"" (#14606)"

This reverts commit 39982a954e.

* fix test + do cloud lookup only when seeing cloud models

---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
2026-03-06 14:27:47 -08:00
Parth Sareen
122c68c151
server: loosen thinking level constraint (#14625) 2026-03-04 13:42:18 -08:00
Jeffrey Morgan
39982a954e
Revert "Reapply "don't require pulling stubs for cloud models"" (#14606)
This reverts commit 799e51d419.
2026-03-03 20:56:10 -08:00
Patrick Devine
110eff01a9
chore: remove old imagegen LLMs models (#14597)
These models are implemented in the x/mlxrunner instead.
2026-03-03 13:23:40 -08:00
Jeffrey Morgan
799e51d419
Reapply "don't require pulling stubs for cloud models"
This reverts commit 97d2f05a6d.
2026-03-03 13:17:10 -08:00
Victor-Quqi
e8fcb29586
model/renderers: fix glm-ocr image tags in renderer prompts (#14584) 2026-03-03 12:51:34 -08:00
Jeffrey Morgan
97d2f05a6d
Revert "don't require pulling stubs for cloud models (#14574)" (#14596)
This reverts commit 8207e55ec7.
2026-03-03 12:51:23 -08:00
Devon Rifkin
8207e55ec7
don't require pulling stubs for cloud models (#14574)
* don't require pulling stubs for cloud models

This is a first in a series of PRs that will better integrate Ollama's
cloud into the API and CLI. Previously we used to have a layer of
indirection where you'd first have to pull a "stub" model that contains
a reference to a cloud model. With this change, you don't have to pull
first, you can just use a cloud model in various routes like `/api/chat`
and `/api/show`. This change respects
<https://github.com/ollama/ollama/pull/14221>, so if cloud is disabled,
these models won't be accessible.

There's also a new, simpler pass-through proxy that doesn't convert the
requests ahead of hitting the cloud models, which they themselves
already support various formats (e.g., `v1/chat/completions` or Open
Responses, etc.). This will help prevent issues caused by double
converting (e.g., `v1/chat/completions` converted to `api/chat` on the
client, then calling cloud and converting back to a
`v1/chat/completions` response instead of the cloud model handling the
original `v1/chat/completions` request first).

There's now a notion of "source tags", which can be mixed with existing
tags. So instead of having different formats like`gpt-oss:20b-cloud` vs.
`kimi-k2.5:cloud` (`-cloud` suffix vs. `:cloud`), you can now specify
cloud by simply appending `:cloud`. This PR doesn't change model
resolution yet, but sets us up to allow for things like omitting the
non-source tag, which would make something like `ollama run
gpt-oss:cloud` work the same way that `ollama run gpt-oss` already works
today.

More detailed changes:

- Added a shared model selector parser in `types/modelselector`:
  - supports `:cloud` and `:local`
  - accepts source tags in any position
  - supports legacy `:<tag>-cloud`
  - rejects conflicting source tags
- Integrated selector handling across server inference/show routes:
  - `GenerateHandler`, `ChatHandler`, `EmbedHandler`,
    `EmbeddingsHandler`, `ShowHandler`
- Added explicit-cloud passthrough proxy for ollama.com:
  - same-endpoint forwarding for `/api/*`, `/v1/*`, and `/v1/messages`
  - normalizes `model` (and `name` for `/api/show`) before forwarding
  - forwards request headers except hop-by-hop/proxy-managed headers
  - uses bounded response-header timeout
  - handles auth failures in a friendly way
- Preserved cloud-disable behavior (`OLLAMA_NO_CLOUD`)
- Updated create flow to support `FROM ...:cloud` model sources (though
  this flow uses the legacy proxy still, supporting Modelfile overrides
  is more complicated with the direct proxy approach)
- Updated CLI/TUI/config cloud detection to use shared selector logic
- Updated CLI preflight behavior so explicit cloud requests do not
  auto-pull local stubs

What's next?

- Cloud discovery/listing and cache-backed `ollama ls` / `/api/tags`
- Modelfile overlay support for virtual cloud models on OpenAI/Anthropic
  request families
- Recommender/default-selection behavior for ambiguous model families
- Fully remove the legacy flow

Fixes: https://github.com/ollama/ollama/issues/13801

* consolidate pull logic into confirmAndPull helper

pullIfNeeded and ShowOrPull shared identical confirm-and-pull logic.
Extract confirmAndPull to eliminate the duplication.

* skip local existence checks for cloud models

ModelExists and the TUI's modelExists both check the local model list,
which causes cloud models to appear missing. Return true early for
explicit cloud models so the TUI displays them beside the integration
name and skips re-prompting the model picker on relaunch.

* support optionally pulling stubs for newly-style names

We now normalize names like `<family>:<size>:cloud` into legacy-style
names like `<family>:<size>-cloud` for pulling and deleting (this also
supports stripping `:local`). Support for pulling cloud models is
temporary, once we integrate properly into `/api/tags` we won't need
this anymore.

* Fix server alias syncing

* Update cmd/cmd.go

Co-authored-by: Parth Sareen <parth.sareen@ollama.com>

* address comments

* improve some naming

---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
2026-03-03 10:46:33 -08:00
Jesse Gross
ad16bffc7d mlx: Remove peak memory from the API
This is still in flux so it is better to just log it for now.
2026-03-02 15:56:18 -08:00
Bruce MacDonald
23d4cad1a2
server: verify digest is not empty on create (#14555)
An empty digest is not a valid digest for an incoming create request. Reject empty digests at the api level.
2026-03-02 13:43:35 -08:00