ollama

mirror of https://github.com/ollama/ollama.git synced 2026-06-05 21:05:00 +08:00

Author	SHA1	Message	Date
Jeffrey Morgan	5f56a289b3	server: classify mmproj GGUFs as projector layers (#16472 )	2026-06-03 12:59:34 -07:00
Patrick Devine	50bbda5660	models: add support for gemma4-12b (#16457 )	2026-06-03 07:44:57 -07:00
Daniel Hiltgen	7d3a6c3ae5	log template details to aid troubleshooting (#16403 ) This cleans up the capabilities logic so we can log more information about the various options we consider as well as the final template version we use.	2026-06-01 16:25:44 -07:00
Daniel Hiltgen	630882621b	llama-server followups (#16353 ) * llama-server followups Misc fixes for #16031 - Add back dropped ROCm build flag for multi-GPU support on windows - Fix amdhip64_.dll version detection for "latest" selection - Fix embeddings API for consistent normalize behavior with prior versions ci: set up for automated llama.cpp update testing * reduce batch for fa-disabled, and constrained vram * mlx: fix v3 load bug on m5 Imagegen was incorrectly loading v3 first. This DRYs out the loading code so imagegen gets the same new v4/v3 selection logic. * fix reload bug on embedding models * bump version * steer user how to enable iGPU when disabled	2026-06-01 10:44:21 -07:00
Patrick Devine	0e93ccc2cd	convert: fixes for qwen3next model conversion (#16354 ) This change addresses some problems with GGUF conversion including: * correctly naming the MoE tensors * correctly quantizing the nextn.eh_proj.weight MTP tensor	2026-06-01 09:43:11 -07:00
Daniel Hiltgen	9db4bdbad6	runner: Remove CGO engines, use llama-server exclusively for GGML models (#16031 ) * broad lint fixes to sidestep CI scope glitch * runner: Remove CGO engines, use llama-server exclusively for GGML models Remove the vendored GGML and llama.cpp backend, CGO runner, Go model implementations, and sample. llama-server (built from upstream llama.cpp via FetchContent) is now the sole inference engine for GGUF-based models. (Safetensor based models continue to run on the new MLX engine.) This allows us to more rapidly pick up new capabilities and fixes from llama.cpp as they come out. On windows this now requires recent AMD driver versions to support ROCm v7 as llama.cpp currently does not support building against v6. * llama/compat: load Ollama-format GGUFs in llama-server Squashed from upstream/jmorganca/llama-compat on 2026-04-29. Source tip: `0c33775d37`. Original source commits: - `25223160d` llama/compat: add in-memory shim so llama-server can load Ollama-format GGUFs - `7449b539a` llm,server: route Ollama-format gemma3 blobs through llama/compat - `436f2e2b1` llama/compat: make patch-apply idempotent - `8c2c9d4c8` llama/compat: extend gemma3 handler to cover 1B and 270M blobs - `021389f7b` llama/compat: shrink clip.cpp injection from 18 lines to 1 - `61b367ec2` llama/compat: shrink patch to pure call-site hooks (34 -> 20 lines) - `36049361c` llama/compat: simplify shim (gemma3-tested) - `8fa664865` llama/compat: add qwen35moe text handler - `db0c74530` llama/compat: add qwen35moe vision (clip) support - `2a388da77` llama/compat: split shared infra into a util TU - `9a69a17dc` llama/compat: document non-public API dependencies - `d0f38a915` llama/compat: add gpt-oss and lfm2 handlers - `086071822` llama/compat: add mistral3 text handler (vision TODO) - `63bde9ff7` llama/compat: add mistral3 vision (clip) support - `3a57b89d5` llama/compat: apply LLaMA RoPE permute to mistral3 vision Q/K - `99cb87439` llama/compat: add qwen35, gemma4, deepseek-ocr handlers - `2c7850dba` llama/compat: add nemotron_h_moe handler (latent FFN + MTP skip) - `9e3b54225` llama/compat: add llama4 text + clip handlers - `034fee349` llama/compat: add gemma4 clip handler (gemma4v projector) - `9945c5a93` server: remove dhiltgen/* compat redirect table - `5d4539101` llama/compat: rewrite gemma4 tokenizer model to BPE - `7e0765327` llama/compat: add glm-ocr text handler + text-loader load-op hook - `f1bd1a25a` llama/compat: add glm-ocr clip handler (glm4v projector) - `4b5cf3420` llama/compat: collapse text-loader hook back to one new patch line - `eb4ecf4fc` llama/compat: extend gemma4 clip handler to gemma4a (audio) - `a23a5e76f` llama/compat: fix gemma4a per-block norm tensor mapping - `cd2dcaff4` llama/compat: add embeddinggemma handler - `1ce8a6b26` llama/compat: add qwen3-vl + qwen2.5-vl handlers - `fd98ffa1e` llama/compat: add gemma3n + glm4moelite handlers - `cc7bdf0bc` llama/compat: handle null buft in maybe_load_tensor - `0c33775d3` llama/compat: disable mmap when load_op transforms text-side tensors * refine implementation * ci: fix windows MLX build * ci: fix windows llama-server build * ci: fix windows rocm build * ci: windows mlx tuning Shorten long-tail on build, and get OllamaSetup.exe back under 2g limit * ci: fix windows dependencies * win: fix dependency gathering * disable openmp * win: arm64 cross-compile build also DRY out CI steps * scheduler improvements * ci: improvements from #15982 * win: favor ninja for faster developer builds * win: fix build * win: fix arm64 cross-compile * win: avoid spaces in compiler path * misc discovery fixes, and bos handling * lint fixes * win: fix arm cross-compile build/CI bugs * llama.cpp update * win: handle multiple CRT dirs * vulkan: add windows iGPU detection * fix creation bugs for patched models, other refactoring work * tune batch size for better performance * ci and lint fixes * fix repeat_last_n bug * build: revamp build for better developer UX * amd, sampler, qwen3next fixes * version bump * fix mlx build * revamp GPU discovery Scanning the output of llama-server is turning out to be too error prone across llama.cpp updates, so this switches to a thin dynamic library load against the bundled GGML libraries so more details can be gathered from the API. * version bump * missing file * ci: fix cache miss on rocm build * refine vulkan dep handling * fix ps reporting bug on full GPU load * improve cmake wiring for customized local builds * version bump * docker build arg cleanup * improve windows exit error logs * fix community gemma4 support and ci flakes * fix mlx unit test * tighten up ps logic to avoid double counting fit log lines * version bump * fix ps view for full gpu layer offload * add MTP wiring for llama-server and create with GGUFs * pick best template by capabilities * version bump * ci: harden apt repos * remove unused cpu core discovery * adjust batch default logic to reduce OOMs * support larger tool calls * fix audio support, template show * qwen35 mtp patch support * flesh out dtypes * rocm deps * version bump * lint fix * block broken gfx1150 on windows * fix qwen3.5 moe mtp tensors in patch * mmproj oom fallback and vulkan on by default * qwen MTP compat fix * version bump * ci: fix WoA cross-compile * ci: workaround ui tool in cross-compile * version bump * win: enable OpenMP for CPU builds * build: improve developer UX * ci: windows path workaround for CPU build * win: fix WoA dependencies * win: fix large offset reads for mmproj patched loads * version bump * fix vulkan dup detection * add OLLAMA_IGPU_ENABLE and largely disable iGPUs by default * opt-in MTP, win large offset, integraton fixes * fix unit test scheduler interaction hang * fix multi-gpu filtering * version bump * review comments * fix thinking level * fix linux rocm ordering and granite 3.3 template * version bump * ci fix - non-shallow MLX checkout * bypass linux sysfs unit test on windows --------- Co-authored-by: jmorganca <jmorganca@gmail.com>	2026-05-29 13:35:47 -07:00
Patrick Devine	f63eea3d27	mlx: fix reported information in `ollama show` (#16289 ) This change updates the show API for MLX models to: * display the correct quantization in mixed precision models * not display the global_scale scalar value * not duplicate the `tools` capability	2026-05-24 14:08:06 -07:00
Anto jones	632ff00798	server: remove duplicate template parsing (#16287 )	2026-05-24 13:27:24 -07:00
Daniel Hiltgen	4b2d529966	Reduce startup model hydration (#16215 ) * Reduce startup model hydration Add a lightweight model list cache for tags and launch inventory, while keeping show cache population lazy. This avoids loading every local model at startup on large model stores. * harden flaky scheduler unit test * remove extra launch model metadata text * review comments * review comments	2026-05-19 15:53:08 -07:00
Eva H	6bdb73073b	anthropic: Preserve Claude local image-path tool results in renderer-owned prompt formatting (#16047 )	2026-05-12 00:02:17 -04:00
Patrick Devine	d819ef0f97	mlx: update the imagegen runner for mlx thread affinity (#16096 )	2026-05-11 13:05:06 -07:00
Daniel Hiltgen	3d5a011a2e	app: harden update flows (#16100 ) * app: harden update flows This hardens the windows update flows and adds a new opt-in and CI triggered unit test to verify Mac/Windows updates with verification. * test: harden unit tests for OLLAMA_MODELS being set * app: harden updater	2026-05-11 12:24:01 -07:00
Daniel Hiltgen	1e1b34dada	mlx: refined model push behavior (#15431 ) * mlx: refined model push behavior Refine the algorithm for parallel push of safetensors based models to get better reliability and throughput. * review comments, hardening, and performance tuning for slow links * review comments	2026-05-08 14:25:30 -07:00
Parth Sareen	bab59072fb	launch: add plan-aware model gating (#16027 )	2026-05-06 14:34:26 -07:00
Parth Sareen	d319227df0	server: cache show responses (#15967 )	2026-05-05 14:40:18 -07:00
Daniel Hiltgen	2d84ec939c	mlx: partial cleanup of imagegen layout (#15435 ) * mlx: partial cleanup of imagegen layout This moves part of the imagegen safetensors code to the new package. * test: remove flaky timing test	2026-05-05 14:15:30 -07:00
Parth Sareen	4017af96cd	go: bump to 1.26 (#15904 )	2026-05-03 23:24:35 -07:00
Parth Sareen	b6447caebc	launch: use vram bytes for model recommendations (#15885 )	2026-04-29 18:40:14 -07:00
Parth Sareen	321cc8a2ba	server/launch: add model recommendations cache endpoint (#15868 )	2026-04-28 17:09:04 -07:00
Daniel Hiltgen	87288ced4f	New models (#15861 ) * mlx: add laguna model support * convert: support fp8 safetensors import Decode HF F8_E4M3 safetensors with block scale companions into GGUF-supported tensor types, and record which output tensors came from FP8 source weights. Use that source-precision metadata during create quantization: default FP8-sourced GGUFs to Q8_0, keep non-FP8 tensors at their original precision for Q8_0, and promote non-FP8 quantizable tensors to Q8_0 for Q4_K requests. * ggml: add laguna model support * server: preserve generate logprobs with builtin parsers Generate requests were dropping logprob-only chunks whenever a builtin parser buffered visible content. Chat already handled this case, but generate only forwarded chunks with visible response, thinking, or tool-call output. Keep generate chunks that carry logprobs even when the builtin parser has not flushed visible content yet, and add a regression test that exercises the behavior with a generic thinking parser. * review comments - perf improvements * ggml: implement nemotron 3 nano omni * add poolside integration * update poolside doc * adapt to new cache setup * fix test * fix test --------- Co-authored-by: Eva Ho <hoyyeva@gmail.com>	2026-04-28 11:50:12 -07:00
Parth Sareen	c2ebb4d57c	api: accept "max" as a think value (#15787 )	2026-04-24 01:49:39 -07:00
Parth Sareen	5d1021603a	server: apply format when think=false for gemma4 (#15678 )	2026-04-20 17:42:29 -07:00
Daniel Hiltgen	2bb7ea00d2	create: avoid gc race with create (#15628 ) If you have a long running create, and start another ollama server with the same model dir, the GC algorithm deletes the pending blobs and breaks the create. This adds a 1h grace period to avoid deleting in-flight creation operations.	2026-04-16 13:29:16 -07:00
Devon Rifkin	e585ecd11f	gemma4: render differently based on model size Following up on #15560, this change now has e2b/e4b render differently from 26b/31b. For backwards compatibility, we take the existing renderer name `gemma4` and make it do dynamic resolution based on the model name/size, but the intended use is for the models to be republished with the renderer variant specified explicitly: `gemma4-small` or `gemma4-large`.	2026-04-15 14:37:16 -07:00
Patrick Devine	eb97274e5c	modelfiles: fix /save command and add shortname for safetensors based models (#15413 ) This change fixes two issues with Modelfiles: 1. If a user uses `ollama show --modelfile` to show a safetensors based model, the Model would leave the "FROM" field blank which won't allow a user to recreate the model. This change adds the model's current canonical short name to the FROM field. 2. If a user uses the `/save` command in the CLI any messages which were saved in a previous model wouldn't get saved (only the set of messages from the current session).	2026-04-08 21:05:39 -07:00
Daniel Hiltgen	30fdd229a4	create: Clean up experimental paths, fix create from existing safetensor model (#14679 ) * create: Clean up experimental paths This cleans up the experimental features, and adds both unit and integration test coverage to verify no regressions. * create: preserve config and layer names when creating from safetensors models When creating a model FROM an existing safetensors model, ModelFormat, Capabilities, and layer Name fields were lost. ModelFormat stayed empty because it's only set from GGML layers (which safetensors models lack), and layer names weren't copied in parseFromModel. This caused derived models to fail loading ("config.json not found in manifest"). * review comments	2026-04-07 08:12:57 -07:00
Daniel Hiltgen	96b202d34b	Add support for gemma4 (#15214 ) * bench: add prompt calibration, context size flag, and NumCtx reporting Add --num-ctx flag to set context size, and report NumCtx in model info header. Calibrate tokens-per-word ratio during warmup using actual tokenization metrics from the model, replacing the fixed 1.3 heuristic. This produces more accurate prompt token counts for --prompt-tokens. Also add fetchContextLength() to query running model context via /api/ps. * integration: improve vision test robustness and add thinking tests Add skipIfNoVisionOverride() to skip vision tests when OLLAMA_TEST_MODEL is set to a non-vision model. Add Think:false to context exhaustion test to prevent thinking models from using all context before the test can measure it. Add third test image (ollama homepage) and replace OCR test with ImageDescription test using it. Relax match strings for broader model compatibility. Add TestThinkingEnabled and TestThinkingSuppressed to verify thinking output and channel tag handling. * gemma4: add Gemma 4 GGML model support Add full Gemma 4 model family support (E2B, E4B, 26B MoE, 31B Dense) for the GGML backend including text, vision, converter, parser, and renderer. Text model features: - Sliding window + full attention with per-layer patterns - KV sharing across layers with donor map - Per-layer embeddings (PLE) with learned projections - MoE routing with RMSNorm + learned scale - Proportional RoPE with freq_factors for global attention - Final logit softcapping Vision model features: - SigLIP vision encoder with 2D RoPE - ClippableLinear with input/output clamping via packed v.clamp_data - Adaptive average pooling with nMerge kernel - Multi-modal projection with unweighted RMSNorm Converter: - Safetensors to GGUF with vision tensor renaming - Fused MoE gate_up_proj splitting - Vision patch embedding reshape (HF to Conv2D layout) - Packed clamp data tensor for ClippableLinear bounds - Proportional RoPE freq_factors generation Also includes: - BackendGet() on ml.Tensor for reading weight tensor data - Q6_K CUDA get_rows kernel support - MoE-aware ffn_down quantization layer counting - Gemma4 parser with tool calling and thinking support - Gemma4 renderer with structured tool format - Architecture-based auto-detection of renderer/parser/stop tokens - Integration test gemma4 model list additions * gemma4: add audio support with USM conformer encoder Add audio encoding for Gemma 4 using the USM conformer architecture: - Converter: audio tensor mapping, SSCP/conformer/embedder name replacements, softplus repacker for per_dim_scale, F32 enforcement for conv weights - GGML backend: Conv1DDW and PadExt tensor ops - Audio encoder: SSCP Conv2D, 12 conformer blocks (FFW + block-local attention with relative position embeddings + LightConv1d + FFW), output projection, audio-to-text embedding projector - Audio preprocessing: WAV decode, mel spectrogram, FFT (pure Go) - Model wiring: WAV detection, audio token handling, unified PostTokenize Correctly transcribes "why is the sky blue" from test audio. * integration: add gemma4 audio tests including OpenAI API coverage Test audio transcription and response via the Ollama native API, plus two new tests exercising the OpenAI-compatible endpoints: - /v1/audio/transcriptions (multipart form upload) - /v1/chat/completions with input_audio content type All tests use capability checks and skip models without audio support. * gemma4: add OpenAI audio API support and capability detection - Add CapabilityAudio and detect from audio.block_count in GGUF - Add /v1/audio/transcriptions endpoint with TranscriptionMiddleware - Add input_audio content type support in /v1/chat/completions - Add TranscriptionRequest/Response types in openai package * gemma4: add audio input support for run command - /audio toggle in interactive mode for voice chat - Platform-specific microphone recording (AVFoundation on macOS, PulseAudio/ALSA on Linux, WASAPI on Windows) - Space to start/stop recording, automatic chunking for long audio * gemma4: add transcribe command (ollama transcribe MODEL) - Interactive mode with readline prompt and slash commands - Non-interactive mode for piped audio or record-until-Ctrl+C - Chunked streaming transcription for long recordings - Word-wrapped output matching run command style * gemma4: add parser, renderer, and integration test plumbing * gemma4: fix renderer to emit BOS token * gemma4: add OpenAI audio transcription API and input_audio support * gemma4: update converter for new weight drop naming * gemma4: add per_expert_scale to MoE router and fix moe_intermediate_size config * gemma4: rewrite renderer to match HF Jinja2 template exactly Fix 8 bugs found by building 55 reference tests verified against the HF Jinja2 chat template (VERIFY_JINJA2=1 shells out to Python): - Tool responses use separate <\|turn>tool turns (not inline tags) - Tool calls emitted before content in assistant messages - Thinking content stripped from assistant history (strip_thinking) - User, tool, and system content trimmed (template does \| trim) - Empty system message still emits system turn (check role, not content) - Nested object properties rendered recursively with required field - Array items specification rendered for array-type properties - OBJECT/ARRAY type-specific rendering comma logic matches template Also adds Required field to api.ToolProperty for nested object schemas, replaces old gemma4_test.go with comprehensive gemma4_reference_test.go, and commits the Jinja2 template as testdata for verification. * gemma4: fix MoE fused gate_up split and multiline tool-call arg parsing - Text MoE: split `ffn_gate_up_exps` into contiguous `[gate\|up]` halves instead of stride-2 slices. - Parser: escape control characters in `<\|"\|>...<\|"\|>` string literals when converting tool-call args to JSON. - Fixes warnings like `invalid character '\n' in string literal` for multiline tool arguments. - Add Gemma4 parser regressions for multiline tool-call args and `gemma4ArgsToJSON`. * cmd: simplify audio input to dropped file attachments * gemma4: use full SWA memory for better cache reuse * gemma4: initialize clamps after backend load * convert: align gemma4 audio tensor renames with llama.cpp * Remove redundant comments in gemma4 vision model * Format Gemma4 MoE block field alignment * use 4096 kvcache.NewSWAMemCache * convert: support new Gemma4 audio_tower tensor naming (#15221) Co-authored-by: jmorganca <jmorganca@gmail.com> * fix integration test defaults for audio * review comments and lint fixes * remove unused audio/video files --------- Co-authored-by: jmorganca <jmorganca@gmail.com>	2026-04-02 11:33:33 -07:00
Parth Sareen	8e54823fd3	revert context length warnings change (#15121 )	2026-03-28 16:43:59 -07:00
Patrick Devine	9e7cb9697e	mlx: fix vision capability + min version (#15106 )	2026-03-27 17:09:28 -07:00
Bruce MacDonald	3824e380a8	server: preserve raw manifest bytes during pull (#15104 ) pullModelManifest unmarshals the registry response into a Go struct then re-marshals with json.Marshal before writing to disk. When the registry's JSON formatting or field ordering differs from Go's output, the local SHA256 won't match the registry's Ollama-Content-Digest header, causing false "out of date" warnings. Preserve the raw bytes from the registry response and write them directly to disk so the local manifest is byte-for-byte identical to what the registry serves.	2026-03-27 15:42:31 -07:00
Eva H	366625a831	launch: warn when server context length is below 64k for local models (#15044 ) A stop-gap for now to guide users better. We'll add more in-depth recommendations per integration as well. --------- Co-authored-by: Parth Sareen <parth.sareen@ollama.com>	2026-03-27 00:15:53 -07:00
Devon Rifkin	26b9f53f8e	api/show: overwrite basename for copilot chat (#15062 ) Copilot Chat prefers to use `general.basename` in the built-in Ollama integration, but this name isn't usually shown directly to users (and there may be many models that share this name). Instead we pass back `req.Model`, which for this extension is the value that we return from `/api/tags`	2026-03-25 14:02:22 -07:00
Devon Rifkin	46cb7795e1	add ability to turn on debug request logging (#14106 ) If `OLLAMA_DEBUG_LOG_REQUESTS` is set, then on server startup a temp folder will be created. Upon any inference request, the body will be logged to a file in this folder, as well as a small shell script to "replay" the request using cURL. This is just intended for debugging scenarios, not as something to turn on normally.	2026-03-19 17:08:17 -07:00
Devon Rifkin	e37a9b4c01	cloud_proxy: for the web_search legacy path, flush on newlines (#14897 ) `WebSearchAnthropicWriter` expects a single object per write. The new transparent proxy will instead send it whatever bytes it sees. This cloud-model + local-orchestration + cloud-search is a temporary code path, so instead of making the web search code more robust to this, I put an adapter in the middle that will flush line-by-line to preserve the old behavior.	2026-03-17 13:30:17 -07:00
Jesse Gross	bbbad97686	sched: Model eviction for MLX MLX runners (image generation and LLM) previously bypassed the scheduler's standard load path via a separate loadMLX method. This meant they skipped VRAM fitting checks and couldn't participate in model eviction. Now all model types flow through the same load function. Model eviction for MLX is based on weights as KV cache and compute graph are dynamic. This means that eviction does not take into account the worst case memory and models can still compete for memory but it is a significant improvement.	2026-03-16 17:40:29 -07:00
Bruce MacDonald	3980c0217d	server: decompress zstd request bodies in cloud passthrough middleware (#14827 ) When a zstd-compressed request (e.g. from Codex CLI) hits /v1/responses with a cloud model the request failed. Fix by decompressing zstd bodies before model extraction, so cloud models are detected and proxied directly without the writer being wrapped.	2026-03-13 15:06:47 -07:00
Devon Rifkin	f676231de9	server: remove experimental aliases support (#14810 )	2026-03-12 20:27:24 -07:00
Devon Rifkin	8c4d5d6c2f	cloud_proxy: send ollama client version (#14769 ) This was previously included in the user agent, and we've made use of it in the past to hotpatch bugs server-side for particular Ollama versions.	2026-03-10 15:53:25 -07:00
Parth Sareen	61086083eb	server: add experimental web search and web fetch routes (#14753 )	2026-03-09 21:52:12 -07:00
Devon Rifkin	afb4c62fbf	cloud_proxy: handle stream disconnects gracefully (#14685 ) Previously we were printing out bad errors for expected cases like clients disconnecting. Now we only debug log when that happens (which still might help in cases where we're figuring out why an integration isn't working). For other errors, we print out a proper warning now	2026-03-06 19:18:52 -08:00
Jeffrey Morgan	4eab60c1e2	Reapply "don't require pulling stubs for cloud models" again (#14608 ) * Revert "Revert "Reapply "don't require pulling stubs for cloud models"" (#14606)" This reverts commit `39982a954e`. * fix test + do cloud lookup only when seeing cloud models --------- Co-authored-by: ParthSareen <parth.sareen@ollama.com>	2026-03-06 14:27:47 -08:00
Parth Sareen	122c68c151	server: loosen thinking level constraint (#14625 )	2026-03-04 13:42:18 -08:00
Jeffrey Morgan	39982a954e	Revert "Reapply "don't require pulling stubs for cloud models"" (#14606 ) This reverts commit `799e51d419`.	2026-03-03 20:56:10 -08:00
Patrick Devine	110eff01a9	chore: remove old imagegen LLMs models (#14597 ) These models are implemented in the x/mlxrunner instead.	2026-03-03 13:23:40 -08:00
Jeffrey Morgan	799e51d419	Reapply "don't require pulling stubs for cloud models" This reverts commit `97d2f05a6d`.	2026-03-03 13:17:10 -08:00
Victor-Quqi	e8fcb29586	model/renderers: fix glm-ocr image tags in renderer prompts (#14584 )	2026-03-03 12:51:34 -08:00
Jeffrey Morgan	97d2f05a6d	Revert "don't require pulling stubs for cloud models (#14574 )" (#14596 ) This reverts commit `8207e55ec7`.	2026-03-03 12:51:23 -08:00
Devon Rifkin	8207e55ec7	don't require pulling stubs for cloud models (#14574 ) * don't require pulling stubs for cloud models This is a first in a series of PRs that will better integrate Ollama's cloud into the API and CLI. Previously we used to have a layer of indirection where you'd first have to pull a "stub" model that contains a reference to a cloud model. With this change, you don't have to pull first, you can just use a cloud model in various routes like `/api/chat` and `/api/show`. This change respects <https://github.com/ollama/ollama/pull/14221>, so if cloud is disabled, these models won't be accessible. There's also a new, simpler pass-through proxy that doesn't convert the requests ahead of hitting the cloud models, which they themselves already support various formats (e.g., `v1/chat/completions` or Open Responses, etc.). This will help prevent issues caused by double converting (e.g., `v1/chat/completions` converted to `api/chat` on the client, then calling cloud and converting back to a `v1/chat/completions` response instead of the cloud model handling the original `v1/chat/completions` request first). There's now a notion of "source tags", which can be mixed with existing tags. So instead of having different formats like`gpt-oss:20b-cloud` vs. `kimi-k2.5:cloud` (`-cloud` suffix vs. `:cloud`), you can now specify cloud by simply appending `:cloud`. This PR doesn't change model resolution yet, but sets us up to allow for things like omitting the non-source tag, which would make something like `ollama run gpt-oss:cloud` work the same way that `ollama run gpt-oss` already works today. More detailed changes: - Added a shared model selector parser in `types/modelselector`: - supports `:cloud` and `:local` - accepts source tags in any position - supports legacy `:<tag>-cloud` - rejects conflicting source tags - Integrated selector handling across server inference/show routes: - `GenerateHandler`, `ChatHandler`, `EmbedHandler`, `EmbeddingsHandler`, `ShowHandler` - Added explicit-cloud passthrough proxy for ollama.com: - same-endpoint forwarding for `/api/`, `/v1/`, and `/v1/messages` - normalizes `model` (and `name` for `/api/show`) before forwarding - forwards request headers except hop-by-hop/proxy-managed headers - uses bounded response-header timeout - handles auth failures in a friendly way - Preserved cloud-disable behavior (`OLLAMA_NO_CLOUD`) - Updated create flow to support `FROM ...:cloud` model sources (though this flow uses the legacy proxy still, supporting Modelfile overrides is more complicated with the direct proxy approach) - Updated CLI/TUI/config cloud detection to use shared selector logic - Updated CLI preflight behavior so explicit cloud requests do not auto-pull local stubs What's next? - Cloud discovery/listing and cache-backed `ollama ls` / `/api/tags` - Modelfile overlay support for virtual cloud models on OpenAI/Anthropic request families - Recommender/default-selection behavior for ambiguous model families - Fully remove the legacy flow Fixes: https://github.com/ollama/ollama/issues/13801 * consolidate pull logic into confirmAndPull helper pullIfNeeded and ShowOrPull shared identical confirm-and-pull logic. Extract confirmAndPull to eliminate the duplication. * skip local existence checks for cloud models ModelExists and the TUI's modelExists both check the local model list, which causes cloud models to appear missing. Return true early for explicit cloud models so the TUI displays them beside the integration name and skips re-prompting the model picker on relaunch. * support optionally pulling stubs for newly-style names We now normalize names like `<family>:<size>:cloud` into legacy-style names like `<family>:<size>-cloud` for pulling and deleting (this also supports stripping `:local`). Support for pulling cloud models is temporary, once we integrate properly into `/api/tags` we won't need this anymore. * Fix server alias syncing * Update cmd/cmd.go Co-authored-by: Parth Sareen <parth.sareen@ollama.com> * address comments * improve some naming --------- Co-authored-by: ParthSareen <parth.sareen@ollama.com>	2026-03-03 10:46:33 -08:00
Jesse Gross	ad16bffc7d	mlx: Remove peak memory from the API This is still in flux so it is better to just log it for now.	2026-03-02 15:56:18 -08:00
Bruce MacDonald	23d4cad1a2	server: verify digest is not empty on create (#14555 ) An empty digest is not a valid digest for an incoming create request. Reject empty digests at the api level.	2026-03-02 13:43:35 -08:00

1 2 3 4 5 ...

1034 Commits