ollama/llama/compat/llama-ollama-compat-util.h
jmorganca 0c33775d37 llama/compat: disable mmap when load_op transforms text-side tensors
Root-cause fix for the glm-ocr GGML_ASSERT(buft) crash: handle_glmocr
fuses ffn_gate + ffn_up into a single ffn_up tensor with ne[1] doubled.
The reshape grows ggml_nbytes past what the file's mmap region for the
original tensor can back, and the upstream loader's mmap path tries to
bind the tensor's storage directly to that (too-small) region. With our
maybe_load_text_tensor hook attempting to fill the tensor before that
binding, cur->buffer is still null and ggml_backend_buft_is_host(nullptr)
asserts.

The previous defensive null-buft check (dbba9b170) only papered over the
crash — it would silently return false and let upstream proceed with the
wrong mmap-backed binding, producing garbage output instead.

Real fix: have handlers that transform text-side tensor bytes call
`disable_mmap_for(ml)`. translate_metadata then returns true, and the
patch site sets `use_mmap = false`. The non-mmap path pre-allocates real
backend buffers via ggml_backend_alloc_ctx_tensors_from_buft, after
which our load_op overrides land in writable memory.

Currently only handle_glmocr needs this; the per-block FFN concat is the
sole text-side reshape with a load_op. Other handlers (gemma3, gemma4,
qwen35, etc.) only do KV translation or rename-without-resize and remain
mmap-compatible.

Patch unchanged in line count (78) — the existing translate_metadata
call site is rewritten to consume the new bool return.

Verification blocked: every prior llama-server process on this box is
wedged in macOS uninterruptible-Metal-wait (UE) and survives kill -9,
preventing new processes from initializing past Metal discovery. Fix
will be live-verified after reboot.
2026-04-20 09:30:27 -07:00

140 lines
6.8 KiB
C++
Vendored

#pragma once
// Internal helpers shared by the per-architecture handlers in
// llama-ollama-compat.cpp. Not part of the public API.
//
// Everything lives under namespace llama_ollama_compat::detail. The
// definitions live in llama-ollama-compat-util.cpp, which also owns the
// registry globals (tensor skip list, load-op table) that need a single
// translation unit.
//
// ---- Non-public API dependencies (see also README.md "Maintenance") ----
//
// Mostly public: gguf_* and ggml_* accessors from ggml/include/ are all
// stable. `ggml_backend_*` and `ggml_fp16_to_fp32` are stable too.
//
// Three pieces we rely on that aren't strictly guaranteed public:
//
// 1. Direct writes to `ggml_tensor::type`, `ne[]`, `nb[]` — the struct is
// public and fields are spec'd, but there's no sanctioned mutator for
// them post-creation. Used in set_tensor_type / set_tensor_shape /
// reclaim_slot_as. Risk: upstream could in principle introduce an
// opaque-tensor mode; in practice it hasn't in years.
//
// 2. `const_cast<char *>(gguf_get_tensor_name(...))` in rename_tensor.
// The pointer returned points into a mutable char[GGML_MAX_NAME]
// buffer inside a std::vector element. Defined behavior as long as
// upstream keeps name storage in-line (has done so forever).
//
// 3. `llama_model_loader` forward decl from src/llama-model-loader.h
// (internal, not llama.h). Only used as an opaque pointer key for
// the skip-prefix registry — we never dereference it. Could swap for
// `const void *` if upstream ever moved that type around.
//
// All three are trivially replaceable if upstream changes out from under
// us. See llama/compat/README.md for the escape hatches.
#include <cstddef>
#include <cstdint>
#include <functional>
#include <initializer_list>
#include <string>
#include <vector>
#include "ggml.h"
#include "ggml-backend.h"
#include "gguf.h"
struct llama_model_loader;
namespace llama_ollama_compat::detail {
// -- gguf_context KV helpers --
bool has_key(const gguf_context * meta, const char * key);
void copy_u32_kv(gguf_context * meta, const char * src, const char * dst);
void copy_f32_kv(gguf_context * meta, const char * src, const char * dst);
// Generic copy that preserves the source's gguf_type. Skips if `src` is
// missing or `dst` is already present. Arrays are copied verbatim
// (including element type).
void copy_kv(gguf_context * meta, const char * src, const char * dst);
// Copy every KV whose key starts with `old_prefix` to a new key under
// `new_prefix`. Old keys are left in place — harmless because the loader
// looks up keys by exact name and only queries the new prefix.
void rename_kv_prefix(gguf_context * meta, const char * old_prefix,
const char * new_prefix);
void inject_u32_if_missing (gguf_context * meta, const char * key, uint32_t v);
void inject_f32_if_missing (gguf_context * meta, const char * key, float v);
void inject_str_if_missing (gguf_context * meta, const char * key, const char * v);
void inject_bool_if_missing(gguf_context * meta, const char * key, bool v);
void inject_f32_arr_if_missing(gguf_context * meta, const char * key,
const float * data, size_t n);
void truncate_str_arr (gguf_context * meta, const char * key, size_t new_n);
void truncate_data_arr(gguf_context * meta, const char * key,
gguf_type elem_type, size_t elem_size, size_t new_n);
// -- ggml_context tensor scans --
bool any_tensor_with_prefix(const ggml_context * ctx, const char * prefix);
// -- Tensor renaming / reshaping (mutates both gguf_context and ggml_context) --
void rename_tensor(gguf_context * meta, ggml_context * ctx,
const char * old_name, const char * new_name);
void rename_tensors_containing(gguf_context * meta, ggml_context * ctx,
const char * needle, const char * replacement);
void set_tensor_type (ggml_tensor * t, ggml_type type);
void set_tensor_shape(ggml_tensor * t, std::initializer_list<int64_t> shape);
bool reclaim_slot_as (gguf_context * meta, ggml_context * ctx,
const char * orphan_name, const char * new_name,
std::initializer_list<int64_t> shape, ggml_type type);
// -- File-offset capture (before rename) --
size_t tensor_file_offset(const gguf_context * meta, const char * name);
// -- Per-loader skip-prefix registry --
void add_skip_prefix(const llama_model_loader * ml, std::string prefix);
bool should_skip_tensor_prefix(const llama_model_loader * ml, const char * name);
// -- Per-loader "needs no-mmap" flag --
// Handlers that register a load_op which transforms a TEXT-side tensor's
// bytes (e.g. concat reshape) must call disable_mmap_for(ml). With mmap
// the upstream loader binds the tensor directly to the file region, so
// our load_op has no writable buffer to fill. translate_metadata reads
// this flag and returns it back to the patch site.
void disable_mmap_for(const llama_model_loader * ml);
bool is_mmap_disabled_for(const llama_model_loader * ml);
// -- Load-time transform registry --
struct LoadOp {
std::function<bool(const char * src_file, void * dst, size_t dst_size)> apply;
const char * description;
};
void register_load_op(std::string dest_name, LoadOp op);
bool take_load_op (const char * dest_name, LoadOp & out); // removes + returns
// Read `size` bytes at `offset` from `path` into `dst`. Used by LoadOps.
bool read_at(const char * path, size_t offset, void * dst, size_t size);
// -- Common high-level transforms --
// F16 -> F32 promotion. Captures the source file offset at registration
// time so later renames/reshapes of this tensor don't invalidate the read.
void promote_tensor_to_f32(gguf_context * meta, ggml_context * ctx, const char * name);
// Concatenate N source tensors into one destination. Captures each source's
// file offset + byte size at registration time. Layout assumption: sources
// concatenate cleanly along the destination's slow ggml axis, which in
// C order means the destination bytes are src[0] || src[1] || ... .
void register_concat_load(const gguf_context * meta, std::string dest_name,
const std::vector<std::string> & src_names);
// Mixed-type variant of register_concat_load: dequantizes each source to
// F32 via its ggml_type_traits.to_float and concatenates the F32 arrays.
// Use when sources differ in quantization (e.g. F16 q/k + Q8_0 v in some
// Ollama vision blobs). Caller must set the destination tensor's type to
// GGML_TYPE_F32 so dst_size matches the F32 concat size.
void register_concat_load_to_f32(const gguf_context * meta,
const ggml_context * ctx,
std::string dest_name,
const std::vector<std::string> & src_names);
} // namespace llama_ollama_compat::detail