ollama/llama/compat/llama-ollama-compat.h
jmorganca 7e07653271 llama/compat: add glm-ocr text handler + text-loader load-op hook
glm-ocr (text side):
  * Arch rename `glmocr` → `glm4` (incl. KV prefix); upstream supports
    GLM-OCR via LLM_ARCH_GLM4 with n_layer=17 (16 main + 1 nextn). We
    report n_layer=16 and leave nextn_predict_layers absent — the
    Ollama blob doesn't ship the nextn layer's weights.
  * M-RoPE: pad `rope.mrope_section` (3 elements) →
    `rope.dimension_sections` (4 elements with trailing 0).
  * Inject `rope.dimension_count = key_length`.
  * Tokenizer pre-tokenizer rename `llama-bpe` → `chatglm-bpe`.
  * Tensor renames: `attn_out`→`attn_output`, `post_attn_norm`→
    `post_attention_norm`, `post_ffn_norm`→`post_ffw_norm`.
  * **Per-block FFN concat**: GLM4 expects fused
    `ffn_up.weight: [n_embd, n_ff*2]` (gate || up). Ollama writes
    separate `ffn_gate.weight` + `ffn_up.weight` (each `[n_embd, n_ff]`).
    Register a load-time concat op that stitches gate+up into the fused
    upstream slot, then add a per-block skip-prefix for the orphan
    `blk.X.ffn_gate.` so the n_tensors check lines up.
  * Hide embedded `v.*`/`mm.*` from the text loader.

This is the first text-side compat that needs custom load-time tensor
data (the FFN concat). Until now load-op support only covered the
clip side. New plumbing:

  * `set_loader_path(ml, fname)` — store the model file path on a
    per-loader registry, called from the loader constructor.
  * `maybe_load_text_tensor(ml, cur, off, buft)` — the text-side
    counterpart to `maybe_load_tensor`; looks up the path from the
    registry then delegates to the existing load-op machinery.
  * Upstream patch grows two new lines: a `set_loader_path` call in
    the constructor and a `maybe_load_text_tensor` hook in
    `load_all_data` (before the use_mmap branch).

Verified: with --no-mmap, glm-ocr's blk.X.ffn_up.weight load fires
the concat op (28MB per block on the 1B variant) and the model emits
coherent text. Through `ollama run` the proper chat template applies.

Note: vision (clip) handler is a follow-up.
2026-04-20 09:30:26 -07:00

75 lines
3.3 KiB
C++
Vendored

#pragma once
// Ollama-format GGUF compatibility shim.
//
// Older Ollama builds ship GGUFs that differ from upstream in a handful of
// ways per-architecture (arch names, KV keys, tensor names, file layout).
// This shim detects those files during load and translates them in-memory
// so the rest of llama.cpp can load them unmodified.
//
// Three upstream hook points call into this namespace — one per insertion:
//
// 1. llama-model-loader.cpp (main model load):
// translate_metadata() — mutate KVs / tensor metadata
// should_skip_tensor() — filter weights_map population
//
// 2. tools/mtmd/clip.cpp (mmproj load):
// translate_clip_metadata() — rewrite KVs + tensor names for clip
// maybe_load_tensor() — override file read (e.g. F16->F32)
//
// Detection is per-arch; for any non-Ollama file every entry point is a
// no-op. Per-arch logic lives in anonymous-namespace handle_<arch>()
// functions in the .cpp; adding a new arch is a new handler plus one
// dispatch line in each translate_* entry point.
#include <cstddef>
#include <string>
#include "ggml-backend.h" // for ggml_backend_buffer_type_t
struct gguf_context;
struct ggml_context;
struct ggml_tensor;
struct llama_model_loader;
namespace llama_ollama_compat {
// Called from llama_model_loader's constructor, right after the arch is read.
void translate_metadata(const llama_model_loader * ml,
gguf_context * meta,
ggml_context * ctx,
std::string & arch_name);
// Called from llama_model_loader's weights_map population loop. Returns
// true to drop a tensor from the loader — used to hide embedded vision
// tensors from the text model's view without modifying the gguf_context.
bool should_skip_tensor(const llama_model_loader * ml, const char * tensor_name);
// Called from clip_model_loader's constructor. Rewrites the clip-facing
// view of the metadata (arch=clip, clip.vision.* KVs, renamed tensors)
// so the rest of clip.cpp can load an Ollama monolithic GGUF unchanged.
void translate_clip_metadata(gguf_context * meta, ggml_context * ctx);
// Called from clip.cpp's tensor-loading loop, before the normal file read.
// If this tensor was marked for type promotion by translate_clip_metadata
// (e.g. F16->F32), performs the conversion and writes the result into
// `cur` (host memcpy or backend_tensor_set based on `buft`). Returns true
// when the tensor was handled — caller should skip its normal read path.
bool maybe_load_tensor(ggml_tensor * cur,
const char * source_file,
size_t file_offset,
ggml_backend_buffer_type_t buft);
// Same as maybe_load_tensor but for the text-side llama_model_loader,
// which doesn't have the clip loader's `fname` in scope at the read
// site. Looks up the model's file path from a per-loader registry
// populated by `set_loader_path` (called from the model loader's
// constructor right after `fname` is in scope).
bool maybe_load_text_tensor(const llama_model_loader * ml,
ggml_tensor * cur,
size_t file_offset,
ggml_backend_buffer_type_t buft);
void set_loader_path(const llama_model_loader * ml, const char * fname);
} // namespace llama_ollama_compat