mirror of
https://github.com/stack-auth/stack.git
synced 2026-06-30 21:01:54 +08:00
969bf03c5a
1 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
969bf03c5a
|
perf(platform-analytics): cut ClickHouse query peak memory (#1632)
## What
Performance pass on the internal **platform-analytics** route. All 17
ClickHouse queries fire in a single `Promise.all` on the shared
`stackframe` admin user, which is subject to a **9 GB per-user** memory
cap — so the worst case is the *sum* of per-query peaks, not the max.
Benchmarked at 10k projects / 1M users / 50M events (power-law, top
project ≈100k users), the sum of peaks was ~6.7 GiB. This PR brings it
down to ~3.8 GiB.
## Changes
**ClickHouse — `sipHash64(user_id)` as the distinct key** (exact,
verified byte-identical):
| query | peak mem | Δ |
|---|---|---|
| `dauSeries` | 949 → 373 MiB | −61% |
| `mauProjects` | 715 → 313 MiB | −56% |
| `activeByProject` | 635 → 374 MiB | −41% |
| `sparkByProject` | 1165 → 809 MiB | −31% |
A 64-bit hash has negligible collision probability over 1M users; the
benchmark confirmed identical output. (Same trick already used in the
internal-metrics MAU query.)
**ClickHouse — sample the activity split**
(`new`/`retained`/`reactivated`):
The split was the single heaviest query (~1.3 GiB) — its cost is a
window function over ~25.8M `(user, day)` rows plus an all-history scan,
which `sipHash` alone barely helped (−7%). It now uses **consistent
1-in-4 user sampling** (same `cityHash64(user_id) % 4` bucket applied to
both subqueries so each sampled user's full activity sequence is
preserved; counts scaled ×4):
- **317 MiB (−78%)** peak memory, **~0.4% mean error** (max 1.4% on the
smallest day) vs the exact result.
This is an **approximation** — the dashboard "Growth quality" chart now
notes it (`subtitle: "… · sampled estimate (~0.4%)"`).
`ACTIVITY_SPLIT_SAMPLE` is a single constant in the route; set it to `1`
to go back to exact.
## What I tried that did NOT make the cut (documented in the harnesses)
- `country` — peak memory is dominated by the per-user `argMax(country,
event_at)` payload, not the key, so hashing does nothing. Left
exact/unchanged.
- PG `authMethods` / `email` — with the production composite PK indexes
the original plans are already best; correlated-subquery / anti-join
rewrites were far worse. No PG query changes in this PR.
## Benchmark harnesses (added)
- `apps/backend/scripts/benchmark-platform-analytics.ts` — full-route
baseline (per-query time/memory/rows).
- `apps/backend/scripts/optimize-platform-analytics.ts` — sipHash & PG
variant comparison with byte-equality checks.
- `apps/backend/scripts/optimize-split.ts` — exact vs sampled split
variants with accuracy measurement.
They seed isolated `bench_pa` databases (server-side, auto-cleaned) and
read `system.query_log` / `EXPLAIN (ANALYZE, BUFFERS)`. Run e.g.:
`pnpm --filter @hexclave/backend run with-env:dev tsx
scripts/optimize-split.ts`
## Testing
- Backend `typecheck` passes. (Dashboard has pre-existing typecheck
errors on the base branch in unrelated files — auth-methods,
team-analytics, user-emails, RDE config — not touched here.)
- All exact rewrites verified byte-identical to the originals by the
harnesses; the sampled split measured at ~0.4% mean error.
Numbers are local warm-cache (relative shape, not production latency).
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Cuts worst-case ClickHouse memory for the internal platform analytics
route by switching to hashed distinct keys and sampling the heaviest
query. On a 10k projects / 1M users / 50M events benchmark, the sum of
per-query peaks drops from ~6.7 GiB to ~3.8 GiB with exact results (or
~0.4% error on the sampled chart).
- **Performance**
- Use sipHash64(user_id) as the distinct key in uniqExact/uniqExactIf
for DAU series, MAU/projects, active-by-project, and sparkline. Exact
results (verified). Peak memory down 31–61% per query.
- Sample the new/retained/reactivated split at 1-in-4 users (consistent
`cityHash64` bucket across subqueries, counts ×4). Peak memory ~−78%
(~1.3 GiB → ~0.3 GiB) with ~0.4% mean error. Toggle via
`ACTIVITY_SPLIT_SAMPLE` (set to 4; set to 1 for exact). Dashboard
subtitle now notes “sampled estimate (~0.4%).”
- Added local harnesses to seed isolated data and measure
time/memory/equality:
`apps/backend/scripts/internal-analytics/benchmark-platform-analytics.ts`,
`optimize-platform-analytics.ts`, `optimize-split.ts`.
<sup>Written for commit
|