Commit Graph

1 Commits

Author SHA1 Message Date
BilalG1
969bf03c5a
perf(platform-analytics): cut ClickHouse query peak memory (#1632)
## What

Performance pass on the internal **platform-analytics** route. All 17
ClickHouse queries fire in a single `Promise.all` on the shared
`stackframe` admin user, which is subject to a **9 GB per-user** memory
cap — so the worst case is the *sum* of per-query peaks, not the max.
Benchmarked at 10k projects / 1M users / 50M events (power-law, top
project ≈100k users), the sum of peaks was ~6.7 GiB. This PR brings it
down to ~3.8 GiB.

## Changes

**ClickHouse — `sipHash64(user_id)` as the distinct key** (exact,
verified byte-identical):

| query | peak mem | Δ |
|---|---|---|
| `dauSeries` | 949 → 373 MiB | −61% |
| `mauProjects` | 715 → 313 MiB | −56% |
| `activeByProject` | 635 → 374 MiB | −41% |
| `sparkByProject` | 1165 → 809 MiB | −31% |

A 64-bit hash has negligible collision probability over 1M users; the
benchmark confirmed identical output. (Same trick already used in the
internal-metrics MAU query.)

**ClickHouse — sample the activity split**
(`new`/`retained`/`reactivated`):
The split was the single heaviest query (~1.3 GiB) — its cost is a
window function over ~25.8M `(user, day)` rows plus an all-history scan,
which `sipHash` alone barely helped (−7%). It now uses **consistent
1-in-4 user sampling** (same `cityHash64(user_id) % 4` bucket applied to
both subqueries so each sampled user's full activity sequence is
preserved; counts scaled ×4):

- **317 MiB (−78%)** peak memory, **~0.4% mean error** (max 1.4% on the
smallest day) vs the exact result.

This is an **approximation** — the dashboard "Growth quality" chart now
notes it (`subtitle: "… · sampled estimate (~0.4%)"`).
`ACTIVITY_SPLIT_SAMPLE` is a single constant in the route; set it to `1`
to go back to exact.

## What I tried that did NOT make the cut (documented in the harnesses)

- `country` — peak memory is dominated by the per-user `argMax(country,
event_at)` payload, not the key, so hashing does nothing. Left
exact/unchanged.
- PG `authMethods` / `email` — with the production composite PK indexes
the original plans are already best; correlated-subquery / anti-join
rewrites were far worse. No PG query changes in this PR.

## Benchmark harnesses (added)

- `apps/backend/scripts/benchmark-platform-analytics.ts` — full-route
baseline (per-query time/memory/rows).
- `apps/backend/scripts/optimize-platform-analytics.ts` — sipHash & PG
variant comparison with byte-equality checks.
- `apps/backend/scripts/optimize-split.ts` — exact vs sampled split
variants with accuracy measurement.

They seed isolated `bench_pa` databases (server-side, auto-cleaned) and
read `system.query_log` / `EXPLAIN (ANALYZE, BUFFERS)`. Run e.g.:
`pnpm --filter @hexclave/backend run with-env:dev tsx
scripts/optimize-split.ts`

## Testing

- Backend `typecheck` passes. (Dashboard has pre-existing typecheck
errors on the base branch in unrelated files — auth-methods,
team-analytics, user-emails, RDE config — not touched here.)
- All exact rewrites verified byte-identical to the originals by the
harnesses; the sampled split measured at ~0.4% mean error.

Numbers are local warm-cache (relative shape, not production latency).

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Cuts worst-case ClickHouse memory for the internal platform analytics
route by switching to hashed distinct keys and sampling the heaviest
query. On a 10k projects / 1M users / 50M events benchmark, the sum of
per-query peaks drops from ~6.7 GiB to ~3.8 GiB with exact results (or
~0.4% error on the sampled chart).

- **Performance**
- Use sipHash64(user_id) as the distinct key in uniqExact/uniqExactIf
for DAU series, MAU/projects, active-by-project, and sparkline. Exact
results (verified). Peak memory down 31–61% per query.
- Sample the new/retained/reactivated split at 1-in-4 users (consistent
`cityHash64` bucket across subqueries, counts ×4). Peak memory ~−78%
(~1.3 GiB → ~0.3 GiB) with ~0.4% mean error. Toggle via
`ACTIVITY_SPLIT_SAMPLE` (set to 4; set to 1 for exact). Dashboard
subtitle now notes “sampled estimate (~0.4%).”
- Added local harnesses to seed isolated data and measure
time/memory/equality:
`apps/backend/scripts/internal-analytics/benchmark-platform-analytics.ts`,
`optimize-platform-analytics.ts`, `optimize-split.ts`.

<sup>Written for commit 60ccf1a06f.
Summary will update on new commits.</sup>

<a
href="https://cubic.dev/pr/hexclave/hexclave/pull/1632?utm_source=github"
target="_blank" rel="noopener noreferrer"
data-no-image-dialog="true"><picture><source
media="(prefers-color-scheme: dark)"
srcset="https://www.cubic.dev/buttons/review-in-cubic-dark.svg"><source
media="(prefers-color-scheme: light)"
srcset="https://www.cubic.dev/buttons/review-in-cubic-light.svg"><img
alt="Review in cubic"
src="https://www.cubic.dev/buttons/review-in-cubic-dark.svg"></picture></a>

<!-- End of auto-generated description by cubic. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Updates

* **Improvements**
* Enhanced platform analytics calculations for more consistent and
efficient user counting across key performance indicators (DAU, MAU,
per-project metrics).
* Updated the Growth Quality chart to indicate that user counts
represent sampled estimates with approximately 0.4% margin of error for
improved performance.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: mantrakp04 <mantrakp@gmail.com>
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: mantra <mantra@stack-auth.com>
2026-06-19 12:44:28 -07:00