stack

CodeCow/stack

Fork 0

mirror of https://github.com/stack-auth/stack.git synced 2026-06-30 21:01:54 +08:00

Commit Graph

Author SHA1 Message Date

Author	SHA1	Message	Date
BilalG1	969bf03c5a	perf(platform-analytics): cut ClickHouse query peak memory (#1632 ) ## What Performance pass on the internal platform-analytics route. All 17 ClickHouse queries fire in a single `Promise.all` on the shared `stackframe` admin user, which is subject to a 9 GB per-user memory cap — so the worst case is the sum of per-query peaks, not the max. Benchmarked at 10k projects / 1M users / 50M events (power-law, top project ≈100k users), the sum of peaks was ~6.7 GiB. This PR brings it down to ~3.8 GiB. ## Changes ClickHouse — `sipHash64(user_id)` as the distinct key (exact, verified byte-identical): \| query \| peak mem \| Δ \| \|---\|---\|---\| \| `dauSeries` \| 949 → 373 MiB \| −61% \| \| `mauProjects` \| 715 → 313 MiB \| −56% \| \| `activeByProject` \| 635 → 374 MiB \| −41% \| \| `sparkByProject` \| 1165 → 809 MiB \| −31% \| A 64-bit hash has negligible collision probability over 1M users; the benchmark confirmed identical output. (Same trick already used in the internal-metrics MAU query.) ClickHouse — sample the activity split (`new`/`retained`/`reactivated`): The split was the single heaviest query (~1.3 GiB) — its cost is a window function over ~25.8M `(user, day)` rows plus an all-history scan, which `sipHash` alone barely helped (−7%). It now uses consistent 1-in-4 user sampling (same `cityHash64(user_id) % 4` bucket applied to both subqueries so each sampled user's full activity sequence is preserved; counts scaled ×4): - 317 MiB (−78%) peak memory, ~0.4% mean error (max 1.4% on the smallest day) vs the exact result. This is an approximation — the dashboard "Growth quality" chart now notes it (`subtitle: "… · sampled estimate (~0.4%)"`). `ACTIVITY_SPLIT_SAMPLE` is a single constant in the route; set it to `1` to go back to exact. ## What I tried that did NOT make the cut (documented in the harnesses) - `country` — peak memory is dominated by the per-user `argMax(country, event_at)` payload, not the key, so hashing does nothing. Left exact/unchanged. - PG `authMethods` / `email` — with the production composite PK indexes the original plans are already best; correlated-subquery / anti-join rewrites were far worse. No PG query changes in this PR. ## Benchmark harnesses (added) - `apps/backend/scripts/benchmark-platform-analytics.ts` — full-route baseline (per-query time/memory/rows). - `apps/backend/scripts/optimize-platform-analytics.ts` — sipHash & PG variant comparison with byte-equality checks. - `apps/backend/scripts/optimize-split.ts` — exact vs sampled split variants with accuracy measurement. They seed isolated `bench_pa` databases (server-side, auto-cleaned) and read `system.query_log` / `EXPLAIN (ANALYZE, BUFFERS)`. Run e.g.: `pnpm --filter @hexclave/backend run with-env:dev tsx scripts/optimize-split.ts` ## Testing - Backend `typecheck` passes. (Dashboard has pre-existing typecheck errors on the base branch in unrelated files — auth-methods, team-analytics, user-emails, RDE config — not touched here.) - All exact rewrites verified byte-identical to the originals by the harnesses; the sampled split measured at ~0.4% mean error. Numbers are local warm-cache (relative shape, not production latency). <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Cuts worst-case ClickHouse memory for the internal platform analytics route by switching to hashed distinct keys and sampling the heaviest query. On a 10k projects / 1M users / 50M events benchmark, the sum of per-query peaks drops from ~6.7 GiB to ~3.8 GiB with exact results (or ~0.4% error on the sampled chart). - Performance - Use sipHash64(user_id) as the distinct key in uniqExact/uniqExactIf for DAU series, MAU/projects, active-by-project, and sparkline. Exact results (verified). Peak memory down 31–61% per query. - Sample the new/retained/reactivated split at 1-in-4 users (consistent `cityHash64` bucket across subqueries, counts ×4). Peak memory ~−78% (~1.3 GiB → ~0.3 GiB) with ~0.4% mean error. Toggle via `ACTIVITY_SPLIT_SAMPLE` (set to 4; set to 1 for exact). Dashboard subtitle now notes “sampled estimate (~0.4%).” - Added local harnesses to seed isolated data and measure time/memory/equality: `apps/backend/scripts/internal-analytics/benchmark-platform-analytics.ts`, `optimize-platform-analytics.ts`, `optimize-split.ts`. <sup>Written for commit `60ccf1a06f`. Summary will update on new commits.</sup> <a href="https://cubic.dev/pr/hexclave/hexclave/pull/1632?utm_source=github" target="_blank" rel="noopener noreferrer" data-no-image-dialog="true"><picture><source media="(prefers-color-scheme: dark)" srcset="https://www.cubic.dev/buttons/review-in-cubic-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://www.cubic.dev/buttons/review-in-cubic-light.svg"><img alt="Review in cubic" src="https://www.cubic.dev/buttons/review-in-cubic-dark.svg"></picture></a> <!-- End of auto-generated description by cubic. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Updates * Improvements * Enhanced platform analytics calculations for more consistent and efficient user counting across key performance indicators (DAU, MAU, per-project metrics). * Updated the Growth Quality chart to indicate that user counts represent sampled estimates with approximately 0.4% margin of error for improved performance. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: mantrakp04 <mantrakp@gmail.com> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: mantra <mantra@stack-auth.com>	2026-06-19 12:44:28 -07:00

BilalG1

969bf03c5a

perf(platform-analytics): cut ClickHouse query peak memory (#1632 )

## What

Performance pass on the internal **platform-analytics** route. All 17
ClickHouse queries fire in a single `Promise.all` on the shared
`stackframe` admin user, which is subject to a **9 GB per-user** memory
cap — so the worst case is the *sum* of per-query peaks, not the max.
Benchmarked at 10k projects / 1M users / 50M events (power-law, top
project ≈100k users), the sum of peaks was ~6.7 GiB. This PR brings it
down to ~3.8 GiB.

## Changes

**ClickHouse — `sipHash64(user_id)` as the distinct key** (exact,
verified byte-identical):

| query | peak mem | Δ |
|---|---|---|
| `dauSeries` | 949 → 373 MiB | −61% |
| `mauProjects` | 715 → 313 MiB | −56% |
| `activeByProject` | 635 → 374 MiB | −41% |
| `sparkByProject` | 1165 → 809 MiB | −31% |

A 64-bit hash has negligible collision probability over 1M users; the
benchmark confirmed identical output. (Same trick already used in the
internal-metrics MAU query.)

**ClickHouse — sample the activity split**
(`new`/`retained`/`reactivated`):
The split was the single heaviest query (~1.3 GiB) — its cost is a
window function over ~25.8M `(user, day)` rows plus an all-history scan,
which `sipHash` alone barely helped (−7%). It now uses **consistent
1-in-4 user sampling** (same `cityHash64(user_id) % 4` bucket applied to
both subqueries so each sampled user's full activity sequence is
preserved; counts scaled ×4):

- **317 MiB (−78%)** peak memory, **~0.4% mean error** (max 1.4% on the
smallest day) vs the exact result.

This is an **approximation** — the dashboard "Growth quality" chart now
notes it (`subtitle: "… · sampled estimate (~0.4%)"`).
`ACTIVITY_SPLIT_SAMPLE` is a single constant in the route; set it to `1`
to go back to exact.

## What I tried that did NOT make the cut (documented in the harnesses)

- `country` — peak memory is dominated by the per-user `argMax(country,
event_at)` payload, not the key, so hashing does nothing. Left
exact/unchanged.
- PG `authMethods` / `email` — with the production composite PK indexes
the original plans are already best; correlated-subquery / anti-join
rewrites were far worse. No PG query changes in this PR.

## Benchmark harnesses (added)

- `apps/backend/scripts/benchmark-platform-analytics.ts` — full-route
baseline (per-query time/memory/rows).
- `apps/backend/scripts/optimize-platform-analytics.ts` — sipHash & PG
variant comparison with byte-equality checks.
- `apps/backend/scripts/optimize-split.ts` — exact vs sampled split
variants with accuracy measurement.

They seed isolated `bench_pa` databases (server-side, auto-cleaned) and
read `system.query_log` / `EXPLAIN (ANALYZE, BUFFERS)`. Run e.g.:
`pnpm --filter @hexclave/backend run with-env:dev tsx
scripts/optimize-split.ts`

## Testing

- Backend `typecheck` passes. (Dashboard has pre-existing typecheck
errors on the base branch in unrelated files — auth-methods,
team-analytics, user-emails, RDE config — not touched here.)
- All exact rewrites verified byte-identical to the originals by the
harnesses; the sampled split measured at ~0.4% mean error.

Numbers are local warm-cache (relative shape, not production latency).

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Cuts worst-case ClickHouse memory for the internal platform analytics
route by switching to hashed distinct keys and sampling the heaviest
query. On a 10k projects / 1M users / 50M events benchmark, the sum of
per-query peaks drops from ~6.7 GiB to ~3.8 GiB with exact results (or
~0.4% error on the sampled chart).

- **Performance**
- Use sipHash64(user_id) as the distinct key in uniqExact/uniqExactIf
for DAU series, MAU/projects, active-by-project, and sparkline. Exact
results (verified). Peak memory down 31–61% per query.
- Sample the new/retained/reactivated split at 1-in-4 users (consistent
`cityHash64` bucket across subqueries, counts ×4). Peak memory ~−78%
(~1.3 GiB → ~0.3 GiB) with ~0.4% mean error. Toggle via
`ACTIVITY_SPLIT_SAMPLE` (set to 4; set to 1 for exact). Dashboard
subtitle now notes “sampled estimate (~0.4%).”
- Added local harnesses to seed isolated data and measure
time/memory/equality:
`apps/backend/scripts/internal-analytics/benchmark-platform-analytics.ts`,
`optimize-platform-analytics.ts`, `optimize-split.ts`.

<sup>Written for commit 60ccf1a06f.
Summary will update on new commits.</sup>

<a
href="https://cubic.dev/pr/hexclave/hexclave/pull/1632?utm_source=github"
target="_blank" rel="noopener noreferrer"
data-no-image-dialog="true"><picture><source
media="(prefers-color-scheme: dark)"
srcset="https://www.cubic.dev/buttons/review-in-cubic-dark.svg"><source
media="(prefers-color-scheme: light)"
srcset="https://www.cubic.dev/buttons/review-in-cubic-light.svg"><img
alt="Review in cubic"
src="https://www.cubic.dev/buttons/review-in-cubic-dark.svg"></picture></a>

<!-- End of auto-generated description by cubic. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Updates

* **Improvements**
* Enhanced platform analytics calculations for more consistent and
efficient user counting across key performance indicators (DAU, MAU,
per-project metrics).
* Updated the Growth Quality chart to indicate that user counts
represent sampled estimates with approximately 0.4% margin of error for
improved performance.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: mantrakp04 <mantrakp@gmail.com>
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: mantra <mantra@stack-auth.com>

2026-06-19 12:44:28 -07:00

1 Commits