stack

CodeCow/stack

Fork 0

mirror of https://github.com/stack-auth/stack.git synced 2026-06-27 21:01:03 +08:00

Commit Graph

Author	SHA1	Message	Date
BilalG1	91b8e4caa4	Fix /internal/metrics ClickHouse OOM (#1457 ) Fixes Sentry [STACK-BACKEND-16H](https://stackframe-pw.sentry.io/issues/STACK-BACKEND-16H) — the `/api/v1/internal/metrics` endpoint was triggering the cluster's 10.8 GiB OvercommitTracker kill on tenants with months of `$token-refresh` history. ## Root cause Three queries in `loadAnalyticsOverview` plus `loadUsersByCountry` did `GROUP BY user_id` over the events table with no lower `event_at` bound, so their hash table working set scaled with cumulative-distinct-users-ever-seen instead of the 30-day metrics window. ## Changes - Add 30-day `event_at` lower bound to `loadUsersByCountry` and to the `analyticsUserJoin` inner subquery (used by `dailyEvents`, `totalVisitors`, `topReferrers`). - New `getClickhouseAdminClientForMetrics()` factory in `lib/clickhouse.tsx` with connection-level safety net: per-query + per-user memory caps, external GROUP BY spill, and `join_algorithm: 'grace_hash,parallel_hash,hash'` (grace_hash measured to give 48% memory reduction at zero latency cost — see benchmark notes in the file). - Inline comment + concrete next steps for the long-term fix (option C: stamp `is_anonymous` at ingest on page-view/click events, then drop the join entirely). - Extend `scripts/benchmark-internal-metrics.ts` with the historical-seed knob and three new modes (`BENCH_BACKFILL_COMPARE`, `BENCH_JOIN_ALGO_COMPARE`, plus the existing `BENCH_ROUTE_QUERIES` updated) used to validate the choices above. ## Benchmark — pre-PR vs post-PR Synthetic seed: 300k users × 9 events spread over 365 days (~2.7M events). \| \| pre-PR \| post-PR \| delta \| \|---\|---:\|---:\|---:\| \| Sum peak memory \| 2.18 GiB \| 515 MiB \| 4.3× less \| \| Max query duration \| 1293 ms \| 101 ms \| 12.8× faster \| \| Sum CPU duration \| 5119 ms \| 394 ms \| 13× less work \| \| Sum bytes read \| 3.87 GiB \| 929 MiB \| 4.3× less I/O \| Per-query at 300k users: - `analyticsOverview:dailyEvents` 561 → 44 MiB (12.8× less) - `analyticsOverview:totalVisitors` 560 → 50 MiB (11.2× less) - `analyticsOverview:topReferrers` 546 → 50 MiB (10.9× less) - `loadUsersByCountry` 388 → 44 MiB (8.9× less) ## Caveats - `loadDailyActiveSplitFromClickhouse` still scans all-history on its `min(event_at)` subquery. It can't be naively bounded — `first_date` is used to classify entities as new vs reactivated, and a 30d bound would silently mislabel old-but-active entities as "new." The new SETTINGS cap+spill it; the proper fix is option C (documented inline). - A user with a page-view but no `$token-refresh` in the last 30 days now falls through to `coalesce(NULL, 0)` and is classified non-anonymous. Token-refresh fires every few minutes per active session, so this is rare but not impossible (embedded SDKs that poll less frequently, sessions straddling the 30d boundary). - `max_memory_usage_for_user: 9 GB` trades "cluster-wide OvercommitTracker kill of a random query" for "clean per-user memory error attributed to the specific query." After our 30d bounds, no query is anywhere near 9 GB. ## Test plan - [x] `pnpm typecheck` passes - [x] `pnpm lint` passes - [x] `pnpm test run apps/e2e/tests/backend/endpoints/api/v1/internal-metrics.test.ts` — 9/10 pass; the 1 failure (`risk_scores` snapshot drift) reproduces on clean `dev` and is unrelated - [x] `pnpm test run apps/e2e/tests/backend/endpoints/api/v1/analytics-{events,events-batch,query}.test.ts apps/e2e/tests/backend/endpoints/api/v1/token-refresh-events.test.ts apps/e2e/tests/backend/performance/metrics.test.ts` — all passing tests pass; 10 pre-existing `PRODUCT_DOES_NOT_EXIST` setup failures reproduce on clean `dev` - [x] Benchmark `BENCH_ROUTE_QUERIES=1` at 300k users shows the deltas above <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Chores * Improved internal metrics collection to use metrics-specific DB settings for more reliable, safer analytical reads. * Added guardrails to metrics queries to enforce time-window bounds and avoid unbounded scans. * Expanded benchmark modes (backfill and join-algo comparisons), extended perf seeding, and improved logging/retry behavior to capture more complete stats and reduce missing log rows. <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/hexclave/stack-auth/pull/1457?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2026-05-21 13:47:32 -07:00
BilalG1	85ae4b1c9e	Fix ClickHouse OOM in MAU query + optimize /internal/metrics route (#1344 ) ## Summary Fixes the Sentry `StackAssertionError: Failed to load monthly active users for internal metrics` crash (ClickHouse OOM at the 7.2 GiB per-query cap) and applies two related optimizations to other queries in the same route while here. Adds a local benchmark harness that validates correctness and measures peak memory / duration before & after. ## Root cause (the original Sentry error) `loadMonthlyActiveUsers` was written as `SELECT user_id … GROUP BY user_id` and then counting in Node via a `Set`. On a large project that ships back millions of user_ids. Two failure modes stacked: 1. Result materialization — every distinct user_id had to be buffered in the server before streaming to Node (~20 MiB of result for 450k users; much more at real scale). 2. `JSONExtract(toJSONString(data), 'is_anonymous', 'UInt8')` — the `toJSONString(data)` per-row re-serialization of the entire nested JSON column, billions of times, just to pull one boolean. Dominates bytes-read. Combined, on a single partition read from S3-backed MergeTree, this can exceed ClickHouse's 7.2 GiB per-query memory cap. That's exactly what the Sentry trace showed. ## Changes ### 1. Fix MAU query (`loadMonthlyActiveUsers`) Moved counting to the server with `uniqExact(sipHash64(normalized_user_id))` and pulled the JS-side normalization (`lower`, `trim`, `isUuid`) into SQL. Picked `sipHash64` after benchmarking 7 variants — it's exact (at <<2³² users) and halves the uniqExact hash-state vs. raw string keys. ### 2. Fix 1 — `JSONExtract(toJSONString(data), …)` → direct `CAST(data.is_anonymous, …)` Applied everywhere the pattern appeared in the metrics route: - `loadDailyActiveUsers` - the `analyticsUserJoin` subquery - the `nonAnonymousAnalyticsUserFilter` - `analyticsOverview:topRegion` - `analyticsOverview:online` Semantics preserved (`coalesce(CAST(data.is_anonymous, 'Nullable(UInt8)'), 0)` matches `JSONExtract(…, 'UInt8')` behavior when the field is missing). ### 3. Fix 3 — server-aggregate the split queries `loadDailyActiveUsersSplit` and `loadDailyActiveTeamsSplit` used to ship 1.2M+ `(day, user_id)` rows back to Node just so the JS could bucket them into new / retained / reactivated. Rewrote both as one CTE-style query that returns 31 rows (one per day in the 30-day window) with the counts precomputed. Minor semantic shift (documented inline in `route.tsx`): \"new\" is now based on the user's first-ever `\$token-refresh` event rather than their Postgres `signedUpAt`. Agrees for users who log in immediately after sign-up (the common case). Disagrees for the rare edge case of an account that existed pre-window but never generated a `\$token-refresh` until now — old code classified as \"reactivated,\" new code classifies as \"new.\" Judged acceptable; can be revisited. Postgres round-trips for `ProjectUser.signedUpAt` / `Team.createdAt` are no longer needed for the split, and the 76 MiB-ish wire ship is gone. ### 4. Benchmark harness (`apps/backend/scripts/benchmark-internal-metrics.ts`) Local-only tool. Three modes: - MAU equivalence matrix — 13 edge cases (empty, dedup, anonymous filter, window boundary, null user_id, non-UUID user_id, case variation, project isolation, missing/null `is_anonymous`, wrong event_type). Asserts OLD pipeline and NEW query return the same set of users, not just the same count. - MAU perf — OLD vs NEW plus 6 other candidate variants (inline regex, UUID keys, sipHash64, HLL sketches), reads `memory_usage` / `read_rows` / `result_bytes` from `system.query_log` for each, prints a ranked table. - Full-route benchmark (`BENCH_ROUTE_QUERIES=1`) — runs every ClickHouse query in `/internal/metrics` in three stages (BEFORE, AFTER, candidate OPTIMIZED) against the same seed and prints per-query deltas plus endpoint-level totals. Seeds under a synthetic `project_id` so real data is never touched; cleans up on exit via `ALTER TABLE … DELETE`. ## Benchmark results ### MAU query alone Ran at two scales; set-equality verified (new query identifies the same individual users, not just the same count). \| seed \| MAU \| peak memory (old → new) \| bytes read \| duration \| \|---\|---\|---\|---\|---\| \| 500k events \| 89,939 \| 158.7 MiB → 46.7 MiB (3.4×, −70%) \| 175.7 MiB → 63.0 MiB (2.8×) \| 483 ms → 76 ms (6.4×) \| \| 2.5M events \| 449,990 \| 439.2 MiB → 281.4 MiB (1.56×, −36%) \| 865.0 MiB → 310.9 MiB (2.8×) \| 783 ms → 126 ms (6.2×) \| MAU variant bake-off at 2.5M events (all exact, all set-equal to OLD): \| variant \| memory \| duration \| notes \| \|---\|---\|---\|---\| \| v0_old (baseline) \| 440 MiB \| 567 ms \| — \| \| v1_uniqExact_string \| 284 MiB \| 110 ms \| naive fix \| \| v3_uniqExact_toUUID \| 244 MiB \| 153 ms \| UUID keys, slower per-row \| \| v4_uniqExact_sipHash64 \| 125 MiB \| 95 ms \| shipped \| \| v5_uniq (HLL) ~approx \| 30 MiB \| 86 ms \| −0.25% error \| \| v6_uniqCombined ~approx \| 31 MiB \| 67 ms \| −0.15% error \| ### Full `/internal/metrics` route (2.7M events, 300k users + page-views + clicks + teams) Ranked by BEFORE peak memory: \| query \| mem BEFORE \| mem AFTER \| Δ mem \| dur BEFORE \| dur AFTER \| Δ dur \| \|---\|---\|---\|---\|---\|---\|---\| \| analyticsOverview:topReferrers \| 588.1 MiB \| 411.1 MiB \| 1.43× \| 1833 ms \| 110 ms \| 16.66× \| \| analyticsOverview:totalVisitors \| 584.3 MiB \| 403.5 MiB \| 1.45× \| 1829 ms \| 121 ms \| 15.12× \| \| analyticsOverview:dailyEvents \| 584.1 MiB \| 403.7 MiB \| 1.45× \| 1897 ms \| 140 ms \| 13.55× \| \| loadUsersByCountry \| 393.1 MiB \| 385.4 MiB \| ≈same \| 74 ms \| 80 ms \| ≈same \| \| loadDailyActiveUsersSplit \| 363.4 MiB \| 396.8 MiB \| +9% \| 1966 ms \| 356 ms \| 5.52× \| \| analyticsOverview:topRegion \| 269.9 MiB \| 106.4 MiB \| 2.54× \| 1602 ms \| 65 ms \| 24.65× \| \| loadDailyActiveUsers \| 268.3 MiB \| 84.0 MiB \| 3.19× \| 1111 ms \| 44 ms \| 25.25× \| \| loadDailyActiveTeamsSplit \| 59.6 MiB \| 78.1 MiB \| +31% \| 70 ms \| 123 ms \| +76% \| \| loadMonthlyActiveUsers \| 54.9 MiB \| 54.9 MiB \| ≈same \| 68 ms \| 56 ms \| ≈same \| \| analyticsOverview:online \| 18.4 MiB \| 5.8 MiB \| 3.17× \| 58 ms \| 4 ms \| 14.50× \| Endpoint-level totals \| metric \| BEFORE \| AFTER \| Δ \| \|---\|---\|---\|---\| \| Sum peak ClickHouse memory \| 3.11 GiB \| 2.28 GiB \| −27% \| \| Max query duration (endpoint wall-clock floor) \| 1966 ms \| 356 ms \| −82% (5.5×) \| \| Sum query duration (total CPU) \| 10508 ms \| 1099 ms \| −90% (9.6×) \| \| Bytes read \| 10.70 GiB \| 4.55 GiB \| −57% \| \| Bytes shipped to Node \| 94.8 MiB \| 44.2 KiB \| −99.95% \| Both split queries show a small memory regression at this seed size (the new server-side window-function + self-join has its own state cost that's near break-even with \"materialize + ship\" at 300k users); at prod scale the 76 MiB-ship saving dominates. Duration is unambiguously better. ## Why we don't need to drop the `analyticsUserJoin` in this PR The benchmark includes an OPTIMIZED stage that drops the LEFT JOIN and trusts `e.data.is_anonymous` directly, which would shave another 1.2 GiB / 1.9× duration off the endpoint. But we can't ship that here — an audit of the client tracker (`packages/js/src/lib/stack-app/apps/implementations/event-tracker.ts`) confirmed `is_anonymous` is never set on client-emitted `$page-view` / `$click` events. The JOIN is currently load-bearing. A follow-up PR will enrich `is_anonymous` at the batch ingest endpoint using `auth.user.is_anonymous`; after one metrics-window cycle (~30 days) the JOIN can be dropped. ## Follow-up work (out of scope for this PR) - Batch-endpoint enrichment + drop the analytics-overview LEFT JOIN (est. further −53% endpoint memory, −46% duration per the benchmark). - Teams-split hash-variant count mismatch — `sipHash64(team_id)` variant of the teams split shows a count discrepancy vs. the string-keyed version in the benchmark. Not blocking since teams-split is only #8 by memory; needs a root-cause pass before shipping that particular optimization. - `loadUsersByCountry` window bound — currently scans every `$token-refresh` event ever for the tenancy (no time filter). Bounding to 30 days would bound memory growth with project age, but changes semantics (\"country of latest login ever\" → \"in last 30 days\"). Deferred because it's product-facing. ## Snapshot changes in `internal-metrics.test.ts.snap` The `should return metrics data with users` test signs in 10 users today, then deletes one of them mid-test. Two small snapshot values change on today's date; both are just a reclassification of that single deleted user — the total (10 active users) is unchanged. - `daily_active_users_split.new[today]`: 9 → 10 All 10 users really did sign in for the first time today. The old code only counted 9 because the deleted user's Postgres row was gone by the time the metrics query ran, so the old classifier couldn't see they were created today. The new query looks at ClickHouse events directly, sees the deleted user's first event was today, and counts them as new like everyone else. - `daily_active_users_split.reactivated[today]`: 1 → 0 No user was "reactivated" today — nobody was active on an earlier day and came back. The old "1" was the deleted user falling into this bucket by default (the old classifier had no other rule that fit them). The new code correctly reports zero. Totals match either way (9 + 1 = 10 + 0). We're moving one deleted user out of the "returning visitor" bucket and into the "brand-new user" bucket, which is what they actually were. ## Test plan - [x] `pnpm typecheck` and `pnpm lint` pass on the backend package - [x] MAU equivalence matrix: 13/13 cases return the same set of users (not just the same count) between OLD and NEW pipelines - [x] Set-equality verified at 500k-MAU perf scale - [x] Full-route benchmark confirms the expected memory / duration improvements - [ ] Sanity-check the dashboard rendering after deploy (split charts, MAU counter, analytics overview) - [ ] Monitor Sentry for the assertion error — should drop to zero <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Performance Improvements * Monthly and daily active metrics are now computed entirely server-side for faster queries and reduced client-side processing. * Bug Fixes * More consistent handling of anonymous/missing IDs and stricter ID filtering to improve accuracy across edge cases. * Tests * Added a comprehensive benchmark and validation harness to measure query performance and verify result equivalence across variants. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2026-04-19 22:57:46 -07:00

Author

SHA1

Message

Date

BilalG1

91b8e4caa4

Fix /internal/metrics ClickHouse OOM (#1457 )

Fixes Sentry
[STACK-BACKEND-16H](https://stackframe-pw.sentry.io/issues/STACK-BACKEND-16H)
— the `/api/v1/internal/metrics` endpoint was triggering the cluster's
10.8 GiB OvercommitTracker kill on tenants with months of
`$token-refresh` history.

## Root cause

Three queries in `loadAnalyticsOverview` plus `loadUsersByCountry` did
`GROUP BY user_id` over the events table with **no lower `event_at`
bound**, so their hash table working set scaled with
cumulative-distinct-users-ever-seen instead of the 30-day metrics
window.

## Changes

- Add 30-day `event_at` lower bound to `loadUsersByCountry` and to the
`analyticsUserJoin` inner subquery (used by `dailyEvents`,
`totalVisitors`, `topReferrers`).
- New `getClickhouseAdminClientForMetrics()` factory in
`lib/clickhouse.tsx` with connection-level safety net: per-query +
per-user memory caps, external GROUP BY spill, and `join_algorithm:
'grace_hash,parallel_hash,hash'` (grace_hash measured to give 48% memory
reduction at zero latency cost — see benchmark notes in the file).
- Inline comment + concrete next steps for the long-term fix (option C:
stamp `is_anonymous` at ingest on page-view/click events, then drop the
join entirely).
- Extend `scripts/benchmark-internal-metrics.ts` with the
historical-seed knob and three new modes (`BENCH_BACKFILL_COMPARE`,
`BENCH_JOIN_ALGO_COMPARE`, plus the existing `BENCH_ROUTE_QUERIES`
updated) used to validate the choices above.

## Benchmark — pre-PR vs post-PR

Synthetic seed: 300k users × 9 events spread over 365 days (~2.7M
events).

| | pre-PR | post-PR | delta |
|---|---:|---:|---:|
| Sum peak memory | 2.18 GiB | 515 MiB | **4.3× less** |
| Max query duration | 1293 ms | 101 ms | **12.8× faster** |
| Sum CPU duration | 5119 ms | 394 ms | 13× less work |
| Sum bytes read | 3.87 GiB | 929 MiB | 4.3× less I/O |

Per-query at 300k users:
- `analyticsOverview:dailyEvents` 561 → 44 MiB (12.8× less)
- `analyticsOverview:totalVisitors` 560 → 50 MiB (11.2× less)
- `analyticsOverview:topReferrers` 546 → 50 MiB (10.9× less)
- `loadUsersByCountry` 388 → 44 MiB (8.9× less)

## Caveats

- `loadDailyActiveSplitFromClickhouse` still scans all-history on its
`min(event_at)` subquery. It can't be naively bounded — `first_date` is
used to classify entities as new vs reactivated, and a 30d bound would
silently mislabel old-but-active entities as "new." The new SETTINGS
cap+spill it; the proper fix is option C (documented inline).
- A user with a page-view but no `$token-refresh` in the last 30 days
now falls through to `coalesce(NULL, 0)` and is classified
non-anonymous. Token-refresh fires every few minutes per active session,
so this is rare but not impossible (embedded SDKs that poll less
frequently, sessions straddling the 30d boundary).
- `max_memory_usage_for_user: 9 GB` trades "cluster-wide
OvercommitTracker kill of a random query" for "clean per-user memory
error attributed to the specific query." After our 30d bounds, no query
is anywhere near 9 GB.

## Test plan

- [x] `pnpm typecheck` passes
- [x] `pnpm lint` passes
- [x] `pnpm test run
apps/e2e/tests/backend/endpoints/api/v1/internal-metrics.test.ts` — 9/10
pass; the 1 failure (`risk_scores` snapshot drift) reproduces on clean
`dev` and is unrelated
- [x] `pnpm test run
apps/e2e/tests/backend/endpoints/api/v1/analytics-{events,events-batch,query}.test.ts
apps/e2e/tests/backend/endpoints/api/v1/token-refresh-events.test.ts
apps/e2e/tests/backend/performance/metrics.test.ts` — all passing tests
pass; 10 pre-existing `PRODUCT_DOES_NOT_EXIST` setup failures reproduce
on clean `dev`
- [x] Benchmark `BENCH_ROUTE_QUERIES=1` at 300k users shows the deltas
above

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Improved internal metrics collection to use metrics-specific DB
settings for more reliable, safer analytical reads.
* Added guardrails to metrics queries to enforce time-window bounds and
avoid unbounded scans.
* Expanded benchmark modes (backfill and join-algo comparisons),
extended perf seeding, and improved logging/retry behavior to capture
more complete stats and reduce missing log rows.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/hexclave/stack-auth/pull/1457?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

2026-05-21 13:47:32 -07:00

BilalG1

85ae4b1c9e

Fix ClickHouse OOM in MAU query + optimize /internal/metrics route (#1344 )

## Summary

Fixes the Sentry `StackAssertionError: Failed to load monthly active
users for internal metrics` crash (ClickHouse OOM at the 7.2 GiB
per-query cap) and applies two related optimizations to other queries in
the same route while here. Adds a local benchmark harness that validates
correctness and measures peak memory / duration before & after.

## Root cause (the original Sentry error)

`loadMonthlyActiveUsers` was written as `SELECT user_id … GROUP BY
user_id` and then counting in Node via a `Set`. On a large project that
ships back millions of user_ids. Two failure modes stacked:

1. **Result materialization** — every distinct user_id had to be
buffered in the server before streaming to Node (~20 MiB of result for
450k users; much more at real scale).
2. **`JSONExtract(toJSONString(data), 'is_anonymous', 'UInt8')`** — the
`toJSONString(data)` per-row re-serialization of the entire nested JSON
column, billions of times, just to pull one boolean. Dominates
bytes-read.

Combined, on a single partition read from S3-backed MergeTree, this can
exceed ClickHouse's 7.2 GiB per-query memory cap. That's exactly what
the Sentry trace showed.

## Changes

### 1. Fix MAU query (`loadMonthlyActiveUsers`)

Moved counting to the server with
`uniqExact(sipHash64(normalized_user_id))` and pulled the JS-side
normalization (`lower`, `trim`, `isUuid`) into SQL. Picked `sipHash64`
after benchmarking 7 variants — it's exact (at <<2³² users) and halves
the uniqExact hash-state vs. raw string keys.

### 2. Fix 1 — `JSONExtract(toJSONString(data), …)` → direct
`CAST(data.is_anonymous, …)`

Applied everywhere the pattern appeared in the metrics route:
- `loadDailyActiveUsers`
- the `analyticsUserJoin` subquery
- the `nonAnonymousAnalyticsUserFilter`
- `analyticsOverview:topRegion`
- `analyticsOverview:online`

Semantics preserved (`coalesce(CAST(data.is_anonymous,
'Nullable(UInt8)'), 0)` matches `JSONExtract(…, 'UInt8')` behavior when
the field is missing).

### 3. Fix 3 — server-aggregate the split queries

`loadDailyActiveUsersSplit` and `loadDailyActiveTeamsSplit` used to ship
1.2M+ `(day, user_id)` rows back to Node just so the JS could bucket
them into new / retained / reactivated. Rewrote both as one CTE-style
query that returns 31 rows (one per day in the 30-day window) with the
counts precomputed.

**Minor semantic shift** (documented inline in `route.tsx`): \"new\" is
now based on the user's first-ever `\$token-refresh` event rather than
their Postgres `signedUpAt`. Agrees for users who log in immediately
after sign-up (the common case). Disagrees for the rare edge case of an
account that existed pre-window but never generated a `\$token-refresh`
until now — old code classified as \"reactivated,\" new code classifies
as \"new.\" Judged acceptable; can be revisited.

Postgres round-trips for `ProjectUser.signedUpAt` / `Team.createdAt` are
no longer needed for the split, and the 76 MiB-ish wire ship is gone.

### 4. Benchmark harness
(`apps/backend/scripts/benchmark-internal-metrics.ts`)

Local-only tool. Three modes:
- **MAU equivalence matrix** — 13 edge cases (empty, dedup, anonymous
filter, window boundary, null user_id, non-UUID user_id, case variation,
project isolation, missing/null `is_anonymous`, wrong event_type).
Asserts OLD pipeline and NEW query return the **same set** of users, not
just the same count.
- **MAU perf** — OLD vs NEW plus 6 other candidate variants (inline
regex, UUID keys, sipHash64, HLL sketches), reads `memory_usage` /
`read_rows` / `result_bytes` from `system.query_log` for each, prints a
ranked table.
- **Full-route benchmark** (`BENCH_ROUTE_QUERIES=1`) — runs every
ClickHouse query in `/internal/metrics` in three stages (BEFORE, AFTER,
candidate OPTIMIZED) against the same seed and prints per-query deltas
plus endpoint-level totals.

Seeds under a synthetic `project_id` so real data is never touched;
cleans up on exit via `ALTER TABLE … DELETE`.

## Benchmark results

### MAU query alone

Ran at two scales; set-equality verified (new query identifies the same
individual users, not just the same count).

| seed | MAU | peak memory (old → new) | bytes read | duration |
|---|---|---|---|---|
| 500k events | 89,939 | 158.7 MiB → 46.7 MiB (**3.4×**, −70%) | 175.7
MiB → 63.0 MiB (2.8×) | 483 ms → 76 ms (**6.4×**) |
| 2.5M events | 449,990 | 439.2 MiB → 281.4 MiB (1.56×, −36%) | 865.0
MiB → 310.9 MiB (2.8×) | 783 ms → 126 ms (**6.2×**) |

MAU variant bake-off at 2.5M events (all exact, all set-equal to OLD):

| variant | memory | duration | notes |
|---|---|---|---|
| v0_old (baseline) | 440 MiB | 567 ms | — |
| v1_uniqExact_string | 284 MiB | 110 ms | naive fix |
| v3_uniqExact_toUUID | 244 MiB | 153 ms | UUID keys, slower per-row |
| **v4_uniqExact_sipHash64** | **125 MiB** | **95 ms** | **shipped** |
| v5_uniq (HLL) ~approx | 30 MiB | 86 ms | −0.25% error |
| v6_uniqCombined ~approx | 31 MiB | 67 ms | −0.15% error |

### Full `/internal/metrics` route (2.7M events, 300k users + page-views
+ clicks + teams)

Ranked by BEFORE peak memory:

| query | mem BEFORE | mem AFTER | Δ mem | dur BEFORE | dur AFTER | Δ
dur |
|---|---|---|---|---|---|---|
| analyticsOverview:topReferrers | 588.1 MiB | 411.1 MiB | 1.43× | 1833
ms | 110 ms | **16.66×** |
| analyticsOverview:totalVisitors | 584.3 MiB | 403.5 MiB | 1.45× | 1829
ms | 121 ms | 15.12× |
| analyticsOverview:dailyEvents | 584.1 MiB | 403.7 MiB | 1.45× | 1897
ms | 140 ms | 13.55× |
| loadUsersByCountry | 393.1 MiB | 385.4 MiB | ≈same | 74 ms | 80 ms |
≈same |
| loadDailyActiveUsersSplit | 363.4 MiB | 396.8 MiB | *+9%* | 1966 ms |
356 ms | 5.52× |
| analyticsOverview:topRegion | 269.9 MiB | 106.4 MiB | 2.54× | 1602 ms
| 65 ms | 24.65× |
| loadDailyActiveUsers | 268.3 MiB | 84.0 MiB | 3.19× | 1111 ms | 44 ms
| 25.25× |
| loadDailyActiveTeamsSplit | 59.6 MiB | 78.1 MiB | *+31%* | 70 ms | 123
ms | *+76%* |
| loadMonthlyActiveUsers | 54.9 MiB | 54.9 MiB | ≈same | 68 ms | 56 ms |
≈same |
| analyticsOverview:online | 18.4 MiB | 5.8 MiB | 3.17× | 58 ms | 4 ms |
14.50× |

**Endpoint-level totals**

| metric | BEFORE | AFTER | Δ |
|---|---|---|---|
| Sum peak ClickHouse memory | 3.11 GiB | 2.28 GiB | **−27%** |
| **Max query duration** (endpoint wall-clock floor) | **1966 ms** |
**356 ms** | **−82%** (5.5×) |
| Sum query duration (total CPU) | 10508 ms | 1099 ms | **−90%** (9.6×)
|
| Bytes read | 10.70 GiB | 4.55 GiB | −57% |
| Bytes shipped to Node | 94.8 MiB | 44.2 KiB | **−99.95%** |

Both split queries show a small memory *regression* at this seed size
(the new server-side window-function + self-join has its own state cost
that's near break-even with \"materialize + ship\" at 300k users); at
prod scale the 76 MiB-ship saving dominates. Duration is unambiguously
better.

## Why we don't need to drop the `analyticsUserJoin` in this PR

The benchmark includes an OPTIMIZED stage that drops the LEFT JOIN and
trusts `e.data.is_anonymous` directly, which would shave another **1.2
GiB / 1.9× duration** off the endpoint. **But we can't ship that here**
— an audit of the client tracker
(`packages/js/src/lib/stack-app/apps/implementations/event-tracker.ts`)
confirmed `is_anonymous` is never set on client-emitted `$page-view` /
`$click` events. The JOIN is currently load-bearing. A follow-up PR will
enrich `is_anonymous` at the batch ingest endpoint using
`auth.user.is_anonymous`; after one metrics-window cycle (~30 days) the
JOIN can be dropped.

## Follow-up work (out of scope for this PR)

- **Batch-endpoint enrichment** + drop the analytics-overview LEFT JOIN
(est. further −53% endpoint memory, −46% duration per the benchmark).
- **Teams-split hash-variant count mismatch** — `sipHash64(team_id)`
variant of the teams split shows a count discrepancy vs. the
string-keyed version in the benchmark. Not blocking since teams-split is
only #8 by memory; needs a root-cause pass before shipping that
particular optimization.
- **`loadUsersByCountry` window bound** — currently scans every
`$token-refresh` event ever for the tenancy (no time filter). Bounding
to 30 days would bound memory growth with project age, but changes
semantics (\"country of latest login ever\" → \"in last 30 days\").
Deferred because it's product-facing.

## Snapshot changes in `internal-metrics.test.ts.snap`

The `should return metrics data with users` test signs in 10 users
today, then deletes one of them mid-test. Two small snapshot values
change on today's date; both are just a reclassification of that single
deleted user — the total (10 active users) is unchanged.

- **`daily_active_users_split.new[today]`: 9 → 10**
All 10 users really did sign in for the first time today. The old code
only counted 9 because the deleted user's Postgres row was gone by the
time the metrics query ran, so the old classifier couldn't see they were
created today. The new query looks at ClickHouse events directly, sees
the deleted user's first event was today, and counts them as new like
everyone else.

- **`daily_active_users_split.reactivated[today]`: 1 → 0**
No user was "reactivated" today — nobody was active on an earlier day
and came back. The old "1" was the deleted user falling into this bucket
by default (the old classifier had no other rule that fit them). The new
code correctly reports zero.

Totals match either way (9 + 1 = 10 + 0). We're moving one deleted user
out of the "returning visitor" bucket and into the "brand-new user"
bucket, which is what they actually were.

## Test plan

- [x] `pnpm typecheck` and `pnpm lint` pass on the backend package
- [x] MAU equivalence matrix: 13/13 cases return the same set of users
(not just the same count) between OLD and NEW pipelines
- [x] Set-equality verified at 500k-MAU perf scale
- [x] Full-route benchmark confirms the expected memory / duration
improvements
- [ ] Sanity-check the dashboard rendering after deploy (split charts,
MAU counter, analytics overview)
- [ ] Monitor Sentry for the assertion error — should drop to zero

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Performance Improvements**
* Monthly and daily active metrics are now computed entirely server-side
for faster queries and reduced client-side processing.

* **Bug Fixes**
* More consistent handling of anonymous/missing IDs and stricter ID
filtering to improve accuracy across edge cases.

* **Tests**
* Added a comprehensive benchmark and validation harness to measure
query performance and verify result equivalence across variants.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

2026-04-19 22:57:46 -07:00

2 Commits