mirror of
https://github.com/stack-auth/stack.git
synced 2026-06-27 21:01:03 +08:00
44ff9bfdfd
2 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
91b8e4caa4
|
Fix /internal/metrics ClickHouse OOM (#1457)
Fixes Sentry [STACK-BACKEND-16H](https://stackframe-pw.sentry.io/issues/STACK-BACKEND-16H) — the `/api/v1/internal/metrics` endpoint was triggering the cluster's 10.8 GiB OvercommitTracker kill on tenants with months of `$token-refresh` history. ## Root cause Three queries in `loadAnalyticsOverview` plus `loadUsersByCountry` did `GROUP BY user_id` over the events table with **no lower `event_at` bound**, so their hash table working set scaled with cumulative-distinct-users-ever-seen instead of the 30-day metrics window. ## Changes - Add 30-day `event_at` lower bound to `loadUsersByCountry` and to the `analyticsUserJoin` inner subquery (used by `dailyEvents`, `totalVisitors`, `topReferrers`). - New `getClickhouseAdminClientForMetrics()` factory in `lib/clickhouse.tsx` with connection-level safety net: per-query + per-user memory caps, external GROUP BY spill, and `join_algorithm: 'grace_hash,parallel_hash,hash'` (grace_hash measured to give 48% memory reduction at zero latency cost — see benchmark notes in the file). - Inline comment + concrete next steps for the long-term fix (option C: stamp `is_anonymous` at ingest on page-view/click events, then drop the join entirely). - Extend `scripts/benchmark-internal-metrics.ts` with the historical-seed knob and three new modes (`BENCH_BACKFILL_COMPARE`, `BENCH_JOIN_ALGO_COMPARE`, plus the existing `BENCH_ROUTE_QUERIES` updated) used to validate the choices above. ## Benchmark — pre-PR vs post-PR Synthetic seed: 300k users × 9 events spread over 365 days (~2.7M events). | | pre-PR | post-PR | delta | |---|---:|---:|---:| | Sum peak memory | 2.18 GiB | 515 MiB | **4.3× less** | | Max query duration | 1293 ms | 101 ms | **12.8× faster** | | Sum CPU duration | 5119 ms | 394 ms | 13× less work | | Sum bytes read | 3.87 GiB | 929 MiB | 4.3× less I/O | Per-query at 300k users: - `analyticsOverview:dailyEvents` 561 → 44 MiB (12.8× less) - `analyticsOverview:totalVisitors` 560 → 50 MiB (11.2× less) - `analyticsOverview:topReferrers` 546 → 50 MiB (10.9× less) - `loadUsersByCountry` 388 → 44 MiB (8.9× less) ## Caveats - `loadDailyActiveSplitFromClickhouse` still scans all-history on its `min(event_at)` subquery. It can't be naively bounded — `first_date` is used to classify entities as new vs reactivated, and a 30d bound would silently mislabel old-but-active entities as "new." The new SETTINGS cap+spill it; the proper fix is option C (documented inline). - A user with a page-view but no `$token-refresh` in the last 30 days now falls through to `coalesce(NULL, 0)` and is classified non-anonymous. Token-refresh fires every few minutes per active session, so this is rare but not impossible (embedded SDKs that poll less frequently, sessions straddling the 30d boundary). - `max_memory_usage_for_user: 9 GB` trades "cluster-wide OvercommitTracker kill of a random query" for "clean per-user memory error attributed to the specific query." After our 30d bounds, no query is anywhere near 9 GB. ## Test plan - [x] `pnpm typecheck` passes - [x] `pnpm lint` passes - [x] `pnpm test run apps/e2e/tests/backend/endpoints/api/v1/internal-metrics.test.ts` — 9/10 pass; the 1 failure (`risk_scores` snapshot drift) reproduces on clean `dev` and is unrelated - [x] `pnpm test run apps/e2e/tests/backend/endpoints/api/v1/analytics-{events,events-batch,query}.test.ts apps/e2e/tests/backend/endpoints/api/v1/token-refresh-events.test.ts apps/e2e/tests/backend/performance/metrics.test.ts` — all passing tests pass; 10 pre-existing `PRODUCT_DOES_NOT_EXIST` setup failures reproduce on clean `dev` - [x] Benchmark `BENCH_ROUTE_QUERIES=1` at 300k users shows the deltas above <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Improved internal metrics collection to use metrics-specific DB settings for more reliable, safer analytical reads. * Added guardrails to metrics queries to enforce time-window bounds and avoid unbounded scans. * Expanded benchmark modes (backfill and join-algo comparisons), extended perf seeding, and improved logging/retry behavior to capture more complete stats and reduce missing log rows. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/hexclave/stack-auth/pull/1457?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> |
||
|
|
85ae4b1c9e
|
Fix ClickHouse OOM in MAU query + optimize /internal/metrics route (#1344)
## Summary Fixes the Sentry `StackAssertionError: Failed to load monthly active users for internal metrics` crash (ClickHouse OOM at the 7.2 GiB per-query cap) and applies two related optimizations to other queries in the same route while here. Adds a local benchmark harness that validates correctness and measures peak memory / duration before & after. ## Root cause (the original Sentry error) `loadMonthlyActiveUsers` was written as `SELECT user_id … GROUP BY user_id` and then counting in Node via a `Set`. On a large project that ships back millions of user_ids. Two failure modes stacked: 1. **Result materialization** — every distinct user_id had to be buffered in the server before streaming to Node (~20 MiB of result for 450k users; much more at real scale). 2. **`JSONExtract(toJSONString(data), 'is_anonymous', 'UInt8')`** — the `toJSONString(data)` per-row re-serialization of the entire nested JSON column, billions of times, just to pull one boolean. Dominates bytes-read. Combined, on a single partition read from S3-backed MergeTree, this can exceed ClickHouse's 7.2 GiB per-query memory cap. That's exactly what the Sentry trace showed. ## Changes ### 1. Fix MAU query (`loadMonthlyActiveUsers`) Moved counting to the server with `uniqExact(sipHash64(normalized_user_id))` and pulled the JS-side normalization (`lower`, `trim`, `isUuid`) into SQL. Picked `sipHash64` after benchmarking 7 variants — it's exact (at <<2³² users) and halves the uniqExact hash-state vs. raw string keys. ### 2. Fix 1 — `JSONExtract(toJSONString(data), …)` → direct `CAST(data.is_anonymous, …)` Applied everywhere the pattern appeared in the metrics route: - `loadDailyActiveUsers` - the `analyticsUserJoin` subquery - the `nonAnonymousAnalyticsUserFilter` - `analyticsOverview:topRegion` - `analyticsOverview:online` Semantics preserved (`coalesce(CAST(data.is_anonymous, 'Nullable(UInt8)'), 0)` matches `JSONExtract(…, 'UInt8')` behavior when the field is missing). ### 3. Fix 3 — server-aggregate the split queries `loadDailyActiveUsersSplit` and `loadDailyActiveTeamsSplit` used to ship 1.2M+ `(day, user_id)` rows back to Node just so the JS could bucket them into new / retained / reactivated. Rewrote both as one CTE-style query that returns 31 rows (one per day in the 30-day window) with the counts precomputed. **Minor semantic shift** (documented inline in `route.tsx`): \"new\" is now based on the user's first-ever `\$token-refresh` event rather than their Postgres `signedUpAt`. Agrees for users who log in immediately after sign-up (the common case). Disagrees for the rare edge case of an account that existed pre-window but never generated a `\$token-refresh` until now — old code classified as \"reactivated,\" new code classifies as \"new.\" Judged acceptable; can be revisited. Postgres round-trips for `ProjectUser.signedUpAt` / `Team.createdAt` are no longer needed for the split, and the 76 MiB-ish wire ship is gone. ### 4. Benchmark harness (`apps/backend/scripts/benchmark-internal-metrics.ts`) Local-only tool. Three modes: - **MAU equivalence matrix** — 13 edge cases (empty, dedup, anonymous filter, window boundary, null user_id, non-UUID user_id, case variation, project isolation, missing/null `is_anonymous`, wrong event_type). Asserts OLD pipeline and NEW query return the **same set** of users, not just the same count. - **MAU perf** — OLD vs NEW plus 6 other candidate variants (inline regex, UUID keys, sipHash64, HLL sketches), reads `memory_usage` / `read_rows` / `result_bytes` from `system.query_log` for each, prints a ranked table. - **Full-route benchmark** (`BENCH_ROUTE_QUERIES=1`) — runs every ClickHouse query in `/internal/metrics` in three stages (BEFORE, AFTER, candidate OPTIMIZED) against the same seed and prints per-query deltas plus endpoint-level totals. Seeds under a synthetic `project_id` so real data is never touched; cleans up on exit via `ALTER TABLE … DELETE`. ## Benchmark results ### MAU query alone Ran at two scales; set-equality verified (new query identifies the same individual users, not just the same count). | seed | MAU | peak memory (old → new) | bytes read | duration | |---|---|---|---|---| | 500k events | 89,939 | 158.7 MiB → 46.7 MiB (**3.4×**, −70%) | 175.7 MiB → 63.0 MiB (2.8×) | 483 ms → 76 ms (**6.4×**) | | 2.5M events | 449,990 | 439.2 MiB → 281.4 MiB (1.56×, −36%) | 865.0 MiB → 310.9 MiB (2.8×) | 783 ms → 126 ms (**6.2×**) | MAU variant bake-off at 2.5M events (all exact, all set-equal to OLD): | variant | memory | duration | notes | |---|---|---|---| | v0_old (baseline) | 440 MiB | 567 ms | — | | v1_uniqExact_string | 284 MiB | 110 ms | naive fix | | v3_uniqExact_toUUID | 244 MiB | 153 ms | UUID keys, slower per-row | | **v4_uniqExact_sipHash64** | **125 MiB** | **95 ms** | **shipped** | | v5_uniq (HLL) ~approx | 30 MiB | 86 ms | −0.25% error | | v6_uniqCombined ~approx | 31 MiB | 67 ms | −0.15% error | ### Full `/internal/metrics` route (2.7M events, 300k users + page-views + clicks + teams) Ranked by BEFORE peak memory: | query | mem BEFORE | mem AFTER | Δ mem | dur BEFORE | dur AFTER | Δ dur | |---|---|---|---|---|---|---| | analyticsOverview:topReferrers | 588.1 MiB | 411.1 MiB | 1.43× | 1833 ms | 110 ms | **16.66×** | | analyticsOverview:totalVisitors | 584.3 MiB | 403.5 MiB | 1.45× | 1829 ms | 121 ms | 15.12× | | analyticsOverview:dailyEvents | 584.1 MiB | 403.7 MiB | 1.45× | 1897 ms | 140 ms | 13.55× | | loadUsersByCountry | 393.1 MiB | 385.4 MiB | ≈same | 74 ms | 80 ms | ≈same | | loadDailyActiveUsersSplit | 363.4 MiB | 396.8 MiB | *+9%* | 1966 ms | 356 ms | 5.52× | | analyticsOverview:topRegion | 269.9 MiB | 106.4 MiB | 2.54× | 1602 ms | 65 ms | 24.65× | | loadDailyActiveUsers | 268.3 MiB | 84.0 MiB | 3.19× | 1111 ms | 44 ms | 25.25× | | loadDailyActiveTeamsSplit | 59.6 MiB | 78.1 MiB | *+31%* | 70 ms | 123 ms | *+76%* | | loadMonthlyActiveUsers | 54.9 MiB | 54.9 MiB | ≈same | 68 ms | 56 ms | ≈same | | analyticsOverview:online | 18.4 MiB | 5.8 MiB | 3.17× | 58 ms | 4 ms | 14.50× | **Endpoint-level totals** | metric | BEFORE | AFTER | Δ | |---|---|---|---| | Sum peak ClickHouse memory | 3.11 GiB | 2.28 GiB | **−27%** | | **Max query duration** (endpoint wall-clock floor) | **1966 ms** | **356 ms** | **−82%** (5.5×) | | Sum query duration (total CPU) | 10508 ms | 1099 ms | **−90%** (9.6×) | | Bytes read | 10.70 GiB | 4.55 GiB | −57% | | Bytes shipped to Node | 94.8 MiB | 44.2 KiB | **−99.95%** | Both split queries show a small memory *regression* at this seed size (the new server-side window-function + self-join has its own state cost that's near break-even with \"materialize + ship\" at 300k users); at prod scale the 76 MiB-ship saving dominates. Duration is unambiguously better. ## Why we don't need to drop the `analyticsUserJoin` in this PR The benchmark includes an OPTIMIZED stage that drops the LEFT JOIN and trusts `e.data.is_anonymous` directly, which would shave another **1.2 GiB / 1.9× duration** off the endpoint. **But we can't ship that here** — an audit of the client tracker (`packages/js/src/lib/stack-app/apps/implementations/event-tracker.ts`) confirmed `is_anonymous` is never set on client-emitted `$page-view` / `$click` events. The JOIN is currently load-bearing. A follow-up PR will enrich `is_anonymous` at the batch ingest endpoint using `auth.user.is_anonymous`; after one metrics-window cycle (~30 days) the JOIN can be dropped. ## Follow-up work (out of scope for this PR) - **Batch-endpoint enrichment** + drop the analytics-overview LEFT JOIN (est. further −53% endpoint memory, −46% duration per the benchmark). - **Teams-split hash-variant count mismatch** — `sipHash64(team_id)` variant of the teams split shows a count discrepancy vs. the string-keyed version in the benchmark. Not blocking since teams-split is only #8 by memory; needs a root-cause pass before shipping that particular optimization. - **`loadUsersByCountry` window bound** — currently scans every `$token-refresh` event ever for the tenancy (no time filter). Bounding to 30 days would bound memory growth with project age, but changes semantics (\"country of latest login ever\" → \"in last 30 days\"). Deferred because it's product-facing. ## Snapshot changes in `internal-metrics.test.ts.snap` The `should return metrics data with users` test signs in 10 users today, then deletes one of them mid-test. Two small snapshot values change on today's date; both are just a reclassification of that single deleted user — the total (10 active users) is unchanged. - **`daily_active_users_split.new[today]`: 9 → 10** All 10 users really did sign in for the first time today. The old code only counted 9 because the deleted user's Postgres row was gone by the time the metrics query ran, so the old classifier couldn't see they were created today. The new query looks at ClickHouse events directly, sees the deleted user's first event was today, and counts them as new like everyone else. - **`daily_active_users_split.reactivated[today]`: 1 → 0** No user was "reactivated" today — nobody was active on an earlier day and came back. The old "1" was the deleted user falling into this bucket by default (the old classifier had no other rule that fit them). The new code correctly reports zero. Totals match either way (9 + 1 = 10 + 0). We're moving one deleted user out of the "returning visitor" bucket and into the "brand-new user" bucket, which is what they actually were. ## Test plan - [x] `pnpm typecheck` and `pnpm lint` pass on the backend package - [x] MAU equivalence matrix: 13/13 cases return the same set of users (not just the same count) between OLD and NEW pipelines - [x] Set-equality verified at 500k-MAU perf scale - [x] Full-route benchmark confirms the expected memory / duration improvements - [ ] Sanity-check the dashboard rendering after deploy (split charts, MAU counter, analytics overview) - [ ] Monitor Sentry for the assertion error — should drop to zero <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Performance Improvements** * Monthly and daily active metrics are now computed entirely server-side for faster queries and reduced client-side processing. * **Bug Fixes** * More consistent handling of anonymous/missing IDs and stricter ID filtering to improve accuracy across edge cases. * **Tests** * Added a comprehensive benchmark and validation harness to measure query performance and verify result equivalence across variants. <!-- end of auto-generated comment: release notes by coderabbit.ai --> |