stack

mirror of https://github.com/stack-auth/stack.git synced 2026-07-20 21:29:36 +08:00

History

BilalG1 91b8e4caa4 Fix /internal/metrics ClickHouse OOM (#1457 ) Fixes Sentry [STACK-BACKEND-16H](https://stackframe-pw.sentry.io/issues/STACK-BACKEND-16H) — the `/api/v1/internal/metrics` endpoint was triggering the cluster's 10.8 GiB OvercommitTracker kill on tenants with months of `$token-refresh` history. ## Root cause Three queries in `loadAnalyticsOverview` plus `loadUsersByCountry` did `GROUP BY user_id` over the events table with no lower `event_at` bound, so their hash table working set scaled with cumulative-distinct-users-ever-seen instead of the 30-day metrics window. ## Changes - Add 30-day `event_at` lower bound to `loadUsersByCountry` and to the `analyticsUserJoin` inner subquery (used by `dailyEvents`, `totalVisitors`, `topReferrers`). - New `getClickhouseAdminClientForMetrics()` factory in `lib/clickhouse.tsx` with connection-level safety net: per-query + per-user memory caps, external GROUP BY spill, and `join_algorithm: 'grace_hash,parallel_hash,hash'` (grace_hash measured to give 48% memory reduction at zero latency cost — see benchmark notes in the file). - Inline comment + concrete next steps for the long-term fix (option C: stamp `is_anonymous` at ingest on page-view/click events, then drop the join entirely). - Extend `scripts/benchmark-internal-metrics.ts` with the historical-seed knob and three new modes (`BENCH_BACKFILL_COMPARE`, `BENCH_JOIN_ALGO_COMPARE`, plus the existing `BENCH_ROUTE_QUERIES` updated) used to validate the choices above. ## Benchmark — pre-PR vs post-PR Synthetic seed: 300k users × 9 events spread over 365 days (~2.7M events). \| \| pre-PR \| post-PR \| delta \| \|---\|---:\|---:\|---:\| \| Sum peak memory \| 2.18 GiB \| 515 MiB \| 4.3× less \| \| Max query duration \| 1293 ms \| 101 ms \| 12.8× faster \| \| Sum CPU duration \| 5119 ms \| 394 ms \| 13× less work \| \| Sum bytes read \| 3.87 GiB \| 929 MiB \| 4.3× less I/O \| Per-query at 300k users: - `analyticsOverview:dailyEvents` 561 → 44 MiB (12.8× less) - `analyticsOverview:totalVisitors` 560 → 50 MiB (11.2× less) - `analyticsOverview:topReferrers` 546 → 50 MiB (10.9× less) - `loadUsersByCountry` 388 → 44 MiB (8.9× less) ## Caveats - `loadDailyActiveSplitFromClickhouse` still scans all-history on its `min(event_at)` subquery. It can't be naively bounded — `first_date` is used to classify entities as new vs reactivated, and a 30d bound would silently mislabel old-but-active entities as "new." The new SETTINGS cap+spill it; the proper fix is option C (documented inline). - A user with a page-view but no `$token-refresh` in the last 30 days now falls through to `coalesce(NULL, 0)` and is classified non-anonymous. Token-refresh fires every few minutes per active session, so this is rare but not impossible (embedded SDKs that poll less frequently, sessions straddling the 30d boundary). - `max_memory_usage_for_user: 9 GB` trades "cluster-wide OvercommitTracker kill of a random query" for "clean per-user memory error attributed to the specific query." After our 30d bounds, no query is anywhere near 9 GB. ## Test plan - [x] `pnpm typecheck` passes - [x] `pnpm lint` passes - [x] `pnpm test run apps/e2e/tests/backend/endpoints/api/v1/internal-metrics.test.ts` — 9/10 pass; the 1 failure (`risk_scores` snapshot drift) reproduces on clean `dev` and is unrelated - [x] `pnpm test run apps/e2e/tests/backend/endpoints/api/v1/analytics-{events,events-batch,query}.test.ts apps/e2e/tests/backend/endpoints/api/v1/token-refresh-events.test.ts apps/e2e/tests/backend/performance/metrics.test.ts` — all passing tests pass; 10 pre-existing `PRODUCT_DOES_NOT_EXIST` setup failures reproduce on clean `dev` - [x] Benchmark `BENCH_ROUTE_QUERIES=1` at 300k users shows the deltas above <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Chores * Improved internal metrics collection to use metrics-specific DB settings for more reliable, safer analytical reads. * Added guardrails to metrics queries to enforce time-window bounds and avoid unbounded scans. * Expanded benchmark modes (backfill and join-algo comparisons), extended perf seeding, and improved logging/retry behavior to capture more complete stats and reduce missing log rows. <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/hexclave/stack-auth/pull/1457?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->		2026-05-21 13:47:32 -07:00
..
prisma	Add schema to migration that was missing it	2026-05-19 16:14:28 -07:00
scripts	Fix /internal/metrics ClickHouse OOM (#1457 )	2026-05-21 13:47:32 -07:00
src	Fix /internal/metrics ClickHouse OOM (#1457 )	2026-05-21 13:47:32 -07:00
.env	[Feat]: set flag to disable billing (#1417 )	2026-05-06 14:58:06 -07:00
.env.development	[Feat]: set flag to disable billing (#1417 )	2026-05-06 14:58:06 -07:00
.eslintrc.cjs	tsup for stack-shared (#647 )	2025-04-28 21:26:52 -07:00
.gitignore	private files n sm build shit (#1276 )	2026-03-23 12:31:36 -07:00
instrumentation-client.ts	Upgrade backend to Next.js 16	2025-12-12 16:59:07 -08:00
LICENSE	Split backend and dashboard (#83 )	2024-06-18 15:49:31 +02:00
next.config.mjs	private files n sm build shit (#1276 )	2026-03-23 12:31:36 -07:00
package.json	chore: update package versions	2026-05-20 11:58:44 -07:00
prisma.config.ts	[Fix]: Assortment of Bugs with Timefold Table and Payments (#1348 )	2026-04-18 14:17:24 -07:00
tsconfig.json	Fix lint	2026-02-27 09:59:26 -08:00
vercel.json	External db sync (#1036 )	2026-02-05 12:04:31 -08:00
vitest.config.ts	Fix flaky tests and preexisting CI failures (#1443 )	2026-05-20 10:00:11 -07:00
vitest.setup.ts	Customizable ports (#962 )	2025-10-20 15:24:47 -07:00