stack/apps/backend/src/app/api/latest
BilalG1 91b8e4caa4
Fix /internal/metrics ClickHouse OOM (#1457)
Fixes Sentry
[STACK-BACKEND-16H](https://stackframe-pw.sentry.io/issues/STACK-BACKEND-16H)
— the `/api/v1/internal/metrics` endpoint was triggering the cluster's
10.8 GiB OvercommitTracker kill on tenants with months of
`$token-refresh` history.

## Root cause

Three queries in `loadAnalyticsOverview` plus `loadUsersByCountry` did
`GROUP BY user_id` over the events table with **no lower `event_at`
bound**, so their hash table working set scaled with
cumulative-distinct-users-ever-seen instead of the 30-day metrics
window.

## Changes

- Add 30-day `event_at` lower bound to `loadUsersByCountry` and to the
`analyticsUserJoin` inner subquery (used by `dailyEvents`,
`totalVisitors`, `topReferrers`).
- New `getClickhouseAdminClientForMetrics()` factory in
`lib/clickhouse.tsx` with connection-level safety net: per-query +
per-user memory caps, external GROUP BY spill, and `join_algorithm:
'grace_hash,parallel_hash,hash'` (grace_hash measured to give 48% memory
reduction at zero latency cost — see benchmark notes in the file).
- Inline comment + concrete next steps for the long-term fix (option C:
stamp `is_anonymous` at ingest on page-view/click events, then drop the
join entirely).
- Extend `scripts/benchmark-internal-metrics.ts` with the
historical-seed knob and three new modes (`BENCH_BACKFILL_COMPARE`,
`BENCH_JOIN_ALGO_COMPARE`, plus the existing `BENCH_ROUTE_QUERIES`
updated) used to validate the choices above.

## Benchmark — pre-PR vs post-PR

Synthetic seed: 300k users × 9 events spread over 365 days (~2.7M
events).

| | pre-PR | post-PR | delta |
|---|---:|---:|---:|
| Sum peak memory | 2.18 GiB | 515 MiB | **4.3× less** |
| Max query duration | 1293 ms | 101 ms | **12.8× faster** |
| Sum CPU duration | 5119 ms | 394 ms | 13× less work |
| Sum bytes read | 3.87 GiB | 929 MiB | 4.3× less I/O |

Per-query at 300k users:
- `analyticsOverview:dailyEvents` 561 → 44 MiB (12.8× less)
- `analyticsOverview:totalVisitors` 560 → 50 MiB (11.2× less)
- `analyticsOverview:topReferrers` 546 → 50 MiB (10.9× less)
- `loadUsersByCountry` 388 → 44 MiB (8.9× less)

## Caveats

- `loadDailyActiveSplitFromClickhouse` still scans all-history on its
`min(event_at)` subquery. It can't be naively bounded — `first_date` is
used to classify entities as new vs reactivated, and a 30d bound would
silently mislabel old-but-active entities as "new." The new SETTINGS
cap+spill it; the proper fix is option C (documented inline).
- A user with a page-view but no `$token-refresh` in the last 30 days
now falls through to `coalesce(NULL, 0)` and is classified
non-anonymous. Token-refresh fires every few minutes per active session,
so this is rare but not impossible (embedded SDKs that poll less
frequently, sessions straddling the 30d boundary).
- `max_memory_usage_for_user: 9 GB` trades "cluster-wide
OvercommitTracker kill of a random query" for "clean per-user memory
error attributed to the specific query." After our 30d bounds, no query
is anywhere near 9 GB.

## Test plan

- [x] `pnpm typecheck` passes
- [x] `pnpm lint` passes
- [x] `pnpm test run
apps/e2e/tests/backend/endpoints/api/v1/internal-metrics.test.ts` — 9/10
pass; the 1 failure (`risk_scores` snapshot drift) reproduces on clean
`dev` and is unrelated
- [x] `pnpm test run
apps/e2e/tests/backend/endpoints/api/v1/analytics-{events,events-batch,query}.test.ts
apps/e2e/tests/backend/endpoints/api/v1/token-refresh-events.test.ts
apps/e2e/tests/backend/performance/metrics.test.ts` — all passing tests
pass; 10 pre-existing `PRODUCT_DOES_NOT_EXIST` setup failures reproduce
on clean `dev`
- [x] Benchmark `BENCH_ROUTE_QUERIES=1` at 300k users shows the deltas
above

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Improved internal metrics collection to use metrics-specific DB
settings for more reliable, safer analytical reads.
* Added guardrails to metrics queries to enforce time-window bounds and
avoid unbounded scans.
* Expanded benchmark modes (backfill and join-algo comparisons),
extended perf seeding, and improved logging/retry behavior to capture
more complete stats and reduce missing log rows.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/hexclave/stack-auth/pull/1457?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-05-21 13:47:32 -07:00
..
(api-keys) Upgrade Prisma to v7 (#1064) 2025-12-26 08:13:34 -08:00
ai/query/[mode] Dashboard: DataGrid refactor + layout (stacked on overview-revamp) (#1338) 2026-04-27 13:50:24 -07:00
analytics/events/batch [Feat]: set flag to disable billing (#1417) 2026-05-06 14:58:06 -07:00
auth stack-cli: support self-hosted URLs and tighten CLI auth polling (#1419) 2026-05-08 11:00:03 -07:00
check-feature-support fix types 2026-01-19 16:51:49 -08:00
check-version update version check alerter 2025-12-11 14:29:44 -08:00
connected-accounts Improved StackAssertionError error logging 2026-05-07 13:29:01 -07:00
contact-channels Upgrade ESLint 2026-02-27 10:58:28 -08:00
data-vault/stores/[id] encrypt neon connection strings, update connections route (#879) 2025-09-09 21:35:07 +00:00
emails stack auth preview mode (#1307) 2026-04-08 16:57:42 -07:00
integrations [Refactor][Feat] Implement Plan Limits for Hard-and-Soft Item Caps (#1215) 2026-05-04 18:25:13 -07:00
internal Fix /internal/metrics ClickHouse OOM (#1457) 2026-05-21 13:47:32 -07:00
migration-tests Move /api/v1 to /api/latest 2025-02-05 17:24:43 -08:00
oauth-providers Emit delete tombstone when provider_account_id changes (#1320) 2026-04-10 18:46:04 -07:00
payments [Fix]: Payments App Sundry Fixes (#1455) 2026-05-20 19:33:14 -07:00
project-permission-definitions Auto migration (#526) 2025-07-24 02:38:37 +02:00
project-permissions Backend fallback (cloud run) (#1306) 2026-04-11 00:57:37 +00:00
projects Onboarding app & restricted users (#1069) 2026-01-11 17:22:14 -08:00
projects-anonymous-users/[project_id]/.well-known Improved anonymous users (#857) 2025-08-24 11:36:01 -07:00
session-replays/batch [Feat]: set flag to disable billing (#1417) 2026-05-06 14:58:06 -07:00
team-invitations [Revert] team invitation accept email-match check (#1431) 2026-05-13 17:15:11 -07:00
team-member-profiles clickhouse new syncs and verify-data (#1304) 2026-04-08 14:43:22 -07:00
team-memberships Backend fallback (cloud run) (#1306) 2026-04-11 00:57:37 +00:00
team-permission-definitions refactor(dashboard): unify AI chat surfaces on assistant-ui Thread (#1427) 2026-05-15 14:21:00 -07:00
team-permissions Backend fallback (cloud run) (#1306) 2026-04-11 00:57:37 +00:00
teams Data-grid overhaul + session-replays / team-payments dashboard surfaces (#1424) 2026-05-15 14:16:47 -07:00
users refactor(dashboard): unify AI chat surfaces on assistant-ui Thread (#1427) 2026-05-15 14:21:00 -07:00
webhooks/svix-token stack auth preview mode (#1307) 2026-04-08 16:57:42 -07:00
beta-changes.txt Move /api/v1 to /api/latest 2025-02-05 17:24:43 -08:00
changes.txt Move /api/v1 to /api/latest 2025-02-05 17:24:43 -08:00
permission-definitions-pagination.ts refactor(dashboard): unify AI chat surfaces on assistant-ui Thread (#1427) 2026-05-15 14:21:00 -07:00
route.ts Endpoints branching (#659) 2025-04-30 15:39:47 -07:00