开源的用户管理解决方案,自带前端组件和管理后台。
Go to file
BilalG1 85ae4b1c9e
Fix ClickHouse OOM in MAU query + optimize /internal/metrics route (#1344)
## Summary

Fixes the Sentry `StackAssertionError: Failed to load monthly active
users for internal metrics` crash (ClickHouse OOM at the 7.2 GiB
per-query cap) and applies two related optimizations to other queries in
the same route while here. Adds a local benchmark harness that validates
correctness and measures peak memory / duration before & after.

## Root cause (the original Sentry error)

`loadMonthlyActiveUsers` was written as `SELECT user_id … GROUP BY
user_id` and then counting in Node via a `Set`. On a large project that
ships back millions of user_ids. Two failure modes stacked:

1. **Result materialization** — every distinct user_id had to be
buffered in the server before streaming to Node (~20 MiB of result for
450k users; much more at real scale).
2. **`JSONExtract(toJSONString(data), 'is_anonymous', 'UInt8')`** — the
`toJSONString(data)` per-row re-serialization of the entire nested JSON
column, billions of times, just to pull one boolean. Dominates
bytes-read.

Combined, on a single partition read from S3-backed MergeTree, this can
exceed ClickHouse's 7.2 GiB per-query memory cap. That's exactly what
the Sentry trace showed.

## Changes

### 1. Fix MAU query (`loadMonthlyActiveUsers`)

Moved counting to the server with
`uniqExact(sipHash64(normalized_user_id))` and pulled the JS-side
normalization (`lower`, `trim`, `isUuid`) into SQL. Picked `sipHash64`
after benchmarking 7 variants — it's exact (at <<2³² users) and halves
the uniqExact hash-state vs. raw string keys.

### 2. Fix 1 — `JSONExtract(toJSONString(data), …)` → direct
`CAST(data.is_anonymous, …)`

Applied everywhere the pattern appeared in the metrics route:
- `loadDailyActiveUsers`
- the `analyticsUserJoin` subquery
- the `nonAnonymousAnalyticsUserFilter`
- `analyticsOverview:topRegion`
- `analyticsOverview:online`

Semantics preserved (`coalesce(CAST(data.is_anonymous,
'Nullable(UInt8)'), 0)` matches `JSONExtract(…, 'UInt8')` behavior when
the field is missing).

### 3. Fix 3 — server-aggregate the split queries

`loadDailyActiveUsersSplit` and `loadDailyActiveTeamsSplit` used to ship
1.2M+ `(day, user_id)` rows back to Node just so the JS could bucket
them into new / retained / reactivated. Rewrote both as one CTE-style
query that returns 31 rows (one per day in the 30-day window) with the
counts precomputed.

**Minor semantic shift** (documented inline in `route.tsx`): \"new\" is
now based on the user's first-ever `\$token-refresh` event rather than
their Postgres `signedUpAt`. Agrees for users who log in immediately
after sign-up (the common case). Disagrees for the rare edge case of an
account that existed pre-window but never generated a `\$token-refresh`
until now — old code classified as \"reactivated,\" new code classifies
as \"new.\" Judged acceptable; can be revisited.

Postgres round-trips for `ProjectUser.signedUpAt` / `Team.createdAt` are
no longer needed for the split, and the 76 MiB-ish wire ship is gone.

### 4. Benchmark harness
(`apps/backend/scripts/benchmark-internal-metrics.ts`)

Local-only tool. Three modes:
- **MAU equivalence matrix** — 13 edge cases (empty, dedup, anonymous
filter, window boundary, null user_id, non-UUID user_id, case variation,
project isolation, missing/null `is_anonymous`, wrong event_type).
Asserts OLD pipeline and NEW query return the **same set** of users, not
just the same count.
- **MAU perf** — OLD vs NEW plus 6 other candidate variants (inline
regex, UUID keys, sipHash64, HLL sketches), reads `memory_usage` /
`read_rows` / `result_bytes` from `system.query_log` for each, prints a
ranked table.
- **Full-route benchmark** (`BENCH_ROUTE_QUERIES=1`) — runs every
ClickHouse query in `/internal/metrics` in three stages (BEFORE, AFTER,
candidate OPTIMIZED) against the same seed and prints per-query deltas
plus endpoint-level totals.

Seeds under a synthetic `project_id` so real data is never touched;
cleans up on exit via `ALTER TABLE … DELETE`.

## Benchmark results

### MAU query alone

Ran at two scales; set-equality verified (new query identifies the same
individual users, not just the same count).

| seed | MAU | peak memory (old → new) | bytes read | duration |
|---|---|---|---|---|
| 500k events | 89,939 | 158.7 MiB → 46.7 MiB (**3.4×**, −70%) | 175.7
MiB → 63.0 MiB (2.8×) | 483 ms → 76 ms (**6.4×**) |
| 2.5M events | 449,990 | 439.2 MiB → 281.4 MiB (1.56×, −36%) | 865.0
MiB → 310.9 MiB (2.8×) | 783 ms → 126 ms (**6.2×**) |

MAU variant bake-off at 2.5M events (all exact, all set-equal to OLD):

| variant | memory | duration | notes |
|---|---|---|---|
| v0_old (baseline) | 440 MiB | 567 ms | — |
| v1_uniqExact_string | 284 MiB | 110 ms | naive fix |
| v3_uniqExact_toUUID | 244 MiB | 153 ms | UUID keys, slower per-row |
| **v4_uniqExact_sipHash64** | **125 MiB** | **95 ms** | **shipped** |
| v5_uniq (HLL) ~approx | 30 MiB | 86 ms | −0.25% error |
| v6_uniqCombined ~approx | 31 MiB | 67 ms | −0.15% error |

### Full `/internal/metrics` route (2.7M events, 300k users + page-views
+ clicks + teams)

Ranked by BEFORE peak memory:

| query | mem BEFORE | mem AFTER | Δ mem | dur BEFORE | dur AFTER | Δ
dur |
|---|---|---|---|---|---|---|
| analyticsOverview:topReferrers | 588.1 MiB | 411.1 MiB | 1.43× | 1833
ms | 110 ms | **16.66×** |
| analyticsOverview:totalVisitors | 584.3 MiB | 403.5 MiB | 1.45× | 1829
ms | 121 ms | 15.12× |
| analyticsOverview:dailyEvents | 584.1 MiB | 403.7 MiB | 1.45× | 1897
ms | 140 ms | 13.55× |
| loadUsersByCountry | 393.1 MiB | 385.4 MiB | ≈same | 74 ms | 80 ms |
≈same |
| loadDailyActiveUsersSplit | 363.4 MiB | 396.8 MiB | *+9%* | 1966 ms |
356 ms | 5.52× |
| analyticsOverview:topRegion | 269.9 MiB | 106.4 MiB | 2.54× | 1602 ms
| 65 ms | 24.65× |
| loadDailyActiveUsers | 268.3 MiB | 84.0 MiB | 3.19× | 1111 ms | 44 ms
| 25.25× |
| loadDailyActiveTeamsSplit | 59.6 MiB | 78.1 MiB | *+31%* | 70 ms | 123
ms | *+76%* |
| loadMonthlyActiveUsers | 54.9 MiB | 54.9 MiB | ≈same | 68 ms | 56 ms |
≈same |
| analyticsOverview:online | 18.4 MiB | 5.8 MiB | 3.17× | 58 ms | 4 ms |
14.50× |

**Endpoint-level totals**

| metric | BEFORE | AFTER | Δ |
|---|---|---|---|
| Sum peak ClickHouse memory | 3.11 GiB | 2.28 GiB | **−27%** |
| **Max query duration** (endpoint wall-clock floor) | **1966 ms** |
**356 ms** | **−82%** (5.5×) |
| Sum query duration (total CPU) | 10508 ms | 1099 ms | **−90%** (9.6×)
|
| Bytes read | 10.70 GiB | 4.55 GiB | −57% |
| Bytes shipped to Node | 94.8 MiB | 44.2 KiB | **−99.95%** |

Both split queries show a small memory *regression* at this seed size
(the new server-side window-function + self-join has its own state cost
that's near break-even with \"materialize + ship\" at 300k users); at
prod scale the 76 MiB-ship saving dominates. Duration is unambiguously
better.

## Why we don't need to drop the `analyticsUserJoin` in this PR

The benchmark includes an OPTIMIZED stage that drops the LEFT JOIN and
trusts `e.data.is_anonymous` directly, which would shave another **1.2
GiB / 1.9× duration** off the endpoint. **But we can't ship that here**
— an audit of the client tracker
(`packages/js/src/lib/stack-app/apps/implementations/event-tracker.ts`)
confirmed `is_anonymous` is never set on client-emitted `$page-view` /
`$click` events. The JOIN is currently load-bearing. A follow-up PR will
enrich `is_anonymous` at the batch ingest endpoint using
`auth.user.is_anonymous`; after one metrics-window cycle (~30 days) the
JOIN can be dropped.

## Follow-up work (out of scope for this PR)

- **Batch-endpoint enrichment** + drop the analytics-overview LEFT JOIN
(est. further −53% endpoint memory, −46% duration per the benchmark).
- **Teams-split hash-variant count mismatch** — `sipHash64(team_id)`
variant of the teams split shows a count discrepancy vs. the
string-keyed version in the benchmark. Not blocking since teams-split is
only #8 by memory; needs a root-cause pass before shipping that
particular optimization.
- **`loadUsersByCountry` window bound** — currently scans every
`$token-refresh` event ever for the tenancy (no time filter). Bounding
to 30 days would bound memory growth with project age, but changes
semantics (\"country of latest login ever\" → \"in last 30 days\").
Deferred because it's product-facing.

## Snapshot changes in `internal-metrics.test.ts.snap`

The `should return metrics data with users` test signs in 10 users
today, then deletes one of them mid-test. Two small snapshot values
change on today's date; both are just a reclassification of that single
deleted user — the total (10 active users) is unchanged.

- **`daily_active_users_split.new[today]`: 9 → 10**
All 10 users really did sign in for the first time today. The old code
only counted 9 because the deleted user's Postgres row was gone by the
time the metrics query ran, so the old classifier couldn't see they were
created today. The new query looks at ClickHouse events directly, sees
the deleted user's first event was today, and counts them as new like
everyone else.

- **`daily_active_users_split.reactivated[today]`: 1 → 0**
No user was "reactivated" today — nobody was active on an earlier day
and came back. The old "1" was the deleted user falling into this bucket
by default (the old classifier had no other rule that fit them). The new
code correctly reports zero.

Totals match either way (9 + 1 = 10 + 0). We're moving one deleted user
out of the "returning visitor" bucket and into the "brand-new user"
bucket, which is what they actually were.

## Test plan

- [x] `pnpm typecheck` and `pnpm lint` pass on the backend package
- [x] MAU equivalence matrix: 13/13 cases return the same set of users
(not just the same count) between OLD and NEW pipelines
- [x] Set-equality verified at 500k-MAU perf scale
- [x] Full-route benchmark confirms the expected memory / duration
improvements
- [ ] Sanity-check the dashboard rendering after deploy (split charts,
MAU counter, analytics overview)
- [ ] Monitor Sentry for the assertion error — should drop to zero

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Performance Improvements**
* Monthly and daily active metrics are now computed entirely server-side
for faster queries and reduced client-side processing.

* **Bug Fixes**
* More consistent handling of anonymous/missing IDs and stricter ID
filtering to improve accuracy across edge cases.

* **Tests**
* Added a comprehensive benchmark and validation harness to measure
query performance and verify result equivalence across variants.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-04-19 22:57:46 -07:00
.changeset Disable changesets changelogs 2026-01-12 15:21:56 -08:00
.claude Better Clickhouse errors during development 2026-01-30 22:39:17 -08:00
.cursor Update pre-push.md 2026-04-12 21:52:33 -07:00
.devcontainer Customizable ports (#962) 2025-10-20 15:24:47 -07:00
.github LLM MCP Flow (#1321) 2026-04-15 17:57:08 +00:00
.vscode Payments bulldozer txn rework (#1315) 2026-04-17 22:11:21 +00:00
apps Fix ClickHouse OOM in MAU query + optimize /internal/metrics route (#1344) 2026-04-19 22:57:46 -07:00
claude Fix Docker build scripts 2026-04-18 23:50:50 -07:00
configs [Fix] Infinite Loop on handler/sign-in due to useStackApp not being able to find the StackProvider given context (#1248) 2026-03-12 22:28:47 -07:00
docker Fix Docker build scripts 2026-04-18 23:50:50 -07:00
docs chore: update package versions 2026-04-18 14:20:39 -07:00
docs-mintlify chore: update package versions 2026-04-18 14:20:39 -07:00
examples chore: update package versions 2026-04-18 14:20:39 -07:00
packages ai proxy fix (#1343) 2026-04-19 22:57:38 -07:00
patches Fix MS OAuth (#457) 2025-02-21 19:39:22 +01:00
scripts Fix memory leak 2026-04-18 22:21:05 -07:00
sdks Don't override 5xx errors 2026-04-18 19:31:13 -07:00
.dockerignore emu with a q stuff (#1266) 2026-04-04 00:33:52 +00:00
.gitignore clickhouse new syncs and verify-data (#1304) 2026-04-08 14:43:22 -07:00
.gitmodules private files n sm build shit (#1276) 2026-03-23 12:31:36 -07:00
AGENTS.md Payments bulldozer txn rework (#1315) 2026-04-17 22:11:21 +00:00
CHANGELOG.md CHANGELOG.md Update with Images 2026-02-02 11:27:09 -06:00
CLAUDE.md session replays (#1187) 2026-02-16 14:15:17 -08:00
CONTRIBUTING.md Config sources (#1083) 2026-01-21 18:08:35 -08:00
LICENSE Fix user hooks 2025-06-22 19:32:52 -07:00
package.json Fix memory leak 2026-04-18 22:21:05 -07:00
pnpm-lock.yaml Fix memory leak 2026-04-18 22:21:05 -07:00
pnpm-workspace.yaml Replace npx with pnpm exec (#1300) 2026-04-08 17:08:55 -07:00
README.md LLM MCP Flow (#1321) 2026-04-15 17:57:08 +00:00
turbo.json Fix build 2026-02-27 00:48:07 -08:00
vitest.shared.ts Fix tests 2026-02-17 19:57:08 -08:00
vitest.workspace.ts Hosted components (#1229) 2026-03-10 11:29:05 -07:00

Stack Logo

Ask DeepWiki

📘 Docs | ☁️ Hosted Version | Demo | 🎮 Discord

Stack Auth: The open-source auth platform

Stack Auth is a managed user authentication solution. It is developer-friendly and fully open-source (licensed under MIT and AGPL).

Stack Auth gets you started in just five minutes, after which you'll be ready to use all of its features as you grow your project. Our managed service is completely optional and you can export your user data and self-host, for free, at any time.

We support Next.js, React, and JavaScript frontends, along with any backend that can use our REST API. Check out our setup guide to get started.

Stack Auth Setup

Table of contents

How is this different from X?

Ask yourself about X:

  • Is X open-source?
  • Is X developer-friendly, well-documented, and lets you get started in minutes?
  • Besides authentication, does X also do authorization and user management (see feature list below)?

If you answered "no" to any of these questions, then that's how Stack Auth is different from X.

Features

To get notified first when we add new features, please subscribe to our newsletter.

<SignIn/> and <SignUp/>

Authentication components that support OAuth, password credentials, and magic links, with shared development keys to make setup faster. All components support dark/light modes.
Sign-in component

Idiomatic Next.js APIs

We build on server components, React hooks, and route handlers.
Dark/light mode

User dashboard

Dashboard to filter, analyze, and edit users. Replaces the first internal tool you would have to build.
User dashboard

Account settings

Lets users update their profile, verify their e-mail, or change their password. No setup required.
Account settings component

Multi-tenancy & teams

Manage B2B customers with an organization structure that makes sense and scales to millions.
Selected team switcher component

Role-based access control

Define an arbitrary permission graph and assign it to users. Organizations can create org-specific roles.
RBAC

OAuth Connections

Beyond login, Stack Auth can also manage access tokens for third-party APIs, such as Outlook and Google Calendar. It handles refreshing tokens and controlling scope, making access tokens accessible via a single function call.
OAuth tokens

Passkeys

Support for passwordless authentication using passkeys, allowing users to sign in securely with biometrics or security keys across all their devices.
OAuth tokens

Impersonation

Impersonate users for debugging and support, logging into their account as if you were them.
Webhooks

Webhooks

Get notified when users use your product, built on Svix.
Webhooks

Automatic emails

Send customizable emails on triggers such as sign-up, password reset, and email verification, editable with a WYSIWYG editor.
Email templates

User session & JWT handling

Stack Auth manages refresh and access tokens, JWTs, and cookies, resulting in the best performance at no implementation cost.
User button

M2M authentication

Use short-lived access tokens to authenticate your machines to other machines.
M2M authentication

📦 Installation & Setup

To install Stack Auth in your Next.js project (for React, JavaScript, or other frameworks, see our complete documentation):

  1. Run Stack Auth's installation wizard with the following command:

    npx @stackframe/stack-cli@latest init
    
  2. Then, create an account on the Stack Auth dashboard, create a new project with an API key, and copy its environment variables into the .env.local file of your Next.js project:

    NEXT_PUBLIC_STACK_PROJECT_ID=<your-project-id>
    NEXT_PUBLIC_STACK_PUBLISHABLE_CLIENT_KEY=<your-publishable-client-key>
    STACK_SECRET_SERVER_KEY=<your-secret-server-key>
    
  3. That's it! You can run your app with npm run dev and go to http://localhost:3000/handler/signup to see the sign-up page. You can also check out the account settings page at http://localhost:3000/handler/account-settings.

Check out the documentation for a more detailed guide.

🌱 Some community projects built with Stack Auth

Have your own? Happy to feature it if you create a PR or message us on Discord.

Templates

Examples

🏗 Development & Contribution

This is for you if you want to contribute to the Stack Auth project or run the Stack Auth dashboard locally.

Important: Please read the contribution guidelines carefully and join our Discord if you'd like to help.

Requirements

  • Node v20
  • pnpm v9
  • Docker

Setup

Note: 24GB+ of RAM is recommended for a smooth development experience.

In a new terminal:

pnpm install

# Build the packages and generate code. We only need to do this once, as `pnpm dev` will do this from now on
pnpm build:packages
pnpm codegen

# Start the dependencies (DB, Inbucket, etc.) as Docker containers, seeding the DB with the Prisma schema
# Make sure you have Docker (or OrbStack) installed and running
pnpm restart-deps

# Start the dev server
pnpm dev

# In a different terminal, run tests in watch mode
pnpm test # useful: --no-watch (disables watch mode) and --bail 1 (stops after the first failure) 

You can now open the dev launchpad at http://localhost:8100. From there, you can navigate to the dashboard at http://localhost:8101, API on port 8102, demo on port 8103, docs on port 8104, Inbucket (e-mails) on port 8105, and Prisma Studio on port 8106. See the dev launchpad for a list of all running services.

Your IDE may show an error on all @stackframe/XYZ imports. To fix this, simply restart the TypeScript language server; for example, in VSCode you can open the command palette (Ctrl+Shift+P) and run Developer: Reload Window or TypeScript: Restart TS server.

Pre-populated .env files for the setup below are available and used by default in .env.development in each of the packages. However, if you're creating a production build (eg. with pnpm run build), you must supply the environment variables manually (see below).

Useful commands

# NOTE:
# Please see the dev launchpad (default: http://localhost:8100) for a list of all running services.

# Installation commands
pnpm install: Installs dependencies

# Types & linting commands
pnpm typecheck: Runs the TypeScript type checker. May require a build or dev server to run first.
pnpm lint: Runs the ESLint linter. Optionally, pass `--fix` to fix some of the linting errors. May require a build or dev server to run first.

# Build commands
pnpm build: Builds all projects, including apps, packages, examples, and docs. Also runs code-generation tasks. Before you can run this, you will have to copy all `.env.development` files in the folders to `.env.production.local` or set the environment variables manually.
pnpm build:packages: Builds all the npm packages.
pnpm codegen: Runs all the code-generation tasks, eg. Prisma client and OpenAPI docs generation.

# Development commands
pnpm dev: Runs the development servers of the main projects, excluding most examples. On the first run, requires the packages to be built and codegen to be run. After that, it will watch for file changes (including those in code-generation files). If you have to restart the development server for anything, that is a bug that you can report.
pnpm dev:full: Runs the development servers for all projects, including examples.
pnpm dev:basic: Runs the development servers only for the necessary services (backend and dashboard). Not recommended for most users, upgrade your machine instead.

# Environment commands
pnpm start-deps: Starts the Docker dependencies (DB, Inbucket, etc.) as Docker containers, and initializes them with the seed script & migrations. Note: The started dependencies will be visible on the dev launchpad (port 8100 by default).
pnpm stop-deps: Stops the Docker dependencies (DB, Inbucket, etc.) and deletes the data on them.
pnpm restart-deps: Stops and starts the dependencies.

# Database commands
pnpm db:migration-gen: Currently not used. Please generate Prisma migrations manually (or with AI).
pnpm db:reset: Resets the database to the initial state. Run automatically by `pnpm start-deps`.
pnpm db:init: Initializes the database with the seed script & migrations. Run automatically by `pnpm db:reset`.
pnpm db:seed: Re-seeds the database with the seed script. Run automatically by `pnpm db:init`.
pnpm db:migrate: Runs the migrations. Run automatically by `pnpm db:init`.

# Testing commands
pnpm test <file-filters>: Runs the tests. Pass `--bail 1` to make the test only run until the first failure. Pass `--no-watch` to run the tests once instead of in watch mode.

# Various commands
pnpm explain-query: Paste a SQL query to get an explanation of the query plan, helping you debug performance issues.
pnpm verify-data-integrity: Verify the integrity of the data in the database by running a bunch of integrity checks. This should never fail at any point in time (unless you messed with the DB manually).

Note: When working with AI, you should keep a terminal tab with the dev server open so the AI can run queries against it.

❤ Contributors