mirror of
https://github.com/stack-auth/stack.git
synced 2026-06-04 21:04:37 +08:00
## Summary
`stack emulator start` now resumes a fully-warm VM snapshot instead of
cold-booting, bringing startup from 30–120s down to ~5–8s with
per-install secret rotation, or ~2.5s with rotation opt-out. The
snapshot is captured **locally on first `stack emulator pull`**, not
shipped from CI — QEMU migration state isn't portable across
accelerators (KVM/HVF/TCG) or `-cpu max` feature sets, so a CI-captured
snapshot couldn't resume reliably on arbitrary user hardware.
Also bundles a pile of CLI QoL fixes (progress bars, PR/run artifact
pulls, PR-build download, native-TS ISO writer replacing
`hdiutil`/`mkisofs`/`genisoimage` host dep, unit tests).
| Scenario | Before | After |
|---|---|---|
| Cold boot (no snapshot) | 30–120s | same, works as fallback |
| `stack emulator pull` (one-time, includes local snapshot capture) |
~30s download | ~30s download + ~1–3 min cold-boot capture |
| Snapshot resume, normal start | — | **~5–8s** |
| Snapshot resume, `EMULATOR_NO_ROTATION=1` | — | **~2.5s** |
Backend (`/health?db=1`) and dashboard (`/handler/sign-in`) return 200
on all paths. Two successive snapshot resumes produce different rotated
PCK/SSK/SAK/CRON_SECRET values per install.
## How it works
**Build (CI)** — `docker/local-emulator/qemu/build-image.sh`:
1. Cloud-init provisioning runs to completion (migrations, seed,
slim-image) producing `stack-emulator-<arch>.qcow2`.
2. Image is built with a topology compatible with later snapshot capture
(pinned SMP=4, phantom seed/bundle ISOs, STACKCFG runtime ISO mounted at
build time, qemu-guest-agent running, placeholder hex secrets baked in
under `STACK_EMULATOR_BUILD_SNAPSHOT=1`).
3. CI publishes **only the qcow2** — no `.savevm.zst` ships.
**Pull (user's machine)** —
`packages/stack-cli/src/commands/emulator.ts` + `run-emulator.sh
capture`:
1. `stack emulator pull` downloads the qcow2 with a progress bar (or
from a PR / workflow run via `--pr` / `--run`).
2. CLI invokes `run-emulator.sh capture`: cold-boots the qcow2 with a
matching device layout (phantom ISOs, fsdev, pcie-root-port, virtfs
detached — migration-incompatible), waits for backend+dashboard health,
then drives QMP: `stop` → set `mapped-ram` + `multifd` caps → `migrate
file:state.raw` → poll `query-migrate` → `quit`. Raw mapped-ram file is
zstd-compressed to `stack-emulator-<arch>.savevm.zst` in the images dir.
3. `--skip-snapshot` opts out (first `start` will then cold-boot).
**Runtime** — `run-emulator.sh start`:
1. Launch QEMU with `-incoming defer` when a `.savevm.zst` is present;
decompress on first use, keep the `.raw` cached for subsequent starts.
2. QMP: same `mapped-ram` + `multifd` caps → `migrate-incoming
file:<.raw>` → poll for `paused` → `cont`.
3. Generate fresh per-install secrets on the host; pipe them
base64-encoded through QGA `guest-exec input-data` →
`trigger-fast-rotate` in the guest → `docker exec -e … rotate-secrets`.
4. `rotate-secrets` in the container: validate keys (hex-only), targeted
`sed` on the placeholder PCK across built JS, `UPDATE ApiKeySet`,
`supervisorctl restart stack-app cron-jobs` (with
`stopasgroup`/`killasgroup` so the Node children actually die and
release their ports).
5. Poll backend+dashboard health; if anything fails, clean up and fall
back to cold boot transparently.
**Security model**: placeholder hex values are baked into the snapshot
(`00…ff` PCK, `00…ee` SSK, `00…dd` SAK, `00…cc` CRON_SECRET). They are
non-secret by construction. Real per-install secrets are generated at
each `emulator start` and never leave the host.
## CLI changes (`packages/stack-cli`)
- **`src/lib/iso.ts`** (new): native TypeScript ISO 9660 + Joliet
writer, replacing the host-side `hdiutil`/`mkisofs`/`genisoimage`
dependency for generating the STACKCFG runtime config disk. Unit tests
in `src/lib/iso.test.ts`.
- **`src/commands/emulator.ts`**:
- `pull`: streamed downloads with progress bar + ETA; `--pr <number>`
and `--run <id>` to pull from a PR build's CI artifacts (uses
`extract-zip` for the nested zip); `--skip-snapshot` to opt out of the
one-time local capture.
- `start` (existing, extended): auto-pulls AND auto-captures when no
image exists, so first-ever `start` is self-bootstrapping; emits
`STACK_EMULATOR_CLI_WROTE_ISO=1` so the shell helper skips its own ISO
regen (avoids the genisoimage host dep).
- `capture` (new, invoked by `pull` and the auto-pull path of `start`):
drives the local snapshot capture via `run-emulator.sh`.
- `status`, `stop`, `reset`, `list-releases`: preflight +
path-resolution tightening (`STACK_EMULATOR_HOME` → images/run dirs).
- Unit tests in `src/commands/emulator.test.ts`.
- **`EMULATOR_NO_ROTATION=1`** env var skips the post-resume rotation
(intended for tests/CI where the placeholder secrets are fine — comes
with a loud warning).
## CI (`.github/workflows/qemu-emulator-build.yaml`)
- Builds **QEMU 10.2.2 from source** (cached), because
`mapped-ram`/`multifd` migration capabilities aren't available in the
distro's QEMU. Enables KVM on ubicloud runners so amd64 boots at
hardware speed.
- amd64 + arm64 both build on the same amd64 matrix
(`ubicloud-standard-8`); arm64 runs under cross-arch TCG (provisioning
only — boot/verify smoke test is amd64-only).
- Verification now runs through the CLI: `emulator start` → `emulator
status` → `emulator stop` against the freshly-built qcow2 (via
`STACK_EMULATOR_HOME` pointing at the workspace, so the CLI doesn't
silently auto-pull a prior release).
- Packages **only** the qcow2. No `.savevm.zst` upload / publish.
- Release notes updated.
## Key files
**Shell / guest:**
- `docker/local-emulator/qemu/build-image.sh` — snapshot-compatible
device topology + STACKCFG runtime ISO at build time
- `docker/local-emulator/qemu/run-emulator.sh` — `start`, `capture`,
`stop`, `reset`, `status`; `-incoming defer`, `.raw` cache, QGA-driven
rotation, cold-boot fallback
- `docker/local-emulator/qemu/common.sh` (new) — shared `qmp_session` +
`capture_vm_state` (factored out so build-image.sh and run-emulator.sh
share the capture path)
- `docker/local-emulator/qemu/cloud-init/emulator/user-data` —
placeholder secrets in snapshot mode, `wait-for-stack-ready`,
`trigger-fast-rotate`, qemu-guest-agent enabled
- `docker/local-emulator/rotate-secrets.sh` (new) — in-container
rotation (sed + UPDATE + supervisorctl)
- `docker/local-emulator/supervisord.conf` — `stopasgroup`/`killasgroup`
on `stack-app` and `cron-jobs`
- `docker/local-emulator/entrypoint.sh` — only mint CRON_SECRET if unset
(placeholder supplied in snapshot mode via --env-file)
- `docker/local-emulator/Dockerfile` — ships `rotate-secrets` to
`/usr/local/bin`
- `docker/server/entrypoint.sh` — source
`/run/stack-auth/rotated-secrets.env`; skip full-tree sentinel scan on
warm restarts via marker
**CLI:**
- `packages/stack-cli/src/lib/iso.ts` (new) + `iso.test.ts` (new)
- `packages/stack-cli/src/commands/emulator.ts` + `emulator.test.ts`
(new)
- `packages/stack-cli/vitest.config.ts` (new)
**CI:**
- `.github/workflows/qemu-emulator-build.yaml`
## Test plan
- [x] `docker/local-emulator/qemu/build-image.sh {amd64,arm64}` produces
`stack-emulator-<arch>.qcow2` with snapshot-compatible topology
- [x] `stack emulator pull` downloads qcow2 with progress, then captures
locally (~1–3 min) and writes `stack-emulator-<arch>.savevm.zst` in the
images dir
- [x] `stack emulator pull --skip-snapshot` stops after download
- [x] `stack emulator pull --pr <n>` / `--run <id>` pull from PR /
workflow run artifacts
- [x] `stack emulator start` on a fresh dir auto-pulls **and**
auto-captures, then starts; subsequent starts fast-resume in ~5–8s;
backend + dashboard return 200
- [x] `EMULATOR_NO_ROTATION=1 stack emulator start` completes in ~2.5s;
backend + dashboard return 200 with warning printed
- [x] Two consecutive `emulator start` invocations produce different PCK
values in the internal `ApiKeySet` row
- [x] `stack emulator status` / `stop` / `reset` resolve paths from
`STACK_EMULATOR_HOME`
- [x] Verified end-to-end on arm64 macOS under HVF (capture ~50s,
fast-resume ~6.5s)
- [x] `pnpm lint` and `pnpm typecheck` pass; stack-cli unit tests (iso +
emulator) pass
- [ ] CI green on this PR (qemu-emulator-build matrix, smoke test)
- [ ] `gh release download emulator-<branch>-latest` contains only
`stack-emulator-<arch>.qcow2` once this PR merges and publish runs
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Snapshot fast-start/resume with optional warm-snapshot assets, runtime
ISO generation, and a cached QEMU build to speed emulator setup.
* CLI: streamed artifact downloads with progress, improved release/asset
handling, stronger preflight checks, and start/status/stop emulator
commands.
* Automated secret rotation and ability to apply rotated secrets at
container startup; supervisor control socket enabled.
* **Bug Fixes**
* More robust start/stop/resume flows with automatic fallback to cold
boot and improved process-group shutdown behavior.
* **Tests**
* New tests for CLI utilities and ISO image generation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
201 lines
8.8 KiB
Bash
201 lines
8.8 KiB
Bash
#!/bin/bash
|
||
|
||
set -e
|
||
|
||
# ============= ROTATED SECRETS OVERLAY =============
|
||
# On emulator snapshot resume, the host injects freshly-generated secrets into
|
||
# /run/stack-auth/rotated-secrets.env before supervisorctl restarts us. Sourcing
|
||
# here lets a fast-restart pick up new values without a full container restart.
|
||
if [ -f /run/stack-auth/rotated-secrets.env ]; then
|
||
set -a
|
||
# shellcheck disable=SC1091
|
||
source /run/stack-auth/rotated-secrets.env
|
||
set +a
|
||
fi
|
||
|
||
# ============= FORWARD MOCK OAUTH SERVER =============
|
||
|
||
# Start socat to forward port 32202 for mock-oauth-server if enabled
|
||
if [ "$STACK_FORWARD_MOCK_OAUTH_SERVER" = "true" ]; then
|
||
socat TCP-LISTEN:32202,fork,reuseaddr TCP:host.docker.internal:32202 &
|
||
fi
|
||
|
||
# ============= ENV VARS =============
|
||
|
||
if [ "$NEXT_PUBLIC_STACK_IS_LOCAL_EMULATOR" = "true" ]; then
|
||
for v in STACK_SEED_INTERNAL_PROJECT_PUBLISHABLE_CLIENT_KEY STACK_SEED_INTERNAL_PROJECT_SECRET_SERVER_KEY STACK_SEED_INTERNAL_PROJECT_SUPER_SECRET_ADMIN_KEY; do
|
||
if [ -z "${!v:-}" ]; then
|
||
echo "$v must be set in local-emulator mode (injected by the QEMU VM)." >&2
|
||
exit 1
|
||
fi
|
||
done
|
||
export STACK_SEED_INTERNAL_PROJECT_PUBLISHABLE_CLIENT_KEY STACK_SEED_INTERNAL_PROJECT_SECRET_SERVER_KEY STACK_SEED_INTERNAL_PROJECT_SUPER_SECRET_ADMIN_KEY
|
||
else
|
||
export STACK_SEED_INTERNAL_PROJECT_PUBLISHABLE_CLIENT_KEY=${STACK_SEED_INTERNAL_PROJECT_PUBLISHABLE_CLIENT_KEY:-$(openssl rand -base64 32)}
|
||
export STACK_SEED_INTERNAL_PROJECT_SECRET_SERVER_KEY=${STACK_SEED_INTERNAL_PROJECT_SECRET_SERVER_KEY:-$(openssl rand -base64 32)}
|
||
export STACK_SEED_INTERNAL_PROJECT_SUPER_SECRET_ADMIN_KEY=${STACK_SEED_INTERNAL_PROJECT_SUPER_SECRET_ADMIN_KEY:-$(openssl rand -base64 32)}
|
||
fi
|
||
|
||
export NEXT_PUBLIC_STACK_PROJECT_ID=internal
|
||
export NEXT_PUBLIC_STACK_PUBLISHABLE_CLIENT_KEY=${STACK_SEED_INTERNAL_PROJECT_PUBLISHABLE_CLIENT_KEY}
|
||
if [ -n "${STACK_SEED_INTERNAL_PROJECT_SECRET_SERVER_KEY:-}" ]; then
|
||
export STACK_SECRET_SERVER_KEY=${STACK_SEED_INTERNAL_PROJECT_SECRET_SERVER_KEY}
|
||
fi
|
||
if [ -n "${STACK_SEED_INTERNAL_PROJECT_SUPER_SECRET_ADMIN_KEY:-}" ]; then
|
||
export STACK_SUPER_SECRET_ADMIN_KEY=${STACK_SEED_INTERNAL_PROJECT_SUPER_SECRET_ADMIN_KEY}
|
||
fi
|
||
|
||
export NEXT_PUBLIC_BROWSER_STACK_DASHBOARD_URL=${NEXT_PUBLIC_STACK_DASHBOARD_URL}
|
||
export NEXT_PUBLIC_STACK_PORT_PREFIX=${NEXT_PUBLIC_STACK_PORT_PREFIX:-81}
|
||
PORT_PREFIX=${NEXT_PUBLIC_STACK_PORT_PREFIX}
|
||
export NEXT_PUBLIC_SERVER_STACK_DASHBOARD_URL="http://localhost:${PORT_PREFIX}01"
|
||
export NEXT_PUBLIC_BROWSER_STACK_API_URL=${NEXT_PUBLIC_STACK_API_URL}
|
||
export NEXT_PUBLIC_SERVER_STACK_API_URL="http://localhost:${PORT_PREFIX}02"
|
||
export BACKEND_PORT=${BACKEND_PORT:-${PORT_PREFIX}02}
|
||
export DASHBOARD_PORT=${DASHBOARD_PORT:-${PORT_PREFIX}01}
|
||
|
||
export USE_INLINE_ENV_VARS=true
|
||
|
||
if [ -z "${NEXT_PUBLIC_STACK_SVIX_SERVER_URL}" ]; then
|
||
export NEXT_PUBLIC_STACK_SVIX_SERVER_URL=${STACK_SVIX_SERVER_URL}
|
||
fi
|
||
|
||
# ============= MIGRATIONS =============
|
||
|
||
should_run_migrations=true
|
||
if [ "$STACK_SKIP_MIGRATIONS" = "true" ] || [ "$STACK_RUN_MIGRATIONS" = "false" ]; then
|
||
should_run_migrations=false
|
||
fi
|
||
|
||
if [ "$should_run_migrations" = "false" ]; then
|
||
echo "Skipping migrations."
|
||
else
|
||
echo "Running migrations..."
|
||
cd apps/backend
|
||
node dist/db-migrations.mjs migrate
|
||
cd ../..
|
||
fi
|
||
|
||
should_run_seed_script=true
|
||
if [ "$STACK_SKIP_SEED_SCRIPT" = "true" ] || [ "$STACK_RUN_SEED_SCRIPT" = "false" ]; then
|
||
should_run_seed_script=false
|
||
fi
|
||
|
||
if [ "$should_run_seed_script" = "false" ]; then
|
||
echo "Skipping seed script."
|
||
else
|
||
echo "Running seed script..."
|
||
cd apps/backend
|
||
node dist/db-migrations.mjs seed
|
||
cd ../..
|
||
fi
|
||
|
||
# ============= LOCAL EMULATOR: BOOTSTRAP INTERNAL API KEY SET =============
|
||
# The build-time seed ran without any keys (the VM generates random ones on
|
||
# first boot). The slim image strips apps/backend/dist so we can't re-run the
|
||
# full seed here. Instead, targeted-upsert the internal api key set with the
|
||
# VM-supplied keys:
|
||
# - pck: used by stack-cli to auth against /api/v1/internal/local-emulator/project
|
||
# - ssk/sak: required by the emulator's own dashboard (StackServerApp ctor
|
||
# throws without ssk). User-app flows don't use these — per-project
|
||
# credentials come from the /local-emulator/project route.
|
||
if [ "$NEXT_PUBLIC_STACK_IS_LOCAL_EMULATOR" = "true" ] && [ -n "${STACK_SEED_INTERNAL_PROJECT_PUBLISHABLE_CLIENT_KEY:-}" ] && [ -n "${STACK_DATABASE_CONNECTION_STRING:-}" ]; then
|
||
# Validate the keys are hex-only to defuse any SQL-injection risk (the VM
|
||
# generates them via `openssl rand -hex 32`, so this is an assert, not a filter).
|
||
for varname in STACK_SEED_INTERNAL_PROJECT_PUBLISHABLE_CLIENT_KEY STACK_SEED_INTERNAL_PROJECT_SECRET_SERVER_KEY STACK_SEED_INTERNAL_PROJECT_SUPER_SECRET_ADMIN_KEY; do
|
||
val="${!varname:-}"
|
||
if [ -z "$val" ]; then
|
||
echo "ERROR: $varname is not set; refusing to bootstrap internal api key set." >&2
|
||
exit 1
|
||
fi
|
||
if ! printf '%s' "$val" | grep -Eq '^[0-9a-fA-F]+$'; then
|
||
echo "ERROR: $varname is not hex-only; refusing to bootstrap internal api key set." >&2
|
||
exit 1
|
||
fi
|
||
done
|
||
echo "Bootstrapping internal API key set (emulator runtime)..."
|
||
psql "$STACK_DATABASE_CONNECTION_STRING" -v ON_ERROR_STOP=1 <<SQL
|
||
INSERT INTO "ApiKeySet" ("projectId", id, description, "expiresAt", "createdAt", "updatedAt", "publishableClientKey", "secretServerKey", "superSecretAdminKey")
|
||
VALUES ('internal', '3142e763-b230-44b5-8636-aa62f7489c26', 'Internal API key set', '2099-12-31T23:59:59Z', NOW(), NOW(),
|
||
'${STACK_SEED_INTERNAL_PROJECT_PUBLISHABLE_CLIENT_KEY}',
|
||
'${STACK_SEED_INTERNAL_PROJECT_SECRET_SERVER_KEY}',
|
||
'${STACK_SEED_INTERNAL_PROJECT_SUPER_SECRET_ADMIN_KEY}')
|
||
ON CONFLICT ("projectId", id) DO UPDATE SET
|
||
"publishableClientKey" = EXCLUDED."publishableClientKey",
|
||
"secretServerKey" = EXCLUDED."secretServerKey",
|
||
"superSecretAdminKey" = EXCLUDED."superSecretAdminKey",
|
||
"updatedAt" = NOW();
|
||
SQL
|
||
fi
|
||
|
||
# ============= ENV VARS =============
|
||
|
||
# Create a working directory for our processed files.
|
||
# Keep this off /tmp so local-emulator config sharing can bind-mount /tmp
|
||
# without pushing the whole runtime copy step onto the host filesystem.
|
||
WORK_DIR="${STACK_RUNTIME_WORK_DIR:-/var/tmp/stack-runtime}"
|
||
mkdir -p "$WORK_DIR"
|
||
|
||
if [ "$WORK_DIR" != "/app" ]; then
|
||
echo "Copying files to working directory..."
|
||
cp -r /app/. "$WORK_DIR"/.
|
||
fi
|
||
|
||
# The full-tree sentinel scan is expensive (several seconds over the whole built
|
||
# app tree). On a fast-restart — triggered by the emulator snapshot rotation
|
||
# path — the placeholders have already been sed-replaced by rotate-secrets,
|
||
# and no new sentinels need substitution. Skip the scan in that case. Marker
|
||
# lives in WORK_DIR because the docker/server image runs as the unprivileged
|
||
# `node` user and cannot write to /var/run.
|
||
SENTINEL_MARKER="$WORK_DIR/.stack-sentinels-replaced"
|
||
if [ -f "$SENTINEL_MARKER" ]; then
|
||
echo "Sentinels already replaced on a previous start; skipping scan."
|
||
else
|
||
# Find all files in the apps directory that contain a STACK_ENV_VAR_SENTINEL and extract the unique sentinel strings.
|
||
echo "Finding unhandled sentinels..."
|
||
unhandled_sentinels=$(find "$WORK_DIR/apps" -type f -exec grep -l "STACK_ENV_VAR_SENTINEL" {} + | \
|
||
xargs grep -h "STACK_ENV_VAR_SENTINEL" | \
|
||
grep -o "STACK_ENV_VAR_SENTINEL[A-Z_]*" | \
|
||
sort -u | grep -v "^STACK_ENV_VAR_SENTINEL$")
|
||
|
||
# Choose an uncommon delimiter – here, we use the ASCII Unit Separator (0x1F)
|
||
delimiter=$(printf '\037')
|
||
|
||
echo "Replacing sentinels..."
|
||
for sentinel in $unhandled_sentinels; do
|
||
# The sentinel is like "STACK_ENV_VAR_SENTINEL_MY_VAR", so extract the env var name.
|
||
env_var=${sentinel#STACK_ENV_VAR_SENTINEL_}
|
||
|
||
# Get the corresponding environment variable value.
|
||
value="${!env_var}"
|
||
|
||
# If the env var is not set, skip replacement.
|
||
if [ -z "$value" ]; then
|
||
continue
|
||
fi
|
||
|
||
# Although the sentinel only contains [A-Z_] we still escape it for any regex meta-characters.
|
||
escaped_sentinel=$(printf '%s\n' "$sentinel" | sed -e 's/\\/\\\\/g' -e 's/[][\/.^$*]/\\&/g')
|
||
|
||
# For the replacement value, first escape backslashes, then escape any occurrence of
|
||
# the chosen delimiter and the '&' (which has special meaning in sed replacements).
|
||
escaped_value=$(printf '%s\n' "$value" | sed -e 's/\\/\\\\/g' -e "s/[${delimiter}&]/\\\\&/g")
|
||
|
||
# Now replace the sentinel with the (properly escaped) value in all files in the working directory.
|
||
find $WORK_DIR/apps -type f -exec sed -i "s${delimiter}${escaped_sentinel}${delimiter}${escaped_value}${delimiter}g" {} +
|
||
done
|
||
touch "$SENTINEL_MARKER"
|
||
fi
|
||
|
||
# ============= START BACKEND AND DASHBOARD =============
|
||
|
||
echo "Starting backend on port $BACKEND_PORT..."
|
||
cd "$WORK_DIR"
|
||
PORT=$BACKEND_PORT HOSTNAME=0.0.0.0 node apps/backend/server.js &
|
||
|
||
echo "Starting dashboard on port $DASHBOARD_PORT..."
|
||
PORT=$DASHBOARD_PORT HOSTNAME=0.0.0.0 node apps/dashboard/server.js &
|
||
|
||
# Wait for both to finish
|
||
wait -n
|