stack/docker
BilalG1 37ee5ec320
Fast-start local emulator via RAM snapshot + live secret rotation (#1340)
## Summary

`stack emulator start` now resumes a fully-warm VM snapshot instead of
cold-booting, bringing startup from 30–120s down to ~5–8s with
per-install secret rotation, or ~2.5s with rotation opt-out. The
snapshot is captured **locally on first `stack emulator pull`**, not
shipped from CI — QEMU migration state isn't portable across
accelerators (KVM/HVF/TCG) or `-cpu max` feature sets, so a CI-captured
snapshot couldn't resume reliably on arbitrary user hardware.

Also bundles a pile of CLI QoL fixes (progress bars, PR/run artifact
pulls, PR-build download, native-TS ISO writer replacing
`hdiutil`/`mkisofs`/`genisoimage` host dep, unit tests).

| Scenario | Before | After |
|---|---|---|
| Cold boot (no snapshot) | 30–120s | same, works as fallback |
| `stack emulator pull` (one-time, includes local snapshot capture) |
~30s download | ~30s download + ~1–3 min cold-boot capture |
| Snapshot resume, normal start | — | **~5–8s** |
| Snapshot resume, `EMULATOR_NO_ROTATION=1` | — | **~2.5s** |

Backend (`/health?db=1`) and dashboard (`/handler/sign-in`) return 200
on all paths. Two successive snapshot resumes produce different rotated
PCK/SSK/SAK/CRON_SECRET values per install.

## How it works

**Build (CI)** — `docker/local-emulator/qemu/build-image.sh`:

1. Cloud-init provisioning runs to completion (migrations, seed,
slim-image) producing `stack-emulator-<arch>.qcow2`.
2. Image is built with a topology compatible with later snapshot capture
(pinned SMP=4, phantom seed/bundle ISOs, STACKCFG runtime ISO mounted at
build time, qemu-guest-agent running, placeholder hex secrets baked in
under `STACK_EMULATOR_BUILD_SNAPSHOT=1`).
3. CI publishes **only the qcow2** — no `.savevm.zst` ships.

**Pull (user's machine)** —
`packages/stack-cli/src/commands/emulator.ts` + `run-emulator.sh
capture`:

1. `stack emulator pull` downloads the qcow2 with a progress bar (or
from a PR / workflow run via `--pr` / `--run`).
2. CLI invokes `run-emulator.sh capture`: cold-boots the qcow2 with a
matching device layout (phantom ISOs, fsdev, pcie-root-port, virtfs
detached — migration-incompatible), waits for backend+dashboard health,
then drives QMP: `stop` → set `mapped-ram` + `multifd` caps → `migrate
file:state.raw` → poll `query-migrate` → `quit`. Raw mapped-ram file is
zstd-compressed to `stack-emulator-<arch>.savevm.zst` in the images dir.
3. `--skip-snapshot` opts out (first `start` will then cold-boot).

**Runtime** — `run-emulator.sh start`:

1. Launch QEMU with `-incoming defer` when a `.savevm.zst` is present;
decompress on first use, keep the `.raw` cached for subsequent starts.
2. QMP: same `mapped-ram` + `multifd` caps → `migrate-incoming
file:<.raw>` → poll for `paused` → `cont`.
3. Generate fresh per-install secrets on the host; pipe them
base64-encoded through QGA `guest-exec input-data` →
`trigger-fast-rotate` in the guest → `docker exec -e … rotate-secrets`.
4. `rotate-secrets` in the container: validate keys (hex-only), targeted
`sed` on the placeholder PCK across built JS, `UPDATE ApiKeySet`,
`supervisorctl restart stack-app cron-jobs` (with
`stopasgroup`/`killasgroup` so the Node children actually die and
release their ports).
5. Poll backend+dashboard health; if anything fails, clean up and fall
back to cold boot transparently.

**Security model**: placeholder hex values are baked into the snapshot
(`00…ff` PCK, `00…ee` SSK, `00…dd` SAK, `00…cc` CRON_SECRET). They are
non-secret by construction. Real per-install secrets are generated at
each `emulator start` and never leave the host.

## CLI changes (`packages/stack-cli`)

- **`src/lib/iso.ts`** (new): native TypeScript ISO 9660 + Joliet
writer, replacing the host-side `hdiutil`/`mkisofs`/`genisoimage`
dependency for generating the STACKCFG runtime config disk. Unit tests
in `src/lib/iso.test.ts`.
- **`src/commands/emulator.ts`**:
- `pull`: streamed downloads with progress bar + ETA; `--pr <number>`
and `--run <id>` to pull from a PR build's CI artifacts (uses
`extract-zip` for the nested zip); `--skip-snapshot` to opt out of the
one-time local capture.
- `start` (existing, extended): auto-pulls AND auto-captures when no
image exists, so first-ever `start` is self-bootstrapping; emits
`STACK_EMULATOR_CLI_WROTE_ISO=1` so the shell helper skips its own ISO
regen (avoids the genisoimage host dep).
- `capture` (new, invoked by `pull` and the auto-pull path of `start`):
drives the local snapshot capture via `run-emulator.sh`.
- `status`, `stop`, `reset`, `list-releases`: preflight +
path-resolution tightening (`STACK_EMULATOR_HOME` → images/run dirs).
  - Unit tests in `src/commands/emulator.test.ts`.
- **`EMULATOR_NO_ROTATION=1`** env var skips the post-resume rotation
(intended for tests/CI where the placeholder secrets are fine — comes
with a loud warning).

## CI (`.github/workflows/qemu-emulator-build.yaml`)

- Builds **QEMU 10.2.2 from source** (cached), because
`mapped-ram`/`multifd` migration capabilities aren't available in the
distro's QEMU. Enables KVM on ubicloud runners so amd64 boots at
hardware speed.
- amd64 + arm64 both build on the same amd64 matrix
(`ubicloud-standard-8`); arm64 runs under cross-arch TCG (provisioning
only — boot/verify smoke test is amd64-only).
- Verification now runs through the CLI: `emulator start` → `emulator
status` → `emulator stop` against the freshly-built qcow2 (via
`STACK_EMULATOR_HOME` pointing at the workspace, so the CLI doesn't
silently auto-pull a prior release).
- Packages **only** the qcow2. No `.savevm.zst` upload / publish.
- Release notes updated.

## Key files

**Shell / guest:**
- `docker/local-emulator/qemu/build-image.sh` — snapshot-compatible
device topology + STACKCFG runtime ISO at build time
- `docker/local-emulator/qemu/run-emulator.sh` — `start`, `capture`,
`stop`, `reset`, `status`; `-incoming defer`, `.raw` cache, QGA-driven
rotation, cold-boot fallback
- `docker/local-emulator/qemu/common.sh` (new) — shared `qmp_session` +
`capture_vm_state` (factored out so build-image.sh and run-emulator.sh
share the capture path)
- `docker/local-emulator/qemu/cloud-init/emulator/user-data` —
placeholder secrets in snapshot mode, `wait-for-stack-ready`,
`trigger-fast-rotate`, qemu-guest-agent enabled
- `docker/local-emulator/rotate-secrets.sh` (new) — in-container
rotation (sed + UPDATE + supervisorctl)
- `docker/local-emulator/supervisord.conf` — `stopasgroup`/`killasgroup`
on `stack-app` and `cron-jobs`
- `docker/local-emulator/entrypoint.sh` — only mint CRON_SECRET if unset
(placeholder supplied in snapshot mode via --env-file)
- `docker/local-emulator/Dockerfile` — ships `rotate-secrets` to
`/usr/local/bin`
- `docker/server/entrypoint.sh` — source
`/run/stack-auth/rotated-secrets.env`; skip full-tree sentinel scan on
warm restarts via marker

**CLI:**
- `packages/stack-cli/src/lib/iso.ts` (new) + `iso.test.ts` (new)
- `packages/stack-cli/src/commands/emulator.ts` + `emulator.test.ts`
(new)
- `packages/stack-cli/vitest.config.ts` (new)

**CI:**
- `.github/workflows/qemu-emulator-build.yaml`

## Test plan

- [x] `docker/local-emulator/qemu/build-image.sh {amd64,arm64}` produces
`stack-emulator-<arch>.qcow2` with snapshot-compatible topology
- [x] `stack emulator pull` downloads qcow2 with progress, then captures
locally (~1–3 min) and writes `stack-emulator-<arch>.savevm.zst` in the
images dir
- [x] `stack emulator pull --skip-snapshot` stops after download
- [x] `stack emulator pull --pr <n>` / `--run <id>` pull from PR /
workflow run artifacts
- [x] `stack emulator start` on a fresh dir auto-pulls **and**
auto-captures, then starts; subsequent starts fast-resume in ~5–8s;
backend + dashboard return 200
- [x] `EMULATOR_NO_ROTATION=1 stack emulator start` completes in ~2.5s;
backend + dashboard return 200 with warning printed
- [x] Two consecutive `emulator start` invocations produce different PCK
values in the internal `ApiKeySet` row
- [x] `stack emulator status` / `stop` / `reset` resolve paths from
`STACK_EMULATOR_HOME`
- [x] Verified end-to-end on arm64 macOS under HVF (capture ~50s,
fast-resume ~6.5s)
- [x] `pnpm lint` and `pnpm typecheck` pass; stack-cli unit tests (iso +
emulator) pass
- [ ] CI green on this PR (qemu-emulator-build matrix, smoke test)
- [ ] `gh release download emulator-<branch>-latest` contains only
`stack-emulator-<arch>.qcow2` once this PR merges and publish runs

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Snapshot fast-start/resume with optional warm-snapshot assets, runtime
ISO generation, and a cached QEMU build to speed emulator setup.
* CLI: streamed artifact downloads with progress, improved release/asset
handling, stronger preflight checks, and start/status/stop emulator
commands.
* Automated secret rotation and ability to apply rotated secrets at
container startup; supervisor control socket enabled.

* **Bug Fixes**
* More robust start/stop/resume flows with automatic fallback to cold
boot and improved process-group shutdown behavior.

* **Tests**
  * New tests for CLI utilities and ISO image generation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-04-20 14:24:49 -07:00
..
backend Fix Docker build scripts 2026-04-18 23:50:50 -07:00
dependencies LLM MCP Flow (#1321) 2026-04-15 17:57:08 +00:00
dev-postgres-replica External db sync (#1036) 2026-02-05 12:04:31 -08:00
dev-postgres-with-extensions Payments bulldozer txn rework (#1315) 2026-04-17 22:11:21 +00:00
local-emulator Fast-start local emulator via RAM snapshot + live secret rotation (#1340) 2026-04-20 14:24:49 -07:00
mock-oauth-server Local emulator (#422) 2025-02-13 18:57:02 +01:00
server Fast-start local emulator via RAM snapshot + live secret rotation (#1340) 2026-04-20 14:24:49 -07:00