Overnight success

A runbook for the night before a release: what to measure, what each number should look like, what each /_api/ready flag means, and the manual gates you should not skip.

The numbers below are the v2 baseline measured on 2026-05-16 on the reference Apple Silicon dev box (32 GB RAM, Postgres 18 in Docker). Use them as the "no-regression" bar — any release that drifts more than ~10 % on these without explanation deserves a closer look.

v2 baseline numbers

Metric	v1 (2026-05-03)	v2 (2026-05-16)	Δ
Workspace LOC (Rust, `crates/`)	328,699	72,743	-77.9 %
Frontend LOC (TS+Svelte)	9,491	7,598	-19.9 %
`cargo tree` lines	1,363	1,369	+0.4 %
`cargo tree -d` duplicates	867	872	+0.6 %
Workspace crates	33	5	-84.8 %
Release binary (`forge` CLI, stripped)	6.1 MB	6.2 MB	+1.6 %
Debug binary (`forge` CLI)	27 MB	27 MB	flat
Workspace unit tests (passing)	907	854	-5.8 %
`cargo test --workspace` wall time	~2.3 s	~2.1 s	-8.7 %
Lib build (release, warm cache)	1 m 31 s	1 m 12 s	-20.9 %

A note on the LOC drop: the v1 number included generated migrations, a dead forge-server crate, the standalone forge-runtime-mcp split, and ~50k lines of test-only fixture trees. The v2 number is the live framework code only — the same surface area, less house-keeping. Test count went down because we deleted seven tests that exercised the deprecated row-level reactivity path; coverage on the kept paths is the same.

Hard floors and feature requirements

Forge v2 hard-fails at startup when:

PostgreSQL major version is below 18. Reason: we rely on pg_notification_queue_usage(), SET ACCESS METHOD on partitioned tables, pg_stat_statements.toplevel, and NOWAIT skip-locked semantics. The check reads server_version_num from a temporary pool connection — see crates/forge-runtime/src/pg/pool.rs.
The embedded system migration list doesn't match forge_system_migrations. Always run forge migrate up before the new binary takes traffic.
Required readiness signals fail twice in a row at boot — the runtime logs the failing flag and exits non-zero so orchestrators (Kubernetes, systemd) restart instead of accepting traffic.

If you can't run PG 18 in production, don't ship v2 — there is no v1-style fallback.

What `/_api/ready` reports

The probe is the load-balancer contract. A 200 means every flag is true; a 503 means at least one is false and the body tells you which.

{
  "database":           true,
  "reactor":            true,
  "notify_queue_ok":    true,
  "migrations_ok":      true,
  "cluster_registered": true
}

Flag	Source	Failure means
`database`	`SELECT 1` on the primary pool	Pool exhausted, network split, or PG down. Restarting the gateway won't fix it.
`reactor`	NOTIFY listener is attached and not dead	Reactivity is offline. New subscribes return 503, existing ones idle.
`notify_queue_ok`	`pg_notification_queue_usage() < 0.75`	A NOTIFY consumer is stuck (most commonly: a hung connection holding a session). Restart that node.
`migrations_ok`	Embedded count == `forge_system_migrations` count	Code is ahead of DB. Run `forge migrate up`.
`cluster_registered`	This node's row in `forge_nodes` is `active`	Heartbeat hasn't landed (boot race) or the row was forcibly marked dead. Check logs.

The body is intentionally shallow — no version strings, no row counts, no stuck workflow names. Public load-balancer probes shouldn't leak deployment state. For diagnostics, query the underlying tables directly.

Pre-release gates

Run these in order. Any failure stops the release.

# 1. Format and lints (deny warnings).
cargo fmt --all --check
cargo clippy --all-targets --all-features --workspace -- -D warnings

# 2. Workspace tests, offline (no DB).
SQLX_OFFLINE=true cargo test --workspace

# 3. Per-example template smoke. ALL six must pass.
for tpl in with-svelte/{minimal,demo,realtime-todo-list} \
           with-dioxus/{minimal,demo,realtime-todo-list}; do
  scripts/ci/test-template.sh "$tpl" target/debug/forge .
done

# 4. Sanity-check the release binary.
cargo build -p forgex --release
ls -la target/release/forge   # expect ~6 MB

# 5. Regenerate .sqlx cache if any SQL changed.
# (see CLAUDE.md "Regenerating .sqlx cache" for the full script)

The template smoke script scaffolds a fresh project with forge new, patches deps to the local workspace, runs forge check, applies migrations, installs Playwright, and runs the example's spec suite. Each test captures a full-page screenshot to test-results/, so the artifact bundle uploaded by CI has visual proof of every route.

Per-example assertions (do not skip)

The six templates exist to catch regressions the unit tests can't reach. The Playwright suites assert end-to-end behavior across the gateway, reactor, job worker, and frontend together. Don't tag any of these .skip() to "unblock" a release.

Template	Must assert
`with-svelte/minimal`	App boots, zero console errors, screenshot captured.
`with-dioxus/minimal`	App boots (after WASM init + SSE subscribe), zero console errors.
`with-svelte/demo`, `with-dioxus/demo`	Users CRUD reactive, `export_users` job reaches 100 %, `account_verification` runs all 6 steps post-confirm, refresh-token counter increments, webhook 401-on-bad/200-on-good, signals capture view + event + identify + error + correlation_id.
`with-svelte/realtime-todo-list`	Create → toggle → delete propagates to a second client over SSE without manual refresh.
`with-dioxus/realtime-todo-list`	Same as Svelte realtime: reactive create/toggle/delete.

The demo assertions exist because every one of them caught a regression during the v2 rewrite at least once. Treat them as canaries, not a checklist.

Cancel and recovery

Operators can stop runaway work without restarting nodes:

POST /_api/admin/jobs/{id}/cancel — flags the job; the worker checks JobContext::is_cancelled() on its next loop iteration.
POST /_api/admin/workflows/{id}/cancel — sets cancel_requested_at and fires forge_workflow_cancelled over NOTIFY. A run sleeping in ctx.sleep("...", 24h) wakes within 50 ms and runs its compensation chain before terminating in cancelled_by_operator.
POST /_api/admin/queues/{name}/pause — adds the queue to forge_paused_queues. The claim SQL has a NOT EXISTS against this table, so existing in-flight work finishes but no new jobs are claimed. Resume with the matching /resume route.

Every state-changing admin route requires the admin role on the calling AuthContext and appends a row to forge_admin_audit with actor, target, reason, request_id, and trace_id. Read-only list and inspect routes don't audit (they're hot paths for dashboards).

When the night is going wrong

Template smoke flaky on one example, fine elsewhere. Run that template's forge test locally with PWDEBUG=1 so Playwright opens the inspector — usually a missing await gotoReady(path) before a reactive assertion.
Workspace tests fail with "query is not in .sqlx/". Cache is stale; regenerate per the CLAUDE.md script and commit the diff.
Release binary jumps in size unexpectedly. Check cargo tree -d; a new transitive dep with a heavy feature flag is the usual culprit.
/_api/ready shows notify_queue_ok=false on every node. A client somewhere is holding LISTEN connections without consuming. Find it with
```
SELECT pid, application_name, query_start, state, query
FROM pg_stat_activity
WHERE wait_event = 'AsyncWait'
ORDER BY query_start;
```
then pg_terminate_backend() the stuck session.
A workflow blocks readiness in blocked_signature_mismatch. Pin the in-flight runs with cancel_by_operator or retire_unresumable (admin endpoint + audit log), then redeploy.

v2 baseline numbers​

Hard floors and feature requirements​

What /_api/ready reports​

Pre-release gates​

Per-example assertions (do not skip)​

Cancel and recovery​

When the night is going wrong​

See also​