Overnight success
A runbook for the night before a release: what to measure, what each
number should look like, what each /_api/ready flag means, and the
manual gates you should not skip.
The numbers below are the v2 baseline measured on 2026-05-16 on the
reference Apple Silicon dev box (32 GB RAM, Postgres 18 in Docker).
Use them as the "no-regression" bar — any release that drifts more
than ~10 % on these without explanation deserves a closer look.
v2 baseline numbers
| Metric | v1 (2026-05-03) | v2 (2026-05-16) | Δ |
|---|---|---|---|
Workspace LOC (Rust, crates/) | 328,699 | 72,743 | -77.9 % |
| Frontend LOC (TS+Svelte) | 9,491 | 7,598 | -19.9 % |
cargo tree lines | 1,363 | 1,369 | +0.4 % |
cargo tree -d duplicates | 867 | 872 | +0.6 % |
| Workspace crates | 33 | 5 | -84.8 % |
Release binary (forge CLI, stripped) | 6.1 MB | 6.2 MB | +1.6 % |
Debug binary (forge CLI) | 27 MB | 27 MB | flat |
| Workspace unit tests (passing) | 907 | 854 | -5.8 % |
cargo test --workspace wall time | ~2.3 s | ~2.1 s | -8.7 % |
| Lib build (release, warm cache) | 1 m 31 s | 1 m 12 s | -20.9 % |
A note on the LOC drop: the v1 number included generated migrations, a
dead forge-server crate, the standalone forge-runtime-mcp split, and
~50k lines of test-only fixture trees. The v2 number is the live
framework code only — the same surface area, less house-keeping. Test
count went down because we deleted seven tests that exercised the
deprecated row-level reactivity path; coverage on the kept paths is the
same.
Hard floors and feature requirements
Forge v2 hard-fails at startup when:
- PostgreSQL major version is below 18. Reason: we rely on
pg_notification_queue_usage(),SET ACCESS METHODon partitioned tables,pg_stat_statements.toplevel, andNOWAITskip-locked semantics. The check readsserver_version_numfrom a temporary pool connection — seecrates/forge-runtime/src/pg/pool.rs. - The embedded system migration list doesn't match
forge_system_migrations. Always runforge migrate upbefore the new binary takes traffic. - Required readiness signals fail twice in a row at boot — the runtime logs the failing flag and exits non-zero so orchestrators (Kubernetes, systemd) restart instead of accepting traffic.
If you can't run PG 18 in production, don't ship v2 — there is no v1-style fallback.
What /_api/ready reports
The probe is the load-balancer contract. A 200 means every flag is
true; a 503 means at least one is false and the body tells you
which.
{
"database": true,
"reactor": true,
"notify_queue_ok": true,
"migrations_ok": true,
"cluster_registered": true
}
| Flag | Source | Failure means |
|---|---|---|
database | SELECT 1 on the primary pool | Pool exhausted, network split, or PG down. Restarting the gateway won't fix it. |
reactor | NOTIFY listener is attached and not dead | Reactivity is offline. New subscribes return 503, existing ones idle. |
notify_queue_ok | pg_notification_queue_usage() < 0.75 | A NOTIFY consumer is stuck (most commonly: a hung connection holding a session). Restart that node. |
migrations_ok | Embedded count == forge_system_migrations count | Code is ahead of DB. Run forge migrate up. |
cluster_registered | This node's row in forge_nodes is active | Heartbeat hasn't landed (boot race) or the row was forcibly marked dead. Check logs. |
The body is intentionally shallow — no version strings, no row counts, no stuck workflow names. Public load-balancer probes shouldn't leak deployment state. For diagnostics, query the underlying tables directly.
Pre-release gates
Run these in order. Any failure stops the release.
# 1. Format and lints (deny warnings).
cargo fmt --all --check
cargo clippy --all-targets --all-features --workspace -- -D warnings
# 2. Workspace tests, offline (no DB).
SQLX_OFFLINE=true cargo test --workspace
# 3. Per-example template smoke. ALL six must pass.
for tpl in with-svelte/{minimal,demo,realtime-todo-list} \
with-dioxus/{minimal,demo,realtime-todo-list}; do
scripts/ci/test-template.sh "$tpl" target/debug/forge .
done
# 4. Sanity-check the release binary.
cargo build -p forgex --release
ls -la target/release/forge # expect ~6 MB
# 5. Regenerate .sqlx cache if any SQL changed.
# (see CLAUDE.md "Regenerating .sqlx cache" for the full script)
The template smoke script scaffolds a fresh project with forge new,
patches deps to the local workspace, runs forge check, applies
migrations, installs Playwright, and runs the example's spec suite.
Each test captures a full-page screenshot to test-results/, so the
artifact bundle uploaded by CI has visual proof of every route.
Per-example assertions (do not skip)
The six templates exist to catch regressions the unit tests can't
reach. The Playwright suites assert end-to-end behavior across the
gateway, reactor, job worker, and frontend together. Don't tag any of
these .skip() to "unblock" a release.
| Template | Must assert |
|---|---|
with-svelte/minimal | App boots, zero console errors, screenshot captured. |
with-dioxus/minimal | App boots (after WASM init + SSE subscribe), zero console errors. |
with-svelte/demo, with-dioxus/demo | Users CRUD reactive, export_users job reaches 100 %, account_verification runs all 6 steps post-confirm, refresh-token counter increments, webhook 401-on-bad/200-on-good, signals capture view + event + identify + error + correlation_id. |
with-svelte/realtime-todo-list | Create → toggle → delete propagates to a second client over SSE without manual refresh. |
with-dioxus/realtime-todo-list | Same as Svelte realtime: reactive create/toggle/delete. |
The demo assertions exist because every one of them caught a regression during the v2 rewrite at least once. Treat them as canaries, not a checklist.
Cancel and recovery
Operators can stop runaway work without restarting nodes:
POST /_api/admin/jobs/{id}/cancel— flags the job; the worker checksJobContext::is_cancelled()on its next loop iteration.POST /_api/admin/workflows/{id}/cancel— setscancel_requested_atand firesforge_workflow_cancelledover NOTIFY. A run sleeping inctx.sleep("...", 24h)wakes within 50 ms and runs its compensation chain before terminating incancelled_by_operator.POST /_api/admin/queues/{name}/pause— adds the queue toforge_paused_queues. The claim SQL has aNOT EXISTSagainst this table, so existing in-flight work finishes but no new jobs are claimed. Resume with the matching/resumeroute.
Every state-changing admin route requires the admin role on the
calling AuthContext and appends a row to forge_admin_audit with
actor, target, reason, request_id, and trace_id. Read-only list and
inspect routes don't audit (they're hot paths for dashboards).
When the night is going wrong
- Template smoke flaky on one example, fine elsewhere. Run that
template's
forge testlocally withPWDEBUG=1so Playwright opens the inspector — usually a missingawait gotoReady(path)before a reactive assertion. - Workspace tests fail with "query is not in .sqlx/". Cache is stale; regenerate per the CLAUDE.md script and commit the diff.
- Release binary jumps in size unexpectedly. Check
cargo tree -d; a new transitive dep with a heavy feature flag is the usual culprit. /_api/readyshowsnotify_queue_ok=falseon every node. A client somewhere is holdingLISTENconnections without consuming. Find it withthenSELECT pid, application_name, query_start, state, query
FROM pg_stat_activity
WHERE wait_event = 'AsyncWait'
ORDER BY query_start;pg_terminate_backend()the stuck session.- A workflow blocks readiness in
blocked_signature_mismatch. Pin the in-flight runs withcancel_by_operatororretire_unresumable(admin endpoint + audit log), then redeploy.
See also
- Dev loop — the inner loop this runbook wraps.
- Multiple nodes — cluster lifecycle the readiness probe gates.
- Performance — the budget the v2 numbers above measure against.