Skip to main content

Overnight success

A runbook for the night before a release: what to measure, what each number should look like, what each /_api/ready flag means, and the manual gates you should not skip.

The numbers below are the v2 baseline measured on 2026-05-16 on the reference Apple Silicon dev box (32 GB RAM, Postgres 18 in Docker). Use them as the "no-regression" bar — any release that drifts more than ~10 % on these without explanation deserves a closer look.

v2 baseline numbers

Metricv1 (2026-05-03)v2 (2026-05-16)Δ
Workspace LOC (Rust, crates/)328,69972,743-77.9 %
Frontend LOC (TS+Svelte)9,4917,598-19.9 %
cargo tree lines1,3631,369+0.4 %
cargo tree -d duplicates867872+0.6 %
Workspace crates335-84.8 %
Release binary (forge CLI, stripped)6.1 MB6.2 MB+1.6 %
Debug binary (forge CLI)27 MB27 MBflat
Workspace unit tests (passing)907854-5.8 %
cargo test --workspace wall time~2.3 s~2.1 s-8.7 %
Lib build (release, warm cache)1 m 31 s1 m 12 s-20.9 %

A note on the LOC drop: the v1 number included generated migrations, a dead forge-server crate, the standalone forge-runtime-mcp split, and ~50k lines of test-only fixture trees. The v2 number is the live framework code only — the same surface area, less house-keeping. Test count went down because we deleted seven tests that exercised the deprecated row-level reactivity path; coverage on the kept paths is the same.

Hard floors and feature requirements

Forge v2 hard-fails at startup when:

  • PostgreSQL major version is below 18. Reason: we rely on pg_notification_queue_usage(), SET ACCESS METHOD on partitioned tables, pg_stat_statements.toplevel, and NOWAIT skip-locked semantics. The check reads server_version_num from a temporary pool connection — see crates/forge-runtime/src/pg/pool.rs.
  • The embedded system migration list doesn't match forge_system_migrations. Always run forge migrate up before the new binary takes traffic.
  • Required readiness signals fail twice in a row at boot — the runtime logs the failing flag and exits non-zero so orchestrators (Kubernetes, systemd) restart instead of accepting traffic.

If you can't run PG 18 in production, don't ship v2 — there is no v1-style fallback.

What /_api/ready reports

The probe is the load-balancer contract. A 200 means every flag is true; a 503 means at least one is false and the body tells you which.

{
"database": true,
"reactor": true,
"notify_queue_ok": true,
"migrations_ok": true,
"cluster_registered": true
}
FlagSourceFailure means
databaseSELECT 1 on the primary poolPool exhausted, network split, or PG down. Restarting the gateway won't fix it.
reactorNOTIFY listener is attached and not deadReactivity is offline. New subscribes return 503, existing ones idle.
notify_queue_okpg_notification_queue_usage() < 0.75A NOTIFY consumer is stuck (most commonly: a hung connection holding a session). Restart that node.
migrations_okEmbedded count == forge_system_migrations countCode is ahead of DB. Run forge migrate up.
cluster_registeredThis node's row in forge_nodes is activeHeartbeat hasn't landed (boot race) or the row was forcibly marked dead. Check logs.

The body is intentionally shallow — no version strings, no row counts, no stuck workflow names. Public load-balancer probes shouldn't leak deployment state. For diagnostics, query the underlying tables directly.

Pre-release gates

Run these in order. Any failure stops the release.

# 1. Format and lints (deny warnings).
cargo fmt --all --check
cargo clippy --all-targets --all-features --workspace -- -D warnings

# 2. Workspace tests, offline (no DB).
SQLX_OFFLINE=true cargo test --workspace

# 3. Per-example template smoke. ALL six must pass.
for tpl in with-svelte/{minimal,demo,realtime-todo-list} \
with-dioxus/{minimal,demo,realtime-todo-list}; do
scripts/ci/test-template.sh "$tpl" target/debug/forge .
done

# 4. Sanity-check the release binary.
cargo build -p forgex --release
ls -la target/release/forge # expect ~6 MB

# 5. Regenerate .sqlx cache if any SQL changed.
# (see CLAUDE.md "Regenerating .sqlx cache" for the full script)

The template smoke script scaffolds a fresh project with forge new, patches deps to the local workspace, runs forge check, applies migrations, installs Playwright, and runs the example's spec suite. Each test captures a full-page screenshot to test-results/, so the artifact bundle uploaded by CI has visual proof of every route.

Per-example assertions (do not skip)

The six templates exist to catch regressions the unit tests can't reach. The Playwright suites assert end-to-end behavior across the gateway, reactor, job worker, and frontend together. Don't tag any of these .skip() to "unblock" a release.

TemplateMust assert
with-svelte/minimalApp boots, zero console errors, screenshot captured.
with-dioxus/minimalApp boots (after WASM init + SSE subscribe), zero console errors.
with-svelte/demo, with-dioxus/demoUsers CRUD reactive, export_users job reaches 100 %, account_verification runs all 6 steps post-confirm, refresh-token counter increments, webhook 401-on-bad/200-on-good, signals capture view + event + identify + error + correlation_id.
with-svelte/realtime-todo-listCreate → toggle → delete propagates to a second client over SSE without manual refresh.
with-dioxus/realtime-todo-listSame as Svelte realtime: reactive create/toggle/delete.

The demo assertions exist because every one of them caught a regression during the v2 rewrite at least once. Treat them as canaries, not a checklist.

Cancel and recovery

Operators can stop runaway work without restarting nodes:

  • POST /_api/admin/jobs/{id}/cancel — flags the job; the worker checks JobContext::is_cancelled() on its next loop iteration.
  • POST /_api/admin/workflows/{id}/cancel — sets cancel_requested_at and fires forge_workflow_cancelled over NOTIFY. A run sleeping in ctx.sleep("...", 24h) wakes within 50 ms and runs its compensation chain before terminating in cancelled_by_operator.
  • POST /_api/admin/queues/{name}/pause — adds the queue to forge_paused_queues. The claim SQL has a NOT EXISTS against this table, so existing in-flight work finishes but no new jobs are claimed. Resume with the matching /resume route.

Every state-changing admin route requires the admin role on the calling AuthContext and appends a row to forge_admin_audit with actor, target, reason, request_id, and trace_id. Read-only list and inspect routes don't audit (they're hot paths for dashboards).

When the night is going wrong

  • Template smoke flaky on one example, fine elsewhere. Run that template's forge test locally with PWDEBUG=1 so Playwright opens the inspector — usually a missing await gotoReady(path) before a reactive assertion.
  • Workspace tests fail with "query is not in .sqlx/". Cache is stale; regenerate per the CLAUDE.md script and commit the diff.
  • Release binary jumps in size unexpectedly. Check cargo tree -d; a new transitive dep with a heavy feature flag is the usual culprit.
  • /_api/ready shows notify_queue_ok=false on every node. A client somewhere is holding LISTEN connections without consuming. Find it with
    SELECT pid, application_name, query_start, state, query
    FROM pg_stat_activity
    WHERE wait_event = 'AsyncWait'
    ORDER BY query_start;
    then pg_terminate_backend() the stuck session.
  • A workflow blocks readiness in blocked_signature_mismatch. Pin the in-flight runs with cancel_by_operator or retire_unresumable (admin endpoint + audit log), then redeploy.

See also

  • Dev loop — the inner loop this runbook wraps.
  • Multiple nodes — cluster lifecycle the readiness probe gates.
  • Performance — the budget the v2 numbers above measure against.