Production Architecture

Understand how a Forge process is structured, what roles it can play, and what a minimal production deployment looks like before you wire up containers and load balancers.

Single-Binary, All Subsystems

Every Forge application compiles to one binary. When it starts, it brings up all of the following subsystems in a single process:

┌─────────────────────────────────────────────────────┐
│                   forge binary                      │
│                                                     │
│  ┌────────────┐  ┌────────────┐  ┌───────────────┐  │
│  │  Gateway   │  │  Worker    │  │  Scheduler    │  │
│  │  (Axum)    │  │  (jobs)    │  │  (cron/leader)│  │
│  └────────────┘  └────────────┘  └───────────────┘  │
│                                                     │
│  ┌────────────┐  ┌────────────┐  ┌───────────────┐  │
│  │  Reactor   │  │  Daemons   │  │  Workflow     │  │
│  │  (SSE/RT)  │  │            │  │  Executor     │  │
│  └────────────┘  └────────────┘  └───────────────┘  │
│                                                     │
│               ↕ PostgreSQL                          │
└─────────────────────────────────────────────────────┘

There is no separate worker process, no separate scheduler, no sidecar. One binary, one PostgreSQL database. You deploy more copies of the same binary when you need to scale or add redundancy.

What each subsystem does

Subsystem	Role
Gateway	Axum HTTP server. Handles RPC (`/_api/rpc/*`), SSE (`/_api/events`), health probes, and static frontend assets.
Worker	Polls `forge_jobs` with `FOR UPDATE SKIP LOCKED`. Executes background jobs concurrently, bounded by the semaphore size in `[worker]`.
Scheduler	Triggers cron jobs on schedule. Only the elected leader node runs this — others stand by.
Reactor	Listens on the PostgreSQL `forge_changes` NOTIFY channel. On a change, debounces, re-executes affected queries, and pushes diffs to connected SSE clients.
Daemons	Long-running background loops. Either leader-only (one instance per cluster via advisory lock) or replicated (one per node).
Workflow Executor	Resumes durable workflows from their persisted checkpoint. Handles step re-execution, compensation, and durable sleep.

Node Roles

By default every node runs every subsystem. You can restrict what a node does via forge.toml:

[node]
roles = ["gateway", "worker", "scheduler", "function"]

Role	Enables
`gateway`	HTTP server and SSE endpoint
`function`	Query and mutation execution
`worker`	Background job processing
`scheduler`	Cron scheduling (leader-elected)

All four roles enabled is the right default for single-node and small multi-node deployments. Split roles when you need to isolate concerns — for example, to put gateway nodes behind a WAF while worker nodes have no inbound HTTP.

Deployment Topologies

Single node (development, staging, small apps)

Internet → [ forge binary ] → PostgreSQL

One node, all roles, one database. This is what cargo run gives you locally and what the Docker Compose in Deploy sets up. It handles hundreds of concurrent connections before you need anything else.

No load balancer needed. No cluster config needed. Migrations run on startup and block until complete.

Minimum viable production: two nodes

Internet → [ Load Balancer ]
               ↙         ↘
    [ forge node A ]  [ forge node B ]
               ↘         ↙
              [ PostgreSQL ]

Two nodes, all roles, one PostgreSQL instance, one load balancer. This gives you:

Zero-downtime deploys (rolling update: start node B, drain node A)
Failover if one node crashes (the other keeps serving and claims orphaned jobs within 15 seconds)
Double the worker throughput

The load balancer routes based on /_api/ready. Nodes that are starting up (joining) or shutting down (draining) return 503 from /_api/ready and drop out of rotation automatically.

# forge.toml — same file on both nodes
[cluster]
discovery = "postgres"

[node]
roles = ["gateway", "worker", "scheduler", "function"]

One node wins the scheduler advisory lock. The other stands by. If the leader crashes, the standby acquires the lock within the next heartbeat interval (default 5 seconds).

Separated concerns: API + worker nodes

For higher throughput or to isolate HTTP traffic from CPU-heavy job processing:

Internet → [ Load Balancer ]
               ↙     ↘
     [ API node ]  [ API node ]    (gateway + function, no worker)
                                         ↕
                              [ PostgreSQL ]
                                         ↕
     [ Worker node ] [ Worker node ]     (worker only, no gateway)

# API nodes
[node]
roles = ["gateway", "function"]

# Worker nodes
[node]
roles = ["worker"]
worker_capabilities = ["general"]

Worker nodes do not bind an HTTP port. They connect to PostgreSQL and poll for jobs. You can scale worker and API nodes independently. See Worker Pools for capability-based routing.

Leader Election

Certain subsystems run on exactly one node at a time:

Scheduler — triggers cron jobs; duplicate execution would double-fire scheduled tasks
Leader-mode daemons — daemons marked as leader-only in their config

Election uses a PostgreSQL advisory lock. The first node to acquire pg_try_advisory_lock(0x464F52470001) becomes the scheduler leader. If that node crashes, its database connection closes, PostgreSQL releases the lock, and another node acquires it within seconds.

No quorum, no Raft, no Zookeeper. The database connection is the lease. Clock skew cannot cause split-brain because the lock is not time-based.

Configuration for Production

A production forge.toml sets the sections that matter. Environment variable substitution (${VAR} and ${VAR-default}) works in any string value.

[project]
name = "my-app"

[database]
url = "${DATABASE_URL}"
pool_size = 20                    # tune for your workload and PG max_connections

[gateway]
port = 8080
host = "0.0.0.0"

[cluster]
discovery = "postgres"
heartbeat_interval = "5s"
dead_threshold = "15s"

[node]
roles = ["gateway", "worker", "scheduler", "function"]

[worker]
concurrency = 16                  # jobs processed simultaneously per node

[auth]
jwt_secret = "${JWT_SECRET}"

[observability]
enabled = true
otlp_endpoint = "${OTEL_EXPORTER_OTLP_ENDPOINT}"   # e.g. http://collector:4318

Key environment variables:

Variable	Required	Description
`DATABASE_URL`	Yes	`postgres://user:pass@host:5432/db`
`JWT_SECRET`	If using auth	Minimum 32 bytes
`RUST_LOG`	No	`info` for production, `debug` for troubleshooting

No other environment variables are required. Everything else lives in forge.toml.

Health Checks

Both endpoints are always available when the gateway role is enabled.

Endpoint	Probe type	Returns
`/_api/health`	Liveness	`200` always (process is up)
`/_api/ready`	Readiness	`200` when DB reachable and reactor ready; `503` otherwise

Use /_api/health as the liveness probe (restart if the process is wedged). Use /_api/ready as the readiness probe (only route traffic here when it returns 200).

The readiness probe also returns 503 when in-flight workflow runs exist for a handler version that is no longer registered — it forces you to drain stranded workflows before the node accepts new traffic.

# Kubernetes
livenessProbe:
  httpGet:
    path: /_api/health
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /_api/ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 1

What You Need to Operate

A complete Forge production deployment requires:

PostgreSQL 18 — all state lives here: jobs, workflows, sessions, signals, node registry, cron schedule
One or more instances of your binary — same binary, any number of nodes
A load balancer — routes to healthy nodes via /_api/ready; sticky sessions needed only for MCP OAuth (/_api/oauth/*)
No other infrastructure — no Redis, no message bus, no separate worker process, no service mesh required

Optional but recommended for production at scale:

Read replicas — configure under [database.replicas] to offload query traffic; see Multiple Nodes
OTLP collector — for distributed traces and metrics; configure [observability]
Connection pooler (PgBouncer or RDS Proxy) — if you run many nodes and approach PostgreSQL's max_connections limit

Single-Binary, All Subsystems​

What each subsystem does​

Node Roles​

Deployment Topologies​

Single node (development, staging, small apps)​

Minimum viable production: two nodes​

Separated concerns: API + worker nodes​

Leader Election​

Configuration for Production​

Health Checks​

What You Need to Operate​