Skip to main content

Performance

Forge ships with an adaptive capacity benchmark that ramps concurrent users until the system breaks. Every user holds a live SSE subscription while continuously making RPC calls, and 30% of traffic is writes that trigger the full reactivity pipeline. This is not a synthetic HTTP throughput test. It measures the most demanding steady-state workload: serving reads, processing writes, re-executing affected subscriptions, and pushing changes over SSE, all at the same time.

The benchmark, its infrastructure, and the load generator are all in benchmarks/app/ if you want to run it yourself.

Baseline numbers

12-core laptop, 18GB RAM, PostgreSQL 18 in Docker, 2 Forge instances, 1 primary + 1 read replica:

  users | active |    req/s |    p50    p90    p99 |  err %
──────────────────────────────────────────────────────────
250 | 250 | 2373 | 2ms 9ms 41ms | 0.00%
450 | 450 | 4324 | 1ms 4ms 11ms | 0.00%
650 | 650 | 6251 | 1ms 4ms 11ms | 0.00%
850 | 850 | 8155 | 1ms 3ms 33ms | 0.00%
1050 | 1050 | 10022 | 1ms 3ms 69ms | 0.00%
1250 | 1250 | 11659 | 1ms 14ms 74ms | 0.00%
1450 | 1450 | 12535 | 2ms 46ms 110ms | 0.00%
1650 | 1650 | 10980 | 27ms 128ms 265ms | 0.00%
1850 | 1850 | 11533 | 35ms 151ms 290ms | 0.00%
2050 | 2050 | 12203 | 47ms 165ms 266ms | 0.00%
2250 | 2250 | 10962 | 85ms 248ms 404ms | 0.00%
2450 | 2450 | 9065 | 85ms 282ms 2128ms | 8.46%

Peak throughput was 12,535 req/s at 1,450 concurrent users with p90 under 50ms. The system held zero errors through 2,250 concurrent SSE users, each maintaining a live subscription while generating 10 requests per second. The last level hit 8% errors as connection pool headroom ran out.

Every one of those 2,250 users had an open SSE connection watching a specific counter. When any user wrote to that counter (30% of all traffic), the change propagated through NOTIFY, invalidation debouncing, query re-execution, hash comparison, and SSE push. The numbers above include all of that overhead.

What the workload looks like

The benchmark app has two tables (users and counters) and five RPC handlers behind JWT auth. Each simulated user connects to SSE, subscribes to a counter, then runs this mix in a loop:

OperationShareWhat it exercises
get_counter50%Authenticated point read by primary key
list_counters20%Paginated read (LIMIT 20) via replica
increment30%Atomic UPDATE with row lock, triggers NOTIFY

128 counters spread write contention across rows. With 2,000+ users, roughly 15 users share each counter, which means every write invalidates a subscription group and fans out to multiple SSE sessions.

What's actually happening on every write

The reported req/s only counts direct RPC calls. The actual database load is higher because each write triggers the reactivity pipeline:

increment (write)
→ Postgres NOTIFY on forge_changes channel
→ ChangeListener picks up notification
→ InvalidationEngine debounces (50ms quiet, 200ms max)
→ Reactor finds affected subscription groups
→ Re-executes query for each group
→ Hashes result, compares to previous
→ Pushes to subscribers over SSE (if changed)

At peak throughput (12k req/s), 30% writes means 3,600 writes per second across 128 counters. After debouncing, there are hundreds of additional query re-executions per second that the load generator never sees. The connection pool serves both visible RPC traffic and these invisible reactivity queries, so the effective DB load is meaningfully higher than the reported req/s.

How it scales

Three resources gate throughput, and the benchmark is designed so you can see which one you hit first.

Connection pool

Each Forge instance maintains its own SQLx connection pools to the primary and replicas. The total connection count across all instances is what determines the first ceiling.

The baseline used 40 connections per instance (80 total). More connections means more headroom, and the relationship is close to linear until Postgres itself becomes the constraint. The upper bound is max_connections on the database.

You can see the pool saturation in the results: p90 stays under 15ms through 1,250 users, then jumps to 46ms at the next level as requests start queuing for a connection. Throughput goes flat while latency climbs. That's the pool.

Forge CPU

Once connection pool pressure is removed (more connections or a faster database), the Forge instances become the bottleneck. Each request involves JWT validation, JSON serialization, and a network round trip to Postgres. Writes add outbox buffer flushes and NOTIFY propagation.

The reactivity pipeline runs alongside direct RPC traffic. Subscription lookups go through a 64-shard DashMap, re-executions are bounded to 64 concurrent, and SSE fan-out is mpsc sends. Each operation is cheap on its own but the aggregate cost grows past 10k req/s.

Adding Forge vCPUs scales throughput well until the database becomes the bottleneck.

Write throughput on the primary

All writes go to a single primary. Each increment takes a row-level lock, generates WAL, and fires NOTIFY. With 128 counters the contention spreads across rows, but at high write rates WAL generation and fsync (even with synchronous_commit = off) become the constraint.

Read replicas absorb all query traffic and reactivity re-executions. Adding replicas scales reads horizontally. The primary is the part that doesn't scale out.

Why SSE doesn't collapse under load

The reactivity system deduplicates subscriptions by hashing (query_name, args, auth_scope). Users subscribed to the same query with the same arguments share a single query group. The query executes once per invalidation window, and the result fans out to all subscribers through cheap mpsc channel sends.

With 128 counters and 2,000 users, there are 128 subscription groups (users watching the same counter share a group). Adding more users to existing counters doesn't increase query re-execution count, only fan-out. The invalidation engine's debounce window (50ms quiet, 200ms max forced flush) coalesces rapid writes into fewer re-executions.

This is why the system held 2,250 SSE users with zero errors: the per-user cost of a subscription is a channel send, not a database query.

Reproducing the benchmark

The full benchmark lives in benchmarks/app/. Run it locally:

# needs: rust 1.92+, docker
./benchmarks/app/run.sh

This builds release binaries, starts PostgreSQL (1 primary + 1 read replica) in Docker, launches 2 Forge instances, warms up for 60 seconds, then ramps +200 users per level until p90 exceeds 2 seconds or errors pass 2%.

Point it at external infrastructure to test cloud deployments:

# external database
./benchmarks/app/run.sh \
--database-url 'postgres://user:pass@primary.rds.amazonaws.com/app' \
--replica-url 'postgres://user:pass@replica.rds.amazonaws.com/app'

# external Forge instances
./benchmarks/app/run.sh \
--forge-url 'http://10.0.1.10:9081' \
--forge-url 'http://10.0.1.11:9081'

PostgreSQL version

The benchmark defaults to PG 18. This workload benefits from PG 18's async I/O and improved connection handling under high concurrency. PG 16 and 17 work fine but expect lower throughput at the same connection count. Change image: postgres:18 in infra/docker-compose.yml to compare.

Tuning knobs

ParameterDefaultEffect
POOL_SIZE (run.sh)40DB connections per Forge instance
FORGE_INSTANCES (run.sh)2Forge processes to start
Ramp step (loadgen)200Users added per level
Counter count (loadgen)128Write contention spread
Action interval (loadgen)100ms10 req/s per user

What concurrent users mean in practice

The benchmark measures concurrent SSE connections, not unique visitors. In a typical SaaS or dashboard product where users keep tabs open, concurrent users are roughly 5-15% of daily active users depending on average session length and time-zone spread.

ConcurrentDAU (at 10% online)MAU (at 30% daily return)
1,000~10,000~30,000
2,500~25,000~75,000
5,000~50,000~150,000
10,000~100,000~300,000

The 10% concurrent-to-DAU ratio is conservative for apps with long sessions (monitoring dashboards, project management, collaboration tools). Short-session apps like e-commerce might see 3-5%, which means the same infrastructure serves even more users. The 30% DAU-to-MAU ratio is typical for a B2B SaaS product with moderate engagement.

Scaling to 10,000 concurrent users

The baseline hit 2,250 concurrent users on a laptop with everything sharing 12 cores and Postgres running in Docker. Getting to 10,000 requires roughly 4.4x the capacity. Here's what that looks like on dedicated infrastructure, and what it costs.

The math

From the baseline, we know the bottleneck order: connection pool first, then CPU, then Postgres writes. Each needs to scale:

Connections. The baseline used 80 connections for 2,250 users. At 10,000 users the RPC throughput is roughly 50k req/s (10 req/s per user), plus reactivity re-executions on top. That needs around 300-400 connections total.

Forge CPU. The baseline had roughly 4-6 effective vCPUs for Forge (the rest was shared with Postgres, Docker, and the load generator). At 50k req/s with JWT validation and JSON serialization on every request, plus the reactivity pipeline, roughly 16-20 dedicated vCPUs should provide headroom.

Postgres writes. 30% of 50k req/s is 15,000 writes/s, each generating WAL and firing NOTIFY. That needs a primary with enough I/O throughput and CPU to handle the write load. A 4-8 vCPU instance with provisioned IOPS covers it. Reads and reactivity queries go to replicas.

Infrastructure

ComponentSpecPurposeMonthly cost (AWS, us-east-1)
4x Forge (c7g.xlarge)4 vCPU, 8GB eachRPC, SSE, reactivity~$400
RDS primary (r7g.xlarge)4 vCPU, 32GBWrites, NOTIFY, WAL~$350
2x RDS read replica (r7g.large)2 vCPU, 16GB eachReads, reactivity re-executions~$350
ALB + storage + transferLoad balancing, gp3 volumes~$100
Total~$1,200/month

Each Forge instance runs pool_size = 80, giving 320 total connections across the cluster. The 2 read replicas handle all query traffic and reactivity re-executions. The primary is dedicated to writes and replication.

This is on-demand pricing. Reserved instances or savings plans bring it closer to $800/month. Graviton (ARM) instances are used here because Rust compiles to aarch64 natively and they offer better price-per-vCPU than x86.

Headroom beyond 10,000

The $1,200/month setup doesn't max out at exactly 10,000 users. There's headroom in the connection pool (320 connections with some spare capacity) and CPU (16 vCPUs with room before saturation). The actual ceiling depends on query complexity and write ratio, which is why the benchmark exists.

To push further, the scaling levers are:

More Forge instances. Each c7g.xlarge adds 4 vCPUs and 80 connections for ~$100/month. Throughput growth is close to linear until the database can't keep up.

Bigger primary. Upgrading to r7g.2xlarge (8 vCPU, 64GB) roughly doubles write capacity for an extra ~$350/month. This is the lever to pull when the write ratio is higher than 30%.

More read replicas. Each r7g.large replica adds read capacity for ~$175/month. Useful when the query mix is heavier or when reactivity groups are growing.

The wall that doesn't move is the single primary. All writes and all NOTIFY events go through one Postgres instance. At some point that saturates, and the options become vertical scaling, write batching, or sharding by tenant, all of which are outside Forge's scope.

Cost at different scales

Concurrent usersApprox. monthly costKey components
2,500~$6002x Forge, 1 primary, 1 replica
5,000~$9003x Forge, 1 primary, 2 replicas
10,000~$1,2004x Forge, 1 primary (xlarge), 2 replicas
25,000~$2,5008x Forge, 1 primary (2xlarge), 3 replicas

These are projections based on the scaling characteristics observed in the benchmark. The benchmark supports --database-url and --forge-url flags so you can validate against your actual target infrastructure instead of trusting estimates.