Skip to content

Density Benchmark — v0.1

Phase 1 success gate from docs/design/v0.1.md §7:

≥ 50 concurrent sessions per worker process at ≤ 4 GB peak RSS, no errors.

This run passes the gate, with substantial headroom. Re-run after any behavioral change to the coroutine path; record new numbers below the existing table rather than overwriting (one row per session-count config per environment).

Methodology

The harness lives in tests/benchmarks/density.py. It constructs the same CoroutinePool chain _CoroutineAgentServer would build, then launches N concurrent fake-job sessions through it. Each session entrypoint:

  1. allocates a 5 MB bytearray (per-session footprint stand-in),
  2. holds the buffer for ~1 s via await asyncio.sleep(1.0),
  3. drops the buffer and exits.

A background asyncio task samples openrtc.observability.metrics.process_resident_set_bytes() every 50 ms throughout the run; we record the maximum and the delta from baseline.

Caveats:

  • 5 MB per session is intentionally low. It exercises Python task scheduling and coroutine dispatch overhead, not realistic per-session memory pressure. The realistic ~60 MB/session target (audio buffers, WebRTC peer connection state, LLM context) validates against the §8.4 real-LiveKit integration test in Phase 2.
  • No real WebRTC, no real STT/LLM/TTS. AgentSession, rtc.Room, and the inference executor are bypassed via stubs. A real worker carries process-wide overhead from the Silero VAD and turn-detector models (~250-400 MB on macOS) that the benchmark replaces with a no-op prewarm.
  • One worker process. No multi-worker scaling claim is implied.

To reproduce a row:

bash
uv run python tests/benchmarks/density.py --sessions 50 --json
uv run python tests/benchmarks/density.py --sessions 50 --rss-budget-mb 4096

Exit codes: 0 success, 2 peak RSS over budget, 3 any session error.

Results

2026-05-03 — local: macOS Darwin 24.3.0 / Python 3.13.5 / uv 0.8.15 / arm64

Three back-to-back runs at the §7 gate (50 sessions, 4096 MB budget) plus a headroom sweep:

RunSessionsSuccessesFailuresBaseline RSSPeak RSSDelta RSSElapsedWithin budget
150500115.5 MB366.5 MB250.9 MB1.08 s
250500115.8 MB366.8 MB251.0 MB1.03 s
350500115.9 MB366.9 MB251.0 MB1.04 s
41001000114.9 MB616.9 MB502.0 MB1.10 s
52002000115.7 MB1072.7 MB956.9 MB1.19 s
65005000114.8 MB1370.4 MB1255.7 MB1.30 s✓ (8 GB cap)

Notes:

  • Per-session memory tracks the 5 MB buffer up to ~200 sessions; at 500 sessions GC starts compacting and the per-session amortized cost drops to ~2.5 MB. This says nothing about real workloads — under 5 MB buffers are tiny — but it confirms the asyncio scheduler is not pathologically expensive at scale.
  • Walltime stays in the 1.0-1.3 s band (essentially the 1 s sleep + tiny setup/teardown) across 50-500 sessions. There is no quadratic spawning cost in the pool's launch_job path.

Verdict

Phase 1 §7 gate met. Peak RSS at 50 sessions is 367 MB, leaving ~3.7 GB of headroom against the 4 GB budget. The gate exists to verify the coroutine architecture supports many concurrent sessions in one process; with the stub workload it does, comfortably. The realistic per-session footprint validation (and the ~50-100 sessions per 4 GB working number) is deferred to the §8.4 real-LiveKit integration tests once the dev-server harness lands in Phase 2.

2026-05-05 — local: macOS Darwin 24.3.0 / Python 3.13.5 / arm64 (10 cores, 16 GB)

Re-run after tests/benchmarks/density.py was extended with scheduler- latency sampling (10 ms cadence, median + p99 + max) and a hardware fingerprint dict (commit 32bde3a). Three back-to-back runs at the §7 gate:

RunSessionsSuccessesFailuresBaseline RSSPeak RSSDelta RSSElapsedSched p99Sched maxWithin budget
A50500116.1 MB367.0 MB251.0 MB1.06 s6.17 ms50.27 ms
B50500116.5 MB354.0 MB237.5 MB1.08 s5.64 ms63.66 ms
C50500116.5 MB367.4 MB251.0 MB1.02 s3.19 ms3.36 ms

Hardware fingerprint (identical across runs): arm / 10 cores / 16 GB total / Darwin 24.3.0 / Python 3.13.5.

Notes:

  • Peak-RSS numbers track the 2026-05-03 row at the same N=50 config (~367 MB), confirming no regression from the benchmark instrumentation additions.
  • Scheduler median latency holds in the 1.06-1.10 ms band — well below the 10 ms sampling interval, so the loop is not starved at this load.
  • Scheduler p99 sits at 3-6 ms; the higher values (50-64 ms) on runs A and B come from a single-sample tail spike each (the max column), most likely a transient OS scheduling event on a busy laptop. Run C, with all background processes quiet, lands at p99 = 3.19 ms / max = 3.36 ms — the clean baseline. The p99 is the load-bearing number for worker stability; the tail max is an environmental artefact.
  • Walltime stays in the same 1.0-1.1 s band (≈ 1 s sleep + setup).

Verdict (2026-05-05)

Phase 1 §7 gate continues to pass. All three runs hit 50/50 sessions, 0 failures, peak RSS ≤ 367 MB. The new scheduler-latency metric provides additional Phase 2 capacity-planning input: a healthy loop runs at ~1 ms median / ~3 ms p99 under a 50-session stub workload.

Released under the MIT License.