Skip to main content

Debugging Density

Running many sessions in one worker is efficient until one session misbehaves and you cannot tell which. This runbook is the flow for “my pool feels slow” or “one session is hot”, using openrtc top and the slow-session detector. It assumes coroutine mode with introspection on (the default).

1. Look at the pool

Open the live inspector next to your worker:
openrtc top
Scan the table:
  • One slow row. A session is blocking the shared event loop. Jump to section 3.
  • One row with a high, sustained cpu%. A session is CPU-heavy (a tight loop, a large sync transform). See section 4.
  • peak climbing across the board, mem(MB) trending up. Worker-level memory pressure. See section 5.
  • Nothing stands out but latency is bad. The bottleneck is likely in the voice pipeline (STT/LLM/TTS), which OpenRTC does not measure. See section 6.
Sort and filter to focus:
openrtc top --sort cpu_pct      # busiest sessions first
openrtc top --status slow       # only sessions blocking the loop

2. Confirm it is density, not a single wedged call

Press s to sort by duration_s. A single very long-lived session that should have ended can look like load. If a call is stuck, that is a session bug (an await that never resolves), not a density problem. Fix it in the agent.

3. A session is blocking the loop

A slow status means the detector measured the event loop stalling while that session’s task was on-CPU. Check the worker logs for the attribution line:
[slow-session] session_id=job-c1d550 blocked event loop for 320ms
The usual cause is a synchronous blocking call inside an otherwise async agent: a sync HTTP client, time.sleep, a blocking file or DB call, or a heavy CPU section run inline. The fix is always the same shape (get it off the loop):
  • Use the async client (aiohttp/httpx.AsyncClient) instead of a sync one.
  • Wrap unavoidable blocking work in await asyncio.to_thread(...).
  • Break large CPU loops into chunks that await asyncio.sleep(0) between them.
Lower slow_session_threshold_ms (default 50 ms) to catch smaller stalls while hunting, e.g. AgentPool(slow_session_threshold_ms=20).
The detector reports the session and duration, not the exact source line (stack sampling is deferred). Use the session id to find the call in that agent’s code; a block over ~50 ms in an async agent is almost always one sync call.

4. A session is CPU-hot

A high, sustained cpu% without a slow status is a session doing real work that is not (yet) blocking the loop, but it still competes for the one core. Remember cpu% is a sampled share, not exact seconds: use it to rank sessions, not to bill them. If one agent is consistently hot, move its heavy work to asyncio.to_thread, or run that agent under isolation="process" so it gets its own core and cannot starve the others. mem(MB) is an equal share of process RSS, so per-session numbers sum back to the real RSS (watch the trend, not any single row). If the total is climbing toward your memory_limit_mb, the worker will drain and restart when it crosses the limit (coroutine caps are worker-level, not per-session). To find a leak, restart with isolation="process" temporarily so the OS accounts memory per session, or reduce max_concurrent_sessions to lower peak pressure.

6. Nothing in OpenRTC explains it

If sessions look healthy in openrtc top but calls are still slow, the bottleneck is in the voice pipeline (STT, LLM, or TTS latency), which OpenRTC does not see (it sees coroutines, not providers). That is voicegateway’s lane: cost, provider latency, and quality metrics live there, keyed by the agent_name and metadata["tenant"] OpenRTC emits. Look there for per-provider latency, not here.

Quick reference

SymptomFirst moveLikely fix
One slow rowRead the [slow-session] log lineMove the sync call off the loop
High cpu%, not slowopenrtc top --sort cpu_pctto_thread or isolation="process"
mem(MB) climbingWatch the RSS trendLower concurrency / find the leak in process mode
Healthy table, slow calls(not an OpenRTC issue)Pipeline latency: voicegateway