Migration and drain
“Zero-downtime upgrade” can mean two very different things:- Migration: pick up a live call from the old worker, move its state to the new worker, and resume it there. The caller never notices.
- Drain: stop sending new calls to the old worker, let its live calls finish where they are, and route new calls to the new worker.
What a live session is made of
A liveAgentSession carries three kinds of state. The full inventory is in the
worker state inventory; the summary is:
| Kind | Examples | Can it move? |
|---|---|---|
| Serializable | Conversation history, agent identity, tenant, job metadata | Yes, it is plain data. |
| Derivable | Agent class + instructions, provider config, prewarmed VAD | Yes, rebuilt from config on the new worker. |
| Live | The WebRTC transport, in-flight STT/LLM/TTS streams, turn state, open provider sockets | No. Bound to this process and this moment. |
Why the live state cannot move
The blocker is the live row. A voice call holds an open WebRTC transport to the caller and, at any instant, an in-flight STT, LLM, or TTS stream. You cannot serialize a token half-generated by the LLM or an audio buffer mid-synthesis and resume it in another process without dropping that turn. And the caller’s SDK is not built to be silently re-pointed at a new server mid-call: that would require a renegotiation the SDK does not expose. Moving a live call means a gap the caller hears, which is not zero-downtime for the person on the phone. So OpenRTC does not migrate. It drains: the call finishes on the worker it started on, uninterrupted, and only new calls land on the new version. See zero-downtime deployments for the mechanics.What this means in practice
- A live call is never paused, moved, or resumed elsewhere. It runs to its natural end on its original worker.
- A new version affects only calls that start after it is deployed. To have new code affect an in-progress call, either wait for that call to end, or use hot reload, which swaps agent code within a running worker (a different mechanism, not a cross-worker move).
- Draining is reused from the graceful-shutdown path, not a new subsystem: a worker draining for a deploy and a worker draining for SIGTERM do the same thing.
Could migration ever return?
Possibly, but for a different feature than live upgrades. A “pause this call and resume it later” feature would only ever move the serializable and derivable state (never the live streams), so it would necessarily involve a real gap the caller consents to, not a silent handoff. If that is ever built, the recorded format choice is msgpack (compact, Python-native types, versioned header). The migration serialize/deserialize APIs and themigration.*
audit events are reserved for that future and are
deliberately not emitted today.
Until then: drain, do not migrate.