Rolling back a deploy

A rollback is just a blue-green deploy pointed the other way. The version you are rolling back to becomes the “new” version: it takes new calls while the bad version drains and exits. Because no live call is ever moved, a rollback drops zero calls, exactly like a forward deploy.

There is no separate rollback mechanism to learn. If you can deploy, you can roll back: the primitives (deployment_version, drain, signed membership, audit) are symmetric. Keep the previous version’s image tagged and ready so the platform can start it without a rebuild.

When to roll back (decision tree)

Rollback is the right move when the new version is actively harming calls and a fix is not one commit away. Fix forward when the fault is small, understood, and faster to patch than to reverse.

Signal after a deploy	Action
New-version sessions error or drop at an elevated rate	Roll back now. Drain the new version, bring the old one back.
A provider/config regression affecting every new call	Roll back now. The blast radius is the whole fleet.
A narrow bug (one agent, one tenant) with an obvious one-line fix	Fix forward. Ship a new version; rolling back loses the good parts too.
Degraded quality or cost (not correctness)	Escalate to voicegateway’s signals first. Cost/quality/latency live there, not in OpenRTC. Roll back only if it is a hard regression.
Unsure	Roll back. Reversing to a known-good version is the conservative default; investigate off the critical path.

The rollback walkthrough

Identify the last good version. From the fleet’s deployment_version distribution or your deploy history, pick the version that was healthy before this deploy.
Start the good version alongside the bad one. Your platform starts workers tagged deployment_version="<last-good>". New jobs begin landing on them. If a leftover bad-version worker must be kept from grabbing new traffic, gate it with signed membership against the rolled-back manifest.
Drain the bad version. Signal each bad-version worker to drain. In production this is the SIGTERM your platform sends when retiring the pods; to trigger it from a coordinator:
```
pool.begin_drain()   # bad version stops taking calls; in-flight run to hangup
```

Record the rollback. Emit a deployment.rolled_back audit event naming the version you left and the version you returned to, so the compliance trail shows the reversal and its reason:

pool.audit_log.emit(
    "deployment.rolled_back",
    actor="on-call",
    target="fleet",
    version="<last-good>",
    from_version="<bad>",
    reason="elevated session errors",
)

Confirm the fleet is clean. Every worker reports the good version and the bad-version workers have exited (active_sessions drained to zero). See monitoring a deploy.

What a rollback does not do

It does not rewind calls that already finished on the bad version. Those calls are over; a rollback only governs which version handles calls from now on.
It does not interrupt calls still in flight on the bad version. They finish on the bad version (drain), then that worker exits. If the bug makes in-flight calls unsafe to continue, ending them is an application decision, not a deployment one.

Next: monitoring a deploy and the audit-event reference.

​Rolling back a deploy

​When to roll back (decision tree)

​The rollback walkthrough

​What a rollback does not do

Rolling back a deploy

When to roll back (decision tree)

The rollback walkthrough

What a rollback does not do