Skip to main content

Rolling back a deploy

A rollback is just a blue-green deploy pointed the other way. The version you are rolling back to becomes the “new” version: it takes new calls while the bad version drains and exits. Because no live call is ever moved, a rollback drops zero calls, exactly like a forward deploy.
There is no separate rollback mechanism to learn. If you can deploy, you can roll back: the primitives (deployment_version, drain, signed membership, audit) are symmetric. Keep the previous version’s image tagged and ready so the platform can start it without a rebuild.

When to roll back (decision tree)

Rollback is the right move when the new version is actively harming calls and a fix is not one commit away. Fix forward when the fault is small, understood, and faster to patch than to reverse.
Signal after a deployAction
New-version sessions error or drop at an elevated rateRoll back now. Drain the new version, bring the old one back.
A provider/config regression affecting every new callRoll back now. The blast radius is the whole fleet.
A narrow bug (one agent, one tenant) with an obvious one-line fixFix forward. Ship a new version; rolling back loses the good parts too.
Degraded quality or cost (not correctness)Escalate to voicegateway’s signals first. Cost/quality/latency live there, not in OpenRTC. Roll back only if it is a hard regression.
UnsureRoll back. Reversing to a known-good version is the conservative default; investigate off the critical path.

The rollback walkthrough

  1. Identify the last good version. From the fleet’s deployment_version distribution or your deploy history, pick the version that was healthy before this deploy.
  2. Start the good version alongside the bad one. Your platform starts workers tagged deployment_version="<last-good>". New jobs begin landing on them. If a leftover bad-version worker must be kept from grabbing new traffic, gate it with signed membership against the rolled-back manifest.
  3. Drain the bad version. Signal each bad-version worker to drain. In production this is the SIGTERM your platform sends when retiring the pods; to trigger it from a coordinator:
    pool.begin_drain()   # bad version stops taking calls; in-flight run to hangup
    
  4. Record the rollback. Emit a deployment.rolled_back audit event naming the version you left and the version you returned to, so the compliance trail shows the reversal and its reason:
    pool.audit_log.emit(
        "deployment.rolled_back",
        actor="on-call",
        target="fleet",
        version="<last-good>",
        from_version="<bad>",
        reason="elevated session errors",
    )
    
  5. Confirm the fleet is clean. Every worker reports the good version and the bad-version workers have exited (active_sessions drained to zero). See monitoring a deploy.

What a rollback does not do

  • It does not rewind calls that already finished on the bad version. Those calls are over; a rollback only governs which version handles calls from now on.
  • It does not interrupt calls still in flight on the bad version. They finish on the bad version (drain), then that worker exits. If the bug makes in-flight calls unsafe to continue, ending them is an application decision, not a deployment one.
Next: monitoring a deploy and the audit-event reference.