DevgainsDevgainsDevgains
All articles

The Saga Pattern and the Distributed-Transaction Trap

·5 min read
The Saga Pattern and the Distributed-Transaction Trap

Photo: Unsplash

The first time a team splits an order workflow across services, someone inevitably asks the dangerous question: "How do we wrap the whole thing in a transaction?" The honest answer is that you can't — not without two-phase commit, which couples your services together so tightly you've recreated the monolith with extra network hops. The saga pattern is the grown-up alternative, but it comes with its own trap that's easy to fall into.

A saga replaces one atomic transaction with a sequence of local transactions, each in its own service. If a later step fails, you don't roll back — you run compensating actions that semantically undo the earlier steps. That word "semantically" is doing enormous work, and it's where the trap lives.

Why Not Just Use Two-Phase Commit

Two-phase commit (2PC) does give you atomicity across services. The catch is that it holds locks across the network for the duration of the transaction. Any participant being slow or unavailable stalls everyone, and the coordinator becomes a single point of failure during the critical window. This is why the microservices.io saga pattern page and most practitioners steer away from 2PC for high-throughput business workflows. It trades availability for consistency in exactly the way distributed systems try to avoid.

A saga gives up isolation, not just rollback. Intermediate states are visible to the rest of the system. An order can briefly be "payment taken but not yet shipped" — and other services can observe that. You must design for those windows.

The Two Flavors: Choreography and Orchestration

There are two ways to wire a saga together, and the choice shapes how your system fails.

Choreography has no central coordinator. Each service listens for events and reacts by emitting its own. It's decentralized and loosely coupled, but the overall workflow exists only as an emergent property — no single place describes it.

// Choreography: each service reacts to events, nobody owns the whole flow
on("OrderCreated", async (e) => {
  await reservePayment(e.orderId, e.amount);
  emit("PaymentReserved", { orderId: e.orderId });
});
 
on("PaymentReserved", async (e) => {
  await reserveInventory(e.orderId);
  emit("InventoryReserved", { orderId: e.orderId });
});

Orchestration introduces a coordinator that explicitly drives each step and decides what to do on failure. The workflow is centralized and readable, at the cost of one component that knows about everyone.

// Orchestration: the coordinator owns the sequence AND the rollback
async function orderSaga(order: Order) {
  const steps = [reservePayment, reserveInventory, scheduleShipping];
  const done: Step[] = [];
  try {
    for (const step of steps) {
      await step(order);
      done.push(step);
    }
  } catch (err) {
    // Compensate in REVERSE order for everything that succeeded
    for (const step of done.reverse()) await step.compensate(order);
    throw err;
  }
}

As a rule of thumb: choreography suits short, simple flows; orchestration earns its keep once you have more than three or four steps or complex failure logic. Microsoft's Saga distributed transactions pattern on the Azure Architecture Center is a good neutral reference for both.

The Trap: Compensations Are Not Rollbacks

Here's where teams get burned. A database rollback is perfect — it's as if the transaction never happened. A compensation is a new business action that tries to reverse a committed one, and reality often refuses to cooperate.

You took the payment, then shipping failed. Your compensation issues a refund. But:

  • A refund is not a deletion — it's a second ledger entry. The customer saw a charge and a refund on their statement. That's a support ticket waiting to happen.
  • The email confirming the order already went out. You can't un-send it.
  • Inventory you released may have been bought by someone else in the meantime. You can't un-promise it.
-- Forward: charge committed and is now visible to everyone
INSERT INTO ledger (order_id, type, amount) VALUES (42, 'charge', 9900);
 
-- Compensation is a NEW fact, not an undo. History keeps both.
INSERT INTO ledger (order_id, type, amount) VALUES (42, 'refund', -9900);

This is why sagas demand careful business design, not just careful engineering. For every forward step you must ask: if this commits and a later step fails, what does undoing it actually mean to a human?

Making Sagas Survivable

A few disciplines keep sagas from becoming a debugging nightmare:

  • Make every step and every compensation idempotent. Retries are guaranteed in this model, so each action must tolerate being delivered more than once.
  • Persist saga state. The coordinator (or the event log) must durably know which steps have completed, so a crash mid-saga can resume rather than restart.
  • Order compensations in reverse. Undo the most recent successful step first, mirroring how you'd unwind a stack.
  • Design for visible intermediate states. Use a status field (pending, confirmed, cancelled) so the rest of the system never mistakes an in-flight order for a finished one.
  • Accept that some steps can't be compensated. A "pivot" step — like physically shipping a package — is a point of no return. Structure the saga so irreversible actions come last, after everything that can fail has already succeeded.

Order your steps so that the cheapest-to-undo and most-likely-to-fail actions happen first, and the irreversible ones happen last. A saga that puts shipping before payment validation is a saga that will lose you money.

Takeaways

  • You cannot wrap a multi-service workflow in a single ACID transaction without 2PC, which sacrifices availability.
  • A saga is a sequence of local transactions plus compensating actions that semantically undo prior steps.
  • Choreography (event-driven, decentralized) suits simple flows; orchestration (a coordinator) suits complex ones.
  • Compensations are new business actions, not rollbacks — refunds, cancellations, and apologies, not deletions.
  • Make every step and compensation idempotent, persist saga state, and compensate in reverse order.
  • Sequence steps so irreversible actions come last; some steps simply cannot be undone, and the design must respect that.
5 min read

Read next