Saga Pattern: Managing Distributed Transactions

Learn how to manage distributed transactions across microservices using the Saga pattern with orchestration and choreography approaches in TypeScript.

While I was looking over some microservices architecture the other day, I came across a production bug that sent me down a rabbit hole. An order was confirmed, payment was charged, but inventory was never decremented. The customer got billed for something we couldn't fulfill.

I was once guilty of thinking distributed transactions were just like database transactions but spread across services. Little did I know that traditional ACID transactions become nearly impossible once you split your monolith. You can't start a transaction in the Order service and have it magically roll back changes in the Payment and Inventory services when something fails.

This is exactly where the Saga pattern comes in, and I cannot stress this enough—understanding this pattern will save you from countless production headaches.

Why Distributed Transactions Are Hard in Microservices

When I finally decided to break down a monolith into microservices, I quickly realized that each service now had its own database. This meant no more simple BEGIN TRANSACTION and COMMIT across everything.

Here's what made it click for me: imagine you're building an e-commerce system with separate Order, Payment, and Inventory services. When a customer places an order, you need to:

Create an order record
Process payment
Reserve inventory
Update shipping status

In a monolith, you'd wrap all of this in a single database transaction. If step 3 fails, everything rolls back automatically. Wonderful! But in microservices, each step happens in a different service with its own database. If payment succeeds but inventory reservation fails, you're left with an inconsistent state where money was charged but nothing ships.

The traditional two-phase commit (2PC) protocol technically solves this, but it requires all services to lock resources while waiting for a coordinator. This creates tight coupling and terrible performance characteristics. In other words, it defeats the purpose of microservices.

What is the Saga Pattern?

A Saga is a sequence of local transactions where each service performs its own transaction and publishes an event or message. If any step fails, the Saga executes compensating transactions to undo the changes made by previous steps.

I realized this was essentially breaking one big transaction into multiple smaller ones, each with a corresponding "undo" operation. When I came across this pattern, it felt like finally having a practical solution instead of theoretical approaches.

Saga Pattern Visualization

The key insight is that Sagas guarantee eventual consistency rather than immediate consistency. Your system might be temporarily inconsistent, but it will eventually reach a consistent state through either successful completion or compensating transactions.

Orchestration vs Choreography: Two Approaches to Sagas

There are two ways to implement Sagas, and choosing the right one matters more than I initially thought.

Orchestration uses a central coordinator (orchestrator) that tells each service what to do. The orchestrator knows the entire workflow and handles the saga logic.

Choreography has no central coordinator. Each service listens for events and decides what to do next. Services communicate through events, creating a chain of reactions.

When I first built Sagas, I gravitated toward choreography because it seemed more "microservices-like." But I learned the hard way that orchestration is often simpler to understand and debug. Let's look at both approaches.

Building an Order Saga with TypeScript and Node.js

Here's an orchestration-based Saga for processing orders. This orchestrator coordinates the entire workflow:

interface SagaStep {
  execute: () => Promise<any>;
  compensate: () => Promise<void>;
}
 
class OrderSaga {
  private completedSteps: SagaStep[] = [];
  
  constructor(
    private orderService: OrderService,
    private paymentService: PaymentService,
    private inventoryService: InventoryService
  ) {}
  
  async execute(orderData: OrderData): Promise<OrderResult> {
    const steps: SagaStep[] = [
      {
        execute: async () => {
          const order = await this.orderService.createOrder(orderData);
          return order;
        },
        compensate: async () => {
          await this.orderService.cancelOrder(orderData.orderId);
        }
      },
      {
        execute: async () => {
          const payment = await this.paymentService.charge({
            orderId: orderData.orderId,
            amount: orderData.total,
            customerId: orderData.customerId
          });
          return payment;
        },
        compensate: async () => {
          await this.paymentService.refund(orderData.orderId);
        }
      },
      {
        execute: async () => {
          const reservation = await this.inventoryService.reserve({
            orderId: orderData.orderId,
            items: orderData.items
          });
          return reservation;
        },
        compensate: async () => {
          await this.inventoryService.release(orderData.orderId);
        }
      }
    ];
    
    try {
      for (const step of steps) {
        const result = await step.execute();
        this.completedSteps.push(step);
        console.log('Step completed:', result);
      }
      
      return { success: true, orderId: orderData.orderId };
    } catch (error) {
      console.error('Saga failed, executing compensations:', error);
      await this.rollback();
      return { success: false, error: error.message };
    }
  }
  
  private async rollback(): Promise<void> {
    // Execute compensating transactions in reverse order
    for (const step of this.completedSteps.reverse()) {
      try {
        await step.compensate();
        console.log('Compensation executed successfully');
      } catch (compensationError) {
        // Log compensation failures - this is critical
        console.error('Compensation failed:', compensationError);
        // In production, you'd alert and potentially retry
      }
    }
  }
}

This orchestrator manages the entire order flow. If inventory reservation fails after payment succeeds, it automatically triggers the payment refund through the compensating transaction.

Implementing Compensating Transactions for Rollbacks

The compensating transactions are where I initially made mistakes. I thought a compensating transaction could just "undo" the previous step, but it's more nuanced than that.

Luckily we can design compensating transactions using semantic rollbacks. Instead of database-level rollbacks, we perform business-level reversals. Here's what I mean with a practical payment service:

class PaymentService {
  private payments: Map<string, Payment> = new Map();
  
  async charge(request: ChargeRequest): Promise<Payment> {
    // Idempotency check - critical for saga reliability
    const existingPayment = this.payments.get(request.orderId);
    if (existingPayment && existingPayment.status === 'charged') {
      console.log('Payment already processed, returning existing');
      return existingPayment;
    }
    
    const payment: Payment = {
      id: generateId(),
      orderId: request.orderId,
      amount: request.amount,
      customerId: request.customerId,
      status: 'charged',
      timestamp: new Date(),
      attempts: 1
    };
    
    // Simulate payment processing
    if (Math.random() > 0.9) {
      throw new Error('Payment gateway timeout');
    }
    
    this.payments.set(request.orderId, payment);
    
    // Publish event for choreography-based sagas
    await this.publishEvent('PaymentCharged', payment);
    
    return payment;
  }
  
  async refund(orderId: string): Promise<void> {
    const payment = this.payments.get(orderId);
    
    if (!payment) {
      console.log('No payment found, nothing to refund');
      return; // Idempotent - already compensated or never happened
    }
    
    if (payment.status === 'refunded') {
      console.log('Payment already refunded');
      return; // Idempotent
    }
    
    payment.status = 'refunded';
    payment.refundedAt = new Date();
    
    await this.publishEvent('PaymentRefunded', {
      orderId,
      amount: payment.amount,
      originalPaymentId: payment.id
    });
    
    console.log(`Refunded ${payment.amount} for order ${orderId}`);
  }
  
  private async publishEvent(eventType: string, data: any): Promise<void> {
    // In production, this would publish to a message broker
    console.log(`Event published: ${eventType}`, data);
  }
}
 
interface Payment {
  id: string;
  orderId: string;
  amount: number;
  customerId: string;
  status: 'charged' | 'refunded' | 'failed';
  timestamp: Date;
  refundedAt?: Date;
  attempts: number;
}

Notice how both charge and refund are idempotent. If you call them multiple times with the same order ID, they handle it gracefully. This is absolutely critical because network failures or retries can cause duplicate requests.

Saga Compensation Flow

Handling Failures and Idempotency in Saga Steps

When I finally understood idempotency in the context of Sagas, everything became clearer. Each saga step must be idempotent because retries will happen. The network is unreliable, services crash, and message brokers deliver messages multiple times.

Here's what I learned the hard way: you need unique identifiers for each saga execution. I use the order ID as the idempotency key, but in production, you might use a saga instance ID that's even more specific.

The compensation logic also needs careful thought. What if a compensating transaction fails? In my experience, you have three options:

Retry forever - Keep trying until it succeeds. This works if the operation is truly idempotent.
Manual intervention - Log it and alert someone to fix it manually. Not ideal but sometimes necessary.
Forward recovery - Instead of rolling back, try to complete the saga differently.

For our order saga, if the inventory service is down when we try to compensate, we might choose to retry for a while, then eventually move the order to a "manual review" queue where a human can sort it out.

When to Use Sagas vs Two-Phase Commit

This was a question that confused me for months. When should you actually use Sagas versus sticking with traditional two-phase commit?

Use Sagas when:

Services are loosely coupled and might be owned by different teams
Long-running transactions that could take seconds or minutes
You can accept eventual consistency
Performance and availability matter more than immediate consistency

Use 2PC when:

You absolutely need ACID guarantees
The transaction completes in milliseconds
All participants are under your control and highly available
The risk of inconsistency is unacceptable (like financial transfers between accounts)

In other words, for most microservices scenarios, Sagas are the better choice. The 2PC protocol creates too much coupling and introduces points of failure that contradict microservices principles.

I've found that many developers reach for 2PC because it's familiar, but it creates more problems than it solves in distributed systems. Sagas force you to think about failure modes upfront, which leads to more resilient systems.

Production Patterns: Monitoring and Observability for Sagas

Wonderful as Sagas are, they're harder to debug than regular transactions. You need excellent observability to track saga execution across services.

Here's what I implemented for production:

Correlation IDs - Every saga gets a unique ID that flows through all services. This makes it possible to trace the entire saga in your logs. I cannot stress this enough—add correlation IDs from day one.
Saga state persistence - Store the saga's current state in a database. If your orchestrator crashes, you can recover and continue or compensate from the last known state.
Dead letter queues - Failed messages go to a DLQ where you can inspect and potentially replay them. This saved me countless times when investigating why a saga failed.
Metrics and alerts - Track saga success/failure rates, duration, and compensation frequency. Set up alerts for compensation spikes—they usually indicate a deeper problem.

For monitoring, I use distributed tracing tools that show the entire saga flow visually. Being able to see which step failed and how long each took is invaluable. Check out my post on distributed tracing in Node.js microservices for more details on implementation.

Combining Sagas with patterns like circuit breakers creates a robust system that handles failures gracefully. Circuit breakers prevent cascading failures when a downstream service is struggling, while Sagas ensure you can recover from partial failures.

And that concludes the end of this post! I hope you found this valuable and look out for more in the future!