Retry with Backoff

Reliability & Latency

Exponential backoff with jitter for resilient retries of transient failures

Core Idea

Retries should be bounded, exponential, and jittered to avoid coordinated retry storms. Only retry idempotent operations and respect server guidance.

When to Use

When encountering transient network errors, 5xx responses, or timeouts on idempotent requests.

Recognition Cues
Indicators that this pattern might be the right solution
  • Burst of 5xx/timeouts under load
  • Thundering herd after dependency failure
  • Client storms synchronized by fixed delays

Pattern Variants & Approaches

Overview
Client performs bounded exponential backoff with jitter on transient failures, respecting idempotency and server hints.

Overview Architecture

Attempt / Retry (jittered)Response/Error👤Client⚙️Service

When to Use This Variant

  • Transient 5xx/timeouts
  • Retry-After headers
  • Idempotent operations

Use Case

HTTP/gRPC clients, SDKs, and background jobs communicating over unreliable networks.

Advantages

  • Avoids retry storms
  • Improves success under flakes
  • Simple client-side policy

Implementation Example

# Exponential backoff with jitter (pseudocode)
for attempt in range(max_attempts):
  try:
    return call()
  except TransientError:
    sleep(rand(0, base * 2**attempt))
Tradeoffs

Pros

  • Improves resiliency to transient faults
  • Reduces load during outages
  • Simple to implement with libraries

Cons

  • Increases latency on failure paths
  • Can amplify load if mis-tuned
  • Complexity with layered retry policies
Common Pitfalls
  • Retrying non-idempotent operations
  • Unbounded attempts or total retry time
  • Coordinated retries without jitter
  • Layered retries across gateway, mesh, and client
  • Ignoring Retry-After headers
Design Considerations
  • Exponential backoff with full/decorrelated jitter
  • Cap attempts and total budget per request
  • Idempotency keys for writes where possible
  • Status/exception-based retry policies
  • Outlier detection and hedging for tail latency (sparingly)
Real-World Examples
AWS SDKs

Built-in exponential backoff with jitter

Millions of clients
Google

SRE guidance on jittered retries

Planet-scale services
Stripe

Idempotency keys + retries for payments

Global API traffic
Complexity Analysis
Scalability

Client-side - Applies per call

Implementation Complexity

Low to Medium - Policy tuning

Cost

Low - Library-level feature