Circuit Breaker

Reliability & Latency

Prevent cascading failures by failing fast when a service is unhealthy

Core Idea

Circuit breaker pattern prevents an application from repeatedly trying to execute an operation that's likely to fail. It monitors for failures and 'opens the circuit' after a threshold, failing fast instead of waiting for timeouts. After a cooldown period, it allows test requests to check if the service has recovered.

When to Use

Use circuit breakers when calling external services, microservices, or any dependency that might fail or become slow. Essential for preventing cascading failures in distributed systems.

Recognition Cues
Indicators that this pattern might be the right solution
  • Calling external APIs or microservices
  • Need to prevent cascading failures
  • Timeouts are causing thread pool exhaustion
  • Want to fail fast instead of waiting
  • Need to provide fallback responses during outages

Pattern Variants & Approaches

Three-State Circuit Breaker
Classic implementation with Closed, Open, and Half-Open states. Transitions between states based on failure rates and recovery attempts.

Three-State Circuit Breaker Architecture

RequestIf ClosedIf OpenCached Response⚙️Service A⚖️Circuit Breaker⚙️Service BFallback

When to Use This Variant

  • Need standard circuit breaker behavior
  • Want gradual recovery testing
  • Can implement state machine logic
  • Need clear state transitions

Use Case

General-purpose circuit breaking for microservices, API calls, and database connections

Advantages

  • Clear state transitions and behavior
  • Gradual recovery testing in half-open state
  • Well-understood pattern
  • Easy to monitor and debug

Implementation Example

# Three-State Circuit Breaker Implementation
from enum import Enum
from datetime import datetime, timedelta
import threading

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing fast
    HALF_OPEN = "half_open"  # Testing recovery

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60, recovery_timeout=30):
        self.failure_threshold = failure_threshold
        self.timeout = timeout  # seconds before trying half-open
        self.recovery_timeout = recovery_timeout
        
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED
        self.lock = threading.Lock()
    
    def call(self, func, *args, **kwargs):
        with self.lock:
            if self.state == CircuitState.OPEN:
                # Check if timeout has passed
                if datetime.now() - self.last_failure_time > timedelta(seconds=self.timeout):
                    self.state = CircuitState.HALF_OPEN
                    print("Circuit breaker entering HALF_OPEN state")
                else:
                    raise Exception("Circuit breaker is OPEN - failing fast")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise e
    
    def _on_success(self):
        with self.lock:
            if self.state == CircuitState.HALF_OPEN:
                # Recovery successful
                self.state = CircuitState.CLOSED
                self.failure_count = 0
                print("Circuit breaker CLOSED - service recovered")
            elif self.state == CircuitState.CLOSED:
                # Reset failure count on success
                self.failure_count = 0
    
    def _on_failure(self):
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = datetime.now()
            
            if self.state == CircuitState.HALF_OPEN:
                # Recovery failed, go back to open
                self.state = CircuitState.OPEN
                print("Circuit breaker OPEN - recovery failed")
            elif self.failure_count >= self.failure_threshold:
                # Too many failures, open the circuit
                self.state = CircuitState.OPEN
                print(f"Circuit breaker OPEN - {self.failure_count} failures")

# Usage
breaker = CircuitBreaker(failure_threshold=3, timeout=30)

def call_external_api():
    try:
        return breaker.call(external_service.get_data)
    except Exception as e:
        # Provide fallback response
        return get_cached_data()
Adaptive Circuit Breaker
Dynamically adjusts thresholds based on recent success/failure rates and latency percentiles. More sophisticated than fixed thresholds.

Adaptive Circuit Breaker Architecture

RequestTrackIf HealthyResponse⚙️Service A⚖️Adaptive BreakerMetrics⚙️Service B

When to Use This Variant

  • Traffic patterns vary significantly
  • Need automatic threshold adjustment
  • Want to consider latency, not just failures
  • Have varying SLA requirements

Use Case

High-scale systems with variable traffic, services with changing performance characteristics

Advantages

  • Automatically adapts to changing conditions
  • Considers both errors and latency
  • More resilient to traffic spikes
  • Reduces false positives

Implementation Example

# Adaptive Circuit Breaker with Sliding Window
from collections import deque
from datetime import datetime, timedelta

class AdaptiveCircuitBreaker:
    def __init__(self, window_size=100, error_threshold_percent=50, 
                 latency_threshold_ms=1000):
        self.window_size = window_size
        self.error_threshold_percent = error_threshold_percent
        self.latency_threshold_ms = latency_threshold_ms
        
        self.requests = deque(maxlen=window_size)
        self.state = CircuitState.CLOSED
    
    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN")
        
        start_time = datetime.now()
        try:
            result = func(*args, **kwargs)
            latency = (datetime.now() - start_time).total_seconds() * 1000
            
            self._record_success(latency)
            return result
        except Exception as e:
            self._record_failure()
            raise e
    
    def _record_success(self, latency_ms):
        self.requests.append({
            'success': True,
            'latency': latency_ms,
            'timestamp': datetime.now()
        })
        
        if self.state == CircuitState.HALF_OPEN:
            self.state = CircuitState.CLOSED
        
        self._evaluate_state()
    
    def _record_failure(self):
        self.requests.append({
            'success': False,
            'timestamp': datetime.now()
        })
        self._evaluate_state()
    
    def _evaluate_state(self):
        if len(self.requests) < self.window_size:
            return  # Not enough data
        
        # Calculate error rate
        errors = sum(1 for r in self.requests if not r['success'])
        error_rate = (errors / len(self.requests)) * 100
        
        # Calculate P95 latency
        latencies = [r['latency'] for r in self.requests if r['success']]
        if latencies:
            latencies.sort()
            p95_latency = latencies[int(len(latencies) * 0.95)]
        else:
            p95_latency = float('inf')
        
        # Open circuit if thresholds exceeded
        if (error_rate > self.error_threshold_percent or 
            p95_latency > self.latency_threshold_ms):
            self.state = CircuitState.OPEN
            print(f"Circuit OPEN - Error rate: {error_rate}%, P95: {p95_latency}ms")
Tradeoffs

Pros

  • Prevents cascading failures across services
  • Fails fast, freeing up resources
  • Allows failing service time to recover
  • Improves overall system resilience
  • Provides clear failure signals for monitoring

Cons

  • Adds complexity to service calls
  • Requires careful threshold tuning
  • Can mask underlying issues if not monitored
  • May reject valid requests during recovery
  • Requires fallback strategy implementation
Common Pitfalls
  • Setting thresholds too aggressive (opens too easily)
  • Not implementing proper fallback mechanisms
  • Forgetting to monitor circuit breaker state
  • Not considering half-open state for recovery
  • Using same circuit breaker for all endpoints
Design Considerations
  • Define failure threshold (number or percentage of failures)
  • Set appropriate timeout values
  • Determine cooldown period before retry
  • Implement fallback strategies (cached data, default values)
  • Monitor circuit breaker state changes
  • Consider per-endpoint vs. per-service circuit breakers
Real-World Examples
Netflix

Hystrix library implements circuit breakers for all microservice calls

Thousands of microservices, millions of requests per second
Amazon

Circuit breakers protect against cascading failures across AWS services

Global infrastructure with millions of service calls
Uber

Circuit breakers prevent driver matching failures from affecting entire platform

Millions of rides per day across 70+ countries
Complexity Analysis
Scalability

Per-service instance - Each instance maintains its own state

Implementation Complexity

Medium to High - Requires careful tuning and monitoring

Cost

Low - Minimal overhead, prevents expensive failures