Horizontal Scaling

Core Scale & Availability

Scale by adding more machines rather than upgrading a single node

Core Idea

Increase capacity by adding more replicas and distributing load across them. Prefer scale-out for elasticity, resilience, and cost efficiency over vertical scale-up.

When to Use

When workload grows beyond a single node, traffic is bursty, or you need zero-downtime elasticity and fault tolerance.

Recognition Cues
Indicators that this pattern might be the right solution
  • CPU/memory saturation on single instance
  • Growing queue depth and rising p95/p99 latency
  • Frequent vertical upgrades with diminishing returns
  • Desire for multi-zone resilience and rolling updates

Pattern Variants & Approaches

Overview
Scale capacity by adding stateless replicas behind a load balancer; depend on shared stores for state.

Overview Architecture

HTTP/HTTPSDistributeHot dataHot dataShared stateShared state👤Client⚖️Load Balancer⚙️Replica⚙️ReplicaCache💾Database

When to Use This Variant

  • Autoscaling groups/HPA
  • Replica-based concurrency caps
  • Shared DB/cache bottlenecks

Use Case

Burst handling, low-latency APIs, and zero-downtime deployments across zones.

Advantages

  • Elastic capacity
  • High availability via redundancy
  • Simple rolling updates

Implementation Example

# Horizontal scaling primitives
# Cloud: ASG/Instance Group + LB
# K8s: Deployment + HPA + Service (LoadBalancer)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2
  maxReplicas: 20
  metrics: [{ type: Resource, resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 }}}]
Tradeoffs

Pros

  • Near-linear scale for embarrassingly parallel workloads
  • High availability via redundancy
  • Cost effective vs large high-memory/CPU machines
  • Easier rolling deploys and canaries

Cons

  • Coordination overhead and distributed bugs
  • Data/state partitioning complexity
  • Potential hotspots for skewed keys
  • Increased network costs
Common Pitfalls
  • No warm-up/slow start causing cold replicas to be overwhelmed
  • Unbounded concurrency and connection pools per replica
  • Shared dependencies (DB/cache) bottleneck instead
  • Lack of graceful shutdown and connection draining
  • Inconsistent hashing causing cache hot keys
Design Considerations
  • Autoscaling on CPU/RPS/queue depth with headroom
  • PodDisruptionBudgets and graceful draining
  • Per-instance concurrency caps and pool sizes
  • Partition awareness and consistent hashing where relevant
  • Capacity planning and chaos testing
Real-World Examples
Amazon

ASGs scale EC2 fleets based on traffic and SLOs

Hundreds of thousands of instances
Kubernetes

HPA scales pods by CPU/custom metrics

Clusters with thousands of pods
Cloudflare

Edge runs many small stateless workers per POP

Hundreds of cities globally
Complexity Analysis
Scalability

High - Add replicas elastically

Implementation Complexity

Medium - Coordination and autoscaling

Cost

Variable - Pay for headroom