Intermediate

Horizontal Scaling

Core Scale & Availability

Scale by adding more machines rather than upgrading a single node

Core Idea

Increase capacity by adding more replicas and distributing load across them. Prefer scale-out for elasticity, resilience, and cost efficiency over vertical scale-up.

When to Use

When workload grows beyond a single node, traffic is bursty, or you need zero-downtime elasticity and fault tolerance.

Recognition Cues

Indicators that this pattern might be the right solution

CPU/memory saturation on single instance
Growing queue depth and rising p95/p99 latency
Frequent vertical upgrades with diminishing returns
Desire for multi-zone resilience and rolling updates

Pattern Variants & Approaches

Overview

Scale capacity by adding stateless replicas behind a load balancer; depend on shared stores for state.

Overview Architecture

When to Use This Variant

Autoscaling groups/HPA
Replica-based concurrency caps
Shared DB/cache bottlenecks

Use Case

Burst handling, low-latency APIs, and zero-downtime deployments across zones.

Advantages

Elastic capacity
High availability via redundancy
Simple rolling updates

Implementation Example

# Horizontal scaling primitives
# Cloud: ASG/Instance Group + LB
# K8s: Deployment + HPA + Service (LoadBalancer)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2
  maxReplicas: 20
  metrics: [{ type: Resource, resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 }}}]

Tradeoffs

Pros

Near-linear scale for embarrassingly parallel workloads
High availability via redundancy
Cost effective vs large high-memory/CPU machines
Easier rolling deploys and canaries

Cons

Coordination overhead and distributed bugs
Data/state partitioning complexity
Potential hotspots for skewed keys
Increased network costs

Common Pitfalls

No warm-up/slow start causing cold replicas to be overwhelmed
Unbounded concurrency and connection pools per replica
Shared dependencies (DB/cache) bottleneck instead
Lack of graceful shutdown and connection draining
Inconsistent hashing causing cache hot keys

Design Considerations

Autoscaling on CPU/RPS/queue depth with headroom
PodDisruptionBudgets and graceful draining
Per-instance concurrency caps and pool sizes
Partition awareness and consistent hashing where relevant
Capacity planning and chaos testing

Real-World Examples

Amazon

ASGs scale EC2 fleets based on traffic and SLOs

Hundreds of thousands of instances

Kubernetes

HPA scales pods by CPU/custom metrics

Clusters with thousands of pods

Cloudflare

Edge runs many small stateless workers per POP

Hundreds of cities globally

Complexity Analysis

Scalability

High - Add replicas elastically

Implementation Complexity

Medium - Coordination and autoscaling

Cost

Variable - Pay for headroom

Related Patterns

Load Balancing Stateless Services Database Sharding Caching Strategies