Database Replication

Core Scale & Availability

Primary–replica and multi-primary replication to scale reads and improve HA

Core Idea

Maintain multiple copies of data across nodes. Use asynchronous or synchronous replication to improve read scale, availability, and geo-distribution; handle conflicts for multi-primary.

When to Use

When reads outpace a single primary, you need failover/HA, or want regional read locality.

Recognition Cues
Indicators that this pattern might be the right solution
  • Read-heavy workloads saturating primary
  • Need low-latency regional reads
  • Maintenance windows require failover
  • Backup/analytics traffic impacts production

Pattern Variants & Approaches

Primary-Replica Overview
Writes go to a primary; asynchronous replication feeds replicas for scalable, low-latency reads via a router.

Primary-Replica Overview Architecture

WritesReadsLag-awareReplicationReplication⚙️Application⚖️Read Router💾Primary DB💾Replica 1💾Replica 2

When to Use This Variant

  • Hot primary with read-heavy load
  • Lag-aware read routing
  • Follower reads after writes

Use Case

OLTP systems needing high read throughput and maintenance without downtime.

Advantages

  • Scale reads horizontally
  • Failover and maintenance windows
  • Geo-local read latency

Implementation Example

# App-side routing sketch (Python pseudocode)
def read(query):
    replica = pick_replica(lag_threshold_ms=200)
    return replica.execute(query)

def write(cmd):
    return primary.execute(cmd)
Tradeoffs

Pros

  • Read scaling via replicas
  • Higher availability and faster maintenance
  • Geo-local reads for latency reduction
  • Offload backups/analytics to followers

Cons

  • Complex operations and failover handling
  • Stale reads and read-after-write anomalies
  • Conflict resolution complexity in multi-primary
  • Higher cost for extra replicas
Common Pitfalls
  • Replication lag causing stale reads and anomalies
  • Read-your-writes violations after POST/PUT
  • Split-brain/conflicts in multi-primary setups
  • Failover loops and flapping primaries
  • Ignoring LSN/GTID monitoring and lag-aware routing
Design Considerations
  • Choose sync vs async and define RPO/RTO
  • Lag-aware read routing and session stickiness after writes
  • Conflict resolution (LWW, vector clocks, CRDTs) for multi-primary
  • Delayed replicas and read intents (strong vs eventual)
  • Health checks, fencing, and automated failover
Real-World Examples
MySQL

Primary-replica with semi-sync for HA

Thousands of replicas at large orgs
PostgreSQL

Streaming replication and Patroni failover

Hundreds of terabytes replicated
MongoDB

Replica sets with automatic elections

Global clusters with many nodes
Complexity Analysis
Scalability

Read-focused - Many followers

Implementation Complexity

High - Lag and failover management

Cost

Medium to High - Extra nodes