Advanced

Database Replication

Core Scale & Availability

Primary–replica and multi-primary replication to scale reads and improve HA

Core Idea

Maintain multiple copies of data across nodes. Use asynchronous or synchronous replication to improve read scale, availability, and geo-distribution; handle conflicts for multi-primary.

When to Use

When reads outpace a single primary, you need failover/HA, or want regional read locality.

Recognition Cues

Indicators that this pattern might be the right solution

Read-heavy workloads saturating primary
Need low-latency regional reads
Maintenance windows require failover
Backup/analytics traffic impacts production

Pattern Variants & Approaches

Primary-Replica Overview

Writes go to a primary; asynchronous replication feeds replicas for scalable, low-latency reads via a router.

Primary-Replica Overview Architecture

When to Use This Variant

Hot primary with read-heavy load
Lag-aware read routing
Follower reads after writes

Use Case

OLTP systems needing high read throughput and maintenance without downtime.

Advantages

Scale reads horizontally
Failover and maintenance windows
Geo-local read latency

Implementation Example

# App-side routing sketch (Python pseudocode)
def read(query):
    replica = pick_replica(lag_threshold_ms=200)
    return replica.execute(query)

def write(cmd):
    return primary.execute(cmd)

Tradeoffs

Pros

Read scaling via replicas
Higher availability and faster maintenance
Geo-local reads for latency reduction
Offload backups/analytics to followers

Cons

Complex operations and failover handling
Stale reads and read-after-write anomalies
Conflict resolution complexity in multi-primary
Higher cost for extra replicas

Common Pitfalls

Replication lag causing stale reads and anomalies
Read-your-writes violations after POST/PUT
Split-brain/conflicts in multi-primary setups
Failover loops and flapping primaries
Ignoring LSN/GTID monitoring and lag-aware routing

Design Considerations

Choose sync vs async and define RPO/RTO
Lag-aware read routing and session stickiness after writes
Conflict resolution (LWW, vector clocks, CRDTs) for multi-primary
Delayed replicas and read intents (strong vs eventual)
Health checks, fencing, and automated failover

Real-World Examples

MySQL

Primary-replica with semi-sync for HA

Thousands of replicas at large orgs

PostgreSQL

Streaming replication and Patroni failover

Hundreds of terabytes replicated

MongoDB

Replica sets with automatic elections

Global clusters with many nodes

Complexity Analysis

Scalability

Read-focused - Many followers

Implementation Complexity

High - Lag and failover management

Cost

Medium to High - Extra nodes

Related Patterns

Read/Write Split CQRS Database Sharding Caching Strategies