- 1.CAP theorem states you can guarantee at most 2 of 3 properties: Consistency, Availability, and Partition tolerance
- 2.In practice, network partitions are inevitable in distributed systems, so you must choose between CP or AP
- 3.Banks choose CP (consistency over availability), while social media platforms choose AP (availability over consistency)
- 4.Modern systems often use eventual consistency patterns to balance both properties
0.1%
Network Partition Rate
$300K
Downtime Cost per Hour
65%
Systems Using AP
What is CAP Theorem?
CAP theorem, formulated by Eric Brewer in 2000 and formally proven by Gilbert and Lynch in 2002, is a fundamental principle in distributed systems. It states that any distributed data store can provide at most two of the following three guarantees simultaneously:
- Consistency (C): All nodes see the same data simultaneously
- Availability (A): The system remains operational and responsive
- Partition Tolerance (P): The system continues operating despite network failures
This isn't a design choice - it's a mathematical impossibility to achieve all three simultaneously when network partitions occur. Understanding CAP theorem is crucial for system design interviews and building real-world distributed applications.
Understanding the Three Properties
Let's break down each property with concrete examples:
Every read receives the most recent write or an error. All nodes must agree on the same value at the same time.
Key Skills
Common Jobs
- • Database Engineer
- • Backend Developer
Every request receives a response (success or failure) without guarantee that it's the most recent data.
Key Skills
Common Jobs
- • Site Reliability Engineer
- • DevOps Engineer
The system continues to operate despite arbitrary message loss or failure between nodes.
Key Skills
Common Jobs
- • Systems Engineer
- • Network Engineer
Source: Google SRE Book
Why You Must Choose: The Partition Reality
In practice, network partitions are inevitable in distributed systems. Hardware fails, cables get cut, switches crash, and cloud regions go down. When partitions occur, you face a binary choice:
- Choose Consistency (CP): Reject requests to maintain data integrity
- Choose Availability (AP): Accept requests but risk serving stale data
This choice isn't theoretical - it happens in production systems. During AWS's 2017 S3 outage, many AP systems continued serving cached data while CP systems went offline to prevent inconsistency.
Real-World CAP Theorem Examples
Different industries make different CAP tradeoffs based on business requirements:
| System Type | CAP Choice | Example | Why This Choice |
|---|---|---|---|
| Banking Systems | CP (Consistency + Partition Tolerance) | Traditional ATM networks | Money transfers must be exact - better to show error than wrong balance |
| Social Media | AP (Availability + Partition Tolerance) | Facebook, Twitter feeds | Users expect fast responses - temporary stale data is acceptable |
| DNS Systems | AP (Availability + Partition Tolerance) | Global DNS infrastructure | Web must work even with stale DNS records |
| Trading Platforms | CP (Consistency + Partition Tolerance) | Stock exchanges | Inconsistent prices could enable arbitrage and market manipulation |
| Content Delivery | AP (Availability + Partition Tolerance) | Netflix, YouTube | Streaming must continue even if metadata is slightly outdated |
CP vs AP: Architecture Patterns
The CAP choice fundamentally shapes your system architecture and technology choices:
Which Should You Choose?
- Financial transactions or money is involved
- Data corruption is catastrophic
- Regulatory compliance requires audit trails
- Users expect 100% accurate data
- Example technologies: PostgreSQL with sync replication, MongoDB with majority write concern
- User experience depends on low latency
- Temporary inconsistency is acceptable
- System must scale to millions of users
- Regional outages cannot stop service
- Example technologies: Cassandra, DynamoDB, CouchDB
- Different data types have different consistency needs
- You can partition by geography or feature
- Some operations are more critical than others
- Example: User profiles (AP) + Payment processing (CP)
Beyond CAP: The PACELC Theorem
CAP theorem only describes behavior during network partitions. The PACELC theorem extends this: if there's a Partition, choose between Availability and Consistency; Else, choose between Latency and Consistency.
Most systems spend 99.9% of their time in normal operation (no partitions), so the Latency vs Consistency tradeoff is often more important than CAP. This explains why eventual consistency patterns are so popular - they optimize for low latency during normal operation while handling partitions gracefully.
Implementing CAP Choices in Practice
Here's how to implement different CAP choices in common scenarios:
CP System Implementation
1. Use Synchronous Replication
Write to majority of nodes before acknowledging. Use techniques like two-phase commit or Raft consensus for strong consistency.
2. Implement Circuit Breakers
Fail fast when nodes are unreachable rather than serving potentially stale data. Monitor partition detection and healing.
3. Choose Appropriate Databases
PostgreSQL with synchronous replication, etcd for configuration, or MongoDB with w=majority write concern.
4. Design for Graceful Degradation
Return meaningful error messages during partitions. Implement read-only modes for non-critical data.
AP System Implementation
1. Embrace Eventual Consistency
Use asynchronous replication and conflict resolution strategies. Implement vector clocks or last-writer-wins patterns.
2. Implement Multi-Region Architecture
Deploy across multiple availability zones or regions. Use technologies like Cassandra or DynamoDB Global Tables.
3. Design Conflict Resolution
Plan for concurrent writes during partitions. Use application-level merging or tombstone patterns for deletes.
4. Monitor Data Staleness
Track replication lag and implement alerts for excessive inconsistency windows.
Code Example: Detecting Network Partitions
Here's a simple pattern for detecting partitions and choosing your CAP behavior:
import time
import asyncio
from typing import List, Optional
class CAPAwareService:
def __init__(self, nodes: List[str], consistency_mode: str = 'CP'):
self.nodes = nodes
self.consistency_mode = consistency_mode # 'CP' or 'AP'
self.healthy_nodes = set(nodes)
self.partition_threshold = len(nodes) // 2 + 1
async def write_data(self, key: str, value: str) -> bool:
if self.consistency_mode == 'CP':
return await self._cp_write(key, value)
else:
return await self._ap_write(key, value)
async def _cp_write(self, key: str, value: str) -> bool:
"""CP: Require majority of nodes to be healthy"""
if len(self.healthy_nodes) < self.partition_threshold:
raise PartitionException("Cannot maintain consistency during partition")
# Write to majority of nodes synchronously
successful_writes = 0
for node in list(self.healthy_nodes)[:self.partition_threshold]:
if await self._write_to_node(node, key, value):
successful_writes += 1
return successful_writes >= self.partition_threshold
async def _ap_write(self, key: str, value: str) -> bool:
"""AP: Write to any available node"""
for node in self.healthy_nodes:
if await self._write_to_node(node, key, value):
# Trigger async replication to other nodes
asyncio.create_task(self._replicate_async(key, value, node))
return True
raise Exception("No healthy nodes available")
async def _write_to_node(self, node: str, key: str, value: str) -> bool:
try:
# Simulate network call with timeout
await asyncio.wait_for(
self._network_call(node, key, value),
timeout=1.0
)
return True
except asyncio.TimeoutError:
self.healthy_nodes.discard(node)
return False
async def _replicate_async(self, key: str, value: str, exclude_node: str):
"""Background replication for AP systems"""
for node in self.healthy_nodes:
if node != exclude_node:
try:
await self._write_to_node(node, key, value)
except:
pass # Best effort replicationCAP Theorem FAQ
Related Engineering Articles
Related Career Paths
Skills and Education
Taylor Rupe
Full-Stack Developer (B.S. Computer Science, B.A. Psychology)
Taylor combines formal training in computer science with a background in human behavior to evaluate complex search, AI, and data-driven topics. His technical review ensures each article reflects current best practices in semantic search, AI systems, and web technology.