Distributed Storage: CAP Theorem & NoSQL Systems

CAP theorem fundamentals, the NoSQL landscape, and an architectural comparison of Bigtable (CP, wide-column) and Dynamo (AP, key-value).

Distributed storage systems must make explicit trade-offs between consistency, availability, and partition tolerance β€” as formalized by Brewer's CAP theorem. This section covers the two most studied systems in the DSA-AS exam: Google Bigtable (a CP wide-column store) and Amazon Dynamo (an AP key-value store).

πŸ’‘ Why do we need distributed storage?

A single machine's disk is limited in capacity, speed, and fault tolerance. Distributed storage spreads data across many nodes, enabling: (1) horizontal scalability beyond single-machine limits, (2) fault tolerance via replication, and (3) geographic distribution for low-latency access.

CAP Theorem

Brewer's CAP theorem (2000, formally proved by Gilbert & Lynch 2002) states that a distributed system can guarantee at most two of the three properties: Consistency, Availability, and Partition Tolerance. Since partitions are inevitable in real networks, every distributed system must choose between CP or AP.

CAP Theorem β€” Interactive Triangle

Click a zone (CP, AP, or CA) to see which systems live there and why.

CPAPCACConsistencyAAvailabilityPPartition Tol.BigtableDynamoRDBMS
Click a zone (CP, AP, CA) or the colored triangle sectors to explore
# CAP Theorem (Brewer 2000, formally proved 2002)
In any distributed system, during a network partition (P),
you must choose between Consistency (C) and Availability (A).
Note: Partitions are inevitable in real networks β†’
every distributed system is either CP or AP (not CA).
Consistency (C)

Every read receives the most recent write (or an error). All nodes see the same data at the same time. Equivalent to linearizability.

Availability (A)

Every request receives a response (not an error), though the response may be stale. Every non-failing node must return a response.

Partition Tolerance (P)

The system continues operating despite arbitrary network partitions (messages lost or delayed between nodes). Required by all distributed systems.

Exam tip: PACELC model

CAP only addresses behavior during partitions. The PACELC model extends it: even without partitions (E), systems must trade Latency (L) for Consistency (C). Dynamo is PA/EL (available during partition, low-latency otherwise). Bigtable is PC/EC (consistent during partition, consistent otherwise).

NoSQL System Classification

NoSQL systems are classified by their data model. Each type optimizes for a different query pattern and makes different trade-offs.

πŸ”‘Key-Value Store

Simplest model: opaque key β†’ opaque value. Extremely fast, no schema, no range queries.

DynamoDBRedisRiakVoldemort
No structured queries, no secondary indexes, client handles data format.
πŸ“ŠWide-Column Store

Row key + dynamic columns. Efficient for sparse data, supports range scans on row key.

BigtableHBaseCassandraScylla
Column families must be declared; queries limited to row key range scans.
πŸ“„Document Store

JSON/BSON documents with nested structures. Supports secondary indexes and rich queries.

MongoDBCouchDBFirestoreCouchbase
Write amplification for indexed fields; consistency tradeoffs for distribution.
πŸ•ΈοΈGraph Store

Nodes and edges with properties. Optimized for relationship traversal queries.

Neo4jAmazon NeptuneTigerGraphJanusGraph
Poor horizontal scalability; niche use cases; hard to shard graph relationships.

Storage Layout: Row-Oriented vs Column-Oriented

How data is physically stored on disk dramatically affects query performance. Row-oriented stores are optimized for OLTP (transactional) workloads; column-oriented (columnar) stores for OLAP (analytical) workloads.

Row-oriented vs. column-oriented storage layout comparison
PropertyRow-OrientedColumn-Oriented
Physical layoutAll columns of a row stored togetherAll values of a column stored together
ExamplesMySQL, PostgreSQL, Bigtable (per-tablet)BigQuery, Redshift, Parquet, ORC
Read patternEfficient for SELECT * (all columns of few rows)Efficient for SELECT avg(col) (one column, all rows)
Write patternFast single-row writes (one seek)Slow writes (update N column files)
CompressionLimited (heterogeneous row data)Excellent (column values are homogeneous, similar values together)
OLTP suitabilityExcellentPoor
OLAP suitabilityPoorExcellent
Projection pushdownNot applicableSkip unneeded columns entirely (no I/O)

πŸ’‘ Bigtable's column families

Bigtable is technically a wide-column store but stores data row-by-row within a tablet. However, Bigtable can be configured to compress data within a column family together, giving some columnar benefits. This is different from true columnar stores like BigQuery which decompose tables into per-column files.

Deep Dives

Explore each storage system in detail with interactive diagrams and exam questions.