Amazon Dynamo: Highly Available Key-Value Store
The storage system behind Amazon's shopping cart. Dynamo trades consistency for availability using consistent hashing, vector clocks, sloppy quorums, and eventual consistency.
Published by Amazon in 2007, Dynamo is a highly available key-value storage system designed for Amazon's e-commerce infrastructure โ most famously the shopping cart. Its core insight: always writable. Even during network partitions, Dynamo continues accepting writes. Conflicts are resolved at read time by the application.
๐ก Dynamo's design goals
| Property | Bigtable | Dynamo |
|---|---|---|
| Data model | Wide-column (row + column family + qualifier + timestamp) | Pure key-value (binary key โ binary value) |
| Consistency | Strong (CP) | Eventual (AP, tunable) |
| CAP position | CP โ refuses writes if tablet unavailable | AP โ always accepts writes, may diverge |
| Partitioning | Range-based (row key sorted) | Consistent hashing with MD5 |
| Conflict resolution | Server-side: last-write-wins by timestamp | Client-side: application-level merge via vector clocks |
| Architecture | Master + tablet servers | Fully decentralized, DHT peer-to-peer |
Architecture Overview
No master node. All N storage nodes are equal peers. Each node knows about every other node through a gossip protocol. Clients can contact any node; that node acts as the coordinator for the request.
For each key, Dynamo maintains a preference list of the N nodes responsible for storing it (the N clockwise successors on the ring). The coordinator sends writes to the first W reachable nodes, reads to the first R reachable nodes.
Each node runs a local storage engine (BerkeleyDB or custom). Objects are stored as opaque byte arrays โ Dynamo knows nothing about the value schema. Metadata (vector clock, version) is stored separately.
Nodes exchange membership information via gossip. When a node joins or leaves, the information propagates within seconds. Each node maintains a ring state (which nodes own which key ranges).
Consistent Hash Ring
Dynamo maps both keys and nodes onto a circular space using MD5 hashing. Each key is assigned to the first node clockwise from its hash position. Adding/removing a node affects only O(K/N) keys โ the minimum possible disruption.
Dynamo only needs uniform distribution, not cryptographic security. MD5 is 2โ3ร faster than SHA-256 and produces a 128-bit output sufficient for the ring. In Dynamo's trusted data-center environment, hash flooding attacks are not a threat.
When a new node joins, it takes over the arc between its position and its predecessor. Only keys in that arc migrate. Compare with modulo hashing: adding one node to N requires moving N/(N+1) โ all keys.
Virtual Nodes
With only N physical nodes on the ring, arc sizes vary significantly due to random hash placement โ some nodes receive much more load than others. Dynamo uses 150 virtual nodes (vnodes) per physical server.
๐ก Heterogeneous clusters
Join/leave disruption
When a node joins, its 150 vnodes are spread across the ring, meaning up to 150 different existing nodes each give up a small arc. This is worse for the disrupted nodes than the simple case, but the migrations are all small. Modern systems (Cassandra) use "token allocation" strategies to reduce this.
Vector Clocks
Dynamo uses vector clocks to track causality between object versions. A vector clock is a list of (node, counter) pairs. When two versions have incomparable clocks (concurrent), Dynamo returns all versions to the client for application-level reconciliation.
Sync sends a message from one node to another: sender increments its clock, receiver merges and increments.
โข Local event: node increments its own counter โ [A, B, C]A += 1
โข Send: sender increments own counter first
โข Receive: receiver takes element-wise max, then increments own counter
โข Dominates: X < Y iff โi: X[i] โค Y[i] (causal ordering)
โข Concurrent: X โ Y iff ยฌ(X < Y) โง ยฌ(Y < X) โ conflict!
1. User adds "Book" on device A โ clock [A:1]
2. User adds "Lamp" on device B (offline) โ clock [B:1]
3. Network reconnects: [A:1] and [B:1] are concurrent
4. Dynamo returns both versions to the application
5. Application merges: cart = {Book, Lamp} (union)
Vector clocks grow unboundedly as more nodes write the same key. Dynamo truncates old entries when the clock exceeds a threshold (typically 10 entries), removing the oldest (node, timestamp) pair. This sacrifices causal accuracy for bounded space โ conflicts may be falsely detected after truncation.
Quorum (N, W, R)
Dynamo's consistency is tunable via three parameters. The key invariant for strong consistency is W + R > N. Amazon's default is N=3, W=2, R=2.
Sloppy Quorum & Hinted Handoff
In the presence of failures, Dynamo doesn't wait for the designated N nodes. Instead, it uses the first N healthy nodes on the ring โ the "sloppy quorum." A write to a replacement node carries a "hint": deliver this to the original node when it recovers.
Sloppy quorum breaks W + R > N guarantee
If a write went to E (not C), and then you read from B and C (both in the original preference list), you might miss D's and E's updates. The read quorum doesn't overlap with the sloppy write quorum. True strong consistency requires waiting for the original N nodes โ which sacrifices availability.
Anti-Entropy with Merkle Trees
Hinted handoff handles short-term outages. For longer-term divergence (nodes that were down for extended periods, or network partitions), Dynamo uses Merkle tree anti-entropy.
1. Each node maintains a Merkle tree over its data range
2. Periodically, two replicas of the same key range compare their Merkle roots
3. If roots match โ no differences (O(1) check!)
4. If roots differ โ binary search down the tree to find diverged subtrees
5. Exchange only the keys in diverged subtrees (O(diff) bandwidth)
๐ก Why Merkle trees?
PUT Operation โ Step by Step
GET Operation & Conflict Resolution
1. Coordinator sends GET(key) to N preference list nodes in parallel
2. Wait for R responses
3. If all R responses have the same vector clock โ return value
4. Causally later version dominates and is returned
1. Coordinator receives R responses with incomparable vector clocks
2. Coordinator returns ALL conflicting versions + their contexts to the client
3. Client performs semantic reconciliation (e.g., union of shopping cart items)
4. Client writes back merged value with merged context
Exam tip: Who resolves conflicts?
In Bigtable: the server resolves conflicts with last-write-wins using timestamps. In Dynamo: the client application resolves conflicts using vector clocks. This is a key architectural difference โ Dynamo makes the application responsible for defining what "merge" means for its data type.
Bigtable vs Dynamo: Deep Comparison
| Dimension | Bigtable | Dynamo |
|---|---|---|
| Data model | Wide-column: (row, col family:qualifier, timestamp) โ value | Pure KV: binary key โ binary value (opaque) |
| Consistency | Strong (single master per tablet) | Eventual (tunable W+R quorum) |
| CAP position | CP: refuse ops when tablet unavailable | AP: always accept writes, may diverge |
| Partitioning | Range-based: lexicographic row key ranges | Consistent hashing: MD5 ring, 150 vnodes/server |
| Conflict resolution | Server-side: LWW by GFS timestamp | Client-side: app-level merge via vector clocks |
| Architecture | Master + tablet servers (hierarchical) | Fully decentralized DHT (all nodes equal) |
| Read path | Chubby โ root โ METADATA โ tablet server | Any node coordinates; directly contacts N replicas |
| Fault tolerance | Master detects failure via Chubby; GFS stores data | Hinted handoff + Merkle anti-entropy |
| Range queries | Efficient (sorted row keys) | Not supported (hash-distributed keys) |
| Use cases | Web index, time series, structured data | Shopping cart, session state, user profiles |
Real-World Impact
Always-writable cart
Dynamo ideas at massive scale
Dynamo Exam Questions
These 5 questions cover the most frequently tested Dynamo concepts.