Amazon Dynamo: Highly Available Key-Value Store

The storage system behind Amazon's shopping cart. Dynamo trades consistency for availability using consistent hashing, vector clocks, sloppy quorums, and eventual consistency.

Published by Amazon in 2007, Dynamo is a highly available key-value storage system designed for Amazon's e-commerce infrastructure โ€” most famously the shopping cart. Its core insight: always writable. Even during network partitions, Dynamo continues accepting writes. Conflicts are resolved at read time by the application.

๐Ÿ’ก Dynamo's design goals

(1) Always writable โ€” SLA requires every AddToCart to succeed. (2) Decentralized โ€” no single master, all nodes are equal peers. (3) Eventual consistency โ€” reads may see stale data, but conflicts are detected and resolved. (4) Tunable via N, W, R.
Bigtable vs Dynamo at a glance
PropertyBigtableDynamo
Data modelWide-column (row + column family + qualifier + timestamp)Pure key-value (binary key โ†’ binary value)
ConsistencyStrong (CP)Eventual (AP, tunable)
CAP positionCP โ€” refuses writes if tablet unavailableAP โ€” always accepts writes, may diverge
PartitioningRange-based (row key sorted)Consistent hashing with MD5
Conflict resolutionServer-side: last-write-wins by timestampClient-side: application-level merge via vector clocks
ArchitectureMaster + tablet serversFully decentralized, DHT peer-to-peer

Architecture Overview

Decentralized Design

No master node. All N storage nodes are equal peers. Each node knows about every other node through a gossip protocol. Clients can contact any node; that node acts as the coordinator for the request.

Preference List

For each key, Dynamo maintains a preference list of the N nodes responsible for storing it (the N clockwise successors on the ring). The coordinator sends writes to the first W reachable nodes, reads to the first R reachable nodes.

Internal Storage

Each node runs a local storage engine (BerkeleyDB or custom). Objects are stored as opaque byte arrays โ€” Dynamo knows nothing about the value schema. Metadata (vector clock, version) is stored separately.

Gossip-based Membership

Nodes exchange membership information via gossip. When a node joins or leaves, the information propagates within seconds. Each node maintains a ring state (which nodes own which key ranges).

Consistent Hash Ring

Dynamo maps both keys and nodes onto a circular space using MD5 hashing. Each key is assigned to the first node clockwise from its hash position. Adding/removing a node affects only O(K/N) keys โ€” the minimum possible disruption.

Consistent Hash Ring
ABCDEHash Ring0 โ€“ 2ยนยฒโธ
Key Lookup
Show Replication (N=3)
Virtual Nodes per Server: None
Load Distribution
A
20.0%
B
20.0%
C
20.0%
D
20.0%
E
20.0%
With only 5 nodes, arc sizes vary by random hash placement. Increase virtual nodes for better balance.
Key position (hashed)
Successor node (primary replica)
Replication nodes (N=3)
Dashed line = clockwise walk to successor
Why MD5, not SHA-256?

Dynamo only needs uniform distribution, not cryptographic security. MD5 is 2โ€“3ร— faster than SHA-256 and produces a 128-bit output sufficient for the ring. In Dynamo's trusted data-center environment, hash flooding attacks are not a threat.

Adding a node only moves K/N keys

When a new node joins, it takes over the arc between its position and its predecessor. Only keys in that arc migrate. Compare with modulo hashing: adding one node to N requires moving N/(N+1) โ‰ˆ all keys.

Virtual Nodes

With only N physical nodes on the ring, arc sizes vary significantly due to random hash placement โ€” some nodes receive much more load than others. Dynamo uses 150 virtual nodes (vnodes) per physical server.

150
Virtual nodes per physical server in Dynamo's original design
โ‰ˆ1/N
Expected load per node โ€” law of large numbers with many small arcs
2ร—
A server with 2ร— capacity gets 2ร— virtual nodes โ†’ 2ร— the load

๐Ÿ’ก Heterogeneous clusters

Virtual nodes enable proportional load distribution in heterogeneous clusters. A newer server with higher capacity gets proportionally more vnodes, ensuring it receives proportionally more traffic โ€” no manual key range assignments needed.

Join/leave disruption

When a node joins, its 150 vnodes are spread across the ring, meaning up to 150 different existing nodes each give up a small arc. This is worse for the disrupted nodes than the simple case, but the migrations are all small. Modern systems (Cassandra) use "token allocation" strategies to reduce this.

Vector Clocks

Dynamo uses vector clocks to track causality between object versions. A vector clock is a list of (node, counter) pairs. When two versions have incomparable clocks (concurrent), Dynamo returns all versions to the client for application-level reconciliation.

Vector Clock Simulator
Node A
[0, 0, 0]
A=0 ยท B=0 ยท C=0
Node B
[0, 0, 0]
A=0 ยท B=0 ยท C=0
Node C
[0, 0, 0]
A=0 ยท B=0 ยท C=0
Message Sync

Sync sends a message from one node to another: sender increments its clock, receiver merges and increments.

Vector Clock Rules

โ€ข Local event: node increments its own counter โ†’ [A, B, C]A += 1

โ€ข Send: sender increments own counter first

โ€ข Receive: receiver takes element-wise max, then increments own counter

โ€ข Dominates: X < Y iff โˆ€i: X[i] โ‰ค Y[i] (causal ordering)

โ€ข Concurrent: X โ€– Y iff ยฌ(X < Y) โˆง ยฌ(Y < X) โ†’ conflict!

Shopping cart example

1. User adds "Book" on device A โ†’ clock [A:1]

2. User adds "Lamp" on device B (offline) โ†’ clock [B:1]

3. Network reconnects: [A:1] and [B:1] are concurrent

4. Dynamo returns both versions to the application

5. Application merges: cart = {Book, Lamp} (union)

Clock truncation

Vector clocks grow unboundedly as more nodes write the same key. Dynamo truncates old entries when the clock exceeds a threshold (typically 10 entries), removing the oldest (node, timestamp) pair. This sacrifices causal accuracy for bounded space โ€” conflicts may be falsely detected after truncation.

Quorum (N, W, R)

Dynamo's consistency is tunable via three parameters. The key invariant for strong consistency is W + R > N. Amazon's default is N=3, W=2, R=2.

Quorum Calculator (N, W, R)
Presets
3
Total copies of each key
2
Must acknowledge write
2
Must respond to read
Replica Visualization
1W+R
2W+R
3
Write quorum Read quorum Both Not in quorum
โœ“ Strong Consistency (W + R > N)
2 + 2 = 4 > N = 3
Every read overlaps with every write by at least 1 node(s). You will always see the latest write.
Write Majority
Yes (W=2 > N/2=1.5)
Read Majority
Yes (R=2 > N/2=1.5)
Max Failures (both)
1 node
Write Availability
99.970%
P(โ‰ฅW of N nodes up, p=0.99)
Read Availability
99.970%
P(โ‰ฅR of N nodes up, p=0.99)
Overall Availability
99.970%
min(write, read)
Sloppy Quorum:In Dynamo, W and R can be satisfied by any N healthy nodes in the ring โ€” not necessarily the N designated replicas. This maintains availability during failures but can break the W + R > N guarantee.

Sloppy Quorum & Hinted Handoff

In the presence of failures, Dynamo doesn't wait for the designated N nodes. Instead, it uses the first N healthy nodes on the ring โ€” the "sloppy quorum." A write to a replacement node carries a "hint": deliver this to the original node when it recovers.

1
Key maps to nodes B, C, D
In normal operation, coordinator writes to B, C, D (preference list).
2
Node C goes down
Coordinator detects C is unreachable. Cannot complete W=2 with just B and D.
3
Sloppy quorum: use node E
Coordinator writes to B, D, E. The write to E carries a hint: "this belongs to C, deliver when C recovers."
4
C recovers, E delivers hint
E detects C is back. E transfers the hinted replica to C, then deletes its local copy.

Sloppy quorum breaks W + R > N guarantee

If a write went to E (not C), and then you read from B and C (both in the original preference list), you might miss D's and E's updates. The read quorum doesn't overlap with the sloppy write quorum. True strong consistency requires waiting for the original N nodes โ€” which sacrifices availability.

Anti-Entropy with Merkle Trees

Hinted handoff handles short-term outages. For longer-term divergence (nodes that were down for extended periods, or network partitions), Dynamo uses Merkle tree anti-entropy.

How it works

1. Each node maintains a Merkle tree over its data range

2. Periodically, two replicas of the same key range compare their Merkle roots

3. If roots match โ†’ no differences (O(1) check!)

4. If roots differ โ†’ binary search down the tree to find diverged subtrees

5. Exchange only the keys in diverged subtrees (O(diff) bandwidth)

Root: H(H1 โ€– H2)
โ”œโ”€ H1: H(H3 โ€– H4)
โ”‚ โ”œโ”€ H3: H(key1,key2)
โ”‚ โ””โ”€ H4: H(key3,key4)
โ””โ”€ H2: H(H5 โ€– H6)
โ”œโ”€ H5: H(key5,key6)
โ””โ”€ H6: H(key7,key8)

๐Ÿ’ก Why Merkle trees?

Comparing N keys naively requires O(N) bandwidth. Merkle trees reduce divergence detection to O(log N) messages โ€” just compare tree nodes top-down until you find differences. Only the differing subtrees need their keys transferred, minimizing anti-entropy traffic.

PUT Operation โ€” Step by Step

1
Route
Client โ†’ any Dynamo node
Any node can act as coordinator. Typically the client library routes to the topmost node in the preference list.
2
Coordinate
Coordinator generates a version
Coordinator increments its own entry in the object's vector clock: e.g., [Node5:1]. This becomes the causal context for this write.
3
Replicate
Send to all N preference list nodes in parallel
Coordinator sends (key, value, context/vector_clock) to all N nodes concurrently. Does not wait for all N โ€” just for W acknowledgments.
4
Quorum
Wait for W acknowledgments
Once W nodes confirm (W=2 in default config), the write is considered durable. The coordinator returns success to the client.
5
Background
Remaining N-W nodes receive write asynchronously
The remaining N-W nodes may receive the write slightly later. They are now in the preference list for future reads.

GET Operation & Conflict Resolution

Happy path (no conflicts)

1. Coordinator sends GET(key) to N preference list nodes in parallel

2. Wait for R responses

3. If all R responses have the same vector clock โ†’ return value

4. Causally later version dominates and is returned

Conflict path (concurrent versions)

1. Coordinator receives R responses with incomparable vector clocks

2. Coordinator returns ALL conflicting versions + their contexts to the client

3. Client performs semantic reconciliation (e.g., union of shopping cart items)

4. Client writes back merged value with merged context

Exam tip: Who resolves conflicts?

In Bigtable: the server resolves conflicts with last-write-wins using timestamps. In Dynamo: the client application resolves conflicts using vector clocks. This is a key architectural difference โ€” Dynamo makes the application responsible for defining what "merge" means for its data type.

Bigtable vs Dynamo: Deep Comparison

Detailed Bigtable vs Dynamo comparison
DimensionBigtableDynamo
Data modelWide-column: (row, col family:qualifier, timestamp) โ†’ valuePure KV: binary key โ†’ binary value (opaque)
ConsistencyStrong (single master per tablet)Eventual (tunable W+R quorum)
CAP positionCP: refuse ops when tablet unavailableAP: always accept writes, may diverge
PartitioningRange-based: lexicographic row key rangesConsistent hashing: MD5 ring, 150 vnodes/server
Conflict resolutionServer-side: LWW by GFS timestampClient-side: app-level merge via vector clocks
ArchitectureMaster + tablet servers (hierarchical)Fully decentralized DHT (all nodes equal)
Read pathChubby โ†’ root โ†’ METADATA โ†’ tablet serverAny node coordinates; directly contacts N replicas
Fault toleranceMaster detects failure via Chubby; GFS stores dataHinted handoff + Merkle anti-entropy
Range queriesEfficient (sorted row keys)Not supported (hash-distributed keys)
Use casesWeb index, time series, structured dataShopping cart, session state, user profiles

Real-World Impact

๐ŸŒ In the Wild:Amazon Shopping Cart

Always-writable cart

The shopping cart was Dynamo's original use case. Amazon's SLA requires every "Add to Cart" to succeed โ€” even during partial network failures. Dynamo's AP design means a cart write always succeeds locally. Conflicts (two carts diverged during a partition) are resolved with the union strategy: you may see an item you deleted, but you'll never lose an item you added.
๐ŸŒ In the Wild:Apache Cassandra

Dynamo ideas at massive scale

Cassandra is the most successful open-source implementation of Dynamo's design โ€” decentralized, consistent hashing, tunable N/W/R quorum, and gossip-based membership. Used by Netflix, Apple, Discord. Cassandra added CQL (SQL-like query language) on top of Dynamo's KV foundation, adding column family support inspired by Bigtable.

Dynamo Exam Questions

These 5 questions cover the most frequently tested Dynamo concepts.