Paxos: The Protocol That Powers Distributed Systems
A rigorous deep dive into single-decree Paxos, Multi-Paxos, safety proofs, and the infamous liveness problem with dueling proposers.
Paxos, invented by Leslie Lamport in 1989 (and published in 1998 after famous reviewer resistance), is the foundational consensus algorithm for crash-fault-tolerant distributed systems. It underpins Google Chubby, Apache ZooKeeper, and — through derivatives — virtually every strongly-consistent distributed system.
💡 What Paxos solves
Three Roles
Phase-by-Phase Data Path
Paxos proceeds in two phases. Phase 1 discovers what was previously accepted. Phase 2 proposes and confirms. Both require a quorum (majority) response.
PREPARE(n)PROMISE(n, ⊥)PROMISE(n, b_prev, v_prev)quorum checkACCEPT(n, v)ACCEPTED(n, v)quorum checkVALUE CHOSEN💡 Why 2 phases? The core invariant
Phase 2 (Accept/Accepted) actually commits the value to a quorum.
The invariant:if a value v was chosen (accepted by a quorum) in ballot b, then every future proposer with ballot > b will discover v in Phase 1 and reproduce it.
Python Pseudocode
1# ─── Paxos Pseudocode (Single-Decree, Crash Fault Tolerant) ───
2
3# State on each ACCEPTOR:
4# promised_ballot: int = 0 # highest PREPARE ballot promised
5# accepted_ballot: int = None # ballot of last accepted value
6# accepted_value: Any = None # value of last accepted proposal
7
8class Acceptor:
9 def on_prepare(self, ballot: int):
10 if ballot > self.promised_ballot:
11 self.promised_ballot = ballot
12 return PROMISE(ballot, self.accepted_ballot, self.accepted_value)
13 # else: ignore (already promised to a higher ballot)
14
15 def on_accept(self, ballot: int, value: Any):
16 if ballot >= self.promised_ballot:
17 self.accepted_ballot = ballot
18 self.accepted_value = value
19 return ACCEPTED(ballot, value)
20 # else: reject (promised to a higher ballot)
21
22class Proposer:
23 def __init__(self, ballot: int, value: Any, acceptors: list):
24 self.ballot = ballot # unique, monotonically increasing
25 self.value = value # my preferred value (may be overridden)
26 self.acceptors = acceptors
27
28 def run(self):
29 # ── PHASE 1: Prepare ──────────────────────────────────────
30 responses = [a.on_prepare(self.ballot) for a in self.acceptors]
31 promises = [r for r in responses if r is not None]
32
33 if len(promises) < majority(self.acceptors):
34 raise FailedToGetQuorum("Phase 1 failed")
35
36 # KEY RULE: if any acceptor already accepted a value,
37 # we MUST use the value with the highest accepted ballot.
38 highest = max(promises, key=lambda r: r.accepted_ballot or -1)
39 if highest.accepted_ballot is not None:
40 self.value = highest.accepted_value # override our preferred value!
41
42 # ── PHASE 2: Accept ───────────────────────────────────────
43 accepted = [a.on_accept(self.ballot, self.value) for a in self.acceptors]
44 acks = [r for r in accepted if r is not None]
45
46 if len(acks) >= majority(self.acceptors):
47 # Value is CHOSEN — inform learners
48 notify_learners(self.value)
49 return self.value
50 else:
51 raise FailedToGetQuorum("Phase 2 failed — retry with higher ballot")
52
53# ── Multi-Paxos optimisation ──────────────────────────────────
54# Once a stable leader is elected, Phase 1 only runs ONCE per
55# leader term. Subsequent log entries go straight to Phase 2.
56# This reduces message complexity from O(N²) to O(N) per slot.Basic Paxos vs Multi-Paxos
- •Decides one value per round
- •Every round runs both Phase 1 and Phase 2
- •O(N²) messages per decision
- •No built-in leader — any node can propose
- •Theoretically clean, not practical at scale
- •Decides a log of commands (SMR)
- •Phase 1 runs once per leader term
- •Subsequent slots: Phase 2 only → O(N) per slot
- •Distinguished leader elected (e.g. via Chubby)
- •Used in Chubby, ZooKeeper, Spanner
Liveness Problem: Dueling Proposers
Paxos guarantees safety (only one value can be chosen) but not liveness (it may never choose anything). The canonical counter-example:
Multi-Paxos uses a distinguished leader (elected via Chubby or Raft-like timeout). When one proposer has exclusive lease, it is the only one proposing — no dueling. However, leader election itself cannot be guaranteed in a purely async system (FLP!), so real systems use timeouts with exponential backoff.