Paxos: The Protocol That Powers Distributed Systems

A rigorous deep dive into single-decree Paxos, Multi-Paxos, safety proofs, and the infamous liveness problem with dueling proposers.

Paxos, invented by Leslie Lamport in 1989 (and published in 1998 after famous reviewer resistance), is the foundational consensus algorithm for crash-fault-tolerant distributed systems. It underpins Google Chubby, Apache ZooKeeper, and — through derivatives — virtually every strongly-consistent distributed system.

💡 What Paxos solves

Given N processes where up to f < N/2 may crash, reach agreement on a single value (single-decree Paxos) or a sequence of values (Multi-Paxos / state machine replication). Paxos guarantees safety always, but liveness only when a stable leader exists.

Three Roles

📤

Proposer

Initiates consensus by choosing a ballot number and proposing a value. In Multi-Paxos, exactly one distinguished proposer acts as leader, sending Phase 2 messages continuously.

✅

Acceptor

Votes on proposals. Maintains state: the highest ballot promised and the last (ballot, value) accepted. 2f+1 acceptors are needed to tolerate f failures.

📚

Learner

Learns the chosen value once a quorum of acceptors have accepted it. In practice, the proposer itself often acts as a learner.

Paxos Message Flow

Show Failure Scenario

Phase 1 (blue)

Phase 2 (green)

Phase-by-Phase Data Path

Paxos proceeds in two phases. Phase 1 discovers what was previously accepted. Phase 2 proposes and confirms. Both require a quorum (majority) response.

Phase 1a

Proposer→All acceptorsPREPARE(n)

Proposer picks a unique ballot number n (higher than any it has seen) and broadcasts PREPARE(n).

Phase 1b

Acceptor→ProposerPROMISE(n, ⊥)

Each acceptor that has not promised to a ballot ≥ n responds with PROMISE, including any previously accepted (ballot, value) pair.

Phase 1b

Acceptor→ProposerPROMISE(n, b_prev, v_prev)

If the acceptor previously accepted a value in some earlier ballot b_prev, it includes that in the PROMISE.

Phase 1b

Proposer→internalquorum check

Proposer waits for f+1 PROMISE responses (a majority). It notes the highest b_prev seen and selects that value (or its own if none).

Phase 2a

Proposer→Quorum of acceptorsACCEPT(n, v)

Proposer sends ACCEPT(n, v) to a quorum. v is the value with the highest b_prev seen in phase 1b (or proposer's own if none were accepted).

Phase 2b

Acceptor→Proposer + LearnersACCEPTED(n, v)

Each acceptor that has not promised to a ballot > n accepts the value and sends ACCEPTED to both the proposer and all learners.

Phase 2b

Proposer→internalquorum check

Proposer waits for f+1 ACCEPTED messages. Once received: value v is CHOSEN for this slot.

Phase done

Learner→—VALUE CHOSEN

Learner receives ACCEPTED messages from a majority. It now knows that value v is the final committed value for this consensus slot.

💡 Why 2 phases? The core invariant

Phase 1 (Prepare/Promise) learns about any value that might already have been accepted — so the proposer knows what it must propose. A proposer MUST use the highest-ballot value heard in Phase 1 responses rather than its own preferred value. This ensures that a value already chosen (by a prior quorum) is never replaced by a different value in a subsequent round.

Phase 2 (Accept/Accepted) actually commits the value to a quorum.

The invariant:if a value v was chosen (accepted by a quorum) in ballot b, then every future proposer with ballot > b will discover v in Phase 1 and reproduce it.

Python Pseudocode

paxos.py

1# ─── Paxos Pseudocode (Single-Decree, Crash Fault Tolerant) ───
2
3# State on each ACCEPTOR:
4#   promised_ballot: int  = 0      # highest PREPARE ballot promised
5#   accepted_ballot: int  = None   # ballot of last accepted value
6#   accepted_value:  Any  = None   # value of last accepted proposal
7
8class Acceptor:
9    def on_prepare(self, ballot: int):
10        if ballot > self.promised_ballot:
11            self.promised_ballot = ballot
12            return PROMISE(ballot, self.accepted_ballot, self.accepted_value)
13        # else: ignore (already promised to a higher ballot)
14
15    def on_accept(self, ballot: int, value: Any):
16        if ballot >= self.promised_ballot:
17            self.accepted_ballot = ballot
18            self.accepted_value  = value
19            return ACCEPTED(ballot, value)
20        # else: reject (promised to a higher ballot)
21
22class Proposer:
23    def __init__(self, ballot: int, value: Any, acceptors: list):
24        self.ballot    = ballot   # unique, monotonically increasing
25        self.value     = value    # my preferred value (may be overridden)
26        self.acceptors = acceptors
27
28    def run(self):
29        # ── PHASE 1: Prepare ──────────────────────────────────────
30        responses = [a.on_prepare(self.ballot) for a in self.acceptors]
31        promises  = [r for r in responses if r is not None]
32
33        if len(promises) < majority(self.acceptors):
34            raise FailedToGetQuorum("Phase 1 failed")
35
36        # KEY RULE: if any acceptor already accepted a value,
37        # we MUST use the value with the highest accepted ballot.
38        highest = max(promises, key=lambda r: r.accepted_ballot or -1)
39        if highest.accepted_ballot is not None:
40            self.value = highest.accepted_value   # override our preferred value!
41
42        # ── PHASE 2: Accept ───────────────────────────────────────
43        accepted = [a.on_accept(self.ballot, self.value) for a in self.acceptors]
44        acks     = [r for r in accepted if r is not None]
45
46        if len(acks) >= majority(self.acceptors):
47            # Value is CHOSEN — inform learners
48            notify_learners(self.value)
49            return self.value
50        else:
51            raise FailedToGetQuorum("Phase 2 failed — retry with higher ballot")
52
53# ── Multi-Paxos optimisation ──────────────────────────────────
54# Once a stable leader is elected, Phase 1 only runs ONCE per
55# leader term. Subsequent log entries go straight to Phase 2.
56# This reduces message complexity from O(N²) to O(N) per slot.

Basic Paxos vs Multi-Paxos

Basic (Single-Decree) Paxos

•Decides one value per round
•Every round runs both Phase 1 and Phase 2
•O(N²) messages per decision
•No built-in leader — any node can propose
•Theoretically clean, not practical at scale

Multi-Paxos (used in practice)

•Decides a log of commands (SMR)
•Phase 1 runs once per leader term
•Subsequent slots: Phase 2 only → O(N) per slot
•Distinguished leader elected (e.g. via Chubby)
•Used in Chubby, ZooKeeper, Spanner

Liveness Problem: Dueling Proposers

Paxos guarantees safety (only one value can be chosen) but not liveness (it may never choose anything). The canonical counter-example:

Dueling proposers scenario: Time 1: P1 starts with ballot n=1 P1 → all acceptors: PREPARE(n=1) All acceptors: PROMISE(n=1, ⊥) → P1 is about to send ACCEPT(n=1, v1)… Time 2: P2 starts with ballot n=2 (before P1 finishes) P2 → all acceptors: PREPARE(n=2) All acceptors: PROMISE(n=2, ⊥) → Acceptors now reject P1's ACCEPT(n=1) because n=1 < promised=2 Time 3: P1 sees its ACCEPT rejected, increments to n=3 P1 → all acceptors: PREPARE(n=3) → Acceptors reject P2's in-flight ACCEPT(n=2) because n=2 < 3 Time 4: P2 increments to n=4… and so on forever. Result: neither P1 nor P2 ever achieves Phase 2 quorum. Liveness is violated — no value is ever chosen.

Practical Solution: Leader Election

Multi-Paxos uses a distinguished leader (elected via Chubby or Raft-like timeout). When one proposer has exclusive lease, it is the only one proposing — no dueling. However, leader election itself cannot be guaranteed in a purely async system (FLP!), so real systems use timeouts with exponential backoff.

Paxos Exam Questions

View all exam questions →