Raft: Understandable Distributed Consensus
Deep dive into Raft's leader election, log replication, and safety properties — with a focus on why Raft is easier to reason about than Paxos.
Raft was designed by Diego Ongaro and John Ousterhout (2014) with a primary goal of understandability — making distributed consensus tractable for practitioners. Every design decision was made to minimise the number of states and interactions that a developer needs to reason about.
- • Defines consensus for a single value (slot)
- • Leader election and log replication are intertwined
- • Complex to extend to full SMR (Multi-Paxos is underspecified)
- • Many valid implementations — hard to verify correctness
- • Defines full replicated log management
- • Explicitly separates leader election from log replication
- • Restricted design: only leader appends, strong leader property
- • One canonical implementation — each state is deterministic
Raft Server States
Every Raft server is always in one of three states. The transitions are deterministic and well-defined — no hidden states or complex leader-change protocols.
Leader Election — Step by Step
Raft uses randomized election timeouts to ensure that only one candidate typically starts an election at a time. The randomization is the key mechanism that prevents Paxos-style dueling proposers.
With random timeouts (150–300 ms), only one follower typically times out first. It gets a head start, collects votes, and becomes leader before others even start. This is statistically safe: the probability that two nodes time out within 15 ms (typical RPC latency) of each other is very low with a 150 ms range.
Log Replication — Step by Step
Once a leader is elected, all client requests go through it. The leader never modifies existing log entries — it only appends. This "strong leader" property dramatically simplifies reasoning about consistency.
AppendEntries RPC (from the Raft spec)
1// Raft AppendEntries RPC — used for both log replication AND heartbeats
2// Heartbeat = AppendEntries with empty Entries slice
3
4// ── Arguments ────────────────────────────────────────────────────
5type AppendEntriesArgs struct {
6 Term int // leader's current term
7 LeaderID int // so followers can redirect clients
8 PrevLogIndex int // index of log entry immediately before new ones
9 PrevLogTerm int // term of PrevLogIndex entry (log matching check)
10 Entries []LogEntry // entries to replicate (empty = heartbeat)
11 LeaderCommit int // leader's commitIndex
12}
13
14// ── Results ──────────────────────────────────────────────────────
15type AppendEntriesReply struct {
16 Term int // current term of follower (so leader can update itself)
17 Success bool // true if follower contained matching entry at PrevLogIndex
18}
19
20// ── Log Entry ────────────────────────────────────────────────────
21type LogEntry struct {
22 Term int // term when entry was received by leader
23 Command interface{} // state machine command
24}
25
26// ── Safety check (Log Matching Property) ─────────────────────────
27// Follower rejects AppendEntries if:
28// 1. args.Term < currentTerm (stale leader)
29// 2. log[PrevLogIndex].Term != args.PrevLogTerm (log diverged)
30//
31// On rejection: leader decrements nextIndex for this follower and retries.
32// This "walks back" until a matching prefix is found.
33
34// ── Commit rule ───────────────────────────────────────────────────
35// Leader commits entry at index N if:
36// - N > commitIndex
37// - log[N].Term == currentTerm ← CRITICAL: never commit prior terms directly
38// - A majority of matchIndex[i] >= NSafety Properties
If a log entry is committed in term T, that entry will be present in the logs of all leaders for all terms > T.
Why: To win an election, a candidate must have a log at least as up-to-date as a majority. A committed entry was replicated to a majority. Any future majority must overlap with the committed majority by at least one node, which will block a vote for a candidate missing the entry.
If a server has applied log entry at index i to its state machine, no other server will ever apply a different command for log index i.
Why: The Log Matching Property guarantees that if two logs agree on (index, term) for any entry, they agree on all preceding entries. Combined with Leader Completeness, committed entries propagate to all future logs unchanged.
Raft vs Paxos: Detailed Comparison
| Dimension | Paxos (Multi-Paxos) | Raft |
|---|---|---|
| Leader election | Any node can propose; external mechanism for leader lease | Built-in randomized timeout; term-based election |
| Log commitment | Complex — prior ballot values may be committed "indirectly" | Simple: only commit entries from current term |
| Log divergence | Gaps allowed; out-of-order filling | No gaps; strict sequential ordering |
| Liveness | Dueling proposers possible; no built-in prevention | Random timeouts prevent simultaneous candidates |
| Message complexity | O(N²) per slot (basic); O(N) with stable leader | O(N) per slot (leader already established) |
| Spec completeness | Paxos = single decree; Multi-Paxos is underspecified | Full system spec: log, snapshots, membership changes |
| Used in | Chubby, ZooKeeper (via ZAB), Spanner | etcd, CockroachDB, TiKV, Consul |