Raft: Understandable Distributed Consensus

Deep dive into Raft's leader election, log replication, and safety properties — with a focus on why Raft is easier to reason about than Paxos.

Raft was designed by Diego Ongaro and John Ousterhout (2014) with a primary goal of understandability — making distributed consensus tractable for practitioners. Every design decision was made to minimise the number of states and interactions that a developer needs to reason about.

Raft vs Paxos: Design Philosophy
Paxos
  • • Defines consensus for a single value (slot)
  • • Leader election and log replication are intertwined
  • • Complex to extend to full SMR (Multi-Paxos is underspecified)
  • • Many valid implementations — hard to verify correctness
Raft
  • • Defines full replicated log management
  • Explicitly separates leader election from log replication
  • • Restricted design: only leader appends, strong leader property
  • • One canonical implementation — each state is deterministic
"Raft's design goal was understandability. We took almost every significant design decision and asked: which choice is easier to understand?" — Ongaro & Ousterhout, 2014

Raft Server States

Every Raft server is always in one of three states. The transitions are deterministic and well-defined — no hidden states or complex leader-change protocols.

Election timeoutMajority votesHigher term seenHigher term seenHeartbeatFollowerpassiveCandidaterequesting votesLeadersends heartbeats
Follower
Passive. Accepts log entries and votes. Starts election if heartbeat times out (150–300 ms random timeout).
Candidate
Increments term, votes for self, sends RequestVote. Becomes leader if wins majority; reverts to follower on higher term.
Leader
Sends periodic heartbeats (AppendEntries with empty log). Handles all client requests and log replication.

Leader Election — Step by Step

Raft uses randomized election timeouts to ensure that only one candidate typically starts an election at a time. The randomization is the key mechanism that prevents Paxos-style dueling proposers.

1
Election timeout fires
A follower's randomized timer (150–300 ms) expires without receiving a heartbeat. It suspects the leader has crashed.
2
Become Candidate
Follower transitions to Candidate, increments its term (e.g., term 1 → 2), votes for itself, and resets its election timer.
3
Send RequestVote RPCs
Candidate sends RequestVote to all other nodes, including its last log entry (index, term) to prove its log is up-to-date.
4
Collect votes
Nodes grant their vote only if: (1) they haven't voted this term, and (2) candidate's log is at least as up-to-date as theirs. Each node votes for at most one candidate per term.
5
Win majority → Leader
Candidate receives votes from more than half (N/2 + 1) the nodes and immediately becomes Leader. Safety: only one candidate can win a majority per term since each node votes once.
6
Send heartbeats
New leader immediately sends AppendEntries (empty = heartbeat) to all followers to suppress new elections and assert authority.
Why random timeouts?

With random timeouts (150–300 ms), only one follower typically times out first. It gets a head start, collects votes, and becomes leader before others even start. This is statistically safe: the probability that two nodes time out within 15 ms (typical RPC latency) of each other is very low with a 150 ms range.

Log Replication — Step by Step

Once a leader is elected, all client requests go through it. The leader never modifies existing log entries — it only appends. This "strong leader" property dramatically simplifies reasoning about consistency.

1
Client sends command
Client contacts the leader (or any node — others redirect to leader). Leader receives the command.
2
Append to local log
Leader appends the command as a new log entry with current term. Entry is not yet committed.
3
AppendEntries to followers
Leader sends AppendEntries RPC in parallel to all followers, including the new entry and the previous entry's (index, term) for the log consistency check.
4
Followers append & acknowledge
Each follower verifies the log matching check (PrevLogIndex / PrevLogTerm match). If so, appends the entry and responds Success=true.
5
Leader commits on majority
When a majority of followers have acknowledged, the leader marks the entry as committed (advances commitIndex). The entry can now be applied to the state machine.
6
Reply to client
Leader applies the command to its state machine, gets the result, and replies to the client.
7
Followers learn of commit
Next AppendEntries (or heartbeat) carries the updated leaderCommit. Followers advance their commitIndex and apply committed entries to their state machines.

AppendEntries RPC (from the Raft spec)

raft_rpc.go
1// Raft AppendEntries RPC — used for both log replication AND heartbeats
2// Heartbeat = AppendEntries with empty Entries slice
3
4// ── Arguments ────────────────────────────────────────────────────
5type AppendEntriesArgs struct {
6    Term         int        // leader's current term
7    LeaderID     int        // so followers can redirect clients
8    PrevLogIndex int        // index of log entry immediately before new ones
9    PrevLogTerm  int        // term of PrevLogIndex entry (log matching check)
10    Entries      []LogEntry // entries to replicate (empty = heartbeat)
11    LeaderCommit int        // leader's commitIndex
12}
13
14// ── Results ──────────────────────────────────────────────────────
15type AppendEntriesReply struct {
16    Term    int  // current term of follower (so leader can update itself)
17    Success bool // true if follower contained matching entry at PrevLogIndex
18}
19
20// ── Log Entry ────────────────────────────────────────────────────
21type LogEntry struct {
22    Term    int         // term when entry was received by leader
23    Command interface{} // state machine command
24}
25
26// ── Safety check (Log Matching Property) ─────────────────────────
27// Follower rejects AppendEntries if:
28//   1. args.Term < currentTerm  (stale leader)
29//   2. log[PrevLogIndex].Term != args.PrevLogTerm  (log diverged)
30//
31// On rejection: leader decrements nextIndex for this follower and retries.
32// This "walks back" until a matching prefix is found.
33
34// ── Commit rule ───────────────────────────────────────────────────
35// Leader commits entry at index N if:
36//   - N > commitIndex
37//   - log[N].Term == currentTerm   ← CRITICAL: never commit prior terms directly
38//   - A majority of matchIndex[i] >= N

Safety Properties

Leader Completeness

If a log entry is committed in term T, that entry will be present in the logs of all leaders for all terms > T.

Why: To win an election, a candidate must have a log at least as up-to-date as a majority. A committed entry was replicated to a majority. Any future majority must overlap with the committed majority by at least one node, which will block a vote for a candidate missing the entry.

State Machine Safety

If a server has applied log entry at index i to its state machine, no other server will ever apply a different command for log index i.

Why: The Log Matching Property guarantees that if two logs agree on (index, term) for any entry, they agree on all preceding entries. Combined with Leader Completeness, committed entries propagate to all future logs unchanged.

# Log Matching Property (formal)
If log[i].index == log[j].index AND log[i].term == log[j].term
THEN: log[i].command == log[j].command
AND: all entries before index i are identical in both logs
# Key: AppendEntries REJECTS if PrevLogTerm doesn't match → forces log truncation

Raft vs Paxos: Detailed Comparison

Raft vs Paxos — key design differences
DimensionPaxos (Multi-Paxos)Raft
Leader electionAny node can propose; external mechanism for leader leaseBuilt-in randomized timeout; term-based election
Log commitmentComplex — prior ballot values may be committed "indirectly"Simple: only commit entries from current term
Log divergenceGaps allowed; out-of-order fillingNo gaps; strict sequential ordering
LivenessDueling proposers possible; no built-in preventionRandom timeouts prevent simultaneous candidates
Message complexityO(N²) per slot (basic); O(N) with stable leaderO(N) per slot (leader already established)
Spec completenessPaxos = single decree; Multi-Paxos is underspecifiedFull system spec: log, snapshots, membership changes
Used inChubby, ZooKeeper (via ZAB), Spanneretcd, CockroachDB, TiKV, Consul

External Resources

Raft Exam Questions