Raft: Understandable Distributed Consensus

Deep dive into Raft's leader election, log replication, and safety properties — with a focus on why Raft is easier to reason about than Paxos.

Raft was designed by Diego Ongaro and John Ousterhout (2014) with a primary goal of understandability — making distributed consensus tractable for practitioners. Every design decision was made to minimise the number of states and interactions that a developer needs to reason about.

Raft vs Paxos: Design Philosophy

Paxos

• Defines consensus for a single value (slot)
• Leader election and log replication are intertwined
• Complex to extend to full SMR (Multi-Paxos is underspecified)
• Many valid implementations — hard to verify correctness

Raft

• Defines full replicated log management
• Explicitly separates leader election from log replication
• Restricted design: only leader appends, strong leader property
• One canonical implementation — each state is deterministic

"Raft's design goal was understandability. We took almost every significant design decision and asked: which choice is easier to understand?" — Ongaro & Ousterhout, 2014

Raft Server States

Every Raft server is always in one of three states. The transitions are deterministic and well-defined — no hidden states or complex leader-change protocols.

Follower

Passive. Accepts log entries and votes. Starts election if heartbeat times out (150–300 ms random timeout).

Candidate

Increments term, votes for self, sends RequestVote. Becomes leader if wins majority; reverts to follower on higher term.

Leader

Sends periodic heartbeats (AppendEntries with empty log). Handles all client requests and log replication.

Leader Election — Step by Step

Raft uses randomized election timeouts to ensure that only one candidate typically starts an election at a time. The randomization is the key mechanism that prevents Paxos-style dueling proposers.

Election timeout fires

A follower's randomized timer (150–300 ms) expires without receiving a heartbeat. It suspects the leader has crashed.

Become Candidate

Follower transitions to Candidate, increments its term (e.g., term 1 → 2), votes for itself, and resets its election timer.

Send RequestVote RPCs

Candidate sends RequestVote to all other nodes, including its last log entry (index, term) to prove its log is up-to-date.

Collect votes

Nodes grant their vote only if: (1) they haven't voted this term, and (2) candidate's log is at least as up-to-date as theirs. Each node votes for at most one candidate per term.

Win majority → Leader

Candidate receives votes from more than half (N/2 + 1) the nodes and immediately becomes Leader. Safety: only one candidate can win a majority per term since each node votes once.

Send heartbeats

New leader immediately sends AppendEntries (empty = heartbeat) to all followers to suppress new elections and assert authority.

Why random timeouts?

With random timeouts (150–300 ms), only one follower typically times out first. It gets a head start, collects votes, and becomes leader before others even start. This is statistically safe: the probability that two nodes time out within 15 ms (typical RPC latency) of each other is very low with a 150 ms range.

Log Replication — Step by Step

Once a leader is elected, all client requests go through it. The leader never modifies existing log entries — it only appends. This "strong leader" property dramatically simplifies reasoning about consistency.

Client sends command

Client contacts the leader (or any node — others redirect to leader). Leader receives the command.

Append to local log

Leader appends the command as a new log entry with current term. Entry is not yet committed.

AppendEntries to followers

Leader sends AppendEntries RPC in parallel to all followers, including the new entry and the previous entry's (index, term) for the log consistency check.

Followers append & acknowledge

Each follower verifies the log matching check (PrevLogIndex / PrevLogTerm match). If so, appends the entry and responds Success=true.

Leader commits on majority

When a majority of followers have acknowledged, the leader marks the entry as committed (advances commitIndex). The entry can now be applied to the state machine.

Reply to client

Leader applies the command to its state machine, gets the result, and replies to the client.

Followers learn of commit

Next AppendEntries (or heartbeat) carries the updated leaderCommit. Followers advance their commitIndex and apply committed entries to their state machines.

AppendEntries RPC (from the Raft spec)

raft_rpc.go

1// Raft AppendEntries RPC — used for both log replication AND heartbeats
2// Heartbeat = AppendEntries with empty Entries slice
3
4// ── Arguments ────────────────────────────────────────────────────
5type AppendEntriesArgs struct {
6    Term         int        // leader's current term
7    LeaderID     int        // so followers can redirect clients
8    PrevLogIndex int        // index of log entry immediately before new ones
9    PrevLogTerm  int        // term of PrevLogIndex entry (log matching check)
10    Entries      []LogEntry // entries to replicate (empty = heartbeat)
11    LeaderCommit int        // leader's commitIndex
12}
13
14// ── Results ──────────────────────────────────────────────────────
15type AppendEntriesReply struct {
16    Term    int  // current term of follower (so leader can update itself)
17    Success bool // true if follower contained matching entry at PrevLogIndex
18}
19
20// ── Log Entry ────────────────────────────────────────────────────
21type LogEntry struct {
22    Term    int         // term when entry was received by leader
23    Command interface{} // state machine command
24}
25
26// ── Safety check (Log Matching Property) ─────────────────────────
27// Follower rejects AppendEntries if:
28//   1. args.Term < currentTerm  (stale leader)
29//   2. log[PrevLogIndex].Term != args.PrevLogTerm  (log diverged)
30//
31// On rejection: leader decrements nextIndex for this follower and retries.
32// This "walks back" until a matching prefix is found.
33
34// ── Commit rule ───────────────────────────────────────────────────
35// Leader commits entry at index N if:
36//   - N > commitIndex
37//   - log[N].Term == currentTerm   ← CRITICAL: never commit prior terms directly
38//   - A majority of matchIndex[i] >= N

Safety Properties

Leader Completeness

If a log entry is committed in term T, that entry will be present in the logs of all leaders for all terms > T.

Why: To win an election, a candidate must have a log at least as up-to-date as a majority. A committed entry was replicated to a majority. Any future majority must overlap with the committed majority by at least one node, which will block a vote for a candidate missing the entry.

State Machine Safety

If a server has applied log entry at index i to its state machine, no other server will ever apply a different command for log index i.

Why: The Log Matching Property guarantees that if two logs agree on (index, term) for any entry, they agree on all preceding entries. Combined with Leader Completeness, committed entries propagate to all future logs unchanged.

# Log Matching Property (formal)

If log[i].index == log[j].index AND log[i].term == log[j].term

THEN: log[i].command == log[j].command

AND: all entries before index i are identical in both logs

# Key: AppendEntries REJECTS if PrevLogTerm doesn't match → forces log truncation

Raft vs Paxos: Detailed Comparison

Raft vs Paxos — key design differences
Dimension	Paxos (Multi-Paxos)	Raft
Leader election	Any node can propose; external mechanism for leader lease	Built-in randomized timeout; term-based election
Log commitment	Complex — prior ballot values may be committed "indirectly"	Simple: only commit entries from current term
Log divergence	Gaps allowed; out-of-order filling	No gaps; strict sequential ordering
Liveness	Dueling proposers possible; no built-in prevention	Random timeouts prevent simultaneous candidates
Message complexity	O(N²) per slot (basic); O(N) with stable leader	O(N) per slot (leader already established)
Spec completeness	Paxos = single decree; Multi-Paxos is underspecified	Full system spec: log, snapshots, membership changes
Used in	Chubby, ZooKeeper (via ZAB), Spanner	etcd, CockroachDB, TiKV, Consul

External Resources

Official Raft Page

raft.github.io

The canonical Raft resource: the original paper, reference implementation links, list of known Raft implementations across languages, and formal TLA+ specification.

Interactive Visualization

The Secret Lives of Data

Ben Johnson's beautiful animated walkthrough of Raft consensus — highly recommended before or after reading the paper. Demonstrates leader election and log replication with step-by-step animation.

Raft Exam Questions

View all exam questions →