10 Mar 2025 3 min read Computer Science

Distributed consensus without the PhD

Distributed consensus gets treated like graduate-level material. The Paxos paper is famously hard to read. Raft was explicitly designed to be more understandable, and it mostly succeeds: but most explanations still lead with the algorithm rather than the problems it's solving.

This is my attempt to go the other way: start with the problems, then show how Raft addresses each one.

The Problem: Agreeing on Anything Across Multiple Machines

Imagine you have three servers storing the same data. A client sends a write to one of them. How do you ensure all three agree on what was written, in what order, and what the current value is: even if servers crash, messages get delayed, or the network partitions?

This is the consensus problem. It sounds manageable until you factor in that you can't distinguish between a slow server and a crashed one, that messages can be delayed indefinitely, and that fixing it by having a single authoritative server just moves the problem to: what happens when that server fails?

Raft's Answer: Elect One Leader, Route Everything Through It

Raft simplifies the problem by designating one node as the leader at any given time. All writes go to the leader. The leader appends the entry to its log, replicates it to followers, and once a majority of nodes have confirmed they've written it, the entry is committed.

The majority quorum is key. If you have 5 nodes and 3 confirm a write, that write is committed: even if the other 2 are unreachable. As long as a majority of nodes agree, the cluster makes progress. Minority partitions stall; majority partitions continue.

Leader Election

Each node has an election timeout: a random timer. If a follower doesn't hear from the leader before its timer fires, it assumes the leader is dead and starts an election: it increments the term number, votes for itself, and asks everyone else to vote for it.

A node wins the election if it gets votes from a majority. The randomised timeouts are what prevent every node from starting an election simultaneously. In practice, one node's timer fires first, it requests votes, and since most followers haven't yet started their own election, it wins quickly.

The term number is Raft's logical clock. Any message from an old term gets rejected. This prevents a previously-crashed leader from rejoining and confusing the cluster about who's in charge.

Log Replication

Once elected, the leader accepts writes, appends them to its local log with an index and term number, and sends AppendEntries RPCs to all followers. Followers write the entry to their own logs and acknowledge. When the leader sees a majority acknowledge, it marks the entry as committed and notifies followers in the next heartbeat.

The log is the source of truth. State is whatever results from replaying the log in order. If a follower falls behind, the leader resends the missing entries. If a follower has conflicting entries (from a previous leader that committed entries before crashing), the current leader overwrites them. The leader's log always wins.

What 'Committed' Actually Guarantees

A committed entry will survive any future leader election. That's the core guarantee. Once a majority of nodes have a log entry, any future leader must have it too: because a candidate can only win an election if it has a log at least as up-to-date as a majority of nodes, and a majority will always include at least one node with the committed entry.

This is why split-brain is safe in Raft: two leaders can't both commit conflicting entries in the same term, and an old leader can't come back and overwrite committed entries because it won't be able to win an election against a newer term.

The Bottom Line

Raft is Paxos with the design choices made explicit and the state machine model simplified. If you're working with any system that claims linearisable reads, strong consistency, or fault-tolerant replication: etcd, CockroachDB, TiKV, Aeron Cluster: you're working with Raft or something closely related. Knowing how leader election, log replication, and commitment work gives you real intuition about what those systems can and can't do under failure.

The Problem: Agreeing on Anything Across Multiple Machines

Raft's Answer: Elect One Leader, Route Everything Through It

Leader Election

Log Replication

What 'Committed' Actually Guarantees

The Bottom Line

You might also like...

Memory models and why they matter for concurrent code

Write-ahead logs: the idea behind most durable systems

Latency vs throughput: they're not the same thing

Using Aeron Cluster as a source of truth

Consistent hashing: how distributed systems route without a coordinator