14 Nov 2024 2 min read Computer Science

Write-ahead logs: the idea behind most durable systems

If you've worked with Postgres, SQLite, or pretty much any serious storage system, you've encountered the write-ahead log, usually as a footnote in the docs about crash recovery. It deserves more attention than that. The WAL is one of those ideas that shows up everywhere once you start looking, and understanding it properly changes how you think about durability.

The Core Idea

A write-ahead log is exactly what the name says: before a change is applied to the actual data files, a record of that change is written to a log. The log entry is written first (hence 'write-ahead'), and only after it's safely on disk does the system proceed with modifying the data.

Why does the ordering matter? Because disk writes to random locations are slow and can fail partway through. If you're updating a B-tree page in-place and the machine crashes halfway through the write, you have a partially-updated, corrupt page. The WAL gives you a recovery path: if the data page is corrupt after a crash, you replay the log entries that were committed and re-apply the changes.

The log is append-only, and sequential disk writes are much faster than random ones. You're trading a more expensive write path (write to log, then write to data) for a much faster recovery path and atomicity guarantees.

What 'Committed' Actually Means

When a database tells you a transaction is committed, it means the WAL entry for that transaction has been flushed to disk. Not that the data pages have been updated, those might still be in memory. The WAL entry is the record of truth.

This is why databases can acknowledge a write quickly (WAL flush is sequential and fast) while still doing the actual data page update lazily in the background. If the system crashes before the page update, the WAL entry is there to replay it on restart.

The fsync configuration in Postgres is directly about this. fsync=off means WAL entries aren't flushed to durable storage, they sit in the OS buffer. Faster writes, but a power failure can corrupt your database because the 'committed' log entries were never actually persisted. It's one of those settings where the performance gain is real and the risk is also real.

WALs Beyond Databases

The pattern shows up across the distributed systems stack. Kafka's partition storage is fundamentally a WAL, append-only, sequential, with offsets for replay. When a Kafka consumer processes a message and commits its offset, it's doing the same dance as a database acknowledging a transaction: recording progress durably before acting on it.

Aeron Cluster uses a log buffer as part of its Raft implementation, the replicated log that all nodes agree on is a WAL. The leader writes entries, followers acknowledge, and the entries become committed once a quorum confirms durability.

etcd, the key-value store that backs Kubernetes, uses a WAL for its Raft log. Most consensus systems do.

The Operational Side

WAL files accumulate. In Postgres, WAL segments sit in pg_wal and need to be either archived (for point-in-time recovery) or cleaned up after checkpoints. If your archiving process falls behind or your checkpoint frequency is too low, the WAL directory grows and you eventually run out of disk space, which is a bad way to discover how your storage durability works.

WAL shipping is also the basis of Postgres streaming replication. The primary writes WAL entries; replicas receive and replay them. The replica's state is always some number of WAL entries behind the primary, which is why 'replication lag' is measured in bytes or time, not records.

The Bottom Line

The write-ahead log is the mechanism behind most durability guarantees you rely on. When a system tells you data is safe, it usually means: a WAL entry exists somewhere that can reconstruct it. Understanding that lets you reason more clearly about what 'durable' actually means in any given system, and what assumptions you're making when you trust it.

The Core Idea

What 'Committed' Actually Means

WALs Beyond Databases

The Operational Side

The Bottom Line

You might also like...

Memory models and why they matter for concurrent code

Distributed consensus without the PhD

Latency vs throughput: they're not the same thing

Using Aeron Cluster as a source of truth

Consistent hashing: how distributed systems route without a coordinator