06 Sep 2024 2 min read Computer Science

Latency vs throughput: they're not the same thing

These two get conflated constantly, including by people who should know better. They're related but they measure fundamentally different things, and optimising for one often makes the other worse.

The Definitions That Actually Hold Up

Latency is the time it takes for a single operation to complete, from when a request is issued to when the response arrives. It's a per-operation measurement. Low latency means fast individual responses.

Throughput is how many operations a system can complete per unit of time, requests per second, messages per second, bytes per second. High throughput means the system handles a lot of work overall.

A simple analogy: a single-lane road might have low latency (you can drive it fast) but low throughput (only one car at a time). A multi-lane highway has high throughput but probably higher latency for any individual car because of merging, congestion, and traffic management overhead.

Why They Trade Off

Batching is the clearest example of the tension. If you accumulate 1000 messages and send them together, your throughput goes up significantly, fewer round trips, better utilisation of the transport layer. But every individual message now waits for the batch to fill before it moves. Latency increases.

Kafka's producer has a linger.ms setting for exactly this reason. Set it to 0 and every message is sent immediately, low latency, lower throughput. Set it to 100ms and messages batch: higher throughput, higher latency. It's an explicit knob on the tradeoff.

Parallelism creates the same tension on the read side. Serving requests across a thread pool increases throughput, but the queueing and context switching add latency to individual requests, especially under load.

The Mistake I See Most Often

Teams optimise for average latency when they should be looking at tail latency. Your p50 might be 5ms, but if your p99 is 2 seconds, your users are having a bad time: just not all of them, and not on every request, which makes it easy to miss in aggregate metrics.

Tail latency is where the real problems hide. Garbage collection pauses, lock contention, slow disk flushes: these show up at p99 and p999, not p50. If you're only watching averages, you'll declare victory while a subset of your users are experiencing something much worse.

The Bottom Line

When someone says their system is 'fast', ask which dimension they mean. A trading system needs low latency above everything else, a 10ms response at p99 might be the hard requirement. A data pipeline might not care about any individual message's latency as long as the overall throughput keeps up with the ingest rate.

Know which one you're optimising for before you start. They're not the same metric, and the techniques that help one often hurt the other.

The Definitions That Actually Hold Up

Why They Trade Off

The Mistake I See Most Often

The Bottom Line

You might also like...

Memory models and why they matter for concurrent code

Distributed consensus without the PhD

Write-ahead logs: the idea behind most durable systems

Using Aeron Cluster as a source of truth

Consistent hashing: how distributed systems route without a coordinator