Story of our Kafka message loss

Background

About a month after we went live with our first Kafka-heavy design in Shopee, we were notified of missing data in our new Audit service. After a quick E2E check, we realized there weren’t many parts that could cause this data loss. After a short log-dive, we found out our records were successfully published by the producer, but there was no error whatsoever!

Our initial hypothesis put the root cause at the unmarshalling stage in the consumer since that was the only step before our first log. We were surprised to discover that the issue was not from there.

Missing error from Kafka producer

Multiple reasons could have caused the data loss for Kafka:

  • The partition leader is down
  • The message does not get from the producer to the leader
  • All brokers are down

Based on our logs, we knew that neither the leader nor the brokers were the problem; otherwise, we would have had errors producing the message in the first place. Diving more profoundly into the Sarama logs, we discovered random rebalancing happening to brokers.

Root cause

If we rely on the leader broker’s Acknowledgment message, there is a chance that after we receive the Acknowledgment, the leader broker fails to start the replica to other brokers, and the message fails to get produced even though it has already been Acknowledged by the leader.

ACK Options

There are multiple Acknowledgement options for the producer

Ack = 0

If the broker crashes due to an exception or goes offline, there won’t be any error, and the message will be lost. Since there is no network overhead, this is the fastest publish method, with the least bandwidth use, but should only be used for scenarios that can tolerate data loss without any log in the service layer.

Ack = 1

The default config for Kafka guarantees the message is written successfully by the leader. Since this Acknowledgment is sent before the replication is done, any crash and rebalancing would lead to data loss.

Ack = All

This method comes with the extra overhead of waiting for the minimum in-sync replicas. The producer only considers the publication successful after the minimum number of replica Acknowledgments.

Solution

The solution to our problem ended up being simple. Just changing the default configuration to require Acknowledge = All was a sufficient fix to avoid any future data loss for us in the future.