Background
About a month after we went live with our first Kafka-heavy design in Shopee, we were notified of missing data in our new Audit service. After a quick E2E check, we realized there weren’t many parts that could cause this data loss. After a short log-dive, we found out our records were successfully published by the producer, but there was no error whatsoever!
Our initial hypothesis put the root cause at the unmarshalling stage in the consumer since that was the only step before our first log. We were surprised to discover that the issue was not from there.
Missing error from Kafka producer
Multiple reasons could have caused the data loss for Kafka:
- The partition leader is down
- The message does not get from the producer to the leader
- All brokers are down
Based on our logs, we knew that neither the leader nor the brokers were the problem; otherwise, we would have had errors producing the message in the first place. Diving more profoundly into the Sarama logs, we discovered random rebalancing happening to brokers.
Root cause
If we rely on the leader broker’s Acknowledgment message, there is a chance that after we receive the Acknowledgment, the leader broker fails to start the replica to other brokers, and the message fails to get produced even though it has already been Acknowledged by the leader.
ACK Options
There are multiple Acknowledgement options for the producer
Ack = 0
If the broker crashes due to an exception or goes offline, there won’t be any error, and the message will be lost. Since there is no network overhead, this is the fastest publish method, with the least bandwidth use, but should only be used for scenarios that can tolerate data loss without any log in the service layer.

Ack = 1
The default config for Kafka guarantees the message is written successfully by the leader. Since this Acknowledgment is sent before the replication is done, any crash and rebalancing would lead to data loss.

Ack = All
This method comes with the extra overhead of waiting for the minimum in-sync replicas. The producer only considers the publication successful after the minimum number of replica Acknowledgments.

Solution
The solution to our problem ended up being simple. Just changing the default configuration to require Acknowledge = All was a sufficient fix to avoid any future data loss for us in the future.