Hi everyone,

I am currently running kafka 0.8.1.1 in a cluster, with 6 brokers and i set
the replication factor to 3. My producer set the ack to be 2 when producing
messages. I recently came across a bad situation that i had to reboot one
broker machine by shutdown the power, and that caused data loss.

This is what actually happened.

Producer 1(PD1) sends message (M100) to Partition 10 (leader h1, ISR h1,
h2, h3) and since the ack == 2, so as long as there are two brokers
acknowledged, M100 is considered as committed and ready for consumer.  So
h1 and h2 got M100 and consumer (C1) pulls M100 down and handle the
message. So far so good, we are just waiting for h3 to catch up.
But before that, h1 gets shutdown and h3 doesn't get the change the get
M100, while still in ISR. So partition 88 will choose a new leader from h2
and h3. And somehow (randomly) it chooses h3 so M100 in h2 will be
truncated and the data is lost.
But this is not the worst part, because consumer C1 already got M100. After
C1 handled the message it commits it's offset(100) back to a key value
store and started to pull message 101 from new leader h3. Since h3 doesn't
have the M100, it responded with error "Offset out of bound".
Now Producer PD1 Keeps producing messages to partition 88, say it produces
two message (M1 and M2), The offset of M1 and M2 in h3 is 100 and 101. Now
consumer C1 pulls the messages from h3 at offset 101, it sees one message
M2. There M1 will never be processed by consumer.

This is extremely bad because the producer get acknowledgement but the
consumer will never be able to process the message.

I googled a bit on how to solve the problem. Most of the post suggest to
change the ack to be -1(all). That is also prone to failure since now if
one broker is down, producers will lose the ability to produce any data.

I want to seek for more wisdom on how to solve this problem in the
community. Any idea or previous experience is welcome.

Thanks ahead.

-- 
come on

Reply via email to