Hi - We are running kafka_2.8.0-0.8.0-beta1 (we are a little behind in 
upgrading).

>From what I can tell, connectivity to ZK was lost for a brief period. The 
>cluster seemed to recover OK except that we now have 2 (out of 125) partitions 
>where the ISR appears to be out of date. In other words, kafka-list-topic is 
>showing only one replica in the ISR for the 2 partitions in question (there 
>should be 3).

What's odd is that in looking at the log segments for those partitions on the 
file system, I can see that they are in fact getting updated and by all 
measures look to be in sync. I can also see that the brokers where the 
out-of-sync replicas reside are doing fine and leading other partitions like 
nothing ever happened. Based on that, it seems like the ISR in ZK is just 
out-of-date due to a botched recovery from the brief ZK outage.

Has anyone seen anything like this before? I saw this ticket which sounded 
similar:

https://issues.apache.org/jira/browse/KAFKA-948

Anyone have any suggestions for recovering from this state? I was thinking of 
running the preferred-replica-election tool next to see if that gets the ISRs 
in ZK back in sync.

After that, I guess the next step would be to bounce the kafka servers in 
question.

Thanks,
Paul

Reply via email to