Hey folks - we had a blip on one of our smaller clusters last night (3 nodes), around the same time as network maintenance in the DC that it was operating in. This caused a partition in the network between the brokers and Zk, as well as between the brokers themselves. The partition was for approx. 6 seconds.
When the partition healed, two of the brokers (those that weren't the controller) tried to shrink the ISR set down to themselves for most topics (with 1 partition, two replicas per partition). The __offset topics shrunk down to exclude the controller. For the next hour or so, the two non-controller brokers we continually trying to shrink the ISR set down, but kept seeing the following two log lines repeated across all partitions: 2017-01-13 03:54:23,843 INFO [kafka-scheduler-7] cluster.Partition - Partition [foo.bar,0] on broker 11: Shrinking ISR for partition [foo_bar,0] from 11,9 to 11 2017-01-13 03:54:23,931 INFO [kafka-scheduler-7] cluster.Partition - Partition [foo.bar,0] on broker 11: Cached zkVersion [64] not equal to that in zookeeper, skip updating ISR NOTE: broker 9 was the controller. Brokers 10 and 11 were the ones having issues. We recently rolled out Kafka 0.10.1.0 everywhere. Wondering if we hit KAFKA-4477 here, although it seems to have subtle differences. I wanted to check in here first to see if others have encountered this and see if it is worth opening a bug for. Understandably, this kind of thing (random network partitions due to network maintenance_ is basically impossible to reproduce. We'll be upgrading to 0.10.1.1 in the next few days nonetheless. Thanks! - nick
