Hey folks - we had a blip on one of our smaller clusters last night (3
nodes),
around the same time as network maintenance in the DC that it was operating
in.
This caused a partition in the network between the brokers and Zk, as well
as
between the brokers themselves. The partition was for approx. 6 seconds.

When the partition healed, two of the brokers (those that weren't the
controller) tried to shrink the ISR set down to themselves for most topics
(with 1 partition, two replicas per partition). The __offset topics shrunk
down to exclude the controller.

For the next hour or so, the two non-controller brokers we continually
trying
to shrink the ISR set down, but kept seeing the following two log lines
repeated across all partitions:

2017-01-13 03:54:23,843  INFO [kafka-scheduler-7] cluster.Partition -
Partition [foo.bar,0] on broker 11: Shrinking ISR for partition [foo_bar,0]
from 11,9 to 11
2017-01-13 03:54:23,931  INFO [kafka-scheduler-7] cluster.Partition -
Partition [foo.bar,0] on broker 11: Cached zkVersion [64] not equal to that
in zookeeper, skip updating ISR

NOTE: broker 9 was the controller. Brokers 10 and 11 were the ones having
issues.

We recently rolled out Kafka 0.10.1.0 everywhere.

Wondering if we hit KAFKA-4477 here, although it seems to have subtle
differences.

I wanted to check in here first to see if others have encountered this and
see if
it is worth opening a bug for. Understandably, this kind of thing (random
network
partitions due to network maintenance_ is basically impossible to reproduce.

We'll be upgrading to 0.10.1.1 in the next few days nonetheless.

Thanks!
- nick

Reply via email to