The following is also appearing in the logs a lot, if anyone has any ideas:
INFO Partition [easypost.syslog,7] on broker 1: Cached zkVersion [647] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) On Fri, Apr 28, 2017 at 10:43 AM, James Brown <jbr...@easypost.com> wrote: > We're running 0.10.1.0 on a five-node cluster. > > I was in the process of migrating some topics from having 2 replicas to > having three replicas when two the five machines in this cluster crashed > (brokers 2 and 3). > > After restarting them, all of the topics that were previously assigned to > them are unavailable and showing "Leader: -1". > > Example kafka-topics output: > > % kafka-topics.sh --zookeeper $ZK_HP --describe --unavailable-partitions > Topic: __consumer_offsets Partition: 9 Leader: -1 Replicas: 3,2,4 Isr: > Topic: __consumer_offsets Partition: 13 Leader: -1 Replicas: 3,2,4 Isr: > Topic: __consumer_offsets Partition: 17 Leader: -1 Replicas: 3,2,5 Isr: > Topic: __consumer_offsets Partition: 23 Leader: -1 Replicas: 5,2,1 Isr: > Topic: __consumer_offsets Partition: 25 Leader: -1 Replicas: 3,2,5 Isr: > Topic: __consumer_offsets Partition: 26 Leader: -1 Replicas: 3,2,1 Isr: > Topic: __consumer_offsets Partition: 30 Leader: -1 Replicas: 3,1,2 Isr: > Topic: __consumer_offsets Partition: 33 Leader: -1 Replicas: 1,2,4 Isr: > Topic: __consumer_offsets Partition: 35 Leader: -1 Replicas: 1,2,5 Isr: > Topic: __consumer_offsets Partition: 39 Leader: -1 Replicas: 3,1,2 Isr: > Topic: __consumer_offsets Partition: 40 Leader: -1 Replicas: 3,4,2 Isr: > Topic: __consumer_offsets Partition: 44 Leader: -1 Replicas: 3,1,2 Isr: > Topic: __consumer_offsets Partition: 45 Leader: -1 Replicas: 1,3,2 Isr: > > Note that I wasn't even moving any of the __consumer_offsets partitions, > so I'm not sure if the fact that a reassignment was in progress is a red > herring or not. > > The logs are full of > > ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2] > to broker 3:org.apache.kafka.common.errors.UnknownServerException: The > server experienced an unexpected error when processing the request > (kafka.server.ReplicaFetcherThread) > ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2] > to broker 3:org.apache.kafka.common.errors.UnknownServerException: The > server experienced an unexpected error when processing the request > (kafka.server.ReplicaFetcherThread) > ERROR [ReplicaFetcherThread-0-3], Error for partition > [epostg.request_log_v1,0] to broker > 3:org.apache.kafka.common.errors.UnknownServerException: > The server experienced an unexpected error when processing the request > (kafka.server.ReplicaFetcherThread) > ERROR [ReplicaFetcherThread-0-3], Error for partition > [epostg.request_log_v1,0] to broker > 3:org.apache.kafka.common.errors.UnknownServerException: > The server experienced an unexpected error when processing the request > (kafka.server.ReplicaFetcherThread) > > What can I do to fix this? Should I manually reassign all partitions that > were led by brokers 2 or 3 to only have whatever the third broker was in > their replica-set as their replica set? Do I need to temporarily enable > unclean elections? > > I've never seen a cluster fail this way... > > -- > James Brown > Engineer > -- James Brown Engineer