Hi
We have recently upgraded from Kafka 0.10 to 1.1 , and we have encountered
several occasions where some partitions in the cluster would go offline
and unable to recover with the following error:
20:33:04.702 [controller-event-thread] ERROR state.change.logger -
[Controller id=1 epoch=14] Controller 1 epoch 14 failed to change state
for partition __consumer_offsets-39 from OfflinePartition to
OnlinePartition
kafka.common.StateChangeFailedException: Failed to elect leader for
partition __consumer_offsets-39 under strategy
PreferredReplicaPartitionLeaderElectionStrategy
at
kafka.controller.PartitionStateMachine$$anonfun$doElectLeaderForPartitions$3.apply(PartitionStateMachine.scala:328)
~[kafka_2.11-1.1.0.jar:?]
at
kafka.controller.PartitionStateMachine$$anonfun$doElectLeaderForPartitions$3.apply(PartitionStateMachine.scala:326)
~[kafka_2.11-1.1.0.jar:?]
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
~[scala-library-2.11.12.jar:?]
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
~[scala-library-2.11.12.jar:?]
at
kafka.controller.PartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:326)
~[kafka_2.11-1.1.0.jar:?]
at
kafka.controller.PartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:254)
[kafka_2.11-1.1.0.jar:?]
at
kafka.controller.PartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:175)
[kafka_2.11-1.1.0.jar:?]
at
kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:116)
[kafka_2.11-1.1.0.jar:?]
at
kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:604)
[kafka_2.11-1.1.0.jar:?]
at
kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3$$anonfun$apply$18.apply(KafkaController.scala:1000)
[kafka_2.11-1.1.0.jar:?]
at
kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3$$anonfun$apply$18.apply(KafkaController.scala:993)
[kafka_2.11-1.1.0.jar:?]
at
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
[scala-library-2.11.12.jar:?]
at
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
[scala-library-2.11.12.jar:?]
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
[scala-library-2.11.12.jar:?]
at
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
[scala-library-2.11.12.jar:?]
at
scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
[scala-library-2.11.12.jar:?]
at
kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:993)
[kafka_2.11-1.1.0.jar:?]
at
kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:980)
[kafka_2.11-1.1.0.jar:?]
at
scala.collection.immutable.Map$Map4.foreach(Map.scala:188)
[scala-library-2.11.12.jar:?]
at
kafka.controller.KafkaController.kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance(KafkaController.scala:980)
[kafka_2.11-1.1.0.jar:?]
at
kafka.controller.KafkaController$AutoPreferredReplicaLeaderElection$.process(KafkaController.scala:1014)
[kafka_2.11-1.1.0.jar:?]
at
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:69)
[kafka_2.11-1.1.0.jar:?]
at
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69)
[kafka_2.11-1.1.0.jar:?]
at
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69)
[kafka_2.11-1.1.0.jar:?]
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
[kafka_2.11-1.1.0.jar:?]
at
kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:68)
[kafka_2.11-1.1.0.jar:?]
at
kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
[kafka_2.11-1.1.0.jar:?]
We used to be able to fix offline partitions in 0.10 by restarting the
whole cluster, and after the upgrade we have to revert
unclean.leader.election.enable to true for the restart to work.
My understanding is that doing unclean leader election could potentially
lose data, the question is that is there an alternative way to fix offline
partitions?
Thanks,
Di Shang