No. They actually were stuck. Not responding to shutdown request. I had to kill them with kill -9 command. I try to take heap dump which hang as well.
Sent from my iPhone On Aug 27, 2013, at 8:14 AM, Jun Rao <jun...@gmail.com> wrote: > The errors you listed may not be serious, as long as they are transient. > When you say 2 of the brokers are not responsive, are they issuing fetch > requests to the 3rd broker (look at the request log)? During a restart of > the whole cluster, brokers that are started later may not have any leader > and thus won't take any request from the client. You will need to run the > leader balance tool. > > Thanks, > > Jun > > > On Mon, Aug 26, 2013 at 10:12 PM, Vadim Keylis <vkeylis2...@gmail.com>wrote: > >> Somehow I am getting my instances of kafka to crash. I started kafka >> instances one by one and they started successfully. Later it some how two >> of 3 instances became completely unresponsive. The process is running, but >> connnection over jmx or taking heat dump not possible. The last one some >> what resposnive. >> I am not sure how server get to this state. Is there anything I can monitor >> to predict instances about to crash. What are ways to recover without data >> loss? What am I doing wrong to get to this state. Please advise. >> I poke around error logs on hosts that are not responsive and here are the >> errors I found. One that I have not listed LeaderNotFoundExceotion. >> >> The most puzzling is about zookeeper as it was not redeployed or updated. >> [2013-08-26 12:14:35,357] ERROR [KafkaApi-5] Error while fetching metadata >> for partition [self_reactivation,0] (kafka.server.KafkaApis) >> kafka.common.ReplicaNotAvailableException >> at >> kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:471) >> at >> kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:456) >> at >> >> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) >> at >> >> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) >> at >> >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) >> at scala.collection.immutable.List.foreach(List.scala:76) >> at >> scala.collection.TraversableLike$class.map(TraversableLike.scala:233) >> >> >> in server.log >> [2013-08-26 21:00:51,942] ERROR Conditional update of path >> /brokers/topics/meetme/partitions/12/state with data { >> "controller_epoch":6, "isr":[ 5 ], "leader":5, "leader_epoch":1, >> "version":1 } and expected version 2 failed due to >> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = >> BadVersion for /brokers/topics/meetme/partitions/12/state >> (kafka.utils.ZkUtils$) >> [2013-08-26 21:00:51,943] INFO Partition [meetme,12] on broker 5: Cached >> zkVersion [2] not equal to that in zookeeper, skip updating ISR >> (kafka.cluster.Partition) >> [2013-08-26 21:00:51,990] INFO Partition [meetme,4] on broker 5: Shrinking >> ISR for partition [meetme,4] from 5,4 to 5 (kafka.cluster.Partition) >> [2013-08-26 21:00:51,993] ERROR Conditional update of path >> /brokers/topics/meetme/partitions/4/state with data { "controller_epoch":6, >> "isr":[ 5 ], "leader":5, "leader_epoch":1, "version":1 } and expected >> version 2 failed due to >> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = >> BadVersion for /brokers/topics/meetme/partitions/4/state >> (kafka.utils.ZkUtils$) >> [2013-08-26 21:00:51,993] INFO Partition [meetme,4] on broker 5: Cached >> zkVersion [2] not equal to that in zookeeper, skip updating ISR >> (kafka.cluster.Partition) >> [2013-08-26 21:00:52,103] INFO Partition [meetme,6] on broker 5: Shrinking >> ISR for partition [meetme,6] from 5,4 to 5 (kafka.cluster.Partition) >> [2013-08-26 21:00:52,107] ERROR Conditional update of path >> /brokers/topics/meetme/partitions/6/state with data { "controller_epoch":6, >> "isr":[ 5 ], "leader":5, "leader_epoch":2, "version":1 } and expected >> version 3 failed due to >> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = >> BadVersion for /brokers/topics/meetme/partitions/6/state >> (kafka.utils.ZkUtils$) >> [2013-08-26 21:00:52,107] INFO Partition [meetme,6] on broker 5: Cached >> zkVersion [3] not equal to that in zookeeper, skip updating ISR >> (kafka.cluster.Partition) >>