Do you see ZK session expiration? If so, take a look at https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whypartitionleadersmigratethemselvessometimes ?
Thanks, Jun On Fri, May 30, 2014 at 2:48 PM, joe smith <water4...@yahoo.com> wrote: > > Hi, > > We are doing load testing. Ran into some exception - need some help > clarifying what the problem is and how to bring the broker back on-line > (hoping > w/o the need to shutdown other brokers). Each broker was handling about > 100k msgs/sec. > Payload is small - less than 500 bytes > About 2 hours into our run, one of the brokers stopped. It > looked like an update to the zookeeper failed - conflict in /controller > data’s > version. It seems to happen when a new > leader is chosen. How do we prevent and > recover? > > Sorry for the long post. Thanks for the help. > > Broker version kafka_2.10-0.8.1.1 > jvm 1.7 > We have 3 brokers running on 3 virtual machines. > 1 topic, 6 partitions, 2 replicas > 3 producing apps (6 producer threads each) > 3 consuming apps (2 consumer each) > The partitions and consumers were properly load-balanced across 3 brokers. > > [2014-05-30 09:42:21,122] > ERROR Conditional update of path /brokers/topics/topic1/partitions/2/state > with > data > {"controller_epoch":27,"leader":1,"version":1,"leader_epoch":0,"isr":[1]} > and expected version 1960 failed due to > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = > BadVersion for /brokers/topics/topic1/partitions/2/state > (kafka.utils.ZkUtils$) > [2014-05-30 09:42:21,123] INFO Closing socket connection to /10.74.9.97. > (kafka.network.Processor) > [2014-05-30 09:42:21,147] INFO Partition [topic1,2] on broker 1: Cached > zkVersion [1960] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2014-05-30 09:42:21,147] INFO Partition [topic1,5] on broker 1: Shrinking > ISR > for partition [topic1,5] from 1,3 to 1 (kafka.cluster.Partition) > [2014-05-30 09:42:21,149] ERROR Conditional update of path > /brokers/topics/topic1/partitions/5/state > with data > {"controller_epoch":27,"leader":1,"version":1,"leader_epoch":0,"isr":[1]} > and expected version 2394 failed due to > org.apache.zookeeper.KeeperException$BadVersionException: > KeeperErrorCode = BadVersion for /brokers/topics/topic1/partitions/5/state > (kafka.utils.ZkUtils$) > [2014-05-30 09:42:21,149] INFO Partition [topic1,5] on broker 1: Cached > zkVersion [2394] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2014-05-30 09:42:21,182] INFO conflict in /controller data: > {"version":1,"brokerid":1,"timestamp":"1401442941179"} > stored data: > {"version":1,"brokerid":3,"timestamp":"1401442940062"} > (kafka.utils.ZkUtils$) > [2014-05-30 09:42:21,259] INFO re-registering broker info in ZK for broker > 1 > (kafka.server.KafkaHealthcheck) > [2014-05-30 09:42:21,267] INFO Registered broker 1 at path /brokers/ids/1 > with > address host01:9092. (kafka.utils.ZkUtils$) > [2014-05-30 09:42:21,267] INFO done re-registering broker > (kafka.server.KafkaHealthcheck) > [2014-05-30 09:42:21,299] INFO Subscribing to /brokers/topics path to > watch for > new topics (kafka.server.KafkaHealthcheck) > > We then saw bunch of producer errors - indicating the target > partition is not local. > > [2014-05-30 09:42:21,482] WARN [KafkaApi-1] Produce request with > correlation id 277356381 from client pooledKafkaMsgProducer on partition > [topic1,2] > failed due to Leader not local for partition [topic1,2] on broker 1 > (kafka.server.KafkaApis) > [2014-05-30 09:42:21,482] WARN [KafkaApi-1] Produce request with > correlation id > 272337409 from client pooledKafkaMsgProducer on partition [topic1,2] > failed due > to Leader not local for partition [topic1,2] on broker 1 > (kafka.server.KafkaApis) > > We saw a new leader is elected. > > [2014-05-30 09:42:21,523] INFO [ReplicaFetcherManager on broker 1] > Removed fetcher for partitions [topic1,5],[topic1,2] > (kafka.server.ReplicaFetcherManager) > [2014-05-30 09:42:21,523] INFO Truncating log topic1-5 to offset > 2304238765. > (kafka.log.Log) > [2014-05-30 09:42:21,549] INFO New leader is 3 > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > [2014-05-30 09:42:21,652] INFO Truncating log topic1-2 to offset > 2304238451. > (kafka.log.Log) > [2014-05-30 09:42:21,659] INFO [ReplicaFetcherManager on broker 1] Added > fetcher for partitions ArrayBuffer([[topic1,2], initOffset 2304238451 to > broker > id:2,host:host02:9092] , [[topic1,5], initOffset 2304238765 to broker > id:3,host03:9092] ) (kafka.server.ReplicaFetcherManager) > [2014-05-30 09:42:21,688] WARN [KafkaApi-1] Fetch request with correlation > id > 114863 from client > host02-2254-1-ConsumerFetcherThread-g1_host02-1401389340410-4d2e28ab-0-1 > on partition [topic1,2] failed due to Leader not local for partition > [topic1,2] > on broker 1 (kafka.server.KafkaApis) > [2014-05-30 09:42:21,690] WARN [KafkaApi-1] Fetch request with correlation > id > 114444 from client > host03-5415-1-ConsumerFetcherThread-g1-host03-1401389345244-70fc9378-0-1 on > partition [topic1,5] failed due to Leader not local for partition > [topic1,5] on > broker 1 (kafka.server.KafkaApis) > [2014-05-30 09:42:21,789] WARN [KafkaApi-1] Fetch request with > correlation id 114863 from client > host02-2254-1-ConsumerFetcherThread-g1_host02-1401389340410-4d2e28ab-0-1 > on partition [topic1,2] failed due to Leader not local for partition > [topic1,2] > on broker 1 (kafka.server.KafkaApis) > [2014-05-30 09:42:21,790] WARN [KafkaApi-1] Fetch request with > correlation id 114444 from client > host03-5415-1-ConsumerFetcherThread-g1_host03-1401389345244-70fc9378-0-1 > on partition [topic1,5] failed due to Leader not local for partition > [topic1,5] > on broker 1 (kafka.server.KafkaApis) > [2014-05-30 09:42:21,793] INFO Closing socket connection to /10.74.41.157. > (kafka.network.Processor) > [2014-05-30 09:42:21,794] INFO Closing socket connection to /10.74.37.171. > (kafka.network.Processor) > [2014-05-30 09:42:21,794] INFO Closing socket connection to /10.74.41.157. > (kafka.network.Processor) > [2014-05-30 09:42:21,794] INFO Closing socket connection to /10.74.41.157. > (kafka.network.Processor)