Some times an ephemeral ZK path does not go away after a consumer is
closed. You can check the log for each rebalance to see if it complains
some conflict data of ZK Path. If all the complaints are pointing to the
same consumer, bounce that consumer. Otherwise you can try to remove the
ZK path manually and try again.

On 2/26/15, 6:04 PM, "Ashwin Jayaprakash" <ashwin.jayaprak...@gmail.com>
wrote:

>Hello, we have a set of JVMs that consume messages from Kafka topics. Each
>JVM creates 4 ConsumerConnectors that are used by 4 separate threads.
>These JVMs also create and use the CuratorFramework's Path children cache
>to watch and keep a sub-tree of the ZooKeeper in sync with other JVMs.
>This
>path has several thousand children elements.
>
>Everything was working perfectly until one fine day we decided to restart
>these JVMs. We restart these JVMs to roll in new code every few weeks or
>so. We never had any problems until suddenly the Kafka consumers on these
>JVMs were unable to rebalance partitions among themselves.  We have
>bounced
>these JVMs before with no issues.
>
>The exception:
>Caused by: kafka.common.ConsumerRebalanceFailedException:
>group1-system01-27422-kafka-787 can't rebalance after 12 retries
>at
>kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebal
>ance(ZookeeperConsumerConnector.scala:432)
>at
>kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumer
>Connector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:722)
>at
>kafka.consumer.ZookeeperConsumerConnector$WildcardStreamsHandler.<init>(Zo
>okeeperConsumerConnector.scala:756)
>at
>kafka.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(Zoo
>keeperConsumerConnector.scala:145)
>at
>kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreamsByFi
>lter(ZookeeperConsumerConnector.scala:96)
>at
>kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreamsByFi
>lter(ZookeeperConsumerConnector.scala:100)
>
>We then set rebalance.max.retries=16 and rebalance.backoff.ms=10000. I've
>seen the Spark-Kafka issue
>https://issues.apache.org/jira/browse/SPARK-5505
>and Jun's recommendation to increase the backoff property.
>
>We must've tried restarting these JVMs about 20 times now both with and
>without the "rebalance.xx" properties. Every time it is the same issue.
>Except for the first time we applied the "rebalance.backoff.ms=10000"
>property when all 4 JVMs started! We thought that solved everything and
>then we tried restarting it just to make sure and then we were back to
>square one.
>
>If we have only 1 thread create 1 ConsumerConnector instead of 4 it works.
>This way we can have any number of JVMs running 1 ConsumerConnector and
>they all behave well and rebalance partitions. It is only when we try to
>start multiple ConsumerConnectors on the same JVM does this problem occur.
>I'd like to remind you that 4 ConsumerConnectors was working for several
>months. The ZK sub-tree for our non-Kafka part of the code was small when
>we started.
>
>Does anybody have any thoughts on this? What could be causing this issue?
>Could there be a Curator/ZK client conflict with the High level Kafka
>consumer? Or is the number of nodes that we have on ZK from our code
>causing problems with partition assignment in the Kafka code? Because the
>Curator framework keeps syncing data in the background while the Kafka
>code
>is creating ConsumerConnectors and rebalancing topics.
>
>Thanks,
>Ashwin Jayaprakash.

Reply via email to