OK, in that case then I'm thinking that you ran into issues that were a natural result of the Zookeeper ensemble having very high CPU usage. Unfortunate, but this would not be an unexpected situation when your ZK ensemble is having significant problems.
-Todd On Fri, Jul 10, 2015 at 10:21 AM, Christofer Hedbrandh < [email protected]> wrote: > Todd, the Kafka problems started when one of three ZooKeeper nodes was > restarted. > > On Thu, Jul 9, 2015 at 12:10 PM, Todd Palino <[email protected]> wrote: > > > Did you hit the problems in the Kafka brokers and consumers during the > > Zookeeper problem, or after you had already cleared it? > > > > For us, we decided to skip the leap second problem (even though we're > > supposedly on a version that doesn't have that bug) by shutting down ntpd > > everywhere and then allowing it to slowly adjust the time afterwards > > without sending the leap second. > > > > -Todd > > > > > > On Thu, Jul 9, 2015 at 7:58 AM, Christofer Hedbrandh < > > [email protected] > > > wrote: > > > > > Hi Kafka users, > > > > > > ZooKeeper in our staging environment was running on a very old ubuntu > > > version, that was exposed to the "leap second causes spuriously high > CPU > > > usage" bug. > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1020285 > > > > > > As a result, when the leap second arrived, the ZooKeeper CPU usage went > > up > > > to 100% and stayed there. In response to this, we restarted one > ZooKeeper > > > process. The ZooKeeper restart unfortunately made the situation much > > worse > > > as we hit three different (possibly related) Kafka problems. We are > using > > > Kafka 0.8.2 brokers, consumers and producers. > > > > > > > > > #1 > > > One of our three brokers was kicked out or ISR for some (most but not > > all) > > > partitions, and was continuously logging "Cached zkVersion [XX] not > equal > > > to that in zookeeper, skip updating ISR" over and over (until I > > eventually > > > stopped this broker). > > > > > > INFO Partition [topic-x,xx] on broker 1: Shrinking ISR for partition > > > [topic-x,xx] from 1,2,3 to 1 (kafka.cluster.Partition) > > > INFO Partition [topic-x,xx] on broker 1: Cached zkVersion [62] not > equal > > to > > > that in zookeeper, skip updating ISR (kafka.cluster.Partition) > > > INFO Partition [topic-y,yy] on broker 1: Shrinking ISR for partition > > > [topic-y,yy] from 1,2,3 to 1 (kafka.cluster.Partition) > > > INFO Partition [topic-y,yy] on broker 1: Cached zkVersion [39] not > equal > > to > > > that in zookeeper, skip updating ISR (kafka.cluster.Partition) > > > INFO Partition [topic-z,zz] on broker 1: Shrinking ISR for partition > > > [topic-z,zz] from 1,2,3 to 1 (kafka.cluster.Partition) > > > INFO Partition [topic-z,zz] on broker 1: Cached zkVersion [45] not > equal > > to > > > that in zookeeper, skip updating ISR (kafka.cluster.Partition) > > > etc. > > > > > > Searching the [email protected] archive and Googling for this log > > > output, gives me similar descriptions but nothing that exactly > describes > > > this. > > > It is very similar to this, but without the "ERROR Conditional update > of > > > path ..." part. > > > https://www.mail-archive.com/[email protected]/msg07044.html > > > > > > > > > #2 > > > The remaining two brokers were logging this every five seconds or so. > > > INFO conflict in /brokers/ids/xxx data: > > > > > > > > > {"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092} > > > stored data: > > > > > > > > > {"jmx_port":xxx,"timestamp":"1435711782536","host":"xxx","version":1,"port":9092} > > > (kafka.utils.ZkUtils$) > > > INFO I wrote this conflicted ephemeral node > > > > > > > > > [{"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}] > > > at /brokers/ids/xxx a while back in a different session, hence I will > > > backoff for this node to be deleted by Zookeeper and retry > > > (kafka.utils.ZkUtils$) > > > > > > It sounds very much like we hit this bug > > > https://issues.apache.org/jira/browse/KAFKA-1387 > > > > > > > > > #3 > > > The most serious issue that resulted was that some consumer groups > failed > > > to claim all partitions. When using the ConsumerOffsetChecker, the > owner > > of > > > some partitions was listed as "none", the lag was constantly > increasing, > > > and it was clear that no consumers were processing these messages. > > > > > > It is exactly what Dave Hamilton is describing here, but from this > email > > > chain no one seems to know what caused it. > > > https://www.mail-archive.com/users%40kafka.apache.org/msg13364.html > > > > > > It may be reasonable to assume that the consumer rebalance failures we > > also > > > saw has something to do with this. But why the rebalance failed is > still > > > unclear. > > > > > > ERROR k.c.ZookeeperConsumerConnector: error during syncedRebalance > > > kafka.common.ConsumerRebalanceFailedException: xxx can't rebalance > after > > 4 > > > retries > > > at > > > > > > > > > kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:633) > > > at > > > > > > > > > kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:551) > > > > > > > > > I am curious to hear if anyone else had similar problems to this? > > > > > > And also if anyone can say if these are all known bugs that are being > > > tracked with some ticket number? > > > > > > > > > Thanks, > > > Christofer > > > > > > P.S. Eventually after ZooKeeper and Kafka broker and consumer restarts > > > everything returned to normal. > > > > > >
