Re: Leap Second Troubles

Todd Palino Fri, 10 Jul 2015 11:57:00 -0700

OK, in that case then I'm thinking that you ran into issues that were a
natural result of the Zookeeper ensemble having very high CPU usage.
Unfortunate, but this would not be an unexpected situation when your ZK
ensemble is having significant problems.


-Todd


On Fri, Jul 10, 2015 at 10:21 AM, Christofer Hedbrandh <
[email protected]> wrote:

> Todd, the Kafka problems started when one of three ZooKeeper nodes was
> restarted.
>
> On Thu, Jul 9, 2015 at 12:10 PM, Todd Palino <[email protected]> wrote:
>
> > Did you hit the problems in the Kafka brokers and consumers during the
> > Zookeeper problem, or after you had already cleared it?
> >
> > For us, we decided to skip the leap second problem (even though we're
> > supposedly on a version that doesn't have that bug) by shutting down ntpd
> > everywhere and then allowing it to slowly adjust the time afterwards
> > without sending the leap second.
> >
> > -Todd
> >
> >
> > On Thu, Jul 9, 2015 at 7:58 AM, Christofer Hedbrandh <
> > [email protected]
> > > wrote:
> >
> > > Hi Kafka users,
> > >
> > > ZooKeeper in our staging environment was running on a very old ubuntu
> > > version, that was exposed to the "leap second causes spuriously high
> CPU
> > > usage" bug.
> > >
> > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1020285
> > >
> > > As a result, when the leap second arrived, the ZooKeeper CPU usage went
> > up
> > > to 100% and stayed there. In response to this, we restarted one
> ZooKeeper
> > > process. The ZooKeeper restart unfortunately made the situation much
> > worse
> > > as we hit three different (possibly related) Kafka problems. We are
> using
> > > Kafka 0.8.2 brokers, consumers and producers.
> > >
> > >
> > > #1
> > > One of our three brokers was kicked out or ISR for some (most but not
> > all)
> > > partitions, and was continuously logging "Cached zkVersion [XX] not
> equal
> > > to that in zookeeper, skip updating ISR" over and over (until I
> > eventually
> > > stopped this broker).
> > >
> > > INFO Partition [topic-x,xx] on broker 1: Shrinking ISR for partition
> > > [topic-x,xx] from 1,2,3 to 1 (kafka.cluster.Partition)
> > > INFO Partition [topic-x,xx] on broker 1: Cached zkVersion [62] not
> equal
> > to
> > > that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > > INFO Partition [topic-y,yy] on broker 1: Shrinking ISR for partition
> > > [topic-y,yy] from 1,2,3 to 1 (kafka.cluster.Partition)
> > > INFO Partition [topic-y,yy] on broker 1: Cached zkVersion [39] not
> equal
> > to
> > > that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > > INFO Partition [topic-z,zz] on broker 1: Shrinking ISR for partition
> > > [topic-z,zz] from 1,2,3 to 1 (kafka.cluster.Partition)
> > > INFO Partition [topic-z,zz] on broker 1: Cached zkVersion [45] not
> equal
> > to
> > > that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > > etc.
> > >
> > > Searching the [email protected] archive and Googling for this log
> > > output, gives me similar descriptions but nothing that exactly
> describes
> > > this.
> > > It is very similar to this, but without the "ERROR Conditional update
> of
> > > path ..." part.
> > > https://www.mail-archive.com/[email protected]/msg07044.html
> > >
> > >
> > > #2
> > > The remaining two brokers were logging this every five seconds or so.
> > > INFO conflict in /brokers/ids/xxx data:
> > >
> > >
> >
> {"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}
> > > stored data:
> > >
> > >
> >
> {"jmx_port":xxx,"timestamp":"1435711782536","host":"xxx","version":1,"port":9092}
> > > (kafka.utils.ZkUtils$)
> > > INFO I wrote this conflicted ephemeral node
> > >
> > >
> >
> [{"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}]
> > > at /brokers/ids/xxx a while back in a different session, hence I will
> > > backoff for this node to be deleted by Zookeeper and retry
> > > (kafka.utils.ZkUtils$)
> > >
> > > It sounds very much like we hit this bug
> > > https://issues.apache.org/jira/browse/KAFKA-1387
> > >
> > >
> > > #3
> > > The most serious issue that resulted was that some consumer groups
> failed
> > > to claim all partitions. When using the ConsumerOffsetChecker, the
> owner
> > of
> > > some partitions was listed as "none", the lag was constantly
> increasing,
> > > and it was clear that no consumers were processing these messages.
> > >
> > > It is exactly what Dave Hamilton is describing here, but from this
> email
> > > chain no one seems to know what caused it.
> > > https://www.mail-archive.com/users%40kafka.apache.org/msg13364.html
> > >
> > > It may be reasonable to assume that the consumer rebalance failures we
> > also
> > > saw has something to do with this. But why the rebalance failed is
> still
> > > unclear.
> > >
> > > ERROR k.c.ZookeeperConsumerConnector: error during syncedRebalance
> > > kafka.common.ConsumerRebalanceFailedException: xxx can't rebalance
> after
> > 4
> > > retries
> > > at
> > >
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:633)
> > > at
> > >
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:551)
> > >
> > >
> > > I am curious to hear if anyone else had similar problems to this?
> > >
> > > And also if anyone can say if these are all known bugs that are being
> > > tracked with some ticket number?
> > >
> > >
> > > Thanks,
> > > Christofer
> > >
> > > P.S. Eventually after ZooKeeper and Kafka broker and consumer restarts
> > > everything returned to normal.
> > >
> >
>

Re: Leap Second Troubles

Reply via email to