Re: Leader election failing

Chris Wed, 08 Aug 2018 08:10:03 -0700

Running 3.5.5

I managed to recreate it on acc and test cluster today, failing on shutdownof leader. Both had been running for over a week. After restarting allzookeepers it runs fine no matter how many leader shutdowns i throw at it.


On 8 August 2018 5:05:34 pm Andor Molnar <an...@cloudera.com.INVALID> wrote:

Some kind of a network split?

It looks like 1-2 and 3-4 were able to communicate each other, but
connection timed out between these 2 splits. When 5 came back online it
started with supporters of (1,2) and later 3 and 4 also joined.

There was no such issue the day after.

Which version of ZooKeeper is this? 3.5.something?

Regards,
Andor



On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turks...@gmail.com> wrote:

Actually i have similar issues on my test and acceptance clusters where
leader election fails if the cluster has been running for a couple of days.
If you stop/start the Zookeepers once they will work fine on further
disruptions that day. Not sure yet what the treshold is.


On 8 August 2018 4:32:56 pm Camille Fournier <cami...@apache.org> wrote:

Hard to say. It looks like about 15 minutes after your first incident where

5 goes down and then comes back up, servers 1 and 2 get socket errors to
their connections with 3, 4, and 6. It's possible if you had waited those
15 minutes, once those errors cleared the quorum would've formed with the
other servers. But as for why there were those errors in the first place
it's not clear. Could be a network glitch, or an obscure bug in the
connection logic. Has anyone else ever seen this?
If you see it again, getting a stack trace of the servers when they can't
form quorum might be helpful.

On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turks...@gmail.com> wrote:

I have a cluster of 5 participants (id 1-5) and 1 observer (id 6).

1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
Yesterday one of the participants (id5, by chance was the leader) was
rebooted. Although all other servers were online and not suffering from
networking issues the leader election failed and the cluster remained
"looking" until the old leader came back online after which it was
promptly
elected as leader again.

Today we tried the same exercise on the exact same servers, 5 was still
leader and was rebooted, and leader election worked fine with 4 as new
leader.

I have included the logs.  From the logs i see that yesterday 1,2 never
received new leader proposals from 3,4 and vice versa.
Today all proposals came through. This is not the first time we've seen
this type of behavior, where some zookeepers can't seem to find each
other
after the leader goes down.
All servers use dynamic configuration and have the same config node.

How could this be explained? These servers also host a replicated
database
cluster and have no history of db replication issues.

Thanks,
Chris

Re: Leader election failing

Reply via email to