Actually i have similar issues on my test and acceptance clusters where
leader election fails if the cluster has been running for a couple of days.
If you stop/start the Zookeepers once they will work fine on further
disruptions that day. Not sure yet what the treshold is.
On 8 August 2018 4:32:56 pm Camille Fournier <cami...@apache.org> wrote:
Hard to say. It looks like about 15 minutes after your first incident where
5 goes down and then comes back up, servers 1 and 2 get socket errors to
their connections with 3, 4, and 6. It's possible if you had waited those
15 minutes, once those errors cleared the quorum would've formed with the
other servers. But as for why there were those errors in the first place
it's not clear. Could be a network glitch, or an obscure bug in the
connection logic. Has anyone else ever seen this?
If you see it again, getting a stack trace of the servers when they can't
form quorum might be helpful.
On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turks...@gmail.com> wrote:
I have a cluster of 5 participants (id 1-5) and 1 observer (id 6).
1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
Yesterday one of the participants (id5, by chance was the leader) was
rebooted. Although all other servers were online and not suffering from
networking issues the leader election failed and the cluster remained
"looking" until the old leader came back online after which it was promptly
elected as leader again.
Today we tried the same exercise on the exact same servers, 5 was still
leader and was rebooted, and leader election worked fine with 4 as new
leader.
I have included the logs. From the logs i see that yesterday 1,2 never
received new leader proposals from 3,4 and vice versa.
Today all proposals came through. This is not the first time we've seen
this type of behavior, where some zookeepers can't seem to find each other
after the leader goes down.
All servers use dynamic configuration and have the same config node.
How could this be explained? These servers also host a replicated database
cluster and have no history of db replication issues.
Thanks,
Chris