Hard to say. It looks like about 15 minutes after your first incident where
5 goes down and then comes back up, servers 1 and 2 get socket errors to
their connections with 3, 4, and 6. It's possible if you had waited those
15 minutes, once those errors cleared the quorum would've formed with the
other servers. But as for why there were those errors in the first place
it's not clear. Could be a network glitch, or an obscure bug in the
connection logic. Has anyone else ever seen this?
If you see it again, getting a stack trace of the servers when they can't
form quorum might be helpful.

On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turks...@gmail.com> wrote:

> I have a cluster of 5 participants (id 1-5) and 1 observer (id 6).
> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
> Yesterday one of the participants (id5, by chance was the leader) was
> rebooted. Although all other servers were online and not suffering from
> networking issues the leader election failed and the cluster remained
> "looking" until the old leader came back online after which it was promptly
> elected as leader again.
>
> Today we tried the same exercise on the exact same servers, 5 was still
> leader and was rebooted, and leader election worked fine with 4 as new
> leader.
>
> I have included the logs.  From the logs i see that yesterday 1,2 never
> received new leader proposals from 3,4 and vice versa.
> Today all proposals came through. This is not the first time we've seen
> this type of behavior, where some zookeepers can't seem to find each other
> after the leader goes down.
> All servers use dynamic configuration and have the same config node.
>
> How could this be explained? These servers also host a replicated database
> cluster and have no history of db replication issues.
>
> Thanks,
> Chris
>
>
>

Reply via email to