Quick update: Apparently the election notifications disappeared somewhere between the datacenters (firewall) when the sockets were not used for some time. We fixed this with zookeeper.tcpKeepAlive=true.
Regards, Chris On Wed, Aug 8, 2018 at 5:05 PM Andor Molnar <an...@cloudera.com.invalid> wrote: > Some kind of a network split? > > It looks like 1-2 and 3-4 were able to communicate each other, but > connection timed out between these 2 splits. When 5 came back online it > started with supporters of (1,2) and later 3 and 4 also joined. > > There was no such issue the day after. > > Which version of ZooKeeper is this? 3.5.something? > > Regards, > Andor > > > > On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turks...@gmail.com> wrote: > > > Actually i have similar issues on my test and acceptance clusters where > > leader election fails if the cluster has been running for a couple of > days. > > If you stop/start the Zookeepers once they will work fine on further > > disruptions that day. Not sure yet what the treshold is. > > > > > > On 8 August 2018 4:32:56 pm Camille Fournier <cami...@apache.org> wrote: > > > > Hard to say. It looks like about 15 minutes after your first incident > where > >> 5 goes down and then comes back up, servers 1 and 2 get socket errors to > >> their connections with 3, 4, and 6. It's possible if you had waited > those > >> 15 minutes, once those errors cleared the quorum would've formed with > the > >> other servers. But as for why there were those errors in the first place > >> it's not clear. Could be a network glitch, or an obscure bug in the > >> connection logic. Has anyone else ever seen this? > >> If you see it again, getting a stack trace of the servers when they > can't > >> form quorum might be helpful. > >> > >> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turks...@gmail.com> wrote: > >> > >> I have a cluster of 5 participants (id 1-5) and 1 observer (id 6). > >>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B. > >>> Yesterday one of the participants (id5, by chance was the leader) was > >>> rebooted. Although all other servers were online and not suffering from > >>> networking issues the leader election failed and the cluster remained > >>> "looking" until the old leader came back online after which it was > >>> promptly > >>> elected as leader again. > >>> > >>> Today we tried the same exercise on the exact same servers, 5 was still > >>> leader and was rebooted, and leader election worked fine with 4 as new > >>> leader. > >>> > >>> I have included the logs. From the logs i see that yesterday 1,2 never > >>> received new leader proposals from 3,4 and vice versa. > >>> Today all proposals came through. This is not the first time we've seen > >>> this type of behavior, where some zookeepers can't seem to find each > >>> other > >>> after the leader goes down. > >>> All servers use dynamic configuration and have the same config node. > >>> > >>> How could this be explained? These servers also host a replicated > >>> database > >>> cluster and have no history of db replication issues. > >>> > >>> Thanks, > >>> Chris > >>> > >>> > >>> > >>> > > > > > > >