Re: Leader election failing

Andor Molnar Mon, 13 Aug 2018 05:06:54 -0700

Hi Chris,

Would you mind testing the following patch on your test clusters?
I'm not entirely sure, but the issue might be related.


https://issues.apache.org/jira/browse/ZOOKEEPER-2930

Regards,
Andor



On Wed, Aug 8, 2018 at 6:51 PM, Camille Fournier <[email protected]> wrote:

> If you have the time and inclination, next time you see this problem in
> your test clusters get stack traces and any other diagnostics possible
> before restarting. I'm not an expert at network debugging but if you have
> someone who is you might want them to take a look at the connections and
> settings of any switches/firewalls/etc involved, see if there's any unusual
> configurations or evidence of other long-lived connections failing (even if
> their services handle the failures more gracefully). Send us the stack
> traces also it would be interesting to take a look.
>
> C
>
>
> On Wed, Aug 8, 2018, 11:09 AM Chris <[email protected]> wrote:
>
> > Running 3.5.5
> >
> > I managed to recreate it on acc and test cluster today, failing on
> > shutdown
> > of leader. Both had been running for over a week. After restarting all
> > zookeepers it runs fine no matter how many leader shutdowns i throw at
> it.
> >
> > On 8 August 2018 5:05:34 pm Andor Molnar <[email protected]>
> > wrote:
> >
> > > Some kind of a network split?
> > >
> > > It looks like 1-2 and 3-4 were able to communicate each other, but
> > > connection timed out between these 2 splits. When 5 came back online it
> > > started with supporters of (1,2) and later 3 and 4 also joined.
> > >
> > > There was no such issue the day after.
> > >
> > > Which version of ZooKeeper is this? 3.5.something?
> > >
> > > Regards,
> > > Andor
> > >
> > >
> > >
> > > On Wed, Aug 8, 2018 at 4:52 PM, Chris <[email protected]> wrote:
> > >
> > >> Actually i have similar issues on my test and acceptance clusters
> where
> > >> leader election fails if the cluster has been running for a couple of
> > days.
> > >> If you stop/start the Zookeepers once they will work fine on further
> > >> disruptions that day. Not sure yet what the treshold is.
> > >>
> > >>
> > >> On 8 August 2018 4:32:56 pm Camille Fournier <[email protected]>
> > wrote:
> > >>
> > >> Hard to say. It looks like about 15 minutes after your first incident
> > where
> > >>> 5 goes down and then comes back up, servers 1 and 2 get socket errors
> > to
> > >>> their connections with 3, 4, and 6. It's possible if you had waited
> > those
> > >>> 15 minutes, once those errors cleared the quorum would've formed with
> > the
> > >>> other servers. But as for why there were those errors in the first
> > place
> > >>> it's not clear. Could be a network glitch, or an obscure bug in the
> > >>> connection logic. Has anyone else ever seen this?
> > >>> If you see it again, getting a stack trace of the servers when they
> > can't
> > >>> form quorum might be helpful.
> > >>>
> > >>> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <[email protected]>
> wrote:
> > >>>
> > >>> I have a cluster of 5 participants (id 1-5) and 1 observer (id 6).
> > >>>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
> > >>>> Yesterday one of the participants (id5, by chance was the leader)
> was
> > >>>> rebooted. Although all other servers were online and not suffering
> > from
> > >>>> networking issues the leader election failed and the cluster
> remained
> > >>>> "looking" until the old leader came back online after which it was
> > >>>> promptly
> > >>>> elected as leader again.
> > >>>>
> > >>>> Today we tried the same exercise on the exact same servers, 5 was
> > still
> > >>>> leader and was rebooted, and leader election worked fine with 4 as
> new
> > >>>> leader.
> > >>>>
> > >>>> I have included the logs.  From the logs i see that yesterday 1,2
> > never
> > >>>> received new leader proposals from 3,4 and vice versa.
> > >>>> Today all proposals came through. This is not the first time we've
> > seen
> > >>>> this type of behavior, where some zookeepers can't seem to find each
> > >>>> other
> > >>>> after the leader goes down.
> > >>>> All servers use dynamic configuration and have the same config node.
> > >>>>
> > >>>> How could this be explained? These servers also host a replicated
> > >>>> database
> > >>>> cluster and have no history of db replication issues.
> > >>>>
> > >>>> Thanks,
> > >>>> Chris
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>
> > >>
> > >>
> >
> >
> >
> >
>

Re: Leader election failing

Reply via email to