Erm. Thanks for carrying out these tests Chris.
Have you by any chance - as Camille suggested - collected debug logs from these tests? Andor On 09/11/2018 11:08 AM, Cee Tee wrote: > Concluded a test with a 3.4.13 cluster, it shows the same behaviour. > > On Mon, Sep 3, 2018 at 4:56 PM Andor Molnar <an...@cloudera.com.invalid> > wrote: > >> Thanks for testing Chris. >> >> So, if I understand you correctly, you're running the latest version from >> branch-3.5. Could we say that this is a 3.5-only problem? >> Have you ever tested the same cluster with 3.4? >> >> Regards, >> Andor >> >> >> >> On Tue, Aug 21, 2018 at 11:29 AM, Cee Tee <c.turks...@gmail.com> wrote: >> >>> I've tested the patch and let it run 6 days. It did not help, result is >>> still the same. (remaining ZKs form islands based on datacenter they are >>> in). >>> >>> I have mitigated it by doing a daily rolling restart. >>> >>> Regards, >>> Chris >>> >>> On Mon, Aug 13, 2018 at 2:06 PM Andor Molnar <an...@cloudera.com.invalid >>> >>> wrote: >>> >>>> Hi Chris, >>>> >>>> Would you mind testing the following patch on your test clusters? >>>> I'm not entirely sure, but the issue might be related. >>>> >>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2930 >>>> >>>> Regards, >>>> Andor >>>> >>>> >>>> >>>> On Wed, Aug 8, 2018 at 6:51 PM, Camille Fournier <cami...@apache.org> >>>> wrote: >>>> >>>>> If you have the time and inclination, next time you see this problem >> in >>>>> your test clusters get stack traces and any other diagnostics >> possible >>>>> before restarting. I'm not an expert at network debugging but if you >>> have >>>>> someone who is you might want them to take a look at the connections >>> and >>>>> settings of any switches/firewalls/etc involved, see if there's any >>>> unusual >>>>> configurations or evidence of other long-lived connections failing >>> (even >>>> if >>>>> their services handle the failures more gracefully). Send us the >> stack >>>>> traces also it would be interesting to take a look. >>>>> >>>>> C >>>>> >>>>> >>>>> On Wed, Aug 8, 2018, 11:09 AM Chris <c.turks...@gmail.com> wrote: >>>>> >>>>>> Running 3.5.5 >>>>>> >>>>>> I managed to recreate it on acc and test cluster today, failing on >>>>>> shutdown >>>>>> of leader. Both had been running for over a week. After restarting >>> all >>>>>> zookeepers it runs fine no matter how many leader shutdowns i throw >>> at >>>>> it. >>>>>> On 8 August 2018 5:05:34 pm Andor Molnar >> <an...@cloudera.com.INVALID >>>>>> wrote: >>>>>> >>>>>>> Some kind of a network split? >>>>>>> >>>>>>> It looks like 1-2 and 3-4 were able to communicate each other, >> but >>>>>>> connection timed out between these 2 splits. When 5 came back >>> online >>>> it >>>>>>> started with supporters of (1,2) and later 3 and 4 also joined. >>>>>>> >>>>>>> There was no such issue the day after. >>>>>>> >>>>>>> Which version of ZooKeeper is this? 3.5.something? >>>>>>> >>>>>>> Regards, >>>>>>> Andor >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turks...@gmail.com> >>> wrote: >>>>>>>> Actually i have similar issues on my test and acceptance >> clusters >>>>> where >>>>>>>> leader election fails if the cluster has been running for a >> couple >>>> of >>>>>> days. >>>>>>>> If you stop/start the Zookeepers once they will work fine on >>> further >>>>>>>> disruptions that day. Not sure yet what the treshold is. >>>>>>>> >>>>>>>> >>>>>>>> On 8 August 2018 4:32:56 pm Camille Fournier < >> cami...@apache.org> >>>>>> wrote: >>>>>>>> Hard to say. It looks like about 15 minutes after your first >>>> incident >>>>>> where >>>>>>>>> 5 goes down and then comes back up, servers 1 and 2 get socket >>>> errors >>>>>> to >>>>>>>>> their connections with 3, 4, and 6. It's possible if you had >>> waited >>>>>> those >>>>>>>>> 15 minutes, once those errors cleared the quorum would've >> formed >>>> with >>>>>> the >>>>>>>>> other servers. But as for why there were those errors in the >>> first >>>>>> place >>>>>>>>> it's not clear. Could be a network glitch, or an obscure bug in >>> the >>>>>>>>> connection logic. Has anyone else ever seen this? >>>>>>>>> If you see it again, getting a stack trace of the servers when >>> they >>>>>> can't >>>>>>>>> form quorum might be helpful. >>>>>>>>> >>>>>>>>> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turks...@gmail.com> >>>>> wrote: >>>>>>>>> I have a cluster of 5 participants (id 1-5) and 1 observer (id >>> 6). >>>>>>>>>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B. >>>>>>>>>> Yesterday one of the participants (id5, by chance was the >>> leader) >>>>> was >>>>>>>>>> rebooted. Although all other servers were online and not >>> suffering >>>>>> from >>>>>>>>>> networking issues the leader election failed and the cluster >>>>> remained >>>>>>>>>> "looking" until the old leader came back online after which it >>> was >>>>>>>>>> promptly >>>>>>>>>> elected as leader again. >>>>>>>>>> >>>>>>>>>> Today we tried the same exercise on the exact same servers, 5 >>> was >>>>>> still >>>>>>>>>> leader and was rebooted, and leader election worked fine with >> 4 >>> as >>>>> new >>>>>>>>>> leader. >>>>>>>>>> >>>>>>>>>> I have included the logs. From the logs i see that yesterday >>> 1,2 >>>>>> never >>>>>>>>>> received new leader proposals from 3,4 and vice versa. >>>>>>>>>> Today all proposals came through. This is not the first time >>> we've >>>>>> seen >>>>>>>>>> this type of behavior, where some zookeepers can't seem to >> find >>>> each >>>>>>>>>> other >>>>>>>>>> after the leader goes down. >>>>>>>>>> All servers use dynamic configuration and have the same config >>>> node. >>>>>>>>>> How could this be explained? These servers also host a >>> replicated >>>>>>>>>> database >>>>>>>>>> cluster and have no history of db replication issues. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Chris >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>>