Thanks for testing Chris. So, if I understand you correctly, you're running the latest version from branch-3.5. Could we say that this is a 3.5-only problem? Have you ever tested the same cluster with 3.4?
Regards, Andor On Tue, Aug 21, 2018 at 11:29 AM, Cee Tee <c.turks...@gmail.com> wrote: > I've tested the patch and let it run 6 days. It did not help, result is > still the same. (remaining ZKs form islands based on datacenter they are > in). > > I have mitigated it by doing a daily rolling restart. > > Regards, > Chris > > On Mon, Aug 13, 2018 at 2:06 PM Andor Molnar <an...@cloudera.com.invalid> > wrote: > > > Hi Chris, > > > > Would you mind testing the following patch on your test clusters? > > I'm not entirely sure, but the issue might be related. > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-2930 > > > > Regards, > > Andor > > > > > > > > On Wed, Aug 8, 2018 at 6:51 PM, Camille Fournier <cami...@apache.org> > > wrote: > > > > > If you have the time and inclination, next time you see this problem in > > > your test clusters get stack traces and any other diagnostics possible > > > before restarting. I'm not an expert at network debugging but if you > have > > > someone who is you might want them to take a look at the connections > and > > > settings of any switches/firewalls/etc involved, see if there's any > > unusual > > > configurations or evidence of other long-lived connections failing > (even > > if > > > their services handle the failures more gracefully). Send us the stack > > > traces also it would be interesting to take a look. > > > > > > C > > > > > > > > > On Wed, Aug 8, 2018, 11:09 AM Chris <c.turks...@gmail.com> wrote: > > > > > > > Running 3.5.5 > > > > > > > > I managed to recreate it on acc and test cluster today, failing on > > > > shutdown > > > > of leader. Both had been running for over a week. After restarting > all > > > > zookeepers it runs fine no matter how many leader shutdowns i throw > at > > > it. > > > > > > > > On 8 August 2018 5:05:34 pm Andor Molnar <an...@cloudera.com.INVALID > > > > > > wrote: > > > > > > > > > Some kind of a network split? > > > > > > > > > > It looks like 1-2 and 3-4 were able to communicate each other, but > > > > > connection timed out between these 2 splits. When 5 came back > online > > it > > > > > started with supporters of (1,2) and later 3 and 4 also joined. > > > > > > > > > > There was no such issue the day after. > > > > > > > > > > Which version of ZooKeeper is this? 3.5.something? > > > > > > > > > > Regards, > > > > > Andor > > > > > > > > > > > > > > > > > > > > On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turks...@gmail.com> > wrote: > > > > > > > > > >> Actually i have similar issues on my test and acceptance clusters > > > where > > > > >> leader election fails if the cluster has been running for a couple > > of > > > > days. > > > > >> If you stop/start the Zookeepers once they will work fine on > further > > > > >> disruptions that day. Not sure yet what the treshold is. > > > > >> > > > > >> > > > > >> On 8 August 2018 4:32:56 pm Camille Fournier <cami...@apache.org> > > > > wrote: > > > > >> > > > > >> Hard to say. It looks like about 15 minutes after your first > > incident > > > > where > > > > >>> 5 goes down and then comes back up, servers 1 and 2 get socket > > errors > > > > to > > > > >>> their connections with 3, 4, and 6. It's possible if you had > waited > > > > those > > > > >>> 15 minutes, once those errors cleared the quorum would've formed > > with > > > > the > > > > >>> other servers. But as for why there were those errors in the > first > > > > place > > > > >>> it's not clear. Could be a network glitch, or an obscure bug in > the > > > > >>> connection logic. Has anyone else ever seen this? > > > > >>> If you see it again, getting a stack trace of the servers when > they > > > > can't > > > > >>> form quorum might be helpful. > > > > >>> > > > > >>> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turks...@gmail.com> > > > wrote: > > > > >>> > > > > >>> I have a cluster of 5 participants (id 1-5) and 1 observer (id > 6). > > > > >>>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B. > > > > >>>> Yesterday one of the participants (id5, by chance was the > leader) > > > was > > > > >>>> rebooted. Although all other servers were online and not > suffering > > > > from > > > > >>>> networking issues the leader election failed and the cluster > > > remained > > > > >>>> "looking" until the old leader came back online after which it > was > > > > >>>> promptly > > > > >>>> elected as leader again. > > > > >>>> > > > > >>>> Today we tried the same exercise on the exact same servers, 5 > was > > > > still > > > > >>>> leader and was rebooted, and leader election worked fine with 4 > as > > > new > > > > >>>> leader. > > > > >>>> > > > > >>>> I have included the logs. From the logs i see that yesterday > 1,2 > > > > never > > > > >>>> received new leader proposals from 3,4 and vice versa. > > > > >>>> Today all proposals came through. This is not the first time > we've > > > > seen > > > > >>>> this type of behavior, where some zookeepers can't seem to find > > each > > > > >>>> other > > > > >>>> after the leader goes down. > > > > >>>> All servers use dynamic configuration and have the same config > > node. > > > > >>>> > > > > >>>> How could this be explained? These servers also host a > replicated > > > > >>>> database > > > > >>>> cluster and have no history of db replication issues. > > > > >>>> > > > > >>>> Thanks, > > > > >>>> Chris > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > > > > > > > > > >