On February 6, 2020 4:18:15 AM GMT+02:00, Eric Robinson <eric.robin...@psmnv.com> wrote: >Hi Strahil – > >I think you may be right about the token timeouts being too short. I’ve >also noticed that periods of high load can cause drbd to disconnect. >What would you recommend for changes to the timeouts? > >I’m running Red Hat’s Corosync Cluster Engine, version 2.4.3. The >config is relatively simple. > >Corosync config looks like this… > >totem { > version: 2 > cluster_name: 001db01ab > secauth: off > transport: udpu >} > >nodelist { > node { > ring0_addr: 001db01a > nodeid: 1 > } > > node { > ring0_addr: 001db01b > nodeid: 2 > } >} > >quorum { > provider: corosync_votequorum > two_node: 1 >} > >logging { > to_logfile: yes > logfile: /var/log/cluster/corosync.log > to_syslog: yes >} > > >From: Users <users-boun...@clusterlabs.org> On Behalf Of Strahil >Nikolov >Sent: Wednesday, February 5, 2020 6:39 PM >To: Cluster Labs - All topics related to open-source clustering >welcomed <users@clusterlabs.org>; Andrei Borzenkov ><arvidj...@gmail.com> >Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster? > >Hi Andrei, > >don't trust Azure so much :D . I've seen stuff that was way more >unbelievable. >Can you check other systems in the same subnet reported any issues. >Yet, pcs most probably won't report any short-term issues. I have >noticed that RHEL7 defaults for token and consensus are quite small and >any short-term disruption could cause an issue. >Actually when I tested live migration on oVirt - the other hosts fenced >the node that was migrated. >What is your corosync config and OS version ? > >Best Regards, >Strahil Nikolov > >В четвъртък, 6 февруари 2020 г., 01:44:55 ч. Гринуич+2, Eric Robinson ><eric.robin...@psmnv.com<mailto:eric.robin...@psmnv.com>> написа: > > > >Hi Strahil – > > > >I can’t prove there was no network loss, but: > > > > 1. There were no dmesg indications of ethernet link loss. >2. Other than corosync, there are no other log messages about >connectivity issues. > 3. Wouldn’t pcsd say something about connectivity loss? > 4. Both servers are in Azure. >5. There are many other servers in the same Azure subscription, >including other corosync clusters, none of which had issues. > > > >So I guess it’s possible, but it seems unlikely. > > > >--Eric > > > >From: Users ><users-boun...@clusterlabs.org<mailto:users-boun...@clusterlabs.org>> >On Behalf Of Strahil Nikolov >Sent: Wednesday, February 5, 2020 3:13 PM >To: Cluster Labs - All topics related to open-source clustering >welcomed <users@clusterlabs.org<mailto:users@clusterlabs.org>>; Andrei >Borzenkov <arvidj...@gmail.com<mailto:arvidj...@gmail.com>> >Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster? > > > >Hi Erik, > > > >what has led you to think that there was no network loss ? > > > >Best Regards, > >Strahil Nikolov > > > >В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson ><eric.robin...@psmnv.com<mailto:eric.robin...@psmnv.com>> написа: > > > > > >> -----Original Message----- >> From: Users ><users-boun...@clusterlabs.org<mailto:users-boun...@clusterlabs.org>> >On Behalf Of Strahil Nikolov >> Sent: Wednesday, February 5, 2020 1:59 PM >> To: Andrei Borzenkov ><arvidj...@gmail.com<mailto:arvidj...@gmail.com>>; >users@clusterlabs.org<mailto:users@clusterlabs.org> >> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster? >> >> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov >> <arvidj...@gmail.com<mailto:arvidj...@gmail.com>> wrote: >> >05.02.2020 20:55, Eric Robinson пишет: >> >> The two servers 001db01a and 001db01b were up and responsive. >Neither >> >had been rebooted and neither were under heavy load. There's no >> >indication in the logs of loss of network connectivity. Any ideas on >> >why both nodes seem to think the other one is at fault? >> > >> >The very fact that nodes lost connection to each other *is* >indication >> >of network problems. Your logs start too late, after any problem >> >already happened. >> > >> >> >> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is >not >> >an option at this time.) >> >> >> >> Log from 001db01a: >> >> >> >> Feb 5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor >failed, >> >forming new configuration. >> >> Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership >> >(10.51.14.33:960) was formed. Members left: 2 >> >> Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to >receive >> >the leave message. failed: 2 >> >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Node 001db01b state >is >> >now lost >> >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Removing all >001db01b >> >attributes for peer loss >> >> Feb 5 08:01:03 001db01a cib[1522]: notice: Node 001db01b state >is >> >now lost >> >> Feb 5 08:01:03 001db01a cib[1522]: notice: Purged 1 peer with >id=2 >> >and/or uname=001db01b from the membership cache >> >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Purged 1 peer with >> >id=2 and/or uname=001db01b from the membership cache >> >> Feb 5 08:01:03 001db01a crmd[1527]: warning: No reason to expect >> >node 2 to be down >> >> Feb 5 08:01:03 001db01a stonith-ng[1523]: notice: Node 001db01b >> >state is now lost >> >> Feb 5 08:01:03 001db01a crmd[1527]: notice: Stonith/shutdown of >> >001db01b not matched >> >> Feb 5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 >Feb >> >> 5 08:01:03 001db01a corosync[1306]: [MAIN ] Completed service >> >synchronization, ready to provide service. >> >> Feb 5 08:01:03 001db01a stonith-ng[1523]: notice: Purged 1 peer >> >with id=2 and/or uname=001db01b from the membership cache >> >> Feb 5 08:01:03 001db01a pacemakerd[1491]: notice: Node 001db01b >> >state is now lost >> >> Feb 5 08:01:03 001db01a crmd[1527]: notice: State transition >S_IDLE >> >-> S_POLICY_ENGINE >> >> Feb 5 08:01:03 001db01a crmd[1527]: notice: Node 001db01b state >is >> >now lost >> >> Feb 5 08:01:03 001db01a crmd[1527]: warning: No reason to expect >> >node 2 to be down >> >> Feb 5 08:01:03 001db01a crmd[1527]: notice: Stonith/shutdown of >> >001db01b not matched >> >> Feb 5 08:01:03 001db01a pengine[1526]: notice: On loss of CCM >> >Quorum: Ignore >> >> >> >> From 001db01b: >> >> >> >> Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership >> >(10.51.14.34:960) was formed. Members left: 1 >> >> Feb 5 08:01:03 001db01b crmd[1693]: notice: Our peer on the DC >> >(001db01a) is dead >> >> Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: Node 001db01a >> >state is now lost >> >> Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to >receive >> >the leave message. failed: 1 >> >> Feb 5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2 >Feb >> >> 5 08:01:03 001db01b corosync[1455]: [MAIN ] Completed service >> >synchronization, ready to provide service. >> >> Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: Purged 1 peer >> >with id=1 and/or uname=001db01a from the membership cache >> >> Feb 5 08:01:03 001db01b pacemakerd[1678]: notice: Node 001db01a >> >state is now lost >> >> Feb 5 08:01:03 001db01b crmd[1693]: notice: State transition >> >S_NOT_DC -> S_ELECTION >> >> Feb 5 08:01:03 001db01b crmd[1693]: notice: Node 001db01a state >is >> >now lost >> >> Feb 5 08:01:03 001db01b attrd[1691]: notice: Node 001db01a state >is >> >now lost >> >> Feb 5 08:01:03 001db01b attrd[1691]: notice: Removing all >001db01a >> >attributes for peer loss >> >> Feb 5 08:01:03 001db01b attrd[1691]: notice: Lost attribute >writer >> >001db01a >> >> Feb 5 08:01:03 001db01b attrd[1691]: notice: Purged 1 peer with >> >id=1 and/or uname=001db01a from the membership cache >> >> Feb 5 08:01:03 001db01b crmd[1693]: notice: State transition >> >S_ELECTION -> S_INTEGRATION >> >> Feb 5 08:01:03 001db01b cib[1688]: notice: Node 001db01a state >is >> >now lost >> >> Feb 5 08:01:03 001db01b cib[1688]: notice: Purged 1 peer with >id=1 >> >and/or uname=001db01a from the membership cache >> >> Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: >[cib_diff_notify] >> >Patch aborted: Application of an update diff failed (-206) >> >> Feb 5 08:01:03 001db01b crmd[1693]: warning: Input I_ELECTION_DC >> >received in state S_INTEGRATION from do_election_check >> >> Feb 5 08:01:03 001db01b pengine[1692]: notice: On loss of CCM >> >Quorum: Ignore >> >> >> >> >> >> -Eric >> >> >> >> >> >> >> >> Disclaimer : This email and any files transmitted with it are >> >confidential and intended solely for intended recipients. If you are >> >not the named addressee you should not disseminate, distribute, copy >or >> >alter this email. Any views or opinions presented in this email are >> >solely those of the author and might not represent those of >Physician >> >Select Management. Warning: Although Physician Select Management has >> >taken reasonable precautions to ensure no viruses are present in >this >> >email, the company cannot accept responsibility for any loss or >damage >> >arising from the use of this email or attachments. >> >> >> >> >> >> _______________________________________________ >> >> Manage your subscription: >> >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >> >> > >> >_______________________________________________ >> >Manage your subscription: >> >https://lists.clusterlabs.org/mailman/listinfo/users >> > >> >ClusterLabs home: https://www.clusterlabs.org/ >> >> Hi Eric, >> Do you use 2 corosync rings (routed via separare switches) ? >> > >I've done that with all my other clusters, but these two servers are in >Azure, so the network is out of our control. > >> If not, you can easily set them up without downtime. >> >> Also, are you using multicast or unicast ? >> > >Unicast, as Azure does not support multicast. > >> If 3rd node is not an option, you can check if your version is >supporting >> 'qdevice' which can be on a separate network and requires very low >> resources - a simple VM will be enough. > >Thanks for the tip. I looked into qdevice years ago but it didn't seem >mature at the time. I appreciate the reminder. I will pop over there >and investigate! > >> >> Best Regards, >> Strahil Nikolov >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > >Disclaimer : This email and any files transmitted with it are >confidential and intended solely for intended recipients. If you are >not the named addressee you should not disseminate, distribute, copy or >alter this email. Any views or opinions presented in this email are >solely those of the author and might not represent those of Physician >Select Management. Warning: Although Physician Select Management has >taken reasonable precautions to ensure no viruses are present in this >email, the company cannot accept responsibility for any loss or damage >arising from the use of this email or attachments. >_______________________________________________ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/ >Disclaimer : This email and any files transmitted with it are >confidential and intended solely for intended recipients. If you are >not the named addressee you should not disseminate, distribute, copy or >alter this email. Any views or opinions presented in this email are >solely those of the author and might not represent those of Physician >Select Management. Warning: Although Physician Select Management has >taken reasonable precautions to ensure no viruses are present in this >email, the company cannot accept responsibility for any loss or damage >arising from the use of this email or attachments. >_______________________________________________ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/ >Disclaimer : This email and any files transmitted with it are >confidential and intended solely for intended recipients. If you are >not the named addressee you should not disseminate, distribute, copy or >alter this email. Any views or opinions presented in this email are >solely those of the author and might not represent those of Physician >Select Management. Warning: Although Physician Select Management has >taken reasonable precautions to ensure no viruses are present in this >email, the company cannot accept responsibility for any loss or damage >arising from the use of this email or attachments.
Hey Andrey, Defaults are 1s token, 1.2s consensus which is too small. In Suse, token is 10s, while consensus is 1.2 * token -> 12s. With these settings, cluster will not react for 22s. I think it's a good start for your cluster . Don't forget to put the cluster in maintenance (pcs property set maintenance-mode=true) before restarting the stack , or even better - get some downtime. You can use the following article to run a simulation before removing the maintenance: https://www.suse.com/support/kb/doc/?id=7022764 Best Regards, Strahil Nikolov _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/