On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov <[email protected]> wrote: >05.02.2020 20:55, Eric Robinson пишет: >> The two servers 001db01a and 001db01b were up and responsive. Neither >had been rebooted and neither were under heavy load. There's no >indication in the logs of loss of network connectivity. Any ideas on >why both nodes seem to think the other one is at fault? > >The very fact that nodes lost connection to each other *is* indication >of network problems. Your logs start too late, after any problem >already >happened. > >> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not >an option at this time.) >> >> Log from 001db01a: >> >> Feb 5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed, >forming new configuration. >> Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership >(10.51.14.33:960) was formed. Members left: 2 >> Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive >the leave message. failed: 2 >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Node 001db01b state is >now lost >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Removing all 001db01b >attributes for peer loss >> Feb 5 08:01:03 001db01a cib[1522]: notice: Node 001db01b state is >now lost >> Feb 5 08:01:03 001db01a cib[1522]: notice: Purged 1 peer with id=2 >and/or uname=001db01b from the membership cache >> Feb 5 08:01:03 001db01a attrd[1525]: notice: Purged 1 peer with >id=2 and/or uname=001db01b from the membership cache >> Feb 5 08:01:03 001db01a crmd[1527]: warning: No reason to expect >node 2 to be down >> Feb 5 08:01:03 001db01a stonith-ng[1523]: notice: Node 001db01b >state is now lost >> Feb 5 08:01:03 001db01a crmd[1527]: notice: Stonith/shutdown of >001db01b not matched >> Feb 5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 >> Feb 5 08:01:03 001db01a corosync[1306]: [MAIN ] Completed service >synchronization, ready to provide service. >> Feb 5 08:01:03 001db01a stonith-ng[1523]: notice: Purged 1 peer >with id=2 and/or uname=001db01b from the membership cache >> Feb 5 08:01:03 001db01a pacemakerd[1491]: notice: Node 001db01b >state is now lost >> Feb 5 08:01:03 001db01a crmd[1527]: notice: State transition S_IDLE >-> S_POLICY_ENGINE >> Feb 5 08:01:03 001db01a crmd[1527]: notice: Node 001db01b state is >now lost >> Feb 5 08:01:03 001db01a crmd[1527]: warning: No reason to expect >node 2 to be down >> Feb 5 08:01:03 001db01a crmd[1527]: notice: Stonith/shutdown of >001db01b not matched >> Feb 5 08:01:03 001db01a pengine[1526]: notice: On loss of CCM >Quorum: Ignore >> >> From 001db01b: >> >> Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership >(10.51.14.34:960) was formed. Members left: 1 >> Feb 5 08:01:03 001db01b crmd[1693]: notice: Our peer on the DC >(001db01a) is dead >> Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: Node 001db01a >state is now lost >> Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive >the leave message. failed: 1 >> Feb 5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2 >> Feb 5 08:01:03 001db01b corosync[1455]: [MAIN ] Completed service >synchronization, ready to provide service. >> Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: Purged 1 peer >with id=1 and/or uname=001db01a from the membership cache >> Feb 5 08:01:03 001db01b pacemakerd[1678]: notice: Node 001db01a >state is now lost >> Feb 5 08:01:03 001db01b crmd[1693]: notice: State transition >S_NOT_DC -> S_ELECTION >> Feb 5 08:01:03 001db01b crmd[1693]: notice: Node 001db01a state is >now lost >> Feb 5 08:01:03 001db01b attrd[1691]: notice: Node 001db01a state is >now lost >> Feb 5 08:01:03 001db01b attrd[1691]: notice: Removing all 001db01a >attributes for peer loss >> Feb 5 08:01:03 001db01b attrd[1691]: notice: Lost attribute writer >001db01a >> Feb 5 08:01:03 001db01b attrd[1691]: notice: Purged 1 peer with >id=1 and/or uname=001db01a from the membership cache >> Feb 5 08:01:03 001db01b crmd[1693]: notice: State transition >S_ELECTION -> S_INTEGRATION >> Feb 5 08:01:03 001db01b cib[1688]: notice: Node 001db01a state is >now lost >> Feb 5 08:01:03 001db01b cib[1688]: notice: Purged 1 peer with id=1 >and/or uname=001db01a from the membership cache >> Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: [cib_diff_notify] >Patch aborted: Application of an update diff failed (-206) >> Feb 5 08:01:03 001db01b crmd[1693]: warning: Input I_ELECTION_DC >received in state S_INTEGRATION from do_election_check >> Feb 5 08:01:03 001db01b pengine[1692]: notice: On loss of CCM >Quorum: Ignore >> >> >> -Eric >> >> >> >> Disclaimer : This email and any files transmitted with it are >confidential and intended solely for intended recipients. If you are >not the named addressee you should not disseminate, distribute, copy or >alter this email. Any views or opinions presented in this email are >solely those of the author and might not represent those of Physician >Select Management. Warning: Although Physician Select Management has >taken reasonable precautions to ensure no viruses are present in this >email, the company cannot accept responsibility for any loss or damage >arising from the use of this email or attachments. >> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > >_______________________________________________ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/
Hi Eric, Do you use 2 corosync rings (routed via separare switches) ? If not, you can easily set them up without downtime. Also, are you using multicast or unicast ? If 3rd node is not an option, you can check if your version is supporting 'qdevice' which can be on a separate network and requires very low resources - a simple VM will be enough. Best Regards, Strahil Nikolov _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
