What is your corosync.conf timeouts (especially token & consensus)? Last time I did live migration of RHEL 7 node with the default values, the cluster fenced it - thus I set it to 10s for token and I also raised the consensus (check 'man corosync.conf') above the default.
Also, start your investigation from the virtualization layer, as during the nights a lot of backups are going on. Last week I got a cluster node fenced cause it failed to respond for 40s . Thankfully that was just a QA cluster, so it wasn't a big deal. The most common reasons for a VM to fail to respond are: - CPU starvation due to high CPU utilisation on the host - I/O issues causing the VM to pause - Lots of backups eating the bandwidth on any of the Hypervisours or on a switch between them (if you have a single heartbeat network) With RHEL8 corosync allows using more than 2 heartbeat rings and way new stuff like sctp. P.S.: You can use a second fencing mechanism like 'sbd' a.k.a. "poison pill" , just make the vmdk shared & independent . This way your cluster can operate even when the vCenter is unreachable for any reason. Best Regards, Strahil Nikolov На 10 юни 2020 г. 20:06:28 GMT+03:00, Howard <hmon...@gmail.com> написа: >Good morning. Thanks for reading. We have a requirement to provide >high >availability for PostgreSQL 10. I have built a two node cluster with a >quorum device as the third vote, all running on RHEL 8. > >Here are the versions installed: >[postgres@srv2 cluster]$ rpm -qa|grep >"pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf" >corosync-3.0.2-3.el8_1.1.x86_64 >corosync-qdevice-3.0.0-2.el8.x86_64 >corosync-qnetd-3.0.0-2.el8.x86_64 >corosynclib-3.0.2-3.el8_1.1.x86_64 >fence-agents-vmware-soap-4.2.1-41.el8.noarch >pacemaker-2.0.2-3.el8_1.2.x86_64 >pacemaker-cli-2.0.2-3.el8_1.2.x86_64 >pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64 >pacemaker-libs-2.0.2-3.el8_1.2.x86_64 >pacemaker-schemas-2.0.2-3.el8_1.2.noarch >pcs-0.10.2-4.el8.x86_64 >resource-agents-paf-2.3.0-1.noarch > >These are vmare VMs so I configured the cluster to use the ESX host as >the >fencing device using fence_vmware_soap. > >Throughout each day things generally work very well. The cluster >remains >online and healthy. Unfortunately, when I check pcs status in the >mornings, >I see that all kinds of things went wrong overnight. It is hard to >pinpoint what the issue is as there is so much information being >written to >the pacemaker.log. Scrolling through pages and pages of informational >log >entries trying to find the lines that pertain to the issue. Is there a >way >to separate the logs out to make it easier to scroll through? Or maybe >a >list of keywords to GREP for? > >It is clearly indicating that the server lost contact with the other >node >and also the quorum device. Is there a way to make this configuration >more >robust or able to recover from a connectivity blip? > >Here are the pacemaker and corosync logs for this morning's failures: >pacemaker.log >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd > [10573] (pcmk_quorum_notification) warning: Quorum lost | >membership=952 members=1 >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 >pacemaker-controld > [10579] (pcmk_quorum_notification) warning: Quorum lost | >membership=952 members=1 >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (pe_fence_node) warning: Cluster node srv1 >will be fenced: peer is no longer part of the cluster >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (determine_online_status) warning: >Node >srv1 is unclean >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_demote_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_stop_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_demote_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_stop_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_demote_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_stop_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_demote_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_stop_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsql-master-ip_stop_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (stage6) warning: Scheduling Node >srv1 >for STONITH >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (pcmk__log_transition_summary) warning: >Calculated transition 2 (with warnings), saving inputs in >/var/lib/pacemaker/pengine/pe-warn-34.bz2 >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 >pacemaker-controld > [10579] (crmd_ha_msg_filter) warning: Another DC detected: srv1 >(op=join_offer) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 >pacemaker-controld >[10579] (destroy_action) warning: Cancelling timer for action 3 >(src=307) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 >pacemaker-controld >[10579] (destroy_action) warning: Cancelling timer for action 2 >(src=308) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 >pacemaker-controld > [10579] (do_log) warning: Input I_RELEASE_DC received in state >S_RELEASE_DC from do_election_count_vote >/var/log/pacemaker/pacemaker.log:pgsqlms(pgsqld)[1164379]: Jun 10 >00:07:19 WARNING: No secondary connected to the master >/var/log/pacemaker/pacemaker.log:Sent 5 probes (5 broadcast(s)) >/var/log/pacemaker/pacemaker.log:Received 0 response(s) > >corosync.log >Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN ] Corosync main >process was not scheduled for 13006.0615 ms (threshold is 800.0000 ms). >Consider token timeout increase. >Jun 10 00:06:41 [10558] srv2 corosync notice [TOTEM ] Token has not >been >received in 12922 ms >Jun 10 00:06:41 [10558] srv2 corosync notice [TOTEM ] A processor >failed, >forming new configuration. >Jun 10 00:06:41 [10558] srv2 corosync info [VOTEQ ] lost contact >with >quorum device Qdevice >Jun 10 00:06:41 [10558] srv2 corosync info [KNET ] link: host: 1 >link: >0 is down >Jun 10 00:06:41 [10558] srv2 corosync info [KNET ] host: host: 1 >(passive) best link: 0 (pri: 1) >Jun 10 00:06:41 [10558] srv2 corosync warning [KNET ] host: host: 1 >has no >active links >Jun 10 00:06:42 [10558] srv2 corosync info [KNET ] rx: host: 1 >link: 0 >is up >Jun 10 00:06:42 [10558] srv2 corosync info [KNET ] host: host: 1 >(passive) best link: 0 (pri: 1) >Jun 10 00:06:42 [10558] srv2 corosync info [VOTEQ ] waiting for >quorum >device Qdevice poll (but maximum for 30000 ms) >Jun 10 00:06:42 [10558] srv2 corosync notice [TOTEM ] A new membership >(2:952) was formed. Members left: 1 >Jun 10 00:06:42 [10558] srv2 corosync notice [TOTEM ] Failed to >receive >the leave message. failed: 1 >Jun 10 00:06:42 [10558] srv2 corosync warning [CPG ] downlist >left_list: >1 received >Jun 10 00:06:42 [10558] srv2 corosync notice [QUORUM] This node is >within >the non-primary component and will NOT provide any services. >Jun 10 00:06:42 [10558] srv2 corosync notice [QUORUM] Members[1]: 2 >Jun 10 00:06:42 [10558] srv2 corosync notice [MAIN ] Completed >service >synchronization, ready to provide service. >Jun 10 00:06:42 [10558] srv2 corosync notice [QUORUM] This node is >within >the primary component and will provide service. >Jun 10 00:06:42 [10558] srv2 corosync notice [QUORUM] Members[1]: 2 >Jun 10 00:06:43 [10558] srv2 corosync info [VOTEQ ] waiting for >quorum >device Qdevice poll (but maximum for 30000 ms) >Jun 10 00:06:43 [10558] srv2 corosync notice [TOTEM ] A new membership >(1:960) was formed. Members joined: 1 >Jun 10 00:06:43 [10558] srv2 corosync warning [CPG ] downlist >left_list: >0 received >Jun 10 00:06:43 [10558] srv2 corosync warning [CPG ] downlist >left_list: >0 received >Jun 10 00:06:45 [10558] srv2 corosync notice [QUORUM] Members[2]: 1 2 >Jun 10 00:06:45 [10558] srv2 corosync notice [MAIN ] Completed >service >synchronization, ready to provide service. >Jun 10 00:06:45 [10558] srv2 corosync warning [MAIN ] Corosync main >process was not scheduled for 1747.0415 ms (threshold is 800.0000 ms). >Consider token timeout increase. >Jun 10 00:06:45 [10558] srv2 corosync info [VOTEQ ] waiting for >quorum >device Qdevice poll (but maximum for 30000 ms) >Jun 10 00:06:45 [10558] srv2 corosync notice [TOTEM ] A new membership >(1:964) was formed. Members >Jun 10 00:06:45 [10558] srv2 corosync warning [CPG ] downlist >left_list: >0 received >Jun 10 00:06:45 [10558] srv2 corosync warning [CPG ] downlist >left_list: >0 received >Jun 10 00:06:45 [10558] srv2 corosync notice [QUORUM] Members[2]: 1 2 >Jun 10 00:06:45 [10558] srv2 corosync notice [MAIN ] Completed >service >synchronization, ready to provide service. >Jun 10 00:06:52 [10558] srv2 corosync notice [TOTEM ] Token has not >been >received in 750 ms >Jun 10 00:06:52 [10558] srv2 corosync info [KNET ] link: host: 1 >link: >0 is down >Jun 10 00:06:52 [10558] srv2 corosync info [KNET ] host: host: 1 >(passive) best link: 0 (pri: 1) >Jun 10 00:06:52 [10558] srv2 corosync warning [KNET ] host: host: 1 >has no >active links >Jun 10 00:06:52 [10558] srv2 corosync notice [TOTEM ] A processor >failed, >forming new configuration. >Jun 10 00:06:53 [10558] srv2 corosync info [VOTEQ ] waiting for >quorum >device Qdevice poll (but maximum for 30000 ms) >Jun 10 00:06:53 [10558] srv2 corosync notice [TOTEM ] A new membership >(2:968) was formed. Members left: 1 >Jun 10 00:06:53 [10558] srv2 corosync notice [TOTEM ] Failed to >receive >the leave message. failed: 1 >Jun 10 00:06:53 [10558] srv2 corosync warning [CPG ] downlist >left_list: >1 received >Jun 10 00:07:17 [10558] srv2 corosync notice [QUORUM] Members[1]: 2 >Jun 10 00:07:17 [10558] srv2 corosync notice [MAIN ] Completed >service >synchronization, ready to provide service. >Jun 10 00:08:56 [10558] srv2 corosync notice [TOTEM ] Token has not >been >received in 750 ms >Jun 10 00:09:04 [10558] srv2 corosync warning [MAIN ] Corosync main >process was not scheduled for 4477.0459 ms (threshold is 800.0000 ms). >Consider token timeout increase. >Jun 10 00:09:13 [10558] srv2 corosync warning [MAIN ] Corosync main >process was not scheduled for 5302.9785 ms (threshold is 800.0000 ms). >Consider token timeout increase. >Jun 10 00:09:13 [10558] srv2 corosync notice [TOTEM ] Token has not >been >received in 5295 ms >Jun 10 00:09:13 [10558] srv2 corosync notice [TOTEM ] A processor >failed, >forming new configuration. >Jun 10 00:09:13 [10558] srv2 corosync info [VOTEQ ] waiting for >quorum >device Qdevice poll (but maximum for 30000 ms) >Jun 10 00:09:13 [10558] srv2 corosync notice [TOTEM ] A new membership >(2:972) was formed. Members >Jun 10 00:09:13 [10558] srv2 corosync warning [CPG ] downlist >left_list: >0 received >Jun 10 00:09:13 [10558] srv2 corosync notice [QUORUM] Members[1]: 2 >Jun 10 00:09:13 [10558] srv2 corosync notice [MAIN ] Completed >service >synchronization, ready to provide service. > >Thanks, >Howard _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/