Hi! There were previous indications of this problem, but today I had it again: I restarted a node (h18, DC) via "crm cluster restart", and the node shutdown cleanly (at least it came to an end), but when restarting, the node was fenced by the new DC (h16):
Feb 08 08:12:24 h18 pacemaker-controld[8293]: notice: Transition 409 (Complete=23, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-4.bz2): Complete Feb 08 08:12:24 h18 pacemaker-controld[8293]: notice: Disconnected from the executor Feb 08 08:12:24 h18 pacemaker-controld[8293]: notice: Disconnected from Corosync Feb 08 08:12:24 h18 pacemaker-controld[8293]: notice: Disconnected from the CIB manager Feb 08 08:12:24 h18 pacemakerd[8286]: notice: Stopping pacemaker-schedulerd Feb 08 08:12:24 h18 pacemaker-schedulerd[8292]: notice: Caught 'Terminated' signal Feb 08 08:12:24 h18 pacemakerd[8286]: notice: Stopping pacemaker-attrd Feb 08 08:12:24 h18 pacemaker-attrd[8291]: notice: Caught 'Terminated' signal Feb 08 08:12:24 h18 pacemaker-execd[8290]: notice: Caught 'Terminated' signal Feb 08 08:12:24 h18 pacemakerd[8286]: notice: Stopping pacemaker-execd Feb 08 08:12:24 h18 pacemaker-based[8287]: warning: new_event_notification (8287-8291-11): Broken pipe (32) Feb 08 08:12:24 h18 pacemaker-based[8287]: warning: Notification of client attrd/b836bbe9-648e-4b1a-ae78-b351d5ffa00b failed Feb 08 08:12:24 h18 pacemakerd[8286]: notice: Stopping pacemaker-fenced Feb 08 08:12:24 h18 pacemaker-fenced[8288]: notice: Caught 'Terminated' signal Feb 08 08:12:24 h18 pacemakerd[8286]: notice: Stopping pacemaker-based Feb 08 08:12:24 h18 pacemaker-based[8287]: notice: Caught 'Terminated' signal Feb 08 08:12:24 h18 pacemaker-based[8287]: notice: Disconnected from Corosync Feb 08 08:12:24 h18 pacemaker-based[8287]: notice: Disconnected from Corosync Feb 08 08:12:24 h18 sbd[8264]: warning: cleanup_servant_by_pid: Servant for pcmk (pid: 8270) has terminated Feb 08 08:12:24 h18 pacemakerd[8286]: notice: Shutdown complete Feb 08 08:12:24 h18 systemd[1]: Stopped Pacemaker High Availability Cluster Manager. Feb 08 08:12:24 h18 sbd[33927]: pcmk: notice: servant_pcmk: Monitoring Pacemaker health Feb 08 08:12:24 h18 systemd[1]: Stopping Shared-storage based fencing daemon... Feb 08 08:12:24 h18 systemd[1]: Stopping Corosync Cluster Engine... Feb 08 08:12:24 h18 sbd[8264]: warning: cleanup_servant_by_pid: Servant for pcmk (pid: 33927) has terminated Feb 08 08:12:24 h18 sbd[8264]: warning: cleanup_servant_by_pid: Servant for cluster (pid: 8271) has terminated Feb 08 08:12:24 h18 corosync[33929]: Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ] Feb 08 08:12:24 h18 corosync[8279]: [MAIN ] Node was shut down by a signalFeb 08 08:12:24 h18 corosync[8279]: [SERV ] Unloading all Corosync service engines. Feb 08 08:12:24 h18 corosync[8279]: [QB ] withdrawing server sockets Feb 08 08:12:24 h18 corosync[8279]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 Feb 08 08:12:24 h18 corosync[8279]: [QB ] withdrawing server sockets Feb 08 08:12:24 h18 corosync[8279]: [SERV ] Service engine unloaded: corosync configuration map access Feb 08 08:12:24 h18 corosync[8279]: [QB ] withdrawing server sockets Feb 08 08:12:24 h18 corosync[8279]: [SERV ] Service engine unloaded: corosync configuration service Feb 08 08:12:24 h18 corosync[8279]: [QB ] withdrawing server sockets Feb 08 08:12:24 h18 corosync[8279]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Feb 08 08:12:24 h18 corosync[8279]: [QB ] withdrawing server sockets Feb 08 08:12:24 h18 corosync[8279]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Feb 08 08:12:24 h18 corosync[8279]: [SERV ] Service engine unloaded: corosync profile loading service Feb 08 08:12:24 h18 corosync[8279]: [MAIN ] Corosync Cluster Engine exiting normally Feb 08 08:12:24 h18 corosync[33929]: Waiting for corosync services to unload:[ OK ] Feb 08 08:12:24 h18 systemd[1]: Stopped Corosync Cluster Engine. Feb 08 08:12:24 h18 sbd[8264]: warning: cleanup_servant_by_pid: Servant for /dev/disk/by-id/dm-name-SBD_1-3P1 (pid: 8268) has terminated Feb 08 08:12:24 h18 sbd[8264]: warning: cleanup_servant_by_pid: Servant for /dev/disk/by-id/dm-name-SBD_1-3P2 (pid: 8269) has terminated Feb 08 08:12:24 h18 systemd[1]: Stopped Shared-storage based fencing daemon. So I hope we can agree that that was a "clean" shutdown of the node. Now the new DC: Feb 08 08:12:24 h16 pacemaker-controld[7525]: notice: Our peer on the DC (h18) is dead Feb 08 08:12:24 h16 pacemaker-controld[7525]: notice: State transition S_NOT_DC -> S_ELECTION Feb 08 08:12:24 h16 pacemaker-controld[7525]: notice: State transition S_ELECTION -> S_INTEGRATION Feb 08 08:12:24 h16 pacemaker-attrd[7523]: notice: Node h18 state is now lost Feb 08 08:12:24 h16 pacemaker-attrd[7523]: notice: Removing all h18 attributes for peer loss Feb 08 08:12:24 h16 pacemaker-attrd[7523]: notice: Purged 1 peer with id=118 and/or uname=h18 from the membership cache Feb 08 08:12:24 h16 pacemaker-fenced[7521]: notice: Node h18 state is now lost Feb 08 08:12:24 h16 pacemaker-fenced[7521]: notice: Purged 1 peer with id=118 and/or uname=h18 from the membership cache Feb 08 08:12:24 h16 pacemaker-based[7520]: notice: Node h18 state is now lost Feb 08 08:12:24 h16 pacemaker-based[7520]: notice: Purged 1 peer with id=118 and/or uname=h18 from the membership cache Feb 08 08:12:24 h16 corosync[6567]: [TOTEM ] A new membership (172.20.16.16:42448) was formed. Members left: 118 Feb 08 08:12:24 h16 corosync[6567]: [CPG ] downlist left_list: 1 received Feb 08 08:12:24 h16 corosync[6567]: [CPG ] downlist left_list: 1 received Feb 08 08:12:24 h16 corosync[6567]: [QUORUM] Members[2]: 116 119 Feb 08 08:12:24 h16 corosync[6567]: [MAIN ] Completed service synchronization, ready to provide service. Feb 08 08:12:24 h16 pacemakerd[7518]: notice: Node h18 state is now lost Feb 08 08:12:24 h16 pacemaker-controld[7525]: notice: Node h18 state is now lost Feb 08 08:12:24 h16 pacemaker-controld[7525]: warning: Stonith/shutdown of node h18 was not expected Feb 08 08:12:24 h16 kernel: dlm: closing connection to node 118 Feb 08 08:12:25 h16 pacemaker-controld[7525]: notice: Updating quorum status to true (call=160) Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]: notice: Watchdog will be used via SBD if fencing is required and stonith-watchdog-timeout is nonzero Why isn't the shutdown expected? h18 had "announced" its shutdown: Feb 08 08:11:52 h18 pacemakerd[8286]: notice: Caught 'Terminated' signal Feb 08 08:11:52 h18 pacemakerd[8286]: notice: Shutting down Pacemaker Feb 08 08:11:52 h18 pacemakerd[8286]: notice: Stopping pacemaker-controld Feb 08 08:11:52 h18 pacemaker-controld[8293]: notice: Caught 'Terminated' signal Feb 08 08:11:52 h18 pacemaker-controld[8293]: notice: Initiating controller shutdown sequence Feb 08 08:11:52 h18 pacemaker-controld[8293]: notice: State transition S_IDLE -> S_POLICY_ENGINE Feb 08 08:11:52 h18 pacemaker-attrd[8291]: notice: Setting shutdown[h18]: (unset) -> 1612768312 So when h18 came up again, it was fenced: Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]: warning: Scheduling Node h18 for STONITH Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]: notice: * Fence (reboot) h18 'resource actions are unrunnable' Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]: warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-36.bz2 Feb 08 08:12:25 h16 pacemaker-controld[7525]: notice: Processing graph 0 (ref=pe_calc-dc-1612768345-16) derived from /var/lib/pacemaker/pengine/pe-warn-36.bz2 Feb 08 08:12:25 h16 pacemaker-controld[7525]: notice: Requesting fencing (reboot) of node h18 Feb 08 08:12:25 h16 pacemaker-fenced[7521]: notice: Client pacemaker-controld.7525.4c7f7600 wants to fence (reboot) 'h18' with device '(any)' Feb 08 08:12:25 h16 pacemaker-fenced[7521]: notice: Requesting peer fencing (reboot) targeting h18 Feb 08 08:12:30 h16 corosync[6567]: [TOTEM ] A new membership (172.20.16.16:42456) was formed. Members joined: 118 Feb 08 08:12:30 h16 corosync[6567]: [CPG ] downlist left_list: 0 received Feb 08 08:12:30 h16 corosync[6567]: [CPG ] downlist left_list: 0 received Feb 08 08:12:30 h16 corosync[6567]: [CPG ] downlist left_list: 0 received Feb 08 08:12:30 h16 corosync[6567]: [QUORUM] Members[3]: 116 118 119 Feb 08 08:12:30 h16 corosync[6567]: [MAIN ] Completed service synchronization, ready to provide service. Feb 08 08:12:30 h16 pacemaker-controld[7525]: notice: Node h18 state is now member Feb 08 08:12:30 h16 pacemakerd[7518]: notice: Node h18 state is now member Feb 08 08:12:30 h16 pacemaker-fenced[7521]: notice: Requesting that h16 perform 'reboot' action targeting h18 Feb 08 08:12:30 h16 pacemaker-fenced[7521]: notice: prm_stonith_sbd is eligible to fence (reboot) h18: dynamic-list Feb 08 08:12:37 h16 corosync[6567]: [TOTEM ] A processor failed, forming new configuration. Feb 08 08:12:44 h16 corosync[6567]: [TOTEM ] A new membership (172.20.16.16:42460) was formed. Members left: 118 Feb 08 08:12:44 h16 corosync[6567]: [TOTEM ] Failed to receive the leave message. failed: 118 Feb 08 08:12:44 h16 corosync[6567]: [CPG ] downlist left_list: 1 received Feb 08 08:12:44 h16 corosync[6567]: [CPG ] downlist left_list: 1 received Feb 08 08:12:44 h16 corosync[6567]: [QUORUM] Members[2]: 116 119 Feb 08 08:12:44 h16 corosync[6567]: [MAIN ] Completed service synchronization, ready to provide service. Feb 08 08:12:44 h16 pacemaker-controld[7525]: notice: Node h18 state is now lost Feb 08 08:12:44 h16 pacemakerd[7518]: notice: Node h18 state is now lost Feb 08 08:12:24 h18 systemd[1]: Starting Shared-storage based fencing daemon... Feb 08 08:12:24 h18 systemd[1]: Starting Corosync Cluster Engine... Feb 08 08:12:24 h18 sbd[33948]: notice: main: Do flush + write 'b' to sysrq in case of timeout Feb 08 08:12:25 h18 sbd[33958]: /dev/disk/by-id/dm-name-SBD_1-3P1: notice: servant_md: Monitoring slot 4 on disk /dev/disk/by-id/dm-name-SBD_1-3P1 Feb 08 08:12:25 h18 sbd[33959]: /dev/disk/by-id/dm-name-SBD_1-3P2: notice: servant_md: Monitoring slot 4 on disk /dev/disk/by-id/dm-name-SBD_1-3P2 Feb 08 08:12:25 h18 sbd[33960]: pcmk: notice: servant_pcmk: Monitoring Pacemaker health Feb 08 08:12:25 h18 sbd[33962]: cluster: notice: servant_cluster: Monitoring unknown cluster health Feb 08 08:12:25 h18 corosync[33964]: [MAIN ] Corosync Cluster Engine ('2.4.5'): started and ready to provide service. Feb 08 08:12:25 h18 corosync[33964]: [MAIN ] Corosync built-in features: testagents systemd qdevices qnetd pie relro bindnow Feb 08 08:12:25 h18 corosync[33967]: [TOTEM ] Initializing transport (UDP/IP Unicast). Feb 08 08:12:25 h18 corosync[33967]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1 Feb 08 08:12:25 h18 corosync[33967]: [TOTEM ] Initializing transport (UDP/IP Unicast). Feb 08 08:12:25 h18 corosync[33967]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1 Feb 08 08:12:25 h18 corosync[33967]: [TOTEM ] The network interface [172.20.16.18] is now up. Feb 08 08:12:25 h18 corosync[33967]: [SERV ] Service engine loaded: corosync configuration map access [0] Feb 08 08:12:25 h18 corosync[33967]: [QB ] server name: cmap Feb 08 08:12:25 h18 corosync[33967]: [SERV ] Service engine loaded: corosync configuration service [1] Feb 08 08:12:25 h18 corosync[33967]: [QB ] server name: cfg Feb 08 08:12:25 h18 corosync[33967]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] Feb 08 08:12:25 h18 corosync[33967]: [QB ] server name: cpg Feb 08 08:12:25 h18 corosync[33967]: [SERV ] Service engine loaded: corosync profile loading service [4] Feb 08 08:12:25 h18 corosync[33967]: [QUORUM] Using quorum provider corosync_votequorum Feb 08 08:12:25 h18 corosync[33967]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] Feb 08 08:12:25 h18 corosync[33967]: [QB ] server name: votequorum Feb 08 08:12:25 h18 corosync[33967]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] Feb 08 08:12:25 h18 corosync[33967]: [QB ] server name: quorum Feb 08 08:12:25 h18 corosync[33967]: [TOTEM ] adding new UDPU member {172.20.16.16} Feb 08 08:12:25 h18 corosync[33967]: [TOTEM ] adding new UDPU member {172.20.16.18} Feb 08 08:12:25 h18 corosync[33967]: [TOTEM ] adding new UDPU member {172.20.16.19} Feb 08 08:12:25 h18 corosync[33967]: [TOTEM ] The network interface [10.2.2.18] is now up. Feb 08 08:12:25 h18 corosync[33967]: [TOTEM ] adding new UDPU member {10.2.2.16} Feb 08 08:12:25 h18 corosync[33967]: [TOTEM ] adding new UDPU member {10.2.2.18} Feb 08 08:12:25 h18 corosync[33967]: [TOTEM ] adding new UDPU member {10.2.2.19} Feb 08 08:12:25 h18 corosync[33967]: [TOTEM ] A new membership (172.20.16.18:42448) was formed. Members joined: 118 Feb 08 08:12:25 h18 corosync[33967]: [CPG ] downlist left_list: 0 received Feb 08 08:12:25 h18 corosync[33967]: [QUORUM] Members[1]: 118 Feb 08 08:12:25 h18 corosync[33967]: [MAIN ] Completed service synchronization, ready to provide service. Feb 08 08:12:25 h18 corosync[33950]: Starting Corosync Cluster Engine (corosync): [ OK ] Feb 08 08:12:25 h18 systemd[1]: Started Corosync Cluster Engine. Feb 08 08:12:30 h18 corosync[33967]: [TOTEM ] A new membership (172.20.16.16:42456) was formed. Members joined: 116 119 Feb 08 08:12:30 h18 corosync[33967]: [CPG ] downlist left_list: 0 received Feb 08 08:12:30 h18 corosync[33967]: [CPG ] downlist left_list: 0 received Feb 08 08:12:30 h18 corosync[33967]: [CPG ] downlist left_list: 0 received Feb 08 08:12:30 h18 corosync[33967]: [QUORUM] This node is within the primary component and will provide service. Feb 08 08:12:30 h18 corosync[33967]: [QUORUM] Members[3]: 116 118 119 Feb 08 08:12:30 h18 corosync[33967]: [MAIN ] Completed service synchronization, ready to provide service. Feb 08 08:12:30 h18 sbd[33953]: notice: inquisitor_child: Servant cluster is healthy (age: 0) Broadcast message from systemd-journald@h18 (Mon 2021-02-08 08:12:32 CET): sbd[33953]: emerg: do_exit: Rebooting system: reboot Feb 08 08:12:32 h18 sbd[33958]: /dev/disk/by-id/dm-name-SBD_1-3P1: notice: servant_md: Received command reset from h16 on disk /dev/disk/by-id/dm-name-SBD_1-3P1 Feb 08 08:12:32 h18 sbd[33959]: /dev/disk/by-id/dm-name-SBD_1-3P2: notice: servant_md: Received command reset from h16 on disk /dev/disk/by-id/dm-name-SBD_1-3P2 Feb 08 08:12:32 h18 sbd[33953]: warning: inquisitor_child: /dev/disk/by-id/dm-name-SBD_1-3P1 requested a reset Feb 08 08:12:32 h18 sbd[33953]: emerg: do_exit: Rebooting system: reboot OS/software: SLES15 SP2 ph16:~ # rpm -q pacemaker corosync pacemaker-2.0.4+20200616.2deceaa3a-3.3.1.x86_64 corosync-2.4.5-6.3.2.x86_64 Regards, Ulrich _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/