[ClusterLabs] node fencing due to "Stonith/shutdown of node .. was not expected" while the node shut down cleanly

Ulrich Windl Sun, 07 Feb 2021 23:42:07 -0800

Hi!

There were previous indications of this problem, but today I had it again:
I restarted a node (h18, DC) via "crm cluster restart", and the node shutdown 
cleanly (at least it came to an end), but when restarting, the node was fenced 
by the new DC (h16):


Feb 08 08:12:24 h18 pacemaker-controld[8293]:  notice: Transition 409 
(Complete=23, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-4.bz2): Complete
Feb 08 08:12:24 h18 pacemaker-controld[8293]:  notice: Disconnected from the 
executor
Feb 08 08:12:24 h18 pacemaker-controld[8293]:  notice: Disconnected from 
Corosync
Feb 08 08:12:24 h18 pacemaker-controld[8293]:  notice: Disconnected from the 
CIB manager
Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-schedulerd
Feb 08 08:12:24 h18 pacemaker-schedulerd[8292]:  notice: Caught 'Terminated' 
signal
Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-attrd
Feb 08 08:12:24 h18 pacemaker-attrd[8291]:  notice: Caught 'Terminated' signal
Feb 08 08:12:24 h18 pacemaker-execd[8290]:  notice: Caught 'Terminated' signal
Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-execd
Feb 08 08:12:24 h18 pacemaker-based[8287]:  warning: new_event_notification 
(8287-8291-11): Broken pipe (32)
Feb 08 08:12:24 h18 pacemaker-based[8287]:  warning: Notification of client 
attrd/b836bbe9-648e-4b1a-ae78-b351d5ffa00b failed
Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-fenced
Feb 08 08:12:24 h18 pacemaker-fenced[8288]:  notice: Caught 'Terminated' signal
Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-based
Feb 08 08:12:24 h18 pacemaker-based[8287]:  notice: Caught 'Terminated' signal
Feb 08 08:12:24 h18 pacemaker-based[8287]:  notice: Disconnected from Corosync
Feb 08 08:12:24 h18 pacemaker-based[8287]:  notice: Disconnected from Corosync
Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid: Servant for 
pcmk (pid: 8270) has terminated
Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Shutdown complete
Feb 08 08:12:24 h18 systemd[1]: Stopped Pacemaker High Availability Cluster 
Manager.
Feb 08 08:12:24 h18 sbd[33927]:       pcmk:   notice: servant_pcmk: Monitoring 
Pacemaker health
Feb 08 08:12:24 h18 systemd[1]: Stopping Shared-storage based fencing daemon...
Feb 08 08:12:24 h18 systemd[1]: Stopping Corosync Cluster Engine...
Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid: Servant for 
pcmk (pid: 33927) has terminated
Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid: Servant for 
cluster (pid: 8271) has terminated
Feb 08 08:12:24 h18 corosync[33929]: Signaling Corosync Cluster Engine 
(corosync) to terminate: [  OK  ]
Feb 08 08:12:24 h18 corosync[8279]:   [MAIN  ] Node was shut down by a 
signalFeb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Unloading all Corosync 
service engines.
Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server sockets
Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine unloaded: 
corosync vote quorum service v1.0
Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server sockets
Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine unloaded: 
corosync configuration map access
Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server sockets
Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine unloaded: 
corosync configuration service
Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server sockets
Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine unloaded: 
corosync cluster closed process group service v1.01
Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server sockets
Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine unloaded: 
corosync cluster quorum service v0.1
Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine unloaded: 
corosync profile loading service
Feb 08 08:12:24 h18 corosync[8279]:   [MAIN  ] Corosync Cluster Engine exiting 
normally
Feb 08 08:12:24 h18 corosync[33929]: Waiting for corosync services to unload:[  
OK  ]
Feb 08 08:12:24 h18 systemd[1]: Stopped Corosync Cluster Engine.
Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid: Servant for 
/dev/disk/by-id/dm-name-SBD_1-3P1 (pid: 8268) has terminated
Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid: Servant for 
/dev/disk/by-id/dm-name-SBD_1-3P2 (pid: 8269) has terminated
Feb 08 08:12:24 h18 systemd[1]: Stopped Shared-storage based fencing daemon.

So I hope we can agree that that was a "clean" shutdown of the node.

Now the new DC:
Feb 08 08:12:24 h16 pacemaker-controld[7525]:  notice: Our peer on the DC (h18) 
is dead
Feb 08 08:12:24 h16 pacemaker-controld[7525]:  notice: State transition 
S_NOT_DC -> S_ELECTION
Feb 08 08:12:24 h16 pacemaker-controld[7525]:  notice: State transition 
S_ELECTION -> S_INTEGRATION
Feb 08 08:12:24 h16 pacemaker-attrd[7523]:  notice: Node h18 state is now lost
Feb 08 08:12:24 h16 pacemaker-attrd[7523]:  notice: Removing all h18 attributes 
for peer loss
Feb 08 08:12:24 h16 pacemaker-attrd[7523]:  notice: Purged 1 peer with id=118 
and/or uname=h18 from the membership cache
Feb 08 08:12:24 h16 pacemaker-fenced[7521]:  notice: Node h18 state is now lost
Feb 08 08:12:24 h16 pacemaker-fenced[7521]:  notice: Purged 1 peer with id=118 
and/or uname=h18 from the membership cache
Feb 08 08:12:24 h16 pacemaker-based[7520]:  notice: Node h18 state is now lost
Feb 08 08:12:24 h16 pacemaker-based[7520]:  notice: Purged 1 peer with id=118 
and/or uname=h18 from the membership cache
Feb 08 08:12:24 h16 corosync[6567]:   [TOTEM ] A new membership 
(172.20.16.16:42448) was formed. Members left: 118
Feb 08 08:12:24 h16 corosync[6567]:   [CPG   ] downlist left_list: 1 received
Feb 08 08:12:24 h16 corosync[6567]:   [CPG   ] downlist left_list: 1 received
Feb 08 08:12:24 h16 corosync[6567]:   [QUORUM] Members[2]: 116 119
Feb 08 08:12:24 h16 corosync[6567]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Feb 08 08:12:24 h16 pacemakerd[7518]:  notice: Node h18 state is now lost
Feb 08 08:12:24 h16 pacemaker-controld[7525]:  notice: Node h18 state is now 
lost
Feb 08 08:12:24 h16 pacemaker-controld[7525]:  warning: Stonith/shutdown of 
node h18 was not expected
Feb 08 08:12:24 h16 kernel: dlm: closing connection to node 118
Feb 08 08:12:25 h16 pacemaker-controld[7525]:  notice: Updating quorum status 
to true (call=160)
Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]:  notice: Watchdog will be used 
via SBD if fencing is required and stonith-watchdog-timeout is nonzero

Why isn't the shutdown expected? h18 had "announced" its shutdown:
Feb 08 08:11:52 h18 pacemakerd[8286]:  notice: Caught 'Terminated' signal
Feb 08 08:11:52 h18 pacemakerd[8286]:  notice: Shutting down Pacemaker
Feb 08 08:11:52 h18 pacemakerd[8286]:  notice: Stopping pacemaker-controld
Feb 08 08:11:52 h18 pacemaker-controld[8293]:  notice: Caught 'Terminated' 
signal
Feb 08 08:11:52 h18 pacemaker-controld[8293]:  notice: Initiating controller 
shutdown sequence
Feb 08 08:11:52 h18 pacemaker-controld[8293]:  notice: State transition S_IDLE 
-> S_POLICY_ENGINE
Feb 08 08:11:52 h18 pacemaker-attrd[8291]:  notice: Setting shutdown[h18]: 
(unset) -> 1612768312

So when h18 came up again, it was fenced:
Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]:  warning: Scheduling Node h18 
for STONITH
Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]:  notice:  * Fence (reboot) h18 
'resource actions are unrunnable'
Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]:  warning: Calculated transition 
0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-36.bz2
Feb 08 08:12:25 h16 pacemaker-controld[7525]:  notice: Processing graph 0 
(ref=pe_calc-dc-1612768345-16) derived from 
/var/lib/pacemaker/pengine/pe-warn-36.bz2
Feb 08 08:12:25 h16 pacemaker-controld[7525]:  notice: Requesting fencing 
(reboot) of node h18
Feb 08 08:12:25 h16 pacemaker-fenced[7521]:  notice: Client 
pacemaker-controld.7525.4c7f7600 wants to fence (reboot) 'h18' with device 
'(any)'
Feb 08 08:12:25 h16 pacemaker-fenced[7521]:  notice: Requesting peer fencing 
(reboot) targeting h18
Feb 08 08:12:30 h16 corosync[6567]:   [TOTEM ] A new membership 
(172.20.16.16:42456) was formed. Members joined: 118
Feb 08 08:12:30 h16 corosync[6567]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:30 h16 corosync[6567]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:30 h16 corosync[6567]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:30 h16 corosync[6567]:   [QUORUM] Members[3]: 116 118 119
Feb 08 08:12:30 h16 corosync[6567]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Feb 08 08:12:30 h16 pacemaker-controld[7525]:  notice: Node h18 state is now 
member
Feb 08 08:12:30 h16 pacemakerd[7518]:  notice: Node h18 state is now member
Feb 08 08:12:30 h16 pacemaker-fenced[7521]:  notice: Requesting that h16 
perform 'reboot' action targeting h18
Feb 08 08:12:30 h16 pacemaker-fenced[7521]:  notice: prm_stonith_sbd is 
eligible to fence (reboot) h18: dynamic-list
Feb 08 08:12:37 h16 corosync[6567]:   [TOTEM ] A processor failed, forming new 
configuration.
Feb 08 08:12:44 h16 corosync[6567]:   [TOTEM ] A new membership 
(172.20.16.16:42460) was formed. Members left: 118
Feb 08 08:12:44 h16 corosync[6567]:   [TOTEM ] Failed to receive the leave 
message. failed: 118
Feb 08 08:12:44 h16 corosync[6567]:   [CPG   ] downlist left_list: 1 received
Feb 08 08:12:44 h16 corosync[6567]:   [CPG   ] downlist left_list: 1 received
Feb 08 08:12:44 h16 corosync[6567]:   [QUORUM] Members[2]: 116 119
Feb 08 08:12:44 h16 corosync[6567]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Feb 08 08:12:44 h16 pacemaker-controld[7525]:  notice: Node h18 state is now 
lost
Feb 08 08:12:44 h16 pacemakerd[7518]:  notice: Node h18 state is now lost

Feb 08 08:12:24 h18 systemd[1]: Starting Shared-storage based fencing daemon...
Feb 08 08:12:24 h18 systemd[1]: Starting Corosync Cluster Engine...
Feb 08 08:12:24 h18 sbd[33948]:   notice: main: Do flush + write 'b' to sysrq 
in case of timeout
Feb 08 08:12:25 h18 sbd[33958]: /dev/disk/by-id/dm-name-SBD_1-3P1:   notice: 
servant_md: Monitoring slot 4 on disk /dev/disk/by-id/dm-name-SBD_1-3P1
Feb 08 08:12:25 h18 sbd[33959]: /dev/disk/by-id/dm-name-SBD_1-3P2:   notice: 
servant_md: Monitoring slot 4 on disk /dev/disk/by-id/dm-name-SBD_1-3P2
Feb 08 08:12:25 h18 sbd[33960]:       pcmk:   notice: servant_pcmk: Monitoring 
Pacemaker health
Feb 08 08:12:25 h18 sbd[33962]:    cluster:   notice: servant_cluster: 
Monitoring unknown cluster health
Feb 08 08:12:25 h18 corosync[33964]:   [MAIN  ] Corosync Cluster Engine 
('2.4.5'): started and ready to provide service.
Feb 08 08:12:25 h18 corosync[33964]:   [MAIN  ] Corosync built-in features: 
testagents systemd qdevices qnetd pie relro bindnow
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing transport (UDP/IP 
Unicast).
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing transmit/receive 
security (NSS) crypto: aes256 hash: sha1
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing transport (UDP/IP 
Unicast).
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing transmit/receive 
security (NSS) crypto: aes256 hash: sha1
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] The network interface 
[172.20.16.18] is now up.
Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine loaded: corosync 
configuration map access [0]
Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: cmap
Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine loaded: corosync 
configuration service [1]
Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: cfg
Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine loaded: corosync 
cluster closed process group service v1.01 [2]
Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: cpg
Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine loaded: corosync 
profile loading service [4]
Feb 08 08:12:25 h18 corosync[33967]:   [QUORUM] Using quorum provider 
corosync_votequorum
Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine loaded: corosync 
vote quorum service v1.0 [5]
Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: votequorum
Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine loaded: corosync 
cluster quorum service v0.1 [3]
Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: quorum
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU member 
{172.20.16.16}
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU member 
{172.20.16.18}
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU member 
{172.20.16.19}
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] The network interface 
[10.2.2.18] is now up.
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU member 
{10.2.2.16}
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU member 
{10.2.2.18}
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU member 
{10.2.2.19}
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] A new membership 
(172.20.16.18:42448) was formed. Members joined: 118
Feb 08 08:12:25 h18 corosync[33967]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:25 h18 corosync[33967]:   [QUORUM] Members[1]: 118
Feb 08 08:12:25 h18 corosync[33967]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Feb 08 08:12:25 h18 corosync[33950]: Starting Corosync Cluster Engine 
(corosync): [  OK  ]
Feb 08 08:12:25 h18 systemd[1]: Started Corosync Cluster Engine.
Feb 08 08:12:30 h18 corosync[33967]:   [TOTEM ] A new membership 
(172.20.16.16:42456) was formed. Members joined: 116 119
Feb 08 08:12:30 h18 corosync[33967]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:30 h18 corosync[33967]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:30 h18 corosync[33967]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:30 h18 corosync[33967]:   [QUORUM] This node is within the primary 
component and will provide service.
Feb 08 08:12:30 h18 corosync[33967]:   [QUORUM] Members[3]: 116 118 119
Feb 08 08:12:30 h18 corosync[33967]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Feb 08 08:12:30 h18 sbd[33953]:   notice: inquisitor_child: Servant cluster is 
healthy (age: 0)

Broadcast message from systemd-journald@h18 (Mon 2021-02-08 08:12:32 CET):

sbd[33953]:    emerg: do_exit: Rebooting system: reboot

Feb 08 08:12:32 h18 sbd[33958]: /dev/disk/by-id/dm-name-SBD_1-3P1:   notice: 
servant_md: Received command reset from h16 on disk 
/dev/disk/by-id/dm-name-SBD_1-3P1
Feb 08 08:12:32 h18 sbd[33959]: /dev/disk/by-id/dm-name-SBD_1-3P2:   notice: 
servant_md: Received command reset from h16 on disk 
/dev/disk/by-id/dm-name-SBD_1-3P2
Feb 08 08:12:32 h18 sbd[33953]:  warning: inquisitor_child: 
/dev/disk/by-id/dm-name-SBD_1-3P1 requested a reset
Feb 08 08:12:32 h18 sbd[33953]:    emerg: do_exit: Rebooting system: reboot

OS/software:
SLES15 SP2
ph16:~ # rpm -q pacemaker corosync
pacemaker-2.0.4+20200616.2deceaa3a-3.3.1.x86_64
corosync-2.4.5-6.3.2.x86_64

Regards,
Ulrich


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] node fencing due to "Stonith/shutdown of node .. was not expected" while the node shut down cleanly

Reply via email to