Hi!

For current SLES15 SP3 I noticed an oddity when the node running the DC is 
going to be fenced:
It seems that another node is performing recovery operations while the old DC 
is not confirmed to be fenced.

Like this (116 is the DC):
Mar 01 01:33:53 h18 corosync[6754]:   [TOTEM ] A new membership 
(172.20.16.18:45612) was formed. Members left: 116

Mar 01 01:33:53 h18 corosync[6754]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: Our peer on the DC (h16) 
is dead
Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: State transition 
S_NOT_DC -> S_ELECTION

Mar 01 01:33:53 h18 dlm_controld[8544]: 394518 fence request 116 pid 16307 
nodedown time 1646094833 fence_all dlm_stonith
Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: State transition 
S_ELECTION -> S_INTEGRATION
Mar 01 01:33:53 h18 dlm_stonith[16307]: stonith_api_time: Found 1 entries for 
116/(null): 0 in progress, 0 completed
Mar 01 01:33:53 h18 pacemaker-fenced[6973]:  notice: Client 
stonith-api.16307.4961743f wants to fence (reboot) '116' with device '(any)'
Mar 01 01:33:53 h18 pacemaker-fenced[6973]:  notice: Requesting peer fencing 
(reboot) targeting h16

Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]:  warning: Cluster node h16 will 
be fenced: peer is no longer part of the cluster
Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]:  warning: Node h16 is unclean

(so far, so good)
Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]:  warning: Scheduling Node h16 
for STONITH
Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]:  notice:  * Fence (reboot) h16 
'peer is no longer part of the cluster'

Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: Initiating monitor 
operation prm_stonith_sbd_monitor_600000 locally on h18
Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: Requesting local 
execution of monitor operation for prm_stonith_sbd on h18
Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: Initiating stop 
operation prm_cron_snap_v17_stop_0 on h19
(isn't h18 playing DC already while h16 isn't fenced yet?)

Mar 01 01:35:23 h18 pacemaker-controld[6980]:  error: Node h18 did not send 
monitor result (via controller) within 90000ms (action timeout plus 
cluster-delay)
Mar 01 01:35:23 h18 pacemaker-controld[6980]:  error: [Action   26]: In-flight 
resource op prm_stonith_sbd_monitor_600000 on h18 (priority: 9900, waiting: 
(null))
Mar 01 01:35:23 h18 pacemaker-controld[6980]:  notice: Transition 0 aborted: 
Action lost
Mar 01 01:35:23 h18 pacemaker-controld[6980]:  warning: rsc_op 26: 
prm_stonith_sbd_monitor_600000 on h18 timed out
(whatever that means)

(now the fencing confirmation follows)
Mar 01 01:35:55 h18 pacemaker-fenced[6973]:  notice: Operation 'reboot' [16309] 
(call 2 from stonith-api.16307) for host 'h16' with device 'prm_stonith_sbd' 
returned: 0 (OK)
Mar 01 01:35:55 h18 pacemaker-fenced[6973]:  notice: Operation 'reboot' 
targeting h16 on h18 for [email protected]: OK
Mar 01 01:35:55 h18 stonith-api[16307]: stonith_api_kick: Node 116/(null) 
kicked: reboot
Mar 01 01:35:55 h18 pacemaker-fenced[6973]:  notice: Operation 'reboot' 
targeting h16 on rksaph18 for [email protected] (merged): OK
Mar 01 01:35:55 h18 pacemaker-controld[6980]:  notice: Peer h16 was terminated 
(reboot) by h18 on behalf of stonith-api.16307: OK
Mar 01 01:35:55 h18 pacemaker-controld[6980]:  notice: Stonith operation 
2/1:0:0:a434124e-3e35-410d-8e17-ef9ae4e4e6eb: OK (0)
Mar 01 01:35:55 h18 pacemaker-controld[6980]:  notice: Peer h16 was terminated 
(reboot) by h18 on behalf of pacemaker-controld.6980: OK

(actual recovery happens)
Mar 01 01:35:55 h18 kernel: ocfs2: Begin replay journal (node 116, slot 0) on 
device (9,10)

Mar 01 01:35:55 h18 kernel: md: md10: resync done.

(more actions follow)
Mar 01 01:35:56 h18 pacemaker-schedulerd[6978]:  notice: Calculated transition 
1, saving inputs in /var/lib/pacemaker/pengine/pe-input-87.bz2

(actions completed)
Mar 01 01:37:18 h18 pacemaker-controld[6980]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE

(pacemaker-2.0.5+20201202.ba59be712-150300.4.16.1.x86_64)

Did I misunderstand something, or does it look like a bug?

Regards,
Ulrich


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to