On Tue, 2022-03-01 at 10:05 +0100, Ulrich Windl wrote: > Hi! > > For current SLES15 SP3 I noticed an oddity when the node running the > DC is going to be fenced: > It seems that another node is performing recovery operations while > the old DC is not confirmed to be fenced. > > Like this (116 is the DC): > Mar 01 01:33:53 h18 corosync[6754]: [TOTEM ] A new membership > (172.20.16.18:45612) was formed. Members left: 116 > > Mar 01 01:33:53 h18 corosync[6754]: [MAIN ] Completed service > synchronization, ready to provide service. > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Our peer on > the DC (h16) is dead > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State > transition S_NOT_DC -> S_ELECTION
At this point, h16 loses its DC status, so there is no DC > > Mar 01 01:33:53 h18 dlm_controld[8544]: 394518 fence request 116 pid > 16307 nodedown time 1646094833 fence_all dlm_stonith > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State > transition S_ELECTION -> S_INTEGRATION At this point, a new DC election has completed, and h18 is now the DC (as indicated by the scheduler messages later) > Mar 01 01:33:53 h18 dlm_stonith[16307]: stonith_api_time: Found 1 > entries for 116/(null): 0 in progress, 0 completed > Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Client stonith- > api.16307.4961743f wants to fence (reboot) '116' with device '(any)' > Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Requesting peer > fencing (reboot) targeting h16 > > Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Cluster > node h16 will be fenced: peer is no longer part of the cluster > Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Node h16 is > unclean > > (so far, so good) > Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Scheduling > Node h16 for STONITH > Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: notice: * Fence > (reboot) h16 'peer is no longer part of the cluster' > > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating > monitor operation prm_stonith_sbd_monitor_600000 locally on h18 > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Requesting > local execution of monitor operation for prm_stonith_sbd on h18 > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating > stop operation prm_cron_snap_v17_stop_0 on h19 > (isn't h18 playing DC already while h16 isn't fenced yet?) > > Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: Node h18 did > not send monitor result (via controller) within 90000ms (action > timeout plus cluster-delay) > Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: [Action 26]: > In-flight resource op prm_stonith_sbd_monitor_600000 on h18 > (priority: 9900, waiting: (null)) > Mar 01 01:35:23 h18 pacemaker-controld[6980]: notice: Transition 0 > aborted: Action lost > Mar 01 01:35:23 h18 pacemaker-controld[6980]: warning: rsc_op 26: > prm_stonith_sbd_monitor_600000 on h18 timed out > (whatever that means) > > (now the fencing confirmation follows) > Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation > 'reboot' [16309] (call 2 from stonith-api.16307) for host 'h16' with > device 'prm_stonith_sbd' returned: 0 (OK) > Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation > 'reboot' targeting h16 on h18 for [email protected]: OK > Mar 01 01:35:55 h18 stonith-api[16307]: stonith_api_kick: Node > 116/(null) kicked: reboot > Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation > 'reboot' targeting h16 on rksaph18 for > [email protected] (merged): OK > Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was > terminated (reboot) by h18 on behalf of stonith-api.16307: OK > Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Stonith > operation 2/1:0:0:a434124e-3e35-410d-8e17-ef9ae4e4e6eb: OK (0) > Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was > terminated (reboot) by h18 on behalf of pacemaker-controld.6980: OK > > (actual recovery happens) > Mar 01 01:35:55 h18 kernel: ocfs2: Begin replay journal (node 116, > slot 0) on device (9,10) > > Mar 01 01:35:55 h18 kernel: md: md10: resync done. > > (more actions follow) > Mar 01 01:35:56 h18 pacemaker-schedulerd[6978]: notice: Calculated > transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input- > 87.bz2 > > (actions completed) > Mar 01 01:37:18 h18 pacemaker-controld[6980]: notice: State > transition S_TRANSITION_ENGINE -> S_IDLE > > (pacemaker-2.0.5+20201202.ba59be712-150300.4.16.1.x86_64) > > Did I misunderstand something, or does it look like a bug? > > Regards, > Ulrich > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > -- Ken Gaillot <[email protected]> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
