[ClusterLabs] Antw: [EXT] Re: Noticed oddity when DC is going to be fenced
>>> Ken Gaillot schrieb am 01.03.2022 um 16:04 in Nachricht <463458e414f7c411eb1107335be6ee9a6e2d13ee.ca...@redhat.com>: > On Tue, 2022‑03‑01 at 10:05 +0100, Ulrich Windl wrote: >> Hi! >> >> For current SLES15 SP3 I noticed an oddity when the node running the >> DC is going to be fenced: >> It seems that another node is performing recovery operations while >> the old DC is not confirmed to be fenced. >> >> Like this (116 is the DC): >> Mar 01 01:33:53 h18 corosync[6754]: [TOTEM ] A new membership >> (172.20.16.18:45612) was formed. Members left: 116 >> >> Mar 01 01:33:53 h18 corosync[6754]: [MAIN ] Completed service >> synchronization, ready to provide service. >> Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: Our peer on >> the DC (h16) is dead >> Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: State >> transition S_NOT_DC ‑> S_ELECTION > > At this point, h16 loses its DC status, so there is no DC I agree from the perspective of h18, but does the DC (h16) know, too? > >> >> Mar 01 01:33:53 h18 dlm_controld[8544]: 394518 fence request 116 pid >> 16307 nodedown time 1646094833 fence_all dlm_stonith >> Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: State >> transition S_ELECTION ‑> S_INTEGRATION > > At this point, a new DC election has completed, and h18 is now the DC > (as indicated by the scheduler messages later) As above: Does h16 know at this point? > >> Mar 01 01:33:53 h18 dlm_stonith[16307]: stonith_api_time: Found 1 >> entries for 116/(null): 0 in progress, 0 completed >> Mar 01 01:33:53 h18 pacemaker‑fenced[6973]: notice: Client stonith‑ >> api.16307.4961743f wants to fence (reboot) '116' with device '(any)' >> Mar 01 01:33:53 h18 pacemaker‑fenced[6973]: notice: Requesting peer >> fencing (reboot) targeting h16 >> >> Mar 01 01:33:53 h18 pacemaker‑schedulerd[6978]: warning: Cluster >> node h16 will be fenced: peer is no longer part of the cluster >> Mar 01 01:33:53 h18 pacemaker‑schedulerd[6978]: warning: Node h16 is >> unclean >> >> (so far, so good) >> Mar 01 01:33:53 h18 pacemaker‑schedulerd[6978]: warning: Scheduling >> Node h16 for STONITH >> Mar 01 01:33:53 h18 pacemaker‑schedulerd[6978]: notice: * Fence >> (reboot) h16 'peer is no longer part of the cluster' >> >> Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: Initiating >> monitor operation prm_stonith_sbd_monitor_60 locally on h18 >> Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: Requesting >> local execution of monitor operation for prm_stonith_sbd on h18 >> Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: Initiating >> stop operation prm_cron_snap_v17_stop_0 on h19 >> (isn't h18 playing DC already while h16 isn't fenced yet?) >> >> Mar 01 01:35:23 h18 pacemaker‑controld[6980]: error: Node h18 did >> not send monitor result (via controller) within 9ms (action >> timeout plus cluster‑delay) >> Mar 01 01:35:23 h18 pacemaker‑controld[6980]: error: [Action 26]: >> In‑flight resource op prm_stonith_sbd_monitor_60 on h18 >> (priority: 9900, waiting: (null)) >> Mar 01 01:35:23 h18 pacemaker‑controld[6980]: notice: Transition 0 >> aborted: Action lost >> Mar 01 01:35:23 h18 pacemaker‑controld[6980]: warning: rsc_op 26: >> prm_stonith_sbd_monitor_60 on h18 timed out >> (whatever that means) >> >> (now the fencing confirmation follows) >> Mar 01 01:35:55 h18 pacemaker‑fenced[6973]: notice: Operation >> 'reboot' [16309] (call 2 from stonith‑api.16307) for host 'h16' with >> device 'prm_stonith_sbd' returned: 0 (OK) >> Mar 01 01:35:55 h18 pacemaker‑fenced[6973]: notice: Operation >> 'reboot' targeting h16 on h18 for stonith‑api.16307@h18.36b9a9bb: OK >> Mar 01 01:35:55 h18 stonith‑api[16307]: stonith_api_kick: Node >> 116/(null) kicked: reboot >> Mar 01 01:35:55 h18 pacemaker‑fenced[6973]: notice: Operation >> 'reboot' targeting h16 on rksaph18 for >> pacemaker‑controld.6980@h18.8ce2f33f (merged): OK >> Mar 01 01:35:55 h18 pacemaker‑controld[6980]: notice: Peer h16 was >> terminated (reboot) by h18 on behalf of stonith‑api.16307: OK >> Mar 01 01:35:55 h18 pacemaker‑controld[6980]: notice: Stonith >> operation 2/1:0:0:a434124e‑3e35‑410d‑8e17‑ef9ae4e4e6eb: OK (0) >> Mar 01 01:35:55 h18 pacemaker‑controld[6980]: notice: Peer h16 was >> terminated (reboot) by h18 on behalf of pacemaker‑controld.6980: OK >> >> (actual recovery happens) >> Mar 01 01:35:55 h18 kernel: ocfs2: Begin replay journal (node 116, >> slot 0) on device (9,10) >> >> Mar 01 01:35:55 h18 kernel: md: md10: resync done. >> >> (more actions follow) >> Mar 01 01:35:56 h18 pacemaker‑schedulerd[6978]: notice: Calculated >> transition 1, saving inputs in /var/lib/pacemaker/pengine/pe‑input‑ >> 87.bz2 >> >> (actions completed) >> Mar 01 01:37:18 h18 pacemaker‑controld[6980]: notice: State >> transition S_TRANSITION_ENGINE ‑> S_IDLE >> >> (pacemaker‑2.0.5+20201202.ba59be712‑150300.4.16.1.x86_64) >> >> Did I misunderstand something, or does it look like a bug? >> >> Regards, >> Ulrich >
Re: [ClusterLabs] Noticed oddity when DC is going to be fenced
On Tue, 2022-03-01 at 10:05 +0100, Ulrich Windl wrote: > Hi! > > For current SLES15 SP3 I noticed an oddity when the node running the > DC is going to be fenced: > It seems that another node is performing recovery operations while > the old DC is not confirmed to be fenced. > > Like this (116 is the DC): > Mar 01 01:33:53 h18 corosync[6754]: [TOTEM ] A new membership > (172.20.16.18:45612) was formed. Members left: 116 > > Mar 01 01:33:53 h18 corosync[6754]: [MAIN ] Completed service > synchronization, ready to provide service. > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Our peer on > the DC (h16) is dead > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State > transition S_NOT_DC -> S_ELECTION At this point, h16 loses its DC status, so there is no DC > > Mar 01 01:33:53 h18 dlm_controld[8544]: 394518 fence request 116 pid > 16307 nodedown time 1646094833 fence_all dlm_stonith > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State > transition S_ELECTION -> S_INTEGRATION At this point, a new DC election has completed, and h18 is now the DC (as indicated by the scheduler messages later) > Mar 01 01:33:53 h18 dlm_stonith[16307]: stonith_api_time: Found 1 > entries for 116/(null): 0 in progress, 0 completed > Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Client stonith- > api.16307.4961743f wants to fence (reboot) '116' with device '(any)' > Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Requesting peer > fencing (reboot) targeting h16 > > Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Cluster > node h16 will be fenced: peer is no longer part of the cluster > Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Node h16 is > unclean > > (so far, so good) > Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Scheduling > Node h16 for STONITH > Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: notice: * Fence > (reboot) h16 'peer is no longer part of the cluster' > > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating > monitor operation prm_stonith_sbd_monitor_60 locally on h18 > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Requesting > local execution of monitor operation for prm_stonith_sbd on h18 > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating > stop operation prm_cron_snap_v17_stop_0 on h19 > (isn't h18 playing DC already while h16 isn't fenced yet?) > > Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: Node h18 did > not send monitor result (via controller) within 9ms (action > timeout plus cluster-delay) > Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: [Action 26]: > In-flight resource op prm_stonith_sbd_monitor_60 on h18 > (priority: 9900, waiting: (null)) > Mar 01 01:35:23 h18 pacemaker-controld[6980]: notice: Transition 0 > aborted: Action lost > Mar 01 01:35:23 h18 pacemaker-controld[6980]: warning: rsc_op 26: > prm_stonith_sbd_monitor_60 on h18 timed out > (whatever that means) > > (now the fencing confirmation follows) > Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation > 'reboot' [16309] (call 2 from stonith-api.16307) for host 'h16' with > device 'prm_stonith_sbd' returned: 0 (OK) > Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation > 'reboot' targeting h16 on h18 for stonith-api.16307@h18.36b9a9bb: OK > Mar 01 01:35:55 h18 stonith-api[16307]: stonith_api_kick: Node > 116/(null) kicked: reboot > Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation > 'reboot' targeting h16 on rksaph18 for > pacemaker-controld.6980@h18.8ce2f33f (merged): OK > Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was > terminated (reboot) by h18 on behalf of stonith-api.16307: OK > Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Stonith > operation 2/1:0:0:a434124e-3e35-410d-8e17-ef9ae4e4e6eb: OK (0) > Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was > terminated (reboot) by h18 on behalf of pacemaker-controld.6980: OK > > (actual recovery happens) > Mar 01 01:35:55 h18 kernel: ocfs2: Begin replay journal (node 116, > slot 0) on device (9,10) > > Mar 01 01:35:55 h18 kernel: md: md10: resync done. > > (more actions follow) > Mar 01 01:35:56 h18 pacemaker-schedulerd[6978]: notice: Calculated > transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input- > 87.bz2 > > (actions completed) > Mar 01 01:37:18 h18 pacemaker-controld[6980]: notice: State > transition S_TRANSITION_ENGINE -> S_IDLE > > (pacemaker-2.0.5+20201202.ba59be712-150300.4.16.1.x86_64) > > Did I misunderstand something, or does it look like a bug? > > Regards, > Ulrich > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs
[ClusterLabs] Antw: [EXT] Re: Noticed oddity when DC is going to be fenced
>>> Klaus Wenninger schrieb am 01.03.2022 um 10:22 in Nachricht : > On Tue, Mar 1, 2022 at 10:05 AM Ulrich Windl < > ulrich.wi...@rz.uni-regensburg.de> wrote: > >> Hi! >> >> For current SLES15 SP3 I noticed an oddity when the node running the DC is >> going to be fenced: >> It seems that another node is performing recovery operations while the old >> DC is not confirmed to be fenced. >> >> Like this (116 is the DC): >> Mar 01 01:33:53 h18 corosync[6754]: [TOTEM ] A new membership ( >> 172.20.16.18:45612) was formed. Members left: 116 >> >> Mar 01 01:33:53 h18 corosync[6754]: [MAIN ] Completed service >> synchronization, ready to provide service. >> Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Our peer on the DC >> (h16) is dead >> Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State transition >> S_NOT_DC -> S_ELECTION >> >> Mar 01 01:33:53 h18 dlm_controld[8544]: 394518 fence request 116 pid 16307 >> nodedown time 1646094833 fence_all dlm_stonith >> Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State transition >> S_ELECTION -> S_INTEGRATION >> Mar 01 01:33:53 h18 dlm_stonith[16307]: stonith_api_time: Found 1 entries >> for 116/(null): 0 in progress, 0 completed >> Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Client >> stonith-api.16307.4961743f wants to fence (reboot) '116' with device '(any)' >> Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Requesting peer >> fencing (reboot) targeting h16 >> >> Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Cluster node h16 >> will be fenced: peer is no longer part of the cluster >> Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Node h16 is >> unclean >> >> (so far, so good) >> Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Scheduling Node >> h16 for STONITH >> Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: notice: * Fence (reboot) >> h16 'peer is no longer part of the cluster' >> >> Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating monitor >> operation prm_stonith_sbd_monitor_60 locally on h18 >> Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Requesting local >> execution of monitor operation for prm_stonith_sbd on h18 >> Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating stop >> operation prm_cron_snap_v17_stop_0 on h19 >> (isn't h18 playing DC already while h16 isn't fenced yet?) >> > > periodic monitors should happen autonomously. > as long as you don't see pacemaker-schedulerd on h18 calculate a new > transition recovering the resources > everything should be fine. So what about "prm_cron_snap_v17_stop_0 on h19"? That's not a monitor. > and yes to a certain extent h18 is playing DC (it is elected to be new DC) > - somebody has to schedule fencing. I see the exception for fencing. Is it the quorum that allows h18 to "play DC" while the old DC could be still active on shared storage? Regards, Ulrich > > Klaus > >> >> Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: Node h18 did not >> send monitor result (via controller) within 9ms (action timeout plus >> cluster-delay) >> Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: [Action 26]: >> In-flight resource op prm_stonith_sbd_monitor_60 on h18 (priority: >> 9900, waiting: (null)) >> Mar 01 01:35:23 h18 pacemaker-controld[6980]: notice: Transition 0 >> aborted: Action lost >> Mar 01 01:35:23 h18 pacemaker-controld[6980]: warning: rsc_op 26: >> prm_stonith_sbd_monitor_60 on h18 timed out >> (whatever that means) >> >> (now the fencing confirmation follows) >> Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' >> [16309] (call 2 from stonith-api.16307) for host 'h16' with device >> 'prm_stonith_sbd' returned: 0 (OK) >> Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' >> targeting h16 on h18 for stonith-api.16307@h18.36b9a9bb: OK >> Mar 01 01:35:55 h18 stonith-api[16307]: stonith_api_kick: Node 116/(null) >> kicked: reboot >> Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' >> targeting h16 on rksaph18 for pacemaker-controld.6980@h18.8ce2f33f >> (merged): OK >> Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was >> terminated (reboot) by h18 on behalf of stonith-api.16307: OK >> Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Stonith operation >> 2/1:0:0:a434124e-3e35-410d-8e17-ef9ae4e4e6eb: OK (0) >> Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was >> terminated (reboot) by h18 on behalf of pacemaker-controld.6980: OK >> >> (actual recovery happens) >> Mar 01 01:35:55 h18 kernel: ocfs2: Begin replay journal (node 116, slot 0) >> on device (9,10) >> >> Mar 01 01:35:55 h18 kernel: md: md10: resync done. >> >> (more actions follow) >> Mar 01 01:35:56 h18 pacemaker-schedulerd[6978]: notice: Calculated >> transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-87.bz2 >> >> (actions completed) >> Mar 01 01:37:18 h18 pacemaker-controld[6980]: noti
Re: [ClusterLabs] Noticed oddity when DC is going to be fenced
On Tue, Mar 1, 2022 at 10:05 AM Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote: > Hi! > > For current SLES15 SP3 I noticed an oddity when the node running the DC is > going to be fenced: > It seems that another node is performing recovery operations while the old > DC is not confirmed to be fenced. > > Like this (116 is the DC): > Mar 01 01:33:53 h18 corosync[6754]: [TOTEM ] A new membership ( > 172.20.16.18:45612) was formed. Members left: 116 > > Mar 01 01:33:53 h18 corosync[6754]: [MAIN ] Completed service > synchronization, ready to provide service. > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Our peer on the DC > (h16) is dead > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State transition > S_NOT_DC -> S_ELECTION > > Mar 01 01:33:53 h18 dlm_controld[8544]: 394518 fence request 116 pid 16307 > nodedown time 1646094833 fence_all dlm_stonith > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State transition > S_ELECTION -> S_INTEGRATION > Mar 01 01:33:53 h18 dlm_stonith[16307]: stonith_api_time: Found 1 entries > for 116/(null): 0 in progress, 0 completed > Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Client > stonith-api.16307.4961743f wants to fence (reboot) '116' with device '(any)' > Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Requesting peer > fencing (reboot) targeting h16 > > Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Cluster node h16 > will be fenced: peer is no longer part of the cluster > Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Node h16 is > unclean > > (so far, so good) > Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Scheduling Node > h16 for STONITH > Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: notice: * Fence (reboot) > h16 'peer is no longer part of the cluster' > > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating monitor > operation prm_stonith_sbd_monitor_60 locally on h18 > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Requesting local > execution of monitor operation for prm_stonith_sbd on h18 > Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating stop > operation prm_cron_snap_v17_stop_0 on h19 > (isn't h18 playing DC already while h16 isn't fenced yet?) > periodic monitors should happen autonomously. as long as you don't see pacemaker-schedulerd on h18 calculate a new transition recovering the resources everything should be fine. and yes to a certain extent h18 is playing DC (it is elected to be new DC) - somebody has to schedule fencing. Klaus > > Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: Node h18 did not > send monitor result (via controller) within 9ms (action timeout plus > cluster-delay) > Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: [Action 26]: > In-flight resource op prm_stonith_sbd_monitor_60 on h18 (priority: > 9900, waiting: (null)) > Mar 01 01:35:23 h18 pacemaker-controld[6980]: notice: Transition 0 > aborted: Action lost > Mar 01 01:35:23 h18 pacemaker-controld[6980]: warning: rsc_op 26: > prm_stonith_sbd_monitor_60 on h18 timed out > (whatever that means) > > (now the fencing confirmation follows) > Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' > [16309] (call 2 from stonith-api.16307) for host 'h16' with device > 'prm_stonith_sbd' returned: 0 (OK) > Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' > targeting h16 on h18 for stonith-api.16307@h18.36b9a9bb: OK > Mar 01 01:35:55 h18 stonith-api[16307]: stonith_api_kick: Node 116/(null) > kicked: reboot > Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' > targeting h16 on rksaph18 for pacemaker-controld.6980@h18.8ce2f33f > (merged): OK > Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was > terminated (reboot) by h18 on behalf of stonith-api.16307: OK > Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Stonith operation > 2/1:0:0:a434124e-3e35-410d-8e17-ef9ae4e4e6eb: OK (0) > Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was > terminated (reboot) by h18 on behalf of pacemaker-controld.6980: OK > > (actual recovery happens) > Mar 01 01:35:55 h18 kernel: ocfs2: Begin replay journal (node 116, slot 0) > on device (9,10) > > Mar 01 01:35:55 h18 kernel: md: md10: resync done. > > (more actions follow) > Mar 01 01:35:56 h18 pacemaker-schedulerd[6978]: notice: Calculated > transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-87.bz2 > > (actions completed) > Mar 01 01:37:18 h18 pacemaker-controld[6980]: notice: State transition > S_TRANSITION_ENGINE -> S_IDLE > > (pacemaker-2.0.5+20201202.ba59be712-150300.4.16.1.x86_64) > > Did I misunderstand something, or does it look like a bug? > > Regards, > Ulrich > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > ___
[ClusterLabs] Noticed oddity when DC is going to be fenced
Hi! For current SLES15 SP3 I noticed an oddity when the node running the DC is going to be fenced: It seems that another node is performing recovery operations while the old DC is not confirmed to be fenced. Like this (116 is the DC): Mar 01 01:33:53 h18 corosync[6754]: [TOTEM ] A new membership (172.20.16.18:45612) was formed. Members left: 116 Mar 01 01:33:53 h18 corosync[6754]: [MAIN ] Completed service synchronization, ready to provide service. Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Our peer on the DC (h16) is dead Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State transition S_NOT_DC -> S_ELECTION Mar 01 01:33:53 h18 dlm_controld[8544]: 394518 fence request 116 pid 16307 nodedown time 1646094833 fence_all dlm_stonith Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State transition S_ELECTION -> S_INTEGRATION Mar 01 01:33:53 h18 dlm_stonith[16307]: stonith_api_time: Found 1 entries for 116/(null): 0 in progress, 0 completed Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Client stonith-api.16307.4961743f wants to fence (reboot) '116' with device '(any)' Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Requesting peer fencing (reboot) targeting h16 Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Cluster node h16 will be fenced: peer is no longer part of the cluster Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Node h16 is unclean (so far, so good) Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Scheduling Node h16 for STONITH Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: notice: * Fence (reboot) h16 'peer is no longer part of the cluster' Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating monitor operation prm_stonith_sbd_monitor_60 locally on h18 Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Requesting local execution of monitor operation for prm_stonith_sbd on h18 Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating stop operation prm_cron_snap_v17_stop_0 on h19 (isn't h18 playing DC already while h16 isn't fenced yet?) Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: Node h18 did not send monitor result (via controller) within 9ms (action timeout plus cluster-delay) Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: [Action 26]: In-flight resource op prm_stonith_sbd_monitor_60 on h18 (priority: 9900, waiting: (null)) Mar 01 01:35:23 h18 pacemaker-controld[6980]: notice: Transition 0 aborted: Action lost Mar 01 01:35:23 h18 pacemaker-controld[6980]: warning: rsc_op 26: prm_stonith_sbd_monitor_60 on h18 timed out (whatever that means) (now the fencing confirmation follows) Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' [16309] (call 2 from stonith-api.16307) for host 'h16' with device 'prm_stonith_sbd' returned: 0 (OK) Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' targeting h16 on h18 for stonith-api.16307@h18.36b9a9bb: OK Mar 01 01:35:55 h18 stonith-api[16307]: stonith_api_kick: Node 116/(null) kicked: reboot Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' targeting h16 on rksaph18 for pacemaker-controld.6980@h18.8ce2f33f (merged): OK Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was terminated (reboot) by h18 on behalf of stonith-api.16307: OK Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Stonith operation 2/1:0:0:a434124e-3e35-410d-8e17-ef9ae4e4e6eb: OK (0) Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was terminated (reboot) by h18 on behalf of pacemaker-controld.6980: OK (actual recovery happens) Mar 01 01:35:55 h18 kernel: ocfs2: Begin replay journal (node 116, slot 0) on device (9,10) Mar 01 01:35:55 h18 kernel: md: md10: resync done. (more actions follow) Mar 01 01:35:56 h18 pacemaker-schedulerd[6978]: notice: Calculated transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-87.bz2 (actions completed) Mar 01 01:37:18 h18 pacemaker-controld[6980]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE (pacemaker-2.0.5+20201202.ba59be712-150300.4.16.1.x86_64) Did I misunderstand something, or does it look like a bug? Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: heads up: series of core dumps in SLES15 SP3 ("kernel: BUG: Bad rss-counter state mm:00000000d1a9d1f5 idx:1 val:4")
Hi! I want to give an update on this issue (support is working on it): First I recommend everyone using Xen and a Dell PowerEdge R7415 _not_ to upgrade to SLES15 SP3, as we have about one crash per node and week. We had one last night a few minutes after BtrFS balance had finished. Meanwhile we also had a few crash dumps (kdump), and the call stacks are like this: (5.3.18-150300.59.43-default) [1175886.947081] ocfs2: Mounting device (9,10) on (node 116, slot 1) with ordered data mode. [1175905.783132] general protection fault: [#1] SMP NOPTI [1175905.785704] RIP: e030:down_read_trylock+0x18/0x50 [1175905.798982] Call Trace: [1175905.800305] page_lock_anon_vma_read+0x48/0xe0 [1175905.801659] rmap_walk_anon+0x16c/0x250 [1175905.803021] page_referenced+0xd5/0x170 [1175905.804254] ? rmap_walk_anon+0x250/0x250 [1175905.805377] ? page_get_anon_vma+0x80/0x80 [1175905.806593] shrink_active_list+0x2dd/0x490 [1175905.807920] shrink_lruvec+0x4aa/0x6e0 [1175905.809253] ? free_unref_page_list+0x16f/0x180 [1175905.810460] ? free_unref_page_list+0x16f/0x180 [1175905.811024] ? shrink_node+0x143/0x600 [1175905.811593] shrink_node+0x143/0x600 [1175905.812149] balance_pgdat+0x28a/0x630 [1175905.812716] kswapd+0x14b/0x3d0 [1175905.813263] ? wait_woken+0x80/0x80 [1175905.813793] ? balance_pgdat+0x630/0x630 [1175905.814319] kthread+0x10d/0x130 [1175905.814847] ? kthread_park+0xa0/0xa0 [1175905.815386] ret_from_fork+0x22/0x40 (5.3.18-150300.59.49-default) [27926.595977] BUG: kernel NULL pointer dereference, address: 0007 [27926.597124] #PF: supervisor write access in kernel mode [27926.598197] #PF: error_code(0x0002) - not-present page [27926.599265] PGD 0 P4D 0 [27926.600322] Oops: 0002 [#1] SMP NOPTI [27926.603924] RIP: e030:down_read_trylock+0x18/0x50 [27926.618618] Call Trace: [27926.619138] page_lock_anon_vma_read+0x48/0xe0 [27926.619665] rmap_walk_anon+0x16c/0x250 [27926.620182] page_referenced+0xd5/0x170 [27926.620705] ? rmap_walk_anon+0x250/0x250 [27926.621214] ? page_get_anon_vma+0x80/0x80 [27926.621772] shrink_active_list+0x2dd/0x490 [27926.622280] balance_pgdat+0x50b/0x630 [27926.622793] kswapd+0x14b/0x3d0 [27926.623308] ? wait_woken+0x80/0x80 [27926.623822] ? balance_pgdat+0x630/0x630 [27926.624329] kthread+0x10d/0x130 [27926.624843] ? kthread_park+0xa0/0xa0 [27926.625577] ret_from_fork+0x22/0x40 (5.3.18-150300.59.49-default) [566428.257264] BTRFS info (device dm-0): scrub: finished on devid 1 with status: 0 [571252.379396] ping[3707]: segfault at 0 ip sp 7fffbae8dc10 error 14 in bash[55a3a9d26000+f1000] [571252.379410] Code: Bad RIP value. [571252.948920] BUG: Bad rss-counter state mm:4d568db6 idx:1 val:4 [571262.985375] general protection fault: [#1] SMP NOPTI [571262.985410] RIP: e030:down_read_trylock+0x18/0x50 [571262.985470] Call Trace: [571262.985481] page_lock_anon_vma_read+0x48/0xe0 [571262.985487] rmap_walk_anon+0x16c/0x250 [571262.985492] page_referenced+0xd5/0x170 [571262.985496] ? rmap_walk_anon+0x250/0x250 [571262.985500] ? page_get_anon_vma+0x80/0x80 [571262.985506] shrink_active_list+0x2dd/0x490 [571262.985512] shrink_lruvec+0x4aa/0x6e0 [571262.985517] ? free_unref_page_list+0x16f/0x180 [571262.985522] ? free_unref_page_list+0x16f/0x180 [571262.985526] ? shrink_node+0x143/0x600 [571262.985529] shrink_node+0x143/0x600 [571262.985534] balance_pgdat+0x28a/0x630 [571262.985539] kswapd+0x14b/0x3d0 [571262.985544] ? wait_woken+0x80/0x80 [571262.985548] ? balance_pgdat+0x630/0x630 [571262.985553] kthread+0x10d/0x130 [571262.985557] ? kthread_park+0xa0/0xa0 [571262.985563] ret_from_fork+0x22/0x40 (5.3.18-150300.59.49-default) [22707.270890] #PF: supervisor write access in kernel mode [22707.271539] #PF: error_code(0x0002) - not-present page [22707.272159] PGD 0 P4D 0 [22707.272786] Oops: 0002 [#1] SMP NOPTI [22707.274680] RIP: e030:down_read_trylock+0x18/0x50 [22707.282129] Call Trace: [22707.282682] page_lock_anon_vma_read+0x48/0xe0 [22707.283225] rmap_walk_anon+0x16c/0x250 [22707.283772] ? uncharge_batch+0xe3/0x180 [22707.284307] try_to_unmap+0x93/0xf0 [22707.284835] ? page_remove_rmap+0x2c0/0x2c0 [22707.285375] ? page_not_mapped+0x20/0x20 [22707.285906] ? page_get_anon_vma+0x80/0x80 [22707.286428] ? invalid_mkclean_vma+0x20/0x20 [22707.286974] migrate_pages+0x857/0xb50 [22707.287494] ? isolate_freepages_block+0x370/0x370 [22707.288013] ? move_freelist_tail+0xd0/0xd0 [22707.288538] compact_zone+0x775/0xd90 [22707.289059] kcompactd_do_work+0xfe/0x2a0 [22707.289576] ? xen_load_sp0+0x7a/0x160 [22707.290096] ? __set_cpus_allowed_ptr+0xb5/0x1e0 [22707.290623] ? kcompactd_do_work+0x2a0/0x2a0 [22707.291158] ? kcompactd+0x84/0x1e0 [22707.291703] kcompactd+0x84/0x1e0 [22707.292243] ? wait_woken+0x80/0x80 [22707.292788] kthread+0x10d/0x130 [22707.293329] ? kthread_park+0xa0/0xa0 [22707.293868] ret_from_fork+0x22/0x40 All those dumps happened with BIOS 1.17.0; today I realized that ther