On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote: > Hi, > > Just found an issue with node is silently unfenced. > > That is quite large setup (2 cluster nodes and 8 remote ones) with > a plenty of slowly starting resources (lustre filesystem). > > Fencing was initiated due to resource stop failure. > lustre often starts very slowly due to internal recovery, and some such > resources were starting in that transition where another resource failed to > stop. > And, as transition did not finish in time specified by the > "failure-timeout" (set to 9 min), and was not aborted, that stop failure was > successfully cleaned. > There were transition aborts due to attribute changes, after that stop > failure happened, but fencing > was not initiated for some reason.
Unfortunately, that makes sense with the current code. Failure timeout changes the node attribute, which aborts the transition, which causes a recalculation based on the new state, and the fencing is no longer needed. I'll make a note to investigate a fix, but feel free to file a bug report at bugs.clusterlabs.org for tracking purposes. > Node where stop failed was a DC. > pacemaker is 1.1.14-5a6cdd1 (from fedora, built on EL7) > > Here is log excerpt illustrating the above: > Apr 19 14:57:56 mds1 pengine[3452]: notice: Move mdt0-es03a-vg > (Started mds1 -> mds0) > Apr 19 14:58:06 mds1 pengine[3452]: notice: Move mdt0-es03a-vg > (Started mds1 -> mds0) > Apr 19 14:58:10 mds1 crmd[3453]: notice: Initiating action 81: monitor > mdt0-es03a-vg_monitor_0 on mds0 > Apr 19 14:58:11 mds1 crmd[3453]: notice: Initiating action 2993: stop > mdt0-es03a-vg_stop_0 on mds1 (local) > Apr 19 14:58:11 mds1 LVM(mdt0-es03a-vg)[6228]: INFO: Deactivating volume > group vg_mdt0_es03a > Apr 19 14:58:12 mds1 LVM(mdt0-es03a-vg)[6541]: ERROR: Logical volume > vg_mdt0_es03a/mdt0 contains a filesystem in use. Can't deactivate volume > group "vg_mdt0_es03a" with 1 open logical volume(s) > [...] > Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9939]: ERROR: LVM: vg_mdt0_es03a did > not stop correctly > Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9943]: WARNING: vg_mdt0_es03a still > Active > Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9947]: INFO: Retry deactivating > volume group vg_mdt0_es03a > Apr 19 14:58:31 mds1 lrmd[3450]: notice: mdt0-es03a-vg_stop_0:5865:stderr [ > ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly ] > [...] > Apr 19 14:58:31 mds1 lrmd[3450]: notice: mdt0-es03a-vg_stop_0:5865:stderr [ > ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly ] > Apr 19 14:58:31 mds1 crmd[3453]: notice: Operation mdt0-es03a-vg_stop_0: > unknown error (node=mds1, call=324, rc=1, cib-update=1695, confirmed=true) > Apr 19 14:58:31 mds1 crmd[3453]: notice: mds1-mdt0-es03a-vg_stop_0:324 [ > ocf-exit-reason:LVM: vg_mdt0_es03a did not stop > correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop > correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop > correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop > correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop > correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop > correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop > correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctl > Apr 19 14:58:31 mds1 crmd[3453]: warning: Action 2993 (mdt0-es03a-vg_stop_0) > on mds1 failed (target: 0 vs. rc: 1): Error > Apr 19 14:58:31 mds1 crmd[3453]: warning: Action 2993 (mdt0-es03a-vg_stop_0) > on mds1 failed (target: 0 vs. rc: 1): Error > Apr 19 15:02:03 mds1 pengine[3452]: warning: Processing failed op stop for > mdt0-es03a-vg on mds1: unknown error (1) > Apr 19 15:02:03 mds1 pengine[3452]: warning: Processing failed op stop for > mdt0-es03a-vg on mds1: unknown error (1) > Apr 19 15:02:03 mds1 pengine[3452]: warning: Node mds1 will be fenced > because of resource failure(s) > Apr 19 15:02:03 mds1 pengine[3452]: warning: Forcing mdt0-es03a-vg away from > mds1 after 1000000 failures (max=1000000) > Apr 19 15:02:03 mds1 pengine[3452]: warning: Scheduling Node mds1 for STONITH > Apr 19 15:02:03 mds1 pengine[3452]: notice: Stop of failed resource > mdt0-es03a-vg is implicit after mds1 is fenced > Apr 19 15:02:03 mds1 pengine[3452]: notice: Recover mdt0-es03a-vg > (Started mds1 -> mds0) > [... many of these ] > Apr 19 15:07:22 mds1 pengine[3452]: warning: Processing failed op stop for > mdt0-es03a-vg on mds1: unknown error (1) > Apr 19 15:07:22 mds1 pengine[3452]: warning: Processing failed op stop for > mdt0-es03a-vg on mds1: unknown error (1) > Apr 19 15:07:22 mds1 pengine[3452]: warning: Node mds1 will be fenced > because of resource failure(s) > Apr 19 15:07:22 mds1 pengine[3452]: warning: Forcing mdt0-es03a-vg away from > mds1 after 1000000 failures (max=1000000) > Apr 19 15:07:23 mds1 pengine[3452]: warning: Scheduling Node mds1 for STONITH > Apr 19 15:07:23 mds1 pengine[3452]: notice: Stop of failed resource > mdt0-es03a-vg is implicit after mds1 is fenced > Apr 19 15:07:23 mds1 pengine[3452]: notice: Recover mdt0-es03a-vg > (Started mds1 -> mds0) > Apr 19 15:07:24 mds1 pengine[3452]: warning: Processing failed op stop for > mdt0-es03a-vg on mds1: unknown error (1) > Apr 19 15:07:24 mds1 pengine[3452]: warning: Processing failed op stop for > mdt0-es03a-vg on mds1: unknown error (1) > Apr 19 15:07:24 mds1 pengine[3452]: warning: Node mds1 will be fenced > because of resource failure(s) > Apr 19 15:07:24 mds1 pengine[3452]: warning: Forcing mdt0-es03a-vg away from > mds1 after 1000000 failures (max=1000000) > Apr 19 15:07:24 mds1 pengine[3452]: warning: Scheduling Node mds1 for STONITH > Apr 19 15:07:24 mds1 pengine[3452]: notice: Stop of failed resource > mdt0-es03a-vg is implicit after mds1 is fenced > Apr 19 15:07:24 mds1 pengine[3452]: notice: Recover mdt0-es03a-vg > (Started mds1 -> mds0) > Apr 19 15:07:32 mds1 pengine[3452]: notice: Clearing expired failcount for > mdt0-es03a-vg on mds1 > Apr 19 15:07:32 mds1 pengine[3452]: notice: Clearing expired failcount for > mdt0-es03a-vg on mds1 > Apr 19 15:07:32 mds1 pengine[3452]: notice: Ignoring expired calculated > failure mdt0-es03a-vg_stop_0 (rc=1, > magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1 > Apr 19 15:07:32 mds1 pengine[3452]: notice: Clearing expired failcount for > mdt0-es03a-vg on mds1 > Apr 19 15:07:32 mds1 pengine[3452]: notice: Ignoring expired calculated > failure mdt0-es03a-vg_stop_0 (rc=1, > magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1 > Apr 19 15:07:33 mds1 crmd[3453]: notice: Initiating action 2016: monitor > mdt0-es03a-vg_monitor_60000 on mds1 (local) > Apr 19 15:07:33 mds1 crmd[3453]: notice: Transition aborted by deletion of > nvpair[@id='status-2-fail-count-mdt0-es03a-vg']: Transient attribute change > (cib=0.228.2601, source=abort_unless_down:343, > path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']/nvpair[@id='status-2-fail-count-mdt0-es03a-vg'], > 0) > Apr 19 15:10:09 mds1 pengine[3452]: notice: Ignoring expired calculated > failure mdt0-es03a-vg_stop_0 (rc=1, > magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1 > Apr 19 15:12:40 mds1 pengine[3452]: notice: Ignoring expired calculated > failure mdt0-es03a-vg_stop_0 (rc=1, > magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1 > > Best, > Vladislav _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
