Re: [ClusterLabs] Node is silently unfenced if transition is very long

Ken Gaillot Mon, 02 May 2016 15:16:22 -0700

On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:
> Hi,
> 
> Just found an issue with node is silently unfenced.
> 
> That is quite large setup (2 cluster nodes and 8 remote ones) with
> a plenty of slowly starting resources (lustre filesystem).
> 
> Fencing was initiated due to resource stop failure.
> lustre often starts very slowly due to internal recovery, and some such
> resources were starting in that transition where another resource failed to 
> stop.
> And, as transition did not finish in time specified by the
> "failure-timeout" (set to 9 min), and was not aborted, that stop failure was 
> successfully cleaned.
> There were transition aborts due to attribute changes, after that stop 
> failure happened, but fencing
> was not initiated for some reason.


Unfortunately, that makes sense with the current code. Failure timeout
changes the node attribute, which aborts the transition, which causes a
recalculation based on the new state, and the fencing is no longer
needed. I'll make a note to investigate a fix, but feel free to file a
bug report at bugs.clusterlabs.org for tracking purposes.

> Node where stop failed was a DC.
> pacemaker is 1.1.14-5a6cdd1 (from fedora, built on EL7)
> 
> Here is log excerpt illustrating the above:
> Apr 19 14:57:56 mds1 pengine[3452]:   notice: Move    mdt0-es03a-vg        
> (Started mds1 -> mds0)
> Apr 19 14:58:06 mds1 pengine[3452]:   notice: Move    mdt0-es03a-vg        
> (Started mds1 -> mds0)
> Apr 19 14:58:10 mds1 crmd[3453]:   notice: Initiating action 81: monitor 
> mdt0-es03a-vg_monitor_0 on mds0
> Apr 19 14:58:11 mds1 crmd[3453]:   notice: Initiating action 2993: stop 
> mdt0-es03a-vg_stop_0 on mds1 (local)
> Apr 19 14:58:11 mds1 LVM(mdt0-es03a-vg)[6228]: INFO: Deactivating volume 
> group vg_mdt0_es03a
> Apr 19 14:58:12 mds1 LVM(mdt0-es03a-vg)[6541]: ERROR: Logical volume 
> vg_mdt0_es03a/mdt0 contains a filesystem in use. Can't deactivate volume 
> group "vg_mdt0_es03a" with 1 open logical volume(s)
> [...]
> Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9939]: ERROR: LVM: vg_mdt0_es03a did 
> not stop correctly
> Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9943]: WARNING: vg_mdt0_es03a still 
> Active
> Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9947]: INFO: Retry deactivating 
> volume group vg_mdt0_es03a
> Apr 19 14:58:31 mds1 lrmd[3450]:   notice: mdt0-es03a-vg_stop_0:5865:stderr [ 
> ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly ]
> [...]
> Apr 19 14:58:31 mds1 lrmd[3450]:   notice: mdt0-es03a-vg_stop_0:5865:stderr [ 
> ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly ]
> Apr 19 14:58:31 mds1 crmd[3453]:   notice: Operation mdt0-es03a-vg_stop_0: 
> unknown error (node=mds1, call=324, rc=1, cib-update=1695, confirmed=true)
> Apr 19 14:58:31 mds1 crmd[3453]:   notice: mds1-mdt0-es03a-vg_stop_0:324 [ 
> ocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctl
> Apr 19 14:58:31 mds1 crmd[3453]:  warning: Action 2993 (mdt0-es03a-vg_stop_0) 
> on mds1 failed (target: 0 vs. rc: 1): Error
> Apr 19 14:58:31 mds1 crmd[3453]:  warning: Action 2993 (mdt0-es03a-vg_stop_0) 
> on mds1 failed (target: 0 vs. rc: 1): Error
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Node mds1 will be fenced 
> because of resource failure(s)
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Forcing mdt0-es03a-vg away from 
> mds1 after 1000000 failures (max=1000000)
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Scheduling Node mds1 for STONITH
> Apr 19 15:02:03 mds1 pengine[3452]:   notice: Stop of failed resource 
> mdt0-es03a-vg is implicit after mds1 is fenced
> Apr 19 15:02:03 mds1 pengine[3452]:   notice: Recover mdt0-es03a-vg        
> (Started mds1 -> mds0)
> [... many of these ]
> Apr 19 15:07:22 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:07:22 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:07:22 mds1 pengine[3452]:  warning: Node mds1 will be fenced 
> because of resource failure(s)
> Apr 19 15:07:22 mds1 pengine[3452]:  warning: Forcing mdt0-es03a-vg away from 
> mds1 after 1000000 failures (max=1000000)
> Apr 19 15:07:23 mds1 pengine[3452]:  warning: Scheduling Node mds1 for STONITH
> Apr 19 15:07:23 mds1 pengine[3452]:   notice: Stop of failed resource 
> mdt0-es03a-vg is implicit after mds1 is fenced
> Apr 19 15:07:23 mds1 pengine[3452]:   notice: Recover mdt0-es03a-vg        
> (Started mds1 -> mds0)
> Apr 19 15:07:24 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:07:24 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:07:24 mds1 pengine[3452]:  warning: Node mds1 will be fenced 
> because of resource failure(s)
> Apr 19 15:07:24 mds1 pengine[3452]:  warning: Forcing mdt0-es03a-vg away from 
> mds1 after 1000000 failures (max=1000000)
> Apr 19 15:07:24 mds1 pengine[3452]:  warning: Scheduling Node mds1 for STONITH
> Apr 19 15:07:24 mds1 pengine[3452]:   notice: Stop of failed resource 
> mdt0-es03a-vg is implicit after mds1 is fenced
> Apr 19 15:07:24 mds1 pengine[3452]:   notice: Recover mdt0-es03a-vg        
> (Started mds1 -> mds0)
> Apr 19 15:07:32 mds1 pengine[3452]:   notice: Clearing expired failcount for 
> mdt0-es03a-vg on mds1
> Apr 19 15:07:32 mds1 pengine[3452]:   notice: Clearing expired failcount for 
> mdt0-es03a-vg on mds1
> Apr 19 15:07:32 mds1 pengine[3452]:   notice: Ignoring expired calculated 
> failure mdt0-es03a-vg_stop_0 (rc=1, 
> magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1
> Apr 19 15:07:32 mds1 pengine[3452]:   notice: Clearing expired failcount for 
> mdt0-es03a-vg on mds1
> Apr 19 15:07:32 mds1 pengine[3452]:   notice: Ignoring expired calculated 
> failure mdt0-es03a-vg_stop_0 (rc=1, 
> magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1
> Apr 19 15:07:33 mds1 crmd[3453]:   notice: Initiating action 2016: monitor 
> mdt0-es03a-vg_monitor_60000 on mds1 (local)
> Apr 19 15:07:33 mds1 crmd[3453]:   notice: Transition aborted by deletion of 
> nvpair[@id='status-2-fail-count-mdt0-es03a-vg']: Transient attribute change 
> (cib=0.228.2601, source=abort_unless_down:343, 
> path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']/nvpair[@id='status-2-fail-count-mdt0-es03a-vg'],
>  0)
> Apr 19 15:10:09 mds1 pengine[3452]:   notice: Ignoring expired calculated 
> failure mdt0-es03a-vg_stop_0 (rc=1, 
> magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1
> Apr 19 15:12:40 mds1 pengine[3452]:   notice: Ignoring expired calculated 
> failure mdt0-es03a-vg_stop_0 (rc=1, 
> magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1
> 
> Best,
> Vladislav

_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Node is silently unfenced if transition is very long

Reply via email to