thanks Ken,
> -----Original Message----- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Thursday, June 1, 2017 12:04 AM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] clearing failed actions > > On 05/31/2017 12:17 PM, Ken Gaillot wrote: > > On 05/30/2017 02:50 PM, Attila Megyeri wrote: > >> Hi Ken, > >> > >> > >>> -----Original Message----- > >>> From: Ken Gaillot [mailto:kgail...@redhat.com] > >>> Sent: Tuesday, May 30, 2017 4:32 PM > >>> To: users@clusterlabs.org > >>> Subject: Re: [ClusterLabs] clearing failed actions > >>> > >>> On 05/30/2017 09:13 AM, Attila Megyeri wrote: > >>>> Hi, > >>>> > >>>> > >>>> > >>>> Shouldn't the > >>>> > >>>> > >>>> > >>>> cluster-recheck-interval="2m" > >>>> > >>>> > >>>> > >>>> property instruct pacemaker to recheck the cluster every 2 minutes > and > >>>> clean the failcounts? > >>> > >>> It instructs pacemaker to recalculate whether any actions need to be > >>> taken (including expiring any failcounts appropriately). > >>> > >>>> At the primitive level I also have a > >>>> > >>>> > >>>> > >>>> migration-threshold="30" failure-timeout="2m" > >>>> > >>>> > >>>> > >>>> but whenever I have a failure, it remains there forever. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> What could be causing this? > >>>> > >>>> > >>>> > >>>> thanks, > >>>> > >>>> Attila > >>> Is it a single old failure, or a recurring failure? The failure timeout > >>> works in a somewhat nonintuitive way. Old failures are not individually > >>> expired. Instead, all failures of a resource are simultaneously cleared > >>> if all of them are older than the failure-timeout. So if something keeps > >>> failing repeatedly (more frequently than the failure-timeout), none of > >>> the failures will be cleared. > >>> > >>> If it's not a repeating failure, something odd is going on. > >> > >> It is not a repeating failure. Let's say that a resource fails for whatever > action, It will remain in the failed actions (crm_mon -Af) until I issue a > "crm > resource cleanup <resource name>". Even after days or weeks, even though > I see in the logs that cluster is rechecked every 120 seconds. > >> > >> How could I troubleshoot this issue? > >> > >> thanks! > > > > > > Ah, I see what you're saying. That's expected behavior. > > > > The failure-timeout applies to the failure *count* (which is used for > > checking against migration-threshold), not the failure *history* (which > > is used for the status display). > > > > The idea is to have it no longer affect the cluster behavior, but still > > allow an administrator to know that it happened. That's why a manual > > cleanup is required to clear the history. > > Hmm, I'm wrong there ... failure-timeout does expire the failure history > used for status display. > > It works with the current versions. It's possible 1.1.10 had issues with > that. > Well if nothing helps I will try to upgrade to a more recent version.. > Check the status to see which node is DC, and look at the pacemaker log > there after the failure occurred. There should be a message about the > failcount expiring. You can also look at the live CIB and search for > last_failure to see what is used for the display. [AM] In the pacemaker log I see at every recheck interval the following lines: Jun 01 16:54:08 [8700] ctabsws2 pengine: warning: unpack_rsc_op: Processing failed op start for jboss_admin2 on ctadmin2: unknown error (1) If I check the CIB for the failure I see: <nvpair id="status-168362322-last-failure-jboss_admin2" name="last-failure-jboss_admin2" value="1496326649"/> <lrm_rsc_op id="jboss_admin2_last_failure_0" operation_key="jboss_admin2_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" transition-key="73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" transition-magic="2:1;73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" call-id="114" rc-code="1" op-status="2" interval="0" last-run="1496326469" last-rc-change="1496326469" exec-time="180001" queue-time="0" op-digest="8ec02bcea0bab86f4a7e9e27c23bc88b"/> Really have no clue why this isn't cleared... > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org