Re: [ClusterLabs] Clearing failed actions

2018-07-09 Thread Ken Gaillot
On Mon, 2018-07-09 at 09:11 +0200, Jehan-Guillaume de Rorthais wrote:
> On Fri, 06 Jul 2018 10:15:08 -0600
> Casey Allen Shobe  wrote:
> 
> > Hi,
> > 
> > I found a web page which suggested to clear the Failed Actions, to
> > use
> > `crm_resource -P`.  Although this appears to work, it's not
> > documented on the
> > man page at all.  Is this deprecated and is there a more correct
> > way to be
> > doing this?
> 
> -P means "reprobe", so I guess this is a side effect or a pre-
> requisit, but not
> only to clean failcounts.

In the 1.1 series, -P is a deprecated synonym for --cleanup / -C. The
options clear fail counts and resource operation history (for a
specific resource and/or node if specified with -r and/or -N, otherwise
all).

In the 2.0 series, -P is gone. --refresh / -R now does what cleanup
used to; --cleanup / -C now cleans up only resources that have had
failures. In other words, the old --cleanup and new --refresh clean
resource history, forcing a re-probe, regardless of whether a resource
failed or not, whereas the new --cleanup will skip resources that
didn't have failures. 

> > Also, is there a way to clear one specific item from the list, or
> > is clearing
> > all the only option?
> 
> pcs failcount reset  [node]

With the low level tools, you can use -r / --resource and/or -N / --
node with crm_resource to limit the clean-up.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Clearing failed actions

2018-07-09 Thread Jehan-Guillaume de Rorthais
On Fri, 06 Jul 2018 10:15:08 -0600
Casey Allen Shobe  wrote:

> Hi,
> 
> I found a web page which suggested to clear the Failed Actions, to use
> `crm_resource -P`.  Although this appears to work, it's not documented on the
> man page at all.  Is this deprecated and is there a more correct way to be
> doing this?

-P means "reprobe", so I guess this is a side effect or a pre-requisit, but not
only to clean failcounts.

> Also, is there a way to clear one specific item from the list, or is clearing
> all the only option?

pcs failcount reset  [node]
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] clearing failed actions

2017-06-21 Thread Ken Gaillot
On 06/19/2017 04:54 PM, Attila Megyeri wrote:
> One more thing to add.
> Two almost identical clusters, with the identical asterisk primitive produce 
> a different crm_verify output. on one cluster, it returns no warnings, 
> whereas the other once complains:
> 
> On the problematic one:
> 
> crm_verify --live-check -VV
> warning: get_failcount_full:   Setting asterisk.failure_timeout=120 in 
> asterisk-stop-0 conflicts with on-fail=block: ignoring timeout
> Warnings found during check: config may not be valid
> 
> 
> The relevant primitive is in both clusters:
> 
> primitive asterisk ocf:heartbeat:asterisk \
> op monitor interval="10s" timeout="45s" on-fail="restart" \
> op start interval="0" timeout="60s" on-fail="standby" \
> op stop interval="0" timeout="60s" on-fail="block" \
> meta migration-threshold="3" failure-timeout="2m"
> 
> Why is the same configuration valid in one, but not in the other cluster?
> Shall I simply omit the "op stop" line?
> 
> thanks :)
> Attila

Ah, that could explain it.

If a failure occurs when on-fail=block applies, the resource's failure
timeout is disabled. This is partly because the point of on-fail=block
is to allow the administrator to investigate and manually clear the
error, and partly because blocking means nothing was done to recover the
resource, so the failure likely is still present (clearing it would make
on-fail=block similar to on-fail=ignore).

The failure timeout should be ignored only if there's an actual error to
be handled by on-fail=block, which would mean a stop failure in this
case. That could explain why it's valid in one situation, if there are
no stop failures there.

Stop failures default to block without fencing because fencing is the
only way to recover from a stop failure. Configuring fencing and using
on-fail=fence for stop would avoid the issue.

A future version of pacemaker will allow specifying the failure timeout
separately for different operations, which would allow you to set
failure timeout 0 on stop, and 1m on everything else. But that work
hasn't started yet.

> 
>> -Original Message-
>> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
>> Sent: Monday, June 19, 2017 9:47 PM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> <users@clusterlabs.org>; kgail...@redhat.com
>> Subject: Re: [ClusterLabs] clearing failed actions
>>
>> I did another experiment, even simpler.
>>
>> Created one node, one resource, using pacemaker 1.1.14 on ubuntu.
>>
>> Configured failcount to 1, migration threshold to 2, failure timeout to 1
>> minute.
>>
>> crm_mon:
>>
>> Last updated: Mon Jun 19 19:43:41 2017  Last change: Mon Jun 19
>> 19:37:09 2017 by root via cibadmin on test
>> Stack: corosync
>> Current DC: test (version 1.1.14-70404b0) - partition with quorum
>> 1 node and 1 resource configured
>>
>> Online: [ test ]
>>
>> db-ip-master(ocf::heartbeat:IPaddr2):   Started test
>>
>> Node Attributes:
>> * Node test:
>>
>> Migration Summary:
>> * Node test:
>>db-ip-master: migration-threshold=2 fail-count=1
>>
>> crm verify:
>>
>> crm_verify --live-check -
>> info: validate_with_relaxng:Creating RNG parser context
>> info: determine_online_status:  Node test is online
>> info: get_failcount_full:   db-ip-master has failed 1 times on test
>> info: get_failcount_full:   db-ip-master has failed 1 times on test
>> info: get_failcount_full:   db-ip-master has failed 1 times on test
>> info: get_failcount_full:   db-ip-master has failed 1 times on test
>> info: native_print: db-ip-master(ocf::heartbeat:IPaddr2):   
>> Started test
>> info: get_failcount_full:   db-ip-master has failed 1 times on test
>> info: common_apply_stickiness:  db-ip-master can fail 1 more times on
>> test before being forced off
>> info: LogActions:   Leave   db-ip-master(Started test)
>>
>>
>> crm configure is:
>>
>> node 168362242: test \
>> attributes standby=off
>> primitive db-ip-master IPaddr2 \
>> params lvs_support=true ip=10.9.1.10 cidr_netmask=24
>> broadcast=10.9.1.255 \
>> op start interval=0 timeout=20s on-fail=restart \
>> op monitor interval=20s timeout=20s \
>> op stop interval=0 timeout=20s on-fail=block \
>> meta migration-threshold=2 failure-timeout=1m target-role=Started
>&

Re: [ClusterLabs] clearing failed actions

2017-06-19 Thread Attila Megyeri
One more thing to add.
Two almost identical clusters, with the identical asterisk primitive produce a 
different crm_verify output. on one cluster, it returns no warnings, whereas 
the other once complains:

On the problematic one:

crm_verify --live-check -VV
warning: get_failcount_full:   Setting asterisk.failure_timeout=120 in 
asterisk-stop-0 conflicts with on-fail=block: ignoring timeout
Warnings found during check: config may not be valid


The relevant primitive is in both clusters:

primitive asterisk ocf:heartbeat:asterisk \
op monitor interval="10s" timeout="45s" on-fail="restart" \
op start interval="0" timeout="60s" on-fail="standby" \
op stop interval="0" timeout="60s" on-fail="block" \
meta migration-threshold="3" failure-timeout="2m"

Why is the same configuration valid in one, but not in the other cluster?
Shall I simply omit the "op stop" line?

thanks :)
Attila


> -Original Message-
> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> Sent: Monday, June 19, 2017 9:47 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> <users@clusterlabs.org>; kgail...@redhat.com
> Subject: Re: [ClusterLabs] clearing failed actions
>
> I did another experiment, even simpler.
>
> Created one node, one resource, using pacemaker 1.1.14 on ubuntu.
>
> Configured failcount to 1, migration threshold to 2, failure timeout to 1
> minute.
>
> crm_mon:
>
> Last updated: Mon Jun 19 19:43:41 2017  Last change: Mon Jun 19
> 19:37:09 2017 by root via cibadmin on test
> Stack: corosync
> Current DC: test (version 1.1.14-70404b0) - partition with quorum
> 1 node and 1 resource configured
>
> Online: [ test ]
>
> db-ip-master(ocf::heartbeat:IPaddr2):   Started test
>
> Node Attributes:
> * Node test:
>
> Migration Summary:
> * Node test:
>db-ip-master: migration-threshold=2 fail-count=1
>
> crm verify:
>
> crm_verify --live-check -
> info: validate_with_relaxng:Creating RNG parser context
> info: determine_online_status:  Node test is online
> info: get_failcount_full:   db-ip-master has failed 1 times on test
> info: get_failcount_full:   db-ip-master has failed 1 times on test
> info: get_failcount_full:   db-ip-master has failed 1 times on test
> info: get_failcount_full:   db-ip-master has failed 1 times on test
> info: native_print: db-ip-master(ocf::heartbeat:IPaddr2):   
> Started test
> info: get_failcount_full:   db-ip-master has failed 1 times on test
> info: common_apply_stickiness:  db-ip-master can fail 1 more times on
> test before being forced off
> info: LogActions:   Leave   db-ip-master(Started test)
>
>
> crm configure is:
>
> node 168362242: test \
> attributes standby=off
> primitive db-ip-master IPaddr2 \
> params lvs_support=true ip=10.9.1.10 cidr_netmask=24
> broadcast=10.9.1.255 \
> op start interval=0 timeout=20s on-fail=restart \
> op monitor interval=20s timeout=20s \
> op stop interval=0 timeout=20s on-fail=block \
> meta migration-threshold=2 failure-timeout=1m target-role=Started
> location loc1 db-ip-master 0: test
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.14-70404b0 \
> cluster-infrastructure=corosync \
> stonith-enabled=false \
> cluster-recheck-interval=30s \
> symmetric-cluster=false
>
>
>
>
> Corosync log:
>
>
> Jun 19 19:45:07 [331] test   crmd:   notice: do_state_transition:   State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jun 19 19:45:07 [330] testpengine: info: process_pe_message:Input 
> has
> not changed since last time, not saving to disk
> Jun 19 19:45:07 [330] testpengine: info: determine_online_status:
> Node test is online
> Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
> db-ip-master
> has failed 1 times on test
> Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
> db-ip-master
> has failed 1 times on test
> Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
> db-ip-master
> has failed 1 times on test
> Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
> db-ip-master
> has failed 1 times on test
> Jun 19 19:45:07 [330] testpengine: info: native_print:  db-ip-master
> (ocf::heartbeat:IPaddr2):   Started test
> Jun 19 19:45:07 [330] testpengine: info: ge

Re: [ClusterLabs] clearing failed actions

2017-06-19 Thread Attila Megyeri
I did another experiment, even simpler.

Created one node, one resource, using pacemaker 1.1.14 on ubuntu.

Configured failcount to 1, migration threshold to 2, failure timeout to 1 
minute.

crm_mon:

Last updated: Mon Jun 19 19:43:41 2017  Last change: Mon Jun 19 
19:37:09 2017 by root via cibadmin on test
Stack: corosync
Current DC: test (version 1.1.14-70404b0) - partition with quorum
1 node and 1 resource configured

Online: [ test ]

db-ip-master(ocf::heartbeat:IPaddr2):   Started test

Node Attributes:
* Node test:

Migration Summary:
* Node test:
   db-ip-master: migration-threshold=2 fail-count=1

crm verify:

crm_verify --live-check -
info: validate_with_relaxng:Creating RNG parser context
info: determine_online_status:  Node test is online
info: get_failcount_full:   db-ip-master has failed 1 times on test
info: get_failcount_full:   db-ip-master has failed 1 times on test
info: get_failcount_full:   db-ip-master has failed 1 times on test
info: get_failcount_full:   db-ip-master has failed 1 times on test
info: native_print: db-ip-master(ocf::heartbeat:IPaddr2):   Started 
test
info: get_failcount_full:   db-ip-master has failed 1 times on test
info: common_apply_stickiness:  db-ip-master can fail 1 more times on 
test before being forced off
info: LogActions:   Leave   db-ip-master(Started test)


crm configure is:

node 168362242: test \
attributes standby=off
primitive db-ip-master IPaddr2 \
params lvs_support=true ip=10.9.1.10 cidr_netmask=24 
broadcast=10.9.1.255 \
op start interval=0 timeout=20s on-fail=restart \
op monitor interval=20s timeout=20s \
op stop interval=0 timeout=20s on-fail=block \
meta migration-threshold=2 failure-timeout=1m target-role=Started
location loc1 db-ip-master 0: test
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
stonith-enabled=false \
cluster-recheck-interval=30s \
symmetric-cluster=false




Corosync log:


Jun 19 19:45:07 [331] test   crmd:   notice: do_state_transition:   State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
origin=crm_timer_popped ]
Jun 19 19:45:07 [330] testpengine: info: process_pe_message:Input 
has not changed since last time, not saving to disk
Jun 19 19:45:07 [330] testpengine: info: determine_online_status:   
Node test is online
Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] testpengine: info: native_print:  db-ip-master
(ocf::heartbeat:IPaddr2):   Started test
Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] testpengine: info: common_apply_stickiness:   
db-ip-master can fail 1 more times on test before being forced off
Jun 19 19:45:07 [330] testpengine: info: LogActions:Leave   
db-ip-master(Started test)
Jun 19 19:45:07 [330] testpengine:   notice: process_pe_message:
Calculated Transition 34: /var/lib/pacemaker/pengine/pe-input-6.bz2
Jun 19 19:45:07 [331] test   crmd: info: do_state_transition:   State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Jun 19 19:45:07 [331] test   crmd:   notice: run_graph: Transition 34 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Complete
Jun 19 19:45:07 [331] test   crmd: info: do_log:FSA: Input 
I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
Jun 19 19:45:07 [331] test   crmd:   notice: do_state_transition:   State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]


I hope someone can help me figure this out :)

Thanks!



> -Original Message-
> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> Sent: Monday, June 19, 2017 7:45 PM
> To: kgail...@redhat.com; Cluster Labs - All topics related to open-source
> clustering welcomed <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] clearing failed actions
>
> Hi Ken,
>
> /sorry for the long text/
>
> I have created a relatively simple setup to localize the issue.
> Three nodes, no fencing, just a master/slave mysql with two virual IPs.
> Just as a reminden, my pr

Re: [ClusterLabs] clearing failed actions

2017-06-19 Thread Attila Megyeri
ransition:
Starting PEngine Recheck Timer
Jun 19 17:37:06 [18998] ctmgr   crmd:debug: crm_timer_start:Started 
PEngine Recheck Timer (I_PE_CALC:3ms), src=277



As you can see from the logs, pacemaker does not even try to re-monitor the 
resource that had a failure, or at least I'm not seeing it.
Cluster recheck interval is set to 30 seconds for troubleshooting reasons.

If I execute a

crm resource cleanup db-ip-master

Tha failure is removed.

Now am I taking something terribly wrong here?
Or is this simply a bug in 1.1.10?


Thanks,
Attila




> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Wednesday, June 7, 2017 10:14 PM
> To: Attila Megyeri <amegy...@minerva-soft.com>; Cluster Labs - All topics
> related to open-source clustering welcomed <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] clearing failed actions
>
> On 06/01/2017 02:44 PM, Attila Megyeri wrote:
> > Ken,
> >
> > I noticed something strange, this might be the issue.
> >
> > In some cases, even the manual cleanup does not work.
> >
> > I have a failed action of resource "A" on node "a". DC is node "b".
> >
> > e.g.
> > Failed actions:
> > jboss_imssrv1_monitor_1 (node=ctims1, call=108, rc=1,
> status=complete, last-rc-change=Thu Jun  1 14:13:36 2017
> >
> >
> > When I attempt to do a "crm resource cleanup A" from node "b", nothing
> happens. Basically the lrmd on "a" is not notified that it should monitor the
> resource.
> >
> >
> > When I execute a "crm resource cleanup A" command on node "a" (where
> the operation failed) , the failed action is cleared properly.
> >
> > Why could this be happening?
> > Which component should be responsible for this? pengine, crmd, lrmd?
>
> The crm shell will send commands to attrd (to clear fail counts) and
> crmd (to clear the resource history), which in turn will record changes
> in the cib.
>
> I'm not sure how crm shell implements it, but crm_resource sends
> individual messages to each node when cleaning up a resource without
> specifying a particular node. You could check the pacemaker log on each
> node to see whether attrd and crmd are receiving those commands, and
> what they do in response.
>
>
> >> -----Original Message-
> >> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> >> Sent: Thursday, June 1, 2017 6:57 PM
> >> To: kgail...@redhat.com; Cluster Labs - All topics related to open-source
> >> clustering welcomed <users@clusterlabs.org>
> >> Subject: Re: [ClusterLabs] clearing failed actions
> >>
> >> thanks Ken,
> >>
> >>
> >>
> >>
> >>
> >>> -Original Message-
> >>> From: Ken Gaillot [mailto:kgail...@redhat.com]
> >>> Sent: Thursday, June 1, 2017 12:04 AM
> >>> To: users@clusterlabs.org
> >>> Subject: Re: [ClusterLabs] clearing failed actions
> >>>
> >>> On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> >>>> On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> >>>>> Hi Ken,
> >>>>>
> >>>>>
> >>>>>> -Original Message-
> >>>>>> From: Ken Gaillot [mailto:kgail...@redhat.com]
> >>>>>> Sent: Tuesday, May 30, 2017 4:32 PM
> >>>>>> To: users@clusterlabs.org
> >>>>>> Subject: Re: [ClusterLabs] clearing failed actions
> >>>>>>
> >>>>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Shouldn't the
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> cluster-recheck-interval="2m"
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> property instruct pacemaker to recheck the cluster every 2 minutes
> >>> and
> >>>>>>> clean the failcounts?
> >>>>>>
> >>>>>> It instructs pacemaker to recalculate whether any actions need to be
> >>>>>> taken (including expiring any failcounts appropriately).
> >>>>>>
> >>>>>>> At the primitive level I also have a
> >>>>>>>
> >>>>>>>
> >>>>>&

Re: [ClusterLabs] clearing failed actions

2017-06-01 Thread Attila Megyeri
thanks Ken,





> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Thursday, June 1, 2017 12:04 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] clearing failed actions
> 
> On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> > On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> >> Hi Ken,
> >>
> >>
> >>> -Original Message-
> >>> From: Ken Gaillot [mailto:kgail...@redhat.com]
> >>> Sent: Tuesday, May 30, 2017 4:32 PM
> >>> To: users@clusterlabs.org
> >>> Subject: Re: [ClusterLabs] clearing failed actions
> >>>
> >>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> >>>> Hi,
> >>>>
> >>>>
> >>>>
> >>>> Shouldn't the
> >>>>
> >>>>
> >>>>
> >>>> cluster-recheck-interval="2m"
> >>>>
> >>>>
> >>>>
> >>>> property instruct pacemaker to recheck the cluster every 2 minutes
> and
> >>>> clean the failcounts?
> >>>
> >>> It instructs pacemaker to recalculate whether any actions need to be
> >>> taken (including expiring any failcounts appropriately).
> >>>
> >>>> At the primitive level I also have a
> >>>>
> >>>>
> >>>>
> >>>> migration-threshold="30" failure-timeout="2m"
> >>>>
> >>>>
> >>>>
> >>>> but whenever I have a failure, it remains there forever.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> What could be causing this?
> >>>>
> >>>>
> >>>>
> >>>> thanks,
> >>>>
> >>>> Attila
> >>> Is it a single old failure, or a recurring failure? The failure timeout
> >>> works in a somewhat nonintuitive way. Old failures are not individually
> >>> expired. Instead, all failures of a resource are simultaneously cleared
> >>> if all of them are older than the failure-timeout. So if something keeps
> >>> failing repeatedly (more frequently than the failure-timeout), none of
> >>> the failures will be cleared.
> >>>
> >>> If it's not a repeating failure, something odd is going on.
> >>
> >> It is not a repeating failure. Let's say that a resource fails for whatever
> action, It will remain in the failed actions (crm_mon -Af) until I issue a 
> "crm
> resource cleanup ". Even after days or weeks, even though
> I see in the logs that cluster is rechecked every 120 seconds.
> >>
> >> How could I troubleshoot this issue?
> >>
> >> thanks!
> >
> >
> > Ah, I see what you're saying. That's expected behavior.
> >
> > The failure-timeout applies to the failure *count* (which is used for
> > checking against migration-threshold), not the failure *history* (which
> > is used for the status display).
> >
> > The idea is to have it no longer affect the cluster behavior, but still
> > allow an administrator to know that it happened. That's why a manual
> > cleanup is required to clear the history.
> 
> Hmm, I'm wrong there ... failure-timeout does expire the failure history
> used for status display.
> 
> It works with the current versions. It's possible 1.1.10 had issues with
> that.
> 

Well if nothing helps I will try to upgrade to a more recent version..



> Check the status to see which node is DC, and look at the pacemaker log
> there after the failure occurred. There should be a message about the
> failcount expiring. You can also look at the live CIB and search for
> last_failure to see what is used for the display.
[AM] 

In the pacemaker log I see at every recheck interval the following lines:

Jun 01 16:54:08 [8700] ctabsws2pengine:  warning: unpack_rsc_op:
Processing failed op start for jboss_admin2 on ctadmin2: unknown error (1)

If I check the  CIB for the failure I see:





Really have no clue why this isn't cleared...



> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] clearing failed actions

2017-05-31 Thread Ken Gaillot
On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> On 05/30/2017 02:50 PM, Attila Megyeri wrote:
>> Hi Ken,
>>
>>
>>> -Original Message-
>>> From: Ken Gaillot [mailto:kgail...@redhat.com]
>>> Sent: Tuesday, May 30, 2017 4:32 PM
>>> To: users@clusterlabs.org
>>> Subject: Re: [ClusterLabs] clearing failed actions
>>>
>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
>>>> Hi,
>>>>
>>>>
>>>>
>>>> Shouldn't the
>>>>
>>>>
>>>>
>>>> cluster-recheck-interval="2m"
>>>>
>>>>
>>>>
>>>> property instruct pacemaker to recheck the cluster every 2 minutes and
>>>> clean the failcounts?
>>>
>>> It instructs pacemaker to recalculate whether any actions need to be
>>> taken (including expiring any failcounts appropriately).
>>>
>>>> At the primitive level I also have a
>>>>
>>>>
>>>>
>>>> migration-threshold="30" failure-timeout="2m"
>>>>
>>>>
>>>>
>>>> but whenever I have a failure, it remains there forever.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> What could be causing this?
>>>>
>>>>
>>>>
>>>> thanks,
>>>>
>>>> Attila
>>> Is it a single old failure, or a recurring failure? The failure timeout
>>> works in a somewhat nonintuitive way. Old failures are not individually
>>> expired. Instead, all failures of a resource are simultaneously cleared
>>> if all of them are older than the failure-timeout. So if something keeps
>>> failing repeatedly (more frequently than the failure-timeout), none of
>>> the failures will be cleared.
>>>
>>> If it's not a repeating failure, something odd is going on.
>>
>> It is not a repeating failure. Let's say that a resource fails for whatever 
>> action, It will remain in the failed actions (crm_mon -Af) until I issue a 
>> "crm resource cleanup ". Even after days or weeks, even 
>> though I see in the logs that cluster is rechecked every 120 seconds.
>>
>> How could I troubleshoot this issue?
>>
>> thanks!
> 
> 
> Ah, I see what you're saying. That's expected behavior.
> 
> The failure-timeout applies to the failure *count* (which is used for
> checking against migration-threshold), not the failure *history* (which
> is used for the status display).
> 
> The idea is to have it no longer affect the cluster behavior, but still
> allow an administrator to know that it happened. That's why a manual
> cleanup is required to clear the history.

Hmm, I'm wrong there ... failure-timeout does expire the failure history
used for status display.

It works with the current versions. It's possible 1.1.10 had issues with
that.

Check the status to see which node is DC, and look at the pacemaker log
there after the failure occurred. There should be a message about the
failcount expiring. You can also look at the live CIB and search for
last_failure to see what is used for the display.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] clearing failed actions

2017-05-31 Thread Ken Gaillot
On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> Hi Ken,
> 
> 
>> -Original Message-
>> From: Ken Gaillot [mailto:kgail...@redhat.com]
>> Sent: Tuesday, May 30, 2017 4:32 PM
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] clearing failed actions
>>
>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
>>> Hi,
>>>
>>>
>>>
>>> Shouldn't the
>>>
>>>
>>>
>>> cluster-recheck-interval="2m"
>>>
>>>
>>>
>>> property instruct pacemaker to recheck the cluster every 2 minutes and
>>> clean the failcounts?
>>
>> It instructs pacemaker to recalculate whether any actions need to be
>> taken (including expiring any failcounts appropriately).
>>
>>> At the primitive level I also have a
>>>
>>>
>>>
>>> migration-threshold="30" failure-timeout="2m"
>>>
>>>
>>>
>>> but whenever I have a failure, it remains there forever.
>>>
>>>
>>>
>>>
>>>
>>> What could be causing this?
>>>
>>>
>>>
>>> thanks,
>>>
>>> Attila
>> Is it a single old failure, or a recurring failure? The failure timeout
>> works in a somewhat nonintuitive way. Old failures are not individually
>> expired. Instead, all failures of a resource are simultaneously cleared
>> if all of them are older than the failure-timeout. So if something keeps
>> failing repeatedly (more frequently than the failure-timeout), none of
>> the failures will be cleared.
>>
>> If it's not a repeating failure, something odd is going on.
> 
> It is not a repeating failure. Let's say that a resource fails for whatever 
> action, It will remain in the failed actions (crm_mon -Af) until I issue a 
> "crm resource cleanup ". Even after days or weeks, even though 
> I see in the logs that cluster is rechecked every 120 seconds.
> 
> How could I troubleshoot this issue?
> 
> thanks!


Ah, I see what you're saying. That's expected behavior.

The failure-timeout applies to the failure *count* (which is used for
checking against migration-threshold), not the failure *history* (which
is used for the status display).

The idea is to have it no longer affect the cluster behavior, but still
allow an administrator to know that it happened. That's why a manual
cleanup is required to clear the history.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] clearing failed actions

2017-05-30 Thread Attila Megyeri
Hi Ken,


> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Tuesday, May 30, 2017 4:32 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] clearing failed actions
> 
> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> > Hi,
> >
> >
> >
> > Shouldn't the
> >
> >
> >
> > cluster-recheck-interval="2m"
> >
> >
> >
> > property instruct pacemaker to recheck the cluster every 2 minutes and
> > clean the failcounts?
> 
> It instructs pacemaker to recalculate whether any actions need to be
> taken (including expiring any failcounts appropriately).
> 
> > At the primitive level I also have a
> >
> >
> >
> > migration-threshold="30" failure-timeout="2m"
> >
> >
> >
> > but whenever I have a failure, it remains there forever.
> >
> >
> >
> >
> >
> > What could be causing this?
> >
> >
> >
> > thanks,
> >
> > Attila
> Is it a single old failure, or a recurring failure? The failure timeout
> works in a somewhat nonintuitive way. Old failures are not individually
> expired. Instead, all failures of a resource are simultaneously cleared
> if all of them are older than the failure-timeout. So if something keeps
> failing repeatedly (more frequently than the failure-timeout), none of
> the failures will be cleared.
> 
> If it's not a repeating failure, something odd is going on.

It is not a repeating failure. Let's say that a resource fails for whatever 
action, It will remain in the failed actions (crm_mon -Af) until I issue a "crm 
resource cleanup ". Even after days or weeks, even though I see 
in the logs that cluster is rechecked every 120 seconds.

How could I troubleshoot this issue?

thanks!


> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] clearing failed actions

2017-05-30 Thread Ken Gaillot
On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> Hi,
> 
>  
> 
> Shouldn’t the 
> 
>  
> 
> cluster-recheck-interval="2m"
> 
>  
> 
> property instruct pacemaker to recheck the cluster every 2 minutes and
> clean the failcounts?

It instructs pacemaker to recalculate whether any actions need to be
taken (including expiring any failcounts appropriately).

> At the primitive level I also have a
> 
>  
> 
> migration-threshold="30" failure-timeout="2m"
> 
>  
> 
> but whenever I have a failure, it remains there forever.
> 
>  
> 
>  
> 
> What could be causing this?
> 
>  
> 
> thanks,
> 
> Attila
Is it a single old failure, or a recurring failure? The failure timeout
works in a somewhat nonintuitive way. Old failures are not individually
expired. Instead, all failures of a resource are simultaneously cleared
if all of them are older than the failure-timeout. So if something keeps
failing repeatedly (more frequently than the failure-timeout), none of
the failures will be cleared.

If it's not a repeating failure, something odd is going on.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org