Re: [ClusterLabs] clearing failed actions

Attila Megyeri Mon, 19 Jun 2017 12:51:13 -0700

I did another experiment, even simpler.

Created one node, one resource, using pacemaker 1.1.14 on ubuntu.


Configured failcount to 1, migration threshold to 2, failure timeout to 1 
minute.

crm_mon:

Last updated: Mon Jun 19 19:43:41 2017          Last change: Mon Jun 19 
19:37:09 2017 by root via cibadmin on test
Stack: corosync
Current DC: test (version 1.1.14-70404b0) - partition with quorum
1 node and 1 resource configured

Online: [ test ]

db-ip-master    (ocf::heartbeat:IPaddr2):       Started test

Node Attributes:
* Node test:

Migration Summary:
* Node test:
   db-ip-master: migration-threshold=2 fail-count=1

crm verify:

crm_verify --live-check -VVVV
    info: validate_with_relaxng:        Creating RNG parser context
    info: determine_online_status:      Node test is online
    info: get_failcount_full:   db-ip-master has failed 1 times on test
    info: get_failcount_full:   db-ip-master has failed 1 times on test
    info: get_failcount_full:   db-ip-master has failed 1 times on test
    info: get_failcount_full:   db-ip-master has failed 1 times on test
    info: native_print: db-ip-master    (ocf::heartbeat:IPaddr2):       Started 
test
    info: get_failcount_full:   db-ip-master has failed 1 times on test
    info: common_apply_stickiness:      db-ip-master can fail 1 more times on 
test before being forced off
    info: LogActions:   Leave   db-ip-master    (Started test)


crm configure is:

node 168362242: test \
        attributes standby=off
primitive db-ip-master IPaddr2 \
        params lvs_support=true ip=10.9.1.10 cidr_netmask=24 
broadcast=10.9.1.255 \
        op start interval=0 timeout=20s on-fail=restart \
        op monitor interval=20s timeout=20s \
        op stop interval=0 timeout=20s on-fail=block \
        meta migration-threshold=2 failure-timeout=1m target-role=Started
location loc1 db-ip-master 0: test
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.14-70404b0 \
        cluster-infrastructure=corosync \
        stonith-enabled=false \
        cluster-recheck-interval=30s \
        symmetric-cluster=false




Corosync log:


Jun 19 19:45:07 [331] test       crmd:   notice: do_state_transition:   State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
origin=crm_timer_popped ]
Jun 19 19:45:07 [330] test    pengine:     info: process_pe_message:    Input 
has not changed since last time, not saving to disk
Jun 19 19:45:07 [330] test    pengine:     info: determine_online_status:       
Node test is online
Jun 19 19:45:07 [330] test    pengine:     info: get_failcount_full:    
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] test    pengine:     info: get_failcount_full:    
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] test    pengine:     info: get_failcount_full:    
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] test    pengine:     info: get_failcount_full:    
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] test    pengine:     info: native_print:  db-ip-master    
(ocf::heartbeat:IPaddr2):       Started test
Jun 19 19:45:07 [330] test    pengine:     info: get_failcount_full:    
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] test    pengine:     info: common_apply_stickiness:       
db-ip-master can fail 1 more times on test before being forced off
Jun 19 19:45:07 [330] test    pengine:     info: LogActions:    Leave   
db-ip-master    (Started test)
Jun 19 19:45:07 [330] test    pengine:   notice: process_pe_message:    
Calculated Transition 34: /var/lib/pacemaker/pengine/pe-input-6.bz2
Jun 19 19:45:07 [331] test       crmd:     info: do_state_transition:   State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Jun 19 19:45:07 [331] test       crmd:   notice: run_graph:     Transition 34 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Complete
Jun 19 19:45:07 [331] test       crmd:     info: do_log:        FSA: Input 
I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
Jun 19 19:45:07 [331] test       crmd:   notice: do_state_transition:   State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]


I hope someone can help me figure this out :)

Thanks!



> -----Original Message-----
> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> Sent: Monday, June 19, 2017 7:45 PM
> To: kgail...@redhat.com; Cluster Labs - All topics related to open-source
> clustering welcomed <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] clearing failed actions
>
> Hi Ken,
>
> /sorry for the long text/
>
> I have created a relatively simple setup to localize the issue.
> Three nodes, no fencing, just a master/slave mysql with two virual IPs.
> Just as a reminden, my primary issue is, that on cluster recheck intervals, 
> tha
> failcounts are not cleared.
>
> I simuated a failure with:
>
> crm_failcount -N ctdb1 -r db-ip-master -v 1
>
>
> crm_mon shows:
>
> Last updated: Mon Jun 19 17:34:35 2017
> Last change: Mon Jun 19 17:34:35 2017 via cibadmin on ctmgr
> Stack: corosync
> Current DC: ctmgr (168362243) - partition with quorum
> Version: 1.1.10-42f2063
> 3 Nodes configured
> 4 Resources configured
>
>
> Online: [ ctdb1 ctdb2 ctmgr ]
>
> db-ip-master    (ocf::heartbeat:IPaddr2):       Started ctdb1
> db-ip-slave     (ocf::heartbeat:IPaddr2):       Started ctdb2
>  Master/Slave Set: mysql [db-mysql]
>      Masters: [ ctdb1 ]
>      Slaves: [ ctdb2 ]
>
> Node Attributes:
> * Node ctdb1:
>     + master-db-mysql                   : 3601
>     + readable                          : 1
> * Node ctdb2:
>     + master-db-mysql                   : 3600
>     + readable                          : 1
> * Node ctmgr:
>
> Migration summary:
> * Node ctmgr:
> * Node ctdb1:
>    db-ip-master: migration-threshold=1000000 fail-count=1
> * Node ctdb2:
>
>
>
> When I check the pacemaker log on the DC, I see the following:
>
> Jun 19 17:37:06 [18998] ctmgr       crmd:     info: crm_timer_popped:   
> PEngine
> Recheck Timer (I_PE_CALC) just popped (30000ms)
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: s_crmd_fsa:         
> Processing
> I_PE_CALC: [ state=S_IDLE cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> Jun 19 17:37:06 [18998] ctmgr       crmd:   notice: do_state_transition:      
>   State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jun 19 17:37:06 [18998] ctmgr       crmd:     info: do_state_transition:
> Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: do_state_transition:      
>   All 3
> cluster nodes are eligible to run resources.
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: do_pe_invoke:       Query
> 231: Requesting the current CIB: S_POLICY_ENGINE
> Jun 19 17:37:06 [18994] ctmgr        cib:     info: cib_process_request:
> Completed cib_query operation for section 'all': OK (rc=0,
> origin=local/crmd/231, version=0.12.9)
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: do_pe_invoke_callback:
> Invoking the PE: query=231, ref=pe_calc-dc-1497893826-144, seq=21884,
> quorate=1
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: process_pe_message:
> Input has not changed since last time, not saving to disk
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      
> STONITH
> timeout: 60000
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      
> STONITH
> of failed nodes is disabled
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      Stop 
> all
> active resources: false
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      
> Default
> stickiness: 0
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      On 
> loss
> of CCM Quorum: Stop ALL resources
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      Node
> scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_domains:
> Unpacking domains
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: determine_online_status:
> Node ctmgr is online
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: determine_online_status:
> Node ctdb1 is online
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: determine_online_status:
> Node ctdb2 is online
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: find_anonymous_clone:
> Internally renamed db-mysql on ctmgr to db-mysql:0
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: find_anonymous_clone:
> Internally renamed db-mysql on ctdb1 to db-mysql:0
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_rsc_op:      db-
> mysql_last_failure_0 on ctdb1 returned 8 (master) instead of the expected
> value: 7 (not running)
> Jun 19 17:37:06 [18997] ctmgr    pengine:   notice: unpack_rsc_op:
> Operation monitor found resource db-mysql:0 active in master mode on
> ctdb1
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: find_anonymous_clone:
> Internally renamed db-mysql on ctdb2 to db-mysql:1
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_rsc_op:      db-
> mysql_last_failure_0 on ctdb2 returned 0 (ok) instead of the expected value:
> 7 (not running)
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: unpack_rsc_op:      
> Operation
> monitor found resource db-mysql:1 active on ctdb2
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: native_print:       
> db-ip-master
> (ocf::heartbeat:IPaddr2):       Started ctdb1
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: native_print:       
> db-ip-slave
> (ocf::heartbeat:IPaddr2):       Started ctdb2
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: clone_print:
> Master/Slave Set: mysql [db-mysql]
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_active:      
> Resource
> db-mysql:0 active on ctdb1
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_active:      
> Resource
> db-mysql:0 active on ctdb1
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_active:      
> Resource
> db-mysql:1 active on ctdb2
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_active:      
> Resource
> db-mysql:1 active on ctdb2
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: short_print:             
> Masters: [
> ctdb1 ]
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: short_print:             
> Slaves: [
> ctdb2 ]
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: get_failcount_full:       
>   db-ip-
> master has failed 1 times on ctdb1
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: common_apply_stickiness:
> db-ip-master can fail 999999 more times on ctdb1 before being forced off
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: common_apply_stickiness:
> Resource db-mysql:0: preferring current location (node=ctdb1, weight=1)
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: common_apply_stickiness:
> Resource db-mysql:1: preferring current location (node=ctdb2, weight=1)
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_assign_node:
> Assigning ctdb1 to db-mysql:0
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_assign_node:
> Assigning ctdb2 to db-mysql:1
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: clone_color:        
> Allocated 2
> mysql instances of a possible 2
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: master_color:       db-
> mysql:0 master score: 3601
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: master_color:       
> Promoting
> db-mysql:0 (Master ctdb1)
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: master_color:       db-
> mysql:1 master score: 3600
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: master_color:       mysql:
> Promoted 1 instances of a possible 1 to master
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_assign_node:
> Assigning ctdb1 to db-ip-master
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_assign_node:
> Assigning ctdb2 to db-ip-slave
> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: master_create_actions:
> Creating actions for mysql
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: LogActions:         Leave 
>   db-ip-
> master    (Started ctdb1)
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: LogActions:         Leave 
>   db-ip-
> slave     (Started ctdb2)
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: LogActions:         Leave 
>   db-
> mysql:0      (Master ctdb1)
> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: LogActions:         Leave 
>   db-
> mysql:1      (Slave ctdb2)
> Jun 19 17:37:06 [18997] ctmgr    pengine:   notice: process_pe_message:
> Calculated Transition 38: /var/lib/pacemaker/pengine/pe-input-16.bz2
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: s_crmd_fsa:         
> Processing
> I_PE_SUCCESS: [ state=S_POLICY_ENGINE cause=C_IPC_MESSAGE
> origin=handle_response ]
> Jun 19 17:37:06 [18998] ctmgr       crmd:     info: do_state_transition:      
>   State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
> input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: unpack_graph:       
> Unpacked
> transition 38: 0 actions in 0 synapses
> Jun 19 17:37:06 [18998] ctmgr       crmd:     info: do_te_invoke:       
> Processing
> graph 38 (ref=pe_calc-dc-1497893826-144) derived from
> /var/lib/pacemaker/pengine/pe-input-16.bz2
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: print_graph:        Empty
> transition graph
> Jun 19 17:37:06 [18998] ctmgr       crmd:   notice: run_graph:  Transition 38
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-16.bz2): Complete
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: print_graph:        Empty
> transition graph
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: te_graph_trigger:   
> Transition
> 38 is now complete
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: notify_crmd:        
> Processing
> transition completion in state S_TRANSITION_ENGINE
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: notify_crmd:        
> Transition
> 38 status: done - <null>
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: s_crmd_fsa:         
> Processing
> I_TE_SUCCESS: [ state=S_TRANSITION_ENGINE cause=C_FSA_INTERNAL
> origin=notify_crmd ]
> Jun 19 17:37:06 [18998] ctmgr       crmd:     info: do_log:     FSA: Input
> I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
> Jun 19 17:37:06 [18998] ctmgr       crmd:   notice: do_state_transition:      
>   State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: do_state_transition:
> Starting PEngine Recheck Timer
> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: crm_timer_start:    
> Started
> PEngine Recheck Timer (I_PE_CALC:30000ms), src=277
>
>
>
> As you can see from the logs, pacemaker does not even try to re-monitor the
> resource that had a failure, or at least I'm not seeing it.
> Cluster recheck interval is set to 30 seconds for troubleshooting reasons.
>
> If I execute a
>
> crm resource cleanup db-ip-master
>
> Tha failure is removed.
>
> Now am I taking something terribly wrong here?
> Or is this simply a bug in 1.1.10?
>
>
> Thanks,
> Attila
>
>
>
>
> > -----Original Message-----
> > From: Ken Gaillot [mailto:kgail...@redhat.com]
> > Sent: Wednesday, June 7, 2017 10:14 PM
> > To: Attila Megyeri <amegy...@minerva-soft.com>; Cluster Labs - All topics
> > related to open-source clustering welcomed <users@clusterlabs.org>
> > Subject: Re: [ClusterLabs] clearing failed actions
> >
> > On 06/01/2017 02:44 PM, Attila Megyeri wrote:
> > > Ken,
> > >
> > > I noticed something strange, this might be the issue.
> > >
> > > In some cases, even the manual cleanup does not work.
> > >
> > > I have a failed action of resource "A" on node "a". DC is node "b".
> > >
> > > e.g.
> > >     Failed actions:
> > >     jboss_imssrv1_monitor_10000 (node=ctims1, call=108, rc=1,
> > status=complete, last-rc-change=Thu Jun  1 14:13:36 2017
> > >
> > >
> > > When I attempt to do a "crm resource cleanup A" from node "b", nothing
> > happens. Basically the lrmd on "a" is not notified that it should monitor 
> > the
> > resource.
> > >
> > >
> > > When I execute a "crm resource cleanup A" command on node "a"
> (where
> > the operation failed) , the failed action is cleared properly.
> > >
> > > Why could this be happening?
> > > Which component should be responsible for this? pengine, crmd, lrmd?
> >
> > The crm shell will send commands to attrd (to clear fail counts) and
> > crmd (to clear the resource history), which in turn will record changes
> > in the cib.
> >
> > I'm not sure how crm shell implements it, but crm_resource sends
> > individual messages to each node when cleaning up a resource without
> > specifying a particular node. You could check the pacemaker log on each
> > node to see whether attrd and crmd are receiving those commands, and
> > what they do in response.
> >
> >
> > >> -----Original Message-----
> > >> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> > >> Sent: Thursday, June 1, 2017 6:57 PM
> > >> To: kgail...@redhat.com; Cluster Labs - All topics related to open-source
> > >> clustering welcomed <users@clusterlabs.org>
> > >> Subject: Re: [ClusterLabs] clearing failed actions
> > >>
> > >> thanks Ken,
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: Ken Gaillot [mailto:kgail...@redhat.com]
> > >>> Sent: Thursday, June 1, 2017 12:04 AM
> > >>> To: users@clusterlabs.org
> > >>> Subject: Re: [ClusterLabs] clearing failed actions
> > >>>
> > >>> On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> > >>>> On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> > >>>>> Hi Ken,
> > >>>>>
> > >>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: Ken Gaillot [mailto:kgail...@redhat.com]
> > >>>>>> Sent: Tuesday, May 30, 2017 4:32 PM
> > >>>>>> To: users@clusterlabs.org
> > >>>>>> Subject: Re: [ClusterLabs] clearing failed actions
> > >>>>>>
> > >>>>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> > >>>>>>> Hi,
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Shouldn't the
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> cluster-recheck-interval="2m"
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> property instruct pacemaker to recheck the cluster every 2
> minutes
> > >>> and
> > >>>>>>> clean the failcounts?
> > >>>>>>
> > >>>>>> It instructs pacemaker to recalculate whether any actions need to
> be
> > >>>>>> taken (including expiring any failcounts appropriately).
> > >>>>>>
> > >>>>>>> At the primitive level I also have a
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> migration-threshold="30" failure-timeout="2m"
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> but whenever I have a failure, it remains there forever.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> What could be causing this?
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> thanks,
> > >>>>>>>
> > >>>>>>> Attila
> > >>>>>> Is it a single old failure, or a recurring failure? The failure 
> > >>>>>> timeout
> > >>>>>> works in a somewhat nonintuitive way. Old failures are not
> > individually
> > >>>>>> expired. Instead, all failures of a resource are simultaneously
> cleared
> > >>>>>> if all of them are older than the failure-timeout. So if something
> > keeps
> > >>>>>> failing repeatedly (more frequently than the failure-timeout),
> none
> > of
> > >>>>>> the failures will be cleared.
> > >>>>>>
> > >>>>>> If it's not a repeating failure, something odd is going on.
> > >>>>>
> > >>>>> It is not a repeating failure. Let's say that a resource fails for
> whatever
> > >>> action, It will remain in the failed actions (crm_mon -Af) until I 
> > >>> issue a
> > "crm
> > >>> resource cleanup <resource name>". Even after days or weeks, even
> > >> though
> > >>> I see in the logs that cluster is rechecked every 120 seconds.
> > >>>>>
> > >>>>> How could I troubleshoot this issue?
> > >>>>>
> > >>>>> thanks!
> > >>>>
> > >>>>
> > >>>> Ah, I see what you're saying. That's expected behavior.
> > >>>>
> > >>>> The failure-timeout applies to the failure *count* (which is used for
> > >>>> checking against migration-threshold), not the failure *history*
> (which
> > >>>> is used for the status display).
> > >>>>
> > >>>> The idea is to have it no longer affect the cluster behavior, but still
> > >>>> allow an administrator to know that it happened. That's why a manual
> > >>>> cleanup is required to clear the history.
> > >>>
> > >>> Hmm, I'm wrong there ... failure-timeout does expire the failure
> history
> > >>> used for status display.
> > >>>
> > >>> It works with the current versions. It's possible 1.1.10 had issues with
> > >>> that.
> > >>>
> > >>
> > >> Well if nothing helps I will try to upgrade to a more recent version..
> > >>
> > >>
> > >>
> > >>> Check the status to see which node is DC, and look at the pacemaker
> log
> > >>> there after the failure occurred. There should be a message about the
> > >>> failcount expiring. You can also look at the live CIB and search for
> > >>> last_failure to see what is used for the display.
> > >> [AM]
> > >>
> > >> In the pacemaker log I see at every recheck interval the following lines:
> > >>
> > >> Jun 01 16:54:08 [8700] ctabsws2    pengine:  warning: unpack_rsc_op:
> > >> Processing failed op start for jboss_admin2 on ctadmin2: unknown error
> > (1)
> > >>
> > >> If I check the  CIB for the failure I see:
> > >>
> > >> <nvpair id="status-168362322-last-failure-jboss_admin2" name="last-
> > failure-
> > >> jboss_admin2" value="1496326649"/>
> > >>             <lrm_rsc_op id="jboss_admin2_last_failure_0"
> > >> operation_key="jboss_admin2_start_0" operation="start" crm-debug-
> > >> origin="do_update_resource" crm_feature_set="3.0.7" transition-
> > >> key="73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" transition-
> > >> magic="2:1;73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" call-
> id="114"
> > rc-
> > >> code="1" op-status="2" interval="0" last-run="1496326469" last-rc-
> > >> change="1496326469" exec-time="180001" queue-time="0" op-
> > >> digest="8ec02bcea0bab86f4a7e9e27c23bc88b"/>
> > >>
> > >>
> > >> Really have no clue why this isn't cleared...
> _______________________________________________
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] clearing failed actions

Reply via email to