There is a issue that pacemaker don't schedule resource which is  in docker  
container after docker is restarted but the pacemaker cluster show the resource 
is started ,it seems to be a bug of pacemaker .

 I am very confused what happend when pengine print those logs(pengine:   
notice: check_operation_expiry:       Clearing failure of event_agent on 
120_120__fd4 because it expired | event_agent_clear_failcount_0). Does anyone 
know what they mean? Thank you very much!


1. pacemaker/corosync version:  1.1.16/2.4.3


2. corosync logs as follows;

Feb 06 09:52:19 [58629] node-4      attrd:     info: attrd_peer_update: Setting 
event_agent_status[120_120__fd4]: ok -> fail from 120_120__fd4

Feb 06 09:52:19 [58629] node-4      attrd:     info: write_attribute:   Sent 
update 50 with 1 changes for event_agent_status, id=<n/a>, set=(null)

Feb 06 09:52:19 [58629] node-4      attrd:     info: attrd_cib_callback:        
Update 50 for event_agent_status: OK (0)

Feb 06 09:52:19 [58629] node-4      attrd:     info: attrd_cib_callback:        
Update 50 for event_agent_status[120_120__fd4]=fail: OK (0)

Feb 06 09:52:19 [58630] node-4    pengine:   notice: unpack_config:     On loss 
of CCM Quorum: Ignore

Feb 06 09:52:19 [58630] node-4    pengine:     info: determine_online_status:   
Node 120_120__fd4 is online

Feb 06 09:52:19 [58630] node-4    pengine:     info: get_failcount_full:        
event_agent has failed 1 times on 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:   notice: check_operation_expiry:    
Clearing failure of event_agent on 120_120__fd4 because it expired | 
event_agent_clear_failcount_0

Feb 06 09:52:19 [58630] node-4    pengine:   notice: unpack_rsc_op:     
Re-initiated expired calculated failure event_agent_monitor_60000 (rc=1, 
magic=0:1;9:18:0:9d1d66d2-2cbe-4182-89f6-c90ba008e2b7) on 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:     info: get_failcount_full:        
event_agent has failed 1 times on 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:   notice: check_operation_expiry:    
Clearing failure of event_agent on 120_120__fd4 because it expired | 
event_agent_clear_failcount_0

Feb 06 09:52:19 [58630] node-4    pengine:     info: get_failcount_full:        
event_agent has failed 1 times on 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:   notice: check_operation_expiry:    
Clearing failure of event_agent on 120_120__fd4 because it expired | 
event_agent_clear_failcount_0

Feb 06 09:52:19 [58630] node-4    pengine:     info: unpack_node_loop:  Node 
4052 is already processed

Feb 06 09:52:19 [58630] node-4    pengine:     info: unpack_node_loop:  Node 
4052 is already processed

Feb 06 09:52:19 [58630] node-4    pengine:     info: common_print:      
pm_agent        (ocf::heartbeat:pm_agent):      Started 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:     info: common_print:      
event_agent     (ocf::heartbeat:event_agent):   Started 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:     info: common_print:      
nwmonitor_vip   (ocf::heartbeat:IPaddr2):       Started 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:     info: common_print:      
nwmonitor       (ocf::heartbeat:nwmonitor):     Started 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:     info: LogActions:        Leave   
pm_agent        (Started 120_120__fd4)

Feb 06 09:52:19 [58630] node-4    pengine:     info: LogActions:        Leave   
event_agent     (Started 120_120__fd4)

Feb 06 09:52:19 [58630] node-4    pengine:     info: LogActions:        Leave   
nwmonitor_vip   (Started 120_120__fd4)

Feb 06 09:52:19 [58630] node-4    pengine:     info: LogActions:        Leave   
nwmonitor       (Started 120_120__fd4)

3. the event_agent resource is marked fail by attrd, that triggered pengine 
computing, but PE actually does't  do anything about  event_agent later. is it 
related to check_operation_expiry function in unpack.c ?  I see some notes in 
this function as fllows:

/* clearing recurring monitor operation failures automatically

     * needs to be carefully considered */

    if (safe_str_eq(crm_element_value(xml_op, XML_LRM_ATTR_TASK), "monitor") &&

        safe_str_neq(crm_element_value(xml_op, XML_LRM_ATTR_INTERVAL), "0")) {

        /* TODO, in the future we should consider not clearing recurring monitor

         * op failures unless the last action for a resource was a "stop" 
action.

         * otherwise it is possible that clearing the monitor failure will 
result

         * in the resource being in an undeterministic state.
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to