Hi Team,
Problem Statement:

pcs resource monitor got timed out after 120000ms and tried to recover 
resource(application) by stopping and starting first occurrence itself. Due to 
this restart resource which caused traffic impact momently in their 
environment. And, we suspect reason for timed out is at time of monitor 
function execution, process checks command got hanged & delayed due to system 
resource unavailability.

We are not able to confirm exactly same thing is happened or not.  we got only 
info like HAZELCAST_occ12_monitor_10000:59159 - terminated with signal 9.

Error message we seen in customer node:

zrangun@seliius25303[16:40][var/log/pacemaker]$ grep -ia HAZELCAST pacemaker.log
Nov 15 22:42:33 occ12 pacemaker-execd     [2796] (child_timeout_callback)       
warning: HAZELCAST_occ12_monitor_10000 process (PID 57827) timed out
Nov 15 22:42:33 occ12 pacemaker-execd     [2796] (operation_finished)   
warning: HAZELCAST_occ12_monitor_10000:57827 - timed out after 120000ms
Nov 15 22:42:47 occ12 pacemaker-execd     [2796] (cancel_recurring_action)      
info: Cancelling ocf operation HAZELCAST_occ12_monitor_10000
Nov 15 22:42:47 occ12 pacemaker-execd     [2796] (services_action_cancel)       
info: Terminating in-flight op HAZELCAST_occ12_monitor_10000 (pid 59159) early 
because it was cancelled
Nov 15 22:42:47 occ12 pacemaker-execd     [2796] (operation_finished)   info: 
HAZELCAST_occ12_monitor_10000:59159 - terminated with signal 9
Nov 15 22:42:47 occ12 pacemaker-execd     [2796] (cancel_recurring_action)      
info: Cancelling ocf operation HAZELCAST_occ12_monitor_10000
Nov 15 22:42:47 occ12 pacemaker-execd     [2796] (log_execute)  info:  
executing - rsc:HAZELCAST_occ12 action:stop call_id:391
Nov 15 22:43:41 occ12 pacemaker-execd     [2796] (log_finished)         info: 
finished - rsc:HAZELCAST_occ12 action:stop call_id:391 pid:59476 exit-code:0 
exec-time:53623ms queue-time:0ms
Nov 15 22:43:42 occ12 pacemaker-execd     [2796] (log_execute)  info: executing 
- rsc:HAZELCAST_occ12 action:start call_id:392
Nov 15 22:43:46 occ12 pacemaker-execd     [2796] (operation_finished)   notice: 
HAZELCAST_occ12_start_0:61681:stderr [ touch: cannot touch 
'/usr/var/run/resource-agents/hazelcast-HAZELCAST_occ12.state': No such file or 
directory ]
Nov 15 22:43:46 occ12 pacemaker-execd     [2796] (log_finished)         info: 
finished - rsc:HAZELCAST_occ12 action:start call_id:392 pid:61681 exit-code:1 
exec-time:3525ms queue-time:0ms
Nov 15 22:43:46 occ12 pacemaker-execd     [2796] (log_execute)  info: executing 
- rsc:HAZELCAST_occ12 action:stop call_id:393
Nov 15 22:43:47 occ12 pacemaker-execd     [2796] (log_finished)         info: 
finished - rsc:HAZELCAST_occ12 action:stop call_id:393 pid:64134 exit-code:0 
exec-time:695ms queue-time:0ms
Nov 15 22:43:50 occ12 pacemaker-execd     [2796] (log_execute)  info: executing 
- rsc:HAZELCAST_occ12 action:start call_id:394
Nov 15 22:45:15 occ12 pacemaker-execd     [2796] (log_finished)         info: 
finished - rsc:HAZELCAST_occ12 action:start call_id:394 pid:64410 exit-code:0 
exec-time:85211ms queue-time:1ms

We have shared resource configuration setup and dummy_monitor function from 
local node FYR.

Resource setup:

[root@vmc0137 ~]# pcs resource show HAZELCAST_vmc0137
Resource: HAZELCAST_vmc0137 (class=ocf provider=provider type=HazelCast_RA)
  Meta Attrs: failure-timeout=120s migration-threshold=5 priority=50
  Operations: migrate_from interval=0s timeout=20 
(HAZELCAST_vmc0137-migrate_from-interval-0s)
              migrate_to interval=0s timeout=20 
(HAZELCAST_vmc0137-migrate_to-interval-0s)
              monitor interval=10s on-fail=restart timeout=120s 
(HAZELCAST_vmc0137-monitor-interval-10s)
              reload interval=0s timeout=20 
(HAZELCAST_vmc0137-reload-interval-0s)
              start interval=0s on-fail=restart timeout=120s 
(HAZELCAST_vmc0137-start-interval-0s)
              stop interval=0s timeout=120s (HAZELCAST_vmc0137-stop-interval-0s)

Monitor function input:

dummy_monitor() {
        # Monitor _MUST!_ differentiate correctly between running
        # (SUCCESS), failed (ERROR) or _cleanly_ stopped (NOT RUNNING).
        # That is THREE states, not just yes/no
        #sleep ${OCF_RESKEY_op_sleep}

output=$(su - ogw -c "/opt/occ/$PRODUCT_NUMBER/bin/RCControl status SERVER")
number=$(grep "Running as PID" <<< "$output" | wc -l)
PID=`pgrep -f 
"Dcmg.component.name<https://protect2.fireeye.com/v1/url?k=c0fcaeb7-9f679786-c0fcee2c-86e2237f51fb-f9fa99fdfd024a85&q=1&e=6f1aa0d1-5fc8-4c2c-a73e-983c5f3bfab6&u=http%3A%2F%2Fdcmg.component.name%2F>=SERVER"`

     if [ $number == 1 ] || [ -n "$PID" ] ; then
        if [ ! -f /opt/occ/var/pid/SERVER.`hostname`.pid ]; then
            NOW=$(date +"%b %d %H:%M:%S")
            echo "$PID" > /opt/occ/var/pid/SERVER.`hostname`.pid
            chown ogw:med /opt/occ/var/pid/SERVER.`hostname`.pid
            echo "$NOW Monitor found SERVER pid file not exist and going to 
create it" >>/var/log/cluster/corosync.log
        fi
         return $OCF_SUCCESS
     fi
     NOW=$(date +"%b %d %H:%M:%S")
     echo "$NOW Monitor found SERVER component is not running and going for the 
restart" >>/var/log/cluster/corosync.log
     return $OCF_NOT_RUNNING

}

So, we need to support and answer to avoid above scenarios in future , kindly 
let us know if any additional logs required.

                1) Is there any options available to set fail-retry conditions 
for resource monitor?  So, if two times resource monitor fails continuously, 
then only it should go for recover. Other wise pacemaker should initiate 
recover for the resource. Please confirm.
                2)  Is there any other better option available to avoid timed 
out issues in first occurrence itself.?
                3)  we thought of increasing resource timeout value to 300s and 
adding retry logic in dummy_monitor function itself on the RA files with 
timeout command. So, In this case pgrep command will be killed if couldn’t get 
response within 30s and retry with next loop. Will this solution  help for us..?

Thanks & Regards,
S Sathish S
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to