Hi Team,
Problem Statement:
pcs resource monitor got timed out after 120000ms and tried to recover
resource(application) by stopping and starting first occurrence itself. Due to
this restart resource which caused traffic impact momently in their
environment. And, we suspect reason for timed out is at time of monitor
function execution, process checks command got hanged & delayed due to system
resource unavailability.
We are not able to confirm exactly same thing is happened or not. we got only
info like HAZELCAST_occ12_monitor_10000:59159 - terminated with signal 9.
Error message we seen in customer node:
zrangun@seliius25303[16:40][var/log/pacemaker]$ grep -ia HAZELCAST pacemaker.log
Nov 15 22:42:33 occ12 pacemaker-execd [2796] (child_timeout_callback)
warning: HAZELCAST_occ12_monitor_10000 process (PID 57827) timed out
Nov 15 22:42:33 occ12 pacemaker-execd [2796] (operation_finished)
warning: HAZELCAST_occ12_monitor_10000:57827 - timed out after 120000ms
Nov 15 22:42:47 occ12 pacemaker-execd [2796] (cancel_recurring_action)
info: Cancelling ocf operation HAZELCAST_occ12_monitor_10000
Nov 15 22:42:47 occ12 pacemaker-execd [2796] (services_action_cancel)
info: Terminating in-flight op HAZELCAST_occ12_monitor_10000 (pid 59159) early
because it was cancelled
Nov 15 22:42:47 occ12 pacemaker-execd [2796] (operation_finished) info:
HAZELCAST_occ12_monitor_10000:59159 - terminated with signal 9
Nov 15 22:42:47 occ12 pacemaker-execd [2796] (cancel_recurring_action)
info: Cancelling ocf operation HAZELCAST_occ12_monitor_10000
Nov 15 22:42:47 occ12 pacemaker-execd [2796] (log_execute) info:
executing - rsc:HAZELCAST_occ12 action:stop call_id:391
Nov 15 22:43:41 occ12 pacemaker-execd [2796] (log_finished) info:
finished - rsc:HAZELCAST_occ12 action:stop call_id:391 pid:59476 exit-code:0
exec-time:53623ms queue-time:0ms
Nov 15 22:43:42 occ12 pacemaker-execd [2796] (log_execute) info: executing
- rsc:HAZELCAST_occ12 action:start call_id:392
Nov 15 22:43:46 occ12 pacemaker-execd [2796] (operation_finished) notice:
HAZELCAST_occ12_start_0:61681:stderr [ touch: cannot touch
'/usr/var/run/resource-agents/hazelcast-HAZELCAST_occ12.state': No such file or
directory ]
Nov 15 22:43:46 occ12 pacemaker-execd [2796] (log_finished) info:
finished - rsc:HAZELCAST_occ12 action:start call_id:392 pid:61681 exit-code:1
exec-time:3525ms queue-time:0ms
Nov 15 22:43:46 occ12 pacemaker-execd [2796] (log_execute) info: executing
- rsc:HAZELCAST_occ12 action:stop call_id:393
Nov 15 22:43:47 occ12 pacemaker-execd [2796] (log_finished) info:
finished - rsc:HAZELCAST_occ12 action:stop call_id:393 pid:64134 exit-code:0
exec-time:695ms queue-time:0ms
Nov 15 22:43:50 occ12 pacemaker-execd [2796] (log_execute) info: executing
- rsc:HAZELCAST_occ12 action:start call_id:394
Nov 15 22:45:15 occ12 pacemaker-execd [2796] (log_finished) info:
finished - rsc:HAZELCAST_occ12 action:start call_id:394 pid:64410 exit-code:0
exec-time:85211ms queue-time:1ms
We have shared resource configuration setup and dummy_monitor function from
local node FYR.
Resource setup:
[root@vmc0137 ~]# pcs resource show HAZELCAST_vmc0137
Resource: HAZELCAST_vmc0137 (class=ocf provider=provider type=HazelCast_RA)
Meta Attrs: failure-timeout=120s migration-threshold=5 priority=50
Operations: migrate_from interval=0s timeout=20
(HAZELCAST_vmc0137-migrate_from-interval-0s)
migrate_to interval=0s timeout=20
(HAZELCAST_vmc0137-migrate_to-interval-0s)
monitor interval=10s on-fail=restart timeout=120s
(HAZELCAST_vmc0137-monitor-interval-10s)
reload interval=0s timeout=20
(HAZELCAST_vmc0137-reload-interval-0s)
start interval=0s on-fail=restart timeout=120s
(HAZELCAST_vmc0137-start-interval-0s)
stop interval=0s timeout=120s (HAZELCAST_vmc0137-stop-interval-0s)
Monitor function input:
dummy_monitor() {
# Monitor _MUST!_ differentiate correctly between running
# (SUCCESS), failed (ERROR) or _cleanly_ stopped (NOT RUNNING).
# That is THREE states, not just yes/no
#sleep ${OCF_RESKEY_op_sleep}
output=$(su - ogw -c "/opt/occ/$PRODUCT_NUMBER/bin/RCControl status SERVER")
number=$(grep "Running as PID" <<< "$output" | wc -l)
PID=`pgrep -f
"Dcmg.component.name<https://protect2.fireeye.com/v1/url?k=c0fcaeb7-9f679786-c0fcee2c-86e2237f51fb-f9fa99fdfd024a85&q=1&e=6f1aa0d1-5fc8-4c2c-a73e-983c5f3bfab6&u=http%3A%2F%2Fdcmg.component.name%2F>=SERVER"`
if [ $number == 1 ] || [ -n "$PID" ] ; then
if [ ! -f /opt/occ/var/pid/SERVER.`hostname`.pid ]; then
NOW=$(date +"%b %d %H:%M:%S")
echo "$PID" > /opt/occ/var/pid/SERVER.`hostname`.pid
chown ogw:med /opt/occ/var/pid/SERVER.`hostname`.pid
echo "$NOW Monitor found SERVER pid file not exist and going to
create it" >>/var/log/cluster/corosync.log
fi
return $OCF_SUCCESS
fi
NOW=$(date +"%b %d %H:%M:%S")
echo "$NOW Monitor found SERVER component is not running and going for the
restart" >>/var/log/cluster/corosync.log
return $OCF_NOT_RUNNING
}
So, we need to support and answer to avoid above scenarios in future , kindly
let us know if any additional logs required.
1) Is there any options available to set fail-retry conditions
for resource monitor? So, if two times resource monitor fails continuously,
then only it should go for recover. Other wise pacemaker should initiate
recover for the resource. Please confirm.
2) Is there any other better option available to avoid timed
out issues in first occurrence itself.?
3) we thought of increasing resource timeout value to 300s and
adding retry logic in dummy_monitor function itself on the RA files with
timeout command. So, In this case pgrep command will be killed if couldn’t get
response within 30s and retry with next loop. Will this solution help for us..?
Thanks & Regards,
S Sathish S
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/