On 09/12/2016 03:00 PM, Alex wrote: > Hi Klaus, > > Thanks for the reply. > > I dont have any logs to indicate that was indeed the PID of apache but > I believe apache was killed successfully as I logged on the server > apache wasn't running.
Reason for me asking was rather if it might have been dead already before and some other process had taken its' pid. That would both be a reason for the monitor to fail and as well for the more graceful ways of stopping to fail. > > I am running: > corosync-2.3.2-2 > pacemaker-1.1.10-19 > > Thanks, > Alex > > > On Monday, September 12, 2016 1:03 PM, Klaus Wenninger > <[email protected]> wrote: > > > On 09/12/2016 12:55 PM, Alex wrote: > > Hi all, > > > > I am having a problem with one of our pacemaker clusters that is > > running in an active-active configuration. > > > > Sometimes the Website monitor will timeout, triggering and apache > > restart that fails. That will increase the fail-count to INFINITY for > > the Website resource and make in un-managed. I have tried the > > following changes: > > > > pcs property set start-failure-is-fatal=false > > > > increasing the stop timeout monitor on the Website resource: > > pcs resource op add Website stop interval=0s timeout=60s > > > > Here is the resource configuration: > > Resource: Website (class=ocf provider=heartbeat type=apache) > > Attributes: configfile=/etc/httpd/conf/httpd.conf > > statusurl=http://localhost/server-status > > Operations: start on-fail=restart interval=0s timeout=60s > > (Website-name-start-interval-0s-on-fail-restart-timeout-60s) > > monitor on-fail=restart interval=1min timeout=40s > > (Website-name-monitor-interval-1min-on-fail-restart-timeout-40s) > > stop interval=0s timeout=60s > > (Website-name-stop-interval-0s-timeout-60s) > > > > Here is what I see in the logs when it fails: > > Sep 10 17:34:25 pcs-wwwclu01-02 lrmd[2268]: warning: > > child_timeout_callback: Website_monitor_60000 process (PID 10352) > > timed out > > Sep 10 17:34:25 pcs-wwwclu01-02 lrmd[2268]: warning: > > operation_finished: Website_monitor_60000:10352 - timed out after > 40000ms > > Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]: error: > > process_lrm_event: LRM operation Website_monitor_60000 (32) Timed Out > > (timeout=40000ms) > > Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]: warning: > > update_failcount: Updating failcount for Website on pcs-wwwclu01-02 > > after failed monitor: rc=1 (update=value++, time=1473543265) > > Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]: notice: > > attrd_trigger_update: Sending flush op to all hosts for: > > fail-count-Website (1) > > Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]: notice: > > attrd_perform_update: Sent update 27: fail-count-Website=1 > > Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]: notice: > > attrd_trigger_update: Sending flush op to all hosts for: > > last-failure-Website (1473543265) > > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]: warning: > > unpack_rsc_op: Processing failed op monitor for Website:0 on > > pcs-wwwclu01-02: unknown error (1) > > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]: notice: LogActions: > > Recover Website:0#011(Started pcs-wwwclu01-02) > > Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]: notice: > > attrd_perform_update: Sent update 30: last-failure-Website=1473543265 > > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]: warning: > > unpack_rsc_op: Processing failed op monitor for Website:0 on > > pcs-wwwclu01-02: unknown error (1) > > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]: notice: LogActions: > > Recover Website:0#011(Started pcs-wwwclu01-02) > > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]: warning: > > unpack_rsc_op: Processing failed op monitor for Website:0 on > > pcs-wwwclu01-02: unknown error (1) > > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]: notice: LogActions: > > Recover Website:0#011(Started pcs-wwwclu01-02) > > Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]: notice: te_rsc_command: > > Initiating action 2: stop Website_stop_0 on pcs-wwwclu01-02 (local) > > Sep 10 17:34:25 pcs-wwwclu01-02 apache(Website)[10443]: INFO: > > Attempting graceful stop of apache PID 3561 > > Sep 10 17:34:55 pcs-wwwclu01-02 apache(Website)[10443]: INFO: Killing > > apache PID 3561 > > Sep 10 17:35:04 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache > > still running (3561). Killing pid failed. > > Sep 10 17:35:04 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache > > children were signalled (SIGTERM) > > Sep 10 17:35:06 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache > > children were signalled (SIGHUP) > > Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]: notice: > > process_lrm_event: LRM operation Website_stop_0 (call=34, rc=1, > > cib-update=3097, confirmed=true) unknown error > > Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]: warning: status_from_rc: > > Action 2 (Website_stop_0) on pcs-wwwclu01-02 failed (target: 0 vs. rc: > > 1): Error > > Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]: warning: > > update_failcount: Updating failcount for Website on pcs-wwwclu01-02 > > after failed stop: rc=1 (update=INFINITY, time=1473543307) > > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice: > > attrd_trigger_update: Sending flush op to all hosts for: > > fail-count-Website (INFINITY) > > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice: > > attrd_perform_update: Sent update 32: fail-count-Website=INFINITY > > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice: > > attrd_trigger_update: Sending flush op to all hosts for: > > last-failure-Website (1473543307) > > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice: > > attrd_perform_update: Sent update 34: last-failure-Website=1473543307 > > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice: > > attrd_trigger_update: Sending flush op to all hosts for: > > fail-count-Website (INFINITY) > > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]: warning: > > unpack_rsc_op: Processing failed op stop for Website:0 on > > pcs-wwwclu01-02: unknown error (1) > > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice: > > attrd_perform_update: Sent update 36: fail-count-Website=INFINITY > > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice: > > attrd_trigger_update: Sending flush op to all hosts for: > > last-failure-Website (1473543307) > > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice: > > attrd_perform_update: Sent update 38: last-failure-Website=1473543307 > > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]: warning: > > unpack_rsc_op: Processing failed op stop for Website:0 on > > pcs-wwwclu01-02: unknown error (1) > > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]: warning: > > common_apply_stickiness: Forcing Website-clone away from > > pcs-wwwclu01-02 after 1000000 failures (max=1000000) > > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]: warning: > > unpack_rsc_op: Processing failed op stop for Website:0 on > > pcs-wwwclu01-02: unknown error (1) > > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]: warning: > > common_apply_stickiness: Forcing Website-clone away from > > pcs-wwwclu01-02 after 1000000 failures (max=1000000) > > > > I dont see that pacemaker is waiting for 60 seconds for the apache to > > stop. > > .../heartbeat/apache: > > graceful_stop() > > { > > ... > > # Try graceful stop for half timeout period if timeout period > is present > > if [ -n "$OCF_RESKEY_CRM_meta_timeout" ]; then > > tries=$((($OCF_RESKEY_CRM_meta_timeout/1000) / 2)) > > fi > > so the 30 seconds from the log are to be expected. > Why it doesn't terminate within this 30 seconds and > why escalation to SIGTERM doesn't help either is > written on another page ... > > Do you have logs showing if at the time when stopping > was tried 3561 was really the pid of a running apache? > Don't see the RA (at least the version I have on my > test-cluster) anywhere checking for the running binary > or alike. > > > > Has anyone encountered something like this before? Or am I missing > > something in the configuration? > > > > Thank you, > > Alex > > > > > > > > > > > _______________________________________________ > > Users mailing list: [email protected] <mailto:[email protected]> > > http://clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/> > > > > _______________________________________________ > Users mailing list: [email protected] <mailto:[email protected]> > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/> > > > _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
