Hi Klaus service mycustomprog status returns fine that is no errors. It does not hang.
Suresh On Sun, Jan 1, 2017 at 9:43 PM, Klaus Wenninger <[email protected]> wrote: > Hi Suresh! > > Have you tried lsb-status in a shell? > Does it show anything interesting or is it hanging? > > Regards, > Klaus > > On 12/30/2016 08:45 AM, Suresh Rajagopalan wrote: > > Cluster running centos 6.8 with pacemaker/corosync. This config was > > running well for quite sometime. All of a sudden we see regular > > restarts of the monitored process where corosync thinks it has > > failed(even though it really has not failed). I am showing the > > relevant logs and config below. Any pointers appreciated as it is not > > clear why this would occur. > > > > Thanks > > Suresh > > > > Dec 28 13:18:20 [2198] a.b.com <http://a.b.com> pengine: info: > > LogActions: Leave mycustomprog (Started a.b.com <http://a.b.com > >) > > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > > process_lrm_event: Operation mycustomprog_monitor_10000: not running > > (node=a.b.com <http://a.b.com>, call=29, rc=7, cib-update=1427, > > confirmed=false) > > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: notice: > > process_lrm_event: a.b.com-mycustomprog_monitor_10000:29 [ > > mycustomprogram (pid 15657) is running...\n ] > > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info: > > cib_perform_op: ++ /cib/status/node_state[@id='a.b.com > > <http://a.b.com>']/lrm[@id='a.b.com > > <http://a.b.com>']/lrm_resources/lrm_resource[@id='mycustomprog']: > > <lrm_rsc_op id="mycustomprog_last_failure_0" > > operation_key="mycustomprog_monitor_10000" operation="monitor" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" > > transition-key="7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7" > > transition-magic="0 > > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > > abort_transition_graph: Transition aborted by > > mycustomprog_monitor_10000 'create' on a.b.com <http://a.b.com>: Old > > event (magic=0:7;7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7, > > cib=0.48.2038786, source=process_graph_event:605, 1) > > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > > update_failcount: Updating failcount for mycustomprog on a.b.com > > <http://a.b.com> after failed monitor: rc=7 (update=value++, > > time=1482931323) > > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > > process_graph_event: Detected action (462.7) > > mycustomprog_monitor_10000.29=not running: failed > > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > > attrd_trigger_update: Sending flush op to all hosts for: > > fail-count-mycustomprog (1) > > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > > attrd_perform_update: Sent update 18: fail-count-mycustomprog=1 > > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info: > > cib_perform_op: ++ /cib/status/node_state[@id='a.b.com > > <http://a.b.com>']/transient_attributes[@id='a.b.com > > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > > <http://status-a.b.com>']: <nvpair > > id="status-a.b.com-fail-count-mycustomprog" > > name="fail-count-mycustomprog" value="1"/> > > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > > attrd_trigger_update: Sending flush op to all hosts for: > > last-failure-mycustomprog (1482931323) > > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > > attrd_perform_update: Sent update 20: > > last-failure-mycustomprog=1482931323 > > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > > attrd_perform_update: Sent update 20: > > last-failure-mycustomprog=1482931323 > > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info: > > cib_perform_op: ++ /cib/status/node_state[@id='a.b.com > > <http://a.b.com>']/transient_attributes[@id='a.b.com > > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > > <http://status-a.b.com>']: <nvpair > > id="status-a.b.com-last-failure-mycustomprog" > > name="last-failure-mycustomprog" value="1482931323"/> > > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: info: > > abort_transition_graph: Transition aborted by > > status-a.b.com-fail-count-mycustomprog, fail-count-mycustomprog=1: > > Transient attribute change (create cib=0.48.2038787, > > source=abort_unless_down:329, path=/cib/status/node_state[@id='a.b.com > > <http://a.b.com>']/transient_attributes[@id='a.b.com > > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > > <http://status-a.b.com> > > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: info: > > abort_transition_graph: Transition aborted by > > status-a.b.com-last-failure-mycustomprog, > > last-failure-mycustomprog=1482931323: Transient attribute change > > (create cib=0.48.2038788, source=abort_unless_down:329, > > path=/cib/status/node_state[@id='a.b.com > > <http://a.b.com>']/transient_attributes[@id='a.b.com > > <http://a.b.com>']/instance_attributes[@id='status-macshii00 > 002-hva.gs.r11. > > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: warning: > > unpack_rsc_op_failure: Processing failed op monitor for > > mycustomprog on a.b.com <http://a.b.com>: not running (7) > > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > > native_print: mycustomprog (lsb:mycustomprog): > > FAILED a.b.com <http://a.b.com> > > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > > get_failcount_full: mycustomprog has failed 1 times on a.b.com > > <http://a.b.com> > > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > > common_apply_stickiness: mycustomprog can fail 999999 more times > > on a.b.com <http://a.b.com> before being forced off > > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > > RecurringOp: Start recurring monitor (10s) for mycustomprog on > > a.b.com <http://a.b.com> > > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: notice: > > LogActions: Recover mycustomprog (Started a.b.com <http://a.b.com > >) > > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: notice: > > te_rsc_command: Initiating action 5: stop mycustomprog_stop_0 on > > a.b.com <http://a.b.com> (local) > > Dec 28 13:22:04 [2196] a.b.com <http://a.b.com> lrmd: info: > > cancel_recurring_action: Cancelling lsb operation > > mycustomprog_status_10000Dec 28 13:18:20 [2198] a.b.com > > <http://a.b.com> pengine: info: LogActions: Leave > > mycustomprog (Started a.b.com <http://a.b.com>) > > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > > process_lrm_event: Operation mycustomprog_monitor_10000: not running > > (node=a.b.com <http://a.b.com>, call=29, rc=7, cib-update=1427, > > confirmed=false) > > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: notice: > > process_lrm_event: a.b.com-mycustomprog_monitor_10000:29 [ > > mycustomprogram (pid 15657) is running...\n ] > > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info: > > cib_perform_op: ++ /cib/status/node_state[@id='a.b.com > > <http://a.b.com>']/lrm[@id='a.b.com > > <http://a.b.com>']/lrm_resources/lrm_resource[@id='mycustomprog']: > > <lrm_rsc_op id="mycustomprog_last_failure_0" > > operation_key="mycustomprog_monitor_10000" operation="monitor" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" > > transition-key="7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7" > > transition-magic="0 > > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > > abort_transition_graph: Transition aborted by > > mycustomprog_monitor_10000 'create' on a.b.com <http://a.b.com>: Old > > event (magic=0:7;7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7, > > cib=0.48.2038786, source=process_graph_event:605, 1) > > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > > update_failcount: Updating failcount for mycustomprog on a.b.com > > <http://a.b.com> after failed monitor: rc=7 (update=value++, > > time=1482931323) > > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > > process_graph_event: Detected action (462.7) > > mycustomprog_monitor_10000.29=not running: failed > > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > > attrd_trigger_update: Sending flush op to all hosts for: > > fail-count-mycustomprog (1) > > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > > attrd_perform_update: Sent update 18: fail-count-mycustomprog=1 > > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info: > > cib_perform_op: ++ /cib/status/node_state[@id='a.b.com > > <http://a.b.com>']/transient_attributes[@id='a.b.com > > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > > <http://status-a.b.com>']: <nvpair > > id="status-a.b.com-fail-count-mycustomprog" > > name="fail-count-mycustomprog" value="1"/> > > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > > attrd_trigger_update: Sending flush op to all hosts for: > > last-failure-mycustomprog (1482931323) > > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > > attrd_perform_update: Sent update 20: > > last-failure-mycustomprog=1482931323 > > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > > attrd_perform_update: Sent update 20: > > last-failure-mycustomprog=1482931323 > > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info: > > cib_perform_op: ++ /cib/status/node_state[@id='a.b.com > > <http://a.b.com>']/transient_attributes[@id='a.b.com > > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > > <http://status-a.b.com>']: <nvpair > > id="status-a.b.com-last-failure-mycustomprog" > > name="last-failure-mycustomprog" value="1482931323"/> > > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: info: > > abort_transition_graph: Transition aborted by > > status-a.b.com-fail-count-mycustomprog, fail-count-mycustomprog=1: > > Transient attribute change (create cib=0.48.2038787, > > source=abort_unless_down:329, path=/cib/status/node_state[@id='a.b.com > > <http://a.b.com>']/transient_attributes[@id='a.b.com > > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > > <http://status-a.b.com> > > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: info: > > abort_transition_graph: Transition aborted by > > status-a.b.com-last-failure-mycustomprog, > > last-failure-mycustomprog=1482931323: Transient attribute change > > (create cib=0.48.2038788, source=abort_unless_down:329, > > path=/cib/status/node_state[@id='a.b.com > > <http://a.b.com>']/transient_attributes[@id='a.b.com > > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > > <http://status-a.b.com> > > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: warning: > > unpack_rsc_op_failure: Processing failed op monitor for > > mycustomprog on a.b.com <http://a.b.com>: not running (7) > > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > > native_print: mycustomprog (lsb:mycustomprog): > > FAILED a.b.com <http://a.b.com> > > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > > get_failcount_full: mycustomprog has failed 1 times on a.b.com > > <http://a.b.com> > > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > > common_apply_stickiness: mycustomprog can fail 999999 more times > > on a.b.com <http://a.b.com> before being forced off > > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > > RecurringOp: Start recurring monitor (10s) for mycustomprog on > > a.b.com <http://a.b.com> > > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: notice: > > LogActions: Recover mycustomprog (Started a.b.com <http://a.b.com > >) > > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: notice: > > te_rsc_command: Initiating action 5: stop mycustomprog_stop_0 on > > a.b.com <http://a.b.com> (local) > > Dec 28 13:22:04 [2196] a.b.com <http://a.b.com> lrmd: info: > > cancel_recurring_action: Cancelling lsb operation > > mycustomprog_status_10000 > > > > > > > > pcs config: > > > > source settings.rc > > > > pcs property set stonith-enabled=false > > pcs property set no-quorum-policy=ignore > > pcs resource create ClusterIP2 IPaddr2 ip=$MYVIP > cidr_netmask=$NETMASKVIP1 > > pcs resource create ClusterIP3 IPaddr2 ip=$MYVIP2 > > cidr_netmask=$NETMASKVIP2 > > pcs resource create mycustomprog lsb:mycustomprog op monitor > > interval="10s" > > pcs constraint colocation add ClusterIP3 with ClusterIP2 INFINITY > > pcs constraint colocation add mycustomprog with ClusterIP2 INFINITY > > pcs property set start-failure-is-fatal=false > > pcs resource defaults resource-stickiness=100 > > pcs constraint colocation add chkhealth with ClusterIP2 INFINITY > > > > > > > > _______________________________________________ > > Users mailing list: [email protected] > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Users mailing list: [email protected] > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
