Hi, yes, by "service" I meant the apache-clone resource.
Maybe I can give a more stripped down and detailed example: *Given the following configuration:* [root@pacemaker-test-1 cluster]# pcs cluster cib --config <configuration> <crm_config> <cluster_property_set id="cib-bootstrap-options"> <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="false"/> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.23-1.el7_9.1-9acf116022"/> <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/> <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" value="pacemaker-test"/> <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/> <nvpair id="cib-bootstrap-options-symmetric-cluster" name="symmetric-cluster" value="false"/> <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1628511747"/> </cluster_property_set> </crm_config> <nodes> <node id="1" uname="pacemaker-test-1"/> <node id="2" uname="pacemaker-test-2"/> </nodes> <resources> <clone id="apache-clone"> <primitive class="ocf" id="apache" provider="heartbeat" type="apache"> <instance_attributes id="apache-instance_attributes"> <nvpair id="apache-instance_attributes-port" name="port" value="80"/> <nvpair id="apache-instance_attributes-statusurl" name="statusurl" value="http://localhost/server-status"/> </instance_attributes> <operations> <op id="apache-monitor-interval-10s" interval="10s" name="monitor" timeout="20s"/> <op id="apache-start-interval-0s" interval="0s" name="start" timeout="40s"/> <op id="apache-stop-interval-0s" interval="0s" name="stop" timeout="60s"/> </operations> </primitive> <meta_attributes id="apache-meta_attributes"> <nvpair id="apache-clone-meta_attributes-clone-max" name="clone-max" value="2"/> <nvpair id="apache-clone-meta_attributes-clone-node-max" name="clone-node-max" value="1"/> <nvpair id="apache-clone-meta_attributes-interleave" name="interleave" value="true"/> </meta_attributes> </clone> </resources> <constraints> <rsc_location id="location-apache-clone-pacemaker-test-1-100" node="pacemaker-test-1" rsc="apache-clone" score="100" resource-discovery="exclusive"/> <rsc_location id="location-apache-clone-pacemaker-test-2-0" node="pacemaker-test-2" rsc="apache-clone" score="0" resource-discovery="exclusive"/> </constraints> <rsc_defaults> <meta_attributes id="rsc_defaults-options"> <nvpair id="rsc_defaults-options-resource-stickiness" name="resource-stickiness" value="50"/> </meta_attributes> </rsc_defaults> </configuration> *With the cluster in a running state:* [root@pacemaker-test-1 cluster]# pcs status Cluster name: pacemaker-test Stack: corosync Current DC: pacemaker-test-2 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum Last updated: Mon Aug 9 14:45:38 2021 Last change: Mon Aug 9 14:43:14 2021 by hacluster via crmd on pacemaker-test-1 2 nodes configured 2 resource instances configured Online: [ pacemaker-test-1 pacemaker-test-2 ] Full list of resources: Clone Set: apache-clone [apache] Started: [ pacemaker-test-1 pacemaker-test-2 ] Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled *When simulating an error by killing the apache-resource on pacemaker-test-1:* [root@pacemaker-test-1 ~]# killall httpd *After a few seconds, the cluster notices that the apache-resource is down on pacemaker-test-1 and restarts it on pacemaker-test-1 (this is fine):* [root@pacemaker-test-1 cluster]# cat corosync.log | grep crmd: Aug 09 14:49:30 [10336] pacemaker-test-1 crmd: info: process_lrm_event: Result of monitor operation for apache on pacemaker-test-1: 7 (not running) | call=12 key=apache_monitor_10000 confirmed=false cib-update=22 Aug 09 14:49:30 [10336] pacemaker-test-1 crmd: info: do_lrm_rsc_op: Performing key=3:4:0:0fe9a8dd-1a73-4770-a36e-b14a6bb37d68 op=apache_stop_0 Aug 09 14:49:30 [10336] pacemaker-test-1 crmd: info: process_lrm_event: Result of monitor operation for apache on pacemaker-test-1: Cancelled | call=12 key=apache_monitor_10000 confirmed=true Aug 09 14:49:30 [10336] pacemaker-test-1 crmd: notice: process_lrm_event: Result of stop operation for apache on pacemaker-test-1: 0 (ok) | call=14 key=apache_stop_0 confirmed=true cib-update=24 Aug 09 14:49:32 [10336] pacemaker-test-1 crmd: info: do_lrm_rsc_op: Performing key=5:4:0:0fe9a8dd-1a73-4770-a36e-b14a6bb37d68 op=apache_start_0 Aug 09 14:49:33 [10336] pacemaker-test-1 crmd: notice: process_lrm_event: Result of start operation for apache on pacemaker-test-1: 0 (ok) | call=15 key=apache_start_0 confirmed=true cib-update=26 Aug 09 14:49:33 [10336] pacemaker-test-1 crmd: info: do_lrm_rsc_op: Performing key=6:4:0:0fe9a8dd-1a73-4770-a36e-b14a6bb37d68 op=apache_monitor_10000 Aug 09 14:49:34 [10336] pacemaker-test-1 crmd: info: process_lrm_event: Result of monitor operation for apache on pacemaker-test-1: 0 (ok) | call=16 key=apache_monitor_10000 confirmed=false cib-update=28 *BUT the cluster also restarts the apache-resource on pacemaker-test-2. Which it should not do because the apache-resource on pacemaker-test-2 did not crash:* [root@pacemaker-test-2 cluster]# cat corosync.log | grep crmd: Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: info: update_failcount: Updating failcount for apache on pacemaker-test-1 after failed monitor: rc=7 (update=value++, time=1628513370) Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: info: process_graph_event: Detected action (2.6) apache_monitor_10000.12=not running: failed Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: info: do_te_invoke: Processing graph 3 (ref=pe_calc-dc-1628513370-25) derived from /var/lib/pacemaker/pengine/pe-input-51.bz2 Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: notice: abort_transition_graph: Transition aborted by status-1-fail-count-apache.monitor_10000 doing modify fail-count-apache#monitor_10000=2: Transient attribute change | cib=0.33.33 source=abort_unless_down:356 path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-fail-count-apache.monitor_10000'] complete=false Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: info: abort_transition_graph: Transition aborted by status-1-last-failure-apache.monitor_10000 doing modify last-failure-apache#monitor_10000=1628513370: Transient attribute change | cib=0.33.34 source=abort_unless_down:356 path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-apache.monitor_10000'] complete=false Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: notice: run_graph: Transition 3 (Complete=1, Pending=0, Fired=0, Skipped=2, Incomplete=9, Source=/var/lib/pacemaker/pengine/pe-input-51.bz2): Stopped Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: info: do_te_invoke: Processing graph 4 (ref=pe_calc-dc-1628513370-26) derived from /var/lib/pacemaker/pengine/pe-input-52.bz2 *Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: notice: te_rsc_command: Initiating stop operation apache_stop_0 locally on pacemaker-test-2 | action 4* Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: info: do_lrm_rsc_op: Performing key=4:4:0:0fe9a8dd-1a73-4770-a36e-b14a6bb37d68 op=apache_stop_0 Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: notice: te_rsc_command: Initiating stop operation apache_stop_0 on pacemaker-test-1 | action 3 *Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: info: process_lrm_event: Result of monitor operation for apache on pacemaker-test-2: Cancelled | call=12 key=apache_monitor_10000 confirmed=true* Aug 09 14:49:30 [18553] pacemaker-test-2 crmd: info: match_graph_event: Action apache_stop_0 (3) confirmed on pacemaker-test-1 (rc=0) *Aug 09 14:49:32 [18553] pacemaker-test-2 crmd: notice: process_lrm_event: Result of stop operation for apache on pacemaker-test-2: 0 (ok) | call=14 key=apache_stop_0 confirmed=true cib-update=50* Aug 09 14:49:32 [18553] pacemaker-test-2 crmd: info: match_graph_event: Action apache_stop_0 (4) confirmed on pacemaker-test-2 (rc=0) Aug 09 14:49:32 [18553] pacemaker-test-2 crmd: notice: te_rsc_command: Initiating start operation apache_start_0 on pacemaker-test-1 | action 5 *Aug 09 14:49:32 [18553] pacemaker-test-2 crmd: notice: te_rsc_command: Initiating start operation apache_start_0 locally on pacemaker-test-2 | action 7* Aug 09 14:49:32 [18553] pacemaker-test-2 crmd: info: do_lrm_rsc_op: Performing key=7:4:0:0fe9a8dd-1a73-4770-a36e-b14a6bb37d68 op=apache_start_0 *Aug 09 14:49:32 [18553] pacemaker-test-2 crmd: notice: process_lrm_event: Result of start operation for apache on pacemaker-test-2: 0 (ok) | call=15 key=apache_start_0 confirmed=true cib-update=52* Aug 09 14:49:32 [18553] pacemaker-test-2 crmd: info: match_graph_event: Action apache_start_0 (7) confirmed on pacemaker-test-2 (rc=0) Aug 09 14:49:32 [18553] pacemaker-test-2 crmd: notice: te_rsc_command: Initiating monitor operation apache_monitor_10000 locally on pacemaker-test-2 | action 8 Aug 09 14:49:32 [18553] pacemaker-test-2 crmd: info: do_lrm_rsc_op: Performing key=8:4:0:0fe9a8dd-1a73-4770-a36e-b14a6bb37d68 op=apache_monitor_10000 Aug 09 14:49:33 [18553] pacemaker-test-2 crmd: info: process_lrm_event: Result of monitor operation for apache on pacemaker-test-2: 0 (ok) | call=16 key=apache_monitor_10000 confirmed=false cib-update=54 Aug 09 14:49:33 [18553] pacemaker-test-2 crmd: info: match_graph_event: Action apache_monitor_10000 (8) confirmed on pacemaker-test-2 (rc=0) Aug 09 14:49:33 [18553] pacemaker-test-2 crmd: info: match_graph_event: Action apache_start_0 (5) confirmed on pacemaker-test-1 (rc=0) Aug 09 14:49:33 [18553] pacemaker-test-2 crmd: notice: te_rsc_command: Initiating monitor operation apache_monitor_10000 on pacemaker-test-1 | action 6 Aug 09 14:49:34 [18553] pacemaker-test-2 crmd: info: match_graph_event: Action apache_monitor_10000 (6) confirmed on pacemaker-test-1 (rc=0) Aug 09 14:49:34 [18553] pacemaker-test-2 crmd: notice: run_graph: Transition 4 (Complete=10, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-52.bz2): Complete Aug 09 14:49:34 [18553] pacemaker-test-2 crmd: info: do_log: Input I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd Aug 09 14:49:34 [18553] pacemaker-test-2 crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE | input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd So my questions are: - Why is the apache-resource being restarted on pacemaker-test-2 in this scenario? - Is it possible to configure the cluster to not restart the apache-resource on pacemaker-test-2 in this scenario? Regards, Andreas Am Mo., 9. Aug. 2021 um 14:14 Uhr schrieb Andrei Borzenkov < arvidj...@gmail.com>: > On Mon, Aug 9, 2021 at 3:07 PM Andreas Janning > <andreas.jann...@qaware.de> wrote: > > > > Hi, > > > > I have just tried your suggestion by adding > > <nvpair id="apache-clone-meta_attributes-interleave" > name="interleave" value="true"/> > > to the clone configuration. > > Unfortunately, the behavior stays the same. The service is still > restarted on the passive node when crashing it on the active node. > > > > What is "service"? Is it the resource with id=apache-clone in your > configuration? > > Logs from DC around time of crash would certainly be useful here. > > > Regards > > > > Andreas > > > > Am Mo., 9. Aug. 2021 um 13:45 Uhr schrieb Vladislav Bogdanov < > bub...@hoster-ok.com>: > >> > >> Hi. > >> I'd suggest to set your clone meta attribute 'interleaved' to 'true' > >> > >> Best, > >> Vladislav > >> > >> On August 9, 2021 1:43:16 PM Andreas Janning <andreas.jann...@qaware.de> > wrote: > >>> > >>> Hi all, > >>> > >>> we recently experienced an outage in our pacemaker cluster and I would > like to understand how we can configure the cluster to avoid this problem > in the future. > >>> > >>> First our basic setup: > >>> - CentOS7 > >>> - Pacemaker 1.1.23 > >>> - Corosync 2.4.5 > >>> - Resource-Agents 4.1.1 > >>> > >>> Our cluster is composed of multiple active/passive nodes. Each > software component runs on two nodes simultaneously and all traffic is > routed to the active node via Virtual IP. > >>> If the active node fails, the passive node grabs the Virtual IP and > immediately takes over all work of the failed node. Since the software is > already up and running on the passive node, there should be virtually no > downtime. > >>> We have tried achieved this in pacemaker by configuring clone-sets for > each software component. > >>> > >>> Now the problem: > >>> When a software component fails on the active node, the Virtual-IP is > correctly grabbed by the passive node. BUT the software component is also > immediately restarted on the passive Node. > >>> That unfortunately defeats the purpose of the whole setup, since we > now have a downtime until the software component is restarted on the > passive node and the restart might even fail and lead to a complete outage. > >>> After some investigating I now understand that the cloned resource is > restarted on all nodes after a monitoring failure because the default > "on-fail" of "monitor" is restart. But that is not what I want. > >>> > >>> I have created a minimal setup that reproduces the problem: > >>> > >>>> <configuration> > >>>> <crm_config> > >>>> <cluster_property_set id="cib-bootstrap-options"> > >>>> <nvpair id="cib-bootstrap-options-have-watchdog" > name="have-watchdog" value="false"/> > >>>> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" > value="1.1.23-1.el7_9.1-9acf116022"/> > >>>> <nvpair id="cib-bootstrap-options-cluster-infrastructure" > name="cluster-infrastructure" value="corosync"/> > >>>> <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" > value="pacemaker-test"/> > >>>> <nvpair id="cib-bootstrap-options-stonith-enabled" > name="stonith-enabled" value="false"/> > >>>> <nvpair id="cib-bootstrap-options-symmetric-cluster" > name="symmetric-cluster" value="false"/> > >>>> </cluster_property_set> > >>>> </crm_config> > >>>> <nodes> > >>>> <node id="1" uname="active-node"/> > >>>> <node id="2" uname="passive-node"/> > >>>> </nodes> > >>>> <resources> > >>>> <primitive class="ocf" id="vip" provider="heartbeat" type="IPaddr2"> > >>>> <instance_attributes id="vip-instance_attributes"> > >>>> <nvpair id="vip-instance_attributes-ip" name="ip" > value="{{infrastructure.virtual_ip}}"/> > >>>> </instance_attributes> > >>>> <operations> > >>>> <op id="psa-vip-monitor-interval-10s" interval="10s" name="monitor" > timeout="20s"/> > >>>> <op id="psa-vip-start-interval-0s" interval="0s" name="start" > timeout="20s"/> > >>>> <op id="psa-vip-stop-interval-0s" interval="0s" name="stop" > timeout="20s"/> > >>>> </operations> > >>>> </primitive> > >>>> <clone id="apache-clone"> > >>>> <primitive class="ocf" id="apache" provider="heartbeat" > type="apache"> > >>>> <instance_attributes id="apache-instance_attributes"> > >>>> <nvpair id="apache-instance_attributes-port" name="port" value="80"/> > >>>> <nvpair id="apache-instance_attributes-statusurl" name="statusurl" > value="http://localhost/server-status"/> > >>>> </instance_attributes> > >>>> <operations> > >>>> <op id="apache-monitor-interval-10s" interval="10s" name="monitor" > timeout="20s"/> > >>>> <op id="apache-start-interval-0s" interval="0s" name="start" > timeout="40s"/> > >>>> <op id="apache-stop-interval-0s" interval="0s" name="stop" > timeout="60s"/> > >>>> </operations> > >>>> </primitive> > >>>> <meta_attributes id="apache-meta_attributes"> > >>>> <nvpair id="apache-clone-meta_attributes-clone-max" name="clone-max" > value="2"/> > >>>> <nvpair id="apache-clone-meta_attributes-clone-node-max" > name="clone-node-max" value="1"/> > >>>> </meta_attributes> > >>>> </clone> > >>>> </resources> > >>>> <constraints> > >>>> <rsc_location id="location-apache-clone-active-node-100" > node="active-node" rsc="apache-clone" score="100" > resource-discovery="exclusive"/> > >>>> <rsc_location id="location-apache-clone-passive-node-0" > node="passive-node" rsc="apache-clone" score="0" > resource-discovery="exclusive"/> > >>>> <rsc_location id="location-vip-clone-active-node-100" > node="active-node" rsc="vip" score="100" resource-discovery="exclusive"/> > >>>> <rsc_location id="location-vip-clone-passive-node-0" > node="passive-node" rsc="vip" score="0" resource-discovery="exclusive"/> > >>>> <rsc_colocation id="colocation-vip-apache-clone-INFINITY" rsc="vip" > score="INFINITY" with-rsc="apache-clone"/> > >>>> </constraints> > >>>> <rsc_defaults> > >>>> <meta_attributes id="rsc_defaults-options"> > >>>> <nvpair id="rsc_defaults-options-resource-stickiness" > name="resource-stickiness" value="50"/> > >>>> </meta_attributes> > >>>> </rsc_defaults> > >>>> </configuration> > >>> > >>> > >>> > >>> When this configuration is started, httpd will be running on > active-node and passive-node. The VIP runs only on active-node. > >>> When crashing the httpd on active-node (with killall httpd), > passive-node immediately grabs the VIP and restarts its own httpd. > >>> > >>> How can I change this configuration so that when the resource fails on > active-node: > >>> - passive-node immediately grabs the VIP (as it does now). > >>> - active-node tries to restart the failed resource, giving up after x > attempts. > >>> - passive-node does NOT restart the resource. > >>> > >>> Regards > >>> > >>> Andreas Janning > >>> > >>> > >>> > >>> -- > >>> ________________________________ > >>> > >>> Beste Arbeitgeber ITK 2021 - 1. Platz für QAware > >>> ausgezeichnet von Great Place to Work > >>> > >>> ________________________________ > >>> > >>> Andreas Janning > >>> Expert Software Engineer > >>> > >>> QAware GmbH > >>> Aschauer Straße 32 > >>> 81549 München, Germany > >>> Mobil +49 160 1492426 > >>> andreas.jann...@qaware.de > >>> www.qaware.de > >>> > >>> ________________________________ > >>> > >>> Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef > Adersberger > >>> Registergericht: München > >>> Handelsregisternummer: HRB 163761 > >>> > >>> _______________________________________________ > >>> Manage your subscription: > >>> https://lists.clusterlabs.org/mailman/listinfo/users > >>> > >>> ClusterLabs home: https://www.clusterlabs.org/ > >>> > >> > > > > > > -- > > ________________________________ > > > > Beste Arbeitgeber ITK 2021 - 1. Platz für QAware > > ausgezeichnet von Great Place to Work > > > > ________________________________ > > > > Andreas Janning > > Expert Software Engineer > > > > QAware GmbH > > Aschauer Straße 32 > > 81549 München, Germany > > Mobil +49 160 1492426 > > andreas.jann...@qaware.de > > www.qaware.de > > > > ________________________________ > > > > Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger > > Registergericht: München > > Handelsregisternummer: HRB 163761 > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > -- ------------------------------ *Beste Arbeitgeber ITK 2021 - 1. Platz für QAware* ausgezeichnet von Great Place to Work <https://www.qaware.de/news/platz-1-bei-beste-arbeitgeber-in-der-itk-2021/> ------------------------------ Andreas Janning Expert Software Engineer QAware GmbH Aschauer Straße 32 81549 München, Germany Mobil +49 160 1492426 andreas.jann...@qaware.de www.qaware.de ------------------------------ Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger Registergericht: München Handelsregisternummer: HRB 163761
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/