On Mon, Aug 9, 2021 at 6:19 AM Andrei Borzenkov <arvidj...@gmail.com> wrote:
> On 09.08.2021 16:00, Andreas Janning wrote: > > Hi, > > > > yes, by "service" I meant the apache-clone resource. > > > > Maybe I can give a more stripped down and detailed example: > > > > *Given the following configuration:* > > [root@pacemaker-test-1 cluster]# pcs cluster cib --config > > <configuration> > > <crm_config> > > <cluster_property_set id="cib-bootstrap-options"> > > <nvpair id="cib-bootstrap-options-have-watchdog" > name="have-watchdog" > > value="false"/> > > <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" > > value="1.1.23-1.el7_9.1-9acf116022"/> > > <nvpair id="cib-bootstrap-options-cluster-infrastructure" > > name="cluster-infrastructure" value="corosync"/> > > <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" > > value="pacemaker-test"/> > > <nvpair id="cib-bootstrap-options-stonith-enabled" > > name="stonith-enabled" value="false"/> > > <nvpair id="cib-bootstrap-options-symmetric-cluster" > > name="symmetric-cluster" value="false"/> > > <nvpair id="cib-bootstrap-options-last-lrm-refresh" > > name="last-lrm-refresh" value="1628511747"/> > > </cluster_property_set> > > </crm_config> > > <nodes> > > <node id="1" uname="pacemaker-test-1"/> > > <node id="2" uname="pacemaker-test-2"/> > > </nodes> > > <resources> > > <clone id="apache-clone"> > > <primitive class="ocf" id="apache" provider="heartbeat" > type="apache"> > > <instance_attributes id="apache-instance_attributes"> > > <nvpair id="apache-instance_attributes-port" name="port" > > value="80"/> > > <nvpair id="apache-instance_attributes-statusurl" > > name="statusurl" value="http://localhost/server-status"/> > > </instance_attributes> > > <operations> > > <op id="apache-monitor-interval-10s" interval="10s" > > name="monitor" timeout="20s"/> > > <op id="apache-start-interval-0s" interval="0s" name="start" > > timeout="40s"/> > > <op id="apache-stop-interval-0s" interval="0s" name="stop" > > timeout="60s"/> > > </operations> > > </primitive> > > <meta_attributes id="apache-meta_attributes"> > > <nvpair id="apache-clone-meta_attributes-clone-max" > > name="clone-max" value="2"/> > > <nvpair id="apache-clone-meta_attributes-clone-node-max" > > name="clone-node-max" value="1"/> > > <nvpair id="apache-clone-meta_attributes-interleave" > > name="interleave" value="true"/> > > </meta_attributes> > > </clone> > > </resources> > > <constraints> > > <rsc_location id="location-apache-clone-pacemaker-test-1-100" > > node="pacemaker-test-1" rsc="apache-clone" score="100" > > resource-discovery="exclusive"/> > > <rsc_location id="location-apache-clone-pacemaker-test-2-0" > > node="pacemaker-test-2" rsc="apache-clone" score="0" > > resource-discovery="exclusive"/> > > </constraints> > > <rsc_defaults> > > <meta_attributes id="rsc_defaults-options"> > > <nvpair id="rsc_defaults-options-resource-stickiness" > > name="resource-stickiness" value="50"/> > > </meta_attributes> > > </rsc_defaults> > > </configuration> > > > > > > *With the cluster in a running state:* > > > > [root@pacemaker-test-1 cluster]# pcs status > > Cluster name: pacemaker-test > > Stack: corosync > > Current DC: pacemaker-test-2 (version 1.1.23-1.el7_9.1-9acf116022) - > > partition with quorum > > Last updated: Mon Aug 9 14:45:38 2021 > > Last change: Mon Aug 9 14:43:14 2021 by hacluster via crmd on > > pacemaker-test-1 > > > > 2 nodes configured > > 2 resource instances configured > > > > Online: [ pacemaker-test-1 pacemaker-test-2 ] > > > > Full list of resources: > > > > Clone Set: apache-clone [apache] > > Started: [ pacemaker-test-1 pacemaker-test-2 ] > > > > Daemon Status: > > corosync: active/disabled > > pacemaker: active/disabled > > pcsd: active/enabled > > > > *When simulating an error by killing the apache-resource on > > pacemaker-test-1:* > > > > [root@pacemaker-test-1 ~]# killall httpd > > > > *After a few seconds, the cluster notices that the apache-resource is > down > > on pacemaker-test-1 and restarts it on pacemaker-test-1 (this is fine):* > > > > [root@pacemaker-test-1 cluster]# cat corosync.log | grep crmd: > > Never ever filter logs that you show unless you know what you are doing. > > You skipped the most interesting part that is the intended actions. > Which are > > Aug 09 15:59:37.889 ha1 pacemaker-schedulerd[3783] (LogAction) notice: > * Recover apache:0 ( ha1 -> ha2 ) > Aug 09 15:59:37.889 ha1 pacemaker-schedulerd[3783] (LogAction) notice: > * Move apache:1 ( ha2 -> ha1 ) > > So pacemaker decides to "swap" nodes where current instances are running. > Correct. I've only skimmed this thread but it looks like: https://github.com/ClusterLabs/pacemaker/pull/2313 https://bugzilla.redhat.com/show_bug.cgi?id=1931023 I've had some personal things get in the way of following up on the PR for a while. In my experience, configuring resource-stickiness has worked around the issue. > Looking at scores > > Using the original execution date of: 2021-08-09 12:59:37Z > > Current cluster status: > Online: [ ha1 ha2 ] > > vip (ocf::pacemaker:Dummy): Started ha1 > Clone Set: apache-clone [apache] > apache (ocf::pacemaker:Dummy): FAILED ha1 > Started: [ ha2 ] > > Allocation scores: > pcmk__clone_allocate: apache-clone allocation score on ha1: 200 > pcmk__clone_allocate: apache-clone allocation score on ha2: 0 > pcmk__clone_allocate: apache:0 allocation score on ha1: 101 > pcmk__clone_allocate: apache:0 allocation score on ha2: 0 > pcmk__clone_allocate: apache:1 allocation score on ha1: 100 > pcmk__clone_allocate: apache:1 allocation score on ha2: 1 > pcmk__native_allocate: apache:1 allocation score on ha1: 100 > pcmk__native_allocate: apache:1 allocation score on ha2: 1 > pcmk__native_allocate: apache:1 allocation score on ha1: 100 > pcmk__native_allocate: apache:1 allocation score on ha2: 1 > pcmk__native_allocate: apache:0 allocation score on ha1: -INFINITY > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > pcmk__native_allocate: apache:0 allocation score on ha2: 0 > pcmk__native_allocate: vip allocation score on ha1: 100 > pcmk__native_allocate: vip allocation score on ha2: 0 > > Transition Summary: > * Recover apache:0 ( ha1 -> ha2 ) > * Move apache:1 ( ha2 -> ha1 ) > > > No, I do not have explanation why pacemaker decides that apache:0 cannot > run on ha1 in this case and so decides to move it to another node. It > most certainly has something to do with asymmetric cluster and location > scores. If you set the same location scores for apache-clone on both > nodes pacemaker will recover failed instance and won't attempt to move > it. Like > > location location-apache-clone-ha1-100 apache-clone > resource-discovery=exclusive 100: ha1 > location location-apache-clone-ha2-100 apache-clone > resource-discovery=exclusive 100: ha2 > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > -- Regards, Reid Wahl, RHCA Senior Software Maintenance Engineer, Red Hat CEE - Platform Support Delivery - ClusterHA
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/