08.06.2019 5:12, Harvey Shepherd пишет: > Thank you for your advice Ken. Sorry for the delayed reply - I was trying out > a few things and trying to capture extra info. The changes that you suggested > make sense, and I have incorporated them into my config. However, the > original issue remains whereby Pacemaker does not attempt to restart the > failed m_main_system process. I tried setting the migration-threshold of that > resource to 1, to try to get Pacemaker to force it to be promoted on the > other node, but this had no effect - the master instance remains "failed" and > the slave instance remains "running" but is not promoted.
As far as I understand, for a clone to be promoted on a node this node must have explicit master score or location constraint for this clone. Master score is normally set by resource agent. > Snipped output from crm_mon: > > Current DC: primary (version unknown) - partition with quorum > Last updated: Sat Jun 8 02:04:05 2019 > Last change: Sat Jun 8 01:51:25 2019 by hacluster via crmd on primary > > 2 nodes configured > 26 resources configured > > Online: [ primary secondary ] > > Active resources: > > Clone Set: m_main_system [main_system] (promotable) > main_system (ocf::main_system-ocf): FAILED secondary > Slaves: [ primary ] > > Migration Summary: > * Node secondary: > main_system: migration-threshold=1 fail-count=1 last-failure='Sat Jun 8 > 01:52:08 2019' > > Failed Resource Actions: > * main_system_monitor_10000 on secondary 'unknown error' (1): call=214, > status=complete, exitreason='', > last-rc-change='Sat Jun 8 01:52:08 2019', queued=0ms, exec=0ms > > > From the logs I see: > > 2019 Jun 8 01:52:09.574 daemon.warning VIRTUAL pacemaker-schedulerd 1131 > warning: Processing failed monitor of main_system:1 on secondary: unknown > error > 2019 Jun 8 01:52:09.586 daemon.warning VIRTUAL pacemaker-schedulerd 1131 > warning: Forcing m_main_system away from secondary after 1 failures (max=1) > 2019 Jun 8 01:52:09.586 daemon.warning VIRTUAL pacemaker-schedulerd 1131 > warning: Forcing m_main_system away from secondary after 1 failures (max=1) > 2019 Jun 8 01:52:10.692 daemon.warning VIRTUAL pacemaker-controld 1132 > warning: Transition 35 (Complete=33, Pending=0, Fired=0, Skipped=0, > Incomplete=67, Source=/var/lib/pacemaker/pengine/pe-input-47.bz2): Terminated Making this file available may help to determine why it decided to not promote resource. > 2019 Jun 8 01:52:10.692 daemon.warning VIRTUAL pacemaker-controld 1132 > warning: Transition failed: terminated > > > Do you have any further suggestions? For your information I've upgraded > Pacemaker to 2.0.2, but the behaviour is the same. > > Thanks, > Harvey > ________________________________________ > From: Users <[email protected]> on behalf of Ken Gaillot > <[email protected]> > Sent: Saturday, 1 June 2019 5:40 a.m. > To: Cluster Labs - All topics related to open-source clustering welcomed > Subject: EXTERNAL: Re: [ClusterLabs] Pacemaker not reacting as I would expect > when two resources fail at the same time > > On Thu, 2019-05-30 at 23:39 +0000, Harvey Shepherd wrote: >> Hi All, >> >> I'm running Pacemaker 2.0.1 on a cluster containing two nodes; one >> master and one slave. I have a main master/slave resource >> (m_main_system), a group of resources that run in active-active mode >> (active_active - i.e. run on both nodes), and a group that runs in >> active-disabled mode (snmp_active_disabled - resources only run on >> the current promoted master). The snmp_active_disabled group is >> configured to be co-located with the master of m_main_system, so only >> a failure of the master m_main_system resource can trigger a >> failover. The constraints specify that m_main_system must be started >> before snmp_active_disabled. >> >> The problem I'm having is that when a resource in the >> snmp_active_disabled group fails and gets into a constant cycle where >> Pacemaker tries to restart it, and I then kill m_main_system on the >> master, then Pacemaker still constantly tries to restart the failed >> snmp_active_disabled resource and ignores the more important >> m_main_system process which should be triggering a failover. If I >> stabilise the snmp_active_disabled resource then Pacemaker finally >> acts on the m_main_system failure. I hope I've described this well >> enough, but I've included a cut down form of my CIB config below if >> it helps! >> >> Is this a bug or an error in my config? Perhaps the order in which >> the groups are defined in the CIB matters despite the constraints? >> Any help would be gratefully received. >> >> Thanks, >> Harvey >> >> <configuration> >> <crm_config> >> <cluster_property_set id="cib-bootstrap-options"> >> <nvpair name="stonith-enabled" value="false" id="cib-bootstrap- >> options-stonith-enabled"/> >> <nvpair name="no-quorum-policy" value="ignore" id="cib- >> bootstrap-options-no-quorum-policy"/> >> <nvpair name="have-watchdog" value="false" id="cib-bootstrap- >> options-have-watchdog"/> >> <nvpair name="cluster-name" value="lbcluster" id="cib- >> bootstrap-options-cluster-name"/> >> <nvpair name="start-failure-is-fatal" value="false" id="cib- >> bootstrap-options-start-failure-is-fatal"/> >> <nvpair name="cluster-recheck-interval" value="0s" id="cib- >> bootstrap-options-cluster-recheck-interval"/> >> </cluster_property_set> >> </crm_config> >> <nodes> >> <node id="1" uname="primary"/> >> <node id="2" uname="secondary"/> >> </nodes> >> <resources> >> <group id="snmp_active_disabled"> >> <primitive id="snmpd" class="lsb" type="snmpd"> >> <operations> >> <op name="monitor" interval="10s" id="snmpd-monitor- >> 10s"/> >> <op name="start" interval="0" timeout="30s" id="snmpd- >> start-30s"/> >> <op name="stop" interval="0" timeout="30s" id="snmpd- >> stop-30s"/> >> </operations> >> </primitive> >> <primitive id="snmp-auxiliaries" class="lsb" type="snmp- >> auxiliaries"> >> <operations> >> <op name="monitor" interval="10s" id="snmp-auxiliaries- >> monitor-10s"/> >> <op name="start" interval="0" timeout="30s" id="snmp- >> auxiliaries-start-30s"/> >> <op name="stop" interval="0" timeout="30s" id="snmp- >> auxiliaries-stop-30s"/> >> </operations> >> </primitive> >> </group> >> <clone id="clone_active_active"> >> <meta_attributes id="clone_active_active_meta_attributes"> >> <nvpair id="group-unique" name="globally-unique" >> value="false"/> >> </meta_attributes> >> <group id="active_active"> >> <primitive id="logd" class="lsb" type="logd"> >> <operations> >> <op name="monitor" interval="10s" id="logd-monitor-10s"/> >> <op name="start" interval="0" timeout="30s" id="logd- >> start-30s"/> >> <op name="stop" interval="0" timeout="30s" id="logd-stop- >> 30s"/> >> </operations> >> </primitive> >> <primitive id="serviced" class="lsb" type="serviced"> >> <operations> >> <op name="monitor" interval="10s" id="serviced-monitor- >> 10s"/> >> <op name="start" interval="0" timeout="30s" id="serviced- >> start-30s"/> >> <op name="stop" interval="0" timeout="30s" id="serviced- >> stop-30s"/> >> </operations> >> </primitive> >> </group> >> </clone> >> <master id="m_main_system"> >> <meta_attributes id="m_main_system-meta_attributes"> >> <nvpair name="notify" value="true" id="m_main_system- >> meta_attributes-notify"/> >> <nvpair name="clone-max" value="2" id="m_main_system- >> meta_attributes-clone-max"/> >> <nvpair name="promoted-max" value="1" id="m_main_system- >> meta_attributes-promoted-max"/> >> <nvpair name="promoted-node-max" value="1" id="m_main_system- >> meta_attributes-promoted-node-max"/> >> </meta_attributes> >> <primitive id="main_system" class="ocf" provider="acme" >> type="main-system-ocf"> >> <operations> >> <op name="start" interval="0" timeout="120s" >> id="main_system-start-0"/> >> <op name="stop" interval="0" timeout="120s" >> id="main_system-stop-0"/> >> <op name="promote" interval="0" timeout="120s" >> id="main_system-promote-0"/> >> <op name="demote" interval="0" timeout="120s" >> id="main_system-demote-0"/> >> <op name="monitor" interval="10s" timeout="10s" >> role="Master" id="main_system-monitor-10s"/> >> <op name="monitor" interval="11s" timeout="10s" >> role="Slave" id="main_system-monitor-11s"/> >> <op name="notify" interval="0" timeout="60s" >> id="main_system-notify-0"/> >> </operations> >> </primitive> >> </master> >> </resources> >> <constraints> >> <rsc_colocation id="master_only_snmp_rscs_with_main_system" >> score="INFINITY" rsc="snmp_active_disabled" with-rsc="m_main_system" >> with-rsc-role="Master"/> >> <rsc_order id="snmp_active_disabled_after_main_system" >> kind="Mandatory" first="m_main_system" then="snmp_active_disabled"/> > > You want first-action="promote" in the above constraint, otherwise the > slave being started (or the master being started but not yet promoted) > is sufficient to start snmp_active_disabled (even though the colocation > ensures it will only be started on the same node where the master will > be). > > I'm not sure if that's related to your issue, but it's worth trying > first. > >> <rsc_order id="active_active_after_main_system" kind="Mandatory" >> first="m_main_system" then="clone_active_active"/> > > You may also want to set interleave to true on clone_active_active, if > you want it to depend only on the local instance of m_main_system, and > not both instances. > >> </constraints> >> <rsc_defaults> >> <meta_attributes id="rsc-options"> >> <nvpair name="resource-stickiness" value="1" id="rsc-options- >> resource-stickiness"/> >> <nvpair name="migration-threshold" value="0" id="rsc-options- >> migration-threshold"/> >> <nvpair name="requires" value="nothing" id="rsc-options- >> requires"/> >> </meta_attributes> >> </rsc_defaults> >> </configuration> > -- > Ken Gaillot <[email protected]> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
