On 06/12/2017 03:24 PM, Dan Ragle wrote: > > > On 6/12/2017 2:03 AM, Klaus Wenninger wrote: >> On 06/10/2017 05:53 PM, Dan Ragle wrote: >>> >>> >>> On 5/25/2017 5:33 PM, Ken Gaillot wrote: >>>> On 05/24/2017 12:27 PM, Dan Ragle wrote: >>>>> I suspect this has been asked before and apologize if so, a google >>>>> search didn't seem to find anything that was helpful to me ... >>>>> >>>>> I'm setting up an active/active two-node cluster and am having an >>>>> issue >>>>> where one of my two defined clusterIPs will not return to the other >>>>> node >>>>> after it (the other node) has been recovered. >>>>> >>>>> I'm running on CentOS 7.3. My resource setups look like this: >>>>> >>>>> # cibadmin -Q|grep dc-version >>>>> <nvpair id="cib-bootstrap-options-dc-version" >>>>> name="dc-version" >>>>> value="1.1.15-11.el7_3.4-e174ec8"/> >>>>> >>>>> # pcs resource show PublicIP-clone >>>>> Clone: PublicIP-clone >>>>> Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true >>>>> interleave=true >>>>> Resource: PublicIP (class=ocf provider=heartbeat type=IPaddr2) >>>>> Attributes: ip=75.144.71.38 cidr_netmask=24 nic=bond0 >>>>> Meta Attrs: resource-stickiness=0 >>>>> Operations: start interval=0s timeout=20s >>>>> (PublicIP-start-interval-0s) >>>>> stop interval=0s timeout=20s >>>>> (PublicIP-stop-interval-0s) >>>>> monitor interval=30s (PublicIP-monitor-interval-30s) >>>>> >>>>> # pcs resource show PrivateIP-clone >>>>> Clone: PrivateIP-clone >>>>> Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true >>>>> interleave=true >>>>> Resource: PrivateIP (class=ocf provider=heartbeat type=IPaddr2) >>>>> Attributes: ip=192.168.1.3 nic=bond1 cidr_netmask=24 >>>>> Meta Attrs: resource-stickiness=0 >>>>> Operations: start interval=0s timeout=20s >>>>> (PrivateIP-start-interval-0s) >>>>> stop interval=0s timeout=20s >>>>> (PrivateIP-stop-interval-0s) >>>>> monitor interval=10s timeout=20s >>>>> (PrivateIP-monitor-interval-10s) >>>>> >>>>> # pcs constraint --full | grep -i publicip >>>>> start WEB-clone then start PublicIP-clone (kind:Mandatory) >>>>> (id:order-WEB-clone-PublicIP-clone-mandatory) >>>>> # pcs constraint --full | grep -i privateip >>>>> start WEB-clone then start PrivateIP-clone (kind:Mandatory) >>>>> (id:order-WEB-clone-PrivateIP-clone-mandatory) >>>> >>>> FYI These constraints cover ordering only. If you also want to be sure >>>> that the IPs only start on a node where the web service is functional, >>>> then you also need colocation constraints. >>>> >>>>> >>>>> When I first create the resources, they split across the two nodes as >>>>> expected/desired: >>>>> >>>>> Clone Set: PublicIP-clone [PublicIP] (unique) >>>>> PublicIP:0 (ocf::heartbeat:IPaddr2): Started >>>>> node1-pcs >>>>> PublicIP:1 (ocf::heartbeat:IPaddr2): Started >>>>> node2-pcs >>>>> Clone Set: PrivateIP-clone [PrivateIP] (unique) >>>>> PrivateIP:0 (ocf::heartbeat:IPaddr2): Started >>>>> node1-pcs >>>>> PrivateIP:1 (ocf::heartbeat:IPaddr2): Started >>>>> node2-pcs >>>>> Clone Set: WEB-clone [WEB] >>>>> Started: [ node1-pcs node2-pcs ] >>>>> >>>>> I then put the second node in standby: >>>>> >>>>> # pcs node standby node2-pcs >>>>> >>>>> And the IPs both jump to node1 as expected: >>>>> >>>>> Clone Set: PublicIP-clone [PublicIP] (unique) >>>>> PublicIP:0 (ocf::heartbeat:IPaddr2): Started >>>>> node1-pcs >>>>> PublicIP:1 (ocf::heartbeat:IPaddr2): Started >>>>> node1-pcs >>>>> Clone Set: WEB-clone [WEB] >>>>> Started: [ node1-pcs ] >>>>> Stopped: [ node2-pcs ] >>>>> Clone Set: PrivateIP-clone [PrivateIP] (unique) >>>>> PrivateIP:0 (ocf::heartbeat:IPaddr2): Started >>>>> node1-pcs >>>>> PrivateIP:1 (ocf::heartbeat:IPaddr2): Started >>>>> node1-pcs >>>>> >>>>> Then unstandby the second node: >>>>> >>>>> # pcs node unstandby node2-pcs >>>>> >>>>> The publicIP goes back, but the private does not: >>>>> >>>>> Clone Set: PublicIP-clone [PublicIP] (unique) >>>>> PublicIP:0 (ocf::heartbeat:IPaddr2): Started >>>>> node1-pcs >>>>> PublicIP:1 (ocf::heartbeat:IPaddr2): Started >>>>> node2-pcs >>>>> Clone Set: WEB-clone [WEB] >>>>> Started: [ node1-pcs node2-pcs ] >>>>> Clone Set: PrivateIP-clone [PrivateIP] (unique) >>>>> PrivateIP:0 (ocf::heartbeat:IPaddr2): Started >>>>> node1-pcs >>>>> PrivateIP:1 (ocf::heartbeat:IPaddr2): Started >>>>> node1-pcs >>>>> >>>>> Anybody see what I'm doing wrong? I'm not seeing anything in the >>>>> logs to >>>>> indicate that it tries node2 and then fails; but I'm fairly new to >>>>> the >>>>> software so it's possible I'm not looking in the right place. >>>> >>>> The pcs status would show any failed actions, and anything >>>> important in >>>> the logs would start with "error:" or "warning:". >>>> >>>> At any given time, one of the nodes is the DC, meaning it schedules >>>> actions for the whole cluster. That node will have more "pengine:" >>>> messages in its logs at the time. You can check those logs to see what >>>> decisions were made, as well as a "saving inputs" message to get the >>>> cluster state that was used to make those decisions. There is a >>>> crm_simulate tool that you can run on that file to get more >>>> information. >>>> >>>> By default, pacemaker will try to balance the number of resources >>>> running on each node, so I'm not sure why in this case node1 has four >>>> resources and node2 has two. crm_simulate might help explain it. >>>> >>>> However, there's nothing here telling pacemaker that the instances of >>>> PrivateIP should run on different nodes when possible. With your >>>> existing constraints, pacemaker would be equally happy to run both >>>> PublicIP instances on one node and both PrivateIP instances on the >>>> other >>>> node. >>> >>> Thanks for your reply. Finally getting back to this. >>> >>> Looking back at my config and my notes I realized I'm guilty of not >>> giving you enough information. There was indeed an additional pair of >>> resources that I didn't list in my original output that I didn't think >>> were relevant to the issue--my bad. Reading what you wrote made me >>> realize that it does appear as though pacemaker is simply trying to >>> balance the overall load of *all* the available resources. >>> >>> But I'm still confused as to how one would definitively correct the >>> issue. I tried this full reduction this morning. Starting from an >>> empty two-node cluster (no resources, no constraints): >>> >>> [root@node1 clustertest]# pcs status >>> Cluster name: MyCluster >>> Stack: corosync >>> Current DC: NONE >>> Last updated: Sat Jun 10 10:58:46 2017 Last change: Sat Jun >>> 10 10:40:23 2017 by root via cibadmin on node1-pcs >>> >>> 2 nodes and 0 resources configured >>> >>> OFFLINE: [ node1-pcs node2-pcs ] >>> >>> No resources >>> >>> >>> Daemon Status: >>> corosync: active/disabled >>> pacemaker: active/disabled >>> pcsd: active/enabled >>> >>> [root@node1 clustertest]# pcs resource create ClusterIP >>> ocf:heartbeat:IPaddr2 ip=1.2.3.4 nic=bond0 cidr_netmask=24 >>> [root@node1 clustertest]# pcs resource meta ClusterIP >>> resource-stickiness=0 >>> [root@node1 clustertest]# pcs resource clone ClusterIP clone-max=2 >>> clone-node-max=2 globally-unique=true interleave=true >>> [root@node1 clustertest]# pcs resource create Test1 systemd:vtest1 >>> [root@node1 clustertest]# pcs resource create Test2 systemd:vtest2 >>> [root@node1 clustertest]# pcs constraint location Test1 prefers >>> node1-pcs=INFINITY >>> [root@node1 clustertest]# pcs constraint location Test2 prefers >>> node1-pcs=INFINITY >>> >>> [root@node1 clustertest]# pcs node standby node1-pcs >>> [root@node1 clustertest]# pcs status >>> Cluster name: MyCluster >>> Stack: corosync >>> Current DC: node1-pcs (version 1.1.15-11.el7_3.4-e174ec8) - partition >>> with quorum >>> Last updated: Sat Jun 10 11:01:07 2017 Last change: Sat Jun >>> 10 11:00:59 2017 by root via crm_attribute on node1-pcs >>> >>> 2 nodes and 4 resources configured >>> >>> Node node1-pcs: standby >>> Online: [ node2-pcs ] >>> >>> Full list of resources: >>> >>> Clone Set: ClusterIP-clone [ClusterIP] (unique) >>> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> Test1 (systemd:vtest1): Started node2-pcs >>> Test2 (systemd:vtest2): Started node2-pcs >>> >>> Daemon Status: >>> corosync: active/disabled >>> pacemaker: active/disabled >>> pcsd: active/enabled >>> >>> [root@node1 clustertest]# pcs node unstandby node1-pcs >>> [root@node1 clustertest]# pcs status resources >>> Clone Set: ClusterIP-clone [ClusterIP] (unique) >>> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> Test1 (systemd:vtest1): Started node1-pcs >>> Test2 (systemd:vtest2): Started node1-pcs >>> >>> [root@node1 clustertest]# pcs node standby node2-pcs >>> [root@node1 clustertest]# pcs status resources >>> Clone Set: ClusterIP-clone [ClusterIP] (unique) >>> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> Test1 (systemd:vtest1): Started node1-pcs >>> Test2 (systemd:vtest2): Started node1-pcs >>> >>> [root@node1 clustertest]# pcs node unstandby node2-pcs >>> [root@node1 clustertest]# pcs status resources >>> Clone Set: ClusterIP-clone [ClusterIP] (unique) >>> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> Test1 (systemd:vtest1): Started node1-pcs >>> Test2 (systemd:vtest2): Started node1-pcs >>> >>> [root@node1 clustertest]# pcs resource delete ClusterIP >>> Attempting to stop: ClusterIP...Stopped >>> [root@node1 clustertest]# pcs resource create ClusterIP >>> ocf:heartbeat:IPaddr2 ip=1.2.3.4 nic=bond0 cidr_netmask=24 >>> [root@node1 clustertest]# pcs resource meta ClusterIP >>> resource-stickiness=0 >>> [root@node1 clustertest]# pcs resource clone ClusterIP clone-max=2 >>> clone-node-max=2 globally-unique=true interleave=true >>> >>> [root@node1 clustertest]# pcs status resources >>> Test1 (systemd:vtest1): Started node1-pcs >>> Test2 (systemd:vtest2): Started node1-pcs >>> Clone Set: ClusterIP-clone [ClusterIP] (unique) >>> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> >>> [root@node1 clustertest]# pcs node standby node1-pcs >>> [root@node1 clustertest]# pcs status resources >>> Test1 (systemd:vtest1): Started node2-pcs >>> Test2 (systemd:vtest2): Started node2-pcs >>> Clone Set: ClusterIP-clone [ClusterIP] (unique) >>> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> >>> [root@node1 clustertest]# pcs node unstandby node1-pcs >>> [root@node1 clustertest]# pcs status resources >>> Test1 (systemd:vtest1): Started node1-pcs >>> Test2 (systemd:vtest2): Started node1-pcs >>> Clone Set: ClusterIP-clone [ClusterIP] (unique) >>> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> >>> [root@node1 clustertest]# pcs node standby node2-pcs >>> [root@node1 clustertest]# pcs status resources >>> Test1 (systemd:vtest1): Started node1-pcs >>> Test2 (systemd:vtest2): Started node1-pcs >>> Clone Set: ClusterIP-clone [ClusterIP] (unique) >>> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> >>> [root@node1 clustertest]# pcs node unstandby node2-pcs >>> [root@node1 clustertest]# pcs status resources >>> Test1 (systemd:vtest1): Started node1-pcs >>> Test2 (systemd:vtest2): Started node1-pcs >>> Clone Set: ClusterIP-clone [ClusterIP] (unique) >>> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> >>> [root@node1 clustertest]# pcs node standby node1-pcs >>> [root@node1 clustertest]# pcs status resources >>> Test1 (systemd:vtest1): Started node2-pcs >>> Test2 (systemd:vtest2): Started node2-pcs >>> Clone Set: ClusterIP-clone [ClusterIP] (unique) >>> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> >>> [root@node1 clustertest]# pcs node unstandby node1-pcs >>> [root@node1 clustertest]# pcs status resources >>> Test1 (systemd:vtest1): Started node1-pcs >>> Test2 (systemd:vtest2): Started node1-pcs >>> Clone Set: ClusterIP-clone [ClusterIP] (unique) >>> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> >>> So in the initial configuration, it works as expected; putting the >>> nodes in standby one at a time (I waited at least 5 seconds between >>> each standby/unstandby operation) and then restoring the nodes shows >>> the ClusterIP bouncing back and forth as expected. But then after >>> deleting the ClusterIP resource and recreating it exactly as it >>> originally was the clones initially both stay on one node (the one the >>> test resources are not on). Putting the node the extra resources are >>> on in standby and then restoring it the IPs stay on the other node. >>> Putting the node the extra resources are *not* on in standby and then >>> restoring that node allows the IPs to split once again. >>> >>> I also did the test above with full pcs status displays after each >>> standby/unstandby; there were no errors displayed at each step. >>> >>> So I guess my bottom line question is: How does one tell Pacemaker >>> that the individual legs of globally unique clones should *always* be >>> spread across the available nodes whenever possible, regardless of the >>> number of processes on any one of the nodes? For kicks I did try: >>> >> >> You configured 'clone-node-max=2'. Set that to '1' and the maximum >> number of clones per node is gonna be '1' - if this is what you >> intended ... >> >> Regards, >> Klaus >> > > Thanks for the reply, Klaus. My understanding was that with the > IPaddr2 agent in an active-active setup it was necessary to set > 'clone-node-max=2' so that in the event of failover the traffic that > had been targeted to the now-failed node would be answered on the > still-working node. I.E., I *want* that unique clone to bounce to the > working node on failover, but I want it to bounce back when the node > is recovered.
Thought you were just wandering why you suddenly had them both on one node ... > > My main reference here is the "Clusters from Scratch" tutorial: > > http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_clone_the_ip_address.html > > > Cheers, > > Dan > >>> pcs constraint location ClusterIP:0 prefers node1-pcs=INFINITY >>> >>> but it responded with an error about an invalid character (:). >>> >>> Thanks, >>> >>> Dan >>> >>>> >>>> I think you could probably get what you want by putting an optional >>>> (<INFINITY) colocation preference between PrivateIP and PublicIP. The >>>> only way pacemaker could satisfy that would be to run one of each on >>>> each node. >>>> >>>>> Also, I noticed when putting a node in standby the main NIC >>>>> appears to >>>>> be interrupted momentarily (long enough for my SSH session, which is >>>>> connected via the permanent IP on the NIC and not the clusterIP, >>>>> to be >>>>> dropped). Is there any way to avoid this? I was thinking that the >>>>> cluster operations would only affect the ClusteIP and not the >>>>> other IPs >>>>> being served on that NIC. >>>> >>>> Nothing in the cluster should cause that behavior. Check all the >>>> system >>>> logs around the time to see if anything unusual is reported. >>>> >>>>> >>>>> Thanks! >>>>> >>>>> Dan >>>> >>>> _______________________________________________ >>>> Users mailing list: Users@clusterlabs.org >>>> http://lists.clusterlabs.org/mailman/listinfo/users >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>> >>> _______________________________________________ >>> Users mailing list: Users@clusterlabs.org >>> http://lists.clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org