[Pacemaker] Different value on cluster-infrastructure between 2 nodes
Hi I am doing a rolling upgrade of pacemaker from CentOS 6.3 to 6.4 and when 1st node is upgraded and gets 1.1.8 version it doesn't join the cluster and I ended up with 2 clusters. In the logs of node1 I see cluster-infrastructure" value="classic openais (with pluin) but node2(still in centos6.3 and pacemaker 1.1.7) it has cluster-infrastructure="openais" I also see different dc-version between nodes. Does anyone know if these could be the reason for node1 to not join the cluster and decides to make its own cluster? corosync communication looks fine Printing ring status. Local node ID 484162314 RING ID 0 id = 10.187.219.28 status = ring 0 active with no faults RING ID 1 id = 192.168.1.2 status = ring 1 active with no faults Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] 1.1.8 not compatible with 1.1.7?
Hoi, As I wrote to another post[1] I failed to upgrade to 1.1.8 for a 2 node cluster. Before the upgrade process both nodes are using CentOS 6.3, corosync 1.4.1-7 and pacemaker-1.1.7. I followed the rolling upgrade process, so I stopped pacemaker and then corosync on node1 and upgraded to CentOS 6.4. The OS upgrade upgrades also pacemaker to 1.1.8-7 and corosync to 1.4.1-15. The upgrade of rpms went smoothly as I knew about the crmsh issue so I made sure I had crmsh rpm on my repos. Corosync started without any problems and both nodes could see each other[2]. But for some reason node2 failed to receive a reply on join offer from node1 and node1 never joined the cluster. Node1 formed a new cluster as it never got an reply from node2, so I ended up with a split-brain situation. Logs of node1 can be found here https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node1.log and of node2 here https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node2.log I have found this thread[3] which could be related to my problem but the bug which caused the failure on join on that case is solved in 1.1.8. Any ideas? Cheers, Pavlos [1] Subject Different value on cluster-infrastructure between 2 nodes [2] https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/corosync.status [3] http://comments.gmane.org/gmane.linux.highavailability.pacemaker/13185 signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Disable startup fencing with cman
On 14/04/2013 10:47 πμ, Andreas Mock wrote: > Hi all, > > > > in a two node cluster (RHEL6.x, cman, pacemaker) > > when I startup the very first node, > > this node will try to fence the other node if it can't see it. > > This can be true in case of maintenance. How do I avoid > > this startup fencing temporarily when I know that the > > other node is down? Have you tried to standby the node? I don't know if it will work, just sharing my idea here. > > > > Best regards > > Andreas > > > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] 1.1.8 not compatible with 1.1.7?
On 12/04/2013 09:37 μμ, Pavlos Parissis wrote: > Hoi, > > As I wrote to another post[1] I failed to upgrade to 1.1.8 for a 2 node > cluster. > > Before the upgrade process both nodes are using CentOS 6.3, corosync > 1.4.1-7 and pacemaker-1.1.7. > > I followed the rolling upgrade process, so I stopped pacemaker and then > corosync on node1 and upgraded to CentOS 6.4. The OS upgrade upgrades > also pacemaker to 1.1.8-7 and corosync to 1.4.1-15. > The upgrade of rpms went smoothly as I knew about the crmsh issue so I > made sure I had crmsh rpm on my repos. > > Corosync started without any problems and both nodes could see each > other[2]. But for some reason node2 failed to receive a reply on join > offer from node1 and node1 never joined the cluster. Node1 formed a new > cluster as it never got an reply from node2, so I ended up with a > split-brain situation. > > Logs of node1 can be found here > https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node1.log > and of node2 here > https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node2.log > Doing a Disconnect & Reattach upgrade of both nodes at the same time brings me a working 1.1.8 cluster. Any attempt to make a 1.1.8 node to join a cluster with a 1.1.7 failed. Cheers, Pavlos signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] 1.1.8 not compatible with 1.1.7?
Hoi, I upgraded 1st node and here are the logs https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node1.debuglog https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node2.debuglog Enabling tracing on the mentioned functions didn't give at least to me any more information. Cheers, Pavlos On 15 April 2013 01:42, Andrew Beekhof wrote: > > On 15/04/2013, at 7:31 AM, Pavlos Parissis > wrote: > > > On 12/04/2013 09:37 μμ, Pavlos Parissis wrote: > >> Hoi, > >> > >> As I wrote to another post[1] I failed to upgrade to 1.1.8 for a 2 node > >> cluster. > >> > >> Before the upgrade process both nodes are using CentOS 6.3, corosync > >> 1.4.1-7 and pacemaker-1.1.7. > >> > >> I followed the rolling upgrade process, so I stopped pacemaker and then > >> corosync on node1 and upgraded to CentOS 6.4. The OS upgrade upgrades > >> also pacemaker to 1.1.8-7 and corosync to 1.4.1-15. > >> The upgrade of rpms went smoothly as I knew about the crmsh issue so I > >> made sure I had crmsh rpm on my repos. > >> > >> Corosync started without any problems and both nodes could see each > >> other[2]. But for some reason node2 failed to receive a reply on join > >> offer from node1 and node1 never joined the cluster. Node1 formed a new > >> cluster as it never got an reply from node2, so I ended up with a > >> split-brain situation. > >> > >> Logs of node1 can be found here > >> https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node1.log > >> and of node2 here > >> https://dl.dropboxusercontent.com/u/1773878/pacemaker-issue/node2.log > >> > > > > Doing a Disconnect & Reattach upgrade of both nodes at the same time > > brings me a working 1.1.8 cluster. Any attempt to make a 1.1.8 node to > > join a cluster with a 1.1.7 failed. > > There wasn't enough detail in the logs to suggest a solution, but if you > add the following to /etc/sysconfig/pacemaker and re-test, it might shed > some additional light on the problem. > > export PCMK_trace_functions=ais_dispatch_message > > Certainly there was no intention to make them incompatible. > > > > > Cheers, > > Pavlos > > > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] location constraint question
Hi, I am having problems to understand why my DRBD ms resource wants a location constraint. My setup is quite simple 3 nodes 2 resource groups which hold ip,fs and the dymmy resources 2 resources for 2 drbd 2 master/slave resource for 2 DRBD. The objective is to have pbx_service_01 to use as primary node-1 and secondary node-03, and for pbx_service_02 to use a primary node-2 and secondary node-03. So, a N+1 architecture. Having the configuration [1] everything works as I want [2]. But, I found a comment from Lars Ellenberg [3] which basically says to use location constraint on ms DRBD. So, I deleted the PrimaryNode-drbd_01 and SecondaryNode-drbd_01 location constraints just to see the impact only 1 of the 2 resource group. I noticed that only ip_01 is started from pbx_service_01 resource group and not the fs and pbx_01 (pbx_01 no starting is normal because the order constraint ). I thought that since I have a location constraint for the resource group will be enough. What have I understood incorrectly? BTW, why does crm_mon report only 4 resource? Thanks, Pavlos [1] [r...@node-01 ~]# crm configure show node $id="b8ad13a6-8a6e-4304-a4a1-8f69fa735100" node-02 node $id="d5557037-cf8f-49b7-95f5-c264927a0c76" node-01 node $id="e5195d6b-ed14-4bb3-92d3-9105543f9251" node-03 primitive drbd_01 ocf:linbit:drbd \ params drbd_resource="drbd_pbx_service_1" \ op monitor interval="30s" primitive drbd_02 ocf:linbit:drbd \ params drbd_resource="drbd_pbx_service_2" \ op monitor interval="30s" primitive fs_01 ocf:heartbeat:Filesystem \ params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" primitive fs_02 ocf:heartbeat:Filesystem \ params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" primitive ip_01 ocf:heartbeat:IPaddr2 \ params ip="10.10.10.10" cidr_netmask="28" broadcast="10.10.10.127" \ op monitor interval="5s" primitive ip_02 ocf:heartbeat:IPaddr2 \ params ip="10.10.10.11" cidr_netmask="28" broadcast="10.10.10.127" \ op monitor interval="5s" primitive pbx_01 ocf:heartbeat:Dummy \ params state="/pbx_service_01/Dummy.state" primitive pbx_02 ocf:heartbeat:Dummy \ params state="/pbx_service_02/Dummy.state" group pbx_service_01 ip_01 fs_01 pbx_01 group pbx_service_02 ip_02 fs_02 pbx_02 ms ms-drbd_01 drbd_01 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" ms ms-drbd_02 drbd_02 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" location PrimaryNode-drbd_01 ms-drbd_01 100: node-01 location PrimaryNode-drbd_02 ms-drbd_02 100: node-02 location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01 location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02 location SecondaryNode-drbd_01 ms-drbd_01 0: node-03 location SecondaryNode-drbd_02 ms-drbd_02 0: node-03 location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03 location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03 colocation fs-on-drbd_01 inf: fs_01 ms-drbd_01:Master colocation fs-on-drbd_02 inf: fs_02 ms-drbd_02:Master colocation pbx_01-with-fs_01 inf: pbx_01 fs_01 colocation pbx_01-with-ip_01 inf: pbx_01 ip_01 colocation pbx_02-with-fs_02 inf: pbx_02 fs_02 colocation pbx_02-with-ip_02 inf: pbx_02 ip_02 order fs_01-after-drbd_01 inf: ms-drbd_01:promote fs_01:start order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start order pbx_01-after-fs_01 inf: fs_01 pbx_01 order pbx_01-after-ip_01 inf: ip_01 pbx_01 order pbx_02-after-fs_02 inf: fs_02 pbx_02 order pbx_02-after-ip_02 inf: ip_02 pbx_02 property $id="cib-bootstrap-options" \ dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \ cluster-infrastructure="Heartbeat" \ stonith-enabled="false" \ symmetric-cluster="false" rsc_defaults $id="rsc-options" \ resource-stickiness="1000" [2] [r...@node-03 ~]# crm_mon -1 Last updated: Mon Sep 20 15:36:46 2010 Stack: Heartbeat Current DC: node-03 (e5195d6b-ed14-4bb3-92d3-9105543f9251) - partition with quorum Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677 3 Nodes configured, unknown expected votes 4 Resources configured. Online: [ node-03 node-01 node-02 ] Resource Group: pbx_service_01 ip_01 (ocf::heartbeat:IPaddr2): Started node-01 fs_01 (ocf::heartbeat:Filesystem): Started node-01 pbx_01 (ocf::heartbeat:Dummy): Started node-01 Resource Group: pbx_service_02 ip_02 (ocf::heartbeat:IPaddr2): Started node-02 fs_02 (ocf::heartbeat:Filesystem): Started node-02 pbx_02 (ocf::heartbeat:Dummy): Started node-02 Master/Slave Set: ms-drbd_01 Masters: [ node-01 ] Slaves: [ node-03 ] Master/Slave Set: ms-drbd_02 Masters: [ node-02 ] Slaves: [ node-03 ] [3] http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg04105.html not have location preference cons
Re: [Pacemaker] location constraint question
On 21 September 2010 08:38, Andrew Beekhof wrote: >> BTW, why does crm_mon report only 4 resource? > > Because the drbd resources were made into master/slaves. > > See: > ms ms-drbd_01 drbd_01 \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > OK, thanks. I tried several things today in order to avoid a location constraint directly on drbd ms resource but nothing worked. I am pretty sure that if you have an a-symmetric cluster you need to have location constraint on drbd ms resource. BTW Adrew, why shouldn't we have location preference constraints on the master role directly? Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] location constraint question
On 21 September 2010 09:04, Andrew Beekhof wrote: > On Tue, Sep 21, 2010 at 8:58 AM, Pavlos Parissis > wrote: >> On 21 September 2010 08:38, Andrew Beekhof wrote: >>>> BTW, why does crm_mon report only 4 resource? >>> >>> Because the drbd resources were made into master/slaves. >>> >>> See: >>> ms ms-drbd_01 drbd_01 \ >>> meta master-max="1" master-node-max="1" clone-max="2" >>> clone-node-max="1" notify="true" >>> >> OK, thanks. >> >> I tried several things today in order to avoid a location constraint >> directly on drbd ms resource but nothing worked. >> I am pretty sure that if you have an a-symmetric cluster you need to >> have location constraint on drbd ms resource. > > yep, otherwise it doesn't know where its allowed to start > >> >> BTW Adrew, why shouldn't we have location preference constraints on >> the master role directly? > > No reason at all. Its allowed. Thanks for the clarification, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] migration-threshold and failure-timeout
Hi, I am trying to figure a way to do the following if the monitor of x resource fails N time in a period of Z then fail over to the other node and clear fail-count. Regards, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] migration-threshold and failure-timeout
On 21 September 2010 15:28, Vadym Chepkov wrote: > On Tue, Sep 21, 2010 at 9:14 AM, Dan Frincu wrote: > > Hi, > > > > This => > > > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-failure-migration.html > > explains it pretty well. Notice the INFINITY score and what sets it. > > > > However I don't know of any automatic method to clear the failcount. > > > > Regards, > > Dan > > > in pacemaker 1.0 nothing will clean failcount automatically, this is a > feature of pacemaker 1.1, imho > > But, > > crm configure rsc_defaults failure-timeout="10min" > > will make cluster to "forget" about previous failure in 10 minutes. > if you want to futher decrease this paramater, you might need to decrease > > crm configure property cluster-recheck-interval="10min" > > Cheers, > Vadym > > Ok guys thank you very much for the info, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] target-role default value
Hi, What is the default value for target-role in resource? I tried to query it with crm_resource but without success. crm_resource pbx_02 --get-property target-role crm_resource pbx_02 --get-parameter target-role --meta Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] target-role default value
On 24 September 2010 11:40, Michael Schhwartzkopff wrote: > On Friday 24 September 2010 11:34:11 Pavlos Parissis wrote: > > Hi, > > > > What is the default value for target-role in resource? > > I tried to query it with crm_resource but without success. > > crm_resource pbx_02 --get-property target-role > > crm_resource pbx_02 --get-parameter target-role --meta > > > > > > Cheers, > > Pavlos > > started > > thanks. How do I get default values for parameters which are not set? Thanks again, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] default timeout for op start/stop
Hi, When I verify my conf I get complains about the timeout on start and stop operation crm(live)configure# verify WARNING: drbd_01: default timeout 20s for start is smaller than the advised 240 WARNING: drbd_01: default timeout 20s for stop is smaller than the advised 100 WARNING: drbd_02: default timeout 20s for start is smaller than the advised 240 WARNING: drbd_02: default timeout 20s for stop is smaller than the advised 100 Since I don't specifically set timeout for the mentioned resources I thought this 20s is coming from the defaults. So, I queried the defaults and got the following [r...@node-03 ~]# crm_attribute --type op_defaults --name timeout scope=op_defaults name=timeout value=(null) So, I am wondering from where this 20s is coming from. I had the same issue for IP and Filesystem type resources and in order to get rid of the warning I specifically set it to be 60s. Regards, Pavlos [r...@node-03 ~]# crm configure show node $id="b8ad13a6-8a6e-4304-a4a1-8f69fa735100" node-02 node $id="d5557037-cf8f-49b7-95f5-c264927a0c76" node-01 node $id="e5195d6b-ed14-4bb3-92d3-9105543f9251" node-03 primitive drbd_01 ocf:linbit:drbd \ params drbd_resource="drbd_pbx_service_1" \ op monitor interval="30s" primitive drbd_02 ocf:linbit:drbd \ params drbd_resource="drbd_pbx_service_2" \ op monitor interval="30s" primitive fs_01 ocf:heartbeat:Filesystem \ params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \ meta migration-threshold="3" failure-timeout="60" \ op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive fs_02 ocf:heartbeat:Filesystem \ params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" \ meta migration-threshold="3" failure-timeout="60" \ op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive ip_01 ocf:heartbeat:IPaddr2 \ params ip="10.10.10.10" cidr_netmask="25" broadcast="10.10.10.127" \ meta failure-timeout="120" migration-threshold="3" \ op monitor interval="5s" primitive ip_02 ocf:heartbeat:IPaddr2 \ params ip="10.10.10.11" cidr_netmask="25" broadcast="10.10.10.127" \ op monitor interval="5s" primitive pbx_01 ocf:heartbeat:Dummy \ params state="/pbx_service_01/Dummy.state" \ meta failure-timeout="60" migration-threshold="3" \ op monitor interval="20s" timeout="40s" primitive pbx_02 ocf:heartbeat:Dummy \ params state="/pbx_service_02/Dummy.state" \ meta failure-timeout="60" migration-threshold="3" group pbx_service_01 ip_01 fs_01 pbx_01 \ meta target-role="Started" group pbx_service_02 ip_02 fs_02 pbx_02 \ meta target-role="Started" ms ms-drbd_01 drbd_01 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" ms ms-drbd_02 drbd_02 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" location PrimaryNode-drbd_01 ms-drbd_01 100: node-01 location PrimaryNode-drbd_02 ms-drbd_02 100: node-02 location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01 location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02 location SecondaryNode-drbd_01 ms-drbd_01 0: node-03 location SecondaryNode-drbd_02 ms-drbd_02 0: node-03 location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03 location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03 colocation fs_01-on-drbd_01 inf: fs_01 ms-drbd_01:Master colocation fs_02-on-drbd_02 inf: fs_02 ms-drbd_02:Master colocation pbx_01-with-fs_01 inf: pbx_01 fs_01 colocation pbx_01-with-ip_01 inf: pbx_01 ip_01 colocation pbx_02-with-fs_02 inf: pbx_02 fs_02 colocation pbx_02-with-ip_02 inf: pbx_02 ip_02 order fs_01-after-drbd_01 inf: ms-drbd_01:promote fs_01:start order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start order pbx_01-after-fs_01 inf: fs_01 pbx_01 order pbx_01-after-ip_01 inf: ip_01 pbx_01 order pbx_02-after-fs_02 inf: fs_02 pbx_02 order pbx_02-after-ip_02 inf: ip_02 pbx_02 property $id="cib-bootstrap-options" \ dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \ cluster-infrastructure="Heartbeat" \ stonith-enabled="false" \ symmetric-cluster="false" \ last-lrm-refresh="1285323745" rsc_defaults $id="rsc-options" \ resource-stickiness="1000" ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] default timeout for op start/stop
On 24 September 2010 13:54, Michael Schhwartzkopff wrote: > On Friday 24 September 2010 13:50:49 Pavlos Parissis wrote: > > Hi, > > > > When I verify my conf I get complains about the timeout on start and stop > > operation > > crm(live)configure# verify > > WARNING: drbd_01: default timeout 20s for start is smaller than the > advised > > 240 > > WARNING: drbd_01: default timeout 20s for stop is smaller than the > advised > > 100 > > WARNING: drbd_02: default timeout 20s for start is smaller than the > advised > > 240 > > WARNING: drbd_02: default timeout 20s for stop is smaller than the > advised > > 100 > > > > Since I don't specifically set timeout for the mentioned resources I > > thought this 20s is coming from the defaults. > > So, I queried the defaults and got the following > > [r...@node-03 ~]# crm_attribute --type op_defaults --name timeout > > scope=op_defaults name=timeout value=(null) > > > > > > Default timeout is coded into the resource agent. You safely can ignore the > WARNINGs. These are also removed from more recent versions of pacemaker. > > thanks again Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] default timeout for op start/stop
On 24 September 2010 18:12, Dejan Muhamedagic wrote: [...snip...] > > > > Default timeout is coded into the resource agent. You safely can ignore > the > > WARNINGs. These are also removed from more recent versions of pacemaker. > > > These warnings shouldn't be ignored. The defaults which are coded > in the RA are what the author of the RA advised as minimum. These > values are, however, not used automatically by the CRM, so they > need to be specified in the configuration. And then the resources > should be thoroughly tested to see if the timeouts are meaningful > in the given environment. > > Thanks, > > Are you saying that if timeouts are not set CRM will wait for ever for each operation to finish? Regards, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] default timeout for op start/stop
On 27 September 2010 12:17, Dejan Muhamedagic wrote: > Hi, > > On Mon, Sep 27, 2010 at 12:00:19PM +0200, Pavlos Parissis wrote: > > On 24 September 2010 18:12, Dejan Muhamedagic > wrote: > > [...snip...] > > > > > > > > > > Default timeout is coded into the resource agent. You safely can > ignore > > > the > > > > WARNINGs. These are also removed from more recent versions of > pacemaker. > > > > > > > > > These warnings shouldn't be ignored. The defaults which are coded > > > in the RA are what the author of the RA advised as minimum. These > > > values are, however, not used automatically by the CRM, so they > > > need to be specified in the configuration. And then the resources > > > should be thoroughly tested to see if the timeouts are meaningful > > > in the given environment. > > > > > > Thanks, > > > > > Are you saying that if timeouts are not set CRM will wait for ever for > each > > operation to finish? > > No. It will use the global default timeout value > (default-action-timeout) which is set to 20s. That's why the > shell issues the warnings: 20s is shorter than what has been > advertised in the meta-data of the RA you want to configure. > ok thanks Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] crm resource move doesn't move the resource
Hi, When I issue "crm resource move pbx_service_01 node-0N" it moves this resource group but the fs_01 resource is not started because drbd_01 is still running on other node and it is not moved as well tonode-0N, even I have colocation constraints. I am pretty sure that I have that working before, but I can't figure why it doesn't work anymore. The resource pbx_service_01 and drbd_01 are moved to another node in case of failure, but for some reason not manually. Can you see in my conf where it could be the problem? I have already spent some time and I think I can't see the obvious anymore:-( node $id="b8ad13a6-8a6e-4304-a4a1-8f69fa735100" node-02 node $id="d5557037-cf8f-49b7-95f5-c264927a0c76" node-01 node $id="e5195d6b-ed14-4bb3-92d3-9105543f9251" node-03 primitive drbd_01 ocf:linbit:drbd \ params drbd_resource="drbd_pbx_service_1" \ op monitor interval="30s" primitive drbd_02 ocf:linbit:drbd \ params drbd_resource="drbd_pbx_service_2" \ op monitor interval="30s" primitive fs_01 ocf:heartbeat:Filesystem \ params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \ meta migration-threshold="3" failure-timeout="60" \ op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" primitive fs_02 ocf:heartbeat:Filesystem \ params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" \ meta migration-threshold="3" failure-timeout="60" \ op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" primitive ip_01 ocf:heartbeat:IPaddr2 \ params ip="10.10.10.10" cidr_netmask="25" broadcast="10.10.10.127" \ meta failure-timeout="120" migration-threshold="3" \ op monitor interval="5s" primitive ip_02 ocf:heartbeat:IPaddr2 \ params ip="10.10.10.11" cidr_netmask="25" broadcast="10.10.10.127" \ op monitor interval="5s" primitive pbx_01 ocf:heartbeat:Dummy \ params state="/pbx_service_01/Dummy.state" \ meta failure-timeout="60" migration-threshold="3" target-role="Started" \ op monitor interval="20s" timeout="40s" primitive pbx_02 ocf:heartbeat:Dummy \ params state="/pbx_service_02/Dummy.state" \ meta failure-timeout="60" migration-threshold="3" group pbx_service_01 ip_01 fs_01 pbx_01 \ meta target-role="Started" group pbx_service_02 ip_02 fs_02 pbx_02 \ meta target-role="Started" ms ms-drbd_01 drbd_01 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" ms ms-drbd_02 drbd_02 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" location PrimaryNode-drbd_01 ms-drbd_01 100: node-01 location PrimaryNode-drbd_02 ms-drbd_02 100: node-02 location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01 location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02 location SecondaryNode-drbd_01 ms-drbd_01 0: node-03 location SecondaryNode-drbd_02 ms-drbd_02 0: node-03 location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03 location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03 colocation fs_01-on-drbd_01 inf: fs_01 ms-drbd_01:Master colocation fs_02-on-drbd_02 inf: fs_02 ms-drbd_02:Master colocation pbx_01-with-fs_01 inf: pbx_01 fs_01 colocation pbx_01-with-ip_01 inf: pbx_01 ip_01 colocation pbx_02-with-fs_02 inf: pbx_02 fs_02 colocation pbx_02-with-ip_02 inf: pbx_02 ip_02 order fs_01-after-drbd_01 inf: ms-drbd_01:promote fs_01:start order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start order pbx_01-after-fs_01 inf: fs_01 pbx_01 order pbx_01-after-ip_01 inf: ip_01 pbx_01 order pbx_02-after-fs_02 inf: fs_02 pbx_02 order pbx_02-after-ip_02 inf: ip_02 pbx_02 property $id="cib-bootstrap-options" \ dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \ cluster-infrastructure="Heartbeat" \ stonith-enabled="false" \ symmetric-cluster="false" \ last-lrm-refresh="1285323745" rsc_defaults $id="rsc-options" \ resource-stickiness="1000" ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] promote a ms resource to a node
Hi, Let's say that I have manually demote a ms resource and have the following situation crm(live)resource# demote ms-drbd_01 crm(live)resource# status [..snip..] Master/Slave Set: ms-drbd_01 Slaves: [ node-01 node-03 ] How can I manually promote ms-drbd_01 on node-03? The promote command doesn't accept node names and the move command on ms-drbd_01 says can't find resource. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm resource move doesn't move the resource
On 28 September 2010 15:09, Pavlos Parissis wrote: > Hi, > > > When I issue "crm resource move pbx_service_01 node-0N" it moves this > resource group but the fs_01 resource is not started because drbd_01 is > still running on other node and it is not moved as well tonode-0N, even I > have colocation constraints. > I am pretty sure that I have that working before, but I can't figure why it > doesn't work anymore. > The resource pbx_service_01 and drbd_01 are moved to another node in case > of failure, but for some reason not manually. > > Can you see in my conf where it could be the problem? I have already spent > some time and I think I can't see the obvious anymore:-( > > [...snip ...] Just to that this issue is applicable only for one of the resource group, even the conf is the same for both of them! So, after hours of running the same test again and again, and reading 10 lines of logs (BTW it seams that they say in a clear way why certain things happen) I decided to recreate the drbd_01 and ms-drbd_01 resource and adjust the order constraints before it was like this order fs_01-after-drbd_01 inf: ms-drbd_01:promote fs_01:start order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start order pbx_01-after-fs_01 inf: fs_01 pbx_01 order pbx_01-after-ip_01 inf: ip_01 pbx_01 order pbx_02-after-fs_02 inf: fs_02 pbx_02 order pbx_02-after-ip_02 inf: ip_02 pbx_02 and now like this order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start order pbx_02-after-fs_02 inf: fs_02 pbx_02 order pbx_02-after-ip_02 inf: ip_02 pbx_02 order pbx_service_01-after-drbd_01 inf: ms-drbd_01:promote pbx_service_01:start* * as you can see no major changes. The end result is that now every time I issue "crm resource move pbx_service_01 node-0N" the drbd_01 is promoted on that node as well and the whole resource group is started! So, issue is solved but I don't like it for the very simple reason, I don't why it didn't work, and that scares me! Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker
Please paste the conf of corosync, without suppling the conf is quite difficult to help you Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker
On 29 September 2010 21:01, Andreas Hofmeister wrote: > On 29.09.2010 19:59, Mike A Meyer wrote: > > We have two nodes that we have the IP address assigned to a bond0 network > interface instead of the usual eth0 network interface. We are wondering if > there are issues with trying to configure corosync/pacemaker with an IP > assigned to a bond0 network interface. We are seeing that > corosync/pacemaker will start on both nodes, but it doesn't detect other > nodes in the cluster. We do have SELinux and the firewall shut off on both > nodes. Any information would be helpful. > > > We run the cluster stuff on bonding devices (actually on a VLan on top of a > bond) and it works well. We use it in a two-node setup in round-robin mode, > the nodes are connected back-to-back (i.e. no Switch in between). > > If you use bonding over a Switch, check your bonding mode - round-robin > just won't work. Try LACP if you have connected each node to a single > switch or if your Switches support link aggregation over multiple Devices > (the cheaper ones won't). Try "active-backup" with multiple switches. > > To check your configuration, use "ping" and check the "icmp_seq" in the > replies. If some sequence number is missing, your setup is probably broken. > > It is quite common to connect both interfaces of a bond on the same switch and then face issues. Mike you need to tell us a bit more on the layer 2 connectivity and how it does look like. We also use active-backup mode on our bond interfaces, but we use 2 switches and it works without any problem Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker
On 30 September 2010 15:23, Mike A Meyer wrote: > Pavlos, > > Thanks for helping out on this. We are running on RHEL 5.5 running on the > iron and not a VM. We don't have SELinux turned on and the firewall is > disabled. Here is information in the /etc/modprobe.conf file. > > alias eth0 bnx2 > alias eth1 bnx2 > alias scsi_hostadapter cciss > alias scsi_hostadapter1 qla2xxx > alias scsi_hostadapter2 usb-storage > alias bond0 bonding > options bond0 mode=1 miimon=100 > options lpfc lpfc_lun_queue_depth=16 lpfc_nodev_tmo=30 > lpfc_discovery_threads=32 > > > We did take off the bond0 as a test and now only have our IP address > assigned to eth0 and still having the same problem when starting corosync. > The problem we are finding in the /var/log/cluster/corosync.log file is > below. > > Sep 30 07:58:57 e-magdb1.buysub.com crmd: [28406]: info: crm_timer_popped: > Election Trigger (I_DC_TIMEOUT) just popped! > Sep 30 07:58:57 e-magdb1.buysub.com crmd: [28406]: WARN: do_log: FSA: > Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING > Sep 30 07:58:57 e-magdb1.buysub.com crmd: [28406]: info: > do_state_transition: State transition S_PENDING -> S_ELECTION [ > input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped ] > > What could this 'just popped' message mean? > > I have no idea about the meaning of this message. But, lets exclude any network issues. Does the ping between the nodes works? if you run tcpdump on the interface and then start corosync, do you see musticast packets arriving? Unfortunately, I don't use corosync so I can't compare your conf with mine, I use heartbeat, so I can't tell if you have any conf issue. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] resources are restarted without obvious reasons
Hi Could be related to a possible bug mentioned here[1]? BTW here is the conf of pacemaker node $id="b8ad13a6-8a6e-4304-a4a1-8f69fa735100" node-02 node $id="d5557037-cf8f-49b7-95f5-c264927a0c76" node-01 node $id="e5195d6b-ed14-4bb3-92d3-9105543f9251" node-03 primitive drbd_01 ocf:linbit:drbd \ params drbd_resource="drbd_pbx_service_1" \ op monitor interval="30s" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="120s" primitive drbd_02 ocf:linbit:drbd \ params drbd_resource="drbd_pbx_service_2" \ op monitor interval="30s" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="120s" primitive fs_01 ocf:heartbeat:Filesystem \ params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \ meta migration-threshold="3" failure-timeout="60" \ op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive fs_02 ocf:heartbeat:Filesystem \ params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" \ meta migration-threshold="3" failure-timeout="60" \ op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive ip_01 ocf:heartbeat:IPaddr2 \ params ip="192.168.78.10" cidr_netmask="24" broadcast="192.168.78.255" \ meta failure-timeout="120" migration-threshold="3" \ op monitor interval="5s" primitive ip_02 ocf:heartbeat:IPaddr2 \ params ip="192.168.78.20" cidr_netmask="24" broadcast="192.168.78.255" \ op monitor interval="5s" primitive pbx_01 lsb:test-01 \ meta failure-timeout="60" migration-threshold="3" target-role="Started" \ op monitor interval="20s" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive pbx_02 lsb:test-02 \ meta failure-timeout="60" migration-threshold="3" target-role="Started" \ op monitor interval="20s" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" group pbx_service_01 ip_01 fs_01 pbx_01 \ meta target-role="Started" group pbx_service_02 ip_02 fs_02 pbx_02 \ meta target-role="Started" ms ms-drbd_01 drbd_01 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" ms ms-drbd_02 drbd_02 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" location PrimaryNode-drbd_01 ms-drbd_01 100: node-01 location PrimaryNode-drbd_02 ms-drbd_02 100: node-02 location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01 location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02 location SecondaryNode-drbd_01 ms-drbd_01 0: node-03 location SecondaryNode-drbd_02 ms-drbd_02 0: node-03 location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03 location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03 colocation fs_01-on-drbd_01 inf: fs_01 ms-drbd_01:Master colocation fs_02-on-drbd_02 inf: fs_02 ms-drbd_02:Master order pbx_service_01-after-drbd_01 inf: ms-drbd_01:promote pbx_service_01:start order pbx_service_02-after-drbd_02 inf: ms-drbd_02:promote pbx_service_02:start property $id="cib-bootstrap-options" \ dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \ cluster-infrastructure="Heartbeat" \ stonith-enabled="false" \ symmetric-cluster="false" \ last-lrm-refresh="1285323745" rsc_defaults $id="rsc-options" \ Cheers, Pavlos [1] http://oss.clusterlabs.org/pipermail/pacemaker/2010-September/007624.html ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] resources are restarted without obvious reasons
Hi, It seams that it happens every time PE wants to check the conf 09:23:55 crmd: [3473]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped! and then check_rsc_parameters() wants to reset my resources 09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of pbx_02 on node-02, provider changed: heartbeat -> 09:23:55 pengine: [3979]: notice: DeleteRsc: Removing pbx_02 from node-02 09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of pbx_01 on node-01, provider changed: heartbeat -> looking at the code I can't conclude where the issue could be, in the actual conf or I am hitting a bug static gboolean check_rsc_parameters(resource_t *rsc, node_t *node, xmlNode *rsc_entry, pe_working_set_t *data_set) { int attr_lpc = 0; gboolean force_restart = FALSE; gboolean delete_resource = FALSE; const char *value = NULL; const char *old_value = NULL; const char *attr_list[] = { XML_ATTR_TYPE, XML_AGENT_ATTR_CLASS, XML_AGENT_ATTR_PROVIDER }; for(; attr_lpc < DIMOF(attr_list); attr_lpc++) { value = crm_element_value(rsc->xml, attr_list[attr_lpc]); old_value = crm_element_value(rsc_entry, attr_list[attr_lpc]); if(value == old_value /* ie. NULL */ || crm_str_eq(value, old_value, TRUE)) { continue; } force_restart = TRUE; crm_notice("Forcing restart of %s on %s, %s changed: %s -> %s", rsc->id, node->details->uname, attr_list[attr_lpc], crm_str(old_value), crm_str(value)); } if(force_restart) { /* make sure the restart happens */ stop_action(rsc, node, FALSE); set_bit(rsc->flags, pe_rsc_start_pending); delete_resource = TRUE; } return delete_resource; } On 1 October 2010 09:13, Pavlos Parissis wrote: > Hi > Could be related to a possible bug mentioned here[1]? > > BTW here is the conf of pacemaker > node $id="b8ad13a6-8a6e-4304-a4a1-8f69fa735100" node-02 > node $id="d5557037-cf8f-49b7-95f5-c264927a0c76" node-01 > node $id="e5195d6b-ed14-4bb3-92d3-9105543f9251" node-03 > primitive drbd_01 ocf:linbit:drbd \ > params drbd_resource="drbd_pbx_service_1" \ > op monitor interval="30s" \ > op start interval="0" timeout="240s" \ > op stop interval="0" timeout="120s" > primitive drbd_02 ocf:linbit:drbd \ > params drbd_resource="drbd_pbx_service_2" \ > op monitor interval="30s" \ > op start interval="0" timeout="240s" \ > op stop interval="0" timeout="120s" > primitive fs_01 ocf:heartbeat:Filesystem \ > params device="/dev/drbd1" directory="/pbx_service_01" > fstype="ext3" \ > meta migration-threshold="3" failure-timeout="60" \ > op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \ > op start interval="0" timeout="60s" \ > op stop interval="0" timeout="60s" > primitive fs_02 ocf:heartbeat:Filesystem \ > params device="/dev/drbd2" directory="/pbx_service_02" > fstype="ext3" \ > meta migration-threshold="3" failure-timeout="60" \ > op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \ > op start interval="0" timeout="60s" \ > op stop interval="0" timeout="60s" > primitive ip_01 ocf:heartbeat:IPaddr2 \ > params ip="192.168.78.10" cidr_netmask="24" > broadcast="192.168.78.255" \ > meta failure-timeout="120" migration-threshold="3" \ > op monitor interval="5s" > primitive ip_02 ocf:heartbeat:IPaddr2 \ > params ip="192.168.78.20" cidr_netmask="24" > broadcast="192.168.78.255" \ > op monitor interval="5s" > primitive pbx_01 lsb:test-01 \ > meta failure-timeout="60" migration-threshold="3" > target-role="Started" \ > op monitor interval="20s" \ > op start interval="0" timeout="60s" \ > op stop interval="0" timeout="60s" > primitive pbx_02 lsb:test-02 \ > meta failure-timeout="60" migration-threshold="3" > target-role="Started" \ > op monitor interval="20s" \ > op start interval="0" timeout="60s"
Re: [Pacemaker] crm resource move doesn't move the resource
Hi, I am having again the same issue, in a different set of 3 nodes. When I try to failover manually the resource group on the standby node, the ms-drbd resource is not moved as well and as a result the resource group is not fully started, only the ip resource is started. Any ideas why I am having this issue? here are the info [r...@node-01 ~]# crm resource move pbx_service_01 node-03 [r...@node-01 ~]# crm resource unmove pbx_service_01 [r...@node-01 ~]# ptest -Ls Allocation scores: clone_color: ms-drbd_01 allocation score on node-01: 100 clone_color: ms-drbd_01 allocation score on node-03: 0 clone_color: drbd_01:0 allocation score on node-01: 11100 clone_color: drbd_01:0 allocation score on node-03: 0 clone_color: drbd_01:1 allocation score on node-01: 100 clone_color: drbd_01:1 allocation score on node-03: 11000 native_color: drbd_01:0 allocation score on node-01: 11100 native_color: drbd_01:0 allocation score on node-03: 0 native_color: drbd_01:1 allocation score on node-01: -100 native_color: drbd_01:1 allocation score on node-03: 11000 drbd_01:0 promotion score on node-01: 10100 drbd_01:1 promotion score on node-03: 1 drbd_01:2 promotion score on none: 0 group_color: pbx_service_01 allocation score on node-01: 200 group_color: pbx_service_01 allocation score on node-03: 10 group_color: ip_01 allocation score on node-01: 200 group_color: ip_01 allocation score on node-03: 1010 group_color: fs_01 allocation score on node-01: 0 group_color: fs_01 allocation score on node-03: 0 group_color: pbx_01 allocation score on node-01: 0 group_color: pbx_01 allocation score on node-03: 0 native_color: ip_01 allocation score on node-01: 200 native_color: ip_01 allocation score on node-03: 1010 drbd_01:0 promotion score on node-01: 100 drbd_01:1 promotion score on node-03: -100 drbd_01:2 promotion score on none: 0 native_color: fs_01 allocation score on node-01: -100 native_color: fs_01 allocation score on node-03: -100 native_color: pbx_01 allocation score on node-01: -100 native_color: pbx_01 allocation score on node-03: -100 [r...@node-01 ~]# crm status Last updated: Sat Oct 2 18:27:32 2010 Stack: Heartbeat Current DC: node-03 (3dd75a8f-9819-450f-9f18-c27730665925) - partition with quorum Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677 3 Nodes configured, unknown expected votes 2 Resources configured. Online: [ node-03 node-01 node-02 ] Master/Slave Set: ms-drbd_01 Masters: [ node-01 ] Slaves: [ node-03 ] Resource Group: pbx_service_01 ip_01 (ocf::heartbeat:IPaddr2): Started node-03 fs_01 (ocf::heartbeat:Filesystem):Stopped pbx_01 (lsb:test-01): Stopped [r...@node-01 ~]# crm configure show node $id="3dd75a8f-9819-450f-9f18-c27730665925" node-03 node $id="4e47db29-5f14-4371-9734-317bf342b8ed" node-02 node $id="a8f56e42-438f-4ea5-a6ba-a7f1d23ed401" node-01 primitive drbd_01 ocf:linbit:drbd \ params drbd_resource="drbd_pbx_service_1" \ op monitor interval="30s" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="120s" primitive fs_01 ocf:heartbeat:Filesystem \ params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \ meta migration-threshold="3" failure-timeout="60" \ op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive ip_01 ocf:heartbeat:IPaddr2 \ params ip="192.168.78.10" cidr_netmask="24" broadcast="192.168.78.255" \ meta failure-timeout="120" migration-threshold="3" \ op monitor interval="5s" primitive pbx_01 lsb:test-01 \ meta migration-threshold="3" failure-timeout="60" \ op monitor interval="20s" timeout="20s" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" group pbx_service_01 ip_01 fs_01 pbx_01 \ meta target-role="Started" ms ms-drbd_01 drbd_01 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" location PrimaryNode-drbd_01 ms-drbd_01 100: node-01 location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01 location SecondaryNode-drbd_01 ms-drbd_01 0: node-03 location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03 colocation fs_01-on-drbd_01 inf: fs_01 ms-drbd_01:Master order pbx_service_01-after-drbd_01 inf: ms-drbd_01:promote pbx_service_01:start property $id="cib-bootstrap-options" \ dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \ cluster-infrastructure="Heartbeat" \ symmetric-cluster="false" \ stonith-enabled="false" rsc_defaults $id="rsc-options" \ resource-stickiness="1000" [r...@node-01 ~]# Thanks, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Pr
Re: [Pacemaker] crm resource move doesn't move the resource
I am wondering if resource-stickiness="1000" could be reason for the behavior I see, but again when on the other cluster i recreated the ms-drbd the issue was solved. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] promote a ms resource to a node
just for the record here is the constraint location master-location ms-drbd_02 \ rule $id="master-rule" $role="Master" 1000: #uname eq node-03 Cheers, Pavlos On 30 September 2010 10:24, Andrew Beekhof wrote: > A resource location constraint with role=Master would do it. > Not sure about the shell syntax though. > > On Tue, Sep 28, 2010 at 3:51 PM, Pavlos Parissis > wrote: > > Hi, > > > > Let's say that I have manually demote a ms resource and have the > following > > situation > > crm(live)resource# demote ms-drbd_01 > > crm(live)resource# status > > [..snip..] > > Master/Slave Set: ms-drbd_01 > > Slaves: [ node-01 node-03 ] > > > > How can I manually promote ms-drbd_01 on node-03? > > The promote command doesn't accept node names and the move command on > > ms-drbd_01 says can't find resource. > > > > Cheers, > > Pavlos > > > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: > > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Recommend Fencing device
Hi Which fencing devices will you recommend? I want to use a device which will give as less problems as possible on configuring a fencing resource for 3 node cluster. Regards, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] resources are restarted without obvious reasons
On 5 October 2010 11:15, Andrew Beekhof wrote: > On Fri, Oct 1, 2010 at 9:53 AM, Pavlos Parissis > wrote: > > Hi, > > It seams that it happens every time PE wants to check the conf > > 09:23:55 crmd: [3473]: info: crm_timer_popped: PEngine Recheck Timer > > (I_PE_CALC) just popped! > > > > and then check_rsc_parameters() wants to reset my resources > > > > 09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart > of > > pbx_02 on node-02, provider changed: heartbeat -> > > 09:23:55 pengine: [3979]: notice: DeleteRsc: Removing pbx_02 from node-02 > > 09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart > of > > pbx_01 on node-01, provider changed: heartbeat -> > > Could be a bug in the code that detects changes to the resource definition. > Could you file a bug please? > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > here it is http://developerbugs.linux-foundation.org/show_bug.cgi?id=2504 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] init Script fails in 1 of LSB Compatible test
Hi, I am thinking to put under cluster control the sshd and I am checking if the /etc/init.d/sshd supplied by RedHat 5.4 is compatible with LSB. So, I run the test mentioned here [1] and it fails at test 6, it returns 1 and failed message. Could this create problems within pacemaker? Regards, Pavlos [1] http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] init Script fails in 1 of LSB Compatible test
On 5 October 2010 13:19, Andrew Beekhof wrote: > On Tue, Oct 5, 2010 at 12:51 PM, Pavlos Parissis > wrote: > > Hi, > > > > I am thinking to put under cluster control the sshd and I am checking if > the > > /etc/init.d/sshd supplied by RedHat 5.4 is compatible with LSB. > > So, I run the test mentioned here [1] and it fails at test 6, it returns > 1 > > and failed message. > > Could this create problems within pacemaker? > > yes > > what kind of prolems and why? Regards, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Online and Offline status when doing crm_mon
On 5 October 2010 22:12, Mike A Meyer wrote: > We are setup in a two node active/passive cluster using pacemaker/corosync. > We shutdown the pacemaker/corosync on both nodes and changed the uname -n > on our nodes to show the short name instead of the FQDN. Started up > pacemaker/corosync and ever since we done that, when we run the crm_mon > command, we see this below. > > > Last updated: Tue Oct 5 13:28:16 2010 > Stack: openais > Current DC: e-magdb2 - partition with quorum > Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677 > 4 Nodes configured, 2 expected votes > 2 Resources configured. > > > Online: [ e-magdb2 e-magdb1 ] > OFFLINE: [ e-magdb1.testingpcmk.com e-magdb2.testingpcmkr.com ] > > We did edit the crm configuration file to use short names for both nodes. > We can ping both the short name and the FQDN on our internal network and > both come back with the right IP address. We are running on RHEL 5. > Anybody have any ideas why the FQDN shows offline since this change since > we configured pacemaker/corosync to use short names? Is it grabbing it from > internal DNS from the IP address we have in the /etc/corosync.conf file? > Everything seems to be working correctly and failing over correctly. > Should this be something to worry about though or is it a display bug > maybe? Below is the corosync.conf file. > Did you follow this[1] procedure? Changing the names on the conf file of corosync it not enough. [1] http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-node-delete.html ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm resource move doesn't move the resource
On 7 October 2010 09:01, Andrew Beekhof wrote: > On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis > wrote: > > Hi, > > > > I am having again the same issue, in a different set of 3 nodes. When I > try > > to failover manually the resource group on the standby node, the ms-drbd > > resource is not moved as well and as a result the resource group is not > > fully started, only the ip resource is started. > > Any ideas why I am having this issue? > > I think its a bug that was fixed recently. Could you try the latest > from code Mercurial? 1.1 or 1.2 branch? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] pacemaker version
On 7 October 2010 08:33, Andrew Beekhof wrote: > > On Wed, Oct 6, 2010 at 5:04 PM, Gianluca Cecchi > wrote: > > On Wed, Oct 6, 2010 at 4:25 PM, Shravan Mishra > > wrote: > >> That is what I heard too, that's the reason for this question. > >> > > > > On June, inside a complex thread regarding "colocation -inf", Andrew > > reported the link and also several clarifications after some questions > > of mine... > > > > See in particular: > > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006606.html > > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006610.html > > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006620.html > > > > I think they are still valid... > > Absolutely. These will all be valid until 1.2 comes out (and then > they'll apply to 1.3 instead :-) I am a bit confused about the meaning of pacemaker-1.0,1.1 and 1.2 schema, mentioned here http://theclusterguy.clusterlabs.org/post/441442543/new-pacemaker-release-series Let's I have installed 1.1 how do I know if I use pacemaker-1.o or 1.2 schema? Sorry if it sounds stupid but I simple don't understand it Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm resource move doesn't move the resource
On 8 October 2010 04:26, jiaju liu wrote: > Message: 2 > Date: Thu, 7 Oct 2010 21:58:29 +0200 > From: Pavlos Parissis > http://cn.mc157.mail.yahoo.com/mc/compose?to=pavlos.paris...@gmail.com> > > > To: The Pacemaker cluster resource manager > > http://cn.mc157.mail.yahoo.com/mc/compose?to=pacema...@oss.clusterlabs.org> > > > Subject: Re: [Pacemaker] crm resource move doesn't move the resource > Message-ID: > > http://cn.mc157.mail.yahoo.com/mc/compose?to=bukp0wt2wie...@mail.gmail.com> > > > Content-Type: text/plain; charset="utf-8" > > On 7 October 2010 09:01, Andrew Beekhof > http://cn.mc157.mail.yahoo.com/mc/compose?to=and...@beekhof.net>> > wrote: > > > On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis > > http://cn.mc157.mail.yahoo.com/mc/compose?to=pavlos.paris...@gmail.com>> > wrote: > > > Hi, > > > > > > I am having again the same issue, in a different set of 3 nodes. When I > > try > > > to failover manually the resource group on the standby node, the > ms-drbd > > > resource is not moved as well and as a result the resource group is not > > > fully started, only the ip resource is started. > > > Any ideas why I am having this issue? > > > > I think its a bug that was fixed recently. Could you try the latest > > from code Mercurial? > Maybe you should clear failcount > > > the failcount was 0. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] pacemaker version
On 8 October 2010 07:47, Andrew Beekhof wrote: > On Thu, Oct 7, 2010 at 10:10 PM, Pavlos Parissis > wrote: >> On 7 October 2010 08:33, Andrew Beekhof wrote: >>> >>> On Wed, Oct 6, 2010 at 5:04 PM, Gianluca Cecchi >>> wrote: >>> > On Wed, Oct 6, 2010 at 4:25 PM, Shravan Mishra >>> > wrote: >>> >> That is what I heard too, that's the reason for this question. >>> >> >>> > >>> > On June, inside a complex thread regarding "colocation -inf", Andrew >>> > reported the link and also several clarifications after some questions >>> > of mine... >>> > >>> > See in particular: >>> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006606.html >>> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006610.html >>> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006620.html >>> > >>> > I think they are still valid... >>> >>> Absolutely. These will all be valid until 1.2 comes out (and then >>> they'll apply to 1.3 instead :-) >> >> I am a bit confused about the meaning of pacemaker-1.0,1.1 and 1.2 >> schema, mentioned here >> http://theclusterguy.clusterlabs.org/post/441442543/new-pacemaker-release-series >> >> Let's I have installed 1.1 how do I know if I use pacemaker-1.o or 1.2 >> schema? >> Sorry if it sounds stupid but I simple don't understand it > > cibadmin -Ql | grep validate > [r...@node-01 ~]# cibadmin -Ql | grep validate since validate-with is set to pacemaker-1.0, I am using pacemaker-1.0 schema, right? So, if I upgrade to 1.1.3 and leave validate-with to pacemaker-1.0, I will run a stable 1.1.3, but if I set to pacemaker-1.1 I will be running a "testing|unstable" 1.1.3. Have I understood it correctly? Thanks, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm resource move doesn't move the resource
On 8 October 2010 08:29, Andrew Beekhof wrote: > On Thu, Oct 7, 2010 at 9:58 PM, Pavlos Parissis > wrote: >> >> >> On 7 October 2010 09:01, Andrew Beekhof wrote: >>> >>> On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis >>> wrote: >>> > Hi, >>> > >>> > I am having again the same issue, in a different set of 3 nodes. When I >>> > try >>> > to failover manually the resource group on the standby node, the ms-drbd >>> > resource is not moved as well and as a result the resource group is not >>> > fully started, only the ip resource is started. >>> > Any ideas why I am having this issue? >>> >>> I think its a bug that was fixed recently. Could you try the latest >>> from code Mercurial? >> >> 1.1 or 1.2 branch? > > 1.1 > to save time on compiling stuff I want to use the available rpms on 1.1.3 version from rpm-next repo. But before I go and recreate the scenario, which means rebuild 3 nodes, I would like to know if this bug is fixed in 1.1.3 Thanks, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] pacemaker version
On 8 October 2010 09:28, Andrew Beekhof wrote: > On Fri, Oct 8, 2010 at 8:31 AM, Pavlos Parissis > wrote: >> On 8 October 2010 07:47, Andrew Beekhof wrote: >>> On Thu, Oct 7, 2010 at 10:10 PM, Pavlos Parissis >>> wrote: >>>> On 7 October 2010 08:33, Andrew Beekhof wrote: >>>>> >>>>> On Wed, Oct 6, 2010 at 5:04 PM, Gianluca Cecchi >>>>> wrote: >>>>> > On Wed, Oct 6, 2010 at 4:25 PM, Shravan Mishra >>>>> > wrote: >>>>> >> That is what I heard too, that's the reason for this question. >>>>> >> >>>>> > >>>>> > On June, inside a complex thread regarding "colocation -inf", Andrew >>>>> > reported the link and also several clarifications after some questions >>>>> > of mine... >>>>> > >>>>> > See in particular: >>>>> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006606.html >>>>> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006610.html >>>>> > http://oss.clusterlabs.org/pipermail/pacemaker/2010-June/006620.html >>>>> > >>>>> > I think they are still valid... >>>>> >>>>> Absolutely. These will all be valid until 1.2 comes out (and then >>>>> they'll apply to 1.3 instead :-) >>>> >>>> I am a bit confused about the meaning of pacemaker-1.0,1.1 and 1.2 >>>> schema, mentioned here >>>> http://theclusterguy.clusterlabs.org/post/441442543/new-pacemaker-release-series >>>> >>>> Let's I have installed 1.1 how do I know if I use pacemaker-1.o or 1.2 >>>> schema? >>>> Sorry if it sounds stupid but I simple don't understand it >>> >>> cibadmin -Ql | grep validate >>> >> >> >> [r...@node-01 ~]# cibadmin -Ql | grep validate >> > have-quorum="1" dc-uuid="b7764e7b-0a00-4745-8d9e-6911271eefb2" >> admin_epoch="0" epoch="271" num_updates="3"> >> >> since validate-with is set to pacemaker-1.0, I am using pacemaker-1.0 >> schema, right? > > Right. > >> So, if I upgrade to 1.1.3 and leave validate-with to pacemaker-1.0, I >> will run a stable 1.1.3, but if I set to pacemaker-1.1 I will be >> running a "testing|unstable" 1.1.3. > > You'll be enabling some unfinished features. > >> Have I understood it correctly? > > Essentially, yes. thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm resource move doesn't move the resource
On 8 October 2010 09:29, Andrew Beekhof wrote: > On Fri, Oct 8, 2010 at 8:34 AM, Pavlos Parissis > wrote: >> On 8 October 2010 08:29, Andrew Beekhof wrote: >>> On Thu, Oct 7, 2010 at 9:58 PM, Pavlos Parissis >>> wrote: >>>> >>>> >>>> On 7 October 2010 09:01, Andrew Beekhof wrote: >>>>> >>>>> On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis >>>>> wrote: >>>>> > Hi, >>>>> > >>>>> > I am having again the same issue, in a different set of 3 nodes. When I >>>>> > try >>>>> > to failover manually the resource group on the standby node, the ms-drbd >>>>> > resource is not moved as well and as a result the resource group is not >>>>> > fully started, only the ip resource is started. >>>>> > Any ideas why I am having this issue? >>>>> >>>>> I think its a bug that was fixed recently. Could you try the latest >>>>> from code Mercurial? >>>> >>>> 1.1 or 1.2 branch? >>> >>> 1.1 >>> >> to save time on compiling stuff I want to use the available rpms on >> 1.1.3 version from rpm-next repo. >> But before I go and recreate the scenario, which means rebuild 3 >> nodes, I would like to know if this bug is fixed in 1.1.3 > > As I said, I believe so. > I've just upgraded[1] my pacemaker to 1.1.3 and stonithd can not be started, am I missing something? Oct 08 21:08:01 node-02 heartbeat: [14192]: info: Starting "/usr/lib/heartbeat/stonithd" as uid 0 gid 0 (pid 14192) Oct 08 21:08:01 node-02 heartbeat: [14193]: info: Starting "/usr/lib/heartbeat/attrd" as uid 101 gid 103 (pid 14193) Oct 08 21:08:01 node-02 heartbeat: [14194]: info: Starting "/usr/lib/heartbeat/crmd" as uid 101 gid 103 (pid 14194) Oct 08 21:08:01 node-02 ccm: [14189]: info: Hostname: node-02 Oct 08 21:08:01 node-02 cib: [14190]: WARN: ccm_connect: CCM Activation failed Oct 08 21:08:01 node-02 cib: [14190]: WARN: ccm_connect: CCM Connection failed 1 times (30 max) Oct 08 21:08:01 node-02 attrd: [14193]: info: Invoked: /usr/lib/heartbeat/attrd Oct 08 21:08:01 node-02 stonith-ng: [14192]: info: Invoked: /usr/lib/heartbeat/stonithd Oct 08 21:08:01 node-02 stonith-ng: [14192]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Oct 08 21:08:01 node-02 heartbeat: [14158]: WARN: Client [stonith-ng] pid 14192 failed authorization [no default client auth] Oct 08 21:08:01 node-02 heartbeat: [14158]: ERROR: api_process_registration_msg: cannot add client(stonith-ng) Oct 08 21:08:01 node-02 stonith-ng: [14192]: ERROR: register_heartbeat_conn: Cannot sign on with heartbeat: Oct 08 21:08:01 node-02 stonith-ng: [14192]: CRIT: main: Cannot sign in to the cluster... terminating Oct 08 21:08:01 node-02 heartbeat: [14158]: WARN: Managed /usr/lib/heartbeat/stonithd process 14192 exited with return code 100. Oct 08 21:08:01 node-02 crmd: [14194]: info: Invoked: /usr/lib/heartbeat/crmd Oct 08 21:08:01 node-02 crmd: [14194]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Oct 08 21:08:02 node-02 crmd: [14194]: WARN: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry Oct 08 21:08:04 node-02 cib: [14190]: WARN: ccm_connect: CCM Activation failed Oct 08 21:08:04 node-02 cib: [14190]: WARN: ccm_connect: CCM Connection failed 2 times (30 max) Oct 08 21:08:05 node-02 crmd: [14194]: WARN: do_cib_control: Couldn't complete CIB registration 2 times... pause and retry [..snip...] Oct 08 21:08:33 node-02 crmd: [14194]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry [1] I use CentOS 5.4 and when I did the installation I used the following repository [r...@node-02 ~]# cat /etc/yum.repos.d/pacemaker.repo [clusterlabs] name=High Availability/Clustering server technologies (epel-5) baseurl=http://www.clusterlabs.org/rpm/epel-5 type=rpm-md gpgcheck=0 enabled=1 and in order to perform the upgrade I added the following rep. [clusterlabs-next] name=High Availability/Clustering server technologies (epel-5-next) baseurl=http://www.clusterlabs.org/rpm-next/epel-5 metadata_expire=45m type=rpm-md gpgcheck=0 enabled=1 and here is the installation/upgrade log, where you can see only pacemaker-libs and pacemaker were upgraded. Oct 03 21:06:20 Installed: libibverbs-1.1.3-2.el5.i386 Oct 03 21:06:25 Installed: lm_sensors-2.10.7-9.el5.i386 Oct 03 21:06:31 Installed: 1:net-snmp-5.3.2.2-9.el5_5.1.i386 Oct 03 21:06:31 Installed: librdmacm-1.0.10-1.el5.i386 Oct 03 21:06:32 Installed: openhpi-libs-2.14.0-5.el5.i386 Oct 03 21:06:33 Installed: OpenIPMI-libs-2.0.16-7.el5.i386 Oct 03 21:06:35 Installed: libesmtp-1.0.4-5.el5.i386 Oct 03 21:06:36 Installed: cluster-glue-libs-1.0.6-1.6.el5.i386 Oct 03 21:06:37
[Pacemaker] unpack_rsc_op: Hard error
Hi, Does anyone know why PE wants to unpack resources on nodes that will never run due to location constraints? I am getting this messages and I am wondering if they harmless or not. 23:12:38 pengine: [7705]: notice: unpack_rsc_op: Hard error - sshd-pbx_01_monitor_0 failed with rc=5: Preventing sshd-pbx_01 from re-starting on node-02 23:12:38 pengine: [7705]: notice: unpack_rsc_op: Hard error - pbx_01_monitor_0 failed with rc=5: Preventing pbx_01 from re-starting on node-02 Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm resource move doesn't move the resource
On 8 October 2010 22:05, Pavlos Parissis wrote: > On 8 October 2010 09:29, Andrew Beekhof wrote: >> On Fri, Oct 8, 2010 at 8:34 AM, Pavlos Parissis >> wrote: >>> On 8 October 2010 08:29, Andrew Beekhof wrote: >>>> On Thu, Oct 7, 2010 at 9:58 PM, Pavlos Parissis >>>> wrote: >>>>> >>>>> >>>>> On 7 October 2010 09:01, Andrew Beekhof wrote: >>>>>> >>>>>> On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis >>>>>> wrote: >>>>>> > Hi, >>>>>> > >>>>>> > I am having again the same issue, in a different set of 3 nodes. When I >>>>>> > try >>>>>> > to failover manually the resource group on the standby node, the >>>>>> > ms-drbd >>>>>> > resource is not moved as well and as a result the resource group is not >>>>>> > fully started, only the ip resource is started. >>>>>> > Any ideas why I am having this issue? >>>>>> >>>>>> I think its a bug that was fixed recently. Could you try the latest >>>>>> from code Mercurial? >>>>> >>>>> 1.1 or 1.2 branch? >>>> >>>> 1.1 >>>> >>> to save time on compiling stuff I want to use the available rpms on >>> 1.1.3 version from rpm-next repo. >>> But before I go and recreate the scenario, which means rebuild 3 >>> nodes, I would like to know if this bug is fixed in 1.1.3 >> >> As I said, I believe so. >> > > I've just upgraded[1] my pacemaker to 1.1.3 and stonithd can not be > started, am I missing something? > > Oct 08 21:08:01 node-02 heartbeat: [14192]: info: Starting > "/usr/lib/heartbeat/stonithd" as uid 0 gid 0 (pid 14192) > Oct 08 21:08:01 node-02 heartbeat: [14193]: info: Starting > "/usr/lib/heartbeat/attrd" as uid 101 gid 103 (pid 14193) > Oct 08 21:08:01 node-02 heartbeat: [14194]: info: Starting > "/usr/lib/heartbeat/crmd" as uid 101 gid 103 (pid 14194) > Oct 08 21:08:01 node-02 ccm: [14189]: info: Hostname: node-02 > Oct 08 21:08:01 node-02 cib: [14190]: WARN: ccm_connect: CCM Activation failed > Oct 08 21:08:01 node-02 cib: [14190]: WARN: ccm_connect: CCM > Connection failed 1 times (30 max) > Oct 08 21:08:01 node-02 attrd: [14193]: info: Invoked: > /usr/lib/heartbeat/attrd > Oct 08 21:08:01 node-02 stonith-ng: [14192]: info: Invoked: > /usr/lib/heartbeat/stonithd > Oct 08 21:08:01 node-02 stonith-ng: [14192]: info: > G_main_add_SignalHandler: Added signal handler for signal 17 > Oct 08 21:08:01 node-02 heartbeat: [14158]: WARN: Client [stonith-ng] > pid 14192 failed authorization [no default client auth] > Oct 08 21:08:01 node-02 heartbeat: [14158]: ERROR: > api_process_registration_msg: cannot add client(stonith-ng) > Oct 08 21:08:01 node-02 stonith-ng: [14192]: ERROR: > register_heartbeat_conn: Cannot sign on with heartbeat: > Oct 08 21:08:01 node-02 stonith-ng: [14192]: CRIT: main: Cannot sign > in to the cluster... terminating > Oct 08 21:08:01 node-02 heartbeat: [14158]: WARN: Managed > /usr/lib/heartbeat/stonithd process 14192 exited with return code 100. > Oct 08 21:08:01 node-02 crmd: [14194]: info: Invoked: /usr/lib/heartbeat/crmd > Oct 08 21:08:01 node-02 crmd: [14194]: info: G_main_add_SignalHandler: > Added signal handler for signal 17 > Oct 08 21:08:02 node-02 crmd: [14194]: WARN: do_cib_control: Couldn't > complete CIB registration 1 times... pause and retry > Oct 08 21:08:04 node-02 cib: [14190]: WARN: ccm_connect: CCM Activation failed > Oct 08 21:08:04 node-02 cib: [14190]: WARN: ccm_connect: CCM > Connection failed 2 times (30 max) > Oct 08 21:08:05 node-02 crmd: [14194]: WARN: do_cib_control: Couldn't > complete CIB registration 2 times... pause and retry > [..snip...] > Oct 08 21:08:33 node-02 crmd: [14194]: ERROR: te_connect_stonith: > Sign-in failed: triggered a retry > Solved by adding apiauth stonith-ng uid=root on ha.hf it was mentioned here, http://www.gossamer-threads.com/lists/linuxha/users/67189#67189 and a patch exists which will make heartbeat to not require apiauth. http://hg.linux-ha.org/dev/rev/9624b66a6b82 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] crmd thinks lsb returns error on monito
Hi, My resource is not started because I get this 00:44:27 crmd: [3141]: WARN: status_from_rc: Action 16 (pbx_02_monitor_0) on node-02 failed (target: 7 vs. rc: 5): Error but when I run manually the status I get 3, which ok because the application is stopped [r...@node-02 ~]# /etc/init.d/znd-pbx_02 status pbx_02 is stopped [r...@node-02 ~]# echo $? 3 why does crm get error in this case? Thanks, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] unpack_rsc_op: Hard error
On 9 October 2010 23:20, Pavlos Parissis wrote: > Hi, > > Does anyone know why PE wants to unpack resources on nodes that will > never run due to location constraints? > I am getting this messages and I am wondering if they harmless or not. > > 23:12:38 pengine: [7705]: notice: unpack_rsc_op: Hard error - > sshd-pbx_01_monitor_0 failed with rc=5: Preventing sshd-pbx_01 from > re-starting on node-02 > 23:12:38 pengine: [7705]: notice: unpack_rsc_op: Hard error - > pbx_01_monitor_0 failed with rc=5: Preventing pbx_01 from re-starting > on node-02 > > Cheers, > Pavlos > It seams that return code of 5 from a LSB script confuses the cluster. I have made my init script to be LSB compliant, it passes the tests here[1], but I have also implemented what it is mentioned here [2] regarding the exit codes. I have implemented the exit code 5 which causes troubles because when the cluster run the monitor on the slave node, where no resources are active, gets rc=5. If I remove the exit 5 everything is fine. Is this a expected behavior? [1]http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html [2]http://refspecs.freestandards.org/LSB_3.1.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html the init script [r...@node-03 ~]# cat /etc/init.d/znd-pbx_01 #!/bin/bash # ### BEGIN INIT INFO # Provides: pbx_01 # Required-Start: $local_fs $network # Required-Stop: $local_fs $network # Default-Start: 3 4 5 # Default-Stop: 0 1 2 6 # Short-Description: start and stop pbx_01 # Description: Init script fro pbxnsip. ### END INIT INFO # source function library . /etc/init.d/functions RETVAL=0 # Installation location INSTALLDIR=/pbx_service_01/pbxnsip PBX_CONFIG=$INSTALLDIR/pbx.xml PBX=pbx_01 PID_FILE=/var/run/$PBX.pid LOCK_FILE=/var/lock/subsys/$PBX PBX_OPTIONS="--dir $INSTALLDIR --config $PBX_CONFIG --pidfile $PID_FILE" #sleep 10; #[ -x $INSTALLDIR/$PBX ] || exit 5 start() { echo -n "Starting PBX: " daemon --pidfile $PID_FILE $INSTALLDIR/$PBX $PBX_OPTIONS RETVAL=$? echo [ $RETVAL -eq 0 ] && touch $LOCK_FILE return $RETVAL } stop() { echo -n "Stopping PBX: " killproc -p $PID_FILE $PBX RETVAL=$? echo [ $RETVAL -eq 0 ] && rm -f $LOCK_FILE return $RETVAL } case "$1" in start) start ;; stop) stop ;; restart) stop start ;; force-reload) stop start ;; status) status -p $PID_FILE $PBX RETVAL=$? ;; *) echo $"Usage: $0 {start|stop|restart|force-reload|status}" exit 2 esac exit $RETVAL [r...@node-03 ~]# ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm resource move doesn't move the resource
On 8 October 2010 09:29, Andrew Beekhof wrote: > On Fri, Oct 8, 2010 at 8:34 AM, Pavlos Parissis > wrote: >> On 8 October 2010 08:29, Andrew Beekhof wrote: >>> On Thu, Oct 7, 2010 at 9:58 PM, Pavlos Parissis >>> wrote: >>>> >>>> >>>> On 7 October 2010 09:01, Andrew Beekhof wrote: >>>>> >>>>> On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis >>>>> wrote: >>>>> > Hi, >>>>> > >>>>> > I am having again the same issue, in a different set of 3 nodes. When I >>>>> > try >>>>> > to failover manually the resource group on the standby node, the ms-drbd >>>>> > resource is not moved as well and as a result the resource group is not >>>>> > fully started, only the ip resource is started. >>>>> > Any ideas why I am having this issue? >>>>> >>>>> I think its a bug that was fixed recently. Could you try the latest >>>>> from code Mercurial? >>>> >>>> 1.1 or 1.2 branch? >>> >>> 1.1 >>> >> to save time on compiling stuff I want to use the available rpms on >> 1.1.3 version from rpm-next repo. >> But before I go and recreate the scenario, which means rebuild 3 >> nodes, I would like to know if this bug is fixed in 1.1.3 > > As I said, I believe so. I recreated the 3 node cluster and I didn't face that issue, but I am going to keep an eye on it for few days and even rerun the whole scenario (recreate 3 node cluster ...) just to be very sure. If I don't the see it again I will also close the bug report Thanks, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crmd thinks lsb returns error on monito
On 10 October 2010 17:40, Andrew Beekhof wrote: > On Sun, Oct 10, 2010 at 12:47 AM, Pavlos Parissis > wrote: >> Hi, >> >> My resource is not started because I get this >> >> 00:44:27 crmd: [3141]: WARN: status_from_rc: Action 16 >> (pbx_02_monitor_0) on node-02 failed (target: 7 vs. rc: 5): Error >> >> but when I run manually the status I get 3, which ok because the >> application is stopped >> >> [r...@node-02 ~]# /etc/init.d/znd-pbx_02 status >> pbx_02 is stopped >> [r...@node-02 ~]# echo $? >> 3 >> >> why does crm get error in this case? > > I imagine because when pacemaker ran it, the script didn't return 3. > pacemaker got 5 because the script returns 5 when the application is not available on the system, which happens only when the fs is not active. What actually happened in this particular case is the the start action on fs and on the resource, which holds the application, started on the same second. I am pretty sure that the start of the application resource went too fast and at the time the LSB script was executed the fs was not available, even the fs resources returned 0 on start and on the first monitor. This issue doesn't happen always but if I put a sleep on LSB script for the application resource I don't run into that issue. The resource are in group with order ip fs app. I also removed the exit code 5 from the LSB script, it confuses the cluster when the monitor action does place on the slave node. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] unpack_rsc_op: Hard error
On 10 October 2010 17:39, Andrew Beekhof wrote: > On Sat, Oct 9, 2010 at 11:20 PM, Pavlos Parissis > wrote: >> Hi, >> >> Does anyone know why PE wants to unpack resources on nodes that will >> never run due to location constraints? > > Because part of its job is to make sure they dont run there. > >> I am getting this messages and I am wondering if they harmless or not. > > Basically yes. We've since reduced this to an informational message. > So, it is not necessary to place the LSB script of a resource to nodes where the resource will never run, due to location constraints.Am I right? Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] resource is stuck
On 11 October 2010 11:12, Pavlos Parissis wrote: > Hi, > > Cluster got an error on monitor and stop action on a resource and > since then I can't do stop/start/manage/unmanage that resource. > For some strange reason the actions monitor/stop failed, manually > worked, but i can't figure out why they failed when cluster run status > and stop on the specific lsb resource. > > The issue now is that I can't do anything about that resource, even I > have cleared out the failcount counter. > > How can i escape from the situation? > > hb_report attached > > Regards, > Pavlos > After reading again and again the "configuration explained" document and especial page 18, I found a solution. Adding on-fail="stop" for monitor/stop/start operation on the resource get me out that situation. After I added this setting cluster initiated stop action which was successful.! The resource was stuck, actually blocked, because that is the default action when stop action fails and stonith is disabled. Blame on me not remembering page 18:-) Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] 1st monitor is too fast after the start
Hi, I noticed a race condition while I was integration an application with Pacemaker and thought to share with you. The init script of the application is LSB-compliant and passes the tests mentioned at the Pacemaker documentation. Moreover, the init script uses the supplied functions from the system[1] for starting,stopping and checking the application. I observed few times that the monitor action was failing after the startup of the cluster or the movement of the resource group. Because it was not happening always and manual start/status was always working, it was quite tricky and difficult to find out the root cause of the failure. After few hours of troubleshooting, I found out that the 1st monitor action after the start action, was executed too fast for the application to create the pid file. As result monitor action was receiving error. I know it sounds a bit strange but it happened on my systems. The fact that my systems are basically vmware images on a laptop could have a relation with the issue. Nevertheless, I would like to ask if you are thinking to implement an "init_wait" on 1st monitor action. Could be useful. To solve my issue I put a sleep after the start of the application in the init script. This gives enough time for the application to create its pid file and the 1st monitor doesn't fail. Cheers, Pavlos [1] Cent0S 5.4 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] sshd under cluster
Hi, I was asked to place sshd daemon under cluster and because I faced few challenges, I thought to share them with you. The 1st challenge was to clone the sshd daemon, init script and its configuration. The procedure is at the bottom of this mail. The 2nd challenge was the init script of sshd in CentOS. It has 2 issues, 1st issue was that it was failing at test 6 mentioned here http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html. The 2nd issue was that during shutdown or reboot of the cluster node, stop action on resource was receiving return code 143 from init script and the whole shutdown/reboot process was stuck for few minutes. The root cause of that was the killall command which is being called by the init script. The init script calls killall, only on shutdown or reboot, to close any open connections. But, that call was killing also the script itself! Because of that cluster was getting error on stop action and the lock file of the sshd was not removed as well. You can image the consequences. For both issues I filled a bug report and hacked the init script in order to have a short term resolution. The last challenge was related to a mail sent few hours ago. The 1st monitor action after the start action was too fast and sshd didn't have enough time to create its pid file. As a result the monitor was thinking that the sshd was down but it wasn't. A sleep 1 after the start function in the init script solved the issue. Cheers, Pavlos Clone SSH for pbx_0N Prerequisite: the default sshd to listen only on nodes IP and not on all IPs. cp -p /etc/init.d/sshd /etc/init.d/sshd-pbx_02 cp -p /etc/pam.d/sshd /etc/pam.d/sshd-pbx_02 # optional because it is needed only if UsePam true - On RH is true by default ln -s /usr/sbin/sshd /usr/sbin/sshd-pbx_02 touch /etc/sysconfig/sshd-pbx_02 echo 'OPTIONS="-f /etc/ssh/sshd_config-pbx_02"' > /etc/sysconfig/sshd-pbx_02 cp -p /etc/ssh/sshd_config /etc/ssh/sshd_config-pbx_02 [r...@node-02 ~]# diff -wu /etc/init.d/sshd /etc/init.d/sshd-pbx_02 --- /etc/init.d/sshd2009-09-03 20:12:38.0 +0200 +++ /etc/init.d/sshd-pbx_02 2010-10-12 12:25:50.0 +0200 @@ -1,33 +1,33 @@ -#!/bin/bash +#!/bin/bash -x # -# Init file for OpenSSH server daemon +# Init file for OpenSSH server daemon used by pbx_02 # # chkconfig: 2345 55 25 -# description: OpenSSH server daemon +# description: OpenSSH server daemon for pbx_02 # -# processname: sshd -# config: /etc/ssh/ssh_host_key -# config: /etc/ssh/ssh_host_key.pub +# processname: sshd-pbx_02 +# config: /etc/ssh/ssh_host_key-pbx_02 +# config: /etc/ssh/ssh_host_key-pbx_02.pub # config: /etc/ssh/ssh_random_seed -# config: /etc/ssh/sshd_config -# pidfile: /var/run/sshd.pid +# config: /etc/ssh/sshd_config-pbx_02 +# pidfile: /var/run/sshd-pbx_02.pid # source function library . /etc/rc.d/init.d/functions # pull in sysconfig settings -[ -f /etc/sysconfig/sshd ] && . /etc/sysconfig/sshd +[ -f /etc/sysconfig/sshd-pbx_02 ] && . /etc/sysconfig/sshd-pbx_02 RETVAL=0 -prog="sshd" +prog="sshd-pbx_02" # Some functions to make the below more readable KEYGEN=/usr/bin/ssh-keygen -SSHD=/usr/sbin/sshd -RSA1_KEY=/etc/ssh/ssh_host_key -RSA_KEY=/etc/ssh/ssh_host_rsa_key -DSA_KEY=/etc/ssh/ssh_host_dsa_key -PID_FILE=/var/run/sshd.pid +SSHD=/usr/sbin/sshd-pbx_02 +RSA1_KEY=/etc/ssh/ssh_host_key-pbx_02 +RSA_KEY=/etc/ssh/ssh_host_rsa_key-pbx_02 +DSA_KEY=/etc/ssh/ssh_host_dsa_key-pbx_02 +PID_FILE=/var/run/sshd-pbx_02.pid runlevel=$(set -- $(runlevel); eval "echo \$$#" ) @@ -110,7 +110,11 @@ echo -n $"Starting $prog: " $SSHD $OPTIONS && success || failure RETVAL=$? - [ "$RETVAL" = 0 ] && touch /var/lock/subsys/sshd + [ "$RETVAL" = 0 ] && touch /var/lock/subsys/sshd-pbx_02 +# to avoid a race condition, 1st cluster monitor after start fails +# because the pid file is not created yet. Few msecs detail on the +# creation of pid file is enough to cause issues. +sleep 1 echo } @@ -119,16 +123,25 @@ echo -n $"Stopping $prog: " if [ -n "`pidfileofproc $SSHD`" ] ; then killproc $SSHD + elif [ -z "`pidfileofproc $SSHD`"] && [ ! -f /var/lock/subsys/sshd-pbx_02 ] ; then +success +RETVAL=0 else failure $"Stopping $prog" fi RETVAL=$? + +### Added by Pavlos Parissis ### +# Disable the below bit because killall kills the script itself. +# This causes problems within the cluster, shutdown of a node fails. +# Any open connections will be killed by /etc/init.d.halt anyways + # if we are in halt or reboot runlevel kill all running sessions # so the TCP connections are closed cleanly - if [ "x$runlevel&qu
Re: [Pacemaker] Migrate resources based on connectivity
On 12 October 2010 20:00, Dan Frincu wrote: > Hi, > > Lars Ellenberg wrote: > > On Mon, Oct 11, 2010 at 03:50:01PM +0300, Dan Frincu wrote: > > > Hi, > > Dejan Muhamedagic wrote: > > > Hi, > > On Sun, Oct 10, 2010 at 10:27:13PM +0300, Dan Frincu wrote: > > > Hi, > > I have the following setup: > - order drbd0:promote drbd1:promote > - order drbd1:promote drbd2:promote > - order drbd2:promote all:start > - collocation all drbd2:Master > - all is a group of resources, drbd{0..3} are drbd ms resources. > > I want to migrate the resources based on ping connectivity to a > default gateway. Based on > http://www.clusterlabs.org/wiki/Pingd_with_resources_on_different_networks > and http://www.clusterlabs.org/wiki/Example_configurations I've > tried the following: > - primitive ping ocf:pacemaker:ping params host_list=1.2.3.4 > multiplier=100 op monitor interval=5s timeout=5s > - clone ping_clone ping meta globally-unique=false > - location ping_nok all \ > rule $id="ping_nok-rule" -inf: not_defined ping_clone or > ping_clone number:lte 0 > > > Use pingd to reference the attribute in the location constraint. > > > Not to be disrespectful, but after 3 days being stuck on this issue, > I don't exactly understand how to do that. Could you please provide > an example. > > Thank you in advance. > > > The example you reference lists: > > primitive pingdnet1 ocf:pacemaker:pingd \ > params host_list=192.168.23.1 \ > name=pingdnet1 > ^^ > > clone cl-pingdnet1 pingdnet1 > ^ > > param name default is pingd, > and is the attribute name to be used in the location constraints. > > You will need to reference pingd in you location constraint, or set an > explicit name in the primitive definition, and reference that. > > Your ping primitive sets the default 'pingd' attribute, > but you reference some 'ping_clone' attribute, > which apparently no-one really references. > > > > I've finally managed to finish the setup with the indications received > above, the behavior is the expected one. Also, I've tried the > ocf:pacemaker:pingd and even though it does the reachability tests properly, > it fails to update the cib upon restoring the connectivity, I had to > manually run attrd_updater -R to get the resources to start again, therefore > I'm going with ocf:pacemaker:ping. > it would be quite useful for the rest of people if you post your final and working configuration. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] 1st monitor is too fast after the start
On 13 October 2010 09:48, Dan Frincu wrote: > Hi, > > I've noticed the same type of behavior, however in a different context, my > setup includes 3 drbd devices and a group of resources, all have to run on > the same node and move together to other nodes. My issue was with the first > resource that required access to a drbd device, which was the > ocf:heartbeat:Filesystem RA trying to do a mount and failing. > > The reason, it was trying to do the mount of the drbd device before the drbd > device had finished migrating to primary state. Same as you, I introduced a > start-delay, but on the start action. This proved to be of no use as the > behavior persisted, even with an increased start-delay. However, it only > happened when performing a fail-back operation, during fail-over, everything > was ok, during fail-back, error. > > The fix I've made was to remove any start-delay and to add group collocation > constraints to all ms_drbd resources. Before that I only had one collocation > constraint for the drbd device being promoted last. > > I hope this helps. > I am glad that somebody else experienced the same issue:) On my mail I was talking about the monitor action which was failing, but the behavior you described happened on my system on the same setup, drbd and fs resource.It also happened on the application resource, the start was too fast and the FS was not mounted (yet) when the action start fired for the application resource. A delay on start function of the resource agent of the application fixed my issue. In my setup I have all the necessary constraints to avoid this, at least this is what I believe so:-) Cheers, Pavlos [r...@node-01 sysconfig]# crm configure show node $id="059313ce-c6aa-4bd5-a4fb-4b781de6d98f" node-03 node $id="d791b1f5-9522-4c84-a66f-cd3d4e476b38" node-02 node $id="e388e797-21f4-4bbe-a588-93d12964b4d7" node-01 primitive drbd_01 ocf:linbit:drbd \ params drbd_resource="drbd_pbx_service_1" \ op monitor interval="30s" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="120s" primitive drbd_02 ocf:linbit:drbd \ params drbd_resource="drbd_pbx_service_2" \ op monitor interval="30s" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="120s" primitive fs_01 ocf:heartbeat:Filesystem \ params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \ meta migration-threshold="3" failure-timeout="60" \ op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive fs_02 ocf:heartbeat:Filesystem \ params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" \ meta migration-threshold="3" failure-timeout="60" \ op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive ip_01 ocf:heartbeat:IPaddr2 \ params ip="192.168.78.10" cidr_netmask="24" broadcast="192.168.78.255" \ meta failure-timeout="120" migration-threshold="3" \ op monitor interval="5s" primitive ip_02 ocf:heartbeat:IPaddr2 \ meta failure-timeout="120" migration-threshold="3" \ params ip="192.168.78.20" cidr_netmask="24" broadcast="192.168.78.255" \ op monitor interval="5s" primitive pbx_01 lsb:znd-pbx_01 \ meta migration-threshold="3" failure-timeout="60" target-role="Started" \ op monitor interval="20s" timeout="20s" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive pbx_02 lsb:znd-pbx_02 \ meta migration-threshold="3" failure-timeout="60" \ op monitor interval="20s" timeout="20s" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive sshd_01 lsb:znd-sshd-pbx_01 \ meta target-role="Started" is-managed="true" \ op monitor on-fail="stop" interval="10m" \ op start interval="0" timeout="60s" on-fail="stop" \ op stop interval="0" timeout="60s" on-fail="stop" primitive sshd_02 lsb:znd-sshd-pbx_02 \ meta target-role="Started" \ op monitor on-fail="stop" interval="10m" \ op start interval="0" timeout="60s" on-fail="stop" \ op stop interval="0" timeout="60s" on-fail="stop" group pbx_service_01 ip_01 fs_01 pbx_01 sshd_01 \ meta target-role="Started" group pbx_service_02 ip_02 fs_02 pbx_02 sshd_02 ms ms-drbd_01 drbd_01 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" ms ms-drbd_02 drbd_02 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" location PrimaryNode-drbd_01 ms-drbd_01 100: node-01 location PrimaryNode-drbd_02 ms-drbd_02 100: node-02 location PrimaryNode-pbx_service_01 pbx_service_01 200: node
Re: [Pacemaker] 1st monitor is too fast after the start
On 13 October 2010 10:50, Dan Frincu wrote: > From what I see you have a dual primary setup with failover on the third > node, basically if you have one drbd resource for which you have both > ordering and collocation, I don't think you need to "improve" it, if it > ain't broke, don't fix it :) > > Regards, > No, I don't have Dual primary. My DRBD is in Single-Primary mode for both DRBD resources. I use N+1 setup. I have 2 resource group which have unique Primary and shared secondary. pbx_service_01 resource group has primary node-01 and secondary node-03 pbx_service_02 resource group has primary node-02 and secondary node-03 I use asymmetric cluster with specific location constraints in order to implement the above. The DRBD resource will never be in primary mode on 2 nodes at the same time. I have set specific collocation and order constraints in order to "bond" DRBD ms resource to the appropriate resource group. I hope it is clear now. Cheers and thanks for looking at my conf, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm resource move doesn't move the resource
On 11 October 2010 11:16, Pavlos Parissis wrote: > On 8 October 2010 09:29, Andrew Beekhof wrote: >> On Fri, Oct 8, 2010 at 8:34 AM, Pavlos Parissis >> wrote: >>> On 8 October 2010 08:29, Andrew Beekhof wrote: >>>> On Thu, Oct 7, 2010 at 9:58 PM, Pavlos Parissis >>>> wrote: >>>>> >>>>> >>>>> On 7 October 2010 09:01, Andrew Beekhof wrote: >>>>>> >>>>>> On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis >>>>>> wrote: >>>>>> > Hi, >>>>>> > >>>>>> > I am having again the same issue, in a different set of 3 nodes. When I >>>>>> > try >>>>>> > to failover manually the resource group on the standby node, the >>>>>> > ms-drbd >>>>>> > resource is not moved as well and as a result the resource group is not >>>>>> > fully started, only the ip resource is started. >>>>>> > Any ideas why I am having this issue? >>>>>> >>>>>> I think its a bug that was fixed recently. Could you try the latest >>>>>> from code Mercurial? >>>>> >>>>> 1.1 or 1.2 branch? >>>> >>>> 1.1 >>>> >>> to save time on compiling stuff I want to use the available rpms on >>> 1.1.3 version from rpm-next repo. >>> But before I go and recreate the scenario, which means rebuild 3 >>> nodes, I would like to know if this bug is fixed in 1.1.3 >> >> As I said, I believe so. > > I recreated the 3 node cluster and I didn't face that issue, but I am > going to keep an eye on it for few days and even rerun the whole > scenario (recreate 3 node cluster ...) just to be very sure. If I > don't the see it again I will also close the bug report > > Thanks, > Pavlos > I recreated the 3-node cluster using 1.1.3 version just see if it is solved, but the issue appeared again. So, Andrew the issue is not solved in 1.1.3. I am going to update the bug report accordingly. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Active-Active HA Firewall
On 15 October 2010 09:47, Marcel Hauser wrote: > > But that is no problem. firewalling is no hard job any more. A reasonable >> machine can firewall 1 GBit/s traffic. >> > > valid point. my only "concern" is/was that i don't like the idea of a > passive firewall because when you need it to failover (maybe after 2 > years :-) ) you may just realize that it's somehow broken too. > a monitor system should help you out on this. > > In an active-active like setup you basically know that both system are > actually working as expected. > > > - how would you guys detect a firewall failure on any node (pingd ??)... >>> and if a failure occurs... will the crm automatically unconfigure the >>> cloned ip's on that node ? >>> >> >> pingd to check the availability of the attached network. The cluste >> resource >> manager takes care for the failover. See the "from the scratch" doc. >> > > Yes i've read that in the docs. But is this really common practice for > firewall clusters ? i don't want the firewall to failover if i'm having > "internal problems with internal hosts/pingable addresses"!? > > otherwise i have to build an internal ping cluster ;-) > I have always believed that you should only trigger a failover when something that is needed to offer the service is not available (disk, a filesystem, a NIC etc) Having said that, I believe a firewall in order to be operational needs access to common elements like disk/fs/nic and on top of that to uplink routers or to any routers that are part of its routing table. Furthermore, a firewall needs access to any layer2 switch which gives him access to the attached LANs But, deciding which element should be part of the "health system" has to do with the network design and if layer 2 or layer 3 redundancy exists in your environment. If the layer 2 or layer 3 redundancy is not available, then make little sense to add them in your "health system", because in a case of failure this element wont be accessible by the standby firewall as well. > why did you choose to run conntrackd and heartbeat over a dedicated bonding > interface in your pdf, compared to the FW builder docs which say to run > heartbeat over every interface of the firewall, which therefore might enable > the cluster to detect network card failures... because the heartbeat is not > received over a given failed interface anymore ? > > > Rumors say that the is a good German book about clusters from O'Reilly. In >> the >> examples chapter the author exactly describes the setup you mentioned. ;-) >> > > :-) i've seen that... but i hate reading books (no matter on what > topic)... and my learning curve is much more efficient if i learn it myself > :-) > I didn't quick search and I couldn't find it, what is the name of the book? > but thanks for the hint... any i really appreciate your and any other help! > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Help understanding why a failover occurred.
On 16 October 2010 00:45, Jai wrote: > I have setup a DRBD->Xen failover cluster. Last night at around 02:50 it > failed the resources from server "bravo" to "alpha". I'm trying to find out > what caused the failover of resources. I don't see anything in the logs that > indicate the cause but I don't really know what to look for. If someone > could help me understand these logs and what I'm looking for would be great. > I'm not even sure how far back I need to go. > > > I don't seen anything as well, but I am not surprised by that. I have seen a similar issue on my cluster where logs weren't that helpful. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] using xml for rules
Hi, I am trying to make a rule to control the failback on the resources. I want during working days from 06:00 to 23:00 and on weekend from 08:00 to 16:00 to have resource-stickiness 1000 and on the left hours zero, so cluster can perform failback any resource which failed over during the working hours. I wrote the following but I also get error [1]. I am not xml guru so I must have done some stupid mistake here. Any hints? [1] cibadmin -V --replace --obj_type rsc_defaults --xml-file tmp.xml Call cib_replace failed (-47): Update does not conform to the configured schema/DTD ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Help understanding why a failover occurred.
On 18 October 2010 04:03, Tim Serong wrote: > On 10/16/2010 at 09:45 AM, Jai wrote: > > I have setup a DRBD->Xen failover cluster. Last night at around 02:50 it > failed > > the resources from server "bravo" to "alpha". I'm trying to find out what > > caused the failover of resources. I don't see anything in the logs that > > indicate the cause but I don't really know what to look for. If someone > could > > help me understand these logs and what I'm looking for would be great. > I'm > > not even sure how far back I need to go. > > I reckon it's this: > > Oct 16 02:46:04 bravo attrd: [25098]: info: attrd_perform_update: Sent > update 161: pingval=0 > > Which suggests bravo lost connectivity to 12.12.12.1 around that time, > causing > the failover. > > For reference, if you're looking at pengine logs... A few lines above > where > it says "info: process_pe_message: Transition NNN: PEngine Input stored in: > /var/lib/pengine/pe-input-MMM.bz2", you'll see what it's about to do to > your > resources. If this is just: "Leave resource FOO (Started/Master/Slave > etc.)" > that transition is probably boring. If it says "Start FOO (...)" or > "Promote/Demote/Stop FOO (...)", it means something has changed. Scroll up > a bit, to above where pengine is saying "unpack_config", > "determine_node_status" > etc. and you should see a message suggesting the cause for the change > (failed > op, timeout, ping attribute modified, etc.) It might be a bit inscrutable > sometimes, but it'll be there somewhere... > > HTH > > These are very useful tips on understanding the logs Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Help understanding why a failover occurred.
On 18 October 2010 05:17, Jai wrote: > > I don't seen anything as well, but I am not surprised by that. I have > seen a > > similar issue on my cluster where logs weren't that helpful. > > Does it still occur on your cluster? > > No, I haven't seen it again. But it could be that I couldn't see it in the logs. If I were you I will follow the advice from Tim, you may find what real happened? Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Question: How many nodes can join a cluster?
On 18 October 2010 10:52, Florian Haas wrote: > - Original Message - > > From: "Andreas Vogelsang" > > To: pacemaker@oss.clusterlabs.org > > Sent: Monday, October 18, 2010 9:46:12 AM > > Subject: [Pacemaker] Question: How many nodes can join a cluster? > > Hello, > > > > > > > > I’m creating a presentation about a virtual Linux-HA Cluster. I just > > asked me how many nodes pacemaker can handle. Mr. Schwartzkopff wrote > > in his Book that Linux-HA version 2 can handle up to 16 Nodes. Is this > > also true for pacemaker? > I have been asked the same question and I said to them, let's say it is 126, what is the use of having 126 nodes in the cluster? Can someone imagine himself going through the logs to find why the resource-XXX failed while there are 200 resources?!! The only use of having 126 nodes is if you want to have HPC, but HPC is total different story than high available clusters. Even in N+N setup I would go with more than 4 or 6 nodes. My 2 cents, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Question: How many nodes can join a cluster?
On 18 October 2010 11:13, Dan Frincu wrote: > Pavlos Parissis wrote: > > > > On 18 October 2010 10:52, Florian Haas wrote: > >> - Original Message - >> > From: "Andreas Vogelsang" >> > To: pacemaker@oss.clusterlabs.org >> > Sent: Monday, October 18, 2010 9:46:12 AM >> > Subject: [Pacemaker] Question: How many nodes can join a cluster? >> > Hello, >> > >> > >> > >> > I’m creating a presentation about a virtual Linux-HA Cluster. I just >> > asked me how many nodes pacemaker can handle. Mr. Schwartzkopff wrote >> > in his Book that Linux-HA version 2 can handle up to 16 Nodes. Is this >> > also true for pacemaker? >> > > I have been asked the same question and I said to them, let's say it is > 126, what is the use of having 126 nodes in the cluster? > Can someone imagine himself going through the logs to find why the > resource-XXX failed while there are 200 resources?!! > > The only use of having 126 nodes is if you want to have HPC, but HPC is > total different story than high available clusters. > Even in N+N setup I would go with more than 4 or 6 nodes. > > > My 2 cents, > Pavlos > > > Actually, the syslog_facility in corosync.conf allows you to specify > either a log file for each node in the cluster (locally), or setting up a > remote syslog server. Either way, identifying the node by hostname or some > other identifier should point out what is going on where. Granted, it's a > large amount of data to process, therefore (such is the case with any large > deployment) SNMP is a much better alternative for tracking issues, or (if > you have _126_ times the same resource) adding some notification options to > the RA might be a choice, such as SNMP trap, or even email. > > BTW, I'm also interested in this, I remember reading something about 64 > nodes, but I'd appreciate an official response. > > Have you ever done troubleshooting on a 4 node cluster at 01:00 night? believe me it is not fun. I don't say there are no use cases which require a lot of nodes, but I have my doubts if there are a lot of use cases for High Available Clusters. Adding without a second thought nodes and services increase complexity, which is one of the main root cause of major problems. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Move DRBD master
On 19 October 2010 01:18, Vadym Chepkov wrote: > Hi, > > What is the crm shell command to move drbd master to a different node? > > take a look at this http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg06300.html ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] unpack_rsc_op: Hard error
On 19 October 2010 14:16, Andrew Beekhof wrote: > On Mon, Oct 11, 2010 at 11:25 AM, Pavlos Parissis > wrote: > > On 10 October 2010 17:39, Andrew Beekhof wrote: > >> On Sat, Oct 9, 2010 at 11:20 PM, Pavlos Parissis > >> wrote: > >>> Hi, > >>> > >>> Does anyone know why PE wants to unpack resources on nodes that will > >>> never run due to location constraints? > >> > >> Because part of its job is to make sure they dont run there. > >> > >>> I am getting this messages and I am wondering if they harmless or not. > >> > >> Basically yes. We've since reduced this to an informational message. > >> > > So, it is not necessary to place the LSB script of a resource to nodes > > where the resource will never run, due to location constraints.Am I > > right? > > Correct, though the probes might show up in crm_mon as "failed". > > > Even it is correct, I placed the script on all nodes, just to avoid the warnings. Thanks, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Failover domains?
On 25 October 2010 19:50, David Quenzler wrote: > Is there a way to limit failover behavior to a subset of cluster nodes > or pin a resource to a node? > > Yes, there is a way. Make sure you have a asymmetric cluster by setting symmetric-cluster to false and then configure accordingly your location constraints in order to have the failover domains as you wish. Here is en example from my cluster where I have 3 nodes and 2 resource group. Each resource group have unique primary node but both of them have shared secondary node. location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01 location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02 location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03 location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03 Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] AP9606 fencing device
Hi, I have a APC AP9606 PDU and I am trying to find a stonith agent which works with that PDU. The apcmaster and apcmastersnmp don't work as you see below. I managed to get the rackpdu working by setting the outlet config (the oid for snmpwalk fails) and setting also the command OID. Here is a long command stonith -t external/rackpdu hostlist=node-01,node-02,node-03 pduip=192.168.100.100 oid=.1.3.6.1.4.1.318.1.1.4.4.2.1.3 community=private outlet_config=/tmp/outlet_config -T on node-01 Does anyone know any other PDU which works out of box with the supplied stonith agents? Regards, Pavlos [r...@node-01 ~]# stonith -t apcmastersnmp ipaddr=192.168.100.100 port=161 community=private -S ** (process:3887): CRITICAL **: APC_read: error in response packet, reason 2 [(noSuchName) There is no such variable name in this MIB.]. ** (process:3887): CRITICAL **: apcmastersnmp_set_config: cannot read number of outlets. Invalid config info for apcmastersnmp device Valid config names are: ipaddr port community [r...@node-01 ~]# stonith -t apcmaster ipaddr=192.168.100.100 login=stonith password=stonith -S ** (process:4215): CRITICAL **: Did not find string Escape character is '^]'. from APC MasterSwitch. ** (process:4215): CRITICAL **: Received [\xff\xfb\u0001\xff\xfb\u0003\xff\xfd\u0003 \u000dUser Name : ] ** (process:4215): CRITICAL **: Did not find string Escape character is '^]'. from APC MasterSwitch. ** (process:4215): CRITICAL **: Received [] connect() failed: Connection reset by peer ** (process:4215): CRITICAL **: Did not find string Escape character is '^]'. from APC MasterSwitch. ** (process:4215): CRITICAL **: Received [] connect() failed: Connection reset by peer connect() failed: Connection reset by peer ** (process:4215): CRITICAL **: Did not find string Escape character is '^]'. from APC MasterSwitch. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 27 October 2010 13:12, Vadym Chepkov wrote: > > On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote: >> >> Does anyone know any other PDU which works out of box with the >> supplied stonith agents? >> > > I use APC AP7901, works like a charm: > > primitive pdu stonith:external/rackpdu \ > params pduip="10.6.6.6" community="pdu-6" hostlist="AUTO" > clone fencing pdu > > Vadym Then most likely the defaults OIDs of the rackpdu agents matches the OIDs of the AP7901. In my case I have to use OID for the device itself 1.3.6.1.4.1.318.1.1.4.4.2.1.3 and OID for retrieving (snmpwalk) the outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 . Hold on a sec, are you using clone on AP7901? Does it support multiple connections? Mine it doesn't. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 27 October 2010 13:43, Vadym Chepkov wrote: > > On Oct 27, 2010, at 7:27 AM, Pavlos Parissis wrote: > > > On 27 October 2010 13:12, Vadym Chepkov wrote: > >> > >> On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote: > >>> > >>> Does anyone know any other PDU which works out of box with the > >>> supplied stonith agents? > >>> > >> > >> I use APC AP7901, works like a charm: > >> > >> primitive pdu stonith:external/rackpdu \ > >>params pduip="10.6.6.6" community="pdu-6" hostlist="AUTO" > >> clone fencing pdu > >> > >> Vadym > > > > Then most likely the defaults OIDs of the rackpdu agents matches the > > OIDs of the AP7901. > > In my case I have to use OID for the device itself > > 1.3.6.1.4.1.318.1.1.4.4.2.1.3 and OID for retrieving (snmpwalk) the > > outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 . > > > > Hold on a sec, are you using clone on AP7901? Does it support multiple > > connections? Mine it doesn't. > > Then it's useless regardless clone or not, you have to have multiple > instances, because server can't reliable fence itself, right? > > > My understanding is/was that I need to have one resource running on 1 of the 3 nodes in the cluster and if a fence event has to be triggered then pacemaker will send to it to the one stonith resource. I am planning to test that the coming days.[1] Am I right? if not then I have to buy a different PDU! :-( Cheers, Pavlos [1] by testing I mean kill the heartbeat links on 1 node and DC node should fence that node. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 27 October 2010 14:09, Vadym Chepkov wrote: > [...snip...] >> > Hold on a sec, are you using clone on AP7901? Does it support multiple >> > connections? Mine it doesn't. >> >> Then it's useless regardless clone or not, you have to have multiple >> instances, because server can't reliable fence itself, right? >> >> > > My understanding is/was that I need to have one resource running on 1 of the > 3 nodes in the cluster and if a fence event has to be triggered then > pacemaker will send to it to the one stonith resource. I am planning to test > that the coming days.[1] > Am I right? if not then I have to buy a different PDU! :-( > > My understanding is you have to have a fencing device for each of your > hosts. Are you sure one connection limitation applies for SNMP? Most likely > it's only for tcp sessions - ssh/http ? Valid point Vadym, SNMP is over UDP so conntionless communication. I am wondering how i can test this - if cloning works on this PDU. > If you look into rackpdu log you will see this: > Oct 19 12:39:00 xen-11 stonithd: [8606]: debug: external_run_cmd: Calling > '/usr/lib64/stonith/plugins/external/rackpdu gethosts' > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_run_cmd: > '/usr/lib64/stonith/plugins/external/rackpdu gethosts' output: xen-11 xen-12 > Outlet_3 Outlet_4 Outlet_5 Outlet_6 Outlet_7 Outlet_8 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: running > 'rackpdu gethosts' returned 0 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host xen-11 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host xen-12 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host Outlet_3 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host Outlet_4 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host Outlet_5 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host Outlet_6 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host Outlet_7 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host Outlet_8 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: remove us (xen-11) from the > host list for pdu:0 > check the last line - the agent is smart enough to know it can't fence > itself. > Vadym > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 27 October 2010 14:11, Dejan Muhamedagic wrote: > Hi, > > On Wed, Oct 27, 2010 at 01:58:20PM +0200, Pavlos Parissis wrote: > > On 27 October 2010 13:43, Vadym Chepkov wrote: > > > > > > > > On Oct 27, 2010, at 7:27 AM, Pavlos Parissis wrote: > > > > > > > On 27 October 2010 13:12, Vadym Chepkov wrote: > > > >> > > > >> On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote: > > > >>> > > > >>> Does anyone know any other PDU which works out of box with the > > > >>> supplied stonith agents? > > > >>> > > > >> > > > >> I use APC AP7901, works like a charm: > > > >> > > > >> primitive pdu stonith:external/rackpdu \ > > > >>params pduip="10.6.6.6" community="pdu-6" hostlist="AUTO" > > > >> clone fencing pdu > > > >> > > > >> Vadym > > > > > > > > Then most likely the defaults OIDs of the rackpdu agents matches the > > > > OIDs of the AP7901. > > > > In my case I have to use OID for the device itself > > > > 1.3.6.1.4.1.318.1.1.4.4.2.1.3 and OID for retrieving (snmpwalk) the > > > > outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 . > > > > > > > > Hold on a sec, are you using clone on AP7901? Does it support > multiple > > > > connections? Mine it doesn't. > > > > > > Then it's useless regardless clone or not, you have to have multiple > > > instances, because server can't reliable fence itself, right? > > > > > > > > > > > My understanding is/was that I need to have one resource running on 1 of > the > > 3 nodes in the cluster and if a fence event has to be triggered then > > pacemaker will send to it to the one stonith resource. I am planning to > test > > that the coming days.[1] > > Am I right? if not then I have to buy a different PDU! :-( > > Yes. In case a node which is currently running the stonith > resource is to be fenced, then the stonith resource would move > elsewhere first. But, yes, you should test this just like > anything else. Make sure to test both the "node gone" event > (failed links) and a critical action failing (such as stop). > > > I am going to test this. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
Hi, I quickly tested cloning on this fencing and it worked. I used iptables to break the heartbeat link on node-01 and it was fenced by the other node - the DC. In the coming days I will test without cloning fencing device. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 27 October 2010 14:09, Vadym Chepkov wrote: > > On Oct 27, 2010, at 7:58 AM, Pavlos Parissis wrote: > > > On 27 October 2010 13:43, Vadym Chepkov wrote: > >> >> On Oct 27, 2010, at 7:27 AM, Pavlos Parissis wrote: >> >> > On 27 October 2010 13:12, Vadym Chepkov wrote: >> >> >> >> On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote: >> >>> >> >>> Does anyone know any other PDU which works out of box with the >> >>> supplied stonith agents? >> >>> >> >> >> >> I use APC AP7901, works like a charm: >> >> >> >> primitive pdu stonith:external/rackpdu \ >> >>params pduip="10.6.6.6" community="pdu-6" hostlist="AUTO" >> >> clone fencing pdu >> >> >> >> Vadym >> > >> > Then most likely the defaults OIDs of the rackpdu agents matches the >> > OIDs of the AP7901. >> > In my case I have to use OID for the device itself >> > 1.3.6.1.4.1.318.1.1.4.4.2.1.3 and OID for retrieving (snmpwalk) the >> > outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 . >> > >> > Hold on a sec, are you using clone on AP7901? Does it support multiple >> > connections? Mine it doesn't. >> >> Then it's useless regardless clone or not, you have to have multiple >> instances, because server can't reliable fence itself, right? >> >> >> > My understanding is/was that I need to have one resource running on 1 of > the 3 nodes in the cluster and if a fence event has to be triggered then > pacemaker will send to it to the one stonith resource. I am planning to test > that the coming days.[1] > Am I right? if not then I have to buy a different PDU! :-( > > > My understanding is you have to have a fencing device for each of your > hosts. Are you sure one connection limitation applies for SNMP? Most likely > it's only for tcp sessions - ssh/http ? > If you look into rackpdu log you will see this: > > Oct 19 12:39:00 xen-11 stonithd: [8606]: debug: external_run_cmd: Calling > '/usr/lib64/stonith/plugins/external/rackpdu gethosts' > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_run_cmd: > '/usr/lib64/stonith/plugins/external/rackpdu gethosts' output: xen-11 xen-12 > Outlet_3 Outlet_4 Outlet_5 Outlet_6 Outlet_7 Outlet_8 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: running > 'rackpdu gethosts' returned 0 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host xen-11 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host xen-12 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host Outlet_3 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host Outlet_4 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host Outlet_5 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host Outlet_6 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host Outlet_7 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu > host Outlet_8 > Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: remove us (xen-11) from the > host list for pdu:0 > > check the last line - the agent is smart enough to know it can't fence > itself. > > > do you enable debug by setting debug 1 on ha.cf? do you see that WARN on your system? stonith-ng: [3369]: WARN: parse_host_line: Could not parse (0 42): /usr/lib/stonith/plugins/external/rackpdu: line 125: local: can only be used in a function Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 27 October 2010 19:23, Vadym Chepkov wrote: > > On Oct 27, 2010, at 1:18 PM, Pavlos Parissis wrote: > > > ok, i have done the same hack but i will remove it. I think 1.1.4 will be > out before we go on production and hopefully this will be fixed in 1.1.4. > > > > This is part of cluster-glue, not pacemaker and it's 1.0.6 now > > > yeap you aright and I am wrong ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 27 October 2010 19:25, Vadym Chepkov wrote: > > On Oct 27, 2010, at 1:19 PM, Pavlos Parissis wrote: > > > > On 27 October 2010 17:08, Vadym Chepkov wrote: > >> >> On Oct 27, 2010, at 11:02 AM, Pavlos Parissis wrote: >> >> > BTW >> > here is my conf for the fencing >> > primitive pdu stonith:external/rackpdu \ >> > params community="empisteftiko" >> names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4" >> oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.3" hostlist="AUTO" pduip="192.168.100.100" >> stonith-timeout="30" >> > clone fencing pdu \ >> > meta target-role="Started" >> > location fencing-on-node-01 fencing 1: node-01 >> > location fencing-on-node-02 fencing 1: node-02 >> > location fencing-on-node-03 fencing 1: node-03 >> > >> > am I missing something? >> > >> >> I would say you have "extra" :) >> why do you need location constraints for this device? >> >> Vadym >> >> > I use symmetric-cluster="false" and if I don't set location constraints the > resource will not start. > > > > > oh, I would expect stonith resources to be "exempt" > > it is a typical resource as all resources, although I expected the same. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
I did more testing using the clone type of fencing and worked as I expected. test1 hack init script to return 1 on stop and run a crm resource move on that resource result node it was fenced and resource was started on the other node test2 using firewall to break the heartbeat links on node with resource result node it was fenced and resource was started on the other node As Dejan suggested I am going to run the same type of tests when 1 fence resource is used. In this test I will try to cause a fencing on the node which has fencing resource running on it and see if pacemaker moves the resource before it fences the node. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 27 October 2010 19:46, Pavlos Parissis wrote: > I did more testing using the clone type of fencing and worked as I > expected. > > test1 hack init script to return 1 on stop and run a crm resource move on > that resource > result node it was fenced and resource was started on the other node > > test2 using firewall to break the heartbeat links on node with resource > result node it was fenced and resource was started on the other node > > As Dejan suggested I am going to run the same type of tests when 1 fence > resource is used. > In this test I will try to cause a fencing on the node which has fencing > resource running on it and see if pacemaker moves the resource before it > fences the node. > > > > I did the same tests without cloning and pacemaker moves fencing resource before triggers a reboot on the node where fencing resource was running. So, cloning fencing resource and having just one fence resource have the same behaviour! at least for these 2 tests. now I don't know which configuration solution I should choose! Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Multiple independent two-node clusters side-by-side?
this http://www.gossamer-threads.com/lists/linuxha/users/67482?search_string=Redundant%20Rings%20%26quot;Still%20Not%20There%3F;#67482 post has a lot information for you on this subject. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 28 October 2010 10:21, Dejan Muhamedagic wrote: > Hi, > > On Wed, Oct 27, 2010 at 08:15:09PM +0200, Pavlos Parissis wrote: > > On 27 October 2010 19:46, Pavlos Parissis > wrote: > > > > > I did more testing using the clone type of fencing and worked as I > > > expected. > > > > > > test1 hack init script to return 1 on stop and run a crm resource move > on > > > that resource > > > result node it was fenced and resource was started on the other node > > > > > > test2 using firewall to break the heartbeat links on node with resource > > > result node it was fenced and resource was started on the other node > > > > > > As Dejan suggested I am going to run the same type of tests when 1 > fence > > > resource is used. > > > In this test I will try to cause a fencing on the node which has > fencing > > > resource running on it and see if pacemaker moves the resource before > it > > > fences the node. > > > > > > > > > > > > > > I did the same tests without cloning and pacemaker moves fencing resource > > before triggers a reboot on the node where fencing resource was running. > > So, cloning fencing resource and having just one fence resource have the > > same behaviour! at least for these 2 tests. > > now I don't know which configuration solution I should choose! > > Whichever you feel more comfortable with, providing that the > device really can support multiple connections simultaneously. > I'd opt for non-cloned version. It's simpler, it avoids possible > device contention. > > Thanks, > > Under which conditions does pacemaker initiate multiple connections to a fencing device? Given the fact that rackpdu agent uses SNMP, so connections limits apply, I don't quite understand how cloned version will give me issues. I make a big assumption here that connections limits are not applicable when fencing device is contacted over SNMP . Furthermore, using cloned version a fence event triggers faster than non-cloned version, because in non-cloned version situation the resource must move to another node, if the node to be fenced holds the fencing resources. Because of the above, I selected for now the cloned version. But your mail worries a bit. what test can I do in order to make sure that the cloned version will not give me issues? Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Pacemaker-1.1.4, when?
Hi, When do we expect to have Pacemaker-1.1.4 available? Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Impossible to add a 4th node to a cluster
On 28 October 2010 16:09, Guillaume Chanaud wrote: > Hello, > > i have a cluster of two master/slave drbd server running into a vlan > (machines are dedicated servers) > (filer1 and filer2) > I added a third node to the cluster (a "blank node" for the moment) > correctly > (server1) > When i add a 4th node to the cluster (which is a "mirror" of server1) > (server2) > this node start as standalone...Here is the message.log : > > Oct 28 15:59:27 ns209045 corosync[16543]: [TOTEM ] A processor joined or > left the membership and a new membership was formed. > Oct 28 15:59:28 ns209045 corosync[16543]: [pcmk ] notice: > pcmk_peer_update: Transitional membership event on ring 945392: memb=1, > new=0, lost=0 > Oct 28 15:59:28 ns209045 corosync[16543]: [pcmk ] info: pcmk_peer_update: > memb: server2 16820416 > Oct 28 15:59:28 ns209045 corosync[16543]: [pcmk ] notice: > pcmk_peer_update: Stable membership event on ring 945392: memb=1, new=0, > lost=0 > Oct 28 15:59:28 ns209045 corosync[16543]: [pcmk ] info: pcmk_peer_update: > MEMB: server2 16820416 > Oct 28 15:59:28 ns209045 corosync[16543]: [TOTEM ] A processor joined or > left the membership and a new membership was formed. > Oct 28 15:59:29 ns209045 corosync[16543]: [pcmk ] notice: > pcmk_peer_update: Transitional membership event on ring 945416: memb=1, > new=0, lost=0 > Oct 28 15:59:29 ns209045 corosync[16543]: [pcmk ] info: pcmk_peer_update: > memb: server2 16820416 > Oct 28 15:59:29 ns209045 corosync[16543]: [pcmk ] notice: > pcmk_peer_update: Stable membership event on ring 945416: memb=1, new=0, > lost=0 > Oct 28 15:59:29 ns209045 corosync[16543]: [pcmk ] info: pcmk_peer_update: > MEMB: server2 16820416 > > [...] Message repeat many many times > > Now i stop the server1, and i start the server2...server2 start correctly > and is added to the cluster...but when > i want to start server1, same thing happens...(so things are inverted but > result is the same...when i start one the serverX, the other can't start...) > > My corosync.conf is configured in broadcast, not multicastI have lots of > problem with multicast because lots of briged VM on the vlan > doesn't see the multicast packets, or doesn't join the multicast group > correctly... > > Any hint on this ?? corosync and auth files are the same on server2? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Impossible to add a 4th node to a cluster
On 28 October 2010 18:30, Guillaume Chanaud wrote: [...snip...] >> corosync and auth files are the same on server2? >> > > Yes of course :D (copied by scp), as i told server1 can join when server2 is > offline, and server 2 can join when server1 is offline, but if one is > online, the other can't join and log the above things in loop... xm you said that you server2 is a clone of server1, check if they have different uuids > In fact i have lttss of problem with > corosync/pacemaker...multicast/broadcast between physical > servers/virtuallots of different shit everywhere, error log are always > different depending on what i try... try to go step up step, make sure you have correct rings, check related threads about rings > > The strange things is that the filer1 filer2 server2 and server1 are all > running the same distro (gentoo) with same tools and are on the same vlan > (which is working for lots of services like nfs...) > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker-1.1.4, when?
On 28 October 2010 22:55, Andrew Beekhof wrote: > Its released already, but the wrong packages got built because I ran > the wrong command :-( > Fedora 13 packages are uploading now, I'll do opensuse 11.3 in the morning I have seen the tag on Mercurial but I haven't seen any rpm on rpm-next for EPEL and I thought you are still testing the release. When do you expect to have the builds for EPEL? Thanks, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] PE ignores monitor failure of stonith:external/rackpdu
Hi, I wanted to check what happens when the monitor of a fencing agents fails, thus I disconnected the PDU from network, reduced the monitor interval and put debug statements on the fencing script. here is the debug statements on the status code status) if [ -z "$pduip" ]; then exit 1 fi date >> /tmp/pdu.monitor if ping -w1 -c1 $pduip >/dev/null 2>&1; then exit 0 else echo "failed" >> /tmp/pdu.monitor exit 1 fi ;; here is the debug output which states that monitor failed [r...@node-03 tmp]# cat pdu.monitor Fri Oct 29 08:29:20 CEST 2010 Fri Oct 29 08:31:05 CEST 2010 failed Fri Oct 29 08:32:50 CEST 2010 failed but pacemaker thinks is fine [r...@node-03 tmp]# crm status|grep pdu pdu(stonith:external/rackpdu): Started node-03 [r...@node-03 tmp]# and here is the resource primitive pdu stonith:external/rackpdu \ params community="empisteftiko" names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4" oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.3" hostlist="AUTO" pduip="192.168.100.100" stonith-timeout="30" \ op monitor interval="1m" timeout="60s" Is it the expected behaviour? Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker-1.1.4, when?
On 29 October 2010 10:25, Andrew Beekhof wrote: > On Fri, Oct 29, 2010 at 8:15 AM, Pavlos Parissis > wrote: >> On 28 October 2010 22:55, Andrew Beekhof wrote: >>> Its released already, but the wrong packages got built because I ran >>> the wrong command :-( >>> Fedora 13 packages are uploading now, I'll do opensuse 11.3 in the morning >> >> I have seen the tag on Mercurial but I haven't seen any rpm on >> rpm-next for EPEL and I thought you are still testing the release. >> When do you expect to have the builds for EPEL? > > There wont be unfortunately. > Some of the changes we needed to make involved the use of > g_hash_table_get_values() which only appeared in glib 2.14 > So EPEL5 is stuck on the 1.0 series. Does that mean I shouldn't use 1.1.x (x<4) on EPEL? I guess not since the change you mentioned is only in 1.1.4 I currently use 1.1.3 on EPEL 5.4 Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker-1.1.4, when?
On 29 October 2010 11:47, Andrew Beekhof wrote: [...snip..] >>> There wont be unfortunately. >>> Some of the changes we needed to make involved the use of >>> g_hash_table_get_values() which only appeared in glib 2.14 >>> So EPEL5 is stuck on the 1.0 series. >> >> Does that mean I shouldn't use 1.1.x (x<4) on EPEL? I guess not since >> the change you mentioned is only in 1.1.4 >> I currently use 1.1.3 on EPEL 5.4 > > 1.1.3 is generally ok still, it was mostly performance stuff that went into .4 > You could update glib manually and rebuild the 1.1.4 packages though... I wont go down this path. So, for EPEL 1.1.3 is the last available release without any upgrade paths, that's not very nice for production systems. I consider switching back to 1.0.9, which hopefully gets updated. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker-1.1.4, when?
On 29 October 2010 12:23, Andrew Beekhof wrote: > On Fri, Oct 29, 2010 at 11:58 AM, Pavlos Parissis > wrote: >> On 29 October 2010 11:47, Andrew Beekhof wrote: >> [...snip..] >>>>> There wont be unfortunately. >>>>> Some of the changes we needed to make involved the use of >>>>> g_hash_table_get_values() which only appeared in glib 2.14 >>>>> So EPEL5 is stuck on the 1.0 series. >>>> >>>> Does that mean I shouldn't use 1.1.x (x<4) on EPEL? I guess not since >>>> the change you mentioned is only in 1.1.4 >>>> I currently use 1.1.3 on EPEL 5.4 >>> >>> 1.1.3 is generally ok still, it was mostly performance stuff that went into >>> .4 >>> You could update glib manually and rebuild the 1.1.4 packages though... >> >> I wont go down this path. So, for EPEL 1.1.3 is the last available >> release without any upgrade paths, > > to be fair, there is an upgrade path, it just involves a version of > glib2 that was released less than 4 years ago > >> that's not very nice for production >> systems. I consider switching back to 1.0.9, which hopefully gets >> updated. > > 1.0.10 is almost done > > Initially, I moved to 1.1.3 in order to see if it solves bug #2500, which is not solved, and stayed on 1.1.3, even I am using pacemaker-1.0 schema, because I wanted to use the latest/greatest and get regular updates. Since there is no realistic upgrade path to 1.1.4 on EPEL, I am wondering if there any benefit of staying on 1.1.3 compared to using 1.0.10. Andrew, thanks for the clarifications, very much appreciated. Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] IP Power 9258HP with external/ippower9258
Hi, Does anyone know if the fencing agent ippower9258 works with IP Power 9258HP PDU? The readme file of the fencing agent mentions the following Especially "IP Power 9258 HP" uses a different http command interface Doesn't that mean that it wont with 9258 HP? The fact that Aviosys has different type of ip9258 makes a bit confusing on what someone should buy. Any ideas? Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] IP Power 9258HP with external/ippower9258
On 30 October 2010 16:03, Pavlos Parissis wrote: > Hi, > > Does anyone know if the fencing agent ippower9258 works with IP Power > 9258HP PDU? > The readme file of the fencing agent mentions the following > > Especially "IP Power 9258 HP" uses a different http command interface > > Doesn't that mean that it wont with 9258 HP? > The fact that Aviosys has different type of ip9258 makes a bit > confusing on what someone should buy. > > Any ideas? I was too fast to send out the above mail. Reading http://www.aviosys.com/downloads/manuals/power/9258hp_en.pdf and http://www.aviosys.com/downloads/manuals/power/9258st_en.pdf gave me the answer. The external/ippower9258 doesn't work with 9258 HP due to different http command interface, as it is mentioned in the readme file of the agent. Sorry the noise Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Ordering clones and primitives
On 30 October 2010 19:55, Lars Kellogg-Stedman wrote: > I have a two node cluster that hosts two virtual ips on the same network: > > primitive proxy_0_ip ocf:heartbeat:IPaddr \ >params ip="10.10.10.20" cidr_netmask="255.255.255.0" nic="eth3" > primitive proxy_1_ip ocf:heartbeat:IPaddr \ >params ip="10.10.10.30" cidr_netmask="255.255.255.0" nic="eth3" > > After the ip address comes up, the system must establish a network > route and a default route. I'm having trouble defining the > relationships between these services. I started with this: > > primitive public_net_route ocf:heartbeat:Route \ >params destination="10.10.10.0/24" > device="eth3" table="1" > primitive public_def_route ocf:heartbeat:Route \ >params destination="default" gateway="10.10.10.1" > device="eth3" table="1" > > clone clone_public_def_route public_def_route > clone clone_public_net_route public_net_route > why do you need/want to clone these 2 resources? For me would make more to 1 group per IP and place the resources in the order you want > But having got this before, I don't understand how to estbalish the > necessary ordering between the routes and the ip address resources. > The clones can't come up on a host until one of the ip addresses are > available on the host. In other words, the cloned resources cannot be > active on a host unless an ip address resource is also active on that > host. > > I tried this: > > order ip_0_before_routes inf: proxy_0_ip clone_public_net_route > order ip_1_before_routes inf: proxy_1_ip clone_public_net_route > order net_route_before_def_route \ > inf: clone_public_net_route clone_public_def_route > > ...but the clone services in this case don't start unless both ips are > started. Shutting down either ip takes down *all* of the clone > resources on both nodes. > > Is it possible to do what I want? This seems like exactly the same > relationship that would exist betwee, say, a cloned Apache instance > and a set of ip address resources, but I can't find a good example. > > I am not sure if you can place order constraints like this on clones. Most experience users will know better Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] downgrading to pacemaker-1.0.9.1-1.15.el5
Hi, I have been using 1.1.3 on CentOS and I decided to downgrade to 1.0.9.1-1.15.el5. The procedure was the following stop heartbeat on all cluster members downgrade to 1.0.9 doing the following on all cluster memebrs yum downgrade pacemaker-1.0.9.1-1.15.el5 pacemaker-libs-1.0.9.1-1.15.el5 pacemaker-debuginfo-1.0.9.1-1.15.el5 starting heartbeat gave me the following and crmd was stopped crmd: [10772]: debug: debug3: compare_version: 3.0.2 > 3.0.1 (3) crmd: [10772]: ERROR: revision_check_callback: This build (1.0.9) does not support the current resource configuration crmd: [10772]: ERROR: revision_check_callback: We can support up to CRM feature set 3.0.2 (current=3.0.1) crmd: [10772]: ERROR: revision_check_callback: Shutting down the CRM why does crm complain about the resource configuration? Even I was using 1.1.3, I had pacemaker-schema 1.0 validate-with="pacemaker-1.0" crm_feature_set="3.0.2" Could be the following the root cause of the problem? Any ideas? Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] downgrading to pacemaker-1.0.9.1-1.15.el5
On 1 November 2010 09:19, Pavlos Parissis wrote: > Hi, > > I have been using 1.1.3 on CentOS and I decided to downgrade to > 1.0.9.1-1.15.el5. > > The procedure was the following > stop heartbeat on all cluster members > > downgrade to 1.0.9 doing the following on all cluster memebrs > yum downgrade pacemaker-1.0.9.1-1.15.el5 pacemaker-libs-1.0.9.1-1.15.el5 > pacemaker-debuginfo-1.0.9.1-1.15.el5 > > starting heartbeat gave me the following and crmd was stopped > crmd: [10772]: debug: debug3: compare_version: 3.0.2 > 3.0.1 (3) > crmd: [10772]: ERROR: revision_check_callback: This build (1.0.9) does not > support the current resource configuration > crmd: [10772]: ERROR: revision_check_callback: We can support up to CRM > feature set 3.0.2 (current=3.0.1) > crmd: [10772]: ERROR: revision_check_callback: Shutting down the CRM > > why does crm complain about the resource configuration? > Even I was using 1.1.3, I had pacemaker-schema 1.0 > validate-with="pacemaker-1.0" crm_feature_set="3.0.2" > > Could be the following the root cause of the problem? > value="1.1.3-9c2342c0378140df9bed7d192f2b9ed157908007"/> > > Any ideas? > Pavlos > > > Yes, I have! Solved by doing the following yum upgrade pacemaker pacemaker-libs gave me a working crmd on 1.1.3 cibadmin --modify --crm_xml '' set feature_set to 3.0.1, I had to look at the code to realize that this is the cause of the problem. The log line "We can support up to CRM feature set 3.0.2 (current=3.0.1)" is a bit confusing and makes you to think the feature set version is not the issue here heartbeat stop yum downgrade pacemaker-1.0.9.1-1.15.el5 pacemaker-libs-1.0.9.1-1.15.el5 pacemaker-debuginfo-1.0.9.1-1.15.el5 heartbeat start and everything is fine again ! Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Stonith Device APC AP7900
On 1 November 2010 15:01, Rick Cone wrote: > Dejan, > > Below I had: > > primitive res_stonith stonith:apcmastersnmp \ > params ipaddr="192.1.1.109" port="161" community="sps" \ > op start interval="0" timeout="60s" \ > op monitor interval="60s" timeout="60s" \ > op stop interval="0" timeout="60s" > clone rc_res_stonith res_stonith \ > meta target-role="Started" > > And you commented: > > You can also use a single instance setup, i.e. without clones. > > What is this "single instance setup", and what would it look like in the > crm > configure? You have one already, res_stonith is your single instance setup. What are the pros/cons to this compared to the clone setup I > have? > The decision of cloned or non-cloned stonith resource is mainly driven about the ability of the fencing device to accept multiple connections simultaneously. If you fencing device doesn't allow that then you can only use a non-cloned resource. The cloned resource will fence a node a bit faster in a case stonith resource is running on the node to be fenced, there is no need to move the resource to another node and then fence the node. But, cloned resource is has a bit more complex configuration. cloning looks easy but I bet there is some complexity behind it. I have experiment with both versions on AP9606 using the rackpdu agent and in both cases worked as expected. Now I am expecting a Aviosys 9258ST and I will use the non-cloned version. My 2 cents, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Stonith Device APC AP7900
On 2 November 2010 11:04, Dejan Muhamedagic wrote: > Hi, > > On Tue, Nov 02, 2010 at 08:08:32AM +0100, Pavlos Parissis wrote: > > On 1 November 2010 15:01, Rick Cone > wrote: > > > > > Dejan, > > > > > > Below I had: > > > > > > primitive res_stonith stonith:apcmastersnmp \ > > > params ipaddr="192.1.1.109" port="161" community="sps" \ > > > op start interval="0" timeout="60s" \ > > > op monitor interval="60s" timeout="60s" \ > > > op stop interval="0" timeout="60s" > > > clone rc_res_stonith res_stonith \ > > > meta target-role="Started" > > > > > > And you commented: > > > > > > You can also use a single instance setup, i.e. without clones. > > > > > > What is this "single instance setup", and what would it look like in > the > > > crm > > > configure? > > > > > > You have one already, res_stonith is your single instance setup. > > Yes, just don't clone that resource. > > > What are the pros/cons to this compared to the clone setup I > > > have? > > > > > > > The decision of cloned or non-cloned stonith resource is mainly driven > about > > the ability of the fencing device to accept multiple connections > > simultaneously. > > If you fencing device doesn't allow that then you can only use a > non-cloned > > resource. > > Right. > > Do you know under which conditions pacemaker initiates multiple connections to a fencing device? > The cloned resource will fence a node a bit faster in a case stonith > > resource is running on the node to be fenced, there is no need to move > the > > resource to another node and then fence the node. > > Did you measure the time it takes to start the stonith resource? > The last time I tested it was with rackpdu and it took 5 secs for pacemaker to move the resource and trigger the reboot event. > > > But, cloned resource is has a bit more complex configuration. cloning > looks > > easy but I bet there is some complexity behind it. > > There is, but in this simple case where clones are not in any > relation to other resources, that shouldn't pose a problem. > > > I have experiment with both versions on AP9606 using the rackpdu agent > and > > in both cases worked as expected. Now I am expecting a Aviosys 9258ST and > I > > will use the non-cloned version. > > Thanks, > > Dejan > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] PE ignores monitor failure of stonith:external/rackpdu
On 2 November 2010 11:22, Dejan Muhamedagic wrote: > Hi, > > On Fri, Oct 29, 2010 at 08:37:04AM +0200, Pavlos Parissis wrote: > > Hi, > > > > I wanted to check what happens when the monitor of a fencing agents > > fails, thus I disconnected the PDU from network, reduced the monitor > > interval and put debug statements on the fencing script. > > > > here is the debug statements on the status code > > status) > > if [ -z "$pduip" ]; then > > exit 1 > > fi > > date >> /tmp/pdu.monitor > > if ping -w1 -c1 $pduip >/dev/null 2>&1; then > > exit 0 > > else > > echo "failed" >> /tmp/pdu.monitor > > exit 1 > > fi > > ;; > > > > > > here is the debug output which states that monitor failed > > [r...@node-03 tmp]# cat pdu.monitor > > Fri Oct 29 08:29:20 CEST 2010 > > Fri Oct 29 08:31:05 CEST 2010 > > failed > > Fri Oct 29 08:32:50 CEST 2010 > > failed > > > > but pacemaker thinks is fine > > [r...@node-03 tmp]# crm status|grep pdu > > pdu(stonith:external/rackpdu): Started node-03 > > [r...@node-03 tmp]# > > > > > > and here is the resource > > primitive pdu stonith:external/rackpdu \ > > params community="empisteftiko" > > names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4" > > oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.3" hostlist="AUTO" > > pduip="192.168.100.100" stonith-timeout="30" \ > > op monitor interval="1m" timeout="60s" > > > > Is it the expected behaviour? > > Definitely not. If you do the monitor action from the command > line does that also return the unexpected exit code: > from the code I pasted you can see it returned 1. > > # stonith -t external/rackpdu community="empisteftiko" > names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4" ... -lS > > Which pacemaker release do you run? I couldn't reproduce this > with a recent Pacemaker. > that it was on 1.1.3 and now I run 1.0.9. Do you want me to run the test on 1.0.9? > > Thanks, > > Dejan > > > Cheers, > > Pavlos > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Stonith Device APC AP7900
On 2 November 2010 12:58, Dejan Muhamedagic wrote: [...snip...] > > > Do you know under which conditions pacemaker initiates multiple > connections > > to a fencing device? > > There are no specific conditions. It can happen by chance because > individual clone instances run independently. > > > The cloned resource will fence a node a bit faster in a case stonith > > > > resource is running on the node to be fenced, there is no need to > move > > > the > > > > resource to another node and then fence the node. > > > > > > Did you measure the time it takes to start the stonith resource? > > > > > > > The last time I tested it was with rackpdu and it took 5 secs for > pacemaker > > to move the resource and trigger the reboot event. > > So, the time difference is 5 seconds in your case, right? > > > No, this is the time it took to fence a node with non-cloned version. I don't remember exactly how many secs it took with cloned version, but I do remember that it was faster. When I get the new PDU (aviosys 9258) I will run the test again and get back on this. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] PE ignores monitor failure of stonith:external/rackpdu
On 2 November 2010 13:02, Dejan Muhamedagic wrote: [...snip...] > > > > Definitely not. If you do the monitor action from the command > > > line does that also return the unexpected exit code: > > > > > > > from the code I pasted you can see it returned 1. > > There is a difference. stonith-ng (stonithd) is a daemon that > runs a perl script (fencing_legacy) which invokes stonith which > then invokes the plugin. A problem can occur in any of these > components. It's important to find out where. > > > > # stonith -t external/rackpdu community="empisteftiko" > > > names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4" ... -lS > > > > > > Which pacemaker release do you run? I couldn't reproduce this > > > with a recent Pacemaker. > > > > > > > that it was on 1.1.3 and now I run 1.0.9. > > Do you want me to run the test on 1.0.9? > > Yes, please. 1.0.9 is still running the old, and well tested, > stonithd, so the result could be different. > > I have the pdu off because it stopped working anymore! As a result the resource is stopped. But I did the test I see that even rackpdu returns 1 on status stonithd reports 256 here is running stonith, remember pdu is off. [r...@node-01 ~]# stonith -d -t external/rackpdu hostlist="node-01,node-02,node-03" pduip="192.168.100.100" community="empisteftiko" names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4" -l ** (process:8115): DEBUG: NewPILPluginUniv(0x8f690c8) ** (process:8115): DEBUG: PILS: Plugin path = /usr/lib/stonith/plugins:/usr/lib/heartbeat/plugins ** (process:8115): DEBUG: NewPILInterfaceUniv(0x8f69768) ** (process:8115): DEBUG: NewPILPlugintype(0x8f69a28) ** (process:8115): DEBUG: NewPILPlugin(0x8f69a40) ** (process:8115): DEBUG: NewPILInterface(0x8f69b50) ** (process:8115): DEBUG: NewPILInterface(0x8f69b50:InterfaceMgr/InterfaceMgr)*** user_data: 0x0 *** ** (process:8115): DEBUG: InterfaceManager_plugin_init(0x8f69b50/InterfaceMgr) ** (process:8115): DEBUG: Registering Implementation manager for Interface type 'InterfaceMgr' ** (process:8115): DEBUG: PILS: Looking for InterfaceMgr/generic => [/usr/lib/stonith/plugins/InterfaceMgr/generic.so] ** (process:8115): DEBUG: Plugin file /usr/lib/stonith/plugins/InterfaceMgr/generic.so does not exist ** (process:8115): DEBUG: PILS: Looking for InterfaceMgr/generic => [/usr/lib/heartbeat/plugins/InterfaceMgr/generic.so] ** (process:8115): DEBUG: Plugin path for InterfaceMgr/generic => [/usr/lib/heartbeat/plugins/InterfaceMgr/generic.so] ** (process:8115): DEBUG: PluginType InterfaceMgr already present ** (process:8115): DEBUG: Plugin InterfaceMgr/generic init function: InterfaceMgr_LTX_generic_pil_plugin_init ** (process:8115): DEBUG: NewPILPlugin(0x8f6a1d8) ** (process:8115): DEBUG: Plugin InterfaceMgr/generic loaded and constructed. ** (process:8115): DEBUG: Calling init function in plugin InterfaceMgr/generic. ** (process:8115): DEBUG: NewPILInterface(0x8f69cd8) ** (process:8115): DEBUG: NewPILInterface(0x8f69cd8:InterfaceMgr/stonith2)*** user_data: 0x8f69b18 *** ** (process:8115): DEBUG: Registering Implementation manager for Interface type 'stonith2' ** (process:8115): DEBUG: IfIncrRefCount(1 + 1 ) ** (process:8115): DEBUG: PluginIncrRefCount(0 + 1 ) ** (process:8115): DEBUG: IfIncrRefCount(1 + 100 ) ** (process:8115): DEBUG: PILS: Looking for stonith2/external => [/usr/lib/stonith/plugins/stonith2/external.so] ** (process:8115): DEBUG: Plugin path for stonith2/external => [/usr/lib/stonith/plugins/stonith2/external.so] ** (process:8115): DEBUG: Creating PluginType for stonith2 ** (process:8115): DEBUG: NewPILPlugintype(0x8f6a398) ** (process:8115): DEBUG: Plugin stonith2/external init function: stonith2_LTX_external_pil_plugin_init ** (process:8115): DEBUG: NewPILPlugin(0x8f69d68) ** (process:8115): DEBUG: Plugin stonith2/external loaded and constructed. ** (process:8115): DEBUG: Calling init function in plugin stonith2/external. ** (process:8115): DEBUG: NewPILInterface(0x8f6a3b0) ** (process:8115): DEBUG: NewPILInterface(0x8f6a3b0:stonith2/external)*** user_data: 0x9e9fbc *** ** (process:8115): DEBUG: IfIncrRefCount(101 + 1 ) ** (process:8115): DEBUG: PluginIncrRefCount(0 + 1 ) ** (process:8115): DEBUG: external_set_config: called. ** (process:8115): DEBUG: external_get_confignames: called. ** (process:8115): DEBUG: external_run_cmd: Calling '/usr/lib/stonith/plugins/external/rackpdu getconfignames' ** (process:8115): DEBUG: external_run_cmd: '/usr/lib/stonith/plugins/external/rackpdu getconfignames' output: hostlist pduip community ** (process:8115): DEBUG: external_get_confignames: 'rackpdu getconfignames' returned 0 ** (process:8115): DEBUG: plugin output: hostlist pduip community ** (process:8115): DEBUG: external_get_confignames: rackpdu configname hostlist ** (process:8115): DEBUG: external_get_confignames: rackpdu configname pduip ** (process:8115): DEBUG: external_get_confignames: rackpdu configname community ** (process:8115): DEBUG: external_status: called. ** (process:8115): DEBUG: external_run_cmd: C
Re: [Pacemaker] IP Power 9258HP with external/ippower9258
On 2 November 2010 13:13, Dejan Muhamedagic wrote: > Hi, > > On Sat, Oct 30, 2010 at 04:31:38PM +0200, Pavlos Parissis wrote: > > On 30 October 2010 16:03, Pavlos Parissis > wrote: > > > Hi, > > > > > > Does anyone know if the fencing agent ippower9258 works with IP Power > > > 9258HP PDU? > > > The readme file of the fencing agent mentions the following > > > > > > Especially "IP Power 9258 HP" uses a different http command interface > > > > > > Doesn't that mean that it wont with 9258 HP? > > > The fact that Aviosys has different type of ip9258 makes a bit > > > confusing on what someone should buy. > > > > > > Any ideas? > > > > I was too fast to send out the above mail. > > Reading http://www.aviosys.com/downloads/manuals/power/9258hp_en.pdf > > and http://www.aviosys.com/downloads/manuals/power/9258st_en.pdf > > gave me the answer. The external/ippower9258 doesn't work with 9258 HP > > due to different http command interface, as it is mentioned in the > > readme file of the agent. > > There was a fairly good implementation of the ippower9258hp > posted by Johan Verrept. Unfortunately, the discussion somehow > petered out when the plugin was almost done. Can't recall the > details anymore, they should be in the list archives, but I guess > that we can revive it. > > Thanks, > > I ordered the st version of 9258 and as a result I can't test that RA. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] PE ignores monitor failure of stonith:external/rackpdu
On 2 November 2010 13:18, Dejan Muhamedagic wrote: > Hi, > > On Tue, Nov 02, 2010 at 01:09:02PM +0100, Pavlos Parissis wrote: > > On 2 November 2010 13:02, Dejan Muhamedagic wrote: > > [...snip...] > > > > > > > > > > Definitely not. If you do the monitor action from the command > > > > > line does that also return the unexpected exit code: > > > > > > > > > > > > > from the code I pasted you can see it returned 1. > > > > > > There is a difference. stonith-ng (stonithd) is a daemon that > > > runs a perl script (fencing_legacy) which invokes stonith which > > > then invokes the plugin. A problem can occur in any of these > > > components. It's important to find out where. > > > > > > > > # stonith -t external/rackpdu community="empisteftiko" > > > > > names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4" ... -lS > > > > > > > > > > Which pacemaker release do you run? I couldn't reproduce this > > > > > with a recent Pacemaker. > > > > > > > > > > > > > that it was on 1.1.3 and now I run 1.0.9. > > > > Do you want me to run the test on 1.0.9? > > > > > > Yes, please. 1.0.9 is still running the old, and well tested, > > > stonithd, so the result could be different. > > > > > > > > I have the pdu off because it stopped working anymore! As a result the > > resource is stopped. > > But I did the test I see that even rackpdu returns 1 on status stonithd > > reports 256 > > Ah, I understand what's going on now. It's a bug in the interface > to external plugins which was exposed by stonith-ng. It has been > fixed in August. The fix is here (in hg.linux-ha.org/glue): > > changeset: 2427:b7df127fc09e > user:Dejan Muhamedagic > date:Thu Aug 12 14:01:10 2010 +0200 > summary: High: stonith: external: interpret properly exit codes from > external stonith plugins (bnc#630357) > > There hasn't been a glue release since then, but there should be > one fairly soon. Note that this affects only Pacemaker 1.1. > > Thanks, > > Dejan > > > > Does this bug have to do anything with PE ignoring monitor failure? Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] drbd on heartbeat links
Hi, I am trying to figure out how I can resolve the following scenario Facts 3 nodes 2 DRBD ms resource 2 group resource by default drbd1/group1 runs on node-01 and drbd2/group2 runs on node2 drbd1/group1 can only run on node-01 and node-03 drbd2/group2 can only run on node-02 and node-03 DRBD fencing_policy is resource-only [1] 2 heartbeat links and one of them used by DRBD communication Scenario 1) node-01 loses both heartbeat links 2) DRBD monitor detects first the absence of the drbd communication and does resource fencing by add location constraint which prevent drbd1 to run on node3 3) pacemaker fencing kicks in and kills node-01 due to location constraint created at step 2, drbd1/group1 can run in the cluster Any ideas? Cheers, Pavlos [1] it is not resource-and-stonith because in the scenario where a node has the role of primary for drbd1 and secondary for drbd2, could be fenced because the primary node of drbd2 have in fencing_policy resource-and-stonith ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] drbd on heartbeat links
On 2 November 2010 16:15, Dan Frincu wrote: > Hi, > > Pavlos Parissis wrote: >> >> Hi, >> >> I am trying to figure out how I can resolve the following scenario >> >> Facts >> 3 nodes >> 2 DRBD ms resource >> 2 group resource >> by default drbd1/group1 runs on node-01 and drbd2/group2 runs on node2 >> drbd1/group1 can only run on node-01 and node-03 >> drbd2/group2 can only run on node-02 and node-03 >> DRBD fencing_policy is resource-only [1] >> 2 heartbeat links and one of them used by DRBD communication >> >> Scenario >> 1) node-01 loses both heartbeat links >> 2) DRBD monitor detects first the absence of the drbd communication >> and does resource fencing by add location constraint which prevent >> drbd1 to run on node3 >> 3) pacemaker fencing kicks in and kills node-01 >> >> due to location constraint created at step 2, drbd1/group1 can run in >> the cluster >> >> > > I don't understand exactly what you mean by this. Resource-only fencing > would create a -inf score on node1 when the node loses the drbd > communication channel (the only one drbd uses), Because node-01 is the primary at the moment of the failure, resource-fencing will create an -inf score for the node-03. > however you could still have > heartbeat communication available via the secondary link, then you shouldn't As I wrote none of the heartbeat links is available. After I sent the mail, I realized that the node-03 will not see location constraint created by node-01 because there no heartbeat communication! Thus I think my scenario has a flaw, since none of the heartbeat links are available on node-01. Resource-fencing from DRBD will be triggered but without any effect and node-03 or node-02 will fence node-01, and node-03 will be become the primary for drbd1 > fence the entire node, the resource-only fencing does that for you, the only > thing you need to do is to add the drbd fence handlers in /etc/drbd.conf. > handlers { > fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; > after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; > } > > Is this what you meant? No. Dan thanks for your mail. Since there is a flaw on the scenario let's define a similar scenario. status node-01 primary for drbd1 and group1 runs on it node-02 primary for drbd2 and group2 runs on it node-3 secondary for drbd1 and drbd2 2 heartbeat links, and one of them being used for DRBD communication here is the scenario 1) on node-01 heartbeat link which carries also DRBD communication is lost 2) node-01 does resource-fencing and places score -inf for drbd1 on node-03 3) on node-01 second heartbeat link is lost 4) node-01 will be fenced by one other cluster members 5) drbd1 can't run on node-03 due to location constraint created at step 2 The problem here is that location constraint will be active even node-01 is fenced. Any ideas? Pavlos drbd.conf global { usage-count yes; } common { protocol C; syncer { csums-alg sha1; verify-alg sha1; rate 10M; } net { data-integrity-alg sha1; max-buffers 20480; max-epoch-size 16384; } disk { on-io-error detach; ### Only when DRBD is under cluster ### fencing resource-only; ### --- ### } startup { wfc-timeout 60; degr-wfc-timeout 30; outdated-wfc-timeout 15; } ### Only when DRBD is under cluster ### handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh root"; fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; } ### --- ### } resource drbd_resource_01 { on node-01 { device/dev/drbd1; disk /dev/sdb1; address 10.10.10.129:7789; meta-disk internal; } on node-03 { device/dev/drbd1; disk /dev/sdb1; address 10.10.10.131:7789; meta-disk internal; } syncer { cpu-mask 2; } } resource drbd_resource_02 { on node-02 { device/dev/drbd2; disk /dev/sdb1; address 10.10.10.130:7790; meta-disk internal; } on node-03 { device/dev/drbd2; disk /dev/sdc1; address 10.10.10.131:7790; meta-disk internal; } syncer { cpu-mask 1; } } ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] drbd on heartbeat links
On 2 November 2010 22:07, Pavlos Parissis wrote: > On 2 November 2010 16:15, Dan Frincu wrote: >> Hi, >> >> Pavlos Parissis wrote: >>> >>> Hi, >>> >>> I am trying to figure out how I can resolve the following scenario >>> >>> Facts >>> 3 nodes >>> 2 DRBD ms resource >>> 2 group resource >>> by default drbd1/group1 runs on node-01 and drbd2/group2 runs on node2 >>> drbd1/group1 can only run on node-01 and node-03 >>> drbd2/group2 can only run on node-02 and node-03 >>> DRBD fencing_policy is resource-only [1] >>> 2 heartbeat links and one of them used by DRBD communication >>> >>> Scenario >>> 1) node-01 loses both heartbeat links >>> 2) DRBD monitor detects first the absence of the drbd communication >>> and does resource fencing by add location constraint which prevent >>> drbd1 to run on node3 >>> 3) pacemaker fencing kicks in and kills node-01 >>> >>> due to location constraint created at step 2, drbd1/group1 can run in >>> the cluster >>> >>> >> >> I don't understand exactly what you mean by this. Resource-only fencing >> would create a -inf score on node1 when the node loses the drbd >> communication channel (the only one drbd uses), > Because node-01 is the primary at the moment of the failure, > resource-fencing will create an -inf score for the node-03. > >> however you could still have >> heartbeat communication available via the secondary link, then you shouldn't > As I wrote none of the heartbeat links is available. > After I sent the mail, I realized that the node-03 will not see > location constraint created by node-01 because there no heartbeat > communication! > Thus I think my scenario has a flaw, since none of the heartbeat links > are available on node-01. > Resource-fencing from DRBD will be triggered but without any effect > and node-03 or node-02 will fence node-01, and node-03 will be become > the primary for drbd1 > >> fence the entire node, the resource-only fencing does that for you, the only >> thing you need to do is to add the drbd fence handlers in /etc/drbd.conf. >> handlers { >> fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; >> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; >> } >> >> Is this what you meant? > > No. > Dan thanks for your mail. > > > Since there is a flaw on the scenario let's define a similar scenario. > > status > node-01 primary for drbd1 and group1 runs on it > node-02 primary for drbd2 and group2 runs on it > node-3 secondary for drbd1 and drbd2 > > 2 heartbeat links, and one of them being used for DRBD communication > > here is the scenario > 1) on node-01 heartbeat link which carries also DRBD communication is lost > 2) node-01 does resource-fencing and places score -inf for drbd1 on node-03 > 3) on node-01 second heartbeat link is lost > 4) node-01 will be fenced by one other cluster members > 5) drbd1 can't run on node-03 due to location constraint created at step 2 > > The problem here is that location constraint will be active even > node-01 is fenced. > > Any ideas? > I found this related thread, http://www.gossamer-threads.com/lists/drbd/users/15380#15380 Wouldn't be better if pacemaker/drbd do these instead? Manual actions add delay on recovering. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [DRBD-user] drbd on heartbeat links
On 2 November 2010 22:57, Lars Ellenberg wrote: > On Tue, Nov 02, 2010 at 10:07:17PM +0100, Pavlos Parissis wrote: >> On 2 November 2010 16:15, Dan Frincu wrote: >> > Hi, >> > >> > Pavlos Parissis wrote: >> >> >> >> Hi, >> >> >> >> I am trying to figure out how I can resolve the following scenario >> >> >> >> Facts >> >> 3 nodes >> >> 2 DRBD ms resource >> >> 2 group resource >> >> by default drbd1/group1 runs on node-01 and drbd2/group2 runs on node2 >> >> drbd1/group1 can only run on node-01 and node-03 >> >> drbd2/group2 can only run on node-02 and node-03 >> >> DRBD fencing_policy is resource-only [1] >> >> 2 heartbeat links and one of them used by DRBD communication >> >> >> >> Scenario >> >> 1) node-01 loses both heartbeat links >> >> 2) DRBD monitor detects first the absence of the drbd communication >> >> and does resource fencing by add location constraint which prevent >> >> drbd1 to run on node3 >> >> 3) pacemaker fencing kicks in and kills node-01 >> >> >> >> due to location constraint created at step 2, drbd1/group1 can run in >> >> the cluster >> >> >> >> >> > >> > I don't understand exactly what you mean by this. Resource-only fencing >> > would create a -inf score on node1 when the node loses the drbd >> > communication channel (the only one drbd uses), >> Because node-01 is the primary at the moment of the failure, >> resource-fencing will create an -inf score for the node-03. >> >> > however you could still have >> > heartbeat communication available via the secondary link, then you >> > shouldn't >> As I wrote none of the heartbeat links is available. >> After I sent the mail, I realized that the node-03 will not see >> location constraint created by node-01 because there no heartbeat >> communication! >> Thus I think my scenario has a flaw, since none of the heartbeat links >> are available on node-01. >> Resource-fencing from DRBD will be triggered but without any effect >> and node-03 or node-02 will fence node-01, and node-03 will be become >> the primary for drbd1 >> >> > fence the entire node, the resource-only fencing does that for you, the >> > only >> > thing you need to do is to add the drbd fence handlers in /etc/drbd.conf. >> > handlers { >> > fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; >> > after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; >> > } >> > >> > Is this what you meant? >> >> No. >> Dan thanks for your mail. >> >> >> Since there is a flaw on the scenario let's define a similar scenario. >> >> status >> node-01 primary for drbd1 and group1 runs on it >> node-02 primary for drbd2 and group2 runs on it >> node-3 secondary for drbd1 and drbd2 >> >> 2 heartbeat links, and one of them being used for DRBD communication >> >> here is the scenario >> 1) on node-01 heartbeat link which carries also DRBD communication is lost >> 2) node-01 does resource-fencing and places score -inf for drbd1 on node-03 >> 3) on node-01 second heartbeat link is lost >> 4) node-01 will be fenced by one other cluster members >> 5) drbd1 can't run on node-03 due to location constraint created at step 2 >> >> The problem here is that location constraint will be active even >> node-01 is fenced. > > Which is good, and intended behaviour, as it protects you from > going online with stale data (changes between 1) and 4) would be lost). > >> Any ideas? > > The drbd setting "resource-and-stonith" simply tells DRBD > that you have stonith configured in your cluster. > It does not by itself trigger any stonith action. > > So if you have stonith enabled, and you want to protect against being > shot while modifying data, you should say "resource-and-stonith". I do have stonith enabled in my Cluster, but I don't quite understand what you have wrote. The resource-and-stonith setting will add the location constraint as the fencing resource-only and it will also prevent a node with a role of primary to be fenced, am I right? So, what happens when Cluster sends a fence event? Initially, I thought this setting will trigger a fence event and I didn't use it because I wanted to avoid a node which have the role of secondary for drbd1 and the role primar