Re: [Pacemaker] [ha-wg] [RFC] Organizing HA Summit 2015
On 11/25/2014 02:14 AM, Digimer wrote: On 24/11/14 10:12 AM, Lars Marowsky-Bree wrote: Beijing, the US, Tasmania (OK, one crazy guy), various countries in Oh, bring him! crazy++ What, you want to bring the guy who's boldly maintaining the outpost on the southern frontier? ;) *cough* Barring a miracle or a sudden huge advance in matter transporter technology I'm rather unlikely to make it, I'm afraid. But I'll add my voice to what Lars said in another email - go all physical (with good minutes/notes/etherpads for others to review - which I assume is what's going to happen this time), or all virtual. Mixing the two is exceedingly difficult to do well, IMO. Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] hawk session timeout?
On 12/02/2014 07:43 AM, Schaefer, Diane E wrote: Hi, I am running hawk 0.6.1-0.11.1 on SLES SP3. How do I configure HAWK so my web session times out. My users are concerned since it never times out by default. It actually will eventually time out if you don't log out manually, but it'll take ten days... This was put in so that if you're using the dashboard function to view multiple clusters, you wouldn't have to keep logging in to them if the sessions timed out. A quick workaround is to edit this file: /srv/www/hawk/config/initializers/session_store.rb You want to change :expire_after to a smaller value (expressed in seconds), then restart hawk. Please feel free to file a bug for this (to either set it lower by default, or break the setting out into a config file, or both) on the SUSE bugzilla (assuming you're using SLE HA), or the github issue tracker (https://github.com/ClusterLabs/hawk) if not. Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Hawk session ends after start or stop action
On 03/05/2014 12:59 AM, Schaefer, Diane E wrote: Hi Lars, I am running pacemaker on SLES 11 SP3 and have applied the update package released in December. The hawk level is 0.6.1-0.11.1 and lighttpd is 1.4.20-2.52.1 . When I log into hawk using firefox, google chrome or IE 9 all with the hacluster userid. I can view my cluster definition, but I cannot perform any actions without my web session ending. The action does get submitted OK. One of my systems is running hawk 0.6.1-0.7.11 and lighttpd-1.4.20-2.46.10 and I don't seem to have the issue. This is a bit strange, but it's possible that you are hitting a real bug in hawk somewhere. Can you please take this log, and a hb_report from the offending cluster, and open a support call? Then we can investigate properly. I rebooted the system and then my hawk sessions no longer close. I had originally configured my cluster without hawk support and then started it via /etc/init.d/hawk start. I also turned on the chkconfig bit at this time. I suspect not everything that is needed was started before my reboot or I’m not starting hawk correctly? I have many clusters up in my test lab, is there some process to check to see if it’s running before I reboot? This sounds like the sort of problem that could happen if something was wrong with the session cookie Hawk sets in your browser. The Hawk 0.6.1-0.11.1 had an update which changed the session key, so it's not impossible that the login checking was confused by the update, but has resolved itself since you rebooted the system (and thus hawk was restarted). If that *was* the problem, just restarting hawk (service hawk restart) and possibly logging out and back in in your web browser should have been enough to resolve it. Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Announce: Hawk (HA Web Konsole) 0.6.2
Greetings, This is to announce version 0.6.2 of Hawk, a web-based GUI for managing and monitoring Pacemaker High-Availability clusters. Notable features include: - View cluster status (summary and detailed views). - Examine potential failure scenarios via simulator mode. - History explorer for analysis of cluster behaviour and prior failures. - Perform regular management tasks (start, stop, move resources, put nodes on standby/maintenance, etc.) - Configure resources, constraints, general cluster properties. - Setup wizard for common configurations (currently web server and OCFS2 filesystem). Packages for various openSUSE releases, as well as Fedora 18 and 19 are available from the Open Build Service: http://software.opensuse.org/download?project=network:ha-clustering:Stablepackage=hawk More information is available in the README in the source tree: https://github.com/ClusterLabs/hawk Some important notes: - The latest versions of Hawk require pacemaker = 1.1.8. - Hawk uses the crm shell[1] internally to provide much of its functionality, so you'll need that installed too. - The history explorer requires hb_report, which is presently available in cluster-glue[2]. If you don't have that installed, you miss that piece of functionality, but everything else should work just fine. - Hawk has been long used and tested on SLES and openSUSE. I suspect (but have no actual way of knowing) that it has been rather less widely deployed on other distros. Accordingly there may be some rough edges. Please tell me about them! More detailed usage documentation is avilable in the SUSE Linux Enterprise High Availability Extension book: https://www.suse.com/documentation/sle_ha/book_sleha/data/cha_ha_configuration_hawk.html Please direct comments, feedback, questions, etc. to myself and/or (preferably) the Pacemaker mailing list. Happy clustering, Tim [1] http://software.opensuse.org/download?project=network:ha-clustering:Stablepackage=crmsh [2] http://software.opensuse.org/download?project=network:ha-clustering:Stablepackage=cluster-glue -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Announce: opensuse-ha mailing list
Greetings, There is now an opensuse-ha mailing list. This list is for discussion of high availability clustering on openSUSE. This includes: - The base cluster stack, i.e. corosync and pacemaker - Management tools such as crmsh and hawk - Clustered filesystems (e.g.: ocfs2) - Replicated storage (drbd) - Basically, anything in network:ha-clustering:* on OBS is on topic :) If you'd like to subscribe, just send an email to: opensuse-ha+subscr...@opensuse.org Please also see the wiki page at: https://en.opensuse.org/openSUSE:High_Availability Happy clustering! Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] reorg of network:ha-clustering repo on build.opensuse.org
On 07/26/2013 09:58 PM, Tim Serong wrote: On 07/25/2013 03:59 PM, Tim Serong wrote: Hi All, This is just a quick heads-up. We're in the process of reorganising the network:ha-clustering repository on build.opensuse.org. If you don't use any of the software from this repo feel free to stop reading now :) Currently we have: - network:ha-clustering (stable builds for various distros) - network:ha-clustering:Factory (devel project for openSUSE:Factory) This is going to change to: - network:ha-clustering:Stable (stable builds for various distros) - network:ha-clustering:Unstable (unstable/dev, various distros) - network:ha-clustering:Factory (devel project for openSUSE:Factory) This means that if you're currently using packages from network:ha-clustering, you'll need to point to network:ha-clustering:Stable instead (once we've finished shuffling everything around). I'll send another email out when this is done. network:ha-clustering:Stable has now been populated. There is some documentation of the new repository configuration at: https://en.opensuse.org/openSUSE:High_Availability The old packages in the base network:ha-clustering repo will be purged soon, but not before 2013-08-05. The old packages in the base network:ha-clustering repo have now been purged. For HA clustering fun, as mentioned above, please use one of: - network:ha-clustering:Stable (stable builds for various distros) - network:ha-clustering:Unstable (unstable/dev, various distros) - network:ha-clustering:Factory (devel project for openSUSE:Factory) For those of you not using openSUSE, network:ha-clustering:Stable notably includes: - crmsh, cluster-glue, pssh for CentOS 6, FC 18, FC 19 - hawk for FC 18 Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] reorg of network:ha-clustering repo on build.opensuse.org
On 07/25/2013 03:59 PM, Tim Serong wrote: Hi All, This is just a quick heads-up. We're in the process of reorganising the network:ha-clustering repository on build.opensuse.org. If you don't use any of the software from this repo feel free to stop reading now :) Currently we have: - network:ha-clustering (stable builds for various distros) - network:ha-clustering:Factory (devel project for openSUSE:Factory) This is going to change to: - network:ha-clustering:Stable (stable builds for various distros) - network:ha-clustering:Unstable (unstable/dev, various distros) - network:ha-clustering:Factory (devel project for openSUSE:Factory) This means that if you're currently using packages from network:ha-clustering, you'll need to point to network:ha-clustering:Stable instead (once we've finished shuffling everything around). I'll send another email out when this is done. network:ha-clustering:Stable has now been populated. There is some documentation of the new repository configuration at: https://en.opensuse.org/openSUSE:High_Availability The old packages in the base network:ha-clustering repo will be purged soon, but not before 2013-08-05. Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] reorg of network:ha-clustering repo on build.opensuse.org
Hi All, This is just a quick heads-up. We're in the process of reorganising the network:ha-clustering repository on build.opensuse.org. If you don't use any of the software from this repo feel free to stop reading now :) Currently we have: - network:ha-clustering (stable builds for various distros) - network:ha-clustering:Factory (devel project for openSUSE:Factory) This is going to change to: - network:ha-clustering:Stable (stable builds for various distros) - network:ha-clustering:Unstable (unstable/dev, various distros) - network:ha-clustering:Factory (devel project for openSUSE:Factory) This means that if you're currently using packages from network:ha-clustering, you'll need to point to network:ha-clustering:Stable instead (once we've finished shuffling everything around). I'll send another email out when this is done. Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Is crm_gui available under RHEL6?
On 02/15/2013 01:53 AM, Dejan Muhamedagic wrote: On Thu, Feb 14, 2013 at 10:46:40AM +0100, Rasto Levrinc wrote: On Thu, Feb 14, 2013 at 12:20 AM, Ron Kerry rke...@sgi.com wrote: I am not sure if this is an appropriate question for a community forum since it is a RHEL specific question. However, I cannot think of a better forum to use (as someone coming from a heavy SLES background), so I will ask it anyway. Feel free to shoot me down or point me in a different direction. I do not find the pacemaker GUI in any of the RHEL6 HA distribution rpms. I have tried to think of all of its various names crm_gui, hb_gui, mgmt/haclient etc, but I have not found it. A simple Google search also was not helpful - perhaps due to me not being sufficient skilled at search techniques. Is it available somewhere in the RHEL7 HA distribution and I am just not finding it? Or do I need to build it from source or pull some community built rpm off the web. I am also not aware of any crm_gui packages for rhel6 not even community build. But you should be able to compile it on rhel6 from here https://github.com/ClusterLabs/pacemaker-mgmt Luckily there are many alternative GUIs, but only 1 or 2 really usable. In theory you can get crmsh package from here http://download.opensuse.org/repositories/network:/ha-clustering/ In practice too :) Every new version of crmsh is going to be available there for the selected platforms. Along with resource-agents, cluster-glue, etc. I don't see HAWK package there, so probably it's still not compatible with the rhel 6 Ruby version at this moment. Right, hawk is not built. Tim should be able to tell why. Yeah, the hawk build in network:ha-clustering is against rails 2, which precludes building on recent Fedora (and presumably RHEL) versions (FC 18 ships rails 3.2). I do have a reasonable rails 3.2 port which I'll make available soon, but I still have some work in progress, bugs to fix, things to clean up, etc. etc. before announcing a release. Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] killproc not found? o2cb shutdown via resource agent
On 11/08/2012 07:56 PM, Andrew Beekhof wrote: On Thu, Nov 8, 2012 at 5:16 PM, Tim Serong tser...@suse.com wrote: On 11/08/2012 12:11 PM, Andrew Beekhof wrote: On Thu, Nov 8, 2012 at 9:59 AM, Matthew O'Connor m...@ecsorl.com wrote: Follow-up and additional info: System is Ubuntu 12.04. Not sure where killproc is supposed to be derived from, or if there is an assumption for it to be a standalone binary or script. I did find it defined in /lib/lsb/init-functions. Adding a . /lib/lsb/init-functions to the start of the /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs file makes the process-kill work, but I suspect this is not the most desirable solution. I think thats as good a solution as any. I wonder where other distros are getting it from. SLES 11 SP2: # rpm -qf /sbin/killproc sysvinit-2.86-210.1 openSUSE 12.2: # rpm -qf /sbin/killproc sysvinit-tools-2.88+-77.3.1.x86_64 Can't speak for any others offhand... Definitely not on fedora or its derivatives Hrm. Well, I just had a quick skim of the ocfs2-tools source, and I'd be willing to bet the o2cb RA was based on the upstream o2cb init script, which uses killproc, but also sources /lib/lsb/init-functions. Does Fedora have killproc buried somewhere in there maybe? On SUSE, /lib/lsb/init-functions defines start_daemon(), killproc(), and pidofproc() but these just wrap binaries of the same name in /sbin (which would explain why o2cb works fine on SUSE, as those missing things are presumably in $PATH anyway). I don't know about sourcing /lib/lsb/init-functions in .ocf-shellfuncs - might be a bit broad? Presumably couldn't hurt to source it in the o2cb RA though, unless there's some other cleaner solution... Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] killproc not found? o2cb shutdown via resource agent
On 11/08/2012 12:11 PM, Andrew Beekhof wrote: On Thu, Nov 8, 2012 at 9:59 AM, Matthew O'Connor m...@ecsorl.com wrote: Follow-up and additional info: System is Ubuntu 12.04. Not sure where killproc is supposed to be derived from, or if there is an assumption for it to be a standalone binary or script. I did find it defined in /lib/lsb/init-functions. Adding a . /lib/lsb/init-functions to the start of the /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs file makes the process-kill work, but I suspect this is not the most desirable solution. I think thats as good a solution as any. I wonder where other distros are getting it from. SLES 11 SP2: # rpm -qf /sbin/killproc sysvinit-2.86-210.1 openSUSE 12.2: # rpm -qf /sbin/killproc sysvinit-tools-2.88+-77.3.1.x86_64 Can't speak for any others offhand... Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Fwd: Re: How can I make the secondary machine elect itself owner of the floating IP address?
Forwarding to the list for posterity (i.e. google) - I believe my reply did solve the problem, BTW. The crm config in question is: node scc-bak node scc-pri primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=10.1.1.180 cidr_netmask=24 \ op monitor interval=30s primitive drbd_r0 ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=15 role=Master \ op monitor interval=30 role=Slave \ primitive fs_r0 ocf:heartbeat:Filesystem \ params device=/dev/drbd1 directory=/home/scc fstype=ext3 \ op monitor interval=10s primitive scc-stonith stonith:meatware \ operations $id=scc-stonith-operations \ op monitor interval=3600 timeout=20 start-delay=15 \ params hostlist=10.1.1.32 10.1.1.31 group r0 fs_R0 ClusterIP ms ms_drbd_r0 drbd_ro \ meta master-max=1 master-node-max=1 clone-max=2 \ clone-node-max=1 notify=true colocation r0_on_drbd inf: r0 ms_drbd_r0:Master order r0_after_drbd inf: ms_drbd_r0:promote r0:start property $id=cib-bootstrap-options \ dc-version=1.1.6-b988976485d15cb702c9307df55512d323831a5e \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ no-quorum-policy=ignore rsc_defaults $id=rsc-options \ resource-stickiness=200 I probably should have noted that scc-pri and scc-bak aren't really the best choice of names, because pri and bak are kind of meaningless assuming identical nodes (and the nomenclature gets confusing when you start talking about masters and slaves on top of that). Anyway... Original Message Subject: Re: How can I make the secondary machine elect itself owner of the floating IP address? Date: Thu, 20 Sep 2012 12:36:03 +1000 From: Tim Serong To: Epps, Josh Hi Josh, On 09/20/2012 10:47 AM, Epps, Josh wrote: Hi Tim, I saw one of your Gossamer threads and I really need some help. I have a two-node cluster running on SLES 11 SP2 with Pacemaker and DRBD. When I shutdown the primary with the shutdown -h now the ocf:heartbeat:IPaddr2 transfers nicely to the backup server. But when I simulate a failure on the primary node by killing the power neither the floating IP address or the mount transfer to the secondary machine. What's probably happening is: - When you do a clean shutdown of one node, the surviving node knows the first has gone away, and it can safely take over those resources. - When you cut power, the surviving node doesn't know what state the first node is in, so will do nothing until the first node is fenced. - You're using the meatware STONITH plugin (which probably doesn't need a monitor op, BTW), which means you should see a CRIT message in syslog on the surviving node, telling you it expects the first node to be fenced. How can I make the secondary machine elect itself owner of the floating IP address? Assuming the first machine is really down :) you should be able to tell the cluster this is so by running meatclient -c scc-pri on the surviving node (but do check syslog to see if you're really getting warnings about a node needing to be fenced). Suse support today said that it can’t be done with just two nodes but we just require a one-way failover. Two node clusters should work fine, they're just more annoying than three node - see for example STONITH Deathmatch Explained at http://ourobengr.com/ha/ If the above doesn't solve it for you, do you mind if we take this to the linux-ha or pacemaker public mailing list? More eyes on a problem never hurts, and then a solution becomes googlable :) Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [corosync] Ideas on merging #linux-ha and #linux-cluster on freenode
On 05/28/2012 10:51 AM, Andrew Beekhof wrote: On Mon, May 28, 2012 at 8:02 AM, Digimer li...@alteeve.ca wrote: I'm not sure if this has come up before, but I thought it might be worth discussing. With the cluster stacks merging, it strikes me that having two separate channels for effectively the same topic splits up folks. I know that #linux-ha technically still supports Heartbeat, but other than that, I see little difference between the two channels. I suppose a similar argument could me made for the myriad of mailing lists, too. I don't know if any of the lists really have significant enough load to cause a problem if the lists were merged. Could Linux-Cluster, Corosync and Pacemaker be merged? Thoughts? Digimer, hoping a hornets nest wasn't just opened. :) I think the only thing you missed was proposing a meta-project to rule them all :-) ...One Totem Ring to rule them all, one Totem Ring to find them... If only Sauron had implemented RRP during the Second Age, things might have turned out differently for Middle Earth. SCNR, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Openais] Help on mysql-proxy resource
Hi Carlos, You'll have most luck with crm configuration questions on the Pacemaker list (CC'd): pacemaker@oss.clusterlabs.org I don't actually know anything about the mysql-proxy RA, but you might have a typo. On 03/30/2012 12:52 PM, Carlos xavier wrote: Hi. I have mysql-proxy running on my system and I want to agregate it to the cluster configuration. When it is started by the system I got this as result of ps auwwwx: root 29644 0.0 0.0 22844 844 ?S22:37 0:00 /usr/sbin/mysql-proxy --pid-file /var/run/mysql-proxy.pid --daemon --proxy-lua-script Note this is --proxy-lua-script (singular) /usr/share/doc/packages/mysql-proxy/examples/tutorial-basic.lua --proxy-backend-addresses=10.10.10.5:3306 --proxy-address=172.31.0.192:3306 So I created the following configuration at the CRM: primitive mysql-proxy ocf:heartbeat:mysql-proxy \ params binary=/usr/sbin/mysql-proxy pidfile=/var/run/mysql-proxy.pid proxy_backend_addresses=10.10.10.5:3306 proxy_address=172.31.0.191:3306 parameters=--proxy-lua-scripts /usr/share/doc/packages/mysql-proxy/examples/tutorial-basic.lua \ This is --proxy-lua-scripts (plural). I'm guessing maybe that's the problem. HTH, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker - corosync with not automatic failover
On 02/07/2012 02:26 AM, Dimokritos Stamatakis wrote: Hello, regarding my previous issue with pacemaker and heartbeat there was a problem with the version that apt-get used to retrieve. I now use pacemaker with corosync and it works fine. In our setup we need to have the ability to decide which node shall get the failover IP resource and force them to do so. In the default corosync-heartbeat configuration the cluster nodes decide which one shall get the failover IP resource. I want a way to stop the nodes from auto-assigning the failover IP resource after a node failure. I tried with monitoring disabled, but nothing happened. If I kill the node that owns the failover IP resource, then they elect another node as the new failover IP owner. I want to stop that, and be able to assign the failover IP to a specific node via the crm resource migrate failover-IP node_x command whenever I want, and corosync not to assign by itself! Is there a way to do that? Well... If you run crm resource migrate failover-IP node_x as mentioned above, failover-IP will stay on node_x forever, until you migrate it somewhere else (or unmigrate it, in which case it'll have the default behaviour of running on some node) :) But you probably want to look at setting up some non-infinity location constraints, e.g.: location ip-prefer-node_0 failover-IP 100: node_0 location ip-maybe-node_1 failover-IP 50: node_1 ... failover-IP would be placed with preference on node_0 (score 100), or node_1 (score 50), or some other node if neither node_0 nor node_1 are available (and assuming you have more than two nodes). HTH, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] OCFS2 problems when connectivity lost
On 12/21/2011 09:47 PM, Ivan Savčić | Epix wrote: Hello, We are having a problem with a 3-node cluster based on Pacemaker/Corosync with 2 primary DRBD+OCFS2 nodes and a quorum node. Nodes run on Debian Squeeze, all packages are from the stable branch except for Corosync (which is from backports for udpu functionality). Each node has a single network card. When the network is up, everything works without any problems, graceful shutdown of resources on any node works as intended and doesn't reflect on the remaining cluster partition. When the network is down on one OCFS2 node, Pacemaker (no-quorum-policy=stop) tries to shut the resources down on that node, but fails to stop the OCFS2 filesystem resource stating that it is in use. *Both* OCFS2 nodes (ie. the one with the network down and the one which is still up in the partition with quorum) hang with dmesg reporting that events, ocfs2rec and ocfs2_wq are blocked for more than 120 seconds. My guess would be: The filesystem can't stop on the non-quorate node, because the network connection is down, so DLM can't do its thing. The filesystem is probably frozen on the quorate node, because of loss of DLM comms. If STONITH is configured, the non-quorate node should be killed after a failed (or timed out) stop, and the quorate node should resume behaving normally. HTH, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Doc: Resource templates
On 12/14/2011 02:57 AM, Dejan Muhamedagic wrote: On Tue, Dec 13, 2011 at 04:18:33PM +0800, Gao,Yan wrote: On 12/13/11 04:25, Andrew Beekhof wrote: On Mon, Dec 12, 2011 at 9:20 PM, Gao,Yany...@suse.com wrote: On 12/12/11 17:52, Florian Haas wrote: On Mon, Dec 12, 2011 at 10:36 AM, Gao,Yany...@suse.com wrote: On 12/12/11 17:16, Florian Haas wrote: On Mon, Dec 12, 2011 at 10:04 AM, Gao,Yany...@suse.com wrote: On 12/12/11 15:55, Gao,Yan wrote: Hi, As some people have noticed, we've provided a new feature Resource templates since pacemaker-1.1.6. I made a document about it which is meant to be included into Pacemaker_Explained. I borrowed the materials from Tanja Roth , Thomas Schraitle, (-- the documentation specialists from SUSE) and Dejan Muhamedagic. Thanks to them! Attaching it here first. If you are interested, please help review it. And if anyone would like to help convert it into DocBook and made a patch, I would be much appreciate. :-) I can tell people would like to see a crm shell version of it as well. I'll sort it out and post it here soon. Attached the crm shell version of the document. As much as I appreciate the new feature, was it really necessary that you re-used a term that already has a defined meaning in the shell? http://www.clusterlabs.org/doc/crm_cli.html#_templates Couldn't you have called them resource prototypes instead? We've already confused users enough in the past. Since Dejan adopted the object name rsc_template in crm shell, and call it Resource template in the help. I'm not inclined to use another term in the document. Opinion, Dejan? I didn't mean to suggest to use a term in the documentation that's different from the one the shell uses. I am suggesting to rename the feature altogether. Granted, it may be a bit late to have a naming discussion now, but I haven't seen this feature discussed on the list at all, so there wasn't really a chance to voice these concerns sooner. Actually there were discussions in pcmk-devel mailing list. Given that it has been included into pacemaker-1.2 schema and released with pacemaker-1.1.6, it seems too late for us to change it from cib side now Technically its not yet in the 1.2 area, that change was pending on this documentation update. OK then. I would like to hear more voices about that, since Dejan and Tim have been working on this for quite some time too. Well, I believe that we already discussed the name. And there were no better ideas heard. But it could as well be that my memory fails me. I don't recall any better naming ideas floating past either (although, now that Florian mentions prototype, hmm...) Anyway, IMO, overloading the word template isn't /too/ bad. It could be qualified if necessary as resource template (the new feature we're talking about here) and configuration template (existing shell feature)... Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] ACL setup
On 12/10/2011 10:35 AM, Larry Brigman wrote: On Fri, Dec 9, 2011 at 3:19 PM, Andreas Kurz andr...@hastexo.com mailto:andr...@hastexo.com wrote: Hello Larry, On 12/09/2011 11:15 PM, Larry Brigman wrote: I have installed pacemaker 1.1.5 and configure ACLs based on the info from http://www.clusterlabs.org/doc/acls.html It looks like the user still does not have read access. Here is the acl section of config acls acl_role id=monitor read id=monitor-read xpath=/cib/ /acl_role acl_user id=nvs role_ref id=monitor/ /acl_user acl_user id=acm role_ref id=monitor/ /acl_user /acls Here is what the user is getting: [nvs@sweng0057 ~]$ crm node show Signon to CIB failed: connection failed Init failed, could not perform requested operations ERROR: cannot parse xml: no element found: line 1, column 0 [nvs@sweng0057 ~]$ crm status Connection to cluster failed: connection failed Any ideas as to why this wouldn't work and what to fix? If you really followed exactly the guide ... did you check user nvs already is in group haclient? Thought of that. Adding the user to the haclient group removes any restrictions as I was able to write to the config without error. Did you set crm configure property enable-acl=true? Without this, all users in the haclient group have full access. Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] How to stop a failed resource?
On 11/07/2011 08:27 PM, Tim Ward wrote: From: Andreas Kurz [mailto:andr...@hastexo.com] and of course you did: crm resource cleanup TestResource42 That works, thanks. However I found no mention of it in either Clusters from Scratch 0r Pacemaker Explained ... so which document(s) have I missed please? http://clusterlabs.org/doc/crm_cli.html Also, just run crm, it has tab completion, online help, etc. Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Language bindings, again (was Re: Newcomer's question - API?)
On 11/02/2011 06:35 PM, Florian Haas wrote: On 2011-11-02 04:33, Tim Serong wrote: ianalI vaguely recall reading the FSF considered headers generally exempt from GPL provisions, provided they're boring, i.e. just structs, function definitions etc. If they're a whole lotta inline code, things are a bit different/ianal. Really? Here's a rough citation: http://linux.slashdot.org/story/11/03/20/1529238/rms-on-header-files-and-derivative-works (No, I didn't read the source material or any of the comments) Anyway. Roughly speaking, if we *did* have other language bindings for libcib/libpengine, the story would be something like this (Andrew can correct me if I'm wrong): libcib would let you manipulate the CIB programatically, with much the same ability you have when running cibadmin, i.e. you're just manipulating chunks of XML. Unless I'm not paying attention, there's no e.g. create resource API; your program would have to construct the correct XML resource definition then give it to libcib to inject into the cluster configuration. Likewise, to stop and start a resource, you'll be writing code to set the target-role meta attribute of that resource. I hate to handwave, as due to my practically non-existent C and C++-fu this is something I can't tackle myself. But let me float this idea here again. Coming from Python, what's usually available there is a thin, low-level wrapper around the C API, plus a high-level object-oriented API that is the only thing callers ever actually use. To make this portable to multiple languages, one possible option that's been suggested to me before is to create an OO C++ wrapper around the libcib/libpengine C APIs, and then SWIGify that (I do understand Andrew cringes at that, which I'll just accept for a moment). Such that, eventually, you might end up with something like cib = cib.connect() r = cib.resources.add(p_mysql,ocf:heartbeat:mysql, binary=/usr/bin/mysqld) cib.commit() r.start() Extrapolate for Perl, Java, PHP, Ruby, or anything else that SWIG supports. No objection to that in principle - the major part of the work there is (or should be) the wrapper layer though, not the SWIG bits. By contrast, SWIGing what we have now would only give the thin, low-level wrapper you referred to above. Any anyone using that thin wrapper would probably need to go read crm_resource.c or crm_mon.c to figure out how to use it :) So you may as well just invoke cibadmin, crm_resource, crm_attribute directly. I think it's safe to assume those interfaces are stable. At a higher level, crm configure ... should also be considered pretty stable; we know people use it in scripts so we try not to break it (and BTW, I use all this stuff in Hawk[1]). Where I do seem to recall you conceded at one point that firing off a binary every time you need to get a resource status doesn't exactly scale to scores of resources in, say, a 16-node cluster, and a Ruby library interface would be much more useful. Or am I mis-remembering? No, you're not misremembering, but my previous email maybe could have been clearer... For creating/modifying resources, IMO there's minimal overhead in invoking crm, or cibadmin or whatever, because you usually only have one-ish invocation(s) per create/edit/delete. Getting status is the annoying thing. The only way I know to do it comprehensively that doesn't involve multiple invocations of some CLI tool is to run cibadmin -Q, then interpret the status section, which is what I do in Hawk. This means I now have a few hundred lines of fairly hairy Ruby code which reimplements a few of Pacemaker's pengine status unpack functions. Which works, BTW. But it doesn't really help anyone else, and TBH SWIG bindings would serve Hawk better here anyway, because then status calculation would only happen in one place (pengine), which would mean zero possibility of drift/misinterpretation/confusion. Blah. I do actually want to do the SWIG bindings at some point (it still hasn't filtered to the top of my list, and I wouldn't complain if someone beat me to it), but I want to make sure that whatever we do here, we get it right, because once it's there, we'll have to support it. Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Ocfs2-users] Error building ocfs2-tools
Hi Nick, It might not be obvious, but IMO this probably belongs back on the Pacemaker list (CC'd). On 11/03/2011 02:40 AM, Nick Khamis wrote: Hello Sunil and Tim, Thank you so much for your responses. I have applied the patch, and recompiled ocfs2-tools. When spinning the pcmk stack, I am recieving the following error from ocfs_conrtold.pcmk ocfs2_controld[14698]: 2011/11/02_11:32:19 ERROR: crm_abort: send_ais_text: Triggered assert at ais.c:346 : dest != crm_msg_ais Sending message 0 via cpg: FAILED (rc=22): Message error: Success (0) ocfs2_controld[14698]: 2011/11/02_11:32:19 ERROR: send_ais_text: Sending message 0 via cpg: FAILED (rc=22): Message error: Success (0) ocfs2_controld[14698]: 2011/11/02_11:32:19 ERROR: crm_abort: send_ais_text: Triggered assert at ais.c:346 : dest != crm_msg_ais Sending message 1 via cpg: FAILED (rc=22): Message error: Success (0) ocfs2_controld[14698]: 2011/11/02_11:32:19 ERROR: send_ais_text: Sending message 1 via cpg: FAILED (rc=22): Message error: Success (0) 1320247939 setup_stack@170: Cluster connection established. Local node id: 1 1320247939 setup_stack@174: Added Pacemaker as client 1 with fd -1 When in doubt, use the source... ocfs2-tools' ocfs2_controld/pacemaker.c:165[1] says: send_ais_text(crm_class_notify, true, TRUE, NULL, crm_msg_ais); pacemaker's lib/common/ais.c:327[2] says: switch(cluster_type) { case pcmk_cluster_classic_ais: ... break; case pcmk_cluster_corosync: case pcmk_cluster_cman: transport = cpg; CRM_CHECK(dest != crm_msg_ais, rc = CS_ERR_MESSAGE_ERROR; goto bail); So you're hitting that assert, because Pacemaker sees cluster_type as either pcmk_cluster_corosync or pcmk_cluster_cman. If Pacemaker saw cluster_type as pcmk_cluster_classic_ais, it would work fine. From memory, you're running Pacemaker under CMAN, somehow. Unfortunately I have no idea what you need to do to reconfigure it so that ocfs2_controld works, or even if it will work in that environment, but the above code is the source of your trouble. Regards, Tim [1] http://oss.oracle.com/git/?p=ocfs2-tools.git;a=blob;f=ocfs2_controld/pacemaker.c;h=822cf41c4c64cd3e5cb4373c339c2e575c4a5efd;hb=d45856e4a75348c1e3b44dc510c6b7f07b88a36f#l165 [2] http://hg.clusterlabs.org/pacemaker/1.1/file/9971ebba4494/lib/common/ais.c#l327 but note ais.c moved to corosync.c in newer source tree on github -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Linux-HA] pcmk + corosync + cman for dlm support?
On 11/03/2011 04:11 PM, Vladislav Bogdanov wrote: 02.11.2011 16:36, Nick Khamis wrote: Vladislav, Thank you so much for your response. Just to make sure, all I need is to: * Apply the three patches to cman. Found here http://www.gossamer-threads.com/lists/linuxha/pacemaker/75164?do=post_view_threaded;. * Recompile CMAN * Do I have to recompile PCMK again? I also want to mention that fencing is not important right now, and I would like to disable fencing JUST for the prototype, and untill things are going. I am almost there (cman-pcmk+ocfs2) with That patches are for fencing only. I have no idea about what goes wrong with your ocfs2_controld, I gave up on trying ocfs2 because it hangs the whole cluster for me. That reminds me... Nick, if you disable fencing (even for your prototype), and you experience (or try to test) any kind of split brain, or you kill one node (ungracefully), the clustered filesystem on all the other (surviving) nodes will freeze/lock up, because the cluster is unable to fence the failed node. Even if you choose something like meatware (see http://clusterlabs.org/doc/crm_fencing.html), you should still configure *some* means of fencing for any prototype system that's going to need fencing when put into production :) Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Newcomer's question - API?
On 11/02/2011 08:34 AM, Florian Haas wrote: On 2011-11-01 21:30, Andrew Beekhof wrote: On Wed, Nov 2, 2011 at 7:04 AM, Florian Haasflor...@hastexo.com wrote: On 2011-11-01 17:52, Tim Ward wrote: You can try tooking at LCMC as that is a Java-based GUI that should at least get you going. I did find some Java code but we can't use it because it's GPL, and I didn't want to study it in case I accidentally copied some of it in recreating it. Well if you can't use anything that's under GPL, I presume anything derived from cib.h is off limits to you anyway, as _that_ is under GPL. LGPL iirc From include/crm/cib.h: /* * Copyright (C) 2004 Andrew Beekhofand...@beekhof.net * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. Doesn't say much about LGPL afaics. ianalI vaguely recall reading the FSF considered headers generally exempt from GPL provisions, provided they're boring, i.e. just structs, function definitions etc. If they're a whole lotta inline code, things are a bit different/ianal. Anyway. Roughly speaking, if we *did* have other language bindings for libcib/libpengine, the story would be something like this (Andrew can correct me if I'm wrong): libcib would let you manipulate the CIB programatically, with much the same ability you have when running cibadmin, i.e. you're just manipulating chunks of XML. Unless I'm not paying attention, there's no e.g. create resource API; your program would have to construct the correct XML resource definition then give it to libcib to inject into the cluster configuration. Likewise, to stop and start a resource, you'll be writing code to set the target-role meta attribute of that resource. So you may as well just invoke cibadmin, crm_resource, crm_attribute directly. I think it's safe to assume those interfaces are stable. At a higher level, crm configure ... should also be considered pretty stable; we know people use it in scripts so we try not to break it (and BTW, I use all this stuff in Hawk[1]). libpengine is more interesting. That would give you reliable information about resource status. The alternative (given no other language bindings) is generally either: - various invocations of crm_mon and crm_resource (maybe lots of invocations, depending on what information you want to extract), which can suck on large clusters, or, - one invocation of cibadmin -Q to get the CIB status section, then process this yourself to determine resource status, using the Dragon Page[2] as a guide. If you do a good jobs of this and/or you care about op history (not just current status), you will end up reimplementing parts of the determine_online_status() and unpack_rsc_op() functions from Pacemaker's lib/pengine/unpack.c in $other_language_of_your_choice. Regards, Tim [1] http://clusterlabs.org/wiki/Hawk [2] http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch-status.html -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Trouble with KVM Resource
On 11/01/2011 02:23 PM, Cliff Massey wrote: I am having a problem with my kvm resource. It was working until I decided to re-install the kvm machine. The libvirt xml file and the pacemaker configuration did not change. I can start the kvm outside of pacemaker just fine. When I check the libvirt log, it shows no attempt to start the kvm machine from pacemaker. crm_mon -1 shows: Online: [ admin01 admin02 ] convirt-kvm (ocf::heartbeat:VirtualDomain): Started admin01 (unmanaged) FAILED Master/Slave Set: ms-convirt [convirt-drbd] Masters: [ admin02 ] Slaves: [ admin01 ] sitescope-kvm (ocf::heartbeat:VirtualDomain): Started admin02 Master/Slave Set: ms-sitescope [sitescope-drbd] Masters: [ admin02 ] Slaves: [ admin01 ] Failed actions: convirt-kvm_monitor_0 (node=admin01, call=2, rc=1, status=complete): unknown error convirt-kvm_stop_0 (node=admin01, call=6, rc=1, status=complete): unknown error My other kvm machine with the same config works just fine. I can't tell you why it doesn't work anymore, but... my logs are at: http://pastebin.com/peFw5KKp The relevant bit of that log is (pardon the formatting): Nov 1 03:14:37 admin01 crmd: [15349]: info: te_rsc_command: Initiating action 4: monitor convirt-kvm_monitor_0 on admin01 (local) ... Nov 1 03:14:38 admin01 VirtualDomain[15370]: ERROR: /var/run/heartbeat/rsctmp/VirtualDomain-convirt-kvm.state is empty. This is unexpected. Cannot determine domain name. ... Nov 1 03:14:38 admin01 lrmd: [15346]: WARN: Managed convirt-kvm:monitor process 15370 exited with return code 1. ... Nov 1 03:14:38 admin01 crmd: [15349]: info: process_lrm_event: LRM operation convirt-kvm_monitor_0 (call=2, rc=1, cib-update=29, confirmed=true) unknown error So the probe (and presumably subsequent stop) for that resource failed, hence no attempt to start it. As for how the state file is empty, I'm not sure. Look at VirtualDomain_Define() in /usr/lib/ocf/resource.d/heartbeat/VirtualDomain (line ~200 onwards), by my reading it shouldn't be possible for that state file to be empty. Unless, somehow (wild guess), permissions on the state file or some parent directory prohibit writing? Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] cloning primatives with differing params
On 26/10/11 05:45, Brian J. Murrell wrote: I want to create a stonith primitive and clone it for each node in my cluster. I'm using the fence-agents virsh agent as my stonith primitive. Currently for a single node it looks like: primitive st-pm-node1 stonith:fence_virsh \ params ipaddr=192.168.122.1 login=xxx passwd=xxx port=node1 action=reboot pcmk_host_list=node1 pcmk_host_check=static-list pcmk_host_map= secure=true But of course that only works for one node and I want to create a clonable primitive that will apply to all nodes as they are added to the cluster. What is stumping me though is the required port parameter which is the node to stonith. I've not seen an example of how a clone resource can be created that can substitute values in for each clone. Is that even possible? OCF resource agents can be aware they're running as clones, and do interesting things as a result, e.g.: IPaddr2, when cloned, with the unique_clone_address parameter set will add the clone ID to the IP address, to give you a whole bunch of IP addresses. Unfortunately I don't know offhand if the same trick can work with STONITH agents (they'd have to be told by pacemaker they were cloned, and then each would have to be instrumented to support it). On a pretty un-related question... given an asymmetric cluster, is there a way to specify that a resource can run on any node without having to add a location constraint for each node as they are added? You could try one constraint per resource, covering all nodes, something like: location some-res-on-all-nodes some-resource \ rule 0: #uname eq node1 or #uname eq node2 or #uname eq node3 ... Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] 4 servers; different resources on different servers?
On 04/10/11 04:06, Nick Khamis wrote: I forgot to ask, for creating an asymmetric cluster, do the services (mysql, apache etc..) have to be installed on all the nodes. Probably. Pacemaker will still try to probe resources on all nodes, to ensure they're not running, then the RA will return not installed if the software isn't installed, and you'll see an error message on that node. The error might not matter, but you might not like to see it :) And finally is assymetric active/active? Assymetric means resources will never run at all by default, unless you specifically create location constraints to make them run on some node. Active/active generally means something like some set of resources is running on at least two nodes[1]. There is no reason you can't do this with an assymetric cluster. It just depends what location constraints you configure. HTH, Tim [1] Depending on your definition, it might also mean the exact same resource is running on at least two nodes, e.g.: a clustered filesystem. -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Call cib_modify failed (-22): The object/attribute does not exist
On 25/09/11 01:16, Brian J. Murrell wrote: Using pacemaker-1.0.10-1.4.el5 I am trying to add the R_10.10.10.101 IPaddr2 example resource: primitive id=R_10.10.10.101 class=ocf type=IPaddr2 provider=heartbeat instance_attributes id=RA_R_10.10.10.101 attributes nvpair id=R_ip_P_ip name=ip value=10.10.10.101/ nvpair id=R_ip_P_nic name=nic value=eth0/ /attributes /instance_attributes /primitive from the cibadmin manpage under EXAMPLES and getting: # cibadmin -o resources -U -x test.xml Call cib_modify failed (-22): The object/attribute does not exist null Any ideas why? Because: 1) You need to run cibadmin -o resources -C -x test.xml to create the resource (-C creates, -U updates an existing resource). 2) Even if you use -C, it will probably still fail due to a schema violation, because the attributes element is bogus (apparently the cibadmin man page needs tweaking). Try: primitive id=R_10.10.10.101 class=ocf type=IPaddr2 provider=heartbeat instance_attributes id=RA_R_10.10.10.101 nvpair id=R_ip_P_ip name=ip value=10.10.10.101/ nvpair id=R_ip_P_nic name=nic value=eth0/ /instance_attributes /primitive Better yet, use the crm shell instead of cibadmin, and you can forget about the XML :) Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm_mon -n -1 : Command output format
On 09/09/11 01:28, manish.gu...@ionidea.com wrote: Hi, I am using crm_mon -n -1 command to parse resources status. Sometime format is changed due to that I am getting unexcepted output on my backend programme. Please anyboby can help me to know all the excepted possible output format of crm_mon -n -1 command output. General format of output. = Node NodeName: NodeStatus ResourceName ResourceAgentType Status = But for clone Resource I am getting this.when cluster in unmanaged status. Node NodeName: NodeStatus ResourceName ResourceAgentType ORPHANED) Status Due to ORPHANED) Resource status is shifted. And I am getting wrong result. Please can you help me to alll the possible output scenerio. Or Please can you share the source code of crm_mon command. I don't think all possible output scenarios are documented anywhere, given crm_mon is generally more for human consumption. If it helps though, the source is at: http://hg.clusterlabs.org/pacemaker/1.1/file/tip/tools/crm_mon.c You might also like to experiment with crm_resource -O, although I can't say offhand what that does with orphans. Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm resource status and HAWK display differ after manually mounting filesystem resource
On 29/08/11 13:24, Tim Serong wrote: On 28/08/11 21:43, Sebastian Kaps wrote: Hi, on our two-node cluster (SLES11-SP1+HAE; corosync 1.3.1, pacemaker 1.1.5) we have defined the following FS resource and its corresponding clone: primitive p_fs_wwwdata ocf:heartbeat:Filesystem \ params device=/dev/drbd1 \ directory=/mnt/wwwdata fstype=ocfs2 \ options=rw,noatime,noacl,nouser_xattr,commit=30,data=writeback \ op start interval=0 timeout=90s \ op stop interval=0 timeout=300s clone c_fs_wwwdata p_fs_wwwdata \ params master-max=2 clone-max=2 \ meta target-role=Started is-managed=true one of the nodes (node01) went down last night and I started it with the cluster put into maintenance-mode. After checking everything else, I mounted the ocfs2-resource manually, did some crm resource reprobe/cleanup to make the cluster aware of this and finally turned off the maintenance-mode. Looking at the output of crm_mon, everything looks good again: Clone Set: c_fs_wwwdata [p_fs_wwwdata] Started: [ node01 node02 ] alternatively looking at crm_mon -n: Node node02: online p_fs_wwwdata:1 (ocf::heartbeat:Filesystem) Started Node node01: online p_fs_wwwdata:0 (ocf::heartbeat:Filesystem) Started but the HAWK web interface (version 0.3.6 coming with SLES11SP1-HAE) displays this: Clone Set: c_fs_wwwdata - p_fs_wwwdata:0: Started: node01, node02 - p_fs_wwwdata:1: Stopped Does anybody know why there is a difference? Did I make a mistake when manually mounting the FS while it was unmanaged? Or is this only a cosmetical issue with HAWK? When these resources are started by pacemaker, HAWK shows exactly what's expected: two started resoures, one per node. Thanks in advance! It's almost certainly a cosmetic issue in Hawk. I have fixed one or two bugs along these lines since version 0.3.6. If you'd like to try a newer (not-officially-supported-by-SUSE-but-best-effort-support-by-me) build, you can try hawk-0.4.1 from: http://software.opensuse.org/search?q=Hawkbaseproject=SUSE%3ASLE-11%3ASP1lang=en Alternately, if you can reproduce the issue then send me the output of cibadmin -Q (offlist is fine), I can verify/fix it. Just for the record, it was a cosmetic issue in Hawk, now fixed in hg: http://hg.clusterlabs.org/pacemaker/hawk/rev/3266874ef3fe Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm resource status and HAWK display differ after manually mounting filesystem resource
On 28/08/11 21:43, Sebastian Kaps wrote: Hi, on our two-node cluster (SLES11-SP1+HAE; corosync 1.3.1, pacemaker 1.1.5) we have defined the following FS resource and its corresponding clone: primitive p_fs_wwwdata ocf:heartbeat:Filesystem \ params device=/dev/drbd1 \ directory=/mnt/wwwdata fstype=ocfs2 \ options=rw,noatime,noacl,nouser_xattr,commit=30,data=writeback \ op start interval=0 timeout=90s \ op stop interval=0 timeout=300s clone c_fs_wwwdata p_fs_wwwdata \ params master-max=2 clone-max=2 \ meta target-role=Started is-managed=true one of the nodes (node01) went down last night and I started it with the cluster put into maintenance-mode. After checking everything else, I mounted the ocfs2-resource manually, did some crm resource reprobe/cleanup to make the cluster aware of this and finally turned off the maintenance-mode. Looking at the output of crm_mon, everything looks good again: Clone Set: c_fs_wwwdata [p_fs_wwwdata] Started: [ node01 node02 ] alternatively looking at crm_mon -n: Node node02: online p_fs_wwwdata:1 (ocf::heartbeat:Filesystem) Started Node node01: online p_fs_wwwdata:0 (ocf::heartbeat:Filesystem) Started but the HAWK web interface (version 0.3.6 coming with SLES11SP1-HAE) displays this: Clone Set: c_fs_wwwdata - p_fs_wwwdata:0: Started: node01, node02 - p_fs_wwwdata:1: Stopped Does anybody know why there is a difference? Did I make a mistake when manually mounting the FS while it was unmanaged? Or is this only a cosmetical issue with HAWK? When these resources are started by pacemaker, HAWK shows exactly what's expected: two started resoures, one per node. Thanks in advance! It's almost certainly a cosmetic issue in Hawk. I have fixed one or two bugs along these lines since version 0.3.6. If you'd like to try a newer (not-officially-supported-by-SUSE-but-best-effort-support-by-me) build, you can try hawk-0.4.1 from: http://software.opensuse.org/search?q=Hawkbaseproject=SUSE%3ASLE-11%3ASP1lang=en Alternately, if you can reproduce the issue then send me the output of cibadmin -Q (offlist is fine), I can verify/fix it. Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] DLM and Control instances for OCFS2
On 19/08/11 13:12, Prakash Velayutham wrote: Hi, I am using pacemaker - 1.1.5-5.5.5 corosync - 1.3.0-5.6.1 ocfs2 - 1.4.3-0.16.7 I will be using 2 OCFS2 volumes for different purposes. Is it enough to have just one instance of ocf:pacemaker:controld and ocf:ocfs2:o2cb or do I need a separate instance of the above for each OCFS2 volume being managed by Corosync/Pacemaker cluster? Nope, just the one. Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Announce: Hawk 4.1 (Pacemaker GUI) packages for Debian Squeeze
On 22/08/11 07:44, Joerg Sauer wrote: On Aug 20, 2011, at 6:14 PM, Joerg Sauerlists_pacema...@dizopsin.net wrote: This version should also install and run on Ubuntu 10.04 (only minimally tested). On Sun August 21 2011 06:07:26 Cotton Tenney wrote: Awesome, I'll be trying this out next week. Thanks! Uhm, that statement about Ubuntu 10.04 was actually wildly incorrect. The Squeeze package will not work on Lucid, so I created a separate one. It does not use the Ruby libs provided by Ubuntu, though (has frozen gems just like upstream). There is also an APT repository with both packages now. More information: http://www.dizopsin.net/debian-and-ubuntu-packages-for-clusterlabs-ha Best regards, Joerg Many thanks for your work! Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Extracting resource state information from the XML
On 11/08/11 21:51, pskrap wrote: Hi, I have a setup with tens of resources over several nodes. The interface that is used to administer the system has a page showing all resources, their state and which node they are running on. I can get the information of one resource using 'crm_resource -W -rrsc' but running this command over and over again for that many resources is far to slow for my needs. The crm_mon produced web page is not enough as I need it in a customized format. I figured the best way to do this efficiently is to query the XML using cibadmin -Q, parse it and get the state of all resources from there in one go. Unfortunately I am not familiar with the status part of the XML. Is anyone able to tell me how i can find the following information in the XML: - resource state (running, stopped, failed) - which node the resource is currently running on You probably want to read Chapter 12. Status - Here be dragons of Pacemaker Explained: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch-status.html In particular, the Complex Resource History Example: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch12s03s02.html Very roughly speaking, for each node_state, you have to look at each lrm_resource_op for each lrm_resource, and based on the specific op (start, stop, monitor, promote, demote, etc.) and its return code, you determine the state of the resource on that node. e.g.: if the last op was a successful (rc=0) start, or a successful monitor, the resource is running on that node. If you're in a hurry, you might find it less painful to parse the output of something like crm_mon -o -1 or crm_mon -n -1. Or, if you'd like to examine some hairy Ruby code for interpreting the CIB status section, have a look at: http://hg.clusterlabs.org/pacemaker/hawk/file/tip/hawk/app/models/cib.rb#l300 Note though that this looks at all the ops, to record a list of what's failed (it's a loose transliteration of Pacemaker's C code that does the same thing). If you only care about state, you probably only care about the *last* op. I should also take the opportunity to plug Hawk, if you need a web based thing for managing Pacemaker clusters: http://www.clusterlabs.org/wiki/Hawk HTH, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Dependency Loop Errors in Log
On 09/08/11 02:36, Bobbie Lind wrote: I have 6 servers with three sets of 2 failover pairs. So 2 servers for one pair, 2 servers for another pair etc. I am trying to configure this under one pacemaker instance. I changed from using Resource groups because the resources are not dependent on each other, just located together. I have 4 dummy resources that are used to help with colocation. The following configuration works as designed when I first start up pacemaker but when I try and run failover tests that's when things get screwy. Here is the relevant snippet of my configuration that shows the location and colocation set up. As well as what I *think* I am asking it to do. [...snip...] ** Ensuring that the resources from one failover node do not start up on the other nodes giving -500 points. ** failover pairs are MDSgroup, OSS1/OSS3, and OSS2/OSS4 colocation colocMDSOSS1 -500: anchorOSS1 MDSgroup colocation colocMDSOSS2 -500: anchorOSS2 MDSgroup colocation colocMDSOSS3 -500: anchorOSS3 MDSgroup colocation colocMDSOSS4 -500: anchorOSS4 MDSgroup colocation colocOSS1MDS -500: MDSgroup anchorOSS1 colocation colocOSS2MDS -500: MDSgroup anchorOSS2 colocation colocOSS3MDS -500: MDSgroup anchorOSS3 colocation colocOSS4MDS -500: MDSgroup anchorOSS4 colocation colocOSS2OSS1 -500: anchorOSS1 anchorOSS2 colocation colocOSS4OSS1 -500: anchorOSS1 anchorOSS4 colocation colocOSS1OSS2 -500: anchorOSS2 anchorOSS1 colocation colocOSS3OSS2 -500: anchorOSS2 anchorOSS3 colocation colocOSS2OSS3 -500: anchorOSS3 anchorOSS2 colocation colocOSS4OSS3 -500: anchorOSS3 anchorOSS4 colocation colocOSS1OSS4 -500: anchorOSS4 anchorOSS1 colocation colocOSS3OSS4 -500: anchorOSS4 anchorOSS3 [...snip...] One of the issues I am running into is the logs are giving me dependency loop errors. Here is a snippet but it does this for all the anchor/dummy resources and the LVM resource(from MDSgroup) Aug 08 11:05:56 s02ns070 pengine: [32677]: info: rsc_merge_weights: anchorOSS1: Breaking dependency loop at MDSgroup [...snip...] I think these dependency loops are what's causing the other quirky behavior I have of resources failing to the wrong server. I'm not sure where the dependency loop is coming from, but I'm sure it has something to do with my configuration and score setup. Any help deciphering this would be greatly appreciated. You can't have bidirectional colocation, i.e.. either specify colocation colocMDSOSS1 -500: anchorOSS1 MDSgroup or colocation colocOSS1MDS -500: MDSgroup anchorOSS1, but not both. The dependency loop error means pacemaker is tossing one of these away. For some more detail check the Resource Constraints chapter of Pacemaker explained (http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/) or the mailing list archives (this has come up a few times in recent memory). HTH, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] wiping out cluster config
On 07/07/11 06:23, Jean-Francois Malouin wrote: Hi, I want to wipe out my existing cluster config and start afresh, with a pristine/empty config without actually starting pacemaker -- cluster is down right now. Is it enough to just remove files in /var/lib/heartbeat/crm and /var/lib/pengine ? That always worked for me. Just make sure you do it on all nodes before you start any of them. And if you break it, you get to keep both pieces :) Regards, Tim This is on Debian/Squeeze with pacemaker 1.0.9.1. thanks! jf ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker+Corosync from OBS
On 22/06/11 22:14, Ciro Iriarte wrote: 2011/6/21 Tim Serongtser...@novell.com: On 22/06/11 08:57, Ciro Iriarte wrote: Hi, I'm trying pacemaker from OBS and I don't see any init script for corosync or pacemaker, am I overlooking something obvious? Name: pacemakerRelocations: (not relocatable) Version : 1.1.5 Vendor: openSUSE Build Service Release : 1.1 Build Date: Thu Apr 14 04:25:55 2011 Name: corosync Relocations: (not relocatable) Version : 1.3.0 Vendor: openSUSE Build Service Release : 1.1 Build Date: Thu Apr 14 04:08:04 2011 Regards, Install openais as well - it includes /etc/init.d/openais which starts corosync. Regards, Tim -- Tim Serongtser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. Thanks, I though corosync replaced openais... I was expecting a corosync init script :) Understandable :) The openais init script is a holdover from prior to the corosync/openais split. Keeping it made upgrading systems from openais 0.8.x to corosync 1.x + openais 1.x a bit nicer, but we should probably do something about an actual corosync init script. Also, I've read that it's better to start corosync and pacemaker independently, (service --- ver: 1), that's not currently possible with OBS build then, am I right? Correct, not yet possible (although, FWIW, AFAIK, the problems people experienced with service ver: 0 generally didn't manifest on SUSE). I believe adding support for service ver: 1 (MCP) is mostly a matter of tweaking the spec file to include the init script and a couple of other things, then test it. See lines 196-198 at: https://build.opensuse.org/package/view_file?file=pacemaker.specpackage=pacemakerproject=network%3Aha-clusteringsrcmd5=a2aa81b9e6b8f3e4fcd7a5bbb6b25e8a Patches (or, given it's OBS, submitreqs) gladly accepted :) Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker+Corosync from OBS
On 22/06/11 08:57, Ciro Iriarte wrote: Hi, I'm trying pacemaker from OBS and I don't see any init script for corosync or pacemaker, am I overlooking something obvious? Name: pacemakerRelocations: (not relocatable) Version : 1.1.5 Vendor: openSUSE Build Service Release : 1.1 Build Date: Thu Apr 14 04:25:55 2011 Name: corosync Relocations: (not relocatable) Version : 1.3.0 Vendor: openSUSE Build Service Release : 1.1 Build Date: Thu Apr 14 04:08:04 2011 Regards, Install openais as well - it includes /etc/init.d/openais which starts corosync. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Permission denied using HAWK
On 18/06/11 22:02, Michael Schwartzkopff wrote: Hi, Creating a resource in HAWK works like a charm. Very nice. Now I want to start or stop the resource and the pop-up window tells me: Error: Permission Denied Any idea what might be wrong? System: - OpenSUSE 11.4 - pacemaker 1.1.5 - hawk 0.4.1 from OBS Editing the resource works. Rails 2.3.11 introduced a fix for a cross site request forgery exploit, which broke Hawk's start/stop/etc. functionality on the status screen. I've just updated Hawk on OBS to work correctly in this case. Please try upgrading to the latest version (hawk-0.4.1-2.1.$ARCH.rpm). Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.4.1
On 19/05/11 00:43, Tim Serong wrote: Hi Everybody, This is to announce version 0.4.1 of Hawk, a web-based GUI for managing and monitoring Pacemaker High-Availability clusters. [...] Building an RPM for Fedora/Red Hat is still just as easy as last time: # hg clone http://hg.clusterlabs.org/pacemaker/hawk # cd hawk # hg update hawk-0.4.1 # make rpm *ahem* It /would/ still be just as easy if I had said hg update tip, or, in this specific instance, hg update 398ae27386e (the Makefile grabs the last tag from hg to use as a version number, which is one commit *after* the actual tagged commit). Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.4.1
On 19/05/11 20:49, Lars Marowsky-Bree wrote: On 2011-05-18T18:54:27, Daugherity, Andrew Wadaugher...@tamu.edu wrote: This is to announce version 0.4.1 of Hawk, a web-based GUI for managing and monitoring Pacemaker High-Availability clusters. ... As before, packages for various SUSE-based distros can be obtained from the network:ha-clustering and network:ha-clustering:Factory repos on OBS, or you can just search for Hawk on software.opensuse.org: http://software.opensuse.org/search?baseproject=ALLq=Hawk Are there any plans to push this out to the SLE11 HAE SP1 update channel? Yes. But that may take a bit longer. Tim's announcing the open source/upstream/community release here ;-) I guess I could always just grab the hawk RPM from the OBS repo and upgrade hawk... I'd rather not add the repo and risk mixing corosync/pacemaker/etc. packages between repos on a production cluster. That's understandable, and I'd not advise that you do that. Perhaps Tim can investigate publishing hawk packages that are build against the latest maintenance updates for SLE HA. I'll see what I can do. In the meantime, the hawk RPM from OBS does install and run on top of SLE HA (i.e. works for me), but obviously any RPMs that aren't in the official update channel aren't officially supported by SUSE. All us clustering types being paranoid by nature, if in doubt, I'd suggest trying it out on a test cluster first :) Also, does anyone know why hawk and crm_gui are case-sensitive for usernames when nothing else is? (Yes, I know mixed-case usernames are bad -- I didn't set up the central auth.) Everything else using LDAP auth (e.g. pam_ldap, apache mod_authnz_ldap, LDAP plugins to various CMSes/Wikis/issue trackers, etc.) is fine with both adaugherity and ADaugherity but hawk/crm_gui require the mixed-case version. They go via the PAM backends too, so this is surprising ... Thanks for pointing this out. Noted. I'm not sure what's going on there yet... Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Announce: Hawk (HA Web Konsole) 0.4.1
Hi Everybody, This is to announce version 0.4.1 of Hawk, a web-based GUI for managing and monitoring Pacemaker High-Availability clusters. You can use Hawk to: - Monitor your cluster. - Perform basic operator tasks (start/stop/migrate etc). - Create, edit and delete resources. - Edit crm_config properties. - Create, edit and delete location, colocation, and ordering constraints (new in 0.4.1) The constraint editor is accessible from the popup menu on the resources panel on the main status screen. Ordering and colocation constraint chains are drawn with arrows between resources indicating dependencies, much as you see in the constraints chapter of Pacemaker Explained[1]. That it to say, to start A then B, you have an order constraint: [A]-[B] ...and to colocate B with A, you have a colocation constraint: [B]-[A] Location constraints can be edited in simple form (just a resource, node and a score), or with a rule editor (if you need to specify roles or complex expressions). Note that date expressions and some explanatory text are still to come here. Any questions in the meantime, feel free to ask (I am particularly interested in feedback from people with large and/or complex sets of constraints). As before, packages for various SUSE-based distros can be obtained from the network:ha-clustering and network:ha-clustering:Factory repos on OBS, or you can just search for Hawk on software.opensuse.org: http://software.opensuse.org/search?baseproject=ALLq=Hawk Building an RPM for Fedora/Red Hat is still just as easy as last time: # hg clone http://hg.clusterlabs.org/pacemaker/hawk # cd hawk # hg update hawk-0.4.1 # make rpm (My apologies continue for all the non-RPM-based distro users.) Further information is available at: http://www.clusterlabs.org/wiki/Hawk Please direct comments, feedback, questions, etc. to myself and/or (preferably) the Pacemaker mailing list. Happy clustering, Tim [1] http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch-constraints.html -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Failover when storage fails
On 13/05/11 18:54, Max Williams wrote: Well this is not what I am seeing here. Perhaps a bug? I also tried adding op stop interval=0 timeout=10 to the LVM resources but still when the storage disappears the cluster just stops where it is and those log entries (below) just get printed in a loop. Cheers, Max OK, that's just weird (unless I'm missing something - anyone else seen this?). Do you mind sending me an hb_report tarball (offlist)? I'd suggest starting everything up cleanly, knocking the storage over, waiting a few minutes, then getting the hb_report for that entire time period. Regards, Tim -Original Message- From: Tim Serong [mailto:tser...@novell.com] Sent: 13 May 2011 04:22 To: The Pacemaker cluster resource manager (pacemaker@oss.clusterlabs.org) Subject: Re: [Pacemaker] Failover when storage fails On 5/12/2011 at 02:28 AM, Max Williamsmax.willi...@betfair.com wrote: After further testing even with stonith enabled the cluster still gets stuck in this state, presumably waiting on IO. I can get around it by setting on-fail=fence on the LVM resources but shouldn't Pacemaker be smart enough to realise the host is effectively offline? If you've got STONITH enabled, nodes should just get fenced when this occurs, without your having to specify on-fail=fence for the monitor op. What *should* happen is, the monitor fails or times out, then pacemaker will try to stop the resource. If the stop also fails or times out, the node will be fenced. See: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-operations.html Also, http://ourobengr.com/ha#causes is relevant here. Regards, Tim Or am I missing some timeout value that would fix this situation? pacemaker-1.1.2-7.el6.x86_64 corosync-1.2.3-21.el6.x86_64 RHEL 6.0 Config: node host001.domain \ attributes standby=off node host002.domain \ attributes standby=off primitive MyApp_IP ocf:heartbeat:IPaddr \ params ip=192.168.104.26 \ op monitor interval=10s primitive MyApp_fs_graph ocf:heartbeat:Filesystem \ params device=/dev/VolGroupB00/AppLV2 directory=/naab1 fstype=ext4 \ op monitor interval=10 timeout=10 primitive MyApp_fs_landing ocf:heartbeat:Filesystem \ params device=/dev/VolGroupB01/AppLV1 directory=/naab2 fstype=ext4 \ op monitor interval=10 timeout=10 primitive MyApp_lvm_graph ocf:heartbeat:LVM \ params volgrpname=VolGroupB00 exclusive=yes \ op monitor interval=10 timeout=10 on-fail=fence depth=0 primitive MyApp_lvm_landing ocf:heartbeat:LVM \ params volgrpname=VolGroupB01 exclusive=yes \ op monitor interval=10 timeout=10 on-fail=fence depth=0 primitive MyApp_scsi_reservation ocf:heartbeat:sg_persist \ params sg_persist_resource=scsi_reservation0 devs=/dev/dm-6 /dev/dm-7 required_devs_nof=2 reservation_type=1 primitive MyApp_init_script lsb:AppInitScript \ op monitor interval=10 timeout=10 primitive fence_host001.domain stonith:fence_ipmilan \ params ipaddr=192.168.16.148 passwd=password login=root pcmk_host_list=host001.domain pcmk_host_check=static-list \ meta target-role=Started primitive fence_host002.domain stonith:fence_ipmilan \ params ipaddr=192.168.16.149 passwd=password login=root pcmk_host_list=host002.domain pcmk_host_check=static-list \ meta target-role=Started group MyApp_group MyApp_lvm_graph MyApp_lvm_landing MyApp_fs_graph MyApp_fs_landing MyApp_IP MyApp_init_script \ meta target-role=Started migration-threshold=2 on-fail=restart failure-timeout=300s ms ms_MyApp_scsi_reservation MyApp_scsi_reservation \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true colocation MyApp_group_on_scsi_reservation inf: MyApp_group ms_MyApp_scsi_reservation:Master order MyApp_group_after_scsi_reservation inf: ms_MyApp_scsi_reservation:promote MyApp_group:start property $id=cib-bootstrap-options \ dc-version=1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ stonith-enabled=true \ last-lrm-refresh=1305129673 rsc_defaults $id=rsc-options \ resource-stickiness=1 From: Max Williams [mailto:max.willi...@betfair.com] Sent: 11 May 2011 13:55 To: The Pacemaker cluster resource manager (pacemaker@oss.clusterlabs.org) Subject: [Pacemaker] Failover when storage fails Hi, I want to configure pacemaker to failover a group of resources and sg_persist (master/slave) when there is a problem with the storage but when I cause the iSCSI LUN to disappear simulating a failure, the cluster always gets stuck in this state: Last updated: Wed May 11 10:52:43 2011 Stack: openais Current DC: host001.domain - partition with quorum Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe 2 Nodes configured, 2 expected votes 4 Resources configured
Re: [Pacemaker] addendum: problems with node membership
On 5/11/2011 at 10:00 PM, Thomas thomascasp...@t-online.de wrote: p.s. 1 cluster nodes failed to respond to the join offer can be found in my corosync log. Google was of no use with that message, I haven't found a solution yet. Cannot be that difficult I think, I just need the freshly installed condition of pacemaker without reinstalling the complete package, because a fresh node joins without problems...how can this be done? I'd suggest double-checking the corosync config and network settings (IP addresses and preferably disable any firewalls) on all nodes. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [PATCH]Bug 2567 - crm resource migrate should support an optional role parameter
On 5/10/2011 at 08:22 PM, Holger Teutsch holger.teut...@web.de wrote: On Tue, 2011-05-10 at 08:24 +0200, Andrew Beekhof wrote: On Mon, May 9, 2011 at 8:44 PM, Holger Teutsch holger.teut...@web.de wrote: On Wed, 2011-04-27 at 13:25 +0200, Andrew Beekhof wrote: On Sun, Apr 24, 2011 at 4:31 PM, Holger Teutsch holger.teut...@web.de wrote: ... Remaining diffs seem to be not related to my changes. Unlikely I'm afraid. We run the regression tests after every commit and complain loudly if they fail. What is the regression test output? That's the output of tools/regression.sh of pacemaker-devel *without* my patches: Version: parent: 10731:bf7b957f4cbe tip see attachment There seems to be something not quite right with your environment. Had you built the tools directory before running the test? Yes, + install In a clean chroot it passes onboth opensuse and fedora: http://build.clusterlabs.org:8010/builders/opensuse-11.3-i386-devel/builds/ 48/steps/cli_test/logs/stdio and http://build.clusterlabs.org:8010/builders/fedora-13-x86_64-devel/builds/48 /steps/cli_test/logs/stdio What distro are you on? Opensuse 11.4 Works for me on openSUSE 11.4 with a clean checkout of devel tip, so presumably isn't endemic (not that this really helps you, sorry, but I had to test). Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [pacemaker][patch 3/4] Simple changes for Pacemaker Explained, Chapter 6 CH_Constraints.xml
On 5/4/2011 at 08:49 PM, Andrew Beekhof and...@beekhof.net wrote: Tick tock. I'm going to push this soon unless someone raises an objection RSN. This is going into 1.1, right? Do existing CIBs automagically get updated to this syntax, or does the admin have to force this? (Sorry, I forget if that was covered already). Thanks, Tim On Fri, Apr 15, 2011 at 4:55 PM, Andrew Beekhof and...@beekhof.net wrote: On Fri, Apr 15, 2011 at 3:00 PM, Lars Marowsky-Bree l...@novell.com wrote: On 2011-04-13T08:37:12, Andrew Beekhof and...@beekhof.net wrote: Before: rsc_colocation id=coloc-set score=INFINITY resource_set id=coloc-set-0 resource_ref id=dummy2/ resource_ref id=dummy3/ /resource_set resource_set id=coloc-set-1 sequential=false role=Master resource_ref id=dummy0/ resource_ref id=dummy1/ /resource_set /rsc_colocation rsc_order id=order-set score=INFINITY resource_set id=order-set-0 role=Master resource_ref id=dummy0/ resource_ref id=dummy1/ /resource_set resource_set id=order-set-1 sequential=false resource_ref id=dummy2/ resource_ref id=dummy3/ /resource_set /rsc_order After: So I am understanding this properly - we're getting rid of the sequential attribute, yes? Absolutely. If so, three cheers. ;-) Can you share the proposed schema and XSLT, if you already have some? Attached rsc_colocation id=coloc-set score=INFINITY colocation_set id=coloc-set-1 internal-colocation=0 resource_ref id=dummy0 role=Master/ resource_ref id=dummy1 role=Master/ /colocation_set colocation_set id=coloc-set-0 internal-colocation=INFINITY resource_ref id=dummy2/ resource_ref id=dummy3/ /colocation_set /rsc_colocation rsc_order id=order-set kind=Mandatory ordering_set id=order-set-0 internal-ordering=Mandatory So we have (score|kind) on the outside, and internal-(ordering|colocation) on the inner elements. Is there a particular reason to use a different name on the inner ones? The language didn't feel right tbh - the inner ones felt like they needed more context/clarification. We can change the outer name too if you like. Also, rsc_order has either score or kind; are you doing away with that here? Yes. Standardizing on kind. Score never made sense for ordering :-( lifetime would only apply to the entire element, right? right And, just to be fully annoying - is there a real benefit to having ordering_set and colocation_set? Very much so. Because kind makes no sense for a colocation - and vice-versa for score. Using different element names means the rng can be stricter. ordering_set id=order-set-1 internal-ordering=Optional resource_ref id=dummy2/ While we're messing with sets anyway, I'd like to re-hash the idea I brought up on pcmk-devel. To make configuration more compact, I'd like to have automatic sets - i.e., the set of all resources that match something. Imagine: resource_list type=Xen provider=heartbeat class=ocf / and suddenly all your Xen guests are correctly ordered and collocated. The savings in administrative complexity and CIB size are huge. Or would you rather do this via templates? You mean something like? ordering_set id=order-set-0 internal-ordering=Mandatory resource_pattern type= provider= /ordering Might make sense. But doesn't strictly need to be bundled with this change. I'd be wary about allowing pattern matching on the name, I can imagine resources ending up in multiple sets (loops!) very easily. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Multi-site support in pacemaker (tokens, deadman, CTR)
On 4/28/2011 at 11:06 PM, Florian Haas florian.h...@linbit.com wrote: On 2011-04-27 20:55, Lars Marowsky-Bree wrote: On 2011-04-26T23:34:16, Yan Gao y...@novell.com wrote: And the cibs between different sites would still be synchronized? The idea is that there would be - perhaps as part of the CTR daemon - a process that would replicate (manually triggered, periodically, or automatically) the configuration details of resources associated with a given ticket (which are easily determined since they depend on it) to the other sites that are eligible for the ticket. Initially, I'd be quite happy if there was a replicate now button to push or script to call - admins may actually have good reasons not to immediately replicate everywhere, anyway. It's conceivable that there would need to be some mangling as configuration is replicated; e.g., path names and IP addresses may be different. We _could_ express this using our CIB syntax already (instance attribute sets take rules, and it'd probably be easy enough to extend this matching to select on ticket ownership), and perhaps that is good enough, since I'd imagine there would actually be quite little to modify. (Having many differences would make the configuration very complex to manage and understand; hence, we want a syntax that makes it easy to have a few different values, and annoying to have many ;-) As I understood it we had essentially reached consensus in Boston that CIB replication would best be achieved by a pair of complementary resource agents. I don't think we had a name then, but I'll call them Publisher and Subscriber for the purposes of this discussion. The idea would be that Publisher exposes the configuration/ section of the CIB via a network daemon, preferably one that uses encryption. Suppose this is something like lighttpd with SSL/TLS support. This would be a simple primitive running exactly once in the Pacemaker cluster, and only if that cluster holds the ticket. Hawk just about does that (exposes bits of the CIB via HTTPS), although admittedly it'd be overkill for just exposing the configuration section for machine processing. A stunningly trivial implementation is, simply, in lighttpd.conf: cgi.assign = ( /config = ) Then, create a shell script called config in lighttpd's document root directory, containing: #!/bin/sh echo Content-type: text/xml echo /usr/sbin/cibadmin -Q --scope constraints Not so much with the security, but it works... Subscriber, by contrast, subscribes to this stream and will usually mangle configuration in some shape or form, preferably configurable through an RA parameter. What was discussed in Boston is that in an initial step, Subscriber could simply take an XSLT script, apply it to the CIB stream with xsltproc, and then update its local CIB with the result. Subscriber would be the only resource (besides STONITH resources and Slaves of master/slave sets) that can be active in a cluster that does not hold the ticket. Comments? Cheers, Florian -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.4.0
On 4/22/2011 at 10:14 PM, Nikita Michalko michalko.sys...@a-i-p.com wrote: Am Dienstag 19 April 2011 12:59:35 schrieb Tim Serong: Greetings All, This is to announce version 0.4.0 of Hawk, a web-based GUI for managing and monitoring Pacemaker High-Availability clusters. You can use Hawk 0.4.0 to: - Monitor your cluster, with much the same functionality as crm_mon (displays node and resource status, failed ops). - Perform basic operator tasks: - Node: standby, online, fence - Resource: start, stop, migrate, unmigrate, clean up. - Create, edit and delete primitives, groups, clones, m/s resources. - Edit crm_config properties. Hawk is intended to run on each node in your cluster, and is accessible via HTTPS on port 7630. You can then access it by pointing your web browser at the IP address of any cluster node, or the address of any IPaddr(2) resource you may have configured. You will need to configure a user account to log in as. The same rules apply as for the python GUI; you need to log in as a user in the haclient group. Packages for various SUSE-based distros can be obtained from the network:ha-clustering and network:ha-clustering:Factory repos on OBS, or you can just search for Hawk on software.opensuse.org: http://software.opensuse.org/search?baseproject=ALLq=Hawk - just tried to download HAWK, but don't know some password required by YAST - which one? Did you download an RPM, or click the 1-Click Install link? If the latter, it'll try to just install it on the system you're downloading on, in which case YaST is asking for your root password in order to install it. This may or may not be what you want. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] operative tasks for a pacemaker cluster
On 4/13/2011 at 02:04 AM, mark - pacemaker list m+pacema...@nerdish.us wrote: Hello, On Mon, Apr 11, 2011 at 11:11 AM, Andrew Beekhof and...@beekhof.net wrote: On Mon, Apr 11, 2011 at 2:48 PM, Klaus Darilion klaus.mailingli...@pernau.at wrote: Recently I got hit by running out of inodes due to too many files in /var/lib/pengine. man pengine look for -series-max There is no pengine man page in the packages (pacemaker, heartbeat, or corosync) from the EPEL repo, nor online with the other online manpages at clusterlabs. Am I missing it someplace? I want to read about this as I have just under 7000 files in /var/lib/pengine on a node that has 7 days of uptime. Will this grow unchecked, or do older files eventually get cleaned up? Not sure what's up with the EPEL packaging, sorry. The relevant bit of that manpage is: pe-error-series-max = integer [-1] The number of PE inputs resulting in ERRORs to save Zero to disable, -1 to store unlimited. pe-warn-series-max = integer [-1] The number of PE inputs resulting in WARNINGs to save Zero to disable, -1 to store unlimited. pe-input-series-max = integer [-1] The number of other PE inputs to save Zero to disable, -1 to store unlimited. So, yeah, by default unless you specifically limit it, it'll just keep saving 'em. They're invaluable for debugging failures, BTW. Were those 7000 pe-inputs all created over that 7 day period? Because that's a transition every 1.44 minutes. Is it just me, or does that sound like a rather busy cluster? Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [pacemaker][patch 3/4] Simple changes for Pacemaker Explained, Chapter 6 CH_Constraints.xml
On 3/21/2011 at 08:20 PM, Andrew Beekhof and...@beekhof.net wrote: Small improvement to: + The only thing that matters is that in order for any member of a set to be active, all the members of the previous set must also be active (and naturally on the same node). When a set has literalsequential=true/literal, then in order for any member to be active, the previous members must also be active. + The only thing that matters is that in order for any member of a set to be active, all the members of the previous setfootnoteparaas determined by the display order in the configuration/para/footnote must also be active (and naturally on the same node). + When a set has literalsequential=true/literal, then in order for any member to be active, the previous members must also be active. This isn't quite correct. For members within a set (sequential=true), it is true that for a given member to be active, the previous members must also be active. Between sets however, it's the other way around - a given set depends on the subsequent set. The example colocation chain in Pacemaker Explained right now should thus be changed as follows in order to match the diagram: constraints rsc_colocation id=coloc-1 score=INFINITY resource_set id=collocated-set-1 sequential=true -resource_ref id=A/ -resource_ref id=B/ +resource_ref id=F/ +resource_ref id=G/ /resource_set resource_set id=collocated-set-2 sequential=false resource_ref id=C/ resource_ref id=D/ resource_ref id=E/ /resource_set resource_set id=collocated-set-2 sequential=true role=Master -resource_ref id=F/ -resource_ref id=G/ +resource_ref id=A/ +resource_ref id=B/ /resource_set /rsc_colocation /constraints Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [pacemaker][patch 3/4] Simple changes for Pacemaker Explained, Chapter 6 CH_Constraints.xml
On 4/11/2011 at 09:37 PM, Andrew Beekhof and...@beekhof.net wrote: On Mon, Apr 11, 2011 at 12:57 PM, Tim Serong tser...@novell.com wrote: On 3/21/2011 at 08:20 PM, Andrew Beekhof and...@beekhof.net wrote: Small improvement to: + The only thing that matters is that in order for any member of a set to be active, all the members of the previous set must also be active (and naturally on the same node). When a set has literalsequential=true/literal, then in order for any member to be active, the previous members must also be active. + The only thing that matters is that in order for any member of a set to be active, all the members of the previous setfootnoteparaas determined by the display order in the configuration/para/footnote must also be active (and naturally on the same node). + When a set has literalsequential=true/literal, then in order for any member to be active, the previous members must also be active. This isn't quite correct. For members within a set (sequential=true), it is true that for a given member to be active, the previous members must also be active. Between sets however, it's the other way around - a given set depends on the subsequent set. Did I really write it like that? You tested it? Yup. Well, I tested it (pcmk 1.1.5), so I assume you wrote it like that :) We want (pardon the ASCII art): /-- C --\ G -- F --+--- D ---+-- B -- A \-- E --/ Test is: # crm configure colocation c inf: F G ( C D E ) A B # crm resource stop F (stops F and G) # crm resource start F # crm resource stop D (stops D, F and G) # crm resource start D # crm resource stop B (stops everything except A) That shell colocation constraint maps exactly to the (new) XML shown below (verified just in case it turned out to be a shell oddity). If so, thats just retarded and needs an overhaul. It is a little... confusing. Regards, Tim The example colocation chain in Pacemaker Explained right now should thus be changed as follows in order to match the diagram: constraints rsc_colocation id=coloc-1 score=INFINITY resource_set id=collocated-set-1 sequential=true -resource_ref id=A/ -resource_ref id=B/ +resource_ref id=F/ +resource_ref id=G/ /resource_set resource_set id=collocated-set-2 sequential=false resource_ref id=C/ resource_ref id=D/ resource_ref id=E/ /resource_set resource_set id=collocated-set-2 sequential=true role=Master -resource_ref id=F/ -resource_ref id=G/ +resource_ref id=A/ +resource_ref id=B/ /resource_set /rsc_colocation /constraints Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [pacemaker][patch 3/4] Simple changes for Pacemaker Explained, Chapter 6 CH_Constraints.xml
On 4/11/2011 at 10:23 PM, Andrew Beekhof and...@beekhof.net wrote: On Mon, Apr 11, 2011 at 2:18 PM, Tim Serong tser...@novell.com wrote: On 4/11/2011 at 09:37 PM, Andrew Beekhof and...@beekhof.net wrote: On Mon, Apr 11, 2011 at 12:57 PM, Tim Serong tser...@novell.com wrote: On 3/21/2011 at 08:20 PM, Andrew Beekhof and...@beekhof.net wrote: Small improvement to: + The only thing that matters is that in order for any member of a set to be active, all the members of the previous set must also be active (and naturally on the same node). When a set has literalsequential=true/literal, then in order for any member to be active, the previous members must also be active. + The only thing that matters is that in order for any member of a set to be active, all the members of the previous setfootnoteparaas determined by the display order in the configuration/para/footnote must also be active (and naturally on the same node). + When a set has literalsequential=true/literal, then in order for any member to be active, the previous members must also be active. This isn't quite correct. For members within a set (sequential=true), it is true that for a given member to be active, the previous members must also be active. Between sets however, it's the other way around - a given set depends on the subsequent set. Did I really write it like that? You tested it? Yup. Well, I tested it (pcmk 1.1.5), so I assume you wrote it like that :) We want (pardon the ASCII art): /-- C --\ G -- F --+--- D ---+-- B -- A \- - E --/ Test is: # crm configure colocation c inf: F G ( C D E ) A B Given the well discussed issues with the shell syntax, I'd prefer to see the raw xml actually. constraints rsc_colocation id=c score=INFINITY resource_set id=c-0 resource_ref id=F/ resource_ref id=G/ /resource_set resource_set id=c-1 sequential=false resource_ref id=C/ resource_ref id=D/ resource_ref id=E/ /resource_set resource_set id=c-2 resource_ref id=A/ resource_ref id=B/ /resource_set /rsc_colocation /constraints # crm resource stop F (stops F and G) # crm resource start F # crm resource stop D (stops D, F and G) # crm resource start D # crm resource stop B (stops everything except A) That shell colocation constraint maps exactly to the (new) XML shown below (verified just in case it turned out to be a shell oddity). If so, thats just retarded and needs an overhaul. It is a little... confusing. Regards, Tim The example colocation chain in Pacemaker Explained right now should thus be changed as follows in order to match the diagram: constraints rsc_colocation id=coloc-1 score=INFINITY resource_set id=collocated-set-1 sequential=true -resource_ref id=A/ -resource_ref id=B/ +resource_ref id=F/ +resource_ref id=G/ /resource_set resource_set id=collocated-set-2 sequential=false resource_ref id=C/ resource_ref id=D/ resource_ref id=E/ /resource_set resource_set id=collocated-set-2 sequential=true role=Master -resource_ref id=F/ -resource_ref id=G/ +resource_ref id=A/ +resource_ref id=B/ /resource_set /rsc_colocation /constraints Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http
Re: [Pacemaker] emulate crm_mon output by xsltproc'essing cibadmin -Ql
On 3/9/2011 at 07:51 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Wed, Mar 09, 2011 at 09:42:49AM +0100, Andrew Beekhof wrote: I had http://hg.clusterlabs.org/pacemaker/1.1/raw-file/tip/xml/crm.xsl doing something similar. Agree its an interesting capability, haven't found much practical use for it yet though. Happy to put it in the extras directory though :-) Fine with me. Then at least it does not get lost. How to figure out from the cib pacemakers idea of the current status (and location) of a resource? Look at the most-recent lrm_rsc_op, and it's result? Pretty much. For all the gory details, read unpack_rsc_op() in pacemaker/lib/pengine/unpack.c. But it (more or less) comes down to: - For each node, sort the ops in order of descending call ID. - The most recent op and rc on each node (highest call ID) tells you the state of the resource on that node. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] version confusion
On 3/3/2011 at 09:17 AM, Klaus Darilion klaus.mailingli...@pernau.at wrote: Hi! I just updated by Debian box to wheezy to test Pacemaker 1.0.10. dpkg reports version 1.0.10 but crm_mon reports version 1.0.9. So, which version is really running? Is really 1.0.9 running or is this due to the previously used 1.0.9 version? # dpkg -l|grep pacem ii pacemaker1.0.10-5HA cluster resource manager # crm_mon -1 Last updated: Wed Mar 2 23:14:23 2011 Stack: openais Current DC: bulgari - partition WITHOUT quorum Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3 Note the hash after the version number. If you search for that hash at http://hg.clusterlabs.org/pacemaker/stable-1.0/ and poke around a bit you'll find it's the commit two commits *before* the one that actually updated the version number to 1.0.10. So, yes, you do have version 1.0.10. Try to think of it as an unfortunate typo :) Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Linux-HA] Solved: SLES 11 HAE SP1 Signon to CIB Failed
On 2/9/2011 at 09:49 PM, darren.mans...@opengi.co.uk wrote: So I compared the /etc/ais/openais.conf in non-sp1 with /etc/corosync/corosync.conf from sp1 and found this bit missing which could be quite useful... service { # Load the Pacemaker Cluster Resource Manager ver: 0 name: pacemaker use_mgmtd: yes use_logd: yes } Added it and it works. Doh. It seems the example corosync.conf that is shipped won't start pacemaker, I'm not sure if that's on purpose or not, but I found it a bit confusing after being used to it 'just working' previously. Ah. Understandably confusing. That got fixed post-SP1, in a maintenance update that went out in September or thereabouts. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. --- Thanks Tim. Although the media that can be downloaded *now* from Novell downloads still has this issue, so any new clusters will fall foul of it. Generally with a test build you won't perform updates as it burns a licence you would need for the production system. Should the downloadable media have the issue fixed? With the disclaimer that I haven't tried this myself lately... :) On this page: http://www.novell.com/products/highavailability/eval.html It says: Please note: Once you login, your evaluation software will automatically be registered to you. You will be able to immediately access free maintenance patches and updates online for a 60-day period following your registration date. So apparently new users should be able to get the latest maintenance updates. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Solved: [Linux-HA] SLES 11 HAE SP1 Signon to CIB Failed
On 2/3/2011 at 08:47 PM, darren.mans...@opengi.co.uk wrote: On Fri, Jan 28, 2011 at 1:06 PM, darren.mans...@opengi.co.uk wrote: Hi all, this seems like it should be an easy one to fix, I'll raise a support call with Novell if required. Base install of SLES 11 32 bit SP1 with HAE SP1 and crm_mon gives 'signon to CIB failed'. Same thing with the CRM shell etc. Too many open file descriptors? lsof might show something interesting --- Unfortunately not. It seems that corosync doesn't spawn anything else, which is causing this issue: [...] So I compared the /etc/ais/openais.conf in non-sp1 with /etc/corosync/corosync.conf from sp1 and found this bit missing which could be quite useful... service { # Load the Pacemaker Cluster Resource Manager ver: 0 name: pacemaker use_mgmtd: yes use_logd: yes } Added it and it works. Doh. It seems the example corosync.conf that is shipped won't start pacemaker, I'm not sure if that's on purpose or not, but I found it a bit confusing after being used to it 'just working' previously. Ah. Understandably confusing. That got fixed post-SP1, in a maintenance update that went out in September or thereabouts. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] STONITH external/ssh missing on RHEL 5.5 EPEL 5.4 + ClusterLabs Repo RPM Build?
On 12/20/2010 at 10:09 PM, Pavlos Parissis pavlos.paris...@gmail.com wrote: On 17 December 2010 20:41, Eliot Gable ega...@broadvox.com wrote: I just did an install of Pacemaker on my CentOS 5.5 system using EPEL 5.4 and ClusterLabs Repo. It seems the RPMs do not include the STONITH plugin external/ssh. Is it in some package that I missed or is it really not provided? Is there any way to get it? Thanks. the following line from cluster-glue-fedora.spec could the reason %exclude %{_libdir}/stonith/plugins/external/ssh That's intentional, see: http://hg.linux-ha.org/glue/rev/5ef3f9370458 You really don't want to rely on SSH STONITH in a production environment. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Resources not migrating on node failure?
On 12/1/2010 at 05:11 AM, Anton Altaparmakov ai...@cam.ac.uk wrote: Hi, I have set up a three node cluster (running Ubuntu 10.04 LTS server with Corosync 1.2.0, Pacemaker 1.0.8, drbd 8.3.7), where one node is only present to provide quorum to the other two nodes in case one node fails but it itself cannot run any resources. The other two nodes are running drbd in master/slave to provide replicated storage, then XFS file system on top of the drbd storage on the master, together with an NFS server on top of the XFS mount, and a service IP address on which the NFS export is shared. This is all working brilliantly and I can cause the resources to move to the slave node by running crm_standby -U cerberus -v on where cerberus is the master node and everything then migrates to the slave node minotaur. My problem is if I pull the power plug on the master node cerberus. Then nothing happens! minotaur continues to run as slave and it never takes over. And I don't get why. )-: Probably because STONITH is disabled. It can't take over the resources unless it knows they're stopped, and without a clean shutdown, there's no way to guarantee they're stopped without STONITH. Also, a second question, possibly related to the first problem, is do I need to define monitor actions for each resource or is that done automatically? No, you need to define them. If I need to do it specifically, how do I do that now that I have it all up and running without defining monitor actions? Run crm configure edit and add whichever monitor ops you need. Have a look at Clusters from Scratch at: http://www.clusterlabs.org/wiki/Documentation HTH, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Extending CTS with other tests
On 11/30/2010 at 09:21 PM, Andrew Beekhof and...@beekhof.net wrote: On Thu, Nov 25, 2010 at 1:36 PM, Vit Pelcak vpel...@suse.cz wrote: Hello everyone. I ran into problem. I cannot format ocfs2 partition with pcmk until primitive o2cb ocf:ocfs2:o2cb is running. Right? Probably To my intense amazement, you can do this: mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=pacemaker /dev/foo This works when the cluster is not running. These parameters are not mentioned anywhere at all in the mkfs.ocfs2 manpage. *sigh* Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] colocation that doesn't
On 11/30/2010 at 10:11 AM, Alan Jones falanclus...@gmail.com wrote: On Thu, Nov 25, 2010 at 6:32 AM, Tim Serong tser...@novell.com wrote: Can you elaborate on why you want this particular behaviour? Maybe there's some other way to approach the problem? I have explained the issue as clearly as I know how. The problem is fundamental to the design of the policy engine in Pacemaker. It performs only two passes to resolve constraints, when what is required for general purpose constraint resolution is an iterative model. These problems have been addressed in the literature for decades. What I meant by maybe there's some other way to approach the problem is maybe there's some other way we can figure out how to get something *like* the behaviour you desire, given the fact that Pacemaker's colocation constraints behave the way they do. If you have: primitive resX ocf:pacemaker:Dummy primitive resY ocf:pacemaker:Dummy location resX-loc resX 1: nodeA.acme.com location resY-loc resY 1: nodeB.acme.com colocation resX-resY -2: resX resY And you have -inf constraints coming from an external source, as you said before, can you change the external source so that it generates different constraints? e.g., instead of generating either of: location resX-nodeA resX -inf: nodeA.acme.com location resY-nodeB resY -inf: nodeB.acme.com (where only the second one works, because of the dependency inherent in the colocation contraint) can your external source specify these constraints only in terms of resY, which is the one that's capable of dragging resX around the place? e.g.: location resX-nodeA resY inf: nodeA.acme.com location resY-nodeB resY -inf: nodeB.acme.com Or, if that sounds completely deranged, how about this: On the assumption your external source will only ever inject one -inf rule, for one resource, why not make it change the colocation constraint as well? e.g.: generate either of: location resX-nodeA resX -inf: nodeA.acme.com colocation resY-resX -2: resY resX (and delete resX-resY if present) -- or -- location resY-nodeB resY -inf: nodeB.acme.com colocation resX-resY -2: resX resY (and delete resY-resX if present) Are there any more details about your application you can share? Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] colocation that doesn't
On 11/25/2010 at 10:33 AM, Alan Jones falanclus...@gmail.com wrote: Instead of: colocation resX-resY -2: resX resY Try: colocation resX-resY -2: resY resX That works fine, as you describe, for placing resY when resX is limited by the -inf rule; but not the reverse. In my configuration the -inf constraints come from an external source and I wish place resX and resY in a symmetric way. Start with resX and resY which can run on either nodeA or nodeB. Give each a preferred node respectively; a weak preference. Now request that, if possible, they should run on different nodes; potentially overriding the weak node preference. Now add external constraints that prohibit one or other from running on one or the other node. For example, if any one of the resources is prevented from running on its preferred node, it should run on the non-preferred node and push the other resource onto its non-preferred node. I have not figured out how to express this in pacemaker. Ah, OK. I'm not seeing it either. Can you elaborate on why you want this particular behaviour? Maybe there's some other way to approach the problem? (Or maybe someone else can think of a way to express this...) Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] colocation that doesn't
On 11/24/2010 at 12:32 PM, Alan Jones falanclus...@gmail.com wrote: On Sat, Nov 20, 2010 at 1:05 AM, Andrew Beekhof and...@beekhof.net wrote: Then -2 obviously isn't big enough is it. I need a value between and not including -inf and -2 that will work. All the values I've tried don't, so I'm open to suggestions. Please read and understand: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ s-resource-colocation.html The way I read the with-rsc description it is in direct conflict with your comment from Nov. 11th (below), so I'm truly confused. colocation X-Y -2: X Y colocation Y-X -2: Y X the second one is implied by the first and is therefore redundant For how colocation constraints actually work instead of inventing your own rules. I'm interested in inventing rules, I'm trying to express the constraints of my application. So far, I have not been able to do so. Instead of: colocation resX-resY -2: resX resY Try: colocation resX-resY -2: resY resX Because: The cluster decides where to put with-rsc (the second one), then decides where to put rsc (the first one). You have: location resX-nodeA resX -inf: nodeA.acme.com location resY-loc resY 1: nodeB.acme.com If it decides where to put resY first, it puts resY on nodeB. Then it tries to place resX, wants to place it where resY is not (nodeA), but can't, due to the -inf score for resX on nodeA. So in this case, resX lands on nodeB as well. If it decides where to put resX first, it puts resX on nodeB because of the -inf score for nodeA. Then it puts resY on nodeA, because of the -2 score for the colocation constraint. HTH, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Help understanding why a failover occurred.
On 10/16/2010 at 09:45 AM, Jai away...@gmail.com wrote: I have setup a DRBD-Xen failover cluster. Last night at around 02:50 it failed the resources from server bravo to alpha. I'm trying to find out what caused the failover of resources. I don't see anything in the logs that indicate the cause but I don't really know what to look for. If someone could help me understand these logs and what I'm looking for would be great. I'm not even sure how far back I need to go. I reckon it's this: Oct 16 02:46:04 bravo attrd: [25098]: info: attrd_perform_update: Sent update 161: pingval=0 Which suggests bravo lost connectivity to 12.12.12.1 around that time, causing the failover. For reference, if you're looking at pengine logs... A few lines above where it says info: process_pe_message: Transition NNN: PEngine Input stored in: /var/lib/pengine/pe-input-MMM.bz2, you'll see what it's about to do to your resources. If this is just: Leave resource FOO (Started/Master/Slave etc.) that transition is probably boring. If it says Start FOO (...) or Promote/Demote/Stop FOO (...), it means something has changed. Scroll up a bit, to above where pengine is saying unpack_config, determine_node_status etc. and you should see a message suggesting the cause for the change (failed op, timeout, ping attribute modified, etc.) It might be a bit inscrutable sometimes, but it'll be there somewhere... HTH Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Cluster failure with mod_security using rotatelogs
On 10/11/2010 at 10:17 AM, Markus Schlup mar...@qbik.ch wrote: Hi all I'm running a cluster-based Apache reverse proxy with the mod_security module. I would like to rotate the logfiles with rotatelogs as follows: CustomLog |/usr/sbin/rotatelogs -l /var/log/httpd/access_log.%Y-%m-%d 86400 common And especially the mod_security log with SecAuditLog |/usr/sbin/rotatelogs -l /var/log/httpd/modsec_audit_log.%Y-%m-%d 86400 As soon as I change the mod_security log to this (instead of just using SecAuditLog /var/log/httpd/modsec_audit_log) the resource does not start anymore. When trying to debug and start the apache resource by hand with OCF_ROOT=/usr/lib/ocf OCF_RESKEY_configfile=/etc/httpd/conf/httpd.conf OCF_RESKEY_statusurl=http://localhost:80/server-status sh -x /usr/lib/ocf/resource.d/heartbeat/apache start it stops after ... + for p in '$PORT' '$Port' 80 + CheckPort 80 + ocf_is_decimal 80 + case $1 in + true + '[' 80 -gt 0 ']' + PORT=80 + break + echo 127.0.0.1:80 + grep : + '[' Xhttp://localhost:80/server-status = X ']' + test /etc/httpd/run/httpd.pid + : OK + case $COMMAND in + start_apache + silent_status + '[' -f /etc/httpd/run/httpd.pid ']' + : No pid file + false + ocf_run /usr/sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf ++ /usr/sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf The resource is in fact started but the command does not finish - so I guess that's the reason why the cluster fails in this setup ... strange enough using the rotatelogs directives for the Apache error and access logs is not an issue and works as expected. Does someone know how to fix that problem? I've not seen that before, but, just to rule out one possibility... What happens if you just run: /usr/sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf Does that ever return? If no, I'd suggest apache is broken. If yes, I'd start pointing my finger towards ocf_run or the RA. HTH, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm_gui login failure
On 9/28/2010 at 04:11 PM, Yan Gao y...@novell.com wrote: On 09/28/10 01:25, Phil Armstrong wrote: I'm running pacemaker-1.1.2-0.6.1 on sles11sp1. I was only able to successfully login to the crm_gui from one of my nodes in spite of the fact that the login parameters appeared to be identical. I traced the problem to a zero length /etc/pam.d/hbmgmt file on the node that exhibited the login failure. This file is part of pacemaker-mgmt-2.0.0-0.3.10, which I have installed on both nodes. I had no previous knowledge of this file and so I am quite sure it wasn't anything I did consciously to zero out the file, or to consciously populate it with the contents of the working node: #%PAM-1.0 auth include common-auth account include common-account Can anyone tell me how this file is created or modified ? The file is extracted from pacemaker-mgmt on package installation. The back-end of the GUI (mgmtd) reads it for user authentication. No one is supposed or needs to modify the file for any reason. So that's strange it was zeroed out. You might need to check the modification time to recall what was happening. Wild guess - was your system STONITH'd or otherwise forcibly reset, immediately after installing pacemaker-mgmt, and are you using XFS for your root filesystem? Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] /etc/hosts
On 9/28/2010 at 07:29 PM, Andrew Beekhof and...@beekhof.net wrote: On Tue, Sep 28, 2010 at 6:05 AM, Mark Horton m...@nostromo.net wrote: Hello, I was wondering what side effects occur if you don't add all the cluster nodes to the /etc/hosts file on each node? I'd also be interested in hearing how others keep the hosts file in sync. For example, lets say you have 3 nodes, and 1 node is currently down. Then you add a 4th node, but you can't update the hosts file of the down node. So you must remember to do it when it comes back up. I was trying to see if there was an automated way to keep them in sync in case we forget to update the hosts file on the down node. Pacemaker doesn't care, but your messaging layer (corosync or heartbeat) might. If the node that is down has no other way to find out the address of the new node, and the cluster is configured to start automatically when the machine boots, then you might have a problem. You might find csync2[1] useful. You can use this to synchronize config files across a cluster. Assuming you've configured it to sync /etc/hosts, any time you edit /etc/hosts on one node, run csync2 -x and it will magically sync the changes out to the other nodes in your cluster. It's a smart manual push mechanism, not something that runs continuously in the background, but it's a hell of a lot better than scp and having to remember where to copy what to, and when :) shameless-plug There's a little section on csync2 in the SLE HAE Guide under Transferring the Configuration to All Nodes at: http://www.novell.com/documentation/sle_ha/book_sleha/?page=/documentation/sle_ha/book_sleha/data/sec_ha_installation_setup.html /shameless-plug HTH Tim [1] http://oss.linbit.com/csync2/ -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] clmvd hangs on node1 if node2 is fenced
On 8/27/2010 at 03:37 PM, Michael Smith msm...@cbnco.com wrote: On Thu, 26 Aug 2010, Tim Serong wrote: for now I have stonith-enabled=false in my CIB. Is there a way to make clvmd/dlm respect it? No. At least, I don't think so, and/or I hope not :) I think I'd consider it a bug: I've disabled stonith, so dlm shouldn't wait forever for a fence operation that isn't going to happen. CLVM is just making the metadata cluster-aware, so the only way I can imagine screwing things up without fencing would be if I ran something like lvresize on two nodes at the same time, during a split brain. So I dug around a little: # dlm_controld.pcmk -h Usage: dlm_controld [options] Options: ... -f num Enable (1) or disable (0) fencing recovery dependency Default is 1 -q num Enable (1) or disable (0) quorum recovery dependency Default is 0 I reckon if you set the args parameter of your ocf:pacemaker:controld resource to -f 0 -q 0, you'll have DLM ignoring fencing. At this point (lest someone reading the archives later thinks I am advocating this) it would be irresponsible of me not to mention this story about Why You Need STONITH: http://advogato.org/person/lmb/diary/105.html There is also an accompanying comic: http://ourobengr.com/stonith-story If DLM is ignoring fencing, everything that uses DLM is also going to ignore fencing, so if you've got (say) an OCFS2 filesystem on top of your CLVM volume, your filesystem will potentially be toast in a split-brain situation. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] clmvd hangs on node1 if node2 is fenced
On 8/27/2010 at 08:50 AM, Michael Smith msm...@cbnco.com wrote: Xinwei Hu hxin...@... writes: That sounds worrying actually. I think this is logged as bug 585419 on SLES' bugzilla. If you can reproduce this issue, it worths to reopen it I think. I've got a pair of fully patched SLES11 SP1 nodes and they're showing what I guess is the same behaviour: if I hard-poweroff node2, operations like vgdisplay -v hang on node1 for quite some time. Sometimes a minute, sometimes two, sometimes forever. They get stuck here: Aug 26 18:31:42 xen-test1 clvmd[8906]: doing PRE command LOCK_VG 'V_vm_store' at 1 (client=0x7f2714000b40) Aug 26 18:31:42 xen-test1 clvmd[8906]: lock_resource 'V_vm_store', flags=0, mode=3 After a few seconds, corosync dlm notice the node is gone, but vg_display and friends still hang while trying to lock the VG. Aug 26 18:31:44 xen-test1 corosync[8476]: [TOTEM ] A processor failed, forming new configuration. Aug 26 18:31:50 xen-test1 cluster-dlm[8870]: update_cluster: Processing membership 1260 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: dlm_process_node: Skipped active node 219878572: born-on=1256, last-seen=1260, this-event=1260, last-event=1256 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: del_configfs_node: del_configfs_node rmdir /sys/kernel/config/dlm/cluster/comms/236655788 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: dlm_process_node: Removed inactive node 236655788: born-on=1252, last-seen=1256, this-event=1260, last-event=1256 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: log_config: dlm:controld conf 1 0 1 memb 219878572 join left 236655788 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: log_config: dlm:ls:clvmd conf 1 0 1 memb 219878572 join left 236655788 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: add_change: clvmd add_change cg 3 remove nodeid 236655788 reason 3 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: add_change: clvmd add_change cg 3 counts member 1 joined 0 remove 1 failed 1 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: stop_kernel: clvmd stop_kernel cg 3 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: do_sysfs: write 0 to /sys/kernel/dlm/clvmd/control Aug 26 18:31:51 xen-test1 kernel: [ 365.267802] dlm: closing connection to node 236655788 Aug 26 18:31:51 xen-test1 clvmd[8906]: confchg callback. 0 joined, 1 left, 1 members Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: fence_node_time: Node 236655788/xen-test2 has not been shot yet Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: check_fencing_done: clvmd check_fencing 23665578 not fenced add 1282861615 fence 0 Aug 26 18:31:51 xen-test1 crmd: [8489]: info: ais_dispatch: Membership 1260: quorum still lost Aug 26 18:31:51 xen-test1 cluster-dlm: [8870]: info: ais_dispatch: Membership 1260: quorum still lost Do you have STONITH configured? Note that it says xen-test2 has not been shot yet and clvmd ... not fenced. It's just going to sit there until the down node is successfully fenced - this is intentional, as it's not safe to keep running until you *know* the dead node is dead. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Updated openSUSE packages in network:ha-clustering repo
Hi All, Just a quick note for openSUSE users - there's updated packages now in the network:ha-clustering and network:ha-clustering:Factory repos, build for SLE 11, SLE 11 SP1, openSUSE 11.2, openSUSE 11.3 and Factory: http://download.opensuse.org/repositories/network:/ha-clustering/ http://download.opensuse.org/repositories/network:/ha-clustering:/Factory/ This includes: - cluster-glue 1.0.6 - corosync 1.2.7 - csync2 1.34 - hawk 0.3.5 - ldirectord 1.0.3 - libdlm 3.00.01 - ocfs2-tools 1.4.3 - openais 1.1.3 - pacemaker 1.1.2.1 - pacemaker-mgmt 2.0.0 - resource-agents 1.0.3 Happy clustering, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] CFP: Linux Plumbers Mini-Conf on High-Availability/Clustering
On 8/13/2010 at 11:21 PM, Florian Haas florian.h...@linbit.com wrote: On 08/11/2010 01:53 PM, Florian Haas wrote: On 08/10/2010 07:48 PM, Lars Marowsky-Bree wrote: On 2010-08-04T15:59:27, Lars Marowsky-Bree l...@novell.com wrote: Hi all, there will (hopefully!) be a mini-conference on HA/Clustering at this year's LPC in Cambridge, MA, Nov 3-5th. Just a quick reminder, there've not been many proposals submitted yet. If the trend continues, the mini-conf slot might instead be allocated to another topic ... Please, do consider to submit a talk to this soon - I know it's to a large degree my fault for sending out the request so late. I have a couple of proposals queued, but you caught me between leave and Linuxcon. :) I'll submit them as soon as I can. OK, I've submitted 3 proposals. But I'm a bit baffled to see just one other proposal besides that. Red Hat folks, NTT people, please! We need you! This is likely the only chance we get to collaborate in one place this whole year. I actually can't see the original CFP email in the linux-cluster archives. On the bold assumption that *this* email somehow magically makes it to that list, here's the URL to submit proposals: http://www.linuxplumbersconf.org/2010/ocw/events/LPC2010MC/proposals/new Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Opensuse 11.3
On 7/26/2010 at 11:51 PM, Andrew Beekhof and...@beekhof.net wrote: On Mon, Jul 26, 2010 at 7:56 AM, Andrew Beekhof and...@beekhof.net wrote: Probably this week. On Wed, Jul 21, 2010 at 11:59 PM, Roberto Giordani r.giord...@libero.it wrote: Hello Andrew, do you now when the clusterlabs rpm for Opnsuse 11.3 will be available? It doesn't look to be possible I'm afraid. SUSE isn't including the repodata directory at http://download.opensuse.org/distribution/11.3/repo/oss/ which means yum can't use it and I can't build packages for 11.3. I don't know what's up with that. Perhaps they want to encourage people to use their build service. Any volunteers? FWIW, openSUSE 11.3 includes reasonably current versions of Pacemaker (1.1.2.1), corosync (1.2.1), openais (1.1.2), cluster-glue (1.0.5) and resource-agents (1.0.3). Heartbeat is a bit out of date (2.99.3). There's one problem I'm aware of (can't start openais/corosync on x86_64) but this can be worked around by creating a few symlinks, see the bug report for details: https://bugzilla.novell.com/show_bug.cgi?id=623427 Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] RFC: cluster-wide attributes
On 7/5/2010 at 04:54 PM, Andrew Beekhof and...@beekhof.net wrote: On Mon, Jul 5, 2010 at 6:21 AM, Tim Serong tser...@novell.com wrote: On 6/30/2010 at 09:42 PM, Andrew Beekhof and...@beekhof.net wrote: On Thu, Jun 24, 2010 at 5:41 PM, Lars Marowsky-Bree l...@novell.com wrote: Hi, another idea that goes along with the previous post are cluster-wide attributes. Similar to per-node attributes, but basically a special section in configuration: optional element name=cluster_attributes zeroOrMore element name=attributes externalRef href=nvset.rng/ /element /zeroOrMore /element /optional Do we need a new section? Or can they go in with cluster-infrastructure etc? These then would also be referencable in the various dependencies like node attributes, just globally. Question - 1. Do we want to treat them like true node attributes, i.e., per-node attributes would override the cluster-wide settings - or as indeed a completely separate class? I lean towards the latter, but would solicit some more opinions. Not sure it really gives you anything by making them a separate class. does it? Just means you have to look twice right? Just for the record, a use case of this came up on IRC last week: you could specify cluster-wide standby=on, so new nodes joining the cluster would automatically join in standby mode, with the admin activating them later (per-node standby=off thus overriding cluster- wide attribute). That doesn't necessarily mean they need to be a separate class though. No, not at all. I'm just adding to the conversation in an unnecessarily confusing fashion :) Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] RFC: cluster-wide attributes
On 6/30/2010 at 09:42 PM, Andrew Beekhof and...@beekhof.net wrote: On Thu, Jun 24, 2010 at 5:41 PM, Lars Marowsky-Bree l...@novell.com wrote: Hi, another idea that goes along with the previous post are cluster-wide attributes. Similar to per-node attributes, but basically a special section in configuration: optional element name=cluster_attributes zeroOrMore element name=attributes externalRef href=nvset.rng/ /element /zeroOrMore /element /optional Do we need a new section? Or can they go in with cluster-infrastructure etc? These then would also be referencable in the various dependencies like node attributes, just globally. Question - 1. Do we want to treat them like true node attributes, i.e., per-node attributes would override the cluster-wide settings - or as indeed a completely separate class? I lean towards the latter, but would solicit some more opinions. Not sure it really gives you anything by making them a separate class. does it? Just means you have to look twice right? Just for the record, a use case of this came up on IRC last week: you could specify cluster-wide standby=on, so new nodes joining the cluster would automatically join in standby mode, with the admin activating them later (per-node standby=off thus overriding cluster- wide attribute). Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker cant start CTDB
On 7/2/2010 at 02:02 AM, Justin Shafer justinsha...@gmail.com wrote: Hello all I have noticed that corosync cant start CTDB in Fedora and Ubuntu. It will work in SLES11 after installing samba-winbind. Going through the logs. sometimes it cant get a recovery lock (filesystem related I know).. but other times I have tried it can get a recovery lock. Possibly the CTDB RA is hitting its start timeout before CTDB has stabilized (which includes some recovery lock fiddling). Try increasing the timeout (crm configure ... op start timeout=...) for your CTDB resource. If that doesn't work, have a look at the CTDB RA itself, about line 359: change seq 30 to something higher (probably we need to make this configurable). and once it does it stops the monitoring and stops winbind and shuts down. Does it say why? You probably want /var/log/ctdb/log.ctdb and /var/log/samba/log.{smbd,winbindd}... It was doing this with SLES 11 until I added samba-winbind, so I am just guessing it cant find smb, nmb and winbind on Ubuntu and Fedora but its just a guess.. Hard to say without seeing logs, but I'm guessing the CTDB RA is setting CTDB_SERVICE_SMB, CTDB_SERVICE_NMB etc. incorrectly on those distros. Please file a bug for this: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Linux-HA In Suse it would start and then stop and never really say why in ctdb.log until I added winbind and then the logs showed it trying to start samba, etc. Too bad not all distros are the same in regards to smb, smbd, samba. I configured /etc/default/ctdb in ubunti and /etc/sysconfig/ctdb in fedora but no dice. Also I noticed that corosync doesn't rip out /etc/default/ctdb and replace it with its own like in SLES11.. at least Ubuntu isn't. Curious. It's *meant* to replace that file. Anything interesting that you can specify in that file should be specified using RA instance parameters. For some notes on this, see: http://linux-ha.org/wiki/CTDB_%28resource_agent%29 Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Set order on two clone set, but apply on each node
On 6/6/2010 at 12:58 AM, Comet / 余尚哲comet...@gmail.com wrote: I have 10 nodes and want to run httpd mysqld on all node, i plan to do load balance later, so i set two clone set on apache and mysql: clone cl-apache apache clone cl-mysqld mysqld each apache will connect to mysqld locally, so if mysqld crash, i must turnoff apache running on the same node to avoid error if apache try to load data from mysql, but i think the order constrant can't do that for me if my setting like this: order mysqld-before-apache inf: cl-mysqld cl-apache If i want to apply this rule to each node, what setting should i configure? Try cloning a group, something like: group mysqld-with-apache mysqld apache clone cl-mysqld-with-apache mysqld-with-apache Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Issues with constraints - working for start/stop, being ignored on failures
On 6/2/2010 at 11:10 AM, Cnut Jansen w...@cnutjansen.eu wrote: Am 31.05.2010 05:47, schrieb Tim Serong: On 5/31/2010 at 12:57 PM, Cnut Jansenw...@cnutjansen.eu wrote: Current constraints: colocation TEST_colocO2cb inf: cloneO2cb cloneDlm colocation colocGrpMysql inf: grpMysql cloneMountMysql colocation colocMountMysql_drbd inf: cloneMountMysql msDrbdMysql:Master colocation colocMountMysql_o2cb inf: cloneMountMysql cloneO2cb colocation colocMountOpencms_drbd inf: cloneMountOpencms msDrbdOpencms:Master colocation colocMountOpencms_o2cb inf: cloneMountOpencms cloneO2cb colocation colocTomcat inf: cloneTomcat cloneMountOpencms:Started order TEST_orderO2cb 0: cloneDlm cloneO2cb order orderGrpMysql 0: cloneMountMysql:start grpMysql order orderMountMysql_drbd 0: msDrbdMysql:promote cloneMountMysql:start order orderMountMysql_o2cb 0: cloneO2cb cloneMountMysql order orderMountOpencms_drbd 0: msDrbdOpencms:promote cloneMountOpencms:start order orderMountOpencms_o2cb 0: cloneO2cb cloneMountOpencms order orderTomcat 0: cloneMountOpencms:start cloneTomcat Try specifying inf for those ordering scores rather than zero. Ordering constraints with score=0 are considered optional and only have an effect when both resources are starting or stopping. You should also be able to leave out the :start specifiers as this is implicit. About those :start specifiers on the mount-resources's order constraints you're of course right, and I also allready knew about that. They're just remains from some tests (probably seek for (other?) workarounds or something) I did, which I only - due to their (to my knowledge) harmless redundancy - so far allways forgot to remove again when doing other, more relevant/important changes. you know, due to the crm-shell's (which I currently use for editing my configuration) canceling all resource monitor operations on the node the crm-shell is started on, I prefer to avoid starting it as much as possible for allways having to make sure I afterwards made all monitor operations run again (i.e. switch cluster's maintenance-mode on/off or switch node to standby and back online). Say what? The CRM shell shouldn't be canceling ops... About those 0-scores, unfortunately they're necessary, since they're the - afaik - official workaround for to prevent instances of clone resources being also restarted on nodes where it's unnecessary to do so. So with scores set to inf instead, when I for example put one node into standby and/or back to online, most clone resources would also be restarted on the other node. That's not acceptable for production. This behaviour is according to what I remember having read only changed in Pacemaker 1.0.7, which isn't shipped with SLES 11 yet. I'm hoping for SLES 11 SP1 to change that, but haven't found any reliable informations about its version of Pacemaker yet. SLES 11 SP1 and the SLE High Availability Extension 11 SP1 are now available for download from http://download.novell.com/ - this includes Pacemaker 1.1.2. Constraints added to work around at least the DRBD-resources left in state started (unmanaged) failed: order GNAH_orderDrbdMysql_stop 0: cloneMountMysql:stop msDrbdMysql:stop order GNAH_orderDrbdOpencms_stop 0: cloneMountOpencms:stop msDrbdOpencms:stop (Also tried similiar constraints for msDrbd*:demote and cloneDlm:stop, but neither seemed to have an effect) Those shouldn't be necessary (I never tried putting ordering constraints on stop ops before...) They shouldn't, right; that's also what I had expected. But as I reported in my post above, they - for what reason ever - actually DO have an effect! I simply don't know yet, why, and hope others maybe having a clue. Anyway, so far, they're the most acceptable workaround I know of for those strange constraint issues that made me we write here. (Another workaround are start-delays on stop-operations, but such are - for there dependency upon individual node's system- and resource-performances - not acceptable for production) I just still don't know if it's just a case of misconfiguration and/or lack of knowledge/experience on my side, or if it's really a bug in Pacemaker; maybe even a allready fixed one in more recent versions than SLES 11's Pacemaker 1.0.6. Curious... I'd suggest seeing if you can reproduce on SLE 11 SP1. Regards, Tim For in case someone would like to have a look onto it, I attached complete cluster configuration, with and without the workaround and both as XML and as output of crm configure show. (Please don't wonder about some quite high monitor operation intervals, which were just set so when dumping the config; the tests done and configs dumped when posting in Novell's support forum were done with those timings being 1/100 of it and made no difference) Here
Re: [Pacemaker] Openais OCF Script Question
On 5/30/2010 at 11:13 AM, Emil Popov epo...@postpath.com wrote: Hi I'm trying to use a OCF script in my Openais Cluster. For the most part it works. From time to time though , the Pacemaker executes the original resource LSB script instead of the correct OCF one Therefore not passing correct parameters to the resource. When I stop the resource and start it again it executes the correct ocf script the second time around. This usually happened when the resource fails over to another node and initially runs LSB script instead the OCF one. Very strange. Any advise is greatly appreciated. Below is the error in the /var/log/messages It insists on using the LSB in /etc/init.d directory. I had renamed the /etc/init.d/ppsd script but that causes the below error and Stonith reboots the node. May 29 05:01:40 gpp0099pun018 crmd: [10927]: info: do_lrm_rsc_op: Performing key=186:20891:0:977e982d-1345-4d4f-b69f-9bf0de010aa3 op=ppsd-6_start_0 ) May 29 05:01:40 gpp0099pun018 lrmd: [10924]: info: rsc:ppsd-6: start May 29 05:01:40 gpp0099pun018 lrmd: [7387]: WARN: For LSB init script, no additional parameters are needed. May 29 05:01:40 gpp0099pun018 lrmd: [7387]: ERROR: (raexeclsb.c:execra:266) execv failed for /etc/init.d/ppsd: No such file or directory May 29 05:01:40 gpp0099pun018 lrmd: [10924]: ERROR: Failed to open lsb RA ppsd. No meta-data gotten. May 29 05:01:40 gpp0099pun018 lrmd: [10924]: WARN: on_msg_get_metadata: empty metadata for lsb::heartbeat::ppsd. May 29 05:01:40 gpp0099pun018 crmd: [10927]: ERROR: lrm_get_rsc_type_metadata(575): got a return code HA_FAIL from a reply message of rmetadata with function g et_ret_from_msg. May 29 05:01:40 gpp0099pun018 crmd: [10927]: WARN: get_rsc_metadata: No metadata found for ppsd::lsb:heartbeat May 29 05:01:40 gpp0099pun018 crmd: [10927]: ERROR: string2xml: Can't parse NULL input May 29 05:01:40 gpp0099pun018 crmd: [10927]: ERROR: get_rsc_restart_list: Metadata for (null)::lsb:ppsd is not valid XML May 29 05:01:40 gpp0099pun018 crmd: [10927]: info: process_lrm_event: LRM operation ppsd-6_start_0 (call=103, rc=254, cib-update=239, confirmed=true) complete unknown Here is the resource configuration that I have in the Pacemaker. It's is supposed to use OCF script named ppsd in directory /usr/lib/ocf/resource.d/custom/ppsd primitive ppsd-0 ocf:custom:ppsd \ params externalip=192.168.0.50 \ op monitor interval=10s timeout=90s \ op start interval=0 timeout=1800s \ op stop interval=0 timeout=180s \ meta target-role=Started is-managed=true Using Openais 0.80.5 Pacemaker 1.0.4 Do you also have an LSB primitive defined called ppsd-6? Because that's what those logs say LRMD is trying to start... Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Re: [Pacemaker] Issues with constraints - working for start/stop, being ignored on failures
On 5/31/2010 at 12:57 PM, Cnut Jansen w...@cnutjansen.eu wrote: Hi, I'm not sure if it's really some kind of bug (maybe allready widely known and even allready fixed in more recent versions) or simply misconfiguration and lack of knowledge and experience or something (since I'm still quite new to HA-computing), but I have issues with Pacemaker about the order constraints I defined, can't get rid of them and only partially work around them. But such workarounds don't really seem as intended/designed to me... The problem is that even though upon starting / switching-to-online and stopping / switching-to-standby the nodes / cluster, all constraint chains work as they should, and so do they even upon directly stopping the troubling fundamental resources, the DRBD- and DLM-resources, which are the bases of my constraint chains. Therefor when i.e. a failure occurs in the DRBD-resource for MySQL's DataDir, the cluster should first stop the MySQL-resource-group (MySQL + IP-adress), then stop the MySQL-mount-resource, then demote and finally stop the DRBD-resource. But when trying to test the cluster's behaviour upon such a failure via crm_resource -F -r drbdMysql:0 -H nde28, the cluster first tries to demote the DRBD-resource, then also allready stop it, then the MySQL-IP, the MySQL-mount and only finally MySQL. The result of such a test isn't - due to failing demote and stop for the DRBD-resource - hard to guess: DRBD-resource left in started (unmanaged) failed, rest of involved resources is stopped. I'm running Pacemaker 1.0.6, delivered with and running on SLES 11 with HAE, both kept up-to-date with official update repositories (due to company's directives). In a few days SLES 11 SP1 shall be released, where I also hope for a more recent version of Pacemaker, DRBD (still have to run 8.2.7) and other HA-cluster-related stuff. I also allready posted about these issues in Novell's support forum with lots of more details: http://forums.novell.com/novell-product-support-forums/suse-linux-enterprise-serve r-sles/sles-configure-administer/411152-constraint-issues-upon-failure-drbd-resource-su se-linux-enterprise-hae-11-a.html So I'm wondering: 1) Aren't constraint chains upon defining them also allready implicitly exactly invertedly defined for stopping resources too? Yes, but see below for a note on scores. 2) After my testing for workarounds: Why (seem to) do - in case of the failing fundamental resources - order constraints for MS-resources's stop-action have an effect, but neither those for MS-resources's demote-action, nor those for (primitives's/?)clones's stop-action? Or is that just for the MS-resources's stop-action being only the second command anyway, and just therefor following my additional constraint?! I'm not sure about that. Current constraints: colocation TEST_colocO2cb inf: cloneO2cb cloneDlm colocation colocGrpMysql inf: grpMysql cloneMountMysql colocation colocMountMysql_drbd inf: cloneMountMysql msDrbdMysql:Master colocation colocMountMysql_o2cb inf: cloneMountMysql cloneO2cb colocation colocMountOpencms_drbd inf: cloneMountOpencms msDrbdOpencms:Master colocation colocMountOpencms_o2cb inf: cloneMountOpencms cloneO2cb colocation colocTomcat inf: cloneTomcat cloneMountOpencms:Started order TEST_orderO2cb 0: cloneDlm cloneO2cb order orderGrpMysql 0: cloneMountMysql:start grpMysql order orderMountMysql_drbd 0: msDrbdMysql:promote cloneMountMysql:start order orderMountMysql_o2cb 0: cloneO2cb cloneMountMysql order orderMountOpencms_drbd 0: msDrbdOpencms:promote cloneMountOpencms:start order orderMountOpencms_o2cb 0: cloneO2cb cloneMountOpencms order orderTomcat 0: cloneMountOpencms:start cloneTomcat Try specifying inf for those ordering scores rather than zero. Ordering constraints with score=0 are considered optional and only have an effect when both resources are starting or stopping. You should also be able to leave out the :start specifiers as this is implicit. Constraints added to work around at least the DRBD-resources left in state started (unmanaged) failed: order GNAH_orderDrbdMysql_stop 0: cloneMountMysql:stop msDrbdMysql:stop order GNAH_orderDrbdOpencms_stop 0: cloneMountOpencms:stop msDrbdOpencms:stop (Also tried similiar constraints for msDrbd*:demote and cloneDlm:stop, but neither seemed to have an effect) Those shouldn't be necessary (I never tried putting ordering constraints on stop ops before...) Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Re: [Pacemaker] How SuSEfirewall2 affects on openais startup?
Hi, On 5/13/2010 at 03:56 PM, Aleksey Zholdak alek...@zholdak.com wrote: The firewall should let through the UDP multicast traffic on ports mcastport and mcastport+1. As I wrote above: all interfaces in SuSEfirewall2 is set to Internal zone. So, how can I open these ports if it already opened? Just to double check, I assume Internal zone does not have any firewall rules applied to it? If you go to Allowed Services in the YaST2 firewall config app, it should show everything greyed-out or allowed for Internal Zone. (Disclaimer: my major experience with SuSEfirewall2 is opening the ssh port on a system I care about, and turning the firewall off completely on my test cluster systems, because they're inside networks I trust) You said earlier that openais starts OK if you have the firewall on, but resources do not run. What does the output of crm_mon -r1 show in this case? Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Re: [Pacemaker] How SuSEfirewall2 affects on openais startup?
On 5/13/2010 at 07:22 PM, Aleksey Zholdak alek...@zholdak.com wrote: firewall should let through the UDP multicast traffic on ports mcastport and mcastport+1. As I wrote above: all interfaces in SuSEfirewall2 is set to Internal zone. So, how can I open these ports if it already opened? Just to double check, I assume Internal zone does not have any firewall rules applied to it? If you go to Allowed Services in the YaST2 firewall config app, it should show everything greyed-out or allowed for Internal Zone. Yes, exactly, everything greyed-out and allowed for Internal Zone. Internal zone is unprotected. All ports are open. OK, that sounds fine. You said earlier that openais starts OK if you have the firewall on, but resources do not run. What does the output of crm_mon -r1 show in this case? sles2:~ # crm_mon -r1 Last updated: Thu May 13 12:21:21 2010 Stack: openais Current DC: NONE 2 Nodes configured, 2 expected votes 10 Resources configured. Node sles2: UNCLEAN (offline) Node sles1: UNCLEAN (offline) The above is normal for while the cluster is starting up. This may sound a little silly, but I would have expected everything to come online if you just wait a few minutes. You can watch status changes (if any) as they occur, with crm_mon -r. It's worth checking /var/log/messages etc. on each node too, to see if anything is obviously screaming in pain. Full list of resources: Clone Set: sbd-clone Stopped: [ sbd_fense:0 sbd_fense:1 ] Don't clone the SBD stonith resource, you only need a single primitive here (not that this should be causing your startup trouble). Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Re: [Pacemaker] How SuSEfirewall2 affects on openais startup?
On 5/13/2010 at 11:48 PM, Aleksey Zholdak alek...@zholdak.com wrote: firewall should let through the UDP multicast traffic on ports mcastport and mcastport+1. As I wrote above: all interfaces in SuSEfirewall2 is set to Internal zone. So, how can I open these ports if it already opened? Just to double check, I assume Internal zone does not have any firewall rules applied to it? If you go to Allowed Services in the YaST2 firewall config app, it should show everything greyed-out or allowed for Internal Zone. Yes, exactly, everything greyed-out and allowed for Internal Zone. Internal zone is unprotected. All ports are open. OK, that sounds fine. You said earlier that openais starts OK if you have the firewall on, but resources do not run. What does the output of crm_mon -r1 show in this case? sles2:~ # crm_mon -r1 Last updated: Thu May 13 12:21:21 2010 Stack: openais Current DC: NONE 2 Nodes configured, 2 expected votes 10 Resources configured. Node sles2: UNCLEAN (offline) Node sles1: UNCLEAN (offline) The above is normal for while the cluster is starting up. This may sound a little silly, but I would have expected everything to come online if you just wait a few minutes. You can watch status changes (if any) as they occur, with crm_mon -r. It's worth checking /var/log/messages etc. on each node too, to see if anything is obviously screaming in pain. In such state node are unchanged for hours. OK, I had to ask. Analysis of logs in this situation does not say anything ... If the firewall is blocking anything, it'll be making noise in /var/log/firewall and/or dmesg. Another thing to try is set debug: on in the openais/corosync config file, then look at /var/log/messages. This should give you more log info... I must remind you that we are talking about a running one node of the two. The second node is turned off (burned, stolen, etc.) Clone Set: sbd-clone Stopped: [ sbd_fense:0 sbd_fense:1 ] Don't clone the SBD stonith resource, you only need a single primitive here (not that this should be causing your startup trouble). sbd fence must be on each node. The sbd daemon needs to be running on both nodes (the openais init script should take care of that on SLES), but there only needs to be one sbd primitive, it does not need to be cloned. Pacemaker will make sure it is running somewhere, which is enough. When the firewall is off or run both of nodes - no problem. So, one node running, with the firewall off, is OK? Two nodes running, with the firewall on, is OK? I think I'm becoming confused... Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Re: [Pacemaker] Making a resource slightly sticky?
On 5/14/2010 at 07:39 AM, Paul Graydon p...@ehawaii.gov wrote: Hi, One of my nodes decided to throw a wobbly this morning and locked up it's network card for about a minute. Pacemaker came to the rescue, merrily transferred everything over to the other node successfully, however when the original node came back again it transferred the functions back across. Is is possible at all to make resources sticky? i.e. resources start on node 1. Node 1 fails, resources migrate to node 2. Node 1 recovers, but resources stay on node 2 until node 2 fails, at which point they'd migrate to node 1. Yes, you want the resource-stickiness property. Using crm configure, per resource: # primitive foo \ meta resource-stickiness=1 Or, to make everything a bit sticky: # rsc_defaults resource-stickiness=1 Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Re: [Pacemaker] Announce: HA Web Konsole (Hawk 0.3.3)
On 4/13/2010 at 08:13 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Mon, Apr 12, 2010 at 10:56:30PM +0200, Roberto Giordani wrote: Hi Tim it's working! Thanks the only simple error was /root/.crm_help_index that should be owned by hacluster:haclient Why should it be owned by another user if it's for root? Does hawk use the crm shell? For performing management ops, yes. It invokes: /usr/sbin/crm resource (start|stop|migrate|unmigrate|cleanup) [...] It should be effectively run as hacluster:haclient though, not as root... Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Re: [Pacemaker] Announce: HA Web Konsole (Hawk 0.3.3)
On 4/14/2010 at 01:59 AM, Dejan Muhamedagic deja...@fastmail.fm wrote: On Tue, Apr 13, 2010 at 05:45:02AM -0600, Tim Serong wrote: On 4/13/2010 at 08:13 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Mon, Apr 12, 2010 at 10:56:30PM +0200, Roberto Giordani wrote: Hi Tim it's working! Thanks the only simple error was /root/.crm_help_index that should be owned by hacluster:haclient Why should it be owned by another user if it's for root? Does hawk use the crm shell? For performing management ops, yes. It invokes: /usr/sbin/crm resource (start|stop|migrate|unmigrate|cleanup) [...] It should be effectively run as hacluster:haclient though, not as root... Makes me wonder how it ended up in /root . I'd have expected it to appear in /var/lib/heartbeat/cores/hacluster if it was going to appear anywhere... But actually, I didn't think the help index was created unless you tried to access the help (which Hawk doesn't do)? Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
[Pacemaker] Announce: HA Web Konsole (Hawk 0.3.3)
Greetings All, This is to announce version 0.3.3 of Hawk, a web-based GUI for Pacemaker HA clusters. Noticeable changes from version 0.3.1 include: - Port number is now 7630 (registered with IANA) - Shows Master/Slave sets (but currently just shows children as started) - Added confirmation prompt for node ops (bnc#593003) and resource ops. - Allow resource mgmt ops on groups (in addition to group children) - Added ability to migrate resources (bnc#593005) - Invoke crm for resource ops, report invocation errors in UI (bnc#583605) - Add mgmt buttons for new resources that appear via JSON update (bnc#590037) SLES/openSUSE packages can be obtained from the openSUSE Build Service: http://software.opensuse.org/search?baseproject=ALLp=1q=hawk Finally, the wiki page at http://clusterlabs.org/wiki/Hawk has been updated slightly to reflect the current project status as outlined in this email. As before, please direct comments, feedback, questions etc. to tser...@novell.com and/or the Pacemaker mailing list. Happy clustering, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Re: [Pacemaker] Announce: HA Web Konsole (Hawk 0.3.3)
On 4/13/2010 at 01:08 AM, Roberto Giordani r.giord...@libero.it wrote: Hello, where I can found a sysvinit 2.86-215.2 for opensuse 11.2 x86_64? This is the dependency rpm -ivh hawk-0.3.3-1.1.x86_64.rpm warning: hawk-0.3.3-1.1.x86_64.rpm: Header V3 DSA signature: NOKEY, key ID 45bd6ae1 error: Failed dependencies: sysvinit 2.86-215.2 is needed by hawk-0.3.3-1.1.x86_64 There's one in the YaST:Web/openSUSE_11.2 repo... But actually, if that's the only missing dependency, you could just install with --nodeps. You'll only get into trouble if you're running a separate instance of lighttpd for some other purpose (in which case startproc etc. may get confused about which lighttpd its meant to be dealing with). /me makes a note to do something more friendly about this depencency. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
[Pacemaker] [PATCH] Low: tools: crm_simulate - fix small xpath memory leak in inject_node_state()
# HG changeset patch # User Tim Serong tser...@novell.com # Date 1269931000 -39600 # Node ID 37312dd57d64ef67d829b3dbb868c659438dc495 # Parent 8b867b37c8007042877943b0c74601528db24d0f Low: tools: crm_simulate - fix small xpath memory leak in inject_node_state() diff -r 8b867b37c800 -r 37312dd57d64 tools/crm_inject.c --- a/tools/crm_inject.cMon Mar 29 16:45:22 2010 +0200 +++ b/tools/crm_inject.cTue Mar 30 17:36:40 2010 +1100 @@ -92,6 +92,7 @@ rc = cib_conn-cmds-query(cib_conn, xpath, cib_object, cib_xpath|cib_sync_call|cib_scope_local); } +crm_free(xpath); CRM_ASSERT(rc == cib_ok); return cib_object; } ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] ERROR: unpack_rsc_op: Hard error
On 3/10/2010 at 09:48 AM, Werner wkuba...@co.sanmateo.ca.us wrote: I have a problem with setting up a simple two node cluster with an IP address that should fail over. The two systems run SLES 11 with HAE. I have done this configuration with two virtual machines and it works just fine in this environment. However, when I do the exact same configuration on the real (physical) systems it fails. This is what I get: r...@imsrcdbdgrid2:~# crm configure show node imsrcdbdgrid1 node imsrcdbdgrid2 primitive ClusterIP ocf:heartbeat:IPadd2 \ ^IPaddr2 Looks like a typo - if your configuration is missing that 'r' character, that'll be the source of your problem (although, if the crm shell let you create a primitive using an RA that doesn't exist, that sounds like a bug). r...@imsrcdbdgrid2:~# crm_verify -L -V crm_verify[26812]: 2010/03/09_14:42:44 ERROR: unpack_rsc_op: Hard error - ClusterIP_monitor_0 failed with rc=5: Preventing ClusterIP from re-starting on imsrcdbdgrid1 rc=5 means not installed, which you'll get if the RA explicitly returns that error code, or if the RA itself doesn't exist. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] Announce: HA Web Konsole (Hawk) 0.3.1
Greetings All, This is to announce version 0.3.1 of Hawk, a web-based GUI for Pacemaker HA clusters. This version introduces the ability to perform some basic management tasks (node standby/online/fence and resource start/stop/cleanup). It also now includes a login screen, so random passersby can't break your cluster. The rule here is the same as for the python GUI - you need to log in as a user who is a member of the haclient group. SLES/openSUSE packages can be obtained from the openSUSE Build Service: http://software.opensuse.org/search?baseproject=ALLp=1q=hawk The wiki page at http://clusterlabs.org/wiki/Hawk has also been updated to reflect the current project status as outlined in this email. As before, please direct comments, feedback, questions etc. to tser...@novell.com and/or the Pacemaker mailing list. Enjoy, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] DRBD Management Console 0.6.0
On 3/2/2010 at 11:12 PM, Rasto Levrinc rasto.levr...@linbit.com wrote: On Tue, March 2, 2010 11:41 am, Lars Marowsky-Bree wrote: On 2010-02-28T12:24:26, Rasto Levrinc rasto.levr...@linbit.com wrote: cool stuff. It's sad that we end up with a competing thingy ... Maybe we could keep Tim's pure web-ui for the monitoring and most simple bits and have drbd-mc replace the python UI. Thanks lmb. I see a place for Hawk as a lightweight tool to quickly make some changes and I could even somehow integrate in the DRBD-MC. I can speed up the DRBD-MC quite a bit still, I did not even try to optimize it till now, but it will never be very fast. Yeah, from my perspective I think Hawk and DRBD-MC will each have different strengths, for example pointing a web browser at a cluster node to see status is easy/quick/lightweight, whereas visualizing complex dependency relationships between resources is more straightforward to implement and interact with in a regular non-web app (although no doubt HTML5 advocates will disagree with me here :)) How does it interact with the CRM shell? Does it issue XML changes directly? What kind of network connection is required between the UI client and the servers? DRBD-MC connects itself via SSH and uses mostly cibadmin and crm_resource commands on the host. It could simply use crm shell commands instead, but it doesn't at the moment, mostly to be compatible with older Heartbeats and there was no reason to change it. By comparison, Hawk doesn't need SSH, as it's running on the cluster nodes. Internally it also uses cibadmin, a couple of crm_* commands and the crm shell, so currently only works with Pacemaker 1.x. It /reads/ XML from cibadmin, but I wasn't planning on having it change the XML directly, rather any changes are (and will be) made through existing CLI tools. Side point: I have it in the back of my mind that I may end up wanting to communicate directly with libcib if the CLI tools ever become a performance bottleneck, but this isn't a problem yet (earlier, Hawk was running crm_resource to get the status of each resource, so the more resources, the more execs: yuck. Now it just figures everything out from a single run of cibadmin). Is there a chance to share more code between the various UIs that, I think, we are going to keep going forward? (I'm pretty sure the crm shell, the web-ui and yours are going to remain actively maintained.) Yes, I like that. I'm also keen on duplicating as little as possible, but I think there's more scope for sharing of underlying tools (crm shell etc.), or perhaps developing new scaffolding as necessary, than for sharing pieces of higher-level GUI implementation. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] DRBD Management Console 0.6.0
On 3/1/2010 at 11:16 PM, Rasto Levrinc rasto.levr...@linbit.com wrote: On Mon, March 1, 2010 12:10 pm, Cristian Mammoli - Apra Sistemi wrote: Hi again... I tried adding a resource with DMC. My script needs 2 mandatory parameters: vmxpath and vimshbin In the gui i filled the field for vmxpath while vimshbin was already present because the resource agent has: shortdesc lang=envmware-vim-cmd path/shortdesc content type=string default=/usr/bin/vmware-vim-cmd/ The question is, what the default here means. It is something that RA would use if nothing is specified or it is suggestion for GUI, what to offer as a default value. Obviously the DRBD-MC assumes the former and the vmware RA the latter. IMO it's both :) If the parameter is optional, default is the value the RA should use internally if no value is explicitly specified. If the parameter is mandatory, default is what the management tools should populate that field with initially. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Frustrating fun with Pacemaker / CentOS / Apache
On 2/19/2010 at 08:40 AM, Paul Graydon p...@ehawaii.gov wrote: I started looking into this today to find out whether it was possible to use another URL for testing. According to the heartbeat script you can specify the parameter statusurl and as long as it has a body and html tag on the page you test it should work. You can also set your own testregex which should match the output of statusurl. Since resource agents release 1.0.2, apache can also do more thorough tests (see crm ra info apache or ocf_heartbeat_apache(7)). So I thought I'd give it a try, but it failed. Initially I assumed it was because I hadn't selected a page with the body and html tag (having not noticed that was a necessity) but even when against a page that has them it still failed. Trying to execute the command it runs came up with a failure for me too, but it appears to be how all the arguments are presented to wget courtesy of sh -c. It's looking for a positive return from: sh -c wget -O- -q -L http://whatever.url.youprovided | tr '\012' ' ' | grep -Ei / *body *[[:space:]]*/ *html * Problem is if you cut it down to just that first section: sh -c wget -O- -q -L http://whatever.url.youprovided it pops back and tells you wget: missing URL Usage: wget [OPTION]... [URL]... Try `wget --help' for more options. If you execute wget without using sh -c in front of it it sees the URL and parses it successfully. Surrounding the wget string with ' marks, as in: sh -c 'wget -O- -q -L http://whatever.url.youprovided ' I'm trying to figure out what other options are available. Adding in ' marks on line 406 of the ocf heartbeat apache script breaks it! I really don't think there is a need to change anything there. Otherwise, apache would never be able to work. If you think you found a problem, you can try to wrap the critical part in set -x/+x and we'll see what the output looks like. Thanks, Dejan I've looked into this with fresh eyes this morning and managed to track down the problem to this addition to the related to meta target-role=Started Not sure quite where I picked that up from, presumably one of the configurations I used as a template? Without setting it as an attribute it works fine, tested and retested with and without that addition. This works: primitive failover-apache ocf:heartbeat:apache \ params configfile=/etc/httpd/conf/httpd.conf httpd=/usr/sbin/httpd port=80 \ op monitor interval=5s timeout=20s statusurl=https://valid.test.url/index.html; This doesn't: primitive failover-apache ocf:heartbeat:apache \ params configfile=/etc/httpd/conf/httpd.conf httpd=/usr/sbin/httpd port=80 \ op monitor interval=5s timeout=20s statusurl=https://valid.test.url/index.html; \ meta target-role=Started That's weird. That attribute shouldn't make any difference in this case - it's just telling the cluster that it should try to start the resource, which is the default anyway. My understanding of the meta bits is a little weak, and I can't find an explanation as to what target-role is actually trying to do. It specifies the state the resource is meant to be in[1], i.e. stopped, started, or a master or slave (the latter of which you would use for an active/passive DRBD clone pair, for example). Ignoring master/slave resources, this attribute is set if you use crm resource stop or crm resource start to force a resource to stop or start. Regards, Tim [1] Yes, I know I should use the word role here, not state :) -- Tim Serong tser...@novell.com Senior Clustering Engineer, Novell Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.2.0
On 2/9/2010 at 11:05 PM, darren.mans...@opengi.co.uk wrote: It's pacemaker-1.0.3-4.1 No output for cluster-infrastructure. But the HTML source does contain information, just display: none hides it: div id=summary style=display: none table trthStack:/th tdspan id=summary::stack/span/td/tr ... /table /div It was keying the display off the cluster-infrastructure parameter, which first appeared in Pacemaker 1.0.4. I've since fixed this, and OBS packages for hawk-0.2.1 have been built. They should thus appear in the repos on download.opensuse.org in the fullness of time. Once said time has elapsed, please install the new packages and let me know how you go. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, Novell Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.2.0
On 2/10/2010 at 03:40 AM, darren.mans...@opengi.co.uk wrote: On Tue, 2010-02-09 at 06:44 -0700, Tim Serong wrote: On 2/9/2010 at 11:05 PM, darren.mans...@opengi.co.uk wrote:=20 It's pacemaker-1.0.3-4.1=20 =20 No output for cluster-infrastructure.=20 =20 But the HTML source does contain information, just display: none hides=20 it:=20 =20 div id=3Dsummary style=3Ddisplay: none=20 table=20 trthStack:/th tdspan=20 id=3Dsummary::stack/span/td/tr=20 ... /table=20 /div=20 =20 It was keying the display off the cluster-infrastructure parameter, which first appeared in Pacemaker 1.0.4. I've since fixed this, and OBS packages for hawk-0.2.1 have been built. They should thus appear in the repos on download.opensuse.org in the fullness of time. Once said time has elapsed, please install the new packages and let me know how you go. =20 Regards, =20 Tim =20 This is great thanks. The only problem now is that in FF 3.5 and Google Chrome in Linux it displays for about 5 seconds then the screen goes blank. Is it this bit of JS? Event.observe(window, 'load', function() { do_update(); }); So, by fixed I clearly meant fixed in only one of the two places that require fixing. Please try the following change (the relevant file will be /srv/www/hawk/public/javascripts/application.js): diff -r ed8bf3b8be26 hawk/public/javascripts/application.js --- a/hawk/public/javascripts/application.jsTue Feb 09 23:27:49 2010 +1100 +++ b/hawk/public/javascripts/application.jsWed Feb 10 10:35:23 2010 +1100 @@ -35,7 +35,7 @@ } function update_summary(summary) { - if (summary.stack) { + if (summary.version) { for (var e in summary) { $(summary:: + e).update(summary[e]); } @@ -101,7 +101,7 @@ update_errors(request.responseJSON.errors); update_summary(request.responseJSON.summary); -if (request.responseJSON.summary.stack) { +if (request.responseJSON.summary.version) { $(nodelist).show(); if (update_panel(request.responseJSON.nodes)) { if ($(nodelist::children).hasClassName(closed)) { Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, Novell Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] Announce: Hawk (HA Web Konsole) 0.2.0
Greetings All, This is to announce version 0.2.0 of Hawk, a web-based GUI for Pacemaker HA clusters. The major item of note for this version is that we now have reasonable feature parity with crm_mon, and there are SLES/openSUSE packages available from the openSUSE Build Service: http://software.opensuse.org/search?baseproject=ALLp=1q=hawk There is also a wiki page up at http://clusterlabs.org/wiki/Hawk that gives a brief overview of the project, and tells you how to get the source from Mercurial, if you don't want to (or can't) use the above packages. As before, please direct comments, feedback, questions etc. to tser...@novell.com and/or the Pacemaker mailing list. Thanks for listening, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, Novell Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] Announce: Hawk (HA Web Konsole)
Greetings All, This is to announce the development of the Hawk project, a web-based GUI for Pacemaker HA clusters. So, why another management tool, given that we already have the crm shell, the Python GUI, and DRBD MC? In order: 1) We have the usual rationale for a GUI over (or in addition to) a CLI tool; it is (or should be) easier to use, for a wider audience. 2) The Python GUI is not always easily installable/runnable (think: sysadmins with Windows desktops and/or people who don't want to, or can't, forward X). 3) Believe it or not, there are a number of cases where, citing security reasons, site policy prohibits ssh access to servers (which is what DRBD MC uses internally). There are also some differing goals; Hawk is not intended to expose absolutely everything. There will be point somewhere where you have to say and now you must learn to use a shell. Likewise, Hawk is not intended to install the base cluster stack for you (whereas DRBD MC does a good job of this). It's early days yet (no downloadable packages), but you can get the current source as follows: # hg clone http://hg.clusterlabs.org/pacemaker/hawk # cd hawk # hg update tip This will give you a web-based GUI with a display roughly analagous to crm_mon, in terms of status of cluster resources. It will show you running/dead/standby nodes, and the resources (clones, groups primitives) running on those nodes. It does not yet provide information about failed resources or nodes, other than the fact that they are not running. Display of nodes resources is collapsible (collapsed by default), but if something breaks while you are looking at it, the display will expand to show the broken nodes and/or resources. Hawk is intended to run on each node in your cluster. You can then access it by pointing your web browser at the IP address of any cluster node, or the address of any IPaddr(2) resource you may have configured. Minimally, to see it in action, you will need the following packages and their dependencies (names per openSUSE/SLES): - ruby - rubygem-rails-2_3 - rubygem-gettext_rails Once you've got those installed, run the following command: # hawk/script/server Then, point your browser at http://your-server:3000/ to see the status of your cluster. Ultimately, hawk is intended to be installed and run as a regular system service via /etc/init.d/hawk. To do this, you will need the following additional packages: - lighttpd - lighttpd-mod_magnet - ruby-fcgi - rubygem-rake Then, try the following, but READ THE MAKEFILE FIRST! make install (and the rest of the build system for that matter) is frightfully primitive at the moment: # make # sudo make install # /etc/init.d/hawk start Then, point your browser at http://your-server:/ to see the status of your cluster. Assuming you've read this far, what next? - In the very near future (but probably not next week, because I'll be busy at linux.conf.au) you can expect to see further documentation and roadmap info up on the clusterlabs.org wiki. - Immediate goal is to obtain feature parity with crm_mon (completing status display, adding error/failure messages). - Various pieces of scaffolding need to be put in place (login page, access via HTTPS, clean up build/packaging, theming, etc.) - After status display, the following major areas of funcionality are: - Basic operator tasks (stop/start/migrate resource, standby/online node, etc.) - Explore failure scenarios (shadow CIB magic to see what would happen if a node/resource failed). - Ability to actually configure resources and nodes. Please direct comments, feedback, questions, etc. to tser...@novell.com and/or the Pacemaker mailing list. Thank you for your attention. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, Novell Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Wrong stack o2cb
On 12/16/2009 at 01:41 AM, Поляченко Владимир Владимировичstrafer.ad...@gmail.com wrote: Hi all (sorry for my english, i can read and understand, but not write in english) Configure cluster in Fedora 12(base manual Cluster from Scratch Apache in Fedora 11) package from fedora repo [r...@server1 /]# rpm -q pacemaker ocfs2-tools ocfs2-tools-pcmk dlm-pcmk heartbeat corosync resource-agents drbd pacemaker-1.0.5-4.fc12.i686 ocfs2-tools-1.4.3-3.fc12.i686 ocfs2-tools-pcmk-1.4.3-3.fc12.i686 dlm-pcmk-3.0.6-1.fc12.i686 heartbeat-3.0.0-0.5.0daab7da36a8.hg.fc12.i686 corosync-1.2.0-1.fc12.i686 resource-agents-3.0.6-1.fc12.i686 drbd-8.3.6-2.fc12.i686 Configuration Active/Active, next problem (/var/log/messages) Dec 15 16:07:21 server1 crmd: [1189]: info: te_rsc_command: Initiating action 4: monitor o2cb:0_monitor_0 on server1 (local) Dec 15 16:07:21 server1 crmd: [1189]: info: do_lrm_rsc_op: Performing key=4:91:7:78a6a7b0-ef15-434f-8aaf-e00cd0f9d6ef op=o2cb:0_monitor_0 ) Dec 15 16:07:21 server1 lrmd: [1186]: info: rsc:o2cb:0:101: monitor Dec 15 16:07:21 server1 o2cb[20999]: ERROR: Wrong stack o2cb Dec 15 16:07:21 server1 lrmd: [1186]: info: RA output: (o2cb:0:monitor:stderr) 2009/12/15_16:07:21 ERROR: Wrong stack o2cb Dec 15 16:07:21 server1 crmd: [1189]: info: process_lrm_event: LRM operation o2cb:0_monitor_0 (call=101, rc=5, cib-update=430, confirmed=true) not installed Dec 15 16:07:21 server1 crmd: [1189]: WARN: status_from_rc: Action 4 (o2cb:0_monitor_0) on server1 failed (target: 7 vs. rc: 5): Error Dec 15 16:07:21 server1 crmd: [1189]: info: abort_transition_graph: match_graph_event:272 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=o2cb:0 _monitor_0, magic=0:5;4:91:7:78a6a7b0-ef15-434f-8aaf-e00cd0f9d6ef, cib=0.329.2) : Event failed Dec 15 16:07:21 server1 crmd: [1189]: info: update_abort_priority: Abort priority upgraded from 0 to 1 Dec 15 16:07:21 server1 crmd: [1189]: info: update_abort_priority: Abort action done superceeded by restart Dec 15 16:07:21 server1 crmd: [1189]: info: match_graph_event: Action o2cb:0_monitor_0 (4) confirmed on server1 (rc=4) Dec 15 16:07:21 server1 crmd: [1189]: info: te_rsc_command: Initiatingaction 3: probe_complete probe_complete on server1 (local) - no waiting but resource /dev/drbd1 mounted without problem(nodes online, mount not start, i mount monually) You don't want to be mounting it manually, the cluster needs to do it for you. crm config (only need rows) - primitive DataFS ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/data directory=/opt fstype=ocfs2 \ meta target-role=Started primitive ServerData ocf:linbit:drbd \ params drbd_resource=data primitive dlm ocf:pacemaker:controld \ op monitor interval=120s primitive dlm ocf:pacemaker:controld \ op monitor interval=120s primitive o2cb ocf:ocfs2:o2cb \ op monitor interval=120s ms ServerDataClone ServerData \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true clone dlm-clone dlm \ meta interleave=true clone o2cb-clone o2cb \ meta interleav e=true colocation o2cb-with-dlm inf: o2cb-clone dlm-clone order start-o2cb-after-dlm inf: dlm-clone o2cb-clone - I create /etc/ocfs2/cluser.conf - node: name = server1 cluster = ocfs2 number = 0 ip_address = 10.10.10.1 ip_port = node: name = server2 cluster = ocfs2 number = 1 ip_address = 10.10.10.2 ip_port = cluster: name = ocfs2 node_count = 2 - How resolve this problem? You shouldn't need /etc/ocfs2/cluster.conf. AFAIK this is only used in non-Pacemaker environments, when o2cb is managing the cluster. Did you create your filesystem with oc2b running, or the Pacemaker cluster? If the former, I'd suggest: - Make sure o2cb is chkconfig'd off. - Make sure your pacemaker cluster is running, and that dlm and ocfs2 are up. - run tunefs.ocfs2 --update-cluster-stack (or use mkfs to recreate your clustered filesystem). One cluster stack can't mount a filesystem created with a different cluster stack. HTH, Tim -- Tim Serong
Re: [Pacemaker] How to delete a resource
On 12/7/2009 at 08:53 PM, Colin colin@gmail.com wrote: Hi, when trying to delete a resource, either by replacing the whole resources-part of the CIB with cibadmin with a new version where some resources are missing, or by using a crm_resource -t primitive —resource name —delete, I get the following error: Error performing operation: Update does not conform to the configured schema/DTD Now since the error doesn't tell me where the problem is, I can only guess that the problem is that other, dynamic parts of the CIB still reference the resource, and the schema prevents dangling references. So if these methods don't work, and the crm-shell doesn't have a delete for resources, is there an official and simple way to delete a resource? This should do it: # crm configure delete resource-id Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, Novell Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Node crash when 'ifdown eth0'
On 12/1/2009 at 11:05 AM, hj lee kerd...@gmail.com wrote: On Fri, Nov 27, 2009 at 3:05 PM, Steven Dake sd...@redhat.com wrote: On Fri, 2009-11-27 at 11:32 -0200, Mark Horton wrote: I'm using pacemaker 1.0.6 and corosync 1.1.2 (not using openais) with centos 5.4. The packages are from here: http://www.clusterlabs.org/rpm/epel-5/ Mark On Fri, Nov 27, 2009 at 9:01 AM, Oscar Remí-rez de Ganuza Satrústegui oscar...@unav.es wrote: Good morning, We are testing a cluster configuration on RHEL5 (x86_64) with pacemaker 1.0.5 and openais (0.80.5). Two node cluster, active-passive, with the following resources: Mysql service resource and a NFS filesystem resource (shared storage in a SAN). In our tests, when we bring down the network interface (ifdown eth0), the What is the use case for ifdown eth0 (ie what are you trying to verify)? I have the same test case. In my case, when two nodes cluster is disconnect, I want to see split-brain. And then I want to see the split-brain handler resets one of nodes. What I want to verify is that the cluster will recover network disconnection and split-brain situation. Try this, on one node: # iptables -A INPUT -s ip.of.other.node -j DROP # iptables -A OUTPUT -d ip.of.other.node -j DROP HTH, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, Novell Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Fwd: virtual IP
On 11/21/2009 at 06:25 AM, Shravan Mishra shravan.mis...@gmail.com wrote: This is my exact output: Last updated: Fri Nov 20 18:20:51 2009 Stack: openais Current DC: node1.itactics.com - partition with quorum Version: 1.0.5-9e9faaab40f3f97e3c0d623e4a4c47ed83fa1601 2 Nodes configured, 2 expected votes 4 Resources configured. Online: [ node1.itactics.com node2.itactics.com ] Master/Slave Set: ms-drbd Masters: [ node1.itactics.com ] Slaves: [ node2.itactics.com ] node1.itactics.com-stonith (stonith:external/safe/ipmi): Started node2.itactics.com node2.itactics.com-stonith (stonith:external/safe/ipmi): Started node1.itactics.com Resource Group: svcs_grp fs0 (ocf::heartbeat:Filesystem):Started node1.itactics.com safe_svcs (ocf::itactics:safe): Started node1.itactics.com vip (ocf::heartbeat:IPaddr2): Stopped Failed actions: vip_monitor_0 (node=node1.itactics.com, call=7, rc=2, status=complete): invalid parameter vip_monitor_0 (node=node2.itactics.com, call=7, rc=2, status=complete): invalid parameter The config this time I tried was primitive id=vip class=ocf type=IPaddr2 provider=heartbeat operations op id=op-vip-1 name=monitor timeout=30s interval=10s/ /operations instance_attributes id=ia-vip nvpair id=vip-addr name=ip value=172.30.0.17 / nvpair id=vip-intf name=nic value=eth1/ nvpair id=vip-bcast name=broadcast value=172.30.255.255/ nvpair id=vip-cidr_netmask name=cidr_netmask value=16/ /instance_attributes /primitive Can somebody help me what's the problem here. You're probably suffering from https://bugzilla.novell.com/show_bug.cgi?id=553753 which is fixed by http://hg.linux-ha.org/agents/rev/5d341d5dc96a Try explicitly adding the parameter clusterip_hash=sourceip-sourceport to the IP address. This will add something like the following to the instance_attributes: nvpair id=vip-clusterip_hash name=clusterip_hash value=sourceip-sourceport/ Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, Novell Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] RFC: Compacting constraints
On 11/6/2009 at 05:13 AM, Andrew Beekhof and...@beekhof.net wrote: On Thu, Nov 5, 2009 at 4:57 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: conjoin sounds sort of funny to me (as a non-native speaker). Equally so to me, and Australian is kinda like english. How about coordinate? Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, Novell Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker