Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash
I will look into adding alerts, thanks for the info. For now I introduced a 5 seconds sleep after "pcs cluster start ...". It seems enough for monitor to be run. On Fri, May 12, 2017 at 9:22 PM, Ken Gaillot wrote: > Another possibility you might want to look into is alerts. Pacemaker can > call a script of your choosing whenever a resource is started or > stopped. See: > > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-sing > le/Pacemaker_Explained/index.html#idm139683940283296 > > for the concepts, and the pcs man page for the "pcs alert" interface. > > On 05/12/2017 06:17 AM, Ludovic Vaugeois-Pepin wrote: > > I checked the node_state of the node that is killed and brought back > > (test3). in_ccm == true and crmd == online for a second or two between > > "pcs cluster start test3" "monitor": > > > > > crm-debug-origin="peer_update_callback" join="member" expected="member"> > > > > > > > > On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin > > mailto:ludovi...@gmail.com>> wrote: > > > > Yes I haven't been using the "nodes" element in the XML, only the > > "resources" element. I couldn't find "node_state" elements or > > attributes in the XML, so after some searching I found that it is in > > the CIB that can be gotten with "pcs cluster cib foo.xml". I will > > start exploring this as an alternative to crm_mon/"pcs status". > > > > > > However I still find what happens to be confusing, so below I try to > > better explain what I see: > > > > > > Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW > > shutdown a minute ago): > > > > crm_mon -1: > > > > Stack: corosync > > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - > > partition with quorum > > Last updated: Fri May 12 10:45:36 2017 Last change: Fri > > May 12 09:18:13 2017 by root via crm_attribute on test1 > > > > 3 nodes and 4 resources configured > > > > Online: [ test1 test2 ] > > OFFLINE: [ test3 ] > > > > Active resources: > > > > Master/Slave Set: pgsql-ha [pgsqld] > > Masters: [ test1 ] > > Slaves: [ test2 ] > > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started > > test1 > > > > > > crm_mon -X: > > > > > > > managed="true" failed="false" failure_ignored="false" > > > > role="Master" active="true" orphaned="false" managed="true" f > > ailed="false" failure_ignored="false" nodes_running_on="1" > > > > > > > > role="Slave" active="true" orphaned="false" managed="true" fa > > iled="false" failure_ignored="false" nodes_running_on="1" > > > > > > > > role="Stopped" active="false" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="0" /> > > > > > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true" > > orphaned="false" managed > > ="true" failed="false" failure_ignored="false" > > nodes_running_on="1" > > > > > > > > > > > > > > > At 10:45:39.440, after "pcs cluster start test3", before first > > "monitor" on test3 (this is where I can't seem to know that > > resources on test3 are down): > > > > crm_mon -1: > > > > Stack: corosync > > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - > > partition with quorum > > Last updated: Fri May 12 10:45:39 2017 Last change: Fri > > May 12 10:45:39 2017 by root via crm_attribute on test1 > > > > 3 nodes and 4 resources configured > > > > Online: [ test1 test2 test3 ] > > > > Active resources: > > > > Master/Slave Set: pgsql-ha [pgsqld] > > Masters: [ test1 ] > > Slaves: [ test2 test3 ] > > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started > > test1 > > > > > > crm_mon -X: > > > > > > > managed="true" failed="false" failure_ignored="false" > > > > role="Master" active="true" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="1" > > > > > > > > role="Slave" active="true" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="1" > > > > > > > > role="Slave" active="true" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="1" > > > > > > > > > > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true" > > orphaned="false" managed="true" failed="false" > > failure_ignored="false" nodes_running_on="1" > > > > > > >
Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash
Another possibility you might want to look into is alerts. Pacemaker can call a script of your choosing whenever a resource is started or stopped. See: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm139683940283296 for the concepts, and the pcs man page for the "pcs alert" interface. On 05/12/2017 06:17 AM, Ludovic Vaugeois-Pepin wrote: > I checked the node_state of the node that is killed and brought back > (test3). in_ccm == true and crmd == online for a second or two between > "pcs cluster start test3" "monitor": > > crm-debug-origin="peer_update_callback" join="member" expected="member"> > > > > On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin > mailto:ludovi...@gmail.com>> wrote: > > Yes I haven't been using the "nodes" element in the XML, only the > "resources" element. I couldn't find "node_state" elements or > attributes in the XML, so after some searching I found that it is in > the CIB that can be gotten with "pcs cluster cib foo.xml". I will > start exploring this as an alternative to crm_mon/"pcs status". > > > However I still find what happens to be confusing, so below I try to > better explain what I see: > > > Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW > shutdown a minute ago): > > crm_mon -1: > > Stack: corosync > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - > partition with quorum > Last updated: Fri May 12 10:45:36 2017 Last change: Fri > May 12 09:18:13 2017 by root via crm_attribute on test1 > > 3 nodes and 4 resources configured > > Online: [ test1 test2 ] > OFFLINE: [ test3 ] > > Active resources: > > Master/Slave Set: pgsql-ha [pgsqld] > Masters: [ test1 ] > Slaves: [ test2 ] > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started > test1 > > > crm_mon -X: > > > managed="true" failed="false" failure_ignored="false" > > role="Master" active="true" orphaned="false" managed="true" f > ailed="false" failure_ignored="false" nodes_running_on="1" > > > > role="Slave" active="true" orphaned="false" managed="true" fa > iled="false" failure_ignored="false" nodes_running_on="1" > > > > role="Stopped" active="false" orphaned="false" managed="true" > failed="false" failure_ignored="false" nodes_running_on="0" /> > > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true" > orphaned="false" managed > ="true" failed="false" failure_ignored="false" > nodes_running_on="1" > > > > > > > > At 10:45:39.440, after "pcs cluster start test3", before first > "monitor" on test3 (this is where I can't seem to know that > resources on test3 are down): > > crm_mon -1: > > Stack: corosync > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - > partition with quorum > Last updated: Fri May 12 10:45:39 2017 Last change: Fri > May 12 10:45:39 2017 by root via crm_attribute on test1 > > 3 nodes and 4 resources configured > > Online: [ test1 test2 test3 ] > > Active resources: > > Master/Slave Set: pgsql-ha [pgsqld] > Masters: [ test1 ] > Slaves: [ test2 test3 ] > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started > test1 > > > crm_mon -X: > > > managed="true" failed="false" failure_ignored="false" > > role="Master" active="true" orphaned="false" managed="true" > failed="false" failure_ignored="false" nodes_running_on="1" > > > > role="Slave" active="true" orphaned="false" managed="true" > failed="false" failure_ignored="false" nodes_running_on="1" > > > > role="Slave" active="true" orphaned="false" managed="true" > failed="false" failure_ignored="false" nodes_running_on="1" > > > > > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true" > orphaned="false" managed="true" failed="false" > failure_ignored="false" nodes_running_on="1" > > > > > > > > At 10:45:41.606, after first "monitor" on test3 (I can now tell the > resources on test3 are not ready): > > crm_mon -1: > > Stack: corosync > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - > partition with quorum > Last updated: Fri May 12 10:45:41 2017 Last change: Fri > May 12 10:45:39 2017 by root via crm_attribute on test1 > > 3 nodes and 4 resources
Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash
Hi Jehan-Guillaume, I would be glad to discuss my motivations and findings with you, by mail or in person, even. Let's just say that I originally wanted to create something that would allow deploying a PG cluster in manners of minutes (yes using Python). From there I tried to understand how PAF works, and at some point I wanted to start changing it, but not being too good with Perl, I chose to translate it. This kinda became a pet project. Ludovic On Fri, May 12, 2017 at 2:01 PM, Jehan-Guillaume de Rorthais < j...@dalibo.com> wrote: > Hi Ludovic, > > On Thu, 11 May 2017 22:00:12 +0200 > Ludovic Vaugeois-Pepin wrote: > > > I translated the a Postgresql multi state RA ( > https://github.com/dalibo/PAF) > > in Python (https://github.com/ulodciv/deploy_cluster), and I have been > > editing it heavily. > > Could you please provide the feedback to the upstream project (or here :))? > > * what did you improved in PAF? > * what did you changed in PAF? > * why did you translate PAF to python? Any advantages? > > A lot of time and research has been dedicated to this project. PAF is a > pure > open source project. We would love some feedback and contributors to keep > improving it. Do not hesitate to open issues on PAF project if you need to > discuss improvements. > > Regards, > -- > Jehan-Guillaume de Rorthais > Dalibo > -- Ludovic Vaugeois-Pepin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash
Hi Ludovic, On Thu, 11 May 2017 22:00:12 +0200 Ludovic Vaugeois-Pepin wrote: > I translated the a Postgresql multi state RA (https://github.com/dalibo/PAF) > in Python (https://github.com/ulodciv/deploy_cluster), and I have been > editing it heavily. Could you please provide the feedback to the upstream project (or here :))? * what did you improved in PAF? * what did you changed in PAF? * why did you translate PAF to python? Any advantages? A lot of time and research has been dedicated to this project. PAF is a pure open source project. We would love some feedback and contributors to keep improving it. Do not hesitate to open issues on PAF project if you need to discuss improvements. Regards, -- Jehan-Guillaume de Rorthais Dalibo ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash
I checked the node_state of the node that is killed and brought back (test3). in_ccm == true and crmd == online for a second or two between "pcs cluster start test3" "monitor": On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin < ludovi...@gmail.com> wrote: > Yes I haven't been using the "nodes" element in the XML, only the > "resources" element. I couldn't find "node_state" elements or attributes > in the XML, so after some searching I found that it is in the CIB that can > be gotten with "pcs cluster cib foo.xml". I will start exploring this as an > alternative to crm_mon/"pcs status". > > > However I still find what happens to be confusing, so below I try to > better explain what I see: > > > Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW > shutdown a minute ago): > > crm_mon -1: > > Stack: corosync > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with > quorum > Last updated: Fri May 12 10:45:36 2017 Last change: Fri May > 12 09:18:13 2017 by root via crm_attribute on test1 > > 3 nodes and 4 resources configured > > Online: [ test1 test2 ] > OFFLINE: [ test3 ] > > Active resources: > > Master/Slave Set: pgsql-ha [pgsqld] > Masters: [ test1 ] > Slaves: [ test2 ] > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started test1 > > > crm_mon -X: > > > failed="false" failure_ignored="false" > > role="Master" active="true" orphaned="false" managed="true" f > ailed="false" failure_ignored="false" nodes_running_on="1" > > > > role="Slave" active="true" orphaned="false" managed="true" fa > iled="false" failure_ignored="false" nodes_running_on="1" > > > > role="Stopped" active="false" orphaned="false" managed="true" > failed="false" failure_ignored="false" nodes_running_on="0" /> > > role="Started" active="true" orphaned="false" managed > ="true" failed="false" failure_ignored="false" nodes_running_on="1" > > > > > > > > At 10:45:39.440, after "pcs cluster start test3", before first "monitor" > on test3 (this is where I can't seem to know that resources on test3 are > down): > > crm_mon -1: > > Stack: corosync > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with > quorum > Last updated: Fri May 12 10:45:39 2017 Last change: Fri May > 12 10:45:39 2017 by root via crm_attribute on test1 > > 3 nodes and 4 resources configured > > Online: [ test1 test2 test3 ] > > Active resources: > > Master/Slave Set: pgsql-ha [pgsqld] > Masters: [ test1 ] > Slaves: [ test2 test3 ] > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started test1 > > > crm_mon -X: > > > failed="false" failure_ignored="false" > > role="Master" active="true" orphaned="false" managed="true" failed="false" > failure_ignored="false" nodes_running_on="1" > > > > role="Slave" active="true" orphaned="false" managed="true" failed="false" > failure_ignored="false" nodes_running_on="1" > > > > role="Slave" active="true" orphaned="false" managed="true" failed="false" > failure_ignored="false" nodes_running_on="1" > > > > > role="Started" active="true" orphaned="false" managed="true" failed="false" > failure_ignored="false" nodes_running_on="1" > > > > > > > > At 10:45:41.606, after first "monitor" on test3 (I can now tell the > resources on test3 are not ready): > > crm_mon -1: > > Stack: corosync > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with > quorum > Last updated: Fri May 12 10:45:41 2017 Last change: Fri May > 12 10:45:39 2017 by root via crm_attribute on test1 > > 3 nodes and 4 resources configured > > Online: [ test1 test2 test3 ] > > Active resources: > > Master/Slave Set: pgsql-ha [pgsqld] > Masters: [ test1 ] > Slaves: [ test2 ] > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started test1 > > > crm_mon -X: > > > failed="false" failure_ignored="false" > > role="Master" active="true" orphaned="false" managed="true" failed="false" > failure_ignored="false" nodes_running_on="1" > > > > role="Slave" active="true" orphaned="false" managed="true" failed="false" > failure_ignored="false" nodes_running_on="1" > > > > role="Stopped" active="false" orphaned="false" managed="true" > failed="false" failure_ignored="false" nodes_running_on="0" /> > > role="Started" active="true" orphaned="false" managed="true" failed="false" > failure_ignored="false" nodes_running_on="1" > > > > > > On Fri, May 12, 2017 at 12:45 AM, Ken Gaillot wrote: > >> On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote: >>
Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash
Yes I haven't been using the "nodes" element in the XML, only the "resources" element. I couldn't find "node_state" elements or attributes in the XML, so after some searching I found that it is in the CIB that can be gotten with "pcs cluster cib foo.xml". I will start exploring this as an alternative to crm_mon/"pcs status". However I still find what happens to be confusing, so below I try to better explain what I see: Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW shutdown a minute ago): crm_mon -1: Stack: corosync Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum Last updated: Fri May 12 10:45:36 2017 Last change: Fri May 12 09:18:13 2017 by root via crm_attribute on test1 3 nodes and 4 resources configured Online: [ test1 test2 ] OFFLINE: [ test3 ] Active resources: Master/Slave Set: pgsql-ha [pgsqld] Masters: [ test1 ] Slaves: [ test2 ] pgsql-master-ip(ocf::heartbeat:IPaddr2): Started test1 crm_mon -X: At 10:45:39.440, after "pcs cluster start test3", before first "monitor" on test3 (this is where I can't seem to know that resources on test3 are down): crm_mon -1: Stack: corosync Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum Last updated: Fri May 12 10:45:39 2017 Last change: Fri May 12 10:45:39 2017 by root via crm_attribute on test1 3 nodes and 4 resources configured Online: [ test1 test2 test3 ] Active resources: Master/Slave Set: pgsql-ha [pgsqld] Masters: [ test1 ] Slaves: [ test2 test3 ] pgsql-master-ip(ocf::heartbeat:IPaddr2): Started test1 crm_mon -X: At 10:45:41.606, after first "monitor" on test3 (I can now tell the resources on test3 are not ready): crm_mon -1: Stack: corosync Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum Last updated: Fri May 12 10:45:41 2017 Last change: Fri May 12 10:45:39 2017 by root via crm_attribute on test1 3 nodes and 4 resources configured Online: [ test1 test2 test3 ] Active resources: Master/Slave Set: pgsql-ha [pgsqld] Masters: [ test1 ] Slaves: [ test2 ] pgsql-master-ip(ocf::heartbeat:IPaddr2): Started test1 crm_mon -X: On Fri, May 12, 2017 at 12:45 AM, Ken Gaillot wrote: > On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote: > > Hi > > I translated the a Postgresql multi state RA > > (https://github.com/dalibo/PAF) in Python > > (https://github.com/ulodciv/deploy_cluster), and I have been editing it > > heavily. > > > > In parallel I am writing unit tests and functional tests. > > > > I am having an issue with a functional test that abruptly powers off a > > slave named says "host3" (hot standby PG instance). Later on I start the > > slave back. Once it is started, I run "pcs cluster start host3". And > > this is where I start having a problem. > > > > I check every second the output of "pcs status xml" until host3 is said > > to be ready as a slave again. In the following I assume that test3 is > > ready as a slave: > > > > > > > standby_onfail="false" maintenance="false" pending="false" > > unclean="false" shutdown="false" expected_up="true" is_dc="false" > > resources_running="2" type="member" /> > > > standby_onfail="false" maintenance="false" pending="false" > > unclean="false" shutdown="false" expected_up="true" is_dc="true" > > resources_running="1" type="member" /> > > > standby_onfail="false" maintenance="false" pending="false" > > unclean="false" shutdown="false" expected_up="true" is_dc="false" > > resources_running="1" type="member" /> > > > > The section says nothing about the current state of the nodes. > Look at the entries for that. in_ccm means the cluster > stack level, and crmd means the pacemaker level -- both need to be up. > > > > > > managed="true" failed="false" failure_ignored="false" > > > > role="Slave" active="true" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="1" > > > > > > > > role="Master" active="true" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="1" > > > > > > > > role="Slave" active="true" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="1" > > > > > > > > > By ready to go I mean
Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash
On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote: > Hi > I translated the a Postgresql multi state RA > (https://github.com/dalibo/PAF) in Python > (https://github.com/ulodciv/deploy_cluster), and I have been editing it > heavily. > > In parallel I am writing unit tests and functional tests. > > I am having an issue with a functional test that abruptly powers off a > slave named says "host3" (hot standby PG instance). Later on I start the > slave back. Once it is started, I run "pcs cluster start host3". And > this is where I start having a problem. > > I check every second the output of "pcs status xml" until host3 is said > to be ready as a slave again. In the following I assume that test3 is > ready as a slave: > > > standby_onfail="false" maintenance="false" pending="false" > unclean="false" shutdown="false" expected_up="true" is_dc="false" > resources_running="2" type="member" /> > standby_onfail="false" maintenance="false" pending="false" > unclean="false" shutdown="false" expected_up="true" is_dc="true" > resources_running="1" type="member" /> > standby_onfail="false" maintenance="false" pending="false" > unclean="false" shutdown="false" expected_up="true" is_dc="false" > resources_running="1" type="member" /> > The section says nothing about the current state of the nodes. Look at the entries for that. in_ccm means the cluster stack level, and crmd means the pacemaker level -- both need to be up. > > managed="true" failed="false" failure_ignored="false" > > role="Slave" active="true" orphaned="false" managed="true" > failed="false" failure_ignored="false" nodes_running_on="1" > > > > role="Master" active="true" orphaned="false" managed="true" > failed="false" failure_ignored="false" nodes_running_on="1" > > > > role="Slave" active="true" orphaned="false" managed="true" > failed="false" failure_ignored="false" nodes_running_on="1" > > > > > By ready to go I mean that upon running "pcs cluster start test3", the > following occurs before test3 appears ready in the XML: > > pcs cluster start test3 > monitor-> RA returns unknown error (1) > notify/pre-stop-> RA returns ok (0) > stop -> RA returns ok (0) > start-> RA returns ok (0) > > The problem I have is that between "pcs cluster start test3" and > "monitor", it seems that the XML returned by "pcs status xml" says test3 > is ready (the XML extract above is what I get at that moment). Once > "monitor" occurs, the returned XML shows test3 to be offline, and not > until the start is finished do I once again have test3 shown as ready. > > I am getting anything wrong? Is there a simpler or better way to check > if test3 is fully functional again, ie OCF start was successful? > > Thanks > > Ludovic ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] How to check if a resource on a cluster node is really back on after a crash
Hi I translated the a Postgresql multi state RA (https://github.com/dalibo/PAF) in Python (https://github.com/ulodciv/deploy_cluster), and I have been editing it heavily. In parallel I am writing unit tests and functional tests. I am having an issue with a functional test that abruptly powers off a slave named says "host3" (hot standby PG instance). Later on I start the slave back. Once it is started, I run "pcs cluster start host3". And this is where I start having a problem. I check every second the output of "pcs status xml" until host3 is said to be ready as a slave again. In the following I assume that test3 is ready as a slave: By ready to go I mean that upon running "pcs cluster start test3", the following occurs before test3 appears ready in the XML: pcs cluster start test3 monitor -> RA returns unknown error (1) notify/pre-stop -> RA returns ok (0) stop -> RA returns ok (0) start -> RA returns ok (0) The problem I have is that between "pcs cluster start test3" and "monitor", it seems that the XML returned by "pcs status xml" says test3 is ready (the XML extract above is what I get at that moment). Once "monitor" occurs, the returned XML shows test3 to be offline, and not until the start is finished do I once again have test3 shown as ready. I am getting anything wrong? Is there a simpler or better way to check if test3 is fully functional again, ie OCF start was successful? Thanks Ludovic ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org