[Pacemaker] Suggestions for managing HA of containers from within a Pacemaker container?
Hi, I am working on Containerizing OpenStack in the Kolla project (http://launchpad.net/kolla). One of the key things we want to do over the next few months is add H/A support to our container tech. David Vossel had suggested using systemctl to monitor the containers themselves by running healthchecking scripts within the containers. That idea is sound. There is another technology called “super-privileged containers”. Essentially it allows more host access for the container, allowing the treatment of Pacemaker as a container rather than a RPM or DEB file. I’d like corosync to run in a separate container. These containers will communicate using their normal mechanisms in a super-privileged mode. We will implement this in Kolla. Where I am stuck is how does Pacemaker within a container control other containers in the host os. One way I have considered is using the docker —pid=host flag, allowing pacemaker to communicate directly with the host systemctl process. Where I am stuck is our containers don’t run via systemctl, but instead via shell scripts that are executed by third party deployment software. An example: Lets say a rabbitmq container wants to run: The user would run kolla-mgr deploy messaging This would run a small bit of code to launch the docker container set for messaging. Could pacemaker run something like Kolla-mgr status messaging To control the lifecycle of the processes? Or would we be better off with some systemd integration with kolla-mgr? Thoughts welcome Regards, -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Need to relax corosync due to backup of VM through snapshot
On 11/21/2013 06:26 AM, Gianluca Cecchi wrote: On Thu, Nov 21, 2013 at 9:09 AM, Lars Marowsky-Bree wrote: On 2013-11-20T16:58:01, Gianluca Cecchi gianluca.cec...@gmail.com wrote: Based on docs I thought that the timeout should be token x token_retransmits_before_loss_const No, the comments in the corosync.conf.example and man corosync.conf should be pretty clear, I hope. Can you recommend which phrasing we should improve? I have not understood exact relationship between token and token_retransmits_before_loss_const. When one comes into play and when the other one... So perhaps the second one could be given more details. Or some web links The token retransmit is a timer that is started each time a token is transmitted. This is the maximum timer that exists - it is not token * retransmits_before_loss_const. The retrans_before_loss_const says please transmit a replacement token x many times in the token period. Since the token is UDP, it could be lost in network overflow situations or other scenarios. Using a real-world example token: 1 retrans_before_loss_const: 10 token will be retransmitted roughly every 1000 msec and the token will be determined lost after 1msec. Regards -steve SO my current test config is: # diff corosync.conf corosync.conf.pre181113 24,25c24 #token: 5000 token: 12 A 120s node timeout? That is really, really long. Why is the backup tool interfering with the scheduling of high priority processes so much? That sounds like the real bug. In fact I inherited analysis for a previous production cluster and I'm setting up a test environment to demonstrate that one of the realistic outputs could well be that a cluster is not the right solution implemented because the underlying infra is not stable enough. I'm not given a great visibility for what is VMware and SAN details, but I'm stressing to get them. I saw sometimes disk latencies going at 8000milliseceonds ;-( SO another possible output could be to make a more reliable infra before going with cluster. I'm putting deliberately high values to see what happens and lower them step by step BTW: I remember in the past some thread with other having problems with Netbackup (or similar backup software ) using snapshot and that putting higher values solved the sporadic problems (possibly 2 for token and 10 for retransmit but I couldn't find them ...) Any comment? Any different strategies successfully used in similar environments where high latencies get in place at snapshot deletion when consolidate phase of disks is executed? A setup where a VM apparently can freeze for almost 120s is not suitable for HA. I see from previous logs that sometimes drbd disconnect and reconnect only after 30-40 seconds with default timeouts... Thanks for your inputs. Gianluca ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Need HA for OpenStack instances? Check out heat V5!
Hi folks, A few developers from HA community have been hard at work on a project called heat which provides native HA for OpenStack virtual machines. Heat provides a template based system with API matching AWS CloudFormation semantics specifically for OpenStack. In v5, instance heatlhchecking has been added. To get started on Fedora 16+ check out the getting started guide: https://github.com/heat-api/heat/blob/master/docs/GettingStarted.rst#readme or on Ubuntu Precise check out the devstack guide: https://github.com/heat-api/heat/wiki/Getting-Started-with-Heat-using-Master-on-Ubuntu An example template with instance HA features is here: https://github.com/heat-api/heat/blob/master/templates/WordPress_Single_Instance_With_IHA.template An example template with applicatoin HA features that includes escalation is here: https://github.com/heat-api/heat/blob/master/templates/WordPress_Single_Instance_With_HA.template Our website is here: http://www.heat-api.org The software can be downloaded from: https://github.com/heat-api/heat/downloads Enjoy -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [corosync] Different Corosync Rings for Different Nodes in Same Cluster?
On 07/02/2012 08:19 AM, Andrew Martin wrote: Hi Steve, Thanks for the clarification. Am I correct in understanding that in a complete network, corosync will automatically re-add nodes that drop out and reappear for any reason (e.g. maintenance, network connectivity loss, STONITH, etc)? Apologies for delay - was on PTO. That is correct. Regards -steve Thanks, Andrew *From: *Steven Dake sd...@redhat.com *To: *The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org *Cc: *disc...@corosync.org *Sent: *Friday, June 29, 2012 9:40:43 AM *Subject: *Re: [Pacemaker] Different Corosync Rings for Different Nodes in Same Cluster? On 06/29/2012 01:42 AM, Dan Frincu wrote: Hi, On Thu, Jun 28, 2012 at 6:13 PM, Andrew Martin amar...@xes-inc.com wrote: Hi Dan, Thanks for the help. If I configure the network as I described - ring 0 as the network all 3 nodes are on, ring 1 as the network only 2 of the nodes are on, and using passive - and the ring 0 network goes down, corosync will start using ring 1. Does this mean that the quorum node will appear to be offline to the cluster? Will the cluster attempt to STONITH it? Once the ring 0 network is available again, will corosync transition back to using it as the communication ring, or will it continue to use ring 1 until it fails? The ideal behavior would be when ring 0 fails it then communicates over ring 1, but keeps periodically checking to see if ring 0 is working again. Once it is, it returns to using ring 0. Is this possible? Added corosync ML in CC as I think this is better asked here as well. Regards, Dan Thanks, Andrew From: Dan Frincu df.clus...@gmail.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, June 27, 2012 3:42:42 AM Subject: Re: [Pacemaker] Different Corosync Rings for Different Nodes inSame Cluster? Hi, On Tue, Jun 26, 2012 at 9:53 PM, Andrew Martin amar...@xes-inc.com wrote: Hello, I am setting up a 3 node cluster with Corosync + Pacemaker on Ubuntu 12.04 server. Two of the nodes are real nodes, while the 3rd is in standby mode as a quorum node. The two real nodes each have two NICs, one that is connected to a shared LAN and the other that is directly connected between the two nodes (for DRBD replication). The quorum node is only connected to the shared LAN. I would like to have multiple Corosync rings for redundancy, however I do not know if this would cause problems for the quorum node. Is it possible for me to configure the shared LAN as ring 0 (which all 3 nodes are connected to) and set the rrp_mode to passive so that it will use ring 0 unless there is a failure, but to also configure the direct link between the two real nodes as ring 1? In general I think you cannot do what you describe. Let me repeat it so its clear: A B C - NET #1 A B - Net #2 Where A, B are your cluster nodes, and C is your quorum node. You want Net #1 and Net #2 to serve as redundant rings. Since C is missing, Net #2 will automatically be detected as faulty. The part about corosync automatically repairing nodes is correct, that would work (If you had a complete network). Regards -steve Short answer, yes. Longer answer. I have a setup with two nodes with two interfaces, one is connected via a switch to the other node and one is a back-to-back link for DRBD replication. In Corosync I have two rings, one that goes via the switch and one via the back-to-back link (rrp_mode: active). With rrp_mode: passive it should work the way you mentioned. HTH, Dan Thanks, Andrew ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Dan Frincu CCNA, RHCE ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc
Re: [Pacemaker] Different Corosync Rings for Different Nodes in Same Cluster?
On 06/29/2012 01:42 AM, Dan Frincu wrote: Hi, On Thu, Jun 28, 2012 at 6:13 PM, Andrew Martin amar...@xes-inc.com wrote: Hi Dan, Thanks for the help. If I configure the network as I described - ring 0 as the network all 3 nodes are on, ring 1 as the network only 2 of the nodes are on, and using passive - and the ring 0 network goes down, corosync will start using ring 1. Does this mean that the quorum node will appear to be offline to the cluster? Will the cluster attempt to STONITH it? Once the ring 0 network is available again, will corosync transition back to using it as the communication ring, or will it continue to use ring 1 until it fails? The ideal behavior would be when ring 0 fails it then communicates over ring 1, but keeps periodically checking to see if ring 0 is working again. Once it is, it returns to using ring 0. Is this possible? Added corosync ML in CC as I think this is better asked here as well. Regards, Dan Thanks, Andrew From: Dan Frincu df.clus...@gmail.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, June 27, 2012 3:42:42 AM Subject: Re: [Pacemaker] Different Corosync Rings for Different Nodes inSame Cluster? Hi, On Tue, Jun 26, 2012 at 9:53 PM, Andrew Martin amar...@xes-inc.com wrote: Hello, I am setting up a 3 node cluster with Corosync + Pacemaker on Ubuntu 12.04 server. Two of the nodes are real nodes, while the 3rd is in standby mode as a quorum node. The two real nodes each have two NICs, one that is connected to a shared LAN and the other that is directly connected between the two nodes (for DRBD replication). The quorum node is only connected to the shared LAN. I would like to have multiple Corosync rings for redundancy, however I do not know if this would cause problems for the quorum node. Is it possible for me to configure the shared LAN as ring 0 (which all 3 nodes are connected to) and set the rrp_mode to passive so that it will use ring 0 unless there is a failure, but to also configure the direct link between the two real nodes as ring 1? In general I think you cannot do what you describe. Let me repeat it so its clear: A B C - NET #1 A B - Net #2 Where A, B are your cluster nodes, and C is your quorum node. You want Net #1 and Net #2 to serve as redundant rings. Since C is missing, Net #2 will automatically be detected as faulty. The part about corosync automatically repairing nodes is correct, that would work (If you had a complete network). Regards -steve Short answer, yes. Longer answer. I have a setup with two nodes with two interfaces, one is connected via a switch to the other node and one is a back-to-back link for DRBD replication. In Corosync I have two rings, one that goes via the switch and one via the back-to-back link (rrp_mode: active). With rrp_mode: passive it should work the way you mentioned. HTH, Dan Thanks, Andrew ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Dan Frincu CCNA, RHCE ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [corosync] Unable to join cluster from a newly-installed centos 6.2 node
On 03/02/2012 05:29 PM, Diego Lima wrote: Hello, I've recently installed Corosync on two CentOS 6.2 machines. One is working fine but on the other machine I've been unable to connect to the cluster. On the logs I can see this whenever I start corosync+pacemaker: Mar 2 21:33:16 no2 corosync[15924]: [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service. Mar 2 21:33:16 no2 corosync[15924]: [MAIN ] Corosync built-in features: nss dbus rdma snmp Mar 2 21:33:16 no2 corosync[15924]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'. Mar 2 21:33:16 no2 corosync[15924]: [TOTEM ] Initializing transport (UDP/IP Multicast). Mar 2 21:33:16 no2 corosync[15924]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Mar 2 21:33:16 no2 corosync[15924]: [TOTEM ] The network interface [172.16.100.2] is now up. Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: process_ais_conf: Reading configure Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: config_find_init: Local handle: 4730966301143465987 for logging Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: config_find_next: Processing additional logging options... Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: get_config_opt: Found 'off' for option: debug Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: get_config_opt: Found 'no' for option: to_logfile Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: get_config_opt: Found 'yes' for option: to_syslog Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: get_config_opt: Found 'daemon' for option: syslog_facility Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: config_find_init: Local handle: 7739444317642555396 for quorum Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: config_find_next: No additional configuration supplied for: quorum Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: get_config_opt: No default for option: provider Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: config_find_init: Local handle: 5650605097994944517 for service Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: config_find_next: Processing additional service options... Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: get_config_opt: Found '0' for option: ver Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: get_config_opt: Defaulting to 'pcmk' for option: clustername Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_logd Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_mgmtd Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: pcmk_startup: CRM: Initialized Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] Logging: Initialized pcmk_startup Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: pcmk_startup: Maximum core file size is: 18446744073709551615 Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: pcmk_startup: Service: 10 Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: pcmk_startup: Local hostname: no2.informidia.int Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: pcmk_update_nodeid: Local node id: 40112300 Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: update_member: Creating entry for node 40112300 born on 0 Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: update_member: 0x766520 Node 40112300 now known as no2.informidia.int (was: (null)) Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: update_member: Node no2.informidia.int now has 1 quorum votes (was 0) Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: update_member: Node 40112300/no2.informidia.int is now: member Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: spawn_child: Forked child 15930 for process stonith-ng Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: spawn_child: Forked child 15931 for process cib Mar 2 21:33:16 no2 corosync[15924]: [pcmk ] info: spawn_child: Forked child 15932 for process lrmd Mar 2 21:33:16 no2 lrmd: [15932]: info: G_main_add_SignalHandler: Added signal handler for signal 15 Mar 2 21:33:16 no2 stonith-ng: [15930]: info: Invoked: /usr/lib64/heartbeat/stonithd Mar 2 21:33:16 no2 stonith-ng: [15930]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root Mar 2 21:33:16 no2 stonith-ng: [15930]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Mar 2 21:33:16 no2 stonith-ng: [15930]: info: get_cluster_type: Cluster type is: 'openais' Mar 2 21:33:16 no2 stonith-ng: [15930]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin) Mar 2 21:33:16 no2 stonith-ng: [15930]: info: init_ais_connection_classic: Creating connection to our Corosync plugin Mar 2 21:33:16 no2 cib: [15931]: info: crm_log_init_worker: Changed active directory to
Re: [Pacemaker] need cluster-wide variables
On 12/21/2011 12:01 AM, Nirmala S wrote: Hi, This is a followup on earlier thread (http://www.gossamer-threads.com/lists/linuxha/pacemaker/76705). My situation is somewhat similar. I need to a cluster which contains 3 kinds of nodes – master, preferred slave, slave. Preferred slave is an entity that becomes the master in case of switchover/failover. Master is the master for pref_slave and pref_slave is master for other slaves. The master election is easy – it is done by crm, all I need to do is use crm_master. RE subject, the cpg interface is perfect for maintaining replicated state among your cluster nodes. man cpg_overview. Regards -steve But for the preferred slave, there needs to an election amongst existing slaves. As of now I am using a variable in CIB with pref_slave|pref_slave_score|temp_score. If temp_score is 0, then the slave will update pref_slave and pref_slave_score and temp_score. If temp_score is non-zero, then the node compares its score with pref_slave_score and updates only if it is bigger. Now I have 2 problems 1. Everytime I change the CIB(which I am doing in pre-promote), the event (pre-promote) is getting retriggered. 2. The event(pre-promote) is sent in parallel to all the slaves. So each slave thinks temp_score is 0, and overwrites with its score. Is there any way to serialize this using some sort of lock ? Or is there a provision to store cluster-wide attributes apart from CIB ? Regards Nirmala This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Questions about reasonable cluster size...
On 10/20/2011 07:42 AM, Alan Robertson wrote: On 10/20/2011 03:11 AM, Proskurin Kirill wrote: On 10/20/2011 03:15 AM, Steven Dake wrote: On 10/19/2011 01:50 PM, Alan Robertson wrote: Hi, I have an application where having a 12-node cluster with about 250 resources would be desirable. Is this reasonable? Can Pacemaker+Corosync be expected to reliably handle a cluster of this size? If not, what is the current recommendation for maximum number of nodes and resources? Steven Dake wrote: We regularly test 16 nodes. As far as resources go, Andrew could answer that. I start to have problems with 10+ nodes. It`s heavly depended on corosync configuration afaik. You should test it. This is somewhat different from Steven's comment. Exactly what things did you have in mind for the corosync configuration that could either help or hurt with larger clusters? Steven: Proskurin seems to think that there are some particular things to watch out for in the Corosync configuration for larger clusters. Does anything come to mind for you about this? We do 16 node testing with token=1 (10 seconds). The rest of the parameters autoconfigure. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] corosync mailing list address change
Sending one last reminder that the Corosync mailing list has changed homes from the Linux Foundation's servers. I have been unable to obtain the previous subscriber list, so please resubscribe. http://lists.corosync.org/mailman/listinfo The list is called discuss. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Questions about reasonable cluster size...
On 10/19/2011 01:50 PM, Alan Robertson wrote: Hi, I have an application where having a 12-node cluster with about 250 resources would be desirable. Is this reasonable? Can Pacemaker+Corosync be expected to reliably handle a cluster of this size? If not, what is the current recommendation for maximum number of nodes and resources? Many thanks! We regularly test 16 nodes. As far as resources go, Andrew could answer that. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Building a Corosync 1.4.1 RPM package for SLES11 SP1
On 08/31/2011 11:39 PM, Sebastian Kaps wrote: Hi, I'm trying to compile Corosync v1.4.1 from source[1] and create an RPM x86_64 package for SLES11 SP1. When running make rpm the build process complains about a broken dependency for the nss-devel package. The package is not installed on the system - mozilla-nss (non-devel), however, is. I'd be fine if I could just build the package without using the nss libs. I have no problem compiling Corosync using ./configure --disable-nss make, but I see no way for doing that with the make rpm command. Alternatively I'd compile everything --with-nss, but I can't install the mozilla-nss-devel package, because the version on the SLE11-SP1-SDK DVD is older than the installed mozilla-nss package (3.12.6-3.1.1 vs. 3.12.8-1.2.1) and creates a conflict when I try to install it. [1] ftp://corosync.org/downloads/corosync-1.4.1/corosync-1.4.1.tar.gz Thanks for pointing out this problem with the build tools for corosync. nss should be conditionalized. This would allow rpmbuild --with-nss or rpmbuild --without-nss from the default rpm builds. I would send a patch to the openais ml to resolve this problem but it is not operating at the moment, so I'll send one here for you to give a spin. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Backup ring is marked faulty
On 08/04/2011 02:04 PM, Sebastian Kaps wrote: Hi Steven, On 04.08.2011, at 20:59, Steven Dake wrote: meaning the corosync community doesn't investigate redundant ring issues prior to corosync versions 1.4.1. Sadly, we need to use the SLES version for support reasons. I'll try to convince them to supply us with a fix for this problem. In the mean time: would it be safe to leave the backup ring marked faulty the next this happens? Would this result in a state that is effectively like having no second ring or is there a chance that this might still affect the cluster's stability? If a ring is marked faulty, it is no longer operational and there is no longer a redundant network. To my knowledge, changing the ring configuration requires a complete restart of the cluster framework on all nodes, right? yes although fixing the retransmit list problem will not require a restart Regards -steve I expect the root of ypur problem is already fixed (the retransmit list problem) however in the repos and latest released versions. I'll try to get an update as soon as possible. Thanks a lot! ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Backup ring is marked faulty
On 08/03/2011 11:31 PM, Tegtmeier.Martin wrote: Hello again, in my case it is always the slower ring that fails (the 100MB network). Does rrp_mode passive expect both rings to have the same speed? Sebastian, can you confirm that in your environment also the slower ring fails? Thanks, -Martin Martin, I have never tested faster+slower networks in redundant ring configs. We just recently added support for this feature in the corosync project meaning we can start to tackle some of these issues going forward. The protocol is designed to limit to the speed of the slowest ring - perhaps this is not working as intended. Regards -steve -Original Message- From: Tegtmeier.Martin [mailto:martin.tegtme...@realtech.com] Sent: Mittwoch, 3. August 2011 11:03 To: The Pacemaker cluster resource manager Subject: AW: [Pacemaker] Backup ring is marked faulty Hello, we have exactly the same issue! Same version of corosync (1.3.1), also running on SuSE Linux Enterprise Server 11 SP1 with HAE. Aug 01 15:45:18 corosync [TOTEM ] Received ringid(172.20.16.2:308) seq 6a Aug 01 15:45:18 corosync [TOTEM ] Received ringid(172.20.16.2:308) seq 63 Aug 01 15:45:18 corosync [TOTEM ] releasing messages up to and including 60 Aug 01 15:45:18 corosync [TOTEM ] releasing messages up to and including 6d Aug 01 15:45:18 corosync [TOTEM ] Marking seqid 162 ringid 1 interface 10.2.2.6 FAULTY - administrative intervention required. rksaph06:/var/log/cluster # corosync-cfgtool -s Printing ring status. Local node ID 101717164 RING ID 0 id = 172.20.16.6 status = ring 0 active with no faults RING ID 1 id = 10.2.2.6 status = Marking seqid 162 ringid 1 interface 10.2.2.6 FAULTY - administrative intervention required. rrp_mode is set to passive Ring 0 (172.20.16.0) supports 1GB and ring 1 (10.2.2.0) supports 100 MBit. There was no other network traffic on ring 1 - only corosync (!) After re-activating both rings with corosync-cfgtool -r the problem is reproducable by simply connecting a crm_gui and hitting refresh inside the GUI 3-5 times. After that ring 1 (10.2.2.0) will be marked as faulty again. Thanks and best regards, -Martin Tegtmeier -Ursprüngliche Nachricht- Von: Sebastian Kaps [mailto:sebastian.k...@imail.de] Gesendet: Mi 03.08.2011 08:53 An: The Pacemaker cluster resource manager Betreff: Re: [Pacemaker] Backup ring is marked faulty Hi Steven! On Tue, 02 Aug 2011 17:45:46 -0700, Steven Dake wrote: Which version of corosync? # corosync -v Corosync Cluster Engine, version '1.3.1' Copyright (c) 2006-2009 Red Hat, Inc. It's the version that comes with SLES11-SP1-HA. -- Sebastian ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] TOTEM: Process pause detected? Leading to STONITH...
On 08/04/2011 05:46 AM, Sebastian Kaps wrote: Hello, here's another problem we're having: Jul 31 03:51:02 node01 corosync[5870]: [TOTEM ] Process pause detected for 11149 ms, flushing membership messages. This process pause message indicates the scheduler doesn't schedule corosync for 11 seconds which is greater then the failure detection timeouts. What does your config file look like? What load are you running? Regards -steve Jul 31 03:51:11 node01 corosync[5870]: [CLM ] CLM CONFIGURATION CHANGE Jul 31 03:51:11 node01 corosync[5870]: [CLM ] New Configuration: Jul 31 03:51:11 node01 corosync[5870]: [CLM ] r(0) ip(192.168.1.1) r(1) ip(x.y.z.3) Jul 31 03:51:11 node01 corosync[5870]: [CLM ] Members Left: Jul 31 03:51:11 node01 corosync[5870]: [CLM ] r(0) ip(192.168.1.2) r(1) ip(x.y.z.1) Jul 31 03:51:11 node01 corosync[5870]: [CLM ] Members Joined: Jul 31 03:51:11 node01 corosync[5870]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 9708: memb=1, new=0, lost=1 Jul 31 03:51:11 node01 corosync[5870]: [pcmk ] info: pcmk_peer_update: memb: node01 16885952 Jul 31 03:51:11 node01 corosync[5870]: [pcmk ] info: pcmk_peer_update: lost: node02 33663168 Jul 31 03:51:11 node01 corosync[5870]: [CLM ] CLM CONFIGURATION CHANGE Jul 31 03:51:11 node01 corosync[5870]: [CLM ] New Configuration: Jul 31 03:51:11 node01 corosync[5870]: [CLM ] r(0) ip(192.168.1.1) r(1) ip(x.y.z.3) Jul 31 03:51:11 node01 corosync[5870]: [CLM ] Members Left: Jul 31 03:51:11 node01 corosync[5870]: [CLM ] Members Joined: Jul 31 03:51:11 node01 crmd: [5912]: notice: ais_dispatch_message: Membership 9708: quorum lost Node01 gets Stonith'd shortly after that. There is no indication whatsoever that this would happen in the logs. For at least half an hour before that there's only the normal status-message noise from monitor ops etc. Jul 31 03:51:01 node02 corosync[5810]: [TOTEM ] A processor failed, forming new configuration. Jul 31 03:51:11 node02 corosync[5810]: [CLM ] CLM CONFIGURATION CHANGE Jul 31 03:51:11 node02 corosync[5810]: [CLM ] New Configuration: Jul 31 03:51:11 node02 corosync[5810]: [CLM ] r(0) ip(192.168.1.2) r(1) ip(x.y.z.1) Jul 31 03:51:11 node02 corosync[5810]: [CLM ] Members Left: Jul 31 03:51:11 node02 corosync[5810]: [CLM ] r(0) ip(192.168.1.1) r(1) ip(x.y.z.3) Jul 31 03:51:11 node02 corosync[5810]: [CLM ] Members Joined: Jul 31 03:51:11 node02 corosync[5810]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 9708: memb=1, new=0, lost=1 Jul 31 03:51:11 node02 corosync[5810]: [pcmk ] info: pcmk_peer_update: memb: node02 33663168 Jul 31 03:51:11 node02 corosync[5810]: [pcmk ] info: pcmk_peer_update: lost: node01 16885952 Jul 31 03:51:11 node02 corosync[5810]: [CLM ] CLM CONFIGURATION CHANGE Jul 31 03:51:11 node02 corosync[5810]: [CLM ] New Configuration: Jul 31 03:51:11 node02 corosync[5810]: [CLM ] r(0) ip(192.168.1.2) r(1) ip(x.y.z.1) Jul 31 03:51:11 node02 corosync[5810]: [CLM ] Members Left: Jul 31 03:51:11 node02 corosync[5810]: [CLM ] Members Joined: What does Process pause detected mean? Quoting from my other recent post regarding the backup ring being marked faulty sporadically: |We're running a two-node cluster with redundant rings. |Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB interfaces that are bonded in |active-backup mode and routed through two independent switches for each node. The ring 1 network |is our normal 1G LAN and should only be used in case the direct 10G connection should fail. | |Corosync Cluster Engine, version '1.3.1' |Copyright (c) 2006-2009 Red Hat, Inc. | |It's the version that comes with SLES11-SP1-HA. Thanks in advance! ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Live demo of Pacemaker Cloud on Fedora: Friday August 5th at 8am PST
On 08/03/2011 06:39 PM, Bob Schatz wrote: Steven, Are you planning on recording/taping it if I want to watch it later? Thanks, Bob Bob, Yes I will record if I can beat elluminate into submission. Regards -steve *From:* Steven Dake sd...@redhat.com *To:* pcmk-cl...@oss.clusterlabs.org *Cc:* aeolus-de...@lists.fedorahosted.org; Fedora Cloud SIG cl...@lists.fedoraproject.org; open...@lists.linux-foundation.org open...@lists.linux-foundation.org; The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org *Sent:* Wednesday, August 3, 2011 9:42 AM *Subject:* [Pacemaker] Live demo of Pacemaker Cloud on Fedora: Friday August 5th at 8am PST Extending a general invitation to the high availability communities and other cloud community contributors to participate in a live demo I am giving on Friday August 5th 8am PST (GMT-7). Demo portion of session is 15 minutes and will be provided first followed by more details of our approach to high availability. I will use elluminate to show the demo on my desktop machine. To make elluminate work, you will need icedtea-web installed on your system which is not typically installed by default. You will also need a conference # and bridge code. Please contact me offlist with your location and I'll provide you with a hopefully toll free conference # and bridge code. Elluminate link: https://sas.elluminate.com/m.jnlp?sid=819password=M.13AB020AEBE358D265FD925A07335F https://sas.elluminate.com/m.jnlp?sid=819password=M.13AB020AEBE358D265FD925A07335F Bridge Code: Please contact me off list with your location and I'll respond back with dial-in information. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org mailto:Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Backup ring is marked faulty
On 08/02/2011 11:53 PM, Sebastian Kaps wrote: Hi Steven! On Tue, 02 Aug 2011 17:45:46 -0700, Steven Dake wrote: Which version of corosync? # corosync -v Corosync Cluster Engine, version '1.3.1' Copyright (c) 2006-2009 Red Hat, Inc. It's the version that comes with SLES11-SP1-HA. redundant ring is only supported upstream in corosync 1.4.1 or later. The retransmit list message issues you are having is fixed in corosync 1.3.3. and later This is what is triggering the redundant ring faulty error. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Backup ring is marked faulty
On 08/04/2011 11:43 AM, Sebastian Kaps wrote: Hi Steven, On 04.08.2011, at 18:27, Steven Dake wrote: redundant ring is only supported upstream in corosync 1.4.1 or later. What does supported mean in this context, exactly? meaning the corosync community doesn't investigate redundant ring issues prior to corosync versions 1.4.1. I expect the root of ypur problem is already fixed (the retransmit list problem) however in the repos and latest released versions. Regards -steve I'm asking, because we're having serious issues with these systems since they went into production (the testing phase did not show any problems, but we also couldn't use real workloads then). Since the cluster went productive, we're having issues with seemingly random STONITH events that seem to be related to a high I/O load on a DRBD-mirrored OCFS2 volume - but I don't see any pattern yet. We've had these machines running for nearly two weeks without major problems and suddenly they went back to killing each other :-( The retransmit list message issues you are having is fixed in corosync 1.3.3. and later This is what is triggering the redundant ring faulty error. Could it also cause the instability problems we're seeing? Thanks again, for helping! yes ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Live demo of Pacemaker Cloud on Fedora: Friday August 5th at 8am PST
Extending a general invitation to the high availability communities and other cloud community contributors to participate in a live demo I am giving on Friday August 5th 8am PST (GMT-7). Demo portion of session is 15 minutes and will be provided first followed by more details of our approach to high availability. I will use elluminate to show the demo on my desktop machine. To make elluminate work, you will need icedtea-web installed on your system which is not typically installed by default. You will also need a conference # and bridge code. Please contact me offlist with your location and I'll provide you with a hopefully toll free conference # and bridge code. Elluminate link: https://sas.elluminate.com/m.jnlp?sid=819password=M.13AB020AEBE358D265FD925A07335F Bridge Code: Please contact me off list with your location and I'll respond back with dial-in information. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Backup ring is marked faulty
Which version of corosync? On 08/02/2011 07:35 AM, Sebastian Kaps wrote: Hi, we're running a two-node cluster with redundant rings. Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB interfaces that are bonded in active-backup mode and routed through two independent switches for each node. The ring 1 network is our normal 1G LAN and should only be used in case the direct 10G connection should fail. I often (once a day on average, I'd guess) see that ring 1 (an only that one) is marked as FAULTY without any obvious reasons. Aug 2 08:56:15 node02 corosync[5752]: [TOTEM ] Retransmit List: c76 c7a c7c c7e c80 c82 c84 Aug 2 08:56:15 node02 corosync[5752]: [TOTEM ] Retransmit List: c82 Aug 2 08:56:15 node02 corosync[5752]: [TOTEM ] Marking seqid 568416 ringid 1 interface x.y.z.1 FAULTY - administrative intervention required. Whenever I see this, I check if the other node's address can be pinged (I never saw any connectivity problems there), then reenable the ring with corosync-cfgtool -r and everything looks ok for a while (i.e. hours or days). How could I find out why this happens? What do these Retransmit List or seqid (sequence id, I assume?) values tell me? Is it safe to reenable the second ring when the partner node can be pinged successfully? The totem section on our config looks like this: totem { rrp_mode: passive join: 60 max_messages: 20 vsftype:none consensus: 1 secauth:on token_retransmits_before_loss_const:10 threads:16 token: 1 version:2 interface { bindnetaddr:192.168.1.0 mcastaddr: 239.250.1.1 mcastport: 5405 ringnumber: 0 } interface { bindnetaddr:x.y.z.0 mcastaddr: 239.250.1.2 mcastport: 5415 ringnumber: 1 } clear_node_high_bit:yes } ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Announcing Pacemaker Cloud 0.4.1 - Available now for download!
Angus and I announced a project to apply high availability best known practice to the field of cloud computing in late March 2011. We reuse the policy engine of Pacemaker. Our first tarball is available today containing a functional prototype demonstrating these best known practices. Today the software supports a deployable/assembly model. Assemblies represent a virtual machine and deployables represent a collection of virtual machines. Resources within a virtual machine can be monitored for failure and recovered. Assemblies and deployables are also monitored for failure and recovered. Currently the significant limitation with the software is that it operates single node. As a result it is not suitable for deployment today. We plan to address this in the future by integrating with other cloud infrastructure systems such as Aeolus (developer ml on CC list). The software will be available in Fedora 16 for all to evaluate that run Fedora. Your feedback is greatly appreciated. To provide feedback, join the mailing list: http://oss.clusterlabs.org/mailman/listinfo/pcmk-cloud/ If you have interest in developing for cloud environments around the topic of high availability, please feel free to download our git repo and submit patches. We also are interested in user feedback! To get the software, check out: http://pacemaker-cloud.org/ ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Sending message via cpg FAILED: (rc=12) Doesn't exist
On 07/22/2011 01:15 AM, Proskurin Kirill wrote: Hello all. pacemaker-1.1.5 corosync-1.4.0 4 nodes in cluster. 3 online 1 not. In logs: Jul 22 11:50:23 my106.example.com crmd: [28030]: info: pcmk_quorum_notification: Membership 0: quorum retained (0) Jul 22 11:50:23 my106.example.com crmd: [28030]: info: do_started: Delaying start, no membership data (0010) Jul 22 11:50:23 my106.example.com crmd: [28030]: info: config_query_callback: Shutdown escalation occurs after: 120ms Jul 22 11:50:23 my106.example.com crmd: [28030]: info: config_query_callback: Checking for expired actions every 90ms Jul 22 11:50:23 my106.example.com crmd: [28030]: info: do_started: Delaying start, no membership data (0010) Jul 22 11:50:27 my106.example.com attrd: [28028]: info: cib_connect: Connected to the CIB after 1 signon attempts Jul 22 11:50:27 my106.example.com attrd: [28028]: info: cib_connect: Sending full refresh Jul 22 11:52:18 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jul 22 11:52:18 corosync [CPG ] chosen downlist: sender r(0) ip(10.3.1.107) ; members(old:4 left:1) Jul 22 11:52:18 corosync [MAIN ] Completed service synchronization, ready to provide service. Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR: send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR: send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR: send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist DC: Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee Jul 22 11:50:07 my107.example.com pacemakerd: [22388]: info: update_node_processes: Node my106.example.com now has process list: 0002 (was 00 12) Jul 22 11:50:07 my107.example.com attrd: [22397]: info: crm_update_peer: Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00 02 (new) Jul 22 11:50:07 my107.example.com cib: [22395]: info: crm_update_peer: Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=0002 (new) Jul 22 11:50:07 my107.example.com stonith-ng: [22394]: info: crm_update_peer: Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=0 002 (new) Jul 22 11:50:07 my107.example.com crmd: [22399]: info: crm_update_peer: Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=000 2 (new) Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee There is a problem? Does your retransmit list continually display e4 e5 etc for rest of cluster lifetime, or is this short lived? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Openais] Linux HA on debian sparc
On 06/07/2011 04:44 AM, william felipe_welter wrote: More two questions.. The patch for mmap calls will be on the mainly development for all archs ? Any problems if i send this patch's for Debian project ? These patches will go into the maintenance branches You can send them to whoever you like ;) Regards -steve 2011/6/3 Steven Dake sd...@redhat.com: On 06/02/2011 08:16 PM, william felipe_welter wrote: Well, Now with this patch, the pacemakerd process starts and up his other process ( crmd, lrmd, pengine) but after the process pacemakerd do a fork, the forked process pacemakerd dies due to signal 10, Bus error.. And on the log, the process of pacemark ( crmd, lrmd, pengine) cant connect to open ais plugin (possible because the death of the pacemakerd process). But this time when the forked pacemakerd dies, he generates a coredump. gdb -c /usr/var/lib/heartbeat/cores/root/ pacemakerd 7986 -se /usr/sbin/pacemakerd : GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type show copying and show warranty for details. This GDB was configured as sparc-linux-gnu. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /usr/sbin/pacemakerd...done. Reading symbols from /usr/lib64/libuuid.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libuuid.so.1 Reading symbols from /usr/lib/libcoroipcc.so.4...done. Loaded symbols for /usr/lib/libcoroipcc.so.4 Reading symbols from /usr/lib/libcpg.so.4...done. Loaded symbols for /usr/lib/libcpg.so.4 Reading symbols from /usr/lib/libquorum.so.4...done. Loaded symbols for /usr/lib/libquorum.so.4 Reading symbols from /usr/lib64/libcrmcommon.so.2...done. Loaded symbols for /usr/lib64/libcrmcommon.so.2 Reading symbols from /usr/lib/libcfg.so.4...done. Loaded symbols for /usr/lib/libcfg.so.4 Reading symbols from /usr/lib/libconfdb.so.4...done. Loaded symbols for /usr/lib/libconfdb.so.4 Reading symbols from /usr/lib64/libplumb.so.2...done. Loaded symbols for /usr/lib64/libplumb.so.2 Reading symbols from /usr/lib64/libpils.so.2...done. Loaded symbols for /usr/lib64/libpils.so.2 Reading symbols from /lib/libbz2.so.1.0...(no debugging symbols found)...done. Loaded symbols for /lib/libbz2.so.1.0 Reading symbols from /usr/lib/libxslt.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libxslt.so.1 Reading symbols from /usr/lib/libxml2.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libxml2.so.2 Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/librt.so.1 Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/libglib-2.0.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libglib-2.0.so.0 Reading symbols from /usr/lib/libltdl.so.7...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libltdl.so.7 Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libpthread.so.0 Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /usr/lib/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libz.so.1 Reading symbols from /lib/libpcre.so.3...(no debugging symbols found)...done. Loaded symbols for /lib/libpcre.so.3 Reading symbols from /lib/libnss_compat.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libnss_compat.so.2 Reading symbols from /lib/libnsl.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libnss_nis.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libnss_nis.so.2 Reading symbols from /lib/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libnss_files.so.2 Core was generated by `pacemakerd'. Program terminated with signal 10, Bus error. #0 cpg_dispatch (handle=17861288972693536769, dispatch_types=7986) at cpg.c:339 339 switch (dispatch_data-id) { (gdb) bt #0 cpg_dispatch (handle=17861288972693536769, dispatch_types=7986) at cpg.c:339 #1 0xf6f100f0 in ?? () #2 0xf6f100f4 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) I take a look at the cpg.c and see that the dispatch_data was aquired by coroipcc_dispatch_get
[Pacemaker] Updated pacemaker-cloud.org website
Hi, I want to spend a moment to tell you about our new website at http://pacemaker-cloud.org. This website will serve as our information store and tarball repo location for the Pacemaker-Cloud project. The features page contains the feature set we plan to deliver. Please have a look and forward any questions or comments to: pcmk-cl...@oss.clusterlabs.org. A big thanks to Adam Stokes who worked on the Matahari website design. We used his design as our inspiration for most of our website. Also a thanks to Angus Salkeld for contributing to moving our hosting to github. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Openais] Linux HA on debian sparc
] exit_fn for conn=0x62500 Jun 02 23:12:21 corosync [TOTEM ] mcasted message added to pending queue Jun 02 23:12:21 corosync [TOTEM ] Delivering 15 to 16 Jun 02 23:12:21 corosync [TOTEM ] Delivering MCAST message with seq 16 to pending delivery queue Jun 02 23:12:21 corosync [CPG ] got procleave message from cluster node 1377289226 Jun 02 23:12:21 corosync [TOTEM ] releasing messages up to and including 16 Jun 02 23:12:21 xx attrd: [7992]: info: Invoked: /usr/lib64/heartbeat/attrd Jun 02 23:12:21 xx attrd: [7992]: info: crm_log_init_worker: Changed active directory to /usr/var/lib/heartbeat/cores/hacluster Jun 02 23:12:21 xx attrd: [7992]: info: main: Starting up Jun 02 23:12:21 xx attrd: [7992]: info: get_cluster_type: Cluster type is: 'openais'. Jun 02 23:12:21 xx attrd: [7992]: info: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin) Jun 02 23:12:21 xx attrd: [7992]: info: init_ais_connection_classic: Creating connection to our Corosync plugin Jun 02 23:12:21 xx attrd: [7992]: info: init_ais_connection_classic: Connection to our AIS plugin (9) failed: Doesn't exist (12) Jun 02 23:12:21 xx attrd: [7992]: ERROR: main: HA Signon failed Jun 02 23:12:21 xx attrd: [7992]: info: main: Cluster connection active Jun 02 23:12:21 xx attrd: [7992]: info: main: Accepting attribute updates Jun 02 23:12:21 xx attrd: [7992]: ERROR: main: Aborting startup Jun 02 23:12:21 xx crmd: [7994]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /usr/var/run/crm/cib_rw Jun 02 23:12:21 xx crmd: [7994]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /usr/var/run/crm/cib_rw Jun 02 23:12:21 xx crmd: [7994]: debug: cib_native_signon_raw: Connection to command channel failed Jun 02 23:12:21 xx crmd: [7994]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /usr/var/run/crm/cib_callback ... 2011/6/2 Steven Dake sd...@redhat.com: On 06/01/2011 11:05 PM, william felipe_welter wrote: I recompile my kernel without hugetlb .. and the result are the same.. My test program still resulting: PATH=/dev/shm/teste123XX page size=2 fd=3 ADDR_ORIG:0xe000a000 ADDR:0x Erro And Pacemaker still resulting because the mmap error: Could not initialize Cluster Configuration Database API instance error 2 Give the patch I posted recently a spin - corosync WFM with this patch on sparc64 with hugetlb set. Please report back results. Regards -steve For make sure that i have disable the hugetlb there is my /proc/meminfo: MemTotal: 33093488 kB MemFree:32855616 kB Buffers:5600 kB Cached:53480 kB SwapCached:0 kB Active:45768 kB Inactive: 28104 kB Active(anon): 18024 kB Inactive(anon): 1560 kB Active(file): 27744 kB Inactive(file):26544 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 6104680 kB SwapFree:6104680 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 14936 kB Mapped: 7736 kB Shmem: 4624 kB Slab: 39184 kB SReclaimable: 10088 kB SUnreclaim:29096 kB KernelStack:7088 kB PageTables: 1160 kB Quicklists:17664 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:22651424 kB Committed_AS: 519368 kB VmallocTotal: 1069547520 kB VmallocUsed: 11064 kB VmallocChunk: 1069529616 kB 2011/6/1 Steven Dake sd...@redhat.com: On 06/01/2011 07:42 AM, william felipe_welter wrote: Steven, cat /proc/meminfo ... HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 4096 kB ... It definitely requires a kernel compile and setting the config option to off. I don't know the debian way of doing this. The only reason you may need this option is if you have very large memory sizes, such as 48GB or more. Regards -steve Its 4MB.. How can i disable hugetlb ? ( passing CONFIG_HUGETLBFS=n at boot to kernel ?) 2011/6/1 Steven Dake sd...@redhat.com mailto:sd...@redhat.com On 06/01/2011 01:05 AM, Steven Dake wrote: On 05/31/2011 09:44 PM, Angus Salkeld wrote: On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter wrote: Angus, I make some test program (based on the code coreipcc.c) and i now i sure that are problems with the mmap systems call on sparc.. Source code of my test program: #include stdlib.h #include sys/mman.h #include stdio.h #define PATH_MAX 36 int main() { int32_t fd; void *addr_orig; void *addr; char path[PATH_MAX
Re: [Pacemaker] [Openais] Linux HA on debian sparc
On 06/01/2011 11:05 PM, william felipe_welter wrote: I recompile my kernel without hugetlb .. and the result are the same.. My test program still resulting: PATH=/dev/shm/teste123XX page size=2 fd=3 ADDR_ORIG:0xe000a000 ADDR:0x Erro And Pacemaker still resulting because the mmap error: Could not initialize Cluster Configuration Database API instance error 2 Give the patch I posted recently a spin - corosync WFM with this patch on sparc64 with hugetlb set. Please report back results. Regards -steve For make sure that i have disable the hugetlb there is my /proc/meminfo: MemTotal: 33093488 kB MemFree:32855616 kB Buffers:5600 kB Cached:53480 kB SwapCached:0 kB Active:45768 kB Inactive: 28104 kB Active(anon): 18024 kB Inactive(anon): 1560 kB Active(file): 27744 kB Inactive(file):26544 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 6104680 kB SwapFree:6104680 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 14936 kB Mapped: 7736 kB Shmem: 4624 kB Slab: 39184 kB SReclaimable: 10088 kB SUnreclaim:29096 kB KernelStack:7088 kB PageTables: 1160 kB Quicklists:17664 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:22651424 kB Committed_AS: 519368 kB VmallocTotal: 1069547520 kB VmallocUsed: 11064 kB VmallocChunk: 1069529616 kB 2011/6/1 Steven Dake sd...@redhat.com: On 06/01/2011 07:42 AM, william felipe_welter wrote: Steven, cat /proc/meminfo ... HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 4096 kB ... It definitely requires a kernel compile and setting the config option to off. I don't know the debian way of doing this. The only reason you may need this option is if you have very large memory sizes, such as 48GB or more. Regards -steve Its 4MB.. How can i disable hugetlb ? ( passing CONFIG_HUGETLBFS=n at boot to kernel ?) 2011/6/1 Steven Dake sd...@redhat.com mailto:sd...@redhat.com On 06/01/2011 01:05 AM, Steven Dake wrote: On 05/31/2011 09:44 PM, Angus Salkeld wrote: On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter wrote: Angus, I make some test program (based on the code coreipcc.c) and i now i sure that are problems with the mmap systems call on sparc.. Source code of my test program: #include stdlib.h #include sys/mman.h #include stdio.h #define PATH_MAX 36 int main() { int32_t fd; void *addr_orig; void *addr; char path[PATH_MAX]; const char *file = teste123XX; size_t bytes=10024; snprintf (path, PATH_MAX, /dev/shm/%s, file); printf(PATH=%s\n,path); fd = mkstemp (path); printf(fd=%d \n,fd); addr_orig = mmap (NULL, bytes, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); addr = mmap (addr_orig, bytes, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0); printf(ADDR_ORIG:%p ADDR:%p\n,addr_orig,addr); if (addr != addr_orig) { printf(Erro); } } Results on x86: PATH=/dev/shm/teste123XX fd=3 ADDR_ORIG:0x7f867d8e6000 ADDR:0x7f867d8e6000 Results on sparc: PATH=/dev/shm/teste123XX fd=3 ADDR_ORIG:0xf7f72000 ADDR:0x Note: 0x == MAP_FAILED (from man mmap) RETURN VALUE On success, mmap() returns a pointer to the mapped area. On error, the value MAP_FAILED (that is, (void *) -1) is returned, and errno is set appropriately. But im wondering if is really needed to call mmap 2 times ? What are the reason to call the mmap 2 times, on the second time using the address of the first? Well there are 3 calls to mmap() 1) one to allocate 2 * what you need (in pages) 2) maps the first half of the mem to a real file 3) maps the second half of the mem to the same file The point is when you write to an address over the end of the first half of memory it is taken care of the the third mmap which maps the address back to the top of the file for you. This means you don't have to worry about ringbuffer wrapping which can be a headache. -Angus interesting this mmap operation doesn't work on sparc linux. Not sure how I can help here - Next step would be a follow up with the sparc linux
Re: [Pacemaker] [Openais] Linux HA on debian sparc
On 05/31/2011 09:44 PM, Angus Salkeld wrote: On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter wrote: Angus, I make some test program (based on the code coreipcc.c) and i now i sure that are problems with the mmap systems call on sparc.. Source code of my test program: #include stdlib.h #include sys/mman.h #include stdio.h #define PATH_MAX 36 int main() { int32_t fd; void *addr_orig; void *addr; char path[PATH_MAX]; const char *file = teste123XX; size_t bytes=10024; snprintf (path, PATH_MAX, /dev/shm/%s, file); printf(PATH=%s\n,path); fd = mkstemp (path); printf(fd=%d \n,fd); addr_orig = mmap (NULL, bytes, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); addr = mmap (addr_orig, bytes, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0); printf(ADDR_ORIG:%p ADDR:%p\n,addr_orig,addr); if (addr != addr_orig) { printf(Erro); } } Results on x86: PATH=/dev/shm/teste123XX fd=3 ADDR_ORIG:0x7f867d8e6000 ADDR:0x7f867d8e6000 Results on sparc: PATH=/dev/shm/teste123XX fd=3 ADDR_ORIG:0xf7f72000 ADDR:0x Note: 0x == MAP_FAILED (from man mmap) RETURN VALUE On success, mmap() returns a pointer to the mapped area. On error, the value MAP_FAILED (that is, (void *) -1) is returned, and errno is set appropriately. But im wondering if is really needed to call mmap 2 times ? What are the reason to call the mmap 2 times, on the second time using the address of the first? Well there are 3 calls to mmap() 1) one to allocate 2 * what you need (in pages) 2) maps the first half of the mem to a real file 3) maps the second half of the mem to the same file The point is when you write to an address over the end of the first half of memory it is taken care of the the third mmap which maps the address back to the top of the file for you. This means you don't have to worry about ringbuffer wrapping which can be a headache. -Angus interesting this mmap operation doesn't work on sparc linux. Not sure how I can help here - Next step would be a follow up with the sparc linux mailing list. I'll do that and cc you on the message - see if we get any response. http://vger.kernel.org/vger-lists.html 2011/5/31 Angus Salkeld asalk...@redhat.com On Tue, May 31, 2011 at 06:25:56PM -0300, william felipe_welter wrote: Thanks Steven, Now im try to run on the MCP: - Uninstall the pacemaker 1.0 - Compile and install 1.1 But now i have problems to initialize the pacemakerd: Could not initialize Cluster Configuration Database API instance error 2 Debbuging with gdb i see that the error are on the confdb.. most specificaly the errors start on coreipcc.c at line: 448if (addr != addr_orig) { 449goto error_close_unlink; - enter here 450 } Some ideia about what can cause this ? I tried porting a ringbuffer (www.libqb.org) to sparc and had the same failure. There are 3 mmap() calls and on sparc the third one keeps failing. This is a common way of creating a ring buffer, see: http://en.wikipedia.org/wiki/Circular_buffer#Exemplary_POSIX_Implementation I couldn't get it working in the short time I tried. It's probably worth looking at the clib implementation to see why it's failing (I didn't get to that). -Angus ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- William Felipe Welter -- Consultor em Tecnologias Livres william.wel...@4linux.com.br www.4linux.com.br ___ Openais mailing list open...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Openais] Linux HA on debian sparc
On 06/01/2011 01:05 AM, Steven Dake wrote: On 05/31/2011 09:44 PM, Angus Salkeld wrote: On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter wrote: Angus, I make some test program (based on the code coreipcc.c) and i now i sure that are problems with the mmap systems call on sparc.. Source code of my test program: #include stdlib.h #include sys/mman.h #include stdio.h #define PATH_MAX 36 int main() { int32_t fd; void *addr_orig; void *addr; char path[PATH_MAX]; const char *file = teste123XX; size_t bytes=10024; snprintf (path, PATH_MAX, /dev/shm/%s, file); printf(PATH=%s\n,path); fd = mkstemp (path); printf(fd=%d \n,fd); addr_orig = mmap (NULL, bytes, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); addr = mmap (addr_orig, bytes, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0); printf(ADDR_ORIG:%p ADDR:%p\n,addr_orig,addr); if (addr != addr_orig) { printf(Erro); } } Results on x86: PATH=/dev/shm/teste123XX fd=3 ADDR_ORIG:0x7f867d8e6000 ADDR:0x7f867d8e6000 Results on sparc: PATH=/dev/shm/teste123XX fd=3 ADDR_ORIG:0xf7f72000 ADDR:0x Note: 0x == MAP_FAILED (from man mmap) RETURN VALUE On success, mmap() returns a pointer to the mapped area. On error, the value MAP_FAILED (that is, (void *) -1) is returned, and errno is set appropriately. But im wondering if is really needed to call mmap 2 times ? What are the reason to call the mmap 2 times, on the second time using the address of the first? Well there are 3 calls to mmap() 1) one to allocate 2 * what you need (in pages) 2) maps the first half of the mem to a real file 3) maps the second half of the mem to the same file The point is when you write to an address over the end of the first half of memory it is taken care of the the third mmap which maps the address back to the top of the file for you. This means you don't have to worry about ringbuffer wrapping which can be a headache. -Angus interesting this mmap operation doesn't work on sparc linux. Not sure how I can help here - Next step would be a follow up with the sparc linux mailing list. I'll do that and cc you on the message - see if we get any response. http://vger.kernel.org/vger-lists.html 2011/5/31 Angus Salkeld asalk...@redhat.com On Tue, May 31, 2011 at 06:25:56PM -0300, william felipe_welter wrote: Thanks Steven, Now im try to run on the MCP: - Uninstall the pacemaker 1.0 - Compile and install 1.1 But now i have problems to initialize the pacemakerd: Could not initialize Cluster Configuration Database API instance error 2 Debbuging with gdb i see that the error are on the confdb.. most specificaly the errors start on coreipcc.c at line: 448if (addr != addr_orig) { 449goto error_close_unlink; - enter here 450 } Some ideia about what can cause this ? I tried porting a ringbuffer (www.libqb.org) to sparc and had the same failure. There are 3 mmap() calls and on sparc the third one keeps failing. This is a common way of creating a ring buffer, see: http://en.wikipedia.org/wiki/Circular_buffer#Exemplary_POSIX_Implementation I couldn't get it working in the short time I tried. It's probably worth looking at the clib implementation to see why it's failing (I didn't get to that). -Angus Note, we sorted this out we believe. Your kernel has hugetlb enabled, probably with 4MB pages. This requires corosync to allocate 4MB pages. Can you verify your hugetlb settings? If you can turn this option off, you should have atleast a working corosync. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- William Felipe Welter -- Consultor em Tecnologias Livres william.wel...@4linux.com.br www.4linux.com.br ___ Openais mailing list open...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Openais mailing list open...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais ___ Pacemaker mailing
Re: [Pacemaker] [Openais] Linux HA on debian sparc
On 06/01/2011 07:42 AM, william felipe_welter wrote: Steven, cat /proc/meminfo ... HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 4096 kB ... It definitely requires a kernel compile and setting the config option to off. I don't know the debian way of doing this. The only reason you may need this option is if you have very large memory sizes, such as 48GB or more. Regards -steve Its 4MB.. How can i disable hugetlb ? ( passing CONFIG_HUGETLBFS=n at boot to kernel ?) 2011/6/1 Steven Dake sd...@redhat.com mailto:sd...@redhat.com On 06/01/2011 01:05 AM, Steven Dake wrote: On 05/31/2011 09:44 PM, Angus Salkeld wrote: On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter wrote: Angus, I make some test program (based on the code coreipcc.c) and i now i sure that are problems with the mmap systems call on sparc.. Source code of my test program: #include stdlib.h #include sys/mman.h #include stdio.h #define PATH_MAX 36 int main() { int32_t fd; void *addr_orig; void *addr; char path[PATH_MAX]; const char *file = teste123XX; size_t bytes=10024; snprintf (path, PATH_MAX, /dev/shm/%s, file); printf(PATH=%s\n,path); fd = mkstemp (path); printf(fd=%d \n,fd); addr_orig = mmap (NULL, bytes, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); addr = mmap (addr_orig, bytes, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0); printf(ADDR_ORIG:%p ADDR:%p\n,addr_orig,addr); if (addr != addr_orig) { printf(Erro); } } Results on x86: PATH=/dev/shm/teste123XX fd=3 ADDR_ORIG:0x7f867d8e6000 ADDR:0x7f867d8e6000 Results on sparc: PATH=/dev/shm/teste123XX fd=3 ADDR_ORIG:0xf7f72000 ADDR:0x Note: 0x == MAP_FAILED (from man mmap) RETURN VALUE On success, mmap() returns a pointer to the mapped area. On error, the value MAP_FAILED (that is, (void *) -1) is returned, and errno is set appropriately. But im wondering if is really needed to call mmap 2 times ? What are the reason to call the mmap 2 times, on the second time using the address of the first? Well there are 3 calls to mmap() 1) one to allocate 2 * what you need (in pages) 2) maps the first half of the mem to a real file 3) maps the second half of the mem to the same file The point is when you write to an address over the end of the first half of memory it is taken care of the the third mmap which maps the address back to the top of the file for you. This means you don't have to worry about ringbuffer wrapping which can be a headache. -Angus interesting this mmap operation doesn't work on sparc linux. Not sure how I can help here - Next step would be a follow up with the sparc linux mailing list. I'll do that and cc you on the message - see if we get any response. http://vger.kernel.org/vger-lists.html 2011/5/31 Angus Salkeld asalk...@redhat.com mailto:asalk...@redhat.com On Tue, May 31, 2011 at 06:25:56PM -0300, william felipe_welter wrote: Thanks Steven, Now im try to run on the MCP: - Uninstall the pacemaker 1.0 - Compile and install 1.1 But now i have problems to initialize the pacemakerd: Could not initialize Cluster Configuration Database API instance error 2 Debbuging with gdb i see that the error are on the confdb.. most specificaly the errors start on coreipcc.c at line: 448if (addr != addr_orig) { 449goto error_close_unlink; - enter here 450 } Some ideia about what can cause this ? I tried porting a ringbuffer (www.libqb.org http://www.libqb.org) to sparc and had the same failure. There are 3 mmap() calls and on sparc the third one keeps failing. This is a common way of creating a ring buffer, see: http://en.wikipedia.org/wiki/Circular_buffer#Exemplary_POSIX_Implementation I couldn't get it working in the short time I tried. It's probably worth looking at the clib implementation to see why it's failing (I didn't get to that). -Angus Note, we sorted this out we believe. Your kernel has hugetlb enabled, probably with 4MB pages. This requires corosync
Re: [Pacemaker] [Openais] Linux HA on debian sparc
Try running paceamaker using the MCP. The plugin mode of pacemaker never really worked very well because of complexities of posix mmap and fork. Not having sparc hardware personally, YMMV. We have recently with corosync 1.3.1 gone through an alignment fixing process for ARM arches - hope that solves your alignment problems on sparc as well. Regards -steve On 05/31/2011 08:38 AM, william felipe_welter wrote: Im trying to setup HA with corosync and pacemaker using the debian packages on SPARC Architecture. Using Debian package corosync process dies after initializate pacemaker process. I make some tests with ltrace and strace and this tools tell me that corosync died because a segmentation fault. I try a lot of thing to solve this problem, but nothing made corosync works. My second try is to compile from scratch (using this docs:http://www.clusterlabs.org/wiki/Install#From_Source) http://www.clusterlabs.org/wiki/Install#From_Source%29. . This way corosync process startup perfectly! but some process of pacemaker don't start.. Analyzing log i see the probably reason: attrd: [2283]: info: init_ais_connection_once: Connection to our AIS plugin (9) failed: Library error (2) stonithd: [2280]: info: init_ais_connection_once: Connection to our AIS plugin (9) failed: Library error (2) . cib: [2281]: info: init_ais_connection_once: Connection to our AIS plugin (9) failed: Library error (2) . crmd: [3320]: debug: init_client_ipc_comms_ nodispatch: Attempting to talk on: /usr/var/run/crm/cib_rw crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /usr/var/run/crm/cib_rw crmd: [3320]: debug: cib_native_signon_raw: Connection to command channel failed crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /usr/var/run/crm/cib_callback crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /usr/var/run/crm/cib_callback crmd: [3320]: debug: cib_native_signon_raw: Connection to callback channel failed crmd: [3320]: debug: cib_native_signon_raw: Connection to CIB failed: connection failed crmd: [3320]: debug: cib_native_signoff: Signing out of the CIB Service crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /usr/var/run/crm/cib_rw crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /usr/var/run/crm/cib_rw crmd: [3320]: debug: cib_native_signon_raw: Connection to command channel failed crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /usr/var/run/crm/cib_callback crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /usr/var/run/crm/cib_callback crmd: [3320]: debug: cib_native_signon_raw: Connection to callback channel failed crmd: [3320]: debug: cib_native_signon_raw: Connection to CIB failed: connection failed crmd: [3320]: debug: cib_native_signoff: Signing out of the CIB Service crmd: [3320]: info: do_cib_control: Could not connect to the CIB service: connection failed My conf: # Please read the corosync.conf.5 manual page compatibility: whitetank totem { version: 2 join: 60 token: 3000 token_retransmits_before_loss_const: 10 secauth: off threads: 0 consensus: 8601 vsftype: none threads: 0 rrp_mode: none clear_node_high_bit: yes max_messages: 20 interface { ringnumber: 0 bindnetaddr: 10.10.23.0 mcastaddr: 226.94.1.1 mcastport: 5405 } } logging { fileline: off to_stderr: no to_logfile: yes to_syslog: yes logfile: /var/log/cluster/corosync.log debug: on timestamp: on logger_subsys { subsys: AMF debug: on } } amf { mode: disabled } service { # Load the Pacemaker Cluster Resource Manager ver: 0 name: pacemaker } aisexec { user: root group: root } My Question is: why attrd, cib ... can't connect to AIS Plugin? What could be the reasons for the connection failed ? (Yes, my /dev/shm are tmpfs) -- William Felipe Welter -- Consultor em Tecnologias Livres william.wel...@4linux.com.br mailto:william.wel...@4linux.com.br www.4linux.com.br http://www.4linux.com.br ___ Openais mailing list open...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Linux HA on debian sparc
Note. there are three signals you could possibly see that generate a core file. SIGABRT (assert() called in the codebase) SIGSEGV (segmentation violation) SIGBUS (alignment error) Make sure you don't have a sigbus. Opening the core file with gdb will tell you which signal triggered the fault. Regards -steve On 05/31/2011 08:34 AM, william felipe_welter wrote: Im trying to setup HA with corosync and pacemaker using the debian packages on SPARC Architecture. Using Debian package corosync process dies after initializate pacemaker process. I make some tests with ltrace and strace and this tools tell me that corosync died because a segmentation fault. I try a lot of thing to solve this problem, but nothing made corosync works. My second try is to compile from scratch (using this docs:http://www.clusterlabs.org/wiki/Install#From_Source) http://www.clusterlabs.org/wiki/Install#From_Source%29. . This way corosync process startup perfectly! but some process of pacemaker don't start.. Analyzing log i see the probably reason: attrd: [2283]: info: init_ais_connection_once: Connection to our AIS plugin (9) failed: Library error (2) stonithd: [2280]: info: init_ais_connection_once: Connection to our AIS plugin (9) failed: Library error (2) . cib: [2281]: info: init_ais_connection_once: Connection to our AIS plugin (9) failed: Library error (2) . crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /usr/var/run/crm/cib_rw crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /usr/var/run/crm/cib_rw crmd: [3320]: debug: cib_native_signon_raw: Connection to command channel failed crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /usr/var/run/crm/cib_callback crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /usr/var/run/crm/cib_callback crmd: [3320]: debug: cib_native_signon_raw: Connection to callback channel failed crmd: [3320]: debug: cib_native_signon_raw: Connection to CIB failed: connection failed crmd: [3320]: debug: cib_native_signoff: Signing out of the CIB Service crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /usr/var/run/crm/cib_rw crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /usr/var/run/crm/cib_rw crmd: [3320]: debug: cib_native_signon_raw: Connection to command channel failed crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /usr/var/run/crm/cib_callback crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /usr/var/run/crm/cib_callback crmd: [3320]: debug: cib_native_signon_raw: Connection to callback channel failed crmd: [3320]: debug: cib_native_signon_raw: Connection to CIB failed: connection failed crmd: [3320]: debug: cib_native_signoff: Signing out of the CIB Service crmd: [3320]: info: do_cib_control: Could not connect to the CIB service: connection failed My conf: # Please read the corosync.conf.5 manual page compatibility: whitetank totem { version: 2 join: 60 token: 3000 token_retransmits_before_loss_const: 10 secauth: off threads: 0 consensus: 8601 vsftype: none threads: 0 rrp_mode: none clear_node_high_bit: yes max_messages: 20 interface { ringnumber: 0 bindnetaddr: 10.10.23.0 mcastaddr: 226.94.1.1 mcastport: 5405 } } logging { fileline: off to_stderr: no to_logfile: yes to_syslog: yes logfile: /var/log/cluster/corosync.log debug: on timestamp: on logger_subsys { subsys: AMF debug: on } } amf { mode: disabled } service { # Load the Pacemaker Cluster Resource Manager ver: 0 name: pacemaker } aisexec { user: root group: root } My Question is: why attrd, cib ... can't connect to AIS Plugin? What could be the reasons for the connection failed ? (Yes, my /dev/shm are tmpfs) -- William Felipe Welter -- Consultor em Tecnologias Livres william.wel...@4linux.com.br mailto:william.wel...@4linux.com.br www.4linux.com.br http://www.4linux.com.br ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
Re: [Pacemaker] [Openais] Corosync goes into endless loop when same hostname is used on more than one node
On 05/12/2011 07:04 AM, Dan Frincu wrote: Hi, When using the same hostname on 2 nodes (debian squeeze, corosync 1.3.0-3 from unstable) the following happens: May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/84, version=0.5.1): ok (rc=0) May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now has id: 620757002 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State transition S_INTEGRATION - S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ] May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1 cluster nodes responded to the join offer. May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-29: Syncing the CIB from debian to the rest of the cluster May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now has id: 603979786 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State transition S_FINALIZE_JOIN - S_INTEGRATION [ input=I_JOIN_REQUEST cause=C_HA_MESSAGE origin=route_message ] May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/86, version=0.5.1): ok (rc=0) May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all: join-30: Waiting on 1 outstanding join acks May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian (3.0.1) May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now has id: 620757002 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State transition S_INTEGRATION - S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ] May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1 cluster nodes responded to the join offer. May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-30: Syncing the CIB from debian to the rest of the cluster May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now has id: 603979786 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State transition S_FINALIZE_JOIN - S_INTEGRATION [ input=I_JOIN_REQUEST cause=C_HA_MESSAGE origin=route_message ] May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all: join-31: Waiting on 1 outstanding join acks May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian (3.0.1) May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/88, version=0.5.1): ok (rc=0) May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now has id: 620757002 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State transition S_INTEGRATION - S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ] May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1 cluster nodes responded to the join offer. May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-31: Syncing the CIB from debian to the rest of the cluster May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now has id: 603979786 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State transition S_FINALIZE_JOIN - S_INTEGRATION [ input=I_JOIN_REQUEST cause=C_HA_MESSAGE origin=route_message ] May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all: join-32: Waiting on 1 outstanding join acks May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian (3.0.1) Basically it goes into an endless loop. This is a improperly configured option, but it would help the users if there was a handling of this or a relevant message printed in the logfile, such as duplicate hostname found. Dan, I believe this is a pacemaker RFE. corosync operates entirely on IP addresses and never does any hostname to IP resolution (because the resolver can block and cause bad things to happen). Regards. Dan -- Dan Frincu CCNA, RHCE ___ Openais mailing list open...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Pacemaker Cloud Policy Engine Red Hat Summit slides and Mailing List
In February we announced our intentions to work on a cloud-specific high availability solution on this list. The code is coming along, and we have reached a point where we should have a mailing list dedicated to cloud specific topics of Pacemaker. The mailing list subscription page is: http://oss.clusterlabs.org/mailman/listinfo/pcmk-cloud To see how we have progressed since February, have a look at the source in our git repo, or take a look at the Red Hat Summit 2011 slides where our work was presented this last week : http://www.redhat.com/summit/2011/presentations/summit/whats_new/thursday/dake_th_1130_high_availability_in_the_cloud.pdf If your interested in cloud high availability technology, please feel free to participate on our mailing lists. Your input there is invaluable to ensuring we deliver a great project that downstream distros and administrators can use. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] corosync crash
On 02/25/2011 12:38 AM, Andrew Beekhof wrote: This is the same one you sent to the openais list right? Andrew, This was root caused to a faulty network setup resulting in the failed to receive abort we are working on currently. One key detail missing from this thread is the implementation worked great on VMW ESX 4.0 but then started having problem in ESX 4.1. Regards -steve On Thu, Feb 24, 2011 at 10:32 AM, u.schmel...@online.de wrote: Hi, my configuration has 2 nodes, one has a set of virtual adresses and a webservice. The situation before crash: node1: has all resources node2: online, no resources action on node2: crm standby node2 result on node1: corosync crashes, the child processes consume all available cpu time my actions: stop all child processes on node1 (kill -9) and restart corosync result on node1: node1: online, all resources node2: offline result on node2: node1: offline node2: online, all resources The only way I found to workaround this problem: remove node2 from the cluster and add it again. There should be other solutions, maybe someone can help. Appended the coredump and fplay. Update: If I keep the cluster in the split brain state, it recovers after about 9 hours (logfile available) regards Uwe ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Cluster Communication fails after VMWare Migration
On 02/25/2011 12:40 AM, Andrew Beekhof wrote: On Wed, Feb 23, 2011 at 10:31 AM, u.schmel...@online.de wrote: Have build a 2 node apache cluster on VMWare virtual machines, which was running as expected. We had to migrate the machines to another computing center and after that the cluster communication didn't work anymore. Migration of vmS causes a change of the networks mac address. Maybe that's the reason for my problem. After removing one node from the cluster and adding it again the communication worked. Because migrations between computing centers can happen at any time (mirrored esx infrastructure), I have to find out, if this breaks the cluster communication. Cluster communication issues are the domain of corosync/heartbeat - their mailing lists may be able to provide more information. We're just the poor consumer of their services :-) poor consumer lol Regarding migration, I doubt a mac address migration will work properly with modern switches and igmp. For that type of operation to work properly, you will definately want to take multicast out of the equation and instead use the udpu transport mode. Keep in mind the corosync devs don't test the types of things you talk about as we don't have proprietary software licenses. Regards -steve regards Uwe ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Article on HA in the IBM cloud using Pacemaker and Heartbeat
On 01/28/2011 08:02 AM, Alan Robertson wrote: Hi, I recently co-authored an article on HA in the IBM cloud using Pacemaker and Heartbeat. http://www.ibm.com/developerworks/cloud/library/cl-highavailabilitycloud/ The cool thing is that the IBM cloud supports virtual IPs. With most of the other clouds you have to do DNS failover - which is sub-optimal ;-). Of course, they added this after we harangued them ;-) - but still it's very nice to have. It uses Heartbeat rather than Corosync because (for good reason) clouds don't support multicast or broadcast. Corosync works in non broadcast/multicast modes. (the transport is called udpu). Regards -steve There will be a follow-up article on setting up DRBD in the cloud as well... Probably a month away or so... -- Alan Robertson al...@unix.sh Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] pacemaker + corosync in the cloud
On 12/14/2010 05:14 PM, ruslan usifov wrote: Hi Is it possible to use pacemaker based on corosync in the cloud hosting like amazon or soflayer? yes with corosync 1.3.0 in udpu mode. The udpu mode avoids the use of multicast allowing operation in amazon's cloud. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] service corosync start failed
On 11/22/2010 01:27 AM, jiaju liu wrote: Hi all If I use command like this service corosync start it shows Starting Corosync Cluster Engine (corosync): [FAILED] and I do nothing just reboot my computer it will be OK what is the reason? Thanks a lot my pacemaker packages are pacemaker-1.0.8-6.1.el5 pacemaker-libs-devel-1.0.8-6.1.el5 pacemaker-libs-1.0.8-6.1.el5 openais packages?are openaislib-devel-1.1.0-1.el5 openais-1.1.0-1.el5 openaislib-1.1.0-1.el5 corosync packages are corosync-1.2.2-1.1.el5 corosynclib-devel-1.2.2-1.1.el5 corosynclib-1.2.2-1.1.el5 who know why thanks a lot Your packages are about 1 year old. I'd suggest updating - we release z streams to fix bugs and problems that people run into. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] UDPU transport patch added, when will the RPMs be available
On 11/22/2010 09:27 AM, Dan Frincu wrote: Hi Steven, Steven Dake wrote: On 11/19/2010 11:42 AM, Andrew Beekhof wrote: On Fri, Nov 19, 2010 at 11:38 AM, Dan Frincu dfri...@streamwide.ro wrote: Hi, The subject is pretty self-explanatory but I'll ask anyway, the patch for UDPU has been released, this adds the ability to set unicast peer addresses of nodes in a cluster, in network environments where multicast is not an option. When will it be available as an RPM? When upstream does a new release. Dan, The flatiron branch (containing the udpu patches) is going through testing for 1.3.0. We find currently that single CPU virtual machine systems seem to have problems with these patches which we will sort out before release. Regards -steve I've taken the (tip I think it is called) of corosync.git and compiled the RPM's on RH5U3 64-bit (I got the code the day it was first released, haven't had a chance to post yet). First off, we release from the flatiron branch. It is our stable branch. From git, do git checkout flatiron This will provide the full flatiron branch for building. # git show commit 565b32c2621c08f82cab57420217060d100d4953 Author: Fabio M. Di Nitto fdini...@redhat.com Date: Fri Nov 19 09:21:47 2010 +0100 There were some issues when compiling, deps mostly, some in the spec related to version which was UNKNOWN, I did a sed, placed 1.2.9 as a We are aware of this problem. We just moved from svn to git, and there is some pain associated. This particular problem comes from a lack of a specific type of tag in the git repo for version numbers. It will be fixed once 1.3.0 is released. Then RPM builds will work as expected. number instead of UNKNOWN and it compiled OK. I've installed it on two Xen VM's I use for testing and found some issues so the question is: where can I send feedback (and what kind of feedback is required) about development code? I'm not saying that you guys haven't run into these errors, maybe you did and they were fixed and maybe some are specific to my setup and haven't been found so, if I can provide some feedback on development code, I'd be more than happy to, if that's OK. We are certainly interested in contributions to the master branch. What most people use is the flatiron branch. If you see defects on the tip of flatiron, let us know, and we will work to address them. The best way to report an issue is to start a conversation on our mailing list (in the cc) Is XYZ supposed to happen?. The developers can say yes or no and ask for further information if there is a defect. Regards -steve Also, I've read about the cluster test suite, but I'm not actually sure how it works, could somebody provide some details as to how I can use the cluster test suite on a cluster to check for issues and then how can I report if there are any issues found (again, what kind of feedback is required). Regards, Dan p.s.: ignore my other email, I didn't see the reply on this one. If I'm barking up the wrong tree, please direct me to the proper channel to direct this request, I'm really looking forward to testing the UDPU. Regards, Dan -- Dan FRINCU Systems Engineer CCNA, RHCE Streamwide Romania ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Dan FRINCU Systems Engineer CCNA, RHCE Streamwide Romania ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http
Re: [Pacemaker] UDPU transport patch added, when will the RPMs be available
On 11/19/2010 11:42 AM, Andrew Beekhof wrote: On Fri, Nov 19, 2010 at 11:38 AM, Dan Frincu dfri...@streamwide.ro wrote: Hi, The subject is pretty self-explanatory but I'll ask anyway, the patch for UDPU has been released, this adds the ability to set unicast peer addresses of nodes in a cluster, in network environments where multicast is not an option. When will it be available as an RPM? When upstream does a new release. Dan, The flatiron branch (containing the udpu patches) is going through testing for 1.3.0. We find currently that single CPU virtual machine systems seem to have problems with these patches which we will sort out before release. Regards -steve If I'm barking up the wrong tree, please direct me to the proper channel to direct this request, I'm really looking forward to testing the UDPU. Regards, Dan -- Dan FRINCU Systems Engineer CCNA, RHCE Streamwide Romania ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Corosync using unicast instead of multicast
On 11/08/2010 05:50 AM, Dan Frincu wrote: Hi, Steven Dake wrote: On 11/05/2010 01:30 AM, Dan Frincu wrote: Hi, Alan Jones wrote: This question should be on the openais list, however, I happen to know the answer. To get up and running quickly you can configure broadcast with the version you have. I've done that already, however I was a little concerned as to what Steven Dake said on the openais mailing list about using broadcast Broadcast and redundant ring probably don't work to well together.. I've also done some testing and saw that the broadcast address used is 255.255.255.255, regardless of what the bindnetaddr network address is, and quite frankly, I was hoping to see a directed broadcast address. This wasn't the case, therefore I wonder whether this was the issue that Steven was referring to, because by using the 255.255.255.255 as a broadcast address, there is the slight chance that some application running in the same network might send a broadcast packet using the same This can happen with multicast or unicast modes as well. If a third party application communicates on the multicast/port combo or unicast port of a cluster node, there is conflict. With encryption, corosync encrypts and authenticates all packets, ignoring packets without a proper signature. The signatures are difficult to spoof. Without encryption, bad things happen in this condition. For more details, read SECURITY file in our source distribution. OK, I read the SECURITY file, a lot of overhead is added, I understand the reasons why it does it this way, not going to go into the details right now. Basically enabling encryption ensures that any traffic going between the nodes is both encrypted and authenticated, so rogue messages that happen to reach the exact network socket will be discarded. I'll come back to this a little bit later. Then again, I have this sentence in my head that I can't seem to get rid of Broadcast and redundant ring probably don't work to well together, broadcast and redundant ring probably don't work to well together and also I read OpenAIS now provides broadcast network communication in addition to multicast. This functionality is considered Technology Preview for standalone usage of OpenAIS, therefore I'm a little bit more concerned. Can you shed some light on this please? Two questions: 1) What do you mean by Broadcast and redundant ring probably don't work to well together? broadcast requires a specific port to run on. As a result, the ports should be different for each interface. I have not done any specific testing on broadcast with redundant ring - you would probably be the first. 2) Is using Corosync's broadcast feature instead of multicast stable enough to be used in production systems? Personally I'd wait for 2.0 for this feature and use bonding for the moment. Thank you in advance. Best regards, Dan port as configured on the cluster. How would the cluster react to that, would it ignore the packet, would it wreak havoc? Regards, Dan That's my main concern right now. Corosync can distinguish separate clusters with the multicast address and port that become payload to the messages. The patch you referred to can be applied to the top of tree for corosync or you can wait for a new release 1.3.0 planned for the end of November. Alan On Thu, Nov 4, 2010 at 1:02 AM, Dan Frincudfri...@streamwide.ro wrote: Hi all, I'm having an issue with a setup using the following: cluster-glue-1.0.6-1.6.el5.x86_64.rpm cluster-glue-libs-1.0.6-1.6.el5.x86_64.rpm corosync-1.2.7-1.1.el5.x86_64.rpm corosynclib-1.2.7-1.1.el5.x86_64.rpm drbd83-8.3.2-6.el5_3.x86_64.rpm kmod-drbd83-8.3.2-6.el5_3.x86_64.rpm openais-1.1.3-1.6.el5.x86_64.rpm openaislib-1.1.3-1.6.el5.x86_64.rpm pacemaker-1.0.9.1-1.el5.x86_64.rpm pacemaker-libs-1.0.9.1-1.el5.x86_64.rpm resource-agents-1.0.3-2.el5.x86_64.rpm This is a two-node HA cluster, with the nodes interconnected via bonded interfaces through the switch. The issue is that I have no control of the switch itself, can't do anything about that, and from what I understand the environment doesn't allow enabling multicast on the switch. In this situation, how can I have the setup functional (with redundant rings, rrp_mode: active) without using multicast. I've seen that individual network sockets are formed between nodes, unicast sockets, as well as the multicast sockets. I'm interested in knowing how will the lack of multicast affect the redundant rings, connectivity, failover, etc. I've also seen this page https://lists.linux-foundation.org/pipermail/openais/2010-October/015271.html And here it states using UDPU transport mode avoids using multicast or broadcast, but it's a patch, is this integrated in any of the newer versions of corosync? Thank you in advance. Regards, Dan -- Dan FRINCU Systems Engineer CCNA, RHCE Streamwide Romania ___ Pacemaker mailing list:Pacemaker@oss.clusterlabs.org http
Re: [Pacemaker] Corosync using unicast instead of multicast
On 11/05/2010 01:30 AM, Dan Frincu wrote: Hi, Alan Jones wrote: This question should be on the openais list, however, I happen to know the answer. To get up and running quickly you can configure broadcast with the version you have. I've done that already, however I was a little concerned as to what Steven Dake said on the openais mailing list about using broadcast Broadcast and redundant ring probably don't work to well together.. I've also done some testing and saw that the broadcast address used is 255.255.255.255, regardless of what the bindnetaddr network address is, and quite frankly, I was hoping to see a directed broadcast address. This wasn't the case, therefore I wonder whether this was the issue that Steven was referring to, because by using the 255.255.255.255 as a broadcast address, there is the slight chance that some application running in the same network might send a broadcast packet using the same This can happen with multicast or unicast modes as well. If a third party application communicates on the multicast/port combo or unicast port of a cluster node, there is conflict. With encryption, corosync encrypts and authenticates all packets, ignoring packets without a proper signature. The signatures are difficult to spoof. Without encryption, bad things happen in this condition. For more details, read SECURITY file in our source distribution. port as configured on the cluster. How would the cluster react to that, would it ignore the packet, would it wreak havoc? Regards, Dan That's my main concern right now. Corosync can distinguish separate clusters with the multicast address and port that become payload to the messages. The patch you referred to can be applied to the top of tree for corosync or you can wait for a new release 1.3.0 planned for the end of November. Alan On Thu, Nov 4, 2010 at 1:02 AM, Dan Frincudfri...@streamwide.ro wrote: Hi all, I'm having an issue with a setup using the following: cluster-glue-1.0.6-1.6.el5.x86_64.rpm cluster-glue-libs-1.0.6-1.6.el5.x86_64.rpm corosync-1.2.7-1.1.el5.x86_64.rpm corosynclib-1.2.7-1.1.el5.x86_64.rpm drbd83-8.3.2-6.el5_3.x86_64.rpm kmod-drbd83-8.3.2-6.el5_3.x86_64.rpm openais-1.1.3-1.6.el5.x86_64.rpm openaislib-1.1.3-1.6.el5.x86_64.rpm pacemaker-1.0.9.1-1.el5.x86_64.rpm pacemaker-libs-1.0.9.1-1.el5.x86_64.rpm resource-agents-1.0.3-2.el5.x86_64.rpm This is a two-node HA cluster, with the nodes interconnected via bonded interfaces through the switch. The issue is that I have no control of the switch itself, can't do anything about that, and from what I understand the environment doesn't allow enabling multicast on the switch. In this situation, how can I have the setup functional (with redundant rings, rrp_mode: active) without using multicast. I've seen that individual network sockets are formed between nodes, unicast sockets, as well as the multicast sockets. I'm interested in knowing how will the lack of multicast affect the redundant rings, connectivity, failover, etc. I've also seen this page https://lists.linux-foundation.org/pipermail/openais/2010-October/015271.html And here it states using UDPU transport mode avoids using multicast or broadcast, but it's a patch, is this integrated in any of the newer versions of corosync? Thank you in advance. Regards, Dan -- Dan FRINCU Systems Engineer CCNA, RHCE Streamwide Romania ___ Pacemaker mailing list:Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home:http://www.clusterlabs.org Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list:Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home:http://www.clusterlabs.org Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Dan FRINCU Systems Engineer CCNA, RHCE Streamwide Romania ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Corosync node detection working too good
On 10/04/2010 02:04 AM, Stephan-Frank Henry wrote: Hello all, still working on my nodes and although the last problem is not officially solved (I hard coded certain versions of the packages and that seems to be ok now) I have a different interesting feature I need to handle. I am setting up my nodes by default as single node setups. But today when I set up another node, *without* doing any special config to make them know each other, the corosyncs on each nodes found each other and distributed the cib.xml between each other. They both also show up together in crm_mon. Not quite what I wanted. :) I presume I have a config that is too generic and thus the nodes are finding each other and thinking they should link up. What configs do I have to look into to avoid this? thanks A unique cluster is defined by mcastaddr and mcastport in the file /etc/corosync/corosync.conf If you simply installed them, you may have the same corosync.conf file for each unique cluster which would result in the problem you describe. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Fail over algorithm used by Pacemaker
On 10/03/2010 07:01 AM, hudan studiawan wrote: Hi, I want to start to contribute to Pacemaker project. I start to read Documentation and try some basic configurations. I have a question: what kind of algorithm used by Pacemaker to choose another node when a node die in a cluster? Is there any manual or documentation I can read? Thank you, Hudan In the case of using Corosync, we use a protocol designed in the 90s to determine membership. It is called The Totem Single Ring Protocol: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.767rep=rep1type=pdf Its full operation is described in that PDF. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Connection to our AIS plugin (9) failed: Library error
On 09/22/2010 04:02 AM, Szymon Hersztek wrote: Wiadomość napisana w dniu 2010-09-22, o godz. 10:26, przez Andrew Beekhof: 2010/9/21 Szymon Hersztek s...@globtel.pl: Wiadomość napisana w dniu 2010-09-21, o godz. 09:08, przez Andrew Beekhof: 2010/9/21 Szymon Hersztek s...@globtel.pl: Wiadomość napisana w dniu 2010-09-21, o godz. 08:34, przez Andrew Beekhof: On Mon, Sep 20, 2010 at 3:34 PM, Szymon Hersztek s...@globtel.pl wrote: Hi Im trying to setup corosync to work as drbd cluster but after installing follow by http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf i got error like below: Unusual, but did pacemaker fork a replacement attrd process? At what time did corosync start? corosync was started manually or do you want to have exact time of start ? well you included at most 1 second's worth of logging. so its kinda hard to know if something took too long or what recovery was attempted. Ok it is not a problem to send more. Do you need debug logging or standard I have to install server once again so in half of hour i can reproduce logs Here's your issue: corosynclib i386 1.2.7-1.1.el5 clusterlabs 155 k corosynclib x86_64 1.2.7-1.1.el5 clusterlabs 172 k Why do you have both i386 and x86_64 versions installed on your machine?? There should be no problems installing lib files for both i386 and x86_64. These rpms only contain the *.so files (and a LICENSE file). Regards -steve Because yum installed it in this way .. as many other packeges The problem was that i do not use /dev/shm as tmpfs But thanks for trying ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Timeout after nodejoin
On 09/22/2010 05:43 AM, Dan Frincu wrote: Hi all, I have the following packages: # rpm -qa | grep -i (openais|cluster|heartbeat|pacemaker|resource) openais-0.80.5-15.2 cluster-glue-1.0-12.2 pacemaker-1.0.5-4.2 cluster-glue-libs-1.0-12.2 resource-agents-1.0-31.5 pacemaker-libs-1.0.5-4.2 pacemaker-mgmt-1.99.2-7.2 libopenais2-0.80.5-15.2 heartbeat-3.0.0-33.3 pacemaker-mgmt-client-1.99.2-7.2 When I start openais, I get nodejoin immediately, as seen in the logs below. However, it takes some time before the nodes are visible in crm_mon output. Any idea how to minimize this delay? Sep 22 15:27:24 bench1 openais[12935]: [crm ] info: send_member_notification: Sending membership update 8 to 1 children Sep 22 15:27:24 bench1 openais[12935]: [CLM ] got nodejoin message 192.168.165.33 Sep 22 15:27:24 bench1 openais[12935]: [CLM ] got nodejoin message 192.168.165.35 Sep 22 15:27:24 bench1 mgmtd: [12947]: info: Started. Sep 22 15:27:24 bench1 openais[12935]: [crm ] WARN: route_ais_message: Sending message to local.crmd failed: unknown (rc=-2) Sep 22 15:27:24 bench1 openais[12935]: [crm ] WARN: route_ais_message: Sending message to local.crmd failed: unknown (rc=-2) Sep 22 15:27:24 bench1 openais[12935]: [crm ] info: pcmk_ipc: Recorded connection 0x174840d0 for crmd/12946 Sep 22 15:27:24 bench1 openais[12935]: [crm ] info: pcmk_ipc: Sending membership update 8 to crmd Sep 22 15:27:24 bench1 openais[12935]: [crm ] info: update_expected_votes: Expected quorum votes 1024 - 2 Sep 22 15:27:25 bench1 crmd: [12946]: notice: ais_dispatch: Membership 8: quorum aquired Sep 22 15:28:15 bench1 crmd: [12946]: info: do_election_count_vote: Election 2 (owner: bench2) pass: vote from bench2 (Host name) Sep 22 15:28:15 bench1 crmd: [12946]: info: do_state_transition: State transition S_PENDING - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=do_election_count_vote ] Sep 22 15:28:15 bench1 crmd: [12946]: info: do_state_transition: State transition S_ELECTION - S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] Sep 22 15:28:15 bench1 crmd: [12946]: info: do_te_control: Registering TE UUID: 87c28ab8-ba93-4111-a26a-67e88dd927fb Sep 22 15:28:15 bench1 crmd: [12946]: WARN: cib_client_add_notify_callback: Callback already present Sep 22 15:28:15 bench1 crmd: [12946]: info: set_graph_functions: Setting custom graph functions Sep 22 15:28:15 bench1 crmd: [12946]: info: unpack_graph: Unpacked transition -1: 0 actions in 0 synapses Sep 22 15:28:15 bench1 crmd: [12946]: info: do_dc_takeover: Taking over DC status for this partition Sep 22 15:28:15 bench1 cib: [12942]: info: cib_process_readwrite: We are now in R/W mode Regards, Dan Where did you get that version of openais? openais 0.80.x is deprecated in the community (and hence, no support). We recommend using corosync instead which has improved testing with pacemaker. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] MCP init script to 21/79?
On 08/24/2010 11:06 PM, Andrew Beekhof wrote: On Wed, Aug 25, 2010 at 8:02 AM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 25.08.2010 08:56, Andrew Beekhof wrote: On Wed, Aug 25, 2010 at 7:39 AM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi all, pacemaker has # chkconfig - 90 90 in its MCP initscript. Shouldn't it be corrected to 90 10? I thought higher numbers started later and shut down earlier... no? Nope, they are in a natural order for both start and stop sequences. So lower number means 'do start or stop earlier'. grep '# chkconfig' /etc/init.d/* Ok, thanks. Changed to 10 Given that corosync default is 20/80, shouldnt mcp be 21/79? Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] MCP init script to 21/79?
On 09/03/2010 09:56 AM, Vladislav Bogdanov wrote: 03.09.2010 19:34, Steven Dake wrote: Nope, they are in a natural order for both start and stop sequences. So lower number means 'do start or stop earlier'. grep '# chkconfig' /etc/init.d/* Ok, thanks. Changed to 10 Given that corosync default is 20/80, shouldnt mcp be 21/79? I think that pcmk may require additional services to be started (I at least see reference to cooperation with cman for GFS as one of pcmk MCP scenarios in Andrew's wiki, but that scenario is still unclear to me), so it is safer to have it start later, 90 is ok for me. That is also what Vadim wrote about. Best, Vladislav I was mistaken, not having read the current code. Ignore the noise. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Corosync + Pacemaker New Install: Corosync Fails Without Error Message
On 06/18/2010 09:42 AM, Eliot Gable wrote: I don’t have an “aisexec” section at all. I simply copied the sample file, which did not have one. I did figure out why it wasn’t logging. It was set to AMF mode and ‘mode’ was ‘disabled’ in the AMF configuration section. After changing that to ‘enabled’, I now have logging. That allowed me to figure out that I needed to set rrp_mode to something other than ‘none’, because I have two interfaces to run the totem protocol over. However, with it set to ‘passive’ or ‘active’, corosync tries to start, then seg faults: Jun 18 07:33:23 corosync [MAIN ] Corosync Cluster Engine ('1.2.2'): started and ready to provide service. Jun 18 07:33:23 corosync [MAIN ] Corosync built-in features: nss rdma Jun 18 07:33:23 corosync [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'. Jun 18 07:33:23 corosync [TOTEM ] Token Timeout (1000 ms) retransmit timeout (238 ms) Jun 18 07:33:23 corosync [TOTEM ] token hold (180 ms) retransmits before loss (4 retrans) Jun 18 07:33:23 corosync [TOTEM ] join (50 ms) send_join (0 ms) consensus (1200 ms) merge (200 ms) Jun 18 07:33:23 corosync [TOTEM ] downcheck (1000 ms) fail to recv const (50 msgs) Jun 18 07:33:23 corosync [TOTEM ] seqno unchanged const (30 rotations) Maximum network MTU 1402 Jun 18 07:33:23 corosync [TOTEM ] window size per rotation (50 messages) maximum messages per rotation (17 messages) Jun 18 07:33:23 corosync [TOTEM ] send threads (0 threads) Jun 18 07:33:23 corosync [TOTEM ] RRP token expired timeout (238 ms) Jun 18 07:33:23 corosync [TOTEM ] RRP token problem counter (2000 ms) Jun 18 07:33:23 corosync [TOTEM ] RRP threshold (10 problem count) Jun 18 07:33:23 corosync [TOTEM ] RRP mode set to passive. Jun 18 07:33:23 corosync [TOTEM ] heartbeat_failures_allowed (0) Jun 18 07:33:23 corosync [TOTEM ] max_network_delay (50 ms) Jun 18 07:33:23 corosync [TOTEM ] HeartBeat is Disabled. To enable set heartbeat_failures_allowed 0 Jun 18 07:33:23 corosync [TOTEM ] Initializing transport (UDP/IP). Jun 18 07:33:23 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Jun 18 07:33:23 corosync [TOTEM ] Initializing transport (UDP/IP). Jun 18 07:33:23 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Jun 18 07:33:23 corosync [IPC ] you are using ipc api v2 Jun 18 07:33:23 corosync [TOTEM ] Receive multicast socket recv buffer size (262142 bytes). Jun 18 07:33:23 corosync [TOTEM ] Transmit multicast socket send buffer size (262142 bytes). Jun 18 07:33:23 corosync [TOTEM ] The network interface is down. Jun 18 07:33:23 corosync [TOTEM ] Created or loaded sequence id 0.127.0.0.1 for this ring. Jun 18 07:33:23 corosync [pcmk ] info: process_ais_conf: Reading configure Jun 18 07:33:23 corosync [pcmk ] info: config_find_init: Local handle: 2013064636357672962 for logging Jun 18 07:33:23 corosync [pcmk ] info: config_find_next: Processing additional logging options... Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Found 'on' for option: debug Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to 'off' for option: to_file Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Found 'yes' for option: to_syslog Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to 'daemon' for option: syslog_facility Jun 18 07:33:23 corosync [pcmk ] info: config_find_init: Local handle: 4730966301143465987 for service Jun 18 07:33:23 corosync [pcmk ] info: config_find_next: Processing additional service options... Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to 'pcmk' for option: clustername Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_logd Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_mgmtd Jun 18 07:33:23 corosync [pcmk ] info: pcmk_startup: CRM: Initialized Jun 18 07:33:23 corosync [pcmk ] Logging: Initialized pcmk_startup Jun 18 07:33:23 corosync [pcmk ] info: pcmk_startup: Maximum core file size is: 18446744073709551615 Segmentation fault (gdb) where full #0 0x00332de797c0 in strlen () from /lib64/libc.so.6 No symbol table info available. #1 0x2acefb9b in logsys_worker_thread (data=value optimized out) at logsys.c:760 rec = 0x2aef0c28 dropped = 0 #2 0x00332e60673d in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00332ded3d1d in clone () from /lib64/libc.so.6 No symbol table info available. (gdb) Downgrading again back to 1.2.1-1.el5 seems to resolve the issue, and Corosync runs. Eliot Gable Senior Product Developer 1228 Euclid Ave, Suite 390 Cleveland, OH 44115 Direct: 216-373-4808 Fax: 216-373-4657 ega...@broadvox.net mailto:ega...@broadvox.net cid:212454920@11022008-1E22 CONFIDENTIAL COMMUNICATION. This e-mail and any files transmitted with it are confidential and are intended
Re: [Pacemaker] use_logd or use_mgmtd kills corosync
On 06/08/2010 11:20 PM, Andrew Beekhof wrote: On Wed, Jun 9, 2010 at 7:27 AM, Devin Readeg...@gno.org wrote: I was following the instructions for a new installation of corosync and was wanting to make use of hb_gui so, following an installation via yum per the docs, built Pacemaker-Python-GUI-pacemaker-mgmt-2.0.0 from source. Starting corosync works normally without mgmtd in the picture, but as soon as *either* of the two lines are added to /etc/corosync/service.d/pcmk, corosync fails to start with no diagnostics in the logfile or syslog: use_logd: 1 use_mgmtd: 1 I ran 'strace corosync -f' and got rather uninformative information, the tail end of it shown here: statfs(/etc/corosync/service.d, {f_type=EXT2_SUPER_MAGIC, f_bsize=4096, f_blocks=507860, f_bfree=388733, f_bavail=362519, f_files=524288, f_ffree=517073, f_fsid={0, 0}, f_namelen=255, f_frsize=4096}) = 0 getdents(3, /* 3 entries */, 32768) = 72 stat(/etc/corosync/service.d/pcmk, {st_mode=S_IFREG|0644, st_size=101, ...}) = 0 open(/etc/corosync/service.d/pcmk, O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0644, st_size=101, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2acb16dd5000 read(4, service {\n \t# Load the Pacemaker..., 4096) = 101 close(4)= 0 munmap(0x2acb16dd5000, 4096)= 0 close(3)= 0 exit_group(8) = ? Any thoughts? Not really. Do any other children start up? Where is the mgmtd binary installed to? # uname -srv Linux 2.6.18-194.3.1.el5 #1 SMP Thu May 13 13:08:30 EDT 2010 # rpm -q -a | grep openais | sort openais-1.1.0-2.el5.i386 openais-1.1.0-2.el5.x86_64 openaislib-1.1.0-2.el5.i386 openaislib-1.1.0-2.el5.x86_64 openaislib-devel-1.1.0-2.el5.i386 openaislib-devel-1.1.0-2.el5.x86_64 ### /etc/corosync/corosync.conf compatibility: none totem { version: 2 secauth: off threads: 0 interface { ringnumber: 0 # but with a real netaddr, obviously bindnetaddr: A.B.C.D mcastaddr: 226.94.1.1 mcastport: 5405 } } logging { fileline: off to_stderr: no to_file: yes to_syslog: yes logfile: /var/log/corosync.log # debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } amf { mode: disabled } aisexec { user: root group: root } /etc/corosync/service.d/pcmk # service { # Load the Pacemaker Cluster Resource Manager name: pacemaker ver: 0 use_logd: 1 } ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker This is likely the sem_wait issue related to some CentOS deployments. An update for corosync is pending release. Hopefully new source tarballs will be available Wednesday. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] handle EINTR in sem_wait (pacemaker corosync 1.2.2+ crash)
Hello, I have found the cause of the crash that was occurring only on some deployments. The cause is that sem_wait is interrupted by signal, and the wait operation is not retried (as is customary in posix). Patch attached to fix A big thank you to Vladislav Bogdanov for running the test case and verifying it fixes the problem. Regards -steve Index: logsys.c === --- logsys.c(revision 2915) +++ logsys.c(working copy) @@ -661,7 +661,18 @@ sem_post (logsys_thread_start); for (;;) { dropped = 0; - sem_wait (logsys_print_finished); +retry_sem_wait: + res = sem_wait (logsys_print_finished); + if (res == -1 errno == EINTR) { + goto retry_sem_wait; + } else + if (res == -1) { + /* + * * This case shouldn't happen + * */ + pthread_exit (NULL); + } + logsys_wthread_lock(); if (wthread_should_exit) { ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] corosync/openais fails to start
This is a known issue on some platforms, although the exact cause is unknown. I have tried RHEL 5.5 as well as CentOS 5.5 with clusterrepo rpms and been unable to reproduce. I'll keep looking. Regards -steve On 05/27/2010 06:07 AM, Diego Remolina wrote: Hi, I was running the old rpms from the opensuse repo and wanted to change over to the latest packages from the clusterlabs repo in my RHEL 5.5 machines. Steps I took 1. Disabled the old repo 2. Set the nodes to standby (two node drbd cluster) and turned of openais 3. Enabled the new repo. 4. Performed an update with yum -y update which replaced all packages. 5. The configuration file for ais was renamed openais.conf.rpmsave 6. I ran corosync-keygen and copied the key to the second machine 7. I copied the file openais.conf.rpmsave to /etc/corosync/corosync.conf and modified it by removing the service section and moving that to /etc/corosync/service.d/pcmk 8. I copied the configurations to the other machine. 9. When I try to start either openais or corosync with the init scripts I get a failure and nothing that can really point me to an error in the logs. Updated packages: May 26 14:29:32 Updated: cluster-glue-libs-1.0.5-1.el5.x86_64 May 26 14:29:32 Updated: resource-agents-1.0.3-2.el5.x86_64 May 26 14:29:34 Updated: cluster-glue-1.0.5-1.el5.x86_64 May 26 14:29:34 Installed: libibverbs-1.1.3-2.el5.x86_64 May 26 14:29:34 Installed: corosync-1.2.2-1.1.el5.x86_64 May 26 14:29:34 Installed: librdmacm-1.0.10-1.el5.x86_64 May 26 14:29:34 Installed: corosynclib-1.2.2-1.1.el5.x86_64 May 26 14:29:34 Installed: openaislib-1.1.0-2.el5.x86_64 May 26 14:29:34 Updated: openais-1.1.0-2.el5.x86_64 May 26 14:29:34 Installed: libnes-0.9.0-2.el5.x86_64 May 26 14:29:35 Installed: heartbeat-libs-3.0.3-2.el5.x86_64 May 26 14:29:35 Updated: pacemaker-libs-1.0.8-6.1.el5.x86_64 May 26 14:29:36 Updated: heartbeat-3.0.3-2.el5.x86_64 May 26 14:29:36 Updated: pacemaker-1.0.8-6.1.el5.x86_64 Apparently corosync is sec faulting when run from the command line: # /usr/sbin/corosync -f Segmentation fault Any help would be greatly appreciated. Diego ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Re: [Pacemaker] corosync/openais fails to start
On 05/27/2010 08:40 AM, Diego Remolina wrote: Is there any workaround for this? Perhaps a slightly older version of the rpms? If so where do I find those? Corosync 1.2.1 doesn't have this issue apparently. With corosync 1.2.1, please don't use debug: on keyword in your config options. I am not sure where Andrew has corosync 1.2.1 rpms available. The corosync project itself doesn't release rpms. See our policy on this topic: http://www.corosync.org/doku.php?id=faq:release_binaries Regards -steve I cannot get the opensuse-ha rpms any more so I am stuck with a non-functioning cluster. Diego Steven Dake wrote: This is a known issue on some platforms, although the exact cause is unknown. I have tried RHEL 5.5 as well as CentOS 5.5 with clusterrepo rpms and been unable to reproduce. I'll keep looking. Regards -steve On 05/27/2010 06:07 AM, Diego Remolina wrote: Hi, I was running the old rpms from the opensuse repo and wanted to change over to the latest packages from the clusterlabs repo in my RHEL 5.5 machines. Steps I took 1. Disabled the old repo 2. Set the nodes to standby (two node drbd cluster) and turned of openais 3. Enabled the new repo. 4. Performed an update with yum -y update which replaced all packages. 5. The configuration file for ais was renamed openais.conf.rpmsave 6. I ran corosync-keygen and copied the key to the second machine 7. I copied the file openais.conf.rpmsave to /etc/corosync/corosync.conf and modified it by removing the service section and moving that to /etc/corosync/service.d/pcmk 8. I copied the configurations to the other machine. 9. When I try to start either openais or corosync with the init scripts I get a failure and nothing that can really point me to an error in the logs. Updated packages: May 26 14:29:32 Updated: cluster-glue-libs-1.0.5-1.el5.x86_64 May 26 14:29:32 Updated: resource-agents-1.0.3-2.el5.x86_64 May 26 14:29:34 Updated: cluster-glue-1.0.5-1.el5.x86_64 May 26 14:29:34 Installed: libibverbs-1.1.3-2.el5.x86_64 May 26 14:29:34 Installed: corosync-1.2.2-1.1.el5.x86_64 May 26 14:29:34 Installed: librdmacm-1.0.10-1.el5.x86_64 May 26 14:29:34 Installed: corosynclib-1.2.2-1.1.el5.x86_64 May 26 14:29:34 Installed: openaislib-1.1.0-2.el5.x86_64 May 26 14:29:34 Updated: openais-1.1.0-2.el5.x86_64 May 26 14:29:34 Installed: libnes-0.9.0-2.el5.x86_64 May 26 14:29:35 Installed: heartbeat-libs-3.0.3-2.el5.x86_64 May 26 14:29:35 Updated: pacemaker-libs-1.0.8-6.1.el5.x86_64 May 26 14:29:36 Updated: heartbeat-3.0.3-2.el5.x86_64 May 26 14:29:36 Updated: pacemaker-1.0.8-6.1.el5.x86_64 Apparently corosync is sec faulting when run from the command line: # /usr/sbin/corosync -f Segmentation fault Any help would be greatly appreciated. Diego ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Re: [Pacemaker] Being fenced node is killed again and again even the connection is recovered!
ifconfig eth0 down is not a valid test case. that will likely lead to bad things happening. I recommend using iptables to test the software. Also Corosync 1.2.2 is out which fixes bugs vs corosync 1.2.0. Regards -steve On Fri, 2010-05-14 at 18:02 +0800, Javen Wu wrote: I forget mention the version I used. I used SLES11-SP1-HAE Beta5 Pacemaker 1.0.7 Corosync 1.2.0 Cluster Glue 1.0.3 2010/5/14 Javen Wu wu.ja...@gmail.com Hi Folks, I setup a three nodes cluster with SBD STONITH configured. After I manually isolate one node by running ifconfig eth1 down on the node. The node is fenced as expected. But after reboot, even the network is recovered, the node is killed again once I start openaispacemaker. I saw the state of the node become from OFFLINE to ONLINE from `crm_mon -n` before being killed. And I saw SBD slot from reset-clear-reset. I attached the syslog and corosync log. And my CIB configuration is very simple. Could you help me check what's the problem? In my mind, it's not expected behaviour. ===%CIB information= cib validate-with=pacemaker-1.0 crm_feature_set=3.0.1 have-quorum=1 admin_epoch=0 epoch=349 num_updates=99 cib-last-written=Fri May 14 14:50:21 2010 dc-uuid=vm209 configuration crm_config cluster_property_set id=cib-bootstrap-options nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.1-530add2a3721a0ecccb24660a97dbfdaa3e68f51/ nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=openais/ nvpair id=cib-bootstrap-options-expected-quorum-votes name=expected-quorum-votes value=3/ /cluster_property_set /crm_config nodes node id=vm208 uname=vm208 type=normal/ node id=vm209 uname=vm209 type=normal/ node id=vm210 uname=vm210 type=normal/ /nodes resources clone id=Fencing primitive class=stonith id=sbd-fencing type=external/sbd instance_attributes id=sbd-fencing-instance_attributes nvpair id=sbd-fencing-instance_attributes-sbd_device name=sbd_device value=/dev/sdc/ /instance_attributes operations op id=sbd-fencing-monitor-20s interval=20s name=monitor/ /operations /primitive /clone /resources constraints/ rsc_defaults/ op_defaults/ /configuration status node_state id=vm209 uname=vm209 ha=active in_ccm=true crmd=online join=member expected=member crm-debug-origin=post_cache_update shutdown=0 transient_attributes id=vm209 instance_attributes id=status-vm209 nvpair id=status-vm209-probe_complete name=probe_complete value=true/ /instance_attributes /transient_attributes lrm id=vm209 lrm_resources lrm_resource id=sbd-fencing:0 type=external/sbd class=stonith lrm_rsc_op id=sbd-fencing:0_monitor_0 operation=monitor crm-debug-origin=build_active_RAs crm_feature_set=3.0.1 transition-key=4:1:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0 transition-magic=0:7;4:1:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0 call-id=2 rc-code=7 op-status=0 interval=0 last-run=1273820137 last-rc-change=1273820137 exec-time=60 queue-time=0 op-digest=4c3fd39434577fbb6540606d808ed050/ lrm_rsc_op id=sbd-fencing:0_start_0 operation=start crm-debug-origin=build_active_RAs crm_feature_set=3.0.1 transition-key=5:1:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0 transition-magic=0:0;5:1:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0 call-id=3 rc-code=0 op-status=0 interval=0 last-run=1273820137 last-rc-change=1273820137 exec-time=10 queue-time=0 op-digest=4c3fd39434577fbb6540606d808ed050/ lrm_rsc_op id=sbd-fencing:0_monitor_2 operation=monitor crm-debug-origin=build_active_RAs crm_feature_set=3.0.1 transition-key=6:2:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0 transition-magic=0:0;6:2:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0 call-id=4 rc-code=0 op-status=0 interval=2 last-run=1273822956 last-rc-change=1273820137 exec-time=1170 queue-time=0 op-digest=4029bbaef749649e82d602afb46dd872/ /lrm_resource /lrm_resources /lrm
Re: [Pacemaker] High load issues
On Thu, 2010-02-04 at 16:09 +0100, Dominik Klein wrote: Hi people, I'll take the risk of annoying you, but I really think this should not be forgotten. If there is high load on a node, the cluster seems to have problems recovering from that. I'd expect the cluster to recognize that a node is unresponsive, stonith it and start services elsewhere. By unresponsive I mean not being able to use the cluster's service, not being able to ssh into the node. I am not sure whether this is an issue of pacemaker (iiuc, beekhof seems to think it is not) or corosync (iiuc, sdake seems to think it is not) or maybe a configuration/thinking thing on my side (which might just be). Anyway, attached you will find a hb_report which covers the startup of the cluster nodes, then what it does when there is high load and no memory left. Then I killed the load producing things and almost immediately, the cluster cleaned up things. I had at least expected that after I saw FAILED status in crm_mon, that after the configured timeouts for stop (120s max in my case), the failover should happen, but it did not. What I did to produce load: * run several md5sum $file on 1gig files * run several heavy sql statements on large tables * saturate(?) the nic using netcat -l on the busy node and netcat -w fed by /dev/urandom on another node * start a forkbomb script which does while (true); do bash $0; done; Used versions: corosync 1.2.0 pacemaker 1.0.7 64 bit packages from clusterlabs for opensuse 11.1 The forkbomb triggers an OOM situation. In Linux, when OOM happens really all bets are off as to what will occur. I expect that the system would work properly without the forkbomb. Could you try that? Corosync actually works quite well in OOM situations and usually doesn't detect this as a failure unless the oom killer blows away the corosync process. To corosync, the node is fully operational (because it is designed to work in an OOM situation). Detecting memory overcommit and doing something about it may be something we should do with Corosync. But generally I believe this test case is invalid. A system should be properly sized memory wise to handle the applications that are intended to run on it. Really sounds like a deployment issue if the systems don't contain the appropriate ram to run the applications. I believe there is a way of setting affinity in the OOM killer but it's been 4 years since I've worked on the kernel fulltime so I don't know the details. One option is to set the affinity to always try to blow away the corosync process. Then you would get fencing in this condition. Regards -steve If you need more information, want me to try patches, whatever, please let me know. Regards Dominik ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] thread safety problem with pacemaker and corosync integration
For some time people have reported segfaults on startup when using pacemaker as a plugin to corosync related to tzset in the stack trace. I believe we had fixed this by removing the thread-unsafe usage of localtime and strftime calls in the code base of corosync in 1.2.0. Via further investigation by H.J. Lee, he mostly identified a problem with localtime_r calling tzset calling getenv(). If at about the same time, another thread calls setenv(), the other thread's getenv could segfault. syslog() also calls localtime_r in glibc. On some rare occasions Pacemaker calls setenv() while corosync executes a syslog operation resulting in a segfault. Posix is clear on this issue - tzset should be thread safe, localtime_r should be thread safe, syslog should be thread safe. Some C libraries implementations of these functions unfortunately are not thread safe for these functions when used in conjunction with setenv because they use getenv internally (which is not required to be thread safe by posix). Our short term plan is to workaround these problems in glibc by doing the following: 1) providing a getenv/setenv api inside coroapi.h so that corosync internal code and third party plugins such as pacemaker can use a mutex protected getenv/setenv 2) porting our syslog-direct-communication code from whitetank and avoid using the syslog C library api (which again uses localtime_r) call entirely 3) implementing a localtime_r replacement which does not call tzset on each execution so that timestamp:on operational mode does not suffer from this same problem If your suffering from this issue, please be aware we have a root cause and will get it resolved. Regards -steve ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] errors in corosync.log
One possibility is you have a different cluster in your network on the same multicast address and port. Regards -steve On Sat, 2010-01-16 at 15:20 -0500, Shravan Mishra wrote: Hi Guys, I'm running the following version of pacemaker and corosync corosync=1.1.1-1-2 pacemaker=1.0.9-2-1 Every thing had been running fine for quite some time now but then I started seeing following errors in the corosync logs, = Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid digest... ignoring. Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid digest... ignoring. Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid digest... ignoring. Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data I can perform all the crm shell commands and what not but it's troubling that the above is happening. My crm_mon output looks good. I also checked the authkey and did md5sum on both it's same. Then I stopped corosync and regenerated the authkey with corosync-keygen and copied it to the the other machine but I still get the above message in the corosync log. Is there anything other authkey that I should look into ? corosync.conf # Please read the corosync.conf.5 manual page compatibility: whitetank totem { version: 2 token: 3000 token_retransmits_before_loss_const: 10 join: 60 consensus: 1500 vsftype: none max_messages: 20 clear_node_high_bit: yes secauth: on threads: 0 rrp_mode: passive interface { ringnumber: 0 bindnetaddr: 192.168.2.0 #mcastaddr: 226.94.1.1 broadcast: yes mcastport: 5405 } interface { ringnumber: 1 bindnetaddr: 172.20.20.0 #mcastaddr: 226.94.1.1 broadcast: yes mcastport: 5405 } } logging { fileline: off to_stderr: yes to_logfile: yes to_syslog: yes logfile: /tmp/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } service { name: pacemaker ver: 0 } aisexec { user:root group: root } amf { mode: disabled } === Thanks Shravan ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] mcast vs broadcast
On Mon, 2010-01-18 at 11:25 -0500, Shravan Mishra wrote: Hi all, Following is my corosync.conf. Even though broadcast is enabled I see mcasted messages like these in corosync.log. Is it ok? even when the broadcast is on and not mcast. Yes you are using broadcast and the debug output doesn't print a special case for broadcast (but it really is broadcasting). This output is debug output meant for developer consumption. It is really not all that useful for end users. == Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue Jan 18 09:50:40 corosync [TOTEM ] Delivering 171 to 173 Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq 172 to pending delivery queue Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq 173 to pending delivery queue Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173 Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 172 Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 173 = === # Please read the corosync.conf.5 manual page compatibility: whitetank totem { version: 2 token: 3000 token_retransmits_before_loss_const: 10 join: 60 consensus: 1500 vsftype: none max_messages: 20 clear_node_high_bit: yes secauth: on threads: 0 rrp_mode: passive interface { ringnumber: 0 bindnetaddr: 192.168.2.0 # mcastaddr: 226.94.1.1 broadcast: yes mcastport: 5405 } interface { ringnumber: 1 bindnetaddr: 172.20.20.0 #mcastaddr: 226.94.2.1 broadcast: yes mcastport: 5405 } } logging { fileline: off to_stderr: yes to_logfile: yes to_syslog: yes logfile: /tmp/corosync.log debug: on timestamp: on logger_subsys { subsys: AMF debug: off } } service { name: pacemaker ver: 0 } aisexec { user:root group: root } amf { mode: disabled } = Thanks Shravan ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Pacemaker/OpenAIS Software for openSuSE 11.2
d) If I would try to compile from source as described at http://www.clusterlabs.org/wiki/Install#First_Steps one step is to get openais. Why are all the relevant prebuild library packages called corosync? I don't understand the distinction between openais and corosync read this link: http://www.corosync.org/doku.php?id=faq:why Corosync used to be part of Openais. Then they split it into two parts to make maintenance easier. From their home page The OpenAIS software is built to operate on the Corosync Cluster Engine and how this two pieces fit together. By the way: There homepage doesn't enlight me either. Enough questions for a restart. Best regards Andreas Mock ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] openais/corosync
On Mon, 2010-01-11 at 19:59 +0100, Andreas Mock wrote: Hi all, I don't understand the distinction between openais and corosync. The prebuild packages are named after corosync while the documentation always talk about openais. See reasoning here: http://www.corosync.org/doku.php?id=faq:why The infos I get from the homepages of openais/corosync dont help either. There is one paper on corosync's homepage saying that pacemaker is using corosync while the installation guide at http://www.clusterlabs.org/wiki/Install#OpenAIS.2A says to download openais. the clusterlabs documentation is technically correct. You can use openais whitetank (see link above) but I recommend just using Corosync instead. Can someone enlight me even this may be more related to openais/corosync. I'm sure that the users of this code can tell me how the parts fit together. ;-) Best regards Andreas Mock ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] openais/corosync
On Mon, 2010-01-11 at 21:00 +0100, Andreas Mock wrote: -Ursprüngliche Nachricht- Von: Steven Dake sd...@redhat.com Gesendet: 11.01.10 20:13:39 An: pacema...@clusterlabs.org Betreff: Re: [Pacemaker] openais/corosync See reasoning here: http://www.corosync.org/doku.php?id=faq:why Hi Steve, thank you for that link. A piece of documentation I didn't find. They know why they do have improved documentation on their 2010 agenda. ;-) Ya its pretty clear Corosync documentation is weak. We really focused on developing a great quality implementation and a good release model at the expense of all other activities such as documentation and project marketing. We hope developers can deal with the documentation warts in the near term until we sort that out. In most cases, users don't need much documentation on Corosync at all except managing corosync.conf which is very well documented in man pages. Corosync's functionality should mostly be hidden behind application's functionality. That said, we do want to improve documentation. Beyond man pages for all tools and APIs, we would eventually like to produce a user guide and a separate developer guide which may number 100-200 PDF pages combined. These objectives will happen this year. Regards -steve ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] corosync init script broken
Hopefully all of these init script problems have been fixed in 1.2.0 by Fabio and Andrew and should be in a repo available for you soon. Regards -steve On Mon, 2009-12-28 at 13:22 +0100, Dominik Klein wrote: Hi cluster people been a while, couldn't really follow things. Today I was tasked to install a new cluster, went for 1.0.6 and corosync as described on the wiki and hit this: New cluster with pacemaker 106 and latest available corosync from the clusterlabs.org/rpm opensuse 11.1 repo. This installs /etc/init.d/corosync start says OK, but does not start corosync. Manually starting it, then stop never returns. This is because the internal status in the script calls killall -0 corosync. This finds /etc/init.d/corosync, therefore start returns early and stop never returns. Workaround: Rename /etc/init.d/corosync I can't believe I am the first one to hit this. Am I? Regards Dominik ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] coroync not able to exec services properly
If your using corosync 1.2.0, we enforced a constraint on consensus and token such that consensus must be 1.2* token. Your consensus is 1/2 token which will cause corosync to exit at start. Regards -steve On Mon, 2009-12-28 at 12:58 +0100, Dejan Muhamedagic wrote: Hi, On Thu, Dec 24, 2009 at 02:35:01PM -0500, Shravan Mishra wrote: Hi Guys, I had a perfectly running system for about 3 weeks now but now on reboot I see problems. Looks like the processes are being spawned and respawned but a proper exec is not happening. According to the logs, attrd can't start (exit code 100) for some reason (perhaps there are more logs elsewhere where it says what's wrong) and pengine segfaults. For the latter please enable coredumps (ulimit -c unlimited) and file a bugzilla. Am I missing some permissions on directories. I have a script which does the following for directories: Why do you need this script? It should be done by the package installation scripts. = getent group haclient /dev/null || groupadd -r haclient getent passwd hacluster /dev/null || useradd -r -g haclient -d /var/lib/heartbeat/cores/hacluster -s /sbin/nologin -c cluster user hacluster if [ ! -d /var/lib/pengine ];then mkdir /var/lib/pengine fi chown -R hacluster:haclient /var/lib/pengine if [ ! -d /var/lib/heartbeat ];then mkdir /var/lib/heartbeat fi if [ ! -d /var/lib/heartbeat/crm ];then mkdir /var/lib/heartbeat/crm fi chown -R hacluster:haclient /var/lib/heartbeat/crm/ chmod 750 /var/lib/heartbeat/crm/ if [ ! -d /var/lib/heartbeat/ccm ];then mkdir /var/lib/heartbeat/ccm fi chown -R hacluster:haclient /var/lib/heartbeat/ccm/ chmod 750 /var/lib/heartbeat/ccm/ if [ ! -d /var/run/heartbeat/ ];then mkdir /var/run/heartbeat/ fi if [ ! -d /var/run/heartbeat/ccm ];then mkdir /var/run/heartbeat/ccm/ fi chown -R hacluster:haclient /var/run/heartbeat/ccm/ chmod 750 /var/run/heartbeat/ccm/ You don't need ccm for corosync/openais clusters. if [ ! -d /var/run/heartbeat/crm ];then mkdir /var/run/heartbeat/crm/ fi chown -R hacluster:haclient /var/run/heartbeat/crm/ chmod 750 /var/run/heartbeat/crm/ if [ ! -d /var/run/crm ];then mkdir /var/run/crm fi if [ ! -d /var/lib/corosync ];then mkdir /var/lib/corosync fi = I have a very simple active-passive configuration with just 2 nodes. On starting Corosync , on doing [r...@node2 ~]# ps -ef | grep coro root 8242 1 0 11:33 ?00:00:00 /usr/sbin/corosync root 8248 8242 0 11:33 ?00:00:00 /usr/sbin/corosync root 8249 8242 0 11:33 ?00:00:00 /usr/sbin/corosync root 8250 8242 0 11:33 ?00:00:00 /usr/sbin/corosync root 8252 8242 0 11:33 ?00:00:00 /usr/sbin/corosync root 8393 8242 0 11:35 ?00:00:00 /usr/sbin/corosync [r...@node2 ~]# ps -ef | grep heart 827924 1 0 11:28 ?00:00:00 /usr/lib64/heartbeat/pengine I'm attaching the log file. My config is: # Please read the corosync.conf.5 manual page compatibility: whitetank totem { version: 2 token: 3000 token_retransmits_before_loss_const: 10 join: 60 consensus: 1500 vsftype: none max_messages: 20 clear_node_high_bit: yes secauth: on threads: 0 rrp_mode: passive interface { ringnumber: 0 bindnetaddr: 192.168.1.0 # mcastaddr: 226.94.1.1 broadcast: yes mcastport: 5405 } interface { ringnumber: 1 bindnetaddr: 172.20.20.0 # mcastaddr: 226.94.1.1 broadcast: yes mcastport: 5405 } } logging { fileline: off to_stderr: yes to_logfile: yes to_syslog: yes logfile: /tmp/corosync.log Don't log to file. Can't recall exactly but there were some permission problems with that, probably because Pacemaker daemons don't run as root. Thanks, Dejan debug: on timestamp: on logger_subsys { subsys: AMF debug: off } } service { name: pacemaker ver: 0 } aisexec { user:root group: root } amf { mode: disabled } Please help. Sincerely Shravan ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Fedora 12 repository
Pacemaker is integrated directly in the fedora repo instead of externally. You can grab it using yum install pacemaker. Regards -steve On Sun, 2009-12-20 at 11:46 -0500, E-Blokos wrote: Hi, is there any yum repository for Fedora 12 ? I checked http://download.opensuse.org/repositories/server%3A/ha-clustering but there are only folder for 10 and 11 Thanks Franck ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Node crash when 'ifdown eth0'
On Mon, 2009-11-30 at 17:05 -0700, hj lee wrote: On Fri, Nov 27, 2009 at 3:05 PM, Steven Dake sd...@redhat.com wrote: On Fri, 2009-11-27 at 11:32 -0200, Mark Horton wrote: I'm using pacemaker 1.0.6 and corosync 1.1.2 (not using openais) with centos 5.4. The packages are from here: http://www.clusterlabs.org/rpm/epel-5/ Mark On Fri, Nov 27, 2009 at 9:01 AM, Oscar Remírez de Ganuza Satrústegui oscar...@unav.es wrote: Good morning, We are testing a cluster configuration on RHEL5 (x86_64) with pacemaker 1.0.5 and openais (0.80.5). Two node cluster, active-passive, with the following resources: Mysql service resource and a NFS filesystem resource (shared storage in a SAN). In our tests, when we bring down the network interface (ifdown eth0), the What is the use case for ifdown eth0 (ie what are you trying to verify)? I have the same test case. In my case, when two nodes cluster is disconnect, I want to see split-brain. And then I want to see the split-brain handler resets one of nodes. What I want to verify is that the cluster will recover network disconnection and split-brain situation. ifconfig eth0 down is a totally different then testing if there is a node disconnection. When corosync detects eth0 being taken down, it binds to the interface 127.0.0.1. This is probably not what you had in mind when you wanted to test split brain. Keep in mind an interface taken out of service is different then an interface failing from a posix api perspective. What you really want to test is pulling the network cable between the machines. Regards -steve Thanks hj ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Node crash when 'ifdown eth0'
On Fri, 2009-11-27 at 11:32 -0200, Mark Horton wrote: I'm using pacemaker 1.0.6 and corosync 1.1.2 (not using openais) with centos 5.4. The packages are from here: http://www.clusterlabs.org/rpm/epel-5/ Mark On Fri, Nov 27, 2009 at 9:01 AM, Oscar Remírez de Ganuza Satrústegui oscar...@unav.es wrote: Good morning, We are testing a cluster configuration on RHEL5 (x86_64) with pacemaker 1.0.5 and openais (0.80.5). Two node cluster, active-passive, with the following resources: Mysql service resource and a NFS filesystem resource (shared storage in a SAN). In our tests, when we bring down the network interface (ifdown eth0), the What is the use case for ifdown eth0 (ie what are you trying to verify)? I recommend using latest pacemaker and corosync as well if your doing a new deployment. Regards -steve ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] **** SPAM **** Re: pacemaker-1.0.6 + corosync 1.1.2 crashing
Nik, Any chance you have a backtrace of the core files? That might be helpful in pinpointing the issue. To do this run gdb binaryname corefilename gdb bt Regards -steve On Thu, 2009-11-19 at 17:50 +0100, Nikola Ciprich wrote: Hi Andrew, sorry to bother again, do You have some idea what else might be wrong? Does it make sense to CC openais or cluster maillist? Is there some other debugging You would recommend? with best regards nik On Wed, Nov 18, 2009 at 03:26:28PM +0100, Nikola Ciprich wrote: I've packaged those myself, all are based on clean sources without any additional patches. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On Thu, 2009-11-12 at 14:53 +0100, Andrew Beekhof wrote: On Wed, Nov 11, 2009 at 1:36 PM, Lars Marowsky-Bree l...@suse.de wrote: On 2009-11-05T14:45:36, Andrew Beekhof and...@beekhof.net wrote: Lastly, I would really like to defer this for 1.2 I know I've bent the rules a bit for 1.0 in the past, but its really late in the game now. Personally, I think the Linux kernel model works really well. ie, no major releases any more, but bugfixes and features alike get merged over time and constantly. Thats a great model if you've got hoards of developers and testers. Of which we have neither. At this point in time, I can't see us going back to the way heartbeat releases were done. If there was a single thing that I'd credit Pacemaker's current reliability to, it would be our release strategy. Maintaining corosync and openais, I'd surely like to only have one tree where all work is done and never have a stable branch. Andrew is right though, this model only works if there is large downstream adoption and support and distros take on the work of stabilizing the efforts of the trunk development. Talking with distros I know this is generally not the case with any package other then kernel.org and maybe some related bits like xen/kvm (which has forced this model upon them). Regards -steve ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] pacemaker-1.0.6 + corosync 1.1.2 crashing
Nikola, yet another possibility is your box doesn't have any/enough shared memory available. Usually this is in the directory /dev/shm. Unfortunately bad things happen and error handling around this condition needs some work. Its hard to tell because the signal delivered to the application on failure is not shown in your backtrace. For example I have plenty of shared memory available (command is from df). tmpfs 1027020 3560 1023460 1% /dev/shm Regards -steve On Tue, 2009-11-10 at 10:28 +0100, Nikola Ciprich wrote: Hello Andrew et al, few days ago, I asked about pacemaker + corosync + clvmd etc. With Your advice, I got this working well. It was in testing virtual machines, I'm now trying to install similar setup on raw hardware but for some reasong attrd and cib seem to be crashing. here's snippet from corosync log: Nov 10 14:12:21 vbox3 corosync[4299]: [MAIN ] Corosync Cluster Engine ('1.1.2'): started and ready to provide service. Nov 10 14:12:21 vbox3 corosync[4299]: [MAIN ] Corosync built-in features: nss rdma Nov 10 14:12:21 vbox3 corosync[4299]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'. Nov 10 14:12:21 vbox3 corosync[4299]: [TOTEM ] Initializing transport (UDP/IP). Nov 10 14:12:21 vbox3 corosync[4299]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Nov 10 14:12:21 vbox3 corosync[4299]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Nov 10 14:12:21 vbox3 corosync[4299]: [TOTEM ] The network interface [10.58.0.1] is now up. Nov 10 14:12:21 vbox3 corosync[4299]: [pcmk ] info: process_ais_conf: Reading configure Nov 10 14:13:16 vbox3 corosync[4348]: [MAIN ] Corosync Cluster Engine ('1.1.2'): started and ready to provide service. Nov 10 14:13:16 vbox3 corosync[4348]: [MAIN ] Corosync built-in features: nss rdma Nov 10 14:13:16 vbox3 corosync[4348]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'. Nov 10 14:13:16 vbox3 corosync[4348]: [TOTEM ] Initializing transport (UDP/IP). Nov 10 14:13:16 vbox3 corosync[4348]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Nov 10 14:13:16 vbox3 corosync[4348]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Nov 10 14:13:16 vbox3 corosync[4348]: [TOTEM ] The network interface [10.58.0.1] is now up. Nov 10 14:13:16 vbox3 corosync[4348]: [pcmk ] info: process_ais_conf: Reading configure Nov 10 14:13:24 vbox3 corosync[4357]: [MAIN ] Corosync Cluster Engine ('1.1.2'): started and ready to provide service. Nov 10 14:13:24 vbox3 corosync[4357]: [MAIN ] Corosync built-in features: nss rdma Nov 10 14:13:24 vbox3 corosync[4357]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'. Nov 10 14:13:24 vbox3 corosync[4357]: [TOTEM ] Initializing transport (UDP/IP). Nov 10 14:13:24 vbox3 corosync[4357]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Nov 10 14:13:24 vbox3 corosync[4357]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Nov 10 14:13:24 vbox3 corosync[4357]: [TOTEM ] The network interface [10.58.0.1] is now up. Nov 10 14:13:24 vbox3 corosync[4357]: [pcmk ] info: process_ais_conf: Reading configure Nov 10 14:13:57 vbox3 corosync[4380]: [MAIN ] Corosync Cluster Engine ('1.1.2'): started and ready to provide service. Nov 10 14:13:57 vbox3 corosync[4380]: [MAIN ] Corosync built-in features: nss rdma Nov 10 14:13:57 vbox3 corosync[4380]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'. Nov 10 14:13:57 vbox3 corosync[4380]: [TOTEM ] Initializing transport (UDP/IP). Nov 10 14:13:57 vbox3 corosync[4380]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Nov 10 14:13:57 vbox3 corosync[4380]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Nov 10 14:13:58 vbox3 corosync[4380]: [TOTEM ] The network interface [10.58.0.1] is now up. Nov 10 14:13:58 vbox3 corosync[4380]: [pcmk ] info: process_ais_conf: Reading configure Nov 10 14:13:58 vbox3 corosync[4380]: [pcmk ] info: config_find_init: Local handle: 9213452461992312833 for logging Nov 10 14:13:58 vbox3 corosync[4380]: [pcmk ] info: config_find_next: Processing additional logging options... Nov 10 14:13:58 vbox3 corosync[4380]: [pcmk ] info: get_config_opt: Found 'off' for option: debug Nov 10 14:13:58 vbox3 corosync[4380]: [pcmk ] info: get_config_opt: Defaulting to 'off' for option: to_file Nov 10 14:13:58 vbox3 corosync[4380]: [pcmk ] info: get_config_opt: Defaulting to 'daemon' for option: syslog_facility Nov 10
Re: [Pacemaker] [ANNOUNCEMENT] Debian Packages for Pacemaker 1.0.6, completely revamped
On Thu, 2009-11-05 at 00:06 +0100, Colin wrote: On Wed, Nov 4, 2009 at 5:47 PM, Andrew Beekhof and...@beekhof.net wrote: Hopelessly out of date? Corosync has been supported for all of 3 days now. Sorry, it seems that I jumped to a wrong conclusion (namely that with Corosync being a part of OpenAIS, and Pacemaker having run on OpenAIS for a while, that there wasn't much difference to supporting Corosync instea of OpenAIS -- shows that I'm still quite ignorant about some of the internals.) Actually, I set up Pacemaker with Corosync from the new packages, just to see what it looks like, and it was so easy that we'll stick to it for the next round of tests, i.o.w., the details of the cluster underneath Pacemaker are so well hidden that (a) it doesn't make much difference, and (b) my ignorance in that area never was a problem: It just works. -Colin The intent with Corosync was that the migration path for users is mostly seamless and we have more or less nailed that with the exception of a few different configuration file renaming and CLI binary renaming (and of course a new ABI for Pacemaker to program to, which was not painless for Andrew:). Regards -steve ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] [ANNOUNCEMENT] Debian Packages for Pacemaker 1.0.6, completely revamped
On Wed, 2009-11-04 at 09:35 +0800, Romain CHANU wrote: Hi Martin, Could you tell us what's the rationale to remove openais and include corosync? Would it mean that people should use corosync from now on for any HA development? Best Regards, Romain Chanu Just a short note I would also recommend making available the latest openais packages which complement both corosync and pacemaker with sa forum complaint apis. Regards -steve 2009/11/3 Martin Gerhard Loschwitz martin.loschw...@linbit.com Ladies and Gentleman, i am happy to announce the availability of Pacemaker 1.0.6 packages for Debian GNU/Linux 5.0 alias Lenny (i386 and amd64). These packages are a remarkable break, as they have totally and ruthlessly been revamped. The whole layout has actually changed; here are the most important things to keep in mind when using them: * pacemaker-openais and pacemaker-heartbeat are gone; pacemaker now only comes in one flavour, having support for corosync and heartbeat built it. This is based on pacemaker's capability to detect by which messaging framework it has been started and act accordingly. * openais is gone. pacemaker 1.0.6 uses corosync. * the new layout allows flawless updates. if you have heartbeat 2.1.4 and do a dist-upgrade, you will automatically get pacemaker. all you need to do afterwards is converting the xml-file to work with pacemaker -- you can then start heartbeat, and things are going to be fine (more on this can be found in the Clusterlabs- Wiki) * Now that we finally have a decent layout for pacemaker, we can easily provide gui packages: welcome pacemaker-mgmt, being in good condition and shape now, allowing you do administer your cluster via a GTK tool. The new packages can as always be found on: deb http://people.debian.org/~madkiss/ha lenny main deb-src http://people.debian.org/~madkiss/ha lenny main -- : Martin G. Loschwitz Tel +43-1-8178292-63 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com : ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] corosync doesn't stop all services
We had to change both pacemaker and corosync for this problem. I suspect you don't have the updated pacemaker. Regards -steve On Wed, 2009-10-21 at 15:11 +0200, Michael Schwartzkopff wrote: Hi, perhaps this is the wrong list but anyway: I have corosync-1.1.1 and pacemaker-1.0.5 on debian lenny. When I start corosync everything looks fine. But when I stop corosync I still see a lot of heartbeart processes. I thought this was fixed in corosync-1.1.1. so what might be the problem? # ps uax | grep heart root 2083 0.0 0.4 4884 1220 pts/1S 17:04 0:00 /usr/lib/heartbeat/ha_logd -d root 2084 0.0 0.3 4884 820 pts/1S 17:04 0:00 /usr/lib/heartbeat/ha_logd -d root 2099 0.0 4.1 10712 10712 ?SLs 17:04 0:00 /usr/lib/heartbeat/stonithd 104 2100 0.1 1.4 12768 3748 ?S 17:04 0:00 /usr/lib/heartbeat/cib root 2101 0.0 0.7 5352 1800 ?S 17:04 0:00 /usr/lib/heartbeat/lrmd 104 2102 0.0 1.0 12260 2596 ?S 17:04 0:00 /usr/lib/heartbeat/attrd 104 2103 0.0 1.1 8880 3024 ?S 17:04 0:00 /usr/lib/heartbeat/pengine 104 2104 0.0 1.2 12404 3176 ?S 17:04 0:00 /usr/lib/heartbeat/crmd root 2140 0.0 0.2 3116 720 pts/1R+ 17:08 0:00 grep heart ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] pacemaker unable to start
I recommend using corosync 1.1.1 - several bug fixes one critical for proper pacemaker operation. It won't fix this particular problem however. Corosync loads pacemaker by searching for a pacemaker lcrso file. These files are default installed in /usr/libexec/lcrso but may be in a different location depending on your distribution. Regards -steve On Wed, 2009-10-21 at 11:13 -0400, Shravan Mishra wrote: Hello guys, We are running corosync-1.0.0 heartbeat-2.99.1 pacemaker-1.0.4 the corosync.conf under /etc/corosync/ is # Please read the corosync.conf.5 manual page compatibility: whitetank aisexec { user: root group: root } totem { version: 2 secauth: off threads: 0 interface { ringnumber: 0 bindnetaddr: 172.30.0.0 mcastaddr:226.94.1.1 mcastport: 5406 } } logging { fileline: off to_stderr: yes to_logfile: yes to_syslog: yes logfile: /tmp/corosync.log debug: on timestamp: on logger_subsys { subsys: pacemaker debug: on tags: enter|leave|trace1|trace2| trace3|trace4|trace6 } } service { name: pacemaker ver: 0 # use_mgmtd: yes # use_logd:yes } corosync { user: root group: root } amf { mode: disabled } #service corosync start starts the messaging but fails to load pacemaker, /tmp/corosync.log --- == Oct 21 11:05:43 corosync [MAIN ] Corosync Cluster Engine ('trunk'): started and ready to provide service. Oct 21 11:05:43 corosync [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'. Oct 21 11:05:43 corosync [TOTEM ] Token Timeout (1000 ms) retransmit timeout (238 ms) Oct 21 11:05:43 corosync [TOTEM ] token hold (180 ms) retransmits before loss (4 retrans) Oct 21 11:05:43 corosync [TOTEM ] join (50 ms) send_join (0 ms) consensus (800 ms) merge (200 ms) Oct 21 11:05:43 corosync [TOTEM ] downcheck (1000 ms) fail to recv const (50 msgs) Oct 21 11:05:43 corosync [TOTEM ] seqno unchanged const (30 rotations) Maximum network MTU 1500 Oct 21 11:05:43 corosync [TOTEM ] window size per rotation (50 messages) maximum messages per rotation (17 messages) Oct 21 11:05:43 corosync [TOTEM ] send threads (0 threads) Oct 21 11:05:43 corosync [TOTEM ] RRP token expired timeout (238 ms) Oct 21 11:05:43 corosync [TOTEM ] RRP token problem counter (2000 ms) Oct 21 11:05:43 corosync [TOTEM ] RRP threshold (10 problem count) Oct 21 11:05:43 corosync [TOTEM ] RRP mode set to none. Oct 21 11:05:43 corosync [TOTEM ] heartbeat_failures_allowed (0) Oct 21 11:05:43 corosync [TOTEM ] max_network_delay (50 ms) Oct 21 11:05:43 corosync [TOTEM ] HeartBeat is Disabled. To enable set heartbeat_failures_allowed 0 Oct 21 11:05:43 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Oct 21 11:05:43 corosync [TOTEM ] Receive multicast socket recv buffer size (262142 bytes). Oct 21 11:05:43 corosync [TOTEM ] Transmit multicast socket send buffer size (262142 bytes). Oct 21 11:05:43 corosync [TOTEM ] The network interface [172.30.0.145] is now up. Oct 21 11:05:43 corosync [TOTEM ] Created or loaded sequence id 184.172.30.0.145 for this ring. Oct 21 11:05:43 corosync [TOTEM ] entering GATHER state from 15. Oct 21 11:05:43 corosync [SERV ] Service failed to load 'pacemaker'. Oct 21 11:05:43 corosync [SERV ] Service initialized 'corosync extended virtual synchrony service' Oct 21 11:05:43 corosync [SERV ] Service initialized 'corosync configuration service' Oct 21 11:05:43 corosync [SERV ] Service initialized 'corosync cluster closed process group service v1.01' Oct 21 11:05:43 corosync [SERV ] Service initialized 'corosync cluster config database access v1.01' Oct 21 11:05:43 corosync [SERV ] Service initialized 'corosync profile loading service' Oct 21 11:05:43 corosync [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Oct 21 11:05:43 corosync [TOTEM ] Creating commit token because I am the rep. Oct 21 11:05:43 corosync [TOTEM ] Saving state aru 0 high seq received 0 Oct 21 11:05:43 corosync [TOTEM ] Storing new sequence id for ring bc Oct 21 11:05:43 corosync [TOTEM ] entering COMMIT state. Oct 21 11:05:43 corosync [TOTEM ] got commit token Oct 21 11:05:43 corosync [TOTEM ] entering RECOVERY state. Oct 21 11:05:43 corosync [TOTEM ] position [0] member 172.30.0.145: Oct 21 11:05:43 corosync [TOTEM ] previous ring seq 184 rep 172.30.0.145 Oct 21 11:05:43 corosync [TOTEM ] aru 0 high delivered 0 received flag 1 Oct 21 11:05:43 corosync [TOTEM ] Did not need to originate any messages in recovery. Oct 21 11:05:43 corosync [TOTEM ] got commit token Oct 21
Re: [Pacemaker] pacemaker unable to start
Ya your missing the pacemaker lcrso file. Either you didn't build pacemaker with corosync support or pacemaker didn't install that binary in the proper place. try: updatedb locate lcrso Regards -steve On Wed, 2009-10-21 at 12:28 -0400, Shravan Mishra wrote: Steve, this is what my installation shows-- ls -l /usr/libexec/lcrso -rwxr-xr-x 1 root root 101243 Jul 29 11:21 coroparse.lcrso -rwxr-xr-x 1 root root 117688 Jul 29 11:21 objdb.lcrso -rwxr-xr-x 1 root root 92702 Jul 29 11:54 openaisserviceenable.lcrso -rwxr-xr-x 1 root root 110808 Jul 29 11:21 quorum_testquorum.lcrso -rwxr-xr-x 1 root root 159057 Jul 29 11:21 quorum_votequorum.lcrso -rwxr-xr-x 1 root root 1175430 Jul 29 11:54 service_amf.lcrso -rwxr-xr-x 1 root root 133976 Jul 29 11:21 service_cfg.lcrso -rwxr-xr-x 1 root root 218374 Jul 29 11:54 service_ckpt.lcrso -rwxr-xr-x 1 root root 139029 Jul 29 11:54 service_clm.lcrso -rwxr-xr-x 1 root root 122668 Jul 29 11:21 service_confdb.lcrso -rwxr-xr-x 1 root root 138412 Jul 29 11:21 service_cpg.lcrso -rwxr-xr-x 1 root root 125638 Jul 29 11:21 service_evs.lcrso -rwxr-xr-x 1 root root 196443 Jul 29 11:54 service_evt.lcrso -rwxr-xr-x 1 root root 194885 Jul 29 11:54 service_lck.lcrso -rwxr-xr-x 1 root root 235168 Jul 29 11:54 service_msg.lcrso -rwxr-xr-x 1 root root 120445 Jul 29 11:21 service_pload.lcrso -rwxr-xr-x 1 root root 135340 Jul 29 11:54 service_tmr.lcrso -rwxr-xr-x 1 root root 124092 Jul 29 11:21 vsf_quorum.lcrso -rwxr-xr-x 1 root root 121298 Jul 29 11:21 vsf_ykd.lcrso I also did export COROSYNC_DEFAULT_CONFIG_IFACE=openaisserviceenable:openaisparser In place of openaisparser I also tried corosyncparse and corosyncparser but to no avail. -sincerely Shravan On Wed, Oct 21, 2009 at 11:49 AM, Steven Dake sd...@redhat.com wrote: I recommend using corosync 1.1.1 - several bug fixes one critical for proper pacemaker operation. It won't fix this particular problem however. Corosync loads pacemaker by searching for a pacemaker lcrso file. These files are default installed in /usr/libexec/lcrso but may be in a different location depending on your distribution. Regards -steve On Wed, 2009-10-21 at 11:13 -0400, Shravan Mishra wrote: Hello guys, We are running corosync-1.0.0 heartbeat-2.99.1 pacemaker-1.0.4 the corosync.conf under /etc/corosync/ is # Please read the corosync.conf.5 manual page compatibility: whitetank aisexec { user: root group: root } totem { version: 2 secauth: off threads: 0 interface { ringnumber: 0 bindnetaddr: 172.30.0.0 mcastaddr:226.94.1.1 mcastport: 5406 } } logging { fileline: off to_stderr: yes to_logfile: yes to_syslog: yes logfile: /tmp/corosync.log debug: on timestamp: on logger_subsys { subsys: pacemaker debug: on tags: enter|leave|trace1|trace2| trace3|trace4|trace6 } } service { name: pacemaker ver: 0 # use_mgmtd: yes # use_logd:yes } corosync { user: root group: root } amf { mode: disabled } #service corosync start starts the messaging but fails to load pacemaker, /tmp/corosync.log --- == Oct 21 11:05:43 corosync [MAIN ] Corosync Cluster Engine ('trunk'): started and ready to provide service. Oct 21 11:05:43 corosync [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'. Oct 21 11:05:43 corosync [TOTEM ] Token Timeout (1000 ms) retransmit timeout (238 ms) Oct 21 11:05:43 corosync [TOTEM ] token hold (180 ms) retransmits before loss (4 retrans) Oct 21 11:05:43 corosync [TOTEM ] join (50 ms) send_join (0 ms) consensus (800 ms) merge (200 ms) Oct 21 11:05:43 corosync [TOTEM ] downcheck (1000 ms) fail to recv const (50 msgs) Oct 21 11:05:43 corosync [TOTEM ] seqno unchanged const (30 rotations) Maximum network MTU 1500 Oct 21 11:05:43 corosync [TOTEM ] window size per rotation (50 messages) maximum messages per rotation (17 messages) Oct 21 11:05:43 corosync [TOTEM ] send threads (0 threads) Oct 21 11:05:43 corosync [TOTEM ] RRP token expired timeout (238 ms) Oct 21 11:05:43 corosync [TOTEM ] RRP token problem counter (2000 ms) Oct 21 11:05:43 corosync [TOTEM ] RRP threshold (10 problem count) Oct 21 11:05:43 corosync [TOTEM ] RRP mode set to none. Oct 21 11:05:43 corosync [TOTEM ] heartbeat_failures_allowed (0) Oct 21 11:05:43 corosync [TOTEM ] max_network_delay (50 ms) Oct 21 11:05:43 corosync [TOTEM ] HeartBeat is Disabled. To enable set heartbeat_failures_allowed 0 Oct 21 11:05:43 corosync [TOTEM ] Initializing transmit/receive security
Re: [Pacemaker] Failed in restart of Corosync.
This bug is reported and we are working on a solution. Regards -steve On Mon, 2009-10-19 at 11:05 +0900, renayama19661...@ybb.ne.jp wrote: Hi, I understand that a combination is not official in Corosync and Pacemaker. However, I contributed it because I thought that it was important that I reported a problem. I started next combination Corosync.(on Redhat5.4(x86)) * corosync trunk 2530 * Cluster-Resource-Agents-6d652f7cf9d8 * Reusable-Cluster-Components-4edc8f99701c * Pacemaker-1-0-de2a3778ace7 I stopped service(corosync) next. But, I did KILL of a process because a process of Pacemaker did not stop well. [r...@rh54-1 ~]# service Corosync stop Stopping Corosync Cluster Engine (corosync): [ OK ] Waiting for services to unload:[ OK ] [r...@rh54-1 ~]# ps -ef |grep coro root 5263 4617 0 10:54 pts/000:00:00 grep coro [r...@rh54-1 ~]# ps -ef |grep heartbeat root 4882 1 0 10:52 ?00:00:00 /usr/lib/heartbeat/stonithd 500 4883 1 0 10:52 ?00:00:00 /usr/lib/heartbeat/cib root 4884 1 0 10:52 ?00:00:00 /usr/lib/heartbeat/lrmd 500 4885 1 0 10:52 ?00:00:00 /usr/lib/heartbeat/attrd 500 4886 1 0 10:52 ?00:00:00 /usr/lib/heartbeat/pengine 500 4887 1 0 10:52 ?00:00:00 /usr/lib/heartbeat/crmd root 5278 4617 0 10:54 pts/000:00:00 grep heartbeat [r...@rh54-1 ~]# kill -9 4882 4883 4884 4885 4886 4887 [r...@rh54-1 ~]# ps -ef |grep heartbeat root 5310 4617 0 10:54 pts/000:00:00 grep heartbeat I started Corosync again. But, a cib process of Pacemaker seems not to be able to communicate with Corosync. Oct 19 10:55:29 rh54-1 cib: [5354]: info: startCib: CIB Initialization completed successfully Oct 19 10:55:29 rh54-1 cib: [5354]: info: crm_cluster_connect: Connecting to OpenAIS Oct 19 10:55:29 rh54-1 cib: [5354]: info: init_ais_connection: Creating connection to our AIS plugin Oct 19 10:55:30 rh54-1 mgmtd: [5359]: info: login to cib live: 1, ret:-10 Oct 19 10:55:30 rh54-1 crmd: [5358]: info: do_cib_control: Could not connect to the CIB service: connection failed Oct 19 10:55:30 rh54-1 crmd: [5358]: WARN: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry Oct 19 10:55:30 rh54-1 crmd: [5358]: info: crmd_init: Starting crmd's mainloop Oct 19 10:55:31 rh54-1 mgmtd: [5359]: info: login to cib live: 2, ret:-10 Oct 19 10:55:32 rh54-1 mgmtd: [5359]: info: login to cib live: 3, ret:-10 Oct 19 10:55:32 rh54-1 crmd: [5358]: info: crm_timer_popped: Wait Timer (I_NULL) just popped! Oct 19 10:55:33 rh54-1 mgmtd: [5359]: info: login to cib live: 4, ret:-10 Oct 19 10:55:33 rh54-1 crmd: [5358]: info: do_cib_control: Could not connect to the CIB service: connection failed Oct 19 10:55:33 rh54-1 crmd: [5358]: WARN: do_cib_control: Couldn't complete CIB registration 2 times... pause and retry On this account it does not start definitely even if Pacemaker waits till when. As for the problem, Corosync seems to fail in poll(?) somehow or other. However, possibly the cause may depend on the failure of the first stop. [r...@rh54-1 ~]# ps -ef |grep coro root 5348 1 0 10:55 ?00:00:00 /usr/sbin/corosync root 5400 4617 0 10:56 pts/000:00:00 grep coro [r...@rh54-1 ~]# strace -p 5348 Process 5348 attached - interrupt to quit futex(0x805c8c0, FUTEX_WAIT_PRIVATE, 2, NULL Is there a method with the avoidance of this phenomenon what it is? Can I evade a problem by deleting some file? * I hope it so that a combination of Corosync and Pacemaker becomes the practical use early. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] fedora11: openais fails to start
You could try the f12 rpms - we have tested these. We are in the process of making these available in f11/f10, but there is a bit of a lag because of fedora process. The f12 rpms are at koji.fedoraproject.org. From looking at your logs, it appears iptables is enabled and not configured properly. try service iptables stop. Regards -steve On Fri, 2009-10-09 at 14:31 +0200, Michael Schwartzkopff wrote: Hi, I wanted to try pacemaker/openais on a fedora11. Packages from OSBS: # rpm -qa | grep ais\|pace pacemaker-1.0.5-4.1.i386 libopenais2-0.80.5-15.1.i386 pacemaker-libs-1.0.5-4.1.i386 openais-0.80.5-15.1.i386 pacemaker-mgmt-1.99.2-6.1.i386 When I start /etc/init.d/openais start - There are some entries in the log. Nothing what I could identify as an error. See: http://www.pastebin.org/41120 - openais-cfgtool -s stops at Printing ring status. Need to CTRL-C to stop. - No pacemaker process are really started: ps uax | grep crm is empty. Any ideas? Using corosync-1.0.0 from fedora11 is no option. Results in another error. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] A problem to fail in a stop of Pacemaker.
On Wed, 2009-09-30 at 09:51 +0900, renayama19661...@ybb.ne.jp wrote: Hi Remi, It appears that this is a similar problem to the one that I reported, yes. It appears to not be a bug in Corosync, but rather one in Pacemaker. This bug has been filed in Red Hat Bugzilla, see it at: https://bugzilla.redhat.com/show_bug.cgi?id=525589 Perhaps you could add any additional details that you have found (affected packages, etc.) to the bug; it may help the developers fix it. All right. Thank you. Best Regards, Hideo Yamauchi. Please note this could still be a bz in corosync related to service engine integration. It is just too early to tell. Andrew should be able to tell us for certain when he has an opportunity to take a look at it. Regards -steve --- Remi Broemeling r...@nexopia.com wrote: Hello Hideo, It appears that this is a similar problem to the one that I reported, yes. It appears to not be a bug in Corosync, but rather one in Pacemaker. This bug has been filed in Red Hat Bugzilla, see it at: https://bugzilla.redhat.com/show_bug.cgi?id=525589 Perhaps you could add any additional details that you have found (affected packages, etc.) to the bug; it may help the developers fix it. Thanks. renayama19661...@ybb.ne.jp wrote: Hi, I started a Dummy resource in one node by the next combination. * corosync 1.1.0 * Pacemaker-1-0-05c8b63cbca7 * Reusable-Cluster-Components-6ef02517ee57 * Cluster-Resource-Agents-88a9cfd9e8b5 The Dummy resource started in a node. I was going to stop a node(service Corosync stop), but did not stop. --log-- (snip) Sep 29 13:52:01 rh53-1 crmd: [11193]: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated Sep 29 13:52:01 rh53-1 crmd: [11193]: info: crm_shutdown: Requesting shutdown Sep 29 13:52:01 rh53-1 crmd: [11193]: info: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ] Sep 29 13:52:01 rh53-1 crmd: [11193]: info: do_state_transition: All 1 cluster nodes are eligible to run resources. Sep 29 13:52:01 rh53-1 crmd: [11193]: info: do_shutdown_req: Sending shutdown request to DC: rh53-1 Sep 29 13:52:30 rh53-1 corosync[11183]: [pcmk ] notice: pcmk_shutdown: Still waiting for crmd (pid=11193) to terminate... Sep 29 13:53:30 rh53-1 last message repeated 2 times Sep 29 13:55:00 rh53-1 last message repeated 3 times Sep 29 13:56:30 rh53-1 last message repeated 3 times Sep 29 13:58:01 rh53-1 last message repeated 3 times Sep 29 13:59:31 rh53-1 last message repeated 3 times Sep 29 14:00:31 rh53-1 last message repeated 2 times Sep 29 14:00:46 rh53-1 cib: [11189]: info: cib_stats: Processed 94 operations (11489.00us average, 0% utilization) in the last 10min Sep 29 14:01:01 rh53-1 corosync[11183]: [pcmk ] notice: pcmk_shutdown: Still waiting for crmd (pid=11193) to terminate... (snip) --log-- Possibly is the cause same as the next email? * http://www.gossamer-threads.com/lists/linuxha/pacemaker/58127 And, the same problem was taking place by the next combination. * corosync 1.0.1 * Pacemaker-1-0-595cca870aff * Reusable-Cluster-Components-6ef02517ee57 * Cluster-Resource-Agents-88a9cfd9e8b5 I attach a file of hb_report. Best Regards, Hideo Yamauchi. -- Remi Broemeling Sr System Administrator Nexopia.com Inc. direct: 780 444 1250 ext 435 email: r...@nexopia.com mailto:r...@nexopia.com fax: 780 487 0376 www.nexopia.com http://www.nexopia.com You are only young once, but you can stay immature indefinitely. www.siglets.com ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Cluster Refuses to Stop/Shutdown
Remi, Likely a defect. We will have to look into it. Please file a bug as per instructions on the corosync wiki at www.corosync.org. On Thu, 2009-09-24 at 16:47 -0600, Remi Broemeling wrote: I've spent all day working on this; even going so far as to completely build my own set of packages from the Debian-available ones (which appear to be different than the Ubuntu-available ones). It didn't have any effect on the issue at all: the cluster still freaks out and becomes a split-brain after a single SIGQUIT. The debian packages that also demonstrate this behavior were the below versions: cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb corosync_1.0.0-5~bpo50+1_i386.deb libcorosync4_1.0.0-5~bpo50+1_i386.deb libopenais3_1.0.0-4~bpo50+1_i386.deb openais_1.0.0-4~bpo50+1_i386.deb pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb These packages were re-built (under Ubuntu Hardy Heron LTS) from the *.diff.gz, *.dsc, and *.orig.tar.gz files available at http://people.debian.org/~madkiss/ha-corosync, and as I said the symptoms remain exactly the same, both under the configuration that I list below and the sample configuration that came with these packages. I also attempted the same with a single IP Address resource associated with the cluster; just to be sure it wasn't an edge case for a cluster with no resources; but again that had no effect. Basically I'm still exactly at the point that I was at yesterday morning at about 0900. Remi Broemeling wrote: I posted this to the OpenAIS Mailing List (open...@lists.linux-foundation.org) yesterday, but haven't received a response and upon further reflection I think that maybe I chose the wrong list to post it to. That list seems to be far less about user support and far more about developer communication. Therefore re-trying here, as the archives show it to be somewhat more user-focused. The problem is that I'm having an issue with corosync refusing to shutdown in response to a QUIT signal. Given the below cluster (output of crm_mon): Last updated: Wed Sep 23 15:56:24 2009 Stack: openais Current DC: boot1 - partition with quorum Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56 2 Nodes configured, 2 expected votes 0 Resources configured. Online: [ boot1 boot2 ] If I go onto the host 'boot2', and issue the command killall -QUIT corosync, the anticipated result would be that boot2 would go offline (out of the cluster), and all of the cluster processes (corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down. However, this is not occurring, and I don't really have any idea why. After logging into boot2, and issuing the command killall -QUIT corosync, the result is a split-brain: From boot1's viewpoint: Last updated: Wed Sep 23 15:58:27 2009 Stack: openais Current DC: boot1 - partition WITHOUT quorum Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56 2 Nodes configured, 2 expected votes 0 Resources configured. Online: [ boot1 ] OFFLINE: [ boot2 ] From boot2's viewpoint: Last updated: Wed Sep 23 15:58:35 2009 Stack: openais Current DC: boot1 - partition with quorum Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56 2 Nodes configured, 2 expected votes 0 Resources configured. Online: [ boot1 boot2 ] At this point the status quo holds until such time as ANOTHER QUIT signal is sent to corosync, (i.e. the command killall -QUIT corosync is executed on boot2 again). Then, boot2 shuts down properly and everything appears to be kosher. Basically, what I expect to happen after a single QUIT signal is instead taking two QUIT signals to occur; and that summarizes my question: why does it take two QUIT signals to force corosync to actually shutdown? Is that desired behavior? From everything online that I have read it seems to be very strange, and it makes me think that I have a problem in my configuration(s), but I've no idea what that would be even after playing with things and investigating for the day. I would be very grateful for any guidance that could be provided, as at the moment I seem to be at an impasse. Log files, with debugging set to 'on', can be found at the following pastebin locations: After first QUIT signal issued on boot2: boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd boot2:/var/log/syslog: http://pastebin.com/d26fdfee After second QUIT signal issued on boot2: boot1:/var/log/syslog: http://pastebin.com/m755fb989 boot2:/var/log/syslog: http://pastebin.com/m22dcef45 OS, Software Packages, and Versions: * two nodes, each running Ubuntu Hardy Heron LTS * ubuntu-ha packages, as downloaded from http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/: *
Re: [Pacemaker] CentOS problem starting openais
On Tue, 2009-08-25 at 21:55 +0200, Michael Schwartzkopff wrote: Hi, I just installed pacemaker from OSBS on fully patched CentOS 5.3. when I call aisexec -f manually everything works as expected. When Is use /etc/init.d/openais start I get the following in /var/log/messages Aug 25 23:46:02 localhost openais[16897]: [MAIN ] AIS Executive Service RELEASE 'subrev 1152 version 0.80' Aug 25 23:46:02 localhost openais[16897]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. Aug 25 23:46:02 localhost openais[16897]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. Aug 25 23:46:02 localhost openais[16897]: [MAIN ] AIS Executive Service: started and ready to provide service. Aug 25 23:46:02 localhost openais[16897]: [TOTEM] Token Timeout (3000 ms) retransmit timeout (294 ms) Aug 25 23:46:02 localhost openais[16897]: [TOTEM] token hold (225 ms) retransmits before loss (10 retrans) Aug 25 23:46:02 localhost openais[16897]: [TOTEM] join (60 ms) send_join (0 ms) consensus (1500 ms) merge (200 ms) Aug 25 23:46:02 localhost openais[16897]: [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs) Aug 25 23:46:02 localhost openais[16897]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500 Aug 25 23:46:02 localhost openais[16897]: [TOTEM] window size per rotation (50 messages) maximum messages per rotation (20 messages) Aug 25 23:46:02 localhost openais[16897]: [TOTEM] send threads (0 threads) Aug 25 23:46:02 localhost openais[16897]: [TOTEM] RRP token expired timeout (294 ms) Aug 25 23:46:02 localhost openais[16897]: [TOTEM] RRP token problem counter (2000 ms) Aug 25 23:46:03 localhost openais[16897]: [TOTEM] RRP threshold (10 problem count) Aug 25 23:46:03 localhost openais[16897]: [TOTEM] RRP mode set to none. Aug 25 23:46:03 localhost openais[16897]: [TOTEM] heartbeat_failures_allowed (0) Aug 25 23:46:03 localhost openais[16897]: [TOTEM] max_network_delay (50 ms) Aug 25 23:46:03 localhost openais[16897]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed 0 Aug 25 23:46:03 localhost openais[16897]: [TOTEM] The network interface [172.19.93.1] is now up. Aug 25 23:46:03 localhost openais[16897]: [TOTEM] Created or loaded sequence id 24.172.19.93.1 for this ring. Aug 25 23:46:03 localhost openais[16897]: [TOTEM] entering GATHER state from 15. Aug 25 23:46:03 localhost openais[16897]: [crm ] info: process_ais_conf: Reading configure Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: config_find_next: Processing additional logging options... Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: get_config_opt: Found 'off' for option: debug Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: get_config_opt: Defaulting to 'off' for option: to_file Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: get_config_opt: Found 'daemon' for option: syslog_facility Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: config_find_next: Processing additional service options... Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: get_config_opt: Defaulting to 'no' for option: use_logd Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: get_config_opt: Found 'yes' for option: use_mgmtd Aug 25 23:46:03 localhost openais[16897]: [crm ] pcmk_plugin_init: Could not enable /proc/sys/kernel/core_uses_pid: (22) Invalid argument Aug 25 23:46:03 localhost openais[16897]: [crm ] info: pcmk_plugin_init: CRM: Initialized Aug 25 23:46:03 localhost openais[16897]: [crm ] Logging: Initialized pcmk_plugin_init Aug 25 23:46:03 localhost openais[16897]: [crm ] info: pcmk_plugin_init: Service: 9 Aug 25 23:46:03 localhost openais[16897]: [crm ] info: pcmk_plugin_init: Local node id: 22877100 Aug 25 23:46:03 localhost openais[16897]: [crm ] info: pcmk_plugin_init: Local hostname: localhost.localdomain Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: update_member: Creating entry for node 22877100 born on 0 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: update_member: 0x9c4c6d8 Node 22877100 now known as localhost.localdomain (was: (null)) Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: update_member: Node localhost.localdomain now has 1 quorum votes (was 0) Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: update_member: Node 22877100/localhost.localdomain is now: member Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: spawn_child: Forked child 16903 for process stonithd Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: spawn_child: Forked child 16904 for process cib Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: spawn_child: Forked child 16905 for process lrmd Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: spawn_child: Forked child 16906 for process attrd Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: spawn_child: Forked child 16907 for process pengine Aug 25
Re: [Pacemaker] [PATCH] Whitetank: fix chkconfig entries for openais init script
If none has objections to this patch, I'll commit it in next few days. Regards -steve On Wed, 2009-06-24 at 09:04 +0200, Florian Haas wrote: # HG changeset patch # User Florian Haas florian.h...@linbit.com # Date 1245827047 -7200 # Branch whitetank # Node ID a60db27fe11b9bfd399b847e33c5f49de3d227bc # Parent e65f52176ba646c9d93c3b76e9e52df24f18d6dc Whitetank: fix chkconfig entries for openais init script The openais init script uses chkconfig entries which may cause OpenAIS to start too early or stop too late in the system startup/shutdown sequence. SUSE Linux doesn't care about this as it ignores chkconfig entries for the most part, but Red Hat systems (and potentially others) get bitten by this. This patch puts openais in the same spot in the sequence as the original heartbeat init script. diff -r e65f52176ba6 -r a60db27fe11b init/generic --- a/init/genericThu Feb 12 11:29:20 2009 +0100 +++ b/init/genericWed Jun 24 09:04:07 2009 +0200 @@ -5,7 +5,7 @@ # Author: Andrew Beekhof abeek...@suse.de # License: Revised BSD # -# chkconfig: - 20 20 +# chkconfig: - 75 05 # processname: aisexec # description: OpenAIS daemon # diff -r e65f52176ba6 -r a60db27fe11b init/redhat --- a/init/redhat Thu Feb 12 11:29:20 2009 +0100 +++ b/init/redhat Wed Jun 24 09:04:07 2009 +0200 @@ -2,7 +2,7 @@ # # OpenAIS daemon init script for Red Hat Linux and compatibles. # -# chkconfig: - 20 20 +# chkconfig: - 75 05 # processname: aisexec # pidfile: /var/run/aisexec.pid # description: OpenAIS daemon ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] About a combination with OpenAIS.
On Thu, 2009-06-11 at 12:30 +0200, Dejan Muhamedagic wrote: Hi Hideo-san, On Thu, Jun 11, 2009 at 03:17:08PM +0900, renayama19661...@ybb.ne.jp wrote: Hi, I understood the cause of the problem. An init script in WhiteTank was a problem. I work definitely when I use an init script for Pacemaker which Mr. Andrew made. I hope a right init script to be included. Perhaps you would be better off with the versions released by Andrew (from the OBS). I'm not sure myself, it's just that openais API was moving until recently, so it may be a problem to match releases. Of course, if this combination works for you then it's fine. Thanks, Dejan The openais whitetank abi hasn't changed much for years. The init script could be a problem however. Is there a special version required by Suse linux? I thought we had all the proper init scripts upstream to work with Pacemaker and SUSE. Thanks -steve ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] [Openais] Pacemaker on OpenAIS, RRP, and link failure
On Thu, 2009-06-04 at 17:54 +0200, Lars Marowsky-Bree wrote: On 2009-05-26T12:50:34, Andrew Beekhof and...@beekhof.net wrote: try all the time also after failure like was done before failure. Complete Totem amateur behind the keyboard, but I'd second that. Since you're constantly checking the link status while it's up, why not keep doing so after it's gone down, to see if it has recovered? Perhaps even at a decreased (user configurable) interval/rate. I think that was actually discussed on the openais list and on IRC in the past and never completely explained why it wouldn't work ;-) The problem with checking the link status with the current code is that the protocol blocks I/O waiting for a response from the failed ring. This could of course be modified to behave differently. So the act of failing a link is expensive and we dont want to retest that it is valid very often. The obvious solution to this is to redesign the protocol to not have this constraint. No patch has been written and I don't have time to do such work at the present time. Regards -steve Regards, Lars ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] [Openais] Pacemaker on OpenAIS, RRP, and link failure
On Thu, 2009-06-04 at 18:30 +0200, Lars Marowsky-Bree wrote: On 2009-06-04T09:23:04, Steven Dake sd...@redhat.com wrote: The problem with checking the link status with the current code is that the protocol blocks I/O waiting for a response from the failed ring. This could of course be modified to behave differently. Right, so the rechecking could possibly be a separate thread, sending an occasional liveness packet on the failed ring and trigger the RRP recovery after it has heard from other nodes on it? Well I prefer totem to remain nonthreaded except for encrypted xmit operations, but in general, that is the basic idea. Some smarts would be needed of course to not constantly retrigger partially active rings (which would fail again immediately). So the act of failing a link is expensive and we dont want to retest that it is valid very often. Does expensive mean that it'll actually slow down the healthy ring(s)? At the moment it blocks until the problem counter reaches the threshold at which point the ring is declared failed and normal communication continues. Regards, Lars ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure
On Mon, 2009-05-25 at 18:32 +0300, Juha Heinanen wrote: Florian Haas writes: Agree that they're hacks, but disagree with your alternative. Why should Pacemaker be concerned with low-level OpenAIS recovery procedures? then have the variable in OpenAIS configuration. Self-healing is not as obvious or easy as it sounds. Totem (the protocol) has no way to determine when the admin has replaced the faulty switch in the network. The only options I see is to periodically try the failed ring for liveness. The problem with this approach is it is hard to implement. Another option is to reenable the ring after some period of time internally and hope for the best. The problem is with this approach that is causes performance degredation every time the failed ring is reenabled and restarted. I think the first option is the best, but atm there isn't anyone that has written patches and most people are focused on the 1.0 release... Regards -steve -- juha ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] recent stabilization changes to corosync and openais API/ABIs
Hello, Some people on these lists may be interested to know what is happening with Corosync and OpenAIS ABIs as well as our schedules for 1.0. We are currently planning the following dates for our releases: Corosync 1.0 - May 15, 2009 OpenAIS 1.0 - June 1, 2009 We have made great progress and the code is seeing good stabilization. Over the past few weeks, our community team has sanitized the corosync ABI and APIs. Specifically the following types of changes were made to corosync: * A const qualifier was added to any parameter that was a constant buffer or struct * Any size parameter was changed from int/unsigned int, to size_t * Any buffer that was copied into in a calling function added a length parameter which has a size_t type. * All of the external headers had these changes applied as well as coroapi.h. No significant changes were made to the openais ABI, however, because of changes to internal apis, we are bumping the so major version requiring a recompile if you use the SA Forum APIs as dependencies or any corosync APIs. We are targeting a new release with these changes to be introduced into distributions that ship this software in early May. Regards -steve ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker