[Pacemaker] Suggestions for managing HA of containers from within a Pacemaker container?

2015-02-07 Thread Steven Dake (stdake)
Hi,

I am working on Containerizing OpenStack in the Kolla project 
(http://launchpad.net/kolla).  One of the key things we want to do over the 
next few months is add H/A support to our container tech.  David Vossel had 
suggested using systemctl to monitor the containers themselves by running 
healthchecking scripts within the containers.  That idea is sound.

There is another technology called “super-privileged containers”.  Essentially 
it allows more host access for the container, allowing the treatment of 
Pacemaker as a container rather than a RPM or DEB file.  I’d like corosync to 
run in a separate container.  These containers will communicate using their 
normal mechanisms in a super-privileged mode.  We will implement this in Kolla.

Where I am stuck is how does Pacemaker within a container control other 
containers  in the host os.  One way I have considered is using the docker 
—pid=host flag, allowing pacemaker to communicate directly with the host 
systemctl process.  Where I am stuck is our containers don’t run via systemctl, 
but instead via shell scripts that are executed by third party deployment 
software.

An example:
Lets say a rabbitmq container wants to run:

The user would run
kolla-mgr deploy messaging

This would run a small bit of code to launch the docker container set for 
messaging.

Could pacemaker run something like

Kolla-mgr status messaging

To control the lifecycle of the processes?

Or would we be better off with some systemd integration with kolla-mgr?

Thoughts welcome

Regards,
-steve
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Need to relax corosync due to backup of VM through snapshot

2013-11-24 Thread Steven Dake


On 11/21/2013 06:26 AM, Gianluca Cecchi wrote:

On Thu, Nov 21, 2013 at 9:09 AM, Lars Marowsky-Bree wrote:

On 2013-11-20T16:58:01, Gianluca Cecchi gianluca.cec...@gmail.com wrote:


Based on docs  I thought that the timeout should be

token x token_retransmits_before_loss_const

No, the comments in the corosync.conf.example and man corosync.conf
should be pretty clear, I hope. Can you recommend which phrasing we
should improve?

I have not understood exact relationship between token and
token_retransmits_before_loss_const.
When one comes into play and when the other one...
So perhaps the second one could be given more details.
Or some web links


The token retransmit is a timer that is started each time a token is 
transmitted.  This is the maximum timer that exists - it is not token * 
retransmits_before_loss_const.


The retrans_before_loss_const says please transmit a replacement token 
x many times in the token period.  Since the token is UDP, it could be 
lost in network overflow situations or other scenarios.


Using a real-world example
token: 1
retrans_before_loss_const: 10

token will be retransmitted roughly every 1000 msec and the token will 
be determined lost after 1msec.


Regards
-steve


SO my current test config is:
   # diff corosync.conf corosync.conf.pre181113
24,25c24
 #token: 5000
 token: 12

A 120s node timeout? That is really, really long. Why is the backup tool
interfering with the scheduling of high priority processes so much? That
sounds like the real bug.

In fact I inherited analysis for a previous production cluster and I'm
setting up a test environment to demonstrate that one of the realistic
outputs could well be that a cluster is not the right solution
implemented because the underlying infra is not stable enough.
I'm not given a great visibility for what is VMware and SAN details,
but I'm stressing to get them.
I saw sometimes disk latencies going at 8000milliseceonds ;-(
SO another possible output could be to make a more reliable infra
before going with cluster.
I'm putting deliberately high values to see what happens and lower
them step by step
BTW: I remember in the past some thread with other having problems
with Netbackup (or similar backup software ) using snapshot and that
putting higher values solved the sporadic problems (possibly 2 for
token and 10 for retransmit but I couldn't find them ...)



Any comment?
Any different strategies successfully used in similar environments
where high latencies get in place at snapshot deletion when
consolidate phase of disks is executed?

A setup where a VM apparently can freeze for almost 120s is not suitable
for HA.


I see from previous logs that sometimes drbd disconnect and reconnect
only after 30-40 seconds with default timeouts...

Thanks for your inputs.

Gianluca

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Need HA for OpenStack instances? Check out heat V5!

2012-08-01 Thread Steven Dake
Hi folks,

A few developers from HA community have been hard at work on a project
called heat which provides native HA for OpenStack virtual machines.
Heat provides a template based system with API matching AWS
CloudFormation semantics specifically for OpenStack.

In v5, instance heatlhchecking has been added.  To get started on Fedora
16+ check out the getting started guide:

https://github.com/heat-api/heat/blob/master/docs/GettingStarted.rst#readme

or on Ubuntu Precise check out the devstack guide:
https://github.com/heat-api/heat/wiki/Getting-Started-with-Heat-using-Master-on-Ubuntu

An example template with instance HA features is here:

https://github.com/heat-api/heat/blob/master/templates/WordPress_Single_Instance_With_IHA.template

An example template with applicatoin HA features that includes
escalation is here:

https://github.com/heat-api/heat/blob/master/templates/WordPress_Single_Instance_With_HA.template

Our website is here:

http://www.heat-api.org

The software can be downloaded from:
https://github.com/heat-api/heat/downloads

Enjoy
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [corosync] Different Corosync Rings for Different Nodes in Same Cluster?

2012-07-08 Thread Steven Dake
On 07/02/2012 08:19 AM, Andrew Martin wrote:
 Hi Steve,
 
 Thanks for the clarification. Am I correct in understanding that in a
 complete network, corosync will automatically re-add nodes that drop out
 and reappear for any reason (e.g. maintenance, network connectivity
 loss, STONITH, etc)?
 

Apologies for delay - was on PTO.

That is correct.

Regards
-steve

 Thanks,
 
 Andrew
 
 
 *From: *Steven Dake sd...@redhat.com
 *To: *The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 *Cc: *disc...@corosync.org
 *Sent: *Friday, June 29, 2012 9:40:43 AM
 *Subject: *Re: [Pacemaker] Different Corosync Rings for Different Nodes
 in Same Cluster?
 
 On 06/29/2012 01:42 AM, Dan Frincu wrote:
 Hi,

 On Thu, Jun 28, 2012 at 6:13 PM, Andrew Martin amar...@xes-inc.com
 wrote:
 Hi Dan,

 Thanks for the help. If I configure the network as I described - ring
 0 as
 the network all 3 nodes are on, ring 1 as the network only 2 of the nodes
 are on, and using passive - and the ring 0 network goes down, corosync
 will start using ring 1. Does this mean that the quorum node will
 appear to
 be offline to the cluster? Will the cluster attempt to STONITH it?
 Once the
 ring 0 network is available again, will corosync transition back to
 using it
 as the communication ring, or will it continue to use ring 1 until it
 fails?

 The ideal behavior would be when ring 0 fails it then communicates
 over ring
 1, but keeps periodically checking to see if ring 0 is working again.
 Once
 it is, it returns to using ring 0. Is this possible?

 Added corosync ML in CC as I think this is better asked here as well.

 Regards,
 Dan


 Thanks,

 Andrew

 
 From: Dan Frincu df.clus...@gmail.com
 To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Sent: Wednesday, June 27, 2012 3:42:42 AM
 Subject: Re: [Pacemaker] Different Corosync Rings for Different Nodes
 inSame Cluster?


 Hi,

 On Tue, Jun 26, 2012 at 9:53 PM, Andrew Martin amar...@xes-inc.com
 wrote:
 Hello,

 I am setting up a 3 node cluster with Corosync + Pacemaker on Ubuntu
 12.04
 server. Two of the nodes are real nodes, while the 3rd is in standby
 mode
 as a quorum node. The two real nodes each have two NICs, one that is
 connected to a shared LAN and the other that is directly connected
 between
 the two nodes (for DRBD replication). The quorum node is only
 connected to
 the shared LAN. I would like to have multiple Corosync rings for
 redundancy,
 however I do not know if this would cause problems for the quorum
 node. Is
 it possible for me to configure the shared LAN as ring 0 (which all 3
 nodes
 are connected to) and set the rrp_mode to passive so that it will
 use ring
 0
 unless there is a failure, but to also configure the direct link between
 the
 two real nodes as ring 1?

 
 In general I think you cannot do what you describe.  Let me repeat it so
 its clear:
 
 A B C - NET #1
 A B   - Net #2
 
 Where A, B are your cluster nodes, and C is your quorum node.
 
 You want Net #1 and Net #2 to serve as redundant rings.  Since C is
 missing, Net #2 will automatically be detected as faulty.
 
 The part about corosync automatically repairing nodes is correct, that
 would work (If you had a complete network).
 
 Regards
 -steve
 
 Short answer, yes.

 Longer answer. I have a setup with two nodes with two interfaces, one
 is connected via a switch to the other node and one is a back-to-back
 link for DRBD replication. In Corosync I have two rings, one that goes
 via the switch and one via the back-to-back link (rrp_mode: active).
 With rrp_mode: passive it should work the way you mentioned.

 HTH,
 Dan


 Thanks,

 Andrew

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




 --
 Dan Frincu
 CCNA, RHCE

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc

Re: [Pacemaker] Different Corosync Rings for Different Nodes in Same Cluster?

2012-06-29 Thread Steven Dake
On 06/29/2012 01:42 AM, Dan Frincu wrote:
 Hi,
 
 On Thu, Jun 28, 2012 at 6:13 PM, Andrew Martin amar...@xes-inc.com wrote:
 Hi Dan,

 Thanks for the help. If I configure the network as I described - ring 0 as
 the network all 3 nodes are on, ring 1 as the network only 2 of the nodes
 are on, and using passive - and the ring 0 network goes down, corosync
 will start using ring 1. Does this mean that the quorum node will appear to
 be offline to the cluster? Will the cluster attempt to STONITH it? Once the
 ring 0 network is available again, will corosync transition back to using it
 as the communication ring, or will it continue to use ring 1 until it fails?

 The ideal behavior would be when ring 0 fails it then communicates over ring
 1, but keeps periodically checking to see if ring 0 is working again. Once
 it is, it returns to using ring 0. Is this possible?
 
 Added corosync ML in CC as I think this is better asked here as well.
 
 Regards,
 Dan
 

 Thanks,

 Andrew

 
 From: Dan Frincu df.clus...@gmail.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, June 27, 2012 3:42:42 AM
 Subject: Re: [Pacemaker] Different Corosync Rings for Different Nodes
 inSame Cluster?


 Hi,

 On Tue, Jun 26, 2012 at 9:53 PM, Andrew Martin amar...@xes-inc.com wrote:
 Hello,

 I am setting up a 3 node cluster with Corosync + Pacemaker on Ubuntu 12.04
 server. Two of the nodes are real nodes, while the 3rd is in standby
 mode
 as a quorum node. The two real nodes each have two NICs, one that is
 connected to a shared LAN and the other that is directly connected between
 the two nodes (for DRBD replication). The quorum node is only connected to
 the shared LAN. I would like to have multiple Corosync rings for
 redundancy,
 however I do not know if this would cause problems for the quorum node. Is
 it possible for me to configure the shared LAN as ring 0 (which all 3
 nodes
 are connected to) and set the rrp_mode to passive so that it will use ring
 0
 unless there is a failure, but to also configure the direct link between
 the
 two real nodes as ring 1?


In general I think you cannot do what you describe.  Let me repeat it so
its clear:

A B C - NET #1
A B   - Net #2

Where A, B are your cluster nodes, and C is your quorum node.

You want Net #1 and Net #2 to serve as redundant rings.  Since C is
missing, Net #2 will automatically be detected as faulty.

The part about corosync automatically repairing nodes is correct, that
would work (If you had a complete network).

Regards
-steve

 Short answer, yes.

 Longer answer. I have a setup with two nodes with two interfaces, one
 is connected via a switch to the other node and one is a back-to-back
 link for DRBD replication. In Corosync I have two rings, one that goes
 via the switch and one via the back-to-back link (rrp_mode: active).
 With rrp_mode: passive it should work the way you mentioned.

 HTH,
 Dan


 Thanks,

 Andrew

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




 --
 Dan Frincu
 CCNA, RHCE

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

 
 
 



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [corosync] Unable to join cluster from a newly-installed centos 6.2 node

2012-03-02 Thread Steven Dake
On 03/02/2012 05:29 PM, Diego Lima wrote:
 Hello,
 
 I've recently installed Corosync on two CentOS 6.2 machines. One is
 working fine but on the other machine I've been unable to connect to
 the cluster. On the logs I can see this whenever I start
 corosync+pacemaker:
 
 Mar  2 21:33:16 no2 corosync[15924]:   [MAIN  ] Corosync Cluster
 Engine ('1.4.1'): started and ready to provide service.
 Mar  2 21:33:16 no2 corosync[15924]:   [MAIN  ] Corosync built-in
 features: nss dbus rdma snmp
 Mar  2 21:33:16 no2 corosync[15924]:   [MAIN  ] Successfully read main
 configuration file '/etc/corosync/corosync.conf'.
 Mar  2 21:33:16 no2 corosync[15924]:   [TOTEM ] Initializing transport
 (UDP/IP Multicast).
 Mar  2 21:33:16 no2 corosync[15924]:   [TOTEM ] Initializing
 transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
 Mar  2 21:33:16 no2 corosync[15924]:   [TOTEM ] The network interface
 [172.16.100.2] is now up.
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
 process_ais_conf: Reading configure
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
 config_find_init: Local handle: 4730966301143465987 for logging
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
 config_find_next: Processing additional logging options...
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
 Found 'off' for option: debug
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
 Found 'no' for option: to_logfile
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
 Found 'yes' for option: to_syslog
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
 Found 'daemon' for option: syslog_facility
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
 config_find_init: Local handle: 7739444317642555396 for quorum
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
 config_find_next: No additional configuration supplied for: quorum
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
 No default for option: provider
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
 config_find_init: Local handle: 5650605097994944517 for service
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
 config_find_next: Processing additional service options...
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
 Found '0' for option: ver
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
 Defaulting to 'pcmk' for option: clustername
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
 Defaulting to 'no' for option: use_logd
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
 Defaulting to 'no' for option: use_mgmtd
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: pcmk_startup:
 CRM: Initialized
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] Logging: Initialized
 pcmk_startup
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: pcmk_startup:
 Maximum core file size is: 18446744073709551615
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: pcmk_startup: Service: 
 10
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: pcmk_startup:
 Local hostname: no2.informidia.int
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
 pcmk_update_nodeid: Local node id: 40112300
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: update_member:
 Creating entry for node 40112300 born on 0
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: update_member:
 0x766520 Node 40112300 now known as no2.informidia.int (was: (null))
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: update_member:
 Node no2.informidia.int now has 1 quorum votes (was 0)
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: update_member:
 Node 40112300/no2.informidia.int is now: member
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: spawn_child:
 Forked child 15930 for process stonith-ng
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: spawn_child:
 Forked child 15931 for process cib
 Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: spawn_child:
 Forked child 15932 for process lrmd
 Mar  2 21:33:16 no2 lrmd: [15932]: info: G_main_add_SignalHandler:
 Added signal handler for signal 15
 Mar  2 21:33:16 no2 stonith-ng: [15930]: info: Invoked:
 /usr/lib64/heartbeat/stonithd
 Mar  2 21:33:16 no2 stonith-ng: [15930]: info: crm_log_init_worker:
 Changed active directory to /var/lib/heartbeat/cores/root
 Mar  2 21:33:16 no2 stonith-ng: [15930]: info:
 G_main_add_SignalHandler: Added signal handler for signal 17
 Mar  2 21:33:16 no2 stonith-ng: [15930]: info: get_cluster_type:
 Cluster type is: 'openais'
 Mar  2 21:33:16 no2 stonith-ng: [15930]: notice: crm_cluster_connect:
 Connecting to cluster infrastructure: classic openais (with plugin)
 Mar  2 21:33:16 no2 stonith-ng: [15930]: info:
 init_ais_connection_classic: Creating connection to our Corosync
 plugin
 Mar  2 21:33:16 no2 cib: [15931]: info: crm_log_init_worker: Changed
 active directory to 

Re: [Pacemaker] need cluster-wide variables

2012-01-11 Thread Steven Dake
On 12/21/2011 12:01 AM, Nirmala S wrote:
 Hi,
 
  
 
 This is a followup on earlier thread
 (http://www.gossamer-threads.com/lists/linuxha/pacemaker/76705).
 
  
 
 My situation is somewhat similar. I need to a cluster which contains 3
 kinds of nodes – master, preferred slave, slave. Preferred slave is an
 entity that becomes the master in case of switchover/failover. Master is
 the master for pref_slave and pref_slave is master for other slaves. The
 master election is easy – it is done by crm, all I need to do is use
 crm_master.
 
 

RE subject, the cpg interface is perfect for maintaining replicated
state among your cluster nodes.  man cpg_overview.

Regards
-steve

 
 But for the preferred slave, there needs to an election amongst existing
 slaves. As of now I am using a variable in CIB with
 pref_slave|pref_slave_score|temp_score. If temp_score is 0, then the
 slave will update pref_slave and pref_slave_score and temp_score. If
 temp_score is non-zero, then the node compares its score with
 pref_slave_score and updates only if it is bigger.
 
  
 
 Now I have 2 problems
 
  1. Everytime I change the CIB(which I am doing in pre-promote), the
 event (pre-promote) is getting retriggered.
  2. The event(pre-promote) is sent in parallel to all the slaves. So
 each slave thinks temp_score is 0, and overwrites with its score. Is
 there any way to serialize this using some sort of lock ? Or is
 there a provision to store cluster-wide attributes apart from CIB ?
 
  
 
 Regards
 
 Nirmala
 
  
 
  
 
 This e-mail and attachments contain confidential information from
 HUAWEI, which is intended only for the person or entity whose address is
 listed above. Any use of the information contained herein in any way
 (including, but not limited to, total or partial disclosure,
 reproduction, or
 
 dissemination) by persons other than the intended recipient's) is
 prohibited. If you receive this e-mail in error, please notify the
 sender by phone or email immediately and delete it!
 
  
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Questions about reasonable cluster size...

2011-10-20 Thread Steven Dake
On 10/20/2011 07:42 AM, Alan Robertson wrote:
 On 10/20/2011 03:11 AM, Proskurin Kirill wrote:
 On 10/20/2011 03:15 AM, Steven Dake wrote:
 On 10/19/2011 01:50 PM, Alan Robertson wrote:
 Hi,

 I have an application where having a 12-node cluster with about 250
 resources would be desirable.

 Is this reasonable?  Can Pacemaker+Corosync be expected to reliably
 handle a cluster of this size?

 If not, what is the current recommendation for maximum number of nodes
 and resources?
 Steven Dake wrote:
 
 We regularly test 16 nodes.  As far as resources go, Andrew could answer
 that.
 

 I start to have problems with 10+ nodes. It`s heavly depended on
 corosync configuration afaik. You should test it.
 This is somewhat different from Steven's comment.  Exactly what things
 did you have in mind for the corosync configuration that could either
 help or hurt with larger clusters?
 
 Steven:  Proskurin seems to think that there are some particular things
 to watch out for in the Corosync configuration for larger clusters. 
 Does anything come to mind for you about this?
 
 

We do 16 node testing with token=1 (10 seconds).  The rest of the
parameters autoconfigure.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] corosync mailing list address change

2011-10-20 Thread Steven Dake
Sending one last reminder that the Corosync mailing list has changed
homes from the Linux Foundation's servers.  I have been unable to obtain
the previous subscriber list, so please resubscribe.

http://lists.corosync.org/mailman/listinfo

The list is called discuss.

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Questions about reasonable cluster size...

2011-10-19 Thread Steven Dake
On 10/19/2011 01:50 PM, Alan Robertson wrote:
 Hi,
 
 I have an application where having a 12-node cluster with about 250
 resources would be desirable.
 
 Is this reasonable?  Can Pacemaker+Corosync be expected to reliably
 handle a cluster of this size?
 
 If not, what is the current recommendation for maximum number of nodes
 and resources?
 
 Many thanks!
 

We regularly test 16 nodes.  As far as resources go, Andrew could answer
that.

Regards
-steve



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Building a Corosync 1.4.1 RPM package for SLES11 SP1

2011-09-01 Thread Steven Dake
On 08/31/2011 11:39 PM, Sebastian Kaps wrote:
 Hi,
 
 I'm trying to compile Corosync v1.4.1 from source[1] and create an RPM
 x86_64 package for SLES11 SP1.
 When running make rpm the build process complains about a broken
 dependency for the nss-devel package.
 The package is not installed on the system - mozilla-nss (non-devel),
 however, is.
 
 I'd be fine if I could just build the package without using the nss libs.
 I have no problem compiling Corosync using ./configure --disable-nss 
 make, but I see no way for
 doing that with the make rpm command.
 
 Alternatively I'd compile everything --with-nss, but I can't install the
 mozilla-nss-devel package,
 because the version on the SLE11-SP1-SDK DVD is older than the installed
 mozilla-nss package (3.12.6-3.1.1
 vs. 3.12.8-1.2.1) and creates a conflict when I try to install it.
 
 [1] ftp://corosync.org/downloads/corosync-1.4.1/corosync-1.4.1.tar.gz

Thanks for pointing out this problem with the build tools for corosync.
 nss should be conditionalized.  This would allow rpmbuild --with-nss or
rpmbuild --without-nss from the default rpm builds.  I would send a
patch to the openais ml to resolve this problem but it is not operating
at the moment, so I'll send one here for you to give a spin.

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-07 Thread Steven Dake
On 08/04/2011 02:04 PM, Sebastian Kaps wrote:
 Hi Steven,
 
 On 04.08.2011, at 20:59, Steven Dake wrote:
 
 meaning the corosync community doesn't investigate redundant ring issues
 prior to corosync versions 1.4.1.
 
 Sadly, we need to use the SLES version for support reasons.
 I'll try to convince them to supply us with a fix for this problem.
 
 In the mean time: would it be safe to leave the backup ring marked faulty 
 the next this happens? Would this result in a state that is effectively 
 like having no second ring or is there a chance that this might still 
 affect the cluster's stability? 

If a ring is marked faulty, it is no longer operational and there is no
longer a redundant network.

 To my knowledge, changing the ring configuration requires a complete 
 restart of the cluster framework on all nodes, right?
 

yes although fixing the retransmit list problem will not require a restart

Regards
-steve

 I expect the root of ypur problem is already fixed (the retransmit list
 problem) however in the repos and latest released versions.
 
 
 I'll try to get an update as soon as possible. Thanks a lot!
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Steven Dake
On 08/03/2011 11:31 PM, Tegtmeier.Martin wrote:
 Hello again,
 
 in my case it is always the slower ring that fails (the 100MB network). Does 
 rrp_mode passive expect both rings to have the same speed?
 
 Sebastian, can you confirm that in your environment also the slower ring 
 fails?
 
 Thanks,
   -Martin
 
 

Martin,

I have never tested faster+slower networks in redundant ring configs.
We just recently added support for this feature in the corosync project
meaning we can start to tackle some of these issues going forward.

The protocol is designed to limit to the speed of the slowest ring -
perhaps this is not working as intended.

Regards
-steve

 -Original Message-
 From: Tegtmeier.Martin [mailto:martin.tegtme...@realtech.com] 
 Sent: Mittwoch, 3. August 2011 11:03
 To: The Pacemaker cluster resource manager
 Subject: AW: [Pacemaker] Backup ring is marked faulty
 
 Hello,
 
 we have exactly the same issue! Same version of corosync (1.3.1), also 
 running on SuSE Linux Enterprise Server 11 SP1 with HAE.
 
 Aug 01 15:45:18 corosync [TOTEM ] Received ringid(172.20.16.2:308) seq 6a
 
 Aug 01 15:45:18 corosync [TOTEM ] Received ringid(172.20.16.2:308) seq 63
 
 Aug 01 15:45:18 corosync [TOTEM ] releasing messages up to and including 60
 
 Aug 01 15:45:18 corosync [TOTEM ] releasing messages up to and including 6d
 
 Aug 01 15:45:18 corosync [TOTEM ] Marking seqid 162 ringid 1 interface 
 10.2.2.6 FAULTY - administrative intervention required.
 
 rksaph06:/var/log/cluster # corosync-cfgtool -s
 
 Printing ring status.
 
 Local node ID 101717164
 
 RING ID 0
 
 id  = 172.20.16.6
 
 status  = ring 0 active with no faults
 
 RING ID 1
 
 id  = 10.2.2.6
 
 status  = Marking seqid 162 ringid 1 interface 10.2.2.6 FAULTY - 
 administrative intervention required.
 
 
 
 rrp_mode is set to passive
 Ring 0 (172.20.16.0) supports 1GB and ring 1 (10.2.2.0) supports 100 MBit. 
 There was no other network traffic on ring 1 - only corosync (!)
 
 After re-activating both rings with corosync-cfgtool -r the problem is 
 reproducable by simply connecting a crm_gui and hitting refresh inside the 
 GUI 3-5 times. After that ring 1 (10.2.2.0) will be marked as faulty again.
 
 Thanks and best regards,
   -Martin Tegtmeier
 
 
 
 
 -Ursprüngliche Nachricht-
 Von: Sebastian Kaps [mailto:sebastian.k...@imail.de]
 Gesendet: Mi 03.08.2011 08:53
 An: The Pacemaker cluster resource manager
 Betreff: Re: [Pacemaker] Backup ring is marked faulty
  
  Hi Steven!
 
  On Tue, 02 Aug 2011 17:45:46 -0700, Steven Dake wrote:
 Which version of corosync?
 
  # corosync -v
  Corosync Cluster Engine, version '1.3.1'
  Copyright (c) 2006-2009 Red Hat, Inc.
 
  It's the version that comes with SLES11-SP1-HA.
 
 --
  Sebastian
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

2011-08-04 Thread Steven Dake
On 08/04/2011 05:46 AM, Sebastian Kaps wrote:
 Hello,
 
 here's another problem we're having:
 
 Jul 31 03:51:02 node01 corosync[5870]:  [TOTEM ] Process pause detected
 for 11149 ms, flushing membership messages.

This process pause message indicates the scheduler doesn't schedule
corosync for 11 seconds which is greater then the failure detection
timeouts.  What does your config file look like?  What load are you running?

Regards
-steve

 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] CLM CONFIGURATION CHANGE
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] New Configuration:
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.1)
 r(1) ip(x.y.z.3)
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Left:
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.2)
 r(1) ip(x.y.z.1)
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Joined:
 Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] notice:
 pcmk_peer_update: Transitional membership event on ring 9708: memb=1,
 new=0, lost=1
 Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] info: pcmk_peer_update:
 memb: node01 16885952
 Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] info: pcmk_peer_update:
 lost: node02 33663168
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] CLM CONFIGURATION CHANGE
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] New Configuration:
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.1)
 r(1) ip(x.y.z.3)
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Left:
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Joined:
 Jul 31 03:51:11 node01 crmd: [5912]: notice: ais_dispatch_message:
 Membership 9708: quorum lost
 
 Node01 gets Stonith'd shortly after that. There is no indication
 whatsoever that this would happen in the logs.
 For at least half an hour before that there's only the normal
 status-message noise from monitor ops etc.
 
 Jul 31 03:51:01 node02 corosync[5810]:  [TOTEM ] A processor failed,
 forming new configuration.
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] CLM CONFIGURATION CHANGE
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] New Configuration:
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.2)
 r(1) ip(x.y.z.1)
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Left:
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.1)
 r(1) ip(x.y.z.3)
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Joined:
 Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] notice:
 pcmk_peer_update: Transitional membership event on ring 9708: memb=1,
 new=0, lost=1
 Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] info: pcmk_peer_update:
 memb: node02 33663168
 Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] info: pcmk_peer_update:
 lost: node01 16885952
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] CLM CONFIGURATION CHANGE
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] New Configuration:
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.2)
 r(1) ip(x.y.z.1)
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Left:
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Joined:
 
 What does Process pause detected mean?
 
 Quoting from my other recent post regarding the backup ring being marked
 faulty sporadically:
 
 |We're running a two-node cluster with redundant rings.
 |Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB
 interfaces that are bonded in
 |active-backup mode and routed through two independent switches for each
 node. The ring 1 network
 |is our normal 1G LAN and should only be used in case the direct 10G
 connection should fail.
 |
 |Corosync Cluster Engine, version '1.3.1'
 |Copyright (c) 2006-2009 Red Hat, Inc.
 |
 |It's the version that comes with SLES11-SP1-HA.
 
 Thanks in advance!
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Live demo of Pacemaker Cloud on Fedora: Friday August 5th at 8am PST

2011-08-04 Thread Steven Dake
On 08/03/2011 06:39 PM, Bob Schatz wrote:
 Steven,
 
 Are you planning on recording/taping it if I want to watch it later?
 
 Thanks,
 
 Bob

Bob,

Yes I will record if I can beat elluminate into submission.

Regards
-steve


 
 
 *From:* Steven Dake sd...@redhat.com
 *To:* pcmk-cl...@oss.clusterlabs.org
 *Cc:* aeolus-de...@lists.fedorahosted.org; Fedora Cloud SIG
 cl...@lists.fedoraproject.org; open...@lists.linux-foundation.org
 open...@lists.linux-foundation.org; The Pacemaker cluster resource
 manager pacemaker@oss.clusterlabs.org
 *Sent:* Wednesday, August 3, 2011 9:42 AM
 *Subject:* [Pacemaker] Live demo of Pacemaker Cloud on Fedora: Friday
 August 5th at 8am PST
 
 Extending a general invitation to the high availability communities and
 other cloud community contributors to participate in a live demo I am
 giving on Friday August 5th 8am PST (GMT-7).  Demo portion of session is
 15 minutes and will be provided first followed by more details of our
 approach to high availability.
 
 I will use elluminate to show the demo on my desktop machine.  To make
 elluminate work, you will need icedtea-web installed on your system
 which is not typically installed by default.
 
 You will also need a conference # and bridge code.  Please contact me
 offlist with your location and I'll provide you with a hopefully toll
 free conference # and bridge code.
 
 Elluminate link:
 https://sas.elluminate.com/m.jnlp?sid=819password=M.13AB020AEBE358D265FD925A07335F
 https://sas.elluminate.com/m.jnlp?sid=819password=M.13AB020AEBE358D265FD925A07335F
 
 Bridge Code:  Please contact me off list with your location and I'll
 respond back with dial-in information.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 mailto:Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Steven Dake
On 08/02/2011 11:53 PM, Sebastian Kaps wrote:
 Hi Steven!
 
 On Tue, 02 Aug 2011 17:45:46 -0700, Steven Dake wrote:
 Which version of corosync?
 
 # corosync -v
 Corosync Cluster Engine, version '1.3.1'
 Copyright (c) 2006-2009 Red Hat, Inc.
 
 It's the version that comes with SLES11-SP1-HA.
 

redundant ring is only supported upstream in corosync 1.4.1 or later.

The retransmit list message issues you are having is fixed in corosync
1.3.3. and later  This is what is triggering the redundant ring faulty
error.

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Steven Dake
On 08/04/2011 11:43 AM, Sebastian Kaps wrote:
 Hi Steven,
 
 On 04.08.2011, at 18:27, Steven Dake wrote:
 
 redundant ring is only supported upstream in corosync 1.4.1 or later.
 
 What does supported mean in this context, exactly? 
 

meaning the corosync community doesn't investigate redundant ring issues
prior to corosync versions 1.4.1.

I expect the root of ypur problem is already fixed (the retransmit list
problem) however in the repos and latest released versions.

Regards
-steve

 I'm asking, because we're having serious issues with these systems since 
 they went into production (the testing phase did not show any problems, 
 but we also couldn't use real workloads then).
 
 Since the cluster went productive, we're having issues with seemingly random 
 STONITH events that seem to be related to a high I/O load on a DRBD-mirrored
 OCFS2 volume - but I don't see any pattern yet. We've had these machines 
 running for nearly two weeks without major problems and suddenly they went 
 back to killing each other :-(
 
 The retransmit list message issues you are having is fixed in corosync
 1.3.3. and later  This is what is triggering the redundant ring faulty
 error.
 
 Could it also cause the instability problems we're seeing?
 Thanks again, for helping!

yes

 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Live demo of Pacemaker Cloud on Fedora: Friday August 5th at 8am PST

2011-08-03 Thread Steven Dake
Extending a general invitation to the high availability communities and
other cloud community contributors to participate in a live demo I am
giving on Friday August 5th 8am PST (GMT-7).  Demo portion of session is
15 minutes and will be provided first followed by more details of our
approach to high availability.

I will use elluminate to show the demo on my desktop machine.  To make
elluminate work, you will need icedtea-web installed on your system
which is not typically installed by default.

You will also need a conference # and bridge code.  Please contact me
offlist with your location and I'll provide you with a hopefully toll
free conference # and bridge code.

Elluminate link:
https://sas.elluminate.com/m.jnlp?sid=819password=M.13AB020AEBE358D265FD925A07335F

Bridge Code:  Please contact me off list with your location and I'll
respond back with dial-in information.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-02 Thread Steven Dake
Which version of corosync?

On 08/02/2011 07:35 AM, Sebastian Kaps wrote:
 Hi,
 
 we're running a two-node cluster with redundant rings.
 Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB
 interfaces that are bonded in
 active-backup mode and routed through two independent switches for each
 node. The ring 1 network
 is our normal 1G LAN and should only be used in case the direct 10G
 connection should fail.
 I often (once a day on average, I'd guess) see that ring 1 (an only that
 one) is marked as
 FAULTY without any obvious reasons.
 
 Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Retransmit List: c76
 c7a c7c c7e c80 c82 c84
 Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Retransmit List: c82
 Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Marking seqid 568416
 ringid 1 interface x.y.z.1 FAULTY - administrative intervention required.
 
 Whenever I see this, I check if the other node's address can be pinged
 (I never saw any
 connectivity problems there), then reenable the ring with
 corosync-cfgtool -r and
 everything looks ok for a while (i.e. hours or days).
 
 How could I find out why this happens?
 What do these Retransmit List or seqid (sequence id, I assume?) values
 tell me?
 Is it safe to reenable the second ring when the partner node can be
 pinged successfully?
 
 The totem section on our config looks like this:
 
 totem {
rrp_mode:   passive
join:   60
max_messages:   20
vsftype:none
consensus:  1
secauth:on
token_retransmits_before_loss_const:10
threads:16
token:  1
version:2
interface {
bindnetaddr:192.168.1.0
mcastaddr:  239.250.1.1
mcastport:  5405
ringnumber: 0
}
interface {
bindnetaddr:x.y.z.0
mcastaddr:  239.250.1.2
mcastport:  5415
ringnumber: 1
}
clear_node_high_bit:yes
 }
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Announcing Pacemaker Cloud 0.4.1 - Available now for download!

2011-07-27 Thread Steven Dake
Angus and I announced a project to apply high availability best known
practice to the field of cloud computing in late March 2011.  We reuse
the policy engine of Pacemaker.  Our first tarball is available today
containing a functional prototype demonstrating these best known practices.

Today the software supports a deployable/assembly model.  Assemblies
represent a virtual machine and deployables represent a collection of
virtual machines.  Resources within a virtual machine can be monitored
for failure and recovered.  Assemblies and deployables are also
monitored for failure and recovered.

Currently the significant limitation with the software is that it
operates single node.  As a result it is not suitable for deployment
today.  We plan to address this in the future by integrating with other
cloud infrastructure systems such as Aeolus (developer ml on CC list).

The software will be available in Fedora 16 for all to evaluate that run
Fedora.  Your feedback is greatly appreciated.  To provide feedback,
join the mailing list:

http://oss.clusterlabs.org/mailman/listinfo/pcmk-cloud/

If you have interest in developing for cloud environments around the
topic of high availability, please feel free to download our git repo
and submit patches.  We also are interested in user feedback!

To get the software, check out:

http://pacemaker-cloud.org/

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Sending message via cpg FAILED: (rc=12) Doesn't exist

2011-07-22 Thread Steven Dake
On 07/22/2011 01:15 AM, Proskurin Kirill wrote:
 Hello all.
 
 
 pacemaker-1.1.5
 corosync-1.4.0
 
 4 nodes in cluster. 3 online 1 not.
 In logs:
 
 Jul 22 11:50:23 my106.example.com crmd: [28030]: info:
 pcmk_quorum_notification: Membership 0: quorum retained (0)
 Jul 22 11:50:23 my106.example.com crmd: [28030]: info: do_started:
 Delaying start, no membership data (0010)
 Jul 22 11:50:23 my106.example.com crmd: [28030]: info:
 config_query_callback: Shutdown escalation occurs after: 120ms
 Jul 22 11:50:23 my106.example.com crmd: [28030]: info:
 config_query_callback: Checking for expired actions every 90ms
 Jul 22 11:50:23 my106.example.com crmd: [28030]: info: do_started:
 Delaying start, no membership data (0010)
 Jul 22 11:50:27 my106.example.com attrd: [28028]: info: cib_connect:
 Connected to the CIB after 1 signon attempts
 Jul 22 11:50:27 my106.example.com attrd: [28028]: info: cib_connect:
 Sending full refresh
 Jul 22 11:52:18 corosync [TOTEM ] A processor joined or left the
 membership and a new membership was formed.
 Jul 22 11:52:18 corosync [CPG   ] chosen downlist: sender r(0)
 ip(10.3.1.107) ; members(old:4 left:1)
 Jul 22 11:52:18 corosync [MAIN  ] Completed service synchronization,
 ready to provide service.
 Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR:
 send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist
 Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR:
 send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist
 Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR:
 send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist
 
 
 
 DC:
 
 Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee
 Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee
 Jul 22 11:50:07 my107.example.com pacemakerd: [22388]: info:
 update_node_processes: Node my106.example.com now has process list:
 0002 (was 00
 12)
 Jul 22 11:50:07 my107.example.com attrd: [22397]: info: crm_update_peer:
 Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0
 seen=0 proc=00
 02 (new)
 Jul 22 11:50:07 my107.example.com cib: [22395]: info: crm_update_peer:
 Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0
 seen=0 proc=0002
  (new)
 Jul 22 11:50:07 my107.example.com stonith-ng: [22394]: info:
 crm_update_peer: Node my106.example.com: id=0 state=unknown addr=(null)
 votes=0 born=0 seen=0 proc=0
 002 (new)
 Jul 22 11:50:07 my107.example.com crmd: [22399]: info: crm_update_peer:
 Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0
 seen=0 proc=000
 2 (new)
 Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee
 Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee
 
 
 There is a problem?
 

Does your retransmit list continually display e4 e5 etc for rest of
cluster lifetime, or is this short lived?



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-06-07 Thread Steven Dake
On 06/07/2011 04:44 AM, william felipe_welter wrote:
 More two questions.. The patch for mmap calls will be on the mainly
 development for all archs ?
 Any problems if i send this patch's for Debian project ?
 

These patches will go into the maintenance branches

You can send them to whoever you like ;)

Regards
-steve

 2011/6/3 Steven Dake sd...@redhat.com:
 On 06/02/2011 08:16 PM, william felipe_welter wrote:
 Well,

 Now with this patch, the pacemakerd process starts and up his other
 process ( crmd, lrmd, pengine) but after the process pacemakerd do
 a fork, the forked  process pacemakerd dies due to signal 10, Bus
 error.. And  on the log, the process of pacemark ( crmd, lrmd,
 pengine) cant connect to open ais plugin (possible because the
 death of the pacemakerd process).
 But this time when the forked pacemakerd dies, he generates a coredump.

 gdb  -c /usr/var/lib/heartbeat/cores/root/ pacemakerd 7986  -se
 /usr/sbin/pacemakerd :
 GNU gdb (GDB) 7.0.1-debian
 Copyright (C) 2009 Free Software Foundation, Inc.
 License GPLv3+: GNU GPL version 3 or later 
 http://gnu.org/licenses/gpl.html
 This is free software: you are free to change and redistribute it.
 There is NO WARRANTY, to the extent permitted by law.  Type show copying
 and show warranty for details.
 This GDB was configured as sparc-linux-gnu.
 For bug reporting instructions, please see:
 http://www.gnu.org/software/gdb/bugs/...
 Reading symbols from /usr/sbin/pacemakerd...done.
 Reading symbols from /usr/lib64/libuuid.so.1...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib64/libuuid.so.1
 Reading symbols from /usr/lib/libcoroipcc.so.4...done.
 Loaded symbols for /usr/lib/libcoroipcc.so.4
 Reading symbols from /usr/lib/libcpg.so.4...done.
 Loaded symbols for /usr/lib/libcpg.so.4
 Reading symbols from /usr/lib/libquorum.so.4...done.
 Loaded symbols for /usr/lib/libquorum.so.4
 Reading symbols from /usr/lib64/libcrmcommon.so.2...done.
 Loaded symbols for /usr/lib64/libcrmcommon.so.2
 Reading symbols from /usr/lib/libcfg.so.4...done.
 Loaded symbols for /usr/lib/libcfg.so.4
 Reading symbols from /usr/lib/libconfdb.so.4...done.
 Loaded symbols for /usr/lib/libconfdb.so.4
 Reading symbols from /usr/lib64/libplumb.so.2...done.
 Loaded symbols for /usr/lib64/libplumb.so.2
 Reading symbols from /usr/lib64/libpils.so.2...done.
 Loaded symbols for /usr/lib64/libpils.so.2
 Reading symbols from /lib/libbz2.so.1.0...(no debugging symbols 
 found)...done.
 Loaded symbols for /lib/libbz2.so.1.0
 Reading symbols from /usr/lib/libxslt.so.1...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libxslt.so.1
 Reading symbols from /usr/lib/libxml2.so.2...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libxml2.so.2
 Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
 Loaded symbols for /lib/libc.so.6
 Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
 Loaded symbols for /lib/librt.so.1
 Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done.
 Loaded symbols for /lib/libdl.so.2
 Reading symbols from /lib/libglib-2.0.so.0...(no debugging symbols
 found)...done.
 Loaded symbols for /lib/libglib-2.0.so.0
 Reading symbols from /usr/lib/libltdl.so.7...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libltdl.so.7
 Reading symbols from /lib/ld-linux.so.2...(no debugging symbols 
 found)...done.
 Loaded symbols for /lib/ld-linux.so.2
 Reading symbols from /lib/libpthread.so.0...(no debugging symbols 
 found)...done.
 Loaded symbols for /lib/libpthread.so.0
 Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
 Loaded symbols for /lib/libm.so.6
 Reading symbols from /usr/lib/libz.so.1...(no debugging symbols 
 found)...done.
 Loaded symbols for /usr/lib/libz.so.1
 Reading symbols from /lib/libpcre.so.3...(no debugging symbols 
 found)...done.
 Loaded symbols for /lib/libpcre.so.3
 Reading symbols from /lib/libnss_compat.so.2...(no debugging symbols
 found)...done.
 Loaded symbols for /lib/libnss_compat.so.2
 Reading symbols from /lib/libnsl.so.1...(no debugging symbols found)...done.
 Loaded symbols for /lib/libnsl.so.1
 Reading symbols from /lib/libnss_nis.so.2...(no debugging symbols 
 found)...done.
 Loaded symbols for /lib/libnss_nis.so.2
 Reading symbols from /lib/libnss_files.so.2...(no debugging symbols
 found)...done.
 Loaded symbols for /lib/libnss_files.so.2
 Core was generated by `pacemakerd'.
 Program terminated with signal 10, Bus error.
 #0  cpg_dispatch (handle=17861288972693536769, dispatch_types=7986) at 
 cpg.c:339
 339   switch (dispatch_data-id) {
 (gdb) bt
 #0  cpg_dispatch (handle=17861288972693536769, dispatch_types=7986) at 
 cpg.c:339
 #1  0xf6f100f0 in ?? ()
 #2  0xf6f100f4 in ?? ()
 Backtrace stopped: previous frame identical to this frame (corrupt stack?)



 I take a look at the cpg.c and see that the dispatch_data was aquired
 by coroipcc_dispatch_get

[Pacemaker] Updated pacemaker-cloud.org website

2011-06-06 Thread Steven Dake
Hi,

I want to spend a moment to tell you about our new website at
http://pacemaker-cloud.org.  This website will serve as our information
store and tarball repo location for the Pacemaker-Cloud project.  The
features page contains the feature set we plan to deliver.

Please have a look and forward any questions or comments to:

pcmk-cl...@oss.clusterlabs.org.

A big thanks to Adam Stokes who worked on the Matahari website design.
We used his design as our inspiration for most of our website.  Also a
thanks to Angus Salkeld for contributing to moving our hosting to github.

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-06-03 Thread Steven Dake
   ] exit_fn for conn=0x62500
 Jun 02 23:12:21 corosync [TOTEM ] mcasted message added to pending queue
 Jun 02 23:12:21 corosync [TOTEM ] Delivering 15 to 16
 Jun 02 23:12:21 corosync [TOTEM ] Delivering MCAST message with seq 16
 to pending delivery queue
 Jun 02 23:12:21 corosync [CPG   ] got procleave message from cluster
 node 1377289226
 Jun 02 23:12:21 corosync [TOTEM ] releasing messages up to and including 16
 Jun 02 23:12:21 xx attrd: [7992]: info: Invoked:
 /usr/lib64/heartbeat/attrd
 Jun 02 23:12:21 xx attrd: [7992]: info: crm_log_init_worker:
 Changed active directory to /usr/var/lib/heartbeat/cores/hacluster
 Jun 02 23:12:21 xx attrd: [7992]: info: main: Starting up
 Jun 02 23:12:21 xx attrd: [7992]: info: get_cluster_type:
 Cluster type is: 'openais'.
 Jun 02 23:12:21 xx attrd: [7992]: info: crm_cluster_connect:
 Connecting to cluster infrastructure: classic openais (with plugin)
 Jun 02 23:12:21 xx attrd: [7992]: info:
 init_ais_connection_classic: Creating connection to our Corosync
 plugin
 Jun 02 23:12:21 xx attrd: [7992]: info:
 init_ais_connection_classic: Connection to our AIS plugin (9) failed:
 Doesn't exist (12)
 Jun 02 23:12:21 xx attrd: [7992]: ERROR: main: HA Signon failed
 Jun 02 23:12:21 xx attrd: [7992]: info: main: Cluster connection 
 active
 Jun 02 23:12:21 xx attrd: [7992]: info: main: Accepting
 attribute updates
 Jun 02 23:12:21 xx attrd: [7992]: ERROR: main: Aborting startup
 Jun 02 23:12:21 xx crmd: [7994]: debug:
 init_client_ipc_comms_nodispatch: Attempting to talk on:
 /usr/var/run/crm/cib_rw
 Jun 02 23:12:21 xx crmd: [7994]: debug:
 init_client_ipc_comms_nodispatch: Could not init comms on:
 /usr/var/run/crm/cib_rw
 Jun 02 23:12:21 xx crmd: [7994]: debug: cib_native_signon_raw:
 Connection to command channel failed
 Jun 02 23:12:21 xx crmd: [7994]: debug:
 init_client_ipc_comms_nodispatch: Attempting to talk on:
 /usr/var/run/crm/cib_callback
 ...
 
 
 2011/6/2 Steven Dake sd...@redhat.com:
 On 06/01/2011 11:05 PM, william felipe_welter wrote:
 I recompile my kernel without hugetlb .. and the result are the same..

 My test program still resulting:
 PATH=/dev/shm/teste123XX
 page size=2
 fd=3
 ADDR_ORIG:0xe000a000  ADDR:0x
 Erro

 And Pacemaker still resulting because the mmap error:
 Could not initialize Cluster Configuration Database API instance error 2


 Give the patch I posted recently a spin - corosync WFM with this patch
 on sparc64 with hugetlb set.  Please report back results.

 Regards
 -steve

 For make sure that i have disable the hugetlb there is my /proc/meminfo:
 MemTotal:   33093488 kB
 MemFree:32855616 kB
 Buffers:5600 kB
 Cached:53480 kB
 SwapCached:0 kB
 Active:45768 kB
 Inactive:  28104 kB
 Active(anon):  18024 kB
 Inactive(anon): 1560 kB
 Active(file):  27744 kB
 Inactive(file):26544 kB
 Unevictable:   0 kB
 Mlocked:   0 kB
 SwapTotal:   6104680 kB
 SwapFree:6104680 kB
 Dirty: 0 kB
 Writeback: 0 kB
 AnonPages: 14936 kB
 Mapped: 7736 kB
 Shmem:  4624 kB
 Slab:  39184 kB
 SReclaimable:  10088 kB
 SUnreclaim:29096 kB
 KernelStack:7088 kB
 PageTables: 1160 kB
 Quicklists:17664 kB
 NFS_Unstable:  0 kB
 Bounce:0 kB
 WritebackTmp:  0 kB
 CommitLimit:22651424 kB
 Committed_AS: 519368 kB
 VmallocTotal:   1069547520 kB
 VmallocUsed:   11064 kB
 VmallocChunk:   1069529616 kB


 2011/6/1 Steven Dake sd...@redhat.com:
 On 06/01/2011 07:42 AM, william felipe_welter wrote:
 Steven,

 cat /proc/meminfo
 ...
 HugePages_Total:   0
 HugePages_Free:0
 HugePages_Rsvd:0
 HugePages_Surp:0
 Hugepagesize:   4096 kB
 ...


 It definitely requires a kernel compile and setting the config option to
 off.  I don't know the debian way of doing this.

 The only reason you may need this option is if you have very large
 memory sizes, such as 48GB or more.

 Regards
 -steve

 Its 4MB..

 How can i disable hugetlb ? ( passing CONFIG_HUGETLBFS=n at boot to
 kernel ?)

 2011/6/1 Steven Dake sd...@redhat.com mailto:sd...@redhat.com

 On 06/01/2011 01:05 AM, Steven Dake wrote:
  On 05/31/2011 09:44 PM, Angus Salkeld wrote:
  On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter
 wrote:
  Angus,
 
  I make some test program (based on the code coreipcc.c) and i
 now i sure
  that are problems with the mmap systems call on sparc..
 
  Source code of my test program:
 
  #include stdlib.h
  #include sys/mman.h
  #include stdio.h
 
  #define PATH_MAX  36
 
  int main()
  {
 
  int32_t fd;
  void *addr_orig;
  void *addr;
  char path[PATH_MAX

Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-06-02 Thread Steven Dake
On 06/01/2011 11:05 PM, william felipe_welter wrote:
 I recompile my kernel without hugetlb .. and the result are the same..
 
 My test program still resulting:
 PATH=/dev/shm/teste123XX
 page size=2
 fd=3
 ADDR_ORIG:0xe000a000  ADDR:0x
 Erro
 
 And Pacemaker still resulting because the mmap error:
 Could not initialize Cluster Configuration Database API instance error 2
 

Give the patch I posted recently a spin - corosync WFM with this patch
on sparc64 with hugetlb set.  Please report back results.

Regards
-steve

 For make sure that i have disable the hugetlb there is my /proc/meminfo:
 MemTotal:   33093488 kB
 MemFree:32855616 kB
 Buffers:5600 kB
 Cached:53480 kB
 SwapCached:0 kB
 Active:45768 kB
 Inactive:  28104 kB
 Active(anon):  18024 kB
 Inactive(anon): 1560 kB
 Active(file):  27744 kB
 Inactive(file):26544 kB
 Unevictable:   0 kB
 Mlocked:   0 kB
 SwapTotal:   6104680 kB
 SwapFree:6104680 kB
 Dirty: 0 kB
 Writeback: 0 kB
 AnonPages: 14936 kB
 Mapped: 7736 kB
 Shmem:  4624 kB
 Slab:  39184 kB
 SReclaimable:  10088 kB
 SUnreclaim:29096 kB
 KernelStack:7088 kB
 PageTables: 1160 kB
 Quicklists:17664 kB
 NFS_Unstable:  0 kB
 Bounce:0 kB
 WritebackTmp:  0 kB
 CommitLimit:22651424 kB
 Committed_AS: 519368 kB
 VmallocTotal:   1069547520 kB
 VmallocUsed:   11064 kB
 VmallocChunk:   1069529616 kB
 
 
 2011/6/1 Steven Dake sd...@redhat.com:
 On 06/01/2011 07:42 AM, william felipe_welter wrote:
 Steven,

 cat /proc/meminfo
 ...
 HugePages_Total:   0
 HugePages_Free:0
 HugePages_Rsvd:0
 HugePages_Surp:0
 Hugepagesize:   4096 kB
 ...


 It definitely requires a kernel compile and setting the config option to
 off.  I don't know the debian way of doing this.

 The only reason you may need this option is if you have very large
 memory sizes, such as 48GB or more.

 Regards
 -steve

 Its 4MB..

 How can i disable hugetlb ? ( passing CONFIG_HUGETLBFS=n at boot to
 kernel ?)

 2011/6/1 Steven Dake sd...@redhat.com mailto:sd...@redhat.com

 On 06/01/2011 01:05 AM, Steven Dake wrote:
  On 05/31/2011 09:44 PM, Angus Salkeld wrote:
  On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter
 wrote:
  Angus,
 
  I make some test program (based on the code coreipcc.c) and i
 now i sure
  that are problems with the mmap systems call on sparc..
 
  Source code of my test program:
 
  #include stdlib.h
  #include sys/mman.h
  #include stdio.h
 
  #define PATH_MAX  36
 
  int main()
  {
 
  int32_t fd;
  void *addr_orig;
  void *addr;
  char path[PATH_MAX];
  const char *file = teste123XX;
  size_t bytes=10024;
 
  snprintf (path, PATH_MAX, /dev/shm/%s, file);
  printf(PATH=%s\n,path);
 
  fd = mkstemp (path);
  printf(fd=%d \n,fd);
 
 
  addr_orig = mmap (NULL, bytes, PROT_NONE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
 
 
  addr = mmap (addr_orig, bytes, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_SHARED, fd, 0);
 
  printf(ADDR_ORIG:%p  ADDR:%p\n,addr_orig,addr);
 
 
if (addr != addr_orig) {
 printf(Erro);
  }
  }
 
  Results on x86:
  PATH=/dev/shm/teste123XX
  fd=3
  ADDR_ORIG:0x7f867d8e6000  ADDR:0x7f867d8e6000
 
  Results on sparc:
  PATH=/dev/shm/teste123XX
  fd=3
  ADDR_ORIG:0xf7f72000  ADDR:0x
 
  Note: 0x == MAP_FAILED
 
  (from man mmap)
  RETURN VALUE
 On success, mmap() returns a pointer to the mapped area.  On
 error, the value MAP_FAILED (that is, (void *) -1) is
 returned,
 and errno is  set appropriately.
 
 
 
  But im wondering if is really needed to call mmap 2 times ?
  What are the
  reason to call the mmap 2 times, on the second time using the
 address of the
  first?
 
 
  Well there are 3 calls to mmap()
  1) one to allocate 2 * what you need (in pages)
  2) maps the first half of the mem to a real file
  3) maps the second half of the mem to the same file
 
  The point is when you write to an address over the end of the
  first half of memory it is taken care of the the third mmap which
 maps
  the address back to the top of the file for you. This means you
  don't have to worry about ringbuffer wrapping which can be a
 headache.
 
  -Angus
 
 
  interesting this mmap operation doesn't work on sparc linux.
 
  Not sure how I can help here - Next step would be a follow up with the
  sparc linux

Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-06-01 Thread Steven Dake
On 05/31/2011 09:44 PM, Angus Salkeld wrote:
 On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter wrote:
 Angus,

 I make some test program (based on the code coreipcc.c) and i now i sure
 that are problems with the mmap systems call on sparc..

 Source code of my test program:

 #include stdlib.h
 #include sys/mman.h
 #include stdio.h

 #define PATH_MAX  36

 int main()
 {

 int32_t fd;
 void *addr_orig;
 void *addr;
 char path[PATH_MAX];
 const char *file = teste123XX;
 size_t bytes=10024;

 snprintf (path, PATH_MAX, /dev/shm/%s, file);
 printf(PATH=%s\n,path);

 fd = mkstemp (path);
 printf(fd=%d \n,fd);


 addr_orig = mmap (NULL, bytes, PROT_NONE,
   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);


 addr = mmap (addr_orig, bytes, PROT_READ | PROT_WRITE,
   MAP_FIXED | MAP_SHARED, fd, 0);

 printf(ADDR_ORIG:%p  ADDR:%p\n,addr_orig,addr);


   if (addr != addr_orig) {
printf(Erro);
 }
 }

 Results on x86:
 PATH=/dev/shm/teste123XX
 fd=3
 ADDR_ORIG:0x7f867d8e6000  ADDR:0x7f867d8e6000

 Results on sparc:
 PATH=/dev/shm/teste123XX
 fd=3
 ADDR_ORIG:0xf7f72000  ADDR:0x
 
 Note: 0x == MAP_FAILED
 
 (from man mmap)
 RETURN VALUE
On success, mmap() returns a pointer to the mapped area.  On
error, the value MAP_FAILED (that is, (void *) -1) is returned,
and errno is  set appropriately.
 


 But im wondering if is really needed to call mmap 2 times ?  What are the
 reason to call the mmap 2 times, on the second time using the address of the
 first?


 Well there are 3 calls to mmap()
 1) one to allocate 2 * what you need (in pages)
 2) maps the first half of the mem to a real file
 3) maps the second half of the mem to the same file
 
 The point is when you write to an address over the end of the
 first half of memory it is taken care of the the third mmap which maps
 the address back to the top of the file for you. This means you
 don't have to worry about ringbuffer wrapping which can be a headache.
 
 -Angus
 

interesting this mmap operation doesn't work on sparc linux.

Not sure how I can help here - Next step would be a follow up with the
sparc linux mailing list.  I'll do that and cc you on the message - see
if we get any response.

http://vger.kernel.org/vger-lists.html





 2011/5/31 Angus Salkeld asalk...@redhat.com

 On Tue, May 31, 2011 at 06:25:56PM -0300, william felipe_welter wrote:
 Thanks Steven,

 Now im try to run on the MCP:
 - Uninstall the pacemaker 1.0
 - Compile and install 1.1

 But now i have problems to initialize the pacemakerd: Could not
 initialize
 Cluster Configuration Database API instance error 2
 Debbuging with gdb i see that the error are on the confdb.. most
 specificaly
 the errors start on coreipcc.c  at line:


 448if (addr != addr_orig) {
 449goto error_close_unlink;  - enter here
 450   }

 Some ideia about  what can cause this  ?


 I tried porting a ringbuffer (www.libqb.org) to sparc and had the same
 failure.
 There are 3 mmap() calls and on sparc the third one keeps failing.

 This is a common way of creating a ring buffer, see:
 http://en.wikipedia.org/wiki/Circular_buffer#Exemplary_POSIX_Implementation

 I couldn't get it working in the short time I tried. It's probably
 worth looking at the clib implementation to see why it's failing
 (I didn't get to that).

 -Angus


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




 -- 
 William Felipe Welter
 --
 Consultor em Tecnologias Livres
 william.wel...@4linux.com.br
 www.4linux.com.br
 
 ___
 Openais mailing list
 open...@lists.linux-foundation.org
 https://lists.linux-foundation.org/mailman/listinfo/openais
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-06-01 Thread Steven Dake
On 06/01/2011 01:05 AM, Steven Dake wrote:
 On 05/31/2011 09:44 PM, Angus Salkeld wrote:
 On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter wrote:
 Angus,

 I make some test program (based on the code coreipcc.c) and i now i sure
 that are problems with the mmap systems call on sparc..

 Source code of my test program:

 #include stdlib.h
 #include sys/mman.h
 #include stdio.h

 #define PATH_MAX  36

 int main()
 {

 int32_t fd;
 void *addr_orig;
 void *addr;
 char path[PATH_MAX];
 const char *file = teste123XX;
 size_t bytes=10024;

 snprintf (path, PATH_MAX, /dev/shm/%s, file);
 printf(PATH=%s\n,path);

 fd = mkstemp (path);
 printf(fd=%d \n,fd);


 addr_orig = mmap (NULL, bytes, PROT_NONE,
   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);


 addr = mmap (addr_orig, bytes, PROT_READ | PROT_WRITE,
   MAP_FIXED | MAP_SHARED, fd, 0);

 printf(ADDR_ORIG:%p  ADDR:%p\n,addr_orig,addr);


   if (addr != addr_orig) {
printf(Erro);
 }
 }

 Results on x86:
 PATH=/dev/shm/teste123XX
 fd=3
 ADDR_ORIG:0x7f867d8e6000  ADDR:0x7f867d8e6000

 Results on sparc:
 PATH=/dev/shm/teste123XX
 fd=3
 ADDR_ORIG:0xf7f72000  ADDR:0x

 Note: 0x == MAP_FAILED

 (from man mmap)
 RETURN VALUE
On success, mmap() returns a pointer to the mapped area.  On
error, the value MAP_FAILED (that is, (void *) -1) is returned,
and errno is  set appropriately.



 But im wondering if is really needed to call mmap 2 times ?  What are the
 reason to call the mmap 2 times, on the second time using the address of the
 first?


 Well there are 3 calls to mmap()
 1) one to allocate 2 * what you need (in pages)
 2) maps the first half of the mem to a real file
 3) maps the second half of the mem to the same file

 The point is when you write to an address over the end of the
 first half of memory it is taken care of the the third mmap which maps
 the address back to the top of the file for you. This means you
 don't have to worry about ringbuffer wrapping which can be a headache.

 -Angus

 
 interesting this mmap operation doesn't work on sparc linux.
 
 Not sure how I can help here - Next step would be a follow up with the
 sparc linux mailing list.  I'll do that and cc you on the message - see
 if we get any response.
 
 http://vger.kernel.org/vger-lists.html
 




 2011/5/31 Angus Salkeld asalk...@redhat.com

 On Tue, May 31, 2011 at 06:25:56PM -0300, william felipe_welter wrote:
 Thanks Steven,

 Now im try to run on the MCP:
 - Uninstall the pacemaker 1.0
 - Compile and install 1.1

 But now i have problems to initialize the pacemakerd: Could not
 initialize
 Cluster Configuration Database API instance error 2
 Debbuging with gdb i see that the error are on the confdb.. most
 specificaly
 the errors start on coreipcc.c  at line:


 448if (addr != addr_orig) {
 449goto error_close_unlink;  - enter here
 450   }

 Some ideia about  what can cause this  ?


 I tried porting a ringbuffer (www.libqb.org) to sparc and had the same
 failure.
 There are 3 mmap() calls and on sparc the third one keeps failing.

 This is a common way of creating a ring buffer, see:
 http://en.wikipedia.org/wiki/Circular_buffer#Exemplary_POSIX_Implementation

 I couldn't get it working in the short time I tried. It's probably
 worth looking at the clib implementation to see why it's failing
 (I didn't get to that).

 -Angus


Note, we sorted this out we believe.  Your kernel has hugetlb enabled,
probably with 4MB pages.  This requires corosync to allocate 4MB pages.

Can you verify your hugetlb settings?

If you can turn this option off, you should have atleast a working corosync.

Regards
-steve

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




 -- 
 William Felipe Welter
 --
 Consultor em Tecnologias Livres
 william.wel...@4linux.com.br
 www.4linux.com.br

 ___
 Openais mailing list
 open...@lists.linux-foundation.org
 https://lists.linux-foundation.org/mailman/listinfo/openais


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 
 ___
 Openais mailing list
 open...@lists.linux-foundation.org
 https://lists.linux-foundation.org/mailman/listinfo/openais


___
Pacemaker mailing

Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-06-01 Thread Steven Dake
On 06/01/2011 07:42 AM, william felipe_welter wrote:
 Steven,
 
 cat /proc/meminfo
 ...
 HugePages_Total:   0
 HugePages_Free:0
 HugePages_Rsvd:0
 HugePages_Surp:0
 Hugepagesize:   4096 kB
 ...
 

It definitely requires a kernel compile and setting the config option to
off.  I don't know the debian way of doing this.

The only reason you may need this option is if you have very large
memory sizes, such as 48GB or more.

Regards
-steve

 Its 4MB..
 
 How can i disable hugetlb ? ( passing CONFIG_HUGETLBFS=n at boot to
 kernel ?)
 
 2011/6/1 Steven Dake sd...@redhat.com mailto:sd...@redhat.com
 
 On 06/01/2011 01:05 AM, Steven Dake wrote:
  On 05/31/2011 09:44 PM, Angus Salkeld wrote:
  On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter
 wrote:
  Angus,
 
  I make some test program (based on the code coreipcc.c) and i
 now i sure
  that are problems with the mmap systems call on sparc..
 
  Source code of my test program:
 
  #include stdlib.h
  #include sys/mman.h
  #include stdio.h
 
  #define PATH_MAX  36
 
  int main()
  {
 
  int32_t fd;
  void *addr_orig;
  void *addr;
  char path[PATH_MAX];
  const char *file = teste123XX;
  size_t bytes=10024;
 
  snprintf (path, PATH_MAX, /dev/shm/%s, file);
  printf(PATH=%s\n,path);
 
  fd = mkstemp (path);
  printf(fd=%d \n,fd);
 
 
  addr_orig = mmap (NULL, bytes, PROT_NONE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
 
 
  addr = mmap (addr_orig, bytes, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_SHARED, fd, 0);
 
  printf(ADDR_ORIG:%p  ADDR:%p\n,addr_orig,addr);
 
 
if (addr != addr_orig) {
 printf(Erro);
  }
  }
 
  Results on x86:
  PATH=/dev/shm/teste123XX
  fd=3
  ADDR_ORIG:0x7f867d8e6000  ADDR:0x7f867d8e6000
 
  Results on sparc:
  PATH=/dev/shm/teste123XX
  fd=3
  ADDR_ORIG:0xf7f72000  ADDR:0x
 
  Note: 0x == MAP_FAILED
 
  (from man mmap)
  RETURN VALUE
 On success, mmap() returns a pointer to the mapped area.  On
 error, the value MAP_FAILED (that is, (void *) -1) is
 returned,
 and errno is  set appropriately.
 
 
 
  But im wondering if is really needed to call mmap 2 times ?
  What are the
  reason to call the mmap 2 times, on the second time using the
 address of the
  first?
 
 
  Well there are 3 calls to mmap()
  1) one to allocate 2 * what you need (in pages)
  2) maps the first half of the mem to a real file
  3) maps the second half of the mem to the same file
 
  The point is when you write to an address over the end of the
  first half of memory it is taken care of the the third mmap which
 maps
  the address back to the top of the file for you. This means you
  don't have to worry about ringbuffer wrapping which can be a
 headache.
 
  -Angus
 
 
  interesting this mmap operation doesn't work on sparc linux.
 
  Not sure how I can help here - Next step would be a follow up with the
  sparc linux mailing list.  I'll do that and cc you on the message
 - see
  if we get any response.
 
  http://vger.kernel.org/vger-lists.html
 
 
 
 
 
  2011/5/31 Angus Salkeld asalk...@redhat.com
 mailto:asalk...@redhat.com
 
  On Tue, May 31, 2011 at 06:25:56PM -0300, william felipe_welter
 wrote:
  Thanks Steven,
 
  Now im try to run on the MCP:
  - Uninstall the pacemaker 1.0
  - Compile and install 1.1
 
  But now i have problems to initialize the pacemakerd: Could not
  initialize
  Cluster Configuration Database API instance error 2
  Debbuging with gdb i see that the error are on the confdb.. most
  specificaly
  the errors start on coreipcc.c  at line:
 
 
  448if (addr != addr_orig) {
  449goto error_close_unlink;  - enter here
  450   }
 
  Some ideia about  what can cause this  ?
 
 
  I tried porting a ringbuffer (www.libqb.org
 http://www.libqb.org) to sparc and had the same
  failure.
  There are 3 mmap() calls and on sparc the third one keeps failing.
 
  This is a common way of creating a ring buffer, see:
 
 
 http://en.wikipedia.org/wiki/Circular_buffer#Exemplary_POSIX_Implementation
 
  I couldn't get it working in the short time I tried. It's probably
  worth looking at the clib implementation to see why it's failing
  (I didn't get to that).
 
  -Angus
 
 
 Note, we sorted this out we believe.  Your kernel has hugetlb enabled,
 probably with 4MB pages.  This requires corosync

Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-05-31 Thread Steven Dake
Try running paceamaker using the MCP.  The plugin mode of pacemaker
never really worked very well because of complexities of posix mmap and
fork.  Not having sparc hardware personally, YMMV.  We have recently
with corosync 1.3.1 gone through an alignment fixing process for ARM
arches - hope that solves your alignment problems on sparc as well.

Regards
-steve

On 05/31/2011 08:38 AM, william felipe_welter wrote:
 Im trying to setup HA with corosync and pacemaker using the debian
 packages on SPARC Architecture. Using Debian package corosync  process
 dies after initializate pacemaker process. I make some tests with ltrace
 and strace and this tools tell me that corosync died because a
 segmentation fault. I try a lot of thing to solve this problem, but
 nothing made corosync works.
 
 My second try is to compile from scratch (using this
 docs:http://www.clusterlabs.org/wiki/Install#From_Source)
 http://www.clusterlabs.org/wiki/Install#From_Source%29. . This way
 corosync process startup perfectly! but some process of pacemaker don't
 start.. Analyzing log i see the probably reason:
 
 attrd: [2283]: info: init_ais_connection_once: Connection to our AIS
 plugin (9) failed: Library error (2)
 
 stonithd: [2280]: info: init_ais_connection_once: Connection to our AIS
 plugin (9) failed: Library error (2)
 .
 cib: [2281]: info: init_ais_connection_once: Connection to our AIS
 plugin (9) failed: Library error (2)
 .
 crmd: [3320]: debug: init_client_ipc_comms_
 nodispatch: Attempting to talk on: /usr/var/run/crm/cib_rw
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
 comms on: /usr/var/run/crm/cib_rw
 crmd: [3320]: debug: cib_native_signon_raw: Connection to command
 channel failed
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
 talk on: /usr/var/run/crm/cib_callback
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
 comms on: /usr/var/run/crm/cib_callback
 crmd: [3320]: debug: cib_native_signon_raw: Connection to callback
 channel failed
 crmd: [3320]: debug: cib_native_signon_raw: Connection to CIB failed:
 connection failed
 crmd: [3320]: debug: cib_native_signoff: Signing out of the CIB Service
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
 talk on: /usr/var/run/crm/cib_rw
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
 comms on: /usr/var/run/crm/cib_rw
 crmd: [3320]: debug: cib_native_signon_raw: Connection to command
 channel failed
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
 talk on: /usr/var/run/crm/cib_callback
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
 comms on: /usr/var/run/crm/cib_callback
 crmd: [3320]: debug: cib_native_signon_raw: Connection to callback
 channel failed
 crmd: [3320]: debug: cib_native_signon_raw: Connection to CIB failed:
 connection failed
 crmd: [3320]: debug: cib_native_signoff: Signing out of the CIB Service
 crmd: [3320]: info: do_cib_control: Could not connect to the CIB
 service: connection failed
 
 
 
 
 
 
 My conf:
 # Please read the corosync.conf.5 manual page
 compatibility: whitetank
 
 totem {
 version: 2
 join: 60
 token: 3000
 token_retransmits_before_loss_const: 10
 secauth: off
 threads: 0
 consensus: 8601
 vsftype: none
 threads: 0
 rrp_mode: none
 clear_node_high_bit: yes
 max_messages: 20
 interface {
 ringnumber: 0
 bindnetaddr: 10.10.23.0
 mcastaddr: 226.94.1.1
 mcastport: 5405
 }
 }
 
 logging {
 fileline: off
 to_stderr: no
 to_logfile: yes
 to_syslog: yes
 logfile: /var/log/cluster/corosync.log
 debug: on
 timestamp: on
 logger_subsys {
 subsys: AMF
 debug: on
 }
 }
 
 amf {
 mode: disabled
 }
 
 service {
 # Load the Pacemaker Cluster Resource Manager
 ver:   0
 name:  pacemaker
 }
 
 aisexec {
 user:   root
 group:  root
 }
 
 
 My Question is: why attrd, cib ... can't connect to  AIS Plugin?  What
 could be the reasons for the connection failed ?
 (Yes, my /dev/shm are tmpfs)
 
 
 
 -- 
 William Felipe Welter
 --
 Consultor em Tecnologias Livres
 william.wel...@4linux.com.br mailto:william.wel...@4linux.com.br
 www.4linux.com.br http://www.4linux.com.br
 
 
 
 ___
 Openais mailing list
 open...@lists.linux-foundation.org
 https://lists.linux-foundation.org/mailman/listinfo/openais


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Linux HA on debian sparc

2011-05-31 Thread Steven Dake
Note.  there are three signals you could possibly see that generate a
core file.

SIGABRT (assert() called in the codebase)
SIGSEGV (segmentation violation)
SIGBUS (alignment error)

Make sure you don't have a sigbus.

Opening the core file with gdb will tell you which signal triggered the
fault.

Regards
-steve

On 05/31/2011 08:34 AM, william felipe_welter wrote:
 Im trying to setup HA with corosync and pacemaker using the debian
 packages on SPARC Architecture. Using Debian package corosync  process
 dies after initializate pacemaker process. I make some tests with ltrace
 and strace and this tools tell me that corosync died because a
 segmentation fault. I try a lot of thing to solve this problem, but
 nothing made corosync works.
 
 My second try is to compile from scratch (using this
 docs:http://www.clusterlabs.org/wiki/Install#From_Source)
 http://www.clusterlabs.org/wiki/Install#From_Source%29. . This way
 corosync process startup perfectly! but some process of pacemaker don't
 start.. Analyzing log i see the probably reason:
 
 attrd: [2283]: info: init_ais_connection_once: Connection to our AIS
 plugin (9) failed: Library error (2)
 
 stonithd: [2280]: info: init_ais_connection_once: Connection to our AIS
 plugin (9) failed: Library error (2)
 .
 cib: [2281]: info: init_ais_connection_once: Connection to our AIS
 plugin (9) failed: Library error (2)
 .
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
 talk on: /usr/var/run/crm/cib_rw
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
 comms on: /usr/var/run/crm/cib_rw
 crmd: [3320]: debug: cib_native_signon_raw: Connection to command
 channel failed
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
 talk on: /usr/var/run/crm/cib_callback
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
 comms on: /usr/var/run/crm/cib_callback
 crmd: [3320]: debug: cib_native_signon_raw: Connection to callback
 channel failed
 crmd: [3320]: debug: cib_native_signon_raw: Connection to CIB failed:
 connection failed
 crmd: [3320]: debug: cib_native_signoff: Signing out of the CIB Service
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
 talk on: /usr/var/run/crm/cib_rw
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
 comms on: /usr/var/run/crm/cib_rw
 crmd: [3320]: debug: cib_native_signon_raw: Connection to command
 channel failed
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
 talk on: /usr/var/run/crm/cib_callback
 crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
 comms on: /usr/var/run/crm/cib_callback
 crmd: [3320]: debug: cib_native_signon_raw: Connection to callback
 channel failed
 crmd: [3320]: debug: cib_native_signon_raw: Connection to CIB failed:
 connection failed
 crmd: [3320]: debug: cib_native_signoff: Signing out of the CIB Service
 crmd: [3320]: info: do_cib_control: Could not connect to the CIB
 service: connection failed
 
 
 
 
 
 
 My conf:
 # Please read the corosync.conf.5 manual page
 compatibility: whitetank
 
 totem {
 version: 2
 join: 60
 token: 3000
 token_retransmits_before_loss_const: 10
 secauth: off
 threads: 0
 consensus: 8601
 vsftype: none
 threads: 0
 rrp_mode: none
 clear_node_high_bit: yes
 max_messages: 20
 interface {
 ringnumber: 0
 bindnetaddr: 10.10.23.0
 mcastaddr: 226.94.1.1
 mcastport: 5405
 }
 }
 
 logging {
 fileline: off
 to_stderr: no
 to_logfile: yes
 to_syslog: yes
 logfile: /var/log/cluster/corosync.log
 debug: on
 timestamp: on
 logger_subsys {
 subsys: AMF
 debug: on
 }
 }
 
 amf {
 mode: disabled
 }
 
 service {
 # Load the Pacemaker Cluster Resource Manager
 ver:   0
 name:  pacemaker
 }
 
 aisexec {
 user:   root
 group:  root
 }
 
 
 My Question is: why attrd, cib ... can't connect to  AIS Plugin?  What
 could be the reasons for the connection failed ?
 (Yes, my /dev/shm are tmpfs)
 
 
 
 
 -- 
 William Felipe Welter
 --
 Consultor em Tecnologias Livres
 william.wel...@4linux.com.br mailto:william.wel...@4linux.com.br
 www.4linux.com.br http://www.4linux.com.br
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: 

Re: [Pacemaker] [Openais] Corosync goes into endless loop when same hostname is used on more than one node

2011-05-12 Thread Steven Dake
On 05/12/2011 07:04 AM, Dan Frincu wrote:
 Hi,
 
 When using the same hostname on 2 nodes (debian squeeze, corosync
 1.3.0-3 from unstable) the following happens:
 
 May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation
 complete: op cib_sync for section 'all' (origin=local/crmd/84,
 version=0.5.1): ok (rc=0)
 May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
 has id: 620757002
 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
 transition S_INTEGRATION - S_FINALIZE_JOIN [ input=I_INTEGRATED
 cause=C_FSA_INTERNAL origin=check_join_state ]
 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1
 cluster nodes responded to the join offer.
 May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-29:
 Syncing the CIB from debian to the rest of the cluster
 May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
 has id: 603979786
 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
 transition S_FINALIZE_JOIN - S_INTEGRATION [ input=I_JOIN_REQUEST
 cause=C_HA_MESSAGE origin=route_message ]
 May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian
 May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation
 complete: op cib_sync for section 'all' (origin=local/crmd/86,
 version=0.5.1): ok (rc=0)
 May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all:
 join-30: Waiting on 1 outstanding join acks
 May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian
 (3.0.1)
 May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
 has id: 620757002
 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
 transition S_INTEGRATION - S_FINALIZE_JOIN [ input=I_INTEGRATED
 cause=C_FSA_INTERNAL origin=check_join_state ]
 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1
 cluster nodes responded to the join offer.
 May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-30:
 Syncing the CIB from debian to the rest of the cluster
 May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
 has id: 603979786
 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
 transition S_FINALIZE_JOIN - S_INTEGRATION [ input=I_JOIN_REQUEST
 cause=C_HA_MESSAGE origin=route_message ]
 May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian
 May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all:
 join-31: Waiting on 1 outstanding join acks
 May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian
 (3.0.1)
 May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation
 complete: op cib_sync for section 'all' (origin=local/crmd/88,
 version=0.5.1): ok (rc=0)
 May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
 has id: 620757002
 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
 transition S_INTEGRATION - S_FINALIZE_JOIN [ input=I_INTEGRATED
 cause=C_FSA_INTERNAL origin=check_join_state ]
 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1
 cluster nodes responded to the join offer.
 May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-31:
 Syncing the CIB from debian to the rest of the cluster
 May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
 has id: 603979786
 May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
 transition S_FINALIZE_JOIN - S_INTEGRATION [ input=I_JOIN_REQUEST
 cause=C_HA_MESSAGE origin=route_message ]
 May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian
 May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all:
 join-32: Waiting on 1 outstanding join acks
 May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian
 (3.0.1)
 
 Basically it goes into an endless loop. This is a improperly configured
 option, but it would help the users if there was a handling of this or a
 relevant message printed in the logfile, such as duplicate hostname found.
 

Dan,

I believe this is a pacemaker RFE.  corosync operates entirely on IP
addresses and never does any hostname to IP resolution (because the
resolver can block and cause bad things to happen).

 Regards.
 Dan
 
 -- 
 Dan Frincu
 CCNA, RHCE
 
 
 
 ___
 Openais mailing list
 open...@lists.linux-foundation.org
 https://lists.linux-foundation.org/mailman/listinfo/openais


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Pacemaker Cloud Policy Engine Red Hat Summit slides and Mailing List

2011-05-08 Thread Steven Dake
In February we announced our intentions to work on a cloud-specific high
availability solution on this list.  The code is coming along, and we
have reached a point where we should have a mailing list dedicated to
cloud specific topics of Pacemaker.

The mailing list subscription page is:

http://oss.clusterlabs.org/mailman/listinfo/pcmk-cloud

To see how we have progressed since February, have a look at the source
in our git repo, or take a look at the Red Hat Summit 2011 slides where
our work was presented this last week
:
http://www.redhat.com/summit/2011/presentations/summit/whats_new/thursday/dake_th_1130_high_availability_in_the_cloud.pdf

If your interested in cloud high availability technology, please feel
free to participate on our mailing lists.  Your input there is
invaluable to ensuring we deliver a great project that downstream
distros and administrators can use.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] corosync crash

2011-03-01 Thread Steven Dake
On 02/25/2011 12:38 AM, Andrew Beekhof wrote:
 This is the same one you sent to the openais list right?
 

Andrew,

This was root caused to a faulty network setup resulting in the failed
to receive abort we are working on currently.  One key detail missing
from this thread is the implementation worked great on VMW ESX 4.0 but
then started having problem in ESX 4.1.

Regards
-steve

 On Thu, Feb 24, 2011 at 10:32 AM,  u.schmel...@online.de wrote:

 Hi,

 my configuration has 2 nodes, one has a set of virtual adresses and a 
 webservice. The situation before crash:
 node1: has all resources
 node2: online, no resources

 action on node2: crm standby node2
 result on node1: corosync crashes, the child processes consume all available 
 cpu time

 my actions: stop all child processes on node1 (kill -9) and restart corosync

 result on node1:
 node1: online, all resources
 node2: offline

 result on node2:
 node1: offline
 node2: online, all resources

 The only way I found to workaround this problem: remove node2 from the 
 cluster and add it again.
 There should be other solutions, maybe someone can help. Appended the 
 coredump and fplay.

 Update: If I keep the cluster in the split brain state, it recovers after 
 about 9 hours (logfile available)

 regards Uwe

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster Communication fails after VMWare Migration

2011-03-01 Thread Steven Dake
On 02/25/2011 12:40 AM, Andrew Beekhof wrote:
 On Wed, Feb 23, 2011 at 10:31 AM,  u.schmel...@online.de wrote:

 Have build a 2 node apache cluster on VMWare virtual machines, which was 
 running as expected. We had to migrate the machines to another computing 
 center and after that the cluster communication didn't work anymore. 
 Migration of vmS causes a change of the networks mac address. Maybe that's 
 the reason for my problem. After removing one node from the cluster and 
 adding it again the communication worked. Because migrations between 
 computing centers can happen at any time (mirrored esx infrastructure), I 
 have to find out, if this breaks the cluster communication.
 
 Cluster communication issues are the domain of corosync/heartbeat -
 their mailing lists may be able to provide more information.
 We're just the poor consumer of their services :-)
 

poor consumer lol

Regarding migration, I doubt a mac address migration will work properly
with modern switches and igmp.  For that type of operation to work
properly, you will definately want to take multicast out of the equation
and instead use the udpu transport mode.

Keep in mind the corosync devs don't test the types of things you talk
about as we don't have proprietary software licenses.

Regards
-steve


 regards Uwe


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Article on HA in the IBM cloud using Pacemaker and Heartbeat

2011-01-28 Thread Steven Dake
On 01/28/2011 08:02 AM, Alan Robertson wrote:
  Hi,
 
 I recently co-authored an article on HA in the IBM cloud using Pacemaker
 and Heartbeat.
 
 http://www.ibm.com/developerworks/cloud/library/cl-highavailabilitycloud/
 
 The cool thing is that the IBM cloud supports virtual IPs.  With most of
 the other clouds you have to do DNS failover - which is sub-optimal
 ;-).  Of course, they added this after we harangued them ;-) - but still
 it's very nice to have.
 
 It uses Heartbeat rather than Corosync because (for good reason) clouds
 don't support multicast or broadcast.
 

Corosync works in non broadcast/multicast modes.  (the transport is
called udpu).

Regards
-steve

 There will be a follow-up article on setting up DRBD in the cloud as
 well...  Probably a month away or so...
 
 -- 
 Alan Robertson al...@unix.sh
 
 Openness is the foundation and preservative of friendship...  Let me claim 
 from you at all times your undisguised opinions. - William Wilberforce
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] pacemaker + corosync in the cloud

2010-12-15 Thread Steven Dake
On 12/14/2010 05:14 PM, ruslan usifov wrote:
 Hi
 
 Is it possible to use pacemaker based on corosync in the cloud hosting
 like amazon or soflayer?
 
 
 

yes with corosync 1.3.0 in udpu mode.  The udpu mode avoids the use of
multicast allowing operation in amazon's cloud.

Regards
-steve

 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] service corosync start failed

2010-11-22 Thread Steven Dake
On 11/22/2010 01:27 AM, jiaju liu wrote:
 Hi all
 If I use command like this
  service corosync start
 it shows
 Starting Corosync Cluster Engine (corosync):   [FAILED]
  
 and I do nothing just reboot my computer it will be OK what is the
 reason?
 Thanks a lot
  
 my pacemaker packages are
  pacemaker-1.0.8-6.1.el5
  pacemaker-libs-devel-1.0.8-6.1.el5
  pacemaker-libs-1.0.8-6.1.el5
 
  openais packages?are
  openaislib-devel-1.1.0-1.el5
  openais-1.1.0-1.el5
  openaislib-1.1.0-1.el5
 
 corosync packages are
  corosync-1.2.2-1.1.el5
  corosynclib-devel-1.2.2-1.1.el5
  corosynclib-1.2.2-1.1.el5
  who know why thanks a lot
 
 
  
 

Your packages are about 1 year old.  I'd suggest updating - we release z
streams to fix bugs and problems that people run into.

Regards
-steve

 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] UDPU transport patch added, when will the RPMs be available

2010-11-22 Thread Steven Dake
On 11/22/2010 09:27 AM, Dan Frincu wrote:
 Hi Steven,
 
 Steven Dake wrote:
 On 11/19/2010 11:42 AM, Andrew Beekhof wrote:
   
 On Fri, Nov 19, 2010 at 11:38 AM, Dan Frincu dfri...@streamwide.ro wrote:
 
 Hi,

 The subject is pretty self-explanatory but I'll ask anyway, the patch for
 UDPU has been released, this adds the ability to set unicast peer addresses
 of nodes in a cluster, in network environments where multicast is not an
 option. When will it be available as an RPM?
   
 When upstream does a new release.

 

 Dan,

 The flatiron branch (containing the udpu patches) is going through
 testing for 1.3.0.  We find currently that single CPU virtual machine
 systems seem to have problems with these patches which we will sort out
 before release.

 Regards
 -steve


   
 I've taken the (tip I think it is called) of corosync.git and compiled
 the RPM's on RH5U3 64-bit (I got the code the day it was first released,
 haven't had a chance to post yet).
 

First off, we release from the flatiron branch.  It is our stable
branch.  From git, do

git checkout flatiron

This will provide the full flatiron branch for building.

 # git show
 commit 565b32c2621c08f82cab57420217060d100d4953
 Author: Fabio M. Di Nitto fdini...@redhat.com
 Date:   Fri Nov 19 09:21:47 2010 +0100
 
 There were some issues when compiling, deps mostly, some in the spec
 related to version which was UNKNOWN, I did a sed, placed 1.2.9 as a

We are aware of this problem.  We just moved from svn to git, and there
is some pain associated.  This particular problem comes from a lack of a
specific type of tag in the git repo for version numbers.  It will be
fixed once 1.3.0 is released.  Then RPM builds will work as expected.

 number instead of UNKNOWN and it compiled OK. I've installed it on two
 Xen VM's I use for testing and found some issues so the question is:
 where can I send feedback (and what kind of feedback is required) about
 development code? I'm not saying that you guys haven't run into these
 errors, maybe you did and they were fixed and maybe some are specific to
 my setup and haven't been found so, if I can provide some feedback on
 development code, I'd be more than happy to, if that's OK.
 

We are certainly interested in contributions to the master branch.  What
most people use is the flatiron branch.  If you see defects on the tip
of flatiron, let us know, and we will work to address them.

The best way to report an issue is to start a conversation on our
mailing list (in the cc) Is XYZ supposed to happen?.  The developers
can say yes or no and ask for further information if there is a defect.

Regards
-steve

 Also, I've read about the cluster test suite, but I'm not actually sure
 how it works, could somebody provide some details as to how I can use
 the cluster test suite on a cluster to check for issues and then how can
 I report if there are any issues found (again, what kind of feedback is
 required).
 
 Regards,
 Dan
 
 p.s.: ignore my other email, I didn't see the reply on this one.
 
 If I'm barking up the wrong tree, please direct me to the proper channel to
 direct this request, I'm really looking forward to testing the UDPU.

 Regards,

 Dan

 --
 Dan FRINCU
 Systems Engineer
 CCNA, RHCE
 Streamwide Romania


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

   
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
   
 
 -- 
 Dan FRINCU
 Systems Engineer
 CCNA, RHCE
 Streamwide Romania
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http

Re: [Pacemaker] UDPU transport patch added, when will the RPMs be available

2010-11-19 Thread Steven Dake
On 11/19/2010 11:42 AM, Andrew Beekhof wrote:
 On Fri, Nov 19, 2010 at 11:38 AM, Dan Frincu dfri...@streamwide.ro wrote:
 Hi,

 The subject is pretty self-explanatory but I'll ask anyway, the patch for
 UDPU has been released, this adds the ability to set unicast peer addresses
 of nodes in a cluster, in network environments where multicast is not an
 option. When will it be available as an RPM?
 
 When upstream does a new release.
 

Dan,

The flatiron branch (containing the udpu patches) is going through
testing for 1.3.0.  We find currently that single CPU virtual machine
systems seem to have problems with these patches which we will sort out
before release.

Regards
-steve



 If I'm barking up the wrong tree, please direct me to the proper channel to
 direct this request, I'm really looking forward to testing the UDPU.

 Regards,

 Dan

 --
 Dan FRINCU
 Systems Engineer
 CCNA, RHCE
 Streamwide Romania


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Corosync using unicast instead of multicast

2010-11-08 Thread Steven Dake

On 11/08/2010 05:50 AM, Dan Frincu wrote:

Hi,

Steven Dake wrote:

On 11/05/2010 01:30 AM, Dan Frincu wrote:

Hi,

Alan Jones wrote:

This question should be on the openais list, however, I happen to know
the answer.
To get up and running quickly you can configure broadcast with the
version you have.


I've done that already, however I was a little concerned as to what
Steven Dake said on the openais mailing list about using broadcast
Broadcast and redundant ring probably don't work to well together..

I've also done some testing and saw that the broadcast address used is
255.255.255.255, regardless of what the bindnetaddr network address is,
and quite frankly, I was hoping to see a directed broadcast address.
This wasn't the case, therefore I wonder whether this was the issue that
Steven was referring to, because by using the 255.255.255.255 as a
broadcast address, there is the slight chance that some application
running in the same network might send a broadcast packet using the same


This can happen with multicast or unicast modes as well. If a third
party application communicates on the multicast/port combo or unicast
port of a cluster node, there is conflict.

With encryption, corosync encrypts and authenticates all packets,
ignoring packets without a proper signature. The signatures are
difficult to spoof. Without encryption, bad things happen in this
condition.

For more details, read SECURITY file in our source distribution.


OK, I read the SECURITY file, a lot of overhead is added, I understand
the reasons why it does it this way, not going to go into the details
right now. Basically enabling encryption ensures that any traffic going
between the nodes is both encrypted and authenticated, so rogue messages
that happen to reach the exact network socket will be discarded. I'll
come back to this a little bit later.

Then again, I have this sentence in my head that I can't seem to get rid
of Broadcast and redundant ring probably don't work to well together,
broadcast and redundant ring probably don't work to well together
and also I read OpenAIS now provides broadcast network communication in
addition to multicast. This functionality is considered Technology
Preview for standalone usage of OpenAIS, therefore I'm a little bit
more concerned.

Can you shed some light on this please? Two questions:

1) What do you mean by Broadcast and redundant ring probably don't work
to well together?



broadcast requires a specific port to run on.  As a result, the ports 
should be different for each interface.  I have not done any specific 
testing on broadcast with redundant ring - you would probably be the first.



2) Is using Corosync's broadcast feature instead of multicast stable
enough to be used in production systems?



Personally I'd wait for 2.0 for this feature and use bonding for the moment.


Thank you in advance.

Best regards,

Dan

port as configured on the cluster. How would the cluster react to that,
would it ignore the packet, would it wreak havoc?

Regards,

Dan

That's my main concern right now.

Corosync can distinguish separate clusters with the multicast address
and port that become payload to the messages.
The patch you referred to can be applied to the top of tree for
corosync or you can wait for a new release 1.3.0 planned for the end
of November.
Alan

On Thu, Nov 4, 2010 at 1:02 AM, Dan Frincudfri...@streamwide.ro
wrote:


Hi all,

I'm having an issue with a setup using the following:
cluster-glue-1.0.6-1.6.el5.x86_64.rpm
cluster-glue-libs-1.0.6-1.6.el5.x86_64.rpm
corosync-1.2.7-1.1.el5.x86_64.rpm
corosynclib-1.2.7-1.1.el5.x86_64.rpm
drbd83-8.3.2-6.el5_3.x86_64.rpm
kmod-drbd83-8.3.2-6.el5_3.x86_64.rpm
openais-1.1.3-1.6.el5.x86_64.rpm
openaislib-1.1.3-1.6.el5.x86_64.rpm
pacemaker-1.0.9.1-1.el5.x86_64.rpm
pacemaker-libs-1.0.9.1-1.el5.x86_64.rpm
resource-agents-1.0.3-2.el5.x86_64.rpm

This is a two-node HA cluster, with the nodes interconnected via
bonded
interfaces through the switch. The issue is that I have no control
of the
switch itself, can't do anything about that, and from what I
understand the
environment doesn't allow enabling multicast on the switch. In this
situation, how can I have the setup functional (with redundant rings,
rrp_mode: active) without using multicast.

I've seen that individual network sockets are formed between nodes,
unicast
sockets, as well as the multicast sockets. I'm interested in
knowing how
will the lack of multicast affect the redundant rings, connectivity,
failover, etc.

I've also seen this page
https://lists.linux-foundation.org/pipermail/openais/2010-October/015271.html

And here it states using UDPU transport mode avoids using multicast or
broadcast, but it's a patch, is this integrated in any of the newer
versions
of corosync?

Thank you in advance.

Regards,

Dan

--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list:Pacemaker@oss.clusterlabs.org
http

Re: [Pacemaker] Corosync using unicast instead of multicast

2010-11-05 Thread Steven Dake

On 11/05/2010 01:30 AM, Dan Frincu wrote:

Hi,

Alan Jones wrote:

This question should be on the openais list, however, I happen to know
the answer.
To get up and running quickly you can configure broadcast with the
version you have.


I've done that already, however I was a little concerned as to what
Steven Dake said on the openais mailing list about using broadcast
Broadcast and redundant ring probably don't work to well together..

I've also done some testing and saw that the broadcast address used is
255.255.255.255, regardless of what the bindnetaddr network address is,
and quite frankly, I was hoping to see a directed broadcast address.
This wasn't the case, therefore I wonder whether this was the issue that
Steven was referring to, because by using the 255.255.255.255 as a
broadcast address, there is the slight chance that some application
running in the same network might send a broadcast packet using the same


This can happen with multicast or unicast modes as well.  If a third 
party application communicates on the multicast/port combo or unicast 
port of a cluster node, there is conflict.


With encryption, corosync encrypts and authenticates all packets, 
ignoring packets without a proper signature.  The signatures are 
difficult to spoof.  Without encryption, bad things happen in this 
condition.


For more details, read SECURITY file in our source distribution.


port as configured on the cluster. How would the cluster react to that,
would it ignore the packet, would it wreak havoc?

Regards,

Dan

That's my main concern right now.

Corosync can distinguish separate clusters with the multicast address
and port that become payload to the messages.
The patch you referred to can be applied to the top of tree for
corosync or you can wait for a new release 1.3.0 planned for the end
of November.
Alan

On Thu, Nov 4, 2010 at 1:02 AM, Dan Frincudfri...@streamwide.ro  wrote:


Hi all,

I'm having an issue with a setup using the following:
cluster-glue-1.0.6-1.6.el5.x86_64.rpm
cluster-glue-libs-1.0.6-1.6.el5.x86_64.rpm
corosync-1.2.7-1.1.el5.x86_64.rpm
corosynclib-1.2.7-1.1.el5.x86_64.rpm
drbd83-8.3.2-6.el5_3.x86_64.rpm
kmod-drbd83-8.3.2-6.el5_3.x86_64.rpm
openais-1.1.3-1.6.el5.x86_64.rpm
openaislib-1.1.3-1.6.el5.x86_64.rpm
pacemaker-1.0.9.1-1.el5.x86_64.rpm
pacemaker-libs-1.0.9.1-1.el5.x86_64.rpm
resource-agents-1.0.3-2.el5.x86_64.rpm

This is a two-node HA cluster, with the nodes interconnected via bonded
interfaces through the switch. The issue is that I have no control of the
switch itself, can't do anything about that, and from what I understand the
environment doesn't allow enabling multicast on the switch. In this
situation, how can I have the setup functional (with redundant rings,
rrp_mode: active) without using multicast.

I've seen that individual network sockets are formed between nodes, unicast
sockets, as well as the multicast sockets. I'm interested in knowing how
will the lack of multicast affect the redundant rings, connectivity,
failover, etc.

I've also seen this page
https://lists.linux-foundation.org/pipermail/openais/2010-October/015271.html
And here it states using UDPU transport mode avoids using multicast or
broadcast, but it's a patch, is this integrated in any of the newer versions
of corosync?

Thank you in advance.

Regards,

Dan

--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list:Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home:http://www.clusterlabs.org
Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker





___
Pacemaker mailing list:Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home:http://www.clusterlabs.org
Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Corosync node detection working too good

2010-10-04 Thread Steven Dake

On 10/04/2010 02:04 AM, Stephan-Frank Henry wrote:

Hello all,

still working on my nodes and although the last problem is not officially 
solved (I hard coded certain versions of the packages and that seems to be ok 
now) I have a different interesting feature I need to handle.

I am setting up my nodes by default as single node setups. But today when I set 
up another node, *without* doing any special config to make them know each 
other, the corosyncs on each nodes found each other and distributed the cib.xml 
between each other.
They both also show up together in crm_mon.

Not quite what I wanted. :)

I presume I have a config that is too generic and thus the nodes are finding 
each other and thinking they should link up.
What configs do I have to look into to avoid this?

thanks


A unique cluster is defined by mcastaddr and mcastport in the file

/etc/corosync/corosync.conf

If you simply installed them, you may have the same corosync.conf file 
for each unique cluster which would result in the problem you describe.


Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Fail over algorithm used by Pacemaker

2010-10-04 Thread Steven Dake

On 10/03/2010 07:01 AM, hudan studiawan wrote:

Hi,

I want to start to contribute to Pacemaker project. I start to read
Documentation and try some basic configurations. I have a question: what
kind of algorithm used by Pacemaker to choose another node when a node
die in a cluster? Is there any manual or documentation I can read?

Thank you,
Hudan




In the case of using Corosync, we use a protocol designed in the 90s to 
determine membership.  It is called The Totem Single Ring Protocol:


http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.767rep=rep1type=pdf

Its full operation is described in that PDF.

Regards
-steve



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Connection to our AIS plugin (9) failed: Library error

2010-09-22 Thread Steven Dake

On 09/22/2010 04:02 AM, Szymon Hersztek wrote:


Wiadomość napisana w dniu 2010-09-22, o godz. 10:26, przez Andrew Beekhof:


2010/9/21 Szymon Hersztek s...@globtel.pl:


Wiadomość napisana w dniu 2010-09-21, o godz. 09:08, przez Andrew
Beekhof:


2010/9/21 Szymon Hersztek s...@globtel.pl:


Wiadomość napisana w dniu 2010-09-21, o godz. 08:34, przez Andrew
Beekhof:


On Mon, Sep 20, 2010 at 3:34 PM, Szymon Hersztek s...@globtel.pl
wrote:


Hi
Im trying to setup corosync to work as drbd cluster but after
installing
follow by http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
i got error like below:


Unusual, but did pacemaker fork a replacement attrd process?
At what time did corosync start?



corosync was started manually or do you want to have exact time of
start
?


well you included at most 1 second's worth of logging.
so its kinda hard to know if something took too long or what recovery
was attempted.


Ok it is not a problem to send more. Do you need debug logging or
standard
I have to install server once again so in half of hour i can
reproduce logs



Here's your issue:

corosynclib i386
1.2.7-1.1.el5
clusterlabs 155 k
corosynclib x86_64
1.2.7-1.1.el5
clusterlabs 172 k

Why do you have both i386 and x86_64 versions installed on your machine??





There should be no problems installing lib files for both i386 and 
x86_64.  These rpms only contain the *.so files (and a LICENSE file).


Regards
-steve


Because yum installed it in this way .. as many other packeges
The problem was that i do not use /dev/shm as tmpfs
But thanks for trying






___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Timeout after nodejoin

2010-09-22 Thread Steven Dake

On 09/22/2010 05:43 AM, Dan Frincu wrote:

Hi all,

I have the following packages:

# rpm -qa | grep -i (openais|cluster|heartbeat|pacemaker|resource)
openais-0.80.5-15.2
cluster-glue-1.0-12.2
pacemaker-1.0.5-4.2
cluster-glue-libs-1.0-12.2
resource-agents-1.0-31.5
pacemaker-libs-1.0.5-4.2
pacemaker-mgmt-1.99.2-7.2
libopenais2-0.80.5-15.2
heartbeat-3.0.0-33.3
pacemaker-mgmt-client-1.99.2-7.2

When I start openais, I get nodejoin immediately, as seen in the logs
below. However, it takes some time before the nodes are visible in
crm_mon output. Any idea how to minimize this delay?

Sep 22 15:27:24 bench1 openais[12935]: [crm ] info:
send_member_notification: Sending membership update 8 to 1 children
Sep 22 15:27:24 bench1 openais[12935]: [CLM ] got nodejoin message
192.168.165.33
Sep 22 15:27:24 bench1 openais[12935]: [CLM ] got nodejoin message
192.168.165.35
Sep 22 15:27:24 bench1 mgmtd: [12947]: info: Started.
Sep 22 15:27:24 bench1 openais[12935]: [crm ] WARN: route_ais_message:
Sending message to local.crmd failed: unknown (rc=-2)
Sep 22 15:27:24 bench1 openais[12935]: [crm ] WARN: route_ais_message:
Sending message to local.crmd failed: unknown (rc=-2)
Sep 22 15:27:24 bench1 openais[12935]: [crm ] info: pcmk_ipc: Recorded
connection 0x174840d0 for crmd/12946
Sep 22 15:27:24 bench1 openais[12935]: [crm ] info: pcmk_ipc: Sending
membership update 8 to crmd
Sep 22 15:27:24 bench1 openais[12935]: [crm ] info:
update_expected_votes: Expected quorum votes 1024 - 2
Sep 22 15:27:25 bench1 crmd: [12946]: notice: ais_dispatch: Membership
8: quorum aquired
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_election_count_vote:
Election 2 (owner: bench2) pass: vote from bench2 (Host name)
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_state_transition: State
transition S_PENDING - S_ELECTION [ input=I_ELECTION
cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_state_transition: State
transition S_ELECTION - S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_election_check ]
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_te_control: Registering
TE UUID: 87c28ab8-ba93-4111-a26a-67e88dd927fb
Sep 22 15:28:15 bench1 crmd: [12946]: WARN:
cib_client_add_notify_callback: Callback already present
Sep 22 15:28:15 bench1 crmd: [12946]: info: set_graph_functions: Setting
custom graph functions
Sep 22 15:28:15 bench1 crmd: [12946]: info: unpack_graph: Unpacked
transition -1: 0 actions in 0 synapses
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_dc_takeover: Taking over
DC status for this partition
Sep 22 15:28:15 bench1 cib: [12942]: info: cib_process_readwrite: We are
now in R/W mode

Regards,

Dan



Where did you get that version of openais?  openais 0.80.x is deprecated 
in the community (and hence, no support).  We recommend using corosync 
instead which has improved testing with pacemaker.


Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] MCP init script to 21/79?

2010-09-03 Thread Steven Dake

On 08/24/2010 11:06 PM, Andrew Beekhof wrote:

On Wed, Aug 25, 2010 at 8:02 AM, Vladislav Bogdanov
bub...@hoster-ok.com  wrote:

25.08.2010 08:56, Andrew Beekhof wrote:

On Wed, Aug 25, 2010 at 7:39 AM, Vladislav Bogdanov
bub...@hoster-ok.com  wrote:

Hi all,

pacemaker has
# chkconfig - 90 90
in its MCP initscript.

Shouldn't it be corrected to 90 10?


I thought higher numbers started later and shut down earlier... no?


Nope, they are in a natural order for both start and stop sequences.
So lower number means 'do start or stop earlier'.

grep '# chkconfig' /etc/init.d/*



Ok, thanks.  Changed to 10



Given that corosync default is 20/80, shouldnt mcp be 21/79?

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] MCP init script to 21/79?

2010-09-03 Thread Steven Dake

On 09/03/2010 09:56 AM, Vladislav Bogdanov wrote:

03.09.2010 19:34, Steven Dake wrote:

Nope, they are in a natural order for both start and stop sequences.
So lower number means 'do start or stop earlier'.

grep '# chkconfig' /etc/init.d/*



Ok, thanks.  Changed to 10



Given that corosync default is 20/80, shouldnt mcp be 21/79?


I think that pcmk may require additional services to be started (I at
least see reference to cooperation with cman for GFS as one of pcmk MCP
scenarios in Andrew's wiki, but that scenario is still unclear to me),
so it is safer to have it start later, 90 is ok for me. That is also
what Vadim wrote about.

Best,
Vladislav


I was mistaken, not having read the current code.  Ignore the noise.

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Corosync + Pacemaker New Install: Corosync Fails Without Error Message

2010-06-22 Thread Steven Dake

On 06/18/2010 09:42 AM, Eliot Gable wrote:

I don’t have an “aisexec” section at all. I simply copied the sample
file, which did not have one.

I did figure out why it wasn’t logging. It was set to AMF mode and
‘mode’ was ‘disabled’ in the AMF configuration section. After changing
that to ‘enabled’, I now have logging. That allowed me to figure out
that I needed to set rrp_mode to something other than ‘none’, because I
have two interfaces to run the totem protocol over. However, with it set
to ‘passive’ or ‘active’, corosync tries to start, then seg faults:

Jun 18 07:33:23 corosync [MAIN ] Corosync Cluster Engine ('1.2.2'):
started and ready to provide service.

Jun 18 07:33:23 corosync [MAIN ] Corosync built-in features: nss rdma

Jun 18 07:33:23 corosync [MAIN ] Successfully read main configuration
file '/etc/corosync/corosync.conf'.

Jun 18 07:33:23 corosync [TOTEM ] Token Timeout (1000 ms) retransmit
timeout (238 ms)

Jun 18 07:33:23 corosync [TOTEM ] token hold (180 ms) retransmits before
loss (4 retrans)

Jun 18 07:33:23 corosync [TOTEM ] join (50 ms) send_join (0 ms)
consensus (1200 ms) merge (200 ms)

Jun 18 07:33:23 corosync [TOTEM ] downcheck (1000 ms) fail to recv const
(50 msgs)

Jun 18 07:33:23 corosync [TOTEM ] seqno unchanged const (30 rotations)
Maximum network MTU 1402

Jun 18 07:33:23 corosync [TOTEM ] window size per rotation (50 messages)
maximum messages per rotation (17 messages)

Jun 18 07:33:23 corosync [TOTEM ] send threads (0 threads)

Jun 18 07:33:23 corosync [TOTEM ] RRP token expired timeout (238 ms)

Jun 18 07:33:23 corosync [TOTEM ] RRP token problem counter (2000 ms)

Jun 18 07:33:23 corosync [TOTEM ] RRP threshold (10 problem count)

Jun 18 07:33:23 corosync [TOTEM ] RRP mode set to passive.

Jun 18 07:33:23 corosync [TOTEM ] heartbeat_failures_allowed (0)

Jun 18 07:33:23 corosync [TOTEM ] max_network_delay (50 ms)

Jun 18 07:33:23 corosync [TOTEM ] HeartBeat is Disabled. To enable set
heartbeat_failures_allowed  0

Jun 18 07:33:23 corosync [TOTEM ] Initializing transport (UDP/IP).

Jun 18 07:33:23 corosync [TOTEM ] Initializing transmit/receive
security: libtomcrypt SOBER128/SHA1HMAC (mode 0).

Jun 18 07:33:23 corosync [TOTEM ] Initializing transport (UDP/IP).

Jun 18 07:33:23 corosync [TOTEM ] Initializing transmit/receive
security: libtomcrypt SOBER128/SHA1HMAC (mode 0).

Jun 18 07:33:23 corosync [IPC ] you are using ipc api v2

Jun 18 07:33:23 corosync [TOTEM ] Receive multicast socket recv buffer
size (262142 bytes).

Jun 18 07:33:23 corosync [TOTEM ] Transmit multicast socket send buffer
size (262142 bytes).

Jun 18 07:33:23 corosync [TOTEM ] The network interface is down.

Jun 18 07:33:23 corosync [TOTEM ] Created or loaded sequence id
0.127.0.0.1 for this ring.

Jun 18 07:33:23 corosync [pcmk ] info: process_ais_conf: Reading configure

Jun 18 07:33:23 corosync [pcmk ] info: config_find_init: Local handle:
2013064636357672962 for logging

Jun 18 07:33:23 corosync [pcmk ] info: config_find_next: Processing
additional logging options...

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Found 'on' for
option: debug

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to
'off' for option: to_file

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Found 'yes' for
option: to_syslog

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to
'daemon' for option: syslog_facility

Jun 18 07:33:23 corosync [pcmk ] info: config_find_init: Local handle:
4730966301143465987 for service

Jun 18 07:33:23 corosync [pcmk ] info: config_find_next: Processing
additional service options...

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to
'pcmk' for option: clustername

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to
'no' for option: use_logd

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to
'no' for option: use_mgmtd

Jun 18 07:33:23 corosync [pcmk ] info: pcmk_startup: CRM: Initialized

Jun 18 07:33:23 corosync [pcmk ] Logging: Initialized pcmk_startup

Jun 18 07:33:23 corosync [pcmk ] info: pcmk_startup: Maximum core file
size is: 18446744073709551615

Segmentation fault

(gdb) where full

#0 0x00332de797c0 in strlen () from /lib64/libc.so.6

No symbol table info available.

#1 0x2acefb9b in logsys_worker_thread (data=value optimized
out) at logsys.c:760

rec = 0x2aef0c28

dropped = 0

#2 0x00332e60673d in start_thread () from /lib64/libpthread.so.0

No symbol table info available.

#3 0x00332ded3d1d in clone () from /lib64/libc.so.6

No symbol table info available.

(gdb)

Downgrading again back to 1.2.1-1.el5 seems to resolve the issue, and
Corosync runs.

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
ega...@broadvox.net mailto:ega...@broadvox.net

cid:212454920@11022008-1E22

CONFIDENTIAL COMMUNICATION. This e-mail and any files transmitted with
it are confidential and are intended 

Re: [Pacemaker] use_logd or use_mgmtd kills corosync

2010-06-09 Thread Steven Dake

On 06/08/2010 11:20 PM, Andrew Beekhof wrote:

On Wed, Jun 9, 2010 at 7:27 AM, Devin Readeg...@gno.org  wrote:

I was following the instructions for a new installation of corosync
and was wanting to make use of hb_gui so, following an installation
via yum per the docs, built Pacemaker-Python-GUI-pacemaker-mgmt-2.0.0
from source.

Starting corosync works normally without mgmtd in the picture, but as
soon as *either* of the two lines are added to /etc/corosync/service.d/pcmk,
corosync fails to start with no diagnostics in the logfile or syslog:
use_logd: 1
use_mgmtd: 1

I ran 'strace corosync -f' and got rather uninformative information, the
tail end of it shown here:

statfs(/etc/corosync/service.d, {f_type=EXT2_SUPER_MAGIC, f_bsize=4096,
f_blocks=507860, f_bfree=388733, f_bavail=362519, f_files=524288,
f_ffree=517073, f_fsid={0, 0}, f_namelen=255, f_frsize=4096}) = 0
getdents(3, /* 3 entries */, 32768) = 72
stat(/etc/corosync/service.d/pcmk, {st_mode=S_IFREG|0644, st_size=101,
...}) = 0
open(/etc/corosync/service.d/pcmk, O_RDONLY) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=101, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x2acb16dd5000
read(4, service {\n \t# Load the Pacemaker..., 4096) = 101
close(4)= 0
munmap(0x2acb16dd5000, 4096)= 0
close(3)= 0
exit_group(8)   = ?


Any thoughts?


Not really.
Do any other children start up?
Where is the mgmtd binary installed to?


# uname -srv
Linux 2.6.18-194.3.1.el5 #1 SMP Thu May 13 13:08:30 EDT 2010

# rpm -q -a | grep openais | sort
openais-1.1.0-2.el5.i386
openais-1.1.0-2.el5.x86_64
openaislib-1.1.0-2.el5.i386
openaislib-1.1.0-2.el5.x86_64
openaislib-devel-1.1.0-2.el5.i386
openaislib-devel-1.1.0-2.el5.x86_64


### /etc/corosync/corosync.conf 
compatibility: none

totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
# but with a real netaddr, obviously
bindnetaddr: A.B.C.D
mcastaddr: 226.94.1.1
mcastport: 5405
}
}

logging {
fileline: off
to_stderr: no
to_file: yes
to_syslog: yes
logfile: /var/log/corosync.log
# debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

amf {
mode: disabled
}

aisexec {
user: root
group: root
}

 /etc/corosync/service.d/pcmk #
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver:  0
use_logd: 1
}


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


This is likely the sem_wait issue related to some CentOS deployments. 
An update for corosync is pending release.  Hopefully new source 
tarballs will be available Wednesday.


Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] handle EINTR in sem_wait (pacemaker corosync 1.2.2+ crash)

2010-06-01 Thread Steven Dake

Hello,

I have found the cause of the crash that was occurring only on some 
deployments.  The cause is that sem_wait is interrupted by signal, and 
the wait operation is not retried (as is customary in posix).


Patch attached to fix

A big thank you to Vladislav Bogdanov for running the test case and 
verifying it fixes the problem.



Regards
-steve
Index: logsys.c
===
--- logsys.c(revision 2915)
+++ logsys.c(working copy)
@@ -661,7 +661,18 @@
sem_post (logsys_thread_start);
for (;;) {
dropped = 0;
-   sem_wait (logsys_print_finished);
+retry_sem_wait:
+   res = sem_wait (logsys_print_finished);
+   if (res == -1  errno == EINTR) {
+   goto retry_sem_wait;
+   } else
+   if (res == -1) {
+   /*
+ *  * This case shouldn't happen
+ *  */
+   pthread_exit (NULL);
+   }
+   
 
logsys_wthread_lock();
if (wthread_should_exit) {
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] corosync/openais fails to start

2010-05-27 Thread Steven Dake
This is a known issue on some platforms, although the exact cause is 
unknown.  I have tried RHEL 5.5 as well as CentOS 5.5 with clusterrepo 
rpms and been unable to reproduce.  I'll keep looking.


Regards
-steve

On 05/27/2010 06:07 AM, Diego Remolina wrote:

Hi,

I was running the old rpms from the opensuse repo and wanted to change
over to the latest packages from the clusterlabs repo in my RHEL 5.5
machines.

Steps I took
1. Disabled the old repo
2. Set the nodes to standby (two node drbd cluster) and turned of openais
3. Enabled the new repo.
4. Performed an update with yum -y update which replaced all packages.
5. The configuration file for ais was renamed openais.conf.rpmsave
6. I ran corosync-keygen and copied the key to the second machine
7. I copied the file openais.conf.rpmsave to /etc/corosync/corosync.conf
and modified it by removing the service section and moving that to
/etc/corosync/service.d/pcmk
8. I copied the configurations to the other machine.
9. When I try to start either openais or corosync with the init scripts
I get a failure and nothing that can really point me to an error in the
logs.

Updated packages:
May 26 14:29:32 Updated: cluster-glue-libs-1.0.5-1.el5.x86_64
May 26 14:29:32 Updated: resource-agents-1.0.3-2.el5.x86_64
May 26 14:29:34 Updated: cluster-glue-1.0.5-1.el5.x86_64
May 26 14:29:34 Installed: libibverbs-1.1.3-2.el5.x86_64
May 26 14:29:34 Installed: corosync-1.2.2-1.1.el5.x86_64
May 26 14:29:34 Installed: librdmacm-1.0.10-1.el5.x86_64
May 26 14:29:34 Installed: corosynclib-1.2.2-1.1.el5.x86_64
May 26 14:29:34 Installed: openaislib-1.1.0-2.el5.x86_64
May 26 14:29:34 Updated: openais-1.1.0-2.el5.x86_64
May 26 14:29:34 Installed: libnes-0.9.0-2.el5.x86_64
May 26 14:29:35 Installed: heartbeat-libs-3.0.3-2.el5.x86_64
May 26 14:29:35 Updated: pacemaker-libs-1.0.8-6.1.el5.x86_64
May 26 14:29:36 Updated: heartbeat-3.0.3-2.el5.x86_64
May 26 14:29:36 Updated: pacemaker-1.0.8-6.1.el5.x86_64

Apparently corosync is sec faulting when run from the command line:

# /usr/sbin/corosync -f
Segmentation fault

Any help would be greatly appreciated.

Diego



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] corosync/openais fails to start

2010-05-27 Thread Steven Dake

On 05/27/2010 08:40 AM, Diego Remolina wrote:

Is there any workaround for this? Perhaps a slightly older version of
the rpms? If so where do I find those?



Corosync 1.2.1 doesn't have this issue apparently.  With corosync 1.2.1, 
please don't use debug: on keyword in your config options.  I am not 
sure where Andrew has corosync 1.2.1 rpms available.


The corosync project itself doesn't release rpms.  See our policy on 
this topic:


http://www.corosync.org/doku.php?id=faq:release_binaries

Regards
-steve


I cannot get the opensuse-ha rpms any more so I am stuck with a
non-functioning cluster.

Diego

Steven Dake wrote:

This is a known issue on some platforms, although the exact cause is
unknown. I have tried RHEL 5.5 as well as CentOS 5.5 with clusterrepo
rpms and been unable to reproduce. I'll keep looking.

Regards
-steve

On 05/27/2010 06:07 AM, Diego Remolina wrote:

Hi,

I was running the old rpms from the opensuse repo and wanted to change
over to the latest packages from the clusterlabs repo in my RHEL 5.5
machines.

Steps I took
1. Disabled the old repo
2. Set the nodes to standby (two node drbd cluster) and turned of
openais
3. Enabled the new repo.
4. Performed an update with yum -y update which replaced all packages.
5. The configuration file for ais was renamed openais.conf.rpmsave
6. I ran corosync-keygen and copied the key to the second machine
7. I copied the file openais.conf.rpmsave to /etc/corosync/corosync.conf
and modified it by removing the service section and moving that to
/etc/corosync/service.d/pcmk
8. I copied the configurations to the other machine.
9. When I try to start either openais or corosync with the init scripts
I get a failure and nothing that can really point me to an error in the
logs.

Updated packages:
May 26 14:29:32 Updated: cluster-glue-libs-1.0.5-1.el5.x86_64
May 26 14:29:32 Updated: resource-agents-1.0.3-2.el5.x86_64
May 26 14:29:34 Updated: cluster-glue-1.0.5-1.el5.x86_64
May 26 14:29:34 Installed: libibverbs-1.1.3-2.el5.x86_64
May 26 14:29:34 Installed: corosync-1.2.2-1.1.el5.x86_64
May 26 14:29:34 Installed: librdmacm-1.0.10-1.el5.x86_64
May 26 14:29:34 Installed: corosynclib-1.2.2-1.1.el5.x86_64
May 26 14:29:34 Installed: openaislib-1.1.0-2.el5.x86_64
May 26 14:29:34 Updated: openais-1.1.0-2.el5.x86_64
May 26 14:29:34 Installed: libnes-0.9.0-2.el5.x86_64
May 26 14:29:35 Installed: heartbeat-libs-3.0.3-2.el5.x86_64
May 26 14:29:35 Updated: pacemaker-libs-1.0.8-6.1.el5.x86_64
May 26 14:29:36 Updated: heartbeat-3.0.3-2.el5.x86_64
May 26 14:29:36 Updated: pacemaker-1.0.8-6.1.el5.x86_64

Apparently corosync is sec faulting when run from the command line:

# /usr/sbin/corosync -f
Segmentation fault

Any help would be greatly appreciated.

Diego



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] Being fenced node is killed again and again even the connection is recovered!

2010-05-14 Thread Steven Dake
ifconfig eth0 down is not a valid test case.  that will likely lead to
bad things happening.

I recommend using iptables to test the software.

Also Corosync 1.2.2 is out which fixes bugs vs corosync 1.2.0.

Regards
-steve

On Fri, 2010-05-14 at 18:02 +0800, Javen Wu wrote:
 I forget mention the version I used. 
 I used SLES11-SP1-HAE Beta5
 Pacemaker 1.0.7
 Corosync 1.2.0
 Cluster Glue 1.0.3
 
 
 2010/5/14 Javen Wu wu.ja...@gmail.com
 Hi Folks,
 
 I setup a three nodes cluster with SBD STONITH configured.
 After I manually isolate one node by running ifconfig eth1
 down on the node. The node is fenced as expected.
 But after reboot, even the network is recovered, the node is
 killed again once I start openaispacemaker.
 I saw the state of the node become from OFFLINE to ONLINE from
 `crm_mon -n` before being killed. And I saw SBD slot from
 reset-clear-reset.
 
 I attached the syslog and corosync log.
 And my CIB configuration is very simple.
 
 Could you help me check what's the problem? In my mind, it's
 not expected behaviour.
 
 ===%CIB information=
 
 cib validate-with=pacemaker-1.0 crm_feature_set=3.0.1
 have-quorum=1 admin_epoch=0 epoch=349 num_updates=99
 cib-last-written=Fri May 14 14:50:21 2010 dc-uuid=vm209
   configuration
 crm_config
   cluster_property_set id=cib-bootstrap-options
 nvpair id=cib-bootstrap-options-dc-version
 name=dc-version
 value=1.1.1-530add2a3721a0ecccb24660a97dbfdaa3e68f51/
 nvpair
 id=cib-bootstrap-options-cluster-infrastructure
 name=cluster-infrastructure value=openais/
 nvpair
 id=cib-bootstrap-options-expected-quorum-votes
 name=expected-quorum-votes value=3/
   /cluster_property_set
 /crm_config
 nodes
   node id=vm208 uname=vm208 type=normal/
   node id=vm209 uname=vm209 type=normal/
   node id=vm210 uname=vm210 type=normal/
 /nodes
 resources
   clone id=Fencing
 primitive class=stonith id=sbd-fencing
 type=external/sbd
   instance_attributes
 id=sbd-fencing-instance_attributes
 nvpair
 id=sbd-fencing-instance_attributes-sbd_device
 name=sbd_device value=/dev/sdc/
   /instance_attributes
   operations
 op id=sbd-fencing-monitor-20s interval=20s
 name=monitor/
   /operations
 /primitive
   /clone
 /resources
 constraints/
 rsc_defaults/
 op_defaults/
   /configuration
   status
 node_state id=vm209 uname=vm209 ha=active
 in_ccm=true crmd=online join=member expected=member
 crm-debug-origin=post_cache_update shutdown=0
   transient_attributes id=vm209
 instance_attributes id=status-vm209
   nvpair id=status-vm209-probe_complete
 name=probe_complete value=true/
 /instance_attributes
   /transient_attributes
   lrm id=vm209
 lrm_resources
   lrm_resource id=sbd-fencing:0 type=external/sbd
 class=stonith
 lrm_rsc_op id=sbd-fencing:0_monitor_0
 operation=monitor crm-debug-origin=build_active_RAs
 crm_feature_set=3.0.1
 transition-key=4:1:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0
 transition-magic=0:7;4:1:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0 
 call-id=2 rc-code=7 op-status=0 interval=0 last-run=1273820137 
 last-rc-change=1273820137 exec-time=60 queue-time=0 
 op-digest=4c3fd39434577fbb6540606d808ed050/
 lrm_rsc_op id=sbd-fencing:0_start_0
 operation=start crm-debug-origin=build_active_RAs
 crm_feature_set=3.0.1
 transition-key=5:1:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0
 transition-magic=0:0;5:1:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0 
 call-id=3 rc-code=0 op-status=0 interval=0 last-run=1273820137 
 last-rc-change=1273820137 exec-time=10 queue-time=0 
 op-digest=4c3fd39434577fbb6540606d808ed050/
 lrm_rsc_op id=sbd-fencing:0_monitor_2
 operation=monitor crm-debug-origin=build_active_RAs
 crm_feature_set=3.0.1
 transition-key=6:2:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0
 transition-magic=0:0;6:2:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0 
 call-id=4 rc-code=0 op-status=0 interval=2 last-run=1273822956 
 last-rc-change=1273820137 exec-time=1170 queue-time=0 
 op-digest=4029bbaef749649e82d602afb46dd872/
   /lrm_resource
 /lrm_resources
   /lrm
 

Re: [Pacemaker] High load issues

2010-02-04 Thread Steven Dake
On Thu, 2010-02-04 at 16:09 +0100, Dominik Klein wrote:
 Hi people,
 
 I'll take the risk of annoying you, but I really think this should not
 be forgotten.
 
 If there is high load on a node, the cluster seems to have problems
 recovering from that. I'd expect the cluster to recognize that a node is
 unresponsive, stonith it and start services elsewhere.
 
 By unresponsive I mean not being able to use the cluster's service, not
 being able to ssh into the node.
 
 I am not sure whether this is an issue of pacemaker (iiuc, beekhof seems
 to think it is not) or corosync (iiuc, sdake seems to think it is not)
 or maybe a configuration/thinking thing on my side (which might just be).
 
 Anyway, attached you will find a hb_report which covers the startup of
 the cluster nodes, then what it does when there is high load and no
 memory left. Then I killed the load producing things and almost
 immediately, the cluster cleaned up things.
 
 I had at least expected that after I saw FAILED status in crm_mon,
 that after the configured timeouts for stop (120s max in my case), the
 failover should happen, but it did not.
 
 What I did to produce load:
 * run several md5sum $file on 1gig files
 * run several heavy sql statements on large tables
 * saturate(?) the nic using netcat -l on the busy node and netcat -w fed
 by /dev/urandom on another node
 * start a forkbomb script which does while (true); do bash $0; done;
 
 Used versions:
 corosync 1.2.0
 pacemaker 1.0.7
 64 bit packages from clusterlabs for opensuse 11.1
 

The forkbomb triggers an OOM situation.  In Linux, when OOM happens
really all bets are off as to what will occur.  I expect that the system
would work properly without the forkbomb.  Could you try that?

Corosync actually works quite well in OOM situations and usually doesn't
detect this as a failure unless the oom killer blows away the corosync
process.  To corosync, the node is fully operational (because it is
designed to work in an OOM situation).

Detecting memory overcommit and doing something about it may be
something we should do with Corosync.

But generally I believe this test case is invalid.  A system should be
properly sized memory wise to handle the applications that are intended
to run on it.  Really sounds like a deployment issue if the systems
don't contain the appropriate ram to run the applications.

I believe there is a way of setting affinity in the OOM killer but it's
been 4 years since I've worked on the kernel fulltime so I don't know
the details.  One option is to set the affinity to always try to blow
away the corosync process.  Then you would get fencing in this
condition.

Regards
-steve

 If you need more information, want me to try patches, whatever, please
 let me know.
 
 Regards
 Dominik
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] thread safety problem with pacemaker and corosync integration

2010-02-03 Thread Steven Dake
For some time people have reported segfaults on startup when using
pacemaker as a plugin to corosync related to tzset in the stack trace.
I believe we had fixed this by removing the thread-unsafe usage of
localtime and strftime calls in the code base of corosync in 1.2.0.

Via further investigation by H.J. Lee, he mostly identified a problem
with localtime_r calling tzset calling getenv().  If at about the same
time, another thread calls setenv(), the other thread's getenv could
segfault.  syslog() also calls localtime_r in glibc.  On some rare
occasions Pacemaker calls setenv() while corosync executes a syslog
operation resulting in a segfault.

Posix is clear on this issue - tzset should be thread safe, localtime_r
should be thread safe, syslog should be thread safe.  Some C libraries
implementations of these functions unfortunately are not thread safe for
these functions when used in conjunction with setenv because they use
getenv internally (which is not required to be thread safe by posix).

Our short term plan is to workaround these problems in glibc by doing
the following:
1) providing a getenv/setenv api inside coroapi.h so that corosync
internal code and third party plugins such as pacemaker can use a mutex
protected getenv/setenv
2) porting our syslog-direct-communication code from whitetank and avoid
using the syslog C library api (which again uses localtime_r) call
entirely
3) implementing a localtime_r replacement which does not call tzset on
each execution so that timestamp:on operational mode does not suffer
from this same problem

If your suffering from this issue, please be aware we have a root cause
and will get it resolved.

Regards
-steve


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] errors in corosync.log

2010-01-18 Thread Steven Dake
One possibility is you have a different cluster in your network on the
same multicast address and port.

Regards
-steve

On Sat, 2010-01-16 at 15:20 -0500, Shravan Mishra wrote:
 Hi Guys,
 
 I'm running the following version of pacemaker and corosync
 corosync=1.1.1-1-2
 pacemaker=1.0.9-2-1
 
 Every thing had been running fine for quite some time now but then I
 started seeing following errors in the corosync logs,
 
 
 =
 Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
 digest... ignoring.
 Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
 Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
 digest... ignoring.
 Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
 Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
 digest... ignoring.
 Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
 
 
 I can perform all the crm shell commands and what not but it's
 troubling that the above is happening.
 
 My crm_mon output looks good.
 
 
 I also checked the authkey and did md5sum on both it's same.
 
 Then I stopped corosync and regenerated the authkey with
 corosync-keygen and copied it to the the other machine but I still get
 the above message in the corosync log.
 
 Is there anything other authkey that I should look into ?
 
 
 corosync.conf
 
 
 
 # Please read the corosync.conf.5 manual page
 compatibility: whitetank
 
 totem {
 version: 2
 token: 3000
 token_retransmits_before_loss_const: 10
 join: 60
 consensus: 1500
 vsftype: none
 max_messages: 20
 clear_node_high_bit: yes
 secauth: on
 threads: 0
 rrp_mode: passive
 
 interface {
 ringnumber: 0
 bindnetaddr: 192.168.2.0
 #mcastaddr: 226.94.1.1
 broadcast: yes
 mcastport: 5405
 }
 interface {
 ringnumber: 1
 bindnetaddr: 172.20.20.0
 #mcastaddr: 226.94.1.1
 broadcast: yes
 mcastport: 5405
 }
 }
 
 
 logging {
 fileline: off
 to_stderr: yes
 to_logfile: yes
 to_syslog: yes
 logfile: /tmp/corosync.log
 debug: off
 timestamp: on
 logger_subsys {
 subsys: AMF
 debug: off
 }
 }
 
 service {
 name: pacemaker
 ver: 0
 }
 
 aisexec {
 user:root
 group: root
 }
 
 amf {
 mode: disabled
 }
 
 
 ===
 
 
 Thanks
 Shravan
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] mcast vs broadcast

2010-01-18 Thread Steven Dake
On Mon, 2010-01-18 at 11:25 -0500, Shravan Mishra wrote:
 Hi all,
 
 
 
 Following is my corosync.conf.
 
 Even though broadcast is enabled I see mcasted messages like these
 in corosync.log.
 
 Is it ok?  even when the broadcast is on and not mcast.
 

Yes you are using broadcast and the debug output doesn't print a special
case for broadcast (but it really is broadcasting).

This output is debug output meant for developer consumption.  It is
really not all that useful for end users.  
 ==
 Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue
 Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue
 Jan 18 09:50:40 corosync [TOTEM ] Delivering 171 to 173
 Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq
 172 to pending delivery queue
 Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq
 173 to pending delivery queue
 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172
 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172
 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173
 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173
 Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 172
 Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 173
 
 
 =
 
 ===
 
 # Please read the corosync.conf.5 manual page
 compatibility: whitetank
 
 totem {
 version: 2
 token: 3000
 token_retransmits_before_loss_const: 10
 join: 60
 consensus: 1500
 vsftype: none
 max_messages: 20
 clear_node_high_bit: yes
 secauth: on
 threads: 0
 rrp_mode: passive
 
 interface {
 ringnumber: 0
 bindnetaddr: 192.168.2.0
 #   mcastaddr: 226.94.1.1
 broadcast: yes
 mcastport: 5405
 }
 interface {
 ringnumber: 1
 bindnetaddr: 172.20.20.0
 #mcastaddr: 226.94.2.1
 broadcast: yes
 mcastport: 5405
 }
 }
 logging {
 fileline: off
 to_stderr: yes
 to_logfile: yes
 to_syslog: yes
 logfile: /tmp/corosync.log
 debug: on
 timestamp: on
 logger_subsys {
 subsys: AMF
 debug: off
 }
 }
 
 service {
 name: pacemaker
 ver: 0
 }
 
 aisexec {
 user:root
 group: root
 }
 
 amf {
 mode: disabled
 }
 =
 
 
 
 Thanks
 Shravan
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker/OpenAIS Software for openSuSE 11.2

2010-01-12 Thread Steven Dake

  d) If I would try to compile from source as described at
  http://www.clusterlabs.org/wiki/Install#First_Steps
  one step is to get openais. Why are all the relevant
  prebuild library packages called corosync?
  I don't understand the distinction between openais and corosync
 

read this link:
http://www.corosync.org/doku.php?id=faq:why


 Corosync used to be part of Openais.
 Then they split it into two parts to make maintenance easier.
 From their home page The OpenAIS software is built to operate on the
 Corosync Cluster Engine 
 
  and how this two pieces fit together. By the way: There homepage
  doesn't enlight me either.
 
  Enough questions for a restart.
 
  Best regards
  Andreas Mock
 
 
 
  ___
  Pacemaker mailing list
  Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] openais/corosync

2010-01-11 Thread Steven Dake
On Mon, 2010-01-11 at 19:59 +0100, Andreas Mock wrote:
 Hi all,
 
 I don't understand the distinction between
 openais and corosync. The prebuild packages are
 named after corosync while the documentation
 always talk about openais.
 

See reasoning here:
http://www.corosync.org/doku.php?id=faq:why

 The infos I get from the homepages of openais/corosync
 dont help either. There is one paper on corosync's homepage
 saying that pacemaker is using corosync while the installation
 guide at http://www.clusterlabs.org/wiki/Install#OpenAIS.2A
 says to download openais.
 

the clusterlabs documentation is technically correct.  You can use
openais whitetank (see link above) but I recommend just using Corosync
instead.

 Can someone enlight me even this may be more related
 to openais/corosync. I'm sure that the users of this code
 can tell me how the parts fit together.  ;-)
 
 Best regards
 Andreas Mock
 
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] openais/corosync

2010-01-11 Thread Steven Dake
On Mon, 2010-01-11 at 21:00 +0100, Andreas Mock wrote:
  -Ursprüngliche Nachricht-
  Von: Steven Dake sd...@redhat.com
  Gesendet: 11.01.10 20:13:39
  An: pacema...@clusterlabs.org
  Betreff: Re: [Pacemaker] openais/corosync
 
 
  
  See reasoning here:
  http://www.corosync.org/doku.php?id=faq:why
 
 Hi Steve,
 
 thank you for that link. A piece of documentation I didn't find.
 
 They know why they do have improved documentation on
 their 2010 agenda.  ;-)
 

Ya its pretty clear Corosync documentation is weak.  We really focused
on developing a great quality implementation and a good release model at
the expense of all other activities such as documentation and project
marketing.  We hope developers can deal with the documentation warts in
the near term until we sort that out.  In most cases, users don't need
much documentation on Corosync at all except managing corosync.conf
which is very well documented in man pages.  Corosync's functionality
should mostly be hidden behind application's functionality.

That said, we do want to improve documentation.  Beyond man pages for
all tools and APIs, we would eventually like to produce a user guide and
a separate developer guide which may number 100-200 PDF pages combined.
These objectives will happen this year.

Regards
-steve


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] corosync init script broken

2010-01-02 Thread Steven Dake
Hopefully all of these init script problems have been fixed in 1.2.0 by
Fabio and Andrew and should be in a repo available for you soon.

Regards
-steve

On Mon, 2009-12-28 at 13:22 +0100, Dominik Klein wrote:
 Hi cluster people
 
 been a while, couldn't really follow things. Today I was tasked to
 install a new cluster, went for 1.0.6 and corosync as described on the
 wiki and hit this:
 
 New cluster with pacemaker 106 and latest available corosync from the
 clusterlabs.org/rpm opensuse 11.1 repo.
 
 This installs /etc/init.d/corosync
 
 start says OK, but does not start corosync.
 
 Manually starting it, then
 
 stop never returns.
 
 This is because the internal status in the script calls killall -0
 corosync. This finds /etc/init.d/corosync, therefore start returns
 early and stop never returns.
 
 Workaround: Rename /etc/init.d/corosync
 
 I can't believe I am the first one to hit this. Am I?
 
 Regards
 Dominik
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] coroync not able to exec services properly

2010-01-02 Thread Steven Dake
If your using corosync 1.2.0, we enforced a constraint on consensus and
token such that consensus must be 1.2* token. Your consensus is 1/2
token which will cause corosync to exit at start.

Regards
-steve

On Mon, 2009-12-28 at 12:58 +0100, Dejan Muhamedagic wrote:
 Hi,
 
 On Thu, Dec 24, 2009 at 02:35:01PM -0500, Shravan Mishra wrote:
  Hi Guys,
  
  I had a perfectly running system for about 3 weeks now but now on reboot I
  see problems.
  
  Looks like the processes are being spawned and respawned but a proper exec
  is not happening.
 
 According to the logs, attrd can't start (exit code 100) for some
 reason (perhaps there are more logs elsewhere where it says
 what's wrong) and pengine segfaults. For the latter please
 enable coredumps (ulimit -c unlimited) and file a bugzilla.
 
  Am I missing some permissions on directories.
  
  
  I have a script which does the following for directories:
 
 Why do you need this script? It should be done by the package
 installation scripts.
 
  =
  getent group haclient  /dev/null || groupadd -r haclient
  getent passwd hacluster  /dev/null || useradd -r -g haclient -d
  /var/lib/heartbeat/cores/hacluster -s /sbin/nologin -c cluster user
  hacluster
  
  if [ ! -d /var/lib/pengine ];then
   mkdir /var/lib/pengine
  fi
  chown -R hacluster:haclient /var/lib/pengine
  
  if [ ! -d /var/lib/heartbeat ];then
  mkdir /var/lib/heartbeat
  fi
  
  if [ ! -d /var/lib/heartbeat/crm ];then
   mkdir /var/lib/heartbeat/crm
  fi
  chown -R hacluster:haclient /var/lib/heartbeat/crm/
  chmod 750 /var/lib/heartbeat/crm/
  
  if [ ! -d /var/lib/heartbeat/ccm ];then
   mkdir /var/lib/heartbeat/ccm
  fi
  chown -R hacluster:haclient /var/lib/heartbeat/ccm/
  chmod 750 /var/lib/heartbeat/ccm/
  
  if [ ! -d /var/run/heartbeat/ ];then
   mkdir /var/run/heartbeat/
   fi
  
  if [ ! -d /var/run/heartbeat/ccm ];then
   mkdir /var/run/heartbeat/ccm/
   fi
  chown -R hacluster:haclient /var/run/heartbeat/ccm/
  chmod 750 /var/run/heartbeat/ccm/
 
 You don't need ccm for corosync/openais clusters.
 
  if [ ! -d /var/run/heartbeat/crm ];then
   mkdir /var/run/heartbeat/crm/
   fi
  chown -R hacluster:haclient /var/run/heartbeat/crm/
  chmod 750 /var/run/heartbeat/crm/
  
  if [ ! -d /var/run/crm ];then
   mkdir /var/run/crm
  fi
  
  if [ ! -d /var/lib/corosync ];then
   mkdir /var/lib/corosync
  fi
  =
  
  
  I have a very simple active-passive configuration with just 2 nodes.
  
  On starting Corosync , on doing
  
  
  [r...@node2 ~]# ps -ef | grep coro
  root  8242 1  0 11:33 ?00:00:00 /usr/sbin/corosync
  root  8248  8242  0 11:33 ?00:00:00 /usr/sbin/corosync
  root  8249  8242  0 11:33 ?00:00:00 /usr/sbin/corosync
  root  8250  8242  0 11:33 ?00:00:00 /usr/sbin/corosync
  root  8252  8242  0 11:33 ?00:00:00 /usr/sbin/corosync
  root  8393  8242  0 11:35 ?00:00:00 /usr/sbin/corosync
  [r...@node2 ~]# ps -ef | grep heart
  827924 1  0 11:28 ?00:00:00 /usr/lib64/heartbeat/pengine
  
  I'm attaching the log file.
  
  My config is:
  
  
  # Please read the corosync.conf.5 manual page
  compatibility: whitetank
  
  totem {
   version: 2
token: 3000
token_retransmits_before_loss_const: 10
join: 60
consensus: 1500
vsftype: none
max_messages: 20
clear_node_high_bit: yes
secauth: on
threads: 0
rrp_mode: passive
  interface {
  ringnumber: 0
  bindnetaddr: 192.168.1.0
  # mcastaddr: 226.94.1.1
  broadcast: yes
  mcastport: 5405
  }
  interface {
  ringnumber: 1
  bindnetaddr: 172.20.20.0
  # mcastaddr: 226.94.1.1
  broadcast: yes
  mcastport: 5405
  }
  }
  
  logging {
  fileline: off
  to_stderr: yes
  to_logfile: yes
  to_syslog: yes
  logfile: /tmp/corosync.log
 
 Don't log to file. Can't recall exactly but there were some
 permission problems with that, probably because Pacemaker daemons
 don't run as root.
 
 Thanks,
 
 Dejan
 
  debug: on
  timestamp: on
  logger_subsys {
  subsys: AMF
  debug: off
  }
  }
  
  service {
  name: pacemaker
  ver: 0
  }
  
  aisexec {
  user:root
  group: root
  }
  
  amf {
  mode: disabled
  }
  
  
  Please help.
  
  Sincerely
  Shravan
 
 
  ___
  Pacemaker mailing list
  Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Fedora 12 repository

2009-12-20 Thread Steven Dake
Pacemaker is integrated directly in the fedora repo instead of
externally.  You can grab it using yum install pacemaker.

Regards
-steve

On Sun, 2009-12-20 at 11:46 -0500, E-Blokos wrote:
 Hi,
 
 is there any yum repository for Fedora 12 ?
 I checked http://download.opensuse.org/repositories/server%3A/ha-clustering
 but there are only folder for 10 and 11
 
 Thanks
 
 Franck
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Node crash when 'ifdown eth0'

2009-11-30 Thread Steven Dake
On Mon, 2009-11-30 at 17:05 -0700, hj lee wrote:
 
 
 On Fri, Nov 27, 2009 at 3:05 PM, Steven Dake sd...@redhat.com wrote:
 On Fri, 2009-11-27 at 11:32 -0200, Mark Horton wrote:
  I'm using pacemaker 1.0.6 and corosync 1.1.2 (not using
 openais) with
  centos 5.4.  The packages are from here:
  http://www.clusterlabs.org/rpm/epel-5/
 
  Mark
 
  On Fri, Nov 27, 2009 at 9:01 AM, Oscar Remí­rez de Ganuza
 Satrústegui
  oscar...@unav.es wrote:
   Good morning,
  
   We are testing a cluster configuration on RHEL5 (x86_64)
 with pacemaker
   1.0.5 and openais (0.80.5).
   Two node cluster, active-passive, with the following
 resources:
   Mysql service resource and a NFS filesystem resource
 (shared storage in a
   SAN).
  
   In our tests, when we bring down the network interface
 (ifdown eth0), the
 
 
 What is the use case for ifdown eth0 (ie what are you trying
 to verify)?
 
 I have the same test case. In my case, when two nodes cluster is
 disconnect, I want to see split-brain. And then I want to see the
 split-brain handler resets one of nodes. What I want to verify is that
 the cluster will recover network disconnection and split-brain
 situation.
 

ifconfig eth0 down is a totally different then testing if there is a
node disconnection.  When corosync detects eth0 being taken down, it
binds to the interface 127.0.0.1.  This is probably not what you had in
mind when you wanted to test split brain.  Keep in mind an interface
taken out of service is different then an interface failing from a posix
api perspective.

What you really want to test is pulling the network cable between the
machines.

Regards
-steve

 Thanks
 hj
 
 


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Node crash when 'ifdown eth0'

2009-11-27 Thread Steven Dake
On Fri, 2009-11-27 at 11:32 -0200, Mark Horton wrote:
 I'm using pacemaker 1.0.6 and corosync 1.1.2 (not using openais) with
 centos 5.4.  The packages are from here:
 http://www.clusterlabs.org/rpm/epel-5/
 
 Mark
 
 On Fri, Nov 27, 2009 at 9:01 AM, Oscar Remí­rez de Ganuza Satrústegui
 oscar...@unav.es wrote:
  Good morning,
 
  We are testing a cluster configuration on RHEL5 (x86_64) with pacemaker
  1.0.5 and openais (0.80.5).
  Two node cluster, active-passive, with the following resources:
  Mysql service resource and a NFS filesystem resource (shared storage in a
  SAN).
 
  In our tests, when we bring down the network interface (ifdown eth0), the

What is the use case for ifdown eth0 (ie what are you trying to verify)?

I recommend using latest pacemaker and corosync as well if your doing a
new deployment.

Regards
-steve


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] **** SPAM **** Re: pacemaker-1.0.6 + corosync 1.1.2 crashing

2009-11-20 Thread Steven Dake
Nik,

Any chance you have a backtrace of the core files?  That might be
helpful in pinpointing the issue.

To do this run
gdb binaryname corefilename
gdb bt

Regards
-steve

On Thu, 2009-11-19 at 17:50 +0100, Nikola Ciprich wrote:
 Hi Andrew,
 sorry to bother again, do You have some idea what else might be wrong?
 Does it make sense to CC openais or cluster maillist?
 Is there some other debugging You would recommend?
 with best regards
 nik
 
 On Wed, Nov 18, 2009 at 03:26:28PM +0100, Nikola Ciprich wrote:
  I've packaged those myself, all are based on clean sources without any
  additional patches.


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Resource capacity limit

2009-11-12 Thread Steven Dake
On Thu, 2009-11-12 at 14:53 +0100, Andrew Beekhof wrote:
 On Wed, Nov 11, 2009 at 1:36 PM, Lars Marowsky-Bree l...@suse.de wrote:
  On 2009-11-05T14:45:36, Andrew Beekhof and...@beekhof.net wrote:
 
  Lastly, I would really like to defer this for 1.2
  I know I've bent the rules a bit for 1.0 in the past, but its really
  late in the game now.
 
  Personally, I think the Linux kernel model works really well. ie, no
  major releases any more, but bugfixes and features alike get merged
  over time and constantly.
 
 Thats a great model if you've got hoards of developers and testers.
 Of which we have neither.
 
 At this point in time, I can't see us going back to the way heartbeat
 releases were done.
 If there was a single thing that I'd credit Pacemaker's current
 reliability to, it would be our release strategy.

Maintaining corosync and openais, I'd surely like to only have one tree
where all work is done and never have a stable branch.  Andrew is
right though, this model only works if there is large downstream
adoption and support and distros take on the work of stabilizing the
efforts of the trunk development.

Talking with distros I know this is generally not the case with any
package other then kernel.org and maybe some related bits like xen/kvm
(which has forced this model upon them).

Regards
-steve




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] pacemaker-1.0.6 + corosync 1.1.2 crashing

2009-11-10 Thread Steven Dake
Nikola,

yet another possibility is your box doesn't have any/enough shared
memory available.  Usually this is in the directory /dev/shm.
Unfortunately bad things happen and error handling around this condition
needs some work.  Its hard to tell because the signal delivered to the
application on failure is not shown in your backtrace.

For example I have plenty of shared memory available (command is from
df).
tmpfs  1027020  3560   1023460   1% /dev/shm

Regards
-steve

On Tue, 2009-11-10 at 10:28 +0100, Nikola Ciprich wrote:
 Hello Andrew et al,
 few days ago, I asked about pacemaker + corosync + clvmd etc. With Your 
 advice, I got this working well.
 It was in testing virtual machines, I'm now trying to install similar setup 
 on raw hardware but for some
 reasong attrd and cib seem to be crashing.
 
 here's snippet from corosync log:
 Nov 10 14:12:21 vbox3 corosync[4299]:   [MAIN  ] Corosync Cluster Engine 
 ('1.1.2'): started and ready to provide service.
 Nov 10 14:12:21 vbox3 corosync[4299]:   [MAIN  ] Corosync built-in features: 
 nss rdma
 Nov 10 14:12:21 vbox3 corosync[4299]:   [MAIN  ] Successfully read main 
 configuration file '/etc/corosync/corosync.conf'.
 Nov 10 14:12:21 vbox3 corosync[4299]:   [TOTEM ] Initializing transport 
 (UDP/IP).
 Nov 10 14:12:21 vbox3 corosync[4299]:   [TOTEM ] Initializing 
 transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
 Nov 10 14:12:21 vbox3 corosync[4299]:   [MAIN  ] Compatibility mode set to 
 whitetank.  Using V1 and V2 of the synchronization engine.
 Nov 10 14:12:21 vbox3 corosync[4299]:   [TOTEM ] The network interface 
 [10.58.0.1] is now up.
 Nov 10 14:12:21 vbox3 corosync[4299]:   [pcmk  ] info: process_ais_conf: 
 Reading configure
 Nov 10 14:13:16 vbox3 corosync[4348]:   [MAIN  ] Corosync Cluster Engine 
 ('1.1.2'): started and ready to provide service.
 Nov 10 14:13:16 vbox3 corosync[4348]:   [MAIN  ] Corosync built-in features: 
 nss rdma
 Nov 10 14:13:16 vbox3 corosync[4348]:   [MAIN  ] Successfully read main 
 configuration file '/etc/corosync/corosync.conf'.
 Nov 10 14:13:16 vbox3 corosync[4348]:   [TOTEM ] Initializing transport 
 (UDP/IP).
 Nov 10 14:13:16 vbox3 corosync[4348]:   [TOTEM ] Initializing 
 transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
 Nov 10 14:13:16 vbox3 corosync[4348]:   [MAIN  ] Compatibility mode set to 
 whitetank.  Using V1 and V2 of the synchronization engine.
 Nov 10 14:13:16 vbox3 corosync[4348]:   [TOTEM ] The network interface 
 [10.58.0.1] is now up.
 Nov 10 14:13:16 vbox3 corosync[4348]:   [pcmk  ] info: process_ais_conf: 
 Reading configure
 Nov 10 14:13:24 vbox3 corosync[4357]:   [MAIN  ] Corosync Cluster Engine 
 ('1.1.2'): started and ready to provide service.
 Nov 10 14:13:24 vbox3 corosync[4357]:   [MAIN  ] Corosync built-in features: 
 nss rdma
 Nov 10 14:13:24 vbox3 corosync[4357]:   [MAIN  ] Successfully read main 
 configuration file '/etc/corosync/corosync.conf'.
 Nov 10 14:13:24 vbox3 corosync[4357]:   [TOTEM ] Initializing transport 
 (UDP/IP).
 Nov 10 14:13:24 vbox3 corosync[4357]:   [TOTEM ] Initializing 
 transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
 Nov 10 14:13:24 vbox3 corosync[4357]:   [MAIN  ] Compatibility mode set to 
 whitetank.  Using V1 and V2 of the synchronization engine.
 Nov 10 14:13:24 vbox3 corosync[4357]:   [TOTEM ] The network interface 
 [10.58.0.1] is now up.
 Nov 10 14:13:24 vbox3 corosync[4357]:   [pcmk  ] info: process_ais_conf: 
 Reading configure
 Nov 10 14:13:57 vbox3 corosync[4380]:   [MAIN  ] Corosync Cluster Engine 
 ('1.1.2'): started and ready to provide service.
 Nov 10 14:13:57 vbox3 corosync[4380]:   [MAIN  ] Corosync built-in features: 
 nss rdma
 Nov 10 14:13:57 vbox3 corosync[4380]:   [MAIN  ] Successfully read main 
 configuration file '/etc/corosync/corosync.conf'.
 Nov 10 14:13:57 vbox3 corosync[4380]:   [TOTEM ] Initializing transport 
 (UDP/IP).
 Nov 10 14:13:57 vbox3 corosync[4380]:   [TOTEM ] Initializing 
 transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
 Nov 10 14:13:57 vbox3 corosync[4380]:   [MAIN  ] Compatibility mode set to 
 whitetank.  Using V1 and V2 of the synchronization engine.
 Nov 10 14:13:58 vbox3 corosync[4380]:   [TOTEM ] The network interface 
 [10.58.0.1] is now up.
 Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: process_ais_conf: 
 Reading configure
 Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: config_find_init: 
 Local handle: 9213452461992312833 for logging
 Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: config_find_next: 
 Processing additional logging options...
 Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: get_config_opt: Found 
 'off' for option: debug
 Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: get_config_opt: 
 Defaulting to 'off' for option: to_file
 Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: get_config_opt: 
 Defaulting to 'daemon' for option: syslog_facility
 Nov 10 

Re: [Pacemaker] [ANNOUNCEMENT] Debian Packages for Pacemaker 1.0.6, completely revamped

2009-11-04 Thread Steven Dake
On Thu, 2009-11-05 at 00:06 +0100, Colin wrote:
 On Wed, Nov 4, 2009 at 5:47 PM, Andrew Beekhof and...@beekhof.net wrote:
 
  Hopelessly out of date?
  Corosync has been supported for all of 3 days now.
 
 Sorry, it seems that I jumped to a wrong conclusion (namely that with
 Corosync being a part of OpenAIS, and Pacemaker having run on OpenAIS
 for a while, that there wasn't much difference to supporting Corosync
 instea of OpenAIS -- shows that I'm still quite ignorant about some of
 the internals.)
 
 Actually, I set up Pacemaker with Corosync from the new packages, just
 to see what it looks like, and it was so easy that we'll stick to it
 for the next round of tests, i.o.w., the details of the cluster
 underneath Pacemaker are so well hidden that (a) it doesn't make much
 difference, and (b) my ignorance in that area never was a problem: It
 just works.
 
 -Colin

The intent with Corosync was that the migration path for users is mostly
seamless and we have more or less nailed that with the exception of a
few different configuration file renaming and CLI binary renaming (and
of course a new ABI for Pacemaker to program to, which was not painless
for Andrew:).

Regards
-steve
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [ANNOUNCEMENT] Debian Packages for Pacemaker 1.0.6, completely revamped

2009-11-03 Thread Steven Dake
On Wed, 2009-11-04 at 09:35 +0800, Romain CHANU wrote:
 Hi Martin,
 
 Could you tell us what's the rationale to remove openais and include
 corosync?
 
 Would it mean that people should use corosync from now on for any HA
 development?
 
 Best Regards,
 
 Romain Chanu
 

Just a short note I would also recommend making available the latest
openais packages which complement both corosync and pacemaker with sa
forum complaint apis.

Regards
-steve

 
 2009/11/3 Martin Gerhard Loschwitz martin.loschw...@linbit.com
 Ladies and Gentleman,
 
 i am happy to announce the availability of Pacemaker 1.0.6
 packages
 for Debian GNU/Linux 5.0 alias Lenny (i386 and amd64).
 
 These packages are a remarkable break, as they have totally
 and
 ruthlessly been revamped. The whole layout has actually
 changed;
 here are the most important things to keep in mind when using
 them:
 
 * pacemaker-openais and pacemaker-heartbeat are gone;
 pacemaker now
 only comes in one flavour, having support for corosync and
 heartbeat
 built it. This is based on pacemaker's capability to detect by
 which
 messaging framework it has been started and act accordingly.
 
 * openais is gone. pacemaker 1.0.6 uses corosync.
 
 * the new layout allows flawless updates. if you have
 heartbeat
 2.1.4 and do a dist-upgrade, you will automatically get
 pacemaker.
 all you need to do afterwards is converting the xml-file to
 work
 with pacemaker -- you can then start heartbeat, and things are
 going to be fine (more on this can be found in the
 Clusterlabs-
 Wiki)
 
 * Now that we finally have a decent layout for pacemaker, we
 can
 easily provide gui packages: welcome pacemaker-mgmt, being in
 good
 condition and shape now, allowing you do administer your
 cluster
 via a GTK tool.
 
 The new packages can as always be found on:
 
 deb http://people.debian.org/~madkiss/ha lenny main
 deb-src http://people.debian.org/~madkiss/ha lenny main
 
 --
 : Martin G. Loschwitz   Tel +43-1-8178292-63
  :
 : LINBIT Information Technologies GmbH  Fax +43-1-8178292-82
  :
 : Vivenotgasse 48, 1120 Vienna, Austria
 http://www.linbit.com :
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] corosync doesn't stop all services

2009-10-21 Thread Steven Dake
We had to change both pacemaker and corosync for this problem.  I
suspect you don't have the updated pacemaker.

Regards
-steve

On Wed, 2009-10-21 at 15:11 +0200, Michael Schwartzkopff wrote:
 Hi,
 
 perhaps this is the wrong list but anyway:
 
 I have corosync-1.1.1 and pacemaker-1.0.5 on debian lenny.
 
 When I start corosync everything looks fine. But when I stop corosync I still 
 see a lot of heartbeart processes. I thought this was fixed in 
 corosync-1.1.1. 
 so what might be the problem?
 
 # ps uax | grep heart
 root  2083  0.0  0.4   4884  1220 pts/1S   17:04   0:00 
 /usr/lib/heartbeat/ha_logd -d
 root  2084  0.0  0.3   4884   820 pts/1S   17:04   0:00 
 /usr/lib/heartbeat/ha_logd -d
 root  2099  0.0  4.1  10712 10712 ?SLs 17:04   0:00 
 /usr/lib/heartbeat/stonithd
 104   2100  0.1  1.4  12768  3748 ?S   17:04   0:00 
 /usr/lib/heartbeat/cib
 root  2101  0.0  0.7   5352  1800 ?S   17:04   0:00 
 /usr/lib/heartbeat/lrmd
 104   2102  0.0  1.0  12260  2596 ?S   17:04   0:00 
 /usr/lib/heartbeat/attrd
 104   2103  0.0  1.1   8880  3024 ?S   17:04   0:00 
 /usr/lib/heartbeat/pengine
 104   2104  0.0  1.2  12404  3176 ?S   17:04   0:00 
 /usr/lib/heartbeat/crmd
 root  2140  0.0  0.2   3116   720 pts/1R+  17:08   0:00 grep heart
 


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] pacemaker unable to start

2009-10-21 Thread Steven Dake
I recommend using corosync 1.1.1 - several bug fixes one critical for
proper pacemaker operation.  It won't fix this particular problem
however.

Corosync loads pacemaker by searching for a pacemaker lcrso file.  These
files are default installed in /usr/libexec/lcrso but may be in a
different location depending on your distribution.

Regards
-steve

On Wed, 2009-10-21 at 11:13 -0400, Shravan Mishra wrote:
 Hello guys,
 
 We are running 
 
 corosync-1.0.0
 heartbeat-2.99.1
 pacemaker-1.0.4
 
 the corosync.conf  under /etc/corosync/ is 
 
 
 # Please read the corosync.conf.5 manual page
 compatibility: whitetank
 
 aisexec {
user: root
group: root
 }
 totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 172.30.0.0
mcastaddr:226.94.1.1
mcastport: 5406
}
 }
 
 logging {
fileline: off
to_stderr: yes
to_logfile: yes
to_syslog: yes
logfile: /tmp/corosync.log
debug: on
timestamp: on
logger_subsys {
subsys: pacemaker
debug: on
tags: enter|leave|trace1|trace2| trace3|trace4|trace6
}
 }
 
 
 service {
name: pacemaker
ver: 0
 #   use_mgmtd: yes
  #  use_logd:yes
 }
 
 
 corosync {
user: root
group: root
 }
 
 
 amf {
mode: disabled
 }
 
 
 
 #service corosync start   
 
 starts the messaging but fails to load pacemaker,
 
 /tmp/corosync.log  ---   
 
 ==
 
 Oct 21 11:05:43 corosync [MAIN  ] Corosync Cluster Engine ('trunk'):
 started and ready to provide service.
 Oct 21 11:05:43 corosync [MAIN  ] Successfully read main configuration
 file '/etc/corosync/corosync.conf'.
 Oct 21 11:05:43 corosync [TOTEM ] Token Timeout (1000 ms) retransmit
 timeout (238 ms)
 Oct 21 11:05:43 corosync [TOTEM ] token hold (180 ms) retransmits
 before loss (4 retrans)
 Oct 21 11:05:43 corosync [TOTEM ] join (50 ms) send_join (0 ms)
 consensus (800 ms) merge (200 ms)
 Oct 21 11:05:43 corosync [TOTEM ] downcheck (1000 ms) fail to recv
 const (50 msgs)
 Oct 21 11:05:43 corosync [TOTEM ] seqno unchanged const (30 rotations)
 Maximum network MTU 1500
 Oct 21 11:05:43 corosync [TOTEM ] window size per rotation (50
 messages) maximum messages per rotation (17 messages)
 Oct 21 11:05:43 corosync [TOTEM ] send threads (0 threads)
 Oct 21 11:05:43 corosync [TOTEM ] RRP token expired timeout (238 ms)
 Oct 21 11:05:43 corosync [TOTEM ] RRP token problem counter (2000 ms)
 Oct 21 11:05:43 corosync [TOTEM ] RRP threshold (10 problem count)
 Oct 21 11:05:43 corosync [TOTEM ] RRP mode set to none.
 Oct 21 11:05:43 corosync [TOTEM ] heartbeat_failures_allowed (0)
 Oct 21 11:05:43 corosync [TOTEM ] max_network_delay (50 ms)
 Oct 21 11:05:43 corosync [TOTEM ] HeartBeat is Disabled. To enable set
 heartbeat_failures_allowed  0
 Oct 21 11:05:43 corosync [TOTEM ] Initializing transmit/receive
 security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
 Oct 21 11:05:43 corosync [TOTEM ] Receive multicast socket recv buffer
 size (262142 bytes).
 Oct 21 11:05:43 corosync [TOTEM ] Transmit multicast socket send
 buffer size (262142 bytes).
 Oct 21 11:05:43 corosync [TOTEM ] The network interface [172.30.0.145]
 is now up.
 Oct 21 11:05:43 corosync [TOTEM ] Created or loaded sequence id
 184.172.30.0.145 for this ring.
 Oct 21 11:05:43 corosync [TOTEM ] entering GATHER state from 15.
 Oct 21 11:05:43 corosync [SERV  ] Service failed to load 'pacemaker'.
 Oct 21 11:05:43 corosync [SERV  ] Service initialized 'corosync
 extended virtual synchrony service'
 Oct 21 11:05:43 corosync [SERV  ] Service initialized 'corosync
 configuration service'
 Oct 21 11:05:43 corosync [SERV  ] Service initialized 'corosync
 cluster closed process group service v1.01'
 Oct 21 11:05:43 corosync [SERV  ] Service initialized 'corosync
 cluster config database access v1.01'
 Oct 21 11:05:43 corosync [SERV  ] Service initialized 'corosync
 profile loading service'
 Oct 21 11:05:43 corosync [MAIN  ] Compatibility mode set to
 whitetank.  Using V1 and V2 of the synchronization engine.
 Oct 21 11:05:43 corosync [TOTEM ] Creating commit token because I am
 the rep.
 Oct 21 11:05:43 corosync [TOTEM ] Saving state aru 0 high seq received
 0
 Oct 21 11:05:43 corosync [TOTEM ] Storing new sequence id for ring bc
 Oct 21 11:05:43 corosync [TOTEM ] entering COMMIT state.
 Oct 21 11:05:43 corosync [TOTEM ] got commit token
 Oct 21 11:05:43 corosync [TOTEM ] entering RECOVERY state.
 Oct 21 11:05:43 corosync [TOTEM ] position [0] member 172.30.0.145:
 Oct 21 11:05:43 corosync [TOTEM ] previous ring seq 184 rep
 172.30.0.145
 Oct 21 11:05:43 corosync [TOTEM ] aru 0 high delivered 0 received flag
 1
 Oct 21 11:05:43 corosync [TOTEM ] Did not need to originate any
 messages in recovery.
 Oct 21 11:05:43 corosync [TOTEM ] got commit token
 Oct 21 

Re: [Pacemaker] pacemaker unable to start

2009-10-21 Thread Steven Dake
Ya your missing the pacemaker lcrso file.  Either you didn't build
pacemaker with corosync support or pacemaker didn't install that binary
in the proper place.

try:

updatedb
locate lcrso

Regards
-steve

On Wed, 2009-10-21 at 12:28 -0400, Shravan Mishra wrote:
 Steve, this is what my installation shows--
 
 ls -l /usr/libexec/lcrso
 
 -rwxr-xr-x  1 root root  101243 Jul 29 11:21 coroparse.lcrso
 -rwxr-xr-x  1 root root  117688 Jul 29 11:21 objdb.lcrso
 -rwxr-xr-x  1 root root   92702 Jul 29 11:54 openaisserviceenable.lcrso
 -rwxr-xr-x  1 root root  110808 Jul 29 11:21 quorum_testquorum.lcrso
 -rwxr-xr-x  1 root root  159057 Jul 29 11:21 quorum_votequorum.lcrso
 -rwxr-xr-x  1 root root 1175430 Jul 29 11:54 service_amf.lcrso
 -rwxr-xr-x  1 root root  133976 Jul 29 11:21 service_cfg.lcrso
 -rwxr-xr-x  1 root root  218374 Jul 29 11:54 service_ckpt.lcrso
 -rwxr-xr-x  1 root root  139029 Jul 29 11:54 service_clm.lcrso
 -rwxr-xr-x  1 root root  122668 Jul 29 11:21 service_confdb.lcrso
 -rwxr-xr-x  1 root root  138412 Jul 29 11:21 service_cpg.lcrso
 -rwxr-xr-x  1 root root  125638 Jul 29 11:21 service_evs.lcrso
 -rwxr-xr-x  1 root root  196443 Jul 29 11:54 service_evt.lcrso
 -rwxr-xr-x  1 root root  194885 Jul 29 11:54 service_lck.lcrso
 -rwxr-xr-x  1 root root  235168 Jul 29 11:54 service_msg.lcrso
 -rwxr-xr-x  1 root root  120445 Jul 29 11:21 service_pload.lcrso
 -rwxr-xr-x  1 root root  135340 Jul 29 11:54 service_tmr.lcrso
 -rwxr-xr-x  1 root root  124092 Jul 29 11:21 vsf_quorum.lcrso
 -rwxr-xr-x  1 root root  121298 Jul 29 11:21 vsf_ykd.lcrso
 
 I also did
 
 export COROSYNC_DEFAULT_CONFIG_IFACE=openaisserviceenable:openaisparser
 
 In place of openaisparser I also tried corosyncparse and
 corosyncparser but to no avail.
 
 -sincerely
 Shravan
 
 On Wed, Oct 21, 2009 at 11:49 AM, Steven Dake sd...@redhat.com wrote:
  I recommend using corosync 1.1.1 - several bug fixes one critical for
  proper pacemaker operation.  It won't fix this particular problem
  however.
 
  Corosync loads pacemaker by searching for a pacemaker lcrso file.  These
  files are default installed in /usr/libexec/lcrso but may be in a
  different location depending on your distribution.
 
  Regards
  -steve
 
  On Wed, 2009-10-21 at 11:13 -0400, Shravan Mishra wrote:
  Hello guys,
 
  We are running
 
  corosync-1.0.0
  heartbeat-2.99.1
  pacemaker-1.0.4
 
  the corosync.conf  under /etc/corosync/ is
 
  
  # Please read the corosync.conf.5 manual page
  compatibility: whitetank
 
  aisexec {
 user: root
 group: root
  }
  totem {
 version: 2
 secauth: off
 threads: 0
 interface {
 ringnumber: 0
 bindnetaddr: 172.30.0.0
 mcastaddr:226.94.1.1
 mcastport: 5406
 }
  }
 
  logging {
 fileline: off
 to_stderr: yes
 to_logfile: yes
 to_syslog: yes
 logfile: /tmp/corosync.log
 debug: on
 timestamp: on
 logger_subsys {
 subsys: pacemaker
 debug: on
 tags: enter|leave|trace1|trace2| trace3|trace4|trace6
 }
  }
 
 
  service {
 name: pacemaker
 ver: 0
  #   use_mgmtd: yes
   #  use_logd:yes
  }
 
 
  corosync {
 user: root
 group: root
  }
 
 
  amf {
 mode: disabled
  }
  
 
 
  #service corosync start
 
  starts the messaging but fails to load pacemaker,
 
  /tmp/corosync.log  ---
 
  ==
 
  Oct 21 11:05:43 corosync [MAIN  ] Corosync Cluster Engine ('trunk'):
  started and ready to provide service.
  Oct 21 11:05:43 corosync [MAIN  ] Successfully read main configuration
  file '/etc/corosync/corosync.conf'.
  Oct 21 11:05:43 corosync [TOTEM ] Token Timeout (1000 ms) retransmit
  timeout (238 ms)
  Oct 21 11:05:43 corosync [TOTEM ] token hold (180 ms) retransmits
  before loss (4 retrans)
  Oct 21 11:05:43 corosync [TOTEM ] join (50 ms) send_join (0 ms)
  consensus (800 ms) merge (200 ms)
  Oct 21 11:05:43 corosync [TOTEM ] downcheck (1000 ms) fail to recv
  const (50 msgs)
  Oct 21 11:05:43 corosync [TOTEM ] seqno unchanged const (30 rotations)
  Maximum network MTU 1500
  Oct 21 11:05:43 corosync [TOTEM ] window size per rotation (50
  messages) maximum messages per rotation (17 messages)
  Oct 21 11:05:43 corosync [TOTEM ] send threads (0 threads)
  Oct 21 11:05:43 corosync [TOTEM ] RRP token expired timeout (238 ms)
  Oct 21 11:05:43 corosync [TOTEM ] RRP token problem counter (2000 ms)
  Oct 21 11:05:43 corosync [TOTEM ] RRP threshold (10 problem count)
  Oct 21 11:05:43 corosync [TOTEM ] RRP mode set to none.
  Oct 21 11:05:43 corosync [TOTEM ] heartbeat_failures_allowed (0)
  Oct 21 11:05:43 corosync [TOTEM ] max_network_delay (50 ms)
  Oct 21 11:05:43 corosync [TOTEM ] HeartBeat is Disabled. To enable set
  heartbeat_failures_allowed  0
  Oct 21 11:05:43 corosync [TOTEM ] Initializing transmit/receive
  security

Re: [Pacemaker] Failed in restart of Corosync.

2009-10-18 Thread Steven Dake
This bug is reported and we are working on a solution.

Regards
-steve

On Mon, 2009-10-19 at 11:05 +0900, renayama19661...@ybb.ne.jp wrote:
 Hi,
 
 I understand that a combination is not official in Corosync and Pacemaker.
 However, I contributed it because I thought that it was important that I 
 reported a problem.
 
 I started next combination Corosync.(on Redhat5.4(x86))
 
 * corosync trunk 2530
 * Cluster-Resource-Agents-6d652f7cf9d8
 * Reusable-Cluster-Components-4edc8f99701c
 * Pacemaker-1-0-de2a3778ace7
 
 I stopped service(corosync) next.
 But, I did KILL of a process because a process of Pacemaker did not stop well.
 
 
 [r...@rh54-1 ~]# service Corosync stop
 Stopping Corosync Cluster Engine (corosync):   [  OK  ]
 Waiting for services to unload:[  OK  ]
 [r...@rh54-1 ~]# ps -ef |grep coro
 root  5263  4617  0 10:54 pts/000:00:00 grep coro
 [r...@rh54-1 ~]# ps -ef |grep heartbeat 
 root  4882 1  0 10:52 ?00:00:00 /usr/lib/heartbeat/stonithd
 500   4883 1  0 10:52 ?00:00:00 /usr/lib/heartbeat/cib
 root  4884 1  0 10:52 ?00:00:00 /usr/lib/heartbeat/lrmd
 500   4885 1  0 10:52 ?00:00:00 /usr/lib/heartbeat/attrd
 500   4886 1  0 10:52 ?00:00:00 /usr/lib/heartbeat/pengine
 500   4887 1  0 10:52 ?00:00:00 /usr/lib/heartbeat/crmd
 root  5278  4617  0 10:54 pts/000:00:00 grep heartbeat
 [r...@rh54-1 ~]# kill -9 4882 4883 4884 4885 4886 4887
 [r...@rh54-1 ~]# ps -ef |grep heartbeat 
 root  5310  4617  0 10:54 pts/000:00:00 grep heartbeat
 
 
 
 I started Corosync again.
 But, a cib process of Pacemaker seems not to be able to communicate with 
 Corosync.
 
 
 
 Oct 19 10:55:29 rh54-1 cib: [5354]: info: startCib: CIB Initialization 
 completed successfully
 Oct 19 10:55:29 rh54-1 cib: [5354]: info: crm_cluster_connect: Connecting to 
 OpenAIS
 Oct 19 10:55:29 rh54-1 cib: [5354]: info: init_ais_connection: Creating 
 connection to our AIS plugin
 Oct 19 10:55:30 rh54-1 mgmtd: [5359]: info: login to cib live: 1, ret:-10
 Oct 19 10:55:30 rh54-1 crmd: [5358]: info: do_cib_control: Could not connect 
 to the CIB service:
 connection failed
 Oct 19 10:55:30 rh54-1 crmd: [5358]: WARN: do_cib_control: Couldn't complete 
 CIB registration 1
 times... pause and retry
 Oct 19 10:55:30 rh54-1 crmd: [5358]: info: crmd_init: Starting crmd's mainloop
 Oct 19 10:55:31 rh54-1 mgmtd: [5359]: info: login to cib live: 2, ret:-10
 Oct 19 10:55:32 rh54-1 mgmtd: [5359]: info: login to cib live: 3, ret:-10
 Oct 19 10:55:32 rh54-1 crmd: [5358]: info: crm_timer_popped: Wait Timer 
 (I_NULL) just popped!
 Oct 19 10:55:33 rh54-1 mgmtd: [5359]: info: login to cib live: 4, ret:-10
 Oct 19 10:55:33 rh54-1 crmd: [5358]: info: do_cib_control: Could not connect 
 to the CIB service:
 connection failed
 Oct 19 10:55:33 rh54-1 crmd: [5358]: WARN: do_cib_control: Couldn't complete 
 CIB registration 2
 times... pause and retry
 
 
 
 On this account it does not start definitely even if Pacemaker waits till 
 when.
 
 As for the problem, Corosync seems to fail in poll(?) somehow or other.
 However, possibly the cause may depend on the failure of the first stop.
 
 
 [r...@rh54-1 ~]# ps -ef |grep coro
 root  5348 1  0 10:55 ?00:00:00 /usr/sbin/corosync
 root  5400  4617  0 10:56 pts/000:00:00 grep coro
 [r...@rh54-1 ~]# strace -p 5348
 Process 5348 attached - interrupt to quit
 futex(0x805c8c0, FUTEX_WAIT_PRIVATE, 2, NULL
 
 
 Is there a method with the avoidance of this phenomenon what it is?
 Can I evade a problem by deleting some file?
 
 * I hope it so that a combination of Corosync and Pacemaker becomes the 
 practical use early.
 
 Best Regards,
 Hideo Yamauchi.
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] fedora11: openais fails to start

2009-10-09 Thread Steven Dake
You could try the f12 rpms - we have tested these.  We are in the
process of making these available in f11/f10, but there is a bit of a
lag because of fedora process.

The f12 rpms are at koji.fedoraproject.org.

From looking at your logs, it appears iptables is enabled and not
configured properly.  try service iptables stop.

Regards
-steve

On Fri, 2009-10-09 at 14:31 +0200, Michael Schwartzkopff wrote:
 Hi,
 
 I wanted to try pacemaker/openais on a fedora11. Packages from OSBS:
 # rpm -qa | grep ais\|pace
 pacemaker-1.0.5-4.1.i386
 libopenais2-0.80.5-15.1.i386
 pacemaker-libs-1.0.5-4.1.i386
 openais-0.80.5-15.1.i386
 pacemaker-mgmt-1.99.2-6.1.i386
 
 When I start /etc/init.d/openais start
 - There are some entries in the log. Nothing what I could identify as an 
 error. See: http://www.pastebin.org/41120
 
 - openais-cfgtool -s stops at
 Printing ring status.
 Need to CTRL-C to stop.
 
 - No pacemaker process are really started:
 ps uax | grep crm
 is empty.
 
 Any ideas?
 
 Using corosync-1.0.0 from fedora11 is no option. Results in another error.
 


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] A problem to fail in a stop of Pacemaker.

2009-09-30 Thread Steven Dake
On Wed, 2009-09-30 at 09:51 +0900, renayama19661...@ybb.ne.jp wrote:
 Hi Remi,
 
  It appears that this is a similar problem to the one that I reported, 
  yes.  It appears to not be a bug in Corosync, but rather one in 
  Pacemaker.  This bug has been filed in Red Hat Bugzilla, see it at:
  
  https://bugzilla.redhat.com/show_bug.cgi?id=525589
  
  Perhaps you could add any additional details that you have found 
  (affected packages, etc.) to the bug; it may help the developers fix it.
 
 All right.
 Thank you.
 
 Best Regards,
 Hideo Yamauchi.
 

Please note this could still be a bz in corosync related to service
engine integration.  It is just too early to tell.  Andrew should be
able to tell us for certain when he has an opportunity to take a look at
it.

Regards
-steve

 --- Remi Broemeling r...@nexopia.com wrote:
 
  Hello Hideo,
  
  It appears that this is a similar problem to the one that I reported, 
  yes.  It appears to not be a bug in Corosync, but rather one in 
  Pacemaker.  This bug has been filed in Red Hat Bugzilla, see it at:
  
  https://bugzilla.redhat.com/show_bug.cgi?id=525589
  
  Perhaps you could add any additional details that you have found 
  (affected packages, etc.) to the bug; it may help the developers fix it.
  
  Thanks.
  
  
  renayama19661...@ybb.ne.jp wrote:
   Hi,
  
   I started a Dummy resource in one node by the next combination.
* corosync 1.1.0
* Pacemaker-1-0-05c8b63cbca7
* Reusable-Cluster-Components-6ef02517ee57
* Cluster-Resource-Agents-88a9cfd9e8b5
  
   The Dummy resource started in a node.
  
   I was going to stop a node(service Corosync stop), but did not stop.
  
   --log--
   (snip)
  
   Sep 29 13:52:01 rh53-1 crmd: [11193]: info: crm_signal_dispatch: Invoking 
   handler for signal
  15:
   Terminated
   Sep 29 13:52:01 rh53-1 crmd: [11193]: info: crm_shutdown: Requesting 
   shutdown
   Sep 29 13:52:01 rh53-1 crmd: [11193]: info: do_state_transition: State 
   transition S_IDLE -
   S_POLICY_ENGINE [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
   Sep 29 13:52:01 rh53-1 crmd: [11193]: info: do_state_transition: All 1 
   cluster nodes are
  eligible to
   run resources.
   Sep 29 13:52:01 rh53-1 crmd: [11193]: info: do_shutdown_req: Sending 
   shutdown request to DC:
  rh53-1
   Sep 29 13:52:30 rh53-1 corosync[11183]:   [pcmk  ] notice: pcmk_shutdown: 
   Still waiting for
  crmd
   (pid=11193) to terminate...
   Sep 29 13:53:30 rh53-1 last message repeated 2 times
   Sep 29 13:55:00 rh53-1 last message repeated 3 times
   Sep 29 13:56:30 rh53-1 last message repeated 3 times
   Sep 29 13:58:01 rh53-1 last message repeated 3 times
   Sep 29 13:59:31 rh53-1 last message repeated 3 times
   Sep 29 14:00:31 rh53-1 last message repeated 2 times
   Sep 29 14:00:46 rh53-1 cib: [11189]: info: cib_stats: Processed 94 
   operations (11489.00us
  average, 0%
   utilization) in the last 10min
   Sep 29 14:01:01 rh53-1 corosync[11183]:   [pcmk  ] notice: pcmk_shutdown: 
   Still waiting for
  crmd
   (pid=11193) to terminate...
  
   (snip)
   --log--
  
  
   Possibly is the cause same as the next email?
* http://www.gossamer-threads.com/lists/linuxha/pacemaker/58127
  
   And, the same problem was taking place by the next combination.
* corosync 1.0.1
* Pacemaker-1-0-595cca870aff
* Reusable-Cluster-Components-6ef02517ee57
* Cluster-Resource-Agents-88a9cfd9e8b5
  
   I attach a file of hb_report.
  
   Best Regards,
   Hideo Yamauchi.
 
  
  -- 
  
  Remi Broemeling
  Sr System Administrator
  
  Nexopia.com Inc.
  direct: 780 444 1250 ext 435
  email: r...@nexopia.com mailto:r...@nexopia.com
  fax: 780 487 0376
  
  www.nexopia.com http://www.nexopia.com
  
  You are only young once, but you can stay immature indefinitely.
  www.siglets.com
   ___
  Pacemaker mailing list
  Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
 
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Cluster Refuses to Stop/Shutdown

2009-09-24 Thread Steven Dake
Remi,

Likely a defect.  We will have to look into it.  Please file a bug as
per instructions on the corosync wiki at www.corosync.org.

On Thu, 2009-09-24 at 16:47 -0600, Remi Broemeling wrote:
 I've spent all day working on this; even going so far as to completely
 build my own set of packages from the Debian-available ones (which
 appear to be different than the Ubuntu-available ones).  It didn't
 have any effect on the issue at all: the cluster still freaks out and
 becomes a split-brain after a single SIGQUIT.
 
 The debian packages that also demonstrate this behavior were the below
 versions:
 cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb
 corosync_1.0.0-5~bpo50+1_i386.deb
 libcorosync4_1.0.0-5~bpo50+1_i386.deb
 libopenais3_1.0.0-4~bpo50+1_i386.deb
 openais_1.0.0-4~bpo50+1_i386.deb
 pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb
 
 These packages were re-built (under Ubuntu Hardy Heron LTS) from the
 *.diff.gz, *.dsc, and *.orig.tar.gz files available at
 http://people.debian.org/~madkiss/ha-corosync, and as I said the
 symptoms remain exactly the same, both under the configuration that I
 list below and the sample configuration that came with these packages.
 I also attempted the same with a single IP Address resource associated
 with the cluster; just to be sure it wasn't an edge case for a cluster
 with no resources; but again that had no effect.
 
 Basically I'm still exactly at the point that I was at yesterday
 morning at about 0900.
 
 Remi Broemeling wrote: 
  I posted this to the OpenAIS Mailing List
  (open...@lists.linux-foundation.org) yesterday, but haven't received
  a response and upon further reflection I think that maybe I chose
  the wrong list to post it to.  That list seems to be far less about
  user support and far more about developer communication.  Therefore
  re-trying here, as the archives show it to be somewhat more
  user-focused.
  
  The problem is that I'm having an issue with corosync refusing to
  shutdown in response to a QUIT signal.  Given the below cluster
  (output of crm_mon):
  
  
  Last updated: Wed Sep 23 15:56:24 2009
  Stack: openais
  Current DC: boot1 - partition with quorum
  Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
  2 Nodes configured, 2 expected votes
  0 Resources configured.
  
  
  Online: [ boot1 boot2 ]
  
  If I go onto the host 'boot2', and issue the command killall -QUIT
  corosync, the anticipated result would be that boot2 would go
  offline (out of the cluster), and all of the cluster processes
  (corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down.
  However, this is not occurring, and I don't really have any idea
  why.  After logging into boot2, and issuing the command killall
  -QUIT corosync, the result is a split-brain:
  
  From boot1's viewpoint:
  
  Last updated: Wed Sep 23 15:58:27 2009
  Stack: openais
  Current DC: boot1 - partition WITHOUT quorum
  Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
  2 Nodes configured, 2 expected votes
  0 Resources configured.
  
  
  Online: [ boot1 ]
  OFFLINE: [ boot2 ]
  
  From boot2's viewpoint:
  
  Last updated: Wed Sep 23 15:58:35 2009
  Stack: openais
  Current DC: boot1 - partition with quorum
  Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
  2 Nodes configured, 2 expected votes
  0 Resources configured.
  
  
  Online: [ boot1 boot2 ]
  
  At this point the status quo holds until such time as ANOTHER QUIT
  signal is sent to corosync, (i.e. the command killall -QUIT
  corosync is executed on boot2 again).  Then, boot2 shuts down
  properly and everything appears to be kosher.  Basically, what I
  expect to happen after a single QUIT signal is instead taking two
  QUIT signals to occur; and that summarizes my question: why does it
  take two QUIT signals to force corosync to actually shutdown?  Is
  that desired behavior?  From everything online that I have read it
  seems to be very strange, and it makes me think that I have a
  problem in my configuration(s), but I've no idea what that would be
  even after playing with things and investigating for the day.
  
  I would be very grateful for any guidance that could be provided, as
  at the moment I seem to be at an impasse.
  
  Log files, with debugging set to 'on', can be found at the following
  pastebin locations:
  After first QUIT signal issued on boot2:
  boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
  boot2:/var/log/syslog: http://pastebin.com/d26fdfee
  After second QUIT signal issued on boot2:
  boot1:/var/log/syslog: http://pastebin.com/m755fb989
  boot2:/var/log/syslog: http://pastebin.com/m22dcef45
  
  OS, Software Packages, and Versions:
  * two nodes, each running Ubuntu Hardy Heron LTS
  * ubuntu-ha packages, as downloaded from
  http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/:
  * 

Re: [Pacemaker] CentOS problem starting openais

2009-08-26 Thread Steven Dake
On Tue, 2009-08-25 at 21:55 +0200, Michael Schwartzkopff wrote:
 Hi,
 
 I just installed pacemaker from OSBS on fully patched CentOS 5.3.
 when I call aisexec -f manually everything works as expected. When Is use 
 /etc/init.d/openais start I get the following in /var/log/messages
 
 Aug 25 23:46:02 localhost openais[16897]: [MAIN ] AIS Executive Service 
 RELEASE 'subrev 1152 version 0.80'
 Aug 25 23:46:02 localhost openais[16897]: [MAIN ] Copyright (C) 2002-2006 
 MontaVista Software, Inc and contributors.
 Aug 25 23:46:02 localhost openais[16897]: [MAIN ] Copyright (C) 2006 Red Hat, 
 Inc.
 Aug 25 23:46:02 localhost openais[16897]: [MAIN ] AIS Executive Service: 
 started and ready to provide service.
 Aug 25 23:46:02 localhost openais[16897]: [TOTEM] Token Timeout (3000 ms) 
 retransmit timeout (294 ms)
 Aug 25 23:46:02 localhost openais[16897]: [TOTEM] token hold (225 ms) 
 retransmits before loss (10 retrans)
 Aug 25 23:46:02 localhost openais[16897]: [TOTEM] join (60 ms) send_join (0 
 ms) consensus (1500 ms) merge (200 ms)
 Aug 25 23:46:02 localhost openais[16897]: [TOTEM] downcheck (1000 ms) fail to 
 recv const (50 msgs)
 Aug 25 23:46:02 localhost openais[16897]: [TOTEM] seqno unchanged const (30 
 rotations) Maximum network MTU 1500
 Aug 25 23:46:02 localhost openais[16897]: [TOTEM] window size per rotation 
 (50 
 messages) maximum messages per rotation (20 messages)
 Aug 25 23:46:02 localhost openais[16897]: [TOTEM] send threads (0 threads)
 Aug 25 23:46:02 localhost openais[16897]: [TOTEM] RRP token expired timeout 
 (294 ms)
 Aug 25 23:46:02 localhost openais[16897]: [TOTEM] RRP token problem counter 
 (2000 ms)
 Aug 25 23:46:03 localhost openais[16897]: [TOTEM] RRP threshold (10 problem 
 count)
 Aug 25 23:46:03 localhost openais[16897]: [TOTEM] RRP mode set to none.
 Aug 25 23:46:03 localhost openais[16897]: [TOTEM] heartbeat_failures_allowed 
 (0)
 Aug 25 23:46:03 localhost openais[16897]: [TOTEM] max_network_delay (50 ms)
 Aug 25 23:46:03 localhost openais[16897]: [TOTEM] HeartBeat is Disabled. To 
 enable set heartbeat_failures_allowed  0
 Aug 25 23:46:03 localhost openais[16897]: [TOTEM] The network interface 
 [172.19.93.1] is now up.
 Aug 25 23:46:03 localhost openais[16897]: [TOTEM] Created or loaded sequence 
 id 24.172.19.93.1 for this ring.
 Aug 25 23:46:03 localhost openais[16897]: [TOTEM] entering GATHER state from 
 15.
 Aug 25 23:46:03 localhost openais[16897]: [crm  ] info: process_ais_conf: 
 Reading configure
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: config_find_next: 
 Processing additional logging options...
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: get_config_opt: Found 
 'off' for option: debug
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: get_config_opt: 
 Defaulting to 'off' for option: to_file
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: get_config_opt: Found 
 'daemon' for option: syslog_facility
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: config_find_next: 
 Processing additional service options...
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: get_config_opt: 
 Defaulting to 'no' for option: use_logd
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: get_config_opt: Found 
 'yes' for option: use_mgmtd
 Aug 25 23:46:03 localhost openais[16897]: [crm  ] pcmk_plugin_init: Could not 
 enable /proc/sys/kernel/core_uses_pid: (22) Invalid argument
 Aug 25 23:46:03 localhost openais[16897]: [crm  ] info: pcmk_plugin_init: 
 CRM: 
 Initialized
 Aug 25 23:46:03 localhost openais[16897]: [crm  ] Logging: Initialized 
 pcmk_plugin_init
 Aug 25 23:46:03 localhost openais[16897]: [crm  ] info: pcmk_plugin_init: 
 Service: 9
 Aug 25 23:46:03 localhost openais[16897]: [crm  ] info: pcmk_plugin_init: 
 Local node id: 22877100
 Aug 25 23:46:03 localhost openais[16897]: [crm  ] info: pcmk_plugin_init: 
 Local hostname: localhost.localdomain
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: update_member: 
 Creating entry for node 22877100 born on 0
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: update_member: 
 0x9c4c6d8 Node 22877100 now known as localhost.localdomain (was: (null))
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: update_member: Node 
 localhost.localdomain now has 1 quorum votes (was 0)
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: update_member: Node 
 22877100/localhost.localdomain is now: member
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: spawn_child: Forked 
 child 16903 for process stonithd
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: spawn_child: Forked 
 child 16904 for process cib
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: spawn_child: Forked 
 child 16905 for process lrmd
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: spawn_child: Forked 
 child 16906 for process attrd
 Aug 25 23:46:03 localhost openais[16897]: [MAIN ] info: spawn_child: Forked 
 child 16907 for process pengine
 Aug 25 

Re: [Pacemaker] [PATCH] Whitetank: fix chkconfig entries for openais init script

2009-06-25 Thread Steven Dake
If none has objections to this patch, I'll commit it in next few days.

Regards
-steve

On Wed, 2009-06-24 at 09:04 +0200, Florian Haas wrote:
 # HG changeset patch
 # User Florian Haas florian.h...@linbit.com
 # Date 1245827047 -7200
 # Branch whitetank
 # Node ID a60db27fe11b9bfd399b847e33c5f49de3d227bc
 # Parent  e65f52176ba646c9d93c3b76e9e52df24f18d6dc
 Whitetank: fix chkconfig entries for openais init script
 
 The openais init script uses chkconfig entries which may cause OpenAIS
 to start too early or stop too late in the system startup/shutdown
 sequence. SUSE Linux doesn't care about this as it ignores chkconfig
 entries for the most part, but Red Hat systems (and potentially
 others) get bitten by this. This patch puts openais in the same spot
 in the sequence as the original heartbeat init script.
 
 diff -r e65f52176ba6 -r a60db27fe11b init/generic
 --- a/init/genericThu Feb 12 11:29:20 2009 +0100
 +++ b/init/genericWed Jun 24 09:04:07 2009 +0200
 @@ -5,7 +5,7 @@
  # Author:   Andrew Beekhof abeek...@suse.de
  # License:  Revised BSD
  #
 -# chkconfig: - 20 20
 +# chkconfig: - 75 05
  # processname:  aisexec
  # description:  OpenAIS daemon
  #
 diff -r e65f52176ba6 -r a60db27fe11b init/redhat
 --- a/init/redhat Thu Feb 12 11:29:20 2009 +0100
 +++ b/init/redhat Wed Jun 24 09:04:07 2009 +0200
 @@ -2,7 +2,7 @@
  #
  # OpenAIS daemon init script for Red Hat Linux and compatibles.
  #
 -# chkconfig: - 20 20
 +# chkconfig: - 75 05
  # processname:  aisexec
  # pidfile:  /var/run/aisexec.pid
  # description:  OpenAIS daemon
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] About a combination with OpenAIS.

2009-06-11 Thread Steven Dake
On Thu, 2009-06-11 at 12:30 +0200, Dejan Muhamedagic wrote:
 Hi Hideo-san,
 
 On Thu, Jun 11, 2009 at 03:17:08PM +0900, renayama19661...@ybb.ne.jp wrote:
  Hi,
  
  I understood the cause of the problem.
  
  An init script in WhiteTank was a problem.
  I work definitely when I use an init script for Pacemaker which Mr. Andrew 
  made.
  
  I hope a right init script to be included.
 
 Perhaps you would be better off with the versions released by
 Andrew (from the OBS). I'm not sure myself, it's just that
 openais API was moving until recently, so it may be a problem to
 match releases. Of course, if this combination works for you
 then it's fine.
 
 Thanks,
 
 Dejan
 

The openais whitetank abi hasn't changed much for years. The init script
could be a problem however.  Is there a special version required by Suse
linux?  I thought we had all the proper init scripts upstream to work
with Pacemaker and SUSE.

Thanks
-steve



___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [Openais] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Steven Dake
On Thu, 2009-06-04 at 17:54 +0200, Lars Marowsky-Bree wrote:
 On 2009-05-26T12:50:34, Andrew Beekhof and...@beekhof.net wrote:
 
   try all the time also after failure like was done before failure.
  
   Complete Totem amateur behind the keyboard, but I'd second that. Since
   you're constantly checking the link status while it's up, why not keep
   doing so after it's gone down, to see if it has recovered?
  
  Perhaps even at a decreased (user configurable) interval/rate.
 
 I think that was actually discussed on the openais list and on IRC in
 the past and never completely explained why it wouldn't work ;-)
 
 
 

The problem with checking the link status with the current code is that
the protocol blocks I/O waiting for a response from the failed ring.
This could of course be modified to behave differently.  So the act of
failing a link is expensive and we dont want to retest that it is valid
very often.  The obvious solution to this is to redesign the protocol to
not have this constraint.  No patch has been written and I don't have
time to do such work at the present time.

Regards
-steve

 Regards,
 Lars
 


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [Openais] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Steven Dake
On Thu, 2009-06-04 at 18:30 +0200, Lars Marowsky-Bree wrote:
 On 2009-06-04T09:23:04, Steven Dake sd...@redhat.com wrote:
 
  The problem with checking the link status with the current code is that
  the protocol blocks I/O waiting for a response from the failed ring.
  This could of course be modified to behave differently.
 
 Right, so the rechecking could possibly be a separate thread, sending an
 occasional liveness packet on the failed ring and trigger the RRP
 recovery after it has heard from other nodes on it?

Well I prefer totem to remain nonthreaded except for encrypted xmit
operations, but in general, that is the basic idea.  

 Some smarts would be needed of course to not constantly retrigger
 partially active rings (which would fail again immediately).
 
  So the act of failing a link is expensive and we dont want to retest
  that it is valid very often.
 
 Does expensive mean that it'll actually slow down the healthy
 ring(s)?
 
At the moment it blocks until the problem counter reaches the threshold
at which point the ring is declared failed and normal communication
continues.
 
 Regards,
 Lars
 


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-05-25 Thread Steven Dake
On Mon, 2009-05-25 at 18:32 +0300, Juha Heinanen wrote:
 Florian Haas writes:
 
   Agree that they're hacks, but disagree with your alternative. Why should
   Pacemaker be concerned with low-level OpenAIS recovery procedures?
 
 then have the variable in OpenAIS configuration.
 

Self-healing is not as obvious or easy as it sounds.  Totem (the
protocol) has no way to determine when the admin has replaced the faulty
switch in the network.

The only options I see is to periodically try the failed ring for
liveness.  The problem with this approach is it is hard to implement.
Another option is to reenable the ring after some period of time
internally and hope for the best.  The problem is with this approach
that is causes performance degredation every time the failed ring is
reenabled and restarted.

I think the first option is the best, but atm there isn't anyone that
has written patches and most people are focused on the 1.0 release...

Regards
-steve

 -- juha
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] recent stabilization changes to corosync and openais API/ABIs

2009-04-17 Thread Steven Dake
Hello,

Some people on these lists may be interested to know what is happening
with Corosync and OpenAIS ABIs as well as our schedules for 1.0.

We are currently planning the following dates for our releases:
Corosync 1.0 - May 15, 2009
OpenAIS 1.0 - June 1, 2009

We have made great progress and the code is seeing good stabilization.
Over the past few weeks, our community team has sanitized the corosync
ABI and APIs.

Specifically the following types of changes were made to corosync:
* A const qualifier was added to any parameter that was a constant
buffer or struct
* Any size parameter was changed from int/unsigned int, to size_t
* Any buffer that was copied into in a calling function added a length
parameter which has a size_t type.
* All of the external headers had these changes applied as well as
coroapi.h.

No significant changes were made to the openais ABI, however, because of
changes to internal apis, we are bumping the so major version requiring
a recompile if you use the SA Forum APIs as dependencies or any corosync
APIs.

We are targeting a new release with these changes to be introduced into
distributions that ship this software in early May.

Regards
-steve



___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker