Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Lars Marowsky-Bree
On 2011-01-12T22:52:14, Bart Coninckx bart.conin...@telenet.be wrote: Jan 12 22:20:34 xen2 pengine: [6633]: WARN: unpack_rsc_op: Processing failed op intranet1_stop_0 on xen1: unknown exec error (-2) My monitors are set to restart a resorce. What makes the PE decide to fence the node in

[Pacemaker] Multi-site support in pacemaker (tokens, deadman, CTR)

2011-01-13 Thread Lars Marowsky-Bree
Hi all, sorry for the delay in posting this. IntroductioN: At LPC 2010, we discussed (once more) that a key feature for pacemaker in 2011 would be improved support for multi-site clusters; by multi-site, we mean two (or more) sites with a local cluster each, and some higher level entity

[Pacemaker] Stretched cluster support

2011-01-13 Thread Valentin Vidic
On Thu, Jan 13, 2011 at 10:14:09AM +0100, Lars Marowsky-Bree wrote: Introduction: At LPC 2010, we discussed (once more) that a key feature for pacemaker in 2011 would be improved support for multi-site clusters; by multi-site, we mean two (or more) sites with a local cluster each, and some

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Bart Coninckx
On Thursday 13 January 2011 09:51:16 Lars Marowsky-Bree wrote: On 2011-01-12T22:52:14, Bart Coninckx bart.conin...@telenet.be wrote: Jan 12 22:20:34 xen2 pengine: [6633]: WARN: unpack_rsc_op: Processing failed op intranet1_stop_0 on xen1: unknown exec error (-2) My monitors are set to

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Lars Marowsky-Bree
On 2011-01-13T11:08:49, Bart Coninckx bart.conin...@telenet.be wrote: thx for your answer. So do I get this straight: - resource undergoes monitor operation - monitor reports failure - a restart of the resource is issued (stop and start) - stop fails - PE decides to fence the node because

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Bart Coninckx
On Thursday 13 January 2011 11:13:42 Lars Marowsky-Bree wrote: On 2011-01-13T11:08:49, Bart Coninckx bart.conin...@telenet.be wrote: thx for your answer. So do I get this straight: - resource undergoes monitor operation - monitor reports failure - a restart of the resource is issued

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Bart Coninckx
On Thursday 13 January 2011 11:13:42 Lars Marowsky-Bree wrote: On 2011-01-13T11:08:49, Bart Coninckx bart.conin...@telenet.be wrote: thx for your answer. So do I get this straight: - resource undergoes monitor operation - monitor reports failure - a restart of the resource is issued

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Bart Coninckx
On Thursday 13 January 2011 11:58:03 Lars Marowsky-Bree wrote: On 2011-01-13T11:48:41, Bart Coninckx bart.conin...@telenet.be wrote: I notice that you work Novell, this is a SLES11SP1 installation so if the resource agent for Xen is faulty I guess you know about it? Yes, I think I'd know

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Florian Haas
On 2011-01-13 13:16, Bart Coninckx wrote: On Thursday 13 January 2011 11:58:03 Lars Marowsky-Bree wrote: On 2011-01-13T11:48:41, Bart Coninckx bart.conin...@telenet.be wrote: I notice that you work Novell, this is a SLES11SP1 installation so if the resource agent for Xen is faulty I guess you

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Michael Smith
Bart Coninckx wrote: By the way: things seem better when I change the monitor time out to 30 seconds in stead of 10 seconds. Very strange though, because the resource agent basically does a xm list --long while monitoring, which takes less than half a second in a console. I think sometimes

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Lars Marowsky-Bree
On 2011-01-13T09:30:48, Michael Smith msm...@cbnco.com wrote: the resource agent basically does a xm list --long while monitoring, which takes less than half a second in a console. I think sometimes xend hangs for a while. 30 seconds should be good. There's a pending fix for this, which

[Pacemaker] [Ubuntu-ha] startup problem DLM on ubuntu lucid

2011-01-13 Thread Jake Smith
I read the thread related to this startup problem (dlm segfaults when server comes up with corosync auto starting up). I just have one follow-up question: The 3.07 package in Ubuntu-HA has not been patched for Lucid yet and there is not a backport of 3.0.12 for Lucid to fix this problem. So

Re: [Pacemaker] Node doesn't rejoin automatically after reboot

2011-01-13 Thread Bob Haxo
Tom, others, Please, what was the solution to this issue? Thanks, Bob Haxo On Mon, 2010-09-06 at 09:50 +0200, Tom Tux wrote: Yes, corosync is running after the reboot. It comes up with the regular init-procedure (runlevel 3 in my case). 2010/9/6 Andrew Beekhof and...@beekhof.net: On

Re: [Pacemaker] Node doesn't rejoin automatically after reboot

2011-01-13 Thread Tom Tux
I don't know. I still have this issue (and it seems, that I'm not the only one...). I'll have a look, if there are pacemaker-updates through the zypper-update-channel available (sles11-sp1). Regards, Tom 2011/1/13 Bob Haxo bh...@sgi.com: Tom, others, Please, what was the solution to this

Re: [Pacemaker] Node doesn't rejoin automatically after reboot

2011-01-13 Thread Bob Haxo
So, Tom ...how do you get the failed node online? I've re-installed with the same image that is running on three other nodes, but still fails. This node was quite happy for the past 3 months. As I'm testing installs, this and other nodes have been installed a significant number of times

[Pacemaker] Howto write a STONITH agent

2011-01-13 Thread Christoph Herrmann
Hi, I have some brand new HP Blades with ILO Boards (iLO 2 Standard Blade Edition 1.81 ...) But I'm not able to connect with them via the external/riloe agent. When i try: stonith -t external/riloe -p hostlist=node1 ilo_hostname=ilo1 ilo_user=ilouser ilo_password=ilopass ilo_can_reset=1

Re: [Pacemaker] Node doesn't rejoin automatically after reboot - POSSIBLE CAUSE

2011-01-13 Thread Bob Haxo
Hi Tom (and Andrew), I figured out an easy fix for the problem that I encountered. However, there would seem to be a problem lurking in the code. Here is what I found. On one of the servers that was online and hosting resources: r2lead1:~ # netstat -a | grep crm Proto RefCnt Flags Type

Re: [Pacemaker] Howto write a STONITH agent

2011-01-13 Thread Bob Haxo
Hi Christoph, Have you taken a look in /usr/lib64/stonith/plugins/external? The ipmi plugin might serve as a coding example/template. Or maybe the drac5 plugin. At first glance, drac5 appears to be using ssh. Bob Haxo On Thu, 2011-01-13 at 21:09 +0100, Christoph Herrmann wrote: Hi, I have

[Pacemaker] Help with configuring pacemaker automatically with chef

2011-01-13 Thread Todd Nine
Hi guys, I'm having a hard time finding the info I need to configure pacemaker from an input file. I've been using Zookeeper a lot in our application tier, so I'm familiar with clusters, however I'm struggling to adapt that knowledge to the pacemaker configuration. Here is an overview of our

[Pacemaker] How to delete warning information

2011-01-13 Thread jiaju liu
when I use command crm configure property start-failure-is-fatal=FALSE it shows WARNING: status: operation not recognized WARNING: status: operation not recognized WARNING: status: operation not recognized WARNING: status: operation not recognized WARNING: status: operation not recognized