Re: [Linux-HA] Backing out of HA

2013-08-22 Thread Ferenc Wagner
Lars Marowsky-Bree l...@suse.com writes:

 Poisoned resources indeed should just fail to start and that should be
 that. What instead can happen is that the resource agent notices it
 can't start, reports back to the cluster, and the cluster manager goes
 Oh no, I couldn't start the resource successfully! It's now possibly in
 a weird state and I better stop it!

 ... And because of the misconfiguration, the *stop* also fails, and
 you're hit with the full power of node-level recovery.

 I think this is an issue with some resource agents (if the parameters
 are so bad that the resource couldn't possibly have started, why fail
 the stop?) and possibly also something where one could contemplate a
 better on-fail= default for stop in response to first-start failure.

Check out http://www.linux-ha.org/doc/dev-guides/_execution_block.html,
especially the comment anything other than meta-data and usage must
pass validation.  So if the start action fails with some validation
error, the stop action will as well.  Is this good practice after all?
Or is OCF_ERR_GENERIC treated differently from the other errors in this
regard and thus the validate action should never return OCF_ERR_GENERIC?
-- 
Thanks,
Feri.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Backing out of HA

2013-07-03 Thread Dejan Muhamedagic
Hi,

On Tue, Jul 02, 2013 at 10:36:37PM +0200, Arnold Krille wrote:
 Just a small comment, maybe of benefit to others...
 
 On Mon, 01 Jul 2013 16:31:13 -0400 William Seligman
 selig...@nevis.columbia.edu wrote:
  Poisoned resource
  
  This is the one you can directly attribute to my stupidity.
  
  I add a new resource to the pacemaker configuration. Even though the
  pacemaker configuration is syntactically correct, and even though I
  think I've tested it, in fact the resource cannot run on either node.
  
  The most recent example: I created a new virtual domain and tested
  it. It worked fine. I created the ocf:heartbeat:VirtualDomain
  resource, verified that crm could parse it, and activated the
  configuration. However, I had not actually created the domain for the
  virtual machine; I had typed virsh create ... but not virsh
  define 
  
  So I had a resource that could not run. What I'd want to happen is
  for the poisoned resource to fail, I see lots of error messages,
  but the remaining resources would continue to run.
  
  What actually happens is that resource tries to run on both nodes
  alternately an infinite number of times (1 times or whatever
  the value is). Then one of the nodes stoniths the other. The poisoned
  resource still won't run on the remaining node, so that node tries
  restarting all the other resources in the pacemaker configuration.
  That still won't work.
 
 This is the reason why I _always_ create new resources with
 'is-managed=false' and see what happens. pacemaker then runs a
 monitoring action without doing anything about the results. Very nice
 to see if the resource is workable for pacemaker without killing the
 cluster. If all works (and the normal working day is over) I activate
 all the resources that are not yet managed...

New resources can also be tested on all nodes by crm configure
rsctest. The test is completely driven by crmsh and won't disturb
the cluster in any way.

Cheers,

Dejan

 Have fun,
 
 Arnold



 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Backing out of HA

2013-07-02 Thread Lars Marowsky-Bree
On 2013-07-01T16:31:13, William Seligman selig...@nevis.columbia.edu wrote:

 a) people can exclaim You fool! and point out all the stupid things I did 
 wrong;
 
 b) sysadmins who are contemplating the switch to HA have additional points to
 add to the pros and cons.

I think you bring up an important point that I also try to stress when I
talk to customers: HA is not for everyone, since it's not a magic
bullet. HA environments can protect against hardware faults (and some
operator issues and failing software), but they need to be carefully
managed and designed. They don't come for free, and the complexity can
be a deterrent.

(While the Complex, but not complicated is a good goal, it's not
easily achieved.)

And why the additional recovery plans should always include Plan C: How
do I bring my services online manually without a cluster stack?

 I'll mention this one first because it's the most recent, and it was the straw
 that broke the camel's back as far as the users were concerned.
 
 Last week, cman crashed, and the cluster stopped working. There was no clear
 message in the logs indicating why. I had no time for archeology, since the
 crash happened in middle of our working day; I rebooted everything and cman
 started up again just fine.

Such stuff happens; without archaeology, we're unlikely to be able to
fix them ;-) I take it though you're not running the latest supported
versions and don't have good support contracts; that's really
important.

And, of course, why we strive to produce support tools that allow first
failure data capture - so we can get a full overview of the log files
and what triggered whatever problem the system encountered, without
needing to reproduce.  (crm_/hb_report, qb_blackbox, etc.)

 Problems under heavy server load:
 
 Let's call the two nodes on the cluster A and B. Node A starts running a 
 process
 that does heavy disk writes to the shared DRBD volume. The load on A starts to
 rise. The load on B rises too, more slowly, because the same blocks must be
 written to node B's disk.
 
 Eventually the load on A grows so great that cman+clvmd+pacemaker does not
 respond promptly, and node B stoniths node A. The problem is that the DRBD
 partition on node B marked Inconsistent. All the other resources in the
 pacemaker configuration depend on DRBD, so none of them are allowed to run.

This shouldn't happen. The cluster stack is supposed to be isolated from
the mere workload via realtime scheduling/IO priority and locking
itself into memory. Or you had too short timeouts for the monitoring
services.

(I have noticed a recent strive to drop the SCHED_RR priority from
processes, because that seemingly makes some problems go away. But
personally, I think that just masks the issue of priority inversion in
the message layers somewhere, and isn't a proper fix; exposing us to
more situations as described here instead.)

But even that shouldn't lead to stonith directly, but to resources being
stopped. Only if that fails would it then cause a fence.

And - a fence also shouldn't make DRBD inconsistent like this. Is your
DRBD set up properly?

 Poisoned resource
 
 This is the one you can directly attribute to my stupidity.
 
 I add a new resource to the pacemaker configuration. Even though the pacemaker
 configuration is syntactically correct, and even though I think I've tested 
 it,
 in fact the resource cannot run on either node.
 
 The most recent example: I created a new virtual domain and tested it. It 
 worked
 fine. I created the ocf:heartbeat:VirtualDomain resource, verified that crm
 could parse it, and activated the configuration. However, I had not actually
 created the domain for the virtual machine; I had typed virsh create ... but
 not virsh define 
 
 So I had a resource that could not run. What I'd want to happen is for the
 poisoned resource to fail, I see lots of error messages, but the remaining
 resources would continue to run.
 
 What actually happens is that resource tries to run on both nodes alternately 
 an
 infinite number of times (1 times or whatever the value is). Then one of
 the nodes stoniths the other. The poisoned resource still won't run on the
 remaining node, so that node tries restarting all the other resources in the
 pacemaker configuration. That still won't work.

Yeah, so, you describe a real problem here.

Poisoned resources indeed should just fail to start and that should be
that. What instead can happen is that the resource agent notices it
can't start, reports back to the cluster, and the cluster manager goes
Oh no, I couldn't start the resource successfully! It's now possibly in
a weird state and I better stop it!

... And because of the misconfiguration, the *stop* also fails, and
you're hit with the full power of node-level recovery.

I think this is an issue with some resource agents (if the parameters
are so bad that the resource couldn't possibly have started, why fail
the stop?) and possibly also something where one 

Re: [Linux-HA] Backing out of HA

2013-07-02 Thread Arnold Krille
Just a small comment, maybe of benefit to others...

On Mon, 01 Jul 2013 16:31:13 -0400 William Seligman
selig...@nevis.columbia.edu wrote:
 Poisoned resource
 
 This is the one you can directly attribute to my stupidity.
 
 I add a new resource to the pacemaker configuration. Even though the
 pacemaker configuration is syntactically correct, and even though I
 think I've tested it, in fact the resource cannot run on either node.
 
 The most recent example: I created a new virtual domain and tested
 it. It worked fine. I created the ocf:heartbeat:VirtualDomain
 resource, verified that crm could parse it, and activated the
 configuration. However, I had not actually created the domain for the
 virtual machine; I had typed virsh create ... but not virsh
 define 
 
 So I had a resource that could not run. What I'd want to happen is
 for the poisoned resource to fail, I see lots of error messages,
 but the remaining resources would continue to run.
 
 What actually happens is that resource tries to run on both nodes
 alternately an infinite number of times (1 times or whatever
 the value is). Then one of the nodes stoniths the other. The poisoned
 resource still won't run on the remaining node, so that node tries
 restarting all the other resources in the pacemaker configuration.
 That still won't work.

This is the reason why I _always_ create new resources with
'is-managed=false' and see what happens. pacemaker then runs a
monitoring action without doing anything about the results. Very nice
to see if the resource is workable for pacemaker without killing the
cluster. If all works (and the normal working day is over) I activate
all the resources that are not yet managed...

Have fun,

Arnold


signature.asc
Description: PGP signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Backing out of HA

2013-07-01 Thread William Seligman
I'm about to write a transition plan for getting rid of high-availability on our
lab's cluster. Before I do that, I thought I'd put my reasons before this group
so that:

a) people can exclaim You fool! and point out all the stupid things I did 
wrong;

b) sysadmins who are contemplating the switch to HA have additional points to
add to the pros and cons.

The basic reason why we want to back out of HA is that, in the three years since
I've implemented an HA cluster at the lab, we have had not a single hardware
problem for which HA would have been useful. However, we've had many instances
of lab-wide downtime due to the HA configuration.

Description: two-node cluster, Scientific Linux 6.2 (=RHEL6.2),
cman+clvmd+pacemaker, dedicated Ethernet ports for DRBD traffic. I've had both
primary/secondary and dual-primary configurations. The resources pacemaker
manages include DRBD and VMs with configuration files and virtual disks on the
DRBD partition. Detailed package versions and configurations are at the end of
this post.

Here are some examples of our difficulties. This is not an exhaustive list.

Mystery crashes

I'll mention this one first because it's the most recent, and it was the straw
that broke the camel's back as far as the users were concerned.

Last week, cman crashed, and the cluster stopped working. There was no clear
message in the logs indicating why. I had no time for archeology, since the
crash happened in middle of our working day; I rebooted everything and cman
started up again just fine.

Problems under heavy server load:

Let's call the two nodes on the cluster A and B. Node A starts running a process
that does heavy disk writes to the shared DRBD volume. The load on A starts to
rise. The load on B rises too, more slowly, because the same blocks must be
written to node B's disk.

Eventually the load on A grows so great that cman+clvmd+pacemaker does not
respond promptly, and node B stoniths node A. The problem is that the DRBD
partition on node B marked Inconsistent. All the other resources in the
pacemaker configuration depend on DRBD, so none of them are allowed to run.

The cluster stays in this non-working state (node A powered off, node B not
running any resources) until I manually intervene.

Poisoned resource

This is the one you can directly attribute to my stupidity.

I add a new resource to the pacemaker configuration. Even though the pacemaker
configuration is syntactically correct, and even though I think I've tested it,
in fact the resource cannot run on either node.

The most recent example: I created a new virtual domain and tested it. It worked
fine. I created the ocf:heartbeat:VirtualDomain resource, verified that crm
could parse it, and activated the configuration. However, I had not actually
created the domain for the virtual machine; I had typed virsh create ... but
not virsh define 

So I had a resource that could not run. What I'd want to happen is for the
poisoned resource to fail, I see lots of error messages, but the remaining
resources would continue to run.

What actually happens is that resource tries to run on both nodes alternately an
infinite number of times (1 times or whatever the value is). Then one of
the nodes stoniths the other. The poisoned resource still won't run on the
remaining node, so that node tries restarting all the other resources in the
pacemaker configuration. That still won't work.

By this time, usually one of the other resources has failed (possibly because
it's not designed to be restarted so frequently), and the cluster is in a
non-working state until I manually intervened.

In this particular case, had we not been running HA, the only problem would have
been that the incorrectly-initialized domain would not have come up after a
system reboot. With HA, my error crashed the cluster.


Let me be clear: I do not claim that HA is without value. My only point is that
for our particular combination of hardware, software, and available sysadmin
support (me), high-availability has not been a good investment.

I also acknowledge that I haven't provided logs for these problems to
corroborate any of the statements I've made. I'm sharing the problems I've had,
but at this point I'm not asking for fixes.


Turgid details:

# rpm -q kernel drbd pacemaker cman \
   lvm2 lvm2-cluster resource-agents
kernel-2.6.32-220.4.1.el6.x86_64
drbd-8.4.1-1.el6.x86_64
pacemaker-1.1.6-3.el6.x86_64
cman-3.0.12.1-23.el6.x86_64
lvm2-2.02.87-7.el6.x86_64
lvm2-cluster-2.02.87-7.el6.x86_64
resource-agents-3.9.2-7.el6.x86_64

/etc/cluster/cluster.conf: http://pastebin.com/qRAxLpkx
/etc/lvm/lvm.conf: http://pastebin.com/tLyZd09i
/etc/drbd.d/global_common.conf: http://pastebin.com/H8Kfi2tM
/etc/drbd.d/admin.res: http://pastebin.com/1GWupJz8
output of crm configure show: http://pastebin.com/wJaX3Msn
output of crm configure show xml: http://pastebin.com/gyUUb2hi
-- 
William Seligman  | Phone: (914) 591-2823
Nevis Labs, Columbia Univ |
PO Box 137  

Re: [Linux-HA] Backing out of HA

2013-07-01 Thread Andrew Beekhof
I know you're not looking for fixes, but there are a couple of points I would 
make:

On 02/07/2013, at 6:31 AM, William Seligman selig...@nevis.columbia.edu wrote:

 I'm about to write a transition plan for getting rid of high-availability on 
 our
 lab's cluster. Before I do that, I thought I'd put my reasons before this 
 group
 so that:
 
 a) people can exclaim You fool! and point out all the stupid things I did 
 wrong;
 
 b) sysadmins who are contemplating the switch to HA have additional points to
 add to the pros and cons.
 
 The basic reason why we want to back out of HA is that, in the three years 
 since
 I've implemented an HA cluster at the lab, we have had not a single hardware
 problem for which HA would have been useful. However, we've had many instances
 of lab-wide downtime due to the HA configuration.

Two of the three cases you list in support of this are the HA software being 
unable to mask an external problem, rather than being the root cause itself.

Even cman crashing should have been easily recoverable without human 
intervention.

 
 Description: two-node cluster, Scientific Linux 6.2 (=RHEL6.2),
 cman+clvmd+pacemaker, dedicated Ethernet ports for DRBD traffic. I've had both
 primary/secondary and dual-primary configurations. The resources pacemaker
 manages include DRBD and VMs with configuration files and virtual disks on the
 DRBD partition. Detailed package versions and configurations are at the end of
 this post.
 
 Here are some examples of our difficulties. This is not an exhaustive list.
 
 Mystery crashes
 
 I'll mention this one first because it's the most recent, and it was the straw
 that broke the camel's back as far as the users were concerned.
 
 Last week, cman crashed, and the cluster stopped working. There was no clear
 message in the logs indicating why. I had no time for archeology, since the
 crash happened in middle of our working day; I rebooted everything and cman
 started up again just fine.
 
 Problems under heavy server load:
 
 Let's call the two nodes on the cluster A and B. Node A starts running a 
 process
 that does heavy disk writes to the shared DRBD volume. The load on A starts to
 rise. The load on B rises too, more slowly, because the same blocks must be
 written to node B's disk.
 
 Eventually the load on A grows so great that cman+clvmd+pacemaker does not
 respond promptly, and node B stoniths node A. The problem is that the DRBD
 partition on node B marked Inconsistent. All the other resources in the
 pacemaker configuration depend on DRBD, so none of them are allowed to run.
 
 The cluster stays in this non-working state (node A powered off, node B not
 running any resources) until I manually intervene.
 
 Poisoned resource
 
 This is the one you can directly attribute to my stupidity.
 
 I add a new resource to the pacemaker configuration. Even though the pacemaker
 configuration is syntactically correct, and even though I think I've tested 
 it,
 in fact the resource cannot run on either node.
 
 The most recent example: I created a new virtual domain and tested it. It 
 worked
 fine. I created the ocf:heartbeat:VirtualDomain resource, verified that crm
 could parse it, and activated the configuration. However, I had not actually
 created the domain for the virtual machine; I had typed virsh create ... but
 not virsh define 
 
 So I had a resource that could not run. What I'd want to happen is for the
 poisoned resource to fail, I see lots of error messages, but the remaining
 resources would continue to run.
 
 What actually happens is that resource tries to run on both nodes alternately 
 an
 infinite number of times (1 times or whatever the value is). Then one of
 the nodes stoniths the other. The poisoned resource still won't run on the
 remaining node, so that node tries restarting all the other resources in the
 pacemaker configuration. That still won't work.
 
 By this time, usually one of the other resources has failed (possibly because
 it's not designed to be restarted so frequently), and the cluster is in a
 non-working state until I manually intervened.
 
 In this particular case, had we not been running HA, the only problem would 
 have
 been that the incorrectly-initialized domain would not have come up after a
 system reboot. With HA, my error crashed the cluster.

I guess you could have set on-fail=block which would have inhibited the 
recovery process.
But HA isn't a magic wand that can make buggy apps less buggy or improve an 
admin's memory.

And if you _really_ don't trust yourself, put the rest of the cluster into 
maintenance-mode so that we won't do anything to the existing resources.

 
 
 Let me be clear: I do not claim that HA is without value. My only point is 
 that
 for our particular combination of hardware, software, and available sysadmin
 support (me), high-availability has not been a good investment.
 
 I also acknowledge that I haven't provided logs for these problems to
 corroborate any of the