[Cluster-devel] fence daemon problems

2012-10-03 Thread Dietmar Maurer
I observe strange problems with fencing when a cluster loose quorum for a short time. After regain quorum, fenced reports 'wait state messages', and whole cluster is blocked waiting for fenced. I can reproduce that bug here easily. It always happens with the following test: Software: RHEL6.3

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread Dietmar Maurer
I observe strange problems with fencing when a cluster loose quorum for a short time. After regain quorum, fenced reports 'wait state   messages', and whole cluster is blocked waiting for fenced. Just found the following in fenced/cpg.c: /* This is how we deal with cpg's

[Cluster-devel] GFS2: Review bug traps in glops.c

2012-10-03 Thread Steven Whitehouse
Two of the bug traps here could really be warnings. The others are converted from BUG() to GLOCK_BUG_ON() since we'll most likely need to know the glock state in order to debug any issues which arise. As a result of this, __dump_glock has to be renamed and is no longer static. Signed-off-by:

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread David Teigland
On Wed, Oct 03, 2012 at 09:25:08AM +, Dietmar Maurer wrote: So the observed behavior is expected? Yes, it's a stateful partition merge, and I think /var/log/messages should have mentioned something about that. When a node is partitioned from the others (e.g. network disconnected), it has

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread Dietmar Maurer
Subject: Re: [Cluster-devel] fence daemon problems On Wed, Oct 03, 2012 at 09:25:08AM +, Dietmar Maurer wrote: So the observed behavior is expected? Yes, it's a stateful partition merge, and I think /var/log/messages should have mentioned something about that. What message

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread Dietmar Maurer
Yes, it's a stateful partition merge, and I think /var/log/messages should have mentioned something about that. When a node is partitioned from the others (e.g. network disconnected), it has to be cleanly reset before it's allowed back. cleanly reset typically means rebooted. If it comes

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread David Teigland
On Wed, Oct 03, 2012 at 04:12:10PM +, Dietmar Maurer wrote: Yes, it's a stateful partition merge, and I think /var/log/messages should have mentioned something about that. When a node is partitioned from the others (e.g. network disconnected), it has to be cleanly reset before it's

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread Dietmar Maurer
I guess you're talking about the dlm_tool ls output? Yes. The fencing there means it is waiting for fenced to finish fencing before it starts dlm recovery. fenced waits for quorum. So who actually starts fencing when cluster is not quorate? rgmanager?

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread David Teigland
On Wed, Oct 03, 2012 at 04:26:35PM +, Dietmar Maurer wrote: I guess you're talking about the dlm_tool ls output? Yes. The fencing there means it is waiting for fenced to finish fencing before it starts dlm recovery. fenced waits for quorum. So who actually starts fencing

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread Dietmar Maurer
The intention of that is to prevent an inquorate node/partition from killing a quorate group of nodes that are running normally. e.g. if a 5 node cluster is partitioned into 2/3 or 1/4. You don't want the 2 or 1 node group to fence the 3 or 4 nodes that are fine. sure, I understand that.

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread David Teigland
On Wed, Oct 03, 2012 at 04:55:55PM +, Dietmar Maurer wrote: The difficult cases, which I think you're seeing, are partitions where no group has quorum, e.g. 2/2. In this case we do nothing, and the user has to resolve it by resetting some of the nodes The problem with that is that