Re: [Cluster-devel] qdisk and fence race

2013-02-18 Thread Dietmar Maurer
> > If so, what is the suggested way to avoid that? Should I configure with > master_wins="1" instead? > > It's best to use master_wins over a heuristic. Ping as a tiebreaker seems > good in practice, but the problem is that there are certain network failures > where two nodes can see the thing t

[Cluster-devel] qdisk and fence race

2013-02-14 Thread Dietmar Maurer
The following example is copied from 'man qdisk': Is it true that this can still result is a fence race? If so, what is the suggested way to avoid that? Should I configure with master_wins="1" instead?

Re: [Cluster-devel] [PATCH 2/2] checkquorum.wdmd: add integration script with wdmd

2012-10-10 Thread Dietmar Maurer
> > [ -f /etc/debian_version && -d /etc/default ] > > > > that doesn´t scale well for debian derivates that don´t ship debian_version :) > (see ubuntu & co..) > > You can´t even use something like "which dpkg" since the tool is available on > rpm based distributions... or viceversa.. there is rpm

Re: [Cluster-devel] [PATCH 2/2] checkquorum.wdmd: add integration script with wdmd

2012-10-10 Thread Dietmar Maurer
> Anyway examples and all, setups, limitations.. all in the doc as soon as it´s > ready. Be a bit patience :) Ok (I am just curios) - many thanks for you fast answers! - Dietmar

Re: [Cluster-devel] [PATCH 2/2] checkquorum.wdmd: add integration script with wdmd

2012-10-10 Thread Dietmar Maurer
> On 10/10/2012 6:26 AM, Dietmar Maurer wrote: > >> +# rpm based distros > >> +[ -d /etc/sysconfig ] && \ > >> + [ -f /etc/sysconfig/checkquorum ] && \ > >> + . /etc/sysconfig/checkquorum > >> + > >> +# deb based distros

Re: [Cluster-devel] [PATCH 2/2] checkquorum.wdmd: add integration script with wdmd

2012-10-09 Thread Dietmar Maurer
Will you add some documentaion how to use those scripts? Seems those scripts does not check if the node is joined to the fence domain? > -Original Message- > From: cluster-devel-boun...@redhat.com [mailto:cluster-devel- > boun...@redhat.com] On Behalf Of Fabio M. Di Nitto > Sent: Dienstag

Re: [Cluster-devel] [PATCH 2/2] checkquorum.wdmd: add integration script with wdmd

2012-10-09 Thread Dietmar Maurer
> +# rpm based distros > +[ -d /etc/sysconfig ] && \ > + [ -f /etc/sysconfig/checkquorum ] && \ > + . /etc/sysconfig/checkquorum > + > +# deb based distros > +[ ! -d /etc/sysconfig ] && \ > + [ -f /etc/default/checkquorum ] && \ > + . /etc/default/checkquorum > + FYI: Some RAID too

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread Dietmar Maurer
> The intention of that is to prevent an inquorate node/partition from killing a > quorate group of nodes that are running normally. e.g. if a 5 node cluster is > partitioned into 2/3 or 1/4. You don't want the 2 or 1 node group to fence > the 3 or 4 nodes that are fine. sure, I understand that.

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread Dietmar Maurer
> I guess you're talking about the dlm_tool ls output? Yes. > The "fencing" there > means it is waiting for fenced to finish fencing before it starts dlm > recovery. > fenced waits for quorum. So who actually starts fencing when cluster is not quorate? rgmanager?

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread Dietmar Maurer
> Yes, it's a stateful partition merge, and I think /var/log/messages should > have > mentioned something about that. When a node is partitioned from the > others (e.g. network disconnected), it has to be cleanly reset before it's > allowed back. "cleanly reset" typically means rebooted. If it

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread Dietmar Maurer
> Subject: Re: [Cluster-devel] fence daemon problems > > On Wed, Oct 03, 2012 at 09:25:08AM +, Dietmar Maurer wrote: > > So the observed behavior is expected? > > Yes, it's a stateful partition merge, and I think /var/log/messages should > have > mention

Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread Dietmar Maurer
> I observe strange problems with fencing when a cluster loose quorum for a > short time. > > After regain quorum, fenced reports 'wait state   messages', and whole > cluster is blocked waiting for fenced. Just found the following in fenced/cpg.c: /* This is how we deal with cpg

[Cluster-devel] fence daemon problems

2012-10-03 Thread Dietmar Maurer
I observe strange problems with fencing when a cluster loose quorum for a short time. After regain quorum, fenced reports 'wait state messages', and whole cluster is blocked waiting for fenced. I can reproduce that bug here easily. It always happens with the following test: Software: RHEL6.3

Re: [Cluster-devel] checkquorum script for self fencing

2012-10-02 Thread Dietmar Maurer
> Subject: Re: [Cluster-devel] checkquorum script for self fencing > > On 10/02/2012 08:07 PM, Dietmar Maurer wrote: > > Hi Fabio, > > > > was there any progress on that topic? > > As a matter of fact, yes, we are completing the first implementation and &g

Re: [Cluster-devel] checkquorum script for self fencing

2012-10-02 Thread Dietmar Maurer
ject: Re: [Cluster-devel] checkquorum script for self fencing > > On 12/21/2011 08:28 PM, Dietmar Maurer wrote: > > I recently detected that checkquorum script for self fencing. > > > > That seems to work reliable, but the remaining nodes (with quorum) > > does not get any

Re: [Cluster-devel] cluster.cman.nodename vanish on config reload

2012-07-11 Thread Dietmar Maurer
> Can you please try the patch I just posted to the list? it works for me, but a > couple of extra eyes won´t hurt. Ok, seem to work here. Many thanks for your help! - Dietmar

Re: [Cluster-devel] cluster.cman.nodename vanish on config reload

2012-07-11 Thread Dietmar Maurer
> Well yes, it is an error if we can´t determine our nodename. > > The issue now is to understand why it fails for you but doesn´t fail for me > using git. Oh, you can't reproduce the bug?

Re: [Cluster-devel] cluster.cman.nodename vanish on config reload

2012-07-11 Thread Dietmar Maurer
> >> This lead directly to commit f3f4499d4ace7a3bf5fe09ce6d9f04ed6d8958f6 > >> > >> But this is just the check you introduced. If I revert that patch, > >> everything works as before, but I noticed that It still deletes the > >> values from the corosync objdb after config reload - even in 3.1.8! >

Re: [Cluster-devel] cluster.cman.nodename vanish on config reload

2012-07-11 Thread Dietmar Maurer
> Ok, bisect myself. > > This lead directly to commit f3f4499d4ace7a3bf5fe09ce6d9f04ed6d8958f6 > > But this is just the check you introduced. If I revert that patch, everything > works as before, but I noticed that It still deletes the values from the > corosync objdb after config reload - even i

Re: [Cluster-devel] cluster.cman.nodename vanish on config reload

2012-07-11 Thread Dietmar Maurer
> >> If are running stable32 from git, can you please revert: > >> > >> commit 8975bd6341b2d94c1f89279b1b00d4360da1f5ff > >> > >> and see if it´s still a problem? > > > > Yes, same problem. > > > > - Dietmar > > > > > > > Ok. then please file a bugzilla. I´ll need to bisect and see when the probl

Re: [Cluster-devel] cluster.cman.nodename vanish on config reload

2012-07-10 Thread Dietmar Maurer
> If are running stable32 from git, can you please revert: > > commit 8975bd6341b2d94c1f89279b1b00d4360da1f5ff > > and see if it´s still a problem? Yes, same problem. - Dietmar

[Cluster-devel] cluster.cman.nodename vanish on config reload

2012-07-10 Thread Dietmar Maurer
I just updated from 3.1.8 to latest STABLE32: I use this cluster.conf: # cat /etc/cluster/cluster.conf cman service starts without problems: # /etc/init.d/cman start Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Checking Netwo

Re: [Cluster-devel] when do I need to start cpglockd

2012-06-19 Thread Dietmar Maurer
> rgmanager.init can simply fire cpglockd.init without any check, as those > would be done properly by cpglockd.init. > > I think this should solve the issue for Debian and keep current behavior in > Fedora. Yes, that would work. - Dietmar

Re: [Cluster-devel] when do I need to start cpglockd

2012-06-19 Thread Dietmar Maurer
> As you can see, rgmanager is on, cpglockd off. I see. > At boot rgmanager starts fine, without cpglockd running. > I think the problem here is the interpretation of the LSB specifications > between different distributions. I am not going to argue which one is right or > wrong but the key issue

Re: [Cluster-devel] when do I need to start cpglockd

2012-06-19 Thread Dietmar Maurer
> And then again, expressing an order is correct. If "Required-Start" > behavior in Debian is different than in other distro (I can speak for > Fedora/RHEL here), then clearly there needs to be some distro specific > "tuning". You simply start a daemon which is not necessary. And I guess you do t

Re: [Cluster-devel] when do I need to start cpglockd

2012-06-18 Thread Dietmar Maurer
> > Yes, and that script 'unconditionally' (always) starts cpglockd > > Nothing wrong with that. If you ask a daemon to start it will start :) For me this is wrong. I have to maintain a debian package, and I do not want to start unnecessary daemons. So I simply remove that dependency. Anyways,

Re: [Cluster-devel] when do I need to start cpglockd

2012-06-18 Thread Dietmar Maurer
> > Yes, that's a bug. cpglockd will be started from the rgmanager init > > script when RRP mode is enabled. > > > > > > Ryan > > > > Actually no, it's not a bug. > > cpglockd has it's own init script too. Yes, and that script 'unconditionally' (always) starts cpglockd > The Required-Start: te

Re: [Cluster-devel] when do I need to start cpglockd

2012-06-14 Thread Dietmar Maurer
> -Original Message- > From: Ryan McCabe [mailto:rmcc...@redhat.com] > Sent: Donnerstag, 14. Juni 2012 14:21 > To: Dietmar Maurer > Cc: cluster-devel@redhat.com > Subject: Re: [Cluster-devel] when do I need to start cpglockd > > On Wed, Jun 13, 2012 at 03:26:22

[Cluster-devel] when do I need to start cpglockd

2012-06-13 Thread Dietmar Maurer
I just tried to upgrade my packages form STABEL31 to STABLE32, and noticed that there is a new daemon 'cpglockd' That daemon is always started with the default init script. Is that really necessary? I just wonder why rgmanager needs 2 different locking mechanism - dlm and cpglockd? The manual

Re: [Cluster-devel] checkquorum script for self fencing

2011-12-21 Thread Dietmar Maurer
> We are already working on a similar feature based on checkquorum, but I got > injured on my hand and I had to delay a bit the write up for the feature (I am > incredibly slow writing with one hand, never mind the typos ;)). > > The way you suggest is dangerous, so no, don't take that route. > >

[Cluster-devel] checkquorum script for self fencing

2011-12-21 Thread Dietmar Maurer
I recently detected that checkquorum script for self fencing. That seems to work reliable, but the remaining nodes (with quorum) does not get any fence acknowledge. I wonder if it would be possible to extend the checkquorum script so that it runs fence_ack_manual on the fence master after some sa

[Cluster-devel] clvm, snapshots and live migration

2011-11-27 Thread Dietmar Maurer
Hi all, I started this thread on the wrong (cluster-devel) mailing list. So I post it in the linux-lvm list to correct that mistake: > > We use clvm and store VM images on shared storage, and our backup tool > > uses LVM snapshots, which seems to require that we exclusively lock the LVs. > > > >

[Cluster-devel] clvm, snapshots and live migration

2011-11-24 Thread Dietmar Maurer
Hi all, We use clvm and store VM images on shared storage, and our backup tool uses LVM snapshots, which seems to require that we exclusively lock the LVs. But live migration requires that we activate the LVs on two nodes, so this conflicts with exclusive locks. So is there a way to convert th

Re: [Cluster-devel] use cman_tool leave remove on shutdown/restart

2011-09-26 Thread Dietmar Maurer
> So leave remove is one option that cannot be enabled by default (it is indeed > dangerous, even for some small corner cases, they can still > happen) but it is there as a tool for cluster admins when necessary (and has > to be > used carefully and knowing what is happening internally). OK, many

Re: [Cluster-devel] use cman_tool leave remove on shutdown/restart

2011-09-26 Thread Dietmar Maurer
> The leave "remove" option has to be used only when permanently removing a > node from a cluster. > The "remove" code path, totally cancel the knowledge of that node from the > cluster, including quorum recalculation. > > It has to be used very carefully for example when doing some long planned >

[Cluster-devel] use cman_tool leave remove on shutdown/restart

2011-09-26 Thread Dietmar Maurer
in /etc/init.d/cman: I wonder if it would be safe to use the 'remove' option for runlevel 0 and 6. Or is that considered dangerous? if so, why? - Dietmar

Re: [Cluster-devel] bug in cman/init.d/cman.in?

2011-08-31 Thread Dietmar Maurer
> > BTW, I can see the same bug in rgmanager init script > > Nope... not any more ;) Thanks for the fix, - Dietmar

Re: [Cluster-devel] bug in cman/init.d/cman.in?

2011-08-31 Thread Dietmar Maurer
> Fix is pushed in git, can you please let me know if it works for you? Thanks, seems to work. > A reproducer > to trigger the error would be nice to have, so I can verify it here too. # /etc/init.d/cman start Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Che

Re: [Cluster-devel] bug in cman/init.d/cman.in?

2011-08-29 Thread Dietmar Maurer
> Ok, unless you have strong objections I prefer to rename the variable > completely > in status() to avoid confusion and fix the issue. Well, I have to say that I can't see the 'beauty' of shell scripts in general ;-) So please feel free to fix it as you like - it will then test the fix. - Di

Re: [Cluster-devel] bug in cman/init.d/cman.in?

2011-08-29 Thread Dietmar Maurer
: Montag, 29. August 2011 14:06 > To: cluster-devel@redhat.com > Subject: Re: [Cluster-devel] bug in cman/init.d/cman.in? > > On 08/29/2011 10:44 AM, Dietmar Maurer wrote: > > * do not overwrite global return status 'rtrn' - use local keywor

Re: [Cluster-devel] Question about /etc/init.d/cman start

2011-08-29 Thread Dietmar Maurer
> > So we have rgmanager running, but no dlm_controld and no fenced. Is it > > expected to work that way? > > If dlm is not working, rgmanager will exit with error. The startup script print: rgmanager[XXX]: Waiting for quorum to form But you are right, it exit after the cluster got quorum: Aug

Re: [Cluster-devel] Question about /etc/init.d/cman start

2011-08-29 Thread Dietmar Maurer
> Yes, generally the fact that quorum is not achieved with N seconds is an > indication of something wrong in the cluster or the hw (for instance network > issues). Users prefer to see an error at that point, rather than keep > executing > more daemons that will just make things more confusing on

Re: [Cluster-devel] Question about /etc/init.d/cman start

2011-08-29 Thread Dietmar Maurer
> Yes, generally the fact that quorum is not achieved with N seconds is an > indication of something wrong in the cluster or the hw (for instance network > issues). Users prefer to see an error at that point, rather than keep > executing > more daemons that will just make things more confusing on

[Cluster-devel] bug in cman/init.d/cman.in?

2011-08-29 Thread Dietmar Maurer
* do not overwrite global return status 'rtrn' - use local keyword Index: new/cman/init.d/cman.in === --- new.orig/cman/init.d/cman.in2010-12-02 07:19:35.0 +0100 +++ new/cman/init.d/cman.in 2010-12-23 11:32:12.

Re: [Cluster-devel] Question about /etc/init.d/cman start

2011-08-29 Thread Dietmar Maurer
> They are smart enough. You are misreading the comments about wait for > quorum in cman init. Ah, OK. > The daemons can be safely started at boot time, even without quorum, but they > can't do anything useful till quorum is achieved. That is why it is possible > to > override the wait for quoru

Re: [Cluster-devel] Question about /etc/init.d/cman start

2011-08-29 Thread Dietmar Maurer
> It is actually configurable via /etc/sysconfig/cman (or /etc/defaults/cman on > debian based systems) > > # CMAN_QUORUM_TIMEOUT -- amount of time to wait for a quorate cluster > on > # startup quorum is needed by many other applications, so we may as > # well wait here. If CMAN_QUORUM_T

[Cluster-devel] Question about /etc/init.d/cman start

2011-08-28 Thread Dietmar Maurer
Hi all, the current startup script simply exit if there is no quorum, so fenced and dlm_controld are not started. Even cmannotifyd is not started, so you can't react to quorum changes with cmannotifyd. What is the suggested way to start those services after the node gets quorum? And why can't

Re: [Cluster-devel] Problems with [PATCH] dlm_controld: fix plock dev_write no op

2011-08-25 Thread Dietmar Maurer
> -Original Message- > From: Fabio M. Di Nitto [mailto:fdini...@redhat.com] > Sent: Donnerstag, 25. August 2011 10:52 > To: Dietmar Maurer > Cc: cluster-devel@redhat.com > Subject: Re: [Cluster-devel] Problems with [PATCH] dlm_controld: fix plock > dev_write no op &g

Re: [Cluster-devel] Problems with [PATCH] dlm_controld: fix plock dev_write no op

2011-08-25 Thread Dietmar Maurer
> > Are there known bugs with 131.6.1.el6 (rhel6.2 is still beta)? > > I am not sure I understand the question... kernel is kind of big and a query > in > bugzilla.redhat.com will answer your question too :) You suggested to use another kernel, so I thought you are aware of a specific bug relat

Re: [Cluster-devel] Problems with [PATCH] dlm_controld: fix plock dev_write no op

2011-08-25 Thread Dietmar Maurer
> >> I pushed a workaround for this, but you need to update your kernel headers. > > > > Well, I use headers from recent RHEL6 kernel (131.6.1.el6). > > > > Is that fixed in newer RHEL6 kernels or what kernel version do you suggest? > > You need kernel from RHEL6.2. Are there known bugs with 131.

Re: [Cluster-devel] Problems with [PATCH] dlm_controld: fix plock dev_write no op

2011-08-24 Thread Dietmar Maurer
> On 8/24/2011 8:18 AM, Dietmar Maurer wrote: > > Compile is broken after that patch. I get: > > > > /cluster-3.1.5/group/dlm_controld/plock.c:744:2: warning: #warning > DLM_PLOCK_FL_CLOSE undefined. Enabling build workaround. > > /cluster-3.1.5/group/dlm

Re: [Cluster-devel] Problems with [PATCH] dlm_controld: fix plock dev_write no op

2011-08-23 Thread Dietmar Maurer
Thank, the fix works well. - Dietmar > -Original Message- > From: Dietmar Maurer > Sent: Mittwoch, 24. August 2011 08:18 > To: 'cluster-devel@redhat.com' > Subject: Problems with [PATCH] dlm_controld: fix plock dev_write no op > > Compile is broken after t

[Cluster-devel] Problems with [PATCH] dlm_controld: fix plock dev_write no op

2011-08-23 Thread Dietmar Maurer
Compile is broken after that patch. I get: /cluster-3.1.5/group/dlm_controld/plock.c:744:2: warning: #warning DLM_PLOCK_FL_CLOSE undefined. Enabling build workaround. /cluster-3.1.5/group/dlm_controld/plock.c: In function \u2018process_plocks\u2019: /cluster-3.1.5/group/dlm_controld/plock.c:1595