Re: [Ubuntu-ha] pacemaker-openais packages on Ubuntu Hardy Heron

Ray Pitmon Sat, 26 Sep 2009 19:55:53 -0700

A split brain wouldn't be a concern if you put the node in standbybefore killing Corosync, correct? I suppose as a workaround, one couldmake the init script do that...


-Ray



Remi Broemeling wrote:

Hi, Ray!
It is true that the daemon shuts down if the init script attempts tostop it, yes; but it isn't actually doing what it is supposed to. Thereason that the init script kills it is that it sends a QUIT signal,delays, and then sends _another_ QUIT signal (which is what actuallykills it). The problem is that corosync is supposed to (by design) exitcleanly after a single QUIT signal. What is actually happening (atleast on my system) is that corosync is freezing up/erroring on receiptof the first QUIT signal, and then treating the second one as anescalation and hard-killing itself. While the treatment of the secondsignal seems to be correct (treating it as a hard-kill and dying); thetreatment of the first QUIT signal is not.
What actually happens is that, given two nodes (boot1/boot2) and issuinga single QUIT signal (not using the init script, just `killall -QUITcorosync`) on boot2:
1) boot2 corosync receives QUIT signal.
2) boot2 corosync sends TERM signal to crmd on boot2.
3) boot2 corosync/crmd apparently "lock-up". The cluster is stillrunning on boot2 (as crm_mon still displays output that shows both boot1and boot2 as being online), but the state of the cluster is static. Itno longer updates. This means that from boot2's viewpoint, both boot1and boot2 are still online, and that seems to remain the case.
4) boot1 reports that it has "lost" boot2.
5) boot1 removes boot2 from the cluster as a lost/missing node.
6) From boot1's viewpoint, boot1 is online but boot2 is offline.
As you can see, this situation results in a split-brain for the cluster:boot2 hasn't actually shutdown (and to my knowledge hasn't shutdown anyservices), but it has been dismissed as offline by boot1. Of coursewith a proper STONITH setup this would still work, as boot1 wouldimmediately STONITH boot2 and boot2 would immediately drop offline, butI am dealing with a case where STONITH is not (at least not yet) enabled.
The second QUIT signal on boot2 cleans this situation up somewhat, ascorosync then treats that as a harder termination and actually makes ithappen (apparently does, anyway). Thus stopping with the init scriptdoes actually kill corosync; but it does so slowly -- and, to my mind,quite dangerously; as a momentary split-brain is implicit in each shutdown.
After consultation with the Pacemaker mailing list as well as theOpenAIS mailing list, this issue has been filed in Red Hat Bugzilla asbug #525589 ( https://bugzilla.redhat.com/show_bug.cgi?id=525589 ).
I think at this point I'll be moving (backwards) to using Heartbeat forcluster communications/membership. I don't mind applying a bit of elbowgrease to get packages back-ported to Hardy, but this simply seemsunworkable, at this point, at least until this bug is fixed. It'sreally too bad, as I was looking forward to the enhanced functionalityof the new system, but I can't seem to get it to work right!
Ray Pitmon wrote:
Hi Remi,
That's good to know. I had thought about messing around with thestartup order.
As for the shutting down, if you watch the logs, you should see thatcorosync shuts down cleanly when running stopping it with the initscript. It just takes another 5-10 seconds longer than the script iswilling to wait. Perhaps there's a way to make the init script wait alittle longer (I haven't looked into it, as I've always been watchingthe logs.) It has stopped cleanly 100% of the time as long as I wait.
-Ray

Remi Broemeling wrote:
Hi Ray, thanks for the response.
I've tracked down the issue as far as some sort of conflict betweenthe following scripts in /etc/rcS.d:
S59corosync
S70screen-cleanup
S80bootmisc.sh
S85urandom
S90console-screen.sh
With the startup sequence given above, the problem seems to occurclose to 100% of time (actually, I've never seen an actual correctstartup of corosync after boot).
I believe that the problem seems to revolve around the sockets thatcorosync uses in /var/run/*, thus I am assuming that one of thescripts that executes after corosync in that sequence messes up someof the /var/run/* files that it needs to communicate back-and-forthbetween all of it's children. I've looked through the scripts(although not really closely) and haven't been able to find the culprit.
However, I have managed to "fix" (work-around) the problem by movingS59corosync to S95corosync. i.e.,
mv /etc/rcS.d/S59corosync /etc/rcS.d/S95corosync
Since doing that on my system(s), I've rebooted 10 times, andcorosync (and all of it's children) have come up correctly all tentimes.
I'm moving on to another problem with the corosync init.d scriptsnow: "/etc/init.d/corosync stop" seems to fail 100% of the time, andI believe it to be related to timeouts (i.e. the init script simplyisn't giving corosync enough time to shutdown), I'll post back tothis list once I have more information on that.
I haven't encountered the problem where corosync fails during amanual start -- it has only been automatic/on-boot starts that havecaused problems, and those only when it was at /etc/rcS.d/S59corosync.
Thanks.

Ray Pitmon wrote:
Hi Remi,
I have not found a solution. I thought about adding it to rc.local,but now I'm finding that starting the thing manually doesn't alwayswork either (especially after a hard-shutdown by pulling the powercable).
I've found that I have to do this on a hard-shutdown:
1. Start corosync, tail syslog and notice that the processes in/usr/bin/heartbeat/ that start up (cib, lrmd, etc) are screamingthat they can't get going for some reason.2. Stop corosync (/etc/init.d/corosync stop), then manually kill allthose procs since they won't quit on their own.
3. Stop corosync again, just for good measure.
4. Start corosync. Watch the logs. Sometimes it says a few thingsand exits again (with no indication in the logs why it exited).
5. Start corosync one more time, if it went away, and it runs great.
So.. After further reading, I decided that it probably isn't wise tostart pacemaker automatically on boot-up anyway. From what I'veread, I fear I might run into a STONITH death-match.
-Ray

Remi Broemeling wrote:
Hi, Ray.
I'm in the process of playing around with the pacemaker-openais1.0.5+hg20090813-0ubuntu2~hardy1 package on Ubuntu Hardy Heron, andI encountered the issue that you wrote about on the[email protected] mailing list.
There is no follow-up conversation on the list (at least none thatI can see), so I was wondering if you had found a solution to theproblem with pacemaker/corosync startup during system boot?
Thanks.
--

Remi Broemeling
Sr System Administrator

Nexopia.com Inc.
direct: 780 444 1250 ext 435
email: [email protected] <mailto:[email protected]>
fax: 780 487 0376

www.nexopia.com <http://www.nexopia.com>

Sarcasm is just one more service we offer.
Things you would love to say at work but can't



_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

Re: [Ubuntu-ha] pacemaker-openais packages on Ubuntu Hardy Heron

Reply via email to