A split brain wouldn't be a concern if you put the node in standby before killing Corosync, correct? I suppose as a workaround, one could make the init script do that...

-Ray


Remi Broemeling wrote:
Hi, Ray!

It is true that the daemon shuts down if the init script attempts to stop it, yes; but it isn't actually doing what it is supposed to. The reason that the init script kills it is that it sends a QUIT signal, delays, and then sends _another_ QUIT signal (which is what actually kills it). The problem is that corosync is supposed to (by design) exit cleanly after a single QUIT signal. What is actually happening (at least on my system) is that corosync is freezing up/erroring on receipt of the first QUIT signal, and then treating the second one as an escalation and hard-killing itself. While the treatment of the second signal seems to be correct (treating it as a hard-kill and dying); the treatment of the first QUIT signal is not.

What actually happens is that, given two nodes (boot1/boot2) and issuing a single QUIT signal (not using the init script, just `killall -QUIT corosync`) on boot2:

1) boot2 corosync receives QUIT signal.
2) boot2 corosync sends TERM signal to crmd on boot2.
3) boot2 corosync/crmd apparently "lock-up". The cluster is still running on boot2 (as crm_mon still displays output that shows both boot1 and boot2 as being online), but the state of the cluster is static. It no longer updates. This means that from boot2's viewpoint, both boot1 and boot2 are still online, and that seems to remain the case.
4) boot1 reports that it has "lost" boot2.
5) boot1 removes boot2 from the cluster as a lost/missing node.
6) From boot1's viewpoint, boot1 is online but boot2 is offline.

As you can see, this situation results in a split-brain for the cluster: boot2 hasn't actually shutdown (and to my knowledge hasn't shutdown any services), but it has been dismissed as offline by boot1. Of course with a proper STONITH setup this would still work, as boot1 would immediately STONITH boot2 and boot2 would immediately drop offline, but I am dealing with a case where STONITH is not (at least not yet) enabled.

The second QUIT signal on boot2 cleans this situation up somewhat, as corosync then treats that as a harder termination and actually makes it happen (apparently does, anyway). Thus stopping with the init script does actually kill corosync; but it does so slowly -- and, to my mind, quite dangerously; as a momentary split-brain is implicit in each shutdown.

After consultation with the Pacemaker mailing list as well as the OpenAIS mailing list, this issue has been filed in Red Hat Bugzilla as bug #525589 ( https://bugzilla.redhat.com/show_bug.cgi?id=525589 ).

I think at this point I'll be moving (backwards) to using Heartbeat for cluster communications/membership. I don't mind applying a bit of elbow grease to get packages back-ported to Hardy, but this simply seems unworkable, at this point, at least until this bug is fixed. It's really too bad, as I was looking forward to the enhanced functionality of the new system, but I can't seem to get it to work right!

Ray Pitmon wrote:
Hi Remi,

That's good to know. I had thought about messing around with the startup order.

As for the shutting down, if you watch the logs, you should see that corosync shuts down cleanly when running stopping it with the init script. It just takes another 5-10 seconds longer than the script is willing to wait. Perhaps there's a way to make the init script wait a little longer (I haven't looked into it, as I've always been watching the logs.) It has stopped cleanly 100% of the time as long as I wait.

-Ray

Remi Broemeling wrote:
Hi Ray, thanks for the response.

I've tracked down the issue as far as some sort of conflict between the following scripts in /etc/rcS.d:

S59corosync
S70screen-cleanup
S80bootmisc.sh
S85urandom
S90console-screen.sh

With the startup sequence given above, the problem seems to occur close to 100% of time (actually, I've never seen an actual correct startup of corosync after boot).

I believe that the problem seems to revolve around the sockets that corosync uses in /var/run/*, thus I am assuming that one of the scripts that executes after corosync in that sequence messes up some of the /var/run/* files that it needs to communicate back-and-forth between all of it's children. I've looked through the scripts (although not really closely) and haven't been able to find the culprit.

However, I have managed to "fix" (work-around) the problem by moving S59corosync to S95corosync. i.e.,

mv /etc/rcS.d/S59corosync /etc/rcS.d/S95corosync

Since doing that on my system(s), I've rebooted 10 times, and corosync (and all of it's children) have come up correctly all ten times.

I'm moving on to another problem with the corosync init.d scripts now: "/etc/init.d/corosync stop" seems to fail 100% of the time, and I believe it to be related to timeouts (i.e. the init script simply isn't giving corosync enough time to shutdown), I'll post back to this list once I have more information on that.

I haven't encountered the problem where corosync fails during a manual start -- it has only been automatic/on-boot starts that have caused problems, and those only when it was at /etc/rcS.d/S59corosync.

Thanks.

Ray Pitmon wrote:
Hi Remi,

I have not found a solution. I thought about adding it to rc.local, but now I'm finding that starting the thing manually doesn't always work either (especially after a hard-shutdown by pulling the power cable).

I've found that I have to do this on a hard-shutdown:

1. Start corosync, tail syslog and notice that the processes in /usr/bin/heartbeat/ that start up (cib, lrmd, etc) are screaming that they can't get going for some reason. 2. Stop corosync (/etc/init.d/corosync stop), then manually kill all those procs since they won't quit on their own.
3. Stop corosync again, just for good measure.
4. Start corosync. Watch the logs. Sometimes it says a few things and exits again (with no indication in the logs why it exited).
5. Start corosync one more time, if it went away, and it runs great.

So.. After further reading, I decided that it probably isn't wise to start pacemaker automatically on boot-up anyway. From what I've read, I fear I might run into a STONITH death-match.

-Ray

Remi Broemeling wrote:
Hi, Ray.

I'm in the process of playing around with the pacemaker-openais 1.0.5+hg20090813-0ubuntu2~hardy1 package on Ubuntu Hardy Heron, and I encountered the issue that you wrote about on the [email protected] mailing list.

There is no follow-up conversation on the list (at least none that I can see), so I was wondering if you had found a solution to the problem with pacemaker/corosync startup during system boot?

Thanks.

--

Remi Broemeling
Sr System Administrator

Nexopia.com Inc.
direct: 780 444 1250 ext 435
email: [email protected] <mailto:[email protected]>
fax: 780 487 0376

www.nexopia.com <http://www.nexopia.com>

Sarcasm is just one more service we offer.
Things you would love to say at work but can't


_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

Reply via email to