Hi, Ray!
It is true that the daemon shuts down if the init script attempts to
stop it, yes; but it isn't actually doing what it is supposed to. The
reason that the init script kills it is that it sends a QUIT signal,
delays, and then sends _another_ QUIT signal (which is what actually
kills it). The problem is that corosync is supposed to (by design)
exit cleanly after a single QUIT signal. What is actually happening
(at least on my system) is that corosync is freezing up/erroring on
receipt of the first QUIT signal, and then treating the second one as
an escalation and hard-killing itself. While the treatment of the
second signal seems to be correct (treating it as a hard-kill and
dying); the treatment of the first QUIT signal is not.
What actually happens is that, given two nodes (boot1/boot2) and
issuing a single QUIT signal (not using the init script, just `killall
-QUIT corosync`) on boot2:
1) boot2 corosync receives QUIT signal.
2) boot2 corosync sends TERM signal to crmd on boot2.
3) boot2 corosync/crmd apparently "lock-up". The cluster is still
running on boot2 (as crm_mon still displays output that shows both
boot1 and boot2 as being online), but the state of the cluster is
static. It no longer updates. This means that from boot2's viewpoint,
both boot1 and boot2 are still online, and that seems to remain the
case.
4) boot1 reports that it has "lost" boot2.
5) boot1 removes boot2 from the cluster as a lost/missing node.
6) From boot1's viewpoint, boot1 is online but boot2 is offline.
As you can see, this situation results in a split-brain for the
cluster: boot2 hasn't actually shutdown (and to my knowledge hasn't
shutdown any services), but it has been dismissed as offline by boot1.
Of course with a proper STONITH setup this would still work, as boot1
would immediately STONITH boot2 and boot2 would immediately drop
offline, but I am dealing with a case where STONITH is not (at least
not yet) enabled.
The second QUIT signal on boot2 cleans this situation up somewhat, as
corosync then treats that as a harder termination and actually makes it
happen (apparently does, anyway). Thus stopping with the init script
does actually kill corosync; but it does so slowly -- and, to my mind,
quite dangerously; as a momentary split-brain is implicit in each
shutdown.
After consultation with the Pacemaker mailing list as well as the
OpenAIS mailing list, this issue has been filed in Red Hat Bugzilla as
bug #525589 ( https://bugzilla.redhat.com/show_bug.cgi?id=525589 ).
I think at this point I'll be moving (backwards) to using Heartbeat for
cluster communications/membership. I don't mind applying a bit of
elbow grease to get packages back-ported to Hardy, but this simply
seems unworkable, at this point, at least until this bug is fixed.
It's really too bad, as I was looking forward to the enhanced
functionality of the new system, but I can't seem to get it to work
right!
Ray Pitmon wrote:
Hi Remi,
That's good to know. I had thought about messing around with the
startup order.
As for the shutting down, if you watch the logs, you should see that
corosync shuts down cleanly when running stopping it with the init
script. It just takes another 5-10 seconds longer than the script is
willing to wait. Perhaps there's a way to make the init script wait a
little longer (I haven't looked into it, as I've always been watching
the logs.) It has stopped cleanly 100% of the time as long as I wait.
-Ray
Remi Broemeling wrote:
Hi Ray, thanks for the response.
I've tracked down the issue as far as some sort of conflict between the
following scripts in /etc/rcS.d:
S59corosync
S70screen-cleanup
S80bootmisc.sh
S85urandom
S90console-screen.sh
With the startup sequence given above, the problem seems to occur close
to 100% of time (actually, I've never seen an actual correct startup of
corosync after boot).
I believe that the problem seems to revolve around the sockets that
corosync uses in /var/run/*, thus I am assuming that one of the scripts
that executes after corosync in that sequence messes up some of the
/var/run/* files that it needs to communicate back-and-forth between
all of it's children. I've looked through the scripts (although not
really closely) and haven't been able to find the culprit.
However, I have managed to "fix" (work-around) the problem by moving
S59corosync to S95corosync. i.e.,
mv /etc/rcS.d/S59corosync /etc/rcS.d/S95corosync
Since doing that on my system(s), I've rebooted 10 times, and corosync
(and all of it's children) have come up correctly all ten times.
I'm moving on to another problem with the corosync init.d scripts now:
"/etc/init.d/corosync stop" seems to fail 100% of the time, and I
believe it to be related to timeouts (i.e. the init script simply isn't
giving corosync enough time to shutdown), I'll post back to this list
once I have more information on that.
I haven't encountered the problem where corosync fails during a manual
start -- it has only been automatic/on-boot starts that have caused
problems, and those only when it was at /etc/rcS.d/S59corosync.
Thanks.
Ray Pitmon wrote:
Hi Remi,
I have not found a solution. I thought about adding it to rc.local,
but now I'm finding that starting the thing manually doesn't always
work either (especially after a hard-shutdown by pulling the power
cable).
I've found that I have to do this on a hard-shutdown:
1. Start corosync, tail syslog and notice that the processes in
/usr/bin/heartbeat/ that start up (cib, lrmd, etc) are screaming that
they can't get going for some reason.
2. Stop corosync (/etc/init.d/corosync stop), then manually kill all
those procs since they won't quit on their own.
3. Stop corosync again, just for good measure.
4. Start corosync. Watch the logs. Sometimes it says a few things and
exits again (with no indication in the logs why it exited).
5. Start corosync one more time, if it went away, and it runs great.
So.. After further reading, I decided that it probably isn't wise to
start pacemaker automatically on boot-up anyway. From what I've read,
I fear I might run into a STONITH death-match.
-Ray
Remi Broemeling wrote:
Hi, Ray.
I'm in the process of playing around with the pacemaker-openais
1.0.5+hg20090813-0ubuntu2~hardy1 package on Ubuntu Hardy Heron, and I
encountered the issue that you wrote about on the
[email protected] mailing list.
There is no follow-up conversation on the list (at least none that I
can see), so I was wondering if you had found a solution to the problem
with pacemaker/corosync startup during system boot?
Thanks.
--
Remi Broemeling
Sr System Administrator
Nexopia.com Inc.
Sarcasm is just one more service we offer.
Things you would love to say at work but can't
|