I got in a bit of an offtopic discussion with Rainer over in
sysadmin-discuss.  I think the juiciest bits are found below, but if
you would like to look at the rest of the thread, it all started out
as an innocent conversation about disk layouts...

http://mail.opensolaris.org/pipermail/sysadmin-discuss/2007-September/001640.html

In a nutshell, it seems as though:

1) Entries in inittab that should run in run level 3 race with things
that pre-SMF would have completed during sysinit through /sbin/rcS.
2) It's not clear that proper dependencies are set up to be sure that
dumpadm is done with swap before other things need/use swap.

Hopefully someone here can shed some light.

---------- Forwarded message ----------
From: Mike Gerdts <mger...@gmail.com>
Date: Sep 19, 2007 10:28 PM
Subject: Re: [sysadmin-discuss] Default OS partition layout
To: Rainer Heilke <rheilke at dragonhearth.com>
Cc: sysadmin-discuss at opensolaris.org


On 9/19/07, Rainer Heilke <rheilke at dragonhearth.com> wrote:
> Sorry, first chance I've had to get to email/post today.

No worries.

> Yes, I'm referring to this second stage.
>
> Swap isn't active yet, but the OS (and any apps running from /etc/rcx.d
> or SMF) are starting up. With a RAC'd DB, CRS (the Oracle process that
> manages RAC) comes up first, it brings up ASM, and then the DB comes up.
> If CRS starts and doesn't get its heartbeat in the 7.3 nanosecond it
> wants (because the core is still writing out), it reboots the node. If
> the core wrote to swap before getting written to /var/crash, it's now
> toast. We had to set up CRS so that it would not start automatically.
> This allowed us to get the core dumps we needed for Sun to examine our
> problem. We never even considered the idea of having the first write go
> to somewhere other than the default swap. So, I've learned something. :-)

I think that this is because CRS likes to start via /etc/inittab - but
with the default entries it shouldn't start until the system hits
runlevel 3.  When does a system enter the requested runlevel?  Pretty
much as soon as init finishes off the "sysinit" and any "boot" entries
 - or some other time?

A quick look at the code:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/init/init.c#708
. . .
    716         } else {
    717                 /*
    718                  * It's fine to boot up with state as zero, because
    719                  * startd will later tell us the real state.
    720                  */
    721                 cur_state = 0;
    722                 op_modes = BOOT_MODES;
    723
    724                 boot_init();
    725         }

boot_init is:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/init/init.c#2020
. . .
   2053                 } else if (cmd.c_action == M_SYSINIT) {
   2054                         /*
   2055                          * Execute the "sysinit" entry and
wait for it to
   2056                          * complete.  No bookkeeping is
performed on these
   2057                          * entries because we avoid writing to
the file system
   2058                          * until after there has been an
chance to check it.
   2059                          */

So, when it is in boot mode it executes sysinit actions, waiting for
each of them to complete.  inittab(4) has this key entry:

smf::sysinit:/lib/svc/bin/svc.startd    >/dev/msglog 2<>/dev/msglog
</dev/console

Which, based upon the comment at line 718, implies that smf will
signal init to enter the appropriate runlevel.  I started going
through the smf code to figure out exactly when init is signaled
relative to when the various services come online.  It was not
immediately clear.  If it signals init too early, the runlevel 3 stuff
may start to come online before the traditional sysinit stuff.

In Solaris 9, we had:

fs::sysinit:/sbin/rcS sysinit           >/dev/msglog 2<>/dev/msglog
</dev/console
...
s2:23:wait:/sbin/rc2                    >/dev/msglog 2<>/dev/msglog
</dev/console
s3:3:wait:/sbin/rc3                     >/dev/msglog 2<>/dev/msglog
</dev/console
...

If Oracle appended its entries, they would run after fs, s2, and s3.
This means that Oracle wouldn't start until after savecore is done.

In Solaris 10 and later we have:

...
smf::sysinit:/lib/svc/bin/svc.startd    >/dev/msglog 2<>/dev/msglog
</dev/console
...

If Oracle did the same append of entries, they would start as soon as
svc.startd signals init.  Like I said before, and it seems as though
you observed, this is happening when savecore is running.  Yikes.  The
following comment hints that there may be something bad going on here:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/svc/startd/graph.c#4412

   4413  * This is called when one of the major milestones changes
state, or when
   4414  * init is signalled and tells us it was told to change
runlevel.  We wait
   4415  * to reach the milestone because this allows /etc/inittab
entries to retain
   4416  * some boot ordering: historically, entries could place
themselves before/after
   4417  * the running of /sbin/rcX scripts but we can no longer make the
   4418  * distinction because the /sbin/rcX scripts no longer exist
as punctuation
   4419  * marks in /etc/inittab.

It does imply that it doesn't set the runlevel to 3 until
svc:/milestone/multi-user-server:default makes some state change (not
sure which change that is, however).

Rather interestingly, there seems to be no dependency established that
ensures that dumpadm completes before swap is enabled.   This assumes
that svcstree[1] is correct.

1. http://blogs.sun.com/jkini/resource/svcstree, linked from
http://blogs.sun.com/jkini/entry/printing_service_dependency_tree


$ ./svcstree -D dumpadm | less
1       svc:/system/dumpadm:default
2       +-->svc:/system/fmd:default

Looking at it the other way around, to see if anything before
multi-user-server requires it is finished:

$ ./svcstree -d multi-user-server | grep dumpadm
<no output>

The really interesting thing here is that I don't see anything that
would cause savecore to deterministically run before swap is enabled,
potentially corrupting the crash dump.  As such, there *seems* to be a
race condition between savecore and swapadd, which is called from
multiple /lib/svc/method/* scripts.

But, then I've gone off of topic a little bit, haven't I?  :)

--
Mike Gerdts
http://mgerdts.blogspot.com/


-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Reply via email to