I got in a bit of an offtopic discussion with Rainer over in sysadmin-discuss. I think the juiciest bits are found below, but if you would like to look at the rest of the thread, it all started out as an innocent conversation about disk layouts...
http://mail.opensolaris.org/pipermail/sysadmin-discuss/2007-September/001640.html In a nutshell, it seems as though: 1) Entries in inittab that should run in run level 3 race with things that pre-SMF would have completed during sysinit through /sbin/rcS. 2) It's not clear that proper dependencies are set up to be sure that dumpadm is done with swap before other things need/use swap. Hopefully someone here can shed some light. ---------- Forwarded message ---------- From: Mike Gerdts <mger...@gmail.com> Date: Sep 19, 2007 10:28 PM Subject: Re: [sysadmin-discuss] Default OS partition layout To: Rainer Heilke <rheilke at dragonhearth.com> Cc: sysadmin-discuss at opensolaris.org On 9/19/07, Rainer Heilke <rheilke at dragonhearth.com> wrote: > Sorry, first chance I've had to get to email/post today. No worries. > Yes, I'm referring to this second stage. > > Swap isn't active yet, but the OS (and any apps running from /etc/rcx.d > or SMF) are starting up. With a RAC'd DB, CRS (the Oracle process that > manages RAC) comes up first, it brings up ASM, and then the DB comes up. > If CRS starts and doesn't get its heartbeat in the 7.3 nanosecond it > wants (because the core is still writing out), it reboots the node. If > the core wrote to swap before getting written to /var/crash, it's now > toast. We had to set up CRS so that it would not start automatically. > This allowed us to get the core dumps we needed for Sun to examine our > problem. We never even considered the idea of having the first write go > to somewhere other than the default swap. So, I've learned something. :-) I think that this is because CRS likes to start via /etc/inittab - but with the default entries it shouldn't start until the system hits runlevel 3. When does a system enter the requested runlevel? Pretty much as soon as init finishes off the "sysinit" and any "boot" entries - or some other time? A quick look at the code: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/init/init.c#708 . . . 716 } else { 717 /* 718 * It's fine to boot up with state as zero, because 719 * startd will later tell us the real state. 720 */ 721 cur_state = 0; 722 op_modes = BOOT_MODES; 723 724 boot_init(); 725 } boot_init is: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/init/init.c#2020 . . . 2053 } else if (cmd.c_action == M_SYSINIT) { 2054 /* 2055 * Execute the "sysinit" entry and wait for it to 2056 * complete. No bookkeeping is performed on these 2057 * entries because we avoid writing to the file system 2058 * until after there has been an chance to check it. 2059 */ So, when it is in boot mode it executes sysinit actions, waiting for each of them to complete. inittab(4) has this key entry: smf::sysinit:/lib/svc/bin/svc.startd >/dev/msglog 2<>/dev/msglog </dev/console Which, based upon the comment at line 718, implies that smf will signal init to enter the appropriate runlevel. I started going through the smf code to figure out exactly when init is signaled relative to when the various services come online. It was not immediately clear. If it signals init too early, the runlevel 3 stuff may start to come online before the traditional sysinit stuff. In Solaris 9, we had: fs::sysinit:/sbin/rcS sysinit >/dev/msglog 2<>/dev/msglog </dev/console ... s2:23:wait:/sbin/rc2 >/dev/msglog 2<>/dev/msglog </dev/console s3:3:wait:/sbin/rc3 >/dev/msglog 2<>/dev/msglog </dev/console ... If Oracle appended its entries, they would run after fs, s2, and s3. This means that Oracle wouldn't start until after savecore is done. In Solaris 10 and later we have: ... smf::sysinit:/lib/svc/bin/svc.startd >/dev/msglog 2<>/dev/msglog </dev/console ... If Oracle did the same append of entries, they would start as soon as svc.startd signals init. Like I said before, and it seems as though you observed, this is happening when savecore is running. Yikes. The following comment hints that there may be something bad going on here: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/svc/startd/graph.c#4412 4413 * This is called when one of the major milestones changes state, or when 4414 * init is signalled and tells us it was told to change runlevel. We wait 4415 * to reach the milestone because this allows /etc/inittab entries to retain 4416 * some boot ordering: historically, entries could place themselves before/after 4417 * the running of /sbin/rcX scripts but we can no longer make the 4418 * distinction because the /sbin/rcX scripts no longer exist as punctuation 4419 * marks in /etc/inittab. It does imply that it doesn't set the runlevel to 3 until svc:/milestone/multi-user-server:default makes some state change (not sure which change that is, however). Rather interestingly, there seems to be no dependency established that ensures that dumpadm completes before swap is enabled. This assumes that svcstree[1] is correct. 1. http://blogs.sun.com/jkini/resource/svcstree, linked from http://blogs.sun.com/jkini/entry/printing_service_dependency_tree $ ./svcstree -D dumpadm | less 1 svc:/system/dumpadm:default 2 +-->svc:/system/fmd:default Looking at it the other way around, to see if anything before multi-user-server requires it is finished: $ ./svcstree -d multi-user-server | grep dumpadm <no output> The really interesting thing here is that I don't see anything that would cause savecore to deterministically run before swap is enabled, potentially corrupting the crash dump. As such, there *seems* to be a race condition between savecore and swapadd, which is called from multiple /lib/svc/method/* scripts. But, then I've gone off of topic a little bit, haven't I? :) -- Mike Gerdts http://mgerdts.blogspot.com/ -- Mike Gerdts http://mgerdts.blogspot.com/