The more I look at this more I think it is load related. svcs -x only shows
that the LP print server is not running which I don't think has any impact
on what I'm seeing.
As for who not reporting what I would expect I tracked that down to someone
installing the gnu tools in /usr/local/bin and then setting default path to
reference those before /bin/ :-(
/bin/who -r shows the zone is at run level 3.
Looking at /var/svc/log/milestone-multi-user-server:default.log I can see
that some of the other services have most likely not completed before it
tries to run the rc scripts. It appears that the /usr filesystem hasn't yet
been mounted read/write and the appstart script is logging an error that
indicates rpc services are not completely running.
Executing legacy init script "/etc/rc3.d/S98apache".
(30)Read-only file system: httpd: could not open error log file
Unable to open logs
Legacy init script "/etc/rc3.d/S98apache" exited with return code 0.
Executing legacy init script "/etc/rc3.d/S99appstart".
ERROR: Unable to contact any server
Legacy init script "/etc/rc3.d/S99appstart" exited with return code 0.
[ Dec 1 09:17:13 Method "start" exited with status 0 ]
We have a process in place that only starts 3 zones at one time so we are
not doing all 40 at once but it could be that with this hardware even
trying 3 at a time is too much and we may need to drop to 2.
On Thu, Dec 1, 2011 at 12:07 PM, Mike Gerdts <mike.ger...@oracle.com> wrote:
> On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote:
> > Have a peculiar problem that I haven't seen before.
> > When starting a system that has about 35 - 40 zones on it occasionally we
> > see that one of the zones doesn't come up properly. You can log into the
> > zone but none of the /etc/rc3.d scripts have been run.
> > /var/adm/messages is completely empty and when running who -r to see the
> > run level it doesn't report anything.
> Take a look at the output of svcs -x. Most likely you have a service
> that svc:/milestone/multi-user-server:default depends on (directly or
> indirectly) that has timed out and as such is in maintenance. Because
> the dependency is not satisfied, this milestone doesn't come up so the
> rc3 scripts are not run.
> My guess is the timeout is because so many zones are starting at once
> that the disks are being thrashed. The resulting I/O backlog slows down
> the startup of services, which leads to timeouts, which lead to some
> services failing to even try to start.
> A google search and a 5 second read suggests that this link may be of
> help to adjust the timeout of services that require a longer timeout:
> Mike Gerdts
> Solaris Core OS / Zones http://blogs.oracle.com/zoneszone/
zones-discuss mailing list