Thanks Mike. The more I look at this more I think it is load related. svcs -x only shows that the LP print server is not running which I don't think has any impact on what I'm seeing.
As for who not reporting what I would expect I tracked that down to someone installing the gnu tools in /usr/local/bin and then setting default path to reference those before /bin/ :-( /bin/who -r shows the zone is at run level 3. Looking at /var/svc/log/milestone-multi-user-server:default.log I can see that some of the other services have most likely not completed before it tries to run the rc scripts. It appears that the /usr filesystem hasn't yet been mounted read/write and the appstart script is logging an error that indicates rpc services are not completely running. Executing legacy init script "/etc/rc3.d/S98apache". (30)Read-only file system: httpd: could not open error log file /usr/local/apache2/logs/error_log. Unable to open logs Legacy init script "/etc/rc3.d/S98apache" exited with return code 0. Executing legacy init script "/etc/rc3.d/S99appstart". ERROR: Unable to contact any server Legacy init script "/etc/rc3.d/S99appstart" exited with return code 0. [ Dec 1 09:17:13 Method "start" exited with status 0 ] We have a process in place that only starts 3 zones at one time so we are not doing all 40 at once but it could be that with this hardware even trying 3 at a time is too much and we may need to drop to 2. Derek On Thu, Dec 1, 2011 at 12:07 PM, Mike Gerdts <mike.ger...@oracle.com> wrote: > On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote: > > Have a peculiar problem that I haven't seen before. > > > > When starting a system that has about 35 - 40 zones on it occasionally we > > see that one of the zones doesn't come up properly. You can log into the > > zone but none of the /etc/rc3.d scripts have been run. > > > > /var/adm/messages is completely empty and when running who -r to see the > > run level it doesn't report anything. > > Take a look at the output of svcs -x. Most likely you have a service > that svc:/milestone/multi-user-server:default depends on (directly or > indirectly) that has timed out and as such is in maintenance. Because > the dependency is not satisfied, this milestone doesn't come up so the > rc3 scripts are not run. > > My guess is the timeout is because so many zones are starting at once > that the disks are being thrashed. The resulting I/O backlog slows down > the startup of services, which leads to timeouts, which lead to some > services failing to even try to start. > > A google search and a 5 second read suggests that this link may be of > help to adjust the timeout of services that require a longer timeout: > > http://www.runningunix.com/2009/01/changing-timeouts-on-smf-services/ > > -- > Mike Gerdts > Solaris Core OS / Zones http://blogs.oracle.com/zoneszone/ >
_______________________________________________ zones-discuss mailing list zones-discuss@opensolaris.org