Re: [zones-discuss] Zone Not Starting Properly?

2011-12-14 Thread Derek McEachern
I actually know that..

rpcinfo -p remote_host

 The script is trying to mount an nfs share nased on a configured list of
remote nfs filers. In theory the client only has access to the filer
determines which one by using the rpcinfo command.

Derek

On Tue, Dec 13, 2011 at 6:55 PM, Edward Pilatowicz 
edward.pilatow...@oracle.com wrote:

 On Tue, Dec 13, 2011 at 09:44:23AM -0600, Derek McEachern wrote:
  Thought I would just send an update on this. Thanks for the all the
  suggestions.
 
  To get around our particular issue I just added some retry logic to the
  /etc/init.d/ script. When it runs it if finds that the operation has
 failed
  it pauses for a second and will try again. It will try up to three times
  before giving up.
 

 it'd be interesting to know what particular operation is failing within
 the script...

 ed

___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-13 Thread Derek McEachern
Thought I would just send an update on this. Thanks for the all the
suggestions.

To get around our particular issue I just added some retry logic to the
/etc/init.d/ script. When it runs it if finds that the operation has failed
it pauses for a second and will try again. It will try up to three times
before giving up.

Running more tests we were able to see that on some occasions it still
fails on the first attempt but so far has always been successful on the 2nd.

Derek

On Thu, Dec 1, 2011 at 3:37 PM, Ian Collins i...@ianshome.com wrote:

 On 12/ 2/11 10:30 AM, Derek McEachern wrote:

 On Thu, Dec 1, 2011 at 2:48 PM, Ian Collins i...@ianshome.com mailto:
 i...@ianshome.com wrote:

On 12/ 2/11 05:39 AM, Derek McEachern wrote:

Have a peculiar problem that I haven't seen before.

When starting a system that has about 35 - 40 zones on it
occasionally we see that one of the zones doesn't come up
properly. You can log into the zone but none of the /etc/rc3.d
scripts have been run.

The same zone, or a random one?

What happens if you halt one or more zones before rebooting?  Is
there a threshold where the problem begins to occur?

 Random zone.

 We've been testing to see if there is a threshold of trying to start too
 many in parallel but so far we don't see anything.

 We saw the problem trying to start 3 zones in parallel but it was very
 intermittent. Like 1 out of every 4 tries at started all 40 zones we would
 see 1 failure. We ran some tests starting 10 zones in parallel and so far
 no errors. Our assumption was that if it was load related moving from 3 to
 10 zones we would see problems.

  I have several systems that start 10 or more zones and I've never seen
 any problems.

 I agree with the comment elsewhere that you should be using SMF rather
 than rc scripts to start services.

 It is also possible to create SMF services with the appropriate
 dependencies to start your zones in the correct order.

 --
 Ian.


___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-13 Thread Edward Pilatowicz
On Tue, Dec 13, 2011 at 09:44:23AM -0600, Derek McEachern wrote:
 Thought I would just send an update on this. Thanks for the all the
 suggestions.

 To get around our particular issue I just added some retry logic to the
 /etc/init.d/ script. When it runs it if finds that the operation has failed
 it pauses for a second and will try again. It will try up to three times
 before giving up.


it'd be interesting to know what particular operation is failing within
the script...

ed
___
zones-discuss mailing list
zones-discuss@opensolaris.org


[zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Derek McEachern
Have a peculiar problem that I haven't seen before.

When starting a system that has about 35 - 40 zones on it occasionally we
see that one of the zones doesn't come up properly. You can log into the
zone but none of the /etc/rc3.d scripts have been run.

/var/adm/messages is completely empty and when running who -r to see the
run level it doesn't report anything.

# who -r
run-level Dec 1 09:17 last=

Anyone else seen anything similar? We are running Solaris 10 update 9.

Regards,
Derek
___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.

for 30-40 zone
what are the main host ram? and what kind of CPU? and how many CPU?
was everything on ZFS? what are the storage/HDD for zone root?
regards


On 12/1/2011 11:39 AM, Derek McEachern wrote:

Have a peculiar problem that I haven't seen before.

When starting a system that has about 35 - 40 zones on it occasionally 
we see that one of the zones doesn't come up properly. You can log 
into the zone but none of the /etc/rc3.d scripts have been run.


/var/adm/messages is completely empty and when running who -r to see 
the run level it doesn't report anything.


# who -r
run-level Dec 1 09:17 last=

Anyone else seen anything similar? We are running Solaris 10 update 9.

Regards,
Derek


___
zones-discuss mailing list
zones-discuss@opensolaris.org


--
Hung-Sheng Tsao Ph D.
Founder  Principal
HopBit GridComputing LLC
cell: 9734950840
http://laotsao.wordpress.com/
http://laotsao.blogspot.com/

attachment: laotsao.vcf___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Derek McEachern
System has 72GB RAM
xeon cpu - 2 socket - 4 core - 16 thread

zonereoot is on ufs filesystem on it's own drive, separate from OS.

Derek

On Thu, Dec 1, 2011 at 11:01 AM, Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D. 
laot...@gmail.com wrote:

  for 30-40 zone
 what are the main host ram? and what kind of CPU? and how many CPU?
 was everything on ZFS? what are the storage/HDD for zone root?
 regards



 On 12/1/2011 11:39 AM, Derek McEachern wrote:

 Have a peculiar problem that I haven't seen before.

  When starting a system that has about 35 - 40 zones on it occasionally
 we see that one of the zones doesn't come up properly. You can log into the
 zone but none of the /etc/rc3.d scripts have been run.

  /var/adm/messages is completely empty and when running who -r to see the
 run level it doesn't report anything.

  # who -r
 run-level Dec 1 09:17 last=

  Anyone else seen anything similar? We are running Solaris 10 update 9.

  Regards,
 Derek


 ___
 zones-discuss mailing listzones-disc...@opensolaris.org


 --
 Hung-Sheng Tsao Ph D.
 Founder  Principal
 HopBit GridComputing LLC
 cell: 9734950840http://laotsao.wordpress.com/http://laotsao.blogspot.com/


 ___
 zones-discuss mailing list
 zones-discuss@opensolaris.org

___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Mike Gerdts
On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote:
 Have a peculiar problem that I haven't seen before.
 
 When starting a system that has about 35 - 40 zones on it occasionally we
 see that one of the zones doesn't come up properly. You can log into the
 zone but none of the /etc/rc3.d scripts have been run.
 
 /var/adm/messages is completely empty and when running who -r to see the
 run level it doesn't report anything.

Take a look at the output of svcs -x.  Most likely you have a service
that svc:/milestone/multi-user-server:default depends on (directly or
indirectly) that has timed out and as such is in maintenance.  Because
the dependency is not satisfied, this milestone doesn't come up so the
rc3 scripts are not run.

My guess is the timeout is because so many zones are starting at once
that the disks are being thrashed.  The resulting I/O backlog slows down
the startup of services, which leads to timeouts, which lead to some
services failing to even try to start.

A google search and a 5 second read suggests that this link may be of
help to adjust the timeout of services that require a longer timeout:

http://www.runningunix.com/2009/01/changing-timeouts-on-smf-services/

-- 
Mike Gerdts
Solaris Core OS / Zones http://blogs.oracle.com/zoneszone/
___
zones-discuss mailing list
zones-discuss@opensolaris.org


Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Derek McEachern
Thanks Mike.

The more I look at this more I think it is load related. svcs -x only shows
that the LP print server is not running which I don't think has any impact
on what I'm seeing.

As for who not reporting what I would expect I tracked that down to someone
installing the gnu tools in /usr/local/bin and then setting default path to
reference those before /bin/ :-(

/bin/who -r shows the zone is at run level 3.

Looking at /var/svc/log/milestone-multi-user-server:default.log I can see
that some of the other services have most likely not completed before it
tries to run the rc scripts. It appears that the /usr filesystem hasn't yet
been mounted read/write and the appstart script is logging an error that
indicates rpc services are not completely running.

Executing legacy init script /etc/rc3.d/S98apache.
(30)Read-only file system: httpd: could not open error log file
/usr/local/apache2/logs/error_log.
Unable to open logs
Legacy init script /etc/rc3.d/S98apache exited with return code 0.
Executing legacy init script /etc/rc3.d/S99appstart.
ERROR: Unable to contact any server
Legacy init script /etc/rc3.d/S99appstart exited with return code 0.
[ Dec 1 09:17:13 Method start exited with status 0 ]

We have a process in place that only starts 3 zones at one time so we are
not doing all 40 at once but it could be that with this hardware even
trying 3 at a time is too much and we may need to drop to 2.

Derek

On Thu, Dec 1, 2011 at 12:07 PM, Mike Gerdts mike.ger...@oracle.com wrote:

 On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote:
  Have a peculiar problem that I haven't seen before.
 
  When starting a system that has about 35 - 40 zones on it occasionally we
  see that one of the zones doesn't come up properly. You can log into the
  zone but none of the /etc/rc3.d scripts have been run.
 
  /var/adm/messages is completely empty and when running who -r to see the
  run level it doesn't report anything.

 Take a look at the output of svcs -x.  Most likely you have a service
 that svc:/milestone/multi-user-server:default depends on (directly or
 indirectly) that has timed out and as such is in maintenance.  Because
 the dependency is not satisfied, this milestone doesn't come up so the
 rc3 scripts are not run.

 My guess is the timeout is because so many zones are starting at once
 that the disks are being thrashed.  The resulting I/O backlog slows down
 the startup of services, which leads to timeouts, which lead to some
 services failing to even try to start.

 A google search and a 5 second read suggests that this link may be of
 help to adjust the timeout of services that require a longer timeout:

 http://www.runningunix.com/2009/01/changing-timeouts-on-smf-services/

 --
 Mike Gerdts
 Solaris Core OS / Zones http://blogs.oracle.com/zoneszone/

___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Ian Collins

On 12/ 2/11 06:07 AM, Derek McEachern wrote:

System has 72GB RAM
xeon cpu - 2 socket - 4 core - 16 thread

zonereoot is on ufs filesystem on it's own drive, separate from OS.



That (UFS) is a strange choice for a recent Solaris 10 version.  You 
loose the useful zones/ZFS features such as cloning.


--
Ian.

___
zones-discuss mailing list
zones-discuss@opensolaris.org


Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.

it seems that you could
1)improve your rc script to check the other dependence for apache
or
2)use SMF for apache that check other dependence
my 2c


On 12/1/2011 1:33 PM, Derek McEachern wrote:

Thanks Mike.

The more I look at this more I think it is load related. svcs -x only 
shows that the LP print server is not running which I don't think has 
any impact on what I'm seeing.


As for who not reporting what I would expect I tracked that down to 
someone installing the gnu tools in /usr/local/bin and then setting 
default path to reference those before /bin/ :-(


/bin/who -r shows the zone is at run level 3.

Looking at /var/svc/log/milestone-multi-user-server:default.log I can 
see that some of the other services have most likely not completed 
before it tries to run the rc scripts. It appears that the /usr 
filesystem hasn't yet been mounted read/write and the appstart script 
is logging an error that indicates rpc services are not completely 
running.


Executing legacy init script /etc/rc3.d/S98apache.
(30)Read-only file system: httpd: could not open error log file 
/usr/local/apache2/logs/error_log.

Unable to open logs
Legacy init script /etc/rc3.d/S98apache exited with return code 0.
Executing legacy init script /etc/rc3.d/S99appstart.
ERROR: Unable to contact any server
Legacy init script /etc/rc3.d/S99appstart exited with return code 0.
[ Dec 1 09:17:13 Method start exited with status 0 ]

We have a process in place that only starts 3 zones at one time so we 
are not doing all 40 at once but it could be that with this hardware 
even trying 3 at a time is too much and we may need to drop to 2.


Derek

On Thu, Dec 1, 2011 at 12:07 PM, Mike Gerdts mike.ger...@oracle.com 
mailto:mike.ger...@oracle.com wrote:


On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote:
 Have a peculiar problem that I haven't seen before.

 When starting a system that has about 35 - 40 zones on it
occasionally we
 see that one of the zones doesn't come up properly. You can log
into the
 zone but none of the /etc/rc3.d scripts have been run.

 /var/adm/messages is completely empty and when running who -r to
see the
 run level it doesn't report anything.

Take a look at the output of svcs -x.  Most likely you have a service
that svc:/milestone/multi-user-server:default depends on (directly or
indirectly) that has timed out and as such is in maintenance.  Because
the dependency is not satisfied, this milestone doesn't come up so the
rc3 scripts are not run.

My guess is the timeout is because so many zones are starting at once
that the disks are being thrashed.  The resulting I/O backlog
slows down
the startup of services, which leads to timeouts, which lead to some
services failing to even try to start.

A google search and a 5 second read suggests that this link may be of
help to adjust the timeout of services that require a longer timeout:

http://www.runningunix.com/2009/01/changing-timeouts-on-smf-services/

--
Mike Gerdts
Solaris Core OS / Zones http://blogs.oracle.com/zoneszone/




___
zones-discuss mailing list
zones-discuss@opensolaris.org


--
Hung-Sheng Tsao Ph D.
Founder  Principal
HopBit GridComputing LLC
cell: 9734950840
http://laotsao.wordpress.com/
http://laotsao.blogspot.com/

attachment: laotsao.vcf___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Ian Collins

On 12/ 2/11 05:39 AM, Derek McEachern wrote:

Have a peculiar problem that I haven't seen before.

When starting a system that has about 35 - 40 zones on it occasionally 
we see that one of the zones doesn't come up properly. You can log 
into the zone but none of the /etc/rc3.d scripts have been run.



The same zone, or a random one?

What happens if you halt one or more zones before rebooting?  Is there a 
threshold where the problem begins to occur?


--
Ian.

___
zones-discuss mailing list
zones-discuss@opensolaris.org


Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Derek McEachern
Random zone.

We've been testing to see if there is a threshold of trying to start too
many in parallel but so far we don't see anything.

We saw the problem trying to start 3 zones in parallel but it was very
intermittent. Like 1 out of every 4 tries at started all 40 zones we would
see 1 failure. We ran some tests starting 10 zones in parallel and so far
no errors. Our assumption was that if it was load related moving from 3 to
10 zones we would see problems.

Derek

On Thu, Dec 1, 2011 at 2:48 PM, Ian Collins i...@ianshome.com wrote:

 On 12/ 2/11 05:39 AM, Derek McEachern wrote:

 Have a peculiar problem that I haven't seen before.

 When starting a system that has about 35 - 40 zones on it occasionally we
 see that one of the zones doesn't come up properly. You can log into the
 zone but none of the /etc/rc3.d scripts have been run.

  The same zone, or a random one?

 What happens if you halt one or more zones before rebooting?  Is there a
 threshold where the problem begins to occur?

 --
 Ian.


___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Derek McEachern
I agree, our script could certainly be improved to add logic to check for
these failures and handle them which we will probably end up doing.

Derek

On Thu, Dec 1, 2011 at 2:47 PM, Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D. 
laot...@gmail.com wrote:

  it seems that you could
 1)improve your rc script to check the other dependence for apache
 or
 2)use SMF for apache that check other dependence
 my 2c



 On 12/1/2011 1:33 PM, Derek McEachern wrote:

 Thanks Mike.

  The more I look at this more I think it is load related. svcs -x only
 shows that the LP print server is not running which I don't think has any
 impact on what I'm seeing.

  As for who not reporting what I would expect I tracked that down to
 someone installing the gnu tools in /usr/local/bin and then setting default
 path to reference those before /bin/ :-(

  /bin/who -r shows the zone is at run level 3.

  Looking at /var/svc/log/milestone-multi-user-server:default.log I can
 see that some of the other services have most likely not completed before
 it tries to run the rc scripts. It appears that the /usr filesystem hasn't
 yet been mounted read/write and the appstart script is logging an error
 that indicates rpc services are not completely running.

  Executing legacy init script /etc/rc3.d/S98apache.
 (30)Read-only file system: httpd: could not open error log file
 /usr/local/apache2/logs/error_log.
 Unable to open logs
 Legacy init script /etc/rc3.d/S98apache exited with return code 0.
 Executing legacy init script /etc/rc3.d/S99appstart.
 ERROR: Unable to contact any server
 Legacy init script /etc/rc3.d/S99appstart exited with return code 0.
 [ Dec 1 09:17:13 Method start exited with status 0 ]

  We have a process in place that only starts 3 zones at one time so we
 are not doing all 40 at once but it could be that with this hardware even
 trying 3 at a time is too much and we may need to drop to 2.

  Derek

  On Thu, Dec 1, 2011 at 12:07 PM, Mike Gerdts mike.ger...@oracle.comwrote:

 On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote:
  Have a peculiar problem that I haven't seen before.
 
  When starting a system that has about 35 - 40 zones on it occasionally
 we
  see that one of the zones doesn't come up properly. You can log into the
  zone but none of the /etc/rc3.d scripts have been run.
 
  /var/adm/messages is completely empty and when running who -r to see the
  run level it doesn't report anything.

  Take a look at the output of svcs -x.  Most likely you have a service
 that svc:/milestone/multi-user-server:default depends on (directly or
 indirectly) that has timed out and as such is in maintenance.  Because
 the dependency is not satisfied, this milestone doesn't come up so the
 rc3 scripts are not run.

 My guess is the timeout is because so many zones are starting at once
 that the disks are being thrashed.  The resulting I/O backlog slows down
 the startup of services, which leads to timeouts, which lead to some
 services failing to even try to start.

 A google search and a 5 second read suggests that this link may be of
 help to adjust the timeout of services that require a longer timeout:

 http://www.runningunix.com/2009/01/changing-timeouts-on-smf-services/

 --
 Mike Gerdts
 Solaris Core OS / Zones
 http://blogs.oracle.com/zoneszone/




 ___
 zones-discuss mailing listzones-disc...@opensolaris.org


 --
 Hung-Sheng Tsao Ph D.
 Founder  Principal
 HopBit GridComputing LLC
 cell: 9734950840http://laotsao.wordpress.com/http://laotsao.blogspot.com/


 ___
 zones-discuss mailing list
 zones-discuss@opensolaris.org

___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Derek McEachern
We haven't made the jump to zfs yet :-) We do loose some useful features
but haven't spent the time to port our stuff over to use zfs.

On Thu, Dec 1, 2011 at 2:47 PM, Ian Collins i...@ianshome.com wrote:

 On 12/ 2/11 06:07 AM, Derek McEachern wrote:

 System has 72GB RAM
 xeon cpu - 2 socket - 4 core - 16 thread

 zonereoot is on ufs filesystem on it's own drive, separate from OS.


 That (UFS) is a strange choice for a recent Solaris 10 version.  You loose
 the useful zones/ZFS features such as cloning.

 --
 Ian.


___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Ian Collins

On 12/ 2/11 10:30 AM, Derek McEachern wrote:
On Thu, Dec 1, 2011 at 2:48 PM, Ian Collins i...@ianshome.com 
mailto:i...@ianshome.com wrote:


On 12/ 2/11 05:39 AM, Derek McEachern wrote:

Have a peculiar problem that I haven't seen before.

When starting a system that has about 35 - 40 zones on it
occasionally we see that one of the zones doesn't come up
properly. You can log into the zone but none of the /etc/rc3.d
scripts have been run.

The same zone, or a random one?

What happens if you halt one or more zones before rebooting?  Is
there a threshold where the problem begins to occur?

Random zone.

We've been testing to see if there is a threshold of trying to start 
too many in parallel but so far we don't see anything.


We saw the problem trying to start 3 zones in parallel but it was very 
intermittent. Like 1 out of every 4 tries at started all 40 zones we 
would see 1 failure. We ran some tests starting 10 zones in parallel 
and so far no errors. Our assumption was that if it was load related 
moving from 3 to 10 zones we would see problems.


I have several systems that start 10 or more zones and I've never seen 
any problems.


I agree with the comment elsewhere that you should be using SMF rather 
than rc scripts to start services.


It is also possible to create SMF services with the appropriate 
dependencies to start your zones in the correct order.


--
Ian.

___
zones-discuss mailing list
zones-discuss@opensolaris.org


Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Ian Collins

On 12/ 2/11 10:36 AM, Derek McEachern wrote:
We haven't made the jump to zfs yet :-) We do loose some useful 
features but haven't spent the time to port our stuff over to use zfs.




Make the jump sooner rather than later or you will flounder on Solaris 11.

--
Ian.

___
zones-discuss mailing list
zones-discuss@opensolaris.org