Re: [zones-discuss] Zone Not Starting Properly?
I actually know that.. rpcinfo -p remote_host The script is trying to mount an nfs share nased on a configured list of remote nfs filers. In theory the client only has access to the filer determines which one by using the rpcinfo command. Derek On Tue, Dec 13, 2011 at 6:55 PM, Edward Pilatowicz edward.pilatow...@oracle.com wrote: On Tue, Dec 13, 2011 at 09:44:23AM -0600, Derek McEachern wrote: Thought I would just send an update on this. Thanks for the all the suggestions. To get around our particular issue I just added some retry logic to the /etc/init.d/ script. When it runs it if finds that the operation has failed it pauses for a second and will try again. It will try up to three times before giving up. it'd be interesting to know what particular operation is failing within the script... ed ___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
Thought I would just send an update on this. Thanks for the all the suggestions. To get around our particular issue I just added some retry logic to the /etc/init.d/ script. When it runs it if finds that the operation has failed it pauses for a second and will try again. It will try up to three times before giving up. Running more tests we were able to see that on some occasions it still fails on the first attempt but so far has always been successful on the 2nd. Derek On Thu, Dec 1, 2011 at 3:37 PM, Ian Collins i...@ianshome.com wrote: On 12/ 2/11 10:30 AM, Derek McEachern wrote: On Thu, Dec 1, 2011 at 2:48 PM, Ian Collins i...@ianshome.com mailto: i...@ianshome.com wrote: On 12/ 2/11 05:39 AM, Derek McEachern wrote: Have a peculiar problem that I haven't seen before. When starting a system that has about 35 - 40 zones on it occasionally we see that one of the zones doesn't come up properly. You can log into the zone but none of the /etc/rc3.d scripts have been run. The same zone, or a random one? What happens if you halt one or more zones before rebooting? Is there a threshold where the problem begins to occur? Random zone. We've been testing to see if there is a threshold of trying to start too many in parallel but so far we don't see anything. We saw the problem trying to start 3 zones in parallel but it was very intermittent. Like 1 out of every 4 tries at started all 40 zones we would see 1 failure. We ran some tests starting 10 zones in parallel and so far no errors. Our assumption was that if it was load related moving from 3 to 10 zones we would see problems. I have several systems that start 10 or more zones and I've never seen any problems. I agree with the comment elsewhere that you should be using SMF rather than rc scripts to start services. It is also possible to create SMF services with the appropriate dependencies to start your zones in the correct order. -- Ian. ___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
On Tue, Dec 13, 2011 at 09:44:23AM -0600, Derek McEachern wrote: Thought I would just send an update on this. Thanks for the all the suggestions. To get around our particular issue I just added some retry logic to the /etc/init.d/ script. When it runs it if finds that the operation has failed it pauses for a second and will try again. It will try up to three times before giving up. it'd be interesting to know what particular operation is failing within the script... ed ___ zones-discuss mailing list zones-discuss@opensolaris.org
[zones-discuss] Zone Not Starting Properly?
Have a peculiar problem that I haven't seen before. When starting a system that has about 35 - 40 zones on it occasionally we see that one of the zones doesn't come up properly. You can log into the zone but none of the /etc/rc3.d scripts have been run. /var/adm/messages is completely empty and when running who -r to see the run level it doesn't report anything. # who -r run-level Dec 1 09:17 last= Anyone else seen anything similar? We are running Solaris 10 update 9. Regards, Derek ___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
for 30-40 zone what are the main host ram? and what kind of CPU? and how many CPU? was everything on ZFS? what are the storage/HDD for zone root? regards On 12/1/2011 11:39 AM, Derek McEachern wrote: Have a peculiar problem that I haven't seen before. When starting a system that has about 35 - 40 zones on it occasionally we see that one of the zones doesn't come up properly. You can log into the zone but none of the /etc/rc3.d scripts have been run. /var/adm/messages is completely empty and when running who -r to see the run level it doesn't report anything. # who -r run-level Dec 1 09:17 last= Anyone else seen anything similar? We are running Solaris 10 update 9. Regards, Derek ___ zones-discuss mailing list zones-discuss@opensolaris.org -- Hung-Sheng Tsao Ph D. Founder Principal HopBit GridComputing LLC cell: 9734950840 http://laotsao.wordpress.com/ http://laotsao.blogspot.com/ attachment: laotsao.vcf___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
System has 72GB RAM xeon cpu - 2 socket - 4 core - 16 thread zonereoot is on ufs filesystem on it's own drive, separate from OS. Derek On Thu, Dec 1, 2011 at 11:01 AM, Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D. laot...@gmail.com wrote: for 30-40 zone what are the main host ram? and what kind of CPU? and how many CPU? was everything on ZFS? what are the storage/HDD for zone root? regards On 12/1/2011 11:39 AM, Derek McEachern wrote: Have a peculiar problem that I haven't seen before. When starting a system that has about 35 - 40 zones on it occasionally we see that one of the zones doesn't come up properly. You can log into the zone but none of the /etc/rc3.d scripts have been run. /var/adm/messages is completely empty and when running who -r to see the run level it doesn't report anything. # who -r run-level Dec 1 09:17 last= Anyone else seen anything similar? We are running Solaris 10 update 9. Regards, Derek ___ zones-discuss mailing listzones-disc...@opensolaris.org -- Hung-Sheng Tsao Ph D. Founder Principal HopBit GridComputing LLC cell: 9734950840http://laotsao.wordpress.com/http://laotsao.blogspot.com/ ___ zones-discuss mailing list zones-discuss@opensolaris.org ___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote: Have a peculiar problem that I haven't seen before. When starting a system that has about 35 - 40 zones on it occasionally we see that one of the zones doesn't come up properly. You can log into the zone but none of the /etc/rc3.d scripts have been run. /var/adm/messages is completely empty and when running who -r to see the run level it doesn't report anything. Take a look at the output of svcs -x. Most likely you have a service that svc:/milestone/multi-user-server:default depends on (directly or indirectly) that has timed out and as such is in maintenance. Because the dependency is not satisfied, this milestone doesn't come up so the rc3 scripts are not run. My guess is the timeout is because so many zones are starting at once that the disks are being thrashed. The resulting I/O backlog slows down the startup of services, which leads to timeouts, which lead to some services failing to even try to start. A google search and a 5 second read suggests that this link may be of help to adjust the timeout of services that require a longer timeout: http://www.runningunix.com/2009/01/changing-timeouts-on-smf-services/ -- Mike Gerdts Solaris Core OS / Zones http://blogs.oracle.com/zoneszone/ ___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
Thanks Mike. The more I look at this more I think it is load related. svcs -x only shows that the LP print server is not running which I don't think has any impact on what I'm seeing. As for who not reporting what I would expect I tracked that down to someone installing the gnu tools in /usr/local/bin and then setting default path to reference those before /bin/ :-( /bin/who -r shows the zone is at run level 3. Looking at /var/svc/log/milestone-multi-user-server:default.log I can see that some of the other services have most likely not completed before it tries to run the rc scripts. It appears that the /usr filesystem hasn't yet been mounted read/write and the appstart script is logging an error that indicates rpc services are not completely running. Executing legacy init script /etc/rc3.d/S98apache. (30)Read-only file system: httpd: could not open error log file /usr/local/apache2/logs/error_log. Unable to open logs Legacy init script /etc/rc3.d/S98apache exited with return code 0. Executing legacy init script /etc/rc3.d/S99appstart. ERROR: Unable to contact any server Legacy init script /etc/rc3.d/S99appstart exited with return code 0. [ Dec 1 09:17:13 Method start exited with status 0 ] We have a process in place that only starts 3 zones at one time so we are not doing all 40 at once but it could be that with this hardware even trying 3 at a time is too much and we may need to drop to 2. Derek On Thu, Dec 1, 2011 at 12:07 PM, Mike Gerdts mike.ger...@oracle.com wrote: On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote: Have a peculiar problem that I haven't seen before. When starting a system that has about 35 - 40 zones on it occasionally we see that one of the zones doesn't come up properly. You can log into the zone but none of the /etc/rc3.d scripts have been run. /var/adm/messages is completely empty and when running who -r to see the run level it doesn't report anything. Take a look at the output of svcs -x. Most likely you have a service that svc:/milestone/multi-user-server:default depends on (directly or indirectly) that has timed out and as such is in maintenance. Because the dependency is not satisfied, this milestone doesn't come up so the rc3 scripts are not run. My guess is the timeout is because so many zones are starting at once that the disks are being thrashed. The resulting I/O backlog slows down the startup of services, which leads to timeouts, which lead to some services failing to even try to start. A google search and a 5 second read suggests that this link may be of help to adjust the timeout of services that require a longer timeout: http://www.runningunix.com/2009/01/changing-timeouts-on-smf-services/ -- Mike Gerdts Solaris Core OS / Zones http://blogs.oracle.com/zoneszone/ ___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
On 12/ 2/11 06:07 AM, Derek McEachern wrote: System has 72GB RAM xeon cpu - 2 socket - 4 core - 16 thread zonereoot is on ufs filesystem on it's own drive, separate from OS. That (UFS) is a strange choice for a recent Solaris 10 version. You loose the useful zones/ZFS features such as cloning. -- Ian. ___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
it seems that you could 1)improve your rc script to check the other dependence for apache or 2)use SMF for apache that check other dependence my 2c On 12/1/2011 1:33 PM, Derek McEachern wrote: Thanks Mike. The more I look at this more I think it is load related. svcs -x only shows that the LP print server is not running which I don't think has any impact on what I'm seeing. As for who not reporting what I would expect I tracked that down to someone installing the gnu tools in /usr/local/bin and then setting default path to reference those before /bin/ :-( /bin/who -r shows the zone is at run level 3. Looking at /var/svc/log/milestone-multi-user-server:default.log I can see that some of the other services have most likely not completed before it tries to run the rc scripts. It appears that the /usr filesystem hasn't yet been mounted read/write and the appstart script is logging an error that indicates rpc services are not completely running. Executing legacy init script /etc/rc3.d/S98apache. (30)Read-only file system: httpd: could not open error log file /usr/local/apache2/logs/error_log. Unable to open logs Legacy init script /etc/rc3.d/S98apache exited with return code 0. Executing legacy init script /etc/rc3.d/S99appstart. ERROR: Unable to contact any server Legacy init script /etc/rc3.d/S99appstart exited with return code 0. [ Dec 1 09:17:13 Method start exited with status 0 ] We have a process in place that only starts 3 zones at one time so we are not doing all 40 at once but it could be that with this hardware even trying 3 at a time is too much and we may need to drop to 2. Derek On Thu, Dec 1, 2011 at 12:07 PM, Mike Gerdts mike.ger...@oracle.com mailto:mike.ger...@oracle.com wrote: On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote: Have a peculiar problem that I haven't seen before. When starting a system that has about 35 - 40 zones on it occasionally we see that one of the zones doesn't come up properly. You can log into the zone but none of the /etc/rc3.d scripts have been run. /var/adm/messages is completely empty and when running who -r to see the run level it doesn't report anything. Take a look at the output of svcs -x. Most likely you have a service that svc:/milestone/multi-user-server:default depends on (directly or indirectly) that has timed out and as such is in maintenance. Because the dependency is not satisfied, this milestone doesn't come up so the rc3 scripts are not run. My guess is the timeout is because so many zones are starting at once that the disks are being thrashed. The resulting I/O backlog slows down the startup of services, which leads to timeouts, which lead to some services failing to even try to start. A google search and a 5 second read suggests that this link may be of help to adjust the timeout of services that require a longer timeout: http://www.runningunix.com/2009/01/changing-timeouts-on-smf-services/ -- Mike Gerdts Solaris Core OS / Zones http://blogs.oracle.com/zoneszone/ ___ zones-discuss mailing list zones-discuss@opensolaris.org -- Hung-Sheng Tsao Ph D. Founder Principal HopBit GridComputing LLC cell: 9734950840 http://laotsao.wordpress.com/ http://laotsao.blogspot.com/ attachment: laotsao.vcf___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
On 12/ 2/11 05:39 AM, Derek McEachern wrote: Have a peculiar problem that I haven't seen before. When starting a system that has about 35 - 40 zones on it occasionally we see that one of the zones doesn't come up properly. You can log into the zone but none of the /etc/rc3.d scripts have been run. The same zone, or a random one? What happens if you halt one or more zones before rebooting? Is there a threshold where the problem begins to occur? -- Ian. ___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
Random zone. We've been testing to see if there is a threshold of trying to start too many in parallel but so far we don't see anything. We saw the problem trying to start 3 zones in parallel but it was very intermittent. Like 1 out of every 4 tries at started all 40 zones we would see 1 failure. We ran some tests starting 10 zones in parallel and so far no errors. Our assumption was that if it was load related moving from 3 to 10 zones we would see problems. Derek On Thu, Dec 1, 2011 at 2:48 PM, Ian Collins i...@ianshome.com wrote: On 12/ 2/11 05:39 AM, Derek McEachern wrote: Have a peculiar problem that I haven't seen before. When starting a system that has about 35 - 40 zones on it occasionally we see that one of the zones doesn't come up properly. You can log into the zone but none of the /etc/rc3.d scripts have been run. The same zone, or a random one? What happens if you halt one or more zones before rebooting? Is there a threshold where the problem begins to occur? -- Ian. ___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
I agree, our script could certainly be improved to add logic to check for these failures and handle them which we will probably end up doing. Derek On Thu, Dec 1, 2011 at 2:47 PM, Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D. laot...@gmail.com wrote: it seems that you could 1)improve your rc script to check the other dependence for apache or 2)use SMF for apache that check other dependence my 2c On 12/1/2011 1:33 PM, Derek McEachern wrote: Thanks Mike. The more I look at this more I think it is load related. svcs -x only shows that the LP print server is not running which I don't think has any impact on what I'm seeing. As for who not reporting what I would expect I tracked that down to someone installing the gnu tools in /usr/local/bin and then setting default path to reference those before /bin/ :-( /bin/who -r shows the zone is at run level 3. Looking at /var/svc/log/milestone-multi-user-server:default.log I can see that some of the other services have most likely not completed before it tries to run the rc scripts. It appears that the /usr filesystem hasn't yet been mounted read/write and the appstart script is logging an error that indicates rpc services are not completely running. Executing legacy init script /etc/rc3.d/S98apache. (30)Read-only file system: httpd: could not open error log file /usr/local/apache2/logs/error_log. Unable to open logs Legacy init script /etc/rc3.d/S98apache exited with return code 0. Executing legacy init script /etc/rc3.d/S99appstart. ERROR: Unable to contact any server Legacy init script /etc/rc3.d/S99appstart exited with return code 0. [ Dec 1 09:17:13 Method start exited with status 0 ] We have a process in place that only starts 3 zones at one time so we are not doing all 40 at once but it could be that with this hardware even trying 3 at a time is too much and we may need to drop to 2. Derek On Thu, Dec 1, 2011 at 12:07 PM, Mike Gerdts mike.ger...@oracle.comwrote: On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote: Have a peculiar problem that I haven't seen before. When starting a system that has about 35 - 40 zones on it occasionally we see that one of the zones doesn't come up properly. You can log into the zone but none of the /etc/rc3.d scripts have been run. /var/adm/messages is completely empty and when running who -r to see the run level it doesn't report anything. Take a look at the output of svcs -x. Most likely you have a service that svc:/milestone/multi-user-server:default depends on (directly or indirectly) that has timed out and as such is in maintenance. Because the dependency is not satisfied, this milestone doesn't come up so the rc3 scripts are not run. My guess is the timeout is because so many zones are starting at once that the disks are being thrashed. The resulting I/O backlog slows down the startup of services, which leads to timeouts, which lead to some services failing to even try to start. A google search and a 5 second read suggests that this link may be of help to adjust the timeout of services that require a longer timeout: http://www.runningunix.com/2009/01/changing-timeouts-on-smf-services/ -- Mike Gerdts Solaris Core OS / Zones http://blogs.oracle.com/zoneszone/ ___ zones-discuss mailing listzones-disc...@opensolaris.org -- Hung-Sheng Tsao Ph D. Founder Principal HopBit GridComputing LLC cell: 9734950840http://laotsao.wordpress.com/http://laotsao.blogspot.com/ ___ zones-discuss mailing list zones-discuss@opensolaris.org ___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
We haven't made the jump to zfs yet :-) We do loose some useful features but haven't spent the time to port our stuff over to use zfs. On Thu, Dec 1, 2011 at 2:47 PM, Ian Collins i...@ianshome.com wrote: On 12/ 2/11 06:07 AM, Derek McEachern wrote: System has 72GB RAM xeon cpu - 2 socket - 4 core - 16 thread zonereoot is on ufs filesystem on it's own drive, separate from OS. That (UFS) is a strange choice for a recent Solaris 10 version. You loose the useful zones/ZFS features such as cloning. -- Ian. ___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
On 12/ 2/11 10:30 AM, Derek McEachern wrote: On Thu, Dec 1, 2011 at 2:48 PM, Ian Collins i...@ianshome.com mailto:i...@ianshome.com wrote: On 12/ 2/11 05:39 AM, Derek McEachern wrote: Have a peculiar problem that I haven't seen before. When starting a system that has about 35 - 40 zones on it occasionally we see that one of the zones doesn't come up properly. You can log into the zone but none of the /etc/rc3.d scripts have been run. The same zone, or a random one? What happens if you halt one or more zones before rebooting? Is there a threshold where the problem begins to occur? Random zone. We've been testing to see if there is a threshold of trying to start too many in parallel but so far we don't see anything. We saw the problem trying to start 3 zones in parallel but it was very intermittent. Like 1 out of every 4 tries at started all 40 zones we would see 1 failure. We ran some tests starting 10 zones in parallel and so far no errors. Our assumption was that if it was load related moving from 3 to 10 zones we would see problems. I have several systems that start 10 or more zones and I've never seen any problems. I agree with the comment elsewhere that you should be using SMF rather than rc scripts to start services. It is also possible to create SMF services with the appropriate dependencies to start your zones in the correct order. -- Ian. ___ zones-discuss mailing list zones-discuss@opensolaris.org
Re: [zones-discuss] Zone Not Starting Properly?
On 12/ 2/11 10:36 AM, Derek McEachern wrote: We haven't made the jump to zfs yet :-) We do loose some useful features but haven't spent the time to port our stuff over to use zfs. Make the jump sooner rather than later or you will flounder on Solaris 11. -- Ian. ___ zones-discuss mailing list zones-discuss@opensolaris.org