That's the thing, I did not see any!?  I am still running my test.
It is up to iteration #98 right now.  I have verified via dumpadm
that a dump device is configured/enabled, so it is a bit of a
waiting game at this point.  I did extend the delay between boot
and halt just a bit to more accurately reflect my original crontab
but I would doubt that should change anything.  We shall see...

g


On 12/23/09 1:57 PM, Steve Lawrence wrote:
Do you have the panic message or crash dump?

-Steve L.


On Wed, Dec 23, 2009 at 09:26:17AM -0500, Glenn Brunette wrote:

Frank,

Just verified that something is still wrong in b129, but the problem is
_not_ with a vanilla configuration.  This time around boot/halt #102,
the system apparently shutdown/panic'ed?  I was running it overnight
and came in to a system that had been rebooted.  I did not see any
problem in the audit log nor in /var/adm/messages.  Any pointers?

I am running an Immutable Service Container configuration, based upon
the installation steps at:

http://kenai.com/projects/isc/pages/OpenSolaris

Specifically:

pfexec pkg install SUNWmercurial
hg clone https://kenai.com/hg/isc~source  isc
pfexec isc/bin/iscadm.ksh -N 0
pfexec bootadm update-archive
pfexec shutdown -g 0 -i 0 -y
[after reboot]
zlogin -C isc1
[wait for zone isc1 to fully complete boot process]

then run the script that I provided that stops and starts the zone.

Apparently, there must be something wrong with the interaction of
components.  In this configuration, we have things like resource
controls, auditing, IP Filter/IP NAT, and zones all enabled.

Would it be possible for you to try the steps above on a fresh
install of 2009.06 or later (b129 is where I am right now).  Also,
if you have other debugging methods, please let me know.

I am going to kick this off again to see if I can catch any
error messages.

g


On 12/16/09 3:49 AM, Frank Batschulat (Home) wrote:
Glenn, I've not been able to reproduce this on onnv build 126 (it's running for 
a day now)

if that script would reproduce 6894901 straight away it should be doing so
on 126 as well (similar to what you've seen in 127)

this pose the question if there are either some other details in your
environment that I don't have or if that script really reliably reproduces 
6894901

cheers
frankB

On Tue, 15 Dec 2009 15:23:06 +0100, Frank Batschulat 
(Home)<frank.batschu...@sun.com>   wrote:

Glenn, I've been running this test case now for nearly a day on build 129, 
could'nt
reproduce at all. good chance this being indeed fixed by 6894901 in build 128.

I'll also try to reproduce this now on buil 126.

cheers
frankB

On Fri, 11 Dec 2009 21:48:52 +0100, Glenn Brunette<glenn.brune...@sun.com>   
wrote:

As part of some Immutable Service Container[1] demonstration that I am
creating for an event in January.  I have the need to start/stop a zone
quite a few times (as part of a Self-Cleansing[2] demo).  During the
course of my testing, I have been able to repeatedly get zoneadm to
hang.

Since I am working with a highly customized configuration, I started
over with a default zone on OpenSolaris (b127) and was able to repeat
this issue.  To reproduce this problem use the following script after
creating a zone usual the normal/default steps:

isc...@osol-isc:~$ while : ; do
   >   echo "`date`: ZONE BOOT"
   >   pfexec zoneadm -z test boot
   >   sleep 30
   >   pfexec zoneamd -z test halt
   >   echo "`date`: ZONE HALT"
   >   sleep 10
   >   done

This script works just fine for a while, but eventually zoneadm hangs
(was at pass #90 in my last test).  When this happens, zoneadm is shown
to be consuming quite a bit of CPU:

      PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP

    16598 root       11M 3140K run      1    0   0:54:49  74% zoneadm/1


A stack trace of zoneadm shows:

isc...@osol-isc:~$ pfexec pstack `pgrep zoneadm`
16082:  zoneadmd -z test
-----------------  lwp# 1  --------------------------------
-----------------  lwp# 2  --------------------------------
    feef41c6 door     (0, 0, 0, 0, 0, 8)
    feed99f7 door_unref_func (3ed2, fef81000, fe33efe8, feeee39e) + 67
    feeee3f3 _thrp_setup (fe5b0a00) + 9b
    feeee680 _lwp_start (fe5b0a00, 0, 0, 0, 0, 0)
-----------------  lwp# 3  --------------------------------
    feef420f __door_return () + 2f
-----------------  lwp# 4  --------------------------------
    feef420f door     (0, 0, 0, fe140e00, f5f00, a)
    feed9f57 door_create_func (0, fef81000, fe140fe8, feeee39e) + 2f
    feeee3f3 _thrp_setup (fe5b1a00) + 9b
    feeee680 _lwp_start (fe5b1a00, 0, 0, 0, 0, 0)
16598:  zoneadm -z test boot
    feef3fc8 door     (6, 80476d0, 0, 0, 0, 3)
    feede653 door_call (6, 80476d0, 400, fe3d43f7) + 7b
    fe3d44f0 zonecfg_call_zoneadmd (8047e33, 8047730, 8078448, 1) + 124
    0805792d boot_func (0, 8047d74, 100, 805ff0b) + 1cd
    08060125 main     (4, 8047d64, 8047d78, 805570f) + 2b9
    0805576d _start   (4, 8047e28, 8047e30, 8047e33, 8047e38, 0) + 7d


A stack trace of zoneadmd shows:

isc...@osol-isc:~$ pfexec pstack `pgrep zoneadmd`
16082:  zoneadmd -z test
-----------------  lwp# 1  --------------------------------
-----------------  lwp# 2  --------------------------------
    feef41c6 door     (0, 0, 0, 0, 0, 8)
    feed99f7 door_unref_func (3ed2, fef81000, fe33efe8, feeee39e) + 67
    feeee3f3 _thrp_setup (fe5b0a00) + 9b
    feeee680 _lwp_start (fe5b0a00, 0, 0, 0, 0, 0)
-----------------  lwp# 3  --------------------------------
    feef4147 __door_ucred (80a37c8, fef81000, fe23e838, feed9cfe) + 27
    feed9d0d door_ucred (fe23f870, 1000, 0, 0) + 32
    08058a88 server   (0, fe23f8f0, 510, 0, 0, 8058a04) + 84
    feef4240 __door_return () + 60
-----------------  lwp# 4  --------------------------------
    feef420f door     (0, 0, 0, fe140e00, f5f00, a)
    feed9f57 door_create_func (0, fef81000, fe140fe8, feeee39e) + 2f
    feeee3f3 _thrp_setup (fe5b1a00) + 9b
    feeee680 _lwp_start (fe5b1a00, 0, 0, 0, 0, 0)


A truss of zoneadm (-f -vall -wall -tall) shows this looping:

16598:  door_call(6, 0x080476D0)                        = 0
16598:          data_ptr=8047730 data_size=0
16598:          desc_ptr=0x0 desc_num=0
16598:          rbuf=0x807F2D8 rsize=4096
16598:  close(6)                                        = 0
16598:  mkdir("/var/run/zones", 0700)                   Err#17 EEXIST
16598:  chmod("/var/run/zones", 0700)                   = 0
16598:  open("/var/run/zones/test.zoneadm.lock", O_RDWR|O_CREAT, 0600) = 6
16598:  fcntl(6, F_SETLKW, 0x08046DC0)                  = 0
16598:          typ=F_WRLCK  whence=SEEK_SET start=0     len=0
sys=4277003009 pid=6
16598:  open("/var/run/zones/test.zoneadmd_door", O_RDONLY) = 7
16598:  door_info(7, 0x08047230)                        = 0
16598:          target=16082 proc=0x8058A04 data=0x0
16598:          attributes=DOOR_UNREF|DOOR_REFUSE_DESC|DOOR_NO_CANCEL
16598:          uniquifier=26426
16598:  close(7)                                        = 0
16598:  close(6)                                        = 0
16598:  open("/var/run/zones/test.zoneadmd_door", O_RDONLY) = 6
16082/3:        door_return(0x00000000, 0, 0x00000000, 0xFE23FE00,
1007360) = 0
16082/3:        door_ucred(0x080A37C8)                          = 0
16082/3:                euid=0 egid=0
16082/3:                ruid=0 rgid=0
16082/3:                pid=16598 zoneid=0
16082/3:                E: all
16082/3:                I: basic
16082/3:                P: all
16082/3:                L: all


PID 16598 is zoneadm and PID 16082 is zoneadmd.


Is this a known issue?  Are there any other things that I can do to
help debug this situation?  Once things get into this state, I have
only been able to recover by rebooting the zone.



Please advise.

g


[1] http://kenai.com/projects/isc/pages/OpenSolaris
[2]
http://kenai.com/attachments/wiki_images/isc/isc-autonomic-cleansing-time-v1.3.png
_______________________________________________
zones-discuss mailing list
zones-discuss@opensolaris.org







_______________________________________________
zones-discuss mailing list
zones-discuss@opensolaris.org
_______________________________________________
zones-discuss mailing list
zones-discuss@opensolaris.org

Reply via email to