And finally, I have had this script run on a real, OSOL build 127 box for a day 
now.

can not reproduce it there either.

So I failed to reproduce this at all using the script on:

- ONNV 129 (zfs root, 1 cpu)
- ONNV 126 (ufs root, 2 cpus)
- OSOL 127 (zfs root, 4 cores)

there must be something special that I am missing.

On Wed, 16 Dec 2009 09:49:25 +0100, Frank Batschulat (Home) 
<frank.batschu...@sun.com> wrote:

> Glenn, I've not been able to reproduce this on onnv build 126 (it's running 
> for a day now)
>
> if that script would reproduce 6894901 straight away it should be doing so
> on 126 as well (similar to what you've seen in 127)
>
> this pose the question if there are either some other details in your
> environment that I don't have or if that script really reliably reproduces 
> 6894901
>
> On Tue, 15 Dec 2009 15:23:06 +0100, Frank Batschulat (Home) 
> <frank.batschu...@sun.com> wrote:
>
>> Glenn, I've been running this test case now for nearly a day on build 129, 
>> could'nt
>> reproduce at all. good chance this being indeed fixed by 6894901 in build 
>> 128.
>>
>> I'll also try to reproduce this now on buil 126.
>>
>> On Fri, 11 Dec 2009 21:48:52 +0100, Glenn Brunette <glenn.brune...@sun.com> 
>> wrote:
>>>
>>> As part of some Immutable Service Container[1] demonstration that I am
>>> creating for an event in January.  I have the need to start/stop a zone
>>> quite a few times (as part of a Self-Cleansing[2] demo).  During the
>>> course of my testing, I have been able to repeatedly get zoneadm to
>>> hang.
>>>
>>> Since I am working with a highly customized configuration, I started
>>> over with a default zone on OpenSolaris (b127) and was able to repeat
>>> this issue.  To reproduce this problem use the following script after
>>> creating a zone usual the normal/default steps:
>>>
>>> isc...@osol-isc:~$ while : ; do
>>>  > echo "`date`: ZONE BOOT"
>>>  > pfexec zoneadm -z test boot
>>>  > sleep 30
>>>  > pfexec zoneamd -z test halt
>>>  > echo "`date`: ZONE HALT"
>>>  > sleep 10
>>>  > done
>>>
>>> This script works just fine for a while, but eventually zoneadm hangs
>>> (was at pass #90 in my last test).  When this happens, zoneadm is shown
>>> to be consuming quite a bit of CPU:
>>>
>>>     PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
>>>
>>>   16598 root       11M 3140K run      1    0   0:54:49  74% zoneadm/1
>>>
>>>
>>> A stack trace of zoneadm shows:
>>>
>>> isc...@osol-isc:~$ pfexec pstack `pgrep zoneadm`
>>> 16082:      zoneadmd -z test
>>> -----------------  lwp# 1  --------------------------------
>>> -----------------  lwp# 2  --------------------------------
>>>   feef41c6 door     (0, 0, 0, 0, 0, 8)
>>>   feed99f7 door_unref_func (3ed2, fef81000, fe33efe8, feeee39e) + 67
>>>   feeee3f3 _thrp_setup (fe5b0a00) + 9b
>>>   feeee680 _lwp_start (fe5b0a00, 0, 0, 0, 0, 0)
>>> -----------------  lwp# 3  --------------------------------
>>>   feef420f __door_return () + 2f
>>> -----------------  lwp# 4  --------------------------------
>>>   feef420f door     (0, 0, 0, fe140e00, f5f00, a)
>>>   feed9f57 door_create_func (0, fef81000, fe140fe8, feeee39e) + 2f
>>>   feeee3f3 _thrp_setup (fe5b1a00) + 9b
>>>   feeee680 _lwp_start (fe5b1a00, 0, 0, 0, 0, 0)
>>> 16598:      zoneadm -z test boot
>>>   feef3fc8 door     (6, 80476d0, 0, 0, 0, 3)
>>>   feede653 door_call (6, 80476d0, 400, fe3d43f7) + 7b
>>>   fe3d44f0 zonecfg_call_zoneadmd (8047e33, 8047730, 8078448, 1) + 124
>>>   0805792d boot_func (0, 8047d74, 100, 805ff0b) + 1cd
>>>   08060125 main     (4, 8047d64, 8047d78, 805570f) + 2b9
>>>   0805576d _start   (4, 8047e28, 8047e30, 8047e33, 8047e38, 0) + 7d
>>>
>>>
>>> A stack trace of zoneadmd shows:
>>>
>>> isc...@osol-isc:~$ pfexec pstack `pgrep zoneadmd`
>>> 16082:      zoneadmd -z test
>>> -----------------  lwp# 1  --------------------------------
>>> -----------------  lwp# 2  --------------------------------
>>>   feef41c6 door     (0, 0, 0, 0, 0, 8)
>>>   feed99f7 door_unref_func (3ed2, fef81000, fe33efe8, feeee39e) + 67
>>>   feeee3f3 _thrp_setup (fe5b0a00) + 9b
>>>   feeee680 _lwp_start (fe5b0a00, 0, 0, 0, 0, 0)
>>> -----------------  lwp# 3  --------------------------------
>>>   feef4147 __door_ucred (80a37c8, fef81000, fe23e838, feed9cfe) + 27
>>>   feed9d0d door_ucred (fe23f870, 1000, 0, 0) + 32
>>>   08058a88 server   (0, fe23f8f0, 510, 0, 0, 8058a04) + 84
>>>   feef4240 __door_return () + 60
>>> -----------------  lwp# 4  --------------------------------
>>>   feef420f door     (0, 0, 0, fe140e00, f5f00, a)
>>>   feed9f57 door_create_func (0, fef81000, fe140fe8, feeee39e) + 2f
>>>   feeee3f3 _thrp_setup (fe5b1a00) + 9b
>>>   feeee680 _lwp_start (fe5b1a00, 0, 0, 0, 0, 0)
>>>
>>> A truss of zoneadm (-f -vall -wall -tall) shows this looping:
>>>
>>> 16598:  door_call(6, 0x080476D0)                        = 0
>>> 16598:          data_ptr=8047730 data_size=0
>>> 16598:          desc_ptr=0x0 desc_num=0
>>> 16598:          rbuf=0x807F2D8 rsize=4096
>>> 16598:  close(6)                                        = 0
>>> 16598:  mkdir("/var/run/zones", 0700)                   Err#17 EEXIST
>>> 16598:  chmod("/var/run/zones", 0700)                   = 0
>>> 16598:  open("/var/run/zones/test.zoneadm.lock", O_RDWR|O_CREAT, 0600) = 6
>>> 16598:  fcntl(6, F_SETLKW, 0x08046DC0)                  = 0
>>> 16598:          typ=F_WRLCK  whence=SEEK_SET start=0     len=0
>>> sys=4277003009 pid=6
>>> 16598:  open("/var/run/zones/test.zoneadmd_door", O_RDONLY) = 7
>>> 16598:  door_info(7, 0x08047230)                        = 0
>>> 16598:          target=16082 proc=0x8058A04 data=0x0
>>> 16598:          attributes=DOOR_UNREF|DOOR_REFUSE_DESC|DOOR_NO_CANCEL
>>> 16598:          uniquifier=26426
>>> 16598:  close(7)                                        = 0
>>> 16598:  close(6)                                        = 0
>>> 16598:  open("/var/run/zones/test.zoneadmd_door", O_RDONLY) = 6
>>> 16082/3:        door_return(0x00000000, 0, 0x00000000, 0xFE23FE00,
>>> 1007360) = 0
>>> 16082/3:        door_ucred(0x080A37C8)                          = 0
>>> 16082/3:                euid=0 egid=0
>>> 16082/3:                ruid=0 rgid=0
>>> 16082/3:                pid=16598 zoneid=0
>>> 16082/3:                E: all
>>> 16082/3:                I: basic
>>> 16082/3:                P: all
>>> 16082/3:                L: all
>>>
>>> PID 16598 is zoneadm and PID 16082 is zoneadmd.
>>>
>>> Is this a known issue?  Are there any other things that I can do to
>>> help debug this situation?  Once things get into this state, I have
>>> only been able to recover by rebooting the zone.
>>>
>>> Please advise.
>>>
>>> g
>>>
>>> [1] http://kenai.com/projects/isc/pages/OpenSolaris
>>> [2]
>>> http://kenai.com/attachments/wiki_images/isc/isc-autonomic-cleansing-time-v1.3.png
>>> _______________________________________________
>>> zones-discuss mailing list
>>> zones-discuss@opensolaris.org

In most cases this is a bad idea.
_______________________________________________
zones-discuss mailing list
zones-discuss@opensolaris.org

Reply via email to