Re: [zfs-discuss] Replacing HDD in x4500

Jorgen Lundman Tue, 27 Jan 2009 19:30:28 -0800

Thanks for your reply,

While the savecore is working its way up the chain to (hopefully) Sun, 
the vendor asked us not to use it, so we moved x4500-02 to use x4500-04 
and x4500-05. But perhaps moving to Sol 10 10/08 on x4500-02 when fixed 
is the way to go.


The savecore had the usual info, that everything is blocked waiting on 
locks:


   601*  threads trying to get a mutex (598 user, 3 kernel)
           longest sleeping 10 minutes 13.52 seconds earlier
   115*  threads trying to get an rwlock (115 user, 0 kernel)

1678   total threads in allthreads list (1231 user, 447 kernel)
    10   thread_reapcnt
     0   lwp_reapcnt
1688   nthread

   thread             pri pctcpu           idle   PID              wchan 
command
   0xfffffe8000137c80  60  0.000      -9m44.88s     0 0xfffffe84d816cdc8 
sched
   0xfffffe800092cc80  60  0.000      -9m44.52s     0 0xffffffffc03c6538 
sched
   0xfffffe8527458b40  59  0.005      -1m41.38s  1217 0xffffffffb02339e0 
/usr/lib/nfs/rquotad
   0xfffffe8527b534e0  60  0.000       -5m4.79s   402 0xfffffe84d816cdc8 
/usr/lib/nfs/lockd
   0xfffffe852578f460  60  0.000      -4m59.79s   402 0xffffffffc0633fc8 
/usr/lib/nfs/lockd
   0xfffffe8532ad47a0  60  0.000      -10m4.40s   623 0xfffffe84bde48598 
/usr/lib/nfs/nfsd
   0xfffffe8532ad3d80  60  0.000      -10m9.10s   623 0xfffffe84d816ced8 
/usr/lib/nfs/nfsd
   0xfffffe8532ad3360  60  0.000      -10m3.77s   623 0xfffffe84d816cde0 
/usr/lib/nfs/nfsd
   0xfffffe85341e9100  60  0.000      -10m6.85s   623 0xfffffe84bde48428 
/usr/lib/nfs/nfsd
   0xfffffe85341e8a40  60  0.000      -10m4.76s   623 0xfffffe84d816ced8 
/usr/lib/nfs/nfsd

SolarisCAT(vmcore.0/10X)> tlist sobj locks | grep nfsd | wc -l
      680

    scl_writer = 0xfffffe8000185c80  <- locking thread



thread 0xfffffe8000185c80
==== kernel thread: 0xfffffe8000185c80  PID: 0 ====
cmd: sched
t_wchan: 0xfffffffffbc8200a  sobj: condition var (from genunix:bflush+0x4d)
t_procp: 0xfffffffffbc22dc0(proc_sched)
   p_as: 0xfffffffffbc24a20(kas)
   zone: global
t_stk: 0xfffffe8000185c80  sp: 0xfffffe8000185aa0  t_stkbase: 
0xfffffe8000181000
t_pri: 99(SYS)  pctcpu: 0.000000
t_lwp: 0x0  psrset: 0  last CPU: 0
idle: 44943 ticks (7 minutes 29.43 seconds)
start: Tue Jan 27 23:44:21 2009
age: 674 seconds (11 minutes 14 seconds)
tstate: TS_SLEEP - awaiting an event
tflg:   T_TALLOCSTK - thread structure allocated from stk
tpflg:  none set
tsched: TS_LOAD - thread is in memory
         TS_DONT_SWAP - thread/LWP should not be swapped
pflag:  SSYS - system resident process

pc:      0xfffffffffb83616f     unix:_resume_from_idle+0xf8 resume_return
startpc: 0xffffffffeff889e0     zfs:spa_async_thread+0x0

unix:_resume_from_idle+0xf8 resume_return()
unix:swtch+0x12a()
genunix:cv_wait+0x68()
genunix:bflush+0x4d()
genunix:ldi_close+0xbe()
zfs:vdev_disk_close+0x6a()
zfs:vdev_close+0x13()
zfs:vdev_raidz_close+0x26()
zfs:vdev_close+0x13()
zfs:vdev_reopen+0x1d()
zfs:spa_async_reopen+0x5f()
zfs:spa_async_thread+0xc8()
unix:thread_start+0x8()
-- end of kernel thread's stack --




Blake wrote:
> I'm not an authority, but on my 'vanilla' filer, using the same
> controller chipset as the thumper, I've been in really good shape
> since moving to zfs boot in 10/08 and doing 'zpool upgrade' and 'zfs
> upgrade' to all my mirrors (3 3-way).  I'd been having similar
> troubles to yours in the past.
> 
> My system is pretty puny next to yours, but it's been reliable now for
> slightly over a month.
> 
> 
> On Tue, Jan 27, 2009 at 12:19 AM, Jorgen Lundman <lund...@gmo.jp> wrote:
>> The vendor wanted to come in and replace an HDD in the 2nd X4500, as it
>> was "constantly busy", and since our x4500 has always died miserably in
>> the past when a HDD dies, they wanted to replace it before the HDD
>> actually died.
>>
>> The usual was done, HDD replaced, resilvering started and ran for about
>> 50 minutes. Then the system hung, same as always, all ZFS related
>> commands would just hang and do nothing. System is otherwise fine and
>> completely idle.
>>
>> The vendor for some reason decided to fsck root-fs, not sure why as it
>> is mounted with "logging", and also decided it would be best to do so
>> from a CDRom boot.
>>
>> Anyway, that was 12 hours ago and the x4500 is still down. I think they
>> have it at single-user prompt resilvering again. (I also noticed they'd
>> decided to break the mirror of the root disks for some very strange
>> reason). It still shows:
>>
>>           raidz1          DEGRADED     0     0     0
>>             c0t1d0        ONLINE       0     0     0
>>             replacing     UNAVAIL      0     0     0  insufficient replicas
>>               c1t1d0s0/o  OFFLINE      0     0     0
>>               c1t1d0      UNAVAIL      0     0     0  cannot open
>>
>> So I am pretty sure it'll hang again sometime soon. What is interesting
>> though is that this is on x4500-02, and all our previous troubles mailed
>> to the list was regarding our first x4500. The hardware is all
>> different, but identical. Solaris 10 5/08.
>>
>> Anyway, I think they want to boot CDrom to fsck root again for some
>> reason, but since customers have been without their mail for 12 hours,
>> they can go a little longer, I guess.
>>
>> What I was really wondering, has there been any progress or patches
>> regarding the system always hanging whenever a HDD dies (or is replaced
>> it seems). It really is rather frustrating.
>>
>> Lund
>>
>> --
>> Jorgen Lundman       | <lund...@lundman.net>
>> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
>> Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
>> Japan                | +81 (0)3 -3375-1767          (home)
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
> 

-- 
Jorgen Lundman       | <lund...@lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replacing HDD in x4500

Reply via email to