Thanks for your reply, While the savecore is working its way up the chain to (hopefully) Sun, the vendor asked us not to use it, so we moved x4500-02 to use x4500-04 and x4500-05. But perhaps moving to Sol 10 10/08 on x4500-02 when fixed is the way to go.
The savecore had the usual info, that everything is blocked waiting on locks: 601* threads trying to get a mutex (598 user, 3 kernel) longest sleeping 10 minutes 13.52 seconds earlier 115* threads trying to get an rwlock (115 user, 0 kernel) 1678 total threads in allthreads list (1231 user, 447 kernel) 10 thread_reapcnt 0 lwp_reapcnt 1688 nthread thread pri pctcpu idle PID wchan command 0xfffffe8000137c80 60 0.000 -9m44.88s 0 0xfffffe84d816cdc8 sched 0xfffffe800092cc80 60 0.000 -9m44.52s 0 0xffffffffc03c6538 sched 0xfffffe8527458b40 59 0.005 -1m41.38s 1217 0xffffffffb02339e0 /usr/lib/nfs/rquotad 0xfffffe8527b534e0 60 0.000 -5m4.79s 402 0xfffffe84d816cdc8 /usr/lib/nfs/lockd 0xfffffe852578f460 60 0.000 -4m59.79s 402 0xffffffffc0633fc8 /usr/lib/nfs/lockd 0xfffffe8532ad47a0 60 0.000 -10m4.40s 623 0xfffffe84bde48598 /usr/lib/nfs/nfsd 0xfffffe8532ad3d80 60 0.000 -10m9.10s 623 0xfffffe84d816ced8 /usr/lib/nfs/nfsd 0xfffffe8532ad3360 60 0.000 -10m3.77s 623 0xfffffe84d816cde0 /usr/lib/nfs/nfsd 0xfffffe85341e9100 60 0.000 -10m6.85s 623 0xfffffe84bde48428 /usr/lib/nfs/nfsd 0xfffffe85341e8a40 60 0.000 -10m4.76s 623 0xfffffe84d816ced8 /usr/lib/nfs/nfsd SolarisCAT(vmcore.0/10X)> tlist sobj locks | grep nfsd | wc -l 680 scl_writer = 0xfffffe8000185c80 <- locking thread thread 0xfffffe8000185c80 ==== kernel thread: 0xfffffe8000185c80 PID: 0 ==== cmd: sched t_wchan: 0xfffffffffbc8200a sobj: condition var (from genunix:bflush+0x4d) t_procp: 0xfffffffffbc22dc0(proc_sched) p_as: 0xfffffffffbc24a20(kas) zone: global t_stk: 0xfffffe8000185c80 sp: 0xfffffe8000185aa0 t_stkbase: 0xfffffe8000181000 t_pri: 99(SYS) pctcpu: 0.000000 t_lwp: 0x0 psrset: 0 last CPU: 0 idle: 44943 ticks (7 minutes 29.43 seconds) start: Tue Jan 27 23:44:21 2009 age: 674 seconds (11 minutes 14 seconds) tstate: TS_SLEEP - awaiting an event tflg: T_TALLOCSTK - thread structure allocated from stk tpflg: none set tsched: TS_LOAD - thread is in memory TS_DONT_SWAP - thread/LWP should not be swapped pflag: SSYS - system resident process pc: 0xfffffffffb83616f unix:_resume_from_idle+0xf8 resume_return startpc: 0xffffffffeff889e0 zfs:spa_async_thread+0x0 unix:_resume_from_idle+0xf8 resume_return() unix:swtch+0x12a() genunix:cv_wait+0x68() genunix:bflush+0x4d() genunix:ldi_close+0xbe() zfs:vdev_disk_close+0x6a() zfs:vdev_close+0x13() zfs:vdev_raidz_close+0x26() zfs:vdev_close+0x13() zfs:vdev_reopen+0x1d() zfs:spa_async_reopen+0x5f() zfs:spa_async_thread+0xc8() unix:thread_start+0x8() -- end of kernel thread's stack -- Blake wrote: > I'm not an authority, but on my 'vanilla' filer, using the same > controller chipset as the thumper, I've been in really good shape > since moving to zfs boot in 10/08 and doing 'zpool upgrade' and 'zfs > upgrade' to all my mirrors (3 3-way). I'd been having similar > troubles to yours in the past. > > My system is pretty puny next to yours, but it's been reliable now for > slightly over a month. > > > On Tue, Jan 27, 2009 at 12:19 AM, Jorgen Lundman <lund...@gmo.jp> wrote: >> The vendor wanted to come in and replace an HDD in the 2nd X4500, as it >> was "constantly busy", and since our x4500 has always died miserably in >> the past when a HDD dies, they wanted to replace it before the HDD >> actually died. >> >> The usual was done, HDD replaced, resilvering started and ran for about >> 50 minutes. Then the system hung, same as always, all ZFS related >> commands would just hang and do nothing. System is otherwise fine and >> completely idle. >> >> The vendor for some reason decided to fsck root-fs, not sure why as it >> is mounted with "logging", and also decided it would be best to do so >> from a CDRom boot. >> >> Anyway, that was 12 hours ago and the x4500 is still down. I think they >> have it at single-user prompt resilvering again. (I also noticed they'd >> decided to break the mirror of the root disks for some very strange >> reason). It still shows: >> >> raidz1 DEGRADED 0 0 0 >> c0t1d0 ONLINE 0 0 0 >> replacing UNAVAIL 0 0 0 insufficient replicas >> c1t1d0s0/o OFFLINE 0 0 0 >> c1t1d0 UNAVAIL 0 0 0 cannot open >> >> So I am pretty sure it'll hang again sometime soon. What is interesting >> though is that this is on x4500-02, and all our previous troubles mailed >> to the list was regarding our first x4500. The hardware is all >> different, but identical. Solaris 10 5/08. >> >> Anyway, I think they want to boot CDrom to fsck root again for some >> reason, but since customers have been without their mail for 12 hours, >> they can go a little longer, I guess. >> >> What I was really wondering, has there been any progress or patches >> regarding the system always hanging whenever a HDD dies (or is replaced >> it seems). It really is rather frustrating. >> >> Lund >> >> -- >> Jorgen Lundman | <lund...@lundman.net> >> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) >> Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) >> Japan | +81 (0)3 -3375-1767 (home) >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > -- Jorgen Lundman | <lund...@lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home) _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss