Re: [zfs-discuss] Replacing HDD in x4500
I've been told we got a BugID: 3-way deadlock happens in ufs filesystem on zvol when writing ufs log but I can not view the BugID yet (presumably due to my accounts weak credentials) Perhaps it isn't something we do wrong, that would be a nice change. Lund Jorgen Lundman wrote: I assume you've changed the failmode to continue already? http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ This appears to be new to 10/08, so that is another vote to upgrade. Also interesting that the default is wait, since it almost behaves like it. Not sure why it would block zpool, zfs and df commands as well though? Lund -- Jorgen Lundman | lund...@lundman.net Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing HDD in x4500
I'm not an authority, but on my 'vanilla' filer, using the same controller chipset as the thumper, I've been in really good shape since moving to zfs boot in 10/08 and doing 'zpool upgrade' and 'zfs upgrade' to all my mirrors (3 3-way). I'd been having similar troubles to yours in the past. My system is pretty puny next to yours, but it's been reliable now for slightly over a month. On Tue, Jan 27, 2009 at 12:19 AM, Jorgen Lundman lund...@gmo.jp wrote: The vendor wanted to come in and replace an HDD in the 2nd X4500, as it was constantly busy, and since our x4500 has always died miserably in the past when a HDD dies, they wanted to replace it before the HDD actually died. The usual was done, HDD replaced, resilvering started and ran for about 50 minutes. Then the system hung, same as always, all ZFS related commands would just hang and do nothing. System is otherwise fine and completely idle. The vendor for some reason decided to fsck root-fs, not sure why as it is mounted with logging, and also decided it would be best to do so from a CDRom boot. Anyway, that was 12 hours ago and the x4500 is still down. I think they have it at single-user prompt resilvering again. (I also noticed they'd decided to break the mirror of the root disks for some very strange reason). It still shows: raidz1 DEGRADED 0 0 0 c0t1d0ONLINE 0 0 0 replacing UNAVAIL 0 0 0 insufficient replicas c1t1d0s0/o OFFLINE 0 0 0 c1t1d0 UNAVAIL 0 0 0 cannot open So I am pretty sure it'll hang again sometime soon. What is interesting though is that this is on x4500-02, and all our previous troubles mailed to the list was regarding our first x4500. The hardware is all different, but identical. Solaris 10 5/08. Anyway, I think they want to boot CDrom to fsck root again for some reason, but since customers have been without their mail for 12 hours, they can go a little longer, I guess. What I was really wondering, has there been any progress or patches regarding the system always hanging whenever a HDD dies (or is replaced it seems). It really is rather frustrating. Lund -- Jorgen Lundman | lund...@lundman.net Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing HDD in x4500
Thanks for your reply, While the savecore is working its way up the chain to (hopefully) Sun, the vendor asked us not to use it, so we moved x4500-02 to use x4500-04 and x4500-05. But perhaps moving to Sol 10 10/08 on x4500-02 when fixed is the way to go. The savecore had the usual info, that everything is blocked waiting on locks: 601* threads trying to get a mutex (598 user, 3 kernel) longest sleeping 10 minutes 13.52 seconds earlier 115* threads trying to get an rwlock (115 user, 0 kernel) 1678 total threads in allthreads list (1231 user, 447 kernel) 10 thread_reapcnt 0 lwp_reapcnt 1688 nthread thread pri pctcpu idle PID wchan command 0xfe8000137c80 60 0.000 -9m44.88s 0 0xfe84d816cdc8 sched 0xfe800092cc80 60 0.000 -9m44.52s 0 0xc03c6538 sched 0xfe8527458b40 59 0.005 -1m41.38s 1217 0xb02339e0 /usr/lib/nfs/rquotad 0xfe8527b534e0 60 0.000 -5m4.79s 402 0xfe84d816cdc8 /usr/lib/nfs/lockd 0xfe852578f460 60 0.000 -4m59.79s 402 0xc0633fc8 /usr/lib/nfs/lockd 0xfe8532ad47a0 60 0.000 -10m4.40s 623 0xfe84bde48598 /usr/lib/nfs/nfsd 0xfe8532ad3d80 60 0.000 -10m9.10s 623 0xfe84d816ced8 /usr/lib/nfs/nfsd 0xfe8532ad3360 60 0.000 -10m3.77s 623 0xfe84d816cde0 /usr/lib/nfs/nfsd 0xfe85341e9100 60 0.000 -10m6.85s 623 0xfe84bde48428 /usr/lib/nfs/nfsd 0xfe85341e8a40 60 0.000 -10m4.76s 623 0xfe84d816ced8 /usr/lib/nfs/nfsd SolarisCAT(vmcore.0/10X) tlist sobj locks | grep nfsd | wc -l 680 scl_writer = 0xfe8000185c80 - locking thread thread 0xfe8000185c80 kernel thread: 0xfe8000185c80 PID: 0 cmd: sched t_wchan: 0xfbc8200a sobj: condition var (from genunix:bflush+0x4d) t_procp: 0xfbc22dc0(proc_sched) p_as: 0xfbc24a20(kas) zone: global t_stk: 0xfe8000185c80 sp: 0xfe8000185aa0 t_stkbase: 0xfe8000181000 t_pri: 99(SYS) pctcpu: 0.00 t_lwp: 0x0 psrset: 0 last CPU: 0 idle: 44943 ticks (7 minutes 29.43 seconds) start: Tue Jan 27 23:44:21 2009 age: 674 seconds (11 minutes 14 seconds) tstate: TS_SLEEP - awaiting an event tflg: T_TALLOCSTK - thread structure allocated from stk tpflg: none set tsched: TS_LOAD - thread is in memory TS_DONT_SWAP - thread/LWP should not be swapped pflag: SSYS - system resident process pc: 0xfb83616f unix:_resume_from_idle+0xf8 resume_return startpc: 0xeff889e0 zfs:spa_async_thread+0x0 unix:_resume_from_idle+0xf8 resume_return() unix:swtch+0x12a() genunix:cv_wait+0x68() genunix:bflush+0x4d() genunix:ldi_close+0xbe() zfs:vdev_disk_close+0x6a() zfs:vdev_close+0x13() zfs:vdev_raidz_close+0x26() zfs:vdev_close+0x13() zfs:vdev_reopen+0x1d() zfs:spa_async_reopen+0x5f() zfs:spa_async_thread+0xc8() unix:thread_start+0x8() -- end of kernel thread's stack -- Blake wrote: I'm not an authority, but on my 'vanilla' filer, using the same controller chipset as the thumper, I've been in really good shape since moving to zfs boot in 10/08 and doing 'zpool upgrade' and 'zfs upgrade' to all my mirrors (3 3-way). I'd been having similar troubles to yours in the past. My system is pretty puny next to yours, but it's been reliable now for slightly over a month. On Tue, Jan 27, 2009 at 12:19 AM, Jorgen Lundman lund...@gmo.jp wrote: The vendor wanted to come in and replace an HDD in the 2nd X4500, as it was constantly busy, and since our x4500 has always died miserably in the past when a HDD dies, they wanted to replace it before the HDD actually died. The usual was done, HDD replaced, resilvering started and ran for about 50 minutes. Then the system hung, same as always, all ZFS related commands would just hang and do nothing. System is otherwise fine and completely idle. The vendor for some reason decided to fsck root-fs, not sure why as it is mounted with logging, and also decided it would be best to do so from a CDRom boot. Anyway, that was 12 hours ago and the x4500 is still down. I think they have it at single-user prompt resilvering again. (I also noticed they'd decided to break the mirror of the root disks for some very strange reason). It still shows: raidz1 DEGRADED 0 0 0 c0t1d0ONLINE 0 0 0 replacing UNAVAIL 0 0 0 insufficient replicas c1t1d0s0/o OFFLINE 0 0 0 c1t1d0 UNAVAIL 0 0 0 cannot open So I am pretty sure it'll hang again sometime soon. What is interesting though is that this is on x4500-02, and all our previous troubles mailed to the list was regarding our first x4500. The hardware is all different, but identical. Solaris 10 5/08. Anyway, I think they want to boot CDrom to
Re: [zfs-discuss] Replacing HDD in x4500
On Tue, Jan 27, 2009 at 9:28 PM, Jorgen Lundman lund...@gmo.jp wrote: Thanks for your reply, While the savecore is working its way up the chain to (hopefully) Sun, the vendor asked us not to use it, so we moved x4500-02 to use x4500-04 and x4500-05. But perhaps moving to Sol 10 10/08 on x4500-02 when fixed is the way to go. The savecore had the usual info, that everything is blocked waiting on locks: I assume you've changed the failmode to continue already? http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing HDD in x4500
I assume you've changed the failmode to continue already? http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ This appears to be new to 10/08, so that is another vote to upgrade. Also interesting that the default is wait, since it almost behaves like it. Not sure why it would block zpool, zfs and df commands as well though? Lund -- Jorgen Lundman | lund...@lundman.net Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Replacing HDD in x4500
The vendor wanted to come in and replace an HDD in the 2nd X4500, as it was constantly busy, and since our x4500 has always died miserably in the past when a HDD dies, they wanted to replace it before the HDD actually died. The usual was done, HDD replaced, resilvering started and ran for about 50 minutes. Then the system hung, same as always, all ZFS related commands would just hang and do nothing. System is otherwise fine and completely idle. The vendor for some reason decided to fsck root-fs, not sure why as it is mounted with logging, and also decided it would be best to do so from a CDRom boot. Anyway, that was 12 hours ago and the x4500 is still down. I think they have it at single-user prompt resilvering again. (I also noticed they'd decided to break the mirror of the root disks for some very strange reason). It still shows: raidz1 DEGRADED 0 0 0 c0t1d0ONLINE 0 0 0 replacing UNAVAIL 0 0 0 insufficient replicas c1t1d0s0/o OFFLINE 0 0 0 c1t1d0 UNAVAIL 0 0 0 cannot open So I am pretty sure it'll hang again sometime soon. What is interesting though is that this is on x4500-02, and all our previous troubles mailed to the list was regarding our first x4500. The hardware is all different, but identical. Solaris 10 5/08. Anyway, I think they want to boot CDrom to fsck root again for some reason, but since customers have been without their mail for 12 hours, they can go a little longer, I guess. What I was really wondering, has there been any progress or patches regarding the system always hanging whenever a HDD dies (or is replaced it seems). It really is rather frustrating. Lund -- Jorgen Lundman | lund...@lundman.net Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss