Re: [zfs-discuss] Replacing HDD in x4500

2009-02-03 Thread Jorgen Lundman

I've been told we got a BugID:

3-way deadlock happens in ufs filesystem on zvol when writing ufs log

but I can not view the BugID yet (presumably due to my accounts weak 
credentials)

Perhaps it isn't something we do wrong, that would be a nice change.

Lund


Jorgen Lundman wrote:
 I assume you've changed the failmode to continue already?

 http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/
  
 
 This appears to be new to 10/08, so that is another vote to upgrade. 
 Also interesting that the default is wait, since it almost behaves 
 like it. Not sure why it would block zpool, zfs and df commands as 
 well though?
 
 
 Lund
 
 

-- 
Jorgen Lundman   | lund...@lundman.net
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing HDD in x4500

2009-01-27 Thread Blake
I'm not an authority, but on my 'vanilla' filer, using the same
controller chipset as the thumper, I've been in really good shape
since moving to zfs boot in 10/08 and doing 'zpool upgrade' and 'zfs
upgrade' to all my mirrors (3 3-way).  I'd been having similar
troubles to yours in the past.

My system is pretty puny next to yours, but it's been reliable now for
slightly over a month.


On Tue, Jan 27, 2009 at 12:19 AM, Jorgen Lundman lund...@gmo.jp wrote:

 The vendor wanted to come in and replace an HDD in the 2nd X4500, as it
 was constantly busy, and since our x4500 has always died miserably in
 the past when a HDD dies, they wanted to replace it before the HDD
 actually died.

 The usual was done, HDD replaced, resilvering started and ran for about
 50 minutes. Then the system hung, same as always, all ZFS related
 commands would just hang and do nothing. System is otherwise fine and
 completely idle.

 The vendor for some reason decided to fsck root-fs, not sure why as it
 is mounted with logging, and also decided it would be best to do so
 from a CDRom boot.

 Anyway, that was 12 hours ago and the x4500 is still down. I think they
 have it at single-user prompt resilvering again. (I also noticed they'd
 decided to break the mirror of the root disks for some very strange
 reason). It still shows:

   raidz1  DEGRADED 0 0 0
 c0t1d0ONLINE   0 0 0
 replacing UNAVAIL  0 0 0  insufficient replicas
   c1t1d0s0/o  OFFLINE  0 0 0
   c1t1d0  UNAVAIL  0 0 0  cannot open

 So I am pretty sure it'll hang again sometime soon. What is interesting
 though is that this is on x4500-02, and all our previous troubles mailed
 to the list was regarding our first x4500. The hardware is all
 different, but identical. Solaris 10 5/08.

 Anyway, I think they want to boot CDrom to fsck root again for some
 reason, but since customers have been without their mail for 12 hours,
 they can go a little longer, I guess.

 What I was really wondering, has there been any progress or patches
 regarding the system always hanging whenever a HDD dies (or is replaced
 it seems). It really is rather frustrating.

 Lund

 --
 Jorgen Lundman   | lund...@lundman.net
 Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
 Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
 Japan| +81 (0)3 -3375-1767  (home)
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing HDD in x4500

2009-01-27 Thread Jorgen Lundman

Thanks for your reply,

While the savecore is working its way up the chain to (hopefully) Sun, 
the vendor asked us not to use it, so we moved x4500-02 to use x4500-04 
and x4500-05. But perhaps moving to Sol 10 10/08 on x4500-02 when fixed 
is the way to go.

The savecore had the usual info, that everything is blocked waiting on 
locks:


   601*  threads trying to get a mutex (598 user, 3 kernel)
   longest sleeping 10 minutes 13.52 seconds earlier
   115*  threads trying to get an rwlock (115 user, 0 kernel)

1678   total threads in allthreads list (1231 user, 447 kernel)
10   thread_reapcnt
 0   lwp_reapcnt
1688   nthread

   thread pri pctcpu   idle   PID  wchan 
command
   0xfe8000137c80  60  0.000  -9m44.88s 0 0xfe84d816cdc8 
sched
   0xfe800092cc80  60  0.000  -9m44.52s 0 0xc03c6538 
sched
   0xfe8527458b40  59  0.005  -1m41.38s  1217 0xb02339e0 
/usr/lib/nfs/rquotad
   0xfe8527b534e0  60  0.000   -5m4.79s   402 0xfe84d816cdc8 
/usr/lib/nfs/lockd
   0xfe852578f460  60  0.000  -4m59.79s   402 0xc0633fc8 
/usr/lib/nfs/lockd
   0xfe8532ad47a0  60  0.000  -10m4.40s   623 0xfe84bde48598 
/usr/lib/nfs/nfsd
   0xfe8532ad3d80  60  0.000  -10m9.10s   623 0xfe84d816ced8 
/usr/lib/nfs/nfsd
   0xfe8532ad3360  60  0.000  -10m3.77s   623 0xfe84d816cde0 
/usr/lib/nfs/nfsd
   0xfe85341e9100  60  0.000  -10m6.85s   623 0xfe84bde48428 
/usr/lib/nfs/nfsd
   0xfe85341e8a40  60  0.000  -10m4.76s   623 0xfe84d816ced8 
/usr/lib/nfs/nfsd

SolarisCAT(vmcore.0/10X) tlist sobj locks | grep nfsd | wc -l
  680

scl_writer = 0xfe8000185c80  - locking thread



thread 0xfe8000185c80
 kernel thread: 0xfe8000185c80  PID: 0 
cmd: sched
t_wchan: 0xfbc8200a  sobj: condition var (from genunix:bflush+0x4d)
t_procp: 0xfbc22dc0(proc_sched)
   p_as: 0xfbc24a20(kas)
   zone: global
t_stk: 0xfe8000185c80  sp: 0xfe8000185aa0  t_stkbase: 
0xfe8000181000
t_pri: 99(SYS)  pctcpu: 0.00
t_lwp: 0x0  psrset: 0  last CPU: 0
idle: 44943 ticks (7 minutes 29.43 seconds)
start: Tue Jan 27 23:44:21 2009
age: 674 seconds (11 minutes 14 seconds)
tstate: TS_SLEEP - awaiting an event
tflg:   T_TALLOCSTK - thread structure allocated from stk
tpflg:  none set
tsched: TS_LOAD - thread is in memory
 TS_DONT_SWAP - thread/LWP should not be swapped
pflag:  SSYS - system resident process

pc:  0xfb83616f unix:_resume_from_idle+0xf8 resume_return
startpc: 0xeff889e0 zfs:spa_async_thread+0x0

unix:_resume_from_idle+0xf8 resume_return()
unix:swtch+0x12a()
genunix:cv_wait+0x68()
genunix:bflush+0x4d()
genunix:ldi_close+0xbe()
zfs:vdev_disk_close+0x6a()
zfs:vdev_close+0x13()
zfs:vdev_raidz_close+0x26()
zfs:vdev_close+0x13()
zfs:vdev_reopen+0x1d()
zfs:spa_async_reopen+0x5f()
zfs:spa_async_thread+0xc8()
unix:thread_start+0x8()
-- end of kernel thread's stack --




Blake wrote:
 I'm not an authority, but on my 'vanilla' filer, using the same
 controller chipset as the thumper, I've been in really good shape
 since moving to zfs boot in 10/08 and doing 'zpool upgrade' and 'zfs
 upgrade' to all my mirrors (3 3-way).  I'd been having similar
 troubles to yours in the past.
 
 My system is pretty puny next to yours, but it's been reliable now for
 slightly over a month.
 
 
 On Tue, Jan 27, 2009 at 12:19 AM, Jorgen Lundman lund...@gmo.jp wrote:
 The vendor wanted to come in and replace an HDD in the 2nd X4500, as it
 was constantly busy, and since our x4500 has always died miserably in
 the past when a HDD dies, they wanted to replace it before the HDD
 actually died.

 The usual was done, HDD replaced, resilvering started and ran for about
 50 minutes. Then the system hung, same as always, all ZFS related
 commands would just hang and do nothing. System is otherwise fine and
 completely idle.

 The vendor for some reason decided to fsck root-fs, not sure why as it
 is mounted with logging, and also decided it would be best to do so
 from a CDRom boot.

 Anyway, that was 12 hours ago and the x4500 is still down. I think they
 have it at single-user prompt resilvering again. (I also noticed they'd
 decided to break the mirror of the root disks for some very strange
 reason). It still shows:

   raidz1  DEGRADED 0 0 0
 c0t1d0ONLINE   0 0 0
 replacing UNAVAIL  0 0 0  insufficient replicas
   c1t1d0s0/o  OFFLINE  0 0 0
   c1t1d0  UNAVAIL  0 0 0  cannot open

 So I am pretty sure it'll hang again sometime soon. What is interesting
 though is that this is on x4500-02, and all our previous troubles mailed
 to the list was regarding our first x4500. The hardware is all
 different, but identical. Solaris 10 5/08.

 Anyway, I think they want to boot CDrom to 

Re: [zfs-discuss] Replacing HDD in x4500

2009-01-27 Thread Tim
On Tue, Jan 27, 2009 at 9:28 PM, Jorgen Lundman lund...@gmo.jp wrote:


 Thanks for your reply,

 While the savecore is working its way up the chain to (hopefully) Sun,
 the vendor asked us not to use it, so we moved x4500-02 to use x4500-04
 and x4500-05. But perhaps moving to Sol 10 10/08 on x4500-02 when fixed
 is the way to go.

 The savecore had the usual info, that everything is blocked waiting on
 locks:


I assume you've changed the failmode to continue already?

http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing HDD in x4500

2009-01-27 Thread Jorgen Lundman
 
 I assume you've changed the failmode to continue already?
 
 http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/
  

This appears to be new to 10/08, so that is another vote to upgrade. 
Also interesting that the default is wait, since it almost behaves 
like it. Not sure why it would block zpool, zfs and df commands as 
well though?


Lund


-- 
Jorgen Lundman   | lund...@lundman.net
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Replacing HDD in x4500

2009-01-26 Thread Jorgen Lundman

The vendor wanted to come in and replace an HDD in the 2nd X4500, as it 
was constantly busy, and since our x4500 has always died miserably in 
the past when a HDD dies, they wanted to replace it before the HDD 
actually died.

The usual was done, HDD replaced, resilvering started and ran for about 
50 minutes. Then the system hung, same as always, all ZFS related 
commands would just hang and do nothing. System is otherwise fine and 
completely idle.

The vendor for some reason decided to fsck root-fs, not sure why as it 
is mounted with logging, and also decided it would be best to do so 
from a CDRom boot.

Anyway, that was 12 hours ago and the x4500 is still down. I think they 
have it at single-user prompt resilvering again. (I also noticed they'd 
decided to break the mirror of the root disks for some very strange 
reason). It still shows:

   raidz1  DEGRADED 0 0 0
 c0t1d0ONLINE   0 0 0
 replacing UNAVAIL  0 0 0  insufficient replicas
   c1t1d0s0/o  OFFLINE  0 0 0
   c1t1d0  UNAVAIL  0 0 0  cannot open

So I am pretty sure it'll hang again sometime soon. What is interesting 
though is that this is on x4500-02, and all our previous troubles mailed 
to the list was regarding our first x4500. The hardware is all 
different, but identical. Solaris 10 5/08.

Anyway, I think they want to boot CDrom to fsck root again for some 
reason, but since customers have been without their mail for 12 hours, 
they can go a little longer, I guess.

What I was really wondering, has there been any progress or patches 
regarding the system always hanging whenever a HDD dies (or is replaced 
it seems). It really is rather frustrating.

Lund

-- 
Jorgen Lundman   | lund...@lundman.net
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss