Re: [zfs-discuss] Simultaneous failure recovery

2010-03-31 Thread Robert Milkowski


I have a pool (on an X4540 running S10U8) in which a disk failed, and the
hot spare kicked in. That's perfect. I'm happy.

Then a second disk fails.

Now, I've replaced the first failed disk, and it's resilvered and I have my
hot spare back.

But: why hasn't it used the spare to cover the other failed drive? And
can I hotspare it manually?  I could do a straight replace, but that
isn't quite the same thing.

   


It seems like it is even driven. Hmmm.. perhaps it shouldn't be.

Anyway you can do zpool replace and it is the same thing, why wouldn't it?

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Simultaneous failure recovery

2010-03-31 Thread Peter Tribble
On Tue, Mar 30, 2010 at 10:42 PM, Eric Schrock eric.schr...@oracle.com wrote:

 On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote:

 I have a pool (on an X4540 running S10U8) in which a disk failed, and the
 hot spare kicked in. That's perfect. I'm happy.

 Then a second disk fails.

 Now, I've replaced the first failed disk, and it's resilvered and I have my
 hot spare back.

 But: why hasn't it used the spare to cover the other failed drive? And
 can I hotspare it manually?  I could do a straight replace, but that
 isn't quite the same thing.

 Hot spares are only activated in response to a fault received by the 
 zfs-retire FMA agent.  There is no notion that the spares should be 
 re-evaluated when they become available at a later point in time.  Certainly 
 a reasonable RFE, but not something ZFS does today.

Definitely an RFE I would like.

 You can 'zpool attach' the spare like a normal device - that's all that the 
 retire agent is doing under the hood.

So, given:

NAMESTATE READ WRITE CKSUM
images  DEGRADED 0 0 0
  raidz1DEGRADED 0 0 0
c2t0d0  FAULTED  4 0 0  too many errors
c3t0d0  ONLINE   0 0 0
c4t0d0  ONLINE   0 0 0
c5t0d0  ONLINE   0 0 0
c0t1d0  ONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c2t1d0  ONLINE   0 0 0
c3t1d0  ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
spares
  c5t7d0AVAIL

then it would be this?

zpool attach images c2t0d0 c5t7d0

which I had considered, but the man page for attach says The
existing device cannot be part of a raidz configuration.

If I try that it fails, saying:
/invalid vdev specification
use '-f' to override the following errors:
dev/dsk/c5t7d0s0 is reserved as a hot spare for ZFS pool images.
Please see zpool(1M).

Thanks!

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Simultaneous failure recovery

2010-03-31 Thread Robert Milkowski


On Tue, Mar 30, 2010 at 10:42 PM, Eric Schrockeric.schr...@oracle.com  wrote:
   

On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote:

 

I have a pool (on an X4540 running S10U8) in which a disk failed, and the
hot spare kicked in. That's perfect. I'm happy.

Then a second disk fails.

Now, I've replaced the first failed disk, and it's resilvered and I have my
hot spare back.

But: why hasn't it used the spare to cover the other failed drive? And
can I hotspare it manually?  I could do a straight replace, but that
isn't quite the same thing.
   

Hot spares are only activated in response to a fault received by the zfs-retire 
FMA agent.  There is no notion that the spares should be re-evaluated when they 
become available at a later point in time.  Certainly a reasonable RFE, but not 
something ZFS does today.
 

Definitely an RFE I would like.

   

You can 'zpool attach' the spare like a normal device - that's all that the 
retire agent is doing under the hood.
 

So, given:

 NAMESTATE READ WRITE CKSUM
 images  DEGRADED 0 0 0
   raidz1DEGRADED 0 0 0
 c2t0d0  FAULTED  4 0 0  too many errors
 c3t0d0  ONLINE   0 0 0
 c4t0d0  ONLINE   0 0 0
 c5t0d0  ONLINE   0 0 0
 c0t1d0  ONLINE   0 0 0
 c1t1d0  ONLINE   0 0 0
 c2t1d0  ONLINE   0 0 0
 c3t1d0  ONLINE   0 0 0
 c4t1d0  ONLINE   0 0 0
 spares
   c5t7d0AVAIL

then it would be this?

zpool attach images c2t0d0 c5t7d0

which I had considered, but the man page for attach says The
existing device cannot be part of a raidz configuration.

If I try that it fails, saying:
/invalid vdev specification
use '-f' to override the following errors:
dev/dsk/c5t7d0s0 is reserved as a hot spare for ZFS pool images.
Please see zpool(1M).

Thanks!

   

You need to use zpool replace.
Once you fix the failed drive and it re-synchronizes a hot spare will 
detach automatically (regardless if you forced it to kick-in via zpool 
replace or if it did so due to FMA).


For more details see http://blogs.sun.com/eschrock/entry/zfs_hot_spares

--
Robert Milkowski
http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Simultaneous failure recovery

2010-03-31 Thread Ian Collins

On 03/31/10 10:54 PM, Peter Tribble wrote:

On Tue, Mar 30, 2010 at 10:42 PM, Eric Schrockeric.schr...@oracle.com  wrote:
   

On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote:

 

I have a pool (on an X4540 running S10U8) in which a disk failed, and the
hot spare kicked in. That's perfect. I'm happy.

Then a second disk fails.

Now, I've replaced the first failed disk, and it's resilvered and I have my
hot spare back.

But: why hasn't it used the spare to cover the other failed drive? And
can I hotspare it manually?  I could do a straight replace, but that
isn't quite the same thing.
   

Hot spares are only activated in response to a fault received by the zfs-retire 
FMA agent.  There is no notion that the spares should be re-evaluated when they 
become available at a later point in time.  Certainly a reasonable RFE, but not 
something ZFS does today.
 

Definitely an RFE I would like.

   

You can 'zpool attach' the spare like a normal device - that's all that the 
retire agent is doing under the hood.
 

So, given:

 NAMESTATE READ WRITE CKSUM
 images  DEGRADED 0 0 0
   raidz1DEGRADED 0 0 0
 c2t0d0  FAULTED  4 0 0  too many errors
 c3t0d0  ONLINE   0 0 0
 c4t0d0  ONLINE   0 0 0
 c5t0d0  ONLINE   0 0 0
 c0t1d0  ONLINE   0 0 0
 c1t1d0  ONLINE   0 0 0
 c2t1d0  ONLINE   0 0 0
 c3t1d0  ONLINE   0 0 0
 c4t1d0  ONLINE   0 0 0
 spares
   c5t7d0AVAIL

then it would be this?

zpool attach images c2t0d0 c5t7d0

which I had considered, but the man page for attach says The
existing device cannot be part of a raidz configuration.

If I try that it fails, saying:
/invalid vdev specification
use '-f' to override the following errors:
dev/dsk/c5t7d0s0 is reserved as a hot spare for ZFS pool images.
Please see zpool(1M).
   


What happens if you remove it as a spare first?

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Simultaneous failure recovery

2010-03-31 Thread Eric Schrock

On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote:

 I have a pool (on an X4540 running S10U8) in which a disk failed, and the
 hot spare kicked in. That's perfect. I'm happy.
 
 Then a second disk fails.
 
 Now, I've replaced the first failed disk, and it's resilvered and I have my
 hot spare back.
 
 But: why hasn't it used the spare to cover the other failed drive? And
 can I hotspare it manually?  I could do a straight replace, but that
 isn't quite the same thing.

Hot spares are only activated in response to a fault received by the zfs-retire 
FMA agent.  There is no notion that the spares should be re-evaluated when they 
become available at a later point in time.  Certainly a reasonable RFE, but not 
something ZFS does today.

You can 'zpool attach' the spare like a normal device - that's all that the 
retire agent is doing under the hood.

Hope that helps,

- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Simultaneous failure recovery

2010-03-30 Thread Ian Collins

On 03/31/10 10:39 AM, Peter Tribble wrote:

I have a pool (on an X4540 running S10U8) in which a disk failed, and the
hot spare kicked in. That's perfect. I'm happy.

Then a second disk fails.

Now, I've replaced the first failed disk, and it's resilvered and I have my
hot spare back.

But: why hasn't it used the spare to cover the other failed drive? And
can I hotspare it manually?  I could do a straight replace, but that
isn't quite the same thing.

   
Was the spare spare when the second drive failed?  If not, I don't think 
it will get used.  My understanding is the spares are added when the 
drive is faulted, so it's an event rather then level driven action.


At least I'm not the only one seeing multiple drive failures this week!

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss