Re: [zfs-discuss] ZFS disk failure question

Cindy Swearingen Thu, 22 Oct 2009 12:16:18 -0700

Hi Jason,

Since spare replacement is an important process, I've rewritten this
section to provide 3 main examples, here:


http://docs.sun.com/app/docs/doc/817-2271/gcvcw?a=view

Scroll down the section:

Activating and Deactivating Hot Spares in Your Storage Pool

Example 4–7 Manually Replacing a Disk With a Hot Spare
Example 4–8 Detaching a Hot Spare After the Failed Disk is Replaced
Example 4–9 Detaching a Failed Disk and Using the Hot Spare

The third example is your scenario. I finally listened to the answer,
which is you must detach the original disk if you want to continue to
use the spare and replace the original disk later. It all works as
described.

I see some other improvements coming with spare replacement and will
provide details when they are available.

Thanks,

Cindy

On 10/14/09 15:54, Jason Frank wrote:

See, I get overly literal when working on failed production storage
(and yes, I do have backups...)  I wasn't wanting to cancel the
in-progress spare replacement.  I had a completed spare replacement,
and I wanted to make it "official".  So, that didn't really fit my
scenario either.

I'm glad you agree on the brevity of the detach subcommand man page.
I would guess that the intricacies of the failure modes would probably
lend itself to richer content than a man page.

I'd really like to see some kind of web based wizard to walk through
it  I doubt I'd get motivated to write it myself though.

The web page Cindy pointed to does not cover how to make the
replacement official either.  It gets close.  But at the end, it
detaches the hot spare, and not the original disk.  Everything seems
to be close, but not quite there.  Of course, now that I've been
through this once, I'll remember all.  I'm just thinking of the
children.

Also, I wanted to try and reconstruct all of my steps from zpool
history -i tank.  According to that, zpool decided to replace t7 with
t11 this morning (why wasn't it last night?), and I offlined, onlined
and detach of t7 and I was OK.  I did notice that the history records
internal scrubs, but not resilvers,  It also doesn't record failed
commands, or disk failures in a zpool.  It would be sweet to have a
line that said something like "marking vdev  /dev/dsk/c8t7d0s0 as
UNAVAIL due to X read errors in Y minutes", Then we can really see
what happened.

Jason

On Wed, Oct 14, 2009 at 4:32 PM, Eric Schrock <eric.schr...@sun.com> wrote:

On 10/14/09 14:26, Jason Frank wrote:

Thank you, that did the trick.  That's not terribly obvious from the
man page though.  The man page says it detaches the devices from a
mirror, and I had a raidz2.  Since I'm messing with production data, I
decided I wasn't going to chance it when I was reading the man page.
You might consider changing the man page, and explaining a little more
what it means, maybe even what the circumstances look like where you
might use it.

This is covered in the "Hot Spares" section of the manpage:

    An in-progress spare replacement can be cancelled by detach-
    ing  the  hot  spare.  If  the  original  faulted  device is
    detached, then the hot spare assumes its place in the confi-
    guration,  and  is removed from the spare list of all active
    pools.

It is true that the description for "zpool detach" is overly brief and could
be expanded to include this use case.

- Eric

--
Eric Schrock, Fishworks                    http://blogs.sun.com/eschrock

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS disk failure question

Reply via email to