Re: [zfs-discuss] ZFS disk failure question

2009-10-22 Thread Jason Frank
Thank you for your follow-up.  The doc looks great.  Having good
examples goes a long way to helping others that have my problem.

Ideally, the replacement would all happen magically, and I would have
had everything marked as good, with one failed disk (like a certain
other storage vendor that has it's beefs with Sun does).  But, I can
live with detaching them if I have to.

Another thing that would be nice would be to receive notification of
disk failures from the OS via email or SMS (like the vendor I
previously alluded to), but I know I'm talking crazy now.

Jason

On Thu, Oct 22, 2009 at 2:15 PM, Cindy Swearingen
cindy.swearin...@sun.com wrote:
 Hi Jason,

 Since spare replacement is an important process, I've rewritten this
 section to provide 3 main examples, here:

 http://docs.sun.com/app/docs/doc/817-2271/gcvcw?a=view

 Scroll down the section:

 Activating and Deactivating Hot Spares in Your Storage Pool

 Example 4–7 Manually Replacing a Disk With a Hot Spare
 Example 4–8 Detaching a Hot Spare After the Failed Disk is Replaced
 Example 4–9 Detaching a Failed Disk and Using the Hot Spare

 The third example is your scenario. I finally listened to the answer,
 which is you must detach the original disk if you want to continue to
 use the spare and replace the original disk later. It all works as
 described.

 I see some other improvements coming with spare replacement and will
 provide details when they are available.

 Thanks,

 Cindy

 On 10/14/09 15:54, Jason Frank wrote:

 See, I get overly literal when working on failed production storage
 (and yes, I do have backups...)  I wasn't wanting to cancel the
 in-progress spare replacement.  I had a completed spare replacement,
 and I wanted to make it official.  So, that didn't really fit my
 scenario either.

 I'm glad you agree on the brevity of the detach subcommand man page.
 I would guess that the intricacies of the failure modes would probably
 lend itself to richer content than a man page.

 I'd really like to see some kind of web based wizard to walk through
 it  I doubt I'd get motivated to write it myself though.

 The web page Cindy pointed to does not cover how to make the
 replacement official either.  It gets close.  But at the end, it
 detaches the hot spare, and not the original disk.  Everything seems
 to be close, but not quite there.  Of course, now that I've been
 through this once, I'll remember all.  I'm just thinking of the
 children.

 Also, I wanted to try and reconstruct all of my steps from zpool
 history -i tank.  According to that, zpool decided to replace t7 with
 t11 this morning (why wasn't it last night?), and I offlined, onlined
 and detach of t7 and I was OK.  I did notice that the history records
 internal scrubs, but not resilvers,  It also doesn't record failed
 commands, or disk failures in a zpool.  It would be sweet to have a
 line that said something like marking vdev  /dev/dsk/c8t7d0s0 as
 UNAVAIL due to X read errors in Y minutes, Then we can really see
 what happened.

 Jason

 On Wed, Oct 14, 2009 at 4:32 PM, Eric Schrock eric.schr...@sun.com
 wrote:

 On 10/14/09 14:26, Jason Frank wrote:

 Thank you, that did the trick.  That's not terribly obvious from the
 man page though.  The man page says it detaches the devices from a
 mirror, and I had a raidz2.  Since I'm messing with production data, I
 decided I wasn't going to chance it when I was reading the man page.
 You might consider changing the man page, and explaining a little more
 what it means, maybe even what the circumstances look like where you
 might use it.

 This is covered in the Hot Spares section of the manpage:

    An in-progress spare replacement can be cancelled by detach-
    ing  the  hot  spare.  If  the  original  faulted  device is
    detached, then the hot spare assumes its place in the confi-
    guration,  and  is removed from the spare list of all active
    pools.

 It is true that the description for zpool detach is overly brief and
 could
 be expanded to include this use case.

 - Eric

 --
 Eric Schrock, Fishworks                    http://blogs.sun.com/eschrock


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS disk failure question

2009-10-15 Thread Jason Frank
Thank you, that did the trick.  That's not terribly obvious from the
man page though.  The man page says it detaches the devices from a
mirror, and I had a raidz2.  Since I'm messing with production data, I
decided I wasn't going to chance it when I was reading the man page.
You might consider changing the man page, and explaining a little more
what it means, maybe even what the circumstances look like where you
might use it.

Actually, an official and easily searchable What to do when you have
a zfs disk failure with lots of examples would be great.  There are a
lot of attempts out there, but nothing I've found is comprehensive.

Jason

On Wed, Oct 14, 2009 at 4:23 PM, Eric Schrock eric.schr...@sun.com wrote:
 On 10/14/09 14:17, Cindy Swearingen wrote:

 Hi Jason,

 I think you are asking how do you tell ZFS that you want to replace the
 failed disk c8t7d0 with the spare, c8t11d0?

 I just tried do this on my Nevada build 124 lab system, simulating a
 disk failure and using zpool replace to replace the failed disk with
 the spare. The spare is now busy and it fails. This has to be a bug.

 You need to 'zpool detach' the original (c8t7d0).

 - Eric


 Another way to recover is if you have a replacement disk for c8t7d0,
 like this:

 1. Physically replace c8t7d0.

 You might have to unconfigure the disk first. It depends
 on the hardware.

 2. Tell ZFS that you replaced it.

 # zpool replace tank c8t7d0

 3. Detach the spare.

 # zpool detach tank c8t11d0

 4. Clear the pool or the device specifically.

 # zpool clear tank c8t7d0

 Cindy

 On 10/14/09 14:44, Jason Frank wrote:

 So, my Areca controller has been complaining via email of read errors for
 a couple days on SATA channel 8.  The disk finally gave up last night at
 17:40.  I got to say I really appreciate the Areca controller taking such
 good care of me.

 For some reason, I wasn't able to log into the server last night or in
 the morning, probably because my home dir was on the zpool with the failed
 disk (although it's a raidz2, so I don't know why that was a problem.)  So,
 I went ahead and rebooted it the hard way this morning.

 The reboot went OK, and I was able to get access to my home directory by
 waiting about 5 minutes after authenticating.  I checked my zpool, and it
 was resilvering.  But, it had only been running for a few minutes.
  Evidently, it didn't start resilvering until I rebooted it.  I would have
 expected it to do that when the disk failed last night (I had set up a hot
 spare disk already).

 All of the zpool commands were taking minutes to complete while c8t7d0
 was UNAVAIL, so I offline'd it.  When I say all, that includes iostat,
 status, upgrade, just about anything non-destructive that I could try.  That
 was a little odd.  Once I offlined the drive, my resilver restarted, which
 surprised me.  After all, I simply changed an UNAVAIL drive to OFFLINE, in
 either case, you can't use it for operations.  But no big deal there.  That
 fixed the login slowness and the zpool command slowness.

 The resilver completed, and now I'm left with the following zpool config.
  I'm not sure how to get things back to normal though, and I hate to do
 something stupid...

 r...@datasrv1:~# zpool status tank
  pool: tank
  state: DEGRADED
  scrub: scrub stopped after 0h10m with 0 errors on Wed Oct 14 15:23:06
 2009
 config:

        NAME           STATE     READ WRITE CKSUM
        tank           DEGRADED     0     0     0
          raidz2       DEGRADED     0     0     0
            c8t0d0     ONLINE       0     0     0
            c8t1d0     ONLINE       0     0     0
            c8t2d0     ONLINE       0     0     0
            c8t3d0     ONLINE       0     0     0
            c8t4d0     ONLINE       0     0     0
            c8t5d0     ONLINE       0     0     0
            c8t6d0     ONLINE       0     0     0
            spare      DEGRADED     0     0     0
              c8t7d0   REMOVED      0     0     0
              c8t11d0  ONLINE       0     0     0
            c8t8d0     ONLINE       0     0     0
            c8t9d0     ONLINE       0     0     0
            c8t10d0    ONLINE       0     0     0
        spares
          c8t11d0      INUSE     currently in use

 Since it's not obvious, the spare line had both t7 and t11 indented under
 it.
 When the resilver completed, I yanked the hard drive on target 7.

 I'm assuming that t11 has the same content as t7, but that's not
 necessarily clear from the output above.

 So, now I'm left with the following config.  I can't zfs remove t7,
 because it's not a hot spare or a cache disk.  I can't zfs replace t7 with
 t11, I'm told that t11 is busy.  And I didn't see any other zpool
 subcommands that look likely to fix the problem.

 Here are my system details:
 SunOS datasrv1 5.11 snv_118 i86pc i386 i86xpv Solaris

 This system is currently running ZFS pool version 16.

 Pool 'tank' is already formatted using the current version.

 How do I tell the system that t11

Re: [zfs-discuss] ZFS disk failure question

2009-10-15 Thread Jason Frank
See, I get overly literal when working on failed production storage
(and yes, I do have backups...)  I wasn't wanting to cancel the
in-progress spare replacement.  I had a completed spare replacement,
and I wanted to make it official.  So, that didn't really fit my
scenario either.

I'm glad you agree on the brevity of the detach subcommand man page.
I would guess that the intricacies of the failure modes would probably
lend itself to richer content than a man page.

I'd really like to see some kind of web based wizard to walk through
it  I doubt I'd get motivated to write it myself though.

The web page Cindy pointed to does not cover how to make the
replacement official either.  It gets close.  But at the end, it
detaches the hot spare, and not the original disk.  Everything seems
to be close, but not quite there.  Of course, now that I've been
through this once, I'll remember all.  I'm just thinking of the
children.

Also, I wanted to try and reconstruct all of my steps from zpool
history -i tank.  According to that, zpool decided to replace t7 with
t11 this morning (why wasn't it last night?), and I offlined, onlined
and detach of t7 and I was OK.  I did notice that the history records
internal scrubs, but not resilvers,  It also doesn't record failed
commands, or disk failures in a zpool.  It would be sweet to have a
line that said something like marking vdev  /dev/dsk/c8t7d0s0 as
UNAVAIL due to X read errors in Y minutes, Then we can really see
what happened.

Jason

On Wed, Oct 14, 2009 at 4:32 PM, Eric Schrock eric.schr...@sun.com wrote:
 On 10/14/09 14:26, Jason Frank wrote:

 Thank you, that did the trick.  That's not terribly obvious from the
 man page though.  The man page says it detaches the devices from a
 mirror, and I had a raidz2.  Since I'm messing with production data, I
 decided I wasn't going to chance it when I was reading the man page.
 You might consider changing the man page, and explaining a little more
 what it means, maybe even what the circumstances look like where you
 might use it.

 This is covered in the Hot Spares section of the manpage:

     An in-progress spare replacement can be cancelled by detach-
     ing  the  hot  spare.  If  the  original  faulted  device is
     detached, then the hot spare assumes its place in the confi-
     guration,  and  is removed from the spare list of all active
     pools.

 It is true that the description for zpool detach is overly brief and could
 be expanded to include this use case.

 - Eric

 --
 Eric Schrock, Fishworks                    http://blogs.sun.com/eschrock

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS disk failure question

2009-10-14 Thread Jason Frank
So, my Areca controller has been complaining via email of read errors for a 
couple days on SATA channel 8.  The disk finally gave up last night at 17:40.  
I got to say I really appreciate the Areca controller taking such good care of 
me.

For some reason, I wasn't able to log into the server last night or in the 
morning, probably because my home dir was on the zpool with the failed disk 
(although it's a raidz2, so I don't know why that was a problem.)  So, I went 
ahead and rebooted it the hard way this morning.

The reboot went OK, and I was able to get access to my home directory by 
waiting about 5 minutes after authenticating.  I checked my zpool, and it was 
resilvering.  But, it had only been running for a few minutes.  Evidently, it 
didn't start resilvering until I rebooted it.  I would have expected it to do 
that when the disk failed last night (I had set up a hot spare disk already).

All of the zpool commands were taking minutes to complete while c8t7d0 was 
UNAVAIL, so I offline'd it.  When I say all, that includes iostat, status, 
upgrade, just about anything non-destructive that I could try.  That was a 
little odd.  Once I offlined the drive, my resilver restarted, which surprised 
me.  After all, I simply changed an UNAVAIL drive to OFFLINE, in either case, 
you can't use it for operations.  But no big deal there.  That fixed the login 
slowness and the zpool command slowness.

The resilver completed, and now I'm left with the following zpool config.  I'm 
not sure how to get things back to normal though, and I hate to do something 
stupid...

r...@datasrv1:~# zpool status tank
  pool: tank
 state: DEGRADED
 scrub: scrub stopped after 0h10m with 0 errors on Wed Oct 14 15:23:06 2009
config:

NAME   STATE READ WRITE CKSUM
tank   DEGRADED 0 0 0
  raidz2   DEGRADED 0 0 0
c8t0d0 ONLINE   0 0 0
c8t1d0 ONLINE   0 0 0
c8t2d0 ONLINE   0 0 0
c8t3d0 ONLINE   0 0 0
c8t4d0 ONLINE   0 0 0
c8t5d0 ONLINE   0 0 0
c8t6d0 ONLINE   0 0 0
spare  DEGRADED 0 0 0
  c8t7d0   REMOVED  0 0 0
  c8t11d0  ONLINE   0 0 0
c8t8d0 ONLINE   0 0 0
c8t9d0 ONLINE   0 0 0
c8t10d0ONLINE   0 0 0
spares
  c8t11d0  INUSE currently in use

Since it's not obvious, the spare line had both t7 and t11 indented under it. 

When the resilver completed, I yanked the hard drive on target 7.

I'm assuming that t11 has the same content as t7, but that's not necessarily 
clear from the output above.

So, now I'm left with the following config.  I can't zfs remove t7, because 
it's not a hot spare or a cache disk.  I can't zfs replace t7 with t11, I'm 
told that t11 is busy.  And I didn't see any other zpool subcommands that look 
likely to fix the problem.

Here are my system details:
SunOS datasrv1 5.11 snv_118 i86pc i386 i86xpv Solaris

This system is currently running ZFS pool version 16.

Pool 'tank' is already formatted using the current version.

How do I tell the system that t11 is the replacement for t7, and how to I then 
add t7 as the hot spare (after I replace the disk)?

Thanks
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss