Re: [zfs-discuss] raidz recovery

2010-12-21 Thread Gareth de Vaux
Hi, I'm copying the list - assume you meant to send it there.

On Sun 2010-12-19 (15:52), Miles Nordin wrote:
 If 'zpool replace /dev/ad6' will not accept that the disk is a
 replacement, then You can unplug the disk, erase the label in a
 different machine using
 
 dd if=/dev/zero of=/dev/thedisk bs=512 count=XXX
 dd if=/dev/zero of=/dev/thedisk bs=512 count=XXX seek=YYY
 
 then plug it back into its old spot and issue 'zpool replace /dev/ad6'
 
 XXX should be about a mbyte worth of sectors, and YYY should be the
 LBA of about 1mbyte from the end of the disk.  You can read or
 experiment to determine the exact values.  you do need to know the
 size of your disk in sectors though.  There's a copy of the EFI label
 at the end of the disk and another at the beginning, which is why you
 have to do this.

Awesome, that does the trick thanx. I assumed it was identifying the
disk by serial number or something. I don't need to unplug the disk
though, it works if I zero it from the same machine.

This should probably be implemented as a zpool function, if it hasn't
already been in later versions.

 In general especially when a disk has corrupt data on it rather than
 unreadable sectors it's best to do the replacement in a way that the
 old and new disks are available simultaneously, because ZFS will use
 the old disk sometimes in places where the old disk is correct.  If
 you take away the old disk, then the old disk can't be used at all
 even when it's correct, so if there are a few spots where there are
 problems with the other good disks in the raidz you will not be able
 to recover that, while with a suspect old disk you could.  OTOH if the
 old disk has unreadable sectors, the controller and ZFS will freeze
 whenever it touches those unreadable sectors, causing the replacement
 to take forever.  This is kind of bullshit and should be solved with
 software IMNSHO, but it's how things are, so if you have a physically
 failing disk I would suggest running the replace/resilver with the
 physically failing disk physically removed (while if the disk has bad
 data on it and is not physically failing i suggest keeping it
 ocnnected somehow).  so...yeah...if there is corrupt data on this
 disk, you'll have to buy another disk to follow my advice in this
 paragraph.  you can go ahead and break the advice, wipe the label,
 replace, though.

Noted. Though if there are a few spots where there are problems with
the other good disks ZFS should know about them right?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz recovery

2010-12-18 Thread Tuomas Leikola
On Wed, Dec 15, 2010 at 3:29 PM, Gareth de Vaux z...@lordcow.org wrote:
 On Mon 2010-12-13 (16:41), Marion Hakanson wrote:
 After you clear the errors, do another scrub before trying anything
 else.  Once you get a complete scrub with no new errors (and no checksum
 errors), you should be confident that the damaged drive has been fully
 re-integrated into the pool.

 Ok I did a scrub after zero'ing, and the array came back clean, apparently, 
 but
 same final result - the array faults as soon as I 'offline' a different vdev.
 The zero'ing is just a pretend-the-errors-aren't-there directive, and the 
 scrub
 seems to be listening to that. What I need in this situation is a way to
 prompt ad6 to resilver from scratch.


I think scrub doesn't replace all superblocks or other stuff not in
the active dataset but rather some drive labels.

have you tried zpool replace? like remove ad6, fill with zeroes,
replace, command zpool replace tank ad6. That should simulate drive
failure and replacement with a new disk.

-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz recovery

2010-12-18 Thread Gareth de Vaux
On Sat 2010-12-18 (14:55), Tuomas Leikola wrote:
 have you tried zpool replace? like remove ad6, fill with zeroes,
 replace, command zpool replace tank ad6. That should simulate drive
 failure and replacement with a new disk.

'replace' requires a different disk to replace with.

How do you remove ad6?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz recovery

2010-12-15 Thread Gareth de Vaux
On Mon 2010-12-13 (16:41), Marion Hakanson wrote:
 After you clear the errors, do another scrub before trying anything
 else.  Once you get a complete scrub with no new errors (and no checksum
 errors), you should be confident that the damaged drive has been fully
 re-integrated into the pool.

Ok I did a scrub after zero'ing, and the array came back clean, apparently, but
same final result - the array faults as soon as I 'offline' a different vdev.
The zero'ing is just a pretend-the-errors-aren't-there directive, and the scrub
seems to be listening to that. What I need in this situation is a way to
prompt ad6 to resilver from scratch.

Btw I can reproduce this behaviour every time. I can also produce
faultless behaviour by offlining and then onlining, or replacing disks
repeatedly, as expected.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz recovery

2010-12-13 Thread Marion Hakanson
z...@lordcow.org said:
 For example when I 'dd if=/dev/zero of=/dev/ad6', or physically remove the
 drive for awhile, then 'online' the disk, after it resilvers I'm typically
 left with the following after scrubbing:
 
 r...@file:~# zpool status
   pool: pool
  state: ONLINE status: One or more devices has experienced an unrecoverable
 error.  An
   attempt was made to correct the error.  Applications are unaffected. 
 action:
 Determine if the device needs to be replaced, and clear the errors
   using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
  scrub: scrub completed after 0h0m with 0 errors on Fri Dec 10 23:45:56 2010
 config:
 
   NAMESTATE READ WRITE CKSUM
   poolONLINE   0 0 0
 raidz1ONLINE   0 0 0
   ad12ONLINE   0 0 0
   ad13ONLINE   0 0 0
   ad4 ONLINE   0 0 0
   ad6 ONLINE   0 0 7
 
 errors: No known data errors
 
 http://www.sun.com/msg/ZFS-8000-9P lists my above actions as a cause for this
 state and rightfully doesn't think them serious. When I 'clear' the errors
 though and offline/fault another drive, and then reboot, the array faults.
 That tells me ad6 was never fully integrated back in. Can I tell the array to
 re-add ad6 from scratch? 'detach' and 'remove' don't work for raidz.
 Otherwise I need to use 'replace' to get out of this situation. 


After you clear the errors, do another scrub before trying anything
else.  Once you get a complete scrub with no new errors (and no checksum
errors), you should be confident that the damaged drive has been fully
re-integrated into the pool.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] raidz recovery

2010-12-11 Thread Gareth de Vaux
Hi all, I'm trying to simulate a drive failure and recovery on a
raidz array. I'm able to do so using 'replace', but this requires
an extra disk that was not part of the array. How do you manage when
you don't have or need an extra disk yet?

For example when I 'dd if=/dev/zero of=/dev/ad6', or physically remove
the drive for awhile, then 'online' the disk, after it resilvers I'm
typically left with the following after scrubbing:

r...@file:~# zpool status
  pool: pool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h0m with 0 errors on Fri Dec 10 23:45:56 2010
config:

NAMESTATE READ WRITE CKSUM
poolONLINE   0 0 0
  raidz1ONLINE   0 0 0
ad12ONLINE   0 0 0
ad13ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 7

errors: No known data errors


http://www.sun.com/msg/ZFS-8000-9P lists my above actions as a cause for this
state and rightfully doesn't think them serious. When I 'clear' the errors
though and offline/fault another drive, and then reboot, the array faults.
That tells me ad6 was never fully integrated back in. Can I tell the array
to re-add ad6 from scratch? 'detach' and 'remove' don't work for raidz.
Otherwise I need to use 'replace' to get out of this situation.

My system:

r...@file:~# uname -a
FreeBSD file 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #0: Sun Nov 28 13:36:08 SAST 
2010 r...@file:/usr/obj/usr/src/sys/COWNEL  amd64
r...@file:~# dmesg | grep ZFS
ZFS filesystem version 4
ZFS storage pool version 15
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss