Re: [zfs-discuss] raidz recovery
Hi, I'm copying the list - assume you meant to send it there. On Sun 2010-12-19 (15:52), Miles Nordin wrote: If 'zpool replace /dev/ad6' will not accept that the disk is a replacement, then You can unplug the disk, erase the label in a different machine using dd if=/dev/zero of=/dev/thedisk bs=512 count=XXX dd if=/dev/zero of=/dev/thedisk bs=512 count=XXX seek=YYY then plug it back into its old spot and issue 'zpool replace /dev/ad6' XXX should be about a mbyte worth of sectors, and YYY should be the LBA of about 1mbyte from the end of the disk. You can read or experiment to determine the exact values. you do need to know the size of your disk in sectors though. There's a copy of the EFI label at the end of the disk and another at the beginning, which is why you have to do this. Awesome, that does the trick thanx. I assumed it was identifying the disk by serial number or something. I don't need to unplug the disk though, it works if I zero it from the same machine. This should probably be implemented as a zpool function, if it hasn't already been in later versions. In general especially when a disk has corrupt data on it rather than unreadable sectors it's best to do the replacement in a way that the old and new disks are available simultaneously, because ZFS will use the old disk sometimes in places where the old disk is correct. If you take away the old disk, then the old disk can't be used at all even when it's correct, so if there are a few spots where there are problems with the other good disks in the raidz you will not be able to recover that, while with a suspect old disk you could. OTOH if the old disk has unreadable sectors, the controller and ZFS will freeze whenever it touches those unreadable sectors, causing the replacement to take forever. This is kind of bullshit and should be solved with software IMNSHO, but it's how things are, so if you have a physically failing disk I would suggest running the replace/resilver with the physically failing disk physically removed (while if the disk has bad data on it and is not physically failing i suggest keeping it ocnnected somehow). so...yeah...if there is corrupt data on this disk, you'll have to buy another disk to follow my advice in this paragraph. you can go ahead and break the advice, wipe the label, replace, though. Noted. Though if there are a few spots where there are problems with the other good disks ZFS should know about them right? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz recovery
On Wed, Dec 15, 2010 at 3:29 PM, Gareth de Vaux z...@lordcow.org wrote: On Mon 2010-12-13 (16:41), Marion Hakanson wrote: After you clear the errors, do another scrub before trying anything else. Once you get a complete scrub with no new errors (and no checksum errors), you should be confident that the damaged drive has been fully re-integrated into the pool. Ok I did a scrub after zero'ing, and the array came back clean, apparently, but same final result - the array faults as soon as I 'offline' a different vdev. The zero'ing is just a pretend-the-errors-aren't-there directive, and the scrub seems to be listening to that. What I need in this situation is a way to prompt ad6 to resilver from scratch. I think scrub doesn't replace all superblocks or other stuff not in the active dataset but rather some drive labels. have you tried zpool replace? like remove ad6, fill with zeroes, replace, command zpool replace tank ad6. That should simulate drive failure and replacement with a new disk. -- - Tuomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz recovery
On Sat 2010-12-18 (14:55), Tuomas Leikola wrote: have you tried zpool replace? like remove ad6, fill with zeroes, replace, command zpool replace tank ad6. That should simulate drive failure and replacement with a new disk. 'replace' requires a different disk to replace with. How do you remove ad6? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz recovery
On Mon 2010-12-13 (16:41), Marion Hakanson wrote: After you clear the errors, do another scrub before trying anything else. Once you get a complete scrub with no new errors (and no checksum errors), you should be confident that the damaged drive has been fully re-integrated into the pool. Ok I did a scrub after zero'ing, and the array came back clean, apparently, but same final result - the array faults as soon as I 'offline' a different vdev. The zero'ing is just a pretend-the-errors-aren't-there directive, and the scrub seems to be listening to that. What I need in this situation is a way to prompt ad6 to resilver from scratch. Btw I can reproduce this behaviour every time. I can also produce faultless behaviour by offlining and then onlining, or replacing disks repeatedly, as expected. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz recovery
z...@lordcow.org said: For example when I 'dd if=/dev/zero of=/dev/ad6', or physically remove the drive for awhile, then 'online' the disk, after it resilvers I'm typically left with the following after scrubbing: r...@file:~# zpool status pool: pool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h0m with 0 errors on Fri Dec 10 23:45:56 2010 config: NAMESTATE READ WRITE CKSUM poolONLINE 0 0 0 raidz1ONLINE 0 0 0 ad12ONLINE 0 0 0 ad13ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 7 errors: No known data errors http://www.sun.com/msg/ZFS-8000-9P lists my above actions as a cause for this state and rightfully doesn't think them serious. When I 'clear' the errors though and offline/fault another drive, and then reboot, the array faults. That tells me ad6 was never fully integrated back in. Can I tell the array to re-add ad6 from scratch? 'detach' and 'remove' don't work for raidz. Otherwise I need to use 'replace' to get out of this situation. After you clear the errors, do another scrub before trying anything else. Once you get a complete scrub with no new errors (and no checksum errors), you should be confident that the damaged drive has been fully re-integrated into the pool. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] raidz recovery
Hi all, I'm trying to simulate a drive failure and recovery on a raidz array. I'm able to do so using 'replace', but this requires an extra disk that was not part of the array. How do you manage when you don't have or need an extra disk yet? For example when I 'dd if=/dev/zero of=/dev/ad6', or physically remove the drive for awhile, then 'online' the disk, after it resilvers I'm typically left with the following after scrubbing: r...@file:~# zpool status pool: pool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h0m with 0 errors on Fri Dec 10 23:45:56 2010 config: NAMESTATE READ WRITE CKSUM poolONLINE 0 0 0 raidz1ONLINE 0 0 0 ad12ONLINE 0 0 0 ad13ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 7 errors: No known data errors http://www.sun.com/msg/ZFS-8000-9P lists my above actions as a cause for this state and rightfully doesn't think them serious. When I 'clear' the errors though and offline/fault another drive, and then reboot, the array faults. That tells me ad6 was never fully integrated back in. Can I tell the array to re-add ad6 from scratch? 'detach' and 'remove' don't work for raidz. Otherwise I need to use 'replace' to get out of this situation. My system: r...@file:~# uname -a FreeBSD file 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #0: Sun Nov 28 13:36:08 SAST 2010 r...@file:/usr/obj/usr/src/sys/COWNEL amd64 r...@file:~# dmesg | grep ZFS ZFS filesystem version 4 ZFS storage pool version 15 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss