On Dec 21, 2011, at 11:45 AM, Gareth de Vaux wrote: > Hi guys, after a scrub my raidz array status showed: > > # zpool status > pool: pool > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://www.sun.com/msg/ZFS-8000-9P > scan: scrub repaired 85.5K in 1h21m with 0 errors on Mon Dec 19 06:24:25 2011 > config: > > NAME STATE READ WRITE CKSUM > pool ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > ad18 ONLINE 0 0 1 > ad19 ONLINE 0 0 0 > ad10 ONLINE 0 0 1 > ad4 ONLINE 0 0 0 > > errors: No known data errors > > > I assume the checksum counts are current and irreconcilable. (Why does > the scan say 'repaired with 0 errors' then?). > > What should one do at this point?
Be happy. Dance a jig. Buy a lottery ticket. Notice: scrub repaired 85.5K in 1h21m with 0 errors on Mon Dec 19 06:24:25 2011 ZFS found corruption and fixed it. > > I rebooted and ran another scrub, this time it came up with 0 errors > and 0 checksum counts, what does that mean? ZFS found corruption and fixed it. > > I then backed up the array, kicked out ad18 and resilvered it from scratch: oops... tempting the fates? Transient errors do occur, frequently. Not all errors are persistent or fatal. Given the information presented here, IMHO, this system did not warrant further action. > > # zpool status > pool: pool > state: DEGRADED > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scan: resilvered 218G in 1h25m with 14 errors on Wed Dec 21 14:48:47 2011 > config: > > NAME STATE READ WRITE CKSUM > pool DEGRADED 0 0 14 > raidz1-0 DEGRADED 0 0 28 > replacing-0 OFFLINE 0 0 0 > ad18/old OFFLINE 0 0 0 > ad18 ONLINE 0 0 0 > ad19 ONLINE 0 0 0 > ad10 ONLINE 0 0 0 > ad4 ONLINE 0 0 0 > > errors: 11 data errors, use '-v' for a list > > > and 'zpool status -v' gives me a list of affected files. > > I assume I delete those files, then follow the same procedure on ad10? > > > # uname -a > FreeBSD file 8.2-STABLE FreeBSD 8.2-STABLE #0: Sat Nov 12 17:51:22 SAST 2011 > root@file:/usr/obj/usr/src/sys/COWNEL amd64 > > ZFS filesystem version 5 > ZFS storage pool version 28 > > > ps. I did get 1 disk alert in the logs during this whole process, half an > hour before resilvering: > > Dec 21 12:41:48 file kernel: ad10: WARNING - READ_DMA48 UDMA ICRC error > (retrying request) LBA=306763504 > Dec 21 12:41:48 file kernel: ad10: FAILURE - READ_DMA48 > status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=306763504 This appears to be a [S]ATA error generated by the drive. If LBA 306763504 is a legal LBA, then this can be one of the factors contributing to the original checksum error. -- richard -- ZFS and performance consulting http://www.RichardElling.com _______________________________________________ zfs-discuss mailing list email@example.com http://mail.opensolaris.org/mailman/listinfo/zfs-discuss