On Oct 18, 2011, at 5:21 PM, Edward Ned Harvey wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Tim Cook
>> I had and have redundant storage, it has *NEVER* automatically fixed
>> it. You're the first person I've heard that has had it automatically fix
> That's probably just because it's normal and expected behavior to
> automatically fix it - I always have redundancy, and every cksum error I
> ever find is always automatically fixed. I never tell anyone here because
> it's normal and expected.
Yes, and in fact the automated tests for ZFS developers intentionally corrupts
so that the repair code can be tested. Also, the same checksum code is used to
calculate the checksum when writing and reading.
> If you have redundancy, and cksum errors, and it's not automatically fixed,
> then you should report the bug.
For modern Solaris-based implementations, each checksum mismatch that is
repaired reports the bitmap of the corrupted vs expected data. Obviously, if the
data cannot be repaired, you cannot know the expected data, so the error is
reported without identification of the broken bits.
In the archives, you can find reports of recoverable and unrecoverable errors
1. ZFS software (rare, but a bug a few years ago mishandled a raidz
2. SAN switch firmware
3. "Hardware" RAID array firmware
4. Power supplies
7. PCI-X bus
8. BIOS settings
9. CPU and chipset errata
Personally, I've seen all of the above except #7, because PCI-X hardware is
hard to find now.
If consistently see unrecoverable data from a system that has protected data,
there may be an issue with a part of the system that is a single point of
very, very few x86 systems are designed with no SPOF.
ZFS and performance consulting
VMworld Copenhagen, October 17-20
OpenStorage Summit, San Jose, CA, October 24-27
LISA '11, Boston, MA, December 4-9
zfs-discuss mailing list