On Jan 21, 2012, at 6:32 AM, Jim Klimov wrote:
> 2012-01-21 0:33, Jim Klimov wrote:
>> 2012-01-13 4:12, Jim Klimov wrote:
>>> As I recently wrote, my data pool has experienced some
>>> "unrecoverable errors". It seems that a userdata block
>>> of deduped data got corrupted and no longer matches the
>>> stored checksum. For whatever reason, raidz2 did not
>>> help in recovery of this data, so I rsync'ed the files
>>> over from another copy. Then things got interesting...
>> Well, after some crawling over my data with zdb, od and dd,
>> I guess ZFS was right about finding checksum errors - the
>> metadata's checksum matched that of a block on original
>> system, and the data block was indeed erring.
> Well, as I'm moving to close my quest with broken data, I'd
> like to draw up some conclusions and RFEs. I am still not
> sure if they are factually true, I'm still learning the ZFS
> internals. So "it currently seems to me, that":
> 1) My on-disk data could get corrupted for whatever reason
> ZFS tries to protect it from, at least once probably
> from misdirected writes (i.e. the head landed not where
> it was asked to write). It can not be ruled out that the
> checksums got broken in non-ECC RAM before writes of
> block pointers for some of my data, thus leading to
> mismatches. One way or another, ZFS noted the discrepancy
> during scrubs and "normal" file accesses. There is no
> (automatic) way to tell which part is faulty - checksum
> or data.
Untrue. If a block pointer is corrupted, then on read it will be logged
and ignored. I'm not sure you have grasped the concept of checksums
in the parent object.
> 2) In the case where on-disk data did get corrupted, the
> checksum in block pointer was correct (matching original
> data), but the raidz2 redundancy did not aid recovery.
I think your analysis is incomplete. Have you determined the root cause?
> 3) The file in question was created on a dataset with enabled
> deduplication, so at the very least the dedup bit was set
> on the corrupted block's pointer and a DDT entry likely
> existed. Attempts to rewrite the block with the original
> one (having "dedup=on") failed in fact, probably because
> the matching checksum was already in DDT.
Works as designed.
> Rewrites of such blocks with "dedup=off" or "dedup=verify"
> Failure/success were tested by "sync; md5sum FILE" some
> time after the fix attempt. (When done just after the
> fix, test tends to return success even if the ondisk data
> is bad, "thanks" to caching).
No, I think you've missed the root cause. By default, data that does
not match its checksum is not used.
> My last attempt was to set "dedup=on" and write the block
> again and sync; the (remote) computer hung instantly :(
> 3*)The RFE stands: deduped blocks found to be invalid and not
> recovered by redundancy should somehow be evicted from DDT
> (or marked for required verification-before-write) so as
> not to pollute further writes, including repair attmepts.
> Alternatively, "dedup=verify" takes care of the situation
> and should be the recommended option.
I have lobbied for this, but so far people prefer performance to dependability.
> 3**) It was suggested to set "dedupditto" to small values,
> like "2". My oi_148a refused to set values smaller than 100.
> Moreover, it seems reasonable to have two dedupditto values:
> for example, to make a ditto copy when DDT reference counter
> exceeds some small value (2-5), and add ditto copies every
> "N" values for frequently-referenced data (every 64-128).
> 4) I did not get to check whether "dedup=verify" triggers a
> checksum mismatch alarm if the preexisting on-disk data
> does not in fact match the checksum.
All checksum mismatches are handled the same way.
> I think such alarm should exist and to as much as a scrub,
> read or other means of error detection and recovery would.
Checksum mismatches are logged, what was your root cause?
> 5) It seems like a worthy RFE to include a pool-wide option to
> "verify-after-write/commit" - to test that recent TXG sync
> data has indeed made it to disk on (consumer-grade) hardware
> into the designated sector numbers. Perhaps the test should
> be delayed several seconds after the sync writes.
There are highly-reliable systems that do this in the fault-tolerant
> If the verifcation fails, currently cached data from recent
> TXGs can be recovered from on-disk redundancy and/or still
> exist in RAM cache, and rewritten again (and tested again).
> More importantly, a failed test *may* mean that the write
> landed on disk randomly, and the pool should be scrubbed
> ASAP. It may be guessed that the yet-unknown error can lie
> within "epsilon" tracks (sector numbers) from the currently
> found non-written data, so if it is possible to scrub just
> a portion of the pool based on DVAs - that's a preferred
> start. It is possible that some data can be recovered if
> it is tended to ASAP (i.e. on mirror, raidz, copies>1)...
> Finally, I should say I'm sorry for lame questions arising
> from not reading the format spec and zdb blogs carefully ;)
> In particular, it was my understanding for a long time that
> block pointers each have a sector of their own, leading to
> overheads that I've seen. Now I know (and checked) that most
> of the blockpointer tree is made of larger groupings (128
> blkptr_t's in a single 16KB block), reducing the impact of
> BP's on fragmentation and/or slacky waste of large sectors
> that I predicted and expected for the past year.
> Sad that nobody ever contradicted that (mis)understanding
> of mine.
Perhaps some day you can become a ZFS guru, but the journey is long...
ZFS Performance and Training
zfs-discuss mailing list