2012-01-21 0:33, Jim Klimov wrote:
2012-01-13 4:12, Jim Klimov wrote:
As I recently wrote, my data pool has experienced some
"unrecoverable errors". It seems that a userdata block
of deduped data got corrupted and no longer matches the
stored checksum. For whatever reason, raidz2 did not
help in recovery of this data, so I rsync'ed the files
over from another copy. Then things got interesting...

Well, after some crawling over my data with zdb, od and dd,
I guess ZFS was right about finding checksum errors - the
metadata's checksum matched that of a block on original
system, and the data block was indeed erring.

Well, as I'm moving to close my quest with broken data, I'd
like to draw up some conclusions and RFEs. I am still not
sure if they are factually true, I'm still learning the ZFS
internals. So "it currently seems to me, that":

1) My on-disk data could get corrupted for whatever reason
   ZFS tries to protect it from, at least once probably
   from misdirected writes (i.e. the head landed not where
   it was asked to write). It can not be ruled out that the
   checksums got broken in non-ECC RAM before writes of
   block pointers for some of my data, thus leading to
   mismatches. One way or another, ZFS noted the discrepancy
   during scrubs and "normal" file accesses. There is no
   (automatic) way to tell which part is faulty - checksum
   or data.

2) In the case where on-disk data did get corrupted, the
   checksum in block pointer was correct (matching original
   data), but the raidz2 redundancy did not aid recovery.

3) The file in question was created on a dataset with enabled
   deduplication, so at the very least the dedup bit was set
   on the corrupted block's pointer and a DDT entry likely
   existed. Attempts to rewrite the block with the original
   one (having "dedup=on") failed in fact, probably because
   the matching checksum was already in DDT.

   Rewrites of such blocks with "dedup=off" or "dedup=verify"

   Failure/success were tested by "sync; md5sum FILE" some
   time after the fix attempt. (When done just after the
   fix, test tends to return success even if the ondisk data
   is bad, "thanks" to caching).

   My last attempt was to set "dedup=on" and write the block
   again and sync; the (remote) computer hung instantly :(

3*)The RFE stands: deduped blocks found to be invalid and not
   recovered by redundancy should somehow be evicted from DDT
   (or marked for required verification-before-write) so as
   not to pollute further writes, including repair attmepts.

   Alternatively, "dedup=verify" takes care of the situation
   and should be the recommended option.

3**) It was suggested to set "dedupditto" to small values,
   like "2". My oi_148a refused to set values smaller than 100.
   Moreover, it seems reasonable to have two dedupditto values:
   for example, to make a ditto copy when DDT reference counter
   exceeds some small value (2-5), and add ditto copies every
   "N" values for frequently-referenced data (every 64-128).

4) I did not get to check whether "dedup=verify" triggers a
   checksum mismatch alarm if the preexisting on-disk data
   does not in fact match the checksum.

   I think such alarm should exist and to as much as a scrub,
   read or other means of error detection and recovery would.

5) It seems like a worthy RFE to include a pool-wide option to
   "verify-after-write/commit" - to test that recent TXG sync
   data has indeed made it to disk on (consumer-grade) hardware
   into the designated sector numbers. Perhaps the test should
   be delayed several seconds after the sync writes.

   If the verifcation fails, currently cached data from recent
   TXGs can be recovered from on-disk redundancy and/or still
   exist in RAM cache, and rewritten again (and tested again).

   More importantly, a failed test *may* mean that the write
   landed on disk randomly, and the pool should be scrubbed
   ASAP. It may be guessed that the yet-unknown error can lie
   within "epsilon" tracks (sector numbers) from the currently
   found non-written data, so if it is possible to scrub just
   a portion of the pool based on DVAs - that's a preferred
   start. It is possible that some data can be recovered if
   it is tended to ASAP (i.e. on mirror, raidz, copies>1)...

Finally, I should say I'm sorry for lame questions arising
from not reading the format spec and zdb blogs carefully ;)

In particular, it was my understanding for a long time that
block pointers each have a sector of their own, leading to
overheads that I've seen. Now I know (and checked) that most
of the blockpointer tree is made of larger groupings (128
blkptr_t's in a single 16KB block), reducing the impact of
BP's on fragmentation and/or slacky waste of large sectors
that I predicted and expected for the past year.

Sad that nobody ever contradicted that (mis)understanding
of mine.

//Jim Klimov
