On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:
> As I recently wrote, my data pool has experienced some
> "unrecoverable errors". It seems that a userdata block
> of deduped data got corrupted and no longer matches the
> stored checksum. For whatever reason, raidz2 did not
> help in recovery of this data, so I rsync'ed the files
> over from another copy. Then things got interesting...
> Bug alert: it seems the block-pointer block with that
> mismatching checksum did not get invalidated, so my
> attempts to rsync known-good versions of the bad files
> from external source seemed to work, but in fact failed:
> subsequent reads of the files produced IO errors.
> Apparently (my wild guess), upon writing the blocks,
> checksums were calculated and the matching DDT entry
> was found. ZFS did not care that the entry pointed to
> inconsistent data (not matching the checksum now),
> it still increased the DDT counter.
> The problem was solved by disabling dedup for the dataset
> involved and rsync-updating the file in-place. After the
> dedup feature was disabled and new blocks were uniquely
> written, everything was readable (and md5sums matched)
> as expected.
> I think of a couple of solutions:
In theory, the verify option will correct this going forward.
> If the block is detected to be corrupt (checksum mismatches
> the data), the checksum value in blockpointers and DDT
> should be rewritten to an "impossible" value, perhaps
> all-zeroes or such, when the error is detected.
What if it is a transient fault?
> Alternatively (opportunistically), a flag might be set
> in the DDT entry requesting that a new write mathching
> this stored checksum should get committed to disk - thus
> "repairing" all files which reference the block (at least,
> stopping the IO errors).
verify eliminates this failure mode.
> Alas, so far there is anyways no guarantee that it was
> not the checksum itself that got corrupted (except for
> using ZDB to retrieve the block contents and matching
> that with a known-good copy of the data, if any), so
> corruption of the checksum would also cause replacement
> of "really-good-but-normally-inaccessible" data.
Extrememly unlikely. The metadata is also checksummed. To arrive here
you will have to have two corruptions each of which generate the proper
checksum. Not impossible, but… I'd buy a lottery ticket instead.
See also dedupditto. I could argue that the default value of dedupditto
should be 2 rather than "off".
> //Jim Klimov
> (Bug reported to Illumos: https://www.illumos.org/issues/1981)
ZFS and performance consulting
zfs-discuss mailing list