On Jan 12, 2012, at 2:34 PM, Jim Klimov wrote:
> I guess I have another practical rationale for a second
> checksum, be it ECC or not: my scrubbing pool found some
> "unrecoverable errors". Luckily, for those files I still
> have external originals, so I rsynced them over. Still,
> there is one file whose broken prehistory is referenced
> in snapshots, and properly fixing that would probably
> require me to resend the whole stack of snapshots.
> That's uncool, but a subject for another thread.
> This thread is about checksums - namely, now, what are
> our options when they mismatch the data? As has been
> reported by many blog-posts researching ZDB, there do
> happen cases when checksums are broken (i.e. bitrot in
> block pointers, or rather in RAM while the checksum was
> calculated - so each ditto copy of BP has the error),
> but the file data is in fact intact (extracted from
> disk with ZDB or DD, and compared to other copies).
Metadata is at least doubly redundant and checksummed.
Can you provide links to posts that describe this failure mode?
> For these cases bloggers asked (in vain) - why is it
> not allowed for an admin to confirm validity of end-user
> data and have the system reconstruct (re-checksum) the
> metadata for it?.. IMHO, that's a valid RFE.
Metadata is COW, too. Rewriting the data also rewrites the metadata.
> While the system is scrubbing, I was reading up on theory.
> Found a nice text "Keeping Bits Safe: How Hard Can It Be?"
> by David Rosenthal , where I stumbled upon an interesting
> The bits forming the digest are no different from the
> bits forming the data; neither is magically incorruptible.
> ...Applications need to know whether the digest has
> been changed.
Hence for ZFS, the checksum (digest) is kept in the parent metadata.
The condition described above can affect T10 DIF-style checksums, but not ZFS.
> In our case, where original checksum in the blockpointer
> could be corrupted in (non-ECC) RAM of my home-NAS just
> before it was dittoed to disk, another checksum - copy
> of this same one, or a differently calculated one, could
> provide ZFS with the means to determine whether the data
> or one of the checksums got corrupted (or all of them).
> Of course, this is not an absolute protection method,
> but it can reduce the cases where pools have to be
> "destroyed, recreated and recovered from tape".
> It is my belief that using dedup contributed to my issue -
> there's lots more of updating the block pointers and their
> checksums, so it gradually becomes more likely that the
> metadata (checksum) blocks gets broken (i.e. in non-ECC
> RAM), while the written-once userdata remains intact...
>  http://queue.acm.org/detail.cfm?id=1866298
> While the text discusses what all ZFSers mostly know
> already - about bit-rot, MTTDL and such, it does so with
> great detail and many examples, and gave me a better
> understanding of it all even though I deal with this for
> several years now. A good read, I suggest it to others ;)
> //Jim Klimov
ZFS and performance consulting
SCALE 10x, Los Angeles, Jan 20-22, 2012
zfs-discuss mailing list