2012-01-13 5:01, Richard Elling wrote:
On Jan 12, 2012, at 2:34 PM, Jim Klimov wrote:
Metadata is at least doubly redundant and checksummed.
True, and this helps if it is valid in the first place
>> As has been
>> reported by many blog-posts researching ZDB, there do
>> happen cases when checksums are broken ...
>> but the file data is in fact intact
Can you provide links to posts that describe this failure mode?
I'll try in another message. That would take some googling
I think the most apparent ones are the tutorials on ZDB
where authors poisoned their VDEVs in those sectors where
metadata was (all copies), so that filedata is factually
intact but not accessible due to mismatching checksums
along the metadata path.
Right now I can't think of any other posts like that,
but nature can produce the same phenomonons and I think
it could have been discussed on-line. I've read too much
during the past weeks :(
For these cases bloggers asked (in vain) - why is it
not allowed for an admin to confirm validity of end-user
data and have the system reconstruct (re-checksum) the
metadata for it?.. IMHO, that's a valid RFE.
Metadata is COW, too. Rewriting the data also rewrites the metadata.
COW does not help well against mis-targeted hardware
writes, bit rot, solar storms, etc. that would break
existing on-disk data.
Random bit errors can happen anywhere, RAM buffers or
committed disks alike.
It is a fact (since the first blogposts about ZDB and
ZFS internals by Marcelo Leal, Max bruning, Ben Rockwood
and countless other kind samaritans) that inquisitive
users - or those repairing their systems - can determine
DVA and ultimately LBA addresses of their data, extract
the userdata blocks and confirm (sometimes) that their
data is intact, and the problem is in metadata paths.
While the system is scrubbing, I was reading up on theory.
Found a nice text "Keeping Bits Safe: How Hard Can It Be?"
by David Rosenthal , where I stumbled upon an interesting
The bits forming the digest are no different from the
bits forming the data; neither is magically incorruptible.
...Applications need to know whether the digest has
Hence for ZFS, the checksum (digest) is kept in the parent metadata.
But it can still rot. And for a while they are in the
same RAM, which might lie. Probably the one good effect
there is - checksum is stored away from the data and
*likely* both at once won't get scratched by HDD head
crash ;) Unless they were coalesced to storage near
Hm... so if the checksum in metadata has bit-rotted
on-disk, this metadata block would first not match
its parent block (as it is the parent's checksummed
data), and would cause reread of a ditto copy.
But if the checksum got broken in-RAM just before the
write, so both ditto blocks have bad checksum values -
but they match their metadata-parents - currently the
data is considered bad :(
Granted, data is larger so there is seemingly a higher
chance that it would get a 1-bit error; but as I wrote,
metadata blocks are rewritten more often - so in fact
they could suffer errors more frequently.
Does your practice or theory prove this statement of
mine fundamentally wrong?
The condition described above can affect T10 DIF-style checksums, but not ZFS.
In our case, where original checksum in the blockpointer
could be corrupted in (non-ECC) RAM of my home-NAS just
before it was dittoed to disk, another checksum - copy
of this same one, or a differently calculated one, could
provide ZFS with the means to determine whether the data
or one of the checksums got corrupted (or all of them).
Of course, this is not an absolute protection method,
but it can reduce the cases where pools have to be
"destroyed, recreated and recovered from tape".
Maybe so... as I elaborate below, there are indeed some
scenarios with using several checksums of data, where
we can not unambiguously determine correctness of either.
Say, we have a data block D in RAM, which can fail always
(more probable without ECC - as is probable on consumer
devices like laptops or home-NASes). We produce two checksums
D' and then D" with different algorithms while preparing to
write (these checksum values would go to all ditto blocks).
During this time a bit flopped, or whatever undetected
(non-ECC) RAM failure happened at least once. Variants:
1) Block D got broken before checksum calcs - we're out
of luck, checksums would probably match, but the data is
2) Block D got broken between checksum calcs - one of
checksums (always D") matches the data, another one
(always D') doesn't.
3) Block D is okay, but one of checksums broke - one of
checksums matches the data, another one doesn't.
About 50% similarity to case (2).
4) Block D is okay, and both checksums broke - block is
considered broken even if it is not...
The idea needs to be rethought, indeed ;)
Perhaps we can checksum or ECC the checksums, or a digest
of a (primary) checksum and the data?
Maybe we can presume that bitflips would produce small
(1-few bits at random location, 0xdeadbeef -> 0xdeafbeef)
differences, and with fuzzy logic the data would still
"likely match" the checksum?
I refuse to easily believe tehre is no solution, no hope! ;)
zfs-discuss mailing list