> Hi Phil,
I know this is an old thread, but I didn't see where you ever got word back
from the Open ZFS dev team, and this is an issue I feel needs to be
address. I am a software engineer, and I have many years of experience
working with ZFS. Though admittedly I have not worked on ZFS development
myself, but I am familiar with the sort of data structures and processes
used by ZFS. I'm very skeptical of this idea of "ZFS cancer" as I would
call it, where ZFS's self-healing routines become poisonous and start
corrupting the entire filesystem due to a data error which occurs in
memory. Now this is a very complicated subject, because there is a lot to
take into consideration, but let us consider only the data for a moment.
ZFS uses an implementation of what in Computer Science is called a self
validating Merkle tree, where each node is validated by a hash from it's
parent node all the way back up to the uberblock (the root node) which is
then duplicated else where.
The proposed cancer scenario is that there is an in memory error which
affects the data in question and in return causes a check sum invalidation
to occur, and so ZFS starts self-healing, and writing the corrupted data
all over the system. However, this is not how this works. Before ZFS
corrects a single block of corrupted data, it first finds a validated copy.
That means there has to be redundant data. If you are running ZFS on a
single drive in a standard configuration, without block duplication or a
split volume, you only have one copy of data, which means self-healing
doesn't even turn on. Now let's assume you are running a mirror, or
Raidz-1,2,3, where you have duplicate data, and ZFS detects data corruption
due to a hash failure. Before ZFS starts healing itself it will try to
find a valid copy of the data, by looking at the redundant data and doing
hash validation on it. The data must pass this hash validation in order to
be propagated. So now you need a second failure where the redundant data
is also wrong, BUT MORE OVER the data has to also pass the validation which
would require a hash collision (a collision is where you have different
data that hashes to the same value). The odds of this are astronomical!!!
But assuming you have a check sum failure, which triggers a self-healing
operation, which then finds a corrupted piece of data which also managed to
pass the hash check, then yes, it would replicate that data. However, it
would only replicate the error for that one block! ...because every block
is hashed individually. Hardly destroying your entire data set! So it
would take a gross set of improbabilities for ZFS to decide to corrupt the
single block containing your 32nd picture of Marilyn Monroe. If ZFS was
going to corrupt a second block we'd have to repeat all of this!
The above is assuming errors in the data itself the MORE LIKELY case to
succeed, if you can believe it. Now lets assume an error in the hash.
Well each hash is hashed against it's parent node. So the faulty hash sum
would need a hash collision with it's parent node's hash! That is
especially difficult, because there are fewer possible collisions in a 1:1
relationship than in say a 1:100 relationship. But even assuming some how
you manage to have a successful collision, you still fall back into the
above scenario, where you now need to find data that successfully matches
the hash, so you now need a second collision! ...and again, that's for a
single block of data! That's to say nothing the fact that you will have a
hash mismatch between the original corrupted hash, and the hash of the
prospective replacement data. So the system will realize at that point
there is a problem, and will move to into tie breaker routines in order to
sort out the issue. I don't even see a path where this ultimately manages
You see how this runaway cancer scenario starts out as statistically
untenable and only becomes more and more difficult as you go? Because the
odds of ZFS corrupting the very first block are utterly remote, but the
odds of it happening a second time are even worse, and so on.
I've read a fair amount of this thread and a lot of stuff has been thrown
around which seems poorly understood. Like someone mentioned Jeff
Bonwick's comments on SHA256. However, these comments are really tried to
the deduplication feature (which I highly recommend not using unless you
have a VERY good reason to) where you have data validation disabled (where
ZFS checks to make sure duplicates are actually duplicates instead of
simply going off of the hash). SHA256 is extreme over kill for block level
validation, in fact MD5 would be extreme over kill, which is why the
original ZFS implementations used CRC (if I remember correctly, it's been a
while), though not I believe ZFS defaults to fletcher (fletcher4?).
However, if you were to use SHA256 (which you can specify) all of the
above becomes multiple orders of magnitude more remote!
Ok, so that address all of the data related corruption problems. Let's say
you have a memory error (be in the system RAM, CPU cache, the ALU
registers, etc) that actually affects ZFS's algorithms and routines
1) Unless the error is not transient, and is affecting a choke point such
as the ALU registers, it's extremely unlikely that of all of the data
somewhere in memory that it would be the ZFS code that would be affected.
2) Assuming that the ZFS code was affected, in the most likely case the
error would be caught by an error handler and dealt with accordingly.
3) Assuming the error got past the in built error detection and handling
code, it is most likely the code would be affected in some way that would
simply cause a process failure.
4) But let's assume the error gets past all of the above considerations and
actually causes ZFS to perform operations outside of spec. Such as
bypassing hash validation, this means the validation code would never be
triggered, thus the self-healing would never take place! So even though
the system would then be vulnerable to new errors coming it, it wouldn't be
replicating them. Again, even if the system wanted to replicate errors it
would be on a block by block basis. You'd have to have massive coordinated
errors to the ZFS routines for it to go into a run away destroy the data
condition, but then similar failures could happen to any system process
(processes that aren't anywhere nearly as hardened, and for which
constitute a larger mount memory usage, and thus a larger threat vector).
It's actually more likely that some other piece of software would be
corrupted in such a way as to tell ZFS to do bad things, such as delete
this or that, or pass ZFS bad data to start with. Say you're working on
editing a picture and it's corrupted while in the editor and you save, well
obviously ZFS won't fix that. Or say that you are accessing data via
samba, well if samba hands ZFS corrupt data, ZFS won't fix that. There are
so many ways corrupted data could be handing to ZFS that ZFS would just see
as data. Like say the data is corrupted while it's crossing the network,
where all you have to do is get back the relatively weak TCP safe guards
(which uses CRC). (Though honestly TCP is pretty darn safe, which should
really say something about how much better ZFS is!) ZFS's fail safes only
kick in AFTER ZFS has the data, so any corruption created by the system's
use of the data wouldn't be protected against. This is where the data
corruption happens in most cases.
Really, not only is ZFS not more dangerous under unprotected memory
conditions, ZFS is in fact a more secure file system under all use cases,
included unprotected memory. ZFS does provide for corruption resistance,
even from memory errors, ASSUMING the corruption takes place while ZFS is
safe guarding the data (if the corruption happens else where in the system,
and then it's passed back to ZFS, ZFS will simply see it as an update).
Because of ZFS multistep data validation process, ZFS is less likely to
get into a runaway data destruction condition than other filesystems
approaches, which don't have those steps which must be traversed before
writes occur. Further, because of ZFS's copy-on-write nature, even if ZFS
did get into such a state, recovery is MUCH easier (especially if prudent
safe guards are established) because ZFS isn't going to write over the
block in question, and so the data is still there to be recovered. As an
aside: I have found myself in truly nasty positions using ZFS beta code,
where I ended up with a corrupted pool (I was working with early
deduplication code), and still managed to recover the data! ZFS's built in
data recover tools are truly extraordinary!
With all of that said, if you are building a storage server, where the
point is to store data, and you are selecting ZFS specifically for it's
data integrity functionality, you are crazy if you don't buy ECC memory,
because you need to not only protect ZFS, but all of the surrounding
software. Because, as noted above, external software can corrupt data, and
when it is handed back to ZFS it will look like regular data. Also, this
improves over all system reliability. ...and ECC memory isn't that
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an email
For more options, visit https://groups.google.com/d/optout.