> Hi Phil,

I know this is an old thread, but I didn't see where you ever got word back 
from the Open ZFS dev team, and this is an issue I feel needs to be 
address.  I am a software engineer, and I have many years of experience 
working with ZFS.  Though admittedly I have not worked on ZFS development 
myself, but I am familiar with the sort of data structures and processes 
used by ZFS.  I'm very skeptical of this idea of "ZFS cancer" as I would 
call it, where ZFS's self-healing routines become poisonous and start 
corrupting the entire filesystem due to a data error which occurs in 
memory.  Now this is a very complicated subject, because there is a lot to 
take into consideration, but let us consider only the data for a moment. 
 ZFS uses an implementation of what in Computer Science is called a self 
validating Merkle tree, where each node is validated by a hash from it's 
parent node all the way back up to the uberblock (the root node) which is 
then duplicated else where.  

The proposed cancer scenario is that there is an in memory error which 
affects the data in question and in return causes a check sum invalidation 
to occur, and so ZFS starts self-healing, and writing the corrupted data 
all over the system.  However, this is not how this works.  Before ZFS 
corrects a single block of corrupted data, it first finds a validated copy. 
 That means there has to be redundant data.   If you are running ZFS on a 
single drive in a standard configuration, without block duplication or a 
split volume, you only have one copy of data, which means self-healing 
doesn't even turn on.  Now let's assume you are running a mirror, or 
Raidz-1,2,3, where you have duplicate data, and ZFS detects data corruption 
due to a hash failure.  Before ZFS starts healing itself it will try to 
find a valid copy of the data, by looking at the redundant data and doing 
hash validation on it.  The data must pass this hash validation in order to 
be propagated.  So now you need a second failure where the redundant data 
is also wrong, BUT MORE OVER the data has to also pass the validation which 
would require a hash collision (a collision is where you have different 
data that hashes to the same value). The odds of this are astronomical!!! 
But assuming you have a check sum failure, which triggers a self-healing 
operation, which then finds a corrupted piece of data which also managed to 
pass the hash check, then yes, it would replicate that data.  However, it 
would only replicate the error for that one block!  ...because every block 
is hashed individually.  Hardly destroying your entire data set!  So it 
would take a gross set of improbabilities for ZFS to decide to corrupt the 
single block containing your 32nd picture of Marilyn Monroe.  If ZFS was 
going to corrupt a second block we'd have to repeat all of this!  

The above is assuming errors in the data itself the MORE LIKELY case to 
succeed, if you can believe it.  Now lets assume an error in the hash. 
 Well each hash is hashed against it's parent node.  So the faulty hash sum 
would need a hash collision with it's parent node's hash!  That is 
especially difficult, because there are fewer possible collisions in a 1:1 
relationship than in say a 1:100 relationship.  But even assuming some how 
you manage to have a successful collision, you still fall back into the 
above scenario, where you now need to find data that successfully matches 
the hash, so you now need a second collision! ...and again, that's for a 
single block of data!  That's to say nothing the fact that you will have a 
hash mismatch between the original corrupted hash, and the hash of the 
prospective replacement data.  So the system will realize at that point 
there is a problem, and will move to into tie breaker routines in order to 
sort out the issue.  I don't even see a path where this ultimately manages 
to propagate. 

You see how this runaway cancer scenario starts out as statistically 
untenable and only becomes more and more difficult as you go?  Because the 
odds of ZFS corrupting the very first block are utterly remote, but the 
odds of it happening a second time are even worse, and so on.

I've read a fair amount of this thread and a lot of stuff has been thrown 
around which seems poorly understood.  Like someone mentioned Jeff 
Bonwick's comments on SHA256.  However, these comments are really tried to 
the deduplication feature (which I highly recommend not using unless you 
have a VERY good reason to) where you have data validation disabled (where 
ZFS checks to make sure duplicates are actually duplicates instead of 
simply going off of the hash).  SHA256 is extreme over kill for block level 
validation, in fact MD5 would be extreme over kill, which is why the 
original ZFS implementations used CRC (if I remember correctly, it's been a 
while), though not I believe ZFS defaults to fletcher (fletcher4?). 
 However, if you were to use SHA256 (which you can specify) all of the 
above becomes multiple orders of magnitude more remote!

Ok, so that address all of the data related corruption problems.  Let's say 
you have a memory error (be in the system RAM, CPU cache, the ALU 
registers, etc) that actually affects ZFS's algorithms and routines 

1) Unless the error is not transient, and is affecting a choke point such 
as the ALU registers, it's extremely unlikely that of all of the data 
somewhere in memory that it would be the ZFS code that would be affected.
2) Assuming that the ZFS code was affected, in the most likely case the 
error would be caught by an error handler and dealt with accordingly.
3) Assuming the error got past the in built error detection and handling 
code, it is most likely the code would be affected in some way that would 
simply cause a process failure.
4) But let's assume the error gets past all of the above considerations and 
actually causes ZFS to perform operations outside of spec.  Such as 
bypassing hash validation, this means the validation code would never be 
triggered, thus the self-healing would never take place!  So even though 
the system would then be vulnerable to new errors coming it, it wouldn't be 
replicating them.  Again, even if the system wanted to replicate errors it 
would be on a block by block basis.  You'd have to have massive coordinated 
errors to the ZFS routines for it to go into a run away destroy the data 
condition, but then similar failures could happen to any system process 
(processes that aren't anywhere nearly as hardened, and for which 
constitute a larger mount memory usage, and thus a larger threat vector). 
 It's actually more likely that some other piece of software would be 
corrupted in such a way as to tell ZFS to do bad things, such as delete 
this or that, or pass ZFS bad data to start with.  Say you're working on 
editing a picture and it's corrupted while in the editor and you save, well 
obviously ZFS won't fix that.  Or say that you are accessing data via 
samba, well if samba hands ZFS corrupt data, ZFS won't fix that.  There are 
so many ways corrupted data could be handing to ZFS that ZFS would just see 
as data.  Like say the data is corrupted while it's crossing the network, 
where all you have to do is get back the relatively weak TCP safe guards 
(which uses CRC).  (Though honestly TCP is pretty darn safe, which should 
really say something about how much better ZFS is!)  ZFS's fail safes only 
kick in AFTER ZFS has the data, so any corruption created by the system's 
use of the data wouldn't be protected against.  This is where the data 
corruption happens in most cases.  

Really, not only is ZFS not more dangerous under unprotected memory 
conditions, ZFS is in fact a more secure file system under all use cases, 
included unprotected memory.  ZFS does provide for corruption resistance, 
even from memory errors, ASSUMING the corruption takes place while ZFS is 
safe guarding the data (if the corruption happens else where in the system, 
and then it's passed back to ZFS, ZFS will simply see it as an update). 
 Because of ZFS multistep data validation process, ZFS is less likely to 
get into a runaway data destruction condition than other filesystems 
approaches, which don't have those steps which must be traversed before 
writes occur.  Further, because of ZFS's copy-on-write nature, even if ZFS 
did get into such a state, recovery is MUCH easier (especially if prudent 
safe guards are established) because ZFS isn't going to write over the 
block in question, and so the data is still there to be recovered.  As an 
aside: I have found myself in truly nasty positions using ZFS beta code, 
where I ended up with a corrupted pool (I was working with early 
deduplication code), and still managed to recover the data!  ZFS's built in 
data recover tools are truly extraordinary!

With all of that said, if you are building a storage server, where the 
point is to store data, and you are selecting ZFS specifically for it's 
data integrity functionality, you are crazy if you don't buy ECC memory, 
because you need to not only protect ZFS, but all of the surrounding 
software.  Because, as noted above, external software can corrupt data, and 
when it is handed back to ZFS it will look like regular data.  Also, this 
improves over all system reliability.  ...and ECC memory isn't that 


You received this message because you are subscribed to the Google Groups 
"zfs-macos" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to zfs-macos+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to