On 07/11/12 02:10, Sašo Kiselkov wrote: > Oh jeez, I can't remember how many times this flame war has been going > on on this list. Here's the gist: SHA-256 (or any good hash) produces a > near uniform random distribution of output. Thus, the chances of getting > a random hash collision are around 2^-256 or around 10^-77.
I think you're correct that most users don't need to worry about this -- sha-256 dedup without verification is not going to cause trouble for them. But your analysis is off. You're citing the chance that two blocks picked at random will have the same hash. But that's not what dedup does; it compares the hash of a new block to a possibly-large population of other hashes, and that gets you into the realm of "birthday problem" or "birthday paradox". See http://en.wikipedia.org/wiki/Birthday_problem for formulas. So, maybe somewhere between 10^-50 and 10^-55 for there being at least one collision in really large collections of data - still not likely enough to worry about. Of course, that assumption goes out the window if you're concerned that an adversary may develop practical ways to find collisions in sha-256 within the deployment lifetime of a system. sha-256 is, more or less, a scaled-up sha-1, and sha-1 is known to be weaker than the ideal 2^80 strength you'd expect from 2^160 bits of hash; the best credible attack is somewhere around 2^57.5 (see http://en.wikipedia.org/wiki/SHA-1#SHA-1). on a somewhat less serious note, perhaps zfs dedup should contain "chinese lottery" code (see http://tools.ietf.org/html/rfc3607 for one explanation) which asks the sysadmin to report a detected sha-256 collision to eprint.iacr.org or the like... _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss