On Thu, 06 Jan 2011 22:42:15 PST Michael DeMan <sola...@deman.com>  wrote:
> To be quite honest, I too am skeptical about about using de-dupe just based o
> n SHA256.  In prior posts it was asked that the potential adopter of the tech
> nology provide the mathematical reason to NOT use SHA-256 only.  However, if 
> Oracle believes that it is adequate to do that, would it be possible for some
> body to provide:
> 
> (A) The theoretical documents and associated mathematics specific to say one 
> simple use case?

See http://en.wikipedia.org/wiki/Birthday_problem -- in
particular see section 5.1 and the probability table of
section 3.4.

> On Jan 6, 2011, at 10:05 PM, Edward Ned Harvey wrote:
> 
> >> I have been told that the checksum value returned by Sha256 is almost
> >> guaranteed to be unique. In fact, if Sha256 fails in some case, we have a
> >> bigger problem such as memory corruption, etc. Essentially, adding
> >> verification to sha256 is an overkill.

Agreed.

> > Someone please correct me if I'm wrong.

OK :-)

> > Suppose you have 128TB of data.  That is ...  you have 2^35 unique 4k block
> s
> > of uniformly sized data.  Then the probability you have any collision in
> > your whole dataset is (sum(1 thru 2^35))*2^-256 
> > Note: sum of integers from 1 to N is  (N*(N+1))/2
> > Note: 2^35 * (2^35+1) = 2^35 * 2^35 + 2^35 = 2^70 + 2^35
> > Note: (N*(N+1))/2 in this case = 2^69 + 2^34
> > So the probability of data corruption in this case, is 2^-187 + 2^-222 ~=
> > 5.1E-57 + 1.5E-67
> > 
> > ~= 5.1E-57

I believe this is wrong. See the wikipedia article referenced
above.

    p(n,d) = 1 - d!/(d^n*(d-n)!)

In your example n = 2^35, d = 2^256.  If you extrapolate the
256 bits row of the probability table of section 3.1, it is
somewhere between 10^-48 and 10^-51. 

This may be easier to grasp: to get a 50% probability of a
collision with sha256, you need 4*10^38 blocks. For a
probability similar to disk error rates (10^-15), you need
1.5*10^31 blocks.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to