# Re: [zfs-discuss] New fast hash algorithm - is it needed?

```On 07/11/2012 06:23 PM, Gregg Wonderly wrote:
> What I'm saying is that I am getting conflicting information from your
> rebuttals here.```
```

> I (and others) say there will be collisions that will cause data loss if
> verify is off.

Saying that "there will be" without any supporting evidence to back it
up amounts to a prophecy.

> You say it would be so rare as to be impossible from your perspective.

Correct.

> Tomas says, well then lets just use the hash value for a 4096X compression.
> You fluff around his argument calling him names.

Tomas' argument was, as I understood later, an attempt at sarcasm.
Nevertheless, I later explained exactly why I consider the
hash-compression claim total and utter bunk:

"So for a full explanation of why hashes aren't usable for compression:

1) they are one-way (kind of bummer for decompression)
2) they operate far below the Shannon limit (i.e. unusable for
lossless compression)
3) their output is pseudo-random, so even if we find collisions, we
have no way to distinguish which input was the most likely one meant
for a given hash value (all are equally probable)"

> I say, well then compute all the possible hashes for all possible bit
> patterns and demonstrate no dupes.

This assumes it's possible to do so. Frenc made a similar claim and I
responded with this question: "how long do you expect this going to take
on average? Come on, do the math!". I pose the same to you. Find the
answer and you'll understand exactly why what you're proposing is
impossible.

> You say it's not possible to do that.

Please go on and compute a reduced size of the problem for, say, 2^64
32-byte values (still a laughably small space for the problem, but I'm
feeling generous). Here's the amount of storage you'll need:

2^64 * 32 = 524288 Exabytes

And that's for a problem that I've reduced for you by 192 orders of
magnitude. You see, only when you do the math you realize how off base
you are in claiming that pre-computation of hash rainbow tables for
generic bit patterns is doable.

> I illustrate a way that loss of data could cost you money.

That's merely an emotional argument where you are trying to attack me by
trying to invoke an emotional response from when "my ass" is on the
line. Sorry, that doesn't invalidate the original argument that you
can't do rainbow table pre-computation for long bit patterns.

> You say it's impossible for there to be a chance of me constructing a block
> that has the same hash but different content.

To make sure we're not using ambiguous rhetoric here, allow me to
summarize my position: you cannot produce, in practical terms, a hash
collision on a 256-bit secure hash algorithm using a brute-force algorithm.

> Several people have illustrated that 128K to 32bits is a huge and lossy ratio
> of compression, yet you still say it's viable to leave verify off.

Except that we're not talking 128K to 32b, but 128K to 256b. Also, only
once you appreciate the mathematics behind the size of the 256-bit
pattern space can you understand why leaving verify off is okay.

> I say, in fact that the total number of unique patterns that can exist on any
> pool is small, compared to the total, illustrating that I understand how the
> key space for the algorithm is small when looking at a ZFS pool, and thus
> could have a non-collision opportunity.

This is so profoundly wrong that it leads me to suspect you never took
courses on cryptography and/or information theory. The size of your
storage pool DOESN'T MATTER ONE BIT to the size of the key space. Even
if your pool were the size of a single block, we're talking here about
the *mathematical* possibility of hitting on a random block that hashes
to the same value. Given a stream of random data blocks (thus simulating
an exhaustive brute-force search) and a secure pseudo-random hash
function (which has a roughly equal chance of producing any output value
for a given input block), you've got only a 10^-77 chance of getting a
hash collision. If you don't understand how this works, read a book on
digital coding theory.

> So I can see what perspective you are drawing your confidence from, but I,
> and others, are not confident that the risk has zero probability.

I never said the risk is zero. The risk non-zero, but is so close to
zero, that you may safely ignore it (since we take much greater risks on
a daily basis without so much as a blink of an eye).

> I'm pushing you to find a way to demonstrate that there is zero risk because
> if you do that, then you've, in fact created the ultimate compression factor
> (but enlarged the keys that could collide because the pool is now virtually
> larger), to date for random bit patterns, and you've also demonstrated that
> the particular algorithm is very good for dedup.
> That would indicate to me, that you can then take that algorithm, and run it
> inside of ZFS dedup to automatically manage when verify is necessary by
> detecting when a collision occurs.

Do you know what a dictionary is in compression algorithms? Do you even
know how things like Huffman coding or LZW work, at least in principle?
If not, then I can see why you didn't understand my earlier explanations
of why hashes aren't usable for compression.

> direction of what is known and finite, away from what is infinitely complex
> and thus impossible to explore.

If you don't understand the mathematics behind my arguments, just say so.

--
Saso
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
```