On Jul 11, 2012, at 12:06 PM, Sašo Kiselkov wrote:
>> I say, in fact that the total number of unique patterns that can exist on
>> any pool is small, compared to the total, illustrating that I understand how
>> the key space for the algorithm is small when looking at a ZFS pool, and
>> thus could have a non-collision opportunity.
> This is so profoundly wrong that it leads me to suspect you never took
> courses on cryptography and/or information theory. The size of your
> storage pool DOESN'T MATTER ONE BIT to the size of the key space. Even
> if your pool were the size of a single block, we're talking here about
> the *mathematical* possibility of hitting on a random block that hashes
> to the same value. Given a stream of random data blocks (thus simulating
> an exhaustive brute-force search) and a secure pseudo-random hash
> function (which has a roughly equal chance of producing any output value
> for a given input block), you've got only a 10^-77 chance of getting a
> hash collision. If you don't understand how this works, read a book on
> digital coding theory.
The size of the pool does absolutely matter, because it represents the total
number of possible bit patterns you can involve in the mapping (through the
math). If the size of the ZFS pool is limited, the total number of "unique"
blocks is in fact limited by the size of the pool. This affects how many
collisions are possible, and thus how effective dedup can be.
Overtime, if the bit patterns can change on each block, at some point, you can
arrive at one of the collisions. Yes, it's rare, I'm not disputing that, I am
disputing that the risk is discardable in computer applications where data
integrity matters. For example, losing money as the example I used.
>> I'm pushing you to find a way to demonstrate that there is zero risk because
>> if you do that, then you've, in fact created the ultimate compression factor
>> (but enlarged the keys that could collide because the pool is now virtually
>> larger), to date for random bit patterns, and you've also demonstrated that
>> the particular algorithm is very good for dedup.
>> That would indicate to me, that you can then take that algorithm, and run it
>> inside of ZFS dedup to automatically manage when verify is necessary by
>> detecting when a collision occurs.
> Do you know what a dictionary is in compression algorithms?
Yes I am familiar with this kind of compression
> Do you even
> know how things like Huffman coding or LZW work, at least in principle?
> If not, then I can see why you didn't understand my earlier explanations
> of why hashes aren't usable for compression.
With zero collisions in a well defined key space, they work perfectly for
compression. To whit, you are saying that you are comfortable enough using
them for dedup, which is exactly a form of compression. I'm agreeing that the
keyspace is huge, but the collision possibilities mean I'm not comfortable with
If there wasn't a sufficiently "small" keyspace in a ZFS pool, then dedup would
never succeed. There are some block contents that are recurring. Usually
blocks filled with 00, FF, or some pattern from a power up memory pattern etc.
So those few common patterns are easily dedup'd out.
>> I appreciate the push back. I'm trying to drive thinking about this into
>> the direction of what is known and finite, away from what is infinitely
>> complex and thus impossible to explore.
> If you don't understand the mathematics behind my arguments, just say so.
I understand the math. I'm not convinced it's nothing to worry about, because
my data is valuable enough to me that I am using ZFS. If I was using dedup,
I'd for sure turn on verify…
zfs-discuss mailing list