Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-12 Thread Sašo Kiselkov
On 07/12/2012 09:52 PM, Sašo Kiselkov wrote:
> I have far too much time to explain

P.S. that should have read "I have taken far too much time explaining".
Men are crap at multitasking...

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-12 Thread Sašo Kiselkov
On 07/12/2012 07:16 PM, Tim Cook wrote:
> Sasso: yes, it's absolutely worth implementing a higher performing hashing
> algorithm.  I'd suggest simply ignoring the people that aren't willing to
> acknowledge basic mathematics rather than lashing out.  No point in feeding
> the trolls.  The PETABYTES of data Quantum and DataDomain have out there
> are proof positive that complex hashes get the job done without verify,
> even if you don't want to acknowledge the math behind it.

That's what I deduced as well. I have far too much time to explain
statistics, coding theory and compression algorithms to random people on
the Internet. Code talks. I'm almost done implementing SHA-512/256 and
Edon-R-512/256. I've spent most of the time learning how the ZFS code is
structured and I now have most of the pieces in place, so expect the
completed code drop with (SHA-512, Edon-R and Skein) by this weekend.

Cheers,

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-12 Thread Tim Cook
On Thu, Jul 12, 2012 at 9:14 AM, Edward Ned Harvey <
opensolarisisdeadlongliveopensola...@nedharvey.com> wrote:

> > From: Jim Klimov [mailto:jimkli...@cos.ru]
> > Sent: Thursday, July 12, 2012 8:42 AM
> > To: Edward Ned Harvey
> > Subject: Re: [zfs-discuss] New fast hash algorithm - is it needed?
> >
> > 2012-07-11 18:03, Edward Ned Harvey пишет:
> > >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > >> boun...@opensolaris.org] On Behalf Of Sašo Kiselkov
> > >>
> > >> As your dedup
> > >> ratio grows, so does the performance hit from dedup=verify. At, say,
> > >> dedupratio=10.0x, on average, every write results in 10 reads.
> > >
> > > Why?
> > > If you intend to write a block, and you discover it has a matching
> checksum,
> > > you only need to verify it matches one block.  You don't need to
> verify it
> > > matches all the other blocks that have previously been verified to
> match
> > > each other.
> >
> > As Saso explained, if you wrote the same block 10 times
> > and detected it was already deduped, then by verifying
> > this detection 10 times you get about 10 extra reads.
>
> (In this case, Jim, you wrote me off-list and I replied on-list, but in
> this case, I thought it would be ok because this message doesn't look
> private or exclusive to me.  I apologize if I was wrong.)
>
> I get the miscommunication now -
>
> When you write the duplicate block for the 10th time, we all understand
> you're not going to go back and verify 10 blocks.  (It seemed, at least to
> me, that's what Saso was saying.  Which is why I told him, No you don't.)
>
> You're saying, that when you wrote the duplicate block the 2nd time, you
> verified... And then when you wrote it the 3rd time, you verified...  And
> the 4th time, and the 5th time...   By the time you write the 10th time,
> you have already verified 9 previous times, but you're still going to
> verify again.
>
> Normally you would expect writing duplicate blocks dedup'd to be faster
> than writing the non-dedup'd data, because you get to skip the actual
> write.  (This idealistic performance gain might be pure vapor due to need
> to update metadata, but ignoring that technicality, continue
> hypothetically...)  When you verify, supposedly you're giving up that
> hypothetical performance gain, because you have to do a read instead of a
> write.  So at first blush, it seems like no net-gain for performance.  But
> if the duplicate block gets written frequently (for example a block of all
> zeros) then there's a high probability the duplicate block is already in
> ARC, so you can actually skip reading from disk and just read from RAM.
>
> So, the "10 extra reads" will sometimes be true - if the duplicate block
> doesn't already exist in ARC.  And the "10 extra reads" will sometimes be
> false - if the duplicate block is already in ARC.



Sasso: yes, it's absolutely worth implementing a higher performing hashing
algorithm.  I'd suggest simply ignoring the people that aren't willing to
acknowledge basic mathematics rather than lashing out.  No point in feeding
the trolls.  The PETABYTES of data Quantum and DataDomain have out there
are proof positive that complex hashes get the job done without verify,
even if you don't want to acknowledge the math behind it.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-12 Thread Edward Ned Harvey
> From: Jim Klimov [mailto:jimkli...@cos.ru]
> Sent: Thursday, July 12, 2012 8:42 AM
> To: Edward Ned Harvey
> Subject: Re: [zfs-discuss] New fast hash algorithm - is it needed?
> 
> 2012-07-11 18:03, Edward Ned Harvey пишет:
> >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> >> boun...@opensolaris.org] On Behalf Of Sašo Kiselkov
> >>
> >> As your dedup
> >> ratio grows, so does the performance hit from dedup=verify. At, say,
> >> dedupratio=10.0x, on average, every write results in 10 reads.
> >
> > Why?
> > If you intend to write a block, and you discover it has a matching checksum,
> > you only need to verify it matches one block.  You don't need to verify it
> > matches all the other blocks that have previously been verified to match
> > each other.
> 
> As Saso explained, if you wrote the same block 10 times
> and detected it was already deduped, then by verifying
> this detection 10 times you get about 10 extra reads.

(In this case, Jim, you wrote me off-list and I replied on-list, but in this 
case, I thought it would be ok because this message doesn't look private or 
exclusive to me.  I apologize if I was wrong.)

I get the miscommunication now - 

When you write the duplicate block for the 10th time, we all understand you're 
not going to go back and verify 10 blocks.  (It seemed, at least to me, that's 
what Saso was saying.  Which is why I told him, No you don't.)

You're saying, that when you wrote the duplicate block the 2nd time, you 
verified... And then when you wrote it the 3rd time, you verified...  And the 
4th time, and the 5th time...   By the time you write the 10th time, you have 
already verified 9 previous times, but you're still going to verify again.

Normally you would expect writing duplicate blocks dedup'd to be faster than 
writing the non-dedup'd data, because you get to skip the actual write.  (This 
idealistic performance gain might be pure vapor due to need to update metadata, 
but ignoring that technicality, continue hypothetically...)  When you verify, 
supposedly you're giving up that hypothetical performance gain, because you 
have to do a read instead of a write.  So at first blush, it seems like no 
net-gain for performance.  But if the duplicate block gets written frequently 
(for example a block of all zeros) then there's a high probability the 
duplicate block is already in ARC, so you can actually skip reading from disk 
and just read from RAM.

So, the "10 extra reads" will sometimes be true - if the duplicate block 
doesn't already exist in ARC.  And the "10 extra reads" will sometimes be false 
- if the duplicate block is already in ARC.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-12 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Sašo Kiselkov
> 
> This is so profoundly wrong that it leads me to suspect you never took
> courses on cryptography and/or information theory. 

More inflammatory commentary, and personal comments?  Saso, seriously, knock
it off.


> The size of your
> storage pool DOESN'T MATTER ONE BIT to the size of the key space. 

Are we still talking about SHA256 collisions in a zpool?  If so, the size of
the storage pool (at least, the used blocks inside the pool) definitely has
an effect on the probability of a collision.

The first block written to pool, gets a checksum.  There is a
zero-probability of collision, because it's the first one.
The second block written gets another checksum.  It has a 2^-256 probability
of collision with the first one.
The 3rd block has a 2*2^-256 probability...
The 4th block has 3*2^-256 probability...

This is the birthday problem.  If you don't know it, look it up on
wikipedia.  Each block written has a higher (but still very low) probability
of collision with any other block that was previously written, until ...

When you write your approximately 2^128th block, you have a 50/50 chance of
collision with any other block.  Fortunately, this amount of data is far
greater than all the storage in the world combined, so this probability can
never be reached with earthly means.

For all Earthly data sets, assuming no SHA256 weaknesses are being
exploited, the probability of a collision is still extremely small, but the
more data blocks in the data set, the higher the probability is.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-12 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Nico Williams
> 
> IMO dedup should always verify.

IMO, it should be a decision left to the user or admin, with the default being 
verify.  (Exactly as it is now.)  Because there is a performance-to-reliability 
tradeoff, and there is a very solid argument that the probability of collision 
causing data loss is so small that you should instead be more concerned with 
metoer strike and other improbable failures causing data loss.  For some people 
and some situations, that argument will be good enough - they would opt for the 
performance increase.  And for others, that argument is not good enough - they 
would opt for verification.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-12 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Sašo Kiselkov
> 
> On 07/11/2012 05:58 PM, Gregg Wonderly wrote:
> > You're entirely sure that there could never be two different blocks that
can
> hash to the same value and have different content?
> >
> > Wow, can you just send me the cash now and we'll call it even?
> 
> You're the one making the positive claim and I'm calling bullshit. So
> the onus is on you to demonstrate the collision (and that you arrived at
> it via your brute force method as described). Until then, my money stays
> safely on my bank account. Put up or shut up, as the old saying goes.

Come on dude.  Chill out.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Nico Williams
You can treat whatever hash function as an idealized one, but actual
hash functions aren't.  There may well be as-yet-undiscovered input
bit pattern ranges where there's a large density of collisions in some
hash function, and indeed, since our hash functions aren't ideal,
there must be.  We just don't know where these potential collisions
are -- for cryptographically secure hash functions that's enough (plus
2nd pre-image and 1st pre-image resistance, but allow me to handwave),
but for dedup?  *shudder*.

Now, for some content types collisions may not be a problem at all.
Think of security camera recordings: collisions will show up as bad
frames in a video stream that no one is ever going to look at, and if
they should need it, well, too bad.

And for other content types collisions can be horrible.  Us ZFS lovers
love to talk about how silent bit rot means you may never know about
serious corruption in other filesystems until it's too late.  Now, if
you disable verification in dedup, what do you get?  The same
situation as other filesystems are in relative to bit rot, only with
different likelihoods.

Disabling verification is something to do after careful deliberation,
not something to do by default.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Gregg Wonderly

On Jul 11, 2012, at 12:06 PM, Sašo Kiselkov wrote:

>> I say, in fact that the total number of unique patterns that can exist on 
>> any pool is small, compared to the total, illustrating that I understand how 
>> the key space for the algorithm is small when looking at a ZFS pool, and 
>> thus could have a non-collision opportunity.
> 
> This is so profoundly wrong that it leads me to suspect you never took
> courses on cryptography and/or information theory. The size of your
> storage pool DOESN'T MATTER ONE BIT to the size of the key space. Even
> if your pool were the size of a single block, we're talking here about
> the *mathematical* possibility of hitting on a random block that hashes
> to the same value. Given a stream of random data blocks (thus simulating
> an exhaustive brute-force search) and a secure pseudo-random hash
> function (which has a roughly equal chance of producing any output value
> for a given input block), you've got only a 10^-77 chance of getting a
> hash collision. If you don't understand how this works, read a book on
> digital coding theory.

The size of the pool does absolutely matter, because it represents the total 
number of possible bit patterns you can involve in the mapping (through the 
math).  If the size of the ZFS pool is limited, the total number of "unique" 
blocks is in fact limited by the size of the pool.  This affects how many 
collisions are possible, and thus how effective dedup can be. 

Overtime, if the bit patterns can change on each block, at some point, you can 
arrive at one of the collisions.  Yes, it's rare, I'm not disputing that, I am 
disputing that the risk is discardable in computer applications where data 
integrity matters. For example, losing money as the example I used.

>> I'm pushing you to find a way to demonstrate that there is zero risk because 
>> if you do that, then you've, in fact created the ultimate compression factor 
>> (but enlarged the keys that could collide because the pool is now virtually 
>> larger), to date for random bit patterns, and you've also demonstrated that 
>> the particular algorithm is very good for dedup. 
>> That would indicate to me, that you can then take that algorithm, and run it 
>> inside of ZFS dedup to automatically manage when verify is necessary by 
>> detecting when a collision occurs.
> 
> Do you know what a dictionary is in compression algorithms?

Yes I am familiar with this kind of compression

> Do you even
> know how things like Huffman coding or LZW work, at least in principle?

Yes

> If not, then I can see why you didn't understand my earlier explanations
> of why hashes aren't usable for compression.

With zero collisions in a well defined key space, they work perfectly for 
compression.  To whit, you are saying that you are comfortable enough using 
them for dedup, which is exactly a form of compression.  I'm agreeing that the 
keyspace is huge, but the collision possibilities mean I'm not comfortable with 
verify=no

If there wasn't a sufficiently "small" keyspace in a ZFS pool, then dedup would 
never succeed.  There are some block contents that are recurring.  Usually 
blocks filled with 00, FF, or some pattern from a power up memory pattern etc.  
So those few common patterns are easily dedup'd out.  

> 
>> I appreciate the push back.  I'm trying to drive thinking about this into 
>> the direction of what is known and finite, away from what is infinitely 
>> complex and thus impossible to explore.
> 
> If you don't understand the mathematics behind my arguments, just say so.

I understand the math.   I'm not convinced it's nothing to worry about, because 
my data is valuable enough to me that I am using ZFS.   If I was using dedup, 
I'd for sure turn on verify…

Gregg
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Gary
On Wed, Jul 11, 2012 at 7:48 AM, Casper Dik wrote:

> Dan Brown seems to think so in "Digital Fortress" but it just means he
> has no grasp on "big numbers".
>

Or little else, for that matter. I seem to recall one character in the book
that would routinely slide under a "mainframe" on his back as if on a
mechanics dolly, solder CPUs to the motherboard above his face, and perform
all manner of bullshit on the fly repairs that never even existed back in
the earliest days of mid-20th century computing. I don't recall anything
else of a technical nature that made a lick of sense and the story was only
further insulting by the mass of alleged super geniuses that could barely
tie their own shoelaces, etc. etc. Reading the one star reviews of this
book on Amazon is far more enlightening & entertaining than reading the
actual book. I found it so insulting that I couldn't finish the last 70
pages of the paperback.

-Gary
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Richard Elling
On Jul 11, 2012, at 1:06 PM, Bill Sommerfeld wrote:
> on a somewhat less serious note, perhaps zfs dedup should contain "chinese
> lottery" code (see http://tools.ietf.org/html/rfc3607 for one explanation)
> which asks the sysadmin to report a detected sha-256 collision to
> eprint.iacr.org or the like...


Agree. George was in that section of the code a few months ago (zio.c) and I 
asked
him to add a kstat, at least. I'll follow up with him next week, or get it done 
some other
way.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 10:06 PM, Bill Sommerfeld wrote:
> On 07/11/12 02:10, Sašo Kiselkov wrote:
>> Oh jeez, I can't remember how many times this flame war has been going
>> on on this list. Here's the gist: SHA-256 (or any good hash) produces a
>> near uniform random distribution of output. Thus, the chances of getting
>> a random hash collision are around 2^-256 or around 10^-77.
> 
> I think you're correct that most users don't need to worry about this --
> sha-256 dedup without verification is not going to cause trouble for them.
> 
> But your analysis is off.  You're citing the chance that two blocks picked at
> random will have the same hash.  But that's not what dedup does; it compares
> the hash of a new block to a possibly-large population of other hashes, and
> that gets you into the realm of "birthday problem" or "birthday paradox".
> 
> See http://en.wikipedia.org/wiki/Birthday_problem for formulas.
> 
> So, maybe somewhere between 10^-50 and 10^-55 for there being at least one
> collision in really large collections of data - still not likely enough to
> worry about.

Yeah, I know, I did this as a quick first-degree approximation. However,
the provided range is still very far above the chances of getting a
random bit-rot error that even Fletcher won't catch.

> Of course, that assumption goes out the window if you're concerned that an
> adversary may develop practical ways to find collisions in sha-256 within the
> deployment lifetime of a system.  sha-256 is, more or less, a scaled-up sha-1,
> and sha-1 is known to be weaker than the ideal 2^80 strength you'd expect from
> 2^160 bits of hash; the best credible attack is somewhere around 2^57.5 (see
> http://en.wikipedia.org/wiki/SHA-1#SHA-1).

Of course, this is theoretically possible, however, I do not expect such
an attack to be practical within any reasonable time frame of the
deployment. In any case, should a realistic need to solve this arise, we
can always simply switch hashes (I'm also planning to implement
Skein-512/256) and do a recv/send to rewrite everything on disk. PITA?
Yes. Serious problem? Don't think so.

> on a somewhat less serious note, perhaps zfs dedup should contain "chinese
> lottery" code (see http://tools.ietf.org/html/rfc3607 for one explanation)
> which asks the sysadmin to report a detected sha-256 collision to
> eprint.iacr.org or the like...

How about we ask them to report to me instead, like so:

1) Detect collision
2) Report to me
3) ???
4) Profit!

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bill Sommerfeld
On 07/11/12 02:10, Sašo Kiselkov wrote:
> Oh jeez, I can't remember how many times this flame war has been going
> on on this list. Here's the gist: SHA-256 (or any good hash) produces a
> near uniform random distribution of output. Thus, the chances of getting
> a random hash collision are around 2^-256 or around 10^-77.

I think you're correct that most users don't need to worry about this --
sha-256 dedup without verification is not going to cause trouble for them.

But your analysis is off.  You're citing the chance that two blocks picked at
random will have the same hash.  But that's not what dedup does; it compares
the hash of a new block to a possibly-large population of other hashes, and
that gets you into the realm of "birthday problem" or "birthday paradox".

See http://en.wikipedia.org/wiki/Birthday_problem for formulas.

So, maybe somewhere between 10^-50 and 10^-55 for there being at least one
collision in really large collections of data - still not likely enough to
worry about.

Of course, that assumption goes out the window if you're concerned that an
adversary may develop practical ways to find collisions in sha-256 within the
deployment lifetime of a system.  sha-256 is, more or less, a scaled-up sha-1,
and sha-1 is known to be weaker than the ideal 2^80 strength you'd expect from
2^160 bits of hash; the best credible attack is somewhere around 2^57.5 (see
http://en.wikipedia.org/wiki/SHA-1#SHA-1).

on a somewhat less serious note, perhaps zfs dedup should contain "chinese
lottery" code (see http://tools.ietf.org/html/rfc3607 for one explanation)
which asks the sysadmin to report a detected sha-256 collision to
eprint.iacr.org or the like...



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Hung-Sheng Tsao Ph.D.


On 7/11/2012 3:16 PM, Bob Friesenhahn wrote:

On Wed, 11 Jul 2012, Hung-Sheng Tsao (LaoTsao) Ph.D wrote:


Not correct as far as I can tell.  You should re-read the page you 
referenced.  Oracle recinded (or lost) the special Studio releases 
needed to build the OpenSolaris kernel.


you can still download 12 12.1 12.2, AFAIK through OTN


That is true (and I have done so).  Unfortunately the versions offered 
are not the correct ones to build the OpenSolaris kernel. Special 
patched versions with particular date stamps are required.  The only 
way that I see to obtain these files any more is via distribution

channels primarily designed to perform copyright violations.
one can still download the patches through MOS, but one need to pay the 
support for development tolls:-(


Bob


--



<>___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Wed, 11 Jul 2012, Hung-Sheng Tsao (LaoTsao) Ph.D wrote:


Not correct as far as I can tell.  You should re-read the page you referenced.  
Oracle recinded (or lost) the special Studio releases needed to build the 
OpenSolaris kernel.


you can still download 12 12.1 12.2, AFAIK through OTN


That is true (and I have done so).  Unfortunately the versions offered 
are not the correct ones to build the OpenSolaris kernel.  Special 
patched versions with particular date stamps are required.  The only 
way that I see to obtain these files any more is via distribution

channels primarily designed to perform copyright violations.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Nico Williams
On Wed, Jul 11, 2012 at 3:45 AM, Sašo Kiselkov  wrote:
> It's also possible to set "dedup=verify" with "checksum=sha256",
> however, that makes little sense (as the chances of getting a random
> hash collision are essentially nil).

IMO dedup should always verify.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Hung-Sheng Tsao (LaoTsao) Ph.D


Sent from my iPad

On Jul 11, 2012, at 13:11, Bob Friesenhahn  wrote:

> On Wed, 11 Jul 2012, Richard Elling wrote:
>> The last studio release suitable for building OpenSolaris is available in 
>> the repo.
>> See the instructions at 
>> http://wiki.illumos.org/display/illumos/How+To+Build+illumos
> 
> Not correct as far as I can tell.  You should re-read the page you 
> referenced.  Oracle recinded (or lost) the special Studio releases needed to 
> build the OpenSolaris kernel.  

hi
you can still download 12 12.1 12.2, AFAIK through OTN


> The only way I can see to obtain these releases is illegally.
> 
> However, Studio 12.3 (free download) produces user-space executables which 
> run fine under Illumos.
> 
> Bob
> -- 
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Richard Elling
On Jul 11, 2012, at 10:23 AM, Sašo Kiselkov wrote:

> Hi Richard,
> 
> On 07/11/2012 06:58 PM, Richard Elling wrote:
>> Thanks Sašo!
>> Comments below...
>> 
>> On Jul 10, 2012, at 4:56 PM, Sašo Kiselkov wrote:
>> 
>>> Hi guys,
>>> 
>>> I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS
>>> implementation to supplant the currently utilized sha256.
>> 
>> No need to supplant, there are 8 bits for enumerating hash algorithms, so 
>> adding another is simply a matter of coding. With the new feature flags, it 
>> is
>> almost trivial to add new algorithms without causing major compatibility 
>> headaches. Darren points out that Oracle is considering doing the same,
>> though I do not expect Oracle to pick up the feature flags.
> 
> I meant in the functional sense, not in the technical - of course, my
> changes would be implemented as a feature flags add-on.

Great! Let's do it! 
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Richard Elling
On Jul 11, 2012, at 10:11 AM, Bob Friesenhahn wrote:
> On Wed, 11 Jul 2012, Richard Elling wrote:
>> The last studio release suitable for building OpenSolaris is available in 
>> the repo.
>> See the instructions at 
>> http://wiki.illumos.org/display/illumos/How+To+Build+illumos
> 
> Not correct as far as I can tell.  You should re-read the page you 
> referenced.  Oracle recinded (or lost) the special Studio releases needed to 
> build the OpenSolaris kernel.  The only way I can see to obtain these 
> releases is illegally.

In the US, the term "illegal" is most often used for criminal law. Contracts 
between parties are covered
under civil law. It is the responsibility of the parties to agree to and 
enforce civil contracts. This includes
you, dear reader.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Wed, 11 Jul 2012, Richard Elling wrote:


The last studio release suitable for building OpenSolaris is available in the 
repo.
See the instructions 
at http://wiki.illumos.org/display/illumos/How+To+Build+illumos


Not correct as far as I can tell.  You should re-read the page you 
referenced.  Oracle recinded (or lost) the special Studio releases 
needed to build the OpenSolaris kernel.  The only way I can see to 
obtain these releases is illegally.


However, Studio 12.3 (free download) produces user-space executables 
which run fine under Illumos.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 06:23 PM, Gregg Wonderly wrote:
> What I'm saying is that I am getting conflicting information from your 
> rebuttals here.

Well, let's address that then:

> I (and others) say there will be collisions that will cause data loss if 
> verify is off.

Saying that "there will be" without any supporting evidence to back it
up amounts to a prophecy.

> You say it would be so rare as to be impossible from your perspective.

Correct.

> Tomas says, well then lets just use the hash value for a 4096X compression.
> You fluff around his argument calling him names.

Tomas' argument was, as I understood later, an attempt at sarcasm.
Nevertheless, I later explained exactly why I consider the
hash-compression claim total and utter bunk:

"So for a full explanation of why hashes aren't usable for compression:

 1) they are one-way (kind of bummer for decompression)
 2) they operate far below the Shannon limit (i.e. unusable for
lossless compression)
 3) their output is pseudo-random, so even if we find collisions, we
have no way to distinguish which input was the most likely one meant
for a given hash value (all are equally probable)"

> I say, well then compute all the possible hashes for all possible bit 
> patterns and demonstrate no dupes.

This assumes it's possible to do so. Frenc made a similar claim and I
responded with this question: "how long do you expect this going to take
on average? Come on, do the math!". I pose the same to you. Find the
answer and you'll understand exactly why what you're proposing is
impossible.

> You say it's not possible to do that.

Please go on and compute a reduced size of the problem for, say, 2^64
32-byte values (still a laughably small space for the problem, but I'm
feeling generous). Here's the amount of storage you'll need:

2^64 * 32 = 524288 Exabytes

And that's for a problem that I've reduced for you by 192 orders of
magnitude. You see, only when you do the math you realize how off base
you are in claiming that pre-computation of hash rainbow tables for
generic bit patterns is doable.

> I illustrate a way that loss of data could cost you money.

That's merely an emotional argument where you are trying to attack me by
trying to invoke an emotional response from when "my ass" is on the
line. Sorry, that doesn't invalidate the original argument that you
can't do rainbow table pre-computation for long bit patterns.

> You say it's impossible for there to be a chance of me constructing a block 
> that has the same hash but different content.

To make sure we're not using ambiguous rhetoric here, allow me to
summarize my position: you cannot produce, in practical terms, a hash
collision on a 256-bit secure hash algorithm using a brute-force algorithm.

> Several people have illustrated that 128K to 32bits is a huge and lossy ratio 
> of compression, yet you still say it's viable to leave verify off.

Except that we're not talking 128K to 32b, but 128K to 256b. Also, only
once you appreciate the mathematics behind the size of the 256-bit
pattern space can you understand why leaving verify off is okay.

> I say, in fact that the total number of unique patterns that can exist on any 
> pool is small, compared to the total, illustrating that I understand how the 
> key space for the algorithm is small when looking at a ZFS pool, and thus 
> could have a non-collision opportunity.

This is so profoundly wrong that it leads me to suspect you never took
courses on cryptography and/or information theory. The size of your
storage pool DOESN'T MATTER ONE BIT to the size of the key space. Even
if your pool were the size of a single block, we're talking here about
the *mathematical* possibility of hitting on a random block that hashes
to the same value. Given a stream of random data blocks (thus simulating
an exhaustive brute-force search) and a secure pseudo-random hash
function (which has a roughly equal chance of producing any output value
for a given input block), you've got only a 10^-77 chance of getting a
hash collision. If you don't understand how this works, read a book on
digital coding theory.

> So I can see what perspective you are drawing your confidence from, but I, 
> and others, are not confident that the risk has zero probability.

I never said the risk is zero. The risk non-zero, but is so close to
zero, that you may safely ignore it (since we take much greater risks on
a daily basis without so much as a blink of an eye).

> I'm pushing you to find a way to demonstrate that there is zero risk because 
> if you do that, then you've, in fact created the ultimate compression factor 
> (but enlarged the keys that could collide because the pool is now virtually 
> larger), to date for random bit patterns, and you've also demonstrated that 
> the particular algorithm is very good for dedup. 
> That would indicate to me, that you can then take that algorithm, and run it 
> inside of ZFS dedup to automatically manage when verify is necessary by 
> detec

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Richard Elling
Thanks Sašo!
Comments below...

On Jul 10, 2012, at 4:56 PM, Sašo Kiselkov wrote:

> Hi guys,
> 
> I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS
> implementation to supplant the currently utilized sha256.

No need to supplant, there are 8 bits for enumerating hash algorithms, so 
adding another is simply a matter of coding. With the new feature flags, it is
almost trivial to add new algorithms without causing major compatibility 
headaches. Darren points out that Oracle is considering doing the same,
though I do not expect Oracle to pick up the feature flags.

> On modern
> 64-bit CPUs SHA-256 is actually much slower than SHA-512 and indeed much
> slower than many of the SHA-3 candidates, so I went out and did some
> testing (details attached) on a possible new hash algorithm that might
> improve on this situation.
> 
> However, before I start out on a pointless endeavor, I wanted to probe
> the field of ZFS users, especially those using dedup, on whether their
> workloads would benefit from a faster hash algorithm (and hence, lower
> CPU utilization). Developments of late have suggested to me three
> possible candidates:
> 
> * SHA-512: simplest to implement (since the code is already in the
>   kernel) and provides a modest performance boost of around 60%.
> 
> * Skein-512: overall fastest of the SHA-3 finalists and much faster
>   than SHA-512 (around 120-150% faster than the current sha256).
> 
> * Edon-R-512: probably the fastest general purpose hash algorithm I've
>   ever seen (upward of 300% speedup over sha256) , but might have
>   potential security problems (though I don't think this is of any
>   relevance to ZFS, as it doesn't use the hash for any kind of security
>   purposes, but only for data integrity & dedup).
> 
> My testing procedure: nothing sophisticated, I took the implementation
> of sha256 from the Illumos kernel and simply ran it on a dedicated
> psrset (where possible with a whole CPU dedicated, even if only to a
> single thread) - I tested both the generic C implementation and the
> Intel assembly implementation. The Skein and Edon-R implementations are
> in C optimized for 64-bit architectures from the respective authors (the
> most up to date versions I could find). All code has been compiled using
> GCC 3.4.3 from the repos (the same that can be used for building
> Illumos). Sadly, I don't have access to Sun Studio.

The last studio release suitable for building OpenSolaris is available in the 
repo.
See the instructions at 
http://wiki.illumos.org/display/illumos/How+To+Build+illumos

I'd be curious about whether you see much difference based on studio 12.1,
gcc 3.4.3 and gcc 4.4 (or even 4.7)
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Wed, 11 Jul 2012, Sašo Kiselkov wrote:


For example, the well-known block might be part of a Windows anti-virus
package, or a Windows firewall configuration, and corrupting it might
leave a Windows VM open to malware attack.


True, but that may not be enough to produce a practical collision for
the reason that while you know which bytes you want to attack, these
might not line up with ZFS disk blocks (especially the case with Windows
VMs which are store in large opaque zvols) - such an attack would
require physical access to the machine (at which point you can simply
manipulate the blocks directly).


I think that well-known blocks are much easier to predict than you say 
because operating systems, VMs, and application software behave in 
predictable patterns.  However, deriving another useful block which 
hashes the same should be extremely difficult and any block hashing 
algorithm needs to assure that.  Having an excellent random 
distribution property is not sufficient if it is relatively easy to 
compute some other block producing the same hash.  It may be useful to 
compromise a known block even if the compromized result is complete 
garbage.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread David Magda
On Wed, July 11, 2012 11:58, Gregg Wonderly wrote:
> You're entirely sure that there could never be two different blocks that
> can hash to the same value and have different content?
[...]

The odds of being hit by lighting (at least in the US) are about 1 in
700,000. I don't worry about that happening, so personally I don't see the
point in worry something a few more orders of magnitude less likely to
happen.

When I get hit by lightning (or win the lottery), I'll start worrying
about hash collisions. Until then: meh. :)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Gregg Wonderly
What I'm saying is that I am getting conflicting information from your 
rebuttals here.

I (and others) say there will be collisions that will cause data loss if verify 
is off.
You say it would be so rare as to be impossible from your perspective.
Tomas says, well then lets just use the hash value for a 4096X compression.
You fluff around his argument calling him names.
I say, well then compute all the possible hashes for all possible bit patterns 
and demonstrate no dupes.
You say it's not possible to do that.
I illustrate a way that loss of data could cost you money.
You say it's impossible for there to be a chance of me constructing a block 
that has the same hash but different content.
Several people have illustrated that 128K to 32bits is a huge and lossy ratio 
of compression, yet you still say it's viable to leave verify off.
I say, in fact that the total number of unique patterns that can exist on any 
pool is small, compared to the total, illustrating that I understand how the 
key space for the algorithm is small when looking at a ZFS pool, and thus could 
have a non-collision opportunity.

So I can see what perspective you are drawing your confidence from, but I, and 
others, are not confident that the risk has zero probability.

I'm pushing you to find a way to demonstrate that there is zero risk because if 
you do that, then you've, in fact created the ultimate compression factor (but 
enlarged the keys that could collide because the pool is now virtually larger), 
to date for random bit patterns, and you've also demonstrated that the 
particular algorithm is very good for dedup. 

That would indicate to me, that you can then take that algorithm, and run it 
inside of ZFS dedup to automatically manage when verify is necessary by 
detecting when a collision occurs.

I appreciate the push back.  I'm trying to drive thinking about this into the 
direction of what is known and finite, away from what is infinitely complex and 
thus impossible to explore.

Maybe all the work has already been done…

Gregg

On Jul 11, 2012, at 11:02 AM, Sašo Kiselkov wrote:

> On 07/11/2012 05:58 PM, Gregg Wonderly wrote:
>> You're entirely sure that there could never be two different blocks that can 
>> hash to the same value and have different content?
>> 
>> Wow, can you just send me the cash now and we'll call it even?
> 
> You're the one making the positive claim and I'm calling bullshit. So
> the onus is on you to demonstrate the collision (and that you arrived at
> it via your brute force method as described). Until then, my money stays
> safely on my bank account. Put up or shut up, as the old saying goes.
> 
> --
> Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 05:58 PM, Gregg Wonderly wrote:
> You're entirely sure that there could never be two different blocks that can 
> hash to the same value and have different content?
> 
> Wow, can you just send me the cash now and we'll call it even?

You're the one making the positive claim and I'm calling bullshit. So
the onus is on you to demonstrate the collision (and that you arrived at
it via your brute force method as described). Until then, my money stays
safely on my bank account. Put up or shut up, as the old saying goes.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Gregg Wonderly
You're entirely sure that there could never be two different blocks that can 
hash to the same value and have different content?

Wow, can you just send me the cash now and we'll call it even?

Gregg

On Jul 11, 2012, at 9:59 AM, Sašo Kiselkov wrote:

> On 07/11/2012 04:56 PM, Gregg Wonderly wrote:
>> So, if I had a block collision on my ZFS pool that used dedup, and it had my 
>> bank balance of $3,212.20 on it, and you tried to write your bank balance of 
>> $3,292,218.84 and got the same hash, no verify, and thus you got my 
>> block/balance and now your bank balance was reduced by 3 orders of 
>> magnitude, would you be okay with that?  What assurances would you be 
>> content with using my ZFS pool?
> 
> I'd feel entirely safe. There, I said it.
> 
> --
> Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 05:33 PM, Bob Friesenhahn wrote:
> On Wed, 11 Jul 2012, Sašo Kiselkov wrote:
>>
>> The reason why I don't think this can be used to implement a practical
>> attack is that in order to generate a collision, you first have to know
>> the disk block that you want to create a collision on (or at least the
>> checksum), i.e. the original block is already in the pool. At that
>> point, you could write a colliding block which would get de-dup'd, but
>> that doesn't mean you've corrupted the original data, only that you
>> referenced it. So, in a sense, you haven't corrupted the original block,
>> only your own collision block (since that's the copy doesn't get
>> written).
> 
> This is not correct.  If you know the well-known block to be written,
> then you can arrange to write your collision block prior to when the
> well-known block is written.  Therefore, it is imperative that the hash
> algorithm make it clearly impractical to take a well-known block and
> compute a collision block.
> 
> For example, the well-known block might be part of a Windows anti-virus
> package, or a Windows firewall configuration, and corrupting it might
> leave a Windows VM open to malware attack.

True, but that may not be enough to produce a practical collision for
the reason that while you know which bytes you want to attack, these
might not line up with ZFS disk blocks (especially the case with Windows
VMs which are store in large opaque zvols) - such an attack would
require physical access to the machine (at which point you can simply
manipulate the blocks directly).

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik

>On Wed, Jul 11, 2012 at 9:48 AM,   wrote:
>>>Huge space, but still finite=85
>>
>> Dan Brown seems to think so in "Digital Fortress" but it just means he
>> has no grasp on "big numbers".
>
>I couldn't get past that.  I had to put the book down.  I'm guessing
>it was as awful as it threatened to be.


It is *fiction*.  So just read it as if it is "magical", like Harry Potter.

It's just like "well researched" in fiction means exactly the same as
"well researched" in journalism: the facts aren't actually facts but could 
pass as facts to those who hasn't had a proper science education
(which unfortunately includes a large part of the population and 99.5% of
all politicians)

Casper




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Nico Williams
On Wed, Jul 11, 2012 at 9:48 AM,   wrote:
>>Huge space, but still finite=85
>
> Dan Brown seems to think so in "Digital Fortress" but it just means he
> has no grasp on "big numbers".

I couldn't get past that.  I had to put the book down.  I'm guessing
it was as awful as it threatened to be.

IMO, FWIW, yes, do add SHA-512 (truncated to 256 bits, of course).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Wed, 11 Jul 2012, Sašo Kiselkov wrote:


The reason why I don't think this can be used to implement a practical
attack is that in order to generate a collision, you first have to know
the disk block that you want to create a collision on (or at least the
checksum), i.e. the original block is already in the pool. At that
point, you could write a colliding block which would get de-dup'd, but
that doesn't mean you've corrupted the original data, only that you
referenced it. So, in a sense, you haven't corrupted the original block,
only your own collision block (since that's the copy doesn't get written).


This is not correct.  If you know the well-known block to be written, 
then you can arrange to write your collision block prior to when the 
well-known block is written.  Therefore, it is imperative that the 
hash algorithm make it clearly impractical to take a well-known block 
and compute a collision block.


For example, the well-known block might be part of a Windows 
anti-virus package, or a Windows firewall configuration, and 
corrupting it might leave a Windows VM open to malware attack.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread David Magda
On Wed, July 11, 2012 10:23, casper@oracle.com wrote:

> I think that I/O isn't getting as fast as CPU is; memory capacity and
> bandwith and CPUs are getting faster.  I/O, not so much.
> (Apart from the one single step from harddisk to SSD; but note that
> I/O is limited to standard interfaces and as such it is likely be
> helddown by requiring a new standard.

The only other thing on the (theoretical) horizon would be things based on
the memristor.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 05:10 PM, David Magda wrote:
> On Wed, July 11, 2012 09:45, Sašo Kiselkov wrote:
> 
>> I'm not convinced waiting makes much sense. The SHA-3 standardization
>> process' goals are different from "ours". SHA-3 can choose to go with
>> something that's slower, but has a higher security margin. I think that
>> absolute super-tight security isn't all that necessary for ZFS, since
>> the hash isn't used for security purposes. We only need something that's
>> fast and has a good pseudo-random output distribution. That's why I
>> looked toward Edon-R. Even though it might have security problems in
>> itself, it's by far the fastest algorithm in the entire competition.
> 
> Fair enough, though I think eventually the SHA-3 winner will be
> incorporated into hardware (or at least certain instructions used in the
> algorithm will). I think waiting a few more weeks/months shouldn't be a
> big deal, as the winner should be announced Real Soon Now, and then a more
> informed decision can probably be made.

The AES process winner had been announced in October 2000. Considering
AES-NI was proposed in March 2008 and first silicon for it appeared
around January 2010, I wouldn't hold my breath hoping for hardware
SHA-3-specific acceleration getting a widespread foothold for at least
another 5-10 years (around 2-3 technology generations).

That being said, a lot can be achieved using SIMD instructions, but that
doesn't depend on the SHA-3 process in any way.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread David Magda
On Wed, July 11, 2012 09:45, Sašo Kiselkov wrote:

> I'm not convinced waiting makes much sense. The SHA-3 standardization
> process' goals are different from "ours". SHA-3 can choose to go with
> something that's slower, but has a higher security margin. I think that
> absolute super-tight security isn't all that necessary for ZFS, since
> the hash isn't used for security purposes. We only need something that's
> fast and has a good pseudo-random output distribution. That's why I
> looked toward Edon-R. Even though it might have security problems in
> itself, it's by far the fastest algorithm in the entire competition.

Fair enough, though I think eventually the SHA-3 winner will be
incorporated into hardware (or at least certain instructions used in the
algorithm will). I think waiting a few more weeks/months shouldn't be a
big deal, as the winner should be announced Real Soon Now, and then a more
informed decision can probably be made.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Gregg Wonderly
I'm just suggesting that the "time frame" of when 256-bits or 512-bits is less 
safe, is closing faster than one might actually think, because social elements 
of the internet allow a lot more effort to be focused on a single "problem" 
than one might consider.  

Gregg Wonderly

On Jul 11, 2012, at 9:50 AM, Edward Ned Harvey wrote:

>> From: Gregg Wonderly [mailto:gr...@wonderly.org]
>> Sent: Wednesday, July 11, 2012 10:28 AM
>> 
>> Unfortunately, the government imagines that people are using their home
>> computers to compute hashes and try and decrypt stuff.  Look at what is
>> happening with GPUs these days.  
> 
> heheheh.  I guess the NSA didn't think of that.;-)
> (That's sarcasm, in case anyone didn't get it.)
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:56 PM, Gregg Wonderly wrote:
> So, if I had a block collision on my ZFS pool that used dedup, and it had my 
> bank balance of $3,212.20 on it, and you tried to write your bank balance of 
> $3,292,218.84 and got the same hash, no verify, and thus you got my 
> block/balance and now your bank balance was reduced by 3 orders of magnitude, 
> would you be okay with that?  What assurances would you be content with using 
> my ZFS pool?

I'd feel entirely safe. There, I said it.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:54 PM, Ferenc-Levente Juhos wrote:
> You don't have to store all hash values:
> a. Just memorize the first one SHA256(0)
> b. start cointing
> c. bang: by the time you get to 2^256 you get at least a collision.

Just one question: how long do you expect this going to take on average?
Come on, do the math!

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Gregg Wonderly
So, if I had a block collision on my ZFS pool that used dedup, and it had my 
bank balance of $3,212.20 on it, and you tried to write your bank balance of 
$3,292,218.84 and got the same hash, no verify, and thus you got my 
block/balance and now your bank balance was reduced by 3 orders of magnitude, 
would you be okay with that?  What assurances would you be content with using 
my ZFS pool?

Gregg Wonderly

On Jul 11, 2012, at 9:43 AM, Sašo Kiselkov wrote:

> On 07/11/2012 04:30 PM, Gregg Wonderly wrote:
>> This is exactly the issue for me.  It's vital to always have verify on.  If 
>> you don't have the data to prove that every possible block combination 
>> possible, hashes uniquely for the "small" bit space we are talking about, 
>> then how in the world can you say that "verify" is not necessary?  That just 
>> seems ridiculous to propose.
> 
> Do you need assurances that in the next 5 seconds a meteorite won't fall
> to Earth and crush you? No. And yet, the Earth puts on thousands of tons
> of weight each year from meteoric bombardment and people have been hit
> and killed by them (not to speak of mass extinction events). Nobody has
> ever demonstrated of being able to produce a hash collision in any
> suitably long hash (128-bits plus) using a random search. All hash
> collisions have been found by attacking the weaknesses in the
> mathematical definition of these functions (i.e. some part of the input
> didn't get obfuscated well in the hash function machinery and spilled
> over into the result, resulting in a slight, but usable non-randomness).
> 
> Cheers,
> --
> Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Ferenc-Levente Juhos
You don't have to store all hash values:
a. Just memorize the first one SHA256(0)
b. start cointing
c. bang: by the time you get to 2^256 you get at least a collision.

(do this using BOINC, you dont have to wait for the last hash to be
calculated, I'm pretty sure a collision will occur sooner)

1. SHA256 produces a 256 bit hash
2. That means it produces a value on 256 bits, in other words a value
between 0..2^256 - 1
3. If you start counting from 0 to 2^256 and for each number calculate the
SHA256 you will get at least one hash collision (if the hash algortihm is
prefectly distributed)
4. Counting from 0 to 2^256, is nothing else but reproducing all possible
bit pattern on 32 bytes


And nobody is trying to proove that SHA256 is bad, and I don't want to do
compression either.
But it's unrealistic to think that a SHA256 collision wont occur just
because it would be impossible to find it via brute-force.

On Wed, Jul 11, 2012 at 4:49 PM, Sašo Kiselkov wrote:

> On 07/11/2012 04:39 PM, Ferenc-Levente Juhos wrote:
> > As I said several times before, to produce hash collisions. Or to
> calculate
> > rainbow tables (as a previous user theorized it) you only need the
> > following.
> >
> > You don't need to reproduce all possible blocks.
> > 1. SHA256 produces a 256 bit hash
> > 2. That means it produces a value on 256 bits, in other words a value
> > between 0..2^256 - 1
> > 3. If you start counting from 0 to 2^256 and for each number calculate
> the
> > SHA256 you will get at least one hash collision (if the hash algortihm is
> > prefectly distributed)
> > 4. Counting from 0 to 2^256, is nothing else but reproducing all possible
> > bit pattern on 32 bytes
> >
> > It's not about whether one computer is capable of producing the above
> > hashes or not, or whether there are actually that many unique 32 byte bit
> > patterns in the universe.
> > A collision can happen.
>
> It's actually not that simple, because in hash collision attacks you're
> not always afforded the luxury of being able to define your input block.
> More often than not, you want to modify a previously hashed block in
> such a fashion that it carries your intended modifications while hashing
> to the same original value. Say for instance you want to modify a
> 512-byte message (e.g. an SSL certificate) to point to your own CN. Here
> your rainbow table, even if you could store it somewhere (you couldn't,
> btw), would do you little good here.
>
> --
> Saso
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik


>Do you need assurances that in the next 5 seconds a meteorite won't fall
>to Earth and crush you? No. And yet, the Earth puts on thousands of tons
>of weight each year from meteoric bombardment and people have been hit
>and killed by them (not to speak of mass extinction events). Nobody has
>ever demonstrated of being able to produce a hash collision in any
>suitably long hash (128-bits plus) using a random search. All hash
>collisions have been found by attacking the weaknesses in the
>mathematical definition of these functions (i.e. some part of the input
>didn't get obfuscated well in the hash function machinery and spilled
>over into the result, resulting in a slight, but usable non-randomness).

The reason why we don't protect against such event because it would
be extremely expensive with a very small chance of it being needed.

"verify" doesn't cost much so even if the risk as infinitesimal as a 
direct meteorite hit, it may still be cost effective.

(Just like we'd better be off preparing for the climate changing (rather 
cheap) rather then trying to keep the climate from changing (impossible 
and still extremely expensive)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Edward Ned Harvey
> From: Gregg Wonderly [mailto:gr...@wonderly.org]
> Sent: Wednesday, July 11, 2012 10:28 AM
> 
> Unfortunately, the government imagines that people are using their home
> computers to compute hashes and try and decrypt stuff.  Look at what is
> happening with GPUs these days.  

heheheh.  I guess the NSA didn't think of that.;-)
(That's sarcasm, in case anyone didn't get it.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Gregg Wonderly
Yes, but from the other angle, the number of unique 128K blocks that you can 
store on your ZFS pool, is actually finitely small, compared to the total 
space.  So the patterns you need to actually consider is not more than the 
physical limits of the universe.

Gregg Wonderly

On Jul 11, 2012, at 9:39 AM, Sašo Kiselkov wrote:

> On 07/11/2012 04:27 PM, Gregg Wonderly wrote:
>> Unfortunately, the government imagines that people are using their home 
>> computers to compute hashes and try and decrypt stuff.  Look at what is 
>> happening with GPUs these days.  People are hooking up 4 GPUs in their 
>> computers and getting huge performance gains.  5-6 char password space 
>> covered in a few days.  12 or so chars would take one machine a couple of 
>> years if I recall.  So, if we had 20 people with that class of machine, we'd 
>> be down to a few months.   I'm just suggesting that while the compute space 
>> is still huge, it's not actually undoable, it just requires some thought 
>> into how to approach the problem, and then some time to do the computations.
>> 
>> Huge space, but still finite…
> 
> There are certain physical limits which one cannot exceed. For instance,
> you cannot store 2^256 units of 32-byte quantities in Earth. Even if you
> used proton spin (or some other quantum property) to store a bit, there
> simply aren't enough protons in the entire visible universe to do it.
> You will never ever be able to search a 256-bit memory space using a
> simple exhaustive search. The reason why our security hashes are so long
> (256-bits, 512-bits, more...) is because attackers *don't* do an
> exhaustive search.
> 
> --
> Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik

>Unfortunately, the government imagines that people are using their home com=
>puters to compute hashes and try and decrypt stuff.  Look at what is happen=
>ing with GPUs these days.  People are hooking up 4 GPUs in their computers =
>and getting huge performance gains.  5-6 char password space covered in a f=
>ew days.  12 or so chars would take one machine a couple of years if I reca=
>ll.  So, if we had 20 people with that class of machine, we'd be down to a =
>few months.   I'm just suggesting that while the compute space is still hug=
>e, it's not actually undoable, it just requires some thought into how to ap=
>proach the problem, and then some time to do the computations.
>
>Huge space, but still finite=85

Dan Brown seems to think so in "Digital Fortress" but it just means he 
has no grasp on "big numbers".

2^128 is a huge space, finite *but* beyond brute force *forever*.

Cconsidering that we have nearly 10billion people and if you give them
all of them 1 billion computers all being able to compute 1 billion checks 
per second, how many years does it take before we get the solution?

Did  you realize that that number is *twice* the number of the years 
needed for a *single* computer with the same specification  to solve this
problem for 64 bits?

There are two reasons for finding a new hash alrgorithm:
- a faster one on current hardware
- a better one with a larger output

But bruteforce is not what we are defending against: we're trying to 
defend against bugs in the hash algorithm.  In the case of md5 and the 
related hash algorithm, a new attack method was discovered and it made 
many hash algorithms obsolete/broken.

When a algorithm is broken, the "work factor" needed for a successful 
attack depends in part of the hash, e.g., you may left with 64 bits
of effective has and that would be brute forcible.


Casper

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:39 PM, Ferenc-Levente Juhos wrote:
> As I said several times before, to produce hash collisions. Or to calculate
> rainbow tables (as a previous user theorized it) you only need the
> following.
> 
> You don't need to reproduce all possible blocks.
> 1. SHA256 produces a 256 bit hash
> 2. That means it produces a value on 256 bits, in other words a value
> between 0..2^256 - 1
> 3. If you start counting from 0 to 2^256 and for each number calculate the
> SHA256 you will get at least one hash collision (if the hash algortihm is
> prefectly distributed)
> 4. Counting from 0 to 2^256, is nothing else but reproducing all possible
> bit pattern on 32 bytes
> 
> It's not about whether one computer is capable of producing the above
> hashes or not, or whether there are actually that many unique 32 byte bit
> patterns in the universe.
> A collision can happen.

It's actually not that simple, because in hash collision attacks you're
not always afforded the luxury of being able to define your input block.
More often than not, you want to modify a previously hashed block in
such a fashion that it carries your intended modifications while hashing
to the same original value. Say for instance you want to modify a
512-byte message (e.g. an SSL certificate) to point to your own CN. Here
your rainbow table, even if you could store it somewhere (you couldn't,
btw), would do you little good here.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Gregg Wonderly
> 
> But this is precisely the kind of "observation" that some people seem to
miss
> out on the importance of.  As Tomas suggested in his post, if this was
true,
> then we could have a huge compression ratio as well.  And even if there
was
> 10% of the bit patterns that created non-unique hashes, you could use the
> fact that a block hashed to a known bit pattern that didn't have
collisions, to
> compress the other 90% of your data.

In fact, if you were to assume hash value X corresponded to data block Y,
you could use the hash function as a form of compression, but for
decompression you would need a lookup table.  So if you were going to
compress Y once, you would not gain anything.  But if you had a data stream
with a bunch of duplicates in it, you might hash Y many times, discovering X
is already in your lookup table, you only need to store another copy of X.

So in fact, dedup is a form of hash-based compression.  But we know it's not
a lossless compression, so the decision to verify or not to verify is a
matter of probability of collision.  Many of us have done the math on this
probability, and many of us are comfortable with neglecting the verify,
because we know the probability of collision is so small.  But the decision
to verify or not verify is very much dependent on your intended purpose (the
type of data you're storing, and what its significance is). 

More importantly, the decision to verify or not verify depends on who you
would need to justify your decision to.  Nobody will ever get in trouble for
enabling verify.  But if you have to convince your CEO that you made the
right choice by disabling verify ... You might just choose to enable verify
for the sake of not explaining your decision.


> I'm serious about this from a number of perspectives.  We worry about the
> time it would take to reverse SHA or RSA hashes to passwords, not even
> thinking that what if someone has been quietly computing all possible
hashes
> for the past 10-20 years into a database some where, with every 5-16
> character password, and now has an instantly searchable hash-to-password
> database.

That is called a rainbow table, and it's commonly used for cracking poorly
implemented password schemes.  Even with massively powerful computers and a
whole lot of storage, you only have enough time and space to compute a
rainbow table for commonly used (easily guessable) passwords.  To prevent
attackers from successfully using rainbow tables, even when users might
choose bad passwords, a security conscious developer will use salting and
stretching.  This increases the randomness and the time to compute the
password hashes, thwarting rainbow tables.


> If no one has computed the hashes for every single 4K and 8K block, then
> fine.  But, if that was done, and we had that data, we'd know for sure
which
> algorithm was going to work the best for the number of bits we are
> considering.

Let's imagine computing a 256-bit hash (32 bytes = 2^5 bytes) for every 1K
(8Kbit) block.  Storage for this table would require 2^5 * 2^8192 bytes =
2^8197 bytes = 3.5 * 10^2467 bytes.  Recall, a petabyte is 10^15 bytes.  So
... Storing the rainbow table for merely 1K possibilities ... I don't know
if it would fit in all the storage on planet Earth.  Maybe.  But having the
rainbow table for *precisely* 1K block size isn't useful unless you happen
to have precisely a 1K block you want to lookup...  If you happen to be
looking up a 1025 byte data hash, or a 4K data hash ... You're out of luck.
This table won't help you.

It's physically impossible to do, even on the supremely simplified problem
of 1K block size and 256-bit hash.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:36 PM, Justin Stringfellow wrote:
> 
> 
>> Since there is a finite number of bit patterns per block, have you tried to 
>> just calculate the SHA-256 or SHA-512 for every possible bit pattern to see 
>> if there is ever a collision?  If you found an algorithm that produced no 
>> collisions for any possible block bit pattern, wouldn't that be the win?
>  
> Perhaps I've missed something, but if there was *never* a collision, you'd 
> have stumbled across a rather impressive lossless compression algorithm. I'm 
> pretty sure there's some Big Mathematical Rules (Shannon?) that mean this 
> cannot be.

Do you realize how big your lookup dictionary would have to be?

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:30 PM, Gregg Wonderly wrote:
> This is exactly the issue for me.  It's vital to always have verify on.  If 
> you don't have the data to prove that every possible block combination 
> possible, hashes uniquely for the "small" bit space we are talking about, 
> then how in the world can you say that "verify" is not necessary?  That just 
> seems ridiculous to propose.

Do you need assurances that in the next 5 seconds a meteorite won't fall
to Earth and crush you? No. And yet, the Earth puts on thousands of tons
of weight each year from meteoric bombardment and people have been hit
and killed by them (not to speak of mass extinction events). Nobody has
ever demonstrated of being able to produce a hash collision in any
suitably long hash (128-bits plus) using a random search. All hash
collisions have been found by attacking the weaknesses in the
mathematical definition of these functions (i.e. some part of the input
didn't get obfuscated well in the hash function machinery and spilled
over into the result, resulting in a slight, but usable non-randomness).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:27 PM, Gregg Wonderly wrote:
> Unfortunately, the government imagines that people are using their home 
> computers to compute hashes and try and decrypt stuff.  Look at what is 
> happening with GPUs these days.  People are hooking up 4 GPUs in their 
> computers and getting huge performance gains.  5-6 char password space 
> covered in a few days.  12 or so chars would take one machine a couple of 
> years if I recall.  So, if we had 20 people with that class of machine, we'd 
> be down to a few months.   I'm just suggesting that while the compute space 
> is still huge, it's not actually undoable, it just requires some thought into 
> how to approach the problem, and then some time to do the computations.
> 
> Huge space, but still finite…

There are certain physical limits which one cannot exceed. For instance,
you cannot store 2^256 units of 32-byte quantities in Earth. Even if you
used proton spin (or some other quantum property) to store a bit, there
simply aren't enough protons in the entire visible universe to do it.
You will never ever be able to search a 256-bit memory space using a
simple exhaustive search. The reason why our security hashes are so long
(256-bits, 512-bits, more...) is because attackers *don't* do an
exhaustive search.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Ferenc-Levente Juhos
As I said several times before, to produce hash collisions. Or to calculate
rainbow tables (as a previous user theorized it) you only need the
following.

You don't need to reproduce all possible blocks.
1. SHA256 produces a 256 bit hash
2. That means it produces a value on 256 bits, in other words a value
between 0..2^256 - 1
3. If you start counting from 0 to 2^256 and for each number calculate the
SHA256 you will get at least one hash collision (if the hash algortihm is
prefectly distributed)
4. Counting from 0 to 2^256, is nothing else but reproducing all possible
bit pattern on 32 bytes

It's not about whether one computer is capable of producing the above
hashes or not, or whether there are actually that many unique 32 byte bit
patterns in the universe.
A collision can happen.

On Wed, Jul 11, 2012 at 4:34 PM, Sašo Kiselkov wrote:

> On 07/11/2012 04:23 PM, casper@oracle.com wrote:
> >
> >> On Tue, 10 Jul 2012, Edward Ned Harvey wrote:
> >>>
> >>> CPU's are not getting much faster.  But IO is definitely getting
> faster.  It's best to keep ahea
> > d of that curve.
> >>
> >> It seems that per-socket CPU performance is doubling every year.
> >> That seems like faster to me.
> >
> > I think that I/O isn't getting as fast as CPU is; memory capacity and
> > bandwith and CPUs are getting faster.  I/O, not so much.
> > (Apart from the one single step from harddisk to SSD; but note that
> > I/O is limited to standard interfaces and as such it is likely be
> > helddown by requiring a new standard.
>
> Have you seen one of those SSDs made by FusionIO? Those things fit in a
> single PCI-e x8 slot and can easily push a sustained rate upward of
> several GB/s. Do not expect that drives are the be-all end-all to
> storage. Hybrid storage invalidated the traditional "CPU & memory fast,
> disks slow" wisdom years ago.
>
> Cheers,
> --
> Saso
>  ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Justin Stringfellow


> Since there is a finite number of bit patterns per block, have you tried to 
> just calculate the SHA-256 or SHA-512 for every possible bit pattern to see 
> if there is ever a collision?  If you found an algorithm that produced no 
> collisions for any possible block bit pattern, wouldn't that be the win?
 
Perhaps I've missed something, but if there was *never* a collision, you'd have 
stumbled across a rather impressive lossless compression algorithm. I'm pretty 
sure there's some Big Mathematical Rules (Shannon?) that mean this cannot be.
 
cheers,
--justin
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:23 PM, casper@oracle.com wrote:
> 
>> On Tue, 10 Jul 2012, Edward Ned Harvey wrote:
>>>
>>> CPU's are not getting much faster.  But IO is definitely getting faster.  
>>> It's best to keep ahea
> d of that curve.
>>
>> It seems that per-socket CPU performance is doubling every year. 
>> That seems like faster to me.
> 
> I think that I/O isn't getting as fast as CPU is; memory capacity and
> bandwith and CPUs are getting faster.  I/O, not so much.
> (Apart from the one single step from harddisk to SSD; but note that
> I/O is limited to standard interfaces and as such it is likely be
> helddown by requiring a new standard.

Have you seen one of those SSDs made by FusionIO? Those things fit in a
single PCI-e x8 slot and can easily push a sustained rate upward of
several GB/s. Do not expect that drives are the be-all end-all to
storage. Hybrid storage invalidated the traditional "CPU & memory fast,
disks slow" wisdom years ago.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Wed, 11 Jul 2012, Joerg Schilling wrote:


Bob Friesenhahn  wrote:


On Tue, 10 Jul 2012, Edward Ned Harvey wrote:


CPU's are not getting much faster.  But IO is definitely getting faster.  It's 
best to keep ahead of that curve.


It seems that per-socket CPU performance is doubling every year.
That seems like faster to me.


This would only apply, if you implement a multi threaded hash.


While it is true that the per-block hash latency does not improve much 
with new CPUs (and may even regress), given multiple I/Os at once, 
hashes may be be computed by different cores and so it seems that 
total system performance will scale with per-socket CPU performance. 
Even with a single stream of I/O, multiple zfs blocks will be read or 
written so mutiple block hashes may be computed at once on different 
cores.


Server OSs like Solaris have been focusing on improving total system 
throughput rather than single-threaded bandwidth.


I don't mean to discount the importance of this effort though.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Gregg Wonderly
This is exactly the issue for me.  It's vital to always have verify on.  If you 
don't have the data to prove that every possible block combination possible, 
hashes uniquely for the "small" bit space we are talking about, then how in the 
world can you say that "verify" is not necessary?  That just seems ridiculous 
to propose.

Gregg Wonderly

On Jul 11, 2012, at 9:22 AM, Bob Friesenhahn wrote:

> On Wed, 11 Jul 2012, Sašo Kiselkov wrote:
>> the hash isn't used for security purposes. We only need something that's
>> fast and has a good pseudo-random output distribution. That's why I
>> looked toward Edon-R. Even though it might have security problems in
>> itself, it's by far the fastest algorithm in the entire competition.
> 
> If an algorithm is not 'secure' and zfs is not set to verify, doesn't that 
> mean that a knowledgeable user will be able to cause intentional data 
> corruption if deduplication is enabled?  A user with very little privilege 
> might be able to cause intentional harm by writing the magic data block 
> before some other known block (which produces the same hash) is written.  
> This allows one block to substitute for another.
> 
> It does seem that security is important because with a human element, data is 
> not necessarily random.
> 
> Bob
> -- 
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,
> http://www.GraphicsMagick.org/___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:22 PM, Bob Friesenhahn wrote:
> On Wed, 11 Jul 2012, Sašo Kiselkov wrote:
>> the hash isn't used for security purposes. We only need something that's
>> fast and has a good pseudo-random output distribution. That's why I
>> looked toward Edon-R. Even though it might have security problems in
>> itself, it's by far the fastest algorithm in the entire competition.
> 
> If an algorithm is not 'secure' and zfs is not set to verify, doesn't
> that mean that a knowledgeable user will be able to cause intentional
> data corruption if deduplication is enabled?  A user with very little
> privilege might be able to cause intentional harm by writing the magic
> data block before some other known block (which produces the same hash)
> is written.  This allows one block to substitute for another.
> 
> It does seem that security is important because with a human element,
> data is not necessarily random.

Theoretically yes, it is possible, but the practicality of such an
attack is very much in doubt. In case this is a concern, however, one
can always switch to a more secure hash function (e.g. Skein-512).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:19 PM, Gregg Wonderly wrote:
> But this is precisely the kind of "observation" that some people seem to miss 
> out on the importance of.  As Tomas suggested in his post, if this was true, 
> then we could have a huge compression ratio as well.  And even if there was 
> 10% of the bit patterns that created non-unique hashes, you could use the 
> fact that a block hashed to a known bit pattern that didn't have collisions, 
> to compress the other 90% of your data.
> 
> I'm serious about this from a number of perspectives.  We worry about the 
> time it would take to reverse SHA or RSA hashes to passwords, not even 
> thinking that what if someone has been quietly computing all possible hashes 
> for the past 10-20 years into a database some where, with every 5-16 
> character password, and now has an instantly searchable hash-to-password 
> database.

This is something very well known in the security community as "rainbow
tables" and a common method to protect against it is via salting. Never
use a password hashing scheme which doesn't use salts for exactly the
reason you outlined above.

> Sometimes we ignore the scale of time, thinking that only the immediately 
> visible details are what we have to work with.
> 
> If no one has computed the hashes for every single 4K and 8K block, then 
> fine.  But, if that was done, and we had that data, we'd know for sure which 
> algorithm was going to work the best for the number of bits we are 
> considering.

Do you even realize how many 4K or 8K blocks there are?!?! Exactly
2^32768 or 2^65536 respectively. I wouldn't worry about somebody having
those pre-hashed ;-) Rainbow tables only work for a very limited subset
of data.

> Speculating based on the theory of the algorithms for "random" number of bits 
> is just silly.  Where's the real data that tells us what we need to know?

If you don't trust math, then I there's little I can do to convince you.
But remember our conversation the next time you step into a car or get
on an airplane. The odds that you'll die on that ride are far higher
than that you'll find a random hash collision in a 256-bit hash algorithm...

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Gregg Wonderly
Unfortunately, the government imagines that people are using their home 
computers to compute hashes and try and decrypt stuff.  Look at what is 
happening with GPUs these days.  People are hooking up 4 GPUs in their 
computers and getting huge performance gains.  5-6 char password space covered 
in a few days.  12 or so chars would take one machine a couple of years if I 
recall.  So, if we had 20 people with that class of machine, we'd be down to a 
few months.   I'm just suggesting that while the compute space is still huge, 
it's not actually undoable, it just requires some thought into how to approach 
the problem, and then some time to do the computations.

Huge space, but still finite…

Gregg Wonderly

On Jul 11, 2012, at 9:13 AM, Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Gregg Wonderly
>> 
>> Since there is a finite number of bit patterns per block, have you tried
> to just
>> calculate the SHA-256 or SHA-512 for every possible bit pattern to see if
> there
>> is ever a collision?  If you found an algorithm that produced no
> collisions for
>> any possible block bit pattern, wouldn't that be the win?
> 
> Maybe I misunderstand what you're saying, but if I got it right, what you're
> saying is physically impossible to do in the time of the universe...  And
> guaranteed to fail even if you had all the computational power of God.
> 
> I think you're saying ... In a block of 128k, sequentially step through all
> the possible values ... starting with 0, 1, 2, ... 2^128k ... and compute
> the hashes of each value, and see if you ever find a hash collision.  If
> this is indeed what you're saying, recall, the above operation will require
> on order 2^128k operations to complete.  But present national security
> standards accept 2^256 operations as satisfactory to protect data from brute
> force attacks over the next 30 years.  Furthermore, in a 128k block, there
> exist 2^128k possible values, while in a 512bit hash, there exist only 2^512
> possible values (which is still a really huge number.)  This means there
> will exist at least 2^127.5k collisions.  However, these numbers are so
> astronomically universally magnanimously huge, it will still take more than
> a lifetime to find any one of those collisions.  So it's impossible to
> perform such a computation, and if you could, you would be guaranteed to
> find a LOT of collisions.
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Edward Ned Harvey
> From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
> Sent: Wednesday, July 11, 2012 10:06 AM
> 
> On Tue, 10 Jul 2012, Edward Ned Harvey wrote:
> >
> > CPU's are not getting much faster.  But IO is definitely getting faster.
It's
> best to keep ahead of that curve.
> 
> It seems that per-socket CPU performance is doubling every year.
> That seems like faster to me.

It's a good point.  It's only *serial* computation that isn't increasing
much anymore.  But computing hashes for a bunch of independent blocks is a
parallel computation operation.  For now, they're able to keep parallelizing
via more cores.

Still, I wonder how many more years this can continue?  The reason the
serial computation isn't increasing much anymore is simply because the
transistors etc have gotten small enough that they can't really increase
clock speed much more.  If you have a 3.0 Ghz clock, how far does light
travel in one hertz?  About 10cm.  All the transistors, capacitors, wires
etc need time to react and stabilize before the next clock cycle...

For now, they're still able to keep making larger dies, with more cores on
them, and they're certainly developing layering and 3-D technologies, which
will certainly allow the number of cores to keep growing for some time to
come.  But there is a limit somewhere.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik

>On Tue, 10 Jul 2012, Edward Ned Harvey wrote:
>>
>> CPU's are not getting much faster.  But IO is definitely getting faster.  
>> It's best to keep ahea
d of that curve.
>
>It seems that per-socket CPU performance is doubling every year. 
>That seems like faster to me.

I think that I/O isn't getting as fast as CPU is; memory capacity and
bandwith and CPUs are getting faster.  I/O, not so much.
(Apart from the one single step from harddisk to SSD; but note that
I/O is limited to standard interfaces and as such it is likely be
helddown by requiring a new standard.

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Ferenc-Levente Juhos
You don't need to reproduce all possible blocks.
1. SHA256 produces a 256 bit hash
2. That means it produces a value on 256 bits, in other words a value
between 0..2^256 - 1
3. If you start counting from 0 to 2^256 and for each number calculate the
SHA256 you will get at least one hash collision (if the hash algortihm is
prefectly distributed)
4. Counting from 0 to 2^256, is nothing else but reproducing all possible
bit pattern on 32 bytes

It's not about whether one computer is capable of producing the above
hashes or not, or whether there are actually that many unique 32 byte bit
patterns in the universe.
A collision can happen.

On Wed, Jul 11, 2012 at 3:57 PM, Gregg Wonderly  wrote:

> Since there is a finite number of bit patterns per block, have you tried
> to just calculate the SHA-256 or SHA-512 for every possible bit pattern to
> see if there is ever a collision?  If you found an algorithm that produced
> no collisions for any possible block bit pattern, wouldn't that be the win?
>
> Gregg Wonderly
>
> On Jul 11, 2012, at 5:56 AM, Sašo Kiselkov wrote:
>
> > On 07/11/2012 12:24 PM, Justin Stringfellow wrote:
> >>> Suppose you find a weakness in a specific hash algorithm; you use this
> >>> to create hash collisions and now imagined you store the hash
> collisions
> >>> in a zfs dataset with dedup enabled using the same hash algorithm.
> >>
> >> Sorry, but isn't this what dedup=verify solves? I don't see the problem
> here. Maybe all that's needed is a comment in the manpage saying hash
> algorithms aren't perfect.
> >
> > It does solve it, but at a cost to normal operation. Every write gets
> > turned into a read. Assuming a big enough and reasonably busy dataset,
> > this leads to tremendous write amplification.
> >
> > Cheers,
> > --
> > Saso
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Wed, 11 Jul 2012, Sašo Kiselkov wrote:

the hash isn't used for security purposes. We only need something that's
fast and has a good pseudo-random output distribution. That's why I
looked toward Edon-R. Even though it might have security problems in
itself, it's by far the fastest algorithm in the entire competition.


If an algorithm is not 'secure' and zfs is not set to verify, doesn't 
that mean that a knowledgeable user will be able to cause intentional 
data corruption if deduplication is enabled?  A user with very little 
privilege might be able to cause intentional harm by writing the magic 
data block before some other known block (which produces the same 
hash) is written.  This allows one block to substitute for another.


It does seem that security is important because with a human element, 
data is not necessarily random.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Gregg Wonderly
But this is precisely the kind of "observation" that some people seem to miss 
out on the importance of.  As Tomas suggested in his post, if this was true, 
then we could have a huge compression ratio as well.  And even if there was 10% 
of the bit patterns that created non-unique hashes, you could use the fact that 
a block hashed to a known bit pattern that didn't have collisions, to compress 
the other 90% of your data.

I'm serious about this from a number of perspectives.  We worry about the time 
it would take to reverse SHA or RSA hashes to passwords, not even thinking that 
what if someone has been quietly computing all possible hashes for the past 
10-20 years into a database some where, with every 5-16 character password, and 
now has an instantly searchable hash-to-password database.

Sometimes we ignore the scale of time, thinking that only the immediately 
visible details are what we have to work with.

If no one has computed the hashes for every single 4K and 8K block, then fine.  
But, if that was done, and we had that data, we'd know for sure which algorithm 
was going to work the best for the number of bits we are considering.

Speculating based on the theory of the algorithms for "random" number of bits 
is just silly.  Where's the real data that tells us what we need to know?

Gregg Wonderly

On Jul 11, 2012, at 9:02 AM, Sašo Kiselkov wrote:

> On 07/11/2012 03:57 PM, Gregg Wonderly wrote:
>> Since there is a finite number of bit patterns per block, have you tried to 
>> just calculate the SHA-256 or SHA-512 for every possible bit pattern to see 
>> if there is ever a collision?  If you found an algorithm that produced no 
>> collisions for any possible block bit pattern, wouldn't that be the win?
> 
> Don't think that, if you can think of this procedure, that the crypto
> security guys at universities haven't though about it as well? Of course
> they have. No, simply generating a sequence of random patterns and
> hoping to hit a match won't do the trick.
> 
> P.S. I really don't mean to sound smug or anything, but I know one thing
> for sure: the crypto researchers who propose these algorithms are some
> of the brightest minds on this topic on the planet, so I would hardly
> think they didn't consider trivial problems.
> 
> Cheers,
> --
> Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Joerg Schilling
Bob Friesenhahn  wrote:

> On Tue, 10 Jul 2012, Edward Ned Harvey wrote:
> >
> > CPU's are not getting much faster.  But IO is definitely getting faster.  
> > It's best to keep ahead of that curve.
>
> It seems that per-socket CPU performance is doubling every year. 
> That seems like faster to me.

This would only apply, if you implement a multi threaded hash.

The single threaded performance win is less that doubling each year.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 03:58 PM, Edward Ned Harvey wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Sašo Kiselkov
>>
>> I really mean no disrespect, but this comment is so dumb I could swear
>> my IQ dropped by a few tenths of a point just by reading.
> 
> Cool it please.  You say "I mean no disrespect" and then say something which
> is clearly disrespectful.

I sort of flew off the handle there, and I shouldn't have. It felt like
Tomas was misrepresenting my position and putting words in my mouth I
didn't say. I certainly didn't mean to diminish the validity of an
honest question.

> Tomas's point is to illustrate that hashing is a many-to-one function.  If
> it were possible to rely on the hash to always be unique, then you could use
> it as a compression algorithm.  He's pointing out that's insane.  His
> comment was not in the slightest bit dumb; if anything, it seems like maybe
> somebody (or some people) didn't get his point.

I understood his point very well and I never argued that hashing always
results in unique hash values, which is why I thought he was
misrepresenting what I said.

So for a full explanation of why hashes aren't usable for compression:

 1) they are one-way (kind of bummer for decompression)
 2) they operate far below the Shannon limit (i.e. unusable for
lossless compression)
 3) their output is pseudo-random, so even if we find collisions, we
have no way to distinguish which input was the most likely one meant
for a given hash value (all are equally probable)

A formal proof would of course take longer to construct and would take
time that I feel is best spent writing code.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Gregg Wonderly
> 
> Since there is a finite number of bit patterns per block, have you tried
to just
> calculate the SHA-256 or SHA-512 for every possible bit pattern to see if
there
> is ever a collision?  If you found an algorithm that produced no
collisions for
> any possible block bit pattern, wouldn't that be the win?

Maybe I misunderstand what you're saying, but if I got it right, what you're
saying is physically impossible to do in the time of the universe...  And
guaranteed to fail even if you had all the computational power of God.

I think you're saying ... In a block of 128k, sequentially step through all
the possible values ... starting with 0, 1, 2, ... 2^128k ... and compute
the hashes of each value, and see if you ever find a hash collision.  If
this is indeed what you're saying, recall, the above operation will require
on order 2^128k operations to complete.  But present national security
standards accept 2^256 operations as satisfactory to protect data from brute
force attacks over the next 30 years.  Furthermore, in a 128k block, there
exist 2^128k possible values, while in a 512bit hash, there exist only 2^512
possible values (which is still a really huge number.)  This means there
will exist at least 2^127.5k collisions.  However, these numbers are so
astronomically universally magnanimously huge, it will still take more than
a lifetime to find any one of those collisions.  So it's impossible to
perform such a computation, and if you could, you would be guaranteed to
find a LOT of collisions.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bob Friesenhahn

On Tue, 10 Jul 2012, Edward Ned Harvey wrote:


CPU's are not getting much faster.  But IO is definitely getting faster.  It's 
best to keep ahead of that curve.


It seems that per-socket CPU performance is doubling every year. 
That seems like faster to me.


If server CPU chipsets offer accelleration for some type of standard 
encryption, then that needs to be considered.  The CPU might not need 
to do the encryption the hard way.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Sašo Kiselkov
> 
> As your dedup
> ratio grows, so does the performance hit from dedup=verify. At, say,
> dedupratio=10.0x, on average, every write results in 10 reads.

Why?  
If you intend to write a block, and you discover it has a matching checksum,
you only need to verify it matches one block.  You don't need to verify it
matches all the other blocks that have previously been verified to match
each other.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 03:57 PM, Gregg Wonderly wrote:
> Since there is a finite number of bit patterns per block, have you tried to 
> just calculate the SHA-256 or SHA-512 for every possible bit pattern to see 
> if there is ever a collision?  If you found an algorithm that produced no 
> collisions for any possible block bit pattern, wouldn't that be the win?

Don't think that, if you can think of this procedure, that the crypto
security guys at universities haven't though about it as well? Of course
they have. No, simply generating a sequence of random patterns and
hoping to hit a match won't do the trick.

P.S. I really don't mean to sound smug or anything, but I know one thing
for sure: the crypto researchers who propose these algorithms are some
of the brightest minds on this topic on the planet, so I would hardly
think they didn't consider trivial problems.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Sašo Kiselkov
> 
> On 07/11/2012 11:53 AM, Tomas Forsman wrote:
> > On 11 July, 2012 - Sa??o Kiselkov sent me these 1,4K bytes:
> >> Oh jeez, I can't remember how many times this flame war has been going
> >> on on this list. Here's the gist: SHA-256 (or any good hash) produces a
> >> near uniform random distribution of output. Thus, the chances of
getting
> >> a random hash collision are around 2^-256 or around 10^-77. If I asked
> >> you to pick two atoms at random *from the entire observable universe*,
> >> your chances of hitting on the same atom are higher than the chances of
> >> that hash collision. So leave dedup=on with sha256 and move on.
> >
> > So in ZFS, which normally uses 128kB blocks, you can instead store them
> > 100% uniquely into 32 bytes.. A nice 4096x compression rate..
> > decompression is a bit slower though..
> 
> I really mean no disrespect, but this comment is so dumb I could swear
> my IQ dropped by a few tenths of a point just by reading.

Cool it please.  You say "I mean no disrespect" and then say something which
is clearly disrespectful.

Tomas's point is to illustrate that hashing is a many-to-one function.  If
it were possible to rely on the hash to always be unique, then you could use
it as a compression algorithm.  He's pointing out that's insane.  His
comment was not in the slightest bit dumb; if anything, it seems like maybe
somebody (or some people) didn't get his point.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Gregg Wonderly
Since there is a finite number of bit patterns per block, have you tried to 
just calculate the SHA-256 or SHA-512 for every possible bit pattern to see if 
there is ever a collision?  If you found an algorithm that produced no 
collisions for any possible block bit pattern, wouldn't that be the win?

Gregg Wonderly

On Jul 11, 2012, at 5:56 AM, Sašo Kiselkov wrote:

> On 07/11/2012 12:24 PM, Justin Stringfellow wrote:
>>> Suppose you find a weakness in a specific hash algorithm; you use this
>>> to create hash collisions and now imagined you store the hash collisions 
>>> in a zfs dataset with dedup enabled using the same hash algorithm.
>> 
>> Sorry, but isn't this what dedup=verify solves? I don't see the problem 
>> here. Maybe all that's needed is a comment in the manpage saying hash 
>> algorithms aren't perfect.
> 
> It does solve it, but at a cost to normal operation. Every write gets
> turned into a read. Assuming a big enough and reasonably busy dataset,
> this leads to tremendous write amplification.
> 
> Cheers,
> --
> Saso
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 03:39 PM, David Magda wrote:
> On Tue, July 10, 2012 19:56, Sašo Kiselkov wrote:
>> However, before I start out on a pointless endeavor, I wanted to probe
>> the field of ZFS users, especially those using dedup, on whether their
>> workloads would benefit from a faster hash algorithm (and hence, lower
>> CPU utilization). Developments of late have suggested to me three
>> possible candidates:
> [...]
> 
> I'd wait until SHA-3 is announced. It's supposed to happen this year, of
> which only six months are left:
> 
> http://csrc.nist.gov/groups/ST/hash/timeline.html
> http://en.wikipedia.org/wiki/NIST_hash_function_competition
> 
> It was actually supposed to happen to "2Q", so they're running a little
> late it seems.

I'm not convinced waiting makes much sense. The SHA-3 standardization
process' goals are different from "ours". SHA-3 can choose to go with
something that's slower, but has a higher security margin. I think that
absolute super-tight security isn't all that necessary for ZFS, since
the hash isn't used for security purposes. We only need something that's
fast and has a good pseudo-random output distribution. That's why I
looked toward Edon-R. Even though it might have security problems in
itself, it's by far the fastest algorithm in the entire competition.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread David Magda
On Tue, July 10, 2012 19:56, Sašo Kiselkov wrote:
> However, before I start out on a pointless endeavor, I wanted to probe
> the field of ZFS users, especially those using dedup, on whether their
> workloads would benefit from a faster hash algorithm (and hence, lower
> CPU utilization). Developments of late have suggested to me three
> possible candidates:
[...]

I'd wait until SHA-3 is announced. It's supposed to happen this year, of
which only six months are left:

http://csrc.nist.gov/groups/ST/hash/timeline.html
http://en.wikipedia.org/wiki/NIST_hash_function_competition

It was actually supposed to happen to "2Q", so they're running a little
late it seems.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread David Magda
On Wed, July 11, 2012 04:50, Ferenc-Levente Juhos wrote:
> Actually although as you pointed out that the chances to have an sha256
> collision is minimal, but still it can happen, that would mean
> that the dedup algorithm discards a block that he thinks is a duplicate.
> Probably it's anyway better to do a byte to byte comparison
> if the hashes match to be sure that the blocks are really identical.
>
> The funny thing here is that ZFS tries to solve all sorts of data
> integrity
> issues with checksumming and healing, etc.,
> and on the other hand a hash collision in the dedup algorithm can cause
> loss of data if wrongly configured.
>
> Anyway thanks that you have brought up the subject, now I know if I will
> enable the dedup feature I must set it to sha256,verify.

There was a long-ish thread on the topic last year
("(Fletcher+Verification) versus (Sha256+No Verification) "):

http://mail.opensolaris.org/pipermail/zfs-discuss/2011-January/thread.html#46864


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 01:42 PM, Justin Stringfellow wrote:
>> This assumes you have low volumes of deduplicated data. As your dedup
>> ratio grows, so does the performance hit from dedup=verify. At, say,
>> dedupratio=10.0x, on average, every write results in 10 reads.
> 
> Well you can't make an omelette without breaking eggs! Not a very nice one, 
> anyway.
>  
> Yes dedup is expensive but much like using O_SYNC, it's a conscious decision 
> here to take a performance hit in order to be sure about our data. Moving the 
> actual reads to a async thread as I suggested should improve things.

And my point here is that the expense is unnecessary and can be omitted
if we choose our algorithms and settings carefully.

"Async" here won't help, you'll still get equal write amplification,
only it's going to be spread in between txg's.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 01:36 PM, casper@oracle.com wrote:
> 
> 
>> This assumes you have low volumes of deduplicated data. As your dedup
>> ratio grows, so does the performance hit from dedup=verify. At, say,
>> dedupratio=10.0x, on average, every write results in 10 reads.
> 
> I don't follow.
> 
> If dedupratio == 10, it means that each item is *referenced* 10 times
> but it is only stored *once*.  Only when you have hash collisions then 
> multiple reads would be needed.
> 
> Only one read is needed except in the case of hash collisions.

No, *every* dedup write will result in a block read. This is how:

 1) ZIO gets block X and computes HASH(X)
 2) ZIO looks up HASH(X) in DDT
  2a) HASH(X) not in DDT -> unique write; exit
  2b) HASH(X) in DDT; continue
 3) Read original disk block Y with HASH(Y) = HASH(X) <--here's the read
 4) Verify X == Y
  4a) X == Y; increment refcount
  4b) X != Y; hash collision; write new block   <-- here's the collision

So in other words, by the time you figure out you've got a hash
collision, you already did the read, ergo, every dedup write creates a read!

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Justin Stringfellow
> This assumes you have low volumes of deduplicated data. As your dedup
> ratio grows, so does the performance hit from dedup=verify. At, say,
> dedupratio=10.0x, on average, every write results in 10 reads.

Well you can't make an omelette without breaking eggs! Not a very nice one, 
anyway.
 
Yes dedup is expensive but much like using O_SYNC, it's a conscious decision 
here to take a performance hit in order to be sure about our data. Moving the 
actual reads to a async thread as I suggested should improve things.
 
cheers,
--justin
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik


>This assumes you have low volumes of deduplicated data. As your dedup
>ratio grows, so does the performance hit from dedup=verify. At, say,
>dedupratio=10.0x, on average, every write results in 10 reads.

I don't follow.

If dedupratio == 10, it means that each item is *referenced* 10 times
but it is only stored *once*.  Only when you have hash collisions then 
multiple reads would be needed.

Only one read is needed except in the case of hash collisions.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 01:09 PM, Justin Stringfellow wrote:
>> The point is that hash functions are many to one and I think the point
>> was about that verify wasn't really needed if the hash function is good
>> enough.
> 
> This is a circular argument really, isn't it? Hash algorithms are never 
> perfect, but we're trying to build a perfect one?
>  
> It seems to me the obvious fix is to use hash to identify candidates for 
> dedup, and then do the actual verify and dedup asynchronously. Perhaps a 
> worker thread doing this at low priority?
> Did anyone consider this?

This assumes you have low volumes of deduplicated data. As your dedup
ratio grows, so does the performance hit from dedup=verify. At, say,
dedupratio=10.0x, on average, every write results in 10 reads.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 12:37 PM, Ferenc-Levente Juhos wrote:
> Precisely, I said the same thing a few posts before:
> dedup=verify solves that. And as I said, one could use dedup= algorithm>,verify with
> an inferior hash algorithm (that is much faster) with the purpose of
> reducing the number of dedup candidates.
> For that matter one could use a trivial CRC32, if the two blocks have the
> same CRC you do anyway a
> byte-to-byte comparison, but if they have differenct CRC's you don't need
> to do the actual comparison.

Yes, and that's exactly what the Fletcher4+verify combo does (Fletcher
is faster than CRC32). That's why I started this thread - to assess
whether a better hash algorithm is needed and/or desired.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 12:32 PM, Ferenc-Levente Juhos wrote:
> Saso, I'm not flaming at all, I happen to disagree, but still I understand
> that
> chances are very very very slim, but as one poster already said, this is
> how
> the lottery works. I'm not saying one should make an exhaustive search with
> trillions of computers just to produce a sha256 collision.
> If I wanted an exhaustive search I would generate all the numbers from
> 0 to 2**256 and I would definitely get at least 1 collision.
> If you formulate it in another way, by generating all the possible 256 bit
> (32 byte)
> blocks + 1 you will definitely get a collision. This is much more credible
> than the analogy with the age of the universe and atoms picked at random,
> etc.

First of all, I never said that the chance is zero. It's definitely
non-zero, but claiming that is analogous to the lottery is just not
appreciating the scale of the difference.

Next, your proposed method of finding hash collisions is naive in that
you assume that you merely need generate 256-bit numbers. First of all,
the smallest blocks in ZFS are 1k (IIRC), i.e. 8192 bits. Next, you fail
to appreciate the difficulty of even generating 2^256 256-bit numbers.
Here's how much memory you'd need:

2^256 * 32 ~= 2^261 bytes

Memory may be cheap, but not that cheap...

> The fact is it can happen, it's entirely possible that there are two jpg's
> in the universe with different content and they have the same hash.
> I can't prove the existence of those, but you can't deny it.

I'm not denying that. Read what I said.

> The fact is, that zfs and everyone using it trys to correct data
> degradation e.g. cause by cosmic rays, and on the other hand their
> using probability calculation (no matter how slim the chances are) to
> potentially discard valid data.
> You can come with other universe and atom theories and with the
> age of the universe, etc. The fact remains the same.

This is just a profoundly naive statement. You always need to make
trade-offs between practicality and performance. How "slim" the chances
are has a very real impact on engineering decisions.

> And each generation was convinced that their current best checksum or
> hash algorithm is the best and will be the best forever. MD5 has
> demonstrated that it's not the case. Time will tell what becomes of
> SHA256, but why take any chances.

You really don't understand much about hash algorithms, do you? MD5 has
*very* good safety against random collisions - that's why it's still
used in archive integrity checking (ever wonder what algorithm Debian
and RPM packages use?). The reason why it's been replaced by newer
algorithms has nothing to do with its chance of random hash collisions,
but with deliberate collision attacks on security algorithms built on
top of it. Also, the reason why we don't use it in ZFS is because
SHA-256 is faster and it has a larger pattern space (thus further
lowering the odds of random collisions).

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Justin Stringfellow


> The point is that hash functions are many to one and I think the point
> was about that verify wasn't really needed if the hash function is good
> enough.

This is a circular argument really, isn't it? Hash algorithms are never 
perfect, but we're trying to build a perfect one?
 
It seems to me the obvious fix is to use hash to identify candidates for dedup, 
and then do the actual verify and dedup asynchronously. Perhaps a worker thread 
doing this at low priority?
Did anyone consider this?
 
cheers,
--justin___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik

>On 07/11/2012 12:24 PM, Justin Stringfellow wrote:
>>> Suppose you find a weakness in a specific hash algorithm; you use this
>>> to create hash collisions and now imagined you store the hash collisions 
>>> in a zfs dataset with dedup enabled using the same hash algorithm.
>> 
>> Sorry, but isn't this what dedup=verify solves? I don't see the problem 
>> here. Maybe all that's n
eeded is a comment in the manpage saying hash algorithms aren't perfect.
>
>It does solve it, but at a cost to normal operation. Every write gets
>turned into a read. Assuming a big enough and reasonably busy dataset,
>this leads to tremendous write amplification.


If and only if the block is being dedup'ed.  (In that case, you're just
changing the write of a whole block into one read (of the block) and an 
update in the dedup date (the whole block isn't written)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 12:24 PM, Justin Stringfellow wrote:
>> Suppose you find a weakness in a specific hash algorithm; you use this
>> to create hash collisions and now imagined you store the hash collisions 
>> in a zfs dataset with dedup enabled using the same hash algorithm.
> 
> Sorry, but isn't this what dedup=verify solves? I don't see the problem here. 
> Maybe all that's needed is a comment in the manpage saying hash algorithms 
> aren't perfect.

It does solve it, but at a cost to normal operation. Every write gets
turned into a read. Assuming a big enough and reasonably busy dataset,
this leads to tremendous write amplification.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 12:00 PM, casper@oracle.com wrote:
> 
> 
>> You do realize that the age of the universe is only on the order of
>> around 10^18 seconds, do you? Even if you had a trillion CPUs each
>> chugging along at 3.0 GHz for all this time, the number of processor
>> cycles you will have executed cumulatively is only on the order 10^40,
>> still 37 orders of magnitude lower than the chance for a random hash
>> collision.
>>
> 
> Suppose you find a weakness in a specific hash algorithm; you use this
> to create hash collisions and now imagined you store the hash collisions 
> in a zfs dataset with dedup enabled using the same hash algorithm.

That is one possibility I considered when evaluating whether to
implement the Edon-R hash algorithm. It's security margin is somewhat
questionable, so then the next step is to determine whether this can be
used in any way to implement a practical data-corruption attack. I guess
is that it can't, but I'm open to debate on this issue, since I'm not a
security expert myself.

The reason why I don't think this can be used to implement a practical
attack is that in order to generate a collision, you first have to know
the disk block that you want to create a collision on (or at least the
checksum), i.e. the original block is already in the pool. At that
point, you could write a colliding block which would get de-dup'd, but
that doesn't mean you've corrupted the original data, only that you
referenced it. So, in a sense, you haven't corrupted the original block,
only your own collision block (since that's the copy doesn't get written).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 11:53 AM, Tomas Forsman wrote:
> On 11 July, 2012 - Sa??o Kiselkov sent me these 1,4K bytes:
>> Oh jeez, I can't remember how many times this flame war has been going
>> on on this list. Here's the gist: SHA-256 (or any good hash) produces a
>> near uniform random distribution of output. Thus, the chances of getting
>> a random hash collision are around 2^-256 or around 10^-77. If I asked
>> you to pick two atoms at random *from the entire observable universe*,
>> your chances of hitting on the same atom are higher than the chances of
>> that hash collision. So leave dedup=on with sha256 and move on.
> 
> So in ZFS, which normally uses 128kB blocks, you can instead store them
> 100% uniquely into 32 bytes.. A nice 4096x compression rate..
> decompression is a bit slower though..

I really mean no disrespect, but this comment is so dumb I could swear
my IQ dropped by a few tenths of a point just by reading.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Ferenc-Levente Juhos
Precisely, I said the same thing a few posts before:
dedup=verify solves that. And as I said, one could use dedup=,verify with
an inferior hash algorithm (that is much faster) with the purpose of
reducing the number of dedup candidates.
For that matter one could use a trivial CRC32, if the two blocks have the
same CRC you do anyway a
byte-to-byte comparison, but if they have differenct CRC's you don't need
to do the actual comparison.

On Wed, Jul 11, 2012 at 12:24 PM, Justin Stringfellow <
justin.stringfel...@btinternet.com> wrote:

>   >>You do realize that the age of the universe is only on the order of
> >>around 10^18 seconds, do you? Even if you had a trillion CPUs each
> >>chugging along at 3.0 GHz for all this time, the number of processor
> >>cycles you will have executed cumulatively is only on the order 10^40,
> >>still 37 orders of magnitude lower than the chance for a random hash
> >>collision.
>
> Here we go, boiling the oceans again :)
>
>
> >Suppose you find a weakness in a specific hash algorithm; you use this
> >to create hash collisions and now imagined you store the hash collisions
> >in a zfs dataset with dedup enabled using the same hash algorithm.
>
> Sorry, but isn't this what dedup=verify solves? I don't see the problem
> here. Maybe all that's needed is a comment in the manpage saying hash
> algorithms aren't perfect.
>
> cheers,
> --justin
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Ferenc-Levente Juhos
Saso, I'm not flaming at all, I happen to disagree, but still I understand
that
chances are very very very slim, but as one poster already said, this is
how
the lottery works. I'm not saying one should make an exhaustive search with
trillions of computers just to produce a sha256 collision.
If I wanted an exhaustive search I would generate all the numbers from
0 to 2**256 and I would definitely get at least 1 collision.
If you formulate it in another way, by generating all the possible 256 bit
(32 byte)
blocks + 1 you will definitely get a collision. This is much more credible
than the analogy with the age of the universe and atoms picked at random,
etc.

The fact is it can happen, it's entirely possible that there are two jpg's
in the universe with different content and they have the same hash.
I can't prove the existence of those, but you can't deny it.

The fact is, that zfs and everyone using it trys to correct data
degradation e.g. cause by cosmic rays, and on the other hand their
using probability calculation (no matter how slim the chances are) to
potentially discard valid data.
You can come with other universe and atom theories and with the
age of the universe, etc. The fact remains the same.

And each generation was convinced that their current best checksum or
hash algorithm is the best and will be the best forever. MD5 has
demonstrated that it's not the case. Time will tell what becomes of
SHA256, but why take any chances.

On Wed, Jul 11, 2012 at 11:10 AM, Sašo Kiselkov wrote:

> On 07/11/2012 10:50 AM, Ferenc-Levente Juhos wrote:
> > Actually although as you pointed out that the chances to have an sha256
> > collision is minimal, but still it can happen, that would mean
> > that the dedup algorithm discards a block that he thinks is a duplicate.
> > Probably it's anyway better to do a byte to byte comparison
> > if the hashes match to be sure that the blocks are really identical.
> >
> > The funny thing here is that ZFS tries to solve all sorts of data
> integrity
> > issues with checksumming and healing, etc.,
> > and on the other hand a hash collision in the dedup algorithm can cause
> > loss of data if wrongly configured.
> >
> > Anyway thanks that you have brought up the subject, now I know if I will
> > enable the dedup feature I must set it to sha256,verify.
>
> Oh jeez, I can't remember how many times this flame war has been going
> on on this list. Here's the gist: SHA-256 (or any good hash) produces a
> near uniform random distribution of output. Thus, the chances of getting
> a random hash collision are around 2^-256 or around 10^-77. If I asked
> you to pick two atoms at random *from the entire observable universe*,
> your chances of hitting on the same atom are higher than the chances of
> that hash collision. So leave dedup=on with sha256 and move on.
>
> Cheers,
> --
> Saso
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik

>Sorry, but isn't this what dedup=verify solves? I don't see the problem here.
>Maybe all that's needed is a comment in the manpage saying hash algorithms
>aren't perfect.

The point is that hash functions are many to one and I think the point
was about that verify wasn't really needed if the hash function is good
enough.

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Justin Stringfellow
>>You do realize that the age of the universe is only on the order of
>>around 10^18 seconds, do you? Even if you had a trillion CPUs each
>>chugging along at 3.0 GHz for all this time, the number of processor
>>cycles you will have executed cumulatively is only on the order 10^40,
>>still 37 orders of magnitude lower than the chance for a random hash
>>collision.

Here we go, boiling the oceans again :)

>Suppose you find a weakness in a specific hash algorithm; you use this
>to create hash collisions and now imagined you store the hash collisions 
>in a zfs dataset with dedup enabled using the same hash algorithm.

Sorry, but isn't this what dedup=verify solves? I don't see the problem here. 
Maybe all that's needed is a comment in the manpage saying hash algorithms 
aren't perfect.
 
cheers,
--justin___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik


>You do realize that the age of the universe is only on the order of
>around 10^18 seconds, do you? Even if you had a trillion CPUs each
>chugging along at 3.0 GHz for all this time, the number of processor
>cycles you will have executed cumulatively is only on the order 10^40,
>still 37 orders of magnitude lower than the chance for a random hash
>collision.
>


Suppose you find a weakness in a specific hash algorithm; you use this
to create hash collisions and now imagined you store the hash collisions 
in a zfs dataset with dedup enabled using the same hash algorithm.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Tomas Forsman
On 11 July, 2012 - Sa??o Kiselkov sent me these 1,4K bytes:

> On 07/11/2012 10:50 AM, Ferenc-Levente Juhos wrote:
> > Actually although as you pointed out that the chances to have an sha256
> > collision is minimal, but still it can happen, that would mean
> > that the dedup algorithm discards a block that he thinks is a duplicate.
> > Probably it's anyway better to do a byte to byte comparison
> > if the hashes match to be sure that the blocks are really identical.
> > 
> > The funny thing here is that ZFS tries to solve all sorts of data integrity
> > issues with checksumming and healing, etc.,
> > and on the other hand a hash collision in the dedup algorithm can cause
> > loss of data if wrongly configured.
> > 
> > Anyway thanks that you have brought up the subject, now I know if I will
> > enable the dedup feature I must set it to sha256,verify.
> 
> Oh jeez, I can't remember how many times this flame war has been going
> on on this list. Here's the gist: SHA-256 (or any good hash) produces a
> near uniform random distribution of output. Thus, the chances of getting
> a random hash collision are around 2^-256 or around 10^-77. If I asked
> you to pick two atoms at random *from the entire observable universe*,
> your chances of hitting on the same atom are higher than the chances of
> that hash collision. So leave dedup=on with sha256 and move on.

So in ZFS, which normally uses 128kB blocks, you can instead store them
100% uniquely into 32 bytes.. A nice 4096x compression rate..
decompression is a bit slower though..

/Tomas
-- 
Tomas Forsman, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Joerg Schilling
Sa?o Kiselkov  wrote:

> On 07/11/2012 10:47 AM, Joerg Schilling wrote:
> > Sa??o Kiselkov  wrote:
> > 
> >> write in case verify finds the blocks are different). With hashes, you
> >> can leave verify off, since hashes are extremely unlikely (~10^-77) to
> >> produce collisions.
> > 
> > This is how a lottery works. the chance is low but some people still win.
> > 
> > q~A
>
> You do realize that the age of the universe is only on the order of
> around 10^18 seconds, do you? Even if you had a trillion CPUs each
> chugging along at 3.0 GHz for all this time, the number of processor
> cycles you will have executed cumulatively is only on the order 10^40,
> still 37 orders of magnitude lower than the chance for a random hash
> collision.

This is not how chance works.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 10:50 AM, Ferenc-Levente Juhos wrote:
> Actually although as you pointed out that the chances to have an sha256
> collision is minimal, but still it can happen, that would mean
> that the dedup algorithm discards a block that he thinks is a duplicate.
> Probably it's anyway better to do a byte to byte comparison
> if the hashes match to be sure that the blocks are really identical.
> 
> The funny thing here is that ZFS tries to solve all sorts of data integrity
> issues with checksumming and healing, etc.,
> and on the other hand a hash collision in the dedup algorithm can cause
> loss of data if wrongly configured.
> 
> Anyway thanks that you have brought up the subject, now I know if I will
> enable the dedup feature I must set it to sha256,verify.

Oh jeez, I can't remember how many times this flame war has been going
on on this list. Here's the gist: SHA-256 (or any good hash) produces a
near uniform random distribution of output. Thus, the chances of getting
a random hash collision are around 2^-256 or around 10^-77. If I asked
you to pick two atoms at random *from the entire observable universe*,
your chances of hitting on the same atom are higher than the chances of
that hash collision. So leave dedup=on with sha256 and move on.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 11:02 AM, Darren J Moffat wrote:
> On 07/11/12 00:56, Sašo Kiselkov wrote:
>>   * SHA-512: simplest to implement (since the code is already in the
>> kernel) and provides a modest performance boost of around 60%.
> 
> FIPS 180-4 introduces SHA-512/t support and explicitly SHA-512/256.
> 
> http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf
> 
> Note this is NOT a simple truncation of SHA-512 since when using
> SHA-512/t the initial value H(0) is different.

Yes, I know that. In my original post I was only trying to probe the
landscape for whether there is interest in this in the community. Of
course for the implementation I'd go with the standardized truncation
function. Skein-512 already includes a truncation function. For Edon-R,
I'd have to devise one (probably based around SHA-512/256 or Skein).

> See sections 5.3.6.2 and 6.7.
> 
> I recommend the checksum value for this be
> checksum=sha512/256
> 
> A / in the value doesn't cause any problems and it is the official NIST
> name of that hash.
> 
> With the internal enum being: ZIO_CHECKSUM_SHA512_256
> 
> CR 7020616 already exists for adding this in Oracle Solaris.

Okay, if I'll implement it identically in Illumos then. However, given
that I plan to use featureflags to advertise the feature to the outside
world, I doubt interoperability will be possible (even though the
on-disk format could be identical).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 10:47 AM, Joerg Schilling wrote:
> Sa??o Kiselkov  wrote:
> 
>> write in case verify finds the blocks are different). With hashes, you
>> can leave verify off, since hashes are extremely unlikely (~10^-77) to
>> produce collisions.
> 
> This is how a lottery works. the chance is low but some people still win.
> 
> q~A

You do realize that the age of the universe is only on the order of
around 10^18 seconds, do you? Even if you had a trillion CPUs each
chugging along at 3.0 GHz for all this time, the number of processor
cycles you will have executed cumulatively is only on the order 10^40,
still 37 orders of magnitude lower than the chance for a random hash
collision.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Darren J Moffat

On 07/11/12 00:56, Sašo Kiselkov wrote:

  * SHA-512: simplest to implement (since the code is already in the
kernel) and provides a modest performance boost of around 60%.


FIPS 180-4 introduces SHA-512/t support and explicitly SHA-512/256.

http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf

Note this is NOT a simple truncation of SHA-512 since when using 
SHA-512/t the initial value H(0) is different.


See sections 5.3.6.2 and 6.7.

I recommend the checksum value for this be
checksum=sha512/256

A / in the value doesn't cause any problems and it is the official NIST 
name of that hash.


With the internal enum being: ZIO_CHECKSUM_SHA512_256

CR 7020616 already exists for adding this in Oracle Solaris.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Ferenc-Levente Juhos
I'm pushing the send button too often, but yes, considering what said
before,
byte-to-byte comparison should be mandatory when deduplicating, and
therefore a "lighter" hash or checksum algorithm,
would suffice to reduce the number of dedup candidates. And overall
deduping would be "bulletproof" and faster.

On Wed, Jul 11, 2012 at 10:50 AM, Ferenc-Levente Juhos
wrote:

> Actually although as you pointed out that the chances to have an sha256
> collision is minimal, but still it can happen, that would mean
> that the dedup algorithm discards a block that he thinks is a duplicate.
> Probably it's anyway better to do a byte to byte comparison
> if the hashes match to be sure that the blocks are really identical.
>
> The funny thing here is that ZFS tries to solve all sorts of data
> integrity issues with checksumming and healing, etc.,
> and on the other hand a hash collision in the dedup algorithm can cause
> loss of data if wrongly configured.
>
> Anyway thanks that you have brought up the subject, now I know if I will
> enable the dedup feature I must set it to sha256,verify.
>  On Wed, Jul 11, 2012 at 10:41 AM, Ferenc-Levente Juhos <
> feci1...@gmail.com> wrote:
>
>> I was under the impression that the hash (or checksum) used for data
>> integrity is the same as the one used for deduplication,
>> but now I see that they are different.
>>
>>
>> On Wed, Jul 11, 2012 at 10:23 AM, Sašo Kiselkov 
>> wrote:
>>
>>> On 07/11/2012 09:58 AM, Ferenc-Levente Juhos wrote:
>>> > Hello all,
>>> >
>>> > what about the fletcher2 and fletcher4 algorithms? According to the
>>> zfs man
>>> > page on oracle, fletcher4 is the current default.
>>> > Shouldn't the fletcher algorithms be much faster then any of the SHA
>>> > algorithms?
>>> > On Wed, Jul 11, 2012 at 9:19 AM, Sašo Kiselkov >> >wrote:
>>>
>>> Fletcher is a checksum, not a hash. It can and often will produce
>>> collisions, so you need to set your dedup to verify (do a bit-by-bit
>>> comparison prior to deduplication) which can result in significant write
>>> amplification (every write is turned into a read and potentially another
>>> write in case verify finds the blocks are different). With hashes, you
>>> can leave verify off, since hashes are extremely unlikely (~10^-77) to
>>> produce collisions.
>>>
>>> --
>>> Saso
>>>
>>
>>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Ferenc-Levente Juhos
Actually although as you pointed out that the chances to have an sha256
collision is minimal, but still it can happen, that would mean
that the dedup algorithm discards a block that he thinks is a duplicate.
Probably it's anyway better to do a byte to byte comparison
if the hashes match to be sure that the blocks are really identical.

The funny thing here is that ZFS tries to solve all sorts of data integrity
issues with checksumming and healing, etc.,
and on the other hand a hash collision in the dedup algorithm can cause
loss of data if wrongly configured.

Anyway thanks that you have brought up the subject, now I know if I will
enable the dedup feature I must set it to sha256,verify.
On Wed, Jul 11, 2012 at 10:41 AM, Ferenc-Levente Juhos
wrote:

> I was under the impression that the hash (or checksum) used for data
> integrity is the same as the one used for deduplication,
> but now I see that they are different.
>
>
> On Wed, Jul 11, 2012 at 10:23 AM, Sašo Kiselkov wrote:
>
>> On 07/11/2012 09:58 AM, Ferenc-Levente Juhos wrote:
>> > Hello all,
>> >
>> > what about the fletcher2 and fletcher4 algorithms? According to the zfs
>> man
>> > page on oracle, fletcher4 is the current default.
>> > Shouldn't the fletcher algorithms be much faster then any of the SHA
>> > algorithms?
>> > On Wed, Jul 11, 2012 at 9:19 AM, Sašo Kiselkov > >wrote:
>>
>> Fletcher is a checksum, not a hash. It can and often will produce
>> collisions, so you need to set your dedup to verify (do a bit-by-bit
>> comparison prior to deduplication) which can result in significant write
>> amplification (every write is turned into a read and potentially another
>> write in case verify finds the blocks are different). With hashes, you
>> can leave verify off, since hashes are extremely unlikely (~10^-77) to
>> produce collisions.
>>
>> --
>> Saso
>>
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Joerg Schilling
Sa??o Kiselkov  wrote:

> write in case verify finds the blocks are different). With hashes, you
> can leave verify off, since hashes are extremely unlikely (~10^-77) to
> produce collisions.

This is how a lottery works. the chance is low but some people still win.

q~A
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 10:41 AM, Ferenc-Levente Juhos wrote:
> I was under the impression that the hash (or checksum) used for data
> integrity is the same as the one used for deduplication,
> but now I see that they are different.

They are the same "in use", i.e. once you switch dedup on, that implies
checksum=sha256. However, if you want to you, you can force ZFS to use
fletcher even with dedup by setting dedup=verify and checksum=fletcher4
(setting "dedup=on" with fletcher4 is not advised due to fletcher's
possibility of producing collisions and subsequent silent corruption).
It's also possible to set "dedup=verify" with "checksum=sha256",
however, that makes little sense (as the chances of getting a random
hash collision are essentially nil).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Ferenc-Levente Juhos
I was under the impression that the hash (or checksum) used for data
integrity is the same as the one used for deduplication,
but now I see that they are different.


On Wed, Jul 11, 2012 at 10:23 AM, Sašo Kiselkov wrote:

> On 07/11/2012 09:58 AM, Ferenc-Levente Juhos wrote:
> > Hello all,
> >
> > what about the fletcher2 and fletcher4 algorithms? According to the zfs
> man
> > page on oracle, fletcher4 is the current default.
> > Shouldn't the fletcher algorithms be much faster then any of the SHA
> > algorithms?
> > On Wed, Jul 11, 2012 at 9:19 AM, Sašo Kiselkov  >wrote:
>
> Fletcher is a checksum, not a hash. It can and often will produce
> collisions, so you need to set your dedup to verify (do a bit-by-bit
> comparison prior to deduplication) which can result in significant write
> amplification (every write is turned into a read and potentially another
> write in case verify finds the blocks are different). With hashes, you
> can leave verify off, since hashes are extremely unlikely (~10^-77) to
> produce collisions.
>
> --
> Saso
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >