subject:"Re\: \[zfs\-discuss\] New fast hash algorithm \- is it needed\?"

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Sašo Kiselkov
 
 On 07/11/2012 05:58 PM, Gregg Wonderly wrote:
  You're entirely sure that there could never be two different blocks that
can
 hash to the same value and have different content?
 
  Wow, can you just send me the cash now and we'll call it even?
 
 You're the one making the positive claim and I'm calling bullshit. So
 the onus is on you to demonstrate the collision (and that you arrived at
 it via your brute force method as described). Until then, my money stays
 safely on my bank account. Put up or shut up, as the old saying goes.

Come on dude.  Chill out.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Nico Williams
 
 IMO dedup should always verify.

IMO, it should be a decision left to the user or admin, with the default being 
verify.  (Exactly as it is now.)  Because there is a performance-to-reliability 
tradeoff, and there is a very solid argument that the probability of collision 
causing data loss is so small that you should instead be more concerned with 
metoer strike and other improbable failures causing data loss.  For some people 
and some situations, that argument will be good enough - they would opt for the 
performance increase.  And for others, that argument is not good enough - they 
would opt for verification.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Sašo Kiselkov
 
 This is so profoundly wrong that it leads me to suspect you never took
 courses on cryptography and/or information theory. 

More inflammatory commentary, and personal comments?  Saso, seriously, knock
it off.


 The size of your
 storage pool DOESN'T MATTER ONE BIT to the size of the key space. 

Are we still talking about SHA256 collisions in a zpool?  If so, the size of
the storage pool (at least, the used blocks inside the pool) definitely has
an effect on the probability of a collision.

The first block written to pool, gets a checksum.  There is a
zero-probability of collision, because it's the first one.
The second block written gets another checksum.  It has a 2^-256 probability
of collision with the first one.
The 3rd block has a 2*2^-256 probability...
The 4th block has 3*2^-256 probability...

This is the birthday problem.  If you don't know it, look it up on
wikipedia.  Each block written has a higher (but still very low) probability
of collision with any other block that was previously written, until ...

When you write your approximately 2^128th block, you have a 50/50 chance of
collision with any other block.  Fortunately, this amount of data is far
greater than all the storage in the world combined, so this probability can
never be reached with earthly means.

For all Earthly data sets, assuming no SHA256 weaknesses are being
exploited, the probability of a collision is still extremely small, but the
more data blocks in the data set, the higher the probability is.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

 From: Jim Klimov [mailto:jimkli...@cos.ru]
 Sent: Thursday, July 12, 2012 8:42 AM
 To: Edward Ned Harvey
 Subject: Re: [zfs-discuss] New fast hash algorithm - is it needed?

 2012-07-11 18:03, Edward Ned Harvey пишет:
  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Sašo Kiselkov

  As your dedup
  ratio grows, so does the performance hit from dedup=verify. At, say,
  dedupratio=10.0x, on average, every write results in 10 reads.

  Why?
  If you intend to write a block, and you discover it has a matching checksum,
  you only need to verify it matches one block.  You don't need to verify it
  matches all the other blocks that have previously been verified to match
  each other.

 As Saso explained, if you wrote the same block 10 times
 and detected it was already deduped, then by verifying
 this detection 10 times you get about 10 extra reads.

(In this case, Jim, you wrote me off-list and I replied on-list, but in this 
case, I thought it would be ok because this message doesn't look private or 
exclusive to me.  I apologize if I was wrong.)

I get the miscommunication now - 

When you write the duplicate block for the 10th time, we all understand you're 
not going to go back and verify 10 blocks.  (It seemed, at least to me, that's 
what Saso was saying.  Which is why I told him, No you don't.)

You're saying, that when you wrote the duplicate block the 2nd time, you 
verified... And then when you wrote it the 3rd time, you verified...  And the 
4th time, and the 5th time...   By the time you write the 10th time, you have 
already verified 9 previous times, but you're still going to verify again.

Normally you would expect writing duplicate blocks dedup'd to be faster than 
writing the non-dedup'd data, because you get to skip the actual write.  (This 
idealistic performance gain might be pure vapor due to need to update metadata, 
but ignoring that technicality, continue hypothetically...)  When you verify, 
supposedly you're giving up that hypothetical performance gain, because you 
have to do a read instead of a write.  So at first blush, it seems like no 
net-gain for performance.  But if the duplicate block gets written frequently 
(for example a block of all zeros) then there's a high probability the 
duplicate block is already in ARC, so you can actually skip reading from disk 
and just read from RAM.

So, the 10 extra reads will sometimes be true - if the duplicate block 
doesn't already exist in ARC.  And the 10 extra reads will sometimes be false 
- if the duplicate block is already in ARC.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-12 Thread Tim Cook

On Thu, Jul 12, 2012 at 9:14 AM, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

  From: Jim Klimov [mailto:jimkli...@cos.ru]
  Sent: Thursday, July 12, 2012 8:42 AM
  To: Edward Ned Harvey
  Subject: Re: [zfs-discuss] New fast hash algorithm - is it needed?

  2012-07-11 18:03, Edward Ned Harvey пишет:
   From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
   boun...@opensolaris.org] On Behalf Of Sašo Kiselkov

   As your dedup
   ratio grows, so does the performance hit from dedup=verify. At, say,
   dedupratio=10.0x, on average, every write results in 10 reads.

   Why?
   If you intend to write a block, and you discover it has a matching
 checksum,
   you only need to verify it matches one block.  You don't need to
 verify it
   matches all the other blocks that have previously been verified to
 match
   each other.

  As Saso explained, if you wrote the same block 10 times
  and detected it was already deduped, then by verifying
  this detection 10 times you get about 10 extra reads.

 (In this case, Jim, you wrote me off-list and I replied on-list, but in
 this case, I thought it would be ok because this message doesn't look
 private or exclusive to me.  I apologize if I was wrong.)

 I get the miscommunication now -

 When you write the duplicate block for the 10th time, we all understand
 you're not going to go back and verify 10 blocks.  (It seemed, at least to
 me, that's what Saso was saying.  Which is why I told him, No you don't.)

 You're saying, that when you wrote the duplicate block the 2nd time, you
 verified... And then when you wrote it the 3rd time, you verified...  And
 the 4th time, and the 5th time...   By the time you write the 10th time,
 you have already verified 9 previous times, but you're still going to
 verify again.

 Normally you would expect writing duplicate blocks dedup'd to be faster
 than writing the non-dedup'd data, because you get to skip the actual
 write.  (This idealistic performance gain might be pure vapor due to need
 to update metadata, but ignoring that technicality, continue
 hypothetically...)  When you verify, supposedly you're giving up that
 hypothetical performance gain, because you have to do a read instead of a
 write.  So at first blush, it seems like no net-gain for performance.  But
 if the duplicate block gets written frequently (for example a block of all
 zeros) then there's a high probability the duplicate block is already in
 ARC, so you can actually skip reading from disk and just read from RAM.

 So, the 10 extra reads will sometimes be true - if the duplicate block
 doesn't already exist in ARC.  And the 10 extra reads will sometimes be
 false - if the duplicate block is already in ARC.

Sasso: yes, it's absolutely worth implementing a higher performing hashing
algorithm.  I'd suggest simply ignoring the people that aren't willing to
acknowledge basic mathematics rather than lashing out.  No point in feeding
the trolls.  The PETABYTES of data Quantum and DataDomain have out there
are proof positive that complex hashes get the job done without verify,
even if you don't want to acknowledge the math behind it.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-12 Thread Sašo Kiselkov

On 07/12/2012 07:16 PM, Tim Cook wrote:
 Sasso: yes, it's absolutely worth implementing a higher performing hashing
 algorithm.  I'd suggest simply ignoring the people that aren't willing to
 acknowledge basic mathematics rather than lashing out.  No point in feeding
 the trolls.  The PETABYTES of data Quantum and DataDomain have out there
 are proof positive that complex hashes get the job done without verify,
 even if you don't want to acknowledge the math behind it.

That's what I deduced as well. I have far too much time to explain
statistics, coding theory and compression algorithms to random people on
the Internet. Code talks. I'm almost done implementing SHA-512/256 and
Edon-R-512/256. I've spent most of the time learning how the ZFS code is
structured and I now have most of the pieces in place, so expect the
completed code drop with (SHA-512, Edon-R and Skein) by this weekend.

Cheers,

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-12 Thread Sašo Kiselkov

On 07/12/2012 09:52 PM, Sašo Kiselkov wrote:
 I have far too much time to explain

P.S. that should have read I have taken far too much time explaining.
Men are crap at multitasking...

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 02:18 AM, John Martin wrote:
 On 07/10/12 19:56, Sašo Kiselkov wrote:
 Hi guys,

 I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS
 implementation to supplant the currently utilized sha256. On modern
 64-bit CPUs SHA-256 is actually much slower than SHA-512 and indeed much
 slower than many of the SHA-3 candidates, so I went out and did some
 testing (details attached) on a possible new hash algorithm that might
 improve on this situation.
 
 Is the intent to store the 512 bit hash or truncate to 256 bit?
 

The intent is to truncate. I know ZFS can only store up to 32 bytes in
the checksum.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 05:20 AM, Edward Ned Harvey wrote:
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Sašo Kiselkov

I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS
implementation to supplant the currently utilized sha256. On modern
64-bit CPUs SHA-256 is actually much slower than SHA-512 and indeed much
slower than many of the SHA-3 candidates, so I went out and did some
testing (details attached) on a possible new hash algorithm that might
improve on this situation.

As coincidence would have it, I recently benchmarked md5 hashing and AES
encryption on systems with and without AES-NI. Theoretically, hashing should
be much faster because it's asymmetric, while symmetric encryption has much
less speed potential. I found md5 could hash at most several hundred MB/sec,
and AES was about half to quarter of that speed ... Which is consistent with
the theory. But if I had AES-NI, then AES was about 1.1 GB/sec. Which means
we have much *much* more speed potential available untapped in terms of
hashing.

MD5 is a painfully slow hash compared to the SHA-3 candidates, or even
SHA-512. The candidates I tested produced the following throughputs (a
simple reversal of the cycles/byte metric for each CPU):

Opteron 4234 (3.1 GHz):
Skein-512: 355 MB/s
Edon-R: 643 MB/s

AMD Athlon II Neo N36L (1.3 GHz):
Skein-512: 213 MB/s
Edon-R: 364 MB/s

Intel Xeon E5645 (2.4 GHz):
Skein-512: 357 MB/s
Edon-R: 734 MB/s

Intel Xeon E5405 (2.0 GHz):
Skein-512: 280 MB/s
Edon-R: 531 MB/s

Intel Xeon E5450 (3.0 GHz):
Skein-512: 416 MB/s
Edon-R: 792 MB/s

Keep in mind that this is single-threaded on a pure-C implementation.
During my tests I used GCC 3.4.3 in order to be able to assess speed
improvements should the code be folded into Illumos (since that's one
compiler Illumos can be built with), but GCC 3.4.3 is seriously
stone-age. Compiling with GCC 4.6.2 I got a further speed boost of
around 10-20%, so even in pure C, Edon-R is probably able to breathe
down the neck of the AES-NI optimized implementation you mentioned.

Now, when you consider that a single disk typically is able to sustain
1.0Gbit (128 MB) per second, it means, very quickly the CPU can become the
bottleneck for sustained disk reads in a large raid system. I think a lot of
the time, people with a bunch of disks in a raid configuration are able to
neglect the CPU load, just because they're using fletcher.

Yes, that's exactly what I'm getting at. It would be great to have a
hash that you could enable with significantly more peace of mind than
sha256 - with sha256, you always need to keep in mind that the hashing
is going to be super-expensive (even for reads). My testing with a
small JBOD from Supermicro showed that I was easily able to exceed 2
GB/s of reads off of just 45 7k2 SAS drives.

Of the SHA3 finalists, I was pleased to see that only one of them was based
on AES-NI, and the others are actually faster. So in vague hand-waving
terms, yes I believe the stronger faster hash algorithm, in time will be a
big win for zfs performance. But only in situations where people have a
sufficiently large number of disks and sufficiently high expectation for IO
performance.

My whole reason for starting this exercise is that RAM is getting dirt
cheap nowadays, so a reasonably large and fast dedup setup can be had
for relatively little money. However, I think that the sha256 hash is
really ill suited to this application and even if it isn't a critical
issue, I think it's really inexcusable we are using something worse than
the best of breed here (especially considering how ZFS was always
designed to be easily extensible with new algorithms).

CPU's are not getting much faster. But IO is definitely getting faster.
It's best to keep ahead of that curve.

As I said above, RAM is getting cheaper much faster than CPU performance
is. Nowadays you can get 128 GB of server-grade RAM for around $1000, so
equipping a machine with half a terabyte of RAM or more is getting
commonplace. By a first degree approximation (assuming 200 B per block
in the DDT) this would allow one to store upward of 80TB of unique 128K
blocks in 128 GB of RAM - that's easily above 100 TB of attached raw
storage, so we're looking at things like this:

http://dataonstorage.com/dataon-products/6g-sas-jbod/dns-1660-4u-60-bay-6g-35inch-sassata-jbod.html

These things come with eight 4-wide SAS 2.0 ports and enough bandwidth
to saturate a QDR InfiniBand port. Another thing to consider are SSDs. A
single 2U server with eight or even sixteen 2.5'' SATA-III SSDs can
achieve even higher throughput. So I fully agree with you, we need to
stay ahead of the curve, however, I think the curve is much closer than
we think!

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org

Re: [zfs-discuss] New fast hash algorithm - is it needed?

Hello all,

what about the fletcher2 and fletcher4 algorithms? According to the zfs man
page on oracle, fletcher4 is the current default.
Shouldn't the fletcher algorithms be much faster then any of the SHA
algorithms?
On Wed, Jul 11, 2012 at 9:19 AM, Sašo Kiselkov skiselkov...@gmail.comwrote:

On 07/11/2012 05:20 AM, Edward Ned Harvey wrote:
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Sašo Kiselkov

As coincidence would have it, I recently benchmarked md5 hashing and AES
encryption on systems with and without AES-NI. Theoretically, hashing
should be much faster because it's asymmetric, while symmetric encryption
has much less speed potential. I found md5 could hash at most several
hundred MB/sec, and AES was about half to quarter of that speed ... Which
is consistent with the theory. But if I had AES-NI, then AES was about 1.1
GB/sec. Which means we have much *much* more speed potential available
untapped in terms of hashing.

MD5 is a painfully slow hash compared to the SHA-3 candidates, or even
SHA-512. The candidates I tested produced the following throughputs (a
simple reversal of the cycles/byte metric for each CPU):

Opteron 4234 (3.1 GHz):
Skein-512: 355 MB/s
Edon-R: 643 MB/s

AMD Athlon II Neo N36L (1.3 GHz):
Skein-512: 213 MB/s
Edon-R: 364 MB/s

Intel Xeon E5645 (2.4 GHz):
Skein-512: 357 MB/s
Edon-R: 734 MB/s

Intel Xeon E5405 (2.0 GHz):
Skein-512: 280 MB/s
Edon-R: 531 MB/s

Intel Xeon E5450 (3.0 GHz):
Skein-512: 416 MB/s
Edon-R: 792 MB/s

Now, when you consider that a single disk typically is able to sustain
1.0Gbit (128 MB) per second, it means, very quickly the CPU can become the
bottleneck for sustained disk reads in a large raid system. I think a lot
of the time, people with a bunch of disks in a raid configuration are able
to neglect the CPU load, just because they're using fletcher.

Of the SHA3 finalists, I was pleased to see that only one of them was
based on AES-NI, and the others are actually faster. So in vague
hand-waving terms, yes I believe the stronger faster hash algorithm, in
time will be a big win for zfs performance. But only in situations where
people have a sufficiently large number of disks and sufficiently high
expectation for IO performance.

CPU's are not getting much faster. But IO is definitely getting faster.
It's best to keep ahead of that curve.

http://dataonstorage.com/dataon-products/6g-sas-jbod/dns-1660-4u-60-bay-6g-35inch-sassata-jbod.html

These things come with eight 4-wide SAS 2.0 ports and enough bandwidth
to saturate a QDR InfiniBand port. Another thing to consider are SSDs. A
single 2U server with eight or even

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 09:58 AM, Ferenc-Levente Juhos wrote:
 Hello all,
 
 what about the fletcher2 and fletcher4 algorithms? According to the zfs man
 page on oracle, fletcher4 is the current default.
 Shouldn't the fletcher algorithms be much faster then any of the SHA
 algorithms?
 On Wed, Jul 11, 2012 at 9:19 AM, Sašo Kiselkov skiselkov...@gmail.comwrote:

Fletcher is a checksum, not a hash. It can and often will produce
collisions, so you need to set your dedup to verify (do a bit-by-bit
comparison prior to deduplication) which can result in significant write
amplification (every write is turned into a read and potentially another
write in case verify finds the blocks are different). With hashes, you
can leave verify off, since hashes are extremely unlikely (~10^-77) to
produce collisions.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

I was under the impression that the hash (or checksum) used for data
integrity is the same as the one used for deduplication,
but now I see that they are different.


On Wed, Jul 11, 2012 at 10:23 AM, Sašo Kiselkov skiselkov...@gmail.comwrote:

 On 07/11/2012 09:58 AM, Ferenc-Levente Juhos wrote:
  Hello all,
 
  what about the fletcher2 and fletcher4 algorithms? According to the zfs
 man
  page on oracle, fletcher4 is the current default.
  Shouldn't the fletcher algorithms be much faster then any of the SHA
  algorithms?
  On Wed, Jul 11, 2012 at 9:19 AM, Sašo Kiselkov skiselkov...@gmail.com
 wrote:

 Fletcher is a checksum, not a hash. It can and often will produce
 collisions, so you need to set your dedup to verify (do a bit-by-bit
 comparison prior to deduplication) which can result in significant write
 amplification (every write is turned into a read and potentially another
 write in case verify finds the blocks are different). With hashes, you
 can leave verify off, since hashes are extremely unlikely (~10^-77) to
 produce collisions.

 --
 Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 10:41 AM, Ferenc-Levente Juhos wrote:
 I was under the impression that the hash (or checksum) used for data
 integrity is the same as the one used for deduplication,
 but now I see that they are different.

They are the same in use, i.e. once you switch dedup on, that implies
checksum=sha256. However, if you want to you, you can force ZFS to use
fletcher even with dedup by setting dedup=verify and checksum=fletcher4
(setting dedup=on with fletcher4 is not advised due to fletcher's
possibility of producing collisions and subsequent silent corruption).
It's also possible to set dedup=verify with checksum=sha256,
however, that makes little sense (as the chances of getting a random
hash collision are essentially nil).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Joerg Schilling

Sa??o Kiselkov skiselkov...@gmail.com wrote:

 write in case verify finds the blocks are different). With hashes, you
 can leave verify off, since hashes are extremely unlikely (~10^-77) to
 produce collisions.

This is how a lottery works. the chance is low but some people still win.

q~A
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

Actually although as you pointed out that the chances to have an sha256
collision is minimal, but still it can happen, that would mean
that the dedup algorithm discards a block that he thinks is a duplicate.
Probably it's anyway better to do a byte to byte comparison
if the hashes match to be sure that the blocks are really identical.

The funny thing here is that ZFS tries to solve all sorts of data integrity
issues with checksumming and healing, etc.,
and on the other hand a hash collision in the dedup algorithm can cause
loss of data if wrongly configured.

Anyway thanks that you have brought up the subject, now I know if I will
enable the dedup feature I must set it to sha256,verify.
On Wed, Jul 11, 2012 at 10:41 AM, Ferenc-Levente Juhos
feci1...@gmail.comwrote:

 I was under the impression that the hash (or checksum) used for data
 integrity is the same as the one used for deduplication,
 but now I see that they are different.


 On Wed, Jul 11, 2012 at 10:23 AM, Sašo Kiselkov skiselkov...@gmail.comwrote:

 On 07/11/2012 09:58 AM, Ferenc-Levente Juhos wrote:
  Hello all,
 
  what about the fletcher2 and fletcher4 algorithms? According to the zfs
 man
  page on oracle, fletcher4 is the current default.
  Shouldn't the fletcher algorithms be much faster then any of the SHA
  algorithms?
  On Wed, Jul 11, 2012 at 9:19 AM, Sašo Kiselkov skiselkov...@gmail.com
 wrote:

 Fletcher is a checksum, not a hash. It can and often will produce
 collisions, so you need to set your dedup to verify (do a bit-by-bit
 comparison prior to deduplication) which can result in significant write
 amplification (every write is turned into a read and potentially another
 write in case verify finds the blocks are different). With hashes, you
 can leave verify off, since hashes are extremely unlikely (~10^-77) to
 produce collisions.

 --
 Saso



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

I'm pushing the send button too often, but yes, considering what said
before,
byte-to-byte comparison should be mandatory when deduplicating, and
therefore a lighter hash or checksum algorithm,
would suffice to reduce the number of dedup candidates. And overall
deduping would be bulletproof and faster.

On Wed, Jul 11, 2012 at 10:50 AM, Ferenc-Levente Juhos
feci1...@gmail.comwrote:

 Actually although as you pointed out that the chances to have an sha256
 collision is minimal, but still it can happen, that would mean
 that the dedup algorithm discards a block that he thinks is a duplicate.
 Probably it's anyway better to do a byte to byte comparison
 if the hashes match to be sure that the blocks are really identical.

 The funny thing here is that ZFS tries to solve all sorts of data
 integrity issues with checksumming and healing, etc.,
 and on the other hand a hash collision in the dedup algorithm can cause
 loss of data if wrongly configured.

 Anyway thanks that you have brought up the subject, now I know if I will
 enable the dedup feature I must set it to sha256,verify.
  On Wed, Jul 11, 2012 at 10:41 AM, Ferenc-Levente Juhos 
 feci1...@gmail.com wrote:

 I was under the impression that the hash (or checksum) used for data
 integrity is the same as the one used for deduplication,
 but now I see that they are different.


 On Wed, Jul 11, 2012 at 10:23 AM, Sašo Kiselkov 
 skiselkov...@gmail.comwrote:

 On 07/11/2012 09:58 AM, Ferenc-Levente Juhos wrote:
  Hello all,
 
  what about the fletcher2 and fletcher4 algorithms? According to the
 zfs man
  page on oracle, fletcher4 is the current default.
  Shouldn't the fletcher algorithms be much faster then any of the SHA
  algorithms?
  On Wed, Jul 11, 2012 at 9:19 AM, Sašo Kiselkov skiselkov...@gmail.com
 wrote:

 Fletcher is a checksum, not a hash. It can and often will produce
 collisions, so you need to set your dedup to verify (do a bit-by-bit
 comparison prior to deduplication) which can result in significant write
 amplification (every write is turned into a read and potentially another
 write in case verify finds the blocks are different). With hashes, you
 can leave verify off, since hashes are extremely unlikely (~10^-77) to
 produce collisions.

 --
 Saso




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Darren J Moffat


On 07/11/12 00:56, Sašo Kiselkov wrote:

  * SHA-512: simplest to implement (since the code is already in the
kernel) and provides a modest performance boost of around 60%.


FIPS 180-4 introduces SHA-512/t support and explicitly SHA-512/256.

http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf

Note this is NOT a simple truncation of SHA-512 since when using 
SHA-512/t the initial value H(0) is different.


See sections 5.3.6.2 and 6.7.

I recommend the checksum value for this be
checksum=sha512/256

A / in the value doesn't cause any problems and it is the official NIST 
name of that hash.


With the internal enum being: ZIO_CHECKSUM_SHA512_256

CR 7020616 already exists for adding this in Oracle Solaris.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 10:47 AM, Joerg Schilling wrote:
 Sa??o Kiselkov skiselkov...@gmail.com wrote:
 
 write in case verify finds the blocks are different). With hashes, you
 can leave verify off, since hashes are extremely unlikely (~10^-77) to
 produce collisions.
 
 This is how a lottery works. the chance is low but some people still win.
 
 q~A

You do realize that the age of the universe is only on the order of
around 10^18 seconds, do you? Even if you had a trillion CPUs each
chugging along at 3.0 GHz for all this time, the number of processor
cycles you will have executed cumulatively is only on the order 10^40,
still 37 orders of magnitude lower than the chance for a random hash
collision.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 11:02 AM, Darren J Moffat wrote:
 On 07/11/12 00:56, Sašo Kiselkov wrote:
   * SHA-512: simplest to implement (since the code is already in the
 kernel) and provides a modest performance boost of around 60%.
 
 FIPS 180-4 introduces SHA-512/t support and explicitly SHA-512/256.
 
 http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf
 
 Note this is NOT a simple truncation of SHA-512 since when using
 SHA-512/t the initial value H(0) is different.

Yes, I know that. In my original post I was only trying to probe the
landscape for whether there is interest in this in the community. Of
course for the implementation I'd go with the standardized truncation
function. Skein-512 already includes a truncation function. For Edon-R,
I'd have to devise one (probably based around SHA-512/256 or Skein).

 See sections 5.3.6.2 and 6.7.
 
 I recommend the checksum value for this be
 checksum=sha512/256
 
 A / in the value doesn't cause any problems and it is the official NIST
 name of that hash.
 
 With the internal enum being: ZIO_CHECKSUM_SHA512_256
 
 CR 7020616 already exists for adding this in Oracle Solaris.

Okay, if I'll implement it identically in Illumos then. However, given
that I plan to use featureflags to advertise the feature to the outside
world, I doubt interoperability will be possible (even though the
on-disk format could be identical).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 10:50 AM, Ferenc-Levente Juhos wrote:
 Actually although as you pointed out that the chances to have an sha256
 collision is minimal, but still it can happen, that would mean
 that the dedup algorithm discards a block that he thinks is a duplicate.
 Probably it's anyway better to do a byte to byte comparison
 if the hashes match to be sure that the blocks are really identical.
 
 The funny thing here is that ZFS tries to solve all sorts of data integrity
 issues with checksumming and healing, etc.,
 and on the other hand a hash collision in the dedup algorithm can cause
 loss of data if wrongly configured.
 
 Anyway thanks that you have brought up the subject, now I know if I will
 enable the dedup feature I must set it to sha256,verify.

Oh jeez, I can't remember how many times this flame war has been going
on on this list. Here's the gist: SHA-256 (or any good hash) produces a
near uniform random distribution of output. Thus, the chances of getting
a random hash collision are around 2^-256 or around 10^-77. If I asked
you to pick two atoms at random *from the entire observable universe*,
your chances of hitting on the same atom are higher than the chances of
that hash collision. So leave dedup=on with sha256 and move on.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Joerg Schilling

Sa?o Kiselkov skiselkov...@gmail.com wrote:

 On 07/11/2012 10:47 AM, Joerg Schilling wrote:
  Sa??o Kiselkov skiselkov...@gmail.com wrote:
  
  write in case verify finds the blocks are different). With hashes, you
  can leave verify off, since hashes are extremely unlikely (~10^-77) to
  produce collisions.
  
  This is how a lottery works. the chance is low but some people still win.
  
  q~A

 You do realize that the age of the universe is only on the order of
 around 10^18 seconds, do you? Even if you had a trillion CPUs each
 chugging along at 3.0 GHz for all this time, the number of processor
 cycles you will have executed cumulatively is only on the order 10^40,
 still 37 orders of magnitude lower than the chance for a random hash
 collision.

This is not how chance works.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?



You do realize that the age of the universe is only on the order of
around 10^18 seconds, do you? Even if you had a trillion CPUs each
chugging along at 3.0 GHz for all this time, the number of processor
cycles you will have executed cumulatively is only on the order 10^40,
still 37 orders of magnitude lower than the chance for a random hash
collision.



Suppose you find a weakness in a specific hash algorithm; you use this
to create hash collisions and now imagined you store the hash collisions 
in a zfs dataset with dedup enabled using the same hash algorithm.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

You do realize that the age of the universe is only on the order of
around 10^18 seconds, do you? Even if you had a trillion CPUs each
chugging along at 3.0 GHz for all this time, the number of processor
cycles you will have executed cumulatively is only on the order 10^40,
still 37 orders of magnitude lower than the chance for a random hash
collision.

Here we go, boiling the oceans again :)

Suppose you find a weakness in a specific hash algorithm; you use this
to create hash collisions and now imagined you store the hash collisions 
in a zfs dataset with dedup enabled using the same hash algorithm.

Sorry, but isn't this what dedup=verify solves? I don't see the problem here. 
Maybe all that's needed is a comment in the manpage saying hash algorithms 
aren't perfect.
 
cheers,
--justin___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?


Sorry, but isn't this what dedup=verify solves? I don't see the problem here.
Maybe all that's needed is a comment in the manpage saying hash algorithms
aren't perfect.

The point is that hash functions are many to one and I think the point
was about that verify wasn't really needed if the hash function is good
enough.

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

Saso, I'm not flaming at all, I happen to disagree, but still I understand
that
chances are very very very slim, but as one poster already said, this is
how
the lottery works. I'm not saying one should make an exhaustive search with
trillions of computers just to produce a sha256 collision.
If I wanted an exhaustive search I would generate all the numbers from
0 to 2**256 and I would definitely get at least 1 collision.
If you formulate it in another way, by generating all the possible 256 bit
(32 byte)
blocks + 1 you will definitely get a collision. This is much more credible
than the analogy with the age of the universe and atoms picked at random,
etc.

The fact is it can happen, it's entirely possible that there are two jpg's
in the universe with different content and they have the same hash.
I can't prove the existence of those, but you can't deny it.

The fact is, that zfs and everyone using it trys to correct data
degradation e.g. cause by cosmic rays, and on the other hand their
using probability calculation (no matter how slim the chances are) to
potentially discard valid data.
You can come with other universe and atom theories and with the
age of the universe, etc. The fact remains the same.

And each generation was convinced that their current best checksum or
hash algorithm is the best and will be the best forever. MD5 has
demonstrated that it's not the case. Time will tell what becomes of
SHA256, but why take any chances.

On Wed, Jul 11, 2012 at 11:10 AM, Sašo Kiselkov skiselkov...@gmail.comwrote:

 On 07/11/2012 10:50 AM, Ferenc-Levente Juhos wrote:
  Actually although as you pointed out that the chances to have an sha256
  collision is minimal, but still it can happen, that would mean
  that the dedup algorithm discards a block that he thinks is a duplicate.
  Probably it's anyway better to do a byte to byte comparison
  if the hashes match to be sure that the blocks are really identical.
 
  The funny thing here is that ZFS tries to solve all sorts of data
 integrity
  issues with checksumming and healing, etc.,
  and on the other hand a hash collision in the dedup algorithm can cause
  loss of data if wrongly configured.
 
  Anyway thanks that you have brought up the subject, now I know if I will
  enable the dedup feature I must set it to sha256,verify.

 Oh jeez, I can't remember how many times this flame war has been going
 on on this list. Here's the gist: SHA-256 (or any good hash) produces a
 near uniform random distribution of output. Thus, the chances of getting
 a random hash collision are around 2^-256 or around 10^-77. If I asked
 you to pick two atoms at random *from the entire observable universe*,
 your chances of hitting on the same atom are higher than the chances of
 that hash collision. So leave dedup=on with sha256 and move on.

 Cheers,
 --
 Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

Precisely, I said the same thing a few posts before:
dedup=verify solves that. And as I said, one could use dedup=hash
algorithm,verify with
an inferior hash algorithm (that is much faster) with the purpose of
reducing the number of dedup candidates.
For that matter one could use a trivial CRC32, if the two blocks have the
same CRC you do anyway a
byte-to-byte comparison, but if they have differenct CRC's you don't need
to do the actual comparison.

On Wed, Jul 11, 2012 at 12:24 PM, Justin Stringfellow 
justin.stringfel...@btinternet.com wrote:

   You do realize that the age of the universe is only on the order of
 around 10^18 seconds, do you? Even if you had a trillion CPUs each
 chugging along at 3.0 GHz for all this time, the number of processor
 cycles you will have executed cumulatively is only on the order 10^40,
 still 37 orders of magnitude lower than the chance for a random hash
 collision.

 Here we go, boiling the oceans again :)


 Suppose you find a weakness in a specific hash algorithm; you use this
 to create hash collisions and now imagined you store the hash collisions
 in a zfs dataset with dedup enabled using the same hash algorithm.

 Sorry, but isn't this what dedup=verify solves? I don't see the problem
 here. Maybe all that's needed is a comment in the manpage saying hash
 algorithms aren't perfect.

 cheers,
 --justin

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 11:53 AM, Tomas Forsman wrote:
 On 11 July, 2012 - Sa??o Kiselkov sent me these 1,4K bytes:
 Oh jeez, I can't remember how many times this flame war has been going
 on on this list. Here's the gist: SHA-256 (or any good hash) produces a
 near uniform random distribution of output. Thus, the chances of getting
 a random hash collision are around 2^-256 or around 10^-77. If I asked
 you to pick two atoms at random *from the entire observable universe*,
 your chances of hitting on the same atom are higher than the chances of
 that hash collision. So leave dedup=on with sha256 and move on.
 
 So in ZFS, which normally uses 128kB blocks, you can instead store them
 100% uniquely into 32 bytes.. A nice 4096x compression rate..
 decompression is a bit slower though..

I really mean no disrespect, but this comment is so dumb I could swear
my IQ dropped by a few tenths of a point just by reading.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 12:00 PM, casper@oracle.com wrote:
 
 
 You do realize that the age of the universe is only on the order of
 around 10^18 seconds, do you? Even if you had a trillion CPUs each
 chugging along at 3.0 GHz for all this time, the number of processor
 cycles you will have executed cumulatively is only on the order 10^40,
 still 37 orders of magnitude lower than the chance for a random hash
 collision.

 
 Suppose you find a weakness in a specific hash algorithm; you use this
 to create hash collisions and now imagined you store the hash collisions 
 in a zfs dataset with dedup enabled using the same hash algorithm.

That is one possibility I considered when evaluating whether to
implement the Edon-R hash algorithm. It's security margin is somewhat
questionable, so then the next step is to determine whether this can be
used in any way to implement a practical data-corruption attack. I guess
is that it can't, but I'm open to debate on this issue, since I'm not a
security expert myself.

The reason why I don't think this can be used to implement a practical
attack is that in order to generate a collision, you first have to know
the disk block that you want to create a collision on (or at least the
checksum), i.e. the original block is already in the pool. At that
point, you could write a colliding block which would get de-dup'd, but
that doesn't mean you've corrupted the original data, only that you
referenced it. So, in a sense, you haven't corrupted the original block,
only your own collision block (since that's the copy doesn't get written).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 12:24 PM, Justin Stringfellow wrote:
 Suppose you find a weakness in a specific hash algorithm; you use this
 to create hash collisions and now imagined you store the hash collisions 
 in a zfs dataset with dedup enabled using the same hash algorithm.
 
 Sorry, but isn't this what dedup=verify solves? I don't see the problem here. 
 Maybe all that's needed is a comment in the manpage saying hash algorithms 
 aren't perfect.

It does solve it, but at a cost to normal operation. Every write gets
turned into a read. Assuming a big enough and reasonably busy dataset,
this leads to tremendous write amplification.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?


On 07/11/2012 12:24 PM, Justin Stringfellow wrote:
 Suppose you find a weakness in a specific hash algorithm; you use this
 to create hash collisions and now imagined you store the hash collisions 
 in a zfs dataset with dedup enabled using the same hash algorithm.
 
 Sorry, but isn't this what dedup=verify solves? I don't see the problem 
 here. Maybe all that's n
eeded is a comment in the manpage saying hash algorithms aren't perfect.

It does solve it, but at a cost to normal operation. Every write gets
turned into a read. Assuming a big enough and reasonably busy dataset,
this leads to tremendous write amplification.


If and only if the block is being dedup'ed.  (In that case, you're just
changing the write of a whole block into one read (of the block) and an 
update in the dedup date (the whole block isn't written)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?



 The point is that hash functions are many to one and I think the point
 was about that verify wasn't really needed if the hash function is good
 enough.

This is a circular argument really, isn't it? Hash algorithms are never 
perfect, but we're trying to build a perfect one?
 
It seems to me the obvious fix is to use hash to identify candidates for dedup, 
and then do the actual verify and dedup asynchronously. Perhaps a worker thread 
doing this at low priority?
Did anyone consider this?
 
cheers,
--justin___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 12:32 PM, Ferenc-Levente Juhos wrote:
 Saso, I'm not flaming at all, I happen to disagree, but still I understand
 that
 chances are very very very slim, but as one poster already said, this is
 how
 the lottery works. I'm not saying one should make an exhaustive search with
 trillions of computers just to produce a sha256 collision.
 If I wanted an exhaustive search I would generate all the numbers from
 0 to 2**256 and I would definitely get at least 1 collision.
 If you formulate it in another way, by generating all the possible 256 bit
 (32 byte)
 blocks + 1 you will definitely get a collision. This is much more credible
 than the analogy with the age of the universe and atoms picked at random,
 etc.

First of all, I never said that the chance is zero. It's definitely
non-zero, but claiming that is analogous to the lottery is just not
appreciating the scale of the difference.

Next, your proposed method of finding hash collisions is naive in that
you assume that you merely need generate 256-bit numbers. First of all,
the smallest blocks in ZFS are 1k (IIRC), i.e. 8192 bits. Next, you fail
to appreciate the difficulty of even generating 2^256 256-bit numbers.
Here's how much memory you'd need:

2^256 * 32 ~= 2^261 bytes

Memory may be cheap, but not that cheap...

 The fact is it can happen, it's entirely possible that there are two jpg's
 in the universe with different content and they have the same hash.
 I can't prove the existence of those, but you can't deny it.

I'm not denying that. Read what I said.

 The fact is, that zfs and everyone using it trys to correct data
 degradation e.g. cause by cosmic rays, and on the other hand their
 using probability calculation (no matter how slim the chances are) to
 potentially discard valid data.
 You can come with other universe and atom theories and with the
 age of the universe, etc. The fact remains the same.

This is just a profoundly naive statement. You always need to make
trade-offs between practicality and performance. How slim the chances
are has a very real impact on engineering decisions.

 And each generation was convinced that their current best checksum or
 hash algorithm is the best and will be the best forever. MD5 has
 demonstrated that it's not the case. Time will tell what becomes of
 SHA256, but why take any chances.

You really don't understand much about hash algorithms, do you? MD5 has
*very* good safety against random collisions - that's why it's still
used in archive integrity checking (ever wonder what algorithm Debian
and RPM packages use?). The reason why it's been replaced by newer
algorithms has nothing to do with its chance of random hash collisions,
but with deliberate collision attacks on security algorithms built on
top of it. Also, the reason why we don't use it in ZFS is because
SHA-256 is faster and it has a larger pattern space (thus further
lowering the odds of random collisions).

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 12:37 PM, Ferenc-Levente Juhos wrote:
 Precisely, I said the same thing a few posts before:
 dedup=verify solves that. And as I said, one could use dedup=hash
 algorithm,verify with
 an inferior hash algorithm (that is much faster) with the purpose of
 reducing the number of dedup candidates.
 For that matter one could use a trivial CRC32, if the two blocks have the
 same CRC you do anyway a
 byte-to-byte comparison, but if they have differenct CRC's you don't need
 to do the actual comparison.

Yes, and that's exactly what the Fletcher4+verify combo does (Fletcher
is faster than CRC32). That's why I started this thread - to assess
whether a better hash algorithm is needed and/or desired.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 01:09 PM, Justin Stringfellow wrote:
 The point is that hash functions are many to one and I think the point
 was about that verify wasn't really needed if the hash function is good
 enough.
 
 This is a circular argument really, isn't it? Hash algorithms are never 
 perfect, but we're trying to build a perfect one?
  
 It seems to me the obvious fix is to use hash to identify candidates for 
 dedup, and then do the actual verify and dedup asynchronously. Perhaps a 
 worker thread doing this at low priority?
 Did anyone consider this?

This assumes you have low volumes of deduplicated data. As your dedup
ratio grows, so does the performance hit from dedup=verify. At, say,
dedupratio=10.0x, on average, every write results in 10 reads.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?



This assumes you have low volumes of deduplicated data. As your dedup
ratio grows, so does the performance hit from dedup=verify. At, say,
dedupratio=10.0x, on average, every write results in 10 reads.

I don't follow.

If dedupratio == 10, it means that each item is *referenced* 10 times
but it is only stored *once*.  Only when you have hash collisions then 
multiple reads would be needed.

Only one read is needed except in the case of hash collisions.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

 This assumes you have low volumes of deduplicated data. As your dedup
 ratio grows, so does the performance hit from dedup=verify. At, say,
 dedupratio=10.0x, on average, every write results in 10 reads.

Well you can't make an omelette without breaking eggs! Not a very nice one, 
anyway.
 
Yes dedup is expensive but much like using O_SYNC, it's a conscious decision 
here to take a performance hit in order to be sure about our data. Moving the 
actual reads to a async thread as I suggested should improve things.
 
cheers,
--justin
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 01:36 PM, casper@oracle.com wrote:
 
 
 This assumes you have low volumes of deduplicated data. As your dedup
 ratio grows, so does the performance hit from dedup=verify. At, say,
 dedupratio=10.0x, on average, every write results in 10 reads.
 
 I don't follow.
 
 If dedupratio == 10, it means that each item is *referenced* 10 times
 but it is only stored *once*.  Only when you have hash collisions then 
 multiple reads would be needed.
 
 Only one read is needed except in the case of hash collisions.

No, *every* dedup write will result in a block read. This is how:

 1) ZIO gets block X and computes HASH(X)
 2) ZIO looks up HASH(X) in DDT
  2a) HASH(X) not in DDT - unique write; exit
  2b) HASH(X) in DDT; continue
 3) Read original disk block Y with HASH(Y) = HASH(X) --here's the read
 4) Verify X == Y
  4a) X == Y; increment refcount
  4b) X != Y; hash collision; write new block   -- here's the collision

So in other words, by the time you figure out you've got a hash
collision, you already did the read, ergo, every dedup write creates a read!

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 01:42 PM, Justin Stringfellow wrote:
 This assumes you have low volumes of deduplicated data. As your dedup
 ratio grows, so does the performance hit from dedup=verify. At, say,
 dedupratio=10.0x, on average, every write results in 10 reads.
 
 Well you can't make an omelette without breaking eggs! Not a very nice one, 
 anyway.
  
 Yes dedup is expensive but much like using O_SYNC, it's a conscious decision 
 here to take a performance hit in order to be sure about our data. Moving the 
 actual reads to a async thread as I suggested should improve things.

And my point here is that the expense is unnecessary and can be omitted
if we choose our algorithms and settings carefully.

Async here won't help, you'll still get equal write amplification,
only it's going to be spread in between txg's.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On Wed, July 11, 2012 04:50, Ferenc-Levente Juhos wrote:
 Actually although as you pointed out that the chances to have an sha256
 collision is minimal, but still it can happen, that would mean
 that the dedup algorithm discards a block that he thinks is a duplicate.
 Probably it's anyway better to do a byte to byte comparison
 if the hashes match to be sure that the blocks are really identical.

 The funny thing here is that ZFS tries to solve all sorts of data
 integrity
 issues with checksumming and healing, etc.,
 and on the other hand a hash collision in the dedup algorithm can cause
 loss of data if wrongly configured.

 Anyway thanks that you have brought up the subject, now I know if I will
 enable the dedup feature I must set it to sha256,verify.

There was a long-ish thread on the topic last year
((Fletcher+Verification) versus (Sha256+No Verification) ):

http://mail.opensolaris.org/pipermail/zfs-discuss/2011-January/thread.html#46864


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On Tue, July 10, 2012 19:56, SaÅ¡o Kiselkov wrote:
 However, before I start out on a pointless endeavor, I wanted to probe
 the field of ZFS users, especially those using dedup, on whether their
 workloads would benefit from a faster hash algorithm (and hence, lower
 CPU utilization). Developments of late have suggested to me three
 possible candidates:
[...]

I'd wait until SHA-3 is announced. It's supposed to happen this year, of
which only six months are left:

http://csrc.nist.gov/groups/ST/hash/timeline.html
http://en.wikipedia.org/wiki/NIST_hash_function_competition

It was actually supposed to happen to 2Q, so they're running a little
late it seems.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 03:39 PM, David Magda wrote:
 On Tue, July 10, 2012 19:56, SaÅ¡o Kiselkov wrote:
 However, before I start out on a pointless endeavor, I wanted to probe
 the field of ZFS users, especially those using dedup, on whether their
 workloads would benefit from a faster hash algorithm (and hence, lower
 CPU utilization). Developments of late have suggested to me three
 possible candidates:
 [...]
 
 I'd wait until SHA-3 is announced. It's supposed to happen this year, of
 which only six months are left:
 
 http://csrc.nist.gov/groups/ST/hash/timeline.html
 http://en.wikipedia.org/wiki/NIST_hash_function_competition
 
 It was actually supposed to happen to 2Q, so they're running a little
 late it seems.

I'm not convinced waiting makes much sense. The SHA-3 standardization
process' goals are different from ours. SHA-3 can choose to go with
something that's slower, but has a higher security margin. I think that
absolute super-tight security isn't all that necessary for ZFS, since
the hash isn't used for security purposes. We only need something that's
fast and has a good pseudo-random output distribution. That's why I
looked toward Edon-R. Even though it might have security problems in
itself, it's by far the fastest algorithm in the entire competition.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

Since there is a finite number of bit patterns per block, have you tried to 
just calculate the SHA-256 or SHA-512 for every possible bit pattern to see if 
there is ever a collision?  If you found an algorithm that produced no 
collisions for any possible block bit pattern, wouldn't that be the win?

Gregg Wonderly

On Jul 11, 2012, at 5:56 AM, Sašo Kiselkov wrote:

 On 07/11/2012 12:24 PM, Justin Stringfellow wrote:
 Suppose you find a weakness in a specific hash algorithm; you use this
 to create hash collisions and now imagined you store the hash collisions 
 in a zfs dataset with dedup enabled using the same hash algorithm.
 
 Sorry, but isn't this what dedup=verify solves? I don't see the problem 
 here. Maybe all that's needed is a comment in the manpage saying hash 
 algorithms aren't perfect.
 
 It does solve it, but at a cost to normal operation. Every write gets
 turned into a read. Assuming a big enough and reasonably busy dataset,
 this leads to tremendous write amplification.
 
 Cheers,
 --
 Saso
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Sašo Kiselkov
 
 On 07/11/2012 11:53 AM, Tomas Forsman wrote:
  On 11 July, 2012 - Sa??o Kiselkov sent me these 1,4K bytes:
  Oh jeez, I can't remember how many times this flame war has been going
  on on this list. Here's the gist: SHA-256 (or any good hash) produces a
  near uniform random distribution of output. Thus, the chances of
getting
  a random hash collision are around 2^-256 or around 10^-77. If I asked
  you to pick two atoms at random *from the entire observable universe*,
  your chances of hitting on the same atom are higher than the chances of
  that hash collision. So leave dedup=on with sha256 and move on.
 
  So in ZFS, which normally uses 128kB blocks, you can instead store them
  100% uniquely into 32 bytes.. A nice 4096x compression rate..
  decompression is a bit slower though..
 
 I really mean no disrespect, but this comment is so dumb I could swear
 my IQ dropped by a few tenths of a point just by reading.

Cool it please.  You say I mean no disrespect and then say something which
is clearly disrespectful.

Tomas's point is to illustrate that hashing is a many-to-one function.  If
it were possible to rely on the hash to always be unique, then you could use
it as a compression algorithm.  He's pointing out that's insane.  His
comment was not in the slightest bit dumb; if anything, it seems like maybe
somebody (or some people) didn't get his point.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 03:57 PM, Gregg Wonderly wrote:
 Since there is a finite number of bit patterns per block, have you tried to 
 just calculate the SHA-256 or SHA-512 for every possible bit pattern to see 
 if there is ever a collision?  If you found an algorithm that produced no 
 collisions for any possible block bit pattern, wouldn't that be the win?

Don't think that, if you can think of this procedure, that the crypto
security guys at universities haven't though about it as well? Of course
they have. No, simply generating a sequence of random patterns and
hoping to hit a match won't do the trick.

P.S. I really don't mean to sound smug or anything, but I know one thing
for sure: the crypto researchers who propose these algorithms are some
of the brightest minds on this topic on the planet, so I would hardly
think they didn't consider trivial problems.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Sašo Kiselkov
 
 As your dedup
 ratio grows, so does the performance hit from dedup=verify. At, say,
 dedupratio=10.0x, on average, every write results in 10 reads.

Why?  
If you intend to write a block, and you discover it has a matching checksum,
you only need to verify it matches one block.  You don't need to verify it
matches all the other blocks that have previously been verified to match
each other.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?


On Tue, 10 Jul 2012, Edward Ned Harvey wrote:


CPU's are not getting much faster.  But IO is definitely getting faster.  It's 
best to keep ahead of that curve.


It seems that per-socket CPU performance is doubling every year. 
That seems like faster to me.


If server CPU chipsets offer accelleration for some type of standard 
encryption, then that needs to be considered.  The CPU might not need 
to do the encryption the hard way.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Gregg Wonderly
 
 Since there is a finite number of bit patterns per block, have you tried
to just
 calculate the SHA-256 or SHA-512 for every possible bit pattern to see if
there
 is ever a collision?  If you found an algorithm that produced no
collisions for
 any possible block bit pattern, wouldn't that be the win?

Maybe I misunderstand what you're saying, but if I got it right, what you're
saying is physically impossible to do in the time of the universe...  And
guaranteed to fail even if you had all the computational power of God.

I think you're saying ... In a block of 128k, sequentially step through all
the possible values ... starting with 0, 1, 2, ... 2^128k ... and compute
the hashes of each value, and see if you ever find a hash collision.  If
this is indeed what you're saying, recall, the above operation will require
on order 2^128k operations to complete.  But present national security
standards accept 2^256 operations as satisfactory to protect data from brute
force attacks over the next 30 years.  Furthermore, in a 128k block, there
exist 2^128k possible values, while in a 512bit hash, there exist only 2^512
possible values (which is still a really huge number.)  This means there
will exist at least 2^127.5k collisions.  However, these numbers are so
astronomically universally magnanimously huge, it will still take more than
a lifetime to find any one of those collisions.  So it's impossible to
perform such a computation, and if you could, you would be guaranteed to
find a LOT of collisions.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 03:58 PM, Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Sašo Kiselkov

 I really mean no disrespect, but this comment is so dumb I could swear
 my IQ dropped by a few tenths of a point just by reading.
 
 Cool it please.  You say I mean no disrespect and then say something which
 is clearly disrespectful.

I sort of flew off the handle there, and I shouldn't have. It felt like
Tomas was misrepresenting my position and putting words in my mouth I
didn't say. I certainly didn't mean to diminish the validity of an
honest question.

 Tomas's point is to illustrate that hashing is a many-to-one function.  If
 it were possible to rely on the hash to always be unique, then you could use
 it as a compression algorithm.  He's pointing out that's insane.  His
 comment was not in the slightest bit dumb; if anything, it seems like maybe
 somebody (or some people) didn't get his point.

I understood his point very well and I never argued that hashing always
results in unique hash values, which is why I thought he was
misrepresenting what I said.

So for a full explanation of why hashes aren't usable for compression:

 1) they are one-way (kind of bummer for decompression)
 2) they operate far below the Shannon limit (i.e. unusable for
lossless compression)
 3) their output is pseudo-random, so even if we find collisions, we
have no way to distinguish which input was the most likely one meant
for a given hash value (all are equally probable)

A formal proof would of course take longer to construct and would take
time that I feel is best spent writing code.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Joerg Schilling

Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote:

 On Tue, 10 Jul 2012, Edward Ned Harvey wrote:
 
  CPU's are not getting much faster.  But IO is definitely getting faster.  
  It's best to keep ahead of that curve.

 It seems that per-socket CPU performance is doubling every year. 
 That seems like faster to me.

This would only apply, if you implement a multi threaded hash.

The single threaded performance win is less that doubling each year.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

But this is precisely the kind of observation that some people seem to miss 
out on the importance of.  As Tomas suggested in his post, if this was true, 
then we could have a huge compression ratio as well.  And even if there was 10% 
of the bit patterns that created non-unique hashes, you could use the fact that 
a block hashed to a known bit pattern that didn't have collisions, to compress 
the other 90% of your data.

I'm serious about this from a number of perspectives.  We worry about the time 
it would take to reverse SHA or RSA hashes to passwords, not even thinking that 
what if someone has been quietly computing all possible hashes for the past 
10-20 years into a database some where, with every 5-16 character password, and 
now has an instantly searchable hash-to-password database.

Sometimes we ignore the scale of time, thinking that only the immediately 
visible details are what we have to work with.

If no one has computed the hashes for every single 4K and 8K block, then fine.  
But, if that was done, and we had that data, we'd know for sure which algorithm 
was going to work the best for the number of bits we are considering.

Speculating based on the theory of the algorithms for random number of bits 
is just silly.  Where's the real data that tells us what we need to know?

Gregg Wonderly

On Jul 11, 2012, at 9:02 AM, Sašo Kiselkov wrote:

 On 07/11/2012 03:57 PM, Gregg Wonderly wrote:
 Since there is a finite number of bit patterns per block, have you tried to 
 just calculate the SHA-256 or SHA-512 for every possible bit pattern to see 
 if there is ever a collision?  If you found an algorithm that produced no 
 collisions for any possible block bit pattern, wouldn't that be the win?
 
 Don't think that, if you can think of this procedure, that the crypto
 security guys at universities haven't though about it as well? Of course
 they have. No, simply generating a sequence of random patterns and
 hoping to hit a match won't do the trick.
 
 P.S. I really don't mean to sound smug or anything, but I know one thing
 for sure: the crypto researchers who propose these algorithms are some
 of the brightest minds on this topic on the planet, so I would hardly
 think they didn't consider trivial problems.
 
 Cheers,
 --
 Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?


On Wed, 11 Jul 2012, Sašo Kiselkov wrote:

the hash isn't used for security purposes. We only need something that's
fast and has a good pseudo-random output distribution. That's why I
looked toward Edon-R. Even though it might have security problems in
itself, it's by far the fastest algorithm in the entire competition.


If an algorithm is not 'secure' and zfs is not set to verify, doesn't 
that mean that a knowledgeable user will be able to cause intentional 
data corruption if deduplication is enabled?  A user with very little 
privilege might be able to cause intentional harm by writing the magic 
data block before some other known block (which produces the same 
hash) is written.  This allows one block to substitute for another.


It does seem that security is important because with a human element, 
data is not necessarily random.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

You don't need to reproduce all possible blocks.
1. SHA256 produces a 256 bit hash
2. That means it produces a value on 256 bits, in other words a value
between 0..2^256 - 1
3. If you start counting from 0 to 2^256 and for each number calculate the
SHA256 you will get at least one hash collision (if the hash algortihm is
prefectly distributed)
4. Counting from 0 to 2^256, is nothing else but reproducing all possible
bit pattern on 32 bytes

It's not about whether one computer is capable of producing the above
hashes or not, or whether there are actually that many unique 32 byte bit
patterns in the universe.
A collision can happen.

On Wed, Jul 11, 2012 at 3:57 PM, Gregg Wonderly gr...@wonderly.org wrote:

 Since there is a finite number of bit patterns per block, have you tried
 to just calculate the SHA-256 or SHA-512 for every possible bit pattern to
 see if there is ever a collision?  If you found an algorithm that produced
 no collisions for any possible block bit pattern, wouldn't that be the win?

 Gregg Wonderly

 On Jul 11, 2012, at 5:56 AM, Sašo Kiselkov wrote:

  On 07/11/2012 12:24 PM, Justin Stringfellow wrote:
  Suppose you find a weakness in a specific hash algorithm; you use this
  to create hash collisions and now imagined you store the hash
 collisions
  in a zfs dataset with dedup enabled using the same hash algorithm.
 
  Sorry, but isn't this what dedup=verify solves? I don't see the problem
 here. Maybe all that's needed is a comment in the manpage saying hash
 algorithms aren't perfect.
 
  It does solve it, but at a cost to normal operation. Every write gets
  turned into a read. Assuming a big enough and reasonably busy dataset,
  this leads to tremendous write amplification.
 
  Cheers,
  --
  Saso
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?


On Tue, 10 Jul 2012, Edward Ned Harvey wrote:

 CPU's are not getting much faster.  But IO is definitely getting faster.  
 It's best to keep ahea
d of that curve.

It seems that per-socket CPU performance is doubling every year. 
That seems like faster to me.

I think that I/O isn't getting as fast as CPU is; memory capacity and
bandwith and CPUs are getting faster.  I/O, not so much.
(Apart from the one single step from harddisk to SSD; but note that
I/O is limited to standard interfaces and as such it is likely be
helddown by requiring a new standard.

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

 From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
 Sent: Wednesday, July 11, 2012 10:06 AM
 
 On Tue, 10 Jul 2012, Edward Ned Harvey wrote:
 
  CPU's are not getting much faster.  But IO is definitely getting faster.
It's
 best to keep ahead of that curve.
 
 It seems that per-socket CPU performance is doubling every year.
 That seems like faster to me.

It's a good point.  It's only *serial* computation that isn't increasing
much anymore.  But computing hashes for a bunch of independent blocks is a
parallel computation operation.  For now, they're able to keep parallelizing
via more cores.

Still, I wonder how many more years this can continue?  The reason the
serial computation isn't increasing much anymore is simply because the
transistors etc have gotten small enough that they can't really increase
clock speed much more.  If you have a 3.0 Ghz clock, how far does light
travel in one hertz?  About 10cm.  All the transistors, capacitors, wires
etc need time to react and stabilize before the next clock cycle...

For now, they're still able to keep making larger dies, with more cores on
them, and they're certainly developing layering and 3-D technologies, which
will certainly allow the number of cores to keep growing for some time to
come.  But there is a limit somewhere.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

Unfortunately, the government imagines that people are using their home 
computers to compute hashes and try and decrypt stuff.  Look at what is 
happening with GPUs these days.  People are hooking up 4 GPUs in their 
computers and getting huge performance gains.  5-6 char password space covered 
in a few days.  12 or so chars would take one machine a couple of years if I 
recall.  So, if we had 20 people with that class of machine, we'd be down to a 
few months.   I'm just suggesting that while the compute space is still huge, 
it's not actually undoable, it just requires some thought into how to approach 
the problem, and then some time to do the computations.

Huge space, but still finite…

Gregg Wonderly

On Jul 11, 2012, at 9:13 AM, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Gregg Wonderly
 
 Since there is a finite number of bit patterns per block, have you tried
 to just
 calculate the SHA-256 or SHA-512 for every possible bit pattern to see if
 there
 is ever a collision?  If you found an algorithm that produced no
 collisions for
 any possible block bit pattern, wouldn't that be the win?
 
 Maybe I misunderstand what you're saying, but if I got it right, what you're
 saying is physically impossible to do in the time of the universe...  And
 guaranteed to fail even if you had all the computational power of God.
 
 I think you're saying ... In a block of 128k, sequentially step through all
 the possible values ... starting with 0, 1, 2, ... 2^128k ... and compute
 the hashes of each value, and see if you ever find a hash collision.  If
 this is indeed what you're saying, recall, the above operation will require
 on order 2^128k operations to complete.  But present national security
 standards accept 2^256 operations as satisfactory to protect data from brute
 force attacks over the next 30 years.  Furthermore, in a 128k block, there
 exist 2^128k possible values, while in a 512bit hash, there exist only 2^512
 possible values (which is still a really huge number.)  This means there
 will exist at least 2^127.5k collisions.  However, these numbers are so
 astronomically universally magnanimously huge, it will still take more than
 a lifetime to find any one of those collisions.  So it's impossible to
 perform such a computation, and if you could, you would be guaranteed to
 find a LOT of collisions.
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 04:19 PM, Gregg Wonderly wrote:
 But this is precisely the kind of observation that some people seem to miss 
 out on the importance of.  As Tomas suggested in his post, if this was true, 
 then we could have a huge compression ratio as well.  And even if there was 
 10% of the bit patterns that created non-unique hashes, you could use the 
 fact that a block hashed to a known bit pattern that didn't have collisions, 
 to compress the other 90% of your data.
 
 I'm serious about this from a number of perspectives.  We worry about the 
 time it would take to reverse SHA or RSA hashes to passwords, not even 
 thinking that what if someone has been quietly computing all possible hashes 
 for the past 10-20 years into a database some where, with every 5-16 
 character password, and now has an instantly searchable hash-to-password 
 database.

This is something very well known in the security community as rainbow
tables and a common method to protect against it is via salting. Never
use a password hashing scheme which doesn't use salts for exactly the
reason you outlined above.

 Sometimes we ignore the scale of time, thinking that only the immediately 
 visible details are what we have to work with.
 
 If no one has computed the hashes for every single 4K and 8K block, then 
 fine.  But, if that was done, and we had that data, we'd know for sure which 
 algorithm was going to work the best for the number of bits we are 
 considering.

Do you even realize how many 4K or 8K blocks there are?!?! Exactly
2^32768 or 2^65536 respectively. I wouldn't worry about somebody having
those pre-hashed ;-) Rainbow tables only work for a very limited subset
of data.

 Speculating based on the theory of the algorithms for random number of bits 
 is just silly.  Where's the real data that tells us what we need to know?

If you don't trust math, then I there's little I can do to convince you.
But remember our conversation the next time you step into a car or get
on an airplane. The odds that you'll die on that ride are far higher
than that you'll find a random hash collision in a 256-bit hash algorithm...

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 04:22 PM, Bob Friesenhahn wrote:
 On Wed, 11 Jul 2012, Sašo Kiselkov wrote:
 the hash isn't used for security purposes. We only need something that's
 fast and has a good pseudo-random output distribution. That's why I
 looked toward Edon-R. Even though it might have security problems in
 itself, it's by far the fastest algorithm in the entire competition.
 
 If an algorithm is not 'secure' and zfs is not set to verify, doesn't
 that mean that a knowledgeable user will be able to cause intentional
 data corruption if deduplication is enabled?  A user with very little
 privilege might be able to cause intentional harm by writing the magic
 data block before some other known block (which produces the same hash)
 is written.  This allows one block to substitute for another.
 
 It does seem that security is important because with a human element,
 data is not necessarily random.

Theoretically yes, it is possible, but the practicality of such an
attack is very much in doubt. In case this is a concern, however, one
can always switch to a more secure hash function (e.g. Skein-512).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

This is exactly the issue for me.  It's vital to always have verify on.  If you 
don't have the data to prove that every possible block combination possible, 
hashes uniquely for the small bit space we are talking about, then how in the 
world can you say that verify is not necessary?  That just seems ridiculous 
to propose.

Gregg Wonderly

On Jul 11, 2012, at 9:22 AM, Bob Friesenhahn wrote:

 On Wed, 11 Jul 2012, Sašo Kiselkov wrote:
 the hash isn't used for security purposes. We only need something that's
 fast and has a good pseudo-random output distribution. That's why I
 looked toward Edon-R. Even though it might have security problems in
 itself, it's by far the fastest algorithm in the entire competition.
 
 If an algorithm is not 'secure' and zfs is not set to verify, doesn't that 
 mean that a knowledgeable user will be able to cause intentional data 
 corruption if deduplication is enabled?  A user with very little privilege 
 might be able to cause intentional harm by writing the magic data block 
 before some other known block (which produces the same hash) is written.  
 This allows one block to substitute for another.
 
 It does seem that security is important because with a human element, data is 
 not necessarily random.
 
 Bob
 -- 
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,
 http://www.GraphicsMagick.org/___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?


On Wed, 11 Jul 2012, Joerg Schilling wrote:


Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote:


On Tue, 10 Jul 2012, Edward Ned Harvey wrote:


CPU's are not getting much faster.  But IO is definitely getting faster.  It's 
best to keep ahead of that curve.


It seems that per-socket CPU performance is doubling every year.
That seems like faster to me.


This would only apply, if you implement a multi threaded hash.


While it is true that the per-block hash latency does not improve much 
with new CPUs (and may even regress), given multiple I/Os at once, 
hashes may be be computed by different cores and so it seems that 
total system performance will scale with per-socket CPU performance. 
Even with a single stream of I/O, multiple zfs blocks will be read or 
written so mutiple block hashes may be computed at once on different 
cores.


Server OSs like Solaris have been focusing on improving total system 
throughput rather than single-threaded bandwidth.


I don't mean to discount the importance of this effort though.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 04:23 PM, casper@oracle.com wrote:
 
 On Tue, 10 Jul 2012, Edward Ned Harvey wrote:

 CPU's are not getting much faster.  But IO is definitely getting faster.  
 It's best to keep ahea
 d of that curve.

 It seems that per-socket CPU performance is doubling every year. 
 That seems like faster to me.
 
 I think that I/O isn't getting as fast as CPU is; memory capacity and
 bandwith and CPUs are getting faster.  I/O, not so much.
 (Apart from the one single step from harddisk to SSD; but note that
 I/O is limited to standard interfaces and as such it is likely be
 helddown by requiring a new standard.

Have you seen one of those SSDs made by FusionIO? Those things fit in a
single PCI-e x8 slot and can easily push a sustained rate upward of
several GB/s. Do not expect that drives are the be-all end-all to
storage. Hybrid storage invalidated the traditional CPU  memory fast,
disks slow wisdom years ago.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?



 Since there is a finite number of bit patterns per block, have you tried to 
 just calculate the SHA-256 or SHA-512 for every possible bit pattern to see 
 if there is ever a collision?  If you found an algorithm that produced no 
 collisions for any possible block bit pattern, wouldn't that be the win?
 
Perhaps I've missed something, but if there was *never* a collision, you'd have 
stumbled across a rather impressive lossless compression algorithm. I'm pretty 
sure there's some Big Mathematical Rules (Shannon?) that mean this cannot be.
 
cheers,
--justin
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

As I said several times before, to produce hash collisions. Or to calculate
rainbow tables (as a previous user theorized it) you only need the
following.

You don't need to reproduce all possible blocks.
1. SHA256 produces a 256 bit hash
2. That means it produces a value on 256 bits, in other words a value
between 0..2^256 - 1
3. If you start counting from 0 to 2^256 and for each number calculate the
SHA256 you will get at least one hash collision (if the hash algortihm is
prefectly distributed)
4. Counting from 0 to 2^256, is nothing else but reproducing all possible
bit pattern on 32 bytes

It's not about whether one computer is capable of producing the above
hashes or not, or whether there are actually that many unique 32 byte bit
patterns in the universe.
A collision can happen.

On Wed, Jul 11, 2012 at 4:34 PM, Sašo Kiselkov skiselkov...@gmail.comwrote:

 On 07/11/2012 04:23 PM, casper@oracle.com wrote:
 
  On Tue, 10 Jul 2012, Edward Ned Harvey wrote:
 
  CPU's are not getting much faster.  But IO is definitely getting
 faster.  It's best to keep ahea
  d of that curve.
 
  It seems that per-socket CPU performance is doubling every year.
  That seems like faster to me.
 
  I think that I/O isn't getting as fast as CPU is; memory capacity and
  bandwith and CPUs are getting faster.  I/O, not so much.
  (Apart from the one single step from harddisk to SSD; but note that
  I/O is limited to standard interfaces and as such it is likely be
  helddown by requiring a new standard.

 Have you seen one of those SSDs made by FusionIO? Those things fit in a
 single PCI-e x8 slot and can easily push a sustained rate upward of
 several GB/s. Do not expect that drives are the be-all end-all to
 storage. Hybrid storage invalidated the traditional CPU  memory fast,
 disks slow wisdom years ago.

 Cheers,
 --
 Saso
  ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 04:27 PM, Gregg Wonderly wrote:
 Unfortunately, the government imagines that people are using their home 
 computers to compute hashes and try and decrypt stuff.  Look at what is 
 happening with GPUs these days.  People are hooking up 4 GPUs in their 
 computers and getting huge performance gains.  5-6 char password space 
 covered in a few days.  12 or so chars would take one machine a couple of 
 years if I recall.  So, if we had 20 people with that class of machine, we'd 
 be down to a few months.   I'm just suggesting that while the compute space 
 is still huge, it's not actually undoable, it just requires some thought into 
 how to approach the problem, and then some time to do the computations.
 
 Huge space, but still finite…

There are certain physical limits which one cannot exceed. For instance,
you cannot store 2^256 units of 32-byte quantities in Earth. Even if you
used proton spin (or some other quantum property) to store a bit, there
simply aren't enough protons in the entire visible universe to do it.
You will never ever be able to search a 256-bit memory space using a
simple exhaustive search. The reason why our security hashes are so long
(256-bits, 512-bits, more...) is because attackers *don't* do an
exhaustive search.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 04:30 PM, Gregg Wonderly wrote:
 This is exactly the issue for me.  It's vital to always have verify on.  If 
 you don't have the data to prove that every possible block combination 
 possible, hashes uniquely for the small bit space we are talking about, 
 then how in the world can you say that verify is not necessary?  That just 
 seems ridiculous to propose.

Do you need assurances that in the next 5 seconds a meteorite won't fall
to Earth and crush you? No. And yet, the Earth puts on thousands of tons
of weight each year from meteoric bombardment and people have been hit
and killed by them (not to speak of mass extinction events). Nobody has
ever demonstrated of being able to produce a hash collision in any
suitably long hash (128-bits plus) using a random search. All hash
collisions have been found by attacking the weaknesses in the
mathematical definition of these functions (i.e. some part of the input
didn't get obfuscated well in the hash function machinery and spilled
over into the result, resulting in a slight, but usable non-randomness).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 04:36 PM, Justin Stringfellow wrote:
 
 
 Since there is a finite number of bit patterns per block, have you tried to 
 just calculate the SHA-256 or SHA-512 for every possible bit pattern to see 
 if there is ever a collision?  If you found an algorithm that produced no 
 collisions for any possible block bit pattern, wouldn't that be the win?
  
 Perhaps I've missed something, but if there was *never* a collision, you'd 
 have stumbled across a rather impressive lossless compression algorithm. I'm 
 pretty sure there's some Big Mathematical Rules (Shannon?) that mean this 
 cannot be.

Do you realize how big your lookup dictionary would have to be?

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Gregg Wonderly
 
 But this is precisely the kind of observation that some people seem to
miss
 out on the importance of.  As Tomas suggested in his post, if this was
true,
 then we could have a huge compression ratio as well.  And even if there
was
 10% of the bit patterns that created non-unique hashes, you could use the
 fact that a block hashed to a known bit pattern that didn't have
collisions, to
 compress the other 90% of your data.

In fact, if you were to assume hash value X corresponded to data block Y,
you could use the hash function as a form of compression, but for
decompression you would need a lookup table.  So if you were going to
compress Y once, you would not gain anything.  But if you had a data stream
with a bunch of duplicates in it, you might hash Y many times, discovering X
is already in your lookup table, you only need to store another copy of X.

So in fact, dedup is a form of hash-based compression.  But we know it's not
a lossless compression, so the decision to verify or not to verify is a
matter of probability of collision.  Many of us have done the math on this
probability, and many of us are comfortable with neglecting the verify,
because we know the probability of collision is so small.  But the decision
to verify or not verify is very much dependent on your intended purpose (the
type of data you're storing, and what its significance is). 

More importantly, the decision to verify or not verify depends on who you
would need to justify your decision to.  Nobody will ever get in trouble for
enabling verify.  But if you have to convince your CEO that you made the
right choice by disabling verify ... You might just choose to enable verify
for the sake of not explaining your decision.


 I'm serious about this from a number of perspectives.  We worry about the
 time it would take to reverse SHA or RSA hashes to passwords, not even
 thinking that what if someone has been quietly computing all possible
hashes
 for the past 10-20 years into a database some where, with every 5-16
 character password, and now has an instantly searchable hash-to-password
 database.

That is called a rainbow table, and it's commonly used for cracking poorly
implemented password schemes.  Even with massively powerful computers and a
whole lot of storage, you only have enough time and space to compute a
rainbow table for commonly used (easily guessable) passwords.  To prevent
attackers from successfully using rainbow tables, even when users might
choose bad passwords, a security conscious developer will use salting and
stretching.  This increases the randomness and the time to compute the
password hashes, thwarting rainbow tables.


 If no one has computed the hashes for every single 4K and 8K block, then
 fine.  But, if that was done, and we had that data, we'd know for sure
which
 algorithm was going to work the best for the number of bits we are
 considering.

Let's imagine computing a 256-bit hash (32 bytes = 2^5 bytes) for every 1K
(8Kbit) block.  Storage for this table would require 2^5 * 2^8192 bytes =
2^8197 bytes = 3.5 * 10^2467 bytes.  Recall, a petabyte is 10^15 bytes.  So
... Storing the rainbow table for merely 1K possibilities ... I don't know
if it would fit in all the storage on planet Earth.  Maybe.  But having the
rainbow table for *precisely* 1K block size isn't useful unless you happen
to have precisely a 1K block you want to lookup...  If you happen to be
looking up a 1025 byte data hash, or a 4K data hash ... You're out of luck.
This table won't help you.

It's physically impossible to do, even on the supremely simplified problem
of 1K block size and 256-bit hash.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 04:39 PM, Ferenc-Levente Juhos wrote:
 As I said several times before, to produce hash collisions. Or to calculate
 rainbow tables (as a previous user theorized it) you only need the
 following.
 
 You don't need to reproduce all possible blocks.
 1. SHA256 produces a 256 bit hash
 2. That means it produces a value on 256 bits, in other words a value
 between 0..2^256 - 1
 3. If you start counting from 0 to 2^256 and for each number calculate the
 SHA256 you will get at least one hash collision (if the hash algortihm is
 prefectly distributed)
 4. Counting from 0 to 2^256, is nothing else but reproducing all possible
 bit pattern on 32 bytes
 
 It's not about whether one computer is capable of producing the above
 hashes or not, or whether there are actually that many unique 32 byte bit
 patterns in the universe.
 A collision can happen.

It's actually not that simple, because in hash collision attacks you're
not always afforded the luxury of being able to define your input block.
More often than not, you want to modify a previously hashed block in
such a fashion that it carries your intended modifications while hashing
to the same original value. Say for instance you want to modify a
512-byte message (e.g. an SSL certificate) to point to your own CN. Here
your rainbow table, even if you could store it somewhere (you couldn't,
btw), would do you little good here.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?


Unfortunately, the government imagines that people are using their home com=
puters to compute hashes and try and decrypt stuff.  Look at what is happen=
ing with GPUs these days.  People are hooking up 4 GPUs in their computers =
and getting huge performance gains.  5-6 char password space covered in a f=
ew days.  12 or so chars would take one machine a couple of years if I reca=
ll.  So, if we had 20 people with that class of machine, we'd be down to a =
few months.   I'm just suggesting that while the compute space is still hug=
e, it's not actually undoable, it just requires some thought into how to ap=
proach the problem, and then some time to do the computations.

Huge space, but still finite=85

Dan Brown seems to think so in Digital Fortress but it just means he 
has no grasp on big numbers.

2^128 is a huge space, finite *but* beyond brute force *forever*.

Cconsidering that we have nearly 10billion people and if you give them
all of them 1 billion computers all being able to compute 1 billion checks 
per second, how many years does it take before we get the solution?

Did  you realize that that number is *twice* the number of the years 
needed for a *single* computer with the same specification  to solve this
problem for 64 bits?

There are two reasons for finding a new hash alrgorithm:
- a faster one on current hardware
- a better one with a larger output

But bruteforce is not what we are defending against: we're trying to 
defend against bugs in the hash algorithm.  In the case of md5 and the 
related hash algorithm, a new attack method was discovered and it made 
many hash algorithms obsolete/broken.

When a algorithm is broken, the work factor needed for a successful 
attack depends in part of the hash, e.g., you may left with 64 bits
of effective has and that would be brute forcible.


Casper

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

Yes, but from the other angle, the number of unique 128K blocks that you can 
store on your ZFS pool, is actually finitely small, compared to the total 
space.  So the patterns you need to actually consider is not more than the 
physical limits of the universe.

Gregg Wonderly

On Jul 11, 2012, at 9:39 AM, Sašo Kiselkov wrote:

 On 07/11/2012 04:27 PM, Gregg Wonderly wrote:
 Unfortunately, the government imagines that people are using their home 
 computers to compute hashes and try and decrypt stuff.  Look at what is 
 happening with GPUs these days.  People are hooking up 4 GPUs in their 
 computers and getting huge performance gains.  5-6 char password space 
 covered in a few days.  12 or so chars would take one machine a couple of 
 years if I recall.  So, if we had 20 people with that class of machine, we'd 
 be down to a few months.   I'm just suggesting that while the compute space 
 is still huge, it's not actually undoable, it just requires some thought 
 into how to approach the problem, and then some time to do the computations.
 
 Huge space, but still finite…
 
 There are certain physical limits which one cannot exceed. For instance,
 you cannot store 2^256 units of 32-byte quantities in Earth. Even if you
 used proton spin (or some other quantum property) to store a bit, there
 simply aren't enough protons in the entire visible universe to do it.
 You will never ever be able to search a 256-bit memory space using a
 simple exhaustive search. The reason why our security hashes are so long
 (256-bits, 512-bits, more...) is because attackers *don't* do an
 exhaustive search.
 
 --
 Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

 From: Gregg Wonderly [mailto:gr...@wonderly.org]
 Sent: Wednesday, July 11, 2012 10:28 AM
 
 Unfortunately, the government imagines that people are using their home
 computers to compute hashes and try and decrypt stuff.  Look at what is
 happening with GPUs these days.  

heheheh.  I guess the NSA didn't think of that.;-)
(That's sarcasm, in case anyone didn't get it.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?



Do you need assurances that in the next 5 seconds a meteorite won't fall
to Earth and crush you? No. And yet, the Earth puts on thousands of tons
of weight each year from meteoric bombardment and people have been hit
and killed by them (not to speak of mass extinction events). Nobody has
ever demonstrated of being able to produce a hash collision in any
suitably long hash (128-bits plus) using a random search. All hash
collisions have been found by attacking the weaknesses in the
mathematical definition of these functions (i.e. some part of the input
didn't get obfuscated well in the hash function machinery and spilled
over into the result, resulting in a slight, but usable non-randomness).

The reason why we don't protect against such event because it would
be extremely expensive with a very small chance of it being needed.

verify doesn't cost much so even if the risk as infinitesimal as a 
direct meteorite hit, it may still be cost effective.

(Just like we'd better be off preparing for the climate changing (rather 
cheap) rather then trying to keep the climate from changing (impossible 
and still extremely expensive)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

You don't have to store all hash values:
a. Just memorize the first one SHA256(0)
b. start cointing
c. bang: by the time you get to 2^256 you get at least a collision.

(do this using BOINC, you dont have to wait for the last hash to be
calculated, I'm pretty sure a collision will occur sooner)

1. SHA256 produces a 256 bit hash
2. That means it produces a value on 256 bits, in other words a value
between 0..2^256 - 1
3. If you start counting from 0 to 2^256 and for each number calculate the
SHA256 you will get at least one hash collision (if the hash algortihm is
prefectly distributed)
4. Counting from 0 to 2^256, is nothing else but reproducing all possible
bit pattern on 32 bytes


And nobody is trying to proove that SHA256 is bad, and I don't want to do
compression either.
But it's unrealistic to think that a SHA256 collision wont occur just
because it would be impossible to find it via brute-force.

On Wed, Jul 11, 2012 at 4:49 PM, Sašo Kiselkov skiselkov...@gmail.comwrote:

 On 07/11/2012 04:39 PM, Ferenc-Levente Juhos wrote:
  As I said several times before, to produce hash collisions. Or to
 calculate
  rainbow tables (as a previous user theorized it) you only need the
  following.
 
  You don't need to reproduce all possible blocks.
  1. SHA256 produces a 256 bit hash
  2. That means it produces a value on 256 bits, in other words a value
  between 0..2^256 - 1
  3. If you start counting from 0 to 2^256 and for each number calculate
 the
  SHA256 you will get at least one hash collision (if the hash algortihm is
  prefectly distributed)
  4. Counting from 0 to 2^256, is nothing else but reproducing all possible
  bit pattern on 32 bytes
 
  It's not about whether one computer is capable of producing the above
  hashes or not, or whether there are actually that many unique 32 byte bit
  patterns in the universe.
  A collision can happen.

 It's actually not that simple, because in hash collision attacks you're
 not always afforded the luxury of being able to define your input block.
 More often than not, you want to modify a previously hashed block in
 such a fashion that it carries your intended modifications while hashing
 to the same original value. Say for instance you want to modify a
 512-byte message (e.g. an SSL certificate) to point to your own CN. Here
 your rainbow table, even if you could store it somewhere (you couldn't,
 btw), would do you little good here.

 --
 Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

So, if I had a block collision on my ZFS pool that used dedup, and it had my 
bank balance of $3,212.20 on it, and you tried to write your bank balance of 
$3,292,218.84 and got the same hash, no verify, and thus you got my 
block/balance and now your bank balance was reduced by 3 orders of magnitude, 
would you be okay with that?  What assurances would you be content with using 
my ZFS pool?

Gregg Wonderly

On Jul 11, 2012, at 9:43 AM, Sašo Kiselkov wrote:

 On 07/11/2012 04:30 PM, Gregg Wonderly wrote:
 This is exactly the issue for me.  It's vital to always have verify on.  If 
 you don't have the data to prove that every possible block combination 
 possible, hashes uniquely for the small bit space we are talking about, 
 then how in the world can you say that verify is not necessary?  That just 
 seems ridiculous to propose.
 
 Do you need assurances that in the next 5 seconds a meteorite won't fall
 to Earth and crush you? No. And yet, the Earth puts on thousands of tons
 of weight each year from meteoric bombardment and people have been hit
 and killed by them (not to speak of mass extinction events). Nobody has
 ever demonstrated of being able to produce a hash collision in any
 suitably long hash (128-bits plus) using a random search. All hash
 collisions have been found by attacking the weaknesses in the
 mathematical definition of these functions (i.e. some part of the input
 didn't get obfuscated well in the hash function machinery and spilled
 over into the result, resulting in a slight, but usable non-randomness).
 
 Cheers,
 --
 Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 04:54 PM, Ferenc-Levente Juhos wrote:
 You don't have to store all hash values:
 a. Just memorize the first one SHA256(0)
 b. start cointing
 c. bang: by the time you get to 2^256 you get at least a collision.

Just one question: how long do you expect this going to take on average?
Come on, do the math!

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 04:56 PM, Gregg Wonderly wrote:
 So, if I had a block collision on my ZFS pool that used dedup, and it had my 
 bank balance of $3,212.20 on it, and you tried to write your bank balance of 
 $3,292,218.84 and got the same hash, no verify, and thus you got my 
 block/balance and now your bank balance was reduced by 3 orders of magnitude, 
 would you be okay with that?  What assurances would you be content with using 
 my ZFS pool?

I'd feel entirely safe. There, I said it.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

I'm just suggesting that the time frame of when 256-bits or 512-bits is less 
safe, is closing faster than one might actually think, because social elements 
of the internet allow a lot more effort to be focused on a single problem 
than one might consider.  

Gregg Wonderly

On Jul 11, 2012, at 9:50 AM, Edward Ned Harvey wrote:

 From: Gregg Wonderly [mailto:gr...@wonderly.org]
 Sent: Wednesday, July 11, 2012 10:28 AM
 
 Unfortunately, the government imagines that people are using their home
 computers to compute hashes and try and decrypt stuff.  Look at what is
 happening with GPUs these days.  
 
 heheheh.  I guess the NSA didn't think of that.;-)
 (That's sarcasm, in case anyone didn't get it.)
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On Wed, July 11, 2012 09:45, Sao Kiselkov wrote:

 I'm not convinced waiting makes much sense. The SHA-3 standardization
 process' goals are different from ours. SHA-3 can choose to go with
 something that's slower, but has a higher security margin. I think that
 absolute super-tight security isn't all that necessary for ZFS, since
 the hash isn't used for security purposes. We only need something that's
 fast and has a good pseudo-random output distribution. That's why I
 looked toward Edon-R. Even though it might have security problems in
 itself, it's by far the fastest algorithm in the entire competition.

Fair enough, though I think eventually the SHA-3 winner will be
incorporated into hardware (or at least certain instructions used in the
algorithm will). I think waiting a few more weeks/months shouldn't be a
big deal, as the winner should be announced Real Soon Now, and then a more
informed decision can probably be made.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 05:10 PM, David Magda wrote:
 On Wed, July 11, 2012 09:45, Sašo Kiselkov wrote:
 
 I'm not convinced waiting makes much sense. The SHA-3 standardization
 process' goals are different from ours. SHA-3 can choose to go with
 something that's slower, but has a higher security margin. I think that
 absolute super-tight security isn't all that necessary for ZFS, since
 the hash isn't used for security purposes. We only need something that's
 fast and has a good pseudo-random output distribution. That's why I
 looked toward Edon-R. Even though it might have security problems in
 itself, it's by far the fastest algorithm in the entire competition.
 
 Fair enough, though I think eventually the SHA-3 winner will be
 incorporated into hardware (or at least certain instructions used in the
 algorithm will). I think waiting a few more weeks/months shouldn't be a
 big deal, as the winner should be announced Real Soon Now, and then a more
 informed decision can probably be made.

The AES process winner had been announced in October 2000. Considering
AES-NI was proposed in March 2008 and first silicon for it appeared
around January 2010, I wouldn't hold my breath hoping for hardware
SHA-3-specific acceleration getting a widespread foothold for at least
another 5-10 years (around 2-3 technology generations).

That being said, a lot can be achieved using SIMD instructions, but that
doesn't depend on the SHA-3 process in any way.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On Wed, July 11, 2012 10:23, casper@oracle.com wrote:

 I think that I/O isn't getting as fast as CPU is; memory capacity and
 bandwith and CPUs are getting faster.  I/O, not so much.
 (Apart from the one single step from harddisk to SSD; but note that
 I/O is limited to standard interfaces and as such it is likely be
 helddown by requiring a new standard.

The only other thing on the (theoretical) horizon would be things based on
the memristor.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?


On Wed, 11 Jul 2012, Sašo Kiselkov wrote:


The reason why I don't think this can be used to implement a practical
attack is that in order to generate a collision, you first have to know
the disk block that you want to create a collision on (or at least the
checksum), i.e. the original block is already in the pool. At that
point, you could write a colliding block which would get de-dup'd, but
that doesn't mean you've corrupted the original data, only that you
referenced it. So, in a sense, you haven't corrupted the original block,
only your own collision block (since that's the copy doesn't get written).


This is not correct.  If you know the well-known block to be written, 
then you can arrange to write your collision block prior to when the 
well-known block is written.  Therefore, it is imperative that the 
hash algorithm make it clearly impractical to take a well-known block 
and compute a collision block.


For example, the well-known block might be part of a Windows 
anti-virus package, or a Windows firewall configuration, and 
corrupting it might leave a Windows VM open to malware attack.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Nico Williams

On Wed, Jul 11, 2012 at 9:48 AM,  casper@oracle.com wrote:
Huge space, but still finite=85

 Dan Brown seems to think so in Digital Fortress but it just means he
 has no grasp on big numbers.

I couldn't get past that.  I had to put the book down.  I'm guessing
it was as awful as it threatened to be.

IMO, FWIW, yes, do add SHA-512 (truncated to 256 bits, of course).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?


On Wed, Jul 11, 2012 at 9:48 AM,  casper@oracle.com wrote:
Huge space, but still finite=85

 Dan Brown seems to think so in Digital Fortress but it just means he
 has no grasp on big numbers.

I couldn't get past that.  I had to put the book down.  I'm guessing
it was as awful as it threatened to be.


It is *fiction*.  So just read it as if it is magical, like Harry Potter.

It's just like well researched in fiction means exactly the same as
well researched in journalism: the facts aren't actually facts but could 
pass as facts to those who hasn't had a proper science education
(which unfortunately includes a large part of the population and 99.5% of
all politicians)

Casper




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 05:33 PM, Bob Friesenhahn wrote:
 On Wed, 11 Jul 2012, Sašo Kiselkov wrote:

 The reason why I don't think this can be used to implement a practical
 attack is that in order to generate a collision, you first have to know
 the disk block that you want to create a collision on (or at least the
 checksum), i.e. the original block is already in the pool. At that
 point, you could write a colliding block which would get de-dup'd, but
 that doesn't mean you've corrupted the original data, only that you
 referenced it. So, in a sense, you haven't corrupted the original block,
 only your own collision block (since that's the copy doesn't get
 written).
 
 This is not correct.  If you know the well-known block to be written,
 then you can arrange to write your collision block prior to when the
 well-known block is written.  Therefore, it is imperative that the hash
 algorithm make it clearly impractical to take a well-known block and
 compute a collision block.
 
 For example, the well-known block might be part of a Windows anti-virus
 package, or a Windows firewall configuration, and corrupting it might
 leave a Windows VM open to malware attack.

True, but that may not be enough to produce a practical collision for
the reason that while you know which bytes you want to attack, these
might not line up with ZFS disk blocks (especially the case with Windows
VMs which are store in large opaque zvols) - such an attack would
require physical access to the machine (at which point you can simply
manipulate the blocks directly).

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

You're entirely sure that there could never be two different blocks that can 
hash to the same value and have different content?

Wow, can you just send me the cash now and we'll call it even?

Gregg

On Jul 11, 2012, at 9:59 AM, Sašo Kiselkov wrote:

 On 07/11/2012 04:56 PM, Gregg Wonderly wrote:
 So, if I had a block collision on my ZFS pool that used dedup, and it had my 
 bank balance of $3,212.20 on it, and you tried to write your bank balance of 
 $3,292,218.84 and got the same hash, no verify, and thus you got my 
 block/balance and now your bank balance was reduced by 3 orders of 
 magnitude, would you be okay with that?  What assurances would you be 
 content with using my ZFS pool?
 
 I'd feel entirely safe. There, I said it.
 
 --
 Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 05:58 PM, Gregg Wonderly wrote:
 You're entirely sure that there could never be two different blocks that can 
 hash to the same value and have different content?
 
 Wow, can you just send me the cash now and we'll call it even?

You're the one making the positive claim and I'm calling bullshit. So
the onus is on you to demonstrate the collision (and that you arrived at
it via your brute force method as described). Until then, my money stays
safely on my bank account. Put up or shut up, as the old saying goes.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

What I'm saying is that I am getting conflicting information from your 
rebuttals here.

I (and others) say there will be collisions that will cause data loss if verify 
is off.
You say it would be so rare as to be impossible from your perspective.
Tomas says, well then lets just use the hash value for a 4096X compression.
You fluff around his argument calling him names.
I say, well then compute all the possible hashes for all possible bit patterns 
and demonstrate no dupes.
You say it's not possible to do that.
I illustrate a way that loss of data could cost you money.
You say it's impossible for there to be a chance of me constructing a block 
that has the same hash but different content.
Several people have illustrated that 128K to 32bits is a huge and lossy ratio 
of compression, yet you still say it's viable to leave verify off.
I say, in fact that the total number of unique patterns that can exist on any 
pool is small, compared to the total, illustrating that I understand how the 
key space for the algorithm is small when looking at a ZFS pool, and thus could 
have a non-collision opportunity.

So I can see what perspective you are drawing your confidence from, but I, and 
others, are not confident that the risk has zero probability.

I'm pushing you to find a way to demonstrate that there is zero risk because if 
you do that, then you've, in fact created the ultimate compression factor (but 
enlarged the keys that could collide because the pool is now virtually larger), 
to date for random bit patterns, and you've also demonstrated that the 
particular algorithm is very good for dedup. 

That would indicate to me, that you can then take that algorithm, and run it 
inside of ZFS dedup to automatically manage when verify is necessary by 
detecting when a collision occurs.

I appreciate the push back.  I'm trying to drive thinking about this into the 
direction of what is known and finite, away from what is infinitely complex and 
thus impossible to explore.

Maybe all the work has already been done…

Gregg

On Jul 11, 2012, at 11:02 AM, Sašo Kiselkov wrote:

 On 07/11/2012 05:58 PM, Gregg Wonderly wrote:
 You're entirely sure that there could never be two different blocks that can 
 hash to the same value and have different content?
 
 Wow, can you just send me the cash now and we'll call it even?
 
 You're the one making the positive claim and I'm calling bullshit. So
 the onus is on you to demonstrate the collision (and that you arrived at
 it via your brute force method as described). Until then, my money stays
 safely on my bank account. Put up or shut up, as the old saying goes.
 
 --
 Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On Wed, July 11, 2012 11:58, Gregg Wonderly wrote:
 You're entirely sure that there could never be two different blocks that
 can hash to the same value and have different content?
[...]

The odds of being hit by lighting (at least in the US) are about 1 in
700,000. I don't worry about that happening, so personally I don't see the
point in worry something a few more orders of magnitude less likely to
happen.

When I get hit by lightning (or win the lottery), I'll start worrying
about hash collisions. Until then: meh. :)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?


On Wed, 11 Jul 2012, Sašo Kiselkov wrote:


For example, the well-known block might be part of a Windows anti-virus
package, or a Windows firewall configuration, and corrupting it might
leave a Windows VM open to malware attack.


True, but that may not be enough to produce a practical collision for
the reason that while you know which bytes you want to attack, these
might not line up with ZFS disk blocks (especially the case with Windows
VMs which are store in large opaque zvols) - such an attack would
require physical access to the machine (at which point you can simply
manipulate the blocks directly).


I think that well-known blocks are much easier to predict than you say 
because operating systems, VMs, and application software behave in 
predictable patterns.  However, deriving another useful block which 
hashes the same should be extremely difficult and any block hashing 
algorithm needs to assure that.  Having an excellent random 
distribution property is not sufficient if it is relatively easy to 
compute some other block producing the same hash.  It may be useful to 
compromise a known block even if the compromized result is complete 
garbage.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

Thanks Sašo!
Comments below...

On Jul 10, 2012, at 4:56 PM, Sašo Kiselkov wrote:

 Hi guys,
 
 I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS
 implementation to supplant the currently utilized sha256.

No need to supplant, there are 8 bits for enumerating hash algorithms, so 
adding another is simply a matter of coding. With the new feature flags, it is
almost trivial to add new algorithms without causing major compatibility 
headaches. Darren points out that Oracle is considering doing the same,
though I do not expect Oracle to pick up the feature flags.

 On modern
 64-bit CPUs SHA-256 is actually much slower than SHA-512 and indeed much
 slower than many of the SHA-3 candidates, so I went out and did some
 testing (details attached) on a possible new hash algorithm that might
 improve on this situation.
 
 However, before I start out on a pointless endeavor, I wanted to probe
 the field of ZFS users, especially those using dedup, on whether their
 workloads would benefit from a faster hash algorithm (and hence, lower
 CPU utilization). Developments of late have suggested to me three
 possible candidates:
 
 * SHA-512: simplest to implement (since the code is already in the
   kernel) and provides a modest performance boost of around 60%.
 
 * Skein-512: overall fastest of the SHA-3 finalists and much faster
   than SHA-512 (around 120-150% faster than the current sha256).
 
 * Edon-R-512: probably the fastest general purpose hash algorithm I've
   ever seen (upward of 300% speedup over sha256) , but might have
   potential security problems (though I don't think this is of any
   relevance to ZFS, as it doesn't use the hash for any kind of security
   purposes, but only for data integrity  dedup).
 
 My testing procedure: nothing sophisticated, I took the implementation
 of sha256 from the Illumos kernel and simply ran it on a dedicated
 psrset (where possible with a whole CPU dedicated, even if only to a
 single thread) - I tested both the generic C implementation and the
 Intel assembly implementation. The Skein and Edon-R implementations are
 in C optimized for 64-bit architectures from the respective authors (the
 most up to date versions I could find). All code has been compiled using
 GCC 3.4.3 from the repos (the same that can be used for building
 Illumos). Sadly, I don't have access to Sun Studio.

The last studio release suitable for building OpenSolaris is available in the 
repo.
See the instructions at 
http://wiki.illumos.org/display/illumos/How+To+Build+illumos

I'd be curious about whether you see much difference based on studio 12.1,
gcc 3.4.3 and gcc 4.4 (or even 4.7)
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 06:23 PM, Gregg Wonderly wrote:
 What I'm saying is that I am getting conflicting information from your 
 rebuttals here.

Well, let's address that then:

 I (and others) say there will be collisions that will cause data loss if 
 verify is off.

Saying that there will be without any supporting evidence to back it
up amounts to a prophecy.

 You say it would be so rare as to be impossible from your perspective.

Correct.

 Tomas says, well then lets just use the hash value for a 4096X compression.
 You fluff around his argument calling him names.

Tomas' argument was, as I understood later, an attempt at sarcasm.
Nevertheless, I later explained exactly why I consider the
hash-compression claim total and utter bunk:

So for a full explanation of why hashes aren't usable for compression:

 1) they are one-way (kind of bummer for decompression)
 2) they operate far below the Shannon limit (i.e. unusable for
lossless compression)
 3) their output is pseudo-random, so even if we find collisions, we
have no way to distinguish which input was the most likely one meant
for a given hash value (all are equally probable)

 I say, well then compute all the possible hashes for all possible bit 
 patterns and demonstrate no dupes.

This assumes it's possible to do so. Frenc made a similar claim and I
responded with this question: how long do you expect this going to take
on average? Come on, do the math!. I pose the same to you. Find the
answer and you'll understand exactly why what you're proposing is
impossible.

 You say it's not possible to do that.

Please go on and compute a reduced size of the problem for, say, 2^64
32-byte values (still a laughably small space for the problem, but I'm
feeling generous). Here's the amount of storage you'll need:

2^64 * 32 = 524288 Exabytes

And that's for a problem that I've reduced for you by 192 orders of
magnitude. You see, only when you do the math you realize how off base
you are in claiming that pre-computation of hash rainbow tables for
generic bit patterns is doable.

 I illustrate a way that loss of data could cost you money.

That's merely an emotional argument where you are trying to attack me by
trying to invoke an emotional response from when my ass is on the
line. Sorry, that doesn't invalidate the original argument that you
can't do rainbow table pre-computation for long bit patterns.

 You say it's impossible for there to be a chance of me constructing a block 
 that has the same hash but different content.

To make sure we're not using ambiguous rhetoric here, allow me to
summarize my position: you cannot produce, in practical terms, a hash
collision on a 256-bit secure hash algorithm using a brute-force algorithm.

 Several people have illustrated that 128K to 32bits is a huge and lossy ratio 
 of compression, yet you still say it's viable to leave verify off.

Except that we're not talking 128K to 32b, but 128K to 256b. Also, only
once you appreciate the mathematics behind the size of the 256-bit
pattern space can you understand why leaving verify off is okay.

 I say, in fact that the total number of unique patterns that can exist on any 
 pool is small, compared to the total, illustrating that I understand how the 
 key space for the algorithm is small when looking at a ZFS pool, and thus 
 could have a non-collision opportunity.

This is so profoundly wrong that it leads me to suspect you never took
courses on cryptography and/or information theory. The size of your
storage pool DOESN'T MATTER ONE BIT to the size of the key space. Even
if your pool were the size of a single block, we're talking here about
the *mathematical* possibility of hitting on a random block that hashes
to the same value. Given a stream of random data blocks (thus simulating
an exhaustive brute-force search) and a secure pseudo-random hash
function (which has a roughly equal chance of producing any output value
for a given input block), you've got only a 10^-77 chance of getting a
hash collision. If you don't understand how this works, read a book on
digital coding theory.

 So I can see what perspective you are drawing your confidence from, but I, 
 and others, are not confident that the risk has zero probability.

I never said the risk is zero. The risk non-zero, but is so close to
zero, that you may safely ignore it (since we take much greater risks on
a daily basis without so much as a blink of an eye).

 I'm pushing you to find a way to demonstrate that there is zero risk because 
 if you do that, then you've, in fact created the ultimate compression factor 
 (but enlarged the keys that could collide because the pool is now virtually 
 larger), to date for random bit patterns, and you've also demonstrated that 
 the particular algorithm is very good for dedup. 
 That would indicate to me, that you can then take that algorithm, and run it 
 inside of ZFS dedup to automatically manage when verify is necessary by 
 detecting when a collision occurs.

Do

Re: [zfs-discuss] New fast hash algorithm - is it needed?


On Wed, 11 Jul 2012, Richard Elling wrote:


The last studio release suitable for building OpenSolaris is available in the 
repo.
See the instructions 
at http://wiki.illumos.org/display/illumos/How+To+Build+illumos


Not correct as far as I can tell.  You should re-read the page you 
referenced.  Oracle recinded (or lost) the special Studio releases 
needed to build the OpenSolaris kernel.  The only way I can see to 
obtain these releases is illegally.


However, Studio 12.3 (free download) produces user-space executables 
which run fine under Illumos.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On Jul 11, 2012, at 10:11 AM, Bob Friesenhahn wrote:
 On Wed, 11 Jul 2012, Richard Elling wrote:
 The last studio release suitable for building OpenSolaris is available in 
 the repo.
 See the instructions at 
 http://wiki.illumos.org/display/illumos/How+To+Build+illumos
 
 Not correct as far as I can tell.  You should re-read the page you 
 referenced.  Oracle recinded (or lost) the special Studio releases needed to 
 build the OpenSolaris kernel.  The only way I can see to obtain these 
 releases is illegally.

In the US, the term illegal is most often used for criminal law. Contracts 
between parties are covered
under civil law. It is the responsibility of the parties to agree to and 
enforce civil contracts. This includes
you, dear reader.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On Jul 11, 2012, at 10:23 AM, Sašo Kiselkov wrote:

 Hi Richard,
 
 On 07/11/2012 06:58 PM, Richard Elling wrote:
 Thanks Sašo!
 Comments below...
 
 On Jul 10, 2012, at 4:56 PM, Sašo Kiselkov wrote:
 
 Hi guys,
 
 I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS
 implementation to supplant the currently utilized sha256.
 
 No need to supplant, there are 8 bits for enumerating hash algorithms, so 
 adding another is simply a matter of coding. With the new feature flags, it 
 is
 almost trivial to add new algorithms without causing major compatibility 
 headaches. Darren points out that Oracle is considering doing the same,
 though I do not expect Oracle to pick up the feature flags.
 
 I meant in the functional sense, not in the technical - of course, my
 changes would be implemented as a feature flags add-on.

Great! Let's do it! 
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Hung-Sheng Tsao (LaoTsao) Ph.D



Sent from my iPad

On Jul 11, 2012, at 13:11, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote:

 On Wed, 11 Jul 2012, Richard Elling wrote:
 The last studio release suitable for building OpenSolaris is available in 
 the repo.
 See the instructions at 
 http://wiki.illumos.org/display/illumos/How+To+Build+illumos
 
 Not correct as far as I can tell.  You should re-read the page you 
 referenced.  Oracle recinded (or lost) the special Studio releases needed to 
 build the OpenSolaris kernel.  

hi
you can still download 12 12.1 12.2, AFAIK through OTN


 The only way I can see to obtain these releases is illegally.
 
 However, Studio 12.3 (free download) produces user-space executables which 
 run fine under Illumos.
 
 Bob
 -- 
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Nico Williams

On Wed, Jul 11, 2012 at 3:45 AM, Sašo Kiselkov skiselkov...@gmail.com wrote:
 It's also possible to set dedup=verify with checksum=sha256,
 however, that makes little sense (as the chances of getting a random
 hash collision are essentially nil).

IMO dedup should always verify.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?


On Wed, 11 Jul 2012, Hung-Sheng Tsao (LaoTsao) Ph.D wrote:


Not correct as far as I can tell.  You should re-read the page you referenced.  
Oracle recinded (or lost) the special Studio releases needed to build the 
OpenSolaris kernel.


you can still download 12 12.1 12.2, AFAIK through OTN


That is true (and I have done so).  Unfortunately the versions offered 
are not the correct ones to build the OpenSolaris kernel.  Special 
patched versions with particular date stamps are required.  The only 
way that I see to obtain these files any more is via distribution

channels primarily designed to perform copyright violations.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Hung-Sheng Tsao Ph.D.



On 7/11/2012 3:16 PM, Bob Friesenhahn wrote:

On Wed, 11 Jul 2012, Hung-Sheng Tsao (LaoTsao) Ph.D wrote:


Not correct as far as I can tell.  You should re-read the page you 
referenced.  Oracle recinded (or lost) the special Studio releases 
needed to build the OpenSolaris kernel.


you can still download 12 12.1 12.2, AFAIK through OTN


That is true (and I have done so).  Unfortunately the versions offered 
are not the correct ones to build the OpenSolaris kernel. Special 
patched versions with particular date stamps are required.  The only 
way that I see to obtain these files any more is via distribution

channels primarily designed to perform copyright violations.
one can still download the patches through MOS, but one need to pay the 
support for development tolls:-(


Bob


--



attachment: laotsao.vcf___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bill Sommerfeld

On 07/11/12 02:10, Sašo Kiselkov wrote:
 Oh jeez, I can't remember how many times this flame war has been going
 on on this list. Here's the gist: SHA-256 (or any good hash) produces a
 near uniform random distribution of output. Thus, the chances of getting
 a random hash collision are around 2^-256 or around 10^-77.

I think you're correct that most users don't need to worry about this --
sha-256 dedup without verification is not going to cause trouble for them.

But your analysis is off.  You're citing the chance that two blocks picked at
random will have the same hash.  But that's not what dedup does; it compares
the hash of a new block to a possibly-large population of other hashes, and
that gets you into the realm of birthday problem or birthday paradox.

See http://en.wikipedia.org/wiki/Birthday_problem for formulas.

So, maybe somewhere between 10^-50 and 10^-55 for there being at least one
collision in really large collections of data - still not likely enough to
worry about.

Of course, that assumption goes out the window if you're concerned that an
adversary may develop practical ways to find collisions in sha-256 within the
deployment lifetime of a system.  sha-256 is, more or less, a scaled-up sha-1,
and sha-1 is known to be weaker than the ideal 2^80 strength you'd expect from
2^160 bits of hash; the best credible attack is somewhere around 2^57.5 (see
http://en.wikipedia.org/wiki/SHA-1#SHA-1).

on a somewhat less serious note, perhaps zfs dedup should contain chinese
lottery code (see http://tools.ietf.org/html/rfc3607 for one explanation)
which asks the sysadmin to report a detected sha-256 collision to
eprint.iacr.org or the like...



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

On 07/11/2012 10:06 PM, Bill Sommerfeld wrote:
 On 07/11/12 02:10, Sašo Kiselkov wrote:
 Oh jeez, I can't remember how many times this flame war has been going
 on on this list. Here's the gist: SHA-256 (or any good hash) produces a
 near uniform random distribution of output. Thus, the chances of getting
 a random hash collision are around 2^-256 or around 10^-77.
 
 I think you're correct that most users don't need to worry about this --
 sha-256 dedup without verification is not going to cause trouble for them.
 
 But your analysis is off.  You're citing the chance that two blocks picked at
 random will have the same hash.  But that's not what dedup does; it compares
 the hash of a new block to a possibly-large population of other hashes, and
 that gets you into the realm of birthday problem or birthday paradox.
 
 See http://en.wikipedia.org/wiki/Birthday_problem for formulas.
 
 So, maybe somewhere between 10^-50 and 10^-55 for there being at least one
 collision in really large collections of data - still not likely enough to
 worry about.

Yeah, I know, I did this as a quick first-degree approximation. However,
the provided range is still very far above the chances of getting a
random bit-rot error that even Fletcher won't catch.

 Of course, that assumption goes out the window if you're concerned that an
 adversary may develop practical ways to find collisions in sha-256 within the
 deployment lifetime of a system.  sha-256 is, more or less, a scaled-up sha-1,
 and sha-1 is known to be weaker than the ideal 2^80 strength you'd expect from
 2^160 bits of hash; the best credible attack is somewhere around 2^57.5 (see
 http://en.wikipedia.org/wiki/SHA-1#SHA-1).

Of course, this is theoretically possible, however, I do not expect such
an attack to be practical within any reasonable time frame of the
deployment. In any case, should a realistic need to solve this arise, we
can always simply switch hashes (I'm also planning to implement
Skein-512/256) and do a recv/send to rewrite everything on disk. PITA?
Yes. Serious problem? Don't think so.

 on a somewhat less serious note, perhaps zfs dedup should contain chinese
 lottery code (see http://tools.ietf.org/html/rfc3607 for one explanation)
 which asks the sysadmin to report a detected sha-256 collision to
 eprint.iacr.org or the like...

How about we ask them to report to me instead, like so:

1) Detect collision
2) Report to me
3) ???
4) Profit!

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?