Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-14 Thread Eric D. Mudama
I hold my breath for weeks at a time, just incase something like this
happens!  I thought I was the only one!

On 4/12/05, Theodore Ts'o <[EMAIL PROTECTED]> wrote:
> So past a certain point, there is a probability that all of molecules
> of oxygen in the room will suddenly migrate outdoors, and you could
> suffocate.  Is it rational to spend time worrying about that
> possibility?  I'll leave that for you to decide.
> 
> - Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-14 Thread Eric Rannaud
On Thu, 2005-04-14 at 01:30 -0700, Andy Isaacson wrote:
> In particular, your defense here is specious.  I agree that second
> preimage is an unmanagably large problem for SHA1 for the forseeable
> future (say, 8 years out), but collision results almost always result in
> partially-controlled attack text.  So I can choose my nefarious content,
> and encode the collision entropy in non-operative text (a C comment, for
> example).

Everything you said is fair enough, and I do agree with this argument in
the context of authentication: that is if you expect an attack (even if
a chunk of non-operative bytes won't go unnoticed for long in a the
kernel tree, but that's not the point).

But the original post of Pedro was, I think, about collisions occurring
'randomly' between two genuine modifications of a source file. In
particular, the paper that was linked to by Pedro is concerned with this
kind of unexpected collisions, i.e. about integrity in normal operation
and not about security (btw, this paper seems kind of bogus to me
because it never specifies the hash function in use).

I took the example of this attack against SHA-1 only to illustrate that
*even* if you try to find a collision, you just can't find one (at least
these days).
>From which I think it is fair to say that such a collision won't happen
between two different versions of the same file, during the normal
process of revision.

I mean, if in the normal revision process of the kernel, a SHA-1
collision was to be found, by chance, who gets to publish the paper at
the next Crypto conference?

However, if we consider the case of an attack, well, that's different.
And you're right.

/er.
-- 
http://www.eleves.ens.fr/home/rannaud/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-14 Thread Andy Isaacson
On Tue, Apr 12, 2005 at 06:35:49PM +0200, Eric Rannaud wrote:
> Simply put, the best known attack of SHA-1 takes 2^69 hash operations.
> ( http://www.schneier.com/blog/archives/2005/02/sha1_broken.html )
> The attack is still only an unpublished paper and has not yet been
> implemented. An attack is: you try as hard as you can to find a collision
> between two arbitrary messages (i.e. two arbitrary --and nonsensical--
> source files).

Well, don't be quite so confident.  Yes, the attacks generally require a
significant uncontrollable chunk of data, but it's easy enough to encode
that in (for example) a comment.

And remember, attacks always get better, they never get worse.

Remaining cautious about the strength of your hash functions is a good
idea.

Especially because *nobody* has a real theory of operation behind online
hash functions.  The stories I've heard imply that even NSA doesn't
really "do" hash functions; they prefer HMAC-style keyed verifiers
(though of course, we on the outside can never be sure what's happening
in the puzzle palace.)

Every (practical) hash function in existence today is of the form "Well,
I mixed stuff up real good, and my friends and I tried and couldn't
break it".  And 160 bits still looks *fairly* secure, but so did 64-bit
symmetric key back in '93.

> In the context of git, a better estimation would be the number of hash
> operations needed to find a message that has the same hash than a given
> fixed message (e.g. mm/memory.c). This is more like 2^100 hash
> operations. And if a collision is found, this is very likely using a
> message that *doesn't* look like a C source file...

In particular, your defense here is specious.  I agree that second
preimage is an unmanagably large problem for SHA1 for the forseeable
future (say, 8 years out), but collision results almost always result in
partially-controlled attack text.  So I can choose my nefarious content,
and encode the collision entropy in non-operative text (a C comment, for
example).

When your tool breaks, replace it with a new one.  Don't say "well, the
jagged bits haven't hurt me *yet*."

-andy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-14 Thread Andy Isaacson
On Tue, Apr 12, 2005 at 06:35:49PM +0200, Eric Rannaud wrote:
 Simply put, the best known attack of SHA-1 takes 2^69 hash operations.
 ( http://www.schneier.com/blog/archives/2005/02/sha1_broken.html )
 The attack is still only an unpublished paper and has not yet been
 implemented. An attack is: you try as hard as you can to find a collision
 between two arbitrary messages (i.e. two arbitrary --and nonsensical--
 source files).

Well, don't be quite so confident.  Yes, the attacks generally require a
significant uncontrollable chunk of data, but it's easy enough to encode
that in (for example) a comment.

And remember, attacks always get better, they never get worse.

Remaining cautious about the strength of your hash functions is a good
idea.

Especially because *nobody* has a real theory of operation behind online
hash functions.  The stories I've heard imply that even NSA doesn't
really do hash functions; they prefer HMAC-style keyed verifiers
(though of course, we on the outside can never be sure what's happening
in the puzzle palace.)

Every (practical) hash function in existence today is of the form Well,
I mixed stuff up real good, and my friends and I tried and couldn't
break it.  And 160 bits still looks *fairly* secure, but so did 64-bit
symmetric key back in '93.

 In the context of git, a better estimation would be the number of hash
 operations needed to find a message that has the same hash than a given
 fixed message (e.g. mm/memory.c). This is more like 2^100 hash
 operations. And if a collision is found, this is very likely using a
 message that *doesn't* look like a C source file...

In particular, your defense here is specious.  I agree that second
preimage is an unmanagably large problem for SHA1 for the forseeable
future (say, 8 years out), but collision results almost always result in
partially-controlled attack text.  So I can choose my nefarious content,
and encode the collision entropy in non-operative text (a C comment, for
example).

When your tool breaks, replace it with a new one.  Don't say well, the
jagged bits haven't hurt me *yet*.

-andy
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-14 Thread Eric Rannaud
On Thu, 2005-04-14 at 01:30 -0700, Andy Isaacson wrote:
 In particular, your defense here is specious.  I agree that second
 preimage is an unmanagably large problem for SHA1 for the forseeable
 future (say, 8 years out), but collision results almost always result in
 partially-controlled attack text.  So I can choose my nefarious content,
 and encode the collision entropy in non-operative text (a C comment, for
 example).

Everything you said is fair enough, and I do agree with this argument in
the context of authentication: that is if you expect an attack (even if
a chunk of non-operative bytes won't go unnoticed for long in a the
kernel tree, but that's not the point).

But the original post of Pedro was, I think, about collisions occurring
'randomly' between two genuine modifications of a source file. In
particular, the paper that was linked to by Pedro is concerned with this
kind of unexpected collisions, i.e. about integrity in normal operation
and not about security (btw, this paper seems kind of bogus to me
because it never specifies the hash function in use).

I took the example of this attack against SHA-1 only to illustrate that
*even* if you try to find a collision, you just can't find one (at least
these days).
From which I think it is fair to say that such a collision won't happen
between two different versions of the same file, during the normal
process of revision.

I mean, if in the normal revision process of the kernel, a SHA-1
collision was to be found, by chance, who gets to publish the paper at
the next Crypto conference?

However, if we consider the case of an attack, well, that's different.
And you're right.

/er.
-- 
http://www.eleves.ens.fr/home/rannaud/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-14 Thread Eric D. Mudama
I hold my breath for weeks at a time, just incase something like this
happens!  I thought I was the only one!

On 4/12/05, Theodore Ts'o [EMAIL PROTECTED] wrote:
 So past a certain point, there is a probability that all of molecules
 of oxygen in the room will suddenly migrate outdoors, and you could
 suffocate.  Is it rational to spend time worrying about that
 possibility?  I'll leave that for you to decide.
 
 - Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-12 Thread Eric Rannaud

Simply put, the best known attack of SHA-1 takes 2^69 hash operations.
( http://www.schneier.com/blog/archives/2005/02/sha1_broken.html )
The attack is still only an unpublished paper and has not yet been
implemented. An attack is: you try as hard as you can to find a collision
between two arbitrary messages (i.e. two arbitrary --and nonsensical--
source files).
In the context of git, a better estimation would be the number of hash
operations needed to find a message that has the same hash than a given
fixed message (e.g. mm/memory.c). This is more like 2^100 hash
operations. And if a collision is found, this is very likely using a
message that *doesn't* look like a C source file...

Moreover, no example of collision is known, AFAIK.

In other words: this won't happen.

Best,
   /er.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-12 Thread Theodore Ts'o
On Tue, Apr 12, 2005 at 12:40:21AM +0200, Pedro Larroy wrote:
> 
> I had a quick look at the source of GIT tonight, I'd like to warn you
> about the use of hash functions as content indexers.
> 
> As probably you are aware, hash functions such as SHA-1 are surjective not
> bijective (1-to-1 map), so they have collisions. Here one can argue
> about the low probability of such a collision, I won't get into
> subjetive valorations of what probability of collision is tolerable for
> me and what is not. 
> 
> I my humble opinion, choosing deliberately, as a design decision, a
> method such as this one, that in some cases could corrupt data in a
> silent and very hard to detect way, is not very good. 

Actually, it will very likely be very, very easy to detect.  What
happens if there is a hash collision?  Well, in the case of a
non-malicious collision, instead of retrieving the correct version of
a file, we will get some random version of another file.  The moment
you try to compile with that incorrect file, you will with an
extremely high probability, get a failed compile, which will be
blantently obvious.

In the case of a malicous attacker trying to introduce a collision, it
should be noted first of all that the recent SHA-1 breakage was a
collision attack, not a pre-image attack.  So it's not useful for
trying to find another message that hashes to the same value as a one
already in use by git.  So the work factor is still 2**80.  Secondly,
even if an attacker could generate another file which has the same
hash as a pre-existing file, it still has to look like a valid git
object, and it still has to be a valid C file or it will again be
blatently obvious when you try to compile the sucker.

> One can also argue
> that the probability of data getting corrupted in the disk, or whatever
> could be higher than that of the collision, again this is not valid
> comparison

That's a matter of some religious dispute.  You can always reduce the
probability of a collsion down to an arbitrarily small value simply by
using a larger hash --- and switch hashes in git is quite simple since
it would just be a matter of running a program to calculate the hash
using a different algorithm, and renaming the files.  You can even use
hardlinks if you want to support two different hash algorithms
simultaneously during some transition period.

So past a certain point, there is a probability that all of molecules
of oxygen in the room will suddenly migrate outdoors, and you could
suffocate.  Is it rational to spend time worrying about that
possibility?  I'll leave that for you to decide.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-12 Thread Richard B. Johnson
On Mon, 11 Apr 2005, Petr Baudis wrote:
Dear diary, on Tue, Apr 12, 2005 at 12:40:21AM CEST, I got a letter
where Pedro Larroy <[EMAIL PROTECTED]> told me that...
Hi
Hello,
I had a quick look at the source of GIT tonight, I'd like to warn you
about the use of hash functions as content indexers.
As probably you are aware, hash functions such as SHA-1 are surjective not
bijective (1-to-1 map), so they have collisions. Here one can argue
about the low probability of such a collision, I won't get into
subjetive valorations of what probability of collision is tolerable for
me and what is not.
I my humble opinion, choosing deliberately, as a design decision, a
method such as this one, that in some cases could corrupt data in a
silent and very hard to detect way, is not very good. One can also argue
that the probability of data getting corrupted in the disk, or whatever
could be higher than that of the collision, again this is not valid
comparison, since the fact that indexing by hash functions without
additional checking could make data corruption legal between the
reasonable working parameters of the program is very dangerous.
(i) 1461501637330902918203684832716283019655932542976 possible SHA1 hashes.
(ii) In git-pasky, there's (turnable off) detection of collisions.
(iii) Your argument against comparing with the probability of a hardware
error does not make sense to me.
(iv) You fail to propose a better solution.
--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
This is a standard, free (no patents) hash-function that works.
The fact that somebody can write a program, designed to create
collisions, and demonstrate that after many weeks of processing,
iteratively building upon the previous result, and finally
cause a collision, really isn't relevant for this application.
There is a good possibility that there are no hash-functions
that can't be broken.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
 Notice : All mail here is now cached for review by Dictator Bush.
 98.36% of all statistics are fiction.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-12 Thread Catalin Marinas
Magnus Damm <[EMAIL PROTECTED]> wrote:
> On 4/12/05, Petr Baudis <[EMAIL PROTECTED]> wrote:
>
>> (iv) You fail to propose a better solution.
>
> I would feel safer with back end storage filenames based on email and
> mtime together with an optional hash lookup that turns collisions into
> worse performance. But that's just me.

Have a look at bazaar-ng (http://www.bazaar-ng.org/), they seem to do
this. I had a quick look at it and they also seem to store the full
files when they change (similar to git). It is also a bit ahead of git
(started earlier) and looks quite promising.

-- 
Catalin

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-12 Thread Barry K. Nathan
On Tue, Apr 12, 2005 at 12:51:39AM +0200, Petr Baudis wrote:
> Dear diary, on Tue, Apr 12, 2005 at 12:40:21AM CEST, I got a letter
> where Pedro Larroy <[EMAIL PROTECTED]> told me that...
[snip...]
> (iii) Your argument against comparing with the probability of a hardware
> error does not make sense to me.

I didn't understand it either, at first, but then I read this:

http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/node9.html

Whether this is serious enough to actually worry about is another
question...

-Barry K. Nathan <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-12 Thread Barry K. Nathan
On Tue, Apr 12, 2005 at 12:51:39AM +0200, Petr Baudis wrote:
 Dear diary, on Tue, Apr 12, 2005 at 12:40:21AM CEST, I got a letter
 where Pedro Larroy [EMAIL PROTECTED] told me that...
[snip...]
 (iii) Your argument against comparing with the probability of a hardware
 error does not make sense to me.

I didn't understand it either, at first, but then I read this:

http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/node9.html

Whether this is serious enough to actually worry about is another
question...

-Barry K. Nathan [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-12 Thread Catalin Marinas
Magnus Damm [EMAIL PROTECTED] wrote:
 On 4/12/05, Petr Baudis [EMAIL PROTECTED] wrote:

 (iv) You fail to propose a better solution.

 I would feel safer with back end storage filenames based on email and
 mtime together with an optional hash lookup that turns collisions into
 worse performance. But that's just me.

Have a look at bazaar-ng (http://www.bazaar-ng.org/), they seem to do
this. I had a quick look at it and they also seem to store the full
files when they change (similar to git). It is also a bit ahead of git
(started earlier) and looks quite promising.

-- 
Catalin

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-12 Thread Richard B. Johnson
On Mon, 11 Apr 2005, Petr Baudis wrote:
Dear diary, on Tue, Apr 12, 2005 at 12:40:21AM CEST, I got a letter
where Pedro Larroy [EMAIL PROTECTED] told me that...
Hi
Hello,
I had a quick look at the source of GIT tonight, I'd like to warn you
about the use of hash functions as content indexers.
As probably you are aware, hash functions such as SHA-1 are surjective not
bijective (1-to-1 map), so they have collisions. Here one can argue
about the low probability of such a collision, I won't get into
subjetive valorations of what probability of collision is tolerable for
me and what is not.
I my humble opinion, choosing deliberately, as a design decision, a
method such as this one, that in some cases could corrupt data in a
silent and very hard to detect way, is not very good. One can also argue
that the probability of data getting corrupted in the disk, or whatever
could be higher than that of the collision, again this is not valid
comparison, since the fact that indexing by hash functions without
additional checking could make data corruption legal between the
reasonable working parameters of the program is very dangerous.
(i) 1461501637330902918203684832716283019655932542976 possible SHA1 hashes.
(ii) In git-pasky, there's (turnable off) detection of collisions.
(iii) Your argument against comparing with the probability of a hardware
error does not make sense to me.
(iv) You fail to propose a better solution.
--
Petr Pasky Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
This is a standard, free (no patents) hash-function that works.
The fact that somebody can write a program, designed to create
collisions, and demonstrate that after many weeks of processing,
iteratively building upon the previous result, and finally
cause a collision, really isn't relevant for this application.
There is a good possibility that there are no hash-functions
that can't be broken.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
 Notice : All mail here is now cached for review by Dictator Bush.
 98.36% of all statistics are fiction.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-12 Thread Theodore Ts'o
On Tue, Apr 12, 2005 at 12:40:21AM +0200, Pedro Larroy wrote:
 
 I had a quick look at the source of GIT tonight, I'd like to warn you
 about the use of hash functions as content indexers.
 
 As probably you are aware, hash functions such as SHA-1 are surjective not
 bijective (1-to-1 map), so they have collisions. Here one can argue
 about the low probability of such a collision, I won't get into
 subjetive valorations of what probability of collision is tolerable for
 me and what is not. 
 
 I my humble opinion, choosing deliberately, as a design decision, a
 method such as this one, that in some cases could corrupt data in a
 silent and very hard to detect way, is not very good. 

Actually, it will very likely be very, very easy to detect.  What
happens if there is a hash collision?  Well, in the case of a
non-malicious collision, instead of retrieving the correct version of
a file, we will get some random version of another file.  The moment
you try to compile with that incorrect file, you will with an
extremely high probability, get a failed compile, which will be
blantently obvious.

In the case of a malicous attacker trying to introduce a collision, it
should be noted first of all that the recent SHA-1 breakage was a
collision attack, not a pre-image attack.  So it's not useful for
trying to find another message that hashes to the same value as a one
already in use by git.  So the work factor is still 2**80.  Secondly,
even if an attacker could generate another file which has the same
hash as a pre-existing file, it still has to look like a valid git
object, and it still has to be a valid C file or it will again be
blatently obvious when you try to compile the sucker.

 One can also argue
 that the probability of data getting corrupted in the disk, or whatever
 could be higher than that of the collision, again this is not valid
 comparison

That's a matter of some religious dispute.  You can always reduce the
probability of a collsion down to an arbitrarily small value simply by
using a larger hash --- and switch hashes in git is quite simple since
it would just be a matter of running a program to calculate the hash
using a different algorithm, and renaming the files.  You can even use
hardlinks if you want to support two different hash algorithms
simultaneously during some transition period.

So past a certain point, there is a probability that all of molecules
of oxygen in the room will suddenly migrate outdoors, and you could
suffocate.  Is it rational to spend time worrying about that
possibility?  I'll leave that for you to decide.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-12 Thread Eric Rannaud

Simply put, the best known attack of SHA-1 takes 2^69 hash operations.
( http://www.schneier.com/blog/archives/2005/02/sha1_broken.html )
The attack is still only an unpublished paper and has not yet been
implemented. An attack is: you try as hard as you can to find a collision
between two arbitrary messages (i.e. two arbitrary --and nonsensical--
source files).
In the context of git, a better estimation would be the number of hash
operations needed to find a message that has the same hash than a given
fixed message (e.g. mm/memory.c). This is more like 2^100 hash
operations. And if a collision is found, this is very likely using a
message that *doesn't* look like a C source file...

Moreover, no example of collision is known, AFAIK.

In other words: this won't happen.

Best,
   /er.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-11 Thread Magnus Damm
On 4/12/05, Petr Baudis <[EMAIL PROTECTED]> wrote:

> (iv) You fail to propose a better solution.

I would feel safer with back end storage filenames based on email and
mtime together with an optional hash lookup that turns collisions into
worse performance. But that's just me.

/ magnus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-11 Thread Petr Baudis
Dear diary, on Tue, Apr 12, 2005 at 12:40:21AM CEST, I got a letter
where Pedro Larroy <[EMAIL PROTECTED]> told me that...
> Hi

Hello,

> I had a quick look at the source of GIT tonight, I'd like to warn you
> about the use of hash functions as content indexers.
> 
> As probably you are aware, hash functions such as SHA-1 are surjective not
> bijective (1-to-1 map), so they have collisions. Here one can argue
> about the low probability of such a collision, I won't get into
> subjetive valorations of what probability of collision is tolerable for
> me and what is not. 
> 
> I my humble opinion, choosing deliberately, as a design decision, a
> method such as this one, that in some cases could corrupt data in a
> silent and very hard to detect way, is not very good. One can also argue
> that the probability of data getting corrupted in the disk, or whatever
> could be higher than that of the collision, again this is not valid
> comparison, since the fact that indexing by hash functions without
> additional checking could make data corruption legal between the
> reasonable working parameters of the program is very dangerous.

(i) 1461501637330902918203684832716283019655932542976 possible SHA1 hashes.

(ii) In git-pasky, there's (turnable off) detection of collisions.

(iii) Your argument against comparing with the probability of a hardware
error does not make sense to me.

(iv) You fail to propose a better solution.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Call to atention about using hash functions as content indexers (SCM saga)

2005-04-11 Thread Pedro Larroy
Hi

I had a quick look at the source of GIT tonight, I'd like to warn you
about the use of hash functions as content indexers.

As probably you are aware, hash functions such as SHA-1 are surjective not
bijective (1-to-1 map), so they have collisions. Here one can argue
about the low probability of such a collision, I won't get into
subjetive valorations of what probability of collision is tolerable for
me and what is not. 

I my humble opinion, choosing deliberately, as a design decision, a
method such as this one, that in some cases could corrupt data in a
silent and very hard to detect way, is not very good. One can also argue
that the probability of data getting corrupted in the disk, or whatever
could be higher than that of the collision, again this is not valid
comparison, since the fact that indexing by hash functions without
additional checking could make data corruption legal between the
reasonable working parameters of the program is very dangerous.

And by the way, I've read
http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/node15.html

and I don't agree with it. Asides from the "avalanche effect" the
properties of a good hash function is that distributes well the entropy
between the output bits, so probably, although the inputs are not
random, the pdf change in the outputs would be small, otherwise it would
be easier to find collision by differential or statistical methods.

Last, hash functions should be only used to check data integrity, and
that's the reason of the "avalanche effect", so even single bit errors
are propagated and it's effect is very noticeable at the output.

Regards.

-- 
Pedro Larroy Tovar | Linux & Network consultant |  pedro%larroy.com 
Make debian mirrors with debian-multimirror: http://pedro.larroy.com/deb_mm/
* Las patentes de programación son nocivas para la innovación * 
http://proinnova.hispalinux.es/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Call to atention about using hash functions as content indexers (SCM saga)

2005-04-11 Thread Pedro Larroy
Hi

I had a quick look at the source of GIT tonight, I'd like to warn you
about the use of hash functions as content indexers.

As probably you are aware, hash functions such as SHA-1 are surjective not
bijective (1-to-1 map), so they have collisions. Here one can argue
about the low probability of such a collision, I won't get into
subjetive valorations of what probability of collision is tolerable for
me and what is not. 

I my humble opinion, choosing deliberately, as a design decision, a
method such as this one, that in some cases could corrupt data in a
silent and very hard to detect way, is not very good. One can also argue
that the probability of data getting corrupted in the disk, or whatever
could be higher than that of the collision, again this is not valid
comparison, since the fact that indexing by hash functions without
additional checking could make data corruption legal between the
reasonable working parameters of the program is very dangerous.

And by the way, I've read
http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/node15.html

and I don't agree with it. Asides from the avalanche effect the
properties of a good hash function is that distributes well the entropy
between the output bits, so probably, although the inputs are not
random, the pdf change in the outputs would be small, otherwise it would
be easier to find collision by differential or statistical methods.

Last, hash functions should be only used to check data integrity, and
that's the reason of the avalanche effect, so even single bit errors
are propagated and it's effect is very noticeable at the output.

Regards.

-- 
Pedro Larroy Tovar | Linux  Network consultant |  pedro%larroy.com 
Make debian mirrors with debian-multimirror: http://pedro.larroy.com/deb_mm/
* Las patentes de programación son nocivas para la innovación * 
http://proinnova.hispalinux.es/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-11 Thread Petr Baudis
Dear diary, on Tue, Apr 12, 2005 at 12:40:21AM CEST, I got a letter
where Pedro Larroy [EMAIL PROTECTED] told me that...
 Hi

Hello,

 I had a quick look at the source of GIT tonight, I'd like to warn you
 about the use of hash functions as content indexers.
 
 As probably you are aware, hash functions such as SHA-1 are surjective not
 bijective (1-to-1 map), so they have collisions. Here one can argue
 about the low probability of such a collision, I won't get into
 subjetive valorations of what probability of collision is tolerable for
 me and what is not. 
 
 I my humble opinion, choosing deliberately, as a design decision, a
 method such as this one, that in some cases could corrupt data in a
 silent and very hard to detect way, is not very good. One can also argue
 that the probability of data getting corrupted in the disk, or whatever
 could be higher than that of the collision, again this is not valid
 comparison, since the fact that indexing by hash functions without
 additional checking could make data corruption legal between the
 reasonable working parameters of the program is very dangerous.

(i) 1461501637330902918203684832716283019655932542976 possible SHA1 hashes.

(ii) In git-pasky, there's (turnable off) detection of collisions.

(iii) Your argument against comparing with the probability of a hardware
error does not make sense to me.

(iv) You fail to propose a better solution.

-- 
Petr Pasky Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Call to atention about using hash functions as content indexers (SCM saga)

2005-04-11 Thread Magnus Damm
On 4/12/05, Petr Baudis [EMAIL PROTECTED] wrote:

 (iv) You fail to propose a better solution.

I would feel safer with back end storage filenames based on email and
mtime together with an optional hash lookup that turns collisions into
worse performance. But that's just me.

/ magnus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/