Re: binary file compare...

2009-04-18 Thread Piet van Oostrum
 Adam Olsen rha...@gmail.com (AO) wrote:

AO The Wayback Machine has 150 billion pages, so 2**37.  Google's index
AO is a bit larger at over a trillion pages, so 2**40.  A little closer
AO than I'd like, but that's still 56294995000 to 1 odds of having
AO *any* collisions between *any* of the files.  Step up to SHA-256 and
AO it becomes 1915619400 to
AO 1.  Sadly, I can't even give you the odds for SHA-512, Qalculate
AO considers that too close to infinite to display. :)

AO You should worry more about your head spontaneously exploding than you
AO should about a hash collision on that scale.  To do otherwise is
AO irrational paranoia.

And that is the probability if there being two files in that huge
collection with the same hash. If you just take two files, not
fabricated to collide, the probability of them having the same hash
under MD5 is 2**-128 which I think is way smaller than the probability
of the bit representing the answer being swapped by some physical cause
in your computer. But then again, it doesn't make sense to use that
instead of byte-by-byte comparison if the files are on the same machine.
-- 
Piet van Oostrum p...@cs.uu.nl
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-17 Thread Nigel Rantor

Adam Olsen wrote:

On Apr 16, 11:15 am, SpreadTooThin bjobrie...@gmail.com wrote:

And yes he is right CRCs hashing all have a probability of saying that
the files are identical when in fact they are not.


Here's the bottom line.  It is either:

A) Several hundred years of mathematics and cryptography are wrong.
The birthday problem as described is incorrect, so a collision is far
more likely than 42 trillion trillion to 1.  You are simply the first
person to have noticed it.

B) Your software was buggy, or possibly the input was maliciously
produced.  Or, a really tiny chance that your particular files
contained a pattern that provoked bad behaviour from MD5.

Finding a specific limitation of the algorithm is one thing.  Claiming
that the math is fundamentally wrong is quite another.


You are confusing yourself about probabilities young man.

Just becasue something is extremely unlikely does not mean it can't 
happen on the first attempt.


This is true *no matter how big the numbers are*.

If you persist in making these ridiculous claims that people *cannot* 
have found collisions then as I said, that's up to you, but I'm not 
going to employ you to do anything except make tea.


Thanks,

  Nigel

--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-17 Thread Nigel Rantor

Adam Olsen wrote:

On Apr 16, 4:27 pm, Rhodri James rho...@wildebst.demon.co.uk
wrote:

On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen rha...@gmail.com wrote:

On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote:

Okay, before I tell you about the empirical, real-world evidence I have
could you please accept that hashes collide and that no matter how many
samples you use the probability of finding two files that do collide is
small but not zero.

I'm afraid you will need to back up your claims with real files.

So that would be a no then.  If the implementation of dicts in Python,
say, were to assert as you are that the hashes aren't going to collide,
then I'd have to walk away from it.  There's no point in using something
that guarantees a non-zero chance of corrupting your data.


Python's hash is only 32 bits on a 32-bit box, so even 2**16 keys (or
65 thousand) will give you a decent chance of a collision.  In
contrast MD5 needs 2**64, and a *good* hash needs 2**128 (SHA-256) or
2**256 (SHA-512).  The two are at totally different extremes.


I'm just going to go ahead and take the above as an admission by you 
that the chance of collision is non-zero, and that if we accept that 
fact you cannot rely on a hash function to tell you if two files are 
identical.


Thanks.


There is *always* a non-zero chance of corruption, due to software
bugs, hardware defects, or even operator error.  It is only in that
broader context that you can realize just how minuscule the risk is.


Please explain why you're dragging the notion of corruption into this 
when it seems to be beside the point?



Can you explain to me why you justify great lengths of paranoia, when
the risk is so much lower?


Because in the real world, where I work, in practical, real, factual 
terms I have seen it happen. Not once. Not twice. But many, many times.



Why are you advocating a solution to the OP's problem that is more
computationally expensive than a simple byte-by-byte comparison and
doesn't guarantee to give the correct answer?


For single, one-off comparison I have no problem with a byte-by-byte
comparison.  There's a decent chance the files won't be in the OS's
cache anyway, so disk IO will be your bottleneck.



Only if you're doing multiple comparisons is a hash database
justified.  Even then, if you expect matching files to be fairly rare
I won't lose any sleep if you're paranoid and do a byte-by-byte
comparison anyway.  New vulnerabilities are found, and if you don't
update promptly there is a small (but significant) chance of a
malicious file leading to collision.


If I have a number of files then I would certainly use a hash as a quick 
test, but if two files hash to the same value I still have to go compare 
them. Hashing in this case saves time doing a comparison for each file 
but it doesn't obviate the need to do a byte-by-byte comparison to see 
if the files that hash to the same value are actually the same.



That's not my concern though.  What I'm responding to is Nigel
Rantor's grossly incorrect statements about probability.  The chance
of collision, in our life time, is *insignificant*.


Please tell me which statements? That the chance of two files hashing to 
the same value is non-zero? You admit as much above.


Also, please remember I gave you a way of verifying what I said, go 
crawl the web for images, pages, whatever, build a hash DB and tell me 
how long it takes you to find a collision using MD5 (which is the hash I 
was talking about when I told you I had real-world, experience to back 
up the theoretical claim that collisions occur.


Regards,

  Nigel

--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-17 Thread Tim Wintle
On Thu, 2009-04-16 at 21:44 -0700, Adam Olsen wrote:
 The Wayback Machine has 150 billion pages, so 2**37.  Google's index
 is a bit larger at over a trillion pages, so 2**40.  A little closer
 than I'd like, but that's still 56294995000 to 1 odds of having
 *any* collisions between *any* of the files.  Step up to SHA-256 and
 it becomes 1915619400 to
 1.  Sadly, I can't even give you the odds for SHA-512, Qalculate
 considers that too close to infinite to display. :)

That might be true as long as your data is completely uniformly
distributed. For the example you give there's:

a) a high chance that there's html near the top

b) a non-uniform distribution of individual words within the text.

c) a non-unifom distribution of all n-grams within the text (as there is
in natural language)

So it's very far from uniformly distributed. Just about the only
situation where I could imagine that holding would be where you are
hashing uniformly random data for the sake of testing the hash.


I believe the point being made is that comparing hash values is a
probabilistic algorithm anyway, which is fine if you're ok with that,
but for mission critical software it's crazy.

Tim Wintle

--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-17 Thread norseman

Adam Olsen wrote:

On Apr 16, 11:15 am, SpreadTooThin bjobrie...@gmail.com wrote:

And yes he is right CRCs hashing all have a probability of saying that
the files are identical when in fact they are not.


Here's the bottom line.  It is either:

A) Several hundred years of mathematics and cryptography are wrong.
The birthday problem as described is incorrect, so a collision is far
more likely than 42 trillion trillion to 1.  You are simply the first
person to have noticed it.

B) Your software was buggy, or possibly the input was maliciously
produced.  Or, a really tiny chance that your particular files
contained a pattern that provoked bad behaviour from MD5.

Finding a specific limitation of the algorithm is one thing.  Claiming
that the math is fundamentally wrong is quite another.
--
http://mail.python.org/mailman/listinfo/python-list



Spending a lifetime in applied math has taught me:
1) All applied math is finite.
2) Any algorithm failing to handle all contingencies is flawed.

The meaning of 1) is that it is limited in what it can actually do.
The meaning of 2) is that the designer missed or left out something.

Neither should be taken as bad. Both need to be accepted 'as 'is' and 
the decision to use (when,where,conditions) based on the probability of 
non-failure.



...a pattern that provoked bad behavior...  does mean the algorithm is 
incomplete and may be fundamentally wrong. Underscore is and may.


The more complicated the math the harder it is to keep a higher form of 
math from checking (or improperly displacing) a lower one.  Which, of 
course, breaks the rules.  Commonly called improper thinking. A number 
of math teasers make use of that.




Steve
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-17 Thread SpreadTooThin
On Apr 17, 4:54 am, Nigel Rantor wig...@wiggly.org wrote:
 Adam Olsen wrote:
  On Apr 16, 11:15 am, SpreadTooThin bjobrie...@gmail.com wrote:
  And yes he is right CRCs hashing all have a probability of saying that
  the files are identical when in fact they are not.

  Here's the bottom line.  It is either:

  A) Several hundred years of mathematics and cryptography are wrong.
  The birthday problem as described is incorrect, so a collision is far
  more likely than 42 trillion trillion to 1.  You are simply the first
  person to have noticed it.

  B) Your software was buggy, or possibly the input was maliciously
  produced.  Or, a really tiny chance that your particular files
  contained a pattern that provoked bad behaviour from MD5.

  Finding a specific limitation of the algorithm is one thing.  Claiming
  that the math is fundamentally wrong is quite another.

 You are confusing yourself about probabilities young man.

 Just becasue something is extremely unlikely does not mean it can't
 happen on the first attempt.

 This is true *no matter how big the numbers are*.

 If you persist in making these ridiculous claims that people *cannot*
 have found collisions then as I said, that's up to you, but I'm not
 going to employ you to do anything except make tea.

 Thanks,

    Nigel

You know this is just insane.  I'd be satisfied with a CRC16 or
something in the situation i'm in.
I have two large files, one local and one remote.  Transferring every
byte across the internet to be sure that the two files are identical
is just not feasible.  If two servers one on one side and the other on
the other side both calculate the CRCs and transmit the CRCs for
comparison I'm happy.
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-17 Thread Adam Olsen
On Apr 17, 5:30 am, Tim Wintle tim.win...@teamrubber.com wrote:
 On Thu, 2009-04-16 at 21:44 -0700, Adam Olsen wrote:
  The Wayback Machine has 150 billion pages, so 2**37.  Google's index
  is a bit larger at over a trillion pages, so 2**40.  A little closer
  than I'd like, but that's still 56294995000 to 1 odds of having
  *any* collisions between *any* of the files.  Step up to SHA-256 and
  it becomes 1915619400 to
  1.  Sadly, I can't even give you the odds for SHA-512, Qalculate
  considers that too close to infinite to display. :)

 That might be true as long as your data is completely uniformly
 distributed. For the example you give there's:

 a) a high chance that there's html near the top

 b) a non-uniform distribution of individual words within the text.

 c) a non-unifom distribution of all n-grams within the text (as there is
 in natural language)

 So it's very far from uniformly distributed. Just about the only
 situation where I could imagine that holding would be where you are
 hashing uniformly random data for the sake of testing the hash.

 I believe the point being made is that comparing hash values is a
 probabilistic algorithm anyway, which is fine if you're ok with that,
 but for mission critical software it's crazy.

Actually, *cryptographic* hashes handle that just fine.  Even for
files with just a 1 bit change the output is totally different.  This
is known as the Avalanche Effect.  Otherwise they'd be vulnerable to
attacks.

Which isn't to say you couldn't *construct* a pattern that it would be
vulnerable to.  Figuring that out is pretty much the whole point of
attacking a cryptographic hash.  MD5 has significant vulnerabilities
by now, and other will in the future.  That's just a risk you need to
manage.
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-17 Thread Adam Olsen
On Apr 17, 9:59 am, norseman norse...@hughes.net wrote:
 The more complicated the math the harder it is to keep a higher form of
 math from checking (or improperly displacing) a lower one.  Which, of
 course, breaks the rules.  Commonly called improper thinking. A number
 of math teasers make use of that.

Of course, designing a hash is hard.  That's why the *recommended*
ones get so many years of peer review and attempted attacks first.

I'd love of Nigel provided evidence that MD5 was broken, I really
would.  It'd be quite interesting to investigate, assuming malicious
content can be ruled out.  Of course even he doesn't think that.  He
claims that his 42 trillion trillion to 1 odds happened not just once,
but multiple times.
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-17 Thread Adam Olsen
On Apr 17, 9:59 am, SpreadTooThin bjobrie...@gmail.com wrote:
 You know this is just insane.  I'd be satisfied with a CRC16 or
 something in the situation i'm in.
 I have two large files, one local and one remote.  Transferring every
 byte across the internet to be sure that the two files are identical
 is just not feasible.  If two servers one on one side and the other on
 the other side both calculate the CRCs and transmit the CRCs for
 comparison I'm happy.

Definitely use a hash, ignore Nigel.  SHA-256 or SHA-512.  Or, if you
might need to update one of the files, look at rsync.  Rsync still
uses MD4 and MD5 (optionally!), but they're fine in a trusted
environment.
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-17 Thread Lawrence D'Oliveiro
In message mailman.3934.1239821812.11746.python-l...@python.org, Nigel 
Rantor wrote:

 Adam Olsen wrote:

 The chance of *accidentally* producing a collision, although
 technically possible, is so extraordinarily rare that it's completely
 overshadowed by the risk of a hardware or software failure producing
 an incorrect result.
 
 Not when you're using them to compare lots of files.
 
 Trust me. Been there, done that, got the t-shirt.

Not with any cryptographically-strong hash, you haven't.

--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-17 Thread Steven D'Aprano
On Fri, 17 Apr 2009 11:19:31 -0700, Adam Olsen wrote:

 Actually, *cryptographic* hashes handle that just fine.  Even for files
 with just a 1 bit change the output is totally different.  This is known
 as the Avalanche Effect.  Otherwise they'd be vulnerable to attacks.
 
 Which isn't to say you couldn't *construct* a pattern that it would be
 vulnerable to.  Figuring that out is pretty much the whole point of
 attacking a cryptographic hash.  MD5 has significant vulnerabilities by
 now, and other will in the future.  That's just a risk you need to
 manage.

There are demonstrated methods for constructing MD5 collisions against an 
arbitrary file in about a minute on a laptop computer.

http://en.wikipedia.org/wiki/MD5



-- 
Steven
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-16 Thread Adam Olsen
On Apr 15, 12:56 pm, Nigel Rantor wig...@wiggly.org wrote:
 Adam Olsen wrote:
  The chance of *accidentally* producing a collision, although
  technically possible, is so extraordinarily rare that it's completely
  overshadowed by the risk of a hardware or software failure producing
  an incorrect result.

 Not when you're using them to compare lots of files.

 Trust me. Been there, done that, got the t-shirt.

 Using hash functions to tell whether or not files are identical is an
 error waiting to happen.

 But please, do so if it makes you feel happy, you'll just eventually get
 an incorrect result and not know it.

Please tell us what hash you used and provide the two files that
collided.

If your hash is 256 bits, then you need around 2**128 files to produce
a collision.  This is known as a Birthday Attack.  I seriously doubt
you had that many files, which suggests something else went wrong.
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-16 Thread Adam Olsen
On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote:
 Adam Olsen wrote:
  On Apr 15, 12:56 pm, Nigel Rantor wig...@wiggly.org wrote:
  Adam Olsen wrote:
  The chance of *accidentally* producing a collision, although
  technically possible, is so extraordinarily rare that it's completely
  overshadowed by the risk of a hardware or software failure producing
  an incorrect result.
  Not when you're using them to compare lots of files.

  Trust me. Been there, done that, got the t-shirt.

  Using hash functions to tell whether or not files are identical is an
  error waiting to happen.

  But please, do so if it makes you feel happy, you'll just eventually get
  an incorrect result and not know it.

  Please tell us what hash you used and provide the two files that
  collided.

 MD5

  If your hash is 256 bits, then you need around 2**128 files to produce
  a collision.  This is known as a Birthday Attack.  I seriously doubt
  you had that many files, which suggests something else went wrong.

 Okay, before I tell you about the empirical, real-world evidence I have
 could you please accept that hashes collide and that no matter how many
 samples you use the probability of finding two files that do collide is
 small but not zero.

I'm afraid you will need to back up your claims with real files.
Although MD5 is a smaller, older hash (128 bits, so you only need
2**64 files to find collisions), and it has substantial known
vulnerabilities, the scenario you suggest where you *accidentally*
find collisions (and you imply multiple collisions!) would be a rather
significant finding.

Please help us all by justifying your claim.

Mind you, since you use MD5 I wouldn't be surprised if your files were
maliciously produced.  As I said before, you need to consider
upgrading your hash every few years to avoid new attacks.
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-16 Thread Nigel Rantor

Adam Olsen wrote:

On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote:

Adam Olsen wrote:

On Apr 15, 12:56 pm, Nigel Rantor wig...@wiggly.org wrote:

Adam Olsen wrote:

The chance of *accidentally* producing a collision, although
technically possible, is so extraordinarily rare that it's completely
overshadowed by the risk of a hardware or software failure producing
an incorrect result.

Not when you're using them to compare lots of files.
Trust me. Been there, done that, got the t-shirt.
Using hash functions to tell whether or not files are identical is an
error waiting to happen.
But please, do so if it makes you feel happy, you'll just eventually get
an incorrect result and not know it.

Please tell us what hash you used and provide the two files that
collided.

MD5


If your hash is 256 bits, then you need around 2**128 files to produce
a collision.  This is known as a Birthday Attack.  I seriously doubt
you had that many files, which suggests something else went wrong.

Okay, before I tell you about the empirical, real-world evidence I have
could you please accept that hashes collide and that no matter how many
samples you use the probability of finding two files that do collide is
small but not zero.


I'm afraid you will need to back up your claims with real files.
Although MD5 is a smaller, older hash (128 bits, so you only need
2**64 files to find collisions), and it has substantial known
vulnerabilities, the scenario you suggest where you *accidentally*
find collisions (and you imply multiple collisions!) would be a rather
significant finding.


No. It wouldn't. It isn't.

The files in question were millions of audio files. I no longer work at 
the company where I had access to them so I cannot give you examples, 
and even if I did Data Protection regulations wouldn't have allowed it.


If you still don't beleive me you can easily verify what I'm saying by 
doing some simple experiemnts. Go spider the web for images, keep 
collecting them until you get an MD5 hash collision.


It won't take long.


Please help us all by justifying your claim.


Now, please go and re-read my request first and admit that everything I 
have said so far is correct.



Mind you, since you use MD5 I wouldn't be surprised if your files were
maliciously produced.  As I said before, you need to consider
upgrading your hash every few years to avoid new attacks.


Good grief, this is nothing to do with security concerns, this is about 
someone suggesting to the OP that they use a hash function to determine 
whether or not two files are identical.


Regards,

  Nige
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-16 Thread Nigel Rantor

Adam Olsen wrote:

On Apr 15, 12:56 pm, Nigel Rantor wig...@wiggly.org wrote:

Adam Olsen wrote:

The chance of *accidentally* producing a collision, although
technically possible, is so extraordinarily rare that it's completely
overshadowed by the risk of a hardware or software failure producing
an incorrect result.

Not when you're using them to compare lots of files.

Trust me. Been there, done that, got the t-shirt.

Using hash functions to tell whether or not files are identical is an
error waiting to happen.

But please, do so if it makes you feel happy, you'll just eventually get
an incorrect result and not know it.


Please tell us what hash you used and provide the two files that
collided.


MD5


If your hash is 256 bits, then you need around 2**128 files to produce
a collision.  This is known as a Birthday Attack.  I seriously doubt
you had that many files, which suggests something else went wrong.


Okay, before I tell you about the empirical, real-world evidence I have 
could you please accept that hashes collide and that no matter how many 
samples you use the probability of finding two files that do collide is 
small but not zero.


Which is the only thing I've been saying.

Yes, it's unlikely. Yes, it's possible. Yes, it happens in practice.

If you are of the opinion though that a hash function can be used to 
tell you whether or not two files are identical then you are wrong. It 
really is that simple.


I'm not sitting here discussing this for my health, I'm just trying to 
give the OP the benefit of my experience, I have worked with other 
people who insisted on this route and had to find out the hard way that 
it was a Bad Idea (tm). They just wouldn't be told.


Regards,

  Nige
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-16 Thread Grant Edwards
On 2009-04-16, Adam Olsen rha...@gmail.com wrote:

 The chance of *accidentally* producing a collision, although
 technically possible, is so extraordinarily rare that it's
 completely overshadowed by the risk of a hardware or software
 failure producing an incorrect result.

 Not when you're using them to compare lots of files.

 Trust me. Been there, done that, got the t-shirt.

I must admit, that is a bit hard to believe.

 Okay, before I tell you about the empirical, real-world
 evidence I have could you please accept that hashes collide
 and that no matter how many samples you use the probability of
 finding two files that do collide is small but not zero.

 I'm afraid you will need to back up your claims with real files.
 Although MD5 is a smaller, older hash (128 bits, so you only need
 2**64 files to find collisions),

You don't need quite that many to have a significant chance of
a collision.  With only something on the order of 2**61
files, you still have about a 1% chance of a collision.

Just for fun here are a few data points showing the probability
of at least one collision in a sample of size n using a perfect
128-bit hash.  These were calculated using the taylor-series
approximation shown at

 http://en.wikipedia.org/wiki/Birthday_paradox#Calculating_the_probability

For a few million files (we'll say 4e6), the probability of a
collision is so close to 0 that it can't be calculated using
double-precision IEEE floats.

For a few _billion_ files, the taylor series approximation is
still 0 to about 15 significant digits.

For a few _trillion_ files, you finally get a non-zero
probability of about 2e-14.

Here are a few more points:

n = 1806791225998070; p(n) = 0.
n = 1987470348597877; p(n) = 0.0001
n = 2186217383457665; p(n) = 0.0001
n = 2404839121803431; p(n) = 0.0001
n = 2645323033983774; p(n) = 0.0001
n = 2909855337382151; p(n) = 0.0001
n = 3200840871120366; p(n) = 0.0002
n = 3520924958232403; p(n) = 0.0002
n = 3873017454055643; p(n) = 0.0002
n = 4260319199461207; p(n) = 0.0003
n = 4686351119407328; p(n) = 0.0003
n = 5154986231348061; p(n) = 0.0004
n = 5670484854482868; p(n) = 0.0005
n = 623759931155; p(n) = 0.0006
n = 6861286673924271; p(n) = 0.0007
n = 7547415341316699; p(n) = 0.0008
n = 8302156875448370; p(n) = 0.0010
n = 9132372562993208; p(n) = 0.0012
n =10045609819292530; p(n) = 0.0015
n =11050170801221784; p(n) = 0.0018
n =12155187881343964; p(n) = 0.0022
n =13370706669478362; p(n) = 0.0026
n =1470336426200; p(n) = 0.0032
n =16178555070068822; p(n) = 0.0038
n =17796410577075706; p(n) = 0.0047
n =19576051634783280; p(n) = 0.0056
n =21533656798261608; p(n) = 0.0068
n =23687022478087772; p(n) = 0.0082
n =26055724725896552; p(n) = 0.0100
n =28661297198486208; p(n) = 0.0121
n =31527426918334832; p(n) = 0.0146
n =34680169610168320; p(n) = 0.0177
n =38148186571185152; p(n) = 0.0214
n =41963005228303672; p(n) = 0.0259
n =46159305751134040; p(n) = 0.0313
n =50775236326247448; p(n) = 0.0379
n =55852759958872200; p(n) = 0.0458
n =61438035954759424; p(n) = 0.0555
n =67581839550235368; p(n) = 0.0671
n =74340023505258912; p(n) = 0.0812
n =81774025855784816; p(n) = 0.0983
n =89951428441363312; p(n) = 0.1189
n =98946571285499648; p(n) = 0.1439
n =   108841228414049616; p(n) = 0.1741
n =   119725351255454592; p(n) = 0.2106
n =   13169788638164; p(n) = 0.2548
n =   144867675019100096; p(n) = 0.3084
n =   159354442521010112; p(n) = 0.3731
n =   17528988677336; p(n) = 0.4515
n =   192818875450422272; p(n) = 0.5463
n =   212100762995464512; p(n) = 0.6610
n =   233310839295010976; p(n) = 0.7998
n =   256641923224512096; p(n) = 0.9678
n =   282306115546963328; p(n) = 0.00011710
n =   310536727101659712; p(n) = 0.00014169
n =   341590399811825728; p(n) = 0.00017144
n =   375749439793008320; p(n) = 0.00020744
n =   413324383772309184; p(n) = 0.00025099
n =   454656822149540160; p(n) = 0.00030369
n =   500122504364494208; p(n) = 0.00036745
n =   550134754800943680; p(n) = 0.00044460
n =   605148230281038080; p(n) = 0.00053794
n =   665663053309141888; p(n) = 0.00065088
n =   732229358640056192; p(n) = 0.00078751
n =   805452294504061824; p(n) = 0.00095280
n =   885997523954468096; p(n) = 0.00115278
n =   974597276349915008; p(n) = 0.00139469
n =  1072057003984906624; p(n) = 0.00168733
n =  1179262704383397376; p(n) = 0.00204131
n =  1297188974821737216; p(n) = 0.00246945
n =  1426907872303911168; p(n) = 0.00298726
n =  1569598659534302464; p(n) = 0.00361345
n =  1726558525487732736; p(n) = 0.00437061
n =  1899214378036506112; p(n) = 0.00528601
n =  2089135815840156928; p(n) = 0.00639252
n =  

Re: binary file compare...

2009-04-16 Thread SpreadTooThin
On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote:
 Adam Olsen wrote:
  On Apr 15, 12:56 pm, Nigel Rantor wig...@wiggly.org wrote:
  Adam Olsen wrote:
  The chance of *accidentally* producing a collision, although
  technically possible, is so extraordinarily rare that it's completely
  overshadowed by the risk of a hardware or software failure producing
  an incorrect result.
  Not when you're using them to compare lots of files.

  Trust me. Been there, done that, got the t-shirt.

  Using hash functions to tell whether or not files are identical is an
  error waiting to happen.

  But please, do so if it makes you feel happy, you'll just eventually get
  an incorrect result and not know it.

  Please tell us what hash you used and provide the two files that
  collided.

 MD5

  If your hash is 256 bits, then you need around 2**128 files to produce
  a collision.  This is known as a Birthday Attack.  I seriously doubt
  you had that many files, which suggests something else went wrong.

 Okay, before I tell you about the empirical, real-world evidence I have
 could you please accept that hashes collide and that no matter how many
 samples you use the probability of finding two files that do collide is
 small but not zero.

 Which is the only thing I've been saying.

 Yes, it's unlikely. Yes, it's possible. Yes, it happens in practice.

 If you are of the opinion though that a hash function can be used to
 tell you whether or not two files are identical then you are wrong. It
 really is that simple.

 I'm not sitting here discussing this for my health, I'm just trying to
 give the OP the benefit of my experience, I have worked with other
 people who insisted on this route and had to find out the hard way that
 it was a Bad Idea (tm). They just wouldn't be told.

 Regards,

    Nige

And yes he is right CRCs hashing all have a probability of saying that
the files are identical when in fact they are not.
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-16 Thread Adam Olsen
On Apr 16, 8:59 am, Grant Edwards inva...@invalid wrote:
 On 2009-04-16, Adam Olsen rha...@gmail.com wrote:
  I'm afraid you will need to back up your claims with real files.
  Although MD5 is a smaller, older hash (128 bits, so you only need
  2**64 files to find collisions),

 You don't need quite that many to have a significant chance of
 a collision.  With only something on the order of 2**61
 files, you still have about a 1% chance of a collision.

Aye, 2**64 is more of the middle of the curve or so.  You can still go
either way.  What's important is the order of magnitude required.


 For a few million files (we'll say 4e6), the probability of a
 collision is so close to 0 that it can't be calculated using
 double-precision IEEE floats.

≈ 0.023509887

Or 4253529600 to 1.

Or 42 trillion trillion to 1.


 Here's the Python function I'm using:

 def bp(n, d):
     return 1.0 - exp(-n*(n-1.)/(2.*d))

 I haven't spent much time studying the numerical issues of the
 way that the exponent is calculated, so I'm not entirely
 confident in the results for small n values such that
 p(n) == 0.0.

Try using Qalculate.  I always resort to it for things like this.
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-16 Thread Rhodri James

On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen rha...@gmail.com wrote:


On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote:

Okay, before I tell you about the empirical, real-world evidence I have
could you please accept that hashes collide and that no matter how many
samples you use the probability of finding two files that do collide is
small but not zero.


I'm afraid you will need to back up your claims with real files.


So that would be a no then.  If the implementation of dicts in Python,
say, were to assert as you are that the hashes aren't going to collide,
then I'd have to walk away from it.  There's no point in using something
that guarantees a non-zero chance of corrupting your data.

Why are you advocating a solution to the OP's problem that is more
computationally expensive than a simple byte-by-byte comparison and
doesn't guarantee to give the correct answer?

--
Rhodri James *-* Wildebeeste Herder to the Masses
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-16 Thread Adam Olsen
On Apr 16, 11:15 am, SpreadTooThin bjobrie...@gmail.com wrote:
 And yes he is right CRCs hashing all have a probability of saying that
 the files are identical when in fact they are not.

Here's the bottom line.  It is either:

A) Several hundred years of mathematics and cryptography are wrong.
The birthday problem as described is incorrect, so a collision is far
more likely than 42 trillion trillion to 1.  You are simply the first
person to have noticed it.

B) Your software was buggy, or possibly the input was maliciously
produced.  Or, a really tiny chance that your particular files
contained a pattern that provoked bad behaviour from MD5.

Finding a specific limitation of the algorithm is one thing.  Claiming
that the math is fundamentally wrong is quite another.
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-16 Thread Adam Olsen
On Apr 16, 4:27 pm, Rhodri James rho...@wildebst.demon.co.uk
wrote:
 On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen rha...@gmail.com wrote:
  On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote:
  Okay, before I tell you about the empirical, real-world evidence I have
  could you please accept that hashes collide and that no matter how many
  samples you use the probability of finding two files that do collide is
  small but not zero.

  I'm afraid you will need to back up your claims with real files.

 So that would be a no then.  If the implementation of dicts in Python,
 say, were to assert as you are that the hashes aren't going to collide,
 then I'd have to walk away from it.  There's no point in using something
 that guarantees a non-zero chance of corrupting your data.

Python's hash is only 32 bits on a 32-bit box, so even 2**16 keys (or
65 thousand) will give you a decent chance of a collision.  In
contrast MD5 needs 2**64, and a *good* hash needs 2**128 (SHA-256) or
2**256 (SHA-512).  The two are at totally different extremes.

There is *always* a non-zero chance of corruption, due to software
bugs, hardware defects, or even operator error.  It is only in that
broader context that you can realize just how minuscule the risk is.

Can you explain to me why you justify great lengths of paranoia, when
the risk is so much lower?


 Why are you advocating a solution to the OP's problem that is more
 computationally expensive than a simple byte-by-byte comparison and
 doesn't guarantee to give the correct answer?

For single, one-off comparison I have no problem with a byte-by-byte
comparison.  There's a decent chance the files won't be in the OS's
cache anyway, so disk IO will be your bottleneck.

Only if you're doing multiple comparisons is a hash database
justified.  Even then, if you expect matching files to be fairly rare
I won't lose any sleep if you're paranoid and do a byte-by-byte
comparison anyway.  New vulnerabilities are found, and if you don't
update promptly there is a small (but significant) chance of a
malicious file leading to collision.

That's not my concern though.  What I'm responding to is Nigel
Rantor's grossly incorrect statements about probability.  The chance
of collision, in our life time, is *insignificant*.

The Wayback Machine has 150 billion pages, so 2**37.  Google's index
is a bit larger at over a trillion pages, so 2**40.  A little closer
than I'd like, but that's still 56294995000 to 1 odds of having
*any* collisions between *any* of the files.  Step up to SHA-256 and
it becomes 1915619400 to
1.  Sadly, I can't even give you the odds for SHA-512, Qalculate
considers that too close to infinite to display. :)

You should worry more about your head spontaneously exploding than you
should about a hash collision on that scale.  To do otherwise is
irrational paranoia.
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-15 Thread Steven D'Aprano
On Wed, 15 Apr 2009 07:54:20 +0200, Martin wrote:

 Perhaps I'm being dim, but how else are you going to decide if two
 files are the same unless you compare the bytes in the files?
 
 I'd say checksums, just about every download relies on checksums to
 verify you do have indeed the same file.

The checksum does look at every byte in each file. Checksumming isn't a 
way to avoid looking at each byte of the two files, it is a way of 
mapping all the bytes to a single number.



 You could hash them and compare the hashes, but that's a lot more work
 than just comparing the two byte streams.
 
 hashing is not exactly much mork in it's simplest form it's 2 lines per
 file.

Hashing is a *lot* more work than just comparing two bytes. The MD5 
checksum has been specifically designed to be fast and compact, and the 
algorithm is still complicated:

http://en.wikipedia.org/wiki/MD5#Pseudocode

The reference implementation is here:

http://www.fastsum.com/rfc1321.php#APPENDIXA

SHA-1 is even more complicated still:

http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-1_pseudocode


Just because *calling* some checksum function is easy doesn't make the 
checksum function itself simple. They do a LOT more work than just a 
simple comparison between bytes, and that's totally unnecessary work if 
you are making a one-off comparison of two local files.



-- 
Steven
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-15 Thread Martin
On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano
ste...@remove.this.cybersource.com.au wrote:
 The checksum does look at every byte in each file. Checksumming isn't a
 way to avoid looking at each byte of the two files, it is a way of
 mapping all the bytes to a single number.

My understanding of the original question was a way to determine
wether 2 files are equal or not. Creating a checksum of 1-n files and
comparing those checksums IMHO is a valid way to do that. I know it's
a (one way) mapping between a (possibly) longer byte sequence and
another one, how does checksumming not take each byte in the original
sequence into account.

I'd still say rather burn CPU cycles than development hours (if I got
the question right), if not then with binary files you will have to
find some way of representing differences between the 2 files in a
readable manner anyway.

 Hashing is a *lot* more work than just comparing two bytes. The MD5
 checksum has been specifically designed to be fast and compact, and the
 algorithm is still complicated:

I know that the various checksum algorithms aren't exactly cheap, but
I do think that just to know wether 2 files are different a solution
which takes 5mins to implement wins against a lengthy discussion which
optimizes too early wins hands down.

regards,
martin

-- 
http://soup.alt.delete.co.at
http://www.xing.com/profile/Martin_Marcher
http://www.linkedin.com/in/martinmarcher

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-15 Thread Grant Edwards
On 2009-04-15, Martin mar...@marcher.name wrote:
 Hi,

 On Mon, Apr 13, 2009 at 10:03 PM, Grant Edwards inva...@invalid wrote:
 On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote:

 I want to compare two binary files and see if they are the same.
 I see the filecmp.cmp function but I don't get a warm fuzzy feeling
 that it is doing a byte by byte comparison of two files to see if they
 are they same.

 Perhaps I'm being dim, but how else are you going to decide if
 two files are the same unless you compare the bytes in the
 files?

 I'd say checksums, just about every download relies on checksums to
 verify you do have indeed the same file.

That's slower than a byte-by-byte compare.

 You could hash them and compare the hashes, but that's a lot
 more work than just comparing the two byte streams.

 hashing is not exactly much mork in it's simplest form it's 2
 lines per file.

I meant a lot more CPU time/cycles.

-- 
Grant Edwards   grante Yow! Was my SOY LOAF left
  at   out in th'RAIN?  It tastes
   visi.comREAL GOOD!!
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-15 Thread Grant Edwards
On 2009-04-15, Martin mar...@marcher.name wrote:
 On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano

 I'd still say rather burn CPU cycles than development hours (if I got
 the question right),

_Hours_?  Calling the file compare module takes  _one_line_of_code_.

Implementing a file compare from scratch takes about a half
dozen lines of code.

 if not then with binary files you will have to find some way
 of representing differences between the 2 files in a readable
 manner anyway.

 1) Who said anything about a readable representation of the
differences?

 2) How does a checksum provide that?

 Hashing is a *lot* more work than just comparing two bytes.
 The MD5 checksum has been specifically designed to be fast and
 compact, and the algorithm is still complicated:

 I know that the various checksum algorithms aren't exactly
 cheap, but I do think that just to know wether 2 files are
 different a solution which takes 5mins to implement wins
 against a lengthy discussion

Bah.  A direct compare is trivial.  The discussion of which
checksum to use, how to implement it, and how reliable it is
will be far longer than any discussion over a direct
comparison.

 which optimizes too early wins hands down.

Optimizes too early?  Comparing the bytes is the simplest and
most direct, obvious solution.  It takes one line of code to
call the file compare module.  Implementing it from scratch
takes about five lines of code.

We all rail against premature optimization, but using a
checksum instead of a direct comparison is premature
unoptimization.  ;)

-- 
Grant Edwards   grante Yow! Hmmm ... A hash-singer
  at   and a cross-eyed guy were
   visi.comSLEEPING on a deserted
   island, when ...
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-15 Thread Nigel Rantor

Martin wrote:

On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano
ste...@remove.this.cybersource.com.au wrote:

The checksum does look at every byte in each file. Checksumming isn't a
way to avoid looking at each byte of the two files, it is a way of
mapping all the bytes to a single number.


My understanding of the original question was a way to determine
wether 2 files are equal or not. Creating a checksum of 1-n files and
comparing those checksums IMHO is a valid way to do that. I know it's
a (one way) mapping between a (possibly) longer byte sequence and
another one, how does checksumming not take each byte in the original
sequence into account.


The fact that two md5 hashes are equal does not mean that the sources 
they were generated from are equal. To do that you must still perform a 
byte-by-byte comparison which is much less work for the processor than 
generating an md5 or sha hash.


If you insist on using a hashing algorithm to determine the equivalence 
of two files you will eventually realise that it is a flawed plan 
because you will eventually find two files with different contents that 
nonetheless hash to the same value.


The more files you test with the quicker you will find out this basic truth.

This is not complex, it's a simple fact about how hashing algorithms work.

  n

--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-15 Thread Nigel Rantor

Grant Edwards wrote:

We all rail against premature optimization, but using a
checksum instead of a direct comparison is premature
unoptimization.  ;)


And more than that, will provide false positives for some inputs.

So, basically it's a worse-than-useless approach for determining if two 
files are the same.


   n
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-15 Thread SpreadTooThin
On Apr 15, 8:04 am, Grant Edwards inva...@invalid wrote:
 On 2009-04-15, Martin mar...@marcher.name wrote:



  Hi,

  On Mon, Apr 13, 2009 at 10:03 PM, Grant Edwards inva...@invalid wrote:
  On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote:

  I want to compare two binary files and see if they are the same.
  I see the filecmp.cmp function but I don't get a warm fuzzy feeling
  that it is doing a byte by byte comparison of two files to see if they
  are they same.

  Perhaps I'm being dim, but how else are you going to decide if
  two files are the same unless you compare the bytes in the
  files?

  I'd say checksums, just about every download relies on checksums to
  verify you do have indeed the same file.

 That's slower than a byte-by-byte compare.

  You could hash them and compare the hashes, but that's a lot
  more work than just comparing the two byte streams.

  hashing is not exactly much mork in it's simplest form it's 2
  lines per file.

 I meant a lot more CPU time/cycles.

 --
 Grant Edwards                   grante             Yow! Was my SOY LOAF left
                                   at               out in th'RAIN?  It tastes
                                visi.com            REAL GOOD!!

I'd like to add my 2 cents here.. (Thats 1.8 cents US)
All I was trying to get was a clarification of the documentation of
the cmp method.
It isn't clear.

byte by byte comparison is good enough for me as long as there are no
cache issues.
a check sum is not good because it doesn't guarantee that  1 + 2 + 3
== 3 + 2 + 1
a crc of any sort is more work than a byte by byte comparison and
doesn't give you any more information.


--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-15 Thread Adam Olsen
On Apr 15, 11:04 am, Nigel Rantor wig...@wiggly.org wrote:
 The fact that two md5 hashes are equal does not mean that the sources
 they were generated from are equal. To do that you must still perform a
 byte-by-byte comparison which is much less work for the processor than
 generating an md5 or sha hash.

 If you insist on using a hashing algorithm to determine the equivalence
 of two files you will eventually realise that it is a flawed plan
 because you will eventually find two files with different contents that
 nonetheless hash to the same value.

 The more files you test with the quicker you will find out this basic truth.

 This is not complex, it's a simple fact about how hashing algorithms work.

The only flaw on a cryptographic hash is the increasing number of
attacks that are found on it.  You need to pick a trusted one when you
start and consider replacing it every few years.

The chance of *accidentally* producing a collision, although
technically possible, is so extraordinarily rare that it's completely
overshadowed by the risk of a hardware or software failure producing
an incorrect result.
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-15 Thread Nigel Rantor

Adam Olsen wrote:

The chance of *accidentally* producing a collision, although
technically possible, is so extraordinarily rare that it's completely
overshadowed by the risk of a hardware or software failure producing
an incorrect result.


Not when you're using them to compare lots of files.

Trust me. Been there, done that, got the t-shirt.

Using hash functions to tell whether or not files are identical is an 
error waiting to happen.


But please, do so if it makes you feel happy, you'll just eventually get 
an incorrect result and not know it.


  n
--
http://mail.python.org/mailman/listinfo/python-list


binary file compare...

2009-04-14 Thread SpreadTooThin
I want to compare two binary files and see if they are the same.
I see the filecmp.cmp function but I don't get a warm fuzzy feeling
that it is doing a byte by byte comparison of two files to see if they
are they same.

What should I be using if not filecmp.cmp?
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-14 Thread Adam Olsen
On Apr 13, 8:39 pm, Grant Edwards gra...@visi.com wrote:
 On 2009-04-13, Peter Otten __pete...@web.de wrote:

  But there's a cache. A change of file contents may go
  undetected as long as the file stats don't change:

 Good point.  You can fool it if you force the stats to their
 old values after you modify a file and you don't clear the
 cache.

The timestamps stored on the filesystem (for ext3 and most other
filesystems) are fairly coarse, so it's quite possible for a check/
update/check sequence to have the same timestamp at the beginning and
end.
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-14 Thread Martin
Hi,

On Mon, Apr 13, 2009 at 10:03 PM, Grant Edwards inva...@invalid wrote:
 On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote:

 I want to compare two binary files and see if they are the same.
 I see the filecmp.cmp function but I don't get a warm fuzzy feeling
 that it is doing a byte by byte comparison of two files to see if they
 are they same.

 Perhaps I'm being dim, but how else are you going to decide if
 two files are the same unless you compare the bytes in the
 files?

I'd say checksums, just about every download relies on checksums to
verify you do have indeed the same file.


 You could hash them and compare the hashes, but that's a lot
 more work than just comparing the two byte streams.

hashing is not exactly much mork in it's simplest form it's 2 lines per file.

$ dd if=/dev/urandom of=testfile.data bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 1.4491 s, 3.6 MB/s
$ dd if=/dev/urandom of=testfile2.data bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 1.92479 s, 2.7 MB/s
$ cp testfile.data testfile3.data
$ python
Python 2.5.4 (r254:67916, Feb 17 2009, 20:16:45)
[GCC 4.3.3] on linux2
Type help, copyright, credits or license for more information.
 import hashlib
 sha = hashlib.sha256()
 sha.update(file(testfile.data).read())
 sha.hexdigest()
'a0a8b5d1fd7b8181e0131fff8fd6acce39917e4498c86704354221fd96815797'
 sha2=hashlib.sha256()
 sha2.update(file(testfile2.data).read())
 sha2.hexdigest()
'25597380f833f287e8dad936b15ddb616669102c38f54dbd60ce57998d99ad3b'
 sha3=hashlib.sha256()
 sha3.update(file(testfile3.data).read())
 sha3.hexdigest()
'a0a8b5d1fd7b8181e0131fff8fd6acce39917e4498c86704354221fd96815797'
 sha.hexdigest() == sha2.hexdigest()
False
 sha.hexdigest() == sha3.hexdigest()
True
 sha2.hexdigest() == sha3.hexdigest()
False




-- 
http://soup.alt.delete.co.at
http://www.xing.com/profile/Martin_Marcher
http://www.linkedin.com/in/martinmarcher

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-13 Thread Przemyslaw Kaminski
SpreadTooThin wrote:

 I want to compare two binary files and see if they are the same.
 I see the filecmp.cmp function but I don't get a warm fuzzy feeling
 that it is doing a byte by byte comparison of two files to see if they
 are they same.
 
 What should I be using if not filecmp.cmp?

Well, here's something
http://www.daniweb.com/forums/thread115959.html
but it seems from the post on the bottom that filecmp does comparison of 
binary files.

--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-13 Thread SpreadTooThin
On Apr 13, 2:00 pm, Przemyslaw Kaminski cge...@gmail.com wrote:
 SpreadTooThin wrote:
  I want to compare two binary files and see if they are the same.
  I see the filecmp.cmp function but I don't get a warm fuzzy feeling
  that it is doing a byte by byte comparison of two files to see if they
  are they same.

  What should I be using if not filecmp.cmp?

 Well, here's somethinghttp://www.daniweb.com/forums/thread115959.html
 but it seems from the post on the bottom that filecmp does comparison of
 binary files.

I just want to be clear, the comparison is not just based on file size
and creation date but by a byte by byte comparison of the data in each
file.
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-13 Thread Grant Edwards
On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote:

 I want to compare two binary files and see if they are the same.
 I see the filecmp.cmp function but I don't get a warm fuzzy feeling
 that it is doing a byte by byte comparison of two files to see if they
 are they same.

Perhaps I'm being dim, but how else are you going to decide if
two files are the same unless you compare the bytes in the
files?

You could hash them and compare the hashes, but that's a lot
more work than just comparing the two byte streams.

 What should I be using if not filecmp.cmp?

I don't understand what you've got against comparing the files
when you stated that what you wanted to do was compare the files.

-- 
Grant Edwards   grante Yow! Look DEEP into the
  at   OPENINGS!!  Do you see any
   visi.comELVES or EDSELS ... or a
   HIGHBALL?? ...
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-13 Thread SpreadTooThin
On Apr 13, 2:03 pm, Grant Edwards inva...@invalid wrote:
 On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote:

  I want to compare two binary files and see if they are the same.
  I see the filecmp.cmp function but I don't get a warm fuzzy feeling
  that it is doing a byte by byte comparison of two files to see if they
  are they same.

 Perhaps I'm being dim, but how else are you going to decide if
 two files are the same unless you compare the bytes in the
 files?

 You could hash them and compare the hashes, but that's a lot
 more work than just comparing the two byte streams.

  What should I be using if not filecmp.cmp?

 I don't understand what you've got against comparing the files
 when you stated that what you wanted to do was compare the files.


I think its just the way the documentation was worded
http://www.python.org/doc/2.5.2/lib/module-filecmp.html

Unless shallow is given and is false, files with identical os.stat()
signatures are taken to be equal.
Files that were compared using this function will not be compared
again unless their os.stat() signature changes.

So to do a comparison:
filecmp.cmp(filea, fileb, False)
?




 --
 Grant Edwards                   grante             Yow! Look DEEP into the
                                   at               OPENINGS!!  Do you see any
                                visi.com            ELVES or EDSELS ... or a
                                                    HIGHBALL?? ...

--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-13 Thread Grant Edwards
On 2009-04-13, Grant Edwards inva...@invalid wrote:
 On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote:

 I want to compare two binary files and see if they are the same.
 I see the filecmp.cmp function but I don't get a warm fuzzy feeling
 that it is doing a byte by byte comparison of two files to see if they
 are they same.

 Perhaps I'm being dim, but how else are you going to decide if
 two files are the same unless you compare the bytes in the
 files?

 You could hash them and compare the hashes, but that's a lot
 more work than just comparing the two byte streams.

 What should I be using if not filecmp.cmp?

 I don't understand what you've got against comparing the files
 when you stated that what you wanted to do was compare the files.

Doh!  I misread your post and thought were weren't getting a
warm fuzzying feeling _because_ it was doing a byte-byte
compare. Now I'm a bit confused.  Are you under the impression
it's _not_ doing a byte-byte compare?  Here's the code:

def _do_cmp(f1, f2):
bufsize = BUFSIZE
fp1 = open(f1, 'rb')
fp2 = open(f2, 'rb')
while True:
b1 = fp1.read(bufsize)
b2 = fp2.read(bufsize)
if b1 != b2:
return False
if not b1:
return True

It looks like a byte-by-byte comparison to me.  Note that when
this function is called the file lengths have already been
compared and found to be equal.

-- 
Grant Edwards   grante Yow! Alright, you!!
  at   Imitate a WOUNDED SEAL
   visi.compleading for a PARKING
   SPACE!!
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-13 Thread SpreadTooThin
On Apr 13, 2:37 pm, Grant Edwards inva...@invalid wrote:
 On 2009-04-13, Grant Edwards inva...@invalid wrote:



  On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote:

  I want to compare two binary files and see if they are the same.
  I see the filecmp.cmp function but I don't get a warm fuzzy feeling
  that it is doing a byte by byte comparison of two files to see if they
  are they same.

  Perhaps I'm being dim, but how else are you going to decide if
  two files are the same unless you compare the bytes in the
  files?

  You could hash them and compare the hashes, but that's a lot
  more work than just comparing the two byte streams.

  What should I be using if not filecmp.cmp?

  I don't understand what you've got against comparing the files
  when you stated that what you wanted to do was compare the files.

 Doh!  I misread your post and thought were weren't getting a
 warm fuzzying feeling _because_ it was doing a byte-byte
 compare. Now I'm a bit confused.  Are you under the impression
 it's _not_ doing a byte-byte compare?  Here's the code:

 def _do_cmp(f1, f2):
     bufsize = BUFSIZE
     fp1 = open(f1, 'rb')
     fp2 = open(f2, 'rb')
     while True:
         b1 = fp1.read(bufsize)
         b2 = fp2.read(bufsize)
         if b1 != b2:
             return False
         if not b1:
             return True

 It looks like a byte-by-byte comparison to me.  Note that when
 this function is called the file lengths have already been
 compared and found to be equal.

 --
 Grant Edwards                   grante             Yow! Alright, you!!
                                   at               Imitate a WOUNDED SEAL
                                visi.com            pleading for a PARKING
                                                    SPACE!!

I am indeed under the impression that it is not always doing a byte by
byte comparison...
as well the documentation states:
Compare the files named f1 and f2, returning True if they seem equal,
False otherwise.

That word... Sem... makes me wonder.

Thanks for the code! :)

--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-13 Thread Peter Otten
Grant Edwards wrote:

 On 2009-04-13, Grant Edwards inva...@invalid wrote:
 On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote:

 I want to compare two binary files and see if they are the same.
 I see the filecmp.cmp function but I don't get a warm fuzzy feeling
 that it is doing a byte by byte comparison of two files to see if they
 are they same.

 Perhaps I'm being dim, but how else are you going to decide if
 two files are the same unless you compare the bytes in the
 files?

 You could hash them and compare the hashes, but that's a lot
 more work than just comparing the two byte streams.

 What should I be using if not filecmp.cmp?

 I don't understand what you've got against comparing the files
 when you stated that what you wanted to do was compare the files.
 
 Doh!  I misread your post and thought were weren't getting a
 warm fuzzying feeling _because_ it was doing a byte-byte
 compare. Now I'm a bit confused.  Are you under the impression
 it's _not_ doing a byte-byte compare?  Here's the code:
 
 def _do_cmp(f1, f2):
 bufsize = BUFSIZE
 fp1 = open(f1, 'rb')
 fp2 = open(f2, 'rb')
 while True:
 b1 = fp1.read(bufsize)
 b2 = fp2.read(bufsize)
 if b1 != b2:
 return False
 if not b1:
 return True
 
 It looks like a byte-by-byte comparison to me.  Note that when
 this function is called the file lengths have already been
 compared and found to be equal.

But there's a cache. A change of file contents may go undetected as long as
the file stats don't change:

 $ cat fool_filecmp.py
import filecmp, shutil, sys

for fn in adb:
with open(fn, w) as f:
f.write(yadda)

shutil.copystat(d, a)
filecmp.cmp(a, b, False)

with open(a, w) as f:
f.write(*)
shutil.copystat(d, a)

if --clear in sys.argv:
print clearing cache
filecmp._cache.clear()

if filecmp.cmp(a, b, False):
print file a and b are equal
else:
print file a and b differ
print a's contents:, open(a).read()
print b's contents:, open(b).read()

$ python2.6 fool_filecmp.py
file a and b are equal
a's contents: *
b's contents: yadda

Oops. If you are paranoid you have to clear the cache before doing the
comparison:

$ python2.6 fool_filecmp.py --clear
clearing cache
file a and b differ
a's contents: *
b's contents: yadda

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-13 Thread Steven D'Aprano
On Mon, 13 Apr 2009 15:03:32 -0500, Grant Edwards wrote:

 On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote:
 
 I want to compare two binary files and see if they are the same. I see
 the filecmp.cmp function but I don't get a warm fuzzy feeling that it
 is doing a byte by byte comparison of two files to see if they are they
 same.
 
 Perhaps I'm being dim, but how else are you going to decide if two files
 are the same unless you compare the bytes in the files?

If you start with an image in one format (e.g. PNG), and convert it to 
another format (e.g. JPEG), you might want the two files to compare equal 
even though their byte contents are completely different, because their 
contents (the image itself) is visually identical.

Or you might want a heuristic as a short cut for comparing large files, 
and decide that if two files have the same size and modification dates, 
and the first (say) 100KB are equal, that you will assume the rest are 
probably equal too.

Neither of these are what the OP wants, I'm just mentioning them to 
answer your rhetorical question :)



-- 
Steven
--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-13 Thread Dave Angel

SpreadTooThin wrote:

On Apr 13, 2:37 pm, Grant Edwards inva...@invalid wrote:
  

On 2009-04-13, Grant Edwards inva...@invalid wrote:





On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote:
  

I want to compare two binary files and see if they are the same.
I see the filecmp.cmp function but I don't get a warm fuzzy feeling
that it is doing a byte by byte comparison of two files to see if they
are they same.


Perhaps I'm being dim, but how else are you going to decide if
two files are the same unless you compare the bytes in the
files?
  
You could hash them and compare the hashes, but that's a lot

more work than just comparing the two byte streams.
  

What should I be using if not filecmp.cmp?


I don't understand what you've got against comparing the files
when you stated that what you wanted to do was compare the files.
  

Doh!  I misread your post and thought were weren't getting a
warm fuzzying feeling _because_ it was doing a byte-byte
compare. Now I'm a bit confused.  Are you under the impression
it's _not_ doing a byte-byte compare?  Here's the code:

def _do_cmp(f1, f2):
bufsize =UFSIZE
fp1 =pen(f1, 'rb')
fp2 =pen(f2, 'rb')
while True:
b1 =p1.read(bufsize)
b2 =p2.read(bufsize)
if b1 !=2:
return False
if not b1:
return True

It looks like a byte-by-byte comparison to me.  Note that when
this function is called the file lengths have already been
compared and found to be equal.

--
Grant Edwards   grante Yow! Alright, you!!
  at   Imitate a WOUNDED SEAL
   visi.compleading for a PARKING
   SPACE!!



I am indeed under the impression that it is not always doing a byte by
byte comparison...
as well the documentation states:
Compare the files named f1 and f2, returning True if they seem equal,
False otherwise.

That word... Sem... makes me wonder.

Thanks for the code! :)


  
Some of this discussion depends on the version of Python, but didn't say 
so.  In version 2.61, the code is different (and more complex) than 
what's listed above.  The docs are different too.  In this version, at 
least, you'll want to explicitly pass the shallow=False parameter.  It 
defaults to 1, by which they must mean True.  I think it's a bad 
default, but it's still a useful function.  Just be careful to include 
that parameter in your call.


Further, you want to check the version included with your version.  The 
file filecmp.py is in the Lib directory, so it's not trouble to check it.



--
http://mail.python.org/mailman/listinfo/python-list


Re: binary file compare...

2009-04-13 Thread Grant Edwards
On 2009-04-13, Peter Otten __pete...@web.de wrote:

 But there's a cache. A change of file contents may go
 undetected as long as the file stats don't change:

Good point.  You can fool it if you force the stats to their
old values after you modify a file and you don't clear the
cache.

-- 
Grant

--
http://mail.python.org/mailman/listinfo/python-list