Re: binary file compare...
Adam Olsen rha...@gmail.com (AO) wrote: AO The Wayback Machine has 150 billion pages, so 2**37. Google's index AO is a bit larger at over a trillion pages, so 2**40. A little closer AO than I'd like, but that's still 56294995000 to 1 odds of having AO *any* collisions between *any* of the files. Step up to SHA-256 and AO it becomes 1915619400 to AO 1. Sadly, I can't even give you the odds for SHA-512, Qalculate AO considers that too close to infinite to display. :) AO You should worry more about your head spontaneously exploding than you AO should about a hash collision on that scale. To do otherwise is AO irrational paranoia. And that is the probability if there being two files in that huge collection with the same hash. If you just take two files, not fabricated to collide, the probability of them having the same hash under MD5 is 2**-128 which I think is way smaller than the probability of the bit representing the answer being swapped by some physical cause in your computer. But then again, it doesn't make sense to use that instead of byte-by-byte comparison if the files are on the same machine. -- Piet van Oostrum p...@cs.uu.nl URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
Adam Olsen wrote: On Apr 16, 11:15 am, SpreadTooThin bjobrie...@gmail.com wrote: And yes he is right CRCs hashing all have a probability of saying that the files are identical when in fact they are not. Here's the bottom line. It is either: A) Several hundred years of mathematics and cryptography are wrong. The birthday problem as described is incorrect, so a collision is far more likely than 42 trillion trillion to 1. You are simply the first person to have noticed it. B) Your software was buggy, or possibly the input was maliciously produced. Or, a really tiny chance that your particular files contained a pattern that provoked bad behaviour from MD5. Finding a specific limitation of the algorithm is one thing. Claiming that the math is fundamentally wrong is quite another. You are confusing yourself about probabilities young man. Just becasue something is extremely unlikely does not mean it can't happen on the first attempt. This is true *no matter how big the numbers are*. If you persist in making these ridiculous claims that people *cannot* have found collisions then as I said, that's up to you, but I'm not going to employ you to do anything except make tea. Thanks, Nigel -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
Adam Olsen wrote: On Apr 16, 4:27 pm, Rhodri James rho...@wildebst.demon.co.uk wrote: On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen rha...@gmail.com wrote: On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote: Okay, before I tell you about the empirical, real-world evidence I have could you please accept that hashes collide and that no matter how many samples you use the probability of finding two files that do collide is small but not zero. I'm afraid you will need to back up your claims with real files. So that would be a no then. If the implementation of dicts in Python, say, were to assert as you are that the hashes aren't going to collide, then I'd have to walk away from it. There's no point in using something that guarantees a non-zero chance of corrupting your data. Python's hash is only 32 bits on a 32-bit box, so even 2**16 keys (or 65 thousand) will give you a decent chance of a collision. In contrast MD5 needs 2**64, and a *good* hash needs 2**128 (SHA-256) or 2**256 (SHA-512). The two are at totally different extremes. I'm just going to go ahead and take the above as an admission by you that the chance of collision is non-zero, and that if we accept that fact you cannot rely on a hash function to tell you if two files are identical. Thanks. There is *always* a non-zero chance of corruption, due to software bugs, hardware defects, or even operator error. It is only in that broader context that you can realize just how minuscule the risk is. Please explain why you're dragging the notion of corruption into this when it seems to be beside the point? Can you explain to me why you justify great lengths of paranoia, when the risk is so much lower? Because in the real world, where I work, in practical, real, factual terms I have seen it happen. Not once. Not twice. But many, many times. Why are you advocating a solution to the OP's problem that is more computationally expensive than a simple byte-by-byte comparison and doesn't guarantee to give the correct answer? For single, one-off comparison I have no problem with a byte-by-byte comparison. There's a decent chance the files won't be in the OS's cache anyway, so disk IO will be your bottleneck. Only if you're doing multiple comparisons is a hash database justified. Even then, if you expect matching files to be fairly rare I won't lose any sleep if you're paranoid and do a byte-by-byte comparison anyway. New vulnerabilities are found, and if you don't update promptly there is a small (but significant) chance of a malicious file leading to collision. If I have a number of files then I would certainly use a hash as a quick test, but if two files hash to the same value I still have to go compare them. Hashing in this case saves time doing a comparison for each file but it doesn't obviate the need to do a byte-by-byte comparison to see if the files that hash to the same value are actually the same. That's not my concern though. What I'm responding to is Nigel Rantor's grossly incorrect statements about probability. The chance of collision, in our life time, is *insignificant*. Please tell me which statements? That the chance of two files hashing to the same value is non-zero? You admit as much above. Also, please remember I gave you a way of verifying what I said, go crawl the web for images, pages, whatever, build a hash DB and tell me how long it takes you to find a collision using MD5 (which is the hash I was talking about when I told you I had real-world, experience to back up the theoretical claim that collisions occur. Regards, Nigel -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Thu, 2009-04-16 at 21:44 -0700, Adam Olsen wrote: The Wayback Machine has 150 billion pages, so 2**37. Google's index is a bit larger at over a trillion pages, so 2**40. A little closer than I'd like, but that's still 56294995000 to 1 odds of having *any* collisions between *any* of the files. Step up to SHA-256 and it becomes 1915619400 to 1. Sadly, I can't even give you the odds for SHA-512, Qalculate considers that too close to infinite to display. :) That might be true as long as your data is completely uniformly distributed. For the example you give there's: a) a high chance that there's html near the top b) a non-uniform distribution of individual words within the text. c) a non-unifom distribution of all n-grams within the text (as there is in natural language) So it's very far from uniformly distributed. Just about the only situation where I could imagine that holding would be where you are hashing uniformly random data for the sake of testing the hash. I believe the point being made is that comparing hash values is a probabilistic algorithm anyway, which is fine if you're ok with that, but for mission critical software it's crazy. Tim Wintle -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
Adam Olsen wrote: On Apr 16, 11:15 am, SpreadTooThin bjobrie...@gmail.com wrote: And yes he is right CRCs hashing all have a probability of saying that the files are identical when in fact they are not. Here's the bottom line. It is either: A) Several hundred years of mathematics and cryptography are wrong. The birthday problem as described is incorrect, so a collision is far more likely than 42 trillion trillion to 1. You are simply the first person to have noticed it. B) Your software was buggy, or possibly the input was maliciously produced. Or, a really tiny chance that your particular files contained a pattern that provoked bad behaviour from MD5. Finding a specific limitation of the algorithm is one thing. Claiming that the math is fundamentally wrong is quite another. -- http://mail.python.org/mailman/listinfo/python-list Spending a lifetime in applied math has taught me: 1) All applied math is finite. 2) Any algorithm failing to handle all contingencies is flawed. The meaning of 1) is that it is limited in what it can actually do. The meaning of 2) is that the designer missed or left out something. Neither should be taken as bad. Both need to be accepted 'as 'is' and the decision to use (when,where,conditions) based on the probability of non-failure. ...a pattern that provoked bad behavior... does mean the algorithm is incomplete and may be fundamentally wrong. Underscore is and may. The more complicated the math the harder it is to keep a higher form of math from checking (or improperly displacing) a lower one. Which, of course, breaks the rules. Commonly called improper thinking. A number of math teasers make use of that. Steve -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 17, 4:54 am, Nigel Rantor wig...@wiggly.org wrote: Adam Olsen wrote: On Apr 16, 11:15 am, SpreadTooThin bjobrie...@gmail.com wrote: And yes he is right CRCs hashing all have a probability of saying that the files are identical when in fact they are not. Here's the bottom line. It is either: A) Several hundred years of mathematics and cryptography are wrong. The birthday problem as described is incorrect, so a collision is far more likely than 42 trillion trillion to 1. You are simply the first person to have noticed it. B) Your software was buggy, or possibly the input was maliciously produced. Or, a really tiny chance that your particular files contained a pattern that provoked bad behaviour from MD5. Finding a specific limitation of the algorithm is one thing. Claiming that the math is fundamentally wrong is quite another. You are confusing yourself about probabilities young man. Just becasue something is extremely unlikely does not mean it can't happen on the first attempt. This is true *no matter how big the numbers are*. If you persist in making these ridiculous claims that people *cannot* have found collisions then as I said, that's up to you, but I'm not going to employ you to do anything except make tea. Thanks, Nigel You know this is just insane. I'd be satisfied with a CRC16 or something in the situation i'm in. I have two large files, one local and one remote. Transferring every byte across the internet to be sure that the two files are identical is just not feasible. If two servers one on one side and the other on the other side both calculate the CRCs and transmit the CRCs for comparison I'm happy. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 17, 5:30 am, Tim Wintle tim.win...@teamrubber.com wrote: On Thu, 2009-04-16 at 21:44 -0700, Adam Olsen wrote: The Wayback Machine has 150 billion pages, so 2**37. Google's index is a bit larger at over a trillion pages, so 2**40. A little closer than I'd like, but that's still 56294995000 to 1 odds of having *any* collisions between *any* of the files. Step up to SHA-256 and it becomes 1915619400 to 1. Sadly, I can't even give you the odds for SHA-512, Qalculate considers that too close to infinite to display. :) That might be true as long as your data is completely uniformly distributed. For the example you give there's: a) a high chance that there's html near the top b) a non-uniform distribution of individual words within the text. c) a non-unifom distribution of all n-grams within the text (as there is in natural language) So it's very far from uniformly distributed. Just about the only situation where I could imagine that holding would be where you are hashing uniformly random data for the sake of testing the hash. I believe the point being made is that comparing hash values is a probabilistic algorithm anyway, which is fine if you're ok with that, but for mission critical software it's crazy. Actually, *cryptographic* hashes handle that just fine. Even for files with just a 1 bit change the output is totally different. This is known as the Avalanche Effect. Otherwise they'd be vulnerable to attacks. Which isn't to say you couldn't *construct* a pattern that it would be vulnerable to. Figuring that out is pretty much the whole point of attacking a cryptographic hash. MD5 has significant vulnerabilities by now, and other will in the future. That's just a risk you need to manage. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 17, 9:59 am, norseman norse...@hughes.net wrote: The more complicated the math the harder it is to keep a higher form of math from checking (or improperly displacing) a lower one. Which, of course, breaks the rules. Commonly called improper thinking. A number of math teasers make use of that. Of course, designing a hash is hard. That's why the *recommended* ones get so many years of peer review and attempted attacks first. I'd love of Nigel provided evidence that MD5 was broken, I really would. It'd be quite interesting to investigate, assuming malicious content can be ruled out. Of course even he doesn't think that. He claims that his 42 trillion trillion to 1 odds happened not just once, but multiple times. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 17, 9:59 am, SpreadTooThin bjobrie...@gmail.com wrote: You know this is just insane. I'd be satisfied with a CRC16 or something in the situation i'm in. I have two large files, one local and one remote. Transferring every byte across the internet to be sure that the two files are identical is just not feasible. If two servers one on one side and the other on the other side both calculate the CRCs and transmit the CRCs for comparison I'm happy. Definitely use a hash, ignore Nigel. SHA-256 or SHA-512. Or, if you might need to update one of the files, look at rsync. Rsync still uses MD4 and MD5 (optionally!), but they're fine in a trusted environment. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
In message mailman.3934.1239821812.11746.python-l...@python.org, Nigel Rantor wrote: Adam Olsen wrote: The chance of *accidentally* producing a collision, although technically possible, is so extraordinarily rare that it's completely overshadowed by the risk of a hardware or software failure producing an incorrect result. Not when you're using them to compare lots of files. Trust me. Been there, done that, got the t-shirt. Not with any cryptographically-strong hash, you haven't. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Fri, 17 Apr 2009 11:19:31 -0700, Adam Olsen wrote: Actually, *cryptographic* hashes handle that just fine. Even for files with just a 1 bit change the output is totally different. This is known as the Avalanche Effect. Otherwise they'd be vulnerable to attacks. Which isn't to say you couldn't *construct* a pattern that it would be vulnerable to. Figuring that out is pretty much the whole point of attacking a cryptographic hash. MD5 has significant vulnerabilities by now, and other will in the future. That's just a risk you need to manage. There are demonstrated methods for constructing MD5 collisions against an arbitrary file in about a minute on a laptop computer. http://en.wikipedia.org/wiki/MD5 -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 15, 12:56 pm, Nigel Rantor wig...@wiggly.org wrote: Adam Olsen wrote: The chance of *accidentally* producing a collision, although technically possible, is so extraordinarily rare that it's completely overshadowed by the risk of a hardware or software failure producing an incorrect result. Not when you're using them to compare lots of files. Trust me. Been there, done that, got the t-shirt. Using hash functions to tell whether or not files are identical is an error waiting to happen. But please, do so if it makes you feel happy, you'll just eventually get an incorrect result and not know it. Please tell us what hash you used and provide the two files that collided. If your hash is 256 bits, then you need around 2**128 files to produce a collision. This is known as a Birthday Attack. I seriously doubt you had that many files, which suggests something else went wrong. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote: Adam Olsen wrote: On Apr 15, 12:56 pm, Nigel Rantor wig...@wiggly.org wrote: Adam Olsen wrote: The chance of *accidentally* producing a collision, although technically possible, is so extraordinarily rare that it's completely overshadowed by the risk of a hardware or software failure producing an incorrect result. Not when you're using them to compare lots of files. Trust me. Been there, done that, got the t-shirt. Using hash functions to tell whether or not files are identical is an error waiting to happen. But please, do so if it makes you feel happy, you'll just eventually get an incorrect result and not know it. Please tell us what hash you used and provide the two files that collided. MD5 If your hash is 256 bits, then you need around 2**128 files to produce a collision. This is known as a Birthday Attack. I seriously doubt you had that many files, which suggests something else went wrong. Okay, before I tell you about the empirical, real-world evidence I have could you please accept that hashes collide and that no matter how many samples you use the probability of finding two files that do collide is small but not zero. I'm afraid you will need to back up your claims with real files. Although MD5 is a smaller, older hash (128 bits, so you only need 2**64 files to find collisions), and it has substantial known vulnerabilities, the scenario you suggest where you *accidentally* find collisions (and you imply multiple collisions!) would be a rather significant finding. Please help us all by justifying your claim. Mind you, since you use MD5 I wouldn't be surprised if your files were maliciously produced. As I said before, you need to consider upgrading your hash every few years to avoid new attacks. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
Adam Olsen wrote: On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote: Adam Olsen wrote: On Apr 15, 12:56 pm, Nigel Rantor wig...@wiggly.org wrote: Adam Olsen wrote: The chance of *accidentally* producing a collision, although technically possible, is so extraordinarily rare that it's completely overshadowed by the risk of a hardware or software failure producing an incorrect result. Not when you're using them to compare lots of files. Trust me. Been there, done that, got the t-shirt. Using hash functions to tell whether or not files are identical is an error waiting to happen. But please, do so if it makes you feel happy, you'll just eventually get an incorrect result and not know it. Please tell us what hash you used and provide the two files that collided. MD5 If your hash is 256 bits, then you need around 2**128 files to produce a collision. This is known as a Birthday Attack. I seriously doubt you had that many files, which suggests something else went wrong. Okay, before I tell you about the empirical, real-world evidence I have could you please accept that hashes collide and that no matter how many samples you use the probability of finding two files that do collide is small but not zero. I'm afraid you will need to back up your claims with real files. Although MD5 is a smaller, older hash (128 bits, so you only need 2**64 files to find collisions), and it has substantial known vulnerabilities, the scenario you suggest where you *accidentally* find collisions (and you imply multiple collisions!) would be a rather significant finding. No. It wouldn't. It isn't. The files in question were millions of audio files. I no longer work at the company where I had access to them so I cannot give you examples, and even if I did Data Protection regulations wouldn't have allowed it. If you still don't beleive me you can easily verify what I'm saying by doing some simple experiemnts. Go spider the web for images, keep collecting them until you get an MD5 hash collision. It won't take long. Please help us all by justifying your claim. Now, please go and re-read my request first and admit that everything I have said so far is correct. Mind you, since you use MD5 I wouldn't be surprised if your files were maliciously produced. As I said before, you need to consider upgrading your hash every few years to avoid new attacks. Good grief, this is nothing to do with security concerns, this is about someone suggesting to the OP that they use a hash function to determine whether or not two files are identical. Regards, Nige -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
Adam Olsen wrote: On Apr 15, 12:56 pm, Nigel Rantor wig...@wiggly.org wrote: Adam Olsen wrote: The chance of *accidentally* producing a collision, although technically possible, is so extraordinarily rare that it's completely overshadowed by the risk of a hardware or software failure producing an incorrect result. Not when you're using them to compare lots of files. Trust me. Been there, done that, got the t-shirt. Using hash functions to tell whether or not files are identical is an error waiting to happen. But please, do so if it makes you feel happy, you'll just eventually get an incorrect result and not know it. Please tell us what hash you used and provide the two files that collided. MD5 If your hash is 256 bits, then you need around 2**128 files to produce a collision. This is known as a Birthday Attack. I seriously doubt you had that many files, which suggests something else went wrong. Okay, before I tell you about the empirical, real-world evidence I have could you please accept that hashes collide and that no matter how many samples you use the probability of finding two files that do collide is small but not zero. Which is the only thing I've been saying. Yes, it's unlikely. Yes, it's possible. Yes, it happens in practice. If you are of the opinion though that a hash function can be used to tell you whether or not two files are identical then you are wrong. It really is that simple. I'm not sitting here discussing this for my health, I'm just trying to give the OP the benefit of my experience, I have worked with other people who insisted on this route and had to find out the hard way that it was a Bad Idea (tm). They just wouldn't be told. Regards, Nige -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On 2009-04-16, Adam Olsen rha...@gmail.com wrote: The chance of *accidentally* producing a collision, although technically possible, is so extraordinarily rare that it's completely overshadowed by the risk of a hardware or software failure producing an incorrect result. Not when you're using them to compare lots of files. Trust me. Been there, done that, got the t-shirt. I must admit, that is a bit hard to believe. Okay, before I tell you about the empirical, real-world evidence I have could you please accept that hashes collide and that no matter how many samples you use the probability of finding two files that do collide is small but not zero. I'm afraid you will need to back up your claims with real files. Although MD5 is a smaller, older hash (128 bits, so you only need 2**64 files to find collisions), You don't need quite that many to have a significant chance of a collision. With only something on the order of 2**61 files, you still have about a 1% chance of a collision. Just for fun here are a few data points showing the probability of at least one collision in a sample of size n using a perfect 128-bit hash. These were calculated using the taylor-series approximation shown at http://en.wikipedia.org/wiki/Birthday_paradox#Calculating_the_probability For a few million files (we'll say 4e6), the probability of a collision is so close to 0 that it can't be calculated using double-precision IEEE floats. For a few _billion_ files, the taylor series approximation is still 0 to about 15 significant digits. For a few _trillion_ files, you finally get a non-zero probability of about 2e-14. Here are a few more points: n = 1806791225998070; p(n) = 0. n = 1987470348597877; p(n) = 0.0001 n = 2186217383457665; p(n) = 0.0001 n = 2404839121803431; p(n) = 0.0001 n = 2645323033983774; p(n) = 0.0001 n = 2909855337382151; p(n) = 0.0001 n = 3200840871120366; p(n) = 0.0002 n = 3520924958232403; p(n) = 0.0002 n = 3873017454055643; p(n) = 0.0002 n = 4260319199461207; p(n) = 0.0003 n = 4686351119407328; p(n) = 0.0003 n = 5154986231348061; p(n) = 0.0004 n = 5670484854482868; p(n) = 0.0005 n = 623759931155; p(n) = 0.0006 n = 6861286673924271; p(n) = 0.0007 n = 7547415341316699; p(n) = 0.0008 n = 8302156875448370; p(n) = 0.0010 n = 9132372562993208; p(n) = 0.0012 n =10045609819292530; p(n) = 0.0015 n =11050170801221784; p(n) = 0.0018 n =12155187881343964; p(n) = 0.0022 n =13370706669478362; p(n) = 0.0026 n =1470336426200; p(n) = 0.0032 n =16178555070068822; p(n) = 0.0038 n =17796410577075706; p(n) = 0.0047 n =19576051634783280; p(n) = 0.0056 n =21533656798261608; p(n) = 0.0068 n =23687022478087772; p(n) = 0.0082 n =26055724725896552; p(n) = 0.0100 n =28661297198486208; p(n) = 0.0121 n =31527426918334832; p(n) = 0.0146 n =34680169610168320; p(n) = 0.0177 n =38148186571185152; p(n) = 0.0214 n =41963005228303672; p(n) = 0.0259 n =46159305751134040; p(n) = 0.0313 n =50775236326247448; p(n) = 0.0379 n =55852759958872200; p(n) = 0.0458 n =61438035954759424; p(n) = 0.0555 n =67581839550235368; p(n) = 0.0671 n =74340023505258912; p(n) = 0.0812 n =81774025855784816; p(n) = 0.0983 n =89951428441363312; p(n) = 0.1189 n =98946571285499648; p(n) = 0.1439 n = 108841228414049616; p(n) = 0.1741 n = 119725351255454592; p(n) = 0.2106 n = 13169788638164; p(n) = 0.2548 n = 144867675019100096; p(n) = 0.3084 n = 159354442521010112; p(n) = 0.3731 n = 17528988677336; p(n) = 0.4515 n = 192818875450422272; p(n) = 0.5463 n = 212100762995464512; p(n) = 0.6610 n = 233310839295010976; p(n) = 0.7998 n = 256641923224512096; p(n) = 0.9678 n = 282306115546963328; p(n) = 0.00011710 n = 310536727101659712; p(n) = 0.00014169 n = 341590399811825728; p(n) = 0.00017144 n = 375749439793008320; p(n) = 0.00020744 n = 413324383772309184; p(n) = 0.00025099 n = 454656822149540160; p(n) = 0.00030369 n = 500122504364494208; p(n) = 0.00036745 n = 550134754800943680; p(n) = 0.00044460 n = 605148230281038080; p(n) = 0.00053794 n = 665663053309141888; p(n) = 0.00065088 n = 732229358640056192; p(n) = 0.00078751 n = 805452294504061824; p(n) = 0.00095280 n = 885997523954468096; p(n) = 0.00115278 n = 974597276349915008; p(n) = 0.00139469 n = 1072057003984906624; p(n) = 0.00168733 n = 1179262704383397376; p(n) = 0.00204131 n = 1297188974821737216; p(n) = 0.00246945 n = 1426907872303911168; p(n) = 0.00298726 n = 1569598659534302464; p(n) = 0.00361345 n = 1726558525487732736; p(n) = 0.00437061 n = 1899214378036506112; p(n) = 0.00528601 n = 2089135815840156928; p(n) = 0.00639252 n =
Re: binary file compare...
On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote: Adam Olsen wrote: On Apr 15, 12:56 pm, Nigel Rantor wig...@wiggly.org wrote: Adam Olsen wrote: The chance of *accidentally* producing a collision, although technically possible, is so extraordinarily rare that it's completely overshadowed by the risk of a hardware or software failure producing an incorrect result. Not when you're using them to compare lots of files. Trust me. Been there, done that, got the t-shirt. Using hash functions to tell whether or not files are identical is an error waiting to happen. But please, do so if it makes you feel happy, you'll just eventually get an incorrect result and not know it. Please tell us what hash you used and provide the two files that collided. MD5 If your hash is 256 bits, then you need around 2**128 files to produce a collision. This is known as a Birthday Attack. I seriously doubt you had that many files, which suggests something else went wrong. Okay, before I tell you about the empirical, real-world evidence I have could you please accept that hashes collide and that no matter how many samples you use the probability of finding two files that do collide is small but not zero. Which is the only thing I've been saying. Yes, it's unlikely. Yes, it's possible. Yes, it happens in practice. If you are of the opinion though that a hash function can be used to tell you whether or not two files are identical then you are wrong. It really is that simple. I'm not sitting here discussing this for my health, I'm just trying to give the OP the benefit of my experience, I have worked with other people who insisted on this route and had to find out the hard way that it was a Bad Idea (tm). They just wouldn't be told. Regards, Nige And yes he is right CRCs hashing all have a probability of saying that the files are identical when in fact they are not. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 16, 8:59 am, Grant Edwards inva...@invalid wrote: On 2009-04-16, Adam Olsen rha...@gmail.com wrote: I'm afraid you will need to back up your claims with real files. Although MD5 is a smaller, older hash (128 bits, so you only need 2**64 files to find collisions), You don't need quite that many to have a significant chance of a collision. With only something on the order of 2**61 files, you still have about a 1% chance of a collision. Aye, 2**64 is more of the middle of the curve or so. You can still go either way. What's important is the order of magnitude required. For a few million files (we'll say 4e6), the probability of a collision is so close to 0 that it can't be calculated using double-precision IEEE floats. ≈ 0.023509887 Or 4253529600 to 1. Or 42 trillion trillion to 1. Here's the Python function I'm using: def bp(n, d): return 1.0 - exp(-n*(n-1.)/(2.*d)) I haven't spent much time studying the numerical issues of the way that the exponent is calculated, so I'm not entirely confident in the results for small n values such that p(n) == 0.0. Try using Qalculate. I always resort to it for things like this. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen rha...@gmail.com wrote: On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote: Okay, before I tell you about the empirical, real-world evidence I have could you please accept that hashes collide and that no matter how many samples you use the probability of finding two files that do collide is small but not zero. I'm afraid you will need to back up your claims with real files. So that would be a no then. If the implementation of dicts in Python, say, were to assert as you are that the hashes aren't going to collide, then I'd have to walk away from it. There's no point in using something that guarantees a non-zero chance of corrupting your data. Why are you advocating a solution to the OP's problem that is more computationally expensive than a simple byte-by-byte comparison and doesn't guarantee to give the correct answer? -- Rhodri James *-* Wildebeeste Herder to the Masses -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 16, 11:15 am, SpreadTooThin bjobrie...@gmail.com wrote: And yes he is right CRCs hashing all have a probability of saying that the files are identical when in fact they are not. Here's the bottom line. It is either: A) Several hundred years of mathematics and cryptography are wrong. The birthday problem as described is incorrect, so a collision is far more likely than 42 trillion trillion to 1. You are simply the first person to have noticed it. B) Your software was buggy, or possibly the input was maliciously produced. Or, a really tiny chance that your particular files contained a pattern that provoked bad behaviour from MD5. Finding a specific limitation of the algorithm is one thing. Claiming that the math is fundamentally wrong is quite another. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 16, 4:27 pm, Rhodri James rho...@wildebst.demon.co.uk wrote: On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen rha...@gmail.com wrote: On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote: Okay, before I tell you about the empirical, real-world evidence I have could you please accept that hashes collide and that no matter how many samples you use the probability of finding two files that do collide is small but not zero. I'm afraid you will need to back up your claims with real files. So that would be a no then. If the implementation of dicts in Python, say, were to assert as you are that the hashes aren't going to collide, then I'd have to walk away from it. There's no point in using something that guarantees a non-zero chance of corrupting your data. Python's hash is only 32 bits on a 32-bit box, so even 2**16 keys (or 65 thousand) will give you a decent chance of a collision. In contrast MD5 needs 2**64, and a *good* hash needs 2**128 (SHA-256) or 2**256 (SHA-512). The two are at totally different extremes. There is *always* a non-zero chance of corruption, due to software bugs, hardware defects, or even operator error. It is only in that broader context that you can realize just how minuscule the risk is. Can you explain to me why you justify great lengths of paranoia, when the risk is so much lower? Why are you advocating a solution to the OP's problem that is more computationally expensive than a simple byte-by-byte comparison and doesn't guarantee to give the correct answer? For single, one-off comparison I have no problem with a byte-by-byte comparison. There's a decent chance the files won't be in the OS's cache anyway, so disk IO will be your bottleneck. Only if you're doing multiple comparisons is a hash database justified. Even then, if you expect matching files to be fairly rare I won't lose any sleep if you're paranoid and do a byte-by-byte comparison anyway. New vulnerabilities are found, and if you don't update promptly there is a small (but significant) chance of a malicious file leading to collision. That's not my concern though. What I'm responding to is Nigel Rantor's grossly incorrect statements about probability. The chance of collision, in our life time, is *insignificant*. The Wayback Machine has 150 billion pages, so 2**37. Google's index is a bit larger at over a trillion pages, so 2**40. A little closer than I'd like, but that's still 56294995000 to 1 odds of having *any* collisions between *any* of the files. Step up to SHA-256 and it becomes 1915619400 to 1. Sadly, I can't even give you the odds for SHA-512, Qalculate considers that too close to infinite to display. :) You should worry more about your head spontaneously exploding than you should about a hash collision on that scale. To do otherwise is irrational paranoia. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Wed, 15 Apr 2009 07:54:20 +0200, Martin wrote: Perhaps I'm being dim, but how else are you going to decide if two files are the same unless you compare the bytes in the files? I'd say checksums, just about every download relies on checksums to verify you do have indeed the same file. The checksum does look at every byte in each file. Checksumming isn't a way to avoid looking at each byte of the two files, it is a way of mapping all the bytes to a single number. You could hash them and compare the hashes, but that's a lot more work than just comparing the two byte streams. hashing is not exactly much mork in it's simplest form it's 2 lines per file. Hashing is a *lot* more work than just comparing two bytes. The MD5 checksum has been specifically designed to be fast and compact, and the algorithm is still complicated: http://en.wikipedia.org/wiki/MD5#Pseudocode The reference implementation is here: http://www.fastsum.com/rfc1321.php#APPENDIXA SHA-1 is even more complicated still: http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-1_pseudocode Just because *calling* some checksum function is easy doesn't make the checksum function itself simple. They do a LOT more work than just a simple comparison between bytes, and that's totally unnecessary work if you are making a one-off comparison of two local files. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano ste...@remove.this.cybersource.com.au wrote: The checksum does look at every byte in each file. Checksumming isn't a way to avoid looking at each byte of the two files, it is a way of mapping all the bytes to a single number. My understanding of the original question was a way to determine wether 2 files are equal or not. Creating a checksum of 1-n files and comparing those checksums IMHO is a valid way to do that. I know it's a (one way) mapping between a (possibly) longer byte sequence and another one, how does checksumming not take each byte in the original sequence into account. I'd still say rather burn CPU cycles than development hours (if I got the question right), if not then with binary files you will have to find some way of representing differences between the 2 files in a readable manner anyway. Hashing is a *lot* more work than just comparing two bytes. The MD5 checksum has been specifically designed to be fast and compact, and the algorithm is still complicated: I know that the various checksum algorithms aren't exactly cheap, but I do think that just to know wether 2 files are different a solution which takes 5mins to implement wins against a lengthy discussion which optimizes too early wins hands down. regards, martin -- http://soup.alt.delete.co.at http://www.xing.com/profile/Martin_Marcher http://www.linkedin.com/in/martinmarcher You are not free to read this message, by doing so, you have violated my licence and are required to urinate publicly. Thank you. Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On 2009-04-15, Martin mar...@marcher.name wrote: Hi, On Mon, Apr 13, 2009 at 10:03 PM, Grant Edwards inva...@invalid wrote: On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote: I want to compare two binary files and see if they are the same. I see the filecmp.cmp function but I don't get a warm fuzzy feeling that it is doing a byte by byte comparison of two files to see if they are they same. Perhaps I'm being dim, but how else are you going to decide if two files are the same unless you compare the bytes in the files? I'd say checksums, just about every download relies on checksums to verify you do have indeed the same file. That's slower than a byte-by-byte compare. You could hash them and compare the hashes, but that's a lot more work than just comparing the two byte streams. hashing is not exactly much mork in it's simplest form it's 2 lines per file. I meant a lot more CPU time/cycles. -- Grant Edwards grante Yow! Was my SOY LOAF left at out in th'RAIN? It tastes visi.comREAL GOOD!! -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On 2009-04-15, Martin mar...@marcher.name wrote: On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano I'd still say rather burn CPU cycles than development hours (if I got the question right), _Hours_? Calling the file compare module takes _one_line_of_code_. Implementing a file compare from scratch takes about a half dozen lines of code. if not then with binary files you will have to find some way of representing differences between the 2 files in a readable manner anyway. 1) Who said anything about a readable representation of the differences? 2) How does a checksum provide that? Hashing is a *lot* more work than just comparing two bytes. The MD5 checksum has been specifically designed to be fast and compact, and the algorithm is still complicated: I know that the various checksum algorithms aren't exactly cheap, but I do think that just to know wether 2 files are different a solution which takes 5mins to implement wins against a lengthy discussion Bah. A direct compare is trivial. The discussion of which checksum to use, how to implement it, and how reliable it is will be far longer than any discussion over a direct comparison. which optimizes too early wins hands down. Optimizes too early? Comparing the bytes is the simplest and most direct, obvious solution. It takes one line of code to call the file compare module. Implementing it from scratch takes about five lines of code. We all rail against premature optimization, but using a checksum instead of a direct comparison is premature unoptimization. ;) -- Grant Edwards grante Yow! Hmmm ... A hash-singer at and a cross-eyed guy were visi.comSLEEPING on a deserted island, when ... -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
Martin wrote: On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano ste...@remove.this.cybersource.com.au wrote: The checksum does look at every byte in each file. Checksumming isn't a way to avoid looking at each byte of the two files, it is a way of mapping all the bytes to a single number. My understanding of the original question was a way to determine wether 2 files are equal or not. Creating a checksum of 1-n files and comparing those checksums IMHO is a valid way to do that. I know it's a (one way) mapping between a (possibly) longer byte sequence and another one, how does checksumming not take each byte in the original sequence into account. The fact that two md5 hashes are equal does not mean that the sources they were generated from are equal. To do that you must still perform a byte-by-byte comparison which is much less work for the processor than generating an md5 or sha hash. If you insist on using a hashing algorithm to determine the equivalence of two files you will eventually realise that it is a flawed plan because you will eventually find two files with different contents that nonetheless hash to the same value. The more files you test with the quicker you will find out this basic truth. This is not complex, it's a simple fact about how hashing algorithms work. n -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
Grant Edwards wrote: We all rail against premature optimization, but using a checksum instead of a direct comparison is premature unoptimization. ;) And more than that, will provide false positives for some inputs. So, basically it's a worse-than-useless approach for determining if two files are the same. n -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 15, 8:04 am, Grant Edwards inva...@invalid wrote: On 2009-04-15, Martin mar...@marcher.name wrote: Hi, On Mon, Apr 13, 2009 at 10:03 PM, Grant Edwards inva...@invalid wrote: On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote: I want to compare two binary files and see if they are the same. I see the filecmp.cmp function but I don't get a warm fuzzy feeling that it is doing a byte by byte comparison of two files to see if they are they same. Perhaps I'm being dim, but how else are you going to decide if two files are the same unless you compare the bytes in the files? I'd say checksums, just about every download relies on checksums to verify you do have indeed the same file. That's slower than a byte-by-byte compare. You could hash them and compare the hashes, but that's a lot more work than just comparing the two byte streams. hashing is not exactly much mork in it's simplest form it's 2 lines per file. I meant a lot more CPU time/cycles. -- Grant Edwards grante Yow! Was my SOY LOAF left at out in th'RAIN? It tastes visi.com REAL GOOD!! I'd like to add my 2 cents here.. (Thats 1.8 cents US) All I was trying to get was a clarification of the documentation of the cmp method. It isn't clear. byte by byte comparison is good enough for me as long as there are no cache issues. a check sum is not good because it doesn't guarantee that 1 + 2 + 3 == 3 + 2 + 1 a crc of any sort is more work than a byte by byte comparison and doesn't give you any more information. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 15, 11:04 am, Nigel Rantor wig...@wiggly.org wrote: The fact that two md5 hashes are equal does not mean that the sources they were generated from are equal. To do that you must still perform a byte-by-byte comparison which is much less work for the processor than generating an md5 or sha hash. If you insist on using a hashing algorithm to determine the equivalence of two files you will eventually realise that it is a flawed plan because you will eventually find two files with different contents that nonetheless hash to the same value. The more files you test with the quicker you will find out this basic truth. This is not complex, it's a simple fact about how hashing algorithms work. The only flaw on a cryptographic hash is the increasing number of attacks that are found on it. You need to pick a trusted one when you start and consider replacing it every few years. The chance of *accidentally* producing a collision, although technically possible, is so extraordinarily rare that it's completely overshadowed by the risk of a hardware or software failure producing an incorrect result. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
Adam Olsen wrote: The chance of *accidentally* producing a collision, although technically possible, is so extraordinarily rare that it's completely overshadowed by the risk of a hardware or software failure producing an incorrect result. Not when you're using them to compare lots of files. Trust me. Been there, done that, got the t-shirt. Using hash functions to tell whether or not files are identical is an error waiting to happen. But please, do so if it makes you feel happy, you'll just eventually get an incorrect result and not know it. n -- http://mail.python.org/mailman/listinfo/python-list
binary file compare...
I want to compare two binary files and see if they are the same. I see the filecmp.cmp function but I don't get a warm fuzzy feeling that it is doing a byte by byte comparison of two files to see if they are they same. What should I be using if not filecmp.cmp? -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 13, 8:39 pm, Grant Edwards gra...@visi.com wrote: On 2009-04-13, Peter Otten __pete...@web.de wrote: But there's a cache. A change of file contents may go undetected as long as the file stats don't change: Good point. You can fool it if you force the stats to their old values after you modify a file and you don't clear the cache. The timestamps stored on the filesystem (for ext3 and most other filesystems) are fairly coarse, so it's quite possible for a check/ update/check sequence to have the same timestamp at the beginning and end. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
Hi, On Mon, Apr 13, 2009 at 10:03 PM, Grant Edwards inva...@invalid wrote: On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote: I want to compare two binary files and see if they are the same. I see the filecmp.cmp function but I don't get a warm fuzzy feeling that it is doing a byte by byte comparison of two files to see if they are they same. Perhaps I'm being dim, but how else are you going to decide if two files are the same unless you compare the bytes in the files? I'd say checksums, just about every download relies on checksums to verify you do have indeed the same file. You could hash them and compare the hashes, but that's a lot more work than just comparing the two byte streams. hashing is not exactly much mork in it's simplest form it's 2 lines per file. $ dd if=/dev/urandom of=testfile.data bs=1M count=5 5+0 records in 5+0 records out 5242880 bytes (5.2 MB) copied, 1.4491 s, 3.6 MB/s $ dd if=/dev/urandom of=testfile2.data bs=1M count=5 5+0 records in 5+0 records out 5242880 bytes (5.2 MB) copied, 1.92479 s, 2.7 MB/s $ cp testfile.data testfile3.data $ python Python 2.5.4 (r254:67916, Feb 17 2009, 20:16:45) [GCC 4.3.3] on linux2 Type help, copyright, credits or license for more information. import hashlib sha = hashlib.sha256() sha.update(file(testfile.data).read()) sha.hexdigest() 'a0a8b5d1fd7b8181e0131fff8fd6acce39917e4498c86704354221fd96815797' sha2=hashlib.sha256() sha2.update(file(testfile2.data).read()) sha2.hexdigest() '25597380f833f287e8dad936b15ddb616669102c38f54dbd60ce57998d99ad3b' sha3=hashlib.sha256() sha3.update(file(testfile3.data).read()) sha3.hexdigest() 'a0a8b5d1fd7b8181e0131fff8fd6acce39917e4498c86704354221fd96815797' sha.hexdigest() == sha2.hexdigest() False sha.hexdigest() == sha3.hexdigest() True sha2.hexdigest() == sha3.hexdigest() False -- http://soup.alt.delete.co.at http://www.xing.com/profile/Martin_Marcher http://www.linkedin.com/in/martinmarcher You are not free to read this message, by doing so, you have violated my licence and are required to urinate publicly. Thank you. Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
SpreadTooThin wrote: I want to compare two binary files and see if they are the same. I see the filecmp.cmp function but I don't get a warm fuzzy feeling that it is doing a byte by byte comparison of two files to see if they are they same. What should I be using if not filecmp.cmp? Well, here's something http://www.daniweb.com/forums/thread115959.html but it seems from the post on the bottom that filecmp does comparison of binary files. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 13, 2:00 pm, Przemyslaw Kaminski cge...@gmail.com wrote: SpreadTooThin wrote: I want to compare two binary files and see if they are the same. I see the filecmp.cmp function but I don't get a warm fuzzy feeling that it is doing a byte by byte comparison of two files to see if they are they same. What should I be using if not filecmp.cmp? Well, here's somethinghttp://www.daniweb.com/forums/thread115959.html but it seems from the post on the bottom that filecmp does comparison of binary files. I just want to be clear, the comparison is not just based on file size and creation date but by a byte by byte comparison of the data in each file. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote: I want to compare two binary files and see if they are the same. I see the filecmp.cmp function but I don't get a warm fuzzy feeling that it is doing a byte by byte comparison of two files to see if they are they same. Perhaps I'm being dim, but how else are you going to decide if two files are the same unless you compare the bytes in the files? You could hash them and compare the hashes, but that's a lot more work than just comparing the two byte streams. What should I be using if not filecmp.cmp? I don't understand what you've got against comparing the files when you stated that what you wanted to do was compare the files. -- Grant Edwards grante Yow! Look DEEP into the at OPENINGS!! Do you see any visi.comELVES or EDSELS ... or a HIGHBALL?? ... -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 13, 2:03 pm, Grant Edwards inva...@invalid wrote: On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote: I want to compare two binary files and see if they are the same. I see the filecmp.cmp function but I don't get a warm fuzzy feeling that it is doing a byte by byte comparison of two files to see if they are they same. Perhaps I'm being dim, but how else are you going to decide if two files are the same unless you compare the bytes in the files? You could hash them and compare the hashes, but that's a lot more work than just comparing the two byte streams. What should I be using if not filecmp.cmp? I don't understand what you've got against comparing the files when you stated that what you wanted to do was compare the files. I think its just the way the documentation was worded http://www.python.org/doc/2.5.2/lib/module-filecmp.html Unless shallow is given and is false, files with identical os.stat() signatures are taken to be equal. Files that were compared using this function will not be compared again unless their os.stat() signature changes. So to do a comparison: filecmp.cmp(filea, fileb, False) ? -- Grant Edwards grante Yow! Look DEEP into the at OPENINGS!! Do you see any visi.com ELVES or EDSELS ... or a HIGHBALL?? ... -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On 2009-04-13, Grant Edwards inva...@invalid wrote: On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote: I want to compare two binary files and see if they are the same. I see the filecmp.cmp function but I don't get a warm fuzzy feeling that it is doing a byte by byte comparison of two files to see if they are they same. Perhaps I'm being dim, but how else are you going to decide if two files are the same unless you compare the bytes in the files? You could hash them and compare the hashes, but that's a lot more work than just comparing the two byte streams. What should I be using if not filecmp.cmp? I don't understand what you've got against comparing the files when you stated that what you wanted to do was compare the files. Doh! I misread your post and thought were weren't getting a warm fuzzying feeling _because_ it was doing a byte-byte compare. Now I'm a bit confused. Are you under the impression it's _not_ doing a byte-byte compare? Here's the code: def _do_cmp(f1, f2): bufsize = BUFSIZE fp1 = open(f1, 'rb') fp2 = open(f2, 'rb') while True: b1 = fp1.read(bufsize) b2 = fp2.read(bufsize) if b1 != b2: return False if not b1: return True It looks like a byte-by-byte comparison to me. Note that when this function is called the file lengths have already been compared and found to be equal. -- Grant Edwards grante Yow! Alright, you!! at Imitate a WOUNDED SEAL visi.compleading for a PARKING SPACE!! -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 13, 2:37 pm, Grant Edwards inva...@invalid wrote: On 2009-04-13, Grant Edwards inva...@invalid wrote: On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote: I want to compare two binary files and see if they are the same. I see the filecmp.cmp function but I don't get a warm fuzzy feeling that it is doing a byte by byte comparison of two files to see if they are they same. Perhaps I'm being dim, but how else are you going to decide if two files are the same unless you compare the bytes in the files? You could hash them and compare the hashes, but that's a lot more work than just comparing the two byte streams. What should I be using if not filecmp.cmp? I don't understand what you've got against comparing the files when you stated that what you wanted to do was compare the files. Doh! I misread your post and thought were weren't getting a warm fuzzying feeling _because_ it was doing a byte-byte compare. Now I'm a bit confused. Are you under the impression it's _not_ doing a byte-byte compare? Here's the code: def _do_cmp(f1, f2): bufsize = BUFSIZE fp1 = open(f1, 'rb') fp2 = open(f2, 'rb') while True: b1 = fp1.read(bufsize) b2 = fp2.read(bufsize) if b1 != b2: return False if not b1: return True It looks like a byte-by-byte comparison to me. Note that when this function is called the file lengths have already been compared and found to be equal. -- Grant Edwards grante Yow! Alright, you!! at Imitate a WOUNDED SEAL visi.com pleading for a PARKING SPACE!! I am indeed under the impression that it is not always doing a byte by byte comparison... as well the documentation states: Compare the files named f1 and f2, returning True if they seem equal, False otherwise. That word... Sem... makes me wonder. Thanks for the code! :) -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
Grant Edwards wrote: On 2009-04-13, Grant Edwards inva...@invalid wrote: On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote: I want to compare two binary files and see if they are the same. I see the filecmp.cmp function but I don't get a warm fuzzy feeling that it is doing a byte by byte comparison of two files to see if they are they same. Perhaps I'm being dim, but how else are you going to decide if two files are the same unless you compare the bytes in the files? You could hash them and compare the hashes, but that's a lot more work than just comparing the two byte streams. What should I be using if not filecmp.cmp? I don't understand what you've got against comparing the files when you stated that what you wanted to do was compare the files. Doh! I misread your post and thought were weren't getting a warm fuzzying feeling _because_ it was doing a byte-byte compare. Now I'm a bit confused. Are you under the impression it's _not_ doing a byte-byte compare? Here's the code: def _do_cmp(f1, f2): bufsize = BUFSIZE fp1 = open(f1, 'rb') fp2 = open(f2, 'rb') while True: b1 = fp1.read(bufsize) b2 = fp2.read(bufsize) if b1 != b2: return False if not b1: return True It looks like a byte-by-byte comparison to me. Note that when this function is called the file lengths have already been compared and found to be equal. But there's a cache. A change of file contents may go undetected as long as the file stats don't change: $ cat fool_filecmp.py import filecmp, shutil, sys for fn in adb: with open(fn, w) as f: f.write(yadda) shutil.copystat(d, a) filecmp.cmp(a, b, False) with open(a, w) as f: f.write(*) shutil.copystat(d, a) if --clear in sys.argv: print clearing cache filecmp._cache.clear() if filecmp.cmp(a, b, False): print file a and b are equal else: print file a and b differ print a's contents:, open(a).read() print b's contents:, open(b).read() $ python2.6 fool_filecmp.py file a and b are equal a's contents: * b's contents: yadda Oops. If you are paranoid you have to clear the cache before doing the comparison: $ python2.6 fool_filecmp.py --clear clearing cache file a and b differ a's contents: * b's contents: yadda Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Mon, 13 Apr 2009 15:03:32 -0500, Grant Edwards wrote: On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote: I want to compare two binary files and see if they are the same. I see the filecmp.cmp function but I don't get a warm fuzzy feeling that it is doing a byte by byte comparison of two files to see if they are they same. Perhaps I'm being dim, but how else are you going to decide if two files are the same unless you compare the bytes in the files? If you start with an image in one format (e.g. PNG), and convert it to another format (e.g. JPEG), you might want the two files to compare equal even though their byte contents are completely different, because their contents (the image itself) is visually identical. Or you might want a heuristic as a short cut for comparing large files, and decide that if two files have the same size and modification dates, and the first (say) 100KB are equal, that you will assume the rest are probably equal too. Neither of these are what the OP wants, I'm just mentioning them to answer your rhetorical question :) -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
SpreadTooThin wrote: On Apr 13, 2:37 pm, Grant Edwards inva...@invalid wrote: On 2009-04-13, Grant Edwards inva...@invalid wrote: On 2009-04-13, SpreadTooThin bjobrie...@gmail.com wrote: I want to compare two binary files and see if they are the same. I see the filecmp.cmp function but I don't get a warm fuzzy feeling that it is doing a byte by byte comparison of two files to see if they are they same. Perhaps I'm being dim, but how else are you going to decide if two files are the same unless you compare the bytes in the files? You could hash them and compare the hashes, but that's a lot more work than just comparing the two byte streams. What should I be using if not filecmp.cmp? I don't understand what you've got against comparing the files when you stated that what you wanted to do was compare the files. Doh! I misread your post and thought were weren't getting a warm fuzzying feeling _because_ it was doing a byte-byte compare. Now I'm a bit confused. Are you under the impression it's _not_ doing a byte-byte compare? Here's the code: def _do_cmp(f1, f2): bufsize =UFSIZE fp1 =pen(f1, 'rb') fp2 =pen(f2, 'rb') while True: b1 =p1.read(bufsize) b2 =p2.read(bufsize) if b1 !=2: return False if not b1: return True It looks like a byte-by-byte comparison to me. Note that when this function is called the file lengths have already been compared and found to be equal. -- Grant Edwards grante Yow! Alright, you!! at Imitate a WOUNDED SEAL visi.compleading for a PARKING SPACE!! I am indeed under the impression that it is not always doing a byte by byte comparison... as well the documentation states: Compare the files named f1 and f2, returning True if they seem equal, False otherwise. That word... Sem... makes me wonder. Thanks for the code! :) Some of this discussion depends on the version of Python, but didn't say so. In version 2.61, the code is different (and more complex) than what's listed above. The docs are different too. In this version, at least, you'll want to explicitly pass the shallow=False parameter. It defaults to 1, by which they must mean True. I think it's a bad default, but it's still a useful function. Just be careful to include that parameter in your call. Further, you want to check the version included with your version. The file filecmp.py is in the Lib directory, so it's not trouble to check it. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On 2009-04-13, Peter Otten __pete...@web.de wrote: But there's a cache. A change of file contents may go undetected as long as the file stats don't change: Good point. You can fool it if you force the stats to their old values after you modify a file and you don't clear the cache. -- Grant -- http://mail.python.org/mailman/listinfo/python-list