Re: SHA question

2010-01-16 Thread David Precious
Andy Wardley wrote: On 14/01/2010 17:41, Philip Newton wrote: Yes - you're missing the fact that in order to compute the differences (which it has to if it doesn't want to transfer the whole file), it has to read the entire file over the slow NFS link into your computer's memory in order to comp

Re: SHA question

2010-01-15 Thread Andy Wardley
On 15/01/2010 20:23, Roger Burton West wrote: And to calculate the checksum on each block of the file, it has to, um, read each block of the file... yes? Sorry, I missed this bit in Philip's message: > if both source and destination are on a local file system I was thinking about remote compa

Re: SHA question

2010-01-15 Thread Ask Bjørn Hansen
On Jan 15, 2010, at 14:19, ian wrote: >>> My understanding[*] is that it computes a checksum for each block of a file >>> and only transmits blocks that have different checksums. >> >> And to calculate the checksum on each block of the file, it has to, um, >> read each block of the file... yes?

Re: SHA question

2010-01-15 Thread ian
On 15/01/2010 20:23, Roger Burton West wrote: On Fri, Jan 15, 2010 at 08:16:09PM +, Andy Wardley wrote: My understanding[*] is that it computes a checksum for each block of a file and only transmits blocks that have different checksums. And to calculate the checksum on each block of the f

Re: SHA question

2010-01-15 Thread Roger Burton West
On Fri, Jan 15, 2010 at 08:16:09PM +, Andy Wardley wrote: > My understanding[*] is that it computes a checksum for each block of a file > and only transmits blocks that have different checksums. And to calculate the checksum on each block of the file, it has to, um, read each block of the fil

Re: SHA question

2010-01-15 Thread Andy Wardley
On 14/01/2010 17:41, Philip Newton wrote: Yes - you're missing the fact that in order to compute the differences (which it has to if it doesn't want to transfer the whole file), it has to read the entire file over the slow NFS link into your computer's memory in order to compare it with the "loca

Re: SHA question

2010-01-14 Thread Philip Newton
On Thu, Jan 14, 2010 at 16:20, Matthew Boyle wrote: > David Cantrell wrote: >> >> On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote: >> >>> That reminds me of how I was disappointed to find that rsync generally >>> transfers complete files (rather than diffs) if both source and >>> des

Re: SHA question

2010-01-14 Thread Matt Lawrence
Matthew Boyle wrote: David Cantrell wrote: On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote: That reminds me of how I was disappointed to find that rsync generally transfers complete files (rather than diffs) if both source and destination are on a local file system -- before I re

Re: SHA question

2010-01-14 Thread Matt Lawrence
David Cantrell wrote: On Thu, Jan 14, 2010 at 02:03:33PM +, Roger Burton West wrote: On Thu, Jan 14, 2010 at 01:59:22PM +, David Cantrell wrote: Shame that "local" includes "at the other end of a really slow NFS connection to the other side of the world". Mind you, absent runnin

Re: SHA question

2010-01-14 Thread Peter Corlett
On 14 Jan 2010, at 14:16, Mark Fowler wrote: [...] > I'd just use Digest::MD5 to calculate the filesize. It's cheap > compared to SHA, you don't care about the exact cryptographic security > of the hash, and will work even if you don't have the original to > compare again. I assume you wrote "fil

Re: SHA question

2010-01-14 Thread Matthew Boyle
David Cantrell wrote: On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote: That reminds me of how I was disappointed to find that rsync generally transfers complete files (rather than diffs) if both source and destination are on a local file system -- before I realised that to compute

Re: SHA question

2010-01-14 Thread David Cantrell
On Thu, Jan 14, 2010 at 02:03:33PM +, Roger Burton West wrote: > On Thu, Jan 14, 2010 at 01:59:22PM +, David Cantrell wrote: > >Shame that "local" includes "at the other end of a really slow NFS > >connection to the other side of the world". Mind you, absent running the > >rsync daemon at t

Re: SHA question

2010-01-14 Thread Mark Fowler
On Wed, Jan 13, 2010 at 3:16 PM, Philip Newton wrote: > Along those lines, you may wish to store the filesize in bytes in your > database as well, as a first point of comparison; if the filesize is > unique, then the file must also be unique and you could save yourself > the time spent calculatin

Re: SHA question

2010-01-14 Thread Roger Burton West
On Thu, Jan 14, 2010 at 01:59:22PM +, David Cantrell wrote: >Shame that "local" includes "at the other end of a really slow NFS >connection to the other side of the world". Mind you, absent running the >rsync daemon at the other end and using that instead of NFS, I'm not >sure if there's a bet

Re: SHA question

2010-01-14 Thread David Cantrell
On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote: > That reminds me of how I was disappointed to find that rsync generally > transfers complete files (rather than diffs) if both source and > destination are on a local file system -- before I realised that to > compute the diffs, it wo

Re: SHA question

2010-01-14 Thread Philip Newton
On Thu, Jan 14, 2010 at 13:22, Peter Corlett wrote: > For de-duping purposes, SHA is still faster than you can pull the files off > the disk and a secondary cheaper hash is unnecessary. That reminds me of how I was disappointed to find that rsync generally transfers complete files (rather than d

Re: SHA question

2010-01-14 Thread Peter Corlett
On 13 Jan 2010, at 17:53, David Cantrell wrote: [...] > Other hashing algorithms exist and are faster but more prone to > inadvertant collisions. If you've got a lot of data to compare, I'd > use one of them (eg one of the variations on a CRC) and then only > bring out the big SHA guns when that f

Re: SHA question

2010-01-13 Thread Paul Makepeace
On Wed, Jan 13, 2010 at 09:53, David Cantrell wrote: > On Wed, Jan 13, 2010 at 01:12:28PM +, Dermot wrote: > >> I am using it in a perl class but if I could system(`fdupes`) that >> might be preferable. I'll try building the sources and see what >> happens. Failing that I'll have to fallback t

Re: SHA question

2010-01-13 Thread David Cantrell
On Wed, Jan 13, 2010 at 02:58:59PM +, Dermot wrote: > 2010/1/13 Avi Greenbury : > > Thirdly, be aware of what hashing guarantees. It does *not* guarantee > > uniqueness, it just gives you a very low chance that two files with > > the same hash are different. It does guarantee that files with >

Re: SHA question

2010-01-13 Thread Dermot
2010/1/13 Paul Makepeace : > On Wed, Jan 13, 2010 at 07:16, Philip Newton wrote: >> On Wed, Jan 13, 2010 at 15:58, Dermot wrote: >>> 2010/1/13 Avi Greenbury : >> >> I think you're putting the cart before the horse. >> >> Did someone come up to you and say, "Dermot, put the SHA value in a >> data

Re: SHA question

2010-01-13 Thread David Cantrell
On Wed, Jan 13, 2010 at 01:12:28PM +, Dermot wrote: > I am using it in a perl class but if I could system(`fdupes`) that > might be preferable. I'll try building the sources and see what > happens. Failing that I'll have to fallback to slurping and SHA or > MD5. Other hashing algorithms exist

Re: SHA question

2010-01-13 Thread Paul Makepeace
On Wed, Jan 13, 2010 at 07:16, Philip Newton wrote: > On Wed, Jan 13, 2010 at 15:58, Dermot wrote: >> 2010/1/13 Avi Greenbury : >> >>> You might've missed his point. >>> >>> If two files are of different sizes, they cannot be identical. Getting >>> the size of a file is substantially cheaper than

Re: SHA question

2010-01-13 Thread A. J. Trickett
On Wed, 13 Jan 2010 at 12:44:47PM +, Dermot wrote: > Hi, > > I have a lots of PDFs that I need to catalogue and I want to ensure > the uniqueness of each PDF. At LWP, Jonathan Rockway mentioned > something similar with SHA1 and binary files. Am I right in thinking > that the code below is on

Re: SHA question

2010-01-13 Thread Matthew Boyle
Dan Rowles wrote: Dermot wrote: [snip] Incident I get poor results from the MD5 compared with SHA so I can't relie on MD5 for MD5 (md5_base64) results: mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32 MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 duplicate.pdf 01f73c1

Re: SHA question

2010-01-13 Thread Dan Rowles
Dermot wrote: [snip] Incident I get poor results from the MD5 compared with SHA so I can't relie on MD5 for MD5 (md5_base64) results: mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32 MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 duplicate.pdf 01f73c142dae9f9f403bbab543

Re: SHA question

2010-01-13 Thread Roger Burton West
On Wed, Jan 13, 2010 at 02:25:58PM +, Alexander Clouter wrote: >The following gives the duplicated hashes (you might prefer '-D' instead >of '-d'): But does not take account of hardlinks, and again hashes every file rather than just the ones that might be duplicates. R

Re: SHA question

2010-01-13 Thread Andy Armstrong
On 13 Jan 2010, at 14:58, Dermot wrote: > MD5 (md5_base64) results: > mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32 > MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > MR_2898.pdf 01f73c142dae9f9f403bbab543b6aa

Re: SHA question

2010-01-13 Thread Andy Armstrong
On 13 Jan 2010, at 14:58, Dermot wrote: > Incident I get poor results from the MD5 compared with SHA so I can't > relie on MD5 for > > MD5 (md5_base64) results: > mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32 > MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > duplicate.pdf

Re: SHA question

2010-01-13 Thread Philip Newton
On Wed, Jan 13, 2010 at 15:58, Dermot wrote: > 2010/1/13 Avi Greenbury : > >> You might've missed his point. >> >> If two files are of different sizes, they cannot be identical. Getting >> the size of a file is substantially cheaper than hashing it. >> >> So you check all your filesizes, and need

Re: SHA question

2010-01-13 Thread Alexander Clouter
Roger Burton West wrote: > > You may want to be slightly cleverer about it - taking a SHAsum is > computationally expensive, and it's only worth doing if the files have > the same size. > > If you don't require a pure-Perl solution, bear in mind that all this > has been done for you in the "fdupes

Re: SHA question

2010-01-13 Thread Peter Corlett
On 13 Jan 2010, at 14:40, Philip Newton wrote: [...] > Well, that said, is the "very low chance" not on the order of the > chance that you'll be run over by a bus in the morning, or that one of > the files will be changed through cosmic rays or bit rot in the > magnetic domains of the hard disk pla

Re: SHA question

2010-01-13 Thread Dermot
2010/1/13 Avi Greenbury : > You might've missed his point. > > If two files are of different sizes, they cannot be identical. Getting > the size of a file is substantially cheaper than hashing it. > > So you check all your filesizes, and need only hash those pairs or > groups that are all the same

Re: SHA question

2010-01-13 Thread Philip Newton
On Wed, Jan 13, 2010 at 15:06, James Laver wrote: > Thirdly, be aware of what hashing guarantees. It does *not* guarantee > uniqueness, it just gives you a very low chance that two files with > the same hash are different. Well, that said, is the "very low chance" not on the order of the chance t

Re: SHA question

2010-01-13 Thread James Laver
On Wed, Jan 13, 2010 at 1:46 PM, Dermot wrote: > 2010/1/13 Roger Burton West : > >>>I am using it in a perl class >> >> So I won't point out the implications, but there's an obvious one which >> will make your life easier. > > You can't leave me hanging there > Dp. > Well, there are a few thi

Re: SHA question

2010-01-13 Thread Avi Greenbury
Dermot wrote: > 2010/1/13 Roger Burton West : > > You may want to be slightly cleverer about it - taking a SHAsum is > > computationally expensive, and it's only worth doing if the files > > have the same size. > > Unfortunately the size varies quite a bit. You might've missed his point. If two

Re: SHA question

2010-01-13 Thread Philip Potter
2010/1/13 Luis Motta Campos : > I believe the official answer to this question would be "The London Perl > Mongers list considers on-topic messages that talk about Ponies, Buffy, > Beer, and Pie. Everything else should be tagged as 'off-toppic'". There is even a FAQ about this: http://london.pm.or

Re: SHA question

2010-01-13 Thread Dermot
2010/1/13 Roger Burton West : >>I am using it in a perl class > > So I won't point out the implications, but there's an obvious one which > will make your life easier. You can't leave me hanging there Dp.

Re: SHA question

2010-01-13 Thread Roger Burton West
On Wed, Jan 13, 2010 at 01:12:28PM +, Dermot wrote: >Unfortunately the size varies quite a bit. There are a few 11Mb pdfs >but the majority are under 1mb. No, that's _good_. >I am using it in a perl class So I won't point out the implications, but there's an obvious one which will make your

Re: SHA question

2010-01-13 Thread Steffan Davies
Dermot wrote at 12:44 on 2010-01-13: > Hi, > > I have a lots of PDFs that I need to catalogue and I want to ensure > the uniqueness of each PDF. At LWP, Jonathan Rockway mentioned > something similar with SHA1 and binary files. Am I right in thinking > that the code below is only taking the SH

Re: SHA question

2010-01-13 Thread Luis Motta Campos
Dermot wrote: > Hi, > > I have a lots of PDFs that I need to catalogue and I want to ensure > the uniqueness of each PDF. At LWP, Jonathan Rockway mentioned > something similar with SHA1 and binary files. Am I right in thinking > that the code below is only taking the SHA on the name of the f

Re: SHA question

2010-01-13 Thread Dermot
2010/1/13 Roger Burton West : > On Wed, Jan 13, 2010 at 12:44:47PM +, Dermot wrote: > >>I have a lots of PDFs that I need to catalogue and I want to ensure >>the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned >>something similar with SHA1 and binary files.  Am I right in thinking >

Re: SHA question

2010-01-13 Thread Roger Burton West
On Wed, Jan 13, 2010 at 12:44:47PM +, Dermot wrote: >I have a lots of PDFs that I need to catalogue and I want to ensure >the uniqueness of each PDF. At LWP, Jonathan Rockway mentioned >something similar with SHA1 and binary files. Am I right in thinking >that the code below is only taking t

SHA question

2010-01-13 Thread Dermot
Hi, I have a lots of PDFs that I need to catalogue and I want to ensure the uniqueness of each PDF. At LWP, Jonathan Rockway mentioned something similar with SHA1 and binary files. Am I right in thinking that the code below is only taking the SHA on the name of the file and if I want to ensure u