Re: a program to delete duplicate files

2005-03-29 Thread alanwo
Why not try to use NoClone, it finds and deletes duplicate files by
true byte-by-byte comparison. Smart marker filters duplicate files to
delete. With GUI.
http://noclone.net


Xah Lee wrote:
 here's a large exercise that uses what we built before.

 suppose you have tens of thousands of files in various directories.
 Some of these files are identical, but you don't know which ones are
 identical with which. Write a program that prints out which file are
 redundant copies.

 Here's the spec.
 --
 The program is to be used on the command line. Its arguments are one
or
 more full paths of directories.

 perl del_dup.pl dir1

 prints the full paths of all files in dir1 that are duplicate.
 (including files in sub-directories) More specifically, if file A has
 duplicates, A's full path will be printed on a line, immediately
 followed the full paths of all other files that is a copy of A. These
 duplicates's full paths will be prefixed with rm  string. A empty
 line follows a group of duplicates.

 Here's a sample output.

 inPath/a.jpg
 rm inPath/b.jpg
 rm inPath/3/a.jpg
 rm inPath/hh/eu.jpg

 inPath/ou.jpg
 rm inPath/23/a.jpg
 rm inPath/hh33/eu.jpg

 order does not matter. (i.e. which file will not be rm  does not
 matter.)

 

 perl del_dup.pl dir1 dir2

 will do the same as above, except that duplicates within dir1 or dir2
 themselves not considered. That is, all files in dir1 are compared to
 all files in dir2. (including subdirectories) And, only files in dir2
 will have the rm  prefix.

 One way to understand this is to imagine lots of image files in both
 dir. One is certain that there are no duplicates within each dir
 themselves. (imagine that del_dup.pl has run on each already) Files
in
 dir1 has already been categorized into sub directories by human. So
 that when there are duplicates among dir1 and dir2, one wants the
 version in dir2 to be deleted, leaving the organization in dir1
intact.

 perl del_dup.pl dir1 dir2 dir3 ...

 does the same as above, except files in later dir will have rm 
 first. So, if there are these identical files:

 dir2/a
 dir2/b
 dir4/c
 dir4/d

 the c and d will both have rm  prefix for sure. (which one has rm

 in dir2 does not matter) Note, although dir2 doesn't compare files
 inside itself, but duplicates still may be implicitly found by
indirect
 comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
 never compared.


 --

 Write a Perl or Python version of the program.

 a absolute requirement in this problem is to minimize the number of
 comparison made between files. This is a part of the spec.

 feel free to write it however you want. I'll post my version in a few
 days.

 http://www.xahlee.org/perl-python/python.html
 
  Xah
  [EMAIL PROTECTED]
  http://xahlee.org/PageTwo_dir/more.html

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-20 Thread Claudio Grondi
 I'll post my version in a few days.
Have I missed something?
Where can I see your version?

Claudio


Xah Lee [EMAIL PROTECTED] schrieb im Newsbeitrag
news:[EMAIL PROTECTED]
 here's a large exercise that uses what we built before.

 suppose you have tens of thousands of files in various directories.
 Some of these files are identical, but you don't know which ones are
 identical with which. Write a program that prints out which file are
 redundant copies.

 Here's the spec.
 --
 The program is to be used on the command line. Its arguments are one or
 more full paths of directories.

 perl del_dup.pl dir1

 prints the full paths of all files in dir1 that are duplicate.
 (including files in sub-directories) More specifically, if file A has
 duplicates, A's full path will be printed on a line, immediately
 followed the full paths of all other files that is a copy of A. These
 duplicates's full paths will be prefixed with rm  string. A empty
 line follows a group of duplicates.

 Here's a sample output.

 inPath/a.jpg
 rm inPath/b.jpg
 rm inPath/3/a.jpg
 rm inPath/hh/eu.jpg

 inPath/ou.jpg
 rm inPath/23/a.jpg
 rm inPath/hh33/eu.jpg

 order does not matter. (i.e. which file will not be rm  does not
 matter.)

 

 perl del_dup.pl dir1 dir2

 will do the same as above, except that duplicates within dir1 or dir2
 themselves not considered. That is, all files in dir1 are compared to
 all files in dir2. (including subdirectories) And, only files in dir2
 will have the rm  prefix.

 One way to understand this is to imagine lots of image files in both
 dir. One is certain that there are no duplicates within each dir
 themselves. (imagine that del_dup.pl has run on each already) Files in
 dir1 has already been categorized into sub directories by human. So
 that when there are duplicates among dir1 and dir2, one wants the
 version in dir2 to be deleted, leaving the organization in dir1 intact.

 perl del_dup.pl dir1 dir2 dir3 ...

 does the same as above, except files in later dir will have rm 
 first. So, if there are these identical files:

 dir2/a
 dir2/b
 dir4/c
 dir4/d

 the c and d will both have rm  prefix for sure. (which one has rm 
 in dir2 does not matter) Note, although dir2 doesn't compare files
 inside itself, but duplicates still may be implicitly found by indirect
 comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
 never compared.


 --

 Write a Perl or Python version of the program.

 a absolute requirement in this problem is to minimize the number of
 comparison made between files. This is a part of the spec.

 feel free to write it however you want. I'll post my version in a few
 days.

 http://www.xahlee.org/perl-python/python.html

  Xah
  [EMAIL PROTECTED]
  http://xahlee.org/PageTwo_dir/more.html



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-14 Thread David Eppstein
In article [EMAIL PROTECTED],
 John Machin [EMAIL PROTECTED] wrote:

 Just look at the efficiency of processing N files of the same size S,
 where they differ after d bytes: [If they don't differ, d = S]

I think this misses the point.  It's easy to find the files that are 
different.  Just a file size comparison will get most of them, and most 
of the rest can be detected in the first few kbytes.  So I think it's 
safe to assume both of these filtering mechanisms would be incorporated 
into a good duplicate detection code, and that any remaining tests that 
would be performed are between files that are very likely to be the same 
as each other.

The hard part is verifying that the files that look like duplicates 
really are duplicates.  To do so, for a group of m files that appear to 
be the same, requires 2(m-1) reads through the whole files if you use a 
comparison based method, or m reads if you use a strong hashing method.  
You can't hope to cut the reads off early when using comparisons, 
because the files won't be different.

The question to me is: when you have a group of m2 likely-to-be-equal 
files, is 2(m-1) reads and very little computation better than m reads 
and a strong hash computation?  I haven't yet seen an answer to this, 
but it's a question for experimentation rather than theory.

-- 
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-14 Thread David Eppstein
In article [EMAIL PROTECTED],
 Patrick Useldinger [EMAIL PROTECTED] wrote:

 Shouldn't you add the additional comparison time that has to be done 
 after hash calculation? Hashes do not give 100% guarantee.

When I've been talking about hashes, I've been assuming very strong 
cryptographic hashes, good enough that you can trust equal results to 
really be equal without having to verify by a comparison.

-- 
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-14 Thread Patrick Useldinger
David Eppstein wrote:
When I've been talking about hashes, I've been assuming very strong 
cryptographic hashes, good enough that you can trust equal results to 
really be equal without having to verify by a comparison.
I am not an expert in this field. All I know is that MD5 and SHA1 can 
create collisions. Are there stronger algorithms that do not? And, more 
importantly, has it been *proved* that they do not?

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-14 Thread Patrick Useldinger
David Eppstein wrote:
The hard part is verifying that the files that look like duplicates 
really are duplicates.  To do so, for a group of m files that appear to 
be the same, requires 2(m-1) reads through the whole files if you use a 
comparison based method, or m reads if you use a strong hashing method.  
You can't hope to cut the reads off early when using comparisons, 
because the files won't be different.
If you read them in parallel, it's _at most_ m (m is the worst case 
here), not 2(m-1). In my tests, it has always significantly less than m.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-14 Thread Bengt Richter
On Mon, 14 Mar 2005 10:43:23 -0800, David Eppstein [EMAIL PROTECTED] wrote:

In article [EMAIL PROTECTED],
 John Machin [EMAIL PROTECTED] wrote:

 Just look at the efficiency of processing N files of the same size S,
 where they differ after d bytes: [If they don't differ, d = S]

I think this misses the point.  It's easy to find the files that are 
different.  Just a file size comparison will get most of them, and most 
of the rest can be detected in the first few kbytes.  So I think it's 
safe to assume both of these filtering mechanisms would be incorporated 
into a good duplicate detection code, and that any remaining tests that 
would be performed are between files that are very likely to be the same 
as each other.

The hard part is verifying that the files that look like duplicates 
really are duplicates.  To do so, for a group of m files that appear to 
be the same, requires 2(m-1) reads through the whole files if you use a 
comparison based method, or m reads if you use a strong hashing method.
What do you mean by a comparison based method ? Are you excluding the
possibility of parallel reading and comparing? The problem then becomes
verifying that m buffers containing corresponding segments of the files
are equal. You could do that by comparing hashes of the buffers (or updated
running hashes for the whole files so far read), or you could compare the
buffers in parallel byte by byte if that turns out efficient to implement
in the given language and os environment. Small buffers would probably
not be efficient for disk access, but too large might not be good either.
Maybe some automatic tuning algorithm could be developed to optimize
comparison operations in the shadow of i/o. But does the OS accept hints
about read-ahead buffering? Are the disks SCSI raid or USB external or
networked? Can automatic tuning achieve an optimum disregarding all that?

What is it that we're trying to optimize? Time to classify the
whole file set into groups of equivalent files? (note that there can be
multiple groups that are internally equal but uneqal to members of other 
groups).

The problem of limitation on the number of simultaneously open files could
be virtualized away, and only have impact when the limit is actually 
encountered.
  
You can't hope to cut the reads off early when using comparisons, 
because the files won't be different.
So the worst case will be reading all through to the end once in parallel.
But pre-hashing guarantees the worst case for all -- unless disk access patterns
happen to make that the most efficient way to get everything read and compared
when most are equal.

The question to me is: when you have a group of m2 likely-to-be-equal 
files, is 2(m-1) reads and very little computation better than m reads 
and a strong hash computation?  I haven't yet seen an answer to this, 
but it's a question for experimentation rather than theory.
ISTM 2(m-1) reads is not necessary, so the question to me doesn't involve 
that ;-)

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-14 Thread John J. Lee
Patrick Useldinger [EMAIL PROTECTED] writes:

 David Eppstein wrote:
 
  The hard part is verifying that the files that look like duplicates
  really are duplicates.  To do so, for a group of m files that appear
  to be the same, requires 2(m-1) reads through the whole files if you
  use a comparison based method, or m reads if you use a strong
  hashing method.  You can't hope to cut the reads off early when
  using comparisons, because the files won't be different.
 
 If you read them in parallel, it's _at most_ m (m is the worst case
 here), not 2(m-1). In my tests, it has always significantly less than
 m.

Hmm, Patrick's right, David, isn't he?

Except when m gets really BIG (say, M), in which case I suppose m is
no longer the worst case number of whole-file reads.

And I'm not sure what the trade off between disk seeks and disk reads
does to the problem, in practice (with caching and realistic memory
constraints).


John
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-14 Thread Jeff Shannon
Patrick Useldinger wrote:
David Eppstein wrote:
When I've been talking about hashes, I've been assuming very strong 
cryptographic hashes, good enough that you can trust equal results to 
really be equal without having to verify by a comparison.
I am not an expert in this field. All I know is that MD5 and SHA1 can 
create collisions. Are there stronger algorithms that do not? And, more 
importantly, has it been *proved* that they do not?
I'm not an expert either, but I seem to remember reading recently 
that, while it's been proven that it's possible for SHA1 to have 
collisions, no actual collisions have been found.  Even if that's not 
completely correct, you're *far* more likely to be killed by a 
meteorite than to stumble across a SHA1 collision.  Heck, I'd expect 
that it's more likely for civilization to be destroyed by a 
dinosaur-killer-sized meteor.

With very few exceptions, if you're contorting yourself to avoid SHA1 
hash collisions, then you should also be wearing meteor-proof (and 
lightning-proof) armor everywhere you go.  (Those few exceptions would 
be cases where a malicious attacker stands to gain enough from 
constructing a single hash collision to make it worthwhile to invest a 
*large* number of petaflops of processing power.)  Sure it's not 100% 
perfect, but... how perfect do you *really* need?

Jeff Shannon

--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-14 Thread David Eppstein
In article [EMAIL PROTECTED], [EMAIL PROTECTED] (John J. Lee) 
wrote:

  If you read them in parallel, it's _at most_ m (m is the worst case
  here), not 2(m-1). In my tests, it has always significantly less than
  m.
 
 Hmm, Patrick's right, David, isn't he?

Yes, I was only considering pairwise comparisons. As he says, 
simultaneously comparing all files in a group would avoid repeated reads 
without the CPU overhead of a strong hash.  Assuming you use a system 
that allows you to have enough files open at once...

 And I'm not sure what the trade off between disk seeks and disk reads
 does to the problem, in practice (with caching and realistic memory
 constraints).

Another interesting point.

-- 
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-13 Thread Patrick Useldinger
John Machin wrote:
Test:
!for k in range(1000):
!open('foo' + str(k), 'w')
I ran that and watched it open 2 million files and going strong ... 
until I figured that files are closed by Python immediately because 
there's no reference to them ;-)

Here's my code:
#!/usr/bin/env python
import os
print 'max number of file handles today is',
n = 0
h = []
try:
while True:
filename = 'mfh' + str(n)
h.append((file(filename,'w'),filename))
n = n + 1
except:
print n
for handle, filename in h:
handle.close()
os.remove(filename)
On Slackware 10.1, this yields 1021.
On WinXPSP2, this yields 509.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
John Machin wrote:
Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]
PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].
Hashing method: O(NS) reading time, O(NS) hash calc time

Shouldn't you add the additional comparison time that has to be done 
after hash calculation? Hashes do not give 100% guarantee. If there's a 
large number of identical hashes, you'd still need to read all of these 
files to make sure.

Just to explain why I appear to be a lawer: everybody I spoke to about 
this program told me to use hashes, but nobody has been able to explain 
why. I found myself 2 possible reasons:

1) it's easier to program: you don't compare several files in parallel, 
but process one by one. But it's not perfect and you still need to 
compare afterwards. In the worst case, you end up with 3 files with 
identical hashes, of which 2 are identical and 1 is not. In order to 
find this, you'd still have to program the algorithm I use, unless you 
say oh well, there's a problem with the hash, go and look yourself.

2) it's probably useful if you compare files over a network and you want 
to reduce bandwidth. A hash lets you do that at the cost of local CPU 
and disk usage, which may be OK. That was not my case.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-12 Thread Scott David Daniels
Patrick Useldinger wrote:
Just to explain why I appear to be a lawer: everybody I spoke to about 
this program told me to use hashes, but nobody has been able to explain 
why. I found myself 2 possible reasons:

1) it's easier to program: you don't compare several files in parallel, 
but process one by one. But it's not perfect and you still need to 
compare afterwards. In the worst case, you end up with 3 files with 
identical hashes, of which 2 are identical and 1 is not. In order to 
find this, you'd still have to program the algorithm I use, unless you 
say oh well, there's a problem with the hash, go and look yourself.

2) it's probably useful if you compare files over a network and you want 
to reduce bandwidth. A hash lets you do that at the cost of local CPU 
and disk usage, which may be OK. That was not my case.

3) Probability is on your side.  If two files match in length, but differ
   in contents, a good hash will have probability (1 / max_hash) of
   having a matching hash.  For exactly two files of the same length,
   calculating the hash is clearly a waste.  For even three files of
   the same length, you are most likely to find them distinct in three
   comparisons.  Using hashes, three file reads and three comparisons
   of hash values.  Without hashes, six file reads; you must read both
   files to do a file comparison, so three comparisons is six files.
   Naturally, as it grows beyond three, the deference grows drastically.
-Scott David Daniels
[EMAIL PROTECTED]
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-12 Thread François Pinard
[Patrick Useldinger]

 Shouldn't you add the additional comparison time that has to be done
 after hash calculation? Hashes do not give 100% guarantee. If there's
 a large number of identical hashes, you'd still need to read all of
 these files to make sure.

Identical hashes for different files?  The probability of this happening
should be extremely small, or else, your hash function is not a good one.

I once was over-cautious about relying on hashes only, without actually
comparing files.  A friend convinced me, doing maths, that with a good
hash function, the probability of a false match was much, much smaller
than the probability of my computer returning the wrong answer, despite
thorough comparisons, due to some electronic glitch or cosmic ray.  So,
my cautious attitude was by far, for all practical means, a waste.

Similar arguments apply, say, for the confidence we may have in
probabilistic algorithms for the determination of number primality.

-- 
François Pinard   http://pinard.progiciels-bpi.ca
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
François Pinard wrote:
Identical hashes for different files?  The probability of this happening
should be extremely small, or else, your hash function is not a good one.
We're talking about md5, sha1 or similar. They are all known not to be 
100% perfect. I agree it's a rare case, but still, why settle on 
something about right when you can have right?

I once was over-cautious about relying on hashes only, without actually
comparing files.  A friend convinced me, doing maths, that with a good
hash function, the probability of a false match was much, much smaller
than the probability of my computer returning the wrong answer, despite
thorough comparisons, due to some electronic glitch or cosmic ray.  So,
my cautious attitude was by far, for all practical means, a waste.
It was not my only argument for not using hashed. My algorithm also does 
less reads, for example.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-12 Thread John Machin

Patrick Useldinger wrote:
 John Machin wrote:

  Just look at the efficiency of processing N files of the same size
S,
  where they differ after d bytes: [If they don't differ, d = S]
 
  PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
  which is important for small N and large d].
 
  Hashing method: O(NS) reading time, O(NS) hash calc time


 Shouldn't you add the additional comparison time that has to be done
 after hash calculation? Hashes do not give 100% guarantee. If there's
a
 large number of identical hashes, you'd still need to read all of
these
 files to make sure.

Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: If there are *any* number (n = 2)
of identical hashes, you'd still need to *RE*-read and *compare* 

1. You are ahead already even before adding any extra time for
comparison on files with equal hashes.

2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
in their right mind [1] would contemplate automatically deleting n-1
out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with
different names or in different-but-related directories with the same
names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand
out as implausible.

Different subject: maximum number of files that can be open at once. I
raised this issue with you because I had painful memories of having to
work around max=20 years ago on MS-DOS and was aware that this magic
number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to
me compared to the likely number of files with the same size -- but you
might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.

You wrote at some stage in this thread that (a) this caused problems on
Windows and (b) you hadn't had any such problems on Linux.

Re (a): what evidence do you have?

Re (b): famous last words! How long would it take you to do a test and
announce the margin of safety that you have?


[1] Yes, I am aware of the subject of this thread.

Regards,

John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
John Machin wrote:
Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: If there are *any* number (n = 2)
of identical hashes, you'd still need to *RE*-read and *compare* 
Right, that is what I meant.
2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
in their right mind [1] would contemplate automatically deleting n-1
out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with
different names or in different-but-related directories with the same
names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand
out as implausible.
Still, if you can get it 100% right automatically, why would you bother 
checking manually? Why get back to argments like impossible, 
implausible, can't be if you can have a simple and correct answer - 
yes or no?

Anyway, fdups does not do anything else than report duplicates. 
Deleting, hardlinking or anything else might be an option depending on 
the context in which you use fdups, but then we'd have to discuss the 
context. I never assumed any context, in order to keep it as universal 
as possible.

Different subject: maximum number of files that can be open at once. I
raised this issue with you because I had painful memories of having to
work around max=20 years ago on MS-DOS and was aware that this magic
number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to
me compared to the likely number of files with the same size -- but you
might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.
For the time being, the additional files will be ignored, and a warning 
is issued. fdups does not quit, why are you saying this?

A fallback solution would be to open the file before every _block_ read, 
and close it afterwards. In my mind, it would be a command-line option, 
because it's difficult to determine the number of available file handles 
in a multitasking environment.

Not difficult to implement, but I first wanted to refactor the code so 
that it's a proper class that can be used in other Python programs, as 
you also asked. That is what I have sent you tonight. It's not that I 
don't care about the file handle problem, it's just that I do changes by 
(my own) priority.

You wrote at some stage in this thread that (a) this caused problems on
Windows and (b) you hadn't had any such problems on Linux.
Re (a): what evidence do you have?
I've had the case myself on my girlfriend's XP box. It was certainly 
less than 500 files of the same length.

Re (b): famous last words! How long would it take you to do a test and
announce the margin of safety that you have?
Sorry, I do not understand what you mean by this.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-12 Thread John Machin

Patrick Useldinger wrote:
 John Machin wrote:

  Maybe I was wrong: lawyers are noted for irritating precision. You
  meant to say in your own defence: If there are *any* number (n =
2)
  of identical hashes, you'd still need to *RE*-read and *compare*


 Right, that is what I meant.

  2. As others have explained, with a decent hash function, the
  probability of a false positive is vanishingly small. Further,
nobody
  in their right mind [1] would contemplate automatically deleting
n-1
  out of a bunch of n reportedly duplicate files without further
  investigation. Duplicate files are usually (in the same directory
with
  different names or in different-but-related directories with the
same
  names) and/or (have a plausible explanation for how they were
  duplicated) -- the one-in-zillion-chance false-positive should
stand
  out as implausible.

 Still, if you can get it 100% right automatically, why would you
bother
 checking manually?

A human in their right mind is required to decide what to do with the
duplicates. The proponents of hashing -- of which I'm not one -- would
point out that any false-positives would be picked up as part of the
human scrutiny.

 Why get back to argments like impossible,
 implausible, can't be if you can have a simple and correct answer
-
 yes or no?

Oh yeah, the computer said so, it must be correct. Even with your
algorithm, I would be investigating cases where files were duplicates
but there was nothing in the names or paths that suggested how that
might have come about.


 Anyway, fdups does not do anything else than report duplicates.
 Deleting, hardlinking or anything else might be an option depending
on
 the context in which you use fdups, but then we'd have to discuss the

 context. I never assumed any context, in order to keep it as
universal
 as possible.

That's very good, but it wasn't under contention.


  Different subject: maximum number of files that can be open at
once. I
  raised this issue with you because I had painful memories of having
to
  work around max=20 years ago on MS-DOS and was aware that this
magic
  number was copied blindly from early Unix. I did tell you that
  empirically I could get 509 successful opens on Win 2000 [add 3 for
  stdin/out/err to get a plausible number] -- this seems high enough
to
  me compared to the likely number of files with the same size -- but
you
  might like to consider a fall-back detection method instead of just
  quitting immediately if you ran out of handles.

 For the time being, the additional files will be ignored, and a
warning
 is issued. fdups does not quit, why are you saying this?

I beg your pardon, I was wrong. Bad memory. It's the case of running
out of the minuscule buffer pool that you allocate by default where it
panics and pulls the sys.exit(1) rip-cord.


 A fallback solution would be to open the file before every _block_
read,
 and close it afterwards.

Ugh. Better use more memory, so less blocks!!

 In my mind, it would be a command-line option,
 because it's difficult to determine the number of available file
handles
 in a multitasking environment.

The pythonic way is to press ahead optimistically and recover if you
get bad news.


 Not difficult to implement, but I first wanted to refactor the code
so
 that it's a proper class that can be used in other Python programs,
as
 you also asked.

I didn't ask; I suggested. I would never suggest a
class-for-classes-sake. You had already a singleton class; why
another. What I did suggest was that you provide a callable interface
that returned clusters of duplicates [so that people could do their own
thing instead of having to parse your file output which contains a
mixture of warning  info messages and data].

 That is what I have sent you tonight. It's not that I
 don't care about the file handle problem, it's just that I do changes
by
 (my own) priority.

  You wrote at some stage in this thread that (a) this caused
problems on
  Windows and (b) you hadn't had any such problems on Linux.
 
  Re (a): what evidence do you have?

 I've had the case myself on my girlfriend's XP box. It was certainly
 less than 500 files of the same length.

Interesting. Less on XP than on 2000? Maybe there's a machine-wide
limit, not a per-process limit, like the old DOS max=20. What else was
running at the time?


  Re (b): famous last words! How long would it take you to do a test
and
  announce the margin of safety that you have?

 Sorry, I do not understand what you mean by this.

Test:
!for k in range(1000):
!open('foo' + str(k), 'w')
Announce:
I can open A files at once on box B running os C. The most files of
the same length that I have seen is D. The ratio A/D is small enough
not to worry.

Cheers,
John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
John Machin wrote:
Oh yeah, the computer said so, it must be correct. Even with your
algorithm, I would be investigating cases where files were duplicates
but there was nothing in the names or paths that suggested how that
might have come about.
Of course, but it's good to know that the computer is right, isn't it? 
That leaves the human to take decisions instead of double-checking.

I beg your pardon, I was wrong. Bad memory. It's the case of running
out of the minuscule buffer pool that you allocate by default where it
panics and pulls the sys.exit(1) rip-cord.
Bufferpool is a parameter, and the default values allow for 4096 files 
of the same size. It's more likely to run out of file handles than out 
of bufferspace, don't you think?

The pythonic way is to press ahead optimistically and recover if you
get bad news.
You're right, that's what I thought about afterwards. Current idea is to 
design a second class that opens/closes/reads the files and handles the 
situation independantly of the main class.

I didn't ask; I suggested. I would never suggest a
class-for-classes-sake. You had already a singleton class; why
another. What I did suggest was that you provide a callable interface
that returned clusters of duplicates [so that people could do their own
thing instead of having to parse your file output which contains a
mixture of warning  info messages and data].
That is what I have submitted to you. Are you sure that *I* am the 
lawyer here?

Re (a): what evidence do you have?
See ;-)
Interesting. Less on XP than on 2000? Maybe there's a machine-wide
limit, not a per-process limit, like the old DOS max=20. What else was
running at the time?
Nothing I started manually, but the usual bunch of local firewall, virus 
scanner (not doing a complete machine check at that time).

Test:
!for k in range(1000):
!open('foo' + str(k), 'w')
I'll try that.
Announce:
I can open A files at once on box B running os C. The most files of
the same length that I have seen is D. The ratio A/D is small enough
not to worry.
I wouldn't count on that on a multi-tasking environment, as I said. The 
class I described earlier seems a cleaner approach.

Regards,
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread TZOTZIOY
On Fri, 11 Mar 2005 01:24:59 +0100, rumours say that Patrick Useldinger
[EMAIL PROTECTED] might have written:

 Have you found any way to test if two files on NTFS are hard linked without
 opening them first to get a file handle?

No. And even then, I wouldn't know how to find out.

MSDN is our friend for Windows stuff.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/createhardlink.asp

and then

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/getfileinformationbyhandle.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/by_handle_file_information_str.asp

The relevant parts from this last page:

st_dev - dwVolumeSerialNumber

st_ino - (nFileIndexHigh, nFileIndexLow)
-- 
TZOTZIOY, I speak England very best.
Be strict when sending and tolerant when receiving. (from RFC1958)
I really should keep that in mind when talking with people, actually...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread TZOTZIOY
On Fri, 11 Mar 2005 01:12:14 +0100, rumours say that Patrick Useldinger
[EMAIL PROTECTED] might have written:

 On POSIX filesystems, one has also to avoid comparing files having same 
 (st_dev,
 st_inum), because you know that they are the same file.

I then have a bug here - I consider all files with the same inode equal, 
  but according to what you say I need to consider the tuple 
(st_dev,ST_ium). I'll have to fix that for 0.13.

I have a bug here too-- I wrote st_inum meaning st_ino, but that would be quick
to find!

 Thanks ;-)

You are very welcome.
-- 
TZOTZIOY, I speak England very best.
Be strict when sending and tolerant when receiving. (from RFC1958)
I really should keep that in mind when talking with people, actually...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread Patrick Useldinger
Christos TZOTZIOY Georgiou wrote:
The relevant parts from this last page:
st_dev - dwVolumeSerialNumber
st_ino - (nFileIndexHigh, nFileIndexLow)
I see. But if I am not mistaken, that would mean that I
(1) had to detect NTFS volumes
(2) use non-standard libraries to find these information (like the 
Python Win extentions).

I am not seriously motivated to do so, but if somebody is interested to 
help, I am open to it.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread Patrick Useldinger
David Eppstein wrote:
You need do no comparisons between files.  Just use a sufficiently 
strong hash algorithm (SHA-256 maybe?) and compare the hashes.
That's not very efficient. IMO, it only makes sense in network-based 
operations such as rsync.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread Patrick Useldinger
Christos TZOTZIOY Georgiou wrote:
A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).
Changed.
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread David Eppstein
In article [EMAIL PROTECTED],
 Patrick Useldinger [EMAIL PROTECTED] wrote:

  You need do no comparisons between files.  Just use a sufficiently 
  strong hash algorithm (SHA-256 maybe?) and compare the hashes.
 
 That's not very efficient. IMO, it only makes sense in network-based 
 operations such as rsync.

Well, but the spec didn't say efficiency was the primary criterion, it 
said minimizing the number of comparisons was.

More seriously, the best I can think of that doesn't use a strong slow 
hash would be to group files by (file size, cheap hash) then compare 
each file in a group with a representative of each distinct file found 
among earlier files in the same group -- that leads to an average of 
about three reads per duplicated file copy: one to hash it, and two for 
the comparison between it and its representative (almost all of the 
comparisons will turn out equal but you still need to check unless you 
use a strong hash).

I'm assuming of course that there are too many files and/or they're too 
large just to keep them all in core.

Anyone have any data on whether reading files and SHA-256'ing them (or 
whatever other cryptographic hash you think is strong enough) is 
I/O-bound or CPU-bound?  That is, is three reads with very little CPU 
overhead really cheaper than one read with a strong hash?

I guess it also depends on the number of files you expect to have 
duplicates of.  If most of the files exist in only one copy, it's clear 
that the cheap hash will find them more cheaply than the expensive hash.  
In that case you could combine the (file size, cheap hash) filtering 
with the expensive hash and get only two reads per copy rather than 
three.

-- 
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread Terry Hancock
On Thursday 10 March 2005 11:02 am, Christos TZOTZIOY Georgiou wrote:
 On Wed, 9 Mar 2005 16:13:20 -0600, rumours say that Terry Hancock
 [EMAIL PROTECTED] might have written:
 
 For anyone interested in responding to the above, a starting
 place might be this maintenance script I wrote for my own use.  I don't
 think it exactly matches the spec, but it addresses the problem.  I wrote
 this to clean up a large tree of image files once.  The exact behavior
 described requires the '--exec=ls %s' option as mentioned in the help.
 
 The drawback of this method is that you have to read everything.  For example,
 if you have ten files less than 100KiB each and one file more than 2 GiB in
 size, there is no need to read the 2 GiB file, is there?
 
 If it's a one-shot attempt, I guess it won't mind a lot.
 
 On POSIX filesystems, one has also to avoid comparing files having same 
 (st_dev,
 st_inum), because you know that they are the same file.

Two good points, but hey, it ran fast enough for me.  ;-)

Cheers,
Terry

-- 
--
Terry Hancock ( hancock at anansispaceworks.com )
Anansi Spaceworks  http://www.anansispaceworks.com

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread Patrick Useldinger
David Eppstein wrote:
Well, but the spec didn't say efficiency was the primary criterion, it 
said minimizing the number of comparisons was.
That's exactly what my program does.
More seriously, the best I can think of that doesn't use a strong slow 
hash would be to group files by (file size, cheap hash) then compare 
each file in a group with a representative of each distinct file found 
among earlier files in the same group -- that leads to an average of 
about three reads per duplicated file copy: one to hash it, and two for 
the comparison between it and its representative (almost all of the 
comparisons will turn out equal but you still need to check unless you 
My point is : forget hashes. If you work with hashes, you do have to 
read each file completely, cheap hash or not. My program normally reads 
*at most* 100% of the files to analyse, but usually much less. Also, I 
do plain comparisons which are much cheaper than hash calculations.

I'm assuming of course that there are too many files and/or they're too 
large just to keep them all in core.
I assume that file handles are sufficient to keep one open per file of 
the same size. This lead to trouble on Windows installations, but I 
guess that's a parameter to change. On Linux, I never had the problem.

Regarding buffer size, I use a maxumim which is then split up between 
all open files.

Anyone have any data on whether reading files and SHA-256'ing them (or 
whatever other cryptographic hash you think is strong enough) is 
I/O-bound or CPU-bound?  That is, is three reads with very little CPU 
overhead really cheaper than one read with a strong hash?
It also depends on the OS. I found that my program runs much slower on 
Windows, probably due to the way Linux anticipates reads and tries to 
reduce head movement.

I guess it also depends on the number of files you expect to have 
duplicates of.  If most of the files exist in only one copy, it's clear 
that the cheap hash will find them more cheaply than the expensive hash.  
In that case you could combine the (file size, cheap hash) filtering 
with the expensive hash and get only two reads per copy rather than 
three.
Sorry, but I can still not see a point tu use hashes. Maybe you'll have 
a look at my program and tell me where a hash could be useful?

It's available at http://www.homepages.lu/pu/fdups.html.
Regards,
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread David Eppstein
In article [EMAIL PROTECTED],
 Patrick Useldinger [EMAIL PROTECTED] wrote:

  Well, but the spec didn't say efficiency was the primary criterion, it 
  said minimizing the number of comparisons was.
 
 That's exactly what my program does.

If you're doing any comparisons at all, you're not minimizing the number 
of comparisons.

-- 
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread TZOTZIOY
On Fri, 11 Mar 2005 11:07:02 -0800, rumours say that David Eppstein
[EMAIL PROTECTED] might have written:

More seriously, the best I can think of that doesn't use a strong slow 
hash would be to group files by (file size, cheap hash) then compare 
each file in a group with a representative of each distinct file found 
among earlier files in the same group -- that leads to an average of 
about three reads per duplicated file copy: one to hash it, and two for 
the comparison between it and its representative (almost all of the 
comparisons will turn out equal but you still need to check unless you 
use a strong hash).

The code I posted in another thread (and provided a link in this one) does
exactly that (a quick hash of the first few K before calculating the whole
file's md5 sum).  However, Patrick's code is faster, reading only what's
necessary (he does what I intended to do, but I was too lazy-- I actually
rewrote from scratch one of the first programs I wrote in Python, which
obviously was too amateurish code for me to publish :)

It seems your objections are related to Xah Lee's specifications; I have no
objections to your objections (-:) other than that we are just trying to produce
something of practical value out of an otherwise doomed thread...
-- 
TZOTZIOY, I speak England very best.
Be strict when sending and tolerant when receiving. (from RFC1958)
I really should keep that in mind when talking with people, actually...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread Bengt Richter
On Fri, 11 Mar 2005 14:06:27 -0800, David Eppstein [EMAIL PROTECTED] wrote:

In article [EMAIL PROTECTED],
 Patrick Useldinger [EMAIL PROTECTED] wrote:

  Well, but the spec didn't say efficiency was the primary criterion, it 
  said minimizing the number of comparisons was.
 
 That's exactly what my program does.

If you're doing any comparisons at all, you're not minimizing the number 
of comparisons.

ISTM a lot will depend on the distribution of differences between candidate
(equal-sized, obviously) files. If multi-GB files differ in the last byte only
(but this is not known), they will all have to be crunched to the last byte to
eliminate them from possible duplicate-set membership. If there is a-priori 
knowledge
of highly probable difference locations (e.g., internal date stamp at some 
offset etc)
then obviously the candidates can be pre-classified into smaller candidate sets 
by some
simple heuristic probes.

If differences are likely to appear early in a sequential scan, perhaps 
developing hashes
in parallel would be a good strategy. But I doubt if blindly pre-computing full 
hashes would be optimal.
Compare to sort|uniq as a sieve. You wouldn't want to read through any files 
any further than you had to.

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-10 Thread Patrick Useldinger
I wrote something similar, have a look at 
http://www.homepages.lu/pu/fdups.html.
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-10 Thread TZOTZIOY
On Wed, 9 Mar 2005 16:13:20 -0600, rumours say that Terry Hancock
[EMAIL PROTECTED] might have written:

For anyone interested in responding to the above, a starting
place might be this maintenance script I wrote for my own use.  I don't
think it exactly matches the spec, but it addresses the problem.  I wrote
this to clean up a large tree of image files once.  The exact behavior
described requires the '--exec=ls %s' option as mentioned in the help.

The drawback of this method is that you have to read everything.  For example,
if you have ten files less than 100KiB each and one file more than 2 GiB in
size, there is no need to read the 2 GiB file, is there?

If it's a one-shot attempt, I guess it won't mind a lot.

On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
st_inum), because you know that they are the same file.
-- 
TZOTZIOY, I speak England very best.
Be strict when sending and tolerant when receiving. (from RFC1958)
I really should keep that in mind when talking with people, actually...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-10 Thread P
I've written a python GUI wrapper around some shell scripts:
http://www.pixelbeat.org/fslint/
the shell script logic is essentially:
exclude hard linked files
only include files where there are more than 1 with the same size
print files with matching md5sum
Pádraig.
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-10 Thread TZOTZIOY
On Thu, 10 Mar 2005 10:54:05 +0100, rumours say that Patrick Useldinger
[EMAIL PROTECTED] might have written:

I wrote something similar, have a look at 
http://www.homepages.lu/pu/fdups.html.

That's fast and good.

A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).

Have you found any way to test if two files on NTFS are hard linked without
opening them first to get a file handle?
-- 
TZOTZIOY, I speak England very best.
Be strict when sending and tolerant when receiving. (from RFC1958)
I really should keep that in mind when talking with people, actually...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-10 Thread Patrick Useldinger
Christos TZOTZIOY Georgiou wrote:
On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
st_inum), because you know that they are the same file.
I then have a bug here - I consider all files with the same inode equal, 
 but according to what you say I need to consider the tuple 
(st_dev,ST_ium). I'll have to fix that for 0.13.

Thanks ;-)
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-10 Thread Patrick Useldinger
Christos TZOTZIOY Georgiou wrote:
That's fast and good.
Nice to hear.
A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).
I'll look into that.
Have you found any way to test if two files on NTFS are hard linked without
opening them first to get a file handle?
No. And even then, I wouldn't know how to find out.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-10 Thread David Eppstein
In article [EMAIL PROTECTED],
 Xah Lee [EMAIL PROTECTED] wrote:

 a absolute requirement in this problem is to minimize the number of
 comparison made between files. This is a part of the spec.

You need do no comparisons between files.  Just use a sufficiently 
strong hash algorithm (SHA-256 maybe?) and compare the hashes.

-- 
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-10 Thread John Bokma
David Eppstein wrote:

 In article [EMAIL PROTECTED],
  Xah Lee [EMAIL PROTECTED] wrote:
 
 a absolute requirement in this problem is to minimize the number of
 comparison made between files. This is a part of the spec.
 
 You need do no comparisons between files.  Just use a sufficiently 
 strong hash algorithm (SHA-256 maybe?) and compare the hashes.

I did it as follows (some time ago):

is filesize in hash?

calculate md5 (and store), if equal then compare
files.

store info in hash.

In some cases if might be faster to drop the md5 (since it reads all data)

-- 
John   Small Perl scripts: http://johnbokma.com/perl/
   Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

-- 
http://mail.python.org/mailman/listinfo/python-list


[perl-python] a program to delete duplicate files

2005-03-09 Thread Xah Lee
here's a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.

Here's the spec.
--
The program is to be used on the command line. Its arguments are one or
more full paths of directories.

perl del_dup.pl dir1

prints the full paths of all files in dir1 that are duplicate.
(including files in sub-directories) More specifically, if file A has
duplicates, A's full path will be printed on a line, immediately
followed the full paths of all other files that is a copy of A. These
duplicates's full paths will be prefixed with rm  string. A empty
line follows a group of duplicates.

Here's a sample output.

inPath/a.jpg
rm inPath/b.jpg
rm inPath/3/a.jpg
rm inPath/hh/eu.jpg

inPath/ou.jpg
rm inPath/23/a.jpg
rm inPath/hh33/eu.jpg

order does not matter. (i.e. which file will not be rm  does not
matter.)



perl del_dup.pl dir1 dir2

will do the same as above, except that duplicates within dir1 or dir2
themselves not considered. That is, all files in dir1 are compared to
all files in dir2. (including subdirectories) And, only files in dir2
will have the rm  prefix.

One way to understand this is to imagine lots of image files in both
dir. One is certain that there are no duplicates within each dir
themselves. (imagine that del_dup.pl has run on each already) Files in
dir1 has already been categorized into sub directories by human. So
that when there are duplicates among dir1 and dir2, one wants the
version in dir2 to be deleted, leaving the organization in dir1 intact.

perl del_dup.pl dir1 dir2 dir3 ...

does the same as above, except files in later dir will have rm 
first. So, if there are these identical files:

dir2/a
dir2/b
dir4/c
dir4/d

the c and d will both have rm  prefix for sure. (which one has rm 
in dir2 does not matter) Note, although dir2 doesn't compare files
inside itself, but duplicates still may be implicitly found by indirect
comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
never compared.


--

Write a Perl or Python version of the program.

a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.

feel free to write it however you want. I'll post my version in a few
days.

http://www.xahlee.org/perl-python/python.html

 Xah
 [EMAIL PROTECTED]
 http://xahlee.org/PageTwo_dir/more.html

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-09 Thread Terry Hancock
On Wednesday 09 March 2005 06:56 am, Xah Lee wrote:
 here's a large exercise that uses what we built before.
 
 suppose you have tens of thousands of files in various directories.
 Some of these files are identical, but you don't know which ones are
 identical with which. Write a program that prints out which file are
 redundant copies.

For anyone interested in responding to the above, a starting
place might be this maintenance script I wrote for my own use.  I don't
think it exactly matches the spec, but it addresses the problem.  I wrote
this to clean up a large tree of image files once.  The exact behavior
described requires the '--exec=ls %s' option as mentioned in the help.

#!/usr/bin/env python
# (C) 2003 Anansi Spaceworks
#---
# find_duplicates

Utility to find duplicate files in a directory tree by 
comparing their checksums.

#---
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# 
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#---

import os, sys, md5, getopt


def file_walker(tbl, srcpath, files):

Visit a path and collect data (including checksum) for files in it.

for file in files:
filepath = os.path.join(srcpath, file)
if os.path.isfile(filepath):
chksum = md5.new(open(os.path.join(srcpath, file)).read()).digest()
if not tbl.has_key(chksum): tbl[chksum]=[]
tbl[chksum].append(filepath)

def find_duplicates(treeroot, tbl=None):

Find duplicate files in directory.

dup = {}
if tbl is None: tbl = {}
os.path.walk(treeroot, file_walker, tbl)
for k,v in tbl.items():
if len(v)  1:
dup[k] = v
return dup

usage = 
 USAGE: find_duplicates options [path ...]

 Find duplicate files (by matching md5 checksums) in a
 collection of paths (defaults to the current directory).
 
 Note that the order of the paths searched will be retained
 in the resulting duplicate file lists. This can be used
 with --exec and --index to automate handling.

 Options:
   -h, -H, --help
Print this help.

   -q, --quiet
Don't print normal report.

   -x, --exec=command string
Python-formatted command string to act on the indexed
duplicate in each duplicate group found.  E.g. try
--exec=ls %s

   -n, --index=index into duplicates
Which in a series of duplicates to use. Begins with '1'.
Default is '1' (i.e. the first file listed).

  Example:
You've copied many files from path ./A into path ./B. You want
to delete all the ones you've processed already, but not
delete anything else:

% find_duplicates -q --exec=rm %s --index=1 ./A ./B


def main():
action = None
quiet  = 0
index  = 1
dup= {}

opts, args = getopt.getopt(sys.argv[1:], 'qhHn:x:', 
['quiet', 'help', 'exec=', 'index='])

for opt, val in opts:
if   opt in ('-h', '-H', '--help'):
print usage
sys.exit()
elif opt in ('-x', '--exec'):
action = str(val)
elif opt in ('-n', '--index'):
index  = int(val)
elif opt in ('-q', '--quiet'):
quiet = 1

if len(args)==0:
dup = find_duplicates('.')
else:
tbl = {}
for arg in args:
dup = find_duplicates(arg, tbl=tbl)

for k, v in dup.items():
if not quiet:
print Duplicates:
for f in v: print \t%s % f
if action:
os.system(action % v[index-1])

if __name__=='__main__':
main()



-- 
--
Terry Hancock ( hancock at anansispaceworks.com )
Anansi Spaceworks  http://www.anansispaceworks.com

-- 
http://mail.python.org/mailman/listinfo/python-list