Re: a program to delete duplicate files
Why not try to use NoClone, it finds and deletes duplicate files by true byte-by-byte comparison. Smart marker filters duplicate files to delete. With GUI. http://noclone.net Xah Lee wrote: here's a large exercise that uses what we built before. suppose you have tens of thousands of files in various directories. Some of these files are identical, but you don't know which ones are identical with which. Write a program that prints out which file are redundant copies. Here's the spec. -- The program is to be used on the command line. Its arguments are one or more full paths of directories. perl del_dup.pl dir1 prints the full paths of all files in dir1 that are duplicate. (including files in sub-directories) More specifically, if file A has duplicates, A's full path will be printed on a line, immediately followed the full paths of all other files that is a copy of A. These duplicates's full paths will be prefixed with rm string. A empty line follows a group of duplicates. Here's a sample output. inPath/a.jpg rm inPath/b.jpg rm inPath/3/a.jpg rm inPath/hh/eu.jpg inPath/ou.jpg rm inPath/23/a.jpg rm inPath/hh33/eu.jpg order does not matter. (i.e. which file will not be rm does not matter.) perl del_dup.pl dir1 dir2 will do the same as above, except that duplicates within dir1 or dir2 themselves not considered. That is, all files in dir1 are compared to all files in dir2. (including subdirectories) And, only files in dir2 will have the rm prefix. One way to understand this is to imagine lots of image files in both dir. One is certain that there are no duplicates within each dir themselves. (imagine that del_dup.pl has run on each already) Files in dir1 has already been categorized into sub directories by human. So that when there are duplicates among dir1 and dir2, one wants the version in dir2 to be deleted, leaving the organization in dir1 intact. perl del_dup.pl dir1 dir2 dir3 ... does the same as above, except files in later dir will have rm first. So, if there are these identical files: dir2/a dir2/b dir4/c dir4/d the c and d will both have rm prefix for sure. (which one has rm in dir2 does not matter) Note, although dir2 doesn't compare files inside itself, but duplicates still may be implicitly found by indirect comparison. i.e. a==c, b==c, therefore a==b, even though a and b are never compared. -- Write a Perl or Python version of the program. a absolute requirement in this problem is to minimize the number of comparison made between files. This is a part of the spec. feel free to write it however you want. I'll post my version in a few days. http://www.xahlee.org/perl-python/python.html Xah [EMAIL PROTECTED] http://xahlee.org/PageTwo_dir/more.html -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
I'll post my version in a few days. Have I missed something? Where can I see your version? Claudio Xah Lee [EMAIL PROTECTED] schrieb im Newsbeitrag news:[EMAIL PROTECTED] here's a large exercise that uses what we built before. suppose you have tens of thousands of files in various directories. Some of these files are identical, but you don't know which ones are identical with which. Write a program that prints out which file are redundant copies. Here's the spec. -- The program is to be used on the command line. Its arguments are one or more full paths of directories. perl del_dup.pl dir1 prints the full paths of all files in dir1 that are duplicate. (including files in sub-directories) More specifically, if file A has duplicates, A's full path will be printed on a line, immediately followed the full paths of all other files that is a copy of A. These duplicates's full paths will be prefixed with rm string. A empty line follows a group of duplicates. Here's a sample output. inPath/a.jpg rm inPath/b.jpg rm inPath/3/a.jpg rm inPath/hh/eu.jpg inPath/ou.jpg rm inPath/23/a.jpg rm inPath/hh33/eu.jpg order does not matter. (i.e. which file will not be rm does not matter.) perl del_dup.pl dir1 dir2 will do the same as above, except that duplicates within dir1 or dir2 themselves not considered. That is, all files in dir1 are compared to all files in dir2. (including subdirectories) And, only files in dir2 will have the rm prefix. One way to understand this is to imagine lots of image files in both dir. One is certain that there are no duplicates within each dir themselves. (imagine that del_dup.pl has run on each already) Files in dir1 has already been categorized into sub directories by human. So that when there are duplicates among dir1 and dir2, one wants the version in dir2 to be deleted, leaving the organization in dir1 intact. perl del_dup.pl dir1 dir2 dir3 ... does the same as above, except files in later dir will have rm first. So, if there are these identical files: dir2/a dir2/b dir4/c dir4/d the c and d will both have rm prefix for sure. (which one has rm in dir2 does not matter) Note, although dir2 doesn't compare files inside itself, but duplicates still may be implicitly found by indirect comparison. i.e. a==c, b==c, therefore a==b, even though a and b are never compared. -- Write a Perl or Python version of the program. a absolute requirement in this problem is to minimize the number of comparison made between files. This is a part of the spec. feel free to write it however you want. I'll post my version in a few days. http://www.xahlee.org/perl-python/python.html Xah [EMAIL PROTECTED] http://xahlee.org/PageTwo_dir/more.html -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
In article [EMAIL PROTECTED], John Machin [EMAIL PROTECTED] wrote: Just look at the efficiency of processing N files of the same size S, where they differ after d bytes: [If they don't differ, d = S] I think this misses the point. It's easy to find the files that are different. Just a file size comparison will get most of them, and most of the rest can be detected in the first few kbytes. So I think it's safe to assume both of these filtering mechanisms would be incorporated into a good duplicate detection code, and that any remaining tests that would be performed are between files that are very likely to be the same as each other. The hard part is verifying that the files that look like duplicates really are duplicates. To do so, for a group of m files that appear to be the same, requires 2(m-1) reads through the whole files if you use a comparison based method, or m reads if you use a strong hashing method. You can't hope to cut the reads off early when using comparisons, because the files won't be different. The question to me is: when you have a group of m2 likely-to-be-equal files, is 2(m-1) reads and very little computation better than m reads and a strong hash computation? I haven't yet seen an answer to this, but it's a question for experimentation rather than theory. -- David Eppstein Computer Science Dept., Univ. of California, Irvine http://www.ics.uci.edu/~eppstein/ -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
In article [EMAIL PROTECTED], Patrick Useldinger [EMAIL PROTECTED] wrote: Shouldn't you add the additional comparison time that has to be done after hash calculation? Hashes do not give 100% guarantee. When I've been talking about hashes, I've been assuming very strong cryptographic hashes, good enough that you can trust equal results to really be equal without having to verify by a comparison. -- David Eppstein Computer Science Dept., Univ. of California, Irvine http://www.ics.uci.edu/~eppstein/ -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
David Eppstein wrote: When I've been talking about hashes, I've been assuming very strong cryptographic hashes, good enough that you can trust equal results to really be equal without having to verify by a comparison. I am not an expert in this field. All I know is that MD5 and SHA1 can create collisions. Are there stronger algorithms that do not? And, more importantly, has it been *proved* that they do not? -pu -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
David Eppstein wrote: The hard part is verifying that the files that look like duplicates really are duplicates. To do so, for a group of m files that appear to be the same, requires 2(m-1) reads through the whole files if you use a comparison based method, or m reads if you use a strong hashing method. You can't hope to cut the reads off early when using comparisons, because the files won't be different. If you read them in parallel, it's _at most_ m (m is the worst case here), not 2(m-1). In my tests, it has always significantly less than m. -pu -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
On Mon, 14 Mar 2005 10:43:23 -0800, David Eppstein [EMAIL PROTECTED] wrote: In article [EMAIL PROTECTED], John Machin [EMAIL PROTECTED] wrote: Just look at the efficiency of processing N files of the same size S, where they differ after d bytes: [If they don't differ, d = S] I think this misses the point. It's easy to find the files that are different. Just a file size comparison will get most of them, and most of the rest can be detected in the first few kbytes. So I think it's safe to assume both of these filtering mechanisms would be incorporated into a good duplicate detection code, and that any remaining tests that would be performed are between files that are very likely to be the same as each other. The hard part is verifying that the files that look like duplicates really are duplicates. To do so, for a group of m files that appear to be the same, requires 2(m-1) reads through the whole files if you use a comparison based method, or m reads if you use a strong hashing method. What do you mean by a comparison based method ? Are you excluding the possibility of parallel reading and comparing? The problem then becomes verifying that m buffers containing corresponding segments of the files are equal. You could do that by comparing hashes of the buffers (or updated running hashes for the whole files so far read), or you could compare the buffers in parallel byte by byte if that turns out efficient to implement in the given language and os environment. Small buffers would probably not be efficient for disk access, but too large might not be good either. Maybe some automatic tuning algorithm could be developed to optimize comparison operations in the shadow of i/o. But does the OS accept hints about read-ahead buffering? Are the disks SCSI raid or USB external or networked? Can automatic tuning achieve an optimum disregarding all that? What is it that we're trying to optimize? Time to classify the whole file set into groups of equivalent files? (note that there can be multiple groups that are internally equal but uneqal to members of other groups). The problem of limitation on the number of simultaneously open files could be virtualized away, and only have impact when the limit is actually encountered. You can't hope to cut the reads off early when using comparisons, because the files won't be different. So the worst case will be reading all through to the end once in parallel. But pre-hashing guarantees the worst case for all -- unless disk access patterns happen to make that the most efficient way to get everything read and compared when most are equal. The question to me is: when you have a group of m2 likely-to-be-equal files, is 2(m-1) reads and very little computation better than m reads and a strong hash computation? I haven't yet seen an answer to this, but it's a question for experimentation rather than theory. ISTM 2(m-1) reads is not necessary, so the question to me doesn't involve that ;-) Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
Patrick Useldinger [EMAIL PROTECTED] writes: David Eppstein wrote: The hard part is verifying that the files that look like duplicates really are duplicates. To do so, for a group of m files that appear to be the same, requires 2(m-1) reads through the whole files if you use a comparison based method, or m reads if you use a strong hashing method. You can't hope to cut the reads off early when using comparisons, because the files won't be different. If you read them in parallel, it's _at most_ m (m is the worst case here), not 2(m-1). In my tests, it has always significantly less than m. Hmm, Patrick's right, David, isn't he? Except when m gets really BIG (say, M), in which case I suppose m is no longer the worst case number of whole-file reads. And I'm not sure what the trade off between disk seeks and disk reads does to the problem, in practice (with caching and realistic memory constraints). John -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
Patrick Useldinger wrote: David Eppstein wrote: When I've been talking about hashes, I've been assuming very strong cryptographic hashes, good enough that you can trust equal results to really be equal without having to verify by a comparison. I am not an expert in this field. All I know is that MD5 and SHA1 can create collisions. Are there stronger algorithms that do not? And, more importantly, has it been *proved* that they do not? I'm not an expert either, but I seem to remember reading recently that, while it's been proven that it's possible for SHA1 to have collisions, no actual collisions have been found. Even if that's not completely correct, you're *far* more likely to be killed by a meteorite than to stumble across a SHA1 collision. Heck, I'd expect that it's more likely for civilization to be destroyed by a dinosaur-killer-sized meteor. With very few exceptions, if you're contorting yourself to avoid SHA1 hash collisions, then you should also be wearing meteor-proof (and lightning-proof) armor everywhere you go. (Those few exceptions would be cases where a malicious attacker stands to gain enough from constructing a single hash collision to make it worthwhile to invest a *large* number of petaflops of processing power.) Sure it's not 100% perfect, but... how perfect do you *really* need? Jeff Shannon -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
In article [EMAIL PROTECTED], [EMAIL PROTECTED] (John J. Lee) wrote: If you read them in parallel, it's _at most_ m (m is the worst case here), not 2(m-1). In my tests, it has always significantly less than m. Hmm, Patrick's right, David, isn't he? Yes, I was only considering pairwise comparisons. As he says, simultaneously comparing all files in a group would avoid repeated reads without the CPU overhead of a strong hash. Assuming you use a system that allows you to have enough files open at once... And I'm not sure what the trade off between disk seeks and disk reads does to the problem, in practice (with caching and realistic memory constraints). Another interesting point. -- David Eppstein Computer Science Dept., Univ. of California, Irvine http://www.ics.uci.edu/~eppstein/ -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
John Machin wrote: Test: !for k in range(1000): !open('foo' + str(k), 'w') I ran that and watched it open 2 million files and going strong ... until I figured that files are closed by Python immediately because there's no reference to them ;-) Here's my code: #!/usr/bin/env python import os print 'max number of file handles today is', n = 0 h = [] try: while True: filename = 'mfh' + str(n) h.append((file(filename,'w'),filename)) n = n + 1 except: print n for handle, filename in h: handle.close() os.remove(filename) On Slackware 10.1, this yields 1021. On WinXPSP2, this yields 509. -pu -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
John Machin wrote: Just look at the efficiency of processing N files of the same size S, where they differ after d bytes: [If they don't differ, d = S] PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d which is important for small N and large d]. Hashing method: O(NS) reading time, O(NS) hash calc time Shouldn't you add the additional comparison time that has to be done after hash calculation? Hashes do not give 100% guarantee. If there's a large number of identical hashes, you'd still need to read all of these files to make sure. Just to explain why I appear to be a lawer: everybody I spoke to about this program told me to use hashes, but nobody has been able to explain why. I found myself 2 possible reasons: 1) it's easier to program: you don't compare several files in parallel, but process one by one. But it's not perfect and you still need to compare afterwards. In the worst case, you end up with 3 files with identical hashes, of which 2 are identical and 1 is not. In order to find this, you'd still have to program the algorithm I use, unless you say oh well, there's a problem with the hash, go and look yourself. 2) it's probably useful if you compare files over a network and you want to reduce bandwidth. A hash lets you do that at the cost of local CPU and disk usage, which may be OK. That was not my case. -pu -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
Patrick Useldinger wrote: Just to explain why I appear to be a lawer: everybody I spoke to about this program told me to use hashes, but nobody has been able to explain why. I found myself 2 possible reasons: 1) it's easier to program: you don't compare several files in parallel, but process one by one. But it's not perfect and you still need to compare afterwards. In the worst case, you end up with 3 files with identical hashes, of which 2 are identical and 1 is not. In order to find this, you'd still have to program the algorithm I use, unless you say oh well, there's a problem with the hash, go and look yourself. 2) it's probably useful if you compare files over a network and you want to reduce bandwidth. A hash lets you do that at the cost of local CPU and disk usage, which may be OK. That was not my case. 3) Probability is on your side. If two files match in length, but differ in contents, a good hash will have probability (1 / max_hash) of having a matching hash. For exactly two files of the same length, calculating the hash is clearly a waste. For even three files of the same length, you are most likely to find them distinct in three comparisons. Using hashes, three file reads and three comparisons of hash values. Without hashes, six file reads; you must read both files to do a file comparison, so three comparisons is six files. Naturally, as it grows beyond three, the deference grows drastically. -Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
[Patrick Useldinger] Shouldn't you add the additional comparison time that has to be done after hash calculation? Hashes do not give 100% guarantee. If there's a large number of identical hashes, you'd still need to read all of these files to make sure. Identical hashes for different files? The probability of this happening should be extremely small, or else, your hash function is not a good one. I once was over-cautious about relying on hashes only, without actually comparing files. A friend convinced me, doing maths, that with a good hash function, the probability of a false match was much, much smaller than the probability of my computer returning the wrong answer, despite thorough comparisons, due to some electronic glitch or cosmic ray. So, my cautious attitude was by far, for all practical means, a waste. Similar arguments apply, say, for the confidence we may have in probabilistic algorithms for the determination of number primality. -- François Pinard http://pinard.progiciels-bpi.ca -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
François Pinard wrote: Identical hashes for different files? The probability of this happening should be extremely small, or else, your hash function is not a good one. We're talking about md5, sha1 or similar. They are all known not to be 100% perfect. I agree it's a rare case, but still, why settle on something about right when you can have right? I once was over-cautious about relying on hashes only, without actually comparing files. A friend convinced me, doing maths, that with a good hash function, the probability of a false match was much, much smaller than the probability of my computer returning the wrong answer, despite thorough comparisons, due to some electronic glitch or cosmic ray. So, my cautious attitude was by far, for all practical means, a waste. It was not my only argument for not using hashed. My algorithm also does less reads, for example. -pu -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
Patrick Useldinger wrote: John Machin wrote: Just look at the efficiency of processing N files of the same size S, where they differ after d bytes: [If they don't differ, d = S] PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d which is important for small N and large d]. Hashing method: O(NS) reading time, O(NS) hash calc time Shouldn't you add the additional comparison time that has to be done after hash calculation? Hashes do not give 100% guarantee. If there's a large number of identical hashes, you'd still need to read all of these files to make sure. Maybe I was wrong: lawyers are noted for irritating precision. You meant to say in your own defence: If there are *any* number (n = 2) of identical hashes, you'd still need to *RE*-read and *compare* 1. You are ahead already even before adding any extra time for comparison on files with equal hashes. 2. As others have explained, with a decent hash function, the probability of a false positive is vanishingly small. Further, nobody in their right mind [1] would contemplate automatically deleting n-1 out of a bunch of n reportedly duplicate files without further investigation. Duplicate files are usually (in the same directory with different names or in different-but-related directories with the same names) and/or (have a plausible explanation for how they were duplicated) -- the one-in-zillion-chance false-positive should stand out as implausible. Different subject: maximum number of files that can be open at once. I raised this issue with you because I had painful memories of having to work around max=20 years ago on MS-DOS and was aware that this magic number was copied blindly from early Unix. I did tell you that empirically I could get 509 successful opens on Win 2000 [add 3 for stdin/out/err to get a plausible number] -- this seems high enough to me compared to the likely number of files with the same size -- but you might like to consider a fall-back detection method instead of just quitting immediately if you ran out of handles. You wrote at some stage in this thread that (a) this caused problems on Windows and (b) you hadn't had any such problems on Linux. Re (a): what evidence do you have? Re (b): famous last words! How long would it take you to do a test and announce the margin of safety that you have? [1] Yes, I am aware of the subject of this thread. Regards, John -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
John Machin wrote: Maybe I was wrong: lawyers are noted for irritating precision. You meant to say in your own defence: If there are *any* number (n = 2) of identical hashes, you'd still need to *RE*-read and *compare* Right, that is what I meant. 2. As others have explained, with a decent hash function, the probability of a false positive is vanishingly small. Further, nobody in their right mind [1] would contemplate automatically deleting n-1 out of a bunch of n reportedly duplicate files without further investigation. Duplicate files are usually (in the same directory with different names or in different-but-related directories with the same names) and/or (have a plausible explanation for how they were duplicated) -- the one-in-zillion-chance false-positive should stand out as implausible. Still, if you can get it 100% right automatically, why would you bother checking manually? Why get back to argments like impossible, implausible, can't be if you can have a simple and correct answer - yes or no? Anyway, fdups does not do anything else than report duplicates. Deleting, hardlinking or anything else might be an option depending on the context in which you use fdups, but then we'd have to discuss the context. I never assumed any context, in order to keep it as universal as possible. Different subject: maximum number of files that can be open at once. I raised this issue with you because I had painful memories of having to work around max=20 years ago on MS-DOS and was aware that this magic number was copied blindly from early Unix. I did tell you that empirically I could get 509 successful opens on Win 2000 [add 3 for stdin/out/err to get a plausible number] -- this seems high enough to me compared to the likely number of files with the same size -- but you might like to consider a fall-back detection method instead of just quitting immediately if you ran out of handles. For the time being, the additional files will be ignored, and a warning is issued. fdups does not quit, why are you saying this? A fallback solution would be to open the file before every _block_ read, and close it afterwards. In my mind, it would be a command-line option, because it's difficult to determine the number of available file handles in a multitasking environment. Not difficult to implement, but I first wanted to refactor the code so that it's a proper class that can be used in other Python programs, as you also asked. That is what I have sent you tonight. It's not that I don't care about the file handle problem, it's just that I do changes by (my own) priority. You wrote at some stage in this thread that (a) this caused problems on Windows and (b) you hadn't had any such problems on Linux. Re (a): what evidence do you have? I've had the case myself on my girlfriend's XP box. It was certainly less than 500 files of the same length. Re (b): famous last words! How long would it take you to do a test and announce the margin of safety that you have? Sorry, I do not understand what you mean by this. -pu -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
Patrick Useldinger wrote: John Machin wrote: Maybe I was wrong: lawyers are noted for irritating precision. You meant to say in your own defence: If there are *any* number (n = 2) of identical hashes, you'd still need to *RE*-read and *compare* Right, that is what I meant. 2. As others have explained, with a decent hash function, the probability of a false positive is vanishingly small. Further, nobody in their right mind [1] would contemplate automatically deleting n-1 out of a bunch of n reportedly duplicate files without further investigation. Duplicate files are usually (in the same directory with different names or in different-but-related directories with the same names) and/or (have a plausible explanation for how they were duplicated) -- the one-in-zillion-chance false-positive should stand out as implausible. Still, if you can get it 100% right automatically, why would you bother checking manually? A human in their right mind is required to decide what to do with the duplicates. The proponents of hashing -- of which I'm not one -- would point out that any false-positives would be picked up as part of the human scrutiny. Why get back to argments like impossible, implausible, can't be if you can have a simple and correct answer - yes or no? Oh yeah, the computer said so, it must be correct. Even with your algorithm, I would be investigating cases where files were duplicates but there was nothing in the names or paths that suggested how that might have come about. Anyway, fdups does not do anything else than report duplicates. Deleting, hardlinking or anything else might be an option depending on the context in which you use fdups, but then we'd have to discuss the context. I never assumed any context, in order to keep it as universal as possible. That's very good, but it wasn't under contention. Different subject: maximum number of files that can be open at once. I raised this issue with you because I had painful memories of having to work around max=20 years ago on MS-DOS and was aware that this magic number was copied blindly from early Unix. I did tell you that empirically I could get 509 successful opens on Win 2000 [add 3 for stdin/out/err to get a plausible number] -- this seems high enough to me compared to the likely number of files with the same size -- but you might like to consider a fall-back detection method instead of just quitting immediately if you ran out of handles. For the time being, the additional files will be ignored, and a warning is issued. fdups does not quit, why are you saying this? I beg your pardon, I was wrong. Bad memory. It's the case of running out of the minuscule buffer pool that you allocate by default where it panics and pulls the sys.exit(1) rip-cord. A fallback solution would be to open the file before every _block_ read, and close it afterwards. Ugh. Better use more memory, so less blocks!! In my mind, it would be a command-line option, because it's difficult to determine the number of available file handles in a multitasking environment. The pythonic way is to press ahead optimistically and recover if you get bad news. Not difficult to implement, but I first wanted to refactor the code so that it's a proper class that can be used in other Python programs, as you also asked. I didn't ask; I suggested. I would never suggest a class-for-classes-sake. You had already a singleton class; why another. What I did suggest was that you provide a callable interface that returned clusters of duplicates [so that people could do their own thing instead of having to parse your file output which contains a mixture of warning info messages and data]. That is what I have sent you tonight. It's not that I don't care about the file handle problem, it's just that I do changes by (my own) priority. You wrote at some stage in this thread that (a) this caused problems on Windows and (b) you hadn't had any such problems on Linux. Re (a): what evidence do you have? I've had the case myself on my girlfriend's XP box. It was certainly less than 500 files of the same length. Interesting. Less on XP than on 2000? Maybe there's a machine-wide limit, not a per-process limit, like the old DOS max=20. What else was running at the time? Re (b): famous last words! How long would it take you to do a test and announce the margin of safety that you have? Sorry, I do not understand what you mean by this. Test: !for k in range(1000): !open('foo' + str(k), 'w') Announce: I can open A files at once on box B running os C. The most files of the same length that I have seen is D. The ratio A/D is small enough not to worry. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: a program to delete duplicate files
John Machin wrote: Oh yeah, the computer said so, it must be correct. Even with your algorithm, I would be investigating cases where files were duplicates but there was nothing in the names or paths that suggested how that might have come about. Of course, but it's good to know that the computer is right, isn't it? That leaves the human to take decisions instead of double-checking. I beg your pardon, I was wrong. Bad memory. It's the case of running out of the minuscule buffer pool that you allocate by default where it panics and pulls the sys.exit(1) rip-cord. Bufferpool is a parameter, and the default values allow for 4096 files of the same size. It's more likely to run out of file handles than out of bufferspace, don't you think? The pythonic way is to press ahead optimistically and recover if you get bad news. You're right, that's what I thought about afterwards. Current idea is to design a second class that opens/closes/reads the files and handles the situation independantly of the main class. I didn't ask; I suggested. I would never suggest a class-for-classes-sake. You had already a singleton class; why another. What I did suggest was that you provide a callable interface that returned clusters of duplicates [so that people could do their own thing instead of having to parse your file output which contains a mixture of warning info messages and data]. That is what I have submitted to you. Are you sure that *I* am the lawyer here? Re (a): what evidence do you have? See ;-) Interesting. Less on XP than on 2000? Maybe there's a machine-wide limit, not a per-process limit, like the old DOS max=20. What else was running at the time? Nothing I started manually, but the usual bunch of local firewall, virus scanner (not doing a complete machine check at that time). Test: !for k in range(1000): !open('foo' + str(k), 'w') I'll try that. Announce: I can open A files at once on box B running os C. The most files of the same length that I have seen is D. The ratio A/D is small enough not to worry. I wouldn't count on that on a multi-tasking environment, as I said. The class I described earlier seems a cleaner approach. Regards, -pu -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
On Fri, 11 Mar 2005 01:24:59 +0100, rumours say that Patrick Useldinger [EMAIL PROTECTED] might have written: Have you found any way to test if two files on NTFS are hard linked without opening them first to get a file handle? No. And even then, I wouldn't know how to find out. MSDN is our friend for Windows stuff. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/createhardlink.asp and then http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/getfileinformationbyhandle.asp http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/by_handle_file_information_str.asp The relevant parts from this last page: st_dev - dwVolumeSerialNumber st_ino - (nFileIndexHigh, nFileIndexLow) -- TZOTZIOY, I speak England very best. Be strict when sending and tolerant when receiving. (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
On Fri, 11 Mar 2005 01:12:14 +0100, rumours say that Patrick Useldinger [EMAIL PROTECTED] might have written: On POSIX filesystems, one has also to avoid comparing files having same (st_dev, st_inum), because you know that they are the same file. I then have a bug here - I consider all files with the same inode equal, but according to what you say I need to consider the tuple (st_dev,ST_ium). I'll have to fix that for 0.13. I have a bug here too-- I wrote st_inum meaning st_ino, but that would be quick to find! Thanks ;-) You are very welcome. -- TZOTZIOY, I speak England very best. Be strict when sending and tolerant when receiving. (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
Christos TZOTZIOY Georgiou wrote: The relevant parts from this last page: st_dev - dwVolumeSerialNumber st_ino - (nFileIndexHigh, nFileIndexLow) I see. But if I am not mistaken, that would mean that I (1) had to detect NTFS volumes (2) use non-standard libraries to find these information (like the Python Win extentions). I am not seriously motivated to do so, but if somebody is interested to help, I am open to it. -pu -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
David Eppstein wrote: You need do no comparisons between files. Just use a sufficiently strong hash algorithm (SHA-256 maybe?) and compare the hashes. That's not very efficient. IMO, it only makes sense in network-based operations such as rsync. -pu -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
Christos TZOTZIOY Georgiou wrote: A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux). Changed. -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
In article [EMAIL PROTECTED], Patrick Useldinger [EMAIL PROTECTED] wrote: You need do no comparisons between files. Just use a sufficiently strong hash algorithm (SHA-256 maybe?) and compare the hashes. That's not very efficient. IMO, it only makes sense in network-based operations such as rsync. Well, but the spec didn't say efficiency was the primary criterion, it said minimizing the number of comparisons was. More seriously, the best I can think of that doesn't use a strong slow hash would be to group files by (file size, cheap hash) then compare each file in a group with a representative of each distinct file found among earlier files in the same group -- that leads to an average of about three reads per duplicated file copy: one to hash it, and two for the comparison between it and its representative (almost all of the comparisons will turn out equal but you still need to check unless you use a strong hash). I'm assuming of course that there are too many files and/or they're too large just to keep them all in core. Anyone have any data on whether reading files and SHA-256'ing them (or whatever other cryptographic hash you think is strong enough) is I/O-bound or CPU-bound? That is, is three reads with very little CPU overhead really cheaper than one read with a strong hash? I guess it also depends on the number of files you expect to have duplicates of. If most of the files exist in only one copy, it's clear that the cheap hash will find them more cheaply than the expensive hash. In that case you could combine the (file size, cheap hash) filtering with the expensive hash and get only two reads per copy rather than three. -- David Eppstein Computer Science Dept., Univ. of California, Irvine http://www.ics.uci.edu/~eppstein/ -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
On Thursday 10 March 2005 11:02 am, Christos TZOTZIOY Georgiou wrote: On Wed, 9 Mar 2005 16:13:20 -0600, rumours say that Terry Hancock [EMAIL PROTECTED] might have written: For anyone interested in responding to the above, a starting place might be this maintenance script I wrote for my own use. I don't think it exactly matches the spec, but it addresses the problem. I wrote this to clean up a large tree of image files once. The exact behavior described requires the '--exec=ls %s' option as mentioned in the help. The drawback of this method is that you have to read everything. For example, if you have ten files less than 100KiB each and one file more than 2 GiB in size, there is no need to read the 2 GiB file, is there? If it's a one-shot attempt, I guess it won't mind a lot. On POSIX filesystems, one has also to avoid comparing files having same (st_dev, st_inum), because you know that they are the same file. Two good points, but hey, it ran fast enough for me. ;-) Cheers, Terry -- -- Terry Hancock ( hancock at anansispaceworks.com ) Anansi Spaceworks http://www.anansispaceworks.com -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
David Eppstein wrote: Well, but the spec didn't say efficiency was the primary criterion, it said minimizing the number of comparisons was. That's exactly what my program does. More seriously, the best I can think of that doesn't use a strong slow hash would be to group files by (file size, cheap hash) then compare each file in a group with a representative of each distinct file found among earlier files in the same group -- that leads to an average of about three reads per duplicated file copy: one to hash it, and two for the comparison between it and its representative (almost all of the comparisons will turn out equal but you still need to check unless you My point is : forget hashes. If you work with hashes, you do have to read each file completely, cheap hash or not. My program normally reads *at most* 100% of the files to analyse, but usually much less. Also, I do plain comparisons which are much cheaper than hash calculations. I'm assuming of course that there are too many files and/or they're too large just to keep them all in core. I assume that file handles are sufficient to keep one open per file of the same size. This lead to trouble on Windows installations, but I guess that's a parameter to change. On Linux, I never had the problem. Regarding buffer size, I use a maxumim which is then split up between all open files. Anyone have any data on whether reading files and SHA-256'ing them (or whatever other cryptographic hash you think is strong enough) is I/O-bound or CPU-bound? That is, is three reads with very little CPU overhead really cheaper than one read with a strong hash? It also depends on the OS. I found that my program runs much slower on Windows, probably due to the way Linux anticipates reads and tries to reduce head movement. I guess it also depends on the number of files you expect to have duplicates of. If most of the files exist in only one copy, it's clear that the cheap hash will find them more cheaply than the expensive hash. In that case you could combine the (file size, cheap hash) filtering with the expensive hash and get only two reads per copy rather than three. Sorry, but I can still not see a point tu use hashes. Maybe you'll have a look at my program and tell me where a hash could be useful? It's available at http://www.homepages.lu/pu/fdups.html. Regards, -pu -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
In article [EMAIL PROTECTED], Patrick Useldinger [EMAIL PROTECTED] wrote: Well, but the spec didn't say efficiency was the primary criterion, it said minimizing the number of comparisons was. That's exactly what my program does. If you're doing any comparisons at all, you're not minimizing the number of comparisons. -- David Eppstein Computer Science Dept., Univ. of California, Irvine http://www.ics.uci.edu/~eppstein/ -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
On Fri, 11 Mar 2005 11:07:02 -0800, rumours say that David Eppstein [EMAIL PROTECTED] might have written: More seriously, the best I can think of that doesn't use a strong slow hash would be to group files by (file size, cheap hash) then compare each file in a group with a representative of each distinct file found among earlier files in the same group -- that leads to an average of about three reads per duplicated file copy: one to hash it, and two for the comparison between it and its representative (almost all of the comparisons will turn out equal but you still need to check unless you use a strong hash). The code I posted in another thread (and provided a link in this one) does exactly that (a quick hash of the first few K before calculating the whole file's md5 sum). However, Patrick's code is faster, reading only what's necessary (he does what I intended to do, but I was too lazy-- I actually rewrote from scratch one of the first programs I wrote in Python, which obviously was too amateurish code for me to publish :) It seems your objections are related to Xah Lee's specifications; I have no objections to your objections (-:) other than that we are just trying to produce something of practical value out of an otherwise doomed thread... -- TZOTZIOY, I speak England very best. Be strict when sending and tolerant when receiving. (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
On Fri, 11 Mar 2005 14:06:27 -0800, David Eppstein [EMAIL PROTECTED] wrote: In article [EMAIL PROTECTED], Patrick Useldinger [EMAIL PROTECTED] wrote: Well, but the spec didn't say efficiency was the primary criterion, it said minimizing the number of comparisons was. That's exactly what my program does. If you're doing any comparisons at all, you're not minimizing the number of comparisons. ISTM a lot will depend on the distribution of differences between candidate (equal-sized, obviously) files. If multi-GB files differ in the last byte only (but this is not known), they will all have to be crunched to the last byte to eliminate them from possible duplicate-set membership. If there is a-priori knowledge of highly probable difference locations (e.g., internal date stamp at some offset etc) then obviously the candidates can be pre-classified into smaller candidate sets by some simple heuristic probes. If differences are likely to appear early in a sequential scan, perhaps developing hashes in parallel would be a good strategy. But I doubt if blindly pre-computing full hashes would be optimal. Compare to sort|uniq as a sieve. You wouldn't want to read through any files any further than you had to. Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
I wrote something similar, have a look at http://www.homepages.lu/pu/fdups.html. -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
On Wed, 9 Mar 2005 16:13:20 -0600, rumours say that Terry Hancock [EMAIL PROTECTED] might have written: For anyone interested in responding to the above, a starting place might be this maintenance script I wrote for my own use. I don't think it exactly matches the spec, but it addresses the problem. I wrote this to clean up a large tree of image files once. The exact behavior described requires the '--exec=ls %s' option as mentioned in the help. The drawback of this method is that you have to read everything. For example, if you have ten files less than 100KiB each and one file more than 2 GiB in size, there is no need to read the 2 GiB file, is there? If it's a one-shot attempt, I guess it won't mind a lot. On POSIX filesystems, one has also to avoid comparing files having same (st_dev, st_inum), because you know that they are the same file. -- TZOTZIOY, I speak England very best. Be strict when sending and tolerant when receiving. (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
I've written a python GUI wrapper around some shell scripts: http://www.pixelbeat.org/fslint/ the shell script logic is essentially: exclude hard linked files only include files where there are more than 1 with the same size print files with matching md5sum Pádraig. -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
On Thu, 10 Mar 2005 10:54:05 +0100, rumours say that Patrick Useldinger [EMAIL PROTECTED] might have written: I wrote something similar, have a look at http://www.homepages.lu/pu/fdups.html. That's fast and good. A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux). Have you found any way to test if two files on NTFS are hard linked without opening them first to get a file handle? -- TZOTZIOY, I speak England very best. Be strict when sending and tolerant when receiving. (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
Christos TZOTZIOY Georgiou wrote: On POSIX filesystems, one has also to avoid comparing files having same (st_dev, st_inum), because you know that they are the same file. I then have a bug here - I consider all files with the same inode equal, but according to what you say I need to consider the tuple (st_dev,ST_ium). I'll have to fix that for 0.13. Thanks ;-) -pu -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
Christos TZOTZIOY Georgiou wrote: That's fast and good. Nice to hear. A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux). I'll look into that. Have you found any way to test if two files on NTFS are hard linked without opening them first to get a file handle? No. And even then, I wouldn't know how to find out. -pu -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
In article [EMAIL PROTECTED], Xah Lee [EMAIL PROTECTED] wrote: a absolute requirement in this problem is to minimize the number of comparison made between files. This is a part of the spec. You need do no comparisons between files. Just use a sufficiently strong hash algorithm (SHA-256 maybe?) and compare the hashes. -- David Eppstein Computer Science Dept., Univ. of California, Irvine http://www.ics.uci.edu/~eppstein/ -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
David Eppstein wrote: In article [EMAIL PROTECTED], Xah Lee [EMAIL PROTECTED] wrote: a absolute requirement in this problem is to minimize the number of comparison made between files. This is a part of the spec. You need do no comparisons between files. Just use a sufficiently strong hash algorithm (SHA-256 maybe?) and compare the hashes. I did it as follows (some time ago): is filesize in hash? calculate md5 (and store), if equal then compare files. store info in hash. In some cases if might be faster to drop the md5 (since it reads all data) -- John Small Perl scripts: http://johnbokma.com/perl/ Perl programmer available: http://castleamber.com/ Happy Customers: http://castleamber.com/testimonials.html -- http://mail.python.org/mailman/listinfo/python-list
[perl-python] a program to delete duplicate files
here's a large exercise that uses what we built before. suppose you have tens of thousands of files in various directories. Some of these files are identical, but you don't know which ones are identical with which. Write a program that prints out which file are redundant copies. Here's the spec. -- The program is to be used on the command line. Its arguments are one or more full paths of directories. perl del_dup.pl dir1 prints the full paths of all files in dir1 that are duplicate. (including files in sub-directories) More specifically, if file A has duplicates, A's full path will be printed on a line, immediately followed the full paths of all other files that is a copy of A. These duplicates's full paths will be prefixed with rm string. A empty line follows a group of duplicates. Here's a sample output. inPath/a.jpg rm inPath/b.jpg rm inPath/3/a.jpg rm inPath/hh/eu.jpg inPath/ou.jpg rm inPath/23/a.jpg rm inPath/hh33/eu.jpg order does not matter. (i.e. which file will not be rm does not matter.) perl del_dup.pl dir1 dir2 will do the same as above, except that duplicates within dir1 or dir2 themselves not considered. That is, all files in dir1 are compared to all files in dir2. (including subdirectories) And, only files in dir2 will have the rm prefix. One way to understand this is to imagine lots of image files in both dir. One is certain that there are no duplicates within each dir themselves. (imagine that del_dup.pl has run on each already) Files in dir1 has already been categorized into sub directories by human. So that when there are duplicates among dir1 and dir2, one wants the version in dir2 to be deleted, leaving the organization in dir1 intact. perl del_dup.pl dir1 dir2 dir3 ... does the same as above, except files in later dir will have rm first. So, if there are these identical files: dir2/a dir2/b dir4/c dir4/d the c and d will both have rm prefix for sure. (which one has rm in dir2 does not matter) Note, although dir2 doesn't compare files inside itself, but duplicates still may be implicitly found by indirect comparison. i.e. a==c, b==c, therefore a==b, even though a and b are never compared. -- Write a Perl or Python version of the program. a absolute requirement in this problem is to minimize the number of comparison made between files. This is a part of the spec. feel free to write it however you want. I'll post my version in a few days. http://www.xahlee.org/perl-python/python.html Xah [EMAIL PROTECTED] http://xahlee.org/PageTwo_dir/more.html -- http://mail.python.org/mailman/listinfo/python-list
Re: [perl-python] a program to delete duplicate files
On Wednesday 09 March 2005 06:56 am, Xah Lee wrote: here's a large exercise that uses what we built before. suppose you have tens of thousands of files in various directories. Some of these files are identical, but you don't know which ones are identical with which. Write a program that prints out which file are redundant copies. For anyone interested in responding to the above, a starting place might be this maintenance script I wrote for my own use. I don't think it exactly matches the spec, but it addresses the problem. I wrote this to clean up a large tree of image files once. The exact behavior described requires the '--exec=ls %s' option as mentioned in the help. #!/usr/bin/env python # (C) 2003 Anansi Spaceworks #--- # find_duplicates Utility to find duplicate files in a directory tree by comparing their checksums. #--- # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA #--- import os, sys, md5, getopt def file_walker(tbl, srcpath, files): Visit a path and collect data (including checksum) for files in it. for file in files: filepath = os.path.join(srcpath, file) if os.path.isfile(filepath): chksum = md5.new(open(os.path.join(srcpath, file)).read()).digest() if not tbl.has_key(chksum): tbl[chksum]=[] tbl[chksum].append(filepath) def find_duplicates(treeroot, tbl=None): Find duplicate files in directory. dup = {} if tbl is None: tbl = {} os.path.walk(treeroot, file_walker, tbl) for k,v in tbl.items(): if len(v) 1: dup[k] = v return dup usage = USAGE: find_duplicates options [path ...] Find duplicate files (by matching md5 checksums) in a collection of paths (defaults to the current directory). Note that the order of the paths searched will be retained in the resulting duplicate file lists. This can be used with --exec and --index to automate handling. Options: -h, -H, --help Print this help. -q, --quiet Don't print normal report. -x, --exec=command string Python-formatted command string to act on the indexed duplicate in each duplicate group found. E.g. try --exec=ls %s -n, --index=index into duplicates Which in a series of duplicates to use. Begins with '1'. Default is '1' (i.e. the first file listed). Example: You've copied many files from path ./A into path ./B. You want to delete all the ones you've processed already, but not delete anything else: % find_duplicates -q --exec=rm %s --index=1 ./A ./B def main(): action = None quiet = 0 index = 1 dup= {} opts, args = getopt.getopt(sys.argv[1:], 'qhHn:x:', ['quiet', 'help', 'exec=', 'index=']) for opt, val in opts: if opt in ('-h', '-H', '--help'): print usage sys.exit() elif opt in ('-x', '--exec'): action = str(val) elif opt in ('-n', '--index'): index = int(val) elif opt in ('-q', '--quiet'): quiet = 1 if len(args)==0: dup = find_duplicates('.') else: tbl = {} for arg in args: dup = find_duplicates(arg, tbl=tbl) for k, v in dup.items(): if not quiet: print Duplicates: for f in v: print \t%s % f if action: os.system(action % v[index-1]) if __name__=='__main__': main() -- -- Terry Hancock ( hancock at anansispaceworks.com ) Anansi Spaceworks http://www.anansispaceworks.com -- http://mail.python.org/mailman/listinfo/python-list