On 07/03/2013 03:51 PM, bja...@jamesgang.dyndns.org wrote:
I've written my first program to take a given directory and look in all
directories below it for duplicate files (duplicate being defined as
having the same MD5 hash, which I know isn't a perfect solution, but for
what I'm doing is good enough)

This is a great first project for learning Python. It's a utility which doesn't write any data to the disk (other than the result file), and therefore bugs won't cause much havoc. Trust me, you will have bugs, we all do. One of the things experience teaches you is how to isolate the damage that bugs do before they're discovered.


My problem now is that my output file is a rather confusing jumble of
paths and I'm not sure the best way to make it more user readable.  My gut
reaction would be to go through and list by first directory, but is there
a logical way to do it so that all the groupings that have files in the
same two directories would be grouped together?

I've come up with the same "presentation problem" with my own similar utilities. Be assured, there's no one "right answer."

First question is have you considered what you want when there are MORE than two copies of one of those files? When you know what you'd like to see if there are four identical files, you might have a better idea what you should do even for two. Additionally, consider that two identical files may be in the same directory, with different names.

Anyway, if you can explain why you want a particular grouping, we might better understand how to accomplish it.


So I'm thinking I'd have:
First File Dir /some/directory/
Duplicate directories:
some/other/directory/
    Original file 1 , dupicate file 1
    Original file 2, duplicate file 2
some/third directory/
    original file 3, duplicate file 3

At present, this First File Dir could be any of the directories involved; without some effort, os.walk doesn't promise you any order of processing. But if you want them to appear in sorted order, you can do sorts at key points inside your os.walk code, and they'll at least come out in an order that's recognizable. (Some OS's may also sort things they feed to os.walk, but you'd do better not to count on it) You also could sort each list in itervalues of hashdict, after the dict is fully populated.

Even with sorting, you run into the problem that there may be duplicates between some/other/directory and some/third/directory that are not in /some/directory. So in the sample you show above, they won't be listed with the ones that are in /some/directory.


and so forth, where the Original file would be the file name in the First
files so that all the ones are the same there.

I fear I'm not explaining this well but I'm hoping someone can either ask
questions to help get out of my head what I'm trying to do or can decipher
this enough to help me.

Here's a git repo of my code if it helps:
https://github.com/CyberCowboy/FindDuplicates


At 40 lines, you should have just included it. It's usually much better to include the code inline if you want any comments on it. Think of what the archives are going to show in a year, when you're removed that repo, or thoroughly updated it. Somebody at that time will not be able to make sense of comments directed at the current version of the code.

BTW, thanks for posting as text, since that'll mean that when you do post code, it shouldn't get mangled.

So I'll comment on the code.

You never call the dupe() function, so presumably this is a module intended to be used from some place else. But if that's the case, I would have expected it to be factored better, at least to separate the input processing from the output file formatting. That way you could re-use the dups logic and provide a new save formatting without duplicating anything. The first function could return the hashdict, and the second one could analyze it to produce a particular formatted output.

The hashdict and dups variables should be initialized within the function, since they are not going to be used outside. Avoid non-const globals. And of course once you factor it, dups will be in the second function only.

You do have a if __name__ == "__main__": line, but it's inside the function. Probably you meant it to be at the left margin. And importing inside a conditional is seldom a good idea, though it doesn't matter here since you're not using the import. Normally you want all your imports at the top, so they're easy to spot. You also probably want a call to dupe() inside the conditional. And perhaps some parsing of argv to get rootdir.

You don't mention your OS, but many OS's have symbolic links or the equivalent. There's no code here to handle that possibility. Symlinks are a pain to do right. You could just add it in your docs, that no symlink is allowed under the rootdir.

Your open() call has no mode switch. If you want the md5 to be 'correct", it has to be opened in binary. So you want something like:
    with open(fullname, "rb") as f:

It doesn't matter this time, since you never export the hashes. But if you ever needed to debug it, it'd be nice if the hashes matched the standard values provided by md5sum. Besides, if you're on Windows, treating a binary file as though it were text could increase the likelihood of getting md5 collisions.




--
DaveA
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to