Re: [Tutor] Cleaning up output

Dave Angel Wed, 03 Jul 2013 14:14:24 -0700

On 07/03/2013 03:51 PM, bja...@jamesgang.dyndns.org wrote:

I've written my first program to take a given directory and look in all
directories below it for duplicate files (duplicate being defined as
having the same MD5 hash, which I know isn't a perfect solution, but for
what I'm doing is good enough)

This is a great first project for learning Python. It's a utility whichdoesn't write any data to the disk (other than the result file), andtherefore bugs won't cause much havoc. Trust me, you will have bugs, weall do. One of the things experience teaches you is how to isolate thedamage that bugs do before they're discovered.


My problem now is that my output file is a rather confusing jumble of
paths and I'm not sure the best way to make it more user readable.  My gut
reaction would be to go through and list by first directory, but is there
a logical way to do it so that all the groupings that have files in the
same two directories would be grouped together?

I've come up with the same "presentation problem" with my own similarutilities. Be assured, there's no one "right answer."

First question is have you considered what you want when there are MOREthan two copies of one of those files? When you know what you'd like tosee if there are four identical files, you might have a better idea whatyou should do even for two. Additionally, consider that two identicalfiles may be in the same directory, with different names.

Anyway, if you can explain why you want a particular grouping, we mightbetter understand how to accomplish it.


So I'm thinking I'd have:
First File Dir /some/directory/
Duplicate directories:
some/other/directory/
    Original file 1 , dupicate file 1
    Original file 2, duplicate file 2
some/third directory/
    original file 3, duplicate file 3

At present, this First File Dir could be any of the directoriesinvolved; without some effort, os.walk doesn't promise you any order ofprocessing. But if you want them to appear in sorted order, you can dosorts at key points inside your os.walk code, and they'll at least comeout in an order that's recognizable. (Some OS's may also sort thingsthey feed to os.walk, but you'd do better not to count on it) You alsocould sort each list in itervalues of hashdict, after the dict is fullypopulated.

Even with sorting, you run into the problem that there may be duplicatesbetween some/other/directory and some/third/directory that are not in/some/directory. So in the sample you show above, they won't be listedwith the ones that are in /some/directory.


and so forth, where the Original file would be the file name in the First
files so that all the ones are the same there.

I fear I'm not explaining this well but I'm hoping someone can either ask
questions to help get out of my head what I'm trying to do or can decipher
this enough to help me.

Here's a git repo of my code if it helps:
https://github.com/CyberCowboy/FindDuplicates

At 40 lines, you should have just included it. It's usually much betterto include the code inline if you want any comments on it. Think ofwhat the archives are going to show in a year, when you're removed thatrepo, or thoroughly updated it. Somebody at that time will not be ableto make sense of comments directed at the current version of the code.

BTW, thanks for posting as text, since that'll mean that when you dopost code, it shouldn't get mangled.


So I'll comment on the code.

You never call the dupe() function, so presumably this is a moduleintended to be used from some place else. But if that's the case, Iwould have expected it to be factored better, at least to separate theinput processing from the output file formatting. That way you couldre-use the dups logic and provide a new save formatting withoutduplicating anything. The first function could return the hashdict, andthe second one could analyze it to produce a particular formatted output.

The hashdict and dups variables should be initialized within thefunction, since they are not going to be used outside. Avoid non-constglobals. And of course once you factor it, dups will be in the secondfunction only.

You do have a if __name__ == "__main__": line, but it's inside thefunction. Probably you meant it to be at the left margin. Andimporting inside a conditional is seldom a good idea, though it doesn'tmatter here since you're not using the import. Normally you want allyour imports at the top, so they're easy to spot. You also probablywant a call to dupe() inside the conditional. And perhaps some parsingof argv to get rootdir.

You don't mention your OS, but many OS's have symbolic links or theequivalent. There's no code here to handle that possibility. Symlinksare a pain to do right. You could just add it in your docs, that nosymlink is allowed under the rootdir.

Your open() call has no mode switch. If you want the md5 to be'correct", it has to be opened in binary. So you want something like:

    with open(fullname, "rb") as f:

It doesn't matter this time, since you never export the hashes. But ifyou ever needed to debug it, it'd be nice if the hashes matched thestandard values provided by md5sum. Besides, if you're on Windows,treating a binary file as though it were text could increase thelikelihood of getting md5 collisions.





--
DaveA
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Cleaning up output

Reply via email to