Re: [Tutor] How to identify clusters of similar files

Steven D'Aprano Sat, 02 Jun 2012 19:02:43 -0700

Albert-Jan Roskam wrote:

Hi,


I want to use difflib to compare a lot (tens of thousands) of text files. I
know that many files are quite similar as they are subsequent versions of
the same document (a primitive kind of version control). What would be a
good approach to cluster the files based on their likeness?

You have already identified the basic tool: difflib. But your question is notreally about Python, it is more about the algorithm used for clustering dataaccording to goodness of fit. That's a hard problem, and you should considerasking it on the main Python mailing list or newsgroup too.


Some search terms to get you started:

biopython
nltk  (the Natural Language Tool Kit)
unrooted phylogram


Good luck!


--
Steven
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] How to identify clusters of similar files

Reply via email to