Albert-Jan Roskam wrote:
Hi,

I want to use difflib to compare a lot (tens of thousands) of text files. I
know that many files are quite similar as they are subsequent versions of
the same document (a primitive kind of version control). What would be a
good approach to cluster the files based on their likeness?

You have already identified the basic tool: difflib. But your question is not really about Python, it is more about the algorithm used for clustering data according to goodness of fit. That's a hard problem, and you should consider asking it on the main Python mailing list or newsgroup too.

Some search terms to get you started:

biopython
nltk  (the Natural Language Tool Kit)
unrooted phylogram


Good luck!


--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to