Albert-Jan Roskam wrote:
Hi,
I want to use difflib to compare a lot (tens of thousands) of text files. I
know that many files are quite similar as they are subsequent versions of
the same document (a primitive kind of version control). What would be a
good approach to cluster the files based on their likeness?
You have already identified the basic tool: difflib. But your question is not
really about Python, it is more about the algorithm used for clustering data
according to goodness of fit. That's a hard problem, and you should consider
asking it on the main Python mailing list or newsgroup too.
Some search terms to get you started:
biopython
nltk (the Natural Language Tool Kit)
unrooted phylogram
Good luck!
--
Steven
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor