Re: Comparing two book chapters (text files)
On Feb 4, 10:20 pm, Nick Matzke mat...@berkeley.edu wrote: So I have an interesting challenge. I want to compare two book chapters, which I have in plain text format, and find out (a) percentage similarity and (b) what has changed. no idea if it will help, but i found this yesterday - http://www.nltk.org/ it's a python toolkit for natural language processing. there's a book at http://www.nltk.org/book with much more info. andrew -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing two book chapters (text files)
andrew cooke wrote: On Feb 4, 10:20 pm, Nick Matzke mat...@berkeley.edu wrote: So I have an interesting challenge. I want to compare two book chapters, which I have in plain text format, and find out (a) percentage similarity and (b) what has changed. no idea if it will help, but i found this yesterday - http://www.nltk.org/ it's a python toolkit for natural language processing. there's a book at http://www.nltk.org/book with much more info. Also there is difflib in the standard package which can be used depending on exact definition of similarity. Regards Tino smime.p7s Description: S/MIME Cryptographic Signature -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing two book chapters (text files)
On 2009-02-05 02:20, Nick Matzke wrote: Hi all, So I have an interesting challenge. I want to compare two book chapters, which I have in plain text format, and find out (a) percentage similarity and (b) what has changed. Some features make this problem different than what seems to be the standard text-matching problem solvable with e.g. difflib. Here is what I mean: * there is no guarantee that single lines from each file will be directly comparable -- e.g., if a few words are inserted into a sentence, then a chunk of the sentence will be moved to the next line, then a chunk of that line moved to the next, etc. * Also, there are cases where paragraphs have been moved around, sections re-ordered, etc. So it can't just be a linear match. I imagine this kind of thing can't be all that hard in the grand scheme of things, but I couldn't find an easily applicable solution readily available. I have advanced beginner python skills but am not quite where I could do this kind of thing from scratch without some guidance about the likely functions, libraries etc. to use. PS: I am going to have to do this for multiple book chapters so various software packages, e.g. for windows, are not really usable. Any help is much appreciated!! difflib is in the Python stdlib and provides many ways to implement difference detection: http://docs.python.org/library/difflib.html Here's a script that I use for diff'ing text files on a word basis, called tdiff.py: http://downloads.egenix.com/python/tdiff.py It helps a lot with text that gets word wrapped or reformatted. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 05 2009) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing two book chapters (text files)
On Wed, Feb 4, 2009 at 5:20 PM, Nick Matzke mat...@berkeley.edu wrote: Hi all, So I have an interesting challenge. I want to compare two book chapters, which I have in plain text format, and find out (a) percentage similarity and (b) what has changed. Some features make this problem different than what seems to be the standard text-matching problem solvable with e.g. difflib. Here is what I mean: * there is no guarantee that single lines from each file will be directly comparable -- e.g., if a few words are inserted into a sentence, then a chunk of the sentence will be moved to the next line, then a chunk of that line moved to the next, etc. * Also, there are cases where paragraphs have been moved around, sections re-ordered, etc. So it can't just be a linear match. I imagine this kind of thing can't be all that hard in the grand scheme of things, but I couldn't find an easily applicable solution readily available. I have advanced beginner python skills but am not quite where I could do this kind of thing from scratch without some guidance about the likely functions, libraries etc. to use. PS: I am going to have to do this for multiple book chapters so various software packages, e.g. for windows, are not really usable. Though not written in Python, wdiff (http://www.gnu.org/software/wdiff/wdiff.html) might be a good starting point. Cheers, Chris -- Follow the path of the Iguana... http://rebertia.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing two book chapters (text files)
Chris Rebert wrote: On Wed, Feb 4, 2009 at 5:20 PM, Nick Matzke mat...@berkeley.edu wrote: Hi all, So I have an interesting challenge. I want to compare two book chapters, which I have in plain text format, and find out (a) percentage similarity and (b) what has changed. Some features make this problem different than what seems to be the standard text-matching problem solvable with e.g. difflib. Here is what I mean: * there is no guarantee that single lines from each file will be directly comparable -- e.g., if a few words are inserted into a sentence, then a chunk of the sentence will be moved to the next line, then a chunk of that line moved to the next, etc. * Also, there are cases where paragraphs have been moved around, sections re-ordered, etc. So it can't just be a linear match. I imagine this kind of thing can't be all that hard in the grand scheme of things, but I couldn't find an easily applicable solution readily available. I have advanced beginner python skills but am not quite where I could do this kind of thing from scratch without some guidance about the likely functions, libraries etc. to use. PS: I am going to have to do this for multiple book chapters so various software packages, e.g. for windows, are not really usable. Though not written in Python, wdiff (http://www.gnu.org/software/wdiff/wdiff.html) might be a good starting point. Wow -- this is actually amazingly effective. And fast! Simple to run from python then use python to parse the output. Thanks! Nick Cheers, Chris -- Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - [W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together. Isaac Asimov (1989). The Relativity of Wrong. The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list