Re: Comparing two book chapters (text files)

2009-02-05 Thread andrew cooke
On Feb 4, 10:20 pm, Nick Matzke mat...@berkeley.edu wrote:
 So I have an interesting challenge.  I want to compare two book
 chapters, which I have in plain text format, and find out (a) percentage
 similarity and (b) what has changed.

no idea if it will help, but i found this yesterday - http://www.nltk.org/

it's a python toolkit for natural language processing.  there's a book
at http://www.nltk.org/book with much more info.

andrew
--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing two book chapters (text files)

2009-02-05 Thread Tino Wildenhain

andrew cooke wrote:

On Feb 4, 10:20 pm, Nick Matzke mat...@berkeley.edu wrote:

So I have an interesting challenge.  I want to compare two book
chapters, which I have in plain text format, and find out (a) percentage
similarity and (b) what has changed.


no idea if it will help, but i found this yesterday - http://www.nltk.org/

it's a python toolkit for natural language processing.  there's a book
at http://www.nltk.org/book with much more info.


Also there is difflib in the standard package which can be used
depending on exact definition of similarity.

Regards
Tino


smime.p7s
Description: S/MIME Cryptographic Signature
--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing two book chapters (text files)

2009-02-05 Thread M.-A. Lemburg
On 2009-02-05 02:20, Nick Matzke wrote:
 Hi all,
 
 So I have an interesting challenge.  I want to compare two book
 chapters, which I have in plain text format, and find out (a) percentage
 similarity and (b) what has changed.
 
 Some features make this problem different than what seems to be the
 standard text-matching problem solvable with e.g. difflib.  Here is what
 I mean:
 
 * there is no guarantee that single lines from each file will be
 directly comparable -- e.g., if a few words are inserted into a
 sentence, then a chunk of the sentence will be moved to the next line,
 then a chunk of that line moved to the next, etc.
 
 * Also, there are cases where paragraphs have been moved around,
 sections re-ordered, etc.  So it can't just be a linear match.
 
 I imagine this kind of thing can't be all that hard in the grand scheme
 of things, but I couldn't find an easily applicable solution readily
 available.  I have advanced beginner python skills but am not quite
 where I could do this kind of thing from scratch without some guidance
 about the likely functions, libraries etc. to use.
 
 PS: I am going to have to do this for multiple book chapters so various
 software packages, e.g. for windows, are not really usable.
 
 Any help is much appreciated!!

difflib is in the Python stdlib and provides many ways to implement
difference detection:

http://docs.python.org/library/difflib.html

Here's a script that I use for diff'ing text files on a word
basis, called tdiff.py:

http://downloads.egenix.com/python/tdiff.py

It helps a lot with text that gets word wrapped or reformatted.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 05 2009)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing two book chapters (text files)

2009-02-04 Thread Chris Rebert
On Wed, Feb 4, 2009 at 5:20 PM, Nick Matzke mat...@berkeley.edu wrote:
 Hi all,

 So I have an interesting challenge.  I want to compare two book chapters,
 which I have in plain text format, and find out (a) percentage similarity
 and (b) what has changed.

 Some features make this problem different than what seems to be the standard
 text-matching problem solvable with e.g. difflib.  Here is what I mean:

 * there is no guarantee that single lines from each file will be directly
 comparable -- e.g., if a few words are inserted into a sentence, then a
 chunk of the sentence will be moved to the next line, then a chunk of that
 line moved to the next, etc.

 * Also, there are cases where paragraphs have been moved around, sections
 re-ordered, etc.  So it can't just be a linear match.

 I imagine this kind of thing can't be all that hard in the grand scheme of
 things, but I couldn't find an easily applicable solution readily available.
  I have advanced beginner python skills but am not quite where I could do
 this kind of thing from scratch without some guidance about the likely
 functions, libraries etc. to use.

 PS: I am going to have to do this for multiple book chapters so various
 software packages, e.g. for windows, are not really usable.

Though not written in Python, wdiff
(http://www.gnu.org/software/wdiff/wdiff.html) might be a good
starting point.

Cheers,
Chris

-- 
Follow the path of the Iguana...
http://rebertia.com
--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing two book chapters (text files)

2009-02-04 Thread Nick Matzke



Chris Rebert wrote:

On Wed, Feb 4, 2009 at 5:20 PM, Nick Matzke mat...@berkeley.edu wrote:

Hi all,

So I have an interesting challenge.  I want to compare two book chapters,
which I have in plain text format, and find out (a) percentage similarity
and (b) what has changed.

Some features make this problem different than what seems to be the standard
text-matching problem solvable with e.g. difflib.  Here is what I mean:

* there is no guarantee that single lines from each file will be directly
comparable -- e.g., if a few words are inserted into a sentence, then a
chunk of the sentence will be moved to the next line, then a chunk of that
line moved to the next, etc.

* Also, there are cases where paragraphs have been moved around, sections
re-ordered, etc.  So it can't just be a linear match.

I imagine this kind of thing can't be all that hard in the grand scheme of
things, but I couldn't find an easily applicable solution readily available.
 I have advanced beginner python skills but am not quite where I could do
this kind of thing from scratch without some guidance about the likely
functions, libraries etc. to use.

PS: I am going to have to do this for multiple book chapters so various
software packages, e.g. for windows, are not really usable.


Though not written in Python, wdiff
(http://www.gnu.org/software/wdiff/wdiff.html) might be a good
starting point.



Wow -- this is actually amazingly effective.  And fast!   Simple to run 
from python  then use python to parse the output.


Thanks!
Nick




Cheers,
Chris



--

Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together.


Isaac Asimov (1989). The Relativity of Wrong. The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list