Re: [somewhat OT] Text processing question (sort of)

Robert Brenstein Mon, 19 May 2008 14:38:39 -0700

On 18/05/08 at 23:03 -0700 Kee Nethery apparently wrote:

Interesting problem.
if you are looking for typos, here are my thoughts.

What are the probable errors? Seems to me you have:
1. Typos in individual words
2. Extra spaces in individual words (so that you end up with twowords instead of one)
3. Punctuation differences
4. Perhaps words such as; "the", "and", "an" missing from titles.

...
So long story short, slice and dice the quotes to collect a set ofpairs that appear to be similar. Then build a flashcard kind ofinterface in RunRev that allows you the human to read the twosimilar quotes and decide whether to delete one or not.
I'd combine brute force with human visuals. 40000 lines seems like asmall data set for brute force.
Kee Nethery

Finding identical lines is fairly trivial. Using fuzzy search to findsimilar lines is definitely more complicated. However, there are wellknown algorithms for detecting spelling errors. One of the common andrather simple approaches is to compute so called Damerau-Levenshteindistance. This is quite fast in Rev.


http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

The approach I'd take

0. find and eliminate identical items
1. clean the word spacing
2. find and eliminate identical items

3. compare and clean punctuation. This may require partially humaninspection but the program can report those cases.

4. again eliminate identical items

5. use a simplified approach, like what Kee suggests or computingword factor as you suggested, to identify line pairs suspected todiffer by spelling and other minor alterations.6. compute Damerau-Levenshtein distance for those and report casesfor human inspection.

7. correct typos and standardize texts as needed.
8. find and eliminate identical items.

Robert
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: [somewhat OT] Text processing question (sort of)

Reply via email to