On 18/05/08 at 23:03 -0700 Kee Nethery apparently wrote:
Interesting problem.

if you are looking for typos, here are my thoughts.

What are the probable errors? Seems to me you have:
1. Typos in individual words
2. Extra spaces in individual words (so that you end up with two words instead of one)
3. Punctuation differences
4. Perhaps words such as; "the", "and", "an" missing from titles.

...
So long story short, slice and dice the quotes to collect a set of pairs that appear to be similar. Then build a flashcard kind of interface in RunRev that allows you the human to read the two similar quotes and decide whether to delete one or not.

I'd combine brute force with human visuals. 40000 lines seems like a small data set for brute force.

Kee Nethery

Finding identical lines is fairly trivial. Using fuzzy search to find similar lines is definitely more complicated. However, there are well known algorithms for detecting spelling errors. One of the common and rather simple approaches is to compute so called Damerau-Levenshtein distance. This is quite fast in Rev.

http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

The approach I'd take

0. find and eliminate identical items
1. clean the word spacing
2. find and eliminate identical items
3. compare and clean punctuation. This may require partially human inspection but the program can report those cases.
4. again eliminate identical items
5. use a simplified approach, like what Kee suggests or computing word factor as you suggested, to identify line pairs suspected to differ by spelling and other minor alterations. 6. compute Damerau-Levenshtein distance for those and report cases for human inspection.
7. correct typos and standardize texts as needed.
8. find and eliminate identical items.

Robert
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to