Interesting problem.

if you are looking for typos, here are my thoughts.

What are the probable errors? Seems to me you have:
1. Typos in individual words
2. Extra spaces in individual words (so that you end up with two words instead of one)
3. Punctuation differences
4. Perhaps words such as; "the", "and", "an" missing from titles.

I think I would generate a letter count for each quotation.

For your example:
"God bless America"    George W Bush    Houston, March 18 2005
"Godi bless America"    George W Bush    Huston, March 18 2005

The quotation letter counts are
2 1 1 0 2 0 1 0 1 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 for "God bless America" (2 a's, 1 b, 1 c ...)
and
2 1 1 0 2 0 1 0 2 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 for "Godi bless America"

Then if you sort by these number sets and compare to see how similar each count is, you;ll get potential matches that you should eyeball. For example, These two strings have all but one count exactly the same. I'd go through this process multiple times by rotating the first count to the rear and re-sorting.

1 1 0 2 0 1 0 1 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 2
1 1 0 2 0 1 0 2 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 2

and just keep doing that until every letter has had a chance to be the first in the sort.

Basically The thing I'd have it do is find pairs of quotes that appear to be very similar. Then once you have a huge list of potential pairs, have something that displays them to you in pairs so that you can quickly tell the interface to delete one or to skip it.

I really do think you are going to want to make no changes to the data unless you look at the matches with your eyeballs. You could very easily end up with two completely different quotes that are extremely similar, just because person B was playing with a famous quote from person A.

So long story short, slice and dice the quotes to collect a set of pairs that appear to be similar. Then build a flashcard kind of interface in RunRev that allows you the human to read the two similar quotes and decide whether to delete one or not.

I'd combine brute force with human visuals. 40000 lines seems like a small data set for brute force.

Kee Nethery
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to