Re: [somewhat OT] Text processing question (sort of)

Kee Nethery Sun, 18 May 2008 23:03:13 -0700

Interesting problem.

if you are looking for typos, here are my thoughts.


What are the probable errors? Seems to me you have:
1. Typos in individual words

2. Extra spaces in individual words (so that you end up with two wordsinstead of one)

3. Punctuation differences
4. Perhaps words such as; "the", "and", "an" missing from titles.

I think I would generate a letter count for each quotation.

For your example:
"God bless America"    George W Bush    Houston, March 18 2005
"Godi bless America"    George W Bush    Huston, March 18 2005

The quotation letter counts are

2 1 1 0 2 0 1 0 1 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 for "God blessAmerica" (2 a's, 1 b, 1 c ...)

and

2 1 1 0 2 0 1 0 2 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 for "Godi blessAmerica"

Then if you sort by these number sets and compare to see how similareach count is, you;ll get potential matches that you should eyeball.For example, These two strings have all but one count exactly thesame. I'd go through this process multiple times by rotating the firstcount to the rear and re-sorting.


1 1 0 2 0 1 0 1 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 2
1 1 0 2 0 1 0 2 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 2

and just keep doing that until every letter has had a chance to be thefirst in the sort.

Basically The thing I'd have it do is find pairs of quotes that appearto be very similar. Then once you have a huge list of potential pairs,have something that displays them to you in pairs so that you canquickly tell the interface to delete one or to skip it.

I really do think you are going to want to make no changes to the dataunless you look at the matches with your eyeballs. You could veryeasily end up with two completely different quotes that are extremelysimilar, just because person B was playing with a famous quote fromperson A.

So long story short, slice and dice the quotes to collect a set ofpairs that appear to be similar. Then build a flashcard kind ofinterface in RunRev that allows you the human to read the two similarquotes and decide whether to delete one or not.

I'd combine brute force with human visuals. 40000 lines seems like asmall data set for brute force.


Kee Nethery
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: [somewhat OT] Text processing question (sort of)

Reply via email to