On 18/05/08 at 23:03 -0700 Kee Nethery apparently wrote:
Interesting problem.
if you are looking for typos, here are my thoughts.
What are the probable errors? Seems to me you have:
1. Typos in individual words
2. Extra spaces in individual words (so that you end up with two
words instead of one)
3. Punctuation differences
4. Perhaps words such as; "the", "and", "an" missing from titles.
...
So long story short, slice and dice the quotes to collect a set of
pairs that appear to be similar. Then build a flashcard kind of
interface in RunRev that allows you the human to read the two
similar quotes and decide whether to delete one or not.
I'd combine brute force with human visuals. 40000 lines seems like a
small data set for brute force.
Kee Nethery
Finding identical lines is fairly trivial. Using fuzzy search to find
similar lines is definitely more complicated. However, there are well
known algorithms for detecting spelling errors. One of the common and
rather simple approaches is to compute so called Damerau-Levenshtein
distance. This is quite fast in Rev.
http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
The approach I'd take
0. find and eliminate identical items
1. clean the word spacing
2. find and eliminate identical items
3. compare and clean punctuation. This may require partially human
inspection but the program can report those cases.
4. again eliminate identical items
5. use a simplified approach, like what Kee suggests or computing
word factor as you suggested, to identify line pairs suspected to
differ by spelling and other minor alterations.
6. compute Damerau-Levenshtein distance for those and report cases
for human inspection.
7. correct typos and standardize texts as needed.
8. find and eliminate identical items.
Robert
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution