Hi list, I've been asked to do some "cleaning" in a client's data, and am trying to figure out some simple and fast algorithm to do the job in Rev, but haven't got much success so far...
Here's the problem : the data consists in a collection of quotations by various writers, politicians, etc. The data is organized in lines of 3 items : the quote, the author, the place & date The cleaning job consists in finding duplicates caused by typos. Here's an (imaginary) example : "God bless America" George W Bush Houston, March 18 2005 "Godi bless America" George W Bush Huston, March 18 2005 Typos can occur in any of the 3 items, and sometimes even in 2 or 3 items of the same line... Last but not least, the data consists in about 40000 lines... The first idea that comes to mind is a kind of brute force approach : to compare each line, item by item, with each of the other lines, compute a ratio of identical words, and keep only lines where the ratio found for each item is above a certain threshold (say 80%)... The problem with such huge set of data, is that it might take forever... I've also tried to sort lines and compare each line with the previous one only, but if the typo occurs in the first char of any item of a line, duplicates might be far away from each other after the sort... so it won't work... Any idea ? thanks in advance, JB _______________________________________________ use-revolution mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
