if anyone is interested, while trying to find the fastest way to compare each line of a list with every other line, I found the following technique quite fast :
-- myData contains the 40000 lines to chack -- myData1 is a duplicate of myData put myData into myData1 repeat for each line j in myData delete line 1 of myData1 repeat for each line i in myData1 end repeat end repeat > Hi list, > > I've been asked to do some "cleaning" in a client's data, and am trying > to figure out some simple and fast algorithm to do the job in Rev, but > haven't got much success so far... > > Here's the problem : the data consists in a collection of quotations by > various writers, politicians, etc. The data is organized in lines of 3 > items : > the quote, the author, the place & date > The cleaning job consists in finding duplicates caused by typos. > > Here's an (imaginary) example : > "God bless America" George W Bush Houston, March 18 2005 > "Godi bless America" George W Bush Huston, March 18 2005 > > Typos can occur in any of the 3 items, and sometimes even in 2 or 3 > items of the same line... > Last but not least, the data consists in about 40000 lines... > > The first idea that comes to mind is a kind of brute force approach : > to compare each line, item by item, with each of the other lines, > compute > a ratio of identical words, and keep only lines where the ratio found > for > each item is above a certain threshold (say 80%)... The problem with > such > huge set of data, is that it might take forever... > > I've also tried to sort lines and compare each line with the previous > one only, > but if the typo occurs in the first char of any item of a line, > duplicates might be > far away from each other after the sort... so it won't work... > > Any idea ? > > thanks in advance, > JB > > _______________________________________________ > use-revolution mailing list > [email protected] > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: _______________________________________________ use-revolution mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
