On 19/5/08 2:41 AM, "jbv" <[EMAIL PROTECTED]> wrote: > Hi list, > > I've been asked to do some "cleaning" in a client's data, and am trying > to figure out some simple and fast algorithm to do the job in Rev, but > haven't got much success so far... > > Here's the problem : the data consists in a collection of quotations by > various writers, politicians, etc. The data is organized in lines of 3 > items : > the quote, the author, the place & date > The cleaning job consists in finding duplicates caused by typos. > > Here's an (imaginary) example : > "God bless America" George W Bush Houston, March 18 2005 > "Godi bless America" George W Bush Huston, March 18 2005 > > Typos can occur in any of the 3 items, and sometimes even in 2 or 3 > items of the same line... > Last but not least, the data consists in about 40000 lines...
How about using the compress function to compare 'pairs' of lines. If the length of each compressed string is similar and it is more or less the same as the length of the combined and compressed strings then you've almost certainly got a 'match'. I haven't done this with thousands of records but I have done it with hundreds and it's relatively quick. Terry... _______________________________________________ use-revolution mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
