Re: [somewhat OT] Text processing question (sort of)

Terry Judd Sun, 18 May 2008 14:52:50 -0700

On 19/5/08 2:41 AM, "jbv" <[EMAIL PROTECTED]> wrote:

> Hi list,
> 
> I've been asked to do some "cleaning" in a client's data, and am trying
> to figure out some simple and fast algorithm to do the job in Rev, but
> haven't got much success so far...
> 
> Here's the problem : the data consists in a collection of quotations by
> various writers, politicians, etc. The data is organized in lines of 3
> items :
> the quote, the author, the place & date
> The cleaning job consists in finding duplicates caused by typos.
> 
> Here's an (imaginary) example :
> "God bless America"    George W Bush    Houston, March 18 2005
> "Godi bless America"    George W Bush    Huston, March 18 2005
> 
> Typos can occur in any of the 3 items, and sometimes even in 2 or 3
> items of the same line...
> Last but not least, the data consists in about 40000 lines...


How about using the compress function to compare 'pairs' of lines. If the
length of each compressed string is similar and it is more or less the same
as the length of the combined and compressed strings then you've almost
certainly got a 'match'. I haven't done this with thousands of records but I
have done it with hundreds and it's relatively quick.

Terry...

_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: [somewhat OT] Text processing question (sort of)

Reply via email to