That works but if you pass the data to PERL it'll be handled 10,000 times
faster. Not that I can remember how I did it! :(
I have a great memory, it's just short.
Best,
Andy
----- Original Message -----
From: "jbv" <[EMAIL PROTECTED]>
To: "How to use Revolution" <[email protected]>
Sent: Sunday, May 18, 2008 2:27 PM
Subject: Re: [somewhat OT] Text processing question (sort of)
if anyone is interested, while trying to find the fastest way to compare
each line of a list with every other line, I found the following technique
quite fast :
-- myData contains the 40000 lines to chack
-- myData1 is a duplicate of myData
put myData into myData1
repeat for each line j in myData
delete line 1 of myData1
repeat for each line i in myData1
end repeat
end repeat
Hi list,
I've been asked to do some "cleaning" in a client's data, and am trying
to figure out some simple and fast algorithm to do the job in Rev, but
haven't got much success so far...
Here's the problem : the data consists in a collection of quotations by
various writers, politicians, etc. The data is organized in lines of 3
items :
the quote, the author, the place & date
The cleaning job consists in finding duplicates caused by typos.
Here's an (imaginary) example :
"God bless America" George W Bush Houston, March 18 2005
"Godi bless America" George W Bush Huston, March 18 2005
Typos can occur in any of the 3 items, and sometimes even in 2 or 3
items of the same line...
Last but not least, the data consists in about 40000 lines...
The first idea that comes to mind is a kind of brute force approach :
to compare each line, item by item, with each of the other lines,
compute
a ratio of identical words, and keep only lines where the ratio found
for
each item is above a certain threshold (say 80%)... The problem with
such
huge set of data, is that it might take forever...
I've also tried to sort lines and compare each line with the previous
one only,
but if the typo occurs in the first char of any item of a line,
duplicates might be
far away from each other after the sort... so it won't work...
Any idea ?
thanks in advance,
JB
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution