jbv wrote:

Hi list,

I'm trying to build the fastest possible algorithm for the
following task :
I have a variable containing reference words; each "word"
in this variable is considered as an item.
I also have a list of "sentences" (each sentence being a
list of "words" separated by spaces), 1 per line. Each
sentence can contain 1 or more words from the reference,
as well as other words.

I need to determine for each sentence, which words from
the reference are included in it, in which order, and output
a list of offsets.
For example :
   reference variable :  W1,W2,W3,W4
   sentence 1 : Wx W2 Wy Wz W3
   output : 2,3

And last but not least, I need to keep only sentences that
contain more than 1 word from the reference.

Here's the fastest script I managed to code :

put "" into newList
repeat for each line j in mySentences
   put j into a
   replace " " with itemdelimiter in a
   put "" into b

   repeat for each item i in it
       get itemoffset(i,myReference)
       if it>0 then
           put it & itemdelimiter after b
       end if
   end repeat

Shouldn't that be "repeat for each item i in a" rather than "... in it", since "it" gets over-written within the loop ?

   if b contains itemdelimiter then
       delete last char of b
       put b & cr after newList
   end if
end repeat

But I was wondering if there was any faster approach...
I don't see anything here which "keeps only those sentences containing more than 1 word" ? newList just contains the list of offsets - with no indication which sentence they occurred in ???

Do you have a "typical" dataset in mind ? or know any characteristics of it ?
how many words in the reference list ?
how many sentences, and how long is a typical one ?
how many "hits" vs "misses" ?

Since you're asking for "fastest" algorithm, I assume at least one of these numbers is large - and it's unlikely the same algorithm would excel at both ends of the spectrum .....

I have 2 or 3 approaches in mind ... any clues you have on the characteristics of the data would help decide which ones to pursue ..... e.g. build an assoc array of the words which occur in myReferences, where the content of the array element is the word number within the references, and lookup each word in the sentence with that ... should be faster than "itemOffset" but probably not enough to justify it if the number of words in myReferences is very small.

--
Alex Tweedly       http://www.tweedly.net



--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.323 / Virus Database: 267.6.2 - Release Date: 04/06/2005

_______________________________________________
use-revolution mailing list
[email protected]
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to