Re: Words Indexing strategies

Brian Yennie Wed, 10 Feb 2010 08:17:15 -0800

Alejandro,

The first step for this would likely include creating an "inverted" index. This 
means you store something like:

monkey:1,34,3827,21314

Where the word being indexed in "monkey" and the numbers that follow are 
article IDs. Using this information it is pretty trivial to implement AND / OR. 
Just merge the article IDs as needed. You can use a mix of this inverted index 
and your regular index in combination to do other types of queries. You can 
even store word frequency information to find similar articles.

With that said, I think you will have a very hard time crafting a competitive 
algorithm for this in Rev. A lot of the engines out there are very mature and 
there are many open source ones that you might consider bundling with your 
project instead of writing it yourself. Some things to watch out for:

1) File size. Indexing every word takes a lot of space, especially if you 
indices are in plain text and not a compact binary format
2) Common words. You might consider a "stop word" list, or a threshold. For 
example, if a word is in > 20% of all articles, don't index it. Or if it's in 
your "stop word" list.
3) Root words / plurals. Can you detect that "monkeys" is the plural of 
"monkey" (or more complex cases)?

An example of an open source engine that is very mature is Lucene, which can be 
run from the command-line in Java.

> Hi Bernard,
> 
> on Wed, 10 Feb 2010 07:20:36 +0000
> Bernard Devlin wrote:
>> Can I just clarify your problem?  You want to be able to search for
>> phrases (partial sentences, possibly with boolean logic) inside the
>> text stored in the xml nodes of the article, once the article is found
>> in the index?
> 
> No, it's not a search inside the displayed article.
> It's a global search, within a general index created
> using all words from all articles of Wikipedia.
> (I do not believe that it's necessary to load this full
> index in memory, instead just open specific parts
> of this index when users start searching)
> 
> For this reason, i am looking for advice to create an
> index structure that allows to implement a fast search
> algorithm, using multiple words (and boolean logic, if
> possible), similar to Wikipedia's own search engine or
> (better yet) just like google. :-)
> 
> Many thanks for your interest in this question!
> 
> Alejandro
> _______________________________________________
> use-revolution mailing list
> [email protected]
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolutio
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Words Indexing strategies

Reply via email to