The ambitious Alejandro Tejada wrote:

It's a global search, within a general index created
using all words from all articles of Wikipedia.
(I do not believe that it's necessary to load this full
index in memory, instead just open specific parts
of this index when users start searching)

For this reason, i am looking for advice to create an
index structure that allows to implement a fast search
algorithm, using multiple words (and boolean logic, if
possible), similar to Wikipedia's own search engine or
(better yet) just like google. :-)

A good place to start on that is the seminal paper describing the initial Google implementation, written by the founders:

The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page
<http://infolab.stanford.edu/~backrub/google.html>

But be warned: indexing is a deep topic, and may become a consuming passion. Roulette is rumored to have been invented by a monk who came to believe he could find a way to predict its outcomes, and eventually went mad trying. Indexing is a bit like that. :)

A couple of the longer-term projects I work on need to incorporate good indexing of large corpuses, and my own code to that end has advanced only in small baby steps as I learn more about it.

So while I have little in the way of applied code to share at the moment, I can offer a few theoretical pointers:

From what I'm reading, my first advice would be to not bother unless you absolutely need to. If there's any way you can use an existing index you'll be a happier man to do so.

But if you have to make your own index, you may find it helpful (or maddening) to consider the challenges involved with the various tenses and inflections of words, and how to determine the root word in its native form (the "lemma") for your lookups.

Currently I'm looking into the various lemmatization schemes available, since it can help tremendously to both keep the index small enough to be practical while returning more relevant search results.

Lemmatization attempts this through linguistic rules; stemming could be said to be a form of "cheating" by using simpler algos less dependent on the nuances of a given language to attempt the same result. The differences are explained well here:

Stemming and lemmatization
<http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html>

Links to specific stemming algos are at the bottom of this article:
<http://www.comp.lancs.ac.uk/computing/research/stemming/general/>

Without lemmatization or stemming, searches for "children" will fail to find "child", which could well be relevant to the searcher. And some stemming methods won't be able to transform "children" to "child" since it's an uncommon transformation linguistically, much less so than more common plural forms like just adding "s" or "es" to the end.

There's a wealth of info available searching the web for "indexing algorithms", "stemming algorithms", etc. Doing it well may take a lifetime; fortunately it seems there are some "cheating" methods which will do most of the job well enough in less time.


All that said, I have to wonder: if Wikipedia's content is available, isn't their index also available?

Porting it from MySQL to SQLite seems a far less daunting task than writing an index from scratch.

--
 Richard Gaskin
 Fourth World
 Rev training and consulting: http://www.fourthworld.com
 Webzine for Rev developers: http://www.revjournal.com
 revJournal blog: http://revjournal.com/blog.irv
_______________________________________________
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to