Re: Words Indexing strategies

Richard Gaskin Wed, 10 Feb 2010 08:26:34 -0800

The ambitious Alejandro Tejada wrote:

It's a global search, within a general index created
using all words from all articles of Wikipedia.
(I do not believe that it's necessary to load this full
index in memory, instead just open specific parts
of this index when users start searching)


For this reason, i am looking for advice to create an
index structure that allows to implement a fast search
algorithm, using multiple words (and boolean logic, if
possible), similar to Wikipedia's own search engine or
(better yet) just like google. :-)

A good place to start on that is the seminal paper describing theinitial Google implementation, written by the founders:


The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page
<http://infolab.stanford.edu/~backrub/google.html>

But be warned: indexing is a deep topic, and may become a consumingpassion. Roulette is rumored to have been invented by a monk who cameto believe he could find a way to predict its outcomes, and eventuallywent mad trying. Indexing is a bit like that. :)

A couple of the longer-term projects I work on need to incorporate goodindexing of large corpuses, and my own code to that end has advancedonly in small baby steps as I learn more about it.

So while I have little in the way of applied code to share at themoment, I can offer a few theoretical pointers:

From what I'm reading, my first advice would be to not bother unlessyou absolutely need to. If there's any way you can use an existingindex you'll be a happier man to do so.

But if you have to make your own index, you may find it helpful (ormaddening) to consider the challenges involved with the various tensesand inflections of words, and how to determine the root word in itsnative form (the "lemma") for your lookups.

Currently I'm looking into the various lemmatization schemes available,since it can help tremendously to both keep the index small enough to bepractical while returning more relevant search results.

Lemmatization attempts this through linguistic rules; stemming could besaid to be a form of "cheating" by using simpler algos less dependent onthe nuances of a given language to attempt the same result. Thedifferences are explained well here:


Stemming and lemmatization
<http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html>

Links to specific stemming algos are at the bottom of this article:
<http://www.comp.lancs.ac.uk/computing/research/stemming/general/>

Without lemmatization or stemming, searches for "children" will fail tofind "child", which could well be relevant to the searcher. And somestemming methods won't be able to transform "children" to "child" sinceit's an uncommon transformation linguistically, much less so than morecommon plural forms like just adding "s" or "es" to the end.

There's a wealth of info available searching the web for "indexingalgorithms", "stemming algorithms", etc. Doing it well may take alifetime; fortunately it seems there are some "cheating" methods whichwill do most of the job well enough in less time.

All that said, I have to wonder: if Wikipedia's content is available,isn't their index also available?

Porting it from MySQL to SQLite seems a far less daunting task thanwriting an index from scratch.


--
 Richard Gaskin
 Fourth World
 Rev training and consulting: http://www.fourthworld.com
 Webzine for Rev developers: http://www.revjournal.com
 revJournal blog: http://revjournal.com/blog.irv
_______________________________________________
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Words Indexing strategies

Reply via email to