Alejandro,

Some off-the-cusp thoughts:

* Add synonyms for common xTalk terms (cd => card, btn => button, etc) and combine their indices * Support some sort of stemming (or at least, combine words with their plurals) * Create a stop word threshold: any term which occurs in more than X% of messages becomes a stop word and is discarded from the index. * Index by message, not by line. You could always find the line in the message on the fly.
* Don't index all message headers
* Don't index message footers and/or signatures
* Remove dups (i.e. if a word appears twice on a line or twice in a message)

Hope these give you some ideas.

Of course I also have a high level question- what's wrong with just a 5MB index on a CD-ROM? If it is just for disk space, you could compress the index and probably get a significant savings.

- Brian

_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to