Alejandro,
Some off-the-cusp thoughts:
* Add synonyms for common xTalk terms (cd => card, btn => button, etc)
and combine their indices
* Support some sort of stemming (or at least, combine words with their
plurals)
* Create a stop word threshold: any term which occurs in more than X%
of messages becomes a stop word and is discarded from the index.
* Index by message, not by line. You could always find the line in the
message on the fly.
* Don't index all message headers
* Don't index message footers and/or signatures
* Remove dups (i.e. if a word appears twice on a line or twice in a
message)
Hope these give you some ideas.
Of course I also have a high level question- what's wrong with just a
5MB index on a CD-ROM? If it is just for disk space, you could compress
the index and probably get a significant savings.
- Brian
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution