Hi Brian, :-) Brian Yennie wrote:
> Some off-the-cusp thoughts: > * Add synonyms for common xTalk terms (cd => card, > btn => button, etc) and combine their indices Interesting idea, i'll give more thought to this possibility. > * Support some sort of stemming (or at least, combine > words with their plurals) Yes, this is a must. > * Create a stop word threshold: any term which occurs > in more than X% of messages becomes a stop word and > is discarded from the index. This is a good recomendation. For example, the word "revolution" should be a stop word. :-) > * Index by message, not by line. You could always > find the line in the message on the fly. Yes, Alex Tweedley makes this recomendation too. > * Don't index all message headers > * Don't index message footers and/or signatures The headers contains some useful info... No? > * Remove dups (i.e. if a word appears twice on a line > or twice in a message) Yes, this is a must too. > Hope these give you some ideas. Sure they do! These are mind opening ideas. You could be sure that many other ideas, probably unrelated to this task will take life while working on this... :-) Today i have step on an interesting idea for a new educative game. Let's keep the hope to raise the resources to make this game a reality! > Of course I also have a high level question- what's > wrong with just a 5MB index on a CD-ROM? If it is > just for disk space, you could compress > the index and probably get a significant savings. Space is not the problem, fast searching in optimized indexes are the goal. ;-) Thanks again for your help, Brian! al Visit my site: http://www.geocities.com/capellan2000/ ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs _______________________________________________ use-revolution mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
