Seth> Another possible meta-token that might help detect word salad Seth> (probably what Skip had in mind):
Seth> percentage of unique word tokens that are not significant I see a chicken-and-egg situation developing when we try to compute these sort of numbers. Start with an empty database. Train on a ham message. No words are significant at that point, so having no significant word tokens is a hammy clue. Train on a spam. By definition all words in the database at this point are significant, so only words not yet seen will be deemed not significant. Lather, rinse, repeat. Maybe after you're done training on all available messages you can toss all these percentage tokens and make a second pass over your messages computing only those tokens. Are there better ways to compute tokens such as this which depend on the contribution of other messages in the database? Skip _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev