> ---------- > From: Jim Grusendorf > Sent: Friday, February 06, 2004 3:25:23 PM > To: '[EMAIL PROTECTED]' > Subject: Bayes Tokens > Auto forwarded by a Rule > > > I'm sure this is nothing new to you SpamAssassin experts, but it seemed like a profound realization when it him me:
In addition to the actual words and phrases in a message and its headers, *any* aspect or property of the message -- whether intrinsic or derived -- can be considered a Bayes token. POPFile takes this into account to a limited extent with their "pseudowords" (http://popfile.sourceforge.net/cgi-bin/wiki.pl?Glossary/PseudoWord). But a rule-based system can be enfolded into a Bayes solution, because the result of evaluating any rule can be considered a token. Imagine if SpamAssassin had an optional mode where, instead of using manual scoring, all the active rules of its impressively comprehensive ruleset were tokenized? I think this would result in most impressive accuracy. I'm imagining examples such as testing for blacklist membership, which some people (mostly elsewhere, apparently) think is a bad idea. I could let Bayes decide just exactly how relevant it is to *my* corpus. How about the "Message is x% to y% HTML" rules, or "text to image ratio" rules. This would develop a very precise profile of what I consider to be ham and spam. I've emailed Paul Graham and Gary Robinson for their opinions, and both agree that this is a good idea. Paul pointed out that he mentions the possibility in the appendix to A Plan for Spam, and Gary mentioned a commercial product (PureMessage) that apparently does some of this. It sure would be nice to see it in SpamAssassin someday. Jim Grusendorf Computer Systems Manager HHS Management Limited Partnership [EMAIL PROTECTED] PGP Key ID 0x5534507C
