> ----------
> From:         Jim Grusendorf
> Sent:         Friday, February 06, 2004 3:25:23 PM
> To:   '[EMAIL PROTECTED]'
> Subject:      Bayes Tokens
> Auto forwarded by a Rule
> 
> 
> 
I'm sure this is nothing new to you SpamAssassin experts, but it seemed like
a profound realization when it him me:

In addition to the actual words and phrases in a message and its headers,
*any* aspect or property of the message -- whether intrinsic or derived --
can be considered a Bayes token.  POPFile takes this into account to a
limited extent with their "pseudowords"
(http://popfile.sourceforge.net/cgi-bin/wiki.pl?Glossary/PseudoWord).  But a
rule-based system can be enfolded into a Bayes solution, because the result
of evaluating any rule can be considered a token.  Imagine if SpamAssassin
had an optional mode where, instead of using manual scoring, all the active
rules of its impressively comprehensive ruleset were tokenized?  I think
this would result in most impressive accuracy.

I'm imagining examples such as testing for blacklist membership, which some
people (mostly elsewhere, apparently) think is a bad idea.  I could let
Bayes decide just exactly how relevant it is to *my* corpus.  How about the
"Message is x% to y% HTML" rules, or "text to image ratio" rules.  This
would develop a very precise profile of what I consider to be ham and spam.

I've emailed Paul Graham and Gary Robinson for their opinions, and both
agree that this is a good idea.  Paul pointed out that he mentions the
possibility in the appendix to A Plan for Spam, and Gary mentioned a
commercial product (PureMessage) that apparently does some of this.  It sure
would be nice to see it in SpamAssassin someday.

Jim Grusendorf
Computer Systems Manager
HHS Management Limited Partnership
[EMAIL PROTECTED]
PGP Key ID 0x5534507C

Reply via email to