Re: FW: Bayes Tokens

Chris Thielen 7 Feb 2004 01:39:22 -0000

On Fri, 2004-02-06 at 16:25, SpamAssassin List wrote:
> I'm sure this is nothing new to you SpamAssassin experts, but it seemed like
> a profound realization when it him me:
> 
> In addition to the actual words and phrases in a message and its headers,
> *any* aspect or property of the message -- whether intrinsic or derived --
> can be considered a Bayes token.  POPFile takes this into account to a


Quite right...

> limited extent with their "pseudowords"
> (http://popfile.sourceforge.net/cgi-bin/wiki.pl?Glossary/PseudoWord).  But a
> rule-based system can be enfolded into a Bayes solution, because the result
> of evaluating any rule can be considered a token.  Imagine if SpamAssassin
> had an optional mode where, instead of using manual scoring, all the active
> rules of its impressively comprehensive ruleset were tokenized?  I think
> this would result in most impressive accuracy.
> 
There was actually some discussion about a similar topic on the sa-dev
list about a week or two back.  Their discussion was specifically
related to *generating* the (static) scores using naive bayes.  IIRC,
the general consensus was that the existing GA or the new perceptron
(and orders of magnitude faster) can come up with better scores. 
Something about including contextual information in the computation
instead of simply "good" or "bad" (eg: rule1 alone is neutral between
ham/spam, but rule1 and rule2 together are spammy.)

I fully expect somebody to reply telling me I'm smoking crack.  Here,
have a grain of salt.

> I'm imagining examples such as testing for blacklist membership, which some
> people (mostly elsewhere, apparently) think is a bad idea.  I could let
> Bayes decide just exactly how relevant it is to *my* corpus.  How about the
> "Message is x% to y% HTML" rules, or "text to image ratio" rules.  This
> would develop a very precise profile of what I consider to be ham and spam.
> 
This is definitely an interesting idea... letting bayes tweak scores...

> I've emailed Paul Graham and Gary Robinson for their opinions, and both
> agree that this is a good idea.  Paul pointed out that he mentions the

Even a more interesting idea.

> possibility in the appendix to A Plan for Spam, and Gary mentioned a
> commercial product (PureMessage) that apparently does some of this.  It sure
> would be nice to see it in SpamAssassin someday.


-- 
Chris Thielen

Easily generate SpamAssassin rules to catch obfuscated spam phrases:
http://www.sandgnat.com/cmos/

Re: FW: Bayes Tokens

Reply via email to