Re: FW: Bayes Tokens

Justin Mason 7 Feb 2004 02:05:26 -0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> In addition to the actual words and phrases in a message and its headers,
> *any* aspect or property of the message -- whether intrinsic or derived --
> can be considered a Bayes token.  POPFile takes this into account to a
> limited extent with their "pseudowords"
> (http://popfile.sourceforge.net/cgi-bin/wiki.pl?Glossary/PseudoWord).

We do this extensively ;)  see the source to Bayes.pm for details.

> But a
> rule-based system can be enfolded into a Bayes solution, because the result
> of evaluating any rule can be considered a token.  Imagine if SpamAssassin
> had an optional mode where, instead of using manual scoring, all the active
> rules of its impressively comprehensive ruleset were tokenized?  I think
> this would result in most impressive accuracy.

This is under investigation.  some previous attempts failed pretty
badly, but we think there might be a way to do it correctly.

> I'm imagining examples such as testing for blacklist membership, which some
> people (mostly elsewhere, apparently) think is a bad idea.  I could let
> Bayes decide just exactly how relevant it is to *my* corpus.  How about the
> "Message is x% to y% HTML" rules, or "text to image ratio" rules.  This
> would develop a very precise profile of what I consider to be ham and spam.
> 
> I've emailed Paul Graham and Gary Robinson for their opinions, and both
> agree that this is a good idea.  Paul pointed out that he mentions the
> possibility in the appendix to A Plan for Spam, and Gary mentioned a
> commercial product (PureMessage) that apparently does some of this.  It sure
> would be nice to see it in SpamAssassin someday.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAJEfRQTcbUG5Y7woRApcIAJ0R45Y8fXC+p+XCCt3MmnLAFEWawACfVA9k
EgVQDJPoVcZGbprzJsPOZ5A=
=20Ub
-----END PGP SIGNATURE-----

Re: FW: Bayes Tokens

Reply via email to