-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
> In addition to the actual words and phrases in a message and its headers, > *any* aspect or property of the message -- whether intrinsic or derived -- > can be considered a Bayes token. POPFile takes this into account to a > limited extent with their "pseudowords" > (http://popfile.sourceforge.net/cgi-bin/wiki.pl?Glossary/PseudoWord). We do this extensively ;) see the source to Bayes.pm for details. > But a > rule-based system can be enfolded into a Bayes solution, because the result > of evaluating any rule can be considered a token. Imagine if SpamAssassin > had an optional mode where, instead of using manual scoring, all the active > rules of its impressively comprehensive ruleset were tokenized? I think > this would result in most impressive accuracy. This is under investigation. some previous attempts failed pretty badly, but we think there might be a way to do it correctly. > I'm imagining examples such as testing for blacklist membership, which some > people (mostly elsewhere, apparently) think is a bad idea. I could let > Bayes decide just exactly how relevant it is to *my* corpus. How about the > "Message is x% to y% HTML" rules, or "text to image ratio" rules. This > would develop a very precise profile of what I consider to be ham and spam. > > I've emailed Paul Graham and Gary Robinson for their opinions, and both > agree that this is a good idea. Paul pointed out that he mentions the > possibility in the appendix to A Plan for Spam, and Gary mentioned a > commercial product (PureMessage) that apparently does some of this. It sure > would be nice to see it in SpamAssassin someday. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) Comment: Exmh CVS iD8DBQFAJEfRQTcbUG5Y7woRApcIAJ0R45Y8fXC+p+XCCt3MmnLAFEWawACfVA9k EgVQDJPoVcZGbprzJsPOZ5A= =20Ub -----END PGP SIGNATURE-----
