This webpage: http://lynx.auton.cs.cmu.edu/~agoode/spam/spam
Mentions two other spamassassin algorithms for spam detecting in addition to the current ones. Ideas on whether they are worthwhile? " With Professor Atkeson's spam problem in mind, we devised the following tests: The compression/entropy test: For each new message, we append it to all other known spam messages (one at a time) and then compress the result using zlib. We then compute ratio of the compressed document size with the uncompressed document size to get a measure of the similarity between the new message and the known spam message. The result of this test is then the minimum ratio of compressed to uncompressed document size. Using the Perl Compress::Zlib module, we can run the compression algorithm without ever writing either the appended spam and message document or its compressed result. As you can imagine, since this algorithm loops over all known spam, it can take quite a while to compute; our tests show it takes about twice as long to run this test as the rest of the tests in SpamAssassin combined: about 2 seconds per message. The HTML Test: Most people do not send messages in HTML and there are many good reasons for this -- not the least of which is that it is really annoying for those of use that prefer to use Mutt or Pine (for those of you laughing, when was the last time YOU got infected by an email virus?) As such SpamAssassin already has some tests to determine how much of the text is in HTML versus how much is in plain text. Our test takes this one step further and counts the number of characters that comprise the HTML tags as opposed to the number of characters that actually comprise the message text. We note that this is a reasonable test for many reasons. Firstly, most people that write messages don't spend tons of time decorating them to make them visually appealing -- really who cares if the email that says that A/C will be off on Saturday looks pretty? Secondly, in an effort to avoid Bayesian word count filters, spammers will split easily detectable words using fake tags so that the source is hard to recognize by a spam filter, but the text rendered by an HTML enabled mail client appears as one would expect. For example, consider the string "V<aa>i<aa>a<aa>g<aa>r<aa>a". A Bayesian filter would have a tough time figuring out what this word was, while Mozilla renders this easily as "Viagra". We could either write filters that specifically check for this kind of trickery for each common spam word (which has been done for SpamAssassin) or, as we have done, we could just compare the number of HTML tag characters with text characters. If the number of tag characters far exceeds the number of text characters, we know that something funny is going on. " -- Luke Computer Science System Administrator Security Administrator,College of Engineering Montana State University-Bozeman,Montana