-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Lucas Albers writes: > This webpage: > http://lynx.auton.cs.cmu.edu/~agoode/spam/spam > > Mentions two other spamassassin algorithms for spam detecting in addition > to the current ones. Well, we already do the second, and I think it's in 2.6x too (HTML_90_100 et al). We took a look at the first one a while back, but it was very slow. I wonder if these guys have any more info on their success rate with it? - --j. > Ideas on whether they are worthwhile? > " > With Professor Atkeson's spam problem in mind, we devised the following > tests: > The compression/entropy test: For each new message, we append it to all > other known spam messages (one at a time) and then compress the result > using zlib. We then compute ratio of the compressed document size with the > uncompressed document size to get a measure of the similarity between the > new message and the known spam message. The result of this test is then > the minimum ratio of compressed to uncompressed document size. Using the > Perl Compress::Zlib module, we can run the compression algorithm without > ever writing either the appended spam and message document or its > compressed result. As you can imagine, since this algorithm loops over all > known spam, it can take quite a while to compute; our tests show it takes > about twice as long to run this test as the rest of the tests in > SpamAssassin combined: about 2 seconds per message. > The HTML Test: Most people do not send messages in HTML and there are many > good reasons for this -- not the least of which is that it is really > annoying for those of use that prefer to use Mutt or Pine (for those of > you laughing, when was the last time YOU got infected by an email virus?) > As such SpamAssassin already has some tests to determine how much of the > text is in HTML versus how much is in plain text. Our test takes this one > step further and counts the number of characters that comprise the HTML > tags as opposed to the number of characters that actually comprise the > message text. We note that this is a reasonable test for many reasons. > Firstly, most people that write messages don't spend tons of time > decorating them to make them visually appealing -- really who cares if the > email that says that A/C will be off on Saturday looks pretty? Secondly, > in an effort to avoid Bayesian word count filters, spammers will split > easily detectable words using fake tags so that the source is hard to > recognize by a spam filter, but the text rendered by an HTML enabled mail > client appears as one would expect. For example, consider the string > "V<aa>i<aa>a<aa>g<aa>r<aa>a". A Bayesian filter would have a tough > time figuring out what this word was, while Mozilla renders this easily as > "Viagra". We could either write filters that specifically check for this > kind of trickery for each common spam word (which has been done for > SpamAssassin) or, as we have done, we could just compare the number of > HTML tag characters with text characters. If the number of tag characters > far exceeds the number of text characters, we know that something funny is > going on. > " -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFBA1ouQTcbUG5Y7woRAiTUAJkBS9Ih5VD+19D2eKqHhdDXgPxl9ACfZ1Ri P/ae04PBK4YEa3GBJ3Ppzak= =gPXY -----END PGP SIGNATURE-----