further spam detection algorithms

Lucas Albers 25 Jul 2004 06:45:56 -0000

This webpage:
http://lynx.auton.cs.cmu.edu/~agoode/spam/spam


Mentions two other spamassassin algorithms for spam detecting in addition
to the current ones.

Ideas on whether they are worthwhile?
"
With Professor Atkeson's spam problem in mind, we devised the following
tests:
The compression/entropy test: For each new message, we append it to all
other known spam messages (one at a time) and then compress the result
using zlib. We then compute ratio of the compressed document size with the
uncompressed document size to get a measure of the similarity between the
new message and the known spam message. The result of this test is then
the minimum ratio of compressed to uncompressed document size. Using the
Perl Compress::Zlib module, we can run the compression algorithm without
ever writing either the appended spam and message document or its
compressed result. As you can imagine, since this algorithm loops over all
known spam, it can take quite a while to compute; our tests show it takes
about twice as long to run this test as the rest of the tests in
SpamAssassin combined: about 2 seconds per message.
The HTML Test: Most people do not send messages in HTML and there are many
good reasons for this -- not the least of which is that it is really
annoying for those of use that prefer to use Mutt or Pine (for those of
you laughing, when was the last time YOU got infected by an email virus?)
As such SpamAssassin already has some tests to determine how much of the
text is in HTML versus how much is in plain text. Our test takes this one
step further and counts the number of characters that comprise the HTML
tags as opposed to the number of characters that actually comprise the
message text. We note that this is a reasonable test for many reasons.
Firstly, most people that write messages don't spend tons of time
decorating them to make them visually appealing -- really who cares if the
email that says that A/C will be off on Saturday looks pretty? Secondly,
in an effort to avoid Bayesian word count filters, spammers will split
easily detectable words using fake tags so that the source is hard to
recognize by a spam filter, but the text rendered by an HTML enabled mail
client appears as one would expect. For example, consider the string
"V<aa>i<aa>a<aa>g<aa>r<aa>&#097;". A Bayesian filter would have a tough
time figuring out what this word was, while Mozilla renders this easily as
"Viagra". We could either write filters that specifically check for this
kind of trickery for each common spam word (which has been done for
SpamAssassin) or, as we have done, we could just compare the number of
HTML tag characters with text characters. If the number of tag characters
far exceeds the number of text characters, we know that something funny is
going on.
"

-- 
Luke Computer Science System Administrator
Security Administrator,College of Engineering
Montana State University-Bozeman,Montana

further spam detection algorithms

Reply via email to