Re: further spam detection algorithms

Justin Mason 25 Jul 2004 06:59:59 -0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Lucas Albers writes:
> This webpage:
> http://lynx.auton.cs.cmu.edu/~agoode/spam/spam
> 
> Mentions two other spamassassin algorithms for spam detecting in addition
> to the current ones.

Well, we already do the second, and I think it's in 2.6x too (HTML_90_100
et al).

We took a look at the first one a while back, but it was very slow.
I wonder if these guys have any more info on their success rate
with it?

- --j.

> Ideas on whether they are worthwhile?
> "
> With Professor Atkeson's spam problem in mind, we devised the following
> tests:
> The compression/entropy test: For each new message, we append it to all
> other known spam messages (one at a time) and then compress the result
> using zlib. We then compute ratio of the compressed document size with the
> uncompressed document size to get a measure of the similarity between the
> new message and the known spam message. The result of this test is then
> the minimum ratio of compressed to uncompressed document size. Using the
> Perl Compress::Zlib module, we can run the compression algorithm without
> ever writing either the appended spam and message document or its
> compressed result. As you can imagine, since this algorithm loops over all
> known spam, it can take quite a while to compute; our tests show it takes
> about twice as long to run this test as the rest of the tests in
> SpamAssassin combined: about 2 seconds per message.
> The HTML Test: Most people do not send messages in HTML and there are many
> good reasons for this -- not the least of which is that it is really
> annoying for those of use that prefer to use Mutt or Pine (for those of
> you laughing, when was the last time YOU got infected by an email virus?)
> As such SpamAssassin already has some tests to determine how much of the
> text is in HTML versus how much is in plain text. Our test takes this one
> step further and counts the number of characters that comprise the HTML
> tags as opposed to the number of characters that actually comprise the
> message text. We note that this is a reasonable test for many reasons.
> Firstly, most people that write messages don't spend tons of time
> decorating them to make them visually appealing -- really who cares if the
> email that says that A/C will be off on Saturday looks pretty? Secondly,
> in an effort to avoid Bayesian word count filters, spammers will split
> easily detectable words using fake tags so that the source is hard to
> recognize by a spam filter, but the text rendered by an HTML enabled mail
> client appears as one would expect. For example, consider the string
> "V<aa>i<aa>a<aa>g<aa>r<aa>&#097;". A Bayesian filter would have a tough
> time figuring out what this word was, while Mozilla renders this easily as
> "Viagra". We could either write filters that specifically check for this
> kind of trickery for each common spam word (which has been done for
> SpamAssassin) or, as we have done, we could just compare the number of
> HTML tag characters with text characters. If the number of tag characters
> far exceeds the number of text characters, we know that something funny is
> going on.
> "
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBA1ouQTcbUG5Y7woRAiTUAJkBS9Ih5VD+19D2eKqHhdDXgPxl9ACfZ1Ri
P/ae04PBK4YEa3GBJ3Ppzak=
=gPXY
-----END PGP SIGNATURE-----

Re: further spam detection algorithms

Reply via email to