On Sun, 2004-02-22 at 15:39, Jeff Waugh wrote:

> So, who's been having fun with their anti-spam tools recently? I'm still
> using spamassassin and bogofilter [1] these days, but finding more and more
> crap in my real inbox, thanks to all this random-text crap. Gar.

I'm using only bogofilter. I stopped using spamassassin because it was
just too CPU intensive on my little laptop where I run {fetch,proc}mail.
I made some observations that I can't really back up but I thought I'd
share nonetheless.

* I was chatting to the guy behind death2spam a while ago and he was
saying that his bayesian filter managed to learn virus signatures with
enough training. I thought it was an interesting idea, and I remember
tools like TBAV on DOS that were able to detect virii it didn't have
signatures for, so I figured it was worth attempting to train bogofilter
on the MyDoom virus. This proved to be arguably my worst idea ever. Very
soon I was getting all sorts of random spam dumped in my inbox. Not
happy. As soon as I got clamav to obliterate all the virii from my
junkmail folder and rebuild the bogofilter database, accuracy went way
up again. In fact better than before because it was rid of virii I'd
previously received. So observation 1: spam and virii are separate
problems; treat them as such.

* I had been noticing that I was getting slowly falling accuracy of late
from bogofilter. Even with the virus purging. I gave this some thought
and realised that my junkmail folder was now much much larger than my
personal mail folder (I train bogofilter on mail addressed directly to
me and spam. I've been shuffling out all but the last year's worth of
personal mail to archive folders for a while now but hadn't been doing
the same thing with the junkmail folder). Anyway, I've gone back to
almost 100% accuracy by deleting all but the last year's worth of spam
from my junkmail folder and rebuilding bogofilter's database. So it
seems to me that because both my legitimate email-senders and junkmail
changes over time (and people's writing style I guess), the database
should reflect that and decay input based on age. So, the spam corpus
bogofilter is currently trained on will probably weigh randomly chosen
words "better" than before. The database is more inline with current
spamming practices.

I've noticed great value in keeping the spam that bogofilter classifies
(and I check). It makes re-training trivial, and it seems that
retraining does help you keep up with spam practices.

Sorry for the long email.

HTH,

James.


-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to