When manually applying the filters "Mark as SPAM" or "Mark as HAM", which pipe the message to the command sa-learn --spam or sa-learn --ham respectively, it
takes up to a minute to process on a PIV 4.3Ghz HT with 1Gb of RAM, which
seems like ages.

I've noticed that the SQL backends to Bayes and AWL are quite a bit faster. If you're learning on a single-message basis, you might want to add --no-sync to your sa-learn invocation so that it doesn't sync the journal and the database with every single message. Do that as a cron job on an appropriate schedule, like once a day.

Second, it seems that spamassassin vs spam is nothing less then an arms-race,
with spamassassin perpetually running behind.

Well, of course. Any rules are going to be reactive to what they've seen, not proactive. The Bayesian filter gets much closer to being an "on the fly" reaction to the mail you see, but it still needs historial record to go on, not intuition. Anything else would end up resembling a Douglas Adams novel :)

As more and more rules are added, doesn't it come to a point where deciding if
a message is spam or ham takes longer and longer or up to a point where
spamassassin allone can't handle it anymore?

I'm not geeky enough to formulate this in fancier words, but it seems like there's an upper threshold to how complicated you can make a mail message, therefore there should be an upper limit to the rules to identify a message automatically based on certain characteristics. But, there may come a time when the "arms race" goes thermonuclear and the only way to deal with spam is to nuke SMTP as we know it and formulate a new system that better deals with the loopholes spammers exploit to send their ads.

Lastly, I am running spamassassin 3.1 out of the box, that is installed the
rpm and that's it.

What can I do to increase effectiveness of spamassassin in diffrentiating spam from ham? Right now, there's about 10% of all messages that come in on a day (4.500) that are injustly marked as ham or spam (10% is not a lot, but still
45 messages each day!)

Uh, wouldn't 10% be 450 messages?  ;)

This is my prejudice showing, but personally I would compile SA from scratch rather than relying on an RPM. I rarely trust that precompiled packages are going to contain the options I want, or exclude the options I'll never use (but then, I'm also a FreeBSD user, and even when you install something from ports it's compiled from scratch and can be fine-tuned). Make sure you have all the SQL tools you need and use your favorite database backend for Bayes and AWL. This is purely anecdotal, but it seems much faster on the several servers where I've implemented it than the older database methods. I'd also look for other bottlenecks, because with a 4GHz processor and 1GB RAM, SA should kick booty. Either something else is consuming your resources, or something's rotten in Denmark.

Reply via email to