Hi all, It's been a long time since I frequented this list!
After many years of faithful service I'm upgrading my server and thought I'd check to see what's happening with James. I'm pleased to see v3 is beginning to emerge and I'll be happy to take it for a spin.
I see nothing much has changed with the Bayesian analysis mailet. It has performed very well for me and I'd definitely recommend it to people. However, I've just taken a look at the code for the first time and I think I'd like to have a go at improving it, especially as IMap is now a possibility.
I have a couple of ideas I'd like to try and I thought I'd air them here in case anyone has a brighter idea or some advice; thanks.
As it stands, the current Bayesian filter has a relatively simplistic tokenizer. It literally seems to break the email into tokens with little regard to whether that bit of text is a mime boundary, base64, image, document or header etc. My spam and ham database is filled with millions of random looking chunks of text mainly from base64 encoded images! So my first plan is to make the tokenizer more intelligent. It should carefully extract far more meta-data from the email.
I'm not the first to think of this of course. Paul Graham originally wrote 'A Plan for Spam' [1] back in 2002 and then updated it with 'Better Bayesian Filtering' [2] in 2003. This spawned several projects and products. The more feature complete version is SpamProbe [3] by Brian Burton but a Java version exists with a project called jASEN [4]. This latter project has been quiet for a few years and was forked into a proprietary product as well.
I'm quite interested in the fact that James 3 supports IMap. I think this may make it easier and more efficient for user's to maintain their own spam folder. Currently user's have to send any spam (or ham) they receive to an address such as [email protected] (or [email protected]) and if they forget to send it as an attachment they risk poisoning the spam corpus. Think how much easier it would be to simply move an email from one of your email folders to a special 'spam' folder. Also think how much easier it would be to browse the spam folder looking for mis-classified emails and drag them back to the correct folder. Currently, I delete emails classified as spam and if someone wants it back I have to go rooting about in MySQL's binary logs!
I worry how big the spam folder may get if I'm not deleting spam messages. I may have to automatically expire spam messages that get to a certain age. Or it may be that a small amount of fastfailing reduces the spam intake to manageable amounts.
I'm not sure how IMap and POP3 play together yet. I guess a user should only manage their email via IMap OR POP3 but not both. Is that right? However, improving the Bayesian tokenizer should improve spam filtering for both access methods.
Best Regards, David Legg [1] http://www.paulgraham.com/spam.html [2] http://www.paulgraham.com/better.html [3] http://spamprobe.sourceforge.net/ [4] http://jasen.sourceforge.net/ --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
