Bayesian Analysis for v3

David Legg Mon, 22 Oct 2012 17:12:44 -0700

Hi all,

It's been a long time since I frequented this list!

After many years of faithful service I'm upgrading my server and thoughtI'd check to see what's happening with James. I'm pleased to see v3 isbeginning to emerge and I'll be happy to take it for a spin.

I see nothing much has changed with the Bayesian analysis mailet. It hasperformed very well for me and I'd definitely recommend it to people.However, I've just taken a look at the code for the first time and Ithink I'd like to have a go at improving it, especially as IMap is now apossibility.

I have a couple of ideas I'd like to try and I thought I'd air them herein case anyone has a brighter idea or some advice; thanks.

As it stands, the current Bayesian filter has a relatively simplistictokenizer. It literally seems to break the email into tokens withlittle regard to whether that bit of text is a mime boundary, base64,image, document or header etc. My spam and ham database is filled withmillions of random looking chunks of text mainly from base64 encodedimages! So my first plan is to make the tokenizer more intelligent. Itshould carefully extract far more meta-data from the email.

I'm not the first to think of this of course. Paul Graham originallywrote 'A Plan for Spam' [1] back in 2002 and then updated it with'Better Bayesian Filtering' [2] in 2003. This spawned several projectsand products. The more feature complete version is SpamProbe [3] byBrian Burton but a Java version exists with a project called jASEN [4].This latter project has been quiet for a few years and was forked into aproprietary product as well.

I'm quite interested in the fact that James 3 supports IMap. I thinkthis may make it easier and more efficient for user's to maintain theirown spam folder. Currently user's have to send any spam (or ham) theyreceive to an address such as [email protected] (or [email protected]) and ifthey forget to send it as an attachment they risk poisoning the spamcorpus. Think how much easier it would be to simply move an email fromone of your email folders to a special 'spam' folder. Also think howmuch easier it would be to browse the spam folder looking formis-classified emails and drag them back to the correct folder.Currently, I delete emails classified as spam and if someone wants itback I have to go rooting about in MySQL's binary logs!

I worry how big the spam folder may get if I'm not deleting spammessages. I may have to automatically expire spam messages that get toa certain age. Or it may be that a small amount of fastfailing reducesthe spam intake to manageable amounts.

I'm not sure how IMap and POP3 play together yet. I guess a user shouldonly manage their email via IMap OR POP3 but not both. Is that right?However, improving the Bayesian tokenizer should improve spam filteringfor both access methods.


Best Regards,
David Legg


[1] http://www.paulgraham.com/spam.html
[2] http://www.paulgraham.com/better.html
[3] http://spamprobe.sourceforge.net/
[4] http://jasen.sourceforge.net/


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Bayesian Analysis for v3

Reply via email to