Re: Bayesian Analysis for v3

Josip Almasi Wed, 24 Oct 2012 07:43:29 -0700

Hi,

David Legg wrote:

Hi all,


It's been a long time since I frequented this list!

After many years of faithful service I'm upgrading my server and thought I'd 
check to see what's happening with James.  I'm pleased to see v3 is beginning 
to emerge and I'll be happy to take it for a spin.


Same here. Though I think I'll wait till it works with java 7. (workaround 
didn't work for me)

I see nothing much has changed with the Bayesian analysis mailet. It has 
performed very well for me and I'd definitely recommend it to people. However, 
I've just taken a look at the code for the first time and I think I'd like to 
have a go at improving it,
especially as IMap is now a possibility.

I have a couple of ideas I'd like to try and I thought I'd air them here in 
case anyone has a brighter idea or some advice; thanks.

As it stands, the current Bayesian filter has a relatively simplistic 
tokenizer.  It literally seems to break the email into tokens with little 
regard to whether that bit of text is a mime boundary, base64, image, document 
or header etc.  My spam and ham
database is filled with millions of random looking chunks of text mainly from 
base64 encoded images!  So my first plan is to make the tokenizer more 
intelligent.  It should carefully extract far more meta-data from the email.


I might help you with that.
Wrote some mail parsing code, parses plain text and html, ignores other MIME 
types. For others, I guess only headers should be taken into account.
Malformed MIMEs are real issue there. So I used heuristics to avoid them - 
number of tokens and size of tokens.
Also, better ignore numbers, or use them as delimiters.
Of course, all message parts need to be processed. That's not cheap, and should 
be limited, by max allowed time and/or number of tokens.

I'm not the first to think of this of course.  Paul Graham originally wrote 'A 
Plan for Spam' [1] back in 2002 and then updated it with 'Better Bayesian 
Filtering' [2] in 2003.  This spawned several projects and products.  The more 
feature complete version
is SpamProbe [3] by Brian Burton but a Java version exists with a project 
called jASEN [4]. This latter project has been quiet for a few years and was 
forked into a proprietary product as well.

I'm quite interested in the fact that James 3 supports IMap.  I think this may 
make it easier and more efficient for user's to maintain their own spam folder. 
 Currently user's have to send any spam (or ham) they receive to an address 
such as [email protected]
(or [email protected]) and if they forget to send it as an attachment they risk 
poisoning the spam corpus.  Think how much easier it would be to simply move an 
email from one of your email folders to a special 'spam' folder.  Also think 
how much easier it
would be to browse the spam folder looking for mis-classified emails and drag 
them back to the correct folder. Currently, I delete emails classified as spam 
and if someone wants it back I have to go rooting about in MySQL's binary logs!


Right!

I worry how big the spam folder may get if I'm not deleting spam messages.  I 
may have to automatically expire spam messages that get to a certain age.  Or 
it may be that a small amount of fastfailing reduces the spam intake to 
manageable amounts.


Well, I'm not deleting any spam:) You never know when you may need some;)
Right now I have 143286 unread in my junk folder, total is 250k+, all correctly 
marked as 100% spam, 850MB.

Regards...


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Bayesian Analysis for v3

Reply via email to