Hi,
David Legg wrote:
Hi all,
It's been a long time since I frequented this list!
After many years of faithful service I'm upgrading my server and thought I'd
check to see what's happening with James. I'm pleased to see v3 is beginning
to emerge and I'll be happy to take it for a spin.
Same here. Though I think I'll wait till it works with java 7. (workaround
didn't work for me)
I see nothing much has changed with the Bayesian analysis mailet. It has
performed very well for me and I'd definitely recommend it to people. However,
I've just taken a look at the code for the first time and I think I'd like to
have a go at improving it,
especially as IMap is now a possibility.
I have a couple of ideas I'd like to try and I thought I'd air them here in
case anyone has a brighter idea or some advice; thanks.
As it stands, the current Bayesian filter has a relatively simplistic
tokenizer. It literally seems to break the email into tokens with little
regard to whether that bit of text is a mime boundary, base64, image, document
or header etc. My spam and ham
database is filled with millions of random looking chunks of text mainly from
base64 encoded images! So my first plan is to make the tokenizer more
intelligent. It should carefully extract far more meta-data from the email.
I might help you with that.
Wrote some mail parsing code, parses plain text and html, ignores other MIME
types. For others, I guess only headers should be taken into account.
Malformed MIMEs are real issue there. So I used heuristics to avoid them -
number of tokens and size of tokens.
Also, better ignore numbers, or use them as delimiters.
Of course, all message parts need to be processed. That's not cheap, and should
be limited, by max allowed time and/or number of tokens.
I'm not the first to think of this of course. Paul Graham originally wrote 'A
Plan for Spam' [1] back in 2002 and then updated it with 'Better Bayesian
Filtering' [2] in 2003. This spawned several projects and products. The more
feature complete version
is SpamProbe [3] by Brian Burton but a Java version exists with a project
called jASEN [4]. This latter project has been quiet for a few years and was
forked into a proprietary product as well.
I'm quite interested in the fact that James 3 supports IMap. I think this may
make it easier and more efficient for user's to maintain their own spam folder.
Currently user's have to send any spam (or ham) they receive to an address
such as [email protected]
(or [email protected]) and if they forget to send it as an attachment they risk
poisoning the spam corpus. Think how much easier it would be to simply move an
email from one of your email folders to a special 'spam' folder. Also think
how much easier it
would be to browse the spam folder looking for mis-classified emails and drag
them back to the correct folder. Currently, I delete emails classified as spam
and if someone wants it back I have to go rooting about in MySQL's binary logs!
Right!
I worry how big the spam folder may get if I'm not deleting spam messages. I
may have to automatically expire spam messages that get to a certain age. Or
it may be that a small amount of fastfailing reduces the spam intake to
manageable amounts.
Well, I'm not deleting any spam:) You never know when you may need some;)
Right now I have 143286 unread in my junk folder, total is 250k+, all correctly
marked as 100% spam, 850MB.
Regards...
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]