Hi Gary,

thank you for your prompt response.

The James mail server *has already* many more anti-spam related features other than the statistical filter: blacklisting, whitelisting, server-side SMIME signature support and already working but yet to be released (in the next release) greylisting, spf, surbl and others. The just released Bayesian statistical filter that has full "Paul Graham" support, and that we want to extend to the "chi technique" was written by me more than 3 years ago and has been used successfully by several people .

Moreover, anti-spam suport is very central to our goals, and we are convinced that we can do good things in this area within James, as it is a *very* flexible and extensible system.

The willingness to implement the "chi technique" came after discussing about interfacing with external filters, and we decided to implement our own solution, as it gives us more future control and flexibility.
See: http://issues.apache.org/jira/browse/JAMES-514
and the following thread:
http://www.mail-archive.com/server-dev@james.apache.org/msg09608.html

That's why we will keep waiting for your help, when you will have time to :-) .

In the meantime thank you for your links, that I will read right away.

Vincenzo

Gary Robinson wrote:

Hi Vincenzo -- Let me ask you something. Most top spam filters have more stuff in them the the statistical filter. Would it be practical for you to just copy the code of another open-source filter, either translating it into Java or using directly? But SpamBayes is excellent and it's written in Python -- technically, you should be able to run its engine directly in Java using Jython, which translates python to Java bytecodes on the fly. Bogofilter is also excellent -- it's written in C so you should be able use its engine from Java. SpamAssassin is also excellent, though I'm not sure what language it's written in. They've all done very well in head-to-head competitions and/or won awards. (It also happens that they all use the chi technique.)

Bogofilter is the only one currently using the "handling redundancy" extension to the original technique. (See http://www.bgl.nu/bogofilter/esf.html for more info.) I think it's only used optionally though; I think the default is the original technique described here in my Linux Journal article: http://www.linuxjournal.com/article/6467. It increases filtering accuracy very significantly as you can see from that bogofilter link, but does take a fair amount of cpu time to find the optimal parms, and spending that time is crucial to make it work.

I've been thinking about ways to speed it up, but I haven't had time to implement or test them. The person who implemented the approach described in the paper, Greg Louis, is unable to spend time on Bogofilter any more. If you're interested in going a bit farther than other filters have with this, let me know, and I'd be happy to write it up for you, although I won't have time for a couple of weeks.

If you don't want to do that, which would be perfectly understandable, I don't think you could do wrong by linking to the bogofilter C code, which is fast, of use SpamBayes' Python code via Jython. (One caveat is that I personally can't advise on whether there are difficulties in breaking the filtering engine out from the interface code in those projects.) I can understand, of course, that there may be a need for pure Java source, in which case you'd have to translate or, as you're already planning, write your own. One advantage of using a pre-existing project is that most projects evolve over time to match the evolution of spammer tactics, which do change significantly over time. I can imagine that it might be too distracting for a project that isn't focused on spam filtering to really try to keep up, and thus it may be hard to stay competitive with the dedicated projects.

Anyway, those are thoughts that come to mind. I'll respond to your specific question as soon as I can -- I've got a heavy meeting and deadline schedule next week. Also, I'm not sure if you saw the Linux Journal article mentioned above -- if you haven't you might want to look at it. If so let me know if it resolves any of your questions.

Gary






Gary Robinson
CTO
Emergent Music, LLC
[EMAIL PROTECTED]
207-942-3463
Company: http://www.goombah.com
Blog:    http://www.garyrobinson.net




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to