Hi Gary,
thank you for your prompt response.
The James mail server *has already* many more anti-spam related features
other than the statistical filter: blacklisting, whitelisting,
server-side SMIME signature support and already working but yet to be
released (in the next release) greylisting, spf, surbl and others. The
just released Bayesian statistical filter that has full "Paul Graham"
support, and that we want to extend to the "chi technique" was written
by me more than 3 years ago and has been used successfully by several
people .
Moreover, anti-spam suport is very central to our goals, and we are
convinced that we can do good things in this area within James, as it is
a *very* flexible and extensible system.
The willingness to implement the "chi technique" came after discussing
about interfacing with external filters, and we decided to implement our
own solution, as it gives us more future control and flexibility.
See: http://issues.apache.org/jira/browse/JAMES-514
and the following thread:
http://www.mail-archive.com/server-dev@james.apache.org/msg09608.html
That's why we will keep waiting for your help, when you will have time
to :-) .
In the meantime thank you for your links, that I will read right away.
Vincenzo
Gary Robinson wrote:
Hi Vincenzo --
Let me ask you something. Most top spam filters have more stuff in
them the the statistical filter. Would it be practical for you to just
copy the code of another open-source filter, either translating it into
Java or using directly? But SpamBayes is excellent and it's written in
Python -- technically, you should be able to run its engine directly in
Java using Jython, which translates python to Java bytecodes on the
fly. Bogofilter is also excellent -- it's written in C so you should be
able use its engine from Java. SpamAssassin is also excellent, though
I'm not sure what language it's written in. They've all done very well
in head-to-head competitions and/or won awards. (It also happens that
they all use the chi technique.)
Bogofilter is the only one currently using the "handling redundancy"
extension to the original technique. (See
http://www.bgl.nu/bogofilter/esf.html for more info.) I think it's only
used optionally though; I think the default is the original technique
described here in my Linux Journal article:
http://www.linuxjournal.com/article/6467. It increases filtering
accuracy very significantly as you can see from that bogofilter link,
but does take a fair amount of cpu time to find the optimal parms, and
spending that time is crucial to make it work.
I've been thinking about ways to speed it up, but I haven't had time
to implement or test them. The person who implemented the approach
described in the paper, Greg Louis, is unable to spend time on
Bogofilter any more. If you're interested in going a bit farther than
other filters have with this, let me know, and I'd be happy to write it
up for you, although I won't have time for a couple of weeks.
If you don't want to do that, which would be perfectly understandable,
I don't think you could do wrong by linking to the bogofilter C code,
which is fast, of use SpamBayes' Python code via Jython. (One caveat is
that I personally can't advise on whether there are difficulties in
breaking the filtering engine out from the interface code in those
projects.) I can understand, of course, that there may be a need for
pure Java source, in which case you'd have to translate or, as you're
already planning, write your own.
One advantage of using a pre-existing project is that most projects
evolve over time to match the evolution of spammer tactics, which do
change significantly over time. I can imagine that it might be too
distracting for a project that isn't focused on spam filtering to
really try to keep up, and thus it may be hard to stay competitive with
the dedicated projects.
Anyway, those are thoughts that come to mind. I'll respond to your
specific question as soon as I can -- I've got a heavy meeting and
deadline schedule next week. Also, I'm not sure if you saw the Linux
Journal article mentioned above -- if you haven't you might want to
look at it. If so let me know if it resolves any of your questions.
Gary
Gary Robinson
CTO
Emergent Music, LLC
[EMAIL PROTECTED]
207-942-3463
Company: http://www.goombah.com
Blog: http://www.garyrobinson.net
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]