Hi folks,
I started a new thread for this one so everybody notices it and nobody
forgets to flame me :) These are more or less my long-promised proposals
for SpamAssassin and how I see the future.
On Wednesday 21 January 2004 19:57 CET Justin Mason wrote:
> Theo Van Dinter writes:
> > On Tue, Jan 20, 2004 at 09:12:09PM -0800, Justin Mason wrote:
> > > my $spamtest = Mail::SpamAssassin->new();
> > > my $mail = Mail::SpamAssassin::SomethingOrOther->new(data =>
> > > [EMAIL PROTECTED]); my $status = $spamtest->check ($mail);
> > >
> > > so it's user-visible.
> >
> > Well, actually it's still not user-visible. It'll be:
> >
> > my $mail = Mail::SpamAssassin::MsgParser->new([EMAIL PROTECTED]);
> >
> > It's a bit confusing because you call MsgParser but get back a
> > MsgContainer. The MsgParser code could, theoretically, go into
> > M::SA::Util, but ...
>
> hmm. In that case, "MsgContainer" may be fine to use, since it's
> not *quite* as user visible.
Nooo, please not that name. Everytime I read something like 'MsgParser' or
'PerMsgStatus' I am reminded of this [1] posting on the KMail list ;-) Hrm,
seriously: I don't think we need/should use Prefixes like 'Msg' to group
stuff, it's better to use. That's what they're made for and its also more
shell-completion freindly (currently one has to type hit tab three times to
edit PerMsgStatus.pm).
You might remember that some day in the autumn we already discussed if we
should go for 3.0 and I said something like "please wait, our API sucks, I
have a draft for a better one". I tried to finish that one but got confused
by that '*Message.pm' and 'Encapped*.pm' stuff (and then didn't have the
time/energy to look at it again). Now that Theo cleaned up that mess I see
some light again :) Thanks, Theo.
Currently our codebase has some other big flaw:
1. (IMO) Mostly because of the flat namespace, the API is very confusing.
Without very deep understanding of the infrastructure one can't really
differentiate between the public and private stuff, which modules are just
helpers and so on. (I'm hacking on SA since... um... some time in 2002 when
Justin was still travelling around the world, and still often have to trace
through the modules when I try to fix something). That makes it not only
hard for other applications' developers to access the SA API but might also
scare away possible new developers which we might need more of.
2. The code is very monolithic and tangled. That makes it hard to extend the
app with "plug-ins". I personally (and I think somebody else said so too)
would like to see SA not only as "the application" SpamAssassin with its
stable codebase but as something like a framework on which other people
might easily develop (eg. without reimplementing all that mail-parsing
stuff) their own anti-spam solutions. That would also fit together with us
becoming an own toplevel project (or what that's called) in the ASF.
Some possible plug-ins which immeadiatly come to my mind are for one other
Learner apart from Bayes. There was once a hack called Fitz [2] announced
which was working with some kind of AI and looked interesting -- but I
never got it running because it is a big patch and didn't apply to my
modified codebase (and I didn't care *that* much that I fix the rejected
parts).
Another candidate for this are the storage backends -- it should be possible
to store (all) your stuff into an SQL database or wherever you want. The
SQL stuff currently scattered thorugh the whole codebase and some parts are
AFAICS heavily outdated.
3. [Got to remember this one.]
So, what were my ideas, my "draft" for 3.0 (and after)?
SA 3.0 should be a first step into the direction of SA-as-a-framework. We'd
make some backwards-incompatible changes, break "binary compatibility" (if
you can say so when you talk about Perl :). Those would include:
* A cleaner API/module structure (as started by Theo).
* A new config parser (including a more flexible file format/backend) which
I started to write.
* Some cleanup of the frontends (like getting rid of some command line
parameters and moving them to the config files).
* Some possible other code move (which I started with spamc).
* Rename the Autowhitelist :)
* Some stuff from Matt's sa-3.0 stuff? I didn't look at it again because I
didn't want to get "tainted" by non-ASL code.
* ... other ideas?
But 3.x won't be the final "framework" I was talking about. I don't think we
can plan everything in advance (I know I can't). I also don't want to start
from scratch. That never works out. We need to make a first step into the
right direction and then go on with refactoring till we're where we want to
be. And we need input from the outside, the possible plug-in writers. I see
3.x as a "moving target" -- which doesn't mean it's unstable all the time,
just that the "plug-in" API (and other stuff) may change. When we finally
have something we like, we call it 4.0 and all world is happy :)
My rough timeplan would be to get 3.0 out around the end of March (because I
write exams in the first two weeks of the Fabruary and have two free months
afterwards where I have time to code :). I expect 4.0 not before Summer
2005 (I guess the spam problem is not solved till then). Also possible that
there won't ever be a 4.0, we'll see. But it's nice to have a rough target
to aim at.
As a first concrete step I propose (something like) this move of modules
(annotations on the line below the module, you should have tabs of eight
spaces):
ArchiveIterator.pm Util/ArchiveIterator.pm
The backend is split into the following
modules:
Util/Archive.pm
Util/Archive/MBox.pm
Util/Archive/Maildir.pm
AutoWhitelist.pm Rules/Autowhateverlist.pm
We really need to rename that one :)
Bayes.pm Learner/Bayes.pm
BayesStore.pm Learner/Bayes/Store.pm
The above is a factory for the correct
Storage module.
Learner/Bayes/StoreDBM.pm
That's DB_File or whatever we currently
use.
Learner/Bayes/StoreSQL.pm
Not yet available :)
Learner/Fitz.pm
Just a sample for possible future
plug-ins
CmdLearn.pm App/Learner.pm
This isn't really a lib but an application
of its own. I don't like this, some stuff
really ought to go into sa-learn.pl while
some "lib" stuff might stay here and
shareable code should go under Util/ or so.
App/Daemon/*.pm
spamd is currently very monolithic, I'd
like to put stuff like the qmail and BSMTP
features into their own modules so spamd
is smaller and easier to maintain.
Conf.pm Conf.pm
The rewritten Conf.pm.
Conf/ConfSQL.pm
Conf/Conf*.pm
"optional" features/plug-ins should get
their own config module which define the
options available.
Conf/Store.pm
A factory for the config storage backend.
Conf/StoreFile.pm
Implementation of the standard config file
format.
ConfSourceSQL.pm Conf/StoreSQL.pm
And for SQL.
DBBasedAddrList.pm Rules/AddrList.pm
Anybody got a better name for this?
Rules/AddrList/StoreDBM.pm
Rules/AddrList/StoreSQL.pm
The Backends.
Dns.pm Rules/DNSDBs.pm
Rules/SPF.pm
Rules/Habeas.pm
Rules/?.pm
"DNS" is too generic, those all use DNS.
The first name is jst a suggestion, see [3].
Util/DNS.pm
Split rules and helper routines. (If still
necessary, I didn't look at that file for
some time.)
EvalTests.pm Rules.pm
I think we need a factory or some general
management module for all the rules.
Rules/Default.pm
Anybody a better name for this?
HTML.pm Parser/HTML.pm
Locales.pm Rules/Locales.pm
Locker.pm Util/Locker.pm
This is probably just a factory or
something, see below.
MIME.pm Message.pm
No PerMsg* or Msg* stuff please. Just
Message.
MIME/Parser.pm Parser/MIME.pm
MailingList.pm Rules/Mailinglists.pm
NetSet.pm Util/NetSet.pm
WTF is this? Better name? Other location?
NoMailAudit.pm
Will end in Message.pm, I think?
PerMsgLearner.pm Learner.pm
A factory? I'm not sure.
PerMsgStatus.pm Status.pm
Shorter is better :)
PersistentAddrList.pm
Dunno. Looks like a factory but don't ask
me where to put it.
Received.pm Parser/Trace.pm
I think we parse other trace headers (like
Delivered-To) here, too.
Reporter.pm
Ummm... dunno.
SHA1.pm Util/DigestSHA1.pm
TextCat.pm Rules/TextCat.pm
or Util/TextCat.pm?
UnixLocker.pm Util/Locker/Unix.pm
See above.
Util.pm Util.pm
Some general routines?
Util/*
There might be other stuff...
Win32Locker.pm Util/Locker/Windows.pm
See above.
Store.pm
Store/DBM.pm
Store/SQL.pm
And finally the backends for general storage
access.
Whew. If that above confused you a bit, it would gave us a nice and clean
structure/API like
Mail/
SpamAssassin.pm
SpamAssassin/
Status.pm
Message.pm
Conf.pm
Conf/Conf*.pm
Conf/Store*.pm
Learner.pm
Learner/
Bayes.pm
Bayes/Store*.pm
Rules.pm
Rules/*.pm
Parser.pm
Parser/
MIME.pm
Trace.pm
HTML.pm
Util.pm
Util/*.pm
Store.pm
/Store/*.pm
App/*.pm
To continue the original discussion, one would do a
my $engine = new Mail::SpamAssassin->new();
my $message = new Mail::SpamAssassin::Message->new(
data => [EMAIL PROTECTED],
);
my $status = $engine->check($message); # Mail::SpamAssassin::Status
$engine->destroy(); # It should be possible to free some mem.
Hm. I think that was it. Comments, flames, patches? :)
Cheers,
Malte
[1]http://article.gmane.org/gmane.comp.kde.devel.kmail/14964
[2]http://zebra.fh-weingarten.de/~thorsten/index-e.html
[3]http://www.hexkey.co.uk/lee/log/2002/12/20/
--
[SGT] Simon G. Tatham: "How to Report Bugs Effectively"
<http://www.chiark.greenend.org.uk/~sgtatham/bugs.html>
[ESR] Eric S. Raymond: "How To Ask Questions The Smart Way"
<http://www.catb.org/~esr/faqs/smart-questions.html>