Hi folks,

I started a new thread for this one so everybody notices it and nobody 
forgets to flame me :) These are more or less my long-promised proposals 
for SpamAssassin and how I see the future.

On Wednesday 21 January 2004 19:57 CET Justin Mason wrote:
> Theo Van Dinter writes:
> > On Tue, Jan 20, 2004 at 09:12:09PM -0800, Justin Mason wrote:
> > >   my $spamtest = Mail::SpamAssassin->new();
> > >   my $mail = Mail::SpamAssassin::SomethingOrOther->new(data =>
> > > [EMAIL PROTECTED]); my $status = $spamtest->check ($mail);
> > >
> > > so it's user-visible.
> >
> > Well, actually it's still not user-visible.  It'll be:
> >
> > my $mail = Mail::SpamAssassin::MsgParser->new([EMAIL PROTECTED]);
> >
> > It's a bit confusing because you call MsgParser but get back a
> > MsgContainer.  The MsgParser code could, theoretically, go into
> > M::SA::Util, but ...
>
> hmm.  In that case, "MsgContainer" may be fine to use, since it's
> not *quite* as user visible.

Nooo, please not that name. Everytime I read something like 'MsgParser' or 
'PerMsgStatus' I am reminded of this [1] posting on the KMail list ;-) Hrm, 
seriously: I don't think we need/should use Prefixes like 'Msg' to group 
stuff, it's better to use. That's what they're made for and its also more 
shell-completion freindly (currently one has to type hit tab three times to 
edit PerMsgStatus.pm).

You might remember that some day in the autumn we already discussed if we 
should go for 3.0 and I said something like "please wait, our API sucks, I 
have a draft for a better one". I tried to finish that one but got confused 
by that '*Message.pm' and 'Encapped*.pm' stuff (and then didn't have the 
time/energy to look at it again). Now that Theo cleaned up that mess I see 
some light again :) Thanks, Theo.


Currently our codebase has some other big flaw:

1. (IMO) Mostly because of the flat namespace, the API is very confusing. 
Without very deep understanding of the infrastructure one can't really 
differentiate between the public and private stuff, which modules are just 
helpers and so on. (I'm hacking on SA since... um... some time in 2002 when 
Justin was still travelling around the world, and still often have to trace 
through the modules when I try to fix something). That makes it not only 
hard for other applications' developers to access the SA API but might also 
scare away possible new developers which we might need more of.

2. The code is very monolithic and tangled. That makes it hard to extend the 
app with "plug-ins". I personally (and I think somebody else said so too) 
would like to see SA not only as "the application" SpamAssassin with its 
stable codebase but as something like a framework on which other people 
might easily develop (eg. without reimplementing all that mail-parsing 
stuff) their own anti-spam solutions. That would also fit together with us 
becoming an own toplevel project (or what that's called) in the ASF.

Some possible plug-ins which immeadiatly come to my mind are for one other 
Learner apart from Bayes. There was once a hack called Fitz [2] announced 
which was working with some kind of AI and looked interesting -- but I 
never got it running because it is a big patch and didn't apply to my 
modified codebase (and I didn't care *that* much that I fix the rejected 
parts).

Another candidate for this are the storage backends -- it should be possible 
to store (all) your stuff into an SQL database or wherever you want. The 
SQL stuff currently scattered thorugh the whole codebase and some parts are 
AFAICS heavily outdated.

3. [Got to remember this one.]


So, what were my ideas, my "draft" for 3.0 (and after)?

SA 3.0 should be a first step into the direction of SA-as-a-framework. We'd 
make some backwards-incompatible changes, break "binary compatibility" (if 
you can say so when you talk about Perl :). Those would include:
* A cleaner API/module structure (as started by Theo).
* A new config parser (including a more flexible file format/backend) which 
I started to write.
* Some cleanup of the frontends (like getting rid of some command line 
parameters and moving them to the config files).
* Some possible other code move (which I started with spamc).
* Rename the Autowhitelist :)
* Some stuff from Matt's sa-3.0 stuff? I didn't look at it again because I 
didn't want to get "tainted" by non-ASL code.
* ... other ideas?

But 3.x won't be the final "framework" I was talking about. I don't think we 
can plan everything in advance (I know I can't). I also don't want to start 
from scratch. That never works out. We need to make a first step into the 
right direction and then go on with refactoring till we're where we want to 
be. And we need input from the outside, the possible plug-in writers. I see 
3.x as a "moving target" -- which doesn't mean it's unstable all the time, 
just that the "plug-in" API (and other stuff) may change. When we finally 
have something we like, we call it 4.0 and all world is happy :)

My rough timeplan would be to get 3.0 out around the end of March (because I 
write exams in the first two weeks of the Fabruary and have two free months 
afterwards where I have time to code :). I expect 4.0 not before Summer 
2005 (I guess the spam problem is not solved till then). Also possible that 
there won't ever be a 4.0, we'll see. But it's nice to have a rough target 
to aim at.


As a first concrete step I propose (something like) this move of modules 
(annotations on the line below the module, you should have tabs of eight 
spaces):
ArchiveIterator.pm      Util/ArchiveIterator.pm
                                The backend is split into the following
                                modules:
                        Util/Archive.pm
                        Util/Archive/MBox.pm
                        Util/Archive/Maildir.pm
AutoWhitelist.pm        Rules/Autowhateverlist.pm
                                We really need to rename that one :)
Bayes.pm                Learner/Bayes.pm
BayesStore.pm           Learner/Bayes/Store.pm
                                The above is a factory for the correct 
                                Storage module.
                        Learner/Bayes/StoreDBM.pm
                                That's DB_File or whatever we currently 
                                use.
                        Learner/Bayes/StoreSQL.pm
                                Not yet available :)
                        Learner/Fitz.pm
                                Just a sample for possible future 
                                plug-ins
CmdLearn.pm             App/Learner.pm
                                This isn't really a lib but an application
                                of its own. I don't like this, some stuff 
                                really ought to go into sa-learn.pl while
                                some "lib" stuff might stay here and 
                                shareable code should go under Util/ or so.
                        App/Daemon/*.pm
                                spamd is currently very monolithic, I'd 
                                like to put stuff like the qmail and BSMTP 
                                features into their own modules so spamd 
                                is smaller and easier to maintain.
Conf.pm                 Conf.pm
                                The rewritten Conf.pm.
                        Conf/ConfSQL.pm
                        Conf/Conf*.pm
                                "optional" features/plug-ins should get 
                                their own config module which define the
                                options available.
                        Conf/Store.pm
                                A factory for the config storage backend.
                        Conf/StoreFile.pm
                                Implementation of the standard config file
                                format.
ConfSourceSQL.pm        Conf/StoreSQL.pm
                                And for SQL.
DBBasedAddrList.pm      Rules/AddrList.pm
                                Anybody got a better name for this?
                        Rules/AddrList/StoreDBM.pm
                        Rules/AddrList/StoreSQL.pm
                                The Backends.
Dns.pm                  Rules/DNSDBs.pm
                        Rules/SPF.pm
                        Rules/Habeas.pm
                        Rules/?.pm
                                "DNS" is too generic, those all use DNS.
                                The first name is jst a suggestion, see [3].
                        Util/DNS.pm
                                Split rules and helper routines. (If still
                                necessary, I didn't look at that file for 
                                some time.)
EvalTests.pm            Rules.pm
                                I think we need a factory or some general 
                                management module for all the rules.
                        Rules/Default.pm
                                Anybody a better name for this?
HTML.pm                 Parser/HTML.pm
Locales.pm              Rules/Locales.pm
Locker.pm               Util/Locker.pm
                                This is probably just a factory or 
                                something, see below.
MIME.pm                 Message.pm
                                No PerMsg* or Msg* stuff please. Just 
                                Message.
MIME/Parser.pm          Parser/MIME.pm
MailingList.pm          Rules/Mailinglists.pm
NetSet.pm               Util/NetSet.pm
                                WTF is this? Better name? Other location?
NoMailAudit.pm
                        Will end in Message.pm, I think?
PerMsgLearner.pm        Learner.pm
                                A factory? I'm not sure.
PerMsgStatus.pm         Status.pm
                                Shorter is better :)
PersistentAddrList.pm
                                Dunno. Looks like a factory but don't ask 
                                me where to put it.
Received.pm             Parser/Trace.pm
                                I think we parse other trace headers (like 
                                Delivered-To) here, too.
Reporter.pm             
                                Ummm... dunno.
SHA1.pm                 Util/DigestSHA1.pm
TextCat.pm              Rules/TextCat.pm
                                or Util/TextCat.pm?
UnixLocker.pm           Util/Locker/Unix.pm
                                See above.
Util.pm                 Util.pm
                                Some general routines?
                        Util/*
                                There might be other stuff...
Win32Locker.pm          Util/Locker/Windows.pm
                                See above.
                        Store.pm
                        Store/DBM.pm
                        Store/SQL.pm
                                And finally the backends for general storage
                                access.

                                
Whew. If that above confused you a bit, it would gave us a nice and clean 
structure/API like
        Mail/
                SpamAssassin.pm
                SpamAssassin/
                        Status.pm
                        Message.pm
                        Conf.pm
                        Conf/Conf*.pm
                        Conf/Store*.pm
                        Learner.pm
                        Learner/
                                Bayes.pm
                                Bayes/Store*.pm
                        Rules.pm
                        Rules/*.pm
                        Parser.pm
                        Parser/
                                MIME.pm
                                Trace.pm
                                HTML.pm
                        Util.pm
                        Util/*.pm
                        Store.pm
                        /Store/*.pm
                        App/*.pm


To continue the original discussion, one would do a
my $engine  = new Mail::SpamAssassin->new();
my $message = new Mail::SpamAssassin::Message->new(
                    data => [EMAIL PROTECTED],
                  );
my $status  = $engine->check($message); # Mail::SpamAssassin::Status
$engine->destroy(); # It should be possible to free some mem.


Hm. I think that was it. Comments, flames, patches? :)

Cheers,
Malte

[1]http://article.gmane.org/gmane.comp.kde.devel.kmail/14964
[2]http://zebra.fh-weingarten.de/~thorsten/index-e.html
[3]http://www.hexkey.co.uk/lee/log/2002/12/20/

-- 
[SGT] Simon G. Tatham: "How to Report Bugs Effectively"
      <http://www.chiark.greenend.org.uk/~sgtatham/bugs.html>
[ESR] Eric S. Raymond: "How To Ask Questions The Smart Way"
      <http://www.catb.org/~esr/faqs/smart-questions.html>

Reply via email to