I know I'm not a committer and I don't know all parts of the code very
well, but I've put a lot of time and energy into the AWL and Bayes
backend stores so feel I can talk intelligently about the codebase.

On Wed, Jan 21, 2004 at 10:55:10PM +0100, Malte S. Stretz wrote:
> 
> Currently our codebase has some other big flaw:
> 
> 1. (IMO) Mostly because of the flat namespace, the API is very confusing. 
> Without very deep understanding of the infrastructure one can't really 
> differentiate between the public and private stuff, which modules are just 
> helpers and so on. (I'm hacking on SA since... um... some time in 2002 when 
> Justin was still travelling around the world, and still often have to trace 
> through the modules when I try to fix something). That makes it not only 
> hard for other applications' developers to access the SA API but might also 
> scare away possible new developers which we might need more of.
> 

Agreed, the codebase needs more structure, it seems like a giant
hodgepodge where code doesn't always end up in the right place.  It's
understandable, any codebase that "grows up" this way with multiple
developers and incremental design usually suffers this fate.  I agree
it would do a world of good to clean things up and would possibly
attract more folks to hack on the code as it becomes easier to work
with.

> 
> Another candidate for this are the storage backends -- it should be possible 
> to store (all) your stuff into an SQL database or wherever you want. The 
> SQL stuff currently scattered thorugh the whole codebase and some parts are 
> AFAICS heavily outdated.

My recent Bayes Storage work has greatly improved this.  I was able to
add a new RPC based storage module in a very short time.  Not to
mention the fact that it can be made even better, it was too great a
leap IMO for the first pass at making it easier to extend.  My plan is
that once I've got things stable and working well (close) and
hopefully merged in, I'll continue to extract implementation specific
methods and what not out of the storage code.

Interestingly, the AWL stuff was VERY easy to expand, but it is much
simpler than the bayes storage code.

> * A new config parser (including a more flexible file format/backend) which 
> I started to write.

The ability to have "plugin" modules add config variables without
having to add them into Conf.pm would be a VERY good thing.  This
would mean plugins could be distributed separately from SA.

> * Some cleanup of the frontends (like getting rid of some command line 
> parameters and moving them to the config files).

One thing in this area I wouldn't mind seeing is the ability to
specify a username on the command line for tools such as sa-learn and
expanding of the ablity to fetch user config data from SQL. There are
a couple of bugs that cover this sort of thing.

> * Rename the Autowhitelist :)

Something like NormalizeAddr? AddrAvgScore? BalanceAddrScore?
BalanceScore? I know they are all TOO long, I stink at coming up with
module names.

No idea, agree with a rename, it's confusing to folks who don't
understand.

> * ... other ideas?
> 

I've been contemplating an apache/httpd RPC/SOAP based implemenation
of spamd, but haven't taken it very far.  It would be nice to be able
to leverage some of the other Apache projects in the server area.  Of
course, that assumes you don't lose the speed that the current
spamc/spamd combination gives. 


I like this layout for the most part.  A few notes below.

> Bayes.pm              Learner/Bayes.pm
> BayesStore.pm         Learner/Bayes/Store.pm
>                               The above is a factory for the correct 
>                               Storage module.
>                       Learner/Bayes/StoreDBM.pm
>                               That's DB_File or whatever we currently 
>                               use.
>                       Learner/Bayes/StoreSQL.pm
>                               Not yet available :)

This can be done now, assuming my changes are folded in.

> DBBasedAddrList.pm    Rules/AddrList.pm
>                               Anybody got a better name for this?
>                       Rules/AddrList/StoreDBM.pm
>                       Rules/AddrList/StoreSQL.pm
>                               The Backends.

This can also be done now.

>                       Store.pm
>                       Store/DBM.pm
>                       Store/SQL.pm
>                               And finally the backends for general storage
>                               access.

I've also been thinking about this.  In general we've got two general
ways we access data, key/value pairs and more specialized access (ie
BayesStorage).  It shouldn't be too difficult to make a quick and easy
Store module that assume key/value pairs and stores them in a DBM file
(BTW what was the downsides to only supporting DB_File and not the
other DBM implementations?) and also in a generic table in a SQL
database.  The table might be something like:

subsystem  username  key                value
conf       parker    use_bayes          1
conf       parker    whitelist_from     [EMAIL PROTECTED]
hashcash   parker    foo                blah (sorry not familar with hashcash)
So on and so forth.  Even the AWL (or whatever it's renamed to) could
use this and not need the custom SQL implemenation although I think it
might be a tad bit faster if you had it.

> 
> 
> Hm. I think that was it. Comments, flames, patches? :)
> 

Sorry for the length of the reply.  I will say that I like your ideas
and am more than happy to do what I can to help.

Michael

Reply via email to