Re: [RulesEmporium] RE: development of new rules (was: This ROCKS!)

Daniel Quinlan 4 May 2004 02:30:34 -0000

Robert Menschel <[EMAIL PROTECTED]> writes:

> Yes, new rules used to make their way into CVS quickly, but those
> rules (at least so far) take months to get into the field, because of
> the overhead and other challenges associated with the GA run. SARE
> provides a method whereby rules can be tested and then adopted by
> systems very quickly.


I think you're overestimating how much users want to download unofficial
rule sets.  I've seen some user complaints that this now seems to be
required (at least by some people answering questions on the mailing
list) and that there's too much confusion about which sets to use,
questions about FP rates, etc.  Making users do extra work is uncool.
Making it a prerequisite for running SpamAssassin is even worse, but I'm
concerned that's exactly where we could be headed.

Why are we here?  I'm not sure.  The benchmark to get SVN access is not
really all that high (ask Michael and Sidney), but perhaps SARE is so
easy to get rules into that it *seems* high.  When I started
SpamAssassin development, I was only submitting rules and rule
improvements (even though I was quite comfortable with perl) and there
was no alternative if I wanted to contribute.  The reality is that
submitting rules into SVN is much easier than maintaining my own set
could ever be.  Testing happens automatically, I get peer review and
fixes, I don't have to worry about scoring (and I hope to automate
scoring of new rules for automatic updates after 3.0 is out), etc.

> A/The major benefit of the GA run is that the rules get properly and
> reliably scored across the comprehensive corpi. Outside of the dev
> cycle, that can't yet happen. However, with SARE running its
> mass-checks against multiple corpi, we're able to generate reasonable
> rule scores which aren't as good as the GA scores, but are good enough
> for most systems.

Score optimization works better when you score everything at once, not
just new rules.  Scoring only new rules means your FP rate is going to
be significantly higher, especially if you have many of them.

If someone tells me which SARE rules I should use, I could prove this.
:-)

> And then, there's a whole universe of rules which are not included in
> SA's distribution rule set, and which should NOT be included in SA's
> distribution rule set.

Sure, any rule that has too many false positives for some people or
doesn't hit enough spam shouldn't be included.

>>> - there's less QA and only manual scoring of SARE rules
> Agree with the first, and quibble with the second (we actually
> generate most scores automatically now, based on the results of our
> mass-check runs; they aren't as high quality as the scores provided by
> the GA, but they aren't "manual" nor "arbitrary").

Most of the scores seem much higher than they should be.  Tell me which
SARE rules I should be using and I can do a corpus test to prove it.  :-)

>>> - SpamAssassin is not being well-maintained to integrate these rules
>>> efficiently and with low overlap, so speed and efficiency suffer.
> I'm not sure what you mean here. The huge majority of rules we develop
> are extremely simple, phrases or variations on phrases, easily tested
> by regex, and not the sort of thing I'd expect to require any tuning
> you haven't already done.
>
> The exceptions would be rulesets like backhair, weeks, tripwire, where
> I expect we'd be better off with well built eval capabilities rather
> than multiple regexes, but we don't have the ability (yet) to create
> those eval capabilities (or equivalent).

I was thinking specifically of backhair (which is now an eval in
3.0-svn), but there are other examples.

> Actually, part of the reason SARE is growing and strengthening is
> because during the development of version 3.0 the core developers
> needed to concentrate on code changes and not so much on rules. There
> was even a comment to that effect on one or both lists a few months
> ago.

Rule development has never really stopped (if you look at SVN, we have a
lot of new rules that never saw Bugzilla or SARE), even if some of the
developers have been focusing on code.  I've been mostly working on new
rules for most of this year.

> If I understand what you're saying here, this would improve the
> quality of the rules, and would also slow down the release of rule
> updates.

Um, I think you're misunderestanding me.  Daily automated updates is
definitely not slower.  Sure, maybe only an average of an additional
rule or two per day might be pushed out, but it adds up.

> What we really need is someone who can work through the current SVN
> rules, compare them to our better SARE rules, and submit those that
> are worth while but not yet in the SVN queue. Again, I don't have the
> time for this. Hopefully someone else will.

I'm looking more for people to work directly on SVN.  If it's someone
just adding stuff to the bugzilla queue, it's just as efficient for one
of the existing developers to poke at SARE on their own (which is how
backhair ended up in SVN) and this is why we're looking for more help.

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Re: [RulesEmporium] RE: development of new rules (was: This ROCKS!)

Reply via email to