Re[2]: [RulesEmporium] RE: development of new rules (was: This ROCKS!)

Robert Menschel 4 May 2004 04:15:04 -0000

Hello Daniel,

Monday, May 3, 2004, 7:30:15 PM, you wrote:


>> Yes, new rules used to make their way into CVS quickly, but those
>> rules (at least so far) take months to get into the field, because of
>> the overhead and other challenges associated with the GA run. SARE
>> provides a method whereby rules can be tested and then adopted by
>> systems very quickly.    

DQ> I think you're overestimating how much users want to download
DQ> unofficial rule sets.  ...

Actually, I think I've been /underestimating/ how many systems actually
download our unofficial rule sets. SARE started as a small number of
people who were developing rules for their own use, who where sharing
them through private web pages and/or the exit0.us wiki, just like
William Stearns shares the blacklist compilation he manages. After a
short while, we decided to share them with the community in a slightly
more formal manner, but "more formal" doesn't yet actually reach
"formal."

Even with the exit0.us wiki, I expected there to be just a dozen
or two people/systems outside SARE itself using these
fly-by-the-seat-of-the-pants rules. 

Instead, as you explore below, we have so many people actually using the
new rule sets that our informality has been causing problems for some of
them.

DQ> I've seen some user complaints that this now seems to be
DQ> required (at least by some people answering questions on the mailing
DQ> list) and that there's too much confusion about which sets to use,
DQ> questions about FP rates, etc.  Making users do extra work is uncool.
DQ> Making it a prerequisite for running SpamAssassin is even worse, but I'm
DQ> concerned that's exactly where we could be headed.

I don't see that as our destination (intended or not), but I can
understand why some people feel the SARE sets are becoming required.

SpamAssassin 2.4x through 2.6x with network checks and a decently trained
Bayes database caught 80% to 95% of all spam when they first came out.
That was and is quite satisfactory -- it means that only a few spam
managed to reach user inboxes unflagged.

A major problem, however, is that spammers adapt to SA faster than SA can
adapt to spammers, because of the long cycle between releases. Because of
this situation, an unmodified vanilla installation of SA loses accuracy
over time.

I think Chris Santerre was the first one to shove a production SA above
the 99% accuracy mark, and he did that with 2.4x while most of us were
working with 2.5x.

I was able to push SA 2.5x as high as 95% by adjusting scores in the
distribution rule set, and by adopting William Stearns' blacklists, but
didn't hit 99% until I started developing some very powerful rules of my
own, many of which are domain specific, many of which have been submitted
to SA and/or SARE, and many of which are not domain specific but also not
viable SARE candidates.

I've been using SA on three domains for a year now. In that year, the
amount of spam which reaches my domains has doubled. Even at an accuracy
rate of 99.8%, I get almost a dozen FNs each week, and mine are very
low volume systems.

If 80% to 90% spam reduction is sufficient, then as it stands 2.6x with
network and Bayes tests is sufficient (especially with the SURBL
enhancement).  If a system needs 95% or better spam reduction, they
currently need SARE's help. 

Agreed, there's too much confusion about which rule sets to use. There's
confusion about which rules are "safer" or more conservative, and which
rules are "riskier" or more aggressive. We're making some progress with
that, but that progress is admittedly experimental.

DQ> Why are we here?  I'm not sure.  The benchmark to get SVN access is not
DQ> really all that high (ask Michael and Sidney), but perhaps SARE is so
DQ> easy to get rules into that it *seems* high. ...

Actually, it's not getting SVN access that concerns me, it's integrating
that SVN service with production email systems that must necessarily be
on stable and officially released versions of SA. To keep my systems at
99.8% I need to add new rules regularly, and I need to test/verify those
rules against the same production version as my production systems.

And I agree that it's not difficult to get rules into SVN, through
Bugzilla if nothing else. It's not the alleged difficulty of SVN that is
significant, but rather the delay between that submission and rule
verification and the final release of the next SA. My "longwords" rules
were submitted to SVN over a month ago. I'm sure they've been improved
on, and will be a great help to everyone when released with 3.0. However,
for that same month, they've been a) unavailable for production use via
SA, and b) flagging thousands of spam here, and maybe dozens or hundreds
of thousands of spam through SARE distribution.

Monday, May 3, 2004, 7:24:51 PM, Justin stated,
JM> I think that the SARE ruleset is probably the best "first deployment"
JM> area for rules -- but I also think that getting some of those rules
JM> into the SpamAssassin distro would be nice ;)
and I believe SARE (or equivalent) will be the path of "first deployment"
as long as the SVN path delays formal distribution. If under 3.0 there'll
be a method where the SVN path leads to reasonably speedy distribution
for qualified rules as you suggest (and I'm hoping you'll be able to even
beat SARE's speed of distribution), then SARE will fall further into the
background.

DQ> ... The reality is that submitting rules into SVN is much easier than
DQ> maintaining my own set could ever be.  Testing happens automatically,
DQ> I get peer review and fixes, I don't have to worry about scoring (and
DQ> I hope to automate scoring of new rules for automatic updates after
DQ> 3.0 is out), etc.

If you can use SVN for your email systems, that works great.  I can't
afford to use SVN in production, and I can't afford to wait months
between releases.

If the release cycle speeds up, or distribution rules can be released
more frequently under 3.0, then yes, that will be a big benefit, and will
lessen the dependence so many systems have developed for SARE.

>> A/The major benefit of the GA run is that the rules get properly and
>> reliably scored across the comprehensive corpi. Outside of the dev
>> cycle, that can't yet happen. However, with SARE running its
>> mass-checks against multiple corpi, we're able to generate reasonable
>> rule scores which aren't as good as the GA scores, but are good enough
>> for most systems.

DQ> Score optimization works better when you score everything at once, not
DQ> just new rules.  Scoring only new rules means your FP rate is going to
DQ> be significantly higher, especially if you have many of them.

DQ> If someone tells me which SARE rules I should use, I could prove this.
DQ> :-)

Very definitely.  I agree with all of this.

If it's easy for you to test SARE rulesets through the SVN process, then
I'd be VERY interested in such tests against these two ruleset files:
http://www.rulesemporium.com/rules/70_sare_genlsubj0.cf
http://www.rulesemporium.com/rules/70_sare_genlsubj2.cf
They hit NO ham during any SARE testing, against several corpi. If they
hit any ham during SVN testing, then I definitely want to revise their
scores accordingly.

Note that some of the rules in
http://www.rulesemporium.com/rules/70_sare_genlsubj0.cf may be suitable
for inclusion in the distribution rule set. 35 of them hit 0.1% or more
of all spam each, and no ham, according to our tests. But the rest drop
off quickly to those that hit a few dozen spam, down to just one or two
spam. Those are obviously NOT appropriate for the distribution set, but
are rules that the more aggressive anti-spam systems find useful.

>>>> - there's less QA and only manual scoring of SARE rules
>> Agree with the first, and quibble with the second (we actually
>> generate most scores automatically now, based on the results of our
>> mass-check runs; they aren't as high quality as the scores provided by
>> the GA, but they aren't "manual" nor "arbitrary").

DQ> Most of the scores seem much higher than they should be.  Tell me which
DQ> SARE rules I should be using and I can do a corpus test to prove it.  :-)

I agree. Maybe not with the sets above, but you should be able to
demonstrate this with
http://www.rulesemporium.com/rules/70_sare_genlsubj3.cf (if you add this
ruleset as scored to the full SVN rule set and don't get some FP
somewhere, I'd be surprised, even though I think the scores calculated
for that rule set are conservative). 

The point for (everyone) to remember (and we probably don't say this
enough on the lists, website, and rule set documentation): SARE is
composed of those people who are aggressive anti-spam fighters.
Scores that are conservative to us may be overly aggressive scores for
others. We also suffer from the problems you mentioned about "local"
testing -- even though we use multiple corpi, they're still *our* corpi.
Just because I have not seen any FP in over a month does not mean that
another site will avoid FPs using the same rules and same scores.

(For that matter, I run SA with a required hits of 9. This allows me to
be simultaneously more conservative and more aggressive than many sites.
To maintain SA's power I've had to increase the scores of many
distribution rules, but /only/ some of them. And I've had to lower scores
of only a few distribution rules, and by less than I'd have needed at
5.0. Finally, with an R-H=9 system, I have a lot more room for error
without creating FPs. Even though my SARE contributions are scored down
for a 5.0 standard, applying those rules within a 5.0 system gives less
room for error, and increases the chance that multiple rule matches in
lengthy ham will generate FPs. A 5.0 system has only 5/9's as much margin
for error as I have.)

>> Actually, part of the reason SARE is growing and strengthening is
>> because during the development of version 3.0 the core developers
>> needed to concentrate on code changes and not so much on rules. There
>> was even a comment to that effect on one or both lists a few months
>> ago.

DQ> Rule development has never really stopped (if you look at SVN, we have a
DQ> lot of new rules that never saw Bugzilla or SARE), even if some of the
DQ> developers have been focusing on code.  I've been mostly working on new
DQ> rules for most of this year.

I'm very glad to hear that, and look forward to benefiting from your
work (and others').

>> If I understand what you're saying here, this would improve the
>> quality of the rules, and would also slow down the release of rule
>> updates.

DQ> Um, I think you're misunderestanding me.  Daily automated updates is
DQ> definitely not slower.  Sure, maybe only an average of an additional
DQ> rule or two per day might be pushed out, but it adds up.

This makes it sound like under 3.0 there will be the ability to provide
daily, automated updates of SVN-validated and SVN-scored rules. If so,
that will be FANTASTIC!

It doesn't solve the problem I have of needing to stay in sync with my
production systems -- when work begins on SA 3.1, the daily automated
updates provided to the world will need to be limited to those that work
on 3.0, or I'll not be able to use them -- is that planned to be part of
the system? 

>> What we really need is someone who can work through the current SVN
>> rules, compare them to our better SARE rules, and submit those that
>> are worth while but not yet in the SVN queue. Again, I don't have the
>> time for this. Hopefully someone else will.

DQ> I'm looking more for people to work directly on SVN.  If it's someone
DQ> just adding stuff to the bugzilla queue, it's just as efficient for one
DQ> of the existing developers to poke at SARE on their own (which is how
DQ> backhair ended up in SVN) and this is why we're looking for more help.

I understand, and I hope you'll get people to work on this.

Bob Menschel

Re[2]: [RulesEmporium] RE: development of new rules (was: This ROCKS!)

Reply via email to