** Phil, if Daniel replies to the SARE list, let it thru ;) **

>-----Original Message-----
>From: Daniel Quinlan [mailto:[EMAIL PROTECTED]
>Sent: Monday, May 03, 2004 5:08 PM
>To: [EMAIL PROTECTED]
>Cc: Chris Santerre; Daniel Quinlan
>Subject: development of new rules (was: This ROCKS!)
>
>
>[ removed cross-posting to SURBL list ]
>
>Chris Santerre <[EMAIL PROTECTED]> writes:
>
>> Sort of. I didn't know you guys did that nightly. Very nice. I'm
>> looking for a more localized process that doesn't require a run of
>> anything. Having counters generated means just checking totals. And
>> even if one didn't use the exact same rules as another, they could
>> easily combine totals for the ones they do.
>
>One of the problems with developing entirely on local messages is that
>you inevitably end up with a huge bias from your corpus.  First, in
>development of the rules and second in testing them.  One of the nice
>things about the nightly run is that you get graded on how well your
>rule works on other corpora that you don't have the ability to tune
>against, at least not easily.

I see your point. We are both looking at this from different sides. We use
corpa from 3-4 different SARE people with some wide differences in corpa
type. But the localization of the rule hit counts was to look at a couple of
things. One being that admins may simply not need a certain ruleset. If
there aren't enough hits to justify the ruleset, then why keep the overhead?
Giving them local counts helps with that. 

There are some meta rules using regular rules we want to look at. Every
ruleset starts with local corpa, then moves on to the groups. Having to grep
something or run a GA is a pain. simply looking at a total number in a flat
file is much easier :)


>
>Sometimes, there is a bug that needs to be fixed, so I do ask for FPs
>and FNs from mass-check submitters from time to time, but if you end up
>"fixing" rules via exceptions (especially more than one or two), then
>the rule is probably not going to be stable once you get outside of the
>larger test set.

LOL, trust me. Any member of SARE is going to agree with that statement ;)
But our corpa is getting pretty darn big and diverse!

>
>>> We've been doing this for well over a year and it works 
>great.  If only
>>> we had more active developers working on rules...
>
>> I'm not quite sure how to take that last line. 
>
>We need active rule developers.
>
>New rules used to make their way into CVS relatively quickly because
>that was the only place for them to go.  SARE is making very nice
>strides in developing new rules, those rules aren't being integrated
>into SpamAssassin quickly at all and everyone is suffering.
>
> - it's more work for users
> - there's less QA and only manual scoring of SARE rules
> - SpamAssassin is not being well-maintained to integrate these rules
>   efficiently and with low overlap, so speed and efficiency suffer.
>
>I'm not saying that I want SARE to go away!  SARE does a better job
>tracking new rule sets than was possible before, but we need to avoid
>falling to a non-optimal pattern of where effort is going.  Developers
>come and go and we've maintained a strong core team for the 
>Perl code in
>SpamAssassin, but the number of people actively working on rules is
>lower now (since January, about 2/3 of SA 3.0 test rule work 
>is the work
>of one person, 94% is two people).
>
>What I think would work better and what I'd like to see:
>
> - Some of the experienced SARE developers also become SpamAssassin
>   developers (with commit access soon enough) so that the best rules
>   are quickly integrated into the SVN tree.
> - Use (and further development) of the infrastructure of the
>   SpamAssassin project to ship rule updates for existing SpamAssassin
>   releases using SARE rules.
>
>and the big one:
>
> - Shift from using maintenance releases for rule updates to automated
>   official rule updates for stable SpamAssassin releases (think: cron
>   job that you can trust).
>
>   - There are a number of killer rules in SA 3.0 SVN that have been
>     through extensive QA and would require minimum 
>development to test.
>     Those could have been deployed in general-release quality 
>for 2.6x,
>     I'd like to see something set up now for 3.0 SVN.
>   - The perceptron is also fast to run, so with a bit of work to make
>     it easier to run (and especially if we can get rid of score sets):
>     - we can use it to generate scores for new rules
>     - and eventually, all scores can also be updated regularly
>   - In addition, the plug-in architecture of SA 3.0 will make it
>     somewhat more feasible to do automated updates for non-trivial
>     rules, so now is the time.
>

OK, I didn't want to break up those comments. I've cc'd this to the secret
SARE list. The ninjas haven't talked about this for a while. We had been
using only 2 members I think, to submit rules for official releases. Fred
and Robert. We just figured "Why have more then 2? This way we won't get
confused with who submits what." 

We'd like nothing better to submit all our high scoring rules to become
official SA rules. I've been reluctant to get involved with the devs for a
few reasons:
1) I didn't want to bother you guys.
2) I kind of kept SARE seperate to not seem like I'm bullying into dev
territory
3) I never expected for SARE to turn into the cool thing it is now with all
these people smarter then me :)

I'll *gulp* join the dev list and try to get SARE more active it you guys.
Less work and duplication is better for all. 

Dang, security calling me, got to go, but I'll catchup later. 

--Chris

Reply via email to