Hello Chris, Dan, etc., Monday, May 3, 2004, 2:31:36 PM, you wrote:
>>We need active rule developers. >> >>New rules used to make their way into CVS relatively quickly because >>that was the only place for them to go. SARE is making very nice >>strides in developing new rules, those rules aren't being integrated >>into SpamAssassin quickly at all and everyone is suffering. There are tradeoffs to be considered. Yes, new rules used to make their way into CVS quickly, but those rules (at least so far) take months to get into the field, because of the overhead and other challenges associated with the GA run. SARE provides a method whereby rules can be tested and then adopted by systems very quickly. A/The major benefit of the GA run is that the rules get properly and reliably scored across the comprehensive corpi. Outside of the dev cycle, that can't yet happen. However, with SARE running its mass-checks against multiple corpi, we're able to generate reasonable rule scores which aren't as good as the GA scores, but are good enough for most systems. And then, there's a whole universe of rules which are not included in SA's distribution rule set, and which should NOT be included in SA's distribution rule set. An example from my 70_sare_genlsubj3.cf: # OVERALL% SPAM% HAM% S/O RANK SCORE NAME # 2.674 3.1530 0.5427 0.853 0.61 0.50 SARE_SUB_ONLINE This rule hits 3% of all spam here, but also 0.5% of ham, and so its S/O is too low to warrant inclusion in the distribution set at this time. It's not a rule that would be appropriate for all or possibly most systems. But those that can be a little more aggressive can benefit from this rule. This is one type of rule that I see has a permanent home in SARE. >> - it's more work for users Agreed. >> - there's less QA and only manual scoring of SARE rules Agree with the first, and quibble with the second (we actually generate most scores automatically now, based on the results of our mass-check runs; they aren't as high quality as the scores provided by the GA, but they aren't "manual" nor "arbitrary"). >> - SpamAssassin is not being well-maintained to integrate these rules >> efficiently and with low overlap, so speed and efficiency suffer. I'm not sure what you mean here. The huge majority of rules we develop are extremely simple, phrases or variations on phrases, easily tested by regex, and not the sort of thing I'd expect to require any tuning you haven't already done. The exceptions would be rulesets like backhair, weeks, tripwire, where I expect we'd be better off with well built eval capabilities rather than multiple regexes, but we don't have the ability (yet) to create those eval capabilities (or equivalent). >>I'm not saying that I want SARE to go away! SARE does a better job >>tracking new rule sets than was possible before, but we need to avoid >>falling to a non-optimal pattern of where effort is going. Developers >>come and go and we've maintained a strong core team for the Perl code >>in SpamAssassin, but the number of people actively working on rules is >>lower now (since January, about 2/3 of SA 3.0 test rule work is the >>work of one person, 94% is two people). Actually, part of the reason SARE is growing and strengthening is because during the development of version 3.0 the core developers needed to concentrate on code changes and not so much on rules. There was even a comment to that effect on one or both lists a few months ago. Justin posted a request for rules volunteers a couple of months back, and I was real tempted to step up for that, but just don't have the time. Anti-spam is a sideline for me, and my primary activities demand that I don't put any more time into SA than I already do. Hopefully one or two others in SARE who can call this part of their actual job (or who have more spare time than I do) can step forward. >>What I think would work better and what I'd like to see: >> >> - Some of the experienced SARE developers also become SpamAssassin >> developers (with commit access soon enough) so that the best rules >> are quickly integrated into the SVN tree. That would be good. >> - Use (and further development) of the infrastructure of the >> SpamAssassin project to ship rule updates for existing SpamAssassin >> releases using SARE rules. That would be good. >> >>and the big one: >> >> - Shift from using maintenance releases for rule updates to automated >> official rule updates for stable SpamAssassin releases (think: cron >> job that you can trust). If I understand what you're saying here, this would improve the quality of the rules, and would also slow down the release of rule updates. I think there's a place for the quick supply of rules which aren't quite GA quality, but have been tested through multiple corpi and are reasonably scored. Yes, I would like to see our best rules folded into formal SA releases (and we're developing naming conventions for rules and files to try to support this), but we need to maintain the ability to add rules quickly to that section of the user community that wants them. >> - There are a number of killer rules in SA 3.0 SVN that have been >> through extensive QA and would require minimum development to test. >> Those could have been deployed in general-release quality for 2.6x, >> I'd like to see something set up now for 3.0 SVN. This can work both ways. Not only can we feed the best rules to SVN, but those rules within SVN which don't require new code can be released through SARE, making them available to those who run 2.6x or 2.5x. I'd like to see this happen. (I agree that people should upgrade to the newest official release of SA as soon as possible, but there are people like me who don't have control over that upgrade, and others that cannot upgrade for various organizational reasons. I think it's a good thing that we're able to improve SA performance for older systems as well as for the newest systems.) >> - The perceptron is also fast to run, so with a bit of work to make >> it easier to run (and especially if we can get rid of score sets): >> - we can use it to generate scores for new rules >> - and eventually, all scores can also be updated regularly I'm looking forward to this. I need to keep my system here at the same SA level as my production mail systems, so I haven't done more than download an occasional 3.0 copy to look at its rules definitions. I'll be able to move to 3.0 only after my production systems move there, and I have no control over that timing. The main reason I haven't done more with my automatic scoring capability than I have is that I'm hoping the perceptron under 3.0 will make my current system obsolete. >> - In addition, the plug-in architecture of SA 3.0 will make it >> somewhat more feasible to do automated updates for non-trivial >> rules, so now is the time. Yes, that also will help. CS> I'll *gulp* join the dev list and try to get SARE more active it you guys. CS> Less work and duplication is better for all. Hurrah! What we really need is someone who can work through the current SVN rules, compare them to our better SARE rules, and submit those that are worth while but not yet in the SVN queue. Again, I don't have the time for this. Hopefully someone else will. Bob Menschel
