Loren Wilton wrote:
Well, it looks like I need to spend some time reading the code to study exactly how SA runs rules, and see if it's doing something that pollutes the memory cache, which would cause the over-sorting to not matter..

As best I recall, it runs rules by type, and sorted by priority within type. There is also code to resolve meta-chaining ordering; I don't recall that I've seen that code since Justin wrote that.
I read the code, it runs by priority, then type within priority.

The first loop can be found in Check.pm, sub check_main.

foreach my $priority (sort { $a <=> $b } keys %{$pms->{conf}->{priorities}}) {

and within that loop, after some code for DNSBL handling, you've got:

   $self->do_head_tests($pms, $priority);
   $self->do_head_eval_tests($pms, $priority);

   $self->do_body_tests($pms, $priority, $decoded);
   $self->do_uri_tests($pms, $priority, @uris);
.....



Since it runs rules by type, I don't think its guaranteed that a -1000 rule will run before a -900 rule if they aren't the same rule type. (Maybe it is; I'd have to look at the code again. For what I remember that wouldn't have been guaranteed.)
It's gaurnteed.

There is (at least in theory) a cache advantage to doing things like running all the head tests and then all the body tests, rather than some of each. OTOH, both head and body are probably in memory, and the headers are generally not huge. The body of course may be even smaller on many spams. So I'm not *convinced* that the cache locality argument will hold up under actual testing, albeit the theory sounds good.
Well, I was thinking about the performance in large messages, which has to do with how it handles "lines" in the body. (even though linewraps are removed for "body" rules, SA does break it up into largeish chunks the code calls lines).

This part of the code runs the entire body on one rule at a time, not all the rules on each "line" at a time..


What is useful is starting the net tests as early as possible, and harvesting them as late as possible.
It already does that. Or at least it harvests them late.
However, net tests can be started early regardless of priority or short-circuiting, with (probably) minimal performance loss. If you decide the case before all the net results arrive, you just ignore the stragglers.

I would not be terribly surprised to find out that on average there was no appreciable difference in running all rules of all types in priority order, over the current method; at least if this didn't push a lot of net rule mandatory result checking too early.
Of course it resulted in no difference. The code as it stands makes zero effort to take advantage of cache locality at all. Well, I guess you could say it's maximizing locality of the rule code, while minimizing locality of the message data.
And even if that happened, it would slow throughput per item, but it wouldn't necessarily increase processor overhead. indeed, it might in some cases reduce processor overhead.

Doing something like you did of assigning a priority to every rule that doesn't already have one, with the rule based on the score pretty much in the order the OP suggested, then sorting the rules by priority regardless of rule type, and running all of them that way I *suspect* will be about the same performance as the current algorithm.
Well that's roughly what my test did. The code doesn't group by type.
A reduction in performance would I suspect most likely occur from the code having to switch on rule type for each rule it runs. There are probably clever tricks (like the current eval-ed compiled procedures) that would eliminate this switch overhead.

This still doesn't necessarily check for bailing on score. But note that short-circuiting is already present. I think it is based on a 'short circult rule' hitting rather than a score comparison. But it is still potentially a per-rule bailout test. An extra numeric comparison after each rule that evaluates true would likely be trivial compared to the other tracking that is done for rules that hit.
Agreed.. or you could do it for each priority... This would give you flexibility to control how often the check is actually made.





Reply via email to