Re: more efficent big scoring

Matt Kettler Sun, 20 Jan 2008 05:29:41 -0800

Loren Wilton wrote:

Well, it looks like I need to spend some time reading the code tostudy exactly how SA runs rules, and see if it's doing something thatpollutes the memory cache, which would cause the over-sorting to notmatter..
As best I recall, it runs rules by type, and sorted by priority withintype. There is also code to resolve meta-chaining ordering; I don'trecall that I've seen that code since Justin wrote that.

I read the code, it runs by priority, then type within priority.

The first loop can be found in Check.pm, sub check_main.

foreach my $priority (sort { $a <=> $b } keys%{$pms->{conf}->{priorities}}) {


and within that loop, after some code for DNSBL handling, you've got:

   $self->do_head_tests($pms, $priority);
   $self->do_head_eval_tests($pms, $priority);

   $self->do_body_tests($pms, $priority, $decoded);
   $self->do_uri_tests($pms, $priority, @uris);
.....

Since it runs rules by type, I don't think its guaranteed that a -1000rule will run before a -900 rule if they aren't the same rule type.(Maybe it is; I'd have to look at the code again. For what I rememberthat wouldn't have been guaranteed.)

It's gaurnteed.

There is (at least in theory) a cache advantage to doing things likerunning all the head tests and then all the body tests, rather thansome of each. OTOH, both head and body are probably in memory, and theheaders are generally not huge. The body of course may be evensmaller on many spams. So I'm not *convinced* that the cache localityargument will hold up under actual testing, albeit the theory soundsgood.

Well, I was thinking about the performance in large messages, which hasto do with how it handles "lines" in the body. (even though linewrapsare removed for "body" rules, SA does break it up into largeish chunksthe code calls lines).

This part of the code runs the entire body on one rule at a time, notall the rules on each "line" at a time..

What is useful is starting the net tests as early as possible, andharvesting them as late as possible.

It already does that. Or at least it harvests them late.

However, net tests can be started early regardless of priority orshort-circuiting, with (probably) minimal performance loss. If youdecide the case before all the net results arrive, you just ignore thestragglers.
I would not be terribly surprised to find out that on average therewas no appreciable difference in running all rules of all types inpriority order, over the current method; at least if this didn't pusha lot of net rule mandatory result checking too early.

Of course it resulted in no difference. The code as it stands makes zeroeffort to take advantage of cache locality at all. Well, I guess youcould say it's maximizing locality of the rule code, while minimizinglocality of the message data.

And even if that happened, it would slow throughput per item, but itwouldn't necessarily increase processor overhead. indeed, it might insome cases reduce processor overhead.
Doing something like you did of assigning a priority to every rulethat doesn't already have one, with the rule based on the score prettymuch in the order the OP suggested, then sorting the rules by priorityregardless of rule type, and running all of them that way I *suspect*will be about the same performance as the current algorithm.

Well that's roughly what my test did. The code doesn't group by type.

A reduction in performance would I suspect most likely occur from thecode having to switch on rule type for each rule it runs. There areprobably clever tricks (like the current eval-ed compiled procedures)that would eliminate this switch overhead.
This still doesn't necessarily check for bailing on score. But notethat short-circuiting is already present. I think it is based on a'short circult rule' hitting rather than a score comparison. But itis still potentially a per-rule bailout test. An extra numericcomparison after each rule that evaluates true would likely be trivialcompared to the other tracking that is done for rules that hit.

Agreed.. or you could do it for each priority... This would give youflexibility to control how often the check is actually made.

Re: more efficent big scoring

Reply via email to