On 24 Jun 2004 23:00:47 -0700, Daniel Quinlan <[EMAIL PROTECTED]> writes:

> I think the bigger performance issue is given the frequency with which
> we call the SHA1 routine (about 420 times per message for this data set)
> is to _avoid that extra function call_.

Function call overhead is less than 100ns. 420 times/message * 20
messages/sec is only .1% of overall CPU time. SHA-1 is also fairly
expensive, because it hashes in blocks of 64 bytes. Earlier Jenkin's
hash was rejected because its output was only 32 bits. Run it twice
with two different keys to get 64 bits of output. 

> > I have two questions. 1) What do the numbers show? How much will we save 
> > if the hash function is twice as fast as SHA-1? Five times as fast?
> 
> Here are the results for a sa-learn on 2000 msgs under MySQL:
> 

> Also, the time per call is not measurable, so I suspect most of the time
> is simply spent in the overhead of a function call, passing data in and
> out of Digest::SHA1, etc.  I don't think it's worth it.
> 
> I think more is to be gained in these areas:
> 
>  - global optimizations to regular expression body tests (the test set
>    that's the most expensive)
>  - general performance improvements (start with some Devel::DProf data)
>  - the spamd architecture (too much overhead per message)
>  - reexamining the idea of not doing work when it is not needed (not to
>    be entered into lightly since it can often be more expensive)

I've run spamassassin through OProfile. Oprofile is a system wide
kernel level profiler that profiles without needing to modify binaries
or libraries. I've got a quick run on spamc/spamd 2.63 running on
stock perl 5.8.4 with the HTML::Parser from CPAN. I'm using the
default database library.

>From the OProfile results, with bayes activated, it appears that only
10% or so of the runtime is spent in the regular expression
code. After about 24,000 messages have been processed, bayes
performance declines by 40% (perhaps because it has autolearned enough
spam and is now classifying) and libdb3 is responsible for 11% of
runtime. About 6% of runtime seems to be identifiably caused by
copy-on-write from spamd's forking behavior.

After 127 minutes and 51,000 messages, it looks like:

CPU: Athlon, speed 1837.72 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask 
of 0x00 (No unit mask) count 1000000
samples  %        app name                 symbol name
1619397  11.1000  libdb3.so.3.0.2          (no symbols)
1494937  10.2469  perl                     S_regmatch
1074194   7.3630  libc-2.3.2.so            (no symbols)
983368    6.7404  vmlinux                  fast_copy_page
343397    2.3498  perl                     S_find_byclass
342439    2.3432  perl                     S_hv_fetch_common
265513    1.8168  perl                     Perl_sv_setsv_flags
223108    1.5267  perl                     Perl_leave_scope
222373    1.5216  perl                     Perl_sv_clear
206650    1.4140  perl                     Perl_pp_padsv
202517    1.3858  perl                     S_visit
197297    1.3500  perl                     S_regtry

Running 24,000 messages without bayes, I get these results, showing
about 10% from fork()ing overhead and 24% in the regexp engine.

CPU: Athlon, speed 1837.72 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask 
of 0x00 (No unit mask) count 1000000
samples  %        app name                 symbol name
587894   19.1067  perl                     S_regmatch
321703   10.4554  vmlinux                  fast_copy_page
231942    7.5382  libc-2.3.2.so            (no symbols)
144354    4.6915  perl                     S_find_byclass
79662     2.5890  perl                     S_regtry
70016     2.2755  perl                     S_visit
65777     2.1378  perl                     S_hv_fetch_common

Unfortunately, my libc and libdb3 are stripped so I can't break the
overheads there down any further. To get symbol-level accuracy
requires unstripped libraries/binaries.

Scott

Reply via email to