On 24 Jun 2004 23:00:47 -0700, Daniel Quinlan <[EMAIL PROTECTED]> writes:
> I think the bigger performance issue is given the frequency with which > we call the SHA1 routine (about 420 times per message for this data set) > is to _avoid that extra function call_. Function call overhead is less than 100ns. 420 times/message * 20 messages/sec is only .1% of overall CPU time. SHA-1 is also fairly expensive, because it hashes in blocks of 64 bytes. Earlier Jenkin's hash was rejected because its output was only 32 bits. Run it twice with two different keys to get 64 bits of output. > > I have two questions. 1) What do the numbers show? How much will we save > > if the hash function is twice as fast as SHA-1? Five times as fast? > > Here are the results for a sa-learn on 2000 msgs under MySQL: > > Also, the time per call is not measurable, so I suspect most of the time > is simply spent in the overhead of a function call, passing data in and > out of Digest::SHA1, etc. I don't think it's worth it. > > I think more is to be gained in these areas: > > - global optimizations to regular expression body tests (the test set > that's the most expensive) > - general performance improvements (start with some Devel::DProf data) > - the spamd architecture (too much overhead per message) > - reexamining the idea of not doing work when it is not needed (not to > be entered into lightly since it can often be more expensive) I've run spamassassin through OProfile. Oprofile is a system wide kernel level profiler that profiles without needing to modify binaries or libraries. I've got a quick run on spamc/spamd 2.63 running on stock perl 5.8.4 with the HTML::Parser from CPAN. I'm using the default database library. >From the OProfile results, with bayes activated, it appears that only 10% or so of the runtime is spent in the regular expression code. After about 24,000 messages have been processed, bayes performance declines by 40% (perhaps because it has autolearned enough spam and is now classifying) and libdb3 is responsible for 11% of runtime. About 6% of runtime seems to be identifiably caused by copy-on-write from spamd's forking behavior. After 127 minutes and 51,000 messages, it looks like: CPU: Athlon, speed 1837.72 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 1000000 samples % app name symbol name 1619397 11.1000 libdb3.so.3.0.2 (no symbols) 1494937 10.2469 perl S_regmatch 1074194 7.3630 libc-2.3.2.so (no symbols) 983368 6.7404 vmlinux fast_copy_page 343397 2.3498 perl S_find_byclass 342439 2.3432 perl S_hv_fetch_common 265513 1.8168 perl Perl_sv_setsv_flags 223108 1.5267 perl Perl_leave_scope 222373 1.5216 perl Perl_sv_clear 206650 1.4140 perl Perl_pp_padsv 202517 1.3858 perl S_visit 197297 1.3500 perl S_regtry Running 24,000 messages without bayes, I get these results, showing about 10% from fork()ing overhead and 24% in the regexp engine. CPU: Athlon, speed 1837.72 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 1000000 samples % app name symbol name 587894 19.1067 perl S_regmatch 321703 10.4554 vmlinux fast_copy_page 231942 7.5382 libc-2.3.2.so (no symbols) 144354 4.6915 perl S_find_byclass 79662 2.5890 perl S_regtry 70016 2.2755 perl S_visit 65777 2.1378 perl S_hv_fetch_common Unfortunately, my libc and libdb3 are stripped so I can't break the overheads there down any further. To get symbol-level accuracy requires unstripped libraries/binaries. Scott
