http://bugzilla.spamassassin.org/show_bug.cgi?id=3671
Summary: Possible 30 - 40% speed increase for Bayes with
SDBM_File
Product: Spamassassin
Version: SVN Trunk (Latest Devel Version)
Platform: Other
OS/Version: other
Status: NEW
Severity: normal
Priority: P3
Component: Learner
AssignedTo: [EMAIL PROTECTED]
ReportedBy: [EMAIL PROTECTED]
As I discussed yesterday with Daniel (was: another profile: scanning with
learning), I achieved outstanding results switching from DB_File to SDBM_File
for the bayes database - interestingly it's said to be slower than DB_File,
but it seems to fit the bayes token operations just fine.
I have witnessed this on multiple systems, but with rather small user bases.
I'd particularly be interested in how this would work for sites with very
large db's.
For a start, here are my benchmark results - configuration: pre-4; Perl 5.8.4,
Linux (slow box).
In a first test run the both databases returned different results, I tracked
this down to different token expiration timings.
A second run with expiration disabled (BayesStore.pm) resulted in completely
equal results for both DB's (down to the tokens).
DB_File taking twenty minutes longer on Run 1 suggests that expiries eat up a
lot of time with DB_File, but not as much with SDBM_File.
All times are per step; tests were limited to bayes.
db_file sdbm_file
Training data
-------------
* learn 300 ham: 4m56.905s 3m40.712s
* learn 1368 (of 1500) spam: 14m12.346s 10m3.399s
Run 1: Expiration enabled
-------------------------
* run against 600 ham, 2236 spam: ~97 min 49m23.579s (!)
=> similar, but not the same results due to token expiration
Run 2 - Expiration disabled
---------------------------
* revert to training state
* run against 75 ham, 150 spam: 5m50.910s 3m22.906s
* learn another 3 x 5 mb of spam:
843 mails (of 978) 13m18.612s 10m33.272s
907 mails (of 1019) 13m12.591s 10m10.109s
1044 mails (of 1193) 15m37.188s 12m30.815s
* run against 600 ham, 2236 spam: 74m10.344s 50m50.962s
=> equal results and db dumps
There has been rumor about potential issues with SDBM - the documentation only
mentions a 1k limit on the individual entry size, which is irrelevant for
bayes considering the current token size.
In contrast, the SDBM_File documentation talks about "multiple" limitations,
but does not elaborate them - does anyone have insight here?
Furthermore, http://qdbm.sourceforge.net/benchmark.pdf says SDBM would not
support more than 100,000 records - my benchmark db worked fine even with
400,000 tokens.
In a posting, the SDBM author states that SDBM would always throw errors when
limits are hit - I have this in operation since a few months and didn't run
into any so far.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.