http://bugzilla.spamassassin.org/show_bug.cgi?id=3671

           Summary: Possible 30 - 40% speed increase for Bayes with
                    SDBM_File
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P3
         Component: Learner
        AssignedTo: [EMAIL PROTECTED]
        ReportedBy: [EMAIL PROTECTED]


As I discussed yesterday with Daniel (was: another profile: scanning with 
learning), I achieved outstanding results switching from DB_File to SDBM_File 
for the bayes database - interestingly it's said to be slower than DB_File, 
but it seems to fit the bayes token operations just fine.
I have witnessed this on multiple systems, but with rather small user bases. 
I'd particularly be interested in how this would work for sites with very 
large db's. 

For a start, here are my benchmark results - configuration: pre-4; Perl 5.8.4, 
Linux (slow box).
In a first test run the both databases returned different results, I tracked 
this down to different token expiration timings.
A second run with expiration disabled (BayesStore.pm) resulted in completely 
equal results for both DB's (down to the tokens).
DB_File taking twenty minutes longer on Run 1 suggests that expiries eat up a 
lot of time with DB_File, but not as much with SDBM_File.

All times are per step; tests were limited to bayes.

                                                db_file         sdbm_file
Training data
-------------
* learn 300 ham:                                4m56.905s       3m40.712s
* learn 1368 (of 1500) spam:                    14m12.346s      10m3.399s


Run 1: Expiration enabled
-------------------------
* run against 600 ham, 2236 spam:               ~97 min         49m23.579s (!)
=> similar, but not the same results due to token expiration


Run 2 - Expiration disabled
---------------------------
* revert to training state
* run against 75 ham, 150 spam:                 5m50.910s       3m22.906s

* learn another 3 x 5 mb of spam:
843 mails (of 978)                              13m18.612s      10m33.272s
907 mails (of 1019)                             13m12.591s      10m10.109s
1044 mails (of 1193)                            15m37.188s      12m30.815s

* run against 600 ham, 2236 spam:               74m10.344s      50m50.962s
=> equal results and db dumps

There has been rumor about potential issues with SDBM - the documentation only 
mentions a 1k limit on the individual entry size, which is irrelevant for 
bayes considering the current token size.
In contrast, the SDBM_File documentation talks about "multiple" limitations, 
but does not elaborate them - does anyone have insight here?
Furthermore, http://qdbm.sourceforge.net/benchmark.pdf says SDBM would not 
support more than 100,000 records - my benchmark db worked fine even with 
400,000 tokens.
In a posting, the SDBM author states that SDBM would always throw errors when 
limits are hit - I have this in operation since a few months and didn't run 
into any so far.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to