David F. Skoll wrote: > Using Pg for Bayes data will be really slow. We don't use the SpamAssassin > Bayes implementation and we went through three iterations of storage > back-ends before finding one we liked. > > 1) PostgreSQL: Convenient but slow. > > 2) Berkeley DB: Faster than PostgreSQL, but still slow and > occasionally flaky > > 3) CDB: Very fast, but cannot be incrementally updated. You need to > rebuild the entire DB and then atomically rename it. > > In our implementation, it's not a problem to have a read-only DB, so we > went with CDB. It's dramatically faster than Berkeley DB: > http://www.dmo.ca/blog/benchmarking-hash-databases-on-large-data/
Thanks for pointing out your benchmark, interesting - especially for me the reference to a Tokyo Cabinet and its successor Kyoto Cabinet. This would be worth inverstigating further. The benchmark may not necessarily reflect directly the usage profile of SpamAssassin's Bayes module, especially its auto- expiration runs and the search-in-a set (the IN operator in SQL). I can very much believe and agree that for a read-only bayes database the CDB provides the best performance - as long as you can afford (or have no other choice in large scale environments) to update it periodically offline. Regarding Berkeley DB and SQL I do not share this experience. When we started using SpamAssassin years back our bayes and awl databases were on a Berkeley DB. This worked reasonably well (sharing your opinion on being 'occasionally flaky'), but the auto-expiration long times started to grow from minutes to hours. Initially this was solved by turning off opportunistic auto-expiry and running it explicitly periodically. A long auto-expiry run could bog down mail processing for a good part of an hour or more, collecting a large backlog in a mail queue. So we finally gave up on using a Berkeley DB for bayes and switched to MySQL - and what a relief that was! Opportunistic auto-expire could be used again in real time, and the whole mail sytem could breath again. Well - occasionally the MyISAM -type database would enter into an unusable state, where SpamAssasin would still appear to be running normally, but bayes would not be returning sensible results. The solution was to run an occasional database repair, which would make things right again for a week or two. The MyISAM screwups were finally resolved by switching our database to a InnoDB -type engine and using Mail::SpamAssassin::BayesStore::MySQL back-end (instead of the Mail::SpamAssassin::BayesStore::SQL). This finally solver the reliability/stability problems of a Bayes and AWL database for us, and the speed was good! Life was beautiful - until somehow this SQL solution started to become slow, and tweaking and cleaning a database did not help. I'm not sure what exactly happened, but even starting with a new scratch database soon lost its speed. Not seeing any obvious solution, we tried switching to PostgreSQL - and stayed there ever since, never looking back! The switch was made somewhere in the 8.3 version of a PostgreSQL server, but now we are running a 9.0.3 (on a FreeBSD) and are very happy with it (along with a SpamAssassin from trunk - to become a 3.4). I must admit that during our first attempted transition to SQL we tried both the MySQL as well as PostgreSQL, and indeed the MySQL turned out to be faster. Seems like the IN operator in PostgreSQL was not well optimized. But it appears that this problem is no longer present in more current versions of PostgreSQL - not sure in which version the change occurred. To put things into perspective, our user base is about 1000 users, so my experience does not necessarily translate to large ISPs, or to SOHO. Anyway, just wanted to share my point of view. Mark