David F. Skoll wrote:

> Using Pg for Bayes data will be really slow.  We don't use the SpamAssassin
> Bayes implementation and we went through three iterations of storage
> back-ends before finding one we liked.
> 
> 1) PostgreSQL: Convenient but slow.
> 
> 2) Berkeley DB: Faster than PostgreSQL, but still slow and
> occasionally flaky
> 
> 3) CDB: Very fast, but cannot be incrementally updated.  You need to
> rebuild the entire DB and then atomically rename it.
> 
> In our implementation, it's not a problem to have a read-only DB, so we
> went with CDB.  It's dramatically faster than Berkeley DB:
>      http://www.dmo.ca/blog/benchmarking-hash-databases-on-large-data/

Thanks for pointing out your benchmark, interesting - especially for me the
reference to a Tokyo Cabinet and its successor Kyoto Cabinet. This would
be worth inverstigating further. The benchmark may not necessarily reflect
directly the usage profile of SpamAssassin's Bayes module, especially its
auto- expiration runs and the search-in-a set (the IN operator in SQL).

I can very much believe and agree that for a read-only bayes database
the CDB provides the best performance - as long as you can afford
(or have no other choice in large scale environments) to update it
periodically offline.

Regarding Berkeley DB and SQL I do not share this experience.

When we started using SpamAssassin years back our bayes and awl
databases were on a Berkeley DB. This worked reasonably well (sharing
your opinion on being 'occasionally flaky'), but the auto-expiration
long times started to grow from minutes to hours. Initially this was
solved by turning off opportunistic auto-expiry and running it
explicitly periodically. A long auto-expiry run could bog down mail
processing for a good part of an hour or more, collecting a large
backlog in a mail queue.

So we finally gave up on using a Berkeley DB for bayes and
switched to MySQL - and what a relief that was! Opportunistic
auto-expire could be used again in real time, and the whole mail
sytem could breath again. Well - occasionally the MyISAM -type
database would enter into an unusable state, where SpamAssasin
would still appear to be running normally, but bayes would not be
returning sensible results. The solution was to run an occasional
database repair, which would make things right again for a week
or two.

The MyISAM screwups were finally resolved by switching our database to
a InnoDB -type engine and using Mail::SpamAssassin::BayesStore::MySQL
back-end (instead of the Mail::SpamAssassin::BayesStore::SQL).
This finally solver the reliability/stability problems of a Bayes and
AWL database for us, and the speed was good!

Life was beautiful - until somehow this SQL solution started to become
slow, and tweaking and cleaning a database did not help. I'm not
sure what exactly happened, but even starting with a new scratch
database soon lost its speed. Not seeing any obvious solution, we tried
switching to PostgreSQL - and stayed there ever since, never looking back!
The switch was made somewhere in the 8.3 version of a PostgreSQL
server, but now we are running a 9.0.3 (on a FreeBSD) and are very happy
with it (along with a SpamAssassin from trunk - to become a 3.4).

I must admit that during our first attempted transition to SQL we
tried both the MySQL as well as PostgreSQL, and indeed the MySQL
turned out to be faster. Seems like the IN operator in PostgreSQL
was not well optimized. But it appears that this problem is no longer
present in more current versions of PostgreSQL - not sure in which
version the change occurred.

To put things into perspective, our user base is about 1000 users, so
my experience does not necessarily translate to large ISPs, or to SOHO.
Anyway, just wanted to share my point of view.

  Mark

Reply via email to