Wes wrote:
On 11/29/07 2:49 PM, "Daryl C. W. O'Shea" <[EMAIL PROTECTED]> wrote:
Even still though, 5 queries times, say, 50ms is a 1/4 of a second that
you're idle in that spamd child process. That leaves you trying to make
up for it by runnning more child processes (you've freed up some CPU
time by having those children idle so you'll have some CPU time to run
more) but you'll never get it all back and you'll be lucky to get even
half of the lost throughput back.
If you'd like to share a database between distributed MXes/spamd
machines you're best off to use replication and limit autolearning to
the machines that connect to the master database server.
Thanks for the details. That gives me an idea what activity to expect. One
DB per location may end up being the way to go. How well does it handle
concurrency, if it has to update the last access time of tokens and learn
new tokens? Are there any numbers on concurrent servers when it starts to
bog down?
Sorry, I have no concrete data on that. Most of my high volume
customers don't use bayes (usually because of memories of misguided
configurations in the past or the fear of bayes taking off in the wrong
direction as it occasionally has a habit of doing).
I would expect that if your spamd and SQL machines are of similar
hardware, though, that you may be able to support a few hundred spamd
children per SQL server. I could be way off though... it's just a guess.
I'd imagine you'd naturally do this, but for others following along,
rather than switching everything over at once, I would switch one
machine (or a couple of machines, depending on how many you have) over
at a time to using the SQL database, track throughput stats for a day
(so you get a complete days mail flow cycle and an expiry or two in) and
then add more. Stop when the average throughput of the SQL using spamd
machines falls too far below linear.
Selecting a storage engine that supports row level locking could help
with concurrency... but not always... for MySQL, MyISAM is faster than
InnoDB, probably due to it's faster indexing (and no transaction support
overhead).
See http://wiki.apache.org/spamassassin/BayesBenchmarkResults for some
small scale stats. Note that I don't think that SDBM's performance will
scale to really large databases. Matt Kettler may have input on that
though.
Also, be sure to read the sql/README.bayes documentation in the SA
release tarball (make sure you use the PostgreSQL specific storage
module if you're going to use PostgreSQL).
Daryl