Re: Mondo bayes_toks - millions of entries

Daryl C. W. O'Shea Thu, 29 Nov 2007 21:59:58 -0800

Wes wrote:

On 11/29/07 2:49 PM, "Daryl C. W. O'Shea" <[EMAIL PROTECTED]> wrote:

Even still though, 5 queries times, say, 50ms is a 1/4 of a second that
you're idle in that spamd child process.  That leaves you trying to make
up for it by runnning more child processes (you've freed up some CPU
time by having those children idle so you'll have some CPU time to run
more) but you'll never get it all back and you'll be lucky to get even
half of the lost throughput back.

If you'd like to share a database between distributed MXes/spamd
machines you're best off to use replication and limit autolearning to
the machines that connect to the master database server.


Thanks for the details.  That gives me an idea what activity to expect.  One
DB per location may end up being the way to go.  How well does it handle
concurrency, if it has to update the last access time of tokens and learn
new tokens?  Are there any numbers on concurrent servers when it starts to
bog down?

Sorry, I have no concrete data on that. Most of my high volumecustomers don't use bayes (usually because of memories of misguidedconfigurations in the past or the fear of bayes taking off in the wrongdirection as it occasionally has a habit of doing).

I would expect that if your spamd and SQL machines are of similarhardware, though, that you may be able to support a few hundred spamdchildren per SQL server. I could be way off though... it's just a guess.

I'd imagine you'd naturally do this, but for others following along,rather than switching everything over at once, I would switch onemachine (or a couple of machines, depending on how many you have) overat a time to using the SQL database, track throughput stats for a day(so you get a complete days mail flow cycle and an expiry or two in) andthen add more. Stop when the average throughput of the SQL using spamdmachines falls too far below linear.

Selecting a storage engine that supports row level locking could helpwith concurrency... but not always... for MySQL, MyISAM is faster thanInnoDB, probably due to it's faster indexing (and no transaction supportoverhead).

See http://wiki.apache.org/spamassassin/BayesBenchmarkResults for somesmall scale stats. Note that I don't think that SDBM's performance willscale to really large databases. Matt Kettler may have input on thatthough.

Also, be sure to read the sql/README.bayes documentation in the SArelease tarball (make sure you use the PostgreSQL specific storage

module if you're going to use PostgreSQL).

Daryl

Re: Mondo bayes_toks - millions of entries

Reply via email to