John Hardin wrote:
On Fri, 27 Jan 2012, Kris Deugau wrote:

Every so often, one of our spamd instances gets locked up when a burst
of messages with "lots" (150-200K+) of body text gets passed in.

If we catch this happening, restarting spamd seems to clear up
whatever gets deadlocked. Otherwise, it typically takes 10-15 minutes
to get unlocked, and then there's a big burst of processing as the
backlog clears.

But it does eventually recover?

After ~10-15 minutes, according to the notifications from the monitoring system and load balancers. Then the log file shows a run on new connections and child state entries showing a growing number of busy children, and eventually the usual "spamd result" entries.

Sounds like you're hitting swap. When that happens things *really* bog
down.

*nod* I've been there on a much smaller, more cramped all-in-one system. It took me a while to fine-tune all of the conflicting demands on the physical RAM (outbound SMTP relay, Clam, SpamAssassin, webmail, POP, IMAP) to get that system to behave.

How much memory do you have, and how many max spamd children are defined?

The two physical servers currently in the cluster have 12G physical RAM.

top claims only ~4G is in use - ~2G of that is cache. I'm not certain where the tmpfs is counted, but even in the worst-case that it's managing to get lost in the ~8G reported as "free", it's still only another 2G as currently configured.

Swap... is not happening. <g> If anything, during mail spikes, they're CPU-bound. The rest of the time they're mostly idle.

The complete spamd command looks like this (allowed-IP list redacted). --syslog-ident is a local patch to let us identify different spamd instances with different configurations all logging to the same log stream.

/usr/local/bin/spamd -d -x -q -r /var/run/spamd.pid --min-children=59 --min-spare=1 --max-spare=1 --max-conn-per-child=100 -m 60 -s local1 -u spamd --timeout-child=60 -i 0.0.0.0 -A <IP list> --syslog-ident spamd/main

MySQL for SA is running on one machine, although it's also running on the other for web hosting logs.

Can you capture "top" or other process stats while this is happening?

I've managed to catch it happening live a couple of times, and the only process eating CPU was spamd. Even MySQL was essentially idle.

Is this a Bayes update deadlock? (We use a global Bayes DB, currently
MySQL ISAM tables on a tmpfs.) Testing just before migrating to the
current hardware showed this was actually the *fastest* (and least
I/O-intensive) setup (comparing with InnoDB tables on disk, or
"memory" tables).

Devoting memory to a tmpfs for bayes means less memory is available to
spamd and makes it more likely you're hitting swap during a message
burst...

How is SA glued to your MTA?

Postfix, calling a custom delivery handler that does a variety of filtering including calling spamc. Functionally the equivalent of the example .procmailrc recipe.

The MXes are separate physical machines, so system load on that side doesn't inherently affect SA's performance.

Can you enforce process limits there so
that spamc doesn't just return a "can't scan" result if it gets overloaded?

Well, the current setup is designed to make sure mail *does* get scanned before delivery (assuming the account has spam filtering enabled). For some similar issues (simple slowdown of overall processing during blasts of legitimate mail) I've looked into dropping the delivery concurrency so we don't have more overall delivery attempts than CPU cores in the filter cluster (given other factors, we could probably go 2:1, possibly 3:1 on parallel delivery to spamd CPU cores).

Concurrency aside, the problem seems to be that some unknown event causes spamd to lock up *completely* - it will have one or two active children, a new connection will come in, which will start processing... and then it goes completely unresponsive, won't even accept new connections.

(it is possible it's database-related if you're using ISAM rather than
InnoDB, but apart from asking "have you tried InnoDB on a tmpfs?" I'll
let others pursue that...)

I thought about doing that, but I got stuck trying to work around the way InnoDB tables for *all* databases in the MySQL instance are stored in one great humongous file (or two), rather than split up per-database. (I'd like to kick whoever thought that was a good idea. Oy.)

We've patched the init script (stock Debian) to load a dump of the SA database (Bayes, AWL, and userprefs) on system startup, and we decided a daily dump was "good enough" for backup purposes.

-kgd

Reply via email to