John Hardin wrote:
On Fri, 27 Jan 2012, Kris Deugau wrote:
Every so often, one of our spamd instances gets locked up when a burst
of messages with "lots" (150-200K+) of body text gets passed in.
If we catch this happening, restarting spamd seems to clear up
whatever gets deadlocked. Otherwise, it typically takes 10-15 minutes
to get unlocked, and then there's a big burst of processing as the
backlog clears.
But it does eventually recover?
After ~10-15 minutes, according to the notifications from the monitoring
system and load balancers. Then the log file shows a run on new
connections and child state entries showing a growing number of busy
children, and eventually the usual "spamd result" entries.
Sounds like you're hitting swap. When that happens things *really* bog
down.
*nod* I've been there on a much smaller, more cramped all-in-one
system. It took me a while to fine-tune all of the conflicting demands
on the physical RAM (outbound SMTP relay, Clam, SpamAssassin, webmail,
POP, IMAP) to get that system to behave.
How much memory do you have, and how many max spamd children are defined?
The two physical servers currently in the cluster have 12G physical RAM.
top claims only ~4G is in use - ~2G of that is cache. I'm not certain
where the tmpfs is counted, but even in the worst-case that it's
managing to get lost in the ~8G reported as "free", it's still only
another 2G as currently configured.
Swap... is not happening. <g> If anything, during mail spikes, they're
CPU-bound. The rest of the time they're mostly idle.
The complete spamd command looks like this (allowed-IP list redacted).
--syslog-ident is a local patch to let us identify different spamd
instances with different configurations all logging to the same log stream.
/usr/local/bin/spamd -d -x -q -r /var/run/spamd.pid --min-children=59
--min-spare=1 --max-spare=1 --max-conn-per-child=100 -m 60 -s local1 -u
spamd --timeout-child=60 -i 0.0.0.0 -A <IP list> --syslog-ident spamd/main
MySQL for SA is running on one machine, although it's also running on
the other for web hosting logs.
Can you capture "top" or other process stats while this is happening?
I've managed to catch it happening live a couple of times, and the only
process eating CPU was spamd. Even MySQL was essentially idle.
Is this a Bayes update deadlock? (We use a global Bayes DB, currently
MySQL ISAM tables on a tmpfs.) Testing just before migrating to the
current hardware showed this was actually the *fastest* (and least
I/O-intensive) setup (comparing with InnoDB tables on disk, or
"memory" tables).
Devoting memory to a tmpfs for bayes means less memory is available to
spamd and makes it more likely you're hitting swap during a message
burst...
How is SA glued to your MTA?
Postfix, calling a custom delivery handler that does a variety of
filtering including calling spamc. Functionally the equivalent of the
example .procmailrc recipe.
The MXes are separate physical machines, so system load on that side
doesn't inherently affect SA's performance.
Can you enforce process limits there so
that spamc doesn't just return a "can't scan" result if it gets overloaded?
Well, the current setup is designed to make sure mail *does* get scanned
before delivery (assuming the account has spam filtering enabled). For
some similar issues (simple slowdown of overall processing during blasts
of legitimate mail) I've looked into dropping the delivery concurrency
so we don't have more overall delivery attempts than CPU cores in the
filter cluster (given other factors, we could probably go 2:1, possibly
3:1 on parallel delivery to spamd CPU cores).
Concurrency aside, the problem seems to be that some unknown event
causes spamd to lock up *completely* - it will have one or two active
children, a new connection will come in, which will start processing...
and then it goes completely unresponsive, won't even accept new
connections.
(it is possible it's database-related if you're using ISAM rather than
InnoDB, but apart from asking "have you tried InnoDB on a tmpfs?" I'll
let others pursue that...)
I thought about doing that, but I got stuck trying to work around the
way InnoDB tables for *all* databases in the MySQL instance are stored
in one great humongous file (or two), rather than split up per-database.
(I'd like to kick whoever thought that was a good idea. Oy.)
We've patched the init script (stock Debian) to load a dump of the SA
database (Bayes, AWL, and userprefs) on system startup, and we decided a
daily dump was "good enough" for backup purposes.
-kgd