Re: Burst of large messages sometimes causes spamd to lock up

Kris Deugau Fri, 27 Jan 2012 13:18:01 -0800

John Hardin wrote:

On Fri, 27 Jan 2012, Kris Deugau wrote:

Every so often, one of our spamd instances gets locked up when a burst
of messages with "lots" (150-200K+) of body text gets passed in.

If we catch this happening, restarting spamd seems to clear up
whatever gets deadlocked. Otherwise, it typically takes 10-15 minutes
to get unlocked, and then there's a big burst of processing as the
backlog clears.


But it does eventually recover?

After ~10-15 minutes, according to the notifications from the monitoringsystem and load balancers. Then the log file shows a run on newconnections and child state entries showing a growing number of busychildren, and eventually the usual "spamd result" entries.

Sounds like you're hitting swap. When that happens things *really* bog
down.

*nod* I've been there on a much smaller, more cramped all-in-onesystem. It took me a while to fine-tune all of the conflicting demandson the physical RAM (outbound SMTP relay, Clam, SpamAssassin, webmail,POP, IMAP) to get that system to behave.

How much memory do you have, and how many max spamd children are defined?


The two physical servers currently in the cluster have 12G physical RAM.

top claims only ~4G is in use - ~2G of that is cache. I'm not certainwhere the tmpfs is counted, but even in the worst-case that it'smanaging to get lost in the ~8G reported as "free", it's still onlyanother 2G as currently configured.

Swap... is not happening. <g> If anything, during mail spikes, they'reCPU-bound. The rest of the time they're mostly idle.

The complete spamd command looks like this (allowed-IP list redacted).--syslog-ident is a local patch to let us identify different spamdinstances with different configurations all logging to the same log stream.

/usr/local/bin/spamd -d -x -q -r /var/run/spamd.pid --min-children=59--min-spare=1 --max-spare=1 --max-conn-per-child=100 -m 60 -s local1 -uspamd --timeout-child=60 -i 0.0.0.0 -A <IP list> --syslog-ident spamd/main

MySQL for SA is running on one machine, although it's also running onthe other for web hosting logs.

Can you capture "top" or other process stats while this is happening?

I've managed to catch it happening live a couple of times, and the onlyprocess eating CPU was spamd. Even MySQL was essentially idle.

Is this a Bayes update deadlock? (We use a global Bayes DB, currently
MySQL ISAM tables on a tmpfs.) Testing just before migrating to the
current hardware showed this was actually the *fastest* (and least
I/O-intensive) setup (comparing with InnoDB tables on disk, or
"memory" tables).


Devoting memory to a tmpfs for bayes means less memory is available to
spamd and makes it more likely you're hitting swap during a message
burst...

How is SA glued to your MTA?

Postfix, calling a custom delivery handler that does a variety offiltering including calling spamc. Functionally the equivalent of theexample .procmailrc recipe.

The MXes are separate physical machines, so system load on that sidedoesn't inherently affect SA's performance.

Can you enforce process limits there so
that spamc doesn't just return a "can't scan" result if it gets overloaded?

Well, the current setup is designed to make sure mail *does* get scannedbefore delivery (assuming the account has spam filtering enabled). Forsome similar issues (simple slowdown of overall processing during blastsof legitimate mail) I've looked into dropping the delivery concurrencyso we don't have more overall delivery attempts than CPU cores in thefilter cluster (given other factors, we could probably go 2:1, possibly3:1 on parallel delivery to spamd CPU cores).

Concurrency aside, the problem seems to be that some unknown eventcauses spamd to lock up *completely* - it will have one or two activechildren, a new connection will come in, which will start processing...and then it goes completely unresponsive, won't even accept newconnections.

(it is possible it's database-related if you're using ISAM rather than
InnoDB, but apart from asking "have you tried InnoDB on a tmpfs?" I'll
let others pursue that...)

I thought about doing that, but I got stuck trying to work around theway InnoDB tables for *all* databases in the MySQL instance are storedin one great humongous file (or two), rather than split up per-database.(I'd like to kick whoever thought that was a good idea. Oy.)

We've patched the init script (stock Debian) to load a dump of the SAdatabase (Bayes, AWL, and userprefs) on system startup, and we decided adaily dump was "good enough" for backup purposes.


-kgd

Re: Burst of large messages sometimes causes spamd to lock up

Reply via email to