On Fri, 27 Jan 2012, Kris Deugau wrote:
John Hardin wrote:
How much memory do you have, and how many max spamd children are defined?
The two physical servers currently in the cluster have 12G physical RAM.
That sounds adequate. :)
top claims only ~4G is in use - ~2G of that is cache. I'm not certain where
the tmpfs is counted, but even in the worst-case that it's managing to get
lost in the ~8G reported as "free", it's still only another 2G as currently
configured.
Swap... is not happening. <g> If anything, during mail spikes, they're
CPU-bound. The rest of the time they're mostly idle.
Sure sounds like swap ain't the problem.
The complete spamd command looks like this (allowed-IP list redacted).
--syslog-ident is a local patch to let us identify different spamd instances
with different configurations all logging to the same log stream.
/usr/local/bin/spamd -d -x -q -r /var/run/spamd.pid --min-children=59
--min-spare=1 --max-spare=1 --max-conn-per-child=100 -m 60 -s local1 -u spamd
--timeout-child=60 -i 0.0.0.0 -A <IP list> --syslog-ident spamd/main
Always 59 or more child processes?
Can you capture "top" or other process stats while this is happening?
I've managed to catch it happening live a couple of times, and the only
process eating CPU was spamd. Even MySQL was essentially idle.
I was more interested in the memory stats, but less so now given your
comments above.
Can you enforce process limits there so
that spamc doesn't just return a "can't scan" result if it gets
overloaded?
Well, the current setup is designed to make sure mail *does* get scanned
before delivery (assuming the account has spam filtering enabled). For some
similar issues (simple slowdown of overall processing during blasts of
legitimate mail) I've looked into dropping the delivery concurrency so we
don't have more overall delivery attempts than CPU cores in the filter
cluster (given other factors, we could probably go 2:1, possibly 3:1 on
parallel delivery to spamd CPU cores).
Bear in mind DNS lookup latency means your concurrency can probably safely
be 2:1.
Concurrency aside, the problem seems to be that some unknown event causes
spamd to lock up *completely* - it will have one or two active children, a
new connection will come in, which will start processing... and then it goes
completely unresponsive, won't even accept new connections.
Ouch. Yeah, that doesn't sound like swapping either.
At this point, I'd suggest (for no really good reason) try reducing
min-children and see if it affects the behavior.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Christian martyrs don't explode. -- Marisol
-----------------------------------------------------------------------
Today: Wolfgang Amadeus Mozart's 256th Birthday