On Fri, 27 Jan 2012, Kris Deugau wrote:

John Hardin wrote:

 How much memory do you have, and how many max spamd children are defined?

The two physical servers currently in the cluster have 12G physical RAM.

That sounds adequate. :)

top claims only ~4G is in use - ~2G of that is cache. I'm not certain where the tmpfs is counted, but even in the worst-case that it's managing to get lost in the ~8G reported as "free", it's still only another 2G as currently configured.

Swap... is not happening. <g> If anything, during mail spikes, they're CPU-bound. The rest of the time they're mostly idle.

Sure sounds like swap ain't the problem.

The complete spamd command looks like this (allowed-IP list redacted). --syslog-ident is a local patch to let us identify different spamd instances with different configurations all logging to the same log stream.

/usr/local/bin/spamd -d -x -q -r /var/run/spamd.pid --min-children=59 --min-spare=1 --max-spare=1 --max-conn-per-child=100 -m 60 -s local1 -u spamd --timeout-child=60 -i 0.0.0.0 -A <IP list> --syslog-ident spamd/main

Always 59 or more child processes?

 Can you capture "top" or other process stats while this is happening?

I've managed to catch it happening live a couple of times, and the only process eating CPU was spamd. Even MySQL was essentially idle.

I was more interested in the memory stats, but less so now given your comments above.

 Can you enforce process limits there so
 that spamc doesn't just return a "can't scan" result if it gets
 overloaded?

Well, the current setup is designed to make sure mail *does* get scanned before delivery (assuming the account has spam filtering enabled). For some similar issues (simple slowdown of overall processing during blasts of legitimate mail) I've looked into dropping the delivery concurrency so we don't have more overall delivery attempts than CPU cores in the filter cluster (given other factors, we could probably go 2:1, possibly 3:1 on parallel delivery to spamd CPU cores).

Bear in mind DNS lookup latency means your concurrency can probably safely be 2:1.

Concurrency aside, the problem seems to be that some unknown event causes spamd to lock up *completely* - it will have one or two active children, a new connection will come in, which will start processing... and then it goes completely unresponsive, won't even accept new connections.

Ouch. Yeah, that doesn't sound like swapping either.

At this point, I'd suggest (for no really good reason) try reducing min-children and see if it affects the behavior.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Christian martyrs don't explode.                         -- Marisol
-----------------------------------------------------------------------
 Today: Wolfgang Amadeus Mozart's 256th Birthday

Reply via email to