On 10/03/2008 6:36 AM, Johann Spies wrote: > On Mon, Mar 10, 2008 at 05:24:24AM -0400, Daryl C. W. O'Shea wrote: >> Try to roughly compare the actual amount of CPU time that the spamd >> children are using on each server. 3.2 will use more resources than > > How do I do that? Just watching 'top' is not a reliable method I > suspect.
Roughly. :) Yeah, use top. If messages are really taking on average of a minute (or even a peak adjusted average of ~15 seconds) to process you should be able to easily see if a bunch of apparently busy children aren't really doing anything (are using a percentage of CPU time much less than 100% * #cores / #active children, less whatever the MTA is consuming). >> Are the timeouts for the same zone(s)? > > Most of them are lookups against list.dsbl.org. A dig > 146.226.86.70.list.dsbl.org that timed out according to the log, took > 352 milliseconds when checked by hand. A little bit longer on the old > one. That's likely due to your geography. You could reduce that by running a copy of the zone locally with rbldnsd: http://dsbl.org/usage >> Test lookups against those zones manually. >> Is your upstream (or downstream) bandwidth usage near full capacity? > > It is 92.5% full at the moment. A number of other DNSBLs can be rsync'd and run locally to save you even more bandwidth and time. 92% is OK so long as it's near a peak and there's no traffic shaping going on (unless its not a guaranteed rate interface which has since slowed down to where 92% is a lot closer to wh. With regards to DNS, congested links and performance don't mix. >> To the two servers share the same DNS setup? > > Yes You would expect DNS or network congestion issues to affect both servers equally, however, remember that 3.0 will (to its detriment) timeout DNS queries a *lot* faster than 3.2. So 3.0 will mask network issues at the expense of accuracy, while 3.2 won't at the expense of time (but not overall throughput as long as you have the memory for the additional children that the CPU now has time to run). >> Is there something else running on the new server that is driving >> the load average up (a common cause of the "child processing >> timeout" message)? > > The load average on the new server is lower than that on the old > server - as expected. For the past 24 hours the highest load average > was 1.6. That's fine then. I suspect that the child timeouts you were/are seeing due to the TVD_STOCK1 test are due to a bug triggered by a specific form of input. Capturing a message (preferably a few) that triggers the error may confirm that. >> A little more work... review the debug output for a bunch of messages >> (you'll have to separate each message's debug info from the combined >> debug log). What parts of the scanning process are taking the most >> amount of time? > > > I will do that. Great. That's your best and most straightforward way to concretely tell you what's going on. Daryl