Re: imeout-problem - additional information

Daryl C. W. O'Shea Mon, 10 Mar 2008 11:32:49 -0700

On 10/03/2008 6:36 AM, Johann Spies wrote:
> On Mon, Mar 10, 2008 at 05:24:24AM -0400, Daryl C. W. O'Shea wrote:
>> Try to roughly compare the actual amount of CPU time that the spamd
>> children are using on each server.  3.2 will use more resources than
> 
> How do I do that?  Just watching 'top' is not a reliable method I
> suspect.


Roughly. :)  Yeah, use top.  If messages are really taking on average of
a minute (or even a peak adjusted average of ~15 seconds) to process you
should be able to easily see if a bunch of apparently busy children
aren't really doing anything (are using a percentage of CPU time much
less than 100% * #cores / #active children, less whatever the MTA is
consuming).

>> Are the timeouts for the same zone(s)?  
> 
> Most of them are lookups against list.dsbl.org.   A dig
> 146.226.86.70.list.dsbl.org that timed out according to the log, took
> 352 milliseconds when checked by hand.  A little bit longer on the old
> one.

That's likely due to your geography.  You could reduce that by running a
copy of the zone locally with rbldnsd: http://dsbl.org/usage

>> Test lookups against those zones manually.
>> Is your upstream (or downstream) bandwidth usage near full capacity?  
> 
> It is 92.5% full at the moment.

A number of other DNSBLs can be rsync'd and run locally to save you even
more bandwidth and time.

92% is OK so long as it's near a peak and there's no traffic shaping
going on (unless its not a guaranteed rate interface which has since
slowed down to where 92% is a lot closer to wh.  With regards to DNS,
congested links and performance don't mix.

>> To the two servers share the same DNS setup?  
> 
> Yes

You would expect DNS or network congestion issues to affect both servers
equally, however, remember that 3.0 will (to its detriment) timeout DNS
queries a *lot* faster than 3.2.  So 3.0 will mask network issues at the
expense of accuracy, while 3.2 won't at the expense of time (but not
overall throughput as long as you have the memory for the additional
children that the CPU now has time to run).

>> Is there something else running on the new server that is driving
>> the load average up (a common cause of the "child processing
>> timeout" message)?
> 
> The load average on the new server is lower than that on the old
> server - as expected.  For the past 24 hours the highest load average
> was 1.6.

That's fine then.  I suspect that the child timeouts you were/are seeing
due to the TVD_STOCK1 test are due to a bug triggered by a specific form
of input.  Capturing a message (preferably a few) that triggers the
error may confirm that.

>> A little more work... review the debug output for a bunch of messages
>> (you'll have to separate each message's debug info from the combined
>> debug log).  What parts of the scanning process are taking the most
>> amount of time?
> 
> 
> I will do that.

Great.  That's your best and most straightforward way to concretely tell
you what's going on.

Daryl

Re: imeout-problem - additional information

Reply via email to