[Pdns-users] Spikey response times in powerdns recursor

2010-03-17 Thread Simon Bedford

Hi guys,

Apologies if this has been discussed before but as a new mailling list 
user I have not seen anything.


We have been running  recursor as a caching name server for a number of 
months having moved from unbound, since this time we see good, in fact 
quick DNS response time but then when running 3.1.7.1 and .2 and also 
3.2.1 we see random spikes up to 2 seconds for the response times often 
at the quietest of times for the name servers.


I had put this down to 3.1 version after reaading the changelog and bugs 
fixed in 3.2 but having upgraded we still see the same spiking, this 
time more frequent over night but not quite as severe as they were.


We are using hardware load balancers with 4 servers behind each, each 
server listens on multiple ports and I now have the recursor running on 
2 threads (a new feature in 3.2).


The servers have no real load and cpu is mostly 95% idle, they have 8G 
or Ram and never go over 2G used by the whole OS (Debain Etch) and software.


Graphs show norms of between 20 and 40ms but then the spikes are 700ms 
and over, this then results in our external monitoring and scoring 
against other companies suffer and in the worst of circumstances become 
unavailable.


I realise that outside lookups will influence the results but its weird 
that when at their busiest they are more responsive than when its quiet 
and also have most of the unusual behaviour at that time.


Recursor performance graphing and dnsscope stats look OK although the 
average time to respon goes up by 100% overnight, see sample stats below 
from overnight/this morning :-


Timespan: 0.828056 hours
Saw 4049548 correct packets, 0 runts, 0 oversize, 0 unknown encaps, 99 
dns decoding errors, 0 bogus packets

3467 packets went unanswered, of which 1 were answered on exact retransmit
1047 answers could not be matched to questions
99 answers were unsatisfactory (indefinite, or SERVFAIL)
7764 answers (would be) discarded because older than 2 seconds
Rcode   Count
0   1482490
2   16215
3   166680
5   1
68.45% of questions answered within 50 usec (68.45%)
71.06% of questions answered within 100 usec (2.62%)
74.16% of questions answered within 200 usec (3.10%)
74.30% of questions answered within 250 usec (0.14%)
74.36% of questions answered within 300 usec (0.06%)
74.40% of questions answered within 350 usec (0.03%)
74.42% of questions answered within 400 usec (0.02%)
74.46% of questions answered within 800 usec (0.04%)
74.48% of questions answered within 1000 usec (0.02%)
77.90% of questions answered within 2.00 msec (3.42%)
79.80% of questions answered within 4.00 msec (1.90%)
80.72% of questions answered within 8.00 msec (0.92%)
84.03% of questions answered within 16.00 msec (3.31%)
85.88% of questions answered within 32.00 msec (1.85%)
87.11% of questions answered within 64.00 msec (1.23%)
93.06% of questions answered within 128.00 msec (5.95%)
96.86% of questions answered within 256.00 msec (3.80%)
98.37% of questions answered within 512.00 msec (1.50%)
98.79% of questions answered within 1024.00 msec (0.42%)
100.00% of questions answered within 2048.00 msec (1.21%)
Average response time: 40419.9 usec

As opposed to a run when everything is OK :-

Timespan: 0.381944 hours
Saw 3929598 correct packets, 0 runts, 0 oversize, 0 unknown encaps, 58 
dns decoding errors, 0 bogus packets

2098 packets went unanswered, of which 0 were answered on exact retransmit
4813 answers could not be matched to questions
58 answers were unsatisfactory (indefinite, or SERVFAIL)
1882 answers (would be) discarded because older than 2 seconds
Rcode   Count
0   1550451
2   7742
3   125547
5   16
70.36% of questions answered within 50 usec (70.36%)
73.27% of questions answered within 100 usec (2.91%)
76.53% of questions answered within 200 usec (3.26%)
76.82% of questions answered within 250 usec (0.29%)
76.96% of questions answered within 300 usec (0.14%)
77.04% of questions answered within 350 usec (0.08%)
77.09% of questions answered within 400 usec (0.05%)
77.18% of questions answered within 800 usec (0.09%)
77.20% of questions answered within 1000 usec (0.02%)
79.46% of questions answered within 2.00 msec (2.26%)
81.67% of questions answered within 4.00 msec (2.21%)
82.83% of questions answered within 8.00 msec (1.16%)
86.36% of questions answered within 16.00 msec (3.53%)
88.47% of questions answered within 32.00 msec (2.11%)
89.89% of questions answered within 64.00 msec (1.42%)
94.79% of questions answered within 128.00 msec (4.89%)
98.18% of questions answered within 256.00 msec (3.39%)
99.32% of questions answered within 512.00 msec (1.14%)
99.59% of questions answered within 1024.00 msec (0.28%)
100.00% of questions answered within 2048.00 msec (0.41%)
Average response time: 24119.3 usec

None of this behaviour was seen in either Unbound or Bind, we moved from 
these because of other limitations/security concerns but may have to 
look at moving back to Unbound if this persists.


Any 

Re: [Pdns-users] Spikey response times in powerdns recursor

2010-03-17 Thread bert hubert
On Wed, Mar 17, 2010 at 10:43:19AM +, Simon Bedford wrote:
 We have been running  recursor as a caching name server for a number
 of months having moved from unbound, since this time we see good, in
 fact quick DNS response time but then when running 3.1.7.1 and .2
 and also 3.2.1 we see random spikes up to 2 seconds for the response
 times often at the quietest of times for the name servers.

Versions below 3.2 can indeed sometimes show prolonged delays when running
with large caches. This issue is solved in 3.2.

 I had put this down to 3.1 version after reaading the changelog and
 bugs fixed in 3.2 but having upgraded we still see the same spiking,
 this time more frequent over night but not quite as severe as they
 were.

As discussed off-list, you see these spikes for a number of domain names, at
least one of which has a short lived TTL and an unresponsive authoritative
server.

Mar 17 11:57:02 [5] bbc.co.uk.: Resolved 'bbc.co.uk.' NS ns1.bbc.co.uk. to: 
132.185.132.21
Mar 17 11:57:02 [5] bbc.co.uk.: Trying IP 132.185.132.21:53, asking 
'bbc.co.uk.|A'
Mar 17 11:57:04 [5] bbc.co.uk.: timeout resolving

bbc.co.uk has a 300 second ttl, and thus expires frequently.

 I realise that outside lookups will influence the results but its
 weird that when at their busiest they are more responsive than when
 its quiet and also have most of the unusual behaviour at that time.

When servers are busy, your monitoring system is unlikely to encounter
expired TTLs. This is why a busy server in fact provides superior service
compared to an idle one.

 Recursor performance graphing and dnsscope stats look OK although
 the average time to respon goes up by 100% overnight, see sample
 stats below from overnight/this morning :-

This matches the expectations.

Bert
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Spikey response times in powerdns recursor

2010-03-17 Thread Simon Bedford

bert hubert wrote:

On Wed, Mar 17, 2010 at 10:43:19AM +, Simon Bedford wrote:

We have been running  recursor as a caching name server for a number
of months having moved from unbound, since this time we see good, in
fact quick DNS response time but then when running 3.1.7.1 and .2
and also 3.2.1 we see random spikes up to 2 seconds for the response
times often at the quietest of times for the name servers.


Versions below 3.2 can indeed sometimes show prolonged delays when running
with large caches. This issue is solved in 3.2.


Understood




I had put this down to 3.1 version after reaading the changelog and
bugs fixed in 3.2 but having upgraded we still see the same spiking,
this time more frequent over night but not quite as severe as they
were.


As discussed off-list, you see these spikes for a number of domain names, at
least one of which has a short lived TTL and an unresponsive authoritative
server.

Mar 17 11:57:02 [5] bbc.co.uk.: Resolved 'bbc.co.uk.' NS ns1.bbc.co.uk. to: 
132.185.132.21
Mar 17 11:57:02 [5] bbc.co.uk.: Trying IP 132.185.132.21:53, asking 
'bbc.co.uk.|A'
Mar 17 11:57:04 [5] bbc.co.uk.: timeout resolving


It never used to happen before 23/12/09 though looking at our graphs and 
as you say this happens for at least 5 domain names that we monitor 
(some of which are our own and some external).




bbc.co.uk has a 300 second ttl, and thus expires frequently.


I realise that outside lookups will influence the results but its
weird that when at their busiest they are more responsive than when
its quiet and also have most of the unusual behaviour at that time.


When servers are busy, your monitoring system is unlikely to encounter
expired TTLs. This is why a busy server in fact provides superior service
compared to an idle one.


I did wonder about this and whether that would be the case.




Recursor performance graphing and dnsscope stats look OK although
the average time to respon goes up by 100% overnight, see sample
stats below from overnight/this morning :-


This matches the expectations.


Doubling overnight and acceptable to have 2 second look up times??  This 
is definitely not something that would be acceptable to our customers 
for valid domains...




Bert


Also, my previous post may appear to have been having a dig at the 
support I have received off list from Bert or that this is entirely due 
to the recursor software, far from it, I have been delighted with the 
level of support received up to yet and really want to fix this issue 
and stay with powerdns.  I have sent our config as I wasn't confident 
that we hadn't missed something as well.


I look forward to getting to the bottom of this and being a happy 
powerdns user.


Simon
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Spikey response times in powerdns recursor

2010-03-17 Thread bert hubert
On Wed, Mar 17, 2010 at 11:16:40AM +, Simon Bedford wrote:

 Mar 17 11:57:02 [5] bbc.co.uk.: Resolved 'bbc.co.uk.' NS ns1.bbc.co.uk. to: 
 132.185.132.21
 Mar 17 11:57:02 [5] bbc.co.uk.: Trying IP 132.185.132.21:53, asking 
 'bbc.co.uk.|A'
 Mar 17 11:57:04 [5] bbc.co.uk.: timeout resolving
 
 It never used to happen before 23/12/09 though looking at our graphs
 and as you say this happens for at least 5 domain names that we
 monitor (some of which are our own and some external).

bbc.co.uk still has a nameserver that is down, so having that domain resolve
slowly every once in a while is to be expected.

You've indicated you've occasionally seen 500ms lookups times for
google.com, but I have not heard of any other problems.  

google.com takes between 0 and 100ms to resolve in my tests. 

 This matches the expectations.
 
 Doubling overnight and acceptable to have 2 second look up times??
 This is definitely not something that would be acceptable to our
 customers for valid domains...

These measurements from dnsscope are for _all_ domain names, not just valid
domains. Please do not think that I recommend 2 second lookup times. 

The reality is that a huge number of domains have unresponsive nameservers.
Your graph indicates that 1% of queries takes between 1024 and 2048 msec to
resolve at night, and this is entirely to be expected.

A doubling of *average* response times, but still in the 40ms range, is
entirely to be expected on a server that is relatively idle at night.

Bert
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Spikey response times in powerdns recursor

2010-03-17 Thread Stefan Schmidt

On 17.03.2010, at 12:37, Simon Bedford wrote:

 
 This is what is causing the mystery for me, when its good its really good but 
 then response times go crazy at a random time, its dropped our customer 
 experience graphing from 99.987% to 89% (some of this will be the 3.1.7.2 
 cache maintenance bug though, in fact a larger proportion as we only have 1 
 of 4 upgraded to 3.2.1).


I think you mean version 3.2, there is no 3.2.1 yet and as that point release 
were likely to be a security fix i thought i should point that out.

I wonder what you test for on your customer experience graphing, if it's only a 
set of say 500 of the most commonly used websites it would make a considerable 
impact on your graph should say all the authoritative nameservers of a medium 
sized ISP go down for a while or when some links were over load.
The latter is frequently reported by http://internetpulse.net/ for example.

Neither me nor our customers ever had any reason to complain about 
pdns_recursor's performance in this field however, in fact our graphs showed it 
to give way less server failure responses than BIND9 for months and packet loss 
or network delay situations were handled more gracefully too.

Stefan
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


[Pdns-users] EDNS support + default buffer size

2010-03-17 Thread Michael Fincham
Hi all,

I've just tested the PowerDNS Recursor 3.2 with its out of the box
configuration against the tests outlined at
https://www.dns-oarc.net/oarc/services/replysizetest

It seems that EDNS is disabled by default, which is confirmed by the
comment attached to changeset #1430
(http://wiki.powerdns.com/trac/changeset/1430)

Looking at the source it seems in 3.2 an option disable-edns=no was
added which turns EDNS support on. A cursory test here shows that adding
this to the stock config does cause the dns-oarc reply size test to
report a reply size of 1200 vs 512 when EDNS is off.

What is the status of EDNS support? Is it safe to rely on in production
environments? What specifically does the nothing but trouble comment
on the changeset refer to?

Also, the buffer size of 1200 appears to be hard coded. Is there any
particular reason for this value? I'm guessing it has to do with
avoiding fragmentation, but it'd be nice to know for sure.

Thanks,
-- 
-Michael Fincham
System Administrator, Unleash
www.unleash.co.nz
Phone: 0800 750 250

___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users