[Pdns-users] Spikey response times in powerdns recursor
Hi guys, Apologies if this has been discussed before but as a new mailling list user I have not seen anything. We have been running recursor as a caching name server for a number of months having moved from unbound, since this time we see good, in fact quick DNS response time but then when running 3.1.7.1 and .2 and also 3.2.1 we see random spikes up to 2 seconds for the response times often at the quietest of times for the name servers. I had put this down to 3.1 version after reaading the changelog and bugs fixed in 3.2 but having upgraded we still see the same spiking, this time more frequent over night but not quite as severe as they were. We are using hardware load balancers with 4 servers behind each, each server listens on multiple ports and I now have the recursor running on 2 threads (a new feature in 3.2). The servers have no real load and cpu is mostly 95% idle, they have 8G or Ram and never go over 2G used by the whole OS (Debain Etch) and software. Graphs show norms of between 20 and 40ms but then the spikes are 700ms and over, this then results in our external monitoring and scoring against other companies suffer and in the worst of circumstances become unavailable. I realise that outside lookups will influence the results but its weird that when at their busiest they are more responsive than when its quiet and also have most of the unusual behaviour at that time. Recursor performance graphing and dnsscope stats look OK although the average time to respon goes up by 100% overnight, see sample stats below from overnight/this morning :- Timespan: 0.828056 hours Saw 4049548 correct packets, 0 runts, 0 oversize, 0 unknown encaps, 99 dns decoding errors, 0 bogus packets 3467 packets went unanswered, of which 1 were answered on exact retransmit 1047 answers could not be matched to questions 99 answers were unsatisfactory (indefinite, or SERVFAIL) 7764 answers (would be) discarded because older than 2 seconds Rcode Count 0 1482490 2 16215 3 166680 5 1 68.45% of questions answered within 50 usec (68.45%) 71.06% of questions answered within 100 usec (2.62%) 74.16% of questions answered within 200 usec (3.10%) 74.30% of questions answered within 250 usec (0.14%) 74.36% of questions answered within 300 usec (0.06%) 74.40% of questions answered within 350 usec (0.03%) 74.42% of questions answered within 400 usec (0.02%) 74.46% of questions answered within 800 usec (0.04%) 74.48% of questions answered within 1000 usec (0.02%) 77.90% of questions answered within 2.00 msec (3.42%) 79.80% of questions answered within 4.00 msec (1.90%) 80.72% of questions answered within 8.00 msec (0.92%) 84.03% of questions answered within 16.00 msec (3.31%) 85.88% of questions answered within 32.00 msec (1.85%) 87.11% of questions answered within 64.00 msec (1.23%) 93.06% of questions answered within 128.00 msec (5.95%) 96.86% of questions answered within 256.00 msec (3.80%) 98.37% of questions answered within 512.00 msec (1.50%) 98.79% of questions answered within 1024.00 msec (0.42%) 100.00% of questions answered within 2048.00 msec (1.21%) Average response time: 40419.9 usec As opposed to a run when everything is OK :- Timespan: 0.381944 hours Saw 3929598 correct packets, 0 runts, 0 oversize, 0 unknown encaps, 58 dns decoding errors, 0 bogus packets 2098 packets went unanswered, of which 0 were answered on exact retransmit 4813 answers could not be matched to questions 58 answers were unsatisfactory (indefinite, or SERVFAIL) 1882 answers (would be) discarded because older than 2 seconds Rcode Count 0 1550451 2 7742 3 125547 5 16 70.36% of questions answered within 50 usec (70.36%) 73.27% of questions answered within 100 usec (2.91%) 76.53% of questions answered within 200 usec (3.26%) 76.82% of questions answered within 250 usec (0.29%) 76.96% of questions answered within 300 usec (0.14%) 77.04% of questions answered within 350 usec (0.08%) 77.09% of questions answered within 400 usec (0.05%) 77.18% of questions answered within 800 usec (0.09%) 77.20% of questions answered within 1000 usec (0.02%) 79.46% of questions answered within 2.00 msec (2.26%) 81.67% of questions answered within 4.00 msec (2.21%) 82.83% of questions answered within 8.00 msec (1.16%) 86.36% of questions answered within 16.00 msec (3.53%) 88.47% of questions answered within 32.00 msec (2.11%) 89.89% of questions answered within 64.00 msec (1.42%) 94.79% of questions answered within 128.00 msec (4.89%) 98.18% of questions answered within 256.00 msec (3.39%) 99.32% of questions answered within 512.00 msec (1.14%) 99.59% of questions answered within 1024.00 msec (0.28%) 100.00% of questions answered within 2048.00 msec (0.41%) Average response time: 24119.3 usec None of this behaviour was seen in either Unbound or Bind, we moved from these because of other limitations/security concerns but may have to look at moving back to Unbound if this persists. Any
Re: [Pdns-users] Spikey response times in powerdns recursor
On Wed, Mar 17, 2010 at 10:43:19AM +, Simon Bedford wrote: We have been running recursor as a caching name server for a number of months having moved from unbound, since this time we see good, in fact quick DNS response time but then when running 3.1.7.1 and .2 and also 3.2.1 we see random spikes up to 2 seconds for the response times often at the quietest of times for the name servers. Versions below 3.2 can indeed sometimes show prolonged delays when running with large caches. This issue is solved in 3.2. I had put this down to 3.1 version after reaading the changelog and bugs fixed in 3.2 but having upgraded we still see the same spiking, this time more frequent over night but not quite as severe as they were. As discussed off-list, you see these spikes for a number of domain names, at least one of which has a short lived TTL and an unresponsive authoritative server. Mar 17 11:57:02 [5] bbc.co.uk.: Resolved 'bbc.co.uk.' NS ns1.bbc.co.uk. to: 132.185.132.21 Mar 17 11:57:02 [5] bbc.co.uk.: Trying IP 132.185.132.21:53, asking 'bbc.co.uk.|A' Mar 17 11:57:04 [5] bbc.co.uk.: timeout resolving bbc.co.uk has a 300 second ttl, and thus expires frequently. I realise that outside lookups will influence the results but its weird that when at their busiest they are more responsive than when its quiet and also have most of the unusual behaviour at that time. When servers are busy, your monitoring system is unlikely to encounter expired TTLs. This is why a busy server in fact provides superior service compared to an idle one. Recursor performance graphing and dnsscope stats look OK although the average time to respon goes up by 100% overnight, see sample stats below from overnight/this morning :- This matches the expectations. Bert ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] Spikey response times in powerdns recursor
bert hubert wrote: On Wed, Mar 17, 2010 at 10:43:19AM +, Simon Bedford wrote: We have been running recursor as a caching name server for a number of months having moved from unbound, since this time we see good, in fact quick DNS response time but then when running 3.1.7.1 and .2 and also 3.2.1 we see random spikes up to 2 seconds for the response times often at the quietest of times for the name servers. Versions below 3.2 can indeed sometimes show prolonged delays when running with large caches. This issue is solved in 3.2. Understood I had put this down to 3.1 version after reaading the changelog and bugs fixed in 3.2 but having upgraded we still see the same spiking, this time more frequent over night but not quite as severe as they were. As discussed off-list, you see these spikes for a number of domain names, at least one of which has a short lived TTL and an unresponsive authoritative server. Mar 17 11:57:02 [5] bbc.co.uk.: Resolved 'bbc.co.uk.' NS ns1.bbc.co.uk. to: 132.185.132.21 Mar 17 11:57:02 [5] bbc.co.uk.: Trying IP 132.185.132.21:53, asking 'bbc.co.uk.|A' Mar 17 11:57:04 [5] bbc.co.uk.: timeout resolving It never used to happen before 23/12/09 though looking at our graphs and as you say this happens for at least 5 domain names that we monitor (some of which are our own and some external). bbc.co.uk has a 300 second ttl, and thus expires frequently. I realise that outside lookups will influence the results but its weird that when at their busiest they are more responsive than when its quiet and also have most of the unusual behaviour at that time. When servers are busy, your monitoring system is unlikely to encounter expired TTLs. This is why a busy server in fact provides superior service compared to an idle one. I did wonder about this and whether that would be the case. Recursor performance graphing and dnsscope stats look OK although the average time to respon goes up by 100% overnight, see sample stats below from overnight/this morning :- This matches the expectations. Doubling overnight and acceptable to have 2 second look up times?? This is definitely not something that would be acceptable to our customers for valid domains... Bert Also, my previous post may appear to have been having a dig at the support I have received off list from Bert or that this is entirely due to the recursor software, far from it, I have been delighted with the level of support received up to yet and really want to fix this issue and stay with powerdns. I have sent our config as I wasn't confident that we hadn't missed something as well. I look forward to getting to the bottom of this and being a happy powerdns user. Simon ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] Spikey response times in powerdns recursor
On Wed, Mar 17, 2010 at 11:16:40AM +, Simon Bedford wrote: Mar 17 11:57:02 [5] bbc.co.uk.: Resolved 'bbc.co.uk.' NS ns1.bbc.co.uk. to: 132.185.132.21 Mar 17 11:57:02 [5] bbc.co.uk.: Trying IP 132.185.132.21:53, asking 'bbc.co.uk.|A' Mar 17 11:57:04 [5] bbc.co.uk.: timeout resolving It never used to happen before 23/12/09 though looking at our graphs and as you say this happens for at least 5 domain names that we monitor (some of which are our own and some external). bbc.co.uk still has a nameserver that is down, so having that domain resolve slowly every once in a while is to be expected. You've indicated you've occasionally seen 500ms lookups times for google.com, but I have not heard of any other problems. google.com takes between 0 and 100ms to resolve in my tests. This matches the expectations. Doubling overnight and acceptable to have 2 second look up times?? This is definitely not something that would be acceptable to our customers for valid domains... These measurements from dnsscope are for _all_ domain names, not just valid domains. Please do not think that I recommend 2 second lookup times. The reality is that a huge number of domains have unresponsive nameservers. Your graph indicates that 1% of queries takes between 1024 and 2048 msec to resolve at night, and this is entirely to be expected. A doubling of *average* response times, but still in the 40ms range, is entirely to be expected on a server that is relatively idle at night. Bert ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] Spikey response times in powerdns recursor
On 17.03.2010, at 12:37, Simon Bedford wrote: This is what is causing the mystery for me, when its good its really good but then response times go crazy at a random time, its dropped our customer experience graphing from 99.987% to 89% (some of this will be the 3.1.7.2 cache maintenance bug though, in fact a larger proportion as we only have 1 of 4 upgraded to 3.2.1). I think you mean version 3.2, there is no 3.2.1 yet and as that point release were likely to be a security fix i thought i should point that out. I wonder what you test for on your customer experience graphing, if it's only a set of say 500 of the most commonly used websites it would make a considerable impact on your graph should say all the authoritative nameservers of a medium sized ISP go down for a while or when some links were over load. The latter is frequently reported by http://internetpulse.net/ for example. Neither me nor our customers ever had any reason to complain about pdns_recursor's performance in this field however, in fact our graphs showed it to give way less server failure responses than BIND9 for months and packet loss or network delay situations were handled more gracefully too. Stefan ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
[Pdns-users] EDNS support + default buffer size
Hi all, I've just tested the PowerDNS Recursor 3.2 with its out of the box configuration against the tests outlined at https://www.dns-oarc.net/oarc/services/replysizetest It seems that EDNS is disabled by default, which is confirmed by the comment attached to changeset #1430 (http://wiki.powerdns.com/trac/changeset/1430) Looking at the source it seems in 3.2 an option disable-edns=no was added which turns EDNS support on. A cursory test here shows that adding this to the stock config does cause the dns-oarc reply size test to report a reply size of 1200 vs 512 when EDNS is off. What is the status of EDNS support? Is it safe to rely on in production environments? What specifically does the nothing but trouble comment on the changeset refer to? Also, the buffer size of 1200 appears to be hard coded. Is there any particular reason for this value? I'm guessing it has to do with avoiding fragmentation, but it'd be nice to know for sure. Thanks, -- -Michael Fincham System Administrator, Unleash www.unleash.co.nz Phone: 0800 750 250 ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users