[Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
Hey guys, I've been having a problem with recursion. For some reason, certain domains seem to throw SERVFAIL errors when dug most of the time, but then NOERROR with a correct response at other random times. For example: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 2636 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 0 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Wed Sep 3 13:36:33 2014 ;; MSG SIZE rcvd: 36 And then, a few hours later: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 56751 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; ANSWER SECTION: toyotasupplier.com. 18296 IN A 12.169.52.71 ;; Query time: 1 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Thu Sep 4 10:39:38 2014 ;; MSG SIZE rcvd: 52 And then, a few hours later still: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 5171 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 3017 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Fri Sep 5 07:50:25 2014 ;; MSG SIZE rcvd: 36 All without making a single change. I have been working on debugging this for two days now and absolutely cannot pinpoint a source for the issue. I've increased the max query lengths, the recursor's network and client TCP timeouts, restarted the service several times on several of our DNS servers, and nothing I do seems to fix it. It of course doesn't help that the bug is a bit of a gremlin and keeps mischievously disappearing at random (and in fact never, to my knowledge, happened before until about a week ago, when it started to occur for no apparent reason). Any idea on what could be causing this? FWIW, when I run dig toyotasupplier.com ns it consistently works fine: root@yoshi:/# dig toyotasupplier.com ns ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ns ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 39522 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN NS ;; ANSWER SECTION: toyotasupplier.com. 50741 IN NS gslb-ns2.toyota-na.com. toyotasupplier.com. 50741 IN NS gslb-ns1.toyota-na.com. ;; Query time: 1 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Fri Sep 5 07:49:29 2014 ;; MSG SIZE rcvd: 92 Many thanks in advance, Todd W. Smith IP Services Technician 2331 East 600 North Greenfield, IN 46140 (317) 323-2021 tsm...@ninestarconnect.commailto:tsm...@ninestarconnect.com www.ninestarconnect.comhttp://www.ninestarconnect.com/ ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
Hey Brian, That would make perfect sense, and I was thinking along similar lines, but if that's the case, why do I get a consistent NOERROR when using Google DNS? Google's cache perhaps? root@yoshi:/# dig toyotasupplier.com @8.8.8.8 ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com @8.8.8.8 ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 35779 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; ANSWER SECTION: toyotasupplier.com. 21594 IN A 12.169.52.71 ;; Query time: 30 msec ;; SERVER: 8.8.8.8#53(8.8.8.8) ;; WHEN: Tue Sep 9 12:34:43 2014 ;; MSG SIZE rcvd: 52 root@yoshi:/# dig toyotasupplier.com @208.88.248.27 ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com @208.88.248.27 ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 29841 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 49 msec ;; SERVER: 208.88.248.27#53(208.88.248.27) ;; WHEN: Tue Sep 9 12:35:02 2014 ;; MSG SIZE rcvd: 36 -T From: pdns-users-boun...@mailman.powerdns.com [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Brian Menges Sent: Tuesday, September 09, 2014 12:56 PM To: 'pdns-users@mailman.powerdns.com' Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random I'd say it's on Toyota's end: $ dig toyotasupplier.com +short @gslb-ns1.toyota-na.com DiG 9.7.3 toyotasupplier.com +short @gslb-ns1.toyota-na.com ;; global options: +cmd connection timed out; no servers could be reached Their other DNS server works fine... several attempts to reach the first one however fails (haven't gotten a success yet). I'd say it's their problem. - Brian Menges Principal Engineer, DevOps @ GoGrid, LLC. From: pdns-users-boun...@mailman.powerdns.commailto:pdns-users-boun...@mailman.powerdns.com [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Todd Smith Sent: Tuesday, September 09, 2014 9:24 AM To: 'pdns-users@mailman.powerdns.com' Subject: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random Hey guys, I've been having a problem with recursion. For some reason, certain domains seem to throw SERVFAIL errors when dug most of the time, but then NOERROR with a correct response at other random times. For example: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 2636 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 0 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Wed Sep 3 13:36:33 2014 ;; MSG SIZE rcvd: 36 And then, a few hours later: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 56751 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; ANSWER SECTION: toyotasupplier.com. 18296 IN A 12.169.52.71 ;; Query time: 1 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Thu Sep 4 10:39:38 2014 ;; MSG SIZE rcvd: 52 And then, a few hours later still: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 5171 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 3017 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Fri Sep 5 07:50:25 2014 ;; MSG SIZE rcvd: 36 All without making a single change. I have been working on debugging this for two days now and absolutely cannot pinpoint a source for the issue. I've increased the max query lengths, the recursor's network and client TCP timeouts, restarted the service several times on several of our DNS servers, and nothing I do seems to fix it. It of course doesn't help that the bug is a bit of a gremlin and keeps mischievously disappearing at random (and in fact never, to my knowledge, happened before until about a week ago, when it started to occur for no apparent reason). Any idea on what could be causing this? FWIW, when I run dig toyotasupplier.com ns it consistently works fine: root@yoshi:/# dig toyotasupplier.com ns ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ns ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 39522 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION
Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
Rather long output here; however, it certainly looks like these results pretty much confirm that the issue is Toyota's, not ours: Sep 9 12:47:07 yoshi pdns_recursor[31821]: 1 [10690756] question for 'toyotasupplier.com.|A' from 208.88.248.27 Sep 9 12:47:12 yoshi pdns_recursor[31821]: 0 [3961638] question for 'toyotasupplier.com.|A' from 208.88.248.27 Sep 9 12:47:17 yoshi pdns_recursor[31821]: 1 [10691043] question for 'toyotasupplier.com.|A' from 208.88.248.27 Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Looking for CNAME cache hit of 'toyotasupplier.com.|CNAME' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: No CNAME cache hit of 'toyotasupplier.com.|CNAME' found Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: No cache hit for 'toyotasupplier.com.|A', trying to find an appropriate NS record Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Checking if we have NS in cache for 'toyotasupplier.com.' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: NS (with ip, or non-glue) in cache for 'toyotasupplier.com.' - 'gslb-ns1.toyota-na.com.' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: within bailiwick: 0, not in cache / did not look at cache Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: NS (with ip, or non-glue) in cache for 'toyotasupplier.com.' - 'gslb-ns2.toyota-na.com.' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: within bailiwick: 0, not in cache / did not look at cache Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: We have NS in cache for 'toyotasupplier.com.' (flawedNSSet=0) Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Cache consultations done, have 2 NS to contact Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Nameservers: gslb-ns1.toyota-na.com.(0.00ms), gslb-ns2.toyota-na.com.(0.00ms) Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying to resolve NS 'gslb-ns1.toyota-na.com.' (1/2) Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] gslb-ns1.toyota-na.com.: Looking for CNAME cache hit of 'gslb-ns1.toyota-na.com.|CNAME' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] gslb-ns1.toyota-na.com.: No CNAME cache hit of 'gslb-ns1.toyota-na.com.|CNAME' found Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] gslb-ns1.toyota-na.com.: Found cache hit for A: 63.238.139.235[ttl=80545] Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Resolved 'toyotasupplier.com.' NS gslb-ns1.toyota-na.com. to: 63.238.139.235 Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying IP 63.238.139.235:53, asking 'toyotasupplier.com.|A' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: timeout resolving Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying to resolve NS 'gslb-ns2.toyota-na.com.' (2/2) Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] gslb-ns2.toyota-na.com.: Looking for CNAME cache hit of 'gslb-ns2.toyota-na.com.|CNAME' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] gslb-ns2.toyota-na.com.: No CNAME cache hit of 'gslb-ns2.toyota-na.com.|CNAME' found Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] gslb-ns2.toyota-na.com.: Found cache hit for A: 12.169.52.62[ttl=80540] Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Resolved 'toyotasupplier.com.' NS gslb-ns2.toyota-na.com. to: 12.169.52.62 Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying IP 12.169.52.62:53, asking 'toyotasupplier.com.|A' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: timeout resolving Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Failed to resolve via any of the 2 offered NS at level 'toyotasupplier.com.' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: failed (res=-1) Sep 9 12:47:22 yoshi pdns_recursor[31821]: [3961638] toyotasupplier.com.: Looking for CNAME cache hit of 'toyotasupplier.com.|CNAME' Sep 9 12:47:22 yoshi pdns_recursor[31821]: [3961638] toyotasupplier.com.: No CNAME cache hit of 'toyotasupplier.com.|CNAME' found Sep 9 12:47:22 yoshi pdns_recursor[31821]: [3961638] toyotasupplier.com.: No cache hit for 'toyotasupplier.com.|A', trying to find an appropriate NS record Sep 9 12:47:22 yoshi pdns_recursor[31821]: [3961638] toyotasupplier.com.: Checking if we have NS in cache for 'toyotasupplier.com.' Sep 9 12:47:22 yoshi pdns_recursor[31821]: [3961638] toyotasupplier.com.: NS (with ip, or non-glue) in cache for 'toyotasupplier.com.' - 'gslb-ns1.toyota-na.com.' Sep 9 12:47:22 yoshi pdns_recursor[31821]: [3961638] toyotasupplier.com.: within
Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
Actually that begs one more question--as of right now I actually have network-timeout set to 5000 in recursor.conf, yet obviously it's still timing out considerably sooner than that; is there, say, some other setting (that is, of course, within PowerDNS) that might be conflicting with this causing it time out sooner? If not (as I suspect), I'll investigate our network settings to see if anything else might be clipping these requests off short. Many many thanks again -T -Original Message- From: bert hubert [mailto:bert.hub...@netherlabs.nl] Sent: Tuesday, September 09, 2014 1:39 PM To: Todd Smith Cc: 'pdns-users@mailman.powerdns.com' Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random On Tue, Sep 09, 2014 at 05:16:24PM +, Todd Smith wrote: Rather long output here; however, it certainly looks like these results pretty much confirm that the issue is Toyota's, not ours: Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Resolved 'toyotasupplier.com.' NS gslb-ns1.toyota-na.com. to: 63.238.139.235 Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying IP 63.238.139.235:53, asking 'toyotasupplier.com.|A' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: timeout resolving Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying IP 12.169.52.62:53, asking 'toyotasupplier.com.|A' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: timeout resolving Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Failed to resolve via any of the 2 offered NS at level 'toyotasupplier.com.' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: failed (res=-1) You could try a traceroute to these two addresses to debug, but this indeed does not look like a powerdns issue but more a networking issue! Note that your timeout does appear to 1 second, you could try raising this with 'network-timeout=2000' and see if this helps (2 seconds). Good luck! Bert ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users