[Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
Hey guys, I've been having a problem with recursion. For some reason, certain domains seem to throw SERVFAIL errors when dug most of the time, but then NOERROR with a correct response at other random times. For example: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 2636 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 0 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Wed Sep 3 13:36:33 2014 ;; MSG SIZE rcvd: 36 And then, a few hours later: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 56751 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; ANSWER SECTION: toyotasupplier.com. 18296 IN A 12.169.52.71 ;; Query time: 1 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Thu Sep 4 10:39:38 2014 ;; MSG SIZE rcvd: 52 And then, a few hours later still: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 5171 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 3017 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Fri Sep 5 07:50:25 2014 ;; MSG SIZE rcvd: 36 All without making a single change. I have been working on debugging this for two days now and absolutely cannot pinpoint a source for the issue. I've increased the max query lengths, the recursor's network and client TCP timeouts, restarted the service several times on several of our DNS servers, and nothing I do seems to fix it. It of course doesn't help that the bug is a bit of a gremlin and keeps mischievously disappearing at random (and in fact never, to my knowledge, happened before until about a week ago, when it started to occur for no apparent reason). Any idea on what could be causing this? FWIW, when I run dig toyotasupplier.com ns it consistently works fine: root@yoshi:/# dig toyotasupplier.com ns ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ns ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 39522 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN NS ;; ANSWER SECTION: toyotasupplier.com. 50741 IN NS gslb-ns2.toyota-na.com. toyotasupplier.com. 50741 IN NS gslb-ns1.toyota-na.com. ;; Query time: 1 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Fri Sep 5 07:49:29 2014 ;; MSG SIZE rcvd: 92 Many thanks in advance, Todd W. Smith IP Services Technician 2331 East 600 North Greenfield, IN 46140 (317) 323-2021 tsm...@ninestarconnect.commailto:tsm...@ninestarconnect.com www.ninestarconnect.comhttp://www.ninestarconnect.com/ ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
I'd say it's on Toyota's end: $ dig toyotasupplier.com +short @gslb-ns1.toyota-na.com DiG 9.7.3 toyotasupplier.com +short @gslb-ns1.toyota-na.com ;; global options: +cmd connection timed out; no servers could be reached Their other DNS server works fine... several attempts to reach the first one however fails (haven't gotten a success yet). I'd say it's their problem. - Brian Menges Principal Engineer, DevOps @ GoGrid, LLC. From: pdns-users-boun...@mailman.powerdns.com [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Todd Smith Sent: Tuesday, September 09, 2014 9:24 AM To: 'pdns-users@mailman.powerdns.com' Subject: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random Hey guys, I've been having a problem with recursion. For some reason, certain domains seem to throw SERVFAIL errors when dug most of the time, but then NOERROR with a correct response at other random times. For example: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 2636 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 0 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Wed Sep 3 13:36:33 2014 ;; MSG SIZE rcvd: 36 And then, a few hours later: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 56751 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; ANSWER SECTION: toyotasupplier.com. 18296 IN A 12.169.52.71 ;; Query time: 1 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Thu Sep 4 10:39:38 2014 ;; MSG SIZE rcvd: 52 And then, a few hours later still: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 5171 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 3017 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Fri Sep 5 07:50:25 2014 ;; MSG SIZE rcvd: 36 All without making a single change. I have been working on debugging this for two days now and absolutely cannot pinpoint a source for the issue. I've increased the max query lengths, the recursor's network and client TCP timeouts, restarted the service several times on several of our DNS servers, and nothing I do seems to fix it. It of course doesn't help that the bug is a bit of a gremlin and keeps mischievously disappearing at random (and in fact never, to my knowledge, happened before until about a week ago, when it started to occur for no apparent reason). Any idea on what could be causing this? FWIW, when I run dig toyotasupplier.com ns it consistently works fine: root@yoshi:/# dig toyotasupplier.com ns ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ns ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 39522 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN NS ;; ANSWER SECTION: toyotasupplier.com. 50741 IN NS gslb-ns2.toyota-na.com. toyotasupplier.com. 50741 IN NS gslb-ns1.toyota-na.com. ;; Query time: 1 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Fri Sep 5 07:49:29 2014 ;; MSG SIZE rcvd: 92 Many thanks in advance, Todd W. Smith IP Services Technician 2331 East 600 North Greenfield, IN 46140 (317) 323-2021 tsm...@ninestarconnect.commailto:tsm...@ninestarconnect.com www.ninestarconnect.comhttp://www.ninestarconnect.com/ The information contained in this message, and any attachments, may contain confidential and legally privileged material. It is solely for the use of the person or entity to which it is addressed. Any review, retransmission, dissemination, or action taken in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you receive this in error, please contact the sender and delete the material from any computer. ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
Hey Brian, That would make perfect sense, and I was thinking along similar lines, but if that's the case, why do I get a consistent NOERROR when using Google DNS? Google's cache perhaps? root@yoshi:/# dig toyotasupplier.com @8.8.8.8 ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com @8.8.8.8 ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 35779 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; ANSWER SECTION: toyotasupplier.com. 21594 IN A 12.169.52.71 ;; Query time: 30 msec ;; SERVER: 8.8.8.8#53(8.8.8.8) ;; WHEN: Tue Sep 9 12:34:43 2014 ;; MSG SIZE rcvd: 52 root@yoshi:/# dig toyotasupplier.com @208.88.248.27 ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com @208.88.248.27 ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 29841 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 49 msec ;; SERVER: 208.88.248.27#53(208.88.248.27) ;; WHEN: Tue Sep 9 12:35:02 2014 ;; MSG SIZE rcvd: 36 -T From: pdns-users-boun...@mailman.powerdns.com [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Brian Menges Sent: Tuesday, September 09, 2014 12:56 PM To: 'pdns-users@mailman.powerdns.com' Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random I'd say it's on Toyota's end: $ dig toyotasupplier.com +short @gslb-ns1.toyota-na.com DiG 9.7.3 toyotasupplier.com +short @gslb-ns1.toyota-na.com ;; global options: +cmd connection timed out; no servers could be reached Their other DNS server works fine... several attempts to reach the first one however fails (haven't gotten a success yet). I'd say it's their problem. - Brian Menges Principal Engineer, DevOps @ GoGrid, LLC. From: pdns-users-boun...@mailman.powerdns.commailto:pdns-users-boun...@mailman.powerdns.com [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Todd Smith Sent: Tuesday, September 09, 2014 9:24 AM To: 'pdns-users@mailman.powerdns.com' Subject: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random Hey guys, I've been having a problem with recursion. For some reason, certain domains seem to throw SERVFAIL errors when dug most of the time, but then NOERROR with a correct response at other random times. For example: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 2636 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 0 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Wed Sep 3 13:36:33 2014 ;; MSG SIZE rcvd: 36 And then, a few hours later: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 56751 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; ANSWER SECTION: toyotasupplier.com. 18296 IN A 12.169.52.71 ;; Query time: 1 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Thu Sep 4 10:39:38 2014 ;; MSG SIZE rcvd: 52 And then, a few hours later still: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 5171 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 3017 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Fri Sep 5 07:50:25 2014 ;; MSG SIZE rcvd: 36 All without making a single change. I have been working on debugging this for two days now and absolutely cannot pinpoint a source for the issue. I've increased the max query lengths, the recursor's network and client TCP timeouts, restarted the service several times on several of our DNS servers, and nothing I do seems to fix it. It of course doesn't help that the bug is a bit of a gremlin and keeps mischievously disappearing at random (and in fact never, to my knowledge, happened before until about a week ago, when it started to occur for no apparent reason). Any idea on what could be causing this? FWIW, when I run dig toyotasupplier.com ns it consistently works fine: root@yoshi:/# dig toyotasupplier.com ns ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ns ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 39522 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION
Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
On Tue, Sep 9, 2014 at 9:55 AM, Brian Menges bmen...@gogrid.com wrote: I’d say it’s on Toyota’s end: Same here gslb-ns1.toyota-na.com not responding (Comcast, Seattle, WA) ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
Well, as long as one server works you should get an answer. Try setting a trace-regex on toyota and see what your powerdns reports! http://doc.powerdns.com/html/rec-control.html - trace-regex Bert On Tue, Sep 09, 2014 at 10:06:03AM -0700, Michael Loftis wrote: On Tue, Sep 9, 2014 at 9:55 AM, Brian Menges bmen...@gogrid.com wrote: I’d say it’s on Toyota’s end: Same here gslb-ns1.toyota-na.com not responding (Comcast, Seattle, WA) ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
for A: 12.169.52.62[ttl=80535] Sep 9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: Resolved 'toyotasupplier.com.' NS gslb-ns2.toyota-na.com. to: 12.169.52.62 Sep 9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: Trying IP 12.169.52.62:53, asking 'toyotasupplier.com.|A' Sep 9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: timeout resolving Sep 9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: Trying to resolve NS 'gslb-ns1.toyota-na.com.' (2/2) Sep 9 12:47:22 yoshi pdns_recursor[31821]: [10691043] gslb-ns1.toyota-na.com.: Looking for CNAME cache hit of 'gslb-ns1.toyota-na.com.|CNAME' Sep 9 12:47:22 yoshi pdns_recursor[31821]: [10691043] gslb-ns1.toyota-na.com.: No CNAME cache hit of 'gslb-ns1.toyota-na.com.|CNAME' found Sep 9 12:47:22 yoshi pdns_recursor[31821]: [10691043] gslb-ns1.toyota-na.com.: Found cache hit for A: 63.238.139.235[ttl=80530] Sep 9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: Resolved 'toyotasupplier.com.' NS gslb-ns1.toyota-na.com. to: 63.238.139.235 Sep 9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: Trying IP 63.238.139.235:53, asking 'toyotasupplier.com.|A' Sep 9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: query throttled Sep 9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: Failed to resolve via any of the 2 offered NS at level 'toyotasupplier.com.' Sep 9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: failed (res=-1) -Original Message- From: pdns-users-boun...@mailman.powerdns.com [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of bert hubert Sent: Tuesday, September 09, 2014 1:11 PM To: Michael Loftis Cc: pdns-users@mailman.powerdns.com Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random Well, as long as one server works you should get an answer. Try setting a trace-regex on toyota and see what your powerdns reports! http://doc.powerdns.com/html/rec-control.html - trace-regex Bert On Tue, Sep 09, 2014 at 10:06:03AM -0700, Michael Loftis wrote: On Tue, Sep 9, 2014 at 9:55 AM, Brian Menges bmen...@gogrid.com wrote: I’d say it’s on Toyota’s end: Same here gslb-ns1.toyota-na.com not responding (Comcast, Seattle, WA) ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
On Tue, Sep 09, 2014 at 05:16:24PM +, Todd Smith wrote: Rather long output here; however, it certainly looks like these results pretty much confirm that the issue is Toyota's, not ours: Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Resolved 'toyotasupplier.com.' NS gslb-ns1.toyota-na.com. to: 63.238.139.235 Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying IP 63.238.139.235:53, asking 'toyotasupplier.com.|A' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: timeout resolving Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying IP 12.169.52.62:53, asking 'toyotasupplier.com.|A' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: timeout resolving Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Failed to resolve via any of the 2 offered NS at level 'toyotasupplier.com.' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: failed (res=-1) You could try a traceroute to these two addresses to debug, but this indeed does not look like a powerdns issue but more a networking issue! Note that your timeout does appear to 1 second, you could try raising this with 'network-timeout=2000' and see if this helps (2 seconds). Good luck! Bert -Original Message- From: pdns-users-boun...@mailman.powerdns.com [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of bert hubert Sent: Tuesday, September 09, 2014 1:11 PM To: Michael Loftis Cc: pdns-users@mailman.powerdns.com Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random Well, as long as one server works you should get an answer. Try setting a trace-regex on toyota and see what your powerdns reports! http://doc.powerdns.com/html/rec-control.html - trace-regex Bert On Tue, Sep 09, 2014 at 10:06:03AM -0700, Michael Loftis wrote: On Tue, Sep 9, 2014 at 9:55 AM, Brian Menges bmen...@gogrid.com wrote: I’d say it’s on Toyota’s end: Same here gslb-ns1.toyota-na.com not responding (Comcast, Seattle, WA) ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
Actually that begs one more question--as of right now I actually have network-timeout set to 5000 in recursor.conf, yet obviously it's still timing out considerably sooner than that; is there, say, some other setting (that is, of course, within PowerDNS) that might be conflicting with this causing it time out sooner? If not (as I suspect), I'll investigate our network settings to see if anything else might be clipping these requests off short. Many many thanks again -T -Original Message- From: bert hubert [mailto:bert.hub...@netherlabs.nl] Sent: Tuesday, September 09, 2014 1:39 PM To: Todd Smith Cc: 'pdns-users@mailman.powerdns.com' Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random On Tue, Sep 09, 2014 at 05:16:24PM +, Todd Smith wrote: Rather long output here; however, it certainly looks like these results pretty much confirm that the issue is Toyota's, not ours: Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Resolved 'toyotasupplier.com.' NS gslb-ns1.toyota-na.com. to: 63.238.139.235 Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying IP 63.238.139.235:53, asking 'toyotasupplier.com.|A' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: timeout resolving Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying IP 12.169.52.62:53, asking 'toyotasupplier.com.|A' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: timeout resolving Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: Failed to resolve via any of the 2 offered NS at level 'toyotasupplier.com.' Sep 9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: failed (res=-1) You could try a traceroute to these two addresses to debug, but this indeed does not look like a powerdns issue but more a networking issue! Note that your timeout does appear to 1 second, you could try raising this with 'network-timeout=2000' and see if this helps (2 seconds). Good luck! Bert ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
On Tue, Sep 09, 2014 at 05:57:08PM +, Todd Smith wrote: Actually that begs one more question--as of right now I actually have network-timeout set to 5000 in recursor.conf, yet obviously it's still timing out considerably sooner than that; is there, say, some other setting (that is, of course, within PowerDNS) that might be conflicting with this causing it time out sooner? Just just checked, 3.6.0 honors network-timeout correctly under normal conditions, seeing if there are possibilities when we might be short circuiting it. Bert ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
On Tue, Sep 09, 2014 at 08:20:48PM +0200, bert hubert wrote: On Tue, Sep 09, 2014 at 05:57:08PM +, Todd Smith wrote: Actually that begs one more question--as of right now I actually have network-timeout set to 5000 in recursor.conf, yet obviously it's still timing out considerably sooner than that; is there, say, some other setting (that is, of course, within PowerDNS) that might be conflicting with this causing it time out sooner? Just just checked, 3.6.0 honors network-timeout correctly under normal conditions, seeing if there are possibilities when we might be short circuiting it. Ok, this has to do with the nature of trace-regex, which outputs the whole thing at the end of the resolve process, with only one timestamp. Clarified the output a bit to: Sep 09 20:28:50 [13281] toyotasupplier.com.: Resolved 'toyotasupplier.com.' NS gslb-ns1.toyota-na.com. to: 63.238.139.235 Sep 09 20:28:50 [13281] toyotasupplier.com.: Trying IP 63.238.139.235:53, asking 'toyotasupplier.com.|MX' Sep 09 20:28:50 [13281] toyotasupplier.com.: timeout resolving after 5000.47msec Sep 09 20:28:50 [13281] toyotasupplier.com.: Trying to resolve NS 'gslb-ns2.toyota-na.com.' (2/2) https://github.com/PowerDNS/pdns/commit/863ca18dd298ad0f2ee377aaf539450bc81e0b0a I also get intermittent failures of toyotasupplier.com here by the way, so it isn't just you! Bert ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com http://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random
I'd say google is talking to the one that answers, and caches that. 63.238.139.235 (gslb-ns1.toyota-na.com) definitely has issues - Brian Menges Principal Engineer, DevOps @ GoGrid, LLC. From: pdns-users-boun...@mailman.powerdns.com [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Todd Smith Sent: Tuesday, September 09, 2014 10:04 AM To: 'pdns-users@mailman.powerdns.com' Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random Hey Brian, That would make perfect sense, and I was thinking along similar lines, but if that's the case, why do I get a consistent NOERROR when using Google DNS? Google's cache perhaps? root@yoshi:/# dig toyotasupplier.com @8.8.8.8 ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com @8.8.8.8 ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 35779 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; ANSWER SECTION: toyotasupplier.com. 21594 IN A 12.169.52.71 ;; Query time: 30 msec ;; SERVER: 8.8.8.8#53(8.8.8.8) ;; WHEN: Tue Sep 9 12:34:43 2014 ;; MSG SIZE rcvd: 52 root@yoshi:/# dig toyotasupplier.com @208.88.248.27 ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com @208.88.248.27 ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 29841 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 49 msec ;; SERVER: 208.88.248.27#53(208.88.248.27) ;; WHEN: Tue Sep 9 12:35:02 2014 ;; MSG SIZE rcvd: 36 -T From: pdns-users-boun...@mailman.powerdns.commailto:pdns-users-boun...@mailman.powerdns.com [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Brian Menges Sent: Tuesday, September 09, 2014 12:56 PM To: 'pdns-users@mailman.powerdns.com' Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random I'd say it's on Toyota's end: $ dig toyotasupplier.com +short @gslb-ns1.toyota-na.com DiG 9.7.3 toyotasupplier.com +short @gslb-ns1.toyota-na.com ;; global options: +cmd connection timed out; no servers could be reached Their other DNS server works fine... several attempts to reach the first one however fails (haven't gotten a success yet). I'd say it's their problem. - Brian Menges Principal Engineer, DevOps @ GoGrid, LLC. From: pdns-users-boun...@mailman.powerdns.commailto:pdns-users-boun...@mailman.powerdns.com [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Todd Smith Sent: Tuesday, September 09, 2014 9:24 AM To: 'pdns-users@mailman.powerdns.com' Subject: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random Hey guys, I've been having a problem with recursion. For some reason, certain domains seem to throw SERVFAIL errors when dug most of the time, but then NOERROR with a correct response at other random times. For example: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 2636 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 0 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Wed Sep 3 13:36:33 2014 ;; MSG SIZE rcvd: 36 And then, a few hours later: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: NOERROR, id: 56751 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; ANSWER SECTION: toyotasupplier.com. 18296 IN A 12.169.52.71 ;; Query time: 1 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Thu Sep 4 10:39:38 2014 ;; MSG SIZE rcvd: 52 And then, a few hours later still: root@yoshi:/# dig toyotasupplier.com ; DiG 9.8.4-rpz2+rl005.12-P1 toyotasupplier.com ;; global options: +cmd ;; Got answer: ;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 5171 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;toyotasupplier.com.IN A ;; Query time: 3017 msec ;; SERVER: 208.88.248.25#53(208.88.248.25) ;; WHEN: Fri Sep 5 07:50:25 2014 ;; MSG SIZE rcvd: 36 All without making a single change. I have been working on debugging this for two days now and absolutely cannot pinpoint a source for the issue. I've increased the max query lengths, the recursor's network and client TCP timeouts, restarted the service several times on several of our DNS servers, and nothing I do seems to fix it. It of course doesn't help that the bug is a bit of a gremlin and keeps mischievously disappearing