[Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread Todd Smith
Hey guys,

I've been having a problem with recursion. For some reason, certain domains 
seem to throw SERVFAIL errors when dug most of the time, but then NOERROR with 
a correct response at other random times. For example:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 2636
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 0 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Wed Sep  3 13:36:33 2014
;; MSG SIZE  rcvd: 36

And then, a few hours later:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 56751
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; ANSWER SECTION:
toyotasupplier.com. 18296   IN  A   12.169.52.71

;; Query time: 1 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Thu Sep  4 10:39:38 2014
;; MSG SIZE  rcvd: 52

And then, a few hours later still:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 5171
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 3017 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Fri Sep  5 07:50:25 2014
;; MSG SIZE  rcvd: 36

All without making a single change.

I have been working on debugging this for two days now and absolutely cannot 
pinpoint a source for the issue. I've increased the max query lengths, the 
recursor's network and client TCP timeouts, restarted the service several times 
on several of our DNS servers, and nothing I do seems to fix it. It of course 
doesn't help that the bug is a bit of a gremlin and keeps mischievously 
disappearing at random (and in fact never, to my knowledge, happened before 
until about a week ago, when it started to occur for no apparent reason). Any 
idea on what could be causing this? FWIW, when I run dig toyotasupplier.com ns 
it consistently works fine:

root@yoshi:/# dig toyotasupplier.com ns

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com ns
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 39522
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  NS

;; ANSWER SECTION:
toyotasupplier.com. 50741   IN  NS  gslb-ns2.toyota-na.com.
toyotasupplier.com. 50741   IN  NS  gslb-ns1.toyota-na.com.

;; Query time: 1 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Fri Sep  5 07:49:29 2014
;; MSG SIZE  rcvd: 92

Many thanks in advance,

Todd W. Smith
IP Services Technician
2331 East 600 North
Greenfield, IN 46140
(317) 323-2021
tsm...@ninestarconnect.commailto:tsm...@ninestarconnect.com
www.ninestarconnect.comhttp://www.ninestarconnect.com/
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread Todd Smith
Hey Brian,

That would make perfect sense, and I was thinking along similar lines, but if 
that's the case, why do I get a consistent NOERROR when using Google DNS? 
Google's cache perhaps?

root@yoshi:/# dig toyotasupplier.com @8.8.8.8

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com @8.8.8.8
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 35779
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; ANSWER SECTION:
toyotasupplier.com. 21594   IN  A   12.169.52.71

;; Query time: 30 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Tue Sep  9 12:34:43 2014
;; MSG SIZE  rcvd: 52

root@yoshi:/# dig toyotasupplier.com @208.88.248.27

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com @208.88.248.27
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 29841
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 49 msec
;; SERVER: 208.88.248.27#53(208.88.248.27)
;; WHEN: Tue Sep  9 12:35:02 2014
;; MSG SIZE  rcvd: 36

-T

From: pdns-users-boun...@mailman.powerdns.com 
[mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Brian Menges
Sent: Tuesday, September 09, 2014 12:56 PM
To: 'pdns-users@mailman.powerdns.com'
Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at 
random

I'd say it's on Toyota's end:

$ dig toyotasupplier.com +short @gslb-ns1.toyota-na.com
 DiG 9.7.3  toyotasupplier.com +short @gslb-ns1.toyota-na.com

  ;; global options: +cmd
connection timed out; no servers could be reached

Their other DNS server works fine... several attempts to reach the first one 
however fails (haven't gotten a success yet).

I'd say it's their problem.

- Brian Menges
Principal Engineer, DevOps @ GoGrid, LLC.

From: 
pdns-users-boun...@mailman.powerdns.commailto:pdns-users-boun...@mailman.powerdns.com
 [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Todd Smith
Sent: Tuesday, September 09, 2014 9:24 AM
To: 'pdns-users@mailman.powerdns.com'
Subject: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

Hey guys,

I've been having a problem with recursion. For some reason, certain domains 
seem to throw SERVFAIL errors when dug most of the time, but then NOERROR with 
a correct response at other random times. For example:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 2636
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 0 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Wed Sep  3 13:36:33 2014
;; MSG SIZE  rcvd: 36

And then, a few hours later:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 56751
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; ANSWER SECTION:
toyotasupplier.com. 18296   IN  A   12.169.52.71

;; Query time: 1 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Thu Sep  4 10:39:38 2014
;; MSG SIZE  rcvd: 52

And then, a few hours later still:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 5171
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 3017 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Fri Sep  5 07:50:25 2014
;; MSG SIZE  rcvd: 36

All without making a single change.

I have been working on debugging this for two days now and absolutely cannot 
pinpoint a source for the issue. I've increased the max query lengths, the 
recursor's network and client TCP timeouts, restarted the service several times 
on several of our DNS servers, and nothing I do seems to fix it. It of course 
doesn't help that the bug is a bit of a gremlin and keeps mischievously 
disappearing at random (and in fact never, to my knowledge, happened before 
until about a week ago, when it started to occur for no apparent reason). Any 
idea on what could be causing this? FWIW, when I run dig toyotasupplier.com ns 
it consistently works fine:

root@yoshi:/# dig toyotasupplier.com ns

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com ns
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 39522
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION

Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread Todd Smith
Rather long output here; however, it certainly looks like these results pretty 
much confirm that the issue is Toyota's, not ours:

Sep  9 12:47:07 yoshi pdns_recursor[31821]: 1 [10690756] question for 
'toyotasupplier.com.|A' from 208.88.248.27
Sep  9 12:47:12 yoshi pdns_recursor[31821]: 0 [3961638] question for 
'toyotasupplier.com.|A' from 208.88.248.27
Sep  9 12:47:17 yoshi pdns_recursor[31821]: 1 [10691043] question for 
'toyotasupplier.com.|A' from 208.88.248.27
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
Looking for CNAME cache hit of 'toyotasupplier.com.|CNAME'
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: No 
CNAME cache hit of 'toyotasupplier.com.|CNAME' found
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: No 
cache hit for 'toyotasupplier.com.|A', trying to find an appropriate NS record
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
Checking if we have NS in cache for 'toyotasupplier.com.'
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: NS 
(with ip, or non-glue) in cache for 'toyotasupplier.com.' - 
'gslb-ns1.toyota-na.com.'
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
within bailiwick: 0, not in cache / did not look at cache
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: NS 
(with ip, or non-glue) in cache for 'toyotasupplier.com.' - 
'gslb-ns2.toyota-na.com.'
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
within bailiwick: 0, not in cache / did not look at cache
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: We 
have NS in cache for 'toyotasupplier.com.' (flawedNSSet=0)
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
Cache consultations done, have 2 NS to contact
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
Nameservers: gslb-ns1.toyota-na.com.(0.00ms), gslb-ns2.toyota-na.com.(0.00ms)
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
Trying to resolve NS 'gslb-ns1.toyota-na.com.' (1/2)
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756]
gslb-ns1.toyota-na.com.: Looking for CNAME cache hit of 
'gslb-ns1.toyota-na.com.|CNAME'
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756]
gslb-ns1.toyota-na.com.: No CNAME cache hit of 'gslb-ns1.toyota-na.com.|CNAME' 
found
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756]
gslb-ns1.toyota-na.com.: Found cache hit for A: 63.238.139.235[ttl=80545] 
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
Resolved 'toyotasupplier.com.' NS gslb-ns1.toyota-na.com. to: 63.238.139.235
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
Trying IP 63.238.139.235:53, asking 'toyotasupplier.com.|A'
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
timeout resolving 
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
Trying to resolve NS 'gslb-ns2.toyota-na.com.' (2/2)
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756]
gslb-ns2.toyota-na.com.: Looking for CNAME cache hit of 
'gslb-ns2.toyota-na.com.|CNAME'
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756]
gslb-ns2.toyota-na.com.: No CNAME cache hit of 'gslb-ns2.toyota-na.com.|CNAME' 
found
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756]
gslb-ns2.toyota-na.com.: Found cache hit for A: 12.169.52.62[ttl=80540] 
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
Resolved 'toyotasupplier.com.' NS gslb-ns2.toyota-na.com. to: 12.169.52.62
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
Trying IP 12.169.52.62:53, asking 'toyotasupplier.com.|A'
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
timeout resolving 
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
Failed to resolve via any of the 2 offered NS at level 'toyotasupplier.com.'
Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
failed (res=-1)
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [3961638] toyotasupplier.com.: 
Looking for CNAME cache hit of 'toyotasupplier.com.|CNAME'
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [3961638] toyotasupplier.com.: No 
CNAME cache hit of 'toyotasupplier.com.|CNAME' found
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [3961638] toyotasupplier.com.: No 
cache hit for 'toyotasupplier.com.|A', trying to find an appropriate NS record
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [3961638] toyotasupplier.com.: 
Checking if we have NS in cache for 'toyotasupplier.com.'
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [3961638] toyotasupplier.com.: NS 
(with ip, or non-glue) in cache for 'toyotasupplier.com.' - 
'gslb-ns1.toyota-na.com.'
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [3961638] toyotasupplier.com.: 
within 

Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread Todd Smith
Actually that begs one more question--as of right now I actually have 
network-timeout set to 5000 in recursor.conf, yet obviously it's still timing 
out considerably sooner than that; is there, say, some other setting (that is, 
of course, within PowerDNS) that might be conflicting with this causing it time 
out sooner?

If not (as I suspect), I'll investigate our network settings to see if anything 
else might be clipping these requests off short.

Many many thanks again
-T

-Original Message-
From: bert hubert [mailto:bert.hub...@netherlabs.nl] 
Sent: Tuesday, September 09, 2014 1:39 PM
To: Todd Smith
Cc: 'pdns-users@mailman.powerdns.com'
Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at 
random

On Tue, Sep 09, 2014 at 05:16:24PM +, Todd Smith wrote:
 Rather long output here; however, it certainly looks like these results 
 pretty much confirm that the issue is Toyota's, not ours:

 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] 
 toyotasupplier.com.: Resolved 'toyotasupplier.com.' NS 
 gslb-ns1.toyota-na.com. to: 63.238.139.235 Sep  9 12:47:17 yoshi 
 pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying IP 
 63.238.139.235:53, asking 'toyotasupplier.com.|A'
 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] 
 toyotasupplier.com.: timeout resolving Sep  9 12:47:17 yoshi 
 pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying IP 
 12.169.52.62:53, asking 'toyotasupplier.com.|A'
 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] 
 toyotasupplier.com.: timeout resolving Sep  9 12:47:17 yoshi 
 pdns_recursor[31821]: [10690756] toyotasupplier.com.: Failed to resolve via 
 any of the 2 offered NS at level 'toyotasupplier.com.'
 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] 
 toyotasupplier.com.: failed (res=-1)

You could try a traceroute to these two addresses to debug, but this indeed 
does not look like a powerdns issue but more a networking issue!

Note that your timeout does appear to 1 second, you could try raising this 
with 'network-timeout=2000' and see if this helps (2 seconds).

Good luck!

Bert
___
 Pdns-users mailing list
 Pdns-users@mailman.powerdns.com
 http://mailman.powerdns.com/mailman/listinfo/pdns-users

___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users