[Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread Todd Smith
Hey guys,

I've been having a problem with recursion. For some reason, certain domains 
seem to throw SERVFAIL errors when dug most of the time, but then NOERROR with 
a correct response at other random times. For example:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 2636
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 0 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Wed Sep  3 13:36:33 2014
;; MSG SIZE  rcvd: 36

And then, a few hours later:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 56751
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; ANSWER SECTION:
toyotasupplier.com. 18296   IN  A   12.169.52.71

;; Query time: 1 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Thu Sep  4 10:39:38 2014
;; MSG SIZE  rcvd: 52

And then, a few hours later still:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 5171
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 3017 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Fri Sep  5 07:50:25 2014
;; MSG SIZE  rcvd: 36

All without making a single change.

I have been working on debugging this for two days now and absolutely cannot 
pinpoint a source for the issue. I've increased the max query lengths, the 
recursor's network and client TCP timeouts, restarted the service several times 
on several of our DNS servers, and nothing I do seems to fix it. It of course 
doesn't help that the bug is a bit of a gremlin and keeps mischievously 
disappearing at random (and in fact never, to my knowledge, happened before 
until about a week ago, when it started to occur for no apparent reason). Any 
idea on what could be causing this? FWIW, when I run dig toyotasupplier.com ns 
it consistently works fine:

root@yoshi:/# dig toyotasupplier.com ns

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com ns
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 39522
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  NS

;; ANSWER SECTION:
toyotasupplier.com. 50741   IN  NS  gslb-ns2.toyota-na.com.
toyotasupplier.com. 50741   IN  NS  gslb-ns1.toyota-na.com.

;; Query time: 1 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Fri Sep  5 07:49:29 2014
;; MSG SIZE  rcvd: 92

Many thanks in advance,

Todd W. Smith
IP Services Technician
2331 East 600 North
Greenfield, IN 46140
(317) 323-2021
tsm...@ninestarconnect.commailto:tsm...@ninestarconnect.com
www.ninestarconnect.comhttp://www.ninestarconnect.com/
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread Brian Menges
I'd say it's on Toyota's end:

$ dig toyotasupplier.com +short @gslb-ns1.toyota-na.com
 DiG 9.7.3  toyotasupplier.com +short @gslb-ns1.toyota-na.com

  ;; global options: +cmd
connection timed out; no servers could be reached

Their other DNS server works fine... several attempts to reach the first one 
however fails (haven't gotten a success yet).

I'd say it's their problem.

- Brian Menges
Principal Engineer, DevOps @ GoGrid, LLC.

From: pdns-users-boun...@mailman.powerdns.com 
[mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Todd Smith
Sent: Tuesday, September 09, 2014 9:24 AM
To: 'pdns-users@mailman.powerdns.com'
Subject: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

Hey guys,

I've been having a problem with recursion. For some reason, certain domains 
seem to throw SERVFAIL errors when dug most of the time, but then NOERROR with 
a correct response at other random times. For example:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 2636
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 0 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Wed Sep  3 13:36:33 2014
;; MSG SIZE  rcvd: 36

And then, a few hours later:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 56751
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; ANSWER SECTION:
toyotasupplier.com. 18296   IN  A   12.169.52.71

;; Query time: 1 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Thu Sep  4 10:39:38 2014
;; MSG SIZE  rcvd: 52

And then, a few hours later still:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 5171
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 3017 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Fri Sep  5 07:50:25 2014
;; MSG SIZE  rcvd: 36

All without making a single change.

I have been working on debugging this for two days now and absolutely cannot 
pinpoint a source for the issue. I've increased the max query lengths, the 
recursor's network and client TCP timeouts, restarted the service several times 
on several of our DNS servers, and nothing I do seems to fix it. It of course 
doesn't help that the bug is a bit of a gremlin and keeps mischievously 
disappearing at random (and in fact never, to my knowledge, happened before 
until about a week ago, when it started to occur for no apparent reason). Any 
idea on what could be causing this? FWIW, when I run dig toyotasupplier.com ns 
it consistently works fine:

root@yoshi:/# dig toyotasupplier.com ns

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com ns
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 39522
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  NS

;; ANSWER SECTION:
toyotasupplier.com. 50741   IN  NS  gslb-ns2.toyota-na.com.
toyotasupplier.com. 50741   IN  NS  gslb-ns1.toyota-na.com.

;; Query time: 1 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Fri Sep  5 07:49:29 2014
;; MSG SIZE  rcvd: 92

Many thanks in advance,

Todd W. Smith
IP Services Technician
2331 East 600 North
Greenfield, IN 46140
(317) 323-2021
tsm...@ninestarconnect.commailto:tsm...@ninestarconnect.com
www.ninestarconnect.comhttp://www.ninestarconnect.com/



The information contained in this message, and any attachments, may contain 
confidential and legally privileged material. It is solely for the use of the 
person or entity to which it is addressed. Any review, retransmission, 
dissemination, or action taken in reliance upon this information by persons or 
entities other than the intended recipient is prohibited. If you receive this 
in error, please contact the sender and delete the material from any computer.
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread Todd Smith
Hey Brian,

That would make perfect sense, and I was thinking along similar lines, but if 
that's the case, why do I get a consistent NOERROR when using Google DNS? 
Google's cache perhaps?

root@yoshi:/# dig toyotasupplier.com @8.8.8.8

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com @8.8.8.8
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 35779
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; ANSWER SECTION:
toyotasupplier.com. 21594   IN  A   12.169.52.71

;; Query time: 30 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Tue Sep  9 12:34:43 2014
;; MSG SIZE  rcvd: 52

root@yoshi:/# dig toyotasupplier.com @208.88.248.27

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com @208.88.248.27
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 29841
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 49 msec
;; SERVER: 208.88.248.27#53(208.88.248.27)
;; WHEN: Tue Sep  9 12:35:02 2014
;; MSG SIZE  rcvd: 36

-T

From: pdns-users-boun...@mailman.powerdns.com 
[mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Brian Menges
Sent: Tuesday, September 09, 2014 12:56 PM
To: 'pdns-users@mailman.powerdns.com'
Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at 
random

I'd say it's on Toyota's end:

$ dig toyotasupplier.com +short @gslb-ns1.toyota-na.com
 DiG 9.7.3  toyotasupplier.com +short @gslb-ns1.toyota-na.com

  ;; global options: +cmd
connection timed out; no servers could be reached

Their other DNS server works fine... several attempts to reach the first one 
however fails (haven't gotten a success yet).

I'd say it's their problem.

- Brian Menges
Principal Engineer, DevOps @ GoGrid, LLC.

From: 
pdns-users-boun...@mailman.powerdns.commailto:pdns-users-boun...@mailman.powerdns.com
 [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Todd Smith
Sent: Tuesday, September 09, 2014 9:24 AM
To: 'pdns-users@mailman.powerdns.com'
Subject: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

Hey guys,

I've been having a problem with recursion. For some reason, certain domains 
seem to throw SERVFAIL errors when dug most of the time, but then NOERROR with 
a correct response at other random times. For example:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 2636
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 0 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Wed Sep  3 13:36:33 2014
;; MSG SIZE  rcvd: 36

And then, a few hours later:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 56751
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; ANSWER SECTION:
toyotasupplier.com. 18296   IN  A   12.169.52.71

;; Query time: 1 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Thu Sep  4 10:39:38 2014
;; MSG SIZE  rcvd: 52

And then, a few hours later still:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 5171
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 3017 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Fri Sep  5 07:50:25 2014
;; MSG SIZE  rcvd: 36

All without making a single change.

I have been working on debugging this for two days now and absolutely cannot 
pinpoint a source for the issue. I've increased the max query lengths, the 
recursor's network and client TCP timeouts, restarted the service several times 
on several of our DNS servers, and nothing I do seems to fix it. It of course 
doesn't help that the bug is a bit of a gremlin and keeps mischievously 
disappearing at random (and in fact never, to my knowledge, happened before 
until about a week ago, when it started to occur for no apparent reason). Any 
idea on what could be causing this? FWIW, when I run dig toyotasupplier.com ns 
it consistently works fine:

root@yoshi:/# dig toyotasupplier.com ns

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com ns
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 39522
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION

Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread Michael Loftis
On Tue, Sep 9, 2014 at 9:55 AM, Brian Menges bmen...@gogrid.com wrote:
 I’d say it’s on Toyota’s end:



Same here gslb-ns1.toyota-na.com not responding (Comcast, Seattle, WA)

___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread bert hubert
Well, as long as one server works you should get an answer.

Try setting a trace-regex on toyota and see what your powerdns reports!

http://doc.powerdns.com/html/rec-control.html - trace-regex

Bert

On Tue, Sep 09, 2014 at 10:06:03AM -0700, Michael Loftis wrote:
 On Tue, Sep 9, 2014 at 9:55 AM, Brian Menges bmen...@gogrid.com wrote:
  I’d say it’s on Toyota’s end:
 
 
 
 Same here gslb-ns1.toyota-na.com not responding (Comcast, Seattle, WA)
 
 ___
 Pdns-users mailing list
 Pdns-users@mailman.powerdns.com
 http://mailman.powerdns.com/mailman/listinfo/pdns-users

___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread Todd Smith
 for A: 12.169.52.62[ttl=80535] 
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: 
Resolved 'toyotasupplier.com.' NS gslb-ns2.toyota-na.com. to: 12.169.52.62
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: 
Trying IP 12.169.52.62:53, asking 'toyotasupplier.com.|A'
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: 
timeout resolving 
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: 
Trying to resolve NS 'gslb-ns1.toyota-na.com.' (2/2)
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [10691043]
gslb-ns1.toyota-na.com.: Looking for CNAME cache hit of 
'gslb-ns1.toyota-na.com.|CNAME'
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [10691043]
gslb-ns1.toyota-na.com.: No CNAME cache hit of 'gslb-ns1.toyota-na.com.|CNAME' 
found
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [10691043]
gslb-ns1.toyota-na.com.: Found cache hit for A: 63.238.139.235[ttl=80530] 
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: 
Resolved 'toyotasupplier.com.' NS gslb-ns1.toyota-na.com. to: 63.238.139.235
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: 
Trying IP 63.238.139.235:53, asking 'toyotasupplier.com.|A'
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: 
query throttled 
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: 
Failed to resolve via any of the 2 offered NS at level 'toyotasupplier.com.'
Sep  9 12:47:22 yoshi pdns_recursor[31821]: [10691043] toyotasupplier.com.: 
failed (res=-1)

-Original Message-
From: pdns-users-boun...@mailman.powerdns.com 
[mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of bert hubert
Sent: Tuesday, September 09, 2014 1:11 PM
To: Michael Loftis
Cc: pdns-users@mailman.powerdns.com
Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at 
random

Well, as long as one server works you should get an answer.

Try setting a trace-regex on toyota and see what your powerdns reports!

http://doc.powerdns.com/html/rec-control.html - trace-regex

Bert

On Tue, Sep 09, 2014 at 10:06:03AM -0700, Michael Loftis wrote:
 On Tue, Sep 9, 2014 at 9:55 AM, Brian Menges bmen...@gogrid.com wrote:
  I’d say it’s on Toyota’s end:
 
 
 
 Same here gslb-ns1.toyota-na.com not responding (Comcast, Seattle, WA)
 
 ___
 Pdns-users mailing list
 Pdns-users@mailman.powerdns.com
 http://mailman.powerdns.com/mailman/listinfo/pdns-users

___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread bert hubert
On Tue, Sep 09, 2014 at 05:16:24PM +, Todd Smith wrote:
 Rather long output here; however, it certainly looks like these results 
 pretty much confirm that the issue is Toyota's, not ours:

 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
 Resolved 'toyotasupplier.com.' NS gslb-ns1.toyota-na.com. to: 63.238.139.235
 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
 Trying IP 63.238.139.235:53, asking 'toyotasupplier.com.|A'
 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
 timeout resolving 
 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
 Trying IP 12.169.52.62:53, asking 'toyotasupplier.com.|A'
 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
 timeout resolving 
 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
 Failed to resolve via any of the 2 offered NS at level 'toyotasupplier.com.'
 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] toyotasupplier.com.: 
 failed (res=-1)

You could try a traceroute to these two addresses to debug, but this indeed
does not look like a powerdns issue but more a networking issue!

Note that your timeout does appear to 1 second, you could try raising this
with 'network-timeout=2000' and see if this helps (2 seconds).

Good luck!

Bert

 
 -Original Message-
 From: pdns-users-boun...@mailman.powerdns.com 
 [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of bert hubert
 Sent: Tuesday, September 09, 2014 1:11 PM
 To: Michael Loftis
 Cc: pdns-users@mailman.powerdns.com
 Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at 
 random
 
 Well, as long as one server works you should get an answer.
 
 Try setting a trace-regex on toyota and see what your powerdns reports!
 
 http://doc.powerdns.com/html/rec-control.html - trace-regex
 
   Bert
 
 On Tue, Sep 09, 2014 at 10:06:03AM -0700, Michael Loftis wrote:
  On Tue, Sep 9, 2014 at 9:55 AM, Brian Menges bmen...@gogrid.com wrote:
   I’d say it’s on Toyota’s end:
  
  
  
  Same here gslb-ns1.toyota-na.com not responding (Comcast, Seattle, WA)
  
  ___
  Pdns-users mailing list
  Pdns-users@mailman.powerdns.com
  http://mailman.powerdns.com/mailman/listinfo/pdns-users
 
 ___
 Pdns-users mailing list
 Pdns-users@mailman.powerdns.com
 http://mailman.powerdns.com/mailman/listinfo/pdns-users
 ___
 Pdns-users mailing list
 Pdns-users@mailman.powerdns.com
 http://mailman.powerdns.com/mailman/listinfo/pdns-users

___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread Todd Smith
Actually that begs one more question--as of right now I actually have 
network-timeout set to 5000 in recursor.conf, yet obviously it's still timing 
out considerably sooner than that; is there, say, some other setting (that is, 
of course, within PowerDNS) that might be conflicting with this causing it time 
out sooner?

If not (as I suspect), I'll investigate our network settings to see if anything 
else might be clipping these requests off short.

Many many thanks again
-T

-Original Message-
From: bert hubert [mailto:bert.hub...@netherlabs.nl] 
Sent: Tuesday, September 09, 2014 1:39 PM
To: Todd Smith
Cc: 'pdns-users@mailman.powerdns.com'
Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at 
random

On Tue, Sep 09, 2014 at 05:16:24PM +, Todd Smith wrote:
 Rather long output here; however, it certainly looks like these results 
 pretty much confirm that the issue is Toyota's, not ours:

 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] 
 toyotasupplier.com.: Resolved 'toyotasupplier.com.' NS 
 gslb-ns1.toyota-na.com. to: 63.238.139.235 Sep  9 12:47:17 yoshi 
 pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying IP 
 63.238.139.235:53, asking 'toyotasupplier.com.|A'
 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] 
 toyotasupplier.com.: timeout resolving Sep  9 12:47:17 yoshi 
 pdns_recursor[31821]: [10690756] toyotasupplier.com.: Trying IP 
 12.169.52.62:53, asking 'toyotasupplier.com.|A'
 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] 
 toyotasupplier.com.: timeout resolving Sep  9 12:47:17 yoshi 
 pdns_recursor[31821]: [10690756] toyotasupplier.com.: Failed to resolve via 
 any of the 2 offered NS at level 'toyotasupplier.com.'
 Sep  9 12:47:17 yoshi pdns_recursor[31821]: [10690756] 
 toyotasupplier.com.: failed (res=-1)

You could try a traceroute to these two addresses to debug, but this indeed 
does not look like a powerdns issue but more a networking issue!

Note that your timeout does appear to 1 second, you could try raising this 
with 'network-timeout=2000' and see if this helps (2 seconds).

Good luck!

Bert
___
 Pdns-users mailing list
 Pdns-users@mailman.powerdns.com
 http://mailman.powerdns.com/mailman/listinfo/pdns-users

___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread bert hubert
On Tue, Sep 09, 2014 at 05:57:08PM +, Todd Smith wrote:

 Actually that begs one more question--as of right now I actually have
 network-timeout set to 5000 in recursor.conf, yet obviously it's still
 timing out considerably sooner than that; is there, say, some other
 setting (that is, of course, within PowerDNS) that might be conflicting
 with this causing it time out sooner?

Just just checked, 3.6.0 honors network-timeout correctly under normal
conditions, seeing if there are possibilities when we might be short
circuiting it.

Bert

___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread bert hubert
On Tue, Sep 09, 2014 at 08:20:48PM +0200, bert hubert wrote:
 On Tue, Sep 09, 2014 at 05:57:08PM +, Todd Smith wrote:
 
  Actually that begs one more question--as of right now I actually have
  network-timeout set to 5000 in recursor.conf, yet obviously it's still
  timing out considerably sooner than that; is there, say, some other
  setting (that is, of course, within PowerDNS) that might be conflicting
  with this causing it time out sooner?
 
 Just just checked, 3.6.0 honors network-timeout correctly under normal
 conditions, seeing if there are possibilities when we might be short
 circuiting it.

Ok, this has to do with the nature of trace-regex, which outputs the whole
thing at the end of the resolve process, with only one timestamp.

Clarified the output a bit to:

Sep 09 20:28:50 [13281] toyotasupplier.com.: Resolved 'toyotasupplier.com.' NS 
gslb-ns1.toyota-na.com. to: 63.238.139.235
Sep 09 20:28:50 [13281] toyotasupplier.com.: Trying IP 63.238.139.235:53, 
asking 'toyotasupplier.com.|MX'
Sep 09 20:28:50 [13281] toyotasupplier.com.: timeout resolving after 
5000.47msec 
Sep 09 20:28:50 [13281] toyotasupplier.com.: Trying to resolve NS 
'gslb-ns2.toyota-na.com.' (2/2)

https://github.com/PowerDNS/pdns/commit/863ca18dd298ad0f2ee377aaf539450bc81e0b0a

I also get intermittent failures of toyotasupplier.com here by the way, so
it isn't just you!

Bert

___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

2014-09-09 Thread Brian Menges
I'd say google is talking to the one that answers, and caches that.

63.238.139.235 (gslb-ns1.toyota-na.com) definitely has issues

- Brian Menges
Principal Engineer, DevOps @ GoGrid, LLC.

From: pdns-users-boun...@mailman.powerdns.com 
[mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Todd Smith
Sent: Tuesday, September 09, 2014 10:04 AM
To: 'pdns-users@mailman.powerdns.com'
Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at 
random

Hey Brian,

That would make perfect sense, and I was thinking along similar lines, but if 
that's the case, why do I get a consistent NOERROR when using Google DNS? 
Google's cache perhaps?

root@yoshi:/# dig toyotasupplier.com @8.8.8.8

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com @8.8.8.8
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 35779
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; ANSWER SECTION:
toyotasupplier.com. 21594   IN  A   12.169.52.71

;; Query time: 30 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Tue Sep  9 12:34:43 2014
;; MSG SIZE  rcvd: 52

root@yoshi:/# dig toyotasupplier.com @208.88.248.27

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com @208.88.248.27
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 29841
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 49 msec
;; SERVER: 208.88.248.27#53(208.88.248.27)
;; WHEN: Tue Sep  9 12:35:02 2014
;; MSG SIZE  rcvd: 36

-T

From: 
pdns-users-boun...@mailman.powerdns.commailto:pdns-users-boun...@mailman.powerdns.com
 [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Brian Menges
Sent: Tuesday, September 09, 2014 12:56 PM
To: 'pdns-users@mailman.powerdns.com'
Subject: Re: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at 
random

I'd say it's on Toyota's end:

$ dig toyotasupplier.com +short @gslb-ns1.toyota-na.com
 DiG 9.7.3  toyotasupplier.com +short @gslb-ns1.toyota-na.com

  ;; global options: +cmd
connection timed out; no servers could be reached

Their other DNS server works fine... several attempts to reach the first one 
however fails (haven't gotten a success yet).

I'd say it's their problem.

- Brian Menges
Principal Engineer, DevOps @ GoGrid, LLC.

From: 
pdns-users-boun...@mailman.powerdns.commailto:pdns-users-boun...@mailman.powerdns.com
 [mailto:pdns-users-boun...@mailman.powerdns.com] On Behalf Of Todd Smith
Sent: Tuesday, September 09, 2014 9:24 AM
To: 'pdns-users@mailman.powerdns.com'
Subject: [Pdns-users] Recursion issue--SERVFAIL then NOERROR totally at random

Hey guys,

I've been having a problem with recursion. For some reason, certain domains 
seem to throw SERVFAIL errors when dug most of the time, but then NOERROR with 
a correct response at other random times. For example:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 2636
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 0 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Wed Sep  3 13:36:33 2014
;; MSG SIZE  rcvd: 36

And then, a few hours later:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: NOERROR, id: 56751
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; ANSWER SECTION:
toyotasupplier.com. 18296   IN  A   12.169.52.71

;; Query time: 1 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Thu Sep  4 10:39:38 2014
;; MSG SIZE  rcvd: 52

And then, a few hours later still:

root@yoshi:/# dig toyotasupplier.com

;  DiG 9.8.4-rpz2+rl005.12-P1  toyotasupplier.com
;; global options: +cmd
;; Got answer:
;; -HEADER- opcode: QUERY, status: SERVFAIL, id: 5171
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;toyotasupplier.com.IN  A

;; Query time: 3017 msec
;; SERVER: 208.88.248.25#53(208.88.248.25)
;; WHEN: Fri Sep  5 07:50:25 2014
;; MSG SIZE  rcvd: 36

All without making a single change.

I have been working on debugging this for two days now and absolutely cannot 
pinpoint a source for the issue. I've increased the max query lengths, the 
recursor's network and client TCP timeouts, restarted the service several times 
on several of our DNS servers, and nothing I do seems to fix it. It of course 
doesn't help that the bug is a bit of a gremlin and keeps mischievously 
disappearing