Re: [Pdns-users] Intermittent recursion failure due to timeout

2018-01-25 Thread Brian T
Brian and Bert - thank you both for the analysis and sanity check.  I'll
continue to focus on the network between me and the affected name server.

On Tue, Jan 23, 2018 at 2:28 PM, bert hubert 
wrote:

> Hi Brian,
>
> On Tue, Jan 23, 2018 at 02:00:58PM -0700, Brian T wrote:
> > My recursor config has 'network-timeout' set to 5500ms (instead of the
> > default 1500ms), so I can understand the "timeout resolving after
> > 5535.35msec" message, but I was surprised by the timestamps of the
> messages
> > all being the same.  Are these messages flushed to syslog all at once or
> is
> > the timeout really happening immediately?
>
> This is exactly why we note the 'observed' time passed in the log line.
> When running with trace=servfail, we indeed buffer the log with one single
> timestamp.
>
> > leading up to the timeout that look suspicious?  This has been happening
> > about 10-15 times per day and started happening on Jan 18th.  I see
> similar
> > timeouts to 3 other sites, but this one most frequently.
>
> 91.189.95.68 is the host causing timeouts.  It answers very reliably for me
> from here, but it is close to me (London).  It may be that this address
> isn't very well served to you somehow.
>
> > Lastly, is there any way to retry here?  I'd rather lower the network
> > timeout and attempt a retry if possible.
>
> "Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
> nova.clouds.archive.ubuntu.com: Cache consultations done, have 1 NS to
> contact"
>
> Usually domains have at least two nameservers.  That generates two
> attempts,
> which should really be enough.
>
> For domains with only a single nameserver with a single IP address, we
> might
> indeed consider trying twice. But there is no way to make that happen right
> now from the configuration.
>
> For now, I think you are mostly observing a somewhat unreliably hosted
> domain name.
>
> Bert
>



-- 
Brian Talley
b225...@gmail.com ::: (720) 6 36-0004
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Intermittent recursion failure due to timeout

2018-01-23 Thread David

On 2018-01-23 2:00 PM, Brian T wrote:


My recursor config has 'network-timeout' set to 5500ms (instead of the 
default 1500ms), so I can understand the "timeout resolving after 
5535.35msec" message, but I was surprised by the timestamps of the 
messages all being the same.  Are these messages flushed to syslog all 
at once or is the timeout really happening immediately?




This seems excessive. If the reply hasn't come in 1500ms what 
expectations do you have that it will arrive within 5500ms?


Aside from the timing question, can anyone offer any insight about this 
sequence of events that lead up to the timeout?  Would this "just" 
indicate a network or transport layer issue or is there anything in the 
events leading up to the timeout that look suspicious?  This has been 
happening about 10-15 times per day and started happening on Jan 18th.  
I see similar timeouts to 3 other sites, but this one most frequently.


Lastly, is there any way to retry here?  I'd rather lower the network 
timeout and attempt a retry if possible.


This is version 4.1.0

Thanks

--
Brian Talley


___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users



___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Intermittent recursion failure due to timeout

2018-01-23 Thread bert hubert
Hi Brian,

On Tue, Jan 23, 2018 at 02:00:58PM -0700, Brian T wrote:
> My recursor config has 'network-timeout' set to 5500ms (instead of the
> default 1500ms), so I can understand the "timeout resolving after
> 5535.35msec" message, but I was surprised by the timestamps of the messages
> all being the same.  Are these messages flushed to syslog all at once or is
> the timeout really happening immediately?

This is exactly why we note the 'observed' time passed in the log line. 
When running with trace=servfail, we indeed buffer the log with one single
timestamp.

> leading up to the timeout that look suspicious?  This has been happening
> about 10-15 times per day and started happening on Jan 18th.  I see similar
> timeouts to 3 other sites, but this one most frequently.

91.189.95.68 is the host causing timeouts.  It answers very reliably for me
from here, but it is close to me (London).  It may be that this address
isn't very well served to you somehow.

> Lastly, is there any way to retry here?  I'd rather lower the network
> timeout and attempt a retry if possible.

"Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333] 
 
nova.clouds.archive.ubuntu.com: Cache consultations done, have 1 NS to  

contact"

Usually domains have at least two nameservers.  That generates two attempts,
which should really be enough.

For domains with only a single nameserver with a single IP address, we might
indeed consider trying twice. But there is no way to make that happen right
now from the configuration.

For now, I think you are mostly observing a somewhat unreliably hosted
domain name. 

Bert
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Intermittent recursion failure due to timeout

2018-01-23 Thread Brian Candler

On 23/01/2018 21:00, Brian T wrote:
I've been seeing intermittent lookup failures for 
'nova.clouds.archive.ubuntu.com '.


Hmm:

$ dig +trace nova.clouds.archive.ubuntu.com

...

ubuntu.com.        172800    IN    NS    ns1.p27.dynect.net.
ubuntu.com.        172800    IN    NS    ns3.p27.dynect.net.
ubuntu.com.        172800    IN    NS    ns2.p27.dynect.net.
ubuntu.com.        172800    IN    NS    ns4.p27.dynect.net.
;; Received 198 bytes from 2001:501:b1f9::30#53(2001:501:b1f9::30) in 117 ms

clouds.archive.ubuntu.com. 60    IN    NS    piru.canonical.com.
;; Received 77 bytes from 2001:500:94:1::27#53(2001:500:94:1::27) in 46 ms

Ergh!

1. clouds.archive.ubuntu.com has only a *single* nameserver (haven't 
they read RFC2182?)


2. the single NS record has a ridiculously low TTL of 60 seconds

The A record for piru.canonical.com has a semi-reasonable TTL of 30 
minutes, although the NS records for canonical.com are cranked down to 
10 minutes in the zone:


$ dig +trace piru.canonical.com.

...

canonical.com.        172800    IN    NS    ns1.p27.dynect.net.
canonical.com.        172800    IN    NS    ns3.p27.dynect.net.
canonical.com.        172800    IN    NS    ns2.p27.dynect.net.
canonical.com.        172800    IN    NS    ns4.p27.dynect.net.
;; Received 186 bytes from 2001:502:8cc::30#53(2001:502:8cc::30) in 205 ms

piru.canonical.com.    1800    IN    A    91.189.95.68
canonical.com.        600    IN    NS    ns3.p27.dynect.net.
canonical.com.        600    IN    NS    ns2.p27.dynect.net.
canonical.com.        600    IN    NS    ns1.p27.dynect.net.
canonical.com.        600    IN    NS    ns4.p27.dynect.net.
; Received 138 bytes from 208.78.70.27#53(208.78.70.27) in 12 ms

This is pretty badly configured, and getting some failures to resolve is 
probably to be expected.


What I'd like to understand though is how many times pdns-recursor 
retries a query to an authoritative server, within that 5500ms timeout 
you've set (or the default 1500ms timeout), given that it has no other 
server to failover to.


Regards,

Brian Candler.
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users


[Pdns-users] Intermittent recursion failure due to timeout

2018-01-23 Thread Brian T
I've been seeing intermittent lookup failures for '
nova.clouds.archive.ubuntu.com'.  Looking at the logs, I see these messages
during the time when the lookup fails:

Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: Wants DNSSEC processing, auth data in query
for A
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: Looking for CNAME cache hit of '
nova.clouds.archive.ubuntu.com|CNAME'
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: No CNAME cache hit of '
nova.clouds.archive.ubuntu.com|CNAME' found
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: No cache hit for '
nova.clouds.archive.ubuntu.com|A', trying to find an appropriate NS record
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333] : got TA for '.'
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333] : setting cut state for
. to Secure
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: Checking if we have NS in cache for '
nova.clouds.archive.ubuntu.com'
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: no valid/useful NS in cache for '
nova.clouds.archive.ubuntu.com'
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: Checking if we have NS in cache for '
clouds.archive.ubuntu.com'
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: NS (with ip, or non-glue) in cache for '
clouds.archive.ubuntu.com' -> 'piru.canonical.com'
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: within bailiwick: 0, not in cache / did not
look at cache
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: We have NS in cache for '
clouds.archive.ubuntu.com' (flawedNSSet=0)
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: initial validation status for
nova.clouds.archive.ubuntu.com is Indeterminate
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: Cache consultations done, have 1 NS to
contact
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com.: Nameservers: piru.canonical.com.(38.62ms)
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: Trying to resolve NS 'piru.canonical.com'
(1/1)
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]piru.canonical.com:
Wants DNSSEC processing, NO auth data in query for A
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]piru.canonical.com:
Looking for CNAME cache hit of 'piru.canonical.com|CNAME'
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]piru.canonical.com:
No CNAME cache hit of 'piru.canonical.com|CNAME' found
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]piru.canonical.com:
Found cache hit for A: 91.189.95.68[ttl=1340]
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]piru.canonical.com:
updating validation state with cache content for piru.canonical.com to
Indeterminate
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: Resolved 'clouds.archive.ubuntu.com' NS
piru.canonical.com to: 91.189.95.68
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: Trying IP 91.189.95.68:53, asking '
nova.clouds.archive.ubuntu.com|A'
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: timeout resolving after 5535.35msec
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: Failed to resolve via any of the 1 offered
NS at level 'clouds.archive.ubuntu.com'
Jan 23 05:47:55 n422 pdns_recursor[8739]: [414333]
nova.clouds.archive.ubuntu.com: failed (res=-1)

My recursor config has 'network-timeout' set to 5500ms (instead of the
default 1500ms), so I can understand the "timeout resolving after
5535.35msec" message, but I was surprised by the timestamps of the messages
all being the same.  Are these messages flushed to syslog all at once or is
the timeout really happening immediately?

Aside from the timing question, can anyone offer any insight about this
sequence of events that lead up to the timeout?  Would this "just" indicate
a network or transport layer issue or is there anything in the events
leading up to the timeout that look suspicious?  This has been happening
about 10-15 times per day and started happening on Jan 18th.  I see similar
timeouts to 3 other sites, but this one most frequently.

Lastly, is there any way to retry here?  I'd rather lower the network
timeout and attempt a retry if possible.

This is version 4.1.0

Thanks

-- 
Brian Talley
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users