Re: [Dnsmasq-discuss] DNSMASQ failing to return SRV records with loss of communication to a single DNS server

2018-08-17 Thread Simon Kelley
I think that the source of the problem here is


> ;; Truncated, retrying in TCP mode.


Dnsmasq is forwarding the query via UDP, and getting a reply, but it has
the bit set which says "the reply is too big for your UDP packet, try
again using TCP."

That gets returned to your requester, which does the right thing and
sends the query again to dnsmasq via TCP. Dnsmasq attempts to connect to
the first upstream server by TCP, but that server is down. By the time
the kernel times out and dnsmasq tries to connect to the second server,
the original requester has already timed out and given up.

The best fix to this is probably to tackle the too-big-reply problem.
The replies you are getting are 528 bytes, which is just bigger than the
minimum which must be supported by any DNS server. You can either remove
one of the five SRV records, or change the server config to support
EDNS0 and >512 byte replies.

Note that it's possible that the upstream server is truncating not
because of its config but because the _queries_ you are sending don't
have EDNS0 specifying a larger than 512 byte allowable reply. Modern
"dig" implementations normally do this by default, but yours doesn't
seem to be. Of course fixing dig doesn't fix the problem if the "real"
queries, rather than test ones from dig, have the same problem.

If none of that works, you are in to sysctl hacking to reduce the
timeout on TCP connection setup.


Cheers,

Simon.



On 14/08/18 18:05, Warner, Andrew C [CTO] wrote:
> Subject: DNSMASQ failing to return SRV records with loss of
> communication to a single DNS server
> 
>  
> 
> Issue:  We have SIP SRV records for a domain which can be provided by
> two DNS servers in our environment.  During testing we have noticed that
> if one of the DNS servers is un-reachable, the request for the SRV
> records via dnsmasq times out.
> 
>  
> 
> This only happens when the query is originated from outside the box
> where dnsmasq is running.  IE – if we issue the SRV query from the
> dnsmasq server, the SRV records are returned.  If we issue the request
> from a client VM which is set to resolve queries against our dnsmasq
> host – the request times out.
> 
>  
> 
> Note:  some of the information below has been changed/replaced with xxx,
>  such as IP addresses and domain names for security reasons.
> 
>  
> 
> Dnsmasq.conf has the following entries – indicating to forward requests
> for labdomain.net to 10.xx.xx.12 and 10.xx.xx.20.   
> 
> server=/labdomain.net/10.xx.xx.12
> 
> server=/labdomain.net/10.xx.xx.20
> 
>  
> 
> VM making SRV queries is 10.xx.xx.99
> 
>  
> 
>  
> 
> *When we query for an SRV record with 10.xx.xx.5 being our DNSMASQ
> server, and have commented out the non-reachable DNS server: 10.xx.xx.12
> – we receive a response to the SRV query.*
> 
>  
> 
> #server=/labdomain.net/10.xx.xx.12
> 
> server=/labdomain.net/10.xx.xx.20
> 
>  
> 
>  
> 
> [labuser@f5-test ~]$ dig srv _sip._udp.scscf.sprout.lp.labdomain.net
> @10.xx.xx.5
> 
> ;; Truncated, retrying in TCP mode.
> 
>  
> 
> ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.5 <<>> srv
> _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5
> 
> ;; global options: +cmd
> 
> ;; Got answer:
> 
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14584
> 
> ;; flags: qr aa; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 5
> 
>  
> 
> ;; QUESTION SECTION:
> 
> ;_sip._udp.scscf.sprout.lp.labdomain.net. IN SRV
> 
>  
> 
> ;; ANSWER SECTION:
> 
> _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054
> ovpklp-viscscf-spn-05.labdomain.net.
> 
> _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054
> ovpklp-viscscf-spn-01.labdomain.net.
> 
> _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054
> ovpklp-viscscf-spn-02.labdomain.net.
> 
> _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054
> ovpklp-viscscf-spn-03.labdomain.net.
> 
> _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054
> ovpklp-viscscf-spn-04.labdomain.net.
> 
>  
> 
> ;; ADDITIONAL SECTION:
> 
> ovpklp-viscscf-spn-05.labdomain.net. 43200 IN A 10.xx.xx.18
> 
> ovpklp-viscscf-spn-01.labdomain.net. 43200 IN A 10.xx.xx.14
> 
> ovpklp-viscscf-spn-02.labdomain.net. 43200 IN A 10.xx.xx.15
> 
> ovpklp-viscscf-spn-03.labdomain.net. 43200 IN A 10.xx.xx.16
> 
> ovpklp-viscscf-spn-04.labdomain.net. 43200 IN A 10.xx.xx.17
> 
>  
> 
> ;; Query time: 2 msec
> 
> ;; SERVER: 10.xx.xx.5#53(10.xx.xx.5)
> 
> ;; WHEN: Mon Aug 13 16:34:40 2018
> 
> ;; MSG SIZE  rcvd: 528
> 
>  
> 
>  
> 
>  
> 
> *When we query for an SRV record with 10.xx.xx.5 being our DNSMASQ
> server, and have both the good and non-reachable DNS server in play – we
> receive a timeout to the SRV query.  In this case – 10.xx.xx.20 is fully
> capable of responding to the SRV query.*
> 
>  
> 
> server=/labdomain.net/10.xx.xx.12    ß not reachable
> 
> server=/labdomain.net/10.xx.xx.20
> 
>  
> 
> [labuser@f5-test ~]$ dig srv _sip._udp.scscf.sprout.lp.labdomain.net
> @10.xx.xx.5
> 
> ;; Truncated, 

[Dnsmasq-discuss] DNSMASQ failing to return SRV records with loss of communication to a single DNS server

2018-08-14 Thread Warner, Andrew C [CTO]
Subject: DNSMASQ failing to return SRV records with loss of communication to a 
single DNS server

Issue:  We have SIP SRV records for a domain which can be provided by two DNS 
servers in our environment.  During testing we have noticed that if one of the 
DNS servers is un-reachable, the request for the SRV records via dnsmasq times 
out.


This only happens when the query is originated from outside the box where 
dnsmasq is running.  IE - if we issue the SRV query from the dnsmasq server, 
the SRV records are returned.  If we issue the request from a client VM which 
is set to resolve queries against our dnsmasq host - the request times out.



Note:  some of the information below has been changed/replaced with xxx,  such 
as IP addresses and domain names for security reasons.



Dnsmasq.conf has the following entries - indicating to forward requests for 
labdomain.net to 10.xx.xx.12 and 10.xx.xx.20.

server=/labdomain.net/10.xx.xx.12

server=/labdomain.net/10.xx.xx.20



VM making SRV queries is 10.xx.xx.99





When we query for an SRV record with 10.xx.xx.5 being our DNSMASQ server, and 
have commented out the non-reachable DNS server: 10.xx.xx.12 - we receive a 
response to the SRV query.



#server=/labdomain.net/10.xx.xx.12

server=/labdomain.net/10.xx.xx.20





[labuser@f5-test ~]$ dig srv _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5

;; Truncated, retrying in TCP mode.



; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.5 <<>> srv 
_sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5

;; global options: +cmd

;; Got answer:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14584

;; flags: qr aa; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 5



;; QUESTION SECTION:

;_sip._udp.scscf.sprout.lp.labdomain.net. IN SRV



;; ANSWER SECTION:

_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 
ovpklp-viscscf-spn-05.labdomain.net.

_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 
ovpklp-viscscf-spn-01.labdomain.net.

_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 
ovpklp-viscscf-spn-02.labdomain.net.

_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 
ovpklp-viscscf-spn-03.labdomain.net.

_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 
ovpklp-viscscf-spn-04.labdomain.net.



;; ADDITIONAL SECTION:

ovpklp-viscscf-spn-05.labdomain.net. 43200 IN A 10.xx.xx.18

ovpklp-viscscf-spn-01.labdomain.net. 43200 IN A 10.xx.xx.14

ovpklp-viscscf-spn-02.labdomain.net. 43200 IN A 10.xx.xx.15

ovpklp-viscscf-spn-03.labdomain.net. 43200 IN A 10.xx.xx.16

ovpklp-viscscf-spn-04.labdomain.net. 43200 IN A 10.xx.xx.17



;; Query time: 2 msec

;; SERVER: 10.xx.xx.5#53(10.xx.xx.5)

;; WHEN: Mon Aug 13 16:34:40 2018

;; MSG SIZE  rcvd: 528





When we query for an SRV record with 10.xx.xx.5 being our DNSMASQ server, and 
have both the good and non-reachable DNS server in play - we receive a timeout 
to the SRV query.  In this case - 10.xx.xx.20 is fully capable of responding to 
the SRV query.


server=/labdomain.net/10.xx.xx.12<-- not reachable

server=/labdomain.net/10.xx.xx.20



[labuser@f5-test ~]$ dig srv _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5

;; Truncated, retrying in TCP mode.



; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.5 <<>> srv 
_sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5

;; global options: +cmd

;; connection timed out; no servers could be reached


Dnsmasq logging shows:

Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: query[SRV] 
_sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: forwarded 
_sip._udp.scscf.sprout.lp.labdomain.net to 10.xx.xx.12
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: forwarded 
_sip._udp.scscf.sprout.lp.labdomain.net to 10.xx.xx.20
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: nameserver 
10.xx.xx.20 refused to do a recursive query
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5172]: query[SRV] 
_sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
Aug 14 16:22:24 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5173]: query[SRV] 
_sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
Aug 14 16:22:34 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5174]: query[SRV] 
_sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99


I could use some ideas on how to further troubleshoot this issue.




Andy Warner
Telecom Design Engineer
O: 406-752-3330 / M: 913-972-7521
andrew.c.war...@sprint.com
[cid:408000_086801428601138001@pvmxe13g01]

___
Dnsmasq-discuss mailing list
Dnsmasq-discuss@lists.thekelleys.org.uk
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss