I think that the source of the problem here is
> ;; Truncated, retrying in TCP mode. Dnsmasq is forwarding the query via UDP, and getting a reply, but it has the bit set which says "the reply is too big for your UDP packet, try again using TCP." That gets returned to your requester, which does the right thing and sends the query again to dnsmasq via TCP. Dnsmasq attempts to connect to the first upstream server by TCP, but that server is down. By the time the kernel times out and dnsmasq tries to connect to the second server, the original requester has already timed out and given up. The best fix to this is probably to tackle the too-big-reply problem. The replies you are getting are 528 bytes, which is just bigger than the minimum which must be supported by any DNS server. You can either remove one of the five SRV records, or change the server config to support EDNS0 and >512 byte replies. Note that it's possible that the upstream server is truncating not because of its config but because the _queries_ you are sending don't have EDNS0 specifying a larger than 512 byte allowable reply. Modern "dig" implementations normally do this by default, but yours doesn't seem to be. Of course fixing dig doesn't fix the problem if the "real" queries, rather than test ones from dig, have the same problem. If none of that works, you are in to sysctl hacking to reduce the timeout on TCP connection setup. Cheers, Simon. On 14/08/18 18:05, Warner, Andrew C [CTO] wrote: > Subject: DNSMASQ failing to return SRV records with loss of > communication to a single DNS server > > > > Issue: We have SIP SRV records for a domain which can be provided by > two DNS servers in our environment. During testing we have noticed that > if one of the DNS servers is un-reachable, the request for the SRV > records via dnsmasq times out. > > > > This only happens when the query is originated from outside the box > where dnsmasq is running. IE – if we issue the SRV query from the > dnsmasq server, the SRV records are returned. If we issue the request > from a client VM which is set to resolve queries against our dnsmasq > host – the request times out. > > > > Note: some of the information below has been changed/replaced with xxx, > such as IP addresses and domain names for security reasons. > > > > Dnsmasq.conf has the following entries – indicating to forward requests > for labdomain.net to 10.xx.xx.12 and 10.xx.xx.20. > > server=/labdomain.net/10.xx.xx.12 > > server=/labdomain.net/10.xx.xx.20 > > > > VM making SRV queries is 10.xx.xx.99 > > > > > > *When we query for an SRV record with 10.xx.xx.5 being our DNSMASQ > server, and have commented out the non-reachable DNS server: 10.xx.xx.12 > – we receive a response to the SRV query.* > > > > #server=/labdomain.net/10.xx.xx.12 > > server=/labdomain.net/10.xx.xx.20 > > > > > > [labuser@f5-test ~]$ dig srv _sip._udp.scscf.sprout.lp.labdomain.net > @10.xx.xx.5 > > ;; Truncated, retrying in TCP mode. > > > > ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.5 <<>> srv > _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5 > > ;; global options: +cmd > > ;; Got answer: > > ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14584 > > ;; flags: qr aa; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 5 > > > > ;; QUESTION SECTION: > > ;_sip._udp.scscf.sprout.lp.labdomain.net. IN SRV > > > > ;; ANSWER SECTION: > > _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 > ovpklp-viscscf-spn-05.labdomain.net. > > _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 > ovpklp-viscscf-spn-01.labdomain.net. > > _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 > ovpklp-viscscf-spn-02.labdomain.net. > > _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 > ovpklp-viscscf-spn-03.labdomain.net. > > _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 > ovpklp-viscscf-spn-04.labdomain.net. > > > > ;; ADDITIONAL SECTION: > > ovpklp-viscscf-spn-05.labdomain.net. 43200 IN A 10.xx.xx.18 > > ovpklp-viscscf-spn-01.labdomain.net. 43200 IN A 10.xx.xx.14 > > ovpklp-viscscf-spn-02.labdomain.net. 43200 IN A 10.xx.xx.15 > > ovpklp-viscscf-spn-03.labdomain.net. 43200 IN A 10.xx.xx.16 > > ovpklp-viscscf-spn-04.labdomain.net. 43200 IN A 10.xx.xx.17 > > > > ;; Query time: 2 msec > > ;; SERVER: 10.xx.xx.5#53(10.xx.xx.5) > > ;; WHEN: Mon Aug 13 16:34:40 2018 > > ;; MSG SIZE rcvd: 528 > > > > > > > > *When we query for an SRV record with 10.xx.xx.5 being our DNSMASQ > server, and have both the good and non-reachable DNS server in play – we > receive a timeout to the SRV query. In this case – 10.xx.xx.20 is fully > capable of responding to the SRV query.* > > > > server=/labdomain.net/10.xx.xx.12 ß not reachable > > server=/labdomain.net/10.xx.xx.20 > > > > [labuser@f5-test ~]$ dig srv _sip._udp.scscf.sprout.lp.labdomain.net > @10.xx.xx.5 > > ;; Truncated, retrying in TCP mode. > > > > ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.5 <<>> srv > _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5 > > ;; global options: +cmd > > ;; connection timed out; no servers could be reached > > > > > > Dnsmasq logging shows: > > > > Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: query[SRV] > _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99 > > Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: forwarded > _sip._udp.scscf.sprout.lp.labdomain.net to 10.xx.xx.12 > > Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: forwarded > _sip._udp.scscf.sprout.lp.labdomain.net to 10.xx.xx.20 > > Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: nameserver > 10.xx.xx.20 refused to do a recursive query > > Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5172]: query[SRV] > _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99 > > Aug 14 16:22:24 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5173]: query[SRV] > _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99 > > Aug 14 16:22:34 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5174]: query[SRV] > _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99 > > > > > > I could use some ideas on how to further troubleshoot this issue. > > > > > > > > > > *Andy Warner* > > Telecom Design Engineer > > O: 406-752-3330 / M: 913-972-7521 > > andrew.c.war...@sprint.com > > cid:408000_086801428601138001@pvmxe13g01 > > > > > > _______________________________________________ > Dnsmasq-discuss mailing list > Dnsmasq-discuss@lists.thekelleys.org.uk > http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss > _______________________________________________ Dnsmasq-discuss mailing list Dnsmasq-discuss@lists.thekelleys.org.uk http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss