Hi all, I wanted to collect some ideas on how do you solve DNS connectivity problems. I've run into those issues a couple of times already and don't see a perfect solution so far. Maybe I can trigger some discussion:
Some background: - opensips blocks the child process while resolving a domain / querying ENUM - standard resolver has minimum timeout = 1s - standard resolver does only one query at a time and can cycle nameservers, but does not save state I believe these are not real problems - just ugly legacy :) that we can work around. The implication is that if you don't use a caching nameserver on your side and you allow users to use routing based on a domain name (not very hard - do you handle "302"s, record-routes, registration?), you're basically screwed: 1. If you don't cache, any domain which times out will block a child for at least 1s. If you use retries, you block for at least Ns where N = number of nameservers. You can be DoS-ed with ~8 packets per second, in standard configuration. 2. If you cycle N nameservers and one of them is down, you're processing N-1 packets correctly, then block until timeout on the last one, then processing N-1, etc. - not good for a high-traffic proxy. 3. If you cache results, you're safe from random failures, but only if you cache timeouts as negative results and keep the state of servers being down, so you don't try to query them again. (nothing apart from `dnsmasq` does that, AFAIK) 4. What solves half of the problem for me, is `dnsmasq` - as far as I know it's the only caching dns server which allows to query all nameservers in parallel. I get 4 times the needed DNS traffic, but I'm never timing out connections if one of the servers is down. Also some results come from cache, so it's only 2 times the traffic in reality. The problem with `dnsmasq` is that it doesn't cache SRV and NAPTR requests (doesn't cache the timeouts / NX responses for them either), only A/AAAA/PTR/.... 5. So even if you have a local caching and backup resolver in `resolv.conf`, minimal timeout, parallel querying from the local cache, saving the state of upstream resolvers being down and route all internal traffic via IPs... it takes only one person with custom NAPTR sending you to custom SRV address which times out to kill all the traffic. So... what's your experience with this? Do you have some better protection in place? I'm considering adding negative caching of dns timeouts and general caching of SRV and NAPTR records into `dnsmasq` to complete my protection. Do you know of any software which would solve those problems out-of-box? Thanks, Stan _______________________________________________ Users mailing list [email protected] http://lists.opensips.org/cgi-bin/mailman/listinfo/users
