So the cache servers are HA behind something (F5 LTM, Cisco local director,
something else). Are the authoritative servers? It would seem sensible to do
the same with them. That way a timeout only occurs if the whole HA cluster
is unavailable.
You can alleviate even that situation by seeding the cache servers every
(TTL-some value) minutes. Or slaving the domain on the cache servers.
On 14/09/10 11:34 AM, "Howard Wilkinson" wrote:
> I have been working on building out a couple of large data centres and
> have been struggling with how to set up the systems so that we get a high
> resilience, highly responsive DNS service in the presence of failing
> equipment.
>
> The configuration we have adopted includes a layer of BIND 9.6.x servers
> that act as pure name server caches. We have six of these servers in each
> data centre paired to provide service on VIPs so that if one of the pair
> fails the other cache takes over.
>
> Our resolv.conf is of the following form.
>
> search xxx.com yyy.com
> nameserver 10.1.1.1
> nameserver 10.1.2.1
> nameserver 10.1.3.1
> options timeout:1 attempts:15 no-check-names rotate
>
> The name servers are thus on different networks within the DCs.
>
> Our first problem arises because the timeouts seem to be taken serially on
> each server rather than the rotate applying between each name server
> request. Is this what I should have expected i.e. a 15 second timeout
> before the next server is tried in sequence.
>
> The second problem we face is that even if we could get a one second
> timeout this orders of magnitude too slow for names that should be
> resolved within our local name space. In other words for lookups within
> the xxx.com and yyy.com domains I would like to see timeouts in the
> micro-second range.
>
> Thinking further about this problem I have been considering whether the
> resolver should be multi-threaded or parallelised in some way so that it
> tries all fo the servers at once and accepts the first to respond. I have
> come to the conclusion that this would be too difficult to make resilient
> in the general use of the resolver code, but would make sense if the
> lwresd layer is added to the equation.
>
> Which brings me on to the use of lwresd, this would reduce the incidence
> of problems with non-responsive servers in that it would detect and switch
> to an alternative server on the first failed attempt. However, this still
> means that if lwresd has not detected the down server then we get a stall
> in response within the data centre.
>
> So my questions are:
>
> 1. Does anybody have any experience in building such systems and
> suggestions on how we should tune the clients and servers to make the
> system less fragile in the presence of hardware, software and network
> failures.
>
> 2. Is is possible with lwresd as it is written today to get the effect of
> precognition - i.e. can I get lwresd to notice that a server has gone down
> or has come back up without it needing to be triggered by a resolv
> request.
>
> 3. Does anybody know if I can configure lwresd to expect particular zones
> to be resolved within very small windows and use this to fail over to the
> next server.
>
> And for discussion I wonder if there would be room to add to the resolver
> code and or lwresd additional options of the form
>
> options zone-timeout: xxx.com:1usec
>
> or something similar, whereby the resolver could be told that if the cache
> does not respond within this time about that particular zone then it can
> be assumed that the server is misbehaving.
>
> Thank you for your attention
>
> Regards, Howard.
>
> ___
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
--
Kal Feher
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users