Re: Timeouts and retries on high speed Lans

2010-09-14 Thread Kalman Feher
So the cache servers are HA behind something (F5 LTM, Cisco local director,
something else). Are the authoritative servers? It would seem sensible to do
the same with them. That way a timeout only occurs if the whole HA cluster
is unavailable.
You can alleviate even that situation by seeding the cache servers every
(TTL-some value) minutes. Or slaving the domain on the cache servers.


On 14/09/10 11:34 AM, "Howard Wilkinson"  wrote:

> I have been working on building out a couple of large data centres and
> have been struggling with how to set up the systems so that we get a high
> resilience, highly responsive DNS service in the presence of failing
> equipment.
> 
> The configuration we have adopted includes a layer of BIND 9.6.x servers
> that act as pure name server caches. We have six of these servers in each
> data centre paired to provide service on VIPs so that if one of the pair
> fails the other cache takes over.
> 
> Our resolv.conf is of the following form.
> 
> search xxx.com yyy.com
> nameserver 10.1.1.1
> nameserver 10.1.2.1
> nameserver 10.1.3.1
> options timeout:1 attempts:15 no-check-names rotate
> 
> The name servers are thus on different networks within the DCs.
> 
> Our first problem arises because the timeouts seem to be taken serially on
> each server rather than the rotate applying between each name server
> request. Is this what I should have expected i.e. a 15 second timeout
> before the next server is tried in sequence.
> 
> The second problem we face is that even if we could get a one second
> timeout this orders of magnitude too slow for names that should be
> resolved within our local name space. In other words for lookups within
> the xxx.com and yyy.com domains I would like to see timeouts in the
> micro-second range.
> 
> Thinking further about this problem I have been considering whether the
> resolver should be multi-threaded or parallelised in some way so that it
> tries all fo the servers at once and accepts the first to respond. I have
> come to the conclusion that this would be too difficult to make resilient
> in the general use of the resolver code, but would make sense if the
> lwresd layer is added to the equation.
> 
> Which brings me on to the use of lwresd, this would reduce the incidence
> of problems with non-responsive servers in that it would detect and switch
> to an alternative server on the first failed attempt. However, this still
> means that if lwresd has not detected the down server then we get a stall
> in response within the data centre.
> 
> So my questions are:
> 
> 1. Does anybody have any experience in building such systems and
> suggestions on how we should tune the clients and servers to make the
> system less fragile in the presence of hardware, software and network
> failures.
> 
> 2. Is is possible with lwresd as it is written today to get the effect of
> precognition - i.e. can I get lwresd to notice that a server has gone down
> or has come back up without it needing to be triggered by a resolv
> request.
> 
> 3. Does anybody know if I can configure lwresd to expect particular zones
> to be resolved within very small windows and use this to fail over to the
> next server.
> 
> And for discussion I wonder if there would be room to add to the resolver
> code and or lwresd additional options of the form
> 
> options zone-timeout: xxx.com:1usec
> 
> or something similar, whereby the resolver could be told that if the cache
> does not respond within this time about that particular zone then it can
> be assumed that the server is misbehaving.
> 
> Thank you for your attention
> 
> Regards, Howard.
> 
> ___
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users

-- 
Kal Feher 

___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Timeouts and retries on high speed Lans

2010-09-14 Thread Howard Wilkinson
I have been working on building out a couple of large data centres and
have been struggling with how to set up the systems so that we get a high
resilience, highly responsive DNS service in the presence of failing
equipment.

The configuration we have adopted includes a layer of BIND 9.6.x servers
that act as pure name server caches. We have six of these servers in each
data centre paired to provide service on VIPs so that if one of the pair
fails the other cache takes over.

Our resolv.conf is of the following form.

search xxx.com yyy.com
nameserver 10.1.1.1
nameserver 10.1.2.1
nameserver 10.1.3.1
options timeout:1 attempts:15 no-check-names rotate

The name servers are thus on different networks within the DCs.

Our first problem arises because the timeouts seem to be taken serially on
each server rather than the rotate applying between each name server
request. Is this what I should have expected i.e. a 15 second timeout
before the next server is tried in sequence.

The second problem we face is that even if we could get a one second
timeout this orders of magnitude too slow for names that should be
resolved within our local name space. In other words for lookups within
the xxx.com and yyy.com domains I would like to see timeouts in the
micro-second range.

Thinking further about this problem I have been considering whether the
resolver should be multi-threaded or parallelised in some way so that it
tries all fo the servers at once and accepts the first to respond. I have
come to the conclusion that this would be too difficult to make resilient
in the general use of the resolver code, but would make sense if the
lwresd layer is added to the equation.

Which brings me on to the use of lwresd, this would reduce the incidence
of problems with non-responsive servers in that it would detect and switch
to an alternative server on the first failed attempt. However, this still
means that if lwresd has not detected the down server then we get a stall
in response within the data centre.

So my questions are:

1. Does anybody have any experience in building such systems and
suggestions on how we should tune the clients and servers to make the
system less fragile in the presence of hardware, software and network
failures.

2. Is is possible with lwresd as it is written today to get the effect of
precognition - i.e. can I get lwresd to notice that a server has gone down
or has come back up without it needing to be triggered by a resolv
request.

3. Does anybody know if I can configure lwresd to expect particular zones
to be resolved within very small windows and use this to fail over to the
next server.

And for discussion I wonder if there would be room to add to the resolver
code and or lwresd additional options of the form

options zone-timeout: xxx.com:1usec

or something similar, whereby the resolver could be told that if the cache
does not respond within this time about that particular zone then it can
be assumed that the server is misbehaving.

Thank you for your attention

Regards, Howard.

___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users