Bug#518129: bind9 hangs: NXDOMAIN for recursive requests but serves authoritative zones

2009-05-11 Thread Steven Chamberlain
Version: 1:9.5.1.dfsg.P1-2

Hi,

Just suffered the same problem!  It sounds pretty nasty if you run a
busy nameserver or just set a low cache size to restrict memory usage.
I had max-cache-size 1m; which I think triggers the problem sooner.

My best guess is that the cache becomes exhausted after several
hours/days of running;  old entries are purged from the cache, but
unfortunately this includes the root hints.  Is that a bug or
misconfiguration on my part?  It causes recursive queries to fail,
although answers are still given from authoritative zones.

My configuration is a little complicated:  split-horizon with
internal/external views, but only the internal view allows recursion and
that's where I had problems.

Relevant global options:

options {
// ...

max-cache-size 1m;
recursive-clients 256;
};

Internal view options:

view internal {
match-clients { 192.168.0.0/16; 127.0.0.1/16; };
recursion yes;
notify no;

// prime the server with knowledge of the root servers
zone . {
type hint;
file /etc/bind/db.root;
};

// ...
};

My root hints file was the 2008020400-serial that shipped with the
Debian package, but I'll be updating that now.

My workaround will be to set max-cache-size unlimited; for the time being.

Regards,
-- 
Steven Chamberlain
ste...@pyro.eu.org



signature.asc
Description: OpenPGP digital signature


Bug#518129: bind9 hangs: NXDOMAIN for recursive requests but serves authoritative zones

2009-03-04 Thread Christoph Haas
Package: bind9
Version: 1:9.5.1.dfsg.P1-1
Severity: important


Since we upgraded our bind9 name servers from Etch to Lenny we are
experiencing occasional hangs. While all requests for authoritative zones are
still answered correctly we can't seem to get replies for recursive queries.
All we get is NXDOMAIN until we init.d/restart the bind process. Every tenth
or so request is answered properly but the next request fails again with
NXDOMAIN. So the successful response from the root servers doesn't seem to get
served from the internal cache either.

In that situation our log file fills up with:

Mar  4 07:21:45 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:21:56 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:22:07 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:22:57 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:22:58 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:22:59 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:23:03 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:23:09 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:23:32 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:23:36 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:23:37 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:23:43 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:26:23 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:26:34 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:31:35 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:32:33 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:32:35 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:32:45 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:33:47 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:33:51 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:33:54 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:33:54 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:33:56 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:33:56 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:33:58 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:33:58 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:33:58 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:34:00 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:34:03 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:34:12 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:34:12 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:34:13 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:34:13 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found
Mar  4 07:34:13 pns named[7077]: general: checkhints: unable to get root NS 
rrset from cache: not found

We traced (tshark) what's happening on the network and it seems like bind9
isn't even sending out requests to the internet if we send it a recursive
query from inside/LAN. Instead is instantly replies with NXDOMAIN.

This situation is happening every few days and requires a bind restart or else
our clients can't run recursive queries any more (which apparently isn't
making them happy).

Our name server serves nearly 500 authoritative zones and is used as a
forwarder for the internal/LAN clients. rndc status shows:

==
version: 9.5.1-P1
number of zones: 511
debug level: 0
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is OFF
recursive clients: 6/0/1000
tcp clients: 0/100
server is up and running
==

Our