Hi,
I'm running unbound 1.4.6-1 from debian squeeze on a couple of machines as a resolver and have come across strange problems where unbound will stop resolving some hosts. It seems sporadic as to what works and what doesn't, but nagios generally reports that it timed out after 10s when trying to resolve www.google.com. I sometimes notice it before nagios does if using that resolver. I've tried sniffing traffic but it's hard to pinpoint queries as the servers are quite busy.

What I've noticed though is that sometimes for example i won't be able to resolve yahoo.com but after a minute or two it works again, but then fails a few mins after that. I don't believe there is a network problem as it fails for some domains whilst working for others which are on the same name servers. others on the same.

If I restart unbound it clears up the problem, but this is obviously less than ideal. I'm tending to do this once a day now, sometimes more frequently. I've moved one of the resolvers over to other software now to try and avoid issues where they are both broken for obvious reasons.

Server config looks like this from the one which is still being used (minus interface, outgoing-interface and access-control lines):

server:
    verbosity: 1
    statistics-interval: 86400
    num-threads: 2
    outgoing-range: 256
    msg-cache-size: 128m
    num-queries-per-thread: 1024
    rrset-cache-size: 256m
    do-ip6: no
    chroot: ""
    root-hints: /etc/unbound/named.cache
    hide-identity: yes
    hide-version: yes


The other I tweaked slightly with socket receive buffers in case it was using all sockets but it didn't make any difference:

server:
    verbosity: 2
    statistics-interval: 86400
    num-threads: 2
    outgoing-range: 462
    so-rcvbuf: 4m
    msg-cache-size: 128m
    num-queries-per-thread: 1024
    rrset-cache-size: 256m
    do-ip6: no
    chroot: ""
    logfile: "/var/log/unbound.log"
    root-hints: /etc/unbound/named.cache
    hide-identity: yes
    hide-version: yes


# cat /proc/sys/net/core/rmem_max
4194304

Logs didn't really show much, and produce too much data to trawl through easily.

This morning I've changed to running one thread and when I receive a problem just now dumped stats which were as follows:


# unbound-control status
version: 1.4.6
verbosity: 1
threads: 1
modules: 2 [ validator iterator ]
uptime: 3095 seconds
unbound (pid 24730) is running...


# unbound-control stats_noreset
thread0.num.queries=87145
thread0.num.cachehits=38156
thread0.num.cachemiss=48989
thread0.num.prefetch=0
thread0.num.recursivereplies=41036
thread0.requestlist.avg=327.143
thread0.requestlist.max=1091
thread0.requestlist.overwritten=3188
thread0.requestlist.exceeded=18
thread0.requestlist.current.all=1091
thread0.requestlist.current.user=1024
thread0.recursion.time.avg=27.134193
thread0.recursion.time.median=0.00895803
total.num.queries=87145
total.num.cachehits=38156
total.num.cachemiss=48989
total.num.prefetch=0
total.num.recursivereplies=41036
total.requestlist.avg=327.143
total.requestlist.max=1091
total.requestlist.overwritten=3188
total.requestlist.exceeded=18
total.requestlist.current.all=1091
total.requestlist.current.user=1024
total.recursion.time.avg=27.134193
total.recursion.time.median=0.00895803
time.now=1301394872.957300
time.up=3099.708199
time.elapsed=3099.708199


I've read the changelogs of newer versions but can't see anything that looks like this problem. I'd prefer to avoid upgrading to the latest source distribution on the off-chance it will fix it as that just seems like clutching at straws.

Any ideas?

Cheers,

john
_______________________________________________
Unbound-users mailing list
[email protected]
http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users

Reply via email to