Hi Wouter,
Excellent explanations and fast reply as usual, many thanks.
(W)hat version are you using? Recently the timeout code was changed to
cope with this sort of situation (1.4.7):
http://www.unbound.net/documentation/info_timeout.html
Oops sorry. I forgot to tell but I am using the latest : 1.4.8.
It's running on Centos 5.5 (old 2.6.18 kernel sadly). We built our own
packages. And it should have libevent. I created a thread a while ago
about where I wanted an explicit way to be sure that we have and are
using libevent. And you told me that I was using it IIRC :)
unbound-libs-1.4.8-2.el5
unbound-1.4.8-2.el5
ldns-1.6.8-1.el5
libevent-1.4.13-1
unbound-control status
version: 1.4.8
verbosity: 1
threads: 1
modules: 2 [ validator iterator ]
uptime: 2773636 seconds
unbound (pid 3952) is running...
Version 1.4.8
linked libs: libevent 1.4.13-stable (it uses epoll), ldns 1.6.8, OpenSSL
0.9.8e-fips-rhel5 01 Jul 2008
linked modules: validator iterator
configured for i386-redhat-linux-gnu on Wed Feb 16 10:26:27 EST 2011
with options: '--build=i386-koji-linux-gnu' '--host=i386-koji-linux-gnu'
'--target=i386-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr'
'--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin'
'--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include'
'--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--localstatedir=/var'
'--sharedstatedir=/usr/com' '--mandir=/usr/share/man'
'--infodir=/usr/share/info' '--with-ldns=' '--with-libevent'
'--with-pthreads' '--with-ssl' '--disable-rpath' '--enable-debug'
'--disable-static' '--with-conf-file=/etc/unbound/unbound.conf'
'--with-pidfile=/var/run/unbound/unbound.pid' '--disable-gost'
'--enable-sha2'
BSD licensed, see LICENSE in source package for details.
Report bugs to [email protected]
[..] jostle-timeout is triggered when the server is very busy. What defines
'busy' ?
The requestlist is full.
Ok. I think this should be clarified in the documentation, I can send
you a patch if you want to save your time.
Your requestlist is the default, so about 1000 and 300 does not fill it
up. I would recommend a recompile with libevent because of your
somewhat high load (then you can increase the requestlist and range to
several thousand, and in recent versions the default increases by
itself, http://www.unbound.net/documentation/howto_optimise.html )
I read this document many times since I am using unbound (and I will
read it again;). But what parameter defines the requestlist size or
actually influence on it.
[..]. Could that impact unbound reactivity ?
No, other queries that priority over these older queries.
Ok.
The requestlist is divided into two halves: run-to-completion, and
fast-stuff. The run-to-completion is that. The fast stuff deletes
older queries to make room for new queries (but not unless the
jostle-timeout has expired, otherwise you could deleted everything that
comes in immediately under a DoS).
Thanks for the explanation. Is this written somewhere as well in the docos ?
Note: jostle-timeout is still set to the default (see my config below).
Yes that should be OK. If you lower it, it will be more likely to drop
the groupinfra stuff.
Ok. I may have some questions about that but I will read the doco first
about jostle-timeout.
I am asking that because sometimes our unbounds have a random hiccup and
I am wondering if it could be due to this or not. The 'hiccup' is very
hard to debug because it's random (once a month or so) on servers doing
something like 500 to 1500 qps each so increasing the verbosity from 1
to 2 is not really possible :)
Ok, so I think I will have to do a script to increase verbosity when it
seems that unbound can't resolve anymore and hopefully I will be able to
catch this nasty issue (could be network related).
What seems to happen is groupinfra has a lot of servers. And they
sometimes experience outages. When they experience an outage, unbound
gets timeouts and tries to fetch the names, but also the other
nameserver names (and there are a lot of them). Given user demand for
groupinfra, unbound starts to explore all the nameservers for
groupinfra, with timeouts and thus the entries fill up your requestlist.
The dependency structure is like that log excerpt that you show.
Because the thing has timeouts those entries are necessarily pretty old,
and thus (the ones in the fast-stuff list) would be dropped to make room
for new queries (if there was a lack of space, but there is no lack of
space, so these queries are performed: there is interest and there is
capacity to undertake actions to find the answers).
Yep ok, I understand but still it is weird to see unbound trying to
resolve something for almost forever. For instance 143000 secs aka 39
hours :) But we have resources so maybe one day it will work (I reckon
this domain just never works ;).
252 AAAA IN uk-dc007.groupinfra.com. 142994.571268 iterator wants AAAA
IN au-dc012.groupinfra.com. AAAA IN br-dc003.groupinfra.com. AAAA IN
de-dc008.groupinfra.com. AAAA IN my-dc003.groupinfra.com. AAAA IN
nl-dc006.groupinfra.com. AAAA IN ph-dc001.groupinfra.com.
-Thomas
_______________________________________________
Unbound-users mailing list
[email protected]
http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users