-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi John,
Your outgoing range = 256 and num queries per thread 1024. Below in your stats you show that you have enough traffic to fill the 1024 list completely. Having 1024 queries compete for only 256 sockets, makes this problem. They have to wait until a socket becomes free and available before they can do work. Recompile with libevent (if you have not already). Then set outgoing-range to 1.5x the num-queries-per-thread (or 2x). outgoing-range: 8192 num-queries-per-thread: 4096 I also increased the num-queries-per-thread because below stats show that you need more. Then there is enough capacity and queries do not have to wait for a socket. On 03/29/2011 12:46 PM, john wrote: > outgoing-range: 256 > msg-cache-size: 128m > num-queries-per-thread: 1024 > rrset-cache-size: 256m > The other I tweaked slightly with socket receive buffers in case it was > using all sockets but it didn't make any difference: > > server: > verbosity: 2 > statistics-interval: 86400 > num-threads: 2 > outgoing-range: 462 > so-rcvbuf: 4m > msg-cache-size: 128m > num-queries-per-thread: 1024 > rrset-cache-size: 256m > do-ip6: no > chroot: "" > logfile: "/var/log/unbound.log" > root-hints: /etc/unbound/named.cache > hide-identity: yes > hide-version: yes > > > # cat /proc/sys/net/core/rmem_max > 4194304 > > Logs didn't really show much, and produce too much data to trawl through > easily. > > This morning I've changed to running one thread and when I receive a > problem just now dumped stats which were as follows: > > > # unbound-control status > version: 1.4.6 > verbosity: 1 > threads: 1 > modules: 2 [ validator iterator ] > uptime: 3095 seconds > unbound (pid 24730) is running... > > > # unbound-control stats_noreset > thread0.num.queries=87145 > thread0.num.cachehits=38156 > thread0.num.cachemiss=48989 > thread0.num.prefetch=0 > thread0.num.recursivereplies=41036 > thread0.requestlist.avg=327.143 > thread0.requestlist.max=1091 The max hits your num-queries-per-thread (and exceeds it, because of internal generated queries (for nameservers usually)). Time to increase the num-queries-per-thread. > thread0.requestlist.overwritten=3188 And this is what happened as a result: queries are getting dropped. It tries to give priority to your www.google.com query, but you also configure so little sockets, that those queries then had to wait as well. > thread0.requestlist.exceeded=18 Life was bad, but not that bad, while it dropped 3188 queries on the floor, those were 'old' queries that had waited for a while. Here 18 were dropped because there was no way to make space (even considered dropping other queries). This number would go up if you were under very high rate queries that come in 1000s per millisecond or so. > thread0.requestlist.current.all=1091 > thread0.requestlist.current.user=1024 Here you see the requestlist is completely full, 1024 of 1024. And 1091 including internal entries. It would be good to have 1091 sockets in this situation: outgoing-range 1091, but since we need to increase the requestlist size anyway, that number needs to increase as well. > thread0.recursion.time.avg=27.134193 Yes you had to wait 27 seconds on average. > thread0.recursion.time.median=0.00895803 But, the median is very low, meaning the bulk was very fast anyway (once a socket was available), but some queries took very long causing a high average. > total.num.queries=87145 > total.num.cachehits=38156 > total.num.cachemiss=48989 The cachehits are only 38/87 : 44% cache hits. That is pretty low. Maybe configure more memory for the cache. This could also be because you had just started and the cache has not warmed up yet. > total.num.prefetch=0 No prefetch enabled. > total.num.recursivereplies=41036 > total.requestlist.avg=327.143 > total.requestlist.max=1091 > total.requestlist.overwritten=3188 > total.requestlist.exceeded=18 > total.requestlist.current.all=1091 > total.requestlist.current.user=1024 > total.recursion.time.avg=27.134193 > total.recursion.time.median=0.00895803 > time.now=1301394872.957300 > time.up=3099.708199 > time.elapsed=3099.708199 > > > I've read the changelogs of newer versions but can't see anything that > looks like this problem. I'd prefer to avoid upgrading to the latest > source distribution on the off-chance it will fix it as that just seems > like clutching at straws. > > Any ideas? Yes, you need more capacity, and the queries are waiting for resources. Best regards, Wouter -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iEYEARECAAYFAk2RxK8ACgkQkDLqNwOhpPgMcwCbB+pKocllb+1UT6KwZ297mqen qTYAnig7+WkuyRQn5mVIY/tXcsGjJ+dt =Tn3R -----END PGP SIGNATURE----- _______________________________________________ Unbound-users mailing list [email protected] http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users
