Thanks Wouter, the workaround with ip6 is working. Best regards, Jakub
2010/1/9 W.C.A. Wijngaards <[email protected]>: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi Jakub, > > I cannot know for sure since the log seems to be shortened by syslog, > but I think you may have the same problem as Haw Loeung reported. I > have since corresponded with him and I think have resolved the issue. A > patch is in svn for unbound. > > The issue seems to be caused by the parent-child disagreement fix in > 1.4.1, worsened by no IPv6 available (either not in the kernel or > do-ip6:no statement) - just turning ipv6 back on (also if you have no > ipv6 connectivity) may stop this bug. > > Sorry for the inconvenience, > Wouter > > On 01/08/2010 10:36 PM, Jakub Heichman wrote: >> Hello there, >> >> I have upgraded from Unbound 1.3.0 to 1.4.1 recently on a couple of >> servers. The servers are doing about 3-7k requests per second. >> I have noticed that after some time of operation, servers >> intermittently stop responding to queries to certain domains from >> cache (return SERVFAIL). It particularly doesn't like webex.com, >> sometimes vmware.com, dell.com. >> >> # unbound-host webex.com >> webex.com has address 64.68.120.99 >> webex.com mail is handled by 50 mx3.webex.com. >> webex.com mail is handled by 10 mx1.webex2.iphmx.com. >> webex.com mail is handled by 10 mx2.webex2.iphmx.com. >> webex.com mail is handled by 50 mx2.webex.com. >> >> But the A record doesnt seem to get cached: >> >> # dig webex.com @localhost >> >> ; <<>> DiG 9.3.3rc2 <<>> webex.com @localhost >> ; (1 server found) >> ;; global options: printcmd >> ;; Got answer: >> ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49004 >> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 >> >> ;; QUESTION SECTION: >> ;webex.com. IN A >> >> ;; Query time: 0 msec >> ;; SERVER: 127.0.0.1#53(127.0.0.1) >> ;; WHEN: Fri Jan 8 19:16:01 2010 >> ;; MSG SIZE rcvd: 27 >> >> # unbound-control dump_cache | grep ^webex.com >> webex.com. 4589 IN NS ns01.webex.com. >> webex.com. 4589 IN NS ns05.webex.com. >> webex.com. 4589 IN NS ns07.webex.com. >> webex.com. 4589 IN NS ns03.webex.com. >> webex.com. 4589 IN NS ns08.webex.com. >> webex.com. 4589 IN NS ns06.webex.com. >> >> It doesnt respond to NS queries either (No A records for >> nsxx.webex.com in cache) >> >> # dig ns webex.com @localhost >> >> ; <<>> DiG 9.3.3rc2 <<>> ns webex.com @localhost >> ; (1 server found) >> ;; global options: printcmd >> ;; Got answer: >> ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 54946 >> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 >> >> ;; QUESTION SECTION: >> ;webex.com. IN NS >> >> ;; Query time: 0 msec >> ;; SERVER: 127.0.0.1#53(127.0.0.1) >> ;; WHEN: Fri Jan 8 20:53:18 2010 >> ;; MSG SIZE rcvd: 27 >> >> >> >> After flushing webex.com I can query the cache again: >> >> # unbound-control flush webex.com >> ok >> # dig +short webex.com @localhost >> 64.68.120.99 >> >> Log with increased verbosity: >> >> an 8 21:05:32 cns01j unbound: [25052:3] info: validator operate: >> query <webex.com. A IN> >> Jan 8 21:05:32 cns01j unbound: [25052:3] info: resolving <webex.com. A IN> >> Jan 8 21:05:32 cns01j unbound: [25052:3] info: >> DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6 >> avail) >> Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns08.webex.com. A >> Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns05.webex.com. A >> Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns01.webex.com. A >> Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns06.webex.com. A >> Jan 8 21:05:32 cns unbound: [25052:3] info: ns07.webex.com. A >> Jan 8 21:05:32 cns unbound: [25052:3] info: ns03.webex.com. A >> Jan 8 21:05:32 cns unbound: [25052:3] info: resolving (init part 2): >> <webex.com. A IN> >> Jan 8 21:05:32 cns unbound: [25052:3] info: resolving (init part 3): >> <webex.com. A IN> >> Jan 8 21:05:32 cns unbound: [25052:3] info: processQueryTargets: >> <webex.com. A IN> >> Jan 8 21:05:32 cns unbound: [25052:3] info: >> DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6 >> avail) >> Jan 8 21:05:32 cns unbound: [25052:3] info: ns08.webex.com. A >> Jan 8 21:05:32 cns unbound: [25052:3] info: ns05.webex.com. A >> Jan 8 21:05:32 cns unbound: [25052:3] info: ns01.webex.com. A >> Jan 8 21:05:32 cns unbound: [25052:3] info: ns06.webex.com. A >> Jan 8 21:05:32 cns unbound: [25052:3] info: ns07.webex.com. A >> Jan 8 21:05:32 cns unbound: [25052:3] info: ns03.webex.com. A >> Jan 8 21:05:32 cns unbound: [25052:3] info: validator operate: query >> <webex.com. A IN> >> Jan 8 21:05:32 cns unbound: [25052:3] info: mesh_run: end 8 recursion >> states (8 with reply, 0 detached), 28 waiting replies, 84 recursion >> replies sent, 0 replies dropped, 0 states jostled out >> Jan 8 21:05:32 cns unbound: [25052:3] info: average recursion >> processing time 4.172319 sec >> Jan 8 21:05:32 cns unbound: [25052:3] info: histogram of recursion >> processing times >> Jan 8 21:05:32 cns unbound: [25052:3] info: [25%]=0.00107789 >> median[50%]=0.00744727 [75%]=0.0524288 >> Jan 8 21:05:32 cns unbound: [25052:3] info: lower(secs) upper(secs) >> recursions >> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.000000 0.000001 9 >> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.000256 0.000512 1 >> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.001024 0.002048 1 >> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.002048 0.004096 19 >> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.004096 0.008192 2 >> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.008192 0.016384 11 >> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.016384 0.032768 3 >> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.032768 0.065536 5 >> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.065536 0.131072 15 >> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.131072 0.262144 10 >> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.262144 0.524288 5 >> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.524288 1.000000 2 >> Jan 8 21:05:32 cns unbound: [25052:3] info: 256.000000 512.000000 1 >> >> >> >> I have downgraded one of the servers back to 1.3.0 (libev 3.53), where >> I don't see this problem. >> >> Unbound 1.4.1 compiled with libev-3.9, without threads - 4 processes running >> >> num-threads: 4 >> outgoing-range: 10000 >> msg-cache-size: 500m >> msg-cache-slabs: 1 >> num-queries-per-thread: 10000 >> jostle-timeout: 4000 >> so-rcvbuf: 8m >> rrset-cache-size: 1024m >> rrset-cache-slabs: 1 >> cache-max-ttl: 86400 >> >> Any help would be greatly appreciated :-) >> >> Thanks, >> > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.9 (GNU/Linux) > Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ > > iEYEARECAAYFAktIS0QACgkQkDLqNwOhpPjQ2ACfeCESH6t5ezR94AnasM6DP16f > EsIAoJbV90YIEyovcdz2VWTRmEk4/tMY > =Zfvf > -----END PGP SIGNATURE----- > -- Kuba Heichman _______________________________________________ Unbound-users mailing list [email protected] http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users
