Hello there, I have upgraded from Unbound 1.3.0 to 1.4.1 recently on a couple of servers. The servers are doing about 3-7k requests per second. I have noticed that after some time of operation, servers intermittently stop responding to queries to certain domains from cache (return SERVFAIL). It particularly doesn't like webex.com, sometimes vmware.com, dell.com.
# unbound-host webex.com webex.com has address 64.68.120.99 webex.com mail is handled by 50 mx3.webex.com. webex.com mail is handled by 10 mx1.webex2.iphmx.com. webex.com mail is handled by 10 mx2.webex2.iphmx.com. webex.com mail is handled by 50 mx2.webex.com. But the A record doesnt seem to get cached: # dig webex.com @localhost ; <<>> DiG 9.3.3rc2 <<>> webex.com @localhost ; (1 server found) ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49004 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;webex.com. IN A ;; Query time: 0 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Fri Jan 8 19:16:01 2010 ;; MSG SIZE rcvd: 27 # unbound-control dump_cache | grep ^webex.com webex.com. 4589 IN NS ns01.webex.com. webex.com. 4589 IN NS ns05.webex.com. webex.com. 4589 IN NS ns07.webex.com. webex.com. 4589 IN NS ns03.webex.com. webex.com. 4589 IN NS ns08.webex.com. webex.com. 4589 IN NS ns06.webex.com. It doesnt respond to NS queries either (No A records for nsxx.webex.com in cache) # dig ns webex.com @localhost ; <<>> DiG 9.3.3rc2 <<>> ns webex.com @localhost ; (1 server found) ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 54946 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;webex.com. IN NS ;; Query time: 0 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Fri Jan 8 20:53:18 2010 ;; MSG SIZE rcvd: 27 After flushing webex.com I can query the cache again: # unbound-control flush webex.com ok # dig +short webex.com @localhost 64.68.120.99 Log with increased verbosity: an 8 21:05:32 cns01j unbound: [25052:3] info: validator operate: query <webex.com. A IN> Jan 8 21:05:32 cns01j unbound: [25052:3] info: resolving <webex.com. A IN> Jan 8 21:05:32 cns01j unbound: [25052:3] info: DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6 avail) Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns08.webex.com. A Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns05.webex.com. A Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns01.webex.com. A Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns06.webex.com. A Jan 8 21:05:32 cns unbound: [25052:3] info: ns07.webex.com. A Jan 8 21:05:32 cns unbound: [25052:3] info: ns03.webex.com. A Jan 8 21:05:32 cns unbound: [25052:3] info: resolving (init part 2): <webex.com. A IN> Jan 8 21:05:32 cns unbound: [25052:3] info: resolving (init part 3): <webex.com. A IN> Jan 8 21:05:32 cns unbound: [25052:3] info: processQueryTargets: <webex.com. A IN> Jan 8 21:05:32 cns unbound: [25052:3] info: DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6 avail) Jan 8 21:05:32 cns unbound: [25052:3] info: ns08.webex.com. A Jan 8 21:05:32 cns unbound: [25052:3] info: ns05.webex.com. A Jan 8 21:05:32 cns unbound: [25052:3] info: ns01.webex.com. A Jan 8 21:05:32 cns unbound: [25052:3] info: ns06.webex.com. A Jan 8 21:05:32 cns unbound: [25052:3] info: ns07.webex.com. A Jan 8 21:05:32 cns unbound: [25052:3] info: ns03.webex.com. A Jan 8 21:05:32 cns unbound: [25052:3] info: validator operate: query <webex.com. A IN> Jan 8 21:05:32 cns unbound: [25052:3] info: mesh_run: end 8 recursion states (8 with reply, 0 detached), 28 waiting replies, 84 recursion replies sent, 0 replies dropped, 0 states jostled out Jan 8 21:05:32 cns unbound: [25052:3] info: average recursion processing time 4.172319 sec Jan 8 21:05:32 cns unbound: [25052:3] info: histogram of recursion processing times Jan 8 21:05:32 cns unbound: [25052:3] info: [25%]=0.00107789 median[50%]=0.00744727 [75%]=0.0524288 Jan 8 21:05:32 cns unbound: [25052:3] info: lower(secs) upper(secs) recursions Jan 8 21:05:32 cns unbound: [25052:3] info: 0.000000 0.000001 9 Jan 8 21:05:32 cns unbound: [25052:3] info: 0.000256 0.000512 1 Jan 8 21:05:32 cns unbound: [25052:3] info: 0.001024 0.002048 1 Jan 8 21:05:32 cns unbound: [25052:3] info: 0.002048 0.004096 19 Jan 8 21:05:32 cns unbound: [25052:3] info: 0.004096 0.008192 2 Jan 8 21:05:32 cns unbound: [25052:3] info: 0.008192 0.016384 11 Jan 8 21:05:32 cns unbound: [25052:3] info: 0.016384 0.032768 3 Jan 8 21:05:32 cns unbound: [25052:3] info: 0.032768 0.065536 5 Jan 8 21:05:32 cns unbound: [25052:3] info: 0.065536 0.131072 15 Jan 8 21:05:32 cns unbound: [25052:3] info: 0.131072 0.262144 10 Jan 8 21:05:32 cns unbound: [25052:3] info: 0.262144 0.524288 5 Jan 8 21:05:32 cns unbound: [25052:3] info: 0.524288 1.000000 2 Jan 8 21:05:32 cns unbound: [25052:3] info: 256.000000 512.000000 1 I have downgraded one of the servers back to 1.3.0 (libev 3.53), where I don't see this problem. Unbound 1.4.1 compiled with libev-3.9, without threads - 4 processes running num-threads: 4 outgoing-range: 10000 msg-cache-size: 500m msg-cache-slabs: 1 num-queries-per-thread: 10000 jostle-timeout: 4000 so-rcvbuf: 8m rrset-cache-size: 1024m rrset-cache-slabs: 1 cache-max-ttl: 86400 Any help would be greatly appreciated :-) Thanks, -- Jakub Heichman _______________________________________________ Unbound-users mailing list [email protected] http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users
