Hello there,

I have upgraded from Unbound 1.3.0 to 1.4.1 recently on a couple of
servers. The servers are doing about 3-7k requests per second.
I have noticed that after some time of operation, servers
intermittently stop responding to queries to certain domains from
cache (return SERVFAIL). It particularly doesn't like webex.com,
sometimes vmware.com, dell.com.

# unbound-host webex.com
webex.com has address 64.68.120.99
webex.com mail is handled by 50 mx3.webex.com.
webex.com mail is handled by 10 mx1.webex2.iphmx.com.
webex.com mail is handled by 10 mx2.webex2.iphmx.com.
webex.com mail is handled by 50 mx2.webex.com.

But the A record doesnt seem to get cached:

# dig webex.com @localhost

; <<>> DiG 9.3.3rc2 <<>> webex.com @localhost
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49004
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;webex.com.                     IN      A

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jan  8 19:16:01 2010
;; MSG SIZE  rcvd: 27

# unbound-control dump_cache | grep ^webex.com
webex.com.      4589    IN      NS      ns01.webex.com.
webex.com.      4589    IN      NS      ns05.webex.com.
webex.com.      4589    IN      NS      ns07.webex.com.
webex.com.      4589    IN      NS      ns03.webex.com.
webex.com.      4589    IN      NS      ns08.webex.com.
webex.com.      4589    IN      NS      ns06.webex.com.

It doesnt respond to NS queries either (No A records for
nsxx.webex.com in cache)

# dig ns webex.com @localhost

; <<>> DiG 9.3.3rc2 <<>> ns webex.com @localhost
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 54946
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;webex.com.                     IN      NS

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jan  8 20:53:18 2010
;; MSG SIZE  rcvd: 27



After flushing webex.com I can query the cache again:

# unbound-control flush webex.com
ok
# dig +short webex.com @localhost
64.68.120.99

Log with increased verbosity:

an  8 21:05:32 cns01j unbound: [25052:3] info: validator operate:
query <webex.com. A IN>
Jan  8 21:05:32 cns01j unbound: [25052:3] info: resolving <webex.com. A IN>
Jan  8 21:05:32 cns01j unbound: [25052:3] info:
DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
avail)
Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns08.webex.com.  A
Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns05.webex.com.  A
Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns01.webex.com.  A
Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns06.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns07.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns03.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info: resolving (init part 2):
<webex.com. A IN>
Jan  8 21:05:32 cns unbound: [25052:3] info: resolving (init part 3):
<webex.com. A IN>
Jan  8 21:05:32 cns unbound: [25052:3] info: processQueryTargets:
<webex.com. A IN>
Jan  8 21:05:32 cns unbound: [25052:3] info:
DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
avail)
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns08.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns05.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns01.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns06.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns07.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns03.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info: validator operate: query
<webex.com. A IN>
Jan  8 21:05:32 cns unbound: [25052:3] info: mesh_run: end 8 recursion
states (8 with reply, 0 detached), 28 waiting replies, 84 recursion
replies sent, 0 replies dropped, 0 states jostled out
Jan  8 21:05:32 cns unbound: [25052:3] info: average recursion
processing time 4.172319 sec
Jan  8 21:05:32 cns unbound: [25052:3] info: histogram of recursion
processing times
Jan  8 21:05:32 cns unbound: [25052:3] info: [25%]=0.00107789
median[50%]=0.00744727 [75%]=0.0524288
Jan  8 21:05:32 cns unbound: [25052:3] info: lower(secs) upper(secs) recursions
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.000000    0.000001 9
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.000256    0.000512 1
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.001024    0.002048 1
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.002048    0.004096 19
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.004096    0.008192 2
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.008192    0.016384 11
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.016384    0.032768 3
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.032768    0.065536 5
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.065536    0.131072 15
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.131072    0.262144 10
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.262144    0.524288 5
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.524288    1.000000 2
Jan  8 21:05:32 cns unbound: [25052:3] info:  256.000000  512.000000 1



I have downgraded one of the servers back to 1.3.0 (libev 3.53), where
I don't see this problem.

Unbound 1.4.1 compiled with libev-3.9, without threads - 4 processes running

num-threads: 4
outgoing-range: 10000
msg-cache-size: 500m
msg-cache-slabs: 1
num-queries-per-thread: 10000
jostle-timeout: 4000
so-rcvbuf: 8m
rrset-cache-size: 1024m
rrset-cache-slabs: 1
cache-max-ttl: 86400

Any help would be greatly appreciated :-)

Thanks,

-- 
Jakub Heichman
_______________________________________________
Unbound-users mailing list
[email protected]
http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users

Reply via email to