Following up on an old message (from Sep 2024): > Unbound does not have a clean strategy for cache. > Records are not evicted based on their TTL status. > Instead Unbound will try to fill all of the configured memory with > data. > Then when a new entry needs to be cached and there is no space left, > the earliest used entry (based on an LRU list) will be dropped off to > free space for the new entry.
OK, that sort of matches up with what I observe, in that the memory consumption of unbound only increases. > Unfortunately the size of the cache is not something trivial to solve > because it heavily depends on client traffic patterns. > > Monitoring the cache-hit rate with different memory configurations > could give a preferred size for specific installations. Well, there are these configuration knobs that I have tuned for cache size limitation, in the hopes that it would be respected: # grep cache-size unbound.conf # rrset-cache-size: 4m rrset-cache-size: 4G # msg-cache-size: 4m msg-cache-size: 3G # key-cache-size: 4m key-cache-size: 500m # neg-cache-size: 1m # I relatively recently implemented RFC 9462, and this has the effect of increasing the amount of DoT and DoH traffic to this unbound instance. Recently (after that was turned on), unbound was killed due to trying to exceed swap space: [ 406876.565818] UVM: pid 895.1528 (unbound), uid 1003 killed: out of swap This host has 24GB memory and 14GB swap, and is almost exclusively used to provide this DNS recursor service. Thus, it far overshot the configured cache size limits, at least in terms of memory consumption by the unbound process. Right after restart, unbound was below 1GB in size, now some 3.5 hours later, it has ballooned to 5.5GB: load averages: 0.86, 0.94, 0.92; up 5+00:58:04 14:24:33 86 processes: 83 sleeping, 1 stopped, 2 on CPU CPU states: 14.8% user, 0.0% nice, 1.3% system, 0.8% interrupt, 83.0% idle Memory: 3035M Act, 68M Inact, 17M Wired, 21M Exec, 14M File, 17G Free Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1322K In, 1906K Out PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 14678 unbound 40 0 5408M 3033M CPU/2 183:17 78.47% 78.47% unbound Granted, 3+4 = 7, which is more than 5.5GB, so this is not yet a "smoking gun". However, I suspect unbound has a memory leak, since it will balloon to a size much larger than the configured cache size + overhead, and that it's possibly been made worse with the increase in DoT and DoH traffic. At the same time as the above "top" display, unbound-control reports # unbound-control stats | egrep '^mem' mem.cache.rrset=338719902 mem.cache.message=333115259 mem.mod.iterator=16756 mem.mod.validator=27072879 mem.mod.respip=0 mem.streamwait=0 mem.http.query_buffer=0 mem.http.response_buffer=0 # The rrset and message cache sizes are reported in bytes, yes? So those are respectively 323MB and 317MB, so barely 650MB together, and therefore do not account for the rather rapid growth in unbound's virtual size. Since the last time I brought this up, unbound has been upgraded to version 1.22.0, but of course I'm still observing this problem, and seek guidance about what to do to find out if unbound does indeed have a (rather severe) memory leak. ... And now, a little more than an hour later, it's grown another 1.4GB: load averages: 0.52, 0.53, 0.52; up 5+02:22:23 15:48:52 85 processes: 82 sleeping, 1 stopped, 2 on CPU CPU states: 11.4% user, 0.0% nice, 1.8% system, 1.0% interrupt, 85.7% idle Memory: 3815M Act, 81M Inact, 17M Wired, 21M Exec, 14M File, 16G Free Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1509K In, 19 PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 14678 unbound 84 0 6863M 3825M kqueue/0 236:12 39.55% 39.55% unbound whereas "unbound-control stats" says for the "mem" part: # unbound-control stats | egrep '^mem' mem.cache.rrset=392533351 mem.cache.message=412422697 mem.mod.iterator=16756 mem.mod.validator=30732262 mem.mod.respip=0 mem.streamwait=0 mem.http.query_buffer=0 mem.http.response_buffer=0 # so the growth there isn't nearly as pronounced, rrset is up to 374MB, and message cache is up to 393MB. And now, 2.5 hours later: load averages: 0.19, 0.35, 0.41; up 5+04:50:24 18:16:53 85 processes: 1 runnable, 82 sleeping, 1 stopped, 1 on CPU CPU states: 11.3% user, 0.0% nice, 1.2% system, 0.0% interrupt, 87.4% idle Memory: 5085M Act, 99M Inact, 17M Wired, 21M Exec, 14M File, 15G Free Swap: 14G Total, 38M Used, 14G Free / Pools: 2886M Used / Network: 79G In, 107G PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 14678 unbound 85 0 9358M 5118M RUN/1 319:53 29.30% 29.30% unbound with # unbound-control stats | egrep '^mem' mem.cache.rrset=470135632 mem.cache.message=465387399 mem.mod.iterator=16756 mem.mod.validator=34195072 mem.mod.respip=0 mem.streamwait=0 mem.http.query_buffer=0 mem.http.response_buffer=0 # So ... what I'm looking for is ... what, if anything, can I do to find and stop what looks like a massive memory leak? Or ... is anyone else observing similar symptoms? Some peculiarities of our configuration: We're using "rpz:" funcionality to a subset of our clients. We recently activated DoT and DoH with a "correct" certificate. As mentioned above, we enabled resolver.arpa resolution, ref. RFC 9462. We're doing DNSSEC validation (of course). And lastly, this is on NetBSD/amd64 10.0, using net/unbound packaged from pkgsrc. Best regards, - HÃ¥vard