Re: Unbound memory resource consumption?

Havard Eidnes via Unbound-users Wed, 12 Mar 2025 10:42:06 -0700

Following up on an old message (from Sep 2024):

> Unbound does not have a clean strategy for cache.
> Records are not evicted based on their TTL status.
> Instead Unbound will try to fill all of the configured memory with
> data.
> Then when a new entry needs to be cached and there is no space left,
> the earliest used entry (based on an LRU list) will be dropped off to
> free space for the new entry.


OK, that sort of matches up with what I observe, in that the
memory consumption of unbound only increases.

> Unfortunately the size of the cache is not something trivial to solve
> because it heavily depends on client traffic patterns.
>
> Monitoring the cache-hit rate with different memory configurations
> could give a preferred size for specific installations.

Well, there are these configuration knobs that I have tuned for
cache size limitation, in the hopes that it would be respected:

# grep cache-size unbound.conf
        # rrset-cache-size: 4m
  rrset-cache-size: 4G
        # msg-cache-size: 4m
  msg-cache-size: 3G
        # key-cache-size: 4m
  key-cache-size: 500m
        # neg-cache-size: 1m
# 

I relatively recently implemented RFC 9462, and this has the
effect of increasing the amount of DoT and DoH traffic to this
unbound instance.  Recently (after that was turned on), unbound
was killed due to trying to exceed swap space:

[ 406876.565818] UVM: pid 895.1528 (unbound), uid 1003 killed: out of swap

This host has 24GB memory and 14GB swap, and is almost
exclusively used to provide this DNS recursor service.

Thus, it far overshot the configured cache size limits, at least
in terms of memory consumption by the unbound process.

Right after restart, unbound was below 1GB in size, now some 3.5
hours later, it has ballooned to 5.5GB:

load averages:  0.86,  0.94,  0.92;               up 5+00:58:04                
14:24:33
86 processes: 83 sleeping, 1 stopped, 2 on CPU
CPU states: 14.8% user,  0.0% nice,  1.3% system,  0.8% interrupt, 83.0% idle
Memory: 3035M Act, 68M Inact, 17M Wired, 21M Exec, 14M File, 17G Free
Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1322K In, 
1906K Out

  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
14678 unbound   40    0  5408M 3033M CPU/2     183:17 78.47% 78.47% unbound

Granted, 3+4 = 7, which is more than 5.5GB, so this is not yet a
"smoking gun".

However, I suspect unbound has a memory leak, since it will
balloon to a size much larger than the configured cache size +
overhead, and that it's possibly been made worse with the
increase in DoT and DoH traffic.

At the same time as the above "top" display, unbound-control
reports

# unbound-control stats | egrep '^mem'
mem.cache.rrset=338719902
mem.cache.message=333115259
mem.mod.iterator=16756
mem.mod.validator=27072879
mem.mod.respip=0
mem.streamwait=0
mem.http.query_buffer=0
mem.http.response_buffer=0
#

The rrset and message cache sizes are reported in bytes, yes?  So
those are respectively 323MB and 317MB, so barely 650MB together,
and therefore do not account for the rather rapid growth in
unbound's virtual size.

Since the last time I brought this up, unbound has been upgraded
to version 1.22.0, but of course I'm still observing this
problem, and seek guidance about what to do to find out if
unbound does indeed have a (rather severe) memory leak.

...

And now, a little more than an hour later, it's grown another
1.4GB:

load averages:  0.52,  0.53,  0.52;               up 5+02:22:23        15:48:52
85 processes: 82 sleeping, 1 stopped, 2 on CPU
CPU states: 11.4% user,  0.0% nice,  1.8% system,  1.0% interrupt, 85.7% idle
Memory: 3815M Act, 81M Inact, 17M Wired, 21M Exec, 14M File, 16G Free
Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1509K In, 19

  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
14678 unbound   84    0  6863M 3825M kqueue/0  236:12 39.55% 39.55% unbound

whereas "unbound-control stats" says for the "mem" part:

# unbound-control stats | egrep '^mem'
mem.cache.rrset=392533351
mem.cache.message=412422697
mem.mod.iterator=16756
mem.mod.validator=30732262
mem.mod.respip=0
mem.streamwait=0
mem.http.query_buffer=0
mem.http.response_buffer=0
#

so the growth there isn't nearly as pronounced, rrset is up to
374MB, and message cache is up to 393MB.

And now, 2.5 hours later: 

load averages:  0.19,  0.35,  0.41;               up 5+04:50:24        18:16:53
85 processes: 1 runnable, 82 sleeping, 1 stopped, 1 on CPU
CPU states: 11.3% user,  0.0% nice,  1.2% system,  0.0% interrupt, 87.4% idle
Memory: 5085M Act, 99M Inact, 17M Wired, 21M Exec, 14M File, 15G Free
Swap: 14G Total, 38M Used, 14G Free / Pools: 2886M Used / Network: 79G In, 107G

  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
14678 unbound   85    0  9358M 5118M RUN/1     319:53 29.30% 29.30% unbound

with

# unbound-control stats | egrep '^mem'
mem.cache.rrset=470135632
mem.cache.message=465387399
mem.mod.iterator=16756
mem.mod.validator=34195072
mem.mod.respip=0
mem.streamwait=0
mem.http.query_buffer=0
mem.http.response_buffer=0
#

So ... what I'm looking for is ... what, if anything, can I do to
find and stop what looks like a massive memory leak?  Or ... is
anyone else observing similar symptoms?

Some peculiarities of our configuration:

We're using "rpz:" funcionality to a subset of our clients.
We recently activated DoT and DoH with a "correct" certificate.
As mentioned above, we enabled resolver.arpa resolution,
ref. RFC 9462.  We're doing DNSSEC validation (of course).
And lastly, this is on NetBSD/amd64 10.0, using net/unbound
packaged from pkgsrc.

Best regards,

- Håvard

Re: Unbound memory resource consumption?

Reply via email to