Hi, and thanks for the feedback, the general advice, and the pointer to jemalloc. I may look into that a bit later.
However, in the mean time I have come to the conclusion that there may be a correlation between me enabling DoH and DoT and using RFC 9462 to direct clients which probe for _dns.resolver.arpa to use the DoH and/or DoT endpoints on the one hand, and on the other hand what really does look like a massive memory leak in unbound. If that is true, which malloc() you use should not make much of a difference. To test this hypothesis, I turned off DoH and DoT (diff to config attached below, it was only turned on about last month), and also stopped serving resolver.arpa, and then restarted unbound. Here are a few "top" displays taken over the span of a few hours. First after this config change: load averages: 0.26, 0.20, 0.25; up 6+00:57:31 14:24:00 79 processes: 76 sleeping, 1 stopped, 2 on CPU CPU states: 4.5% user, 0.0% nice, 2.2% system, 1.0% interrupt, 92.2% idle Memory: 2702M Act, 7948K Inact, 17M Wired, 27M Exec, 2367M File, 17G Free Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 1574K In, 16 PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 14982 unbound 43 0 398M 268M CPU/2 6:55 30.22% 30.22% unbound load averages: 0.13, 0.17, 0.21; up 6+01:49:28 15:15:57 79 processes: 77 sleeping, 1 stopped, 1 on CPU CPU states: 2.8% user, 0.0% nice, 2.0% system, 0.6% interrupt, 94.5% idle Memory: 2847M Act, 11M Inact, 17M Wired, 27M Exec, 2367M File, 17G Free Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 1234K In, 13 PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 14982 unbound 85 0 544M 417M kqueue/2 18:13 38.23% 38.23% unbound load averages: 0.22, 0.11, 0.10; up 6+03:55:58 17:22:27 90 processes: 87 sleeping, 1 stopped, 2 on CPU CPU states: 1.2% user, 0.0% nice, 1.1% system, 0.2% interrupt, 97.3% idle Memory: 3040M Act, 18M Inact, 17M Wired, 27M Exec, 2367M File, 17G Free Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 648K In, 700 PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 14982 unbound 43 0 738M 604M CPU/2 38:45 3.61% 3.61% unbound If we compare this to what I experienced with these options turned on and a number of DoH / DoT clients using those endpoints, quoting from yesterday's e-mail: load averages: 0.86, 0.94, 0.92; up 5+00:58:04 14:24:33 86 processes: 83 sleeping, 1 stopped, 2 on CPU CPU states: 14.8% user, 0.0% nice, 1.3% system, 0.8% interrupt, 83.0% idle Memory: 3035M Act, 68M Inact, 17M Wired, 21M Exec, 14M File, 17G Free Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1322K In, 1906K Out PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 14678 unbound 40 0 5408M 3033M CPU/2 183:17 78.47% 78.47% unbound load averages: 0.52, 0.53, 0.52; up 5+02:22:23 15:48:52 85 processes: 82 sleeping, 1 stopped, 2 on CPU CPU states: 11.4% user, 0.0% nice, 1.8% system, 1.0% interrupt, 85.7% idle Memory: 3815M Act, 81M Inact, 17M Wired, 21M Exec, 14M File, 16G Free Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1509K In, 19 PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 14678 unbound 84 0 6863M 3825M kqueue/0 236:12 39.55% 39.55% unbound load averages: 0.19, 0.35, 0.41; up 5+04:50:24 18:16:53 85 processes: 1 runnable, 82 sleeping, 1 stopped, 1 on CPU CPU states: 11.3% user, 0.0% nice, 1.2% system, 0.0% interrupt, 87.4% idle Memory: 5085M Act, 99M Inact, 17M Wired, 21M Exec, 14M File, 15G Free Swap: 14G Total, 38M Used, 14G Free / Pools: 2886M Used / Network: 79G In, 107G PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 14678 unbound 85 0 9358M 5118M RUN/1 319:53 29.30% 29.30% unbound You'll notice the difference is quite stark. Not only is the CPU time much lower (OK, crypto costs, I guess), but also the trajectory of the virtual size is vastly different: 5408M -> 6863M (1:24h later) -> 9358M (3:52h after 0th measurement) compared to 398M -> 544M (51m later) -> 738M (2:58h after 0th measurement) And according to "unbound-control stats" the query rate is comparable to what it was yesterday. So I suspect there is a serious memory leak, possibly in unbound, related to the code which does DoH and/or DoT handling. Diff to our unbound.conf compared to yesterday attached. Regards, - HÃ¥vard
rcsdiff -u unbound.conf =================================================================== RCS file: RCS/unbound.conf,v retrieving revision 1.9 diff -u -r1.9 unbound.conf --- unbound.conf 2025/03/03 16:25:44 1.9 +++ unbound.conf 2025/03/13 12:53:24 @@ -12,27 +12,27 @@ # 853 = DNS-over-TLS # 443 = DNS-over-HTTPS interface: 158.38.0.2 - interface: 158.38.0.2@443 - interface: 158.38.0.2@853 +# interface: 158.38.0.2@443 +# interface: 158.38.0.2@853 interface: 2001:700:0:ff00::2 - interface: 2001:700:0:ff00::2@443 - interface: 2001:700:0:ff00::2@853 +# interface: 2001:700:0:ff00::2@443 +# interface: 2001:700:0:ff00::2@853 interface: 158.38.0.169 - interface: 158.38.0.169@443 - interface: 158.38.0.169@853 +# interface: 158.38.0.169@443 +# interface: 158.38.0.169@853 interface: 2001:700:0:503::c253 - interface: 2001:700:0:503::c253@443 - interface: 2001:700:0:503::c253@853 +# interface: 2001:700:0:503::c253@443 +# interface: 2001:700:0:503::c253@853 interface: 127.0.0.1 interface: ::1 # TLS key and certificate - tls-service-key: "/usr/pkg/etc/unbound/dns-resolver2-key.pem" - tls-service-pem: "/usr/pkg/etc/unbound/dns-resolver2-cert.pem" - tls-cert-bundle: "/etc/openssl/certs/ca-certificates.crt" +# tls-service-key: "/usr/pkg/etc/unbound/dns-resolver2-key.pem" +# tls-service-pem: "/usr/pkg/etc/unbound/dns-resolver2-cert.pem" +# tls-cert-bundle: "/etc/openssl/certs/ca-certificates.crt" # Enable DNS-over-HTTPS (doh): - https-port: 443 +# https-port: 443 # These need tuning away from defaults; # the defaults not suitable for TCP-heavy workloads: @@ -988,9 +988,9 @@ # for-upstream: yes # zonefile: "example.org.zone" - auth-zone: - name: resolver.arpa - zonefile: "pz/resolver.arpa" +# auth-zone: +# name: resolver.arpa +# zonefile: "pz/resolver.arpa" # Views # Create named views. Name must be unique. Map views to requests using