Earlier, I wrote: > we're using unbound (1.9.4) on NetBSD 8.0 to provide DNS-over-TLS > service to a diverse set of customers, at the moment mostly as an > experiment. > > What I find is that over time there will pile up a set of TCP > connections in CLOSE_WAIT state, all with 147 bytes stuck in the > receive queue of the socket (according to "netstat -n -f inet").
I've taken a closer look, but not found a solution to this issue. I'm doing testing with "kdig" from the knot package. What I find is that it can take an inordinately long time to get a response, e.g.: % time kdig @158.38.0.1 +timeout=120 +tls vg.no. a ... 0.000u 0.008s 0:31.80 0.0% 0+0k 0+0io 0pf+0w 31 seconds, in other words, and I've seen response times as long as 1minute+. The randomness pool on the server host is full, as is it at the client, so it's not that this is running low. While it's waiting, I can find: tcp 337 0 158.38.0.1.853 158.38.62.77.64808 ESTABLISHED in "netstat -f inet -n | grep 853" output on the server. In other words: the TCP connection is established, but the server isn't consuming the data the client has sent. Other clients will leave a little less data, typically 147 bytes: tcp 147 0 158.38.0.1.853 158.38.204.xxx.43370 ESTABLISHED tcp 147 0 158.38.0.1.853 158.36.214.xx.41054 ESTABLISHED tcp 147 0 158.38.0.1.853 158.38.204.xxx.49226 ESTABLISHED tcp 147 0 158.38.0.1.853 158.38.204.xxx.41550 ESTABLISHED These will in all probability end up as CLOSE_WAIT sockets, because the client times out, and by the looks of it, the server is very slow in cleaning these out as well. Furthermore, I observe this behaviour both with libevent (2.1.11) and when using the unbound-internal "mini-event" code. OpenSSL is the in-tree code from NetBSD, version OpenSSL 1.0.2k 26 Jan 2017. In the process, I ran across https://github.com/libevent/libevent/issues/854 but wedging that into libevent 2.1.11 didn't make any significant improvement that I could see. I added some logging at the top of ssl_handshake() in util/netevent.c, and the latest test I did resulted in: % time kdig @158.38.0.1 +timeout=120 +tls vg.no. a ... ;; Time 2019-11-24 15:06:21 CET ;; From 158.38.0.1@853(TCP) in 11.1 ms 0.000u 0.007s 1:08.18 0.0% 0+0k 0+0io 0pf+0w and the corresponding log entry: Nov 24 15:07:29 oliven unbound: [13367:0] notice: ssl handshake with 158.38.62.77 port 64713 So ... apparently unbound is Very Slow in getting around to handling the socket and handing it off for SSL handshake. The reason for this has yet to be found. Hints gratefully accepted. Regards, - HÃ¥vard
