Hello,
I've tried setting up unbound for the first time (to replace existing dns setup based around venerable djbdns), and seem to have stumbled into following problem. After as little as dozen minutes and as much as several hours, unbound daemon stops responding to queries without any kind of overt errors (e.g. warning/error-level logging) or indication that anything is abnormal. Looking at e.g. "Unbound+request-list+demystified" article, I wasn't able to find description of the state which would match this case, so don't seem to understand what exactly is happening here. ( https://docs.menandmice.com/display/MM/Unbound+request-list+demystified ) One suspicion I get (without looking at the code or knowing much about it), is that maybe unbound marks all forwarders as "bad", but that doesn't seem to explain "debug: Too many queries" message in logs (see below). Config file that I'm using: https://e.var.nz/unbound.bugreport-2020-04-20_141538.conf It's (supposed to be) setup to forward DNS queries over TLS to CloudFlare servers (1.1.1.1 and such), except for couple specific local and stub zones, with latter forwarded to 127.0.0.1 where tinydns (djbdns resolver) is listening. It is setup in a systemd-nspawn container running latest Arch Linux userspace and Linux Kernel 5.4.x, started from systemd unit, which I think is shipped with unbound itself (given how it includes a lot of security options that require good understanding of how daemon works). "unbound -V" output: https://e.var.nz/unbound.bugreport-2020-04-20_141538.version-output.txt As far as I can tell, bulk of the queries to unbound are coming from postfix daemon that is checking various IPs and names for being valid for handling email in various ways (e.g. PTR request to check if IP is dynamic and such). At some point it stops responding to all DNS queries, as mentioned, but otherwise works fine, i.e. it's not a crash or hang in the code afaict. I've tried running special unbound on localhost port 54 and hammering it with similar PTR queries like postfix does, but couldn't reproduce the issue. As far as I can tell, for same PTR query where I can see a lot of THROWAWAY reponses in the log, CloudFlare eventually resolves it (maybe after couple dozen seconds), and unbound starts serving cached responses to it immediately, and all other queries seem to work fine at/during that time too. So basically don't know how to reproduce it outside of my real-world specific setup, but from restart unbound there ~5-10 times, had it become unresponsive within 30min-to-hours in all of them. This does not happen when I switch mandatory TLS off in general request forwarding. As a workaround, I've setup "stubby" DNS-over-TLS resolver, forced to only use TLS, with same exact CloudFlare servers, just like unbound, and made unbound query it over regular TCP/UDP on localhost. This way (unbound + stubby), it's been running for days now without any issues, i.e. problem only seem to manifest with mandatory-TLS upstreams. When problem happens, here are various log/command outputs: - strace.log from "strace -fp $(pgrep -x unbound)" command when querying it over UDP (using "dig ya.ru" in this example iirc): epoll_wait(10, [{EPOLLIN, {u32=5, u64=5}}], 32, -1) = 1 gettimeofday({tv_sec=1587140870, tv_usec=99133}, NULL) = 0 gettimeofday({tv_sec=1587140870, tv_usec=99357}, NULL) = 0 recvmsg(5, {msg_name={sa_family=AF_INET, sin_port=htons(37572), sin_addr=inet_addr("10.8.0.1")}, msg_namelen=128->16, msg_iov=[{iov_base="\361G\1 \0\1\0\0\0\0\0\1\2ya\2ru\0\0\1\0\1\0\0)\20\0\0\0\0\0"..., iov_len=65552}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=if_nametoindex("host0"), ipi_spec_dst=inet_addr("10.8.0.2"), ipi_addr=inet_addr("10.8.0.2")}}], msg_controllen=32, msg_flags=0}, 0) = 46 recvmsg(5, {msg_namelen=128}, 0) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(10, Note: there is also no output immediately before/after this for a reasonable time, i.e. unbound receives the request and discards it without trying to do anything with any sockets or files. - "unbound-control dump_requestlist" - empty table, i.e. looks like all requests are immediately discarded for other reason. - "unbound-control ratelimit_list" and "unbound-control ip_ratelimit_list" are empty. - "unbound-control stats" output looks like this: https://e.var.nz/unbound.bugreport-2020-04-20_141538.unbound-control-stats.txt Has total.num.expired=0 total.num.queries_ip_ratelimited=0 total.requestlist.exceeded=34 Don't think I've checked if new requests in this state bump "exceeded" limit, but it looks like they might. - Enabling "unbound-control verbosity 9", get output like the following in the log: http://e.var.nz/unbound.bugreport-2020-04-20_141538.log Note that initially it is set to only get "info" level messages, then switched to show "debug" (after "info: control cmd: verbosity 9"). Before unbound goes into this unresponsive state, there are lot of messages like this one: [1587135862] unbound[520:0] info: resolving 18.96.128.23.in-addr.arpa. PTR IN [1587135867] unbound[520:0] info: resolving 18.96.128.23.in-addr.arpa. PTR IN [1587135873] unbound[520:0] info: response for 18.96.128.23.in-addr.arpa. PTR IN [1587135873] unbound[520:0] info: reply from <.> 2606:4700:4700::1001#853 [1587135873] unbound[520:0] info: query response was THROWAWAY [1587135873] unbound[520:0] info: response for 18.96.128.23.in-addr.arpa. PTR IN [1587135873] unbound[520:0] info: reply from <.> 2606:4700:4700::1111#853 [1587135873] unbound[520:0] info: query response was THROWAWAY (i.e. ones with THROWAWAY response, for more-or-less same or similar PTR queries) Typical log output after stopping responding to queries seem to be: [1587136134] unbound[520:0] info: receive_udp on interface: 3 10.8.0.2 10.8.0.2 [1587136134] unbound[520:0] debug: udp request from ip4 10.8.0.1 port 50658 (len 16) [1587136134] unbound[520:0] debug: Too many queries. dropping incoming query. [1587136134] unbound[520:0] debug: cache memory msg=69740 rrset=70720 infra=8996 val=66352 subnet=74504 Apologies if a bit long-winded, tried to include as much info up-front as possible. Let me know if there's anything I can try or check, it should be very easy to tweak backend and options in this setup, and failures should be relatively low-impact, and noticed quickly enough. Thanks in advance for any information, as well as reading this far and working on this project. Cheers! -- Mike Kazantsev // fraggod.net
