Re: odd increase in SERVFAIL with "misc failure" reason

Otto Retter via Unbound-users Wed, 06 Nov 2024 07:03:09 -0800

Hi Wolfgang,

I observe the same increased SERVFAILs ("misc failure") after updating
to Unbound 1.22.0. Also on a low-volume recursor.


I have not had the opportunity to take a closer look, but wanted to
provide anecdotal evidence that you are not alone!

Cheers,
Otto

Wolfgang Breyha via Unbound-users wrote:

Hi!

I'm operating a small private (low volume) recurser for my own purpose for
years using unbound since about 1.6.x. Without (recognized) issues so far.

But with 1.22+ I noticed some oddities with unexpected SERVFAILs.

Incoming requests are made with DoT on port 853 and locally (classic port
53). My config mostly uses defaults except [0].

I first recognized it with failed mail reception from GMX, because unbound
occasionally was not able to resolve the PTR RRs of their outgoing mail
relay. The "verb 1; log-servfail: yes" log showed only
error: SERVFAIL <18.15.227.212.in-addr.arpa. PTR IN>: misc failure

A closer look to the logs showed a lot of rather odd "misc failure"s. eg.:
error: SERVFAIL <ctldl.windowsupdate.com. AAAA IN>: misc failure
error: SERVFAIL <alexa.amazon.de. A IN>: misc failure
error: SERVFAIL <www.paypal.com. A IN>: misc failure

All of them worked at a later retry as expected.

I searched the source for the "misc failure" message and found the new (at
least to me) option "max-global-quota" as one reason. Afterwards I raised
the verbosity to 3 to see more details. At the same time I added
        msg-cache-size: 4m
        num-queries-per-thread: 4096
        rrset-cache-size: 8m
        cache-min-ttl: 10
        cache-max-negative-ttl: 3600
        infra-cache-min-rtt: 100
to [0]. But I still didn't change the "max-global-quota" default.

To my surprise this also influenced the "misc failure" rate positively and
only some "in-addr.arpa" SERVFAILed with it. They all triggered the
"request xxxx has exceeded the maximum global quota on number of upstream
queries yyy" message in the debug log.

I then removed the modifications from the config again and returned to
plain [0] and the raised rate of "misc failures" including quite prominent
zones returned as well.

eg.:
debug: request 3.pool.ntp.org. has exceeded the maximum global quota on
number of upstream queries 155
debug: return error response SERVFAIL

Searching for the highest "number of upstream queries" gave 180 for
error: SERVFAIL <at.mirrors.cicku.me. AAAA IN>: misc failure

This one failed again when I retried while writing this mail with "139".
The second try gave the correct answer.

Obviously the cache size and primarily the contents influences the needed
maximum number of requests.

I'm wondering if I'm the only one seeing this?

IMO either the default of 128 is simply to low for low volume recursers or
there is some other oddity with this option.

Greetings,
Wolfgang Breyha


[0] config (stripped access, tls keys, common stuff)
         outgoing-port-permit: 32768-60999
         outgoing-port-avoid: 0-32767
         so-rcvbuf: 4m
         so-sndbuf: 4m
         so-reuseport: yes
         ip-transparent: yes
         max-udp-size: 4096
         log-servfail: yes
         harden-glue: yes
         harden-dnssec-stripped: yes
         harden-below-nxdomain: yes
         harden-referral-path: yes
         qname-minimisation: yes
         aggressive-nsec: yes
         use-caps-for-id: no
         unwanted-reply-threshold: 10000000
         prefetch: yes
         prefetch-key: yes
         rrset-roundrobin: yes
         minimal-responses: no
         val-clean-additional: yes
         val-permissive-mode: no
         serve-expired: no
         val-log-level: 1

Re: odd increase in SERVFAIL with "misc failure" reason

Reply via email to