Hi there and have a Good day!
Andrey Sedletsky on behalf PJSC MGTS (Moscow City Telephone Network)
company!
We are using your recursive DNS servers (Open Source PowerDNS
recurser) and we've got a couple of questions to you (actually more).
We were contacted by one of our clients with the problem of the
inability to resolve the domain name "cm.taxi".
From the request trace on the server, it can be seen that PowerDNS
does not accept a response from an authoritative server because the AA
(Authoritative Answer) flag is not set to one.
Sep 04 01:47:38 a975-icache02 pdns_recursor[2575]: Removing record
'cm.taxi|A|91.231.114.19' in the answer section without the AA bit set
received from cm.taxi
Sep 04 01:47:38 a975-icache02 pdns_recursor[2575]: Removing record
'cm.taxi|A|91.231.114.18' in the answer section without the AA bit set
received from cm.taxi
The full log can be found in the attachment, there is also a dump file
illustrating the problem.
So our first question. Whether this is a normal behavior of PowerDNS
Recursor and can it be changed (in general or for specific zones) ?
Also, not so long ago, we had an issue when restarting the
pdns-recursor process. After the restart (around 11 am), the number of
servfail responses towards clients began to increase.
The load on the server at this moment was about 300 thousand requests
per second.
By the evening (about 22 hours), the number of servfail responses
began to approach 30 percent of the total number of requests,
and the call center began to receive mass appeals from subscribers
about the impossibility of resolving domain names.
By this time, the load has grown to 400 thousand requests per second
(the standard value for the current time of day).
Switching to a backup server with a similar configuration (hardware
and software) did not solve the problem. It was reproduced on the
backup server too. The restart did not help either.
In the end, the problem was solved by reducing the parameter
max-threads=16 to eight.
In this regard, there are a number of questions.
What could be the reason for this behavior (until the problem
occurred, the server was working normally for several months at the
same load and with the same configuration) ?
What tests should be performed to identify bottlenecks in the system
and the pdns-recursor itself?
What metrics should be put on monitoring to prevent the occurrence of
such situations?
And again in the attachment there is a screenshot illustrating the
situation at that time.
One last question.
Our company would like to have commercial support for your product. Is
this possible and, if so, what needs to be done for this ?
Below is the link to the attachments:
https://cloud.mail.ru/public/3y53/RzaP6z2a6
Additional information:
>rec_control version
4.3.6
> less /etc/oracle-release
Oracle Linux Server release 8.4
>2 CPUs (28 cores, 56 threads)
>128 GB RAM
PDNS was installed from EPEL Repo
grep -i process recursor.conf
# dnssec DNSSEC mode: off/process-no-validate
(default)/process/log-fail/validate
# dnssec=process-no-validate
Best Regards,
Andrey
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users