Re: [Pdns-users] PowerDNS issues
September 22, 2021 3:03 PM, "Andrey Sedletsky via Pdns-users" wrote: > Good afternoon! Hi Andrey, > After restarting the pdns-recursor process, the number of "outgoing > query timeout" and "over capacity drops" sharply increases, which leads > to serious degradation of the service. > This behavior manifests itself at times of high load on the server (more > than 400 thousand requests per second). With a lower load, restarting > the process does not lead to such consequences. Have you considered the possibility that 400 thousand queries per second is a load that is taxing your server to the brink of resource exhaustion? That sure is a lot of queries. According to https://pc.nanog.org/static/published/meetings/NANOG77/2142/20191029_Spacek_Lightning_Talk_Dns_v2.pdf they were able to achieve a lot less than that in 2019. > We are interested in what could be the reason for this behavior Upon the hunch that your setup might be in an overload scenario i followed 'over-capacity-drops' in the code and ended up at https://github.com/PowerDNS/pdns/blob/97a4cff6fc7b3da1ff44d42b950cfc17d2fd95cf/pdns/pdns_recursor.cc#L3146 so it seems that you have exhausted your thread capacity when that happens. See https://doc.powerdns.com/recursor/performance.html on how to tune the recursor however if that is not benchmark traffic but real world i would strongly suggest getting more servers installed. The SERVFAIL response is just what i would expect in such a case. See https://www.rfc-editor.org/rfc/rfc1035.html#section-4.1.1 . kinds regards, Stefan ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com https://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] PowerDNS issues
Good afternoon! After restarting the pdns-recursor process, the number of "outgoing query timeout" and "over capacity drops" sharply increases, which leads to serious degradation of the service. This behavior manifests itself at times of high load on the server (more than 400 thousand requests per second). With a lower load, restarting the process does not lead to such consequences. Below are the examples: Before the restart (data from the telegraf + influxdb bundle) > select "host","outgoing-timeouts" from powerdns_recursor where "host"='a975-icache01' and time > '2021-09-03 07:45:00' and time < '2021-09-03 08:00:00' name: powerdns_recursor time host outgoing-timeouts - 2021-09-03T07:50:30Z a975-icache01 1463346871 2021-09-03T07:51:02Z a975-icache01 1463354005 2021-09-03T07:51:31 Za 975-icache01 1463360230 2021-09-03T07:52:00Z a975-icache01 1463366325 2021-09-03T07:52:30Z a975-icache01 1463372284 > select "host","over-capacity-drops" from powerdns_recursor where "host"='a975-icache01' and time > '2021-09-03 07:45:00' and time < '2021-09-03 08:00:00' name: powerdns_recursor time host over-capacity-drops --- 2021-09-03T07:50:30Z a975-icache01 5281536 2021-09-03T07:51:02Z a975-icache01 5281536 2021-09-03T07:51:31Z a975-icache01 5281536 2021-09-03T07:52:00Z a975-icache01 5281536 2021-09-03T07:52:30Z a975-icache01 5281536 And after the restart: select "host","outgoing-timeouts" from powerdns_recursor where "host"='a975-icache01' and time > '2021-09-03 07:45:00' and time < '2021-09-03 08:00:00' name: powerdns_recursor time host outgoing-timeouts - 2021-09-03T07:53:30Z a975-icache01 114684 2021-09-03T07:54:01Z a975-icache01 437493 2021-09-03T07:54:31Z a975-icache01 738150 2021-09-03T07:55:03Z a975-icache01 1060959 2021-09-03T07:55:30Z a975-icache01 1327177 ... > select "host","over-capacity-drops" from powerdns_recursor where "host"='a975-icache01' and time > '2021-09-03 07:45:00' and time < '2021-09-03 08:00:00' name: powerdns_recursor time host over-capacity-drops --- 2021-09-03T07:53:30Z a975-icache01 100934 2021-09-03T07:54:01Z a975-icache01 457612 2021-09-03T07:54:31Z a975-icache01 572332 2021-09-03T07:55:03Z a975-icache01 742152 2021-09-03T07:55:30Z a975-icache01 803205 ... We are interested in what could be the reason for this behavior Thank you in advance Additional information: >rec_control version 4.3.6 > less /etc/oracle-release Oracle Linux Server release 8.4 >2 CPUs (28 cores, 56 threads) >128 GB RAM PDNS was installed from EPEL Repo grep -i process recursor.conf # dnssec DNSSEC mode: off/process-no-validate (default)/process/log-fail/validate # dnssec=process-no-validate Best Regards, Andrey ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com https://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] PowerDNS issues
On 10/09/2021 10:07, Andrey Sedletsky via Pdns-users wrote: One last question. Our company would like to have commercial support for your product. Is this possible and, if so, what needs to be done for this ? Below is the link to the attachments: https://cloud.mail.ru/public/3y53/RzaP6z2a6 Beware: I checked that link with curl, and it contains a bunch of Javascript and a massive embedded base64 binary payload which pretends to be an image/gif: background-image:url("data:image/gif;base64,R0lGODlhLQAtAOZ/AGm27CmW5Dm I suppose it *might* be a pdns log wrapped in some cloud fluff, but I don't want to find out - and it couldn't look more suspicious if it tried. If this is a genuine query, I suggest the OP posts a link to a text log file instead. Otherwise, steer cleer. ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com https://mailman.powerdns.com/mailman/listinfo/pdns-users
Re: [Pdns-users] PowerDNS issues
Hi there and have a Good day! Andrey Sedletsky on behalf PJSC MGTS (Moscow City Telephone Network) company! We are using your recursive DNS servers (Open Source PowerDNS recurser) and we've got a couple of questions to you (actually more). We were contacted by one of our clients with the problem of the inability to resolve the domain name "cm.taxi". From the request trace on the server, it can be seen that PowerDNS does not accept a response from an authoritative server because the AA (Authoritative Answer) flag is not set to one. Sep 04 01:47:38 a975-icache02 pdns_recursor[2575]: Removing record 'cm.taxi|A|91.231.114.19' in the answer section without the AA bit set received from cm.taxi Sep 04 01:47:38 a975-icache02 pdns_recursor[2575]: Removing record 'cm.taxi|A|91.231.114.18' in the answer section without the AA bit set received from cm.taxi The full log can be found in the attachment, there is also a dump file illustrating the problem. So our first question. Whether this is a normal behavior of PowerDNS Recursor and can it be changed (in general or for specific zones) ? Also, not so long ago, we had an issue when restarting the pdns-recursor process. After the restart (around 11 am), the number of servfail responses towards clients began to increase. The load on the server at this moment was about 300 thousand requests per second. By the evening (about 22 hours), the number of servfail responses began to approach 30 percent of the total number of requests, and the call center began to receive mass appeals from subscribers about the impossibility of resolving domain names. By this time, the load has grown to 400 thousand requests per second (the standard value for the current time of day). Switching to a backup server with a similar configuration (hardware and software) did not solve the problem. It was reproduced on the backup server too. The restart did not help either. In the end, the problem was solved by reducing the parameter max-threads=16 to eight. In this regard, there are a number of questions. What could be the reason for this behavior (until the problem occurred, the server was working normally for several months at the same load and with the same configuration) ? What tests should be performed to identify bottlenecks in the system and the pdns-recursor itself? What metrics should be put on monitoring to prevent the occurrence of such situations? And again in the attachment there is a screenshot illustrating the situation at that time. One last question. Our company would like to have commercial support for your product. Is this possible and, if so, what needs to be done for this ? Below is the link to the attachments: https://cloud.mail.ru/public/3y53/RzaP6z2a6 Additional information: >rec_control version 4.3.6 > less /etc/oracle-release Oracle Linux Server release 8.4 >2 CPUs (28 cores, 56 threads) >128 GB RAM PDNS was installed from EPEL Repo grep -i process recursor.conf # dnssec DNSSEC mode: off/process-no-validate (default)/process/log-fail/validate # dnssec=process-no-validate Best Regards, Andrey ___ Pdns-users mailing list Pdns-users@mailman.powerdns.com https://mailman.powerdns.com/mailman/listinfo/pdns-users