Re: [Pdns-users] PowerDNS issues

2021-09-22 Thread Stefan Schmidt via Pdns-users
September 22, 2021 3:03 PM, "Andrey Sedletsky via Pdns-users" 

wrote:

> Good afternoon!

Hi Andrey,

> After restarting the pdns-recursor process, the number of "outgoing 
> query timeout" and "over capacity drops" sharply increases, which leads 
> to serious degradation of the service.
> This behavior manifests itself at times of high load on the server (more 
> than 400 thousand requests per second). With a lower load, restarting 
> the process does not lead to such consequences.

Have you considered the possibility that 400 thousand queries per second is a 
load that is taxing your server to the brink of resource exhaustion? That sure 
is a lot of queries. According to 
https://pc.nanog.org/static/published/meetings/NANOG77/2142/20191029_Spacek_Lightning_Talk_Dns_v2.pdf
 they were able to achieve a lot less than that in 2019.

> We are interested in what could be the reason for this behavior

Upon the hunch that your setup might be in an overload scenario i followed 
'over-capacity-drops' in the code and ended up at 
https://github.com/PowerDNS/pdns/blob/97a4cff6fc7b3da1ff44d42b950cfc17d2fd95cf/pdns/pdns_recursor.cc#L3146
 so it seems that you have exhausted your thread capacity when that happens. 
See https://doc.powerdns.com/recursor/performance.html on how to tune the 
recursor however if that is not benchmark traffic but real world i would 
strongly suggest getting more servers installed.

The SERVFAIL response is just what i would expect in such a case. See 
https://www.rfc-editor.org/rfc/rfc1035.html#section-4.1.1 .

kinds regards,

 Stefan
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] PowerDNS issues

2021-09-22 Thread Andrey Sedletsky via Pdns-users

Good afternoon!
After restarting the pdns-recursor process, the number of "outgoing 
query timeout" and "over capacity drops" sharply increases, which leads 
to serious degradation of the service.
This behavior manifests itself at times of high load on the server (more 
than 400 thousand requests per second). With a lower load, restarting 
the process does not lead to such consequences.

Below are the examples:

Before the restart (data from the telegraf + influxdb bundle)
> select "host","outgoing-timeouts" from powerdns_recursor where 
"host"='a975-icache01' and time > '2021-09-03 07:45:00' and time < 
'2021-09-03 08:00:00'

name: powerdns_recursor
time host outgoing-timeouts
  -

2021-09-03T07:50:30Z a975-icache01 1463346871
2021-09-03T07:51:02Z a975-icache01 1463354005
2021-09-03T07:51:31 Za 975-icache01 1463360230
2021-09-03T07:52:00Z a975-icache01 1463366325
2021-09-03T07:52:30Z a975-icache01 1463372284

> select "host","over-capacity-drops" from powerdns_recursor where 
"host"='a975-icache01' and time > '2021-09-03 07:45:00' and time < 
'2021-09-03 08:00:00'

name: powerdns_recursor
time host  over-capacity-drops
   ---
2021-09-03T07:50:30Z a975-icache01 5281536
2021-09-03T07:51:02Z a975-icache01 5281536
2021-09-03T07:51:31Z a975-icache01 5281536
2021-09-03T07:52:00Z a975-icache01 5281536
2021-09-03T07:52:30Z a975-icache01 5281536


And after the restart:

select "host","outgoing-timeouts" from powerdns_recursor where 
"host"='a975-icache01' and time > '2021-09-03 07:45:00' and time < 
'2021-09-03 08:00:00'

name: powerdns_recursor
time host  outgoing-timeouts
   -
2021-09-03T07:53:30Z a975-icache01 114684
2021-09-03T07:54:01Z a975-icache01 437493
2021-09-03T07:54:31Z a975-icache01 738150
2021-09-03T07:55:03Z a975-icache01 1060959
2021-09-03T07:55:30Z a975-icache01 1327177
...


> select "host","over-capacity-drops" from powerdns_recursor where 
"host"='a975-icache01' and time > '2021-09-03 07:45:00' and time < 
'2021-09-03 08:00:00'

name: powerdns_recursor
time host  over-capacity-drops
   ---
2021-09-03T07:53:30Z a975-icache01 100934
2021-09-03T07:54:01Z a975-icache01 457612
2021-09-03T07:54:31Z a975-icache01 572332
2021-09-03T07:55:03Z a975-icache01 742152
2021-09-03T07:55:30Z a975-icache01 803205
...

We are interested in what could be the reason for this behavior

Thank you in advance


Additional information:
>rec_control version
4.3.6
> less /etc/oracle-release
Oracle Linux Server release 8.4
>2 CPUs (28 cores, 56 threads)
>128 GB RAM
PDNS was installed from EPEL Repo
grep -i process recursor.conf
# dnssec    DNSSEC mode: off/process-no-validate 
(default)/process/log-fail/validate

# dnssec=process-no-validate


Best Regards,
Andrey
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] PowerDNS issues

2021-09-10 Thread Brian Candler via Pdns-users

On 10/09/2021 10:07, Andrey Sedletsky via Pdns-users wrote:

One last question.
Our company would like to have commercial support for your product. 
Is this possible and, if so, what needs to be done for this ?

Below is the link to the attachments:
https://cloud.mail.ru/public/3y53/RzaP6z2a6


Beware: I checked that link with curl, and it contains a bunch of 
Javascript and a massive embedded base64 binary payload which pretends 
to be an image/gif:


background-image:url("data:image/gif;base64,R0lGODlhLQAtAOZ/AGm27CmW5Dm

I suppose it *might* be a pdns log wrapped in some cloud fluff, but I 
don't want to find out - and it couldn't look more suspicious if it tried.


If this is a genuine query, I suggest the OP posts a link to a text log 
file instead.  Otherwise, steer cleer.


___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] PowerDNS issues

2021-09-10 Thread Andrey Sedletsky via Pdns-users



Hi there and have a Good day!
Andrey Sedletsky on behalf PJSC MGTS (Moscow City Telephone Network) 
company!


We are using your recursive DNS servers (Open Source PowerDNS 
recurser) and we've got a couple of questions to you (actually more).
We were contacted by one of our clients with the problem of the 
inability to resolve the domain name "cm.taxi".
From the request trace on the server, it can be seen that PowerDNS 
does not accept a response from an authoritative server because the AA 
(Authoritative Answer) flag is not set to one.


Sep 04 01:47:38 a975-icache02 pdns_recursor[2575]: Removing record 
'cm.taxi|A|91.231.114.19' in the answer section without the AA bit set 
received from cm.taxi
Sep 04 01:47:38 a975-icache02 pdns_recursor[2575]: Removing record 
'cm.taxi|A|91.231.114.18' in the answer section without the AA bit set 
received from cm.taxi


The full log can be found in the attachment, there is also a dump file 
illustrating the problem.
So our first question. Whether this is a normal behavior of PowerDNS 
Recursor and can it be changed (in general or for specific zones) ?



Also, not so long ago, we had an issue when restarting the 
pdns-recursor process. After the restart (around 11 am), the number of 
servfail responses towards clients began to increase.
The load on the server at this moment was about 300 thousand requests 
per second.
By the evening (about 22 hours), the number of servfail responses 
began to approach 30 percent of the total number of requests,
and the call center began to receive mass appeals from subscribers 
about the impossibility of resolving domain names.
By this time, the load has grown to 400 thousand requests per second 
(the standard value for the current time of day).
Switching to a backup server with a similar configuration (hardware 
and software) did not solve the problem. It was reproduced on the 
backup server too.  The restart did not help either.
In the end, the problem was solved by reducing the parameter 
max-threads=16 to eight.

In this regard, there are a number of questions.
What could be the reason for this behavior (until the problem 
occurred, the server was working normally for several months at the 
same load and with the same configuration) ?
What tests should be performed to identify bottlenecks in the system 
and the pdns-recursor itself?
What metrics should be put on monitoring to prevent the occurrence of 
such situations?
And again in the attachment there is a screenshot illustrating the 
situation at that time.


One last question.
Our company would like to have commercial support for your product. Is 
this possible and, if so, what needs to be done for this ?

Below is the link to the attachments:
https://cloud.mail.ru/public/3y53/RzaP6z2a6


Additional information:

>rec_control version
4.3.6
> less /etc/oracle-release
Oracle Linux Server release 8.4
>2 CPUs (28 cores, 56 threads)
>128 GB RAM


PDNS was installed from EPEL Repo
grep -i process recursor.conf
# dnssec    DNSSEC mode: off/process-no-validate 
(default)/process/log-fail/validate

# dnssec=process-no-validate



Best Regards,
Andrey

___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users