Re: [Pdns-users] Recursor getting pegged at 100% CPU

2024-03-15 Thread Otto Moerbeek via Pdns-users
On Fri, Mar 15, 2024 at 05:25:20PM +0100, Otto Moerbeek via Pdns-users wrote:

> > Op 15 mrt. 2024, om 17:01 heeft Tim Burns via Pdns-users 
> >  het volgende geschreven:
> > 
> > Hello all, I’m experiencing a performance degradation while using the 
> > Recursor that I haven’t been able to root cause, and I was hoping to get 
> > some insight on what might be causing it, or some troubleshooting and 
> > diagnostics tips.
> >  
> > The observed issue is that the Recursor would start up, begin receiving and 
> > forwarding traffic for several minutes, suddenly the CPU usage would jump 
> > to 100%, rec_control started timing out attempting to talk to the program, 
> > and Prometheus metrics scrapes would stop succeeding.
> > The Recursor also stopped reporting statistics at the 
> > logging.statistics_interval, although if I added a debug Lua script it 
> > would still log messages from the threads.
> >  
> > I’m running the Recursor as a container in Kubernetes with the 5.0.3 image 
> > from Docker Hub. The replicas receive anywhere from 2-6k QPS nominally, but 
> > this issue seems to occur regardless of query rate. Before the metrics 
> > freeze, the instance only uses 10% of its allocated CPU and about 20% of 
> > allocated memory, and then gets pegged to 100% CPU usage and stops 
> > responding. This happens anywhere from several minutes to an hour after 
> > process start.
> >  
> > Below is the configuration used. We’re using this as a middleman for now to 
> > forward requests upstream to larger recursors and cache the results, and is 
> > only accessible from within our network.
> >  
> > recursor:
> > forward_zones_recurse:
> > - zone: .
> >   forwarders:
> > - 8.8.8.8
> > - 1.1.1.1
> > - 8.8.4.4
> > - 1.0.0.1
> > - 208.67.222.222
> > - 208.67.220.220
> >   threads: 8
> > security_poll_suffix: ''
> >   server_id: ''
> >   version_string: ''
> >   any_to_tcp: true
> > dnssec:
> >   validation: process-no-validate
> > incoming:
> >   allow_from:
> > - 0.0.0.0/0
> >   listen:
> > - 0.0.0.0
> >   port: 2353
> > logging:
> >   disable_syslog: true
> > recordcache:
> >   max_entries: 50
> > webservice:
> >   address: 0.0.0.0
> >   allow_from:
> > - 10.0.0.0/8
> > - 172.16.0.0/12
> > - 127.0.0.1
> >   loglevel: none
> >   port: 8912
> >   webserver: true
> >  
> >  
> >  
> > While attempting to mitigate this live, I took a variety of actions in an 
> > attempt to root cause the issue, but none of the actions made a difference 
> > and we observed the same failure mode each time.
> > Actions taken:
> > - Disabling DNSSEC entirely with dnssec.validation: off
> > - Reducing threads from 8 -> 4, we had this scaled up to 8 as the default 2 
> > would lead to a high volume of UdpInErrors after some time
> > - Turned off TCP, only receiving queries over UDP
> > - Enabled logging.common_errors to see if there was anything interesting, 
> > but nothing notable happened around the time of freeze
> >  
> > Notably, rec_control would report the error: “Timeout waiting for control 
> > channel data”, and when I attempted to turn on trace logging for SERVFAILs 
> > (logging.trace: fail), logs indicated “Could not dup trace file” with 
> > reason “Bad file descriptor”.
> >  
> > Let me know if there’s any other information I can provide. I haven’t been 
> > able to correlate this behavior to any unusual metrics before the freeze as 
> > everything appeared reasonably nominal, but I’m happy to provide the last 
> > Prometheus scrape we got if it would help. Any ideas?
> >  
> > Thank you!
> > Tim
> 
> Try running when the recursor is peggef
> 
>   perf top
> 
> Or attach gdb and run 
> 
>   thread apply all bt
> 
> on gdb's command line
> 
> This might not possible in a container (I don’t know that), but you might be 
> able to install the perf tools or gdb in the container. Or otherwise try to 
> reproduce in an ordinary VM or bare metal.
> 
>   -Otto
> 

Oh, and some questions: in addition to collecting Prometheus metrics,
are you peridodically querying using rec_control or the web
server/web-api for other things? If so what exactly?

-Otto
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Recursor getting pegged at 100% CPU

2024-03-15 Thread Winfried via Pdns-users
Hi Tim,

Is there anything unusual in the stdout of the recursor container?

Winfried 


Am 15. März 2024 17:01:59 MEZ schrieb Tim Burns via Pdns-users 
:
>Hello all, I’m experiencing a performance degradation while using the Recursor 
>that I haven’t been able to root cause, and I was hoping to get some insight 
>on what might be causing it, or some troubleshooting and diagnostics tips.
> 
>The observed issue is that the Recursor would start up, begin receiving and 
>forwarding traffic for several minutes, suddenly the CPU usage would jump to 
>100%, rec_control started timing out attempting to talk to the program, and 
>Prometheus metrics scrapes would stop succeeding.
>The Recursor also stopped reporting statistics at the 
>logging.statistics_interval, although if I added a debug Lua script it would 
>still log messages from the threads.
> 
>I’m running the Recursor as a container in Kubernetes with the 5.0.3 image 
>from Docker Hub. The replicas receive anywhere from 2-6k QPS nominally, but 
>this issue seems to occur regardless of query rate. Before the metrics freeze, 
>the instance only uses 10% of its allocated CPU and about 20% of allocated 
>memory, and then gets pegged to 100% CPU usage and stops responding. This 
>happens anywhere from several minutes to an hour after process start.
> 
>Below is the configuration used. We’re using this as a middleman for now to 
>forward requests upstream to larger recursors and cache the results, and is 
>only accessible from within our network.
> 
>recursor:
>forward_zones_recurse:
>    - zone: .
>  forwarders:
>    - 8.8.8.8
>    - 1.1.1.1
>    - 8.8.4.4
>    - 1.0.0.1
>    - 208.67.222.222
>    - 208.67.220.220
>  threads: 8
>security_poll_suffix: ''
>  server_id: ''
>  version_string: ''
>  any_to_tcp: true
>dnssec:
>  validation: process-no-validate
>incoming:
>  allow_from:
>    - 0.0.0.0/0
>  listen:
>    - 0.0.0.0
>  port: 2353
>logging:
>  disable_syslog: true
>recordcache:
>  max_entries: 50
>webservice:
>  address: 0.0.0.0
>  allow_from:
>    - 10.0.0.0/8
>    - 172.16.0.0/12
>    - 127.0.0.1
>  loglevel: none
>  port: 8912
>  webserver: true
> 
> 
> 
>While attempting to mitigate this live, I took a variety of actions in an 
>attempt to root cause the issue, but none of the actions made a difference and 
>we observed the same failure mode each time.
>Actions taken:
>- Disabling DNSSEC entirely with dnssec.validation: off
>- Reducing threads from 8 -> 4, we had this scaled up to 8 as the default 2 
>would lead to a high volume of UdpInErrors after some time
>- Turned off TCP, only receiving queries over UDP
>- Enabled logging.common_errors to see if there was anything interesting, but 
>nothing notable happened around the time of freeze
> 
>Notably, rec_control would report the error: “Timeout waiting for control 
>channel data”, and when I attempted to turn on trace logging for SERVFAILs 
>(logging.trace: fail), logs indicated “Could not dup trace file” with reason 
>“Bad file descriptor”.
> 
>Let me know if there’s any other information I can provide. I haven’t been 
>able to correlate this behavior to any unusual metrics before the freeze as 
>everything appeared reasonably nominal, but I’m happy to provide the last 
>Prometheus scrape we got if it would help. Any ideas?
> 
>Thank you!
>Tim
>___
>Pdns-users mailing list
>Pdns-users@mailman.powerdns.com
>https://mailman.powerdns.com/mailman/listinfo/pdns-users
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Recursor getting pegged at 100% CPU

2024-03-15 Thread Otto Moerbeek via Pdns-users
> Op 15 mrt. 2024, om 17:01 heeft Tim Burns via Pdns-users 
>  het volgende geschreven:
> 
> Hello all, I’m experiencing a performance degradation while using the 
> Recursor that I haven’t been able to root cause, and I was hoping to get some 
> insight on what might be causing it, or some troubleshooting and diagnostics 
> tips.
>  
> The observed issue is that the Recursor would start up, begin receiving and 
> forwarding traffic for several minutes, suddenly the CPU usage would jump to 
> 100%, rec_control started timing out attempting to talk to the program, and 
> Prometheus metrics scrapes would stop succeeding.
> The Recursor also stopped reporting statistics at the 
> logging.statistics_interval, although if I added a debug Lua script it would 
> still log messages from the threads.
>  
> I’m running the Recursor as a container in Kubernetes with the 5.0.3 image 
> from Docker Hub. The replicas receive anywhere from 2-6k QPS nominally, but 
> this issue seems to occur regardless of query rate. Before the metrics 
> freeze, the instance only uses 10% of its allocated CPU and about 20% of 
> allocated memory, and then gets pegged to 100% CPU usage and stops 
> responding. This happens anywhere from several minutes to an hour after 
> process start.
>  
> Below is the configuration used. We’re using this as a middleman for now to 
> forward requests upstream to larger recursors and cache the results, and is 
> only accessible from within our network.
>  
> recursor:
> forward_zones_recurse:
> - zone: .
>   forwarders:
> - 8.8.8.8
> - 1.1.1.1
> - 8.8.4.4
> - 1.0.0.1
> - 208.67.222.222
> - 208.67.220.220
>   threads: 8
> security_poll_suffix: ''
>   server_id: ''
>   version_string: ''
>   any_to_tcp: true
> dnssec:
>   validation: process-no-validate
> incoming:
>   allow_from:
> - 0.0.0.0/0
>   listen:
> - 0.0.0.0
>   port: 2353
> logging:
>   disable_syslog: true
> recordcache:
>   max_entries: 50
> webservice:
>   address: 0.0.0.0
>   allow_from:
> - 10.0.0.0/8
> - 172.16.0.0/12
> - 127.0.0.1
>   loglevel: none
>   port: 8912
>   webserver: true
>  
>  
>  
> While attempting to mitigate this live, I took a variety of actions in an 
> attempt to root cause the issue, but none of the actions made a difference 
> and we observed the same failure mode each time.
> Actions taken:
> - Disabling DNSSEC entirely with dnssec.validation: off
> - Reducing threads from 8 -> 4, we had this scaled up to 8 as the default 2 
> would lead to a high volume of UdpInErrors after some time
> - Turned off TCP, only receiving queries over UDP
> - Enabled logging.common_errors to see if there was anything interesting, but 
> nothing notable happened around the time of freeze
>  
> Notably, rec_control would report the error: “Timeout waiting for control 
> channel data”, and when I attempted to turn on trace logging for SERVFAILs 
> (logging.trace: fail), logs indicated “Could not dup trace file” with reason 
> “Bad file descriptor”.
>  
> Let me know if there’s any other information I can provide. I haven’t been 
> able to correlate this behavior to any unusual metrics before the freeze as 
> everything appeared reasonably nominal, but I’m happy to provide the last 
> Prometheus scrape we got if it would help. Any ideas?
>  
> Thank you!
> Tim

Try running when the recursor is peggef

perf top

Or attach gdb and run 

thread apply all bt

on gdb's command line

This might not possible in a container (I don’t know that), but you might be 
able to install the perf tools or gdb in the container. Or otherwise try to 
reproduce in an ordinary VM or bare metal.

-Otto

___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users


[Pdns-users] Recursor getting pegged at 100% CPU

2024-03-15 Thread Tim Burns via Pdns-users
Hello all, I’m experiencing a performance degradation while using the Recursor 
that I haven’t been able to root cause, and I was hoping to get some insight on 
what might be causing it, or some troubleshooting and diagnostics tips.
 
The observed issue is that the Recursor would start up, begin receiving and 
forwarding traffic for several minutes, suddenly the CPU usage would jump to 
100%, rec_control started timing out attempting to talk to the program, and 
Prometheus metrics scrapes would stop succeeding.
The Recursor also stopped reporting statistics at the 
logging.statistics_interval, although if I added a debug Lua script it would 
still log messages from the threads.
 
I’m running the Recursor as a container in Kubernetes with the 5.0.3 image from 
Docker Hub. The replicas receive anywhere from 2-6k QPS nominally, but this 
issue seems to occur regardless of query rate. Before the metrics freeze, the 
instance only uses 10% of its allocated CPU and about 20% of allocated memory, 
and then gets pegged to 100% CPU usage and stops responding. This happens 
anywhere from several minutes to an hour after process start.
 
Below is the configuration used. We’re using this as a middleman for now to 
forward requests upstream to larger recursors and cache the results, and is 
only accessible from within our network.
 
recursor:
forward_zones_recurse:
    - zone: .
  forwarders:
    - 8.8.8.8
    - 1.1.1.1
    - 8.8.4.4
    - 1.0.0.1
    - 208.67.222.222
    - 208.67.220.220
  threads: 8
security_poll_suffix: ''
  server_id: ''
  version_string: ''
  any_to_tcp: true
dnssec:
  validation: process-no-validate
incoming:
  allow_from:
    - 0.0.0.0/0
  listen:
    - 0.0.0.0
  port: 2353
logging:
  disable_syslog: true
recordcache:
  max_entries: 50
webservice:
  address: 0.0.0.0
  allow_from:
    - 10.0.0.0/8
    - 172.16.0.0/12
    - 127.0.0.1
  loglevel: none
  port: 8912
  webserver: true
 
 
 
While attempting to mitigate this live, I took a variety of actions in an 
attempt to root cause the issue, but none of the actions made a difference and 
we observed the same failure mode each time.
Actions taken:
- Disabling DNSSEC entirely with dnssec.validation: off
- Reducing threads from 8 -> 4, we had this scaled up to 8 as the default 2 
would lead to a high volume of UdpInErrors after some time
- Turned off TCP, only receiving queries over UDP
- Enabled logging.common_errors to see if there was anything interesting, but 
nothing notable happened around the time of freeze
 
Notably, rec_control would report the error: “Timeout waiting for control 
channel data”, and when I attempted to turn on trace logging for SERVFAILs 
(logging.trace: fail), logs indicated “Could not dup trace file” with reason 
“Bad file descriptor”.
 
Let me know if there’s any other information I can provide. I haven’t been able 
to correlate this behavior to any unusual metrics before the freeze as 
everything appeared reasonably nominal, but I’m happy to provide the last 
Prometheus scrape we got if it would help. Any ideas?
 
Thank you!
Tim
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users