Re: [dnsdist] dnsdist 1.7.4 Debian Bullseye vs 1.8.4 Bullseye
Hi On 05/10/2023 10:41, Aleš Rygl via dnsdist wrote: Thanks for your response. After some deep documentation reading and config tweaking I am nearly on the previous values regarding CPU load, apart from latency, which is still higher (1.3ms -> 2.3ms). I suspect a different way the latency is likely computed (I noticed a new set of latency counters for TLS, TCP, etc.) here. The key configuration parameter is setMaxTCPClientThreads(). Changing anything else (cache shards, number of listeners, etc.) has nearly no impact. We had 256 with 1.7.4. now it is 16. Going up here means a rapid increase of CPU load, having less than 16 means dropping TCP connections in showTCPStats(), where Queued hits Max Queued. Insane values like 1024 kills the CPU. We have a physical server with 16 phys. cores, OS sees 32 cores. OK, this is clearly unexpected. I mean, since 1.4.0 you should not be needing more TCP worker threads than the number of cores, since a single worker can handle a lot (easily thousands) of TCP connections, but having a larger value should not kill the CPU so I'm wondering if we are busy-looping somewhere. I have not been able to reproduce that so far, so I would be really interested in seeing the perf output if you can get it. Update: after some testing I can say that dnsdist 1.7.4 on Bookworm has the same issue as 1.8.1. The reason is apparently here: https://github.com/openssl/openssl/issues/17064. There is a safe workaround - lowering setMaxTCPClientThreads(). Watch out TCP queueing - use showTCPStats(). And improving TLS performance using STEK file can help as well. I'd like to thank Remi for his excellent support. Ales ___ dnsdist mailing list dnsdist@mailman.powerdns.com https://mailman.powerdns.com/mailman/listinfo/dnsdist
Re: [dnsdist] dnsdist 1.7.4 Debian Bullseye vs 1.8.4 Bullseye
Hi! On 05/10/2023 10:41, Aleš Rygl via dnsdist wrote: Thanks for your response. After some deep documentation reading and config tweaking I am nearly on the previous values regarding CPU load, apart from latency, which is still higher (1.3ms -> 2.3ms). I suspect a different way the latency is likely computed (I noticed a new set of latency counters for TLS, TCP, etc.) here. The key configuration parameter is setMaxTCPClientThreads(). Changing anything else (cache shards, number of listeners, etc.) has nearly no impact. We had 256 with 1.7.4. now it is 16. Going up here means a rapid increase of CPU load, having less than 16 means dropping TCP connections in showTCPStats(), where Queued hits Max Queued. Insane values like 1024 kills the CPU. We have a physical server with 16 phys. cores, OS sees 32 cores. OK, this is clearly unexpected. I mean, since 1.4.0 you should not be needing more TCP worker threads than the number of cores, since a single worker can handle a lot (easily thousands) of TCP connections, but having a larger value should not kill the CPU so I'm wondering if we are busy-looping somewhere. I have not been able to reproduce that so far, so I would be really interested in seeing the perf output if you can get it. Best regards, -- Remi Gacogne PowerDNS.COM BV - https://www.powerdns.com/ OpenPGP_signature.asc Description: OpenPGP digital signature ___ dnsdist mailing list dnsdist@mailman.powerdns.com https://mailman.powerdns.com/mailman/listinfo/dnsdist
Re: [dnsdist] dnsdist 1.7.4 Debian Bullseye vs 1.8.4 Bullseye
Hi Remi, On 02. 10. 23 13:53, Remi Gacogne via dnsdist wrote: Hi Ales, On 25/09/2023 16:09, Aleš Rygl via dnsdist wrote: I would to kindly ask for help or and advice. I have just upgraded one of our dnsdist instances from 1.7.4 do 1.8.4 together with OS upgrade (Debian 11.7 to 12.1). Everything works fine, no issues observed apart some deprecated config references. What is a big surprise to me is CPU usage. The newer version has nearly two times higher CPU consumption in userspace. I am nearly at 80% CPU with 16 physical cores (was about 40%). We have a lot of TLS (DoT) sessions (30k) and 60kqps in total (30k via DoT) here. The latency measured by dnsdist went up also. We are collecting all the metrics dnsdist produces via graphite so I can check counters, what could be wrong. Wow, that's awful. It's the first time I hear about such a regression, and I really would like to understand what is going on. 1/ Are you using our packages, compiling yourself, or perhaps using the Debian ones? 2/ Do you think it would be possible for you to try downgrading the instance to 1.7.4 on Debian 12.1? It might help us pinpointing whether the issue is related to a system change (I have seen people complain about the performance of OpenSSL 3.0.x compared to 1.1.1x, for example). 3/ Would you mind sharing your configuration? 4/ And finally, do you think it would be possible for you to collect a perf trace on this instance? It would require installing linux-perf, if possible the debug symbols for dnsdist (dnsdist-dbgsym) then running 'perf record --call-graph dwarf -p process> -o ' for a few dozens of seconds to collect a trace, stopping it with Ctrl+C and finally getting a report with "perf report -i --stdio". It should tell us where the CPU usage is going. Best regards, Thanks for your response. After some deep documentation reading and config tweaking I am nearly on the previous values regarding CPU load, apart from latency, which is still higher (1.3ms -> 2.3ms). I suspect a different way the latency is likely computed (I noticed a new set of latency counters for TLS, TCP, etc.) here. The key configuration parameter is setMaxTCPClientThreads(). Changing anything else (cache shards, number of listeners, etc.) has nearly no impact. We had 256 with 1.7.4. now it is 16. Going up here means a rapid increase of CPU load, having less than 16 means dropping TCP connections in showTCPStats(), where Queued hits Max Queued. Insane values like 1024 kills the CPU. We have a physical server with 16 phys. cores, OS sees 32 cores. Back to your questions: 1/ from your repos 2/ yes, I could try it, the thing is that 1.7.4 for Bullseye crashes on Bookworm wit TLS enabled and there a no packages of 1.7.4 for Bookworm in your repo 3/ sure, I will do so 4/ no problem Best regards Ales ___ dnsdist mailing list dnsdist@mailman.powerdns.com https://mailman.powerdns.com/mailman/listinfo/dnsdist
Re: [dnsdist] dnsdist 1.7.4 Debian Bullseye vs 1.8.4 Bullseye
Hi Ales, On 25/09/2023 16:09, Aleš Rygl via dnsdist wrote: I would to kindly ask for help or and advice. I have just upgraded one of our dnsdist instances from 1.7.4 do 1.8.4 together with OS upgrade (Debian 11.7 to 12.1). Everything works fine, no issues observed apart some deprecated config references. What is a big surprise to me is CPU usage. The newer version has nearly two times higher CPU consumption in userspace. I am nearly at 80% CPU with 16 physical cores (was about 40%). We have a lot of TLS (DoT) sessions (30k) and 60kqps in total (30k via DoT) here. The latency measured by dnsdist went up also. We are collecting all the metrics dnsdist produces via graphite so I can check counters, what could be wrong. Wow, that's awful. It's the first time I hear about such a regression, and I really would like to understand what is going on. 1/ Are you using our packages, compiling yourself, or perhaps using the Debian ones? 2/ Do you think it would be possible for you to try downgrading the instance to 1.7.4 on Debian 12.1? It might help us pinpointing whether the issue is related to a system change (I have seen people complain about the performance of OpenSSL 3.0.x compared to 1.1.1x, for example). 3/ Would you mind sharing your configuration? 4/ And finally, do you think it would be possible for you to collect a perf trace on this instance? It would require installing linux-perf, if possible the debug symbols for dnsdist (dnsdist-dbgsym) then running 'perf record --call-graph dwarf -p -o ' for a few dozens of seconds to collect a trace, stopping it with Ctrl+C and finally getting a report with "perf report -i --stdio". It should tell us where the CPU usage is going. Best regards, -- Remi Gacogne PowerDNS.COM BV - https://www.powerdns.com/ OpenPGP_signature.asc Description: OpenPGP digital signature ___ dnsdist mailing list dnsdist@mailman.powerdns.com https://mailman.powerdns.com/mailman/listinfo/dnsdist
Re: [dnsdist] dnsdist 1.7.4 Debian Bullseye vs 1.8.4 Bullseye
Ah, I am sorry, the subject should be 1.7.4 Debian Bullseye vs 1.8.1 Bookworm. I am running 1.8.1 on Bookworm... Ales On 25. 09. 23 16:01, Aleš Rygl via dnsdist wrote: Hello, I would to kindly ask for help or and advice. I have just upgraded one of our dnsdist instances from 1.7.4 do 1.8.4 together with OS upgrade (Debian 11.7 to 12.1). Everything works fine, no issues observed apart some deprecated config references. What is a big surprise to me is CPU usage. The newer version has nearly two times higher CPU consumption in userspace. I am nearly at 80% CPU with 16 physical cores (was about 40%). We have a lot of TLS (DoT) sessions (30k) and 60kqps in total (30k via DoT) here. The latency measured by dnsdist went up also. We are collecting all the metrics dnsdist produces via graphite so I can check counters, what could be wrong. Thanks in advance With best regards Ales ___ dnsdist mailing list dnsdist@mailman.powerdns.com https://mailman.powerdns.com/mailman/listinfo/dnsdist ___ dnsdist mailing list dnsdist@mailman.powerdns.com https://mailman.powerdns.com/mailman/listinfo/dnsdist