Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)
On Thu, Apr 15, 2021 at 07:53:15PM +, Robin H. Johnson wrote: > But your thought of CPU pinning was good. > I went to confirm it in the host, and I'm not certain if the cpu-map is > working > right. Ignore me, long day and I didn't think to check each thread PID: # ps -e -T | grep haproxy -w |awk '{print $2}' |xargs -n1 taskset -pc pid 120689's current affinity list: 0-127 pid 120702's current affinity list: 0 pid 120706's current affinity list: 1 .. pid 120776's current affinity list: 63 -- Robin Hugh Johnson E-Mail : robb...@orbis-terrarum.net Home Page : http://www.orbis-terrarum.net/?l=people.robbat2 GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 signature.asc Description: PGP signature
Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)
On Thu, Apr 15, 2021 at 09:23:07AM +0200, Willy Tarreau wrote: > On Thu, Apr 15, 2021 at 07:13:53AM +, Robin H. Johnson wrote: > > Thanks; I will need to catch it faster or automate this, because the > > watchdog does a MUCH better job restarting it than before, less than 30 > > seconds of 100% CPU before the watchdog reliably kills it. > I see. Then collecting the watchdog outputs can be instructive to see > if it always happens at the same place or not. And the core dumps will > indicate what all threads were doing (and if some were competing on a > lock for example). The truncation in the log output for the crash was interesting, I couldn't see why it was being cut off. I wish we could get a clean reproduction in our testing environment, because the production core dumps absolutely have customer data in them. > > Varnish runs on the same host and is used to cache some of the backends. > > Please of free memory at the moment. > > I'm now thinking about something. Do you have at least as many CPUs as the > total number of threads used by haproxy and varnish ? Otherwise there will > be some competition and migrations will happen. If neither is bounded, you > can even end up with two haproxy threads forced to run on the same CPU, > which is the worst situation as one could be scheduled out with a lock > held and the other one spinning waiting for this lock. Single socket, AMD EPYC 7702P 64-Core Processor, 128 threads! Shows as single NUMA node in our present configuration. Hopefully the kernel is mostly doing the right thing, but read on. HAProxy already pinned to the first 64 threads with: cpu-map 1/1 0 ... cpu-map 1/64 63 Varnish isn't explicitly pinned right now, but uses less than 200% CPU overall (we know most requests aren't cachable so they don't get routed to Varnish at all) But your thought of CPU pinning was good. I went to confirm it in the host, and I'm not certain if the cpu-map is working right. # pid_haproxy_leader=68839 ; pid_haproxy_follower=68848 # taskset -pc $pid_haproxy_leader pid 68839's current affinity list: 0-127 # taskset -pc $pid_haproxy_follower pid 68848's current affinity list: 0 -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Treasurer E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 signature.asc Description: PGP signature
Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)
On Thu, Apr 15, 2021 at 07:13:53AM +, Robin H. Johnson wrote: > Thanks; I will need to catch it faster or automate this, because the > watchdog does a MUCH better job restarting it than before, less than 30 > seconds of 100% CPU before the watchdog reliably kills it. I see. Then collecting the watchdog outputs can be instructive to see if it always happens at the same place or not. And the core dumps will indicate what all threads were doing (and if some were competing on a lock for example). > > - please also check if your machine is not swapping, as this can have > > terrible consequences and could explain why it only happens on certain > > machines dealing with certain workloads. I remember having spent several > > weeks many years ago chasing a response time issue happening only in the > > morning, which was in fact caused by the log upload batch having swapped > > and left a good part of the unused memory in the swap, making it very > > difficult for the network stack to allocate buffers during send() and > > recv(), thus causing losses and retransmits as the load grew. This was > > never reproducible in the lab because we were not syncing logs :-) > 512GiB RAM and no swap configured on the system at all. :-) > Varnish runs on the same host and is used to cache some of the backends. > Please of free memory at the moment. I'm now thinking about something. Do you have at least as many CPUs as the total number of threads used by haproxy and varnish ? Otherwise there will be some competition and migrations will happen. If neither is bounded, you can even end up with two haproxy threads forced to run on the same CPU, which is the worst situation as one could be scheduled out with a lock held and the other one spinning waiting for this lock. With half a TB of RAM I guess you have multiple sockets. Could you at least pin haproxy to the CPUs of a single socket (that's the bare minimum to do to preserve performance, as atomics and locks over UPI/QPI are a disaster), and ideally pin varnish to another socket ? Or maybe just enable less threads for haproxy if you don't need that many and make sure the CPUs it's bound to are not used by varnish ? In such a setup combining several high-performance processes, it's really important to reserve the resources to them, and you must count on the number of CPUs needed to deal with network interrupts as well (and likely for disk if varnish uses it). Once your resources are cleanly reserved, you'll get the maximum performance with the lowest latency. Willy
Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)
On Thu, Apr 15, 2021 at 08:59:35AM +0200, Willy Tarreau wrote: > On Wed, Apr 14, 2021 at 01:53:06PM +0200, Christopher Faulet wrote: > > > nbthread=64, nbproc=1 on both 1.8/2.x > > > > It is thus surprising, if it is really a contention issue, that you never > > observed slow down on the 1.8. There is no watchdog, but the thread > > implementation is a bit awkward on the 1.8. 2.X are better on this point, > > the best being the 2.4. > > Agreed, I'd even say that 64 threads in 1.8 should be wa slower than > a single thread. > > A few things are worth having a look at, Robin: > - please run "perf top" when this happens, and when you see a function > reporting a high usage, do no hesitate to navigate through it, pressing > enter, and "annotate function ". Then scrolling through it will > reveal some percentage of time certain points were met. It's very possible > you'll find that 100% of the CPU are used in 1, 2 or 3 functions and > that there is a logic error somewhere. Usually if you find a single one > you'll end up spotting a lock. Thanks; I will need to catch it faster or automate this, because the watchdog does a MUCH better job restarting it than before, less than 30 seconds of 100% CPU before the watchdog reliably kills it. > - please also check if your machine is not swapping, as this can have > terrible consequences and could explain why it only happens on certain > machines dealing with certain workloads. I remember having spent several > weeks many years ago chasing a response time issue happening only in the > morning, which was in fact caused by the log upload batch having swapped > and left a good part of the unused memory in the swap, making it very > difficult for the network stack to allocate buffers during send() and > recv(), thus causing losses and retransmits as the load grew. This was > never reproducible in the lab because we were not syncing logs :-) 512GiB RAM and no swap configured on the system at all. Varnish runs on the same host and is used to cache some of the backends. Please of free memory at the moment. -- Robin Hugh Johnson GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 signature.asc Description: PGP signature
Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)
On Wed, Apr 14, 2021 at 01:53:06PM +0200, Christopher Faulet wrote: > > nbthread=64, nbproc=1 on both 1.8/2.x > > It is thus surprising, if it is really a contention issue, that you never > observed slow down on the 1.8. There is no watchdog, but the thread > implementation is a bit awkward on the 1.8. 2.X are better on this point, > the best being the 2.4. Agreed, I'd even say that 64 threads in 1.8 should be wa slower than a single thread. A few things are worth having a look at, Robin: - please run "perf top" when this happens, and when you see a function reporting a high usage, do no hesitate to navigate through it, pressing enter, and "annotate function ". Then scrolling through it will reveal some percentage of time certain points were met. It's very possible you'll find that 100% of the CPU are used in 1, 2 or 3 functions and that there is a logic error somewhere. Usually if you find a single one you'll end up spotting a lock. - please also check if your machine is not swapping, as this can have terrible consequences and could explain why it only happens on certain machines dealing with certain workloads. I remember having spent several weeks many years ago chasing a response time issue happening only in the morning, which was in fact caused by the log upload batch having swapped and left a good part of the unused memory in the swap, making it very difficult for the network stack to allocate buffers during send() and recv(), thus causing losses and retransmits as the load grew. This was never reproducible in the lab because we were not syncing logs :-) Willy
Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)
Le 10/04/2021 à 00:34, Robin H. Johnson a écrit : On Fri, Apr 09, 2021 at 10:14:26PM +0200, Christopher Faulet wrote: It seems you have a blocking call in one of your lua script. The threads dump shows many threads blocked in hlua_ctx_init. Many others are executing lua. Unfortunately, for a unknown reason, there is no stack traceback. All of our Lua is string handling. Parsing headers, and then building TXN variables as well as a set of applets that return responses in cases where we don't want to go to a backend server as the response is simple enough to generate inside the LB. For the 2.3 and prior, the lua scripts are executed under a global lock. Thus blocking calls in a lua script are awful, because it does not block only one thread but all others too. I guess the same issue exists on the 1.8, but there is no watchdog on this version. Thus, time to time HAProxy hangs and may report huge latencies but, at the end it recovers and continues to process data. It is exactly the purpose of the watchdog, reporting hidden bugs related to spinning loops and deadlocks. Nothing in this Lua code should have blocking calls at all. The Lua code has zero calls to external services, no sockets, no sleeps, no print, no Map.new (single call in the Lua startup, not inside any applet or fetch), no usage of other packages, no file handling, no other IO. I'm hoping I can get $work to agree to fully open-source the Lua, so you can see this fact and review the code to confirm that it SHOULD be non-blocking. I trust you on this point. If you are not using external component, it should indeed be ok. So, it is probably a contention issue on the global Lua lock. If you are able to generate and inspect a core file, it should help you to figure out what really happens. However, I may be wrong. It may be just a contention problem because your are executing lua with 64 threads and a huge workload. In this case, you may give a try to the 2.4 (under development). There is a way to have a separate lua context for each thread loading the scripts with "lua-load-per-thread" directive. Out of curiosity, on the 1.8, are you running HAProxy with several threads or are you spawning several processes? nbthread=64, nbproc=1 on both 1.8/2.x It is thus surprising, if it is really a contention issue, that you never observed slow down on the 1.8. There is no watchdog, but the thread implementation is a bit awkward on the 1.8. 2.X are better on this point, the best being the 2.4. Yes, we're hoping to try 2.4.x, just working on some parts to get there. -- Christopher Faulet
Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)
On Fri, Apr 09, 2021 at 10:14:26PM +0200, Christopher Faulet wrote: > It seems you have a blocking call in one of your lua script. The threads dump > shows many threads blocked in hlua_ctx_init. Many others are executing lua. > Unfortunately, for a unknown reason, there is no stack traceback. All of our Lua is string handling. Parsing headers, and then building TXN variables as well as a set of applets that return responses in cases where we don't want to go to a backend server as the response is simple enough to generate inside the LB. > For the 2.3 and prior, the lua scripts are executed under a global lock. Thus > blocking calls in a lua script are awful, because it does not block only one > thread but all others too. I guess the same issue exists on the 1.8, but > there > is no watchdog on this version. Thus, time to time HAProxy hangs and may > report > huge latencies but, at the end it recovers and continues to process data. It > is > exactly the purpose of the watchdog, reporting hidden bugs related to > spinning > loops and deadlocks. Nothing in this Lua code should have blocking calls at all. The Lua code has zero calls to external services, no sockets, no sleeps, no print, no Map.new (single call in the Lua startup, not inside any applet or fetch), no usage of other packages, no file handling, no other IO. I'm hoping I can get $work to agree to fully open-source the Lua, so you can see this fact and review the code to confirm that it SHOULD be non-blocking. > However, I may be wrong. It may be just a contention problem because your are > executing lua with 64 threads and a huge workload. In this case, you may give > a > try to the 2.4 (under development). There is a way to have a separate lua > context for each thread loading the scripts with "lua-load-per-thread" > directive. Out of curiosity, on the 1.8, are you running HAProxy with several > threads or are you spawning several processes? nbthread=64, nbproc=1 on both 1.8/2.x Yes, we're hoping to try 2.4.x, just working on some parts to get there. -- Robin Hugh Johnson E-Mail : robb...@orbis-terrarum.net Home Page : http://www.orbis-terrarum.net/?l=people.robbat2 GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 signature.asc Description: PGP signature
Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)
Le 09/04/2021 à 19:26, Robin H. Johnson a écrit : Hi, Maciej had said they were going to create a new thread, but I didn't see one yet. I want to start by noting problem was much worse on 2.2.8 & 2.2.9, and that 2.2.13 & 2.3.9 don't get entirely hung at 100% anymore: a big thanks for that initial work in fixing the issue. As I mentioned in my other mail asking for a 1.8.30 release, we're experiencing this problem in DigitalOcean's HAProxy instances used to run the Spaces product. I've been trying to dig out deeper detail as well with a debug threads version, but I have the baseline error output from 2.3.9 here to share, after passing redaction review. This dump was generated with vbernat's PPA version of 2.3.9, not any internal builds. We have struggled to reproduce the problem in testing environments, it only turns up at the biggest regions, and plotting occurances of the issue over the time dimension suggest that it might have some partial correlation w/ a weird workload input. The dumps do suggest Lua is implicated as well, and we've got some extensive Lua code, so it's impossible to rule it out as contributing to the problem (We have been discussing plans to move it to SPOA instead). The Lua code in question hasn't changed significantly in nearly 6 months, and it was problem-free on the 1.8 series (having a test suite for the Lua code has been invaluable). Hi, It seems you have a blocking call in one of your lua script. The threads dump shows many threads blocked in hlua_ctx_init. Many others are executing lua. Unfortunately, for a unknown reason, there is no stack traceback. For the 2.3 and prior, the lua scripts are executed under a global lock. Thus blocking calls in a lua script are awful, because it does not block only one thread but all others too. I guess the same issue exists on the 1.8, but there is no watchdog on this version. Thus, time to time HAProxy hangs and may report huge latencies but, at the end it recovers and continues to process data. It is exactly the purpose of the watchdog, reporting hidden bugs related to spinning loops and deadlocks. However, I may be wrong. It may be just a contention problem because your are executing lua with 64 threads and a huge workload. In this case, you may give a try to the 2.4 (under development). There is a way to have a separate lua context for each thread loading the scripts with "lua-load-per-thread" directive. Out of curiosity, on the 1.8, are you running HAProxy with several threads or are you spawning several processes? -- Christopher Faulet
Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)
Hi Robin, W dniu pt., 9.04.2021 o 19:26 Robin H. Johnson napisał(a): > Maciej had said they were going to create a new thread, but I didn't see > one yet. > > I want to start by noting problem was much worse on 2.2.8 & 2.2.9, and > that 2.2.13 & 2.3.9 don't get entirely hung at 100% anymore: a big > thanks for that initial work in fixing the issue. > After Christopher patch using another function (that does not allocate memory) to dump LUA stack trace the situation is much better. It used 100% cpu only once and only one thread, however HAProxy didn't kill itself like in your situation. Since then the issue did not occur again, I have anything useful to share. :( I'm using 2.2.11 HAProxy but will upgrade next week to 2.3, I'll report if anything interesting pops out. > We have struggled to reproduce the problem in testing environments, it > only turns up at the biggest regions, and plotting occurances of the > issue over the time dimension suggest that it might have some partial > correlation w/ a weird workload input. > On our side the issue occured on a machine with unusual workload, hundred thousands tcp connections and very little http requests/responses. We also use lua. Kind regards, Maciej
Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)
Hi, Maciej had said they were going to create a new thread, but I didn't see one yet. I want to start by noting problem was much worse on 2.2.8 & 2.2.9, and that 2.2.13 & 2.3.9 don't get entirely hung at 100% anymore: a big thanks for that initial work in fixing the issue. As I mentioned in my other mail asking for a 1.8.30 release, we're experiencing this problem in DigitalOcean's HAProxy instances used to run the Spaces product. I've been trying to dig out deeper detail as well with a debug threads version, but I have the baseline error output from 2.3.9 here to share, after passing redaction review. This dump was generated with vbernat's PPA version of 2.3.9, not any internal builds. We have struggled to reproduce the problem in testing environments, it only turns up at the biggest regions, and plotting occurances of the issue over the time dimension suggest that it might have some partial correlation w/ a weird workload input. The dumps do suggest Lua is implicated as well, and we've got some extensive Lua code, so it's impossible to rule it out as contributing to the problem (We have been discussing plans to move it to SPOA instead). The Lua code in question hasn't changed significantly in nearly 6 months, and it was problem-free on the 1.8 series (having a test suite for the Lua code has been invaluable). -- Robin Hugh Johnson E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 20210409_haproxy-crashlogs.REDACTED.txt.gz Description: Binary data signature.asc Description: PGP signature