Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-15 Thread Robin H. Johnson
On Thu, Apr 15, 2021 at 07:53:15PM +, Robin H. Johnson wrote:
> But your thought of CPU pinning was good.
> I went to confirm it in the host, and I'm not certain if the cpu-map is 
> working
> right.
Ignore me, long day and I didn't think to check each thread PID:

# ps -e -T | grep haproxy -w |awk '{print $2}' |xargs -n1 taskset -pc
pid 120689's current affinity list: 0-127
pid 120702's current affinity list: 0
pid 120706's current affinity list: 1
..
pid 120776's current affinity list: 63

-- 
Robin Hugh Johnson
E-Mail : robb...@orbis-terrarum.net
Home Page  : http://www.orbis-terrarum.net/?l=people.robbat2
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85


signature.asc
Description: PGP signature


Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-15 Thread Robin H. Johnson
On Thu, Apr 15, 2021 at 09:23:07AM +0200, Willy Tarreau wrote:
> On Thu, Apr 15, 2021 at 07:13:53AM +, Robin H. Johnson wrote:
> > Thanks; I will need to catch it faster or automate this, because the
> > watchdog does a MUCH better job restarting it than before, less than 30
> > seconds of 100% CPU before the watchdog reliably kills it.
> I see. Then collecting the watchdog outputs can be instructive to see
> if it always happens at the same place or not. And the core dumps will
> indicate what all threads were doing (and if some were competing on a
> lock for example).
The truncation in the log output for the crash was interesting, I couldn't see
why it was being cut off. I wish we could get a clean reproduction in our
testing environment, because the production core dumps absolutely have customer
data in them.

> > Varnish runs on the same host and is used to cache some of the backends.
> > Please of free memory at the moment.
> 
> I'm now thinking about something. Do you have at least as many CPUs as the
> total number of threads used by haproxy and varnish ? Otherwise there will
> be some competition and migrations will happen. If neither is bounded, you
> can even end up with two haproxy threads forced to run on the same CPU,
> which is the worst situation as one could be scheduled out with a lock
> held and the other one spinning waiting for this lock.
Single socket, AMD EPYC 7702P 64-Core Processor, 128 threads!
Shows as single NUMA node in our present configuration.
Hopefully the kernel is mostly doing the right thing, but read on.

HAProxy already pinned to the first 64 threads with:
cpu-map 1/1 0
...
cpu-map 1/64 63

Varnish isn't explicitly pinned right now, but uses less than 200% CPU
overall (we know most requests aren't cachable so they don't get routed to
Varnish at all)

But your thought of CPU pinning was good.
I went to confirm it in the host, and I'm not certain if the cpu-map is working
right.

# pid_haproxy_leader=68839 ; pid_haproxy_follower=68848 
# taskset -pc $pid_haproxy_leader
pid 68839's current affinity list: 0-127
# taskset -pc $pid_haproxy_follower
pid 68848's current affinity list: 0

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature


Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-15 Thread Willy Tarreau
On Thu, Apr 15, 2021 at 07:13:53AM +, Robin H. Johnson wrote:
> Thanks; I will need to catch it faster or automate this, because the
> watchdog does a MUCH better job restarting it than before, less than 30
> seconds of 100% CPU before the watchdog reliably kills it.

I see. Then collecting the watchdog outputs can be instructive to see
if it always happens at the same place or not. And the core dumps will
indicate what all threads were doing (and if some were competing on a
lock for example).

> >   - please also check if your machine is not swapping, as this can have
> > terrible consequences and could explain why it only happens on certain
> > machines dealing with certain workloads. I remember having spent several
> > weeks many years ago chasing a response time issue happening only in the
> > morning, which was in fact caused by the log upload batch having swapped
> > and left a good part of the unused memory in the swap, making it very
> > difficult for the network stack to allocate buffers during send() and
> > recv(), thus causing losses and retransmits as the load grew. This was
> > never reproducible in the lab because we were not syncing logs :-)
> 512GiB RAM and no swap configured on the system at all.

:-)

> Varnish runs on the same host and is used to cache some of the backends.
> Please of free memory at the moment.

I'm now thinking about something. Do you have at least as many CPUs as the
total number of threads used by haproxy and varnish ? Otherwise there will
be some competition and migrations will happen. If neither is bounded, you
can even end up with two haproxy threads forced to run on the same CPU,
which is the worst situation as one could be scheduled out with a lock
held and the other one spinning waiting for this lock.

With half a TB of RAM I guess you have multiple sockets. Could you at
least pin haproxy to the CPUs of a single socket (that's the bare minimum
to do to preserve performance, as atomics and locks over UPI/QPI are a
disaster), and ideally pin varnish to another socket ?

Or maybe just enable less threads for haproxy if you don't need that many
and make sure the CPUs it's bound to are not used by varnish ?

In such a setup combining several high-performance processes, it's really
important to reserve the resources to them, and you must count on the
number of CPUs needed to deal with network interrupts as well (and likely
for disk if varnish uses it). Once your resources are cleanly reserved,
you'll get the maximum performance with the lowest latency.

Willy



Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-15 Thread Robin H. Johnson
On Thu, Apr 15, 2021 at 08:59:35AM +0200, Willy Tarreau wrote:
> On Wed, Apr 14, 2021 at 01:53:06PM +0200, Christopher Faulet wrote:
> > > nbthread=64, nbproc=1 on both 1.8/2.x
> > 
> > It is thus surprising, if it is really a contention issue, that you never
> > observed slow down on the 1.8. There is no watchdog, but the thread
> > implementation is a bit awkward on the 1.8. 2.X are better on this point,
> > the best being the 2.4.
> 
> Agreed, I'd even say that 64 threads in 1.8 should be wa slower than
> a single thread.
> 
> A few things are worth having a look at, Robin:
>   - please run "perf top" when this happens, and when you see a function
> reporting a high usage, do no hesitate to navigate through it, pressing
> enter, and "annotate function ". Then scrolling through it will
> reveal some percentage of time certain points were met. It's very possible
> you'll find that 100% of the CPU are used in 1, 2 or 3 functions and
> that there is a logic error somewhere. Usually if you find a single one
> you'll end up spotting a lock.
Thanks; I will need to catch it faster or automate this, because the
watchdog does a MUCH better job restarting it than before, less than 30
seconds of 100% CPU before the watchdog reliably kills it.

>   - please also check if your machine is not swapping, as this can have
> terrible consequences and could explain why it only happens on certain
> machines dealing with certain workloads. I remember having spent several
> weeks many years ago chasing a response time issue happening only in the
> morning, which was in fact caused by the log upload batch having swapped
> and left a good part of the unused memory in the swap, making it very
> difficult for the network stack to allocate buffers during send() and
> recv(), thus causing losses and retransmits as the load grew. This was
> never reproducible in the lab because we were not syncing logs :-)
512GiB RAM and no swap configured on the system at all.
Varnish runs on the same host and is used to cache some of the backends.
Please of free memory at the moment.

-- 
Robin Hugh Johnson
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85


signature.asc
Description: PGP signature


Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-15 Thread Willy Tarreau
On Wed, Apr 14, 2021 at 01:53:06PM +0200, Christopher Faulet wrote:
> > nbthread=64, nbproc=1 on both 1.8/2.x
> 
> It is thus surprising, if it is really a contention issue, that you never
> observed slow down on the 1.8. There is no watchdog, but the thread
> implementation is a bit awkward on the 1.8. 2.X are better on this point,
> the best being the 2.4.

Agreed, I'd even say that 64 threads in 1.8 should be wa slower than
a single thread.

A few things are worth having a look at, Robin:
  - please run "perf top" when this happens, and when you see a function
reporting a high usage, do no hesitate to navigate through it, pressing
enter, and "annotate function ". Then scrolling through it will
reveal some percentage of time certain points were met. It's very possible
you'll find that 100% of the CPU are used in 1, 2 or 3 functions and
that there is a logic error somewhere. Usually if you find a single one
you'll end up spotting a lock.

  - please also check if your machine is not swapping, as this can have
terrible consequences and could explain why it only happens on certain
machines dealing with certain workloads. I remember having spent several
weeks many years ago chasing a response time issue happening only in the
morning, which was in fact caused by the log upload batch having swapped
and left a good part of the unused memory in the swap, making it very
difficult for the network stack to allocate buffers during send() and
recv(), thus causing losses and retransmits as the load grew. This was
never reproducible in the lab because we were not syncing logs :-)

Willy



Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-14 Thread Christopher Faulet

Le 10/04/2021 à 00:34, Robin H. Johnson a écrit :

On Fri, Apr 09, 2021 at 10:14:26PM +0200, Christopher Faulet wrote:

It seems you have a blocking call in one of your lua script. The threads dump
shows many threads blocked in hlua_ctx_init. Many others are executing lua.
Unfortunately, for a unknown reason, there is no stack traceback.

All of our Lua is string handling. Parsing headers, and then building
TXN variables as well as a set of applets that return responses in cases
where we don't want to go to a backend server as the response is simple
enough to generate inside the LB.


For the 2.3 and prior, the lua scripts are executed under a global lock. Thus
blocking calls in a lua script are awful, because it does not block only one
thread but all others too. I guess the same issue exists on the 1.8, but there
is no watchdog on this version. Thus, time to time HAProxy hangs and may report
huge latencies but, at the end it recovers and continues to process data. It is
exactly the purpose of the watchdog, reporting hidden bugs related to spinning
loops and deadlocks.

Nothing in this Lua code should have blocking calls at all.
The Lua code has zero calls to external services, no sockets, no sleeps,
no print, no Map.new (single call in the Lua startup, not inside any
applet or fetch), no usage of other packages, no file handling, no other
IO.

I'm hoping I can get $work to agree to fully open-source the Lua, so you
can see this fact and review the code to confirm that it SHOULD be
non-blocking.


I trust you on this point. If you are not using external component, it should 
indeed be ok. So, it is probably a contention issue on the global Lua lock. If 
you are able to generate and inspect a core file, it should help you to figure 
out what really happens.




However, I may be wrong. It may be just a contention problem because your are
executing lua with 64 threads and a huge workload. In this case, you may give a
try to the 2.4 (under development). There is a way to have a separate lua
context for each thread loading the scripts with "lua-load-per-thread"
directive. Out of curiosity, on the 1.8, are you running HAProxy with several
threads or are you spawning several processes?

nbthread=64, nbproc=1 on both 1.8/2.x


It is thus surprising, if it is really a contention issue, that you never 
observed slow down on the 1.8. There is no watchdog, but the thread 
implementation is a bit awkward on the 1.8. 2.X are better on this point, the 
best being the 2.4.



Yes, we're hoping to try 2.4.x, just working on some parts to get there.



--
Christopher Faulet



Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-09 Thread Robin H. Johnson
On Fri, Apr 09, 2021 at 10:14:26PM +0200, Christopher Faulet wrote:
> It seems you have a blocking call in one of your lua script. The threads dump 
> shows many threads blocked in hlua_ctx_init. Many others are executing lua. 
> Unfortunately, for a unknown reason, there is no stack traceback.
All of our Lua is string handling. Parsing headers, and then building
TXN variables as well as a set of applets that return responses in cases
where we don't want to go to a backend server as the response is simple
enough to generate inside the LB.

> For the 2.3 and prior, the lua scripts are executed under a global lock. Thus 
> blocking calls in a lua script are awful, because it does not block only one 
> thread but all others too. I guess the same issue exists on the 1.8, but 
> there 
> is no watchdog on this version. Thus, time to time HAProxy hangs and may 
> report 
> huge latencies but, at the end it recovers and continues to process data. It 
> is 
> exactly the purpose of the watchdog, reporting hidden bugs related to 
> spinning 
> loops and deadlocks.
Nothing in this Lua code should have blocking calls at all.
The Lua code has zero calls to external services, no sockets, no sleeps,
no print, no Map.new (single call in the Lua startup, not inside any
applet or fetch), no usage of other packages, no file handling, no other
IO.

I'm hoping I can get $work to agree to fully open-source the Lua, so you
can see this fact and review the code to confirm that it SHOULD be
non-blocking.

> However, I may be wrong. It may be just a contention problem because your are 
> executing lua with 64 threads and a huge workload. In this case, you may give 
> a 
> try to the 2.4 (under development). There is a way to have a separate lua 
> context for each thread loading the scripts with "lua-load-per-thread" 
> directive. Out of curiosity, on the 1.8, are you running HAProxy with several 
> threads or are you spawning several processes?
nbthread=64, nbproc=1 on both 1.8/2.x

Yes, we're hoping to try 2.4.x, just working on some parts to get there.

-- 
Robin Hugh Johnson
E-Mail : robb...@orbis-terrarum.net
Home Page  : http://www.orbis-terrarum.net/?l=people.robbat2
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85


signature.asc
Description: PGP signature


Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-09 Thread Christopher Faulet

Le 09/04/2021 à 19:26, Robin H. Johnson a écrit :

Hi,

Maciej had said they were going to create a new thread, but I didn't see
one yet.

I want to start by noting problem was much worse on 2.2.8 & 2.2.9, and
that 2.2.13 & 2.3.9 don't get entirely hung at 100% anymore: a big
thanks for that initial work in fixing the issue.

As I mentioned in my other mail asking for a 1.8.30 release, we're
experiencing this problem in DigitalOcean's HAProxy instances used to
run the Spaces product.

I've been trying to dig out deeper detail as well with a debug threads
version, but I have the baseline error output from 2.3.9 here to share,
after passing redaction review. This dump was generated with vbernat's
PPA version of 2.3.9, not any internal builds.

We have struggled to reproduce the problem in testing environments, it
only turns up at the biggest regions, and plotting occurances of the
issue over the time dimension suggest that it might have some partial
correlation w/ a weird workload input.

The dumps do suggest Lua is implicated as well, and we've got some
extensive Lua code, so it's impossible to rule it out as contributing to
the problem (We have been discussing plans to move it to SPOA instead).

The Lua code in question hasn't changed significantly in nearly 6
months, and it was problem-free on the 1.8 series (having a test suite
for the Lua code has been invaluable).



Hi,

It seems you have a blocking call in one of your lua script. The threads dump 
shows many threads blocked in hlua_ctx_init. Many others are executing lua. 
Unfortunately, for a unknown reason, there is no stack traceback.


For the 2.3 and prior, the lua scripts are executed under a global lock. Thus 
blocking calls in a lua script are awful, because it does not block only one 
thread but all others too. I guess the same issue exists on the 1.8, but there 
is no watchdog on this version. Thus, time to time HAProxy hangs and may report 
huge latencies but, at the end it recovers and continues to process data. It is 
exactly the purpose of the watchdog, reporting hidden bugs related to spinning 
loops and deadlocks.


However, I may be wrong. It may be just a contention problem because your are 
executing lua with 64 threads and a huge workload. In this case, you may give a 
try to the 2.4 (under development). There is a way to have a separate lua 
context for each thread loading the scripts with "lua-load-per-thread" 
directive. Out of curiosity, on the 1.8, are you running HAProxy with several 
threads or are you spawning several processes?


--
Christopher Faulet



Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-09 Thread Maciej Zdeb
Hi Robin,

W dniu pt., 9.04.2021 o 19:26 Robin H. Johnson 
napisał(a):

> Maciej had said they were going to create a new thread, but I didn't see
> one yet.
>
> I want to start by noting problem was much worse on 2.2.8 & 2.2.9, and
> that 2.2.13 & 2.3.9 don't get entirely hung at 100% anymore: a big
> thanks for that initial work in fixing the issue.
>

After Christopher patch using another function (that does not allocate
memory) to dump LUA stack trace the situation is much better. It used 100%
cpu only once and only one thread, however HAProxy didn't kill itself like
in your situation. Since then the issue did not occur again, I have
anything useful to share. :(

I'm using 2.2.11 HAProxy but will upgrade next week to 2.3, I'll report if
anything interesting pops out.


> We have struggled to reproduce the problem in testing environments, it
> only turns up at the biggest regions, and plotting occurances of the
> issue over the time dimension suggest that it might have some partial
> correlation w/ a weird workload input.
>

On our side the issue occured on a machine with unusual workload, hundred
thousands tcp connections and very little http requests/responses. We also
use lua.

Kind regards,
Maciej


Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-09 Thread Robin H. Johnson
Hi,

Maciej had said they were going to create a new thread, but I didn't see
one yet.

I want to start by noting problem was much worse on 2.2.8 & 2.2.9, and
that 2.2.13 & 2.3.9 don't get entirely hung at 100% anymore: a big
thanks for that initial work in fixing the issue.

As I mentioned in my other mail asking for a 1.8.30 release, we're
experiencing this problem in DigitalOcean's HAProxy instances used to
run the Spaces product.

I've been trying to dig out deeper detail as well with a debug threads
version, but I have the baseline error output from 2.3.9 here to share,
after passing redaction review. This dump was generated with vbernat's
PPA version of 2.3.9, not any internal builds.

We have struggled to reproduce the problem in testing environments, it
only turns up at the biggest regions, and plotting occurances of the
issue over the time dimension suggest that it might have some partial
correlation w/ a weird workload input.

The dumps do suggest Lua is implicated as well, and we've got some
extensive Lua code, so it's impossible to rule it out as contributing to
the problem (We have been discussing plans to move it to SPOA instead).

The Lua code in question hasn't changed significantly in nearly 6
months, and it was problem-free on the 1.8 series (having a test suite
for the Lua code has been invaluable).

-- 
Robin Hugh Johnson
E-Mail : robb...@gentoo.org
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85


20210409_haproxy-crashlogs.REDACTED.txt.gz
Description: Binary data


signature.asc
Description: PGP signature