Re: HAProxy performance on OpenBSD

2023-01-24 Thread Willy Tarreau
On Wed, Jan 25, 2023 at 12:04:14AM +0100, Olivier Houchard wrote:
> >   0x0af892c770b0 :   mov%r12,%rdi
> >   0x0af892c770b3 :   callq  0xaf892c24e40 
> > 
> >   0x0af892c770b8 :   mov%rax,%r12
> >   0x0af892c770bb :   test   %rax,%rax
> >   0x0af892c770be :   je 0xaf892c770e5 
> > 
> >   0x0af892c770c0 :   mov0x18(%r12),%rax
> > =>0x0af892c770c5 :   mov0xa0(%rax),%r11
> >   0x0af892c770cc :   test   %r11,%r11
> >   0x0af892c770cf :   je 0xaf892c770b0 
> > 
> >   0x0af892c770d1 :   mov%r12,%rdi
> >   0x0af892c770d4 :   mov%r13d,%esi
> > 
> >   1229  conn = 
> > srv_lookup_conn(>per_thr[i].safe_conns, hash);
> >   1230  while (conn) {
> >   1231==>>> if (conn->mux->takeover && 
> > conn->mux->takeover(conn, i) == 0) {
> > ^^^
> > 
> > It's here. %rax==0, which is conn->mux when we're trying to dereference
> > it to retrieve ->takeover. I can't see how it's possible to have a NULL
> > mux here in an idle connection since they're placed there by the muxes
> > themselves. But maybe there's a tiny race somewhere that's impossible to
> > reach except under such an extreme contention between the two sockets.
> > I really have no idea regarding this one for now, maybe Olivier could
> > spot a race here ? Maybe we're missing a barrier somewhere.
>  
> One way I can see that happening, as unlikely as it may be, is if
> h1_takeover() is called, but fails to allocate a task or a tasklet, and
> will call h1_release(), which will set the mux to NULL and call
> conn_free().

:-)  thanks for the analysis!

I would find it strange that the task allocation fails, it would indicate
the machine is lacking RAM. Or maybe in order to avoid extreme lock
contention in malloc() on this platform when running on non-uniform
machines, they decided to use trylocks that can simply fail to report
memory instead of waiting forever. In this case we should be able to
reproduce the same issue by setting the fail-alloc fault injection rate
to a non-null value.

> It was previously ok to happen, because conn_free() would
> unconditionally remove the connection from any list, but since the idle
> connections now stay in a tree, and no longer in a list, that no longer
> happens.

I'm not sure why this has to be different, we could as well decide to
unconditionally remove it from the tree. I'll have a look and try to
figure why.

> Another possibility of h1_release() been called while still in the
> idle tree is if h1_wake() gets called, but the only way I see that
> happening is from mmux_stopping_process(), whatever that is, so that's
> unlikely to be the problem in this case.

Indeed. It's still good to keep this in mind though.

> Anyway, I suggest doing something along the line of the patch attached,
> and add a BUG_ON() in conn_free() to catch any freeing of connection
> still in the idle tree.

I agree, better crash where the problem is caused than where it hurts.
I'll take your patch, thanks!

Willy



Re: HAProxy performance on OpenBSD

2023-01-24 Thread Willy Tarreau
On Tue, Jan 24, 2023 at 11:59:16PM -0600, Marc West wrote:
> On 2023-01-24 23:04:14, Olivier Houchard wrote:
> > On Tue, Jan 24, 2023 at 11:05:37PM +0100, Willy Tarreau wrote:
> > > On Tue, Jan 24, 2023 at 02:15:08PM -0600, Marc West wrote:
> > > > > Stupid question but I prefer to ask in order to be certain, are all of
> > > > > these 32 threads located on the same physical CPU ? I just want to be
> > > > > sure that locks (kernel or user) are not traveling between multiple 
> > > > > CPU
> > > > > sockets, as that ruins scalability.
> > > > 
> > > > Very good observation. This is a 2 socket system, 20 cores per socket,
> > > > and since there is no affinity in OpenBSD unfortunately the 32 threads
> > > > are not on the same physical CPU.
> > > 
> > > Ah, then that's particularly bad. When you're saying "there is no
> > > affinity", do you mean the OS doesn't implement anything or only that
> > > haproxy doesn't support it ? Could you please check if you have man
> > > pages about "cpuset_setaffinity" or "sched_setaffinity" ? The former
> > > is the FreeBSD version and the second the Linux version.
> > 
> > Unfortunately, having a quick look at OpenBSD's sources, I don't think
> > they provide anything that could look like that. In fact, I'm not sure
> > there is any way to restrict a process to a specific CPU.
> 
> Unfortunately that is what I found also, no references to affinity in
> manpages nor finding any tools like taskset or NUMA related knobs. 
> It seems to be a design decision.
> 
> https://marc.info/?l=openbsd-misc=152507006602422=2
> https://marc.info/?l=openbsd-tech=133909957708933=2
> https://marc.info/?l=openbsd-misc=135884346431916=2

Then it's really pathetic, because implementing multi-CPU support
without being able to pin tasks to CPUs it counter-productive. I know
that their focus is primarily on code simplicity and auditability but
they already lost half of it by supporting SMP, and not doing it right
only makes the situation worse, and even opens the way to trivial DoS
attacks on running processes. Sadly the response to that message from
Ilya in the last comment you pointed above says it all:

  "The scheduler will know what CPUs are busy and which ones are
   not and will make apprpriate decisions."

That's of course an indication of total ignorance of how hardware
works, since the right CPU to use is not directly related to the
ones that are busy or not, but first and foremost to the cost of
the communications with other threads/processes/devices that is
related to the distance between them, and the impacts of cache
line migration. That's precisely why operating systems supporting
multi-processing all offer the application the ability to set its
affinity based on what it knows about its workload. And seeing that
it was already required 10 years ago and still not addressed just
shows a total lack of interest for multicore machines.

> I thought there might be a way to enable just the CPUs that are on the
> first package via the kernel config mechanism but attempting that
> resulted in an immediate reboot. It looks like some kernel hacking would
> be needed to do this. There are no options in the BIOS to disable one of
> the sockets unfortunately and I don't have any single-socket machines to
> test with (aside from a laptop).

Yeah that's annoying. And you're not necessarily interested in running
that OS inside a virtual machine managed by a more capable operating
system.

> Gven that I think we are going to have to plan to switch OS to FreeBSD
> for now.

Yeah in your case I think that's the best plan. In addition, there's a
large community using haproxy on FreeBSD, and it's generally agreed that
it's the second operating system working the best with haproxy. Like
Linux it remains a modern and featureful OS with a powerful network
stack, and I think you shouldn't have any issues seeking help about it
if needed.

> OpenBSD is a great OS and a joy to work with but for this
> particular use case we do need to handle much more HTTPS traffic in
> production.

I agree. I've used it for a long time and really loved its light weight,
simplicity and reliability. It has been working flawlessly for something
like 10 years on my VAX with 24 MB RAM. I had to stop due to disks failing
repeatedly. But while I think that running it on a Geode-LX based firewall
or VPN can still make sense nowadays for an ADSL line, I also think it's
about time to abandon designs stuck in the 90s that try too hard to run
correctly on hardware designed in 2020+ with totally different concepts.
As a rule of thumb it should probably not be installed on a machine which
possesses a fan.

> Thanks again for all the replies and I will have this hardware available
> for the foreseeable future in case there is any more testing that would
> be helpful.

You're welcome, and thanks for sharing your experience, it's always
useful to many others!

Willy



Re: HAProxy performance on OpenBSD

2023-01-24 Thread Marc West
On 2023-01-24 23:04:14, Olivier Houchard wrote:
> On Tue, Jan 24, 2023 at 11:05:37PM +0100, Willy Tarreau wrote:
> > On Tue, Jan 24, 2023 at 02:15:08PM -0600, Marc West wrote:
> > > > Stupid question but I prefer to ask in order to be certain, are all of
> > > > these 32 threads located on the same physical CPU ? I just want to be
> > > > sure that locks (kernel or user) are not traveling between multiple CPU
> > > > sockets, as that ruins scalability.
> > > 
> > > Very good observation. This is a 2 socket system, 20 cores per socket,
> > > and since there is no affinity in OpenBSD unfortunately the 32 threads
> > > are not on the same physical CPU.
> > 
> > Ah, then that's particularly bad. When you're saying "there is no
> > affinity", do you mean the OS doesn't implement anything or only that
> > haproxy doesn't support it ? Could you please check if you have man
> > pages about "cpuset_setaffinity" or "sched_setaffinity" ? The former
> > is the FreeBSD version and the second the Linux version.
> 
> Unfortunately, having a quick look at OpenBSD's sources, I don't think
> they provide anything that could look like that. In fact, I'm not sure
> there is any way to restrict a process to a specific CPU.

Unfortunately that is what I found also, no references to affinity in
manpages nor finding any tools like taskset or NUMA related knobs. 
It seems to be a design decision.

https://marc.info/?l=openbsd-misc=152507006602422=2
https://marc.info/?l=openbsd-tech=133909957708933=2
https://marc.info/?l=openbsd-misc=135884346431916=2

I thought there might be a way to enable just the CPUs that are on the
first package via the kernel config mechanism but attempting that
resulted in an immediate reboot. It looks like some kernel hacking would
be needed to do this. There are no options in the BIOS to disable one of
the sockets unfortunately and I don't have any single-socket machines to
test with (aside from a laptop).

Gven that I think we are going to have to plan to switch OS to FreeBSD
for now. OpenBSD is a great OS and a joy to work with but for this
particular use case we do need to handle much more HTTPS traffic in
production.

Thanks again for all the replies and I will have this hardware available
for the foreseeable future in case there is any more testing that would
be helpful.



Re: HAProxy performance on OpenBSD

2023-01-24 Thread Olivier Houchard
On Tue, Jan 24, 2023 at 11:05:37PM +0100, Willy Tarreau wrote:
> On Tue, Jan 24, 2023 at 02:15:08PM -0600, Marc West wrote:
> > > Stupid question but I prefer to ask in order to be certain, are all of
> > > these 32 threads located on the same physical CPU ? I just want to be
> > > sure that locks (kernel or user) are not traveling between multiple CPU
> > > sockets, as that ruins scalability.
> > 
> > Very good observation. This is a 2 socket system, 20 cores per socket,
> > and since there is no affinity in OpenBSD unfortunately the 32 threads
> > are not on the same physical CPU.
> 
> Ah, then that's particularly bad. When you're saying "there is no
> affinity", do you mean the OS doesn't implement anything or only that
> haproxy doesn't support it ? Could you please check if you have man
> pages about "cpuset_setaffinity" or "sched_setaffinity" ? The former
> is the FreeBSD version and the second the Linux version.

Unfortunately, having a quick look at OpenBSD's sources, I don't think
they provide anything that could look like that. In fact, I'm not sure
there is any way to restrict a process to a specific CPU.

[...]

> Oh thank you very much. From what I'm seeing we're here:
> 
>   0x0af892c770b0 :   mov%r12,%rdi
>   0x0af892c770b3 :   callq  0xaf892c24e40 
> 
>   0x0af892c770b8 :   mov%rax,%r12
>   0x0af892c770bb :   test   %rax,%rax
>   0x0af892c770be :   je 0xaf892c770e5 
> 
>   0x0af892c770c0 :   mov0x18(%r12),%rax
> =>0x0af892c770c5 :   mov0xa0(%rax),%r11
>   0x0af892c770cc :   test   %r11,%r11
>   0x0af892c770cf :   je 0xaf892c770b0 
> 
>   0x0af892c770d1 :   mov%r12,%rdi
>   0x0af892c770d4 :   mov%r13d,%esi
> 
>   1229  conn = 
> srv_lookup_conn(>per_thr[i].safe_conns, hash);
>   1230  while (conn) {
>   1231==>>> if (conn->mux->takeover && 
> conn->mux->takeover(conn, i) == 0) {
> ^^^
> 
> It's here. %rax==0, which is conn->mux when we're trying to dereference
> it to retrieve ->takeover. I can't see how it's possible to have a NULL
> mux here in an idle connection since they're placed there by the muxes
> themselves. But maybe there's a tiny race somewhere that's impossible to
> reach except under such an extreme contention between the two sockets.
> I really have no idea regarding this one for now, maybe Olivier could
> spot a race here ? Maybe we're missing a barrier somewhere.
 
One way I can see that happening, as unlikely as it may be, is if
h1_takeover() is called, but fails to allocate a task or a tasklet, and
will call h1_release(), which will set the mux to NULL and call
conn_free(). It was previously ok to happen, because conn_free() would
unconditionally remove the connection from any list, but since the idle
connections now stay in a tree, and no longer in a list, that no longer
happens.
Another possibility of h1_release() been called while still in the
idle tree is if h1_wake() gets called, but the only way I see that
happening is from mmux_stopping_process(), whatever that is, so that's
unlikely to be the problem in this case.
Anyway, I suggest doing something along the line of the patch attached,
and add a BUG_ON() in conn_free() to catch any freeing of connection
still in the idle tree.

Olivier
>From 2a0ac4b84b97aa05cd7befc5fcf45b03795f2e76 Mon Sep 17 00:00:00 2001
From: Olivier Houchard 
Date: Tue, 24 Jan 2023 23:59:32 +0100
Subject: [PATCH] MINOR: Add a BUG_ON() to detect destroying connection in idle
 list

Add a BUG_ON() in conn_free(), to check that zhen we're freeing a
connection, it is not still in the idle connections tree, otherwise the
next thread that will try to use it will probably crash.
---
 src/connection.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/connection.c b/src/connection.c
index 4a73dbcc8..97619ec26 100644
--- a/src/connection.c
+++ b/src/connection.c
@@ -498,6 +498,10 @@ void conn_free(struct connection *conn)
 	pool_free(pool_head_uniqueid, istptr(conn->proxy_unique_id));
 	conn->proxy_unique_id = IST_NULL;
 
+	/* Make sure the connection is not left in the idle connection tree */
+	if (conn->hash_node != NULL)
+		BUG_ON(conn->hash_node->node.node.leaf_p != NULL);
+
 	pool_free(pool_head_conn_hash_node, conn->hash_node);
 	conn->hash_node = NULL;
 
-- 
2.36.1



Re: HAProxy performance on OpenBSD

2023-01-24 Thread Willy Tarreau
On Tue, Jan 24, 2023 at 02:15:08PM -0600, Marc West wrote:
> > Stupid question but I prefer to ask in order to be certain, are all of
> > these 32 threads located on the same physical CPU ? I just want to be
> > sure that locks (kernel or user) are not traveling between multiple CPU
> > sockets, as that ruins scalability.
> 
> Very good observation. This is a 2 socket system, 20 cores per socket,
> and since there is no affinity in OpenBSD unfortunately the 32 threads
> are not on the same physical CPU.

Ah, then that's particularly bad. When you're saying "there is no
affinity", do you mean the OS doesn't implement anything or only that
haproxy doesn't support it ? Could you please check if you have man
pages about "cpuset_setaffinity" or "sched_setaffinity" ? The former
is the FreeBSD version and the second the Linux version.

> cpu0 reports as core 0 package 0, cpu1 core 0 package 1, and so on that
> the odd cpus are socket 1 and evens are socket 2. Watching threads in
> top with a 1 sec interval I see them bouncing around a lot between
> sockets.

Yes that must be horrible. In fact, the worst possible scenario!

> > It's great already to be able to rule out that one. Another useful
> > test you could run is to place a reject rule at the connection level
> > in your frontend (it will happen before SSL tries to process traffic):
> > 
> >tcp-request connection reject
> > 
> > (don't do that in production of course). Once this is installed you
> > can compare again the CPU usage with no-thread, and 32-thread without
> > shards and 32 threads with 8 shards.
> 
> This is HTTPS with the reject rule:
> 
> - no nbthread, no shards:
> -- 40  CPUs:  0.3% user,  0.0% nice,  0.7% sys,  0.1% spin,  0.1% intr, 98.9% 
> idle
> -- haproxy peaked at 20% CPU
> 
> - nbthread 32, no shards:
> -- 40  CPUs:  0.5% user,  0.0% nice, 19.9% sys,  4.6% spin,  0.1% intr, 75.0% 
> idle
> -- haproxy peaked at 863% CPU
> 
> - nbthread 32, 8 shards:
> -- 40  CPUs:  0.4% user,  0.0% nice,  1.9% sys,  0.1% spin,  0.1% intr, 97.4% 
> idle
> -- haproxy peaked at 62% CPU
> -- this only used 5 threads according to top FYI 

OK, let's basically say it doesn't work very well... At least it makes
sense now.

> Since the SSL stack has been mentioned a couple times I retested with
> HTTP instead of HTTPS and the results are interesting:
> 
> - HTTP, no nbthread, no shards 
> -- 40  CPUs:  0.6% user,  0.0% nice,  2.0% sys,  0.2% spin,  0.2% intr, 97.0% 
> idle
> -- haproxy peaked at 54% CPU
> -- current conns = 475; current pipes = 0/0; conn rate = 0/sec; bit rate 
> 403.512 Mbps, Running tasks: 0/922; idle = 28 %
> -- First 3 seconds 7001, 14474, 22000 responses. 7-8k/sec throughout
> test. 100% success 0 fail
> 
> - HTTP, nbthread 32, no shards
> -- 40  CPUs:  0.7% user,  0.0% nice, 29.6% sys,  0.2% spin,  0.5% intr, 69.1% 
> idle
> -- haproxy peaked at 800% CPU
> -- current conns = 501; current pipes = 0/0; conn rate = 1/sec; bit rate 
> 410.711 Mbps, Running tasks: 28/967; idle = 64 %
> -- First 3 seconds 7000, 14500, 22002 responses. 7-8k/sec during test. 100% 
> success 0 fail
> 
> - HTTP, nbthread 32, 8 shards
> -- 40  CPUs:  0.7% user,  0.0% nice,  5.0% sys,  0.2% spin,  0.2% intr, 94.0% 
> idle
> -- haproxy peaked at 147% CPUs
> -- current conns = 500; current pipes = 0/0; conn rate = 1/sec; bit rate 
> 401.946 Mbps, Running tasks: 0/746; idle = 93 %
> -- First 3 seconds 7000, 14500, 22000 responses. 7-8k/sec during test. 100% 
> success 0 fail 
>
> Here we still see sys% growing with more threads but even with the waste,
> with HTTP we get a reliable 7-8k responses/sec and 100% success rate
> instead of 400-500/sec for a few second burst and then a total stall /
> nearly 0% success rate with HTTPS. I expect SSL to have some cost but 
> not quite this huge, and the "stall" of traffic/health checks under 
> heavy HTTPS load is a bit puzzling.

It's important to split the problems. Here clearly the test is limited
by the client to ~7.3k/s, but it's still sufficient to exacerbate the
threading issue with 32 threads and no shards. The SSL part is a second
problem, but since openssl uses a lot of locks, it could be a bigger
victim of the apparently poor thread implementation, explaining why it
doesn't scale at all.

> Do you think it would be worth trying to install OpenSSL 3.0.7 from
> ports and manually build haproxy against that to compare with the
> current LibreSSL 3.6.0? Or is the bottleneck likely somewhere else?

No, OpenSSL 3 is much much worse. It's blatant that its performance was
never tested outside of "openssl speed" before being released, because
its extreme abuse of locks makes it totally unusable for any usage in
a network-connected daemon :-(  LibreSSL 3.6 could work, but it could
as well be a victim of the threading problem.

Don't you have at least an equivalent of the linux taskset utility on
OpenBSD, to force all your threads on the same physical package ? Or
maybe you'll find some NUMA tools ? Or in the worst case, 

Re: HAProxy performance on OpenBSD

2023-01-24 Thread Marc West
On 2023-01-24 06:58:57, Willy Tarreau wrote:
> Hi Marc,

Hi Willy,

> See the difference ? There seems to be an insane FD locking cost on this
> system that simply wastes 40% of the CPU there. So I suspect that in your
> first tests you were stressing the locking while in the last ones you
> were stressing the SSL stack.
> 
> Stupid question but I prefer to ask in order to be certain, are all of
> these 32 threads located on the same physical CPU ? I just want to be
> sure that locks (kernel or user) are not traveling between multiple CPU
> sockets, as that ruins scalability.

Very good observation. This is a 2 socket system, 20 cores per socket,
and since there is no affinity in OpenBSD unfortunately the 32 threads
are not on the same physical CPU.

cpu0 reports as core 0 package 0, cpu1 core 0 package 1, and so on that
the odd cpus are socket 1 and evens are socket 2. Watching threads in
top with a 1 sec interval I see them bouncing around a lot between
sockets.

> It's great already to be able to rule out that one. Another useful
> test you could run is to place a reject rule at the connection level
> in your frontend (it will happen before SSL tries to process traffic):
> 
>tcp-request connection reject
> 
> (don't do that in production of course). Once this is installed you
> can compare again the CPU usage with no-thread, and 32-thread without
> shards and 32 threads with 8 shards.

This is HTTPS with the reject rule:

- no nbthread, no shards:
-- 40  CPUs:  0.3% user,  0.0% nice,  0.7% sys,  0.1% spin,  0.1% intr, 98.9% 
idle
-- haproxy peaked at 20% CPU

- nbthread 32, no shards:
-- 40  CPUs:  0.5% user,  0.0% nice, 19.9% sys,  4.6% spin,  0.1% intr, 75.0% 
idle
-- haproxy peaked at 863% CPU

- nbthread 32, 8 shards:
-- 40  CPUs:  0.4% user,  0.0% nice,  1.9% sys,  0.1% spin,  0.1% intr, 97.4% 
idle
-- haproxy peaked at 62% CPU
-- this only used 5 threads according to top FYI 

Since the SSL stack has been mentioned a couple times I retested with
HTTP instead of HTTPS and the results are interesting:

- HTTP, no nbthread, no shards 
-- 40  CPUs:  0.6% user,  0.0% nice,  2.0% sys,  0.2% spin,  0.2% intr, 97.0% 
idle
-- haproxy peaked at 54% CPU
-- current conns = 475; current pipes = 0/0; conn rate = 0/sec; bit rate 
403.512 Mbps, Running tasks: 0/922; idle = 28 %
-- First 3 seconds 7001, 14474, 22000 responses. 7-8k/sec throughout
test. 100% success 0 fail

- HTTP, nbthread 32, no shards
-- 40  CPUs:  0.7% user,  0.0% nice, 29.6% sys,  0.2% spin,  0.5% intr, 69.1% 
idle
-- haproxy peaked at 800% CPU
-- current conns = 501; current pipes = 0/0; conn rate = 1/sec; bit rate 
410.711 Mbps, Running tasks: 28/967; idle = 64 %
-- First 3 seconds 7000, 14500, 22002 responses. 7-8k/sec during test. 100% 
success 0 fail

- HTTP, nbthread 32, 8 shards
-- 40  CPUs:  0.7% user,  0.0% nice,  5.0% sys,  0.2% spin,  0.2% intr, 94.0% 
idle
-- haproxy peaked at 147% CPUs
-- current conns = 500; current pipes = 0/0; conn rate = 1/sec; bit rate 
401.946 Mbps, Running tasks: 0/746; idle = 93 %
-- First 3 seconds 7000, 14500, 22000 responses. 7-8k/sec during test. 100% 
success 0 fail 

Here we still see sys% growing with more threads but even with the waste,
with HTTP we get a reliable 7-8k responses/sec and 100% success rate
instead of 400-500/sec for a few second burst and then a total stall /
nearly 0% success rate with HTTPS. I expect SSL to have some cost but 
not quite this huge, and the "stall" of traffic/health checks under 
heavy HTTPS load is a bit puzzling.

Do you think it would be worth trying to install OpenSSL 3.0.7 from
ports and manually build haproxy against that to compare with the
current LibreSSL 3.6.0? Or is the bottleneck likely somewhere else?

> Ah, great! Do you have any info on the signal that was received there ?
> If you still have the core, issuing "info registers", then "disassemble
> conn_backend_get" and pressing enter till the end of the function could
> be useful to try to locate what's happening there.

It was signal 11 and yes I do still have the core, attached!
GNU gdb 6.3
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-unknown-openbsd7.2"...
Core was generated by `haproxy'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/lib/libpthread.so.26.2...done.
Loaded symbols for /usr/lib/libpthread.so.26.2
Loaded symbols for /usr/local/sbin/haproxy
Reading symbols from /usr/lib/libz.so.7.0...done.
Loaded symbols for /usr/lib/libz.so.7.0
Symbols already loaded for /usr/lib/libpthread.so.26.2
Reading symbols from /usr/lib/libssl.so.53.0...done.
Loaded symbols for /usr/lib/libssl.so.53.0
Reading symbols from /usr/lib/libcrypto.so.50.0...done.
Loaded 

Re: HAProxy performance on OpenBSD

2023-01-23 Thread Willy Tarreau
Hi Marc,

On Mon, Jan 23, 2023 at 11:36:48PM -0600, Marc West wrote:
(...)
> I tested flooding bogus UDP traffic from two other machines with random
> source ports (nsd listening on 53). Within 1 second PF had ~130k states
> and load was minimal:
(...)

OK at least at this point we can rule out any relation with pf.

> Thanks also to Olivier for the nbthread idea. Here are results from
> various combinations of shards and nbthread, with the haproxy stats line
> taken about in the middle of the 1 minute test runs. haproxy was
> restarted after each test.
> 
> ---
> 
> - nbthread=32, transparent:
> -- 40  CPUs: 29.8% user,  0.0% nice, 35.2% sys,  4.0% spin,  0.3% intr,
> 30.8% idle
> -- haproxy peaked at 2500% CPU
> -- current conns = 26183; current pipes = 0/0; conn rate = 3568/sec; bit
> rate = 21.185 Mbps, Running tasks: 25398/27251; idle = 7 %
> -- First 3 seconds of test had 744, 1523, and 721 responses each second.
> Responses stalled after 5 seconds. 95% total failure rate.
> 
> - nbthread=32, no transparent:
> -- 40  CPUs: 28.4% user,  0.0% nice, 36.1% sys,  4.1% spin,  0.3% intr,
> 31.1% idle
> -- haproxy peaked at 2500% CPU
> -- current conns = 21944; current pipes = 0/0; conn rate = 4270/sec; bit
> rate = 8.053 Mbps, Running tasks: 21090/22313; idle = 2 %
> -- First 3 seconds 564, 546, 187 responses. Stalled after 6 seconds. 96%
> total failure rate. 
> 
> All below tests are not using transparent:
> 
> - nbthread=32, shards=2:
> -- 40  CPUs: 33.4% user,  0.0% nice,  5.0% sys,  0.3% spin,  0.1% intr,
> 61.1% idle
> -- haproxy peaked at 1500% CPU
> -- current conns = 22062; current pipes = 0/0; conn rate = 3282/sec; bit
> rate = 7.609 Mbps, Running tasks: 21261/23067; idle = 50 %
> -- First 3 seconds 74, 647, 126 responses. Stalled after 5 seconds.
> Nearly 100% failure rate, 929 responses to 300k queries (most timed out)
> 
> - nbthread=32, shards=4
> -- 40  CPUs: 17.6% user,  0.0% nice,  2.0% sys,  0.2% spin,  0.1% intr,
> 80.1% idle
> -- haproxy peaked at 757% CPU
> -- current conns = 23581; current pipes = 0/0; conn rate = 2239/sec; bit
> rate = 7.818 Mbps, Running tasks: 22210/24450; idle = 75 %
> -- First 3 seconds 44, 272, 144 responses. Stalled after 5 seconds.
> Nearly 100% failure rate, 568 responses to 300k queries (most timed out)
> 
> - nbthread=32, shards=8
> -- 40  CPUs:  8.7% user,  0.0% nice,  1.5% sys,  0.1% spin,  0.1% intr,
> 89.7% idle
> -- haproxy peaked at 365% CPU
> -- current conns = 26125; current pipes = 0/0; conn rate = 3397/sec; bit
> rate = 3.773 Mbps, Running tasks: 26043/26569; idle = 87 %
> -- First 3 seconds 0, 0, 9 responses. Stalled after 5 seconds. Nearly
> 100% failure rate, 77 responses to 300k queries (most timed out)
> 
> - nbthread=4, no shards
> -- 40  CPUs:  9.0% user,  0.0% nice,  1.4% sys,  0.1% spin,  0.1% intr,
> 89.3% idle
> -- haproxy peaked at 366% CPU
> -- current conns = 37327; current pipes = 0/0; conn rate = 4367/sec; bit
> rate = 1.302 Mbps, Running tasks: 37277/37577; idle = 0 %
> -- First 3 seconds 0, 0, 0 responses. Stalled after 5 seconds. Nearly
> 100% failure rate, 9 responses to 300k queries (most timed out)
> 
> - no nbthread, no shards
> -- 40  CPUs:  2.4% user,  0.0% nice,  0.3% sys,  0.0% spin,  0.1% intr,
> 97.2% idle
> -- haproxy peaked at 96% CPU
> -- current conns = 22496; current pipes = 0/0; conn rate = 697/sec; bit
> rate = 219.976 kbps, Running tasks: 22685/22891; idle = 0 %
> -- First 3 seconds 0, 0, 0 responses. 100% failure rate, 0 successful
> responses to 300k queries (~75% connection timeout, ~25% read timeout)

All these connection rates are *extremely* low for anything network
related on such hardware. Thus I conclude that the tests were run with
TLS connections and systematic renegotiation, which also seems to be
supported by the CPU usage which is almost exclusively userland here,
in which case it can be normal depending on the CPU frequency (usually
you count on roughly 1k cps per core on RSA2048).

But if we look at the numbers with more threads, this is an entirely
different story:

> - nbthread=32, no transparent:
> -- 40  CPUs: 28.4% user,  0.0% nice, 36.1% sys,  4.1% spin,  0.3% intr, 31.1% 
> idle
> - nbthread=32, shards=8
> -- 40  CPUs:  8.7% user,  0.0% nice,  1.5% sys,  0.1% spin,  0.1% intr, 89.7% 
> idle

See the difference ? There seems to be an insane FD locking cost on this
system that simply wastes 40% of the CPU there. So I suspect that in your
first tests you were stressing the locking while in the last ones you
were stressing the SSL stack.

Stupid question but I prefer to ask in order to be certain, are all of
these 32 threads located on the same physical CPU ? I just want to be
sure that locks (kernel or user) are not traveling between multiple CPU
sockets, as that ruins scalability.

> I'm not quite sure what to make of each test initially handling varying
> numbers of responses within the first few seconds but then all tests
> "stalling" after about 5 

Re: HAProxy performance on OpenBSD

2023-01-23 Thread Marc West
On 2023-01-23 07:58:24, Willy Tarreau wrote:
> Hi Marc,
 
Hi Willy,

Thanks for your reply and all of your work on haproxy!

> I think you should try to flood the machine using UDP traffic to see
> the difference between the part that happens in the network stack and
> the part that happens in the rest of the system (haproxy included). If
> a small UDP flood on accepted ports brings the machine on its knees,
> it's definitely related to the network stack and/or filtering/tracking.
> If it does nothing to it, I would tend to say that the lower network
> layers and PF are innocent. This would leave us with TCP and haproxy.

I tested flooding bogus UDP traffic from two other machines with random
source ports (nsd listening on 53). Within 1 second PF had ~130k states
and load was minimal:

40  CPUs:  0.0% user,  0.0% nice,  2.5% sys,  0.1% spin,  0.5% intr,
96.8% idle

Based on your and Olivier's replies I wondered what would happen with
server-count=32 instead of 1 nsd process:

40  CPUs:  0.0% user,  0.0% nice, 13.5% sys,  0.2% spin,  0.5% intr,
85.7% idle

The box overall didn't fall over with that amount of states or
~190Mbps of UDP traffic but there were impacts to some connections.
On the first test with 1 process, nsd was immediately responding to
legit queries about half of the time and taking 5-15 seconds to answer
the other half. haproxy responses seemed normal.

On the second test with 32 processes the nsd query failure rate was 
much higher and haproxy had difficulties serving sites or the stats
page similar to the original issue.

> A SYN flood test could be useful, maybe the listening queues are too
> small and incoming packets are dropped too fast.
 
I ran a SYN flood from two separate test machines and quickly got PF up
to ~600k states. CPU usage was minimal and haproxy had no problem
handling requests during this:

40  CPUs:  0.0% user,  0.0% nice,  1.6% sys,  0.2% spin,  0.1% intr,
98.1% idle

> I don't know if it's on-purpose that you're using transparent proxying
> to the servers, but it's very likely that it will increase the processing
> cost at the lower layers by creating extra states in the network sessions
> table. Again this will only have an effect for traffic between haproxy and
> the servers.

For this use case transparent proxying is needed. More detailed results
below but on this system at least there doesn't seem to be much difference
with or without.

> One thing you can try here is to duplicate that line to have multiple
> listening sockets (or just append "shards X" to specify the number of
> sockets you want). One of the benefits is that it will multiply the
> number of listening sockets hence increase the global queue size. Maybe
> some of your packets are lost in socket queues and this could improve
> the situation.

On 2023-01-23 11:56:48, Olivier Houchard wrote:
> I wonder if the problem comes from OpenBSD's thread implementation, I
> fear you may have too much contention in the kernel, especially on a
> NUMA machine such as yours. Does using, say, only 4 threads, make
> things
> better?

Thanks also to Olivier for the nbthread idea. Here are results from
various combinations of shards and nbthread, with the haproxy stats line
taken about in the middle of the 1 minute test runs. haproxy was
restarted after each test.

---

- nbthread=32, transparent:
-- 40  CPUs: 29.8% user,  0.0% nice, 35.2% sys,  4.0% spin,  0.3% intr,
30.8% idle
-- haproxy peaked at 2500% CPU
-- current conns = 26183; current pipes = 0/0; conn rate = 3568/sec; bit
rate = 21.185 Mbps, Running tasks: 25398/27251; idle = 7 %
-- First 3 seconds of test had 744, 1523, and 721 responses each second.
Responses stalled after 5 seconds. 95% total failure rate.

- nbthread=32, no transparent:
-- 40  CPUs: 28.4% user,  0.0% nice, 36.1% sys,  4.1% spin,  0.3% intr,
31.1% idle
-- haproxy peaked at 2500% CPU
-- current conns = 21944; current pipes = 0/0; conn rate = 4270/sec; bit
rate = 8.053 Mbps, Running tasks: 21090/22313; idle = 2 %
-- First 3 seconds 564, 546, 187 responses. Stalled after 6 seconds. 96%
total failure rate. 

All below tests are not using transparent:

- nbthread=32, shards=2:
-- 40  CPUs: 33.4% user,  0.0% nice,  5.0% sys,  0.3% spin,  0.1% intr,
61.1% idle
-- haproxy peaked at 1500% CPU
-- current conns = 22062; current pipes = 0/0; conn rate = 3282/sec; bit
rate = 7.609 Mbps, Running tasks: 21261/23067; idle = 50 %
-- First 3 seconds 74, 647, 126 responses. Stalled after 5 seconds.
Nearly 100% failure rate, 929 responses to 300k queries (most timed out)

- nbthread=32, shards=4
-- 40  CPUs: 17.6% user,  0.0% nice,  2.0% sys,  0.2% spin,  0.1% intr,
80.1% idle
-- haproxy peaked at 757% CPU
-- current conns = 23581; current pipes = 0/0; conn rate = 2239/sec; bit
rate = 7.818 Mbps, Running tasks: 22210/24450; idle = 75 %
-- First 3 seconds 44, 272, 144 responses. Stalled after 5 seconds.
Nearly 100% failure rate, 568 responses to 300k queries (most timed 

Re: HAProxy performance on OpenBSD

2023-01-23 Thread Olivier Houchard
Hi Marc,

On Mon, Jan 23, 2023 at 12:13:13AM -0600, Marc West wrote:
> Hi,
> 
> We have been running HAProxy on OpenBSD for serveral years (currently
> OpenBSD 7.2 / HAProxy 2.6.7) and everything has been working perfect
> until a recent event of higher than normal traffic. It was an unexpected
> flood to one site and above ~1100 cur sessions we started to see major
> impacts to all sites behind haproxy in different frontends/backends than 
> the one with heavy traffic. All sites started to respond very slowly or 
> failed to load (including requests that haproxy denies before hitting 
> backend), health checks in various other backends started flapping, and
> once the heavy traffic stopped everything returned to normal. Total 
> bandwidth was less than 50Mbps on a 10G port (Intel 82599 ix fiber NIC).
> 
> After that issue we have been doing some load testing to try to gain
> more info and make any possible tweaks. Using the ddosify load testing
> tool (-d 60 -n 7) from a single machine (different from haproxy) is
> able to reproduce the same issue we saw with real traffic.
> 
> When starting a load test HAProxy handles 400-500 requests per second
> for about 3 seconds. After the first few seconds of heavy traffic, the
> test error rate immediately starts to shoot up to 75%+ connection/read
> timeouts and other haproxy sites/health checks start to be impacted. We
> had to stop the test after only 11 seconds to restore responsiveness to
> other sites. This is a physical server with 2x E5-2698 v4 20 core CPUs
> (hyperthreading disabled) and the haproxy process uses about 1545% CPU
> under this load. Overall CPU utilization is 21% user, 0% nice, 37% sys,
> 18% spin, 0.7% intr, 23.1% idle. There was no noticeable impact on
> connections to other services that this box NATs via PF to other servers
> outside of haproxy. 
> 
> Installing FreeBSD 13.1 on an identical machine and trying the same test
> on the same 2.6.7 with the same config and backend servers, the results
> are more what I would expect out of this hardware - haproxy has no
> problems handling over 10,000 req/sec and 40k connections without
> impacting any traffic/health checks, and only about 30% overall CPU
> usage at that traffic level. 100% success rate on the load tests with
> plenty of headroom on the haproxy box to handle way more. The backend
> servers were actually the bottleneck in the FreeBSD test.
> 
> I understand that raw performance on OpenBSD is sometimes not as high as
> other OSes in some scenarios, but the difference of 500 vs 10,000+
> req/sec and 1100 vs 40,000 connections here is very large so I wanted to
> see if there are any thoughts, known issues, or tunables that could
> possibly help improve HAProxy throughput on OpenBSD?
> 
> The usual OS tunables openfiles-cur/openfiles-max are raised to 200k,
> kern.maxfiles=205000 (openfiles peaked at 15k), and haproxy stats
> reports those as expected. PF state limit is raised to 1 million and
> peaked at 72k in use. BIOS power profile is set to max performance.
> 
> pid = 78180 (process #1, nbproc = 1, nbthread = 32)
> uptime = 1d 19h10m11s
> system limits: memmax = unlimited; ulimit-n = 20
> maxsock = 20; maxconn = 99904; maxpipes = 0
> 
> No errors that I can see in logs about hitting any limits. There is no
> change in results with http vs https, http/1.1 vs h2, with or without
> httplog, or reducing nbthread on this 40 core machine. If there are any
> other details I can provide please let me know.
> 
> Thanks in advance for any input!
> 
> 
> global
>   chroot  /var/haproxy
>   daemon  
>   log  127.0.0.1 local2
>   nbthread  32

I wonder if the problem comes from OpenBSD's thread implementation, I
fear you may have too much contention in the kernel, especially on a
NUMA machine such as yours. Does using, say, only 4 threads, make things
better?

Regards,

Olivier



Re: HAProxy performance on OpenBSD

2023-01-23 Thread Willy Tarreau
On Mon, Jan 23, 2023 at 02:22:45PM +0600,  ??? wrote:
> also, I wonder what is LibreSSL <--> OpenSSL perf.
> I'll try "openssl speed" (I recall LibreSSL has the same feature), but I'm
> not sure I can get OpenBSD machine.

It wouldn't have caused that much system if it was the cause, the
system would be mostly used.

Willy



Re: HAProxy performance on OpenBSD

2023-01-23 Thread Илья Шипицин
gmail decided to put original message to spam.
I replied to first reply.

indeed it was mentioned. sorry

пн, 23 янв. 2023 г. в 14:22, Willy Tarreau :

> Hi Ilya,
>
> On Mon, Jan 23, 2023 at 02:11:56PM +0600,  ??? wrote:
> > I would start with big picture view
> >
> > 1) are CPUs utilized at 100% ?
> > 2) what is CPU usage in details - fraction of system, user, idle ... ?
> >
> > it will allow us to narrow things and find what is the bottleneck, either
> > kernel space or user space.
>
> This was mentioned:
>
>  the haproxy process uses about 1545% CPU
>  under this load. Overall CPU utilization is 21% user, 0% nice, 37% sys,
>  18% spin, 0.7% intr, 23.1% idle
>
> The %sys is high. The %spin could indicate spinlocks and if so it's
> related to the kernel running in SMP and not necessarily being very
> scalable.
>
> Willy
>


Re: HAProxy performance on OpenBSD

2023-01-23 Thread Willy Tarreau
Hi Ilya,

On Mon, Jan 23, 2023 at 02:11:56PM +0600,  ??? wrote:
> I would start with big picture view
> 
> 1) are CPUs utilized at 100% ?
> 2) what is CPU usage in details - fraction of system, user, idle ... ?
> 
> it will allow us to narrow things and find what is the bottleneck, either
> kernel space or user space.

This was mentioned:

 the haproxy process uses about 1545% CPU
 under this load. Overall CPU utilization is 21% user, 0% nice, 37% sys,
 18% spin, 0.7% intr, 23.1% idle

The %sys is high. The %spin could indicate spinlocks and if so it's
related to the kernel running in SMP and not necessarily being very
scalable.

Willy



Re: HAProxy performance on OpenBSD

2023-01-23 Thread Илья Шипицин
also, I wonder what is LibreSSL <--> OpenSSL perf.
I'll try "openssl speed" (I recall LibreSSL has the same feature), but I'm
not sure I can get OpenBSD machine.

can you try haproxy + openssl-1.1.1 (it is considered the most performant
these days) ?

пн, 23 янв. 2023 г. в 14:17, Илья Шипицин :

> and fun fact from my own experience.
> I used to run load balancer on FreeBSD with OpenSSL built from ports.
> somehow I chose "assembler optimization" to "no" and OpenSSL big numbers
> arith were implemented in slow way
>
> I was able to find big fraction of BN-functions using "perf" tool.
> something like 25% of general impact
>
> later, I used "openssl speed", I compared Linux <--> FreeBSD (on required
> cipher suites)
>
> How can I interpret openssl speed output? - Stack Overflow
> 
>
> пн, 23 янв. 2023 г. в 14:11, Илья Шипицин :
>
>> I would start with big picture view
>>
>> 1) are CPUs utilized at 100% ?
>> 2) what is CPU usage in details - fraction of system, user, idle ... ?
>>
>> it will allow us to narrow things and find what is the bottleneck, either
>> kernel space or user space.
>>
>> пн, 23 янв. 2023 г. в 14:01, Willy Tarreau :
>>
>>> Hi Marc,
>>>
>>> On Mon, Jan 23, 2023 at 12:13:13AM -0600, Marc West wrote:
>>> (...)
>>> > I understand that raw performance on OpenBSD is sometimes not as high
>>> as
>>> > other OSes in some scenarios, but the difference of 500 vs 10,000+
>>> > req/sec and 1100 vs 40,000 connections here is very large so I wanted
>>> to
>>> > see if there are any thoughts, known issues, or tunables that could
>>> > possibly help improve HAProxy throughput on OpenBSD?
>>>
>>> Based on my experience a long time ago (~13-14 years), I remember that
>>> PF's connection tracking didn't scale at all with the number of
>>> connections. It was very clear that there was a very high per-packet
>>> lookup cost indicating that a hash table was too small. Unfortunately
>>> I didn't know how to change such settings, and since my home machine
>>> was being an ADSL line anyway, the line would have been filled long
>>> before the hash table so I didn't really care. But I was a bit shocked
>>> by this observation. I supposed that since then it has significantly
>>> evolved, but it would be worth having a look around this.
>>>
>>> > The usual OS tunables openfiles-cur/openfiles-max are raised to 200k,
>>> > kern.maxfiles=205000 (openfiles peaked at 15k), and haproxy stats
>>> > reports those as expected. PF state limit is raised to 1 million and
>>> > peaked at 72k in use. BIOS power profile is set to max performance.
>>>
>>> I think you should try to flood the machine using UDP traffic to see
>>> the difference between the part that happens in the network stack and
>>> the part that happens in the rest of the system (haproxy included). If
>>> a small UDP flood on accepted ports brings the machine on its knees,
>>> it's definitely related to the network stack and/or filtering/tracking.
>>> If it does nothing to it, I would tend to say that the lower network
>>> layers and PF are innocent. This would leave us with TCP and haproxy.
>>> A SYN flood test could be useful, maybe the listening queues are too
>>> small and incoming packets are dropped too fast.
>>>
>>> At the TCP layer, a long time ago OpenBSD used to be a bit extremist
>>> in the way it produces random sequence numbers. I don't know how it
>>> is today nor if this has a significant cost. Similarly, outgoing
>>> connections will need a random source port, and this can be expensive,
>>> particularly when the number of concurrent connections raises and ports
>>> become scarce, though you said that even blocked traffic causes harm
>>> to the machine, so I doubt this is your concern for now.
>>>
>>> > pid = 78180 (process #1, nbproc = 1, nbthread = 32)
>>> > uptime = 1d 19h10m11s
>>> > system limits: memmax = unlimited; ulimit-n = 20
>>> > maxsock = 20; maxconn = 99904; maxpipes = 0
>>> >
>>> > No errors that I can see in logs about hitting any limits. There is no
>>> > change in results with http vs https, http/1.1 vs h2, with or without
>>> > httplog, or reducing nbthread on this 40 core machine. If there are any
>>> > other details I can provide please let me know.
>>>
>>> At least I'm seeing you're using kqueue, which is a good point.
>>>
>>> >   source  0.0.0.0 usesrc clientip
>>>
>>> I don't know if it's on-purpose that you're using transparent proxying
>>> to the servers, but it's very likely that it will increase the processing
>>> cost at the lower layers by creating extra states in the network sessions
>>> table. Again this will only have an effect for traffic between haproxy
>>> and
>>> the servers.
>>>
>>> > listen test_https
>>> >   bind ip.ip.ip.ip:443 ssl crt /path/to/cert.pem no-tlsv11 alpn
>>> h2,http/1.1
>>>
>>> One thing you can try here is to duplicate that line to have multiple
>>> listening sockets (or just append "shards X" to 

Re: HAProxy performance on OpenBSD

2023-01-23 Thread Илья Шипицин
and fun fact from my own experience.
I used to run load balancer on FreeBSD with OpenSSL built from ports.
somehow I chose "assembler optimization" to "no" and OpenSSL big numbers
arith were implemented in slow way

I was able to find big fraction of BN-functions using "perf" tool.
something like 25% of general impact

later, I used "openssl speed", I compared Linux <--> FreeBSD (on required
cipher suites)

How can I interpret openssl speed output? - Stack Overflow


пн, 23 янв. 2023 г. в 14:11, Илья Шипицин :

> I would start with big picture view
>
> 1) are CPUs utilized at 100% ?
> 2) what is CPU usage in details - fraction of system, user, idle ... ?
>
> it will allow us to narrow things and find what is the bottleneck, either
> kernel space or user space.
>
> пн, 23 янв. 2023 г. в 14:01, Willy Tarreau :
>
>> Hi Marc,
>>
>> On Mon, Jan 23, 2023 at 12:13:13AM -0600, Marc West wrote:
>> (...)
>> > I understand that raw performance on OpenBSD is sometimes not as high as
>> > other OSes in some scenarios, but the difference of 500 vs 10,000+
>> > req/sec and 1100 vs 40,000 connections here is very large so I wanted to
>> > see if there are any thoughts, known issues, or tunables that could
>> > possibly help improve HAProxy throughput on OpenBSD?
>>
>> Based on my experience a long time ago (~13-14 years), I remember that
>> PF's connection tracking didn't scale at all with the number of
>> connections. It was very clear that there was a very high per-packet
>> lookup cost indicating that a hash table was too small. Unfortunately
>> I didn't know how to change such settings, and since my home machine
>> was being an ADSL line anyway, the line would have been filled long
>> before the hash table so I didn't really care. But I was a bit shocked
>> by this observation. I supposed that since then it has significantly
>> evolved, but it would be worth having a look around this.
>>
>> > The usual OS tunables openfiles-cur/openfiles-max are raised to 200k,
>> > kern.maxfiles=205000 (openfiles peaked at 15k), and haproxy stats
>> > reports those as expected. PF state limit is raised to 1 million and
>> > peaked at 72k in use. BIOS power profile is set to max performance.
>>
>> I think you should try to flood the machine using UDP traffic to see
>> the difference between the part that happens in the network stack and
>> the part that happens in the rest of the system (haproxy included). If
>> a small UDP flood on accepted ports brings the machine on its knees,
>> it's definitely related to the network stack and/or filtering/tracking.
>> If it does nothing to it, I would tend to say that the lower network
>> layers and PF are innocent. This would leave us with TCP and haproxy.
>> A SYN flood test could be useful, maybe the listening queues are too
>> small and incoming packets are dropped too fast.
>>
>> At the TCP layer, a long time ago OpenBSD used to be a bit extremist
>> in the way it produces random sequence numbers. I don't know how it
>> is today nor if this has a significant cost. Similarly, outgoing
>> connections will need a random source port, and this can be expensive,
>> particularly when the number of concurrent connections raises and ports
>> become scarce, though you said that even blocked traffic causes harm
>> to the machine, so I doubt this is your concern for now.
>>
>> > pid = 78180 (process #1, nbproc = 1, nbthread = 32)
>> > uptime = 1d 19h10m11s
>> > system limits: memmax = unlimited; ulimit-n = 20
>> > maxsock = 20; maxconn = 99904; maxpipes = 0
>> >
>> > No errors that I can see in logs about hitting any limits. There is no
>> > change in results with http vs https, http/1.1 vs h2, with or without
>> > httplog, or reducing nbthread on this 40 core machine. If there are any
>> > other details I can provide please let me know.
>>
>> At least I'm seeing you're using kqueue, which is a good point.
>>
>> >   source  0.0.0.0 usesrc clientip
>>
>> I don't know if it's on-purpose that you're using transparent proxying
>> to the servers, but it's very likely that it will increase the processing
>> cost at the lower layers by creating extra states in the network sessions
>> table. Again this will only have an effect for traffic between haproxy and
>> the servers.
>>
>> > listen test_https
>> >   bind ip.ip.ip.ip:443 ssl crt /path/to/cert.pem no-tlsv11 alpn
>> h2,http/1.1
>>
>> One thing you can try here is to duplicate that line to have multiple
>> listening sockets (or just append "shards X" to specify the number of
>> sockets you want). One of the benefits is that it will multiply the
>> number of listening sockets hence increase the global queue size. Maybe
>> some of your packets are lost in socket queues and this could improve
>> the situation.
>>
>> I don't know if you have something roughly equivalent to "perf" on
>> OpenBSD nowadays, as that could prove extremely useful to figure where
>> the CPU 

Re: HAProxy performance on OpenBSD

2023-01-23 Thread Илья Шипицин
I would start with big picture view

1) are CPUs utilized at 100% ?
2) what is CPU usage in details - fraction of system, user, idle ... ?

it will allow us to narrow things and find what is the bottleneck, either
kernel space or user space.

пн, 23 янв. 2023 г. в 14:01, Willy Tarreau :

> Hi Marc,
>
> On Mon, Jan 23, 2023 at 12:13:13AM -0600, Marc West wrote:
> (...)
> > I understand that raw performance on OpenBSD is sometimes not as high as
> > other OSes in some scenarios, but the difference of 500 vs 10,000+
> > req/sec and 1100 vs 40,000 connections here is very large so I wanted to
> > see if there are any thoughts, known issues, or tunables that could
> > possibly help improve HAProxy throughput on OpenBSD?
>
> Based on my experience a long time ago (~13-14 years), I remember that
> PF's connection tracking didn't scale at all with the number of
> connections. It was very clear that there was a very high per-packet
> lookup cost indicating that a hash table was too small. Unfortunately
> I didn't know how to change such settings, and since my home machine
> was being an ADSL line anyway, the line would have been filled long
> before the hash table so I didn't really care. But I was a bit shocked
> by this observation. I supposed that since then it has significantly
> evolved, but it would be worth having a look around this.
>
> > The usual OS tunables openfiles-cur/openfiles-max are raised to 200k,
> > kern.maxfiles=205000 (openfiles peaked at 15k), and haproxy stats
> > reports those as expected. PF state limit is raised to 1 million and
> > peaked at 72k in use. BIOS power profile is set to max performance.
>
> I think you should try to flood the machine using UDP traffic to see
> the difference between the part that happens in the network stack and
> the part that happens in the rest of the system (haproxy included). If
> a small UDP flood on accepted ports brings the machine on its knees,
> it's definitely related to the network stack and/or filtering/tracking.
> If it does nothing to it, I would tend to say that the lower network
> layers and PF are innocent. This would leave us with TCP and haproxy.
> A SYN flood test could be useful, maybe the listening queues are too
> small and incoming packets are dropped too fast.
>
> At the TCP layer, a long time ago OpenBSD used to be a bit extremist
> in the way it produces random sequence numbers. I don't know how it
> is today nor if this has a significant cost. Similarly, outgoing
> connections will need a random source port, and this can be expensive,
> particularly when the number of concurrent connections raises and ports
> become scarce, though you said that even blocked traffic causes harm
> to the machine, so I doubt this is your concern for now.
>
> > pid = 78180 (process #1, nbproc = 1, nbthread = 32)
> > uptime = 1d 19h10m11s
> > system limits: memmax = unlimited; ulimit-n = 20
> > maxsock = 20; maxconn = 99904; maxpipes = 0
> >
> > No errors that I can see in logs about hitting any limits. There is no
> > change in results with http vs https, http/1.1 vs h2, with or without
> > httplog, or reducing nbthread on this 40 core machine. If there are any
> > other details I can provide please let me know.
>
> At least I'm seeing you're using kqueue, which is a good point.
>
> >   source  0.0.0.0 usesrc clientip
>
> I don't know if it's on-purpose that you're using transparent proxying
> to the servers, but it's very likely that it will increase the processing
> cost at the lower layers by creating extra states in the network sessions
> table. Again this will only have an effect for traffic between haproxy and
> the servers.
>
> > listen test_https
> >   bind ip.ip.ip.ip:443 ssl crt /path/to/cert.pem no-tlsv11 alpn
> h2,http/1.1
>
> One thing you can try here is to duplicate that line to have multiple
> listening sockets (or just append "shards X" to specify the number of
> sockets you want). One of the benefits is that it will multiply the
> number of listening sockets hence increase the global queue size. Maybe
> some of your packets are lost in socket queues and this could improve
> the situation.
>
> I don't know if you have something roughly equivalent to "perf" on
> OpenBSD nowadays, as that could prove extremely useful to figure where
> the CPU time is spent. Other than that I'm a bit out of ideas.
>
> Willy
>
>


Re: HAProxy performance on OpenBSD

2023-01-23 Thread Willy Tarreau
Hi Marc,

On Mon, Jan 23, 2023 at 12:13:13AM -0600, Marc West wrote:
(...)
> I understand that raw performance on OpenBSD is sometimes not as high as
> other OSes in some scenarios, but the difference of 500 vs 10,000+
> req/sec and 1100 vs 40,000 connections here is very large so I wanted to
> see if there are any thoughts, known issues, or tunables that could
> possibly help improve HAProxy throughput on OpenBSD?

Based on my experience a long time ago (~13-14 years), I remember that
PF's connection tracking didn't scale at all with the number of
connections. It was very clear that there was a very high per-packet
lookup cost indicating that a hash table was too small. Unfortunately
I didn't know how to change such settings, and since my home machine
was being an ADSL line anyway, the line would have been filled long
before the hash table so I didn't really care. But I was a bit shocked
by this observation. I supposed that since then it has significantly
evolved, but it would be worth having a look around this.

> The usual OS tunables openfiles-cur/openfiles-max are raised to 200k,
> kern.maxfiles=205000 (openfiles peaked at 15k), and haproxy stats
> reports those as expected. PF state limit is raised to 1 million and
> peaked at 72k in use. BIOS power profile is set to max performance.

I think you should try to flood the machine using UDP traffic to see
the difference between the part that happens in the network stack and
the part that happens in the rest of the system (haproxy included). If
a small UDP flood on accepted ports brings the machine on its knees,
it's definitely related to the network stack and/or filtering/tracking.
If it does nothing to it, I would tend to say that the lower network
layers and PF are innocent. This would leave us with TCP and haproxy.
A SYN flood test could be useful, maybe the listening queues are too
small and incoming packets are dropped too fast.

At the TCP layer, a long time ago OpenBSD used to be a bit extremist
in the way it produces random sequence numbers. I don't know how it
is today nor if this has a significant cost. Similarly, outgoing
connections will need a random source port, and this can be expensive,
particularly when the number of concurrent connections raises and ports
become scarce, though you said that even blocked traffic causes harm
to the machine, so I doubt this is your concern for now.

> pid = 78180 (process #1, nbproc = 1, nbthread = 32)
> uptime = 1d 19h10m11s
> system limits: memmax = unlimited; ulimit-n = 20
> maxsock = 20; maxconn = 99904; maxpipes = 0
> 
> No errors that I can see in logs about hitting any limits. There is no
> change in results with http vs https, http/1.1 vs h2, with or without
> httplog, or reducing nbthread on this 40 core machine. If there are any
> other details I can provide please let me know.

At least I'm seeing you're using kqueue, which is a good point.

>   source  0.0.0.0 usesrc clientip

I don't know if it's on-purpose that you're using transparent proxying
to the servers, but it's very likely that it will increase the processing
cost at the lower layers by creating extra states in the network sessions
table. Again this will only have an effect for traffic between haproxy and
the servers.

> listen test_https
>   bind ip.ip.ip.ip:443 ssl crt /path/to/cert.pem no-tlsv11 alpn h2,http/1.1

One thing you can try here is to duplicate that line to have multiple
listening sockets (or just append "shards X" to specify the number of
sockets you want). One of the benefits is that it will multiply the
number of listening sockets hence increase the global queue size. Maybe
some of your packets are lost in socket queues and this could improve
the situation.

I don't know if you have something roughly equivalent to "perf" on
OpenBSD nowadays, as that could prove extremely useful to figure where
the CPU time is spent. Other than that I'm a bit out of ideas.

Willy