Re: haproxy hung with CPU usage at 100% Heeeelp, please!!!

2021-05-19 Thread Amaury Denoyelle
Oriol Magrane  wrote:

> Amaury,
> Your patch seems to have killed the bug: our newly recompiled haproxy is
> running for more than 24h now without glitches (while the unpatched
> versions never got past the 4h hour).
> I'll let it run for 24 more hours at full load just in case, but things are
> looking good and I don't expect to have to come back here crying.
> So thank you very much for your help!

Great news ! I will merge the fix soon in the dev branch and it should
hit the next serie of stable releases. Thank you again for your report
and your feedback.

-- 
Amaury Denoyelle



Re: haproxy hung with CPU usage at 100% Heeeelp, please!!!

2021-05-19 Thread Oriol Magrane
Amaury,
Your patch seems to have killed the bug: our newly recompiled haproxy is
running for more than 24h now without glitches (while the unpatched
versions never got past the 4h hour).
I'll let it run for 24 more hours at full load just in case, but things are
looking good and I don't expect to have to come back here crying.
So thank you very much for your help!

Oriol


Missatge de Amaury Denoyelle  del dia dl., 17 de
maig 2021 a les 10:32:

> Oriol Magrane  wrote:
>
> > Hi all,
> > we need help: after upgrading the OS of an haproxy server of ours to
> debian
> > buster and getting haproxy upgraded to the OS’ package 1.8.19-1 version
> we
> > started noticing eventual hangs of the process where it gets stuck at
> 100%
> > CPU usage and it stops accepting connections of any kind (front-end HTTP
> > connections, unix socket or TCP API queries or HTTP statistics
> retrievals).
> > These hangs happen only from time to time (quite randomly, it may take 10
> > minutes or several hours after starting haproxy), but we have observed
> they
> > are always triggered by a “set maxconn server” command sent via API while
> > haproxy is busy processing normal HTTP requests. (In order to keep some
> > backends capacity up-to-date we have a cron that issues a batch of
> between
> > 30 to 40 of these commands once per minute to haproxy’s API).
> > We have been unable to establish any relation between the backend, server
> > and maxconn values of the ‘set maxconn server’ command that caused a hang
> > (they seem to differ on every occasion), and we have been unable to
> relate
> > it to the haproxy workload at that given time either.
> > We’ve also tried by source-compiling in the target machine a more recent
> > version of the 1.8 series (1.8.29) too, but to no avail: the problem
> > persists.
> > The only thing we seem to have been able to identify is where in the code
> > the process gets stuck: we attached a gdb to the haproxy process and let
> it
> > run freely, feeding it with regular workload and the usual API-based
> > maxconn server updates, until it hung. Interrupting the execution, and
> > asking the debugger a backtrace resulted in this:
> > ^C
> > Program received signal SIGINT, Interrupt.
> > 0x563d70c8613a in process_srv_queue (s=s@entry=0x563d70dd1960) at
> > src/queue.c:177
> > 177 HA_SPIN_LOCK(SERVER_LOCK, >lock);
> > (gdb) bt full
> > #0  0x563d70c8613a in process_srv_queue (s=s@entry=0x563d70dd1960)
> at
> > src/queue.c:177
> > p = 0x563d70dcf330
> > maxconn = 
> > remote = 0
> > #1  0x563d70beda0f in server_parse_maxconn_change_request
> > (maxconn_str=, sv=0x563d70dd1960) at src/server.c:1269
> > v = 
> > end = 
> > v = 
> > end = 
> > #2  server_parse_maxconn_change_request (sv=0x563d70dd1960,
> > maxconn_str=) at src/server.c:1247
> > v = 
> > end = 0x563d70d69603 ""
> > #3  0x563d70bf2db2 in cli_parse_set_maxconn_server
> (private= > out>, appctx=0x563d7143ff30, args=0x7ffd15cb9d20) at src/server.c:4295
> > sv = 0x563d70dd1960
> > warning = 
> > sv = 
> > warning = 
> > __pl_r = 
> > __pl_l = 
> > ret = 
> > #4  cli_parse_set_maxconn_server (args=0x7ffd15cb9d20,
> > appctx=0x563d7143ff30, private=) at src/server.c:4281
> > sv = 
> > warning = 
> > __pl_r = 
> > __pl_l = 
> > ret = 
> > #5  0x563d70c4e5df in cli_parse_request
> > (appctx=appctx@entry=0x563d7143ff30,
> > line=0x563d70d69603 "") at src/cli.c:479
> > args = {0x563d70d695c0 "set", 0x563d70d695c4 "maxconn",
> > 0x563d70d695cc "server", 0x563d70d695d3 "group_b_remote_backend",
> > 0x563d70d69601 "19", 0x563d70d69603 "" }
> > kw = 
> > arg = 
> > i = 
> > j = 
> > #6  0x563d70c4ebde in cli_io_handler (appctx=0x563d7143ff30) at
> > src/cli.c:622
> > si = 0x563d714a6ba0
> > req = 0x563d714a6940
> > res = 0x563d714a6980
> > bind_conf = 
> > reql = 68
> > len = 
> > #7  0x563d70c8705e in applet_run_active () at src/applet.c:87
> > curr = 
> > next = 
> > si = 0x563d714a6ba0
> > applet_cur_queue = {n = 0x563d7143ff30, p = 0x563d7143ff30}
> > max_processed = 
> > #8  0x563d70c31600 in run_poll_loop () at src/haproxy.c:2563
> > next = 
> > exp = 
> > next = 
> > exp = 
> > #9  run_thread_poll_loop (data=) at src/haproxy.c:2598
> > ptif = 
> > ptdf = 
> > start_lock = 0
> > #10 0x563d70b8c6f9 in main (argc=, argv= out>)
> > at src/haproxy.c:3201
> > tids = 0x563d710bda60
> > threads = 0x563d710bda80
> > i = 
> > old_sig = {__val = {2048, 140724969120513, 140724969120352, 179,
> > 94821886072224, 94821884909799, 94821886091488, 94821885435755,
> > 140724969120255, 140724969120576, 140724969120271, 140724969120608,
> > 206158430256, 

Re: haproxy hung with CPU usage at 100% Heeeelp, please!!!

2021-05-17 Thread Amaury Denoyelle
Oriol Magrane  wrote:

> Hi all,
> we need help: after upgrading the OS of an haproxy server of ours to debian
> buster and getting haproxy upgraded to the OS’ package 1.8.19-1 version we
> started noticing eventual hangs of the process where it gets stuck at 100%
> CPU usage and it stops accepting connections of any kind (front-end HTTP
> connections, unix socket or TCP API queries or HTTP statistics retrievals).
> These hangs happen only from time to time (quite randomly, it may take 10
> minutes or several hours after starting haproxy), but we have observed they
> are always triggered by a “set maxconn server” command sent via API while
> haproxy is busy processing normal HTTP requests. (In order to keep some
> backends capacity up-to-date we have a cron that issues a batch of between
> 30 to 40 of these commands once per minute to haproxy’s API).
> We have been unable to establish any relation between the backend, server
> and maxconn values of the ‘set maxconn server’ command that caused a hang
> (they seem to differ on every occasion), and we have been unable to relate
> it to the haproxy workload at that given time either.
> We’ve also tried by source-compiling in the target machine a more recent
> version of the 1.8 series (1.8.29) too, but to no avail: the problem
> persists.
> The only thing we seem to have been able to identify is where in the code
> the process gets stuck: we attached a gdb to the haproxy process and let it
> run freely, feeding it with regular workload and the usual API-based
> maxconn server updates, until it hung. Interrupting the execution, and
> asking the debugger a backtrace resulted in this:
> ^C
> Program received signal SIGINT, Interrupt.
> 0x563d70c8613a in process_srv_queue (s=s@entry=0x563d70dd1960) at
> src/queue.c:177
> 177 HA_SPIN_LOCK(SERVER_LOCK, >lock);
> (gdb) bt full
> #0  0x563d70c8613a in process_srv_queue (s=s@entry=0x563d70dd1960) at
> src/queue.c:177
> p = 0x563d70dcf330
> maxconn = 
> remote = 0
> #1  0x563d70beda0f in server_parse_maxconn_change_request
> (maxconn_str=, sv=0x563d70dd1960) at src/server.c:1269
> v = 
> end = 
> v = 
> end = 
> #2  server_parse_maxconn_change_request (sv=0x563d70dd1960,
> maxconn_str=) at src/server.c:1247
> v = 
> end = 0x563d70d69603 ""
> #3  0x563d70bf2db2 in cli_parse_set_maxconn_server (private= out>, appctx=0x563d7143ff30, args=0x7ffd15cb9d20) at src/server.c:4295
> sv = 0x563d70dd1960
> warning = 
> sv = 
> warning = 
> __pl_r = 
> __pl_l = 
> ret = 
> #4  cli_parse_set_maxconn_server (args=0x7ffd15cb9d20,
> appctx=0x563d7143ff30, private=) at src/server.c:4281
> sv = 
> warning = 
> __pl_r = 
> __pl_l = 
> ret = 
> #5  0x563d70c4e5df in cli_parse_request
> (appctx=appctx@entry=0x563d7143ff30,
> line=0x563d70d69603 "") at src/cli.c:479
> args = {0x563d70d695c0 "set", 0x563d70d695c4 "maxconn",
> 0x563d70d695cc "server", 0x563d70d695d3 "group_b_remote_backend",
> 0x563d70d69601 "19", 0x563d70d69603 "" }
> kw = 
> arg = 
> i = 
> j = 
> #6  0x563d70c4ebde in cli_io_handler (appctx=0x563d7143ff30) at
> src/cli.c:622
> si = 0x563d714a6ba0
> req = 0x563d714a6940
> res = 0x563d714a6980
> bind_conf = 
> reql = 68
> len = 
> #7  0x563d70c8705e in applet_run_active () at src/applet.c:87
> curr = 
> next = 
> si = 0x563d714a6ba0
> applet_cur_queue = {n = 0x563d7143ff30, p = 0x563d7143ff30}
> max_processed = 
> #8  0x563d70c31600 in run_poll_loop () at src/haproxy.c:2563
> next = 
> exp = 
> next = 
> exp = 
> #9  run_thread_poll_loop (data=) at src/haproxy.c:2598
> ptif = 
> ptdf = 
> start_lock = 0
> #10 0x563d70b8c6f9 in main (argc=, argv=)
> at src/haproxy.c:3201
> tids = 0x563d710bda60
> threads = 0x563d710bda80
> i = 
> old_sig = {__val = {2048, 140724969120513, 140724969120352, 179,
> 94821886072224, 94821884909799, 94821886091488, 94821885435755,
> 140724969120255, 140724969120576, 140724969120271, 140724969120608,
> 206158430256, 140724969120576, 140724969120384, 7368483128895714304}}
> blocked_sig = {__val = {1844674406710583, 18446744073709551615
> }}
> err = 
> retry = 
> limit = {rlim_cur = 16497, rlim_max = 16497}
> errmsg =
> "\000\000\000\000\000\000\000\000P\000\000\000\000\000\000\000\210\334\377\377\377\377\377\377\000\000\000\000\000\000\000\000\003\000\000\000\060",
> '\000' , "[\000\000\000n\000\000\000|\000\000\000w",
> '\000' ,
> "\360\006\236\f\335\177\000\000@\274\235\f\335\177\000\000
> \000\000"
> pidfd = 
> ... so the server HA_SPIN_LOCK() in line 177 of queue.c seems to be the
> culprit.
> At this point, we don’t know what we can 

Re: haproxy hung with CPU usage at 100% Heeeelp, please!!!

2021-05-14 Thread Lukas Tribus
The first thing I'd try is to disable multithreading (by putting
nbthread 1 in the global section of the configuration), so if that
helps.


Lukas



haproxy hung with CPU usage at 100% Heeeelp, please!!!

2021-05-14 Thread Oriol Magrane
Hi all,

we need help: after upgrading the OS of an haproxy server of ours to debian
buster and getting haproxy upgraded to the OS’ package 1.8.19-1 version we
started noticing eventual hangs of the process where it gets stuck at 100%
CPU usage and it stops accepting connections of any kind (front-end HTTP
connections, unix socket or TCP API queries or HTTP statistics retrievals).

These hangs happen only from time to time (quite randomly, it may take 10
minutes or several hours after starting haproxy), but we have observed they
are always triggered by a “set maxconn server” command sent via API while
haproxy is busy processing normal HTTP requests. (In order to keep some
backends capacity up-to-date we have a cron that issues a batch of between
30 to 40 of these commands once per minute to haproxy’s API).

We have been unable to establish any relation between the backend, server
and maxconn values of the ‘set maxconn server’ command that caused a hang
(they seem to differ on every occasion), and we have been unable to relate
it to the haproxy workload at that given time either.

We’ve also tried by source-compiling in the target machine a more recent
version of the 1.8 series (1.8.29) too, but to no avail: the problem
persists.

The only thing we seem to have been able to identify is where in the code
the process gets stuck: we attached a gdb to the haproxy process and let it
run freely, feeding it with regular workload and the usual API-based
maxconn server updates, until it hung. Interrupting the execution, and
asking the debugger a backtrace resulted in this:

^C

Program received signal SIGINT, Interrupt.

0x563d70c8613a in process_srv_queue (s=s@entry=0x563d70dd1960) at
src/queue.c:177

177 HA_SPIN_LOCK(SERVER_LOCK, >lock);

(gdb) bt full

#0  0x563d70c8613a in process_srv_queue (s=s@entry=0x563d70dd1960) at
src/queue.c:177

p = 0x563d70dcf330

maxconn = 

remote = 0

#1  0x563d70beda0f in server_parse_maxconn_change_request
(maxconn_str=, sv=0x563d70dd1960) at src/server.c:1269

v = 

end = 

v = 

end = 

#2  server_parse_maxconn_change_request (sv=0x563d70dd1960,
maxconn_str=) at src/server.c:1247

v = 

end = 0x563d70d69603 ""

#3  0x563d70bf2db2 in cli_parse_set_maxconn_server (private=, appctx=0x563d7143ff30, args=0x7ffd15cb9d20) at src/server.c:4295

sv = 0x563d70dd1960

warning = 

sv = 

warning = 

__pl_r = 

__pl_l = 

ret = 

#4  cli_parse_set_maxconn_server (args=0x7ffd15cb9d20,
appctx=0x563d7143ff30, private=) at src/server.c:4281

sv = 

warning = 

__pl_r = 

__pl_l = 

ret = 

#5  0x563d70c4e5df in cli_parse_request
(appctx=appctx@entry=0x563d7143ff30,
line=0x563d70d69603 "") at src/cli.c:479

args = {0x563d70d695c0 "set", 0x563d70d695c4 "maxconn",
0x563d70d695cc "server", 0x563d70d695d3 "group_b_remote_backend",
0x563d70d69601 "19", 0x563d70d69603 "" }

kw = 

arg = 

i = 

j = 

#6  0x563d70c4ebde in cli_io_handler (appctx=0x563d7143ff30) at
src/cli.c:622

si = 0x563d714a6ba0

req = 0x563d714a6940

res = 0x563d714a6980

bind_conf = 

reql = 68

len = 

#7  0x563d70c8705e in applet_run_active () at src/applet.c:87

curr = 

next = 

si = 0x563d714a6ba0

applet_cur_queue = {n = 0x563d7143ff30, p = 0x563d7143ff30}

max_processed = 

#8  0x563d70c31600 in run_poll_loop () at src/haproxy.c:2563

next = 

exp = 

next = 

exp = 

#9  run_thread_poll_loop (data=) at src/haproxy.c:2598

ptif = 

ptdf = 

start_lock = 0

#10 0x563d70b8c6f9 in main (argc=, argv=)
at src/haproxy.c:3201

tids = 0x563d710bda60

threads = 0x563d710bda80

i = 

old_sig = {__val = {2048, 140724969120513, 140724969120352, 179,
94821886072224, 94821884909799, 94821886091488, 94821885435755,
140724969120255, 140724969120576, 140724969120271, 140724969120608,
206158430256, 140724969120576, 140724969120384, 7368483128895714304}}

blocked_sig = {__val = {1844674406710583, 18446744073709551615
}}

err = 

retry = 

limit = {rlim_cur = 16497, rlim_max = 16497}

errmsg =
"\000\000\000\000\000\000\000\000P\000\000\000\000\000\000\000\210\334\377\377\377\377\377\377\000\000\000\000\000\000\000\000\003\000\000\000\060",
'\000' , "[\000\000\000n\000\000\000|\000\000\000w",
'\000' ,
"\360\006\236\f\335\177\000\000@\274\235\f\335\177\000\000
\000\000"

pidfd = 


... so the server HA_SPIN_LOCK() in line 177 of queue.c seems to be the
culprit.



At this point, we don’t know what we can do in order to get rid of this
problem. We tried with the new haproxy series 2.x and the issue is still
there; the only difference is the haproxy process doesn’t hang, but it gets