Re: [2.4.2] Thread .. is about to kill the process - Lua-involved
On Fri, Aug 20, 2021 at 03:10:05PM +0200, Willy Tarreau wrote: > BTW maybe we should arrange to take the Lua lock inside an externally > visible function that could be resolved. It would more easily show up in > case of trouble so that the issue becomes more obvious. And it works pretty well! - perf top 86.61% haproxy [.] lua_take_global_lock 9.69% haproxy [.] luaV_execute 2.72% haproxy [.] luaG_traceexec 0.17% [kernel] [k] native_io_delay 0.07% [i2c_i801][k] 0x0cf2 0.03% [kernel] [k] acpi_os_read_port 0.02% perf [.] rb_next 0.02% libc-2.30.so [.] __strcmp_avx2 0.01% perf [.] __symbols__insert 0.01% perf [.] dso__find_symbol 0.01% [kernel] [k] kallsyms_expand_symbol.constprop.0 0.01% libc-2.30.so [.] _int_malloc - "show threads" >Thread 4 : id=0x7f561b677700 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1 stuck=1 prof=1 harmless=0 wantrdv=0 cpu_ns: poll=43427720070 now=43835242784 diff=407522714 curr_task=0x7f560c02e9c0 (task) calls=1 last=408413622 ns ago fct=0x4d3b20(process_stream) ctx=(nil) call trace(13): | 0x5a0c3b [85 c0 75 19 48 81 c4 c8]: ha_dump_backtrace+0x2b/0x299 | 0x5a209f [48 83 c4 38 5b 5d 41 5c]: ha_thread_dump+0x22f/0x281 | 0x5a2166 [48 8b 05 2b 2c 20 00 48]: debug_handler+0x66/0x10b | 0x7f5621ef6690 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13690 | 0x46341a [eb cc 0f 1f 40 00 48 b8]: lua_take_global_lock+0x4a/0x4c | 0x46987d [64 48 8b 04 25 00 00 00]: hlua_ctx_destroy+0xcd/0x35a | 0x4d77ea [49 83 bf f0 00 00 00 00]: process_stream+0x3cca/0x4e18 - panic: >Thread 8 : id=0x7f8ec57fa700 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0 stuck=1 prof=0 harmless=0 wantrdv=0 cpu_ns: poll=611690 now=1994658802 diff=1994047112 curr_task=0x7f8eb402e9c0 (task) calls=1 last=0 fct=0x4d3b20(process_stream) ctx=0x7f8eb402e550 strm=0x7f8eb402e550 src=127.0.0.1 fe=fe4 be=fe4 dst=unknown rqf=40d08002 rqa=30 rpf=8000 rpa=0 sif=EST,200020 sib=INI,30 af=(nil),0 csf=0x7f8eb402e500,4000 ab=(nil),0 csb=(nil),0 cof=0x7f8eb802a2c0,80001300:H1(0x7f8eb4026240)/RAW((nil))/tcpv4(40) cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0) Current executing Lua from a stream analyser -- call trace(20): | 0x5a0c3b [85 c0 75 19 48 81 c4 c8]: ha_dump_backtrace+0x2b/0x299 | 0x5a209f [48 83 c4 38 5b 5d 41 5c]: ha_thread_dump+0x22f/0x281 | 0x5a2166 [48 8b 05 2b 2c 20 00 48]: debug_handler+0x66/0x10b | 0x7f8ece7d3690 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13690 | 0x46341a [eb cc 0f 1f 40 00 48 b8]: lua_take_global_lock+0x4a/0x4c | 0x466c17 [e9 fe fe ff ff 0f 1f 40]: hlua_ctx_init+0x147/0x1cd | 0x469592 [85 c0 74 3a 48 8b 44 24]: main+0x3dc92 | 0x51dacb [85 c0 74 61 48 8b 5d 20]: sample_process+0x4b/0xf7 | 0x51e55c [48 85 c0 74 3f 64 48 63]: sample_fetch_as_type+0x3c/0x9b | 0x525613 [48 89 c6 48 85 c0 0f 84]: sess_build_logline+0x2443/0x3cae | 0x4af0be [4c 63 e8 4c 03 6d 10 4c]: http_apply_redirect_rule+0xbfe/0xdf8 | 0x4af523 [83 f8 01 19 c0 83 e0 03]: main+0x83c23 | 0x4b2326 [83 f8 07 0f 87 99 00 00]: http_process_req_common+0xf6/0x15f6 | 0x4d5b30 [85 c0 0f 85 9f f5 ff ff]: process_stream+0x2010/0x4e18 And from what I'm seeing the forceyield setting doesn't help against this because the system is really not making any visible progress and non-lua stuff is blocked (the CLI is unresponsive on short loops). Thus I'm going to commit this. It doesn't show any measurable performance change considering how long that lock is usually taken! For reference, the Lua code here does this: core.register_fetches("loop", function(txn) for i=1,1 do end return "/" end) And the haproxy conf is this: global stats socket /tmp/sock1 level admin # tune.lua.forced-yield 10 # lua-load-per-thread loop.lua lua-load loop.lua defaults mode http timeout client 10s timeout server 10s timeout connect 10s frontend fe4 bind :9003 http-request redirect location %[lua.loop] Willy
Re: [2.4.2] Thread .. is about to kill the process - Lua-involved
Hi Robin, sorry for the delay, we've been quite busy these last days :-/ On Mon, Aug 09, 2021 at 09:06:36PM +, Robin H. Johnson wrote: > After months searching, at work we stumbled onto an internally usable-only > reproduction case using a tool we wrote that made millions of requests: > Turning > it up around ~6K RPS w/ lots of the headers being processed by our Lua code > triggered the issue, running on a single-sock EPYC 7702P system. OK! > We also found a surprising mitigation: enabling multithreaded Lua w/ > "lua-load-per-thread" made the problem go away entirely (and gave a modest 10% > performance boost, we are mostly limited by backend servers, not HAProxy or > Lua). Then I'm sorry that it was not spotted earlier, because this was a known limitation of Lua in the pre-2.4 versions: the Lua code runs on a single threaded stack, and by default as there is a single stack, when you have too many threads, some are waiting ages to try to get access to the CPU, to the point of possibly spinning more than 2 seconds there (which is an eternity for a CPU). BTW maybe we should arrange to take the Lua lock inside an externally visible function that could be resolved. It would more easily show up in case of trouble so that the issue becomes more obvious. And that's exactly why lua-load-per-thread was introduced. It creates one independent stack per thread so that there is no more locking. I suspect that limiting the number of Lua instructions executed in a call could have reduced the probability to keep Lua on a thread for too long, but that equates to playing Russian roulette, and if you could switch to lua-load-per-thread it's way better. I had started some work a few months ago to implement latency-bounded locks that avoid the trouble of NUMA systems (and even any system with a non-totally-uniform L3 cache) where some groups of threads can hinder other groups for a while. When I'm done with this I guess the Lua lock will be a good candidate for it! > The Lua script was described in the previous script, and only does complex > string parsing, used for variables, and driving some applets. It doesn't do > any > blocking operations, sockets, files or rely on globals. It got a few cleanups > for multi-threaded usage (forcing more variables to be explicitly local), but > has no other significant changes relevant to this discussion (it had some > business logic changes to string handling used to compute stick table keys, > but > not really functionality changes). I'm really glad that you managed to make it thread-safe with limited changes, as that was our hope when we designed it like this with Thierry! > The full errors are attached along with decoded core dump, with some details > redacted per $work security team requirements. > Repeated the error twice and both attempts are attached, 4 files in total. > I'll repeat the short form here for interest from just one of the occurrences: Many thanks for sharing all this, it will certainly help others. (...) Thanks! Willy
[2.4.2] Thread .. is about to kill the process - Lua-involved
=== Resending this, with the threading broken, so that other readers hopefully see it. It was in the thread previously. === Hi, This is a followup to the prior threads about 100% in 2.2.x & 2.3.x; where I referenced heavy workloads causing HAProxy to initially hit 100% CPU, but then after the watchdog detection was added, they just killed the process instead. After months searching, at work we stumbled onto an internally usable-only reproduction case using a tool we wrote that made millions of requests: Turning it up around ~6K RPS w/ lots of the headers being processed by our Lua code triggered the issue, running on a single-sock EPYC 7702P system. We also found a surprising mitigation: enabling multithreaded Lua w/ "lua-load-per-thread" made the problem go away entirely (and gave a modest 10% performance boost, we are mostly limited by backend servers, not HAProxy or Lua). The Lua script was described in the previous script, and only does complex string parsing, used for variables, and driving some applets. It doesn't do any blocking operations, sockets, files or rely on globals. It got a few cleanups for multi-threaded usage (forcing more variables to be explicitly local), but has no other significant changes relevant to this discussion (it had some business logic changes to string handling used to compute stick table keys, but not really functionality changes). The full errors are attached along with decoded core dump, with some details redacted per $work security team requirements. Repeated the error twice and both attempts are attached, 4 files in total. I'll repeat the short form here for interest from just one of the occurrences: Thread 23 is about to kill the process. ... *>Thread 23: id=0x7f78d4ff9700 act=1 glob=1 wq=1 rq=1 tl=1 tlsz=8 rqsz=30 stuck=1 prof=0 harmless=0 wantrdv=0 cpu_ns: poll=63597658036 now=66142046101 diff=2544388065 curr_task=0x7f7888c9db60 (task) calls=1 last=0 fct=0x55e4b662ca30(process_stream) ctx=0x7f78890147e0 strm=0x7f78890147e0 src=REDACTED-CLIENT-IP fe=tls-ipv4 be=tls-ipv4 dst=unknown rqf=40d08002 rqa=30 rpf=8000 rpa=0 sif=EST,200020 sib=INI,30 af=(nil),0 csf=0x7f7888d02bb0,104000 ab=(nil),0 csb=(nil),0 cof=0x7f78b0685500,80003300:H1(0x7f78886e1fa0)/SSL(0x7f78886c3270)/tcpv4(3900) cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0) Current executing Lua from a stream analyser -- stack traceback: call trace(20): | 0x55e4b6759008 [eb c5 66 0f 1f 44 00 00]: wdt_handler+0x98/0x129 | 0x7f78ef9f4980 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12980 | 0x55e4b65d650a [48 8b 05 2f b9 45 00 48]: main+0x346da | 0x55e4b667778d [85 c0 0f 84 8b 00 00 00]: sample_process+0x4d/0x127 | 0x55e4b670ba99 [48 85 c0 0f 84 c1 00 00]: main+0x169c69 | 0x55e4b660a421 [83 f8 07 4c 8b 0c 24 0f]: main+0x685f1 | 0x55e4b660d436 [83 f8 07 48 8b 4c 24 20]: http_process_req_common+0xf6/0x1659 | 0x55e4b662e508 [85 c0 0f 85 0a f4 ff ff]: [NOTICE] (35377) : haproxy version is 2.4.2-1ppa1~bionic [NOTICE] (35377) : path to executable is /usr/sbin/haproxy [ALERT](35377) : Current worker #1 (35404) exited with code 134 (Aborted) [ALERT](35377) : exit-on-failure: killing every processes with SIGTERM [WARNING] (35377) : All workers exited. Exiting... (134) -- Robin Hugh Johnson E-Mail : robb...@orbis-terrarum.net Home Page : http://www.orbis-terrarum.net/?l=people.robbat2 GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 syslog.35404.gz Description: Binary data syslog.34872.gz Description: Binary data core.haproxy.35404.HOSTNAME.1627083123.txt.gz Description: Binary data core.haproxy.34872.HOSTNAME.1627082514.txt.gz Description: Binary data signature.asc Description: PGP signature
[2.4.2] Thread .. is about to kill the process - Lua-involved
Hi, This is a followup to the prior threads about 100% in 2.2.x & 2.3.x; where I referenced heavy workloads causing HAProxy to initially hit 100% CPU, but then after the watchdog detection was added, they just killed the process instead. After months searching, at work we stumbled onto an internally usable-only reproduction case using a tool we wrote that made millions of requests: Turning it up around ~6K RPS w/ lots of the headers being processed by our Lua code triggered the issue, running on a single-sock EPYC 7702P system. We also found a surprising mitigation: enabling multithreaded Lua w/ "lua-load-per-thread" made the problem go away entirely (and gave a modest 10% performance boost, we are mostly limited by backend servers, not HAProxy or Lua). The Lua script was described in the previous script, and only does complex string parsing, used for variables, and driving some applets. It doesn't do any blocking operations, sockets, files or rely on globals. It got a few cleanups for multi-threaded usage (forcing more variables to be explicitly local), but has no other significant changes relevant to this discussion (it had some business logic changes to string handling used to compute stick table keys, but not really functionality changes). The full errors are attached along with decoded core dump, with some details redacted per $work security team requirements. Repeated the error twice and both attempts are attached, 4 files in total. I'll repeat the short form here for interest from just one of the occurrences: Thread 23 is about to kill the process. ... *>Thread 23: id=0x7f78d4ff9700 act=1 glob=1 wq=1 rq=1 tl=1 tlsz=8 rqsz=30 stuck=1 prof=0 harmless=0 wantrdv=0 cpu_ns: poll=63597658036 now=66142046101 diff=2544388065 curr_task=0x7f7888c9db60 (task) calls=1 last=0 fct=0x55e4b662ca30(process_stream) ctx=0x7f78890147e0 strm=0x7f78890147e0 src=REDACTED-CLIENT-IP fe=tls-ipv4 be=tls-ipv4 dst=unknown rqf=40d08002 rqa=30 rpf=8000 rpa=0 sif=EST,200020 sib=INI,30 af=(nil),0 csf=0x7f7888d02bb0,104000 ab=(nil),0 csb=(nil),0 cof=0x7f78b0685500,80003300:H1(0x7f78886e1fa0)/SSL(0x7f78886c3270)/tcpv4(3900) cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0) Current executing Lua from a stream analyser -- stack traceback: call trace(20): | 0x55e4b6759008 [eb c5 66 0f 1f 44 00 00]: wdt_handler+0x98/0x129 | 0x7f78ef9f4980 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12980 | 0x55e4b65d650a [48 8b 05 2f b9 45 00 48]: main+0x346da | 0x55e4b667778d [85 c0 0f 84 8b 00 00 00]: sample_process+0x4d/0x127 | 0x55e4b670ba99 [48 85 c0 0f 84 c1 00 00]: main+0x169c69 | 0x55e4b660a421 [83 f8 07 4c 8b 0c 24 0f]: main+0x685f1 | 0x55e4b660d436 [83 f8 07 48 8b 4c 24 20]: http_process_req_common+0xf6/0x1659 | 0x55e4b662e508 [85 c0 0f 85 0a f4 ff ff]: [NOTICE] (35377) : haproxy version is 2.4.2-1ppa1~bionic [NOTICE] (35377) : path to executable is /usr/sbin/haproxy [ALERT](35377) : Current worker #1 (35404) exited with code 134 (Aborted) [ALERT](35377) : exit-on-failure: killing every processes with SIGTERM [WARNING] (35377) : All workers exited. Exiting... (134) -- Robin Hugh Johnson E-Mail : robb...@orbis-terrarum.net Home Page : http://www.orbis-terrarum.net/?l=people.robbat2 GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 syslog.35404.gz Description: Binary data syslog.34872.gz Description: Binary data core.haproxy.35404.HOSTNAME.1627083123.txt.gz Description: Binary data core.haproxy.34872.HOSTNAME.1627082514.txt.gz Description: Binary data signature.asc Description: PGP signature