Re: [2.4.22] Segmentation fault when using spoe + disabled keyword

2023-05-15 Thread Maciej Zdeb
Awesome, thanks!

czw., 11 maj 2023 o 09:38 Christopher Faulet 
napisał(a):

> Le 5/9/23 à 14:29, Maciej Zdeb a écrit :
> > Hi Christopher,
> > no problem. :) Yes I'm using the same spoe backend for multiple
> frontends. This
> > is my spoe configuration:
> >
> > [abc]
> >
> > spoe-agent abc-agent
> >messages check-abc
> >register-var-names x_abc_request_headers x_headers x_abc_response body
> >option var-prefix abc
> >option set-on-error error
> >option set-total-time total_time
> >timeout hello 100ms
> >timeout idle 41s
> >timeout processing 50ms
> >use-backend spoe-abc
> >
> > spoe-message check-abc
> >args Key=str("123") Key2=str("Haproxy")...
> >acl route_to_abc  path_beg -f /etc/haproxy/configs/acl/path_begin.lst
> >event on-frontend-http-request if route_to_abc
> >
>
> Thanks ! I'm able to reproduce the issue.
>
> SPOE filters for disabled proxies are released during startup but pattern
> expressions may be shared between filters. So if the disabled proxy is
> defined
> first, it is an issue because these expressions are released while they
> are
> still used by other filters.
>
> I pushed a fix in 2.8-DEV.  It will be backported to all stable versions.
> As a
> workaround, the disabled proxies must not be the first ones defined.
>
> --
> Christopher Faulet
>
>


Re: [2.4.22] Segmentation fault when using spoe + disabled keyword

2023-05-09 Thread Maciej Zdeb
Hi Christopher,
no problem. :) Yes I'm using the same spoe backend for multiple frontends.
This is my spoe configuration:

[abc]

spoe-agent abc-agent
  messages check-abc
  register-var-names x_abc_request_headers x_headers x_abc_response body
  option var-prefix abc
  option set-on-error error
  option set-total-time total_time
  timeout hello 100ms
  timeout idle 41s
  timeout processing 50ms
  use-backend spoe-abc

spoe-message check-abc
  args Key=str("123") Key2=str("Haproxy")...
  acl route_to_abc  path_beg -f /etc/haproxy/configs/acl/path_begin.lst
  event on-frontend-http-request if route_to_abc

wt., 9 maj 2023 o 09:14 Christopher Faulet  napisał(a):

> Le 5/2/23 à 13:58, Maciej Zdeb a écrit :
> > Hi,
> > I'm experiencing a segmentation fault caused by adding "disabled"
> > (http://docs.haproxy.org/2.4/configuration.html#4-disabled
> > <http://docs.haproxy.org/2.4/configuration.html#4-disabled>) to the
> frontend
> > section of haproxy configuration file. That frontend does not handle any
> traffic
> > yet.
> >
> > The "disabled" keyword was used on other frontends on the same HAProxy
> instance
> > before without any issue. The difference is that this time it was added
> to
> > frontend with spoe filter. This probably causes a crash when processing
> requests
> > on a different heavily used frontend (also spoe enabled).
> >
> > It crashes after 2-10s under production traffic. I couldn't replicate it
> with a
> > synthetic test, however I have a coredump and packages with HAProxy
> binary and
> > debug symbols.
> >
> > (gdb) bt
> > #0  __pthread_kill_implementation (no_tid=0, signo=6,
> threadid=140527117211200)
> > at ./nptl/pthread_kill.c:44
> > #1  __pthread_kill_internal (signo=6, threadid=140527117211200) at
> > ./nptl/pthread_kill.c:78
> > #2  __GI___pthread_kill (threadid=140527117211200, signo=signo@entry=6)
> at
> > ./nptl/pthread_kill.c:89
> > #3  0x7fcf0881a476 in __GI_raise (sig=sig@entry=6) at
> > ../sysdeps/posix/raise.c:26
> > #4  0x7fcf088007f3 in __GI_abort () at ./stdlib/abort.c:79
> > #5  0x558f48524feb in ha_panic () at src/debug.c:333
> > #6  0x558f486be1af in wdt_handler (sig=14, si=0x7fcf04e4d3f0,
> > arg=0x7fcf04e4d2c0) at src/wdt.c:122
> > #7  
> > #8  0x558f48469895 in pattern_exec_match (head=0x558f69fe67b8,
> > smp=0x7fcf04e4e080, fill=0) at src/pattern.c:2498
> > #9  0x558f485a0bf2 in acl_exec_cond (cond=0x558f69fe69a0,
> px=0x558f69f67cf0,
> > sess=0x7fce50243cd0, strm=0x7fce5020a4b0, opt=6) at src/acl.c:1123
> > #10 0x558f483b5eea in spoe_encode_message (s=0x7fce5020a4b0,
> > ctx=0x7fce5026ddf0, msg=0x558f69fe2650, dir=0, buf=0x7fcf04e4e1b0,
> > end=0x7fce502a627c "") at src/flt_spoe.c:2185
> > #11 0x558f483b6259 in spoe_encode_messages (s=0x7fce5020a4b0,
> > ctx=0x7fce5026ddf0, messages=0x558f69fe23a0, dir=0, type=1) at
> src/flt_spoe.c:2285
> > #12 0x558f483b75a4 in spoe_process_messages (s=0x7fce5020a4b0,
> > ctx=0x7fce5026ddf0, messages=0x558f69fe23a0, dir=0, type=1) at
> src/flt_spoe.c:2726
> > #13 0x558f483b79d0 in spoe_process_event (s=0x7fce5020a4b0,
> > ctx=0x7fce5026ddf0, ev=SPOE_EV_ON_HTTP_REQ_FE) at src/flt_spoe.c:2842
> > #14 0x558f483b8eeb in spoe_chn_pre_analyze (s=0x7fce5020a4b0,
> > filter=0x7fce5025c920, chn=0x7fce5020a4c0, an_bit=16) at
> src/flt_spoe.c:3324
> > #15 0x558f48549976 in flt_pre_analyze (s=0x7fce5020a4b0,
> chn=0x7fce5020a4c0,
> > an_bit=16) at src/filters.c:764
> > #16 0x558f4836aaff in process_stream (t=0x7fce50242f30,
> > context=0x7fce5020a4b0, state=260) at src/stream.c:1918
> > #17 0x558f4857aa3c in run_tasks_from_lists (budgets=0x7fcf04e4e5a0)
> at
> > src/task.c:597
> > #18 0x558f4857b485 in process_runnable_tasks () at src/task.c:838
> > #19 0x558f484d0eb5 in run_poll_loop () at src/haproxy.c:2630
> > #20 0x558f484d1423 in run_thread_poll_loop (data=0x5) at
> src/haproxy.c:2805
> > #21 0x7fcf0886cb43 in start_thread (arg=) at
> > ./nptl/pthread_create.c:442
> > #22 0x7fcf088fea00 in clone3 () at
> ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
> >
> > haproxy -vv
> > HAProxy version 2.4.22-f8e3218 2023/02/14 - https://haproxy.org/
> > <https://haproxy.org/>
> > Status: long-term supported branch - will stop receiving fixes around Q2
> 2026.
> > Known bugs: http://www.haproxy.org/bugs/bugs-2.4.22.html
> > <http://www.haproxy.org/bugs/bugs-2.4.22.html>
> > Running on: Linux 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29
> UTC 2023
> &

[2.4.22] Segmentation fault when using spoe + disabled keyword

2023-05-02 Thread Maciej Zdeb
Hi,
I'm experiencing a segmentation fault caused by adding "disabled" (
http://docs.haproxy.org/2.4/configuration.html#4-disabled) to the frontend
section of haproxy configuration file. That frontend does not handle any
traffic yet.

The "disabled" keyword was used on other frontends on the same HAProxy
instance before without any issue. The difference is that this time it was
added to frontend with spoe filter. This probably causes a crash when
processing requests on a different heavily used frontend (also spoe
enabled).

It crashes after 2-10s under production traffic. I couldn't replicate it
with a synthetic test, however I have a coredump and packages with HAProxy
binary and debug symbols.

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6,
threadid=140527117211200) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140527117211200) at
./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140527117211200, signo=signo@entry=6) at
./nptl/pthread_kill.c:89
#3  0x7fcf0881a476 in __GI_raise (sig=sig@entry=6) at
../sysdeps/posix/raise.c:26
#4  0x7fcf088007f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x558f48524feb in ha_panic () at src/debug.c:333
#6  0x558f486be1af in wdt_handler (sig=14, si=0x7fcf04e4d3f0,
arg=0x7fcf04e4d2c0) at src/wdt.c:122
#7  
#8  0x558f48469895 in pattern_exec_match (head=0x558f69fe67b8,
smp=0x7fcf04e4e080, fill=0) at src/pattern.c:2498
#9  0x558f485a0bf2 in acl_exec_cond (cond=0x558f69fe69a0,
px=0x558f69f67cf0, sess=0x7fce50243cd0, strm=0x7fce5020a4b0, opt=6) at
src/acl.c:1123
#10 0x558f483b5eea in spoe_encode_message (s=0x7fce5020a4b0,
ctx=0x7fce5026ddf0, msg=0x558f69fe2650, dir=0, buf=0x7fcf04e4e1b0,
end=0x7fce502a627c "") at src/flt_spoe.c:2185
#11 0x558f483b6259 in spoe_encode_messages (s=0x7fce5020a4b0,
ctx=0x7fce5026ddf0, messages=0x558f69fe23a0, dir=0, type=1) at
src/flt_spoe.c:2285
#12 0x558f483b75a4 in spoe_process_messages (s=0x7fce5020a4b0,
ctx=0x7fce5026ddf0, messages=0x558f69fe23a0, dir=0, type=1) at
src/flt_spoe.c:2726
#13 0x558f483b79d0 in spoe_process_event (s=0x7fce5020a4b0,
ctx=0x7fce5026ddf0, ev=SPOE_EV_ON_HTTP_REQ_FE) at src/flt_spoe.c:2842
#14 0x558f483b8eeb in spoe_chn_pre_analyze (s=0x7fce5020a4b0,
filter=0x7fce5025c920, chn=0x7fce5020a4c0, an_bit=16) at src/flt_spoe.c:3324
#15 0x558f48549976 in flt_pre_analyze (s=0x7fce5020a4b0,
chn=0x7fce5020a4c0, an_bit=16) at src/filters.c:764
#16 0x558f4836aaff in process_stream (t=0x7fce50242f30,
context=0x7fce5020a4b0, state=260) at src/stream.c:1918
#17 0x558f4857aa3c in run_tasks_from_lists (budgets=0x7fcf04e4e5a0) at
src/task.c:597
#18 0x558f4857b485 in process_runnable_tasks () at src/task.c:838
#19 0x558f484d0eb5 in run_poll_loop () at src/haproxy.c:2630
#20 0x558f484d1423 in run_thread_poll_loop (data=0x5) at
src/haproxy.c:2805
#21 0x7fcf0886cb43 in start_thread (arg=) at
./nptl/pthread_create.c:442
#22 0x7fcf088fea00 in clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

haproxy -vv
HAProxy version 2.4.22-f8e3218 2023/02/14 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2
2026.
Known bugs: http://www.haproxy.org/bugs/bugs-2.4.22.html
Running on: Linux 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC
2023 x86_64
Build options :
  TARGET  = linux-glibc
  CPU = generic
  CC  = cc
  CFLAGS  = -O0 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv
-Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered
-Wno-missing-field-initializers -Wtype-limits -Wshift-negative-value
-Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -DMAX_SESS_STKCTR=12
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_GETADDRINFO=1 USE_OPENSSL=1
USE_LUA=1 USE_ZLIB=1 USE_DL=1 USE_PROMEX=1
  DEBUG   =

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY
+CRYPT_H -DEVICEATLAS +DL +EPOLL -EVPORTS +FUTEX +GETADDRINFO -KQUEUE
+LIBCRYPT +LINUX_SPLICE +LINUX_TPROXY +LUA -MEMORY_PROFILING +NETFILTER +NS
-OBSOLETE_LINKER +OPENSSL -OT -PCRE +PCRE2 +PCRE2_JIT -PCRE_JIT +POLL
+PRCTL -PRIVATE_CACHE -PROCCTL +PROMEX -PTHREAD_PSHARED -QUIC +RT -SLZ
-STATIC_PCRE -STATIC_PCRE2 -SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY
-WURFL +ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=4).
Built with OpenSSL version : OpenSSL 1.1.1t  7 Feb 2023
Running on OpenSSL version : OpenSSL 1.1.1t  7 Feb 2023
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.5
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent 

Re: 2.4.22 and maxconn setting problem

2023-03-22 Thread Maciej Zdeb
wt., 21 mar 2023 o 15:03 Willy Tarreau  napisał(a):

> What you *seem* to be missing in these logs is "%bc" and "%bq" to see
> the state of the backend itself. My suspicion is that %bq is not zero
> despite %bc being zero.
>
I will add these to logs/experiment with balance source and will return
with more data when the issue occurs again.

Thanks Willy!


Re: 2.4.22 and maxconn setting problem

2023-03-21 Thread Maciej Zdeb
wt., 21 mar 2023 o 11:39 Willy Tarreau  napisał(a):

> Just to be clear on these last few points, when you say you cannot
> connect, you mean in fact that the connection establishes to haproxy
> but your request cannot reach the server, right ? 503 will indeed
> indicate a failure to find a server or a connection that died in the
> queue.

Correct!. TCP connection from client to haproxy is established and haproxy
returns 503 (termination state sQ--).


> Does your stats page indicate that for the servers or the
> backend there are still connections in the queue ?

When the traffic was directed to another data center, the stats page on the
affected server showed that there was no connection/traffic and HAProxy
still could not connect to the backend server (sQ--).


> A test could be useful, to force the LB algorithm to something determinist
> (e.g.
> "balance source").
>
I will check if it is possible to balance by source. This frontend serves
around 8000 rps and I'm not sure if changing "ratio" algorithm to "balance
source" won't cause any troubles (with customers behind huge NAT).

Sample log of request that was made AFTER the traffic switch to another
data-center.

{
  "_source": {
"status_code": "503",
"time_queue": "1001",
<- timeout queue 1s
"time_sess_tot": "1012",
"ssl_cipher": "TLS_AES_256_GCM_SHA384",
"memb_conc_conn": "0", <-
%sc (server concurrent connections)
"client_port": "49681",
"time_req_headers": "11",
"member":
"-:-",<- member
IP was not selected
"time_data_transmission": "-1",
"time_req_active": "1001",
"method": "HEAD",
"termination_state": "sQ--",
<- looks like queue is full
"time_tcp_memb": "-1",
"time_ssl": "11",
"time_resp": "-1",
"time_idle": "0",
"member_name": "",
"bytes_read_h": "206",
"bytes_uploaded_h": "98",
"time_req_from_firstbyte": "0",
"ssl_ver": "TLSv1.3",
"http_ver": "HTTP/2.0"
  }
}


2.4.22 and maxconn setting problem

2023-03-21 Thread Maciej Zdeb
Hi,

I'm observing a strange issue with haproxy 2.4.22 (but it was also on
previous versions).

I have set maxconn to 20 in global and defaults configuration section
and with following configuration

frontend front
mode http
option http-keep-alive

bind 10.0.0.10:443 ssl crt /etc/cert/crt.pem alpn h2,http/1.1 process
1/1
bind 10.0.0.10:443 ssl crt /etc/cert/crt.pem alpn h2,http/1.1 process
1/2
...
bind 10.0.0.10:443 ssl crt /etc/cert/crt.pem alpn h2,http/1.1 process
1/20
default_backend back

backend back
option http-keep-alive
mode http
http-reuse always
option httpchk GET /health HTTP/1.0\r\nHost:\ ttt.local
http-check expect string OK
timeout queue 1s
default-server maxconn 2000

default-server resolve-prefer ipv4 resolvers default-dns
server slot_0_checker 10.0.0.82:31011 check weight 0 disabled cookie
slot_0_checker
server slot_1_checker 10.0.0.236:31011 check weight 0 disabled cookie
slot_1_checker
server slot_0_0 10.0.0.82:31011 source ${SNAT_741_0} track
slot_0_checker weight 50 disabled cookie slot_0_0
server slot_1_0 10.0.0.236:31011 source ${SNAT_741_0} track
slot_1_checker weight 51 disabled cookie slot_1_0

I'm experiencing a situation in which clients cannot connect (termination
state CQ-- or sQ--) which is expected when the traffic is high (maxconn
2000 on each server) and HAProxy is using much more CPU. However after such
an event when traffic is lower or when I cut off the traffic completely
(cpu idle is almost 100%) I still cannot connect to the proxy. Termination
state is still sQ-- and I receive 503 in response while the stats page and
CLI reports that there are no connections to proxy.

Am I missing something or is it a bug?

Kind regards,
Maciej


Re: Peers using heavily single cpu core

2022-05-23 Thread Maciej Zdeb
Hi Christopher,
I've verified that outgoing connections are now spread between multiple
threads! Thank you very much!

23 : st=0x000121(cl heopI W:sRa R:srA) tmask=0x8 umask=0x0
owner=0x56219da29280 iocb=0x56219cdc8730(sock_conn_iocb) back=0
cflg=0x0300 fam=ipv4 lport=1024 rport=52046 fe=hap1 mux=PASS
ctx=0x7f4c2002f410 xprt=RAW
24 : st=0x000121(cl heopI W:sRa R:srA) tmask=0x20 umask=0x0
owner=0x7f4c2802a7a0 iocb=0x56219cdc8730(sock_conn_iocb) back=0
cflg=0x0300 fam=ipv4 lport=1024 rport=55428 fe=hap1 mux=PASS
ctx=0x7f4c2c02f200 xprt=RAW
25 : st=0x000121(cl heopI W:sRa R:srA) tmask=0x1 umask=0x0
owner=0x7f4c2c0260b0 iocb=0x56219cdc8730(sock_conn_iocb) back=0
cflg=0x0300 fam=ipv4 lport=1024 rport=51190 fe=hap1 mux=PASS
ctx=0x56219da24180 xprt=RAW
26 : st=0x010121(cL heopI W:sRa R:srA) tmask=0x20 umask=0x0
owner=0x7f4c2c026ac0 iocb=0x56219cdc8730(sock_conn_iocb) back=1
cflg=0x1300 fam=ipv4 lport=34454 rport=1024 px=hap1 mux=PASS
ctx=0x7f4c2c026610 xprt=RAW
27 : st=0x000121(cl heopI W:sRa R:srA) tmask=0x4 umask=0x0
owner=0x7f4c40026cf0 iocb=0x56219cdc8730(sock_conn_iocb) back=0
cflg=0x0300 fam=ipv4 lport=1024 rport=50226 fe=hap1 mux=PASS
ctx=0x7f4c3002eb60 xprt=RAW

Kind regards,

wt., 17 maj 2022 o 16:25 Christopher Faulet 
napisał(a):

> Le 4/20/22 à 14:51, Maciej Zdeb a écrit :
> > Hi Willy,
> > I saw Christopher changes are now merged. I was wondering how to proceed
> with my
> > issue. Right now in stream_new() I'm able to get cs_endpoint and appctx
> (if
> > endpoint is applet), so I can get thread_mask of appctx to create a
> stream task
> > on the same thread. Is this approach correct?
> >
>
> Hi Maciej,
>
> I've finally finish my applet refactoring. Now it is possible to choose
> where to
> start an applet. I've also merged your patches. Peer applets must now be
> balanced across threads. So, you may give it a try to be sure it solves
> your
> issue and also validate everything works fine.
>
> --
> Christopher Faulet
>


Re: Session terminated with status IH-- after upgrade to 2.4.16

2022-05-12 Thread Maciej Zdeb
Ok, I've found the github issue
https://github.com/haproxy/haproxy/issues/1684 sorry for the duplicate.

Kind regards,
Maciej

czw., 12 maj 2022 o 13:34 Maciej Zdeb  napisał(a):

> Hi,
> after the HAProxy upgrade from 2.4.15 to 2.4.16 we observe in logs that
> very small percent of sessions are terminated with IH-- status.
>
> Docs say that "I" should never happen and should be reported. All the
> terminated sessions used one of the POST/PATCH/PUT methods and all of them
> were on HTTP/2.0.
>
> I'm not able to replicate it manually, but maybe you will know which patch
> might have caused it.
>
> haproxy -v
> HAProxy version 2.4.16-9d532c4 2022/04/29 - https://haproxy.org/
> Status: long-term supported branch - will stop receiving fixes around Q2
> 2026.
> Known bugs: http://www.haproxy.org/bugs/bugs-2.4.16.html
> Running on: Linux 4.15.0-173-generic #182-Ubuntu SMP Fri Mar 18 15:53:46
> UTC 2022 x86_64
>
> Kind regards,
> Maciej
>


Session terminated with status IH-- after upgrade to 2.4.16

2022-05-12 Thread Maciej Zdeb
Hi,
after the HAProxy upgrade from 2.4.15 to 2.4.16 we observe in logs that
very small percent of sessions are terminated with IH-- status.

Docs say that "I" should never happen and should be reported. All the
terminated sessions used one of the POST/PATCH/PUT methods and all of them
were on HTTP/2.0.

I'm not able to replicate it manually, but maybe you will know which patch
might have caused it.

haproxy -v
HAProxy version 2.4.16-9d532c4 2022/04/29 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2
2026.
Known bugs: http://www.haproxy.org/bugs/bugs-2.4.16.html
Running on: Linux 4.15.0-173-generic #182-Ubuntu SMP Fri Mar 18 15:53:46
UTC 2022 x86_64

Kind regards,
Maciej


Re: Peers using heavily single cpu core

2022-04-20 Thread Maciej Zdeb
No worries, thanks for the update.

Kind regards,
Maciej

śr., 20 kwi 2022 o 21:59 Willy Tarreau  napisał(a):

> Hi Maciej,
>
> On Wed, Apr 20, 2022 at 02:51:32PM +0200, Maciej Zdeb wrote:
> > Hi Willy,
> > I saw Christopher changes are now merged. I was wondering how to proceed
> > with my issue. Right now in stream_new() I'm able to get cs_endpoint and
> > appctx (if endpoint is applet), so I can get thread_mask of appctx to
> > create a stream task on the same thread. Is this approach correct?
>
> It's still not possible at this stage. The related modifications still
> have to be done. We've discussed about this today and these still seem
> possible but it's difficult to estimate if it will be done easily or
> if we'll discover new showstoppers. I'd suggest that you wait a bit
> more, I'm afraid.
>
> Cheers,
> Willy
>


Re: Peers using heavily single cpu core

2022-04-20 Thread Maciej Zdeb
Hi Willy,
I saw Christopher changes are now merged. I was wondering how to proceed
with my issue. Right now in stream_new() I'm able to get cs_endpoint and
appctx (if endpoint is applet), so I can get thread_mask of appctx to
create a stream task on the same thread. Is this approach correct?

Kind regards

czw., 7 kwi 2022 o 22:32 Maciej Zdeb  napisał(a):

> Hi Willy,
> I adjusted the 3rd patch to use HA_ATOMIC_INC/HA_ATOMIC_DEC. Thanks!
>
> czw., 7 kwi 2022 o 14:02 Willy Tarreau  napisał(a):
>
>> There are some ongoing changes in that area that are about to be merged
>> (probably early
>> next week) and that will help for this task, so we'll have a closer look,
>> and it's very likely that we could proceed like this soon.
>>
>
> Great! I must admit that I imagined a stream_new_on() function but after
> looking at stream_new() I thought I'd better ask for help. ;-)
>
> Kind regards,
> Maciej
>
>


Re: Peers using heavily single cpu core

2022-04-07 Thread Maciej Zdeb
Hi Willy,
I adjusted the 3rd patch to use HA_ATOMIC_INC/HA_ATOMIC_DEC. Thanks!

czw., 7 kwi 2022 o 14:02 Willy Tarreau  napisał(a):

> There are some ongoing changes in that area that are about to be merged
> (probably early
> next week) and that will help for this task, so we'll have a closer look,
> and it's very likely that we could proceed like this soon.
>

Great! I must admit that I imagined a stream_new_on() function but after
looking at stream_new() I thought I'd better ask for help. ;-)

Kind regards,
Maciej


0002-MINOR-peers-Track-number-of-applets-run-by-thread.patch
Description: Binary data


0003-MINOR-peers-Balance-applets-across-threads.patch
Description: Binary data


0001-MINOR-applet-Add-function-to-create-applet-on-select.patch
Description: Binary data


Re: Peers using heavily single cpu core

2022-04-06 Thread Maciej Zdeb
Hi Willy,

I've managed to implement the easy part (attached patches). However this is
not enough to get it to work properly. I'm stuck in a debug of following
stack trace (BUG_ON task_queue() function):

FATAL: bug condition "task->thread_mask != tid_bit" matched at
include/haproxy/task.h:320
  call trace(12):
  | 0x55f921169be3 [c7 04 25 01 00 00 00 00]: main+0x112f53
  | 0x55f92116c4ee [48 8b 43 08 48 8b 78 10]: si_applet_wake_cb+0x7e/0xf3
  | 0x55f9211cc276 [48 8b 43 08 48 8b 40 10]: task_run_applet+0x196/0x5bd
  | 0x55f92118e6ca [48 89 c5 eb 0f 90 4c 89]:
run_tasks_from_lists+0x3aa/0x871
  | 0x55f92118ef73 [29 44 24 14 8b 7c 24 14]:
process_runnable_tasks+0x3d3/0x892
  | 0x55f92115eb44 [83 3d f9 ef 1d 00 01 0f]: run_poll_loop+0x124/0x3d4
  | 0x55f92115efc9 [48 8b 1d d0 94 12 00 4c]: main+0x108339
  | 0x55f9210589ff [31 c0 e8 ba a6 16 00 31]: main+0x1d6f/0x296a

I'm not sure why it happens. I suspect that it happens because of
session_new() in peer_session_create() that creates another task on the
current thread. If a session task tries to wake up a peer applet task then
it would explain the stack trace, however I've failed to fully understand
the "flow" between applet and session (and how/when task_queue is called in
this flow). I would appreciate any help or tips on how to proceed.

Kind regards,

pt., 1 kwi 2022 o 14:34 Willy Tarreau  napisał(a):

> On Fri, Apr 01, 2022 at 11:23:34AM +0200, Maciej Zdeb wrote:
> > > I remember mentioning something about trying to change this in the
> future.
> > > For outgoing connections I think we could try to figure where peers
> > > connections are and always use one least represented thread.
> > >
> > If I understand the code correctly, the peer_session_create() function
> > creates appctx by appctx_new() which allocates all outgoing connections
> on
> > the current thread "appctx->t = task_new_here();"?
> > So changing peer_session_create() into:
> >
> > static struct appctx *peer_session_create(struct peers *peers, struct
> peer
> > *peer, int thread) {
> > ...
> > appctx = appctx_new_on(_applet, thread);
> > if (!appctx)
> > goto out_close;
> > ...
> > }
> >
> > where appctx_new_on() is a new function that creates an applet on a
> > specified thread would fix my issue? I'm worrying it's not that easy and
> > some locking here and there might be needed. ;-)
>
> Indeed, I think it would work without problems! In this case you could
> even add an array of MAXTHREADS to the peers struct to count the number
> of applets per thread, then peer_session_create() could just scan the
> threads to find the least loaded one and use this one. This would easily
> do the job and sounds simple and efficient enough to get into 2.6.
>
> Please, if you go that way, think about splitting your work in at least
> 3 patches:
>   - 1 for appctx_new_on()
>   - 1 to maintain the applet count per peers struct
>   - 1 to use that counter to start the applet on the best thread
>
> You may find that you need to implement more stuff, but you get the idea.
>
> thanks,
> Willy
>


0001-MINOR-applet-Add-function-to-create-applet-on-select.patch
Description: Binary data


0002-MINOR-peers-Track-number-of-applets-run-by-thread.patch
Description: Binary data


0003-MINOR-peers-Balance-applets-across-threads.patch
Description: Binary data


Re: Peers using heavily single cpu core

2022-04-01 Thread Maciej Zdeb
pt., 1 kwi 2022 o 09:06 Willy Tarreau  napisał(a):

> I seem to remember a discussion about this in the past and that the
> conclusion basically was that outgoing connections are sent on the
> "current" thread when the management task creates them, while incoming
> connections are balanced depending on the overall process' load.

Yes, that's what I'm observing. High cpu usage is caused by outgoing
connections (visible on "show fd" dump - rport=1024) and currently I don't
mind the incoming ones because they are balanced pretty well.


> I remember mentioning something about trying to change this in the future.
> For outgoing connections I think we could try to figure where peers
> connections are and always use one least represented thread.
>
If I understand the code correctly, the peer_session_create() function
creates appctx by appctx_new() which allocates all outgoing connections on
the current thread "appctx->t = task_new_here();"?
So changing peer_session_create() into:

static struct appctx *peer_session_create(struct peers *peers, struct peer
*peer, int thread) {
...
appctx = appctx_new_on(_applet, thread);
if (!appctx)
goto out_close;
...
}

where appctx_new_on() is a new function that creates an applet on a
specified thread would fix my issue? I'm worrying it's not that easy and
some locking here and there might be needed. ;-)

Kind regards,
Maciej



> Willy
>


Re: Peers using heavily single cpu core

2022-04-01 Thread Maciej Zdeb
Hi Willy,
Thanks for the patch! Now I'll be able to rebalance peers connections
manually. Do you know why so many of them are landing on the same thread?

Kind regards

czw., 31 mar 2022 o 15:16 Willy Tarreau  napisał(a):

> On Thu, Mar 31, 2022 at 11:56:13AM +0200, Willy Tarreau wrote:
> > > I've made some "session shutdown" tests. With simple configuration
> > > (attached) it is very easy to reproduce a situation when "shutdown
> session"
> > > works only for the first time.
> >
> > Indeed, thanks for this, I can as well reproduce it. With a bit of luck
> > I'll figure what's happening, hoping I won't unearth an ugly bug!
>
> Now found and fixed. It's been bogus since we've split the list of streams
> to be per-thread in 2.4, patch attached if you need it.
>
> Thanks,
> Willy
>


Re: Peers using heavily single cpu core

2022-03-16 Thread Maciej Zdeb
Hi Willy,
did you find anything?

I've made some "session shutdown" tests. With simple configuration
(attached) it is very easy to reproduce a situation when "shutdown session"
works only for the first time.

1. Kill haproxy to have a clean start:
root@hap-rnd-1a:/etc/haproxy# killall haproxy

2. Show current sessions:
root@hap-rnd-1a:/etc/haproxy# echo "show sess" | socat
unix-connect:/var/run/haproxy/haproxy1.sock stdio
0x562291719560: proto=tcpv4 src=10.0.0.3:5276 fe=hap-rnd-1a be=hap-rnd-1a
srv= ts=00 epoch=0 age=4s calls=1 rate=0 cpu=0 lat=0
rq[f=c48202h,i=0,an=00h,rx=,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=,wx=,ax=]
s0=[8,200048h,fd=29,ex=] s1=[8,204058h,fd=-1,ex=] exp=
0x562291719fa0: proto=unix_stream src=unix:1 fe=GLOBAL be= srv=
ts=00 epoch=0 age=0s calls=2 rate=2 cpu=0 lat=0
rq[f=c0c020h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008002h,i=0,an=00h,rx=,wx=,ax=]
s0=[8,280008h,fd=30,ex=] s1=[8,204018h,fd=-1,ex=] exp=30s

3. Kill peer session 0x562291719560
root@hap-rnd-1a:/etc/haproxy# echo "shutdown session 0x562291719560" |
socat unix-connect:/var/run/haproxy/haproxy1.sock stdio

4. Show current sessions - confirm peer connection shutdown
root@hap-rnd-1a:/etc/haproxy# echo "show sess" | socat
unix-connect:/var/run/haproxy/haproxy1.sock stdio
0x7f2628026220: proto=unix_stream src=unix:1 fe=GLOBAL be= srv=
ts=00 epoch=0x1 age=0s calls=1 rate=1 cpu=0 lat=0
rq[f=c08000h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=]
s0=[8,240008h,fd=29,ex=] s1=[8,204018h,fd=-1,ex=] exp=

5. Show current sessions - confirm new peer connection (note that new
connection has same id 0x7f2628026220 as unix_stream in previous output,
but it is probably by chance)
root@hap-rnd-1a:/etc/haproxy# echo "show sess" | socat
unix-connect:/var/run/haproxy/haproxy1.sock stdio
0x7f2628026220: proto=tcpv4 src=10.0.0.3:5288 fe=hap-rnd-1a be=hap-rnd-1a
srv= ts=00 epoch=0x2 age=3s calls=1 rate=0 cpu=0 lat=0
rq[f=c48200h,i=0,an=00h,rx=,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=,wx=,ax=]
s0=[8,200048h,fd=29,ex=] s1=[8,204058h,fd=-1,ex=] exp=
0x7f263c026220: proto=unix_stream src=unix:1 fe=GLOBAL be= srv=
ts=00 epoch=0x2 age=0s calls=1 rate=1 cpu=0 lat=0
rq[f=c08000h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008002h,i=0,an=00h,rx=,wx=,ax=]
s0=[8,240008h,fd=30,ex=] s1=[8,204018h,fd=-1,ex=] exp=

6. Again kill peer session 0x7f2628026220
root@hap-rnd-1a:/etc/haproxy# echo "shutdown session 0x7f2628026220" |
socat unix-connect:/var/run/haproxy/haproxy1.sock stdio

7. Show current sessions - note that 0x7f2628026220 was not killed in the
previous step
root@hap-rnd-1a:/etc/haproxy# echo "show sess" | socat
unix-connect:/var/run/haproxy/haproxy1.sock stdio
0x7f267c026a60: proto=unix_stream src=unix:1 fe=GLOBAL be= srv=
ts=00 epoch=0x3 age=0s calls=1 rate=1 cpu=0 lat=0
rq[f=c08000h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008000h,i=0,an=00h,rx=,wx=,ax=]
s0=[8,240008h,fd=30,ex=] s1=[8,204018h,fd=-1,ex=] exp=
0x7f2628026220: proto=tcpv4 src=10.0.0.3:5288 fe=hap-rnd-1a be=hap-rnd-1a
srv= ts=00 epoch=0x2 age=17s calls=1 rate=0 cpu=0 lat=0
rq[f=c48202h,i=0,an=00h,rx=,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=,wx=,ax=]
s0=[8,200048h,fd=29,ex=] s1=[8,204058h,fd=-1,ex=] exp=

8. Kill peer session 0x7f2628026220
root@hap-rnd-1a:/etc/haproxy# echo "shutdown session 0x7f2628026220" |
socat unix-connect:/var/run/haproxy/haproxy1.sock stdio

9. Show session - again no effect
root@hap-rnd-1a:/etc/haproxy# echo "show sess" | socat
unix-connect:/var/run/haproxy/haproxy1.sock stdio
0x7f2628026220: proto=tcpv4 src=10.0.0.3:5288 fe=hap-rnd-1a be=hap-rnd-1a
srv= ts=00 epoch=0x2 age=22s calls=1 rate=0 cpu=0 lat=0
rq[f=c48202h,i=0,an=00h,rx=,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=,wx=,ax=]
s0=[8,200048h,fd=29,ex=] s1=[8,204058h,fd=-1,ex=] exp=
0x7f261c026220: proto=unix_stream src=unix:1 fe=GLOBAL be= srv=
ts=00 epoch=0x4 age=0s calls=1 rate=1 cpu=0 lat=0
rq[f=c08000h,i=0,an=00h,rx=,wx=,ax=] rp[f=80008002h,i=0,an=00h,rx=,wx=,ax=]
s0=[8,240008h,fd=30,ex=] s1=[8,204018h,fd=-1,ex=] exp=

Kind regards,

sob., 12 mar 2022 o 00:20 Willy Tarreau  napisał(a):

> On Fri, Mar 11, 2022 at 10:19:09PM +0100, Maciej Zdeb wrote:
> > Hi Willy,
> >
> > Thank you for such useful info! I've checked the worst HAProxy nodes and
> on
> > every such node all outgoing peers connections are run on the same
> thread:
> (...)
> Indeed. I was pretty sure we were setting them to any thread on creation
> but maybe I'm wrong, I'll have to recheck.
>
> > On one node I was able to rebalance it, but on the node above (and other
> > nodes) I'm not able to shutdown the sessions:
> (...)
> > echo "shutdown session 0x7f0aa402e2c0" | socat
> > unix-connect:/var/run/haproxy.sock stdio
> >
> > echo "show sess 0x7f0aa402e2c0" | socat
> unix-connect:/var/run/haproxy.sock
> > stdio
> > 0x7f0aa402e2c0: [11/Mar/2022:07:17:

Re: Peers using heavily single cpu core

2022-03-11 Thread Maciej Zdeb
8200 ctx=(nil)
  req=0x7f0aa402e2d0 (f=0x848202 an=0x0 pipe=0 tofwd=-1 total=9491961149)
  an_exp= rex=5s wex=
  buf=0x7f0aa402e2d8 data=(nil) o=0 p=0 i=0 size=0
  res=0x7f0aa402e330 (f=0x80448202 an=0x0 pipe=0 tofwd=-1 total=9322481446)
  an_exp= rex= wex=
  buf=0x7f0aa402e338 data=(nil) o=0 p=0 i=0 size=0

pt., 11 mar 2022 o 18:09 Willy Tarreau  napisał(a):

> Hi Maciej,
>
> On Fri, Mar 04, 2022 at 01:10:37PM +0100, Maciej Zdeb wrote:
> > Hi,
> >
> > I'm experiencing high CPU usage on a single core, idle drops below 40%
> > while other cores are at 80% idle. Peers cluster is quite big (12
> servers,
> > each server running 12 threads) and is used for synchronization of 3
> > stick-tables of 1 million entries size.
> >
> > Is peers protocol single threaded and high usage on single core is
> expected
> > or am I experiencing some kind of bug? I'll keep digging to be sure
> nothing
> > else from my configuration is causing the  issue.
>
> Almost. More precisely, each peers connection runs on a single thread
> at once (like any connection). Some such connections may experience
> heavy protocol parsing so it may be possible that sometimes you end up
> with an unbalance number of connections on threads. It's tricky though,
> because we could imagine a mechanism to try to evenly spread the outgoing
> peers connection on threads but the incoming ones will land where they
> land.
>
> That's something you can check with "show sess" and/or "show fd", looking
> for those related to your peers and checking their thread_mask. If you
> find two on the same thread (same thread_mask), you can shut one of them
> down using "shutdown session " and it will reconnect, possibly on
> another thread. That could confirm that it's the root cause of the
> problem you're experiencing.
>
> I'm wondering if we shouldn't introduce a notion of "heavy connection"
> like we already have heavy tasks for the scheduler. These ones would
> be balanced differently from others. Usually they're long-lived and
> can eat massive amounts of CPU so it would make sense. The only ones
> I'm thinking about are the peers ones but the concept would be more
> portable than focusing on peers. Usually such long connections are not
> refreshed often so we probably prefer to spend time carefully picking
> the best thread rather than saving 200ns of processing and having to
> pay it for the whole connection's life time.
>
> Willy
>


Peers using heavily single cpu core

2022-03-04 Thread Maciej Zdeb
Hi,

I'm experiencing high CPU usage on a single core, idle drops below 40%
while other cores are at 80% idle. Peers cluster is quite big (12 servers,
each server running 12 threads) and is used for synchronization of 3
stick-tables of 1 million entries size.

Is peers protocol single threaded and high usage on single core is expected
or am I experiencing some kind of bug? I'll keep digging to be sure nothing
else from my configuration is causing the  issue.

haproxy -v
HAProxy version 2.4.14-eaa786f 2022/02/25 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2
2026.
Known bugs: http://www.haproxy.org/bugs/bugs-2.4.14.html
Running on: Linux 4.15.0-154-generic #161-Ubuntu SMP Fri Jul 30 13:04:17
UTC 2021 x86_64

Kind regards,
Maciej Zdeb


Re: HAPROXY CAN NOT POINT IN TO PORT 5000 OF PATRONI

2021-04-22 Thread Maciej Zdeb
Hi,
try removing those two lines from config:
option httpchk
http-check expect status 200

it is postgres (tcp backend), you should not expect http response on health
check.

Kind regards,
Maciej

czw., 22 kwi 2021 o 02:38 thủy bùi  napisał(a):

> I have open all port by setting firewall rules, But still get the same
> error
>
> Vào Th 4, 21 thg 4, 2021 lúc 23:01 thủy bùi  đã
> viết:
>
>> I have change my configuration to TCP mode as below and restart haproxy
>> but still got the same error:
>> global maxconn 100defaults log global mode tcp retries 2 timeout client
>> 30m timeout connect 4s timeout server 30m timeout check 5slisten stats
>> mode tcp bind *:8009 stats enable stats uri /listen postgres bind *:5000
>> option httpchk http-check expect status 200 default-server inter 3s fall
>> 3 rise 2 on-marked-down shutdown-sessions server
>> postgresql_10.128.0.10_5432 10.128.0.10:5432 maxconn 100 check port 8008
>> server postgresql_10.128.0.11_5432 10.128.0.11:5432 maxconn 100 check
>> port 8008
>>
>> Vào Th 4, 21 thg 4, 2021 vào lúc 22:42 Jarno Huuskonen <
>> jarno.huusko...@uef.fi> đã viết:
>>
>>> Hei,
>>>
>>> On Wed, 2021-04-21 at 16:27 +0100, Andrew Smalley wrote:
>>> > From the look of  your configuration you are using  HTTP Mode, for
>>> > PostgreSQL, you will need a TCP VIP
>>> >
>>> > I noted this because of the HTTP check
>>> >
>>> > try using  "mode tcp"
>>> >
>>>
>>> defaults has mode tcp:
>>>
>>> defaults
>>> log global
>>> mode tcp
>>> ...
>>>
>>> -Jarno
>>>
>>> >
>>> > On Wed, 21 Apr 2021 at 16:25, Jarno Huuskonen 
>>> > wrote:
>>> > >
>>> > > Hi,
>>> > >
>>> > > On Wed, 2021-04-21 at 21:55 +0700, thủy bùi wrote:
>>> > > > Dear HAproxy dev,I have install all the requirement of HAproxy
>>> into the
>>> > > > system alongside with patroni and etcd, but finding error while
>>> call
>>> > > > into
>>> > > > port 5000.
>>> > > > The information is provided as below.
>>> > > > Please help me find out the issue.
>>> > > > I have running HAproxy successfully
>>> > > >
>>> > > > But unable to connect to my database throught port 5000
>>> > > >
>>> > > >
>>> > > > psql: error: server closed the connection unexpectedly
>>> > > > This probably means the server terminated abnormally
>>> > > > before or while processing the request.
>>> > > > What is your configuration?
>>> > > >
>>> > > ...
>>> > >
>>> > > Does haproxy run when you run it from cli
>>> > > (haproxy -d -f /path/to/yourconfig.cfg) ?
>>> > >
>>> > > And do you have errors in your logs ?
>>> > >
>>> > > > Linux postgre02 3.10.0-1160.21.1.el7.x86_64 #1 SMP Tue Mar 16
>>> 18:28:22
>>> > > > UTC
>>> > > > 2021 x86_64 x86_64 x86_64 GNU/Linux
>>> > >
>>> > > Looks like you're running on CentOS/RHEL 7 ? Do you have selinux
>>> enabled
>>> > > (getenforce) ? You'll probably need to allow haproxy to connect to
>>> all
>>> > > ports
>>> > > (or allow required ports). (setsebool -P haproxy_connect_any=On might
>>> > > help).
>>> > >
>>> > > (Your logs should show if connections are denied).
>>> > >
>>> > > -Jarno
>>> > >
>>> > > --
>>> > > Jarno Huuskonen
>>>
>>>
>>
>> --
>> *BUI THANH THUY*
>> Tel: 0348672994
>> Email: buithuy.13...@gmail.com
>>
> --
> *BUI THANH THUY*
> Tel: 0348672994
> Email: buithuy.13...@gmail.com
>


Re: [2.2.11] 100% CPU again

2021-04-22 Thread Maciej Zdeb
śr., 21 kwi 2021 o 13:53 Christopher Faulet 
napisał(a):

> The fix was merge in upstream :
>
> * BUG/MAJOR: mux-h2: Properly detect too large frames when decoding headers
>(http://git.haproxy.org/?p=haproxy.git;a=commit;h=07f88d75)
>

Thanks! Building 2.2.13 and 2.3.9 with patches. I'll return with feedback
if anything pops out again.

Kind regards,


Re: [2.2.11] 100% CPU again

2021-04-21 Thread Maciej Zdeb
I'm very happy you managed to reproduce a similar issue! :)

Does it affect 2.3.9? I've experienced 100% cpu on it also. But on 2.3 ALL
threads loops on _do_poll/epoll_wait and threads did not hang on a
particular h2 session. :( I'll check twice and return in the new thread for
2.3.9, because as for now it looks like a different bug.

Kind regards
Maciej,

wt., 20 kwi 2021 o 18:38 Christopher Faulet 
napisał(a):

> Le 19/04/2021 à 19:54, Maciej Zdeb a écrit :
> > Hi,
> >
> > After a couple weeks running HAProxy 2.2.11, one server started to loop
> on
> > thread 9. If I'm reading this correctly something went wrong on h2c at
> > 0x7fd7b08d0530.
> >
> [...]
> >
> > ### (gdb) p *((struct h2c *)0x7fd7b08d0530)
> > $1 = {conn = 0x7fd785b87990, st0 = H2_CS_FRAME_P, errcode =
> H2_ERR_NO_ERROR,
> > flags = 8, streams_limit = 100, max_id = 0, rcvd_c = 0, rcvd_s = 0,
> >ddht = 0x7fd7b0820d60, dbuf = {size = 16384,
> >  area = 0x7fd7b1dec0b0
> >
> "¤\226\205\064\f\212jܧâ\201\004A\fN\177jC]t\027\221cÌd°à\033\\+\205µ?¬Ø÷èÏô¥\006êU1\024\235O",
>
> > data = 16384, head = 48},
> >dsi = 1, dfl = 16383, dft = 1 '\001', dff = 1 '\001', dpl = 0 '\000',
> > last_sid = -1, mbuf = {{size = 32,
> >area = 0x2 , data =
> 1, head =
> > 1}, {size = 0, area = 0x0, data = 0, head = 0} , {
> >size = 0, area = 0x0, data = 0, head = 12}, {size = 1249, area =
> > 0x7fd7b080 "ðHX²×\177", data = 140564347486336, head = 0}}, msi =
> -1, mfl = 0,
> >mft = 32 ' ', mff = -96 ' ', miw = 65535, mws = 65535, mfs = 16384,
> timeout =
> > 2, shut_timeout = 2, nb_streams = 0, nb_cs = 0, nb_reserved = 0,
> >stream_cnt = 0, proxy = 0x5557136cbe60, task = 0x7fd792079be0,
> streams_by_id
> > = {b = {0x0, 0x0}}, send_list = {n = 0x7fd7b08d09d8, p = 0x7fd7b08d09d8},
> >fctl_list = {n = 0x7fd7b08d09e8, p = 0x7fd7b08d09e8}, blocked_list =
> {n =
> > 0x7fd7b08d09f8, p = 0x7fd7b08d09f8}, buf_wait = {target = 0x0, wakeup_cb
> = 0x0,
> >  list = {next = 0x7fd7b08d0a18, prev = 0x7fd7b08d0a18}}, wait_event
> =
> > {tasklet = 0x7fd79e235bf0, events = 0}}
> >
>
> Hi Maciej,
>
> I'm able to reproduce a similar bug hacking the nghttp2 client to send at
> most
> 16383 bytes per frame (instead of 16384). By sending too large headers, we
> are
> falling into a wakeup loop, waiting for more data while the buffer is full.
>
> I have a fix but I must check with Willy how to proceed because I'm not
> 100%
> sure for now.
>
> --
> Christopher Faulet
>


[2.2.11] 100% CPU again

2021-04-19 Thread Maciej Zdeb
Hi,

After a couple weeks running HAProxy 2.2.11, one server started to loop on
thread 9. If I'm reading this correctly something went wrong on h2c at
0x7fd7b08d0530.

### show threads
Thread 1 : id=0x7fd8855a2200 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=24099004674002 now=24099004739108 diff=65106
 curr_task=0
  Thread 2 : id=0x7fd883786700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=23776498018239 now=23776498104181 diff=85942
 curr_task=0
  Thread 3 : id=0x7fd882b5b700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=23713672613057 now=23713672680973 diff=67916
 curr_task=0
  Thread 4 : id=0x7fd881416700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=23752667768434 now=23752667910419 diff=141985
 curr_task=0
  Thread 5 : id=0x7fd880c15700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=23726859650451 now=23726859726954 diff=76503
 curr_task=0
  Thread 6 : id=0x7fd7d1488700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=2398264056 now=23982641100021 diff=100065
 curr_task=0
* Thread 7 : id=0x7fd7d0c87700 act=1 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=0 wantrdv=0
 cpu_ns: poll=23482398415461 now=23482398534664 diff=119203
 curr_task=0x5557150b7400 (task) calls=1 last=0
   fct=0x55570d2c8f40(task_run_applet) ctx=0x7fd7949a1a20()
 strm=0x7fd87c322d70 src=unix fe=GLOBAL be=GLOBAL dst=
 rqf=c48200 rqa=0 rpf=80008000 rpa=0 sif=EST,28
sib=EST,204018
 af=(nil),0 csf=0x555716465060,8200
 ab=0x7fd7949a1a20,9 csb=(nil),0

 cof=0x7fd791941e90,1300:PASS(0x7fd78411c1f0)/RAW((nil))/unix_stream(1271)
 cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)

  Thread 8 : id=0x7fd7d0486700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=23504865560248 now=23504865701145 diff=140897
 curr_task=0
  Thread 9 : id=0x7fd7cfc85700 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=0 wantrdv=0
 cpu_ns: poll=24731954111541 now=24731954183811 diff=72270
 curr_task=0x7fd79e235bf0 (tasklet) calls=1301021937
   fct=0x55570d1c1390(h2_io_cb) ctx=0x7fd7b08d0530

  Thread 10: id=0x7fd7cf484700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=23544763552254 now=23544763716578 diff=164324
 curr_task=0
  Thread 11: id=0x7fd7cec83700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=23918013478538 now=23918013624911 diff=146373
 curr_task=0
  Thread 12: id=0x7fd7ce482700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=23657530305687 now=23657530468631 diff=162944
 curr_task=0
  Thread 13: id=0x7fd7cdc81700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=23491571084184 now=23491571212288 diff=128104
 curr_task=0
  Thread 14: id=0x7fd7cd480700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=24048462813219 now=24048462979499 diff=166280
 curr_task=0
  Thread 15: id=0x7fd7ccc7f700 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=23948328115728 now=23948328275800 diff=160072
 curr_task=0
  Thread 16: id=0x7fd7cc47e700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=24133419414190 now=24133419605024 diff=190834
 curr_task=0
  Thread 17: id=0x7fd7cbc7d700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=24124631248817 now=24124631415673 diff=166856
 curr_task=0
  Thread 18: id=0x7fd7cb47c700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=23906001641575 now=23906001827864 diff=186289
 curr_task=0
  Thread 19: id=0x7fd7cac7b700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
 stuck=0 prof=0 harmless=1 wantrdv=0
 cpu_ns: poll=23969313722856 now=23969313891707 diff=168851
 curr_task=0
  Thread 20: id=0x7fd7ca47a700 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
   

Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-09 Thread Maciej Zdeb
Hi Robin,

W dniu pt., 9.04.2021 o 19:26 Robin H. Johnson 
napisał(a):

> Maciej had said they were going to create a new thread, but I didn't see
> one yet.
>
> I want to start by noting problem was much worse on 2.2.8 & 2.2.9, and
> that 2.2.13 & 2.3.9 don't get entirely hung at 100% anymore: a big
> thanks for that initial work in fixing the issue.
>

After Christopher patch using another function (that does not allocate
memory) to dump LUA stack trace the situation is much better. It used 100%
cpu only once and only one thread, however HAProxy didn't kill itself like
in your situation. Since then the issue did not occur again, I have
anything useful to share. :(

I'm using 2.2.11 HAProxy but will upgrade next week to 2.3, I'll report if
anything interesting pops out.


> We have struggled to reproduce the problem in testing environments, it
> only turns up at the biggest regions, and plotting occurances of the
> issue over the time dimension suggest that it might have some partial
> correlation w/ a weird workload input.
>

On our side the issue occured on a machine with unusual workload, hundred
thousands tcp connections and very little http requests/responses. We also
use lua.

Kind regards,
Maciej


Re: [2.2.9] 100% CPU usage

2021-04-02 Thread Maciej Zdeb
Hi Christopher,

Yes I know, my issues are always pretty weird. ;) Of course it's not
reproducible. :(

I'll try to collect more data and return to you. I will start a new thread
to not mix those two cases.

Kind regards,

pt., 2 kwi 2021 o 10:13 Christopher Faulet  napisał(a):

> Le 31/03/2021 à 13:28, Maciej Zdeb a écrit :
> > Hi,
> >
> > Well it's a bit better situation than earlier because only one thread is
> looping
> > forever and the rest is working properly. I've tried to verify where
> exactly the
> > thread looped but doing "n" in gdb fixed the problem :( After quitting
> gdb
> > session all threads were idle. Before I started gdb it looped about 3h
> not
> > serving any traffic, because I've put it into maintenance as soon as I
> observed
> > abnormal cpu usage.
> >
>
> Hi Maciej,
>
> I'm puzzled. It seems to be a different issue than the first one. You
> reported
> an issue during reloads, on the old processes. There was a loop because of
> a
> deadlock, but the traces showed the watchdog was fired, probably because
> of the
> lua (this must be confirm though).
>
> Here, it is a loop on a living process, right ? Reading the trace, it is
> for now
> a bit hard to figure out what happens. If it is reproducible, you may try
> to use
> "perf top" command, selecting the right thread ID with "-t" argument.
>
> In addition, if the loop always falls in the H2 multiplexer, it could be
> good to
> print the h2C structure to have more info on its internal state.
>
> And to be complete, the output of "show activity", "show fd" and "show
> sess all"
> may help. Because it still loops with no traffic, it should be small.
> "show
> threads" may be good, but HAProxy should be compiled with
> USE_THREAD_DUMP=1 to
> have an advanced dump.
>
> Sorry to ask you so much work, it is pretty hard to track this kind of bug.
>
> Thanks !
> --
> Christopher Faulet
>


Re: [2.2.9] 100% CPU usage

2021-03-31 Thread Maciej Zdeb
I've forgot to mention that the backtrace is from 2.2.11 built from
http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=601704962bc9d82b3b1cc97d90d2763db0ae4479

śr., 31 mar 2021 o 13:28 Maciej Zdeb  napisał(a):

> Hi,
>
> Well it's a bit better situation than earlier because only one thread is
> looping forever and the rest is working properly. I've tried to verify
> where exactly the thread looped but doing "n" in gdb fixed the problem :(
> After quitting gdb session all threads were idle. Before I started gdb it
> looped about 3h not serving any traffic, because I've put it into
> maintenance as soon as I observed abnormal cpu usage.
>
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> 0x7f2cf0df6a47 in epoll_wait (epfd=3, events=0x55d7aaa04920,
> maxevents=200, timeout=timeout@entry=39) at
> ../sysdeps/unix/sysv/linux/epoll_wait.c:30
> 30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
> (gdb) thread 11
> [Switching to thread 11 (Thread 0x7f2c3c53d700 (LWP 20608))]
> #0  trace (msg=..., cb=, a4=, a3= out>, a2=, a1=, func=,
> where=..., src=, mask=,
> level=) at include/haproxy/trace.h:149
> 149 if (unlikely(src->state != TRACE_STATE_STOPPED))
> (gdb) bt
> #0  trace (msg=..., cb=, a4=, a3= out>, a2=, a1=, func=,
> where=..., src=, mask=,
> level=) at include/haproxy/trace.h:149
> #1  h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740,
> head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255
> #2  0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at
> src/mux_h2.c:3329
> #3  h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479
> #4  0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at
> src/mux_h2.c:3624
> #5  0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740,
> status=) at src/mux_h2.c:3583
> #6  0x55d7a4381f62 in run_tasks_from_lists 
> (budgets=budgets@entry=0x7f2c3c51a35c)
> at src/task.c:454
> #7  0x55d7a438282d in process_runnable_tasks () at src/task.c:679
> #8  0x55d7a4339467 in run_poll_loop () at src/haproxy.c:2942
> #9  0x55d7a4339819 in run_thread_poll_loop (data=) at
> src/haproxy.c:3107
> #10 0x7f2cf1e606db in start_thread (arg=0x7f2c3c53d700) at
> pthread_create.c:463
> #11 0x7f2cf0df671f in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> (gdb) bt full
> #0  trace (msg=..., cb=, a4=, a3= out>, a2=, a1=, func=,
> where=..., src=, mask=,
> level=) at include/haproxy/trace.h:149
> No locals.
> #1  h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740,
> head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255
> h2s = 
> h2s_back = 
> __FUNCTION__ = "h2_resume_each_sending_h2s"
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #2  0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at
> src/mux_h2.c:3329
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #3  h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479
> flags = 
> released = 
> buf = 
> conn = 0x7f2bf658b8d0
> done = 0
> sent = 0
> __FUNCTION__ = "h2_send"
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> ---Type  to continue, or q  to quit---
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #4  0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at
> src/mux_h2.c:3624
> conn = 0x7f2bf658b8d0
> __FUNCTION__ = "h2_process"
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #5  0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740,
> status=) at src/mux_h2.c:3583
> conn = 0x7f2bf658b8d0
> tl = 
> conn_in_list = 0
> h2c = 0x7f2c18dca740
> ret = 
> __FUNCTION__ = "h2_io_cb"
> __x = 
> __l = 
> __x = 
> __l = 
> __x 

Re: [2.2.9] 100% CPU usage

2021-03-31 Thread Maciej Zdeb
K_SELF_WAKING|TASK_KILLED);
(gdb)
440 ti->flags &= ~TI_FL_STUCK; // this thread is still running
(gdb)
441 activity[tid].ctxsw++;
(gdb)
444 t->calls++;
(gdb)
445 sched->current = t;
(gdb)
447 _HA_ATOMIC_SUB(_run_queue, 1);
(gdb)
449 if (TASK_IS_TASKLET(t)) {
(gdb)
450 LIST_DEL_INIT(&((struct tasklet *)t)->list);
(gdb)
449 if (TASK_IS_TASKLET(t)) {
(gdb)
451 __ha_barrier_store();
(gdb)
452 state = _HA_ATOMIC_XCHG(>state, state);
(gdb)
454 process(t, ctx, state);
(gdb)
456 sched->current = NULL;
(gdb)
455 done++;
(gdb)
456 sched->current = NULL;
(gdb)
457 __ha_barrier_store();
(gdb)
458 continue;
(gdb)
398 if (global.tune.options & GTUNE_SCHED_LOW_LATENCY) {
(gdb)
399 if (unlikely(sched->tl_class_mask & budget_mask & ((1 << queue) - 1))) {
(gdb)
398 if (global.tune.options & GTUNE_SCHED_LOW_LATENCY) {
(gdb)
424 if (LIST_ISEMPTY(_queues[queue])) {
(gdb)
430 if (!budgets[queue]) {
(gdb)
436 budgets[queue]--;
(gdb)
442 ctx = t->context;
(gdb)
443 process = t->process;
(gdb)
436 budgets[queue]--;
(gdb)
440 ti->flags &= ~TI_FL_STUCK; // this thread is still running
(gdb)
437 t = (struct task *)LIST_ELEM(tl_queues[queue].n, struct tasklet *,
list);
(gdb)
438 state = t->state & (TASK_SHARED_WQ|TASK_SELF_WAKING|TASK_KILLED);
(gdb)
440 ti->flags &= ~TI_FL_STUCK; // this thread is still running
(gdb)
438 state = t->state & (TASK_SHARED_WQ|TASK_SELF_WAKING|TASK_KILLED);
(gdb)
440 ti->flags &= ~TI_FL_STUCK; // this thread is still running
(gdb)
441 activity[tid].ctxsw++;
(gdb)
444 t->calls++;
(gdb)
445 sched->current = t;
(gdb)
447 _HA_ATOMIC_SUB(_run_queue, 1);
(gdb)
449 if (TASK_IS_TASKLET(t)) {
(gdb)
450 LIST_DEL_INIT(&((struct tasklet *)t)->list);
(gdb)
449 if (TASK_IS_TASKLET(t)) {
(gdb)
451 __ha_barrier_store();
(gdb)
452 state = _HA_ATOMIC_XCHG(>state, state);
(gdb)
454 process(t, ctx, state);
(gdb)
456 sched->current = NULL;
(gdb)
455 done++;
(gdb)
456 sched->current = NULL;
(gdb)
457 __ha_barrier_store();
(gdb)
458 continue;
(gdb)
398 if (global.tune.options & GTUNE_SCHED_LOW_LATENCY) {
(gdb)
399 if (unlikely(sched->tl_class_mask & budget_mask & ((1 << queue) - 1))) {
(gdb)
398 if (global.tune.options & GTUNE_SCHED_LOW_LATENCY) {
(gdb)
424 if (LIST_ISEMPTY(_queues[queue])) {
(gdb)
430 if (!budgets[queue]) {
(gdb)
436 budgets[queue]--;
(gdb)
442 ctx = t->context;
(gdb)
443 process = t->process;
(gdb)
436 budgets[queue]--;
(gdb)
440 ti->flags &= ~TI_FL_STUCK; // this thread is still running
(gdb)
437 t = (struct task *)LIST_ELEM(tl_queues[queue].n, struct tasklet *,
list);
(gdb)
438 state = t->state & (TASK_SHARED_WQ|TASK_SELF_WAKING|TASK_KILLED);
(gdb)
440 ti->flags &= ~TI_FL_STUCK; // this thread is still running
(gdb)
438 state = t->state & (TASK_SHARED_WQ|TASK_SELF_WAKING|TASK_KILLED);
(gdb)
440 ti->flags &= ~TI_FL_STUCK; // this thread is still running
(gdb)
441 activity[tid].ctxsw++;
(gdb)
444 t->calls++;
(gdb)
445 sched->current = t;
(gdb)
447 _HA_ATOMIC_SUB(_run_queue, 1);
(gdb)
449 if (TASK_IS_TASKLET(t)) {
(gdb)
450 LIST_DEL_INIT(&((struct tasklet *)t)->list);
(gdb)
449 if (TASK_IS_TASKLET(t)) {
(gdb)
451 __ha_barrier_store();
(gdb)
452 state = _HA_ATOMIC_XCHG(>state, state);
(gdb)
454 process(t, ctx, state);
(gdb)
456 sched->current = NULL;
(gdb)
455 done++;
(gdb)
456 sched->current = NULL;
(gdb)
457 __ha_barrier_store();
(gdb) q
A debugging session is active.

Inferior 1 [process 20598] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/sbin/haproxy, process 20598

czw., 25 mar 2021 o 13:51 Christopher Faulet 
napisał(a):

> Le 25/03/2021 à 13:38, Maciej Zdeb a écrit :
> > Hi,
> >
> > I deployed a patched (with volatile hlua_not_dumpable) HAProxy and so
> far so
> > good, no looping. Christopher I saw new patches with hlua_traceback used
> > instead, looks much cleaner to me, should I verify them instead? :)
> >
> > Christopher & Willy I've forgotten to thank you for help!
> >
> Yes please, try the last 2.2 snapshot. It is a really a better way to fix
> this
> issue because the Lua traceback is never ignored. And it is really safer
> to not
> allocate memory in the debugger.
>
> So now, we should be able to figure out why the Lua fires the watchdog.
> Because,
> under the hood, it is the true issue :)
>
> --
> Christopher Faulet
>


Re: [2.2.9] 100% CPU usage

2021-03-25 Thread Maciej Zdeb
Hi,

I deployed a patched (with volatile hlua_not_dumpable) HAProxy and so far
so good, no looping. Christopher I saw new patches with hlua_traceback used
instead, looks much cleaner to me, should I verify them instead? :)

Christopher & Willy I've forgotten to thank you for help!

Kind regards,

śr., 24 mar 2021 o 10:51 Maciej Zdeb  napisał(a):

> śr., 24 mar 2021 o 10:37 Christopher Faulet 
> napisał(a):
>
>> However, reading the other trace Maciej sent (bussy_thread_peers.txt), it
>> seems
>> possible to stop a memory allocation from other places. Thus, I guess we
>> must
>> find a more global way to prevent the lua stack dump.
>>
>
> I'm not sure which part of trace you're referring to but I need to clarify
> that both "bussy_thread_peers.txt" and "free_thread_spoe_lua.txt" occurred
> at the same time but on different threads (free_thread_spoe_lua on thread
> 10 and bussy_thread_peers on other threads). So If I understand it
> correctly thread 10 locked itself and other threads looping to acquire lock.
>


Re: [2.2.9] 100% CPU usage

2021-03-24 Thread Maciej Zdeb
śr., 24 mar 2021 o 10:37 Christopher Faulet 
napisał(a):

> However, reading the other trace Maciej sent (bussy_thread_peers.txt), it
> seems
> possible to stop a memory allocation from other places. Thus, I guess we
> must
> find a more global way to prevent the lua stack dump.
>

I'm not sure which part of trace you're referring to but I need to clarify
that both "bussy_thread_peers.txt" and "free_thread_spoe_lua.txt" occurred
at the same time but on different threads (free_thread_spoe_lua on thread
10 and bussy_thread_peers on other threads). So If I understand it
correctly thread 10 locked itself and other threads looping to acquire lock.


Re: [2.2.9] 100% CPU usage

2021-03-24 Thread Maciej Zdeb
Wow, that's it! :)

   0x55d94949e965 <+53>: addl   $0x1,%fs:0xfffdd688
   0x55d94949e96e <+62>: callq  0x55d9494782c0 
   0x55d94949e973 <+67>: subl   $0x1,%fs:0xfffdd688
[...]
   0x55d94949e99f <+111>: ja 0x55d94949e9b0 
   0x55d94949e9a1 <+113>: mov%rbx,%rdi
   0x55d94949e9a4 <+116>: callq  0x55d949477d50 
   0x55d94949e9a9 <+121>: test   %rax,%rax
[...]
   0x55d94949e9c1 <+145>: addl   $0x1,%fs:0xfffdd688
   0x55d94949e9ca <+154>: callq  0x55d949477b50 
   0x55d94949e9cf <+159>: subl   $0x1,%fs:0xfffdd688

Ok I'll make hlua_not_dumpable volatile instead of applying compiler
barriers.

śr., 24 mar 2021 o 10:00 Willy Tarreau  napisał(a):

> On Wed, Mar 24, 2021 at 09:52:22AM +0100, Willy Tarreau wrote:
> > So yes, it looks like gcc decides that a function called "malloc" will
> > never use your program's global variables but that "blablalloc" may. I
> > have no explanation to this except "optimization craziness" resulting
> > in breaking valid code, since it means that if I provide my own malloc
> > function it might not work. Pfff
>
> And that's exactly it!
>
> https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#Other-Builtins
>
> By default a number of functions including malloc, strchr etc are mapped
> to builtins so it figures it knows better how these work and how to
> optimize
> around. Once built with -fno-builtin-malloc no more probmlem. So the
> volatile
> is the way to go to respect the risk of a signal hitting it.
>
> I found an old report of gcc-4.5 breaking chromium and tcmalloc because of
> this, which put me on the track:
>
> https://github.com/gperftools/gperftools/issues/239
>
> Willy
>


Re: [2.2.9] 100% CPU usage

2021-03-24 Thread Maciej Zdeb
Hi,

wt., 23 mar 2021 o 18:36 Willy Tarreau  napisał(a):

> > It is most probably because of compiler optimizations. Some compiler
> > barriers are necessary to avoid instructions reordering. It is the
> purpose
> > of attached patches. Sorry to ask you it again, but could you make some
> > tests ?
>
 Sure, I'll verify!

I don't believe in it at all. free(), malloc() etc are free to manipulate
> global variables so the compiler cannot reorder these operations around
> them. We're probably facing other corner cases (or the same but not
> totally addressed maybe).
>

After reading I wasn't sure anymore I even tested properly patched package.
:) Fortunately I have a core file so I verified if hlua_not_dumpable
variable exists. I don't know how it is possible that hlua_not_dumpable is
0 in this state (bt), if Christopher theory about lacking compiler barriers
is wrong.

(gdb) p hlua_not_dumpable
$1 = 0

(gdb) bt
#0  __lll_lock_wait_private () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x7f52fb16c34b in __GI___libc_malloc (bytes=bytes@entry=41) at
malloc.c:3063
#2  0x55d94949e9a9 in hlua_alloc (ud=0x55d9498e90d0
, ptr=, osize=4, nsize=41) at
src/hlua.c:8210
#3  0x55d949622271 in luaM_realloc_ ()
#4  0x55d949621dbe in luaC_newobj ()
#5  0x55d949626f4d in internshrstr ()
#6  0x55d9496271fc in luaS_new ()
#7  0x55d94961bd03 in lua_pushstring ()
#8  0x55d94962d48a in luaL_traceback ()
#9  0x55d9495a1b62 in ha_task_dump (buf=buf@entry=0x7f5247c7f628,
task=0x7f5234057ba0, pfx=pfx@entry=0x55d949672f19 ' ' )
at src/debug.c:227
#10 0x55d9495a1fd7 in ha_thread_dump (buf=0x7f5247c7f628,
thr=, calling_tid=7) at src/debug.c:91
#11 0x55d9495a2236 in debug_handler (sig=, si=, arg=) at src/debug.c:847
#12 
#13 _int_malloc (av=av@entry=0x7f523420, bytes=bytes@entry=56) at
malloc.c:4100
#14 0x7f52fb16c35d in __GI___libc_malloc (bytes=bytes@entry=56) at
malloc.c:3065
#15 0x55d94949e9a9 in hlua_alloc (ud=0x55d9498e90d0
, ptr=, osize=5, nsize=56) at
src/hlua.c:8210


Re: [2.2.9] 100% CPU usage

2021-03-23 Thread Maciej Zdeb
Hi Christopher,

Bad news, patches didn't help. Attaching stacktraces, now it looks that
spoe that executes lua scripts (free_thread_spue_lua.txt) tried to malloc
twice. :(

Kind regards,

pon., 22 mar 2021 o 08:39 Maciej Zdeb  napisał(a):

> Hi Christopher,
>
> Thanks! I'm building a patched version and will return with feedback!
>
> Kind regards,
>
> pt., 19 mar 2021 o 16:40 Christopher Faulet 
> napisał(a):
>
>> Le 16/03/2021 à 13:46, Maciej Zdeb a écrit :
>> > Sorry for spam. In the last message I said that the old process (after
>> reload)
>> > is consuming cpu for lua processing and that's not true, it is
>> processing other
>> > things also.
>> >
>> > I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and
>> maybe 2.4
>> > branch. For each version I need a week or two to be sure the issue does
>> not
>> > occur. :(
>> >
>> > If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm
>> if there
>> > is any relation between infinite loops and custom configuration:
>> > - lua scripts (mainly used for header generation/manipulation),
>> > - spoe (used for sending metadata about each request to external
>> service),
>> > - peers (we have a cluster of 12 HAProxy servers connected to each
>> other).
>> >
>> Hi Maciej,
>>
>> I've read more carefully your backtraces, and indeed, it seems to be
>> related to
>> lua processing. I don't know if the watchdog is triggered because of the
>> lua or
>> if it is just a side-effect. But the lua execution is interrupted inside
>> the
>> memory allocator. And malloc/realloc are not async-signal-safe.
>> Unfortunately,
>> when the lua stack is dumped, the same allocator is also used. At this
>> stage,
>> because a lock was not properly released, HAProxy enter in a deadlock.
>>
>> On other threads, we loop in the watchdog, waiting for the hand to dump
>> the
>> thread information and that explains the 100% CPU usage you observe.
>>
>> So, to prevent this situation, the lua stack must not be dumped if it was
>> interrupted inside an unsafe part. It is the easiest way we found to
>> workaround
>> this bug. And because it is pretty rare, it should be good enough.
>>
>> However, I'm unable to reproduce the bug. Could you test attached patches
>> please
>> ? I attached patched for the 2.4, 2.3 and 2.2. Because you experienced
>> this bug
>> on the 2.2, it is probably easier to test patches for this version.
>>
>> If fixed, it could be good to figure out why the watchdog is triggered on
>> your
>> old processes.
>>
>> --
>> Christopher Faulet
>>
>
[Switching to thread 10 (Thread 0x7f5246c9d700 (LWP 15807))]
#0  __lll_lock_wait_private () at 
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
95  ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or 
directory.
(gdb) bt full
#0  __lll_lock_wait_private () at 
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
No locals.
#1  0x7f52fb16c34b in __GI___libc_malloc (bytes=bytes@entry=41) at 
malloc.c:3063
ignore1 = 
ignore2 = 
ignore3 = 
ar_ptr = 0x7f523420
victim = 
hook = 
tbytes = 
tc_idx = 
__x = 
ignore1 = 
ignore2 = 
ignore3 = 
ignore = 
#2  0x55d94949e9a9 in hlua_alloc (ud=0x55d9498e90d0 
, ptr=, osize=4, nsize=41) at 
src/hlua.c:8210
zone = 0x55d9498e90d0 
#3  0x55d949622271 in luaM_realloc_ ()
No symbol table info available.
#4  0x55d949621dbe in luaC_newobj ()
No symbol table info available.
#5  0x55d949626f4d in internshrstr ()
No symbol table info available.
#6  0x55d9496271fc in luaS_new ()
No symbol table info available.
#7  0x55d94961bd03 in lua_pushstring ()
No symbol table info available.
#8  0x55d94962d48a in luaL_traceback ()
No symbol table info available.
#9  0x55d9495a1b62 in ha_task_dump (buf=buf@entry=0x7f5247c7f628, 
task=0x7f5234057ba0, pfx=pfx@entry=0x55d949672f19 ' ' ) at 
src/debug.c:227
s = 
appctx = 
hlua = 0x7f52341bdc60
#10 0x55d9495a1fd7 in ha_thread_dump (buf=0x7f5247c7f628, thr=, calling_tid=7) at src/debug.c:91
thr_bit = 512
p = 74138721
n = 
stuck = 0
#11 0x55d9495a2236 in debug_handler (sig=, si=, arg=) at src/debug.c:847
sig = 
---Type  to continue, or q  to quit---
si = 
arg = 
#12 
No locals.
#13 _int_malloc (av=av@entry=0x7f523420, bytes=bytes@entry=56) at 
malloc.c:4100
p = 
iters = 
nb = 64
idx = 5
bin = 
victim = 0x7f52341cbff0

Re: [2.2.9] 100% CPU usage

2021-03-22 Thread Maciej Zdeb
Hi Christopher,

Thanks! I'm building a patched version and will return with feedback!

Kind regards,

pt., 19 mar 2021 o 16:40 Christopher Faulet 
napisał(a):

> Le 16/03/2021 à 13:46, Maciej Zdeb a écrit :
> > Sorry for spam. In the last message I said that the old process (after
> reload)
> > is consuming cpu for lua processing and that's not true, it is
> processing other
> > things also.
> >
> > I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and
> maybe 2.4
> > branch. For each version I need a week or two to be sure the issue does
> not
> > occur. :(
> >
> > If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm if
> there
> > is any relation between infinite loops and custom configuration:
> > - lua scripts (mainly used for header generation/manipulation),
> > - spoe (used for sending metadata about each request to external
> service),
> > - peers (we have a cluster of 12 HAProxy servers connected to each
> other).
> >
> Hi Maciej,
>
> I've read more carefully your backtraces, and indeed, it seems to be
> related to
> lua processing. I don't know if the watchdog is triggered because of the
> lua or
> if it is just a side-effect. But the lua execution is interrupted inside
> the
> memory allocator. And malloc/realloc are not async-signal-safe.
> Unfortunately,
> when the lua stack is dumped, the same allocator is also used. At this
> stage,
> because a lock was not properly released, HAProxy enter in a deadlock.
>
> On other threads, we loop in the watchdog, waiting for the hand to dump
> the
> thread information and that explains the 100% CPU usage you observe.
>
> So, to prevent this situation, the lua stack must not be dumped if it was
> interrupted inside an unsafe part. It is the easiest way we found to
> workaround
> this bug. And because it is pretty rare, it should be good enough.
>
> However, I'm unable to reproduce the bug. Could you test attached patches
> please
> ? I attached patched for the 2.4, 2.3 and 2.2. Because you experienced
> this bug
> on the 2.2, it is probably easier to test patches for this version.
>
> If fixed, it could be good to figure out why the watchdog is triggered on
> your
> old processes.
>
> --
> Christopher Faulet
>


Re: [2.2.9] 100% CPU usage

2021-03-17 Thread Maciej Zdeb
Hi Christopher,

That's good news! If you need me to test a patch then let me know.

On my side I'm preparing to update HAProxy to 2.3 and solving some simple
issues like lacking new lines on the end of configuration. ;)

Kind regards,

śr., 17 mar 2021 o 10:49 Christopher Faulet 
napisał(a):

> Le 16/03/2021 à 13:46, Maciej Zdeb a écrit :
> > Sorry for spam. In the last message I said that the old process (after
> reload)
> > is consuming cpu for lua processing and that's not true, it is
> processing other
> > things also.
> >
> > I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and
> maybe 2.4
> > branch. For each version I need a week or two to be sure the issue does
> not
> > occur. :(
> >
> > If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm if
> there
> > is any relation between infinite loops and custom configuration:
> > - lua scripts (mainly used for header generation/manipulation),
> > - spoe (used for sending metadata about each request to external
> service),
> > - peers (we have a cluster of 12 HAProxy servers connected to each
> other).
> >
> Hi Maciej,
>
> Just to let you know we identified a potential AB/BA lock problem when
> HAProxy
> is stopping a listener. At least on the 2.2. The listeners code has
> changed in
> the 2.3 but we have no idea if the bug can be triggered too on 2.3 or 2.4.
> However, we have not found the time to investigate more for now. So, stay
> tuned.
>
> --
> Christopher Faulet
>
>


Re: [2.2.9] 100% CPU usage

2021-03-16 Thread Maciej Zdeb
Sorry for spam. In the last message I said that the old process (after
reload) is consuming cpu for lua processing and that's not true, it is
processing other things also.

I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and maybe
2.4 branch. For each version I need a week or two to be sure the issue does
not occur. :(

If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm if
there is any relation between infinite loops and custom configuration:
- lua scripts (mainly used for header generation/manipulation),
- spoe (used for sending metadata about each request to external service),
- peers (we have a cluster of 12 HAProxy servers connected to each other).

Kind regards,

wt., 16 mar 2021 o 13:05 Maciej Zdeb  napisał(a):

> Below is the output from perf top - it happens during reload (all threads
> on an old process spike and use 100% cpu, processing lua) and after 15-30
> seconds old process exits. It is probably a different bug, because it
> happens on an old process and I have no idea how it could affect the new
> process. Maybe through peers protocol? Still both issues are related to lua
> scripts (on my initial email in low_cpu_thread.txt it hung on lua script
> responsible for removing http headers on hlua_http_get_headers function).
>
> Samples: 770K of event 'cycles:ppp', Event count (approx.): 336817908209
> Overhead  Shared Object   Symbol
>   36.39%  haproxy [.]
> process_runnable_tasks
>   34.82%  haproxy [.]
> run_tasks_from_lists
>   10.04%  haproxy [.] __task_wakeup
>5.63%  libc-2.27.so[.]
> __memset_avx2_erms
>1.87%  haproxy [.] hlua_action
>0.79%  haproxy [.] hlua_ctx_init
>0.77%  haproxy [.]
> stream_release_buffers
>0.58%  libcrypto.so.1.1[.]
> rsaz_1024_sqr_avx2
>0.56%  haproxy [.]
> stream_int_notify
>0.49%  haproxy [.]
> task_run_applet
>0.42%  haproxy [.]
> hlua_sample_fetch_wrapper
>0.38%  haproxy [.]
> spoe_handle_appctx
>0.38%  [kernel][k] do_syscall_64
>0.37%  haproxy [.] eb32sc_insert
>0.32%  [vdso]  [.]
> __vdso_clock_gettime
>0.27%  haproxy [.]
> hlua_ctx_destroy
>0.22%  haproxy [.]
> stktable_get_entry
>0.20%  haproxy [.]
> process_stream
>0.19%  [kernel][k]
> audit_filter_rules.constprop.13
>0.17%  haproxy [.]
> si_applet_wake_cb
>0.14%  libcrypto.so.1.1[.]
> rsaz_1024_mul_avx2
>
> wt., 16 mar 2021 o 12:00 Maciej Zdeb  napisał(a):
>
>> Sure, patch from Christopher attached. :)
>>
>> wt., 16 mar 2021 o 10:58 Willy Tarreau  napisał(a):
>>
>>> Hi Maciej,
>>>
>>> On Tue, Mar 16, 2021 at 10:55:11AM +0100, Maciej Zdeb wrote:
>>> > Hi,
>>> >
>>> > I'm returning with bad news, the patch did not help and the issue
>>> occurred
>>> > today (on patched 2.2.10). It is definitely related to reloads,
>>> however it
>>> > is very rare issue it worked flawlessly the whole week.
>>>
>>> OK. Just as a reminder (because I can't find it here), can you please
>>> remind us what this patch was ?
>>>
>>> Thanks,
>>> Willy
>>>
>>


Re: [2.2.9] 100% CPU usage

2021-03-16 Thread Maciej Zdeb
Below is the output from perf top - it happens during reload (all threads
on an old process spike and use 100% cpu, processing lua) and after 15-30
seconds old process exits. It is probably a different bug, because it
happens on an old process and I have no idea how it could affect the new
process. Maybe through peers protocol? Still both issues are related to lua
scripts (on my initial email in low_cpu_thread.txt it hung on lua script
responsible for removing http headers on hlua_http_get_headers function).

Samples: 770K of event 'cycles:ppp', Event count (approx.): 336817908209
Overhead  Shared Object   Symbol
  36.39%  haproxy [.]
process_runnable_tasks
  34.82%  haproxy [.]
run_tasks_from_lists
  10.04%  haproxy [.] __task_wakeup
   5.63%  libc-2.27.so[.]
__memset_avx2_erms
   1.87%  haproxy [.] hlua_action
   0.79%  haproxy [.] hlua_ctx_init
   0.77%  haproxy [.]
stream_release_buffers
   0.58%  libcrypto.so.1.1[.]
rsaz_1024_sqr_avx2
   0.56%  haproxy [.]
stream_int_notify
   0.49%  haproxy [.]
task_run_applet
   0.42%  haproxy [.]
hlua_sample_fetch_wrapper
   0.38%  haproxy [.]
spoe_handle_appctx
   0.38%  [kernel][k] do_syscall_64
   0.37%  haproxy [.] eb32sc_insert
   0.32%  [vdso]  [.]
__vdso_clock_gettime
   0.27%  haproxy [.]
hlua_ctx_destroy
   0.22%  haproxy [.]
stktable_get_entry
   0.20%  haproxy [.] process_stream
   0.19%  [kernel][k]
audit_filter_rules.constprop.13
   0.17%  haproxy [.]
si_applet_wake_cb
   0.14%  libcrypto.so.1.1[.]
rsaz_1024_mul_avx2

wt., 16 mar 2021 o 12:00 Maciej Zdeb  napisał(a):

> Sure, patch from Christopher attached. :)
>
> wt., 16 mar 2021 o 10:58 Willy Tarreau  napisał(a):
>
>> Hi Maciej,
>>
>> On Tue, Mar 16, 2021 at 10:55:11AM +0100, Maciej Zdeb wrote:
>> > Hi,
>> >
>> > I'm returning with bad news, the patch did not help and the issue
>> occurred
>> > today (on patched 2.2.10). It is definitely related to reloads, however
>> it
>> > is very rare issue it worked flawlessly the whole week.
>>
>> OK. Just as a reminder (because I can't find it here), can you please
>> remind us what this patch was ?
>>
>> Thanks,
>> Willy
>>
>


Re: [2.2.9] 100% CPU usage

2021-03-16 Thread Maciej Zdeb
Sure, patch from Christopher attached. :)

wt., 16 mar 2021 o 10:58 Willy Tarreau  napisał(a):

> Hi Maciej,
>
> On Tue, Mar 16, 2021 at 10:55:11AM +0100, Maciej Zdeb wrote:
> > Hi,
> >
> > I'm returning with bad news, the patch did not help and the issue
> occurred
> > today (on patched 2.2.10). It is definitely related to reloads, however
> it
> > is very rare issue it worked flawlessly the whole week.
>
> OK. Just as a reminder (because I can't find it here), can you please
> remind us what this patch was ?
>
> Thanks,
> Willy
>


0001-BUG-MEDIUM-lists-Avoid-an-infinite-loop-in-MT_LIST_T.patch
Description: Binary data


Re: [2.2.9] 100% CPU usage

2021-03-16 Thread Maciej Zdeb
Hi,

I'm returning with bad news, the patch did not help and the issue occurred
today (on patched 2.2.10). It is definitely related to reloads, however it
is very rare issue it worked flawlessly the whole week.

wt., 9 mar 2021 o 09:17 Willy Tarreau  napisał(a):

> On Tue, Mar 09, 2021 at 09:04:43AM +0100, Maciej Zdeb wrote:
> > Hi,
> >
> > After applying the patch, the issue did not occur, however I'm still not
> > sure it is fixed. Unfortunately I don't have a reliable way to trigger
> it.
>
> OK. If it's related, it's very possible that some of the issues we've
> identified there recently are at least responsible for a part of the
> problem. In short, if the CPUs are too fair, some contention can last
> long because two steps are required to complete such operations. We've
> added calls to cpu_relax() in 2.4 and some are queued in 2.3 already.
> After a while, some of them should also be backported to 2.2 as they
> significantly improved the situation with many threads.
>
> > pt., 5 mar 2021 o 22:07 Willy Tarreau  napisal(a):
> >
> > > Note, before 2.4, a single thread can execute Lua scripts at once,
> > > with the others waiting behind, and if the Lua load is important, maybe
> > > this can happen (but I've never experienced it yet, and the premption
> > > interval is short enough not to cause issues in theory).
> > >
> > I'm not sure if related but during every reload, for a couple seconds all
> > 12 threads on OLD pid are using 100% cpu and then one after one core
> return
> > to normal usage, finally the old haproxy process exits. I have no idea
> why
> > it is behaving like that.
>
> It could be related but it's hard to tell. It is also possible that
> for any reason the old threads constantly believe they have something
> to do, for example a health check that doesn't get stopped and that
> keep ringing.
>
> > > Maciej, if this happens often,
> > > would you be interested in running one machine on 2.4-dev11 ?
> >
> > It is a very rare issue and of course it occurs only in production
> > environment. :(
>
> Obviously!
>
> > I'm very willing to test the 2.4 version, especially with
> > that tasty lua optimization for multiple threads, but I can't do it on
> > production until it's stable.
>
> This makes sense. We'll try to issue 2.3 with some thread fixes this
> week, maybe it will be a possible step for you then.
>
> Cheers,
> Willy
>


Re: [2.2.9] 100% CPU usage

2021-03-09 Thread Maciej Zdeb
Hi,

After applying the patch, the issue did not occur, however I'm still not
sure it is fixed. Unfortunately I don't have a reliable way to trigger it.

pt., 5 mar 2021 o 22:07 Willy Tarreau  napisał(a):

> Note, before 2.4, a single thread can execute Lua scripts at once,
> with the others waiting behind, and if the Lua load is important, maybe
> this can happen (but I've never experienced it yet, and the premption
> interval is short enough not to cause issues in theory).
>
I'm not sure if related but during every reload, for a couple seconds all
12 threads on OLD pid are using 100% cpu and then one after one core return
to normal usage, finally the old haproxy process exits. I have no idea why
it is behaving like that.


> Maciej, if this happens often,
> would you be interested in running one machine on 2.4-dev11 ?

It is a very rare issue and of course it occurs only in production
environment. :( I'm very willing to test the 2.4 version, especially with
that tasty lua optimization for multiple threads, but I can't do it on
production until it's stable.


> We'd
> need to have a quick look at your config (off-list if needed) to
> figure what Lua parts could run in multi-thread.
>
I'll provide configs (off the list) shortly!

Kind regards,


Re: [2.2.9] 100% CPU usage

2021-03-05 Thread Maciej Zdeb
Hi Christopher,

Thanks, I'll check but it'll take a couple days because the issue is quite
rare. I'll return with feedback!

Maybe the patch is not backported to 2.2 because of commit message that
states only 2.3 branch?

Kind regards,

czw., 4 mar 2021 o 22:34 Christopher Faulet 
napisał(a):

> Le 04/03/2021 à 14:01, Maciej Zdeb a écrit :
> > Hi,
> >
> > Sometimes after HAProxy reload it starts to loop infinitely, for example
> 9 of 10
> > threads using 100% CPU (gdb sessions attached). I've also dumped the
> core file
> > from gdb.
> >
> Hi Maciej,
>
> The 2.2.1O is out. But I'm afraid that a fix is missing. Could you test
> with the
> attached patch please ? On top of the 2.2.9 or 2.2.10, as you want.
>
> Thanks,
> --
> Christopher Faulet
>


[2.2.9] 100% CPU usage

2021-03-04 Thread Maciej Zdeb
Hi,

Sometimes after HAProxy reload it starts to loop infinitely, for example 9
of 10 threads using 100% CPU (gdb sessions attached). I've also dumped the
core file from gdb.

# haproxy -v
HA-Proxy version 2.2.9-a947cc2 2021/02/06 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2
2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.2.9.html
Running on: Linux 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC
2019 x86_64

thread with high CPU is looping on:
809 if (HA_ATOMIC_CAS(_to_dump, , all_threads_mask))
(gdb)
811 ha_thread_relax();
(gdb)
809 if (HA_ATOMIC_CAS(_to_dump, , all_threads_mask))
(gdb)
811 ha_thread_relax();

it tried to panic but is failing (from bt full):
#0  0x7fa73309cf37 in sched_yield () at
../sysdeps/unix/syscall-template.S:78
No locals.
#1  0x559253cb7fce in ha_thread_relax () at include/haproxy/thread.h:233
No locals.
#2  ha_thread_dump_all_to_trash () at src/debug.c:811
old = 
#3  0x559253cb802f in ha_panic () at src/debug.c:268
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x7fa73309cf37 in sched_yield () at ../sysdeps/unix/syscall-template.S:78
78  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt full
#0  0x7fa73309cf37 in sched_yield () at 
../sysdeps/unix/syscall-template.S:78
No locals.
#1  0x559253cb7fce in ha_thread_relax () at include/haproxy/thread.h:233
No locals.
#2  ha_thread_dump_all_to_trash () at src/debug.c:811
old = 
#3  0x559253cb802f in ha_panic () at src/debug.c:268
No locals.
#4  0x559253d23f08 in wdt_handler (sig=14, si=, 
arg=) at src/wdt.c:119
n = 
p = 
thr = 0
#5  
No locals.
#6  0x7fa73309cf37 in sched_yield () at 
../sysdeps/unix/syscall-template.S:78
No locals.
#7  0x559253cb9be5 in ha_thread_relax () at include/haproxy/thread.h:233
No locals.
#8  debug_handler (sig=, si=, arg=) at src/debug.c:859
sig = 
si = 
arg = 
#9  
No locals.
#10 0x559253bc642f in hlua_action (rule=0x5592596b5ba0, px=0x559259686260, 
sess=, s=0x55925b8f2fb0, flags=2) at src/hlua.c:6711
__pl_r = 4294967300
ret = 
arg = 
dir = 0
act_ret = 0
error = 
#11 0x559253c353c3 in http_req_get_intercept_rule 
(px=px@entry=0x559259686260, rules=rules@entry=0x5592596862b0, 
s=s@entry=0x55925b8f2fb0) at src/http_ana.c:2884
sess = 
txn = 0x55925b8f3790
rule = 0x5592596b5ba0
rule_ret = HTTP_RULE_RES_CONT
act_opts = 2
#12 0x559253c382f8 in http_process_req_common (s=s@entry=0x55925b8f2fb0, 
req=req@entry=0x55925b8f2fc0, an_bit=an_bit@entry=16, px=) at 
src/http_ana.c:501
sess = 
txn = 
msg = 
---Type  to continue, or q  to quit---
htx = 
rule = 
verdict = 
conn = 
#13 0x559253c48669 in process_stream (t=, 
context=0x55925b8f2fb0, state=) at src/stream.c:1781
max_loops = 
ana_list = 409648
ana_back = 409648
flags = 
srv = 
s = 0x55925b8f2fb0
sess = 
rqf_last = 
rpf_last = 2147483648
rq_prod_last = 
rq_cons_last = 0
rp_cons_last = 8
rp_prod_last = 0
req_ana_back = 
req = 0x55925b8f2fc0
res = 0x55925b8f3020
si_f = 0x55925b8f32e8
si_b = 0x55925b8f3340
rate = 
#14 0x559253d0a643 in run_tasks_from_lists 
(budgets=budgets@entry=0x7ffc669f811c) at src/task.c:483
process = 
tl_queues = 
t = 0x55925b8f34a0
budget_mask = 6 '\006'
done = 
queue = 
state = 
ctx = 
__ret = 
__n = 
__p = 
#15 0x559253d0b05d in process_runnable_tasks () at src/task.c:679
tt = 0x55925412f7c0 
lrq = 
grq = 
t = 
---Type  to continue, or q  to quit---
max = {0, 37, 25}
max_total = 
tmp_list = 
queue = 3
max_processed = 
#16 0x559253cc1df7 in run_poll_loop () at src/haproxy.c:2939
next = 
wake = 
#17 0x559253cc21a9 in run_thread_poll_loop (data=) at 
src/haproxy.c:3104
ptaf = 
ptif = 
ptdf = 
ptff = 
init_left = 0
init_mutex = pthread_mutex_t = {Type = Normal, Status = Not acquired, 
Robust = No, Shared = No, Protocol = None}
init_cond = pthread_cond_t = {Threads known to still execute a wait 
function = 0, Clock ID = CLOCK_REALTIME, Shared = No}
#18 0x559253b95840 in main (argc=, argv=0x7ffc669f8658) at 
src/haproxy.c:3803
blocked_sig = {__val = {1844674406710583, 18446744073709551615 
}}
old_sig = {__val = {0, 13925524187434950144, 0, 140722030216352, 8, 32, 
12884901897, 7, 48, 94086972698648, 80, 18446744073709409504, 0, 206158430211, 
0, 0}}
i = 
err = 
retry = 
limit = 

Re: ring buffer + spoe = segfault

2021-02-21 Thread Maciej Zdeb
Great, thanks for the update. :)

W dniu pt., 19.02.2021 o 18:06 Christopher Faulet 
napisał(a):

> Le 19/02/2021 à 09:31, Maciej Zdeb a écrit :
> > Hi,
> >
> > By accident I discovered that when enabling logging with line "log
> > ring@spoe-ring local2" in SPOE immediate segmentation fault occurs.
> >
> > Steps to reproduce (configuration files attached):
> >
> > # execute spoa-server from contrib directory:
> > ./spoa -d -p 8080
> >
> > # execute haproxy:
> > haproxy -f haproxy.cfg
> >
> > # trigger:
> > curl http://localhost:8000 <http://localhost:8000>
> >
> >
> > # haproxy -v
> > HA-Proxy version 2.2.5-34b2b10 2020/11/05 - https://haproxy.org/
> > <https://haproxy.org/>
> > Status: long-term supported branch - will stop receiving fixes around Q2
> 2025.
> > Known bugs: http://www.haproxy.org/bugs/bugs-2.2.5.html
> > <http://www.haproxy.org/bugs/bugs-2.2.5.html>
> > Running on: Linux 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24
> UTC 2019
> > x86_64
> >
> > Kind regards,
> > Maciej
>
> Maciej,
>
> I pushed a fix (https://github.com/haproxy/haproxy/commit/1d7d0f8). Not
> backported yet.
>
> Thanks !
> --
> Christopher Faulet
>


ring buffer + spoe = segfault

2021-02-19 Thread Maciej Zdeb
Hi,

By accident I discovered that when enabling logging with line "log
ring@spoe-ring local2" in SPOE immediate segmentation fault occurs.

Steps to reproduce (configuration files attached):

# execute spoa-server from contrib directory:
./spoa -d -p 8080

# execute haproxy:
haproxy -f haproxy.cfg

# trigger:
curl http://localhost:8000


# haproxy -v
HA-Proxy version 2.2.5-34b2b10 2020/11/05 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2
2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.2.5.html
Running on: Linux 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC
2019 x86_64

Kind regards,
Maciej


spoa-server.spoe.conf
Description: Binary data


haproxy.cfg
Description: Binary data


Re: [PATCH] DOC: Clarify %HP description in log-format

2020-11-30 Thread Maciej Zdeb
wt., 1 gru 2020 o 08:14 Willy Tarreau  napisał(a):

> Ah I'm stupid, sorry! The problem is not caused by the defaults section
> but by the fact that %[path] extracts the path from the request while your
> proposal takes it from the copy of the URI that was kept for logging.
>
No need to be sorry. :) I'm the one that is learning here. I didn't notice
that logging is done from the copy of URI and didn't give you full
information about the difference between %HPR and %[path].

OK, at first glance your patch looks good. However the description of
> the token doesn't seem perfect. Given that we already use the terminology
> "path", I'd rather make it clear that it's the patth. What about, for
> example, "%HPO" for "HTTP path only (without host nor query string)" ?
>
I was scratching my head for a long time with this. "HTTP path only" sounds
much better to me. I've adjusted the patch.

Thank you Willy!


0001-MINOR-log-Logging-HTTP-path-only-with-HPO.patch
Description: Binary data


Re: [PATCH] DOC: Clarify %HP description in log-format

2020-11-30 Thread Maciej Zdeb
Even with mode http:

defaults
mode http
log stdout local0
log-format 'path:%[path]'

it still fails.

What is more even if you use %[path] on http frontend:

defaults
mode http
log stdout local0

frontend front
mode http
# http-request set-var(txn.path) path
# log-format 'path:%[var(txn.path)]'
log-format 'path:%[path]'
bind 127.0.0.1:
default_backend back
backend back
mode http
server srv 127.0.0.1:80 check

it will also fail. The only way I know for using fetches in log-format is
through set-var()/var() (commented lines in above config).

Idea with named defaults is brilliant! :)

wt., 1 gru 2020 o 07:26 Willy Tarreau  napisał(a):

> On Tue, Dec 01, 2020 at 07:13:32AM +0100, Maciej Zdeb wrote:
> > For such defaults section:
> > defaults
> > log stdout local0
> > log-format 'path:%[path]'
> >
> > I'm getting following error:
> > [ALERT] 335/060923 (13259) : Parsing [haproxy.cfg:11]: failed to parse
> > log-format : sample fetch  may not be reliably used here because
> it
> > needs 'HTTP request headers' which is not available here.
> >
> > What I'm doing wrong? :)
>
> You're missing "mode http" in your defaults section. I tend to recommend
> not
> to mix TCP and HTTP defaults, because that tends to make TCP proxies
> inherit
> lots of HTTP settings. The checks are done everywhere (as you can see
> here),
> but this requires a lot of dedicated code to deal with corner cases, and I
> must admit I never know if from time to time an HTTP directive slips into a
> TCP proxy.
>
> The good practice instead is to have one "defaults" section for TCP and
> another one later for HTTP. I've long wanted to have named defaults that
> could be reused in proxies. This would require to extend the "frontend",
> "listen" and "backend" directives to take a section name at the declaration
> moment. For example:
>
>   defaults tcp
>  timeout client 1h
>
>   defaults http
>  mode http
>  timeout client 10s
>
>   frontend f1 from tcp
>  bind :1234
>
>   frontend f2 from http
>  bind :80
>
> The last declared section would always set the current defaults though.
> If anyone is interested in working on this, it's not very difficult, the
> config parser currently knows about a single "defproxy", instead it could
> perform a lookup (or even copy it from the current designated one).
>
> Willy
>


Re: [PATCH] DOC: Clarify %HP description in log-format

2020-11-30 Thread Maciej Zdeb
For such defaults section:
defaults
log stdout local0
log-format 'path:%[path]'

I'm getting following error:
[ALERT] 335/060923 (13259) : Parsing [haproxy.cfg:11]: failed to parse
log-format : sample fetch  may not be reliably used here because it
needs 'HTTP request headers' which is not available here.

What I'm doing wrong? :)

wt., 1 gru 2020 o 06:56 Willy Tarreau  napisał(a):

> On Tue, Dec 01, 2020 at 06:54:16AM +0100, Maciej Zdeb wrote:
> > There's no difference with %[path] in fact this patch only exposes "path"
> > fetch as %HPR. It's only an usability matter. Usually I'm setting some
> sane
> > log-format in the defaults section (where I can't use %[path]) and only
> for
> > some critically important frontends I'm overwriting it to be more
> verbose.
>
> Why do you say you can't use %[path] in the defaults section ?
>
> Willy
>


Re: [PATCH] DOC: Clarify %HP description in log-format

2020-11-30 Thread Maciej Zdeb
There's no difference with %[path] in fact this patch only exposes "path"
fetch as %HPR. It's only an usability matter. Usually I'm setting some sane
log-format in the defaults section (where I can't use %[path]) and only for
some critically important frontends I'm overwriting it to be more verbose.

wt., 1 gru 2020 o 04:12 Willy Tarreau  napisał(a):

> Hi Maciej,
>
> On Mon, Nov 30, 2020 at 08:17:42PM +0100, Maciej Zdeb wrote:
> > Hi
> >
> > I have a small patch which adds a new log-format variable %HPR for
> logging
> > relative path. I hope it is clean and useful! :)
>
> What's the difference with %[path] ?  I'm trying to have a sample-fetch
> equivalent for all individual log tags so I'm a bit worried about seeing
> new ones being introduced if they can already be expressed differently.
>
> Thanks,
> Willy
>


Re: [PATCH] DOC: Clarify %HP description in log-format

2020-11-30 Thread Maciej Zdeb
Hi

I have a small patch which adds a new log-format variable %HPR for logging
relative path. I hope it is clean and useful! :)

Kind regards,



czw., 26 lis 2020 o 19:08 Willy Tarreau  napisał(a):

> Hi Maciej,
>
> On Thu, Nov 26, 2020 at 12:39:19PM +0100, Maciej Zdeb wrote:
> > Hi,
> >
> > Since 30ee1efe676e8264af16bab833c621d60a72a4d7 "MEDIUM: h2: use the
> > normalized URI encoding for absolute form requests" we're using absolute
> > URI when possible. As stated in the commit message this also changed the
> > way it is reported in logs, especially for H2.
> >
> > Docs describes %HP as "HTTP request URI without query string (path)" and
> > the word "path" suggests that %HP behaves exactly the same as "path"
> sample
> > fetch (it surprised me after HAProxy update that it's not true).
> Shouldn't
> > we delete the "(path)" from that description? Even better would be to
> > change it to "rel_path/abs_path" like it is in rfc or "relative/absolute
> > URI" but it is too long. :) If I'm the only one that got confused then it
> > may stay. But...
> >
> > Do you think that it might be useful to implement another var, let's say
> > %HRU described as "HTTP request relative URI"? :) Usually in my logs I'm
> > not interested in absolute uri and relative is sufficient.
>
> Few people play with absolute URIs in practice, that's why such issues
> started to pop up very late. Your example above is just yet another proof
> of this. So you're probalbly not the only one, but the few others might
> have silently fixed the meaning in their mind while you've fixed it in the
> doc, which is better!
>
> Applied now, thank you!
> Willy
>


0001-MINOR-log-Logging-relative-URI-path-with-HPR.patch
Description: Binary data


[PATCH] DOC: Clarify %HP description in log-format

2020-11-26 Thread Maciej Zdeb
Hi,

Since 30ee1efe676e8264af16bab833c621d60a72a4d7 "MEDIUM: h2: use the
normalized URI encoding for absolute form requests" we're using absolute
URI when possible. As stated in the commit message this also changed the
way it is reported in logs, especially for H2.

Docs describes %HP as "HTTP request URI without query string (path)" and
the word "path" suggests that %HP behaves exactly the same as "path" sample
fetch (it surprised me after HAProxy update that it's not true). Shouldn't
we delete the "(path)" from that description? Even better would be to
change it to "rel_path/abs_path" like it is in rfc or "relative/absolute
URI" but it is too long. :) If I'm the only one that got confused then it
may stay. But...

Do you think that it might be useful to implement another var, let's say
%HRU described as "HTTP request relative URI"? :) Usually in my logs I'm
not interested in absolute uri and relative is sufficient.

Kind regards,
Maciej


0001-DOC-Clarify-HP-description-in-log-format.patch
Description: Binary data


Re: [PATCH] Fix lacking init of log-format list

2020-11-24 Thread Maciej Zdeb
wt., 24 lis 2020 o 10:34 Willy Tarreau  napisał(a):

> I prefer the former. If there's a list head, it must be valid. We must
> not start to look at list pointers to guess whether or not they are
> properly initialized, there are enough exceptions and special cases
> everywhere in the code not to add new ones.
>
Now when you pointed it out I agree that it's easier to remember to
properly init the list than validate it everytime we want to use it.

I'll do my best not to introduce such bugs in future.

Thank you Tim & Willy!


Re: [PATCH] Fix lacking init of log-format list

2020-11-23 Thread Maciej Zdeb
Hi Tim,

There's a crash report on github:
https://github.com/haproxy/haproxy/issues/909#issuecomment-732241492 with a
reproducer and also stack trace from Willy. That list init is necessary
because release_http_action() is trying to iterate over and free its
elements:
http://git.haproxy.org/?p=haproxy.git;a=blob;f=src/http_act.c;h=13a5c370da861c78f21037186808ee16325ab45a;hb=HEAD#l52

Now I had more time to look at the code and maybe a cleaner solution is to
add an additional check to release_http_action() instead of initializing
the list (patch attached).

pon., 23 lis 2020 o 20:38 Tim Düsterhus  napisał(a):

> Maciej,
>
> Am 23.11.20 um 17:15 schrieb Maciej Zdeb:
> > By accident I removed initialization of log-format list during
> > implementation of -m flag for http-request/response del-header. :(
> >
> > This patch restores the init.
> >
>
> Again from my "I'm am not an export" viewpoint:
>
> I'm not sure whether this is necessary. As you don't take a log-format
> string you shouldn't need that. The previous existence of that line
> probably was a copy and paste mistake.
>
> Did you encounter any issues with the removal of that line or did you
> just review your diff? If you actually had issues it would be helpful if
> you would put a reproducer into the commit message.
>
> Best regards
> Tim Düsterhus
>


0001-BUG-MINOR-http_act-Fix-segv-on-release_http_action.patch
Description: Binary data


[PATCH] Fix lacking init of log-format list

2020-11-23 Thread Maciej Zdeb
Hi,

By accident I removed initialization of log-format list during
implementation of -m flag for http-request/response del-header. :(

This patch restores the init.


0001-BUG-MINOR-http_act-Restore-init-of-log-format-list.patch
Description: Binary data


Re: [2.2.5] High cpu usage after switch to threads

2020-11-21 Thread Maciej Zdeb
sob., 21 lis 2020 o 07:13 Willy Tarreau  napisał(a):

> So I guess we'll use you as a beta tester once we're starting to see
> promising solutions ;-)
>
I'll test it happily :)


Re: [PATCH] minor bug + http-request del-header supporting -m flag

2020-11-21 Thread Maciej Zdeb
Hi,

Willy thanks for clarification about lfs_file, I've removed it from patch.
As Christopher mentioned, when developing the patch I've noticed the
comment that when "action_ptr" is set, "action" can be set to any
meaningful value. However, I also had doubts about it. :)

Christopher I fixed the parsing issue (I hope), good catch and many thanks!



sob., 21 lis 2020 o 10:43 Willy Tarreau  napisał(a):

> On Sat, Nov 21, 2020 at 09:42:32AM +0100, Christopher Faulet wrote:
> > ..Le 21/11/2020 à 06:15, Willy Tarreau a écrit :
> > > I looked at the patch, it's OK regarding this and most of the other
> > > stuff, except one thing, it mixes ACT_HTTP_* and PAT_MATCH_* and there
> > > is some breakage there, because the HTTP action being set to
> PAT_MATCH_STR
> > > for example sets it to value 5 which is ACT_HTTP_SET_NICE, etc. So you
> > > must only have ACT_* values in your rule->action field.
> > >
> > > Maciej, there are two possibilities here. Either you duplicate the
> delete
> > > action to have DEL_STR, DEL_BEG, DEL_SUB etc, and you just set the
> right
> > > action based on the matching method you found during parsing, or you
> > > extend act_rule.arg.http to add a matching method for the delete
> action.
> > > I tend to consider that the former is cleaner and simpler, given that
> > > we're really performing a few different actions there, which are all
> > > derivatives of the delete action.
> >
> > Hi,
> >
> > rule->action value may be set to any value if an action function is
> defined
> > (rule->action_ptr). ACT_* are only used for action without action
> function,
> > to perform a switch case on it. That's documented this way :
> >
> > unsigned int action; /* ACT_* or any meaningful value if action_ptr is
> defined */
>
> OK, I wasn't aware of this, seeing only ACT_* there. Fine by me then.
>
> Willy
>


0001-BUG-MINOR-http_htx-Fix-searching-headers-by-substrin.patch
Description: Binary data


0002-MINOR-http_act-Add-m-flag-for-del-header-name-matchi.patch
Description: Binary data


Re: [PATCH] minor bug + http-request del-header supporting -m flag

2020-11-20 Thread Maciej Zdeb
Thank you Tim for your help and support! :)

I've fixed some coding-style issues. I hope that all trivial mistakes
(wrong bracing, indention etc) are now fixed.

I have a question about:
free(px->conf.lfs_file);
px->conf.lfs_file = strdup(px->conf.args.file);
px->conf.lfs_line = px->conf.args.line;

It is on various places in http_act.c and so I've also placed that in my
patch, however I'm not sure if it's really necessary. I just don't
understand this fragment, why it is there. :(

pt., 20 lis 2020 o 19:49 Tim Düsterhus  napisał(a):

> Maciej,
>
> Am 20.11.20 um 18:45 schrieb Maciej Zdeb:
> >> +# This config tests various http request rules (set/replace)
> >> manipulating the
> >>> +# path, with or without the query-string. It also test path and pathq
> >> sample
> >>> +# fetches.
> >>
> >> I believe this comment is not correct.
> >>
> >>> +http-response del-header x-str2 -m str
> >>> + http-response del-header x-beg -m beg
> >>> +http-response del-header end1 -m end
> >>
> >> Indentation is messed up there, because you mixed tabs and spaces.
> >>
> > True, I was so excited that my code started to work that I didn't notice
> > this garbage. I'm very sorry for wasting your time.
>
> Please do not apologize. You didn't waste anyone's time. You certainly
> didn't waste mine.
>
> Think about the time you saved for others, because you took the effort
> to implement this! My remarks all were very minor.
>
> >> I now also tested your patch. I noticed the following issues:
> >>
> >>> http-response del-header -m beg
> >>
> >> This should probably cause a configuration parsing failure, but is
> >> silently accepted. I am not sure what is does, delete headers called
> >> `-m`, ignoring the `beg`?
> >>
> >  Yes, it deleted the `-m` header and ignored `beg`. I've added an
> > additional check, if an unsupported flag is set.
> >
> > Thank you again Tim!
> >
>
> You're welcome. I can't comment on whether you correctly integrated
> everything, but I can say that the patch now looks good to me based on
> my knowledge of HAProxy. I hope I saved the maintainers some reviewing
> effort and I'll leave the final review to them now (adding Willy to Cc).
>
> You earned this:
>
> Reviewed-by: Tim Duesterhus 
>
> Best regards
> Tim Düsterhus
>


0001-BUG-MINOR-http_htx-Fix-searching-headers-by-substrin.patch
Description: Binary data


0002-MINOR-http_act-Add-m-flag-for-del-header-name-matchi.patch
Description: Binary data


Re: [PATCH] minor bug + http-request del-header supporting -m flag

2020-11-20 Thread Maciej Zdeb
I've added to commit msg "This is related to GitHub issue #909" because
there is still no support for reg matching in this patch.

pt., 20 lis 2020 o 17:54 Tim Düsterhus  napisał(a):

> > +#REQUIRE_VERSION=2.2
>
> This should either be 2.4 or be removed. The removal should be safe,
> because it is integrated within a single commit.
>
Removed.

> +# This config tests various http request rules (set/replace)
> manipulating the
> > +# path, with or without the query-string. It also test path and pathq
> sample
> > +# fetches.
>
> I believe this comment is not correct.
>
> > +http-response del-header x-str2 -m str
> > + http-response del-header x-beg -m beg
> > +http-response del-header end1 -m end
>
> Indentation is messed up there, because you mixed tabs and spaces.
>
True, I was so excited that my code started to work that I didn't notice
this garbage. I'm very sorry for wasting your time.

> + * Otherwsize ACT_RET_ERR is returned.
>
> There is a typo there.
>
Fixed.

> + rule->action = PAT_MATCH_STR; // exact matching (-m str)
>
Changed to multiline. :)


> I now also tested your patch. I noticed the following issues:
>
> > http-response del-header -m beg
>
> This should probably cause a configuration parsing failure, but is
> silently accepted. I am not sure what is does, delete headers called
> `-m`, ignoring the `beg`?
>
 Yes, it deleted the `-m` header and ignored `beg`. I've added an
additional check, if an unsupported flag is set.

Thank you again Tim!


0001-BUG-MINOR-http_htx-Fix-searching-headers-by-substrin.patch
Description: Binary data


0002-MINOR-http_act-Add-m-flag-for-del-header-name-matchi.patch
Description: Binary data


Re: [PATCH] minor bug + http-request del-header supporting -m flag

2020-11-20 Thread Maciej Zdeb
pt., 20 lis 2020 o 16:56 Tim Düsterhus  napisał(a):

> There's a duplicate semicolon there.

Fixed!


> Merge the test into the previous commit, please. This ensures the test
> is always there when the feature is there and absent if it is not.
>
True! Fixed.

Thank you Tim!


0001-BUG-MINOR-http_htx-Fix-searching-headers-by-substrin.patch
Description: Binary data


0002-MINOR-http_act-Add-m-flag-for-del-header-name-matchi.patch
Description: Binary data


[PATCH] minor bug + http-request del-header supporting -m flag

2020-11-20 Thread Maciej Zdeb
Hi,

I've implemented a http-request del-header with optional flag -m [beg, sub,
str, end]. I'm not sure if my approach is correct so I'm not adding support
for the "reg" method in these patches. I've doubts about (ab)using the
action field in the act_rule struct for storing selected matching method.
What if we'll need additional flags? :)


0001-BUG-MINOR-http_htx-Fix-searching-headers-by-substrin.patch
Description: Binary data


0003-REGTESTS-Add-http-rules-del_header.vtc.patch
Description: Binary data


0002-MINOR-http_act-Add-m-flag-for-del-header-name-matchi.patch
Description: Binary data


Re: [2.2.5] High cpu usage after switch to threads

2020-11-19 Thread Maciej Zdeb
Hi,

Alaksandar I've looked into code and... :)

śr., 18 lis 2020 o 15:30 Aleksandar Lazic  napisał(a):

> Can you think to respectthe '-i'.
>
> http://git.haproxy.org/?p=haproxy.git=search=HEAD=grep=PAT_MF_IGNORE_CASE
>
I'm not sure if I understand you correctly, but in case of http-request
del-header the "case insensitivity" must be always enabled, because header
names should be case insensitive according to RFC. So we should not
implement "-i" flag in this scenario.


> Additional Info.
>
> What I have see in the the checking of '-i' (PAT_MF_IGNORE_CASE), the '-m
> reg' functions
> have not the  PAT_MF_IGNORE_CASE check.
>
I think you're looking at the regex execution, but flags are considered
during regex compile. If you look at regex_comp function
http://git.haproxy.org/?p=haproxy.git;a=blob;f=src/regex.c;h=45a7e9004e8e4f6ad9604ed9a858aba0060b6204;hb=0217b7b24bb33d746d2bf625f5e894007517d1b0#l312
you'll notice cs param. Function is called in pattern.c and couple other
places. In my opinion -i with -m reg is perfectly valid and should work.


Re: [2.2.5] High cpu usage after switch to threads

2020-11-18 Thread Maciej Zdeb
Tim thanks for the hint!

Aleksandar I’ll do my best, however I’m still learning HAProxy internals
and refreshing my C skills after very long break. ;) First, I’ll try to
deliver something simple like „-m beg” and after review from the team we’ll
see.

If someone is in hurry with this issue and wants to implement it asap, then
just let me know.

W dniu śr., 18.11.2020 o 16:49 Tim Düsterhus  napisał(a):

> Maciej,
>
> Am 18.11.20 um 14:22 schrieb Maciej Zdeb:
> > I've found an earlier discussion about replacing reqidel (and others) in
> > 2.x: https://www.mail-archive.com/haproxy@formilux.org/msg36321.html
> >
> > So basically we're lacking:
> > http-request del-header x-private-  -m beg
> > http-request del-header x-.*company -m reg
> > http-request del-header -tracea -m end
> >
> > I'll try to implement it in the free time.
>
> Please refer to this issue: https://github.com/haproxy/haproxy/issues/909
>
> Best regards
> Tim Düsterhus
>
>
>


Re: [2.2.5] High cpu usage after switch to threads

2020-11-18 Thread Maciej Zdeb
I've found an earlier discussion about replacing reqidel (and others) in
2.x: https://www.mail-archive.com/haproxy@formilux.org/msg36321.html

So basically we're lacking:
http-request del-header x-private-  -m beg
http-request del-header x-.*company -m reg
http-request del-header -tracea -m end

I'll try to implement it in the free time.

śr., 18 lis 2020 o 13:20 Maciej Zdeb  napisał(a):

> Sure, the biggest problem is to delete header by matching prefix:
>
> load_blacklist = function(service)
> local prefix = '/etc/haproxy/configs/maps/header_blacklist'
> local blacklist = {}
>
> blacklist.req = {}
> blacklist.res = {}
> blacklist.req.str = Map.new(string.format('%s_%s_req.map', prefix,
> service), Map._str)
> blacklist.req.beg = Map.new(string.format('%s_%s_req_beg.map', prefix,
> service), Map._beg)
>
> return blacklist
> end
>
> blacklist = {}
> blacklist.testsite = load_blacklist('testsite')
>
> is_denied = function(bl, name)
> return bl ~= nil and (bl.str:lookup(name) ~= nil or
> bl.beg:lookup(name) ~= nil)
> end
>
> req_header_filter = function(txn, service)
> local req_headers = txn.http:req_get_headers()
> for name, _ in pairs(req_headers) do
> if is_denied(blacklist[service].req, name) then
> txn.http:req_del_header(name)
> end
> end
> end
>
> core.register_action('req_header_filter', { 'http-req' },
> req_header_filter, 1)
>
> śr., 18 lis 2020 o 12:46 Julien Pivotto 
> napisał(a):
>
>> On 18 Nov 12:33, Maciej Zdeb wrote:
>> > Hi again,
>> >
>> > So "# some headers manipulation, nothing different then on other
>> clusters"
>> > was the important factor in config. Under this comment I've hidden from
>> you
>> > one of our LUA scripts that is doing header manipulation like deleting
>> all
>> > headers from request when its name begins with "abc*". We're doing it on
>> > all HAProxy servers, but only here it has such a big impact on the CPU,
>> > because of huge RPS.
>> >
>> > If I understand correctly:
>> > with nbproc = 20, lua interpreter worked on every process
>> > with nbproc=1, nbthread=20, lua interpreter works on single
>> process/thread
>> >
>> > I suspect that running lua on multiple threads is not a trivial task...
>>
>> If you can share your lua script maybe we can see if this is doable
>> more natively in haproxy
>>
>> >
>> >
>> >
>> >
>> > wt., 17 lis 2020 o 15:50 Maciej Zdeb  napisał(a):
>> >
>> > > Hi,
>> > >
>> > > We're in a process of migration from HAProxy[2.2.5] working on
>> multiple
>> > > processes to multiple threads. Additional motivation came from the
>> > > announcement that the "nbproc" directive was marked as deprecated and
>> will
>> > > be killed in 2.5.
>> > >
>> > > Mostly the migration went smoothly but on one of our clusters the CPU
>> > > usage went so high that we were forced to rollback to nbproc. There is
>> > > nothing unusual in the config, but the traffic on this particular
>> cluster
>> > > is quite unusual.
>> > >
>> > > With nbproc set to 20 CPU idle drops at most to 70%, with nbthread =
>> 20
>> > > after a couple of minutes at idle 50% it drops to 0%. HAProxy
>> > > processes/threads are working on dedicated/isolated CPU cores.
>> > >
>> > > [image: image.png]
>> > >
>> > > I mentioned that traffic is quite unusual, because most of it are http
>> > > requests with some payload in headers and very very small responses
>> (like
>> > > 200 OK). On multi-proc setup HAProxy handles about 20 to 30k of
>> connections
>> > > (on frontend and backend) and about 10-20k of http requests. Incoming
>> > > traffic is just about 100-200Mbit/s and outgoing 40-100Mbit/s from
>> frontend
>> > > perspective.
>> > >
>> > > Did someone experience similar behavior of HAProxy? I'll try to
>> collect
>> > > more data and generate similar traffic with sample config to show a
>> > > difference in performance between nbproc and nbthread.
>> > >
>> > > I'll greatly appreciate any hints on what I should focus. :)
>> > >
>> > > Current config is close to:
>> > > frontend front
>> > > mode http
>> > > option http-keep-

Re: [2.2.5] High cpu usage after switch to threads

2020-11-18 Thread Maciej Zdeb
Sure, the biggest problem is to delete header by matching prefix:

load_blacklist = function(service)
local prefix = '/etc/haproxy/configs/maps/header_blacklist'
local blacklist = {}

blacklist.req = {}
blacklist.res = {}
blacklist.req.str = Map.new(string.format('%s_%s_req.map', prefix,
service), Map._str)
blacklist.req.beg = Map.new(string.format('%s_%s_req_beg.map', prefix,
service), Map._beg)

return blacklist
end

blacklist = {}
blacklist.testsite = load_blacklist('testsite')

is_denied = function(bl, name)
return bl ~= nil and (bl.str:lookup(name) ~= nil or bl.beg:lookup(name)
~= nil)
end

req_header_filter = function(txn, service)
local req_headers = txn.http:req_get_headers()
for name, _ in pairs(req_headers) do
if is_denied(blacklist[service].req, name) then
txn.http:req_del_header(name)
end
end
end

core.register_action('req_header_filter', { 'http-req' },
req_header_filter, 1)

śr., 18 lis 2020 o 12:46 Julien Pivotto  napisał(a):

> On 18 Nov 12:33, Maciej Zdeb wrote:
> > Hi again,
> >
> > So "# some headers manipulation, nothing different then on other
> clusters"
> > was the important factor in config. Under this comment I've hidden from
> you
> > one of our LUA scripts that is doing header manipulation like deleting
> all
> > headers from request when its name begins with "abc*". We're doing it on
> > all HAProxy servers, but only here it has such a big impact on the CPU,
> > because of huge RPS.
> >
> > If I understand correctly:
> > with nbproc = 20, lua interpreter worked on every process
> > with nbproc=1, nbthread=20, lua interpreter works on single
> process/thread
> >
> > I suspect that running lua on multiple threads is not a trivial task...
>
> If you can share your lua script maybe we can see if this is doable
> more natively in haproxy
>
> >
> >
> >
> >
> > wt., 17 lis 2020 o 15:50 Maciej Zdeb  napisał(a):
> >
> > > Hi,
> > >
> > > We're in a process of migration from HAProxy[2.2.5] working on multiple
> > > processes to multiple threads. Additional motivation came from the
> > > announcement that the "nbproc" directive was marked as deprecated and
> will
> > > be killed in 2.5.
> > >
> > > Mostly the migration went smoothly but on one of our clusters the CPU
> > > usage went so high that we were forced to rollback to nbproc. There is
> > > nothing unusual in the config, but the traffic on this particular
> cluster
> > > is quite unusual.
> > >
> > > With nbproc set to 20 CPU idle drops at most to 70%, with nbthread = 20
> > > after a couple of minutes at idle 50% it drops to 0%. HAProxy
> > > processes/threads are working on dedicated/isolated CPU cores.
> > >
> > > [image: image.png]
> > >
> > > I mentioned that traffic is quite unusual, because most of it are http
> > > requests with some payload in headers and very very small responses
> (like
> > > 200 OK). On multi-proc setup HAProxy handles about 20 to 30k of
> connections
> > > (on frontend and backend) and about 10-20k of http requests. Incoming
> > > traffic is just about 100-200Mbit/s and outgoing 40-100Mbit/s from
> frontend
> > > perspective.
> > >
> > > Did someone experience similar behavior of HAProxy? I'll try to collect
> > > more data and generate similar traffic with sample config to show a
> > > difference in performance between nbproc and nbthread.
> > >
> > > I'll greatly appreciate any hints on what I should focus. :)
> > >
> > > Current config is close to:
> > > frontend front
> > > mode http
> > > option http-keep-alive
> > > http-request add-header X-Forwarded-For %[src]
> > >
> > > # some headers manipulation, nothing different then on other
> clusters
> > >
> > > bind x.x.x.x:443 ssl crt /etc/cert/a.pem.pem alpn h2,http/1.1
> process 1
> > > bind x.x.x.x:443 ssl crt /etc/cert/a.pem.pem alpn h2,http/1.1
> process 2
> > > bind x.x.x.x:443 ssl crt /etc/cert/a.pem.pem alpn h2,http/1.1
> process 3
> > > bind x.x.x.x:443 ssl crt /etc/cert/a.pem.pem alpn h2,http/1.1
> process 4
> > > bind x.x.x.x:443 ssl crt /etc/cert/a.pem.pem alpn h2,http/1.1
> process 5
> > > bind x.x.x.x:443 ssl crt /etc/cert/a.pem.pem alpn h2,http/1.1
> process 6
> > > bind x.x.x.x:443 ssl crt /etc/cert/a.pem.pem alpn h2,http/1.1
> process 7
> > > bind

Re: [PATCH] REGTESTS: Add sample_fetches/cook.vtc

2020-11-16 Thread Maciej Zdeb
Great idea Tim! Thank you!

pt., 13 lis 2020 o 19:49 Christopher Faulet 
napisał(a):

> Le 13/11/2020 à 19:36, Tim Duesterhus a écrit :
> > Add a reg-test verifying the fix in
> dea7c209f8a77b471323dd97bdc1ac4d7a17b812.
> >
>
> Thanks Tim! merged now.
>
> --
> Christopher Faulet
>


Re: HTTP fetch req.cook_cnt() always return 0

2020-11-13 Thread Maciej Zdeb
It is great Christopher! :) Thank you for fixing it!

W dniu pt., 13.11.2020 o 14:06 Christopher Faulet 
napisał(a):

> Le 13/11/2020 à 11:27, Maciej Zdeb a écrit :
> > Fixed warnings during compilation and fixed req.cook() which should
> return first
> > and not last value when no cookie name specified.
> >
>
> Thanks Maciej,
>
> I slightly updated your patch though. I changed the commit message to add
> details and simplified smp_fetch_cookie() function. I also updated the
> comments.
> I didn't want to bother you with that. hope it's ok for you ? Let me known
> if
> you want to rework the patch instead.
>
> I also added another fix to support the sample fetches without the
> parentheses.
> 'req.cook()' and 'req.cook' must do the same.
>
> --
> Christopher Faulet
>


Re: HTTP fetch req.cook_cnt() always return 0

2020-11-13 Thread Maciej Zdeb
Fixed warnings during compilation and fixed req.cook() which should return
first and not last value when no cookie name specified.

pt., 13 lis 2020 o 10:48 Maciej Zdeb  napisał(a):

> I've a patch that is fixing the issue (I hope) ;) It's my first patch so
> any hints are very welcome :)
>
> pt., 13 lis 2020 o 09:45 Christopher Faulet 
> napisał(a):
>
>> Le 12/11/2020 à 15:08, Maciej Zdeb a écrit :
>> > Hi,
>> >
>> > With such sample config:
>> >
>> > frontend front
>> >  mode http
>> >  http-request set-var(txn.abc) req.cook_cnt()
>> >  http-response set-header abc %[var(txn.abc)]
>> >  bind x.x.x.x:80
>> >  default_backend back
>> > backend back
>> >  mode http
>> >  server srv y.y.y.y:80 check
>> >
>> > When sending 3 cookies with curl:
>> > curl -I http://x.x.x.x <http://x.x.x.x> -H 'cookie: cook1=0;
>> cook2=123; cook3=22'
>> > I expected the response header abc: 3, instead I got abc: 0.
>> >
>> https://cbonte.github.io/haproxy-dconv/2.2/configuration.html#7.3.6-req.cook_cnt
>> > <
>> https://cbonte.github.io/haproxy-dconv/2.2/configuration.html#7.3.6-req.cook_cnt
>> >
>> >
>> > haproxy -v
>> > HA-Proxy version 2.2.5-34b2b10 2020/11/05 - https://haproxy.org/
>> > <https://haproxy.org/>
>>
>> Hi Maciej,
>>
>> I confirm the bug and it seems to fail on all versions (tested as far as
>> 1.7).
>> And all sample fetches searching for a cookie value are affected too
>> (req.cook(), res.scook()...)
>>
>> Thanks Maciej, we will handle this bug.
>>
>> --
>> Christopher Faulet
>>
>


0001-BUG-MINOR-http-Extract-cookie-value-even-when-no-coo.patch
Description: Binary data


Re: HTTP fetch req.cook_cnt() always return 0

2020-11-13 Thread Maciej Zdeb
I've a patch that is fixing the issue (I hope) ;) It's my first patch so
any hints are very welcome :)

pt., 13 lis 2020 o 09:45 Christopher Faulet 
napisał(a):

> Le 12/11/2020 à 15:08, Maciej Zdeb a écrit :
> > Hi,
> >
> > With such sample config:
> >
> > frontend front
> >  mode http
> >  http-request set-var(txn.abc) req.cook_cnt()
> >  http-response set-header abc %[var(txn.abc)]
> >  bind x.x.x.x:80
> >  default_backend back
> > backend back
> >  mode http
> >  server srv y.y.y.y:80 check
> >
> > When sending 3 cookies with curl:
> > curl -I http://x.x.x.x <http://x.x.x.x> -H 'cookie: cook1=0; cook2=123;
> cook3=22'
> > I expected the response header abc: 3, instead I got abc: 0.
> >
> https://cbonte.github.io/haproxy-dconv/2.2/configuration.html#7.3.6-req.cook_cnt
> > <
> https://cbonte.github.io/haproxy-dconv/2.2/configuration.html#7.3.6-req.cook_cnt
> >
> >
> > haproxy -v
> > HA-Proxy version 2.2.5-34b2b10 2020/11/05 - https://haproxy.org/
> > <https://haproxy.org/>
>
> Hi Maciej,
>
> I confirm the bug and it seems to fail on all versions (tested as far as
> 1.7).
> And all sample fetches searching for a cookie value are affected too
> (req.cook(), res.scook()...)
>
> Thanks Maciej, we will handle this bug.
>
> --
> Christopher Faulet
>


0001-BUG-MINOR-http-Extract-cookie-value-even-when-cookie.patch
Description: Binary data


HTTP fetch req.cook_cnt() always return 0

2020-11-12 Thread Maciej Zdeb
Hi,

With such sample config:

frontend front
mode http
http-request set-var(txn.abc) req.cook_cnt()
http-response set-header abc %[var(txn.abc)]
bind x.x.x.x:80
default_backend back
backend back
mode http
server srv y.y.y.y:80 check

When sending 3 cookies with curl:
curl -I http://x.x.x.x -H 'cookie: cook1=0; cook2=123; cook3=22'
I expected the response header abc: 3, instead I got abc: 0.
https://cbonte.github.io/haproxy-dconv/2.2/configuration.html#7.3.6-req.cook_cnt

haproxy -v
HA-Proxy version 2.2.5-34b2b10 2020/11/05 - https://haproxy.org/


Re: [2.0.17] crash with coredump

2020-11-11 Thread Maciej Zdeb
śr., 11 lis 2020 o 12:53 Willy Tarreau  napisał(a):

> Two months of chasing a non reproducible
> memory corruption with zero initial info is quite an achievement, many
> thanks for doing that!
>

Initially it crashed (once every few hours) only on our most critical
HAProxy servers and with SPOA from external vendor, then on less critical
but still production servers. It took a lot of time to analyze our config
and not break anything. I was almost sure it's something specific to our
configuration. Especially nobody else reported similar problems with SPOE.

Your patch with additional checks was a game changer, it was easier to
trigger the crash thus easier to replicate! :)


Re: [2.0.17] crash with coredump

2020-11-11 Thread Maciej Zdeb
Wow! Yes, I can confirm that a crash does not occur now. :) I checked 2.0
and 2.2 branches. I'll keep testing it for a couple days just to be sure.

So that stacktrace I shared before (on spoe_release_appctx function) was
very lucky... Do you think that it'd be possible to find the bug without
the replication procedure?

Christopher & Willy many thanks for your hard work! I'm always impressed
how fast you're able to narrow the bug when you finally get proper input
from a reporter. :)

wt., 10 lis 2020 o 22:30 Willy Tarreau  napisał(a):

> Hi Christopher,
>
> On Tue, Nov 10, 2020 at 09:17:15PM +0100, Christopher Faulet wrote:
> > Le 10/11/2020 à 18:12, Maciej Zdeb a écrit :
> > > Hi,
> > >
> > > I'm so happy you're able to replicate it! :)
> > >
> > > With that patch that disabled pool_flush I still can reproduce on my
> r
> > > server and on production, just different places of crash:
> > >
> >
> > Hi Maciej,
> >
> > Could you test the following patch please ? For now I don't know if it
> fully
> > fixes the bug. But it is step forward. I must do a deeper review to be
> sure
> > it covers all cases.
>
> Looks like you got it right this time, not only it doesn't crash anymore
> in my tests, the suspiciously wrong cur_fap that were going negative very
> quickly do not happen anymore either! This is a very good news! Looking
> forward to read about Maciej's tests.
>
> Cheers,
> Willy
>


Re: [2.0.17] crash with coredump

2020-11-10 Thread Maciej Zdeb
Hi,

I'm so happy you're able to replicate it! :)

With that patch that disabled pool_flush I still can reproduce on my r
server and on production, just different places of crash:

on r:
(gdb) bt
#0  tasklet_wakeup (tl=0xd720c300a000) at include/haproxy/task.h:328
#1  h2s_notify_recv (h2s=h2s@entry=0x55d720c2d500) at src/mux_h2.c:1037
#2  0x55d71f44d3a0 in h2s_notify_recv (h2s=0x55d720c2d500) at
include/haproxy/trace.h:150
#3  h2s_close (h2s=0x55d720c2d500) at src/mux_h2.c:1236
#4  0x55d71f450c26 in h2s_frt_make_resp_headers (htx=0x55d720ae4c90,
h2s=0x55d720c2d500) at src/mux_h2.c:4795
#5  h2_snd_buf (cs=0x55d720c31000, buf=0x55d720c2d888, count=182,
flags=) at src/mux_h2.c:5888
#6  0x55d71f4fb9fa in si_cs_send (cs=0x55d720c31000) at
src/stream_interface.c:737
#7  0x55d71f4fc2c0 in si_sync_send (si=si@entry=0x55d720c2db48) at
src/stream_interface.c:914
#8  0x55d71f49ea91 in process_stream (t=,
context=0x55d720c2d810, state=) at src/stream.c:2245
#9  0x55d71f55cfe9 in run_tasks_from_list (list=list@entry=0x55d71f96cb40
, max=max@entry=149) at src/task.c:371
#10 0x55d71f55d7ca in process_runnable_tasks () at src/task.c:519
#11 0x55d71f517c15 in run_poll_loop () at src/haproxy.c:2900
#12 0x55d71f517fc9 in run_thread_poll_loop (data=) at
src/haproxy.c:3065
#13 0x55d71f3ef87e in main (argc=, argv=0x7fff7a4ef218)
at src/haproxy.c:3767

on production:
#0  0x557070df32d6 in h2s_notify_recv (h2s=0x7ff0fc696670) at
src/mux_h2.c:1035
#1  h2s_close (h2s=0x7ff0fc696670) at src/mux_h2.c:1236
#2  0x557070df7922 in h2s_frt_make_resp_data (count=,
buf=0x7ff0ec87ee78, h2s=0x7ff0fc696670) at src/mux_h2.c:5466
#3  h2_snd_buf (cs=0x7ff118af9790, buf=0x7ff0ec87ee78, count=3287,
flags=) at src/mux_h2.c:5903
#4  0x557070ea19fa in si_cs_send (cs=cs@entry=0x7ff118af9790) at
src/stream_interface.c:737
#5  0x557070ea1c5b in stream_int_chk_snd_conn (si=0x7ff0ec87f138) at
src/stream_interface.c:1121
#6  0x557070e9f112 in si_chk_snd (si=0x7ff0ec87f138) at
include/haproxy/stream_interface.h:488
#7  stream_int_notify (si=si@entry=0x7ff0ec87f190) at
src/stream_interface.c:490
#8  0x557070ea1f48 in si_cs_process (cs=cs@entry=0x7ff0fc93d9d0) at
src/stream_interface.c:624
#9  0x557070ea31fa in si_cs_io_cb (t=,
ctx=0x7ff0ec87f190, state=) at src/stream_interface.c:792
#10 0x557070f030ed in run_tasks_from_list (list=list@entry=0x557071312c50
, max=) at src/task.c:348
#11 0x557070f037da in process_runnable_tasks () at src/task.c:523
#12 0x557070ebdc15 in run_poll_loop () at src/haproxy.c:2900
#13 0x557070ebdfc9 in run_thread_poll_loop (data=) at
src/haproxy.c:3065
#14 0x7ff1cc2f16db in start_thread (arg=0x7ff11f7da700) at
pthread_create.c:463
#15 0x7ff1cb287a3f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

wt., 10 lis 2020 o 17:19 Willy Tarreau  napisał(a):

> On Tue, Nov 10, 2020 at 04:14:52PM +0100, Willy Tarreau wrote:
> > Seems like we're getting closer. Will continue digging now.
>
> I found that among the 5 crashes I got, 3 were under pool_flush()
> that is precisely called during the soft stopping. I tried to
> disable that function with the patch below and I can't reproduce
> the problem anymore, it would be nice if you could test it. I'm
> suspecting that either it copes badly with the lockless pools,
> or that pool_gc() itself, called from the signal handler, could
> possibly damage some of the pools and cause some lose objects to
> be used, returned and reused once reallocated. I see no reason
> for the relation with SPOE like this, but maybe it just helps
> trigger the complex condition.
>
> diff --git a/src/pool.c b/src/pool.c
> index 321f8bc67..5e2f41fe9 100644
> --- a/src/pool.c
> +++ b/src/pool.c
> @@ -246,7 +246,7 @@ void pool_flush(struct pool_head *pool)
> void **next, *temp;
> int removed = 0;
>
> -   if (!pool)
> +   //if (!pool)
> return;
> HA_SPIN_LOCK(POOL_LOCK, >lock);
> do {
>
> I'm continuing to investigate.
>
> Willy
>


Re: [2.0.17] crash with coredump

2020-11-10 Thread Maciej Zdeb
Hi,

I'm very sorry that my skills in gdb and knowledge of HAProxy and C are not
sufficient for this debugging process.

With the patch applied I tried again to use spoa from
"contrib/spoa_example/". Example spoa agent does not understand my
spoe-message and silently ignores it, but it doesn't matter.

To trigger segmentation fault I must reload HAProxy (when using spoa from
an external vendor this additional reload wasn't necessary, I've just had
to wait a couple seconds to trigger crash).

Usually HAProxy crashes on process_stream, but once it crashed at
(long)h2s->subs & 1 check in testcorrupt during spoe_release_appctx
#0  0x5597450c25f9 in testcorrupt (ptr=0x7f4fb8071990) at
src/mux_h2.c:6238
cs = 0x7f4fb8071990
h2s = 0x7f4fe85751f0
#1  0x559745196239 in spoe_release_appctx (appctx=0x7f4fe8324e00) at
src/flt_spoe.c:1294
si = 0x7f4fe82b31f8
spoe_appctx = 0x7f4fe88dd760
agent = 0x559746052580
ctx = 0x7f4fe8380b80
back = 0x559746355b38

Then I tried again to replicate the bug on my r server this time making
HAProxy reloads (multiple times) during the test and it crashed.

HAProxy was compiled with git HEAD set to
77015abe0bcfde67bff519b1d48393a513015f77 with patch
0001-EXP-try-to-spot-where-h2s-subs-changes-V2.patch applied
and with modified h2s:

diff --git a/src/mux_h2.c b/src/mux_h2.c
index 9928b32c7..3d5187271 100644
--- a/src/mux_h2.c
+++ b/src/mux_h2.c
@@ -206,6 +206,8 @@ struct h2s {
  uint16_t status; /* HTTP response status */
  unsigned long long body_len; /* remaining body length according to
content-length if H2_SF_DATA_CLEN */
  struct buffer rxbuf; /* receive buffer, always valid (buf_empty or real
buffer) */
+ struct tasklet *dummy0;
+ struct wait_event *dummy1;
  struct wait_event *subs;  /* recv wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
  struct list list; /* To be used when adding in h2c->send_list or
h2c->fctl_lsit */
  struct tasklet *shut_tl;  /* deferred shutdown tasklet, to retry to send
an RST after we failed to,

Attached:
haproxy.cfg (/etc/haproxy/haproxy.cfg main config)
spoe-example.conf (/etc/haproxy/spoe-example.conf spoe config)

I used spoa from contrib/spoa_example run with command:
"./spoa -p 4545  -c fragmentation -c async -c pipelining"

I used vegeta to generate traffic: https://github.com/tsenart/vegeta with
command:
"cat input | ./vegeta attack -duration=360s -insecure   -keepalive=false
 -http2=true -rate=500/1s > /dev/null"
I used 2 virtual machines to generate traffic and additionally I've
launched vegeta on host with HAProxy

where input file is:
GET https://haproxy-crash.test.local/
zdebek:
sdofijdsoifjodisjfoisdjfoisdovisoivjdfoijvoisdjvopsdijg0934u49032ut09gir09j40g9u0492it093i2g09i0r9bi2490ib094i0b9i09i0924it09biitk42jh09tj4309sdfjdlsjfoadiwe9023i0r92094i4309gi0934ig9034ig093i4g90i3409gi3409gi0394ig0934i0g93jjoujgiurhjgiuerhgiurehgiuerhg89489u098u509u09wrut0923ej23fjjsufdsuf98dusf98u98u2398uf9834uf983u49f8h98huish9fsdu98fusd98uf982u398u3298ru2938uffhsdijhfisdjhiusdhfiu2iuhf2398289823189831893198931udashidsah

The reloaded HAProxy configuration (multiple times, again and again until
segmentation fault occurred):
haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf 10608

pon., 9 lis 2020 o 16:01 Maciej Zdeb  napisał(a):

> It crashed now on first test in process_stream:
>
> struct task *process_stream(struct task *t, void *context, unsigned short
> state)
> {
> struct server *srv;
> struct stream *s = context;
> struct session *sess = s->sess;
> unsigned int rqf_last, rpf_last;
> unsigned int rq_prod_last, rq_cons_last;
> unsigned int rp_cons_last, rp_prod_last;
> unsigned int req_ana_back;
> struct channel *req, *res;
> struct stream_interface *si_f, *si_b;
> unsigned int rate;
>
> TEST_STRM(s);
> [...]
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x55f4cda7b5f9 in testcorrupt (ptr=0x7f75ac1ed990) at
> src/mux_h2.c:6238
> [Current thread is 1 (Thread 0x7f75a98b9700 (LWP 5860))]
> (gdb) bt full
> #0  0x55f4cda7b5f9 in testcorrupt (ptr=0x7f75ac1ed990) at
> src/mux_h2.c:6238
> cs = 0x7f75ac1ed990
> h2s = 0x7f7584244510
> #1  0x55f4cdad8993 in process_stream (t=0x7f75ac139d70,
> context=0x7f7588066540, state=260) at src/stream.c:1499
> srv = 0x7f75a9896390
> s = 0x7f7588066540
> sess = 0x7f759c071b80
> rqf_last = 4294967294
> rpf_last = 2217468112
> rq_prod_last = 32629
> rq_cons_last = 2217603024
> rp_cons_last = 32629
> rp_prod_last = 2217182865
> req_ana_back = 2217603025
> req = 0x7f75a9896350
> res = 0x55f4cdbed618 <__task_queue

Re: [2.0.17] crash with coredump

2020-11-09 Thread Maciej Zdeb
It crashed now on first test in process_stream:

struct task *process_stream(struct task *t, void *context, unsigned short
state)
{
struct server *srv;
struct stream *s = context;
struct session *sess = s->sess;
unsigned int rqf_last, rpf_last;
unsigned int rq_prod_last, rq_cons_last;
unsigned int rp_cons_last, rp_prod_last;
unsigned int req_ana_back;
struct channel *req, *res;
struct stream_interface *si_f, *si_b;
unsigned int rate;

TEST_STRM(s);
[...]

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x55f4cda7b5f9 in testcorrupt (ptr=0x7f75ac1ed990) at
src/mux_h2.c:6238
[Current thread is 1 (Thread 0x7f75a98b9700 (LWP 5860))]
(gdb) bt full
#0  0x55f4cda7b5f9 in testcorrupt (ptr=0x7f75ac1ed990) at
src/mux_h2.c:6238
cs = 0x7f75ac1ed990
h2s = 0x7f7584244510
#1  0x55f4cdad8993 in process_stream (t=0x7f75ac139d70,
context=0x7f7588066540, state=260) at src/stream.c:1499
srv = 0x7f75a9896390
s = 0x7f7588066540
sess = 0x7f759c071b80
rqf_last = 4294967294
rpf_last = 2217468112
rq_prod_last = 32629
rq_cons_last = 2217603024
rp_cons_last = 32629
rp_prod_last = 2217182865
req_ana_back = 2217603025
req = 0x7f75a9896350
res = 0x55f4cdbed618 <__task_queue+92>
si_f = 0x55f4ce03c680 
si_b = 0x7f75842def80
rate = 2217603024
#2  0x55f4cdbeddb2 in run_tasks_from_list (list=0x55f4ce03c6c0
, max=150) at src/task.c:371
process = 0x55f4cdad892d 
t = 0x7f75ac139d70
state = 260
ctx = 0x7f7588066540
done = 3
[...]

subs is 0x like before BUT dummy1 is also changed to 0x

(gdb) p *(struct h2s*)(0x7f7584244510)
$1 = {cs = 0x7f75ac1ed990, sess = 0x55f4ce02be40 , h2c =
0x7f758417abd0, h1m = {state = H1_MSG_RPBEFORE, flags = 12, curr_len = 0,
body_len = 0, next = 0, err_pos = -1, err_state = 0}, by_id = {node = {
  branches = {b = {0x7f758428e430, 0x7f7584244550}}, node_p =
0x7f758428e431, leaf_p = 0x7f7584244551, bit = 1, pfx = 33828}, key = 23},
id = 23, flags = 16385, sws = 0, errcode = H2_ERR_NO_ERROR, st = H2_SS_HREM,
  status = 0, body_len = 0, rxbuf = {size = 16384, area = 0x7f75780a2210
"Ð?", data = 16384, head = 0}, dummy0 = 0x0, dummy1 = 0x, subs =
0x, list = {n = 0x7f75842445c8, p = 0x7f75842445c8},
  shut_tl = 0x7f75842df0d0}

pon., 9 lis 2020 o 15:07 Christopher Faulet 
napisał(a):

> Le 09/11/2020 à 13:10, Maciej Zdeb a écrit :
> > I've played little bit with the patch and it led me to backend.c file
> and
> > connect_server() function
> >
> > int connect_server(struct stream *s)
> > {
> > [...]
> > if (!conn_xprt_ready(srv_conn) && !srv_conn->mux) {
> >  /* set the correct protocol on the output stream
> interface */
> >  if (srv)
> >  conn_prepare(srv_conn,
> > protocol_by_family(srv_conn->dst->ss_family), srv->xprt);
> >  else if (obj_type(s->target) == OBJ_TYPE_PROXY) {
> >  /* proxies exclusively run on raw_sock right
> now */
> >  conn_prepare(srv_conn,
> > protocol_by_family(srv_conn->dst->ss_family), xprt_get(XPRT_RAW));
> >  if (!(srv_conn->ctrl)) {
> >  conn_free(srv_conn);
> >  return SF_ERR_INTERNAL;
> >  }
> >  }
> >  else {
> >  conn_free(srv_conn);
> >  return SF_ERR_INTERNAL;  /* how did we get
> there ? */
> >  }
> > // THIS ONE IS OK
> > TEST_STRM(s);
> > //
> >  srv_cs = si_alloc_cs(>si[1], srv_conn);
> > // FAIL
> > TEST_STRM(s);
> > //
> >  if (!srv_cs) {
> >  conn_free(srv_conn);
> >  return SF_ERR_RESOURCE;
> >  }
>
> Hi,
>
> In fact, this crash occurs because of the Willy's patch. It was not design
> to
> handle non-h2 connections. Here the crash happens on a TCP connection,
> used by a
> SPOE applet for instance.
>
> I updated its patch. First, I added some calls to TEST_STRM() in the SPOE
> code,
> to be sure. I also explicitly set the stream task to NULL in stream_free()
> to
> catch late wakeups in the SPOE. Finally, I modified testcorrupt(). I hope
> this
> one is correct. But if I missed something, you may only keep the last
> ABORT_NOW() in testcorrupt() and replace others by a return statement,
> just like
> in the Willy's patch.
>
> --
> Christopher Faulet
>


Re: [2.0.17] crash with coredump

2020-11-09 Thread Maciej Zdeb
I've played little bit with the patch and it led me to backend.c file and
connect_server() function

int connect_server(struct stream *s)
{
[...]
if (!conn_xprt_ready(srv_conn) && !srv_conn->mux) {
/* set the correct protocol on the output stream interface
*/
if (srv)
conn_prepare(srv_conn,
protocol_by_family(srv_conn->dst->ss_family), srv->xprt);
else if (obj_type(s->target) == OBJ_TYPE_PROXY) {
/* proxies exclusively run on raw_sock right now */
conn_prepare(srv_conn,
protocol_by_family(srv_conn->dst->ss_family), xprt_get(XPRT_RAW));
if (!(srv_conn->ctrl)) {
conn_free(srv_conn);
return SF_ERR_INTERNAL;
}
}
else {
conn_free(srv_conn);
return SF_ERR_INTERNAL;  /* how did we get there ?
*/
}
// THIS ONE IS OK
TEST_STRM(s);
//
srv_cs = si_alloc_cs(>si[1], srv_conn);
// FAIL
TEST_STRM(s);
//
if (!srv_cs) {
conn_free(srv_conn);
return SF_ERR_RESOURCE;
}
[...]
}

pon., 9 lis 2020 o 11:51 Maciej Zdeb  napisał(a):

> Hi,
>
> This time h2s = 0x30 ;)
>
> it crashed here:
> void testcorrupt(void *ptr)
> {
> [...]
> if (h2s->cs != cs)
> return;
> [...]
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x556b617f0562 in testcorrupt (ptr=0x7f99741d85a0) at
> src/mux_h2.c:6228
> 6228 src/mux_h2.c: No such file or directory.
> [Current thread is 1 (Thread 0x7f99a484d700 (LWP 28658))]
> (gdb) bt full
> #0  0x556b617f0562 in testcorrupt (ptr=0x7f99741d85a0) at
> src/mux_h2.c:6228
> cs = 0x7f99741d85a0
> h2s = 0x30
> #1  0x556b61850b1a in process_stream (t=0x7f99741d8c60,
> context=0x7f99682cd7b0, state=1284) at src/stream.c:2147
> srv = 0x556b622770e0
> s = 0x7f99682cd7b0
> sess = 0x7f9998057170
> rqf_last = 9469954
> rpf_last = 2151677952
> rq_prod_last = 8
> rq_cons_last = 0
> rp_cons_last = 8
> rp_prod_last = 0
> req_ana_back = 0
> req = 0x7f99682cd7c0
> res = 0x7f99682cd820
> si_f = 0x7f99682cdae8
> si_b = 0x7f99682cdb40
> rate = 1
> #2  0x556b61962a5f in run_tasks_from_list (list=0x556b61db1600
> , max=150) at src/task.c:371
> process = 0x556b6184d8e6 
> t = 0x7f99741d8c60
> state = 1284
> ctx = 0x7f99682cd7b0
> done = 2
> [...]
>
>
> pt., 6 lis 2020 o 20:00 Willy Tarreau  napisał(a):
>
>> Maciej,
>>
>> I wrote this ugly patch to try to crash as soon as possible when a corrupt
>> h2s->subs is detected. The patch was written for 2.2. I only instrumented
>> roughly 30 places in process_stream() which is a fairly likely candidate.
>> I just hope it happens within the context of the stream itself otherwise
>> it will become really painful.
>>
>> You can apply this patch on top of your existing changes. It will try to
>> detect the presence of a non-zero lowest bit in the subs pointer (which
>> should never happen). If we're lucky it will crash inside process_stream()
>> between two points and we'll be able to narrow it down. If we're unlucky
>> it will crash when entering it and that will not be fun.
>>
>> If you want to play with it, you can apply TEST_SI() on stream_interface
>> pointers (often called "si"), TEST_STRM() on stream pointers, and
>> TEST_CS()
>> on conn_stream pointers (often called "cs").
>>
>> Please just let me know how it goes. Note, I tested it, it passes all
>> regtests for me so I'm reasonably confident it should not crash by
>> accident. But I can't be sure, I'm just using heuristics, so please do
>> not put it in sensitive production!
>>
>> Thanks,
>> Willy
>>
>


Re: [2.0.17] crash with coredump

2020-11-09 Thread Maciej Zdeb
Hi,

This time h2s = 0x30 ;)

it crashed here:
void testcorrupt(void *ptr)
{
[...]
if (h2s->cs != cs)
return;
[...]

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x556b617f0562 in testcorrupt (ptr=0x7f99741d85a0) at
src/mux_h2.c:6228
6228 src/mux_h2.c: No such file or directory.
[Current thread is 1 (Thread 0x7f99a484d700 (LWP 28658))]
(gdb) bt full
#0  0x556b617f0562 in testcorrupt (ptr=0x7f99741d85a0) at
src/mux_h2.c:6228
cs = 0x7f99741d85a0
h2s = 0x30
#1  0x556b61850b1a in process_stream (t=0x7f99741d8c60,
context=0x7f99682cd7b0, state=1284) at src/stream.c:2147
srv = 0x556b622770e0
s = 0x7f99682cd7b0
sess = 0x7f9998057170
rqf_last = 9469954
rpf_last = 2151677952
rq_prod_last = 8
rq_cons_last = 0
rp_cons_last = 8
rp_prod_last = 0
req_ana_back = 0
req = 0x7f99682cd7c0
res = 0x7f99682cd820
si_f = 0x7f99682cdae8
si_b = 0x7f99682cdb40
rate = 1
#2  0x556b61962a5f in run_tasks_from_list (list=0x556b61db1600
, max=150) at src/task.c:371
process = 0x556b6184d8e6 
t = 0x7f99741d8c60
state = 1284
ctx = 0x7f99682cd7b0
done = 2
[...]


pt., 6 lis 2020 o 20:00 Willy Tarreau  napisał(a):

> Maciej,
>
> I wrote this ugly patch to try to crash as soon as possible when a corrupt
> h2s->subs is detected. The patch was written for 2.2. I only instrumented
> roughly 30 places in process_stream() which is a fairly likely candidate.
> I just hope it happens within the context of the stream itself otherwise
> it will become really painful.
>
> You can apply this patch on top of your existing changes. It will try to
> detect the presence of a non-zero lowest bit in the subs pointer (which
> should never happen). If we're lucky it will crash inside process_stream()
> between two points and we'll be able to narrow it down. If we're unlucky
> it will crash when entering it and that will not be fun.
>
> If you want to play with it, you can apply TEST_SI() on stream_interface
> pointers (often called "si"), TEST_STRM() on stream pointers, and TEST_CS()
> on conn_stream pointers (often called "cs").
>
> Please just let me know how it goes. Note, I tested it, it passes all
> regtests for me so I'm reasonably confident it should not crash by
> accident. But I can't be sure, I'm just using heuristics, so please do
> not put it in sensitive production!
>
> Thanks,
> Willy
>


Re: [2.0.17] crash with coredump

2020-11-03 Thread Maciej Zdeb
I modified h2s struct in 2.2 branch with HEAD set to
f96508aae6b49277dcf142caa35042678cf8e2ca "MEDIUM: mux-h2: merge recv_wait
and send_wait event notifications" like below (subs is in exact place of
removed wait_event):

struct h2s {
[...]
struct tasklet *dummy0;
struct wait_event *dummy1;
struct wait_event *subs;  /* recv wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
struct list list; /* To be used when adding in h2c->send_list or
h2c->fctl_lsit */
struct tasklet *shut_tl;  /* deferred shutdown tasklet, to retry to
send an RST after we failed to,
   * in case there's no other subscription
to do it */
}

it crashed like before with subs = 0x:

(gdb) p *(struct h2s*)(0x7fde7459e9b0)
$1 = {cs = 0x7fde5c02d260, sess = 0x5628283bc740 , h2c =
0x5628295cbb80, h1m = {state = H1_MSG_RPBEFORE, flags = 12, curr_len = 0,
body_len = 0, next = 0, err_pos = -1, err_state = 0}, by_id = {node =
{branches = {b = {0x0, 0x7fde3c2c6c60}}, node_p = 0x0,
  leaf_p = 0x5628295cc018, bit = -5624, pfx = 29785}, key = 11}, id =
11, flags = 28673, sws = -4060, errcode = H2_ERR_NO_ERROR, st = H2_SS_HREM,
  status = 200, body_len = 0, rxbuf = {size = 0, area = 0x0, data = 0, head
= 0}, dummy0 = 0x0, dummy1 = 0x0, subs = 0x, list = {
n = 0x7fde7459ea68, p = 0x7fde7459ea68}, shut_tl = 0x5628297eeaf0}

it crashes like above until commit:
http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=5c8be272c732e4f42ccd6b3d65f25aa7425a2aba
which alters tasks processing.


pon., 2 lis 2020 o 15:46 Maciej Zdeb  napisał(a):

> I'm wondering, the corrupted address was always at "wait_event" in h2s
> struct, after its removal in:
> http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
> crashes went away.
>
> But with the above patch and after altering h2s struct into:
> struct h2s {
> [...]
> struct tasklet *shut_tl;
> struct wait_event *recv_wait; /* recv wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> struct wait_event *send_wait; /* send wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> struct list list; /* To be used when adding in h2c->send_list or
> h2c->fctl_lsit */
> };
>
> the crash returned.
>
> However after recv_wait and send_wait were merged in:
> http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=f96508aae6b49277dcf142caa35042678cf8e2ca
> crashes went away again.
>
> In my opinion shut_tl should be corrupted again, but it is not. Maybe the
> last patch fixed it?
>
> pon., 2 lis 2020 o 15:37 Kirill A. Korinsky  napisał(a):
>
>> Maciej,
>>
>> Looks like memory corruption is still here but it corrupt just some
>> another place.
>>
>> Willy do you agree?
>>
>> --
>> wbr, Kirill
>>
>> On 2. Nov 2020, at 15:34, Maciej Zdeb  wrote:
>>
>> So after Kirill suggestion to modify h2s struct in a way that tasklet
>> "shut_tl" is before recv_wait I verified if in 2.2.4 the same crash will
>> occur nd it did not!
>>
>> After the patch that merges recv_wait and send_wait:
>> http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=f96508aae6b49277dcf142caa35042678cf8e2ca
>> and witch such h2s (tasklet shut_tl before wait_event subs) the crashes
>> are gone:
>>
>> struct h2s {
>> [...]
>> struct buffer rxbuf; /* receive buffer, always valid (buf_empty
>> or real buffer) */
>> struct tasklet *shut_tl;  /* deferred shutdown tasklet, to retry
>> to send an RST after we failed to,
>>* in case there's no other
>> subscription to do it */
>> struct wait_event *subs;  /* recv wait_event the conn_stream
>> associated is waiting on (via h2_subscribe) */
>> struct list list; /* To be used when adding in h2c->send_list or
>> h2c->fctl_lsit */
>> };
>>
>>
>>
>> pon., 2 lis 2020 o 12:42 Maciej Zdeb  napisał(a):
>>
>>> Great idea Kirill,
>>>
>>> With such modification:
>>>
>>> struct h2s {
>>> [...]
>>> struct tasklet *shut_tl;
>>> struct wait_event *recv_wait; /* recv wait_event the conn_stream
>>> associated is waiting on (via h2_subscribe) */
>>> struct wait_event *send_wait; /* send wait_event the conn_stream
>>> associated is waiting on (via h2_subscribe) */
>>> struct list list; /* To be used when adding in h2c->send_list or
>>> h2c->fctl_lsi

Re: [2.0.17] crash with coredump

2020-11-02 Thread Maciej Zdeb
I'm wondering, the corrupted address was always at "wait_event" in h2s
struct, after its removal in:
http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
crashes went away.

But with the above patch and after altering h2s struct into:
struct h2s {
[...]
struct tasklet *shut_tl;
struct wait_event *recv_wait; /* recv wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
struct wait_event *send_wait; /* send wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
struct list list; /* To be used when adding in h2c->send_list or
h2c->fctl_lsit */
};

the crash returned.

However after recv_wait and send_wait were merged in:
http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=f96508aae6b49277dcf142caa35042678cf8e2ca
crashes went away again.

In my opinion shut_tl should be corrupted again, but it is not. Maybe the
last patch fixed it?

pon., 2 lis 2020 o 15:37 Kirill A. Korinsky  napisał(a):

> Maciej,
>
> Looks like memory corruption is still here but it corrupt just some
> another place.
>
> Willy do you agree?
>
> --
> wbr, Kirill
>
> On 2. Nov 2020, at 15:34, Maciej Zdeb  wrote:
>
> So after Kirill suggestion to modify h2s struct in a way that tasklet
> "shut_tl" is before recv_wait I verified if in 2.2.4 the same crash will
> occur nd it did not!
>
> After the patch that merges recv_wait and send_wait:
> http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=f96508aae6b49277dcf142caa35042678cf8e2ca
> and witch such h2s (tasklet shut_tl before wait_event subs) the crashes
> are gone:
>
> struct h2s {
> [...]
> struct buffer rxbuf; /* receive buffer, always valid (buf_empty or
> real buffer) */
> struct tasklet *shut_tl;  /* deferred shutdown tasklet, to retry
> to send an RST after we failed to,
>* in case there's no other subscription
> to do it */
> struct wait_event *subs;  /* recv wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> struct list list; /* To be used when adding in h2c->send_list or
> h2c->fctl_lsit */
> };
>
>
>
> pon., 2 lis 2020 o 12:42 Maciej Zdeb  napisał(a):
>
>> Great idea Kirill,
>>
>> With such modification:
>>
>> struct h2s {
>> [...]
>> struct tasklet *shut_tl;
>> struct wait_event *recv_wait; /* recv wait_event the conn_stream
>> associated is waiting on (via h2_subscribe) */
>> struct wait_event *send_wait; /* send wait_event the conn_stream
>> associated is waiting on (via h2_subscribe) */
>> struct list list; /* To be used when adding in h2c->send_list or
>> h2c->fctl_lsit */
>> };
>>
>> it crashed just like before.
>>
>> pon., 2 lis 2020 o 11:12 Kirill A. Korinsky 
>> napisał(a):
>>
>>> Hi,
>>>
>>> Thanks for update.
>>>
>>> After read Wully's recommendation and provided commit that fixed an
>>> issue I'm curious can you "edit" a bit this commit and move `shut_tl`
>>> before `recv_wait` instead of removed `wait_event`?
>>>
>>> It is a quiet dummy way to confirm that memory corruption had gone, and
>>> not just moved to somewhere else.
>>>
>>> --
>>> wbr, Kirill
>>>
>>> On 2. Nov 2020, at 10:58, Maciej Zdeb  wrote:
>>>
>>> Hi,
>>>
>>> Update for people on the list that might be interested in the issue,
>>> because part of discussion was private.
>>>
>>> I wanted to check Willy suggestion and modified h2s struct (added dummy
>>> fields):
>>>
>>> struct h2s {
>>> [...]
>>> uint16_t status; /* HTTP response status */
>>> unsigned long long body_len; /* remaining body length according
>>> to content-length if H2_SF_DATA_CLEN */
>>> struct buffer rxbuf; /* receive buffer, always valid (buf_empty
>>> or real buffer) */
>>> int dummy0;
>>> struct wait_event wait_event; /* Wait list, when we're
>>> attempting to send a RST but we can't send */
>>> int dummy1;
>>> struct wait_event *recv_wait; /* recv wait_event the conn_stream
>>> associated is waiting on (via h2_subscribe) */
>>> int dummy2;
>>> struct wait_event *send_wait; /* send wait_event the conn_stream
>>> associated is waiting on (via h2_subscribe) */
>>> int dummy3;
>>>  

Re: [2.0.17] crash with coredump

2020-11-02 Thread Maciej Zdeb
So after Kirill suggestion to modify h2s struct in a way that tasklet
"shut_tl" is before recv_wait I verified if in 2.2.4 the same crash will
occur nd it did not!

After the patch that merges recv_wait and send_wait:
http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=f96508aae6b49277dcf142caa35042678cf8e2ca
and witch such h2s (tasklet shut_tl before wait_event subs) the crashes are
gone:

struct h2s {
[...]
struct buffer rxbuf; /* receive buffer, always valid (buf_empty or
real buffer) */
struct tasklet *shut_tl;  /* deferred shutdown tasklet, to retry to
send an RST after we failed to,
   * in case there's no other subscription
to do it */
struct wait_event *subs;  /* recv wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
struct list list; /* To be used when adding in h2c->send_list or
h2c->fctl_lsit */
};



pon., 2 lis 2020 o 12:42 Maciej Zdeb  napisał(a):

> Great idea Kirill,
>
> With such modification:
>
> struct h2s {
> [...]
> struct tasklet *shut_tl;
> struct wait_event *recv_wait; /* recv wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> struct wait_event *send_wait; /* send wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> struct list list; /* To be used when adding in h2c->send_list or
> h2c->fctl_lsit */
> };
>
> it crashed just like before.
>
> pon., 2 lis 2020 o 11:12 Kirill A. Korinsky  napisał(a):
>
>> Hi,
>>
>> Thanks for update.
>>
>> After read Wully's recommendation and provided commit that fixed an issue
>> I'm curious can you "edit" a bit this commit and move `shut_tl` before
>> `recv_wait` instead of removed `wait_event`?
>>
>> It is a quiet dummy way to confirm that memory corruption had gone, and
>> not just moved to somewhere else.
>>
>> --
>> wbr, Kirill
>>
>> On 2. Nov 2020, at 10:58, Maciej Zdeb  wrote:
>>
>> Hi,
>>
>> Update for people on the list that might be interested in the issue,
>> because part of discussion was private.
>>
>> I wanted to check Willy suggestion and modified h2s struct (added dummy
>> fields):
>>
>> struct h2s {
>> [...]
>> uint16_t status; /* HTTP response status */
>> unsigned long long body_len; /* remaining body length according
>> to content-length if H2_SF_DATA_CLEN */
>> struct buffer rxbuf; /* receive buffer, always valid (buf_empty
>> or real buffer) */
>> int dummy0;
>> struct wait_event wait_event; /* Wait list, when we're attempting
>> to send a RST but we can't send */
>> int dummy1;
>> struct wait_event *recv_wait; /* recv wait_event the conn_stream
>> associated is waiting on (via h2_subscribe) */
>> int dummy2;
>> struct wait_event *send_wait; /* send wait_event the conn_stream
>> associated is waiting on (via h2_subscribe) */
>> int dummy3;
>> struct list list; /* To be used when adding in h2c->send_list or
>> h2c->fctl_lsit */
>> struct list sending_list; /* To be used when adding in
>> h2c->sending_list */
>> };
>>
>> With such modified h2s struct, the crash did not occur.
>>
>> I've checked HAProxy 2.1, it crashes like 2.0.
>>
>> I've also checked 2.2, bisection showed that this commit:
>> http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
>> fixed the crashes we experienced. I'm not sure how/if it fixed the memory
>> corruption, it is possible that memory is still corrupted but not causing
>> the crash.
>>
>>
>>
>> pt., 25 wrz 2020 o 16:25 Kirill A. Korinsky 
>> napisał(a):
>>
>>> Very interesting.
>>>
>>> Anyway, I can see that this pice of code was refactored some time ago:
>>> https://github.com/haproxy/haproxy/commit/f96508aae6b49277dcf142caa35042678cf8e2ca
>>>
>>> Maybe it is worth to try 2.2 or 2.3 branch?
>>>
>>> Yes, it is a blind shot and just a guess.
>>>
>>> --
>>> wbr, Kirill
>>>
>>> On 25. Sep 2020, at 16:01, Maciej Zdeb  wrote:
>>>
>>> Yes at the same place with same value:
>>>
>>> (gdb) bt full
>>> #0  0x559ce98b964b in h2s_notify_recv (h2s=0x559cef94e7e0) at
>>> src/mux_h2.c:783
>>> sw = 0x
>>>
>>>
>>>
>>> pt., 25 wrz 2020 o 15:42 Kirill A. Korinsky 
>>> napisał(a):
>>>
>>>> > On 25. Sep 2020, at 15:26, Maciej Zdeb  wrote:
>>>> >
>>>> > I was mailing outside the list with Willy and Christopher but it's
>>>> worth sharing that the problem occurs even with nbthread = 1. I've managed
>>>> to confirm it today.
>>>>
>>>>
>>>> I'm curious is it crashed at the same place with the same value?
>>>>
>>>> --
>>>> wbr, Kirill
>>>>
>>>>
>>>>
>>>
>>


Re: [2.0.17] crash with coredump

2020-11-02 Thread Maciej Zdeb
Great idea Kirill,

With such modification:

struct h2s {
[...]
struct tasklet *shut_tl;
struct wait_event *recv_wait; /* recv wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
struct wait_event *send_wait; /* send wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
struct list list; /* To be used when adding in h2c->send_list or
h2c->fctl_lsit */
};

it crashed just like before.

pon., 2 lis 2020 o 11:12 Kirill A. Korinsky  napisał(a):

> Hi,
>
> Thanks for update.
>
> After read Wully's recommendation and provided commit that fixed an issue
> I'm curious can you "edit" a bit this commit and move `shut_tl` before
> `recv_wait` instead of removed `wait_event`?
>
> It is a quiet dummy way to confirm that memory corruption had gone, and
> not just moved to somewhere else.
>
> --
> wbr, Kirill
>
> On 2. Nov 2020, at 10:58, Maciej Zdeb  wrote:
>
> Hi,
>
> Update for people on the list that might be interested in the issue,
> because part of discussion was private.
>
> I wanted to check Willy suggestion and modified h2s struct (added dummy
> fields):
>
> struct h2s {
> [...]
> uint16_t status; /* HTTP response status */
> unsigned long long body_len; /* remaining body length according to
> content-length if H2_SF_DATA_CLEN */
> struct buffer rxbuf; /* receive buffer, always valid (buf_empty or
> real buffer) */
> int dummy0;
> struct wait_event wait_event; /* Wait list, when we're attempting
> to send a RST but we can't send */
> int dummy1;
> struct wait_event *recv_wait; /* recv wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> int dummy2;
> struct wait_event *send_wait; /* send wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> int dummy3;
> struct list list; /* To be used when adding in h2c->send_list or
> h2c->fctl_lsit */
> struct list sending_list; /* To be used when adding in
> h2c->sending_list */
> };
>
> With such modified h2s struct, the crash did not occur.
>
> I've checked HAProxy 2.1, it crashes like 2.0.
>
> I've also checked 2.2, bisection showed that this commit:
> http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
> fixed the crashes we experienced. I'm not sure how/if it fixed the memory
> corruption, it is possible that memory is still corrupted but not causing
> the crash.
>
>
>
> pt., 25 wrz 2020 o 16:25 Kirill A. Korinsky  napisał(a):
>
>> Very interesting.
>>
>> Anyway, I can see that this pice of code was refactored some time ago:
>> https://github.com/haproxy/haproxy/commit/f96508aae6b49277dcf142caa35042678cf8e2ca
>>
>> Maybe it is worth to try 2.2 or 2.3 branch?
>>
>> Yes, it is a blind shot and just a guess.
>>
>> --
>> wbr, Kirill
>>
>> On 25. Sep 2020, at 16:01, Maciej Zdeb  wrote:
>>
>> Yes at the same place with same value:
>>
>> (gdb) bt full
>> #0  0x559ce98b964b in h2s_notify_recv (h2s=0x559cef94e7e0) at
>> src/mux_h2.c:783
>> sw = 0x
>>
>>
>>
>> pt., 25 wrz 2020 o 15:42 Kirill A. Korinsky 
>> napisał(a):
>>
>>> > On 25. Sep 2020, at 15:26, Maciej Zdeb  wrote:
>>> >
>>> > I was mailing outside the list with Willy and Christopher but it's
>>> worth sharing that the problem occurs even with nbthread = 1. I've managed
>>> to confirm it today.
>>>
>>>
>>> I'm curious is it crashed at the same place with the same value?
>>>
>>> --
>>> wbr, Kirill
>>>
>>>
>>>
>>
>


Re: [2.0.17] crash with coredump

2020-11-02 Thread Maciej Zdeb
Hi,

Update for people on the list that might be interested in the issue,
because part of discussion was private.

I wanted to check Willy suggestion and modified h2s struct (added dummy
fields):

struct h2s {
[...]
uint16_t status; /* HTTP response status */
unsigned long long body_len; /* remaining body length according to
content-length if H2_SF_DATA_CLEN */
struct buffer rxbuf; /* receive buffer, always valid (buf_empty or
real buffer) */
int dummy0;
struct wait_event wait_event; /* Wait list, when we're attempting
to send a RST but we can't send */
int dummy1;
struct wait_event *recv_wait; /* recv wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
int dummy2;
struct wait_event *send_wait; /* send wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
int dummy3;
struct list list; /* To be used when adding in h2c->send_list or
h2c->fctl_lsit */
struct list sending_list; /* To be used when adding in
h2c->sending_list */
};

With such modified h2s struct, the crash did not occur.

I've checked HAProxy 2.1, it crashes like 2.0.

I've also checked 2.2, bisection showed that this commit:
http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
fixed the crashes we experienced. I'm not sure how/if it fixed the memory
corruption, it is possible that memory is still corrupted but not causing
the crash.



pt., 25 wrz 2020 o 16:25 Kirill A. Korinsky  napisał(a):

> Very interesting.
>
> Anyway, I can see that this pice of code was refactored some time ago:
> https://github.com/haproxy/haproxy/commit/f96508aae6b49277dcf142caa35042678cf8e2ca
>
> Maybe it is worth to try 2.2 or 2.3 branch?
>
> Yes, it is a blind shot and just a guess.
>
> --
> wbr, Kirill
>
> On 25. Sep 2020, at 16:01, Maciej Zdeb  wrote:
>
> Yes at the same place with same value:
>
> (gdb) bt full
> #0  0x559ce98b964b in h2s_notify_recv (h2s=0x559cef94e7e0) at
> src/mux_h2.c:783
> sw = 0x
>
>
>
> pt., 25 wrz 2020 o 15:42 Kirill A. Korinsky  napisał(a):
>
>> > On 25. Sep 2020, at 15:26, Maciej Zdeb  wrote:
>> >
>> > I was mailing outside the list with Willy and Christopher but it's
>> worth sharing that the problem occurs even with nbthread = 1. I've managed
>> to confirm it today.
>>
>>
>> I'm curious is it crashed at the same place with the same value?
>>
>> --
>> wbr, Kirill
>>
>>
>>
>


Re: [2.0.17] crash with coredump

2020-09-25 Thread Maciej Zdeb
Yes at the same place with same value:

(gdb) bt full
#0  0x559ce98b964b in h2s_notify_recv (h2s=0x559cef94e7e0) at
src/mux_h2.c:783
sw = 0x



pt., 25 wrz 2020 o 15:42 Kirill A. Korinsky  napisał(a):

> > On 25. Sep 2020, at 15:26, Maciej Zdeb  wrote:
> >
> > I was mailing outside the list with Willy and Christopher but it's worth
> sharing that the problem occurs even with nbthread = 1. I've managed to
> confirm it today.
>
>
> I'm curious is it crashed at the same place with the same value?
>
> --
> wbr, Kirill
>
>
>


Re: [2.0.17] crash with coredump

2020-09-25 Thread Maciej Zdeb
> Here I can suggest to implement Yarrow PRGN (that is very simple to
> implement) with some lua-pure cryptographic hash function.

We're using lrandom because of the algorithm Mersenne Twister and its well
known weaknesses and strengths.

> In fact I know it's possible to call haproxy's internal sample fetch
> functions from Lua (I never can figure how to do that, I always need
> to lookup an example for this unfortunately). But once you figure out how
> to do it, you can call the "rand()" sample fetch that will call the
> internal thread-safe random number generator.

Rand() sample fetch cannot be seeded (at the time we checked) so on HAProxy
servers with nbproc > 1 we got multiple sequences of the same random
numbers - it was one of the reasons we couldn't use it.

I was mailing outside the list with Willy and Christopher but it's worth
sharing that the problem occurs even with nbthread = 1. I've managed to
confirm it today.

pt., 25 wrz 2020 o 15:06 Willy Tarreau  napisał(a):

> Hi Kirill,
>
> On Fri, Sep 25, 2020 at 12:34:16PM +0200, Kirill A. Korinsky wrote:
> > I've extracted a pice of code from lrandom and put it here:
> > https://gist.github.com/catap/bf862cc0d289083fc1ccd38c905e2416
> > 
> >
> > You can see that object generator contains N words (and here it is 624),
> and
> > I use an assumption that Maciej's code doesn't create a new generator for
> > each request and share lrandom.
> >
> > Idea of this RNG is initialize each N words via init_genrand and it
> checking
> > that all of them are used, and after one generated a new ones.
> >
> > Let assume that we called genrand_int32 at the same moment from two
> threads.
> > If condition at lines 39 and 43 are true we start to initialize the next
> > words at both threads.
> >
> > You can see that we can easy move outside of v array at line 21 because
> two
> > threads are increasing i field, and put some random number to i field.
>
> Till here your analysis is right but:
>   - the overflow would only be at most the number of extra threads running
> init_genrand() concurrently, or more precisely the distance between
> the most upfront to the latest thread, so in the worst case nbthread-1
> hence there's no way to write a single location into a totally
> unrelated
> structure without writing the nbthread-2 words that are between the end
> of the MT array and the overwritten location ;
>
>   - the lua code is not threaded (there's a "big lock" around lua since
> stacks cannot easily be protected, though Thierry has some ideas for
> this).
>
> > Ans when the second thread is going to line 27 and nobody knows where it
> put 0x
>
> Actually init_genrand() doesn't actually *write* 0x but only writes
> *something* of 64 bits size and applies a 32-bit mask over it, so it would
> have written 32 bits pseudo-random bits followed by 4 zero bytes.
>
> > How can it be proved / solved?
> >
> > I see a few possible options:
> > 1. Switch off threads inside haproxy
> > 2. Use dedicated lrandom per thread
> > 3. Move away from lrandom
> >
> > As I understand lrandom is using here because it is very fast and
> secure, and
> > reading from /dev/urandom isn't an option.
> >
> > Here I can suggest to implement Yarrow PRGN (that is very simple to
> > implement) with some lua-pure cryptographic hash function.
>
> In fact I know it's possible to call haproxy's internal sample fetch
> functions from Lua (I never can figure how to do that, I always need
> to lookup an example for this unfortunately). But once you figure how
> to do it, you can call the "rand()" sample fetch that will call the
> internal thread-safe random number generator.
>
> Regards,
> Willy
>


Re: [2.0.17] crash with coredump

2020-09-25 Thread Maciej Zdeb
Hi Kirill,

Thanks for your hints and time! Unfortunately, I think lrandom is not the
cause of crash. We're using lrandom with threads for couple of months on
our other servers without any crash. I think lua in HAproxy is executed in
a single thread so your analysis is correct but this assumption is never
true: "Let assume that we called genrand_int32 at the same moment from two
threads." in HAProxy environment.

I suspect something is going on in SPOE or LUA scripts from external
vendor. I'll share more details as soon as I confirm it is in SPOE or LUA.


pt., 25 wrz 2020 o 12:34 Kirill A. Korinsky  napisał(a):

> Good day,
>
> I'd like to share with your my two cents regarding this topic:
>
> lrandom (PRNG for lua, we're using it for 2 or 3 years without any
> problems, and soon we will drop it from our build)
>
>
> Never heard of this last one, not that it would make it suspicious at
> all, just that it might indicate you're having a slightly different
> workload than most common ones and can help spotting directions where
> to look for the problem.
>
>
>
> As far as I know Haproxy is using threads by default for some time and I
> assume that Maciej's setup doesn't change anything and it had enabled
> threads.
>
> If so I believe that lrandom is the root cause of this issue.
>
> I've extracted a pice of code from lrandom and put it here:
> https://gist.github.com/catap/bf862cc0d289083fc1ccd38c905e2416
>
> You can see that object generator contains N words (and here it is 624),
> and I use an assumption that Maciej's code doesn't create a new generator
> for each request and share lrandom.
>
> Idea of this RNG is initialize each N words via init_genrand and it
> checking that all of them are used, and after one generated a new ones.
>
> Let assume that we called genrand_int32 at the same moment from two
> threads. If condition at lines 39 and 43 are true we start to initialize
> the next words at both threads.
>
> You can see that we can easy move outside of v array at line 21 because
> two threads are increasing i field, and put some random number to i field.
>
> Ans when the second thread is going to line 27 and nobody knows where it
> put 0x
>
> Let me quote Willy Tarreau:
>
> In the trace it's said that sw = 0x. Looking at all places where
> h2s->recv_wait() is modified, it's either NULL or a valid pointer to some
> structure. We could have imagined that for whatever reason h2s is wrong
> here, but this call only happens when its state is still valid, and it
> experiences double dereferences before landing here, which tends to
> indicate that the h2s pointer is OK. Thus the only hypothesis I can have
> for now is memory corruption :-/ That field would get overwritten with
> (int)-1 for whatever reason, maybe a wrong cast somewhere, but it's not
> as if we had many of these.
>
>
> and base on this I believe that it is the case.
>
> How can it be proved / solved?
>
> I see a few possible options:
> 1. Switch off threads inside haproxy
> 2. Use dedicated lrandom per thread
> 3. Move away from lrandom
>
> As I understand lrandom is using here because it is very fast and secure,
> and reading from /dev/urandom isn't an option.
>
> Here I can suggest to implement Yarrow PRGN (that is very simple to
> implement) with some lua-pure cryptographic hash function.
>
> --
> wbr, Kirill
>
>


Re: [2.0.17] crash with coredump

2020-09-17 Thread Maciej Zdeb
Hi,

Our config is quite complex and I'm trying to narrow it down. It is
occurring only on one production haproxy cluster (which consists of 6
servers in each of two data centers) with significant load - crashes occurs
on random servers so I would exclude memory corruption.

I'm suspecting SPOE or/and LUA script both are used to send metadata about
each request to an external endpoint. Yesterday I disabled this feature in
one datacenter to verify.

Our build is done in docker (Ubuntu bionic) with kernel 4.9.184-linuxkit,
crash is on Ubuntu bionic 4.15.0-55-generic, using:
haproxy 2.0.17
openssl 1.1.1f
pcre 8.44
lua 5.3.5
lrandom (PRNG for lua, we're using it for 2 or 3 years without any
problems, and soon we will drop it from our build)

compiled in following way:

# LUA
wget http://www.lua.org/ftp/lua-$LUA_VERSION.tar.gz \
&& tar -zxf lua-$LUA_VERSION.tar.gz \
&& cd lua-$LUA_VERSION \
&& make linux test \
&& make install

# LUA LRANDOM
wget http://webserver2.tecgraf.puc-rio.br/~lhf/ftp/lua/ar/lrandom-100.tar.gz
&& tar -zxf lrandom-100.tar.gz \
&& make -C lrandom-100 \
&& make -C lrandom-100 install

# PCRE
wget https://ftp.pcre.org/pub/pcre/pcre-$PCRE_VERSION.tar.gz \
&& tar -zxf pcre-$PCRE_VERSION.tar.gz \
&& cd pcre-$PCRE_VERSION \
&& ./configure --prefix=/usr/lib/haproxy/pcre_$PCRE_VERSION
--enable-jit --enable-utf --enable-unicode-properties
--disable-silent-rules \
&& make \
&& make install

# OPENSSL
wget https://www.openssl.org/source/openssl-$SSL_VERSION.tar.gz \
&& tar -zxf openssl-$SSL_VERSION.tar.gz \
&& cd openssl-$SSL_VERSION \
&& ./Configure --openssldir=/usr/lib/haproxy/openssl_$SSL_VERSION
--prefix=/usr/lib/haproxy/openssl_$SSL_VERSION
-Wl,-rpath=/usr/lib/haproxy/openssl_$SSL_VERSION/lib shared no-idea
linux-x86_64 \
&& make depend \
&& make \
&& make install_sw

and finally haproxy is compiled using deb builder:

override_dh_auto_build:
make TARGET=$(HAP_TARGET) DEFINE="-DIP_BIND_ADDRESS_NO_PORT=24
-DMAX_SESS_STKCTR=12" USE_PCRE=1 USE_PCRE_JIT=1
PCRE_INC=/usr/lib/haproxy/pcre_$(PCRE_VERSION)/include
PCRE_LIB="/usr/lib/haproxy/pcre_$(PCRE_VERSION)/lib
-Wl,-rpath,/usr/lib/haproxy/pcre_$(PCRE_VERSION)/lib" USE_GETADDRINFO=1
USE_OPENSSL=1 SSL_INC=/usr/lib/haproxy/openssl_$(SSL_VERSION)/include
SSL_LIB="/usr/lib/haproxy/openssl_$(SSL_VERSION)/lib
-Wl,-rpath,/usr/lib/haproxy/openssl_$(SSL_VERSION)/lib" ADDLIB=-ldl
USE_ZLIB=1 USE_DL=1 USE_LUA=1 USE_REGPARM=1

DIP_BIND_ADDRESS_NO_PORT is now absolete and we'll drop it
MAX_SESS_STKCTR=12 we need more stick tables

Kind regards,


czw., 17 wrz 2020 o 08:18 Willy Tarreau  napisał(a):

> Hi guys,
>
> On Thu, Sep 17, 2020 at 11:05:31AM +1000, Igor Cicimov wrote:
> (...)
> > > Coredump fragment from thread1:
> > > (gdb) bt
> > > #0  0x55cbbf6ed64b in h2s_notify_recv (h2s=0x7f65b8b55130) at
> > > src/mux_h2.c:783
>
> So the code is this one:
>
>777  static void __maybe_unused h2s_notify_recv(struct h2s *h2s)
>778  {
>779  struct wait_event *sw;
>780
>781  if (h2s->recv_wait) {
>782  sw = h2s->recv_wait;
>783  sw->events &= ~SUB_RETRY_RECV;
>784  tasklet_wakeup(sw->tasklet);
>785  h2s->recv_wait = NULL;
>786  }
>787  }
>
> In the trace it's said that sw = 0x. Looking at all places where
> h2s->recv_wait() is modified, it's either NULL or a valid pointer to some
> structure. We could have imagined that for whatever reason h2s is wrong
> here, but this call only happens when its state is still valid, and it
> experiences double dereferences before landing here, which tends to
> indicate that the h2s pointer is OK. Thus the only hypothesis I can have
> for now is memory corruption :-/ That field would get overwritten with
> (int)-1 for whatever reason, maybe a wrong cast somewhere, but it's not
> as if we had many of these.
>
> > I'm not one of the devs but obviously many of us using v2.0 will be
> > interested in the answer. Assuming you do not install from packages can
> you
> > please provide some more background on how you produce the binary, like
> if
> > you compile then what OS and kernel is this compiled on and what OS and
> > kernel this crashes on? Again if compiled any other custom compiled
> > packages in use, like OpenSSL, lua etc, you might be using or have
> compiled
> > haproxy against etc.?
> >
> > Also if this is a bug and you have hit some corner case with your config
> > (many are using 2.0 but we have not seen crashes) you should provide a
> > stripped down version (not too stripped though just the sensitive data)
> of
> > your config too.
>
> I agree with Igor here, any info to try to narrow down a reproducer, both
> in terms of config and operations, would be tremendously helpful!
>
> Thanks,
> Willy
>


[2.0.17] crash with coredump

2020-09-16 Thread Maciej Zdeb
Hi,

Our HAProxy (2.0.14) started to crash, so first we upgraded to 2.0.17 but
it didn't help. Below you'll find traces from coredump

Version:
HA-Proxy version 2.0.17 2020/07/31 - https://haproxy.org/
Build options :
  TARGET  = linux-glibc
  CPU = generic
  CC  = gcc
  CFLAGS  = -O0 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter
-Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered
-Wno-missing-field-initializers -Wno-implicit-fallthrough
-Wno-stringop-overflow -Wtype-limits -Wshift-negative-value
-Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
-DIP_BIND_ADDRESS_NO_PORT=24 -DMAX_SESS_STKCTR=12
  OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_REGPARM=1 USE_GETADDRINFO=1
USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_DL=1

Feature list : +EPOLL -KQUEUE -MY_EPOLL -MY_SPLICE +NETFILTER +PCRE
+PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED
+REGPARM -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE
+LIBCRYPT +CRYPT_H -VSYSCALL +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4
-MY_ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS
-51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=4).
Built with OpenSSL version : OpenSSL 1.1.1f  31 Mar 2020
Running on OpenSSL version : OpenSSL 1.1.1f  31 Mar 2020
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.5
Built with network namespace support.
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with PCRE version : 8.44 2020-02-12
Running on PCRE version : 8.44 2020-02-12
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes

Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as  cannot be specified using 'proto' keyword)
  h2 : mode=HTTP   side=FEmux=H2
  h2 : mode=HTXside=FE|BE mux=H2
: mode=HTXside=FE|BE mux=H1
: mode=TCP|HTTP   side=FE|BE mux=PASS

Available services : none

Available filters :
[SPOE] spoe
[COMP] compression
[CACHE] cache
[TRACE] trace


Coredump fragment from thread1:
(gdb) bt
#0  0x55cbbf6ed64b in h2s_notify_recv (h2s=0x7f65b8b55130) at
src/mux_h2.c:783
#1  0x55cbbf6edbc7 in h2s_close (h2s=0x7f65b8b55130) at src/mux_h2.c:921
#2  0x55cbbf6f9745 in h2s_htx_make_trailers (h2s=0x7f65b8b55130,
htx=0x7f65a9c34f20) at src/mux_h2.c:5385
#3  0x55cbbf6fa48e in h2_snd_buf (cs=0x7f65b8c48a40,
buf=0x7f65d05291b8, count=2, flags=1) at src/mux_h2.c:5694
#4  0x55cbbf7cdde3 in si_cs_send (cs=0x7f65b8c48a40) at
src/stream_interface.c:762
#5  0x55cbbf7ce839 in stream_int_chk_snd_conn (si=0x7f65d0529478) at
src/stream_interface.c:1145
#6  0x55cbbf7cc9d6 in si_chk_snd (si=0x7f65d0529478) at
include/proto/stream_interface.h:496
#7  0x55cbbf7cd559 in stream_int_notify (si=0x7f65d05294d0) at
src/stream_interface.c:510
#8  0x55cbbf7cda33 in si_cs_process (cs=0x55cbca178f90) at
src/stream_interface.c:644
#9  0x55cbbf7cdfb1 in si_cs_io_cb (t=0x0, ctx=0x7f65d05294d0, state=1)
at src/stream_interface.c:817
#10 0x55cbbf81af32 in process_runnable_tasks () at src/task.c:415
#11 0x55cbbf740cc0 in run_poll_loop () at src/haproxy.c:2701
#12 0x55cbbf741188 in run_thread_poll_loop (data=0x1) at
src/haproxy.c:2840
#13 0x7f667fbdb6db in start_thread (arg=0x7f65d5556700) at
pthread_create.c:463
#14 0x7f667e764a3f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

(gdb) bt full
#0  0x55cbbf6ed64b in h2s_notify_recv (h2s=0x7f65b8b55130) at
src/mux_h2.c:783
sw = 0x
#1  0x55cbbf6edbc7 in h2s_close (h2s=0x7f65b8b55130) at src/mux_h2.c:921
No locals.
#2  0x55cbbf6f9745 in h2s_htx_make_trailers (h2s=0x7f65b8b55130,
htx=0x7f65a9c34f20) at src/mux_h2.c:5385
list = {{n = {ptr = 0x55cbbf88d18c "", len = 0}, v = {ptr =
0x7f65a9c34f20 "�?", len = 94333580136844}}, {n = {ptr = 0x7f65d5532410 "
Oée\177", len = 94333579742112}, v = {ptr = 0x7f65a9c38e78 "�\001", len =
140074616573728}}, {n = {
  ptr = 0x55cbbf6ea203 
"\211E�\213E�\203�\002t\005\203�\005u\035H\213E�\213@\004\017��H\213E�\213@\004��\b%��\017",
len = 81604378627}, v = {ptr = 0x1fd0001 , len = 140074616589944}}, {n = {ptr =
0x7f65a9c34f20 "�?", len = 140075347420216}, v = {
  ptr = 0x55cbbf82bfc0 

Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-13 Thread Maciej Zdeb
Ok, I'll wait for update from you! :)

pon., 13 maj 2019 o 08:03 Willy Tarreau  napisał(a):

> Hi Maciej,
>
> On Mon, May 13, 2019 at 07:21:59AM +0200, Maciej Zdeb wrote:
> > Hi,
> >
> > I'm not observing any issues, so I think it's fixed. :)
> >
> > Willy, Olivier thank you very much!
>
> Great, thanks. I'm going to issue 1.9.8 with the patch I sent you then.
> However after discussing about it with Olivier, we came to the conclusion
> that it is not the cleanest solution, as it addresses the consequence of
> a problem (being flow controlled and being in the send list) instead of
> the cause.  Olivier proposed me a different one that I'd very much
> appreciate you to test after the release; I don't want to take any new
> risk for 1.9.8 for now. We'll keep you updated.
>
> Cheers,
> Willy
>


Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-12 Thread Maciej Zdeb
Hi,

I'm not observing any issues, so I think it's fixed. :)

Willy, Olivier thank you very much!

Kind regards,

niedz., 12 maj 2019 o 12:11 Willy Tarreau  napisał(a):

> On Sun, May 12, 2019 at 12:00:44PM +0200, Maciej Zdeb wrote:
> > As for now it is stable, no looping. I'll leave it till monday and return
> > with feedback! :)
>
> Many thanks Maciej for this positive feedback. I'll merge it and issue
> 1.9.8
> then. There are definitely enough pending fixes for a release!
>
> Cheers,
> Willy
>


Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-12 Thread Maciej Zdeb
As for now it is stable, no looping. I'll leave it till monday and return
with feedback! :)

Thanks,

sob., 11.05.2019, 15:44 użytkownik Maciej Zdeb  napisał:

> Patch applied, finger crossed, testing! :-)
>
> Thanks!
>
> sob., 11 maj 2019 o 14:58 Willy Tarreau  napisał(a):
>
>> On Sat, May 11, 2019 at 11:01:42AM +0200, Willy Tarreau wrote:
>> > On Sat, May 11, 2019 at 10:52:35AM +0200, Willy Tarreau wrote:
>> > > I certainly made a few reasoning mistakes above but I don't see
>> anything
>> > > in the code preventing this case from happening.
>> > >
>> > > Thus I'd like you to try the attached patch which is supposed to
>> prevent
>> > > this scenario from happening. At least I've verified that it doesn't
>> > > break the h2spec test suite.
>> >
>> > While trying to check if it still applied to the latest 1.9 I figured
>> > that it corresponds to what Olivier had also found and fixed in his
>> > latest patch :-/  The positive point is that my analysis was correct.
>> >
>> > So I'm afraid that if it still fails with his fix, we'll need another
>> > core :-(
>>
>> Actually not, Olivier's fix is incomplete regarding the scenario I
>> proposed :
>> - in h2s_frt_make_resp_data() we can set H2_SF_BLK_SFCTL and remove the
>>   element from the list
>> - then in h2_shutr() and h2_shutw(), we check if the list is empty before
>>   subscribing the element, which is true after the case above
>> - then in h2c_update_all_ws() we still have H2_SF_BLK_SFCTL with the item
>>   in the send_list, thus LIST_ADDQ() adds it a second time.
>>
>> Thus the first part of the patch I sent is still required, I'm attaching
>> it
>> again, rebased on top of Olivier's patch and simplified so that we don't
>> detach then re-attach.
>>
>> I'm still keeping hope ;-)
>>
>> Willy
>>
>


Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-11 Thread Maciej Zdeb
Patch applied, finger crossed, testing! :-)

Thanks!

sob., 11 maj 2019 o 14:58 Willy Tarreau  napisał(a):

> On Sat, May 11, 2019 at 11:01:42AM +0200, Willy Tarreau wrote:
> > On Sat, May 11, 2019 at 10:52:35AM +0200, Willy Tarreau wrote:
> > > I certainly made a few reasoning mistakes above but I don't see
> anything
> > > in the code preventing this case from happening.
> > >
> > > Thus I'd like you to try the attached patch which is supposed to
> prevent
> > > this scenario from happening. At least I've verified that it doesn't
> > > break the h2spec test suite.
> >
> > While trying to check if it still applied to the latest 1.9 I figured
> > that it corresponds to what Olivier had also found and fixed in his
> > latest patch :-/  The positive point is that my analysis was correct.
> >
> > So I'm afraid that if it still fails with his fix, we'll need another
> > core :-(
>
> Actually not, Olivier's fix is incomplete regarding the scenario I
> proposed :
> - in h2s_frt_make_resp_data() we can set H2_SF_BLK_SFCTL and remove the
>   element from the list
> - then in h2_shutr() and h2_shutw(), we check if the list is empty before
>   subscribing the element, which is true after the case above
> - then in h2c_update_all_ws() we still have H2_SF_BLK_SFCTL with the item
>   in the send_list, thus LIST_ADDQ() adds it a second time.
>
> Thus the first part of the patch I sent is still required, I'm attaching it
> again, rebased on top of Olivier's patch and simplified so that we don't
> detach then re-attach.
>
> I'm still keeping hope ;-)
>
> Willy
>


Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-11 Thread Maciej Zdeb
> What I find very strange is why you're possibly the only one seeing this
> (and maybe also @serimin on github issue #94). If we could figure what
> makes your case specific it could help narrow the issue down. I'm seeing
> that you have a very simple Lua service to respond to health checks, so
> I've been thinking that maybe we do have some remaining bugs when Lua is
> accessed over H2 (e.g. incorrect length or whatever), but it's not the
> case on your connection since there are something like 17 streams so we
> can rule out the hypothesis of a health check connection, and thus that
> Lua was used.
>

Correct, lua applets are used for health checking and only for specific
list of clients (distinguished by source IP). Health checks are made over
http 1.0 (not even 1.1), so it should not affect h2 in HAProxy. This
particular HAProxy is used for serving images, so the responses are quite
large and so is the traffic. I'm using HAProxy 1.9.7 on other machines,
without that problem so the traffic specification or config matters.

The difference from other instances of HAProxy 1.9.7 I'm using is the
consistent hashing (hash-type consistent, balance hdr(...) and server id
specified inf config for each server) and also inter, fastinter, downinter,
slowstart settings.


Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-10 Thread Maciej Zdeb
= 0, streams_limit = 100, max_id = 9, rcvd_c = 0, rcvd_s = 0, ddht =
0x2f311a0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 3, dfl
= 4,
  dft = 8 '\b', dff = 0 '\000', dpl = 0 '\000', last_sid = -1, mbuf = {size
= 16384, area = 0x34f0a10 "", data = 9, head = 0}, msi = -1, mfl = 0, mft =
0 '\000', mff = 0 '\000', miw = 65535, mws = 10338036, mfs = 16384, timeout
= 2,
  shut_timeout = 2, nb_streams = 0, nb_cs = 0, nb_reserved = 0,
stream_cnt = 5, proxy = 0x2537fe0, task = 0x308cb70, streams_by_id = {b =
{0x0, 0x0}}, send_list = {n = 0x321d3b8, p = 0x321d3b8}, fctl_list = {n =
0x3174ce8,
p = 0x3174ce8}, sending_list = {n = 0x321d3c8, p = 0x321d3c8}, buf_wait
= {target = 0x0, wakeup_cb = 0x0, list = {n = 0x3174d18, p = 0x3174d18}},
wait_event = {task = 0x30936f0, handle = 0x0, events = 1}}
(gdb) p list

pt., 10 maj 2019 o 15:22 Maciej Zdeb  napisał(a):

> I've just sent some additional data to Willy. :)
>
> Sure, I'll test your patch!
>
> pt., 10 maj 2019 o 15:11 Olivier Houchard 
> napisał(a):
>
>> Hi Maciej,
>>
>> On Thu, May 09, 2019 at 07:25:54PM +0200, Maciej Zdeb wrote:
>> > Hi again,
>> >
>> > I have bad news, HAProxy 1.9.7-35b44da still looping :/
>> >
>> > gdb session:
>> > h2_process_mux (h2c=0x1432420) at src/mux_h2.c:2609
>> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
>> list) {
>> > (gdb) n
>> > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
>> > H2_CF_MUX_BLOCK_ANY)
>> > (gdb)
>> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
>> list) {
>> > (gdb)
>> > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
>> > H2_CF_MUX_BLOCK_ANY)
>> > (gdb)
>> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
>> list) {
>> > (gdb)
>> > 2613if (!LIST_ISEMPTY(>sending_list))
>> > (gdb)
>> > 2619if (!h2s->send_wait) {
>> > (gdb)
>> > 2620LIST_DEL_INIT(>list);
>> > (gdb)
>> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
>> list) {
>> > (gdb)
>> > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
>> > H2_CF_MUX_BLOCK_ANY)
>> > (gdb)
>> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
>> list) {
>> > (gdb)
>> > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
>> > H2_CF_MUX_BLOCK_ANY)
>> > (gdb)
>> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
>> list) {
>> > (gdb)
>> > 2613if (!LIST_ISEMPTY(>sending_list))
>> > (gdb)
>> > 2619if (!h2s->send_wait) {
>> > (gdb)
>> > 2620LIST_DEL_INIT(>list);
>> > (gdb)
>> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
>> list) {
>> > (gdb)
>> > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
>> > H2_CF_MUX_BLOCK_ANY)
>> > (gdb)
>> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
>> list) {
>> > (gdb)
>> > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
>> > H2_CF_MUX_BLOCK_ANY)
>> > (gdb)
>> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
>> list) {
>> > (gdb)
>> > 2613if (!LIST_ISEMPTY(>sending_list))
>> > (gdb)
>> > 2619if (!h2s->send_wait) {
>> > (gdb)
>> > 2620LIST_DEL_INIT(>list);
>> > (gdb)
>> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
>> list) {
>> > (gdb) p *h2s
>> > $1 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420,
>> h1m
>> > = {state = H1_MSG_DONE, flags = 29, curr_len = 0, body_len = 111976,
>> next =
>> > 411, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b =
>> > {0x13dcf50,
>> >   0x15b3120}}, node_p = 0x125ab90, leaf_p = 0x15b3121, bit = 1,
>> pfx
>> > = 0}, key = 11}, id = 11, flags = 28675, mws = 977198, errcode =
>> > H2_ERR_NO_ERROR, st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf =
>> > {size = 0, area = 0x0,
>> > data = 0, head = 0}, wait_event = {task = 0x15077a0, handle = 0x0,
>> > events = 0}, recv_wait = 0x0, send_wait = 0x0, list = {n = 0x15b31a8, p
>> =
>> > 0x15b31a8}, sending_list = {n = 0x15b31b8, p = 0x15b31b8}}
>> 

Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-10 Thread Maciej Zdeb
I've just sent some additional data to Willy. :)

Sure, I'll test your patch!

pt., 10 maj 2019 o 15:11 Olivier Houchard 
napisał(a):

> Hi Maciej,
>
> On Thu, May 09, 2019 at 07:25:54PM +0200, Maciej Zdeb wrote:
> > Hi again,
> >
> > I have bad news, HAProxy 1.9.7-35b44da still looping :/
> >
> > gdb session:
> > h2_process_mux (h2c=0x1432420) at src/mux_h2.c:2609
> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
> list) {
> > (gdb) n
> > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
> > H2_CF_MUX_BLOCK_ANY)
> > (gdb)
> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
> list) {
> > (gdb)
> > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
> > H2_CF_MUX_BLOCK_ANY)
> > (gdb)
> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
> list) {
> > (gdb)
> > 2613if (!LIST_ISEMPTY(>sending_list))
> > (gdb)
> > 2619if (!h2s->send_wait) {
> > (gdb)
> > 2620LIST_DEL_INIT(>list);
> > (gdb)
> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
> list) {
> > (gdb)
> > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
> > H2_CF_MUX_BLOCK_ANY)
> > (gdb)
> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
> list) {
> > (gdb)
> > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
> > H2_CF_MUX_BLOCK_ANY)
> > (gdb)
> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
> list) {
> > (gdb)
> > 2613if (!LIST_ISEMPTY(>sending_list))
> > (gdb)
> > 2619if (!h2s->send_wait) {
> > (gdb)
> > 2620LIST_DEL_INIT(>list);
> > (gdb)
> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
> list) {
> > (gdb)
> > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
> > H2_CF_MUX_BLOCK_ANY)
> > (gdb)
> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
> list) {
> > (gdb)
> > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
> > H2_CF_MUX_BLOCK_ANY)
> > (gdb)
> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
> list) {
> > (gdb)
> > 2613if (!LIST_ISEMPTY(>sending_list))
> > (gdb)
> > 2619if (!h2s->send_wait) {
> > (gdb)
> > 2620LIST_DEL_INIT(>list);
> > (gdb)
> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
> list) {
> > (gdb) p *h2s
> > $1 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420,
> h1m
> > = {state = H1_MSG_DONE, flags = 29, curr_len = 0, body_len = 111976,
> next =
> > 411, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b =
> > {0x13dcf50,
> >   0x15b3120}}, node_p = 0x125ab90, leaf_p = 0x15b3121, bit = 1,
> pfx
> > = 0}, key = 11}, id = 11, flags = 28675, mws = 977198, errcode =
> > H2_ERR_NO_ERROR, st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf =
> > {size = 0, area = 0x0,
> > data = 0, head = 0}, wait_event = {task = 0x15077a0, handle = 0x0,
> > events = 0}, recv_wait = 0x0, send_wait = 0x0, list = {n = 0x15b31a8, p =
> > 0x15b31a8}, sending_list = {n = 0x15b31b8, p = 0x15b31b8}}
> > (gdb) p *h2s_back
> > $2 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420,
> h1m
> > = {state = H1_MSG_DONE, flags = 29, curr_len = 0, body_len = 111976,
> next =
> > 411, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b =
> > {0x13dcf50,
> >   0x15b3120}}, node_p = 0x125ab90, leaf_p = 0x15b3121, bit = 1,
> pfx
> > = 0}, key = 11}, id = 11, flags = 28675, mws = 977198, errcode =
> > H2_ERR_NO_ERROR, st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf =
> > {size = 0, area = 0x0,
> > data = 0, head = 0}, wait_event = {task = 0x15077a0, handle = 0x0,
> > events = 0}, recv_wait = 0x0, send_wait = 0x0, list = {n = 0x15b31a8, p =
> > 0x15b31a8}, sending_list = {n = 0x15b31b8, p = 0x15b31b8}}
> > (gdb) p *h2c
> > $3 = {conn = 0x17e3310, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR,
> > flags = 0, streams_limit = 100, max_id = 13, rcvd_c = 0, rcvd_s = 0,
> ddht =
> > 0x1e99a40, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 13,
> dfl
> > = 4,
> >   dft = 8 '\b', dff = 0 '\000', dpl = 0 '\000', last_sid = -1, mbuf =
> {size
> > = 16384, area = 0x1e573a0 "", data = 13700, head = 0}, msi = -1, mfl = 0,
&

Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-10 Thread Maciej Zdeb
I'm gettingh old... I failed to remember to dump core :( And already killed
the process. Sorry, but the issue must reoccur and I can't say how long it
may take.

As soon as I get core dump I'll return.

pt., 10.05.2019, 10:35 użytkownik Willy Tarreau  napisał:

> Hi Maciej,
>
> On Thu, May 09, 2019 at 07:25:54PM +0200, Maciej Zdeb wrote:
> > Hi again,
> >
> > I have bad news, HAProxy 1.9.7-35b44da still looping :/
>
> Well, it's getting really annoying. Something's definitely wrong in
> this list and I can't figure what.
>
> > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list,
> list) {
> > (gdb) p *h2s
> > $1 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420,
> h1m
> ^^^
>
> Seeing things like the above make me doubt about the list's integrity, thus
> it could again hold an element that was already reused somewhere else.
> Could
> it be possible for you to share your unstripped executable, a core dump and
> your config ? (not to the list! just send a private link to Olivier or me).
>
> At this point either we find what's happening or we'll have to issue 1.9.8
> with this bug still alive, which doesn't make me feel comfortable to say
> the least :-/
>
> Willy
>


Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-09 Thread Maciej Zdeb
x0, list = {n = 0x15b31a8, p =
0x15b31a8}, sending_list = {n = 0x15b31b8, p = 0x15b31b8}}
(gdb) n
2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) {
(gdb) n
2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb) n
2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) {
(gdb) n
2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb) n
2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) {
(gdb) n
2613if (!LIST_ISEMPTY(>sending_list))
(gdb) n
2619if (!h2s->send_wait) {
(gdb) n
2620LIST_DEL_INIT(>list);
(gdb) n
2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) {
(gdb) n
2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)

czw., 9 maj 2019 o 14:46 Maciej Zdeb  napisał(a):

> I'm happy to help! :) Checking Olivier patch.
>
> Thanks!
>
> czw., 9 maj 2019 o 14:34 Willy Tarreau  napisał(a):
>
>> On Thu, May 09, 2019 at 02:31:58PM +0200, Maciej Zdeb wrote:
>> > What a bad luck :D I must have compiled it just before you pushed that
>> > change (segfault above is from haproxy 1.9.7-9b8ac0f).
>>
>> Great, so there's still some hope. I really appreciate your help and
>> feedback here, such issues are extremely difficult to track down and
>> even to reproduce and your help is invaluable here.
>>
>> Cheers,
>> Willy
>>
>


Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-09 Thread Maciej Zdeb
I'm happy to help! :) Checking Olivier patch.

Thanks!

czw., 9 maj 2019 o 14:34 Willy Tarreau  napisał(a):

> On Thu, May 09, 2019 at 02:31:58PM +0200, Maciej Zdeb wrote:
> > What a bad luck :D I must have compiled it just before you pushed that
> > change (segfault above is from haproxy 1.9.7-9b8ac0f).
>
> Great, so there's still some hope. I really appreciate your help and
> feedback here, such issues are extremely difficult to track down and
> even to reproduce and your help is invaluable here.
>
> Cheers,
> Willy
>


Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-09 Thread Maciej Zdeb
Hi Willy,

I've built 1.9 from head, unfortunately something is wrong, right now I've
got segfault:

Core was generated by `/usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p
/var/run/haproxy.pid -D -sf 75'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00484ab4 in h2_process_mux (h2c=0x1cda990) at
src/mux_h2.c:2609
2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) {
(gdb) bt f
#0  0x00484ab4 in h2_process_mux (h2c=0x1cda990) at
src/mux_h2.c:2609
h2s = 0x2
h2s_back = 
#1  h2_send (h2c=0x1cda990) at src/mux_h2.c:2747
flags = 
conn = 0x2d12190
done = 0
sent = 0
#2  0x0048d488 in pool_free (ptr=0x1cdaa68, pool=0x0) at
include/common/memory.h:327
No locals.
#3  __b_drop (buf=) at include/common/buffer.h:112
No locals.
#4  b_drop (buf=0x1cda9b8) at include/common/buffer.h:119
No locals.
#5  b_free (buf=0x1cda9b8) at include/common/buffer.h:125
No locals.
#6  h2_release_buf (h2c=0x1cda990, bptr=0x1cda9b8) at src/mux_h2.c:407
bptr = 0x1cda9b8
#7  h2_process (h2c=0x1cda990) at src/mux_h2.c:2889
conn = 0xfffdd7b8
#8  0x004bf425 in run_thread_poll_loop (data=0x0) at
src/haproxy.c:2710
ptif = 
ptdf = 
start_lock = 0
#9  0x0001 in ?? ()
No symbol table info available.
#10 0x0041fb86 in main (argc=, argv=0x7ffe885af938)
at src/haproxy.c:3354
tids = 0x1924900
threads = 0xfffc
i = 
old_sig = {__val = {0, 140199073187936, 140199073187072,
140199071033905, 0, 140199073173696, 1, 0, 18446603340516163585,
140199073187072, 1, 140199073235400, 0, 140199073236256, 0, 24}}
blocked_sig = {__val = {1844674406710583, 18446744073709551615
}}
err = 
retry = 
limit = {rlim_cur = 100, rlim_max = 100}
errmsg = "\000\000\000\000\000\000\000\000n\000\000\000w", '\000'
,
"p1K\243\202\177\000\000\000\000\000\000\000\000\000\000`'\\\242\202\177\000\000\030\000\000\000\000\000\000\000\200f4\001\000\000\000\000>\001\000\024\000\000\000\000
\352J\243\202\177\000\000`\221\065\001\000\000\000\000\340*(\242\202\177\000\000\370\371Z\210"
pidfd = 

czw., 9 maj 2019 o 10:51 Willy Tarreau  napisał(a):

> Hi Maciej,
>
> I've just pushed a number of fixes into 1.9-master, including the one
> I was talking about, if you want to try again.
>
> Cheers,
> Willy
>


Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-08 Thread Maciej Zdeb
Ok, thanks!

śr., 8.05.2019, 16:50 użytkownik Willy Tarreau  napisał:

> On Wed, May 08, 2019 at 04:48:55PM +0200, Maciej Zdeb wrote:
> > I'll gladly test Olivier patch after backporting. :)
>
> Thank you Maciej. I still have other stuff to deal with before going back
> to the pending 1.9 backports, and I'll merge it.
>
> Willy
>


Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-08 Thread Maciej Zdeb
I'll gladly test Olivier patch after backporting. :)

śr., 8.05.2019, 15:29 użytkownik Willy Tarreau  napisał:

> On Wed, May 08, 2019 at 03:03:23PM +0200, Olivier Houchard wrote:
> > > > I can't seem to remember :)
> > >
> > > Given the number of bugs we've dealt with in the last few weeks, you're
> > > forgiven :-)
> > >
> >
> > I'm afraid I'm getting old :/
>
> Ah! you see how it feels!
>
> > > > I think that patch is safe to backport anyway, but I wouldn't swear
> it solves
> > > > Maciej's issue.
> > >
> > > OK. I'll take it then so that we can study the situation under better
> > > conditions if the problem happens again. It may be possible that we've
> > > also overlooked other situations where this infinite loop could happen
> > > with SUB_CALL_UNSUBSCRIBE and that were addressed by your patch.
> > >
> >
> > Yes it is certainly possible it helps, I just can affirm it with
> confidence.
>
> This is why I prefer to take it. The amount of changes that went into 1.9
> was significant enough to get such hard-to-debug issues. As long as fixes
> don't harm, I prefer to take them than having to fight one full week with
> a bug just to discover that it was already fixed upstream.
>
> Thanks,
> Willy
>


[1.9 HEAD] HAProxy using 100% CPU

2019-05-07 Thread Maciej Zdeb
Hi,

I've got another bug with 100% CPU on HAProxy process, it is built from
HEAD of 1.9 branch.

One of processes stuck in infinite loop, admin socket is not responsive so
I've got information only from gdb:

0x00484ab8 in h2_process_mux (h2c=0x2e8ff30) at src/mux_h2.c:2589
2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE)
(gdb) n
2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb)
2586list_for_each_entry(h2s, >send_list, list) {
(gdb)
2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb)
2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE)
(gdb)
2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb)
2586list_for_each_entry(h2s, >send_list, list) {
(gdb)
2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb) p h2c
$1 = (struct h2c *) 0x2e8ff30
(gdb) p *h2c
$2 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR,
flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht
= 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149,
dfl = 0,
  dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size
= 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft =
0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384,
  timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53,
nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0,
streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p =
0x3093c18},
  fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n =
0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list
= {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle =
0x0, events = 1}}
(gdb) n
2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE)
(gdb) p *h2s
$3 = {cs = 0x297bdb0, sess = 0x819580 , h2c = 0x2e8ff30, h1m
= {state = H1_MSG_RPBEFORE, flags = 12, curr_len = 0, body_len = 0, next =
0, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b =
{0x2a72250,
  0x2961c30}}, node_p = 0x2a72251, leaf_p = 0x2961c31, bit = 1, pfx
= 49017}, key = 103}, id = 103, flags = 16385, mws = 6291456, errcode =
H2_ERR_NO_ERROR, st = H2_SS_HREM, status = 0, body_len = 0, rxbuf = {size =
0,
area = 0x0, data = 0, head = 0}, wait_event = {task = 0x2fb3ee0, handle
= 0x0, events = 0}, recv_wait = 0x2b8d700, send_wait = 0x2b8d700, list = {n
= 0x3130108, p = 0x2b02238}, sending_list = {n = 0x3130118, p = 0x2b02248}}
(gdb) n
2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb) p *h2c
$4 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR,
flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht
= 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149,
dfl = 0,
  dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size
= 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft =
0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384,
  timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53,
nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0,
streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p =
0x3093c18},
  fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n =
0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list
= {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle =
0x0, events = 1}}
(gdb) n
2586list_for_each_entry(h2s, >send_list, list) {
(gdb) n
2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb) n
2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE)


HAProxy info:
HA-Proxy version 1.9.7-207ba5a 2019/05/05 - https://haproxy.org/
Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter
-Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered
-Wno-missing-field-initializers -Wtype-limits -DIP_BIND_ADDRESS_NO_PORT=24
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_DL=1
USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.1.1b  26 Feb 2019
Running on OpenSSL version : OpenSSL 1.1.1b  26 Feb 2019
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.5
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : 

Re: [1.9.7] One of haproxy processes using 100% CPU

2019-05-05 Thread Maciej Zdeb
Hi,

I confirm Willy patch fixed the problem! Thanks!

wt., 30 kwi 2019 o 13:49 Maciej Zdeb  napisał(a):

> Hi Olivier,
>
> Thank you very much. I'll test it and get back with feedback!
>
> Regards,
>
> wt., 30 kwi 2019 o 13:12 Olivier Houchard 
> napisał(a):
>
>> Hi Maciej,
>>
>> On Tue, Apr 30, 2019 at 08:43:07AM +0200, Maciej Zdeb wrote:
>> > Filtered results from show fd for that particular virtual server:
>> >
>> > 10 : st=0x22(R:pRa W:pRa) ev=0x01(heopI) [lc] cnext=-3 cprev=-2
>> tmask=0x1
>> > umask=0x0 owner=0x53a5690 iocb=0x59d689(conn_fd_handler) back=0
>> > cflg=0x80243300 fe=virtual-server_front mux=H2 ctx=0x6502860 h2c.st0=2
>> > .err=0 .maxid=17 .lastid=-1 .flg=0x1 .nbst=0 .nbcs=1 .fctl_cnt=0
>> > .send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=0 .dsi=13 .dbuf=0@(nil)+0/0
>> > .msi=-1 .mbuf=0@(nil)+0/0 last_h2s=0x5907040 .id=13 .flg=0x4005
>> > .rxbuf=0@(nil)+0/0
>> > .cs=0x905b1b0 .cs.flg=0x00106a00 .cs.data=0x5d1d228
>> > 98 : st=0x22(R:pRa W:pRa) ev=0x01(heopI) [lc] cnext=-1809 cprev=-2
>> > tmask=0x1 umask=0x0 owner=0xa3bb7f0 iocb=0x59d689(conn_fd_handler)
>> back=0
>> > cflg=0x80201300 fe=virtual-server_front mux=H2 ctx=0xa71f310 h2c.st0=3
>> > .err=0 .maxid=0 .lastid=-1 .flg=0x0008 .nbst=0 .nbcs=0 .fctl_cnt=0
>> > .send_cnt=0 .tree_cnt=0 .orph_cnt=0 .sub=0 .dsi=3
>> > .dbuf=16384@0x5873f10+61/16384
>> > .msi=-1 .mbuf=0@(nil)+0/0
>>
>> I see that it seems to be HTTP/2. Not saying it's your problem, but a bug
>> that would cause haproxy to use 100% of the CPU has been fixed in the
>> HTTP/2
>> code just after the 1.9.7 release was done.
>> Any chance you can see if it still happens with that commit :
>> commit c980b511bfef566e9890eb9a06d607c193d63828
>> Author: Willy Tarreau 
>> Date:   Mon Apr 29 10:20:21 2019 +0200
>>
>> BUG/MEDIUM: mux-h2: properly deal with too large headers frames
>>
>> Regards,
>>
>> Olivier
>>
>


Re: [1.9.7] One of haproxy processes using 100% CPU

2019-04-30 Thread Maciej Zdeb
Hi Olivier,

Thank you very much. I'll test it and get back with feedback!

Regards,

wt., 30 kwi 2019 o 13:12 Olivier Houchard 
napisał(a):

> Hi Maciej,
>
> On Tue, Apr 30, 2019 at 08:43:07AM +0200, Maciej Zdeb wrote:
> > Filtered results from show fd for that particular virtual server:
> >
> > 10 : st=0x22(R:pRa W:pRa) ev=0x01(heopI) [lc] cnext=-3 cprev=-2 tmask=0x1
> > umask=0x0 owner=0x53a5690 iocb=0x59d689(conn_fd_handler) back=0
> > cflg=0x80243300 fe=virtual-server_front mux=H2 ctx=0x6502860 h2c.st0=2
> > .err=0 .maxid=17 .lastid=-1 .flg=0x1 .nbst=0 .nbcs=1 .fctl_cnt=0
> > .send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=0 .dsi=13 .dbuf=0@(nil)+0/0
> > .msi=-1 .mbuf=0@(nil)+0/0 last_h2s=0x5907040 .id=13 .flg=0x4005
> > .rxbuf=0@(nil)+0/0
> > .cs=0x905b1b0 .cs.flg=0x00106a00 .cs.data=0x5d1d228
> > 98 : st=0x22(R:pRa W:pRa) ev=0x01(heopI) [lc] cnext=-1809 cprev=-2
> > tmask=0x1 umask=0x0 owner=0xa3bb7f0 iocb=0x59d689(conn_fd_handler) back=0
> > cflg=0x80201300 fe=virtual-server_front mux=H2 ctx=0xa71f310 h2c.st0=3
> > .err=0 .maxid=0 .lastid=-1 .flg=0x0008 .nbst=0 .nbcs=0 .fctl_cnt=0
> > .send_cnt=0 .tree_cnt=0 .orph_cnt=0 .sub=0 .dsi=3
> > .dbuf=16384@0x5873f10+61/16384
> > .msi=-1 .mbuf=0@(nil)+0/0
>
> I see that it seems to be HTTP/2. Not saying it's your problem, but a bug
> that would cause haproxy to use 100% of the CPU has been fixed in the
> HTTP/2
> code just after the 1.9.7 release was done.
> Any chance you can see if it still happens with that commit :
> commit c980b511bfef566e9890eb9a06d607c193d63828
> Author: Willy Tarreau 
> Date:   Mon Apr 29 10:20:21 2019 +0200
>
> BUG/MEDIUM: mux-h2: properly deal with too large headers frames
>
> Regards,
>
> Olivier
>


Re: [1.9.7] One of haproxy processes using 100% CPU

2019-04-30 Thread Maciej Zdeb
Filtered results from show fd for that particular virtual server:

10 : st=0x22(R:pRa W:pRa) ev=0x01(heopI) [lc] cnext=-3 cprev=-2 tmask=0x1
umask=0x0 owner=0x53a5690 iocb=0x59d689(conn_fd_handler) back=0
cflg=0x80243300 fe=virtual-server_front mux=H2 ctx=0x6502860 h2c.st0=2
.err=0 .maxid=17 .lastid=-1 .flg=0x1 .nbst=0 .nbcs=1 .fctl_cnt=0
.send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=0 .dsi=13 .dbuf=0@(nil)+0/0
.msi=-1 .mbuf=0@(nil)+0/0 last_h2s=0x5907040 .id=13 .flg=0x4005
.rxbuf=0@(nil)+0/0
.cs=0x905b1b0 .cs.flg=0x00106a00 .cs.data=0x5d1d228
98 : st=0x22(R:pRa W:pRa) ev=0x01(heopI) [lc] cnext=-1809 cprev=-2
tmask=0x1 umask=0x0 owner=0xa3bb7f0 iocb=0x59d689(conn_fd_handler) back=0
cflg=0x80201300 fe=virtual-server_front mux=H2 ctx=0xa71f310 h2c.st0=3
.err=0 .maxid=0 .lastid=-1 .flg=0x0008 .nbst=0 .nbcs=0 .fctl_cnt=0
.send_cnt=0 .tree_cnt=0 .orph_cnt=0 .sub=0 .dsi=3
.dbuf=16384@0x5873f10+61/16384
.msi=-1 .mbuf=0@(nil)+0/0
184 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lC] cnext=-3 cprev=-2
tmask=0x umask=0x0 owner=0x23eb040
iocb=0x57e662(listener_accept) l.st=RDY fe=virtual-server_front
660 : st=0x22(R:pRa W:pRa) ev=0x11(HeopI) [lc] cnext=-3 cprev=-2 tmask=0x1
umask=0x0 owner=0x533d6e0 iocb=0x59d689(conn_fd_handler) back=0
cflg=0x80243300 fe=virtual-server_front mux=H2 ctx=0x8031b90 h2c.st0=2
.err=0 .maxid=49 .lastid=-1 .flg=0x1 .nbst=0 .nbcs=1 .fctl_cnt=0
.send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=0 .dsi=49 .dbuf=0@(nil)+0/0
.msi=-1 .mbuf=0@(nil)+0/0 last_h2s=0x70f1e80 .id=31 .flg=0x4005
.rxbuf=0@(nil)+0/0
.cs=0x6f373d0 .cs.flg=0x00106a00 .cs.data=0x56bb788
699 : st=0x22(R:pRa W:pRa) ev=0x11(HeopI) [lc] cnext=-87 cprev=-2 tmask=0x1
umask=0x0 owner=0x6694b60 iocb=0x59d689(conn_fd_handler) back=0
cflg=0x80243300 fe=virtual-server_front mux=H2 ctx=0x56e7b00 h2c.st0=2
.err=0 .maxid=111 .lastid=-1 .flg=0x1 .nbst=0 .nbcs=1 .fctl_cnt=0
.send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=0 .dsi=111 .dbuf=0@(nil)+0/0
.msi=-1 .mbuf=0@(nil)+0/0 last_h2s=0x5931bf0 .id=47 .flg=0x4105
.rxbuf=0@(nil)+0/0
.cs=0x5943120 .cs.flg=0x00106a00 .cs.data=0x77af4c8
970 : st=0x22(R:pRa W:pRa) ev=0x01(heopI) [lc] cnext=-3 cprev=-2 tmask=0x1
umask=0x0 owner=0x67684b0 iocb=0x59d689(conn_fd_handler) back=0
cflg=0x80243300 fe=virtual-server_front mux=H2 ctx=0x5c90c30 h2c.st0=2
.err=0 .maxid=125 .lastid=-1 .flg=0x1 .nbst=0 .nbcs=1 .fctl_cnt=0
.send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=0 .dsi=125 .dbuf=0@(nil)+0/0
.msi=-1 .mbuf=0@(nil)+0/0 last_h2s=0x7ac8650 .id=43 .flg=0x4005
.rxbuf=0@(nil)+0/0
.cs=0x7901a20 .cs.flg=0x00106a00 .cs.data=0x882c388
1282 : st=0x22(R:pRa W:pRa) ev=0x11(HeopI) [lc] cnext=-3 cprev=-2 tmask=0x1
umask=0x0 owner=0x6f23720 iocb=0x59d689(conn_fd_handler) back=0
cflg=0x80243300 fe=virtual-server_front mux=H2 ctx=0x6090cf0 h2c.st0=2
.err=0 .maxid=129 .lastid=-1 .flg=0x1 .nbst=0 .nbcs=1 .fctl_cnt=0
.send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=0 .dsi=129 .dbuf=0@(nil)+0/0
.msi=-1 .mbuf=0@(nil)+0/0 last_h2s=0x5cc0890 .id=17 .flg=0x4005
.rxbuf=0@(nil)+0/0
.cs=0x64d33f0 .cs.flg=0x00106a00 .cs.data=0x639a3e8
3041 : st=0x22(R:pRa W:pRa) ev=0x11(HeopI) [lc] cnext=-955 cprev=-2
tmask=0x1 umask=0x0 owner=0x6de8980 iocb=0x59d689(conn_fd_handler) back=0
cflg=0x80243300 fe=virtual-server_front mux=H2 ctx=0x5beca10 h2c.st0=2
.err=0 .maxid=89 .lastid=-1 .flg=0x1 .nbst=0 .nbcs=1 .fctl_cnt=0
.send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=0 .dsi=89 .dbuf=0@(nil)+0/0
.msi=-1 .mbuf=0@(nil)+0/0 last_h2s=0x82f5900 .id=15 .flg=0x4005
.rxbuf=0@(nil)+0/0
.cs=0x7e027a0 .cs.flg=0x00106a00 .cs.data=0x6e5d398

wt., 30 kwi 2019 o 08:31 Maciej Zdeb  napisał(a):

> Forgot to attach information about HAProxy (for tracing the issue I
> compiled it with debug symbols and without optimizations):
>
> haproxy -vv
> HA-Proxy version 1.9.7 2019/04/25 - https://haproxy.org/
> Build options :
>   TARGET  = linux2628
>   CPU = generic
>   CC  = gcc
>   CFLAGS  = -O0 -g -fno-strict-aliasing -Wdeclaration-after-statement
> -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter
> -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered
> -Wno-missing-field-initializers -Wtype-limits -DIP_BIND_ADDRESS_NO_PORT=24
>   OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_DL=1
> USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1
>
> Default settings :
>   maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
>
> Built with OpenSSL version : OpenSSL 1.1.1b  26 Feb 2019
> Running on OpenSSL version : OpenSSL 1.1.1b  26 Feb 2019
> OpenSSL library supports TLS extensions : yes
> OpenSSL library supports SNI : yes
> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
> Built with Lua version : Lua 5.3.5
> Built with transparent proxy support using: IP_TRANSPARENT
> IPV6_TRANSPARENT IP_FREEBIND
> Built with zlib version : 1.2.8
> Running on zlib version : 1.2.8
> Compression algorithms supported : identity("identity

  1   2   >