Setting retry-on causes segmentation fault with TCP backends

2019-08-30 Thread Louis Chanouha
Hello,

I upgraded to HAProxy 2.0.5 (from 1.9) and found an issue when i tried to add 
retry-on option. TCP backend seems to answer to one or two requests and then 
crash HAProxy:

My simplified conf:

defaults
   [...]
   retries 3
   option abortonclose
   http-reuse  safe
   retry-on conn-failure 0rtt-rejected 503

listen SMTPS2_PROD
   bind    0.0.0.0:587
   mode    tcp
   balance roundrobin

   server  s1 1.1.1.1:586
   server  s2 1.1.1.2:586

I get in logs:

Aug 30 14:48:49 s1 haproxy[3071]: [ALERT] 241/144849 (3071) : Current worker #1 
(3072) exited with code 139 (Segmentation fault)

With option, i get:

└──╼ openssl s_client -connect server:587  -starttls smtp
CONNECTED(0003)
Didn't find STARTTLS in server response, trying anyway...
write:errno=32
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 23 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)

Sow few requests success..

Without option, server is stable:

──╼ openssl s_client -connect server:587 -starttls smtp
^[[A
CONNECTED(0003)
[...]
---
No client certificate CA names sent
Peer signing digest: SHA256
Peer signature type: RSA
Server Temp Key: ECDH, P-256, 256 bits
---
SSL handshake has read 3843 bytes and written 483 bytes
Verification: OK
---
New, TLSv1.2, Cipher is ECDHE-RSA-AES256-GCM-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-GCM-SHA384
    Session-ID: yyy
    Session-ID-ctx: 
    Master-Key: 5xxx
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1567167549
    Timeout   : 7200 (sec)
    Verify return code: 0 (ok)
    Extended master secret: yes

Louis

---

--

Louis Chanouha | Infrastructures informatiques
Service Numérique de l'Université de Toulouse
Université Fédérale Toulouse Midi-Pyrénées
Maison de la Recherche et de la Valorisation - MRV
118 route de Narbonne - 31062 Toulouse Cedex 09
Tél. : +33 5 61 10 80 45 /poste int. : 12 80 45
louis.chano...@univ-toulouse.fr
Facebook | Twitter | www.univ-toulouse.fr

Re: 1.9.2: Crash with 300% CPU and stuck agent-checks

2019-03-14 Thread Louis Chanouha
Hello, 

In fact I have two haproxies fighting for CPU, but seemless reload date doesn't 
correspond with CPU usage increasing (13/03 at 21:29:20, average normal usage 
is around 5%).

└──╼ ps -eo pid,lstart,%cpu,cmd | grep "[h]aproxy"
 8406 Wed Mar 13 18:06:21 2019 97.5 /usr/sbin/haproxy -Ws -f 
/etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 29691 -x 
/run/haproxy/admin.sock
29604 Wed Mar 13 16:59:49 2019  0.0 /usr/sbin/haproxy -Ws -f 
/etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 29691 -x 
/run/haproxy/admin.sock
29691 Wed Mar 13 16:59:49 2019  180 /usr/sbin/haproxy -Ws -f 
/etc/haproxy/haproxy.cfg -p /run/haproxy.pid

I tried to get what you asked for but I really do not have any idea what i'm 
doing (see below) :x. Do you have a secure way or a GPG key to send you the 
core dump ? I just extracted core dump of pid 29691 (old instance). I crashed 
8406 with bad command.

This bug happen once a week so next time I can do more.
Production is not impacted, the HAProxy is fully functionnal except i guess for 
checks (dead backend never marked UP).

Louis

└──╼  gdb --pid 29691
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 29691
[New LWP 29692]
[New LWP 29693]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x7f1549f8e303 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: Aucun fichier ou dossier de ce type.
(gdb) p task_per_thread[0].task_list_size
cannot subscript something of type `'
(gdb) up
#1  0x5585811a365d in ?? ()
(gdb) p task_per_thread[0].task_list_size
cannot subscript something of type `'
(gdb) up
#2  0x5585812485c2 in ?? ()
(gdb) p task_per_thread[0].task_list_size
cannot subscript something of type `'
(gdb) up
#3  0x5585811a1102 in main ()
(gdb) p task_per_thread[0].task_list_size
cannot subscript something of type `'
(gdb) up
Initial frame selected; you cannot go up.
(gdb) down
#2  0x5585812485c2 in ?? ()
(gdb) p task_per_thread[0].task_list_size
cannot subscript something of type `'
(gdb) down
#1  0x5585811a365d in ?? ()
(gdb) p task_per_thread[0].task_list_size
cannot subscript something of type `'
(gdb) down
#0  0x7f1549f8e303 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
84      in ../sysdeps/unix/syscall-template.S
(gdb) p task_per_thread[0].task_list_size
cannot subscript something of type `'

March 14, 2019 2:04:52 PM CET Willy Tarreau  wrote:On Thu, Mar 14, 
2019 at 01:22:37PM +0100, Louis Chanouha wrote:
> Hello,
> Thanks !
> 
> Seems OK to me, i don't have symbols error but your command doesn't works
(see below).

Oops sorry I know what's wrong :

> 0x7f06e3d0663f in __libc_send (fd=6698, buf=0x56295563fe90, n=4344,
flags=16448) at
> ../sysdeps/unix/sysv/linux/x86_64/send.c:26
> 26    ../sysdeps/unix/sysv/linux/x86_64/send.c: Aucun fichier ou dossier de
ce type.
> (gdb)  p task_per_thread[0].task_list_size
> cannot subscript something of type `'

The process was interrupted at a lower layer in the libc, you need
to go up in the task using the "up" cmomand. Each time you type it
it goes up, and if you type too many of it you can go down using "down".
You can also issue "bt full" which will show the complete backtrace of
each function called and their arguments.

> You told me how to generate core dump, i could give you the full file if it
> can be usefull to you.

Oh yes, that would be awesome. You can do that from gdb using the
command "generate-core-file". It will dump it into your current
directory, you may need to make sure to have enough room. Please
keep in mind that the executable is needed with the core file so
we'll need both.

thanks!
Willy

--

Louis Chanouha | Infrastructures informatiques
Service Numérique de l'Université de Toulouse
Université Fédérale Toulouse Midi-Pyrénées
Maison de la Recherche et de la Valorisation - MRV
118 route de Narbonne - 31062 Toulouse Cedex 09
Tél. : +33 5 61 10 80 45 /poste int. : 12 80 45
louis.chano...@univ-toulouse.fr
Facebook | Twitter | www.univ-toulouse.fr

Re: 1.9.2: Crash with 300% CPU and stuck agent-checks

2019-03-14 Thread Louis Chanouha
Hello,
Did I miss something ? Sorry I never used GDB.

└──╼ (gdb) p task_per_thread[0].task_list_size
cannot subscript something of type `'

└──╼ haproxy -vvv
HA-Proxy version 1.9.3-1 2019/01/29 - https://haproxy.org/
Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -O2 -g -O2 -fdebug-prefix-map=/root/haproxy/haproxy-1.9.3=. 
-fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
-D_FORTIFY_SOURCE=2 -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv 
-Wno-unused-label -Wno-sign-compare -Wno-unused-parameter 
-Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered 
-Wno-missing-field-initializers -Wtype-limits -Wshift-negative-value 
-Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 
USE_SYSTEMD=1 USE_PCRE2=1 USE_PCRE2_JIT=1 USE_NS=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.1.1a  20 Nov 2018
Running on OpenSSL version : OpenSSL 1.1.1a  20 Nov 2018
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with network namespace support.
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT 
IP_FREEBIND
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"), deflate("deflate"), 
raw-deflate("deflate"), gzip("gzip")
Built with PCRE2 version : 10.22 2016-07-29
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with multi-threading support.

Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as  cannot be specified using 'proto' keyword)
  h2 : mode=HTX    side=FE|BE
  h2 : mode=HTTP   side=FE
    : mode=HTX    side=FE|BE
    : mode=TCP|HTTP   side=FE|BE

Available filters :
    [SPOE] spoe
    [COMP] compression
    [CACHE] cache
    [TRACE] trace

March 14, 2019 11:18:39 AM CET Willy Tarreau  wrote:Louis,

I'd be interested in checking the values of task_per_thread[X].task_list_size
for each value of X between 0 and your number of threads minus 1. Example for
4 threads :

(gdb) p task_per_thread[0].task_list_size
$2 = 0
(gdb) p task_per_thread[1].task_list_size
$3 = 0
(gdb) p task_per_thread[2].task_list_size
$4 = 0
(gdb) p task_per_thread[3].task_list_size
$5 = 0

It will help ruling out certain areas which could set the negative value.

Thanks,
Willy

--

Louis Chanouha | Infrastructures informatiques
Service Numérique de l'Université de Toulouse
Université Fédérale Toulouse Midi-Pyrénées
Maison de la Recherche et de la Valorisation - MRV
118 route de Narbonne - 31062 Toulouse Cedex 09
Tél. : +33 5 61 10 80 45 /poste int. : 12 80 45
louis.chano...@univ-toulouse.fr
Facebook | Twitter | www.univ-toulouse.fr

Re: 1.9.2: Crash with 300% CPU and stuck agent-checks

2019-03-14 Thread Louis Chanouha
Hello,
I seems that i have the same problem than Mark Janssen.
I did not restart so i still can do gdb debug.

Louis

└──╼ haproxy -v
HA-Proxy version 1.9.3-1 2019/01/29 - https://haproxy.org/

──╼ /usr/bin/socat -T 15 -t 5 /run/haproxy/admin.sock - <<< "show info dump" 
|grep Run_queue
Run_queue: 4294967075

└──╼ /usr/bin/socat -T 15 -t 5 /run/haproxy/admin.sock - <<< "show activity" 
thread_id: 0
date_now: 1552556188.687920
loops: 1008276769 3401674410 4164783314
wake_cache: 32084592 30461706 80918252
wake_tasks: 974107386 3368299498 4072703110
wake_signal: 0 0 0
poll_exp: 1006191978 3398761204 4153621362
poll_drop: 290649 290640 408668
poll_dead: 0 0 0
poll_skip: 0 0 0
fd_skip: 0 0 0
fd_lock: 582021 566207 517485
fd_del: 0 0 0
conn_dead: 0 0 0
stream: 2544187 2588658 4177598
empty_rq: 10172386 11302238 74503517
long_rq: 1310868 4117304 2862308
cpust_ms_tot: 6731598 6509804 640930
cpust_ms_1s: 313 219 3
cpust_ms_15s: 3864 3003 142
avg_loop_us: 1 0 43

January 29, 2019 10:45:58 AM CET Willy Tarreau  wrote:On Tue, Jan 
29, 2019 at 10:41:52AM +0100, Louis Chanouha wrote:
> I'm pretty sure this bug is specific to version 1.9. Last week i restarted
> the process because is seemed to be stuck at around 100% CPU, but without
> anormal behaviour.
> I've never saw that in 1.7 or 1.8 series. We migrated from 1.8.15 to 1.9.2.
> 
> For 3 years, i've never saw HAProxy use more than 30% CPU of our VM.

OK that's already a good indication. For example you could have been running
fine with 1.9.1 and experienced the bug only once switching to 1.9.2, which
would have indicated a recent regression.

> As i guess they could be private keys in theses file, i will send you core
> dump privately (master/worker) and or haproxy conf file. Hope it will help. 

There would indeed be private info. I don't need them right now, I
just suggested that you keep them for a while just in case they would
be needed later to validate any hypothesis for example.

I've released 1.9.3 this morning you should definitely upgrade to this one
and see if the issue happens again. It contains the fixes for the suspicious
related bugs I mentioned.

Thanks,
Willy

--

Louis Chanouha | Infrastructures informatiques
Service Numérique de l'Université de Toulouse
Université Fédérale Toulouse Midi-Pyrénées
Maison de la Recherche et de la Valorisation - MRV
118 route de Narbonne - 31062 Toulouse Cedex 09
Tél. : +33 5 61 10 80 45 /poste int. : 12 80 45
louis.chano...@univ-toulouse.fr
Facebook | Twitter | www.univ-toulouse.fr

Re: 1.9.2: Crash with 300% CPU and stuck agent-checks

2019-01-29 Thread Louis Chanouha

January 29, 2019 4:24:57 AM CET Willy Tarreau  wrote:Hello Louis,

On Mon, Jan 28, 2019 at 10:43:37PM +0100, Louis Chanouha wrote:
> Hello,
> We faced this evening a critical issue this issue where all agent-checks
ware
> stuck (or retries very very slower than usual).
> In example I see "2h39m DOWN 6/15 ?" for more than 2h one several backend
> servers. So all down server stayed down until I manually forced "UP" state.
> The problem seemed to start from a (legitimate) L7 timeout and I can see now
> static 300% CPU usage.
> 
> We use Haproxy 1.9.2 compiled from Debian sources. We do not use any 1.9
> specific option (no htx) and never had this kind of bug before. We use
> threading (nbthread = 3) and a lot of L7 custom checks (tcp-check expect
> string ...).

I guess that if it's the first time you're seeing this, it might not happen
often enough to be easily debugged. However, what is the previous version
you've used where you feel reasonably confident you would have known if it
had happened ?

I'm pretty sure this bug is specific to version 1.9. Last week i restarted the 
process because is seemed to be stuck at around 100% CPU, but without anormal 
behaviour.
I've never saw that in 1.7 or 1.8 series. We migrated from 1.8.15 to 1.9.2.

For 3 years, i've never saw HAProxy use more than 30% CPU of our VM.

I suspect it might be related to some of the recent fixes on the checks
code which is unfortunately still shared with mailers and which caused
them to loop like crazy. Since there we've found two remaining bugs in
this area that were addressed after 1.9.2 and are already pending in the
maintenance branch, scheduled for release (hopefully today). One of them
(the issue with the task wake up) could possibly result in this as a side
effect.

> I did not restart our HAProxy to help debbuging. We can I provide to help ?
> Logfile isn't usefull.

If you want, you can take a core dump of the process using gdb. You attach
it to the process (gdb --pid $(pidof haproxy)) and issue "generate-core-file".
It will produce a core file that may be reused later with your executable
if we figure that we'd possibly need something from it. Please don't forget
to keep a copy of the executable with this core. This way you can safely
kill this process and restart it.

As i guess they could be private keys in theses file, i will send you core dump 
privately (master/worker) and or haproxy conf file. Hope it will help. 

> [Sorry for my english]

No problem at all with your english, at least from another frenchie :-)

:) 

Bonne journée,
Louis

Thanks,
Willy

--

Louis Chanouha | Missions SCOUT et CLOUD UFTMiP
Service Numérique de l'Université de Toulouse
Université Fédérale Toulouse Midi-Pyrénées
Maison de la Recherche et de la Valorisation - MRV
118 route de Narbonne - 31062 Toulouse Cedex 09
Tél. : +33 5 61 10 80 45 /poste int. : 12 80 45
louis.chano...@univ-toulouse.fr
Facebook | Twitter | www.univ-toulouse.fr

1.9.2: Crash with 300% CPU and stuck agent-checks

2019-01-28 Thread Louis Chanouha
Hello,
We faced this evening a critical issue this issue where all agent-checks ware 
stuck (or retries very very slower than usual).
In example I see "2h39m DOWN 6/15 ↑" for more than 2h one several backend 
servers. So all down server stayed down until I manually forced "UP" state.
The problem seemed to start from a (legitimate) L7 timeout and I can see now 
static 300% CPU usage.

We use Haproxy 1.9.2 compiled from Debian sources. We do not use any 1.9 
specific option (no htx) and never had this kind of bug before. We use 
threading (nbthread = 3) and a lot of L7 custom checks (tcp-check expect string 
...).

I did not restart our HAProxy to help debbuging. We can I provide to help ? 
Logfile isn't usefull.

[Sorry for my english]
Regards,
Louis

 

h2 & server PUSH

2018-11-11 Thread Louis Chanouha

Hello,

If I'm right (I may have missed some exchanges in mailing), h2 main 
improvement in 1.9  will be end2end working. So to have an h2 with 
Server Push, we will need to have h2 enabled backends.


Is a server push initiated by HAProxy based on "Link" header scheduled 
to 1.9 (like nginx http2_push_preload 
 
and h2o http2-push-preload) ? 



Since a lot of CMS (& others apps) implement this header (and easyly 
added manually if not), this is IMHO the fastest way to enable server 
push even with h1 backends. It avoids upgrading backends and if you have 
a cache server like me (eg Varnish, or HAProxy's internal), it is 
useless to send data from backends.


Thanks for your answer & sorry for my english,

Louis



[bug] http-reuse and TCP mode warning when using PROXY protocol

2018-04-26 Thread Louis Chanouha

Hello,

I set a global http-reuse safe.

HAProxy displays a warning for http-reuse and send-proxy combinaison on 
TCP mode backends, but http-reuse is active only on HTTP mode backends.


[WARNING] 115/135122 (26529) : config : proxy ' SMTPS_SUBMISSION' : 
connections to server 'f1' will have a PROXY protocol header announcing 
the first client's IP address while http-reuse is enabled and allows the 
same connection to be shared between multiple clients. It is strongly 
advised to disable 'send-proxy' and to use the 'forwardfor' option instead.


  defaults
                [...]
                http-reuse  safe

  listen SMTPS_SUBMISSION
    bind    0.0.0.0:587
    mode    tcp
    balance roundrobin
    option  tcplog
    server  f1 1.2.3.4 587 check send-proxy 
check-send-proxy


Sorry for my english

Louis