Re: 2.0.14 + htx / retry-on all-retryable-errors -> sometimes wrong backend/server used

2020-05-20 Thread Willy Tarreau
Hi Jarno,

On Tue, May 19, 2020 at 03:26:22PM +, Jarno Huuskonen wrote:
> Hi,
> 
> On Tue, 2020-05-19 at 15:58 +0200, Christopher Faulet wrote:
> > It was already reported on github and seems to be fixed. We are just
> > waiting a 
> > feedback to be sure it is fixed before backporting the patch. See 
> > https://github.com/haproxy/haproxy/issues/623.
> > 
> > If you try the latest 2.2 snapshot, it should be good. You may also
> > try to 
> > cherry-pick the commit 8cabc9783 to the 2.0.
> 
> Thanks Christopher (and Tim), I'll try with 2.2 snapshot (and/or)
> 8cabc9783 and report how it goes.

Note, regarding this, I'd like us to emit another set of stable versions
but we've got a report of a very tricky situation involving l7 retries
and HTTP reuse which can sometimes lead to a crash in versions prior to
2.0, and I'd really like to understand it fully, reproduce it and fix it.
So in the mean time, it's possible that 2.2 would be more reliable than
2.0 when it comes to l7 retries.

Cheers,
Willy



Re: 2.0.14 + htx / retry-on all-retryable-errors -> sometimes wrong backend/server used

2020-05-19 Thread Jarno Huuskonen
Hi,

On Tue, 2020-05-19 at 15:58 +0200, Christopher Faulet wrote:
> It was already reported on github and seems to be fixed. We are just
> waiting a 
> feedback to be sure it is fixed before backporting the patch. See 
> https://github.com/haproxy/haproxy/issues/623.
> 
> If you try the latest 2.2 snapshot, it should be good. You may also
> try to 
> cherry-pick the commit 8cabc9783 to the 2.0.

Thanks Christopher (and Tim), I'll try with 2.2 snapshot (and/or)
8cabc9783 and report how it goes.

-Jarno

-- 
Jarno Huuskonen


Re: 2.0.14 + htx / retry-on all-retryable-errors -> sometimes wrong backend/server used

2020-05-19 Thread Christopher Faulet

Le 19/05/2020 à 15:36, Jarno Huuskonen a écrit :

Hi,

I think I found a case when haproxy-2.0.14 with htx and retry-on
all-retryable-errors sometimes seems to select wrong backend/server
to retry. (Doesn't happen on every retry).

I found that sometimes when our wordpress backend gave 500 error
haproxy
would retry on wrong backend.  Here's a very simplified redacted
config,
(I can provide full config off list(if needed)).

wordpress frontend has http/2 enabled:
frontend FE_wp
  bind ipv4@address-here:443 name wpv4a1 ssl
crt/etc/haproxy/ssl/crt1.pem alpn h2,http/1.1 crt
/etc/haproxy/ssl/crt2.pem alpn h2,http/1.1 ssl-min-ver TLSv1.2
...
use_backend
%[req.hdr(Host),lower,map_dom(/path/to/wordpress_backends.map,BE_wp_blo
gs)]
default_backend BE_wp_blogs


# This is the backend that sometimes gave 500 errors
backend BE_wp_blogs2
...
 retries 2
 option  redispatch
 option  prefer-last-server
 retry-onall-retryable-errors
 balance roundrobin

 timeout connect 4500ms
 timeout server  40s
 timeout queue   4s
 timeout check   5s

 cookie cookiename insert indirect nocache httponly maxidle 20m
 default-server inter 15s downinter 25s rise 2 error-limit 250 on-
error fail-check
 server name1 ip1:8443 id 1 cookie name1 check observe layer7
 server name2 ip2:8443 id 2 cookie name2 check observe layer7
 server name3 ip3:8443 id 3 cookie name3 check observe layer7

# These are the unrelated/wrong backend that
backend wrong1
...
 server diff1 different_ip:2048 id 1 cookie diff1 maxconn 500 track
BE_other/ezauth1
 server diff2 different_ip:2048 id 2 cookie diff2 maxconn 500 track
BE_other/ezauth2

backend wrong2
...
 server diffssl1 different_ip:2443 id 1 cookie diff1 maxconn 500
track BE_other/ezauth1
 server diffssl2 different_ip:2443 id 2 cookie diff2 maxconn 500
track BE_other/ezauth2

All backends had servers with same numeric id's id 1-3 for wordpress
and
"wrong" backend servers with id's 1-2.  I tried changing all backends
server id's but I still sometimes get wrong backend/server.

If I set no option http-use-htx or retry-on conn-failure then
AFAIK(limited testing) the problem doesn't happen.

I haven't managed to reproduce with simple php script that just gives
500 error, so there could be some timing that triggers this.

Example haproxy log when request goes to wrong backend/server:
haproxy[258199]: client-ipv6-address:42630 [19/May/2020:16:09:29.366]
FE_wp~ BE_wp_blogs2/name3 0/0/89/0/89 400 134 - - --VU 1/1/0/121/2 0/0
{hostheader} "GET /wp-admin/ HTTP/2.0" 443 HTTP/2

(And the wrong backend/server (tomcat in this case) logs this:
haproxy-ip-address - - [19/May/2020:16:09:29 +0300] "-" 400 - 0ms
JSESSIONID=-) (400 error because haproxy sends http to tomcat https
port).

tshark -e text shows this for the wordpress backend 500 response that
can trigger retry on wrong backend:
   "layers": {
 "http.response.code": [
   "500"
 ],
 "http.response.phrase": [
   "Internal Server Error"
 ],
 "text": [
   "Timestamps",
   "HTTP/1.1 500 Internal Server Error\\r\\n",
   "\\r\\n",
   "HTTP chunked response",
   "Data chunk (2892 octets)",
   "End of chunked encoding",
   "\\r\\n",
(I've omitted the response body).

Any more tests etc. I could try to figure out what's going on ? Perhaps
try with latest 2.2-dev ?



Hi,

It was already reported on github and seems to be fixed. We are just waiting a 
feedback to be sure it is fixed before backporting the patch. See 
https://github.com/haproxy/haproxy/issues/623.


If you try the latest 2.2 snapshot, it should be good. You may also try to 
cherry-pick the commit 8cabc9783 to the 2.0.


--
Christopher Faulet



Re: 2.0.14 + htx / retry-on all-retryable-errors -> sometimes wrong backend/server used

2020-05-19 Thread Tim Düsterhus
Jarno,

Am 19.05.20 um 15:36 schrieb Jarno Huuskonen:
> I think I found a case when haproxy-2.0.14 with htx and retry-on
> all-retryable-errors sometimes seems to select wrong backend/server
> to retry. (Doesn't happen on every retry).
> 

I believe this is a duplicate of this issue:
https://github.com/haproxy/haproxy/issues/623

Best regards
Tim Düsterhus



2.0.14 + htx / retry-on all-retryable-errors -> sometimes wrong backend/server used

2020-05-19 Thread Jarno Huuskonen
Hi,

I think I found a case when haproxy-2.0.14 with htx and retry-on
all-retryable-errors sometimes seems to select wrong backend/server
to retry. (Doesn't happen on every retry).

I found that sometimes when our wordpress backend gave 500 error
haproxy
would retry on wrong backend.  Here's a very simplified redacted
config,
(I can provide full config off list(if needed)).

wordpress frontend has http/2 enabled:
frontend FE_wp
 bind ipv4@address-here:443 name wpv4a1 ssl
crt/etc/haproxy/ssl/crt1.pem alpn h2,http/1.1 crt
/etc/haproxy/ssl/crt2.pem alpn h2,http/1.1 ssl-min-ver TLSv1.2
...
use_backend
%[req.hdr(Host),lower,map_dom(/path/to/wordpress_backends.map,BE_wp_blo
gs)]
default_backend BE_wp_blogs


# This is the backend that sometimes gave 500 errors
backend BE_wp_blogs2
...
retries 2
option  redispatch
option  prefer-last-server
retry-onall-retryable-errors
balance roundrobin

timeout connect 4500ms
timeout server  40s
timeout queue   4s
timeout check   5s

cookie cookiename insert indirect nocache httponly maxidle 20m
default-server inter 15s downinter 25s rise 2 error-limit 250 on-
error fail-check
server name1 ip1:8443 id 1 cookie name1 check observe layer7
server name2 ip2:8443 id 2 cookie name2 check observe layer7
server name3 ip3:8443 id 3 cookie name3 check observe layer7

# These are the unrelated/wrong backend that
backend wrong1
...
server diff1 different_ip:2048 id 1 cookie diff1 maxconn 500 track
BE_other/ezauth1
server diff2 different_ip:2048 id 2 cookie diff2 maxconn 500 track
BE_other/ezauth2

backend wrong2
...
server diffssl1 different_ip:2443 id 1 cookie diff1 maxconn 500
track BE_other/ezauth1
server diffssl2 different_ip:2443 id 2 cookie diff2 maxconn 500
track BE_other/ezauth2

All backends had servers with same numeric id's id 1-3 for wordpress
and
"wrong" backend servers with id's 1-2.  I tried changing all backends
server id's but I still sometimes get wrong backend/server.

If I set no option http-use-htx or retry-on conn-failure then
AFAIK(limited testing) the problem doesn't happen.

I haven't managed to reproduce with simple php script that just gives
500 error, so there could be some timing that triggers this.

Example haproxy log when request goes to wrong backend/server:
haproxy[258199]: client-ipv6-address:42630 [19/May/2020:16:09:29.366]
FE_wp~ BE_wp_blogs2/name3 0/0/89/0/89 400 134 - - --VU 1/1/0/121/2 0/0
{hostheader} "GET /wp-admin/ HTTP/2.0" 443 HTTP/2

(And the wrong backend/server (tomcat in this case) logs this:
haproxy-ip-address - - [19/May/2020:16:09:29 +0300] "-" 400 - 0ms
JSESSIONID=-) (400 error because haproxy sends http to tomcat https
port).

tshark -e text shows this for the wordpress backend 500 response that
can trigger retry on wrong backend:
  "layers": {
"http.response.code": [
  "500"
],
"http.response.phrase": [
  "Internal Server Error"
],
"text": [
  "Timestamps",
  "HTTP/1.1 500 Internal Server Error\\r\\n",
  "\\r\\n",
  "HTTP chunked response",
  "Data chunk (2892 octets)",
  "End of chunked encoding",
  "\\r\\n",
(I've omitted the response body).

Any more tests etc. I could try to figure out what's going on ? Perhaps
try with latest 2.2-dev ?

-Jarno

(haproxy -vv:
HA-Proxy version 2.0.14 2020/04/02 - https://haproxy.org/
Build options :
  TARGET  = linux-glibc
  CPU = generic
  CC  = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -Wno
-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-
declaration -
Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers
-Wtype-limit
s
  OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_REGPARM=1 USE_GETADDRINFO=1
USE_OPENSSL=
1 USE_ZLIB=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE -MY_EPOLL -MY_SPLICE +NETFILTER +PCRE
+PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD
-PTHREAD_PSHARED +REGPARM -STATIC_PCRE -STATIC_PCRE2 +TPROXY
+LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H -VSYSCALL +GETADDRINFO
+OPENSSL -LUA +FUTEX +ACCEPT4 -MY_ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO
+NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER
+PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=4).
Built with OpenSSL version : OpenSSL 1.1.1d  10 Sep 2019
Running on OpenSSL version : OpenSSL 1.1.1d  10 Sep 2019
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with network namespace support.
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"),