Re: healthchecks (to uwsgi) possible regression 1.9.8 -> 1.9.9

2019-10-09 Thread Willy Tarreau
Hi Jarno,

On Wed, Oct 09, 2019 at 02:20:00PM +0300, Jarno Huuskonen wrote:
> > There's no real state machine there, it's roughly "if request was not
> > sent, send it, otherwise try to receive a response; if response is
> > received, then process it and close otherwise wait for response". So
> > we cannot remain in a "waiting for closing" state after we receive a
> > response. I'm wondering if that couldn't be achieved using tcp-checks
> > however. I quickly tried but couldn't find a reliable way of doing this
> > but I'm thing that we could possibly extend the tcp-check rules to have
> > "tcp-check expect close" that would wait for a read0.
> 
> Should there be a somekind of timeout how long to wait for close ? I'm
> thinking about backend servers that have very long/infinite keepalive
> (and/or don't respect Connection: close).

Yes, we should have this once we rework the checks part. I've added this
in an issue already. The timeout should basically be the regular check
timeout, it would simply terminate the current check and the connection
before starting a new one. It would simply refrain from complaining!

Cheers,
Willy



Re: healthchecks (to uwsgi) possible regression 1.9.8 -> 1.9.9

2019-10-09 Thread Jarno Huuskonen
Hi,

Thanks Willy for looking into this !

On Tue, Oct 08, Willy Tarreau wrote:
> On Fri, Oct 04, 2019 at 07:28:15PM +0300, Jarno Huuskonen wrote:
> > I sent pcap/strace offlist.
> 
> Thanks, that was very useful.
> 
> > (strace -f -o -ttt, tcpdump -n -p -s 16384 -w ... host 127.0.0.1 and
> > port 8080).
> > 
> > I think in packet capture the second health checks causes
> > "client_addr: 127.0.0.1 client_port: 2779] hr_read(): Connection reset by 
> > peer [plugins/http/http.c line 917]"
> > (I think uswgi logs client_port incorrectly, ntohs(2779) gives 56074
> > (and port 56074 is in packet capture)).
> 
> It's as usual, when you start to troubleshoot a bug you find a myriad of
> other ones around :-)
> 
> > (haproxy version: HA-Proxy version 2.1-dev2 2019/10/01).
> > 
> > I tried to reproduce with very minimal flask/uwsgi hello world app
> > and there hr_read happens very rarely. 
> > With alerta(.io) app this happens more regularly (AFAIK not with every 
> > check).
> > So maybe this is weird timing issue or bug in uwsgi.
> 
> So it is indeed a timing issue. The check running on port 56062 shows the
> following sequence:
> 
> 17:10:57.877876 IP localhost.56062 > localhost.8080: Flags [S]
> 17:10:57.877898 IP localhost.8080 > localhost.56062: Flags [S.]
> 17:10:57.877909 IP localhost.56062 > localhost.8080: Flags [.]
> 17:10:57.878065 IP localhost.56062 > localhost.8080: Flags [P.]
> 17:10:57.878078 IP localhost.8080 > localhost.56062: Flags [.]
> 17:10:57.879933 IP localhost.8080 > localhost.56062: Flags [P.]
> 17:10:57.879939 IP localhost.56062 > localhost.8080: Flags [.]
> 17:10:57.880008 IP localhost.8080 > localhost.56062: Flags [F.]
> 17:10:57.880333 IP localhost.56062 > localhost.8080: Flags [F.]
> 17:10:57.880341 IP localhost.8080 > localhost.56062: Flags [.]
> 
> Note the FIN sent 75 microseconds after the response. This resulted in
> recvfrom() returning zero and the connection to be cleanly closed. Now
> regarding port 56074 that is causing excess logs :
> 
> 17:11:04.132867 IP localhost.56074 > localhost.8080: Flags [S]
> 17:11:04.132890 IP localhost.8080 > localhost.56074: Flags [S.]
> 17:11:04.132904 IP localhost.56074 > localhost.8080: Flags [.]
> 17:11:04.133083 IP localhost.56074 > localhost.8080: Flags [P.]
> 17:11:04.133098 IP localhost.8080 > localhost.56074: Flags [.]
> 17:11:04.135101 IP localhost.8080 > localhost.56074: Flags [P.]
> 17:11:04.135107 IP localhost.56074 > localhost.8080: Flags [.]
> 17:11:04.135316 IP localhost.56074 > localhost.8080: Flags [R.]
> 
> As you can see, even 215 microseconds after the response there is still
> no FIN. I've checked and in both cases the headers are the same. There
> is indeed a "connection: close" header emitted by haproxy, none is
> advertised in the response, but clearly it's a matter of delay. And
> this causes recvfrom() to return EAGAIN, so haproxy is forced to hard-
> close the connection (as if it were doing option http-server-close).
> Please also note that returned data are properly ACKed so there is not
> much more we can do there.
> 
> Before the patch that you bisected, you didn't face this because of a
> bug that was preventing the hard close from working, and instead you
> were accumulating TIME_WAIT sockets on the client side. So I'm afraid
> to say that for now the best I can say is that you'll still have to
> live with these occasional logs :-/

Thanks. Glad to see this wasn't a bug on haproxy side.

I tested the alerta app with gunicorn and gunicorn doesn't seem to mind
that haproxy uses forced close on the health check, so switching
uwsgi->gunicorn is another option.

> I'd really like to see a massive rework being done on the checks. Your
> report made me realize that there is one feature we're currently missing,
> which is the ability to wait for a close once we've got a response. At
> the moment we cannot add this because checks work in 2 steps :
> 
>  1) connect and send the "request"
>  2) receive the "reponse" and close.

And AFAIK this works just fine with most sane backend servers.

> There's no real state machine there, it's roughly "if request was not
> sent, send it, otherwise try to receive a response; if response is
> received, then process it and close otherwise wait for response". So
> we cannot remain in a "waiting for closing" state after we receive a
> response. I'm wondering if that couldn't be achieved using tcp-checks
> however. I quickly tried but couldn't find a reliable way of doing this
> but I'm thing that we could possibly extend the tcp-check rules to have
> "tcp-check expect close" that would wait for a read0.

Should there be a somekind of timeout how long to wait for close ? I'm
thinking about backend servers that have very long/infinite keepalive
(and/or don't respect Connection: close).

-Jarno

-- 
Jarno Huuskonen



Re: healthchecks (to uwsgi) possible regression 1.9.8 -> 1.9.9

2019-10-08 Thread Willy Tarreau
Hi Jarno,

On Fri, Oct 04, 2019 at 07:28:15PM +0300, Jarno Huuskonen wrote:
> I sent pcap/strace offlist.

Thanks, that was very useful.

> (strace -f -o -ttt, tcpdump -n -p -s 16384 -w ... host 127.0.0.1 and
> port 8080).
> 
> I think in packet capture the second health checks causes
> "client_addr: 127.0.0.1 client_port: 2779] hr_read(): Connection reset by 
> peer [plugins/http/http.c line 917]"
> (I think uswgi logs client_port incorrectly, ntohs(2779) gives 56074
> (and port 56074 is in packet capture)).

It's as usual, when you start to troubleshoot a bug you find a myriad of
other ones around :-)

> (haproxy version: HA-Proxy version 2.1-dev2 2019/10/01).
> 
> I tried to reproduce with very minimal flask/uwsgi hello world app
> and there hr_read happens very rarely. 
> With alerta(.io) app this happens more regularly (AFAIK not with every check).
> So maybe this is weird timing issue or bug in uwsgi.

So it is indeed a timing issue. The check running on port 56062 shows the
following sequence:

17:10:57.877876 IP localhost.56062 > localhost.8080: Flags [S]
17:10:57.877898 IP localhost.8080 > localhost.56062: Flags [S.]
17:10:57.877909 IP localhost.56062 > localhost.8080: Flags [.]
17:10:57.878065 IP localhost.56062 > localhost.8080: Flags [P.]
17:10:57.878078 IP localhost.8080 > localhost.56062: Flags [.]
17:10:57.879933 IP localhost.8080 > localhost.56062: Flags [P.]
17:10:57.879939 IP localhost.56062 > localhost.8080: Flags [.]
17:10:57.880008 IP localhost.8080 > localhost.56062: Flags [F.]
17:10:57.880333 IP localhost.56062 > localhost.8080: Flags [F.]
17:10:57.880341 IP localhost.8080 > localhost.56062: Flags [.]

Note the FIN sent 75 microseconds after the response. This resulted in
recvfrom() returning zero and the connection to be cleanly closed. Now
regarding port 56074 that is causing excess logs :

17:11:04.132867 IP localhost.56074 > localhost.8080: Flags [S]
17:11:04.132890 IP localhost.8080 > localhost.56074: Flags [S.]
17:11:04.132904 IP localhost.56074 > localhost.8080: Flags [.]
17:11:04.133083 IP localhost.56074 > localhost.8080: Flags [P.]
17:11:04.133098 IP localhost.8080 > localhost.56074: Flags [.]
17:11:04.135101 IP localhost.8080 > localhost.56074: Flags [P.]
17:11:04.135107 IP localhost.56074 > localhost.8080: Flags [.]
17:11:04.135316 IP localhost.56074 > localhost.8080: Flags [R.]

As you can see, even 215 microseconds after the response there is still
no FIN. I've checked and in both cases the headers are the same. There
is indeed a "connection: close" header emitted by haproxy, none is
advertised in the response, but clearly it's a matter of delay. And
this causes recvfrom() to return EAGAIN, so haproxy is forced to hard-
close the connection (as if it were doing option http-server-close).
Please also note that returned data are properly ACKed so there is not
much more we can do there.

Before the patch that you bisected, you didn't face this because of a
bug that was preventing the hard close from working, and instead you
were accumulating TIME_WAIT sockets on the client side. So I'm afraid
to say that for now the best I can say is that you'll still have to
live with these occasional logs :-/

I'd really like to see a massive rework being done on the checks. Your
report made me realize that there is one feature we're currently missing,
which is the ability to wait for a close once we've got a response. At
the moment we cannot add this because checks work in 2 steps :

 1) connect and send the "request"
 2) receive the "reponse" and close.

There's no real state machine there, it's roughly "if request was not
sent, send it, otherwise try to receive a response; if response is
received, then process it and close otherwise wait for response". So
we cannot remain in a "waiting for closing" state after we receive a
response. I'm wondering if that couldn't be achieved using tcp-checks
however. I quickly tried but couldn't find a reliable way of doing this
but I'm thing that we could possibly extend the tcp-check rules to have
"tcp-check expect close" that would wait for a read0.

Regards,
Willy



Re: healthchecks (to uwsgi) possible regression 1.9.8 -> 1.9.9

2019-10-04 Thread Jarno Huuskonen
Hi Willy,

On Fri, Oct 04, Willy Tarreau wrote:
> Hi Jarno,
> 
> On Wed, Oct 02, 2019 at 01:08:14PM +0300, Jarno Huuskonen wrote:
> > Hello,
> > 
> > I was testing haproxy -> uwsgi(alert.io) and noticed a possible regression
> > with healthchecks(httpchk).
> > With 1.9.9 uwsgi logs:
> > [uwsgi-http key: host.name.fi client_addr: 127.0.0.1 client_port: 45715] 
> > hr_read(): Connection reset by peer [plugins/http/http.c line 917]
> > 
> > health checks work
> > (option httpchk GET /_ HTTP/1.1\r\nHost:\ host.name.fi\r\nUser-Agent:\ 
> > haproxy)
> > but uwsgi logs the hr_read() warning/error.
> > 
> > I bisected 1.9.9 and this commit is probably the commit that changes
> > behaviour between 1.9.8 and 1.9.9:
> > 5d0cb90eb78f869e8801b34eddfdfd5dd8360e71 is the first bad commit
> > commit 5d0cb90eb78f869e8801b34eddfdfd5dd8360e71
> > Author: Olivier Houchard 
> > Date:   Fri Jun 14 15:26:06 2019 +0200
> > 
> > BUG/MEDIUM: connections: Don't call shutdown() if we want to disable 
> > linger.
> > 
> > In conn_sock_shutw(), avoid calling shutdown() if linger_risk is set. 
> > Not
> > doing so will result in getting sockets in TIME_WAIT for some time.
> > This is particularly observable with health checks.
> > 
> > This should be backported to 1.9.
> > 
> > (cherry picked from commit fe4abe62c7c5206dff1802f42d17014e198b9141)
> > Signed-off-by: Christopher Faulet 
> 
> Hmmm that's annoying, really, because we've opened a huge can of worms
> when fixing the first failed check and we're constantly displacing the
> problem somewhere else :-/
> 
> Yes, please do provide an strace, and a tcpdump, that would be nice.
> I suspect that we'll possibly see a FIN from the server, without the
> equivalent recv()==0 in haproxy, and that the setsockopt() call
> resulting in the RST being sent doesn't ack the FIN. Normally we
> should perform a clean shutdown+close if the FIN was received with
> the response and detected in time.

I sent pcap/strace offlist.
(strace -f -o -ttt, tcpdump -n -p -s 16384 -w ... host 127.0.0.1 and
port 8080).

I think in packet capture the second health checks causes
"client_addr: 127.0.0.1 client_port: 2779] hr_read(): Connection reset by peer 
[plugins/http/http.c line 917]"
(I think uswgi logs client_port incorrectly, ntohs(2779) gives 56074
(and port 56074 is in packet capture)).

(haproxy version: HA-Proxy version 2.1-dev2 2019/10/01).

I tried to reproduce with very minimal flask/uwsgi hello world app
and there hr_read happens very rarely. 
With alerta(.io) app this happens more regularly (AFAIK not with every check).
So maybe this is weird timing issue or bug in uwsgi.

> I'm seeing that your check request doesn't contain "connection: close",
> so actually it's possible that your server doesn't send the SYN, in
> which case we really need to close with RST. Could you please try to
> add "connection: close" to your httpchk line ?

I had Connection: close, but removed it after I added
http-check expect string OK
(http-check expect adds Connection: close).

-Jarno

-- 
Jarno Huuskonen



Re: healthchecks (to uwsgi) possible regression 1.9.8 -> 1.9.9

2019-10-03 Thread Willy Tarreau
Hi Jarno,

On Wed, Oct 02, 2019 at 01:08:14PM +0300, Jarno Huuskonen wrote:
> Hello,
> 
> I was testing haproxy -> uwsgi(alert.io) and noticed a possible regression
> with healthchecks(httpchk).
> With 1.9.9 uwsgi logs:
> [uwsgi-http key: host.name.fi client_addr: 127.0.0.1 client_port: 45715] 
> hr_read(): Connection reset by peer [plugins/http/http.c line 917]
> 
> health checks work
> (option httpchk GET /_ HTTP/1.1\r\nHost:\ host.name.fi\r\nUser-Agent:\ 
> haproxy)
> but uwsgi logs the hr_read() warning/error.
> 
> I bisected 1.9.9 and this commit is probably the commit that changes
> behaviour between 1.9.8 and 1.9.9:
> 5d0cb90eb78f869e8801b34eddfdfd5dd8360e71 is the first bad commit
> commit 5d0cb90eb78f869e8801b34eddfdfd5dd8360e71
> Author: Olivier Houchard 
> Date:   Fri Jun 14 15:26:06 2019 +0200
> 
> BUG/MEDIUM: connections: Don't call shutdown() if we want to disable 
> linger.
> 
> In conn_sock_shutw(), avoid calling shutdown() if linger_risk is set. Not
> doing so will result in getting sockets in TIME_WAIT for some time.
> This is particularly observable with health checks.
> 
> This should be backported to 1.9.
> 
> (cherry picked from commit fe4abe62c7c5206dff1802f42d17014e198b9141)
> Signed-off-by: Christopher Faulet 

Hmmm that's annoying, really, because we've opened a huge can of worms
when fixing the first failed check and we're constantly displacing the
problem somewhere else :-/

Yes, please do provide an strace, and a tcpdump, that would be nice.
I suspect that we'll possibly see a FIN from the server, without the
equivalent recv()==0 in haproxy, and that the setsockopt() call
resulting in the RST being sent doesn't ack the FIN. Normally we
should perform a clean shutdown+close if the FIN was received with
the response and detected in time.

I'm seeing that your check request doesn't contain "connection: close",
so actually it's possible that your server doesn't send the SYN, in
which case we really need to close with RST. Could you please try to
add "connection: close" to your httpchk line ?

Regards,
Willy