Re: healthchecks (to uwsgi) possible regression 1.9.8 -> 1.9.9
Hi Jarno, On Wed, Oct 09, 2019 at 02:20:00PM +0300, Jarno Huuskonen wrote: > > There's no real state machine there, it's roughly "if request was not > > sent, send it, otherwise try to receive a response; if response is > > received, then process it and close otherwise wait for response". So > > we cannot remain in a "waiting for closing" state after we receive a > > response. I'm wondering if that couldn't be achieved using tcp-checks > > however. I quickly tried but couldn't find a reliable way of doing this > > but I'm thing that we could possibly extend the tcp-check rules to have > > "tcp-check expect close" that would wait for a read0. > > Should there be a somekind of timeout how long to wait for close ? I'm > thinking about backend servers that have very long/infinite keepalive > (and/or don't respect Connection: close). Yes, we should have this once we rework the checks part. I've added this in an issue already. The timeout should basically be the regular check timeout, it would simply terminate the current check and the connection before starting a new one. It would simply refrain from complaining! Cheers, Willy
Re: healthchecks (to uwsgi) possible regression 1.9.8 -> 1.9.9
Hi, Thanks Willy for looking into this ! On Tue, Oct 08, Willy Tarreau wrote: > On Fri, Oct 04, 2019 at 07:28:15PM +0300, Jarno Huuskonen wrote: > > I sent pcap/strace offlist. > > Thanks, that was very useful. > > > (strace -f -o -ttt, tcpdump -n -p -s 16384 -w ... host 127.0.0.1 and > > port 8080). > > > > I think in packet capture the second health checks causes > > "client_addr: 127.0.0.1 client_port: 2779] hr_read(): Connection reset by > > peer [plugins/http/http.c line 917]" > > (I think uswgi logs client_port incorrectly, ntohs(2779) gives 56074 > > (and port 56074 is in packet capture)). > > It's as usual, when you start to troubleshoot a bug you find a myriad of > other ones around :-) > > > (haproxy version: HA-Proxy version 2.1-dev2 2019/10/01). > > > > I tried to reproduce with very minimal flask/uwsgi hello world app > > and there hr_read happens very rarely. > > With alerta(.io) app this happens more regularly (AFAIK not with every > > check). > > So maybe this is weird timing issue or bug in uwsgi. > > So it is indeed a timing issue. The check running on port 56062 shows the > following sequence: > > 17:10:57.877876 IP localhost.56062 > localhost.8080: Flags [S] > 17:10:57.877898 IP localhost.8080 > localhost.56062: Flags [S.] > 17:10:57.877909 IP localhost.56062 > localhost.8080: Flags [.] > 17:10:57.878065 IP localhost.56062 > localhost.8080: Flags [P.] > 17:10:57.878078 IP localhost.8080 > localhost.56062: Flags [.] > 17:10:57.879933 IP localhost.8080 > localhost.56062: Flags [P.] > 17:10:57.879939 IP localhost.56062 > localhost.8080: Flags [.] > 17:10:57.880008 IP localhost.8080 > localhost.56062: Flags [F.] > 17:10:57.880333 IP localhost.56062 > localhost.8080: Flags [F.] > 17:10:57.880341 IP localhost.8080 > localhost.56062: Flags [.] > > Note the FIN sent 75 microseconds after the response. This resulted in > recvfrom() returning zero and the connection to be cleanly closed. Now > regarding port 56074 that is causing excess logs : > > 17:11:04.132867 IP localhost.56074 > localhost.8080: Flags [S] > 17:11:04.132890 IP localhost.8080 > localhost.56074: Flags [S.] > 17:11:04.132904 IP localhost.56074 > localhost.8080: Flags [.] > 17:11:04.133083 IP localhost.56074 > localhost.8080: Flags [P.] > 17:11:04.133098 IP localhost.8080 > localhost.56074: Flags [.] > 17:11:04.135101 IP localhost.8080 > localhost.56074: Flags [P.] > 17:11:04.135107 IP localhost.56074 > localhost.8080: Flags [.] > 17:11:04.135316 IP localhost.56074 > localhost.8080: Flags [R.] > > As you can see, even 215 microseconds after the response there is still > no FIN. I've checked and in both cases the headers are the same. There > is indeed a "connection: close" header emitted by haproxy, none is > advertised in the response, but clearly it's a matter of delay. And > this causes recvfrom() to return EAGAIN, so haproxy is forced to hard- > close the connection (as if it were doing option http-server-close). > Please also note that returned data are properly ACKed so there is not > much more we can do there. > > Before the patch that you bisected, you didn't face this because of a > bug that was preventing the hard close from working, and instead you > were accumulating TIME_WAIT sockets on the client side. So I'm afraid > to say that for now the best I can say is that you'll still have to > live with these occasional logs :-/ Thanks. Glad to see this wasn't a bug on haproxy side. I tested the alerta app with gunicorn and gunicorn doesn't seem to mind that haproxy uses forced close on the health check, so switching uwsgi->gunicorn is another option. > I'd really like to see a massive rework being done on the checks. Your > report made me realize that there is one feature we're currently missing, > which is the ability to wait for a close once we've got a response. At > the moment we cannot add this because checks work in 2 steps : > > 1) connect and send the "request" > 2) receive the "reponse" and close. And AFAIK this works just fine with most sane backend servers. > There's no real state machine there, it's roughly "if request was not > sent, send it, otherwise try to receive a response; if response is > received, then process it and close otherwise wait for response". So > we cannot remain in a "waiting for closing" state after we receive a > response. I'm wondering if that couldn't be achieved using tcp-checks > however. I quickly tried but couldn't find a reliable way of doing this > but I'm thing that we could possibly extend the tcp-check rules to have > "tcp-check expect close" that would wait for a read0. Should there be a somekind of timeout how long to wait for close ? I'm thinking about backend servers that have very long/infinite keepalive (and/or don't respect Connection: close). -Jarno -- Jarno Huuskonen
Re: healthchecks (to uwsgi) possible regression 1.9.8 -> 1.9.9
Hi Jarno, On Fri, Oct 04, 2019 at 07:28:15PM +0300, Jarno Huuskonen wrote: > I sent pcap/strace offlist. Thanks, that was very useful. > (strace -f -o -ttt, tcpdump -n -p -s 16384 -w ... host 127.0.0.1 and > port 8080). > > I think in packet capture the second health checks causes > "client_addr: 127.0.0.1 client_port: 2779] hr_read(): Connection reset by > peer [plugins/http/http.c line 917]" > (I think uswgi logs client_port incorrectly, ntohs(2779) gives 56074 > (and port 56074 is in packet capture)). It's as usual, when you start to troubleshoot a bug you find a myriad of other ones around :-) > (haproxy version: HA-Proxy version 2.1-dev2 2019/10/01). > > I tried to reproduce with very minimal flask/uwsgi hello world app > and there hr_read happens very rarely. > With alerta(.io) app this happens more regularly (AFAIK not with every check). > So maybe this is weird timing issue or bug in uwsgi. So it is indeed a timing issue. The check running on port 56062 shows the following sequence: 17:10:57.877876 IP localhost.56062 > localhost.8080: Flags [S] 17:10:57.877898 IP localhost.8080 > localhost.56062: Flags [S.] 17:10:57.877909 IP localhost.56062 > localhost.8080: Flags [.] 17:10:57.878065 IP localhost.56062 > localhost.8080: Flags [P.] 17:10:57.878078 IP localhost.8080 > localhost.56062: Flags [.] 17:10:57.879933 IP localhost.8080 > localhost.56062: Flags [P.] 17:10:57.879939 IP localhost.56062 > localhost.8080: Flags [.] 17:10:57.880008 IP localhost.8080 > localhost.56062: Flags [F.] 17:10:57.880333 IP localhost.56062 > localhost.8080: Flags [F.] 17:10:57.880341 IP localhost.8080 > localhost.56062: Flags [.] Note the FIN sent 75 microseconds after the response. This resulted in recvfrom() returning zero and the connection to be cleanly closed. Now regarding port 56074 that is causing excess logs : 17:11:04.132867 IP localhost.56074 > localhost.8080: Flags [S] 17:11:04.132890 IP localhost.8080 > localhost.56074: Flags [S.] 17:11:04.132904 IP localhost.56074 > localhost.8080: Flags [.] 17:11:04.133083 IP localhost.56074 > localhost.8080: Flags [P.] 17:11:04.133098 IP localhost.8080 > localhost.56074: Flags [.] 17:11:04.135101 IP localhost.8080 > localhost.56074: Flags [P.] 17:11:04.135107 IP localhost.56074 > localhost.8080: Flags [.] 17:11:04.135316 IP localhost.56074 > localhost.8080: Flags [R.] As you can see, even 215 microseconds after the response there is still no FIN. I've checked and in both cases the headers are the same. There is indeed a "connection: close" header emitted by haproxy, none is advertised in the response, but clearly it's a matter of delay. And this causes recvfrom() to return EAGAIN, so haproxy is forced to hard- close the connection (as if it were doing option http-server-close). Please also note that returned data are properly ACKed so there is not much more we can do there. Before the patch that you bisected, you didn't face this because of a bug that was preventing the hard close from working, and instead you were accumulating TIME_WAIT sockets on the client side. So I'm afraid to say that for now the best I can say is that you'll still have to live with these occasional logs :-/ I'd really like to see a massive rework being done on the checks. Your report made me realize that there is one feature we're currently missing, which is the ability to wait for a close once we've got a response. At the moment we cannot add this because checks work in 2 steps : 1) connect and send the "request" 2) receive the "reponse" and close. There's no real state machine there, it's roughly "if request was not sent, send it, otherwise try to receive a response; if response is received, then process it and close otherwise wait for response". So we cannot remain in a "waiting for closing" state after we receive a response. I'm wondering if that couldn't be achieved using tcp-checks however. I quickly tried but couldn't find a reliable way of doing this but I'm thing that we could possibly extend the tcp-check rules to have "tcp-check expect close" that would wait for a read0. Regards, Willy
Re: healthchecks (to uwsgi) possible regression 1.9.8 -> 1.9.9
Hi Willy, On Fri, Oct 04, Willy Tarreau wrote: > Hi Jarno, > > On Wed, Oct 02, 2019 at 01:08:14PM +0300, Jarno Huuskonen wrote: > > Hello, > > > > I was testing haproxy -> uwsgi(alert.io) and noticed a possible regression > > with healthchecks(httpchk). > > With 1.9.9 uwsgi logs: > > [uwsgi-http key: host.name.fi client_addr: 127.0.0.1 client_port: 45715] > > hr_read(): Connection reset by peer [plugins/http/http.c line 917] > > > > health checks work > > (option httpchk GET /_ HTTP/1.1\r\nHost:\ host.name.fi\r\nUser-Agent:\ > > haproxy) > > but uwsgi logs the hr_read() warning/error. > > > > I bisected 1.9.9 and this commit is probably the commit that changes > > behaviour between 1.9.8 and 1.9.9: > > 5d0cb90eb78f869e8801b34eddfdfd5dd8360e71 is the first bad commit > > commit 5d0cb90eb78f869e8801b34eddfdfd5dd8360e71 > > Author: Olivier Houchard > > Date: Fri Jun 14 15:26:06 2019 +0200 > > > > BUG/MEDIUM: connections: Don't call shutdown() if we want to disable > > linger. > > > > In conn_sock_shutw(), avoid calling shutdown() if linger_risk is set. > > Not > > doing so will result in getting sockets in TIME_WAIT for some time. > > This is particularly observable with health checks. > > > > This should be backported to 1.9. > > > > (cherry picked from commit fe4abe62c7c5206dff1802f42d17014e198b9141) > > Signed-off-by: Christopher Faulet > > Hmmm that's annoying, really, because we've opened a huge can of worms > when fixing the first failed check and we're constantly displacing the > problem somewhere else :-/ > > Yes, please do provide an strace, and a tcpdump, that would be nice. > I suspect that we'll possibly see a FIN from the server, without the > equivalent recv()==0 in haproxy, and that the setsockopt() call > resulting in the RST being sent doesn't ack the FIN. Normally we > should perform a clean shutdown+close if the FIN was received with > the response and detected in time. I sent pcap/strace offlist. (strace -f -o -ttt, tcpdump -n -p -s 16384 -w ... host 127.0.0.1 and port 8080). I think in packet capture the second health checks causes "client_addr: 127.0.0.1 client_port: 2779] hr_read(): Connection reset by peer [plugins/http/http.c line 917]" (I think uswgi logs client_port incorrectly, ntohs(2779) gives 56074 (and port 56074 is in packet capture)). (haproxy version: HA-Proxy version 2.1-dev2 2019/10/01). I tried to reproduce with very minimal flask/uwsgi hello world app and there hr_read happens very rarely. With alerta(.io) app this happens more regularly (AFAIK not with every check). So maybe this is weird timing issue or bug in uwsgi. > I'm seeing that your check request doesn't contain "connection: close", > so actually it's possible that your server doesn't send the SYN, in > which case we really need to close with RST. Could you please try to > add "connection: close" to your httpchk line ? I had Connection: close, but removed it after I added http-check expect string OK (http-check expect adds Connection: close). -Jarno -- Jarno Huuskonen
Re: healthchecks (to uwsgi) possible regression 1.9.8 -> 1.9.9
Hi Jarno, On Wed, Oct 02, 2019 at 01:08:14PM +0300, Jarno Huuskonen wrote: > Hello, > > I was testing haproxy -> uwsgi(alert.io) and noticed a possible regression > with healthchecks(httpchk). > With 1.9.9 uwsgi logs: > [uwsgi-http key: host.name.fi client_addr: 127.0.0.1 client_port: 45715] > hr_read(): Connection reset by peer [plugins/http/http.c line 917] > > health checks work > (option httpchk GET /_ HTTP/1.1\r\nHost:\ host.name.fi\r\nUser-Agent:\ > haproxy) > but uwsgi logs the hr_read() warning/error. > > I bisected 1.9.9 and this commit is probably the commit that changes > behaviour between 1.9.8 and 1.9.9: > 5d0cb90eb78f869e8801b34eddfdfd5dd8360e71 is the first bad commit > commit 5d0cb90eb78f869e8801b34eddfdfd5dd8360e71 > Author: Olivier Houchard > Date: Fri Jun 14 15:26:06 2019 +0200 > > BUG/MEDIUM: connections: Don't call shutdown() if we want to disable > linger. > > In conn_sock_shutw(), avoid calling shutdown() if linger_risk is set. Not > doing so will result in getting sockets in TIME_WAIT for some time. > This is particularly observable with health checks. > > This should be backported to 1.9. > > (cherry picked from commit fe4abe62c7c5206dff1802f42d17014e198b9141) > Signed-off-by: Christopher Faulet Hmmm that's annoying, really, because we've opened a huge can of worms when fixing the first failed check and we're constantly displacing the problem somewhere else :-/ Yes, please do provide an strace, and a tcpdump, that would be nice. I suspect that we'll possibly see a FIN from the server, without the equivalent recv()==0 in haproxy, and that the setsockopt() call resulting in the RST being sent doesn't ack the FIN. Normally we should perform a clean shutdown+close if the FIN was received with the response and detected in time. I'm seeing that your check request doesn't contain "connection: close", so actually it's possible that your server doesn't send the SYN, in which case we really need to close with RST. Could you please try to add "connection: close" to your httpchk line ? Regards, Willy
healthchecks (to uwsgi) possible regression 1.9.8 -> 1.9.9
Hello, I was testing haproxy -> uwsgi(alert.io) and noticed a possible regression with healthchecks(httpchk). With 1.9.9 uwsgi logs: [uwsgi-http key: host.name.fi client_addr: 127.0.0.1 client_port: 45715] hr_read(): Connection reset by peer [plugins/http/http.c line 917] health checks work (option httpchk GET /_ HTTP/1.1\r\nHost:\ host.name.fi\r\nUser-Agent:\ haproxy) but uwsgi logs the hr_read() warning/error. I bisected 1.9.9 and this commit is probably the commit that changes behaviour between 1.9.8 and 1.9.9: 5d0cb90eb78f869e8801b34eddfdfd5dd8360e71 is the first bad commit commit 5d0cb90eb78f869e8801b34eddfdfd5dd8360e71 Author: Olivier Houchard Date: Fri Jun 14 15:26:06 2019 +0200 BUG/MEDIUM: connections: Don't call shutdown() if we want to disable linger. In conn_sock_shutw(), avoid calling shutdown() if linger_risk is set. Not doing so will result in getting sockets in TIME_WAIT for some time. This is particularly observable with health checks. This should be backported to 1.9. (cherry picked from commit fe4abe62c7c5206dff1802f42d17014e198b9141) Signed-off-by: Christopher Faulet Also 1.9.11, 2.0.7 and 2.1-dev2 has the same problem with uwsgi hr_read(). If I revert commits 6c7e96a3e1abb331e414d1aabb45d9fedb0254c2 and fe4abe62c7c5206dff1802f42d17014e198b9141 from 2.1-dev2 then the uwsgi hr_read() disappears. If this seems worth digging into I can get packet captures or strace. (I'm testing this is on rhel8 vm with 4.18.0-80.11.1.el8_0.x86_64 kernel). This is fairly minimal config for testing: frontend FE_alerta bind ipv4@:8443 name alertav4ssl ssl crt /etc/haproxy/ssl/crtname.pem alpn h2,http/1.1 modehttp option dontlognull option http-ignore-probes # ignore "pre-connect" requests timeout http-request8s capture request header Host len 40 option contstats option forwardfor except 127.0.0.0/8 # remove incoming X-Forwarded-For headers http-request set-header X-Forwarded-Proto https default_backend BE_alertaapi # # Alerta uwsgi backend # backend BE_alertaapi option httpchk GET /_ HTTP/1.1\r\nHost:\ demo3.uef.fi\r\nUser-Agent:\ UEFHaproxy http-check expect string OK http-check disable-on-404 retries 2 option redispatch option prefer-last-server balance roundrobin timeout connect 4500ms timeout server 30s timeout queue 4s timeout check 5s # uwsgi alerta app expects /alerts (not /api/alerts), strip /api #http-request replace-uri ^/api/?(.*) /\1 # inter fast for uwsgi hr_read() testing default-server inter 6s downinter 25s rise 2 server alertaapi1 127.0.0.1:8080 id 1 check -Jarno -- Jarno Huuskonen