Re: haproxy / Python 'tornado' framework - digging into 502/504 errors
Hi Willy, Cyril, Thank you for your detailed analysis. I still notice 504 errors almost immediately on a HAproxy start, and the PID matches the new process: [root@frontend2 log]# ps aux | grep haproxy haproxy 21242 6.6 0.1 133176 47984 ?Rs 07:17 0:00 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid [root@frontend2 log]# service haproxy stop; service rsyslog stop; rm -f /var/log/haproxy.log; service rsyslog start; service haproxy start; Stopping haproxy: [ OK ] Shutting down system logger: [ OK ] Starting system logger:[ OK ] Starting haproxy: [ OK ] [root@frontend2 log]# tail -f haproxy.log | grep 504 Sep 14 07:16:14 localhost haproxy[21178]: 94.197.40.185:3504[14/Sep/2011:07:16:08.216] main python_8001/python_8001_fe1 80/0/0/-1/6449 502 204 - - SH-- 3375/3375/950/950/0 0/0 POST /xxx/chat/status/updates HTTP/1.1 Sep 14 07:16:15 localhost haproxy[21178]: 118.101.95.88:49504[14/Sep/2011:07:16:10.298] main python_9003/python_9003_fe1 22/0/0/-1/5088 502 204 - - SH-- 3312/3312/386/386/0 0/0 POST /xxx/chat/message/3/updates HTTP/1.1 ^C [root@frontend2 log]# ps aux | grep haproxy haproxy 21178 5.4 0.2 137268 51480 ?Ss 07:16 0:01 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid I don't understand how this is - can you shed any light? (the configuration is the same as in my email - i.e. server timeout 2hrs). I will investigate the 502 errors with more benchmarking directly against the backends. Thank you for confirming that this is the source of the problem. Many thanks, Alex On Tue, Sep 13, 2011 at 11:41 PM, Willy Tarreau w...@1wt.eu wrote: Hi Alex, On Tue, Sep 13, 2011 at 03:18:54PM +0100, Alex Davies wrote: Hi, Thank you for your observation - indeed I did notice some of those as I was writing my email - I have updated my globals to increase the server timeout (as we are doing long polling) and reduce the others, and remove the duplicates: defaults mode http optionhttplog #option tcplog optiondontlognull optiondontlog-normal log global retries 10 maxconn 5 option forwardfor except 127.0.0.1/32 # Apache on https://127.0.0.1 option httpclose # Required for REMOTE HEADER option redispatch timeout connect 1 timeout client 1 timeout server 720 I still notice the same errors in the logs! (slightly less 504, as I would expect through the increase in timeout server - but I still don't understand why I get any at all in the first minute of a new process). To complete Cyril's detailed analysis, I'd like to add that you'll only see 502s when you restart, and it will take some time before you see 504s again (eg: 2 hours with the config above). The 502s mean that the server has suddenly aborted the connection (flags SH), while the 504s indicate that haproxy was fed up with waiting and closed after timeout server was elapsed. So yes it's very possible that your server has its own timeout, but it should be in the 30s from what I saw in your logs. It sill does not explain why some requests never time out on the server, maybe they don't wake the same components up ? Regards, Willy -- Alex Davies This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender immediately by e-mail and delete this e-mail permanently.
Re: haproxy / Python 'tornado' framework - digging into 502/504 errors
Hi Alex, Sorry I won't have time to help you now, but... Le mardi 13 septembre 2011 14:26:04, Alex Davies a écrit : The total time in milliseconds between accept and last close seems to be ~300 seconds for most of the requests (although far from all of them, as the first entry shows). If I *restart* (not reload) haproxy, I still get lines with the fifth of these numbers (Tt in the docs) as ~300,000 (the timeout server value in the config at the moment I copied the logs above), a few seconds after the haproxy process starts. I also get lots that seem to end on almost exactly 300k even when I change both timeout client and timeout server to very different numbers. Have you noticed that your configuration declared the same timeouts several times ? The configuration mixed deprecated syntax and new keywords (clitimeout vs timeout client / contimeout vs timeout connect / srvtimeout vs timeout server). If you tried to modify the srvtimeout and clitimeout values, that can explain why you still see those 300s timeouts. # haproxy.conf defaults timeout connect 20 contimeout20 clitimeout9 option forwardfor except 127.0.0.1/32 # Apache running https on localhost option httpclose # Required for REMOTE HEADER option redispatch timeout connect 1 timeout client 30 timeout server 30 Latest values declared will apply : timeout connect 1 timeout client 30 timeout server 30 Maybe this can help you for the next steps ;-) -- Cyril Bonté
Re: haproxy / Python 'tornado' framework - digging into 502/504 errors
Hi, Thank you for your observation - indeed I did notice some of those as I was writing my email - I have updated my globals to increase the server timeout (as we are doing long polling) and reduce the others, and remove the duplicates: defaults mode http optionhttplog #option tcplog optiondontlognull optiondontlog-normal log global retries 10 maxconn 5 option forwardfor except 127.0.0.1/32 # Apache on https://127.0.0.1 option httpclose # Required for REMOTE HEADER option redispatch timeout connect 1 timeout client 1 timeout server 720 I still notice the same errors in the logs! (slightly less 504, as I would expect through the increase in timeout server - but I still don't understand why I get any at all in the first minute of a new process). Cheers, Alex On Tue, Sep 13, 2011 at 1:46 PM, Cyril Bonté cyril.bo...@free.fr wrote: clitimeout
Re: haproxy / Python 'tornado' framework - digging into 502/504 errors
Hi again Alex, Le Mardi 13 Septembre 2011 13:26:04 Alex Davies a écrit : Hi, I am not a haproxy expert, but have been using it in production for some time with excellent results and I wonder if I can seek some expert advice on running the fairly fast application server http://www.tornadoweb.org/ behind HAproxy (haproxy-1.3.23 using the EPEL RPM (-1) on RHEL6 x86_64). Haproxy is working very well for me, but i'm looking for help understanding how I can diagnose problems with the Tornado application I have running behind it. I have ~8 tornado processes running on two servers. Its important that one is active and the other is failover (some state is stored in memory). The parts of my haproxy configuration relevant to my question are below. I notice a large number of entries in the logs like this: 502 errors: Sep 13 12:42:45 localhost haproxy[15128]: 188.222.50.208:61001[13/Sep/2011:12:42:43.881] main python_8001/python_8001_fe1 10/0/0/-1/1527 502 204 - - SH-- 6676/6676/2082/2082/0 0/0 POST /xxx/chat/status/updates HTTP/1.1 Sep 13 12:42:45 localhost haproxy[15128]: 81.246.46.162:29456[13/Sep/2011:12:42:14.289] main python_8001/python_8001_fe1 28/0/0/-1/31118 502 204 - - SH-- 6675/6675/2081/2081/0 0/0 POST /xxx/chat/status/updates HTTP/1.1 504 errors: Sep 13 12:43:08 localhost haproxy[15128]: 180.234.122.248:52888[13/Sep/2011:12:38:08.822] main python_9004/python_9004_fe1 45/0/0/-1/300045 504 194 - - sH-- 6607/6607/697/697/0 0/0 POST /xxx/chat/message/4/updates HTTP/1.1 Sep 13 12:43:09 localhost haproxy[15128]: 82.26.136.198:61758[13/Sep/2011:12:38:09.071] main python_8001/python_8001_fe1 19/0/0/-1/300020 504 194 - - sH-- 6569/6569/2085/2085/0 0/0 POST /xxx/chat/status/updates HTTP/1.1 It seems to me that all of these involve 0 seconds waiting in a queue, 0 seconds to make a connection to the final app server and then a aborted connection to the app server before a complete response could be received The total time in milliseconds between accept and last close seems to be ~300 seconds for most of the requests (although far from all of them, as the first entry shows). I wonder if you've not reached a limit on your tornado servers, for example the max number of open files. Do the servers go down when it happens (due to the check keyword that only performs a layer 4 check in your configuration) ? If I *restart* (not reload) haproxy, I still get lines with the fifth of these numbers (Tt in the docs) as ~300,000 (the timeout server value in the config at the moment I copied the logs above), a few seconds after the haproxy process starts. What you describe makes me think of a syslog asynchronous configuration. That could explain why you see logs after the restart. When it happens, can you verify that the pid logged is really the pid of the new instance or if it's the old one ? In your example, your instance has the pid 15128. I also get lots that seem to end on almost exactly 300k even when I change both timeout client and timeout server to very different numbers. It is possible that the application (jQuery) has a 300s timeout hardcoded, but in any case I do not understand why the haproxy logs show connections with a connection of 300k failing when I stop haproxy, increase timeout server by a order of magnitude and start it again. Looking at the next part of the log entries it seems that bytes_read is always 204 for 504 errors and 194 for 504 errors. This does seem to be a fairly regular pattern: [root@frontend2 log]# cat /var/log/haproxy.log | grep 504 194 | wc -l 1975 [root@frontend2 log]# cat /var/log/haproxy.log | grep 502 204 | wc -l 12401 [root@frontend2 log]# cat /var/log/haproxy.log | wc -l 18721 My second question is how do I find out exactly what is being returned (easily), i.e. what are those 194/204 bytes? This might give me a hint as to what is going wrong or timing out on the application server.I guess I could try to tcpdump but I might struggle to actually filter down the correct data (there are large numbers of successful connections going on) Those sizes correspond respectively to the 504 and 502 responses sent by haproxy. [HTTP_ERR_502] = HTTP/1.0 502 Bad Gateway\r\n Cache-Control: no-cache\r\n Connection: close\r\n Content-Type: text/html\r\n \r\n htmlbodyh1502 Bad Gateway/h1\nThe server returned an invalid or incomplete response.\n/body/html\n, [HTTP_ERR_504] = HTTP/1.0 504 Gateway Time-out\r\n Cache-Control: no-cache\r\n Connection: close\r\n Content-Type: text/html\r\n \r\n htmlbodyh1504 Gateway Time-out/h1\nThe server didn't respond in time.\n/body/html\n, (you can find them in src/proto_http.c) The next part of the logs are most interesting; ignoring the two cookie fields we see that the the server-side timeout expired for the 504 errors and the TCP session was unexpectedly
Re: haproxy / Python 'tornado' framework - digging into 502/504 errors
Hi Alex, On Tue, Sep 13, 2011 at 03:18:54PM +0100, Alex Davies wrote: Hi, Thank you for your observation - indeed I did notice some of those as I was writing my email - I have updated my globals to increase the server timeout (as we are doing long polling) and reduce the others, and remove the duplicates: defaults mode http optionhttplog #option tcplog optiondontlognull optiondontlog-normal log global retries 10 maxconn 5 option forwardfor except 127.0.0.1/32 # Apache on https://127.0.0.1 option httpclose # Required for REMOTE HEADER option redispatch timeout connect 1 timeout client 1 timeout server 720 I still notice the same errors in the logs! (slightly less 504, as I would expect through the increase in timeout server - but I still don't understand why I get any at all in the first minute of a new process). To complete Cyril's detailed analysis, I'd like to add that you'll only see 502s when you restart, and it will take some time before you see 504s again (eg: 2 hours with the config above). The 502s mean that the server has suddenly aborted the connection (flags SH), while the 504s indicate that haproxy was fed up with waiting and closed after timeout server was elapsed. So yes it's very possible that your server has its own timeout, but it should be in the 30s from what I saw in your logs. It sill does not explain why some requests never time out on the server, maybe they don't wake the same components up ? Regards, Willy