Re: haproxy / Python 'tornado' framework - digging into 502/504 errors

2011-09-14 Thread Alex Davies
Hi Willy, Cyril,

Thank you for your detailed analysis. I still notice 504 errors almost
immediately on a HAproxy start, and the PID matches the new process:

[root@frontend2 log]# ps aux | grep haproxy
haproxy  21242  6.6  0.1 133176 47984 ?Rs   07:17   0:00
/usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid
[root@frontend2 log]# service haproxy stop; service rsyslog stop; rm -f
/var/log/haproxy.log;  service rsyslog start; service haproxy start;
Stopping haproxy:  [  OK  ]
Shutting down system logger:   [  OK  ]
Starting system logger:[  OK  ]
Starting haproxy:  [  OK  ]
[root@frontend2 log]# tail -f haproxy.log | grep 504
Sep 14 07:16:14 localhost haproxy[21178]:
94.197.40.185:3504[14/Sep/2011:07:16:08.216] main
python_8001/python_8001_fe1 80/0/0/-1/6449
502 204 - - SH-- 3375/3375/950/950/0 0/0 POST /xxx/chat/status/updates
HTTP/1.1
Sep 14 07:16:15 localhost haproxy[21178]:
118.101.95.88:49504[14/Sep/2011:07:16:10.298] main
python_9003/python_9003_fe1 22/0/0/-1/5088
502 204 - - SH-- 3312/3312/386/386/0 0/0 POST /xxx/chat/message/3/updates
HTTP/1.1
^C
[root@frontend2 log]# ps aux | grep haproxy
haproxy  21178  5.4  0.2 137268 51480 ?Ss   07:16   0:01
/usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid

I don't understand how this is - can you shed any light? (the configuration
is the same as in my email - i.e. server timeout 2hrs).

I will investigate the 502 errors with more benchmarking directly against
the backends. Thank you for confirming that this is the source of the
problem.

Many thanks,

Alex


On Tue, Sep 13, 2011 at 11:41 PM, Willy Tarreau w...@1wt.eu wrote:

 Hi Alex,

 On Tue, Sep 13, 2011 at 03:18:54PM +0100, Alex Davies wrote:
  Hi,
 
  Thank you for your observation - indeed I did notice some of those as I
 was
  writing my email - I have updated my globals to increase the server
 timeout
  (as we are doing long polling) and reduce the others, and remove the
  duplicates:
 
  defaults
  mode  http
  optionhttplog
  #option   tcplog
  optiondontlognull
  optiondontlog-normal
 
  log   global
  retries   10
  maxconn   5
  option forwardfor except 127.0.0.1/32 # Apache on https://127.0.0.1
   option httpclose  # Required for REMOTE HEADER
  option redispatch
 
  timeout connect 1
  timeout client  1
  timeout server  720
 
  I still notice the same errors in the logs! (slightly less 504, as I
 would
  expect through the increase in timeout server - but I still don't
  understand why I get any at all in the first minute of a new process).

 To complete Cyril's detailed analysis, I'd like to add that you'll only
 see 502s when you restart, and it will take some time before you see 504s
 again (eg: 2 hours with the config above).

 The 502s mean that the server has suddenly aborted the connection (flags
 SH),
 while the 504s indicate that haproxy was fed up with waiting and closed
 after
 timeout server was elapsed.

 So yes it's very possible that your server has its own timeout, but it
 should
 be in the 30s from what I saw in your logs. It sill does not explain why
 some
 requests never time out on the server, maybe they don't wake the same
 components
 up ?

 Regards,
 Willy




-- 
Alex Davies

This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the sender immediately by e-mail and delete this e-mail permanently.


Re: haproxy / Python 'tornado' framework - digging into 502/504 errors

2011-09-13 Thread Cyril Bonté
Hi Alex,

Sorry I won't have time to help you now, but...

Le mardi 13 septembre 2011 14:26:04, Alex Davies a écrit :
 The total time in milliseconds between accept and last close seems to be
 ~300 seconds for most of the requests (although far from all of them, as the
 first entry shows). If I *restart* (not reload) haproxy, I still get lines
 with the fifth of these numbers (Tt in the docs) as ~300,000 (the timeout
 server value in the config at the moment I copied the logs above), a few
 seconds after the haproxy process starts. I also get lots that seem to end
 on almost exactly 300k even when I change both timeout client and timeout
 server to very different numbers.

Have you noticed that your configuration declared the same timeouts several 
times ?
The configuration mixed deprecated syntax and new keywords (clitimeout vs 
timeout client / contimeout vs timeout connect / srvtimeout vs timeout server).

If you tried to modify the srvtimeout and clitimeout values, that can explain 
why you still see those 300s timeouts.

 # haproxy.conf
 defaults
 timeout connect   20
 contimeout20
 clitimeout9
 option forwardfor except 127.0.0.1/32 # Apache running https on
 localhost
 option httpclose # Required for REMOTE HEADER
 option redispatch
 timeout connect 1
 timeout client  30
 timeout server  30

Latest values declared will apply : 
 timeout connect 1
 timeout client  30
 timeout server  30

Maybe this can help you for the next steps ;-)

-- 
Cyril Bonté



Re: haproxy / Python 'tornado' framework - digging into 502/504 errors

2011-09-13 Thread Alex Davies
Hi,

Thank you for your observation - indeed I did notice some of those as I was
writing my email - I have updated my globals to increase the server timeout
(as we are doing long polling) and reduce the others, and remove the
duplicates:

defaults
mode  http
optionhttplog
#option   tcplog
optiondontlognull
optiondontlog-normal

log   global
retries   10
maxconn   5
option forwardfor except 127.0.0.1/32 # Apache on https://127.0.0.1
option httpclose  # Required for REMOTE HEADER
option redispatch

timeout connect 1
timeout client  1
timeout server  720

I still notice the same errors in the logs! (slightly less 504, as I would
expect through the increase in timeout server - but I still don't
understand why I get any at all in the first minute of a new process).

Cheers,

Alex

On Tue, Sep 13, 2011 at 1:46 PM, Cyril Bonté cyril.bo...@free.fr wrote:

 clitimeout


Re: haproxy / Python 'tornado' framework - digging into 502/504 errors

2011-09-13 Thread Cyril Bonté
Hi again Alex,

Le Mardi 13 Septembre 2011 13:26:04 Alex Davies a écrit :
 Hi,
 
 I am not a haproxy expert, but have been using it in production for some
 time with excellent results and I wonder if I can seek some expert advice on
 running the fairly fast application server http://www.tornadoweb.org/
 behind HAproxy (haproxy-1.3.23 using the EPEL RPM (-1) on RHEL6 x86_64).
 Haproxy is working very well for me, but i'm looking for help understanding
 how I can diagnose problems with the Tornado application I have running
 behind it.
 
 I have ~8 tornado processes running on two servers. Its important that one
 is active and the other is failover (some state is stored in memory). The
 parts of my haproxy configuration relevant to my question are below. I
 notice a large number of entries in the logs like this:
 
 502 errors:
 Sep 13 12:42:45 localhost haproxy[15128]:
 188.222.50.208:61001[13/Sep/2011:12:42:43.881] main
 python_8001/python_8001_fe1 10/0/0/-1/1527
 502 204 - - SH-- 6676/6676/2082/2082/0 0/0 POST /xxx/chat/status/updates
 HTTP/1.1
 Sep 13 12:42:45 localhost haproxy[15128]:
 81.246.46.162:29456[13/Sep/2011:12:42:14.289] main
 python_8001/python_8001_fe1 28/0/0/-1/31118
 502 204 - - SH-- 6675/6675/2081/2081/0 0/0 POST /xxx/chat/status/updates
 HTTP/1.1
 
 504 errors:
 Sep 13 12:43:08 localhost haproxy[15128]:
 180.234.122.248:52888[13/Sep/2011:12:38:08.822] main
 python_9004/python_9004_fe1 45/0/0/-1/300045
 504 194 - - sH-- 6607/6607/697/697/0 0/0 POST /xxx/chat/message/4/updates
 HTTP/1.1
 Sep 13 12:43:09 localhost haproxy[15128]:
 82.26.136.198:61758[13/Sep/2011:12:38:09.071] main
 python_8001/python_8001_fe1 19/0/0/-1/300020
 504 194 - - sH-- 6569/6569/2085/2085/0 0/0 POST /xxx/chat/status/updates
 HTTP/1.1
 
 It seems to me that all of these involve 0 seconds waiting in a queue, 0
 seconds to make a connection to the final app server and then a aborted
 connection to the app server before a complete response could be received
 The total time in milliseconds between accept and last close seems to be
 ~300 seconds for most of the requests (although far from all of them, as the
 first entry shows).

I wonder if you've not reached a limit on your tornado servers, for example 
the max number of open files. Do the servers go down when it happens (due to 
the check keyword that only performs a layer 4 check in your configuration) 
?

 If I *restart* (not reload) haproxy, I still get lines
 with the fifth of these numbers (Tt in the docs) as ~300,000 (the
 timeout server value in the config at the moment I copied the logs
 above), a few seconds after the haproxy process starts.

What you describe makes me think of a syslog asynchronous configuration. That 
could explain why you see logs after the restart.
When it happens, can you verify that the pid logged is really the pid of the 
new instance or if it's the old one ?

In your example, your instance has the pid 15128.

 I also get lots
 that seem to end on almost exactly 300k even when I change both timeout
 client and timeout server to very different numbers. It is possible that
 the application (jQuery) has a 300s timeout hardcoded, but in any case I do
 not understand why the haproxy logs show connections with a connection of
 300k failing when I stop haproxy, increase timeout server by a order of
 magnitude and start it again.
 
 Looking at the next part of the log entries it seems that bytes_read is
 always 204 for 504 errors and 194 for 504 errors. This does seem to be a
 fairly regular pattern:
 [root@frontend2 log]# cat /var/log/haproxy.log | grep  504 194 | wc -l
 1975
 [root@frontend2 log]# cat /var/log/haproxy.log | grep  502 204 | wc -l
 12401
 [root@frontend2 log]# cat /var/log/haproxy.log | wc -l
 18721
 
  My second question is how do I find out exactly what is being returned
 (easily), i.e. what are those 194/204 bytes? This might give me a hint as to
 what is going wrong or timing out on the application server.I guess I
 could try to tcpdump but I might struggle to actually filter down the
 correct data (there are large numbers of successful connections going on)

Those sizes correspond respectively to the 504 and 502 responses sent by 
haproxy.

[HTTP_ERR_502] =
HTTP/1.0 502 Bad Gateway\r\n
Cache-Control: no-cache\r\n
Connection: close\r\n
Content-Type: text/html\r\n
\r\n
htmlbodyh1502 Bad Gateway/h1\nThe server returned an invalid 
or 
incomplete response.\n/body/html\n,

[HTTP_ERR_504] =
HTTP/1.0 504 Gateway Time-out\r\n
Cache-Control: no-cache\r\n
Connection: close\r\n
Content-Type: text/html\r\n
\r\n
htmlbodyh1504 Gateway Time-out/h1\nThe server didn't respond 
in 
time.\n/body/html\n,

(you can find them in src/proto_http.c)

 The next part of the logs are most interesting; ignoring the two cookie
 fields we see that the the server-side timeout expired for the 504 errors
 and the TCP session was unexpectedly 

Re: haproxy / Python 'tornado' framework - digging into 502/504 errors

2011-09-13 Thread Willy Tarreau
Hi Alex,

On Tue, Sep 13, 2011 at 03:18:54PM +0100, Alex Davies wrote:
 Hi,
 
 Thank you for your observation - indeed I did notice some of those as I was
 writing my email - I have updated my globals to increase the server timeout
 (as we are doing long polling) and reduce the others, and remove the
 duplicates:
 
 defaults
 mode  http
 optionhttplog
 #option   tcplog
 optiondontlognull
 optiondontlog-normal
 
 log   global
 retries   10
 maxconn   5
 option forwardfor except 127.0.0.1/32 # Apache on https://127.0.0.1
  option httpclose  # Required for REMOTE HEADER
 option redispatch
 
 timeout connect 1
 timeout client  1
 timeout server  720
 
 I still notice the same errors in the logs! (slightly less 504, as I would
 expect through the increase in timeout server - but I still don't
 understand why I get any at all in the first minute of a new process).

To complete Cyril's detailed analysis, I'd like to add that you'll only
see 502s when you restart, and it will take some time before you see 504s
again (eg: 2 hours with the config above).

The 502s mean that the server has suddenly aborted the connection (flags SH),
while the 504s indicate that haproxy was fed up with waiting and closed after
timeout server was elapsed.

So yes it's very possible that your server has its own timeout, but it should
be in the 30s from what I saw in your logs. It sill does not explain why some
requests never time out on the server, maybe they don't wake the same components
up ?

Regards,
Willy