Re: SNMP Perl script with Centos 6.0
On Tue, Sep 13, 2011 at 10:40:18AM +1000, Dwyer, Simon wrote: Issue resolved. I thought i had already turn selinux to permissive. Apparently not :) Wow, good to know. Thanks for the feedback on this issue, apparently nobody was able to bring any idea on this point, now we have it in the ML's archives ! Cheers, Willy
Re: Stress test
On Tue, Sep 13, 2011 at 03:13:11PM +1000, Dwyer, Simon wrote: Cheers, I will have a look at ab. I more just want to make sure it doesnt crash and burn while its in test. doing more of a proof of concept atm :) BTW, you must use different machines for the client, the LB and the server in your tests. Otherwise you'll see very strange patterns because all of them will fight for CPU and your numbers may vary *a lot*. I'd suggest looking at httperf, though it's harder to use than ab. There is the old inject on my web page which has the advantage of reporting measurements in realtime instead of running a blind tests and giving you numbers at the end without letting you know if the load was regular or not. It does no keep-alive and doesn't scale well in concurrent connections however. On the server side, you should probably use something like nginx, which will be much faster than apache. Apache generally is the bottleneck when used in a benchmark platform. Regards, Willy
Re: Problems with load balancing on cloud servers
Hi, On Tue, Sep 13, 2011 at 11:02:26AM +0800, Liong Kok Foo wrote: Top for server 70 (load problem) top - 10:51:23 up 32 days, 22:21, 1 user, load average: 3.09, 2.99, 2.50 Tasks: 115 total, 3 running, 112 sleeping, 0 stopped, 0 zombie Cpu(s): 38.5%us, 11.0%sy, 0.0%ni, 48.2%id, 0.0%wa, 0.0%hi, 2.3%si, 0.0%st Mem: 205k total, 1049708k used, 1000292k free, 264264k buffers Swap: 1052248k total, 876k used, 1051372k free, 418272k cached Sometimes server B's load will shoot up to 20 or more while server A (and the rest remain at around 5). Would really appreciate any input on this matter. When you look at the stats, you notice that there is much a higher retransmit number than for other servers. This almost always translates to connectivity issues. And if there are connectivity issues, then the server has more difficulties pushing out responses to the clients and gets more concurrent processes than the other ones, leading to a higher load and memory usage. You should run tests between this server and another one : transfer a large file (500 MB) several time. You should reach the gbps (118 MB/s). Do this in different directions. Often you'll notice that one direction is approximately OK while the other one is terrible. Do this with other reference servers that work well and if you note that communications with this server are the only ones affected, then ask the provider to replace it. Sometimes it's just a cable issue. Sometimes it's switch port, sometimes it's a NIC. But those issues are quite common in datacenters. You can also look at the network statistics : $ netstat -s|grep retrans I suspect that you'll notice more retransmits on this one than on the other servers. Be careful, those stats are from the last boot, so you have to take uptime into account. Regards, Willy
http_req_first
can you provide some valid examples of using http_req_first, acl aclX http_req_first or use_backend beX if http_req_first does not seem to work for me in 1.4.17 Thanks.
Establishing connection lasts long
Hi there, we're using haproxy 1.4.15 on a Ubuntu 10.04 box. This box is virtualised, HW-specs: 1 CPU-core (Xeon 2.00GHz), 512MB RAM, 2x 1GBit virtual LAN (these are also two different physical NICs in the HV). Now we've got the problem, that the initial connect through haproxy seems to be delayed. The HTTP-Servers behind haproxy are physical one's and they seem to deliver the page quite a lot faster directly then using haproxy in front. Any ideas or recommendations on checking haproxy to be not the source of the delay? Regards, Tim -- Tim Korves Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net
Re: Establishing connection lasts long
Hi, I noticed the same thing, the problem happens at the first call of the page, After the result is immediate. Christophe Le 13/09/11 13:22, « Tim Korves » t...@whtec.net a écrit : Hi there, we're using haproxy 1.4.15 on a Ubuntu 10.04 box. This box is virtualised, HW-specs: 1 CPU-core (Xeon 2.00GHz), 512MB RAM, 2x 1GBit virtual LAN (these are also two different physical NICs in the HV). Now we've got the problem, that the initial connect through haproxy seems to be delayed. The HTTP-Servers behind haproxy are physical one's and they seem to deliver the page quite a lot faster directly then using haproxy in front. Any ideas or recommendations on checking haproxy to be not the source of the delay? Regards, Tim -- Tim Korves Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net
Re: Establishing connection lasts long
Hi again, I noticed the same thing, the problem happens at the first call of the page, Ok, seems to be a bug? Or what do you think? After the result is immediate. I can confirm that. Any idea? Thanks, Tim Hi there, we're using haproxy 1.4.15 on a Ubuntu 10.04 box. This box is virtualised, HW-specs: 1 CPU-core (Xeon 2.00GHz), 512MB RAM, 2x 1GBit virtual LAN (these are also two different physical NICs in the HV). Now we've got the problem, that the initial connect through haproxy seems to be delayed. The HTTP-Servers behind haproxy are physical one's and they seem to deliver the page quite a lot faster directly then using haproxy in front. Any ideas or recommendations on checking haproxy to be not the source of the delay? Regards, Tim -- Tim Korves Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net -- Tim Korves Inhaber / Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net
Re: Establishing connection lasts long
Hi, I don't know! It's very strange. When I check the server load, it is almost zero. Christophe Le 13/09/11 13:29, « Tim Korves » t...@whtec.net a écrit : Hi again, I noticed the same thing, the problem happens at the first call of the page, Ok, seems to be a bug? Or what do you think? After the result is immediate. I can confirm that. Any idea? Thanks, Tim Hi there, we're using haproxy 1.4.15 on a Ubuntu 10.04 box. This box is virtualised, HW-specs: 1 CPU-core (Xeon 2.00GHz), 512MB RAM, 2x 1GBit virtual LAN (these are also two different physical NICs in the HV). Now we've got the problem, that the initial connect through haproxy seems to be delayed. The HTTP-Servers behind haproxy are physical one's and they seem to deliver the page quite a lot faster directly then using haproxy in front. Any ideas or recommendations on checking haproxy to be not the source of the delay? Regards, Tim -- Tim Korves Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net -- Tim Korves Inhaber / Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net
Re: Establishing connection lasts long
Hi, It's very strange. When I check the server load, it is almost zero. same here... Anyone got information about such an issue? Regards, Tim Hi again, I noticed the same thing, the problem happens at the first call of the page, Ok, seems to be a bug? Or what do you think? After the result is immediate. I can confirm that. Any idea? Thanks, Tim Hi there, we're using haproxy 1.4.15 on a Ubuntu 10.04 box. This box is virtualised, HW-specs: 1 CPU-core (Xeon 2.00GHz), 512MB RAM, 2x 1GBit virtual LAN (these are also two different physical NICs in the HV). Now we've got the problem, that the initial connect through haproxy seems to be delayed. The HTTP-Servers behind haproxy are physical one's and they seem to deliver the page quite a lot faster directly then using haproxy in front. Any ideas or recommendations on checking haproxy to be not the source of the delay? Regards, Tim -- Tim Korves Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net -- Tim Korves Inhaber / Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net -- Tim Korves Inhaber / Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net
haproxy / Python 'tornado' framework - digging into 502/504 errors
Hi, I am not a haproxy expert, but have been using it in production for some time with excellent results and I wonder if I can seek some expert advice on running the fairly fast application server http://www.tornadoweb.org/ behind HAproxy (haproxy-1.3.23 using the EPEL RPM (-1) on RHEL6 x86_64). Haproxy is working very well for me, but i'm looking for help understanding how I can diagnose problems with the Tornado application I have running behind it. I have ~8 tornado processes running on two servers. Its important that one is active and the other is failover (some state is stored in memory). The parts of my haproxy configuration relevant to my question are below. I notice a large number of entries in the logs like this: 502 errors: Sep 13 12:42:45 localhost haproxy[15128]: 188.222.50.208:61001[13/Sep/2011:12:42:43.881] main python_8001/python_8001_fe1 10/0/0/-1/1527 502 204 - - SH-- 6676/6676/2082/2082/0 0/0 POST /xxx/chat/status/updates HTTP/1.1 Sep 13 12:42:45 localhost haproxy[15128]: 81.246.46.162:29456[13/Sep/2011:12:42:14.289] main python_8001/python_8001_fe1 28/0/0/-1/31118 502 204 - - SH-- 6675/6675/2081/2081/0 0/0 POST /xxx/chat/status/updates HTTP/1.1 504 errors: Sep 13 12:43:08 localhost haproxy[15128]: 180.234.122.248:52888[13/Sep/2011:12:38:08.822] main python_9004/python_9004_fe1 45/0/0/-1/300045 504 194 - - sH-- 6607/6607/697/697/0 0/0 POST /xxx/chat/message/4/updates HTTP/1.1 Sep 13 12:43:09 localhost haproxy[15128]: 82.26.136.198:61758[13/Sep/2011:12:38:09.071] main python_8001/python_8001_fe1 19/0/0/-1/300020 504 194 - - sH-- 6569/6569/2085/2085/0 0/0 POST /xxx/chat/status/updates HTTP/1.1 It seems to me that all of these involve 0 seconds waiting in a queue, 0 seconds to make a connection to the final app server and then a aborted connection to the app server before a complete response could be received The total time in milliseconds between accept and last close seems to be ~300 seconds for most of the requests (although far from all of them, as the first entry shows). If I *restart* (not reload) haproxy, I still get lines with the fifth of these numbers (Tt in the docs) as ~300,000 (the timeout server value in the config at the moment I copied the logs above), a few seconds after the haproxy process starts. I also get lots that seem to end on almost exactly 300k even when I change both timeout client and timeout server to very different numbers. It is possible that the application (jQuery) has a 300s timeout hardcoded, but in any case I do not understand why the haproxy logs show connections with a connection of 300k failing when I stop haproxy, increase timeout server by a order of magnitude and start it again. Looking at the next part of the log entries it seems that bytes_read is always 204 for 504 errors and 194 for 504 errors. This does seem to be a fairly regular pattern: [root@frontend2 log]# cat /var/log/haproxy.log | grep 504 194 | wc -l 1975 [root@frontend2 log]# cat /var/log/haproxy.log | grep 502 204 | wc -l 12401 [root@frontend2 log]# cat /var/log/haproxy.log | wc -l 18721 My second question is how do I find out exactly what is being returned (easily), i.e. what are those 194/204 bytes? This might give me a hint as to what is going wrong or timing out on the application server. I guess I could try to tcpdump but I might struggle to actually filter down the correct data (there are large numbers of successful connections going on) The next part of the logs are most interesting; ignoring the two cookie fields we see that the the server-side timeout expired for the 504 errors and the TCP session was unexpectedly aborted by the server, or the server explicitly refused it in the case of the 502. Subject to my questions above I have a theory that the 504s are caused by the long-pooling application, but I do not understand why in the case of the 502 haproxy is not retrying the TCP connection before returning a 502 - I thought that the option redispatch and retries 10 would ensure another go. If anybody is able to shed some thoughts on my two questions I would be very grateful! Many thanks, Alex # haproxy.conf global log 127.0.0.1 local4 debug chroot /var/lib/haproxy pidfile /var/run/haproxy.pid maxconn 5 userhaproxy group haproxy daemon defaults mode http optionhttplog #option tcplog optiondontlognull optiondontlog-normal log global retries 10 maxconn 5 timeout connect 20 contimeout20 clitimeout9 option forwardfor except 127.0.0.1/32 # Apache running https on localhost option httpclose # Required for REMOTE HEADER option redispatch timeout connect 1 timeout client 30 timeout server 30 frontend main *:80 acl url_py_8001 path_beg -i /url1 acl url_py_8002 path_beg -i
Re: haproxy / Python 'tornado' framework - digging into 502/504 errors
Hi Alex, Sorry I won't have time to help you now, but... Le mardi 13 septembre 2011 14:26:04, Alex Davies a écrit : The total time in milliseconds between accept and last close seems to be ~300 seconds for most of the requests (although far from all of them, as the first entry shows). If I *restart* (not reload) haproxy, I still get lines with the fifth of these numbers (Tt in the docs) as ~300,000 (the timeout server value in the config at the moment I copied the logs above), a few seconds after the haproxy process starts. I also get lots that seem to end on almost exactly 300k even when I change both timeout client and timeout server to very different numbers. Have you noticed that your configuration declared the same timeouts several times ? The configuration mixed deprecated syntax and new keywords (clitimeout vs timeout client / contimeout vs timeout connect / srvtimeout vs timeout server). If you tried to modify the srvtimeout and clitimeout values, that can explain why you still see those 300s timeouts. # haproxy.conf defaults timeout connect 20 contimeout20 clitimeout9 option forwardfor except 127.0.0.1/32 # Apache running https on localhost option httpclose # Required for REMOTE HEADER option redispatch timeout connect 1 timeout client 30 timeout server 30 Latest values declared will apply : timeout connect 1 timeout client 30 timeout server 30 Maybe this can help you for the next steps ;-) -- Cyril Bonté
Re: Establishing connection lasts long
heh, This has nothing to see with haproxy but more how your hypervisor manages VMs which doesn't do anything :) cheers On Tue, Sep 13, 2011 at 1:35 PM, Tim Korves t...@whtec.net wrote: Hi, It's very strange. When I check the server load, it is almost zero. same here... Anyone got information about such an issue? Regards, Tim Hi again, I noticed the same thing, the problem happens at the first call of the page, Ok, seems to be a bug? Or what do you think? After the result is immediate. I can confirm that. Any idea? Thanks, Tim Hi there, we're using haproxy 1.4.15 on a Ubuntu 10.04 box. This box is virtualised, HW-specs: 1 CPU-core (Xeon 2.00GHz), 512MB RAM, 2x 1GBit virtual LAN (these are also two different physical NICs in the HV). Now we've got the problem, that the initial connect through haproxy seems to be delayed. The HTTP-Servers behind haproxy are physical one's and they seem to deliver the page quite a lot faster directly then using haproxy in front. Any ideas or recommendations on checking haproxy to be not the source of the delay? Regards, Tim -- Tim Korves Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net -- Tim Korves Inhaber / Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net -- Tim Korves Inhaber / Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net
Re: Establishing connection lasts long
Hi there, This has nothing to see with haproxy but more how your hypervisor manages VMs which doesn't do anything :) thanks for the information. Do you have any tips regarding vmWare ESXi 4.1 ? Best wishes, Tim Hi, It's very strange. When I check the server load, it is almost zero. same here... Anyone got information about such an issue? Regards, Tim Hi again, I noticed the same thing, the problem happens at the first call of the page, Ok, seems to be a bug? Or what do you think? After the result is immediate. I can confirm that. Any idea? Thanks, Tim Hi there, we're using haproxy 1.4.15 on a Ubuntu 10.04 box. This box is virtualised, HW-specs: 1 CPU-core (Xeon 2.00GHz), 512MB RAM, 2x 1GBit virtual LAN (these are also two different physical NICs in the HV). Now we've got the problem, that the initial connect through haproxy seems to be delayed. The HTTP-Servers behind haproxy are physical one's and they seem to deliver the page quite a lot faster directly then using haproxy in front. Any ideas or recommendations on checking haproxy to be not the source of the delay? Regards, Tim -- Tim Korves Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net -- Tim Korves Inhaber / Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net -- Tim Korves Inhaber / Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net -- Tim Korves Inhaber / Administrator whTec Teutoburger Straße 309 D-46119 Oberhausen Fon: +49 (40) 70 97 50 35 -0 Fax: +49 (40) 70 97 50 35 -99 SIP: t.kor...@fon.whtec.net --- Service: serv...@whtec.net Buchhaltung: buchhalt...@whtec.net DNS: d...@whtec.net ACHTUNG: Anfragen von BOS bitte über b...@whtec.net Anfragen von NGOs (e.V., gGmbH etc.) bitte über n...@whtec.net
Re: haproxy / Python 'tornado' framework - digging into 502/504 errors
Hi, Thank you for your observation - indeed I did notice some of those as I was writing my email - I have updated my globals to increase the server timeout (as we are doing long polling) and reduce the others, and remove the duplicates: defaults mode http optionhttplog #option tcplog optiondontlognull optiondontlog-normal log global retries 10 maxconn 5 option forwardfor except 127.0.0.1/32 # Apache on https://127.0.0.1 option httpclose # Required for REMOTE HEADER option redispatch timeout connect 1 timeout client 1 timeout server 720 I still notice the same errors in the logs! (slightly less 504, as I would expect through the increase in timeout server - but I still don't understand why I get any at all in the first minute of a new process). Cheers, Alex On Tue, Sep 13, 2011 at 1:46 PM, Cyril Bonté cyril.bo...@free.fr wrote: clitimeout
haproxy / Python 'tornado' framework - digging into 502/504 errors
Hi, Thank you for your observation - indeed I did notice some of those as I was writing my email - I have updated my globals to increase the server timeout (as we are doing long polling) and reduce the others, and remove the duplicates: defaults mode http optionhttplog #option tcplog optiondontlognull optiondontlog-normal log global retries 10 maxconn 5 option forwardfor except 127.0.0.1/32 # Apache on https://127.0.0.1 option httpclose # Required for REMOTE HEADER option redispatch timeout connect 1 timeout client 1 timeout server 720 I still notice the same errors in the logs! (slightly less 504, as I would expect through the increase in timeout server - but I still don't understand why I get any at all in the first minute of a new process). Cheers, Alex On Tue, Sep 13, 2011 at 1:46 PM, Cyril Bonté cyril.bo...@free.fr wrote: clitimeout
Re: haproxy / Python 'tornado' framework - digging into 502/504 errors
Hi again Alex, Le Mardi 13 Septembre 2011 13:26:04 Alex Davies a écrit : Hi, I am not a haproxy expert, but have been using it in production for some time with excellent results and I wonder if I can seek some expert advice on running the fairly fast application server http://www.tornadoweb.org/ behind HAproxy (haproxy-1.3.23 using the EPEL RPM (-1) on RHEL6 x86_64). Haproxy is working very well for me, but i'm looking for help understanding how I can diagnose problems with the Tornado application I have running behind it. I have ~8 tornado processes running on two servers. Its important that one is active and the other is failover (some state is stored in memory). The parts of my haproxy configuration relevant to my question are below. I notice a large number of entries in the logs like this: 502 errors: Sep 13 12:42:45 localhost haproxy[15128]: 188.222.50.208:61001[13/Sep/2011:12:42:43.881] main python_8001/python_8001_fe1 10/0/0/-1/1527 502 204 - - SH-- 6676/6676/2082/2082/0 0/0 POST /xxx/chat/status/updates HTTP/1.1 Sep 13 12:42:45 localhost haproxy[15128]: 81.246.46.162:29456[13/Sep/2011:12:42:14.289] main python_8001/python_8001_fe1 28/0/0/-1/31118 502 204 - - SH-- 6675/6675/2081/2081/0 0/0 POST /xxx/chat/status/updates HTTP/1.1 504 errors: Sep 13 12:43:08 localhost haproxy[15128]: 180.234.122.248:52888[13/Sep/2011:12:38:08.822] main python_9004/python_9004_fe1 45/0/0/-1/300045 504 194 - - sH-- 6607/6607/697/697/0 0/0 POST /xxx/chat/message/4/updates HTTP/1.1 Sep 13 12:43:09 localhost haproxy[15128]: 82.26.136.198:61758[13/Sep/2011:12:38:09.071] main python_8001/python_8001_fe1 19/0/0/-1/300020 504 194 - - sH-- 6569/6569/2085/2085/0 0/0 POST /xxx/chat/status/updates HTTP/1.1 It seems to me that all of these involve 0 seconds waiting in a queue, 0 seconds to make a connection to the final app server and then a aborted connection to the app server before a complete response could be received The total time in milliseconds between accept and last close seems to be ~300 seconds for most of the requests (although far from all of them, as the first entry shows). I wonder if you've not reached a limit on your tornado servers, for example the max number of open files. Do the servers go down when it happens (due to the check keyword that only performs a layer 4 check in your configuration) ? If I *restart* (not reload) haproxy, I still get lines with the fifth of these numbers (Tt in the docs) as ~300,000 (the timeout server value in the config at the moment I copied the logs above), a few seconds after the haproxy process starts. What you describe makes me think of a syslog asynchronous configuration. That could explain why you see logs after the restart. When it happens, can you verify that the pid logged is really the pid of the new instance or if it's the old one ? In your example, your instance has the pid 15128. I also get lots that seem to end on almost exactly 300k even when I change both timeout client and timeout server to very different numbers. It is possible that the application (jQuery) has a 300s timeout hardcoded, but in any case I do not understand why the haproxy logs show connections with a connection of 300k failing when I stop haproxy, increase timeout server by a order of magnitude and start it again. Looking at the next part of the log entries it seems that bytes_read is always 204 for 504 errors and 194 for 504 errors. This does seem to be a fairly regular pattern: [root@frontend2 log]# cat /var/log/haproxy.log | grep 504 194 | wc -l 1975 [root@frontend2 log]# cat /var/log/haproxy.log | grep 502 204 | wc -l 12401 [root@frontend2 log]# cat /var/log/haproxy.log | wc -l 18721 My second question is how do I find out exactly what is being returned (easily), i.e. what are those 194/204 bytes? This might give me a hint as to what is going wrong or timing out on the application server.I guess I could try to tcpdump but I might struggle to actually filter down the correct data (there are large numbers of successful connections going on) Those sizes correspond respectively to the 504 and 502 responses sent by haproxy. [HTTP_ERR_502] = HTTP/1.0 502 Bad Gateway\r\n Cache-Control: no-cache\r\n Connection: close\r\n Content-Type: text/html\r\n \r\n htmlbodyh1502 Bad Gateway/h1\nThe server returned an invalid or incomplete response.\n/body/html\n, [HTTP_ERR_504] = HTTP/1.0 504 Gateway Time-out\r\n Cache-Control: no-cache\r\n Connection: close\r\n Content-Type: text/html\r\n \r\n htmlbodyh1504 Gateway Time-out/h1\nThe server didn't respond in time.\n/body/html\n, (you can find them in src/proto_http.c) The next part of the logs are most interesting; ignoring the two cookie fields we see that the the server-side timeout expired for the 504 errors and the TCP session was unexpectedly
Re: [Proposal] Concurrency tuning by adding a limit to http-server-close
Hi Willy, A small update on this development. Le Lundi 29 Août 2011 18:01:23 Willy Tarreau a écrit : If you're interested in doing this, I'd be glad to merge it and to provide help if needed. We need a struct list fe_idle in the struct proxy and add/remove idle connections there. Of course I'm interested. I can't promise I'll be available for it for the next days but I can start it shortly. Nice, thank you! Don't forget to take a rest, you're on holidays ;-) First, I didn't forget to take a rest ;-) More seriously, I've got a first version that looks to work quite well. I couldn't raise maxconn keep-alive connections but maxconn - 1, due to the way haproxy pauses the listener when they are full or when the proxy is. I still have to optimize some pathes added to resume the listeners when a connection goes back to the idle list. This is more true for proxies that have lots of listeners. I don't know such configurations but maybe you've already met some ;-) During a test today, I had 2 minutes of panic due to an unexplained segfault but gdb quickly reminded me that I recompiled the sources with DEBUG_DEV enabled ! Except that, it never crashed. -- Cyril Bonté
RE: SNMP Perl script with Centos 6.0
Not a problem Willy. I should also note in that case that the initial error error on subcontainer 'ia_addr' insert (-1) Is due to the face i am using keepalived and i have an ip address without an interface. I believe its a bug in the snmp libraries for centos. Cheers, Simon :) From: Willy Tarreau [w...@1wt.eu] Sent: Tuesday, September 13, 2011 4:27 PM To: Dwyer, Simon Cc: haproxy@formilux.org Subject: Re: SNMP Perl script with Centos 6.0 On Tue, Sep 13, 2011 at 10:40:18AM +1000, Dwyer, Simon wrote: Issue resolved. I thought i had already turn selinux to permissive. Apparently not :) Wow, good to know. Thanks for the feedback on this issue, apparently nobody was able to bring any idea on this point, now we have it in the ML's archives ! Cheers, Willy
Re: [Proposal] Concurrency tuning by adding a limit to http-server-close
Hi Cyril, On Tue, Sep 13, 2011 at 10:13:14PM +0200, Cyril Bonté wrote: More seriously, I've got a first version that looks to work quite well. I couldn't raise maxconn keep-alive connections but maxconn - 1, due to the way haproxy pauses the listener when they are full or when the proxy is. I don't know if you have checked 1.5-dev7, there's a function there to wake up the listeners that are waiting for maxconn to be OK again. It's already used to apply maxconn without burning CPU cycles. And I don't see there anything that would prevent you to use up to maxconn connections ; in fact, I even ran some tests with maxconn 1 to check that it worked fine :-) I still have to optimize some pathes added to resume the listeners when a connection goes back to the idle list. This is more true for proxies that have lots of listeners. I don't know such configurations but maybe you've already met some ;-) Check in session.c:process_session, you have this : if (s-listener-state == LI_FULL) resume_listener(s-listener); I think you can make use of this for your code. Since listeners are individually full or not full, you don't need to scan the whole listeners anymore. BTW, the worst config I have ever seen was someone binding to a *large* port range. This means tens of thousands of listeners... During a test today, I had 2 minutes of panic due to an unexplained segfault but gdb quickly reminded me that I recompiled the sources with DEBUG_DEV enabled ! I think I remember about a recent patch from Simon to fix some breakage in DEBUG_DEV, so 1.5-dev7 might be OK. But I've not used DEBUG_DEV for a long time now and I don't know what shape it's in. Except that, it never crashed. Fine ! Cheers, Willy
Re: haproxy / Python 'tornado' framework - digging into 502/504 errors
Hi Alex, On Tue, Sep 13, 2011 at 03:18:54PM +0100, Alex Davies wrote: Hi, Thank you for your observation - indeed I did notice some of those as I was writing my email - I have updated my globals to increase the server timeout (as we are doing long polling) and reduce the others, and remove the duplicates: defaults mode http optionhttplog #option tcplog optiondontlognull optiondontlog-normal log global retries 10 maxconn 5 option forwardfor except 127.0.0.1/32 # Apache on https://127.0.0.1 option httpclose # Required for REMOTE HEADER option redispatch timeout connect 1 timeout client 1 timeout server 720 I still notice the same errors in the logs! (slightly less 504, as I would expect through the increase in timeout server - but I still don't understand why I get any at all in the first minute of a new process). To complete Cyril's detailed analysis, I'd like to add that you'll only see 502s when you restart, and it will take some time before you see 504s again (eg: 2 hours with the config above). The 502s mean that the server has suddenly aborted the connection (flags SH), while the 504s indicate that haproxy was fed up with waiting and closed after timeout server was elapsed. So yes it's very possible that your server has its own timeout, but it should be in the 30s from what I saw in your logs. It sill does not explain why some requests never time out on the server, maybe they don't wake the same components up ? Regards, Willy