Re: [ANNOUNCE] haproxy-1.5-dev16
On 26.12.2012, at 1:03, Willy Tarreau w...@1wt.eu wrote: This fix is still wrong, as it only accepts one add-header rule, so please use the other fix posted in this thread by seri0528 instead. Thanks a lot! Works now.
Re: Help with cleaning up our error log output and request error counts
Hi Brendon, On Wed, Dec 26, 2012 at 03:09:37PM -0500, Brendon Colby wrote: Greetings! (Apparently GMail IPs are now listed in SORBs, so when I first sent this through the Gmail web interface I got a bounce; I had to use Thunderbird and IMAP.) We just replaced our old commercial load balancer (end of life) with haproxy running on an HP DL360p G8 server with a quad-core Xeon E5-2609 2.4GHz processor (soon to be two for redundancy). Our site is mainly media - SWF files of movies and games, mp4 movies, audio, art, etc. We are serving approximately 2,330 req/s right now, for example, at over 1Gbps of outgoing traffic (we average 600Mbps to 1Gbps+ daily). Everything has been running great for a couple of weeks, and now we're just tidying up a few things like logging and these request error counts we've been seeing. We have multiple sub-domains: one hosting css, one hosting images, etc. My goals have been to: 1. Figure out what's causing the request errors and see if we can tweak something on the server side to stop them 2. Reduce the error log output to a manageable level, so we can keep it enabled all the time and monitor it We haven't been able to keep full error logging enabled due to the volume of errors being logged (200-300 per second). When we first enabled error logging it was flooded with entries like these: Dec 21 21:00:01 localhost haproxy[16034]: x.x.x.x:50714 [21/Dec/2012:20:59:56.140] http-in http-in/NOSRV -1/-1/-1/-1/12343 400 212 - - CR-- 1913/1903/0/0/0 0/0 BADREQ This isn't the exact error but close - a CR error with a seemingly random timeout. show errors reported nothing of value. Here's what our defaults looked like: defaults backlog 1 mode http # This option helps provide better statistics / graphs. option contstats option splice-auto timeout client 30s timeout connect 10s timeout server 30s I did some research and it seemed to me that these errors were simply the browser closing connections. For example, when I opened our site with Chrome, after about 12 seconds I would see several CR errors in the logs from our office IP. I figured these must be keep-alive connections (or something like that) that were just timing out on the client side (hence the CR or client side error). I put in a timeout http-request of five seconds and haproxy then started logging cR errors with a 5000ms timeout value (the same log values above essentially, with cR instead of CR). What this told me was that now instead of the client disconnecting (CR) haproxy was proactively disconnecting (cR) and throwing a 408 error, which made sense. Right now http-request is set to 15 seconds and we're still seeing 100s of errors per second. Next, I read through almost the entire haproxy manual (very good docs!) and found this section under timeout http-keep-alive: There is also another difference between the two timeouts : when a connection expires during timeout http-keep-alive, no error is returned, the connection just closes. If the connection expires in http-request while waiting for a connection to complete, a HTTP 408 error is returned. I thought for sure using timeout http-keep-alive 1 would stop the cR/408 errors, but it didn't. option dontlognull stops them from being logged, but I see we're still getting 200-300 request errors per second on the frontend. The docs say not to use option dontlognull because it could mask attacks and such. I agree with this and don't want to leave this in. What's puzzling is that these cR/408 errors seem to be coming from regular site traffic and browser behavior, not an attack. Unless I'm mistaken, the way we have haproxy configured right now, we shouldn't be seeing these cR/408 errors. I will post the relevant pieces of our configuration below. Please let me know if I'm missing something here, because at this point I'm stuck! Thank you for the very well detailed analysis. I believe that some browsers nowadays tend to proactively establish connections to visited websites, just in case they will need them later. Since Chrome does everything it can to reduce page load time, it very likely is something it's doing. This could explain exactly what you're observing : a new connection over which nothing is transferred that it closed when the user leaves the site (hence the random delay). If you can reproduce the behaviour with your browser, I think that dontlognull will be your only solution and that we'll have to update the doc to indicate that browsers have adopted such an internet-unfriendly behaviour that it's better to leave the option on. What I don't like with proactively opened connections is that they're killing servers with 10-100 times the load they would have to sustain and that even small sites might experience issues with this. If you see 200 of them per second and they last 5s on average, it means
Re: Help with cleaning up our error log output and request error counts
We have the same exact problem, only that dontlognull is not working for us neither for some reason (we have 1 byte requests containing a single byte - NULL - maybe something affected from a firewall or other device tunneling the traffic to our LB). We also concluded this is something the browsers are doing automatically (especially when option http-server-close is in use). If you find any other solution not including the dontlognull option we would love to hear about it. On Wed, Dec 26, 2012 at 11:38 PM, Willy Tarreau w...@1wt.eu wrote: Hi Brendon, On Wed, Dec 26, 2012 at 03:09:37PM -0500, Brendon Colby wrote: Greetings! (Apparently GMail IPs are now listed in SORBs, so when I first sent this through the Gmail web interface I got a bounce; I had to use Thunderbird and IMAP.) We just replaced our old commercial load balancer (end of life) with haproxy running on an HP DL360p G8 server with a quad-core Xeon E5-2609 2.4GHz processor (soon to be two for redundancy). Our site is mainly media - SWF files of movies and games, mp4 movies, audio, art, etc. We are serving approximately 2,330 req/s right now, for example, at over 1Gbps of outgoing traffic (we average 600Mbps to 1Gbps+ daily). Everything has been running great for a couple of weeks, and now we're just tidying up a few things like logging and these request error counts we've been seeing. We have multiple sub-domains: one hosting css, one hosting images, etc. My goals have been to: 1. Figure out what's causing the request errors and see if we can tweak something on the server side to stop them 2. Reduce the error log output to a manageable level, so we can keep it enabled all the time and monitor it We haven't been able to keep full error logging enabled due to the volume of errors being logged (200-300 per second). When we first enabled error logging it was flooded with entries like these: Dec 21 21:00:01 localhost haproxy[16034]: x.x.x.x:50714 [21/Dec/2012:20:59:56.140] http-in http-in/NOSRV -1/-1/-1/-1/12343 400 212 - - CR-- 1913/1903/0/0/0 0/0 BADREQ This isn't the exact error but close - a CR error with a seemingly random timeout. show errors reported nothing of value. Here's what our defaults looked like: defaults backlog 1 mode http # This option helps provide better statistics / graphs. option contstats option splice-auto timeout client 30s timeout connect 10s timeout server 30s I did some research and it seemed to me that these errors were simply the browser closing connections. For example, when I opened our site with Chrome, after about 12 seconds I would see several CR errors in the logs from our office IP. I figured these must be keep-alive connections (or something like that) that were just timing out on the client side (hence the CR or client side error). I put in a timeout http-request of five seconds and haproxy then started logging cR errors with a 5000ms timeout value (the same log values above essentially, with cR instead of CR). What this told me was that now instead of the client disconnecting (CR) haproxy was proactively disconnecting (cR) and throwing a 408 error, which made sense. Right now http-request is set to 15 seconds and we're still seeing 100s of errors per second. Next, I read through almost the entire haproxy manual (very good docs!) and found this section under timeout http-keep-alive: There is also another difference between the two timeouts : when a connection expires during timeout http-keep-alive, no error is returned, the connection just closes. If the connection expires in http-request while waiting for a connection to complete, a HTTP 408 error is returned. I thought for sure using timeout http-keep-alive 1 would stop the cR/408 errors, but it didn't. option dontlognull stops them from being logged, but I see we're still getting 200-300 request errors per second on the frontend. The docs say not to use option dontlognull because it could mask attacks and such. I agree with this and don't want to leave this in. What's puzzling is that these cR/408 errors seem to be coming from regular site traffic and browser behavior, not an attack. Unless I'm mistaken, the way we have haproxy configured right now, we shouldn't be seeing these cR/408 errors. I will post the relevant pieces of our configuration below. Please let me know if I'm missing something here, because at this point I'm stuck! Thank you for the very well detailed analysis. I believe that some browsers nowadays tend to proactively establish connections to visited websites, just in case they will need them later. Since Chrome does everything it can to reduce page load time, it very likely is something it's doing. This could explain exactly what you're observing : a new connection over which nothing is transferred that it closed when the
Re: Help with cleaning up our error log output and request error counts
On Dec 26, 2012, at 4:38 PM, Willy Tarreau w...@1wt.eu wrote: Hi Brendon, Thank you for the very well detailed analysis. I believe that some browsers nowadays tend to proactively establish connections to visited websites, just in case they will need them later. Since Chrome does everything it can to reduce page load time, it very likely is something it's doing. This could explain exactly what you're observing : a new connection over which nothing is transferred that it closed when the user leaves the site (hence the random delay). I was thinking that this is just standard browser behavior too. IE also does this - it just seems to open fewer connections. This is why I was confused and thought I was missing something, because it seems like normal browser behavior even though the docs indicated that it should only happen during an attack or some other anomalous event. If you can reproduce the behaviour with your browser, I think that dontlognull will be your only solution and that we'll have to update the doc to indicate that browsers have adopted such an internet-unfriendly behaviour that it's better to leave the option on. What I don't like with proactively opened connections is that they're killing servers with 10-100 times the load they would have to sustain and that even small sites might experience issues with this. If you see 200 of them per second and they last 5s on average, it means you're constantly having 5000 idle connections just because of this. Many web servers can't handle this :-/ I can see smaller sites having a hard time with this! Before we added timeout http-request we were seeing over 22K established connections to the haproxy server. That's why we have such a high maxconn on our frontend - we hit several limits once we went live (which reminds me to lower it) and had to keep increasing it. Once we added the timeout though, established connections plummeted to (as of now, for example) about 5K. The total TCP connections did NOT go down, however, because now most of them are now in TIME_WAIT (92K!). The only thing this seems to affect is our monitoring system which uses netstat to get TCP stats. It sometimes takes almost a minute to run and uses 100% CPU, but otherwise doesn't seem to affect anything so we've left it for now. So if connections are terminated because of timeouts that I have explicitly set, is there any reason to log and count that as a request error? That to me seems like something that could be logged as info for troubleshooting and not counted as an error at all. Just a thought - it's nothing major. With dontlognulls we still see this type of error at a rate of several per second: Dec 26 18:38:17 localhost haproxy[32259]: n.n.n.n:33685 [26/Dec/2012:18:38:17.155] http-in webservers-ngfiles/ngwebNN 10/0/0/1/649 200 31671 - - CD-- 4166/4159/139/31/0 0/0 GET /images/14/o-eternal-o_the-vampire-dragon.jpg HTTP/1.1 This is logged when, for example, I am on the site watching a movie, and I close the browser. To me, this is another event that could be logged at the info level, but I figure there are probably good reasons why it's logged and counted as an error. It is probably just the perfectionist in me that wants the error rate to be nearly 0. :) BTW, your config is really clean, I have nothing to suggest. I wouldn't be surprized if some people reuse it to build their own configs :-) Thanks! My co-worker is responsible for most of the config, so I will be sure to pass this along to him! Best regards, Willy
Re: Help with cleaning up our error log output and request error counts
On Dec 26, 2012, at 5:57 PM, SBD sbd@gmail.com wrote: We have the same exact problem, only that dontlognull is not working for us neither for some reason (we have 1 byte requests containing a single byte - NULL - maybe something affected from a firewall or other device tunneling the traffic to our LB). We also concluded this is something the browsers are doing automatically (especially when option http-server-close is in use). If you find any other solution not including the dontlognull option we would love to hear about it. I remember seeing your e-mail in the archives. Did you get this from sending show errors to the socket? frontend ** (#1): invalid request src **, session #3468, backend NONE (#-1), server NONE (#-1) HTTP internal state 26, buffer flags 0x00909002, event #0 request length 1 bytes, error at position 0: 0 \x00 I don't see this type of error when I run show errors so I'm curious why you're seeing these and I'm not. Brendon
Re: Help with cleaning up our error log output and request error counts
Yes. sometimes I get it and sometimes don't though. As I said this is probably have to do with some other device (hopefully). Another interesting thing is, that we didn't have those kind of requests all the time. It was started soon after changing the configuration from single listen to frontend-backend config style (we have not try to switch it back though) On Wed, Dec 26, 2012 at 4:30 PM, Brendon Colby bren...@newgrounds.comwrote: On Dec 26, 2012, at 5:57 PM, SBD sbd@gmail.com wrote: We have the same exact problem, only that dontlognull is not working for us neither for some reason (we have 1 byte requests containing a single byte - NULL - maybe something affected from a firewall or other device tunneling the traffic to our LB). We also concluded this is something the browsers are doing automatically (especially when option http-server-close is in use). If you find any other solution not including the dontlognull option we would love to hear about it. I remember seeing your e-mail in the archives. Did you get this from sending show errors to the socket? frontend ** (#1): invalid request src **, session #3468, backend NONE (#-1), server NONE (#-1) HTTP internal state 26, buffer flags 0x00909002, event #0 request length 1 bytes, error at position 0: 0 \x00 I don't see this type of error when I run show errors so I'm curious why you're seeing these and I'm not. Brendon
Re: Help with cleaning up our error log output and request error counts
On Wed, Dec 26, 2012 at 07:03:02PM -0500, Brendon Colby wrote: I was thinking that this is just standard browser behavior too. IE also does this - it just seems to open fewer connections. This is why I was confused and thought I was missing something, because it seems like normal browser behavior even though the docs indicated that it should only happen during an attack or some other anomalous event. This stupid behaviour is something very recent. I did not know that IE was doing this too, but after all I'm not surprized, with the browser war... If you can reproduce the behaviour with your browser, I think that dontlognull will be your only solution and that we'll have to update the doc to indicate that browsers have adopted such an internet-unfriendly behaviour that it's better to leave the option on. What I don't like with proactively opened connections is that they're killing servers with 10-100 times the load they would have to sustain and that even small sites might experience issues with this. If you see 200 of them per second and they last 5s on average, it means you're constantly having 5000 idle connections just because of this. Many web servers can't handle this :-/ I can see smaller sites having a hard time with this! Before we added timeout http-request we were seeing over 22K established connections to the haproxy server. That's why we have such a high maxconn on our frontend - we hit several limits once we went live (which reminds me to lower it) and had to keep increasing it. That's really disgusting. Products such as haproxy or nginx can easily deal with that many concurrent connections, but many other legacy servers cannot. Would you mind reporting this to the HTTP working group at the IETF ? The HTTP/1.1 spec is currently being refined and is almost done, but we can still add information there. Good and bad practices can be updated with this experience. The address is ietf-http...@w3.org. Once we added the timeout though, established connections plummeted to (as of now, for example) about 5K. The total TCP connections did NOT go down, however, because now most of them are now in TIME_WAIT (92K!). TIME_WAIT are harmless on the server side. You can easily reach millions without any issues. The only thing this seems to affect is our monitoring system which uses netstat to get TCP stats. It sometimes takes almost a minute to run and uses 100% CPU, but otherwise doesn't seem to affect anything so we've left it for now. There are two commands that you must absolutely never use in a monitoring system : - netstat -a - ipcs -a Both of them will saturate the system and considerably slow it down when something starts to go wrong. For the sockets you should use what's in /proc/net/sockstat. You have all the numbers you want. If you need more details, use ss -a instead of netstat -a, it uses the netlink interface and is several orders of magnitude faster. So if connections are terminated because of timeouts that I have explicitly set, is there any reason to log and count that as a request error? That to me seems like something that could be logged as info for troubleshooting and not counted as an error at all. Just a thought - it's nothing major. This is a real error. It's not because some browsers decided to do stupid things that it's not an error. When you don't count this case not attacks, the first reason for not getting a request is that it is blocked by too short an MTU in some VPNs. It's very important to know that a browser could not send a POST or a request with a large cookie due to a short MTU somewhere. With dontlognulls we still see this type of error at a rate of several per second: Dec 26 18:38:17 localhost haproxy[32259]: n.n.n.n:33685 [26/Dec/2012:18:38:17.155] http-in webservers-ngfiles/ngwebNN 10/0/0/1/649 200 31671 - - CD-- 4166/4159/139/31/0 0/0 GET /images/14/o-eternal-o_the-vampire-dragon.jpg HTTP/1.1 This one happens more frequently and is also an error as the transfer was not complete, as seen from haproxy. This is logged when, for example, I am on the site watching a movie, and I close the browser. To me, this is another event that could be logged at the info level, but I figure there are probably good reasons why it's logged and counted as an error. It is probably just the perfectionist in me that wants the error rate to be nearly 0. :) In fact if you want to bring the error rate to zero by hiding all errors, you're cheating :-) It seems to me that you'd like not to log errors for what happens on the client side, am I wrong ? This could probably make sense in some environments and we could probably think about adding an option to do that. Regards, Willy