Re: [ANNOUNCE] haproxy-1.5-dev16

2012-12-26 Thread Dmitry Sivachenko

On 26.12.2012, at 1:03, Willy Tarreau w...@1wt.eu wrote:
 
 
 This fix is still wrong, as it only accepts one add-header rule, so
 please use the other fix posted in this thread by seri0528 instead.
 


Thanks a lot! Works now.




Re: Help with cleaning up our error log output and request error counts

2012-12-26 Thread Willy Tarreau
Hi Brendon,

On Wed, Dec 26, 2012 at 03:09:37PM -0500, Brendon Colby wrote:
 Greetings!
 
 (Apparently GMail IPs are now listed in SORBs, so when I first sent this 
 through the Gmail web interface I got a bounce; I had to use Thunderbird 
 and IMAP.)
 
 We just replaced our old commercial load balancer (end of life) with 
 haproxy running on an HP DL360p G8 server with a quad-core Xeon E5-2609 
 2.4GHz processor (soon to be two for redundancy). Our site is mainly 
 media - SWF files of movies and games, mp4 movies, audio, art, etc. We 
 are serving approximately 2,330 req/s right now, for example, at over 
 1Gbps of outgoing traffic (we average 600Mbps to 1Gbps+ daily). 
 Everything has been running great for a couple of weeks, and now we're 
 just tidying up a few things like logging and these request error counts 
 we've been seeing. We have multiple sub-domains: one hosting css, one 
 hosting images, etc.
 
 My goals have been to:
 
 1. Figure out what's causing the request errors and see if we can tweak 
 something on the server side to stop them
 2. Reduce the error log output to a manageable level, so we can keep it 
 enabled all the time and monitor it
 
 We haven't been able to keep full error logging enabled due to the 
 volume of errors being logged (200-300 per second).
 
 When we first enabled error logging it was flooded with entries like these:
 
 Dec 21 21:00:01 localhost haproxy[16034]: x.x.x.x:50714 
 [21/Dec/2012:20:59:56.140] http-in http-in/NOSRV -1/-1/-1/-1/12343 400 
 212 - - CR-- 1913/1903/0/0/0 0/0 BADREQ
 
 This isn't the exact error but close - a CR error with a seemingly 
 random timeout. show errors reported nothing of value.
 
 Here's what our defaults looked like:
 
 defaults
 backlog 1
 mode http
 # This option helps provide better statistics / graphs.
 option contstats
 option splice-auto
 timeout client 30s
 timeout connect 10s
 timeout server 30s
 
 I did some research and it seemed to me that these errors were simply 
 the browser closing connections. For example, when I opened our site 
 with Chrome, after about 12 seconds I would see several CR errors in the 
 logs from our office IP. I figured these must be keep-alive connections 
 (or something like that) that were just timing out on the client side 
 (hence the CR or client side error).
 
 I put in a timeout http-request of five seconds and haproxy then 
 started logging cR errors with a 5000ms timeout value (the same log 
 values above essentially, with cR instead of CR). What this told me was 
 that now instead of the client disconnecting (CR) haproxy was 
 proactively disconnecting (cR) and throwing a 408 error, which made 
 sense. Right now http-request is set to 15 seconds and we're still 
 seeing 100s of errors per second.
 
 Next, I read through almost the entire haproxy manual (very good docs!) 
 and found this section under timeout http-keep-alive: There is also 
 another difference between the two timeouts : when a connection expires 
 during timeout http-keep-alive, no error is returned, the 
 connection just closes. If the connection expires in http-request 
 while waiting for a connection to complete, a HTTP 408 error is returned.
 
 I thought for sure using timeout http-keep-alive 1 would stop the 
 cR/408 errors, but it didn't. option dontlognull stops them from being 
 logged, but I see we're still getting 200-300 request errors per second 
 on the frontend.
 
 The docs say not to use option dontlognull because it could mask 
 attacks and such. I agree with this and don't want to leave this in. 
 What's puzzling is that these cR/408 errors seem to be coming from 
 regular site traffic and browser behavior, not an attack.
 
 Unless I'm mistaken, the way we have haproxy configured right now, we 
 shouldn't be seeing these cR/408 errors. I will post the relevant pieces 
 of our configuration below. Please let me know if I'm missing something 
 here, because at this point I'm stuck!

Thank you for the very well detailed analysis. I believe that some browsers
nowadays tend to proactively establish connections to visited websites, just
in case they will need them later. Since Chrome does everything it can to
reduce page load time, it very likely is something it's doing. This could
explain exactly what you're observing : a new connection over which nothing
is transferred that it closed when the user leaves the site (hence the
random delay).

If you can reproduce the behaviour with your browser, I think that dontlognull
will be your only solution and that we'll have to update the doc to indicate
that browsers have adopted such an internet-unfriendly behaviour that it's
better to leave the option on. What I don't like with proactively opened
connections is that they're killing servers with 10-100 times the load they
would have to sustain and that even small sites might experience issues with
this. If you see 200 of them per second and they last 5s on average, it means

Re: Help with cleaning up our error log output and request error counts

2012-12-26 Thread SBD
We have the same exact problem, only that dontlognull is not working for us
neither for some reason (we have 1 byte requests containing a single byte -
NULL - maybe something affected from a firewall or other device tunneling
the traffic to our LB).

We also concluded this is something the browsers are doing automatically
(especially when option http-server-close is in use).

If you find any other solution not including the dontlognull option we
would love to hear about it.


On Wed, Dec 26, 2012 at 11:38 PM, Willy Tarreau w...@1wt.eu wrote:

 Hi Brendon,

 On Wed, Dec 26, 2012 at 03:09:37PM -0500, Brendon Colby wrote:
  Greetings!
 
  (Apparently GMail IPs are now listed in SORBs, so when I first sent this
  through the Gmail web interface I got a bounce; I had to use Thunderbird
  and IMAP.)
 
  We just replaced our old commercial load balancer (end of life) with
  haproxy running on an HP DL360p G8 server with a quad-core Xeon E5-2609
  2.4GHz processor (soon to be two for redundancy). Our site is mainly
  media - SWF files of movies and games, mp4 movies, audio, art, etc. We
  are serving approximately 2,330 req/s right now, for example, at over
  1Gbps of outgoing traffic (we average 600Mbps to 1Gbps+ daily).
  Everything has been running great for a couple of weeks, and now we're
  just tidying up a few things like logging and these request error counts
  we've been seeing. We have multiple sub-domains: one hosting css, one
  hosting images, etc.
 
  My goals have been to:
 
  1. Figure out what's causing the request errors and see if we can tweak
  something on the server side to stop them
  2. Reduce the error log output to a manageable level, so we can keep it
  enabled all the time and monitor it
 
  We haven't been able to keep full error logging enabled due to the
  volume of errors being logged (200-300 per second).
 
  When we first enabled error logging it was flooded with entries like
 these:
 
  Dec 21 21:00:01 localhost haproxy[16034]: x.x.x.x:50714
  [21/Dec/2012:20:59:56.140] http-in http-in/NOSRV -1/-1/-1/-1/12343 400
  212 - - CR-- 1913/1903/0/0/0 0/0 BADREQ
 
  This isn't the exact error but close - a CR error with a seemingly
  random timeout. show errors reported nothing of value.
 
  Here's what our defaults looked like:
 
  defaults
  backlog 1
  mode http
  # This option helps provide better statistics / graphs.
  option contstats
  option splice-auto
  timeout client 30s
  timeout connect 10s
  timeout server 30s
 
  I did some research and it seemed to me that these errors were simply
  the browser closing connections. For example, when I opened our site
  with Chrome, after about 12 seconds I would see several CR errors in the
  logs from our office IP. I figured these must be keep-alive connections
  (or something like that) that were just timing out on the client side
  (hence the CR or client side error).
 
  I put in a timeout http-request of five seconds and haproxy then
  started logging cR errors with a 5000ms timeout value (the same log
  values above essentially, with cR instead of CR). What this told me was
  that now instead of the client disconnecting (CR) haproxy was
  proactively disconnecting (cR) and throwing a 408 error, which made
  sense. Right now http-request is set to 15 seconds and we're still
  seeing 100s of errors per second.
 
  Next, I read through almost the entire haproxy manual (very good docs!)
  and found this section under timeout http-keep-alive: There is also
  another difference between the two timeouts : when a connection expires
  during timeout http-keep-alive, no error is returned, the
  connection just closes. If the connection expires in http-request
  while waiting for a connection to complete, a HTTP 408 error is
 returned.
 
  I thought for sure using timeout http-keep-alive 1 would stop the
  cR/408 errors, but it didn't. option dontlognull stops them from being
  logged, but I see we're still getting 200-300 request errors per second
  on the frontend.
 
  The docs say not to use option dontlognull because it could mask
  attacks and such. I agree with this and don't want to leave this in.
  What's puzzling is that these cR/408 errors seem to be coming from
  regular site traffic and browser behavior, not an attack.
 
  Unless I'm mistaken, the way we have haproxy configured right now, we
  shouldn't be seeing these cR/408 errors. I will post the relevant pieces
  of our configuration below. Please let me know if I'm missing something
  here, because at this point I'm stuck!

 Thank you for the very well detailed analysis. I believe that some browsers
 nowadays tend to proactively establish connections to visited websites,
 just
 in case they will need them later. Since Chrome does everything it can to
 reduce page load time, it very likely is something it's doing. This could
 explain exactly what you're observing : a new connection over which nothing
 is transferred that it closed when the 

Re: Help with cleaning up our error log output and request error counts

2012-12-26 Thread Brendon Colby

On Dec 26, 2012, at 4:38 PM, Willy Tarreau w...@1wt.eu wrote:

 Hi Brendon,
 
 
 Thank you for the very well detailed analysis. I believe that some browsers
 nowadays tend to proactively establish connections to visited websites, just
 in case they will need them later. Since Chrome does everything it can to
 reduce page load time, it very likely is something it's doing. This could
 explain exactly what you're observing : a new connection over which nothing
 is transferred that it closed when the user leaves the site (hence the
 random delay).
 

I was thinking that this is just standard browser behavior too. IE also does 
this - it just seems to open fewer connections. This is why I was confused and 
thought I was missing something, because it seems like normal browser behavior 
even though the docs indicated that it should only happen during an attack or 
some other anomalous event.

 If you can reproduce the behaviour with your browser, I think that dontlognull
 will be your only solution and that we'll have to update the doc to indicate
 that browsers have adopted such an internet-unfriendly behaviour that it's
 better to leave the option on. What I don't like with proactively opened
 connections is that they're killing servers with 10-100 times the load they
 would have to sustain and that even small sites might experience issues with
 this. If you see 200 of them per second and they last 5s on average, it means
 you're constantly having 5000 idle connections just because of this. Many
 web servers can't handle this :-/

I can see smaller sites having a hard time with this! Before we added timeout 
http-request we were seeing over 22K established connections to the haproxy 
server. That's why we have such a high maxconn on our frontend - we hit several 
limits once we went live (which reminds me to lower it) and had to keep 
increasing it.

Once we added the timeout though, established connections plummeted to (as of 
now, for example) about 5K. The total TCP connections did NOT go down, however, 
because now most of them are now in TIME_WAIT (92K!). The only thing this seems 
to affect is our monitoring system which uses netstat to get TCP stats. It 
sometimes takes almost a minute to run and uses 100% CPU, but otherwise doesn't 
seem to affect anything so we've left it for now.

So if connections are terminated because of timeouts that I have explicitly 
set, is there any reason to log and count that as a request error? That to me 
seems like something that could be logged as info for troubleshooting and not 
counted as an error at all. Just a thought - it's nothing major.

With dontlognulls we still see this type of error at a rate of several per 
second:

Dec 26 18:38:17 localhost haproxy[32259]: n.n.n.n:33685 
[26/Dec/2012:18:38:17.155] http-in webservers-ngfiles/ngwebNN 10/0/0/1/649 200 
31671 - - CD-- 4166/4159/139/31/0 0/0 GET 
/images/14/o-eternal-o_the-vampire-dragon.jpg HTTP/1.1

This is logged when, for example, I am on the site watching a movie, and I 
close the browser. To me, this is another event that could be logged at the 
info level, but I figure there are probably good reasons why it's logged and 
counted as an error. It is probably just the perfectionist in me that wants the 
error rate to be nearly 0. :)

 BTW, your config is really clean, I have nothing to suggest. I wouldn't
 be surprized if some people reuse it to build their own configs :-)

Thanks! My co-worker is responsible for most of the config, so I will be sure 
to pass this along to him!

 Best regards,
 Willy
 




Re: Help with cleaning up our error log output and request error counts

2012-12-26 Thread Brendon Colby
On Dec 26, 2012, at 5:57 PM, SBD sbd@gmail.com wrote:

 We have the same exact problem, only that dontlognull is not working for us 
 neither for some reason (we have 1 byte requests containing a single byte - 
 NULL - maybe something affected from a firewall or other device tunneling the 
 traffic to our LB).
 
 We also concluded this is something the browsers are doing automatically 
 (especially when option http-server-close is in use).
 
 If you find any other solution not including the dontlognull option we would 
 love to hear about it.

I remember seeing your e-mail in the archives. Did you get this from sending 
show errors to the socket?
  frontend ** (#1): invalid request
src **, session #3468, backend NONE (#-1), server NONE (#-1)
HTTP internal state 26, buffer flags 0x00909002, event #0
request length 1 bytes, error at position 0:
 
0  \x00
I don't see this type of error when I run show errors so I'm curious why 
you're seeing these and I'm not.

Brendon

Re: Help with cleaning up our error log output and request error counts

2012-12-26 Thread SBD
Yes. sometimes I get it and sometimes don't though. As I said this is
probably have to do with some other device (hopefully).
Another interesting thing is, that we didn't have those kind of requests
all the time. It was started soon after changing the configuration from
single listen to frontend-backend config style (we have not try to
switch it back though)


On Wed, Dec 26, 2012 at 4:30 PM, Brendon Colby bren...@newgrounds.comwrote:

 On Dec 26, 2012, at 5:57 PM, SBD sbd@gmail.com wrote:

 We have the same exact problem, only that dontlognull is not working for
 us neither for some reason (we have 1 byte requests containing a single
 byte - NULL - maybe something affected from a firewall or other device
 tunneling the traffic to our LB).

 We also concluded this is something the browsers are doing automatically
 (especially when option http-server-close is in use).

 If you find any other solution not including the dontlognull option we
 would love to hear about it.


 I remember seeing your e-mail in the archives. Did you get this from
 sending show errors to the socket?

  frontend ** (#1): invalid request
src **, session #3468, backend NONE (#-1), server NONE (#-1)
HTTP internal state 26, buffer flags 0x00909002, event #0
request length 1 bytes, error at position 0:
 
0  \x00

 I don't see this type of error when I run show errors so I'm curious why
 you're seeing these and I'm not.

 Brendon



Re: Help with cleaning up our error log output and request error counts

2012-12-26 Thread Willy Tarreau
On Wed, Dec 26, 2012 at 07:03:02PM -0500, Brendon Colby wrote:
 I was thinking that this is just standard browser behavior too. IE also does
 this - it just seems to open fewer connections. This is why I was confused
 and thought I was missing something, because it seems like normal browser
 behavior even though the docs indicated that it should only happen during an
 attack or some other anomalous event.

This stupid behaviour is something very recent. I did not know that IE was
doing this too, but after all I'm not surprized, with the browser war...

  If you can reproduce the behaviour with your browser, I think that 
  dontlognull
  will be your only solution and that we'll have to update the doc to indicate
  that browsers have adopted such an internet-unfriendly behaviour that it's
  better to leave the option on. What I don't like with proactively opened
  connections is that they're killing servers with 10-100 times the load they
  would have to sustain and that even small sites might experience issues with
  this. If you see 200 of them per second and they last 5s on average, it 
  means
  you're constantly having 5000 idle connections just because of this. Many
  web servers can't handle this :-/
 
 I can see smaller sites having a hard time with this! Before we added
 timeout http-request we were seeing over 22K established connections to the
 haproxy server. That's why we have such a high maxconn on our frontend - we
 hit several limits once we went live (which reminds me to lower it) and had
 to keep increasing it.

That's really disgusting. Products such as haproxy or nginx can easily deal
with that many concurrent connections, but many other legacy servers cannot.

Would you mind reporting this to the HTTP working group at the IETF ? The
HTTP/1.1 spec is currently being refined and is almost done, but we can
still add information there. Good and bad practices can be updated with
this experience. The address is ietf-http...@w3.org.

 Once we added the timeout though, established connections plummeted to (as of
 now, for example) about 5K. The total TCP connections did NOT go down,
 however, because now most of them are now in TIME_WAIT (92K!).

TIME_WAIT are harmless on the server side. You can easily reach millions
without any issues.

 The only thing
 this seems to affect is our monitoring system which uses netstat to get TCP
 stats. It sometimes takes almost a minute to run and uses 100% CPU, but
 otherwise doesn't seem to affect anything so we've left it for now.

There are two commands that you must absolutely never use in a monitoring
system :
  - netstat -a
  - ipcs -a

Both of them will saturate the system and considerably slow it down when
something starts to go wrong. For the sockets you should use what's in
/proc/net/sockstat. You have all the numbers you want. If you need more
details, use ss -a instead of netstat -a, it uses the netlink interface
and is several orders of magnitude faster.

 So if connections are terminated because of timeouts that I have explicitly
 set, is there any reason to log and count that as a request error? That to me
 seems like something that could be logged as info for troubleshooting and not
 counted as an error at all. Just a thought - it's nothing major.

This is a real error. It's not because some browsers decided to do stupid
things that it's not an error. When you don't count this case not attacks,
the first reason for not getting a request is that it is blocked by too
short an MTU in some VPNs. It's very important to know that a browser
could not send a POST or a request with a large cookie due to a short MTU
somewhere.

 With dontlognulls we still see this type of error at a rate of several per 
 second:
 
 Dec 26 18:38:17 localhost haproxy[32259]: n.n.n.n:33685
 [26/Dec/2012:18:38:17.155] http-in webservers-ngfiles/ngwebNN 10/0/0/1/649
 200 31671 - - CD-- 4166/4159/139/31/0 0/0 GET
 /images/14/o-eternal-o_the-vampire-dragon.jpg HTTP/1.1

This one happens more frequently and is also an error as the transfer was not
complete, as seen from haproxy.

 This is logged when, for example, I am on the site watching a movie, and I
 close the browser. To me, this is another event that could be logged at the
 info level, but I figure there are probably good reasons why it's logged and
 counted as an error. It is probably just the perfectionist in me that wants
 the error rate to be nearly 0. :)

In fact if you want to bring the error rate to zero by hiding all errors,
you're cheating :-)

It seems to me that you'd like not to log errors for what happens on the
client side, am I wrong ? This could probably make sense in some environments
and we could probably think about adding an option to do that.

Regards,
Willy