Hello, I have the following configuration for my webfarm:
7 servers, all running Centos 5. Two (LB1 and LB2) are HAproxy load balancers, Web1-5 are Apache2 and Lighttpd web servers (I tried both as a method of elimination test on the side of the web servers). All of the servers are dual xeon Dell PE1850s with 4GB of ram, or better. I am using HAproxy 1.3.15.2. The webfarm is used for adserving using OpenX. Here is the problem I am having: Once I get in the neighborhood of 1000 sessions the stats page on the active LB starts showing that the web servers are going up and down randomly (all servers even those set as backup and not getting any traffic). When I monitor the servers in real time and check on the stats page of the standby LB (the hot spare that still monitors servers) everything looks fine and all servers are running green with no problems. As a result of HAproxy thinking that servers are going up and down, clients are getting errors. The apache servers are set up in prefork mode with a ServerLimit at 2000, MaxClients at 2000, and MaxRequestsPerChild at 4000. The server load per server has never gotten to 2000, and there is plenty of CPU and Ram to spare on each machine even during heavy load. I am guessing that the issue has to do with the active HAproxy server running out of something or other and losing the ability to poll the web servers under this load. There is barely any usage of CPU or Ram on the LB itself, so I don't think its a hardware issue. Below is my HAproxy config file. Please note that because this is an ad server distribution system, I don't need the proxy to keep track of sessions or send the same user to the same web server as each request is a full self contained operation. global log 127.0.0.1 local0 notice #log 127.0.0.1 local1 notice #log loghost local0 info maxconn 20000 #debug #quiet ulimit-n 25000 user root <---- set because lit says ulimit-n above requires group root <---- set because lit says ulimit-n above requires defaults log global mode http option httplog option dontlognull retries 3 redispatch maxconn 20000 contimeout 60000 clitimeout 100000 srvtimeout 100000 listen webfarm bind :80 mode http stats enable stats auth ***:*** balance roundrobin #cookie SERVERID insert indirect nocache option httpclose option forwardfor option httpchk GET /check.txt HTTP/1.0 #server web1.****.com ***.***.***.***:80 weight 1 check maxconn 200 inter 1000 rise 2 fall 5 <--- commented out, reserved for admin interface server web2.****.com ***.***.***.***:80 weight 3 check maxconn 3000 inter 1000 rise 2 fall 5 backup server web3.****.com ***.***.***.***:80 weight 3 check maxconn 3000 inter 1000 rise 2 fall 5 server web4.****.com ***.***.***.***:80 weight 3 check maxconn 3000 inter 1000 rise 2 fall 5 and so forth, same for the rest of the servers... I have tried using HTTP/1.1 for the httpchk option, but that results in all servers being shown as down in the stats. I have also tried varying the inter variable from 500 to 4000 with no change in behavior. Please let me know if you can suggest something. I am guessing some operating system variables need to be tweaked. Thank you, Michael