Hello,

I have the following configuration for my webfarm:

7 servers, all running Centos 5. Two (LB1 and LB2) are HAproxy load
balancers, Web1-5 are Apache2 and Lighttpd web servers (I tried both
as a method of elimination test on the side of the web servers). All
of the servers are dual xeon Dell PE1850s with 4GB of ram, or better.
I am using HAproxy 1.3.15.2. The webfarm is used for adserving using
OpenX.

Here is the problem I am having:

Once I get in the neighborhood of 1000 sessions the stats page on the
active LB starts showing that the web servers are going up and down
randomly (all servers even those set as backup and not getting any
traffic). When I monitor the servers in real time and check on the
stats page of the standby LB (the hot spare that still monitors
servers) everything looks fine and all servers are running green with
no problems. As a result of HAproxy thinking that servers are going up
and down, clients are getting errors.

The apache servers are set up in prefork mode with a ServerLimit at
2000, MaxClients at 2000, and MaxRequestsPerChild at 4000. The server
load per server has never gotten to 2000, and there is plenty of CPU
and Ram to spare on each machine even during heavy load.

I am guessing that the issue has to do with the active HAproxy server
running out of something or other and losing the ability to poll the
web servers under this load. There is barely any usage of CPU or Ram
on the LB itself, so I don't think its a hardware issue. Below is my
HAproxy config file. Please note that because this is an ad server
distribution system, I don't need the proxy to keep track of sessions
or send the same user to the same web server as each request is a full
self contained operation.

global
        log 127.0.0.1   local0 notice
        #log 127.0.0.1   local1 notice
        #log loghost    local0 info
        maxconn 20000
        #debug
        #quiet
        ulimit-n 25000
        user root   <---- set because lit says ulimit-n above requires
        group root  <---- set because lit says ulimit-n above requires

defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        retries 3
        redispatch

        maxconn 20000
        contimeout      60000
        clitimeout      100000
        srvtimeout      100000

listen webfarm
       bind :80
       mode http
       stats enable
       stats auth ***:***
       balance roundrobin
       #cookie SERVERID insert indirect nocache
       option httpclose
       option forwardfor
       option httpchk GET /check.txt HTTP/1.0
      #server web1.****.com ***.***.***.***:80 weight 1 check maxconn
200 inter 1000 rise 2 fall 5 <--- commented out, reserved for admin
interface
      server web2.****.com ***.***.***.***:80 weight 3 check maxconn
3000 inter 1000 rise 2 fall 5 backup
      server web3.****.com ***.***.***.***:80 weight 3 check maxconn
3000 inter 1000 rise 2 fall 5
      server web4.****.com ***.***.***.***:80  weight 3 check maxconn
3000 inter 1000 rise 2 fall 5

and so forth, same for the rest of the servers...

I have tried using HTTP/1.1 for the httpchk option, but that results
in all servers being shown as down in the stats. I have also tried
varying the inter variable from 500 to 4000 with no change in
behavior. Please let me know if you can suggest something. I am
guessing some operating system variables need to be tweaked.


Thank you,
Michael

Reply via email to