Hello,
I am currently running Haproxy 1.6.14-1ppa1~xenial-66af4a1 2018/01/06. There
are many features that were implemented in 1.8, 1.9 and 2.0 that would benefit
my deployments. I tested 2.0.3-1ppa1~xenial last night but unfortunately found
it to be using excessive amounts of CPU and had to revert. For this
implementation, I have two separate use cases in haproxy: first being external
HTTP/HTTPS load balancing to a cluster from external clients, the second being
HTTP internal load balancing between the two different applications (for
simplicity sake we can call them front and back). The excessive CPU was
noticed on the second implementation, HTTP between the front and back
applications. I previously leveraged nbproc and cpu-map to isolate the use
cases, but in 2.0 moved to nbthread (default) and cpu-map (auto) to isolate.
The CPU usage was so excessive that I had to move the second implementation to
two cores to not utilize 100% of the processer and still I was getting
timeouts. It took some time to rewrite the config files from 1.6 to 2.0 but I
was able to get them all configured properly and leveraged top and mpstat to
ensure threads and use cases were on the proper cores.
Because of the problems with usage case #2 I did not even get a chance to
evaluate use case #1, but again, I use cpu-map and 'process' to isolate these
use cases as much as possible. Upon reverting back to 1.6 (install and
configs) everything worked as expected.
Here is the CPU usage on 1.6 from mpstat -P ALL 5:
08:33:02 PM CPU %usr %nice %sys %iowait %irq %soft %steal
%guest %gnice %idle
08:33:07 PM 0 7.48 0.00 16.63 0.00 0.00 0.00 0.00
0.00 0.00 75.88
Here is the CPU usage on 2.0.3 when using one thread:
08:29:35 PM CPU %usr %nice %sys %iowait %irq %soft %steal
%guest %gnice %idle
08:29:40 PM 39 35.28 0.00 55.24 0.00 0.00 0.00 0.00
0.00 0.00 9.48
Here is the CPU usage on 2.0.3 when using two threads (the front application
still experienced timeouts to the back application even without 100% cpu
utilization on the cores):
08:30:48 PM CPU %usr %nice %sys %iowait %irq %soft %steal
%guest %gnice %idle
08:30:53 PM 0 22.93 0.00 19.75 0.00 0.00 0.00 0.00
0.00 0.00 57.32
08:30:53 PM 39 21.60 0.00 25.10 0.00 0.00 0.00 0.00
0.00 0.00 53.29
Also, note, our front generally keeps connections open to our back for an
extended period of time as it pools them internally, so many requests are sent
over the connection via HTTP/1.1 keep-alive connections. I think we had
roughly ~1000 connections established during these tests.
Some configurations that might be relevant to your analysis (there are more but
they are pretty much standard, such as user, group, stats, log, chroot, etc):
global
cpu-map auto:1/1-40 0-39
maxconn 50
spread-checks 2
server-state-file global
server-state-base /var/lib/haproxy/
defaults
option dontlognull
option dontlog-normal
option redispatch
option tcp-smart-accept
option tcp-smart-connect
timeout connect 2s
timeout client 50s
timeout server 50s
timeout client-fin 1s
timeout server-fin 1s
This part has been sanitized and I reduced the number of servers from 14 to 2.
listen back
bind 10.0.0.251:8080 defer-accept process 1/40
bind 10.0.0.252:8080 defer-accept process 1/40
bind 10.0.0.253:8080 defer-accept process 1/40
bind 10.0.0.254:8080 defer-accept process 1/40
mode http
maxconn 65000
fullconn 65000
balance leastconn
http-reuse safe
source 10.0.1.100
option httpchk GET /ping HTTP/1.0
http-check expect string OK
server s1 10.0.2.1:8080 check agent-check agent-port 8009
agent-inter 250ms inter 500ms fastinter 250ms downinter 1000ms weight 100
source 10.0.1.100
server s2 10.0.2.2:8080 check agent-check agent-port 8009
agent-inter 250ms inter 500ms fastinter 250ms downinter 1000ms weight 100
source 10.0.1.101
To configure multiple cores, I changed the bind line to add 'process 1/1' I
also removed process 1/1 from the other use case.
The OS is Ubuntu 16.04.3 LTS, procs are 2x E5-2630, 64GB of RAM. The output
from haproxy -vv looked very typical between both, epoll, openssl 1.0.2g (not
used in this case), etc.
Please let me know if there is any additional information I can provide to
assist in isolating the cause of this issue.
Thank you!
Nick