Re: About CPU usage

2015-09-13 Thread Willy Tarreau
Hello Dmitry,

On Thu, Sep 10, 2015 at 06:12:34PM +0300, Dmitry Sivachenko wrote:
> Hello,
> 
> I have a haproxy-1.5.11 with a single frontend passing requests to a single 
> backend in TCP mode (sample config).
> Application establishes several long-living tcp connections and sends a lot 
> of small requests over them.
> 
> In my test case I have 2 simultaneous TCP connections producing about 3 
> MB/sec, 20 kpps input (as shown by netstat on backend machine) and 
> approximately the same output.
> 
> haproxy process consumes about 20% of CPU core (I have a machine with 2 Intel 
> Xeon E5-2650 v2 @ 2.60GHz).
> 
> In my understanding such CPU usage is rather high for the relatively low load.

It's not necessarily high in fact, what you're observing is the system's
capacity with no event aggregation. For each input message, you probably
have one NIC IRQ, one TCP segment processed, one wake-up of haproxy, one
poll() wake-up event, one recv() call, one send() call, one attempt to
send a segment to the NIC, one update of the polled FDs, one call to poll(),
one ACK in return, one NIC IRQ, one wake-up of the TCP stack to empty the
send queue, then one CPU idle call.

In practice at higher loads, the whole cycle above happends once for
multiple messages/packets. That's why it's always very hard to predict
the absolute maximum performance based on CPU usage alone. The graph
of the load per packet rate generally represents an increasing sawtooth
curve with smaller teeth as load increases. Since you're seeing only 20%
CPU, in your case the load is probably low so you're at the beginning of
the curve.

If you want to get a very approximative idea of the maximum load you can
handle with a single event at once, you can simulate this in haproxy (but
you won't be able to do it in the system). You'll have to force the
tune.maxpollevents value to 1, and stress the system. HAProxy will be in
the worst possible case but that will give you an idea of the maximum
performance in this situation, hence where you are on the curve with
your current load.

> Anything obvious I can tune?

You really need to understand that current systems are designed for
event aggregation all the way down to the hardware. NICs implement
scatter-gather/LRO/TSO and IRQ mitigation. Systems implement NAPI,
softirq and GRO/GSO. Poll() supports multiple events. HAproxy uses
this as well. Without any form of even aggregation, the effective
system's capacity can be much lower than the nominal performance
(typically 10-20% of the nominal perf). So that's why the measured
CPU usage doesn't grow linearly with the workload.

By the way, sometimes even the CPU's ability to switch to idle mode
comes with a latency cost, just like the calls to the scheduler's
idle thread. There are situations where running a busy loop at nice
+19 in the background results in lower latency on proxied HTTP
requests! This explains why when you progressively increase the load,
you'll see the processing time decrease at some points before starting
to increase again.

That's why I can only recommend you to find the limit for your workload
(including the maximum latency you're willing to accept) and trust that
limit.

You may find that past a certain load, running IRQs on certain CPU cores
and haproxy on another core sharing the same cache (L2 or L3) will bring
you more power with lower latency. Don't move it to a different CPU
socket though, the IPI (inter-processor-interrupt) can be very slow and
the lack of cache sharing will cost a lot of performance and latency.

Willy




About CPU usage

2015-09-10 Thread Dmitry Sivachenko
Hello,

I have a haproxy-1.5.11 with a single frontend passing requests to a single 
backend in TCP mode (sample config).
Application establishes several long-living tcp connections and sends a lot of 
small requests over them.

In my test case I have 2 simultaneous TCP connections producing about 3 MB/sec, 
20 kpps input (as shown by netstat on backend machine) and approximately the 
same output.

haproxy process consumes about 20% of CPU core (I have a machine with 2 Intel 
Xeon E5-2650 v2 @ 2.60GHz).

In my understanding such CPU usage is rather high for the relatively low load.

I tried both FreeBSD and Linux and see similar results (I am interested in 
FreeBSD though).

Anything obvious I can tune?

Thanks.