Re: [vpp-dev] TCP performance - TSO - HW offloading in general.

Luca Muscariello (lumuscar) Mon, 14 May 2018 07:35:30 -0700

Hi Florin,

Session enable does not help.
hping is using raw sockets so this must be the reason.

Luca

From: Florin Coras <fcoras.li...@gmail.com>
Date: Friday 11 May 2018 at 23:02
To: Luca Muscariello <lumuscar+f...@cisco.com>
Cc: "vpp-dev@lists.fd.io" <vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] TCP performance - TSO - HW offloading in general.

Hi Luca,

Not really sure why the kernel is slow to reply to ping. Maybe it has to do 
with scheduling but it’s just guess work.

I’ve never tried hping. Let me see if I understand your scenario: while running 
iperf you tried to hping the stack and you got no rst back? Anything 
interesting in “sh error” counters? If iperf wasn’t running, did you first 
enable the stack with “session enable”?

Florin

On May 11, 2018, at 3:19 AM, Luca Muscariello 
<lumuscar+f...@cisco.com<mailto:lumuscar+f...@cisco.com>> wrote:

Florin,

A few more comments about latency.
Some number in ms in the table below:

This is ping and iperf3 concurrent. In case of VPP it is vppctl ping.

Kernel w/ load   Kernel w/o load  VPP w/ load      VPP w/o load
Min.   :0.1920   Min.   :0.0610   Min.   :0.0573   Min.   :0.03480
1st Qu.:0.2330   1st Qu.:0.1050   1st Qu.:0.2058   1st Qu.:0.04640
Median :0.2450   Median :0.1090   Median :0.2289   Median :0.04880
Mean   :0.2458   Mean   :0.1153   Mean   :0.2568   Mean   :0.05096
3rd Qu.:0.2720   3rd Qu.:0.1290   3rd Qu.:0.2601   3rd Qu.:0.05270
Max.   :0.2800   Max.   :0.1740   Max.   :0.6926   Max.   :0.09420

In short: ICMP packets have a lower latency under load.
I could interpret this as du to vectorization maybe. Also the Linux kernel
is slower to reply to ping by x2 factor (system call latency?) 115us vs
50us in VPP. w/ load no difference. In this test Linux TCP is using TSO.

While trying to use hping  to have a latency sample w/ TCP instead of ICMP
we noticed that VPP TCP stack does not reply with a RST. So we don’t get
any sample. Is that expected behavior?

Thanks

Luca

From: Luca Muscariello <lumus...@cisco.com<mailto:lumus...@cisco.com>>
Date: Thursday 10 May 2018 at 13:52
To: Florin Coras <fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>>
Cc: Luca Muscariello <lumuscar+f...@cisco.com<mailto:lumuscar+f...@cisco.com>>, 
"vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>" 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>>
Subject: Re: [vpp-dev] TCP performance - TSO - HW offloading in general.

MTU had no effect, just statistical fluctuations in the test reports. Sorry for 
misreporting the info.

We are exploiting vectorization as we have a single memif channel
per transport socket so we can control the size of the batches dynamically.

In theory the size of outstanding data from the transport should be controlled 
in bytes for
batching to be useful and not harmful as frame sizes can vary a lot. But I’m 
not aware of a queue abstraction from DPDK
to control that from VPP.

From: Florin Coras <fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>>
Date: Wednesday 9 May 2018 at 18:23
To: Luca Muscariello <lumus...@cisco.com<mailto:lumus...@cisco.com>>
Cc: Luca Muscariello <lumuscar+f...@cisco.com<mailto:lumuscar+f...@cisco.com>>, 
"vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>" 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>>
Subject: Re: [vpp-dev] TCP performance - TSO - HW offloading in general.

Hi Luca,

We don’t yet support pmtu in the stack so tcp uses a fixed 1460 mtu, unless you 
changed that, we shouldn’t generate jumbo packets. If we do, I’ll have to take 
a look at it :)

If you already had your transport protocol, using memif is the natural way to 
go. Using the session layer makes sense only if you can implement your 
transport within vpp in a way that leverages vectorization or if it can 
leverage the existing transports (see for instance the TLS implementation).

Until today [1] the stack did allow for excessive batching (generation of 
multiple frames in one dispatch loop) but we’re now restricting that to one. 
This is still far from proper pacing which is on our todo list.

Florin

[1] https://gerrit.fd.io/r/#/c/12439/

On May 9, 2018, at 4:21 AM, Luca Muscariello (lumuscar) 
<lumus...@cisco.com<mailto:lumus...@cisco.com>> wrote:

Florin,

Thanks for the slide deck, I’ll check it soon.

BTW, VPP/DPDK test was using jumbo frames by default so the TCP stack had a 
little
advantage wrt the Linux TCP stack which was using 1500B by default.

By manually setting DPDK MTU to 1500B the goodput goes down to 8.5Gbps which 
compares
to 4.5Gbps for Linux w/o TSO. Also congestion window adaptation is not the same.

BTW, for what we’re doing it is difficult to reuse the VPP session layer as it 
is.
Our transport stack uses a different kind of namespace and mux/demux is also 
different.

We are using memif as underlying driver which does not seem to be a
bottleneck as we can also control batching there. Also, we have our own
shared memory downstream memif inside VPP through a plugin.

What we observed is that delay-based congestion control does not like
much VPP batching (batching in general) and we are using DBCG.

Linux TSO has the same problem but has TCP pacing to limit bad effects of bursts
on RTT/losses and flow control laws.

I guess you’re aware of these issues already.

Luca

From: Florin Coras <fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>>
Date: Monday 7 May 2018 at 22:23
To: Luca Muscariello <lumus...@cisco.com<mailto:lumus...@cisco.com>>
Cc: Luca Muscariello <lumuscar+f...@cisco.com<mailto:lumuscar+f...@cisco.com>>, 
"vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>" 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>>
Subject: Re: [vpp-dev] TCP performance - TSO - HW offloading in general.

Yes, the whole host stack uses shared memory segments and fifos that the 
session layer manages. For a brief description of the session layer see [1, 2]. 
Apart from that, unfortunately, we don’t have any other dev documentation. 
src/vnet/session/segment_manager.[ch] has some good examples of how to allocate 
segments and fifos. Under application_interface.h check 
app_[send|recv]_[stream|dgram]_raw for examples on how to read/write to the 
fifos.

Now, regarding the the writing to the fifos: they are lock free but size 
increments are atomic since the assumption is that we’ll always have one reader 
and one writer. Still, batching helps. VCL doesn’t do it but iperf probably 
does it.

Hope this helps,
Florin

[1] https://wiki.fd.io/view/VPP/HostStack/SessionLayerArchitecture
[2] https://wiki.fd.io/images/1/15/Vpp-hoststack-kc-eu-18.pdf

On May 7, 2018, at 11:35 AM, Luca Muscariello (lumuscar) 
<lumus...@cisco.com<mailto:lumus...@cisco.com>> wrote:

Florin,

So the TCP stack does not connect to VPP using memif.
I’ll check the shared memory you mentioned.

For our transport stack we’re using memif. Nothing to
do with TCP though.

Iperf3 to VPP there must be copies anyway.
There must be some batching with timing though
while doing these copies.

Is there any doc of svm_fifo usage?

Thanks
Luca

On 7 May 2018, at 20:00, Florin Coras 
<fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>> wrote:
Hi Luca,

I guess, as you did, that it’s vectorization. VPP is really good at pushing 
packets whereas Linux is good at using all hw optimizations.

The stack uses it’s own shared memory mechanisms (check svm_fifo_t) but given 
that you did the testing with iperf3, I suspect the edge is not there. That is, 
I guess they’re not abusing syscalls with lots of small writes. Moreover, the 
fifos are not zero-copy, apps do have to write to the fifo and vpp has to 
packetize that data.

Florin

On May 7, 2018, at 10:29 AM, Luca Muscariello (lumuscar) 
<lumus...@cisco.com<mailto:lumus...@cisco.com>> wrote:

Hi Florin

Thanks for the info.

So, how do you explain VPP TCP stack beats Linux
implementation by doubling the goodput?
Does it come from vectorization?
Any special memif optimization underneath?

Luca

On 7 May 2018, at 18:17, Florin Coras 
<fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>> wrote:
Hi Luca,

We don’t yet support TSO because it requires support within all of vpp (think 
tunnels). Still, it’s on our list.

As for crypto offload, we do have support for IPSec offload with QAT cards and 
we’re now working with Ping and Ray from Intel on accelerating the TLS OpenSSL 
engine also with QAT cards.

Regards,
Florin

On May 7, 2018, at 7:53 AM, Luca Muscariello 
<lumuscar+f...@cisco.com<mailto:lumuscar+f...@cisco.com>> wrote:

Hi,

A few questions about the TCP stack and HW offloading.
Below is the experiment under test.

  +------------+                          +-----------+
  |      +-----+                 DPDK-10GE|           |
  |Iperf3| TCP |      +------------+      |TCP   Iperf3
  |      +------------+Nexus Switch+------+           +
  |LXC   | VPP||      +------------+      |VPP |LXC   |
  +------------+  DPDK-10GE               +-----------+

Using the Linux kernel w/ or w/o TSO I get an iperf3 goodput of 9.5Gbps or 
4.5Gbps.
Using VPP TCP stack I get 9.2Gbps, say max goodput as Linux w/ TSO.

Is there any TSO implementation already in VPP one can take advantage of?

Side question. Is there any crypto offloading service available in VPP?
Essentially  for the computation of RSA-1024/2048, EDCSA 192/256 signatures.

Thanks
Luca

Re: [vpp-dev] TCP performance - TSO - HW offloading in general.

Reply via email to