Re: [vpp-dev] TCP performance - TSO - HW offloading in general.

Florin Coras Mon, 14 May 2018 01:44:38 -0700

Hi Luca, 

That is most probably the reason. We don’t support raw sockets.


Florin

> On May 14, 2018, at 1:21 AM, Luca Muscariello (lumuscar) <lumus...@cisco.com> 
> wrote:
> 
> Hi Florin,
>
> Session enable does not help.
> hping is using raw sockets so this must be the reason.
>
> Luca
>
>
>
> From: Florin Coras <fcoras.li...@gmail.com>
> Date: Friday 11 May 2018 at 23:02
> To: Luca Muscariello <lumuscar+f...@cisco.com>
> Cc: "vpp-dev@lists.fd.io" <vpp-dev@lists.fd.io>
> Subject: Re: [vpp-dev] TCP performance - TSO - HW offloading in general.
>
> Hi Luca,
>
> Not really sure why the kernel is slow to reply to ping. Maybe it has to do 
> with scheduling but it’s just guess work. 
>
> I’ve never tried hping. Let me see if I understand your scenario: while 
> running iperf you tried to hping the stack and you got no rst back? Anything 
> interesting in “sh error” counters? If iperf wasn’t running, did you first 
> enable the stack with “session enable”?
>
> Florin
> 
> 
>> On May 11, 2018, at 3:19 AM, Luca Muscariello <lumuscar+f...@cisco.com 
>> <mailto:lumuscar+f...@cisco.com>> wrote:
>>
>> Florin,
>>
>> A few more comments about latency.
>> Some number in ms in the table below:
>>
>> This is ping and iperf3 concurrent. In case of VPP it is vppctl ping.
>>
>> Kernel w/ load   Kernel w/o load  VPP w/ load      VPP w/o load
>> Min.   :0.1920   Min.   :0.0610   Min.   :0.0573   Min.   :0.03480
>> 1st Qu.:0.2330   1st Qu.:0.1050   1st Qu.:0.2058   1st Qu.:0.04640
>> Median :0.2450   Median :0.1090   Median :0.2289   Median :0.04880
>> Mean   :0.2458   Mean   :0.1153   Mean   :0.2568   Mean   :0.05096
>> 3rd Qu.:0.2720   3rd Qu.:0.1290   3rd Qu.:0.2601   3rd Qu.:0.05270
>> Max.   :0.2800   Max.   :0.1740   Max.   :0.6926   Max.   :0.09420
>>
>> In short: ICMP packets have a lower latency under load.
>> I could interpret this as du to vectorization maybe. Also the Linux kernel
>> is slower to reply to ping by x2 factor (system call latency?) 115us vs
>> 50us in VPP. w/ load no difference. In this test Linux TCP is using TSO.
>>
>> While trying to use hping  to have a latency sample w/ TCP instead of ICMP 
>> we noticed that VPP TCP stack does not reply with a RST. So we don’t get
>> any sample. Is that expected behavior?
>>
>> Thanks
>>
>>
>> Luca
>>
>>
>>
>>
>>
>> From: Luca Muscariello <lumus...@cisco.com <mailto:lumus...@cisco.com>>
>> Date: Thursday 10 May 2018 at 13:52
>> To: Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>>
>> Cc: Luca Muscariello <lumuscar+f...@cisco.com 
>> <mailto:lumuscar+f...@cisco.com>>, "vpp-dev@lists.fd.io 
>> <mailto:vpp-dev@lists.fd.io>" <vpp-dev@lists.fd.io 
>> <mailto:vpp-dev@lists.fd.io>>
>> Subject: Re: [vpp-dev] TCP performance - TSO - HW offloading in general.
>>
>> MTU had no effect, just statistical fluctuations in the test reports. Sorry 
>> for misreporting the info.
>>
>> We are exploiting vectorization as we have a single memif channel 
>> per transport socket so we can control the size of the batches dynamically. 
>>
>> In theory the size of outstanding data from the transport should be 
>> controlled in bytes for 
>> batching to be useful and not harmful as frame sizes can vary a lot. But I’m 
>> not aware of a queue abstraction from DPDK 
>> to control that from VPP.
>>
>> From: Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>>
>> Date: Wednesday 9 May 2018 at 18:23
>> To: Luca Muscariello <lumus...@cisco.com <mailto:lumus...@cisco.com>>
>> Cc: Luca Muscariello <lumuscar+f...@cisco.com 
>> <mailto:lumuscar+f...@cisco.com>>, "vpp-dev@lists.fd.io 
>> <mailto:vpp-dev@lists.fd.io>" <vpp-dev@lists.fd.io 
>> <mailto:vpp-dev@lists.fd.io>>
>> Subject: Re: [vpp-dev] TCP performance - TSO - HW offloading in general.
>>
>> Hi Luca,
>>
>> We don’t yet support pmtu in the stack so tcp uses a fixed 1460 mtu, unless 
>> you changed that, we shouldn’t generate jumbo packets. If we do, I’ll have 
>> to take a look at it :)
>>
>> If you already had your transport protocol, using memif is the natural way 
>> to go. Using the session layer makes sense only if you can implement your 
>> transport within vpp in a way that leverages vectorization or if it can 
>> leverage the existing transports (see for instance the TLS implementation).
>>
>> Until today [1] the stack did allow for excessive batching (generation of 
>> multiple frames in one dispatch loop) but we’re now restricting that to one. 
>> This is still far from proper pacing which is on our todo list. 
>>
>> Florin
>>
>> [1] https://gerrit.fd.io/r/#/c/12439/ <https://gerrit.fd.io/r/#/c/12439/>
>>
>> 
>> 
>> 
>> 
>>> On May 9, 2018, at 4:21 AM, Luca Muscariello (lumuscar) <lumus...@cisco.com 
>>> <mailto:lumus...@cisco.com>> wrote:
>>>
>>> Florin,
>>>
>>> Thanks for the slide deck, I’ll check it soon.
>>>
>>> BTW, VPP/DPDK test was using jumbo frames by default so the TCP stack had a 
>>> little
>>> advantage wrt the Linux TCP stack which was using 1500B by default.
>>>
>>> By manually setting DPDK MTU to 1500B the goodput goes down to 8.5Gbps 
>>> which compares
>>> to 4.5Gbps for Linux w/o TSO. Also congestion window adaptation is not the 
>>> same.
>>>
>>> BTW, for what we’re doing it is difficult to reuse the VPP session layer as 
>>> it is.
>>> Our transport stack uses a different kind of namespace and mux/demux is 
>>> also different.
>>>
>>> We are using memif as underlying driver which does not seem to be a
>>> bottleneck as we can also control batching there. Also, we have our own
>>> shared memory downstream memif inside VPP through a plugin.
>>>
>>> What we observed is that delay-based congestion control does not like
>>> much VPP batching (batching in general) and we are using DBCG.
>>>
>>> Linux TSO has the same problem but has TCP pacing to limit bad effects of 
>>> bursts
>>> on RTT/losses and flow control laws.
>>>
>>> I guess you’re aware of these issues already.
>>>
>>> Luca
>>>
>>>
>>> From: Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>>
>>> Date: Monday 7 May 2018 at 22:23
>>> To: Luca Muscariello <lumus...@cisco.com <mailto:lumus...@cisco.com>>
>>> Cc: Luca Muscariello <lumuscar+f...@cisco.com 
>>> <mailto:lumuscar+f...@cisco.com>>, "vpp-dev@lists.fd.io 
>>> <mailto:vpp-dev@lists.fd.io>" <vpp-dev@lists.fd.io 
>>> <mailto:vpp-dev@lists.fd.io>>
>>> Subject: Re: [vpp-dev] TCP performance - TSO - HW offloading in general.
>>>
>>> Yes, the whole host stack uses shared memory segments and fifos that the 
>>> session layer manages. For a brief description of the session layer see [1, 
>>> 2]. Apart from that, unfortunately, we don’t have any other dev 
>>> documentation. src/vnet/session/segment_manager.[ch] has some good examples 
>>> of how to allocate segments and fifos. Under application_interface.h check 
>>> app_[send|recv]_[stream|dgram]_raw for examples on how to read/write to the 
>>> fifos.  <>
>>>
>>> Now, regarding the the writing to the fifos: they are lock free but size 
>>> increments are atomic since the assumption is that we’ll always have one 
>>> reader and one writer. Still, batching helps. VCL doesn’t do it but iperf 
>>> probably does it. 
>>>
>>> Hope this helps, 
>>> Florin
>>>
>>> [1] https://wiki.fd.io/view/VPP/HostStack/SessionLayerArchitecture 
>>> <https://wiki.fd.io/view/VPP/HostStack/SessionLayerArchitecture>
>>> [2] https://wiki.fd.io/images/1/15/Vpp-hoststack-kc-eu-18.pdf 
>>> <https://wiki.fd.io/images/1/15/Vpp-hoststack-kc-eu-18.pdf>
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On May 7, 2018, at 11:35 AM, Luca Muscariello (lumuscar) 
>>>> <lumus...@cisco.com <mailto:lumus...@cisco.com>> wrote:
>>>>
>>>> Florin,
>>>>
>>>> So the TCP stack does not connect to VPP using memif.
>>>> I’ll check the shared memory you mentioned.
>>>>
>>>> For our transport stack we’re using memif. Nothing to 
>>>> do with TCP though.
>>>>
>>>> Iperf3 to VPP there must be copies anyway. 
>>>> There must be some batching with timing though 
>>>> while doing these copies.
>>>>
>>>> Is there any doc of svm_fifo usage?
>>>>
>>>> Thanks
>>>> Luca 
>>>> 
>>>> On 7 May 2018, at 20:00, Florin Coras <fcoras.li...@gmail.com 
>>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>>> 
>>>>> Hi Luca,
>>>>>
>>>>> I guess, as you did, that it’s vectorization. VPP is really good at 
>>>>> pushing packets whereas Linux is good at using all hw optimizations. 
>>>>>
>>>>> The stack uses it’s own shared memory mechanisms (check svm_fifo_t) but 
>>>>> given that you did the testing with iperf3, I suspect the edge is not 
>>>>> there. That is, I guess they’re not abusing syscalls with lots of small 
>>>>> writes. Moreover, the fifos are not zero-copy, apps do have to write to 
>>>>> the fifo and vpp has to packetize that data. 
>>>>>
>>>>> Florin
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On May 7, 2018, at 10:29 AM, Luca Muscariello (lumuscar) 
>>>>>> <lumus...@cisco.com <mailto:lumus...@cisco.com>> wrote:
>>>>>>
>>>>>> Hi Florin 
>>>>>>
>>>>>> Thanks for the info.
>>>>>>
>>>>>> So, how do you explain VPP TCP stack beats Linux
>>>>>> implementation by doubling the goodput?
>>>>>> Does it come from vectorization? 
>>>>>> Any special memif optimization underneath?
>>>>>>
>>>>>> Luca 
>>>>>> 
>>>>>> On 7 May 2018, at 18:17, Florin Coras <fcoras.li...@gmail.com 
>>>>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>>>>> 
>>>>>>> Hi Luca, 
>>>>>>>
>>>>>>> We don’t yet support TSO because it requires support within all of vpp 
>>>>>>> (think tunnels). Still, it’s on our list. 
>>>>>>>
>>>>>>> As for crypto offload, we do have support for IPSec offload with QAT 
>>>>>>> cards and we’re now working with Ping and Ray from Intel on 
>>>>>>> accelerating the TLS OpenSSL engine also with QAT cards. 
>>>>>>>
>>>>>>> Regards, 
>>>>>>> Florin
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On May 7, 2018, at 7:53 AM, Luca Muscariello <lumuscar+f...@cisco.com 
>>>>>>>> <mailto:lumuscar+f...@cisco.com>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> A few questions about the TCP stack and HW offloading.
>>>>>>>> Below is the experiment under test.
>>>>>>>>
>>>>>>>>   +------------+                          +-----------+
>>>>>>>>   |      +-----+                 DPDK-10GE|           |
>>>>>>>>   |Iperf3| TCP |      +------------+      |TCP   Iperf3
>>>>>>>>   |      +------------+Nexus Switch+------+           +
>>>>>>>>   |LXC   | VPP||      +------------+      |VPP |LXC   |
>>>>>>>>   +------------+  DPDK-10GE               +-----------+
>>>>>>>>
>>>>>>>>
>>>>>>>> Using the Linux kernel w/ or w/o TSO I get an iperf3 goodput of 
>>>>>>>> 9.5Gbps or 4.5Gbps.
>>>>>>>> Using VPP TCP stack I get 9.2Gbps, say max goodput as Linux w/ TSO.
>>>>>>>>
>>>>>>>> Is there any TSO implementation already in VPP one can take advantage 
>>>>>>>> of?
>>>>>>>>
>>>>>>>> Side question. Is there any crypto offloading service available in VPP?
>>>>>>>> Essentially  for the computation of RSA-1024/2048, EDCSA 192/256 
>>>>>>>> signatures.
>>>>>>>>                                                                  
>>>>>>>> Thanks
>>>>>>>> Luca
>>>>>>>>
>>>>>>> 
>>>>>>>
>>>>> 
>>>>>
>>>>>

Re: [vpp-dev] TCP performance - TSO - HW offloading in general.

Reply via email to