Re: [vpp-dev] 100G with iperf3 server using VCL library

Florin Coras Tue, 15 Sep 2020 14:06:25 -0700

Hi Vijay, 

Yes. Underneath, the fifos maintain a linked a list of chunks where the data is 
stored. VCL could provide pointers to those in the form of iovecs and another 
api to mark the data as consumed (implicitly release the chunks) once the app 
is done reading. But again, the apps would have to explicitly use these apis.


Regards,
Florin

> On Sep 15, 2020, at 1:46 PM, Vijay Sampath <vsamp...@gmail.com> wrote:
> 
> Hi Florin,
> 
> I got it now.
> 
> Also, I think you mentioned the support in the VCL library for the 
> application to read/write/free directly from fifo buffers is not yet present, 
> but can be added with little effort. Is that correct?
> 
> Thanks,
> 
> Vijay
> 
> On Tue, Sep 15, 2020 at 1:31 PM Florin Coras <fcoras.li...@gmail.com 
> <mailto:fcoras.li...@gmail.com>> wrote:
> Hi Vijay, 
> 
> Oh, by no means. Builtin applications, i.e., applications that run within the 
> vpp process, are definitely possible (see plugins/hs_apps/echo_client/server 
> or the proxy). They run “on” the vpp workers and io/ctrl events are delivered 
> by the session layer to those apps using callback functions. However, the 
> session layer exchanges data with them using fifos, not vlib buffers. We 
> might consider offering the option to improve that for low scale and high 
> throughput scenarios, but that’s not possible today. 
> 
> Regards,
> Florin
> 
>> On Sep 15, 2020, at 12:23 PM, Vijay Sampath <vsamp...@gmail.com 
>> <mailto:vsamp...@gmail.com>> wrote:
>> 
>> Hi Florin,
>> 
>> Got it. So what you are saying is that TCP applications cannot directly be 
>> linked with VPP. They have to be a separate process and go through the VCL 
>> library, although they can be optimized to avoid 1 extra memcpy. In future, 
>> memcpy _may_ be avoided completely, but the applications have to still 
>> reside as a separate process.
>> 
>> Thanks,
>> 
>> Vijay
>> 
>> On Tue, Sep 15, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com 
>> <mailto:fcoras.li...@gmail.com>> wrote:
>> Hi Vijay, 
>> 
>> Currently, builtin applications can only receive data from tcp in a 
>> session’s rx fifo. That’s a deliberate choice because, at scale, out of 
>> order data could end up consuming a lot of buffers, i.e., buffers are queued 
>> but cannot be consumed by the app until the gaps are filled. Still, builtin 
>> apps can avoid the extra memcpy vcl needs to do for traditional apps. 
>> 
>> Now, there have been talks and we have been considering the option of 
>> linking vlib buffers into the fifos (to avoid the memcpy) but there’s no ETA 
>> for that. 
>> 
>> Regards,
>> Florin
>> 
>>> On Sep 15, 2020, at 11:32 AM, Vijay Sampath <vsamp...@gmail.com 
>>> <mailto:vsamp...@gmail.com>> wrote:
>>> 
>>> Hi Florin,
>>> 
>>> Sure yes, and better still would be for the app to integrate directly with 
>>> VPP to even avoid the shared fifo copy, I assume. It's just that the VCL 
>>> library gives a quick way to get some benchmark numbers with existing 
>>> applications. Thanks for all the help. I have a much better idea now.
>>> 
>>> Thanks,
>>> 
>>> Vijay
>>> 
>>> On Tue, Sep 15, 2020 at 11:25 AM Florin Coras <fcoras.li...@gmail.com 
>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>> Hi Vijay, 
>>> 
>>> Yes, that is the case for this iperf3 test. The data is already in user 
>>> space, and could be passed to the app in the shape of iovecs, to avoid the 
>>> extra memcpy, but the app would need to be changed to have it release the 
>>> memory whenever it’s done reading it. In case of iperf3 it would be on the 
>>> spot, because it discards it. 
>>> 
>>> For completeness, note that we don’t currently have vcl apis to expose the 
>>> fifo chunks as iovecs, but they shouldn’t be that difficult. 
>>> 
>>> Regards,
>>> Florin
>>> 
>>>> On Sep 15, 2020, at 10:47 AM, Vijay Sampath <vsamp...@gmail.com 
>>>> <mailto:vsamp...@gmail.com>> wrote:
>>>> 
>>>> Hi Florin,
>>>> 
>>>> I just realized that maybe in the VPP case there is an extra copy - once 
>>>> from mbuf to shared fifo, and once from shared fifo to application buffer. 
>>>> In Linux, there is probably just the copy from kernel space to user space. 
>>>> Please correct me if I am wrong. If so, what I am doing is not an apples 
>>>> to apples comparison.
>>>> 
>>>> Thanks,
>>>> 
>>>> Vijay
>>>> 
>>>> On Tue, Sep 15, 2020 at 8:54 AM Vijay Sampath <vsamp...@gmail.com 
>>>> <mailto:vsamp...@gmail.com>> wrote:
>>>> Hi Florin,
>>>> 
>>>> In the 1 iperf connection test, I get different numbers every time I run. 
>>>> When I ran today
>>>> 
>>>> - iperf and vpp in the same numa core as pci device: 50Gbps (although in 
>>>> different runs I saw 30Gbps also)
>>>> - vpp in the same numa core as pci device, iperf in the other numa : 28Gbps
>>>> - vpp and iperf in the other numa as pci device : 36Gbps
>>>> 
>>>> But these numbers vary from test to test. But I was never able to get 
>>>> beyond 50G with 10connections with iperf on the other numa node. As I 
>>>> mentioned in the previous email, when I repeat this test with Linux TCP as 
>>>> the server, I am able to get 100G no matter which cores I start iperf on.
>>>> 
>>>> Thanks,
>>>> 
>>>> Vijay
>>>> 
>>>> On Mon, Sep 14, 2020 at 8:30 PM Florin Coras <fcoras.li...@gmail.com 
>>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>>> Hi Vijay, 
>>>> 
>>>> In this sort of setup, with few connections, probably it’s inevitable to 
>>>> lose throughput because of the cross-numa memcpy. In your 1 iperf 
>>>> connection test, did you only change iperf’s numa or vpp’s worker as well? 
>>>> 
>>>> Regards,
>>>> Florin
>>>> 
>>>>> On Sep 14, 2020, at 6:35 PM, Vijay Sampath <vsamp...@gmail.com 
>>>>> <mailto:vsamp...@gmail.com>> wrote:
>>>>> 
>>>>> Hi Florin,
>>>>> 
>>>>> I ran some experiments by going cross numa, and see that I am not able to 
>>>>> go beyond 50G. I tried with a different number of worker threads (5, 8 
>>>>> and 10), and going upto 10 iperf servers. I am attaching the show run 
>>>>> output with 10 workers. When I run the same experiment in Linux, I don't 
>>>>> see a difference in the bandwidth - iperf in both numa nodes are able to 
>>>>> achieve 100G. Do you have any suggestions on other experiments to try?
>>>>> 
>>>>> I also did try 1 iperf connection - and the bandwidth dropped from 33G to 
>>>>> 23G for the same numa core vs different.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Vijay
>>>>> 
>>>>> On Sat, Sep 12, 2020 at 2:40 PM Florin Coras <fcoras.li...@gmail.com 
>>>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>>>> Hi VIjay, 
>>>>> 
>>>>> 
>>>>>> On Sep 12, 2020, at 12:06 PM, Vijay Sampath <vsamp...@gmail.com 
>>>>>> <mailto:vsamp...@gmail.com>> wrote:
>>>>>> 
>>>>>> Hi Florin,
>>>>>> 
>>>>>> On Sat, Sep 12, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com 
>>>>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>>>>> Hi Vijay, 
>>>>>> 
>>>>>> 
>>>>>>> On Sep 12, 2020, at 10:06 AM, Vijay Sampath <vsamp...@gmail.com 
>>>>>>> <mailto:vsamp...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Hi Florin,
>>>>>>> 
>>>>>>> On Fri, Sep 11, 2020 at 11:23 PM Florin Coras <fcoras.li...@gmail.com 
>>>>>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>>>>>> Hi Vijay, 
>>>>>>> 
>>>>>>> Quick replies inline. 
>>>>>>> 
>>>>>>>> On Sep 11, 2020, at 7:27 PM, Vijay Sampath <vsamp...@gmail.com 
>>>>>>>> <mailto:vsamp...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Hi Florin,
>>>>>>>> 
>>>>>>>> Thanks once again for looking at this issue. Please see inline:
>>>>>>>> 
>>>>>>>> On Fri, Sep 11, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com 
>>>>>>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>>>>>>> Hi Vijay, 
>>>>>>>> 
>>>>>>>> Inline.
>>>>>>>> 
>>>>>>>>> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com 
>>>>>>>>> <mailto:vsamp...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Florin,
>>>>>>>>> 
>>>>>>>>> Thanks for the response. Please see inline:
>>>>>>>>> 
>>>>>>>>> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <fcoras.li...@gmail.com 
>>>>>>>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>>>>>>>> Hi Vijay, 
>>>>>>>>> 
>>>>>>>>> Cool experiment. More inline. 
>>>>>>>>> 
>>>>>>>>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com 
>>>>>>>>> > <mailto:vsamp...@gmail.com>> wrote:
>>>>>>>>> > 
>>>>>>>>> > Hi,
>>>>>>>>> > 
>>>>>>>>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine 
>>>>>>>>> > connected to another server running VPP using 100G NICs. Both 
>>>>>>>>> > servers are Intel Xeon with 24 cores.
>>>>>>>>> 
>>>>>>>>> May I ask the frequency for those cores? Also what type of nic are 
>>>>>>>>> you using?
>>>>>>>>> 
>>>>>>>>> 2700 MHz. 
>>>>>>>> 
>>>>>>>> Probably this somewhat limits throughput per single connection 
>>>>>>>> compared to my testbed where the Intel cpu boosts to 4GHz. 
>>>>>>>>  
>>>>>>>> Please see below, I noticed an anomaly. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> The nic is a Pensando DSC100.
>>>>>>>> 
>>>>>>>> Okay, not sure what to expect there. Since this mostly stresses the rx 
>>>>>>>> side, what’s the number of rx descriptors? Typically I test with 256, 
>>>>>>>> with more connections higher throughput you might need more. 
>>>>>>>>  
>>>>>>>> This is the default - comments seem to suggest that is 1024. I don't 
>>>>>>>> see any rx queue empty errors on the nic, which probably means there 
>>>>>>>> are sufficient buffers. 
>>>>>>> 
>>>>>>> Reasonable. Might want to try to reduce it down to 256 but performance 
>>>>>>> will depend a lot on other things as well. 
>>>>>>> 
>>>>>>> This seems to help, but I do get rx queue empty nic drops. More below.
>>>>>> 
>>>>>> That’s somewhat expected to happen either when 1) the peer tries to 
>>>>>> probe for more throughput and bursts a bit more than we can handle 2) a 
>>>>>> full vpp dispatch takes too long, which could happen because of the 
>>>>>> memcpy in tcp-established. 
>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>>>> > I am trying to push 100G traffic from the iperf Linux TCP client by 
>>>>>>>>> > starting 10 parallel iperf connections on different port numbers 
>>>>>>>>> > and pinning them to different cores on the sender side. On the VPP 
>>>>>>>>> > receiver side I have 10 worker threads and 10 rx-queues in dpdk, 
>>>>>>>>> > and running iperf3 using VCL library as follows
>>>>>>>>> > 
>>>>>>>>> > taskset 0x00400 sh -c 
>>>>>>>>> > "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so 
>>>>>>>>> > VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" &
>>>>>>>>> > taskset 0x00800 sh -c 
>>>>>>>>> > "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so 
>>>>>>>>> > VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" &
>>>>>>>>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64
>>>>>>>>> > ...
>>>>>>>>> > 
>>>>>>>>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client:
>>>>>>>>> > 
>>>>>>>>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000
>>>>>>>>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001
>>>>>>>>> > ...
>>>>>>>>> 
>>>>>>>>> Could you try first with only 1 iperf server/client pair, just to see 
>>>>>>>>> where performance is with that? 
>>>>>>>>> 
>>>>>>>>> These are the numbers I get
>>>>>>>>> rx-fifo-size 65536: ~8G
>>>>>>>>> rx-fifo-size 524288: 22G
>>>>>>>>> rx-fifo-size 4000000: 25G
>>>>>>>> 
>>>>>>>> Okay, so 4MB is probably the sweet spot. Btw, could you check the 
>>>>>>>> vector rate (and the errors) in this case also?  
>>>>>>>> 
>>>>>>>> I noticed that adding "enable-tcp-udp-checksum" back seems to improve 
>>>>>>>> performance. Not sure if this is an issue with the dpdk driver for the 
>>>>>>>> nic. Anyway in the "show hardware" flags I see now that tcp and udp 
>>>>>>>> checksum offloads are enabled:
>>>>>>>> 
>>>>>>>> root@server:~# vppctl show hardware
>>>>>>>>               Name                Idx   Link  Hardware
>>>>>>>> eth0                               1     up   dsc1
>>>>>>>>   Link speed: 100 Gbps
>>>>>>>>   Ethernet address 00:ae:cd:03:79:51
>>>>>>>>   ### UNKNOWN ###
>>>>>>>>     carrier up full duplex mtu 9000
>>>>>>>>     flags: admin-up pmd maybe-multiseg rx-ip4-cksum
>>>>>>>>     Devargs:
>>>>>>>>     rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1)
>>>>>>>>     tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1)
>>>>>>>>     pci: device 1dd8:1002 subsystem 1dd8:400a address 0000:15:00.00 
>>>>>>>> numa 0
>>>>>>>>     max rx packet len: 9208
>>>>>>>>     promiscuous: unicast off all-multicast on
>>>>>>>>     vlan offload: strip off filter off qinq off
>>>>>>>>     rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum 
>>>>>>>> vlan-filter
>>>>>>>>                        jumbo-frame scatter
>>>>>>>>     rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame 
>>>>>>>> scatter
>>>>>>>>     tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum 
>>>>>>>> tcp-tso
>>>>>>>>                        outer-ipv4-cksum multi-segs mbuf-fast-free 
>>>>>>>> outer-udp-cksum
>>>>>>>>     tx offload active: multi-segs
>>>>>>>>     rss avail:         ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
>>>>>>>>     rss active:        ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
>>>>>>>>     tx burst function: ionic_xmit_pkts
>>>>>>>>     rx burst function: ionic_recv_pkts
>>>>>>>> 
>>>>>>>> With this I get better performance per iperf3 connection - about 
>>>>>>>> 30.5G. Show run output attached (1connection.txt)
>>>>>>> 
>>>>>>> Interesting. Yes, dpdk does request offload rx ip/tcp checksum 
>>>>>>> computation when possible but it currently (unless some of the pending 
>>>>>>> patches were merged) does not mark the packet appropriately and 
>>>>>>> ip4-local will recompute/validate the checksum. From your logs, it 
>>>>>>> seems ip4-local needs ~1.8e3 cycles in the 1 connection setup and 
>>>>>>> ~3.1e3 for 7 connections. That’s a lot, so it seems to confirm that the 
>>>>>>> checksum is recomputed. 
>>>>>>> 
>>>>>>> So, it’s somewhat counter intuitive the fact that performance improves. 
>>>>>>> How do the show run numbers change? Could be that performance worsens 
>>>>>>> because of tcp’s congestion recovery/flow control, i.e., the packets 
>>>>>>> are processes faster but some component starts dropping/queues get 
>>>>>>> full. 
>>>>>>> 
>>>>>>> That's interesting. I got confused by the "show hardware" output since 
>>>>>>> it doesn't show any output against "tx offload active". You are right, 
>>>>>>> though it definitely uses less cycles without this option present, so I 
>>>>>>> took it out for further tests. I am attaching the show run output for 
>>>>>>> both 1 connection and 7 connection case without this option present. 
>>>>>>> With 1 connection, it appears VPP is not loaded at all since there is 
>>>>>>> no batching happening? 
>>>>>> 
>>>>>> That’s probably because you’re using 9kB frames. It’s practically 
>>>>>> equivalent to LRO so vpp doesn’t need to work too much. Did throughput 
>>>>>> increase at all?
>>>>>> 
>>>>>> Throughput varied between 26-30G.
>>>>> 
>>>>> Sounds reasonable for the cpu frequency. 
>>>>> 
>>>>>>  
>>>>>> 
>>>>>>> With 7 connections I do see it getting around 90-92G. When I drop the 
>>>>>>> rx queue to 256, I do see some nic drops, but performance improves and 
>>>>>>> I am getting 99G now. 
>>>>>> 
>>>>>> Awesome!
>>>>>> 
>>>>>>> Can you please explain why this makes a difference? Does it have to do 
>>>>>>> with caches?
>>>>>> 
>>>>>> There’s probably several things at play. First of all, we back pressure 
>>>>>> the sender with minimal cost, i.e., we minimize the data that we queue 
>>>>>> and we just drop as soon as we run out of space. So instead of us trying 
>>>>>> to buffer large bursts and deal with them later, we force the sender to 
>>>>>> drop the rate. Second, as you already guessed, this probably improves 
>>>>>> cache utilization because we end up touching fewer buffers. 
>>>>>> 
>>>>>> I see. I was trying to accomplish something similar by limiting the 
>>>>>> rx-fifo-size (rmem in linux) for each connection. So there is no issue 
>>>>>> with the ring size being equal to the VPP batch size? While VPP is 
>>>>>> working on a batch, what happens if more packets come in?
>>>>> 
>>>>> They will be dropped. Typically tcp pacing should make sure that packets 
>>>>> are not delivered in bursts, instead they’re spread over an rtt. For 
>>>>> instance, see how small the vector rate is for 1 connection. Even if you 
>>>>> multiply it by 4 (to reach 100Gbps) the vector rate is still small. 
>>>>> 
>>>>>>  
>>>>>> 
>>>>>>> 
>>>>>>> Are the other cores kind of unusable now due to being on a different 
>>>>>>> numa? With Linux TCP, I believe I was able to use most of the cores and 
>>>>>>> scale the number of connections. 
>>>>>> 
>>>>>> They’re all usable but it’s just that cross-numa memcpy is more 
>>>>>> expensive (session layer buffers the data for the apps in the shared 
>>>>>> memory fifos). As the sessions are scaled up, each session will carry 
>>>>>> less data, so moving some of them to the other numa should not be a 
>>>>>> problem. But it all ultimately depends on the efficiency of the UPI 
>>>>>> interconnect. 
>>>>>> 
>>>>>> 
>>>>>> Sure, I will try these experiments.
>>>>> 
>>>>> Sounds good. Let me know how it goes. 
>>>>> 
>>>>> Regards,
>>>>> Florin
>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Vijay
>>>>> 
>>>>> <show_run_10_conn_cross_numa.txt>
>>>> 
>>> 
>> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17414): https://lists.fd.io/g/vpp-dev/message/17414
Mute This Topic: https://lists.fd.io/mt/76783803/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] 100G with iperf3 server using VCL library

Reply via email to