Hi Vijay, 

In this sort of setup, with few connections, probably it’s inevitable to lose 
throughput because of the cross-numa memcpy. In your 1 iperf connection test, 
did you only change iperf’s numa or vpp’s worker as well? 

Regards,
Florin

> On Sep 14, 2020, at 6:35 PM, Vijay Sampath <vsamp...@gmail.com> wrote:
> 
> Hi Florin,
> 
> I ran some experiments by going cross numa, and see that I am not able to go 
> beyond 50G. I tried with a different number of worker threads (5, 8 and 10), 
> and going upto 10 iperf servers. I am attaching the show run output with 10 
> workers. When I run the same experiment in Linux, I don't see a difference in 
> the bandwidth - iperf in both numa nodes are able to achieve 100G. Do you 
> have any suggestions on other experiments to try?
> 
> I also did try 1 iperf connection - and the bandwidth dropped from 33G to 23G 
> for the same numa core vs different.
> 
> Thanks,
> 
> Vijay
> 
> On Sat, Sep 12, 2020 at 2:40 PM Florin Coras <fcoras.li...@gmail.com 
> <mailto:fcoras.li...@gmail.com>> wrote:
> Hi VIjay, 
> 
> 
>> On Sep 12, 2020, at 12:06 PM, Vijay Sampath <vsamp...@gmail.com 
>> <mailto:vsamp...@gmail.com>> wrote:
>> 
>> Hi Florin,
>> 
>> On Sat, Sep 12, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com 
>> <mailto:fcoras.li...@gmail.com>> wrote:
>> Hi Vijay, 
>> 
>> 
>>> On Sep 12, 2020, at 10:06 AM, Vijay Sampath <vsamp...@gmail.com 
>>> <mailto:vsamp...@gmail.com>> wrote:
>>> 
>>> Hi Florin,
>>> 
>>> On Fri, Sep 11, 2020 at 11:23 PM Florin Coras <fcoras.li...@gmail.com 
>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>> Hi Vijay, 
>>> 
>>> Quick replies inline. 
>>> 
>>>> On Sep 11, 2020, at 7:27 PM, Vijay Sampath <vsamp...@gmail.com 
>>>> <mailto:vsamp...@gmail.com>> wrote:
>>>> 
>>>> Hi Florin,
>>>> 
>>>> Thanks once again for looking at this issue. Please see inline:
>>>> 
>>>> On Fri, Sep 11, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com 
>>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>>> Hi Vijay, 
>>>> 
>>>> Inline.
>>>> 
>>>>> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com 
>>>>> <mailto:vsamp...@gmail.com>> wrote:
>>>>> 
>>>>> Hi Florin,
>>>>> 
>>>>> Thanks for the response. Please see inline:
>>>>> 
>>>>> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <fcoras.li...@gmail.com 
>>>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>>>> Hi Vijay, 
>>>>> 
>>>>> Cool experiment. More inline. 
>>>>> 
>>>>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com 
>>>>> > <mailto:vsamp...@gmail.com>> wrote:
>>>>> > 
>>>>> > Hi,
>>>>> > 
>>>>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine 
>>>>> > connected to another server running VPP using 100G NICs. Both servers 
>>>>> > are Intel Xeon with 24 cores.
>>>>> 
>>>>> May I ask the frequency for those cores? Also what type of nic are you 
>>>>> using?
>>>>> 
>>>>> 2700 MHz. 
>>>> 
>>>> Probably this somewhat limits throughput per single connection compared to 
>>>> my testbed where the Intel cpu boosts to 4GHz. 
>>>>  
>>>> Please see below, I noticed an anomaly. 
>>>> 
>>>> 
>>>>> The nic is a Pensando DSC100.
>>>> 
>>>> Okay, not sure what to expect there. Since this mostly stresses the rx 
>>>> side, what’s the number of rx descriptors? Typically I test with 256, with 
>>>> more connections higher throughput you might need more. 
>>>>  
>>>> This is the default - comments seem to suggest that is 1024. I don't see 
>>>> any rx queue empty errors on the nic, which probably means there are 
>>>> sufficient buffers. 
>>> 
>>> Reasonable. Might want to try to reduce it down to 256 but performance will 
>>> depend a lot on other things as well. 
>>> 
>>> This seems to help, but I do get rx queue empty nic drops. More below.
>> 
>> That’s somewhat expected to happen either when 1) the peer tries to probe 
>> for more throughput and bursts a bit more than we can handle 2) a full vpp 
>> dispatch takes too long, which could happen because of the memcpy in 
>> tcp-established. 
>> 
>>>  
>>> 
>>>>> > I am trying to push 100G traffic from the iperf Linux TCP client by 
>>>>> > starting 10 parallel iperf connections on different port numbers and 
>>>>> > pinning them to different cores on the sender side. On the VPP receiver 
>>>>> > side I have 10 worker threads and 10 rx-queues in dpdk, and running 
>>>>> > iperf3 using VCL library as follows
>>>>> > 
>>>>> > taskset 0x00400 sh -c 
>>>>> > "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so 
>>>>> > VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" &
>>>>> > taskset 0x00800 sh -c 
>>>>> > "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so 
>>>>> > VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" &
>>>>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64
>>>>> > ...
>>>>> > 
>>>>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client:
>>>>> > 
>>>>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000
>>>>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001
>>>>> > ...
>>>>> 
>>>>> Could you try first with only 1 iperf server/client pair, just to see 
>>>>> where performance is with that? 
>>>>> 
>>>>> These are the numbers I get
>>>>> rx-fifo-size 65536: ~8G
>>>>> rx-fifo-size 524288: 22G
>>>>> rx-fifo-size 4000000: 25G
>>>> 
>>>> Okay, so 4MB is probably the sweet spot. Btw, could you check the vector 
>>>> rate (and the errors) in this case also?  
>>>> 
>>>> I noticed that adding "enable-tcp-udp-checksum" back seems to improve 
>>>> performance. Not sure if this is an issue with the dpdk driver for the 
>>>> nic. Anyway in the "show hardware" flags I see now that tcp and udp 
>>>> checksum offloads are enabled:
>>>> 
>>>> root@server:~# vppctl show hardware
>>>>               Name                Idx   Link  Hardware
>>>> eth0                               1     up   dsc1
>>>>   Link speed: 100 Gbps
>>>>   Ethernet address 00:ae:cd:03:79:51
>>>>   ### UNKNOWN ###
>>>>     carrier up full duplex mtu 9000
>>>>     flags: admin-up pmd maybe-multiseg rx-ip4-cksum
>>>>     Devargs:
>>>>     rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1)
>>>>     tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1)
>>>>     pci: device 1dd8:1002 subsystem 1dd8:400a address 0000:15:00.00 numa 0
>>>>     max rx packet len: 9208
>>>>     promiscuous: unicast off all-multicast on
>>>>     vlan offload: strip off filter off qinq off
>>>>     rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum 
>>>> vlan-filter
>>>>                        jumbo-frame scatter
>>>>     rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame scatter
>>>>     tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum tcp-tso
>>>>                        outer-ipv4-cksum multi-segs mbuf-fast-free 
>>>> outer-udp-cksum
>>>>     tx offload active: multi-segs
>>>>     rss avail:         ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
>>>>     rss active:        ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
>>>>     tx burst function: ionic_xmit_pkts
>>>>     rx burst function: ionic_recv_pkts
>>>> 
>>>> With this I get better performance per iperf3 connection - about 30.5G. 
>>>> Show run output attached (1connection.txt)
>>> 
>>> Interesting. Yes, dpdk does request offload rx ip/tcp checksum computation 
>>> when possible but it currently (unless some of the pending patches were 
>>> merged) does not mark the packet appropriately and ip4-local will 
>>> recompute/validate the checksum. From your logs, it seems ip4-local needs 
>>> ~1.8e3 cycles in the 1 connection setup and ~3.1e3 for 7 connections. 
>>> That’s a lot, so it seems to confirm that the checksum is recomputed. 
>>> 
>>> So, it’s somewhat counter intuitive the fact that performance improves. How 
>>> do the show run numbers change? Could be that performance worsens because 
>>> of tcp’s congestion recovery/flow control, i.e., the packets are processes 
>>> faster but some component starts dropping/queues get full. 
>>> 
>>> That's interesting. I got confused by the "show hardware" output since it 
>>> doesn't show any output against "tx offload active". You are right, though 
>>> it definitely uses less cycles without this option present, so I took it 
>>> out for further tests. I am attaching the show run output for both 1 
>>> connection and 7 connection case without this option present. With 1 
>>> connection, it appears VPP is not loaded at all since there is no batching 
>>> happening? 
>> 
>> That’s probably because you’re using 9kB frames. It’s practically equivalent 
>> to LRO so vpp doesn’t need to work too much. Did throughput increase at all?
>> 
>> Throughput varied between 26-30G.
> 
> Sounds reasonable for the cpu frequency. 
> 
>>  
>> 
>>> With 7 connections I do see it getting around 90-92G. When I drop the rx 
>>> queue to 256, I do see some nic drops, but performance improves and I am 
>>> getting 99G now. 
>> 
>> Awesome!
>> 
>>> Can you please explain why this makes a difference? Does it have to do with 
>>> caches?
>> 
>> There’s probably several things at play. First of all, we back pressure the 
>> sender with minimal cost, i.e., we minimize the data that we queue and we 
>> just drop as soon as we run out of space. So instead of us trying to buffer 
>> large bursts and deal with them later, we force the sender to drop the rate. 
>> Second, as you already guessed, this probably improves cache utilization 
>> because we end up touching fewer buffers. 
>> 
>> I see. I was trying to accomplish something similar by limiting the 
>> rx-fifo-size (rmem in linux) for each connection. So there is no issue with 
>> the ring size being equal to the VPP batch size? While VPP is working on a 
>> batch, what happens if more packets come in?
> 
> They will be dropped. Typically tcp pacing should make sure that packets are 
> not delivered in bursts, instead they’re spread over an rtt. For instance, 
> see how small the vector rate is for 1 connection. Even if you multiply it by 
> 4 (to reach 100Gbps) the vector rate is still small. 
> 
>>  
>> 
>>> 
>>> Are the other cores kind of unusable now due to being on a different numa? 
>>> With Linux TCP, I believe I was able to use most of the cores and scale the 
>>> number of connections. 
>> 
>> They’re all usable but it’s just that cross-numa memcpy is more expensive 
>> (session layer buffers the data for the apps in the shared memory fifos). As 
>> the sessions are scaled up, each session will carry less data, so moving 
>> some of them to the other numa should not be a problem. But it all 
>> ultimately depends on the efficiency of the UPI interconnect. 
>> 
>> 
>> Sure, I will try these experiments.
> 
> Sounds good. Let me know how it goes. 
> 
> Regards,
> Florin
> 
>> 
>> Thanks,
>> 
>> Vijay
> 
> <show_run_10_conn_cross_numa.txt>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17393): https://lists.fd.io/g/vpp-dev/message/17393
Mute This Topic: https://lists.fd.io/mt/76783803/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to