Hi Vijay, 

> On Sep 12, 2020, at 10:06 AM, Vijay Sampath <vsamp...@gmail.com> wrote:
> 
> Hi Florin,
> 
> On Fri, Sep 11, 2020 at 11:23 PM Florin Coras <fcoras.li...@gmail.com 
> <mailto:fcoras.li...@gmail.com>> wrote:
> Hi Vijay, 
> 
> Quick replies inline. 
> 
>> On Sep 11, 2020, at 7:27 PM, Vijay Sampath <vsamp...@gmail.com 
>> <mailto:vsamp...@gmail.com>> wrote:
>> 
>> Hi Florin,
>> 
>> Thanks once again for looking at this issue. Please see inline:
>> 
>> On Fri, Sep 11, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com 
>> <mailto:fcoras.li...@gmail.com>> wrote:
>> Hi Vijay, 
>> 
>> Inline.
>> 
>>> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com 
>>> <mailto:vsamp...@gmail.com>> wrote:
>>> 
>>> Hi Florin,
>>> 
>>> Thanks for the response. Please see inline:
>>> 
>>> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <fcoras.li...@gmail.com 
>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>> Hi Vijay, 
>>> 
>>> Cool experiment. More inline. 
>>> 
>>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com 
>>> > <mailto:vsamp...@gmail.com>> wrote:
>>> > 
>>> > Hi,
>>> > 
>>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine connected 
>>> > to another server running VPP using 100G NICs. Both servers are Intel 
>>> > Xeon with 24 cores.
>>> 
>>> May I ask the frequency for those cores? Also what type of nic are you 
>>> using?
>>> 
>>> 2700 MHz. 
>> 
>> Probably this somewhat limits throughput per single connection compared to 
>> my testbed where the Intel cpu boosts to 4GHz. 
>>  
>> Please see below, I noticed an anomaly. 
>> 
>> 
>>> The nic is a Pensando DSC100.
>> 
>> Okay, not sure what to expect there. Since this mostly stresses the rx side, 
>> what’s the number of rx descriptors? Typically I test with 256, with more 
>> connections higher throughput you might need more. 
>>  
>> This is the default - comments seem to suggest that is 1024. I don't see any 
>> rx queue empty errors on the nic, which probably means there are sufficient 
>> buffers. 
> 
> Reasonable. Might want to try to reduce it down to 256 but performance will 
> depend a lot on other things as well. 
> 
> This seems to help, but I do get rx queue empty nic drops. More below.

That’s somewhat expected to happen either when 1) the peer tries to probe for 
more throughput and bursts a bit more than we can handle 2) a full vpp dispatch 
takes too long, which could happen because of the memcpy in tcp-established. 

>  
> 
>>> > I am trying to push 100G traffic from the iperf Linux TCP client by 
>>> > starting 10 parallel iperf connections on different port numbers and 
>>> > pinning them to different cores on the sender side. On the VPP receiver 
>>> > side I have 10 worker threads and 10 rx-queues in dpdk, and running 
>>> > iperf3 using VCL library as follows
>>> > 
>>> > taskset 0x00400 sh -c 
>>> > "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so 
>>> > VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" &
>>> > taskset 0x00800 sh -c 
>>> > "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so 
>>> > VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" &
>>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64
>>> > ...
>>> > 
>>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client:
>>> > 
>>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000
>>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001
>>> > ...
>>> 
>>> Could you try first with only 1 iperf server/client pair, just to see where 
>>> performance is with that? 
>>> 
>>> These are the numbers I get
>>> rx-fifo-size 65536: ~8G
>>> rx-fifo-size 524288: 22G
>>> rx-fifo-size 4000000: 25G
>> 
>> Okay, so 4MB is probably the sweet spot. Btw, could you check the vector 
>> rate (and the errors) in this case also?  
>> 
>> I noticed that adding "enable-tcp-udp-checksum" back seems to improve 
>> performance. Not sure if this is an issue with the dpdk driver for the nic. 
>> Anyway in the "show hardware" flags I see now that tcp and udp checksum 
>> offloads are enabled:
>> 
>> root@server:~# vppctl show hardware
>>               Name                Idx   Link  Hardware
>> eth0                               1     up   dsc1
>>   Link speed: 100 Gbps
>>   Ethernet address 00:ae:cd:03:79:51
>>   ### UNKNOWN ###
>>     carrier up full duplex mtu 9000
>>     flags: admin-up pmd maybe-multiseg rx-ip4-cksum
>>     Devargs:
>>     rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1)
>>     tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1)
>>     pci: device 1dd8:1002 subsystem 1dd8:400a address 0000:15:00.00 numa 0
>>     max rx packet len: 9208
>>     promiscuous: unicast off all-multicast on
>>     vlan offload: strip off filter off qinq off
>>     rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum vlan-filter
>>                        jumbo-frame scatter
>>     rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame scatter
>>     tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum tcp-tso
>>                        outer-ipv4-cksum multi-segs mbuf-fast-free 
>> outer-udp-cksum
>>     tx offload active: multi-segs
>>     rss avail:         ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
>>     rss active:        ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
>>     tx burst function: ionic_xmit_pkts
>>     rx burst function: ionic_recv_pkts
>> 
>> With this I get better performance per iperf3 connection - about 30.5G. Show 
>> run output attached (1connection.txt)
> 
> Interesting. Yes, dpdk does request offload rx ip/tcp checksum computation 
> when possible but it currently (unless some of the pending patches were 
> merged) does not mark the packet appropriately and ip4-local will 
> recompute/validate the checksum. From your logs, it seems ip4-local needs 
> ~1.8e3 cycles in the 1 connection setup and ~3.1e3 for 7 connections. That’s 
> a lot, so it seems to confirm that the checksum is recomputed. 
> 
> So, it’s somewhat counter intuitive the fact that performance improves. How 
> do the show run numbers change? Could be that performance worsens because of 
> tcp’s congestion recovery/flow control, i.e., the packets are processes 
> faster but some component starts dropping/queues get full. 
> 
> That's interesting. I got confused by the "show hardware" output since it 
> doesn't show any output against "tx offload active". You are right, though it 
> definitely uses less cycles without this option present, so I took it out for 
> further tests. I am attaching the show run output for both 1 connection and 7 
> connection case without this option present. With 1 connection, it appears 
> VPP is not loaded at all since there is no batching happening?

That’s probably because you’re using 9kB frames. It’s practically equivalent to 
LRO so vpp doesn’t need to work too much. Did throughput increase at all?

> With 7 connections I do see it getting around 90-92G. When I drop the rx 
> queue to 256, I do see some nic drops, but performance improves and I am 
> getting 99G now.

Awesome!

> Can you please explain why this makes a difference? Does it have to do with 
> caches?

There’s probably several things at play. First of all, we back pressure the 
sender with minimal cost, i.e., we minimize the data that we queue and we just 
drop as soon as we run out of space. So instead of us trying to buffer large 
bursts and deal with them later, we force the sender to drop the rate. Second, 
as you already guessed, this probably improves cache utilization because we end 
up touching fewer buffers. 

> 
> Are the other cores kind of unusable now due to being on a different numa? 
> With Linux TCP, I believe I was able to use most of the cores and scale the 
> number of connections.

They’re all usable but it’s just that cross-numa memcpy is more expensive 
(session layer buffers the data for the apps in the shared memory fifos). As 
the sessions are scaled up, each session will carry less data, so moving some 
of them to the other numa should not be a problem. But it all ultimately 
depends on the efficiency of the UPI interconnect. 

Regards,
Florin 

> Anyway this is good that I can get close to line rate. I will try more 
> experiments and see. Thanks for your help.
> 
> Thanks,
> 
> Vijay
> <show_run_1connection.txt><show_run_7connection.txt><show_run_7connection_256rxqueue.txt>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17386): https://lists.fd.io/g/vpp-dev/message/17386
Mute This Topic: https://lists.fd.io/mt/76783803/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to