Hi Florin, On Sat, Sep 12, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com> wrote:
> Hi Vijay, > > > On Sep 12, 2020, at 10:06 AM, Vijay Sampath <vsamp...@gmail.com> wrote: > > Hi Florin, > > On Fri, Sep 11, 2020 at 11:23 PM Florin Coras <fcoras.li...@gmail.com> > wrote: > >> Hi Vijay, >> >> Quick replies inline. >> >> On Sep 11, 2020, at 7:27 PM, Vijay Sampath <vsamp...@gmail.com> wrote: >> >> Hi Florin, >> >> Thanks once again for looking at this issue. Please see inline: >> >> On Fri, Sep 11, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com> >> wrote: >> >>> Hi Vijay, >>> >>> Inline. >>> >>> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com> wrote: >>> >>> Hi Florin, >>> >>> Thanks for the response. Please see inline: >>> >>> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <fcoras.li...@gmail.com> >>> wrote: >>> >>>> Hi Vijay, >>>> >>>> Cool experiment. More inline. >>>> >>>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com> >>>> wrote: >>>> > >>>> > Hi, >>>> > >>>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine >>>> connected to another server running VPP using 100G NICs. Both servers are >>>> Intel Xeon with 24 cores. >>>> >>>> May I ask the frequency for those cores? Also what type of nic are you >>>> using? >>>> >>> >>> 2700 MHz. >>> >>> >>> Probably this somewhat limits throughput per single connection compared >>> to my testbed where the Intel cpu boosts to 4GHz. >>> >> >> Please see below, I noticed an anomaly. >> >> >>> The nic is a Pensando DSC100. >>> >>> >>> Okay, not sure what to expect there. Since this mostly stresses the rx >>> side, what’s the number of rx descriptors? Typically I test with 256, with >>> more connections higher throughput you might need more. >>> >> >> This is the default - comments seem to suggest that is 1024. I don't see >> any rx queue empty errors on the nic, which probably means there are >> sufficient buffers. >> >> >> Reasonable. Might want to try to reduce it down to 256 but performance >> will depend a lot on other things as well. >> > > This seems to help, but I do get rx queue empty nic drops. More below. > > > That’s somewhat expected to happen either when 1) the peer tries to probe > for more throughput and bursts a bit more than we can handle 2) a full vpp > dispatch takes too long, which could happen because of the memcpy in > tcp-established. > > > >> >> > I am trying to push 100G traffic from the iperf Linux TCP client by >>>> starting 10 parallel iperf connections on different port numbers and >>>> pinning them to different cores on the sender side. On the VPP receiver >>>> side I have 10 worker threads and 10 rx-queues in dpdk, and running iperf3 >>>> using VCL library as follows >>>> > >>>> > taskset 0x00400 sh -c >>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" & >>>> > taskset 0x00800 sh -c >>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" & >>>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64 >>>> > ... >>>> > >>>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client: >>>> > >>>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000 >>>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001 >>>> > ... >>>> >>>> Could you try first with only 1 iperf server/client pair, just to see >>>> where performance is with that? >>>> >>> >>> These are the numbers I get >>> rx-fifo-size 65536: ~8G >>> rx-fifo-size 524288: 22G >>> rx-fifo-size 4000000: 25G >>> >>> >>> Okay, so 4MB is probably the sweet spot. Btw, could you check the vector >>> rate (and the errors) in this case also? >>> >> >> I noticed that adding "enable-tcp-udp-checksum" back seems to improve >> performance. Not sure if this is an issue with the dpdk driver for the nic. >> Anyway in the "show hardware" flags I see now that tcp and udp checksum >> offloads are enabled: >> >> root@server:~# vppctl show hardware >> Name Idx Link Hardware >> eth0 1 up dsc1 >> Link speed: 100 Gbps >> Ethernet address 00:ae:cd:03:79:51 >> ### UNKNOWN ### >> carrier up full duplex mtu 9000 >> flags: admin-up pmd maybe-multiseg rx-ip4-cksum >> Devargs: >> rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1) >> tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1) >> pci: device 1dd8:1002 subsystem 1dd8:400a address 0000:15:00.00 numa >> 0 >> max rx packet len: 9208 >> promiscuous: unicast off all-multicast on >> vlan offload: strip off filter off qinq off >> rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum >> vlan-filter >> jumbo-frame scatter >> rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame scatter >> tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum tcp-tso >> outer-ipv4-cksum multi-segs mbuf-fast-free >> outer-udp-cksum >> tx offload active: multi-segs >> rss avail: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6 >> rss active: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6 >> tx burst function: ionic_xmit_pkts >> rx burst function: ionic_recv_pkts >> >> With this I get better performance per iperf3 connection - about 30.5G. >> Show run output attached (1connection.txt) >> >> >> Interesting. Yes, dpdk does request offload rx ip/tcp checksum >> computation when possible but it currently (unless some of the pending >> patches were merged) does not mark the packet appropriately and ip4-local >> will recompute/validate the checksum. From your logs, it seems ip4-local >> needs ~1.8e3 cycles in the 1 connection setup and ~3.1e3 for 7 connections. >> That’s a lot, so it seems to confirm that the checksum is recomputed. >> >> So, it’s somewhat counter intuitive the fact that performance improves. >> How do the show run numbers change? Could be that performance worsens >> because of tcp’s congestion recovery/flow control, i.e., the packets are >> processes faster but some component starts dropping/queues get full. >> > > That's interesting. I got confused by the "show hardware" output since it > doesn't show any output against "tx offload active". You are right, though > it definitely uses less cycles without this option present, so I took it > out for further tests. I am attaching the show run output for both 1 > connection and 7 connection case without this option present. With 1 > connection, it appears VPP is not loaded at all since there is no batching > happening? > > > That’s probably because you’re using 9kB frames. It’s practically > equivalent to LRO so vpp doesn’t need to work too much. Did throughput > increase at all? > Throughput varied between 26-30G. > > With 7 connections I do see it getting around 90-92G. When I drop the rx > queue to 256, I do see some nic drops, but performance improves and I am > getting 99G now. > > > Awesome! > > Can you please explain why this makes a difference? Does it have to do > with caches? > > > There’s probably several things at play. First of all, we back pressure > the sender with minimal cost, i.e., we minimize the data that we queue and > we just drop as soon as we run out of space. So instead of us trying to > buffer large bursts and deal with them later, we force the sender to drop > the rate. Second, as you already guessed, this probably improves cache > utilization because we end up touching fewer buffers. > I see. I was trying to accomplish something similar by limiting the rx-fifo-size (rmem in linux) for each connection. So there is no issue with the ring size being equal to the VPP batch size? While VPP is working on a batch, what happens if more packets come in? > > > Are the other cores kind of unusable now due to being on a different numa? > With Linux TCP, I believe I was able to use most of the cores and scale the > number of connections. > > > They’re all usable but it’s just that cross-numa memcpy is more expensive > (session layer buffers the data for the apps in the shared memory fifos). > As the sessions are scaled up, each session will carry less data, so moving > some of them to the other numa should not be a problem. But it all > ultimately depends on the efficiency of the UPI interconnect. > Sure, I will try these experiments. Thanks, Vijay
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17387): https://lists.fd.io/g/vpp-dev/message/17387 Mute This Topic: https://lists.fd.io/mt/76783803/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-