Re: [vpp-dev] 100G with iperf3 server using VCL library

Vijay Sampath Sat, 12 Sep 2020 12:07:20 -0700

Hi Florin,

On Sat, Sep 12, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com>
wrote:


> Hi Vijay,
>
>
> On Sep 12, 2020, at 10:06 AM, Vijay Sampath <vsamp...@gmail.com> wrote:
>
> Hi Florin,
>
> On Fri, Sep 11, 2020 at 11:23 PM Florin Coras <fcoras.li...@gmail.com>
> wrote:
>
>> Hi Vijay,
>>
>> Quick replies inline.
>>
>> On Sep 11, 2020, at 7:27 PM, Vijay Sampath <vsamp...@gmail.com> wrote:
>>
>> Hi Florin,
>>
>> Thanks once again for looking at this issue. Please see inline:
>>
>> On Fri, Sep 11, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com>
>> wrote:
>>
>>> Hi Vijay,
>>>
>>> Inline.
>>>
>>> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com> wrote:
>>>
>>> Hi Florin,
>>>
>>> Thanks for the response. Please see inline:
>>>
>>> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <fcoras.li...@gmail.com>
>>> wrote:
>>>
>>>> Hi Vijay,
>>>>
>>>> Cool experiment. More inline.
>>>>
>>>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine
>>>> connected to another server running VPP using 100G NICs. Both servers are
>>>> Intel Xeon with 24 cores.
>>>>
>>>> May I ask the frequency for those cores? Also what type of nic are you
>>>> using?
>>>>
>>>
>>> 2700 MHz.
>>>
>>>
>>> Probably this somewhat limits throughput per single connection compared
>>> to my testbed where the Intel cpu boosts to 4GHz.
>>>
>>
>> Please see below, I noticed an anomaly.
>>
>>
>>> The nic is a Pensando DSC100.
>>>
>>>
>>> Okay, not sure what to expect there. Since this mostly stresses the rx
>>> side, what’s the number of rx descriptors? Typically I test with 256, with
>>> more connections higher throughput you might need more.
>>>
>>
>> This is the default - comments seem to suggest that is 1024. I don't see
>> any rx queue empty errors on the nic, which probably means there are
>> sufficient buffers.
>>
>>
>> Reasonable. Might want to try to reduce it down to 256 but performance
>> will depend a lot on other things as well.
>>
>
> This seems to help, but I do get rx queue empty nic drops. More below.
>
>
> That’s somewhat expected to happen either when 1) the peer tries to probe
> for more throughput and bursts a bit more than we can handle 2) a full vpp
> dispatch takes too long, which could happen because of the memcpy in
> tcp-established.
>
>
>
>>
>> > I am trying to push 100G traffic from the iperf Linux TCP client by
>>>> starting 10 parallel iperf connections on different port numbers and
>>>> pinning them to different cores on the sender side. On the VPP receiver
>>>> side I have 10 worker threads and 10 rx-queues in dpdk, and running iperf3
>>>> using VCL library as follows
>>>> >
>>>> > taskset 0x00400 sh -c
>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so
>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" &
>>>> > taskset 0x00800 sh -c
>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so
>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" &
>>>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64
>>>> > ...
>>>> >
>>>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client:
>>>> >
>>>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000
>>>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001
>>>> > ...
>>>>
>>>> Could you try first with only 1 iperf server/client pair, just to see
>>>> where performance is with that?
>>>>
>>>
>>> These are the numbers I get
>>> rx-fifo-size 65536: ~8G
>>> rx-fifo-size 524288: 22G
>>> rx-fifo-size 4000000: 25G
>>>
>>>
>>> Okay, so 4MB is probably the sweet spot. Btw, could you check the vector
>>> rate (and the errors) in this case also?
>>>
>>
>> I noticed that adding "enable-tcp-udp-checksum" back seems to improve
>> performance. Not sure if this is an issue with the dpdk driver for the nic.
>> Anyway in the "show hardware" flags I see now that tcp and udp checksum
>> offloads are enabled:
>>
>> root@server:~# vppctl show hardware
>>               Name                Idx   Link  Hardware
>> eth0                               1     up   dsc1
>>   Link speed: 100 Gbps
>>   Ethernet address 00:ae:cd:03:79:51
>>   ### UNKNOWN ###
>>     carrier up full duplex mtu 9000
>>     flags: admin-up pmd maybe-multiseg rx-ip4-cksum
>>     Devargs:
>>     rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1)
>>     tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1)
>>     pci: device 1dd8:1002 subsystem 1dd8:400a address 0000:15:00.00 numa
>> 0
>>     max rx packet len: 9208
>>     promiscuous: unicast off all-multicast on
>>     vlan offload: strip off filter off qinq off
>>     rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum
>> vlan-filter
>>                        jumbo-frame scatter
>>     rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame scatter
>>     tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum tcp-tso
>>                        outer-ipv4-cksum multi-segs mbuf-fast-free
>> outer-udp-cksum
>>     tx offload active: multi-segs
>>     rss avail:         ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
>>     rss active:        ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
>>     tx burst function: ionic_xmit_pkts
>>     rx burst function: ionic_recv_pkts
>>
>> With this I get better performance per iperf3 connection - about 30.5G.
>> Show run output attached (1connection.txt)
>>
>>
>> Interesting. Yes, dpdk does request offload rx ip/tcp checksum
>> computation when possible but it currently (unless some of the pending
>> patches were merged) does not mark the packet appropriately and ip4-local
>> will recompute/validate the checksum. From your logs, it seems ip4-local
>> needs ~1.8e3 cycles in the 1 connection setup and ~3.1e3 for 7 connections.
>> That’s a lot, so it seems to confirm that the checksum is recomputed.
>>
>> So, it’s somewhat counter intuitive the fact that performance improves.
>> How do the show run numbers change? Could be that performance worsens
>> because of tcp’s congestion recovery/flow control, i.e., the packets are
>> processes faster but some component starts dropping/queues get full.
>>
>
> That's interesting. I got confused by the "show hardware" output since it
> doesn't show any output against "tx offload active". You are right, though
> it definitely uses less cycles without this option present, so I took it
> out for further tests. I am attaching the show run output for both 1
> connection and 7 connection case without this option present. With 1
> connection, it appears VPP is not loaded at all since there is no batching
> happening?
>
>
> That’s probably because you’re using 9kB frames. It’s practically
> equivalent to LRO so vpp doesn’t need to work too much. Did throughput
> increase at all?
>

Throughput varied between 26-30G.


>
> With 7 connections I do see it getting around 90-92G. When I drop the rx
> queue to 256, I do see some nic drops, but performance improves and I am
> getting 99G now.
>
>
> Awesome!
>
> Can you please explain why this makes a difference? Does it have to do
> with caches?
>
>
> There’s probably several things at play. First of all, we back pressure
> the sender with minimal cost, i.e., we minimize the data that we queue and
> we just drop as soon as we run out of space. So instead of us trying to
> buffer large bursts and deal with them later, we force the sender to drop
> the rate. Second, as you already guessed, this probably improves cache
> utilization because we end up touching fewer buffers.
>

I see. I was trying to accomplish something similar by limiting the
rx-fifo-size (rmem in linux) for each connection. So there is no issue with
the ring size being equal to the VPP batch size? While VPP is working on a
batch, what happens if more packets come in?


>
>
> Are the other cores kind of unusable now due to being on a different numa?
> With Linux TCP, I believe I was able to use most of the cores and scale the
> number of connections.
>
>
> They’re all usable but it’s just that cross-numa memcpy is more expensive
> (session layer buffers the data for the apps in the shared memory fifos).
> As the sessions are scaled up, each session will carry less data, so moving
> some of them to the other numa should not be a problem. But it all
> ultimately depends on the efficiency of the UPI interconnect.
>


Sure, I will try these experiments.

Thanks,

Vijay

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17387): https://lists.fd.io/g/vpp-dev/message/17387
Mute This Topic: https://lists.fd.io/mt/76783803/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] 100G with iperf3 server using VCL library

Reply via email to