Hi Florin,

Sure got it. The options are clear now.

Thanks,

Vijay

On Tue, Sep 15, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com> wrote:

> Hi Vijay,
>
> Yes. Underneath, the fifos maintain a linked a list of chunks where the
> data is stored. VCL could provide pointers to those in the form of iovecs
> and another api to mark the data as consumed (implicitly release the
> chunks) once the app is done reading. But again, the apps would have to
> explicitly use these apis.
>
> Regards,
> Florin
>
> On Sep 15, 2020, at 1:46 PM, Vijay Sampath <vsamp...@gmail.com> wrote:
>
> Hi Florin,
>
> I got it now.
>
> Also, I think you mentioned the support in the VCL library for the
> application to read/write/free directly from fifo buffers is not yet
> present, but can be added with little effort. Is that correct?
>
> Thanks,
>
> Vijay
>
> On Tue, Sep 15, 2020 at 1:31 PM Florin Coras <fcoras.li...@gmail.com>
> wrote:
>
>> Hi Vijay,
>>
>> Oh, by no means. Builtin applications, i.e., applications that run within
>> the vpp process, are definitely possible (see
>> plugins/hs_apps/echo_client/server or the proxy). They run “on” the vpp
>> workers and io/ctrl events are delivered by the session layer to those apps
>> using callback functions. However, the session layer exchanges data with
>> them using fifos, not vlib buffers. We might consider offering the option
>> to improve that for low scale and high throughput scenarios, but that’s not
>> possible today.
>>
>> Regards,
>> Florin
>>
>> On Sep 15, 2020, at 12:23 PM, Vijay Sampath <vsamp...@gmail.com> wrote:
>>
>> Hi Florin,
>>
>> Got it. So what you are saying is that TCP applications cannot
>> directly be linked with VPP. They have to be a separate process and go
>> through the VCL library, although they can be optimized to avoid 1 extra
>> memcpy. In future, memcpy _may_ be avoided completely, but the applications
>> have to still reside as a separate process.
>>
>> Thanks,
>>
>> Vijay
>>
>> On Tue, Sep 15, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com>
>> wrote:
>>
>>> Hi Vijay,
>>>
>>> Currently, builtin applications can only receive data from tcp in a
>>> session’s rx fifo. That’s a deliberate choice because, at scale, out of
>>> order data could end up consuming a lot of buffers, i.e., buffers are
>>> queued but cannot be consumed by the app until the gaps are filled. Still,
>>> builtin apps can avoid the extra memcpy vcl needs to do for traditional
>>> apps.
>>>
>>> Now, there have been talks and we have been considering the option of
>>> linking vlib buffers into the fifos (to avoid the memcpy) but there’s no
>>> ETA for that.
>>>
>>> Regards,
>>> Florin
>>>
>>> On Sep 15, 2020, at 11:32 AM, Vijay Sampath <vsamp...@gmail.com> wrote:
>>>
>>> Hi Florin,
>>>
>>> Sure yes, and better still would be for the app to integrate directly
>>> with VPP to even avoid the shared fifo copy, I assume. It's just that the
>>> VCL library gives a quick way to get some benchmark numbers with existing
>>> applications. Thanks for all the help. I have a much better idea now.
>>>
>>> Thanks,
>>>
>>> Vijay
>>>
>>> On Tue, Sep 15, 2020 at 11:25 AM Florin Coras <fcoras.li...@gmail.com>
>>> wrote:
>>>
>>>> Hi Vijay,
>>>>
>>>> Yes, that is the case for this iperf3 test. The data is already in user
>>>> space, and could be passed to the app in the shape of iovecs, to avoid the
>>>> extra memcpy, but the app would need to be changed to have it release the
>>>> memory whenever it’s done reading it. In case of iperf3 it would be on the
>>>> spot, because it discards it.
>>>>
>>>> For completeness, note that we don’t currently have vcl apis to expose
>>>> the fifo chunks as iovecs, but they shouldn’t be that difficult.
>>>>
>>>> Regards,
>>>> Florin
>>>>
>>>> On Sep 15, 2020, at 10:47 AM, Vijay Sampath <vsamp...@gmail.com> wrote:
>>>>
>>>> Hi Florin,
>>>>
>>>> I just realized that maybe in the VPP case there is an extra copy -
>>>> once from mbuf to shared fifo, and once from shared fifo to application
>>>> buffer. In Linux, there is probably just the copy from kernel space to user
>>>> space. Please correct me if I am wrong. If so, what I am doing is not an
>>>> apples to apples comparison.
>>>>
>>>> Thanks,
>>>>
>>>> Vijay
>>>>
>>>> On Tue, Sep 15, 2020 at 8:54 AM Vijay Sampath <vsamp...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Florin,
>>>>>
>>>>> In the 1 iperf connection test, I get different numbers every time I
>>>>> run. When I ran today
>>>>>
>>>>> - iperf and vpp in the same numa core as pci device: 50Gbps (although
>>>>> in different runs I saw 30Gbps also)
>>>>> - vpp in the same numa core as pci device, iperf in the other numa :
>>>>> 28Gbps
>>>>> - vpp and iperf in the other numa as pci device : 36Gbps
>>>>>
>>>>> But these numbers vary from test to test. But I was never able to get
>>>>> beyond 50G with 10connections with iperf on the other numa node. As I
>>>>> mentioned in the previous email, when I repeat this test with Linux TCP as
>>>>> the server, I am able to get 100G no matter which cores I start iperf on.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Vijay
>>>>>
>>>>> On Mon, Sep 14, 2020 at 8:30 PM Florin Coras <fcoras.li...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Vijay,
>>>>>>
>>>>>> In this sort of setup, with few connections, probably it’s inevitable
>>>>>> to lose throughput because of the cross-numa memcpy. In your 1 iperf
>>>>>> connection test, did you only change iperf’s numa or vpp’s worker as 
>>>>>> well?
>>>>>>
>>>>>> Regards,
>>>>>> Florin
>>>>>>
>>>>>> On Sep 14, 2020, at 6:35 PM, Vijay Sampath <vsamp...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Florin,
>>>>>>
>>>>>> I ran some experiments by going cross numa, and see that I am not
>>>>>> able to go beyond 50G. I tried with a different number of worker threads
>>>>>> (5, 8 and 10), and going upto 10 iperf servers. I am attaching the show 
>>>>>> run
>>>>>> output with 10 workers. When I run the same experiment in Linux, I don't
>>>>>> see a difference in the bandwidth - iperf in both numa nodes are able to
>>>>>> achieve 100G. Do you have any suggestions on other experiments to try?
>>>>>>
>>>>>> I also did try 1 iperf connection - and the bandwidth dropped from
>>>>>> 33G to 23G for the same numa core vs different.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Vijay
>>>>>>
>>>>>> On Sat, Sep 12, 2020 at 2:40 PM Florin Coras <fcoras.li...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi VIjay,
>>>>>>>
>>>>>>>
>>>>>>> On Sep 12, 2020, at 12:06 PM, Vijay Sampath <vsamp...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Florin,
>>>>>>>
>>>>>>> On Sat, Sep 12, 2020 at 11:44 AM Florin Coras <
>>>>>>> fcoras.li...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Vijay,
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sep 12, 2020, at 10:06 AM, Vijay Sampath <vsamp...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Florin,
>>>>>>>>
>>>>>>>> On Fri, Sep 11, 2020 at 11:23 PM Florin Coras <
>>>>>>>> fcoras.li...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Vijay,
>>>>>>>>>
>>>>>>>>> Quick replies inline.
>>>>>>>>>
>>>>>>>>> On Sep 11, 2020, at 7:27 PM, Vijay Sampath <vsamp...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Florin,
>>>>>>>>>
>>>>>>>>> Thanks once again for looking at this issue. Please see inline:
>>>>>>>>>
>>>>>>>>> On Fri, Sep 11, 2020 at 2:06 PM Florin Coras <
>>>>>>>>> fcoras.li...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Vijay,
>>>>>>>>>>
>>>>>>>>>> Inline.
>>>>>>>>>>
>>>>>>>>>> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Florin,
>>>>>>>>>>
>>>>>>>>>> Thanks for the response. Please see inline:
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <
>>>>>>>>>> fcoras.li...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Vijay,
>>>>>>>>>>>
>>>>>>>>>>> Cool experiment. More inline.
>>>>>>>>>>>
>>>>>>>>>>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > Hi,
>>>>>>>>>>> >
>>>>>>>>>>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine
>>>>>>>>>>> connected to another server running VPP using 100G NICs. Both 
>>>>>>>>>>> servers are
>>>>>>>>>>> Intel Xeon with 24 cores.
>>>>>>>>>>>
>>>>>>>>>>> May I ask the frequency for those cores? Also what type of nic
>>>>>>>>>>> are you using?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2700 MHz.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Probably this somewhat limits throughput per single connection
>>>>>>>>>> compared to my testbed where the Intel cpu boosts to 4GHz.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please see below, I noticed an anomaly.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> The nic is a Pensando DSC100.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Okay, not sure what to expect there. Since this mostly stresses
>>>>>>>>>> the rx side, what’s the number of rx descriptors? Typically I test 
>>>>>>>>>> with
>>>>>>>>>> 256, with more connections higher throughput you might need more.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is the default - comments seem to suggest that is 1024. I
>>>>>>>>> don't see any rx queue empty errors on the nic, which probably means 
>>>>>>>>> there
>>>>>>>>> are sufficient buffers.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Reasonable. Might want to try to reduce it down to 256 but
>>>>>>>>> performance will depend a lot on other things as well.
>>>>>>>>>
>>>>>>>>
>>>>>>>> This seems to help, but I do get rx queue empty nic drops. More
>>>>>>>> below.
>>>>>>>>
>>>>>>>>
>>>>>>>> That’s somewhat expected to happen either when 1) the peer tries to
>>>>>>>> probe for more throughput and bursts a bit more than we can handle 2) a
>>>>>>>> full vpp dispatch takes too long, which could happen because of the 
>>>>>>>> memcpy
>>>>>>>> in tcp-established.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> > I am trying to push 100G traffic from the iperf Linux TCP client
>>>>>>>>>>> by starting 10 parallel iperf connections on different port numbers 
>>>>>>>>>>> and
>>>>>>>>>>> pinning them to different cores on the sender side. On the VPP 
>>>>>>>>>>> receiver
>>>>>>>>>>> side I have 10 worker threads and 10 rx-queues in dpdk, and running 
>>>>>>>>>>> iperf3
>>>>>>>>>>> using VCL library as follows
>>>>>>>>>>> >
>>>>>>>>>>> > taskset 0x00400 sh -c
>>>>>>>>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so
>>>>>>>>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" &
>>>>>>>>>>> > taskset 0x00800 sh -c
>>>>>>>>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so
>>>>>>>>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" &
>>>>>>>>>>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64
>>>>>>>>>>> > ...
>>>>>>>>>>> >
>>>>>>>>>>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on
>>>>>>>>>>> client:
>>>>>>>>>>> >
>>>>>>>>>>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000
>>>>>>>>>>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001
>>>>>>>>>>> > ...
>>>>>>>>>>>
>>>>>>>>>>> Could you try first with only 1 iperf server/client pair, just
>>>>>>>>>>> to see where performance is with that?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> These are the numbers I get
>>>>>>>>>> rx-fifo-size 65536: ~8G
>>>>>>>>>> rx-fifo-size 524288: 22G
>>>>>>>>>> rx-fifo-size 4000000: 25G
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Okay, so 4MB is probably the sweet spot. Btw, could you check the
>>>>>>>>>> vector rate (and the errors) in this case also?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I noticed that adding "enable-tcp-udp-checksum" back seems to
>>>>>>>>> improve performance. Not sure if this is an issue with the dpdk 
>>>>>>>>> driver for
>>>>>>>>> the nic. Anyway in the "show hardware" flags I see now that tcp and 
>>>>>>>>> udp
>>>>>>>>> checksum offloads are enabled:
>>>>>>>>>
>>>>>>>>> root@server:~# vppctl show hardware
>>>>>>>>>               Name                Idx   Link  Hardware
>>>>>>>>> eth0                               1     up   dsc1
>>>>>>>>>   Link speed: 100 Gbps
>>>>>>>>>   Ethernet address 00:ae:cd:03:79:51
>>>>>>>>>   ### UNKNOWN ###
>>>>>>>>>     carrier up full duplex mtu 9000
>>>>>>>>>     flags: admin-up pmd maybe-multiseg rx-ip4-cksum
>>>>>>>>>     Devargs:
>>>>>>>>>     rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1)
>>>>>>>>>     tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1)
>>>>>>>>>     pci: device 1dd8:1002 subsystem 1dd8:400a address
>>>>>>>>> 0000:15:00.00 numa 0
>>>>>>>>>     max rx packet len: 9208
>>>>>>>>>     promiscuous: unicast off all-multicast on
>>>>>>>>>     vlan offload: strip off filter off qinq off
>>>>>>>>>     rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum
>>>>>>>>> vlan-filter
>>>>>>>>>                        jumbo-frame scatter
>>>>>>>>>     rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame
>>>>>>>>> scatter
>>>>>>>>>     tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum
>>>>>>>>> tcp-tso
>>>>>>>>>                        outer-ipv4-cksum multi-segs mbuf-fast-free
>>>>>>>>> outer-udp-cksum
>>>>>>>>>     tx offload active: multi-segs
>>>>>>>>>     rss avail:         ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp
>>>>>>>>> ipv6
>>>>>>>>>     rss active:        ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp
>>>>>>>>> ipv6
>>>>>>>>>     tx burst function: ionic_xmit_pkts
>>>>>>>>>     rx burst function: ionic_recv_pkts
>>>>>>>>>
>>>>>>>>> With this I get better performance per iperf3 connection - about
>>>>>>>>> 30.5G. Show run output attached (1connection.txt)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Interesting. Yes, dpdk does request offload rx ip/tcp checksum
>>>>>>>>> computation when possible but it currently (unless some of the pending
>>>>>>>>> patches were merged) does not mark the packet appropriately and 
>>>>>>>>> ip4-local
>>>>>>>>> will recompute/validate the checksum. From your logs, it seems 
>>>>>>>>> ip4-local
>>>>>>>>> needs ~1.8e3 cycles in the 1 connection setup and ~3.1e3 for 7 
>>>>>>>>> connections.
>>>>>>>>> That’s a lot, so it seems to confirm that the checksum is recomputed.
>>>>>>>>>
>>>>>>>>> So, it’s somewhat counter intuitive the fact that performance
>>>>>>>>> improves. How do the show run numbers change? Could be that 
>>>>>>>>> performance
>>>>>>>>> worsens because of tcp’s congestion recovery/flow control, i.e., the
>>>>>>>>> packets are processes faster but some component starts 
>>>>>>>>> dropping/queues get
>>>>>>>>> full.
>>>>>>>>>
>>>>>>>>
>>>>>>>> That's interesting. I got confused by the "show hardware" output
>>>>>>>> since it doesn't show any output against "tx offload active". You are
>>>>>>>> right, though it definitely uses less cycles without this option 
>>>>>>>> present,
>>>>>>>> so I took it out for further tests. I am attaching the show run output 
>>>>>>>> for
>>>>>>>> both 1 connection and 7 connection case without this option present. 
>>>>>>>> With 1
>>>>>>>> connection, it appears VPP is not loaded at all since there is no 
>>>>>>>> batching
>>>>>>>> happening?
>>>>>>>>
>>>>>>>>
>>>>>>>> That’s probably because you’re using 9kB frames. It’s practically
>>>>>>>> equivalent to LRO so vpp doesn’t need to work too much. Did throughput
>>>>>>>> increase at all?
>>>>>>>>
>>>>>>>
>>>>>>> Throughput varied between 26-30G.
>>>>>>>
>>>>>>>
>>>>>>> Sounds reasonable for the cpu frequency.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> With 7 connections I do see it getting around 90-92G. When I drop
>>>>>>>> the rx queue to 256, I do see some nic drops, but performance improves 
>>>>>>>> and
>>>>>>>> I am getting 99G now.
>>>>>>>>
>>>>>>>>
>>>>>>>> Awesome!
>>>>>>>>
>>>>>>>> Can you please explain why this makes a difference? Does it have
>>>>>>>> to do with caches?
>>>>>>>>
>>>>>>>>
>>>>>>>> There’s probably several things at play. First of all, we back
>>>>>>>> pressure the sender with minimal cost, i.e., we minimize the data that 
>>>>>>>> we
>>>>>>>> queue and we just drop as soon as we run out of space. So instead of us
>>>>>>>> trying to buffer large bursts and deal with them later, we force the 
>>>>>>>> sender
>>>>>>>> to drop the rate. Second, as you already guessed, this probably 
>>>>>>>> improves
>>>>>>>> cache utilization because we end up touching fewer buffers.
>>>>>>>>
>>>>>>>
>>>>>>> I see. I was trying to accomplish something similar by limiting the
>>>>>>> rx-fifo-size (rmem in linux) for each connection. So there is no issue 
>>>>>>> with
>>>>>>> the ring size being equal to the VPP batch size? While VPP is working 
>>>>>>> on a
>>>>>>> batch, what happens if more packets come in?
>>>>>>>
>>>>>>>
>>>>>>> They will be dropped. Typically tcp pacing should make sure that
>>>>>>> packets are not delivered in bursts, instead they’re spread over an rtt.
>>>>>>> For instance, see how small the vector rate is for 1 connection. Even if
>>>>>>> you multiply it by 4 (to reach 100Gbps) the vector rate is still small.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Are the other cores kind of unusable now due to being on a
>>>>>>>> different numa? With Linux TCP, I believe I was able to use most of the
>>>>>>>> cores and scale the number of connections.
>>>>>>>>
>>>>>>>>
>>>>>>>> They’re all usable but it’s just that cross-numa memcpy is more
>>>>>>>> expensive (session layer buffers the data for the apps in the shared 
>>>>>>>> memory
>>>>>>>> fifos). As the sessions are scaled up, each session will carry less 
>>>>>>>> data,
>>>>>>>> so moving some of them to the other numa should not be a problem. But 
>>>>>>>> it
>>>>>>>> all ultimately depends on the efficiency of the UPI interconnect.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sure, I will try these experiments.
>>>>>>>
>>>>>>>
>>>>>>> Sounds good. Let me know how it goes.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Florin
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Vijay
>>>>>>>
>>>>>>>
>>>>>>> <show_run_10_conn_cross_numa.txt>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17415): https://lists.fd.io/g/vpp-dev/message/17415
Mute This Topic: https://lists.fd.io/mt/76783803/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to