Hi Florin, I got it now.
Also, I think you mentioned the support in the VCL library for the application to read/write/free directly from fifo buffers is not yet present, but can be added with little effort. Is that correct? Thanks, Vijay On Tue, Sep 15, 2020 at 1:31 PM Florin Coras <fcoras.li...@gmail.com> wrote: > Hi Vijay, > > Oh, by no means. Builtin applications, i.e., applications that run within > the vpp process, are definitely possible (see > plugins/hs_apps/echo_client/server or the proxy). They run “on” the vpp > workers and io/ctrl events are delivered by the session layer to those apps > using callback functions. However, the session layer exchanges data with > them using fifos, not vlib buffers. We might consider offering the option > to improve that for low scale and high throughput scenarios, but that’s not > possible today. > > Regards, > Florin > > On Sep 15, 2020, at 12:23 PM, Vijay Sampath <vsamp...@gmail.com> wrote: > > Hi Florin, > > Got it. So what you are saying is that TCP applications cannot directly be > linked with VPP. They have to be a separate process and go through the VCL > library, although they can be optimized to avoid 1 extra memcpy. In future, > memcpy _may_ be avoided completely, but the applications have to still > reside as a separate process. > > Thanks, > > Vijay > > On Tue, Sep 15, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com> > wrote: > >> Hi Vijay, >> >> Currently, builtin applications can only receive data from tcp in a >> session’s rx fifo. That’s a deliberate choice because, at scale, out of >> order data could end up consuming a lot of buffers, i.e., buffers are >> queued but cannot be consumed by the app until the gaps are filled. Still, >> builtin apps can avoid the extra memcpy vcl needs to do for traditional >> apps. >> >> Now, there have been talks and we have been considering the option of >> linking vlib buffers into the fifos (to avoid the memcpy) but there’s no >> ETA for that. >> >> Regards, >> Florin >> >> On Sep 15, 2020, at 11:32 AM, Vijay Sampath <vsamp...@gmail.com> wrote: >> >> Hi Florin, >> >> Sure yes, and better still would be for the app to integrate directly >> with VPP to even avoid the shared fifo copy, I assume. It's just that the >> VCL library gives a quick way to get some benchmark numbers with existing >> applications. Thanks for all the help. I have a much better idea now. >> >> Thanks, >> >> Vijay >> >> On Tue, Sep 15, 2020 at 11:25 AM Florin Coras <fcoras.li...@gmail.com> >> wrote: >> >>> Hi Vijay, >>> >>> Yes, that is the case for this iperf3 test. The data is already in user >>> space, and could be passed to the app in the shape of iovecs, to avoid the >>> extra memcpy, but the app would need to be changed to have it release the >>> memory whenever it’s done reading it. In case of iperf3 it would be on the >>> spot, because it discards it. >>> >>> For completeness, note that we don’t currently have vcl apis to expose >>> the fifo chunks as iovecs, but they shouldn’t be that difficult. >>> >>> Regards, >>> Florin >>> >>> On Sep 15, 2020, at 10:47 AM, Vijay Sampath <vsamp...@gmail.com> wrote: >>> >>> Hi Florin, >>> >>> I just realized that maybe in the VPP case there is an extra copy - once >>> from mbuf to shared fifo, and once from shared fifo to application buffer. >>> In Linux, there is probably just the copy from kernel space to user space. >>> Please correct me if I am wrong. If so, what I am doing is not an apples to >>> apples comparison. >>> >>> Thanks, >>> >>> Vijay >>> >>> On Tue, Sep 15, 2020 at 8:54 AM Vijay Sampath <vsamp...@gmail.com> >>> wrote: >>> >>>> Hi Florin, >>>> >>>> In the 1 iperf connection test, I get different numbers every time I >>>> run. When I ran today >>>> >>>> - iperf and vpp in the same numa core as pci device: 50Gbps (although >>>> in different runs I saw 30Gbps also) >>>> - vpp in the same numa core as pci device, iperf in the other numa : >>>> 28Gbps >>>> - vpp and iperf in the other numa as pci device : 36Gbps >>>> >>>> But these numbers vary from test to test. But I was never able to get >>>> beyond 50G with 10connections with iperf on the other numa node. As I >>>> mentioned in the previous email, when I repeat this test with Linux TCP as >>>> the server, I am able to get 100G no matter which cores I start iperf on. >>>> >>>> Thanks, >>>> >>>> Vijay >>>> >>>> On Mon, Sep 14, 2020 at 8:30 PM Florin Coras <fcoras.li...@gmail.com> >>>> wrote: >>>> >>>>> Hi Vijay, >>>>> >>>>> In this sort of setup, with few connections, probably it’s inevitable >>>>> to lose throughput because of the cross-numa memcpy. In your 1 iperf >>>>> connection test, did you only change iperf’s numa or vpp’s worker as well? >>>>> >>>>> Regards, >>>>> Florin >>>>> >>>>> On Sep 14, 2020, at 6:35 PM, Vijay Sampath <vsamp...@gmail.com> wrote: >>>>> >>>>> Hi Florin, >>>>> >>>>> I ran some experiments by going cross numa, and see that I am not able >>>>> to go beyond 50G. I tried with a different number of worker threads (5, 8 >>>>> and 10), and going upto 10 iperf servers. I am attaching the show run >>>>> output with 10 workers. When I run the same experiment in Linux, I don't >>>>> see a difference in the bandwidth - iperf in both numa nodes are able to >>>>> achieve 100G. Do you have any suggestions on other experiments to try? >>>>> >>>>> I also did try 1 iperf connection - and the bandwidth dropped from 33G >>>>> to 23G for the same numa core vs different. >>>>> >>>>> Thanks, >>>>> >>>>> Vijay >>>>> >>>>> On Sat, Sep 12, 2020 at 2:40 PM Florin Coras <fcoras.li...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi VIjay, >>>>>> >>>>>> >>>>>> On Sep 12, 2020, at 12:06 PM, Vijay Sampath <vsamp...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi Florin, >>>>>> >>>>>> On Sat, Sep 12, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Vijay, >>>>>>> >>>>>>> >>>>>>> On Sep 12, 2020, at 10:06 AM, Vijay Sampath <vsamp...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Hi Florin, >>>>>>> >>>>>>> On Fri, Sep 11, 2020 at 11:23 PM Florin Coras < >>>>>>> fcoras.li...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Vijay, >>>>>>>> >>>>>>>> Quick replies inline. >>>>>>>> >>>>>>>> On Sep 11, 2020, at 7:27 PM, Vijay Sampath <vsamp...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Florin, >>>>>>>> >>>>>>>> Thanks once again for looking at this issue. Please see inline: >>>>>>>> >>>>>>>> On Fri, Sep 11, 2020 at 2:06 PM Florin Coras < >>>>>>>> fcoras.li...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Vijay, >>>>>>>>> >>>>>>>>> Inline. >>>>>>>>> >>>>>>>>> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Florin, >>>>>>>>> >>>>>>>>> Thanks for the response. Please see inline: >>>>>>>>> >>>>>>>>> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras < >>>>>>>>> fcoras.li...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Vijay, >>>>>>>>>> >>>>>>>>>> Cool experiment. More inline. >>>>>>>>>> >>>>>>>>>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> > >>>>>>>>>> > Hi, >>>>>>>>>> > >>>>>>>>>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine >>>>>>>>>> connected to another server running VPP using 100G NICs. Both >>>>>>>>>> servers are >>>>>>>>>> Intel Xeon with 24 cores. >>>>>>>>>> >>>>>>>>>> May I ask the frequency for those cores? Also what type of nic >>>>>>>>>> are you using? >>>>>>>>>> >>>>>>>>> >>>>>>>>> 2700 MHz. >>>>>>>>> >>>>>>>>> >>>>>>>>> Probably this somewhat limits throughput per single connection >>>>>>>>> compared to my testbed where the Intel cpu boosts to 4GHz. >>>>>>>>> >>>>>>>> >>>>>>>> Please see below, I noticed an anomaly. >>>>>>>> >>>>>>>> >>>>>>>>> The nic is a Pensando DSC100. >>>>>>>>> >>>>>>>>> >>>>>>>>> Okay, not sure what to expect there. Since this mostly stresses >>>>>>>>> the rx side, what’s the number of rx descriptors? Typically I test >>>>>>>>> with >>>>>>>>> 256, with more connections higher throughput you might need more. >>>>>>>>> >>>>>>>> >>>>>>>> This is the default - comments seem to suggest that is 1024. I >>>>>>>> don't see any rx queue empty errors on the nic, which probably means >>>>>>>> there >>>>>>>> are sufficient buffers. >>>>>>>> >>>>>>>> >>>>>>>> Reasonable. Might want to try to reduce it down to 256 but >>>>>>>> performance will depend a lot on other things as well. >>>>>>>> >>>>>>> >>>>>>> This seems to help, but I do get rx queue empty nic drops. More >>>>>>> below. >>>>>>> >>>>>>> >>>>>>> That’s somewhat expected to happen either when 1) the peer tries to >>>>>>> probe for more throughput and bursts a bit more than we can handle 2) a >>>>>>> full vpp dispatch takes too long, which could happen because of the >>>>>>> memcpy >>>>>>> in tcp-established. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> > I am trying to push 100G traffic from the iperf Linux TCP client >>>>>>>>>> by starting 10 parallel iperf connections on different port numbers >>>>>>>>>> and >>>>>>>>>> pinning them to different cores on the sender side. On the VPP >>>>>>>>>> receiver >>>>>>>>>> side I have 10 worker threads and 10 rx-queues in dpdk, and running >>>>>>>>>> iperf3 >>>>>>>>>> using VCL library as follows >>>>>>>>>> > >>>>>>>>>> > taskset 0x00400 sh -c >>>>>>>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >>>>>>>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" & >>>>>>>>>> > taskset 0x00800 sh -c >>>>>>>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >>>>>>>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" & >>>>>>>>>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64 >>>>>>>>>> > ... >>>>>>>>>> > >>>>>>>>>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on >>>>>>>>>> client: >>>>>>>>>> > >>>>>>>>>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000 >>>>>>>>>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001 >>>>>>>>>> > ... >>>>>>>>>> >>>>>>>>>> Could you try first with only 1 iperf server/client pair, just to >>>>>>>>>> see where performance is with that? >>>>>>>>>> >>>>>>>>> >>>>>>>>> These are the numbers I get >>>>>>>>> rx-fifo-size 65536: ~8G >>>>>>>>> rx-fifo-size 524288: 22G >>>>>>>>> rx-fifo-size 4000000: 25G >>>>>>>>> >>>>>>>>> >>>>>>>>> Okay, so 4MB is probably the sweet spot. Btw, could you check the >>>>>>>>> vector rate (and the errors) in this case also? >>>>>>>>> >>>>>>>> >>>>>>>> I noticed that adding "enable-tcp-udp-checksum" back seems to >>>>>>>> improve performance. Not sure if this is an issue with the dpdk driver >>>>>>>> for >>>>>>>> the nic. Anyway in the "show hardware" flags I see now that tcp and udp >>>>>>>> checksum offloads are enabled: >>>>>>>> >>>>>>>> root@server:~# vppctl show hardware >>>>>>>> Name Idx Link Hardware >>>>>>>> eth0 1 up dsc1 >>>>>>>> Link speed: 100 Gbps >>>>>>>> Ethernet address 00:ae:cd:03:79:51 >>>>>>>> ### UNKNOWN ### >>>>>>>> carrier up full duplex mtu 9000 >>>>>>>> flags: admin-up pmd maybe-multiseg rx-ip4-cksum >>>>>>>> Devargs: >>>>>>>> rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1) >>>>>>>> tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1) >>>>>>>> pci: device 1dd8:1002 subsystem 1dd8:400a address >>>>>>>> 0000:15:00.00 numa 0 >>>>>>>> max rx packet len: 9208 >>>>>>>> promiscuous: unicast off all-multicast on >>>>>>>> vlan offload: strip off filter off qinq off >>>>>>>> rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum >>>>>>>> vlan-filter >>>>>>>> jumbo-frame scatter >>>>>>>> rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame >>>>>>>> scatter >>>>>>>> tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum >>>>>>>> tcp-tso >>>>>>>> outer-ipv4-cksum multi-segs mbuf-fast-free >>>>>>>> outer-udp-cksum >>>>>>>> tx offload active: multi-segs >>>>>>>> rss avail: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp >>>>>>>> ipv6 >>>>>>>> rss active: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp >>>>>>>> ipv6 >>>>>>>> tx burst function: ionic_xmit_pkts >>>>>>>> rx burst function: ionic_recv_pkts >>>>>>>> >>>>>>>> With this I get better performance per iperf3 connection - about >>>>>>>> 30.5G. Show run output attached (1connection.txt) >>>>>>>> >>>>>>>> >>>>>>>> Interesting. Yes, dpdk does request offload rx ip/tcp checksum >>>>>>>> computation when possible but it currently (unless some of the pending >>>>>>>> patches were merged) does not mark the packet appropriately and >>>>>>>> ip4-local >>>>>>>> will recompute/validate the checksum. From your logs, it seems >>>>>>>> ip4-local >>>>>>>> needs ~1.8e3 cycles in the 1 connection setup and ~3.1e3 for 7 >>>>>>>> connections. >>>>>>>> That’s a lot, so it seems to confirm that the checksum is recomputed. >>>>>>>> >>>>>>>> So, it’s somewhat counter intuitive the fact that performance >>>>>>>> improves. How do the show run numbers change? Could be that performance >>>>>>>> worsens because of tcp’s congestion recovery/flow control, i.e., the >>>>>>>> packets are processes faster but some component starts dropping/queues >>>>>>>> get >>>>>>>> full. >>>>>>>> >>>>>>> >>>>>>> That's interesting. I got confused by the "show hardware" output >>>>>>> since it doesn't show any output against "tx offload active". You are >>>>>>> right, though it definitely uses less cycles without this option >>>>>>> present, >>>>>>> so I took it out for further tests. I am attaching the show run output >>>>>>> for >>>>>>> both 1 connection and 7 connection case without this option present. >>>>>>> With 1 >>>>>>> connection, it appears VPP is not loaded at all since there is no >>>>>>> batching >>>>>>> happening? >>>>>>> >>>>>>> >>>>>>> That’s probably because you’re using 9kB frames. It’s practically >>>>>>> equivalent to LRO so vpp doesn’t need to work too much. Did throughput >>>>>>> increase at all? >>>>>>> >>>>>> >>>>>> Throughput varied between 26-30G. >>>>>> >>>>>> >>>>>> Sounds reasonable for the cpu frequency. >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> With 7 connections I do see it getting around 90-92G. When I drop >>>>>>> the rx queue to 256, I do see some nic drops, but performance improves >>>>>>> and >>>>>>> I am getting 99G now. >>>>>>> >>>>>>> >>>>>>> Awesome! >>>>>>> >>>>>>> Can you please explain why this makes a difference? Does it have >>>>>>> to do with caches? >>>>>>> >>>>>>> >>>>>>> There’s probably several things at play. First of all, we back >>>>>>> pressure the sender with minimal cost, i.e., we minimize the data that >>>>>>> we >>>>>>> queue and we just drop as soon as we run out of space. So instead of us >>>>>>> trying to buffer large bursts and deal with them later, we force the >>>>>>> sender >>>>>>> to drop the rate. Second, as you already guessed, this probably improves >>>>>>> cache utilization because we end up touching fewer buffers. >>>>>>> >>>>>> >>>>>> I see. I was trying to accomplish something similar by limiting the >>>>>> rx-fifo-size (rmem in linux) for each connection. So there is no issue >>>>>> with >>>>>> the ring size being equal to the VPP batch size? While VPP is working on >>>>>> a >>>>>> batch, what happens if more packets come in? >>>>>> >>>>>> >>>>>> They will be dropped. Typically tcp pacing should make sure that >>>>>> packets are not delivered in bursts, instead they’re spread over an rtt. >>>>>> For instance, see how small the vector rate is for 1 connection. Even if >>>>>> you multiply it by 4 (to reach 100Gbps) the vector rate is still small. >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> Are the other cores kind of unusable now due to being on a different >>>>>>> numa? With Linux TCP, I believe I was able to use most of the cores and >>>>>>> scale the number of connections. >>>>>>> >>>>>>> >>>>>>> They’re all usable but it’s just that cross-numa memcpy is more >>>>>>> expensive (session layer buffers the data for the apps in the shared >>>>>>> memory >>>>>>> fifos). As the sessions are scaled up, each session will carry less >>>>>>> data, >>>>>>> so moving some of them to the other numa should not be a problem. But it >>>>>>> all ultimately depends on the efficiency of the UPI interconnect. >>>>>>> >>>>>> >>>>>> >>>>>> Sure, I will try these experiments. >>>>>> >>>>>> >>>>>> Sounds good. Let me know how it goes. >>>>>> >>>>>> Regards, >>>>>> Florin >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Vijay >>>>>> >>>>>> >>>>>> <show_run_10_conn_cross_numa.txt> >>>>> >>>>> >>>>> >>> >> >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17413): https://lists.fd.io/g/vpp-dev/message/17413 Mute This Topic: https://lists.fd.io/mt/76783803/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-