Hi Florin, Sure got it. The options are clear now.
Thanks, Vijay On Tue, Sep 15, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com> wrote: > Hi Vijay, > > Yes. Underneath, the fifos maintain a linked a list of chunks where the > data is stored. VCL could provide pointers to those in the form of iovecs > and another api to mark the data as consumed (implicitly release the > chunks) once the app is done reading. But again, the apps would have to > explicitly use these apis. > > Regards, > Florin > > On Sep 15, 2020, at 1:46 PM, Vijay Sampath <vsamp...@gmail.com> wrote: > > Hi Florin, > > I got it now. > > Also, I think you mentioned the support in the VCL library for the > application to read/write/free directly from fifo buffers is not yet > present, but can be added with little effort. Is that correct? > > Thanks, > > Vijay > > On Tue, Sep 15, 2020 at 1:31 PM Florin Coras <fcoras.li...@gmail.com> > wrote: > >> Hi Vijay, >> >> Oh, by no means. Builtin applications, i.e., applications that run within >> the vpp process, are definitely possible (see >> plugins/hs_apps/echo_client/server or the proxy). They run “on” the vpp >> workers and io/ctrl events are delivered by the session layer to those apps >> using callback functions. However, the session layer exchanges data with >> them using fifos, not vlib buffers. We might consider offering the option >> to improve that for low scale and high throughput scenarios, but that’s not >> possible today. >> >> Regards, >> Florin >> >> On Sep 15, 2020, at 12:23 PM, Vijay Sampath <vsamp...@gmail.com> wrote: >> >> Hi Florin, >> >> Got it. So what you are saying is that TCP applications cannot >> directly be linked with VPP. They have to be a separate process and go >> through the VCL library, although they can be optimized to avoid 1 extra >> memcpy. In future, memcpy _may_ be avoided completely, but the applications >> have to still reside as a separate process. >> >> Thanks, >> >> Vijay >> >> On Tue, Sep 15, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com> >> wrote: >> >>> Hi Vijay, >>> >>> Currently, builtin applications can only receive data from tcp in a >>> session’s rx fifo. That’s a deliberate choice because, at scale, out of >>> order data could end up consuming a lot of buffers, i.e., buffers are >>> queued but cannot be consumed by the app until the gaps are filled. Still, >>> builtin apps can avoid the extra memcpy vcl needs to do for traditional >>> apps. >>> >>> Now, there have been talks and we have been considering the option of >>> linking vlib buffers into the fifos (to avoid the memcpy) but there’s no >>> ETA for that. >>> >>> Regards, >>> Florin >>> >>> On Sep 15, 2020, at 11:32 AM, Vijay Sampath <vsamp...@gmail.com> wrote: >>> >>> Hi Florin, >>> >>> Sure yes, and better still would be for the app to integrate directly >>> with VPP to even avoid the shared fifo copy, I assume. It's just that the >>> VCL library gives a quick way to get some benchmark numbers with existing >>> applications. Thanks for all the help. I have a much better idea now. >>> >>> Thanks, >>> >>> Vijay >>> >>> On Tue, Sep 15, 2020 at 11:25 AM Florin Coras <fcoras.li...@gmail.com> >>> wrote: >>> >>>> Hi Vijay, >>>> >>>> Yes, that is the case for this iperf3 test. The data is already in user >>>> space, and could be passed to the app in the shape of iovecs, to avoid the >>>> extra memcpy, but the app would need to be changed to have it release the >>>> memory whenever it’s done reading it. In case of iperf3 it would be on the >>>> spot, because it discards it. >>>> >>>> For completeness, note that we don’t currently have vcl apis to expose >>>> the fifo chunks as iovecs, but they shouldn’t be that difficult. >>>> >>>> Regards, >>>> Florin >>>> >>>> On Sep 15, 2020, at 10:47 AM, Vijay Sampath <vsamp...@gmail.com> wrote: >>>> >>>> Hi Florin, >>>> >>>> I just realized that maybe in the VPP case there is an extra copy - >>>> once from mbuf to shared fifo, and once from shared fifo to application >>>> buffer. In Linux, there is probably just the copy from kernel space to user >>>> space. Please correct me if I am wrong. If so, what I am doing is not an >>>> apples to apples comparison. >>>> >>>> Thanks, >>>> >>>> Vijay >>>> >>>> On Tue, Sep 15, 2020 at 8:54 AM Vijay Sampath <vsamp...@gmail.com> >>>> wrote: >>>> >>>>> Hi Florin, >>>>> >>>>> In the 1 iperf connection test, I get different numbers every time I >>>>> run. When I ran today >>>>> >>>>> - iperf and vpp in the same numa core as pci device: 50Gbps (although >>>>> in different runs I saw 30Gbps also) >>>>> - vpp in the same numa core as pci device, iperf in the other numa : >>>>> 28Gbps >>>>> - vpp and iperf in the other numa as pci device : 36Gbps >>>>> >>>>> But these numbers vary from test to test. But I was never able to get >>>>> beyond 50G with 10connections with iperf on the other numa node. As I >>>>> mentioned in the previous email, when I repeat this test with Linux TCP as >>>>> the server, I am able to get 100G no matter which cores I start iperf on. >>>>> >>>>> Thanks, >>>>> >>>>> Vijay >>>>> >>>>> On Mon, Sep 14, 2020 at 8:30 PM Florin Coras <fcoras.li...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Vijay, >>>>>> >>>>>> In this sort of setup, with few connections, probably it’s inevitable >>>>>> to lose throughput because of the cross-numa memcpy. In your 1 iperf >>>>>> connection test, did you only change iperf’s numa or vpp’s worker as >>>>>> well? >>>>>> >>>>>> Regards, >>>>>> Florin >>>>>> >>>>>> On Sep 14, 2020, at 6:35 PM, Vijay Sampath <vsamp...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi Florin, >>>>>> >>>>>> I ran some experiments by going cross numa, and see that I am not >>>>>> able to go beyond 50G. I tried with a different number of worker threads >>>>>> (5, 8 and 10), and going upto 10 iperf servers. I am attaching the show >>>>>> run >>>>>> output with 10 workers. When I run the same experiment in Linux, I don't >>>>>> see a difference in the bandwidth - iperf in both numa nodes are able to >>>>>> achieve 100G. Do you have any suggestions on other experiments to try? >>>>>> >>>>>> I also did try 1 iperf connection - and the bandwidth dropped from >>>>>> 33G to 23G for the same numa core vs different. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Vijay >>>>>> >>>>>> On Sat, Sep 12, 2020 at 2:40 PM Florin Coras <fcoras.li...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi VIjay, >>>>>>> >>>>>>> >>>>>>> On Sep 12, 2020, at 12:06 PM, Vijay Sampath <vsamp...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Hi Florin, >>>>>>> >>>>>>> On Sat, Sep 12, 2020 at 11:44 AM Florin Coras < >>>>>>> fcoras.li...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Vijay, >>>>>>>> >>>>>>>> >>>>>>>> On Sep 12, 2020, at 10:06 AM, Vijay Sampath <vsamp...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Florin, >>>>>>>> >>>>>>>> On Fri, Sep 11, 2020 at 11:23 PM Florin Coras < >>>>>>>> fcoras.li...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Vijay, >>>>>>>>> >>>>>>>>> Quick replies inline. >>>>>>>>> >>>>>>>>> On Sep 11, 2020, at 7:27 PM, Vijay Sampath <vsamp...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Florin, >>>>>>>>> >>>>>>>>> Thanks once again for looking at this issue. Please see inline: >>>>>>>>> >>>>>>>>> On Fri, Sep 11, 2020 at 2:06 PM Florin Coras < >>>>>>>>> fcoras.li...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Vijay, >>>>>>>>>> >>>>>>>>>> Inline. >>>>>>>>>> >>>>>>>>>> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Florin, >>>>>>>>>> >>>>>>>>>> Thanks for the response. Please see inline: >>>>>>>>>> >>>>>>>>>> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras < >>>>>>>>>> fcoras.li...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Vijay, >>>>>>>>>>> >>>>>>>>>>> Cool experiment. More inline. >>>>>>>>>>> >>>>>>>>>>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> > >>>>>>>>>>> > Hi, >>>>>>>>>>> > >>>>>>>>>>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine >>>>>>>>>>> connected to another server running VPP using 100G NICs. Both >>>>>>>>>>> servers are >>>>>>>>>>> Intel Xeon with 24 cores. >>>>>>>>>>> >>>>>>>>>>> May I ask the frequency for those cores? Also what type of nic >>>>>>>>>>> are you using? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2700 MHz. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Probably this somewhat limits throughput per single connection >>>>>>>>>> compared to my testbed where the Intel cpu boosts to 4GHz. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Please see below, I noticed an anomaly. >>>>>>>>> >>>>>>>>> >>>>>>>>>> The nic is a Pensando DSC100. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Okay, not sure what to expect there. Since this mostly stresses >>>>>>>>>> the rx side, what’s the number of rx descriptors? Typically I test >>>>>>>>>> with >>>>>>>>>> 256, with more connections higher throughput you might need more. >>>>>>>>>> >>>>>>>>> >>>>>>>>> This is the default - comments seem to suggest that is 1024. I >>>>>>>>> don't see any rx queue empty errors on the nic, which probably means >>>>>>>>> there >>>>>>>>> are sufficient buffers. >>>>>>>>> >>>>>>>>> >>>>>>>>> Reasonable. Might want to try to reduce it down to 256 but >>>>>>>>> performance will depend a lot on other things as well. >>>>>>>>> >>>>>>>> >>>>>>>> This seems to help, but I do get rx queue empty nic drops. More >>>>>>>> below. >>>>>>>> >>>>>>>> >>>>>>>> That’s somewhat expected to happen either when 1) the peer tries to >>>>>>>> probe for more throughput and bursts a bit more than we can handle 2) a >>>>>>>> full vpp dispatch takes too long, which could happen because of the >>>>>>>> memcpy >>>>>>>> in tcp-established. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> > I am trying to push 100G traffic from the iperf Linux TCP client >>>>>>>>>>> by starting 10 parallel iperf connections on different port numbers >>>>>>>>>>> and >>>>>>>>>>> pinning them to different cores on the sender side. On the VPP >>>>>>>>>>> receiver >>>>>>>>>>> side I have 10 worker threads and 10 rx-queues in dpdk, and running >>>>>>>>>>> iperf3 >>>>>>>>>>> using VCL library as follows >>>>>>>>>>> > >>>>>>>>>>> > taskset 0x00400 sh -c >>>>>>>>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >>>>>>>>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" & >>>>>>>>>>> > taskset 0x00800 sh -c >>>>>>>>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >>>>>>>>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" & >>>>>>>>>>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64 >>>>>>>>>>> > ... >>>>>>>>>>> > >>>>>>>>>>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on >>>>>>>>>>> client: >>>>>>>>>>> > >>>>>>>>>>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000 >>>>>>>>>>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001 >>>>>>>>>>> > ... >>>>>>>>>>> >>>>>>>>>>> Could you try first with only 1 iperf server/client pair, just >>>>>>>>>>> to see where performance is with that? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> These are the numbers I get >>>>>>>>>> rx-fifo-size 65536: ~8G >>>>>>>>>> rx-fifo-size 524288: 22G >>>>>>>>>> rx-fifo-size 4000000: 25G >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Okay, so 4MB is probably the sweet spot. Btw, could you check the >>>>>>>>>> vector rate (and the errors) in this case also? >>>>>>>>>> >>>>>>>>> >>>>>>>>> I noticed that adding "enable-tcp-udp-checksum" back seems to >>>>>>>>> improve performance. Not sure if this is an issue with the dpdk >>>>>>>>> driver for >>>>>>>>> the nic. Anyway in the "show hardware" flags I see now that tcp and >>>>>>>>> udp >>>>>>>>> checksum offloads are enabled: >>>>>>>>> >>>>>>>>> root@server:~# vppctl show hardware >>>>>>>>> Name Idx Link Hardware >>>>>>>>> eth0 1 up dsc1 >>>>>>>>> Link speed: 100 Gbps >>>>>>>>> Ethernet address 00:ae:cd:03:79:51 >>>>>>>>> ### UNKNOWN ### >>>>>>>>> carrier up full duplex mtu 9000 >>>>>>>>> flags: admin-up pmd maybe-multiseg rx-ip4-cksum >>>>>>>>> Devargs: >>>>>>>>> rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1) >>>>>>>>> tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1) >>>>>>>>> pci: device 1dd8:1002 subsystem 1dd8:400a address >>>>>>>>> 0000:15:00.00 numa 0 >>>>>>>>> max rx packet len: 9208 >>>>>>>>> promiscuous: unicast off all-multicast on >>>>>>>>> vlan offload: strip off filter off qinq off >>>>>>>>> rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum >>>>>>>>> vlan-filter >>>>>>>>> jumbo-frame scatter >>>>>>>>> rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame >>>>>>>>> scatter >>>>>>>>> tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum >>>>>>>>> tcp-tso >>>>>>>>> outer-ipv4-cksum multi-segs mbuf-fast-free >>>>>>>>> outer-udp-cksum >>>>>>>>> tx offload active: multi-segs >>>>>>>>> rss avail: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp >>>>>>>>> ipv6 >>>>>>>>> rss active: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp >>>>>>>>> ipv6 >>>>>>>>> tx burst function: ionic_xmit_pkts >>>>>>>>> rx burst function: ionic_recv_pkts >>>>>>>>> >>>>>>>>> With this I get better performance per iperf3 connection - about >>>>>>>>> 30.5G. Show run output attached (1connection.txt) >>>>>>>>> >>>>>>>>> >>>>>>>>> Interesting. Yes, dpdk does request offload rx ip/tcp checksum >>>>>>>>> computation when possible but it currently (unless some of the pending >>>>>>>>> patches were merged) does not mark the packet appropriately and >>>>>>>>> ip4-local >>>>>>>>> will recompute/validate the checksum. From your logs, it seems >>>>>>>>> ip4-local >>>>>>>>> needs ~1.8e3 cycles in the 1 connection setup and ~3.1e3 for 7 >>>>>>>>> connections. >>>>>>>>> That’s a lot, so it seems to confirm that the checksum is recomputed. >>>>>>>>> >>>>>>>>> So, it’s somewhat counter intuitive the fact that performance >>>>>>>>> improves. How do the show run numbers change? Could be that >>>>>>>>> performance >>>>>>>>> worsens because of tcp’s congestion recovery/flow control, i.e., the >>>>>>>>> packets are processes faster but some component starts >>>>>>>>> dropping/queues get >>>>>>>>> full. >>>>>>>>> >>>>>>>> >>>>>>>> That's interesting. I got confused by the "show hardware" output >>>>>>>> since it doesn't show any output against "tx offload active". You are >>>>>>>> right, though it definitely uses less cycles without this option >>>>>>>> present, >>>>>>>> so I took it out for further tests. I am attaching the show run output >>>>>>>> for >>>>>>>> both 1 connection and 7 connection case without this option present. >>>>>>>> With 1 >>>>>>>> connection, it appears VPP is not loaded at all since there is no >>>>>>>> batching >>>>>>>> happening? >>>>>>>> >>>>>>>> >>>>>>>> That’s probably because you’re using 9kB frames. It’s practically >>>>>>>> equivalent to LRO so vpp doesn’t need to work too much. Did throughput >>>>>>>> increase at all? >>>>>>>> >>>>>>> >>>>>>> Throughput varied between 26-30G. >>>>>>> >>>>>>> >>>>>>> Sounds reasonable for the cpu frequency. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> With 7 connections I do see it getting around 90-92G. When I drop >>>>>>>> the rx queue to 256, I do see some nic drops, but performance improves >>>>>>>> and >>>>>>>> I am getting 99G now. >>>>>>>> >>>>>>>> >>>>>>>> Awesome! >>>>>>>> >>>>>>>> Can you please explain why this makes a difference? Does it have >>>>>>>> to do with caches? >>>>>>>> >>>>>>>> >>>>>>>> There’s probably several things at play. First of all, we back >>>>>>>> pressure the sender with minimal cost, i.e., we minimize the data that >>>>>>>> we >>>>>>>> queue and we just drop as soon as we run out of space. So instead of us >>>>>>>> trying to buffer large bursts and deal with them later, we force the >>>>>>>> sender >>>>>>>> to drop the rate. Second, as you already guessed, this probably >>>>>>>> improves >>>>>>>> cache utilization because we end up touching fewer buffers. >>>>>>>> >>>>>>> >>>>>>> I see. I was trying to accomplish something similar by limiting the >>>>>>> rx-fifo-size (rmem in linux) for each connection. So there is no issue >>>>>>> with >>>>>>> the ring size being equal to the VPP batch size? While VPP is working >>>>>>> on a >>>>>>> batch, what happens if more packets come in? >>>>>>> >>>>>>> >>>>>>> They will be dropped. Typically tcp pacing should make sure that >>>>>>> packets are not delivered in bursts, instead they’re spread over an rtt. >>>>>>> For instance, see how small the vector rate is for 1 connection. Even if >>>>>>> you multiply it by 4 (to reach 100Gbps) the vector rate is still small. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Are the other cores kind of unusable now due to being on a >>>>>>>> different numa? With Linux TCP, I believe I was able to use most of the >>>>>>>> cores and scale the number of connections. >>>>>>>> >>>>>>>> >>>>>>>> They’re all usable but it’s just that cross-numa memcpy is more >>>>>>>> expensive (session layer buffers the data for the apps in the shared >>>>>>>> memory >>>>>>>> fifos). As the sessions are scaled up, each session will carry less >>>>>>>> data, >>>>>>>> so moving some of them to the other numa should not be a problem. But >>>>>>>> it >>>>>>>> all ultimately depends on the efficiency of the UPI interconnect. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> Sure, I will try these experiments. >>>>>>> >>>>>>> >>>>>>> Sounds good. Let me know how it goes. >>>>>>> >>>>>>> Regards, >>>>>>> Florin >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Vijay >>>>>>> >>>>>>> >>>>>>> <show_run_10_conn_cross_numa.txt> >>>>>> >>>>>> >>>>>> >>>> >>> >> >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17415): https://lists.fd.io/g/vpp-dev/message/17415 Mute This Topic: https://lists.fd.io/mt/76783803/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-