Hi Vipul, Did you check the core affinity of forwarding thread in OVS? For opt perf, one fwd thread should take one dedicated core.
BRs, Chenbo > -----Original Message----- > From: users <users-boun...@dpdk.org> On Behalf Of Vipul Ujawane > Sent: Monday, June 29, 2020 4:33 PM > To: David Christensen <d...@linux.vnet.ibm.com> > Cc: users@dpdk.org > Subject: Re: [dpdk-users] Poor performance when using OVS with DPDK > > So, > > You don't mention how many different flows you're using in the test. > Don't be surprised as throughput drops when you move from 1,000 flows to > 1,000,000 flows. > > We currently only have 1 flow, the basic packet forwarding rule. We used > pktgen > standard built-in packet generation without any pcap or script that would > change the flows! > Therefore, increasing the number of queues (and cores/queues) cannot help; > that flow will always be handled in one specific queue. > > Increasing the overall core assignment to DPDK should then help, but it does > not. > On the other hand, we tested again the VM-to-VM performance as well via iperf > and the dpdkvhost user interfaces in the KVM machines, but the performance is > still bad with the new settings, although a bit increased; it's around 10G > now. > Note again, it's iperf using TCP and MTU sized packets (but with OVS- Kernel, > the > performance is 20G with a similar setup). > > Thanks. > > On Sat, Jun 27, 2020 at 3:32 AM David Christensen <d...@linux.vnet.ibm.com> > wrote: > > > > > Why don't you reserve any CPUs for OVS/DPDK or VM usage? All > > > > published > performance white papers recommend settings for CPU > > > isolation like > this > Mellanox DPDK performance report: > > > > > > > > > > > > > > https://fast.dpdk.org/doc/perf/DPDK_19_08_Mellanox_NIC_performance_rep > > ort.pdf > > > < > > https://mailtrack.io/trace/link/d820a3bc37ae49e92351ef785e3cfb4c21ab5d > > > 3c?url=https%3A%2F%2Ffast.dpdk.org%2Fdoc%2Fperf%2FDPDK_19_08_Mellan > ox_ > > NIC_performance_report.pdf&userId=1840365&signature=eda177b39475fac6 > > > > > > > > > > > For their test system: > > > > > > > > isolcpus=24-47 intel_idle.max_cstate=0 processor.max_cstate=0 > > > > intel_pstate=disable nohz_full=24-47 > rcu_nocbs=24-47 > > > rcu_nocb_poll default_hugepagesz=1G hugepagesz=1G > hugepages=64 > > > audit=0 > nosoftlockup > > Using the tuned service (CPU > > > partitioning profile) make this process > easier: > > > > > > > > https://tuned-project.org/ > > > < > > https://mailtrack.io/trace/link/4f3f47457c7163aacfe6bb108c6eff554be9cd > > 4d?url=https%3A%2F%2Ftuned- > project.org%2F&userId=1840365&signature=f15 > > 04b405d9e880f > > > > > > > > > > Nice tutorial, thanks for sharing. I have checked it and configured > > > our server like this: > > > > > > isolcpus=12-19 intel_idle.max_cstate=0 processor.max_cstate=0 > > > nohz_full=12-19 rcu_nocbs=12-19 intel_pstate=disable > > > default_hugepagesz=1G hugepagesz=1G hugepages=24 audit=0 > > > nosoftlockup intel_iommu=on iommu=pt rcu_nocb_poll > > > > > > > > > Even though our servers are NUMA-capable and NUMA-aware, we only > > > have one CPU installed in one socket. > > > And one CPU has 20 physical cores (40 threads), so I figured out to > > > use the "top-most" cores for DPDK/OVS, that's the reason of > > > isolcpus=12-19 > > > > You can never have too many cores. On POWER systems I'll sometimes > > reserve 76 out of 80 available cores to improve overall throughput. > > > > > > > > > > > > ./usertools/dpdk-devbind.py --status > > Network devices using > > > kernel driver > > =================================== > > > > > 0000:b3:00.0 'MT27800 Family [ConnectX-5] 1017' if=ens2 > > > > drv=mlx5_core > > unused=igb_uio,vfio-pci > > > > Due to the way > > > how Mellanox cards and their driver work, I have not > bond > > > > > igb_uio to the interface, however, uio, igb_uio and vfio-pci kernel > > > > modules > > are loaded. > > > > > > > > > > > > > > > Relevant part of the VM-config for Qemu/KVM > > > > > ------------------------------------------- > > > > > <cputune> > > > > > <shares>4096</shares> > > > > > <vcpupin vcpu='0' cpuset='4'/> > > > > > <vcpupin vcpu='1' cpuset='5'/> > > > > > > > > Where did you get these CPU mapping values? x86 systems > > > typically > map > even-numbered CPUs to one NUMA node and > > > odd-numbered CPUs to a > different > NUMA node. You generally > > > want to select CPUs from the same NUMA node > as > the mlx5 NIC > > > you're using for DPDK. > > > > > > > > You should have at least 4 CPUs in the VM, selected according to > > > the > NUMA topology of the system. > > > as per my answer above, our system has no secondary NUMA node, all > > > mappings are to the same socket/CPU. > > > > > > > > > > > Take a look at this bash script written for Red Hat: > > > > > > > > > > > > > https://github.com/ctrautma/RHEL_NIC_QUALIFICATION/blob/ansible/ansibl > > e/get_cpulist.sh > > > < > > https://mailtrack.io/trace/link/cf74e0c69acb6d9a348606606825a032089882 > > > 4a?url=https%3A%2F%2Fgithub.com%2Fctrautma%2FRHEL_NIC_QUALIFICATIO > N%2F > > blob%2Fansible%2Fansible%2Fget_cpulist.sh&userId=1840365&signature=988 > > 4f8ea2b42399d > > > > > > > > > > > It gives you a good starting reference which CPUs to select for > > > the > OVS/DPDK and VM configurations on your particular system. > > > Also > review > the Ansible script pvp_ovsdpdk.yml, it provides a > > > lot of other > useful > steps you might be able to apply to your > > > Debian OS. > > > > > > > > > <emulatorpin cpuset='4-5'/> > > > > > </cputune> > > > > > <cpu mode='host-model' check='partial'> > > > > > <model fallback='allow'/> > > > > > <topology sockets='2' cores='1' threads='1'/> > > > > > <numa> > > > > > <cell id='0' cpus='0-1' memory='4194304' unit='KiB' > > > > > memAccess='shared'/> > > > > > </numa> > > > > > </cpu> > > > > > <interface type='vhostuser'> > > > > > <mac address='00:00:00:00:00:aa'/> > > > > > <source type='unix' > > > > path='/usr/local/var/run/openvswitch/vhostuser' > > > > > mo$ > > > > > <model type='virtio'/> > > > > > <driver queues='2'> > > > > > <host mrg_rxbuf='on'/> > > > > > > > > Is there a requirement for mergeable RX buffers? Some PMDs like > > > > mlx5 > can take advantage of SSE instructions when this is > > > disabled, > yielding > better performance. > > > Good point, there is no requirement, I just took an example config > > > and though it's necessary for the driver queues setting. > > > > That's how we all learn :-) > > > > > > > > > > > </driver> > > > > > <address type='pci' domain='0x0000' bus='0x07' slot='0x00' > > > > > function='0x0'$ > > > > > </interface> > > > > > > > > > > > > > I don't see hugepage usage in the libvirt XML. Something similar to: > > > > > > > > <memory unit='KiB'>8388608</memory> > > > > <currentMemory unit='KiB'>8388608</currentMemory> > > > > <memoryBacking> > > > > <hugepages> > > > > <page size='1048576' unit='KiB' nodeset='0'/> > > > > </hugepages> > > > > </memoryBacking> > > > I did not copy this part of the XML, but we have hugepages > > > configured properly. > > > > > > > > > > > > > ----------------------------------- > > > > > OVS Start Config > > > > > ----------------------------------- > > > > > ovs-vsctl --no-wait set Open_vSwitch . > > > other_config:dpdk-init=true > > ovs-vsctl --no-wait set > > > Open_vSwitch . other_config:dpdk-socket- > mem="4096,0" > > > > > ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore- > > > > mask=0xff > > ovs-vsctl --no-wait set Open_vSwitch . > > > other_config:pmd-cpu-mask=0e > > These two masks shouldn't > > > overlap: > > > > > > > > > https://developers.redhat.com/blog/2017/06/28/ovs-dpdk-parameters-deal > > ing-with-multi-numa/ > > > < > > https://mailtrack.io/trace/link/6c59473e0a8547a9cb80d8f52f9cf5190a0712 > > > f6?url=https%3A%2F%2Fdevelopers.redhat.com%2Fblog%2F2017%2F06%2F28% > 2Fo > > vs-dpdk-parameters-dealing-with-multi-numa%2F&userId=1840365&signature > > =b01114dc094e5fb9 > > > > > > > > > > Thanks, this did really help me understand in which order these > > > commands should be issued. > > > > > > So, the problem now is the following. > > > I did all the changes you shared, and started OVS/DPDK in a proper > > > way and set these features: > > > > > > ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket- > > > mem="8192,0" > > > > > > ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore- > > > mask=0x01000 > > > > > > ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true > > > > > > and, finally this: > > > ovs-vsctl --no-wait set Open_vSwitch . other_config:pmd-cpu- > > > mask=0x0e000 > > > > > > The documentation you shared say this last one can be even set > > > during runtime. So, I was playing with it to see there is any change. > > > > > > I did not start any VM on top of OVS/DPDK, just set up a port > > > forward rule (in_port=1, actions=output:IN_PORT), since I only have > > > one physical ports on each mellanox card. > > > Then, I generated traffic from the other server towards OVS Using > > > pktsize 64B, the max throughput Pktgen reports is 8Gbps. > > > In particular, I got these metrics: > > > Size Sent_pps Recv_pps Recv_Gbps > > > 64B 93M 11M ~8 > > > 128B 65M 12.5M ~15 > > > 256B 42.5M 12.3M ~27 > > > 512B 23.5M 11.9M ~51 > > > 1024B 11.9M 10M ~83 > > > 1280B 9.6M 8.3M ~86 > > > 1500B 8.3M 6.7M ~82 > > > > > > It's quite interesting that for 64B, the pps is less than for > > > greater sizes. Because PPS should be the practical limitation in > > > throughput, and according to the packet size we can count the throughput > > > in > Gbps. > > > > Looking at 64B performance gives you a sense of the per-packet > > overhead associated with the DPDK framework and your application. At > > 100Gb/s line rate, 64B frames will arrive every 6.72ns. Since your > > received PPS is peaking around 12.5MPPS I'd guess that it's taking > > about 80ns of CPU time per frame. I don't know how well OVS scales > > with additional CPUs, something to look at. > > > > You don't mention how many different flows you're using in the test. > > Don't be surprised as throughput drops when you move from 1,000 flows > > to > > 1,000,000 flows. > > > > It's likely that most of your frame loss is due to the NICs RX buffers > > overflowing and dropping frames due to back pressure (i.e. DPDK/OVS > > can't process packets fast enough). Look the the mlx5's hardware > > statistics to verify. > > > > You may be able to improve the performance by increasing the number of > > RX queues and RX descriptors per queue, and assigning more lcores to > > match the number of queues, allowing the work to be spread more evenly > > and reducing buffer overflows. This often works when running testpmd > > alone since the app overhead is low but has less effect on OVS > > perforamnce. You might consider benchmarking testpmd alone vs > > OVS/DPDK to understand the OVS overhead. > > > > > > > > Anyway, OVS-DPDK have 3 cores to use, but only one rx queue is > > > assigned to the port (so, basically --- as `top` also shows --- it > > > is the one- core performance. > > > > Increasing the number of RX queues/descriptors and assigning a > > dedicated lcore to each queue will generally improve performance if > > your bottleneck is RX in the PMD. > > > > > Increasing the cores did not help, and the performance remained the > > > same. Is this performance normal for OVS/DPDK? > > > > That's been my experience, though there are other who have more > > experience with performance testing OVS. The platform matters. Look > > for existing whitepapers and compare your system configuration to > > theirs to see what you need to achieve the performance you're looking for. > > > > Dave > > > > > -- > > Vipul Ujawane <https://vipul999ujawane.github.io/> > Pre-Final Year Undergraduate > Department of Industrial and Systems Engineering Indian Institute of > Technology, > Kharagpur