-----Original Message-----
From:vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Zhang, Fan
Sent: Friday, January 6, 2023 16:04
To:vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] Slow VPP performance vs. DPDK l2fwd / l3wfd
There was a change in DPDK 21.11 to impact no-multi-seg options for VPP.
In VPP's DPDK RX, the original implementation was to fetch 256 packets. If
not enough packets are fetched from NIC queue then try again with smaller
amount.
DPDK 21.11 introduced the change by not slicing the big burst size to
smaller (say 32) ones and performing NIC RX multiple times when "no-multi-
seg" was enabled, this caused VPP always drained NIC queue in the first
attempt and the NIC cannot keep up to fill enough descriptors into the
queue before CPU does another RX burst - at least it was the case for
Intel FVL and CVL.
This caused a lot of empty polling in the end, and the vpp vector size was
always 64 instead of 256 (for CVL and FVL).
I addressed the problem for CVL/FVL by letting VPP only does smaller burst
size (up to 32) multiple times manually instead. However I didn't test on
MLX NICs due to the lack of the HW. (a9fe20f4b dpdk: improve rx burst
count per loop)
Since different HW has its sweet point of the burst size that makes it
capable working with CPU in harmony - possibly with different problems as
well, this won't be easily addressed by non-vendor developers.
Regards,
Fan
On 1/6/2023 2:16 PM,r...@gmx.net <mailto:r...@gmx.net> wrote:
Hi Matt,
thanks a lot. I ended temporarily solving it via downgrade to
v21.10 and there the option `no-multi-seg` provides full line speed 100
Gbps ( tested with mixed TRex profile avg. pkt 900 Bytes).
Weirdly enough any v22.xx causes major performance drop with MLX5
DPDK PMD enabled. I will open another thread to discuss usage of TRex with
rdma driver.
Below the working config for v21.10 with my Mellanox-ConnectX-6-DX
cards:
unix {
exec /etc/vpp/exec.cmd
# l2fwd mode based on mac
# exec /etc/vpp/l2fwd.cmd
nodaemon
log /var/log/vpp/vpp.log
full-coredump
cli-listen /run/vpp/cli.sock
gid vpp
## run vpp in the interactive mode
# interactive
## do not use colors in terminal output
# nocolor
## do not display banner
# nobanner
}
api-trace {
## This stanza controls binary API tracing. Unless there is
a very strong reason,
## please leave this feature enabled.
on
## Additional parameters:
##
## To set the number of binary API trace records in the
circular buffer, configure nitems
##
## nitems <nnn>
##
## To save the api message table decode tables, configure a
filename. Results in /tmp/<filename>
## Very handy for understanding api message changes between
versions, identifying missing
## plugins, and so forth.
##
## save-api-table <filename>
}
api-segment {
gid vpp
}
socksvr {
default
}
#memory {
## Set the main heap size, default is 1G
# main-heap-size 8G
## Set the main heap page size. Default page size is OS
default page
## which is in most cases 4K. if different page size is
specified VPP
## will try to allocate main heap by using specified
page size.
## special keyword 'default-hugepage' will use system
default hugepage
## size
# main-heap-page-size 1G
## Set the default huge page size.
# default-hugepage-size 1G
#}
cpu {
## In the VPP there is one main thread and optionally
the user can create worker(s)
## The main thread and worker thread(s) can be pinned to
CPU core(s) manually or automatically
## Manual pinning of thread(s) to CPU core(s)
## Set logical CPU core where main thread runs, if main
core is not set
## VPP will use core 1 if available
main-core 6
# 2,4,6,8,10,12,14,16
## Set logical CPU core(s) where worker threads are
running
corelist-workers 12
# find right worker via lscpu and numa assignement ,
check PCI address NICs for NUMA match
#corelist-workers 4,6,8,10,12,14,16
## Automatic pinning of thread(s) to CPU core(s)
## Sets number of CPU core(s) to be skipped (1 ... N-1)
## Skipped CPU core(s) are not used for pinning main
thread and working thread(s).
## The main thread is automatically pinned to the first
available CPU core and worker(s)
## are pinned to next free CPU core(s) after core
assigned to main thread
# skip-cores 4
## Specify a number of workers to be created
## Workers are pinned to N consecutive CPU cores while
skipping "skip-cores" CPU core(s)
## and main thread's CPU core
# workers 1
## Set scheduling policy and priority of main and worker
threads
## Scheduling policy options are: other (SCHED_OTHER),
batch (SCHED_BATCH)
## idle (SCHED_IDLE), fifo (SCHED_FIFO), rr (SCHED_RR)
# scheduler-policy fifo
## Scheduling priority is used only for "real-time
policies (fifo and rr),
## and has to be in the range of priorities supported
for a particular policy
# scheduler-priority 50
}
#buffers {
## Increase number of buffers allocated, needed only in
scenarios with
## large number of interfaces and worker threads. Value
is per numa node.
## Default is 16384 (8192 if running unpriviledged)
# buffers-per-numa 128000
## Size of buffer data area
## Default is 2048
# default data-size 2048
## Size of the memory pages allocated for buffer data
## Default will try 'default-hugepage' then 'default'
## you can also pass a size in K/M/G e.g. '8M'
# page-size default-hugepage
#}
dpdk {
## Change default settings for all interfaces
dev default {
## Number of receive queues, enables RSS
## Default is 1
# num-rx-queues 3
## Number of transmit queues, Default is equal
## to number of worker threads or 1 if no workers
treads
# num-tx-queues 3
## Number of descriptors in transmit and receive
rings
## increasing or reducing number can impact
performance
## Default is 1024 for both rx and tx
num-rx-desc 4096
num-tx-desc 4096
## VLAN strip offload mode for interface
## Default is off
# vlan-strip-offload on
## TCP Segment Offload
## Default is off
## To enable TSO, 'enable-tcp-udp-checksum' must be
set
# tso on
## Devargs
## device specific init args
## Default is NULL
# devargs safe-mode-support=1,pipeline-mode-
support=1
# devargs
mprq_en=1,rxqs_min_mprq=1,mprq_log_stride_num=9,txq_inline_mpw=128,rxq_pkt
_pad_en=1,dv_flow_en=0
## rss-queues
## set valid rss steering queues
# rss-queues 0,2,5-7
}
## Whitelist specific interface by specifying PCI
address
dev 0000:4b:00.0
dev 0000:4b:00.1
## Blacklist specific device type by specifying PCI
vendor:device
## Whitelist entries take precedence
# blacklist 8086:10fb
## Set interface name
# dev 0000:02:00.1 {
# name eth0
# }
## Whitelist specific interface by specifying PCI
address and in
## addition specify custom parameters for this interface
# dev 0000:02:00.1 {
# num-rx-queues 2
# }
## Change UIO driver used by VPP, Options are: igb_uio,
vfio-pci,
## uio_pci_generic or auto (default)
# uio-driver vfio-pci
## Disable multi-segment buffers, improves performance
but
## disables Jumbo MTU support
no-multi-seg
## Change hugepages allocation per-socket, needed only
if there is need for
## larger number of mbufs. Default is 256M on each
detected CPU socket
socket-mem 4096,4096
## Disables UDP / TCP TX checksum offload. Typically
needed for use
## faster vector PMDs (together with no-multi-seg)
# no-tx-checksum-offload
## Enable UDP / TCP TX checksum offload
## This is the reversed option of 'no-tx-checksum-
offload'
# enable-tcp-udp-checksum
## Enable/Disable AVX-512 vPMDs
#max-simd-bitwidth <256|512>
}
## node variant defaults
#node {
## specify the preferred default variant
# default { variant avx512 }
## specify the preferred variant, for a given node
# ip4-rewrite { variant avx2 }
#}
# plugins {
## Adjusting the plugin path depending on where the VPP
plugins are
# path /ws/vpp/build-root/install-vpp-
native/vpp/lib/vpp_plugins
## Disable all plugins by default and then selectively
enable specific plugins
# plugin default { disable }
# plugin dpdk_plugin.so { enable }
# plugin acl_plugin.so { enable }
## Enable all plugins by default and then selectively
disable specific plugins
# plugin dpdk_plugin.so { disable }
# plugin acl_plugin.so { disable }
# }
## Statistics Segment
# statseg {
# socket-name <filename>, name of the stats segment
socket
# defaults to /run/vpp/stats.sock
# size <nnn>[KMG], size of the stats segment, defaults
to 32mb
# page-size <nnn>, page size, ie. 2m, defaults to 4k
# per-node-counters on | off, defaults to none
# update-interval <f64-seconds>, sets the segment scrape
/ update interval
# }
## L2 FIB
# l2fib {
## l2fib hash table size.
# table-size 512M
## l2fib hash table number of buckets. Must be power of
2.
# num-buckets 524288
# }
## ipsec
# {
# ip4 {
## ipsec for ipv4 tunnel lookup hash number of buckets.
# num-buckets 524288
# }
# ip6 {
## ipsec for ipv6 tunnel lookup hash number of buckets.
# num-buckets 524288
# }
# }
# logging {
## set default logging level for logging buffer
## logging levels: emerg, alert,crit, error, warn,
notice, info, debug, disabled
# default-log-level debug
## set default logging level for syslog or stderr output
# default-syslog-log-level info
## Set per-class configuration
# class dpdk/cryptodev { rate-limit 100 level debug
syslog-level error }
# }