Re: [vpp-dev] Slow VPP performance vs. DPDK l2fwd / l3wfd

Zhang, Fan Fri, 06 Jan 2023 07:59:41 -0800

Hi Benoit,

What I will state in below all based on our understanding to FVL/CVL,not MLX NICs.

It is not the HW queue as the queue size can be bigger than 256. It isan interim buffer (please forgive me that I forgot the official terms ofit) that NIC to fill descriptors and the CPU to fetch.

So when VPP requests 256 packets, FVL/CVL driver actually only providesmaximum 64 descriptors for CPU to fetch at any given time, so the bufferis depleted. Since nowadays CPU is really fast and we eager for morepackets, the CPU will keep asking the NIC - the awkward situationhappens that NIC is busy telling CPU no more please come next time, butnever refill the interim buffer ever. So it becomes a special "deadlock"between the NIC and the CPU.

To answer your retry question - I actually wrote the code to retryindefinitely and the code goes 100% real deadlock, and total packetsfetched is 64 no matter how many times I tried.

The solution is simple, instead of depleting the interim buffer ofdescriptors, we always asking for half of 64 packets, when doing rxburst next time, the NIC is more than happy to give the rest 32 packetsto CPU while refilling 32 packets to prepare with no problem.

The problem was first found by CSIT team. You may found more log indpdk: improve rx burst count per loop (I804dce6d) · Gerrit Code Review(fd.io) <https://gerrit.fd.io/r/c/vpp/+/35620>


Regards,

Fan

On 1/6/2023 3:25 PM, Benoit Ganne (bganne) via lists.fd.io wrote:

Interesting! Thanks Fan for bringing that up.
So if I understand correctly, with the previous DPDK behavior we could have say 
128 packets in the rxq, VPP would request 256, get 32, and the request 224 
(256-32) again, etc.
While VPP request more packets, the NIC has the opportunity to add packets in 
the rxq and VPP could end up with 256...
With the new behavior, with the same initial state, VPP requests 256 packets, 
get 128 and call it a day.
If that's the case, maybe a better heuristic could be to retry up to 8 times 
(256/32) before giving up?

Best
ben

-----Original Message-----
From:vpp-dev@lists.fd.io  <vpp-dev@lists.fd.io>  On Behalf Of Zhang, Fan
Sent: Friday, January 6, 2023 16:04
To:vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] Slow VPP performance vs. DPDK l2fwd / l3wfd

There was a change in DPDK 21.11 to impact no-multi-seg options for VPP.

In VPP's DPDK RX, the original implementation was to fetch 256 packets. If
not enough packets are fetched from NIC queue then try again with smaller
amount.

DPDK 21.11 introduced the change by not slicing the big burst size to
smaller (say 32) ones and performing NIC RX multiple times when "no-multi-
seg" was enabled, this caused VPP always drained NIC queue in the first
attempt and the NIC cannot keep up to fill enough descriptors into the
queue before CPU does another RX burst - at least it was the case for
Intel FVL and CVL.

This caused a lot of empty polling in the end, and the vpp vector size was
always 64 instead of 256 (for CVL and FVL).


I addressed the problem for CVL/FVL by letting VPP only does smaller burst
size (up to 32) multiple times manually instead. However I didn't test on
MLX NICs due to the lack of the HW. (a9fe20f4b dpdk: improve rx burst
count per loop)


Since different HW has its sweet point of the burst size that makes it
capable working with CPU in harmony - possibly with different problems as
well, this won't be easily addressed by non-vendor developers.




Regards,

Fan





On 1/6/2023 2:16 PM,r...@gmx.net  <mailto:r...@gmx.net>   wrote:


        Hi Matt,

        thanks a lot. I ended temporarily solving it via downgrade to
v21.10 and there the option `no-multi-seg` provides full line speed 100
Gbps ( tested with mixed TRex profile avg. pkt 900 Bytes).
        Weirdly enough any v22.xx causes major performance drop with MLX5
DPDK PMD enabled. I will open another thread to discuss usage of TRex with
rdma driver.

        Below the working config for v21.10 with my Mellanox-ConnectX-6-DX
cards:

                unix {
                  exec /etc/vpp/exec.cmd
                # l2fwd mode based on mac
                #  exec /etc/vpp/l2fwd.cmd
                  nodaemon
                  log /var/log/vpp/vpp.log
                  full-coredump
                  cli-listen /run/vpp/cli.sock
                  gid vpp

                  ## run vpp in the interactive mode
                  # interactive

                  ## do not use colors in terminal output
                  # nocolor

                  ## do not display banner
                  # nobanner
                }

                api-trace {
                ## This stanza controls binary API tracing. Unless there is
a very strong reason,
                ## please leave this feature enabled.
                  on
                ## Additional parameters:
                ##
                ## To set the number of binary API trace records in the
circular buffer, configure nitems
                ##
                ## nitems <nnn>
                ##
                ## To save the api message table decode tables, configure a
filename. Results in /tmp/<filename>
                ## Very handy for understanding api message changes between
versions, identifying missing
                ## plugins, and so forth.
                ##
                ## save-api-table <filename>
                }

                api-segment {
                  gid vpp
                }

                socksvr {
                  default
                }

                #memory {
                    ## Set the main heap size, default is 1G
                #   main-heap-size 8G

                    ## Set the main heap page size. Default page size is OS
default page
                    ## which is in most cases 4K. if different page size is
specified VPP
                    ## will try to allocate main heap by using specified
page size.
                    ## special keyword 'default-hugepage' will use system
default hugepage
                    ## size
                    # main-heap-page-size 1G
                    ## Set the default huge page size.
                #   default-hugepage-size 1G
                #}

                cpu {
                    ## In the VPP there is one main thread and optionally
the user can create worker(s)
                    ## The main thread and worker thread(s) can be pinned to
CPU core(s) manually or automatically

                    ## Manual pinning of thread(s) to CPU core(s)

                    ## Set logical CPU core where main thread runs, if main
core is not set
                    ## VPP will use core 1 if available
                    main-core 6
                    # 2,4,6,8,10,12,14,16
                    ## Set logical CPU core(s) where worker threads are
running
                    corelist-workers 12
                    # find right worker via lscpu and numa assignement ,
check PCI address NICs for NUMA match
                #corelist-workers 4,6,8,10,12,14,16
                    ## Automatic pinning of thread(s) to CPU core(s)

                    ## Sets number of CPU core(s) to be skipped (1 ... N-1)
                    ## Skipped CPU core(s) are not used for pinning main
thread and working thread(s).
                    ## The main thread is automatically pinned to the first
available CPU core and worker(s)
                    ## are pinned to next free CPU core(s) after core
assigned to main thread
                    # skip-cores 4

                    ## Specify a number of workers to be created
                    ## Workers are pinned to N consecutive CPU cores while
skipping "skip-cores" CPU core(s)
                    ## and main thread's CPU core
                #   workers 1

                    ## Set scheduling policy and priority of main and worker
threads

                    ## Scheduling policy options are: other (SCHED_OTHER),
batch (SCHED_BATCH)
                    ## idle (SCHED_IDLE), fifo (SCHED_FIFO), rr (SCHED_RR)
                    # scheduler-policy fifo

                    ## Scheduling priority is used only for "real-time
policies (fifo and rr),
                    ## and has to be in the range of priorities supported
for a particular policy
                    # scheduler-priority 50
                }

                #buffers {
                    ## Increase number of buffers allocated, needed only in
scenarios with
                    ## large number of interfaces and worker threads. Value
is per numa node.
                    ## Default is 16384 (8192 if running unpriviledged)
                #   buffers-per-numa 128000

                    ## Size of buffer data area
                    ## Default is 2048
                    # default data-size 2048

                    ## Size of the memory pages allocated for buffer data
                    ## Default will try 'default-hugepage' then 'default'
                    ## you can also pass a size in K/M/G e.g. '8M'
                #   page-size default-hugepage
                #}

                dpdk {
                    ## Change default settings for all interfaces
                    dev default {
                        ## Number of receive queues, enables RSS
                        ## Default is 1
                        # num-rx-queues 3

                        ## Number of transmit queues, Default is equal
                        ## to number of worker threads or 1 if no workers
treads
                        # num-tx-queues 3

                        ## Number of descriptors in transmit and receive
rings
                        ## increasing or reducing number can impact
performance
                        ## Default is 1024 for both rx and tx
                        num-rx-desc 4096
                        num-tx-desc 4096

                        ## VLAN strip offload mode for interface
                        ## Default is off
                        # vlan-strip-offload on

                        ## TCP Segment Offload
                        ## Default is off
                        ## To enable TSO, 'enable-tcp-udp-checksum' must be
set
                        # tso on

                        ## Devargs
                                ## device specific init args
                                ## Default is NULL
                        # devargs safe-mode-support=1,pipeline-mode-
support=1
                        # devargs
mprq_en=1,rxqs_min_mprq=1,mprq_log_stride_num=9,txq_inline_mpw=128,rxq_pkt
_pad_en=1,dv_flow_en=0
                        ## rss-queues
                        ## set valid rss steering queues
                        # rss-queues 0,2,5-7
                    }

                    ## Whitelist specific interface by specifying PCI
address
                    dev 0000:4b:00.0
                    dev 0000:4b:00.1
                    ## Blacklist specific device type by specifying PCI
vendor:device
                        ## Whitelist entries take precedence
                    # blacklist 8086:10fb

                    ## Set interface name
                    # dev 0000:02:00.1 {
                    #   name eth0
                    # }

                    ## Whitelist specific interface by specifying PCI
address and in
                    ## addition specify custom parameters for this interface
                    # dev 0000:02:00.1 {
                    #   num-rx-queues 2
                    # }

                    ## Change UIO driver used by VPP, Options are: igb_uio,
vfio-pci,
                    ## uio_pci_generic or auto (default)
                    # uio-driver vfio-pci

                    ## Disable multi-segment buffers, improves performance
but
                    ## disables Jumbo MTU support
                    no-multi-seg

                    ## Change hugepages allocation per-socket, needed only
if there is need for
                    ## larger number of mbufs. Default is 256M on each
detected CPU socket
                    socket-mem 4096,4096

                    ## Disables UDP / TCP TX checksum offload. Typically
needed for use
                    ## faster vector PMDs (together with no-multi-seg)
                    # no-tx-checksum-offload

                    ## Enable UDP / TCP TX checksum offload
                    ## This is the reversed option of 'no-tx-checksum-
offload'
                    # enable-tcp-udp-checksum

                    ## Enable/Disable AVX-512 vPMDs
                    #max-simd-bitwidth <256|512>
                }

                ## node variant defaults
                #node {

                ## specify the preferred default variant
                #   default { variant avx512 }

                ## specify the preferred variant, for a given node
                #   ip4-rewrite { variant avx2 }

                #}


                # plugins {
                    ## Adjusting the plugin path depending on where the VPP
plugins are
                    #   path /ws/vpp/build-root/install-vpp-
native/vpp/lib/vpp_plugins

                    ## Disable all plugins by default and then selectively
enable specific plugins
                    # plugin default { disable }
                    # plugin dpdk_plugin.so { enable }
                    # plugin acl_plugin.so { enable }

                    ## Enable all plugins by default and then selectively
disable specific plugins
                    # plugin dpdk_plugin.so { disable }
                    # plugin acl_plugin.so { disable }
                # }

                ## Statistics Segment
                # statseg {
                    # socket-name <filename>, name of the stats segment
socket
                    #     defaults to /run/vpp/stats.sock
                    # size <nnn>[KMG], size of the stats segment, defaults
to 32mb
                    # page-size <nnn>, page size, ie. 2m, defaults to 4k
                    # per-node-counters on | off, defaults to none
                    # update-interval <f64-seconds>, sets the segment scrape
/ update interval
                # }

                ## L2 FIB
                # l2fib {
                    ## l2fib hash table size.
                    #  table-size 512M

                    ## l2fib hash table number of buckets. Must be power of
2.
                    #  num-buckets 524288
                # }

                ## ipsec
                # {
                   # ip4 {
                   ## ipsec for ipv4 tunnel lookup hash number of buckets.
                   #  num-buckets 524288
                   # }
                   # ip6 {
                   ## ipsec for ipv6 tunnel lookup hash number of buckets.
                   #  num-buckets 524288
                   # }
                # }

                # logging {
                   ## set default logging level for logging buffer
                   ## logging levels: emerg, alert,crit, error, warn,
notice, info, debug, disabled
                   # default-log-level debug
                   ## set default logging level for syslog or stderr output
                   # default-syslog-log-level info
                   ## Set per-class configuration
                   # class dpdk/cryptodev { rate-limit 100 level debug
syslog-level error }
                # }

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22426): https://lists.fd.io/g/vpp-dev/message/22426
Mute This Topic: https://lists.fd.io/mt/95959719/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] Slow VPP performance vs. DPDK l2fwd / l3wfd

Reply via email to