On Thu, 3 Jul 2025 20:14:59 +0000
"Lombardo, Ed" <ed.lomba...@netscout.com> wrote:

> Hi,
> I have run out of ideas and thought I would reach out to the dpdk community.
> 
> I have a Sapphire Rapids dual CPU server and one E180 (also tried X710), both 
> are 4x10G NICs.  When our application pipeline final stage enqueues mbufs 
> into the tx ring I expect the rte_ring_dequeue_burst() to pull the mbufs from 
> the tx ring and rte_eth_tx_burst() transmit them at line rate.  What I see is 
> when there is one interface receiving 64-byte UDP in IPv4 the receive and 
> transmit is at line rate (i.e. packets in one port and out another port of 
> the NIC @14.9 MPPS).
> When I turn on another receive port then both transmit ports of the NIC shows 
> Tx performance drops to 5 MPPS.  The Tx ring is filling faster than Tx thread 
> can dequeue and transmit mbufs.
> 
> Packets arrive on ports 1 and 3 in my test setup.  NIC is on NUMA Node 1.  
> Hugepage memory (6GB, 1GB page size) is on NUMA Node 1.  The mbuf size is 9KB.
> 
> Rx Port 1 -> Tx Port 2
> Rx Port 3 -> Tx port 4
> 
> I monitor the mbufs available and they are:
> *** DPDK Mempool Configuration ***
> Number Sockets      :                    1
> Memory/Socket GB    :                 6
> Hugepage Size MB    :                 1024
> Overhead/socket MB  :              512
> Usable mem/socket MB:          5629
> mbuf size Bytes     :                     9216
> nb mbufs per socket :               640455
> total nb mbufs      :                      640455
> hugepages/socket GB :               6
> mempool cache size  :            512
> 
> *** DPDK EAL args ***
> EAL lcore arg       : -l 36   <<< NUMA Node 1
> EAL socket-mem arg  : --socket-mem=0,6144
> 
> The number of rings in this configuration is 16 and all are the same size 
> (16384 * 8), and there is one mempool.
> 
> The Tx rings are created as SP and SC when created.
> 
> There is one Tx thread per NIC port, where its only task is to dequeue mbufs 
> from the tx ring and call rte_eth_tx_burst() to transmit the mbufs.  The 
> dequeue burst size is 512 and tx burst is equal to or less than 512.  The 
> rte_eth_tx_burst() never returns less than the bust size given.
> 
> Each Tx thread is on a dedicated CPU core and its sibling is unused.
> We use cpushielding to keep noncritical threads from using these CPUs for Tx 
> threads.  HTOP shows the Tx threads are the only threads using the carved-out 
> CPUs.
> 
> In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of mbufs 
> up to 512.
> I added debug counters to keep track of how many mbufs are dequeued from the 
> tx ring with rte_ring_dequeue_burst() that equals to the 512 and a counter 
> for less than 512.  The dequeue of the tx ring is always 512, never less.
> 
> 
> Note: if I skip the rte_eth_tx_burst() in the Tx threads and just dequeue the 
> mbufs and bulk free the mbufs from the tx ring I do not see the tx ring 
> fill-up, i.e., it is able to free the mbufs faster than they arrive on the tx 
> ring.
> 
> So, I suspect that the rte_eth_tx_burst() is the bottleneck to investigate, 
> which involves the inner bows of DPDK and Intel NIC architecture.
> 
> 
> 
> Any help to resolve my issue is greatly appreciated.
> 
> Thanks,
> Ed
> 
> 
> 


Do profiling, and look at the number of cache misses.
I suspect using an additional ring is causing lots of cache misses.
Remember going to memory is really slow on modern processors.

Reply via email to