Hi Ed,

You say there is only one mempool. Why?
Have you tried using dedicated mempools, one per each port pair (0,2), (3,4)?

Thank you.

On Thu, 3 Jul 2025, Lombardo, Ed wrote:


Hi,

I have run out of ideas and thought I would reach out to the dpdk community.

 

I have a Sapphire Rapids dual CPU server and one E180 (also tried X710), both 
are 4x10G NICs.  When our application pipeline final stage enqueues mbufs into 
the tx ring I expect the
rte_ring_dequeue_burst() to pull the mbufs from the tx ring and 
rte_eth_tx_burst() transmit them at line rate.  What I see is when there is one 
interface receiving 64-byte UDP in IPv4
the receive and transmit is at line rate (i.e. packets in one port and out 
another port of the NIC @14.9 MPPS).

When I turn on another receive port then both transmit ports of the NIC shows 
Tx performance drops to 5 MPPS.  The Tx ring is filling faster than Tx thread 
can dequeue and transmit
mbufs. 

 

Packets arrive on ports 1 and 3 in my test setup.  NIC is on NUMA Node 1.  
Hugepage memory (6GB, 1GB page size) is on NUMA Node 1.  The mbuf size is 9KB.

 

Rx Port 1 -> Tx Port 2

Rx Port 3 -> Tx port 4

 

I monitor the mbufs available and they are:

*** DPDK Mempool Configuration ***

Number Sockets      :                    1

Memory/Socket GB    :                 6

Hugepage Size MB    :                 1024

Overhead/socket MB  :              512

Usable mem/socket MB:          5629

mbuf size Bytes     :                     9216

nb mbufs per socket :               640455

total nb mbufs      :                      640455

hugepages/socket GB :               6

mempool cache size  :            512

 

*** DPDK EAL args ***

EAL lcore arg       : -l 36   <<< NUMA Node 1

EAL socket-mem arg  : --socket-mem=0,6144

 

The number of rings in this configuration is 16 and all are the same size 
(16384 * 8), and there is one mempool.

 

The Tx rings are created as SP and SC when created.

 

There is one Tx thread per NIC port, where its only task is to dequeue mbufs 
from the tx ring and call rte_eth_tx_burst() to transmit the mbufs.  The 
dequeue burst size is 512 and tx
burst is equal to or less than 512.  The rte_eth_tx_burst() never returns less 
than the bust size given.

 

Each Tx thread is on a dedicated CPU core and its sibling is unused.

We use cpushielding to keep noncritical threads from using these CPUs for Tx 
threads.  HTOP shows the Tx threads are the only threads using the carved-out 
CPUs.

 

In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of mbufs 
up to 512. 

I added debug counters to keep track of how many mbufs are dequeued from the tx 
ring with rte_ring_dequeue_burst() that equals to the 512 and a counter for 
less than 512.  The dequeue of
the tx ring is always 512, never less. 

 

 

Note: if I skip the rte_eth_tx_burst() in the Tx threads and just dequeue the 
mbufs and bulk free the mbufs from the tx ring I do not see the tx ring 
fill-up, i.e., it is able to free
the mbufs faster than they arrive on the tx ring.

 

So, I suspect that the rte_eth_tx_burst() is the bottleneck to investigate, 
which involves the inner bows of DPDK and Intel NIC architecture.

 

 

 

Any help to resolve my issue is greatly appreciated.

 

Thanks,

Ed

 

 

 


Reply via email to