On Thu, 3 Jul 2025 20:14:59 +0000 "Lombardo, Ed" <ed.lomba...@netscout.com> wrote:
> Hi, > I have run out of ideas and thought I would reach out to the dpdk community. > > I have a Sapphire Rapids dual CPU server and one E180 (also tried X710), both > are 4x10G NICs. When our application pipeline final stage enqueues mbufs > into the tx ring I expect the rte_ring_dequeue_burst() to pull the mbufs from > the tx ring and rte_eth_tx_burst() transmit them at line rate. What I see is > when there is one interface receiving 64-byte UDP in IPv4 the receive and > transmit is at line rate (i.e. packets in one port and out another port of > the NIC @14.9 MPPS). > When I turn on another receive port then both transmit ports of the NIC shows > Tx performance drops to 5 MPPS. The Tx ring is filling faster than Tx thread > can dequeue and transmit mbufs. > > Packets arrive on ports 1 and 3 in my test setup. NIC is on NUMA Node 1. > Hugepage memory (6GB, 1GB page size) is on NUMA Node 1. The mbuf size is 9KB. > > Rx Port 1 -> Tx Port 2 > Rx Port 3 -> Tx port 4 > > I monitor the mbufs available and they are: > *** DPDK Mempool Configuration *** > Number Sockets : 1 > Memory/Socket GB : 6 > Hugepage Size MB : 1024 > Overhead/socket MB : 512 > Usable mem/socket MB: 5629 > mbuf size Bytes : 9216 > nb mbufs per socket : 640455 > total nb mbufs : 640455 > hugepages/socket GB : 6 > mempool cache size : 512 > > *** DPDK EAL args *** > EAL lcore arg : -l 36 <<< NUMA Node 1 > EAL socket-mem arg : --socket-mem=0,6144 > > The number of rings in this configuration is 16 and all are the same size > (16384 * 8), and there is one mempool. > > The Tx rings are created as SP and SC when created. > > There is one Tx thread per NIC port, where its only task is to dequeue mbufs > from the tx ring and call rte_eth_tx_burst() to transmit the mbufs. The > dequeue burst size is 512 and tx burst is equal to or less than 512. The > rte_eth_tx_burst() never returns less than the bust size given. > > Each Tx thread is on a dedicated CPU core and its sibling is unused. > We use cpushielding to keep noncritical threads from using these CPUs for Tx > threads. HTOP shows the Tx threads are the only threads using the carved-out > CPUs. > > In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of mbufs > up to 512. > I added debug counters to keep track of how many mbufs are dequeued from the > tx ring with rte_ring_dequeue_burst() that equals to the 512 and a counter > for less than 512. The dequeue of the tx ring is always 512, never less. > > > Note: if I skip the rte_eth_tx_burst() in the Tx threads and just dequeue the > mbufs and bulk free the mbufs from the tx ring I do not see the tx ring > fill-up, i.e., it is able to free the mbufs faster than they arrive on the tx > ring. > > So, I suspect that the rte_eth_tx_burst() is the bottleneck to investigate, > which involves the inner bows of DPDK and Intel NIC architecture. > > > > Any help to resolve my issue is greatly appreciated. > > Thanks, > Ed > > > Do profiling, and look at the number of cache misses. I suspect using an additional ring is causing lots of cache misses. Remember going to memory is really slow on modern processors.