Hi
During the pressure test on the CX6 using DPDK, the process exits abnormally. 
It is located that the problem is caused by a bug of the DPDK mlx5 driver. 
Please check whether the latest firmware and driver fix this coredump.

By default, the DPDK enables the rxtx_vect and compress CQE functions, and the 
receive ringbuffer is 1024. During the service process pressure, the service 
process receives SIGFAULT and exits.
Call stack information:
    #2  0x0000000000e72437 in signal_captured_function (signo=11, 
si=0x7f6310f46eb0, ucontext=0x7f6310f46d80) at ../v1/handle_signal.c:499
    #3  <signal handler called>
    #4  _mm_storeu_si128 (__B=..., __P=<optimized out>) at 
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/include/emmintrin.h:720
    #5  rxq_cq_decompress_v (elts=0x20217ff394e8, cq=0x20217f8538c0, 
rxq=0x20217ff36e00) at ../drivers/net/mlx5/mlx5_rxtx_vec_sse.h:159
    #6  rxq_burst_v (no_cq=<synthetic pointer>, err=<synthetic pointer>, 
pkts_n=9, pkts=0x2004e278c9d8, rxq=0x20217ff36e00) at 
../drivers/net/mlx5/mlx5_rxtx_vec.c:349
    #7  mlx5_rx_burst_vec (dpdk_rxq=0x20217ff36e00, pkts=0x2004e278c9d8, 
pkts_n=128) at ../drivers/net/mlx5/mlx5_rxtx_vec.c:393
    #8  0x0000000001086448 in rte_eth_rx_burst (nb_pkts=128, 
rx_pkts=0x2004e278c9d8, queue_id=7, port_id=<optimized out>) at 
../include/dpdk/rte_ethdev.h:5339

Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
Version:
    [root@localhost ~]# ofed_info -s
    MLNX_OFED_LINUX-23.04-0.5.3.3:

    [root@localhost ~]# ethtool -i eth6|grep fir
firmware-version: 22.37.1014 (MT_0000000359)
dpdk version: DPDK 21.11


../drivers/net/mlx5/mlx5_rxtx_vec_sse.h:159
157:          /* B.1 store rearm data to mbuf. */
158:          _mm_storeu_si128((__m128i *)&elts[pos + 2]->rearm_data, rearm);
159:          _mm_storeu_si128((__m128i *)&elts[pos + 3]->rearm_data, rearm);

Root cause: When processing compressed CQEs, 9 mini CQEs need to be processed 
and (*rxq->elts)[1021] -> (*rxq->elts)[1028] is accessed. Only [0, 1027] are 
reserved during the initialization of the receive queue. A null pointer is 
accessed due to out-of-bounds access. As a result, a core dump occurs in the 
process.
[cid:[email protected]]
(gdb) p elts[0]
$149 = (struct rte_mbuf *) 0x2006945a8000  //first round
(gdb) p elts[1]
$150 = (struct rte_mbuf *) 0x2006945aa1c0
(gdb) p elts[2]
$151 = (struct rte_mbuf *) 0x2006945ac380
(gdb) p elts[3]
$152 = (struct rte_mbuf *) 0x20217ff36f80
(gdb) p elts[4]
$153 = (struct rte_mbuf *) 0x20217ff36f80  //Second round
(gdb) p elts[5]
$154 = (struct rte_mbuf *) 0x20217ff36f80
(gdb) p elts[6]
$155 = (struct rte_mbuf *) 0x20217ff36f80
(gdb) p elts[7]
$156 = (struct rte_mbuf *) 0x0     //coredump
(gdb) p elts - (*rxq->elts)
$157 = 1021



Reply via email to