On Fri, 15 Sep 2023 10:24:01 +0200 Gabor LENCSE <[email protected]> wrote:
> Dear DPDK Developers and Users, > > I have met the following issue with my RFC 8219 compliant SIIT and > stateful NAT64/NAT44 tester, siitperf: > https://github.com/lencsegabor/siitperf > > Its main program starts two sending threads and two receiving threads on > their exclusively used CPU cores using the rte_eal_remote_launch() > function, e.g., the code is as follows: > > // start left sender > if ( rte_eal_remote_launch(send, &spars1, cpu_left_sender) ) > std::cout << "Error: could not start Left Sender." << > std::endl; > > When the test frame sending is finished, the senders check the sending > time, and if the allowed time was significantly exceeded, the sender > gives an error message and terminates (itself and also the main program) > using the rte_exit() function. > > This is the code: > > elapsed_seconds = (double)(rte_rdtsc()-start_tsc)/hz; > printf("Info: %s sender's sending took %3.10lf seconds.\n", side, > elapsed_seconds); > if ( elapsed_seconds > duration*TOLERANCE ) > rte_exit(EXIT_FAILURE, "%s sending exceeded the %3.10lf seconds > limit, the test is invalid.\n", side, duration*TOLERANCE); > printf("%s frames sent: %lu\n", side, sent_frames); > > return 0; > > The above code worked as I expected, while I used siitperf under Debian > 9.13 with DPDK 16.11.11-1+deb9u2. It always displayed the execution time > of test frame sending, and if the allowed time was significantly exceed, > then it gave an error message, and it was terminated, thus the sender > did not print out the number of send frames. And also the main program > was terminated due to the call of this function: it did not write out > the "Info: Test finished." message. > > However, when I updated siitperf to use it with Ubuntu 22.04 with DPDK > version "21.11.3-0ubuntu0.22.04.1 amd64", then I experienced something > rather strange: > > In the case, when the sending time is significantly exceeded, I get the > following messages from the program (I copy here the full output, as it > may be useful): > > root@x033:~/siitperf# cat temp.out > EAL: Detected CPU lcores: 56 > EAL: Detected NUMA nodes: 4 > EAL: Detected shared linkage of DPDK > EAL: Multi-process socket /var/run/dpdk/rte/mp_socket > EAL: Selected IOVA mode 'PA' > EAL: No free 2048 kB hugepages reported on node 0 > EAL: No free 2048 kB hugepages reported on node 1 > EAL: No free 2048 kB hugepages reported on node 2 > EAL: No free 2048 kB hugepages reported on node 3 > EAL: No available 2048 kB hugepages reported > EAL: VFIO support initialized > EAL: Probe PCI driver: net_ice (8086:159b) device: 0000:98:00.0 (socket 2) > ice_load_pkg_type(): Active package is: 1.3.26.0, ICE OS Default Package > (single VLAN mode) > EAL: Probe PCI driver: net_ice (8086:159b) device: 0000:98:00.1 (socket 2) > ice_load_pkg_type(): Active package is: 1.3.26.0, ICE OS Default Package > (single VLAN mode) > TELEMETRY: No legacy callbacks, legacy socket not created > ice_set_rx_function(): Using AVX2 Vector Rx (port 0). > ice_set_rx_function(): Using AVX2 Vector Rx (port 1). > Info: Left port and Left Sender CPU core belong to the same NUMA node: 2 > Info: Right port and Right Receiver CPU core belong to the same NUMA node: 2 > Info: Right port and Right Sender CPU core belong to the same NUMA node: 2 > Info: Left port and Left Receiver CPU core belong to the same NUMA node: 2 > Info: Testing initiated at 2023-09-15 07:50:17 > EAL: Error - exiting with code: 1 > Cause: Forward sending exceeded the 60.0006000000 seconds limit, the > test is invalid. > EAL: Error - exiting with code: 1 > Cause: Reverse sending exceeded the 60.0006000000 seconds limit, the > test is invalid. > root@x033:~/siitperf# > > The rte_exit() function seems to work, as the error message appears, and > the number of sent frames is not displayed, however, the "Info: ..." > message about the sending time (printed out earlier in the code) is > missing! This is rather strange! > > What is worse, the program does not stop, but *the sender threads and > the main program remain running (forever)*. > > Here is the output of the "top" command: > > top - 07:54:24 up 1 day, 14:12, 2 users, load average: 3.02, 2.41, 2.10 > Tasks: 591 total, 2 running, 589 sleeping, 0 stopped, 0 zombie > %Cpu0 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu1 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu2 : 0.0 us, 0.0 sy, 0.0 ni, 94.1 id, 0.0 wa, 0.0 hi, 5.9 si, > 0.0 st > %Cpu3 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu4 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu5 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu6 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu7 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu8 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu9 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu10 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu11 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu13 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > > CPU0 is the main core, the left sender and the right sender use CPU1 and > CPU9, respectively. (The receivers that used CPU5 and CPU13 already > terminated due to their timeout.) > > *Thus, rte_exit() behaves differently now: it used the terminate the > main program but now it does not. *(And it also suppresses some > previously sent output.)* > * > > Is it a bug a new feature? > > *How could I achieve the old behavior?* (Or at least the termination of > the main program by sender threads?) > > Thank you very much for you guidance! > > Best regards, > > Gábor Lencse Please get a backtrace. Simple way is to attach gdb to that process. I suspect that since rte_exit() call the internal eal_cleanup function and that calls close in the dirver, that ICE driver close function has a bug. Perhaps ice close function does not correctly handle case where the device has not started.
