** Description changed:
+ [Impact]
+
+ * Crashing on certain SkyLake Chips
+
+ * Follow upstream disabling one of the gcc options
+
+ [Test Case]
+
+ * Part of the MRE bug 1817675 following the MRE verficiation process as
+ defined there.
+
+ [Regression Potential]
+
+ * Rebuilds with the new code using DPDK headers will be slightly slower
+ (not using the feature) but avoiding the crash. The slowdown should
+ be negligible for most cases and the crash avoidance outweigh this.
+
+ [Other Info]
+
+ * n/a
+
+ ---
+
Hi, Christian
We've recently encountered a weird issue with Ubuntu 18.04 on the Skylake
server. I can always reproduce this crash and I could narrowed it down. I
guess
it could be a GCC issue.
-
[1] How to reproduce
- ConnectX-4Lx/ConnectX-5 with mlx5 PMD in DPDK 18.02.1
- Ubuntu 18.04 on Intel Skylake server
- gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
- Testpmd crashes when it starts to forward traffic. Easy to reproduce.
- Only happens on the Skylake server.
- DPDK 18.05 and later don't have such issue. git-bisect gives no clue.
This is because I enabled MEMPOOL_DEBUG and MLX5_DEBUG. As mempool/rte_memcpy
is
inlined function, it should be affected. Now I can see the crash regardlessly
-
18.02, 18.05 and 18.08.
[2] Failure point
The attached patch gives an insight of why it crashes. The following is the
result of the patch and the GDB commands.
In summary, rte_memcpy() doesn't work as expected. In __mempool_generic_put(),
there's rte_memcpy() to move the array of objects to the lcore cache. If I run
memcmp() right after rte_memcpy(dst, src, n), data in dst differs from data in
src. And it looks like some of data got shifted by a few bytes as you can see
below.
- [GDB command]
- $dst = 0x7ffff4e09ea8
- $src = 0x7fffce3fb970
- $n = 256
- x/32gx 0x7ffff4e09ea8
- x/32gx 0x7fffce3fb970
- testpmd: /home/mlnxtest/dpdk/build/include/rte_mempool.h:1140:
__mempool_generic_put: Assertion `0' failed.
+ [GDB command]
+ $dst = 0x7ffff4e09ea8
+ $src = 0x7fffce3fb970
+ $n = 256
+ x/32gx 0x7ffff4e09ea8
+ x/32gx 0x7fffce3fb970
+ testpmd: /home/mlnxtest/dpdk/build/include/rte_mempool.h:1140:
__mempool_generic_put: Assertion `0' failed.
- Thread 4 "lcore-slave-1" received signal SIGABRT, Aborted.
- [Switching to Thread 0x7fffce3ff700 (LWP 69913)]
- (gdb) x/32gx 0x7ffff4e09ea8
- 0x7ffff4e09ea8: 0x00007fffaac38ec0 0x00007fffaac38500
- 0x7ffff4e09eb8: 0x00007fffaac37b40 0x00007fffaac37180
- 0x7ffff4e09ec8: 0x850000007fffaac3 0x7b4000007fffaac3
- 0x7ffff4e09ed8: 0x00007fffaac35440 0x00007fffaac34a80
- 0x7ffff4e09ee8: 0xaac3850000007fff 0xaac37b4000007fff
- 0x7ffff4e09ef8: 0x00007fffaac32d40 0x00007fffaac32380
- 0x7ffff4e09f08: 0x7fffaac385000000 0x7fffaac37b400000
- 0x7ffff4e09f18: 0x00007fffaac30640 0x00007fffaac2fc80
- 0x7ffff4e09f28: 0x00007fffaac2f2c0 0x00007fffaac2e900
- 0x7ffff4e09f38: 0x00007fffaac2df40 0x00007fffaac2d580
- 0x7ffff4e09f48: 0x00007fffaac2cbc0 0x00007fffaac2c200
- 0x7ffff4e09f58: 0x00007fffaac2b840 0x00007fffaac2ae80
- 0x7ffff4e09f68: 0x00007fffaac2a4c0 0x00007fffaac29b00
- 0x7ffff4e09f78: 0x00007fffaac29140 0x00007fffaac28780
- 0x7ffff4e09f88: 0x00007fffaac27dc0 0x00007fffaac27400
- 0x7ffff4e09f98: 0x00007fffaac26a40 0x00007fffaac26080
- (gdb) x/32gx 0x7fffce3fb970
- 0x7fffce3fb970: 0x00007fffaac38ec0 0x00007fffaac38500
- 0x7fffce3fb980: 0x00007fffaac37b40 0x00007fffaac37180
- 0x7fffce3fb990: 0x00007fffaac367c0 0x00007fffaac35e00
- 0x7fffce3fb9a0: 0x00007fffaac35440 0x00007fffaac34a80
- 0x7fffce3fb9b0: 0x00007fffaac340c0 0x00007fffaac33700
- 0x7fffce3fb9c0: 0x00007fffaac32d40 0x00007fffaac32380
- 0x7fffce3fb9d0: 0x00007fffaac319c0 0x00007fffaac31000
- 0x7fffce3fb9e0: 0x00007fffaac30640 0x00007fffaac2fc80
- 0x7fffce3fb9f0: 0x00007fffaac2f2c0 0x00007fffaac2e900
- 0x7fffce3fba00: 0x00007fffaac2df40 0x00007fffaac2d580
- 0x7fffce3fba10: 0x00007fffaac2cbc0 0x00007fffaac2c200
- 0x7fffce3fba20: 0x00007fffaac2b840 0x00007fffaac2ae80
- 0x7fffce3fba30: 0x00007fffaac2a4c0 0x00007fffaac29b00
- 0x7fffce3fba40: 0x00007fffaac29140 0x00007fffaac28780
- 0x7fffce3fba50: 0x00007fffaac27dc0 0x00007fffaac27400
- 0x7fffce3fba60: 0x00007fffaac26a40 0x00007fffaac26080
-
+ Thread 4 "lcore-slave-1" received signal SIGABRT, Aborted.
+ [Switching to Thread 0x7fffce3ff700 (LWP 69913)]
+ (gdb) x/32gx 0x7ffff4e09ea8
+ 0x7ffff4e09ea8: 0x00007fffaac38ec0 0x00007fffaac38500
+ 0x7ffff4e09eb8: 0x00007fffaac37b40 0x00007fffaac37180
+ 0x7ffff4e09ec8: 0x850000007fffaac3 0x7b4000007fffaac3
+ 0x7ffff4e09ed8: 0x00007fffaac35440 0x00007fffaac34a80
+ 0x7ffff4e09ee8: 0xaac3850000007fff 0xaac37b4000007fff
+ 0x7ffff4e09ef8: 0x00007fffaac32d40 0x00007fffaac32380
+ 0x7ffff4e09f08: 0x7fffaac385000000 0x7fffaac37b400000
+ 0x7ffff4e09f18: 0x00007fffaac30640 0x00007fffaac2fc80
+ 0x7ffff4e09f28: 0x00007fffaac2f2c0 0x00007fffaac2e900
+ 0x7ffff4e09f38: 0x00007fffaac2df40 0x00007fffaac2d580
+ 0x7ffff4e09f48: 0x00007fffaac2cbc0 0x00007fffaac2c200
+ 0x7ffff4e09f58: 0x00007fffaac2b840 0x00007fffaac2ae80
+ 0x7ffff4e09f68: 0x00007fffaac2a4c0 0x00007fffaac29b00
+ 0x7ffff4e09f78: 0x00007fffaac29140 0x00007fffaac28780
+ 0x7ffff4e09f88: 0x00007fffaac27dc0 0x00007fffaac27400
+ 0x7ffff4e09f98: 0x00007fffaac26a40 0x00007fffaac26080
+ (gdb) x/32gx 0x7fffce3fb970
+ 0x7fffce3fb970: 0x00007fffaac38ec0 0x00007fffaac38500
+ 0x7fffce3fb980: 0x00007fffaac37b40 0x00007fffaac37180
+ 0x7fffce3fb990: 0x00007fffaac367c0 0x00007fffaac35e00
+ 0x7fffce3fb9a0: 0x00007fffaac35440 0x00007fffaac34a80
+ 0x7fffce3fb9b0: 0x00007fffaac340c0 0x00007fffaac33700
+ 0x7fffce3fb9c0: 0x00007fffaac32d40 0x00007fffaac32380
+ 0x7fffce3fb9d0: 0x00007fffaac319c0 0x00007fffaac31000
+ 0x7fffce3fb9e0: 0x00007fffaac30640 0x00007fffaac2fc80
+ 0x7fffce3fb9f0: 0x00007fffaac2f2c0 0x00007fffaac2e900
+ 0x7fffce3fba00: 0x00007fffaac2df40 0x00007fffaac2d580
+ 0x7fffce3fba10: 0x00007fffaac2cbc0 0x00007fffaac2c200
+ 0x7fffce3fba20: 0x00007fffaac2b840 0x00007fffaac2ae80
+ 0x7fffce3fba30: 0x00007fffaac2a4c0 0x00007fffaac29b00
+ 0x7fffce3fba40: 0x00007fffaac29140 0x00007fffaac28780
+ 0x7fffce3fba50: 0x00007fffaac27dc0 0x00007fffaac27400
+ 0x7fffce3fba60: 0x00007fffaac26a40 0x00007fffaac26080
AFAIK, AVX512F support is disabled by default in DPDK as it is still
experimental (CONFIG_RTE_ENABLE_AVX512=n). But with gcc optimization, AVX2
version of rte_memcpy() seems to be optimized with 512b instructions. If I
disable it by adding EXTRA_CFLAGS="-mno-avx512f", then it works fine and
doesn't
crash.
Do you have any idea regarding this issue or are you already aware of
it?
-
Thanks,
Yongseok
-
$ git diff
diff --git a/config/common_base b/config/common_base
index ad03cf433..f512b5a88 100644
--- a/config/common_base
+++ b/config/common_base
@@ -275,8 +275,8 @@ CONFIG_RTE_LIBRTE_MLX4_TX_MP_CACHE=8
- #
- # Compile burst-oriented Mellanox ConnectX-4 & ConnectX-5 (MLX5) PMD
- #
+ #
+ # Compile burst-oriented Mellanox ConnectX-4 & ConnectX-5 (MLX5) PMD
+ #
-CONFIG_RTE_LIBRTE_MLX5_PMD=n
-CONFIG_RTE_LIBRTE_MLX5_DEBUG=n
+CONFIG_RTE_LIBRTE_MLX5_PMD=y
+CONFIG_RTE_LIBRTE_MLX5_DEBUG=y
- CONFIG_RTE_LIBRTE_MLX5_DLOPEN_DEPS=n
- CONFIG_RTE_LIBRTE_MLX5_TX_MP_CACHE=8
+ CONFIG_RTE_LIBRTE_MLX5_DLOPEN_DEPS=n
+ CONFIG_RTE_LIBRTE_MLX5_TX_MP_CACHE=8
@@ -597,7 +597,7 @@ CONFIG_RTE_RING_USE_C11_MEM_MODEL=n
- #
- CONFIG_RTE_LIBRTE_MEMPOOL=y
- CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE=512
+ #
+ CONFIG_RTE_LIBRTE_MEMPOOL=y
+ CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE=512
-CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
+CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=y
- #
- # Compile Mempool drivers
+ #
+ # Compile Mempool drivers
diff --git a/lib/librte_mempool/rte_mempool.h
b/lib/librte_mempool/rte_mempool.h
index 8b1b7f7ed..9f48028d9 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -39,6 +39,7 @@
- #include <errno.h>
- #include <inttypes.h>
- #include <sys/queue.h>
+ #include <errno.h>
+ #include <inttypes.h>
+ #include <sys/queue.h>
+#include <assert.h>
- #include <rte_config.h>
- #include <rte_spinlock.h>
+ #include <rte_config.h>
+ #include <rte_spinlock.h>
@@ -1123,6 +1124,22 @@ __mempool_generic_put(struct rte_mempool *mp, void *
const *obj_table,
- /* Add elements back into the cache */
- rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+ /* Add elements back into the cache */
+ rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+ if(memcmp(&cache_objs[0], obj_table, sizeof(void *) * n)) {
+ printf("[GDB command] \n"
+ "$dst = %p\n"
+ "$src = %p\n"
+ "$n = %ld\n"
+ "x/%ldgx %p\n"
+ "x/%ldgx %p\n",
+ (void *)&cache_objs[0],
+ (const void *)obj_table,
+ sizeof(void *) * n,
+ sizeof(void *) * n / 8, (void *)&cache_objs[0],
+ sizeof(void *) * n / 8, (const void *)obj_table
+ );
+ assert(0);
+ }
+
- cache->len += n;
+ cache->len += n;
- if (cache->len >= cache->flushthresh) {
+ if (cache->len >= cache->flushthresh) {
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1799397
Title:
[dpdk]rte_memcpy() moves data incorrectly on Ubuntu 18.04 on Intel
Skylake.
To manage notifications about this bug go to:
https://bugs.launchpad.net/dpdk/+bug/1799397/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs