https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90028
Bug ID: 90028 Summary: On Intel Skylake (-march=native) generated avx512 instruction can be wrong Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: ferruh.yigit at intel dot com Target Milestone: --- Created attachment 46114 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46114&action=edit 19.05-rc1 default gcc build on skylake gcc version: gcc (GCC) 8.3.1 20190223 (Red Hat 8.3.1-2) binutils: GNU ld version 2.31.1-24.fc29 This is observed in dpdk project (https://git.dpdk.org/dpdk/tree/?h=v19.05-rc1) on Intel Skylate CPU. Full build command (removed -I & -D ones): gcc -Wp,-MD,./.rte_kni.o.d.tmp -m64 -pthread -march=native -W -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wold-style-definition -Wpointer-arith -Wcast-align -Wnested-externs -Wcast-qual -Wformat-nonliteral -Wformat-security -Wundef -Wwrite-strings -Wdeprecated -Werror -Wimplicit-fallthrough=2 -Wno-format-truncation -O3 -fno-strict-aliasing -o rte_kni.o -c /root/development/dpdk-next-net/lib/librte_kni/rte_kni.c When related code build with "-mno-avx512f" flag, problem solved. Also clang (clang version 7.0.1 (Fedora 7.0.1-6.fc29)) output works fine. Suspected from 'vpgatherqq' instruction usage. The related .c code is (https://git.dpdk.org/dpdk/tree/lib/librte_kni/rte_kni.c?h=v19.05-rc1#n546): " static void * va2pa(struct rte_mbuf *m) { return (void *)((unsigned long)m - ((unsigned long)m->buf_addr - (unsigned long)m->buf_iova)); } unsigned rte_kni_tx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs, unsigned num) { void *phy_mbufs[num]; unsigned int ret; unsigned int i; for (i = 0; i < num; i++) phy_mbufs[i] = va2pa(mbufs[i]); .... " 'm->buf_addr' & 'm->buf_iova' are next to each other in the struct, so there is 8 bytes difference between their address. Generated asm code: avx512 enabled code snippet: 232c: ba ff ff ff ff mov $0xffffffff,%edx 2331: 48 c1 e0 05 shl $0x5,%rax 2335: 31 c9 xor %ecx,%ecx 2337: c5 f9 92 ca kmovb %edx,%k1 233b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 2340: 62 f1 fe 28 6f 0c 0e vmovdqu64 (%rsi,%rcx,1),%ymm1 2347: c5 f9 90 d1 kmovb %k1,%k2 *234b: 62 f2 fd 2a 91 04 0d vpgatherqq 0x1(,%ymm1,1),%ymm0{%k2} 2352: 01 00 00 00 2356: c5 f9 90 d9 kmovb %k1,%k3 235a: c5 fd d4 c1 vpaddq %ymm1,%ymm0,%ymm0 235e: 62 f2 fd 2b 91 14 0d vpgatherqq 0x0(,%ymm1,1),%ymm2{%k3} 2365: 00 00 00 00 2369: c5 fd fb c2 vpsubq %ymm2,%ymm0,%ymm0 236d: 62 d1 fe 28 7f 04 08 vmovdqu64 %ymm0,(%r8,%rcx,1) same code avx512 disabled (avx2) code snippet: 2332: c5 ed 76 d2 vpcmpeqd %ymm2,%ymm2,%ymm2 2336: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 233d: 00 00 00 2340: c5 fe 6f 0c 0e vmovdqu (%rsi,%rcx,1),%ymm1 2345: c5 fd 6f e2 vmovdqa %ymm2,%ymm4 2349: c4 e2 dd 91 04 0d 08 vpgatherqq %ymm4,0x8(,%ymm1,1),%ymm0 2350: 00 00 00 2353: c5 fd 6f ea vmovdqa %ymm2,%ymm5 2357: c4 e2 d5 91 1c 0d 00 vpgatherqq %ymm5,0x0(,%ymm1,1),%ymm3 235e: 00 00 00 2361: c5 fd d4 c1 vpaddq %ymm1,%ymm0,%ymm0 2365: c5 fd fb c3 vpsubq %ymm3,%ymm0,%ymm0 2369: c4 c1 7e 7f 04 08 vmovdqu %ymm0,(%r8,%rcx,1) full asm outputs are attached. In the avx512 one, for 'vpgatherqq', it looks like the offset should be 0x8 instead of 0x1.