[Bug c++/81602] New: Unnecessary zero-extension after 16 bit popcnt

christoph.diegelmann at gmx dot de Fri, 28 Jul 2017 06:05:42 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602


            Bug ID: 81602
           Summary: Unnecessary zero-extension after 16 bit popcnt
           Product: gcc
           Version: 7.1.1
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: christoph.diegelmann at gmx dot de
  Target Milestone: ---

GCC misses an optimization on this:

 #include <cstdint>
 #include "immintrin.h"

 void test(std::uint16_t* mask, std::uint16_t* data) {
 for (int i = 0; i < 1024; ++i) {
 *data = 0;
 unsigned tmp = *mask++;
 unsigned step = _mm_popcnt_u32(tmp);
 data += step;
 }
 }

g++ -O3 -Wall -std=c++14 -march=skylake generates:

 test(unsigned short*, unsigned short*):
 leaq 2048(%rdi), %rdx
 .L2:
 xorl %eax, %eax
 addq $2, %rdi
 movw %ax, (%rsi)
 popcntw -2(%rdi), %ax
 movzwl %ax, %eax
 leaq (%rsi,%rax,2), %rsi
 cmpq %rdx, %rdi
 jne .L2
 ret

The rax register is known to be zero at the time of `popcntw -2(%rdi), %ax`.
Anyway gcc still clears the upper bits using `movzwl %ax, %eax` afterwards.

While clang uses 32 bit popcnt and `movzwl (%rdi,%rax,2), %ecx` it correctly
recognises that there's no need to clear the upper bits.

clang -O3 -Wall -std=c++14 -march=skylake -fno-unroll-loops generates:

 test(unsigned short*, unsigned short*): 
 xorl %eax, %eax
 .LBB0_1: 
 movw $0, (%rsi)
 movzwl (%rdi,%rax,2), %ecx
 popcntl %ecx, %ecx
 leaq (%rsi,%rcx,2), %rsi
 addq $1, %rax
 cmpl $1024, %eax # imm = 0x400
 jne .LBB0_1
 retq

See https://godbolt.org/g/kgQ7VS

[Bug c++/81602] New: Unnecessary zero-extension after 16 bit popcnt

Reply via email to