https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602
Bug ID: 81602 Summary: Unnecessary zero-extension after 16 bit popcnt Product: gcc Version: 7.1.1 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: christoph.diegelmann at gmx dot de Target Milestone: --- GCC misses an optimization on this: #include <cstdint> #include "immintrin.h" void test(std::uint16_t* mask, std::uint16_t* data) { for (int i = 0; i < 1024; ++i) { *data = 0; unsigned tmp = *mask++; unsigned step = _mm_popcnt_u32(tmp); data += step; } } g++ -O3 -Wall -std=c++14 -march=skylake generates: test(unsigned short*, unsigned short*): leaq 2048(%rdi), %rdx .L2: xorl %eax, %eax addq $2, %rdi movw %ax, (%rsi) popcntw -2(%rdi), %ax movzwl %ax, %eax leaq (%rsi,%rax,2), %rsi cmpq %rdx, %rdi jne .L2 ret The rax register is known to be zero at the time of `popcntw -2(%rdi), %ax`. Anyway gcc still clears the upper bits using `movzwl %ax, %eax` afterwards. While clang uses 32 bit popcnt and `movzwl (%rdi,%rax,2), %ecx` it correctly recognises that there's no need to clear the upper bits. clang -O3 -Wall -std=c++14 -march=skylake -fno-unroll-loops generates: test(unsigned short*, unsigned short*): xorl %eax, %eax .LBB0_1: movw $0, (%rsi) movzwl (%rdi,%rax,2), %ecx popcntl %ecx, %ecx leaq (%rsi,%rcx,2), %rsi addq $1, %rax cmpl $1024, %eax # imm = 0x400 jne .LBB0_1 retq See https://godbolt.org/g/kgQ7VS