https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86267

            Bug ID: 86267
           Summary: detect conversions between bitmasks and vector masks
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: kretz at kde dot org
  Target Milestone: ---

Testcase (cf. https://godbolt.org/g/gi6f7V):

#include <x86intrin.h>

auto f(__m256i a, __m256i b) {
    __m256i k = a < b;
    long long bitmask = _mm256_movemask_pd((__m256d)k) & 0xf;
    return _mm256_cmpgt_epi64(
        __m256i{bitmask, bitmask, bitmask, bitmask} & __m256i{1, 2, 4, 8},
        __m256i()
    );
}

This should be optimized to "return a < b;".

A more complex case also allows conversion of the vector mask (cf.
https://godbolt.org/g/FLAEgC):

#include <x86intrin.h>

auto f(__m256i a, __m256i b) {
    using V [[gnu::vector_size(16)]] = int;
    __m256i k = a < b;
    int bitmask = _mm256_movemask_pd((__m256d)k) & 0xf;
    return (V{bitmask, bitmask, bitmask, bitmask} & V{1, 2, 4, 8}) != 0;
}

I believe the most portable and readable strategy would be to introduce new
builtins that convert between bitmasks and vector masks. (This can be
especially helpful with AVX512, where the builtin comparison operators return
vector masks, but Intel intrinsics require bitmasks.)

E.g.:
using W [[gnu::vector_size(32)]] = long long;
using V [[gnu::vector_size(16)]] = int;
V f(W a, W b) {
    unsigned bitmask = __builtin_vector_to_bitmask(a < b);
    return __builtin_bitmask_to_vector(bitmask, V);
}

I'd define __builtin_vector_to_bitmask to only consider the MSB of each
element. And, to make optimization simpler, consider all remaining input bits
to be whatever the canonical mask representation on the target system is.

Reply via email to