[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-12 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #15 from Jörn Engel  ---
> int foo (int x) { return __builtin_ctz (x); }
> 
> Without -mbmi, gcc emits:
> xorl%eax, %eax
> rep bsfl%edi, %eax
> ret

That example convinces me.  Code would be broken with a zero-argument,
but if the compiler cannot decide whether that is possible and the
programmer can, it makes sense to generate less/faster code.

Thank you!

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #13 from Jörn Engel  ---
None of those examples convince me.  If you or I know that a zero-argument is
impossible, but the compiler doesn't know, wouldn't that still be UB?  And if
the compiler knows, it can remove the branch either way.

Similar for architectures returning 64 or -1, code could be

asm(...);
if (ret == 64)
return 32;
return ret;

Again, if a null-argument is impossible, the branch can be removed.  And if the
programmer wants to get 64 or -1, that either requires a conditional or invokes
UB.

So far, whichever way I look at it, moving the conditional inside of
__builtin_ctz() and making the result well-defined for any input doesn't have
any downsides.  I cannot even think of existing code that would break unless it
already invoked UB and depended on a lucky roll of the dice to work correctly.

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #11 from Jörn Engel  ---
I stand corrected.  Thank you very much!

Out of curiosity, if the only non-broken way to call __builtin_ctz(foo) is via
"foo ? __builtin_ctz(foo) : 32", why isn't the conditional moved into
__builtin_ctz()?  Is there some hidden advantage from callers having to add the
conditional or getting surprised by undefined behaviour?

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #8 from Jörn Engel  ---
Updated testcase below fails to remove the branch with my gcc-8.

/*
 * usage:
 * gcc -std=gnu11 -Wall -Wextra -g -march=core-avx2 -mbmi -fPIC -O3 % &&
./a.out < /dev/zero
 */
#include 
#include 
#include 
#include 

typedef uint8_t u8_256 __attribute__((vector_size(32), may_alias));
typedef char  c256 __attribute__((vector_size(32), may_alias));
typedef uint8_t  u256u __attribute__((vector_size(32), may_alias, aligned(1)));

static inline  u8_256 read256(const void *buf) { return *(const u256u *)buf; }

static inline int movemask8_256(u8_256 mask)
{
return __builtin_ia32_pmovmskb256((c256)mask);
}

static inline int matchlen32(const void *a, const void *b)
{
int mask = ~movemask8_256(read256(a) == read256(b));
return mask ? __builtin_ctz(mask) : 32;
}

static int ml30(const void *src)
{
int ml = matchlen32(src, src + 1);
if (ml >= 30)
ml += matchlen32(src + 32, src + 1 + 32);
return ml;
}

static int ml32(const void *src)
{
int ml = matchlen32(src, src + 1);
if (ml >= 32)
ml += matchlen32(src + 32, src + 1 + 32);
return ml;
}

int main(void)
{
uint8_t src[256];
ssize_t n;

n = read(0, src, sizeof(src));
if (n != sizeof(src))
return -1;
printf("should be 64: %d\n", ml30(src));
printf("should be 64: %d\n", ml32(src));
return 0;
}

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #6 from Jörn Engel  ---
True for one, but not the other.

return mask ? __builtin_ctz(mask) : 32;
1099:   83 f6 ffxor$0x,%esi
109c:   74 47   je 10e5 
109e:   f3 0f bc f6 tzcnt  %esi,%esi

I used:
gcc-8 -std=gnu11 -Wall -Wextra -g -march=core-avx2 -mbmi -fPIC -O3 %

_tzcnt_u32() works as you said it should.  Nicer than inline asm and allows
type checking.  Thank you for that hint!

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #4 from Jörn Engel  ---
Fair enough.  That means the only way to get tzcnt without a conditional is by
using inline asm.  Annoying, but something I can work with.

Annoying because for CPUs with BMI1, tzcnt is well-defined and I explicitly
tell the compiler to generate code for BMI1.  So while the __builtin_ctz() in
generall is undefined, it is actually well-defined for the case I care about.

But I need to support older compilers anyway, so inline asm it is.  Thank you!

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #2 from Jörn Engel  ---
The input is 32.  Does the "undefined-if-zero" thing give gcc license to remove
code depending on the output?  If it does, why is the code only removed when
comparing against 31/32, not when comparing against 30?

[Bug c/89670] New: __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

Bug ID: 89670
   Summary: __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be
<31 ?
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: joern at purestorage dot com
  Target Milestone: ---

Created attachment 45945
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45945=edit
matchlen testcase extracted from lz compressor

I ran across this while working on a LZ compression library.  One way of
calculating the match length is through vector-comparison, movemask and ctz. 
It is relatively useful because it covers up to 32 equal bytes without branch.

If 32 bytes match, the true match length might be much longer than 32.  So
naturally the code contains a branch
if (ml == 32) {
/* calculate actual match length */
}

That branch was optimized away, which surprised me a bit.  I have reduced the
problem to the attached testcase.  Testcase seems to work fine with gcc 4.8,
but fails with 4.9, 5, 6, 7 and 8.  It also fails with clang 3.5, 3.8, 4.0, 6.0
and 7, fwiw.

System is an old Debian unstable, compilers are from Debian.b