[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-12 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #15 from Jörn Engel  ---
> int foo (int x) { return __builtin_ctz (x); }
> 
> Without -mbmi, gcc emits:
> xorl%eax, %eax
> rep bsfl%edi, %eax
> ret

That example convinces me.  Code would be broken with a zero-argument,
but if the compiler cannot decide whether that is possible and the
programmer can, it makes sense to generate less/faster code.

Thank you!

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-12 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #14 from Jakub Jelinek  ---
(In reply to Jörn Engel from comment #13)
> None of those examples convince me.  If you or I know that a zero-argument
> is impossible, but the compiler doesn't know, wouldn't that still be UB? 
> And if the compiler knows, it can remove the branch either way.

The current design is good.
As has been said, what the various hw instructions do varies a lot, it can
result in the bitsize of the corresponding type, in -1, in some larger value or
in completely undefined result, e.g. the x86 bsf instruction leaves the content
of the destination register unmodified if used with 0.

Try:

int foo (int x) { return __builtin_ctz (x); }
int bar (int x) { return x ? __builtin_ctz (x) : 32; }
int baz (int x) { return x ? __builtin_ctz (x) : -1; }

Without -mbmi, gcc emits:
xorl%eax, %eax
rep bsfl%edi, %eax
ret
for foo, and
xorl%eax, %eax
movl$32, %edx
rep bsfl%edi, %eax
testl   %edi, %edi
cmove   %edx, %eax
ret
for bar and
testl   %edi, %edi
je  .L8
xorl%eax, %eax
rep bsfl%edi, %eax
ret
.L8:
movl$-1, %eax
ret
for baz.  If __builtin_ctz was well defined for 0, we could not emit the simple
first case unless the optimizers figure out that 0 is not possible, plus the
choice of what to do for 0 would probably need to be consistent on all arches,
so generating worse code if the chosen value doesn't match what the hw can do.

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #13 from Jörn Engel  ---
None of those examples convince me.  If you or I know that a zero-argument is
impossible, but the compiler doesn't know, wouldn't that still be UB?  And if
the compiler knows, it can remove the branch either way.

Similar for architectures returning 64 or -1, code could be

asm(...);
if (ret == 64)
return 32;
return ret;

Again, if a null-argument is impossible, the branch can be removed.  And if the
programmer wants to get 64 or -1, that either requires a conditional or invokes
UB.

So far, whichever way I look at it, moving the conditional inside of
__builtin_ctz() and making the result well-defined for any input doesn't have
any downsides.  I cannot even think of existing code that would break unless it
already invoked UB and depended on a lucky roll of the dice to work correctly.

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #12 from Jakub Jelinek  ---
(In reply to Jörn Engel from comment #11)

> Out of curiosity, if the only non-broken way to call __builtin_ctz(foo) is
> via "foo ? __builtin_ctz(foo) : 32", why isn't the conditional moved into
> __builtin_ctz()?  Is there some hidden advantage from callers having to add
> the conditional or getting surprised by undefined behaviour?

In many cases you know the argument is not zero, so no need to write it that
way.
Plus, not everybody wants value 32 for the case when the argument is 0.
Some CPUs have instructions that return -1 in such cases, other return say 64
even when the ctz is 32-bit and it is up to the user to specify in the code
what they want.

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #11 from Jörn Engel  ---
I stand corrected.  Thank you very much!

Out of curiosity, if the only non-broken way to call __builtin_ctz(foo) is via
"foo ? __builtin_ctz(foo) : 32", why isn't the conditional moved into
__builtin_ctz()?  Is there some hidden advantage from callers having to add the
conditional or getting surprised by undefined behaviour?

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #10 from Andrew Pinski  ---
I forgot to list what L15 was:
.L15:
tzcntl  %eax, %eax
vzeroupper
ret

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #9 from Andrew Pinski  ---
(In reply to Jörn Engel from comment #6)
> True for one, but not the other.
> 
> return mask ? __builtin_ctz(mask) : 32;
> 1099:   83 f6 ffxor$0x,%esi
> 109c:   74 47   je 10e5 
> 109e:   f3 0f bc f6 tzcnt  %esi,%esi
> 

But this is because of jump threading:

int ml = matchlen32(src, src + 1);
if (ml >= 32)
ml += matchlen32(src + 32, src + 1 + 32);

Does optimize to the correct thing (only one jump rather than 2):
.cfi_startproc
vmovdqu 1(%rdi), %ymm0
vpcmpeqd%ymm1, %ymm1, %ymm1
vpcmpeqb(%rdi), %ymm0, %ymm0
vpandn  %ymm1, %ymm0, %ymm0
vpmovmskb   %ymm0, %eax
testl   %eax, %eax
jne .L15
vmovdqu 32(%rdi), %ymm0
xorl%eax, %eax
vpcmpeqb33(%rdi), %ymm0, %ymm0
vpandn  %ymm1, %ymm0, %ymm0
vpmovmskb   %ymm0, %edx
tzcntl  %edx, %eax
addl$32, %eax
testl   %edx, %edx
movl$64, %edx
cmove   %edx, %eax
vzeroupper
ret

The other one:
.LFB4795:
.cfi_startproc
vmovdqu 1(%rdi), %ymm0
vpcmpeqd%ymm1, %ymm1, %ymm1
vpcmpeqb(%rdi), %ymm0, %ymm0
vpandn  %ymm1, %ymm0, %ymm0
vpmovmskb   %ymm0, %eax
testl   %eax, %eax
je  .L5
tzcntl  %eax, %eax
cmpl$29, %eax
jle .L7
.L2:
vmovdqu 32(%rdi), %ymm0
vpcmpeqd%ymm1, %ymm1, %ymm1
vpcmpeqb33(%rdi), %ymm0, %ymm0
vpandn  %ymm1, %ymm0, %ymm0
vpmovmskb   %ymm0, %edx
tzcntl  %edx, %edx
addl%edx, %eax
.L7:
vzeroupper
ret
.p2align 4,,10
.p2align 3
.L5:
movl$32, %eax
jmp .L2
.cfi_endproc

Is due to jump threading too, notice how after the test against 0 is jumping to
L5 and then past the comparison again >= 29 :).

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #8 from Jörn Engel  ---
Updated testcase below fails to remove the branch with my gcc-8.

/*
 * usage:
 * gcc -std=gnu11 -Wall -Wextra -g -march=core-avx2 -mbmi -fPIC -O3 % &&
./a.out < /dev/zero
 */
#include 
#include 
#include 
#include 

typedef uint8_t u8_256 __attribute__((vector_size(32), may_alias));
typedef char  c256 __attribute__((vector_size(32), may_alias));
typedef uint8_t  u256u __attribute__((vector_size(32), may_alias, aligned(1)));

static inline  u8_256 read256(const void *buf) { return *(const u256u *)buf; }

static inline int movemask8_256(u8_256 mask)
{
return __builtin_ia32_pmovmskb256((c256)mask);
}

static inline int matchlen32(const void *a, const void *b)
{
int mask = ~movemask8_256(read256(a) == read256(b));
return mask ? __builtin_ctz(mask) : 32;
}

static int ml30(const void *src)
{
int ml = matchlen32(src, src + 1);
if (ml >= 30)
ml += matchlen32(src + 32, src + 1 + 32);
return ml;
}

static int ml32(const void *src)
{
int ml = matchlen32(src, src + 1);
if (ml >= 32)
ml += matchlen32(src + 32, src + 1 + 32);
return ml;
}

int main(void)
{
uint8_t src[256];
ssize_t n;

n = read(0, src, sizeof(src));
if (n != sizeof(src))
return -1;
printf("should be 64: %d\n", ml30(src));
printf("should be 64: %d\n", ml32(src));
return 0;
}

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #7 from Jakub Jelinek  ---
int
foo (int x)
{
  return x ? __builtin_ctz (x) : 32;
}
works without conditionals just fine for me, both in 8.x and trunk, both C and
C++, both -O2 and -O3.

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #6 from Jörn Engel  ---
True for one, but not the other.

return mask ? __builtin_ctz(mask) : 32;
1099:   83 f6 ffxor$0x,%esi
109c:   74 47   je 10e5 
109e:   f3 0f bc f6 tzcnt  %esi,%esi

I used:
gcc-8 -std=gnu11 -Wall -Wextra -g -march=core-avx2 -mbmi -fPIC -O3 %

_tzcnt_u32() works as you said it should.  Nicer than inline asm and allows
type checking.  Thank you for that hint!

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #5 from Jakub Jelinek  ---
(In reply to Jörn Engel from comment #4)
> Fair enough.  That means the only way to get tzcnt without a conditional is
> by using inline asm.

Of course not.
Either you can use _tzcnt_u32, or you can use x ? __builtin_ctz (x) : 32, both
with with -mbmi expand to tzcnt when optimizing.

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #4 from Jörn Engel  ---
Fair enough.  That means the only way to get tzcnt without a conditional is by
using inline asm.  Annoying, but something I can work with.

Annoying because for CPUs with BMI1, tzcnt is well-defined and I explicitly
tell the compiler to generate code for BMI1.  So while the __builtin_ctz() in
generall is undefined, it is actually well-defined for the case I care about.

But I need to support older compilers anyway, so inline asm it is.  Thank you!

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

Jakub Jelinek  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 CC||jakub at gcc dot gnu.org
 Resolution|--- |INVALID

--- Comment #3 from Jakub Jelinek  ---
__builtin_ctz (0) is undefined behavior, anytime you invoke UB, all bets are
off.
The compiler optimizes based on the assumption that UB does not happen.
So, as all valid __builtin_ctz calls return values from 0 to 31, the compiler
does optimize away __builtin_ctz (x) == 32 into 0.

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread joern at purestorage dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #2 from Jörn Engel  ---
The input is 32.  Does the "undefined-if-zero" thing give gcc license to remove
code depending on the output?  If it does, why is the code only removed when
comparing against 31/32, not when comparing against 30?

[Bug c/89670] __builtin_ctz(_mm256_movemask_epi8(foo)) assumed to be <31 ?

2019-03-11 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89670

--- Comment #1 from Andrew Pinski  ---
__builtin_ctz is undefined if the input is 0 as documented.