[Bug target/88918] [meta-bug] x86 intrinsic issues

2019-05-22 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88918
Bug 88918 depends on bug 56253, which changed state.

Bug 56253 Summary: fp-contract does not work with SSE and AVX FMAs (neither 
FMA4 nor FMA3)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56253

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug target/56253] fp-contract does not work with SSE and AVX FMAs (neither FMA4 nor FMA3)

2019-05-22 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56253

Matthias Kretz  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Matthias Kretz  ---
This is resolved since 5.1: https://godbolt.org/z/_tpStf

[Bug c++/88752] [8 Regression] ICE in enclosing_instantiation_of, at cp/pt.c:13328

2019-05-17 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88752

Matthias Kretz  changed:

   What|Removed |Added

  Known to work||7.4.0, 9.1.0
  Known to fail||8.3.0

--- Comment #11 from Matthias Kretz  ---
The following link contains a minor modification of the testcase to really make
the code valid again: https://godbolt.org/z/FTe8Ax

At this point, this PR has low priority for me (since GCC 9 is out).

[Bug target/58790] [missed optimization] reduction of masks of builtin vectors not transformed to ptest or movemask instructions

2019-05-16 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58790

Matthias Kretz  changed:

   What|Removed |Added

Version|4.9.0   |10.0

--- Comment #2 from Matthias Kretz  ---
Completely different idea how to handle mask reduction and create more
potential for optimization:

Add a new builtin "__builtin_is_zero(x)" which takes any __vector(N) type and
returns true if all bits of x are 0.

none_equal(a, b) { return __builtin_is_zero(a == b); }
all_equal(a, b) { return __builtin_is_zero(~(a == b)); }
any_equal(a, b) { return !__builtin_is_zero(a == b); }
some_equal(a, b) { return !__builtin_is_zero(a == b) && !__bulitin_is_zero(~(a
== b)) }

The x86 backend could then translate those to movmsk or ptest/vtestp[sd].
Examples:
with SSE4:
__builtin_is_zero(x) -> ptest(x, x); return ZF
__builtin_is_zero(~x) -> ptest(x, -1); return CF
__builtin_is_zero(integer < 0) -> ptest(integer, signmask); return ZF
__builtin_is_zero(x & k) -> ptest(x, k); return ZF
__builtin_is_zero(~x & k) -> ptest(x, k); return CF
__builtin_is_zero((integer < 0) & k) -> ptest(integer, signmask & k); return ZF

without SSE4:
__builtin_is_zero(x) -> movmsk(x == 0) == 0
__builtin_is_zero(mask) -> movmsk(mask) == 0  // i.e. when the argument is
known
  // to have only 0 or -1 values
__builtin_is_zero(a == b) -> movmsk(a == b) == 0
__builtin_is_zero(~(a == b)) -> movmsk(a == b) == "full bitmask" // 0x3, 0xf,
0xff, 0x, or 0x depending on the actual movmsk instruction used.

I assume this would make PR90483 a lot more natural to implement.

[Bug target/90483] input to ptest not optimized

2019-05-15 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90483

--- Comment #1 from Matthias Kretz  ---
https://godbolt.org/z/7BFMdG (for quick verification)

[Bug target/90487] New: optimize SSE & AVX char compares with subsequent movmskb [negation]

2019-05-15 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90487

Bug ID: 90487
   Summary: optimize SSE & AVX char compares with subsequent
movmskb [negation]
   Product: gcc
   Version: 9.1.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

Testcase (cf. https://godbolt.org/z/7NiU7O):

#include 

template 
using V [[gnu::vector_size(N)]] = T;

int good0(V a) { return 0x ^ _mm_movemask_epi8  
(reinterpret_cast<__m128i>(a)); }
int good1(V a) { return _mm_movemask_epi8  
(reinterpret_cast<__m128i>(!a)); }

// the following should be optimized to either good0 (prefer e.g. if compared
against 0x) or good1:
int f0(V a) { return _mm_movemask_epi8  
(reinterpret_cast<__m128i>(a <= 0x7f)); }
int f1(V a) { return _mm_movemask_epi8  
(reinterpret_cast<__m128i>(a <  0x80)); }
int f0(V<  signed char, 16> a) { return _mm_movemask_epi8  
(reinterpret_cast<__m128i>(a >=  0)); }
int f1(V<  signed char, 16> a) { return _mm_movemask_epi8  
(reinterpret_cast<__m128i>(a >  -1)); }
int f0(V< char, 16> a) { return _mm_movemask_epi8  
(reinterpret_cast<__m128i>(a >=  0)); }
int f1(V< char, 16> a) { return _mm_movemask_epi8  
(reinterpret_cast<__m128i>(a >  -1)); }

#ifdef __AVX2__
int good0(V a) { return 0x ^ _mm256_movemask_epi8  
(reinterpret_cast<__m256i>(a)); }
int good1(V a) { return _mm256_movemask_epi8  
(reinterpret_cast<__m256i>(!a)); }

// the following should be optimized to either good0 (prefer e.g. if compared
against 0x) or good1:
int f0(V a) { return
_mm256_movemask_epi8(reinterpret_cast<__m256i>(a <= 0x7f)); }
int f1(V a) { return
_mm256_movemask_epi8(reinterpret_cast<__m256i>(a <  0x80)); }
int f0(V<  signed char, 32> a) { return
_mm256_movemask_epi8(reinterpret_cast<__m256i>(a >=  0)); }
int f1(V<  signed char, 32> a) { return
_mm256_movemask_epi8(reinterpret_cast<__m256i>(a >  -1)); }
int f0(V< char, 32> a) { return
_mm256_movemask_epi8(reinterpret_cast<__m256i>(a >=  0)); }
int f1(V< char, 32> a) { return
_mm256_movemask_epi8(reinterpret_cast<__m256i>(a >  -1)); }
#endif

Compile with -O2 and either -mavx2 or -msse2. This PR is simply the negation of
PR88152. I failed to cover these cases in the other PR and they are just as
likely to appear as the ones in PR88152.

[Bug target/90483] New: input to ptest not optimized

2019-05-15 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90483

Bug ID: 90483
   Summary: input to ptest not optimized
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

The (V)PTEST instruction of SSE4.1/AVX produces ZF = `(a & b) == 0` and CF =
`(~a & b) == 0`. Generic usage of PTEST simply sets `b = ~__m128i()` (or
`~__m256i()`), i.e. tests `a` and `~a` for having only zero bits. (cf.
_mm_test_all_ones)

Consequently, if `a` is the result of a vector comparison which only depends on
a bitmask, the compare instruction can be elided and the `~__m128i()` mask
replaced with the corresponding bitmask.

Examples:

// test sign bit
bool bad(__v16qu x) {
  return __builtin_ia32_ptestz128(~__v16qu(), x > 0x7f);
}

Since x > 0x7f can be rewritten as a test for the sign bit, we can optimize to
(with 0x808080... at LC0):
vptest .LC0(%rip), %xmm0
sete %al
ret

// test for zero
bool bad2(__v16qu x) {
  return __builtin_ia32_ptestz128(~__v16qu(), x == 0);
}

This equivalent to testing scalars for 0, i.e. we can optimize to:
vptest %xmm0, %xmm0
sete %al
ret

// test for certain bits
bool bad3(__v16qu x, __v16qu k) {
  return __builtin_ia32_ptestz128(~__v16qu(), (x & k) == 0);
}

With the above transformation we already get PTEST(x, x) which can
consequently be reduced to PTEST(x, k):
vptest %xmm0, %xmm1
sete %al
ret

Further optimization of e.g. `(x & ~k) == 0` using CF instead of ZF might also
be interesting.

And of course, these transformations apply to all vector types, not just
__v16qu.

[Bug tree-optimization/90460] Inefficient vector construction from pieces

2019-05-14 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90460

--- Comment #1 from Matthias Kretz  ---
PR85048 and PR77399 are related

[Bug target/90424] memcpy into vector builtin not optimized

2019-05-13 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90424

--- Comment #2 from Matthias Kretz  ---
FWIW, I agree that "bit-inserting into a default-def" isn't a good idea. My
code, in the meantime, looks more like this (https://godbolt.org/z/D-yfZJ):

template 
using V [[gnu::vector_size(16)]] = T;

template 
V load(const void *p) {
  V r = {};
  __builtin_memcpy(, p, M);
  return r;
}

I can't read the SSA code with certainty, but bit-inserting sounds like what I
want to have. Alternatively, the partial vector load could be implemented like
this - and looks even worse (https://godbolt.org/z/nJuTn-):
template 
using V [[gnu::vector_size(16)]] = T;

template 
V load(const void *p) {
  const T* q = static_cast(p);
  V r = {q[I]...};
  return r;
}

// movq or movsd
template V load(const void *);
template V load(const void *);
template V load(const void *);
template V load(const void *);
template V load(const void *);
template V load(const void *);

// movd or movss
template V load(const void *);
template V load(const void *);
template V load(const void *);
template V load(const void *);

[Bug target/90424] New: memcpy into vector builtin not optimized

2019-05-10 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90424

Bug ID: 90424
   Summary: memcpy into vector builtin not optimized
   Product: gcc
   Version: 9.1.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

Testcase (cf. https://godbolt.org/z/LsKcii):

template 
using V [[gnu::vector_size(16)]] = T;

template )>
V load(const void *p) {
  using W = V;
  W r;
  __builtin_memcpy(, p, M);
  return r;
}

// movq or movsd
template V load(const void *); // bad
template V load(const void *);   // bad
template V load(const void *);   // bad
template V load(const void *); // good
template V load(const void *);   // bad
template V load(const void *); // good (movsd?)

// movd or movss
template V load(const void *);   // bad
template V load(const void *); // bad
template V load(const void *); // good
template V load(const void *); // good

All of these partial loads should be translated to a single mov[qd] or movs[sd]
instruction. But most of them are not.

[Bug target/88152] optimize SSE & AVX char compares with subsequent movmskb

2019-05-09 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88152

Matthias Kretz  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Matthias Kretz  ---
As far as the testcase on godbolt is concerned, my issue is resolved. GCC 9.1
produces perfect results for the f functions and close to perfect results for
cmp (the movdqa %xmm0, %xmm1 seems unnecessary).

Good work!

[Bug c++/90243] New: diagnostic notes that belong to a suppressed error about an uninitialized variable in a constexpr function are still shown

2019-04-25 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90243

Bug ID: 90243
   Summary: diagnostic notes that belong to a suppressed error
about an uninitialized variable in a constexpr
function are still shown
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Test case (https://godbolt.org/z/34KB20):

struct Z {
  int y;
};

template 
constexpr Z f(const T *data) {
  Z z;
  __builtin_memcpy(, data, sizeof(z));
  return z;
}

constexpr Z g(const char *data) { return f(data); }

This prints:
: In instantiation of 'constexpr Z f(const T*) [with T = char]':
:12:48:   required from here
:1:8: note: 'struct Z' has no user-provided default constructor
:2:7: note: and the implicitly-defined constructor does not initialize
'int Z::y'

If f is not a template, `Z z;` is an error and the notes explain the error. But
when f is a template the error is suppressed (seems correct). However the notes
that explain the error are still shown. Whether the notes are shown should use
the same condition as the error.

[Bug libstdc++/88066] [7 Regression] Relative includes in bits/locale_conv.h should be prefixed

2019-03-28 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88066

Matthias Kretz  changed:

   What|Removed |Added

 CC||kretz at kde dot org

--- Comment #9 from Matthias Kretz  ---
Created attachment 46049
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46049=edit
test case

Let me present the counterargument. I.e. if I use -I. and have a file named as
used internally by libstdc++, compilation breaks. Nothing in the C++ standard
forbids to create a bits/stl_vector.h file in my source tree, right? *evil
grin*

I'm a vocal fighter for "" includes... ;-)

[Bug c++/89357] alignas for automatic variables with alignment greater than 16 fails

2019-02-19 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89357

--- Comment #2 from Matthias Kretz  ---
I agree. The corresponding C test case produces equivalent f0 and f1:

void g(int*);

void f0() {
  __attribute__((aligned(128))) int x;
  g();
}

void f1() {
  _Alignas(128) int x;
  g();
}

And I agree this PR is rejects-valid. Unless there's some very strange rule for
ARM and thus _Alignas is supposed to also reject values greater 16?

[Bug target/89357] New: alignas for automatic variables with alignment greater than 16 fails

2019-02-14 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89357

Bug ID: 89357
   Summary: alignas for automatic variables with alignment greater
than 16 fails
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---
Target: aarch64-*-*, arm-*-*

Test case (cf. https://godbolt.org/z/ubJge4):

void g(int &);

auto f0() {
  __attribute__((aligned(128))) int x;
  g(x);
}

auto f1() {
  alignas(128) int x;
  g(x);
}

In f0, x is aligned on 128 Bytes, in f1 it is aligned on 16 Bytes. GCC emits
the warning "requested alignment 128 is larger than 16 [-Wattributes]". The
warning is not helpful as it only states an obvious fact (128 > 16). In any
case, it is unclear why the alignment is rejected in the first place.

http://eel.is/c++draft/dcl.align#2.2 says "[...] the implementation does not
support that alignment in the context of the declaration, the program is
ill-formed". Thus, compiling with alignment of 16 is non-conforming. If the
alignment is unsupported this needs to be a error.

My preferred solution would be to just align x in f1 to 128 Bytes. Why should
alignas and the aligned attribute behave differently?

On x86, power, mips, and msp430 (tested on Compiler Explorer) f0 and f1 are
equivalent.

[Bug target/89224] New: subscript of NEON intrinsic discards const

2019-02-06 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89224

Bug ID: 89224
   Summary: subscript of NEON intrinsic discards const
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Test case (cf. https://godbolt.org/z/RFrftn):
#include 

template 
void g(T &) {
  x = 1;
}

auto f(const __Int8x8_t ) {
  g(x[0]);
  //x[0] = 1;  // ill-formed
}

decltype(x[0]) is `signed char&`, which can't be right if decltype(x) is
const-ref.

[Bug target/89189] New: missed optimization for 16/8-bit vector shuffle

2019-02-04 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89189

Bug ID: 89189
   Summary: missed optimization for 16/8-bit vector shuffle
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

Testcase `-O2 -msse2`, further missed optimization with SSSE3 / SSE4.1 (cf.
https://godbolt.org/z/Yx6aLo):

using vshort [[gnu::vector_size(16)]] = short;
vshort f(vshort x) {
return vshort{x[3], x[7]};
}

using vchar [[gnu::vector_size(16)]] = char;
vchar g(vchar x) {
return vchar{x[7], x[15]};
}

f is compiled to 2x pextrw, movd, pinsrw + unpacks for zeroing high bits. The
latter unpacks are unnecessary since movd already zeros the high bits [127:32].

With SSE4.1 g is compiled to a similar pattern using pextrb/pinsrb. In this
case movd is used, but note that pextrb zeros the bits [31:8] in the GPR, so
that the unpacks for zeroing are also unnecessary.

Using SSSE3, both functions can also be compiled to a single pshufb instruction
using a suitable constant shuffle vector (6,7,14,15,-1,-1,... and
7,15,-1,-1,...).

[Bug target/24073] (vector float){a, b, 0, 0} code gen is not good

2019-01-17 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=24073

Matthias Kretz  changed:

   What|Removed |Added

 CC||kretz at kde dot org

--- Comment #8 from Matthias Kretz  ---
I believe the issue is resolved by now. See https://godbolt.org/z/urg7ri (add
-mavx and/or add more variants to test).

[Bug tree-optimization/88854] redundant store after load that would makes aliasing UB

2019-01-15 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88854

--- Comment #7 from Matthias Kretz  ---
(In reply to rguent...@suse.de from comment #5)
> Yeah, we do not perform this kind of "flow-sensitive" TBAA.  So
> when trying to DSE *a = x; we only look at
> 
>  int x = *a;
>  *b = 1;
>  *a = x;
> 
> and do not consider the earlier load from *b at all because it is
> not on the path from the load making the store possibly redundant.

However, if I annotate a and/or b as __restrict__ GCC does the DSE. I don't
think I want another DSE special case but a general case of inducing aliasing
knowledge, which may affect decisions throughout the whole program where the
pointers are used.

[Bug tree-optimization/88854] redundant store after load that would makes aliasing UB

2019-01-15 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88854

--- Comment #6 from Matthias Kretz  ---
Regarding gcc.dg/tree-ssa/ssa-pre-30.c

I'd argue that for `bar`, GCC may assume b == 0, because otherwise f would be
read both via int and float pointer, which is UB. So bar can be optimized to

`foo` shows a case I forgot in my "General rule". Pointers to integers that
only differ in signedness can do aliasing loads without UB and thus can't
trigger the optimization. And all the other aliasing exceptions listed in
http://eel.is/c++draft/basic.lval#11.

[Bug tree-optimization/88854] redundant store after load that would makes aliasing UB

2019-01-15 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88854

--- Comment #4 from Matthias Kretz  ---
Another test case, which the patch doesn't optimize:

short f(int *a, short *b) {
short y = *b; // 1
int x = *a;   // 2
*b = 1;
*a = x;
return y;
}

The loads in 1+2 are either UB or a and b must not alias. Consequently the
store to b won't change a and the store to a is dead.

General rule:
Given two pointers a and b of different type, where b is not a pointer to char,
unsigned char, or std::byte, if
- a load of a is followed by a load of b, or
- a store to a is followed by a load of b
then a and b are guaranteed to point to different addresses.

[Bug tree-optimization/88854] New: redundant store after load that would makes aliasing UB

2019-01-15 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88854

Bug ID: 88854
   Summary: redundant store after load that would makes aliasing
UB
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
CC: rguenth at gcc dot gnu.org
  Target Milestone: ---

Test cases:

This is optimized at -O1 and with GCC 5 at -O2. -fdisable-tree-fre1 and
-fno-strict-aliasing also remove the store to a.

void f(int *a, float *b) {
int x = *a;
*b = 0;
x = *a;
*a = x;
}

The following is an extension that reloads *a after store to b into a different
variable. Still the store to a must be dead, since otherwise the read of a
would be UB.

int g(int *a, float *b) {
int x = *a;
*b = 0;
int r = *a;
*a = x;
return r;
}

[Bug libstdc++/77776] C++17 std::hypot implementation is poor

2019-01-14 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=6

--- Comment #10 from Matthias Kretz  ---
Experience from testing my simd implementation:

I had failures (2 ULP deviation from long double result) when using 

auto __xx = abs(__x);
auto __yy = abs(__y);
auto __zz = abs(__z);
auto __hi = max(max(__xx, __yy), __zz);
auto __l0 = min(__zz, max(__xx, __yy));
auto __l1 = min(__yy, __xx);
__l0 /= __hi;
__l1 /= __hi;
auto __lo = __l0 * __l0 + __l1 * __l1;
return __hi * sqrt(1 + __lo);

Where the failures occur depends on wether FMA instructions are used. I have
observed only 1 ULP deviation from long double with my algorithm (independent
of FMAs).

Here are two data points that seem challenging:

hypot(0x1.965372p+125f, 0x1.795c92p+126f, 0x1.d0fc96p+125f) -> 0x1.e79366p+126f

hypot(0x1.235f24p+125f, 0x1.5b88f4p+125f, 0x1.d57828p+124f) -> 0x1.feaa26p+125f

[Bug target/88808] New: bitwise operators on AVX512 masks fail to use the new mask instructions

2019-01-11 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88808

Bug ID: 88808
   Summary: bitwise operators on AVX512 masks fail to use the new
mask instructions
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

Test case (https://godbolt.org/z/gyCN12):
#include 

using V [[gnu::vector_size(16)]] = float;

auto f(V x) {
auto mask = _mm_fpclass_ps_mask(x, 16) | _mm_fpclass_ps_mask(x, 8);
return _mm_mask_blend_ps(mask, x, x + 1);
}

auto g(V x) {
auto mask = _kor_mask8(_mm_fpclass_ps_mask(x, 16), _mm_fpclass_ps_mask(x,
8));
return _mm_mask_blend_ps(mask, x, x + 1);
}

Function f should compile to the same code as g does, i.e. use korb instead of
kmovb + orl + kmovb. Similar test cases can be constructed for kxor, kand, and
kandn as well as for masks of 8 and 16 bits (likely for 32 and 64 as well, but
I have not tested it). For kand it's a bit trickier to trigger, but e.g. the
following shows it:

__mmask8 foo = 0;
auto f(V x) {
auto mask0 = _mm_fpclass_ps_mask(x, 16);
auto mask1 = _mm_fpclass_ps_mask(x, 8);
foo = mask0 | mask1;
return _mm_mask_blend_ps(mask0 & mask1, x, x + 1);
}

[Bug target/80517] [missed optimization] constant propagation through Intel intrinsics

2019-01-11 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80517

--- Comment #4 from Matthias Kretz  ---
A similar test case showing that something is still missing
(https://gcc.godbolt.org/z/t1DT7E):

#include 

inline __m128i cmp(__m128i x, __m128i y) {
return _mm_cmpeq_epi16(x, y);
}
inline unsigned to_bits(__m128i mask0) {
return _pext_u32(_mm_movemask_epi8(mask0), 0x);
}

inline __m128i to_vmask(unsigned bits) {
__m128i mask = _mm_set1_epi16(bits);
mask = _mm_and_si128(mask, _mm_setr_epi16(1, 2, 4, 8, 16, 32, 64, 128));
mask = _mm_cmpeq_epi16(mask, _mm_setzero_si128());
mask = _mm_xor_si128(mask, _mm_cmpeq_epi16(mask, mask));
return mask;
}

auto f(__m128i x, __m128i y) {
// should be:
// vpcmpeqw %xmm1, %xmm0, %xmm0
// ret
return to_vmask(to_bits(cmp(x, y)));
}

auto f(unsigned bits) {
// should be equivalent to `return 0xff & bits;`
return to_bits(to_vmask(bits));
}

[Bug target/80517] [missed optimization] constant propagation through Intel intrinsics

2019-01-11 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80517

Matthias Kretz  changed:

   What|Removed |Added

Version|8.0 |9.0

--- Comment #3 from Matthias Kretz  ---
GCC 9 almost resolves this. However, for some reason this extended test case is
not fully optimized: https://gcc.godbolt.org/z/jRrHth
i.e. the call to dont_call_me() should be eliminated as dead code

#include 

inline __m128i cmp(__m128i x, __m128i y) {
return _mm_cmpeq_epi16(x, y);
}
inline unsigned to_bits(__m128i mask0) {
return _pext_u32(_mm_movemask_epi8(mask0), 0x);
}

inline __m128i to_vmask(unsigned bits) {
__m128i mask = _mm_set1_epi16(bits);
mask = _mm_and_si128(mask, _mm_setr_epi16(1, 2, 4, 8, 16, 32, 64, 128));
mask = _mm_cmpeq_epi16(mask, _mm_setzero_si128());
mask = _mm_xor_si128(mask, _mm_cmpeq_epi16(mask, mask));
return mask;
}

inline bool is_eq(unsigned bits, __m128i vmask) {
return to_bits(vmask) == bits;
}

extern const auto a = __m128i{0x0001'0002'0004'0003, 0x0009'0008'0007'0006};
extern const auto b = __m128i{0x0001'0002'0005'0003, 0x'0008'0007'0006};
extern const auto c = cmp(a, b);
extern const auto d = to_bits(c);

void call_me();
void dont_call_me();
void f() {
if (is_eq(d, cmp(b, a))) {
call_me();
} else {
dont_call_me();
}
}

[Bug libstdc++/77776] C++17 std::hypot implementation is poor

2019-01-10 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=6

--- Comment #9 from Matthias Kretz  ---
(In reply to emsr from comment #7)
> What does this do?
> 
>   auto __hi_exp =
> __hi & simd<_T, _Abi>(std::numeric_limits<_T>::infinity()); // no error

component-wise bitwise and of __hi and +inf. Or in other words, it sets all
sign bits and mantissa bits to 0. Consequently `__hi / __hi_exp` returns __hi
with the exponent bits set to 0x3f8 (float) / 0x3ff (double) and the mantissa
bits unchanged.

> Sorry, I have no simd knowlege yet.

It's a very simple idea:
- simd holds simd::size() many values of type T.
- All operators/operations act component-wise.

See Section 9 in wg21.link/N4793 for the last WD of the TS.

> Anyway, doesn't the large scale risk overflow if a, b are large? I guess I'm
> lost.

It basically has the same underflow risks as your implementation does, and no
risk of overflow.

[Bug target/88794] New: fixupimm intrinsics are unusable [9.0 regression]

2019-01-10 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88794

Bug ID: 88794
   Summary: fixupimm intrinsics are unusable [9.0 regression]
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

Test case:
```
#include 

__m128 f(__m128 x, __m128 ) {
y = _mm_fixupimm_ps(x, _mm_set1_epi32(0x), 0x00);
return x;
}

__m128 g(__m128 x, __m128 ) {
y = _mm_mask_fixupimm_ps(y, -1, x, _mm_set1_epi32(0x), 0x00);
return x;
}

__m128 h(__m128 x, __m128 , __mmask16 k) {
y = _mm_mask_fixupimm_ps(y, k, x, _mm_set1_epi32(0x), 0x00);
return x;
}
```

The function f (cf. https://godbolt.org/z/f6u-GI) only compiles with GCC 9;
none of GCC 8, clang, or ICC accept f. If one adds `y, ` as first argument to f
it compiles on non-GCC9 (cf. https://godbolt.org/z/rqPT15).

Besides the incompatibility, fixupimm is unusable on GCC9, because the
functions f and g produce garbage in y.

This was introduced in r265827.

[Bug libstdc++/77776] C++17 std::hypot implementation is poor

2019-01-10 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=6

--- Comment #6 from Matthias Kretz  ---
(In reply to Marc Glisse from comment #4)
> Your "reference" number seems strange. Why not do the computation with
> double (or long double or mpfr) or use __builtin_hypotf? Note that it
> changes the value.

Doh. (I didn't know the builtin exists. But use of (long) double should have
been a no-brainer.) I guess my point was the precision of the input to sqrt not
the result of sqrt. The sqrt makes that error almost irrelevant, though. My
numerical analysis skills are not good enough to argue for what approach is
better. But intuitively, keeping the information of the `amax` mantissa for the
final multiplication around might actually make that approach slightly better
(if the input to the sqrt were precise that wouldn't be true, though - but it
never is).

> How precise is hypot supposed to be? I know it is supposed to try and avoid
> spurious overflow/underflow, but I am not convinced that it should aim for
> correct rounding.

That's a good question for all of  / . Any normative wording on
that question would be (welcome) news to me. AFAIK precision is left completely
as QoI. So, except for the Annex F requirements (which we can drop with
-ffast-math), let's implement all of  as `return 0;`. ;-)

> (I see that you are using clang in that godbolt link, with gcc I need to
> mark the global variables with "extern const" to get a similar asm)

Thanks for the hint. I switched to clang when GCC started to produce code
instead of constants in the asm. (I also like the unicode identifier support in
clang ;-))

[Bug rtl-optimization/88785] New: ICE in as_a, at machmode.h:353

2019-01-09 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88785

Bug ID: 88785
   Summary: ICE in as_a, at machmode.h:353
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Created attachment 45398
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45398=edit
reduced test case

Compile the attached test case with `-g -O2 -std=gnu++17 -march=skylake-avx512
-c`:

during RTL pass: final
ice.cpp: In function ‘void dg() [with dc = a::b]’:
ice.cpp:192:1: internal compiler error: in as_a, at machmode.h:353
  192 | }
  | ^
0x6d31e4 scalar_float_mode as_a(machine_mode)
../../gcc/machmode.h:353
0x6d31e4 insert_float
../../gcc/dwarf2out.c:19456
0xbbc38e add_const_value_attribute
../../gcc/dwarf2out.c:19548
0xbbd9a3 add_location_or_const_value_attribute
../../gcc/dwarf2out.c:20106
0xbbd9a3 add_location_or_const_value_attribute
../../gcc/dwarf2out.c:20060
0xbd000a gen_variable_die
../../gcc/dwarf2out.c:23880
0xbc4348 gen_decl_die
../../gcc/dwarf2out.c:26371
0xbc14cf decls_for_scope
../../gcc/dwarf2out.c:25858
0xbddff6 gen_inlined_subroutine_die
../../gcc/dwarf2out.c:24219
0xbddff6 gen_block_die
../../gcc/dwarf2out.c:25762
0xbc158a decls_for_scope
../../gcc/dwarf2out.c:25887
0xbddff6 gen_inlined_subroutine_die
../../gcc/dwarf2out.c:24219
0xbddff6 gen_block_die
../../gcc/dwarf2out.c:25762
0xbc158a decls_for_scope
../../gcc/dwarf2out.c:25887
0xbc252f gen_subprogram_die
../../gcc/dwarf2out.c:23328
0xbc3f9c gen_decl_die
../../gcc/dwarf2out.c:26288
0xbc4b3e dwarf2out_decl
../../gcc/dwarf2out.c:26856
0xbc4fbe dwarf2out_function_decl
../../gcc/dwarf2out.c:26871
0xc390bc rest_of_handle_final
../../gcc/final.c:4695
0xc390bc execute
../../gcc/final.c:4737

[Bug c++/88752] ICE in enclosing_instantiation_of, at cp/pt.c:13328

2019-01-09 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88752

Matthias Kretz  changed:

   What|Removed |Added

  Attachment #45376|0   |1
is obsolete||

--- Comment #4 from Matthias Kretz  ---
Created attachment 45385
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45385=edit
valid code test case

True, I made an error in the verification script. Better reduction attached.

[Bug c++/88752] ICE in enclosing_instantiation_of, at cp/pt.c:13328

2019-01-08 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88752

Matthias Kretz  changed:

   What|Removed |Added

  Attachment #45375|0   |1
is obsolete||

--- Comment #1 from Matthias Kretz  ---
Created attachment 45376
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45376=edit
reduced test case

[Bug c++/88752] New: ICE in enclosing_instantiation_of, at cp/pt.c:13328

2019-01-08 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88752

Bug ID: 88752
   Summary: ICE in enclosing_instantiation_of, at cp/pt.c:13328
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Created attachment 45375
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45375=edit
not-reduced test case

Compile attached test case with `-std=gnu++17 -march=skylake -mrtm ~/ice.cpp`.

/home/mkretz/src/gcc/libstdc++-v3/testsuite/experimental/simd/tests/trigonometric.h:17:895:
internal compiler error: in enclosing_instantiation_of, at cp/pt.c:13328
   17 |MAKE_TESTER(acos), MAKE_TESTER(tan), MAKE_TESTER(acosh),
  |
   
   
   
   
   
   
   
   
   
   
  ^   
0x624ca1 enclosing_instantiation_of
/home/mkretz/src/gcc/gcc/cp/pt.c:13327
0x986934 tsubst_copy
/home/mkretz/src/gcc/gcc/cp/pt.c:15494
0x9a0023 tsubst_copy
/home/mkretz/src/gcc/gcc/cp/pt.c:15377
0x9a0023 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:19257
0x9a030c tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:18169
0x9a0976 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:18638
0x98ef5f tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:17756
0x992542 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:15346
0x992542 tsubst_init
/home/mkretz/src/gcc/gcc/cp/pt.c:15350
0x9910c4 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:16997
0x98e34d tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:16862
0x98bb21 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:17163
0x98e34d tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:16862
0x98bb21 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:17163
0x9a415e tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:16847
0x9a415e tsubst_lambda_expr(tree_node*, tree_node*, int, tree_node*)
/home/mkretz/src/gcc/gcc/cp/pt.c:18023
0x9a2d53 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:19344
0x9a0976 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:18638
0x9a07bb tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:18346
0x98ef5f tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
/home/mkretz/src/gcc/gcc/cp/pt.c:17756

[Bug c++/85052] Implement support for clang's __builtin_convertvector

2019-01-07 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85052

--- Comment #12 from Matthias Kretz  ---
(In reply to Jakub Jelinek from comment #11)
> [...] though for 8x conversions we
> are e.g. on x86 already outside of the realm of natively supported vectors
> (we don't really want MMX and for 1024 bit and wider generic vectors we
> don't always emit best code).

Creatively thinking, consider constants stored as (u)char arrays (for bandwith
optimization), converted to double or (u)llong when used. I'd want to use a
half-SSE load + subsequent conversion to AVX-512 vector (e.g. vpmovsxbq +
vcvtqq2pd) or even full SSE load + one shift and two conversions to AVX-512.

Similar motivation for the reverse direction. (Though a lot less likely to be
used in practice, I believe. Hmm, maybe AI applications can prove that
expectation wrong.)

But we should track optimizations in their own issues.

[Bug c++/85052] Implement support for clang's __builtin_convertvector

2019-01-05 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85052

--- Comment #9 from Matthias Kretz  ---
(In reply to Devin Hussey from comment #7)
> Wait, silly me, this isn't about optimizations, this is about patterns.

Regarding optimizations, PR85048 is a first step (it lists all x86
single-instruction SIMD conversions). I also linked my library implementation
in #5, which provides optimizations for all cases on x86.

[Bug libstdc++/77776] C++17 std::hypot implementation is poor

2019-01-05 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=6

--- Comment #3 from Matthias Kretz  ---
Did you consider the error introduced by scaling with __amax? I made sure that
the division is without error by zeroing the mantissa bits. Here's a motivating
example that shows an error of 1 ulp otherwise: https://godbolt.org/z/_U2K7e

About std::fma, how bad is the performance hit if there's no instruction for
it?

[Bug c++/85052] Implement support for clang's __builtin_convertvector

2019-01-02 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85052

--- Comment #5 from Matthias Kretz  ---
Thank you Jakub! Here's a tested x86 library implementation for all conversions
and different ISA extension support for reference:

https://github.com/mattkretz/gcc/blob/mkretz/simd/libstdc%2B%2B-v3/include/experimental/bits/simd_x86_conversions.h

(I have not looked at the patch yet to see whether I understand enough of the
implementation to optimize conversions myself.)

[Bug libstdc++/84949] -ffast-math bugged with respect to NaNs

2018-12-11 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84949

--- Comment #7 from Matthias Kretz  ---
Example showing the discrepancy: https://godbolt.org/z/D15m71

Also PR83875 is relevant wrt. giving different answers depending on function
attributes.

[Bug libstdc++/84949] -ffast-math bugged with respect to NaNs

2018-12-11 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84949

Matthias Kretz  changed:

   What|Removed |Added

 CC||kretz at kde dot org

--- Comment #6 from Matthias Kretz  ---
I'd like to make a case for numeric_limits::is_iec559 to follow
__STDC_IEC_559__. I.e. the following patch:

--- a/libstdc++-v3/include/std/limits
+++ b/libstdc++-v3/include/std/limits
@@ -1649,7 +1649,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   denorm_min() _GLIBCXX_USE_NOEXCEPT { return __FLT_DENORM_MIN__; }

   static _GLIBCXX_USE_CONSTEXPR bool is_iec559
+#ifdef __STDC_IEC_559__
   = has_infinity && has_quiet_NaN && has_denorm == denorm_present;
+#else
+  = false;
+#endif
   static _GLIBCXX_USE_CONSTEXPR bool is_bounded = true;
   static _GLIBCXX_USE_CONSTEXPR bool is_modulo = false;

@@ -1724,7 +1728,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   denorm_min() _GLIBCXX_USE_NOEXCEPT { return __DBL_DENORM_MIN__; }

   static _GLIBCXX_USE_CONSTEXPR bool is_iec559
+#ifdef __STDC_IEC_559__
   = has_infinity && has_quiet_NaN && has_denorm == denorm_present;
+#else
+  = false;
+#endif
   static _GLIBCXX_USE_CONSTEXPR bool is_bounded = true;
   static _GLIBCXX_USE_CONSTEXPR bool is_modulo = false;

@@ -1799,7 +1807,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   denorm_min() _GLIBCXX_USE_NOEXCEPT { return __LDBL_DENORM_MIN__; }

   static _GLIBCXX_USE_CONSTEXPR bool is_iec559
+#ifdef __STDC_IEC_559__
   = has_infinity && has_quiet_NaN && has_denorm == denorm_present;
+#else
+  = false;
+#endif
   static _GLIBCXX_USE_CONSTEXPR bool is_bounded = true;
   static _GLIBCXX_USE_CONSTEXPR bool is_modulo = false;


The __STDC_IEC_559__ macro is not defined when -ffast-math is active. is_iec559
requires "true if and only if the type adheres to ISO/IEC/IEEE 60559". I don't
have IEC559 at hand, but I believe assuming no NaN, inf, or -0 can occur makes
the floating point types not adhere to IEC559.
And IIUC, if __STDC_IEC_559__ is defined, then has_infinity, has_quiet_NaN and
has_denorm must all be true. So
+#ifdef __STDC_IEC_559__
+  = true;
+#else
+  = false;
+#endif
should be correct.

[Bug libstdc++/77776] C++17 std::hypot implementation is poor

2018-12-06 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=6

Matthias Kretz  changed:

   What|Removed |Added

 CC||kretz at kde dot org

--- Comment #1 from Matthias Kretz  ---
Testcase for the inf input case
(https://wandbox.org/permlink/wDla0jQOtKFUsHNs):

  constexpr float inf = std::numeric_limits::infinity();
  std::cout << std::hypot(inf, 1.f, 1.f) << '\n';
  std::cout << std::hypot(inf, inf, 1.f) << '\n';
  std::cout << std::hypot(inf, inf, inf) << '\n';

All of these return NaN, but surely should return inf instead. Joseph mentions
this issue in his mails.


here's what I currently have for std::experimental::simd's hypot overload:

template 
simd<_T, _Abi> hypot(simd<_T, _Abi> __x, simd<_T, _Abi> __y, simd<_T, _Abi>
__z)
{
  using namespace __proposed::float_bitwise_operators;
  __x   = abs(__x);// no error
  __y   = abs(__y);// no error
  __z   = abs(__z);// no error
  auto __hi = max(max(__x, __y), __z); // no error
  auto __l0 = min(__z, max(__x, __y)); // no error
  auto __l1 = min(__y, __x);   // no error
  auto __hi_exp =
__hi & simd<_T, _Abi>(std::numeric_limits<_T>::infinity()); // no error
  where(__hi_exp == 0, __hi_exp) = std::numeric_limits<_T>::min();

  // scale __hi to 0x1.???p0 and __lo by the same factor
  auto __scale = 1 / __hi_exp;   // no error
  auto __h1  = __hi / __hi_exp; // no error
  __l0 *= 1 / __hi_exp; // no error
  __l1 *= 1 / __hi_exp; // no error
  auto __lo = __l0 * __l0 + __l1 * __l1; // add the two smaller values first
  auto __r  = __hi_exp * sqrt(__lo + __h1 * __h1);
  where(__l0 + __l1 == 0, __r) = __hi;
  where(isinf(__x) || isinf(__y) || isinf(__z), __r) =
std::numeric_limits<_T>::infinity();
  return __r;
}

IIUC this might still produce underflow exceptions, even though the answer is
precise. And without FMA in the accumulation inside the sqrt there may be a
larger than .5 ULP error on the result. But that's just to the best of my
knowledge.

[Bug target/88152] optimize SSE & AVX char compares with subsequent movmskb

2018-11-29 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88152

--- Comment #5 from Matthias Kretz  ---
> -fno-signed-zeros isn't a guarantee the operand will not be -0.0 and having
> x < 0.0 behave differently based on whether x is -0.0 or 0.0 (with
> -fno-signed-zeros quite randomly) is IMHO very bad.

I agree this isn't obviously correct. But I stared long and hard at the manpage
before writing that comment. From the manpage:
> Allow optimizations for floating-point arithmetic that ignore the signedness
> of zero. IEEE arithmetic specifies the behavior of distinct +0.0 and -0.0
> values, which then prohibits simplification of expressions such as x+0.0 or
> 0.0*x (even with -ffinite-math-only).  This option implies that the sign of a
> zero result isn't significant.

1. The movmsk(x<0) => movmsk(x) transformation is an "optimization for
floating-point arithmetic that ignores the signedness of zero" - except that
it's not really an arithmetic operation. But then again -ffinite-math-only
turns isnan(x) into false ("Allow optimizations for floating-point arithmetic
that assume that arguments and results are not NaNs")

2. I based my argument mostly on this part, though: "This option implies that
the sign of a zero result isn't significant". I interpret it as "if the result
of f() is zero, it's sign is not significant in the expression f()<0".
Consequently, we can assume the sign to be positive.

Feel free to disagree. I just wanted to explain how I read the GCC
documentation on -fno-signed-zeros. If that's not the intended reading, I'd be
happy to help clarify the documentation.

[Bug target/88152] optimize SSE & AVX char compares with subsequent movmskb

2018-11-23 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88152

--- Comment #2 from Matthias Kretz  ---
I just realized, the movmsk(x<0) => movmsk(x) transformation also applies to
float and double if -ffinite-math-only (i.e. no NaN, it's alright for inf) and
-fno-signed-zeros are active.

[Bug target/88152] New: optimize SSE & AVX char compares with subsequent movmskb

2018-11-22 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88152

Bug ID: 88152
   Summary: optimize SSE & AVX char compares with subsequent
movmskb
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

Testcase (https://godbolt.org/z/YNPZyf):

#include 

template 
using V [[gnu::vector_size(N)]] = T;

// the following should be optimized to:
// vpxor %xmm1, %xmm1, %xmm1
// vpcmpgtb %[xy]mm0, %[xy]mm1, %[xy]mm0
// ret

auto cmp0(V a) { return a > 0x7f; }
auto cmp0(V a) { return a > 0x7f; }
auto cmp1(V a) { return a >= 0x80; }
auto cmp1(V a) { return a >= 0x80; }
auto cmp0(V<  signed char, 16> a) { return a < 0; }
auto cmp0(V<  signed char, 32> a) { return a < 0; }
auto cmp1(V<  signed char, 16> a) { return a <= -1; }
auto cmp1(V<  signed char, 32> a) { return a <= -1; }
auto cmp0(V< char, 16> a) { return a < 0; }
auto cmp0(V< char, 32> a) { return a < 0; }
auto cmp1(V< char, 16> a) { return a <= -1; }
auto cmp1(V< char, 32> a) { return a <= -1; }

// the following should be optimized to:
// vpmovmskb %[xy]mm0, %eax
// ret

int f0(V a) {
  return _mm_movemask_epi8   (reinterpret_cast<__m128i>(a >  0x7f));
}
int f0(V a) {
  return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a >  0x7f));
}
int f1(V a) {
  return _mm_movemask_epi8   (reinterpret_cast<__m128i>(a >= 0x80));
}
int f1(V a) {
  return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a >= 0x80));
}
int f0(V<  signed char, 16> a) {
  return _mm_movemask_epi8   (reinterpret_cast<__m128i>(a <  0));
}
int f0(V<  signed char, 32> a) {
  return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a <  0));
}
int f1(V<  signed char, 16> a) {
  return _mm_movemask_epi8   (reinterpret_cast<__m128i>(a <= -1));
}
int f1(V<  signed char, 32> a) {
  return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a <= -1));
}
int f0(V< char, 16> a) {
  return _mm_movemask_epi8   (reinterpret_cast<__m128i>(a <  0));
}
int f0(V< char, 32> a) {
  return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a <  0));
}
int f1(V< char, 16> a) {
  return _mm_movemask_epi8   (reinterpret_cast<__m128i>(a <= -1));
}
int f1(V< char, 32> a) {
  return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a <= -1));
}

Compile with `-O2 -mavx2` (the same is relevant with -msse2 if you remove the
AVX overloads).

Motivation:
This pattern is relevant for vectorized UTF-8 decoding, where all bytes with
MSB == 0 can simply be zero extended to UTF-16/32. Such code could just skip
the compare and call movemask directly on `a`. However:
std::experimental::simd doesn't (and any other general purpose SIMD abstraction
shouldn't) expose a "munch sign bits into bitmask integer" function. Such a
function is too ISA specific.
In the interest of making code readable (and thus maintainable) I strongly
believe it should read: `n_ascii_chars = find_first_set(a > 0x7f)` while still
getting the optimization.

Similar test cases can be constructed for movmskp[sd] after 32/64 bit integer
compares.

[Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128

2018-11-19 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551

--- Comment #20 from Matthias Kretz  ---
The original issue I meant to report is fixed. There are many more missed
optimizations in the original example, though.

I.e. https://godbolt.org/z/7P1o3O should compile to:
use_insert_extract():
  vmovdqu DATA+4(%rip), %xmm2
  vmovdqu DATA+20(%rip), %xmm4
  vpaddd DATA(%rip), %xmm2, %xmm0
  vpaddd DATA+16(%rip), %xmm4, %xmm1
  vpaddd %xmm2, %xmm0, %xmm0
  vpaddd %xmm4, %xmm1, %xmm1
  vmovups %xmm0, DATA(%rip)
  vmovups %xmm1, DATA+16(%rip)
  ret

[Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128

2018-11-19 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551

--- Comment #18 from Matthias Kretz  ---
FWIW, the issue is resolved on trunk. GCC8.2 still has the missed optimization:
https://godbolt.org/z/hbgIIi

[Bug c++/87989] [8/9 Regression] Calling operator T() invokes wrong conversion operator overload

2018-11-15 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87989

--- Comment #4 from Matthias Kretz  ---
Yes, looks like a duplicate of 86246.

[Bug c++/87989] Calling operator T() invokes wrong conversion operator overload

2018-11-12 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87989

Matthias Kretz  changed:

   What|Removed |Added

  Known to work||7.3.0
  Known to fail||8.1.0, 8.2.0

--- Comment #1 from Matthias Kretz  ---
might be related to #86521

[Bug c++/87989] New: Calling operator T() invokes wrong conversion operator overload

2018-11-12 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87989

Bug ID: 87989
   Summary: Calling operator T() invokes wrong conversion operator
overload
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Testcase (https://godbolt.org/z/sStNGV):

struct X {
template  operator T() const;
operator float() const;
};

template 
T f(const X ) { return x.operator T(); }

template float f(const X &);

Starting with GCC8, this calls `X::operator float() const` instead of
`X::operator float() const`. The behavior is correct if function f is changed
to `{ return x; }`; i.e. implicit call of the conversion operator.

I have not double-checked the standard, but clang, EDG, MSVC and GCC <= 7 do
not show this behavior.

[Bug c++/87631] new attribute for passing structures with multiple SIMD data members in registers

2018-10-17 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87631

--- Comment #2 from Matthias Kretz  ---
My (current) use case is structures (nested) of builtin types and vector types.
These structures have a trivial copy constructor.


Generalization
---

I believe generalization of this approach should be possible, but I'm not sure
how useful it would be. E.g.

  struct [[gnu::pass_via_register]] A {
int a;
std::vector b;
  };

  void f(A);

could call f by "unpacking" A and call f'(int, std::vector). I believe the
effort of supporting types with non-trivial copy ctor is not worth the effort
(such types are typically passed via const-ref anyway).


What I believe is worthwhile
-

pass_via_register (max_registers)
  This attribute, attached to a struct type definition, specifies that function
arguments and function return values are passed via up to max_registers
registers, thus potentially using a different calling convention.

  If the number of registers required for passing a value exceeds
max_registers, the default calling convention is used instead. Specifically,
`struct S { int a, b, c; } __attribute__((pass_via_register(1)));` may still
pass via two registers  if it would do so without the attribute.

  If a structure has a single non-static data member of a type declared with
the pass_via_register attribute, the attribute is also applied to the outer
structure:

struct S { ... } __attribute__((pass_via_register(4)));
struct inherited { S x; };  // implicit pass_via_register(4)

If a structure has two or more non-static data members the resulting type does
not inherit the pass_via_register attribute.

  You may only specify this attribute on the definition of a struct, not on a
typedef that does not also define the structure.


Example from std::experimental::simd
-

using V = simd>;

This essentially asks for { __m128[2] }, similar to `float
attribute((vector_size(32)))` when AVX is not available, except that I'd like
to pass arguments and return values via registers:

V f(V x, V y);

Function f reads x from %xmm0 and %xmm1, y from %xmm2 and %xmm3, and returns
via %xmm0 and %xmm1.

The simd class would be defined like this (note that `simd` itself would not
have the attribute):

template  class member_type;
template 
class [[gnu::pass_via_register(4)]] member_type> {
  using V [[gnu::vector_size(16)]] = float;
  V data[N];
};

template  class simd {
  member_type data;
};

simd inherits the pass_via_register(4) attribute from its data member because
it has only one data member.


ill-formed
---

I'd make the following ill-formed:

struct [[gnu::pass_via_register]] A {
  A(const A &);
};

The non-trivial copy ctor clashes with pass_via_register.


dropping the attribute
---

Example:

struct X {
  simd> a;
  int b;
};

a is pass_via_register, b in principle is pass_via_register (on x86_64), but X
is not (two or more non-static data members). The default calling convention
applies.


implementation strategy


I don't see how the frontend could reliably implement the attribute. Does the
frontend know whether a certain type is passed via register (and how many)?
E.g. `void f(int)` passes via the stack on i686. `struct S { int a, b; };`
passes via a single register on x86_64, unpacking `f(S)` to `f(int, int)` would
be suboptimal.

[Bug other/87631] New: new attribute for passing structures with multiple SIMD data members in registers

2018-10-17 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87631

Bug ID: 87631
   Summary: new attribute for passing structures with multiple
SIMD data members in registers
   Product: gcc
   Version: 9.0
   URL: https://godbolt.org/z/M-zEpR
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Consider:

using V [[gnu::vector_size(16)]] = float;

struct X1 { V a; };
struct X2 { V a, b; };

X1 f1(X1 x) { return x; }
X2 f2(X2 x) { return x; }

f1 is typically more efficient at the call site and in the function itself,
since it doesn't have to do the stores to & loads from the stack. The user
still has a choice: Using const-ref, the function arguments can still be passed
via memory.

f2 leaves the user no choice, every object of X2 is unconditionally passed via
memory. If the vector_size attribute is removed, however, objects of X2 are
still passed via registers. I propose a new attribute (e.g.
"pass_via_register"), potentially with an argument that limits the number of
registers it may use (useful for generic types), that would modify the ABI of
such types to have them passed via registers. Consequently, f2 could also
compile to a single "ret".

Note, I would like to use this feature in the implementation of
std::experimental::simd.

[Bug target/86896] New: invalid vmovdqa64 instruction for KNL emitted

2018-08-09 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86896

Bug ID: 86896
   Summary: invalid vmovdqa64 instruction for KNL emitted
   Product: gcc
   Version: 8.1.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Test case (unreduced) at
https://web-docs.gsi.de/~mkretz/invalid_knl_instruction.cpp

Compile the test case with `g++ -std=c++17 -O1 -march=knl
invalid_knl_instruction.cpp`

Output when testing with Intel SDE `-knl`:
TID 0 SDE-ERROR: Executed instruction not valid for specified chip (KNL):
0x41eb28: vmovdqa64 xmm4, xmm16
Image: a.out+0x1eb28 (in multi-region image, region# 0)
Function:
_ZN5Tests10operators_IN2Vc2v24simdIyNS2_8simd_abi11__fixed_abiILi31EE3runEv
Instruction bytes are: 62 b1 fd 08 6f e0

[Bug libstdc++/86655] std::assoc_legendre should not constrain the value of m

2018-07-25 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86655

--- Comment #2 from Matthias Kretz  ---
http://eel.is/c++draft/c.math#sf.cmath-1.3 might be the reason why `m <= l` is
enforced. But unless I'm confused the footnote on "mathematically defined"
tells us it should work:

- "(a) if it is explicitly defined for that set of argument values" - does not
hold

- "(b) if its limiting value exists and does not depend on the direction of
approach" - this holds, no?

[Bug libstdc++/86655] New: std::assoc_legendre should not constrain the value of m

2018-07-24 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86655

Bug ID: 86655
   Summary: std::assoc_legendre should not constrain the value of
m
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

https://wg21.link/c.math#sf.cmath.assoc_legendre leaves m unconstrained.
__detail::__assoc_legendre_p documents "@param  m  The order of the associated
Legendre function. @f$ m <= l @f$." and throws a domain error if m > l.

IIUC correctly, m > l simply implies that the result of std::assoc_legendre is
0. Which is why Wikipedia documents "This equation has nonzero solutions that
are nonsingular on [−1, 1] only if ℓ and m are integers with 0 ≤ m ≤ ℓ".

If this is correct, then the m <= l restriction should be removed. Otherwise,
we'd need to file a defect report to the C++ standard to constrain m.

[Bug target/86267] detect conversions between bitmasks and vector masks

2018-07-24 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86267

--- Comment #2 from Matthias Kretz  ---
Sorry for the delay. Vacation...

This pattern appears in many variations in the implementation of
wg21.link/p0214r9. The fixed_size ABI tag used with a simd_mask type
requires a decision from the implementer, whether to store the mask
unconditionally as a bitmask or as one or more vector masks. (array of bools is
another choice, but never a good fit.)
Thanks to AVX512, the native mask representation on x86 "depends". Any choice
for simd_mask> leads to bitmask <-> vector masks conversions.
GCC decided to implement compares of vector builtins to unconditionally return
vector masks, even if an AVX512 compare instruction is used. The optimizer then
sometimes recognizes the conversion back to a bitmask and eliminates the
conversions. Consequently, fixed_size simd_masks currently achieve better
optimization when implemented as vector masks. Through this PR, I want to find
out whether using bitmasks is a feasible solution.

I understand the pain involved in making this work generically. That's why I'm
suggesting to only support this optimization when a special conversion builtin
is used. Thus, GCC wouldn't have to recognize all possible patterns to convert
bitmask <-> vector mask. And, through the use of __builtin_vector_to_bitmask
the caller implies that the argument is a vector mask (every other input is
UB).

[Bug tree-optimization/86267] New: detect conversions between bitmasks and vector masks

2018-06-21 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86267

Bug ID: 86267
   Summary: detect conversions between bitmasks and vector masks
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Testcase (cf. https://godbolt.org/g/gi6f7V):

#include 

auto f(__m256i a, __m256i b) {
__m256i k = a < b;
long long bitmask = _mm256_movemask_pd((__m256d)k) & 0xf;
return _mm256_cmpgt_epi64(
__m256i{bitmask, bitmask, bitmask, bitmask} & __m256i{1, 2, 4, 8},
__m256i()
);
}

This should be optimized to "return a < b;".

A more complex case also allows conversion of the vector mask (cf.
https://godbolt.org/g/FLAEgC):

#include 

auto f(__m256i a, __m256i b) {
using V [[gnu::vector_size(16)]] = int;
__m256i k = a < b;
int bitmask = _mm256_movemask_pd((__m256d)k) & 0xf;
return (V{bitmask, bitmask, bitmask, bitmask} & V{1, 2, 4, 8}) != 0;
}

I believe the most portable and readable strategy would be to introduce new
builtins that convert between bitmasks and vector masks. (This can be
especially helpful with AVX512, where the builtin comparison operators return
vector masks, but Intel intrinsics require bitmasks.)

E.g.:
using W [[gnu::vector_size(32)]] = long long;
using V [[gnu::vector_size(16)]] = int;
V f(W a, W b) {
unsigned bitmask = __builtin_vector_to_bitmask(a < b);
return __builtin_bitmask_to_vector(bitmask, V);
}

I'd define __builtin_vector_to_bitmask to only consider the MSB of each
element. And, to make optimization simpler, consider all remaining input bits
to be whatever the canonical mask representation on the target system is.

[Bug libstdc++/85951] New: make_signed and make_unsigned are incorrect for wchar_t, char16_t, and char32_t

2018-05-28 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85951

Bug ID: 85951
   Summary: make_signed and make_unsigned are incorrect for
wchar_t, char16_t, and char32_t
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

make_(un)signed_t of char16_t, char32_t, or wchar_t should never be
char16_t/char32_t/wchar_t, just like it is the case for make_signed_t
(which is signed char).

According to http://eel.is/c++draft/basic.fundamental#2 and #3 char16_t,
char32_t, and wchar32_t are neither _signed integer types_ nor _unsigned
integer types_. Therefore, the third outcome in
http://eel.is/c++draft/meta.trans.sign applies: "type names the (un)signed
integer type with smallest rank for which sizeof(T) == sizeof(type), with the
same cv-qualifiers as T".

cf. https://godbolt.org/g/aG4CnD

[Bug c++/85827] false positive for -Wunused-but-set-variable because of constexpr-if

2018-05-18 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85827

--- Comment #3 from Matthias Kretz  ---
But macros are different. They remove the code before the C++ parser sees it
(at least as-if). One great improvement of constexpr-if over macros is that all
the other branches are parsed and their syntax checked. E.g. it requires the
mentioned names to exist. This doesn't compile (cf.
https://godbolt.org/g/iCRPDv):

#ifdef HAVE_FOO
constexpr bool have_foo = true;
void foo();
#else
constexpr bool have_foo = false;
#endif

void f() {
  if constexpr (have_foo) {
foo();
  }
}

So, the frontend parses all branches anyway. It should be able to see that _2
and _3 are referenced in f<1>().

[Bug c++/85827] false positive for -Wunused-but-set-variable because of constexpr-if

2018-05-18 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85827

--- Comment #1 from Matthias Kretz  ---
Same issue for -Wunused-variable

[Bug c++/85827] New: false positive for -Wunused-but-set-variable because of constexpr-if

2018-05-18 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85827

Bug ID: 85827
   Summary: false positive for -Wunused-but-set-variable because
of constexpr-if
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Testcase `-std=c++17 -Wall` (cf. https://godbolt.org/g/kfgN2V):

template  int f()
{
constexpr bool _1 = N == 1;
constexpr bool _2 = N == 2;
constexpr bool _3 = N == 3;

if constexpr (_1) {
return 5;
} else if constexpr (_2) {
return 1;
} else if constexpr (_3) {
return 7;
}
}

int a() {
return f<1>();
}
int b() {
return f<2>();
}
int c() {
return f<3>();
}

Yes, I see how one can argue that _2 and _3 are unused in f<1>. However, this
makes use of constexpr-if cumbersome. E.g. when a variable is required in all
but one branch, then it'll warn for specializations that take this one branch.
So you'll have to reorder the code such that the variable is only defined
inside the else branch where all the other branches are handled. But what if
you have two variables and their uses are disjunct? This leads to code
duplication, just for silencing the warning.

I believe, -Wunused-but-set-variable needs to consider all constexpr branches
to be useful here.

[Bug target/85819] New: conversion from __v[48]su to __v[48]sf should use FMA

2018-05-17 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85819

Bug ID: 85819
   Summary: conversion from __v[48]su to __v[48]sf should use FMA
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Testcase (cf. https://godbolt.org/g/UoU3zj):

using T = float;
using To [[gnu::vector_size(32)]] = T;
using From [[gnu::vector_size(32)]] = unsigned;

#define A2(I) (T)a[I], (T)a[1+I]
#define A4(I) A2(I), A2(2+I)
#define A8(I) A4(I), A4(4+I)

To f(From a) {
return To{A8(0)};
}

This compiles to:
  vpand .LC0(%rip), %ymm0, %ymm1
  vpsrld $16, %ymm0, %ymm0
  vcvtdq2ps %ymm0, %ymm0
  vcvtdq2ps %ymm1, %ymm1
  vmulps .LC1(%rip), %ymm0, %ymm0
  vaddps %ymm0, %ymm1, %ymm0
  ret

The last vmulps and vaddps can be contracted to vfmadd132ps .LC1(%rip), %ymm1,
%ymm0.

The same is true for vector_size(16).

[Bug target/85572] New: faster code for absolute value of __v2di

2018-04-30 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85572

Bug ID: 85572
   Summary: faster code for absolute value of __v2di
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

The absolute value for 64-bit integer SSE vectors is only optimized when
AVX512VL is available. Test case (`-O2 -ffast-math` and one of -mavx512vl,
-msse4, or -msse2):

#include 

__v2di abs(__v2di x) {
return x < 0 ? -x : x;
}

With SSE4 I suggest:

abs(long long __vector(2)):
  pxor %xmm1, %xmm1
  pcmpgtq %xmm0, %xmm1
  pxor %xmm1, %xmm0
  psubq %xmm1, %xmm0
  ret

in C++:
auto neg = reinterpret_cast<__v2di>(x < 0);
return (x ^ neg) - neg;


Without SSE4:

abs(long long __vector(2)):
  movdqa %xmm0, %xmm2
  pxor %xmm1, %xmm1
  psrlq $63, %xmm2
  psubq %xmm2, %xmm1
  pxor %xmm1, %xmm0
  paddq %xmm2, %xmm0
  ret

in C++:
auto neg = -reinterpret_cast<__v2di>(reinterpret_cast<__v2du>(x) >> 63);
return (x ^ neg) - neg;


related issue for scalars: #67510

[Bug target/85538] kortest for 32 and 64 bit masks incorrectly uses k0

2018-04-27 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85538

--- Comment #3 from Matthias Kretz  ---
Some more observations:

1. The instruction sequence:

kmovq %k1,-0x8(%rsp)
vmovq -0x8(%rsp),%xmm1
vmovq %xmm1,%rax
kmovq %rax,%k0

   should be a simple `kmovq %k1,%k0` instead.

2. Adding `asm("");` before the compare intrinsic makes the problem go away.

3. Using inline asm in place of the kortest intrinsic shows the same preference
for using the k0 register. Test case:

void bad(__m512i x, __m512i y) {
auto k = _mm512_cmp_epi8_mask(x, y, _MM_CMPINT_EQ);
asm("kmovq %0,%%rax" :: "k"(k));
}

4. The following test cases still unnecessarily prefers k0, but does it with a
nicer `kmovq %k1,%0`:

auto almost_good(__m512i x, __m512i y) {
auto k = _mm512_cmp_epi8_mask(x, y, _MM_CMPINT_EQ);
asm("kmovq %0, %0" : "+k"(k));
return k;
}

(cf. https://godbolt.org/g/hZTga4 for 2, 3 and 4)

[Bug target/85538] kortest for 32 and 64 bit masks incorrectly uses k0

2018-04-26 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85538

--- Comment #1 from Matthias Kretz  ---
Sorry, I was trying to force GCC to use the k1 register and playing with
register asm (which didn't have any effect at all). f8 should actually be (cf.
https://godbolt.org/g/hSkoJV):

bool f8(__m512i x, __m512i y) {
__mmask64 k = _mm512_cmp_epi8_mask(x, y, _MM_CMPINT_EQ);
return _kortestc_mask64_u8(k, k);
}

[Bug target/85538] New: kortest for 32 and 64 bit masks incorrectly uses k0

2018-04-26 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85538

Bug ID: 85538
   Summary: kortest for 32 and 64 bit masks incorrectly uses k0
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Test case (`-O2 -march=skylake-avx512`, cf. https://godbolt.org/g/ou3oAZ):
#include 

// bad:
bool f8(__m512i x, __m512i y) {
register __mmask64 k asm("%rbx") = _mm512_cmp_epi8_mask(x, y,
_MM_CMPINT_EQ);
return _kortestc_mask64_u8(k, k);
}
bool f16(__m512i x, __m512i y) {
auto k = _mm512_cmp_epi16_mask(x, y, _MM_CMPINT_EQ);
return _kortestc_mask32_u8(k, k);
}

// good:
bool f32(__m512i x, __m512i y) {
auto k = _mm512_cmp_epi32_mask(x, y, _MM_CMPINT_EQ);
return _kortestc_mask16_u8(k, k);
}
bool f64(__m512i x, __m512i y) {
auto k = _mm512_cmp_epi64_mask(x, y, _MM_CMPINT_EQ);
return _kortestc_mask8_u8(k, k);
}

The 32-bit and 64-bit masks are correctly assigned to k1 on vpcmp[bw], but
subsequently GCC does some heroics to get k1 assigned into k0 (which shouldn't
be possible, no?) and then calls `kortest[qd] %k0, %k0`. The f32 and f64
functions show the correct behavior.

[Bug target/85482] unnecessary vmovaps/vmovapd/vmovdqa emitted

2018-04-20 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85482

Matthias Kretz  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||x86_64-*-*, i?86-*-*

--- Comment #1 from Matthias Kretz  ---
related to PR85480

[Bug target/85482] New: unnecessary vmovaps/vmovapd/vmovdqa emitted

2018-04-20 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85482

Bug ID: 85482
   Summary: unnecessary vmovaps/vmovapd/vmovdqa emitted
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Test case (cf. https://godbolt.org/g/QkJYSK):
#include 

__m256 zero_extend1(__m128 a) {
return _mm256_insertf128_ps(__m256(), a, 0);
}
__m256d zero_extend1(__m128d a) {
return _mm256_insertf128_pd(__m256d(), a, 0);
}
__m256i zero_extend1(__m128i a) {
return _mm256_insertf128_si256(__m256i(), a, 0);
}

__m512 zero_extend2(__m128 a) {
return _mm512_insertf32x4(__m512(), a, 0);
}
__m512d zero_extend2(__m128d a) {
return _mm512_insertf64x2(__m512d(), a, 0);
}
__m512i zero_extend2(__m128i a) {
return _mm512_inserti32x4(__m512i(), a, 0);
}

__m512 zero_extend3(__m256 a) {
return _mm512_insertf32x8(__m512(), a, 0);
}
__m512d zero_extend3(__m256d a) {
return _mm512_insertf64x4(__m512d(), a, 0);
}
__m512i zero_extend3(__m256i a) {
return _mm512_inserti64x4(__m512i(), a, 0);
}

template  T blackhole;

void test(void *mem) {
blackhole<__m256 > = zero_extend1(_mm_load_ps((float *)mem));
blackhole<__m256d> = zero_extend1(_mm_load_pd((double *)mem));
blackhole<__m256i> = zero_extend1(_mm_load_si128((__m128i *)mem));
blackhole<__m512 > = zero_extend2(_mm_load_ps((float *)mem));
blackhole<__m512d> = zero_extend2(_mm_load_pd((double *)mem));
blackhole<__m512i> = zero_extend2(_mm_load_si128((__m128i *)mem));
blackhole<__m512 > = zero_extend3(_mm256_load_ps((float *)mem));
blackhole<__m512d> = zero_extend3(_mm256_load_pd((double *)mem));
blackhole<__m512i> = zero_extend3(_mm256_load_si256((__m256i *)mem));
}

Between every load and store instruction in the `test` function, the
vmov(aps|apd|dqa) is superfluous. The preceding load instruction already zeroes
the high bits. Instead of the load instruction, there could also be a different
instructions, or instruction sequences, that already guarantee the high bits to
be zero.

[Bug target/85480] New: zero extension from xmm to zmm via _mm512_insert???x? not optimized

2018-04-20 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85480

Bug ID: 85480
   Summary: zero extension from xmm to zmm via _mm512_insert???x?
not optimized
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Test case (cf. https://godbolt.org/g/p4Kt8X):

#include 
__m512 zero_extend2(__m128 a) {
return _mm512_insertf32x4(__m512(), a, 0);
}
__m512d zero_extend2(__m128d a) {
return _mm512_insertf64x2(__m512d(), a, 0);
}
__m512i zero_extend2(__m128i a) {
return _mm512_inserti32x4(__m512i(), a, 0);
}

These 3 functions should compile to `vmovaps %xmm0, %xmm0`, `vmovapd %xmm0,
%xmm0`, and `vmovdqa %xmm0, %xmm0`, respectively.

GCC detects the optimization for the xmm->ymm and ymm->zmm cases already. It's
only missing for the xmm->zmm case.

[Bug target/85324] New: missing constant propagation on SSE/AVX conversion intrinsics

2018-04-10 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85324

Bug ID: 85324
   Summary: missing constant propagation on SSE/AVX conversion
intrinsics
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

The following test case shows that constant propagation through conversion
intrinsics does not work:

#include 

template  using V [[gnu::vector_size(16)]] = T;

// missed optimization:
auto a1() { return 1 + (V<  int>)_mm_cvttps_epi32(_mm_set1_ps(1.f)); }
auto b1() { return 1 + (V< long>)_mm_cvttps_epi64(_mm_set1_ps(1.f)); }
auto c1() { return 1 + (V<  int>)_mm_cvttpd_epi32(_mm_set1_pd(1.)); }
auto d1() { return 1 + (V< long>)_mm_cvttpd_epi64(_mm_set1_pd(1.)); }
auto e1() { return 1 + (V)_mm_cvtepi32_epi16(_mm_set1_epi32(1)); }

The resulting asm is (`-O3 -march=skylake-avx512 -std=c++17`):
a1():
  vcvttps2dq .LC0(%rip), %xmm0
  vpaddd %xmm0, %xmm0, %xmm0
  ret
b1():
  vcvttps2qq .LC0(%rip), %xmm0
  vpaddq %xmm0, %xmm0, %xmm0
  ret
c1():
  vmovdqa64 .LC1(%rip), %xmm0
  vcvttpd2dqx .LC5(%rip), %xmm1
  vpaddd %xmm0, %xmm1, %xmm0
  ret
d1():
  vcvttpd2qq .LC5(%rip), %xmm0
  vpaddq %xmm0, %xmm0, %xmm0
  ret
e1():
  vmovdqa64 .LC7(%rip), %xmm1
  vmovdqa64 .LC1(%rip), %xmm0
  vpmovdw %xmm0, %xmm0
  vpaddw %xmm1, %xmm0, %xmm0
  ret

It should be a single load of a constant in each function. (A wrapper using
__builtin_constant_p can work around it; cf. https://godbolt.org/g/8dta7B)

[Bug target/85323] SSE/AVX/AVX512 shift by 0 not optimized away

2018-04-10 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85323

--- Comment #1 from Matthias Kretz  ---
Created attachment 43898
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43898=edit
idea for a partial solution

Constant propagation works using the built in shift operators. At least for the
shifts taking a "const int" shift argument, this is a possible fix. However,
the patched intrinsics now accept non-immediate shift arguments, which modifies
the interface (in an unsafe way).

[Bug target/85323] New: SSE/AVX/AVX512 shift by 0 not optimized away

2018-04-10 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85323

Bug ID: 85323
   Summary: SSE/AVX/AVX512 shift by 0 not optimized away
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

In the following test case, all three functions should compile to just `ret`:

#include 

__m128i f(__m128i x) {
x = _mm_sll_epi64(x, __m128i());
x = _mm_sll_epi32(x, __m128i());
x = _mm_sll_epi16(x, __m128i());
x = _mm_srl_epi64(x, __m128i());
x = _mm_srl_epi32(x, __m128i());
x = _mm_srl_epi16(x, __m128i());
x = _mm_sra_epi64(x, __m128i());
x = _mm_sra_epi32(x, __m128i());
x = _mm_sra_epi16(x, __m128i());
x = _mm_slli_epi64(x, 0);
x = _mm_slli_epi32(x, 0);
x = _mm_slli_epi16(x, 0);
x = _mm_srli_epi64(x, 0);
x = _mm_srli_epi32(x, 0);
x = _mm_srli_epi16(x, 0);
x = _mm_srai_epi64(x, 0);
x = _mm_srai_epi32(x, 0);
x = _mm_srai_epi16(x, 0);
return x;
}

__m256i f(__m256i x) {
x = _mm256_sll_epi64(x, __m128i());
x = _mm256_sll_epi32(x, __m128i());
x = _mm256_sll_epi16(x, __m128i());
x = _mm256_srl_epi64(x, __m128i());
x = _mm256_srl_epi32(x, __m128i());
x = _mm256_srl_epi16(x, __m128i());
x = _mm256_sra_epi64(x, __m128i());
x = _mm256_sra_epi32(x, __m128i());
x = _mm256_sra_epi16(x, __m128i());
x = _mm256_slli_epi64(x, 0);
x = _mm256_slli_epi32(x, 0);
x = _mm256_slli_epi16(x, 0);
x = _mm256_srli_epi64(x, 0);
x = _mm256_srli_epi32(x, 0);
x = _mm256_srli_epi16(x, 0);
x = _mm256_srai_epi64(x, 0);
x = _mm256_srai_epi32(x, 0);
x = _mm256_srai_epi16(x, 0);
return x;
}

__m512i f(__m512i x) {
x = _mm512_sll_epi64(x, __m128i());
x = _mm512_sll_epi32(x, __m128i());
x = _mm512_sll_epi16(x, __m128i());
x = _mm512_srl_epi64(x, __m128i());
x = _mm512_srl_epi32(x, __m128i());
x = _mm512_srl_epi16(x, __m128i());
x = _mm512_sra_epi64(x, __m128i());
x = _mm512_sra_epi32(x, __m128i());
x = _mm512_sra_epi16(x, __m128i());
x = _mm512_slli_epi64(x, 0);
x = _mm512_slli_epi32(x, 0);
x = _mm512_slli_epi16(x, 0);
x = _mm512_srli_epi64(x, 0);
x = _mm512_srli_epi32(x, 0);
x = _mm512_srli_epi16(x, 0);
x = _mm512_srai_epi64(x, 0);
x = _mm512_srai_epi32(x, 0);
x = _mm512_srai_epi16(x, 0);
return x;
}

[Bug target/85317] New: missing constant propagation on _mm(256)_movemask_*

2018-04-10 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85317

Bug ID: 85317
   Summary: missing constant propagation on _mm(256)_movemask_*
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

The following test case shows that the movemask intrinsics are are a barrier
for constant propagation. All of these functions should have a trivial constant
return value.

#include 

// return 0:
int i0() { return _mm_movemask_epi8( __m128i()); }
int s0() { return _mm_movemask_ps  ( __m128 ()); }
int d0() { return _mm_movemask_pd  ( __m128d()); }
int I0() { return _mm256_movemask_epi8( __m256i()); }
int S0() { return _mm256_movemask_ps  ( __m256 ()); }
int D0() { return _mm256_movemask_pd  ( __m256d()); }

int x2 () { return _mm_movemask_pd ((__m128d)~__m128i()); } // return 0x3
int x4 () { return _mm_movemask_ps ((__m128 )~__m128i()); } // return 0xf
int x4_() { return _mm256_movemask_pd  ((__m256d)~__m256i()); } // return 0xf
int x8 () { return _mm256_movemask_ps  ((__m256 )~__m256i()); } // return 0xff
int x16() { return _mm_movemask_epi8   (~__m128i()); } // return 0x
int x32() { return _mm256_movemask_epi8(~__m256i()); } // return 0x

Clang supports the optimization at -O1: https://godbolt.org/g/e6CVmR

[Bug c++/85077] [8 Regression] V[248][SD]F abs not optimized to

2018-03-27 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85077

--- Comment #8 from Matthias Kretz  ---
Thanks! FWIW my abs implementation now uses:

template 
[[gnu::optimize("finite-math-only,no-signed-zeros")]]
constexpr Storage abs(Storage v)
{
return v.d < 0 ? -v.d : v.d;
}

[Bug target/84786] [miscompilation] vunpcklpd accessing xmm16-22 targeting KNL

2018-03-27 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84786

--- Comment #15 from Matthias Kretz  ---
Here's an idea for a test case (https://godbolt.org/g/SjM2HE: it appears fixed
on GCC 8):

typedef unsigned short V __attribute__((vector_size (16)));

V foo (V x, int y)
{
  x <<= y;
  asm volatile (""::"x"(x):"xmm1", "xmm2", "xmm3",
 "xmm4", "xmm5", "xmm6", "xmm7", "xmm8", "xmm9",
 "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15");
  return x >> y;
}

[Bug target/84786] [miscompilation] vunpcklpd accessing xmm16-22 targeting KNL

2018-03-27 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84786

--- Comment #14 from Matthias Kretz  ---
I applied both patches to my GCC 7.2 installation and as a result my complete
testsuite passes now. Anything else I can help with?

[Bug target/84786] [miscompilation] vunpcklpd accessing xmm16-22 targeting KNL

2018-03-27 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84786

--- Comment #13 from Matthias Kretz  ---
I'll try to apply it locally and will report my findings.

[Bug target/84786] [miscompilation] vunpcklpd accessing xmm16-22 targeting KNL

2018-03-26 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84786

--- Comment #11 from Matthias Kretz  ---
Created attachment 43762
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43762=edit
test case that produces incorrect vpsrlw

Compiled with `g++-7 -std=c++17 -O0 -fabi-version=0 -fabi-compat-version=0
-march=knl -o fail fail.cpp`

g++-7 (Ubuntu 7.2.0-1ubuntu1~16.04) 7.2.0

from objdump -d | grep vpsrlw I get:
  56e364:   62 b1 7d 08 d1 c0   vpsrlw %xmm16,%xmm0,%xmm0
  56eaf6:   62 b1 7d 08 d1 c1   vpsrlw %xmm17,%xmm0,%xmm0
  56f174:   62 b1 7d 08 d1 c2   vpsrlw %xmm18,%xmm0,%xmm0
  56f68c:   62 b1 7d 08 d1 c3   vpsrlw %xmm19,%xmm0,%xmm0
  58ef6f:   62 b1 7d 08 d1 c0   vpsrlw %xmm16,%xmm0,%xmm0
  58f6f5:   62 b1 7d 08 d1 c1   vpsrlw %xmm17,%xmm0,%xmm0
  58fd67:   62 b1 7d 08 d1 c2   vpsrlw %xmm18,%xmm0,%xmm0
  590273:   62 b1 7d 08 d1 c3   vpsrlw %xmm19,%xmm0,%xmm0
  59585d:   62 b1 7d 08 d1 c0   vpsrlw %xmm16,%xmm0,%xmm0
  595fef:   62 b1 7d 08 d1 c1   vpsrlw %xmm17,%xmm0,%xmm0
  596664:   62 b1 7d 08 d1 c2   vpsrlw %xmm18,%xmm0,%xmm0
  596b6a:   62 b1 7d 08 d1 c3   vpsrlw %xmm19,%xmm0,%xmm0
  59cb7a:   62 b1 7d 28 d1 c0   vpsrlw %xmm16,%ymm0,%ymm0
  59d39f:   62 b1 7d 28 d1 c1   vpsrlw %xmm17,%ymm0,%ymm0
  59d9fe:   62 b1 7d 28 d1 c2   vpsrlw %xmm18,%ymm0,%ymm0
  59dfc6:   62 b1 7d 28 d1 c3   vpsrlw %xmm19,%ymm0,%ymm0
  5a407c:   62 b1 7d 28 d1 c0   vpsrlw %xmm16,%ymm0,%ymm0
  5a4895:   62 b1 7d 28 d1 c1   vpsrlw %xmm17,%ymm0,%ymm0
  5a4eeb:   62 b1 7d 28 d1 c2   vpsrlw %xmm18,%ymm0,%ymm0
  5a54ad:   62 b1 7d 28 d1 c3   vpsrlw %xmm19,%ymm0,%ymm0
  5be392:   62 b1 7d 08 d1 c0   vpsrlw %xmm16,%xmm0,%xmm0
  5bea0b:   62 b1 7d 08 d1 c1   vpsrlw %xmm17,%xmm0,%xmm0
  5bef85:   62 b1 7d 08 d1 c2   vpsrlw %xmm18,%xmm0,%xmm0
  5bf3be:   62 b1 7d 08 d1 c3   vpsrlw %xmm19,%xmm0,%xmm0
  5d8ae0:   62 b1 7d 08 d1 c0   vpsrlw %xmm16,%xmm0,%xmm0
  5d9149:   62 b1 7d 08 d1 c1   vpsrlw %xmm17,%xmm0,%xmm0
  5d96b3:   62 b1 7d 08 d1 c2   vpsrlw %xmm18,%xmm0,%xmm0
  5d9adc:   62 b1 7d 08 d1 c3   vpsrlw %xmm19,%xmm0,%xmm0
  5de3e7:   62 b1 7d 08 d1 c0   vpsrlw %xmm16,%xmm0,%xmm0
  5dea62:   62 b1 7d 08 d1 c1   vpsrlw %xmm17,%xmm0,%xmm0
  5defd5:   62 b1 7d 08 d1 c2   vpsrlw %xmm18,%xmm0,%xmm0
  5df3fe:   62 b1 7d 08 d1 c3   vpsrlw %xmm19,%xmm0,%xmm0
  5e3cd2:   62 b1 7d 08 d1 c0   vpsrlw %xmm16,%xmm0,%xmm0
  5e431b:   62 b1 7d 08 d1 c1   vpsrlw %xmm17,%xmm0,%xmm0
  5e4865:   62 b1 7d 08 d1 c2   vpsrlw %xmm18,%xmm0,%xmm0
  5e4c6e:   62 b1 7d 08 d1 c3   vpsrlw %xmm19,%xmm0,%xmm0
  5e94bd:   62 b1 7d 08 d1 c0   vpsrlw %xmm16,%xmm0,%xmm0
  5e9b18:   62 b1 7d 08 d1 c1   vpsrlw %xmm17,%xmm0,%xmm0
  5ea06b:   62 b1 7d 08 d1 c2   vpsrlw %xmm18,%xmm0,%xmm0
  5ea474:   62 b1 7d 08 d1 c3   vpsrlw %xmm19,%xmm0,%xmm0
  799710:   c5 f9 d1 c5 vpsrlw %xmm5,%xmm0,%xmm0
  7999a9:   c5 f9 d1 c6 vpsrlw %xmm6,%xmm0,%xmm0
  799e3b:   c5 f9 d1 c7 vpsrlw %xmm7,%xmm0,%xmm0
  79a101:   62 b1 7d 08 d1 c0   vpsrlw %xmm16,%xmm0,%xmm0
  79a3c2:   62 b1 7d 08 d1 c1   vpsrlw %xmm17,%xmm0,%xmm0
  79a68c:   62 b1 7d 08 d1 c2   vpsrlw %xmm18,%xmm0,%xmm0
  7a1e51:   c5 f9 d1 c5 vpsrlw %xmm5,%xmm0,%xmm0
  7a20ea:   c5 f9 d1 c6 vpsrlw %xmm6,%xmm0,%xmm0
  7a2579:   c5 f9 d1 c7 vpsrlw %xmm7,%xmm0,%xmm0
  7a283f:   62 b1 7d 08 d1 c0   vpsrlw %xmm16,%xmm0,%xmm0
  7a2b00:   62 b1 7d 08 d1 c1   vpsrlw %xmm17,%xmm0,%xmm0
  7a2dca:   62 b1 7d 08 d1 c2   vpsrlw %xmm18,%xmm0,%xmm0

[Bug target/84786] [miscompilation] vunpcklpd accessing xmm16-22 targeting KNL

2018-03-26 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84786

--- Comment #10 from Matthias Kretz  ---
This is all I have right now:
TID 0 SDE-ERROR: Executed instruction not valid for specified chip (KNL):
0x70d281: vpsrlw xmm0, xmm0, xmm16
Image:
/home/travis/build/VcDevel/Vc/build-Experimental/c2dd920concentrateGCC7.2.0Relivy-bridgeknl/tests/mask_knl_vectorbuiltin+0x30d281
(in multi-region image, region# 0)
Function:
_ZN5Tests11load_store_IN2Vc2v29simd_maskIfNS2_6detail7avx_abiILi32EE3runEv
Instruction bytes are: 62 b1 7d 08 d1 c0

See bottom of: http://lxwww53.gsi.de/testDetails.php?test=2016375=14519

[Bug target/84786] [miscompilation] vunpcklpd accessing xmm16-22 targeting KNL

2018-03-26 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84786

--- Comment #8 from Matthias Kretz  ---
There seems to be a similar bug for vpsrlw and vpsllw. Do you need a testcase?
(It's hard to hit the bug... just had one occur on a Travis CI build)

[Bug c++/85077] V[248][SD]F abs not optimized to

2018-03-26 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85077

--- Comment #4 from Matthias Kretz  ---
Oh, there seems to be a regression in GCC 8. In 7 it works as you say. In 8 I
can't get the andps to show up

[Bug c++/85077] V[248][SD]F abs not optimized to

2018-03-26 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85077

--- Comment #3 from Matthias Kretz  ---
Ouch, right I didn't think of non-finite values.

I.e. -0 < 0 is false...

However, this is what I wanted:
abs(-inf) -> inf
abs( inf) -> inf
abs( nan) -> nan
abs(  -0) -> 0
abs(   0) -> 0

The sign bit manipulation works for all of them. The ternary fails only on the
-0 input, no?

I'm working on an implementation of wg21.link/p0214r9 that I'd like to
contribute to libstdc++, which is why I'm currently looking to remove
workarounds and enable the compiler to do const-prop as much as possible.

I'd be happy to go with an implementation that uses my_abs if that's the way to
go (I guess it is).

FWIW, ICC translates `x < 0 ? -x : x` (on float itself) to sign masking. (But
then again ICC doesn't conform with default flags either)

[Bug target/85077] New: V[248][SD]F abs not optimized to

2018-03-26 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85077

Bug ID: 85077
   Summary: V[248][SD]F abs not optimized to
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

The following test case (also at https://godbolt.org/g/XEPk7M) shows that `x <
0 ? -x : x` is not optimized to an efficient abs implementation. This is not
only the case for SSE, but also for AVX and AVX512 vectors.

The my_abs functions show what I'd expect the result to be.

#include 

template  using V [[gnu::vector_size(N)]] = T;

auto abs(V<float, 16> x) { return x < 0 ? -x : x; }
auto my_abs(V<float, 16> x) {
return _mm_and_ps((__m128)(~V<unsigned, 16>() >> 1), x);
}

auto abs(V<double, 16> x) { return x < 0 ? -x : x; }
auto my_abs(V<double, 16> x) {
return _mm_and_pd((__m128d)(~V() >> 1), x);
}

[Bug target/48701] [missed optimization] GCC fails to use aliasing of ymm and xmm registers

2018-03-26 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48701

--- Comment #3 from Matthias Kretz  ---
Updated test case at https://godbolt.org/g/D5P1N1.
`testLoad` was fixed with 4.7.
`testStore` still combines via the stack.

[Bug target/43514] use of SSE shift intrinsics introduces unnecessary moves to the stack and back

2018-03-26 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43514

Matthias Kretz  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Matthias Kretz  ---
this issue is resolved since GCC 5

[Bug target/85048] [missed optimization] vector conversions

2018-03-23 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #3 from Matthias Kretz  ---
Just opened PR85052 for tracking __builtin_convertvector support.

[Bug c++/85052] New: Implement support for clang's __builtin_convertvector

2018-03-23 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85052

Bug ID: 85052
   Summary: Implement support for clang's __builtin_convertvector
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

clang implements __builtin_convertvector to simplify conversions between
different vector builtins. In contrast to bitcasts, supported through C casts,
this builtin converts element-wise according to the standard type conversion
rules.

Documentation for the builtin:
https://clang.llvm.org/docs/LanguageExtensions.html#langext-builtin-convertvector

Related to PR85048.

[Bug target/85048] [missed optimization] vector conversions

2018-03-23 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #1 from Matthias Kretz  ---
Godbolt link:

[Bug target/85048] New: [missed optimization] vector conversions

2018-03-23 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

Bug ID: 85048
   Summary: [missed optimization] vector conversions
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

The following testcase lists all integer and/or float conversions applied to
vector builtins of the same number of elements. All of those functions can be
compiled to a single instruction (the function's name plus `ret`) when
`-march=skylake-avx512` is active. AFAICS many conversion instructions in the
SSE and AVX ISA extensions are also unsupported.

I would expect this code to compile to optimal conversion sequences even on -O2
(and lower) since the conversion is applied directly on vector builtins. If
this is not in scope, I'd like to open a feature request for something like
clang's __builtin_convertvector (could be even done via static_cast) that
produces optimal conversion instruction sequences on vector builtins without
the auto-vectorizer.

#include 

template 
using V [[gnu::vector_size(Size)]] = T;

template  V<To, 2> cvt2(V<From, 2> x) {
return V<To, 2>{To(x[0]), To(x[1])};
}
template  V<To, 4> cvt4(V<From, 4> x) {
return V<To, 4>{To(x[0]), To(x[1]), To(x[2]), To(x[3])};
}
template  V<To, 8> cvt8(V<From, 8> x) {
return V<To, 8>{
To(x[0]), To(x[1]), To(x[2]), To(x[3]),
To(x[4]), To(x[5]), To(x[6]), To(x[7])
};
}
template  V<To, 16> cvt16(V<From, 16> x) {
return V<To, 16>{
To(x[0]), To(x[1]), To(x[2]), To(x[3]),
To(x[4]), To(x[5]), To(x[6]), To(x[7]),
To(x[8]), To(x[9]), To(x[10]), To(x[11]),
To(x[12]), To(x[13]), To(x[14]), To(x[15])
};
}
template  V<To, 32> cvt32(V<From, 32> x) {
return V<To, 32>{
To(x[0]), To(x[1]), To(x[2]), To(x[3]),
To(x[4]), To(x[5]), To(x[6]), To(x[7]),
To(x[8]), To(x[9]), To(x[10]), To(x[11]),
To(x[12]), To(x[13]), To(x[14]), To(x[15]),
To(x[16]), To(x[17]), To(x[18]), To(x[19]),
To(x[20]), To(x[21]), To(x[22]), To(x[23]),
To(x[24]), To(x[25]), To(x[26]), To(x[27]),
To(x[28]), To(x[29]), To(x[30]), To(x[31])
};
}
template  V<To, 64> cvt64(V<From, 64> x) {
return V<To, 64>{
To(x[ 0]), To(x[ 1]), To(x[ 2]), To(x[ 3]),
To(x[ 4]), To(x[ 5]), To(x[ 6]), To(x[ 7]),
To(x[ 8]), To(x[ 9]), To(x[10]), To(x[11]),
To(x[12]), To(x[13]), To(x[14]), To(x[15]),
To(x[16]), To(x[17]), To(x[18]), To(x[19]),
To(x[20]), To(x[21]), To(x[22]), To(x[23]),
To(x[24]), To(x[25]), To(x[26]), To(x[27]),
To(x[28]), To(x[29]), To(x[30]), To(x[31]),
To(x[32]), To(x[33]), To(x[34]), To(x[35]),
To(x[36]), To(x[37]), To(x[38]), To(x[39]),
To(x[40]), To(x[41]), To(x[42]), To(x[43]),
To(x[44]), To(x[45]), To(x[46]), To(x[47]),
To(x[48]), To(x[49]), To(x[50]), To(x[51]),
To(x[52]), To(x[53]), To(x[54]), To(x[55]),
To(x[56]), To(x[57]), To(x[58]), To(x[59]),
To(x[60]), To(x[61]), To(x[62]), To(x[63]),
};
}

#define _(name, from, to, size) \
auto name(V<from, size> x) { return cvt##size<from, to>(x); }
// integral -> integral; truncation
_(vpmovqd , uint64_t, uint32_t,  2)
_(vpmovqd , uint64_t, uint32_t,  4)
_(vpmovqd , uint64_t, uint32_t,  8)
_(vpmovqd ,  int64_t, uint32_t,  2)
_(vpmovqd ,  int64_t, uint32_t,  4)
_(vpmovqd ,  int64_t, uint32_t,  8)
_(vpmovqd_, uint64_t,  int32_t,  2)
_(vpmovqd_, uint64_t,  int32_t,  4)
_(vpmovqd_, uint64_t,  int32_t,  8)
_(vpmovqd_,  int64_t,  int32_t,  2)
_(vpmovqd_,  int64_t,  int32_t,  4)
_(vpmovqd_,  int64_t,  int32_t,  8)

_(vpmovqw , uint64_t, uint16_t,  2)
_(vpmovqw , uint64_t, uint16_t,  4)
_(vpmovqw , uint64_t, uint16_t,  8)
_(vpmovqw ,  int64_t, uint16_t,  2)
_(vpmovqw ,  int64_t, uint16_t,  4)
_(vpmovqw ,  int64_t, uint16_t,  8)
_(vpmovqw_, uint64_t,  int16_t,  2)
_(vpmovqw_, uint64_t,  int16_t,  4)
_(vpmovqw_, uint64_t,  int16_t,  8)
_(vpmovqw_,  int64_t,  int16_t,  2)
_(vpmovqw_,  int64_t,  int16_t,  4)
_(vpmovqw_,  int64_t,  int16_t,  8)

_(vpmovqb , uint64_t,  uint8_t,  2)
_(vpmovqb , uint64_t,  uint8_t,  4)
_(vpmovqb , uint64_t,  uint8_t,  8)
_(vpmovqb ,  int64_t,  uint8_t,  2)
_(vpmovqb ,  int64_t,  uint8_t,  4)
_(vpmovqb ,  int64_t,  uint8_t,  8)
_(vpmovqb_, uint64_t,   int8_t,  2)
_(vpmovqb_, uint64_t,   int8_t,  4)
_(vpmovqb_, uint64_t,   int8_t,  8)
_(vpmovqb_,  int64_t,   int8_t,  2)
_(vpmovqb_,  int64_t,   int8_t,  4)
_(vpmovqb_,  int64_t,   int8_t,  8)

_(vpmovdw , uint32_t, uint16_t,  4)
_(vpmovdw , uint32_t, uint16_t,  8)
_(vpmovdw , uint32_t, uint16_t, 16)
_(vpmovdw ,  int32_t, uint16_t,  4)
_(vpmovdw ,  int32_t, uint16_t,  8)
_(vpmovdw ,  int32_

[Bug target/84786] [miscompilation] vunpcklpd accessing xmm16-22 targeting KNL

2018-03-10 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84786

--- Comment #2 from Matthias Kretz  ---
Created attachment 43618
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43618=edit
unreduced testcase

Compile with `g++ -std=c++17 -O2 -march=knl -o knl-fail knl-fail.cpp`.
The function `Tests::operators_ > >::run()` countains the invalid instructions.

% g++-7 --version
g++-7 (Ubuntu 7.2.0-1ubuntu1~16.04) 7.2.0

[Bug target/84786] New: [miscompilation] vunpcklpd accessing xmm16-22 targeting KNL

2018-03-09 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84786

Bug ID: 84786
   Summary: [miscompilation] vunpcklpd accessing xmm16-22
targeting KNL
   Product: gcc
   Version: 7.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

I see generated code, such as:

  424821:·   vpxord %zmm17,%zmm17,%zmm17
  424827:·   vpxord %zmm18,%zmm18,%zmm18
[...]
  424855:·   vunpcklpd %xmm17,%xmm0,%xmm1
[...]
  424891:·   vunpcklpd %xmm18,%xmm1,%xmm1

when compiling with `-O2 -march=knl`. Apparently the `_mm_unpacklo_pd`
intrinsic is incorrectly translated to an encoding that allows the upper 16
SIMD registers for the first register.

Reducing a test case will take some time.

[Bug target/84781] New: [missed optimization] ignore bitmask after movemask

2018-03-09 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84781

Bug ID: 84781
   Summary: [missed optimization] ignore bitmask after movemask
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Testcase: https://godbolt.org/g/S3tfrL

#include 

int f(__m128  a) { return _mm_movemask_ps(a)& 0xf; }
int f(__m128d a) { return _mm_movemask_pd(a)& 0x3; }
int f(__m128i a) { return _mm_movemask_epi8(a)  & 0xu; }
int f(__m256  a) { return _mm256_movemask_ps(a) & 0xff; }
int f(__m256d a) { return _mm256_movemask_pd(a) & 0xf; }

In all of these functions, the bitmask is a no-op since the movemask cannot
yield bits in any of the masked-off places. Consequently, the bitwise and
should be dropped.

[Bug c++/83875] [feature request] target_clones compatible SIMD capability/length check

2018-01-20 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83875

--- Comment #9 from Matthias Kretz  ---
> inside multi-versioned (target_clones/target) function it depends on the 
> active target

Yes., this part is easy.

> inside a constexpr context (function/variable, your examples) or 
> always_inline function it depends on the caller

Except that a constexpr variable isn't "called". Maybe for variables a
target_clones attribute could be used. But it feels like going a bit too far.

A constexpr function without arguments would implicitly have to take a "target"
argument. I can't judge how much of a problem that would be.

[Bug c++/83875] [feature request] target_clones compatible SIMD capability/length check

2018-01-17 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83875

--- Comment #7 from Matthias Kretz  ---
Hmm,

what should the following print?

constexpr int native_simd_width = __builtin_target_supports("avx512f") ? 64 :
__builtin_target_supports("avx") ? 32 : __builtin_target_supports("sse") ? 16 :
__builtin_target_supports("mmx") ? 8 : 0;

constexpr int native_simd_width_f() {
return __builtin_target_supports("avx512f") ? 64 :
__builtin_target_supports("avx") ? 32 : __builtin_target_supports("sse") ? 16 :
__builtin_target_supports("mmx") ? 8 : 0;
}

[[gnu::target_clones("default,avx,avx512f")]]
void f() {
std::cout << native_simd_width << ' ' << native_simd_width_f();
}

[Bug c++/83875] [feature request] target_clones compatible SIMD capability/length check

2018-01-17 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83875

Matthias Kretz  changed:

   What|Removed |Added

 CC||kretz at kde dot org

--- Comment #6 from Matthias Kretz  ---
Here's an example I'd like to see supported. Note the missing target_clones
attribute on "min":

inline __m128i min(__m128i a, __m128i b) {
if constexpr (__builtin_target_supports("sse4.1")) {
return _mm_min_epi32(a, b);
} else {
const auto mask = _mm_cmpgt_epi32(a, b);
return _mm_or_si128(_mm_andnot_si128(mask, a), _mm_and_si128(mask, b));
}
}

[[gnu::target_clones("default,sse4.1")]]
__m128i f(__m128i a, __m128i b) {
return min(a, b);
}

[Bug target/83894] [missed optimization] __v16qu shift instruction sequence on x86

2018-01-16 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83894

--- Comment #2 from Matthias Kretz  ---
I compiled with:

g++-7 -march=haswell -std=c++17 -O3 -flax-vector-conversions -o char_shift
char_shift.cpp

[Bug target/83894] [missed optimization] __v16qu shift instruction sequence on x86

2018-01-16 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83894

--- Comment #1 from Matthias Kretz  ---
Created attachment 43149
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43149=edit
tsc.h

Header required for the benchmark code.

[Bug target/83894] New: [missed optimization] __v16qu shift instruction sequence on x86

2018-01-16 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83894

Bug ID: 83894
   Summary: [missed optimization] __v16qu shift instruction
sequence on x86
   Product: gcc
   Version: 7.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Created attachment 43148
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43148=edit
benchmark

shifts of vector builtins with 8-bit integral element type can be optimized
better.

I.e. `v << n` can be implemented as

1. load 0x00ff00ff00ff... and 16-bit shift by n
2. xor (1) with 0xff00ff00ff00... to produce a bitmask
3. 16-bit shift v by n
4. bitwise and of (2) and (3)

I'll attach a benchmark with an intrinsics based implementation.

[Bug c++/83793] Pack expansion outside of lambda containing the pack incorrectly rejected

2018-01-15 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83793

--- Comment #4 from Matthias Kretz  ---
(In reply to Jonathan Wakely from comment #2)
> Looks like a dup of PR 47226

Ah, yes. Sorry for missing it, I recall seeing it before. I agree, a backport
would be nice, but an overhaul is not a backportable bugfix. I really need to
switch to GCC 8 just for the lambda fixes. :-)

Thanks!

[Bug c++/83856] New: ICE in tsubst_copy;

2018-01-15 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83856

Bug ID: 83856
   Summary: ICE in tsubst_copy;
   Product: gcc
   Version: 7.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Testcase (cf. https://godbolt.org/g/jFkk7N):
```
#include 

template  struct simd
{
  static constexpr size_t size() { return 4; }
  template  simd(F &, decltype(std::declval()(0)) * = nullptr)
{}
};

template 
void test_tuples(const std::initializer_list<std::array<float, 1>> ,
 F &&... fun_pack)
{
  auto it = data.begin();
  const int remaining = data.size() % V::size();
  if (remaining > 0) {
[](auto...) {
}((fun_pack(V([&](auto i) { return it[i < remaining ? i : 0][0]; })),
0)...);
  }
}

void f()
{
  test_tuples<simd>({}, [](simd x) {});
}
```

Output when compiling with GCC g++-7 (Ubuntu 7.2.0-1ubuntu1~16.04) 7.2.0:
```
ice.cpp: In instantiation of ‘test_tuples(const
std::initializer_list<std::array<float, 1> >&, F&& ...)::<lambda(auto:2)> [with
auto:2 = int; V = simd; F = {f()::<lambda(simd)>}]’:
ice.cpp:6:62:   required by substitution of ‘template
simd::simd(F&&, decltype (declval()(0))*) [with F = test_tuples(const
std::initializer_list<std::array<float, 1> >&, F&& ...) [with V = simd;
F = {f()::<lambda(simd)>}]::<lambda(auto:2)> ’
ice.cpp:17:17:   required from ‘void test_tuples(const
std::initializer_list<std::array<float, 1> >&, F&& ...) [with V = simd;
F = {f()::<lambda(simd)>}]’
ice.cpp:23:52:   required from here
ice.cpp:14:25: internal compiler error: in tsubst_copy, at cp/pt.c:14539
   const int remaining = data.size() % V::size();
 ^~~~
0x5eb5c2 tsubst_copy
../../src/gcc/cp/pt.c:14539
0x5ef666 tsubst_copy
../../src/gcc/cp/pt.c:14515
0x5ef666 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
../../src/gcc/cp/pt.c:17831
0x5ef450 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
../../src/gcc/cp/pt.c:16769
0x5ef549 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
../../src/gcc/cp/pt.c:17611
0x5ef086 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
../../src/gcc/cp/pt.c:17232
0x5ef748 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
../../src/gcc/cp/pt.c:16934
0x5e83c7 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../src/gcc/cp/pt.c:16550
0x5e95f1 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../src/gcc/cp/pt.c:14477
0x5e95f1 tsubst_init
../../src/gcc/cp/pt.c:14483
0x5eb558 tsubst_copy
../../src/gcc/cp/pt.c:14681
0x5ef666 tsubst_copy
../../src/gcc/cp/pt.c:14515
0x5ef666 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
../../src/gcc/cp/pt.c:17831
0x5ef765 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
../../src/gcc/cp/pt.c:16935
0x5ef81b tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
../../src/gcc/cp/pt.c:17477
0x5ef505 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
../../src/gcc/cp/pt.c:16967
0x5ef4e9 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool,
bool)
../../src/gcc/cp/pt.c:16965
0x5e83c7 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../src/gcc/cp/pt.c:16550
0x5e7f85 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../src/gcc/cp/pt.c:15811
0x5e810b tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
../../src/gcc/cp/pt.c:16027
```

GCC 8 appears to be fixed.

[Bug c++/83793] New: Pack expansion outside of lambda containing the pack incorrectly rejected

2018-01-11 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83793

Bug ID: 83793
   Summary: Pack expansion outside of lambda containing the pack
incorrectly rejected
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kretz at kde dot org
  Target Milestone: ---

Testcase (https://godbolt.org/g/mNhetZ):

#include 
#include 

using std::size_t;
template  auto f(std::index_sequence) {
std::array<int, sizeof...(Indexes)> x = {[&]() -> int { return Indexes;
}()...};
return x;
}

auto g() {
return f(std::make_index_sequence<3>());
}

The pack expansion of `[&]() -> int { return Indexes; }()` is apparently never
considered. ICC, clang, and MSVC accept the code.

[Bug c++/47226] [C++0x] GCC doesn't expand template parameter pack that appears in a lambda-expression

2017-07-06 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47226

Matthias Kretz  changed:

   What|Removed |Added

 CC||kretz at kde dot org

--- Comment #10 from Matthias Kretz  ---
I have the following in my code now:

// this could be much simpler:
//
// return {V([&](auto i) { return x[i + Indexes * V::size()]; })...};   
//
// Sadly GCC has a bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47226.
The 
// following works around it by placing the pack outside of the code block
of the 
// lambda:  
return {[](size_t j, const datapar<T, A> ) {  
return V([&](auto i) { return y[i + j * V::size()]; }); 
}(Indexes, x)...};

  1   2   >