[Bug tree-optimization/89653] Missing vectorization of loop containing std::min/std::max and temporary

2022-11-01 Thread moritz.kreutzer at siemens dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89653

--- Comment #11 from Moritz Kreutzer  ---
I am currently out of the office, with limited to no email access. I will be
returning on November 28. For urgent questions regarding ARM64 support please
contact Julian Hornich, for GPGPU-related issues please contact Michael Kuron,
and for compiler- and build-related issues please contact Tom James. For
anything else (which is urgent), please reach out to Joel Daniels.


Thanks,
Moritz

-
Siemens Industry Software GmbH; Anschrift: Am Kabellager 9, 51063 K?ln;
Gesellschaft mit beschr?nkter Haftung; Gesch?ftsf?hrer: Klaus L?ckel, Alexander
Walter; Sitz der Gesellschaft: K?ln; Registergericht: Amtsgericht K?ln, HRB
84564; Vorsitzender des Aufsichtsrats: Timo Nentwich

[Bug c++/91819] New: ICE when iterating over enum values

2019-09-19 Thread moritz.kreutzer at siemens dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91819

Bug ID: 91819
   Summary: ICE when iterating over enum values
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: moritz.kreutzer at siemens dot com
  Target Milestone: ---

Created attachment 46899
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46899=edit
Preprocessed source and backtrace

Hi,

we are getting an ICE with the latest trunk of GCC with the following code:


enum Foo
{
  a,
  b
};

inline Foo operator++(Foo , int) 
{
  return f = (Foo)(f + 1);
}

int main()
{
  int count = 0;
  for (Foo f = a; f <= b; f++) {
count++;
  }
  return count;
}


GCC 9 and older seem to be working: https://godbolt.org/z/UO37hz

The preprocessed source and backtrace are attached. Let me know if you need
further information.


Thanks,
Moritz

[Bug tree-optimization/91198] GCC not generating AVX-512 compress/expand instructions

2019-07-19 Thread moritz.kreutzer at siemens dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198

--- Comment #4 from Moritz Kreutzer  ---
> How would a vectorized version with the intrinsic look like?

Something along the lines of (assuming insize is a multiple of 16):


__mmask16 mask;
__m512 vin;
__m512 const thr = _mm512_set1_ps(threshold);
int o = 0;
for (int i = 0; i < insize; i+=16) {
  vin = _mm512_loadu_ps([i]);
  mask = _mm512_cmplt_ps_mask(vin, thr);
  _mm512_mask_compressstoreu_ps([o], mask, vin);
  o += __builtin_popcount(_mm512_mask2int(mask));
}
*outsize = o;



I don't really understand your other two questions, but maybe the intrinsics
code will help.

[Bug tree-optimization/91198] GCC not generating AVX-512 compress/expand instructions

2019-07-18 Thread moritz.kreutzer at siemens dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198

--- Comment #2 from Moritz Kreutzer  ---
Sure, I should have said that I'm talking about auto vectorization. I'm aware
that we could use intrinsics, but of course that'll always be our last resort
for obvious reasons.

[Bug tree-optimization/91198] New: GCC not generating AVX-512 compress/expand instructions

2019-07-18 Thread moritz.kreutzer at siemens dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198

Bug ID: 91198
   Summary: GCC not generating AVX-512 compress/expand
instructions
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: moritz.kreutzer at siemens dot com
  Target Milestone: ---

We have a simple loop to select values based on a condition from one array and
store the selected values contiguously in a second array:


https://godbolt.org/z/T7UXXD

float const threshold = 0.5; 
int o = 0;
for (int i = 0; i < size; ++i) {
  if (input[i] < threshold) {
output[o] = input[i];
o++;
  } 
}


It seems like GCC is not able to generate AVX-512 assembly using vcompressps
instructions for this code. The same holds true for the orthogonal pattern
(expansion using vexpandps). Is this a missed optimization in GCC or is there
another issue in the example code which prevents vectorization?

[Bug tree-optimization/89653] Missing vectorization of loop containing std::min/std::max and temporary

2019-03-25 Thread moritz.kreutzer at siemens dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89653

--- Comment #7 from Moritz Kreutzer  ---
Thanks for taking this up Richard! I just want to check back: Do you need any
assistance with testing or more information from my side?

[Bug c++/89653] New: Missing vectorization of loop containing std::min/std::max and temporary

2019-03-11 Thread moritz.kreutzer at siemens dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89653

Bug ID: 89653
   Summary: Missing vectorization of loop containing
std::min/std::max and temporary
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: moritz.kreutzer at siemens dot com
  Target Milestone: ---

Godbolt worksheet: https://godbolt.org/z/F6m5hl

GCC (trunk and all earlier versions) fails to vectorize (SSE/AVX2/AVX-512) the
following loop because of a "complicated access pattern" (similarly for
std::max()):

== loop1 - FAIL 
for (int i = 0; i < end; ++i)
{
  vec[i] = std::min(vec[i], vec[i]/x);
}


If we don't use std::min(), but implement the same loop using a ternary
operator, the vectorization is successful:

== loop2 - OK ==
for (int i = 0; i < end; ++i)
{
  vec[i] = vec[i] < vec[i]/x ? vec[i] : vec[i]/x;
}


However, the problem does not seem to be that GCC is unable to vectorize
std::min() itself, because the following loop _does_ get vectorized (note the
different logic and the absence of an implicit temporary for vec[i]/x):

== loop3 - OK ==
for (int i = 0; i < end; ++i)
{
  vec[i] = std::min(vec[i], x);
}


The C++ standard prescribes that std::min() returns the result as a const
reference, so an implementation might look like this:

== std::min() ==
double const & min(double const , double const )
{
if (a

[Bug tree-optimization/89618] Inner loop won't vectorize unless dummy statement is included

2019-03-07 Thread moritz.kreutzer at siemens dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89618

--- Comment #3 from Moritz Kreutzer  ---
Great, thanks for the quick action Richard!

[Bug c/89618] New: Inner loop won't vectorize unless dummy statement is included

2019-03-07 Thread moritz.kreutzer at siemens dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89618

Bug ID: 89618
   Summary: Inner loop won't vectorize unless dummy statement is
included
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: moritz.kreutzer at siemens dot com
  Target Milestone: ---

We have a loop in which we are scattering data to an array of length "n" where
can assure no write conflicts only within confined ranges of length "m". Our
implementation includes splitting this loop into an outer and an inner loop and
specifying "#pragma GCC ivdep" for the inner loop:


https://godbolt.org/z/ulnRrk
===
const int m = 32;

for (int j = 0; j < n/m; ++j)
{
  int const start = j*m;
  int const end = (j+1)*m;

  #pragma GCC ivdep
  for (int i = start; i < end; ++i)
  {
a[off[i]] = a[i] < 0 ? a[i] : 0;
  }

#ifdef VECTORIZE
  // dummy statement required for vectorization
  if (a[0] == 0.) a[0] = 0.; 
#endif
}
===

The issue is that GCC (trunk and any earlier version) won't vectorize the code
unless we add the obviously useless dummy statement (guarded by "#ifdef
VECTORIZE"). This is counterintuitive, involves some overhead which we want to
avoid, and may be cumbersome or even impossible to implemented depending on the
specific structure of the inner loop (the body may be passed as a lambda,
etc.).

Without knowing about the internals of GCC, I can imagine that in the absence
of the dummy statment, GCC jams the loops and tries and fails to vectorize the
remaining (outer) loop because it doesn't have an "ivdep" pragma. Can we avoid
this behavior? If my thinking is correct, something like ICC's "#pragma
nounroll_and_jam" could work, but GCC doesn't (officially?) support anything
like it as far as I can see. 

Thanks,
Moritz

[Bug tree-optimization/88464] AVX-512 vectorization of masked scatter failing with "not suitable for scatter store"

2018-12-17 Thread moritz.kreutzer at siemens dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88464

--- Comment #16 from Moritz Kreutzer  ---
I can confirm the fix from my side.

Thanks again!

[Bug tree-optimization/88464] AVX-512 vectorization of masked scatter failing with "not suitable for scatter store"

2018-12-14 Thread moritz.kreutzer at siemens dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88464

--- Comment #11 from Moritz Kreutzer  ---
Jakub, I can confirm it's working for masked gathers (we have a similar pattern
elsewhere in our code) with the latest trunk. Thanks for looking at the
scatters as well!

[Bug tree-optimization/88464] AVX-512 vectorization of masked scatter failing with "not suitable for scatter store"

2018-12-13 Thread moritz.kreutzer at siemens dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88464

--- Comment #8 from Moritz Kreutzer  ---
Thanks for the input and for confirming that "for conditional ones (both
MASK_LOAD and MASK_STORE) the support for the cases when using a mask register
rather than a vector register with mask either hasn't been done or doesn't work
properly." 

Any idea about a possible way forward to fix or implement those? As stated
above, even though I have no experience with GCC development, I'd be happy to
assist to the best of my resources if at all possible.