[Bug tree-optimization/89653] Missing vectorization of loop containing std::min/std::max and temporary
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89653 --- Comment #11 from Moritz Kreutzer --- I am currently out of the office, with limited to no email access. I will be returning on November 28. For urgent questions regarding ARM64 support please contact Julian Hornich, for GPGPU-related issues please contact Michael Kuron, and for compiler- and build-related issues please contact Tom James. For anything else (which is urgent), please reach out to Joel Daniels. Thanks, Moritz - Siemens Industry Software GmbH; Anschrift: Am Kabellager 9, 51063 K?ln; Gesellschaft mit beschr?nkter Haftung; Gesch?ftsf?hrer: Klaus L?ckel, Alexander Walter; Sitz der Gesellschaft: K?ln; Registergericht: Amtsgericht K?ln, HRB 84564; Vorsitzender des Aufsichtsrats: Timo Nentwich
[Bug c++/91819] New: ICE when iterating over enum values
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91819 Bug ID: 91819 Summary: ICE when iterating over enum values Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: moritz.kreutzer at siemens dot com Target Milestone: --- Created attachment 46899 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46899=edit Preprocessed source and backtrace Hi, we are getting an ICE with the latest trunk of GCC with the following code: enum Foo { a, b }; inline Foo operator++(Foo , int) { return f = (Foo)(f + 1); } int main() { int count = 0; for (Foo f = a; f <= b; f++) { count++; } return count; } GCC 9 and older seem to be working: https://godbolt.org/z/UO37hz The preprocessed source and backtrace are attached. Let me know if you need further information. Thanks, Moritz
[Bug tree-optimization/91198] GCC not generating AVX-512 compress/expand instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198 --- Comment #4 from Moritz Kreutzer --- > How would a vectorized version with the intrinsic look like? Something along the lines of (assuming insize is a multiple of 16): __mmask16 mask; __m512 vin; __m512 const thr = _mm512_set1_ps(threshold); int o = 0; for (int i = 0; i < insize; i+=16) { vin = _mm512_loadu_ps([i]); mask = _mm512_cmplt_ps_mask(vin, thr); _mm512_mask_compressstoreu_ps([o], mask, vin); o += __builtin_popcount(_mm512_mask2int(mask)); } *outsize = o; I don't really understand your other two questions, but maybe the intrinsics code will help.
[Bug tree-optimization/91198] GCC not generating AVX-512 compress/expand instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198 --- Comment #2 from Moritz Kreutzer --- Sure, I should have said that I'm talking about auto vectorization. I'm aware that we could use intrinsics, but of course that'll always be our last resort for obvious reasons.
[Bug tree-optimization/91198] New: GCC not generating AVX-512 compress/expand instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198 Bug ID: 91198 Summary: GCC not generating AVX-512 compress/expand instructions Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: moritz.kreutzer at siemens dot com Target Milestone: --- We have a simple loop to select values based on a condition from one array and store the selected values contiguously in a second array: https://godbolt.org/z/T7UXXD float const threshold = 0.5; int o = 0; for (int i = 0; i < size; ++i) { if (input[i] < threshold) { output[o] = input[i]; o++; } } It seems like GCC is not able to generate AVX-512 assembly using vcompressps instructions for this code. The same holds true for the orthogonal pattern (expansion using vexpandps). Is this a missed optimization in GCC or is there another issue in the example code which prevents vectorization?
[Bug tree-optimization/89653] Missing vectorization of loop containing std::min/std::max and temporary
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89653 --- Comment #7 from Moritz Kreutzer --- Thanks for taking this up Richard! I just want to check back: Do you need any assistance with testing or more information from my side?
[Bug c++/89653] New: Missing vectorization of loop containing std::min/std::max and temporary
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89653 Bug ID: 89653 Summary: Missing vectorization of loop containing std::min/std::max and temporary Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: moritz.kreutzer at siemens dot com Target Milestone: --- Godbolt worksheet: https://godbolt.org/z/F6m5hl GCC (trunk and all earlier versions) fails to vectorize (SSE/AVX2/AVX-512) the following loop because of a "complicated access pattern" (similarly for std::max()): == loop1 - FAIL for (int i = 0; i < end; ++i) { vec[i] = std::min(vec[i], vec[i]/x); } If we don't use std::min(), but implement the same loop using a ternary operator, the vectorization is successful: == loop2 - OK == for (int i = 0; i < end; ++i) { vec[i] = vec[i] < vec[i]/x ? vec[i] : vec[i]/x; } However, the problem does not seem to be that GCC is unable to vectorize std::min() itself, because the following loop _does_ get vectorized (note the different logic and the absence of an implicit temporary for vec[i]/x): == loop3 - OK == for (int i = 0; i < end; ++i) { vec[i] = std::min(vec[i], x); } The C++ standard prescribes that std::min() returns the result as a const reference, so an implementation might look like this: == std::min() == double const & min(double const , double const ) { if (a
[Bug tree-optimization/89618] Inner loop won't vectorize unless dummy statement is included
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89618 --- Comment #3 from Moritz Kreutzer --- Great, thanks for the quick action Richard!
[Bug c/89618] New: Inner loop won't vectorize unless dummy statement is included
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89618 Bug ID: 89618 Summary: Inner loop won't vectorize unless dummy statement is included Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: moritz.kreutzer at siemens dot com Target Milestone: --- We have a loop in which we are scattering data to an array of length "n" where can assure no write conflicts only within confined ranges of length "m". Our implementation includes splitting this loop into an outer and an inner loop and specifying "#pragma GCC ivdep" for the inner loop: https://godbolt.org/z/ulnRrk === const int m = 32; for (int j = 0; j < n/m; ++j) { int const start = j*m; int const end = (j+1)*m; #pragma GCC ivdep for (int i = start; i < end; ++i) { a[off[i]] = a[i] < 0 ? a[i] : 0; } #ifdef VECTORIZE // dummy statement required for vectorization if (a[0] == 0.) a[0] = 0.; #endif } === The issue is that GCC (trunk and any earlier version) won't vectorize the code unless we add the obviously useless dummy statement (guarded by "#ifdef VECTORIZE"). This is counterintuitive, involves some overhead which we want to avoid, and may be cumbersome or even impossible to implemented depending on the specific structure of the inner loop (the body may be passed as a lambda, etc.). Without knowing about the internals of GCC, I can imagine that in the absence of the dummy statment, GCC jams the loops and tries and fails to vectorize the remaining (outer) loop because it doesn't have an "ivdep" pragma. Can we avoid this behavior? If my thinking is correct, something like ICC's "#pragma nounroll_and_jam" could work, but GCC doesn't (officially?) support anything like it as far as I can see. Thanks, Moritz
[Bug tree-optimization/88464] AVX-512 vectorization of masked scatter failing with "not suitable for scatter store"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88464 --- Comment #16 from Moritz Kreutzer --- I can confirm the fix from my side. Thanks again!
[Bug tree-optimization/88464] AVX-512 vectorization of masked scatter failing with "not suitable for scatter store"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88464 --- Comment #11 from Moritz Kreutzer --- Jakub, I can confirm it's working for masked gathers (we have a similar pattern elsewhere in our code) with the latest trunk. Thanks for looking at the scatters as well!
[Bug tree-optimization/88464] AVX-512 vectorization of masked scatter failing with "not suitable for scatter store"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88464 --- Comment #8 from Moritz Kreutzer --- Thanks for the input and for confirming that "for conditional ones (both MASK_LOAD and MASK_STORE) the support for the cases when using a mask register rather than a vector register with mask either hasn't been done or doesn't work properly." Any idea about a possible way forward to fix or implement those? As stated above, even though I have no experience with GCC development, I'd be happy to assist to the best of my resources if at all possible.