[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252 --- Comment #12 from Andrew Pinski --- (In reply to rguent...@suse.de from comment #11) > We're lacking a way to say one of the bit_not should be single-used, > one multi-use would be OK and a fair trade-off - not sure if that > would be enough here, of course. That would mena changing to > a condition with single_use (). That does not fix it though. Because in this case we have: c_19 = ~r_16; m_20 = ~g_17; y_21 = ~b_18; tmp_22 = MIN_EXPR ; k_23 = MIN_EXPR ; _1 = c_19 - k_23; _3 = m_20 - k_23; _5 = y_21 - k_23; .. = k_23; So both bit_not are used more than once. so we have `~a - MIN, ~c>` which is the same as `MAX,c> - a`. Let me file this as a seperate bug to continue the discussion there.
[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252 --- Comment #11 from rguenther at suse dot de --- On Tue, 28 Nov 2023, pinskia at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252 > > Andrew Pinski changed: > >What|Removed |Added > > CC||pinskia at gcc dot gnu.org > > --- Comment #10 from Andrew Pinski --- > Note there is also a missing scalar optimization here also (which will improve > the vectorized version in the end too). > > Right now we have the following match pattern: > /* MIN (~X, ~Y) -> ~MAX (X, Y) >MAX (~X, ~Y) -> ~MIN (X, Y) */ > (for minmax (min max) > maxmin (max min) > (simplify > (minmax (bit_not:s@2 @0) (bit_not:s@3 @1)) > (bit_not (maxmin @0 @1))) > > > But that does not match here due to the :s. I am not 100% sure but trading 2 > possible bit_not for adding another might end up improving things ... We're lacking a way to say one of the bit_not should be single-used, one multi-use would be OK and a fair trade-off - not sure if that would be enough here, of course. That would mena changing to a condition with single_use ().
[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252 Andrew Pinski changed: What|Removed |Added CC||pinskia at gcc dot gnu.org --- Comment #10 from Andrew Pinski --- Note there is also a missing scalar optimization here also (which will improve the vectorized version in the end too). Right now we have the following match pattern: /* MIN (~X, ~Y) -> ~MAX (X, Y) MAX (~X, ~Y) -> ~MIN (X, Y) */ (for minmax (min max) maxmin (max min) (simplify (minmax (bit_not:s@2 @0) (bit_not:s@3 @1)) (bit_not (maxmin @0 @1))) But that does not match here due to the :s. I am not 100% sure but trading 2 possible bit_not for adding another might end up improving things ...
[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #9 from Richard Biener --- We are not optimally vectorizing this yet, we are using SLP to cover out[0], out[1], out[2] and single element interleaving for out[3]. The stores end up strided (aka scalar), that's not what the reporter intended. We also unroll the loop four times. The SLP discovery code splits the store group (in the end we should avoid throwing away such information). This makes it have a gap and stores with a gap are only supported "strided" (we could at least store two and one element, but ...). We don't support "merging" back the group from SLP and non-SLP. With SLP only we might recover here, possibly we shouldn't allow half SLP / non-SLP for a store group but it might fail even after discovery so it might be difficult to force this. Maybe a good case to "prime" single-lane SLP.
[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252 Martin Liška changed: What|Removed |Added CC||marxin at gcc dot gnu.org --- Comment #8 from Martin Liška --- Can the bug be marked as resolved?
[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252 --- Comment #7 from Kirill Yukhin kyukhin at gcc dot gnu.org --- Author: kyukhin Date: Wed Jun 18 07:46:18 2014 New Revision: 211769 URL: https://gcc.gnu.org/viewcvs?rev=211769root=gccview=rev Log: gcc/ * config/i386/i386.c (ix86_reassociation_width): Add alternative for vector case. * config/i386/i386.h (TARGET_VECTOR_PARALLEL_EXECUTION): New. * config/i386/x86-tune.def (X86_TUNE_VECTOR_PARALLEL_EXECUTION): New. * tree-vect-data-refs.c (vect_shift_permute_load_chain): New. Introduces alternative way of loads group permutaions. (vect_transform_grouped_load): Try alternative way of permutations. gcc/testsuite/ PR tree-optimization/52252 * gcc.target/i386/pr52252-atom.c: Test on loads group of size 3. * gcc.target/i386/pr52252-core.c: Ditto. PR tree-optimization/61403 * gcc.target/i386/pr61403.c: Test on loads and stores group of size 3. Added: trunk/gcc/testsuite/gcc.target/i386/pr52252-atom.c trunk/gcc/testsuite/gcc.target/i386/pr52252-core.c trunk/gcc/testsuite/gcc.target/i386/pr61403.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/i386.h trunk/gcc/config/i386/x86-tune.def trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-data-refs.c
[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252 --- Comment #6 from Kirill Yukhin kyukhin at gcc dot gnu.org --- Author: kyukhin Date: Wed Jun 11 08:37:53 2014 New Revision: 211439 URL: http://gcc.gnu.org/viewcvs?rev=211439root=gccview=rev Log: gcc/ * tree-vect-data-refs.c (vect_grouped_store_supported): New check for stores group of length 3. (vect_permute_store_chain): New permutations for stores group of length 3. * tree-vect-stmts.c (vect_model_store_cost): Change cost of vec_perm_shuffle for the new permutations. gcc/testsuite/ PR tree-optimization/52252 * gcc.dg/vect/pr52252-st.c: Test on stores group of size 3. Added: trunk/gcc/testsuite/gcc.dg/vect/pr52252-st.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-data-refs.c trunk/gcc/tree-vect-stmts.c
[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252 --- Comment #5 from Kirill Yukhin kyukhin at gcc dot gnu.org --- Author: kyukhin Date: Wed May 7 12:10:22 2014 New Revision: 210155 URL: http://gcc.gnu.org/viewcvs?rev=210155root=gccview=rev Log: gcc/ * tree-vect-data-refs.c (vect_grouped_load_supported): New check for loads group of length 3. (vect_permute_load_chain): New permutations for loads group of length 3. * tree-vect-stmts.c (vect_model_load_cost): Change cost of vec_perm_shuffle for the new permutations. gcc/testsuite/ PR tree-optimization/52252 * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3. Added: trunk/gcc/testsuite/gcc.dg/vect/pr52252-ld.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-data-refs.c trunk/gcc/tree-vect-stmts.c
[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252 --- Comment #4 from Stupachenko Evgeny evstupac at gmail dot com --- The patch giving an expected 3 times gain submitted for a discussion at: http://gcc.gnu.org/ml/gcc-patches/2014-02/msg00670.html
[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Blocks||53947 --- Comment #3 from Richard Guenther rguenth at gcc dot gnu.org 2012-07-13 08:48:18 UTC --- Link to vectorizer missed-optimization meta-bug.
[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252 --- Comment #2 from Stupachenko Evgeny evstupac at gmail dot com 2012-02-29 12:32:20 UTC --- The difference of 2 dumps from Arm: gcc -O3 -mfpu=neon test.c -S -ftree-vectorizer-verbose=12 X86: gcc -O3 -m32 -msse3 test.c -S -ftree-vectorizer-verbose=12 Starts at: For Arm (can use vec_load_lanes): 6: === vect_make_slp_decision === 6: === vect_detect_hybrid_slp === 6: === vect_analyze_loop_operations === 6: examining phi: in_35 = PHI in_22(7), in_5(D)(4) …… 6: can use vec_load_lanesCIV16QI 6: vect_model_load_cost: unaligned supported by hardware. 6: vect_model_load_cost: inside_cost = 2, outside_cost = 0 . For x86 (no array mode for V16QI[3]): 6: === vect_make_slp_decision === 6: === vect_detect_hybrid_slp === 6: === vect_analyze_loop_operations === 6: examining phi: in_35 = PHI in_22(7), in_5(D)(4) .…… 6: no array mode for V16QI[3] 6: the size of the group of strided accesses is not a power of 2 6: not vectorized: relevant stmt not supported: r_8 = *in_35; As I mentioned before, there is an ability for x86 to handle this (Arm can shuffle than loads, x86 can use pshufb).
[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2012-02-15 Component|target |tree-optimization Version|unknown |4.7.0 Ever Confirmed|0 |1 Severity|normal |enhancement --- Comment #1 from Richard Guenther rguenth at gcc dot gnu.org 2012-02-15 11:53:58 UTC --- We fail to SLP vectorize this because of 6: Build SLP failed: different operation in stmt k_15 = MIN_EXPR tmp_14, y_13; thus, out[0] = c - k; out[1] = m - k; out[2] = y - k; out[3] = k; isn't detected as equivalent to out[0] = c - k; out[1] = m - k; out[2] = y - k; out[3] = magic - k; or out[3] = k - 0; whatever would be more suitable (the latter would fail to be detected as induction I guess, the former would fail with a similar issue for the definition of magic). With out[3] = y - k; we fail with 6: Load permutation 0 1 2 2 1 1 1 1 0 0 0 0 2 2 2 2 6: Build SLP failed: unsupported load permutation *out_37 = D.1721_16; we can vectorize void convert_image(byte *in, byte *out, int size) { int i; for(i = 0; i size; i++) { byte r = in[0]; byte g = in[1]; byte b = in[2]; byte a = in[3]; byte c, m, y, k, z, tmp; c = 255 - r; m = 255 - g; y = 255 - b; z = 255 - a; tmp = MIN(m, y); k = MIN(c, tmp); out[0] = c - k; out[1] = m - k; out[2] = y - k; out[3] = z - k; in += 4; out += 4; } } though.