[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)

2023-11-28 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

--- Comment #12 from Andrew Pinski  ---
(In reply to rguent...@suse.de from comment #11)
> We're lacking a way to say one of the bit_not should be single-used,
> one multi-use would be OK and a fair trade-off - not sure if that
> would be enough here, of course.  That would mena changing to
> a condition with single_use ().

That does not fix it though. Because in this case we have:
  c_19 = ~r_16;
  m_20 = ~g_17;
  y_21 = ~b_18;
  tmp_22 = MIN_EXPR ;
  k_23 = MIN_EXPR ;
  _1 = c_19 - k_23;
  _3 = m_20 - k_23;
  _5 = y_21 - k_23;
  .. = k_23;

So both bit_not are used more than once.

so we have `~a - MIN, ~c>` which is the same as `MAX,c> -
a`.

Let me file this as a seperate bug to continue the discussion there.

[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)

2023-11-28 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

--- Comment #11 from rguenther at suse dot de  ---
On Tue, 28 Nov 2023, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252
> 
> Andrew Pinski  changed:
> 
>What|Removed |Added
> 
>  CC||pinskia at gcc dot gnu.org
> 
> --- Comment #10 from Andrew Pinski  ---
> Note there is also a missing scalar optimization here also (which will improve
> the vectorized version in the end too).
> 
> Right now we have the following match pattern:
> /* MIN (~X, ~Y) -> ~MAX (X, Y)
>MAX (~X, ~Y) -> ~MIN (X, Y)  */
> (for minmax (min max)
>  maxmin (max min)
>  (simplify
>   (minmax (bit_not:s@2 @0) (bit_not:s@3 @1))
>   (bit_not (maxmin @0 @1)))
> 
> 
> But that does not match here due to the :s. I am not 100% sure but trading 2
> possible bit_not for adding another might end up improving things ...

We're lacking a way to say one of the bit_not should be single-used,
one multi-use would be OK and a fair trade-off - not sure if that
would be enough here, of course.  That would mena changing to
a condition with single_use ().

[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)

2023-11-27 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

Andrew Pinski  changed:

   What|Removed |Added

 CC||pinskia at gcc dot gnu.org

--- Comment #10 from Andrew Pinski  ---
Note there is also a missing scalar optimization here also (which will improve
the vectorized version in the end too).

Right now we have the following match pattern:
/* MIN (~X, ~Y) -> ~MAX (X, Y)
   MAX (~X, ~Y) -> ~MIN (X, Y)  */
(for minmax (min max)
 maxmin (max min)
 (simplify
  (minmax (bit_not:s@2 @0) (bit_not:s@3 @1))
  (bit_not (maxmin @0 @1)))


But that does not match here due to the :s. I am not 100% sure but trading 2
possible bit_not for adding another might end up improving things ...

[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)

2023-08-31 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #9 from Richard Biener  ---
We are not optimally vectorizing this yet, we are using SLP to cover
out[0], out[1], out[2] and single element interleaving for out[3].  The
stores end up strided (aka scalar), that's not what the reporter intended.
We also unroll the loop four times.

The SLP discovery code splits the store group (in the end we should avoid
throwing away such information).  This makes it have a gap and stores with
a gap are only supported "strided" (we could at least store two and one
element, but ...).  We don't support "merging" back the group from SLP
and non-SLP.  With SLP only we might recover here, possibly we shouldn't
allow half SLP / non-SLP for a store group but it might fail even after
discovery so it might be difficult to force this.  Maybe a good case to
"prime" single-lane SLP.

[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)

2018-11-19 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

Martin Liška  changed:

   What|Removed |Added

 CC||marxin at gcc dot gnu.org

--- Comment #8 from Martin Liška  ---
Can the bug be marked as resolved?

[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)

2014-06-18 Thread kyukhin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

--- Comment #7 from Kirill Yukhin kyukhin at gcc dot gnu.org ---
Author: kyukhin
Date: Wed Jun 18 07:46:18 2014
New Revision: 211769

URL: https://gcc.gnu.org/viewcvs?rev=211769root=gccview=rev
Log:
gcc/
* config/i386/i386.c (ix86_reassociation_width): Add alternative for
vector case.
* config/i386/i386.h (TARGET_VECTOR_PARALLEL_EXECUTION): New.
* config/i386/x86-tune.def (X86_TUNE_VECTOR_PARALLEL_EXECUTION): New.
* tree-vect-data-refs.c (vect_shift_permute_load_chain): New.
Introduces alternative way of loads group permutaions.
(vect_transform_grouped_load): Try alternative way of permutations.

gcc/testsuite/
PR tree-optimization/52252
* gcc.target/i386/pr52252-atom.c: Test on loads group of size 3.
* gcc.target/i386/pr52252-core.c: Ditto.

PR tree-optimization/61403
* gcc.target/i386/pr61403.c: Test on loads and stores group of size 3.


Added:
trunk/gcc/testsuite/gcc.target/i386/pr52252-atom.c
trunk/gcc/testsuite/gcc.target/i386/pr52252-core.c
trunk/gcc/testsuite/gcc.target/i386/pr61403.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.c
trunk/gcc/config/i386/i386.h
trunk/gcc/config/i386/x86-tune.def
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-vect-data-refs.c


[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)

2014-06-11 Thread kyukhin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

--- Comment #6 from Kirill Yukhin kyukhin at gcc dot gnu.org ---
Author: kyukhin
Date: Wed Jun 11 08:37:53 2014
New Revision: 211439

URL: http://gcc.gnu.org/viewcvs?rev=211439root=gccview=rev
Log:
gcc/
* tree-vect-data-refs.c (vect_grouped_store_supported): New
check for stores group of length 3.
(vect_permute_store_chain): New permutations for stores group of
length 3.
* tree-vect-stmts.c (vect_model_store_cost): Change cost
of vec_perm_shuffle for the new permutations.

gcc/testsuite/
PR tree-optimization/52252
* gcc.dg/vect/pr52252-st.c: Test on stores group of size 3.


Added:
trunk/gcc/testsuite/gcc.dg/vect/pr52252-st.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-vect-data-refs.c
trunk/gcc/tree-vect-stmts.c


[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)

2014-05-07 Thread kyukhin at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

--- Comment #5 from Kirill Yukhin kyukhin at gcc dot gnu.org ---
Author: kyukhin
Date: Wed May  7 12:10:22 2014
New Revision: 210155

URL: http://gcc.gnu.org/viewcvs?rev=210155root=gccview=rev
Log:
gcc/
* tree-vect-data-refs.c (vect_grouped_load_supported): New
check for loads group of length 3.
(vect_permute_load_chain): New permutations for loads group of
length 3.
* tree-vect-stmts.c (vect_model_load_cost): Change cost
of vec_perm_shuffle for the new permutations.

gcc/testsuite/
PR tree-optimization/52252
* gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.


Added:
trunk/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-vect-data-refs.c
trunk/gcc/tree-vect-stmts.c


[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)

2014-02-11 Thread evstupac at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

--- Comment #4 from Stupachenko Evgeny evstupac at gmail dot com ---
The patch giving an expected 3 times gain submitted for a discussion at:
http://gcc.gnu.org/ml/gcc-patches/2014-02/msg00670.html


[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)

2012-07-13 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Blocks||53947

--- Comment #3 from Richard Guenther rguenth at gcc dot gnu.org 2012-07-13 
08:48:18 UTC ---
Link to vectorizer missed-optimization meta-bug.


[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)

2012-02-29 Thread evstupac at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

--- Comment #2 from Stupachenko Evgeny evstupac at gmail dot com 2012-02-29 
12:32:20 UTC ---
The difference of 2 dumps from

Arm: gcc -O3 -mfpu=neon test.c -S -ftree-vectorizer-verbose=12
X86: gcc -O3 -m32 -msse3 test.c -S -ftree-vectorizer-verbose=12

Starts at:

For Arm (can use vec_load_lanes):

6: === vect_make_slp_decision === 
6: === vect_detect_hybrid_slp ===
6: === vect_analyze_loop_operations ===
6: examining phi: in_35 = PHI in_22(7), in_5(D)(4)

……

6: can use vec_load_lanesCIV16QI 
6: vect_model_load_cost: unaligned supported by hardware. 
6: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .

For x86 (no array mode for V16QI[3]):

6: === vect_make_slp_decision === 
6: === vect_detect_hybrid_slp === 
6: === vect_analyze_loop_operations === 
6: examining phi: in_35 = PHI in_22(7), in_5(D)(4) 

.……

6: no array mode for V16QI[3] 
6: the size of the group of strided accesses is not a power of 2 
6: not vectorized: relevant stmt not supported: r_8 = *in_35; 

As I mentioned before, there is an ability for x86 to handle this (Arm can
shuffle than loads, x86 can use pshufb).


[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)

2012-02-15 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2012-02-15
  Component|target  |tree-optimization
Version|unknown |4.7.0
 Ever Confirmed|0   |1
   Severity|normal  |enhancement

--- Comment #1 from Richard Guenther rguenth at gcc dot gnu.org 2012-02-15 
11:53:58 UTC ---
We fail to SLP vectorize this because of

6: Build SLP failed: different operation in stmt k_15 = MIN_EXPR tmp_14,
y_13;

thus,

out[0] = c - k;
out[1] = m - k;
out[2] = y - k;
out[3] = k;

isn't detected as equivalent to

out[0] = c - k;
out[1] = m - k;
out[2] = y - k;
out[3] = magic - k;

or

out[3] = k - 0;

whatever would be more suitable (the latter would fail to be detected as
induction I guess, the former would fail with a similar issue for the
definition of magic).

With

out[3] = y - k;

we fail with

6: Load permutation 0 1 2 2 1 1 1 1 0 0 0 0 2 2 2 2
6: Build SLP failed: unsupported load permutation *out_37 = D.1721_16;

we can vectorize

void convert_image(byte *in, byte *out, int size) {
int i;
for(i = 0; i  size; i++) {
byte r = in[0];
byte g = in[1];
byte b = in[2];
byte a = in[3];
byte c, m, y, k, z, tmp;
c = 255 - r;
m = 255 - g;
y = 255 - b;
z = 255 - a;
tmp = MIN(m, y);
k = MIN(c, tmp);
out[0] = c - k;
out[1] = m - k;
out[2] = y - k;
out[3] = z - k;
in += 4;
out += 4;
}
}

though.