[Bug tree-optimization/88259] vectorization failure for a typical loop for getting max value and index

2019-04-10 Thread tnfchris at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88259

--- Comment #7 from Tamar Christina  ---
> Note that ripping out non-SLP support from the vectorizer will turn
> reduction support upside down ... which means the work will heavily
> conflict, either me or you needing to re-do stuff.

Fair enough, thanks for the heads up!

[Bug tree-optimization/88259] vectorization failure for a typical loop for getting max value and index

2019-04-10 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88259

--- Comment #6 from rguenther at suse dot de  ---
On Tue, 9 Apr 2019, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88259
> 
> Tamar Christina  changed:
> 
>What|Removed |Added
> 
>  Status|NEW |ASSIGNED
>  CC||tnfchris at gcc dot gnu.org
>Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
> gnu.org
> 
> --- Comment #5 from Tamar Christina  ---
> I'll be taking a look at this one as a part of GCC 10 as well.

Note that ripping out non-SLP support from the vectorizer will turn
reduction support upside down ... which means the work will heavily
conflict, either me or you needing to re-do stuff.

[Bug tree-optimization/88259] vectorization failure for a typical loop for getting max value and index

2019-04-09 Thread tnfchris at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88259

Tamar Christina  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
 CC||tnfchris at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #5 from Tamar Christina  ---
I'll be taking a look at this one as a part of GCC 10 as well.

[Bug tree-optimization/88259] vectorization failure for a typical loop for getting max value and index

2018-12-07 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88259

--- Comment #4 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #3)
> The vectorizer does not like
> 
>[local count: 955630224]:
>   # best_i_25 = PHI 
>   # best_26 = PHI 
>   # i_27 = PHI 
>   _1 = (long unsigned int) i_27;
>   _2 = _1 * 4;
>   _3 = data_18(D) + _2;
>   _4 = *_3;
>   best_i_11 = _4 <= best_26 ? best_i_25 : i_27;
>   best_13 = MAX_EXPR <_4, best_26>;
>   i_20 = i_27 + 1;
>   if (n_17(D) > i_20)
> 
> because for the best MAX reduction we have an additional use of the
> reduction value in the index reduction.  This combination isn't
> magically supported even though in isolation both cases are.
> 
> t.c:4:5: note:   Analyze phi: best_26 = PHI 
> t.c:4:5: missed:   reduction used in loop.
> t.c:4:5: missed:   Unknown def-use cycle pattern.
> t.c:4:5: note:   Analyze phi: best_i_25 = PHI  best_i_16(D)(18)>
> t.c:4:5: note:   detected reduction: need to swap operands: best_i_11 = _4 >
> best_26 ? i_27 : best_i_25;
> t.c:4:5: note:   Detected reduction.
> 
> if we'd been lucky and had analyzed best_i_25 before best_26 then we could
> probably special-case the case of "reduction used in loop" when that appears
> in other reductions.  In general that's of course still not valid I think.
Yeah.  Disabling the check for uses in the loop:

  /* If this isn't a nested cycle or if the nested cycle reduction value
 is used ouside of the inner loop we cannot handle uses of the reduction
 value.  */
  if ((!nested_in_vect_loop || inner_loop_of_double_reduc)
  && (nlatch_def_loop_uses > 1 || nphi_def_loop_uses > 1))

gives us something like the vector body we want, modulo some
inefficiency:

.L4:
ldr q4, [x2], 16
mov v3.16b, v2.16b
add v2.4s, v2.4s, v6.4s
cmgev5.4s, v0.4s, v4.4s
cmp x3, x2
smaxv0.4s, v0.4s, v4.4s
bif v1.16b, v3.16b, v5.16b
bne .L4

where v0.4s ends up containing the maximum for each individual
lane and v1.s contains the best_i associated with each member
of v0.4s.  We "just" then need to make the epilogue do the
right thing with this information.

Hacking out the condition above (obviously an invalid thing
to do) sets "best" to the maximum of v0.s (good) but also sets
"best_i" to the maximum of v1.s (bad).  We need to restrict the
maximum of v1.s to lanes of v0.s that contain "best" (i.e. the
reduction result of v0.s):

dupv2.4s, best
cmpeq  v2.4s, v2.4s, v0.4s
andv1.4s, v1.4s, v2.4s

and only then take the maximum of v1.4s.

This requires "best" to come from a reassociatve conditional
reduction and would require the "best_i" reduction to be marked
as dependent on the "best" reduction.  Might end up being a bit
messy, since we'd have to be careful to retain the uses check
above for all other cases.

[Bug tree-optimization/88259] vectorization failure for a typical loop for getting max value and index

2018-11-29 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88259

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #3 from Richard Biener  ---
The vectorizer does not like

   [local count: 955630224]:
  # best_i_25 = PHI 
  # best_26 = PHI 
  # i_27 = PHI 
  _1 = (long unsigned int) i_27;
  _2 = _1 * 4;
  _3 = data_18(D) + _2;
  _4 = *_3;
  best_i_11 = _4 <= best_26 ? best_i_25 : i_27;
  best_13 = MAX_EXPR <_4, best_26>;
  i_20 = i_27 + 1;
  if (n_17(D) > i_20)

because for the best MAX reduction we have an additional use of the
reduction value in the index reduction.  This combination isn't
magically supported even though in isolation both cases are.

t.c:4:5: note:   Analyze phi: best_26 = PHI 
t.c:4:5: missed:   reduction used in loop.
t.c:4:5: missed:   Unknown def-use cycle pattern.
t.c:4:5: note:   Analyze phi: best_i_25 = PHI 
t.c:4:5: note:   detected reduction: need to swap operands: best_i_11 = _4 >
best_26 ? i_27 : best_i_25;
t.c:4:5: note:   Detected reduction.

if we'd been lucky and had analyzed best_i_25 before best_26 then we could
probably special-case the case of "reduction used in loop" when that appears
in other reductions.  In general that's of course still not valid I think.

Alternatively the reduction operation could be combined somehow via
pattern detection.

[Bug tree-optimization/88259] vectorization failure for a typical loop for getting max value and index

2018-11-29 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88259

Ramana Radhakrishnan  changed:

   What|Removed |Added

 CC||ramana at gcc dot gnu.org

--- Comment #2 from Ramana Radhakrishnan  ---
(In reply to ktkachov from comment #1)
> Confirmed.
> Trying to find just the index (not the max value) vectorises as well:
> void test_vec(int *data, int n) {
> int best_i, best = 0;
> 
> for (int i = 0; i < n; i++) {
> if (data[i] > best) {
> //best = data[i];
> best_i = i;
> }
> }
> 
> data[best_i] = data[0];
> data[0] = best;
> }
> 
> 
> -O3:
> .L4:
> ldr q1, [x2], 16
> mov v3.16b, v2.16b
> add v2.4s, v2.4s, v4.4s
> cmlev1.4s, v1.4s, #0
> cmp x2, x3
> bif v0.16b, v3.16b, v1.16b
> bne .L4
> smaxv   s0, v0.4s
> and w3, w1, -4
> umovw2, v0.s[0]
> cmn w2, #1
> cselw2, w2, wzr, ne
> tst x1, 3
> beq .L2
> .L3:
> 
> But their combination seems like it's throwing the machinery off. I'm
> guessing the index-finding needs some if-conversion and masking to happen in
> the vectoriser

ISTR there is some limit in if conversion around the vectorizer where it only
works on very simple if-blocks. But this is from memory and it's a bit fuzzy
now.

[Bug tree-optimization/88259] vectorization failure for a typical loop for getting max value and index

2018-11-29 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88259

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Keywords||missed-optimization
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2018-11-29
 CC||ktkachov at gcc dot gnu.org,
   ||rsandifo at gcc dot gnu.org
 Blocks||53947
 Ever confirmed|0   |1

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Confirmed.
Trying to find just the index (not the max value) vectorises as well:
void test_vec(int *data, int n) {
int best_i, best = 0;

for (int i = 0; i < n; i++) {
if (data[i] > best) {
//best = data[i];
best_i = i;
}
}

data[best_i] = data[0];
data[0] = best;
}


-O3:
.L4:
ldr q1, [x2], 16
mov v3.16b, v2.16b
add v2.4s, v2.4s, v4.4s
cmlev1.4s, v1.4s, #0
cmp x2, x3
bif v0.16b, v3.16b, v1.16b
bne .L4
smaxv   s0, v0.4s
and w3, w1, -4
umovw2, v0.s[0]
cmn w2, #1
cselw2, w2, wzr, ne
tst x1, 3
beq .L2
.L3:

But their combination seems like it's throwing the machinery off. I'm guessing
the index-finding needs some if-conversion and masking to happen in the
vectoriser


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations