[Bug tree-optimization/98813] loop is sub-optimized if index is unsigned int with offset

2021-01-27 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98813

--- Comment #8 from Jiu Fu Guo  ---
For code in comment 4, it is optimized since there are some range info for "_2
= l_m_34 + _54;" where _54 > 0.

[Bug tree-optimization/98813] loop is sub-optimized if index is unsigned int with offset

2021-01-26 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98813

--- Comment #7 from Jiu Fu Guo  ---
(In reply to Richard Biener from comment #6)
> (In reply to Andrew Pinski from comment #5)
> > (In reply to Jiu Fu Guo from comment #0)
> > > For the below code:
> > > ---t.c
> > > void
> > > foo (const double* __restrict__ A, const double* __restrict__ B, double*
> > > __restrict__ C,
> > >  int n, int k, int m)
> > > {
> > >   for (unsigned int l_m = 0; l_m < m; l_m++)
> > > C[n + l_m] += A[k + l_m] * B[k];
> > > }
> > 
> > Try using unsigned long instead of unsigned int.
> > I think this is the same as PR 61247.
> 
> Yes, I think we've seen plenty examples in the past where conversions in
> the SCEV chain prevent analysis.

Yes. Thanks for your comments and suggestions!

And for this code (unsigned int), I'm thinking if we really need runtime
scev/overflow checking before vectorizing it to guard `n+m<4294967295 &&
m<4294967295`.  
Without this guard, I'm wondering if the optimization is correct for the code
in comment 4.

[Bug tree-optimization/98813] loop is sub-optimized if index is unsigned int with offset

2021-01-26 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98813

--- Comment #6 from Richard Biener  ---
(In reply to Andrew Pinski from comment #5)
> (In reply to Jiu Fu Guo from comment #0)
> > For the below code:
> > ---t.c
> > void
> > foo (const double* __restrict__ A, const double* __restrict__ B, double*
> > __restrict__ C,
> >  int n, int k, int m)
> > {
> >   for (unsigned int l_m = 0; l_m < m; l_m++)
> > C[n + l_m] += A[k + l_m] * B[k];
> > }
> 
> Try using unsigned long instead of unsigned int.
> I think this is the same as PR 61247.

Yes, I think we've seen plenty examples in the past where conversions in
the SCEV chain prevent analysis.

[Bug tree-optimization/98813] loop is sub-optimized if index is unsigned int with offset

2021-01-25 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98813

--- Comment #5 from Andrew Pinski  ---
(In reply to Jiu Fu Guo from comment #0)
> For the below code:
> ---t.c
> void
> foo (const double* __restrict__ A, const double* __restrict__ B, double*
> __restrict__ C,
>  int n, int k, int m)
> {
>   for (unsigned int l_m = 0; l_m < m; l_m++)
> C[n + l_m] += A[k + l_m] * B[k];
> }

Try using unsigned long instead of unsigned int.
I think this is the same as PR 61247.

[Bug tree-optimization/98813] loop is sub-optimized if index is unsigned int with offset

2021-01-25 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98813

--- Comment #4 from Jiu Fu Guo  ---
Thanks, Richard!

One interesting thing: below code is vectorized:

void
foo (const double *__restrict__ A, const double *__restrict__ B,
 double *__restrict__ C, int n, int k, int m)
{
  if (n > 0 && m > 0 && k > 0)
for (unsigned int l_m = 0; l_m < m; l_m++)
  C[n + l_m] += A[k + l_m] * B[k];
}

[Bug tree-optimization/98813] loop is sub-optimized if index is unsigned int with offset

2021-01-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98813

Richard Biener  changed:

   What|Removed |Added

 Blocks||53947
   Keywords||missed-optimization
 Ever confirmed|0   |1
   Severity|normal  |enhancement
 CC||rguenth at gcc dot gnu.org,
   ||rsandifo at gcc dot gnu.org
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2021-01-25

--- Comment #3 from Richard Biener  ---
Confirmed.  While niter analysis produces 'assumptions' there's no such
capability in SCEV analysis.  In particular we have accesses like

  _54 = (unsigned int) n_24(D);
...

  _2 = l_m_34 + _54;
  _3 = (long unsigned int) _2;
  _4 = _3 * 8;
  _5 = C_25(D) + _4;
  _6 = *_5;

where SCEV analysis could "look through" the (long unsigned int) cast
in case _2 >= 0 && _2 <= INT_MAX / 8, it could, similar to niter analysis,
record this somewhere.

The overflow analysis there possibly also has similar issues as the
split_constant_offset_1 one (which might also benefit from tracking
'assumptions').

Btw, tracking 'assumptions' not as GENERIC tree expression but in a form
that would be nicer to collect & simplify later would be nice.  Maybe
for tracking purposes just note the SSA name _4 telling it's producer
is assumed to not overflow, leaving combining & producing of versioning
conditions to other helpers.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/98813] loop is sub-optimized if index is unsigned int with offset

2021-01-24 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98813

--- Comment #2 from Jiu Fu Guo  ---
For code:
  for (unsigned int k = 0;  k < BS; k++)
{
  s += A[k] * B[k];
}

PR48052 handles this, and for this code, the additional runtime check seems not
required. 

If there is offset in code:
  for (unsigned int k = 0;  k < BS; k++)
{
  s += A[k+3] * B[k+3];
}
This code is not vectorized then.

[Bug tree-optimization/98813] loop is sub-optimized if index is unsigned int with offset

2021-01-24 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98813

--- Comment #1 from Jiu Fu Guo  ---
Since there are additional costs for the run-time check, we can see the benefit
if upbound `m` is large; if upbound is small (e.g. < 12), the vectorized code
(from clang) is worse than un-vectorized binary.