[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-02-08 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #27 from Alexander Nesterovskiy  ---
Place of interest here is a loop in mat_times_vec function.
For r253678 a mat_times_vec.constprop._loopfn.0 is created with autopar.
For r256990 the mat_times_vec is inlined into bi_cgstab_block and three
functions are created by autopar:
bi_cgstab_block.constprop._loopfn.3
bi_cgstab_block.constprop._loopfn.6
bi_cgstab_block.constprop._loopfn.10
Sum of effective CPU time for these functions in all four threads is very close
for r253678 and r256990.
It looks reasonable since in both cases the equal amount of calculations is
being done.
But there is a significant difference in spinning/wait time.

Measuring with OMP_WAIT_POLICY=ACTIVE seems to be more informative - threads
never sleeps, they either working or spinning, thanks to Jakub.
r253678 case:
Main thread0:  ~0%  thread time spinning (~100% working)
Worker threads1-3: ~45% thread time spinning (~55% working)
r256990 case:
Main thread0:  ~20% thread time spinning (~80% working)
Worker threads1-3: ~50% thread time spinning (~50% working)

I've attached a second chart comparing CPU time for both cases (r253678 vs
r256990_work_spin), I think it illustrates the difference better than the first
one.

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-02-08 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #26 from Alexander Nesterovskiy  ---
Created attachment 43361
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43361=edit
r253678 vs r256990_work_spin

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-02-05 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #25 from amker at gcc dot gnu.org ---
(In reply to Alexander Nesterovskiy from comment #24)
> Yes, it looks like more time is being spent in synchronizing.
> r256990 really changes the way autopar works:
> For r253679...r256989 the most of work was in main thread0 mostly (thread0
> ~91%, threads1-3 ~3% each one).
> For r256990 there is the same distribution as for r253678 (thread0 ~34%,
> threads1-3 ~22% each one) but a lot of time is being spent spinning.
> I've attached a chart comparing r253678 and r256990 in the same time scale
> (~0.5 sec).
> libgomp.so.1.0.0 code executed in thread1 for both cases is wait functions,
> and for r256990 they are called more often.
> 
> Setting OMP_WAIT_POLICY doesn't change a lot:
> for ACTIVE - performance is nearly the same as default
> for PASSIVE - there is a serious performance drop for r256990 (looks
> reasonable because of a lots of threads sleeps/wake-ups)
> 
> Changing parloops-schedule also have no positive effect:
> r253678 performance is mostly the same for static, guided and dynamic
> r256990 performance is best with static, which is default

I don't know openmp much.  Still not sure where the additional synchronization
comes from.  Look the iterations of parallelized loop have the same execution
time on average?

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-02-02 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #24 from Alexander Nesterovskiy  ---
Yes, it looks like more time is being spent in synchronizing.
r256990 really changes the way autopar works:
For r253679...r256989 the most of work was in main thread0 mostly (thread0
~91%, threads1-3 ~3% each one).
For r256990 there is the same distribution as for r253678 (thread0 ~34%,
threads1-3 ~22% each one) but a lot of time is being spent spinning.
I've attached a chart comparing r253678 and r256990 in the same time scale
(~0.5 sec).
libgomp.so.1.0.0 code executed in thread1 for both cases is wait functions, and
for r256990 they are called more often.

Setting OMP_WAIT_POLICY doesn't change a lot:
for ACTIVE - performance is nearly the same as default
for PASSIVE - there is a serious performance drop for r256990 (looks reasonable
because of a lots of threads sleeps/wake-ups)

Changing parloops-schedule also have no positive effect:
r253678 performance is mostly the same for static, guided and dynamic
r256990 performance is best with static, which is default

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-02-02 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #23 from Alexander Nesterovskiy  ---
Created attachment 43326
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43326=edit
r253678 vs r256990

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-30 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #22 from Jakub Jelinek  ---
libgomp has many ways to tweak the wait behavior, look for OMP_WAIT_POLICY and
GOMP_SPINCOUNT environment variables to tweak the spinning vs. futex waiting.
Is the work scheduled for each thread roughly the same or are there threads
that do more work than others?  There is also
--param parloops-schedule={static,dynamic,guided,auto,runtime} with which you
can choose different kinds of OpenMP scheduling, with runtime even using env
vars.

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-30 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #21 from amker at gcc dot gnu.org ---
(In reply to Alexander Nesterovskiy from comment #20)
> I've made test runs on Broadwell and Skylake, RHEL 7.3.
> 410.bwaves became faster after r256990 but not as fast as it was on r253678. 
> Comparing 410.bwaves performance, "-Ofast -funroll-loops -flto
> -ftree-parallelize-loops=4": 
> 
> rev   perf. relative to r253678, %
> r253678   100%
> r253679   54%
> ...
> r256989   54%
> r256990   71%
> 
> CPU time distribution became more flat (~34% thread0, ~22% - threads1-3),
> but a lot of time is spent spinning in 
> libgomp.so.1.0.0/gomp_barrier_wait_end -> do_wait -> do_spin
> and
> libgomp.so.1.0.0/gomp_team_barrier_wait_end -> do_wait -> do_spin 
> r253678 spin time is ~10% of CPU time 
> r256990 spin time is ~30% of CPU time

I don't know gomp. Does this mean we spend more time synchronizing threads now?
Thanks.

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-30 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #20 from Alexander Nesterovskiy  ---
I've made test runs on Broadwell and Skylake, RHEL 7.3.
410.bwaves became faster after r256990 but not as fast as it was on r253678. 
Comparing 410.bwaves performance, "-Ofast -funroll-loops -flto
-ftree-parallelize-loops=4": 

rev   perf. relative to r253678, %
r253678   100%
r253679   54%
...
r256989   54%
r256990   71%

CPU time distribution became more flat (~34% thread0, ~22% - threads1-3), but a
lot of time is spent spinning in 
libgomp.so.1.0.0/gomp_barrier_wait_end -> do_wait -> do_spin
and
libgomp.so.1.0.0/gomp_team_barrier_wait_end -> do_wait -> do_spin 
r253678 spin time is ~10% of CPU time 
r256990 spin time is ~30% of CPU time

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-29 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

Jeffrey A. Law  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||law at redhat dot com
 Resolution|--- |FIXED

--- Comment #19 from Jeffrey A. Law  ---
Fixed by Bin's change on the trunk.

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-23 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #18 from amker at gcc dot gnu.org ---
Author: amker
Date: Tue Jan 23 16:47:03 2018
New Revision: 256990

URL: https://gcc.gnu.org/viewcvs?rev=256990=gcc=rev
Log:
PR tree-optimization/82604
* tree-loop-distribution.c (enum partition_kind): New enum item
PKIND_PARTIAL_MEMSET.
(partition_builtin_p): Support above new enum item.
(generate_code_for_partition): Ditto.
(compute_access_range): Differentiate cases that equality can be
proven at all loops, the innermost loops or no loops.
(classify_builtin_st, classify_builtin_ldst): Adjust call to above
function.  Set PKIND_PARTIAL_MEMSET for partition appropriately.
(finalize_partitions, distribute_loop): Don't fuse partition of
PKIND_PARTIAL_MEMSET kind when distributing 3-level loop nest.
(prepare_perfect_loop_nest): Distribute 3-level loop nest only if
parloop is enabled.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-loop-distribution.c

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-18 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #17 from rguenther at suse dot de  ---
On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
> 
> --- Comment #16 from amker at gcc dot gnu.org ---
> I hope it's possible to break the dependence by reordering passes so that
> graphite/parallelization could be moved earlier.  There are several issues 
> like
> this IIRC.

Certainly not for GCC 8.  Also given that parallelization involves
outlining of code this always introduces issues with dependence
analysis due to extra indirections and us not being able to transition
all relevant information to the outlined copy.  So I'd even move
it later... (after vectorization for example).

For this particular case dependence analysis should be enhanced to
also handle memory builtins of course.

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-18 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #16 from amker at gcc dot gnu.org ---
I hope it's possible to break the dependence by reordering passes so that
graphite/parallelization could be moved earlier.  There are several issues like
this IIRC.

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-18 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #15 from rguenther at suse dot de  ---
On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
> 
> --- Comment #14 from amker at gcc dot gnu.org ---
> (In reply to rguent...@suse.de from comment #13)
> > On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
> > > 
> > > --- Comment #12 from amker at gcc dot gnu.org ---
> > > (In reply to rguent...@suse.de from comment #11)
> > > > 
> > > Yes, this can be done.  For now, it's disabled because without classifying
> > > zeroing stmt as a builtin partition, it's fused because of shared memory
> > > reference to y(l,i,j,k).  This step can be made by cost model changes.  
> > > The
> > > on;y problem is the cost model change doesn't make sense here (without
> > > considering builtin partition stuff, it should be fused, right?)
> > 
> > It might be profitable to distribute away stores that have no dependent
> > stmts (thus stores from invariants).
> > 
> > Another heuristic would be to never merge builtin partitions with
> > other partitions because loop optimizations do not handle memory
> 
> Together with last sentence of your comment.  IIUC, so what we want to do is
> still a builtin partition distribution from the original loop.  The only
> difference is now the loop nest of zeroing stmt is distributed into a
> loop(outer) of memset call, rather than a single memset call.  Of course if it
> would be even better if it can be distributed into a single memset.
> Currently such loop nest in this case is not classified as builtin partition.
> 
> > builtins (the data dependence limitation).  Which might also be a reason
> > not to handle those as builtins but revert to a non-builtin
> > classification.
> But I don't quite follow this sentence,  why not handle it as builtins?  it is
> special, but eventually we want to distribute it into memset (in a loop nest),
> right?

Do not handle it as builtins because we cannot handle builtins in data
dependence and we do not want to distribute it into a separate nest
because of data locality.

Needs to be measured ;)

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-18 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #14 from amker at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #13)
> On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
> > 
> > --- Comment #12 from amker at gcc dot gnu.org ---
> > (In reply to rguent...@suse.de from comment #11)
> > > 
> > Yes, this can be done.  For now, it's disabled because without classifying
> > zeroing stmt as a builtin partition, it's fused because of shared memory
> > reference to y(l,i,j,k).  This step can be made by cost model changes.  The
> > on;y problem is the cost model change doesn't make sense here (without
> > considering builtin partition stuff, it should be fused, right?)
> 
> It might be profitable to distribute away stores that have no dependent
> stmts (thus stores from invariants).
> 
> Another heuristic would be to never merge builtin partitions with
> other partitions because loop optimizations do not handle memory

Together with last sentence of your comment.  IIUC, so what we want to do is
still a builtin partition distribution from the original loop.  The only
difference is now the loop nest of zeroing stmt is distributed into a
loop(outer) of memset call, rather than a single memset call.  Of course if it
would be even better if it can be distributed into a single memset.
Currently such loop nest in this case is not classified as builtin partition.

> builtins (the data dependence limitation).  Which might also be a reason
> not to handle those as builtins but revert to a non-builtin
> classification.
But I don't quite follow this sentence,  why not handle it as builtins?  it is
special, but eventually we want to distribute it into memset (in a loop nest),
right?

Thanks
> 
> I suppose implementing both and then looking at what distributions
> change due to them on say SPEC CPU 2006, classifying each change
> as either good or bad is the only way we'd know whether such
> cost model change is wanted.
> 
> > > And then do memset replacement in the first loop.
> > I guess this step is equally hard to what I mentioned?  We still need to 
> > prove
> > loops of zeroing statement doesn't leave bubble in memory.
> 
> No, you'd simply have the i and j loops containing a memset...

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-18 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #13 from rguenther at suse dot de  ---
On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
> 
> --- Comment #12 from amker at gcc dot gnu.org ---
> (In reply to rguent...@suse.de from comment #11)
> > On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:
> > 
> > 
> > I think the zeroing stmt can be distributed into a separate loop nest
> > (up to whavever level we choose) and in the then non-parallelized nest
> > the memset can stay at the current level.  So distribute
> > 
> > >  do j=1,ny
> > > jm1=mod(j+ny-2,ny)+1
> > > jp1=mod(j,ny)+1
> > > do i=1,nx
> > >im1=mod(i+nx-2,nx)+1
> > >ip1=mod(i,nx)+1
> > >do l=1,nb
> > >   y(l,i,j,k)=0.0d0
> > >   do m=1,nb
> > >  y(l,i,j,k)=y(l,i,j,k)+
> > >  ;; 
> > >   enddo
> > >enddo
> > > enddo
> > >  enddo
> > 
> > to
> > 
> > >  do j=1,ny
> > > jm1=mod(j+ny-2,ny)+1
> > > jp1=mod(j,ny)+1
> > > do i=1,nx
> > >im1=mod(i+nx-2,nx)+1
> > >ip1=mod(i,nx)+1
> > >do l=1,nb
> > >   y(l,i,j,k)=0.0d0
> > >enddo
> > > enddo
> > >  enddo
> > >  do j=1,ny
> > > jm1=mod(j+ny-2,ny)+1
> > > jp1=mod(j,ny)+1
> > > do i=1,nx
> > >im1=mod(i+nx-2,nx)+1
> > >ip1=mod(i,nx)+1
> > >do l=1,nb
> > >   do m=1,nb
> > >  y(l,i,j,k)=y(l,i,j,k)+
> > >  ;; 
> > >   enddo
> > >enddo
> > > enddo
> > >  enddo
> > 
> Yes, this can be done.  For now, it's disabled because without classifying
> zeroing stmt as a builtin partition, it's fused because of shared memory
> reference to y(l,i,j,k).  This step can be made by cost model changes.  The
> on;y problem is the cost model change doesn't make sense here (without
> considering builtin partition stuff, it should be fused, right?)

It might be profitable to distribute away stores that have no dependent
stmts (thus stores from invariants).

Another heuristic would be to never merge builtin partitions with
other partitions because loop optimizations do not handle memory
builtins (the data dependence limitation).  Which might also be a reason
not to handle those as builtins but revert to a non-builtin
classification.

I suppose implementing both and then looking at what distributions
change due to them on say SPEC CPU 2006, classifying each change
as either good or bad is the only way we'd know whether such
cost model change is wanted.

> > And then do memset replacement in the first loop.
> I guess this step is equally hard to what I mentioned?  We still need to prove
> loops of zeroing statement doesn't leave bubble in memory.

No, you'd simply have the i and j loops containing a memset...

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-18 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #12 from amker at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #11)
> On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:
> 
> 
> I think the zeroing stmt can be distributed into a separate loop nest
> (up to whavever level we choose) and in the then non-parallelized nest
> the memset can stay at the current level.  So distribute
> 
> >  do j=1,ny
> > jm1=mod(j+ny-2,ny)+1
> > jp1=mod(j,ny)+1
> > do i=1,nx
> >im1=mod(i+nx-2,nx)+1
> >ip1=mod(i,nx)+1
> >do l=1,nb
> >   y(l,i,j,k)=0.0d0
> >   do m=1,nb
> >  y(l,i,j,k)=y(l,i,j,k)+
> >  ;; 
> >   enddo
> >enddo
> > enddo
> >  enddo
> 
> to
> 
> >  do j=1,ny
> > jm1=mod(j+ny-2,ny)+1
> > jp1=mod(j,ny)+1
> > do i=1,nx
> >im1=mod(i+nx-2,nx)+1
> >ip1=mod(i,nx)+1
> >do l=1,nb
> >   y(l,i,j,k)=0.0d0
> >enddo
> > enddo
> >  enddo
> >  do j=1,ny
> > jm1=mod(j+ny-2,ny)+1
> > jp1=mod(j,ny)+1
> > do i=1,nx
> >im1=mod(i+nx-2,nx)+1
> >ip1=mod(i,nx)+1
> >do l=1,nb
> >   do m=1,nb
> >  y(l,i,j,k)=y(l,i,j,k)+
> >  ;; 
> >   enddo
> >enddo
> > enddo
> >  enddo
> 
Yes, this can be done.  For now, it's disabled because without classifying
zeroing stmt as a builtin partition, it's fused because of shared memory
reference to y(l,i,j,k).  This step can be made by cost model changes.  The
on;y problem is the cost model change doesn't make sense here (without
considering builtin partition stuff, it should be fused, right?)

> And then do memset replacement in the first loop.
I guess this step is equally hard to what I mentioned?  We still need to prove
loops of zeroing statement doesn't leave bubble in memory.
> 
> I think the current cost modeling doesn't consider this because
> of the re-use of y.  IIRC this is what my original nest distribution
> patches did.
> 
> This might be doable by just cost model changes?

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-18 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #11 from rguenther at suse dot de  ---
On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
> 
> --- Comment #10 from amker at gcc dot gnu.org ---
> For the record, there is another possible fix.  Quoted loop nest from
> gcc/testsuite/gfortran.dg/pr81303.f:
> 
>  do j=1,ny
> jm1=mod(j+ny-2,ny)+1
> jp1=mod(j,ny)+1
> do i=1,nx
>im1=mod(i+nx-2,nx)+1
>ip1=mod(i,nx)+1
>do l=1,nb
>   y(l,i,j,k)=0.0d0
>   do m=1,nb
>  y(l,i,j,k)=y(l,i,j,k)+
>  ;; 
>   enddo
>enddo
> enddo
>  enddo
> 
> Originally GCC can parallelize loop nest at i loop, but now GCC only
> parallelize it at l loop because stmt "y(l,i,j,k)=0.0d0" is distributed into
> memset into i loop.  As a result the distributed memset call can't be analyzed
> by data reference analyzer.
> An idea is to distribute the stmt to outer loop j, so at least we can
> parallelize at loop level i as before.
> 
> Unfortunately this is not easy.  To distribute it into memset at loop level j,
> we have to prove that memory range set to ZERO at each loop level doesn't 
> leave
> any bubble in it.
> Given the array bound and loop niters are not constant, we need to prove
> non-trivially equality for difference expressions.  This needs to be done in
> function tree-loop-distribution.c:compute_access_range.  Specifically in this
> function we have:
> 
> :
>   _1 = *nb_113(D);
>   ubound.86_114 = (integer(kind=8)) _1;
>   stride.88_115 = MAX_EXPR ;
> 
> ...
> 
> :  // thus in loop nest we have _1 > 0
>   if (_1 <= 0)
> goto ; [15.00%]
>   else
> goto ; [85.00%]
> 
> ...
> 
> And in the end, we need to prove:
> 
> ((sizetype) ((unsigned int) _1 + 4294967295) + 1) * 8
>  == (sizetype) stride.88_115 * 8
> 
> We first need to prove:
> ((sizetype) ((unsigned int) _1 + 4294967295) + 1)
>   == (sizetype) _1
> using pre-condition "_1 > 0"
> 
> Then need to prove: MAX_EXPR  == ubound.86_114 also because
> of "_1 > 0".
> 
> I doubt this can be done (without heavy messy code) in GCC now.  Or there 
> might
> be another way out of this?

I think the zeroing stmt can be distributed into a separate loop nest
(up to whavever level we choose) and in the then non-parallelized nest
the memset can stay at the current level.  So distribute

>  do j=1,ny
> jm1=mod(j+ny-2,ny)+1
> jp1=mod(j,ny)+1
> do i=1,nx
>im1=mod(i+nx-2,nx)+1
>ip1=mod(i,nx)+1
>do l=1,nb
>   y(l,i,j,k)=0.0d0
>   do m=1,nb
>  y(l,i,j,k)=y(l,i,j,k)+
>  ;; 
>   enddo
>enddo
> enddo
>  enddo

to

>  do j=1,ny
> jm1=mod(j+ny-2,ny)+1
> jp1=mod(j,ny)+1
> do i=1,nx
>im1=mod(i+nx-2,nx)+1
>ip1=mod(i,nx)+1
>do l=1,nb
>   y(l,i,j,k)=0.0d0
>enddo
> enddo
>  enddo
>  do j=1,ny
> jm1=mod(j+ny-2,ny)+1
> jp1=mod(j,ny)+1
> do i=1,nx
>im1=mod(i+nx-2,nx)+1
>ip1=mod(i,nx)+1
>do l=1,nb
>   do m=1,nb
>  y(l,i,j,k)=y(l,i,j,k)+
>  ;; 
>   enddo
>enddo
> enddo
>  enddo

And then do memset replacement in the first loop.

I think the current cost modeling doesn't consider this because
of the re-use of y.  IIRC this is what my original nest distribution
patches did.

This might be doable by just cost model changes?

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-18 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #10 from amker at gcc dot gnu.org ---
For the record, there is another possible fix.  Quoted loop nest from
gcc/testsuite/gfortran.dg/pr81303.f:

 do j=1,ny
jm1=mod(j+ny-2,ny)+1
jp1=mod(j,ny)+1
do i=1,nx
   im1=mod(i+nx-2,nx)+1
   ip1=mod(i,nx)+1
   do l=1,nb
  y(l,i,j,k)=0.0d0
  do m=1,nb
 y(l,i,j,k)=y(l,i,j,k)+
 ;; 
  enddo
   enddo
enddo
 enddo

Originally GCC can parallelize loop nest at i loop, but now GCC only
parallelize it at l loop because stmt "y(l,i,j,k)=0.0d0" is distributed into
memset into i loop.  As a result the distributed memset call can't be analyzed
by data reference analyzer.
An idea is to distribute the stmt to outer loop j, so at least we can
parallelize at loop level i as before.

Unfortunately this is not easy.  To distribute it into memset at loop level j,
we have to prove that memory range set to ZERO at each loop level doesn't leave
any bubble in it.
Given the array bound and loop niters are not constant, we need to prove
non-trivially equality for difference expressions.  This needs to be done in
function tree-loop-distribution.c:compute_access_range.  Specifically in this
function we have:

:
  _1 = *nb_113(D);
  ubound.86_114 = (integer(kind=8)) _1;
  stride.88_115 = MAX_EXPR ;

...

:  // thus in loop nest we have _1 > 0
  if (_1 <= 0)
goto ; [15.00%]
  else
goto ; [85.00%]

...

And in the end, we need to prove:

((sizetype) ((unsigned int) _1 + 4294967295) + 1) * 8
 == (sizetype) stride.88_115 * 8

We first need to prove:
((sizetype) ((unsigned int) _1 + 4294967295) + 1)
  == (sizetype) _1
using pre-condition "_1 > 0"

Then need to prove: MAX_EXPR  == ubound.86_114 also because
of "_1 > 0".

I doubt this can be done (without heavy messy code) in GCC now.  Or there might
be another way out of this?

Thanks,

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-10 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2018-01-10
 Ever confirmed|0   |1

--- Comment #9 from Richard Biener  ---
Confirmed at least.

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2017-11-27 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #8 from amker at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #7)
> The #c5 approach sounds better to me, we can have memsets in the IL even
> from the user, so would be nice if we handled those in the dr analysis too.

Yeah, I plan to look into that way after interchange stuff, unless Richi gets
to it earlier :)

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2017-11-27 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #7 from Jakub Jelinek  ---
The #c5 approach sounds better to me, we can have memsets in the IL even from
the user, so would be nice if we handled those in the dr analysis too.

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2017-11-17 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #6 from Richard Biener  ---
failed loop-distribution hack:  (still needs dependence analysis fixes)
Could even preserve TBAA if we use a {} of correct element array type.
For constant sizes this should be always a win.

Index: gcc/tree-loop-distribution.c
===
--- gcc/tree-loop-distribution.c(revision 254858)
+++ gcc/tree-loop-distribution.c(working copy)
@@ -1006,9 +1006,22 @@ generate_memset_builtin (struct loop *lo
   val = tem;
 }

-  fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_MEMSET));
-  fn_call = gimple_build_call (fn, 3, mem, val, nb_bytes);
-  gsi_insert_after (, fn_call, GSI_CONTINUE_LINKING);
+  if (! integer_zerop (val))
+{
+  fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_MEMSET));
+  fn_call = gimple_build_call (fn, 3, mem, val, nb_bytes);
+  gsi_insert_after (, fn_call, GSI_CONTINUE_LINKING);
+}
+  else
+{
+  tree arrt = build_array_type (char_type_node, NULL_TREE);
+  gassign *ass = gimple_build_assign (build2 (MEM_REF, arrt,
+ mem, build_zero_cst
(ptr_type_node)),
+ build2 (WITH_SIZE_EXPR, arrt,
+ build_constructor (arrt,
NULL),
+ nb_bytes));
+  gsi_insert_after (, ass, GSI_CONTINUE_LINKING);
+}

   if (dump_file && (dump_flags & TDF_DETAILS))
 {

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2017-11-17 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #5 from Richard Biener  ---
Sth like the following which only works up to the point of dependence analysis
trying to disambiguate this ref against others ...  I suppose some
dr_may_alias_p tweaks to consider niter information and step/size to do
a "simple" offset based overlap test is missing.

Index: gcc/tree-data-ref.c
===
--- gcc/tree-data-ref.c (revision 254858)
+++ gcc/tree-data-ref.c (working copy)
@@ -4827,7 +4827,7 @@ get_references_in_stmt (gimple *stmt, ve
clobbers_memory = true;
break;
  }
-  else
+  else if (! gimple_call_builtin_p (stmt, BUILT_IN_MEMSET))
clobbers_memory = true;
 }
   else if (stmt_code == GIMPLE_ASM
@@ -4888,6 +4888,12 @@ get_references_in_stmt (gimple *stmt, ve
  default:
break;
  }
+  else if (gimple_call_builtin_p (stmt, BUILT_IN_MEMSET))
+   {
+ ref.ref = fold_build2 (MEM_REF, build_array_type (char_type_node,
build_index_type (gimple_call_arg (stmt, 2))), gimple_call_arg (stmt, 0),
build_zero_cst (ptr_type_node));
+ references->safe_push (ref);
+ return false;
+   }

   op0 = gimple_call_lhs (stmt);
   n = gimple_call_num_args (stmt);

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2017-11-17 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #4 from Richard Biener  ---
So it still parallelizes the loop(s) but at one level deeper (line 176 vs.
173).
This is because dependence analysis does not handle calls and loop distribution
distributed a memset.  ISL dependence analysis will have the same issue.

I've quickly tried replacing memset with *p = {} which somewhat works
when wrapping the LHS with a WITH_SIZE_EXPR.  It later ICEs, but well.
RTL expansion suggests we expect it on the RHS, so *p = WITH_SIZE_EXPR <{}, n}
but that doesn't parallelize again (dependence analysis is confused).

I think the proper fix is to dependence analysis in get_references_in_stmt,
similar to how we handle masked loads/stores.

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2017-10-19 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #3 from rguenther at suse dot de  ---
On Thu, 19 Oct 2017, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
> 
> --- Comment #2 from amker at gcc dot gnu.org ---
> (In reply to Richard Biener from comment #1)
> > I suppose loop distribution inserted a version copy turning this into a
> > non-perfect nest for outer loops and thus disabling autopar there.
> > 
> > What probably makes sense is to run autoparallelization earlier or to try
> Given passes like split/distribution/interchange are not to help
> GRAPHITE/autopar, another choice is to push them later after GRAPHITE and
> autopar?

I suppose sometimes they will help autopar given w/o the help of
GRAPHITE autopar can only handle perfect nests.

I think it would be nice to rewrite autopar in terms of GRAPHITE
only, thus instruct ISL to tile the parallel loops in the region
so the outermost parallel loops in a SCOP can be OMP-ized by the code 
generator if requested (that's similar to the idea of tiling the
inner loop for autovectorizing).

But yes, if autopar weren't restricted we'd move it before all
high-level loop opts.  And we'd have a split pipeline between
a graphite pass and our own high-level loop opts if graphite
were good enough.

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2017-10-19 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #2 from amker at gcc dot gnu.org ---
(In reply to Richard Biener from comment #1)
> I suppose loop distribution inserted a version copy turning this into a
> non-perfect nest for outer loops and thus disabling autopar there.
> 
> What probably makes sense is to run autoparallelization earlier or to try
Given passes like split/distribution/interchange are not to help
GRAPHITE/autopar, another choice is to push them later after GRAPHITE and
autopar?

> -fparallelize-loops-all (using ISL dependence analysis).
> 
> Just guesses of course, autopar isn't one of our most elaborate passes...

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2017-10-19 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 CC||amker at gcc dot gnu.org,
   ||matz at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org
   Target Milestone|--- |8.0

--- Comment #1 from Richard Biener  ---
I suppose loop distribution inserted a version copy turning this into a
non-perfect nest for outer loops and thus disabling autopar there.

What probably makes sense is to run autoparallelization earlier or to try
-fparallelize-loops-all (using ISL dependence analysis).

Just guesses of course, autopar isn't one of our most elaborate passes...