https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #27 from Alexander Nesterovskiy ---
Place of interest here is a loop in mat_times_vec function.
For r253678 a mat_times_vec.constprop._loopfn.0 is created with autopar.
For r256990 the mat_times_vec is inlined into bi_cgstab_block
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #26 from Alexander Nesterovskiy ---
Created attachment 43361
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43361=edit
r253678 vs r256990_work_spin
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #25 from amker at gcc dot gnu.org ---
(In reply to Alexander Nesterovskiy from comment #24)
> Yes, it looks like more time is being spent in synchronizing.
> r256990 really changes the way autopar works:
> For r253679...r256989 the
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #24 from Alexander Nesterovskiy ---
Yes, it looks like more time is being spent in synchronizing.
r256990 really changes the way autopar works:
For r253679...r256989 the most of work was in main thread0 mostly (thread0
~91%,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #23 from Alexander Nesterovskiy ---
Created attachment 43326
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43326=edit
r253678 vs r256990
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #22 from Jakub Jelinek ---
libgomp has many ways to tweak the wait behavior, look for OMP_WAIT_POLICY and
GOMP_SPINCOUNT environment variables to tweak the spinning vs. futex waiting.
Is the work scheduled for each thread roughly the
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #21 from amker at gcc dot gnu.org ---
(In reply to Alexander Nesterovskiy from comment #20)
> I've made test runs on Broadwell and Skylake, RHEL 7.3.
> 410.bwaves became faster after r256990 but not as fast as it was on r253678.
>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #20 from Alexander Nesterovskiy ---
I've made test runs on Broadwell and Skylake, RHEL 7.3.
410.bwaves became faster after r256990 but not as fast as it was on r253678.
Comparing 410.bwaves performance, "-Ofast -funroll-loops -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
Jeffrey A. Law changed:
What|Removed |Added
Status|NEW |RESOLVED
CC|
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #18 from amker at gcc dot gnu.org ---
Author: amker
Date: Tue Jan 23 16:47:03 2018
New Revision: 256990
URL: https://gcc.gnu.org/viewcvs?rev=256990=gcc=rev
Log:
PR tree-optimization/82604
* tree-loop-distribution.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #17 from rguenther at suse dot de ---
On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
>
> --- Comment #16 from amker at gcc dot gnu.org ---
> I hope it's possible to break
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #16 from amker at gcc dot gnu.org ---
I hope it's possible to break the dependence by reordering passes so that
graphite/parallelization could be moved earlier. There are several issues like
this IIRC.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #15 from rguenther at suse dot de ---
On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
>
> --- Comment #14 from amker at gcc dot gnu.org ---
> (In reply to rguent...@suse.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #14 from amker at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #13)
> On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
> >
> > --- Comment #12 from
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #13 from rguenther at suse dot de ---
On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
>
> --- Comment #12 from amker at gcc dot gnu.org ---
> (In reply to rguent...@suse.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #12 from amker at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #11)
> On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:
>
>
> I think the zeroing stmt can be distributed into a separate loop nest
> (up to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #11 from rguenther at suse dot de ---
On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
>
> --- Comment #10 from amker at gcc dot gnu.org ---
> For the record, there is
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #10 from amker at gcc dot gnu.org ---
For the record, there is another possible fix. Quoted loop nest from
gcc/testsuite/gfortran.dg/pr81303.f:
do j=1,ny
jm1=mod(j+ny-2,ny)+1
jp1=mod(j,ny)+1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
Richard Biener changed:
What|Removed |Added
Status|UNCONFIRMED |NEW
Last reconfirmed|
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #8 from amker at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #7)
> The #c5 approach sounds better to me, we can have memsets in the IL even
> from the user, so would be nice if we handled those in the dr analysis too.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
Jakub Jelinek changed:
What|Removed |Added
CC||jakub at gcc dot gnu.org
--- Comment #7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #6 from Richard Biener ---
failed loop-distribution hack: (still needs dependence analysis fixes)
Could even preserve TBAA if we use a {} of correct element array type.
For constant sizes this should be always a win.
Index:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #5 from Richard Biener ---
Sth like the following which only works up to the point of dependence analysis
trying to disambiguate this ref against others ... I suppose some
dr_may_alias_p tweaks to consider niter information and
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #4 from Richard Biener ---
So it still parallelizes the loop(s) but at one level deeper (line 176 vs.
173).
This is because dependence analysis does not handle calls and loop distribution
distributed a memset. ISL dependence
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #3 from rguenther at suse dot de ---
On Thu, 19 Oct 2017, amker at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
>
> --- Comment #2 from amker at gcc dot gnu.org ---
> (In reply to Richard Biener from
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #2 from amker at gcc dot gnu.org ---
(In reply to Richard Biener from comment #1)
> I suppose loop distribution inserted a version copy turning this into a
> non-perfect nest for outer loops and thus disabling autopar there.
>
> What
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
Richard Biener changed:
What|Removed |Added
Keywords||missed-optimization
27 matches
Mail list logo