https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304
--- Comment #37 from Evandro ---
Here's what I had in mind:
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01787.html
Feedback is welcome.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304
--- Comment #36 from Evandro ---
(In reply to Ramana Radhakrishnan from comment #35)
> (In reply to Evandro from comment #32)
> > Because of side effects of the Haiffa scheduler, the loads now pile up, and
> > the ADRPs may affect the load issue
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304
--- Comment #30 from Evandro ---
The performance impact of always referring to constants as if they were far
away is significant on targets which do not fuse ADRP and LDR together. What's
the status of the solution that evaluates the function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304
--- Comment #32 from Evandro ---
(In reply to Ramana Radhakrishnan from comment #31)
> (In reply to Evandro from comment #30)
> > The performance impact of always referring to constants as if they were far
> > away is significant on targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304
--- Comment #34 from Evandro ---
(In reply to Wilco from comment #33)
> (In reply to Evandro from comment #32)
> ADRP latency to load-address should be zero on any OoO core - ADRP is
> basically a move-immediate, so can execute early and hide
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58623
--- Comment #6 from Evandro e.menezes at samsung dot com ---
What's the PR of the fwprop issue?
Thank you.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915
--- Comment #20 from Evandro e.menezes at samsung dot com ---
(In reply to Ramana Radhakrishnan from comment #19)
To my mind it seems like 407 fmoves is just a bit too berserk and regardless
of how efficient your core is, there is no point
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #21 from Evandro e.menezes at samsung dot com ---
(In reply to ramana.radhakrish...@arm.com from comment #20)
What's the kind of performance delta you see if you managed to unroll
the loop just a wee bit ? Probably not much looking
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #23 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #22)
Unrolling alone isn't good enough in sum reductions. As I mentioned before,
GCC doesn't enable any of the useful loop optimizations by default
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915
--- Comment #11 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #9)
The performance cost is a much bigger issue than codesize. The problem is
that when register pressure is high, the register allocator decides
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915
--- Comment #12 from Evandro e.menezes at samsung dot com ---
(In reply to Evandro from comment #11)
Do you have an idea of the performance impact of this patch?
At least in Dhrystone, it improved by over 2% on A57.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915
--- Comment #14 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #10)
Note currently it is not possible to use FP registers for spilling using the
hooks - basically you still end up with int-fp moves for every
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #14 from Evandro Menezes e.menezes at samsung dot com ---
Compiling the test-case above with just -O2, I can reproduce the code I
mentioned initially and easily measure the cycle count to run it on target
using perf.
The binary
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #16 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #15)
Using -Ofast is not any different from -O3 -ffast-math when compiling
non-Fortran code. As comment 10 shows, both loops are vectorized, however
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #17 from Evandro e.menezes at samsung dot com ---
Created attachment 33785
-- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33785action=edit
Simple matrix multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
Evandro e.menezes at samsung dot com changed:
What|Removed |Added
Attachment #33774|0 |1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #12 from Evandro Menezes e.menezes at samsung dot com ---
Created attachment 33774
-- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33774action=edit
Simple test-case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #8 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Ramana Radhakrishnan from comment #7)
As Evandro doesn't mention flags it's hard to say whether there really is a
problem here or not.
Both GCC and LLVM were
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #9 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Wilco from comment #6)
I ran the assembler examples on A57 hardware with identical input. The FMADD
code is ~20% faster irrespectively of the size of the input
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: e.menezes at samsung dot com
CC: spop at gcc dot gnu.org
Target: aarch64-*
Curious why Geekbench's {D,S}GEMM by GCC were 8-9% slower than by LLVM, I
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #3 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Andrew Pinski from comment #1)
The other question here are there denormals happening? That might cause
some performance differences between using fmadd and fmul
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #4 from Evandro Menezes e.menezes at samsung dot com ---
Here's a simplified code to reproduce these results:
double sum(double *A, double *B, int n)
{
int i;
double res = 0;
for (i = 0; i n; i++)
res += A [i] * B [i
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915
--- Comment #7 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Vladimir Makarov from comment #6)
Evandro, thanks for reporting this. Sorry, I am busy with other thing these
days. I'll start to work on this PR in September
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014
Evandro Menezes e.menezes at samsung dot com changed:
What|Removed |Added
Status|WAITING |RESOLVED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915
--- Comment #5 from Evandro Menezes e.menezes at samsung dot com ---
Created attachment 33249
-- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33249action=edit
Dhrystone, part 2 of 3
I firstly observed this issue when looking into Dhrystone
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014
--- Comment #9 from Evandro Menezes e.menezes at samsung dot com ---
It seems to me that it's the LRA which is forcing the use of FP registers, so,
even if the patterns are fixed, I believe that in the end the combiner would
just give up and ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014
--- Comment #11 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to ktkachov from comment #10)
What we really need here is a preprocessed testcase showing the problem.
It should be fairly easy to lock down on the problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014
--- Comment #13 from Evandro Menezes e.menezes at samsung dot com ---
Created attachment 33253
-- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33253action=edit
Test-case
This test-case is a stripped-down version of Dhrystone, where the issue
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014
Evandro Menezes e.menezes at samsung dot com changed:
What|Removed |Added
Attachment #33246|0 |1
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: e.menezes at samsung dot com
Created attachment 33245
-- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33245action=edit
This patch should fix this issue, though it needs a test-case.
In some cases, when the LRA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014
--- Comment #2 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Andrew Pinski from comment #1)
+ /* Do not spill into FP registers when -mgeneral-regs-only is
specified. *
You are missing a / in your comment.
Ermahgerd!
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014
Evandro Menezes e.menezes at samsung dot com changed:
What|Removed |Added
Attachment #33245|0 |1
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: e.menezes at samsung dot com
The issue that I observed in code size due to the default use of the LRA
results in the spilling of the FP register used to spill variables into, which
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915
--- Comment #3 from Evandro Menezes e.menezes at samsung dot com ---
In Opteron, there was a path from FP to the GP registers, but not the other way
around. That path was eventually made symmetric in Barcelona only.
34 matches
Mail list logo