https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
Thomas Preud'homme changed:
What|Removed |Added
Status|WAITING |RESOLVED
CC|
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #25 from Thomas Preud'homme thopre01 at gcc dot gnu.org ---
Author: thopre01
Date: Tue Apr 28 08:10:44 2015
New Revision: 222512
URL: https://gcc.gnu.org/viewcvs?rev=222512root=gccview=rev
Log:
2015-04-28 Thomas Preud'homme
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #21 from Evandro e.menezes at samsung dot com ---
(In reply to ramana.radhakrish...@arm.com from comment #20)
What's the kind of performance delta you see if you managed to unroll
the loop just a wee bit ? Probably not much looking
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #22 from Wilco wdijkstr at arm dot com ---
(In reply to Evandro from comment #21)
(In reply to ramana.radhakrish...@arm.com from comment #20)
What's the kind of performance delta you see if you managed to unroll
the loop just a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #23 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #22)
Unrolling alone isn't good enough in sum reductions. As I mentioned before,
GCC doesn't enable any of the useful loop optimizations by default.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #24 from Wilco wdijkstr at arm dot com ---
(In reply to Evandro from comment #23)
(In reply to Wilco from comment #22)
Unrolling alone isn't good enough in sum reductions. As I mentioned before,
GCC doesn't enable any of the
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #20 from ramana.radhakrishnan at arm dot com ramana.radhakrishnan
at arm dot com ---
On 23/10/14 00:28, e.menezes at samsung dot com wrote:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #16 from Evandro e.menezes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #13 from Wilco wdijkstr at arm dot com ---
(In reply to Andrew Pinski from comment #11)
(In reply to Wilco from comment #10)
The loops shown are not the correct inner loops for those options - with
-ffast-math they are
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #14 from Evandro Menezes e.menezes at samsung dot com ---
Compiling the test-case above with just -O2, I can reproduce the code I
mentioned initially and easily measure the cycle count to run it on target
using perf.
The binary
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #15 from Wilco wdijkstr at arm dot com ---
(In reply to Evandro Menezes from comment #14)
Compiling the test-case above with just -O2, I can reproduce the code I
mentioned initially and easily measure the cycle count to run it on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #16 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #15)
Using -Ofast is not any different from -O3 -ffast-math when compiling
non-Fortran code. As comment 10 shows, both loops are vectorized, however
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #17 from Evandro e.menezes at samsung dot com ---
Created attachment 33785
-- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33785action=edit
Simple matrix multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
Evandro e.menezes at samsung dot com changed:
What|Removed |Added
Attachment #33774|0 |1
is
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #19 from Wilco wdijkstr at arm dot com ---
(In reply to Evandro from comment #16)
(In reply to Wilco from comment #15)
Using -Ofast is not any different from -O3 -ffast-math when compiling
non-Fortran code. As comment 10 shows,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #10 from Wilco wdijkstr at arm dot com ---
The loops shown are not the correct inner loops for those options - with
-ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So the
question is why GCC doesn't unroll vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #11 from Andrew Pinski pinskia at gcc dot gnu.org ---
(In reply to Wilco from comment #10)
The loops shown are not the correct inner loops for those options - with
-ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #12 from Evandro Menezes e.menezes at samsung dot com ---
Created attachment 33774
-- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33774action=edit
Simple test-case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #8 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Ramana Radhakrishnan from comment #7)
As Evandro doesn't mention flags it's hard to say whether there really is a
problem here or not.
Both GCC and LLVM were
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #9 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Wilco from comment #6)
I ran the assembler examples on A57 hardware with identical input. The FMADD
code is ~20% faster irrespectively of the size of the input.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
Wilco wdijkstr at arm dot com changed:
What|Removed |Added
CC||wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
Ramana Radhakrishnan ramana at gcc dot gnu.org changed:
What|Removed |Added
Status|UNCONFIRMED |WAITING
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #1 from Andrew Pinski pinskia at gcc dot gnu.org ---
This might be true for A57 but for our chip (ThunderX), using fused
multiply-add is better.
The other question here are there denormals happening? That might cause some
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #2 from Andrew Pinski pinskia at gcc dot gnu.org ---
The other option it is the fusion of the cmp and branch which is causing the
improvement.
Can you manually edit the assembly and swap the cmp and fmadd in the GCC output
and try
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #3 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Andrew Pinski from comment #1)
The other question here are there denormals happening? That might cause
some performance differences between using fmadd and
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #4 from Evandro Menezes e.menezes at samsung dot com ---
Here's a simplified code to reproduce these results:
double sum(double *A, double *B, int n)
{
int i;
double res = 0;
for (i = 0; i n; i++)
res += A [i] * B [i];
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #5 from Andrew Pinski pinskia at gcc dot gnu.org ---
Also how sure are you that it is the fused multiply-add and not the scheduling
of the instructions? As I mentioned, try swapping the cmp and fmadd; you might
get a performance
26 matches
Mail list logo