[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2016-03-07 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 Thomas Preud'homme changed: What|Removed |Added Status|WAITING |RESOLVED CC|

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2015-04-28 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #25 from Thomas Preud'homme thopre01 at gcc dot gnu.org --- Author: thopre01 Date: Tue Apr 28 08:10:44 2015 New Revision: 222512 URL: https://gcc.gnu.org/viewcvs?rev=222512root=gccview=rev Log: 2015-04-28 Thomas Preud'homme

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-28 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #21 from Evandro e.menezes at samsung dot com --- (In reply to ramana.radhakrish...@arm.com from comment #20) What's the kind of performance delta you see if you managed to unroll the loop just a wee bit ? Probably not much looking

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-28 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #22 from Wilco wdijkstr at arm dot com --- (In reply to Evandro from comment #21) (In reply to ramana.radhakrish...@arm.com from comment #20) What's the kind of performance delta you see if you managed to unroll the loop just a

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-28 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #23 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #22) Unrolling alone isn't good enough in sum reductions. As I mentioned before, GCC doesn't enable any of the useful loop optimizations by default.

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-28 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #24 from Wilco wdijkstr at arm dot com --- (In reply to Evandro from comment #23) (In reply to Wilco from comment #22) Unrolling alone isn't good enough in sum reductions. As I mentioned before, GCC doesn't enable any of the

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-23 Thread ramana.radhakrishnan at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #20 from ramana.radhakrishnan at arm dot com ramana.radhakrishnan at arm dot com --- On 23/10/14 00:28, e.menezes at samsung dot com wrote: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #16 from Evandro e.menezes

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #13 from Wilco wdijkstr at arm dot com --- (In reply to Andrew Pinski from comment #11) (In reply to Wilco from comment #10) The loops shown are not the correct inner loops for those options - with -ffast-math they are

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #14 from Evandro Menezes e.menezes at samsung dot com --- Compiling the test-case above with just -O2, I can reproduce the code I mentioned initially and easily measure the cycle count to run it on target using perf. The binary

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #15 from Wilco wdijkstr at arm dot com --- (In reply to Evandro Menezes from comment #14) Compiling the test-case above with just -O2, I can reproduce the code I mentioned initially and easily measure the cycle count to run it on

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #16 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #15) Using -Ofast is not any different from -O3 -ffast-math when compiling non-Fortran code. As comment 10 shows, both loops are vectorized, however

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #17 from Evandro e.menezes at samsung dot com --- Created attachment 33785 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33785action=edit Simple matrix multiplication

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 Evandro e.menezes at samsung dot com changed: What|Removed |Added Attachment #33774|0 |1 is

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #19 from Wilco wdijkstr at arm dot com --- (In reply to Evandro from comment #16) (In reply to Wilco from comment #15) Using -Ofast is not any different from -O3 -ffast-math when compiling non-Fortran code. As comment 10 shows,

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-21 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #10 from Wilco wdijkstr at arm dot com --- The loops shown are not the correct inner loops for those options - with -ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So the question is why GCC doesn't unroll vectorized

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-21 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #11 from Andrew Pinski pinskia at gcc dot gnu.org --- (In reply to Wilco from comment #10) The loops shown are not the correct inner loops for those options - with -ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-21 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #12 from Evandro Menezes e.menezes at samsung dot com --- Created attachment 33774 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33774action=edit Simple test-case

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-14 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #8 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Ramana Radhakrishnan from comment #7) As Evandro doesn't mention flags it's hard to say whether there really is a problem here or not. Both GCC and LLVM were

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-14 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #9 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Wilco from comment #6) I ran the assembler examples on A57 hardware with identical input. The FMADD code is ~20% faster irrespectively of the size of the input.

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-10 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 Wilco wdijkstr at arm dot com changed: What|Removed |Added CC||wdijkstr at arm dot com

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-10 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 Ramana Radhakrishnan ramana at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |WAITING

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #1 from Andrew Pinski pinskia at gcc dot gnu.org --- This might be true for A57 but for our chip (ThunderX), using fused multiply-add is better. The other question here are there denormals happening? That might cause some

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #2 from Andrew Pinski pinskia at gcc dot gnu.org --- The other option it is the fusion of the cmp and branch which is causing the improvement. Can you manually edit the assembly and swap the cmp and fmadd in the GCC output and try

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #3 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Andrew Pinski from comment #1) The other question here are there denormals happening? That might cause some performance differences between using fmadd and

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #4 from Evandro Menezes e.menezes at samsung dot com --- Here's a simplified code to reproduce these results: double sum(double *A, double *B, int n) { int i; double res = 0; for (i = 0; i n; i++) res += A [i] * B [i];

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #5 from Andrew Pinski pinskia at gcc dot gnu.org --- Also how sure are you that it is the fused multiply-add and not the scheduling of the instructions? As I mentioned, try swapping the cmp and fmadd; you might get a performance