[Bug target/63304] Aarch64 pc-relative load offset out of range

2015-11-13 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304 --- Comment #37 from Evandro --- Here's what I had in mind: https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01787.html Feedback is welcome.

[Bug target/63304] Aarch64 pc-relative load offset out of range

2015-11-06 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304 --- Comment #36 from Evandro --- (In reply to Ramana Radhakrishnan from comment #35) > (In reply to Evandro from comment #32) > > Because of side effects of the Haiffa scheduler, the loads now pile up, and > > the ADRPs may affect the load issue

[Bug target/63304] Aarch64 pc-relative load offset out of range

2015-11-06 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304 --- Comment #30 from Evandro --- The performance impact of always referring to constants as if they were far away is significant on targets which do not fuse ADRP and LDR together. What's the status of the solution that evaluates the function

[Bug target/63304] Aarch64 pc-relative load offset out of range

2015-11-06 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304 --- Comment #32 from Evandro --- (In reply to Ramana Radhakrishnan from comment #31) > (In reply to Evandro from comment #30) > > The performance impact of always referring to constants as if they were far > > away is significant on targets

[Bug target/63304] Aarch64 pc-relative load offset out of range

2015-11-06 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304 --- Comment #34 from Evandro --- (In reply to Wilco from comment #33) > (In reply to Evandro from comment #32) > ADRP latency to load-address should be zero on any OoO core - ADRP is > basically a move-immediate, so can execute early and hide

[Bug target/58623] lack of ldp/stp optimization

2014-12-15 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58623 --- Comment #6 from Evandro e.menezes at samsung dot com --- What's the PR of the fwprop issue? Thank you.

[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost

2014-10-31 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #20 from Evandro e.menezes at samsung dot com --- (In reply to Ramana Radhakrishnan from comment #19) To my mind it seems like 407 fmoves is just a bit too berserk and regardless of how efficient your core is, there is no point

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-28 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #21 from Evandro e.menezes at samsung dot com --- (In reply to ramana.radhakrish...@arm.com from comment #20) What's the kind of performance delta you see if you managed to unroll the loop just a wee bit ? Probably not much looking

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-28 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #23 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #22) Unrolling alone isn't good enough in sum reductions. As I mentioned before, GCC doesn't enable any of the useful loop optimizations by default

[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-24 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #11 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #9) The performance cost is a much bigger issue than codesize. The problem is that when register pressure is high, the register allocator decides

[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-24 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #12 from Evandro e.menezes at samsung dot com --- (In reply to Evandro from comment #11) Do you have an idea of the performance impact of this patch? At least in Dhrystone, it improved by over 2% on A57.

[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-10-24 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #14 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #10) Note currently it is not possible to use FP registers for spilling using the hooks - basically you still end up with int-fp moves for every

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #14 from Evandro Menezes e.menezes at samsung dot com --- Compiling the test-case above with just -O2, I can reproduce the code I mentioned initially and easily measure the cycle count to run it on target using perf. The binary

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #16 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #15) Using -Ofast is not any different from -O3 -ffast-math when compiling non-Fortran code. As comment 10 shows, both loops are vectorized, however

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #17 from Evandro e.menezes at samsung dot com --- Created attachment 33785 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33785action=edit Simple matrix multiplication

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 Evandro e.menezes at samsung dot com changed: What|Removed |Added Attachment #33774|0 |1

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-21 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #12 from Evandro Menezes e.menezes at samsung dot com --- Created attachment 33774 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33774action=edit Simple test-case

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-14 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #8 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Ramana Radhakrishnan from comment #7) As Evandro doesn't mention flags it's hard to say whether there really is a problem here or not. Both GCC and LLVM were

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-14 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #9 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Wilco from comment #6) I ran the assembler examples on A57 hardware with identical input. The FMADD code is ~20% faster irrespectively of the size of the input

[Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread e.menezes at samsung dot com
Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: e.menezes at samsung dot com CC: spop at gcc dot gnu.org Target: aarch64-* Curious why Geekbench's {D,S}GEMM by GCC were 8-9% slower than by LLVM, I

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #3 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Andrew Pinski from comment #1) The other question here are there denormals happening? That might cause some performance differences between using fmadd and fmul

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #4 from Evandro Menezes e.menezes at samsung dot com --- Here's a simplified code to reproduce these results: double sum(double *A, double *B, int n) { int i; double res = 0; for (i = 0; i n; i++) res += A [i] * B [i

[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-08-14 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #7 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Vladimir Makarov from comment #6) Evandro, thanks for reporting this. Sorry, I am busy with other thing these days. I'll start to work on this PR in September

[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-06 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 Evandro Menezes e.menezes at samsung dot com changed: What|Removed |Added Status|WAITING |RESOLVED

[Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64

2014-08-05 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #5 from Evandro Menezes e.menezes at samsung dot com --- Created attachment 33249 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33249action=edit Dhrystone, part 2 of 3 I firstly observed this issue when looking into Dhrystone

[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-05 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 --- Comment #9 from Evandro Menezes e.menezes at samsung dot com --- It seems to me that it's the LRA which is forcing the use of FP registers, so, even if the patterns are fixed, I believe that in the end the combiner would just give up and ICE

[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-05 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 --- Comment #11 from Evandro Menezes e.menezes at samsung dot com --- (In reply to ktkachov from comment #10) What we really need here is a preprocessed testcase showing the problem. It should be fairly easy to lock down on the problem

[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-05 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 --- Comment #13 from Evandro Menezes e.menezes at samsung dot com --- Created attachment 33253 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33253action=edit Test-case This test-case is a stripped-down version of Dhrystone, where the issue

[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-05 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 Evandro Menezes e.menezes at samsung dot com changed: What|Removed |Added Attachment #33246|0 |1

[Bug target/62014] New: [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-04 Thread e.menezes at samsung dot com
Component: target Assignee: unassigned at gcc dot gnu.org Reporter: e.menezes at samsung dot com Created attachment 33245 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33245action=edit This patch should fix this issue, though it needs a test-case. In some cases, when the LRA

[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-04 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 --- Comment #2 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Andrew Pinski from comment #1) + /* Do not spill into FP registers when -mgeneral-regs-only is specified. * You are missing a / in your comment. Ermahgerd!

[Bug target/62014] [AArch64] Using -mgeneral-regs-only may lead to ICE

2014-08-04 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62014 Evandro Menezes e.menezes at samsung dot com changed: What|Removed |Added Attachment #33245|0 |1

[Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size

2014-07-25 Thread e.menezes at samsung dot com
Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: e.menezes at samsung dot com The issue that I observed in code size due to the default use of the LRA results in the spilling of the FP register used to spill variables into, which

[Bug target/61915] [AArch64] Default use of the LRA results in extra code size

2014-07-25 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915 --- Comment #3 from Evandro Menezes e.menezes at samsung dot com --- In Opteron, there was a path from FP to the GP registers, but not the other way around. That path was eventually made symmetric in Barcelona only.