[Bug target/82989] [7/8 regression] Inexplicable use of NEON for 64-bit math

2018-02-19 Thread sudi at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

--- Comment #18 from sudi at gcc dot gnu.org ---
Created bug 84467 to continue discussions about the early expand phase
decisions to choose or reject NEON operations

[Bug target/82989] [7/8 regression] Inexplicable use of NEON for 64-bit math

2018-02-19 Thread sudi at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

--- Comment #17 from sudi at gcc dot gnu.org ---
Since this looks like a pretty invasive problem, according to my discussions
with Wilco and Kyrill, I think I will try to propose a smaller, but temporary
fix using the ?s and special casing 32 for this PR (which could go in sooner).
I will also open a new PR to handle this at the expand phase and clean up the
code aimed at gcc 9.

[Bug target/82989] [7/8 regression] Inexplicable use of NEON for 64-bit math

2018-02-19 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

--- Comment #16 from Wilco  ---
(In reply to Matthijs van Duin from comment #15)
> (In reply to Wilco from comment #14)
> > Yes on older cores it can be a bad idea to allow accidental use of Neon
> > instructions. The simplest workaround is to switch off Neon, just use
> > -mfpu=vfp.
> 
> Sure, though that's not exactly ideal since it is very desirable for the
> Neon unit to be used for single-precision floats. (9-10 cycles for add/sub
> in VFP, 10-12 cycles for mul in VFP, 1 cycle for two add/sub/mul ops in Neon)

That's why it's a workaround until the bug is fixed - assuming the performance
critical functions use both 64-bit integer and 32-bit fp operations, you could
benchmark both options and choose the one that works best for now.

[Bug target/82989] [7/8 regression] Inexplicable use of NEON for 64-bit math

2018-02-19 Thread matthijsvanduin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

--- Comment #15 from Matthijs van Duin  ---
(In reply to Wilco from comment #14)
> Yes on older cores it can be a bad idea to allow accidental use of Neon
> instructions. The simplest workaround is to switch off Neon, just use
> -mfpu=vfp.

Sure, though that's not exactly ideal since it is very desirable for the Neon
unit to be used for single-precision floats. (9-10 cycles for add/sub in VFP,
10-12 cycles for mul in VFP, 1 cycle for two add/sub/mul ops in Neon)

[Bug target/82989] [7/8 regression] Inexplicable use of NEON for 64-bit math

2018-02-19 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

--- Comment #14 from Wilco  ---
(In reply to Matthijs van Duin from comment #13)
> In case it's of interest, I did a quick benchmark of my testcase executed in
> a loop on a cortex-a8:
> 
> Without neon:
> 12 instructions/iteration
> 14 cycles/iteration
> 
> With neon:
> 14 instructions/iteration
> 35.2-35.3 cycles/iteration
> 
> (This includes 4 instructions for the loop itself.)
> 
> When using neon, the majority of the time is spent in a nasty pipeline stall
> for moving data from neon registers to arm registers, which takes a minimum
> of 20 cycles according to the cortex-a8 TRM.

Yes on older cores it can be a bad idea to allow accidental use of Neon
instructions. The simplest workaround is to switch off Neon, just use
-mfpu=vfp.

We probably also need to block the register allocator from spilling integer
registers to the FP register file as that would have the same stall (another
thing it really seems to insist on for some odd reason). There are AArch64
patches for this that could be ported to Arm.

[Bug target/82989] [7/8 regression] Inexplicable use of NEON for 64-bit math

2018-02-19 Thread matthijsvanduin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

--- Comment #13 from Matthijs van Duin  ---
In case it's of interest, I did a quick benchmark of my testcase executed in a
loop on a cortex-a8:

Without neon:
12 instructions/iteration
14 cycles/iteration

With neon:
14 instructions/iteration
35.2-35.3 cycles/iteration

(This includes 4 instructions for the loop itself.)

When using neon, the majority of the time is spent in a nasty pipeline stall
for moving data from neon registers to arm registers, which takes a minimum of
20 cycles according to the cortex-a8 TRM.

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2018-02-19 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

--- Comment #12 from Wilco  ---
(In reply to Jakub Jelinek from comment #11)
> From my completely ARM unaware POV, if NEON is available, it isn't a strict
> requirement that NEON must never be used for this, just a matter of
> preferences.  So perhaps one or two further ?s on the avoid_neon_for_64bits
> alternatives to let the RA know it should prefer the GPR alternative which
> already contains a single ? and therefore right now is equal preference to
> the avoid_neon_for_64bits.

The costing model for preferencing is extremely complex and suboptimal. There
are too many bugs in it, it all appears to be written for a CISC ISA (eg. it
assumes it can just replace any register operand with a MEM at no extra cost,
even if such a pattern doesn't exist).

So hacking in some more ?'s is never going to result in good code. The issue is
also made more complex by the fact that expanding 64-bit operations after
register allocation is an extremely bad idea - see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308.

So today the choice is between generating inefficient ARM code or inefficient
Neon code... 

By making this a hard choice early on (only generate ARM, or use Neon when
beneficial) we can actually generate high quality code. I've got patches that
give huge gains and still allow a user to prefer Neon instructions with
-mneon-for-64bits.

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2018-02-19 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

Jakub Jelinek  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #11 from Jakub Jelinek  ---
>From my completely ARM unaware POV, if NEON is available, it isn't a strict
requirement that NEON must never be used for this, just a matter of
preferences.  So perhaps one or two further ?s on the avoid_neon_for_64bits
alternatives to let the RA know it should prefer the GPR alternative which
already contains a single ? and therefore right now is equal preference to the
avoid_neon_for_64bits.

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2018-02-19 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #10 from Wilco  ---
The only solution to this is to make a hard choice between Neon and integer
registers before register allocation.

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2018-02-16 Thread sudi at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

sudi at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2018-02-15 Thread matthijsvanduin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

--- Comment #9 from Matthijs van Duin  ---
I can still reproduce the same test case with:  arm-linux-gnueabihf-gcc-8
(Debian 8-20180207-2) 8.0.1 20180207 (experimental) [trunk revision 257435]


-mfloat-abi=hard is implicit for arm-linux-gnueabihf, although overriding that
to -mfloat-abi=softfp produces exactly the same result for me.

(-mfloat-abi=soft completely disables all use of VFP/Neon hence obviously
suppresses this problem.)

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2018-02-15 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

--- Comment #8 from Jakub Jelinek  ---
In both adddi3_neon and lshrdi3_neon the GPR alternatives use exactly the same
number of ?s as the NEON ones with avoid_neon_for_64bits, just the
neon_for_64bits alternatives don't have any.  So I don't see how it is a
preference of not using NEON, when TARGET_PREFER_NEON_64BITS is true, neon is
clearly preferred, otherwise there is no preferencing at all.  And indeed, for
DImode shifts by 32 the GPR regs should be prefered.  Anyway, leaving this to
ARM maintainers, don't know about the arch enough to do something.

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2018-02-15 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

--- Comment #7 from Jakub Jelinek  ---
Indeed, with -march=armv7-a -mcpu=cortex-a8 -mfpu=neon -O2 pr82989.c 
-mfloat-abi=hard -mprint-tune-info
I can reproduce.  prefer_neon_for_64bits seems to be false, and LRA first says:
Choosing alt 4 in insn 7:  (0) ?  (1) r  (2) i  (3) X  (4) X  (5) X
{lshrdi3_neon}
which I assume would be with GPRs, but then decides to pick:
Choosing alt 6 in insn 7:  (0) ?w  (1) w  (2) i  (3) X  (4) X  (5) X
{lshrdi3_neon}
instead.  That alternative has avoid_neon_for_64bits arch, even when it uses
neon registers, just has ? to disparage it slightly.

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2018-02-15 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

--- Comment #6 from ktkachov at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #5)
> Just double-checked this on current trunk and I still can't reproduce this.
> Nor with r248000, r245631 nor 244212 I have around (all crosses).
> 
> If anyone can reproduce it, can you please have a look at it?

It needs an -mfloat-abi=hard to reproduce.

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2018-02-15 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

Jakub Jelinek  changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org,
   ||nickc at gcc dot gnu.org,
   ||rearnsha at gcc dot gnu.org

--- Comment #5 from Jakub Jelinek  ---
Just double-checked this on current trunk and I still can't reproduce this.
Nor with r248000, r245631 nor 244212 I have around (all crosses).

If anyone can reproduce it, can you please have a look at it?

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2018-01-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|7.3 |7.4

--- Comment #4 from Richard Biener  ---
GCC 7.3 is being released, adjusting target milestone.

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2017-12-20 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek  ---
Can't reproduce with current trunk (admittedly cross from x86_64-linux):
./cc1 -quiet -nostdinc -march=armv7-a -mcpu=cortex-a8 -mfpu=neon -O2 pr82989.c
ldrdr2, [r0]
mov r1, #0
addsr2, r3, r2
adc r3, r1, r3
strdr2, [r0]
bx  lr

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2017-12-14 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |7.3

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2017-11-23 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

Ramana Radhakrishnan  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-11-23
 CC||ramana at gcc dot gnu.org
Summary|Inexplicable use of NEON|[7/8 regression ]
   |for 64-bit math |Inexplicable use of NEON
   ||for 64-bit math
 Ever confirmed|0   |1
  Known to fail||7.2.1, 8.0

--- Comment #1 from Ramana Radhakrishnan  ---
Confirmed.

[Bug target/82989] [7/8 regression ] Inexplicable use of NEON for 64-bit math

2017-11-23 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

--- Comment #2 from Ramana Radhakrishnan  ---
Works as expected in GCC 6.