[Bug target/77308] surprisingly large stack usage for sha512 on arm

2019-08-23 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

Wilco  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #68 from Wilco  ---
Now also fixed when Neon is enabled (r274823, r274824, r274825). Softfp, vfp
and neon all generate similar instruction counts and stack size, all below 300
bytes with -O3.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2018-11-19 Thread edlinger at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

Bernd Edlinger  changed:

   What|Removed |Added

  Known to work||8.1.0

--- Comment #67 from Bernd Edlinger  ---
Okay, with the patches I installed for gcc-8
the stack usage is at least a bit less "surprising" than before.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2018-11-19 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

Martin Liška  changed:

   What|Removed |Added

 CC||marxin at gcc dot gnu.org

--- Comment #66 from Martin Liška  ---
Bernd: Can you please update Known to work?

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2017-09-06 Thread edlinger at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #65 from Bernd Edlinger  ---
Author: edlinger
Date: Wed Sep  6 07:47:52 2017
New Revision: 251752

URL: https://gcc.gnu.org/viewcvs?rev=251752=gcc=rev
Log:
2017-09-06  Bernd Edlinger  

PR target/77308
* config/arm/predicates.md (arm_general_adddi_operand): Create new
non-vfp predicate.
* config/arm/arm.md (*arm_adddi3, *arm_subdi3): Use new predicates.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/arm/arm.md
trunk/gcc/config/arm/predicates.md

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2017-09-04 Thread edlinger at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #64 from Bernd Edlinger  ---
Author: edlinger
Date: Mon Sep  4 15:25:59 2017
New Revision: 251663

URL: https://gcc.gnu.org/viewcvs?rev=251663=gcc=rev
Log:
2017-09-04  Bernd Edlinger  

PR target/77308
* config/arm/arm.md (*arm_adddi3, *arm_subdi3): Split early except for
TARGET_NEON and TARGET_IWMMXT.
(anddi3, iordi3, xordi3, one_cmpldi2): Split while expanding except for
TARGET_NEON and TARGET_IWMMXT.
(*one_cmpldi2_insn): Moved the body of one_cmpldi2 here.

testsuite:
2017-09-04  Bernd Edlinger  

PR target/77308
* gcc.target/arm/pr77308-1.c: New test.

Added:
trunk/gcc/testsuite/gcc.target/arm/pr77308-1.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/arm/arm.md
trunk/gcc/testsuite/ChangeLog

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-17 Thread edlinger at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #63 from Bernd Edlinger  ---
Author: edlinger
Date: Thu Nov 17 13:47:24 2016
New Revision: 242549

URL: https://gcc.gnu.org/viewcvs?rev=242549=gcc=rev
Log:
2016-11-17  Bernd Edlinger  

PR target/77308
* config/arm/arm.md (*thumb2_ldrd, *thumb2_ldrd_base,
*thumb2_ldrd_base_neg, *thumb2_strd, *thumb2_strd_base,
*thumb2_strd_base_neg): Recognize insn regardless of
current_tune->prefer_ldrd_strd.
* config/arm/ldrdstrd.md: Enable all ldrd/strd peephole rules
whenever possible.

testsuite:
2016-11-17  Bernd Edlinger  

PR target/77308
* gcc.target/arm/pr53447-5.c: New test.
* lib/target-supports.exp
(check_effective_target_arm_prefer_ldrd_strd): Adjust.

Added:
trunk/gcc/testsuite/gcc.target/arm/pr53447-5.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/arm/arm.md
trunk/gcc/config/arm/ldrdstrd.md
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/lib/target-supports.exp

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-09 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #62 from Bernd Edlinger  ---
Both parts of the patch are now posted for review:

https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00523.html

https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00830.html

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-03 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #61 from Bernd Edlinger  ---
Created attachment 39958
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39958=edit
patch for enabling ldrdstrd peephole

And this is what I will bootstrap in the next cycle.

It will enable all ldrd/strd peephole rules, as suggested
by Wilco.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-03 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

Bernd Edlinger  changed:

   What|Removed |Added

  Attachment #39940|0   |1
is obsolete||

--- Comment #60 from Bernd Edlinger  ---
Created attachment 39957
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39957=edit
patch for di-mode early splitting

This is the di-mode early splitting patch which is same as the
previous one with the ldrdstrd part removed.

So I am currently bootstrapping with this patch.

It tries to reduce the stack usage with -msoft-float
to 272 bytes, and does nothing in thumb1 and vfp/neon modes.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-03 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #59 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #58)
> (In reply to wilco from comment #57)
> > (In reply to Bernd Edlinger from comment #56)
> > > Agreed, I can split the patch.
> > > 
> > > From what I understand, we should never emit ldrd/strd out of
> > > the memmovdi2 pattern when optimizing for speed and disable
> > > the peephole in the way I proposed it in the patch.
> > 
> > No that's incorrect. Not generating LDRD when optimizing for speed means a
> > slowdown on most cores, so it is essential we keep generating LDRD whenever
> > possible.
> 
> But if that is true, the current setting of prefer_lrdr_strd is wrong
> in most cores, and should be fixed.

The meaning is really: "prefer using ldrd/strd over ldm/stm in function
prolog/epilog and inlined memcpy". So it says something about performance of
large LDMs vs multiple LDRDs, rather than about performance of a single LDRD vs
2x LDR (basically LDRD doubles available memory bandwidth so is pretty much
always a good idea). And yes I wouldn't be surprised if the setting is
non-optimal for some cores.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-03 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #57 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #56)
> (In reply to wilco from comment #55)
> > (In reply to Bernd Edlinger from comment #39)
> > > Created attachment 39940 [details]
> > > proposed patch, v2
> > > 
> > > last upload was accidentally truncated.
> > > uploaded the right patch.
> > 
> > Right so looking at your patch, I think we should make the LDRD peephole
> > change in a separate patch. I tried your foo example on all combinations of
> > ARM, Thumb-2, VFP, NEON on various CPUs with both settings of
> > prefer_ldrd_strd.
> > 
> > In all cases the current GCC generates LDRD/STRD, even for zero offsets.
> > CPUs where prefer_ldrd_strd=false emit LDR/STR for the shifts with
> > -msoft-float or -mfpu=vfp (but not -mfpu=neon). This is clearly incorrect
> > given that LDRD/STRD is used in all other cases, and prefer_ldrd_strd seems
> > to imply whether to prefer using LDRD/STRD in prolog/epilog and inlined
> > memcpy.
> > 
> > So that means we should remove the odd checks for codesize and
> > current_tune->prefer_ldrd_strd from all the peepholes.
> 
> Agreed, I can split the patch.
> 
> From what I understand, we should never emit ldrd/strd out of
> the memmovdi2 pattern when optimizing for speed and disable
> the peephole in the way I proposed it in the patch.

No that's incorrect. Not generating LDRD when optimizing for speed means a
slowdown on most cores, so it is essential we keep generating LDRD whenever
possible.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-03 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #58 from Bernd Edlinger  ---
(In reply to wilco from comment #57)
> (In reply to Bernd Edlinger from comment #56)
> > Agreed, I can split the patch.
> > 
> > From what I understand, we should never emit ldrd/strd out of
> > the memmovdi2 pattern when optimizing for speed and disable
> > the peephole in the way I proposed it in the patch.
> 
> No that's incorrect. Not generating LDRD when optimizing for speed means a
> slowdown on most cores, so it is essential we keep generating LDRD whenever
> possible.

But if that is true, the current setting of prefer_lrdr_strd is wrong
in most cores, and should be fixed.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-03 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #56 from Bernd Edlinger  ---
(In reply to wilco from comment #55)
> (In reply to Bernd Edlinger from comment #39)
> > Created attachment 39940 [details]
> > proposed patch, v2
> > 
> > last upload was accidentally truncated.
> > uploaded the right patch.
> 
> Right so looking at your patch, I think we should make the LDRD peephole
> change in a separate patch. I tried your foo example on all combinations of
> ARM, Thumb-2, VFP, NEON on various CPUs with both settings of
> prefer_ldrd_strd.
> 
> In all cases the current GCC generates LDRD/STRD, even for zero offsets.
> CPUs where prefer_ldrd_strd=false emit LDR/STR for the shifts with
> -msoft-float or -mfpu=vfp (but not -mfpu=neon). This is clearly incorrect
> given that LDRD/STRD is used in all other cases, and prefer_ldrd_strd seems
> to imply whether to prefer using LDRD/STRD in prolog/epilog and inlined
> memcpy.
> 
> So that means we should remove the odd checks for codesize and
> current_tune->prefer_ldrd_strd from all the peepholes.

Agreed, I can split the patch.

From what I understand, we should never emit ldrd/strd out of
the memmovdi2 pattern when optimizing for speed and disable
the peephole in the way I proposed it in the patch.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-03 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #55 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #39)
> Created attachment 39940 [details]
> proposed patch, v2
> 
> last upload was accidentally truncated.
> uploaded the right patch.

Right so looking at your patch, I think we should make the LDRD peephole change
in a separate patch. I tried your foo example on all combinations of ARM,
Thumb-2, VFP, NEON on various CPUs with both settings of prefer_ldrd_strd.

In all cases the current GCC generates LDRD/STRD, even for zero offsets. CPUs
where prefer_ldrd_strd=false emit LDR/STR for the shifts with -msoft-float or
-mfpu=vfp (but not -mfpu=neon). This is clearly incorrect given that LDRD/STRD
is used in all other cases, and prefer_ldrd_strd seems to imply whether to
prefer using LDRD/STRD in prolog/epilog and inlined memcpy.

So that means we should remove the odd checks for codesize and
current_tune->prefer_ldrd_strd from all the peepholes.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-02 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #54 from Bernd Edlinger  ---
(In reply to richard.earnshaw from comment #53)
> On 02/11/16 11:57, bernd.edlinger at hotmail dot de wrote:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308
> > 
> > --- Comment #52 from Bernd Edlinger  ---
> > (In reply to wilco from comment #51)
> >>
> >> Indeed, that's the reason behind the existing check. However it disables 
> >> all
> >> profitable bswap cases while still generating unaligned accesses if no 
> >> bswap
> >> is needed. So I am looking for a callback that gives the correct answer. It
> >> would need to check -mno-unaligned-access and the target capabilities (eg.
> >> if unaligned accesses are supported in hardware but really expensive we 
> >> want
> >> to avoid them).
> > 
> > Yes.  I think ARM is becoming a non-strict-alignment platform.
> > While x86_64 is moving in the opposite direction.
> 
> It can never be a non-strict alignment platform while some memory access
> instructions do not support unaligned accesses.
> 
> However, it is progressively becoming a less slow unaligned access platform.
> 


But isn't that exactly the same situation for x86_64:
Most instructions support unaligned memory accesses,
and a few data types need a movmisalign_optab ?

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-02 Thread richard.earnshaw at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #53 from richard.earnshaw at arm dot com ---
On 02/11/16 11:57, bernd.edlinger at hotmail dot de wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308
> 
> --- Comment #52 from Bernd Edlinger  ---
> (In reply to wilco from comment #51)
>>
>> Indeed, that's the reason behind the existing check. However it disables all
>> profitable bswap cases while still generating unaligned accesses if no bswap
>> is needed. So I am looking for a callback that gives the correct answer. It
>> would need to check -mno-unaligned-access and the target capabilities (eg.
>> if unaligned accesses are supported in hardware but really expensive we want
>> to avoid them).
> 
> Yes.  I think ARM is becoming a non-strict-alignment platform.
> While x86_64 is moving in the opposite direction.

It can never be a non-strict alignment platform while some memory access
instructions do not support unaligned accesses.

However, it is progressively becoming a less slow unaligned access platform.

R.

Re: [Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-02 Thread Richard Earnshaw (lists)
On 02/11/16 11:57, bernd.edlinger at hotmail dot de wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308
> 
> --- Comment #52 from Bernd Edlinger  ---
> (In reply to wilco from comment #51)
>>
>> Indeed, that's the reason behind the existing check. However it disables all
>> profitable bswap cases while still generating unaligned accesses if no bswap
>> is needed. So I am looking for a callback that gives the correct answer. It
>> would need to check -mno-unaligned-access and the target capabilities (eg.
>> if unaligned accesses are supported in hardware but really expensive we want
>> to avoid them).
> 
> Yes.  I think ARM is becoming a non-strict-alignment platform.
> While x86_64 is moving in the opposite direction.

It can never be a non-strict alignment platform while some memory access
instructions do not support unaligned accesses.

However, it is progressively becoming a less slow unaligned access platform.

R.



[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-02 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #52 from Bernd Edlinger  ---
(In reply to wilco from comment #51)
> 
> Indeed, that's the reason behind the existing check. However it disables all
> profitable bswap cases while still generating unaligned accesses if no bswap
> is needed. So I am looking for a callback that gives the correct answer. It
> would need to check -mno-unaligned-access and the target capabilities (eg.
> if unaligned accesses are supported in hardware but really expensive we want
> to avoid them).

Yes.  I think ARM is becoming a non-strict-alignment platform.
While x86_64 is moving in the opposite direction.

Would it be possible to handle the STRICT_ALIGNMENT switchable
like int the rs6000, in that case you have also more flexibility
in the handling of SLOW_UNALIGNED_ACCESS macro ?

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-02 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #51 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #49)
> (In reply to Bernd Edlinger from comment #48)
> > (In reply to wilco from comment #22)
> > > 
> > > Anyway, there is another bug: on AArch64 we correctly recognize there are 
> > > 8
> > > 1-byte loads, shifts and orrs which can be replaced by a single 8-byte 
> > > load
> > > and a byte reverse. Although it is recognized on ARM and works correctly 
> > > if
> > > it is a little endian load, it doesn't perform the optimization if a byte
> > > reverse is needed. As a result there are lots of 64-bit shifts and orrs
> > > which create huge register pressure if not expanded early.
> > 
> > Hmm...
> > 
> > I think the test case does something invalid here:
> > 
> > const SHA_LONG64 *W = in;
> > 
> > T1 = X[0] = PULL64(W[0]);
> > 
> > 
> > in is not aligned, but it is cast to a 8-byte aligned type.
> > 
> > If the bswap pass assumes with your proposed patch
> > it is OK to merge 4 byte accesses into an aligned word access,
> > it may likely break openssl on -mno-unaligned targets.
> > Even on our cortex-a9 the O/S will trap on unaligned accesses.
> > I have checked that openssl still works on arm-none-eabi 
> > with my patch, but I am not sure about your patch.
> 
> I tried it out.  Although the code is bogus the code generation
> does not use the wrong alignment.
> 
> With -mno-unaligned-access the ldr is split out into 4 ldb and
> the result is fed into the rev.
>
> At least in this configuration that is not profitable though.

Indeed, that's the reason behind the existing check. However it disables all
profitable bswap cases while still generating unaligned accesses if no bswap is
needed. So I am looking for a callback that gives the correct answer. It would
need to check -mno-unaligned-access and the target capabilities (eg. if
unaligned accesses are supported in hardware but really expensive we want to
avoid them).

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-02 Thread rearnsha at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #50 from Richard Earnshaw  ---
(In reply to wilco from comment #47)
> (In reply to Richard Earnshaw from comment #46)
> > (In reply to wilco from comment #44)
> > > (In reply to Bernd Edlinger from comment #38)
> > > > Created attachment 39939 [details]
> > > > proposed patch, v2
> > > > 
> > > 
> > > > Unlike the previous patch, thumb1 stack usage stays at 1588 bytes,
> > > > because thumb1 cannot split the adddi3 pattern, once it is emitted.
> > > 
> > > We can split into a new pattern that contains adds/adc together. Splitting
> > > should help Thumb-1 the most as it has just 3 allocatable DI mode
> > > registers...
> > 
> > Not on Thumb-1 we can't.  Because of register allocation limitations, we
> > cannot expose the flags until after register allocation has completed. 
> > (Since the register allocator needs to be able to insert loads, adds and
> > copy instructions between any two insns.  The add and copy instructions
> > clobber the flags, making early splitting impossible.
> 
> What I meant is splitting into a single new instruction using SI mode
> registers rather than DI mode registers so that register allocation is more
> efficient.

You couldn't do that before combine, since the pattern would have to describe
setting both 'result' registers independently.  That would create a pattern
that combine couldn't handle (more than one non-flag result) and so that in
turn would stop the compiler being able to optimize such a pattern properly. 
Note the pattern would probably end up looking like a parallel that set high
and low parts to the result of the 64-bit operation.

It might help to rewrite the pattern that way after combine, but before
register allocation.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-02 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #49 from Bernd Edlinger  ---
(In reply to Bernd Edlinger from comment #48)
> (In reply to wilco from comment #22)
> > 
> > Anyway, there is another bug: on AArch64 we correctly recognize there are 8
> > 1-byte loads, shifts and orrs which can be replaced by a single 8-byte load
> > and a byte reverse. Although it is recognized on ARM and works correctly if
> > it is a little endian load, it doesn't perform the optimization if a byte
> > reverse is needed. As a result there are lots of 64-bit shifts and orrs
> > which create huge register pressure if not expanded early.
> 
> Hmm...
> 
> I think the test case does something invalid here:
> 
> const SHA_LONG64 *W = in;
> 
> T1 = X[0] = PULL64(W[0]);
> 
> 
> in is not aligned, but it is cast to a 8-byte aligned type.
> 
> If the bswap pass assumes with your proposed patch
> it is OK to merge 4 byte accesses into an aligned word access,
> it may likely break openssl on -mno-unaligned targets.
> Even on our cortex-a9 the O/S will trap on unaligned accesses.
> I have checked that openssl still works on arm-none-eabi 
> with my patch, but I am not sure about your patch.

I tried it out.  Although the code is bogus the code generation
does not use the wrong alignment.

With -mno-unaligned-access the ldr is split out into 4 ldb and
the result is fed into the rev.

At least in this configuration that is not profitable though.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-02 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #48 from Bernd Edlinger  ---
(In reply to wilco from comment #22)
> 
> Anyway, there is another bug: on AArch64 we correctly recognize there are 8
> 1-byte loads, shifts and orrs which can be replaced by a single 8-byte load
> and a byte reverse. Although it is recognized on ARM and works correctly if
> it is a little endian load, it doesn't perform the optimization if a byte
> reverse is needed. As a result there are lots of 64-bit shifts and orrs
> which create huge register pressure if not expanded early.

Hmm...

I think the test case does something invalid here:

const SHA_LONG64 *W = in;

T1 = X[0] = PULL64(W[0]);


in is not aligned, but it is cast to a 8-byte aligned type.

If the bswap pass assumes with your proposed patch
it is OK to merge 4 byte accesses into an aligned word access,
it may likely break openssl on -mno-unaligned targets.
Even on our cortex-a9 the O/S will trap on unaligned accesses.
I have checked that openssl still works on arm-none-eabi 
with my patch, but I am not sure about your patch.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #47 from wilco at gcc dot gnu.org ---
(In reply to Richard Earnshaw from comment #46)
> (In reply to wilco from comment #44)
> > (In reply to Bernd Edlinger from comment #38)
> > > Created attachment 39939 [details]
> > > proposed patch, v2
> > > 
> > 
> > > Unlike the previous patch, thumb1 stack usage stays at 1588 bytes,
> > > because thumb1 cannot split the adddi3 pattern, once it is emitted.
> > 
> > We can split into a new pattern that contains adds/adc together. Splitting
> > should help Thumb-1 the most as it has just 3 allocatable DI mode
> > registers...
> 
> Not on Thumb-1 we can't.  Because of register allocation limitations, we
> cannot expose the flags until after register allocation has completed. 
> (Since the register allocator needs to be able to insert loads, adds and
> copy instructions between any two insns.  The add and copy instructions
> clobber the flags, making early splitting impossible.

What I meant is splitting into a single new instruction using SI mode registers
rather than DI mode registers so that register allocation is more efficient.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #45 from Bernd Edlinger  ---
(In reply to wilco from comment #44)
> (In reply to Bernd Edlinger from comment #38)
> > Created attachment 39939 [details]
> > proposed patch, v2
> > 
> 
> > Unlike the previous patch, thumb1 stack usage stays at 1588 bytes,
> > because thumb1 cannot split the adddi3 pattern, once it is emitted.
> 
> We can split into a new pattern that contains adds/adc together. Splitting
> should help Thumb-1 the most as it has just 3 allocatable DI mode
> registers...

But we need to split the adds and the adc into two separate
pattern, then it can happen that the adc instruction's result
is unused, and that propagates to the inputs.

But since I read this comment in thumb1.md I have doubts:

;; Beware of splitting Thumb1 patterns that output multiple
;; assembly instructions, in particular instruction such as SBC and
;; ADC which consume flags.  For example, in the pattern thumb_subdi3
;; below, the output SUB implicitly sets the flags (assembled to SUBS)
;; and then the Carry flag is used by SBC to compute the correct
;; result.  If we split thumb_subdi3 pattern into two separate RTL
;; insns (using define_insn_and_split), the scheduler might place
;; other RTL insns between SUB and SBC, possibly modifying the Carry
;; flag used by SBC.  This might happen because most Thumb1 patterns
;; for flag-setting instructions do not have explicit RTL for setting
;; or clobbering the flags.  Instead, they have the attribute "conds"
;; with value "set" or "clob".  However, this attribute is not used to
;; identify dependencies and therefore the scheduler might reorder
;; these instruction.  Currenly, this problem cannot happen because
;; there are no separate Thumb1 patterns for individual instruction
;; that consume flags (except conditional execution, which is treated
;; differently).  In particular there is no Thumb1 armv6-m pattern for
;; sbc or adc.


Disabling the adddi3 pattern worked with control flow instead of
passing the Carry flag from thee adds to the adc pattern.

In the sha512 test case that was still profitable, but I think
that will not be the case in general.

I can live with the state of thumb1 in the moment.

I am more interested in early expansion of di patterns
in vfp / avoid_neon_for_64bits and so on.

Maybe if the user explicitly wants neon_for_64bits, so be it.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread rearnsha at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #46 from Richard Earnshaw  ---
(In reply to wilco from comment #44)
> (In reply to Bernd Edlinger from comment #38)
> > Created attachment 39939 [details]
> > proposed patch, v2
> > 
> 
> > Unlike the previous patch, thumb1 stack usage stays at 1588 bytes,
> > because thumb1 cannot split the adddi3 pattern, once it is emitted.
> 
> We can split into a new pattern that contains adds/adc together. Splitting
> should help Thumb-1 the most as it has just 3 allocatable DI mode
> registers...

Not on Thumb-1 we can't.  Because of register allocation limitations, we cannot
expose the flags until after register allocation has completed.  (Since the
register allocator needs to be able to insert loads, adds and copy instructions
between any two insns.  The add and copy instructions clobber the flags, making
early splitting impossible.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #44 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #38)
> Created attachment 39939 [details]
> proposed patch, v2
> 

> Unlike the previous patch, thumb1 stack usage stays at 1588 bytes,
> because thumb1 cannot split the adddi3 pattern, once it is emitted.

We can split into a new pattern that contains adds/adc together. Splitting
should help Thumb-1 the most as it has just 3 allocatable DI mode registers...

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #43 from Bernd Edlinger  ---
(In reply to wilco from comment #41)
> 
> ARM only uses the 2nd alternative (set_attr "arch" "any,a,t2,t2"), so this
> is correct. There is no need to support this pattern for ARM as ARM doesn't
> have ORN, and we expand early the whole pattern becomes redundant.

Oh I see.  Thanks for clarifying that.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #42 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #40)
> BTW: I found something strange in this pattern in neon.md:
> 
> (define_insn_and_split "orndi3_neon"
>   [(set (match_operand:DI 0 "s_register_operand" "=w,?,?,?")
> (ior:DI (not:DI (match_operand:DI 2 "s_register_operand" "w,0,0,r"))
> (match_operand:DI 1 "s_register_operand" "w,r,r,0")))]

Also it would be easy to support ",r,0" by doing op0 = ~(op0 = op2 & ~op0) so
there was no need to have ARM and Thumb-2 specific alternatives...

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #41 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #40)
> BTW: I found something strange in this pattern in neon.md:
> 
> (define_insn_and_split "orndi3_neon"
>   [(set (match_operand:DI 0 "s_register_operand" "=w,?,?,?")
> (ior:DI (not:DI (match_operand:DI 2 "s_register_operand" "w,0,0,r"))
> (match_operand:DI 1 "s_register_operand" "w,r,r,0")))]
>   "TARGET_NEON"
>   "@
>vorn\t%P0, %P1, %P2
>#
>#
>#"
>   "reload_completed && 
>(TARGET_NEON && !(IS_VFP_REGNUM (REGNO (operands[0]"
>   [(set (match_dup 0) (ior:SI (not:SI (match_dup 2)) (match_dup 1)))
>(set (match_dup 3) (ior:SI (not:SI (match_dup 4)) (match_dup 5)))]
>   "
>   {
> if (TARGET_THUMB2)
>   {
> operands[3] = gen_highpart (SImode, operands[0]);
> operands[0] = gen_lowpart (SImode, operands[0]);
> operands[4] = gen_highpart (SImode, operands[2]);
> operands[2] = gen_lowpart (SImode, operands[2]);
> operands[5] = gen_highpart (SImode, operands[1]);
> operands[1] = gen_lowpart (SImode, operands[1]);
>   }
> else
>   {
> emit_insn (gen_one_cmpldi2 (operands[0], operands[2]));
> emit_insn (gen_iordi3 (operands[0], operands[1], operands[0]));
> DONE;
>   }
>   }"
>   [(set_attr "type" "neon_logic,multiple,multiple,multiple")
>(set_attr "length" "*,16,8,8")
>(set_attr "arch" "any,a,t2,t2")]
> )
> 
> 
> I think in alternative#4 we have operands[0] == operands[1]
> and operands[2] != operands[0]
> 
> and then gen_one_cmpldi2 (operands[0], operands[2])
> will overwrite the operand[1] before it is used in
> gen_iordi3 (operands[0], operands[1], operands[0]) ??

ARM only uses the 2nd alternative (set_attr "arch" "any,a,t2,t2"), so this is
correct. There is no need to support this pattern for ARM as ARM doesn't have
ORN, and we expand early the whole pattern becomes redundant.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #40 from Bernd Edlinger  ---
BTW: I found something strange in this pattern in neon.md:

(define_insn_and_split "orndi3_neon"
  [(set (match_operand:DI 0 "s_register_operand" "=w,?,?,?")
(ior:DI (not:DI (match_operand:DI 2 "s_register_operand" "w,0,0,r"))
(match_operand:DI 1 "s_register_operand" "w,r,r,0")))]
  "TARGET_NEON"
  "@
   vorn\t%P0, %P1, %P2
   #
   #
   #"
  "reload_completed && 
   (TARGET_NEON && !(IS_VFP_REGNUM (REGNO (operands[0]"
  [(set (match_dup 0) (ior:SI (not:SI (match_dup 2)) (match_dup 1)))
   (set (match_dup 3) (ior:SI (not:SI (match_dup 4)) (match_dup 5)))]
  "
  {
if (TARGET_THUMB2)
  {
operands[3] = gen_highpart (SImode, operands[0]);
operands[0] = gen_lowpart (SImode, operands[0]);
operands[4] = gen_highpart (SImode, operands[2]);
operands[2] = gen_lowpart (SImode, operands[2]);
operands[5] = gen_highpart (SImode, operands[1]);
operands[1] = gen_lowpart (SImode, operands[1]);
  }
else
  {
emit_insn (gen_one_cmpldi2 (operands[0], operands[2]));
emit_insn (gen_iordi3 (operands[0], operands[1], operands[0]));
DONE;
  }
  }"
  [(set_attr "type" "neon_logic,multiple,multiple,multiple")
   (set_attr "length" "*,16,8,8")
   (set_attr "arch" "any,a,t2,t2")]
)


I think in alternative#4 we have operands[0] == operands[1]
and operands[2] != operands[0]

and then gen_one_cmpldi2 (operands[0], operands[2])
will overwrite the operand[1] before it is used in
gen_iordi3 (operands[0], operands[1], operands[0]) ??

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

Bernd Edlinger  changed:

   What|Removed |Added

  Attachment #39939|0   |1
is obsolete||

--- Comment #39 from Bernd Edlinger  ---
Created attachment 39940
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39940=edit
proposed patch, v2

last upload was accidentally truncated.
uploaded the right patch.

Sorry.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

Bernd Edlinger  changed:

   What|Removed |Added

  Attachment #39898|0   |1
is obsolete||

--- Comment #38 from Bernd Edlinger  ---
Created attachment 39939
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39939=edit
proposed patch, v2

Hi,

this is a new version that tries to fix the fall out of
the previous attempt.

I will attempt a bootstrap and reg-test later this week.

It splits the logical di3 pattern right at the expansion.
When !TARGET_HARD_FLOAT or !TARGET_IWMMXT, in order to not
break the neon/iwmmxt patterns that seem to depend on it.

Simply disabling the logical di3 pattern made it impossible
to merge the ldrd/strd later because the ldr/str got expanded
too far away from each other.

It splits the adddi3/subdi3 in the split1 pass but only when
!TARGET_HARD_FLOAT, because other hard float pattern seem
to depend on it.

Note that the setting of the out register in the shift
expansion is only necessary in the case -mfpu=vfp -mhard-float
in all other configurations this is now unnecessary.

So far I have only benchmarked with the sha512 test case
and a modified sha512 with the Sigma blocks decorated with bit-not (~).

Checked that the pr53447-*.c test cases work again.

Checked that this test case emits all ldrd/strd where expected:

cat test.c
void foo(long long* p)
{
  p[1] |= 0x10001;
  p[2] &= 0x10001;
  p[3] ^= 0x10001;
  p[4] += 0x10001;
  p[5] -= 0x10001;
  p[6] = ~p[6];
  p[7] <<= 5;
  p[8] >>= 5;
  p[9] -= p[10];
}

At -Os -mthumb -march=armv7-a -msoft-float / -mhard-float
improves number of ldrd/strd with this patch to 100%.

I wonder if it is OK to emit ldrd at all when optimizing
for speed, given they are considered slower than ldm / 2x ldr ?

With -Os -mfpu=neon / -mfpu=vfp / -march=iwmmxt: checked that the stack usage
is still the same, around 2328 bytes.

With -Os -marm / thumb2: made sure that the stack usage is still 272 bytes.

Unlike the previous patch, thumb1 stack usage stays at 1588 bytes,
because thumb1 cannot split the adddi3 pattern, once it is emitted.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread rearnsha at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #37 from Richard Earnshaw  ---
(In reply to Bernd Edlinger from comment #34)
> (In reply to Richard Earnshaw from comment #33)

> > The logic is certainly strange.  Some cores run LDRD less quickly than they
> > can do LDM, or even two independent loads.  I suspect the logic is meant to
> > be: use LDRD if available and not (optimizing for speed on a slow
> > LDRD-device).
> 
> Ok, so instead of removing this completely I should change it to:
>TARGET_LDRD
>&& (current_tune->prefer_ldrd_strd
>|| optimize_function_for_size_p (cfun))

That sounds about right.  Note that the original patch, back in 2013, said:

"* does not attempt to generate LDRD/STRD when optimizing for size and non of
the LDM/STM patterns match (but it would be easy to add),"
(https://gcc.gnu.org/ml/gcc-patches/2013-02/msg00604.html)

So it appears that this case was not attempted at the time.

I think when LDRD is not preferred we'd want to try to use LDM by preference if
the address offsets support it, even when optimizing for size.  Otherwise, use
LDRD if it supports the operation.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #36 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #34)
> (In reply to Richard Earnshaw from comment #33)
> > (In reply to Wilco from comment #32)
> > > (In reply to Bernd Edlinger from comment #31)
> > > > Furthermore, if I want to do -Os the third condition is FALSE too.
> > > > But one ldrd must be shorter than two ldr ?
> > > > 
> > > > That seems wrong...
> > > 
> > > Indeed, on a target that supports LDRD you want to use LDRD if legal. LDM
> > > should only be tried on Thumb-1. Emitting LDRD from a peephole when the
> > > offset is in range will never increase code size so should always be 
> > > enabled.
> > 
> > The logic is certainly strange.  Some cores run LDRD less quickly than they
> > can do LDM, or even two independent loads.  I suspect the logic is meant to
> > be: use LDRD if available and not (optimizing for speed on a slow
> > LDRD-device).
> 
> Ok, so instead of removing this completely I should change it to:
>TARGET_LDRD
>&& (current_tune->prefer_ldrd_strd
>|| optimize_function_for_size_p (cfun))

That's better but still won't emit LDRD as it seems most cores have
prefer_ldrd_strd disabled... Given that we currently always emit LDRD/STRD for
DI mode accesses, this should just check TARGET_LDRD.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #35 from wilco at gcc dot gnu.org ---
(In reply to Richard Earnshaw from comment #30)
> (In reply to wilco from comment #29)
> >  Combine could help with
> > merging 2 loads/stores into a single instruction.
> 
> No, combine works strictly on dataflow dependencies.  Two stores cannot be
> dataflow related so won't be combined.  Loads would only be dataflow related
> if both loads fed into *exactly* one data-processing instruction after the
> split.  That's unlikely to happen so I very much dobut it would happen there
> either.

Right, so then either we need to look further when creating ldm/ldrd or when
splitting use a parallel of 2 SI mode loads.(In reply to Richard Earnshaw from
comment #33)
> (In reply to Wilco from comment #32)
> > (In reply to Bernd Edlinger from comment #31)
> > > Furthermore, if I want to do -Os the third condition is FALSE too.
> > > But one ldrd must be shorter than two ldr ?
> > > 
> > > That seems wrong...
> > 
> > Indeed, on a target that supports LDRD you want to use LDRD if legal. LDM
> > should only be tried on Thumb-1. Emitting LDRD from a peephole when the
> > offset is in range will never increase code size so should always be 
> > enabled.
> 
> The logic is certainly strange.  Some cores run LDRD less quickly than they
> can do LDM, or even two independent loads.  I suspect the logic is meant to
> be: use LDRD if available and not (optimizing for speed on a slow
> LDRD-device).

The issue is that the behaviour is not consistent. If DI mode accesses are
split early, LDRD is not used, but if not split, LDRD is used even on cores
where LDRD is not preferred or slow.

Selecting -mcpu=cortex-a57 while splitting early gives:

t0p:
ldrdr3, r2, [r0]
addsr3, r3, #1
adc r2, r2, #0
strdr3, r2, [r0]
bx  lr

But with -mcpu=cortex-a53 (with -O2 or -Os):

t0p:
ldr r3, [r0]
ldr r2, [r0, #4]
addsr3, r3, #1
str r3, [r0]
adc r2, r2, #0
str r2, [r0, #4]
bx  lr

GCC currently emits LDRD for both cases - so clearly LDRD was preferred...

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #34 from Bernd Edlinger  ---
(In reply to Richard Earnshaw from comment #33)
> (In reply to Wilco from comment #32)
> > (In reply to Bernd Edlinger from comment #31)
> > > Furthermore, if I want to do -Os the third condition is FALSE too.
> > > But one ldrd must be shorter than two ldr ?
> > > 
> > > That seems wrong...
> > 
> > Indeed, on a target that supports LDRD you want to use LDRD if legal. LDM
> > should only be tried on Thumb-1. Emitting LDRD from a peephole when the
> > offset is in range will never increase code size so should always be 
> > enabled.
> 
> The logic is certainly strange.  Some cores run LDRD less quickly than they
> can do LDM, or even two independent loads.  I suspect the logic is meant to
> be: use LDRD if available and not (optimizing for speed on a slow
> LDRD-device).

Ok, so instead of removing this completely I should change it to:
   TARGET_LDRD
   && (current_tune->prefer_ldrd_strd
   || optimize_function_for_size_p (cfun))

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-11-01 Thread rearnsha at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #33 from Richard Earnshaw  ---
(In reply to Wilco from comment #32)
> (In reply to Bernd Edlinger from comment #31)
> > Furthermore, if I want to do -Os the third condition is FALSE too.
> > But one ldrd must be shorter than two ldr ?
> > 
> > That seems wrong...
> 
> Indeed, on a target that supports LDRD you want to use LDRD if legal. LDM
> should only be tried on Thumb-1. Emitting LDRD from a peephole when the
> offset is in range will never increase code size so should always be enabled.

The logic is certainly strange.  Some cores run LDRD less quickly than they can
do LDM, or even two independent loads.  I suspect the logic is meant to be: use
LDRD if available and not (optimizing for speed on a slow LDRD-device).

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-31 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #32 from Wilco  ---
(In reply to Bernd Edlinger from comment #31)
> Sure, combine cant help, especially because it runs before split1.
> 
> But I wondered why this peephole2 is not enabled:
> 
> (define_peephole2 ; ldrd
>   [(set (match_operand:SI 0 "arm_general_register_operand" "")
> (match_operand:SI 2 "memory_operand" ""))
>(set (match_operand:SI 1 "arm_general_register_operand" "")
> (match_operand:SI 3 "memory_operand" ""))]
>   "TARGET_LDRD
>  && current_tune->prefer_ldrd_strd
>  && !optimize_function_for_size_p (cfun)"
>   [(const_int 0)]
> 
> 
> I have -march=armv7-a / -mcpu=cortex-a9 and thus for me
> current_tune-> prefer_ldrd_strd is FALSE.
> 
> Furthermore, if I want to do -Os the third condition is FALSE too.
> But one ldrd must be shorter than two ldr ?
> 
> That seems wrong...

Indeed, on a target that supports LDRD you want to use LDRD if legal. LDM
should only be tried on Thumb-1. Emitting LDRD from a peephole when the offset
is in range will never increase code size so should always be enabled.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-31 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #31 from Bernd Edlinger  ---
Sure, combine cant help, especially because it runs before split1.

But I wondered why this peephole2 is not enabled:

(define_peephole2 ; ldrd
  [(set (match_operand:SI 0 "arm_general_register_operand" "")
(match_operand:SI 2 "memory_operand" ""))
   (set (match_operand:SI 1 "arm_general_register_operand" "")
(match_operand:SI 3 "memory_operand" ""))]
  "TARGET_LDRD
 && current_tune->prefer_ldrd_strd
 && !optimize_function_for_size_p (cfun)"
  [(const_int 0)]


I have -march=armv7-a / -mcpu=cortex-a9 and thus for me
current_tune-> prefer_ldrd_strd is FALSE.

Furthermore, if I want to do -Os the third condition is FALSE too.
But one ldrd must be shorter than two ldr ?

That seems wrong...

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-31 Thread rearnsha at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #30 from Richard Earnshaw  ---
(In reply to wilco from comment #29)
>  Combine could help with
> merging 2 loads/stores into a single instruction.

No, combine works strictly on dataflow dependencies.  Two stores cannot be
dataflow related so won't be combined.  Loads would only be dataflow related if
both loads fed into *exactly* one data-processing instruction after the split. 
That's unlikely to happen so I very much dobut it would happen there either.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-31 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #29 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #28)
> With my latest patch I bootstrapped a configuration with
> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16
> --with-float=hard
> 
> I noticed a single regression in gcc.target/arm/pr53447-*.c
> 
> That is caused by disabling the adddi3 expansion.
> 
> void t0p(long long * p)
> {
>   *p += 0x10001;
> }
> 
> used to get compiled to this at -O2:
> 
>   ldrdr2, [r0]
>   addsr2, r2, #1
>   adc r3, r3, #1
>   strdr2, [r0]
>   bx  lr
> 
> but without the adddi3 pattern I have at -O2:
> 
>   ldr r3, [r0]
>   ldr r1, [r0, #4]
>   cmn r3, #1
>   add r3, r3, #1
>   movcc   r2, #0
>   movcs   r2, #1
>   add r1, r1, #1
>   str r3, [r0]
>   add r3, r2, r1
>   str r3, [r0, #4]
>   bx  lr

That's because your patch disables adddi3 completely, which is not correct. We
want to use the existing integer sequence, just expanded earlier. Instead of
your change, removing the "&& reload_completed" from the arm_adddi3 instruction
means we expand before register allocation:

ldr r3, [r0]
ldr r2, [r0, #4]
addsr3, r3, #1
str r3, [r0]
adc r2, r2, #16
str r2, [r0, #4]
bx  lr

> Note that also the ldrd instructions are not there.

Yes that's yet another bug...

> I think this is the effect on the ldrd that you already mentioned,
> and it gets worse when the expansion breaks the di registers up
> into two si registers.

Indeed, splitting early means we end up with 2 loads. However in most cases we
should be able to gather the loads and emit LDRD/STRD on Thumb-2 (ARM's
LDRD/STRD is far more limited so not as useful). Combine could help with
merging 2 loads/stores into a single instruction.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-31 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #28 from Bernd Edlinger  ---
With my latest patch I bootstrapped a configuration with
--with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16
--with-float=hard

I noticed a single regression in gcc.target/arm/pr53447-*.c

That is caused by disabling the adddi3 expansion.

void t0p(long long * p)
{
  *p += 0x10001;
}

used to get compiled to this at -O2:

ldrdr2, [r0]
addsr2, r2, #1
adc r3, r3, #1
strdr2, [r0]
bx  lr

but without the adddi3 pattern I have at -O2:

ldr r3, [r0]
ldr r1, [r0, #4]
cmn r3, #1
add r3, r3, #1
movcc   r2, #0
movcs   r2, #1
add r1, r1, #1
str r3, [r0]
add r3, r2, r1
str r3, [r0, #4]
bx  lr

Note that also the ldrd instructions are not there.

Unfortunaltely also the other di3 pattern make the ldrd go away:

void t0p(long long * p)
{
  *p |= 0x10001;
}

was with iordi3 like this:

ldrdr2, [r0]
orr r2, r2, #1
orr r3, r3, #1
strdr2, [r0]
bx  lr

and without iordi3:

ldm r0, {r2, r3}
orr r2, r2, #1
orr r3, r3, #1
stm r0, {r2, r3}
bx  lr

but 

void t0p(long long * p)
{
  p[1] |= 0x10001;
}


gets two loads instead:

ldr r2, [r0, #8]
ldr r3, [r0, #12]
orr r2, r2, #1
orr r3, r3, #1
str r2, [r0, #8]
str r3, [r0, #12]
bx  lr

however:

void t0p(long long * p)
{
  p[1] <<= 11;
}

gets compiled into this:

ldr r3, [r0, #12]
ldr r2, [r0, #8]
lsl r3, r3, #11
lsl r1, r2, #11
orr r3, r3, r2, lsr #21
str r1, [r0, #8]
str r3, [r0, #12]
bx  lr


already before my patch.

I think this is the effect on the ldrd that you already mentioned,
and it gets worse when the expansion breaks the di registers up
into two si registers.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-28 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #27 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #26)
> (In reply to wilco from comment #25)
> > 
> > Alternatives can be disabled, there are flags, eg:
> > 
> > (set_attr "arch" "neon_for_64bits,*,*,avoid_neon_for_64bits")
> > 
> 
> Ok I see, thanks.
> 
> Still lots of insns could be simplified into less alternatives,
> when the attribs are identical of course.  Doesn't that create
> smaller tables and use less memory to compile?

Quite likely yes, but more importantly it will generate better code by not
accidentally allocating to FP registers. There are many integer variants that
have a '?' which seems incorrect.

> > I think the NEON alternatives not only do not help at all, they actually
> > generate cause CQ to be far worse even when no NEON instructions are ever
> > generated.
> 
> Yes.  At least when not fpu is available, there can't possibly be
> any benefit from these insns.  Sure there will be cases when
> the fpu instructions can improve something, just not here.

It's hard to imagine cases where using NEON for 64-bit operations is beneficial
- the extra moves to and from NEON registers are expensive on many cores, and
NEON operations have higher latencies than equivalent integer instructions.

Note I tried setting SLOW_UNALIGNED_ACCESS to false in addition to your patch.
This emits the rev instruction as expected, and size goes down to ~4400
instructions with -mthumb -march=armv7 -mfloat-abi=soft -Os. With -O2 -mthumb
-mfpu=vfp the reduction is even more dramatic from around 7300 instructions to
just 5000...

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-27 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #26 from Bernd Edlinger  ---
(In reply to wilco from comment #25)
> 
> Alternatives can be disabled, there are flags, eg:
> 
> (set_attr "arch" "neon_for_64bits,*,*,avoid_neon_for_64bits")
> 

Ok I see, thanks.

Still lots of insns could be simplified into less alternatives,
when the attribs are identical of course.  Doesn't that create
smaller tables and use less memory to compile?

> I think the NEON alternatives not only do not help at all, they actually
> generate cause CQ to be far worse even when no NEON instructions are ever
> generated.

Yes.  At least when not fpu is available, there can't possibly be
any benefit from these insns.  Sure there will be cases when
the fpu instructions can improve something, just not here.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #25 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #24)
> (In reply to Bernd Edlinger from comment #23)
> > @@ -5020,7 +5020,7 @@
> >  (define_insn_and_split "one_cmpldi2"
> >[(set (match_operand:DI 0 "s_register_operand""=w,,,?w")
> > (not:DI (match_operand:DI 1 "s_register_operand" " w, 0, r, w")))]
> 
> BTW: who knows what it is good for to have
> =w
>  w
> 
> on the first alternative, and on the 4th alternative
> ?w
>  w
> 
> "?" does only make the alternative less attractive, but it is otherwise
> exactly the same as the first one, right?

Alternatives can be disabled, there are flags, eg:

(set_attr "arch" "neon_for_64bits,*,*,avoid_neon_for_64bits")

I think the NEON alternatives not only do not help at all, they actually
generate cause CQ to be far worse even when no NEON instructions are ever
generated.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-27 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #24 from Bernd Edlinger  ---
(In reply to Bernd Edlinger from comment #23)
> @@ -5020,7 +5020,7 @@
>  (define_insn_and_split "one_cmpldi2"
>[(set (match_operand:DI 0 "s_register_operand"  "=w,,,?w")
>   (not:DI (match_operand:DI 1 "s_register_operand" " w, 0, r, w")))]

BTW: who knows what it is good for to have
=w
 w

on the first alternative, and on the 4th alternative
?w
 w

"?" does only make the alternative less attractive, but it is otherwise
exactly the same as the first one, right?

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-27 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #23 from Bernd Edlinger  ---
(In reply to wilco from comment #22)
> 
> What I meant is that your patch still makes a large difference on the
> original test case despite making no difference in simple cases like the
> above.

For sure it is papering over something, as the complete init-regs pass,
it is even documented to do that:

/* Check all of the uses of pseudo variables.  If any use that is MUST
   uninitialized, add a store of 0 immediately before it.  For
   subregs, this makes combine happy.  For full word regs, this makes
   other optimizations, like the register allocator and the reg-stack
   happy as well as papers over some problems on the arm and other
   processors where certain isa constraints cannot be handled by gcc.

I have seen the DI = 0 decay to two SI = 0 and finally removed
well before the init-regs pass runs, and the init-regs pass finds
nothing to do in this test case.  Nevertheless they have a very
positive influence in the lra pass.  In the moment I do not see,
what could replace this.  Magic.

> Anyway, there is another bug: on AArch64 we correctly recognize there are 8
> 1-byte loads, shifts and orrs which can be replaced by a single 8-byte load
> and a byte reverse. Although it is recognized on ARM and works correctly if
> it is a little endian load, it doesn't perform the optimization if a byte
> reverse is needed. As a result there are lots of 64-bit shifts and orrs
> which create huge register pressure if not expanded early.
> 
> This testcase is turning out to be a goldmine of bugs...


Yes, and the test case can be modified to exercise other insns too.

For instance I just added di-mode ~ to the sigma blocks:

#define Sigma0(x)   ~(ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39))
#define Sigma1(x)   ~(ROTR((x),14) ^ ROTR((x),18) ^ ROTR((x),41))
#define sigma0(x)   ~(ROTR((x),1)  ^ ROTR((x),8)  ^ ((x)>>7))
#define sigma1(x)   ~(ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6))

and saw the stack use double with -marm -mfpu=vfp -msoft-float -Os
to 528, and when I disable the one_cmpldi2 pattern it goes
back to 278 again:

thus I will add this to the second patch:

@@ -5020,7 +5020,7 @@
 (define_insn_and_split "one_cmpldi2"
   [(set (match_operand:DI 0 "s_register_operand""=w,,,?w")
(not:DI (match_operand:DI 1 "s_register_operand" " w, 0, r, w")))]
-  "TARGET_32BIT"
+  "TARGET_32BIT && TARGET_HARD_FLOAT"
   "@
vmvn\t%P0, %P1
#


Not every di2 pattern is hamful, for instance unary minus does nothing.
Mostly all patterns that mix =w and =r alternatives.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #22 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #21)
> (In reply to wilco from comment #20)
> > > Wilco, where have you seen the additional registers used with my
> > > previous patch, maybe we can try to fix that somehow?
> > 
> > What happens is that the move of zero causes us to use extra registers in
> > shifts as both source and destination are now always live at the same time.
> > We generate worse code for simple examples like x | (y << 3):
> > 
> > -mfpu=vfp:
> > push{r4, r5}
> > lslsr5, r1, #3
> > orr r5, r5, r0, lsr #29
> > lslsr4, r0, #3
> > orr r0, r4, r2
> > orr r1, r5, r3
> > pop {r4, r5}
> > bx  lr
> > -mfpu=neon:
> > lslsr1, r1, #3
> > orr r1, r1, r0, lsr #29
> > lslsr0, r0, #3
> > orrsr0, r0, r2
> > orrsr1, r1, r3
> > bx  lr
> > 
> 
> hmm. I think with my patch reverted the code is the same.
> 
> I tried -O2 -marm -mfpu=vfp -mhard-float get the first variant
> with and without patch.

Yes that's what I get.

> For -O2 -marm -mfpu=vfp -msoft-float I get the second variant
> with and witout patch.

This still gives the first variant for me.

> For -O2 -marm -mfpu=neon -mhard-float I get the second variant

Right.

> With -O2 -marm -mfpu=neon -msoft-float I get a third variant
> again with and without patch:
> 
> lsl r1, r1, #3
> mov ip, r0
> orr r0, r2, r0, lsl #3
> orr r1, r1, ip, lsr #29
> orr r1, r1, r3
> bx  lr

I don't see this...

> Am I missing something?

What I meant is that your patch still makes a large difference on the original
test case despite making no difference in simple cases like the above.

Anyway, there is another bug: on AArch64 we correctly recognize there are 8
1-byte loads, shifts and orrs which can be replaced by a single 8-byte load and
a byte reverse. Although it is recognized on ARM and works correctly if it is a
little endian load, it doesn't perform the optimization if a byte reverse is
needed. As a result there are lots of 64-bit shifts and orrs which create huge
register pressure if not expanded early.

This testcase is turning out to be a goldmine of bugs...

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-27 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #21 from Bernd Edlinger  ---
(In reply to wilco from comment #20)
> > Wilco, where have you seen the additional registers used with my
> > previous patch, maybe we can try to fix that somehow?
> 
> What happens is that the move of zero causes us to use extra registers in
> shifts as both source and destination are now always live at the same time.
> We generate worse code for simple examples like x | (y << 3):
> 
> -mfpu=vfp:
>   push{r4, r5}
>   lslsr5, r1, #3
>   orr r5, r5, r0, lsr #29
>   lslsr4, r0, #3
>   orr r0, r4, r2
>   orr r1, r5, r3
>   pop {r4, r5}
>   bx  lr
> -mfpu=neon:
>   lslsr1, r1, #3
>   orr r1, r1, r0, lsr #29
>   lslsr0, r0, #3
>   orrsr0, r0, r2
>   orrsr1, r1, r3
>   bx  lr
> 

hmm. I think with my patch reverted the code is the same.

I tried -O2 -marm -mfpu=vfp -mhard-float get the first variant
with and without patch.

For -O2 -marm -mfpu=vfp -msoft-float I get the second variant
with and witout patch.

For -O2 -marm -mfpu=neon -mhard-float I get the second variant

With -O2 -marm -mfpu=neon -msoft-float I get a third variant
again with and without patch:

lsl r1, r1, #3
mov ip, r0
orr r0, r2, r0, lsl #3
orr r1, r1, ip, lsr #29
orr r1, r1, r3
bx  lr



Am I missing something?

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-26 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #20 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #19)
> I think the problem with anddi iordi and xordi instructions is that
> they obscure the data flow between low and high half words.
> When they are not enabled, we have the low and high parts
> expanded independently, but in the case of the di mode instructions
> it is not clear which of the half words propagate from input to output.
> 
> With my new patch, we have 2328 bytes stack for hard float point,
> and only 272 bytes for arm-none-eabi which is a target I care about.
> 
> 
> This is still not perfect, but certainly a big improvement.
> 
> Wilco, where have you seen the additional registers used with my
> previous patch, maybe we can try to fix that somehow?

What happens is that the move of zero causes us to use extra registers in
shifts as both source and destination are now always live at the same time. We
generate worse code for simple examples like x | (y << 3):

-mfpu=vfp:
push{r4, r5}
lslsr5, r1, #3
orr r5, r5, r0, lsr #29
lslsr4, r0, #3
orr r0, r4, r2
orr r1, r5, r3
pop {r4, r5}
bx  lr
-mfpu=neon:
lslsr1, r1, #3
orr r1, r1, r0, lsr #29
lslsr0, r0, #3
orrsr0, r0, r2
orrsr1, r1, r3
bx  lr

So that means this is not a solution.

Note init_regs already does insert moves of zero before expanded shifts (I get
the same code with -mfpu=vfp with or without your previous patch), so it
shouldn't be necessary. Why does it still make a difference? Presumably
init_regs doesn't find all cases or inserts the moves at the right place, so we
should fix that rather than do it in the shift expansion.

However the underlying issue is that DI mode operations are not all split at
exactly the same time, and that is what needs to be fixed.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-26 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #19 from Bernd Edlinger  ---
I think the problem with anddi iordi and xordi instructions is that
they obscure the data flow between low and high half words.
When they are not enabled, we have the low and high parts
expanded independently, but in the case of the di mode instructions
it is not clear which of the half words propagate from input to output.

With my new patch, we have 2328 bytes stack for hard float point,
and only 272 bytes for arm-none-eabi which is a target I care about.


This is still not perfect, but certainly a big improvement.

Wilco, where have you seen the additional registers used with my
previous patch, maybe we can try to fix that somehow?

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-26 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #18 from Bernd Edlinger  ---
Created attachment 39898
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39898=edit
proposed patch

This disables problematic di patterns when no fpu is used, and
there is absolutely no chance that the di patterns can improve something.

This is what I am going to test now, but any help with additional testing
would be highly welcome.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-26 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

wilco at gcc dot gnu.org changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #17 from wilco at gcc dot gnu.org ---
(In reply to Bernd Edlinger from comment #16)
> Wow.

> 
> Maybe I am dreaming, or something is completely wrong now...

Well I can reproduce it so it is real - thanks for the insight! Basically
you're forcing Thumb-2 to split early like Thumb-1. This allows the top and
bottom halves to be independently optimized (unused halves or zeroes are very
common in DI mode operations), resulting in much lower register pressure.

Interestingly the subreg removal phase works fine if you split *all* DI mode
operations early. This explains the bad code for shifts as they are split early
but can't have their subregs removed due to other DI mode operations being
split after reload.

This means the correct approach is to split all operations before register
allocation. Looking at the phase list, Expand appears best as there isn't any
constant propagation or dead code elimination done after Split1. Interestingly
it will be a major simplification as we can get rid of a huge number of complex
patterns and do exactly the same for ARM, Thumb-1, Thumb-2, VFP and NEON.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-25 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #16 from Bernd Edlinger  ---
Wow.

look at this:

Index: arm.md
===
--- arm.md  (revision 241539)
+++ arm.md  (working copy)
@@ -448,7 +448,7 @@
  (plus:DI (match_operand:DI 1 "s_register_operand" "")
   (match_operand:DI 2 "arm_adddi_operand"  "")))
 (clobber (reg:CC CC_REGNUM))])]
-  "TARGET_EITHER"
+  "TARGET_EITHER && !TARGET_THUMB2"
   "
   if (TARGET_THUMB1)
 {
@@ -2256,7 +2256,7 @@
   [(set (match_operand:DI 0 "s_register_operand" "")
(and:DI (match_operand:DI 1 "s_register_operand" "")
(match_operand:DI 2 "neon_inv_logic_op2" "")))]
-  "TARGET_32BIT"
+  "TARGET_32BIT && !TARGET_THUMB2"
   ""
 )

@@ -3310,7 +3310,7 @@
   [(set (match_operand:DI 0 "s_register_operand" "")
(xor:DI (match_operand:DI 1 "s_register_operand" "")
(match_operand:DI 2 "arm_xordi_operand" "")))]
-  "TARGET_32BIT"
+  "TARGET_32BIT && !TARGET_THUMB2"
   ""
 )


Thus only disabling adddi3, anddi3, and xordi3,
but not ashldi3 and ashrdi3, because they are not part of the problem.

reduces
arm-linux-gnueabihf-gcc -mthumb -march=armv7 -mfloat-abi=soft -Os -c pr77308.c

frame = 272 (about the same as aarch64 had!)
code  = 0x374C

sha512_block_data_order:
@ args = 0, pretend = 0, frame = 272
@ frame_needed = 0, uses_anonymous_args = 0


Maybe I am dreaming, or something is completely wrong now...

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-25 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #15 from Bernd Edlinger  ---
(In reply to Wilco from comment #14)
> (In reply to Bernd Edlinger from comment #13)
> > I am still trying to understand why thumb1 seems to outperform thumb2.
> > 
> > Obviously thumb1 does not have the shiftdi3 pattern,
> > but even if I remove these from thumb2, the result is still
> > not par with thumb2.  Apparently other patterns still produce di
> > values that are not enabled with thumb1, they are 
> > xordi3 and anddi3, these are often used.  Then there is
> > adddi3 that is enabled in thumb1 and thumb2, I also disabled
> > this one, and now the sha512 gets down to inclredible 1152
> > bytes frame (-Os -march=armv7 -mthumb -float-abi=soft):
> > 
> > I know this is a hack, but 1K stack is what we should expect...
> > 
> > --- arm.md  2016-10-25 19:54:16.425736721 +0200
> > +++ arm.md.orig 2016-10-17 19:46:59.0 +0200
> > @@ -448,7 +448,7 @@
> >   (plus:DI (match_operand:DI 1 "s_register_operand" "")
> >(match_operand:DI 2 "arm_adddi_operand"  "")))
> >  (clobber (reg:CC CC_REGNUM))])]
> > -  "TARGET_EITHER && !TARGET_THUMB2"
> > +  "TARGET_EITHER"
> 
> So you're actually turning the these instructions off for Thumb-2? What does
> it do instead then? Do the number of instructions go down?
> 
> I noticed that with or without -mfpu=neon, using -marm is significantly
> smaller than -mthumb. Most of the extra instructions appear to be moves,
> which means something is wrong (I would expect Thumb-2 to do better as it
> supports LDRD with larger offsets than ARM).

The LDRD may be another detail, that contributes to this mess.
Maybe, just a guess, the LDRD does simply not work with DI registers, but
only with two SI, at least the pattern looks like targeting two SI moves?

I would expect the n DI mode registers to fall apart into 2n SI mode registers,
that should happen when the expansion finds no DI pattern, it falls back
to use SI pattern, and each SI mode register can be spilled independently and
can be dead independently of the other half word.

And frankly I am still puzzled, what my brutal patch did to the stack size,
and it reduced the code size:

frame   2328 ->   1152
code size 0x4188 -> 0x3ab8


I have not tested if the code works, but I assume that it should,
or fail in an ICE, which is apparently not the case.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-25 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #14 from Wilco  ---
(In reply to Bernd Edlinger from comment #13)
> I am still trying to understand why thumb1 seems to outperform thumb2.
> 
> Obviously thumb1 does not have the shiftdi3 pattern,
> but even if I remove these from thumb2, the result is still
> not par with thumb2.  Apparently other patterns still produce di
> values that are not enabled with thumb1, they are 
> xordi3 and anddi3, these are often used.  Then there is
> adddi3 that is enabled in thumb1 and thumb2, I also disabled
> this one, and now the sha512 gets down to inclredible 1152
> bytes frame (-Os -march=armv7 -mthumb -float-abi=soft):
> 
> I know this is a hack, but 1K stack is what we should expect...
> 
> --- arm.md  2016-10-25 19:54:16.425736721 +0200
> +++ arm.md.orig 2016-10-17 19:46:59.0 +0200
> @@ -448,7 +448,7 @@
>   (plus:DI (match_operand:DI 1 "s_register_operand" "")
>(match_operand:DI 2 "arm_adddi_operand"  "")))
>  (clobber (reg:CC CC_REGNUM))])]
> -  "TARGET_EITHER && !TARGET_THUMB2"
> +  "TARGET_EITHER"

So you're actually turning the these instructions off for Thumb-2? What does it
do instead then? Do the number of instructions go down?

I noticed that with or without -mfpu=neon, using -marm is significantly smaller
than -mthumb. Most of the extra instructions appear to be moves, which means
something is wrong (I would expect Thumb-2 to do better as it supports LDRD
with larger offsets than ARM).

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-25 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #13 from Bernd Edlinger  ---
I am still trying to understand why thumb1 seems to outperform thumb2.

Obviously thumb1 does not have the shiftdi3 pattern,
but even if I remove these from thumb2, the result is still
not par with thumb2.  Apparently other patterns still produce di
values that are not enabled with thumb1, they are 
xordi3 and anddi3, these are often used.  Then there is
adddi3 that is enabled in thumb1 and thumb2, I also disabled
this one, and now the sha512 gets down to inclredible 1152
bytes frame (-Os -march=armv7 -mthumb -float-abi=soft):

I know this is a hack, but 1K stack is what we should expect...

--- arm.md  2016-10-25 19:54:16.425736721 +0200
+++ arm.md.orig 2016-10-17 19:46:59.0 +0200
@@ -448,7 +448,7 @@
  (plus:DI (match_operand:DI 1 "s_register_operand" "")
   (match_operand:DI 2 "arm_adddi_operand"  "")))
 (clobber (reg:CC CC_REGNUM))])]
-  "TARGET_EITHER && !TARGET_THUMB2"
+  "TARGET_EITHER"
   "
   if (TARGET_THUMB1)
 {
@@ -465,7 +465,7 @@
(plus:DI (match_operand:DI 1 "s_register_operand" "%0, 0, r, 0, r")
 (match_operand:DI 2 "arm_adddi_operand"  "r,  0, r, Dd, Dd")))
(clobber (reg:CC CC_REGNUM))]
-  "TARGET_32BIT && !TARGET_NEON && !TARGET_THUMB2"
+  "TARGET_32BIT && !TARGET_NEON"
   "#"
   "TARGET_32BIT && reload_completed
&& ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))"
@@ -2256,7 +2256,7 @@
   [(set (match_operand:DI 0 "s_register_operand" "")
(and:DI (match_operand:DI 1 "s_register_operand" "")
(match_operand:DI 2 "neon_inv_logic_op2" "")))]
-  "TARGET_32BIT && !TARGET_THUMB2"
+  "TARGET_32BIT"
   ""
 )

@@ -2264,7 +2264,7 @@
   [(set (match_operand:DI 0 "s_register_operand" "=w,w
,?w,?w")
 (and:DI (match_operand:DI 1 "s_register_operand" "%w,0 ,0 ,r ,0 ,r
,w ,0")
 (match_operand:DI 2 "arm_anddi_operand_neon" "w ,DL,r ,r
,De,De,w ,DL")))]
-  "TARGET_32BIT && !TARGET_IWMMXT && !TARGET_THUMB2"
+  "TARGET_32BIT && !TARGET_IWMMXT"
 {
   switch (which_alternative)
 {
@@ -3310,7 +3310,7 @@
   [(set (match_operand:DI 0 "s_register_operand" "")
(xor:DI (match_operand:DI 1 "s_register_operand" "")
(match_operand:DI 2 "arm_xordi_operand" "")))]
-  "TARGET_32BIT && !TARGET_THUMB2"
+  "TARGET_32BIT"
   ""
 )

@@ -3318,7 +3318,7 @@
   [(set (match_operand:DI 0 "s_register_operand" "=w,?w")
(xor:DI (match_operand:DI 1 "s_register_operand" "%w ,0,r ,0 ,r ,w")
(match_operand:DI 2 "arm_xordi_operand"  "w ,r ,r ,Dg,Dg,w")))]
-  "TARGET_32BIT && !TARGET_IWMMXT && !TARGET_THUMB2"
+  "TARGET_32BIT && !TARGET_IWMMXT"
 {
   switch (which_alternative)
 {
@@ -3983,7 +3983,7 @@
   [(set (match_operand:DI0 "s_register_operand" "")
 (ashift:DI (match_operand:DI 1 "s_register_operand" "")
(match_operand:SI 2 "general_operand" "")))]
-  "TARGET_32BIT && !TARGET_THUMB2"
+  "TARGET_32BIT"
   "
   if (TARGET_NEON)
 {
@@ -4058,7 +4058,7 @@
   [(set (match_operand:DI  0 "s_register_operand" "")
 (ashiftrt:DI (match_operand:DI 1 "s_register_operand" "")
  (match_operand:SI 2 "reg_or_int_operand" "")))]
-  "TARGET_32BIT && !TARGET_THUMB2"
+  "TARGET_32BIT"
   "
   if (TARGET_NEON)
 {


What do you think?

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-20 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #12 from Wilco  ---
It looks like we need a different approach, I've seen the extra SETs use up
more registers in some cases, and in other cases being optimized away early
on... 

Doing shift expansion at the same time as all other DI mode operations should
result in the same stack size as -fpu=neon. However that's still well behind
Thumb-1, and I would expect ARM/Thumb-2 to beat Thumb-1 easily with 6 extra
registers.

The spill code for Thumb-2 seems incorrect:

(insn 11576 8090 9941 5 (set (reg:SI 3 r3 [11890])
(plus:SI (reg/f:SI 13 sp)
(const_int 480 [0x1e0]))) sha512.c:147 4 {*arm_addsi3}
 (nil))
(insn 9941 11576 2978 5 (set (reg:DI 2 r2 [4210])
(mem/c:DI (reg:SI 3 r3 [11890]) [5 %sfpD.4158+-3112 S8 A64]))
sha512.c:147 170 {*arm_movdi}
 (nil))

LDRD has a range of 1020 on Thumb-2 so I would expect this to be a single
instruction.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-17 Thread edlinger at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #11 from Bernd Edlinger  ---
Author: edlinger
Date: Mon Oct 17 17:46:59 2016
New Revision: 241273

URL: https://gcc.gnu.org/viewcvs?rev=241273=gcc=rev
Log:
2016-10-17  Bernd Edlinger  

PR target/77308
* config/arm/arm.c (arm_emit_coreregs_64bit_shift): Clear the result
register explicitly.
* config/arm/arm.md (ashldi3, ashrdi3, lshrdi3): Don't FAIL if
optimizing for size.

testsuite:
2016-10-17  Bernd Edlinger  

PR target/77308
* gcc.target/arm/pr77308.c: New test.

Added:
trunk/gcc/testsuite/gcc.target/arm/pr77308.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/arm/arm.c
trunk/gcc/config/arm/arm.md
trunk/gcc/testsuite/ChangeLog

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-08-23 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #10 from Wilco  ---
The register allocator doesn't correctly track liferanges for SETs with a
subreg and this can cause terrible spilling indeed.

Also why aren't DI mode values split into native SI registers on 32-bit
targets? Using DI mode registers means the register allocator has a much harder
problem to solve as it can only use even/odd register pairs (and must allocate
both registers even if one is dead).

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-08-23 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2016-08-23
 CC||ktkachov at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #9 from ktkachov at gcc dot gnu.org ---
Note that the fpu option plays a role here as well

When I compile with -O3 -S -mfloat-abi=hard -march=armv7-a -mthumb
-mtune=cortex-a8 -mfpu=neon

I get:
sha512_block_data_order:
@ args = 0, pretend = 0, frame = 2384
@ frame_needed = 0, uses_anonymous_args = 0
push{r4, r5, r6, r7, r8, r9, r10, fp, lr}
subwsp, sp, #2388
subsr4, r2, #1


whereas if you leave out the -mfpu you get the default which is probably 'vfp'
if you didn't configure gcc with an explicit --with-fpu. This is usually not a
good fit for recent targets.
With -mfpu=vfp I get the terrible:
sha512_block_data_order:
@ args = 0, pretend = 0, frame = 3568
@ frame_needed = 0, uses_anonymous_args = 0
push{r4, r5, r6, r7, r8, r9, r10, fp, lr}
subwsp, sp, #3572
subsr4, r2, #1

That said, I bet there's still room for improvement

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-08-22 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #8 from Bernd Edlinger  ---
analyzing the different thumb1/2 reload dumps,
I see t2 often uses code like that to access spill slots:

(insn 11576 8090 9941 5 (set (reg:SI 3 r3 [11890])
(plus:SI (reg/f:SI 13 sp)
(const_int 480 [0x1e0]))) sha512.c:147 4 {*arm_addsi3}
 (nil))
(insn 9941 11576 2978 5 (set (reg:DI 2 r2 [4210])
(mem/c:DI (reg:SI 3 r3 [11890]) [5 %sfpD.4158+-3112 S8 A64]))
sha512.c:147 170 {*arm_movdi}
 (nil))

while t1 often does it this way:

(insn 8221 8219 4591 6 (set (reg/v:DI 4 r4 [orig:1450 fD.4102 ] [1450])
(mem/c:DI (plus:SI (reg/f:SI 13 sp)
(const_int 152 [0x98])) [5 %sfpD.4164+-1432 S8 A64]))
sha512.c:155 748 {*thumb1_movdi_insn}
 (nil))


grep "plus.*reg.*sp" t1/sha512.c.260r.reload |grep -v mem |wc -l
110

grep "plus.*reg.*sp" t2/sha512.c.260r.reload |grep -v mem |wc -l
602


I think in thumb1 the 110 memory accesses are all for frame-objects,
like X, a-h, etc, while thumb2 also uses pointers to access spill slots.

It must be pretty expensive for thumb1 to access something using
a pointer register.  But for thumb2 it this strategy creates rather
significant register pressure.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-08-22 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #7 from Bernd Edlinger  ---
even more surprisingly is that:

While thumb2 code (-march=armv6t2 -mthumb) has about the same stack size
as arm code (-marm), thumb1 code has only 1588 bytes stack, and it does
not change with -fno-schedule-insns nor with the patch above:

arm-unknown-eabi-gcc -O3 -march=armv6 -msoft-float -mthumb -S sha512.c

=>

sha512_block_data_order:
push{r4, r5, r6, r7, lr}
mov r6, r10
mov r4, r8
mov r7, fp
mov r5, r9
ldr r3, .L11
push{r4, r5, r6, r7}
ldr r4, .L11+4
add sp, sp, r4

...

.L11:
.word   1580
.word   -1588

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-08-21 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

Bernd Edlinger  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #6 from Bernd Edlinger  ---
Vladimir,

is this an lra problem?
or should this be fixed at the target?

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-08-21 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #5 from Bernd Edlinger  ---
Now I try to clear the out register when the shift < 32


Index: gcc/config/arm/arm.c
===
--- gcc/config/arm/arm.c(revision 239624)
+++ gcc/config/arm/arm.c(working copy)
@@ -29159,6 +29159,7 @@ arm_emit_coreregs_64bit_shift (enum rtx_code code,
  /* Shifts by a constant less than 32.  */
  rtx reverse_amount = GEN_INT (32 - INTVAL (amount));

+ emit_insn (SET (out, const0_rtx));
  emit_insn (SET (out_down, LSHIFT (code, in_down, amount)));
  emit_insn (SET (out_down,
  ORR (REV_LSHIFT (code, in_up, reverse_amount),
@@ -29170,12 +29171,11 @@ arm_emit_coreregs_64bit_shift (enum rtx_code code,
  /* Shifts by a constant greater than 31.  */
  rtx adj_amount = GEN_INT (INTVAL (amount) - 32);

+ emit_insn (SET (out, const0_rtx));
  emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
  if (code == ASHIFTRT)
emit_insn (gen_ashrsi3 (out_up, in_up,
GEN_INT (31)));
- else
-   emit_insn (SET (out_up, const0_rtx));
}
 }
   else


result: the stack use goes down from 2960 to 2320
note: this SET is simply redundant, but it makes the out
well-defined from the beginning.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-08-21 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #4 from Bernd Edlinger  ---
hmm, when I compare aarch64 vs. arm sha512.c.260r.reload
with -O3 -fno-schedule-insns

I see a big difference:

aarch64 has only few spill regs

subreg regs:
  Slot 0 regnos (width = 8): 856
  Slot 1 regnos (width = 8): 857
  Slot 2 regnos (width = 8): 858
  Slot 3 regnos (width = 8): 859
  Slot 4 regnos (width = 8): 860
  Slot 5 regnos (width = 8): 861
  Slot 6 regnos (width = 8): 862
  Slot 7 regnos (width = 8): 2117
  Slot 8 regnos (width = 8): 1164
  Slot 9 regnos (width = 8): 1052


but arm has 415 (8 bytes each)
and the line "subreg regs:" before the Spill Slots is contains ~1500 regs.

and while aarch64 does not have a single subreg in any pass,
the arm has lots of subregs before lra eliminates all of them.

like this, in sha512.c.217r.expand:

(insn 85 84 86 5 (set (subreg:SI (reg:DI 1670) 4)
(ashift:SI (subreg:SI (reg:DI 1669) 0)
(const_int 24 [0x18]))) sha512.c:98 -1
 (nil))
(insn 86 85 87 5 (set (subreg:SI (reg:DI 1670) 0)
(const_int 0 [0])) sha512.c:98 -1
 (nil))

This funny instruction is generated in arm_emit_coreregs_64bit_shift:

  /* Shifts by a constant greater than 31.  */
  rtx adj_amount = GEN_INT (INTVAL (amount) - 32);

  emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
  if (code == ASHIFTRT)
emit_insn (gen_ashrsi3 (out_up, in_up,
GEN_INT (31)));
  else
emit_insn (SET (out_up, const0_rtx));

From my past experience, I assume that using a subreg to write
an half of the out register makes more problems than it solves.

So I tried this:

Index: gcc/config/arm/arm.c
===
--- gcc/config/arm/arm.c(revision 239624)
+++ gcc/config/arm/arm.c(working copy)
@@ -29170,12 +29170,11 @@
  /* Shifts by a constant greater than 31.  */
  rtx adj_amount = GEN_INT (INTVAL (amount) - 32);

+ emit_insn (SET (out, const0_rtx));
  emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
  if (code == ASHIFTRT)
emit_insn (gen_ashrsi3 (out_up, in_up,
GEN_INT (31)));
- else
-   emit_insn (SET (out_up, const0_rtx));
}
 }
   else

and it reduced the stack from 3472->2960

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-08-21 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

Bernd Edlinger  changed:

   What|Removed |Added

 CC||bernd.edlinger at hotmail dot 
de

--- Comment #3 from Bernd Edlinger  ---
(In reply to Andrew Pinski from comment #1)
> Does -fno-schedule-insns help?  Sometimes the scheduler before the register
> allocator causes register pressure and forces more register spills.

Yes, with -fno-schedule-insns it is a little bit smaller:

sha512_block_data_order:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 3472
@ frame_needed = 0, uses_anonymous_args = 0
push{r4, r5, r6, r7, r8, r9, r10, fp, lr}
sub sp, sp, #3472
sub sp, sp, #4


but it could be worse :-)

with -fno-ira-share-spill-slots:

sha512_block_data_order:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 6640
@ frame_needed = 0, uses_anonymous_args = 0
push{r4, r5, r6, r7, r8, r9, r10, fp, lr}
sub sp, sp, #6592
sub sp, sp, #52

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-08-21 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #2 from Andrew Pinski  ---
For aarch64, the stack size is just 208 bytes.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-08-21 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #1 from Andrew Pinski  ---
Does -fno-schedule-insns help?  Sometimes the scheduler before the register
allocator causes register pressure and forces more register spills.