les), but this simple patch
fixes it:
If strchr can't be folded in gimple-fold.c, break so folding code in builtins.c
is
also called.
OK for commit?
2016-09-28 Wilco Dijkstra <wdijk...@arm.com>
* gimple-fold.c (gimple_fold_builtin): After failing to fold
strchr, also try the generi
Prathamesh Kulkarni wrote:
> Hi Wilco,
> I am working on LTO varpool partitioning to improve performance for
> section anchors.
> I posted a preliminary patch posted at:
> https://gcc.gnu.org/ml/gcc/2016-07/msg00033.html
> Unfortunately I haven't yet been able to
Markus Trippelsdorf wrote:
> On 2016.09.26 at 09:42 +0200, Richard Biener wrote:
> > On Sat, Sep 24, 2016 at 10:52 AM, Markus Trippelsdorf
> > wrote:
> > > On 2016.09.23 at 15:29 +0200, Richard Biener wrote:
> > > If for example you limit partitions to 4 on a 4-core
Markus Trippelsdorf wrote:
> What I wanted to point out is that you of course loose the speedup you'll
> get from parallel running backends with only a single partition.
Absolutely. For every possible value of min-lto-partition you can find an
application that will build with more parallelism if
Richard Biener wrote:
>On Fri, Sep 23, 2016 at 3:02 PM, Markus Trippelsdorf
>wrote:
> > And tramp3d only uses ten partitions (lto-min-partition=1).
> > With lto-min-partition=5 (current patch) this decrease to only two
> > partitions. As a result we loose the
:
2016-09-23 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* gcc/gimple-fold.c (gimple_fold_builtin_strchr):
New function to optimize strchr (s, 0) to strlen.
(gimple_fold_builtin): Add BUILT_IN_STRCHR case.
testsuite/
* gcc/testsuite/gcc.dg/strlenopt-20.c: Updat
ARM server.
ChangeLog:
2016-09-22 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* params.def (MIN_PARTITION_SIZE): Increase to 5.
--
diff --git a/gcc/params.def b/gcc/params.def
index
79b7dd4cca9ec1bb67a64725fb1a596b6e937419..da8fd1825e15f2aa800b1c8b680985776c1080ed
100644
---
ChangeLog:
2016-09-12 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.c (aarch64_classify_symbol):
Apply reasonable limit to symbol offsets.
testsuite/
* gcc.target/aarch64/symbol-range.c (foo): Set new limit.
* gcc.target/aarch64/symb
ping
From: Wilco Dijkstra
Sent: 23 August 2016 15:49
To: Richard Earnshaw; GCC Patches
Cc: nd
Subject: Re: [PATCH][AArch64] Improve stack adjustment
ping
Richard Earnshaw wrote:
> I see you've added a default argument for your new parameter. I think
> doing that is fine, but
ping
From: Wilco Dijkstra
Sent: 08 September 2016 14:35
To: GCC Patches
Cc: nd
Subject: [PATCH][AArch64] Align FP callee-saves
If the number of integer callee-saves is odd, the FP callee-saves use 8-byte
aligned
LDP/STP. Since 16-byte alignment may be faster on some CPUs, align the FP
ping
From: Wilco Dijkstra
Sent: 02 September 2016 12:31
To: Ramana Radhakrishnan; GCC Patches
Cc: nd
Subject: Re: [PATCH][AArch64 - v3] Simplify eh_return implementation
Ramana Radhakrishnan wrote:
> Can you please file a PR for this and add some testcases ? This sounds like
> a s
Bernd Schmidt wrote:
> On 09/15/2016 03:38 PM, Wilco Dijkstra wrote:
> > __rawmemchr is not the fastest on any target I tried, including x86,
>
> Interesting. Care to share your test program? I just looked at the libc
> sources and strlen/rawmemchr are practically identical cod
Jakub Jelinek
>
> Those are the generic definitions, all targets that care about performance
> obviously should replace them with assembly code.
No, that's exactly my point, it is not true that it is always best to write
assembly
code. For example there is absolutely no
From: Jakub Jelinek <ja...@redhat.com>
> On Thu, Sep 15, 2016 at 02:55:48PM +0000, Wilco Dijkstra wrote:
>> stpcpy is not conceptually the same, but for mempcpy, yes. By default
>> it's converted into memcpy in the GLIBC headers and the generic
>> implementation
Jakub Jelinek wrote:
On Thu, Sep 15, 2016 at 03:16:52PM +0100, Szabolcs Nagy wrote:
> >
> > from libc point of view, rawmemchr is a rarely used
> > nonstandard function that should be optimized for size.
> > (glibc does not do this now, but it should in my opinion.)
>
> rawmemchr with 0 is to
Richard Biener wrote:
> On Wed, Sep 14, 2016 at 3:45 PM, Jakub Jelinek wrote:
> > On Wed, Sep 14, 2016 at 03:41:33PM +0200, Richard Biener wrote:
> >> > We've seen several different proposals for where/how to do this
> >> > simplification, why did you
> >> > say strlenopt is
Tamar Christina wrote:
> On 13/09/16 13:43, Joseph Myers wrote:
> > On Tue, 13 Sep 2016, Tamar Christina wrote:
>>
> >> On 12/09/16 23:33, Joseph Myers wrote:
> >>> Why is this boolean false for ieee_quad_format, mips_quad_format and
> >>> ieee_half_format? They should meet your description (even
Richard Biener <richard.guent...@gmail.com> wrote:
> On Wed, May 18, 2016 at 2:29 PM, Wilco Dijkstra <wilco.dijks...@arm.com>
> wrote:
>> Richard Biener wrote:
>>
>>> Yeah ;) I'm currently bootstrapping/testing the patch that makes it
>>> poss
Jakub wrote:
> On Mon, Sep 12, 2016 at 04:19:32PM +, Tamar Christina wrote:
> > This patch adds an optimized route to the fpclassify builtin
> > for floating point numbers which are similar to IEEE-754 in format.
> >
> > The goal is to make it faster by:
> > 1. Trying to determine the most
016-09-12 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.c (aarch64_classify_symbol):
Apply reasonable limit to symbol offsets.
testsuite/
* gcc.target/aarch64/symbol-range.c (foo): Set new limit.
* gcc.target/aarch64/symbol-ra
s various bugs in aarch64_final_eh_return_addr, which does not work with
-fomit-frame-pointer, alloca or outgoing arguments.
Bootstrap OK, GCC Regression OK, OK for trunk? Would it be useful to backport
this to GCC6.x?
ChangeLog:
2016-09-02 Wilco Dijkstra <wdijk...@arm.com>
callee-saves, the generated
code
doesn't change.
Bootstrap and regression pass, OK for commit?
ChangeLog:
2016-09-08 Wilco Dijkstra <wdijk...@arm.com>
* config/aarch64/aarch64.c (aarch64_layout_frame):
Align FP callee-saves.
--
diff --git a/gcc/config/aarch64/aarch64.c
Christophe Lyon wrote:
> After this patch, I've noticed a regression:
> FAIL: gcc.target/aarch64/ldp_vec_64_1.c scan-assembler ldp\td[0-9]+, d[0-9]
> You probably need to adjust the scan pattern.
The original code was better and what we should generate. IVOpt chooses to use
indexing eventhough it
]
ldr w0, [x0, 3424]
add w0, w1, w0
ret
OK for trunk?
ChangeLog:
2016-09-06 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.c (aarch64_legitimize_address):
Avoid use of base_offset if offset already in range.
--
diff --git a/gcc/
ion using simple wrapper functions and no
default parameters:
ChangeLog:
2016-08-10 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.c (aarch64_add_constant_internal):
Add extra argument to allow emitting the move immediate.
Use add/su
ious bugs in aarch64_final_eh_return_addr, which does not work with
-fomit-frame-pointer, alloca or outgoing arguments.
Bootstrap OK, GCC Regression OK, OK for trunk? Would it be useful to backport
this to GCC6.x?
ChangeLog:
2016-09-02 Wilco Dijkstra <wdijk...@arm.com>
PR77455
gcc/
or outgoing arguments.
Bootstrap OK, GCC Regression OK, OK for trunk? Would it be useful to backport
this to GCC6.x?
ChangeLog:
2016-08-10 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.md (eh_return): Remove pattern and splitter.
* config/aarch64/aar
mostly QoI, what we have here is a case where legal
code fails to link due to optimization. The original example is from GCC itself,
the fixed_regs array is small but due to optimization we can end up with
_regs + 0x.
Wilco
> ChangeLog:
> 2016-08-23 Wilco Dijkstra &
Jiong Wang wrote:
> The -fomit-frame-pointer is really broken on aarch64_find_eh_return_addr
Yes, that's a good conclusion. However even with the frame pointer there are
cases that fail, for example it will access LR off SP even after alloca. In fact
we're lucky it
trap, GCC regression OK.
ChangeLog:
2016-08-10 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.c (aarch64_legitimize_address_displacement):
New function.
(TARGET_LEGITIMIZE_ADDRESS_DISPLACEMENT): Define.
--
diff --git a/gcc/config/aarch64/aarch64
ion using simple wrapper functions and no
default parameters:
ChangeLog:
2016-08-10 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.c (aarch64_add_constant_internal):
Add extra argument to allow emitting the move immediate.
Use add/su
arguments.
Bootstrap OK, GCC Regression OK, OK for trunk? Would it be useful to backport
this to GCC6.x?
ChangeLog:
2016-08-10 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.md (eh_return): Remove pattern and splitter.
* config/aarch64/aar
failing with symbol out of
range.
OK for commit? As this is a latent bug, OK to backport to GCC6.x?
ChangeLog:
2016-08-23 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.c (aarch64_classify_symbol):
Apply reasonable limit to symbol offsets.
tes
egression OK.
ChangeLog:
2016-08-10 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.c (aarch64_legitimize_address_displacement):
New function.
(TARGET_LEGITIMIZE_ADDRESS_DISPLACEMENT): Define.
--
diff --git a/gcc/config/aarch64/aarch64.c b/gcc
arguments.
Bootstrap OK, GCC Regression OK, OK for trunk? Would it be useful to backport
this to GCC6.x?
ChangeLog:
2016-08-10 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.md (eh_return): Remove pattern and splitter.
* config/aarch64/aar
mple wrapper functions and no
default parameters:
ChangeLog:
2016-08-10 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.c (aarch64_add_constant_internal):
Add extra argument to allow emitting the move immediate.
Use add/sub with positive immediate
implementation also
fixes various bugs in aarch64_final_eh_return_addr, which does not work with
-fomit-frame-pointer, alloca or outgoing arguments.
Bootstrap OK, GCC Regression OK, OK for trunk? Would it be useful to backport
this to GCC6.x?
ChangeLog:
2016-08-04 Wilco Dijkstra <wdijk...@arm.
ret
Passes GCC regression tests.
ChangeLog:
2016-08-04 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.c (aarch64_add_constant):
Add extra argument to allow emitting the move immediate.
Use add/sub with positive imm
add x1, sp, x1
str wzr, [x1]
add x1, sp, x2
str wzr, [x1]
add x1, sp, x3
str wzr, [x1]
ldr w0, [sp, w0, sxtw 2]
add sp, sp, 32768
ret
Bootstrap, GCC regression OK.
ChangeLog:
2016-08-04 Wilco Dijkstra
(if frame_pointer_needed)
4. stp reg3, reg4, [sp, callee_offset + N*16] (store remaining callee-saves)
5. sub sp, sp, final_adjust
The epilog reverses this, and may omit step 3 if alloca wasn't used.
Bootstrap, GCC & gdb regression OK.
ChangeLog:
2016-07-29 Wilco Dijk
This patch improves the readability of the prolog and epilog code by moving
some
code into separate functions. There is no difference in generated code.
OK for commit?
ChangeLog:
2016-07-26 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.c (aarch64_pushwb_pa
Richard Earnshaw wrote:
> So under what circumstances does it lead to sub-optimal code?
If the cost is incorrect Combine can make the wrong decision, for example
whether to emit a multiply-add or not. I'm not sure whether this still happens
as Kyrill fixed several issues in Combine since this
Richard Earnshaw wrote:
> Both of which look reasonable to me.
Yes the code we generate for these examples is fine, I don't believe this
example ever went bad. It's just the cost calculation that is incorrect with
the outer check.
Wilco
Richard Earnshaw wrote:
> Why does combine care what the cost is if the instruction isn't valid?
No idea. Combine does lots of odd things that don't make sense to me.
Unfortunately the costs we give for cases like this need to be accurate or
they negatively affect code quality. The reason for
Richard Earnshaw wrote:
> I'm not sure about this, while rtx_cost is called recursively as it
> walks the RTL, I'd normally expect the outer levels of the recursion to
> catch the cases where zero-extend is folded into a more complex
> operation. Hitting a case like this suggests that something
where UBFM has
the same performance as AND, and minor speedups across several
benchmarks on an implementation where UBFM is slower than AND.
Bootstrapped and tested on aarch64-none-elf.
2016-07-19 Kristina Martsenko <kristina.martse...@arm.com>
2016-07-19 Wilco Dijkstra <wdijk..
When zero extending a 32-bit value to 64 bits, there should always be a
SET operation on the outside, according to the patterns in aarch64.md.
However, the mid-end can also ask for the cost of a made-up instruction,
where the zero-extend is part of another operation, not SET.
In this case we
This patchset improves zero extend costs and code generation.
When zero extending a 32-bit register, we emit a "mov", but currently
report the cost of the "mov" incorrectly.
In terms of speed, we currently say the cost is that of an extend
operation. But the cost of a "mov" is the cost of 1
Fix prototype in vst1Q_laneu64-1.c to unsigned char* so it passes.
Committed as trivial fix.
ChangeLog
2016-07-06 Wilco Dijkstra <wdijk...@arm.com>
gcc/testsuite/
* gcc.target/arm/vst1Q_laneu64-1.c (foo): Use unsigned char*.
---
diff --git a/gcc/testsuite/gcc.targ
This patch improves the accuracy of the Cortex-A53 integer scheduler,
resulting in performance gains across a wide range of benchmarks.
OK for commit?
ChangeLog:
2016-07-05 Wilco Dijkstra <wdijk...@arm.com>
* config/arm/cortex-a53.md: Use final_presence_set for in
Evandro Menezes wrote:
On 06/29/16 07:59, James Greenhalgh wrote:
> On Tue, Jun 21, 2016 at 02:39:23PM +0100, Wilco Dijkstra wrote:
>> ping
>>
>>
>> From: Wilco Dijkstra
>> Sent: 03 June 2016 11:51
>> To: GCC Patches
>> Cc: nd; philipp.toms...@theobrom
code is now more similar as well as more
optimal
across Cortex cores.
Regress passes, OK for commit?
ChangeLog:
2016-06-29 Wilco Dijkstra <wdijk...@arm.com>
* config/aarch64/aarch64.c (cortexa35_tunings):
Enable AES fusion. Use cortexa57_branch_cost.
(cortexa53_t
number
never matches in big-endian.
The test gcc.dg/optimize-bswapsi-4.c now passes on AArch64, no other changes.
OK for commit?
ChangeLog:
2016-06-23 Wilco Dijkstra <wdijk...@arm.com>
* gcc/tree-ssa-math-opts.c (find_bswap_or_nop_1): Adjust bitnumbering
for big-
to return
true for AArch64 so these tests are run on AArch64 too.
Committed as trivial patch in r237653.
ChangeLog:
2016-06-21 Wilco Dijkstra <wdijk...@arm.com>
gcc/testsuite/
* gcc.target/aarch64/advsimd-intrinsics/vrnd.c
(dg-require-effective-target): Use arm_v8_n
Fix tree-ssa/attr-hotcold-2.c failures now that the test runs.
GCC dumps the blocks 3 times so update count to 3 and the test passes.
ChangeLog:
2016-06-21 Wilco Dijkstra <wdijk...@arm.com>
gcc/testsuite/
* gcc.dg/tree-ssa/attr-hotcold-2.c (scan-tree-dump-times):
Set to 3 s
Due to recent improvements to the vectorizer, the number of vectorized
loops needs to be increased to 21 in gfortran.dg/vect/vect-8.f90.
Confirmed this test now passes on AArch64.
Commited as trivial patch in r237650.
ChangeLog:
2016-06-21 Wilco Dijkstra <wdijk...@arm.
ping
From: Wilco Dijkstra
Sent: 03 June 2016 11:51
To: GCC Patches
Cc: nd; philipp.toms...@theobroma-systems.com; pins...@gmail.com;
jim.wil...@linaro.org; benedikt.hu...@theobroma-systems.com; Evandro Menezes
Subject: [PATCH][AArch64] Increase code alignment
Increase loop alignment
This patch cleans up the -mpc-relative-loads option processing. Rename to
avoid the
"no*" name and confusing !no* expressions. Fix the option processing code to
implement
-mno-pc-relative-loads rather than ignore it.
OK for commit?
ChangeLog:
2016-06-03 Wilco Dijkstra <wdi
agree on alignment of
16 for function, and 8 for loops and branches, so we should change
-mcpu=generic as well if there is no disagreement - feedback welcome.
OK for commit?
ChangeLog:
2016-05-03 Wilco Dijkstra <wdijk...@arm.com>
* gcc/config/aarch64/aarch64.c (cortexa53_t
ping
From: Wilco Dijkstra
Sent: 17 May 2016 17:08
To: James Greenhalgh
Cc: gcc-patches@gcc.gnu.org; nd
Subject: Re: [PATCH][AArch64] Improve aarch64_modes_tieable_p
James Greenhalgh wrote:
> It would be handy if you could raise something in bugzi
The Cortex-A57 scheduler is missing fcsel, so add it.
OK for commit?
ChangeLog:
2016-06-02 Wilco Dijkstra <wdijk...@arm.com>
* config/arm/cortex-a57.md (cortex_a57_fp_cpys): Add fcsel.
---
diff --git a/gcc/config/arm/cortex-a57.md b/gcc/config/arm/cortex-a57.md
other fixes in that bug.
Sure.
>
> ChangeLog:
> 2016-05-19 Wilco Dijkstra <wdijk...@arm.com>
>
> * gcc/config/aarch64/aarch64.h
> (CANNOT_CHANGE_MODE_CLASS): Remove.
> * gcc/config/aarch64/aarch64.c
> (aarch64_cannot_change_mode_class): Remove fun
James Greenhalgh wrote:
> I really don't like [1][2][3] this technique of attempting to work around
> register allocator issues using the disparaging mechanisms.
I don't see the issue as it is a standard mechanism to describe higher cost
to the register allocator. On the other had the use of '*'
Jim Wilson wrote:
> It looks like a slight lose on qdf24xx on SPEC CPU2006 at -O3. I see
> about a 0.37% loss on the integer benchmarks, and no significant
> change on the FP benchmarks. The integer loss is mainly due to
> 458.sjeng which drops 2%. We had tried various values for
>
Remove aarch64_cannot_change_mode_class as the underlying issue
(PR67609) has been resolved. This avoids a few unnecessary lane
widening operations like:
faddp d18, v18.2d
mov d18, v18.d[0]
Passes regress, OK for commit?
ChangeLog:
2016-05-19 Wilco Dijkstra <wdijk...@arm.
Richard Biener wrote:
>
> Yeah ;) I'm currently bootstrapping/testing the patch that makes it possible
> to
> write all this in match.pd.
So what was the conclusion? Improving match.pd to be able to handle more cases
like this seems like a nice thing.
Wilco
James Greenhalgh wrote:
> It would be handy if you could raise something in bugzilla for the
> register allocator deficiency.
The register allocation issues are well known and we have multiple
workarounds for this in place. When you allow modes to be tieable
the workarounds are not as effective.
ping
From: Wilco Dijkstra
Sent: 22 April 2016 16:35
To: gcc-patches@gcc.gnu.org
Cc: nd
Subject: [PATCH][AArch64] Adjust SIMD integer preference
SIMD operations like combine prefer to have their operands in FP registers,
so increase the cost of integer
James Greenhalgh wrote:
> As this change will change code generation for all cores (except
> Exynos-M1), I'd like to hear from those with more detailed knowledge of
> ThunderX, X-Gene and qdf24xx before I take this patch.
>
> Let's give it another week or so for comments, and expand the CC list.
ping
From: Wilco Dijkstra
Sent: 22 April 2016 17:15
To: gcc-patches@gcc.gnu.org
Cc: nd
Subject: [PATCH][AArch64] Improve aarch64_case_values_threshold setting
GCC expands switch statements in a very simplistic way and tries to use a table
expansion even
ping
From: Wilco Dijkstra
Sent: 27 April 2016 17:39
To: James Greenhalgh
Cc: gcc-patches@gcc.gnu.org; nd
Subject: Re: [PATCH][AArch64] print_operand should not fallthrough from
register operand into generic operand
James Greenhalgh wrote:
> So the p
>> The new version does not seem better, as it adds a branch on the path
>> and it is not smaller.
>
> That looks like bb-reorder isn't doing its job? Maybe it thinks that
> pop is too expensive to copy?
It relies on static branch probabilities, which are set completely wrong in GCC,
so it ends
Ramana Radhakrishnan wrote:
>
> Can you file a bugzilla entry with a testcase that folks can look at please ?
I created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70961. Unfortunately
I don't have a simple testcase that I can share.
Wilco
Bernd Schmidt wrote:
> On 05/04/2016 03:25 PM, Ramana Radhakrishnan wrote:
>> On ARM / AArch32 I haven't seen any performance data yet - the one place we
>> are concerned
>> about the impact is on Thumb2 code size as regrename may end up
>> inadvertently putting more
>> things in high
Kyrill Tkachov wrote:
> On 25/04/16 20:21, Wilco Dijkstra wrote:
> > The GCC switch expansion is awful, so
> > even with a good indirect predictor it is better to use conditional
> > branches.
>
> In what way is it awful? If there's something we can do better at
James Greenhalgh wrote:
> So the part of this patch removing the fallthrough to general operand
> is not OK for trunk.
>
> The other parts look reasonable to me, please resubmit just those.
Right, I removed the removal of the fallthrough. Here is the revised version:
ChangeLog:
2016-
James Greenhalgh wrote:
> So this is off for all cores currently supported by GCC?
>
> I'm not sure I understand why we should take this if it will immediately
> be dead code?
I presume it was meant to have the vector variants enabled with -mcpu=exynos-m1
as that is where you can get a good gain
Evandro Menezes wrote:
>
> True, but the results when running on A53 could be quite different.
GCC is ~1.2% faster on Cortex-A53 built for generic, but there is no
difference in perlbench.
Wilco
Evandro Menezes wrote:
>On 03/10/16 10:37, James Greenhalgh wrote:
>> Thanks for sticking with it. This is OK for GCC 7 when development
>> opens.
>>
>> Remember to mention the most recent changes in your Changelog entry
>> (Remove "fp" attribute from *movhf_aarch64 and *movtf_aarch64).
>
>
> OK
Evandro Menezes wrote:
> I agree with your assessment, but I'm more curious to understand how
> this change affects code built with the default -mcpu=generic when run
> on both A53 and A57, the typical configuration of big.LITTLE machines.
I wouldn't expect the result to be any different as the
Evandro Menezes wrote:
> I assume that you mean that such improvements are true for
> -mcpu=generic, yes? On which target, A53 or A57 or other?
It's true for any CPU setting. The SPEC results are for Cortex-A57
however I wrote a microbenchmark that shows improvements on
all targets I have
ex that automatically does the right thing?
>> +Enable PC relative literal loads. With this option literal pools are
Fixed, new version below:
2016-04-22 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* gcc/doc/invoke.texi (AArch64 Options): Update.
--
diff --git a/gcc/doc/invok
in performance (1-2%) as well as size
(0.5-1% smaller).
OK for trunk?
ChangeLog:
2016-04-22 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.c (aarch64_case_values_threshold):
Return a better case_values_threshold when optimizing.
--
diff --git a/gcc/config/a
SIMD operations like combine prefer to have their operands in FP registers,
so increase the cost of integer registers slightly to avoid unnecessary int<->FP
moves. This improves register allocation of scalar SIMD operations.
OK for trunk?
ChangeLog:
2016-04-22 Wilco Dijkstra <wdijk..
://gcc.gnu.org/ml/gcc-patches/2016-04/msg01265.html to
be committed first)
ChangeLog:
2016-04-22 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* config/aarch64/aarch64.md
(add3_compareC_cconly_imm): Remove use of %w for immediate.
(add3_compareC_imm): Likewise.
(si
This patch fixes the attributes of integer immediate shifts which were
incorrectly
modelled as register controlled shifts. Also change EXTR attribute to being a
rotate.
OK for trunk?
ChangeLog:
2016-04-22 Wilco Dijkstra <wdijk...@arm.com>
* gcc/config/aarch64/aarc
orr v0.8b, v1.8b, v0.8b
OK for trunk?
ChangeLog:
2016-04-22 Wilco Dijkstra <wdijk...@arm.com>
* gcc/config/aarch64/aarch64.c (aarch64_modes_tieable_p):
Allow scalar/single vector modes to be tieable.
---
gcc/config/aarch64/aarch64.c | 18 ++
Update documentation of AArch64 options for GCC6 to be more accurate,
fix a few minor mistakes and remove some duplication.
Tested with "make info dvi pdf html" and checked resulting PDF is as expected.
OK for trunk and backport to GCC6.1 branch?
ChangeLog:
2016-04-21 Wilco Dijkst
Jakub Jelinek wrote:
> On Wed, Apr 20, 2016 at 11:17:06AM +0000, Wilco Dijkstra wrote:
>> Can you quantify "don't like"? I benchmarked rawmemchr on a few targets
>> and it's slower than strlen, so it's hard to guess what you don't like about
>> it.
>
> This is
Richard Biener wrote:
> Better - comments below. Jakub objections to the usefulness of the transform
> remain - we do have the strlen pass that uses some global knowledge to decide
> on profitability. One could argue that for -Os doing the reverse transform is
> profitable?
In what way would it
Jakub Jelinek wrote:
> I still don't like this transformation and would very much prefer to see
> using rawmemchr instead on targets that provide it, and also this is
> something that IMHO should be done in the tree-ssa-strlen.c pass together
> with the other optimizations in there. Similarly to
this kind of thing either.
Wilco
ChangeLog:
2016-04-19 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* gcc/gimple-fold.c (gimple_fold_builtin_strchr):
New function to optimize strchr (s, 0) to strlen.
(gimple_fold_builtin): Add BUILT_IN_STRCHR case.
testsuite/
Jakub Jelinek <ja...@redhat.com> wrote:
> On Mon, Apr 18, 2016 at 05:00:45PM +0000, Wilco Dijkstra wrote:
>> Optimize strchr (s, 0) to s + strlen (s). strchr (s, 0) appears a common
>> idiom for finding the end of a string, however it is not a very efficient
>> way of
twice as fast as strchr on strings of 1KB).
OK for trunk?
ChangeLog:
2016-04-18 Wilco Dijkstra <wdijk...@arm.com>
gcc/
* gcc/builtins.c (fold_builtin_strchr): Optimize strchr (s, 0) into
strlen.
testsuite/
* gcc/testsuite/gcc.dg/strlenopt-20.c: Update test.
Evandro Menezes wrote:
>
> I hope that this gets in the ballpark of what's been discussed previously.
Yes that's very close to what I had in mind. A minor issue is that the vector
modes cannot work as they start at MAX_MODE_FLOAT (which is > 32):
+/* Control approximate alternatives to certain
Evandro Menezes wrote:
> However, I don't think that there's the need to handle any special case
> for division. The only case when the approximation differs from
> division is when the numerator is infinity and the denominator, zero,
> when the approximation returns infinity and the division,
Evandro Menezes wrote:
> > The division variant should use the same latency reduction trick I
> > mentioned for sqrt.
>
> I don't think that it applies here, since it doesn't have to deal with
> special cases.
No it applies as it's exactly the same calculation: x * rsqrt(y) and x *
recip(y). In
Evandro Menezes wrote:
On 03/23/16 11:24, Evandro Menezes wrote:
> On 03/17/16 15:09, Evandro Menezes wrote:
>> This patch implements FP division by an approximation using the Newton
>> series.
>>
>> With this patch, DF division is sped up by over 100% and SF division,
>> zilch, both on A57 and on
Evandro Menezes wrote:
>
> Ping^1
I haven't seen a newer version that incorporates my feedback. To recap what
I'd like to see is a more general way to select approximations based on mode.
I don't believe that looking at the inner mode works in general, and it doesn't
make sense to add internal
Hi Evandro,
> For example, though this approximation is improves the performance
> noticeably for DF on A57, for SF, not so much, if at all.
I'm still skeptical that you ever can get any gain on scalars. I bet the only
gain is on
4x vectorized floats.
So what I would like to see is this
901 - 1000 of 1125 matches
Mail list logo