Re: [PATCH 11/24] Disable year 2038 support on 32-bit hosts by default

2023-08-08 Thread Luis Machado via Gcc-patches
On 8/7/23 18:45, Jeff Law wrote:
> 
> 
> On 8/7/23 04:32, Arsen Arsenović via Gcc-patches wrote:
>> From: Luis Machado 
>>
>> With a recent import of gnulib, code has been pulled that tests and enables
>> 64-bit time_t by default on 32-bit hosts that support it.
>>
>> Although gdb can use the gnulib support, bfd doesn't use gnulib and currently
>> doesn't do these checks.
>>
>> As a consequence, if we have a 32-bit host that supports 64-bit time_t, we'll
>> have a mismatch between gdb's notion of time_t and bfd's notion of time_t.
>>
>> This will lead to mismatches in the struct stat size, leading to memory
>> corruption and crashes.
>>
>> This patch disables the year 2038 check for now, which makes things work
>> reliably again.
>>
>> I'd consider this a temporary fix until we have proper bfd checks for the 
>> year
>> 2038, if it makes sense.  64-bit hosts seems to be more common these days, so
>> I'm not sure how important it is to have this support enabled and how soon
>> we want to enable it.
>>
>> Thoughts?
>>
>> ChangeLog:
>>
>> * configure.ac: Disable year2038 by default on 32-bit hosts.
>> * configure: Regenerate.
> OK.  Though I thought gnulib ended up reverting some of this work.
> 
> Jeff

I sent this a little while ago. It may be possible that things have changed in 
gnulib since then.

This was mostly addressing issues building binutils-gdb with a 32-bit hosts and 
enabling all targets.


Re: [PATCH 1/3] GCC: Pass --plugin to AR and RANLIB

2021-01-11 Thread Luis Machado via Gcc-patches

This seems to have broken the builds on AArch64-Linux Ubuntu 18.04.

make[2]: Entering directory 'binutils-gdb-master-bionic/libiberty'
rm -f ./libiberty.a pic/./libiberty.a noasan/./libiberty.a
ar --plugin /usr/lib/gcc/aarch64-linux-gnu/7/liblto_plugin.so rc 
./libiberty.a \
  ./regex.o ./cplus-dem.o ./cp-demangle.o ./md5.o ./sha1.o ./alloca.o 
./argv.o ./bsearch_r.o ./choose-temp.o ./concat.o ./cp-demint.o 
./crc32.o ./d-demangle.o ./dwarfnames.o ./dyn-string.o ./fdmatch.o 
./fibheap.o ./filedescriptor.o ./filename_cmp.o ./floatformat.o 
./fnmatch.o ./fopen_unlocked.o ./getopt.o ./getopt1.o ./getpwd.o 
./getruntime.o ./hashtab.o ./hex.o ./lbasename.o ./lrealpath.o 
./make-relative-prefix.o ./make-temp-file.o ./objalloc.o ./obstack.o 
./partition.o ./pexecute.o ./physmem.o ./pex-common.o ./pex-one.o 
./pex-unix.o ./vprintf-support.o ./rust-demangle.o ./safe-ctype.o 
./simple-object.o ./simple-object-coff.o ./simple-object-elf.o 
./simple-object-mach-o.o ./simple-object-xcoff.o ./sort.o ./spaces.o 
./splay-tree.o ./stack-limit.o ./strerror.o ./strsignal.o 
./timeval-utils.o ./unlink-if-ordinary.o ./xasprintf.o ./xatexit.o 
./xexit.o ./xmalloc.o ./xmemdup.o ./xstrdup.o ./xstrerror.o ./xstrndup.o 
./xvasprintf.o  ./setproctitle.o

ar: no operation specified
Makefile:252: recipe for target 'libiberty.a' failed
make[2]: *** [libiberty.a] Error 1

Reverting that patch makes the build OK again.

On 10/29/20 4:11 PM, H.J. Lu via Binutils wrote:

Detect GCC LTO plugin.  Pass --plugin to AR and RANLIB to support LTO
build.

* Makefile.tpl (AR): Add @AR_PLUGIN_OPTION@
(RANLIB): Add @RANLIB_PLUGIN_OPTION@.
* configure.ac: Include config/gcc-plugin.m4.
AC_SUBST AR_PLUGIN_OPTION and RANLIB_PLUGIN_OPTION.
* libtool.m4 (_LT_CMD_OLD_ARCHIVE): Pass --plugin to AR and
RANLIB if possible.
* Makefile.in: Regenerated.
* configure: Likewise.

config/

* gcc-plugin.m4 (GCC_PLUGIN_OPTION): New.

libiberty/

* Makefile.in (AR): Add @AR_PLUGIN_OPTION@
(RANLIB): Add @RANLIB_PLUGIN_OPTION@.
(configure_deps): Depend on ../config/gcc-plugin.m4.
* aclocal.m4: Include ../config/gcc-plugin.m4.
* configure.ac: AC_SUBST AR_PLUGIN_OPTION and
RANLIB_PLUGIN_OPTION.
* configure: Regenerated.

zlib/

* configure: Regenerated.
---
  Makefile.in|   5 +-
  Makefile.tpl   |   5 +-
  config/gcc-plugin.m4   |  28 ++
  configure  |  39 
  configure.ac   |  15 +++
  libiberty/Makefile.in  |   5 +-
  libiberty/aclocal.m4   |   1 +
  libiberty/configure|  37 
  libiberty/configure.ac |  12 +++
  libtool.m4 |  25 -
  zlib/configure | 206 -
  11 files changed, 368 insertions(+), 10 deletions(-)

diff --git a/Makefile.in b/Makefile.in
index fe34132f9e..978e777338 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -387,7 +387,7 @@ MAKEINFOFLAGS = --split-size=500
  # -
  
  AS = @AS@

-AR = @AR@
+AR = @AR@ @AR_PLUGIN_OPTION@
  AR_FLAGS = rc
  CC = @CC@
  CXX = @CXX@
@@ -396,7 +396,7 @@ LD = @LD@
  LIPO = @LIPO@
  NM = @NM@
  OBJDUMP = @OBJDUMP@
-RANLIB = @RANLIB@
+RANLIB = @RANLIB@ @RANLIB_PLUGIN_OPTION@
  READELF = @READELF@
  STRIP = @STRIP@
  WINDRES = @WINDRES@
@@ -52633,6 +52633,7 @@ AUTOCONF = autoconf
  $(srcdir)/configure: @MAINT@ $(srcdir)/configure.ac $(srcdir)/config/acx.m4 \
$(srcdir)/config/override.m4 $(srcdir)/config/proginstall.m4 \
$(srcdir)/config/elf.m4 $(srcdir)/config/isl.m4 \
+   $(srcdir)/config/gcc-plugin.m4 \
$(srcdir)/libtool.m4 $(srcdir)/ltoptions.m4 $(srcdir)/ltsugar.m4 \
$(srcdir)/ltversion.m4 $(srcdir)/lt~obsolete.m4
cd $(srcdir) && $(AUTOCONF)
diff --git a/Makefile.tpl b/Makefile.tpl
index 5b118a8ba4..a280a1498c 100644
--- a/Makefile.tpl
+++ b/Makefile.tpl
@@ -390,7 +390,7 @@ MAKEINFOFLAGS = --split-size=500
  # -
  
  AS = @AS@

-AR = @AR@
+AR = @AR@ @AR_PLUGIN_OPTION@
  AR_FLAGS = rc
  CC = @CC@
  CXX = @CXX@
@@ -399,7 +399,7 @@ LD = @LD@
  LIPO = @LIPO@
  NM = @NM@
  OBJDUMP = @OBJDUMP@
-RANLIB = @RANLIB@
+RANLIB = @RANLIB@ @RANLIB_PLUGIN_OPTION@
  READELF = @READELF@
  STRIP = @STRIP@
  WINDRES = @WINDRES@
@@ -1967,6 +1967,7 @@ AUTOCONF = autoconf
  $(srcdir)/configure: @MAINT@ $(srcdir)/configure.ac $(srcdir)/config/acx.m4 \
$(srcdir)/config/override.m4 $(srcdir)/config/proginstall.m4 \
$(srcdir)/config/elf.m4 $(srcdir)/config/isl.m4 \
+   $(srcdir)/config/gcc-plugin.m4 \
$(srcdir)/libtool.m4 $(srcdir)/ltoptions.m4 $(srcdir)/ltsugar.m4 \
$(srcdir)/ltversion.m4 $(srcdir)/lt~obsolete.m4
cd $(srcdir) && $(AUTOCONF)
diff --git a/config/gcc-plugin.m4 b/config/gcc-plugin.m4
index 8f27871911..c5b72e9a13 100644
--- a/config/gcc-plugin.m4
+++ b/config/gcc-plugin.m4
@@ -124,3 +124,31 @@ 

Re: [stage1][PATCH] Add gcc_assert that _options are not dirty modified.

2020-06-18 Thread Luis Machado via Gcc-patches

On 6/18/20 3:34 PM, Martin Liška wrote:

On 6/18/20 7:18 PM, Luis Machado wrote:

On 6/18/20 1:02 PM, Martin Liška wrote:

On 6/18/20 5:47 PM, Luis Machado wrote:
That's another one I noticed alongside the first one I reported. 
That's good that you managed to reproduce it.


Can you please send me your .config and I can reproduce that locally.

Thanks,
Martin


Here it is. It is a defconfig with a gas that supports aarch64 memtag 
(binutils-gdb master will do). I hope this helps. Otherwise I can 
compile the information and dump it in a ticket later.


In that case please attach output of -E option for the problematic 
compilation unit.


Martin


Here it is: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95753

Sorry for the delay. I was busy with something else.


Re: [stage1][PATCH] Add gcc_assert that _options are not dirty modified.

2020-06-18 Thread Luis Machado via Gcc-patches

On 6/18/20 1:02 PM, Martin Liška wrote:

On 6/18/20 5:47 PM, Luis Machado wrote:
That's another one I noticed alongside the first one I reported. 
That's good that you managed to reproduce it.


Can you please send me your .config and I can reproduce that locally.

Thanks,
Martin


Here it is. It is a defconfig with a gas that supports aarch64 memtag 
(binutils-gdb master will do). I hope this helps. Otherwise I can 
compile the information and dump it in a ticket later.


.config.gz
Description: application/gzip


Re: [stage1][PATCH] Add gcc_assert that _options are not dirty modified.

2020-06-18 Thread Luis Machado via Gcc-patches

On 6/18/20 12:40 PM, Martin Liška wrote:

I see the following ICE for aarch64 kernel build:

$ cat neon.i
#pragma GCC push_options
#pragma GCC target "arch=armv8.2-a+bf16"
#pragma GCC pop_options

$ ./xgcc -B. ~/Programming/testcases/neon.i -c -mbranch-protection=pac-ret
/home/marxin/Programming/testcases/neon.i:3:9: internal compiler error: 
‘global_options’ are modified in local context

     3 | #pragma GCC pop_options
   | ^~~
0xf73 cl_optimization_compare(gcc_options*, gcc_options*)
 /dev/shm/objdir3/gcc/options-save.c:11996
0xb02ff4 handle_pragma_pop_options
 /home/marxin/Programming/gcc/gcc/c-family/c-pragma.c:1090
0xb03953 c_invoke_pragma_handler(unsigned int)
 /home/marxin/Programming/gcc/gcc/c-family/c-pragma.c:1512
0xa5ae39 c_parser_pragma
 /home/marxin/Programming/gcc/gcc/c/c-parser.c:12544
0xa3f9fc c_parser_external_declaration
 /home/marxin/Programming/gcc/gcc/c/c-parser.c:1754
0xa3f5c8 c_parser_translation_unit
 /home/marxin/Programming/gcc/gcc/c/c-parser.c:1646
0xa7db4d c_parse_file()
 /home/marxin/Programming/gcc/gcc/c/c-parser.c:21822
0xafd0b6 c_common_parse_file()
 /home/marxin/Programming/gcc/gcc/c-family/c-opts.c:1190
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.

#1  0x0f74 in cl_optimization_compare (ptr1=0x2d5e3f0, 
ptr2=0x2cc7760 ) at options-save.c:11996
11996    internal_error ("% are modified in local 
context");

(gdb) p ptr2->x_aarch64_branch_protection_string
$2 = 0x2cf52e0 "pac-ret"
(gdb) p ptr1->x_aarch64_branch_protection_string
$3 = 0x2d3c190 "pac-ret"


    │11995 if (ptr1->x_aarch64_branch_protection_string != 
ptr2->x_aarch64_branch_protection_string)
   >│11996   internal_error ("% are modified in 
local context");


This is bogus as these are 2 strings that are equal. Let me fix it.

Martin


That's another one I noticed alongside the first one I reported. That's 
good that you managed to reproduce it.


Re: [stage1][PATCH] Add gcc_assert that _options are not dirty modified.

2020-06-18 Thread Luis Machado via Gcc-patches
FTR, I'm running into this ICE when attempting to build the Linux Kernel 
for arm64.


More specifically:

/repos/linux-arm/kernel/bpf/core.c:1368:1: internal compiler error: 
‘global_options’ are modified in local context

 1368 | {
  | ^
0xc0554b cl_optimization_compare(gcc_options*, gcc_options*)
/build/gcc-master/gcc/options-save.c:9786
0x7812df handle_optimize_attribute
../../../repos/gcc/gcc/c-family/c-attribs.c:4475
0x666477 decl_attributes(tree_node**, tree_node*, int, tree_node*)
../../../repos/gcc/gcc/attribs.c:714
0x688c9b start_function(c_declspecs*, c_declarator*, tree_node*)
../../../repos/gcc/gcc/c/c-decl.c:9177
0x6f85f3 c_parser_declaration_or_fndef
../../../repos/gcc/gcc/c/c-parser.c:2434
0x7038af c_parser_external_declaration
../../../repos/gcc/gcc/c/c-parser.c:1773
0x7044c7 c_parser_translation_unit
../../../repos/gcc/gcc/c/c-parser.c:1646
0x7044c7 c_parse_file()
../../../repos/gcc/gcc/c/c-parser.c:21822
0x764897 c_common_parse_file()
../../../repos/gcc/gcc/c-family/c-opts.c:1190
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.

I don't have a reduced testcase for this, but I thought I'd check if 
this is known/being worked on before filing a bug.


On 6/9/20 6:01 PM, Jeff Law via Gcc-patches wrote:

On Thu, 2020-05-21 at 14:04 +0200, Martin Liška wrote:

PING^1

On 3/20/20 4:55 PM, Martin Liška wrote:

Ok, it would be possible, but if you take a look at options-save.c there's no
function that will leverage that. It's a generated code so I guess we can
live with that?

Given the size increase is under control now, I think this is fine for the 
trunk.

jeff



Re: [PATCH 3/4] libgcc: fix the handling of return address mangling [PR94891]

2020-06-08 Thread Luis Machado via Gcc-patches

Hi Szabolcs,

Just to confirm, this is a "unwinder debugger hook ABI" change only in 
the sense that the generated DWARF will be changed, right? So no further 
action from DWARF consumers will be needed. Is that understanding correct?


On 6/5/20 1:51 PM, Szabolcs Nagy wrote:

Mangling, currently only used on AArch64 for return address signing,
is an internal representation that should not be exposed via

   __builtin_return_address return value,
   __builtin_eh_return handler argument,
   _Unwind_DebugHook handler argument.

Note that a mangled address might not even fit into a void *, e.g.
with AArch64 ilp32 ABI the return address is stored as 64bit, so
the mangled return address cannot be accessed via _Unwind_GetPtr.

This patch changes the unwinder hooks as follows:

MD_POST_EXTRACT_ROOT_ADDR is removed: root address comes from
__builtin_return_address which is not mangled.

MD_POST_EXTRACT_FRAME_ADDR is renamed to MD_DEMANGLE_RETURN_ADDR,
it now operates on _Unwind_Word instead of void *, so the hook
should work when return address signing is enabled on AArch64 ilp32.
(But for that __builtin_aarch64_autia1716 should be fixed to operate
on 64bit input instead of a void *.)

MD_POST_FROB_EH_HANDLER_ADDR is removed: it is the responsibility of
__builtin_eh_return to do the mangling if necessary.

libgcc/ChangeLog:

2020-06-04  Szabolcs Nagy  

* config/aarch64/aarch64-unwind.h (MD_POST_EXTRACT_ROOT_ADDR): Remove.
(MD_POST_FROB_EH_HANDLER_ADDR): Remove.
(MD_POST_EXTRACT_FRAME_ADDR): Rename to ...
(MD_DEMANGLE_RETURN_ADDR): This.
(aarch64_post_extract_frame_addr): Rename to ...
(aarch64_demangle_return_addr): This.
(aarch64_post_frob_eh_handler_addr): Remove.
* unwind-dw2.c (uw_update_context): Demangle return address.
(uw_frob_return_addr): Remove.
---
  libgcc/config/aarch64/aarch64-unwind.h | 34 --
  libgcc/unwind-dw2.c| 34 ++
  2 files changed, 13 insertions(+), 55 deletions(-)

diff --git a/libgcc/config/aarch64/aarch64-unwind.h 
b/libgcc/config/aarch64/aarch64-unwind.h
index ed84a96db41..b1d732e0b2d 100644
--- a/libgcc/config/aarch64/aarch64-unwind.h
+++ b/libgcc/config/aarch64/aarch64-unwind.h
@@ -27,11 +27,8 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If 
not, see
  
  #define DWARF_REGNUM_AARCH64_RA_STATE 34
  
-#define MD_POST_EXTRACT_ROOT_ADDR(addr)  __builtin_aarch64_xpaclri (addr)

-#define MD_POST_EXTRACT_FRAME_ADDR(context, fs, addr) \
-  aarch64_post_extract_frame_addr (context, fs, addr)
-#define MD_POST_FROB_EH_HANDLER_ADDR(current, target, addr) \
-  aarch64_post_frob_eh_handler_addr (current, target, addr)
+#define MD_DEMANGLE_RETURN_ADDR(context, fs, addr) \
+  aarch64_demangle_return_addr (context, fs, addr)
  #define MD_FROB_UPDATE_CONTEXT(context, fs) \
aarch64_frob_update_context (context, fs)
  
@@ -57,9 +54,10 @@ aarch64_cie_signed_with_b_key (struct _Unwind_Context *context)

 using CFA of current frame.  */
  
  static inline void *

-aarch64_post_extract_frame_addr (struct _Unwind_Context *context,
-_Unwind_FrameState *fs, void *addr)
+aarch64_demangle_return_addr (struct _Unwind_Context *context,
+ _Unwind_FrameState *fs, _Unwind_Word addr_word)
  {
+  void *addr = (void *)addr_word;
if (context->flags & RA_SIGNED_BIT)
  {
_Unwind_Word salt = (_Unwind_Word) context->cfa;
@@ -71,28 +69,6 @@ aarch64_post_extract_frame_addr (struct _Unwind_Context 
*context,
  return addr;
  }
  
-/* Do AArch64 private frob on exception handler's address HANDLER_ADDR before

-   installing it into current context CURRENT.  TARGET is currently not used.
-   We need to sign exception handler's address if CURRENT itself is signed.  */
-
-static inline void *
-aarch64_post_frob_eh_handler_addr (struct _Unwind_Context *current,
-  struct _Unwind_Context *target
-  ATTRIBUTE_UNUSED,
-  void *handler_addr)
-{
-  if (current->flags & RA_SIGNED_BIT)
-{
-  if (aarch64_cie_signed_with_b_key (current))
-   return __builtin_aarch64_pacib1716 (handler_addr,
-   (_Unwind_Word) current->cfa);
-  return __builtin_aarch64_pacia1716 (handler_addr,
-   (_Unwind_Word) current->cfa);
-}
-  else
-return handler_addr;
-}
-
  /* Do AArch64 private initialization on CONTEXT based on frame info FS.  Mark
 CONTEXT as return address signed if bit 0 of DWARF_REGNUM_AARCH64_RA_STATE 
is
 set.  */
diff --git a/libgcc/unwind-dw2.c b/libgcc/unwind-dw2.c
index 62d4a3d29a2..fe896565d2e 100644
--- a/libgcc/unwind-dw2.c
+++ b/libgcc/unwind-dw2.c
@@ -1538,11 +1538,14 @@ uw_update_context (struct _Unwind_Context *context, 
_Unwind_FrameState *fs)
  {
/* Compute the return address 

[PATCH] [OBVIOUS] Fix old file reference in gcc/cp/cp-gimplify.c

2019-10-16 Thread Luis Machado
I've found this stale reference while looking at cp-gimplify.c. tree-gimplify.c
no longer exists and its contents were merged into gimple.c.

Seems obvious enough.

gcc/cp/ChangeLog:

2019-10-16  Luis Machado  

* cp-gimplify.c: Fix reference to non-existing tree-gimplify.c file.

Signed-off-by: Luis Machado 
---
 gcc/cp/cp-gimplify.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/cp/cp-gimplify.c b/gcc/cp/cp-gimplify.c
index 154fa70ec06..0ab0438f601 100644
--- a/gcc/cp/cp-gimplify.c
+++ b/gcc/cp/cp-gimplify.c
@@ -1,4 +1,4 @@
-/* C++-specific tree lowering bits; see also c-gimplify.c and tree-gimple.c.
+/* C++-specific tree lowering bits; see also c-gimplify.c and gimple.c.
 
Copyright (C) 2002-2019 Free Software Foundation, Inc.
Contributed by Jason Merrill 
-- 
2.17.1



Re: [PATCH] [AArch64, Falkor] Adjust Falkor's sign extend reg+reg address cost

2018-08-27 Thread Luis Machado

Hi,

On 08/08/2018 04:46 AM, Siddhesh Poyarekar wrote:

On 08/01/2018 04:24 AM, James Greenhalgh wrote:

OK if this is what is best for your subtarget.



I have pushed this on behalf of Luis since he is on holiday.

Thanks,
Siddhesh


Similarly to the vector cost changes, we've also noticed a non-trivial 
performance impact from this change, on top of fixing a testsuite failure.


This is also Falkor specific. Would it be ok to push it to GCC 8?

Thanks,
Luis


Re: [PATCH] [AArch64, Falkor] Switch to using Falkor-specific vector costs

2018-08-27 Thread Luis Machado

Hi,

On 08/08/2018 04:54 AM, Siddhesh Poyarekar wrote:

On 08/01/2018 04:23 AM, James Greenhalgh wrote:

On Wed, Jul 25, 2018 at 01:10:34PM -0500, Luis Machado wrote:
The adjusted vector costs give Falkor a reasonable boost in 
performance for FP
benchmarks (both CPU2017 and CPU2006) and doesn't change INT 
benchmarks that

much. About 0.7% for CPU2017 FP and 1.54% for CPU2006 FP.

OK for trunk?


OK if this is what works best for your subtarget.



I have pushed this for Luis since he is on holiday.

Thanks,
Siddhesh


Given this has a significant performance impact on Falkor and the 
changes are very localized and specific to Falkor, it would be great to 
have them on GCC 8. Would it be ok to push it?


Thanks,
Luis


Re: [PATCH] [AArch64, Falkor] Switch to using Falkor-specific vector costs

2018-07-26 Thread Luis Machado

Hi Kyrill,

On 07/26/2018 11:34 AM, Kyrill Tkachov wrote:

Hi Luis,

On 25/07/18 19:10, Luis Machado wrote:
The adjusted vector costs give Falkor a reasonable boost in 
performance for FP
benchmarks (both CPU2017 and CPU2006) and doesn't change INT 
benchmarks that

much. About 0.7% for CPU2017 FP and 1.54% for CPU2006 FP.

OK for trunk?



The patch looks ok and safe to me (though you'll need approval from the 
maintainers).


I'd be interested to see what workloads in CPU2017 were affected by this.
Any chance you could post the breakdown in numbers from CPU2017?



Sure. Here it is (speed):

605.mcf_s: -1.8%
620.omnetpp_s: -2% (tends to be noisy)
623.xalancbmk_s: 2%
654.roms_s: 7%

INT mean: -0.09%
FP mean: 0.70%

It is worth mentioning i noticed bigger improvements in CPU2017 rate, 
but i did not record those numbers for the final run. The speed 
benchmarks seem to have a slightly different performance profile.


Here's a breakdown of the biggest changes from CPU2006 in case you're 
interested:


410.bwaves: 5.4%
434.zeusmp: 9.7%
436.cactusADM: -12.3%
437.leslie3d: 5.2%
459.GemsFDTD: 16.9%

cactusADM seems to have a pretty big loop that is a win if vectorized, 
but experimentation showed me it is tricky to get GCC to vectorize that 
specific loop without also vectorizing particular loops from the other 
benchmarks.


It would be nice to get cactusADM back up though.


[PATCH] [AArch64, Falkor] Adjust Falkor's sign extend reg+reg address cost

2018-07-25 Thread Luis Machado
Adjust Falkor's register_sextend cost from 4 to 3.  This fixes a testsuite
failure in gcc.target/aarch64/extend.c:ldr_sxtw where GCC was generating
a sbfiz instruction rather than a load with sign extension.

No performance changes.

gcc/ChangeLog:

2018-07-25  Luis Machado  

* config/aarch64/aarch64.c (qdf24xx_addrcost_table)
: Set to 3.
---
 gcc/config/aarch64/aarch64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index fa01475..ea39272 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -329,7 +329,7 @@ static const struct cpu_addrcost_table 
qdf24xx_addrcost_table =
   1, /* pre_modify  */
   1, /* post_modify  */
   3, /* register_offset  */
-  4, /* register_sextend  */
+  3, /* register_sextend  */
   3, /* register_zextend  */
   2, /* imm_offset  */
 };
-- 
2.7.4



[PATCH] [AArch64, Falkor] Switch to using Falkor-specific vector costs

2018-07-25 Thread Luis Machado
The adjusted vector costs give Falkor a reasonable boost in performance for FP
benchmarks (both CPU2017 and CPU2006) and doesn't change INT benchmarks that
much. About 0.7% for CPU2017 FP and 1.54% for CPU2006 FP.

OK for trunk?

gcc/ChangeLog:

2018-07-25  Luis Machado  

* config/aarch64/aarch64.c (qdf24xx_vector_cost): New.
(qdf24xx_tunings) : Set to qdf24xx_vector_cost.
---
 gcc/config/aarch64/aarch64.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index fa01475..d443aee 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -430,6 +430,26 @@ static const struct cpu_vector_cost generic_vector_cost =
   1 /* cond_not_taken_branch_cost  */
 };
 
+/* Qualcomm QDF24xx costs for vector insn classes.  */
+static const struct cpu_vector_cost qdf24xx_vector_cost =
+{
+  1, /* scalar_int_stmt_cost  */
+  1, /* scalar_fp_stmt_cost  */
+  1, /* scalar_load_cost  */
+  1, /* scalar_store_cost  */
+  1, /* vec_int_stmt_cost  */
+  3, /* vec_fp_stmt_cost  */
+  2, /* vec_permute_cost  */
+  1, /* vec_to_scalar_cost  */
+  1, /* scalar_to_vec_cost  */
+  1, /* vec_align_load_cost  */
+  1, /* vec_unalign_load_cost  */
+  1, /* vec_unalign_store_cost  */
+  1, /* vec_store_cost  */
+  3, /* cond_taken_branch_cost  */
+  1  /* cond_not_taken_branch_cost  */
+};
+
 /* ThunderX costs for vector insn classes.  */
 static const struct cpu_vector_cost thunderx_vector_cost =
 {
@@ -890,7 +910,7 @@ static const struct tune_params qdf24xx_tunings =
   _extra_costs,
   _addrcost_table,
   _regmove_cost,
-  _vector_cost,
+  _vector_cost,
   _branch_cost,
   _approx_modes,
   4, /* memmov_cost  */
-- 
2.7.4



Re: [PATCH] [AArch64, Falkor] Falkor address costs tuning

2018-05-24 Thread Luis Machado

On 05/23/2018 12:17 PM, James Greenhalgh wrote:

On Tue, May 22, 2018 at 12:04:38PM -0500, Luis Machado wrote:

Switch from using generic address costs to using Falkor-specific ones, which
give Falkor better results overall.

OK for trunk?

Given this is a Falkor-specific adjustment, would this be an acceptable
backport for GCC 8 as well?


OK for trunk.

It doesn't fix a regression, so it wouldn't really fit the definition of
a backport patch. That said, if it is important to you to have it in GCC 8,
it is sufficiently low-risk for non-Falkor targets that we can take it. So,
it is your call if you want to backport it or not.

Thanks,
James


Thanks. For now i've pushed it to mainline as r260675.


Re: [PATCH 1/2] Introduce prefetch-minimum stride option

2018-05-24 Thread Luis Machado

On 05/23/2018 10:57 PM, Jeff Law wrote:

On 05/23/2018 04:50 PM, Luis Machado wrote:

On 05/23/2018 07:42 PM, H.J. Lu wrote:

On Wed, May 23, 2018 at 3:41 PM, H.J. Lu <hjl.to...@gmail.com> wrote:

On Wed, May 23, 2018 at 3:35 PM, H.J. Lu <hjl.to...@gmail.com> wrote:



Sorry. Does the following fix it for i686?



Index: gcc/tree-ssa-loop-prefetch.c
===
--- gcc/tree-ssa-loop-prefetch.c (revision 260625)
+++ gcc/tree-ssa-loop-prefetch.c (working copy)
@@ -1014,7 +1014,8 @@
    fprintf (dump_file,
    "Step for reference %u:%u (%ld) is less than the mininum "
    ^^^ Please use
HOST_WIDE_INT_PRINT_DEC
    "required stride of %d\n",
- ref->group->uid, ref->uid, int_cst_value (ref->group->step),
+ ref->group->uid, ref->uid,
+ (HOST_WIDE_INT) int_cst_value (ref->group->step),
    PREFETCH_MINIMUM_STRIDE);




Something like this.

--
H.J.
---
diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index c3e7fd1e529..949a67f360e 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -1012,8 +1012,8 @@ should_issue_prefetch_p (struct mem_ref *ref)
   {
     if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file,
-  "Step for reference %u:%u (%ld) is less than the mininum "
-  "required stride of %d\n",
+  "Step for reference %u:%u (" HOST_WIDE_INT_PRINT_C
+  ") is less than the mininum required stride of %d\n",
     ref->group->uid, ref->uid, int_cst_value (ref->group->step),
     PREFETCH_MINIMUM_STRIDE);
     return false;


I meant:

diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index c3e7fd1e529..e34b78dc186 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -1012,8 +1012,8 @@ should_issue_prefetch_p (struct mem_ref *ref)
   {
     if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file,
-  "Step for reference %u:%u (%ld) is less than the mininum "
-  "required stride of %d\n",
+  "Step for reference %u:%u (" HOST_WIDE_INT_PRINT_DEC
+  ") is less than the mininum required stride of %d\n",
     ref->group->uid, ref->uid, int_cst_value (ref->group->step),
     PREFETCH_MINIMUM_STRIDE);
     return false;




Pushed now. Sorry for the breakage.

For future reference, is there an i686 system on the gcc farm that i can
use to validate this?

My tester uses gcc45.  In theory you could probably also do it in an
i686 chroot on an x86_64 system.  My tester does that for testing ppc32
on a ppc64 system.


Got it. Thanks for the info.


Re: [PATCH 1/2] Introduce prefetch-minimum stride option

2018-05-23 Thread Luis Machado

On 05/23/2018 07:42 PM, H.J. Lu wrote:

On Wed, May 23, 2018 at 3:41 PM, H.J. Lu  wrote:

On Wed, May 23, 2018 at 3:35 PM, H.J. Lu  wrote:



Sorry. Does the following fix it for i686?



Index: gcc/tree-ssa-loop-prefetch.c
===
--- gcc/tree-ssa-loop-prefetch.c (revision 260625)
+++ gcc/tree-ssa-loop-prefetch.c (working copy)
@@ -1014,7 +1014,8 @@
   fprintf (dump_file,
   "Step for reference %u:%u (%ld) is less than the mininum "
   ^^^ Please use
HOST_WIDE_INT_PRINT_DEC
   "required stride of %d\n",
- ref->group->uid, ref->uid, int_cst_value (ref->group->step),
+ ref->group->uid, ref->uid,
+ (HOST_WIDE_INT) int_cst_value (ref->group->step),
   PREFETCH_MINIMUM_STRIDE);




Something like this.

--
H.J.
---
diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index c3e7fd1e529..949a67f360e 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -1012,8 +1012,8 @@ should_issue_prefetch_p (struct mem_ref *ref)
  {
if (dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file,
-  "Step for reference %u:%u (%ld) is less than the mininum "
-  "required stride of %d\n",
+  "Step for reference %u:%u (" HOST_WIDE_INT_PRINT_C
+  ") is less than the mininum required stride of %d\n",
ref->group->uid, ref->uid, int_cst_value (ref->group->step),
PREFETCH_MINIMUM_STRIDE);
return false;


I meant:

diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index c3e7fd1e529..e34b78dc186 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -1012,8 +1012,8 @@ should_issue_prefetch_p (struct mem_ref *ref)
  {
if (dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file,
-  "Step for reference %u:%u (%ld) is less than the mininum "
-  "required stride of %d\n",
+  "Step for reference %u:%u (" HOST_WIDE_INT_PRINT_DEC
+  ") is less than the mininum required stride of %d\n",
ref->group->uid, ref->uid, int_cst_value (ref->group->step),
PREFETCH_MINIMUM_STRIDE);
return false;




Pushed now. Sorry for the breakage.

For future reference, is there an i686 system on the gcc farm that i can 
use to validate this?


Thanks,
Luis


Re: [PATCH 1/2] Introduce prefetch-minimum stride option

2018-05-23 Thread Luis Machado



On 05/23/2018 05:01 PM, H.J. Lu wrote:

On Tue, May 22, 2018 at 11:55 AM, Luis Machado <luis.mach...@linaro.org> wrote:



On 05/16/2018 08:18 AM, Luis Machado wrote:




On 05/16/2018 06:08 AM, Kyrill Tkachov wrote:



On 15/05/18 12:12, Luis Machado wrote:


Hi,

On 05/15/2018 06:37 AM, Kyrill Tkachov wrote:


Hi Luis,

On 14/05/18 22:18, Luis Machado wrote:


Hi,

Here's an updated version of the patch (now reverted) that addresses
the previous bootstrap problem (signedness and long long/int conversion).

I've checked that it bootstraps properly on both aarch64-linux and
x86_64-linux and that tests look sane.

James, would you please give this one a try to see if you can still
reproduce PR85682? I couldn't reproduce it in multiple attempts.



The patch doesn't hit the regressions in PR85682 from what I can see.
I have a comment on the patch below.



Great. Thanks for checking Kyrill.


--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -992,6 +992,23 @@ prune_by_reuse (struct mem_ref_group *groups)
   static bool
   should_issue_prefetch_p (struct mem_ref *ref)
   {
+  /* Some processors may have a hardware prefetcher that may conflict
with
+ prefetch hints for a range of strides.  Make sure we don't issue
+ prefetches for such cases if the stride is within this particular
+ range.  */
+  if (cst_and_fits_in_hwi (ref->group->step)
+  && abs_hwi (int_cst_value (ref->group->step)) <
+  (HOST_WIDE_INT) PREFETCH_MINIMUM_STRIDE)
+{

The '<' should go on the line below together with
PREFETCH_MINIMUM_STRIDE.



I've fixed this locally now.



Thanks. I haven't followed the patch in detail, are you looking for
midend changes approval since the last version?
Or do you need aarch64 approval?



The changes are not substantial, but midend approval i what i was aiming
at.

Also the confirmation that PR85682 is no longer happening.



James confirmed PR85682 is no longer reproducible with the updated patch and
the bootstrap issue is fixed now. So i take it this should be OK to push to
mainline?

Also, i'd like to discuss the possibility of having these couple options
backported to GCC 8. As is, the changes don't alter code generation by
default, but they allow better tuning of the software prefetcher for targets
that benefit from it.

Maybe after letting the changes bake on mainline enough to be confirmed
stable?


It breaks GCC bootstrap on i686:

../../src-trunk/gcc/tree-ssa-loop-prefetch.c: In function ‘bool
should_issue_prefetch_p(mem_ref*)’:
../../src-trunk/gcc/tree-ssa-loop-prefetch.c:1015:4: error: format
‘%ld’ expects argument of type ‘long int’, but argument 5 has type
‘long long int’ [-Werror=format=]
 "Step for reference %u:%u (%ld) is less than the mininum "
 ^~
 "required stride of %d\n",
 ~
 ref->group->uid, ref->uid, int_cst_value (ref->group->step),




Sorry. Does the following fix it for i686?

I had bootstrapped it for x86_64.
2018-05-22  Luis Machado  <luis.mach...@linaro.org>

	* tree-ssa-loop-prefetch.c (should_issue_prefetch_p): Cast value
	to HOST_WIDE_INT.

Index: gcc/tree-ssa-loop-prefetch.c
===
--- gcc/tree-ssa-loop-prefetch.c	(revision 260625)
+++ gcc/tree-ssa-loop-prefetch.c	(working copy)
@@ -1014,7 +1014,8 @@
 	fprintf (dump_file,
 		 "Step for reference %u:%u (%ld) is less than the mininum "
 		 "required stride of %d\n",
-		 ref->group->uid, ref->uid, int_cst_value (ref->group->step),
+		 ref->group->uid, ref->uid,
+		 (HOST_WIDE_INT) int_cst_value (ref->group->step),
 		 PREFETCH_MINIMUM_STRIDE);
   return false;
 }


Re: [PATCH 1/2] Introduce prefetch-minimum stride option

2018-05-22 Thread Luis Machado



On 05/16/2018 08:18 AM, Luis Machado wrote:



On 05/16/2018 06:08 AM, Kyrill Tkachov wrote:


On 15/05/18 12:12, Luis Machado wrote:

Hi,

On 05/15/2018 06:37 AM, Kyrill Tkachov wrote:

Hi Luis,

On 14/05/18 22:18, Luis Machado wrote:

Hi,

Here's an updated version of the patch (now reverted) that 
addresses the previous bootstrap problem (signedness and long 
long/int conversion).


I've checked that it bootstraps properly on both aarch64-linux and 
x86_64-linux and that tests look sane.


James, would you please give this one a try to see if you can still 
reproduce PR85682? I couldn't reproduce it in multiple attempts.




The patch doesn't hit the regressions in PR85682 from what I can see.
I have a comment on the patch below.



Great. Thanks for checking Kyrill.


--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -992,6 +992,23 @@ prune_by_reuse (struct mem_ref_group *groups)
  static bool
  should_issue_prefetch_p (struct mem_ref *ref)
  {
+  /* Some processors may have a hardware prefetcher that may 
conflict with

+ prefetch hints for a range of strides.  Make sure we don't issue
+ prefetches for such cases if the stride is within this particular
+ range.  */
+  if (cst_and_fits_in_hwi (ref->group->step)
+  && abs_hwi (int_cst_value (ref->group->step)) <
+  (HOST_WIDE_INT) PREFETCH_MINIMUM_STRIDE)
+    {

The '<' should go on the line below together with 
PREFETCH_MINIMUM_STRIDE.


I've fixed this locally now.


Thanks. I haven't followed the patch in detail, are you looking for 
midend changes approval since the last version?

Or do you need aarch64 approval?


The changes are not substantial, but midend approval i what i was aiming 
at.


Also the confirmation that PR85682 is no longer happening.


James confirmed PR85682 is no longer reproducible with the updated patch 
and the bootstrap issue is fixed now. So i take it this should be OK to 
push to mainline?


Also, i'd like to discuss the possibility of having these couple options 
backported to GCC 8. As is, the changes don't alter code generation by 
default, but they allow better tuning of the software prefetcher for 
targets that benefit from it.


Maybe after letting the changes bake on mainline enough to be confirmed 
stable?


Thanks,
Luis


[PATCH] [AArch64, Falkor] Falkor address costs tuning

2018-05-22 Thread Luis Machado
Switch from using generic address costs to using Falkor-specific ones, which
give Falkor better results overall.

OK for trunk?

Given this is a Falkor-specific adjustment, would this be an acceptable
backport for GCC 8 as well?

gcc/ChangeLog:

2018-05-22  Luis Machado  <luis.mach...@linaro.org>

* config/aarch64/aarch64.c (qdf24xx_addrcost_table): New static
global.
(qdf24xx_tunings) : Set to qdf24xx_addrcost_table.
---
 gcc/config/aarch64/aarch64.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index f60e0ad..548d87a 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -314,6 +314,22 @@ static const struct cpu_addrcost_table 
thunderx2t99_addrcost_table =
   0, /* imm_offset  */
 };
 
+static const struct cpu_addrcost_table qdf24xx_addrcost_table =
+{
+{
+  1, /* hi  */
+  1, /* si  */
+  1, /* di  */
+  2, /* ti  */
+},
+  1, /* pre_modify  */
+  1, /* post_modify  */
+  3, /* register_offset  */
+  4, /* register_sextend  */
+  3, /* register_zextend  */
+  2, /* imm_offset  */
+};
+
 static const struct cpu_regmove_cost generic_regmove_cost =
 {
   1, /* GP2GP  */
@@ -856,7 +872,7 @@ static const struct tune_params xgene1_tunings =
 static const struct tune_params qdf24xx_tunings =
 {
   _extra_costs,
-  _addrcost_table,
+  _addrcost_table,
   _regmove_cost,
   _vector_cost,
   _branch_cost,
-- 
2.7.4



Re: [PATCH 1/2] Introduce prefetch-minimum stride option

2018-05-16 Thread Luis Machado



On 05/16/2018 06:08 AM, Kyrill Tkachov wrote:


On 15/05/18 12:12, Luis Machado wrote:

Hi,

On 05/15/2018 06:37 AM, Kyrill Tkachov wrote:

Hi Luis,

On 14/05/18 22:18, Luis Machado wrote:

Hi,

Here's an updated version of the patch (now reverted) that addresses 
the previous bootstrap problem (signedness and long long/int 
conversion).


I've checked that it bootstraps properly on both aarch64-linux and 
x86_64-linux and that tests look sane.


James, would you please give this one a try to see if you can still 
reproduce PR85682? I couldn't reproduce it in multiple attempts.




The patch doesn't hit the regressions in PR85682 from what I can see.
I have a comment on the patch below.



Great. Thanks for checking Kyrill.


--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -992,6 +992,23 @@ prune_by_reuse (struct mem_ref_group *groups)
  static bool
  should_issue_prefetch_p (struct mem_ref *ref)
  {
+  /* Some processors may have a hardware prefetcher that may 
conflict with

+ prefetch hints for a range of strides.  Make sure we don't issue
+ prefetches for such cases if the stride is within this particular
+ range.  */
+  if (cst_and_fits_in_hwi (ref->group->step)
+  && abs_hwi (int_cst_value (ref->group->step)) <
+  (HOST_WIDE_INT) PREFETCH_MINIMUM_STRIDE)
+    {

The '<' should go on the line below together with 
PREFETCH_MINIMUM_STRIDE.


I've fixed this locally now.


Thanks. I haven't followed the patch in detail, are you looking for 
midend changes approval since the last version?

Or do you need aarch64 approval?


The changes are not substantial, but midend approval i what i was aiming at.

Also the confirmation that PR85682 is no longer happening.

Luis


Re: [PATCH 1/2] Introduce prefetch-minimum stride option

2018-05-15 Thread Luis Machado

Hi,

On 05/15/2018 06:37 AM, Kyrill Tkachov wrote:

Hi Luis,

On 14/05/18 22:18, Luis Machado wrote:

Hi,

Here's an updated version of the patch (now reverted) that addresses 
the previous bootstrap problem (signedness and long long/int conversion).


I've checked that it bootstraps properly on both aarch64-linux and 
x86_64-linux and that tests look sane.


James, would you please give this one a try to see if you can still 
reproduce PR85682? I couldn't reproduce it in multiple attempts.




The patch doesn't hit the regressions in PR85682 from what I can see.
I have a comment on the patch below.



Great. Thanks for checking Kyrill.


--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -992,6 +992,23 @@ prune_by_reuse (struct mem_ref_group *groups)
  static bool
  should_issue_prefetch_p (struct mem_ref *ref)
  {
+  /* Some processors may have a hardware prefetcher that may conflict with
+ prefetch hints for a range of strides.  Make sure we don't issue
+ prefetches for such cases if the stride is within this particular
+ range.  */
+  if (cst_and_fits_in_hwi (ref->group->step)
+  && abs_hwi (int_cst_value (ref->group->step)) <
+  (HOST_WIDE_INT) PREFETCH_MINIMUM_STRIDE)
+    {

The '<' should go on the line below together with PREFETCH_MINIMUM_STRIDE.


I've fixed this locally now.


Re: [PATCH 1/2] Introduce prefetch-minimum stride option

2018-05-14 Thread Luis Machado

Hi,

Here's an updated version of the patch (now reverted) that addresses the 
previous bootstrap problem (signedness and long long/int conversion).


I've checked that it bootstraps properly on both aarch64-linux and 
x86_64-linux and that tests look sane.


James, would you please give this one a try to see if you can still 
reproduce PR85682? I couldn't reproduce it in multiple attempts.


Thanks,
Luis

On 01/22/2018 11:46 AM, Luis Machado wrote:

This patch adds a new option to control the minimum stride, for a memory
reference, after which the loop prefetch pass may issue software prefetch
hints for. There are two motivations:

* Make the pass less aggressive, only issuing prefetch hints for bigger strides
that are more likely to benefit from prefetching. I've noticed a case in cpu2017
where we were issuing thousands of hints, for example.

* For processors that have a hardware prefetcher, like Falkor, it allows the
loop prefetch pass to defer prefetching of smaller (less than the threshold)
strides to the hardware prefetcher instead. This prevents conflicts between
the software prefetcher and the hardware prefetcher.

I've noticed considerable reduction in the number of prefetch hints and
slightly positive performance numbers. This aligns GCC and LLVM in terms of
prefetch behavior for Falkor.

The default settings should guarantee no changes for existing targets. Those
are free to tweak the settings as necessary.

No regressions in the testsuite and bootstrapped ok on aarch64-linux.

Ok?



>From e0207950a6d7793cdaceaa71fc5ada05a93dc1b3 Mon Sep 17 00:00:00 2001
From: Luis Machado <luis.mach...@linaro.org>
Date: Thu, 21 Dec 2017 15:23:59 -0200
Subject: [PATCH 1/2] Introduce prefetch-minimum stride option

This patch adds a new option to control the minimum stride, for a memory
reference, after which the loop prefetch pass may issue software prefetch
hints for. There are two motivations:

* Make the pass less aggressive, only issuing prefetch hints for bigger strides
that are more likely to benefit from prefetching. I've noticed a case in cpu2017
where we were issuing thousands of hints, for example.

* For processors that have a hardware prefetcher, like Falkor, it allows the
loop prefetch pass to defer prefetching of smaller (less than the threshold)
strides to the hardware prefetcher instead. This prevents conflicts between
the software prefetcher and the hardware prefetcher.

I've noticed considerable reduction in the number of prefetch hints and
slightly positive performance numbers. This aligns GCC and LLVM in terms of
prefetch behavior for Falkor.

The default settings should guarantee no changes for existing targets. Those
are free to tweak the settings as necessary.

No regressions in the testsuite and bootstrapped ok on aarch64-linux and
x86_64-linux.

Ok?

2018-05-14  Luis Machado  <luis.mach...@linaro.org>

	Introduce option to limit software prefetching to known constant
	strides above a specific threshold with the goal of preventing
	conflicts with a hardware prefetcher.

	gcc/
	* config/aarch64/aarch64-protos.h (cpu_prefetch_tune)
	: New const int field.
	* config/aarch64/aarch64.c (generic_prefetch_tune): Update to include
	minimum_stride field.
	(exynosm1_prefetch_tune): Likewise.
	(thunderxt88_prefetch_tune): Likewise.
	(thunderx_prefetch_tune): Likewise.
	(thunderx2t99_prefetch_tune): Likewise.
	(qdf24xx_prefetch_tune): Likewise. Set minimum_stride to 2048.
	: Set to 3.
	(aarch64_override_options_internal): Update to set
	PARAM_PREFETCH_MINIMUM_STRIDE.
	* doc/invoke.texi (prefetch-minimum-stride): Document new option.
	* params.def (PARAM_PREFETCH_MINIMUM_STRIDE): New.
	* params.h (PARAM_PREFETCH_MINIMUM_STRIDE): Define.
	* tree-ssa-loop-prefetch.c (should_issue_prefetch_p): Return false if
	stride is constant and is below the minimum stride threshold.
---
 gcc/config/aarch64/aarch64-protos.h |  3 +++
 gcc/config/aarch64/aarch64.c| 13 -
 gcc/doc/invoke.texi | 15 +++
 gcc/params.def  |  9 +
 gcc/params.h|  2 ++
 gcc/tree-ssa-loop-prefetch.c| 17 +
 6 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index cda2895..5d3b9d7 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -230,6 +230,9 @@ struct cpu_prefetch_tune
   const int l1_cache_size;
   const int l1_cache_line_size;
   const int l2_cache_size;
+  /* The minimum constant stride beyond which we should use prefetch
+ hints for.  */
+  const int minimum_stride;
   const int default_opt_level;
 };
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index a2003fe..5215deb 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -547,6 +547,7 @@ static const cpu_prefetch_tune generic_prefetch_tune =
   -1,			/* l1_cache

Re: [PATCH, v2] Recognize a missed usage of a sbfiz instruction

2018-05-14 Thread Luis Machado

On 05/11/2018 06:46 AM, Kyrill Tkachov wrote:

Hi Luis,

On 10/05/18 11:31, Luis Machado wrote:


On 05/09/2018 10:44 AM, Kyrill Tkachov wrote:


On 09/05/18 13:30, Luis Machado wrote:

Hi Kyrill,

On 05/08/2018 11:09 AM, Kyrill Tkachov wrote:

Hi Luis,

On 07/05/18 15:28, Luis Machado wrote:

Hi,

On 02/08/2018 10:45 AM, Luis Machado wrote:

Hi Kyrill,

On 02/08/2018 09:48 AM, Kyrill Tkachov wrote:

Hi Luis,

On 06/02/18 15:04, Luis Machado wrote:
Thanks for the feedback Kyrill. I've adjusted the v2 patch 
based on your

suggestions and re-tested the changes. Everything is still sane.


Thanks! This looks pretty good to me.

Since this is ARM-specific and fairly specific, i wonder if it 
would be

reasonable to consider it for inclusion at the current stage.


It is true that the target maintainers can choose to take
such patches at any stage. However, any patch at this stage 
increases

the risk of regressions being introduced and these regressions
can come bite us in ways that are very hard to anticipate.

Have a look at some of the bugs in bugzilla (or a quick scan of 
the gcc-bugs list)
for examples of the ways that things can go wrong with any of 
the myriad of GCC components

and the unexpected ways in which they can interact.

For example, I am now working on what I initially thought was a 
one-liner fix for
PR 84164 but it has expanded into a 3-patch series with a midend 
component and

target-specific changes for 2 ports.

These issues are very hard to catch during review and normal 
testing, and can sometimes take months of deep testing by
fuzzing and massive codebase rebuilds to expose, so the closer 
the commit is to a release
the higher the risk is that an obscure edge case will be 
unnoticed and unfixed in the release.


So the priority at this stage is to minimise the risk of 
destabilising the codebase,
as opposed to taking in new features and desirable performance 
improvements (like your patch!)


That is the rationale for delaying committing such changes until 
the start
of GCC 9 development. But again, this is up to the aarch64 
maintainers.
I'm sure the patch will be a perfectly fine and desirable commit 
for GCC 9.

This is just my perspective as maintainer of the arm port.


Thanks. Your explanation makes the situation pretty clear and it 
sounds very reasonable. I'll put the patch on hold until 
development is open again.


Regards,
Luis


With GCC 9 development open, i take it this patch is worth 
considering again?




Yes, I believe the latest version is at:
https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00239.html ?

+(define_insn "*ashift_extv_bfiz"
+  [(set (match_operand:GPI 0 "register_operand" "=r")
+    (ashift:GPI (sign_extract:GPI (match_operand:GPI 1 
"register_operand" "r")
+  (match_operand 2 
"aarch64_simd_shift_imm_offset_" "n")
+  (match_operand 3 
"aarch64_simd_shift_imm_" "n"))

+ (match_operand 4 "aarch64_simd_shift_imm_" "n")))]
+  ""
+  "sbfiz\\t%0, %1, %4, %2"
+  [(set_attr "type" "bfx")]
+)
+


Indeed.



Can you give a bit more information about what are the values for 
operands 2,3 and 4 in your example testcases?


For sbfiz32 we have 3, 0 and 19 respectively. For sbfiz64 we have 6, 
0 and 38.


I'm trying to understand why the value of operand 3 (the bit 
position the sign-extract starts from) doesn't get validated

in any way and doesn't play any role in the output...


This may be an oversight. It seems operand 3 will always be 0 in 
this particular case i'm covering. It starts from 0, gets shifted x 
bits to the left and then y < x bits to the right). The operation is 
essentially an ashift of the bitfield followed by a sign-extension 
of the msb of the bitfield being extracted.


Having a non-zero operand 3 from RTL means the shift amount won't 
translate directly to operand 3 of sbfiz (the position). Then we'd 
need to do a calculation where we take into account operand 3 from RTL.


I'm wondering when such a RTL pattern, with a non-zero operand 3, 
would be generated though.


I think it's best to enforce that operand 3 is a zero. Maybe just 
match const_int 0 here directly.

Better safe than sorry with these things.


Indeed. I've updated the original patch with that change now.

Bootstrapped and regtested on aarch64-linux.



I think you might also want to enforce that the sum of operands 2 and 3 
doesn't exceed the width of the mode
(see other patterns in aarch64.md that generate bfx-style instructions 
for similar restrictions).


Updated now in the attached patch. Everything still sane with bootstrap 
and testsuite on aarch64-linux.




Otherwise the patch looks good to me (it will still need approval from a 
maintainer).


For the future I think there's also an opportunity to match the SImode 
version of this pattern that zero-extend

Re: [PATCH, v2] Recognize a missed usage of a sbfiz instruction

2018-05-10 Thread Luis Machado


On 05/09/2018 10:44 AM, Kyrill Tkachov wrote:


On 09/05/18 13:30, Luis Machado wrote:

Hi Kyrill,

On 05/08/2018 11:09 AM, Kyrill Tkachov wrote:

Hi Luis,

On 07/05/18 15:28, Luis Machado wrote:

Hi,

On 02/08/2018 10:45 AM, Luis Machado wrote:

Hi Kyrill,

On 02/08/2018 09:48 AM, Kyrill Tkachov wrote:

Hi Luis,

On 06/02/18 15:04, Luis Machado wrote:
Thanks for the feedback Kyrill. I've adjusted the v2 patch based 
on your

suggestions and re-tested the changes. Everything is still sane.


Thanks! This looks pretty good to me.

Since this is ARM-specific and fairly specific, i wonder if it 
would be

reasonable to consider it for inclusion at the current stage.


It is true that the target maintainers can choose to take
such patches at any stage. However, any patch at this stage increases
the risk of regressions being introduced and these regressions
can come bite us in ways that are very hard to anticipate.

Have a look at some of the bugs in bugzilla (or a quick scan of 
the gcc-bugs list)
for examples of the ways that things can go wrong with any of the 
myriad of GCC components

and the unexpected ways in which they can interact.

For example, I am now working on what I initially thought was a 
one-liner fix for
PR 84164 but it has expanded into a 3-patch series with a midend 
component and

target-specific changes for 2 ports.

These issues are very hard to catch during review and normal 
testing, and can sometimes take months of deep testing by
fuzzing and massive codebase rebuilds to expose, so the closer the 
commit is to a release
the higher the risk is that an obscure edge case will be unnoticed 
and unfixed in the release.


So the priority at this stage is to minimise the risk of 
destabilising the codebase,
as opposed to taking in new features and desirable performance 
improvements (like your patch!)


That is the rationale for delaying committing such changes until 
the start
of GCC 9 development. But again, this is up to the aarch64 
maintainers.
I'm sure the patch will be a perfectly fine and desirable commit 
for GCC 9.

This is just my perspective as maintainer of the arm port.


Thanks. Your explanation makes the situation pretty clear and it 
sounds very reasonable. I'll put the patch on hold until 
development is open again.


Regards,
Luis


With GCC 9 development open, i take it this patch is worth 
considering again?




Yes, I believe the latest version is at:
https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00239.html ?

+(define_insn "*ashift_extv_bfiz"
+  [(set (match_operand:GPI 0 "register_operand" "=r")
+    (ashift:GPI (sign_extract:GPI (match_operand:GPI 1 
"register_operand" "r")
+  (match_operand 2 
"aarch64_simd_shift_imm_offset_" "n")
+  (match_operand 3 
"aarch64_simd_shift_imm_" "n"))

+ (match_operand 4 "aarch64_simd_shift_imm_" "n")))]
+  ""
+  "sbfiz\\t%0, %1, %4, %2"
+  [(set_attr "type" "bfx")]
+)
+


Indeed.



Can you give a bit more information about what are the values for 
operands 2,3 and 4 in your example testcases?


For sbfiz32 we have 3, 0 and 19 respectively. For sbfiz64 we have 6, 0 
and 38.


I'm trying to understand why the value of operand 3 (the bit position 
the sign-extract starts from) doesn't get validated

in any way and doesn't play any role in the output...


This may be an oversight. It seems operand 3 will always be 0 in this 
particular case i'm covering. It starts from 0, gets shifted x bits to 
the left and then y < x bits to the right). The operation is 
essentially an ashift of the bitfield followed by a sign-extension of 
the msb of the bitfield being extracted.


Having a non-zero operand 3 from RTL means the shift amount won't 
translate directly to operand 3 of sbfiz (the position). Then we'd 
need to do a calculation where we take into account operand 3 from RTL.


I'm wondering when such a RTL pattern, with a non-zero operand 3, 
would be generated though.


I think it's best to enforce that operand 3 is a zero. Maybe just match 
const_int 0 here directly.

Better safe than sorry with these things.


Indeed. I've updated the original patch with that change now.

Bootstrapped and regtested on aarch64-linux.

Thanks,
Luis
2018-05-10  Luis Machado  <luis.mach...@linaro.org>

	gcc/
	* config/aarch64/aarch64.md (*ashift_extv_bfiz): New pattern.

	gcc/testsuite/
	* gcc.target/aarch64/lsl_asr_sbfiz.c: New test.
---
 gcc/config/aarch64/aarch64.md| 13 +
 gcc/testsuite/gcc.target/aarch64/lsl_asr_sbfiz.c | 24 
 2 files changed, 37 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/lsl_asr_sbfiz.c

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 32a0e1f..1f943e6 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/conf

Re: [PATCH, v2] Recognize a missed usage of a sbfiz instruction

2018-05-09 Thread Luis Machado

Hi Kyrill,

On 05/08/2018 11:09 AM, Kyrill Tkachov wrote:

Hi Luis,

On 07/05/18 15:28, Luis Machado wrote:

Hi,

On 02/08/2018 10:45 AM, Luis Machado wrote:

Hi Kyrill,

On 02/08/2018 09:48 AM, Kyrill Tkachov wrote:

Hi Luis,

On 06/02/18 15:04, Luis Machado wrote:
Thanks for the feedback Kyrill. I've adjusted the v2 patch based on 
your

suggestions and re-tested the changes. Everything is still sane.


Thanks! This looks pretty good to me.

Since this is ARM-specific and fairly specific, i wonder if it 
would be

reasonable to consider it for inclusion at the current stage.


It is true that the target maintainers can choose to take
such patches at any stage. However, any patch at this stage increases
the risk of regressions being introduced and these regressions
can come bite us in ways that are very hard to anticipate.

Have a look at some of the bugs in bugzilla (or a quick scan of the 
gcc-bugs list)
for examples of the ways that things can go wrong with any of the 
myriad of GCC components

and the unexpected ways in which they can interact.

For example, I am now working on what I initially thought was a 
one-liner fix for
PR 84164 but it has expanded into a 3-patch series with a midend 
component and

target-specific changes for 2 ports.

These issues are very hard to catch during review and normal 
testing, and can sometimes take months of deep testing by
fuzzing and massive codebase rebuilds to expose, so the closer the 
commit is to a release
the higher the risk is that an obscure edge case will be unnoticed 
and unfixed in the release.


So the priority at this stage is to minimise the risk of 
destabilising the codebase,
as opposed to taking in new features and desirable performance 
improvements (like your patch!)


That is the rationale for delaying committing such changes until the 
start

of GCC 9 development. But again, this is up to the aarch64 maintainers.
I'm sure the patch will be a perfectly fine and desirable commit for 
GCC 9.

This is just my perspective as maintainer of the arm port.


Thanks. Your explanation makes the situation pretty clear and it 
sounds very reasonable. I'll put the patch on hold until development 
is open again.


Regards,
Luis


With GCC 9 development open, i take it this patch is worth considering 
again?




Yes, I believe the latest version is at:
https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00239.html ?

+(define_insn "*ashift_extv_bfiz"
+  [(set (match_operand:GPI 0 "register_operand" "=r")
+    (ashift:GPI (sign_extract:GPI (match_operand:GPI 1 
"register_operand" "r")
+  (match_operand 2 
"aarch64_simd_shift_imm_offset_" "n")
+  (match_operand 3 "aarch64_simd_shift_imm_" 
"n"))

+ (match_operand 4 "aarch64_simd_shift_imm_" "n")))]
+  ""
+  "sbfiz\\t%0, %1, %4, %2"
+  [(set_attr "type" "bfx")]
+)
+


Indeed.



Can you give a bit more information about what are the values for 
operands 2,3 and 4 in your example testcases?


For sbfiz32 we have 3, 0 and 19 respectively. For sbfiz64 we have 6, 0 
and 38.


I'm trying to understand why the value of operand 3 (the bit position 
the sign-extract starts from) doesn't get validated

in any way and doesn't play any role in the output...


This may be an oversight. It seems operand 3 will always be 0 in this 
particular case i'm covering. It starts from 0, gets shifted x bits to 
the left and then y < x bits to the right). The operation is essentially 
an ashift of the bitfield followed by a sign-extension of the msb of the 
bitfield being extracted.


Having a non-zero operand 3 from RTL means the shift amount won't 
translate directly to operand 3 of sbfiz (the position). Then we'd need 
to do a calculation where we take into account operand 3 from RTL.


I'm wondering when such a RTL pattern, with a non-zero operand 3, would 
be generated though.


Re: [PATCH 1/2] Introduce prefetch-minimum stride option

2018-05-07 Thread Luis Machado



On 05/07/2018 12:15 PM, H.J. Lu wrote:

On Mon, May 7, 2018 at 7:09 AM, Luis Machado <luis.mach...@linaro.org> wrote:



On 05/01/2018 03:30 PM, Jeff Law wrote:


On 01/22/2018 06:46 AM, Luis Machado wrote:


This patch adds a new option to control the minimum stride, for a memory
reference, after which the loop prefetch pass may issue software prefetch
hints for. There are two motivations:

* Make the pass less aggressive, only issuing prefetch hints for bigger
strides
that are more likely to benefit from prefetching. I've noticed a case in
cpu2017
where we were issuing thousands of hints, for example.

* For processors that have a hardware prefetcher, like Falkor, it allows
the
loop prefetch pass to defer prefetching of smaller (less than the
threshold)
strides to the hardware prefetcher instead. This prevents conflicts
between
the software prefetcher and the hardware prefetcher.

I've noticed considerable reduction in the number of prefetch hints and
slightly positive performance numbers. This aligns GCC and LLVM in terms
of
prefetch behavior for Falkor.

The default settings should guarantee no changes for existing targets.
Those
are free to tweak the settings as necessary.

No regressions in the testsuite and bootstrapped ok on aarch64-linux.

Ok?

2018-01-22  Luis Machado  <luis.mach...@linaro.org>

 Introduce option to limit software prefetching to known constant
 strides above a specific threshold with the goal of preventing
 conflicts with a hardware prefetcher.

 gcc/
 * config/aarch64/aarch64-protos.h (cpu_prefetch_tune)
 : New const int field.
 * config/aarch64/aarch64.c (generic_prefetch_tune): Update to
include
 minimum_stride field.
 (exynosm1_prefetch_tune): Likewise.
 (thunderxt88_prefetch_tune): Likewise.
 (thunderx_prefetch_tune): Likewise.
 (thunderx2t99_prefetch_tune): Likewise.
 (qdf24xx_prefetch_tune): Likewise. Set minimum_stride to 2048.
 (aarch64_override_options_internal): Update to set
 PARAM_PREFETCH_MINIMUM_STRIDE.
 * doc/invoke.texi (prefetch-minimum-stride): Document new option.
 * params.def (PARAM_PREFETCH_MINIMUM_STRIDE): New.
 * params.h (PARAM_PREFETCH_MINIMUM_STRIDE): Define.
 * tree-ssa-loop-prefetch.c (should_issue_prefetch_p): Return
false if
 stride is constant and is below the minimum stride threshold.


OK for the trunk.
jeff



Thanks. Committed as revision 259995 now.


This breaks bootstrap on x86:

../../src-trunk/gcc/tree-ssa-loop-prefetch.c: In function ‘bool
should_issue_prefetch_p(mem_ref*)’:
../../src-trunk/gcc/tree-ssa-loop-prefetch.c:1010:54: error:
comparison of integer expressions of different signedness: ‘long long
unsigned int’ and ‘int’ [-Werror=sign-compare]
&& absu_hwi (int_cst_value (ref->group->step)) < 
PREFETCH_MINIMUM_STRIDE)
../../src-trunk/gcc/tree-ssa-loop-prefetch.c:1014:4: error: format
‘%d’ expects argument of type ‘int’, but argument 5 has type ‘long
long int’ [-Werror=format=]
 "Step for reference %u:%u (%d) is less than the mininum "
 ^
 "required stride of %d\n",
 ~
 ref->group->uid, ref->uid, int_cst_value (ref->group->step),




I've reverted this for now while i address the bootstrap problem.


Re: [PATCH, v2] Recognize a missed usage of a sbfiz instruction

2018-05-07 Thread Luis Machado

Hi,

On 02/08/2018 10:45 AM, Luis Machado wrote:

Hi Kyrill,

On 02/08/2018 09:48 AM, Kyrill Tkachov wrote:

Hi Luis,

On 06/02/18 15:04, Luis Machado wrote:

Thanks for the feedback Kyrill. I've adjusted the v2 patch based on your
suggestions and re-tested the changes. Everything is still sane.


Thanks! This looks pretty good to me.


Since this is ARM-specific and fairly specific, i wonder if it would be
reasonable to consider it for inclusion at the current stage.


It is true that the target maintainers can choose to take
such patches at any stage. However, any patch at this stage increases
the risk of regressions being introduced and these regressions
can come bite us in ways that are very hard to anticipate.

Have a look at some of the bugs in bugzilla (or a quick scan of the 
gcc-bugs list)
for examples of the ways that things can go wrong with any of the 
myriad of GCC components

and the unexpected ways in which they can interact.

For example, I am now working on what I initially thought was a 
one-liner fix for
PR 84164 but it has expanded into a 3-patch series with a midend 
component and

target-specific changes for 2 ports.

These issues are very hard to catch during review and normal testing, 
and can sometimes take months of deep testing by
fuzzing and massive codebase rebuilds to expose, so the closer the 
commit is to a release
the higher the risk is that an obscure edge case will be unnoticed and 
unfixed in the release.


So the priority at this stage is to minimise the risk of destabilising 
the codebase,
as opposed to taking in new features and desirable performance 
improvements (like your patch!)


That is the rationale for delaying committing such changes until the 
start

of GCC 9 development. But again, this is up to the aarch64 maintainers.
I'm sure the patch will be a perfectly fine and desirable commit for 
GCC 9.

This is just my perspective as maintainer of the arm port.


Thanks. Your explanation makes the situation pretty clear and it sounds 
very reasonable. I'll put the patch on hold until development is open 
again.


Regards,
Luis


With GCC 9 development open, i take it this patch is worth considering 
again?


Thanks,
Luis


Re: [PATCH 2/2] Introduce prefetch-dynamic-strides option.

2018-05-07 Thread Luis Machado



On 05/01/2018 03:30 PM, Jeff Law wrote:

On 01/22/2018 06:46 AM, Luis Machado wrote:

The following patch adds an option to control software prefetching of memory
references with non-constant/unknown strides.

Currently we prefetch these references if the pass thinks there is benefit to
doing so. But, since this is all based on heuristics, it's not always the case
that we end up with better performance.

For Falkor there is also the problem of conflicts with the hardware prefetcher,
so we need to be more conservative in terms of what we issue software prefetch
hints for.

This also aligns GCC with what LLVM does for Falkor.

Similarly to the previous patch, the defaults guarantee no change in behavior
for other targets and architectures.

I've regression-tested and bootstrapped it on aarch64-linux. No problems found.

Ok?

2018-01-22  Luis Machado  <luis.mach...@linaro.org>

Introduce option to control whether the software prefetch pass should
issue prefetch hints for non-constant strides.

gcc/
* config/aarch64/aarch64-protos.h (cpu_prefetch_tune)
: New const unsigned int field.
* config/aarch64/aarch64.c (generic_prefetch_tune): Update to include
prefetch_dynamic_strides.
(exynosm1_prefetch_tune): Likewise.
(thunderxt88_prefetch_tune): Likewise.
(thunderx_prefetch_tune): Likewise.
(thunderx2t99_prefetch_tune): Likewise.
(qdf24xx_prefetch_tune): Likewise. Set prefetch_dynamic_strides to 0.
(aarch64_override_options_internal): Update to set
PARAM_PREFETCH_DYNAMIC_STRIDES.
* doc/invoke.texi (prefetch-dynamic-strides): Document new option.
* params.def (PARAM_PREFETCH_DYNAMIC_STRIDES): New.
* params.h (PARAM_PREFETCH_DYNAMIC_STRIDES): Define.
* tree-ssa-loop-prefetch.c (should_issue_prefetch_p): Account for
prefetch-dynamic-strides setting.

OK for the trunk.
jeff



Thanks. Committed as r259996.


Re: [PATCH 1/2] Introduce prefetch-minimum stride option

2018-05-07 Thread Luis Machado



On 05/01/2018 03:30 PM, Jeff Law wrote:

On 01/22/2018 06:46 AM, Luis Machado wrote:

This patch adds a new option to control the minimum stride, for a memory
reference, after which the loop prefetch pass may issue software prefetch
hints for. There are two motivations:

* Make the pass less aggressive, only issuing prefetch hints for bigger strides
that are more likely to benefit from prefetching. I've noticed a case in cpu2017
where we were issuing thousands of hints, for example.

* For processors that have a hardware prefetcher, like Falkor, it allows the
loop prefetch pass to defer prefetching of smaller (less than the threshold)
strides to the hardware prefetcher instead. This prevents conflicts between
the software prefetcher and the hardware prefetcher.

I've noticed considerable reduction in the number of prefetch hints and
slightly positive performance numbers. This aligns GCC and LLVM in terms of
prefetch behavior for Falkor.

The default settings should guarantee no changes for existing targets. Those
are free to tweak the settings as necessary.

No regressions in the testsuite and bootstrapped ok on aarch64-linux.

Ok?

2018-01-22  Luis Machado  <luis.mach...@linaro.org>

Introduce option to limit software prefetching to known constant
strides above a specific threshold with the goal of preventing
conflicts with a hardware prefetcher.

gcc/
* config/aarch64/aarch64-protos.h (cpu_prefetch_tune)
: New const int field.
* config/aarch64/aarch64.c (generic_prefetch_tune): Update to include
minimum_stride field.
(exynosm1_prefetch_tune): Likewise.
(thunderxt88_prefetch_tune): Likewise.
(thunderx_prefetch_tune): Likewise.
(thunderx2t99_prefetch_tune): Likewise.
(qdf24xx_prefetch_tune): Likewise. Set minimum_stride to 2048.
(aarch64_override_options_internal): Update to set
PARAM_PREFETCH_MINIMUM_STRIDE.
* doc/invoke.texi (prefetch-minimum-stride): Document new option.
* params.def (PARAM_PREFETCH_MINIMUM_STRIDE): New.
* params.h (PARAM_PREFETCH_MINIMUM_STRIDE): Define.
* tree-ssa-loop-prefetch.c (should_issue_prefetch_p): Return false if
stride is constant and is below the minimum stride threshold.

OK for the trunk.
jeff



Thanks. Committed as revision 259995 now.


Re: [PATCH] [AArch64] Update L2 cache size on Falkor's prefetch tuning structure.

2018-03-08 Thread Luis Machado

On 03/08/2018 10:30 AM, James Greenhalgh wrote:

On Thu, Mar 01, 2018 at 06:45:21PM +, Luis Machado wrote:

Falkor's prefetch tuning structure still carries the L2 cache size value from
early support code. This patch updates it to match the specifications.

Even though the prefetcher is currently disabled for Falkor, we have a patch
waiting for GCC development to reopen that re-enables it, so i take it this
update should be trivial enough to go in before development reopens?


This is OK for trunk now. I'm happy for you to take the risk of the change
in code-gen for qdf24xx if you are.

Thanks,
James



Thanks. There is no risk yet given prefetching is still disabled for 
qdf24xx and will be re-enabled once development reopens.


Re: [PATCH] [AArch64] Update L2 cache size on Falkor's prefetch tuning structure.

2018-03-08 Thread Luis Machado



On 03/01/2018 03:45 PM, Luis Machado wrote:

Falkor's prefetch tuning structure still carries the L2 cache size value from
early support code. This patch updates it to match the specifications.

Even though the prefetcher is currently disabled for Falkor, we have a patch
waiting for GCC development to reopen that re-enables it, so i take it this
update should be trivial enough to go in before development reopens?

Thanks,
Luis

2018-03-01  Luis Machado  <luis.mach...@linaro.org>

* config/aarch64/aarch64.c (qdf24xx_prefetch_tune) : Set
to 512.
---
  gcc/config/aarch64/aarch64.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 2f98a21..5a732b1 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -564,7 +564,7 @@ static const cpu_prefetch_tune qdf24xx_prefetch_tune =
4,  /* num_slots  */
32, /* l1_cache_size  */
64, /* l1_cache_line_size  */
-  1024,/* l2_cache_size  */
+  512, /* l2_cache_size  */
-1  /* default_opt_level  */
  };
  



Ping?


[PATCH] [AArch64] Update L2 cache size on Falkor's prefetch tuning structure.

2018-03-01 Thread Luis Machado
Falkor's prefetch tuning structure still carries the L2 cache size value from
early support code. This patch updates it to match the specifications.

Even though the prefetcher is currently disabled for Falkor, we have a patch
waiting for GCC development to reopen that re-enables it, so i take it this
update should be trivial enough to go in before development reopens?

Thanks,
Luis

2018-03-01  Luis Machado  <luis.mach...@linaro.org>

* config/aarch64/aarch64.c (qdf24xx_prefetch_tune) : Set
to 512.
---
 gcc/config/aarch64/aarch64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 2f98a21..5a732b1 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -564,7 +564,7 @@ static const cpu_prefetch_tune qdf24xx_prefetch_tune =
   4,   /* num_slots  */
   32,  /* l1_cache_size  */
   64,  /* l1_cache_line_size  */
-  1024,/* l2_cache_size  */
+  512, /* l2_cache_size  */
   -1   /* default_opt_level  */
 };
 
-- 
2.7.4



Re: [PATCH, v2] Recognize a missed usage of a sbfiz instruction

2018-02-08 Thread Luis Machado

Hi Kyrill,

On 02/08/2018 09:48 AM, Kyrill Tkachov wrote:

Hi Luis,

On 06/02/18 15:04, Luis Machado wrote:

Thanks for the feedback Kyrill. I've adjusted the v2 patch based on your
suggestions and re-tested the changes. Everything is still sane.


Thanks! This looks pretty good to me.


Since this is ARM-specific and fairly specific, i wonder if it would be
reasonable to consider it for inclusion at the current stage.


It is true that the target maintainers can choose to take
such patches at any stage. However, any patch at this stage increases
the risk of regressions being introduced and these regressions
can come bite us in ways that are very hard to anticipate.

Have a look at some of the bugs in bugzilla (or a quick scan of the 
gcc-bugs list)
for examples of the ways that things can go wrong with any of the myriad 
of GCC components

and the unexpected ways in which they can interact.

For example, I am now working on what I initially thought was a 
one-liner fix for
PR 84164 but it has expanded into a 3-patch series with a midend 
component and

target-specific changes for 2 ports.

These issues are very hard to catch during review and normal testing, 
and can sometimes take months of deep testing by
fuzzing and massive codebase rebuilds to expose, so the closer the 
commit is to a release
the higher the risk is that an obscure edge case will be unnoticed and 
unfixed in the release.


So the priority at this stage is to minimise the risk of destabilising 
the codebase,
as opposed to taking in new features and desirable performance 
improvements (like your patch!)


That is the rationale for delaying committing such changes until the start
of GCC 9 development. But again, this is up to the aarch64 maintainers.
I'm sure the patch will be a perfectly fine and desirable commit for GCC 9.
This is just my perspective as maintainer of the arm port.


Thanks. Your explanation makes the situation pretty clear and it sounds 
very reasonable. I'll put the patch on hold until development is open again.


Regards,
Luis


[PATCH, v2] Recognize a missed usage of a sbfiz instruction

2018-02-06 Thread Luis Machado
Thanks for the feedback Kyrill. I've adjusted the v2 patch based on your
suggestions and re-tested the changes. Everything is still sane.

Since this is ARM-specific and fairly specific, i wonder if it would be
reasonable to consider it for inclusion at the current stage.

Regards,
Luis

Changes in v2:

- Added more restrictive predicates to operands 2, 3 and 4.
- Removed pattern conditional.
- Switched to long long for 64-bit signed integer for the testcase.

---

A customer reported the following missed opportunities to combine a couple
instructions into a sbfiz.

int sbfiz32 (int x)
{
  return x << 29 >> 10;
}

long long sbfiz64 (long long x)
{
  return x << 58 >> 20;
}

This gets converted to the following pattern:

(set (reg:SI 98)
(ashift:SI (sign_extend:SI (reg:HI 0 x0 [ xD.3334 ]))
(const_int 6 [0x6])))

Currently, gcc generates the following:

sbfiz32:
lsl x0, x0, 29
asr x0, x0, 10
ret

sbfiz64:
lsl x0, x0, 58
asr x0, x0, 20
ret

It could generate this instead:

sbfiz32:
sbfiz   w0, w0, 19, 3
ret

sbfiz64::
sbfiz   x0, x0, 38, 6
ret

The unsigned versions already generate ubfiz for the same code, so the lack of
a sbfiz pattern may have been an oversight.

This particular sbfiz pattern shows up in both CPU2006 (~ 80 hits) and
CPU2017 (~ 280 hits). It's not a lot, but seems beneficial in any case. No
significant performance differences, probably due to the small number of
occurrences.

2018-02-06  Luis Machado  <luis.mach...@linaro.org>

gcc/
* config/aarch64/aarch64.md (*ashift_extv_bfiz): New pattern.

2018-02-06  Luis Machado  <luis.mach...@linaro.org>

gcc/testsuite/
* gcc.target/aarch64/lsl_asr_sbfiz.c: New test.
---
 gcc/config/aarch64/aarch64.md| 13 +
 gcc/testsuite/gcc.target/aarch64/lsl_asr_sbfiz.c | 24 
 2 files changed, 37 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/lsl_asr_sbfiz.c

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 5a2a930..e8284ae 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4828,6 +4828,19 @@
   [(set_attr "type" "bfx")]
 )
 
+;; Match sbfiz pattern in a shift left + shift right operation.
+
+(define_insn "*ashift_extv_bfiz"
+  [(set (match_operand:GPI 0 "register_operand" "=r")
+   (ashift:GPI (sign_extract:GPI (match_operand:GPI 1 "register_operand" 
"r")
+ (match_operand 2 
"aarch64_simd_shift_imm_offset_" "n")
+ (match_operand 3 
"aarch64_simd_shift_imm_" "n"))
+(match_operand 4 "aarch64_simd_shift_imm_" "n")))]
+  ""
+  "sbfiz\\t%0, %1, %4, %2"
+  [(set_attr "type" "bfx")]
+)
+
 ;; When the bit position and width of the equivalent extraction add up to 32
 ;; we can use a W-reg LSL instruction taking advantage of the implicit
 ;; zero-extension of the X-reg.
diff --git a/gcc/testsuite/gcc.target/aarch64/lsl_asr_sbfiz.c 
b/gcc/testsuite/gcc.target/aarch64/lsl_asr_sbfiz.c
new file mode 100644
index 000..106433d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/lsl_asr_sbfiz.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+/* Check that a LSL followed by an ASR can be combined into a single SBFIZ
+   instruction.  */
+
+/* Using W-reg */
+
+int
+sbfiz32 (int x)
+{
+  return x << 29 >> 10;
+}
+
+/* Using X-reg */
+
+long long
+sbfiz64 (long long x)
+{
+  return x << 58 >> 20;
+}
+
+/* { dg-final { scan-assembler "sbfiz\tw" } } */
+/* { dg-final { scan-assembler "sbfiz\tx" } } */
-- 
2.7.4



[PATCH] Recognize a missed usage of a sbfiz instruction

2018-02-02 Thread Luis Machado
A customer reported the following missed opportunities to combine a couple
instructions into a sbfiz.

int sbfiz32 (int x)
{
  return x << 29 >> 10;
}

long sbfiz64 (long x)
{
  return x << 58 >> 20;
}

This gets converted to the following pattern:

(set (reg:SI 98)
(ashift:SI (sign_extend:SI (reg:HI 0 x0 [ xD.3334 ]))
(const_int 6 [0x6])))

Currently, gcc generates the following:

sbfiz32:
lsl x0, x0, 29
asr x0, x0, 10
ret

sbfiz64:
lsl x0, x0, 58
asr x0, x0, 20
ret

It could generate this instead:

sbfiz32:
sbfiz   w0, w0, 19, 3
ret

sbfiz64::
sbfiz   x0, x0, 38, 6
ret

The unsigned versions already generate ubfiz for the same code, so the lack of
a sbfiz pattern may have been an oversight.

This particular sbfiz pattern shows up in both CPU2006 (~ 80 hits) and
CPU2017 (~ 280 hits). It's not a lot, but seems beneficial in any case. No
significant performance differences, probably due to the small number of
occurrences or cases outside hot areas.

Regression-tested and bootstrapped ok on aarch64-linux. Validated with
CPU2017 and CPU2006 runs.

I thought i'd put this up for review. I know we're still not in development
mode yet.

2018-02-02  Luis Machado  <luis.mach...@linaro.org>

gcc/
* config/aarch64/aarch64.md (*ashift_extv_bfiz): New pattern.
* testsuite/gcc.target/aarch64/lsl_asr_sbfiz.c: New test.
---
 gcc/config/aarch64/aarch64.md| 13 +
 gcc/testsuite/gcc.target/aarch64/lsl_asr_sbfiz.c | 24 
 2 files changed, 37 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/lsl_asr_sbfiz.c

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 5a2a930..d336bf0 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4828,6 +4828,19 @@
   [(set_attr "type" "bfx")]
 )
 
+;; Match sbfiz pattern in a shift left + shift right operation.
+
+(define_insn "*ashift_extv_bfiz"
+  [(set (match_operand:GPI 0 "register_operand" "=r")
+   (ashift:GPI (sign_extract:GPI (match_operand:GPI 1 "register_operand" 
"r")
+ (match_operand 2 "const_int_operand" "n")
+ (match_operand 3 "const_int_operand" "n"))
+(match_operand 4 "const_int_operand" "n")))]
+  "UINTVAL (operands[2]) < "
+  "sbfiz\\t%0, %1, %4, %2"
+  [(set_attr "type" "bfx")]
+)
+
 ;; When the bit position and width of the equivalent extraction add up to 32
 ;; we can use a W-reg LSL instruction taking advantage of the implicit
 ;; zero-extension of the X-reg.
diff --git a/gcc/testsuite/gcc.target/aarch64/lsl_asr_sbfiz.c 
b/gcc/testsuite/gcc.target/aarch64/lsl_asr_sbfiz.c
new file mode 100644
index 000..931f8f4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/lsl_asr_sbfiz.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+/* Check that a LSL followed by an ASR can be combined into a single SBFIZ
+   instruction.  */
+
+/* Using W-reg */
+
+int
+sbfiz32 (int x)
+{
+  return x << 29 >> 10;
+}
+
+/* Using X-reg */
+
+long
+sbfiz64 (long x)
+{
+  return x << 58 >> 20;
+}
+
+/* { dg-final { scan-assembler "sbfiz\tw" } } */
+/* { dg-final { scan-assembler "sbfiz\tx" } } */
-- 
2.7.4



Re: [PATCH 2/2] Introduce prefetch-dynamic-strides option.

2018-01-23 Thread Luis Machado

Hi,

On 01/23/2018 07:40 AM, Kyrill Tkachov wrote:

Hi Luis,

On 22/01/18 13:46, Luis Machado wrote:
The following patch adds an option to control software prefetching of 
memory

references with non-constant/unknown strides.

Currently we prefetch these references if the pass thinks there is 
benefit to
doing so. But, since this is all based on heuristics, it's not always 
the case

that we end up with better performance.

For Falkor there is also the problem of conflicts with the hardware 
prefetcher,
so we need to be more conservative in terms of what we issue software 
prefetch

hints for.

This also aligns GCC with what LLVM does for Falkor.

Similarly to the previous patch, the defaults guarantee no change in 
behavior

for other targets and architectures.

I've regression-tested and bootstrapped it on aarch64-linux. No 
problems found.


Ok?



This also looks like a sensible approach to me with a caveat inline.
The same general comments as for patch [1/2] apply.
diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h

index 8736bd9..22bd9ae 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -230,6 +230,9 @@ struct cpu_prefetch_tune
   const int l1_cache_size;
   const int l1_cache_line_size;
   const int l2_cache_size;
+  /* Whether software prefetch hints should be issued for non-constant
+ strides.  */
+  const unsigned int prefetch_dynamic_strides;


I understand that the midend PARAMs are defined as integers, but I think
the backend tuning option here is better represented as a boolean as it 
really

is just a yes/no decision.

I started off with a boolean to be honest. Then i noticed the midend 
only used integers, which i restricted to the range of 0..1.


I'll change this locally to use booleans again.

Thanks!
Luis


Re: [PATCH 1/2] Introduce prefetch-minimum stride option

2018-01-23 Thread Luis Machado

Hi Kyrill,

On 01/23/2018 07:32 AM, Kyrill Tkachov wrote:

Hi Luis,

On 22/01/18 13:46, Luis Machado wrote:

This patch adds a new option to control the minimum stride, for a memory
reference, after which the loop prefetch pass may issue software prefetch
hints for. There are two motivations:

* Make the pass less aggressive, only issuing prefetch hints for 
bigger strides
that are more likely to benefit from prefetching. I've noticed a case 
in cpu2017

where we were issuing thousands of hints, for example.



I've noticed a large amount of prefetch hints being issued as well, but 
had not

analysed it further.



I've gathered some numbers for this. Some of the most extreme cases 
before both patches:


CPU2017

xalancbmk_s: 3755 hints
wrf_s: 10950 hints
parest_r: 8521 hints

CPU2006

gamess: 11377 hints
wrf: 3238 hints

After both patches:

CPU2017

xalancbmk_s: 1 hint
wrf_s: 20 hints
parest_r: 0 hints

CPU2006

gamess: 44 hints
wrf: 16 hints


* For processors that have a hardware prefetcher, like Falkor, it 
allows the
loop prefetch pass to defer prefetching of smaller (less than the 
threshold)
strides to the hardware prefetcher instead. This prevents conflicts 
between

the software prefetcher and the hardware prefetcher.

I've noticed considerable reduction in the number of prefetch hints and
slightly positive performance numbers. This aligns GCC and LLVM in 
terms of

prefetch behavior for Falkor.


Do you, by any chance, have a link to the LLVM review that implemented 
that behavior?

It's okay if you don't, but I think it would be useful context.



I've dug it up. The base change was implemented here:

review: https://reviews.llvm.org/D17945
RFC: http://lists.llvm.org/pipermail/llvm-dev/2015-December/093514.html

And then target-specific changes were introduced later for specific 
processors.


One small difference in LLVM is the fact that the second parameter, 
prefetching of non-constant strides, is implicitly switched off if one 
sets the minimum stride length. My approach here makes that second 
parameter adjustable.


I've seen big gains due to prefetching of non-constant strides, but it 
tends to be tricky to control and usually comes together with 
significant regressions as well.


The fact that we potentially unroll loops along with issuing prefetch 
hints also makes things a bit erratic.




The default settings should guarantee no changes for existing targets. 
Those

are free to tweak the settings as necessary.

No regressions in the testsuite and bootstrapped ok on aarch64-linux.

Ok?



Are there any benchmark numbers you can share?
I think this approach is sensible.



Comparing the previous, more aggressive, pass behavior with the new one 
i've seen a slight improvement for CPU2006, 0.15% for both INT and FP.


For CPU2017 the previous behavior was actually a bit harmful, regressing 
performance by about 1.2% in intspeed. The new behavior kept intspeed 
stable and slightly improved fpspeed by 0.15%.


The motivation for the future is to have better control of software 
prefetching so we can fine-tune the pass, either through generic loop 
prefetch code or by using the target-specific parameters.



Since your patch touches generic code as well as AArch64
code you'll need an approval from a midend maintainer as well as an 
AArch64 maintainer.
Also, GCC development is now in the regression fixing stage, so unless 
this fixes a regression

it may have to wait until GCC 9 development is opened.


That is my understanding. I thought i'd put this up for review anyway so 
people can chime in and provide their thoughts.


Thanks for the review.

Luis


[PATCH 1/2] Introduce prefetch-minimum stride option

2018-01-22 Thread Luis Machado
This patch adds a new option to control the minimum stride, for a memory
reference, after which the loop prefetch pass may issue software prefetch
hints for. There are two motivations:

* Make the pass less aggressive, only issuing prefetch hints for bigger strides
that are more likely to benefit from prefetching. I've noticed a case in cpu2017
where we were issuing thousands of hints, for example.

* For processors that have a hardware prefetcher, like Falkor, it allows the
loop prefetch pass to defer prefetching of smaller (less than the threshold)
strides to the hardware prefetcher instead. This prevents conflicts between
the software prefetcher and the hardware prefetcher.

I've noticed considerable reduction in the number of prefetch hints and
slightly positive performance numbers. This aligns GCC and LLVM in terms of
prefetch behavior for Falkor.

The default settings should guarantee no changes for existing targets. Those
are free to tweak the settings as necessary.

No regressions in the testsuite and bootstrapped ok on aarch64-linux.

Ok?

2018-01-22  Luis Machado  <luis.mach...@linaro.org>

Introduce option to limit software prefetching to known constant
strides above a specific threshold with the goal of preventing
conflicts with a hardware prefetcher.

gcc/
* config/aarch64/aarch64-protos.h (cpu_prefetch_tune)
: New const int field.
* config/aarch64/aarch64.c (generic_prefetch_tune): Update to include
minimum_stride field.
(exynosm1_prefetch_tune): Likewise.
(thunderxt88_prefetch_tune): Likewise.
(thunderx_prefetch_tune): Likewise.
(thunderx2t99_prefetch_tune): Likewise.
(qdf24xx_prefetch_tune): Likewise. Set minimum_stride to 2048.
(aarch64_override_options_internal): Update to set
PARAM_PREFETCH_MINIMUM_STRIDE.
* doc/invoke.texi (prefetch-minimum-stride): Document new option.
* params.def (PARAM_PREFETCH_MINIMUM_STRIDE): New.
* params.h (PARAM_PREFETCH_MINIMUM_STRIDE): Define.
* tree-ssa-loop-prefetch.c (should_issue_prefetch_p): Return false if
stride is constant and is below the minimum stride threshold.
---
 gcc/config/aarch64/aarch64-protos.h |  3 +++
 gcc/config/aarch64/aarch64.c| 13 -
 gcc/doc/invoke.texi | 15 +++
 gcc/params.def  |  9 +
 gcc/params.h|  2 ++
 gcc/tree-ssa-loop-prefetch.c| 16 
 6 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index ef1b0bc..8736bd9 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -230,6 +230,9 @@ struct cpu_prefetch_tune
   const int l1_cache_size;
   const int l1_cache_line_size;
   const int l2_cache_size;
+  /* The minimum constant stride beyond which we should use prefetch
+ hints for.  */
+  const int minimum_stride;
   const int default_opt_level;
 };
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 174310c..0ed9f14 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -547,6 +547,7 @@ static const cpu_prefetch_tune generic_prefetch_tune =
   -1,  /* l1_cache_size  */
   -1,  /* l1_cache_line_size  */
   -1,  /* l2_cache_size  */
+  -1,  /* minimum_stride */
   -1   /* default_opt_level  */
 };
 
@@ -556,6 +557,7 @@ static const cpu_prefetch_tune exynosm1_prefetch_tune =
   -1,  /* l1_cache_size  */
   64,  /* l1_cache_line_size  */
   -1,  /* l2_cache_size  */
+  -1,  /* minimum_stride */
   -1   /* default_opt_level  */
 };
 
@@ -565,7 +567,8 @@ static const cpu_prefetch_tune qdf24xx_prefetch_tune =
   32,  /* l1_cache_size  */
   64,  /* l1_cache_line_size  */
   1024,/* l2_cache_size  */
-  -1   /* default_opt_level  */
+  2048,/* minimum_stride */
+  3/* default_opt_level  */
 };
 
 static const cpu_prefetch_tune thunderxt88_prefetch_tune =
@@ -574,6 +577,7 @@ static const cpu_prefetch_tune thunderxt88_prefetch_tune =
   32,  /* l1_cache_size  */
   128, /* l1_cache_line_size  */
   16*1024, /* l2_cache_size  */
+  -1,  /* minimum_stride */
   3/* default_opt_level  */
 };
 
@@ -583,6 +587,7 @@ static const cpu_prefetch_tune thunderx_prefetch_tune =
   32,  /* l1_cache_size  */
   128, /* l1_cache_line_size  */
   -1,  /* l2_cache_size  */
+  -1,  /* minimum_stride */
   -1   /* default_opt_level  */
 };
 
@@ -592,6 +597,7 @@ static

[PATCH 0/2] Add a couple new options to control loop prefetch pass

2018-01-22 Thread Luis Machado
The following couple patches add options to better control the loop prefetch
pass.

With the current settings the pass tends to be very aggressive, issuing a lot
of prefetch hints. Most of these don't translate to better performance. Some
of these issued hints may even cause more cache evictions.

Luis Machado (2):
  Introduce prefetch-minimum stride option
  Introduce prefetch-dynamic-strides option.

 gcc/config/aarch64/aarch64-protos.h |  6 ++
 gcc/config/aarch64/aarch64.c| 24 +++-
 gcc/doc/invoke.texi | 25 +
 gcc/params.def  | 18 ++
 gcc/params.h|  4 
 gcc/tree-ssa-loop-prefetch.c| 26 ++
 6 files changed, 102 insertions(+), 1 deletion(-)

-- 
2.7.4



[PATCH 2/2] Introduce prefetch-dynamic-strides option.

2018-01-22 Thread Luis Machado
The following patch adds an option to control software prefetching of memory
references with non-constant/unknown strides.

Currently we prefetch these references if the pass thinks there is benefit to
doing so. But, since this is all based on heuristics, it's not always the case
that we end up with better performance.

For Falkor there is also the problem of conflicts with the hardware prefetcher,
so we need to be more conservative in terms of what we issue software prefetch
hints for.

This also aligns GCC with what LLVM does for Falkor.

Similarly to the previous patch, the defaults guarantee no change in behavior
for other targets and architectures.

I've regression-tested and bootstrapped it on aarch64-linux. No problems found.

Ok?

2018-01-22  Luis Machado  <luis.mach...@linaro.org>

Introduce option to control whether the software prefetch pass should
issue prefetch hints for non-constant strides.

gcc/
* config/aarch64/aarch64-protos.h (cpu_prefetch_tune)
: New const unsigned int field.
* config/aarch64/aarch64.c (generic_prefetch_tune): Update to include
prefetch_dynamic_strides.
(exynosm1_prefetch_tune): Likewise.
(thunderxt88_prefetch_tune): Likewise.
(thunderx_prefetch_tune): Likewise.
(thunderx2t99_prefetch_tune): Likewise.
(qdf24xx_prefetch_tune): Likewise. Set prefetch_dynamic_strides to 0.
(aarch64_override_options_internal): Update to set
PARAM_PREFETCH_DYNAMIC_STRIDES.
* doc/invoke.texi (prefetch-dynamic-strides): Document new option.
* params.def (PARAM_PREFETCH_DYNAMIC_STRIDES): New.
* params.h (PARAM_PREFETCH_DYNAMIC_STRIDES): Define.
* tree-ssa-loop-prefetch.c (should_issue_prefetch_p): Account for
prefetch-dynamic-strides setting.
---
 gcc/config/aarch64/aarch64-protos.h |  3 +++
 gcc/config/aarch64/aarch64.c| 11 +++
 gcc/doc/invoke.texi | 10 ++
 gcc/params.def  |  9 +
 gcc/params.h|  2 ++
 gcc/tree-ssa-loop-prefetch.c| 10 ++
 6 files changed, 45 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 8736bd9..22bd9ae 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -230,6 +230,9 @@ struct cpu_prefetch_tune
   const int l1_cache_size;
   const int l1_cache_line_size;
   const int l2_cache_size;
+  /* Whether software prefetch hints should be issued for non-constant
+ strides.  */
+  const unsigned int prefetch_dynamic_strides;
   /* The minimum constant stride beyond which we should use prefetch
  hints for.  */
   const int minimum_stride;
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 0ed9f14..713b230 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -547,6 +547,7 @@ static const cpu_prefetch_tune generic_prefetch_tune =
   -1,  /* l1_cache_size  */
   -1,  /* l1_cache_line_size  */
   -1,  /* l2_cache_size  */
+  1,   /* prefetch_dynamic_strides */
   -1,  /* minimum_stride */
   -1   /* default_opt_level  */
 };
@@ -557,6 +558,7 @@ static const cpu_prefetch_tune exynosm1_prefetch_tune =
   -1,  /* l1_cache_size  */
   64,  /* l1_cache_line_size  */
   -1,  /* l2_cache_size  */
+  1,   /* prefetch_dynamic_strides */
   -1,  /* minimum_stride */
   -1   /* default_opt_level  */
 };
@@ -567,6 +569,7 @@ static const cpu_prefetch_tune qdf24xx_prefetch_tune =
   32,  /* l1_cache_size  */
   64,  /* l1_cache_line_size  */
   1024,/* l2_cache_size  */
+  0,   /* prefetch_dynamic_strides */
   2048,/* minimum_stride */
   3/* default_opt_level  */
 };
@@ -577,6 +580,7 @@ static const cpu_prefetch_tune thunderxt88_prefetch_tune =
   32,  /* l1_cache_size  */
   128, /* l1_cache_line_size  */
   16*1024, /* l2_cache_size  */
+  1,   /* prefetch_dynamic_strides */
   -1,  /* minimum_stride */
   3/* default_opt_level  */
 };
@@ -587,6 +591,7 @@ static const cpu_prefetch_tune thunderx_prefetch_tune =
   32,  /* l1_cache_size  */
   128, /* l1_cache_line_size  */
   -1,  /* l2_cache_size  */
+  1,   /* prefetch_dynamic_strides */
   -1,  /* minimum_stride */
   -1   /* default_opt_level  */
 };
@@ -597,6 +602,7 @@ static const cpu_prefetch_tune thunderx2t99_prefetch_tune =
   32,  /* l1_cache_size  */
   64,  /* l1_cache_line_size  */

Re: [PATCH, obv?] Fix missing newlines from local-pure-const pass dump

2017-12-04 Thread Luis Machado



On 12/04/2017 02:01 PM, Jeff Law wrote:

On 12/01/2017 11:42 AM, Luis Machado wrote:

I noticed the debugging output from local-pure-const pass is missing a
newline in a couple places, leading to this:

  local analysis of main
scanning: i ={v} 0;
 Volatile stmt is not const/pure
 Volatile operand is not const/pure  scanning: j ={v} 20;
 Volatile stmt is not const/pure
 Volatile operand is not const/pure  scanning: vol.0_10 ={v} i;
 Volatile stmt is not const/pure

It should've been:

  local analysis of main
scanning: i ={v} 0;
 Volatile stmt is not const/pure
 Volatile operand is not const/pure
scanning: j ={v} 20;
 Volatile stmt is not const/pure
 Volatile operand is not const/pure
scanning: vol.0_10 ={v} i;
 Volatile stmt is not const/pure

Seems fairly obvious. OK?

gcc/ChangeLog:

2017-12-01  Luis Machado  <luis.mach...@linaro.org>

* ipa-pure-const.c (check_decl): Add missing newline.
(state_from_flags): Likewise.

OK.
jeff



Thanks. Pushed now.


[PATCH, obv?] Fix missing newlines from local-pure-const pass dump

2017-12-01 Thread Luis Machado
I noticed the debugging output from local-pure-const pass is missing a
newline in a couple places, leading to this:

 local analysis of main
   scanning: i ={v} 0;
Volatile stmt is not const/pure
Volatile operand is not const/pure  scanning: j ={v} 20;
Volatile stmt is not const/pure
Volatile operand is not const/pure  scanning: vol.0_10 ={v} i;
Volatile stmt is not const/pure

It should've been:

 local analysis of main
   scanning: i ={v} 0;
Volatile stmt is not const/pure
Volatile operand is not const/pure
   scanning: j ={v} 20;
Volatile stmt is not const/pure
Volatile operand is not const/pure
   scanning: vol.0_10 ={v} i;
Volatile stmt is not const/pure

Seems fairly obvious. OK?

gcc/ChangeLog:

2017-12-01  Luis Machado  <luis.mach...@linaro.org>

* ipa-pure-const.c (check_decl): Add missing newline.
(state_from_flags): Likewise.
---
 gcc/ipa-pure-const.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/ipa-pure-const.c b/gcc/ipa-pure-const.c
index bdc7522..22f92fc 100644
--- a/gcc/ipa-pure-const.c
+++ b/gcc/ipa-pure-const.c
@@ -332,7 +332,7 @@ check_decl (funct_state local,
 {
   local->pure_const_state = IPA_NEITHER;
   if (dump_file)
-fprintf (dump_file, "Volatile operand is not const/pure");
+fprintf (dump_file, "Volatile operand is not const/pure\n");
   return;
 }
 
@@ -446,7 +446,7 @@ state_from_flags (enum pure_const_state_e *state, bool 
*looping,
 {
   *looping = true;
   if (dump_file && (dump_flags & TDF_DETAILS))
-   fprintf (dump_file, " looping");
+   fprintf (dump_file, " looping\n");
 }
   if (flags & ECF_CONST)
 {
-- 
2.7.4



Re: [PATCH, AArch64] Adjust tuning parameters for Falkor

2017-11-17 Thread Luis Machado

On 11/17/2017 01:48 PM, James Greenhalgh wrote:

On Wed, Nov 15, 2017 at 03:00:53AM +, Luis Machado wrote:

I think the best thing is to leave this tuning structure in place and
just change default_opt_level   to -1 to disable it at -O3.

Thanks,
Andrew



Indeed that seems to be more appropriate if re-enabling prefetches in the
future is a possibility.

How about the following?


This is OK.

Thanks,
James


Thanks James. I've pushed this now.

Luis


Re: [PATCH, AArch64] Adjust tuning parameters for Falkor

2017-11-17 Thread Luis Machado

On 11/17/2017 07:25 AM, Kyrill Tkachov wrote:

Hi Luis,

[cc'ing aarch64 maintainers, it's quicker to get review that way]

On 15/11/17 03:00, Luis Machado wrote:

> I think the best thing is to leave this tuning structure in place and
> just change default_opt_level   to -1 to disable it at -O3.
>
> Thanks,
> Andrew
>

Indeed that seems to be more appropriate if re-enabling prefetches in the
future is a possibility.

How about the following?



This looks correct to me to achieve what you want to achieve,
but I can't approve it myself so you'll need an ok from an aarch64 
maintainer.


Thanks,
Kyrill



Thanks for the feedback. I'll wait for one of the maintainers to OK it.

Luis


Re: [PATCH, AArch64] Adjust tuning parameters for Falkor

2017-11-16 Thread Luis Machado
On 15 November 2017 at 01:00, Luis Machado <luis.mach...@linaro.org> wrote:

> > I think the best thing is to leave this tuning structure in place and
> > just change default_opt_level   to -1 to disable it at -O3.
> >
> > Thanks,
> > Andrew
> >
>
> Indeed that seems to be more appropriate if re-enabling prefetches in the
> future is a possibility.
>
> How about the following?
>
> Thanks,
> Luis
>

Ping?


Re: [PATCH, AArch64] Adjust tuning parameters for Falkor

2017-11-14 Thread Luis Machado
> I think the best thing is to leave this tuning structure in place and
> just change default_opt_level   to -1 to disable it at -O3.
> 
> Thanks,
> Andrew
> 

Indeed that seems to be more appropriate if re-enabling prefetches in the
future is a possibility.

How about the following?

Thanks,
Luis

2017-11-15  Luis Machado  <luis.mach...@linaro.org>

gcc/
* config/aarch64/aarch64.c
(qdf24xx_prefetch_tune) : Set to -1.
(qdf24xx_tunings) : Set to
tune_params::AUTOPREFETCHER_WEAK.
---
 gcc/ChangeLog| 7 +++
 gcc/config/aarch64/aarch64.c | 4 ++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index b80a421..0e05f9e 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2017-11-15  Luis Machado  <luis.mach...@linaro.org>
+
+   * config/aarch64/aarch64.c
+   (qdf24xx_prefetch_tune) : Set to -1.
+   (qdf24xx_tunings) : Set to
+   tune_params::AUTOPREFETCHER_WEAK.
+
 2017-11-14  Carl Love  <c...@us.ibm.com>
 
* config/rs6000/rs6000.c (swap_endian_selector_for_mode): Remove
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 0c67e2b..8779cad 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -508,7 +508,7 @@ static const cpu_prefetch_tune qdf24xx_prefetch_tune =
   32,  /* l1_cache_size  */
   64,  /* l1_cache_line_size  */
   1024,/* l2_cache_size  */
-  3/* default_opt_level  */
+  -1   /* default_opt_level  */
 };
 
 static const cpu_prefetch_tune thunderxt88_prefetch_tune =
@@ -817,7 +817,7 @@ static const struct tune_params qdf24xx_tunings =
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
   0,   /* max_case_values.  */
-  tune_params::AUTOPREFETCHER_STRONG,  /* autoprefetcher_model.  */
+  tune_params::AUTOPREFETCHER_WEAK,/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
   _prefetch_tune
 };
-- 
2.7.4



[PATCH, AArch64] Adjust tuning parameters for Falkor

2017-11-14 Thread Luis Machado
Disabling software prefetching and switching the autoprefetcher to weak improves
CPU2017 rate and speed benchmarks for both int and fp sets on Falkor.

SPECrate 2017 fp is up 0.38%
SPECspeed 2017 fp is up 0.54%
SPECrate 2017 int is up 3.02%
SPECspeed 2017 int is up 3.16%

There are only a couple individual regressions. The biggest one being about 4%
in parest.

For SPEC2006, we've noticed the following:

SPECint is up 0.91%
SPECfp is stable

In the case of SPEC2006 we noticed both a big regression in mcf (about 20%)
and a big improvement for hmmer (about 40%).

Since the overall result is positive, we would like to make these new tuning
settings the default for Falkor.

We may revisit the software prefetcher setting in the future, in case we
can adjust it enough so it provides us a good balance between improvements and
regressions (mcf). But for now it is best if it stays off.

I understand the freeze is happening soon, so it would be great to have this
in before then.

OK?

Thanks,
Luis

2017-11-14  Luis Machado  <luis.mach...@linaro.org>

* config/aarch64/aarch64.c (qdf24xx_prefetch_tune): Remove.
(qdf24xx_tunings): Replace qdf24xx_prefetch_tune with
generic_prefetch_tune and tune_params::AUTOPREFETCHER_STRONG with
tune_params::AUTOPREFETCHER_WEAK.
---
 gcc/ChangeLog|  7 +++
 gcc/config/aarch64/aarch64.c | 13 ++---
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index b80a421..4dbfda0 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2017-11-14  Luis Machado  <luis.mach...@linaro.org>
+
+   * config/aarch64/aarch64.c (qdf24xx_prefetch_tune): Remove.
+   (qdf24xx_tunings): Replace qdf24xx_prefetch_tune with
+   generic_prefetch_tune and tune_params::AUTOPREFETCHER_STRONG with
+   tune_params::AUTOPREFETCHER_WEAK.
+
 2017-11-14  Carl Love  <c...@us.ibm.com>
 
* config/rs6000/rs6000.c (swap_endian_selector_for_mode): Remove
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 0c67e2b..171a230 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -502,15 +502,6 @@ static const cpu_prefetch_tune exynosm1_prefetch_tune =
   -1   /* default_opt_level  */
 };
 
-static const cpu_prefetch_tune qdf24xx_prefetch_tune =
-{
-  4,   /* num_slots  */
-  32,  /* l1_cache_size  */
-  64,  /* l1_cache_line_size  */
-  1024,/* l2_cache_size  */
-  3/* default_opt_level  */
-};
-
 static const cpu_prefetch_tune thunderxt88_prefetch_tune =
 {
   8,   /* num_slots  */
@@ -817,9 +808,9 @@ static const struct tune_params qdf24xx_tunings =
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
   0,   /* max_case_values.  */
-  tune_params::AUTOPREFETCHER_STRONG,  /* autoprefetcher_model.  */
+  tune_params::AUTOPREFETCHER_WEAK,/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
-  _prefetch_tune
+  _prefetch_tune
 };
 
 /* Tuning structure for the Qualcomm Saphira core.  Default to falkor values
-- 
2.7.4



Re: [PATCH,doc] Fix latency in pipeline description example

2017-11-13 Thread Luis Machado

On 11/10/2017 08:30 PM, Jim Wilson wrote:

On 11/09/2017 03:44 AM, Luis Machado wrote:
Am i missing something or is this example incorrect and this should 
either

have a latency of 9 (patch attached) or a different resource utilization
description, say, containing "div*6" instead?


The numbers are somewhat arbitrary, but you are right that there is an 
inconsistency here that should be fixed, and it doesn't really matter 
how we fix it.



2017-11-09  Luis Machado  <luis.mach...@linaro.org

* doc/md.texi (Specifying processor pipeline description): Fix
incorrect latency for the div instruction example.


OK.

Jim



Thanks. I've checked this in now.


[PATCH,doc] Fix latency in pipeline description example

2017-11-09 Thread Luis Machado
While reading through section 16.19.9 of the internals manual, i ran into this
example that looked slightly odd:

(define_insn_reservation "div" 8 (eq_attr "type" "div")
 "i1_pipeline, div*7, div + (port0 | port1)")

Am i missing something or is this example incorrect and this should either
have a latency of 9 (patch attached) or a different resource utilization
description, say, containing "div*6" instead?

Regards,
Luis

2017-11-09  Luis Machado  <luis.mach...@linaro.org

* doc/md.texi (Specifying processor pipeline description): Fix
incorrect latency for the div instruction example.
---
 gcc/ChangeLog   | 5 +
 gcc/doc/md.texi | 4 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 3cb3b9e..34e5daf 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2017-11-09  Luis Machado  <luis.mach...@linaro.org
+
+   * doc/md.texi (Specifying processor pipeline description): Fix
+   incorrect latency for the div instruction example.
+
 2017-11-09  Martin Liska  <mli...@suse.cz>
 
PR tree-optimization/82669
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index c4c1138..9806b65 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -9617,7 +9617,7 @@ their result is ready in two cycles.  The simple integer 
insns are
 issued into the first pipeline unless it is reserved, otherwise they
 are issued into the second pipeline.  Integer division and
 multiplication insns can be executed only in the second integer
-pipeline and their results are ready correspondingly in 8 and 4
+pipeline and their results are ready correspondingly in 9 and 4
 cycles.  The integer division is not pipelined, i.e.@: the subsequent
 integer division insn can not be issued until the current division
 insn finished.  Floating point insns are fully pipelined and their
@@ -9634,7 +9634,7 @@ incurred.  To describe all of this we could specify
 (define_insn_reservation "mult" 4 (eq_attr "type" "mult")
  "i1_pipeline, nothing*2, (port0 | port1)")
 
-(define_insn_reservation "div" 8 (eq_attr "type" "div")
+(define_insn_reservation "div" 9 (eq_attr "type" "div")
  "i1_pipeline, div*7, div + (port0 | port1)")
 
 (define_insn_reservation "float" 3 (eq_attr "type" "float")
-- 
2.7.4