[Bug tree-optimization/88440] size optimization of memcpy-like code

2020-03-09 Thread bina2374 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

Mel Chen  changed:

   What|Removed |Added

 CC||bina2374 at gmail dot com

--- Comment #27 from Mel Chen  ---
Related new bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-27 Thread clyon at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #26 from Christophe Lyon  ---
Author: clyon
Date: Mon May 27 13:37:57 2019
New Revision: 271662

URL: https://gcc.gnu.org/viewcvs?rev=271662=gcc=rev
Log:
[testsuite,aarch64,arm] PR88440: Fix testcases

2019-05-27  Christophe Lyon  

PR tree-optimization/88440
gcc/testsuite/
* gcc.target/aarch64/sve/index_offset_1.c: Add
-fno-tree-loop-distribute-patterns.
* gcc.target/aarch64/sve/single_1.c: Likewise.
* gcc.target/aarch64/sve/single_2.c: Likewise.
* gcc.target/aarch64/sve/single_3.c: Likewise.
* gcc.target/aarch64/sve/single_4.c: Likewise.
* gcc.target/aarch64/sve/vec_init_1.c: Likewise.
* gcc.target/aarch64/vect-fmovd-zero.c: Likewise.
* gcc.target/aarch64/vect-fmovf-zero.c: Likewise.
* gcc.target/arm/ivopts.c: Likewise.


Modified:
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.target/aarch64/sve/index_offset_1.c
trunk/gcc/testsuite/gcc.target/aarch64/sve/single_1.c
trunk/gcc/testsuite/gcc.target/aarch64/sve/single_2.c
trunk/gcc/testsuite/gcc.target/aarch64/sve/single_3.c
trunk/gcc/testsuite/gcc.target/aarch64/sve/single_4.c
trunk/gcc/testsuite/gcc.target/aarch64/sve/vec_init_1.c
trunk/gcc/testsuite/gcc.target/aarch64/vect-fmovd-zero.c
trunk/gcc/testsuite/gcc.target/aarch64/vect-fmovf-zero.c
trunk/gcc/testsuite/gcc.target/arm/ivopts.c

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-24 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #25 from Richard Biener  ---
Author: rguenth
Date: Fri May 24 08:48:14 2019
New Revision: 271595

URL: https://gcc.gnu.org/viewcvs?rev=271595=gcc=rev
Log:
2019-05-23  Richard Biener  

PR tree-optimization/88440
* opts.c (default_options_table): Enable
-ftree-loop-distribute-patterns
at -O[2s]+.
* tree-loop-distribution.c (generate_memset_builtin): Fold the
generated call.
(generate_memcpy_builtin): Likewise.
(distribute_loop): Pass in whether to only distribute patterns.
(prepare_perfect_loop_nest): Also allow size optimization.
(pass_loop_distribution::execute): When optimizing a loop
nest for size allow pattern replacement.

* gcc.dg/tree-ssa/ldist-37.c: New testcase.
* gcc.dg/tree-ssa/ldist-38.c: Likewise.
* gcc.dg/vect/vect.exp: Add -fno-tree-loop-distribute-patterns.
* gcc.dg/tree-ssa/ldist-37.c: Adjust.
* gcc.dg/tree-ssa/ldist-38.c: Likewise.
* g++.dg/tree-ssa/pr78847.C: Likewise.
* gcc.dg/autopar/pr39500-1.c: Likewise.
* gcc.dg/autopar/reduc-1char.c: Likewise.
* gcc.dg/autopar/reduc-7.c: Likewise.
* gcc.dg/tree-ssa/ivopts-lt-2.c: Likewise.
* gcc.dg/tree-ssa/ivopts-lt.c: Likewise.
* gcc.dg/tree-ssa/predcom-dse-1.c: Likewise.
* gcc.dg/tree-ssa/predcom-dse-2.c: Likewise.
* gcc.dg/tree-ssa/predcom-dse-3.c: Likewise.
* gcc.dg/tree-ssa/predcom-dse-4.c: Likewise.
* gcc.dg/tree-ssa/prefetch-7.c: Likewise.
* gcc.dg/tree-ssa/prefetch-8.c: Likewise.
* gcc.dg/tree-ssa/prefetch-9.c: Likewise.
* gcc.dg/tree-ssa/scev-11.c: Likewise.
* gcc.dg/vect/costmodel/i386/costmodel-vect-31.c: Likewise.
* gcc.dg/vect/costmodel/i386/costmodel-vect-33.c: Likewise.
* gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c: Likewise.
* gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c: Likewise.
* gcc.target/i386/pr30970.c: Likewise.
* gcc.target/i386/vect-double-1.c: Likewise.
* gcc.target/i386/vect-double-2.c: Likewise.
* gcc.dg/tree-ssa/gen-vect-2.c: Likewise.
* gcc.dg/tree-ssa/gen-vect-26.c: Likewise.
* gcc.dg/tree-ssa/gen-vect-28.c: Likewise.
* gcc.dg/tree-ssa/gen-vect-32.c: Likewise.
* gfortran.dg/vect/vect-5.f90: Likewise.
* gfortran.dg/vect/vect-8.f90: Likewise.

Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-37.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-38.c

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #24 from Richard Biener  ---
Fixed on trunk.

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #23 from Richard Biener  ---
Author: rguenth
Date: Thu May 23 11:35:16 2019
New Revision: 271553

URL: https://gcc.gnu.org/viewcvs?rev=271553=gcc=rev
Log:
2019-05-23  Richard Biener  

PR tree-optimization/88440
* opts.c (default_options_table): Enable
-ftree-loop-distribute-patterns
at -O[2s]+.
* tree-loop-distribution.c (generate_memset_builtin): Fold the
generated call.
(generate_memcpy_builtin): Likewise.
(distribute_loop): Pass in whether to only distribute patterns.
(prepare_perfect_loop_nest): Also allow size optimization.
(pass_loop_distribution::execute): When optimizing a loop
nest for size allow pattern replacement.

* gcc.dg/tree-ssa/ldist-37.c: New testcase.
* gcc.dg/tree-ssa/ldist-38.c: Likewise.
* gcc.dg/vect/vect.exp: Add -fno-tree-loop-distribute-patterns.
* gcc.dg/tree-ssa/ldist-37.c: Adjust.
* gcc.dg/tree-ssa/ldist-38.c: Likewise.
* g++.dg/tree-ssa/pr78847.C: Likewise.
* gcc.dg/autopar/pr39500-1.c: Likewise.
* gcc.dg/autopar/reduc-1char.c: Likewise.
* gcc.dg/autopar/reduc-7.c: Likewise.
* gcc.dg/tree-ssa/ivopts-lt-2.c: Likewise.
* gcc.dg/tree-ssa/ivopts-lt.c: Likewise.
* gcc.dg/tree-ssa/predcom-dse-1.c: Likewise.
* gcc.dg/tree-ssa/predcom-dse-2.c: Likewise.
* gcc.dg/tree-ssa/predcom-dse-3.c: Likewise.
* gcc.dg/tree-ssa/predcom-dse-4.c: Likewise.
* gcc.dg/tree-ssa/prefetch-7.c: Likewise.
* gcc.dg/tree-ssa/prefetch-8.c: Likewise.
* gcc.dg/tree-ssa/prefetch-9.c: Likewise.
* gcc.dg/tree-ssa/scev-11.c: Likewise.
* gcc.dg/vect/costmodel/i386/costmodel-vect-31.c: Likewise.
* gcc.dg/vect/costmodel/i386/costmodel-vect-33.c: Likewise.
* gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c: Likewise.
* gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c: Likewise.
* gcc.target/i386/pr30970.c: Likewise.
* gcc.target/i386/vect-double-1.c: Likewise.
* gcc.target/i386/vect-double-2.c: Likewise.
* gcc.dg/tree-ssa/gen-vect-2.c: Likewise.
* gcc.dg/tree-ssa/gen-vect-26.c: Likewise.
* gcc.dg/tree-ssa/gen-vect-28.c: Likewise.
* gcc.dg/tree-ssa/gen-vect-32.c: Likewise.
* gfortran.dg/vect/vect-5.f90: Likewise.
* gfortran.dg/vect/vect-8.f90: Likewise.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/opts.c
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/g++.dg/tree-ssa/pr78847.C
trunk/gcc/testsuite/gcc.dg/autopar/pr39500-1.c
trunk/gcc/testsuite/gcc.dg/autopar/reduc-1char.c
trunk/gcc/testsuite/gcc.dg/autopar/reduc-7.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-2.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-26.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-28.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-32.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-1.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-2.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-3.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-4.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/prefetch-7.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/prefetch-8.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/prefetch-9.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/scev-11.c
trunk/gcc/testsuite/gcc.dg/vect/costmodel/i386/costmodel-vect-31.c
trunk/gcc/testsuite/gcc.dg/vect/costmodel/i386/costmodel-vect-33.c
trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c
trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c
trunk/gcc/testsuite/gcc.dg/vect/vect.exp
trunk/gcc/testsuite/gcc.target/i386/pr30970.c
trunk/gcc/testsuite/gcc.target/i386/vect-double-1.c
trunk/gcc/testsuite/gcc.target/i386/vect-double-2.c
trunk/gcc/testsuite/gfortran.dg/vect/vect-5.f90
trunk/gcc/testsuite/gfortran.dg/vect/vect-8.f90
trunk/gcc/tree-loop-distribution.c

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-22 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #22 from Richard Biener  ---
The code in question was originally added with r202721 by Vlad and likely
became more costly after making the target macro a hook (no inlining
anymore).

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-22 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #21 from Richard Biener  ---
Ick.

static inline void
check_pseudos_live_through_calls (int regno,
  HARD_REG_SET last_call_used_reg_set,
  rtx_insn *call_insn)
{
...
  for (hr = 0; HARD_REGISTER_NUM_P (hr); hr++)
if (targetm.hard_regno_call_part_clobbered (call_insn, hr,
PSEUDO_REGNO_MODE (regno)))
  add_to_hard_reg_set (_reg_info[regno].conflict_hard_regs,
   PSEUDO_REGNO_MODE (regno), hr);

this loop is repeatedly computing an implicit hard-reg set for
which hard-regs are partly clobbered by the call for the _same_
actual instruction since check_pseudos_live_through_calls is called
via

  /* Mark each defined value as live.  We need to do this for
 unused values because they still conflict with quantities
 that are live at the time of the definition.  */
  for (reg = curr_id->regs; reg != NULL; reg = reg->next)
{
  if (reg->type != OP_IN)
{
  update_pseudo_point (reg->regno, curr_point, USE_POINT);
  mark_regno_live (reg->regno, reg->biggest_mode);
  check_pseudos_live_through_calls (reg->regno,
last_call_used_reg_set,
call_insn);
...
}

and

  EXECUTE_IF_SET_IN_SPARSESET (pseudos_live, j)
{
  IOR_HARD_REG_SET (lra_reg_info[j].actual_call_used_reg_set,
this_call_used_reg_set);

  if (flush)
check_pseudos_live_through_calls (j,
  last_call_used_reg_set,
  last_call_insn);
}

and

  /* Mark each used value as live.  */
  for (reg = curr_id->regs; reg != NULL; reg = reg->next)
if (reg->type != OP_OUT)
  {
if (reg->type == OP_IN)
  update_pseudo_point (reg->regno, curr_point, USE_POINT);
mark_regno_live (reg->regno, reg->biggest_mode);
check_pseudos_live_through_calls (reg->regno,
  last_call_used_reg_set,
  call_insn);
  }

and

  EXECUTE_IF_SET_IN_BITMAP (df_get_live_in (bb), FIRST_PSEUDO_REGISTER, j, bi)
{
  if (sparseset_cardinality (pseudos_live_through_calls) == 0)
break;
  if (sparseset_bit_p (pseudos_live_through_calls, j))
check_pseudos_live_through_calls (j, last_call_used_reg_set,
call_insn);
}

the pseudos mode may change but I guess usually it doesn't.  I also wonder
why the target hook doesn't return a hard-reg-set ...

That said, the above code doesn't scale well with functions with a lot of
calls at least, also the passed call_insn isn't the current insn and
might even be NULL.  All but aarch64 do not even look at the actual instruction
(even more an argument for re-designing the hook with it's use in mind).

I guess an artificial testcase with a lot of calls and a lot of live
pseudos (even single-BB) should show this issue easily.

Samples: 579  of event 'cycles:ppp', Event count (approx.): 257134187434191 
Overhead  Command  Shared Object Symbol 
  22.26%  f951 f951  [.] process_bb_lives
  15.06%  f951 f951  [.] ix86_hard_regno_call_part_clobbered
   8.55%  f951 f951  [.] concat
   6.88%  f951 f951  [.] find_base_term
   3.60%  f951 f951  [.] get_ref_base_and_extent
   3.27%  f951 f951  [.] find_base_term
   2.95%  f951 f951  [.] make_hard_regno_dead

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-22 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

Richard Biener  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #20 from Richard Biener  ---
(In reply to rguent...@suse.de from comment #11)
> On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> > 
> > --- Comment #10 from Martin Liška  ---
> > > So the only significant offender is module_configure.fppized.f90 file. Let
> > > me profile it.
> > 
> > Time profile before/after:
> > 
> > ╔══╤╤╤═╗
> > ║ PASS │ Before │ After  │ Change  ║
> > ╠══╪╪╪═╣
> > ║ backwards jump threading │ 6.29   │ 6.16   │ 97.93%  ║
> > ║ integrated RA│ 6.76   │ 6.41   │ 94.82%  ║
> > ║ tree SSA incremental │ 9.01   │ 11.16  │ 123.86% ║
> > ║ LRA create live ranges   │ 15.68  │ 40.02  │ 255.23% ║
> > ║ PRE  │ 23.24  │ 32.32  │ 139.07% ║
> > ║ alias stmt walking   │ 27.69  │ 28.75  │ 103.83% ║
> > ║ phase opt and generate   │ 124.13 │ 163.95 │ 132.08% ║
> > ║ TOTAL│ 125.39 │ 165.17 │ 131.73% ║
> > ╚══╧╧╧═╝
> > 
> > Richi, do you want a perf report or do you come up with a patch that will
> > introduce the aforementioned params?
> 
> Can you share -fopt-report-loop differences?  From the above I would
> guess we split a lot of loops, meaning the memcpy/memmove/memset
> calls are in the "middle" and we have to split loops (how many
> calls are detected here?).  If that's true another way would be
> to only allow calls at head or tail position, thus a single
> non-builtin partition.

Some analysis shows, focusing on LRA lives, that unpatched we have

lra live on 53 BBs for wrf_alt_nml_obsolete
lra live on 5 BBs for set_config_as_buffer
lra live on 5 BBs for get_config_as_buffer
lra live on 3231 BBs for initial_config
lra live on 3231 BBs for initial_config

while patched

lra live on 53 BBs for wrf_alt_nml_obsolete
lra live on 5 BBs for set_config_as_buffer
lra live on 5 BBs for get_config_as_buffer
lra live on 465 BBs for initial_config
lra live on 465 BBs for initial_config

so it's the initial_config function.  We need 8 DF worklist iterations
in both cases but eventually the amount of local work is larger
or the local work isn't linear in the size of the BBs.  The "work"
it does to not update hardregs by anding ~all_hard_regs_bitmap seems
somewhat pointless unless the functions do not handle those correctly.
But that's micro-optimizing, likewise adding a bitmap_ior_and_compl_and_compl
function to avoid the temporary bitmap in live_trans_fun.

perf tells us most time is spent in process_bb_lives, not in the dataflow
problem though, and there in ix86_hard_regno_call_part_clobbered
(the function has a _lot_ of calls...).

Also w/o pattern detection the lra_simple_p heuristic kicks in since
we have a lot more BBs.

  /* If there are too many pseudos and/or basic blocks (e.g. 10K
 pseudos and 10K blocks or 100K pseudos and 1K blocks), we will
 use simplified and faster algorithms in LRA.  */
  lra_simple_p
= (ira_use_lra_p
   && max_reg_num () >= (1 << 26) / last_basic_block_for_fn (cfun));

The code is auto-generated and large (I have a single source file using
no modules now but still too large and similar to SPEC to attach here),
so I wouldn't worry too much here.  The above magic constant should be
a --param though.

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-22 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #19 from rguenther at suse dot de  ---
On Wed, 22 May 2019, marxin at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> 
> Martin Liška  changed:
> 
>What|Removed |Added
> 
>  Status|ASSIGNED|NEW
>Assignee|marxin at gcc dot gnu.org  |unassigned at gcc dot 
> gnu.org
> 
> --- Comment #17 from Martin Liška  ---
> > 
> > Hmm, so then it might be we run into some CFG complexity cut-off
> > before for PRE and RA but not after since the CFG should simplify
> > a lot if we make memcpy from all of the above loops...
> 
> I guess so. Note that even without the patch the files takes 2 minutes to
> compile. It's somehow weird.
> 
> I'm done with patch measurements, do you see it Richi beneficial to enable it
> with -O2?

OK, let's do it.

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-22 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

--- Comment #18 from Richard Biener  ---
OK, let's do it.

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-22 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

Martin Liška  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW
   Assignee|marxin at gcc dot gnu.org  |unassigned at gcc dot 
gnu.org

--- Comment #17 from Martin Liška  ---
> 
> Hmm, so then it might be we run into some CFG complexity cut-off
> before for PRE and RA but not after since the CFG should simplify
> a lot if we make memcpy from all of the above loops...

I guess so. Note that even without the patch the files takes 2 minutes to
compile. It's somehow weird.

I'm done with patch measurements, do you see it Richi beneficial to enable it
with -O2?

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-22 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #16 from Martin Liška  ---
Created attachment 46393
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46393=edit
SPEC2006 and SPEC2017 report

The report presents difference between master (first gray column) and the
Richi's patch (last 2 columns in order to run tests twice).

There are some signification improvements and regressions both. Note that
436.cactusADM is jumping (21%)!

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-17 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #15 from rguenther at suse dot de  ---
On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> 
> --- Comment #14 from Martin Liška  ---
> (In reply to rguent...@suse.de from comment #13)
> > On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> > > 
> > > --- Comment #12 from Martin Liška  ---
> > > > 
> > > > Can you share -fopt-report-loop differences?  From the above I would
> > > > guess we split a lot of loops, meaning the memcpy/memmove/memset
> > > > calls are in the "middle" and we have to split loops (how many
> > > > calls are detected here?).  If that's true another way would be
> > > > to only allow calls at head or tail position, thus a single
> > > > non-builtin partition.
> > > 
> > > I newly see ~1400 lines:
> > > 
> > > module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: 
> > > split to 0
> > > loops and 1 library calls.
> > > module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: 
> > > split to 0
> > > loops and 1 library calls.
> > > module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: 
> > > split to 0
> > > loops and 1 library calls.
> > > module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: 
> > > split to 0
> > > loops and 1 library calls.
> > > module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: 
> > > split to 0
> > > loops and 1 library calls.
> > > module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: 
> > > split to
> > > 0 loops and 1 library calls.
> > > module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: 
> > > split to
> > > 0 loops and 1 library calls.
> > > module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: 
> > > split to
> > > 0 loops and 1 library calls.
> > > module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: 
> > > split to
> > > 0 loops and 1 library calls.
> > > module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: 
> > > split to
> > > 0 loops and 1 library calls.
> > > module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: 
> > > split to
> > > 0 loops and 1 library calls.
> > > ...
> > 
> > All with "0 loops"?  That disputes my theory :/
> 
> Yep. All these are in a form of:
> 
>[local count: 118163158]:
>   # S.1565_41079 = PHI <1(2028), S.1565_32687(3351)>
>   # ivtmp_38850 = PHI <11(2028), ivtmp_38848(3351)>
>   _3211 = S.1565_41079 + -1;
>   _3212 = fire_ignition_start_y1[_3211];
>   MEM[(real(kind=4)[11] *)_config_rec + 101040B][_3211] = _3212;
>   S.1565_32687 = S.1565_41079 + 1;
>   ivtmp_38848 = ivtmp_38850 - 1;
>   if (ivtmp_38848 == 0)
> goto ; [9.09%]
>   else
> goto ; [90.91%]
> 
>[local count: 107425740]:
>   goto ; [100.00%]
> 
>[local count: 10737418]:
> 
>[local count: 118163158]:
>   # S.1566_41080 = PHI <1(2027), S.1566_32689(3350)>
>   # ivtmp_38856 = PHI <11(2027), ivtmp_38854(3350)>
>   _3213 = S.1566_41080 + -1;
>   _3214 = fire_ignition_end_x1[_3213];
>   MEM[(real(kind=4)[11] *)_config_rec + 101084B][_3213] = _3214;
>   S.1566_32689 = S.1566_41080 + 1;
>   ivtmp_38854 = ivtmp_38856 - 1;
>   if (ivtmp_38854 == 0)
> goto ; [9.09%]
>   else
> goto ; [90.91%]
> 
>[local count: 107425740]:
>   goto ; [100.00%]
> 
>[local count: 10737418]:
> 
>[local count: 118163158]:
>   # S.1567_41081 = PHI <1(2026), S.1567_32691(3349)>
>   # ivtmp_38860 = PHI <11(2026), ivtmp_38858(3349)>
>   _3215 = S.1567_41081 + -1;
>   _3216 = fire_ignition_end_y1[_3215];
>   MEM[(real(kind=4)[11] *)_config_rec + 101128B][_3215] = _3216;
>   S.1567_32691 = S.1567_41081 + 1;
>   ivtmp_38858 = ivtmp_38860 - 1;
>   if (ivtmp_38858 == 0)
> goto ; [9.09%]
>   else
> goto ; [90.91%]
> 
>[local count: 107425740]:
>   goto ; [100.00%]
> 
>[local count: 10737418]:
> ...
> 
> 
> It's a configure module, so that it probably contains so many loops for 
> various
> configs.

Hmm, so then it might be we run into some CFG complexity cut-off
before for PRE and RA but not after since the CFG should simplify
a lot if we make memcpy from all of the above loops...

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-17 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #14 from Martin Liška  ---
(In reply to rguent...@suse.de from comment #13)
> On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> > 
> > --- Comment #12 from Martin Liška  ---
> > > 
> > > Can you share -fopt-report-loop differences?  From the above I would
> > > guess we split a lot of loops, meaning the memcpy/memmove/memset
> > > calls are in the "middle" and we have to split loops (how many
> > > calls are detected here?).  If that's true another way would be
> > > to only allow calls at head or tail position, thus a single
> > > non-builtin partition.
> > 
> > I newly see ~1400 lines:
> > 
> > module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split 
> > to 0
> > loops and 1 library calls.
> > module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split 
> > to 0
> > loops and 1 library calls.
> > module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split 
> > to 0
> > loops and 1 library calls.
> > module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split 
> > to 0
> > loops and 1 library calls.
> > module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split 
> > to 0
> > loops and 1 library calls.
> > module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split 
> > to
> > 0 loops and 1 library calls.
> > module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split 
> > to
> > 0 loops and 1 library calls.
> > module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split 
> > to
> > 0 loops and 1 library calls.
> > module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split 
> > to
> > 0 loops and 1 library calls.
> > module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split 
> > to
> > 0 loops and 1 library calls.
> > module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split 
> > to
> > 0 loops and 1 library calls.
> > ...
> 
> All with "0 loops"?  That disputes my theory :/

Yep. All these are in a form of:

   [local count: 118163158]:
  # S.1565_41079 = PHI <1(2028), S.1565_32687(3351)>
  # ivtmp_38850 = PHI <11(2028), ivtmp_38848(3351)>
  _3211 = S.1565_41079 + -1;
  _3212 = fire_ignition_start_y1[_3211];
  MEM[(real(kind=4)[11] *)_config_rec + 101040B][_3211] = _3212;
  S.1565_32687 = S.1565_41079 + 1;
  ivtmp_38848 = ivtmp_38850 - 1;
  if (ivtmp_38848 == 0)
goto ; [9.09%]
  else
goto ; [90.91%]

   [local count: 107425740]:
  goto ; [100.00%]

   [local count: 10737418]:

   [local count: 118163158]:
  # S.1566_41080 = PHI <1(2027), S.1566_32689(3350)>
  # ivtmp_38856 = PHI <11(2027), ivtmp_38854(3350)>
  _3213 = S.1566_41080 + -1;
  _3214 = fire_ignition_end_x1[_3213];
  MEM[(real(kind=4)[11] *)_config_rec + 101084B][_3213] = _3214;
  S.1566_32689 = S.1566_41080 + 1;
  ivtmp_38854 = ivtmp_38856 - 1;
  if (ivtmp_38854 == 0)
goto ; [9.09%]
  else
goto ; [90.91%]

   [local count: 107425740]:
  goto ; [100.00%]

   [local count: 10737418]:

   [local count: 118163158]:
  # S.1567_41081 = PHI <1(2026), S.1567_32691(3349)>
  # ivtmp_38860 = PHI <11(2026), ivtmp_38858(3349)>
  _3215 = S.1567_41081 + -1;
  _3216 = fire_ignition_end_y1[_3215];
  MEM[(real(kind=4)[11] *)_config_rec + 101128B][_3215] = _3216;
  S.1567_32691 = S.1567_41081 + 1;
  ivtmp_38858 = ivtmp_38860 - 1;
  if (ivtmp_38858 == 0)
goto ; [9.09%]
  else
goto ; [90.91%]

   [local count: 107425740]:
  goto ; [100.00%]

   [local count: 10737418]:
...


It's a configure module, so that it probably contains so many loops for various
configs.

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-17 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #13 from rguenther at suse dot de  ---
On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> 
> --- Comment #12 from Martin Liška  ---
> > 
> > Can you share -fopt-report-loop differences?  From the above I would
> > guess we split a lot of loops, meaning the memcpy/memmove/memset
> > calls are in the "middle" and we have to split loops (how many
> > calls are detected here?).  If that's true another way would be
> > to only allow calls at head or tail position, thus a single
> > non-builtin partition.
> 
> I newly see ~1400 lines:
> 
> module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split to > 0
> loops and 1 library calls.
> module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split to > 0
> loops and 1 library calls.
> module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split to > 0
> loops and 1 library calls.
> module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split to > 0
> loops and 1 library calls.
> module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split to > 0
> loops and 1 library calls.
> module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split to
> 0 loops and 1 library calls.
> module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split to
> 0 loops and 1 library calls.
> module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split to
> 0 loops and 1 library calls.
> module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split to
> 0 loops and 1 library calls.
> module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split to
> 0 loops and 1 library calls.
> module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split to
> 0 loops and 1 library calls.
> ...

All with "0 loops"?  That disputes my theory :/

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-17 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #12 from Martin Liška  ---
> 
> Can you share -fopt-report-loop differences?  From the above I would
> guess we split a lot of loops, meaning the memcpy/memmove/memset
> calls are in the "middle" and we have to split loops (how many
> calls are detected here?).  If that's true another way would be
> to only allow calls at head or tail position, thus a single
> non-builtin partition.

I newly see ~1400 lines:

module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split to 0
loops and 1 library calls.
module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split to 0
loops and 1 library calls.
module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split to 0
loops and 1 library calls.
module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split to 0
loops and 1 library calls.
module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split to 0
loops and 1 library calls.
module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split to
0 loops and 1 library calls.
module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split to
0 loops and 1 library calls.
module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split to
0 loops and 1 library calls.
module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split to
0 loops and 1 library calls.
module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split to
0 loops and 1 library calls.
module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split to
0 loops and 1 library calls.
...

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-17 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #11 from rguenther at suse dot de  ---
On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> 
> --- Comment #10 from Martin Liška  ---
> > So the only significant offender is module_configure.fppized.f90 file. Let
> > me profile it.
> 
> Time profile before/after:
> 
> ╔══╤╤╤═╗
> ║ PASS │ Before │ After  │ Change  ║
> ╠══╪╪╪═╣
> ║ backwards jump threading │ 6.29   │ 6.16   │ 97.93%  ║
> ║ integrated RA│ 6.76   │ 6.41   │ 94.82%  ║
> ║ tree SSA incremental │ 9.01   │ 11.16  │ 123.86% ║
> ║ LRA create live ranges   │ 15.68  │ 40.02  │ 255.23% ║
> ║ PRE  │ 23.24  │ 32.32  │ 139.07% ║
> ║ alias stmt walking   │ 27.69  │ 28.75  │ 103.83% ║
> ║ phase opt and generate   │ 124.13 │ 163.95 │ 132.08% ║
> ║ TOTAL│ 125.39 │ 165.17 │ 131.73% ║
> ╚══╧╧╧═╝
> 
> Richi, do you want a perf report or do you come up with a patch that will
> introduce the aforementioned params?

Can you share -fopt-report-loop differences?  From the above I would
guess we split a lot of loops, meaning the memcpy/memmove/memset
calls are in the "middle" and we have to split loops (how many
calls are detected here?).  If that's true another way would be
to only allow calls at head or tail position, thus a single
non-builtin partition.

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-17 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #10 from Martin Liška  ---
> So the only significant offender is module_configure.fppized.f90 file. Let
> me profile it.

Time profile before/after:

╔══╤╤╤═╗
║ PASS │ Before │ After  │ Change  ║
╠══╪╪╪═╣
║ backwards jump threading │ 6.29   │ 6.16   │ 97.93%  ║
║ integrated RA│ 6.76   │ 6.41   │ 94.82%  ║
║ tree SSA incremental │ 9.01   │ 11.16  │ 123.86% ║
║ LRA create live ranges   │ 15.68  │ 40.02  │ 255.23% ║
║ PRE  │ 23.24  │ 32.32  │ 139.07% ║
║ alias stmt walking   │ 27.69  │ 28.75  │ 103.83% ║
║ phase opt and generate   │ 124.13 │ 163.95 │ 132.08% ║
║ TOTAL│ 125.39 │ 165.17 │ 131.73% ║
╚══╧╧╧═╝

Richi, do you want a perf report or do you come up with a patch that will
introduce the aforementioned params?

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-17 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #9 from Martin Liška  ---
So comparison all files in wrf, I've got:

╔═╤╤╤═╗
║ Filename│ Before │ After  │ Change  ║
╠═╪╪╪═╣
║ module_configure.fppized.f90│ 127.81 │ 163.39 │ 127.84% ║
║ d1fgkb.fppized.f90  │ 0.21   │ 0.23   │ 109.52% ║
║ solve_interface.fppized.f90 │ 0.35   │ 0.38   │ 108.57% ║
║ module_ltng_crmpr92.fppized.f90 │ 0.28   │ 0.3│ 107.14% ║
║ module_cu_kf.fppized.f90│ 1.42   │ 1.51   │ 106.34% ║
║ mradbg.fppized.f90  │ 0.32   │ 0.34   │ 106.25% ║
║ module_sf_pxlsm.fppized.f90 │ 0.55   │ 0.58   │ 105.45% ║
║ module_domain_type.fppized.f90  │ 0.19   │ 0.2│ 105.26% ║
║ module_shallowcu_driver.fppized.f90 │ 0.19   │ 0.2│ 105.26% ║
║ module_bl_gfs.fppized.f90   │ 0.78   │ 0.82   │ 105.13% ║
║ module_bl_myjurb.fppized.f90│ 0.78   │ 0.82   │ 105.13% ║
║ Meat.fppized.f90│ 0.39   │ 0.41   │ 105.13% ║
...

So the only significant offender is module_configure.fppized.f90 file. Let me
profile it.

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-16 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #8 from rguenther at suse dot de  ---
On Thu, 16 May 2019, marxin at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> 
> --- Comment #7 from Martin Liška  ---
> (In reply to Richard Biener from comment #6)
> > Created attachment 45313 [details]
> > patch
> > 
> > This enables distribution of patterns at -O[2s]+ and optimizes the testcase
> > at -Os by adjusting the guards in loop distribution.
> > 
> > Note that the interesting bits are compile-time, binary-size and performance
> > at mainly -O2, eventually size at -Os.
> > 
> > I suspect that at -O2 w/o profiling most loops would be
> > optimize_loop_for_speed
> > anyways so changing the heuristics isn't so bad but of course enabling
> > distribution at -O2 might encour a penalty.
> 
> I have so far build numbers on a Zen machine with -j16:
... 
> There's only one difference:
> 
> 521.wrf_r: 310 -> 346s

Ick.  I currently see no limiting on the size of loops in
loop distribution, one easy would be to limit the worklist
size in find_seed_stmts_for_distribution with a --param
we can lower at -O[2s], another thing would be to limit
loop nest depth similarly.

A profile might be interesting here as well...

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-05-16 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #7 from Martin Liška  ---
(In reply to Richard Biener from comment #6)
> Created attachment 45313 [details]
> patch
> 
> This enables distribution of patterns at -O[2s]+ and optimizes the testcase
> at -Os by adjusting the guards in loop distribution.
> 
> Note that the interesting bits are compile-time, binary-size and performance
> at mainly -O2, eventually size at -Os.
> 
> I suspect that at -O2 w/o profiling most loops would be
> optimize_loop_for_speed
> anyways so changing the heuristics isn't so bad but of course enabling
> distribution at -O2 might encour a penalty.

I have so far build numbers on a Zen machine with -j16:

SPEC2006:

  Elapsed compile for '400.perlbench': 00:00:05 (5)
  Elapsed compile for '401.bzip2': 00:00:02 (2)
  Elapsed compile for '403.gcc': 00:00:11 (11)
  Elapsed compile for '429.mcf': 00:00:01 (1)
  Elapsed compile for '445.gobmk': 00:00:04 (4)
  Elapsed compile for '456.hmmer': 00:00:01 (1)
  Elapsed compile for '458.sjeng': 00:00:01 (1)
  Elapsed compile for '462.libquantum': 00:00:01 (1)
  Elapsed compile for '464.h264ref': 00:00:04 (4)
  Elapsed compile for '471.omnetpp': 00:00:05 (5)
  Elapsed compile for '473.astar': 00:00:01 (1)
  Elapsed compile for '483.xalancbmk': 00:00:21 (21)
  Elapsed compile for '410.bwaves': 00:00:01 (1)
  Elapsed compile for '416.gamess': 00:00:20 (20)
  Elapsed compile for '433.milc': 00:00:02 (2)
  Elapsed compile for '434.zeusmp': 00:00:02 (2)
  Elapsed compile for '435.gromacs': 00:00:06 (6)
  Elapsed compile for '436.cactusADM': 00:00:04 (4)
  Elapsed compile for '437.leslie3d': 00:00:04 (4)
  Elapsed compile for '444.namd': 00:00:09 (9)
  Elapsed compile for '447.dealII': 00:00:15 (15)
  Elapsed compile for '450.soplex': 00:00:03 (3)
  Elapsed compile for '453.povray': 00:00:04 (4)
  Elapsed compile for '454.calculix': 00:00:06 (6)
  Elapsed compile for '459.GemsFDTD': 00:00:09 (9)
  Elapsed compile for '465.tonto': 00:00:53 (53)
  Elapsed compile for '470.lbm': 00:00:02 (2)
  Elapsed compile for '481.wrf': 00:00:38 (38)
  Elapsed compile for '482.sphinx3': 00:00:01 (1)

All differences before and after are withing 1s, which is granularity.

SPEC 2017:

  Elapsed compile for '503.bwaves_r': 00:00:01 (1)
  Elapsed compile for '507.cactuBSSN_r': 00:00:25 (25)
  Elapsed compile for '508.namd_r': 00:00:09 (9)
  Elapsed compile for '510.parest_r': 00:00:46 (46)
  Elapsed compile for '511.povray_r': 00:00:04 (4)
  Elapsed compile for '519.lbm_r': 00:00:01 (1)
  Elapsed compile for '521.wrf_r': 00:05:46 (346)
  Elapsed compile for '526.blender_r': 00:00:25 (25)
  Elapsed compile for '527.cam4_r': 00:00:37 (37)
  Elapsed compile for '538.imagick_r': 00:00:11 (11)
  Elapsed compile for '544.nab_r': 00:00:01 (1)
  Elapsed compile for '549.fotonik3d_r': 00:00:07 (7)
  Elapsed compile for '554.roms_r': 00:00:06 (6)
  Elapsed compile for '500.perlbench_r': 00:00:09 (9)
  Elapsed compile for '502.gcc_r': 00:00:44 (44)
  Elapsed compile for '505.mcf_r': 00:00:01 (1)
  Elapsed compile for '520.omnetpp_r': 00:00:12 (12)
  Elapsed compile for '523.xalancbmk_r': 00:00:25 (25)
  Elapsed compile for '525.x264_r': 00:00:09 (9)
  Elapsed compile for '531.deepsjeng_r': 00:00:02 (2)
  Elapsed compile for '541.leela_r': 00:00:03 (3)
  Elapsed compile for '548.exchange2_r': 00:00:04 (4)
  Elapsed compile for '557.xz_r': 00:00:01 (1)

There's only one difference:

521.wrf_r: 310 -> 346s

[Bug tree-optimization/88440] size optimization of memcpy-like code

2019-01-02 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #6 from Richard Biener  ---
Created attachment 45313
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45313=edit
patch

This enables distribution of patterns at -O[2s]+ and optimizes the testcase
at -Os by adjusting the guards in loop distribution.

Note that the interesting bits are compile-time, binary-size and performance
at mainly -O2, eventually size at -Os.

I suspect that at -O2 w/o profiling most loops would be optimize_loop_for_speed
anyways so changing the heuristics isn't so bad but of course enabling
distribution at -O2 might encour a penalty.

[Bug tree-optimization/88440] size optimization of memcpy-like code

2018-12-27 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

Martin Liška  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |marxin at gcc dot 
gnu.org

--- Comment #5 from Martin Liška  ---
Sure, I can help with measurements during next stage1. Can you please Richi
attach a patch that will enable the optimization for -O[2s]?

[Bug tree-optimization/88440] size optimization of memcpy-like code

2018-12-27 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

Martin Liška  changed:

   What|Removed |Added

   Target Milestone|--- |10.0

[Bug tree-optimization/88440] size optimization of memcpy-like code

2018-12-12 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #4 from rguenther at suse dot de  ---
On Wed, 12 Dec 2018, hoganmeier at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> 
> --- Comment #3 from krux  ---
> Adding -ftree-loop-distribute-patterns to -Os does not seem to make a
> difference though.

Possibly because of

  /* Don't distribute multiple exit edges loop, or cold loop.  */
  if (!single_exit (loop)
  || !optimize_loop_for_speed_p (loop))
continue;

[Bug tree-optimization/88440] size optimization of memcpy-like code

2018-12-11 Thread hoganmeier at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #3 from krux  ---
Adding -ftree-loop-distribute-patterns to -Os does not seem to make a
difference though.

[Bug tree-optimization/88440] size optimization of memcpy-like code

2018-12-11 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2018-12-11
 CC||mliska at suse dot cz,
   ||rguenth at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #2 from Richard Biener  ---
I think distributing patterns is reasonable for -O[2s].  I think we kept it at
-O3 because of compile-time concerns (dependence analysis is quadratic).

But then we still don't do basic vectorization at -O[2s] either... (same
compile-time for not too much gain issue).

So if somebody is willing to do some compile-time / effect numbers
(bootstrap / SPEC?) then we can consider enabling loop-distribute-patterns
for -O[2s] for GCC 10.

[Bug tree-optimization/88440] size optimization of memcpy-like code

2018-12-10 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

--- Comment #1 from Andrew Pinski  ---
I thought I had a dup of this bug somewhere which was asking for this
optimization to moved to -O2 (and -Os) and above rather than keep it at -O3 and
above.