from:"Bin Cheng"

[PATCH PR85720/partial]Support runtime loop versioning if loop can be distributed into builtin functions

2018-05-22 Thread Bin Cheng

Hi,
This patch partially improves loop distribution for PR85720.  It now supports 
runtime
loop versioning if the loop can be distributed into builtin functions.  Note 
for this
moment only coarse-grain runtime alias is checked, while different overlapping 
cases
for different dependence relations are not supported yet.
Note changes in break_alias_scc_partitions and version_loop_by_alias_check do 
not
strictly match each other, with the latter more restricted.  Because it's hard 
to pass
information around.  Hopefully this will be resolved when classifying 
distributor.

Bootstrap and test on x86_64.  Is it OK?

Thanks,
bin

2018-05-22  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (break_alias_scc_partitions): Don't merge
SCC if all partitions are builtins.
(version_loop_by_alias_check): New parameter.  Generate cancelable
runtime alias check if all partitions are builtins.
(distribute_loop): Update call to above function.

gcc/testsuite
2018-05-22  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/tree-ssa/pr85720.c: New test.
* gcc.target/i386/avx256-unaligned-store-2.c: Disable loop pattern
distribution.From 2518709d31440525010fa6692b531419fc81b426 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Mon, 21 May 2018 15:49:55 +0100
Subject: [PATCH] pr85720-20180520

---
 gcc/testsuite/gcc.dg/tree-ssa/pr85720.c| 13 +++
 .../gcc.target/i386/avx256-unaligned-store-2.c |  2 +-
 gcc/tree-loop-distribution.c   | 40 +-
 3 files changed, 45 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr85720.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr85720.c b/gcc/testsuite/gcc.dg/tree-ssa/pr85720.c
new file mode 100644
index 000..18d8be9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr85720.c
@@ -0,0 +1,13 @@
+/* { dg-do compile { target size32plus } } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fdump-tree-ldist" } */
+
+void fill(char* A, char* B, unsigned n)
+{
+for (unsigned i = 0; i < n; i++)
+{
+A[i] = 0;
+B[i] = A[i] + 1;
+}
+}
+
+/* { dg-final { scan-tree-dump-times "_builtin_memset" 2 "ldist" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-2.c b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-2.c
index 87285c6..1e7969b 100644
--- a/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-2.c
+++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile { target { ! ia32 } } } */
-/* { dg-options "-O3 -mtune-ctrl=sse_typeless_stores -dp -mavx -mavx256-split-unaligned-store -mno-prefer-avx128" } */
+/* { dg-options "-O3 -mtune-ctrl=sse_typeless_stores -dp -mavx -mavx256-split-unaligned-store -mno-prefer-avx128 -fno-tree-loop-distribute-patterns" } */
 
 #define N 1024
 
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 5e327f4..c6e0a60 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -2268,21 +2268,26 @@ break_alias_scc_partitions (struct graph *rdg,
 	  for (j = 0; partitions->iterate (j, ); ++j)
 	if (pg->vertices[j].component == i)
 	  break;
+
+	  bool same_type = true, all_builtins = partition_builtin_p (first);
 	  for (++j; partitions->iterate (j, ); ++j)
 	{
 	  if (pg->vertices[j].component != i)
 		continue;
 
-	  /* Note we Merge partitions of parallel type on purpose, though
-		 the result partition is sequential.  The reason is vectorizer
-		 can do more accurate runtime alias check in this case.  Also
-		 it results in more conservative distribution.  */
 	  if (first->type != partition->type)
 		{
-		  bitmap_clear_bit (sccs_to_merge, i);
+		  same_type = false;
 		  break;
 		}
+	  all_builtins &= partition_builtin_p (partition);
 	}
+	  /* Merge SCC if all partitions in SCC have the same type, though the
+	 result partition is sequential, because vectorizer can do better
+	 runtime alias check.  One expecption is all partitions in SCC are
+	 builtins.  */
+	  if (!same_type || all_builtins)
+	bitmap_clear_bit (sccs_to_merge, i);
 	}
 
   /* Initialize callback data for traversing.  */
@@ -2458,7 +2463,8 @@ compute_alias_check_pairs (struct loop *loop, vec *alias_ddrs,
checks and version LOOP under condition of these runtime alias checks.  */
 
 static void
-version_loop_by_alias_check (struct loop *loop, vec *alias_ddrs)
+version_loop_by_alias_check (vec *partitions,
+			 struct loop *loop, vec *alias_ddrs)
 {
   profile_probability prob;
   basic_block cond_bb;
@@ -2481,9 +2487,25 @@ version_loop_by_alias_check (struct loop *loop, vec *alias_ddrs)
   is_gimple_val, NULL_TREE);
 
   /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS.  */
-  if (f

[PATCH PR85804]Fix wrong code by correcting bump step computation in vector(1) load of single-element group access

2018-05-21 Thread Bin Cheng

Hi,
As reported in PR85804, bump step is wrongly computed for vector(1) load of
single-element group access.  This patch fixes the issue by correcting bump
step computation for the specific VMAT_CONTIGUOUS case.

Bootstrap and test on x86_64 and AArch64 ongoing, is it OK?

Thanks,
bin

2018-05-17  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/85804
* tree-vect-stmts.c (vectorizable_load): Compute correct bump step
for vector(1) load in single-element group access.

gcc/testsuite
2018-05-17  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/85804
* gcc.c-torture/execute/pr85804.c: New test.From 502bcd1e445186a56b6ea254a0cd2406fb62f08c Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 18 May 2018 14:18:14 +0100
Subject: [PATCH] pr85804-20180517

---
 gcc/testsuite/gcc.c-torture/execute/pr85804.c | 22 ++
 gcc/tree-vect-stmts.c | 15 +++
 2 files changed, 37 insertions(+)
 create mode 100644 gcc/testsuite/gcc.c-torture/execute/pr85804.c

diff --git a/gcc/testsuite/gcc.c-torture/execute/pr85804.c b/gcc/testsuite/gcc.c-torture/execute/pr85804.c
new file mode 100644
index 000..b8929b1
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr85804.c
@@ -0,0 +1,22 @@
+/* { dg-options "-O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+long d[64] = {0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+};
+
+void abort ();
+void __attribute__((noinline)) foo (int b)
+{
+  if (b)
+abort ();
+}
+int main() {
+  int b = 0;
+  for (int c = 0; c <= 5; c++)
+b ^= d[c * 5 + 1];
+  foo (b);
+  return 0;
+}
+
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 64a157d..e6b95b3 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -7262,6 +7262,7 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
   gphi *phi = NULL;
   vec dr_chain = vNULL;
   bool grouped_load = false;
+  bool single_element = false;
   gimple *first_stmt;
   gimple *first_stmt_for_drptr = NULL;
   bool inv_p;
@@ -7822,6 +7823,15 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
   first_stmt = stmt;
   first_dr = dr;
   group_size = vec_num = 1;
+  /* For single-element vector in a single-element group, record the group
+	 size in order to compute correct bump size.  */
+  if (!slp
+	  && memory_access_type == VMAT_CONTIGUOUS
+	  && STMT_VINFO_GROUPED_ACCESS (stmt_info))
+	{
+	  single_element = true;
+	  group_size = GROUP_SIZE (vinfo_for_stmt (first_stmt));
+	}
   group_gap_adj = 0;
   ref_type = reference_alias_ptr_type (DR_REF (first_dr));
 }
@@ -7992,6 +8002,11 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
   else
 	aggr_type = vectype;
   bump = vect_get_data_ptr_increment (dr, aggr_type, memory_access_type);
+  /* Multiply bump size by group size for single-element vector in single-
+	 element group.  */
+  if (single_element && group_size > 1)
+	bump = fold_build2 (MULT_EXPR, TREE_TYPE (bump), bump,
+			build_int_cst (TREE_TYPE (bump), group_size));
 }
 
   tree vec_mask = NULL_TREE;
-- 
1.9.1

[PATCH PR85793]Fix ICE by loading vector(1) scalara_type for 1 element-wise case

2018-05-16 Thread Bin Cheng

Hi,
This patch fixes ICE by loading vector(1) scalar_type if it's 1 element-wise 
for VMAT_ELEMENTWISE.
Bootstrap and test on x86_64 and AArch64 ongoing.  Is it OK?

Thanks,
bin
2018-05-16  Bin Cheng  <bin.ch...@arm.com>
Richard Biener  <rguent...@suse.de>

PR tree-optimization/85793
* tree-vect-stmts.c (vectorizable_load): Handle 1 element-wise load
for VMAT_ELEMENTWISE.

gcc/testsuite
2018-05-16  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/85793
* gcc.dg/vect/pr85793.c: New test.From 85ef7f0c6ee0cb89804f1cd9d5a39ba26f8aaba3 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 16 May 2018 14:30:06 +0100
Subject: [PATCH] pr85793-20180515

---
 gcc/testsuite/gcc.dg/vect/pr85793.c | 12 
 gcc/tree-vect-stmts.c   |  4 
 2 files changed, 16 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr85793.c

diff --git a/gcc/testsuite/gcc.dg/vect/pr85793.c b/gcc/testsuite/gcc.dg/vect/pr85793.c
new file mode 100644
index 000..9b5d518
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr85793.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_perm } */
+
+int a, c, d;
+long b[6];
+void fn1() {
+  for (; a < 2; a++) {
+c = 0;
+for (; c <= 5; c++)
+  d &= b[a * 3];
+  }
+}
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 1e8ccbc..64a157d 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -7662,6 +7662,10 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
 	}
 	  ltype = build_aligned_type (ltype, TYPE_ALIGN (TREE_TYPE (vectype)));
 	}
+  /* Load vector(1) scalar_type if it's 1 element-wise vectype.  */
+  else if (nloads == 1)
+	ltype = vectype;
+
   if (slp)
 	{
 	  /* For SLP permutation support we need to load the whole group,
-- 
1.9.1

[PATCH GCC][6/6]Restrict predcom using register pressure information

2018-05-04 Thread Bin Cheng

Hi,
This patch restricts predcom pass using register pressure information.
In case of high register pressure, we now prune additional chains as well
as disable unrolling in predcom.  In generally, I think this patch set is
useful.

Bootstrap and test on x86_64 ongoing.  Any comments?

Thanks,
bin
2018-04-27  Bin Cheng  <bin.ch...@arm.com>

* tree-predcom.c (stor-layout.h, tree-ssa-live.h): Include.
(REG_RELAX_RATIO, prune_chains): New.
(tree_predictive_commoning_loop): Compute reg pressure using class
region.  Prune chains based on reg pressure.  Force to not unroll
if reg pressure is high.From 1b488665f8fea619c4ce35f71650c342df69de2f Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 25 Apr 2018 16:30:41 +0100
Subject: [PATCH 6/6] pcom-reg-pressure-20180423

---
 gcc/tree-predcom.c | 74 ++
 1 file changed, 74 insertions(+)

diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index aeadbf7..d0c18b3 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -217,6 +217,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "ssa.h"
 #include "gimple-pretty-print.h"
+#include "stor-layout.h"
 #include "alias.h"
 #include "fold-const.h"
 #include "cfgloop.h"
@@ -227,6 +228,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-loop-ivopts.h"
 #include "tree-ssa-loop-manip.h"
 #include "tree-ssa-loop-niter.h"
+#include "tree-ssa-live.h"
 #include "tree-ssa-loop.h"
 #include "tree-into-ssa.h"
 #include "tree-dfa.h"
@@ -242,6 +244,10 @@ along with GCC; see the file COPYING3.  If not see
 
 #define MAX_DISTANCE (target_avail_regs[GENERAL_REGS] < 16 ? 4 : 8)
 
+/* The ratio by which register pressure check is relaxed.  */
+
+#define REG_RELAX_RATIO (2)
+
 /* Data references (or phi nodes that carry data reference values across
loop iterations).  */
 
@@ -3156,6 +3162,59 @@ insert_init_seqs (struct loop *loop, vec chains)
   }
 }
 
+/* Prune chains causing high register pressure.  */
+
+static void
+prune_chains (vec *chains, unsigned *max_pressure)
+{
+  bool pruned_p = false;
+  machine_mode mode;
+  enum reg_class cl;
+  unsigned i, new_pressure;
+
+  for (i = 0; i < chains->length ();)
+{
+  chain_p chain = (*chains)[i];
+  /* Always allow combined chain and zero-length chain.  */
+  if (chain->combined || chain->type == CT_COMBINATION
+	  || chain->length == 0 || chain->type == CT_STORE_STORE)
+	{
+	  i++;
+	  continue;
+	}
+
+  gcc_assert (chain->refs.length () > 0);
+  mode = TYPE_MODE (TREE_TYPE (chain->refs[0]->ref->ref));
+  /* Bypass chain that doesn't contribute to any reg_class, although
+	 something could be wrong when mapping type mode to reg_class.  */
+  if (ira_mode_classes[mode] == NO_REGS)
+	{
+	  i++;
+	  continue;
+	}
+
+  cl = ira_pressure_class_translate[ira_mode_classes[mode]];
+  /* Prune chain if it causes higher register pressure than available
+	 registers; otherwise keep the chain and update register pressure
+	 information.  */
+  new_pressure = max_pressure[cl] + chain->length - 1;
+  if (new_pressure <= target_avail_regs[cl] * REG_RELAX_RATIO)
+	{
+	  i++;
+	  max_pressure[cl] = new_pressure;
+	}
+  else
+	{
+	  release_chain (chain);
+	  chains->unordered_remove (i);
+	  pruned_p = true;
+	}
+}
+
+  if (pruned_p && dump_file && (dump_flags & TDF_DETAILS))
+fprintf (dump_file, "Prune chain because of high reg pressure\n");
+}
+
 /* Performs predictive commoning for LOOP.  Sets bit 1<<0 of return value
if LOOP was unrolled; Sets bit 1<<1 of return value if loop closed ssa
form was corrupted.  */
@@ -3171,6 +3230,9 @@ tree_predictive_commoning_loop (struct loop *loop)
   struct tree_niter_desc desc;
   bool unroll = false, loop_closed_ssa = false;
   edge exit;
+  lr_region *region;
+  unsigned max_pressure[N_REG_CLASSES];
+  bool high_pressure_p;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
 fprintf (dump_file, "Processing loop %d\n",  loop->num);
@@ -3239,6 +3301,11 @@ tree_predictive_commoning_loop (struct loop *loop)
   /* Try to combine the chains that are always worked with together.  */
   try_combine_chains (loop, );
 
+  region = new lr_region (loop);
+  high_pressure_p = region->calculate_pressure (max_pressure);
+  delete region;
+  prune_chains (, max_pressure);
+
   insert_init_seqs (loop, chains);
 
   if (dump_file && (dump_flags & TDF_DETAILS))
@@ -3250,6 +3317,13 @@ tree_predictive_commoning_loop (struct loop *loop)
   /* Determine the unroll factor, and if the loop should be unrolled, ensure
  that its number of iter

[PATCH GCC][5/6]implement live range, reg pressure computation class

2018-05-04 Thread Bin Cheng

Hi,
Based on previous patch, this one implements live range, reg pressure 
computation
class in tree-ssa-live.c.  The user would only need to instantiate the class and
call the computation interface as in next patch.
During the work, I think it's also worthwhile to classify all live range and 
coalesce
data structures and algorithms in the future.

Bootstrap and test on x86_64 and AArch64 ongoing.  Any comments?

Thanks,
bin
2018-04-27  Bin Cheng  <bin.ch...@arm.com>

* tree-ssa-live.c (memmodel.h, ira.h, tree-ssa-coalesce.h): Include.
(struct stmt_lr_info, free_stmt_lr_info): New.
(lr_region::lr_region, lr_region::~lr_region): New.
(lr_region::create_stmt_lr_info): New.
(lr_region::update_live_range_by_stmt): New.
(lr_region::calculate_coalesced_pressure): New.
(lr_region::calculate_pressure): New.
* tree-ssa-live.h (struct stmt_lr_info): New declaration.
(class lr_region): New class.From 5c16db5672a4f0826d2a164823759a9ffb12c349 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 4 May 2018 09:42:04 +0100
Subject: [PATCH 5/6] region-reg-pressure-20180428

---
 gcc/tree-ssa-live.c | 157 
 gcc/tree-ssa-live.h |  49 
 2 files changed, 206 insertions(+)

diff --git a/gcc/tree-ssa-live.c b/gcc/tree-ssa-live.c
index ccb0d99..e51cd15 100644
--- a/gcc/tree-ssa-live.c
+++ b/gcc/tree-ssa-live.c
@@ -23,6 +23,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "coretypes.h"
 #include "backend.h"
 #include "rtl.h"
+#include "memmodel.h"
+#include "ira.h"
 #include "tree.h"
 #include "gimple.h"
 #include "timevar.h"
@@ -34,6 +36,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-dfa.h"
 #include "dumpfile.h"
 #include "tree-ssa-live.h"
+#include "tree-ssa-coalesce.h"
 #include "debug.h"
 #include "tree-ssa.h"
 #include "ipa-utils.h"
@@ -1204,6 +1207,160 @@ calculate_live_ranges (var_map map, bool want_livein)
 }
 
 
+/* Live range information for a gimple stmt.  */
+struct stmt_lr_info
+{
+  /*  ID of the stmt.  */
+  unsigned id;
+  gimple *stmt;
+  /* Live ranges after the stmt.  */
+  bitmap lr_after_stmt;
+};
+
+/* Call back function to free live range INFO of gimple STMT.  */
+
+bool
+free_stmt_lr_info (gimple *const & stmt, stmt_lr_info *const , void *)
+{
+  gcc_assert (info->stmt == stmt);
+  if (info->lr_after_stmt != NULL)
+BITMAP_FREE (info->lr_after_stmt);
+
+  free (info);
+  return true;
+}
+
+lr_region::lr_region (struct loop *loop)
+  : m_loop (loop),
+m_varmap (NULL),
+m_liveinfo (NULL),
+m_stmtmap (new hash_map (13))
+{
+  memset (m_pressure, 0, sizeof (unsigned) * N_REG_CLASSES);
+}
+
+lr_region::~lr_region ()
+{
+  m_stmtmap->traverse (NULL);
+  delete m_stmtmap;
+}
+
+struct stmt_lr_info *
+lr_region::create_stmt_lr_info (gimple *stmt)
+{
+  bool exist_p;
+  struct stmt_lr_info **slot = _stmtmap->get_or_insert (stmt, _p);
+
+  gcc_assert (!exist_p);
+  *slot = XCNEW (struct stmt_lr_info);
+  (*slot)->stmt = stmt;
+  (*slot)->lr_after_stmt = NULL;
+  return *slot;
+}
+
+void
+lr_region::update_live_range_by_stmt (gimple *stmt, bitmap live_ranges,
+  unsigned *pressure)
+{
+  int p;
+  tree var;
+  ssa_op_iter iter;
+
+  FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_DEF)
+{
+  p = var_to_partition (m_varmap, var);
+  gcc_assert (p != NO_PARTITION);
+  if (bitmap_clear_bit (live_ranges, p))
+	pressure[ira_mode_classes[TYPE_MODE (TREE_TYPE (var))]]--;
+}
+  FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE)
+{
+  p = var_to_partition (m_varmap, var);
+  gcc_assert (p != NO_PARTITION);
+  if (bitmap_set_bit (live_ranges, p))
+	pressure[ira_mode_classes[TYPE_MODE (TREE_TYPE (var))]]++;
+}
+}
+
+void
+lr_region::calculate_coalesced_pressure ()
+{
+  unsigned i, j, reg_class, pressure[N_REG_CLASSES];
+  bitmap_iterator bi, bj;
+  gimple_stmt_iterator bsi;
+  auto_bitmap live_ranges;
+  bitmap bbs = get_bbs ();
+
+  EXECUTE_IF_SET_IN_BITMAP (bbs, 0, i, bi)
+{
+  basic_block bb = BASIC_BLOCK_FOR_FN (cfun, i);
+  bitmap_copy (live_ranges, _liveinfo->liveout[bb->index]);
+
+  memset (pressure, 0, sizeof (unsigned) * N_REG_CLASSES);
+  EXECUTE_IF_SET_IN_BITMAP (live_ranges, 0, j, bj)
+	{
+	  tree var = partition_to_var (m_varmap, j);
+	  reg_class = ira_mode_classes[TYPE_MODE (TREE_TYPE (var))];
+	  pressure[reg_class]++;
+	}
+
+  for (bsi = gsi_last_bb (bb); !gsi_end_p (bsi); gsi_prev ())
+	{
+	  gimple *stmt = gsi_stmt (bsi);
+	  struct stmt_lr_info *stmt_info = create_stmt_lr_info (stmt);
+	  /* No need to compute live range information for debug stmt.  */
+	  if (is_gimple_debug (stmt))
+	continue;
+
+

[PATCH GCC][4/6]Support regional coalesce and live range computation

2018-05-04 Thread Bin Cheng

Hi,
Following Jeff's suggestion, I am now using existing tree-ssa-live.c and
tree-ssa-coalesce.c to compute register pressure, rather than inventing
another live range solver.

The major change is to record region's basic blocks in var_map and use that
information in computation, rather than FOR_EACH_BB_FN.  For now only loop
and function type regions are supported.  The default one is function type
region which is used in out-of-ssa.  Loop type region will be used in next
patch to compute information for a loop.

Bootstrap and test on x86_64 and AArch64 ongoing.  Any comments?

Thanks,
bin
2018-04-27  Bin Cheng  <bin.ch...@arm.com>

* tree-outof-ssa.c (remove_ssa_form): Update use.
* tree-ssa-coalesce.c (build_ssa_conflict_graph): Support regional
coalesce.
(coalesce_with_default): Update comment.
(create_outofssa_var_map): Support regional coalesce.  Rename to...
(create_var_map): ...this.
(coalesce_partitions): Support regional coalesce.
(gimple_can_coalesce_p, compute_optimized_partition_bases): Ditto.
(coalesce_ssa_name): Ditto.
* tree-ssa-coalesce.h (coalesce_ssa_name, gimple_can_coalesce_p):
Add parameter in declarations.
* tree-ssa-live.c (init_var_map, delete_var_map): Support regional
coalesce.
(new_tree_live_info, loe_visit_block, set_var_live_on_entry): Ditto.
(calculate_live_on_exit, verify_live_on_entry): Ditto.
* tree-ssa-live.h (enum region_type): New.
(struct _var_map): New fields.
(init_var_map): Add parameter in declaration.
(function_region_p, region_contains_p): New.
* tree-ssa-uncprop.c (uncprop_into_successor_phis): Update uses.From 6b7b80eb40c0bd08c25c14b3f7c33937941bdfaa Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 4 May 2018 09:39:17 +0100
Subject: [PATCH 4/6] liverange-support-region-20180427

---
 gcc/tree-outof-ssa.c|  2 +-
 gcc/tree-ssa-coalesce.c | 77 ++-
 gcc/tree-ssa-coalesce.h |  4 +--
 gcc/tree-ssa-live.c | 80 +++--
 gcc/tree-ssa-live.h | 51 ++-
 gcc/tree-ssa-uncprop.c  |  5 ++--
 6 files changed, 163 insertions(+), 56 deletions(-)

diff --git a/gcc/tree-outof-ssa.c b/gcc/tree-outof-ssa.c
index 59bdcd6..81edbc5 100644
--- a/gcc/tree-outof-ssa.c
+++ b/gcc/tree-outof-ssa.c
@@ -945,7 +945,7 @@ remove_ssa_form (bool perform_ter, struct ssaexpand *sa)
   bitmap values = NULL;
   var_map map;
 
-  map = coalesce_ssa_name ();
+  map = coalesce_ssa_name (NULL, flag_tree_coalesce_vars);
 
   /* Return to viewing the variable list as just all reference variables after
  coalescing has been performed.  */
diff --git a/gcc/tree-ssa-coalesce.c b/gcc/tree-ssa-coalesce.c
index 5cc0aca..7269eb1 100644
--- a/gcc/tree-ssa-coalesce.c
+++ b/gcc/tree-ssa-coalesce.c
@@ -869,7 +869,7 @@ build_ssa_conflict_graph (tree_live_info_p liveinfo)
  coalesce variables from different base variables, including
  different parameters, so we have to make sure default defs live
  at the entry block conflict with each other.  */
-  if (flag_tree_coalesce_vars)
+  if (liveinfo->map->coalesce_vars_p)
 entry = single_succ (ENTRY_BLOCK_PTR_FOR_FN (cfun));
   else
 entry = NULL;
@@ -879,7 +879,7 @@ build_ssa_conflict_graph (tree_live_info_p liveinfo)
 
   live = new_live_track (map);
 
-  FOR_EACH_BB_FN (bb, cfun)
+  for (unsigned i = 0; liveinfo->map->vec_bbs->iterate (i, ); ++i)
 {
   /* Start with live on exit temporaries.  */
   live_track_init (live, live_on_exit (liveinfo, bb));
@@ -944,6 +944,8 @@ build_ssa_conflict_graph (tree_live_info_p liveinfo)
 	{
 	  gphi *phi = gsi.phi ();
 	  tree result = PHI_RESULT (phi);
+	  if (virtual_operand_p (result))
+	continue;
 	  if (live_track_live_p (live, result))
 	live_track_process_def (live, result, graph);
 	}
@@ -1071,14 +1073,18 @@ coalesce_with_default (tree var, coalesce_list *cl, bitmap used_in_copy)
   add_cost_one_coalesce (cl, SSA_NAME_VERSION (ssa), SSA_NAME_VERSION (var));
   bitmap_set_bit (used_in_copy, SSA_NAME_VERSION (var));
   /* Default defs will have their used_in_copy bits set at the end of
- create_outofssa_var_map.  */
+ create_var_map.  */
 }
 
-/* This function creates a var_map for the current function as well as creating
-   a coalesce list for use later in the out of ssa process.  */
+/* This function creates a var_map for a region indicated by BBS in the current
+   function as well as creating a coalesce list for use later in the out of ssa
+   process.  Region is a loop if LOOP is not NULL, otherwise the function.
+   COALESCE_VARS_P is true if we coalesce version of different user-defined
+   variables.  */
 
 static var_map
-create_outofssa_var_map (coalesce_list *cl, bitmap used_in_copy)
+create_var_map (struct loop *loop

[PATCH GCC][3/6]Delete unnecessary function live_merge_and_clear

2018-05-04 Thread Bin Cheng

HI,
This is an obvious patch removing the unnecessary function.

Bootstrap and test on x86_64 and AArch64 ongoing.  Is it OK?

Thanks,
bin
2018-04-27  Bin Cheng  <bin.ch...@arm.com>

* tree-ssa-live.h (live_merge_and_clear): Delete.From ba6e47da7faba9a31c776a6d06ef052b1ed392a8 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 2 May 2018 11:37:34 +0100
Subject: [PATCH 3/6] remove-live_merge_and_clear.txt

---
 gcc/tree-ssa-live.h | 12 
 1 file changed, 12 deletions(-)

diff --git a/gcc/tree-ssa-live.h b/gcc/tree-ssa-live.h
index e62293b..448aaf9 100644
--- a/gcc/tree-ssa-live.h
+++ b/gcc/tree-ssa-live.h
@@ -289,18 +289,6 @@ live_var_map (tree_live_info_p live)
 }
 
 
-/* Merge the live on entry information in LIVE for partitions P1 and P2. Place
-   the result into P1.  Clear P2.  */
-
-static inline void
-live_merge_and_clear (tree_live_info_p live, int p1, int p2)
-{
-  gcc_checking_assert (>livein[p1] && >livein[p2]);
-  bitmap_ior_into (>livein[p1], >livein[p2]);
-  bitmap_clear (>livein[p2]);
-}
-
-
 /* Mark partition P as live on entry to basic block BB in LIVE.  */
 
 static inline void
-- 
1.9.1

[PATCH GCC][2/6]Compute available register for each register classes

2018-05-04 Thread Bin Cheng

Hi,
This is the second patch computing available/clobber registers for register 
classes.
It's the same as the original patch posted 
@https://gcc.gnu.org/ml/gcc-patches/2017-05/msg01022.html

Bootstrap and test on x86_64 and AArch64 ongoing.  Any comments?

Thanks,
bin
2017-04-27  Bin Cheng  <bin.ch...@arm.com>

* cfgloop.h (struct target_cfgloop): Change x_target_avail_regs and
x_target_clobbered_regs into array fields.
(init_avail_clobber_regs): New declaration.
* cfgloopanal.c (memmodel.h, ira.h): Include header files.
(init_set_costs): Remove computation for old x_target_avail_regs and
x_target_clobbered_regs fields.
(init_avail_clobber_regs): New function.
(estimate_reg_pressure_cost): Update the uses.
* toplev.c (cfgloop.h): Update comment why the header file is needed.
(backend_init_target): Call init_avail_clobber_regs.
* tree-predcom.c (memmodel.h, ira.h): Include header files.
(MAX_DISTANCE): Update the use.
* tree-ssa-loop-ivopts.c (AVAILABLE_REGS, CLOBBERED_REGS): New marco.
(ivopts_estimate_reg_pressure, determine_set_costs): Update the uses.From 47a2074d21f4b28b4c38233628e94bcaef9ed40d Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Thu, 19 Apr 2018 15:54:14 +0100
Subject: [PATCH 2/6] init-avail_clob-regs-20180428.txt

---
 gcc/cfgloop.h  | 10 ---
 gcc/cfgloopanal.c  | 68 +++---
 gcc/toplev.c   |  3 +-
 gcc/tree-predcom.c |  4 ++-
 gcc/tree-ssa-loop-ivopts.c | 14 ++
 5 files changed, 72 insertions(+), 27 deletions(-)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index af9bfab..3d06e1c 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -773,11 +773,12 @@ loop_iterator::~loop_iterator ()
 
 /* The properties of the target.  */
 struct target_cfgloop {
-  /* Number of available registers.  */
-  unsigned x_target_avail_regs;
+  /* Number of available registers per register pressure class.  */
+  unsigned x_target_avail_regs[N_REG_CLASSES];
 
-  /* Number of available registers that are call-clobbered.  */
-  unsigned x_target_clobbered_regs;
+  /* Number of available registers that are call-clobbered, per register
+ pressure class.  */
+  unsigned x_target_clobbered_regs[N_REG_CLASSES];
 
   /* Number of registers reserved for temporary expressions.  */
   unsigned x_target_res_regs;
@@ -812,6 +813,7 @@ extern struct target_cfgloop *this_target_cfgloop;
invariant motion.  */
 extern unsigned estimate_reg_pressure_cost (unsigned, unsigned, bool, bool);
 extern void init_set_costs (void);
+extern void init_avail_clobber_regs (void);
 
 /* Loop optimizer initialization.  */
 extern void loop_optimizer_init (unsigned);
diff --git a/gcc/cfgloopanal.c b/gcc/cfgloopanal.c
index 3af0b2d..20010bb 100644
--- a/gcc/cfgloopanal.c
+++ b/gcc/cfgloopanal.c
@@ -22,6 +22,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "coretypes.h"
 #include "backend.h"
 #include "rtl.h"
+#include "memmodel.h"
+#include "ira.h"
 #include "tree.h"
 #include "predict.h"
 #include "memmodel.h"
@@ -344,20 +346,6 @@ init_set_costs (void)
   rtx reg2 = gen_raw_REG (SImode, LAST_VIRTUAL_REGISTER + 2);
   rtx addr = gen_raw_REG (Pmode, LAST_VIRTUAL_REGISTER + 3);
   rtx mem = validize_mem (gen_rtx_MEM (SImode, addr));
-  unsigned i;
-
-  target_avail_regs = 0;
-  target_clobbered_regs = 0;
-  for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
-if (TEST_HARD_REG_BIT (reg_class_contents[GENERAL_REGS], i)
-	&& !fixed_regs[i])
-  {
-	target_avail_regs++;
-	if (call_used_regs[i])
-	  target_clobbered_regs++;
-  }
-
-  target_res_regs = 3;
 
   for (speed = 0; speed < 2; speed++)
  {
@@ -387,6 +375,54 @@ init_set_costs (void)
   default_rtl_profile ();
 }
 
+/* Initialize available, clobbered register for each register classes.  */
+
+void
+init_avail_clobber_regs (void)
+{
+  int j;
+  unsigned i;
+  bool general_regs_presented_p = false;
+
+  /* Check if GENERAL_REGS is one of pressure classes.  */
+  for (j = 0; j < ira_pressure_classes_num; j++)
+{
+  target_avail_regs[j] = 0;
+  target_clobbered_regs[j] = 0;
+  if (ira_pressure_classes[j] == GENERAL_REGS)
+	general_regs_presented_p = true;
+}
+  target_avail_regs[GENERAL_REGS] = 0;
+  target_clobbered_regs[GENERAL_REGS] = 0;
+
+  for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
+{
+  if (fixed_regs[i])
+	continue;
+
+  bool call_used = call_used_regs[i];
+
+  for (j = 0; j < ira_pressure_classes_num; j++)
+	if (TEST_HARD_REG_BIT (reg_class_contents[ira_pressure_classes[j]], i))
+	  {
+	target_avail_regs[ira_pressure_classes[j]]++;
+	if (call_used)
+	  target_clobbered_regs[ira_pressure_classes[j]]++;
+	  }
+
+  /* Compute pressure information for GENERAL_

[PATCH GCC][1/6]Compute type mode and register class mapping

2018-05-04 Thread Bin Cheng

Hi,
This is the updated version patch set computing register pressure on TREE SSA
and use that information to direct other loop optimizers (predcom only for now).
This version of change is to follow Jeff's comment that we should reuse existing
tree-ssa-live.c infrastructure for live range computation, rather than inventing
another one.
Jeff had another concern about exposing ira.h and low-level register stuff in
GIMPLE world.  Unfortunately I haven't got a clear solution to it.  I found it's
a bit hard to relate type/type_mode with register class and with available regs
without exposing the information, especially there are multiple possible 
register
classes for vector types and it's not fixed.  I am open to any suggestions here.

This is the first patch estimating the map from type mode to register class.
This one doesn't need update and it's the same as the original version patch
at https://gcc.gnu.org/ml/gcc-patches/2017-05/msg01021.html

Bootstrap and test on x86_64 and AArch64 ongoing.  Any comments?

Thanks,
bin
2018-04-27  Bin Cheng  <bin.ch...@arm.com>

* ira.c (setup_mode_classes): New function.
(find_reg_classes): Call above function.
* ira.h (struct target_ira): New field x_ira_mode_classes.
(ira_mode_classes): New macro.From d65c160a37f785cff29172f1335e87d01fc260ba Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Mon, 24 Apr 2017 14:41:28 +0100
Subject: [PATCH 1/6] ira-mode-reg_class-map-20170316.txt

---
 gcc/ira.c | 77 +++
 gcc/ira.h |  7 ++
 2 files changed, 84 insertions(+)

diff --git a/gcc/ira.c b/gcc/ira.c
index b7bcc15..f132a7a 100644
--- a/gcc/ira.c
+++ b/gcc/ira.c
@@ -1154,6 +1154,82 @@ setup_class_translate (void)
 			   ira_pressure_classes_num, ira_pressure_classes);
 }
 
+/* Find desired register class for machine mode from information about
+   register pressure class.  On RTL level, we can compute preferred
+   register class infomation for each pseudo register or allocno.  On
+   GIMPLE level, we need to infer register class from variable's type,
+   i.e, we need map from type mode to register class.
+
+   The map information is computed by simple guess, it's good enough
+   for use on GIMPLE.  */
+void
+setup_mode_classes (void)
+{
+  int i, j;
+  machine_mode mode;
+  enum reg_class vector_class = NO_REGS;
+
+  for (i = 0; i < NUM_MACHINE_MODES; i++)
+{
+  mode = (machine_mode) i;
+  ira_mode_classes[mode] = NO_REGS;
+
+  /* Only care about integer, float and vector modes on GIMPLE.  */
+  if (!INTEGRAL_MODE_P (mode)
+	  && !FLOAT_MODE_P (mode) && !VECTOR_MODE_P (mode))
+	continue;
+
+  /* Integers must be in GENERAL_REGS by default.  */
+  if (SCALAR_INT_MODE_P (mode))
+	{
+	  ira_mode_classes[mode] = GENERAL_REGS;
+	  continue;
+	}
+
+  /* Iterate over pressure classes and find the most appropriate
+	 one for this mode.  */
+  for (j = 0; j < ira_pressure_classes_num; j++)
+	{
+	  HARD_REG_SET valid_for_cl;
+	  enum reg_class cl = ira_pressure_classes[j];
+
+	  if (!contains_reg_of_mode[cl][mode])
+	continue;
+
+	  COPY_HARD_REG_SET (valid_for_cl, reg_class_contents[cl]);
+	  AND_COMPL_HARD_REG_SET (valid_for_cl,
+  ira_prohibited_class_mode_regs[cl][mode]);
+	  AND_COMPL_HARD_REG_SET (valid_for_cl, ira_no_alloc_regs);
+	  if (hard_reg_set_empty_p (valid_for_cl))
+	continue;
+
+	  if (ira_mode_classes[mode] == NO_REGS)
+	{
+	  ira_mode_classes[mode] = cl;
+
+	  /* Record reg_class for vector mode.  */
+	  if (VECTOR_MODE_P (mode) && cl != NO_REGS)
+		vector_class = cl;
+
+	  continue;
+	}
+	  /* Prefer non GENERAL_REGS for floating points.  */
+	  if ((FLOAT_MODE_P (mode) || VECTOR_MODE_P (mode))
+	  && cl != GENERAL_REGS && ira_mode_classes[mode] == GENERAL_REGS)
+	ira_mode_classes[mode] = cl;
+	}
+}
+
+  /* Setup vector modes that are missed previously.  */
+  if (vector_class != NO_REGS)
+for (i = 0; i < NUM_MACHINE_MODES; i++)
+  {
+	mode = (machine_mode) i;
+	if (ira_mode_classes[mode] == NO_REGS && VECTOR_MODE_P (mode))
+	  ira_mode_classes[mode] = vector_class;
+  }
+}
+
 /* Order numbers of allocno classes in original target allocno class
array, -1 for non-allocno classes.  */
 static int allocno_class_order[N_REG_CLASSES];
@@ -1430,6 +1506,7 @@ find_reg_classes (void)
   setup_class_translate ();
   reorder_important_classes ();
   setup_reg_class_relations ();
+  setup_mode_classes ();
 }
 
 
diff --git a/gcc/ira.h b/gcc/ira.h
index 9df983c..3471d4c 100644
--- a/gcc/ira.h
+++ b/gcc/ira.h
@@ -66,6 +66,11 @@ struct target_ira
  class.  */
   enum reg_class x_ira_pressure_class_translate[N_REG_CLASSES];
 
+  /* Map of machine mode to register pressure class.  With this map,
+ coarse-grained register pressure can be computed on GIMPLE, where
+ w

[PATCH PR85190]Adjust pointer for aligned access

2018-04-10 Thread Bin Cheng

Hi,
Pointer q in gcc.dg/vect/pr81196.c is not aligned after vectorization, 
resulting test failure for some targets.
This simple patch adjust it so that it's aligned.

Is it OK?

Hi Rainer, could you please help me double check that this solves the issue?

Thanks,
bin

gcc/testsuite
2018-04-10  Bin Cheng  <bin.ch...@arm.com>

PR testsuite/85190
* gcc.dg/vect/pr81196.c: Adjust pointer for aligned access.diff --git a/gcc/testsuite/gcc.dg/vect/pr81196.c 
b/gcc/testsuite/gcc.dg/vect/pr81196.c
index 46d7a9e..15320ae 100644
--- a/gcc/testsuite/gcc.dg/vect/pr81196.c
+++ b/gcc/testsuite/gcc.dg/vect/pr81196.c
@@ -4,14 +4,14 @@
 
 void f(short*p){
   p=(short*)__builtin_assume_aligned(p,64);
-  short*q=p+256;
+  short*q=p+255;
   for(;p!=q;++p,--q){
 short t=*p;*p=*q;*q=t;
   }
 }
 void b(short*p){
   p=(short*)__builtin_assume_aligned(p,64);
-  short*q=p+256;
+  short*q=p+255;
   for(;p<q;++p,--q){
 short t=*p;*p=*q;*q=t;
   }

[wwwdocs]Mention -ftree-loop-distribution

2018-04-03 Thread Bin Cheng

Hi,

Option -ftree-loop-distribution is improved and enabled by default at -O3 for 
GCC8.
This patch describes the change, is it OK?

Thanks,
binIndex: htdocs/gcc-8/changes.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-8/changes.html,v
retrieving revision 1.51
diff -u -r1.51 changes.html
--- htdocs/gcc-8/changes.html   3 Apr 2018 06:52:04 -   1.51
+++ htdocs/gcc-8/changes.html   3 Apr 2018 14:26:31 -
@@ -101,6 +101,13 @@
 are enabled by default at -O3 and above.
   
   
+Classical loop nest optimization pass -ftree-loop-distribution
+has been improved and enabled by default at -O3 and above.
+It supports loop nest distribution in some restricted scenarios; it also
+supports cancellable innermost loop distribution with loop versioning
+under runtime alias checks.
+  
+  
 The new option -fstack-clash-protection causes the
 compiler to insert probes whenever stack space is allocated
 statically or dynamically to reliably detect stack overflows and

[PATCH testsuite]Fix pr83126.c failure for bare-metal toolchains

2018-03-22 Thread Bin Cheng

Hi,
The new test pr83126.c requires pthread for compiling, this simple patch skips 
it
for bare-metal toolchains.

Test checked.  Is it OK?

Thanks,
bin

gcc/testsuite
2018-03-22  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/graphite/pr83126.c: Require pthread for the test.diff --git a/gcc/testsuite/gcc.dg/graphite/pr83126.c 
b/gcc/testsuite/gcc.dg/graphite/pr83126.c
index 663d059..36bf5d5 100644
--- a/gcc/testsuite/gcc.dg/graphite/pr83126.c
+++ b/gcc/testsuite/gcc.dg/graphite/pr83126.c
@@ -1,3 +1,4 @@
+/* { dg-do compile { target pthread } }  */
 /* { dg-additional-options "-w -ftree-parallelize-loops=2 
-floop-parallelize-all -O1" }  */
 
 void

[PATCH PR84969]Don't reorder builtin memsets if they set different rhs values

2018-03-20 Thread Bin Cheng

Hi,
As noted in PR84969, fuse_memset_builtins breaks dependence between different 
memsets.
Specifically, it reorders different builtin memset partitions though it doesn't 
merge
them in the end.  This simple patch fixes this wrong code issue by checking if 
any two
builtin memsets set the same rhs value or not.  Note we don't need to bother if 
two
memsets intersect with each other or not.

Of course, this would miss opportunity merging S1/S3 in below case:
  memset(p+12, 0, 12);   //<-S1
  memset(p+17, 1, 10);
  memset(p, 0, 12);  //<-S3
In my opinion, this should be resolved in a more general way maximizing 
parallelism
as well as merging opportunities when sorting partitions into topological order 
from
dependence graph, which isn't GCC8 task.

Bootstrap and test on x86_64 and AArch64 ongoing.  Okay if no failures?

Thanks,
bin

2018-03-20  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/84969
* tree-loop-distribution.c (fuse_memset_builtins): Don't reorder
builtin memset partitions if they set differnt rhs values.

gcc/testsuite
2018-03-20  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/84969
* gcc.dg/tree-ssa/pr84969.c: New test.

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr84969.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr84969.c
new file mode 100644
index 000..e15c3d9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr84969.c
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-distribute-patterns" } */
+
+static void
+__attribute__((noipa, noinline))
+foo (char **values, int ndim, char *needquotes, int *dims)
+{
+  int i;
+  int j = 0;
+  int k = 0;
+  char *retval = (char *)__builtin_malloc(1000); 
+  char *p = retval;
+  char *tmp;
+
+  int indx[111];
+
+#define APPENDSTR(str) (__builtin_strcpy(p, (str)), p += __builtin_strlen(p))
+#define APPENDCHAR(ch) (*p++ = (ch), *p = '\0')
+
+   APPENDCHAR('{');
+   for (i = 0; i < ndim; i++)
+   indx[i] = 0;
+   do
+   {
+   for (i = j; i < ndim - 1; i++)
+   APPENDCHAR('{');
+
+   APPENDSTR(values[k]);
+   k++;
+
+   for (i = ndim - 1; i >= 0; i--)
+   {
+   indx[i] = (indx[i] + 1) % dims[i];
+   if (indx[i])
+   {
+   APPENDCHAR(',');
+   break;
+   }
+   else
+   APPENDCHAR('}');
+   }
+   j = i;
+   } while (j != -1);
+
+   if (__builtin_strcmp (retval, "{{{0,1},{2,3}}}") != 0)
+ __builtin_abort ();
+}
+
+int main()
+{
+  char* array[4] = {"0", "1", "2", "3"};
+  char f[] = {0, 0, 0, 0, 0, 0, 0, 0};
+  int dims[] = {1, 2, 2};
+  foo (array, 3, f, dims);
+
+  return 0;
+}
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 67f27ba..5e327f4 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -2569,6 +2569,7 @@ fuse_memset_builtins (vec *partitions)
 {
   unsigned i, j;
   struct partition *part1, *part2;
+  tree rhs1, rhs2;
 
   for (i = 0; partitions->iterate (i, );)
 {
@@ -2586,6 +2587,12 @@ fuse_memset_builtins (vec 
*partitions)
  || !operand_equal_p (part1->builtin->dst_base_base,
   part2->builtin->dst_base_base, 0))
break;
+
+ /* Memset calls setting different values can't be merged.  */
+ rhs1 = gimple_assign_rhs1 (DR_STMT (part1->builtin->dst_dr));
+ rhs2 = gimple_assign_rhs1 (DR_STMT (part2->builtin->dst_dr));
+ if (!operand_equal_p (rhs1, rhs2, 0))
+   break;
}
 
   /* Stable sort is required in order to avoid breaking dependence.  */
@@ -2617,8 +2624,8 @@ fuse_memset_builtins (vec *partitions)
  i++;
  continue;
}
-  tree rhs1 = gimple_assign_rhs1 (DR_STMT (part1->builtin->dst_dr));
-  tree rhs2 = gimple_assign_rhs1 (DR_STMT (part2->builtin->dst_dr));
+  rhs1 = gimple_assign_rhs1 (DR_STMT (part1->builtin->dst_dr));
+  rhs2 = gimple_assign_rhs1 (DR_STMT (part2->builtin->dst_dr));
   int bytev1 = const_with_all_bytes_same (rhs1);
   int bytev2 = const_with_all_bytes_same (rhs2);
   /* Only merge memset partitions of the same value.  */

[PATCH AArch64]Fix test failure for pr84682-2.c

2018-03-16 Thread Bin Cheng

Hi,
This simple patch fixes test case failure for pr84682-2.c by returning
false on wrong mode rtx in aarch64_classify_address, rather than assert.

Bootstrap and test on aarch64.  Is it OK?

Thanks,
bin

2018-03-16  Bin Cheng  <bin.ch...@arm.com>

* config/aarch64/aarch64.c (aarch64_classify_address): Return false
on wrong mode rtx, rather than assert.diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 07c55b1..8790902 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -5674,8 +5674,10 @@ aarch64_classify_address (struct aarch64_address_info 
*info,
   && (code != POST_INC && code != REG))
 return false;
 
-  gcc_checking_assert (GET_MODE (x) == VOIDmode
-  || SCALAR_INT_MODE_P (GET_MODE (x)));
+  /* Wrong mode for an address expr.  */
+  if (GET_MODE (x) != VOIDmode
+  && ! SCALAR_INT_MODE_P (GET_MODE (x)))
+return false;
 
   switch (code)
 {

[PATCH PR82965/PR83991]Fix invalid profile count in vectorization peeling

2018-01-31 Thread Bin Cheng

Hi,
This patch fixes invalid profile count information in vectorization peeling.
Current implementation is a bit confusing for me since it tries to compute
an overall probability based on scaling probability and change of estimated
niters.  This patch does it in two steps.  Firstly it does the scaling, then
it adjusts to new estimated niters by simply adjusting loop's latch count
information; scaling the loop's count information by the proportion
new_estimated_niters/old_estimate_niters.  Of course we have to adjust loop
latch's count information back after scaling.

Bootstrap and test on x86_64 and AArch64.  gcc.dg/vect/pr79347.c is fixed
for both PR82965 and PR83991.  Is this OK?

Thanks,
bin

2018-01-30  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82965
PR tree-optimization/83991
* cfgloopmanip.c (scale_loop_profile): Further scale loop's profile
information if the loop was predicted to iterate too many times.diff --git a/gcc/cfgloopmanip.c b/gcc/cfgloopmanip.c
index b9b76d8..1f560b8 100644
--- a/gcc/cfgloopmanip.c
+++ b/gcc/cfgloopmanip.c
@@ -509,7 +509,7 @@ scale_loop_profile (struct loop *loop, profile_probability 
p,
gcov_type iteration_bound)
 {
   gcov_type iterations = expected_loop_iterations_unbounded (loop);
-  edge e;
+  edge e, preheader_e;
   edge_iterator ei;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
@@ -521,77 +521,66 @@ scale_loop_profile (struct loop *loop, 
profile_probability p,
   (int)iteration_bound, (int)iterations);
 }
 
+  /* Scale the probabilities.  */
+  scale_loop_frequencies (loop, p);
+
   /* See if loop is predicted to iterate too many times.  */
-  if (iteration_bound && iterations > 0
-  && p.apply (iterations) > iteration_bound)
+  if (iteration_bound == 0 || iterations <= 0
+  || p.apply (iterations) <= iteration_bound)
+return;
+
+  e = single_exit (loop);
+  preheader_e = loop_preheader_edge (loop);
+  profile_count count_in = preheader_e->count ();
+  if (e && preheader_e
+  && count_in > profile_count::zero ()
+  && loop->header->count.initialized_p ())
 {
-  /* Fixing loop profile for different trip count is not trivial; the exit
-probabilities has to be updated to match and frequencies propagated 
down
-to the loop body.
-
-We fully update only the simple case of loop with single exit that is
-either from the latch or BB just before latch and leads from BB with
-simple conditional jump.   This is OK for use in vectorizer.  */
-  e = single_exit (loop);
-  if (e)
-   {
- edge other_e;
- profile_count count_delta;
+  edge other_e;
+  profile_count count_delta;
 
-  FOR_EACH_EDGE (other_e, ei, e->src->succs)
-   if (!(other_e->flags & (EDGE_ABNORMAL | EDGE_FAKE))
-   && e != other_e)
- break;
+  FOR_EACH_EDGE (other_e, ei, e->src->succs)
+   if (!(other_e->flags & (EDGE_ABNORMAL | EDGE_FAKE))
+   && e != other_e)
+ break;
 
- /* Probability of exit must be 1/iterations.  */
- count_delta = e->count ();
- e->probability = profile_probability::always ()
+  /* Probability of exit must be 1/iterations.  */
+  count_delta = e->count ();
+  e->probability = profile_probability::always ()
.apply_scale (1, iteration_bound);
- other_e->probability = e->probability.invert ();
- count_delta -= e->count ();
-
- /* If latch exists, change its count, since we changed
-probability of exit.  Theoretically we should update everything 
from
-source of exit edge to latch, but for vectorizer this is enough.  
*/
- if (loop->latch
- && loop->latch != e->src)
-   {
- loop->latch->count += count_delta;
-   }
-   }
+  other_e->probability = e->probability.invert ();
 
   /* Roughly speaking we want to reduce the loop body profile by the
 difference of loop iterations.  We however can do better if
 we look at the actual profile, if it is available.  */
-  p = p.apply_scale (iteration_bound, iterations);
-
-  if (loop->header->count.initialized_p ())
-   {
- profile_count count_in = profile_count::zero ();
+  p = profile_probability::always ();
 
- FOR_EACH_EDGE (e, ei, loop->header->preds)
-   if (e->src != loop->latch)
- count_in += e->count ();
-
- if (count_in > profile_count::zero () )
-   {
- p = count_in.probability_in (loop->header->count.apply_scale
-(iteration_bound, 1));
-   }
-   }
+  cou

[PATCH PR82604]Fix regression in ftree-parallelize-loops

2018-01-19 Thread Bin Cheng

Hi,
This patch is supposed to fix regression caused by loop distribution when
ftree-parallelize-loops.  The reason is distributed memset call can't be
understood/analyzed in data reference analysis, as a result, parloop can
only parallelize the innermost 2-level loop nest.  Before distribution
change, parloop can parallelize the innermost 3-level loop nest, i.e,
more parallelization.
As commented in the PR, ideally, loop distribution should be able to
distribute memset call for 3-level loop nest.  Unfortunately this requires
sophisticated work proving equality between tree expressions which gcc
is not good at now.
Another fix is to improve data reference analysis so that memset call
can be supported.  We don't know how big this change is and it's definitely
not GCC 8 task.

So this patch fixes the regression in a bit hacking way.  It first enables
3-level loop nest distribution when flag_tree_parloops > 1.  Secondly, it
supports 3-level loop nest distribution for ZERO-ing stmt which can only
be distributed as a loop (nest) of memset, but can't be distributed as a
single memset.  The overall effect is ZERO-ing stmt will be distributed
to one loop deeper than now, so parloop can parallelize as before.

Bootstrap and test on x86_64 and AArch64 ongoing.  Is it OK if no errors?

Thanks,
bin
2018-01-19  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82604
* tree-loop-distribution.c (enum partition_kind): New enum item
PKIND_PARTIAL_MEMSET.
(partition_builtin_p): Support above new enum item.
(generate_code_for_partition): Ditto.
(compute_access_range): Differentiate cases that equality can be
proven at all loops, the innermost loops or no loops.
(classify_builtin_st, classify_builtin_ldst): Adjust call to above
function.  Set PKIND_PARTIAL_MEMSET for partition appropriately.
(finalize_partitions, distribute_loop): Don't fuse partition of
PKIND_PARTIAL_MEMSET kind when distributing 3-level loop nest.
(prepare_perfect_loop_nest): Distribute 3-level loop nest only if
parloop is enabled.diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index a3d76e4..807fd07 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -584,7 +584,19 @@ build_rdg (struct loop *loop, control_dependences *cd)
 
 /* Kind of distributed loop.  */
 enum partition_kind {
-PKIND_NORMAL, PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
+PKIND_NORMAL,
+/* Partial memset stands for a paritition can be distributed into a loop
+   of memset calls, rather than a single memset call.  It's handled just
+   like a normal parition, i.e, distributed as separate loop, no memset
+   call is generated.
+
+   Note: This is a hacking fix trying to distribute ZERO-ing stmt in a
+   loop nest as deep as possible.  As a result, parloop achieves better
+   parallelization by parallelizing deeper loop nest.  This hack should
+   be unnecessary and removed once distributed memset can be understood
+   and analyzed in data reference analysis.  See PR82604 for more.  */
+PKIND_PARTIAL_MEMSET,
+PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
 };
 
 /* Type of distributed loop.  */
@@ -659,7 +671,7 @@ partition_free (partition *partition)
 static bool
 partition_builtin_p (partition *partition)
 {
-  return partition->kind != PKIND_NORMAL;
+  return partition->kind > PKIND_PARTIAL_MEMSET;
 }
 
 /* Returns true if the partition contains a reduction.  */
@@ -1127,6 +1139,7 @@ generate_code_for_partition (struct loop *loop,
   switch (partition->kind)
 {
 case PKIND_NORMAL:
+case PKIND_PARTIAL_MEMSET:
   /* Reductions all have to be in the last partition.  */
   gcc_assert (!partition_reduction_p (partition)
  || !copy_p);
@@ -1399,17 +1412,22 @@ find_single_drs (struct loop *loop, struct graph *rdg, 
partition *partition,
 
 /* Given data reference DR in LOOP_NEST, this function checks the enclosing
loops from inner to outer to see if loop's step equals to access size at
-   each level of loop.  Return true if yes; record access base and size in
-   BASE and SIZE; save loop's step at each level of loop in STEPS if it is
-   not null.  For example:
+   each level of loop.  Return 2 if we can prove this at all level loops;
+   record access base and size in BASE and SIZE; save loop's step at each
+   level of loop in STEPS if it is not null.  For example:
 
  int arr[100][100][100];
  for (i = 0; i < 100; i++)   ;steps[2] = 4
for (j = 100; j > 0; j--) ;steps[1] = -400
 for (k = 0; k < 100; k++)   ;steps[0] = 4
-  arr[i][j - 1][k] = 0; ;base = , size = 400.  */
+  arr[i][j - 1][k] = 0; ;base = , size = 400
 
-static bool
+   Return 1 if we can prove the equality at the innermost loop, but not all
+   level loops.  In this case, no informati

[PATCH PR83695]Fix ICE by resetting cached scev info after interchange.

2018-01-11 Thread Bin Cheng

Hi,
As explained in comment of PR83695, outdated cached scev info could be referred
by later interchange of outer loops in nest.  This simple patch fixes ICE by
resetting cached scev info after interchange.  It's expensive resetting all scev
information but might not be a problem here given we only interchange in limited
cases.

Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin

2018-01-11  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/83695
* gimple-loop-linterchange.cc
(tree_loop_interchange::interchange_loops): Call scev_reset_htab to
reset cached scev information after interchange.
(pass_linterchange::execute): Remove call to scev_reset_htab.

gcc/testsuite
2018-01-11  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/83695
* gcc.dg/tree-ssa/pr83695.c: New test.diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
index 01a26c0..eb35263 100644
--- a/gcc/gimple-loop-interchange.cc
+++ b/gcc/gimple-loop-interchange.cc
@@ -1119,6 +1119,10 @@ tree_loop_interchange::interchange_loops (loop_cand 
, loop_cand )
   oloop.m_loop->any_likely_upper_bound = false;
   free_numbers_of_iterations_estimates (oloop.m_loop);
 
+  /* Clear all cached scev information.  This is expensive but shouldn't be
+ a problem given we interchange in very limited times.  */
+  scev_reset_htab ();
+
   /* ???  The association between the loop data structure and the
  CFG changed, so what was loop N at the source level is now
  loop M.  We should think of retaining the association or breaking
@@ -2070,9 +2074,6 @@ pass_linterchange::execute (function *fun)
   loop_nest.release ();
 }
 
-  if (changed_p)
-scev_reset_htab ();
-
   return changed_p ? (TODO_update_ssa_only_virtuals) : 0;
 }
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr83695.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr83695.c
new file mode 100644
index 000..af56a31
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr83695.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+int a[3][3][3], b, d;
+short c;
+unsigned char e;
+
+static void f ()
+{
+  for (c = 0; c < 2; c++)
+  for (e = 0; e < 3; e++)
+for (b = 0; b < 3; b++)
+  a[b][e][b] = 0;
+  while (1)
+;
+}
+
+int main ()
+{
+  if (d)
+f ();
+  return 0;
+}

[PATCH BACKPORT]Backport r254778 and test case in r244815 to GCC6

2017-12-19 Thread Bin Cheng

HI,
This patch backports r254778 and test case in r244815 to GCC6.  Bootstrap and
test on x86_64.  Is it OK?

Thanks,
bin

2017-12-18  Bin Cheng  <bin.ch...@arm.com>

Backport from mainline
2017-11-15  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82726
PR tree-optimization/70754
* tree-predcom.c (order_drefs_by_pos): New function.
(combine_chains): Move code setting has_max_use_after to...
(try_combine_chains): ...here.  New parameter.  Sort combined chains
according to position information.
(tree_predictive_commoning_loop): Update call to above function.
(update_pos_for_combined_chains, pcom_stmt_dominates_stmt_p): New.

gcc/testsuite
2017-12-18  Bin Cheng  <bin.ch...@arm.com>

Backport from mainline
        2017-11-15  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82726
* gcc.dg/tree-ssa/pr82726.c: New test.

Backport from mainline
    2017-01-23  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/70754
* gfortran.dg/pr70754.f90: New test.Index: gcc/testsuite/gcc.dg/tree-ssa/pr82726.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr82726.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr82726.c (working copy)
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 --param tree-reassoc-width=4" } */
+/* { dg-additional-options "-mavx2" { target { x86_64-*-* i?86-*-* } } } */
+
+#define N 40
+#define M 128
+unsigned int in[N+M];
+unsigned short out[N];
+
+/* Outer-loop vectorization. */
+
+void
+foo (){
+  int i,j;
+  unsigned int diff;
+
+  for (i = 0; i < N; i++) {
+diff = 0;
+for (j = 0; j < M; j+=8) {
+  diff += in[j+i];
+}
+out[i]=(unsigned short)diff;
+  }
+
+  return;
+}
Index: gcc/testsuite/gfortran.dg/pr70754.f90
===
--- gcc/testsuite/gfortran.dg/pr70754.f90   (revision 0)
+++ gcc/testsuite/gfortran.dg/pr70754.f90   (working copy)
@@ -0,0 +1,35 @@
+! { dg-do compile }
+! { dg-options "-Ofast" }
+module m
+  implicit none
+  private
+  save
+
+  integer, parameter, public :: &
+ii4  = selected_int_kind(6), &
+rr8  = selected_real_kind(13)
+
+  integer (ii4), dimension(40,40,199), public :: xyz
+  public :: foo
+contains
+  subroutine foo(a)
+real (rr8), dimension(40,40), intent(out) :: a
+real (rr8), dimension(40,40) :: b
+integer (ii4), dimension(40,40) :: c
+integer  i, j
+
+do i=1,20
+  b(i,j) = 123 * a(i,j) + 34 * a(i,j+1) &
+ + 34 * a(i,j-1) + a(i+1,j+1) &
+ + a(i+1,j-1) + a(i-1,j+1) &
+ + a(i-1,j-1)
+  c(i,j) = 123
+end do
+
+where ((xyz(:,:,2) /= 0) .and. (c /= 0))
+  a = b/real(c)
+elsewhere
+  a = 456
+endwhere
+ end subroutine foo
+end module m
Index: gcc/tree-predcom.c
===
--- gcc/tree-predcom.c  (revision 255817)
+++ gcc/tree-predcom.c  (working copy)
@@ -943,6 +943,17 @@
   return (*da)->pos - (*db)->pos;
 }
 
+/* Compares two drefs A and B by their position.  Callback for qsort.  */
+
+static int
+order_drefs_by_pos (const void *a, const void *b)
+{
+  const dref *const da = (const dref *) a;
+  const dref *const db = (const dref *) b;
+
+  return (*da)->pos - (*db)->pos;
+}
+
 /* Returns root of the CHAIN.  */
 
 static inline dref
@@ -2250,7 +2261,6 @@
   bool swap = false;
   chain_p new_chain;
   unsigned i;
-  gimple *root_stmt;
   tree rslt_type = NULL_TREE;
 
   if (ch1 == ch2)
@@ -2292,31 +2302,55 @@
   new_chain->refs.safe_push (nw);
 }
 
-  new_chain->has_max_use_after = false;
-  root_stmt = get_chain_root (new_chain)->stmt;
-  for (i = 1; new_chain->refs.iterate (i, ); i++)
-{
-  if (nw->distance == new_chain->length
- && !stmt_dominates_stmt_p (nw->stmt, root_stmt))
-   {
- new_chain->has_max_use_after = true;
- break;
-   }
-}
-
   ch1->combined = true;
   ch2->combined = true;
   return new_chain;
 }
 
-/* Try to combine the CHAINS.  */
+/* Recursively update position information of all offspring chains to ROOT
+   chain's position information.  */
 
 static void
-try_combine_chains (vec *chains)
+update_pos_for_combined_chains (chain_p root)
 {
+  chain_p ch1 = root->ch1, ch2 = root->ch2;
+  dref ref, ref1, ref2;
+  for (unsigned j = 0; (root->refs.iterate (j, )
+   && ch1->refs.iterate (j, )
+   && ch2->refs.iterate (j, )); ++j)
+ref1->pos = ref2->pos = ref->pos;
+
+  if (ch1->type == CT_COMBINATION)
+update_pos_for_combined_chains (ch1);
+  if (ch2->type == CT_COMBINATION)
+update_pos_for_combined_cha

[GCC BACKPORT]Backport revision 254777 and 254778 to GCC 7 branch

2017-12-19 Thread Bin Cheng

Hi,
This patch backports revision 254777 and 254778 to GCC 7 branch.
Bootstrap and test on x86_64.  Is it OK?

Thanks,
bin

2017-12-18  Bin Cheng  <bin.ch...@arm.com>

Backport from mainline
2017-11-15  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82726
PR tree-optimization/70754
* tree-predcom.c (order_drefs_by_pos): New function.
(combine_chains): Move code setting has_max_use_after to...
(try_combine_chains): ...here.  New parameter.  Sort combined chains
according to position information.
(tree_predictive_commoning_loop): Update call to above function.
(update_pos_for_combined_chains, pcom_stmt_dominates_stmt_p): New.

2017-11-15  Bin Cheng  <bin.ch...@arm.com>


PR tree-optimization/82726
Revert
        2017-01-23  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/70754
* tree-predcom.c (stmt_combining_refs): New parameter INSERT_BEFORE.
(reassociate_to_the_same_stmt): New parameter INSERT_BEFORE.  Insert
combined stmt before it if not NULL.
(combine_chains): Process refs reversely and compute dominance point
for root ref.

Revert
    2017-02-23  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/79663
* tree-predcom.c (combine_chains): Process refs in reverse order
only for ZERO length chains, and add explaining comment.

gcc/testsuite
2017-12-18  Bin Cheng  <bin.ch...@arm.com>

Backport from mainline
2017-11-15  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82726
* gcc.dg/tree-ssa/pr82726.c: New test.From 3f9a8b53738aded4ead32fb97c251527cbf31ea7 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Mon, 18 Dec 2017 11:23:21 +
Subject: [PATCH] backport-pr70754.txt

---
 gcc/testsuite/gcc.dg/tree-ssa/pr82726.c |  26 +
 gcc/tree-predcom.c  | 200 +---
 2 files changed, 160 insertions(+), 66 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr82726.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr82726.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr82726.c
new file mode 100644
index 000..22bc59d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr82726.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 --param tree-reassoc-width=4" } */
+/* { dg-additional-options "-mavx2" { target { x86_64-*-* i?86-*-* } } } */
+
+#define N 40
+#define M 128
+unsigned int in[N+M];
+unsigned short out[N];
+
+/* Outer-loop vectorization. */
+
+void
+foo (){
+  int i,j;
+  unsigned int diff;
+
+  for (i = 0; i < N; i++) {
+diff = 0;
+for (j = 0; j < M; j+=8) {
+  diff += in[j+i];
+}
+out[i]=(unsigned short)diff;
+  }
+
+  return;
+}
diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index 57d8f7d..a2bb676 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -943,6 +943,17 @@ order_drefs (const void *a, const void *b)
   return (*da)->pos - (*db)->pos;
 }
 
+/* Compares two drefs A and B by their position.  Callback for qsort.  */
+
+static int
+order_drefs_by_pos (const void *a, const void *b)
+{
+  const dref *const da = (const dref *) a;
+  const dref *const db = (const dref *) b;
+
+  return (*da)->pos - (*db)->pos;
+}
+
 /* Returns root of the CHAIN.  */
 
 static inline dref
@@ -2164,11 +2175,10 @@ remove_name_from_operation (gimple *stmt, tree op)
 }
 
 /* Reassociates the expression in that NAME1 and NAME2 are used so that they
-   are combined in a single statement, and returns this statement.  Note the
-   statement is inserted before INSERT_BEFORE if it's not NULL.  */
+   are combined in a single statement, and returns this statement.  */
 
 static gimple *
-reassociate_to_the_same_stmt (tree name1, tree name2, gimple *insert_before)
+reassociate_to_the_same_stmt (tree name1, tree name2)
 {
   gimple *stmt1, *stmt2, *root1, *root2, *s1, *s2;
   gassign *new_stmt, *tmp_stmt;
@@ -2225,12 +2235,6 @@ reassociate_to_the_same_stmt (tree name1, tree name2, 
gimple *insert_before)
   var = create_tmp_reg (type, "predreastmp");
   new_name = make_ssa_name (var);
   new_stmt = gimple_build_assign (new_name, code, name1, name2);
-  if (insert_before && stmt_dominates_stmt_p (insert_before, s1))
-bsi = gsi_for_stmt (insert_before);
-  else
-bsi = gsi_for_stmt (s1);
-
-  gsi_insert_before (, new_stmt, GSI_SAME_STMT);
 
   var = create_tmp_reg (type, "predreastmp");
   tmp_name = make_ssa_name (var);
@@ -2247,6 +2251,7 @@ reassociate_to_the_same_stmt (tree name1, tree name2, 
gimple *insert_before)
   s1 = gsi_stmt (bsi);
   update_stmt (s1);
 
+  gsi_insert_before (, new_stmt, GSI_SAME_STMT);
   gsi_insert_before (, tmp_stmt, GSI_SAME_STMT);
 
   return new_stmt;
@@ -2255,11 +2260,10 @@ reassociate_to_the_same_stmt (tree na

[PATCH PR81740]Enforce dependence check for outer loop vectorization

2017-12-15 Thread Bin Cheng

Hi,
As explained in the PR, given below test case:
int a[8][10] = { [2][5] = 4 }, c;

int
main ()
{
  short b;
  int i, d;
  for (b = 4; b >= 0; b--)
for (c = 0; c <= 6; c++)
  a[c + 1][b + 2] = a[c][b + 1];
  for (i = 0; i < 8; i++)
for (d = 0; d < 10; d++)
  if (a[i][d] != (i == 3 && d == 6) * 4)
__builtin_abort ();
  return 0;

the loop nest is illegal for vectorization without reversing inner loop.  The 
issue
is in data dependence checking of vectorizer, I believe the mentioned revision 
just
exposed this.  Previously the vectorization is skipped because of unsupported 
memory
operation.  The outer loop vectorization unrolls the outer loop into:

  for (b = 4; b > 0; b -= 4)
  {
for (c = 0; c <= 6; c++)
  a[c + 1][6] = a[c][5];
for (c = 0; c <= 6; c++)
  a[c + 1][5] = a[c][4];
for (c = 0; c <= 6; c++)
  a[c + 1][4] = a[c][3];
for (c = 0; c <= 6; c++)
  a[c + 1][3] = a[c][2];
  }
Then four inner loops are fused into:
  for (b = 4; b > 0; b -= 4)
  {
for (c = 0; c <= 6; c++)
{
  a[c + 1][6] = a[c][5];  // S1
  a[c + 1][5] = a[c][4];  // S2
  a[c + 1][4] = a[c][3];
  a[c + 1][3] = a[c][2];
}
  }
The loop fusion needs to meet the dependence requirement.  Basically, GCC's data
dependence analyzer does not model dep between references in sibling loops, but
in practice, fusion requirement can be checked by analyzing all data references
after fusion, and there is no backward data dependence.

Apparently, the requirement is violated because we have backward data dependence
between references (a[c][5], a[c+1][5]) in S1/S2.  Note, if we reverse the inner
loop, the outer loop would become legal for vectorization.

This patch fixes the issue by enforcing dependence check.  It also adds two 
tests
with one shouldn't be vectorized and the other should.  Bootstrap and test on 
x86_64
and AArch64.  Is it OK?

Thanks,
bin
2017-12-15  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81740
* tree-vect-data-refs.c (vect_analyze_data_ref_dependence): In case
of outer loop vectorization, check backward dependence at inner loop
if dependence at outer loop is reversed.

gcc/testsuite
2017-12-15  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81740
* gcc.dg/vect/pr81740-1.c: New test.
* gcc.dg/vect/pr81740-2.c: Refine test.From c0c8cfae08c0bde2cec41a8d3abcbfea0bd2e211 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Thu, 14 Dec 2017 15:32:02 +
Subject: [PATCH] pr81740-20171212.txt

---
 gcc/testsuite/gcc.dg/vect/pr81740-1.c | 17 +
 gcc/testsuite/gcc.dg/vect/pr81740-2.c | 21 +
 gcc/tree-vect-data-refs.c | 11 +++
 3 files changed, 49 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr81740-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr81740-2.c

diff --git a/gcc/testsuite/gcc.dg/vect/pr81740-1.c 
b/gcc/testsuite/gcc.dg/vect/pr81740-1.c
new file mode 100644
index 000..d90aba5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr81740-1.c
@@ -0,0 +1,17 @@
+/* { dg-do run } */
+int a[8][10] = { [2][5] = 4 }, c;
+
+int
+main ()
+{
+  short b;
+  int i, d;
+  for (b = 4; b >= 0; b--)
+for (c = 0; c <= 6; c++)
+  a[c + 1][b + 2] = a[c][b + 1];
+  for (i = 0; i < 8; i++)
+for (d = 0; d < 10; d++)
+  if (a[i][d] != (i == 3 && d == 6) * 4)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/pr81740-2.c 
b/gcc/testsuite/gcc.dg/vect/pr81740-2.c
new file mode 100644
index 000..fb5b300
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr81740-2.c
@@ -0,0 +1,21 @@
+/* { dg-do run } */
+/* { dg-require-effective-target vect_int } */
+
+int a[8][10] = { [2][5] = 4 }, c;
+
+int
+main ()
+{
+  short b;
+  int i, d;
+  for (b = 4; b >= 0; b--)
+for (c = 6; c >= 0; c--)
+  a[c + 1][b + 2] = a[c][b + 1];
+  for (i = 0; i < 8; i++)
+for (d = 0; d < 10; d++)
+  if (a[i][d] != (i == 3 && d == 6) * 4)
+__builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect"  } } */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 996d156..3b780cf1 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -435,6 +435,17 @@ vect_analyze_data_ref_dependence (struct 
data_dependence_relation *ddr,
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "dependence distance negative.\n");
+ /* When doing outer loop vectorization, we need to check if there is
+backward dependence at inner loop level if dependence at the outer
+loop is reversed.  See PR81740 for more information.  */
+ if (nested_in_vect_loop_p (lo

[PATCH PR83320]Fix new/free mismatch issue

2017-12-08 Thread Bin Cheng

Hi,
While I am still trying to reproduce and verify the issue (valgrind checking 
runs very slow for me),
It's clear I made stupid mistake using free for newed vector.  This simple 
patch fixes it.
Bootstrap and test ongoing.  Is it OK?

Thanks,
bin
2017-12-06  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/83320
* gimple-loop-interchange.cc (free_data_refs_with_aux): Use delete.
(prune_datarefs_not_in_loop): Ditto.diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
index 3f7c54f..92c96a3 100644
--- a/gcc/gimple-loop-interchange.cc
+++ b/gcc/gimple-loop-interchange.cc
@@ -945,7 +945,7 @@ free_data_refs_with_aux (vec datarefs)
 if (dr->aux != NULL)
   {
DR_ACCESS_STRIDE (dr)->release ();
-   free (dr->aux);
+   delete (vec *) dr->aux;
   }
 
   free_data_refs (datarefs);
@@ -1843,7 +1843,7 @@ prune_datarefs_not_in_loop (struct loop *loop, 
vec datarefs)
  if (dr->aux)
{
  DR_ACCESS_STRIDE (dr)->release ();
- free (dr->aux);
+ delete (vec *) dr->aux;
}
  free_data_ref (dr);
}

[PATCH GCC]More conservative interchanging small loops with const initialized simple reduction

2017-12-08 Thread Bin Cheng

Hi,
This simple patch makes interchange even more conservative for small loops with 
constant initialized simple reduction.
The reason is undoing such reduction introduces new data reference and 
cond_expr, which could cost too much in a small
loop.
Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if test 
passes?

Thanks,
bin
2017-12-08  Bin Cheng  <bin.ch...@arm.com>

* gimple-loop-interchange.cc (struct loop_cand): New field.
(loop_cand::loop_cand): Init new field in constructor.
(loop_cand::classify_simple_reduction): Record simple reduction
initialized with constant value.
(should_interchange_loops): New parameter.  Skip interchange if loop
has few data references and constant intitialized simple reduction.
(tree_loop_interchange::interchange): Update call to above function.
(should_interchange_loop_nest): Ditto.diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
index 6554a42..f45f7dc 100644
--- a/gcc/gimple-loop-interchange.cc
+++ b/gcc/gimple-loop-interchange.cc
@@ -199,13 +199,16 @@ struct loop_cand
   edge m_exit;
   /* Basic blocks of this loop.  */
   basic_block *m_bbs;
+  /* Number of constant initialized simple reduction.  */
+  unsigned m_num_const_init_simple_reduc;
 };
 
 /* Constructor.  */
 
 loop_cand::loop_cand (struct loop *loop, struct loop *outer)
   : m_loop (loop), m_outer (outer),
-m_exit (single_exit (loop)), m_bbs (get_loop_body (loop))
+m_exit (single_exit (loop)), m_bbs (get_loop_body (loop)),
+m_num_const_init_simple_reduc (0)
 {
 m_inductions.create (3);
 m_reductions.create (3);
@@ -440,7 +443,9 @@ loop_cand::classify_simple_reduction (reduction_p re)
 
   re->init_ref = gimple_assign_rhs1 (producer);
 }
-  else if (!CONSTANT_CLASS_P (re->init))
+  else if (CONSTANT_CLASS_P (re->init))
+m_num_const_init_simple_reduc++;
+  else
 return;
 
   /* Check how reduction variable is used.  */
@@ -1422,6 +1427,7 @@ dump_access_strides (vec datarefs)
 static bool
 should_interchange_loops (unsigned i_idx, unsigned o_idx,
  vec datarefs,
+ unsigned num_const_init_simple_reduc,
  bool innermost_loops_p, bool dump_info_p = true)
 {
   unsigned HOST_WIDE_INT ratio;
@@ -1522,6 +1528,12 @@ should_interchange_loops (unsigned i_idx, unsigned o_idx,
   if (num_unresolved_drs != 0 || num_resolved_not_ok_drs != 0)
 return false;
 
+  /* Conservatively skip interchange in cases only have few data references
+ and constant initialized simple reduction since it introduces new data
+ reference as well as ?: operation.  */
+  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())
+return false;
+
   /* We use different stride comparison ratio for interchanging innermost
  two loops or not.  The idea is to be conservative in interchange for
  the innermost loops.  */
@@ -1576,6 +1588,7 @@ tree_loop_interchange::interchange (vec 
datarefs,
 
   /* Check profitability for loop interchange.  */
   if (should_interchange_loops (i_idx, o_idx, datarefs,
+   iloop.m_num_const_init_simple_reduc,
iloop.m_loop->inner == NULL))
{
  if (dump_file && (dump_flags & TDF_DETAILS))
@@ -1764,7 +1779,7 @@ should_interchange_loop_nest (struct loop *loop_nest, 
struct loop *innermost,
   /* Check if any two adjacent loops should be interchanged.  */
   for (struct loop *loop = innermost;
loop != loop_nest; loop = loop_outer (loop), idx--)
-if (should_interchange_loops (idx, idx - 1, datarefs,
+if (should_interchange_loops (idx, idx - 1, datarefs, 0,
  loop == innermost, false))
   return true;

[PATCH GCC]Introduce loop interchange pass and enable it at -O3

2017-12-07 Thread Bin Cheng

Hi,
This is the overall loop interchange patch on gimple-linterchange branch.  Note 
the new pass
is enabled at -O3 level by default.  Bootstrap and regtest on x86_64 and 
AArch64(ongoing).
NOte after cost model change it is now far more conservative than original 
version.  It only
interchanges 11 loops in spec2k6 (416 doesn't build at the moment), vs ~250 for 
the original
version.  I will collect compilation time data, though there shouldn't be any 
surprise given
few loops are actually interchanged.  I will also collect spec2k6 data, 
shouldn't affect cases
other than bwaves either.
So is it OK?

Thanks,
bin
2017-12-07  Bin Cheng  <bin.ch...@arm.com>
Richard Biener  <rguent...@suse.de>

PR tree-optimization/81303
* Makefile.in (gimple-loop-interchange.o): New object file.
* common.opt (floop-interchange): Reuse the option from graphite.
* doc/invoke.texi (-floop-interchange): Ditto.  New document for
-floop-interchange and mention it for -O3.
* opts.c (default_options_table): Enable -floop-interchange at -O3.
* gimple-loop-interchange.cc: New file.
* params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter.
(PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
* passes.def (pass_linterchange): New pass.
* timevar.def (TV_LINTERCHANGE): New time var.
* tree-pass.h (make_pass_linterchange): New declaration.
* tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external
interchange.  Record IV before/after increment in new parameters.
* tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
* tree-vect-loop.c (vect_is_simple_reduction): Factor out reduction
path check into...
(check_reduction_path): ...New function here.
* tree-vectorizer.h (check_reduction_path): New declaration.

gcc/testsuite
2017-12-07  Bin Cheng  <bin.ch...@arm.com>
Richard Biener  <rguent...@suse.de>

PR tree-optimization/81303
* gcc.dg/tree-ssa/loop-interchange-1.c: New test.
* gcc.dg/tree-ssa/loop-interchange-1b.c: New test.
* gcc.dg/tree-ssa/loop-interchange-2.c: New test.
* gcc.dg/tree-ssa/loop-interchange-3.c: New test.
* gcc.dg/tree-ssa/loop-interchange-4.c: New test.
* gcc.dg/tree-ssa/loop-interchange-5.c: New test.
* gcc.dg/tree-ssa/loop-interchange-6.c: New test.
* gcc.dg/tree-ssa/loop-interchange-7.c: New test.
* gcc.dg/tree-ssa/loop-interchange-8.c: New test.
* gcc.dg/tree-ssa/loop-interchange-9.c: New test.
* gcc.dg/tree-ssa/loop-interchange-10.c: New test.
* gcc.dg/tree-ssa/loop-interchange-11.c: New test.
* gcc.dg/tree-ssa/loop-interchange-12.c: New test.
* gcc.dg/tree-ssa/loop-interchange-13.c: New test.diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index db43fc1..3297437 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1302,6 +1302,7 @@ OBJS = \
gimple-iterator.o \
gimple-fold.o \
gimple-laddress.o \
+   gimple-loop-interchange.o \
gimple-low.o \
gimple-pretty-print.o \
gimple-ssa-backprop.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index ffcbf85..6b9e4ea 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1504,8 +1504,8 @@ Common Alias(floop-nest-optimize)
 Enable loop nest transforms.  Same as -floop-nest-optimize.
 
 floop-interchange
-Common Alias(floop-nest-optimize)
-Enable loop nest transforms.  Same as -floop-nest-optimize.
+Common Report Var(flag_loop_interchange) Optimization
+Enable loop interchange on trees.
 
 floop-block
 Common Alias(floop-nest-optimize)
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index b8c8083..cebc465 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -7401,6 +7401,7 @@ by @option{-O2} and also turns on the following 
optimization flags:
 -ftree-loop-vectorize @gol
 -ftree-loop-distribution @gol
 -ftree-loop-distribute-patterns @gol
+-floop-interchange @gol
 -fsplit-paths @gol
 -ftree-slp-vectorize @gol
 -fvect-cost-model @gol
@@ -8500,12 +8501,10 @@ Perform loop optimizations on trees.  This flag is 
enabled by default
 at @option{-O} and higher.
 
 @item -ftree-loop-linear
-@itemx -floop-interchange
 @itemx -floop-strip-mine
 @itemx -floop-block
 @itemx -floop-unroll-and-jam
 @opindex ftree-loop-linear
-@opindex floop-interchange
 @opindex floop-strip-mine
 @opindex floop-block
 @opindex floop-unroll-and-jam
@@ -8600,6 +8599,25 @@ ENDDO
 @end smallexample
 and the initialization loop is transformed into a call to memset zero.
 
+@item -floop-interchange
+@opindex floop-interchange
+Perform loop interchange outside of graphite.  This flag can improve cache
+performance on loop nest and allow further loop optimizations, like
+vectorization, to take place.  For example, the loop
+@smallexample
+for (int i = 0; i < N; i++)
+  for (int j = 0; j < N; j++)

[PATCH TEST]Adjust GRAPHITE tests in preparation for loop interchange

2017-12-06 Thread Bin Cheng

Hi,
The loop interchange pass reuses option -floop-interchange from GRAPHITE, this 
patch
adjusts all affected GRAPHITE tests by changing the option to 
-floop-nest-optimize.
Test result checked with/without loop interchange.  Is it OK?

Thanks,
bin
gcc/testsuite
2017-12-06  Bin Cheng  <bin.ch...@arm.com>

* g++.dg/graphite/pr41305.C: Refine test option.
* gcc.dg/graphite/pr42205-1.c: Ditto.
* gcc.dg/graphite/pr42205-2.c: Ditto.
* gcc.dg/graphite/pr42211.c: Ditto.
* gcc.dg/graphite/pr46185.c: Ditto.
* gcc.dg/graphite/pr46966.c: Ditto.
* gcc.dg/graphite/pr59817-1.c: Ditto.
* gcc.dg/graphite/pr59817-2.c: Ditto.
* gcc.dg/graphite/pr60740.c: Ditto.
* gcc.dg/graphite/pr60785.c: Ditto.
* gcc.dg/graphite/pr68715-2.c: Ditto.
* gcc.dg/graphite/pr68715.c: Ditto.
* gcc.dg/graphite/pr70045.c: Ditto.
* gfortran.dg/graphite/pr14741.f90: Ditto.
* gfortran.dg/graphite/pr40982.f90: Ditto.
* gfortran.dg/graphite/pr42285.f90: Ditto.
* gfortran.dg/graphite/pr42334-1.f: Ditto.
* gfortran.dg/graphite/pr42334.f90: Ditto.
* gfortran.dg/graphite/pr43349.f: Ditto.
* gfortran.dg/graphite/pr59817.f: Ditto.diff --git a/gcc/testsuite/g++.dg/graphite/pr41305.C 
b/gcc/testsuite/g++.dg/graphite/pr41305.C
index 756b126..afab30a 100644
--- a/gcc/testsuite/g++.dg/graphite/pr41305.C
+++ b/gcc/testsuite/g++.dg/graphite/pr41305.C
@@ -1,5 +1,5 @@
 // { dg-do compile }
-// { dg-options "-O3 -floop-interchange -Wno-conversion-null -Wno-return-type" 
}
+// { dg-options "-O3 -floop-nest-optimize -Wno-conversion-null 
-Wno-return-type" }
 
 void __throw_bad_alloc ();
 
diff --git a/gcc/testsuite/gcc.dg/graphite/pr42205-1.c 
b/gcc/testsuite/gcc.dg/graphite/pr42205-1.c
index 9cca6de..f08bbec 100644
--- a/gcc/testsuite/gcc.dg/graphite/pr42205-1.c
+++ b/gcc/testsuite/gcc.dg/graphite/pr42205-1.c
@@ -1,5 +1,5 @@
 /* { dg-require-effective-target int32plus } */
-/* { dg-options "-O1 -ffast-math -floop-interchange" } */
+/* { dg-options "-O1 -ffast-math -floop-nest-optimize" } */
 
 int adler32(int adler, char *buf, int n)
 {
diff --git a/gcc/testsuite/gcc.dg/graphite/pr42205-2.c 
b/gcc/testsuite/gcc.dg/graphite/pr42205-2.c
index 595cedb..9ceb1ce 100644
--- a/gcc/testsuite/gcc.dg/graphite/pr42205-2.c
+++ b/gcc/testsuite/gcc.dg/graphite/pr42205-2.c
@@ -1,4 +1,4 @@
-/* { dg-options "-O1 -funsafe-math-optimizations -floop-interchange" } */
+/* { dg-options "-O1 -funsafe-math-optimizations -floop-nest-optimize" } */
 
 double f(double x)
 {
diff --git a/gcc/testsuite/gcc.dg/graphite/pr42211.c 
b/gcc/testsuite/gcc.dg/graphite/pr42211.c
index d8fb915..06bae27 100644
--- a/gcc/testsuite/gcc.dg/graphite/pr42211.c
+++ b/gcc/testsuite/gcc.dg/graphite/pr42211.c
@@ -1,4 +1,4 @@
-/* { dg-options "-O3 -floop-interchange" } */
+/* { dg-options "-O3 -floop-nest-optimize" } */
 
 typedef unsigned char uint8_t;
 
diff --git a/gcc/testsuite/gcc.dg/graphite/pr46185.c 
b/gcc/testsuite/gcc.dg/graphite/pr46185.c
index 7f9ae07..ecb99f5 100644
--- a/gcc/testsuite/gcc.dg/graphite/pr46185.c
+++ b/gcc/testsuite/gcc.dg/graphite/pr46185.c
@@ -1,7 +1,7 @@
 /* { dg-do run } */
 /* { dg-require-effective-target size32plus } */
 /* { dg-require-effective-target int32plus } */
-/* { dg-options "-O2 -floop-interchange -ffast-math -fno-ipa-cp" } */
+/* { dg-options "-O2 -floop-nest-optimize -ffast-math -fno-ipa-cp" } */
 
 #define DEBUG 0
 #if DEBUG
diff --git a/gcc/testsuite/gcc.dg/graphite/pr46966.c 
b/gcc/testsuite/gcc.dg/graphite/pr46966.c
index bb55b71..7bc82ee 100644
--- a/gcc/testsuite/gcc.dg/graphite/pr46966.c
+++ b/gcc/testsuite/gcc.dg/graphite/pr46966.c
@@ -2,7 +2,7 @@
 /* { dg-do compile } */
 /* This test is too big for small targets.  */
 /* { dg-require-effective-target size32plus } */
-/* { dg-options "-O -floop-interchange -ffast-math -fno-tree-copy-prop 
-fno-tree-loop-im" } */
+/* { dg-options "-O -floop-nest-optimize -ffast-math -fno-tree-copy-prop 
-fno-tree-loop-im" } */
 
 int a[1000][1000];
 
diff --git a/gcc/testsuite/gcc.dg/graphite/pr59817-1.c 
b/gcc/testsuite/gcc.dg/graphite/pr59817-1.c
index 175fa16..f5f2a63 100644
--- a/gcc/testsuite/gcc.dg/graphite/pr59817-1.c
+++ b/gcc/testsuite/gcc.dg/graphite/pr59817-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -floop-interchange" } */
+/* { dg-options "-O2 -floop-nest-optimize" } */
 
 int kd;
 
diff --git a/gcc/testsuite/gcc.dg/graphite/pr59817-2.c 
b/gcc/testsuite/gcc.dg/graphite/pr59817-2.c
index 1395007..064136e 100644
--- a/gcc/testsuite/gcc.dg/graphite/pr59817-2.c
+++ b/gcc/testsuite/gcc.dg/graphite/pr59817-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -floop-interchange" } */
+/* { dg-options "-O2 -floop-nest-optimize" } */
 
 void
 xl(void)
diff --git a/gc

[PATCH branch/gimple-interchange]obvious cleanup

2017-12-05 Thread Bin Cheng

Hi,
This is an obvious cleanup patch doing variable renaming, function inlining.
Is it OK?

Thanks,
bin
2017-12-05  Bin Cheng  <bin.ch...@arm.com>

* gimple-loop-interchange.cc (struct induction): Rename fields.
(dump_induction, loop_cand::analyze_induction_var): Update uses.
(loop_cand::undo_simple_reduction): Ditto.
(tree_loop_interchange::map_inductions_to_loop): Ditto.
(tree_loop_interchange::can_interchange_loops): Delete.
(tree_loop_interchange::interchange): Inline can_interchange_loops.From 734217d0879c246a139d4c55ecd65837b2fa6077 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Tue, 5 Dec 2017 10:00:06 +
Subject: [PATCH] cleanup-1

---
 gcc/gimple-loop-interchange.cc | 49 +-
 1 file changed, 20 insertions(+), 29 deletions(-)

diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
index 1f46509..8c396ab 100644
--- a/gcc/gimple-loop-interchange.cc
+++ b/gcc/gimple-loop-interchange.cc
@@ -100,10 +100,11 @@ typedef struct induction
 {
   /* IV itself.  */
   tree var;
-  /* Initializer.  */
-  tree init;
-  /* IV's base and step part of SCEV.  */
-  tree base;
+  /* IV's initializing value, which is the init arg of the IV PHI node.  */
+  tree init_val;
+  /* IV's initializing expr, which is (the expanded result of) init_val.  */
+  tree init_expr;
+  /* IV's step.  */
   tree step;
 }* induction_p;
 
@@ -161,7 +162,7 @@ dump_induction (struct loop *loop, induction_p iv)
   fprintf (dump_file, "  Induction:  ");
   print_generic_expr (dump_file, iv->var, TDF_SLIM);
   fprintf (dump_file, " = {");
-  print_generic_expr (dump_file, iv->base, TDF_SLIM);
+  print_generic_expr (dump_file, iv->init_expr, TDF_SLIM);
   fprintf (dump_file, ", ");
   print_generic_expr (dump_file, iv->step, TDF_SLIM);
   fprintf (dump_file, "}_%d\n", loop->num);
@@ -742,8 +743,8 @@ loop_cand::analyze_induction_var (tree var, tree chrec)
 {
   struct induction *iv = XCNEW (struct induction);
   iv->var = var;
-  iv->init = init;
-  iv->base = chrec;
+  iv->init_val = init;
+  iv->init_expr = chrec;
   iv->step = build_int_cst (TREE_TYPE (chrec), 0);
   m_inductions.safe_push (iv);
   return true;
@@ -757,8 +758,8 @@ loop_cand::analyze_induction_var (tree var, tree chrec)
 
   struct induction *iv = XCNEW (struct induction);
   iv->var = var;
-  iv->init = init;
-  iv->base = CHREC_LEFT (chrec);
+  iv->init_val = init;
+  iv->init_expr = CHREC_LEFT (chrec);
   iv->step = CHREC_RIGHT (chrec);
 
   if (dump_file && (dump_flags & TDF_DETAILS))
@@ -938,14 +939,14 @@ loop_cand::undo_simple_reduction (reduction_p re, bitmap dce_seeds)
   /* Find all stmts on which expression "MEM_REF[idx]" depends.  */
   find_deps_in_bb_for_stmt (, gimple_bb (re->consumer), re->consumer);
   /* Because we generate new stmt loading from the MEM_REF to TMP.  */
-  tree tmp = copy_ssa_name (re->var);
+  tree cond, tmp = copy_ssa_name (re->var);
   stmt = gimple_build_assign (tmp, re->init_ref);
   gimple_seq_add_stmt (, stmt);
 
   /* Init new_var to MEM_REF or CONST depending on if it is the first
 	 iteration.  */
   induction_p iv = m_inductions[0];
-  tree cond = fold_build2 (NE_EXPR, boolean_type_node, iv->var, iv->init);
+  cond = fold_build2 (NE_EXPR, boolean_type_node, iv->var, iv->init_val);
   new_var = copy_ssa_name (re->var);
   stmt = gimple_build_assign (new_var, COND_EXPR, cond, tmp, re->init);
   gimple_seq_add_stmt (, stmt);
@@ -1007,7 +1008,6 @@ public:
 private:
   void update_data_info (unsigned, unsigned, vec, vec);
   bool valid_data_dependences (unsigned, unsigned, vec);
-  bool can_interchange_loops (loop_cand &, loop_cand &);
   void interchange_loops (loop_cand &, loop_cand &);
   void map_inductions_to_loop (loop_cand &, loop_cand &);
   void move_code_to_inner_loop (struct loop *, struct loop *, basic_block *);
@@ -1099,21 +1099,6 @@ tree_loop_interchange::valid_data_dependences (unsigned i_idx, unsigned o_idx,
   return true;
 }
 
-/* Return true if ILOOP and OLOOP can be interchanged in terms of code
-   transformation.  */
-
-bool
-tree_loop_interchange::can_interchange_loops (loop_cand ,
-	  loop_cand )
-{
-  return (iloop.analyze_carried_vars (NULL)
-	  && iloop.analyze_lcssa_phis ()
-	  && oloop.analyze_carried_vars ()
-	  && oloop.analyze_lcssa_phis ()
-	  && iloop.can_interchange_p (NULL)
-	  && oloop.can_interchange_p ());
-}
-
 /* Interchange two loops specified by ILOOP and OLOOP.  */
 
 void
@@ -1227,7 +1212,8 @@ tree_loop_interchange::map_inductions_to_loop (loop_cand , loop_cand )
 	{
 	  /* Map the IV by creating the same one in target

[PATCH branch/gimple-linterchange]Use dyn_cast instread of is_a<> and as_a<>

2017-12-01 Thread Bin Cheng

Hi,
This is a simple patch using dyn_cast instead of is_a<> and as_a<> as suggested 
by review.
This is for branches/gimple-linterchange, bootstrap and test as when the branch 
is created.  Is it OK?

Thanks,
bin
2017-11-30  Bin Cheng  <bin.ch...@arm.com>

* gimple-loop-interchange.cc (is-a.h): New header file.
(loop_cand::find_reduction_by_stmt): Use dyn_cast instead of is_a<>
and as_a<>.
(loop_cand::analyze_iloop_reduction_var): Ditto.
(loop_cand::analyze_oloop_reduction_var): Ditto.  Check gimple stmt
against phi node directly.From 88ddf90ee183f2e58bb5d4b38d14733412603b44 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 29 Nov 2017 11:23:52 +
Subject: [PATCH 40/42] dyn_cast

---
 gcc/gimple-loop-interchange.cc | 25 +++--
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
index 7afafb8..e999822 100644
--- a/gcc/gimple-loop-interchange.cc
+++ b/gcc/gimple-loop-interchange.cc
@@ -22,6 +22,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "system.h"
 #include "coretypes.h"
 #include "backend.h"
+#include "is-a.h"
 #include "tree.h"
 #include "gimple.h"
 #include "tree-pass.h"
@@ -270,12 +271,9 @@ unsupported_edge (edge e)
 reduction_p
 loop_cand::find_reduction_by_stmt (gimple *stmt)
 {
-  gphi *phi = NULL;
+  gphi *phi = dyn_cast  (stmt);
   reduction_p re;
 
-  if (is_a  (stmt))
-phi = as_a  (stmt);
-
   for (unsigned i = 0; m_reductions.iterate (i, ); ++i)
 if ((phi != NULL && phi == re->lcssa_phi)
 	|| (stmt == re->producer || stmt == re->consumer))
@@ -591,10 +589,8 @@ loop_cand::analyze_iloop_reduction_var (tree var)
 	continue;
 
   /* Or else it's used in PHI itself.  */
-  use_phi = NULL;
-  if (is_a  (stmt)
-	  && (use_phi = as_a  (stmt)) != NULL
-	  && use_phi == phi)
+  use_phi = dyn_cast  (stmt);
+  if (use_phi == phi)
 	continue;
 
   if (use_phi != NULL
@@ -684,12 +680,7 @@ loop_cand::analyze_oloop_reduction_var (loop_cand *iloop, tree var)
   if (is_gimple_debug (stmt))
 	continue;
 
-  if (!flow_bb_inside_loop_p (m_loop, gimple_bb (stmt)))
-	return false;
-
-  if (! is_a  (stmt)
-	  || (use_phi = as_a  (stmt)) == NULL
-	  || use_phi != inner_re->phi)
+  if (stmt != inner_re->phi)
 	return false;
 }
 
@@ -701,10 +692,8 @@ loop_cand::analyze_oloop_reduction_var (loop_cand *iloop, tree var)
 	continue;
 
   /* Or else it's used in PHI itself.  */
-  use_phi = NULL;
-  if (is_a  (stmt)
-	  && (use_phi = as_a  (stmt)) != NULL
-	  && use_phi == phi)
+  use_phi = dyn_cast  (stmt);
+  if (use_phi == phi)
 	continue;
 
   if (lcssa_phi == NULL
-- 
1.9.1

[PATCH GCC]Rename and make remove_dead_inserted_code a simple dce interface

2017-11-28 Thread Bin Cheng

Hi,
This patch renames remove_dead_inserted_code to simple_dce_from_worklist, moves 
it to tree-ssa-dce.c
and makes it a simple public DCE interface.  Bootstrap and test along with loop 
interchange.  It's required
for interchange pass.  Is it OK?
BTW, I will push this along with interchange to branch: 
gcc.gnu.org/svn/gcc/branches/gimple-linterchange.

Thanks,
bin
2017-11-27  Bin Cheng  <bin.ch...@arm.com>

* tree-ssa-dce.c (simple_dce_from_worklist): Move and rename from
tree-ssa-pre.c::remove_dead_inserted_code.
* tree-ssa-dce.h: New file.
* tree-ssa-pre.c (tree-ssa-dce.h): Include new header file.
(remove_dead_inserted_code): Move and rename to function
tree-ssa-dce.c::simple_dce_from_worklist.
(pass_pre::execute): Update use.diff --git a/gcc/tree-ssa-dce.c b/gcc/tree-ssa-dce.c
index a5f0edf..227e55d 100644
--- a/gcc/tree-ssa-dce.c
+++ b/gcc/tree-ssa-dce.c
@@ -1723,3 +1723,56 @@ make_pass_cd_dce (gcc::context *ctxt)
 {
   return new pass_cd_dce (ctxt);
 }
+
+
+/* A cheap DCE interface starting from a seed set of possibly dead stmts.  */
+
+void
+simple_dce_from_worklist (bitmap seeds)
+{
+  /* ???  Re-use seeds as worklist not only as initial set.  This may end up
+ removing more code as well.  If we keep seeds unchanged we could restrict
+ new worklist elements to members of seed.  */
+  bitmap worklist = seeds;
+  while (! bitmap_empty_p (worklist))
+{
+  /* Pop item.  */
+  unsigned i = bitmap_first_set_bit (worklist);
+  bitmap_clear_bit (worklist, i);
+
+  tree def = ssa_name (i);
+  /* Removed by somebody else or still in use.  */
+  if (! def || ! has_zero_uses (def))
+   continue;
+
+  gimple *t = SSA_NAME_DEF_STMT (def);
+  if (gimple_has_side_effects (t))
+   continue;
+
+  /* Add uses to the worklist.  */
+  ssa_op_iter iter;
+  use_operand_p use_p;
+  FOR_EACH_PHI_OR_STMT_USE (use_p, t, iter, SSA_OP_USE)
+   {
+ tree use = USE_FROM_PTR (use_p);
+ if (TREE_CODE (use) == SSA_NAME
+ && ! SSA_NAME_IS_DEFAULT_DEF (use))
+   bitmap_set_bit (worklist, SSA_NAME_VERSION (use));
+   }
+
+  /* Remove stmt.  */
+  if (dump_file && (dump_flags & TDF_DETAILS))
+   {
+ fprintf (dump_file, "Removing dead stmt:");
+ print_gimple_stmt (dump_file, t, 0);
+   }
+  gimple_stmt_iterator gsi = gsi_for_stmt (t);
+  if (gimple_code (t) == GIMPLE_PHI)
+   remove_phi_node (, true);
+  else
+   {
+ gsi_remove (, true);
+ release_defs (t);
+   }
+}
+}
diff --git a/gcc/tree-ssa-dce.h b/gcc/tree-ssa-dce.h
new file mode 100644
index 000..2adb086
--- /dev/null
+++ b/gcc/tree-ssa-dce.h
@@ -0,0 +1,22 @@
+/* Copyright (C) 2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef TREE_SSA_DCE_H
+#define TREE_SSA_DCE_H
+extern void simple_dce_from_worklist (bitmap);
+#endif
diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 281f100..c19d486 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -49,6 +49,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "dbgcnt.h"
 #include "domwalk.h"
 #include "tree-ssa-propagate.h"
+#include "tree-ssa-dce.h"
 #include "tree-cfgcleanup.h"
 #include "alias.h"
 
@@ -4038,64 +4039,6 @@ compute_avail (void)
   free (worklist);
 }
 
-/* Cheap DCE of a known set of possibly dead stmts.
-
-   Because we don't follow exactly the standard PRE algorithm, and decide not
-   to insert PHI nodes sometimes, and because value numbering of casts isn't
-   perfect, we sometimes end up inserting dead code.   This simple DCE-like
-   pass removes any insertions we made that weren't actually used.  */
-
-static void
-remove_dead_inserted_code (void)
-{
-  /* ???  Re-use inserted_exprs as worklist not only as initial set.
- This may end up removing non-inserted code as well.  If we
- keep inserted_exprs unchanged we could restrict new worklist
- elements to members of inserted_exprs.  */
-  bitmap worklist = inserted_exprs;
-  while (! bitmap_empty_p (worklist))
-{
-  /* Pop item.  */
-  unsigned i = bitmap_first_set_bit (worklist);
-  bitmap_clear_bit (worklist, i);
-
-  tree def = ssa_name

[PATCH GCC]Support load in CT_STORE_STORE chain if dominated by store in the same loop iteration

2017-11-17 Thread Bin Cheng

Hi,
I previously introduced CT_STORE_STORE chains in predcom.  This patch further 
supports load
reference in CT_STORE_STORE chain if the load is dominated by a store reference 
in the same
loop iteration.  So example as in added test case:

  for (i = 0; i < len; i++)
{
  a[i] = t1;
  a[i + 3] = t2;
  a[i + 1] = -1;
  sum = sum + a[i] + a[i + 3];
}
can be transformed into:
  for (i = 0; i < len; i++)
{
  a[i] = t1;
  a[i + 3] = t2;
  a[i + 1] = -1;
  sum = sum + t1 + t2;
}
Before this patch, we can't eliminate load because no load reference is allowed 
in CT_STORE_STORE
chain.

This patch only supports it if the load is dominated by a store reference in 
the same loop
iteration.  If we generalize this to load/store in arbitrary loop iterations, 
it basically
generalizes CT_STORE_LOAD/CT_STORE_STORE chains into arbitrary mixed chain.  
That would 
need fundamental rewrite of the pass and not sure how useful it would be.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-11-15  Bin Cheng  <bin.ch...@arm.com>

* tree-predcom.c: Add general comment on Store-Store chains.
(split_data_refs_to_components): Postpone clearing eliminate_store_p
flag in component.
(get_chain_last_ref_at): Rename into...
(get_chain_last_write_at): ...this.
(get_chain_last_write_before_load): New function.
(add_ref_to_chain): Promote type of chain from CT_STORE_LOAD to
CT_STORE_STORE when write reference is added.
(determine_roots_comp): Support load ref in CT_STORE_STORE chains.
(is_inv_store_elimination_chain): Update get_chain_last_write_at call.
(initialize_root_vars_store_elim_1): Ditto.
(initialize_root_vars_store_elim_2): Ditto.  Replace rhs once default
definition is created.
(execute_pred_commoning_chain): Support load ref in CT_STORE_STORE
chain by replacing it with dominant stored value.

gcc/testsuite/ChangeLog
2017-11-15  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/tree-ssa/predcom-dse-12.c: New test.diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-12.c 
b/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-12.c
new file mode 100644
index 000..510c600
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-12.c
@@ -0,0 +1,67 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fno-inline -fpredictive-commoning 
-fdump-tree-pcom-details" } */
+
+int arr[105] = {2, 3, 5, 7, 11};
+int result0[10] = {2, 3, 5, 7, 11};
+int result1[10] = {0, -1, 5, -2, 11, 0};
+int result2[10] = {0, 0, -1, -2, -2, 0};
+int result3[10] = {0, 0, 0, -1, -2, -2, 0};
+int result4[10] = {0, 0, 0, 0, -1, -2, -2, 0};
+int result100[105] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -2, -2, 0};
+
+extern void abort (void);
+int sum;
+
+void __attribute__((noinline)) foo (int * restrict a, int len, int t1, int t2)
+{
+  int i;
+  for (i = 0; i < len; i++)
+{
+  a[i] = t1;
+  a[i + 3] = t2;
+  a[i + 1] = -1;
+  sum = sum + a[i] + a[i + 3];
+}
+}
+
+void check (int *a, int *res, int len, int sval)
+{
+  int i;
+
+  if (sum != sval)
+abort ();
+
+  for (i = 0; i < len; i++)
+if (a[i] != res[i])
+  abort ();
+}
+
+int main (void)
+{
+  foo (arr, 0, 0, -2);
+  check (arr, result0, 10, 0);
+
+  foo (arr, 1, 0, -2);
+  check (arr, result1, 10, -2);
+
+  foo (arr, 2, 0, -2);
+  check (arr, result2, 10, -6);
+
+  foo (arr, 3, 0, -2);
+  check (arr, result3, 10, -12);
+
+  foo (arr, 4, 0, -2);
+  check (arr, result4, 10, -20);
+
+  foo (arr, 100, 0, -2);
+  check (arr, result100, 105, -220);
+
+  return 0;
+}
+/* { dg-final { scan-tree-dump "Store-stores chain" "pcom"} } */
diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index d078b96..7725941 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -192,6 +192,10 @@ along with GCC; see the file COPYING3.  If not see
The interesting part is this can be viewed either as general store motion
or general dead store elimination in either intra/inter-iterations way.
 
+   With trivial effort, we also support load inside Store-Store chains if the
+   load is dominated by a store statement in the same iteration of loop.  You
+   can see this as a restricted Store-Mixed-Load-Store chain.
+
TODO: For now, we don't support store-store chains in multi-exit loops.  We
force to not unroll in case of store-store chain even if other chains might
ask for unroll.
@@ -902,8 +906,6 @@ split_data_refs_to_components (struct loop *loop,

[PATCH Obvious]Remove redundant check on component distance

2017-11-17 Thread Bin Cheng

Hi,
This is an obvious patch removing redundant check on component distance in
tree-predcom.c  Bootstrap and test along with next patch.  Is it OK?

Thanks,
bin
2017-11-15  Bin Cheng  <bin.ch...@arm.com>

* tree-predcom.c (add_ref_to_chain): Remove check on distance.From 8b62802309b2d14a2fca4446b9f6f8f8670a450b Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 20 Oct 2017 15:56:03 +0100
Subject: [PATCH 1/2] redundant-dist-check-20171102.txt

---
 gcc/tree-predcom.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index 747c1b8..499cedb 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -1063,11 +1063,6 @@ add_ref_to_chain (chain_p chain, dref ref)
 
   gcc_assert (wi::les_p (root->offset, ref->offset));
   widest_int dist = ref->offset - root->offset;
-  if (wi::leu_p (MAX_DISTANCE, dist))
-{
-  free (ref);
-  return;
-}
   gcc_assert (wi::fits_uhwi_p (dist));
 
   chain->refs.safe_push (ref);
-- 
1.9.1

[PATCH PR82726/PR70754][2/2]New fix by finding correct root reference in combined chains

2017-11-03 Thread Bin Cheng

Hi,
As described in message of previous patch:

This patch set fixes both PRs in the opposite way: Instead of find dominance
insertion position for root reference, we resort zero-distance references of
combined chain by their position information so that new root reference must
dominate others.  This should be more efficient because we avoid function call
to stmt_dominates_stmt_p.
Bootstrap and test on x86_64 and AArch64 in patch set.  Is it OK?

Thanks,
bin
2017-11-02  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82726
PR tree-optimization/70754
* tree-predcom.c (, INCLUDE_ALGORITHM): New headers.
(order_drefs_by_pos): New function.
(combine_chains): Move code setting has_max_use_after to...
(try_combine_chains): ...here.  New parameter.  Sort combined chains
according to position information.
(tree_predictive_commoning_loop): Update call to above function.
(update_pos_for_combined_chains, pcom_stmt_dominates_stmt_p): New.

gcc/testsuite
2017-11-02  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82726
* gcc.dg/tree-ssa/pr82726.c: New test.From 843cef544a46236e40063416cebc8037736ad18a Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 1 Nov 2017 17:43:55 +
Subject: [PATCH 2/2] pr82726-20171102.txt

---
 gcc/testsuite/gcc.dg/tree-ssa/pr82726.c |  26 ++
 gcc/tree-predcom.c  | 159 
 2 files changed, 169 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr82726.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr82726.c b/gcc/testsuite/gcc.dg/tree-ssa/pr82726.c
new file mode 100644
index 000..179f93a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr82726.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 --param tree-reassoc-width=4" } */
+/* { dg-additional-options "-mavx2" { target avx2_runtime } } */
+
+#define N 40
+#define M 128
+unsigned int in[N+M];
+unsigned short out[N];
+
+/* Outer-loop vectorization. */
+
+void
+foo (){
+  int i,j;
+  unsigned int diff;
+
+  for (i = 0; i < N; i++) {
+diff = 0;
+for (j = 0; j < M; j+=8) {
+  diff += in[j+i];
+}
+out[i]=(unsigned short)diff;
+  }
+
+  return;
+}
diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index 24d7c9c..a243bce 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -201,6 +201,8 @@ along with GCC; see the file COPYING3.  If not see
i * i with ii_last + 2 * i + 1), to generalize strength reduction.  */
 
 #include "config.h"
+#include 
+#define INCLUDE_ALGORITHM /* std::sort */
 #include "system.h"
 #include "coretypes.h"
 #include "backend.h"
@@ -1020,6 +1022,14 @@ order_drefs (const void *a, const void *b)
   return (*da)->pos - (*db)->pos;
 }
 
+/* Compares two drefs A and B by their position.  Callback for std::sort.  */
+
+static bool
+order_drefs_by_pos (dref a, dref b)
+{
+  return a->pos < b->pos;
+}
+
 /* Returns root of the CHAIN.  */
 
 static inline dref
@@ -2633,7 +2643,6 @@ combine_chains (chain_p ch1, chain_p ch2)
   bool swap = false;
   chain_p new_chain;
   unsigned i;
-  gimple *root_stmt;
   tree rslt_type = NULL_TREE;
 
   if (ch1 == ch2)
@@ -2675,31 +2684,56 @@ combine_chains (chain_p ch1, chain_p ch2)
   new_chain->refs.safe_push (nw);
 }
 
-  new_chain->has_max_use_after = false;
-  root_stmt = get_chain_root (new_chain)->stmt;
-  for (i = 1; new_chain->refs.iterate (i, ); i++)
-{
-  if (nw->distance == new_chain->length
-	  && !stmt_dominates_stmt_p (nw->stmt, root_stmt))
-	{
-	  new_chain->has_max_use_after = true;
-	  break;
-	}
-}
-
   ch1->combined = true;
   ch2->combined = true;
   return new_chain;
 }
 
-/* Try to combine the CHAINS.  */
+/* Recursively update position information of all offspring chains to ROOT
+   chain's position information.  */
+
+static void
+update_pos_for_combined_chains (chain_p root)
+{
+  chain_p ch1 = root->ch1, ch2 = root->ch2;
+  dref ref, ref1, ref2;
+  for (unsigned j = 0; (root->refs.iterate (j, )
+			&& ch1->refs.iterate (j, )
+			&& ch2->refs.iterate (j, )); ++j)
+ref1->pos = ref2->pos = ref->pos;
+
+  if (ch1->type == CT_COMBINATION)
+update_pos_for_combined_chains (ch1);
+  if (ch2->type == CT_COMBINATION)
+update_pos_for_combined_chains (ch2);
+}
+
+/* Returns true if statement S1 dominates statement S2.  */
+
+static bool
+pcom_stmt_dominates_stmt_p (std::map _map,
+			gimple *s1, gimple *s2)
+{
+  basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2);
+
+  if (!bb1 || s1 == s2)
+return true;
+
+  if (bb1 == bb2)
+return stmts_map[s1] < stmts_map[s2];
+
+  return dominated_by_p (CDI_DOMINATORS, bb2, bb1);
+}
+
+/* Try to combine the CHAINS in LOOP.  */
 
 static v

[PATCH PR82726][1/2]Revert previous fixes for PR70754 and PR79663

2017-11-03 Thread Bin Cheng

Hi,
When fixing PR70754, I thought the issue only happens for ZERO-length chains.
Well, that's apparently not true with PR82726.
The whole story is, with chain combination/re-association, new stmts may be
created/inserted at position not dominating following uses.  This happens in
two scenarios:
  1) Zero length chains, as in PR70754.
  2) Non-zero chains with multiple zero distance references.
PR82726 falls in case 2).  Because zero distance references are root of the
chain, they don't inherit values from loop carried PHIs.  In code generation,
we still need to be careful not inserting use before definitions.

Previous fix to PR70754 tries to find dominance position for insertion when
combining all references.  I could do the similar thing on top of that fix,
but it would be inefficient/complicated because we should only do that for
zero distance references in a non-zero length combined chain.

This patch set fixes both PRs in the opposite way: Instead of finding dominance
insertion position for root reference, we re-sort zero-distance references of
combined chain by their position information so that new root reference must
dominate others.  This should be more efficient because we avoid function call
to stmt_dominates_stmt_p.

This is the first patch reverting r244815 and r245689.

Bootstrap and test on x86_64 and AArch64 in patch set.  Is it OK?

Thanks,
bin
2017-11-02  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82726
Revert
2017-01-23  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/70754
* tree-predcom.c (stmt_combining_refs): New parameter INSERT_BEFORE.
(reassociate_to_the_same_stmt): New parameter INSERT_BEFORE.  Insert
combined stmt before it if not NULL.
(combine_chains): Process refs reversely and compute dominance point
for root ref.

Revert
2017-02-23  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/79663
* tree-predcom.c (combine_chains): Process refs in reverse order
only for ZERO length chains, and add explaining comment.From 408c86c33670ce64e9872fa9d4cc66fe0b3bffa4 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 1 Nov 2017 12:53:43 +
Subject: [PATCH 1/2] revert-244815-245689.txt

---
 gcc/tree-predcom.c | 64 +++---
 1 file changed, 13 insertions(+), 51 deletions(-)

diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index fdb32f1..24d7c9c 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -2520,11 +2520,10 @@ remove_name_from_operation (gimple *stmt, tree op)
 }
 
 /* Reassociates the expression in that NAME1 and NAME2 are used so that they
-   are combined in a single statement, and returns this statement.  Note the
-   statement is inserted before INSERT_BEFORE if it's not NULL.  */
+   are combined in a single statement, and returns this statement.  */
 
 static gimple *
-reassociate_to_the_same_stmt (tree name1, tree name2, gimple *insert_before)
+reassociate_to_the_same_stmt (tree name1, tree name2)
 {
   gimple *stmt1, *stmt2, *root1, *root2, *s1, *s2;
   gassign *new_stmt, *tmp_stmt;
@@ -2581,12 +2580,6 @@ reassociate_to_the_same_stmt (tree name1, tree name2, gimple *insert_before)
   var = create_tmp_reg (type, "predreastmp");
   new_name = make_ssa_name (var);
   new_stmt = gimple_build_assign (new_name, code, name1, name2);
-  if (insert_before && stmt_dominates_stmt_p (insert_before, s1))
-bsi = gsi_for_stmt (insert_before);
-  else
-bsi = gsi_for_stmt (s1);
-
-  gsi_insert_before (, new_stmt, GSI_SAME_STMT);
 
   var = create_tmp_reg (type, "predreastmp");
   tmp_name = make_ssa_name (var);
@@ -2603,6 +2596,7 @@ reassociate_to_the_same_stmt (tree name1, tree name2, gimple *insert_before)
   s1 = gsi_stmt (bsi);
   update_stmt (s1);
 
+  gsi_insert_before (, new_stmt, GSI_SAME_STMT);
   gsi_insert_before (, tmp_stmt, GSI_SAME_STMT);
 
   return new_stmt;
@@ -2611,11 +2605,10 @@ reassociate_to_the_same_stmt (tree name1, tree name2, gimple *insert_before)
 /* Returns the statement that combines references R1 and R2.  In case R1
and R2 are not used in the same statement, but they are used with an
associative and commutative operation in the same expression, reassociate
-   the expression so that they are used in the same statement.  The combined
-   statement is inserted before INSERT_BEFORE if it's not NULL.  */
+   the expression so that they are used in the same statement.  */
 
 static gimple *
-stmt_combining_refs (dref r1, dref r2, gimple *insert_before)
+stmt_combining_refs (dref r1, dref r2)
 {
   gimple *stmt1, *stmt2;
   tree name1 = name_for_ref (r1);
@@ -2626,7 +2619,7 @@ stmt_combining_refs (dref r1, dref r2, gimple *insert_before)
   if (stmt1 == stmt2)
 return stmt1;
 
-  return reassociate_to_the_same_stmt (name1, name2, insert_before);
+  return reassoc

[PATCH OBVIOUS]Fix memory leak in tree-predcom.c

2017-11-03 Thread Bin Cheng

Hi,
I ran into this memory leak issue in tree-predcom.c when investigating other 
PRs.
This is the obvious fix by freeing reference of trivial component.
Bootstrap and test on x86_64.  Is it OK?

Thanks,
bin
2017-11-02  Bin Cheng  <bin.ch...@arm.com>

* tree-predcom.c (determine_roots_comp): Avoid memory leak by freeing
reference of trivial component.diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index a243bce..e493dcd 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -1341,7 +1341,14 @@ determine_roots_comp (struct loop *loop,
 
   /* Trivial component.  */
   if (comp->refs.length () <= 1)
-return;
+{
+  if (comp->refs.length () == 1)
+   {
+ free (comp->refs[0]);
+ comp->refs.truncate (0);
+   }
+  return;
+}
 
   comp->refs.qsort (order_drefs);
   FOR_EACH_VEC_ELT (comp->refs, i, a)

[PATCH PR82776]Exploit more undefined pointer overflow behavior in loop niter analysis

2017-11-03 Thread Bin Cheng

Hi,
This is a simple patch exploiting more undefined pointer overflow behavior in
loop niter analysis.  Originally, it only supports POINTER_PLUS_EXPR if the
offset part is IV.  This patch also handles the case if pointer is IV.  With
this patch, the while(true) loop in test can be removed by cddce pass now.

Bootstrap and test on x86_64 and AArch64.  This patch introduces two failures:
FAIL: g++.dg/pr79095-1.C  -std=gnu++98 (test for excess errors)
FAIL: g++.dg/pr79095-2.C  -std=gnu++11 (test for excess errors)
I believe this exposes inaccurate value range information issue.  For below 
code:
/* { dg-do compile } */
/* { dg-options "-Wall -O3" } */

typedef long unsigned int size_t;

inline void
fill (int *p, size_t n, int)
{
  while (n--)
*p++ = 0;
}

struct B
{
  int* p0, *p1, *p2;

  size_t size () const {
return size_t (p1 - p0);
  }

  void resize (size_t n) {
if (n > size())
  append (n - size());
  }

  void append (size_t n)
  {
if (size_t (p2 - p1) >= n)   {
  fill (p1, n, 0);
}
  }
};

void foo (B )
{
  if (b.size () != 0)
b.resize (b.size () - 1);
}

GCC gives below warning with this patch:
pr79095-1.C: In function ‘void foo(B&)’:
pr79095-1.C:10:7: warning: iteration 4611686018427387903 invokes undefined 
behavior [-Waggressive-loop-optimizations]
 *p++ = 0;
  ~^~
pr79095-1.C:9:11: note: within this loop
   while (n--)
   ^~

Problem is VRP should understand that it's never the case with condition:
  (size_t (p2 - p1) >= n) 
in function B::append.

So, any comment?

Thanks,
bin
2017-11-02  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82776
* tree-ssa-loop-niter.c (infer_loop_bounds_from_pointer_arith): Handle
POINTER_PLUS_EXPR in which the pointer is an IV.
(infer_loop_bounds_from_signedness): Refine comment.

gcc/testsuite
2017-11-02  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82776
* g++.dg/pr82776.C: New test.
* gcc.dg/tree-ssa/split-path-6.c: Refine test.diff --git a/gcc/testsuite/g++.dg/pr82776.C b/gcc/testsuite/g++.dg/pr82776.C
new file mode 100644
index 000..2a66817
--- /dev/null
+++ b/gcc/testsuite/g++.dg/pr82776.C
@@ -0,0 +1,78 @@
+// PR tree-optimization/82776
+// { dg-do compile }
+// { dg-options "-O2 -std=c++14 -fdump-tree-cddce2-details" }
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+
+unsigned baz (unsigned);
+
+struct Chunk {
+  std::array<uint8_t,14> tags_;
+  uint8_t control_;
+
+  bool eof() const {
+return (control_ & 1) != 0;
+  }
+
+  static constexpr unsigned kFullMask = (1 << 14) - 1;
+
+  unsigned occupiedMask() const {
+return baz (kFullMask);
+  }
+};
+
+#define LIKELY(x) __builtin_expect((x), true)
+#define UNLIKELY(x) __builtin_expect((x), false)
+
+struct Iter {
+  Chunk* chunk_;
+  std::size_t index_;
+
+  void advance() {
+// common case is packed entries
+while (index_ > 0) {
+  --index_;
+  if (LIKELY(chunk_->tags_[index_] != 0)) {
+return;
+  }
+}
+
+// bar only skips the work of advance() if this loop can
+// be guaranteed to terminate
+#ifdef ENABLE_FORLOOP
+for (std::size_t i = 1; i != 0; ++i) {
+#else
+while (true) {
+#endif
+  // exhausted the current chunk
+  if (chunk_->eof()) {
+chunk_ = nullptr;
+break;
+  }
+  ++chunk_;
+  auto m = chunk_->occupiedMask();
+  if (m != 0) {
+index_ = 31 - __builtin_clz(m);
+break;
+  }
+}
+  }
+};
+
+static Iter foo(Iter iter) {
+  puts("hello");
+  iter.advance();
+  return iter;
+}
+
+void bar(Iter iter) {
+  foo(iter);
+}
+
+// { dg-final { scan-tree-dump-not "can not prove finiteness of loop" "cddce2" 
} }
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/split-path-6.c 
b/gcc/testsuite/gcc.dg/tree-ssa/split-path-6.c
index 682166f..2206d05 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/split-path-6.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/split-path-6.c
@@ -53,10 +53,11 @@ oof ()
 }
 
 void
-lookharder (string)
+lookharder (string, res)
  char *string;
+ char *res;
 {
-  register char *g;
+  register char *g = res;
   register char *s;
   for (s = string; *s != '\0'; s++)
 {
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index 6efe67a..7c1ac61 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -3422,7 +3422,7 @@ static void
 infer_loop_bounds_from_pointer_arith (struct loop *loop, gimple *stmt)
 {
   tree def, base, step, scev, type, low, high;
-  tree var, ptr;
+  tree rhs2, rhs1;
 
   if (!is_gimple_assign (stmt)
   || gimple_assign_rhs_code (stmt) != POINTER_PLUS_EXPR)
@@ -3436,12 +3436,13 @@ infer_loop_bounds_from_pointer_arith (struct loop 
*loop, gimple *stmt)
   if (!nowrap_type_p (type))
 return;
 
-  ptr = gimple_assign_rhs1 (stmt);
-  if (!expr_invariant_in_loop_p (loop, ptr

Re: [PATCH GCC][3/3]Refine CFG and bound information for split loops

2017-10-20 Thread Bin Cheng








From: Richard Biener <richard.guent...@gmail.com>
Sent: 20 October 2017 12:24
To: Bin Cheng
Cc: gcc-patches@gcc.gnu.org; nd
Subject: Re: [PATCH GCC][3/3]Refine CFG and bound information for split loops
    
On Thu, Oct 19, 2017 at 3:26 PM, Bin Cheng <bin.ch...@arm.com> wrote:
> Hi,
> This is a rework of patch at  
> https://gcc.gnu.org/ml/gcc-patches/2017-06/msg01037.html.
> The new patch doesn't try to handle all cases, instead, it only handles 
> obvious cases.
> It also tries to add tests illustrating different cases handled.
> Bootstrap and test for patch set on x86_64 and AArch64.  Comments?

ENOPATCH

Sorry for the mistake, here is the one.

Thanks,
bin

> Thanks,
> bin
> 2017-10-16  Bin Cheng  <bin.ch...@arm.com>
>
> * tree-ssa-loop-split.c (compute_new_first_bound): New parameter.
> Compute and return bound information for the second split loop.
> (adjust_loop_split): New function.
> (split_loop): Update use and call above function.
>
> gcc/testsuite/ChangeLog
> 2017-10-16  Bin Cheng  <bin.ch...@arm.com>
>
> * gcc.dg/loop-split-1.c: New test.
> * gcc.dg/loop-split-2.c: New test.
> * gcc.dg/loop-split-3.c: New test.
From 3bf8b382682b6a6c6aedf6f085d663e6379f003a Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 2 Aug 2017 14:57:27 +0100
Subject: [PATCH 3/3] lsplit-refine-cfg-niter-bound-20171017.txt

---
 gcc/testsuite/gcc.dg/loop-split-1.c |  40 
 gcc/testsuite/gcc.dg/loop-split-2.c |  34 +++
 gcc/testsuite/gcc.dg/loop-split-3.c |  40 
 gcc/tree-ssa-loop-split.c   | 179 +---
 4 files changed, 282 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/loop-split-1.c
 create mode 100644 gcc/testsuite/gcc.dg/loop-split-2.c
 create mode 100644 gcc/testsuite/gcc.dg/loop-split-3.c

diff --git a/gcc/testsuite/gcc.dg/loop-split-1.c b/gcc/testsuite/gcc.dg/loop-split-1.c
new file mode 100644
index 000..7cf6a37
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/loop-split-1.c
@@ -0,0 +1,40 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */
+
+#define NUM (100)
+int x[NUM] = {0, 1, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31};
+int y[NUM] = {0, 1, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31};
+int r[NUM] = {0, 2, 4, 6, 8, 12, 14, 18, 20, 24, 29, 31};
+
+extern void abort (void);
+int __attribute__((noinline)) foo (int *a, int *b, int len)
+{
+  int k;
+  for (k = 1; k <= len; k++)
+{
+  a[k]++;
+
+  if (k < len)
+	b[k]++;
+}
+}
+
+int main (void)
+{
+  int i;
+
+  foo (x, y, 9);
+
+  for (i = 0; i < NUM; ++i)
+{
+  if (i != 9
+	  && (x[i] != r[i] || y[i] != r[i]))
+	abort ();
+  if (i == 9
+	  && (x[i] != r[i] || y[i] != r[i] - 1))
+	abort ();
+}
+
+  return 0;
+}
+/* { dg-final { scan-tree-dump "The second split loop iterates at 0 latch times." "lsplit" } } */
diff --git a/gcc/testsuite/gcc.dg/loop-split-2.c b/gcc/testsuite/gcc.dg/loop-split-2.c
new file mode 100644
index 000..3659a7a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/loop-split-2.c
@@ -0,0 +1,34 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */
+
+#define NUM (100)
+int x[NUM] = {0, 1, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31};
+int y[NUM] = {0, 1, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31};
+int r[NUM] = {1, 1, 4, 5, 8, 11, 14, 17, 20, 23, 29, 31};
+
+extern void abort (void);
+int __attribute__((noinline)) foo (int *a, int *b, int len)
+{
+  int k, i;
+  for (k = 0, i = 1; k < len; k += 2, i += 2)
+{
+  a[k]++;
+
+  if (i < 1 + len)
+	b[k]++;
+}
+}
+
+int main (void)
+{
+  int i;
+
+  foo (x, y, 9);
+
+  for (i = 0; i < NUM; ++i)
+if (x[i] != r[i] || y[i] != r[i])
+  abort ();
+
+  return 0;
+}
+/* { dg-final { scan-tree-dump "The second split loop is never executed." "lsplit" } } */
diff --git a/gcc/testsuite/gcc.dg/loop-split-3.c b/gcc/testsuite/gcc.dg/loop-split-3.c
new file mode 100644
index 000..10e7cfd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/loop-split-3.c
@@ -0,0 +1,40 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */
+
+#define NUM (100)
+int x[NUM] = {0, 1, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31};
+int y[NUM] = {0, 1, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31};
+int r[NUM] = {0, 2, 4, 6, 8, 12, 14, 18, 20, 24, 30, 31};
+
+extern void abort (void);
+int __attribute__((noinline)) foo (int *a, int *b, int start, int end)
+{
+  int k;
+  for (k = start; k >= end; k--)
+{
+  a[k]++;
+
+  if (k > end)
+	b[k]++;
+}
+}
+
+int main (void)
+{
+  int i;
+
+  foo (x, y, 10, 1);
+
+  for (i = 0; i < NUM; ++i)
+{
+  if (i != 1
+	  && (x[i] != r[i] || y[i] != r[i]))
+	abort ();
+

[PATCH GCC][3/3]Refine CFG and bound information for split loops

2017-10-19 Thread Bin Cheng

Hi,
This is a rework of patch at 
https://gcc.gnu.org/ml/gcc-patches/2017-06/msg01037.html.
The new patch doesn't try to handle all cases, instead, it only handles obvious 
cases.
It also tries to add tests illustrating different cases handled.
Bootstrap and test for patch set on x86_64 and AArch64.  Comments?

Thanks,
bin
2017-10-16  Bin Cheng  <bin.ch...@arm.com>

* tree-ssa-loop-split.c (compute_new_first_bound): New parameter.
Compute and return bound information for the second split loop.
(adjust_loop_split): New function.
(split_loop): Update use and call above function.

gcc/testsuite/ChangeLog
2017-10-16  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/loop-split-1.c: New test.
* gcc.dg/loop-split-2.c: New test.
* gcc.dg/loop-split-3.c: New test.

[PATCH GCC][2/3]Simplify ((A +- CST1 CMP A +- CST2)) for undefined overflow type

2017-10-19 Thread Bin Cheng

Hi,
This patch adds pattern simplifying (A +- CST1 CMP A +- CST2) for undefined 
overflow types.
Bootstrap and test for patch set on x86_64 and AArch64.  Comments?

Thanks,
bin
2017-10-16  Bin Cheng  <bin.ch...@arm.com>

* match.pd (A +- CST1 CMP A +- CST2): New pattern.From 6e31cde6560366242c15039a5b3032f5425750e0 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Thu, 10 Aug 2017 17:29:22 +0100
Subject: [PATCH 2/3] simplify-AopCst1-cmp-AopCst2-20170806.txt

---
 gcc/match.pd | 25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/gcc/match.pd b/gcc/match.pd
index 64b023d..dae0f1c 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3485,7 +3485,30 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 (if (cmp == LE_EXPR)
 	 (ge (convert:st @0) { build_zero_cst (st); })
 	 (lt (convert:st @0) { build_zero_cst (st); }))
- 
+
+/* A +- CST1 CMP A +- CST2 in type with undefined overflow behavior.  */
+(for cmp  (lt gt le ge)
+ (for xop (plus minus)
+  (for yop (plus minus)
+   (simplify
+(cmp (xop @0 INTEGER_CST@1) (yop @0 INTEGER_CST@2))
+(if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
+	 && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@0))
+	 && types_compatible_p (TREE_TYPE (@1), TREE_TYPE (@2)))
+ (with
+  {
+	tree cst1 = @1, cst2 = @2, zero = build_zero_cst (TREE_TYPE (@1));
+	if (xop == MINUS_EXPR)
+	  cst1 = int_const_binop (MINUS_EXPR, zero, cst1);
+	if (yop == MINUS_EXPR)
+	  cst2 = int_const_binop (MINUS_EXPR, zero, cst2);
+
+fold_overflow_warning (("assuming signed overflow does not occur "
+"when simplifying A +- CST cmp A +- CST"),
+			   WARN_STRICT_OVERFLOW_CONDITIONAL);
+  }
+  (cmp { cst1; } { cst2; })))
+
 (for cmp (unordered ordered unlt unle ungt unge uneq ltgt)
  /* If the second operand is NaN, the result is constant.  */
  (simplify
-- 
1.9.1

[PATCH GCC][1/3]Simplify (A + CST cmp A -> CST cmp zero) for undefined overflow type

2017-10-19 Thread Bin Cheng

Hi,
This is a rework of patch set at 
https://gcc.gnu.org/ml/gcc-patches/2017-06/msg01036.html
and https://gcc.gnu.org/ml/gcc-patches/2017-06/msg01037.html.  The patch set 
improves niters
bound analysis for split loop.  Instead of feeding bound computation to generic 
folder, this
patch simplifies (A + CST cmp A  ->  CST cmp zero) for types with undefined 
overflow behavior.
Bootstrap and test for patch set on x86_64 and AArch64.  Comments?

Thanks,
bin
2017-10-16  Bin Cheng  <bin.ch...@arm.com>

* match.pd (A + CST cmp A  ->  CST cmp zero): New simplification
for undefined overflow types in (A + CST CMP A  ->  A CMP' CST').From 9eb5d484235b97ed6e4e5f153dd7f159d7365f38 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Mon, 16 Oct 2017 14:24:10 +0100
Subject: [PATCH 1/3] simplify-AopCst-cmp-A-20171006.txt

---
 gcc/match.pd | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/gcc/match.pd b/gcc/match.pd
index f2c4373..64b023d 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3518,7 +3518,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 /* When one argument is a constant, overflow detection can be simplified.
Currently restricted to single use so as not to interfere too much with
ADD_OVERFLOW detection in tree-ssa-math-opts.c.
-   A + CST CMP A  ->  A CMP' CST' */
+   A + CST CMP A  ->  A CMP' CST'
+
+   For type with undefined overflow behavior, the expression can also be
+   simplified by assuming overflow won't happen.
+   A + CST cmp A  -> CST cmp zero.  */
 (for cmp (lt le ge gt)
  out (gt gt le le)
  (simplify
@@ -3530,7 +3534,18 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(with { unsigned int prec = TYPE_PRECISION (TREE_TYPE (@0)); }
 (out @0 { wide_int_to_tree (TREE_TYPE (@0),
 			wi::max_value (prec, UNSIGNED)
-- wi::to_wide (@1)); })
+- wi::to_wide (@1)); }))
+   (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
+&& TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@0)))
+(with
+ {
+   tree zero = build_zero_cst (TREE_TYPE (@1));
+
+   fold_overflow_warning (("assuming signed overflow does not occur "
+			   "when simplifying A + CST cmp A"),
+			  WARN_STRICT_OVERFLOW_CONDITIONAL);
+ }
+ (cmp @1 { zero; }))
 
 /* To detect overflow in unsigned A - B, A < B is simpler than A - B > A.
However, the detection logic for SUB_OVERFLOW in tree-ssa-math-opts.c
-- 
1.9.1

[PATCH PR82574]Check that datref must be executed exactly once per iteration against outermost loop in nest

2017-10-17 Thread Bin Cheng

Hi,
The patch fixes ICE reported in PR82574.  In order to distribute builtin 
partition, we need
to check that data reference must be executed exactly once per iteration.  In 
distribution
for loop nest, this has to be checked against each loop in the nest.  One 
optimization can
be done is we only need to check against the outermost loop for perfect nest.
Bootstrap and test on x86_64.  Is it OK?

Thanks,
bin
2017-10-17  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82574
* tree-loop-distribution.c (find_single_drs): New parameter.  Check
that data reference must be executed exactly once per iteration
against the outermost loop in nest.
(classify_partition): Update call to above function.

gcc/testsuite
2017-10-17  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82574
* gcc.dg/tree-ssa/pr82574.c: New test.diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr82574.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr82574.c
new file mode 100644
index 000..8fc4596
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr82574.c
@@ -0,0 +1,19 @@
+/* { dg-do run } */
+/* { dg-options "-O3" } */
+
+unsigned char a, b, c, d[200][200];
+
+void abort (void);
+
+int main ()
+{
+  for (; a < 200; a++)
+for (b = 0; b < 200; b++)
+  if (c)
+   d[a][b] = 1;
+
+  if ((c && d[0][0] != 1) || (!c && d[0][0] != 0))
+abort ();
+
+  return 0;
+}
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 5e835be..d029f98 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -1283,12 +1283,12 @@ build_rdg_partition_for_vertex (struct graph *rdg, int 
v)
   return partition;
 }
 
-/* Given PARTITION of RDG, record single load/store data references for
-   builtin partition in SRC_DR/DST_DR, return false if there is no such
+/* Given PARTITION of LOOP and RDG, record single load/store data references
+   for builtin partition in SRC_DR/DST_DR, return false if there is no such
data references.  */
 
 static bool
-find_single_drs (struct graph *rdg, partition *partition,
+find_single_drs (struct loop *loop, struct graph *rdg, partition *partition,
 data_reference_p *dst_dr, data_reference_p *src_dr)
 {
   unsigned i;
@@ -1344,10 +1344,12 @@ find_single_drs (struct graph *rdg, partition 
*partition,
   && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
 return false;
 
-  /* Data reference must be executed exactly once per iteration.  */
+  /* Data reference must be executed exactly once per iteration of each
+ loop in the loop nest.  We only need to check dominant information
+ against the outermost one in a perfect loop nest because a bb can't
+ dominate outermost loop's latch without dominating inner loop's.  */
   basic_block bb_st = gimple_bb (DR_STMT (single_st));
-  struct loop *inner = bb_st->loop_father;
-  if (!dominated_by_p (CDI_DOMINATORS, inner->latch, bb_st))
+  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
 return false;
 
   if (single_ld)
@@ -1365,14 +1367,16 @@ find_single_drs (struct graph *rdg, partition 
*partition,
 
   /* Load and store must be in the same loop nest.  */
   basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
-  if (inner != bb_ld->loop_father)
+  if (bb_st->loop_father != bb_ld->loop_father)
return false;
 
-  /* Data reference must be executed exactly once per iteration.  */
-  if (!dominated_by_p (CDI_DOMINATORS, inner->latch, bb_ld))
+  /* Data reference must be executed exactly once per iteration.
+Same as single_st, we only need to check against the outermost
+loop.  */
+  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
return false;
 
-  edge e = single_exit (inner);
+  edge e = single_exit (bb_st->loop_father);
   bool dom_ld = dominated_by_p (CDI_DOMINATORS, e->src, bb_ld);
   bool dom_st = dominated_by_p (CDI_DOMINATORS, e->src, bb_st);
   if (dom_ld != dom_st)
@@ -1611,7 +1615,7 @@ classify_partition (loop_p loop, struct graph *rdg, 
partition *partition,
 return;
 
   /* Find single load/store data references for builtin partition.  */
-  if (!find_single_drs (rdg, partition, _st, _ld))
+  if (!find_single_drs (loop, rdg, partition, _st, _ld))
 return;
 
   /* Classify the builtin kind.  */

[PATCH GCC]Introduce qsort_range interface for GCC vector

2017-10-16 Thread Bin Cheng

Hi,
I was asked by Richi to replace insertion sort with qsort_range in loop
nest distribution patch.  Although I believe stable sort (thus insertion)
sort is needed in that case, I also added qsort_range interface in vec.h.
The new interface might be useful in other places.
Bootstrap and test on x86_64 and AArch64 with other patches.  Is it OK?

Thanks,
bin
2017-10-13  Bin Cheng  <bin.ch...@arm.com>

* vec.h (struct GTY((user)) vec<T, A, vl_embed>::qsort_range): New
member function.
(struct vec<T, va_heap, vl_ptr>): New member function.From a6aa2866fb067628f63f508e9314c3a092b6055c Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 13 Oct 2017 13:55:03 +0100
Subject: [PATCH 7/8] vec-qsort_range-20171013.txt

---
 gcc/vec.h | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/gcc/vec.h b/gcc/vec.h
index cbdd439..f49177d 100644
--- a/gcc/vec.h
+++ b/gcc/vec.h
@@ -497,6 +497,7 @@ public:
   void unordered_remove (unsigned);
   void block_remove (unsigned, unsigned);
   void qsort (int (*) (const void *, const void *));
+  void qsort_range (unsigned, unsigned, int (*) (const void *, const void *));
   T *bsearch (const void *key, int (*compar)(const void *, const void *));
   unsigned lower_bound (T, bool (*)(const T &, const T &)) const;
   bool contains (const T ) const;
@@ -974,6 +975,20 @@ vec<T, A, vl_embed>::qsort (int (*cmp) (const void *, const void *))
 }
 
 
+/* Sort the contents within range [S, E] of this vector with qsort.  Both
+   S and E should be within [0, length).  CMP is the comparison function
+   to pass to qsort.  */
+
+template
+inline void
+vec<T, A, vl_embed>::qsort_range (unsigned s, unsigned e,
+  int (*cmp) (const void *, const void *))
+{
+  if (e > s && length () > e)
+::qsort (&(*this)[s], e - s + 1, sizeof (T), cmp);
+}
+
+
 /* Search the contents of the sorted vector with a binary search.
CMP is the comparison function to pass to bsearch.  */
 
@@ -1260,6 +1275,7 @@ public:
   void unordered_remove (unsigned);
   void block_remove (unsigned, unsigned);
   void qsort (int (*) (const void *, const void *));
+  void qsort_range (unsigned, unsigned, int (*) (const void *, const void *));
   T *bsearch (const void *key, int (*compar)(const void *, const void *));
   unsigned lower_bound (T, bool (*)(const T &, const T &)) const;
   bool contains (const T ) const;
@@ -1736,6 +1752,20 @@ vec<T, va_heap, vl_ptr>::qsort (int (*cmp) (const void *, const void *))
 }
 
 
+/* Sort the contents within range [S, E] of this vector with qsort.  Both
+   S and E should be within [0, length).  CMP is the comparison function
+   to pass to qsort.  */
+
+template
+inline void
+vec<T, va_heap, vl_ptr>::qsort_range (unsigned s, unsigned e,
+  int (*cmp) (const void *, const void *))
+{
+  if (m_vec)
+m_vec->qsort_range (s, e, cmp);
+}
+
+
 /* Search the contents of the sorted vector with a binary search.
CMP is the comparison function to pass to bsearch.  */
 
-- 
1.9.1

[PATCH GCC]Try harder to find base object by expanding base address

2017-10-13 Thread Bin Cheng

Hi,
I ran into this when investigating PR82369 which we failed to find base object.
This simple patch tries harder to find base object by expanding base address
in alloc_iv.  In general, we don't want to do aggressive expansion, but this
case is fine because finding base object means reduction happened during the
expansion.  And it's good to have base object for address type iv_uses.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-10-12  Bin Cheng  <bin.ch...@arm.com>

* tree-scalar-evolution.c (alloc_iv): New parameter controlling
base expansion for finding base object.
(find_interesting_uses_address): Adjust call to alloc_iv.

gcc/testsuite
2017-10-12  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/tree-ssa/ivopt_6.c: New test.diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopt_6.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ivopt_6.c
new file mode 100644
index 000..de94b88
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopt_6.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+typedef unsigned long int uintptr_t;
+typedef long unsigned int size_t;
+typedef long int ptrdiff_t;
+
+void foo (unsigned char *restrict dst, unsigned char *restrict src, size_t 
bytes)
+{
+  uintptr_t end_dst = (uintptr_t) (dst + bytes);
+  uintptr_t srcu = (uintptr_t) src, dstu = (uintptr_t) dst;
+  ptrdiff_t src_dst_offset = srcu - 2 * dstu;
+
+  do {
+ unsigned char v0 = *(unsigned char *) (dstu * 2 + src_dst_offset);
+ unsigned char v1 = *(unsigned char *) ((dstu * 2 + src_dst_offset) + 1);
+ unsigned char res = v1 + v0;
+
+ *((unsigned char*) dstu) = res;
+ dstu += 16;
+  } while (dstu < end_dst);
+}
+/* { dg-final { scan-tree-dump-times "Type:\tADDRESS" 3 "ivopts" } } */
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index bbea619..4ccdf32 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -1160,11 +1160,12 @@ contain_complex_addr_expr (tree expr)
 }
 
 /* Allocates an induction variable with given initial value BASE and step STEP
-   for loop LOOP.  NO_OVERFLOW implies the iv doesn't overflow.  */
+   for loop LOOP.  NO_OVERFLOW implies the iv doesn't overflow.  If EXPAND_P
+   is true, this function expands base address to find base object.  */
 
 static struct iv *
 alloc_iv (struct ivopts_data *data, tree base, tree step,
- bool no_overflow = false)
+ bool no_overflow = false, bool expand_p = false)
 {
   tree expr = base;
   struct iv *iv = (struct iv*) obstack_alloc (>iv_obstack,
@@ -1185,8 +1186,22 @@ alloc_iv (struct ivopts_data *data, tree base, tree step,
   base = fold_convert (TREE_TYPE (base), aff_combination_to_tree ());
 }
 
+  tree base_object = determine_base_object (base);
+  /* Try harder to find base object by expanding base.  */
+  if (expand_p && base_object == NULL_TREE)
+{
+  aff_tree comb;
+  expr = unshare_expr (base);
+  tree_to_aff_combination_expand (base, TREE_TYPE (base), ,
+ >name_expansion_cache);
+  base = fold_convert (TREE_TYPE (base), aff_combination_to_tree ());
+  base_object = determine_base_object (base);
+  /* Fall back to unexpanded base if no base object is found.  */
+  if (!base_object)
+   base = expr;
+}
   iv->base = base;
-  iv->base_object = determine_base_object (base);
+  iv->base_object = base_object;
   iv->step = step;
   iv->biv_p = false;
   iv->nonlin_use = NULL;
@@ -2365,7 +2380,7 @@ find_interesting_uses_address (struct ivopts_data *data, 
gimple *stmt,
}
 }
 
-  civ = alloc_iv (data, base, step);
+  civ = alloc_iv (data, base, step, false, true);
   /* Fail if base object of this memory reference is unknown.  */
   if (civ->base_object == NULL_TREE)
 goto fail;

[PATCH GCC]Refine comment and set type for partition merged from SCC

2017-10-11 Thread Bin Cheng

Hi,
When reading the code I found it's could be confusing without comment.
This patch adds comment explaining why we want merge PARALLEL type
partitions in a SCC, even though the result partition can no longer
be executed in parallel.  It also sets type of the result partition
to sequential.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-10-10  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (break_alias_scc_partitions): Add comment
and set PTYPE_SEQUENTIAL for merged partition.diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 9ffac53..dc429cf 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -2062,7 +2062,7 @@ break_alias_scc_partitions (struct graph *rdg,
   auto_vec scc_types;
   struct partition *partition, *first;
 
-  /* If all paritions in a SCC has the same type, we can simply merge the
+  /* If all partitions in a SCC have the same type, we can simply merge the
 SCC.  This loop finds out such SCCS and record them in bitmap.  */
   bitmap_set_range (sccs_to_merge, 0, (unsigned) num_sccs);
   for (i = 0; i < num_sccs; ++i)
@@ -2075,6 +2075,10 @@ break_alias_scc_partitions (struct graph *rdg,
  if (pg->vertices[j].component != i)
continue;
 
+ /* Note we Merge partitions of parallel type on purpose, though
+the result partition is sequential.  The reason is vectorizer
+can do more accurate runtime alias check in this case.  Also
+it results in more conservative distribution.  */
  if (first->type != partition->type)
{
  bitmap_clear_bit (sccs_to_merge, i);
@@ -2096,7 +2100,7 @@ break_alias_scc_partitions (struct graph *rdg,
   if (bitmap_count_bits (sccs_to_merge) != (unsigned) num_sccs)
{
  /* Run SCC finding algorithm again, with alias dependence edges
-skipped.  This is to topologically sort paritions according to
+skipped.  This is to topologically sort partitions according to
 compilation time known dependence.  Note the topological order
 is stored in the form of pg's post order number.  */
  num_sccs_no_alias = graphds_scc (pg, NULL, pg_skip_alias_edge);
@@ -2139,6 +2143,8 @@ break_alias_scc_partitions (struct graph *rdg,
  data = (struct pg_vdata *)pg->vertices[k].data;
  gcc_assert (data->id == k);
  data->partition = NULL;
+ /* The result partition of merged SCC must be sequential.  */
+ first->type = PTYPE_SEQUENTIAL;
}
}
 }

[PATCH PR82472]Update postorder number for merged partition.

2017-10-11 Thread Bin Cheng

Hi,
This patch fixes the reported ICE.  Root cause is postorder number is not 
updated
after merging partitions in SCC.  As a result, reduction partition may not be 
scheduled
as the last one because partitions are sorted in descending postorder.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-10-10  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82472
* tree-loop-distribution.c (sort_partitions_by_post_order): Refine
comment.
(break_alias_scc_partitions): Update postorder number.

gcc/testsuite
2017-10-10  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82472
* gcc.dg/tree-ssa/pr82472.c: New test.diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr82472.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr82472.c
new file mode 100644
index 000..445c95f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr82472.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution" } */
+
+long int xj;
+
+int
+cx (long int *ox, short int mk, char tf)
+{
+  int si, f9;
+  char *p4 = 
+  short int *rm = (tf != 0) ? (short int *) : 
+
+  for (f9 = 0; f9 < 2; ++f9)
+{
+  *rm = 0;
+  *p4 = *ox;
+  si = mk;
+  xj = 0;
+  while (p4 < (char *)rm)
+++p4;
+}
+
+  return si;
+}
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 26b8b9a..9ffac53 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -1939,7 +1939,8 @@ build_partition_graph (struct graph *rdg,
   return pg;
 }
 
-/* Sort partitions in PG by post order and store them in PARTITIONS.  */
+/* Sort partitions in PG in descending post order and store them in
+   PARTITIONS.  */
 
 static void
 sort_partitions_by_post_order (struct graph *pg,
@@ -1948,7 +1949,7 @@ sort_partitions_by_post_order (struct graph *pg,
   int i;
   struct pg_vdata *data;
 
-  /* Now order the remaining nodes in postorder.  */
+  /* Now order the remaining nodes in descending postorder.  */
   qsort (pg->vertices, pg->n_vertices, sizeof (vertex), pgcmp);
   partitions->truncate (0);
   for (i = 0; i < pg->n_vertices; ++i)
@@ -2044,7 +2045,7 @@ break_alias_scc_partitions (struct graph *rdg,
vec *partitions,
vec *alias_ddrs)
 {
-  int i, j, num_sccs, num_sccs_no_alias;
+  int i, j, k, num_sccs, num_sccs_no_alias;
   /* Build partition dependence graph.  */
   graph *pg = build_partition_graph (rdg, partitions, false);
 
@@ -2117,18 +2118,26 @@ break_alias_scc_partitions (struct graph *rdg,
  for (j = 0; partitions->iterate (j, ); ++j)
if (cbdata.vertices_component[j] == i)
  break;
- for (++j; partitions->iterate (j, ); ++j)
+ for (k = j + 1; partitions->iterate (k, ); ++k)
{
  struct pg_vdata *data;
 
- if (cbdata.vertices_component[j] != i)
+ if (cbdata.vertices_component[k] != i)
continue;
 
+ /* Update postorder number so that merged reduction partition is
+sorted after other partitions.  */
+ if (!partition_reduction_p (first)
+ && partition_reduction_p (partition))
+   {
+ gcc_assert (pg->vertices[k].post < pg->vertices[j].post);
+ pg->vertices[j].post = pg->vertices[k].post;
+   }
  partition_merge_into (NULL, first, partition, FUSE_SAME_SCC);
- (*partitions)[j] = NULL;
+ (*partitions)[k] = NULL;
  partition_free (partition);
- data = (struct pg_vdata *)pg->vertices[j].data;
- gcc_assert (data->id == j);
+ data = (struct pg_vdata *)pg->vertices[k].data;
+ gcc_assert (data->id == k);
  data->partition = NULL;
}
}

[PATCH GCC][7/7]Merge adjacent memset builtin partitions

2017-10-05 Thread Bin Cheng

Hi,
This patch merges adjacent memset builtin partitions if possible.  It is
a useful special case optimization transforming below code:

#define M (256)
#define N (512)

struct st
{
  int a[M][N];
  int c[M];
  int b[M][N];
};

void
foo (struct st *p)
{
  for (unsigned i = 0; i < M; ++i)
{
  p->c[i] = 0;
  for (unsigned j = N; j > 0; --j)
{
  p->a[i][j - 1] = 0;
  p->b[i][j - 1] = 0;
}
}

into a single memset function call, rather than three calls initializing
the structure field by field.

Bootstrap and test in patch set on x86_64 and AArch64, is it OK?

Thanks,
bin
2017-10-04  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (tree-ssa-loop-ivopts.h): New header file.
(struct builtin_info): New fields.
(classify_builtin_1): Compute and record base and offset parts for
memset builtin partition by calling strip_offset.
(fuse_memset_builtins): New function.
(finalize_partitions): Fuse adjacent memset partitions by calling
above function.
* tree-ssa-loop-ivopts.c (strip_offset): Delete static declaration.
Expose the interface.
* tree-ssa-loop-ivopts.h (strip_offset): New declaration.

gcc/testsuite/ChangeLog
2017-10-04  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/tree-ssa/ldist-17.c: Adjust test string.
* gcc.dg/tree-ssa/ldist-32.c: New test.diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c
index 4efc0a4..b3617f6 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c
@@ -45,5 +45,5 @@ mad_synth_mute (struct mad_synth *synth)
   return;
 }
 
-/* { dg-final { scan-tree-dump "distributed: split to 0 loops and 4 library 
calls" "ldist" } } */
-/* { dg-final { scan-tree-dump-times "generated memset zero" 4 "ldist" } } */
+/* { dg-final { scan-tree-dump "Loop nest . distributed: split to 0 loops and 
1 library calls" "ldist" } } */
+/* { dg-final { scan-tree-dump-times "generated memset zero" 1 "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-32.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-32.c
new file mode 100644
index 000..477d222
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-32.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns 
-fdump-tree-ldist-details" } */
+
+#define M (256)
+#define N (512)
+
+struct st
+{
+  int a[M][N];
+  int c[M];
+  int b[M][N];
+};
+
+void
+foo (struct st *p)
+{
+  for (unsigned i = 0; i < M; ++i)
+{
+  p->c[i] = 0;
+  for (unsigned j = N; j > 0; --j)
+   {
+ p->a[i][j - 1] = 0;
+ p->b[i][j - 1] = 0;
+   }
+}
+}
+
+/* { dg-final { scan-tree-dump-times "Loop nest . distributed: split to 0 
loops and 1 library" 1 "ldist" } } */
+/* { dg-final { scan-tree-dump-times "__builtin_memset \\(.*, 0, 1049600\\);" 
1 "ldist" } } */
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 237474f..ac1903d 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -106,6 +106,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "stor-layout.h"
 #include "tree-cfg.h"
 #include "tree-ssa-loop-manip.h"
+#include "tree-ssa-loop-ivopts.h"
 #include "tree-ssa-loop.h"
 #include "tree-into-ssa.h"
 #include "tree-ssa.h"
@@ -604,6 +605,10 @@ struct builtin_info
   tree dst_base;
   tree src_base;
   tree size;
+  /* Base and offset part of dst_base after stripping constant offset.  This
+ is only used in memset builtin distribution for now.  */
+  tree dst_base_base;
+  unsigned HOST_WIDE_INT dst_base_offset;
 };
 
 /* Partition for loop distribution.  */
@@ -1500,7 +1505,11 @@ classify_builtin_1 (loop_p loop, partition *partition, 
data_reference_p dr)
   if (!compute_access_range (loop, dr, , ))
 return;
 
-  partition->builtin = alloc_builtin (dr, NULL, base, NULL_TREE, size);
+  struct builtin_info *builtin;
+  builtin = alloc_builtin (dr, NULL, base, NULL_TREE, size);
+  builtin->dst_base_base = strip_offset (builtin->dst_base,
+>dst_base_offset);
+  partition->builtin = builtin;
   partition->kind = PKIND_MEMSET;
 }
 
@@ -2461,6 +2470,115 @@ version_for_distribution_p (vec 
*partitions,
   return (alias_ddrs->length () > 0);
 }
 
+/* Fuse adjacent memset builtin PARTITIONS if possible.  This is a special
+   case optimization transforming below code:
+
+ __builtin_memset (, 0, 100);
+ _1 =  + 100;
+ __builtin_memset (_1, 0, 200);
+ _2 =  + 300;
+ __builtin_memset (_2, 0, 100);
+
+   into:
+
+ __builtin_memset (, 0, 400);
+
+   Note we do

[PATCH GCC][6/7]Support loop nest distribution for builtin partition

2017-10-05 Thread Bin Cheng

Hi,
This patch rewrites classification part of builtin partition so that nested
builtin partitions are supported.  With this extension, below loop nest:
void
foo (void)
{
  for (unsigned i = 0; i < M; ++i)
for (unsigned j = 0; j < N; ++j)
  arr[i][j] = 0;

will be distributed into a single memset, rather than a loop of memset.
Bootstrap and test in patch set on x86_64 and AArch64, is it OK?

Thanks,
bin
2017-10-04  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (struct builtin_info): New struct.
(struct partition): Refactor fields into struct builtin_info.
(partition_free): Free struct builtin_info.
(build_size_arg_loc, build_addr_arg_loc): Delete.
(generate_memset_builtin, generate_memcpy_builtin): Get memory range
information from struct builtin_info.
(find_single_drs): New function refactored from classify_partition.
Also moved builtin validity checks to this function.
(compute_access_range, alloc_builtin): New functions.
(classify_builtin_1, classify_builtin_2): New functions.
(classify_partition): Refactor code into functions find_single_drs,
classify_builtin_1 and classify_builtin_2.
(distribute_loop): Don't do runtime alias check when distributing
loop nest.
(find_seed_stmts_for_distribution): New function.
(pass_loop_distribution::execute): Refactor code finding seed
stmts into above function.  Support distribution for the innermost
two-level loop nest.  Adjust dump information.

gcc/testsuite/ChangeLog
2017-10-04  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/tree-ssa/ldist-28.c: New test.
* gcc.dg/tree-ssa/ldist-29.c: New test.
* gcc.dg/tree-ssa/ldist-30.c: New test.
* gcc.dg/tree-ssa/ldist-31.c: New test.diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-28.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-28.c
new file mode 100644
index 000..4420139
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-28.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns 
-fdump-tree-ldist-details" } */
+
+#define M (256)
+#define N (1024)
+int arr[M][N];
+
+void
+foo (void)
+{
+  for (unsigned i = 0; i < M; ++i)
+for (unsigned j = 0; j < N; ++j)
+  arr[i][j] = 0;
+}
+
+/* { dg-final { scan-tree-dump "Loop nest . distributed: split to 0 loops and 
1 library" "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-29.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-29.c
new file mode 100644
index 000..9ce93e8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-29.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns 
-fdump-tree-ldist-details" } */
+
+#define M (256)
+#define N (512)
+int arr[M][N];
+
+void
+foo (void)
+{
+  for (unsigned i = 0; i < M; ++i)
+for (unsigned j = 0; j < N - 1; ++j)
+  arr[i][j] = 0;
+}
+
+/* { dg-final { scan-tree-dump-not "Loop nest . distributed: split to" "ldist" 
} } */
+/* { dg-final { scan-tree-dump-times "Loop . distributed: split to 0 loops and 
1 library" 1 "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-30.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-30.c
new file mode 100644
index 000..f31860a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-30.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns 
-fdump-tree-ldist-details" } */
+
+#define M (256)
+#define N (512)
+int a[M][N], b[M][N];
+
+void
+foo (void)
+{
+  for (unsigned i = 0; i < M; ++i)
+for (unsigned j = N; j > 0; --j)
+  a[i][j - 1] = b[i][j - 1];
+}
+
+/* { dg-final { scan-tree-dump-times "Loop nest . distributed: split to" 1 
"ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-31.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-31.c
new file mode 100644
index 000..60a9f74
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-31.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns 
-fdump-tree-ldist-details" } */
+
+#define M (256)
+#define N (512)
+int a[M][N], b[M][N], c[M];
+
+void
+foo (void)
+{
+  for (int i = M - 1; i >= 0; --i)
+{
+  c[i] = 0;
+  for (unsigned j = N; j > 0; --j)
+   a[i][j - 1] = b[i][j - 1];
+}
+}
+
+/* { dg-final { scan-tree-dump-times "Loop nest . distributed: split to 0 
loops and 2 library" 1 "ldist" } } */
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 59a968c..237474f 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -581,72 +581,82 @@ build_rdg (struct loop *loop, control_dependences *cd)

[PATCH GCC][5/7]Extend loop distribution for two-level innermost loop nest

2017-10-05 Thread Bin Cheng

Hi,
For now distribution pass only handles the innermost loop.  This patch extends 
the pass
to cover two-level innermost loop nest.  It also refactors code in 
pass_loop_distribution::execute
for better reading.  Note I restrict it to 2-level loop nest on purpose because 
of high
cost in data dependence computation.  Some compilation time optimizations like 
reusing
the data reference finding, data dependence computing, would require a rewrite 
of this
pass like the proposed loop interchange implementation.  But that's another 
task.

This patch introduces a temporary TODO for loop nest builtin partition which is 
covered
by next two patches.

With this patch, kernel loop in bwaves now can be distributed, thus exposed for 
further
interchange.  This patch adds new test for matrix multiplication, as well as 
adjusts
test strings of existing tests.
Bootstrap and test in patch set on x86_64 and AArch64, is it OK?

Thanks,
bin
2017-10-04  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c: Adjust the general comment.
(NUM_PARTITION_THRESHOLD): New macro.
(ssa_name_has_uses_outside_loop_p): Support loop nest distribution.
(classify_partition): Skip builtin pattern of loop nest's inner loop.
(merge_dep_scc_partitions): New parameter ignore_alias_p and use it
in call to build_partition_graph.
(finalize_partitions): New parameter.  Make loop distribution more
conservative by fusing more partitions.
(distribute_loop): Don't do runtime alias check in case of loop nest
distribution.
(find_seed_stmts_for_distribution): New function.
(pass_loop_distribution::execute): Refactor code finding seed stmts
into above function.  Support loop nest distribution for two-level
innermost loop nest.  Adjust dump information.

gcc/testsuite/ChangeLog
2017-10-04  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/tree-ssa/ldist-7.c: Adjust test string.
* gcc.dg/tree-ssa/ldist-16.c: Ditto.
* gcc.dg/tree-ssa/ldist-25.c: Ditto.
* gcc.dg/tree-ssa/ldist-33.c: Ditto.diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-16.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-16.c
index f43b64e..f4f3a44 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-16.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-16.c
@@ -16,5 +16,5 @@ void foo (int n)
 
 /* We should not apply loop distribution and not generate a memset (0).  */
 
-/* { dg-final { scan-tree-dump "Loop 1 is the same" "ldist" } } */
+/* { dg-final { scan-tree-dump "Loop 1 not distributed" "ldist" } } */
 /* { dg-final { scan-tree-dump-times "generated memset zero" 0 "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-25.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-25.c
index 699bf38..c0b95fc 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-25.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-25.c
@@ -22,4 +22,4 @@ foo (void)
 }
 }
 
-/* { dg-final { scan-tree-dump "Loop . is the same" "ldist" } } */
+/* { dg-final { scan-tree-dump "Loop . not distributed" "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-33.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-33.c
new file mode 100644
index 000..24d27fd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-33.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns 
-fdump-tree-ldist-details" } */
+
+#define N (1024)
+double a[N][N], b[N][N], c[N][N];
+
+void
+foo (void)
+{
+  unsigned i, j, k;
+
+  for (i = 0; i < N; ++i)
+for (j = 0; j < N; ++j)
+  {
+   c[i][j] = 0.0;
+   for (k = 0; k < N; ++k)
+ c[i][j] += a[i][k] * b[k][j];
+  }
+}
+
+/* { dg-final { scan-tree-dump "Loop nest . distributed: split to 1 loops and 
1 library" "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-7.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-7.c
index f31d051..2eb1f74 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-7.c
@@ -28,4 +28,4 @@ int loop1 (int k)
   return a[1000-2] + b[1000-1] + c[1000-2] + d[1000-2];
 }
 
-/* { dg-final { scan-tree-dump-times "distributed" 0 "ldist" } } */
+/* { dg-final { scan-tree-dump-times "distributed: " 0 "ldist" } } */
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 999b32e..59a968c 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -83,8 +83,8 @@ along with GCC; see the file COPYING3.  If not see
loops and recover to the original one.
 
TODO:
- 1) We only distribute innermost loops now.  This pass should handle loop
-   nests in the future.
+ 1) We only distribute innermost two-level loop nest now.  We should
+   extend it for arbitrary loop nests in the future.

[PATCH GCC][4/7]Choose exit edge/path when removing inner loop's exit statement

2017-10-05 Thread Bin Cheng

Hi,
Function generate_loops_for_partition chooses arbitrary path when removing exit
condition not in partition.  This is fine for now because it's impossible to 
have
loop exit condition in case of innermost distribution.  After extending to loop
nest distribution, we must choose exit edge/path for inner loop's exit 
condition,
otherwise an infinite empty loop will be generated.  Test case added.

Bootstrap and test in patch set on x86_64 and AArch64, is it OK?

Thanks,
bin
2017-10-04  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (generate_loops_for_partition): Remove
inner loop's exit stmt by making it always exit the loop, otherwise
we would generate an infinite empty loop.

gcc/testsuite/ChangeLog
2017-10-04  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/tree-ssa/ldist-27.c: New test.From 29f15d5a166b139d8d2dad2ee798c4d0a338f820 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Mon, 25 Sep 2017 16:52:42 +0100
Subject: [PATCH 4/7] loop_nest-exit-cond-distribution.txt

---
 gcc/testsuite/gcc.dg/tree-ssa/ldist-27.c | 38 
 gcc/tree-loop-distribution.c | 16 +++---
 2 files changed, 51 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-27.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-27.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-27.c
new file mode 100644
index 000..3580c65
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-27.c
@@ -0,0 +1,38 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -ftree-loop-distribute-patterns -fdump-tree-ldist-details" } */
+
+#define M (300)
+#define N (200)
+
+struct st
+{
+  double a[M];
+  double b[M];
+  double c[M][N];
+};
+
+int __attribute__ ((noinline)) foo (struct st *s)
+{
+  int i, j;
+  for (i = 0; i != M;)
+{
+  s->a[i] = 0.0;
+  s->b[i] = 1.0;
+  for (j = 0; 1; ++j)
+	{
+	  if (j == N) goto L2;
+	  s->c[i][j] = 0.0;
+	}
+L2:
+  ++i;
+}
+  return 0;
+}
+
+int main (void)
+{
+  struct st s;
+  return foo ();
+}
+
+/* { dg-final { scan-tree-dump "distributed: split to " "ldist" } } */
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 3db3d6e..999b32e 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -830,6 +830,10 @@ generate_loops_for_partition (struct loop *loop, partition *partition,
   for (i = 0; i < loop->num_nodes; i++)
 {
   basic_block bb = bbs[i];
+  edge inner_exit = NULL;
+
+  if (loop != bb->loop_father)
+	inner_exit = single_exit (bb->loop_father);
 
   for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);)
 	{
@@ -848,11 +852,17 @@ generate_loops_for_partition (struct loop *loop, partition *partition,
 	  && !is_gimple_debug (stmt)
 	  && !bitmap_bit_p (partition->stmts, gimple_uid (stmt)))
 	{
-	  /* Choose an arbitrary path through the empty CFG part
-		 that this unnecessary control stmt controls.  */
+	  /* In distribution of loop nest, if bb is inner loop's exit_bb,
+		 we choose its exit edge/path in order to avoid generating
+		 infinite loop.  For all other cases, we choose an arbitrary
+		 path through the empty CFG part that this unnecessary
+		 control stmt controls.  */
 	  if (gcond *cond_stmt = dyn_cast  (stmt))
 		{
-		  gimple_cond_make_false (cond_stmt);
+		  if (inner_exit && inner_exit->flags & EDGE_TRUE_VALUE)
+		gimple_cond_make_true (cond_stmt);
+		  else
+		gimple_cond_make_false (cond_stmt);
 		  update_stmt (stmt);
 		}
 	  else if (gimple_code (stmt) == GIMPLE_SWITCH)
-- 
1.9.1

[PATCH GCC][3/7]Don't skip renaming PHIs in loop nest with only one inner loop

2017-10-05 Thread Bin Cheng

Hi,
Function rename_variables_in_bb skips renaming PHI nodes in loop nest if the
outer loop has only one inner loop.  This breaks loop nest distribution when
inner loop has PHI node initialized from outer loop's variable.  Unfortunately,
I lost the original C code illustrating the issue.  Now it is only triggered
in building spec2006/416.gamess with loop nest distribution, but I failed to
reduce a test from it.
Bootstrap and test in patch set on x86_64 and AArch64, is it OK?

Thanks,
bin
2017-10-04  Bin Cheng  <bin.ch...@arm.com>

* tree-vect-loop-manip.c (rename_variables_in_bb): Rename PHI nodes
when copying loop nest with only one inner loop.From fa3cc2014278d672db94bad5c6a606cb6888d79a Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Mon, 25 Sep 2017 16:40:11 +0100
Subject: [PATCH 3/7] rename-variables-in-loop-nest.txt

---
 gcc/tree-vect-loop-manip.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 2c724a2..9fd65a7 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -117,8 +117,6 @@ rename_variables_in_bb (basic_block bb, bool rename_from_outer_loop)
 		  || single_pred (e->src) != outer_loop->header)
 		continue;
 		}
-	  else
-		continue;
 	}
 	}
   for (gphi_iterator gsi = gsi_start_phis (bb); !gsi_end_p (gsi);
-- 
1.9.1

[PATCH GCC][2/7]Don't rename variables for deleted new preheader

2017-10-05 Thread Bin Cheng

Hi,
I noticed that new_preheader basic block could be deleted if the copied
loop is added at entry in function slpeel_tree_duplicate_loop_to_edge_cfg.
This simple patch skips new_preheader during variable renaming if it is
deleted.
Bootstrap and test in patch set on x86_64 and AArch64, is it OK?

Thanks,
bin
2017-10-04  Bin Cheng  <bin.ch...@arm.com>

* tree-vect-loop-manip.c (slpeel_tree_duplicate_loop_to_edge_cfg): Skip
renaming variables in new preheader if it's deleted.From 9c7719402c9528b517d8408419c2e9b930708772 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 22 Sep 2017 16:50:40 +0100
Subject: [PATCH 2/7] skip-new_preheader.txt

---
 gcc/tree-vect-loop-manip.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index f78e4b4..2c724a2 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -496,7 +496,8 @@ slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *loop,
 			   loop_preheader_edge (new_loop)->src);
 }
 
-  for (unsigned i = 0; i < scalar_loop->num_nodes + 1; i++)
+  /* Skip new preheader since it's deleted if copy loop is added at entry.  */
+  for (unsigned i = (at_exit ? 0 : 1); i < scalar_loop->num_nodes + 1; i++)
 rename_variables_in_bb (new_bbs[i], duplicate_outer_loop);
 
   if (scalar_loop != loop)
-- 
1.9.1

[PATCH GCC][1/7]Delete unused field of struct partition in loop distribution

2017-10-05 Thread Bin Cheng

Hi,
This patch set implements distribution and builtin pattern distribution for
loop nest.  It consists of below patches:
  Patches [1~4]: Cleanup and (latent) bug fixes.
  Patch [5]: Loop nest distribution of two-level innermost loop nest.
  Patches [6,7]: Loop nest builtin pattern distribution.

This is the first simple patch deleting unused field of struct partition in loop
distribution.  It's an obvious change.
Bootstrap and test in patch set on x86_64 and AArch64.

Thanks,
bin
2017-10-04  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (struct partition): Remove unused field
loops of the structure.
(partition_alloc, partition_free): Ditto.
(build_rdg_partition_for_vertex): Ditto.From 6f1b39f7bea2e77fc320bc70829b3e1445633d1b Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Tue, 26 Sep 2017 16:54:35 +0100
Subject: [PATCH 1/7] struct-partition-20170925.txt

---
 gcc/tree-loop-distribution.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 26b8b9a..3db3d6e 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -593,8 +593,6 @@ struct partition
 {
   /* Statements of the partition.  */
   bitmap stmts;
-  /* Loops of the partition.  */
-  bitmap loops;
   /* True if the partition defines variable which is used outside of loop.  */
   bool reduction_p;
   /* For builtin partition, true if it executes one iteration more than
@@ -619,7 +617,6 @@ partition_alloc (void)
 {
   partition *partition = XCNEW (struct partition);
   partition->stmts = BITMAP_ALLOC (NULL);
-  partition->loops = BITMAP_ALLOC (NULL);
   partition->reduction_p = false;
   partition->kind = PKIND_NORMAL;
   partition->datarefs = BITMAP_ALLOC (NULL);
@@ -632,7 +629,6 @@ static void
 partition_free (partition *partition)
 {
   BITMAP_FREE (partition->stmts);
-  BITMAP_FREE (partition->loops);
   BITMAP_FREE (partition->datarefs);
   free (partition);
 }
@@ -1279,8 +1275,6 @@ build_rdg_partition_for_vertex (struct graph *rdg, int v)
   FOR_EACH_VEC_ELT (nodes, i, x)
 {
   bitmap_set_bit (partition->stmts, x);
-  bitmap_set_bit (partition->loops,
-		  loop_containing_stmt (RDG_STMT (rdg, x))->num);
 
   for (j = 0; RDG_DATAREFS (rdg, x).iterate (j, ); ++j)
 	{
-- 
1.9.1

[PATCH PR82163/V2]New interface checking LCSSA for single loop

2017-09-22 Thread Bin Cheng

Hi,
This is the V2 patch fixing PR82163.  It rewrites verify_loop_closed_ssa by 
checking
uses of all definitions inside of loop.  This gives advantage that we can check 
loop
closed ssa form for a specific loop, rather than for whole function.  The 
interface
is used in fixing this issue.
Bootstrap and test on x86_64, is it OK?

Thanks,
bin
2017-09-21  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82163
* tree-ssa-loop-manip.h (verify_loop_closed_ssa): New parameter.
(checking_verify_loop_closed_ssa): New parameter.
* tree-ssa-loop-manip.c (check_loop_closed_ssa_use): Delete.
(check_loop_closed_ssa_stmt): Delete.
(check_loop_closed_ssa_def, check_loop_closed_ssa_bb): New functions.
(verify_loop_closed_ssa): Check loop closed ssa form for LOOP.
(tree_transform_and_unroll_loop): Check loop closed ssa form only for
changed loops.

gcc/testsuite
2017-09-21  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82163
* gcc.dg/tree-ssa/pr82163.c: New test.From ef756285db3685bd97bbe7d144d58af477251416 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Thu, 21 Sep 2017 12:41:32 +0100
Subject: [PATCH] pr82163-20170921.txt

---
 gcc/testsuite/gcc.dg/tree-ssa/pr82163.c | 23 +
 gcc/tree-ssa-loop-manip.c   | 91 +++--
 gcc/tree-ssa-loop-manip.h   |  6 +--
 3 files changed, 80 insertions(+), 40 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr82163.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr82163.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr82163.c
new file mode 100644
index 000..fef2b1d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr82163.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+int a, b, c[4], d, e, f, g;
+
+void h ()
+{
+  for (; a; a++)
+{
+  c[a + 3] = g;
+  if (b)
+c[a] = f;
+  else
+{
+  for (; d; d++)
+c[d + 3] = c[d];
+  for (e = 1; e == 2; e++)
+;
+  if (e)
+break;
+}
+}
+}
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index d6ba305..6ad0b75 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -690,48 +690,62 @@ rewrite_virtuals_into_loop_closed_ssa (struct loop *loop)
   rewrite_into_loop_closed_ssa_1 (NULL, 0, SSA_OP_VIRTUAL_USES, loop);
 }
 
-/* Check invariants of the loop closed ssa form for the USE in BB.  */
+/* Check invariants of the loop closed ssa form for the def in DEF_BB.  */
 
 static void
-check_loop_closed_ssa_use (basic_block bb, tree use)
+check_loop_closed_ssa_def (basic_block def_bb, tree def)
 {
-  gimple *def;
-  basic_block def_bb;
+  use_operand_p use_p;
+  imm_use_iterator iterator;
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, def)
+{
+  if (is_gimple_debug (USE_STMT (use_p)))
+   continue;
 
-  if (TREE_CODE (use) != SSA_NAME || virtual_operand_p (use))
-return;
+  basic_block use_bb = gimple_bb (USE_STMT (use_p));
+  if (is_a  (USE_STMT (use_p)))
+   use_bb = EDGE_PRED (use_bb, PHI_ARG_INDEX_FROM_USE (use_p))->src;
 
-  def = SSA_NAME_DEF_STMT (use);
-  def_bb = gimple_bb (def);
-  gcc_assert (!def_bb
- || flow_bb_inside_loop_p (def_bb->loop_father, bb));
+  gcc_assert (flow_bb_inside_loop_p (def_bb->loop_father, use_bb));
+}
 }
 
-/* Checks invariants of loop closed ssa form in statement STMT in BB.  */
+/* Checks invariants of loop closed ssa form in BB.  */
 
 static void
-check_loop_closed_ssa_stmt (basic_block bb, gimple *stmt)
+check_loop_closed_ssa_bb (basic_block bb)
 {
-  ssa_op_iter iter;
-  tree var;
+  for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
+   gsi_next ())
+{
+  gphi *phi = bsi.phi ();
 
-  if (is_gimple_debug (stmt))
-return;
+  if (!virtual_operand_p (PHI_RESULT (phi)))
+   check_loop_closed_ssa_def (bb, PHI_RESULT (phi));
+}
+
+  for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
+   gsi_next ())
+{
+  ssa_op_iter iter;
+  tree var;
+  gimple *stmt = gsi_stmt (bsi);
+
+  if (is_gimple_debug (stmt))
+   continue;
 
-  FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE)
-check_loop_closed_ssa_use (bb, var);
+  FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_DEF)
+   check_loop_closed_ssa_def (bb, var);
+}
 }
 
 /* Checks that invariants of the loop closed ssa form are preserved.
-   Call verify_ssa when VERIFY_SSA_P is true.  */
+   Call verify_ssa when VERIFY_SSA_P is true.  Note all loops are checked
+   if LOOP is NULL, otherwise, only LOOP is checked.  */
 
 DEBUG_FUNCTION void
-verify_loop_closed_ssa (bool verify_ssa_p)
+verify_loop_closed_ssa (bool verify_ssa_p, struct loop *loop)
 {
-  basic_block bb;
-  edge e;
-  edge_iterator ei;
-
   if (number_of_loops (cfun) &l

[PATCH PR82163]Rewrite loop into lcssa form instantly

2017-09-14 Thread Bin Cheng

Hi,
Current pcom implementation rewrites into lcssa form after all loops are 
transformed, this is
not enough because unrolling of later loop checks lcssa form in function 
tree_transform_and_unroll_loop.
This simple patch rewrites loop into lcssa form if store-store chain is 
handled.  I think it doesn't
affect compilation time since rewrite_into_loop_closed_ssa_1 is only called for 
store-store chain
transformation and only the transformed loop is rewritten.
Bootstrap and test ongoing on x86_64.  is it OK if no failures?

Thanks,
bin
2017-09-14  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82163
* tree-predcom.c (tree_predictive_commoning_loop): Rewrite into
loop closed ssa instantly.  Return boolean true if loop is unrolled.
(tree_predictive_commoning): Return TODO_cleanup_cfg if loop is
unrolled.

gcc/testsuite
2017-09-14  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/82163
* gcc.dg/tree-ssa/pr82163.c: New test.diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr82163.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr82163.c
new file mode 100644
index 000..fef2b1d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr82163.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+int a, b, c[4], d, e, f, g;
+
+void h ()
+{
+  for (; a; a++)
+{
+  c[a + 3] = g;
+  if (b)
+c[a] = f;
+  else
+{
+  for (; d; d++)
+c[d + 3] = c[d];
+  for (e = 1; e == 2; e++)
+;
+  if (e)
+break;
+}
+}
+}
diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index e7b10cb..ffbe332 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -3014,11 +3014,10 @@ insert_init_seqs (struct loop *loop, vec 
chains)
   }
 }
 
-/* Performs predictive commoning for LOOP.  Sets bit 1<<0 of return value
-   if LOOP was unrolled; Sets bit 1<<1 of return value if loop closed ssa
-   form was corrupted.  */
+/* Performs predictive commoning for LOOP.  Returns true if LOOP was
+   unrolled.  */
 
-static unsigned
+static bool
 tree_predictive_commoning_loop (struct loop *loop)
 {
   vec datarefs;
@@ -3154,7 +3153,13 @@ end: ;
 
   free_affine_expand_cache (_expansions);
 
-  return (unroll ? 1 : 0) | (loop_closed_ssa ? 2 : 0);
+  /* Rewrite loop into loop closed ssa form if necessary.  We can not do it
+ after all loops are transformed because unrolling of later loop checks
+ loop closed ssa form.  */
+  if (loop_closed_ssa)
+rewrite_into_loop_closed_ssa_1 (NULL, 0, SSA_OP_USE, loop);
+
+  return unroll;
 }
 
 /* Runs predictive commoning.  */
@@ -3163,7 +3168,7 @@ unsigned
 tree_predictive_commoning (void)
 {
   struct loop *loop;
-  unsigned ret = 0, changed = 0;
+  bool changed = 0;
 
   initialize_original_copy_tables ();
   FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
@@ -3173,17 +3178,13 @@ tree_predictive_commoning (void)
   }
   free_original_copy_tables ();
 
-  if (changed > 0)
+  if (changed)
 {
   scev_reset ();
-
-  if (changed > 1)
-   rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
-
-  ret = TODO_cleanup_cfg;
+  return TODO_cleanup_cfg;
 }
 
-  return ret;
+  return 0;
 }
 
 /* Predictive commoning Pass.  */

[PATCH GCC]A simple implementation of loop interchange

2017-08-30 Thread Bin Cheng

Hi,
This patch implements a simple loop interchange pass in GCC, as described by 
its comments:
+/* This pass performs loop interchange: for example, the loop nest
+
+   for (int j = 0; j < N; j++)
+ for (int k = 0; k < N; k++)
+   for (int i = 0; i < N; i++)
+c[i][j] = c[i][j] + a[i][k]*b[k][j];
+
+   is transformed to
+
+   for (int i = 0; i < N; i++)
+ for (int j = 0; j < N; j++)
+   for (int k = 0; k < N; k++)
+c[i][j] = c[i][j] + a[i][k]*b[k][j];
+
+   This pass implements loop interchange in the following steps:
+
+ 1) Find perfect loop nest for each innermost loop and compute data
+   dependence relations for it.  For above example, loop nest is
+   <loop_j, loop_k, loop_i>.
+ 2) From innermost to outermost loop, this pass tries to interchange
+   each loop pair.  For above case, it firstly tries to interchange
+   <loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
+   Then it tries to interchange <loop_j, loop_i> and loop nest becomes
+   <loop_i, loop_j, loop_k>.  The overall effect is to move innermost
+   loop to the outermost position.  For loop pair <loop_i, loop_j>
+   to be interchanged, we:
+ 3) Check if data dependence relations are valid for loop interchange.
+ 4) Check if both loops can be interchanged in terms of transformation.
+ 5) Check if interchanging the two loops is profitable.
+ 6) Interchange the two loops by mapping induction variables.
+
+   This pass also handles reductions in loop nest.  So far we only support
+   simple reduction of inner loop and double reduction of the loop nest.  */

Actually, this pass only does loop shift which outermosting inner loop to 
outer, rather
than permutation.  Also as a traditional loop optimizer, it only works for 
perfect loop
nest.  I put it just after loop distribution thus ideally loop 
split/distribution could
create perfect nest for it.  Unfortunately, we don't get any perfect nest from 
distribution
for now because it only works for innermost loop.  For example, the motivation 
case in
spec2k6/bwaves is not handled on this pass alone.  I have a patch extending 
distribution
for (innermost) loop nest and with that patch bwaves case can be handled.
Another point is I deliberately make both the cost model and code 
transformation (very)
conservative.  We can support more cases, or more transformations with great 
care when
it is for sure known beneficial.  IMHO, we already hit over-baked issues quite 
often and
don't want to introduce more.
As for code generation, this patch has an issue that invariant code in outer 
loop could
be moved to inner loop.  For the moment, we rely on the last lim pass to handle 
all INV
generated during interchange.  In the future, we may need to avoid that in 
interchange
itself, or another lim pass just like the one after graphite optimizations.

Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run 
successfully.
Note this pass is disabled in patch, while the code is exercised by 
bootstrap/building
programs with it enabled by default.  Any comments?

Thanks,
bin
2017-08-29  Bin Cheng  <bin.ch...@arm.com>

* Makefile.in (tree-ssa-loop-interchange.o): New object file.
* common.opt (ftree-loop-interchange): New option.
* doc/invoke.texi (-ftree-loop-interchange): Document new option.
* passes.def (pass_linterchange): New pass.
* timevar.def (TV_LINTERCHANGE): New time var.
* tree-pass.h (make_pass_linterchange): New declaration.
* tree-ssa-loop-interchange.cc: New file.
* tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
* tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.

gcc/testsuite
2017-08-29  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/tree-ssa/loop-interchange-1.c: New test.
* gcc.dg/tree-ssa/loop-interchange-2.c: New test.
* gcc.dg/tree-ssa/loop-interchange-3.c: New test.
* gcc.dg/tree-ssa/loop-interchange-4.c: New test.
* gcc.dg/tree-ssa/loop-interchange-5.c: New test.
* gcc.dg/tree-ssa/loop-interchange-6.c: New test.
* gcc.dg/tree-ssa/loop-interchange-7.c: New test.
* gcc.dg/tree-ssa/loop-interchange-8.c: New test.
* gcc.dg/tree-ssa/loop-interchange-9.c: New test.
* gcc.dg/tree-ssa/loop-interchange-10.c: New test.diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 0bde7ac..5002598 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1522,6 +1522,7 @@ OBJS = \
tree-ssa-live.o \
tree-ssa-loop-ch.o \
tree-ssa-loop-im.o \
+   tree-ssa-loop-interchange.o \
tree-ssa-loop-ivcanon.o \
tree-ssa-loop-ivopts.o \
tree-ssa-loop-manip.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 1331008..e7efa09 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2524,6 +2524,10 @@ ftree-loop-distrib

[PATCH PR81913]Skip niter analysis if either IV in exit condition can wrap

2017-08-24 Thread Bin Cheng

Hi,
I added code handle exit condition like "IV1 le/lt IV2" by changing it into 
"IV1' le/lt INV".
Unfortunately, wrapping behavior has subtle impact on the transformation.  This 
patch for
now skips niter analysis if either IV1 or IV2 can wrap.  We can still handle 
pointer case
as reported in PR81196, but unsigned type needs more work.  The patch also 
includes two
XFAIL tests showing what shall be improved here.
Bootstrap and test on AArch64.  Is it OK?

Thanks,
bin
2017-08-24  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81913
* tree-ssa-loop-niter.c (number_of_iterations_cond): Skip niter
analysis when either IVs in condition can wrap.

gcc/testsuite
2017-08-24  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81913
* gcc.c-torture/execute/pr81913.c: New test.
* gcc.dg/tree-ssa/loop-niter-1.c: New test.
* gcc.dg/tree-ssa/loop-niter-2.c: New test.From 58262ff795e2c2f4cff2982dc8c7aecc240d3227 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 23 Aug 2017 10:04:01 +0100
Subject: [PATCH] pr81913-20170817.txt

---
 gcc/testsuite/gcc.c-torture/execute/pr81913.c | 27 +++
 gcc/testsuite/gcc.dg/tree-ssa/loop-niter-1.c  | 31 +++
 gcc/testsuite/gcc.dg/tree-ssa/loop-niter-2.c  | 31 +++
 gcc/tree-ssa-loop-niter.c |  6 --
 4 files changed, 93 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.c-torture/execute/pr81913.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/loop-niter-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/loop-niter-2.c

diff --git a/gcc/testsuite/gcc.c-torture/execute/pr81913.c 
b/gcc/testsuite/gcc.c-torture/execute/pr81913.c
new file mode 100644
index 000..11eec4e
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr81913.c
@@ -0,0 +1,27 @@
+/* PR tree-optimization/81913 */
+
+typedef unsigned char u8;
+typedef unsigned int u32;
+
+static u32
+b (u8 d, u32 e, u32 g)
+{
+  do
+{
+  e += g + 1;
+  d--;
+}
+  while (d >= (u8) e);
+
+  return e;
+}
+
+int
+main (void)
+{
+  u32 x = b (1, -0x378704, ~0xba64fc);
+  if (x != 0xd93190d0)
+__builtin_abort ();
+  return 0;
+}
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-niter-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/loop-niter-1.c
new file mode 100644
index 000..16c76fe
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-niter-1.c
@@ -0,0 +1,31 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-sccp-details" } */
+
+typedef unsigned char u8;
+typedef unsigned int u32;
+
+static u32
+b (u8 d, u32 e, u32 g)
+{
+  do
+{
+  e += g + 1;
+  d--;
+}
+  while (d >= (u8) e);
+
+  return e;
+}
+
+int
+main (void)
+{
+  u32 x = b (200, -0x378704, ~0xba64fc);
+  if (x != 0xe1ee4ca0)
+__builtin_abort ();
+
+  return 0;
+}
+
+/* Niter analyzer should be able to compute niters for the loop.  */
+/* { dg-final { scan-tree-dump "Replacing uses of: .* with: 3790490784" "sccp" 
{ xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-niter-2.c 
b/gcc/testsuite/gcc.dg/tree-ssa/loop-niter-2.c
new file mode 100644
index 000..2377e6c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-niter-2.c
@@ -0,0 +1,31 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-sccp-details" } */
+
+typedef unsigned char u8;
+typedef unsigned int u32;
+
+static u32
+b (u8 d, u32 e, u32 g)
+{
+  do
+{
+  e += g + 1;
+  d--;
+}
+  while (d >= (u8) e);
+
+  return e;
+}
+
+int
+main (void)
+{
+  u32 x = b (1, -0x378704, ~0xba64fc);
+  if (x != 0xd93190d0)
+__builtin_abort ();
+  return 0;
+}
+
+/* Niter analyzer should be able to compute niters for the loop even though
+   IV:d wraps.  */
+/* { dg-final { scan-tree-dump "Replacing uses of: .* with: 3643904208" "sccp" 
{ xfail *-*-* } } } */
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index 0d6d101..27244eb 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1728,7 +1728,7 @@ number_of_iterations_cond (struct loop *loop,
  provided that either below condition is satisfied:
 
a) the test is NE_EXPR;
-   b) iv0.step - iv1.step is positive integer.
+   b) iv0.step - iv1.step is integer and iv0/iv1 don't overflow.
 
  This rarely occurs in practice, but it is simple enough to manage.  */
   if (!integer_zerop (iv0->step) && !integer_zerop (iv1->step))
@@ -1739,7 +1739,9 @@ number_of_iterations_cond (struct loop *loop,
 
   /* No need to check sign of the new step since below code takes care
 of this well.  */
-  if (code != NE_EXPR && TREE_CODE (step) != INTEGER_CST)
+  if (code != NE_EXPR
+ && (TREE_CODE (step) != INTEGER_CST
+ || !iv0->no_overflow || !iv1->no_overflow))
return false;
 
   iv0->step = step;
-- 
1.9.1

[GCC RFC]Expensive internal function calls.

2017-08-18 Thread Bin Cheng

Hi,
As a followup patch for fix to PR81832, this patch considers internal function 
call to
IFN_LOOP_DIST_ALIAS and IFN_LOOP_VECTORIZED as expensive.  Or interpreted
in another way: return false since we shouldn't be asked the question?  Any 
comments?
BTW, I have no problem to drop this if not appropriate.

Thanks,
bin
2017-08-16  Bin Cheng  <bin.ch...@arm.com>

* gimple.c (gimple_inexpensive_call_p): Consider IFN_LOOP_DIST_ALIAS
and IFN_LOOP_VECTORIZED as expensive calls.diff --git a/gcc/gimple.c b/gcc/gimple.c
index c4e6f81..6d4e376 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -3038,7 +3038,16 @@ bool
 gimple_inexpensive_call_p (gcall *stmt)
 {
   if (gimple_call_internal_p (stmt))
-return true;
+{
+  /* Some internal function calls are only meant to indicate temporary
+arrangement of optimization and are never used in code generation.
+We always consider these calls expensive.  */
+  if (gimple_call_internal_p (stmt, IFN_LOOP_DIST_ALIAS)
+ || gimple_call_internal_p (stmt, IFN_LOOP_VECTORIZED))
+   return false;
+
+  return true;
+}
   tree decl = gimple_call_fndecl (stmt);
   if (decl && is_inexpensive_builtin (decl))
 return true;

[PATCH PR81832]Skip copying loop header if inner loop is distributed

2017-08-15 Thread Bin Cheng

Hi,
This patch fixes PR81832.  Root cause for the ICE is:
  1) Loop has distributed inner loop.
  2) The guarding function call IFN_LOOP_DIST_CALL happens to be in loop's 
header.
  3) IFN_LOOP_DIST_CALL (int loop's header) is duplicated by pass_ch_vect thus
 not eliminated.

Given pass_ch_vect copies loop header to enable more vectorization, we should
skip loop in this case because distributed inner loop means this loop can not
be vectorized anyway.  One point to mention is name inner_loop_distributed_p
is a little misleading.  The name indicates that each basic block is checked,
but the patch only checks loop's header for simplicity/efficiency's purpose.
Any comment?
Bootstrap and test on x86_64.

Thanks,
bin
2017-08-15  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81832
* tree-ssa-loop-ch.c (inner_loop_distributed_p): New function.
(pass_ch_vect::process_loop_p): Call above function.

gcc/testsuite
2017-08-15  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81832
* gcc.dg/tree-ssa/pr81832.c: New test.diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr81832.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr81832.c
new file mode 100644
index 000..893124e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr81832.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+int a, b, *c;
+void d(void)
+{
+int **e;
+for(;;)
+for(int f = 1; f <= 6; f++)
+{
+b = 0;
+if(a)
+g:
+while(a++);
+if (**e);
+else
+{
+*c = a;
+goto g;
+}
+}
+}
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index 14cc6d8d..3c217d4 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -143,6 +143,27 @@ should_duplicate_loop_header_p (basic_block header, struct 
loop *loop,
   return true;
 }
 
+/* Return TRUE if LOOP's inner loop is versioned by loop distribution and
+   the guarding internal function call happens to be in LOOP's header.
+   Given loop distribution is placed between pass_ch and pass_ch_vect,
+   this function only returns true in pass_ch_vect.  When it returns TRUE,
+   it's known that copying LOOP's header is meaningless.  */
+
+static bool
+inner_loop_distributed_p (struct loop *loop)
+{
+  gimple *stmt = last_stmt (loop->header);
+  if (stmt == NULL || gimple_code (stmt) != GIMPLE_COND)
+return false;
+
+  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+  gsi_prev ();
+  if (gsi_end_p (gsi))
+return false;
+
+  return (gimple_call_internal_p (gsi_stmt (gsi), IFN_LOOP_DIST_ALIAS));
+}
+
 /* Checks whether LOOP is a do-while style loop.  */
 
 static bool
@@ -442,6 +463,9 @@ pass_ch_vect::process_loop_p (struct loop *loop)
   if (loop->dont_vectorize)
 return false;
 
+  if (inner_loop_distributed_p (loop))
+return false;
+
   if (!do_while_loop_p (loop))
 return true;

[PATCH PR81799]Fix ICE by forcing to is_gimple_val

2017-08-14 Thread Bin Cheng

Hi,
This patch fixes ICE reported in PR81799.  It simply uses is_gimple_val rather 
than is_gimple_condexpr.
Bootstap and test on x86_64.  Is it OK?

Thanks,
bin
2017-08-11  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81799
* tree-loop-distribution.c (version_loop_by_alias_check): Force
cond_expr to simple gimple operand.

gcc/testsuite
2017-08-11  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81799
* gcc.dg/tree-ssa/pr81799.c: New.diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr81799.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr81799.c
new file mode 100644
index 000..aad01232
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr81799.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+int printf (const char *, ...);
+
+int a, c[1], d, e, **f;
+
+void fn1 (int h)
+{
+  int *i = 0;
+  for (d = 0; d < 1; d++)
+{
+  if (d)
+continue;
+  for (; e; e++)
+{
+  a = c[*i];
+  if (h)
+printf ("0");
+}
+  return;
+}
+  f = 
+}
+
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 8d80ccc..b1b2934 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -2263,7 +2263,7 @@ version_loop_by_alias_check (struct loop *loop, 
vec *alias_ddrs)
   compute_alias_check_pairs (loop, alias_ddrs, _alias_pairs);
   create_runtime_alias_checks (loop, _alias_pairs, _expr);
   cond_expr = force_gimple_operand_1 (cond_expr, _stmts,
- is_gimple_condexpr, NULL_TREE);
+ is_gimple_val, NULL_TREE);
 
   /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS.  */
   if (flag_tree_loop_vectorize)

[PATCH GCC][02/06]New field in struct dependence_info indicating fixed length access

2017-08-14 Thread Bin Cheng

Hi,
This simple patch adds new field in struct dependence_info.  The new field
indicates if non-dependence information is only valid for fixed memory access
length of this reference.  There is a concern that this costs an additional
byte for all tree nodes, but I do not know easy way out because we need to
differentiate dependence_info derived from runtime alias check with others
derived from restrict pointer.
Bootstrap and test in series.  any comment?

Thanks,
bin
2017-08-10  Bin Cheng  <bin.ch...@arm.com>

* tree-core.h (struct tree_base.dependence_info): New field.
* tree.c (copy_node): Reset dependence info for fixed length
memory access.
* tree.h (MR_DEPENDENCE_FIXED_LENGTH_P): New macro.From 52073da1294a43723a5ee6a244c561f9b495f5b6 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Tue, 13 Jun 2017 15:56:42 +0100
Subject: [PATCH 2/6] fixed-length-dep-info-20170801.txt

---
 gcc/tree-core.h | 10 --
 gcc/tree.c  |  8 
 gcc/tree.h  |  3 +++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-core.h b/gcc/tree-core.h
index 278d0c9..6200cb5 100644
--- a/gcc/tree-core.h
+++ b/gcc/tree-core.h
@@ -981,14 +981,20 @@ struct GTY(()) tree_base {
 /* Internal function code.  */
 enum internal_fn ifn;
 
-/* The following two fields are used for MEM_REF and TARGET_MEM_REF
+/* The first two fields are used for MEM_REF and TARGET_MEM_REF
expression trees and specify known data non-dependences.  For
two memory references in a function they are known to not
alias if dependence_info.clique are equal and dependence_info.base
-   are distinct.  */
+   are distinct.  The third field is used for marking that data
+   non-dependences info only holds within the fixed access length
+   of this reference.  In other words, we should reset this info
+   whenever the MEM_REF and TARGET_MEM_REF are copied because we
+   don't know if it's used to build data reference accessing out-
+   side of fixed length.  */
 struct {
   unsigned short clique;
   unsigned short base;
+  bool fixed_length_p;
 } dependence_info;
   } GTY((skip(""))) u;
 };
diff --git a/gcc/tree.c b/gcc/tree.c
index c493edd..9c4f248 100644
--- a/gcc/tree.c
+++ b/gcc/tree.c
@@ -1211,6 +1211,14 @@ copy_node (tree node MEM_STAT_DECL)
 	memcpy (TREE_OPTIMIZATION (t), TREE_OPTIMIZATION (node),
 		sizeof (struct cl_optimization));
   }
+  else if ((code == MEM_REF || code == TARGET_MEM_REF)
+	   && MR_DEPENDENCE_FIXED_LENGTH_P (t))
+{
+  /* Reset dependence information for copying.  */
+  MR_DEPENDENCE_CLIQUE (t) = 0;
+  MR_DEPENDENCE_BASE (t) = 0;
+  MR_DEPENDENCE_FIXED_LENGTH_P (t) = false;
+}
 
   return t;
 }
diff --git a/gcc/tree.h b/gcc/tree.h
index 46debc1..641b7ce 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1211,6 +1211,9 @@ extern void protected_set_expr_location (tree, location_t);
   (TREE_CHECK2 (NODE, MEM_REF, TARGET_MEM_REF)->base.u.dependence_info.clique)
 #define MR_DEPENDENCE_BASE(NODE) \
   (TREE_CHECK2 (NODE, MEM_REF, TARGET_MEM_REF)->base.u.dependence_info.base)
+#define MR_DEPENDENCE_FIXED_LENGTH_P(NODE) \
+  (TREE_CHECK2 (NODE, MEM_REF, \
+		TARGET_MEM_REF)->base.u.dependence_info.fixed_length_p)
 
 /* The operands of a BIND_EXPR.  */
 #define BIND_EXPR_VARS(NODE) (TREE_OPERAND (BIND_EXPR_CHECK (NODE), 0))
-- 
1.9.1

[PATCH GCC][01/06]New interface returning all adjacent vertices in graph

2017-08-14 Thread Bin Cheng

Hi,
This simple patch adds new interface returning adjacent vertices for a vertex 
in graph.
Bootstrap and test in series.  Is it OK?

Thanks,
bin
2017-08-10  Bin Cheng  <bin.ch...@arm.com>

* graphds.c (adjacent_vertices): New function.
* graphds.h (adjacent_vertices): New declaration.From d84e4dd5b840d5f34a619a0f89e502fccf24326f Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Tue, 13 Jun 2017 15:51:54 +0100
Subject: [PATCH 1/6] graphds-adjacent_vertices-20170801.txt

---
 gcc/graphds.c | 19 +++
 gcc/graphds.h |  1 +
 2 files changed, 20 insertions(+)

diff --git a/gcc/graphds.c b/gcc/graphds.c
index 2951349..5618074 100644
--- a/gcc/graphds.c
+++ b/gcc/graphds.c
@@ -338,6 +338,25 @@ for_each_edge (struct graph *g, graphds_edge_callback callback, void *data)
   callback (g, e, data);
 }
 
+/* Given graph G, record V's adjacent vertices in ADJ and return if ADJ
+   isn't NULL.  */
+
+void
+adjacent_vertices (struct graph *g, int v, vec *adj)
+{
+  struct graph_edge *e;
+
+  if (!adj)
+return;
+
+  e = dfs_fst_edge (g, v, true, NULL, NULL);
+  while (e != NULL)
+{
+  adj->safe_push (e->dest);
+  e = dfs_next_edge (e, true, NULL, NULL);
+}
+}
+
 /* Releases the memory occupied by G.  */
 
 void
diff --git a/gcc/graphds.h b/gcc/graphds.h
index 9f9fc10..86172a2 100644
--- a/gcc/graphds.h
+++ b/gcc/graphds.h
@@ -63,6 +63,7 @@ void graphds_domtree (struct graph *, int, int *, int *, int *);
 typedef void (*graphds_edge_callback) (struct graph *,
    struct graph_edge *, void *);
 void for_each_edge (struct graph *, graphds_edge_callback, void *);
+void adjacent_vertices (struct graph *, int, vec *);
 void free_graph (struct graph *g);
 
 #endif /* GCC_GRAPHDS_H */
-- 
1.9.1

[PATCH GCC][06/06]Record runtime alias info in struct dependence_info and pass it along

2017-08-14 Thread Bin Cheng

Hi,
This is the main patch recording runtime alias check information in struct
dependence_info and passing it along to later optimizers.  It models graph
of runtime alias checks with some approximation; then sets <clique, base>
to the original data references and records it in hash map; at last, it
sets <clique, base> info to vectorized memory reference.

Given simple data structure (dependence_info) isn't capable of modeling
arbitrary graph, we take approximation as described by comment:

/*
   ...

   Runtime alias checks can be modeled as an arbitrary graph with data
   references as vertices.  In theory we can record the graph and pass
   it to later optimizers in order to enable more transformations.  In
   practice, it's not feasible for high space/time cost in handling
   arbitrary graph.  GCC uses simple data structure in which two fields
   (clique/base) are set for each memory reference.  Two references are
   known to not alias each other if dependence_info.clique are equal and
   dependence_info.base are distinct.  This simple data structure can
   not accurately model graph with diameter bigger than 2.  For example,
   below graph can't be modeled.

  v0v1v2v3

   The data structure can't model some graph with diameter equal to 2
   if the graph contains clique with more than 2 vertices, as below:

  v0v1---v4
   \/
\  /
 v2

   This function only handles (diameter <= 2) graphs.  Fortunately, we
   can reduce arbitrary graph by simply discarding edges to sub-graph
   which can be modeled.  For example, we can discard edge <v0, v2> in
   above graph.  This approximation results in less optimizations but
   correct code.  */

Bootstrap and test in series on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-08-10  Bin Cheng  <bin.ch...@arm.com>

* tree-vectorizer.h (struct rt_alias_clique): New.
(free_rt_alias_clique): New declaration.
(struct _loop_vec_info): New member rt_alias_clique_map.
(LOOP_VINFO_RT_ALIAS_CLIQUE_MAP): New macro.
* tree-vect-loop-manip.c (struct edge_callback_data): New.
(free_rt_alias_clique, alias_graphd_edge_callback): New function.
(data_ref_with_dep_info_p): New function.
(record_runtime_alias_for_data_refs): New function.
(vect_create_cond_for_alias_checks): Call above function.
(_loop_vec_info::_loop_vec_info): Initialize rt_alias_clique_map.
(_loop_vec_info::~_loop_vec_info): Release rt_alias_clique_map.
* tree-vect-stmts.c (set_runtime_alias_dependence_info): New func.
(vectorizable_store, vectorizable_load): Call above function.

gcc/testsuite
2017-08-10  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/vect/vect-rt-alias-info.c: New test.From 147dc1eb8a835a70411ac8e11b6c9bc63695c431 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 9 Aug 2017 11:47:52 +0100
Subject: [PATCH 6/6] vect-rt-alias-dep-info-20170801.txt

---
 gcc/testsuite/gcc.dg/vect/vect-rt-alias-info.c |  11 ++
 gcc/tree-vect-loop-manip.c | 210 +
 gcc/tree-vect-loop.c   |   8 +
 gcc/tree-vect-stmts.c  |  32 
 gcc/tree-vectorizer.h  |  12 ++
 5 files changed, 273 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-rt-alias-info.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-rt-alias-info.c b/gcc/testsuite/gcc.dg/vect/vect-rt-alias-info.c
new file mode 100644
index 000..bb20e786
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-rt-alias-info.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-fdump-tree-vect-alias -fdump-tree-loopdone-alias" } */
+
+void foo (int *a, int *c, int *d)
+{
+  for (int i = 0; i < 1024; ++i)
+a[i] = c[i] + d[i];
+}
+/* { dg-final { scan-tree-dump "clique . base . fixed" "vect" } } */
+/* { dg-final { scan-tree-dump-not "clique . base . fixed" "loopdone" } } */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index f78e4b4..4518069 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2091,6 +2091,215 @@ vect_create_cond_for_unequal_addrs (loop_vec_info loop_vinfo, tree *cond_expr)
 }
 }
 
+/* Hash map callback function freeing rt_alias_clique.  */
+
+bool
+free_rt_alias_clique (data_reference_p const &,
+		  rt_alias_clique_p const , void *)
+{
+  free (clique);
+  return true;
+}
+
+/* Private data for callback function traversing graph edges.  */
+
+struct edge_callback_data
+{
+  vec *datarefs;
+  hash_map<data_reference_p, struct rt_alias_clique *> *clique_map;
+  bool valid_p;
+};
+
+/* Callback function for traversing graph edges.  */
+
+static void
+alias_graphd_edge_callback (struct graph *, struct graph_edge

[PATCH GCC][03/06]Dump dependence information

2017-08-14 Thread Bin Cheng

Hi,
This simple patch adds code dumping struct dependence_info.
Bootstrap and test in series.  Is it OK?

Thanks,
bin
2017-08-10  Bin Cheng  <bin.ch...@arm.com>

* tree-pretty-print.c (dump_generic_node): Dump fixed length
tag in MEM_REF.  Dump dependence info in TARGET_MEM_REF.From 60c58afac71860bdef2bbc7ca579642172865e7f Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 9 Aug 2017 15:38:18 +0100
Subject: [PATCH 3/6] dump-dependence_info-20170801.txt

---
 gcc/tree-pretty-print.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/gcc/tree-pretty-print.c b/gcc/tree-pretty-print.c
index 4d8177c..7457334 100644
--- a/gcc/tree-pretty-print.c
+++ b/gcc/tree-pretty-print.c
@@ -1551,6 +1551,9 @@ dump_generic_node (pretty_printer *pp, tree node, int spc, dump_flags_t flags,
 		pp_unsigned_wide_integer (pp, MR_DEPENDENCE_CLIQUE (node));
 		pp_string (pp, " base ");
 		pp_unsigned_wide_integer (pp, MR_DEPENDENCE_BASE (node));
+
+		if (MR_DEPENDENCE_FIXED_LENGTH_P (node))
+		  pp_string (pp, " fixed");
 	  }
 	pp_right_bracket (pp);
 	  }
@@ -1611,6 +1614,17 @@ dump_generic_node (pretty_printer *pp, tree node, int spc, dump_flags_t flags,
 	pp_string (pp, "offset: ");
 	dump_generic_node (pp, tmp, spc, flags, false);
 	  }
+	if ((flags & TDF_ALIAS)
+	&& MR_DEPENDENCE_CLIQUE (node) != 0)
+	  {
+	pp_string (pp, " clique ");
+	pp_unsigned_wide_integer (pp, MR_DEPENDENCE_CLIQUE (node));
+	pp_string (pp, " base ");
+	pp_unsigned_wide_integer (pp, MR_DEPENDENCE_BASE (node));
+
+	if (MR_DEPENDENCE_FIXED_LENGTH_P (node))
+	  pp_string (pp, " fixed");
+	  }
 	pp_right_bracket (pp);
   }
   break;
-- 
1.9.1

[PATCH GCC][05/06]An interface clear all dependence_info with fixed access length tag

2017-08-14 Thread Bin Cheng

Hi,
Given tree node (thus struct dependence_info) is kept and shadow copied on RTL,
it is unsafe to pass non-dependent info to RTL in case of loop unrolling etc.
This patch adds an interface clearing all dependence_info with fixed access
length tag before entering RTL world.  We could do it just before expanding,
but for now it is done in loopdone given predcom is the only motivation pass
that I know.
Bootstrap and test in series.  Is it OK?

Thanks,
bin
2017-08-10  Bin Cheng  <bin.ch...@arm.com>

* tree-ssa-address.c (clear_dependence_info): New function.
(clear_all_dependence_info): New function.
* tree-ssa-address.h (clear_all_dependence_info): New declaration.
* tree-ssa-loop.c: Include tree-ssa-address.h.
(tree_ssa_loop_done): Call clear_all_dependence_info.From b54fa9745b3bd84cb44baf9dbee2379fa9e28362 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 9 Aug 2017 15:40:38 +0100
Subject: [PATCH 5/6] clear-fixed-dep_info-20170801.txt

---
 gcc/tree-ssa-address.c | 40 
 gcc/tree-ssa-address.h |  1 +
 gcc/tree-ssa-loop.c|  2 ++
 3 files changed, 43 insertions(+)

diff --git a/gcc/tree-ssa-address.c b/gcc/tree-ssa-address.c
index aea1730..33cfab1 100644
--- a/gcc/tree-ssa-address.c
+++ b/gcc/tree-ssa-address.c
@@ -975,6 +975,46 @@ copy_dependence_info (tree to, tree from)
   MR_DEPENDENCE_FIXED_LENGTH_P (to) = MR_DEPENDENCE_FIXED_LENGTH_P (from);
 }
 
+/* Clear dependence information in REF if it's for fixed access length.  */
+
+static inline void
+clear_dependence_info (tree ref)
+{
+  if ((TREE_CODE (ref) != MEM_REF && TREE_CODE (ref) != TARGET_MEM_REF)
+  || !MR_DEPENDENCE_FIXED_LENGTH_P (ref))
+return;
+
+  MR_DEPENDENCE_CLIQUE (ref) = 0;
+  MR_DEPENDENCE_BASE (ref) = 0;
+  MR_DEPENDENCE_FIXED_LENGTH_P (ref) = false;
+}
+
+/* Clear all dependence information which is for fixed access length.  */
+
+void
+clear_all_dependence_info ()
+{
+  basic_block bb;
+
+  FOR_EACH_BB_FN (bb, cfun)
+{
+  for (gimple_stmt_iterator gsi = gsi_start_bb (bb);
+	   !gsi_end_p (gsi); gsi_next ())
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (!is_gimple_assign (stmt))
+	continue;
+
+	  enum tree_code code = gimple_assign_rhs_code (stmt);
+	  if (get_gimple_rhs_class (code) != GIMPLE_SINGLE_RHS)
+	continue;
+
+	  clear_dependence_info (gimple_assign_lhs (stmt));
+	  clear_dependence_info (gimple_assign_rhs1 (stmt));
+	}
+}
+}
+
 /* Copies the reference information from OLD_REF to NEW_REF, where
NEW_REF should be either a MEM_REF or a TARGET_MEM_REF.  */
 
diff --git a/gcc/tree-ssa-address.h b/gcc/tree-ssa-address.h
index ebba5ad..71d24a9 100644
--- a/gcc/tree-ssa-address.h
+++ b/gcc/tree-ssa-address.h
@@ -37,6 +37,7 @@ extern void move_fixed_address_to_symbol (struct mem_address *,
 tree create_mem_ref (gimple_stmt_iterator *, tree,
 		 struct aff_tree *, tree, tree, tree, bool);
 extern void copy_dependence_info (tree, tree);
+extern void clear_all_dependence_info ();
 extern void copy_ref_info (tree, tree);
 tree maybe_fold_tmr (tree);
 
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index 1e84917..0de00eb 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-loop-manip.h"
 #include "tree-ssa-loop-niter.h"
 #include "tree-ssa-loop.h"
+#include "tree-ssa-address.h"
 #include "cfgloop.h"
 #include "tree-inline.h"
 #include "tree-scalar-evolution.h"
@@ -517,6 +518,7 @@ tree_ssa_loop_done (void)
   free_numbers_of_iterations_estimates (cfun);
   scev_finalize ();
   loop_optimizer_finalize ();
+  clear_all_dependence_info ();
   return 0;
 }
 
-- 
1.9.1

[PATCH GCC][04/06]Add copying interface for dependence_info

2017-08-14 Thread Bin Cheng

HI,
This patch adds copying interface for dependence_info.  The methodology
is we don't copy such information by default, and this interface should
be called explicitly when it is safe and necessary to do so.  Just like
this patch uses the interface in ivopts.
Bootstrap and test in series.  Is it OK?

Thanks,
bin
2017-08-10  Bin Cheng  <bin.ch...@arm.com>

* tree-ssa-address.c (copy_dependence_info): New function.
* tree-ssa-address.h (copy_dependence_info): New declaration.
* tree-ssa-loop-ivopts.c (rewrite_use_address): Call above func.From 3cf0275d0db7d3e240bc7a010c6de68f15f46ce7 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Tue, 13 Jun 2017 15:57:24 +0100
Subject: [PATCH 4/6] copy-dep-fino-20170801.txt

---
 gcc/tree-ssa-address.c | 17 +
 gcc/tree-ssa-address.h |  1 +
 gcc/tree-ssa-loop-ivopts.c |  3 +++
 3 files changed, 21 insertions(+)

diff --git a/gcc/tree-ssa-address.c b/gcc/tree-ssa-address.c
index 8257fde..aea1730 100644
--- a/gcc/tree-ssa-address.c
+++ b/gcc/tree-ssa-address.c
@@ -958,6 +958,23 @@ get_address_description (tree op, struct mem_address *addr)
   addr->offset = TMR_OFFSET (op);
 }
 
+/* Copy data non-dependences info from FROM to TO which both are MEM_REF or
+   TARGET_MEM_REF.  */
+
+void
+copy_dependence_info (tree to, tree from)
+{
+  if ((TREE_CODE (from) != MEM_REF && TREE_CODE (from) != TARGET_MEM_REF)
+  || MR_DEPENDENCE_CLIQUE (from) == 0)
+return;
+
+  gcc_assert (to != NULL_TREE);
+  gcc_assert (TREE_CODE (to) == MEM_REF || TREE_CODE (to) == TARGET_MEM_REF);
+  MR_DEPENDENCE_CLIQUE (to) = MR_DEPENDENCE_CLIQUE (from);
+  MR_DEPENDENCE_BASE (to) = MR_DEPENDENCE_BASE (from);
+  MR_DEPENDENCE_FIXED_LENGTH_P (to) = MR_DEPENDENCE_FIXED_LENGTH_P (from);
+}
+
 /* Copies the reference information from OLD_REF to NEW_REF, where
NEW_REF should be either a MEM_REF or a TARGET_MEM_REF.  */
 
diff --git a/gcc/tree-ssa-address.h b/gcc/tree-ssa-address.h
index cd62ed9..ebba5ad 100644
--- a/gcc/tree-ssa-address.h
+++ b/gcc/tree-ssa-address.h
@@ -36,6 +36,7 @@ extern void move_fixed_address_to_symbol (struct mem_address *,
 	  struct aff_tree *);
 tree create_mem_ref (gimple_stmt_iterator *, tree,
 		 struct aff_tree *, tree, tree, tree, bool);
+extern void copy_dependence_info (tree, tree);
 extern void copy_ref_info (tree, tree);
 tree maybe_fold_tmr (tree);
 
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index b65cd96..6b1efc1 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -7023,6 +7023,9 @@ rewrite_use_address (struct ivopts_data *data,
 			 iv, base_hint, data->speed);
 
   copy_ref_info (ref, *use->op_p);
+  /* Copy dependece information from the original reference.  */
+  copy_dependence_info (ref, *use->op_p);
+
   *use->op_p = ref;
 }
 
-- 
1.9.1

[PATCH GCC][OBVIOUS]Handle boundary case for last iv candidate

2017-08-08 Thread Bin Cheng

Hi,
When investigate issues, I ran into this obvious issue that the last candidate 
is not related to compare type iv_use.
This patch fixes it.  Will apply later.

Thanks,
bin
2017-08-08  Bin Cheng  <bin.ch...@arm.com>

* tree-ssa-loop-ivopts.c (relate_compare_use_with_all_cands): Handle
boundary case for the last candidate.diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 1cbff04..b65cd96 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -5284,13 +5284,13 @@ set_autoinc_for_original_candidates (struct ivopts_data 
*data)
 static void
 relate_compare_use_with_all_cands (struct ivopts_data *data)
 {
-  unsigned i, max_id = data->vcands.length () - 1;
+  unsigned i, count = data->vcands.length ();
   for (i = 0; i < data->vgroups.length (); i++)
 {
   struct iv_group *group = data->vgroups[i];
 
   if (group->type == USE_COMPARE)
-   bitmap_set_range (group->related_cands, 0, max_id);
+   bitmap_set_range (group->related_cands, 0, count);
 }
 }

[PATCH PR81744]Fix ICE by deep copying expression of loop's number of iterations

2017-08-08 Thread Bin Cheng

Hi,
This is an obvious patch.  It fixes ICE in PR81744 by deep copying expression
of loop's number of iterations.
Test result checked.  Is it OK?

Thanks,
bin
2017-08-07  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81744
* tree-predcom.c (prepare_finalizers_chain): Deep copy expr of
loop's number of iterations.

gcc/testsuite/ChangeLog
2017-08-07  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81744
* gcc.dg/tree-ssa/pr81744.c: New.diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr81744.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr81744.c
new file mode 100644
index 000..b0f5d38f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr81744.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fno-tree-loop-vectorize -fno-tree-slp-vectorize 
-fno-inline -fdump-tree-pcom-details" } */
+
+typedef struct {
+  int a, b;
+} CompandSegment;
+int a;
+CompandSegment *b;
+void fn1() {
+  for (; a; a++)
+b[a].a = b[a].b = b[a - 1].a = b[a - 1].b = 0;
+}
+/* { dg-final { scan-tree-dump-times "Store-stores chain" 2 "pcom"} } */
diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index 4538773..e7b10cb 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -2940,7 +2940,7 @@ prepare_finalizers_chain (struct loop *loop, chain_p 
chain)
 
   if (TREE_CODE (niters) != INTEGER_CST && TREE_CODE (niters) != SSA_NAME)
{
- niters = copy_node (niters);
+ niters = unshare_expr (niters);
  niters = force_gimple_operand (niters, , true, NULL);
  if (stmts)
{

[PATCH PR81267]Rewrite into loop closed ssa form in case of any store-store chain

2017-07-31 Thread Bin Cheng

Hi,
This simple patch fixes the ICE by rewriting into loop closed ssa form in case
of any store-store chain.  We maybe able to avoid that for some cases that
eliminated stores only store loop invariant values, but only with more checks
when inserting final store instructions.
Bootstrap and test on x86_64 and AArch64 ongoing.  Is it OK?

Thanks,
bin
2017-07-31  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81627
* tree-predcom.c (prepare_finalizers): Always rewrite into loop
closed ssa form for store-store chain.

gcc/testsuite/ChangeLog
2017-07-31  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81627
* gcc.dg/tree-ssa/pr81627.c: New.From d366015187de926a8fe3248325b229bed99b27b5 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Mon, 31 Jul 2017 11:16:44 +0100
Subject: [PATCH 2/2] pr81627-20170731.txt

---
 gcc/testsuite/gcc.dg/tree-ssa/pr81627.c | 28 
 gcc/tree-predcom.c  | 10 +-
 2 files changed, 33 insertions(+), 5 deletions(-)
 create mode 100755 gcc/testsuite/gcc.dg/tree-ssa/pr81627.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr81627.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr81627.c
new file mode 100755
index 000..7421c49
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr81627.c
@@ -0,0 +1,28 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -fno-tree-loop-vectorize -fdump-tree-pcom-details" } */
+
+int a, b, c, d[6], e = 3, f;
+
+void abort (void);
+void fn1 ()
+{
+  for (b = 1; b < 5; b++)
+{
+  for (c = 0; c < 5; c++)
+d[b] = e;
+  if (a)
+f++;
+  d[b + 1] = 1;
+}
+}
+
+int main ()
+{
+  fn1 ();
+  if (d[0] != 0 || d[1] != 3 || d[2] != 3
+  || d[3] != 3 || d[4] != 3 || d[5] != 1)
+abort ();
+
+  return 0;
+}
+/* { dg-final { scan-tree-dump-times "Store-stores chain" 1 "pcom" } } */
diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index f7a57a4..4538773 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -2983,11 +2983,11 @@ prepare_finalizers (struct loop *loop, vec 
chains)
   if (prepare_finalizers_chain (loop, chain))
{
  i++;
- /* We don't corrupt loop closed ssa form for store elimination
-chain if eliminated stores only store loop invariant values
-into memory.  */
- if (!chain->inv_store_elimination)
-   loop_closed_ssa |= (!chain->inv_store_elimination);
+ /* Be conservative, assume loop closed ssa form is corrupted
+by store-store chain.  Though it's not always the case if
+eliminated stores only store loop invariant values into
+memory.  */
+ loop_closed_ssa = true;
}
   else
{
-- 
1.9.1

[PATCH PR81620]Don't set has_max_use_after flag for store-store chain

2017-07-31 Thread Bin Cheng

Hi,
This simple patch fixes the ICE by not setting has_max_use_after flag for
store-store chain because there is no use at all.
Bootstrap and test on x86_64 and AArch64 ongoing.  Is it OK if no failure?

Thanks,
bin
2017-07-31  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81620
* tree-predcom.c (add_ref_to_chain): Don't set has_max_use_after
for store-store chain.

gcc/testsuite/ChangeLog
2017-07-31  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81620
* gcc.dg/tree-ssa/pr81620-1.c: New.
* gcc.dg/tree-ssa/pr81620-2.c: New.From 4e8f67bb1cc09ef475f9cfbb8e847f9f422c3e44 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Mon, 31 Jul 2017 10:24:07 +0100
Subject: [PATCH 1/2] pr81620-20170731.txt

---
 gcc/testsuite/gcc.dg/tree-ssa/pr81620-1.c | 20 
 gcc/testsuite/gcc.dg/tree-ssa/pr81620-2.c | 25 +
 gcc/tree-predcom.c|  4 +++-
 3 files changed, 48 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr81620-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr81620-2.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr81620-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr81620-1.c
new file mode 100644
index 000..f8f2dd8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr81620-1.c
@@ -0,0 +1,20 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -fno-tree-loop-vectorize -fdump-tree-pcom-details" } */
+
+int a[7];
+char b;
+void abort (void);
+
+int main() {
+  b = 4;
+  for (; b; b--) {
+a[b] = b;
+a[b + 2] = 1;
+  }
+  if (a[0] != 0 || a[1] != 1 || a[2] != 2
+  || a[3] != 1 || a[4] != 1 || a[5] != 1 || a[6] != 1)
+abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Store-stores chain" 1 "pcom" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr81620-2.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr81620-2.c
new file mode 100644
index 000..85a8e35
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr81620-2.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -fno-tree-loop-vectorize -fdump-tree-pcom-details" } */
+
+int a[200];
+char b;
+void abort (void);
+
+int main() {
+  int i;
+  b = 100;
+  for (; b; b--) {
+a[b] = 2;
+a[b + 2] = 1;
+  }
+
+  if (a[0] != 0 || a[1] != 2 || a[2] != 2)
+abort ();
+  for (i = 3; i < 103; i++)
+if (a[i] != 1)
+abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Store-stores chain" 1 "pcom" } } */
diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index a4011bf..f7a57a4 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -1069,7 +1069,9 @@ add_ref_to_chain (chain_p chain, dref ref)
   chain->has_max_use_after = false;
 }
 
-  if (ref->distance == chain->length
+  /* Don't set the flag for store-store chain since there is no use.  */
+  if (chain->type != CT_STORE_STORE
+  && ref->distance == chain->length
   && ref->pos > root->pos)
 chain->has_max_use_after = true;
 
-- 
1.9.1

[PATCH PR81228]Fixes ICE by adding LTGT in vec_cmp.

2017-07-28 Thread Bin Cheng

Hi,
This simple patch fixes the ICE by adding LTGT in vec_cmp 
pattern.
I also modified the original test case into a compilation one since 
-fno-wrapping-math
should not be used in general.
Bootstrap and test on AArch64, test result check for x86_64.  Is it OK?  I 
would also need to
backport it to gcc-7-branch.

Thanks,
bin
2017-07-27  Bin Cheng  <bin.ch...@arm.com>

PR target/81228
* config/aarch64/aarch64-simd.md (vec_cmp): Add
LTGT.

gcc/testsuite/ChangeLog
2017-07-27  Bin Cheng  <bin.ch...@arm.com>

PR target/81228
* gcc.dg/pr81228.c: New.diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 011fcec0..9cd67a2 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -2524,6 +2524,7 @@
 case EQ:
   comparison = gen_aarch64_cmeq;
   break;
+case LTGT:
 case UNEQ:
 case ORDERED:
 case UNORDERED:
@@ -2571,6 +2572,7 @@
   emit_insn (comparison (operands[0], operands[2], operands[3]));
   break;
 
+case LTGT:
 case UNEQ:
   /* We first check (a > b ||  b > a) which is !UNEQ, inverting
 this result will then give us (a == b || a UNORDERED b).  */
@@ -2578,7 +2580,8 @@
 operands[2], operands[3]));
   emit_insn (gen_aarch64_cmgt (tmp, operands[3], operands[2]));
   emit_insn (gen_ior3 (operands[0], operands[0], tmp));
-  emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
+  if (code == UNEQ)
+   emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
   break;
 
 case UNORDERED:
diff --git a/gcc/testsuite/gcc.dg/pr81228.c b/gcc/testsuite/gcc.dg/pr81228.c
new file mode 100644
index 000..3334299
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr81228.c
@@ -0,0 +1,47 @@
+/* PR target/81228 */
+/* { dg-do compile } */
+/* { dg-options "-O3 -fno-trapping-math" } */
+/* { dg-options "-O3 -fno-trapping-math -mavx" { target avx_runtime } } */
+
+double s1[4], s2[4], s3[64];
+
+int
+main (void)
+{
+  int i;
+  asm volatile ("" : : : "memory");
+  for (i = 0; i < 4; i++)
+s3[0 * 4 + i] = __builtin_isgreater (s1[i], s2[i]) ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[1 * 4 + i] = (!__builtin_isgreater (s1[i], s2[i])) ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[2 * 4 + i] = __builtin_isgreaterequal (s1[i], s2[i]) ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[3 * 4 + i] = (!__builtin_isgreaterequal (s1[i], s2[i])) ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[4 * 4 + i] = __builtin_isless (s1[i], s2[i]) ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[5 * 4 + i] = (!__builtin_isless (s1[i], s2[i])) ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[6 * 4 + i] = __builtin_islessequal (s1[i], s2[i]) ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[7 * 4 + i] = (!__builtin_islessequal (s1[i], s2[i])) ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[8 * 4 + i] = __builtin_islessgreater (s1[i], s2[i]) ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[9 * 4 + i] = (!__builtin_islessgreater (s1[i], s2[i])) ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[10 * 4 + i] = __builtin_isunordered (s1[i], s2[i]) ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[11 * 4 + i] = (!__builtin_isunordered (s1[i], s2[i])) ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[12 * 4 + i] = s1[i] > s2[i] ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[13 * 4 + i] = s1[i] >= s2[i] ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[14 * 4 + i] = s1[i] < s2[i] ? -1.0 : 0.0;
+  for (i = 0; i < 4; i++)
+s3[15 * 4 + i] = s1[i] <= s2[i] ? -1.0 : 0.0;
+  asm volatile ("" : : : "memory");
+  return 0;
+}

[PATCH TEST]Require vect_perm in gcc.dg/vect/pr80815-3.c

2017-07-24 Thread Bin Cheng

Hi,
The test has negative step in memory access, thus can't be vectorized on
target like sparc-sun-solaris2.12.  This patch adds vect_perm requirement
for it.  Test result checked.  Is it OK?

Thanks,
bin
gcc/testsuite/ChangeLog
2017-07-20  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/vect/pr80815-3.c: Require vect_perm.diff --git a/gcc/testsuite/gcc.dg/vect/pr80815-3.c 
b/gcc/testsuite/gcc.dg/vect/pr80815-3.c
index dae01fa..50392ab 100644
--- a/gcc/testsuite/gcc.dg/vect/pr80815-3.c
+++ b/gcc/testsuite/gcc.dg/vect/pr80815-3.c
@@ -42,4 +42,4 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump "improved number of alias checks from \[0-9\]* 
to 1" "vect" } } */
+/* { dg-final { scan-tree-dump "improved number of alias checks from \[0-9\]* 
to 1" "vect" { target vect_perm } } } */

[PATCH GCC]Make pointer overflow always undefined and remove the macro

2017-07-24 Thread Bin Cheng

Hi,
This is a followup patch to PR81388's fix.  According to Richi,
POINTER_TYPE_OVERFLOW_UNDEFINED was added in -fstrict-overflow
warning work.  Given:
  A) strict-overflow was removed;
  B) memory object can not wrap in address space;
  C) existing code doesn't take it in consideration, as in nowrap_type_p.
This patch makes it always true thus removes definition/usage of the macro.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-07-20  Bin Cheng  <bin.ch...@arm.com>

* tree.h (POINTER_TYPE_OVERFLOW_UNDEFINED): Delete.
* fold-const.c (fold_comparison, fold_binary_loc): Delete use of
above macro.
* match.pd: Ditto in address comparison pattern.

gcc/testsuite/ChangeLog
2017-07-20  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/no-strict-overflow-7.c: Revise comment and test string.
* gcc.dg/tree-ssa/pr81388-1.c: Ditto.diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index 1bcbbb5..78bb326 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -8505,14 +8505,9 @@ fold_comparison (location_t loc, enum tree_code code, 
tree type,
{
  /* We can fold this expression to a constant if the non-constant
 offset parts are equal.  */
- if ((offset0 == offset1
-  || (offset0 && offset1
-  && operand_equal_p (offset0, offset1, 0)))
- && (equality_code
- || (indirect_base0
- && (DECL_P (base0) || CONSTANT_CLASS_P (base0)))
- || POINTER_TYPE_OVERFLOW_UNDEFINED))
-
+ if (offset0 == offset1
+ || (offset0 && offset1
+ && operand_equal_p (offset0, offset1, 0)))
{
  if (!equality_code
  && bitpos0 != bitpos1
@@ -8547,11 +8542,7 @@ fold_comparison (location_t loc, enum tree_code code, 
tree type,
 because pointer arithmetic is restricted to retain within an
 object and overflow on pointer differences is undefined as of
 6.5.6/8 and /9 with respect to the signed ptrdiff_t.  */
- else if (bitpos0 == bitpos1
-  && (equality_code
-  || (indirect_base0
-  && (DECL_P (base0) || CONSTANT_CLASS_P (base0)))
-  || POINTER_TYPE_OVERFLOW_UNDEFINED))
+ else if (bitpos0 == bitpos1)
{
  /* By converting to signed sizetype we cover middle-end pointer
 arithmetic which operates on unsigned pointer types of size
@@ -9651,7 +9642,7 @@ fold_binary_loc (location_t loc,
 
  /* With undefined overflow prefer doing association in a type
 which wraps on overflow, if that is one of the operand types.  */
- if ((POINTER_TYPE_P (type) && POINTER_TYPE_OVERFLOW_UNDEFINED)
+ if (POINTER_TYPE_P (type)
  || (INTEGRAL_TYPE_P (type) && !TYPE_OVERFLOW_WRAPS (type)))
{
  if (INTEGRAL_TYPE_P (TREE_TYPE (arg0))
@@ -9665,7 +9656,7 @@ fold_binary_loc (location_t loc,
 
  /* With undefined overflow we can only associate constants with one
 variable, and constants whose association doesn't overflow.  */
- if ((POINTER_TYPE_P (atype) && POINTER_TYPE_OVERFLOW_UNDEFINED)
+ if (POINTER_TYPE_P (atype)
  || (INTEGRAL_TYPE_P (atype) && !TYPE_OVERFLOW_WRAPS (atype)))
{
  if (var0 && var1)
diff --git a/gcc/match.pd b/gcc/match.pd
index 979085a..b89aed3 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3129,14 +3129,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
|| TREE_CODE (base1) == STRING_CST))
  equal = (base0 == base1);
  }
- (if (equal == 1
- && (cmp == EQ_EXPR || cmp == NE_EXPR
- /* If the offsets are equal we can ignore overflow.  */
- || off0 == off1
- || POINTER_TYPE_OVERFLOW_UNDEFINED
- /* Or if we compare using pointers to decls or strings.  */
- || (POINTER_TYPE_P (TREE_TYPE (@2))
- && (DECL_P (base0) || TREE_CODE (base0) == STRING_CST
+ (if (equal == 1)
   (switch
(if (cmp == EQ_EXPR)
{ constant_boolean_node (off0 == off1, type); })
diff --git a/gcc/testsuite/gcc.dg/no-strict-overflow-7.c 
b/gcc/testsuite/gcc.dg/no-strict-overflow-7.c
index 19e1b55..0e73d48 100644
--- a/gcc/testsuite/gcc.dg/no-strict-overflow-7.c
+++ b/gcc/testsuite/gcc.dg/no-strict-overflow-7.c
@@ -3,8 +3,8 @@
 
 /* Source: Ian Lance Taylor.  Dual of strict-overflow-6.c.  */
 
-/* We can only simplify the conditional when using strict overflow
-   semantics.  */
+/* We can simplify the conditional because pointer overflow always has
+   undefined semantics.  */
 
 int
 foo (char* p)
@@ -12,4 +12,4 @@ foo (char* p)
   return p + 1000 < p;
 }

[PATCH PR81388]Revert change in revision 238585

2017-07-20 Thread Bin Cheng

Hi,
I removed computation of may_be_zero in revision 238585 by assuming
"pointer + 2 < pointer" can be folded.  This is false when pointer could 
overflow,
as well as unsigned type (I don't know why it haven't been exposed for long
time in case of unsigned type).  As for the issue itself, any fix would require
may_be_zero to be computed, which basically leads to patch revert.
Bootstrap and test on x86_64 and AArch64, is it OK?

Thanks,
bin
2017-07-20  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81388
Revert r238585:
        2016-07-21  Bin Cheng  <bin.ch...@arm.com>

* tree-ssa-loop-niter.c (number_of_iterations_lt_to_ne): Clean up
by removing computation of may_be_zero.

gcc/testsuite/ChangeLog
2017-07-20  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81388
* gcc.dg/tree-ssa/pr81388-1.c: New test.
* gcc.dg/tree-ssa/pr81388-2.c: New test.diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr81388-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr81388-1.c
new file mode 100644
index 000..ecfe129
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr81388-1.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-strict-overflow -fdump-tree-ivcanon-details" } */
+
+void bar();
+void foo(char *dst)
+{
+  char *const end = dst;
+  do {
+bar();
+dst += 2;
+  } while (dst < end);
+}
+
+/* { dg-final { scan-tree-dump-times " zero if " 1 "ivcanon" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr81388-2.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr81388-2.c
new file mode 100644
index 000..71fd289
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr81388-2.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivcanon-details" } */
+
+void bar();
+void foo(unsigned dst)
+{
+  unsigned end = dst;
+  do {
+bar();
+dst += 2;
+  } while (dst < end);
+}
+
+/* { dg-final { scan-tree-dump-times " zero if " 1 "ivcanon" } } */
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index 5a7cab5..a872f5f 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1142,8 +1142,12 @@ number_of_iterations_lt_to_ne (tree type, affine_iv 
*iv0, affine_iv *iv1,
   tree niter_type = TREE_TYPE (step);
   tree mod = fold_build2 (FLOOR_MOD_EXPR, niter_type, *delta, step);
   tree tmod;
-  tree assumption = boolean_true_node, bound;
-  tree type1 = (POINTER_TYPE_P (type)) ? sizetype : type;
+  mpz_t mmod;
+  tree assumption = boolean_true_node, bound, noloop;
+  bool ret = false, fv_comp_no_overflow;
+  tree type1 = type;
+  if (POINTER_TYPE_P (type))
+type1 = sizetype;
 
   if (TREE_CODE (mod) != INTEGER_CST)
 return false;
@@ -1151,51 +1155,96 @@ number_of_iterations_lt_to_ne (tree type, affine_iv 
*iv0, affine_iv *iv1,
 mod = fold_build2 (MINUS_EXPR, niter_type, step, mod);
   tmod = fold_convert (type1, mod);
 
+  mpz_init (mmod);
+  wi::to_mpz (mod, mmod, UNSIGNED);
+  mpz_neg (mmod, mmod);
+
   /* If the induction variable does not overflow and the exit is taken,
- then the computation of the final value does not overflow.  There
- are three cases:
-   1) The case if the new final value is equal to the current one.
-   2) Induction varaible has pointer type, as the code cannot rely
- on the object to that the pointer points being placed at the
- end of the address space (and more pragmatically,
- TYPE_{MIN,MAX}_VALUE is not defined for pointers).
-   3) EXIT_MUST_BE_TAKEN is true, note it implies that the induction
- variable does not overflow.  */
-  if (!integer_zerop (mod) && !POINTER_TYPE_P (type) && !exit_must_be_taken)
+ then the computation of the final value does not overflow.  This is
+ also obviously the case if the new final value is equal to the
+ current one.  Finally, we postulate this for pointer type variables,
+ as the code cannot rely on the object to that the pointer points being
+ placed at the end of the address space (and more pragmatically,
+ TYPE_{MIN,MAX}_VALUE is not defined for pointers).  */
+  if (integer_zerop (mod) || POINTER_TYPE_P (type))
+fv_comp_no_overflow = true;
+  else if (!exit_must_be_taken)
+fv_comp_no_overflow = false;
+  else
+fv_comp_no_overflow =
+   (iv0->no_overflow && integer_nonzerop (iv0->step))
+   || (iv1->no_overflow && integer_nonzerop (iv1->step));
+
+  if (integer_nonzerop (iv0->step))
 {
-  if (integer_nonzerop (iv0->step))
+  /* The final value of the iv is iv1->base + MOD, assuming that this
+computation does not overflow, and that
+iv0->base <= iv1->base + MOD.  */
+  if (!fv_comp_no_overflow)
{
- /* The final value of the iv is iv1->base + MOD, assuming
-that this computation does not overflow, and

[GCC ARM]Remove unused variable in arm

2017-07-18 Thread Bin Cheng

Hi,
This leftover unused variable breaks arm bootstrap.  Simply remove it.

Thanks,
bin
2017-07-18  Bin Cheng  <bin.ch...@arm.com>

* config/arm/arm.c (arm_emit_store_exclusive): Remove unused var.diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 1b7b382..139ab70 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -28268,8 +28268,6 @@ arm_emit_store_exclusive (machine_mode mode, rtx bval, 
rtx rval,
 static void
 emit_unlikely_jump (rtx insn)
 {
-  int very_unlikely = REG_BR_PROB_BASE / 100 - 1;
-
   rtx_insn *jump = emit_jump_insn (insn);
   add_reg_br_prob_note (jump, profile_probability::very_unlikely ());
 }

[PATCH PR81408]Turn TREE level unsafe loop optimizations warning to missed optimization message

2017-07-18 Thread Bin Cheng

Hi,
I removed unsafe loop optimization on TREE level last year, so GCC doesn't do 
unsafe
loop optimizations on TREE now.  All "unsafe loop optimizations" warnings 
reported by
TREE optimizers are simply missed optimizations.  This patch turns such warning 
into
missed optimization messages.  I didn't change when this will be dumped, for 
now it is
when called from ivopts.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-07-13  Bin Cheng  <bin.ch...@arm.com>

PR target/81408
* tree-ssa-loop-niter.c (number_of_iterations_exit): Dump missed
optimization for loop niter analysis.

gcc/testsuite/ChangeLog
2017-07-13  Bin Cheng  <bin.ch...@arm.com>

PR target/81408
* g++.dg/tree-ssa/pr81408.C: New.
* gcc.dg/tree-ssa/pr19210-1.c: Check dump message rather than warning.diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr81408.C 
b/gcc/testsuite/g++.dg/tree-ssa/pr81408.C
new file mode 100644
index 000..354d362
--- /dev/null
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr81408.C
@@ -0,0 +1,93 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -std=gnu++11 -fdump-tree-ivopts-missed 
-Wunsafe-loop-optimizations" } */
+
+namespace a {
+void b () __attribute__ ((__noreturn__));
+template  struct d;
+template  struct d
+{
+  typedef e f;
+};
+struct g
+{
+  template  using i = h *;
+};
+}
+using a::d;
+template  class k
+{
+  j l;
+
+public:
+  typename d::f operator* () {}
+  void operator++ () { ++l; }
+  j
+  aa ()
+  {
+return l;
+  }
+};
+template 
+bool
+operator!= (k<m, ab> o, k<n, ab> p2)
+{
+  return o.aa () != p2.aa ();
+}
+struct p;
+namespace a {
+struct F
+{
+  struct q
+  {
+using ai = g::i;
+  };
+  using r = q::ai;
+};
+class H
+{
+public:
+  k<F::r, int> begin ();
+  k<F::r, int> end ();
+};
+int s;
+class I
+{
+public:
+  void
+  aq (char)
+  {
+if (s)
+  b ();
+  }
+};
+class u : public I
+{
+public:
+  void
+  operator<< (u o (u))
+  {
+o (*this);
+  }
+  u operator<< (void *);
+};
+template 
+at
+av (au o)
+{
+  o.aq ('\n');
+}
+u ax;
+}
+struct p
+{
+  char *ay;
+};
+a::H t;
+void
+ShowHelpListCommands ()
+{
+  for (auto c : t)
+a::ax << c.ay << a::av;
+}
+
+/* { dg-final { scan-tree-dump "note: missed loop optimization: niters 
analysis ends up with assumptions." "ivopts" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr19210-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr19210-1.c
index 3c8ee06..3c18470 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr19210-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr19210-1.c
@@ -1,15 +1,15 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -Wunsafe-loop-optimizations" } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details -Wunsafe-loop-optimizations" } 
*/
 extern void g(void);
 
 void
 f (unsigned n)
 {
   unsigned k;
-  for(k = 0;k <= n;k++) /* { dg-warning "missed loop optimization.*overflow" } 
*/
+  for(k = 0;k <= n;k++) /* missed optimization for this loop.  */
 g();
 
-  for(k = 0;k <= n;k += 4) /* { dg-warning "missed loop 
optimization.*overflow" } */
+  for(k = 0;k <= n;k += 4) /* missed optimization for this loop.  */
 g();
 
   /* We used to get warning for this loop.  However, since then # of iterations
@@ -21,9 +21,14 @@ f (unsigned n)
 g();
 
   /* So we need the following loop, instead.  */
-  for(k = 4;k <= n;k += 5) /* { dg-warning "missed loop 
optimization.*overflow" } */
+  for(k = 4;k <= n;k += 5) /* missed optimization for this loop.  */
 g();
   
-  for(k = 15;k >= n;k--) /* { dg-warning "missed loop optimization.*overflow" 
} */
+  for(k = 15;k >= n;k--) /* missed optimization for this loop.  */
 g();
 }
+
+/* { dg-final { scan-tree-dump "pr19210-1.c:9:.*: missed loop optimization: 
niters analysis ends up with assumptions." "ivopts" } } */
+/* { dg-final { scan-tree-dump "pr19210-1.c:12:.*: missed loop optimization: 
niters analysis ends up with assumptions." "ivopts" } } */
+/* { dg-final { scan-tree-dump "pr19210-1.c:24:.*: missed loop optimization: 
niters analysis ends up with assumptions." "ivopts" } } */
+/* { dg-final { scan-tree-dump "pr19210-1.c:27:.*: missed loop optimization: 
niters analysis ends up with assumptions." "ivopts" } } */
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index 5a7cab5..1421002 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -2378,9 +2378,9 @@ number_of_iterations_exit (struct loop *loop, edge exit,
 return true;
 
   if (warn)
-warning_at (gimple_location_safe (stmt),
-   OPT_Wunsafe_loop_optimizations,
-   "missed loop optimization, the loop counter may overflow");
+dump_printf_loc (MSG_MISSED_OPTIMIZATION, gimple_location_safe (stmt),
+"missed loop optimization: niters analysis ends up "
+"with assumptions.\n");
 
   return false;
 }

[PATCH PR81369/02]Conservatively not distribute loop with unknown niters

2017-07-14 Thread Bin Cheng

Hi,
This is a followup patch for previous fix to PR81369.  In that test case, GCC
tries to distribute infinite loop, which doesn't make much sense.  This patch
works conservatively by skipping loops with unknown niters.  It also simplifies
code a bit.
Bootstrap and test on x86_64 and AArch64, is it OK?

Thanks,
bin
2017-07-12  Bin Cheng  <bin.ch...@arm.com>

PR target/81369
* tree-loop-distribution.c (classify_partition): Only assert on
numer of iterations.
(merge_dep_scc_partitions): Delete prameter.  Update function call.
(distribute_loop): Remove code handling loop with unknown niters.
(pass_loop_distribution::execute): Skip loop with unknown niters.From b96c0053b79fd457df1fdb91c4401a1a7ccace7d Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 12 Jul 2017 12:30:12 +0100
Subject: [PATCH 2/2] skip-loop-with-unknown-niters.txt

---
 gcc/tree-loop-distribution.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index fe678a5..497e6a9 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -1412,8 +1412,7 @@ classify_partition (loop_p loop, struct graph *rdg, partition *partition,
 return;
 
   nb_iter = number_of_latch_executions (loop);
-  if (!nb_iter || nb_iter == chrec_dont_know)
-return;
+  gcc_assert (nb_iter && nb_iter != chrec_dont_know);
   if (dominated_by_p (CDI_DOMINATORS, single_exit (loop)->src,
 		  gimple_bb (DR_STMT (single_store
 plus_one = true;
@@ -1962,18 +1961,16 @@ sort_partitions_by_post_order (struct graph *pg,
 }
 
 /* Given reduced dependence graph RDG merge strong connected components
-   of PARTITIONS.  If IGNORE_ALIAS_P is true, data dependence caused by
-   possible alias between references is ignored, as if it doesn't exist
-   at all; otherwise all depdendences are considered.  */
+   of PARTITIONS.  In this function, data dependence caused by possible
+   alias between references is ignored, as if it doesn't exist at all.  */
 
 static void
 merge_dep_scc_partitions (struct graph *rdg,
-			  vec *partitions,
-			  bool ignore_alias_p)
+			  vec *partitions)
 {
   struct partition *partition1, *partition2;
   struct pg_vdata *data;
-  graph *pg = build_partition_graph (rdg, partitions, ignore_alias_p);
+  graph *pg = build_partition_graph (rdg, partitions, true);
   int i, j, num_sccs = graphds_scc (pg, NULL);
 
   /* Strong connected compoenent means dependence cycle, we cannot distribute
@@ -2420,9 +2417,6 @@ distribute_loop (struct loop *loop, vec stmts,
   auto_vec partitions;
   rdg_build_partitions (rdg, stmts, );
 
-  /* Can't do runtime alias check if loop niter is unknown.  */
-  tree niters = number_of_latch_executions (loop);
-  bool rt_alias_check_p = (niters != NULL_TREE && niters != chrec_dont_know);
   auto_vec alias_ddrs;
 
   auto_bitmap stmt_in_all_partitions;
@@ -2511,9 +2505,9 @@ distribute_loop (struct loop *loop, vec stmts,
   /* Build the partition dependency graph.  */
   if (partitions.length () > 1)
 {
-  merge_dep_scc_partitions (rdg, , rt_alias_check_p);
+  merge_dep_scc_partitions (rdg, );
   alias_ddrs.truncate (0);
-  if (rt_alias_check_p && partitions.length () > 1)
+  if (partitions.length () > 1)
 	break_alias_scc_partitions (rdg, , _ddrs);
 }
 
@@ -2653,6 +2647,11 @@ pass_loop_distribution::execute (function *fun)
   if (!optimize_loop_for_speed_p (loop))
 	continue;
 
+  /* Don't distribute loop if niters is unknown.  */
+  tree niters = number_of_latch_executions (loop);
+  if (niters == NULL_TREE || niters == chrec_dont_know)
+	continue;
+
   /* Initialize the worklist with stmts we seed the partitions with.  */
   bbs = get_loop_body_in_dom_order (loop);
   for (i = 0; i < loop->num_nodes; ++i)
-- 
1.9.1

[PATCH PR81369/01]Sort partitions by post order for all cases

2017-07-14 Thread Bin Cheng

Hi,
This patch fixes ICE reported by PR81369.  It simply sinks call to
sort_partitions_by_post_order so that it's executed for all cases.
This is necessary to schedule reduction partition as the last one.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-07-12  Bin Cheng  <bin.ch...@arm.com>

PR target/81369
* tree-loop-distribution.c (merge_dep_scc_partitions): Sink call to
function sort_partitions_by_post_order.

gcc/testsuite/ChangeLog
2017-07-12  Bin Cheng  <bin.ch...@arm.com>

PR target/81369
* gcc.dg/tree-ssa/pr81369.c: New.From 685bab237e38544375dcc9a950ae2816b8e4385b Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 12 Jul 2017 12:04:49 +0100
Subject: [PATCH 1/2] pr81369-1.txt

---
 gcc/testsuite/gcc.dg/tree-ssa/pr81369.c | 23 +++
 gcc/tree-loop-distribution.c|  3 ++-
 2 files changed, 25 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr81369.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr81369.c b/gcc/testsuite/gcc.dg/tree-ssa/pr81369.c
new file mode 100644
index 000..b40477b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr81369.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution" } */
+
+typedef __PTRDIFF_TYPE__ intptr_t;
+int wo;
+
+void
+sy (long int *as)
+{
+  for (;;)
+{
+  *as = wo;
+  while (as < (long int *) (void *) 2)
+{
+  int *y9;
+
+  if (wo != 0)
+*y9 = (int) (intptr_t) 
+  wo /= (wo != 0 && *y9 != 0);
+  ++as;
+}
+}
+}
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index be0a660..fe678a5 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -1997,8 +1997,9 @@ merge_dep_scc_partitions (struct graph *rdg,
 		data->partition = NULL;
 	  }
 	}
-  sort_partitions_by_post_order (pg, partitions);
 }
+
+  sort_partitions_by_post_order (pg, partitions);
   gcc_assert (partitions->length () == (unsigned)num_sccs);
   free_partition_graph_vdata (pg);
   free_graph (pg);
-- 
1.9.1

[PATCH AArch64]Fix ICE in cortex-a57 fma steering pass

2017-07-12 Thread Bin Cheng

Hi,
After change @236817, AArch64 backend could avoid unnecessary conversion 
instructions
for register between different modes now.  As a result, GCC could initialize 
register
in larger mode and use it later in smaller mode.  such def-use chain is not 
supported
by current regrename.c analyzer, as described by its comment:

  /* Process the insn, determining its effect on the def-use
 chains and live hard registers.  We perform the following
 steps with the register references in the insn, simulating
 its effect:
 ...
 We cannot deal with situations where we track a reg in one mode
 and see a reference in another mode; these will cause the chain
 to be marked unrenamable or even cause us to abort the entire
 basic block.  */

In this case, regrename.c analyzer doesn't create chain for the use of the 
register.
OTOH, cortex-a57-fma-steering.c has below code:

@@ -973,10 +973,14 @@ func_fma_steering::analyze ()
break;
}
 
- /* We didn't find a chain with a def for this instruction.  */
- gcc_assert (i < dest_op_info->n_chains);
-
- this->analyze_fma_fmul_insn (forest, chain, head);

It assumes by gcc_assert that a chain must be found for dest register of 
fmul/fmac
instructions.  According to above analysis, this is not always true if the dest 
reg
is reused from one of its source register.

This patch fixes the issue by skipping such instructions if no du chain is 
found.
Bootstrap and test on AArch64/cortex-a57.  Is it OK?  If it's fine, I would 
also need to
backport it to 7/6 branches.

Thanks,
bin
2017-07-12  Bin Cheng  <bin.ch...@arm.com>

PR target/81414
* config/aarch64/cortex-a57-fma-steering.c (analyze): Skip fmul/fmac
instructions if no du chain is found.

gcc/testsuite/ChangeLog
2017-07-12  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

PR target/81414
* gcc.target/aarch64/pr81414.C: New.From ef2bc842993210a4399205d83fa46435eec5d7cd Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 12 Jul 2017 15:16:53 +0100
Subject: [PATCH] tmp

---
 gcc/config/aarch64/cortex-a57-fma-steering.c | 12 
 gcc/testsuite/gcc.target/aarch64/pr81414.C   | 10 ++
 2 files changed, 18 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr81414.C

diff --git a/gcc/config/aarch64/cortex-a57-fma-steering.c 
b/gcc/config/aarch64/cortex-a57-fma-steering.c
index 1bf804b..b2ee398 100644
--- a/gcc/config/aarch64/cortex-a57-fma-steering.c
+++ b/gcc/config/aarch64/cortex-a57-fma-steering.c
@@ -973,10 +973,14 @@ func_fma_steering::analyze ()
break;
}
 
- /* We didn't find a chain with a def for this instruction.  */
- gcc_assert (i < dest_op_info->n_chains);
-
- this->analyze_fma_fmul_insn (forest, chain, head);
+ /* Due to implementation of regrename, dest register can slip away
+from regrename's analysis.  As a result, there is no chain for
+the destination register of insn.  We simply skip the insn even
+it is a fmul/fmac instruction.  This case can happen when the
+dest register is also a source register of insn and the source
+reg is setup in larger mode than this insn.  */
+ if (i < dest_op_info->n_chains)
+   this->analyze_fma_fmul_insn (forest, chain, head);
}
 }
   free (bb_dfs_preorder);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr81414.C 
b/gcc/testsuite/gcc.target/aarch64/pr81414.C
new file mode 100644
index 000..13666a3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr81414.C
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57" } */
+
+typedef __Float32x2_t float32x2_t;
+__inline float32x2_t vdup_n_f32(float) {}
+ 
+float32x2_t vfma_lane_f32(float32x2_t __a, float32x2_t __b) {
+  int __lane;
+  return __builtin_aarch64_fmav2sf(__b, vdup_n_f32(__lane), __a);
+}
-- 
1.9.1

[PATCH PR81374]Record the max index of basic block, rather than # of basic blocks

2017-07-10 Thread Bin Cheng

Hi,
This patch fixes an ICE in new loop distribution code.  When computing 
topological
order for basic blocks it should record the max index of basic block, rather 
than
number of basic blocks.  I didn't add new test because existing tests can catch 
the
ICE as well.

Bootstrap and test on x86_64.  Is it OK?
Thanks,
bin
2017-07-10  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81374
* tree-loop-distribution.c (pass_loop_distribution::execute): Record
the max index of basic blocks, rather than number of basic blocks.diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index be0a660..5c8f29d 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -2614,12 +2614,13 @@ pass_loop_distribution::execute (function *fun)
  lexicographical order.  */
   if (bb_top_order_index == NULL)
 {
+  int rpo_num;
   int *rpo = XNEWVEC (int, last_basic_block_for_fn (cfun));
 
   bb_top_order_index = XNEWVEC (int, last_basic_block_for_fn (cfun));
-  bb_top_order_index_size
-   = pre_and_rev_post_order_compute_fn (cfun, NULL, rpo, true);
-  for (int i = 0; i < bb_top_order_index_size; i++)
+  bb_top_order_index_size = last_basic_block_for_fn (cfun);
+  rpo_num = pre_and_rev_post_order_compute_fn (cfun, NULL, rpo, true);
+  for (int i = 0; i < rpo_num; i++)
bb_top_order_index[rpo[i]] = i;
 
   free (rpo);

[PATCH PR81196]Analyze ntiers for loop with exit condition comparing induction variables

2017-06-28 Thread Bin Cheng

Hi,
This patch picks up a missed-optimization case in loop niter analysis.  With 
this
patch, niters information for loop as in added test can be analyzed.  Bootstrap
and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-27  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81196
* tree-ssa-loop-niter.c (number_of_iterations_cond): Handle loop
exit condition comparing two IVs.

gcc/testsuite/ChangeLog
2017-06-27  Bin Cheng  <bin.ch...@arm.com>

PR tree-optimization/81196
* gcc.dg/vect/pr81196.c: New.From e11856e20b64bacaa4c5ebc3ea08f875160161dc Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Mon, 26 Jun 2017 15:06:31 +0100
Subject: [PATCH] pr81196-20170626.txt

---
 gcc/testsuite/gcc.dg/vect/pr81196.c | 19 +++
 gcc/tree-ssa-loop-niter.c   | 30 +++---
 2 files changed, 42 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr81196.c

diff --git a/gcc/testsuite/gcc.dg/vect/pr81196.c b/gcc/testsuite/gcc.dg/vect/pr81196.c
new file mode 100644
index 000..46d7a9e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr81196.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_perm_short } */
+
+void f(short*p){
+  p=(short*)__builtin_assume_aligned(p,64);
+  short*q=p+256;
+  for(;p!=q;++p,--q){
+short t=*p;*p=*q;*q=t;
+  }
+}
+void b(short*p){
+  p=(short*)__builtin_assume_aligned(p,64);
+  short*q=p+256;
+  for(;p<q;++p,--q){
+short t=*p;*p=*q;*q=t;
+  }
+}
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" } } */
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index 848e812..934e3b7 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1668,18 +1668,34 @@ number_of_iterations_cond (struct loop *loop,
 	exit_must_be_taken = true;
 }
 
-  /* We can handle the case when neither of the sides of the comparison is
- invariant, provided that the test is NE_EXPR.  This rarely occurs in
- practice, but it is simple enough to manage.  */
+  /* We can handle cases which neither of the sides of the comparison is
+ invariant:
+
+   {iv0.base, iv0.step} cmp_code {iv1.base, iv1.step}
+ as if:
+   {iv0.base, iv0.step - iv1.step} cmp_code {iv1.base, 0}
+
+ provided that either below condition is satisfied:
+
+   a) the test is NE_EXP;
+   b) iv0.step - iv1.step is positive integer.
+
+ This rarely occurs in practice, but it is simple enough to manage.  */
   if (!integer_zerop (iv0->step) && !integer_zerop (iv1->step))
 {
   tree step_type = POINTER_TYPE_P (type) ? sizetype : type;
-  if (code != NE_EXPR)
+  tree step = fold_binary_to_constant (MINUS_EXPR, step_type,
+	   iv0->step, iv1->step);
+
+  /* No need to check sign of the new step since below code takes care
+	 of this well.  */
+  if (code != NE_EXPR && TREE_CODE (step) != INTEGER_CST)
 	return false;
 
-  iv0->step = fold_binary_to_constant (MINUS_EXPR, step_type,
-	   iv0->step, iv1->step);
-  iv0->no_overflow = false;
+  iv0->step = step;
+  if (!POINTER_TYPE_P (type))
+	iv0->no_overflow = false;
+
   iv1->step = build_int_cst (step_type, 0);
   iv1->no_overflow = true;
 }
-- 
1.9.1

[PATCH GCC][4/4]Better handle store-stores chain if eliminated stores only store loop invariant

2017-06-27 Thread Bin Cheng

Hi,
This is a followup patch better handling below case:
 for (i = 0; i < n; i++)
   {
 a[i] = 1;
 a[i+2] = 2;
   }
Instead of generating root variables by loading from memory and propagating 
with PHI
nodes, like:
 t0 = a[0];
 t1 = a[1];
 for (i = 0; i < n; i++)
   {
 a[i] = 1;
 t2 = 2;
 t0 = t1;
 t1 = t2;
   }
 a[n] = t0;
 a[n+1] = t1;
We can simply store loop invariant values after loop body if we know loop 
iterates more
than chain->length times, like:
 for (i = 0; i < n; i++)
   {
 a[i] = 1;
   }
 a[n] = 2;
 a[n+1] = 2;

Bootstrap(O2/O3) in patch series on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-21  Bin Cheng  <bin.ch...@arm.com>

* tree-predcom.c: (struct chain): Handle store-store chain in which
stores for elimination only store loop invariant values.
(execute_pred_commoning_chain): Ditto.
(prepare_initializers_chain_store_elim): Ditto.
(prepare_finalizers): Ditto.
(is_inv_store_elimination_chain): New function.
(initialize_root_vars_store_elim_1): New function.From 16603c31d42e44f93a5de0faa0354629e669c5d0 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 21 Jun 2017 16:18:43 +0100
Subject: [PATCH 6/6] inv-store-elimination-20170621.txt

---
 gcc/tree-predcom.c | 131 ++---
 1 file changed, 125 insertions(+), 6 deletions(-)

diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index 9be93e4..8e38be4 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -327,6 +327,10 @@ typedef struct chain
 
   /* True if this chain was combined together with some other chain.  */
   unsigned combined : 1;
+
+  /* True if this is store elimination chain and eliminated stores store
+ loop invariant value into memory.  */
+  unsigned inv_store_elimination : 1;
 } *chain_p;
 
 
@@ -1630,6 +1634,98 @@ initialize_root_vars (struct loop *loop, chain_p chain, 
bitmap tmp_vars)
 }
 }
 
+/* For inter-iteration store elimination CHAIN in LOOP, returns true if
+   all stores to be eliminated store loop invariant values into memory.
+   In this case, we can use these invariant values directly after LOOP.  */
+
+static bool
+is_inv_store_elimination_chain (struct loop *loop, chain_p chain)
+{
+  if (chain->length == 0 || chain->type != CT_STORE_STORE)
+return false;
+
+  gcc_assert (!chain->has_max_use_after);
+
+  /* If loop iterates for unknown times or fewer times than chain->lenght,
+ we still need to setup root variable and propagate it with PHI node.  */
+  tree niters = number_of_latch_executions (loop);
+  if (TREE_CODE (niters) != INTEGER_CST || wi::leu_p (niters, chain->length))
+return false;
+
+  /* Check stores in chain for elimination if they only store loop invariant
+ values.  */
+  for (unsigned i = 0; i < chain->length; i++)
+{
+  dref a = get_chain_last_ref_at (chain, i);
+  if (a == NULL)
+   continue;
+
+  gimple *def_stmt, *stmt = a->stmt;
+  if (!gimple_assign_single_p (stmt))
+   return false;
+
+  tree val = gimple_assign_rhs1 (stmt);
+  if (TREE_CLOBBER_P (val))
+   return false;
+
+  if (TREE_CODE (val) == INTEGER_CST || TREE_CODE (val) == REAL_CST)
+   continue;
+
+  if (TREE_CODE (val) != SSA_NAME)
+   return false;
+
+  def_stmt = SSA_NAME_DEF_STMT (val);
+  if (gimple_nop_p (def_stmt))
+   continue;
+
+  if (flow_bb_inside_loop_p (loop, gimple_bb (def_stmt)))
+   return false;
+}
+  return true;
+}
+
+/* Creates root variables for store elimination CHAIN in which stores for
+   elimination only store loop invariant values.  In this case, we neither
+   need to load root variables before loop nor propagate it with PHI nodes.  */
+
+static void
+initialize_root_vars_store_elim_1 (chain_p chain)
+{
+  tree var;
+  unsigned i, n = chain->length;
+
+  chain->vars.create (n);
+  chain->vars.safe_grow_cleared (n);
+
+  /* Initialize root value for eliminated stores at each distance.  */
+  for (i = 0; i < n; i++)
+{
+  dref a = get_chain_last_ref_at (chain, i);
+  if (a == NULL)
+   continue;
+
+  var = gimple_assign_rhs1 (a->stmt);
+  chain->vars[a->distance] = var;
+}
+
+  /* We don't propagate values with PHI nodes, so manually propagate value
+ to bubble positions.  */
+  var = chain->vars[0];
+  for (i = 1; i < n; i++)
+{
+  if (chain->vars[i] != NULL_TREE)
+   {
+ var = chain->vars[i];
+ continue;
+   }
+  chain->vars[i] = var;
+}
+
+  /* Revert the vector.  */
+  for (i = 0; i < n / 2; i++)
+std::swap (chain->vars[i], chain->vars[n - i - 1]);
+}
+
 /* Creates root variables for store elimination CHAIN in which stores for
elimination store loop va

[PATCH GCC][3/4]Generalize dead store elimination (or store motion) across loop iterations in predcom

2017-06-27 Thread Bin Cheng

Hi,
For the moment, tree-predcom.c only supports invariant/load-loads/store-loads 
chains.
This patch generalizes dead store elimination (or store motion) across loop 
iterations in
predictive commoning pass by supporting store-store chain.  As comment in the 
patch:

   Apart from predictive commoning on Load-Load and Store-Load chains, we
   also support Store-Store chains -- stores killed by other store can be
   eliminated.  Given below example:

 for (i = 0; i < n; i++)
   {
 a[i] = 1;
 a[i+2] = 2;
   }

   It can be replaced with:

 t0 = a[0];
 t1 = a[1];
 for (i = 0; i < n; i++)
   {
 a[i] = 1;
 t2 = 2;
 t0 = t1;
 t1 = t2;
   }
 a[n] = t0;
 a[n+1] = t1;

   If the loop runs more than 1 iterations, it can be further simplified into:

 for (i = 0; i < n; i++)
   {
 a[i] = 1;
   }
 a[n] = 2;
 a[n+1] = 2;

   The interesting part is this can be viewed either as general store motion
   or general dead store elimination in either intra/inter-iterations way.

There are number of interesting facts about this enhancement:
a) This patch supports dead store elimination for both across-iteration case 
and single-iteration
 case.  For the latter, it is dead store elimination.
b) There are advantages supporting dead store elimination in predcom, for 
example, it has
 complete information about memory address.  On the contrary, DSE pass can 
only handle
 memory references with exact the same memory address expression.
c) It's cheap to support store-stores chain in predcom based on existing code.
d) As commented, the enhancement can be viewed as either generalized dead store 
elimination
 or generalized store motion.  I prefer DSE here.

Bootstrap(O2/O3) in patch series on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-21  Bin Cheng  <bin.ch...@arm.com>

* tree-predcom.c: Revise general description of pass.
(enum chain_type): New enum type for store elimination.
(struct chain): New field supporting store elimination.
(dump_chain): Dump store-stores chain.
(release_chain): Release resources.
(split_data_refs_to_components): Compute and create component
contains only stores for elimination.
(get_chain_last_ref_at): New function.
(make_invariant_chain): Initialization.
(make_rooted_chain): Specify chain type in parameter.
(add_looparound_copies): Skip for store-stores chain.
(determine_roots_comp): Compute type of chain and pass it to
make_rooted_chain.
(initialize_root_vars_store_elim_2): New function.
(finalize_eliminated_stores): New function.
(remove_stmt): Handle store for elimination.
(execute_pred_commoning_chain): Execute predictive commoning on
store-store chains.
(determine_unroll_factor): Skip unroll for store-stores chain.
(prepare_initializers_chain_store_elim): New function.
(prepare_initializers_chain): Hanlde store-store chain.
(prepare_finalizers_chain, prepare_finalizers): New function.
(tree_predictive_commoning_loop): Return integer value indicating
if loop is unrolled or lcssa form is corrupted.
(tree_predictive_commoning): Rewrite for lcssa form if necessary.

gcc/testsuite/ChangeLog
2017-06-21  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/tree-ssa/predcom-dse-1.c: New test.
* gcc.dg/tree-ssa/predcom-dse-2.c: New test.
* gcc.dg/tree-ssa/predcom-dse-3.c: New test.
* gcc.dg/tree-ssa/predcom-dse-4.c: New test.
* gcc.dg/tree-ssa/predcom-dse-5.c: New test.
* gcc.dg/tree-ssa/predcom-dse-6.c: New test.
* gcc.dg/tree-ssa/predcom-dse-7.c: New test.
* gcc.dg/tree-ssa/predcom-dse-8.c: New test.
* gcc.dg/tree-ssa/predcom-dse-9.c: New test.From 937905df58fa8d167b5ae371474f84e7c9de976a Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Mon, 26 Jun 2017 11:38:50 +0100
Subject: [PATCH 5/6] predcom-generalize-dse-20170622.txt

---
 gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-1.c |  62 
 gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-2.c |  62 
 gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-3.c | 108 ++
 gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-4.c |  61 
 gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-5.c |  63 
 gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-6.c |  65 
 gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-7.c |  63 
 gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-8.c |  60 
 gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-9.c |  90 +
 gcc/tree-predcom.c| 478 +++---
 10 files changed, 1066 insertions(+), 46 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-2.c
 create mode 100644 gcc/testsuite/gcc.

[PATCH GCC][1/4]Extend interface ref_at_iteration to compute ref @ (NITERS + ITERS)-th iteration

2017-06-27 Thread Bin Cheng

Hi,
This is a simple patch extending interface ref_at_iteration in order to compute 
memory
refernce at (NITERS + ITERS)-th iteration.  This is for following predictive 
commoning
pass enhancement in which we need to compute reference after (NITERS) loop at
ITERS shifted iterations.
Bootstrap(O2/O3) in patch series on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-21  Bin Cheng  <bin.ch...@arm.com>

* tree-predcom.c (ref_at_iteration): Add parameter NITERS.  Compute
memory reference to DR at (NITERS + ITERS)-th iteration of loop.From ed1df7daca3d2dc2c3ba1c504d5431fba96d0887 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Tue, 20 Jun 2017 16:21:34 +0100
Subject: [PATCH 3/6] ref-at-iteration-20170620.txt

---
 gcc/tree-predcom.c | 36 +---
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index b7b1083..0238e87 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -1370,11 +1370,12 @@ replace_ref_with (gimple *stmt, tree new_tree, bool 
set, bool in_lhs)
   gsi_insert_after (, new_stmt, GSI_NEW_STMT);
 }
 
-/* Returns a memory reference to DR in the ITER-th iteration of
-   the loop it was analyzed in.  Append init stmts to STMTS.  */
+/* Returns a memory reference to DR in the (NITERS + ITER)-th iteration
+   of the loop it was analyzed in.  Append init stmts to STMTS.  */
 
 static tree
-ref_at_iteration (data_reference_p dr, int iter, gimple_seq *stmts)
+ref_at_iteration (data_reference_p dr, int iter,
+ gimple_seq *stmts, tree niters = NULL_TREE)
 {
   tree off = DR_OFFSET (dr);
   tree coff = DR_INIT (dr);
@@ -1383,14 +1384,27 @@ ref_at_iteration (data_reference_p dr, int iter, 
gimple_seq *stmts)
   tree ref_type = NULL_TREE;
   tree ref_op1 = NULL_TREE;
   tree ref_op2 = NULL_TREE;
-  if (iter == 0)
-;
-  else if (TREE_CODE (DR_STEP (dr)) == INTEGER_CST)
-coff = size_binop (PLUS_EXPR, coff,
-  size_binop (MULT_EXPR, DR_STEP (dr), ssize_int (iter)));
-  else
-off = size_binop (PLUS_EXPR, off,
- size_binop (MULT_EXPR, DR_STEP (dr), ssize_int (iter)));
+  tree new_offset;
+
+  if (iter != 0)
+{
+  new_offset = size_binop (MULT_EXPR, DR_STEP (dr), ssize_int (iter));
+  if (TREE_CODE (new_offset) == INTEGER_CST)
+   coff = size_binop (PLUS_EXPR, coff, new_offset);
+  else
+   off = size_binop (PLUS_EXPR, off, new_offset);
+}
+
+  if (niters != NULL_TREE)
+{
+  niters = fold_convert (ssizetype, niters);
+  new_offset = size_binop (MULT_EXPR, DR_STEP (dr), niters);
+  if (TREE_CODE (niters) == INTEGER_CST)
+   coff = size_binop (PLUS_EXPR, coff, new_offset);
+  else
+   off = size_binop (PLUS_EXPR, off, new_offset);
+}
+
   /* While data-ref analysis punts on bit offsets it still handles
  bitfield accesses at byte boundaries.  Cope with that.  Note that
  if the bitfield object also starts at a byte-boundary we can simply
-- 
1.9.1

[PATCH GCC][2/4]Remove interface initialize_root in predcom

2017-06-27 Thread Bin Cheng

Hi,
This simple patch removes interface initialize_root.  It's simple enough and 
called only once.
Bootstrap(O2/O3) in patch series on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-21  Bin Cheng  <bin.ch...@arm.com>

* tree-predcom.c (initialize_root): Delete.
(execute_pred_commoning_chain): Initialize root vars and replace
reference of non-combined chain directly, rather than call above
function.From 5670159613b9582437d3713aa69578e1a6b2cf0c Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Tue, 20 Jun 2017 16:31:43 +0100
Subject: [PATCH 4/6] remove-initialize_root-20170620.txt

---
 gcc/tree-predcom.c | 30 +-
 1 file changed, 9 insertions(+), 21 deletions(-)

diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index 0238e87..1c5944d 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -1536,23 +1536,6 @@ initialize_root_vars (struct loop *loop, chain_p chain, 
bitmap tmp_vars)
 }
 }
 
-/* Create the variables and initialization statement for root of chain
-   CHAIN.  Uids of the newly created temporary variables are marked
-   in TMP_VARS.  */
-
-static void
-initialize_root (struct loop *loop, chain_p chain, bitmap tmp_vars)
-{
-  dref root = get_chain_root (chain);
-  bool in_lhs = (chain->type == CT_STORE_LOAD
-|| chain->type == CT_COMBINATION);
-
-  initialize_root_vars (loop, chain, tmp_vars);
-  replace_ref_with (root->stmt,
-   chain->vars[chain->length],
-   true, in_lhs);
-}
-
 /* Initializes a variable for load motion for ROOT and prepares phi nodes and
initialization on entry to LOOP if necessary.  The ssa name for the variable
is stored in VARS.  If WRITTEN is true, also a phi node to copy its value
@@ -1749,6 +1732,7 @@ execute_pred_commoning_chain (struct loop *loop, chain_p 
chain,
   unsigned i;
   dref a;
   tree var;
+  bool in_lhs;
 
   if (chain->combined)
 {
@@ -1758,10 +1742,14 @@ execute_pred_commoning_chain (struct loop *loop, 
chain_p chain,
 }
   else
 {
-  /* For non-combined chains, set up the variables that hold its value,
-and replace the uses of the original references by these
-variables.  */
-  initialize_root (loop, chain, tmp_vars);
+  /* For non-combined chains, set up the variables that hold its value.  */
+  initialize_root_vars (loop, chain, tmp_vars);
+  a = get_chain_root (chain);
+  in_lhs = (chain->type == CT_STORE_LOAD
+   || chain->type == CT_COMBINATION);
+  replace_ref_with (a->stmt, chain->vars[chain->length], true, in_lhs);
+
+  /* Replace the uses of the original references by these variables.  */
   for (i = 1; chain->refs.iterate (i, ); i++)
{
  var = chain->vars[chain->length - a->distance];
-- 
1.9.1

[PATCH GCC][2/2]Refine CFG and bound information for split loops

2017-06-14 Thread Bin Cheng

Hi,
Loop split currently generates below control flow graph for split loops:
+
+   .-- guard1  --.
+   v v
+ pre1(loop1).-->pre2(loop2)
+  | ||
+.--->h1 |   h2<.
+| | || |
+|ex1---.|   .---ex2|
+|/ v|   | \|
+'---l1 guard2---'   | l2---'
+   ||
+   ||
+   '--->join<---'
+
In which,
+   LOOP2 is the second loop after split, GUARD1 and GUARD2 are the two bbs
+   controling if loop2 should be executed.

Take added test case as an example, the second loop only iterates for 1 time,
as a result, the CFG and loop niter bound information can be refined.  In 
general,
guard1/guard2 can be folded to true/false if loop2's niters is known at 
compilation
time.  This patch does such improvement by analyzing and refining niters of
loop2; as well as using that information to simplify CFG.  With this patch,
the second split loop as in test can be completely unrolled by later passes.

Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-12  Bin Cheng  <bin.ch...@arm.com>

* tree-ssa-loop-split.c (compute_new_first_bound): Compute and
return bound information for the second split loop.
(adjust_loop_split): New function.
(split_loop): Update calls to above functions.

gcc/testsuite/ChangeLog
2017-06-12  Bin Cheng  <bin.ch...@arm.com>

* gcc.dg/loop-split-1.c: New test.From 61855c74c7db6178008f007198aedee9a03f13e6 Mon Sep 17 00:00:00 2001
From: amker <amker@amker-laptop.(none)>
Date: Sun, 4 Jun 2017 02:26:34 +0800
Subject: [PATCH 2/2] lsplit-refine-cfg-niter-bound-20170601.txt

---
 gcc/testsuite/gcc.dg/loop-split-1.c |  16 +++
 gcc/tree-ssa-loop-split.c   | 197 
 2 files changed, 194 insertions(+), 19 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/loop-split-1.c

diff --git a/gcc/testsuite/gcc.dg/loop-split-1.c 
b/gcc/testsuite/gcc.dg/loop-split-1.c
new file mode 100644
index 000..2fd3e04
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/loop-split-1.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */
+
+int foo (int *a, int *b, int len)
+{
+  int k;
+  for (k = 1; k <= len; k++)
+{
+  a[k]++;
+
+  if (k < len)
+   b[k]++;
+}
+}
+
+/* { dg-final { scan-tree-dump "The second split loop iterates at 0 latch 
times." "lsplit" } } */
diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
index f8fe8e6..73c7dc2 100644
--- a/gcc/tree-ssa-loop-split.c
+++ b/gcc/tree-ssa-loop-split.c
@@ -364,10 +364,9 @@ connect_loops (struct loop *loop1, struct loop *loop2)
 
 /* This returns the new bound for iterations given the original iteration
space in NITER, an arbitrary new bound BORDER, assumed to be some
-   comparison value with a different IV, the initial value GUARD_INIT of
-   that other IV, and the comparison code GUARD_CODE that compares
-   that other IV with BORDER.  We return an SSA name, and place any
-   necessary statements for that computation into *STMTS.
+   comparison value with a different IV, the initial value GUARD_INIT and
+   STEP of the other IV, and the comparison code GUARD_CODE that compares
+   that other IV with BORDER.
 
For example for such a loop:
 
@@ -387,28 +386,41 @@ connect_loops (struct loop *loop1, struct loop *loop2)
 
Depending on the direction of the IVs and if the exit tests
are strict or non-strict we need to use MIN or MAX,
-   and add or subtract 1.  This routine computes newend above.  */
+   and add or subtract 1.  This routine computes newend above.
+
+   After computing the new bound (on j), we may be able to compare the
+   first loop's iteration space against the original loop's.  If it is
+   comparable at compilation time, we may be able to compute the niter
+   bound of the second loop.  Record the second loop's iteration bound
+   to SECOND_LOOP_NITER_BOUND which has below meaning:
+
+ -3: Don't know anything about the second loop;
+ -2: The second loop must not be executed;
+ -1: The second loop must be executed, but niter bound is unknown;
+  n: The second loop must be executed, niter bound is n (>= 0);
+
+   Note we compute niter bound for the second loop's latch.  */
 
 static tree
-compute_new_first_bound (gimple_seq *stmts, struct tree_niter_desc *niter,
-tree border,
-enum tree_code guard_code, tree guard_init)
+compute_new_first_bound (struct tree_niter_desc *niter, tree border,
+enum tree_code guard_code, tree guard_init,
+tree step, HOST_WIDE_INT *second_loop_niter_boun

[PATCH GCC][1/2]Feed bound computation to folder in loop split

2017-06-14 Thread Bin Cheng

Hi,
Loop split forces intermediate computation to gimple operands all the time when
computing bound information.  This is not good since folding opportunities are
missed.  This patch fixes the issue by feeding all computation to folder and 
only
forcing to gimple operand at last.

Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-12  Bin Cheng  <bin.ch...@arm.com>

* tree-ssa-loop-split.c (compute_new_first_bound): Feed bound
computation to folder, rather than force to gimple operands too
early.From 372dc98aa91fd495c98c2326f854eb5f2c76500b Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 2 Jun 2017 18:05:03 +0100
Subject: [PATCH 1/2] feed-bound-computation-to-folder-20170601.txt

---
 gcc/tree-ssa-loop-split.c | 77 ++-
 1 file changed, 30 insertions(+), 47 deletions(-)

diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
index e77f2bf..f8fe8e6 100644
--- a/gcc/tree-ssa-loop-split.c
+++ b/gcc/tree-ssa-loop-split.c
@@ -396,53 +396,38 @@ compute_new_first_bound (gimple_seq *stmts, struct 
tree_niter_desc *niter,
 {
   /* The niter structure contains the after-increment IV, we need
  the loop-enter base, so subtract STEP once.  */
-  tree controlbase = force_gimple_operand (niter->control.base,
-  stmts, true, NULL_TREE);
+  tree controlbase = niter->control.base;
   tree controlstep = niter->control.step;
-  tree enddiff;
+  tree enddiff, end = niter->bound;
+  tree type;
+
+  /* Compute end-beg.  */
   if (POINTER_TYPE_P (TREE_TYPE (controlbase)))
 {
-  controlstep = gimple_build (stmts, NEGATE_EXPR,
- TREE_TYPE (controlstep), controlstep);
-  enddiff = gimple_build (stmts, POINTER_PLUS_EXPR,
- TREE_TYPE (controlbase),
- controlbase, controlstep);
+  controlstep = fold_build1 (NEGATE_EXPR,
+TREE_TYPE (controlstep), controlstep);
+  enddiff = fold_build_pointer_plus (controlbase, controlstep);
+
+  type = unsigned_type_for (enddiff);
+  enddiff = fold_build1 (NEGATE_EXPR, type, fold_convert (type, enddiff));
+  end = fold_convert (type, end);
+  enddiff = fold_build2 (PLUS_EXPR, type, end, enddiff);
+  enddiff = fold_convert (sizetype, enddiff);
 }
   else
-enddiff = gimple_build (stmts, MINUS_EXPR,
-   TREE_TYPE (controlbase),
-   controlbase, controlstep);
-
-  /* Compute end-beg.  */
-  gimple_seq stmts2;
-  tree end = force_gimple_operand (niter->bound, ,
-   true, NULL_TREE);
-  gimple_seq_add_seq_without_update (stmts, stmts2);
-  if (POINTER_TYPE_P (TREE_TYPE (enddiff)))
 {
-  tree tem = gimple_convert (stmts, sizetype, enddiff);
-  tem = gimple_build (stmts, NEGATE_EXPR, sizetype, tem);
-  enddiff = gimple_build (stmts, POINTER_PLUS_EXPR,
- TREE_TYPE (enddiff),
- end, tem);
+  enddiff = fold_build2 (MINUS_EXPR, TREE_TYPE (controlbase),
+controlbase, controlstep);
+  enddiff = fold_build2 (MINUS_EXPR, TREE_TYPE (enddiff), end, enddiff);
 }
-  else
-enddiff = gimple_build (stmts, MINUS_EXPR, TREE_TYPE (enddiff),
-   end, enddiff);
 
   /* Compute guard_init + (end-beg).  */
   tree newbound;
-  enddiff = gimple_convert (stmts, TREE_TYPE (guard_init), enddiff);
   if (POINTER_TYPE_P (TREE_TYPE (guard_init)))
-{
-  enddiff = gimple_convert (stmts, sizetype, enddiff);
-  newbound = gimple_build (stmts, POINTER_PLUS_EXPR,
-  TREE_TYPE (guard_init),
-  guard_init, enddiff);
-}
+newbound = fold_build_pointer_plus (guard_init, enddiff);
   else
-newbound = gimple_build (stmts, PLUS_EXPR, TREE_TYPE (guard_init),
-guard_init, enddiff);
+newbound = fold_build2 (PLUS_EXPR, TREE_TYPE (guard_init), guard_init,
+   fold_convert (TREE_TYPE (guard_init), enddiff));
 
   /* Depending on the direction of the IVs the new bound for the first
  loop is the minimum or maximum of old bound and border.
@@ -467,20 +452,18 @@ compute_new_first_bound (gimple_seq *stmts, struct 
tree_niter_desc *niter,
 
   if (addbound)
 {
-  tree type2 = TREE_TYPE (newbound);
-  if (POINTER_TYPE_P (type2))
-   type2 = sizetype;
-  newbound = gimple_build (stmts,
-  POINTER_TYPE_P (TREE_TYPE (newbound))
-  ? POINTER_PLUS_EXPR : PLUS_EXPR,
-  TREE_TYPE (newbound),
-  newbound,
-  build_int_cst (type2, addbound));
+  type = TREE_TYPE (newbound)

[PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check

2017-06-12 Thread Bin Cheng

Hi,
This is the main patch rewriting loop distribution in order to handle hmmer.
It improves loop distribution by versioning loop under runtime alias check 
conditions.
As described in comments, the patch basically implements distribution in the 
following
steps:

 1) Seed partitions with specific type statements.  For now we support
two types seed statements: statement defining variable used outside
of loop; statement storing to memory.
 2) Build reduced dependence graph (RDG) for loop to be distributed.
The vertices (RDG:V) model all statements in the loop and the edges
(RDG:E) model flow and control dependencies between statements.
 3) Apart from RDG, compute data dependencies between memory references.
 4) Starting from seed statement, build up partition by adding depended
statements according to RDG's dependence information.  Partition is
classified as parallel type if it can be executed paralleled; or as
sequential type if it can't.  Parallel type partition is further
classified as different builtin kinds if it can be implemented as
builtin function calls.
 5) Build partition dependence graph (PG) based on data dependencies.
The vertices (PG:V) model all partitions and the edges (PG:E) model
all data dependencies between every partitions pair.  In general,
data dependence is either compilation time known or unknown.  In C
family languages, there exists quite amount compilation time unknown
dependencies because of possible alias relation of data references.
We categorize PG's edge to two types: "true" edge that represents
compilation time known data dependencies; "alias" edge for all other
data dependencies.
 6) Traverse subgraph of PG as if all "alias" edges don't exist.  Merge
partitions in each strong connected component (SCC) correspondingly.
Build new PG for merged partitions.
 7) Traverse PG again and this time with both "true" and "alias" edges
included.  We try to break SCCs by removing some edges.  Because
SCCs by "true" edges are all fused in step 6), we can break SCCs
by removing some "alias" edges.  It's NP-hard to choose optimal
edge set, fortunately simple approximation is good enough for us
given the small problem scale.
 8) Collect all data dependencies of the removed "alias" edges.  Create
runtime alias checks for collected data dependencies.
 9) Version loop under the condition of runtime alias checks.  Given
loop distribution generally introduces additional overhead, it is
only useful if vectorization is achieved in distributed loop.  We
version loop with internal function call IFN_LOOP_DIST_ALIAS.  If
no distributed loop can be vectorized, we simply remove distributed
loops and recover to the original one.

Also, there are some more to improve in the future (which isn't difficult I 
think):
   TODO:
 1) We only distribute innermost loops now.  This pass should handle loop
nests in the future.
 2) We only fuse partitions in SCC now.  A better fusion algorithm is
desired to minimize loop overhead, maximize parallelism and maximize

Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-07  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c: Add general explanantion on the pass.
(pg_add_dependence_edges): New parameter.  handle alias data
dependence specially and record it in the parameter if asked.
(struct pg_vdata, pg_edata, pg_edge_callback_data): New structs.
(init_partition_graph_vertices, add_partition_graph_edge): New.
(pg_skip_alias_edge, free_partition_graph_edata_cb): New.
(free_partition_graph_vdata, build_partition_graph): New.
(sort_partitions_by_post_order, merge_dep_scc_partitions): New.
(pg_collect_alias_ddrs, break_alias_scc_partitions): New.
(data_ref_segment_size, latch_dominated_by_data_ref): New.
(compute_alias_check_pairs, version_loop_by_alias_check): New.
(version_for_distribution_p, finalize_partitions): New.
(distribute_loop): Handle alias data dependence specially.  Factor
out loop fustion code as functions.

From ab801ff60a3a92d7ee90b422b89d9a96a003b7ba Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 9 Jun 2017 14:24:52 +0100
Subject: [PATCH 13/13] loop-distribution-20170607.txt

---
 gcc/testsuite/gcc.dg/tree-ssa/ldist-12.c |   3 +-
 gcc/testsuite/gcc.dg/tree-ssa/ldist-13.c |   5 +-
 gcc/testsuite/gcc.dg/tree-ssa/ldist-14.c |   5 +-
 gcc/testsuite/gcc.dg/tree-ssa/ldist-4.c  |   4 +-
 gcc/tree-loop-distribution.c | 843 +++
 5 files changed, 743 i

[PATCH GCC][12/13]Workaround reduction statements for distribution

2017-06-12 Thread Bin Cheng

Hi,
For now, loop distribution handles variables used outside of loop as reduction.
This is inaccurate because all partitions contain statement defining induction
vars.  Ideally we should factor out scev-propagation as a standalone interface
which can be called when necessary.  Before that, this patch simply workarounds
reduction issue by checking if the statement belongs to all partitions.  If yes,
the reduction must be computed in the last partition no matter how the loop is
distributed.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-07  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (classify_partition): New parameter and
better handle reduction statement.
(rdg_build_partitions): New parameter and record statements belonging
to all partitions.
(distribute_loop): Update use of above functions.From 51764e6a377cf21ef13ffc36928c9f2b8932aac2 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 9 Jun 2017 13:21:07 +0100
Subject: [PATCH 12/14] reduction-workaround-20170607.txt

---
 gcc/tree-loop-distribution.c | 40 +++-
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 7e31fee8..167155e 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -1276,17 +1276,18 @@ build_rdg_partition_for_vertex (struct graph *rdg, int 
v)
 }
 
 /* Classifies the builtin kind we can generate for PARTITION of RDG and LOOP.
-   For the moment we detect only the memset zero pattern.  */
+   For the moment we detect memset, memcpy and memmove patterns.  Bitmap
+   STMT_IN_ALL_PARTITIONS contains statements belonging to all partitions.  */
 
 static void
-classify_partition (loop_p loop, struct graph *rdg, partition *partition)
+classify_partition (loop_p loop, struct graph *rdg, partition *partition,
+   bitmap stmt_in_all_partitions)
 {
   bitmap_iterator bi;
   unsigned i;
   tree nb_iter;
   data_reference_p single_load, single_store;
-  bool volatiles_p = false;
-  bool plus_one = false;
+  bool volatiles_p = false, plus_one = false, has_reduction = false;
 
   partition->kind = PKIND_NORMAL;
   partition->main_dr = NULL;
@@ -1301,16 +1302,24 @@ classify_partition (loop_p loop, struct graph *rdg, 
partition *partition)
   if (gimple_has_volatile_ops (stmt))
volatiles_p = true;
 
-  /* If the stmt has uses outside of the loop mark it as reduction.  */
+  /* If the stmt is not included by all partitions and there is uses
+outside of the loop, then mark the partition as reduction.  */
   if (stmt_has_scalar_dependences_outside_loop (loop, stmt))
{
- partition->reduction_p = true;
- return;
+ if (!bitmap_bit_p (stmt_in_all_partitions, i))
+   {
+ partition->reduction_p = true;
+ return;
+   }
+ has_reduction = true;
}
 }
 
   /* Perform general partition disqualification for builtins.  */
   if (volatiles_p
+  /* Simple workaround to prevent classifying the partition as builtin
+if it contains any use outside of loop.  */
+  || has_reduction
   || !flag_tree_loop_distribute_patterns)
 return;
 
@@ -1540,14 +1549,16 @@ share_memory_accesses (struct graph *rdg,
   return false;
 }
 
-/* Aggregate several components into a useful partition that is
-   registered in the PARTITIONS vector.  Partitions will be
-   distributed in different loops.  */
+/* For each seed statement in STARTING_STMTS, this function builds
+   partition for it by adding depended statements according to RDG.
+   All partitions are recorded in PARTITIONS.  Statements belongs
+   to all partitions are recorded in STMT_IN_ALL_PARTITIONS.  */
 
 static void
 rdg_build_partitions (struct graph *rdg,
  vec starting_stmts,
- vec *partitions)
+ vec *partitions,
+ bitmap stmt_in_all_partitions)
 {
   auto_bitmap processed;
   int i;
@@ -1568,6 +1579,7 @@ rdg_build_partitions (struct graph *rdg,
 
   partition *partition = build_rdg_partition_for_vertex (rdg, v);
   bitmap_ior_into (processed, partition->stmts);
+  bitmap_and_into (stmt_in_all_partitions, partition->stmts);
 
   if (dump_file && (dump_flags & TDF_DETAILS))
{
@@ -1814,13 +1826,15 @@ distribute_loop (struct loop *loop, vec stmts,
   ddrs_vec = new vec ();
   ddrs_table = new hash_table (389);
 
+  auto_bitmap stmt_in_all_partitions;
   auto_vec partitions;
-  rdg_build_partitions (rdg, stmts, );
+  bitmap_set_range (stmt_in_all_partitions, 0, rdg->n_vertices);
+  rdg_build_partitions (rdg, stmts, , stmt_in_all_partitions);
 
   any_builtin = false;
   FOR_EACH_VEC_ELT (partitions, i, partition)
 {
-  classify_partition (loop, rdg, par

[PATCH GCC][08/13]Refactoring structure partition for distribution

2017-06-12 Thread Bin Cheng

Hi,
This patch refactors struct partition for later distribution.  It records
bitmap of data references in struct partition rather than vertices' data in
partition dependence graph.  It simplifies code as well as enables following
rewriting.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin

2017-06-07  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (struct partition): New fields recording
its data references.
(partition_alloc, partition_free): Init and release data refs.
(partition_merge_into): Merge data refs.
(build_rdg_partition_for_vertex): Collect data refs for partition.
(distribute_loop): Remve data refs from vertice data of partition
graph.From 1d3607e7aad7855e456b7a569f242703451911ab Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 9 Jun 2017 12:29:24 +0100
Subject: [PATCH 08/14] struct-partition-refactoring-20170607.txt

---
 gcc/tree-loop-distribution.c | 180 ---
 1 file changed, 101 insertions(+), 79 deletions(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 0b16024..9a0e101 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -496,30 +496,43 @@ enum partition_kind {
 PKIND_NORMAL, PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
 };
 
+/* Partition for loop distribution.  */
 struct partition
 {
+  /* Statements of the partition.  */
   bitmap stmts;
+  /* Loops of the partition.  */
   bitmap loops;
+  /* True if the partition defines variable which is used outside of loop.  */
   bool reduction_p;
+  /* For builtin partition, true if it executes one iteration more than
+ number of loop (latch) iterations.  */
   bool plus_one;
   enum partition_kind kind;
   /* data-references a kind != PKIND_NORMAL partition is about.  */
   data_reference_p main_dr;
   data_reference_p secondary_dr;
+  /* Number of loop (latch) iterations.  */
   tree niter;
+  /* Read data references in the partition.  */
+  bitmap reads;
+  /* Write data references in the partition.  */
+  bitmap writes;
 };
 
 
 /* Allocate and initialize a partition from BITMAP.  */
 
 static partition *
-partition_alloc (bitmap stmts, bitmap loops)
+partition_alloc (void)
 {
   partition *partition = XCNEW (struct partition);
-  partition->stmts = stmts ? stmts : BITMAP_ALLOC (NULL);
-  partition->loops = loops ? loops : BITMAP_ALLOC (NULL);
+  partition->stmts = BITMAP_ALLOC (NULL);
+  partition->loops = BITMAP_ALLOC (NULL);
   partition->reduction_p = false;
   partition->kind = PKIND_NORMAL;
+  partition->reads = BITMAP_ALLOC (NULL);
+  partition->writes = BITMAP_ALLOC (NULL);
   return partition;
 }
 
@@ -530,6 +543,8 @@ partition_free (partition *partition)
 {
   BITMAP_FREE (partition->stmts);
   BITMAP_FREE (partition->loops);
+  BITMAP_FREE (partition->reads);
+  BITMAP_FREE (partition->writes);
   free (partition);
 }
 
@@ -577,6 +592,9 @@ partition_merge_into (partition *dest, partition 
*partition, enum fuse_type ft)
   if (partition_reduction_p (partition))
 dest->reduction_p = true;
 
+  bitmap_ior_into (dest->reads, partition->reads);
+  bitmap_ior_into (dest->writes, partition->writes);
+
   if (dump_file && (dump_flags & TDF_DETAILS))
 {
   fprintf (dump_file, "Fuse partitions because %s:\n", fuse_message[ft]);
@@ -1050,10 +1068,11 @@ generate_code_for_partition (struct loop *loop,
 static partition *
 build_rdg_partition_for_vertex (struct graph *rdg, int v)
 {
-  partition *partition = partition_alloc (NULL, NULL);
+  partition *partition = partition_alloc ();
   auto_vec<int, 3> nodes;
-  unsigned i;
+  unsigned i, j;
   int x;
+  data_reference_p dr;
 
   graphds_dfs (rdg, , 1, , false, NULL);
 
@@ -1062,6 +1081,18 @@ build_rdg_partition_for_vertex (struct graph *rdg, int v)
   bitmap_set_bit (partition->stmts, x);
   bitmap_set_bit (partition->loops,
  loop_containing_stmt (RDG_STMT (rdg, x))->num);
+
+  for (j = 0; RDG_DATAREFS (rdg, x).iterate (j, ); ++j)
+   {
+ int *slot = datarefs_map->get (dr);
+
+ gcc_assert (slot != NULL);
+
+ if (DR_IS_READ (dr))
+   bitmap_set_bit (partition->reads, *slot);
+ else
+   bitmap_set_bit (partition->writes, *slot);
+   }
 }
 
   return partition;
@@ -1426,63 +1457,69 @@ partition_contains_all_rw (struct graph *rdg,
 
 static int
 pg_add_dependence_edges (struct graph *rdg, int dir,
-vec drs1,
-vec drs2)
+bitmap drs1, bitmap drs2)
 {
-  data_reference_p dr1, dr2;
+  unsigned i, j;
+  bitmap_iterator bi, bj;
+  data_reference_p dr1, dr2, saved_dr1;
 
   /* dependence direction - 0 is no dependence, -1 is back,
  1 is forth, 2 is both (we can stop then, merging wi

[PATCH GCC][10/13]Compute and cache data dependence relation

2017-06-12 Thread Bin Cheng

Hi,
This patch computes and caches data dependence relation in a hash table
so that it can be queried multiple times later for partition dependence
check.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin

2017-06-07  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (struct ddr_entry, ddr_entry_hasher): New.
(ddr_entry_hasher::hash, ::equal, get_data_dependence): New function.
(ddrs_vec, ddrs_table): New.
(classify_partition): Call get_data_dependence.
(pg_add_dependence_edges): Ditto.
(distribute_loop): Initialize data dependence global variables.From 6075d1db94b7f130a91bba53125bed5754a46f59 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 9 Jun 2017 13:02:09 +0100
Subject: [PATCH 10/14] cache-data-dependence-20170607.txt

---
 gcc/tree-loop-distribution.c | 118 +--
 1 file changed, 92 insertions(+), 26 deletions(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 90dc8ea..eacd9a1 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -66,6 +66,43 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 
 
+/* Hashtable entry for data reference relation.  */
+struct ddr_entry
+{
+  data_reference_p a;
+  data_reference_p b;
+  ddr_p ddr;
+  hashval_t hash;
+};
+
+/* Hashtable helpers.  */
+
+struct ddr_entry_hasher : delete_ptr_hash 
+{
+  static inline hashval_t hash (const ddr_entry *);
+  static inline bool equal (const ddr_entry *, const ddr_entry *);
+};
+
+/* Hash function for data reference relation.  */
+
+inline hashval_t
+ddr_entry_hasher::hash (const ddr_entry *entry)
+{
+  return entry->hash;
+}
+
+/* Hash table equality function for data reference relation.  */
+
+inline bool
+ddr_entry_hasher::equal (const ddr_entry *entry1, const ddr_entry *entry2)
+{
+  return (entry1->hash == entry2->hash
+ && DR_STMT (entry1->a) == DR_STMT (entry2->a)
+ && DR_STMT (entry1->b) == DR_STMT (entry2->b)
+ && operand_equal_p (DR_REF (entry1->a), DR_REF (entry2->a), 0)
+ && operand_equal_p (DR_REF (entry1->b), DR_REF (entry2->b), 0));
+}
+
 /* The loop (nest) to be distributed.  */
 static vec *loop_nest;
 
@@ -75,6 +112,13 @@ static vec *datarefs_vec;
 /* Map of data reference in the loop to a unique id.  */
 static hash_map<data_reference_p, int> *datarefs_map;
 
+/* Vector of data dependence relations.  */
+static vec *ddrs_vec;
+
+/* Hash table for data dependence relation in the loop to be distributed.  */
+static hash_table *ddrs_table;
+
+
 /* A Reduced Dependence Graph (RDG) vertex representing a statement.  */
 struct rdg_vertex
 {
@@ -1061,6 +1105,41 @@ generate_code_for_partition (struct loop *loop,
   return false;
 }
 
+/* Return data dependence relation for data references A and B.  The two
+   data references must be in lexicographic order wrto reduced dependence
+   graph RDG.  We firstly try to find ddr from global ddr hash table.  If
+   it doesn't exist, compute the ddr and cache it.  */
+
+static data_dependence_relation *
+get_data_dependence (struct graph *rdg, data_reference_p a, data_reference_p b)
+{
+  struct ddr_entry ent, **slot;
+  struct data_dependence_relation *ddr;
+
+  gcc_assert (DR_IS_WRITE (a) || DR_IS_WRITE (b));
+  gcc_assert (rdg_vertex_for_stmt (rdg, DR_STMT (a))
+ <= rdg_vertex_for_stmt (rdg, DR_STMT (b)));
+  ent.a = a;
+  ent.b = b;
+  ent.hash = iterative_hash_expr (DR_REF (a), 0);
+  ent.hash = iterative_hash_expr (DR_REF (b), ent.hash);
+  slot = ddrs_table->find_slot (, INSERT);
+  if (*slot == NULL)
+{
+  ddr = initialize_data_dependence_relation (a, b, *loop_nest);
+  compute_affine_dependence (ddr, (*loop_nest)[0]);
+
+  ddrs_vec->safe_push (ddr);
+
+  *slot = new ddr_entry ();
+  (*slot)->a = a;
+  (*slot)->b = b;
+  (*slot)->ddr = ddr;
+  (*slot)->hash = ent.hash;
+}
+
+  return (*slot)->ddr;
+}
 
 /* Returns a partition with all the statements needed for computing
the vertex V of the RDG, also including the loop exit conditions.  */
@@ -1231,44 +1310,27 @@ classify_partition (loop_p loop, struct graph *rdg, 
partition *partition)
return;
   /* Now check that if there is a dependence this dependence is
  of a suitable form for memmove.  */
-  vec loops = vNULL;
-  ddr_p ddr;
-  loops.safe_push (loop);
-  ddr = initialize_data_dependence_relation (single_load, single_store,
-loops);
-  compute_affine_dependence (ddr, loop);
+  ddr_p ddr = get_data_dependence (rdg, single_load, single_store);
   if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know)
-   {
- free_dependence_relation (ddr);
-

[PATCH GCC][11/13]Annotate partition by its parallelism execution type

2017-06-12 Thread Bin Cheng

Hi,
This patch checks and records if partition can be executed in parallel by
looking if there exists data dependence cycles.  The information is needed
for distribution because the idea is to distribute parallel type partitions
away from sequential ones.  I believe current distribution doesn't work
very well because it does blind distribution/fusion.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-07  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (alias.h): Include header file.
(enum partition_type): New.
(struct partition): New field type.
(partition_merge_into): Update partition type.
(data_dep_in_cycle_p): New function.
(build_rdg_partition_for_vertex): Compute partition type.
(rdg_build_partitions): Dump partition type.From 63a21f07ac97d1e93086110d2564900417a2af5a Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 9 Jun 2017 13:11:59 +0100
Subject: [PATCH 11/14] partition-type-20170607.txt

---
 gcc/tree-loop-distribution.c | 107 +--
 1 file changed, 103 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index eacd9a1..7e31fee8 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -51,6 +51,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "ssa.h"
 #include "gimple-pretty-print.h"
+#include "alias.h"
 #include "fold-const.h"
 #include "cfganal.h"
 #include "gimple-iterator.h"
@@ -535,11 +536,19 @@ build_rdg (struct loop *loop, control_dependences *cd)
 }
 
 
-
+/* Kind of distributed loop.  */
 enum partition_kind {
 PKIND_NORMAL, PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
 };
 
+/* Type of distributed loop.  */
+enum partition_type {
+/* The distributed loop can be executed parallelly.  */
+PTYPE_PARALLEL = 0,
+/* The distributed loop has to be executed sequentially.  */
+PTYPE_SEQUENTIAL
+};
+
 /* Partition for loop distribution.  */
 struct partition
 {
@@ -553,6 +562,7 @@ struct partition
  number of loop (latch) iterations.  */
   bool plus_one;
   enum partition_kind kind;
+  enum partition_type type;
   /* data-references a kind != PKIND_NORMAL partition is about.  */
   data_reference_p main_dr;
   data_reference_p secondary_dr;
@@ -632,6 +642,9 @@ static void
 partition_merge_into (partition *dest, partition *partition, enum fuse_type ft)
 {
   dest->kind = PKIND_NORMAL;
+  if (dest->type == PTYPE_PARALLEL)
+dest->type = partition->type;
+
   bitmap_ior_into (dest->stmts, partition->stmts);
   if (partition_reduction_p (partition))
 dest->reduction_p = true;
@@ -1141,6 +1154,47 @@ get_data_dependence (struct graph *rdg, data_reference_p 
a, data_reference_p b)
   return (*slot)->ddr;
 }
 
+/* In reduced dependence graph RDG for loop distribution, return true if
+   dependence between references DR1 and DR2 leads to a dependence cycle
+   and such dependence cycle can't be resolved by runtime alias check.  */
+
+static bool
+data_dep_in_cycle_p (struct graph *rdg,
+data_reference_p dr1, data_reference_p dr2)
+{
+  struct data_dependence_relation *ddr;
+
+  /* Re-shuffle data-refs to be in topological order.  */
+  if (rdg_vertex_for_stmt (rdg, DR_STMT (dr1))
+  > rdg_vertex_for_stmt (rdg, DR_STMT (dr2)))
+std::swap (dr1, dr2);
+
+  ddr = get_data_dependence (rdg, dr1, dr2);
+
+  /* In case of no data dependence.  */
+  if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
+return false;
+  /* Or the data dependence can be resolved by compilation time alias
+ check.  */
+  else if (!alias_sets_conflict_p (get_alias_set (DR_REF (dr1)),
+  get_alias_set (DR_REF (dr2
+return false;
+  /* For unknown data dependence or known data dependence which can't be
+ expressed in classic distance vector, we check if it can be resolved
+ by runtime alias check.  If yes, we still consider data dependence
+ as won't introduce data dependence cycle.  */
+  else if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know
+  || DDR_NUM_DIST_VECTS (ddr) == 0)
+return !runtime_alias_check_p (ddr, NULL, true);
+  else if (DDR_NUM_DIST_VECTS (ddr) > 1)
+return true;
+  else if (DDR_REVERSED_P (ddr)
+  || lambda_vector_zerop (DDR_DIST_VECT (ddr, 0), 1))
+return false;
+
+  return true;
+}
+
 /* Returns a partition with all the statements needed for computing
the vertex V of the RDG, also including the loop exit conditions.  */
 
@@ -1151,7 +1205,8 @@ build_rdg_partition_for_vertex (struct graph *rdg, int v)
   auto_vec<int, 3> nodes;
   unsigned i, j;
   int x;
-  data_reference_p dr;
+  data_reference_p dr, dr1, dr2;
+  bitmap_iterator bi, bj;
 
   graphds_dfs (rdg, , 1, , false,

[PATCH GCC][09/13]Simply cost model merges partitions with the same references

2017-06-12 Thread Bin Cheng

Hi,
Current primitive cost model merges partitions with data references sharing the 
same
base address.  I believe it's designed to maximize data reuse in distribution, 
but
that should be done by dedicated data reusing algorithm.  At this stage of 
merging,
we should be conservative and only merge partitions with the same references.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-07  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (ref_base_address): Delete.
(similar_memory_accesses): Rename ...
(share_memory_accesses): ... to this.  Check if partitions access
the same memory reference.
(distribute_loop): Call share_memory_accesses.From ce94bbb382eacb8d170a8349415b7d2c88528d74 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 9 Jun 2017 12:41:36 +0100
Subject: [PATCH 09/14] share-memory-access-20170607.txt

---
 gcc/tree-loop-distribution.c | 126 ++-
 1 file changed, 88 insertions(+), 38 deletions(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 9a0e101..90dc8ea 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -1276,30 +1276,16 @@ classify_partition (loop_p loop, struct graph *rdg, 
partition *partition)
 }
 }
 
-/* For a data reference REF, return the declaration of its base
-   address or NULL_TREE if the base is not determined.  */
-
-static tree
-ref_base_address (data_reference_p dr)
-{
-  tree base_address = DR_BASE_ADDRESS (dr);
-  if (base_address
-  && TREE_CODE (base_address) == ADDR_EXPR)
-return TREE_OPERAND (base_address, 0);
-
-  return base_address;
-}
-
-/* Returns true when PARTITION1 and PARTITION2 have similar memory
-   accesses in RDG.  */
+/* Returns true when PARTITION1 and PARTITION2 access the same memory
+   object in RDG.  */
 
 static bool
-similar_memory_accesses (struct graph *rdg, partition *partition1,
-partition *partition2)
+share_memory_accesses (struct graph *rdg,
+  partition *partition1, partition *partition2)
 {
-  unsigned i, j, k, l;
+  unsigned i, j;
   bitmap_iterator bi, bj;
-  data_reference_p ref1, ref2;
+  data_reference_p dr1, dr2;
 
   /* First check whether in the intersection of the two partitions are
  any loads or stores.  Common loads are the situation that happens
@@ -1309,24 +1295,88 @@ similar_memory_accesses (struct graph *rdg, partition 
*partition1,
|| RDG_MEM_READS_STMT (rdg, i))
   return true;
 
-  /* Then check all data-references against each other.  */
-  EXECUTE_IF_SET_IN_BITMAP (partition1->stmts, 0, i, bi)
-if (RDG_MEM_WRITE_STMT (rdg, i)
-   || RDG_MEM_READS_STMT (rdg, i))
-  EXECUTE_IF_SET_IN_BITMAP (partition2->stmts, 0, j, bj)
-   if (RDG_MEM_WRITE_STMT (rdg, j)
-   || RDG_MEM_READS_STMT (rdg, j))
- {
-   FOR_EACH_VEC_ELT (RDG_DATAREFS (rdg, i), k, ref1)
- {
-   tree base1 = ref_base_address (ref1);
-   if (base1)
- FOR_EACH_VEC_ELT (RDG_DATAREFS (rdg, j), l, ref2)
-   if (base1 == ref_base_address (ref2))
- return true;
- }
- }
+  /* Then check whether the two partitions access the same memory object.  */
+  EXECUTE_IF_SET_IN_BITMAP (partition1->reads, 0, i, bi)
+{
+  gcc_assert (i < datarefs_vec->length ());
+  dr1 = (*datarefs_vec)[i];
+
+  if (!DR_BASE_ADDRESS (dr1)
+ || !DR_OFFSET (dr1) || !DR_INIT (dr1) || !DR_STEP (dr1))
+   continue;
+
+  EXECUTE_IF_SET_IN_BITMAP (partition2->reads, 0, j, bj)
+   {
+ gcc_assert (j < datarefs_vec->length ());
+ dr2 = (*datarefs_vec)[j];
+
+ if (!DR_BASE_ADDRESS (dr2)
+ || !DR_OFFSET (dr2) || !DR_INIT (dr2) || !DR_STEP (dr2))
+   continue;
 
+ if (operand_equal_p (DR_BASE_ADDRESS (dr1), DR_BASE_ADDRESS (dr2), 0)
+ && operand_equal_p (DR_OFFSET (dr1), DR_OFFSET (dr2), 0)
+ && operand_equal_p (DR_INIT (dr1), DR_INIT (dr2), 0)
+ && operand_equal_p (DR_STEP (dr1), DR_STEP (dr2), 0))
+   return true;
+   }
+  EXECUTE_IF_SET_IN_BITMAP (partition2->writes, 0, j, bj)
+   {
+ gcc_assert (j < datarefs_vec->length ());
+ dr2 = (*datarefs_vec)[j];
+
+ if (!DR_BASE_ADDRESS (dr2)
+ || !DR_OFFSET (dr2) || !DR_INIT (dr2) || !DR_STEP (dr2))
+   continue;
+
+ if (operand_equal_p (DR_BASE_ADDRESS (dr1), DR_BASE_ADDRESS (dr2), 0)
+ && operand_equal_p (DR_OFFSET (dr1), DR_OFFSET (dr2), 0)
+ && operand_equal_p (DR_INIT (dr1), DR_INIT (dr2), 0)
+ && operand_equal_p (DR_STEP (dr1), DR_STEP (dr2), 0))
+   return true;
+   }
+}
+
+  EXECU

[PATCH GCC][06/13]Preserve loop nest in whole distribution life time

2017-06-12 Thread Bin Cheng

Hi,
This simple patch computes and preserves loop nest vector for whole distribution
life time.  The loop nest will be used multiple times in on-demand data 
dependence
computation.

Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-07  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (loop_nest): New global var.
(build_rdg): Use loop directly, rather than loop nest.
(pg_add_dependence_edges): Remove loop nest parameter.  Use global
variable directly.
(distribute_loop): Compute global variable loop nest.  Update use.From ea3c198138036676334063226b6c1535e45dd4b2 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 9 Jun 2017 11:56:28 +0100
Subject: [PATCH 06/14] loop-nest-20170607.txt

---
 gcc/tree-loop-distribution.c | 45 +++-
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index ce6db66..e1f5bce 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -66,6 +66,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 
 
+/* The loop (nest) to be distributed.  */
+static vec *loop_nest;
+
 /* A Reduced Dependence Graph (RDG) vertex representing a statement.  */
 struct rdg_vertex
 {
@@ -454,22 +457,22 @@ free_rdg (struct graph *rdg)
   free_graph (rdg);
 }
 
-/* Build the Reduced Dependence Graph (RDG) with one vertex per
-   statement of the loop nest LOOP_NEST, and one edge per data dependence or
-   scalar dependence.  */
+/* Build the Reduced Dependence Graph (RDG) with one vertex per statement of
+   LOOP, and one edge per flow dependence or control dependence from control
+   dependence CD.  */
 
 static struct graph *
-build_rdg (vec loop_nest, control_dependences *cd)
+build_rdg (struct loop *loop, control_dependences *cd)
 {
   struct graph *rdg;
   vec datarefs;
 
   /* Create the RDG vertices from the stmts of the loop nest.  */
   auto_vec stmts;
-  stmts_from_loop (loop_nest[0], );
+  stmts_from_loop (loop, );
   rdg = new_graph (stmts.length ());
   datarefs.create (10);
-  if (!create_rdg_vertices (rdg, stmts, loop_nest[0], ))
+  if (!create_rdg_vertices (rdg, stmts, loop, ))
 {
   datarefs.release ();
   free_rdg (rdg);
@@ -479,7 +482,7 @@ build_rdg (vec loop_nest, control_dependences *cd)
 
   create_rdg_flow_edges (rdg);
   if (cd)
-create_rdg_cd_edges (rdg, cd, loop_nest[0]);
+create_rdg_cd_edges (rdg, cd, loop);
 
   datarefs.release ();
 
@@ -1421,7 +1424,7 @@ partition_contains_all_rw (struct graph *rdg,
and DRS2 and modify and return DIR according to that.  */
 
 static int
-pg_add_dependence_edges (struct graph *rdg, vec loops, int dir,
+pg_add_dependence_edges (struct graph *rdg, int dir,
 vec drs1,
 vec drs2)
 {
@@ -1442,8 +1445,8 @@ pg_add_dependence_edges (struct graph *rdg, vec 
loops, int dir,
std::swap (dr1, dr2);
this_dir = -this_dir;
  }
-   ddr = initialize_data_dependence_relation (dr1, dr2, loops);
-   compute_affine_dependence (ddr, loops[0]);
+   ddr = initialize_data_dependence_relation (dr1, dr2, *loop_nest);
+   compute_affine_dependence (ddr, (*loop_nest)[0]);
if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know)
  this_dir = 2;
else if (DDR_ARE_DEPENDENT (ddr) == NULL_TREE)
@@ -1511,11 +1514,15 @@ distribute_loop (struct loop *loop, vec stmts,
 
   *destroy_p = false;
   *nb_calls = 0;
-  auto_vec<loop_p, 3> loop_nest;
-  if (!find_loop_nest (loop, _nest))
-return 0;
+  loop_nest = new vec ();
+  if (!find_loop_nest (loop, loop_nest))
+{
+  loop_nest->release ();
+  delete loop_nest;
+  return 0;
+}
 
-  rdg = build_rdg (loop_nest, cd);
+  rdg = build_rdg (loop, cd);
   if (!rdg)
 {
   if (dump_file && (dump_flags & TDF_DETAILS))
@@ -1523,6 +1530,8 @@ distribute_loop (struct loop *loop, vec stmts,
 "Loop %d not distributed: failed to build the RDG.\n",
 loop->num);
 
+  loop_nest->release ();
+  delete loop_nest;
   return 0;
 }
 
@@ -1646,15 +1655,15 @@ distribute_loop (struct loop *loop, vec stmts,
/* dependence direction - 0 is no dependence, -1 is back,
   1 is forth, 2 is both (we can stop then, merging will occur).  */
int dir = 0;
-   dir = pg_add_dependence_edges (rdg, loop_nest, dir,
+   dir = pg_add_dependence_edges (rdg, dir,
   PGDATA(i)->writes,
   PGDATA(j)->reads);
if (dir != 2)
- dir = pg_add_dependence_edges (rdg, loop_nest, dir,
+ dir = pg_add_dependence_edges (rdg, dir,
 PGDATA

[PATCH GCC][04/13]Sort statements in topological order for loop distribution

2017-06-12 Thread Bin Cheng

Hi,
During the work I ran into a latent bug for distributing.  For the moment we 
sort statements
in dominance order, but that's not enough because basic blocks may be sorted in 
reverse order
of execution flow.  This results in wrong data dependence direction later.  
This patch fixes
the issue by sorting in topological order.

Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-07  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (bb_top_order_index): New.
(bb_top_order_index_size, bb_top_order_cmp): New.
(stmts_from_loop): Use topological order.
(pass_loop_distribution::execute): Compute topological order for.
basic blocks.From 4bb233239e080eca956b3db7836cdf64da486dbf Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 7 Jun 2017 13:47:52 +0100
Subject: [PATCH 04/14] sort-stmts-in-top-order-20170607.txt

---
 gcc/tree-loop-distribution.c | 58 +++-
 1 file changed, 52 insertions(+), 6 deletions(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index b0b9d66..a32253c 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -373,16 +373,39 @@ create_rdg_vertices (struct graph *rdg, vec 
stmts, loop_p loop,
   return true;
 }
 
-/* Initialize STMTS with all the statements of LOOP.  The order in
-   which we discover statements is important as
-   generate_loops_for_partition is using the same traversal for
-   identifying statements in loop copies.  */
+/* Array mapping basic block's index to its topological order.  */
+static int *bb_top_order_index;
+/* And size of the array.  */
+static int bb_top_order_index_size;
+
+/* If X has a smaller topological sort number than Y, returns -1;
+   if greater, returns 1.  */
+
+static int
+bb_top_order_cmp (const void *x, const void *y)
+{
+  basic_block bb1 = *(const basic_block *) x;
+  basic_block bb2 = *(const basic_block *) y;
+
+  gcc_assert (bb1->index < bb_top_order_index_size
+ && bb2->index < bb_top_order_index_size);
+  gcc_assert (bb1 == bb2
+ || bb_top_order_index[bb1->index]
+!= bb_top_order_index[bb2->index]);
+
+  return (bb_top_order_index[bb1->index] - bb_top_order_index[bb2->index]);
+}
+
+/* Initialize STMTS with all the statements of LOOP.  We use topological
+   order to discover all statements.  The order is important because
+   generate_loops_for_partition is using the same traversal for identifying
+   statements in loop copies.  */
 
 static void
 stmts_from_loop (struct loop *loop, vec *stmts)
 {
   unsigned int i;
-  basic_block *bbs = get_loop_body_in_dom_order (loop);
+  basic_block *bbs = get_loop_body_in_custom_order (loop, bb_top_order_cmp);
 
   for (i = 0; i < loop->num_nodes; i++)
 {
@@ -1764,6 +1787,22 @@ pass_loop_distribution::execute (function *fun)
   if (number_of_loops (fun) <= 1)
 return 0;
 
+  /* Compute topological order for basic blocks.  Topological order is
+ needed because data dependence is computed for data references in
+ lexicographical order.  */
+  if (bb_top_order_index == NULL)
+{
+  int *rpo = XNEWVEC (int, last_basic_block_for_fn (cfun));
+
+  bb_top_order_index = XNEWVEC (int, last_basic_block_for_fn (cfun));
+  bb_top_order_index_size
+   = pre_and_rev_post_order_compute_fn (cfun, NULL, rpo, true);
+  for (int i = 0; i < bb_top_order_index_size; i++)
+   bb_top_order_index[rpo[i]] = i;
+
+  free (rpo);
+}
+
   FOR_ALL_BB_FN (bb, fun)
 {
   gimple_stmt_iterator gsi;
@@ -1881,13 +1920,20 @@ out:
   if (cd)
 delete cd;
 
+  if (bb_top_order_index != NULL)
+{
+  free (bb_top_order_index);
+  bb_top_order_index = NULL;
+  bb_top_order_index_size = 0;
+}
+
   if (changed)
 {
   /* Destroy loop bodies that could not be reused.  Do this late as we
 otherwise can end up refering to stale data in control dependences.  */
   unsigned i;
   FOR_EACH_VEC_ELT (loops_to_be_destroyed, i, loop)
- destroy_loop (loop);
+   destroy_loop (loop);
 
   /* Cached scalar evolutions now may refer to wrong or non-existing
 loops.  */
-- 
1.9.1

[PATCH GCC][07/13]Preserve data references for whole distribution life time

2017-06-12 Thread Bin Cheng

Hi,
This patch collects and preserves all data references in loop for whole
distribution life time.  It will be used afterwards.

Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-07  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (datarefs_vec, datarefs_map): New
global var.
(create_rdg_vertices): Use datarefs_vec directly.
(free_rdg): Don't free data references.
(build_rdg): Update use.  Don't free data references.
(distribute_loop): Compute global variable for data references.
Bail out if there are too many data references.From 78dd9322e9c3e5af2c736997fdbd2f71285eb5c0 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 9 Jun 2017 12:09:03 +0100
Subject: [PATCH 07/14] preserve-datarefs-20170607.txt

---
 gcc/tree-loop-distribution.c | 58 +---
 1 file changed, 44 insertions(+), 14 deletions(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index e1f5bce..0b16024 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -69,6 +69,12 @@ along with GCC; see the file COPYING3.  If not see
 /* The loop (nest) to be distributed.  */
 static vec *loop_nest;
 
+/* Vector of data references in the loop to be distributed.  */
+static vec *datarefs_vec;
+
+/* Map of data reference in the loop to a unique id.  */
+static hash_map<data_reference_p, int> *datarefs_map;
+
 /* A Reduced Dependence Graph (RDG) vertex representing a statement.  */
 struct rdg_vertex
 {
@@ -339,8 +345,7 @@ create_rdg_cd_edges (struct graph *rdg, control_dependences 
*cd, loop_p loop)
if that failed.  */
 
 static bool
-create_rdg_vertices (struct graph *rdg, vec stmts, loop_p loop,
-vec *datarefs)
+create_rdg_vertices (struct graph *rdg, vec stmts, loop_p loop)
 {
   int i;
   gimple *stmt;
@@ -360,12 +365,12 @@ create_rdg_vertices (struct graph *rdg, vec 
stmts, loop_p loop,
   if (gimple_code (stmt) == GIMPLE_PHI)
continue;
 
-  unsigned drp = datarefs->length ();
-  if (!find_data_references_in_stmt (loop, stmt, datarefs))
+  unsigned drp = datarefs_vec->length ();
+  if (!find_data_references_in_stmt (loop, stmt, datarefs_vec))
return false;
-  for (unsigned j = drp; j < datarefs->length (); ++j)
+  for (unsigned j = drp; j < datarefs_vec->length (); ++j)
{
- data_reference_p dr = (*datarefs)[j];
+ data_reference_p dr = (*datarefs_vec)[j];
  if (DR_IS_READ (dr))
RDGV_HAS_MEM_READS (v) = true;
  else
@@ -449,7 +454,7 @@ free_rdg (struct graph *rdg)
   if (v->data)
{
  gimple_set_uid (RDGV_STMT (v), -1);
- free_data_refs (RDGV_DATAREFS (v));
+ (RDGV_DATAREFS (v)).release ();
  free (v->data);
}
 }
@@ -459,22 +464,20 @@ free_rdg (struct graph *rdg)
 
 /* Build the Reduced Dependence Graph (RDG) with one vertex per statement of
LOOP, and one edge per flow dependence or control dependence from control
-   dependence CD.  */
+   dependence CD.  During visiting each statement, data references are also
+   collected and recorded in global data DATAREFS_VEC.  */
 
 static struct graph *
 build_rdg (struct loop *loop, control_dependences *cd)
 {
   struct graph *rdg;
-  vec datarefs;
 
   /* Create the RDG vertices from the stmts of the loop nest.  */
   auto_vec stmts;
   stmts_from_loop (loop, );
   rdg = new_graph (stmts.length ());
-  datarefs.create (10);
-  if (!create_rdg_vertices (rdg, stmts, loop, ))
+  if (!create_rdg_vertices (rdg, stmts, loop))
 {
-  datarefs.release ();
   free_rdg (rdg);
   return NULL;
 }
@@ -484,8 +487,6 @@ build_rdg (struct loop *loop, control_dependences *cd)
   if (cd)
 create_rdg_cd_edges (rdg, cd, loop);
 
-  datarefs.release ();
-
   return rdg;
 }
 
@@ -1522,6 +1523,7 @@ distribute_loop (struct loop *loop, vec stmts,
   return 0;
 }
 
+  datarefs_vec = new vec ();
   rdg = build_rdg (loop, cd);
   if (!rdg)
 {
@@ -1532,8 +1534,33 @@ distribute_loop (struct loop *loop, vec stmts,
 
   loop_nest->release ();
   delete loop_nest;
+  free_data_refs (*datarefs_vec);
+  delete datarefs_vec;
   return 0;
 }
+  if (datarefs_vec->length () > 64)
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file,
+"Loop %d not distributed: more than 64 memory references.\n",
+loop->num);
+
+  free_rdg (rdg);
+  loop_nest->release ();
+  delete loop_nest;
+  free_data_refs (*datarefs_vec);
+  delete datarefs_vec;
+  return 0;
+}
+
+  data_reference_p dref;
+  datarefs_map = new hash_map<data_reference_p, int>;
+  for (i = 0; datarefs_vec->iterate (i, ); ++i)
+{
+  int *slot = datarefs_m

[PATCH GCC][05/13]Refactoring partition merge

2017-06-12 Thread Bin Cheng

Hi,
This simple patch refactors partition merge code and dump information.

Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-07  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (enum fuse_type, fuse_message): New.
(partition_merge_into): New parameter.  Dump reason for fusion.
(distribute_loop): Update use of partition_merge_into.From d0b2d528931f6a8057bd0ac442fe9a0e7158044c Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 7 Jun 2017 14:16:21 +0100
Subject: [PATCH 05/14] partition-merge-dump-20170607.txt

---
 gcc/tree-loop-distribution.c | 66 +---
 1 file changed, 32 insertions(+), 34 deletions(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index a32253c..ce6db66 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -545,15 +545,42 @@ partition_reduction_p (partition *partition)
   return partition->reduction_p;
 }
 
+/* Partitions are fused because of different reasons.  */
+enum fuse_type
+{
+  FUSE_NON_BUILTIN = 0,
+  FUSE_REDUCTION = 1,
+  FUSE_SHARE_REF = 2,
+  FUSE_SAME_SCC = 3,
+  FUSE_FINALIZE = 4
+};
+
+/* Description on different fusing reason.  */
+static const char *fuse_message[] = {
+  "they are non-builtins",
+  "they have reductions",
+  "they have shared memory refs",
+  "they are in the same dependence scc",
+  "there is no point to distribute loop"};
+
 /* Merge PARTITION into the partition DEST.  */
 
 static void
-partition_merge_into (partition *dest, partition *partition)
+partition_merge_into (partition *dest, partition *partition, enum fuse_type ft)
 {
   dest->kind = PKIND_NORMAL;
   bitmap_ior_into (dest->stmts, partition->stmts);
   if (partition_reduction_p (partition))
 dest->reduction_p = true;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+{
+  fprintf (dump_file, "Fuse partitions because %s:\n", fuse_message[ft]);
+  fprintf (dump_file, "  Part 1: ");
+  dump_bitmap (dump_file, dest->stmts);
+  fprintf (dump_file, "  Part 2: ");
+  dump_bitmap (dump_file, partition->stmts);
+}
 }
 
 
@@ -1534,13 +1561,7 @@ distribute_loop (struct loop *loop, vec stmts,
   for (++i; partitions.iterate (i, ); ++i)
if (!partition_builtin_p (partition))
  {
-   if (dump_file && (dump_flags & TDF_DETAILS))
- {
-   fprintf (dump_file, "fusing non-builtin partitions\n");
-   dump_bitmap (dump_file, into->stmts);
-   dump_bitmap (dump_file, partition->stmts);
- }
-   partition_merge_into (into, partition);
+   partition_merge_into (into, partition, FUSE_NON_BUILTIN);
partitions.unordered_remove (i);
partition_free (partition);
i--;
@@ -1556,14 +1577,7 @@ distribute_loop (struct loop *loop, vec stmts,
   for (i = i + 1; partitions.iterate (i, ); ++i)
 if (partition_reduction_p (partition))
   {
-   if (dump_file && (dump_flags & TDF_DETAILS))
- {
-   fprintf (dump_file, "fusing partitions\n");
-   dump_bitmap (dump_file, into->stmts);
-   dump_bitmap (dump_file, partition->stmts);
-   fprintf (dump_file, "because they have reductions\n");
- }
-   partition_merge_into (into, partition);
+   partition_merge_into (into, partition, FUSE_REDUCTION);
partitions.unordered_remove (i);
partition_free (partition);
i--;
@@ -1581,15 +1595,7 @@ distribute_loop (struct loop *loop, vec stmts,
{
  if (similar_memory_accesses (rdg, into, partition))
{
- if (dump_file && (dump_flags & TDF_DETAILS))
-   {
- fprintf (dump_file, "fusing partitions\n");
- dump_bitmap (dump_file, into->stmts);
- dump_bitmap (dump_file, partition->stmts);
- fprintf (dump_file, "because they have similar "
-  "memory accesses\n");
-   }
- partition_merge_into (into, partition);
+ partition_merge_into (into, partition, FUSE_SHARE_REF);
  partitions.unordered_remove (j);
  partition_free (partition);
  j--;
@@ -1681,15 +1687,7 @@ distribute_loop (struct loop *loop, vec stmts,
  for (j = j + 1; partitions.iterate (j, ); ++j)
if (pg->vertices[j].component == i)
  {
-   if (dump_file && (dump_flags & TDF_DETAILS))
- {
-   fprintf (dump_file, "fusing partitions\n");
-   dump_bitmap (dump_file, first->

[PATCH GCC][02/13]Skip distribution if there is no loop

2017-06-12 Thread Bin Cheng

Hi,
this is a simple patch skipping distribution if there is no loop at all.

Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin

2017-06-07  Bin Cheng  <bin.ch...@arm.com>

* cfgloop.h (pass_loop_distribution::execute): Skip if no loops.From eb6a795331efde92fd6df1c6e612fb1ffa9f482f Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Fri, 9 Jun 2017 09:30:40 +0100
Subject: [PATCH 02/14] fast-return-20170607.txt

---
 gcc/tree-loop-distribution.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index a60454b..9f0c801 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -1758,6 +1758,9 @@ pass_loop_distribution::execute (function *fun)
   control_dependences *cd = NULL;
   auto_vec loops_to_be_destroyed;
 
+  if (number_of_loops (fun) <= 1)
+return 0;
+
   FOR_ALL_BB_FN (bb, fun)
 {
   gimple_stmt_iterator gsi;
-- 
1.9.1

[PATCH GCC][03/13]Mark and skip distributed loops

2017-06-12 Thread Bin Cheng

Hi,
This simple patch marks distributed loops and skips it in following 
distribution.

Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-07  Bin Cheng  <bin.ch...@arm.com>

* tree-loop-distribution.c (generate_loops_for_partition): Mark
distributed loops.
(pass_loop_distribution::execute): Skip distributed loops.From 705ad383bb8a806eb8b0fcd6faa298938dd3176b Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 7 Jun 2017 13:20:08 +0100
Subject: [PATCH 03/14] record-and-skip-distributed-loop-20170607.txt

---
 gcc/tree-loop-distribution.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 9f0c801..b0b9d66 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -618,8 +618,11 @@ generate_loops_for_partition (struct loop *loop, partition 
*partition,
 
   if (copy_p)
 {
+  int ldist_alias_id = loop->num;
   loop = copy_loop_before (loop);
   gcc_assert (loop != NULL);
+  loop->ldist_alias_id = ldist_alias_id;
+  loop->aux = (void *)loop;
   create_preheader (loop, CP_SIMPLE_PREHEADERS);
   create_bb_after_loop (loop);
 }
@@ -1770,6 +1773,9 @@ pass_loop_distribution::execute (function *fun)
gimple_set_uid (gsi_stmt (gsi), -1);
 }
 
+  FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
+loop->aux = NULL;
+
   /* We can at the moment only distribute non-nested loops, thus restrict
  walking to innermost loops.  */
   FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
@@ -1779,6 +1785,10 @@ pass_loop_distribution::execute (function *fun)
   int num = loop->num;
   unsigned int i;
 
+  /* Skip distributed loops.  */
+  if (loop->aux != NULL)
+   continue;
+
   /* If the loop doesn't have a single exit we will fail anyway,
 so do that early.  */
   if (!single_exit (loop))
@@ -1865,6 +1875,9 @@ out:
fprintf (dump_file, "Loop %d is the same.\n", num);
 }
 
+  FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
+loop->aux = NULL;
+
   if (cd)
 delete cd;
 
-- 
1.9.1

[PATCH GCC][01/13]Introduce internal function IFN_LOOP_DIST_ALIAS

2017-06-12 Thread Bin Cheng

Hi,
I was asked by upstream to split the loop distribution patch into small ones.
It is hard because data structure and algorithm are closely coupled together.
Anyway, this is the patch series with smaller patches.  Basically I tried to
separate data structure and bug-fix changes apart with one as the main patch.
Note I only made necessary code refactoring in order to separate patch, apart
from that, there is no change against the last version.

This is the first patch introducing new internal function IFN_LOOP_DIST_ALIAS.
GCC will distribute loops under condition of this function call.

Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
2017-06-07  Bin Cheng  <bin.ch...@arm.com>

* cfgloop.h (struct loop): New field ldist_alias_id.
* cfgloopmanip.c (lv_adjust_loop_entry_edge): Comment change.
* internal-fn.c (expand_LOOP_DIST_ALIAS): New function.
* internal-fn.def (LOOP_DIST_ALIAS): New.
* tree-vectorizer.c (vect_loop_dist_alias_call): New function.
(fold_loop_dist_alias_call): New function.
(vectorize_loops): Fold IFN_LOOP_DIST_ALIAS call depending on
successful vectorization or not.From 3598491598e0b425f1cfa4b7bb4c180886a08bef Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 7 Jun 2017 13:04:03 +0100
Subject: [PATCH 01/14] ifn_loop_dist_alias-20170607.txt

---
 gcc/cfgloop.h |  9 +++
 gcc/cfgloopmanip.c|  3 ++-
 gcc/internal-fn.c |  8 ++
 gcc/internal-fn.def   |  1 +
 gcc/tree-vectorizer.c | 75 ++-
 5 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index a8bec1d..be4187a 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -225,6 +225,15 @@ struct GTY ((chain_next ("%h.next"))) loop {
  builtins.  */
   tree simduid;
 
+  /* For loops generated by distribution with runtime alias checks, this
+ is a unique identifier of the original distributed loop.  Generally
+ it is the number of the original loop.  IFN_LOOP_DIST_ALIAS builtin
+ uses this id as its first argument.  Give a loop with an id, we can
+ look upward in dominance tree for the corresponding IFN_LOOP_DIST_ALIAS
+ buildin.  Note this id has no meanling after IFN_LOOP_DIST_ALIAS is
+ folded and eliminated.  */
+  int ldist_alias_id;
+
   /* Upper bound on number of iterations of a loop.  */
   struct nb_iter_bound *bounds;
 
diff --git a/gcc/cfgloopmanip.c b/gcc/cfgloopmanip.c
index d764ab9..adb2f65 100644
--- a/gcc/cfgloopmanip.c
+++ b/gcc/cfgloopmanip.c
@@ -1653,7 +1653,8 @@ force_single_succ_latches (void)
 
   THEN_PROB is the probability of then branch of the condition.
   ELSE_PROB is the probability of else branch. Note that they may be both
-  REG_BR_PROB_BASE when condition is IFN_LOOP_VECTORIZED.  */
+  REG_BR_PROB_BASE when condition is IFN_LOOP_VECTORIZED or
+  IFN_LOOP_DIST_ALIAS.  */
 
 static basic_block
 lv_adjust_loop_entry_edge (basic_block first_head, basic_block second_head,
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 75fe027..96e40cb 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2250,6 +2250,14 @@ expand_LOOP_VECTORIZED (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+/* This should get folded in tree-vectorizer.c.  */
+
+static void
+expand_LOOP_DIST_ALIAS (internal_fn, gcall *)
+{
+  gcc_unreachable ();
+}
+
 /* Expand MASK_LOAD call STMT using optab OPTAB.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index e162d81..79c19fb 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -158,6 +158,7 @@ DEF_INTERNAL_FN (GOMP_SIMD_LAST_LANE, ECF_CONST | ECF_LEAF 
| ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_ORDERED_START, ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_ORDERED_END, ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (LOOP_VECTORIZED, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (LOOP_DIST_ALIAS, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (ANNOTATE,  ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (UBSAN_NULL, ECF_LEAF | ECF_NOTHROW, ".R.")
 DEF_INTERNAL_FN (UBSAN_BOUNDS, ECF_LEAF | ECF_NOTHROW, NULL)
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 1bef2e4..0d83d33 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -469,6 +469,63 @@ fold_loop_vectorized_call (gimple *g, tree value)
 }
 }
 
+/* If LOOP has been versioned during loop distribution, return the internal
+   call guarding it.  */
+
+static gimple *
+vect_loop_dist_alias_call (struct loop *loop)
+{
+  gimple_stmt_iterator gsi;
+  gimple *g;
+  basic_block bb = loop_preheader_edge (loop)->src;
+  struct loop *outer_loop = bb->loop_father;
+
+  /* Look upward in dominance tree.  */
+  for (; bb != ENTRY_BLOCK_PTR_FOR_FN (cfun) && bb->loop_father == outer_loop;
+   bb = get_immediate_dominator (CDI

[PATCH GCC][5/5]Enable tree loop distribution at -O3 and above optimization levels.

2017-06-02 Thread Bin Cheng

Hi,
This patch enables -ftree-loop-distribution by default at -O3 and above 
optimization levels.
Bootstrap and test at O2/O3 on x86_64 and AArch64.  is it OK?

Note I don't have strong opinion here and am fine with either it's accepted or 
rejected.

Thanks,
bin
2017-05-31  Bin Cheng  <bin.ch...@arm.com>

* opts.c (default_options_table): Enable OPT_ftree_loop_distribution
for -O3 and above levels.From e7f43d62eb8aa8d29700e5ed1cb737eec813860f Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Tue, 30 May 2017 15:02:36 +0100
Subject: [PATCH 5/5] enable-loop-distribution-O3-20170525.txt

---
 gcc/opts.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/opts.c b/gcc/opts.c
index ffedb10..e2427b3 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -525,6 +525,7 @@ static const struct default_options default_options_table[] =
 
 /* -O3 optimizations.  */
 { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
+{ OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribution, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_fpredictive_commoning, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_fsplit_paths, NULL, 1 },
 /* Inlining of functions reducing size is a good idea with -Os
-- 
1.9.1

[PATCH GCC][4/5]Improve loop distribution to handle hmmer

2017-06-02 Thread Bin Cheng

Hi,
This is the main patch of the change.  It improves loop distribution by 
versioning loop under
runtime alias check conditions, as well as better partition fusion.  As 
described in comments,
the patch basically implements distribution in the following steps:

 1) Seed partitions with specific type statements.  For now we support
two types seed statements: statement defining variable used outside
of loop; statement storing to memory.
 2) Build reduced dependence graph (RDG) for loop to be distributed.
The vertices (RDG:V) model all statements in the loop and the edges
(RDG:E) model flow and control dependences between statements.
 3) Apart from RDG, compute data dependences between memory references.
 4) Starting from seed statement, build up partition by adding depended
statements according to RDG's dependence information.  Partition is
classified as parallel type if it can be executed parallelly; or as
sequential type if it can't.  Parallel type partition is further
classified as different builtin kinds if it can be implemented as
builtin function calls.
 5) Build partition dependence graph (PG) based on data dependences.
The vertices (PG:V) model all partitions and the edges (PG:E) model
all data dependences between every partitions pair.  In general,
data dependence is either compilation time known or unknown.  In C
family languages, there exists quite amount compilation time unknown
dependences because of possible alias relation of data references.
We categorize PG's edge to two types: "true" edge that represents
compilation time known data dependences; "alias" edge for all other
data dependences.
 6) Traverse subgraph of PG as if all "alias" edges don't exist.  Merge
partitions in each strong connected commponent (SCC) correspondingly.
Build new PG for merged partitions.
 7) Traverse PG again and this time with both "true" and "alias" edges
included.  We try to break SCCs by removing some edges.  Because
SCCs by "true" edges are all fused in step 6), we can break SCCs
by removing some "alias" edges.  It's NP-hard to choose optimal
edge set, fortunately simple approximation is good enough for us
given the small problem scale.
 8) Collect all data dependences of the removed "alias" edges.  Create
runtime alias checks for collected data dependences.
 9) Version loop under the condition of runtime alias checks.  Given
loop distribution generally introduces additional overhead, it is
only useful if vectorization is achieved in distributed loop.  We
version loop with internal function call IFN_LOOP_DIST_ALIAS.  If
no distributed loop can be vectorized, we simply remove distributed
loops and recover to the original one.

Also, there are some more to improve in the future (which shouldn't be 
difficult):
   TODO:
 1) We only distribute innermost loops now.  This pass should handle loop
nests in the future.
 2) We only fuse partitions in SCC now.  A better fusion algorithm is
desired to minimize loop overhead, maximize parallelism and maximize

This patch also fixes couple of latent bugs in the original implementation.

After this change, kernel loop in hmmer can be distributed and vectorized as a 
result.
This gives obvious performance improvement.  There is still inefficient code 
generation
issue which I will try to fix in loop split.  Apart from this, the next 
opportunity in hmmer
is to eliminate number of dead stores under proper alias information.
Bootstrap and test at O2/O3 on x86_64 and AArch64.  is it OK?

Thanks,
bin
2017-05-31  Bin Cheng  <bin.ch...@arm.com>

* cfgloop.h (struct loop): New field ldist_alias_id.
* cfgloopmanip.c (lv_adjust_loop_entry_edge): Refine comment for
new internal function.
* internal-fn.c (expand_LOOP_DIST_ALIAS): New function.
* internal-fn.def (IFN_LOOP_DIST_ALIAS): New internal function.
* tree-loop-distribution.c: Add general explanantion on the pass.
Include header file.
(struct ddr_entry, struct ddr_entry_hasher): New structs.
(ddr_entry_hasher::hash, ddr_entry_hasher::equal): New functions.
(bb_top_order_index, bb_top_order_index_size): New static vars.
(bb_top_order_cmp): New function.
(stmts_from_loop): Get basic blocks in topological order.  Don't
free data references.
(build_rdg): New parameter pointing to vector of data references.
Store data references in it.
(enum partition_type): New enum.
(enum partition_kind, struct partition): Add comments.  New fields.
(partition_alloc, partition_free): Handle new fields of partition.
(enu

[PATCH GCC][2/5]Extend graph data structure

2017-06-02 Thread Bin Cheng

Hi,
This patch extends graph data structure in two ways:
  1) Passes private data to callback function of for_each_edge.
  2) Adds new callback function to graph traversing functions like graphds_scc 
and graphds_dfs.
  The callback function acts as a supplement constraint for edges on top of 
subgraph constraint.
  With this change, the traversing function not only skips vertices/edges 
not belong to subgraph,
  but also skips edges when the callback function returns true on it.  As a 
result, pass like loop
  distribution can traverse dependence graph with some dependence edges 
skipped.

Bootstrap and test at O2/O3 on x86_64 and AArch64.  is it OK?

Thanks,
bin
2017-05-31  Bin Cheng  <bin.ch...@arm.com>

* graphds.c (add_edge): Intitialize edge's attached data.
(foll_in_subgraph, dfs_fst_edge, dfs_next_edge): New function
pointer parameter.  Call pointed function on each edge during
graph traversing.  Skip traversing the edge when the function
returns true.
(graphds_dfs, graphds_scc): Ditto.
(for_each_edge): New parameter.  Pass the new parameter to callback
function.
* graphds.h (skip_edge_callback): New function pointer type.
(graphds_dfs, graphds_scc): New function pointer parameter.
(graphds_edge_callback, for_each_edge): New parameter.From 46d7f7e90144bb3878ed4b807ee572fa6d6a2915 Mon Sep 17 00:00:00 2001
From: amker <amker@amker-laptop.(none)>
Date: Mon, 29 May 2017 21:25:18 +0800
Subject: [PATCH 2/5] extend-graph-data-struct-20170525.txt

---
 gcc/graphds.c | 66 ++-
 gcc/graphds.h | 10 +
 2 files changed, 49 insertions(+), 27 deletions(-)

diff --git a/gcc/graphds.c b/gcc/graphds.c
index e7cb19f..2951349 100644
--- a/gcc/graphds.c
+++ b/gcc/graphds.c
@@ -81,6 +81,7 @@ add_edge (struct graph *g, int f, int t)
   e->succ_next = vf->succ;
   vf->succ = e;
 
+  e->data = NULL;
   return e;
 }
 
@@ -133,20 +134,28 @@ dfs_edge_dest (struct graph_edge *e, bool forward)
 }
 
 /* Helper function for graphds_dfs.  Returns the first edge after E (including
-   E), in the graph direction given by FORWARD, that belongs to SUBGRAPH.  */
+   E), in the graph direction given by FORWARD, that belongs to SUBGRAPH.  If
+   SKIP_EDGE_P is not NULL, it points to a callback function.  Edge E will be
+   skipped if callback function returns true.  */
 
 static inline struct graph_edge *
-foll_in_subgraph (struct graph_edge *e, bool forward, bitmap subgraph)
+foll_in_subgraph (struct graph_edge *e, bool forward, bitmap subgraph,
+		  skip_edge_callback skip_edge_p)
 {
   int d;
 
-  if (!subgraph)
+  if (!e)
+return e;
+
+  if (!subgraph && (!skip_edge_p || !skip_edge_p (e)))
 return e;
 
   while (e)
 {
   d = dfs_edge_dest (e, forward);
-  if (bitmap_bit_p (subgraph, d))
+  /* Return edge if it belongs to subgraph and shouldn't be skipped.  */
+  if ((!subgraph || bitmap_bit_p (subgraph, d))
+	  && (!skip_edge_p || !skip_edge_p (e)))
 	return e;
 
   e = forward ? e->succ_next : e->pred_next;
@@ -156,36 +165,45 @@ foll_in_subgraph (struct graph_edge *e, bool forward, bitmap subgraph)
 }
 
 /* Helper function for graphds_dfs.  Select the first edge from V in G, in the
-   direction given by FORWARD, that belongs to SUBGRAPH.  */
+   direction given by FORWARD, that belongs to SUBGRAPH.  If SKIP_EDGE_P is not
+   NULL, it points to a callback function.  Edge E will be skipped if callback
+   function returns true.  */
 
 static inline struct graph_edge *
-dfs_fst_edge (struct graph *g, int v, bool forward, bitmap subgraph)
+dfs_fst_edge (struct graph *g, int v, bool forward, bitmap subgraph,
+	  skip_edge_callback skip_edge_p)
 {
   struct graph_edge *e;
 
   e = (forward ? g->vertices[v].succ : g->vertices[v].pred);
-  return foll_in_subgraph (e, forward, subgraph);
+  return foll_in_subgraph (e, forward, subgraph, skip_edge_p);
 }
 
 /* Helper function for graphds_dfs.  Returns the next edge after E, in the
-   graph direction given by FORWARD, that belongs to SUBGRAPH.  */
+   graph direction given by FORWARD, that belongs to SUBGRAPH.  If SKIP_EDGE_P
+   is not NULL, it points to a callback function.  Edge E will be skipped if
+   callback function returns true.  */
 
 static inline struct graph_edge *
-dfs_next_edge (struct graph_edge *e, bool forward, bitmap subgraph)
+dfs_next_edge (struct graph_edge *e, bool forward, bitmap subgraph,
+	   skip_edge_callback skip_edge_p)
 {
   return foll_in_subgraph (forward ? e->succ_next : e->pred_next,
-			   forward, subgraph);
+			   forward, subgraph, skip_edge_p);
 }
 
 /* Runs dfs search over vertices of G, from NQ vertices in queue QS.
The vertices in postorder are stored into QT.  If FORWARD is false,
backward dfs is run.  If SUBGRAPH is not NULL, it specifies the
subgraph of G to run DFS

[PATCH GCC][3/5]Move pass ivcanon upward in compilation process

2017-06-02 Thread Bin Cheng

Hi,
This patch moves pass ivcanon before loop distribution.  Pass loop split could 
create
loops with limited niters.  Such loop should be unrolled before loop 
distribution (or graphite),
rather than after.

Bootstrap and test at O2/O3 on x86_64 and AArch64.  is it OK?

Thanks,
bin
2017-05-31  Bin Cheng  <bin.ch...@arm.com>

* passes.def (pass_iv_canon): Move before pass_loop_distribution.From 1698cc3e552a17e84719dba1ff2fbe4a8890e6be Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Tue, 30 May 2017 17:56:05 +0100
Subject: [PATCH 3/5] move-ivcanon-pass-20170529.txt

---
 gcc/passes.def | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/passes.def b/gcc/passes.def
index 10a18bf..beb350b 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -277,6 +277,7 @@ along with GCC; see the file COPYING3.  If not see
 	 empty loops.  Remove them now.  */
 	  NEXT_PASS (pass_cd_dce);
 	  NEXT_PASS (pass_record_bounds);
+	  NEXT_PASS (pass_iv_canon);
 	  NEXT_PASS (pass_loop_distribution);
 	  NEXT_PASS (pass_copy_prop);
 	  NEXT_PASS (pass_graphite);
@@ -286,7 +287,6 @@ along with GCC; see the file COPYING3.  If not see
 	  NEXT_PASS (pass_copy_prop);
 	  NEXT_PASS (pass_dce);
 	  POP_INSERT_PASSES ()
-	  NEXT_PASS (pass_iv_canon);
 	  NEXT_PASS (pass_parallelize_loops, false /* oacc_kernels_p */);
 	  NEXT_PASS (pass_expand_omp_ssa);
 	  NEXT_PASS (pass_ch_vect);
-- 
1.9.1

1 2 3 4 5 >

1 - 100 of 446 matches

Mail list logo