Re: [PATCH] avoid transform at run until wrap comparesion

2021-09-02 Thread Bin.Cheng via Gcc-patches
On Thu, Sep 2, 2021 at 6:18 PM Richard Biener  wrote:
>
> On Thu, 2 Sep 2021, Jiufu Guo wrote:
>
> > When transform
> >   {iv0.base, iv0.step} LT/LE {iv1.base, iv1.step}
> > to
> >   {iv0.base, iv0.step - iv1.step} LT/LE {iv1.base, 0}
> >
> > There would be error if 'iv0.step - iv1.step' in negative,
> > for which means run until wrap/overflow.
> >
> > For example:
> >{1, +, 1} <= {4, +, 3} => {1, +, -2} <= {4, +, 0}
> >
> > This patch avoid this kind transform.
> >
> > Bootstrap and regtest pass on ppc64le.
> > Is this ok for trunk?
>
> This looks like PR100740, see the discussion starting at
> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571570.html
>
> We seem to be at a dead end figuring what's exactly required
> to make the transform valid and I have my doubts that arguing
> with overflow we're not running in circles.
>
> My last attempt was
>
> +  if (code != NE_EXPR)
> +   {
> + if (TREE_CODE (step) != INTEGER_CST)
> +   return false;
> + if (!iv0->no_overflow || !iv1->no_overflow)
> +   return false;
> + /* The new IV does not overflow if the step has the same
> +sign and is less in magnitude.  */
> + if (tree_int_cst_sign_bit (iv0->step) != tree_int_cst_sign_bit
> (step)
> + || wi::geu_p (wi::abs (wi::to_wide (step)),
> +   wi::abs (wi::to_wide (iv0->step
> +   return false;
> +   }
>
> where your patch at least misses { 1, +, -1 } < { -3, + , -3 }
> transforming to { 1, +, +2 } < -3, thus a positive step but
> we're still iterating unti wrap?  That is, the sign of the
> combined step cannot really be the issue at hand.
>
> Bin argued my condition is too strict and I agreed, see
> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574334.html
> which is the last mail in the thread sofar (still without a fix :/)
>
> Somewhere it was said that we eventually should simply put
> preconditions into ->assumptions rather than trying to
> check ->no_overflow and friends, not sure if that's really
I did some experiments which can fix the PRs, however it causes new
failures in graphite possibly because of missing cases.  However, I
didn't have time looking into graphite tests back in time.

Thanks,
bin
> applicable here.
>
> Richard.
>
> > BR.
> > Jiufu
> >
> > gcc/ChangeLog:
> >
> > 2021-09-02  Jiufu Guo  
> >
> >   PR tree-optimization/102131
> >   * tree-ssa-loop-niter.c (number_of_iterations_cond):
> >   Avoid transform until wrap condition
> >
> > gcc/testsuite/ChangeLog:
> >
> > 2021-09-02  Jiufu Guo  
> >
> >   PR tree-optimization/102131
> >   * gcc.dg/pr102131.c: New test.
> >
> > ---
> >  gcc/tree-ssa-loop-niter.c   | 18 +
> >  gcc/testsuite/gcc.dg/pr102131.c | 69 +
> >  2 files changed, 87 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.dg/pr102131.c
> >
> > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > index 7af92d1c893..52ce517af89 100644
> > --- a/gcc/tree-ssa-loop-niter.c
> > +++ b/gcc/tree-ssa-loop-niter.c
> > @@ -1866,6 +1866,24 @@ number_of_iterations_cond (class loop *loop,
> > || !iv0->no_overflow || !iv1->no_overflow))
> >   return false;
> >
> > +  /* GT/GE has been transformed to LT/LE already.
> > + cmp_code could be LT, LE or NE
> > +
> > + For LE/LT transform
> > + {iv0.base, iv0.step} LT/LE {iv1.base, iv1.step}
> > + to
> > + {iv0.base, iv0.step - iv1.step} LT/LE {iv1.base, 0}
> > + Negative iv0.step - iv1.step means decreasing until wrap,
> > + then the transform is not accurate.
> > +
> > + For example:
> > + {1, +, 1} <= {4, +, 3}
> > + is not same with
> > + {1, +, -2} <= {4, +, 0}
> > +   */
> > +  if ((code == LE_EXPR || code == LT_EXPR) && tree_int_cst_sign_bit 
> > (step))
> > + return false;
> > +
> >iv0->step = step;
> >if (!POINTER_TYPE_P (type))
> >   iv0->no_overflow = false;
> > diff --git a/gcc/testsuite/gcc.dg/pr102131.c 
> > b/gcc/testsuite/gcc.dg/pr102131.c
> > new file mode 100644
> > index 000..0fcfaa132da
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/pr102131.c
> > @@ -0,0 +1,69 @@
> > +/* { dg-do run } */
> > +
> > +unsigned int a;
> > +int
> > +fun1 ()
> > +{
> > +  unsigned b = 0;
> > +  int c = 1;
> > +  for (; b < 3; b++)
> > +{
> > +  while (c < b)
> > + __builtin_abort ();
> > +  for (a = 0; a < 3; a++)
> > + c++;
> > +}
> > +  return 0;
> > +}
> > +
> > +unsigned b;
> > +int
> > +fun2 ()
> > +{
> > +  unsigned c = 0;
> > +  for (a = 0; a < 2; a++)
> > +for (b = 0; b < 2; b++)
> > +  if (++c < a)
> > + __builtin_abort ();
> > +  return 0;
> > +}
> > +
> > +int
> > +fun3 (void)
> > +{
> > +  unsigned a, b, c = 0;
> > +  for (a = 0; a < 10; a++)
> > +{
> > +  for (b = 0; b < 2; b++)
> > + {
> > +   c++;
> > +   if (c < a)
> > + {
> > +   __builtin_abort 

Re: Ping: [PATCH v2] Analyze niter for until-wrap condition [PR101145]

2021-08-24 Thread Bin.Cheng via Gcc-patches
On Wed, Aug 25, 2021 at 11:26 AM guojiufu  wrote:
>
> On 2021-08-16 09:33, Bin.Cheng wrote:
> > On Wed, Aug 4, 2021 at 10:42 AM guojiufu 
> > wrote:
> >>
> ...
> >> >> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.inc
> >> >> b/gcc/testsuite/gcc.dg/vect/pr101145.inc
> >> >> new file mode 100644
> >> >> index 000..6eed3fa8aca
> >> >> --- /dev/null
> >> >> +++ b/gcc/testsuite/gcc.dg/vect/pr101145.inc
> >> >> @@ -0,0 +1,63 @@
> >> >> +TYPE __attribute__ ((noinline))
> >> >> +foo_sign (int *__restrict__ a, int *__restrict__ b, TYPE l, TYPE n)
> >> >> +{
> >> >> +  for (l = L_BASE; n < l; l += C)
> >> >> +*a++ = *b++ + 1;
> >> >> +  return l;
> >> >> +}
> >> >> +
> >> >> +TYPE __attribute__ ((noinline))
> >> >> +bar_sign (int *__restrict__ a, int *__restrict__ b, TYPE l, TYPE n)
> >> >> +{
> >> >> +  for (l = L_BASE_DOWN; l < n; l -= C)
> > I noticed that both L_BASE and L_BASE_DOWN are defined as l, which
> > makes this test a bit confusing.  Could you clean the use of l, for
> > example, by using an auto var for the loop index invariable?
> > Otherwise the patch looks good to me.  Thanks very much for the work.
>
> Hi,
>
> Sorry for bothering you here.
> I feel this would be an approval (with the comment) already :)
>
> With the change code to make it a little clear as:
>TYPE i;
>for (i = l; n < i; i += C)
>
> it may be ok to commit the patch to the trunk, right?
Yes please.  Thanks again for working on this.

Thanks,
bin
>
> BR,
> Jiufu
>
> >
> > Thanks,
> > bin
> >> >> +*a++ = *b++ + 1;
> >> >> +  return l;
> >> >> +}
> >> >> +
> >> >> +int __attribute__ ((noinline)) neq (int a, int b) { return a != b; }
> >> >> +
> >> >> +int a[1000], b[1000];
> >> >> +int fail;
> >> >> +
> >> >> +int
> ...
> >> >> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145_1.c
> >> >> b/gcc/testsuite/gcc.dg/vect/pr101145_1.c
> >> >> new file mode 100644
> >> >> index 000..94f6b99b893
> >> >> --- /dev/null
> >> >> +++ b/gcc/testsuite/gcc.dg/vect/pr101145_1.c
> >> >> @@ -0,0 +1,15 @@
> >> >> +/* { dg-require-effective-target vect_int } */
> >> >> +/* { dg-options "-O3 -fdump-tree-vect-details" } */
> >> >> +#define TYPE signed char
> >> >> +#define MIN -128
> >> >> +#define MAX 127
> >> >> +#define N_BASE (MAX - 32)
> >> >> +#define N_BASE_DOWN (MIN + 32)
> >> >> +
> >> >> +#define C 3
> >> >> +#define L_BASE l
> >> >> +#define L_BASE_DOWN l
> >> >> +


Re: Ping: [PATCH v2] Analyze niter for until-wrap condition [PR101145]

2021-08-15 Thread Bin.Cheng via Gcc-patches
On Wed, Aug 4, 2021 at 10:42 AM guojiufu  wrote:
>
> Hi,
>
> I would like to have a ping on this.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574596.html
Sorry for being late in replying.

>
> BR,
> Jiufu
>
> On 2021-07-15 08:17, guojiufu via Gcc-patches wrote:
> > Hi,
> >
> > I would like to have an early ping on this with more mail addresses.
> >
> > BR,
> > Jiufu.
> >
> > On 2021-07-07 20:47, Jiufu Guo wrote:
> >> Changes since v1:
> >> * Update assumptions for niter, add more test cases check
> >> * Use widest_int/wide_int instead mpz to do +-/
> >> * Move some early check for quick return
> >>
> >> For code like:
> >> unsigned foo(unsigned val, unsigned start)
> >> {
> >>   unsigned cnt = 0;
> >>   for (unsigned i = start; i > val; ++i)
> >> cnt++;
> >>   return cnt;
> >> }
> >>
> >> The number of iterations should be about UINT_MAX - start.
> >>
> >> There is function adjust_cond_for_loop_until_wrap which
> >> handles similar work for const bases.
> >> Like adjust_cond_for_loop_until_wrap, this patch enhance
> >> function number_of_iterations_cond/number_of_iterations_lt
> >> to analyze number of iterations for this kind of loop.
> >>
> >> Bootstrap and regtest pass on powerpc64le, x86_64 and aarch64.
> >> Is this ok for trunk?
> >>
> >> gcc/ChangeLog:
> >>
> >> 2021-07-07  Jiufu Guo  
> >>
> >>  PR tree-optimization/101145
> >>  * tree-ssa-loop-niter.c (number_of_iterations_until_wrap):
> >>  New function.
> >>  (number_of_iterations_lt): Invoke above function.
> >>  (adjust_cond_for_loop_until_wrap):
> >>  Merge to number_of_iterations_until_wrap.
> >>  (number_of_iterations_cond): Update invokes for
> >>  adjust_cond_for_loop_until_wrap and number_of_iterations_lt.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >> 2021-07-07  Jiufu Guo  
> >>
> >>  PR tree-optimization/101145
> >>  * gcc.dg/vect/pr101145.c: New test.
> >>  * gcc.dg/vect/pr101145.inc: New test.
> >>  * gcc.dg/vect/pr101145_1.c: New test.
> >>  * gcc.dg/vect/pr101145_2.c: New test.
> >>  * gcc.dg/vect/pr101145_3.c: New test.
> >>  * gcc.dg/vect/pr101145inf.c: New test.
> >>  * gcc.dg/vect/pr101145inf.inc: New test.
> >>  * gcc.dg/vect/pr101145inf_1.c: New test.
> >> ---
> >>  gcc/testsuite/gcc.dg/vect/pr101145.c  | 187
> >> ++
> >>  gcc/testsuite/gcc.dg/vect/pr101145.inc|  63 
> >>  gcc/testsuite/gcc.dg/vect/pr101145_1.c|  15 ++
> >>  gcc/testsuite/gcc.dg/vect/pr101145_2.c|  15 ++
> >>  gcc/testsuite/gcc.dg/vect/pr101145_3.c|  15 ++
> >>  gcc/testsuite/gcc.dg/vect/pr101145inf.c   |  25 +++
> >>  gcc/testsuite/gcc.dg/vect/pr101145inf.inc |  28 
> >>  gcc/testsuite/gcc.dg/vect/pr101145inf_1.c |  23 +++
> >>  gcc/tree-ssa-loop-niter.c | 157 ++
> >>  9 files changed, 463 insertions(+), 65 deletions(-)
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.inc
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_1.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_2.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_3.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145inf.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145inf.inc
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145inf_1.c
> >>
> >> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.c
> >> b/gcc/testsuite/gcc.dg/vect/pr101145.c
> >> new file mode 100644
> >> index 000..74031b031cf
> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.dg/vect/pr101145.c
> >> @@ -0,0 +1,187 @@
> >> +/* { dg-require-effective-target vect_int } */
> >> +/* { dg-options "-O3 -fdump-tree-vect-details" } */
> >> +#include 
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned
> >> n)
> >> +{
> >> +  while (n < ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_1 (int *__restrict__ a, int *__restrict__ b, unsigned l,
> >> unsigned)
> >> +{
> >> +  while (UINT_MAX - 64 < ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_2 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned
> >> n)
> >> +{
> >> +  l = UINT_MAX - 32;
> >> +  while (n < ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_3 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned
> >> n)
> >> +{
> >> +  while (n <= ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_4 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned
> >> n)
> >> +{  // infininate
> >> +  while (0 <= ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_5 (int 

[PATCH AArch64]Fix expanding of %w for *extend... pattern

2021-08-09 Thread bin.cheng via Gcc-patches
Hi,
When playing with std::experimental::simd, I found a bug newly introduced in
AArch64 backend.  As commit message describes:
  7 Pattern "*extend2_aarch64" is duplicated
  8 from the corresponding zero_extend pattern, however % needs
  9 to be expanded according to its mode iterator because the smov
 10 instruction is different to umov.

Without the fix, it forces expanding to w register no matter the mode is.  
This simple patch fixes the issue which should be obvious.

Bootstrap and test on AArch64.  Any comments?

Thanks,
bin

pattern-signed-extend-20210809.patch
Description: Binary data


Re: 0001-Don-t-skip-prologue-instructions-as-it-could-affect-.patch

2021-07-27 Thread Bin.Cheng via Gcc-patches
On Mon, Jul 26, 2021 at 11:07 PM Jeff Law  wrote:
>
>
>
> On 7/25/2021 7:47 PM, Bin.Cheng wrote:
> > On Sat, Jul 24, 2021 at 12:30 AM Jeff Law via Gcc-patches
> >  wrote:
> >>
> >>
> >> On 7/14/2021 3:14 AM, bin.cheng via Gcc-patches wrote:
> >>> Hi,
> >>> I ran into a wrong code bug in code with deep template instantiation when 
> >>> working on sdx::simd.
> >>> The root cause as described in commit summary is we skip prologue insns 
> >>> in init_alias_analysis.
> >>> This simple patch fixes the issue, however, it's hard to reduce a case 
> >>> because of heavy use of
> >>> templates.
> >>> Bootstrap and test on x86_64, is it OK?
> >> It's a clear correctness improvement, but what's unclear to me is why
> >> we'd want to skip them in the epilogue either.
> > I can only guess, there is nothing to initialize epilogue for because
> > no code follows.
> Yea, but couldn't the lack of analysis of the epilogue lead to a pass
> mis-optimizing code within the epilogue itself?  It's not terribly
> likely, but it just seems wrong to skip the epilogue like this.
> Remember, the aliasing bits are just an analysis phase to find the
> aliasing relationships that exist and we don't necessarily know how that
> data is going to be used.  It may in fact be safe now, but may not be
> safe in the future if someone added a late RTL pass that used the
> aliasing info in a new way.
>
> The more I think about it, the more I think we should remove remove this
> hunk of code completely.  There is some chance for fallout, but I think
> it's unlikely.
Hi Jeff,
Thanks for the suggestion, here is the simple patch removing all of it.
diff --git a/gcc/alias.c b/gcc/alias.c
index 69e1eb89ac6..099acabca6b 100644
--- a/gcc/alias.c
+++ b/gcc/alias.c
@@ -3406,14 +3406,6 @@ init_alias_analysis (void)
   rpo = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
   rpo_cnt = pre_and_rev_post_order_compute (NULL, rpo, false);

-  /* The prologue/epilogue insns are not threaded onto the
- insn chain until after reload has completed.  Thus,
- there is no sense wasting time checking if INSN is in
- the prologue/epilogue until after reload has completed.  */
-  bool could_be_prologue_epilogue = ((targetm.have_prologue ()
- || targetm.have_epilogue ())
-&& reload_completed);
-
   pass = 0;
   do
 {
@@ -3459,10 +3451,6 @@ init_alias_analysis (void)
{
  rtx note, set;

- if (could_be_prologue_epilogue
- && prologue_epilogue_contains (insn))
-   continue;
-
  /* If this insn has a noalias note, process it,  Otherwise,
 scan for sets.  A simple set will have no side effects
 which could change the base value of any other
register.  */

No fallouts in bootstrap/test on x86_64.  Is it OK?

Thanks,
bin
>
> Jeff
>


Re: [PATCH][RFC] tree-optimization/100499 - multiple_of_p bad behavior wrt niter analysis

2021-07-25 Thread Bin.Cheng via Gcc-patches
On Thu, Jul 22, 2021 at 6:41 PM Richard Biener  wrote:
>
> This avoids using multiple_of_p in niter analysis when its behavior
Hmm, but this patch actually introduces one more call to
multiple_of_p, also it doesn't touch the below use:
if (niter->control.no_overflow && multiple_of_p (type, c, s))
  {
niter->niter = fold_build2 (FLOOR_DIV_EXPR, niter_type, c, s);
return true;
  }

> to assume the tested expression does not invoke integer overflow
> produces wrong niter analysis results.  For the cases multiple_of_p
> handles power-of-two values of bottom look safe which should also be
> the majority of cases we care about in addition to the constant case
> now handled explicitely.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> I'm unsure how important a "perfect" solution is (rewriting
> multiple_of_p), and wonder whether this solution is preferable
I am still for this approach now, it only needs to be conservative,
rather than perfect, especially if there are not many breakages with a
conservative version multiple_of_p?

> for now (and especially for branches).  I've not yet tried
> to sanitize multiple_of_p plus use range info to prove
> no-overflow where TYPE_OVERFLOW_UNDEFINED doesn't tell us
> immediately.
>
> 2021-07-22  Richard Biener  
>
> PR tree-optimization/100499
> * tree-ssa-loop-niter.c (number_of_iterations_ne): Restrict
> multiple_of_p queries to power-of-two bottoms, handle
> the all constant case inline.
>
> * gcc.dg/torture/pr100499-1.c: New testcase.
> * gcc.dg/torture/pr100499-2.c: Likewise.
> ---
>  gcc/testsuite/gcc.dg/torture/pr100499-1.c | 28 +++
>  gcc/testsuite/gcc.dg/torture/pr100499-2.c | 16 +
>  gcc/tree-ssa-loop-niter.c |  8 ++-
>  3 files changed, 51 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/torture/pr100499-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/torture/pr100499-2.c
>
> diff --git a/gcc/testsuite/gcc.dg/torture/pr100499-1.c 
> b/gcc/testsuite/gcc.dg/torture/pr100499-1.c
> new file mode 100644
> index 000..97ab6051554
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr100499-1.c
> @@ -0,0 +1,28 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target int32plus } */
> +
> +typedef __UINT16_TYPE__ uint16_t;
> +static uint16_t g_2823 = 0xEC75L;
> +static uint16_t g_116 = 0xBC07L;
> +
> +static uint16_t
> +safe_mul_func_uint16_t_u_u(uint16_t ui1, uint16_t ui2)
> +{
> +  return ((unsigned int)ui1) * ((unsigned int)ui2);
> +}
> +
> +int main (int argc, char* argv[])
> +{
> +  uint16_t l_2815 = 65535UL;
> +  uint16_t *l_2821 = _116;
> +  uint16_t *l_2822 = _2823;
> +
> +lbl_2826:
> +  l_2815 &= 0x9DEF1EAEL;
> +  if (+(safe_mul_func_uint16_t_u_u(((*l_2821) = l_2815), (--(*l_2822)
> +goto lbl_2826;
> +
> +  if (g_2823 != 32768)
> +__builtin_abort ();
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.dg/torture/pr100499-2.c 
> b/gcc/testsuite/gcc.dg/torture/pr100499-2.c
> new file mode 100644
> index 000..999f931806a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr100499-2.c
> @@ -0,0 +1,16 @@
> +/* { dg-do run } */
> +
> +unsigned char ag = 55;
> +unsigned i;
> +int main()
> +{
> +  unsigned char c;
> +  unsigned char a = ag;
> +d:
> +  c = a-- * 52;
> +  if (c)
> +goto d;
> +  if (a != 255)
> +__builtin_abort ();
> +  return 0;
> +}
> diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> index 1b5605c26b8..c6b953c5316 100644
> --- a/gcc/tree-ssa-loop-niter.c
> +++ b/gcc/tree-ssa-loop-niter.c
> @@ -1050,7 +1050,13 @@ number_of_iterations_ne (class loop *loop, tree type, 
> affine_iv *iv,
>   Note, for NE_EXPR, base equals to FINAL is a special case, in
>   which the loop exits immediately, and the iv does not overflow.  */
>if (!niter->control.no_overflow
> -  && (integer_onep (s) || multiple_of_p (type, c, s)))
> +  && (integer_onep (s)
> + || (poly_int_tree_p (c)
> + && multiple_p (wi::to_poly_widest (c), wi::to_poly_widest (s)))
> + /* ???  multiple_of_p assumes the expression 'c' does not overflow
> +but that cannot be guaranteed, so we restrict 's' to power of
> +two values where that should not be an issue.  See PR100499.  */
> + || (integer_pow2p (s) && multiple_of_p (type, c, s
>  {
>tree t, cond, new_c, relaxed_cond = boolean_false_node;
I (to be blamed) added this part of code to special handle cases like
pr34114, now I feel it's in the wrong direction.  Ideally this part of
code is unnecessary and conditions will be (it is now) incorporated
into niter->assumptions which should be simplified to 1/0
correspondingly.  The only problem is that assumptions are not
appropriately simplified.
Is it possible to incorporate a more powerful solver (like Z3) in GCC
for such cases, e.g., assumption simplification, multiple_of_p, etc..
Oh, we don't do SCEV analysis 

Re: 0001-Don-t-skip-prologue-instructions-as-it-could-affect-.patch

2021-07-25 Thread Bin.Cheng via Gcc-patches
On Sat, Jul 24, 2021 at 12:30 AM Jeff Law via Gcc-patches
 wrote:
>
>
>
> On 7/14/2021 3:14 AM, bin.cheng via Gcc-patches wrote:
> > Hi,
> > I ran into a wrong code bug in code with deep template instantiation when 
> > working on sdx::simd.
> > The root cause as described in commit summary is we skip prologue insns in 
> > init_alias_analysis.
> > This simple patch fixes the issue, however, it's hard to reduce a case 
> > because of heavy use of
> > templates.
> > Bootstrap and test on x86_64, is it OK?
> It's a clear correctness improvement, but what's unclear to me is why
> we'd want to skip them in the epilogue either.
I can only guess, there is nothing to initialize epilogue for because
no code follows.

Thanks,
bin
>
> Jeff


Re: 0001-Don-t-skip-prologue-instructions-as-it-could-affect-.patch

2021-07-23 Thread Bin.Cheng via Gcc-patches
On Fri, Jul 23, 2021 at 7:53 AM Segher Boessenkool
 wrote:
>
> On Wed, Jul 14, 2021 at 05:14:16PM +0800, bin.cheng via Gcc-patches wrote:
> > Hi,
> > I ran into a wrong code bug in code with deep template instantiation when 
> > working on sdx::simd.
> > The root cause as described in commit summary is we skip prologue insns in 
> > init_alias_analysis.
> > This simple patch fixes the issue, however, it's hard to reduce a case 
> > because of heavy use of
> > templates.
>
> > Subject: [PATCH 1/2] Don't skip prologue instructions as it could affect 
> > alias
> >  analysis
> >
> > In init_alias_analysis, we skip prologue instructions but this is
> > wrong as it could affect base value analysis for registers as well
> > as following analysis/transformations like cselib and dse:
> >   prologue:
> > x12 = 0x1810
> > sp = sp - x12
> >   ...
> > ...
> > x12 = SYMBOL_REF(.LC89)
> > Here reg_base_value[12] is set to ".LC89", however this is only true
> > after the second instruction setting x12.  The patch fixes the issue
> > by just handling prologue instructions as normal.  Though ideally it
> > can be improved in context-sensitive way.
>
> In what pass do you get the bad behaviour?  dse2?  postreload?  Or what
> else?
Hi Segher,
It is dse2 deleting non dead stores based on wrong cse info again
based on wrong alias info.
The code flow is quite tricky, given insn list like:
 x12 = 0x1810
 sp = sp - x12
 ...
 [sp + offset] = x
 y = [sp + offset]
 ...
 [sp + offset] = z
 ...
 x12 = SYMBOL_REF(.LC89)

1、alias computes wrong reg_base_value[12] = symbol_ref(.LC89)
2、when dse2 process "y = [sp + offset]", the calling sequence is :
  scan_insn
-> check_mem_read_rtx
  -> check_mem_read_rtx
-> canon_true_dependence(group_id == -1)
  -> true_dependence_1 which has below code:
--
  base = find_base_term (x_addr);
  if (base && (GET_CODE (base) == LABEL_REF
   || (GET_CODE (base) == SYMBOL_REF
   && CONSTANT_POOL_ADDRESS_P (base
return 0;
x_addr is "sp - x12", sp has no base term, however x12 has
symbol_ref(.LC89) base term, and the code returns 0.  As a result,
dse2 considers storing x as dead when processing "[sp + offset] = z".

Sorry I can't reproduce a case out of it.

Thanks,
bin

>
> Your patch looks correct, but I'd like to know why it has seemed to work
> for so long :-)
>
>
> Segher


Re: 0001-Don-t-skip-prologue-instructions-as-it-could-affect-.patch

2021-07-22 Thread Bin.Cheng via Gcc-patches
Gentle ping.  Any suggestions would be appreciated.

Thanks,
bin

On Wed, Jul 14, 2021 at 5:15 PM bin.cheng via Gcc-patches
 wrote:
>
> Hi,
> I ran into a wrong code bug in code with deep template instantiation when 
> working on sdx::simd.
> The root cause as described in commit summary is we skip prologue insns in 
> init_alias_analysis.
> This simple patch fixes the issue, however, it's hard to reduce a case 
> because of heavy use of
> templates.
> Bootstrap and test on x86_64, is it OK?
>
> Thanks,
> bin


Re: [committed] add test for PR 86650

2021-07-20 Thread Bin.Cheng via Gcc-patches
On Wed, Jul 7, 2021 at 5:39 AM Martin Sebor via Gcc-patches
 wrote:
>
> The recent patch series to improve warning suppression for inlined
> functions [PR98512] also implicitly includes the inlining context
> in all warning messages for inlined code.  In r12-2091 I have
> committed the attached test to verify that -Warray-bounds too
> includes this context (its absence its the subject of PR 86650).
Hi,
It seems this patch exposes/causes uninitialized warning in arm_neon.h
like the following one:

__extension__ extern __inline void
__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
vst2q_s32 (int32_t * __a, int32x4x2_t __val)
{
  __builtin_aarch64_simd_oi __o;
  __o = __builtin_aarch64_set_qregoiv4si (__o, (int32x4_t) __val.val[0], 0);
  __o = __builtin_aarch64_set_qregoiv4si (__o, (int32x4_t) __val.val[1], 1);
  __builtin_aarch64_st2v4si ((__builtin_aarch64_simd_si *) __a, __o);
}

Thanks,
bin


[PATCH AArch64]Use stable sort in generating ldp/stp

2021-07-14 Thread bin.cheng via Gcc-patches
Hi,
Like previous patch, this is found when I was playing with stx::simd.  It's an 
obvious
change as described in commit summary.  Also the dead store in the code should 
be
optimized away, but I guess there is no guarantee, so here is a simple patch 
fixing it.


Is it OK?

Thanks,
bin

0002-AArch64-use-stable-sorting-in-generating-ldp-stp.patch
Description: Binary data


0001-Don-t-skip-prologue-instructions-as-it-could-affect-.patch

2021-07-14 Thread bin.cheng via Gcc-patches
Hi,
I ran into a wrong code bug in code with deep template instantiation when 
working on sdx::simd.
The root cause as described in commit summary is we skip prologue insns in 
init_alias_analysis.
This simple patch fixes the issue, however, it's hard to reduce a case because 
of heavy use of
templates.
Bootstrap and test on x86_64, is it OK?

Thanks,
bin

0001-Don-t-skip-prologue-instructions-as-it-could-affect-.patch
Description: Binary data


Re: [PATCH] New hook adjust_iv_update_pos

2021-07-06 Thread Bin.Cheng via Gcc-patches
On Mon, Jun 28, 2021 at 4:26 PM Richard Biener
 wrote:
>
> On Mon, Jun 28, 2021 at 10:07 AM Xionghu Luo  wrote:
> >
> >
> >
> > On 2021/6/25 18:02, Richard Biener wrote:
> > > On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo  wrote:
> > >>
> > >>
> > >>
> > >> On 2021/6/25 16:54, Richard Biener wrote:
> > >>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches
> > >>>  wrote:
> > 
> >  From: Xiong Hu Luo 
> > 
> >  adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance
> >  on Power.  For example, it generates mismatched address offset after
> >  adjust iv update statement position:
> > 
> >   [local count: 70988443]:
> >  _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
> >  ivtmp.30_415 = ivtmp.30_414 + 1;
> >  _34 = ref_180 + 18446744073709551615;
> >  _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
> >  if (_84 == _86)
> >  goto ; [94.50%]
> >  else
> >  goto ; [5.50%]
> > 
> >  Disable it will produce:
> > 
> >   [local count: 70988443]:
> >  _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
> >  _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
> >  ivtmp.30_415 = ivtmp.30_414 + 1;
> >  if (_84 == _86)
> >  goto ; [94.50%]
> >  else
> >  goto ; [5.50%]
> > 
> >  Then later pass loop unroll could benefit from same address offset
> >  with different base address and reduces register dependency.
> >  This patch could improve performance by 10% for typical case on Power,
> >  no performance change observed for X86 or Aarch64 due to small loops
> >  not unrolled on these platforms.  Any comments?
> > >>>
> > >>> The case you quote is special in that if we hoisted the IV update before
> > >>> the other MEM _also_ used in the condition it would be fine again.
> > >>
> > >> Thanks.  I tried to hoist the IV update statement before the first MEM 
> > >> (Fix 2), it
> > >> shows even worse performance due to not unroll(two more "base-1" is 
> > >> generated in gimple,
> > >> then loop->ninsns is 11 so small loops is not unrolled), change the 
> > >> threshold from
> > >> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, 
> > >> the
> > >> performance is SAME to the one that IV update statement in the *MIDDLE* 
> > >> (trunk).
> > >>  From the ASM, we can see the index register %r4 is used in two 
> > >> iterations which
> > >> maybe a bottle neck for hiding instruction latency?
> > >>
> > >> Then it seems reasonable the performance would be better if keep the IV 
> > >> update
> > >> statement at *LAST* (Fix 1).
> > >>
> > >> (Fix 2):
> > >> [local count: 70988443]:
> > >>ivtmp.30_415 = ivtmp.30_414 + 1;
> > >>_34 = ip_229 + 18446744073709551615;
> > >>_84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
> > >>_33 = ref_180 + 18446744073709551615;
> > >>_86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1];
> > >>if (_84 == _86)
> > >>  goto ; [94.50%]
> > >>else
> > >>  goto ; [5.50%]
> > >>
> > >>
> > >> .L67:
> > >>  lbzx %r12,%r24,%r4
> > >>  lbzx %r25,%r7,%r4
> > >>  cmpw %cr0,%r12,%r25
> > >>  bne %cr0,.L11
> > >>  mr %r26,%r4
> > >>  addi %r4,%r4,1
> > >>  lbzx %r12,%r24,%r4
> > >>  lbzx %r25,%r7,%r4
> > >>  mr %r6,%r26
> > >>  cmpw %cr0,%r12,%r25
> > >>  bne %cr0,.L11
> > >>  mr %r26,%r4
> > >> .L12:
> > >>  cmpdi %cr0,%r10,1
> > >>  addi %r4,%r26,1
> > >>  mr %r6,%r26
> > >>  addi %r10,%r10,-1
> > >>  bne %cr0,.L67
> > >>
> > >>>
> > >>> Now, adjust_iv_update_pos doesn't seem to check that the
> > >>> condition actually uses the IV use stmt def, so it likely applies to
> > >>> too many cases.
> > >>>
> > >>> Unfortunately the introducing rev didn't come with a testcase,
> > >>> but still I think fixing up adjust_iv_update_pos is better than
> > >>> introducing a way to short-cut it per target decision.
> > >>>
> > >>> One "fix" might be to add a check that either the condition
> > >>> lhs or rhs is the def of the IV use and the other operand
> > >>> is invariant.  Or if it's of similar structure hoist across the
> > >>> other iv-use as well.  Not that I understand the argument
> > >>> about the overlapping life-range.
> > >>>
> > >>> You also don't provide a complete testcase ...
> > >>>
> > >>
> > >> Attached the test code, will also add it it patch in future version.
> > >> The issue comes from a very small hot loop:
> > >>
> > >>  do {
> > >>len++;
> > >>  } while(len < maxlen && ip[len] == ref[len]);
> > >
> > > unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int 
> > > maxlen)
> > > {
> > >unsigned int len = 2;
> > >do {
> > >len++;
> > >}while(len < maxlen && ip[len] == ref[len]);
> > >return len;
> > > }
> > >
> > > I can see the effect on this loop on x86_64 as well, we end 

Re: [PATCH PR100740]Fix overflow check in simplifying exit cond comparing two IVs.

2021-07-01 Thread Bin.Cheng via Gcc-patches
On Thu, Jul 1, 2021 at 8:19 PM Richard Biener
 wrote:
>
> On Mon, Jun 7, 2021 at 4:35 PM Richard Biener
>  wrote:
> >
> > On Sun, Jun 6, 2021 at 12:01 PM Bin.Cheng  wrote:
> > >
> > > On Wed, Jun 2, 2021 at 3:28 PM Richard Biener via Gcc-patches
> > >  wrote:
> > > >
> > > > On Tue, Jun 1, 2021 at 4:00 PM bin.cheng via Gcc-patches
> > > >  wrote:
> > > > >
> > > > > Hi,
> > > > > As described in patch summary, this fixes the wrong code issue by 
> > > > > adding overflow-ness
> > > > > check for iv1.step - iv2.step.
> > > > >
> > > > > Bootstrap and test on x86_64.  Any comments?
> > > >
> > > > + bool wrap_p = TYPE_OVERFLOW_WRAPS (step_type);
> > > > + if (wrap_p)
> > > > +   {
> > > > + tree t = fold_binary_to_constant (GE_EXPR, step_type,
> > > > +   iv0->step, iv1->step);
> > > > + wrap_p = integer_zerop (t);
> > > > +   }
> > > >
> > > > I think we can't use TYPE_OVERFLOW_WRAPS/TYPE_OVERFLOW_UNDEFINED since
> > > > that's only relevant for expressions written by the user - we're
> > > > computing iv0.step - iv1.step
> > > > which can even overflow when TYPE_OVERFLOW_UNDEFINED (in fact we may not
> > > > even generate this expression then!).  So I think we have to do sth like
> > > >
> > > >/* If the iv0->step - iv1->step wraps, fail.  */
> > > >if (!operand_equal_p (iv0->step, iv1->step)
> > > >&& (TREE_CODE (iv0->step) != INTEGER_CST || TREE_CODE
> > > > (iv1->step) != INTEGER_CST)
> > > >&& !wi::gt (wi::to_widest (iv0->step), wi::to_widest (iv1->step))
> > > >  return false;
> > > >
> > > > which only handles equality and all integer constant steps. You could
> > > Thanks for the suggestion.  I realized that we have LE/LT/NE
> > > conditions here, and for LE/LT what we need to check is iv0/iv1
> > > converge to each other, rather than diverge.  Also steps here can only
> > > be constants, so there is no need to use range information.
> >
> > Ah, that simplifies things.
> >
> > + if (tree_int_cst_lt (iv0->step, iv1->step))
> > +   return false;
> >
> > so it looks to me that iv?->step can be negative which means we should
> > verify that abs(iv0->step - iv1->step) <= abs (iv0->step), correct?
> >
> >tree step = fold_binary_to_constant (MINUS_EXPR, step_type,
> >iv0->step, iv1->step);
> > ...
> > + if (TREE_CODE (step) != INTEGER_CST)
> > +   return false;
> >
> > note fold_binary_to_constant will return NULL if the result is not
> > TREE_CONSTANT (which would also include symbolic constants
> > like  - ).  It wasn't checked before, of course but since we're
> > touching the code we might very well be checking for NULL step ...
> > (or assert it is not for documentation purposes).
> >
> > That said, if iv0->step and iv1->step are known INTEGER_CSTs
> > (I think they indeed are given the constraints we impose on
> > simple_iv_with_niters).
> >
> > That said, with just a quick look again it looks to me the
> > IV1 {<=,<} IV2 transform to IV1 - IV2step {<=,<} IV2base
> > is OK whenever the effective step magnitude on the IV1'
> > decreases, thus abs(IV1.step - IV2.step) <= abs(IV1.step)
> > since then IV1 is still guaranteed to not overflow.  But
> > for example {0, +, 1} and {10, -, 1} also converge if the
> > number of iterations is less than 10 but they would not pass
> > this criteria.  So I'm not sure "convergence" is a good wording
> > here - but maybe I'm missing some critical piece of understanding
> > here.
> >
> > But in any case it looks like we're on the right track ;)
>
> Just to pick up things where we left them (and seeing the patch to
> unti-wrap which reminded me), I've digged in myself a bit and
> came to the following conclusion.
>
> The b0 + s0 < b1 + s1 -> b0 + s0 - s1 < b1 transform is only
> valid if b0 + s0 - s1 does not wrap which we can only ensure
> by ensuring that b0 + s0 and b1 + s1 do not wrap (which we
> already check) but also - what we're missing - that (s0 - s1)
> makes b0 still evolve in the same direction

Re: [PATCH] Analyze niter for until-wrap condition [PR101145]

2021-07-01 Thread Bin.Cheng via Gcc-patches
On Thu, Jul 1, 2021 at 10:15 PM guojiufu via Gcc-patches
 wrote:
>
> On 2021-07-01 20:35, Richard Biener wrote:
> > On Thu, 1 Jul 2021, Jiufu Guo wrote:
> >
> >> For code like:
> >> unsigned foo(unsigned val, unsigned start)
> >> {
> >>   unsigned cnt = 0;
> >>   for (unsigned i = start; i > val; ++i)
> >> cnt++;
> >>   return cnt;
> >> }
> >>
> >> The number of iterations should be about UINT_MAX - start.
> >
> > For
> >
> > unsigned foo(unsigned val, unsigned start)
> > {
> >   unsigned cnt = 0;
> >   for (unsigned i = start; i >= val; ++i)
> > cnt++;
> >   return cnt;
> > }
> >
> > and val == 0 the loop never terminates.  I don't see anywhere
> > in the patch that you disregard GE_EXPR and I remember
> > the code handles GE as well as GT?  From a quick look this is
> > also not covered by a testcase you add - not exactly sure
> > how it would materialize in a miscompilation.
>
> In number_of_iterations_cond, there is code:
> if (code == GE_EXPR || code == GT_EXPR
> || (code == NE_EXPR && integer_zerop (iv0->step)))
>{
>  std::swap (iv0, iv1);
>  code = swap_tree_comparison (code);
>}
> It converts "GT/GE" (i >= val) to "LT/LE" (val <= i),
> and LE (val <= i) is converted to LT (val - 1 < i).
> So, the code is added to number_of_iterations_lt.
>
> But, this patch leads mis-compilation for unsigned "i >= val" as
> above transforms: converting LE (val <= i) to LT (val - 1 < i)
> seems not appropriate (e.g where val=0).
I don't know where the exact code is, but IIRC, number_of_iteration
handles boundary conditions when transforming <= into <.  You may
check it out.

> Thanks for pointing out this!!!
>
> I would investigate a way to handle this correctly.
> A possible way maybe just to return false for this kind of LE.
IIRC, it checks the boundary conditions, either returns false or
simply introduces more assumptions.
>
> Any suggestions?
>
> >
> >> There is function adjust_cond_for_loop_until_wrap which
> >> handles similar work for const bases.
> >> Like adjust_cond_for_loop_until_wrap, this patch enhance
> >> function number_of_iterations_cond/number_of_iterations_lt
> >> to analyze number of iterations for this kind of loop.
> >>
> >> Bootstrap and regtest pass on powerpc64le, is this ok for trunk?
> >>
> >> gcc/ChangeLog:
> >>
> >>  PR tree-optimization/101145
> >>  * tree-ssa-loop-niter.c
> >>  (number_of_iterations_until_wrap): New function.
> >>  (number_of_iterations_lt): Invoke above function.
> >>  (adjust_cond_for_loop_until_wrap):
> >>  Merge to number_of_iterations_until_wrap.
> >>  (number_of_iterations_cond): Update invokes for
> >>  adjust_cond_for_loop_until_wrap and number_of_iterations_lt.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >>  PR tree-optimization/101145
> >>  * gcc.dg/vect/pr101145.c: New test.
> >>  * gcc.dg/vect/pr101145.inc: New test.
> >>  * gcc.dg/vect/pr101145_1.c: New test.
> >>  * gcc.dg/vect/pr101145_2.c: New test.
> >>  * gcc.dg/vect/pr101145_3.c: New test.
> >> ---
> >>  gcc/testsuite/gcc.dg/vect/pr101145.c   | 187
> >> +
> >>  gcc/testsuite/gcc.dg/vect/pr101145.inc |  63 +
> >>  gcc/testsuite/gcc.dg/vect/pr101145_1.c |  15 ++
> >>  gcc/testsuite/gcc.dg/vect/pr101145_2.c |  15 ++
> >>  gcc/testsuite/gcc.dg/vect/pr101145_3.c |  15 ++
> >>  gcc/tree-ssa-loop-niter.c  | 150 +++-
> >>  6 files changed, 380 insertions(+), 65 deletions(-)
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.inc
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_1.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_2.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_3.c
> >>
> >> diff --git a/gcc/testsuite/gcc.dg/vect/pr101145.c
> >> b/gcc/testsuite/gcc.dg/vect/pr101145.c
> >> new file mode 100644
> >> index 000..74031b031cf
> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.dg/vect/pr101145.c
> >> @@ -0,0 +1,187 @@
> >> +/* { dg-require-effective-target vect_int } */
> >> +/* { dg-options "-O3 -fdump-tree-vect-details" } */
> >> +#include 
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned
> >> n)
> >> +{
> >> +  while (n < ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_1 (int *__restrict__ a, int *__restrict__ b, unsigned l,
> >> unsigned)
> >> +{
> >> +  while (UINT_MAX - 64 < ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_2 (int *__restrict__ a, int *__restrict__ b, unsigned l, unsigned
> >> n)
> >> +{
> >> +  l = UINT_MAX - 32;
> >> +  while (n < ++l)
> >> +*a++ = *b++ + 1;
> >> +  return l;
> >> +}
> >> +
> >> +unsigned __attribute__ ((noinline))
> >> +foo_3 (int *__restrict__ a, int *__restrict__ b, 

Re: [PATCH] Analyze niter for until-wrap condition [PR101145]

2021-07-01 Thread Bin.Cheng via Gcc-patches
On Thu, Jul 1, 2021 at 10:06 AM Jiufu Guo via Gcc-patches
 wrote:
>
> For code like:
> unsigned foo(unsigned val, unsigned start)
> {
>   unsigned cnt = 0;
>   for (unsigned i = start; i > val; ++i)
> cnt++;
>   return cnt;
> }
>
> The number of iterations should be about UINT_MAX - start.
>
> There is function adjust_cond_for_loop_until_wrap which
> handles similar work for const bases.
> Like adjust_cond_for_loop_until_wrap, this patch enhance
> function number_of_iterations_cond/number_of_iterations_lt
> to analyze number of iterations for this kind of loop.
>
> Bootstrap and regtest pass on powerpc64le, is this ok for trunk?
>
> gcc/ChangeLog:
>
> PR tree-optimization/101145
> * tree-ssa-loop-niter.c
> (number_of_iterations_until_wrap): New function.
> (number_of_iterations_lt): Invoke above function.
> (adjust_cond_for_loop_until_wrap):
> Merge to number_of_iterations_until_wrap.
> (number_of_iterations_cond): Update invokes for
> adjust_cond_for_loop_until_wrap and number_of_iterations_lt.
>
> gcc/testsuite/ChangeLog:
>
> PR tree-optimization/101145
> * gcc.dg/vect/pr101145.c: New test.
> * gcc.dg/vect/pr101145.inc: New test.
> * gcc.dg/vect/pr101145_1.c: New test.
> * gcc.dg/vect/pr101145_2.c: New test.
> * gcc.dg/vect/pr101145_3.c: New test.
> ---
>  gcc/testsuite/gcc.dg/vect/pr101145.c   | 187 +
>  gcc/testsuite/gcc.dg/vect/pr101145.inc |  63 +
>  gcc/testsuite/gcc.dg/vect/pr101145_1.c |  15 ++
>  gcc/testsuite/gcc.dg/vect/pr101145_2.c |  15 ++
>  gcc/testsuite/gcc.dg/vect/pr101145_3.c |  15 ++
>  gcc/tree-ssa-loop-niter.c  | 150 +++-
>  6 files changed, 380 insertions(+), 65 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145.inc
>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr101145_3.c
>

> diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> index b5add827018..06db6a36ef8 100644
> --- a/gcc/tree-ssa-loop-niter.c
> +++ b/gcc/tree-ssa-loop-niter.c
> @@ -1473,6 +1473,86 @@ assert_loop_rolls_lt (tree type, affine_iv *iv0, 
> affine_iv *iv1,
>  }
>  }
>
> +/* Determines number of iterations of loop whose ending condition
> +   is IV0 < IV1 which likes:  {base, -C} < n,  or n < {base, C}.
> +   The number of iterations is stored to NITER.  */
> +
> +static bool
> +number_of_iterations_until_wrap (class loop *, tree type, affine_iv *iv0,
> +affine_iv *iv1, class tree_niter_desc *niter)
> +{
> +  tree niter_type = unsigned_type_for (type);
> +  tree max, min;
> +
> +  if (POINTER_TYPE_P (type))
> +{
> +  max = fold_convert (type, TYPE_MAX_VALUE (niter_type));
> +  min = fold_convert (type, TYPE_MIN_VALUE (niter_type));
> +}
> +  else
> +{
> +  max = TYPE_MAX_VALUE (type);
> +  min = TYPE_MIN_VALUE (type);
> +}
> +
> +  tree high = max, low = min, one = build_int_cst (niter_type, 1);
> +  tree step;
> +
> +  /* n < {base, C}. */
> +  if (integer_zerop (iv0->step) && TREE_CODE (iv1->step) == INTEGER_CST
> +  && !tree_int_cst_sign_bit (iv1->step))
> +{
> +  step = iv1->step;
> +  niter->niter = fold_build2 (MINUS_EXPR, niter_type, max, iv1->base);
max/iv1->base could be of pointer type, not sure if this is canonical though.

> +  if (TREE_CODE (iv1->base) == INTEGER_CST)
> +   low = fold_build2 (MINUS_EXPR, type, iv1->base, one);
> +  else if (TREE_CODE (iv0->base) == INTEGER_CST)
> +   low = iv0->base;
> +}
> +  /* {base, -C} < n. */
> +  else if (TREE_CODE (iv0->step) == INTEGER_CST
> +  && tree_int_cst_sign_bit (iv0->step) && integer_zerop (iv1->step))
> +{
> +  step = fold_build1 (NEGATE_EXPR, TREE_TYPE (iv0->step), iv0->step);
> +  niter->niter = fold_build2 (MINUS_EXPR, niter_type, iv0->base, min);
> +  if (TREE_CODE (iv0->base) == INTEGER_CST)
> +   high = fold_build2 (PLUS_EXPR, type, iv0->base, one);
> +  else if (TREE_CODE (iv1->base) == INTEGER_CST)
> +   high = iv1->base;
> +}
> +  else
> +return false;
> +
> +  /* (delta + step - 1) / step */
> +  step = fold_convert (niter_type, step);
> +  niter->niter = fold_convert (niter_type, niter->niter);
> +  niter->niter = fold_build2 (PLUS_EXPR, niter_type, niter->niter, step);
> +  niter->niter = fold_build2 (FLOOR_DIV_EXPR, niter_type, niter->niter, 
> step);
> +
> +  tree m = fold_build2 (MINUS_EXPR, niter_type, high, low);
> +  m = fold_convert (niter_type, m);
> +  mpz_t mstep, tmp, mmax;
> +  mpz_init (mstep);
> +  mpz_init (tmp);
> +  mpz_init (mmax);
> +  wi::to_mpz (wi::to_wide (step), mstep, UNSIGNED);
> +  wi::to_mpz (wi::to_wide (m), mmax, UNSIGNED);
> +  mpz_add (tmp, mmax, mstep);
> +  mpz_sub_ui (tmp, 

Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-06-07 Thread Bin.Cheng via Gcc-patches
On Fri, Jun 4, 2021 at 12:35 AM Andre Vieira (lists) via Gcc-patches
 wrote:
>
> Hi,
>
> This RFC is motivated by the IV sharing RFC in
> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569502.html and the
> need to have the IVOPTS pass be able to clean up IV's shared between
> multiple loops. When creating a similar problem with C code I noticed
> IVOPTs treated IV's with uses outside the loop differently, this didn't
> even required multiple loops, take for instance the following example
> using SVE intrinsics:
>
> #include 
> #include 
> extern void use (char *);
> void bar (char  * __restrict__ a, char * __restrict__ b, char *
> __restrict__ c, unsigned n)
> {
>  svbool_t all_true = svptrue_b8 ();
>unsigned i = 0;
>if (n < (UINT_MAX - svcntb() - 1))
>  {
>  for (; i < n; i += svcntb())
>  {
>  svuint8_t va = svld1 (all_true, (uint8_t*)a);
>  svuint8_t vb = svld1 (all_true, (uint8_t*)b);
>  svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
>  a += svcntb();
>  b += svcntb();
>  c += svcntb();
>  }
>  }
>use (a);
> }
>
> IVOPTs tends to generate a shared IV for SVE memory accesses, as we
> don't have a post-increment for SVE load/stores. If we had not included
> 'use (a);' in this example, IVOPTs would have replaced the IV's for a, b
> and c with a single one, (also used for the loop-control). See:
>
> [local count: 955630225]:
># ivtmp.7_8 = PHI 
>va_14 = MEM  [(unsigned char *)a_10(D) + ivtmp.7_8 * 1];
>vb_15 = MEM  [(unsigned char *)b_11(D) + ivtmp.7_8 * 1];
>_2 = svadd_u8_z ({ -1, ... }, va_14, vb_15);
>MEM <__SVUint8_t> [(unsigned char *)c_12(D) + ivtmp.7_8 * 1] = _2;
>ivtmp.7_25 = ivtmp.7_8 + POLY_INT_CST [16, 16];
>i_23 = (unsigned int) ivtmp.7_25;
>if (n_9(D) > i_23)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
>   However, due to the 'use (a);' it will create two IVs one for
> loop-control, b and c and one for a. See:
>
>[local count: 955630225]:
># a_28 = PHI 
># ivtmp.7_25 = PHI 
>va_15 = MEM  [(unsigned char *)a_28];
>vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>_2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
>MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
>a_18 = a_28 + POLY_INT_CST [16, 16];
>ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
>i_8 = (unsigned int) ivtmp.7_24;
>if (n_10(D) > i_8)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
> With the first patch attached in this RFC 'no_cost.patch', I tell IVOPTs
> to not cost uses outside of the loop. This makes IVOPTs generate a
> single IV, but unfortunately it decides to create the variable for the
> use inside the loop and it also seems to use the pre-increment value of
> the shared-IV and add the [16,16] to it. See:
>
> [local count: 955630225]:
># ivtmp.7_25 = PHI 
>va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
>vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>_2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
>MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
>_8 = (unsigned long) a_11(D);
>_7 = _8 + ivtmp.7_25;
>_6 = _7 + POLY_INT_CST [16, 16];
>a_18 = (char * restrict) _6;
>ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
>i_5 = (unsigned int) ivtmp.7_24;
>if (n_10(D) > i_5)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
> With the patch 'var_after.patch' I make get_computation_aff_1 use
> 'cand->var_after' for outside uses thus using the post-increment var of
> the candidate IV. This means I have to insert it in a different place
> and make sure to delete the old use->stmt. I'm sure there is a better
> way to do this using IVOPTs current framework, but I didn't find one
> yet. See the result:
>
>[local count: 955630225]:
># ivtmp.7_25 = PHI 
>va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
>vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>_2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
>MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
>ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
>_8 = (unsigned long) a_11(D);
>_7 = _8 + ivtmp.7_24;
>a_18 = (char * restrict) _7;
>i_6 = (unsigned int) ivtmp.7_24;
>if (n_10(D) > i_6)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
>
> This is still not optimal as we are still doing the update inside the
> loop and there is absolutely no need for that. I found that running sink
> would solve it and it seems someone has added a second sink pass, so
> that saves me a third patch :) see after sink2:
>
> [local count: 955630225]:
># ivtmp.7_25 = PHI 
>va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
>vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>_2 = svadd_u8_z ({ -1, ... }, 

Re: [PATCH PR100740]Fix overflow check in simplifying exit cond comparing two IVs.

2021-06-06 Thread Bin.Cheng via Gcc-patches
On Wed, Jun 2, 2021 at 3:28 PM Richard Biener via Gcc-patches
 wrote:
>
> On Tue, Jun 1, 2021 at 4:00 PM bin.cheng via Gcc-patches
>  wrote:
> >
> > Hi,
> > As described in patch summary, this fixes the wrong code issue by adding 
> > overflow-ness
> > check for iv1.step - iv2.step.
> >
> > Bootstrap and test on x86_64.  Any comments?
>
> + bool wrap_p = TYPE_OVERFLOW_WRAPS (step_type);
> + if (wrap_p)
> +   {
> + tree t = fold_binary_to_constant (GE_EXPR, step_type,
> +   iv0->step, iv1->step);
> + wrap_p = integer_zerop (t);
> +   }
>
> I think we can't use TYPE_OVERFLOW_WRAPS/TYPE_OVERFLOW_UNDEFINED since
> that's only relevant for expressions written by the user - we're
> computing iv0.step - iv1.step
> which can even overflow when TYPE_OVERFLOW_UNDEFINED (in fact we may not
> even generate this expression then!).  So I think we have to do sth like
>
>/* If the iv0->step - iv1->step wraps, fail.  */
>if (!operand_equal_p (iv0->step, iv1->step)
>&& (TREE_CODE (iv0->step) != INTEGER_CST || TREE_CODE
> (iv1->step) != INTEGER_CST)
>&& !wi::gt (wi::to_widest (iv0->step), wi::to_widest (iv1->step))
>  return false;
>
> which only handles equality and all integer constant steps. You could
Thanks for the suggestion.  I realized that we have LE/LT/NE
conditions here, and for LE/LT what we need to check is iv0/iv1
converge to each other, rather than diverge.  Also steps here can only
be constants, so there is no need to use range information.

> also use ranges
> like
>
>  wide_int min0, max0, min1, max1;
>   if (!operand_equal_p (iv->step, iv1->step)
>   && (determine_value_range (iv0->step, , ) != VR_RANGE
>  || determine_value_range (iv1->step, , ) != VR_RANGE
>  || !wi::ge (min0, max1)))
>return false;
>
> Note I'm not sure why
>
>iv0->step = step;
>if (!POINTER_TYPE_P (type))
> iv0->no_overflow = false;
I don't exactly remember, this was added sometime when no_overflow was
introduced.  Note we only do various checks for non NE_EXPR so the
step isn't always less in absolute value?  I will check if we should
reset it in all cases.

Patch updated.  test ongoing.

Thanks,
bin
>
> here the no_overflow reset does not happen for pointer types?  Or
> rather why does
> it happen at all?  Don't we strictly make the step less in absolute value?
>
> > Thanks,
> > bin
From 395dd6595cabebb7fd3e71a5fbfe84544d598608 Mon Sep 17 00:00:00 2001
Author: Bin Cheng 
Date: Fri, 28 May 2021 16:49:54 +0800
Subject: [PATCH] Simplify exit cond comparing two IVs only if they converge.

We should also check that iv1.step >= iv2.step so that the two IVs
converge to each other under comparison condition LE_EXPR/LT_EXPR.

gcc:
PR tree-optimization/100740
* tree-ssa-loop-niter.c (number_of_iterations_cond): Check
  IVs converge or not.

gcc/testsuite:
* gcc.c-torture/execute/pr100740.c
---
 .../gcc.c-torture/execute/pr100740.c  | 11 ++
 gcc/tree-ssa-loop-niter.c | 20 +--
 2 files changed, 25 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.c-torture/execute/pr100740.c

diff --git a/gcc/testsuite/gcc.c-torture/execute/pr100740.c 
b/gcc/testsuite/gcc.c-torture/execute/pr100740.c
new file mode 100644
index 000..8fcdaffef3b
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr100740.c
@@ -0,0 +1,11 @@
+/* PR tree-optimization/100740 */
+
+unsigned a, b;
+int main() {
+  unsigned c = 0;
+  for (a = 0; a < 2; a++)
+for (b = 0; b < 2; b++)
+  if (++c < a)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index 325bd978609..6240084782a 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1782,7 +1782,9 @@ number_of_iterations_cond (class loop *loop,
  provided that either below condition is satisfied:
 
a) the test is NE_EXPR;
-   b) iv0.step - iv1.step is integer and iv0/iv1 don't overflow.
+   b) iv0.step and iv1.step are integers and:
+ - iv0 and iv1 don't overflow.
+ - iv0 and iv1 converge to each other, under cond LE_EXPR/LT_EXPR.
 
  This rarely occurs in practice, but it is simple enough to manage.  */
   if (!integer_zerop (iv0->step) && !integer_zerop (iv1->step))
@@ -1790,14 +1792,20 @@ number_of_iterations_cond (class loop *loop,
   tree step_type = POINTER_TYPE_P (type) ? sizetype : type;
   tree step = fold_binary_to_constant (MINUS_EXPR, step_type,
   iv0-&g

[PATCH PR100740]Fix overflow check in simplifying exit cond comparing two IVs.

2021-06-01 Thread bin.cheng via Gcc-patches
Hi,
As described in patch summary, this fixes the wrong code issue by adding 
overflow-ness
check for iv1.step - iv2.step.

Bootstrap and test on x86_64.  Any comments?

Thanks,
bin

pr100740-20210525.txt
Description: Binary data


[PATCH PR100499]Fix wrong niter info caused by overflow behavior

2021-05-26 Thread bin.cheng via Gcc-patches
Hi,
As described in PR100499, loop niters analysis for "!=" now relies on 
multiple_of_p which
so far is mostly implemented for no-overflow scenarios.  This patch fixes the 
issue by:
1. add new parameter to multiple_of_p indicating no-wrapping behavior in top 
expression.
2. pass new argument to help multiple_of_p, for example, the case in 
stor-layout.c.
3. adjust calls in number_of_iterations_ne by checking multiple_of_p without 
dependence
on wrapping/no-wrapping behavior.  It also corrects one issue in PR72817.

Noticed there are some improvements in ranger to help analysis of 
multiple_of_p.  For
TOP expression like "left op right", we would like to check if this specific op 
wraps or not.
For example:
1. left (unsigned, multiple of 3) + 0xfffd: the result is multiple of 3 if 
"+" wraps even if
right operand is not multiple of 3.
2. left (unsigned, multiple of 3) + 6: the result is not multiple of 3 if "+" 
wraps even if the
right operand is multiple of 3.
So in the end, we might need ranger to tell if such operator wraps or not, i.e, 
must_overflow,
must_not_overflow, may_overflow.  Of course, we need to be conservative in the 
last case.

Bootstrap and test on x86_64.  Any comments?

Thanks,
bin

pr100499-20210520.txt
Description: Binary data


Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

2021-05-07 Thread Bin.Cheng via Gcc-patches
On Fri, Apr 30, 2021 at 1:20 PM Feng Xue OS via Gcc-patches
 wrote:
>
> >> This patch implements a new loop optimization according to the proposal
> >> in RFC given at
> >> https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html.
> >> So do not repeat the idea in this mail. Hope your comments on it.
> >
> > With the caveat that I'm not an optimization expert (but no one else
> > seems to have replied), here are some thoughts.
> >
> > [...snip...]
> >
> >> Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop
> >>  [PR98598]
> >
> > BTW, did you mean to also post patches 2 and 3?
> >
>
> Not yet, but they are ready. Since this is kind of special optimization that 
> uses
> heap as temporary storage, not a common means in gcc, we do not know
> basic attitude of the community towards it. So only the first patch was sent
> out for initial comments, in that it implements a generic MGO framework, and
> is complete and self-contained. Other 2 patches just composed some
> enhancements for specific code pattern and dynamic alias check. If possible,
> this proposal would be accepted principally, we will submit other 2 for 
> review.
>
> >
> >> In nested loops, if scattered memory accesses inside inner loop remain
> >> unchanged in outer loop, we can sequentialize these loads by caching
> >> their values into a temporary memory region at the first time, and
> >> reuse the caching data in following iterations. This way can improve
> >> efficiency of cpu cache subsystem by reducing its unpredictable activies.
> >
> > I don't think you've cited any performance numbers so far.  Does the
> > optimization show a measurable gain on some benchmark(s)?  e.g. is this
> > ready to run SPEC yet, and how does it do?
>
> Yes, we have done that. Minor improvement about several point percentage
> could gain for some real applications. And to be specific, we also get major
> improvement as more than 30% for certain benchmark in SPEC2017.
Hi Feng Xue,
Could you help point out which bench it is?  I failed to observe
improvement in spec2017 on local x86 machine.  I am running with O3
level.

Thanks,
bin
>
> >
> >> To illustrate what the optimization will do, two pieces of pseudo code,
> >> before and after transformation, are given. Suppose all loads and
> >> "iter_count" are invariant in outer loop.
> >>
> >> From:
> >>
> >>   outer-loop ()
> >> {
> >>   inner-loop (iter, iter_count)
> >> {
> >>   Type1 v1 = LOAD (iter);
> >>   Type2 v2 = LOAD (v1);
> >>   Type3 v3 = LOAD (v2);
> >>   ...
> >>   iter = NEXT (iter);
> >> }
> >> }
> >>
> >> To:
> >>
> >>   typedef struct cache_elem
> >> {
> >>   bool   init;
> >>   Type1  c_v1;
> >>   Type2  c_v2;
> >>   Type3  c_v3;
> >
> > Putting the "bool init;" at the front made me think "what about
> > packing?" but presumably the idea is that every element is accessed in
> > order, so it presumably benefits speed to have "init" at the top of the
> > element, right?
>
> Yes, layout of the struct layout could be optimized in terms of size by
> some means, such as:
>   o. packing "init" into a padding hole after certain field
>   o. if certain field is a pointer type, the field can take the role of "init"
>   (Non-NULL implies "initialized")
> Now this simple scheme is straightforward, and would be enhanced
> in various aspects later.
>
> >> } cache_elem;
> >>
> >>   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));
>
> > What if the allocation fails at runtime?  Do we keep an unoptimized
> > copy of the nested loops around as a fallback and have an unlikely
> > branch to that copy?
>
> Yes, we should. But in a different way, a flag is added into original
> nested loop to control runtime switch between optimized and
> unoptimized execution. This definitely incurs runtime cost, but
> avoid possible code size bloating. A better handling, as a TODO is
> to apply dynamic-switch for large loop, and loop-clone for small one.
>
> > I notice that you're using calloc, presumably to clear all of the
> > "init" flags (and the whole buffer).
> >
> > FWIW, this feels like a case where it would be nice to have a thread-
> > local heap allocation, perhaps something like an obstack implemented in
> > the standard library - but that's obviously scope creep for this.
>
> Yes, that's good, specially for many-thread application.
>
> > Could it make sense to use alloca for small allocations?  (or is that
> > scope creep?)
>
> We did consider using alloca as you said.  But if we could not determine
> up limit for a non-constant size, we have to place alloca inside a loop that
> encloses the nested loop. Without a corresponding free operation, this
> kind of alloca-in-loop might cause stack overflow. So it becomes another
> TODO.
>
> >>   outer-loop ()
> >> {
> >>   size_t cache_idx = 0;
> >>
> >>   inner-loop (iter, iter_count)
> >> {
> >>   if 

Re: [PATCH RFC][PR98736]Compute and use programing order preserved RPO in loop distribution

2021-03-23 Thread bin.cheng via Gcc-patches
> > In the patch, I just duplicated and created new function 
> > loop_first_rev_post_order_compute_fn.
> > I am not sure if I should change the original function 
> > pre_and_rev_post_order_compute_fn
> > (maybe not at this stage)? I am neither sure about the name, though haven't 
> > got a better one.
> > Any comment is appreciated?
> 
> So your new function seems to do what rev_post_order_and_mark_dfs_back_seme 
> does
> when you specify for_iteration = true.  Specifically that function then does
> 
>If FOR_ITERATION is true then compute an RPO where SCCs form a
>contiguous region in the RPO array.

Right, this is exactly what I need.  Attachment is the updated patch.  
Bootstrap and test on x86_64.
> 
> it in particular should handle the situation well where there are multiple 
> exits
> from a loop to different outer loops (consider a switch stmt exiting to
> the immediate outer or to the next outer loop) - something your patch
> likely still handles "randomly" (though we of course generally do not handle
> such exits well).
> 
> The idea of the rev_post_order_and_mark_dfs_back_seme algorithm is to
> treat SCCs as single nodes for the "outermost" RPO walk and then
> continuously expand SCCs from outer to inner.
> 
> So for the testcase I'm quite sure using the this function for computing
> the RPO would work but extra thoughts are appreciated.
Maybe another more straightforward or easier to understand function name?

Thanks,
bin
> 
> Thanks,
> Richard.
>

pr98736-20210323.txt
Description: Binary data


[PATCH RFC][PR98736]Compute and use programing order preserved RPO in loop distribution

2021-03-22 Thread bin.cheng via Gcc-patches
Hi,

This is the patch for PR98736.  The root cause is like:

Use programing order preserved RPO in loop distribution.

Tree loop distribution uses RPO to build reduced dependence graph,
it's important that RPO preserves the original programing order and
usually it does.  However, when distributing loop nest, the exit bb
could be placed before some loop basic blocks while after loop header.

This patch fixes the issue by preferring loop exit edge in DFS when
computing RPO.

In the patch, I just duplicated and created new function 
loop_first_rev_post_order_compute_fn.
I am not sure if I should change the original function 
pre_and_rev_post_order_compute_fn 
(maybe not at this stage)? I am neither sure about the name, though haven't got 
a better one.
Any comment is appreciated?

Bootstrap and test on x86_64.

Thanks,
bin

gcc/ChangeLog:

PR tree-optimization/98736
* cfganal.c (loop_first_rev_post_order_compute_fn): New function.
* cfganal.h (loop_first_rev_post_order_compute_fn): New decl.
* tree-loop-distribution.c (loop_distribution::bb_top_order_init):
Compute RPO with programing order preserved by calling above.

gcc/testsuite/ChangeLog:

PR tree-optimization/98736
* gcc.c-torture/execute/pr98736.c: New test.

pr98736-20210322.txt
Description: Binary data


Re: [PATCH PR97627]Avoid computing niters info for fake edges

2021-01-29 Thread Bin.Cheng via Gcc-patches
On Fri, Jan 29, 2021 at 3:55 PM Richard Biener
 wrote:
>
> On Thu, Jan 28, 2021 at 11:31 AM Bin.Cheng  wrote:
> >
> > On Thu, Jan 28, 2021 at 5:08 PM Richard Biener via Gcc-patches
> >  wrote:
> > >
> > > On Thu, Jan 28, 2021 at 3:49 AM bin.cheng via Gcc-patches
> > >  wrote:
> > > >
> > > > Hi,
> > > > As described in commit message, we need to avoid computing niters info 
> > > > for fake
> > > > edges.  This simple patch does this by two changes.
> > > >
> > > > Bootstrap and test on X86_64, is it ok?
> > >
> > > Hmm, so I think the patch is a bit complicated and avoiding niter compute
> > > for fake edges would be easier when just returning false for
> > > fake edges in number_of_iterations_exit_assumptions?
> > I just grepped calls to get_loop_exit_edges, and thought there might
> > be cases other than niters analysis that would also like to skip fake
> > edges.  But I didn't check the calls one by one.
>
> My hunch is that the usual APIs always want to ignore them, but let's
> do a minimal fix that we can backport easily.
Yeah, please apply the trivial patch.

Thanks,
bin
>
> > >
> > > Which pass was the problematical that had infinite loops connected to 
> > > exit?
> > >
> > > I guess the cfgloop code should simply ignore fake exits - they mostly
> > > exist to make reverse CFG walks easy.  Specifically single_exit
> > > and single_likely_exit but also exit edge recording should ignore them.
> > >
> > > That said, the testcase seems to be fixed with just
> > >
> > > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > > index 7d61ef080eb..7775bc7275c 100644
> > > --- a/gcc/tree-ssa-loop-niter.c
> > > +++ b/gcc/tree-ssa-loop-niter.c
> > > @@ -2407,6 +2407,11 @@ number_of_iterations_exit_assumptions (class
> > > loop *loop, edge exit,
> > >affine_iv iv0, iv1;
> > >bool safe;
> > >
> > > +  /* The condition at a fake exit (if it exists) does not control its
> > > + execution.  */
> > > +  if (exit->flags & EDGE_FAKE)
> > > +return false;
> > > +
> > >/* Nothing to analyze if the loop is known to be infinite.  */
> > >if (loop_constraint_set_p (loop, LOOP_C_INFINITE))
> > >  return false;
> > >
> > > Your dfs_find_deadend change likely "breaks" post-dominance DFS order
> > > (this is a very fragile area).
> > >
> > > So any objection to just simplify the patch to the above hunk?
> > Considering we are in late stage3? No objection to this change.  But I
> > do think dfs_find_deadend needs to be improved, if not as this patch
> > does.  For a loop nest with the outermost loop as the infinite one,
> > the function adds fake (exit) edges for inner loops, which is
> > counter-intuitive.
>
> Sure, but then this is independent of the PR.  As said, the fake exits
> only exist to make reverse CFG walkers easier - yes, for natural
> infinite loops we'd like to have "intuitive" post-dom behavior but for
> example for irreducible regions there's not much to do.
>
> Richard.
>
> > Thanks,
> > bin
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > Thanks,
> > > > bin


Re: [PATCH PR97627]Avoid computing niters info for fake edges

2021-01-28 Thread Bin.Cheng via Gcc-patches
On Thu, Jan 28, 2021 at 5:08 PM Richard Biener via Gcc-patches
 wrote:
>
> On Thu, Jan 28, 2021 at 3:49 AM bin.cheng via Gcc-patches
>  wrote:
> >
> > Hi,
> > As described in commit message, we need to avoid computing niters info for 
> > fake
> > edges.  This simple patch does this by two changes.
> >
> > Bootstrap and test on X86_64, is it ok?
>
> Hmm, so I think the patch is a bit complicated and avoiding niter compute
> for fake edges would be easier when just returning false for
> fake edges in number_of_iterations_exit_assumptions?
I just grepped calls to get_loop_exit_edges, and thought there might
be cases other than niters analysis that would also like to skip fake
edges.  But I didn't check the calls one by one.
>
> Which pass was the problematical that had infinite loops connected to exit?
>
> I guess the cfgloop code should simply ignore fake exits - they mostly
> exist to make reverse CFG walks easy.  Specifically single_exit
> and single_likely_exit but also exit edge recording should ignore them.
>
> That said, the testcase seems to be fixed with just
>
> diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> index 7d61ef080eb..7775bc7275c 100644
> --- a/gcc/tree-ssa-loop-niter.c
> +++ b/gcc/tree-ssa-loop-niter.c
> @@ -2407,6 +2407,11 @@ number_of_iterations_exit_assumptions (class
> loop *loop, edge exit,
>affine_iv iv0, iv1;
>bool safe;
>
> +  /* The condition at a fake exit (if it exists) does not control its
> + execution.  */
> +  if (exit->flags & EDGE_FAKE)
> +return false;
> +
>/* Nothing to analyze if the loop is known to be infinite.  */
>if (loop_constraint_set_p (loop, LOOP_C_INFINITE))
>  return false;
>
> Your dfs_find_deadend change likely "breaks" post-dominance DFS order
> (this is a very fragile area).
>
> So any objection to just simplify the patch to the above hunk?
Considering we are in late stage3? No objection to this change.  But I
do think dfs_find_deadend needs to be improved, if not as this patch
does.  For a loop nest with the outermost loop as the infinite one,
the function adds fake (exit) edges for inner loops, which is
counter-intuitive.

Thanks,
bin
>
> Thanks,
> Richard.
>
> > Thanks,
> > bin


[PATCH PR97627]Avoid computing niters info for fake edges

2021-01-27 Thread bin.cheng via Gcc-patches
Hi,
As described in commit message, we need to avoid computing niters info for fake
edges.  This simple patch does this by two changes.  

Bootstrap and test on X86_64, is it ok?

Thanks,
bin

pr97627-20210128.patch
Description: Binary data


Re: Do we need to do a loop invariant motion after loop interchange ?

2020-09-22 Thread Bin.Cheng via Gcc-patches
On Tue, Sep 22, 2020 at 10:30 AM HAO CHEN GUI  wrote:
>
> Bin,
>
> I just tested your patch on current trunk.  Here is my summary.
>
> 1. About some iv aren't moved out of inner loop (Lijia mentioned in his
> last email)
>
>[local count: 955630226]:
># l_32 = PHI <1(12), l_54(21)>
># ivtmp_165 = PHI <_446(12), ivtmp_155(21)>
>_26 = (integer(kind=8)) l_32;
>_27 = _25 + _26;
>y__I_lsm.119_136 = (*y_135(D))[_27];
>y__I_lsm.119_90 = m_55 != 1 ? y__I_lsm.119_136 : 0.0;
>_37 = _36 * stride.88_111;
>_38 = _35 + _37;
>_39 = _26 + _38;
>_40 = (*a_137(D))[_39];
>
> The offset _39 is not loop independent as it relies on _26. But _38 and
> _37 should be loop independent. So Lijia thought they should be moved
> out of loop.
>
> I checked the following pass and found that these  statements are
> eliminated after vertorizing and dce.
>
> In vect dump,
>
> simple.F:27:23: note:  -->vectorizing statement: _37 = _36 *
> stride.88_111;
> simple.F:27:23: note:  -->vectorizing statement: _38 = _35 + _37;
> simple.F:27:23: note:  -->vectorizing statement: _39 = _26 + _38;
> simple.F:27:23: note:  -->vectorizing statement: _40 = (*a_137(D))[_39];
> simple.F:27:23: note:  transform statement.
> simple.F:27:23: note:  transform load. ncopies = 1
> simple.F:27:23: note:  create vector_type-pointer variable to type:
> vector(2) real(kind=8)  vectorizing an array ref: (*a_137(D))
> simple.F:27:23: note:  created vectp_a.131_383
> simple.F:27:23: note:  add new stmt: vect__40.132_374 = MEM  real(kind=8)> [(real(kind=8) *)vectp_a.130_376];
>
> In dce dump,
>
> Deleting : _39 = _26 + _38;
>
> Deleting : _38 = _35 + _37;
>
> Deleting : _37 = _36 * stride.88_111;
>
> So it's reasonable to only consider data reference after loop
> interchange. Other statements may be eliminated or be moved out of loop
> in last lim pass if they're real expensive.
>
> 2. I tested the SPEC on powerpc64le-linux-gnu. 503.bwaves_r got 6.77%
> performance improvement with this patch. It has no impact on other
> benchmarks.
>
> 3. The patch passed bootstrapped and regression test on
> powerpc64le-linux-gnu.
>
> I think the patch works fine. Could you please add it into trunk? Thanks
> a lot.
Hmm, IIRC the patch was intended to show what the missing transform
is, and I think it has latent bugs which I haven't got time to refine.
As Richard mentioned, could you please explore this with the existing
LIM facility, rather than introducing new code implementing existing
transforms?

Thanks,
bin
>
>
> On 8/9/2020 下午 6:18, Bin.Cheng wrote:
> > On Mon, Sep 7, 2020 at 5:42 PM HAO CHEN GUI  wrote:
> >> Hi,
> >>
> >> I want to follow Lijia's work as I gained the performance benefit on
> >> some SPEC workloads by adding a im pass after loop interchange.  Could
> >> you send me the latest patches? I could do further testing. Thanks a lot.
> > Hi,
> > Hmm, not sure if this refers to me?  I only provided an example patch
> > (which isn't complete) before Lijia's.  Unfortunately I don't have any
> > latest patch about this either.
> > As Richard suggested, maybe you (if you work on this) can simplify the
> > implementation.  Anyway, we only need to hoist memory references here.
> >
> > Thanks,
> > bin
> >> https://gcc.gnu.org/pipermail/gcc/2020-February/232091.html
> >>


[PATCH PR93334][RFC]Skip output dep if values stored are byte wise the same

2020-09-15 Thread bin.cheng via Gcc-patches
Hi,
As suggested by PR93334 comments, this patch adds an interface identifying
output dependence which can be skipped in terms of reordering and skip it in
loop distribution.  It also adds a new test case.  Any comment?

Thanks,
bin

0001-Skip-output-dependence-if-values-stored-are-bytewise.patch
Description: Binary data


Re: Do we need to do a loop invariant motion after loop interchange ?

2020-09-08 Thread Bin.Cheng via Gcc-patches
On Mon, Sep 7, 2020 at 5:42 PM HAO CHEN GUI  wrote:
>
> Hi,
>
> I want to follow Lijia's work as I gained the performance benefit on
> some SPEC workloads by adding a im pass after loop interchange.  Could
> you send me the latest patches? I could do further testing. Thanks a lot.
Hi,
Hmm, not sure if this refers to me?  I only provided an example patch
(which isn't complete) before Lijia's.  Unfortunately I don't have any
latest patch about this either.
As Richard suggested, maybe you (if you work on this) can simplify the
implementation.  Anyway, we only need to hoist memory references here.

Thanks,
bin
>
> https://gcc.gnu.org/pipermail/gcc/2020-February/232091.html
>


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-04 Thread Bin.Cheng via Gcc-patches
On Fri, Sep 4, 2020 at 6:37 AM Segher Boessenkool
 wrote:
>
> On Thu, Sep 03, 2020 at 10:24:21AM +0800, Kewen.Lin wrote:
> > on 2020/9/2 下午6:25, Segher Boessenkool wrote:
> > > On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote:
> > >> on 2020/9/1 上午3:41, Segher Boessenkool wrote:
> > >>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
> >  1) Currently address_cost hook on rs6000 always return zero, but at 
> >  least
> >  from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> >  have to take the address update into account (scalar normal operation).
> > >>>
> > >>> From Power4 on already (not sure about Power6, but does anyone care?)
> > >>
> > >> Thanks for the information, it looks this issue exists for a long time.
> > >
> > > Well, *is* it an issue?  The addressing doesn't get more expensive...
> > > For example, an
> > >   ldu 3,16(4)
> > > is cracked to an
> > >   ld 3,16(4)
> > > and an
> > >   addi 4,4,16
> > > (the addi is not on the critical path of the load).  So it seems to me
> > > this shouldn't increase the addressing cost at all?  (The instruction of
> > > course is really two insns in one.)
> >
> > Good question!  I agree that they can execute in parallel, but it depends
> > on how we interprete the addressing cost, if it's for required execution
> > resource, I think it's off, since comparing with ld, the ldu has two iops
> > and extra ALU requirement.
>
> OTOH, if you do not use an ldu you need to use a real addi insn, which
> gives you all the same cost (plus it takes more code space and decode etc.
> resources).
>
> > I'm not sure its usage elsewhere, but in the
> > context of IVOPTs on Power, for one normal candidate, its step cost is 4,
> > the cost for group (1) is zero, total cost is 4 for this combination.
> > for the scenario like:
> > ldx rx, iv // (1)
> > ...
> > iv = iv + step // (2)
> >
> > While for ainc_use candidate (like ldu), its step cost is 4, but the cost
> > for group (1) is (-4 // minus step cost), total cost is 0.  It looks to
> > say the step update is free.
>
> That seems wrong, but the address_cost is used in more places, that is
> not where to fix this?
>
> > We can also see (1) and (2) can also execute in parallel (same iteration).
> > If we consider the next iteration, it will have the dependency, but it's
> > the same for ldu.  So basically they are similar, I think it's unfair to
> > have this difference in the cost modeling.  The cracked addi should have
> > its cost here.  Does it make sense?
>
> It should have cost, certainly, but not address_cost I think.  The total
> cost of an ldu should be a tiny bit less than that of ld + that of addi;
> the address_cost of ldu should be the same as that of ld.
Hi Segher,
In simple cases, yes, and it is also the (rough) idea of modeling
auto-inc addressing mode in ivopts, however, things are different if
loop gets complicated.  Considering the case choosing 10 auto-inc
addressing_mode/candidate vs. [base_x + iv_index].  The latter only
needs one add instruction, while the former needs 10 embedded auto-inc
operations.
Another issue is register pressure, choosing auto-inc candidates could
result in more IV, while choosing IV_index results in one IV (and more
Base pointers), however, spilling base pointer (which is loop
invariant) is usually cheaper than IV.
Another issue is auto-inc candidates probably lead to more bloated
setup code in the preheader BB, due to problems in expression
canonicalization, CSE, etc..

So it's not that easy to answer the question for complicated cases.
As for simple cases, the current model works fine with auto-inc
(somehow) preferred.

Thanks,
bin
>
> > Apart from that, one P9 specific point is that the update form load isn't
> > preferred,  the reason is that the instruction can not retire until both
> > parts complete, it can hold up subsequent instructions from retiring.
> > If the addi stalls (starvation), the instruction can not retire and can
> > cause things stuck.  It seems also something we can model here?
>
> This is (almost) no problem on p9, since we no longer have issue groups.
> It can hold up older insns from retiring, sure, but they *will* have
> finished, and p9 can retire 64 insns per cycle.  The "completion wall"
> is gone.  The only problem is if things stick around so long that
> resources run out...  but you're talking 100s of insns there.
>
>
> Segher


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-01 Thread Bin.Cheng via Gcc-patches
On Wed, Sep 2, 2020 at 11:50 AM Kewen.Lin  wrote:
>
> Hi Bin,
>
> >> 2) This case makes me think we should exclude ainc candidates in function
> >> mark_reg_offset_candidates.  The justification is that: ainc candidate
> >> handles step update itself and when we calculate the cost for it against
> >> its ainc_use, the cost_step has been reduced. When unrolling happens,
> >> the ainc computations are replicated and it doesn't save step updates
> >> like normal reg_offset_p candidates.
> > Though auto-inc candidate embeds stepping operation into memory
> > access, we might want to avoid it in case of unroll if there are many
> > sequences of memory accesses, and if the unroll factor is big.  The
> > rationale is embedded stepping is a u-arch operation and does have its
> > cost.
> >
>
> Thanks for the comments!  Agree!  Excluding them from reg_offset_p
> candidates here is consistent with this expectation, it makes us
> consider the unroll factor effect when checking the corresponding
> step cost and the embedded stepping cost (in group/candidate cost,
> minus step cost and use the cost from the address_cost hook).
>
> >>
> >> I've updated the patch to punt ainc_use candidates as below:
> >>
> >>> + /* Skip AINC candidate since it contains address update itself,
> >>> +the replicated AINC computations when unrolling still have
> >>> +updates, unlike reg_offset_p candidates can save step updates
> >>> +when unrolling.  */
> >>> + if (cand->ainc_use)
> >>> +   continue;
> >>> +
> >>
> >> For this new attached patch, it's bootstrapped and regress-tested without
> >> explicit unrolling, while the only one failure has been identified as
> >> rs6000 specific address_cost issue in bootstrapping and regression testing
> >> with explicit unrolling.
> >>
> >> By the way, with above simple hack of address_cost, I also did one
> >> bootstrapping and regression testing with explicit unrolling, the above
> >> sms-4.c did pass as I expected but had two failures instead:
> >>
> >>   PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
> >>   PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts 
> >> "PHI" 2
> >>
> >> By further investigation, the 2nd one is expected due to the adddress_cost 
> >> hook
> >> hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
> >> sequence starts to off from sms, just some NOTE_INSN_DELETED positions 
> >> change.
> >> I believe it's just exposed by this combination unluckily/luckily ;-) I 
> >> will
> >> send a patch separately for it once I got time to look into it, but it 
> >> should
> >> be unrelated to this patch series for sure.
> > This is the kind of situation I intended to avoid before.  IMHO, this
> > isn't a neat change (it can't be given we are predicting the future
> > transformation in compilation pipeline), accumulation of such changes
> > could make IVOPTs break in one way or another.  So as long as you make
> > sure it doesn't have functional impact in case of no-rtl_unroll, I am
> > fine.
>
> Yeah, I admit it's not neat, but the proposals in the previous discussions
> without predicting unroll factor can not work well for all scenarios with
> different unroll factors, they could over-blame some kind of candidates.
> For the case of no-rtl_unroll, unroll factor estimation should set
> loop->estimated_unroll to zero, all these changes won't take effect. The
> estimation function follows the same logics as that of RTL unroll factor
> calculation, I did test with explicit unrolling disablement before, it
> worked expectedly.
Thanks for working on this, also sorry for being nitpicking.

Thanks,
bin


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-01 Thread Bin.Cheng via Gcc-patches
On Tue, Aug 25, 2020 at 8:47 PM Kewen.Lin  wrote:
>
> Hi Bin,
>
> >>
> >> For one particular case like:
> >>
> >> for (i = 0; i < SIZE; i++)
> >>   y[i] = a * x[i] + z[i];
> >>
> >> we will mark reg_offset_p for IV candidates on x as below:
> >>- (unsigned long) (x_18(D) + 8)// only mark this before.
> >>- x_18(D) + 8
> >>- (unsigned long) (x_18(D) + 24)
> >>- (unsigned long) ((vector(2) double *) (x_18(D) + 8) + 
> >> 18446744073709551600)
> >>...
> >>
> >> Do you mind to have a review again?  Thanks in advance!
> > I trust you with the change.
>
> Thanks again!  Sorry for the late since it took some time to investigate
> the exposed issues.
>
> >>
> >> Bootstrapped/regtested on powerpc64le-linux-gnu P8 and P9.
> >>
> >> SPEC2017 P9 performance run has no remarkable degradations/improvements.
> > Is this run with unroll-loops?
>
> Yes, the options what I used were:
>
>-Ofast -mcpu=power9 -fpeel-loops -mrecip -funroll-loops
>
> I also re-tested the newly attached patch, nothing changes for SPEC2017 data.
>
> > Could you exercise the code with unroll-loops enabled when
> > bootstrap/regtest please?  It doesn't matter if cases fail with
> > unroll-loops, just making sure there is no fallout.  Otherwise it's
> > fine with me.
>
> Great idea!  With explicitly specified -funroll-loops, it's bootstrapped
> but the regression testing did show one failure (the only one):
>
>   PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1
>
> It exposes two issues:
>
> 1) Currently address_cost hook on rs6000 always return zero, but at least
> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> have to take the address update into account (scalar normal operation).
> Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer
> ainc candidates.  In this case, the cand/group cost is -4 (minus cost_step),
> with scaling up, the off becomes much.  With one simple hack on for pre_inc/
> pre_dec in rs6000 address_cost, the case passed.  It should be handled in
> one separated issue.
>
> 2) This case makes me think we should exclude ainc candidates in function
> mark_reg_offset_candidates.  The justification is that: ainc candidate
> handles step update itself and when we calculate the cost for it against
> its ainc_use, the cost_step has been reduced. When unrolling happens,
> the ainc computations are replicated and it doesn't save step updates
> like normal reg_offset_p candidates.
Though auto-inc candidate embeds stepping operation into memory
access, we might want to avoid it in case of unroll if there are many
sequences of memory accesses, and if the unroll factor is big.  The
rationale is embedded stepping is a u-arch operation and does have its
cost.

>
> I've updated the patch to punt ainc_use candidates as below:
>
> > + /* Skip AINC candidate since it contains address update itself,
> > +the replicated AINC computations when unrolling still have
> > +updates, unlike reg_offset_p candidates can save step updates
> > +when unrolling.  */
> > + if (cand->ainc_use)
> > +   continue;
> > +
>
> For this new attached patch, it's bootstrapped and regress-tested without
> explicit unrolling, while the only one failure has been identified as
> rs6000 specific address_cost issue in bootstrapping and regression testing
> with explicit unrolling.
>
> By the way, with above simple hack of address_cost, I also did one
> bootstrapping and regression testing with explicit unrolling, the above
> sms-4.c did pass as I expected but had two failures instead:
>
>   PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
>   PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2
>
> By further investigation, the 2nd one is expected due to the adddress_cost 
> hook
> hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
> sequence starts to off from sms, just some NOTE_INSN_DELETED positions change.
> I believe it's just exposed by this combination unluckily/luckily ;-) I will
> send a patch separately for it once I got time to look into it, but it should
> be unrelated to this patch series for sure.
This is the kind of situation I intended to avoid before.  IMHO, this
isn't a neat change (it can't be given we are predicting the future
transformation in compilation pipeline), accumulation of such changes
could make IVOPTs break in one way or another.  So as long as you make
sure it doesn't have functional impact in case of no-rtl_unroll, I am
fine.
>
> In a word, bootstrapping/regress-testing with unroll-loops enabled shows this
> patch looks fine.
>
> BR,
> Kewen
> -
> gcc/ChangeLog:
>
> * tree-ssa-loop-ivopts.c (struct iv_cand): New field reg_offset_p.
> (struct ivopts_data): New field consider_reg_offset_for_unroll_p.
> (mark_reg_offset_candidates): New function.
> (add_candidate_1): Set 

Re: [PATCH 3/4 v2] ivopts: Consider cost_step on different forms during unrolling

2020-08-21 Thread Bin.Cheng via Gcc-patches
On Tue, Aug 18, 2020 at 5:03 PM Kewen.Lin  wrote:
>
> Hi Bin,
>
> > I see, it's similar to the auto-increment case where cost should be
> > recorded only once.  So this is okay given 1) fine predicting
> > rtl-unroll is likely impossible here; 2) the patch has very limited
> > impact.
> >
> Really appreciate your time and patience!
>
> I extended the previous version to address Richard S.'s comments on
> candidates with the same base/step but different offsets here:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/547014.html.
>
> The previous version only allows the candidate derived from the group
> of interest, this updated patch extends it to those ones which have the
> same bases/steps and same/different offsets but in the acceptable range
> by considering unrolling.
>
> For one particular case like:
>
> for (i = 0; i < SIZE; i++)
>   y[i] = a * x[i] + z[i];
>
> we will mark reg_offset_p for IV candidates on x as below:
>- (unsigned long) (x_18(D) + 8)// only mark this before.
>- x_18(D) + 8
>- (unsigned long) (x_18(D) + 24)
>- (unsigned long) ((vector(2) double *) (x_18(D) + 8) + 
> 18446744073709551600)
>...
>
> Do you mind to have a review again?  Thanks in advance!
I trust you with the change.
>
> Bootstrapped/regtested on powerpc64le-linux-gnu P8 and P9.
>
> SPEC2017 P9 performance run has no remarkable degradations/improvements.
Is this run with unroll-loops?
Could you exercise the code with unroll-loops enabled when
bootstrap/regtest please?  It doesn't matter if cases fail with
unroll-loops, just making sure there is no fallout.  Otherwise it's
fine with me.

Thanks,
bin
>
> BR,
> Kewen
> -
> gcc/ChangeLog:
>
> * tree-ssa-loop-ivopts.c (struct iv_cand): New field reg_offset_p.
> (struct ivopts_data): New field consider_reg_offset_for_unroll_p.
> (mark_reg_offset_candidates): New function.
> (add_candidate_1): Set reg_offset_p to false for new candidate.
> (set_group_iv_cost): Scale up group cost with estimate_unroll_factor 
> if
> consider_reg_offset_for_unroll_p.
> (determine_iv_cost): Increase step cost with estimate_unroll_factor if
> consider_reg_offset_for_unroll_p.
> (tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
> consider_reg_offset_for_unroll_p.
>


Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling

2020-08-15 Thread Bin.Cheng via Gcc-patches
On Mon, Aug 10, 2020 at 10:41 PM Kewen.Lin  wrote:
>
> Hi Bin,
>
> on 2020/8/10 下午8:38, Bin.Cheng wrote:
> > On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin  wrote:
> >>
> >> Hi Bin,
> >>
> >> Thanks for the review!!
> >>
> >> on 2020/8/8 下午4:01, Bin.Cheng wrote:
> >>> Hi Kewen,
> >>> Sorry for the late reply.
> >>> The patch's most important change is below cost computation:
> >>>
> >>>> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, 
> >>>> struct iv_cand *cand)
> >>>> cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
> >>>>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
> >>>>
> >>>> +  /* Consider additional step updates during unrolling.  */
> >>>> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
> >>>> +cost += (data->current_loop->estimated_unroll - 1) * cost_step;
> >>> This is a bit strange, to me the add instructions are additional
> >>> computation caused by unrolling+addressing_mode, rather than a native
> >>> part in candidate itself.  Specifically, an additional cost is needed
> >>> if candidates (without reg_offset_p) are chosen for the address type
> >>> group/uses.
> >>
> >> Good point, ideally it should be one additional cost for each cand set,
> >> when we select one cand for one group, we need to check this pair need
> >> more (estimated_unroll - 1) step costs, we probably need to care about
> >> this during remove/replace etc.  IIUC the current IVOPTs cost framework
> >> doesn't support this and it could increase the selection complexity and
> >> time.  I hesitated to do it and put it to cand step cost initially instead.
> >>
> >> I was thinking those candidates with reg_offset_p should be only used for
> >> those reg_offset_p groups in most cases (very limited) meanwhile the others
> >> are simply scaled up like before.  But indeed this can cover some similar
> >> cases like one cand is only used for the compare type group which is for
> >> loop closing, then it doesn't need more step costs for unrolling.
> >>
> >> Do you prefer me to improve the current cost framework?
> > No, I don't think it's relevant to the candidate selecting algorithm.
> > I was thinking about adjusting cost somehow in
> > determine_group_iv_cost_address. Given we don't expose selected
> > addressing mode in this function, you may need to do it in
> > get_address_cost, either way.
> >
>
> Thanks for your suggestion!
>
> Sorry, I may miss something, but I still think the additional cost is
> per candidate.  The justification is that we miss to model the iv
> candidate step well in the context of unrolling, the step cost is part
> of candidate cost, which is per candidate.
>
> To initialize it in determine_iv_cost isn't perfect as you pointed out,
> ideally we should check any uses of the candidate requires iv update
> after each replicated iteration, and take extra step costs into account
> if at least one needs, meanwhile scaling up all the computation cost to
> reflect unrolling cost nature.
I see, it's similar to the auto-increment case where cost should be
recorded only once.  So this is okay given 1) fine predicting
rtl-unroll is likely impossible here; 2) the patch has very limited
impact.

Thanks,
bin
>
> Besides, the reg_offset desirable pair already takes zero cost for
> cand/group cost, IIRC negative cost isn't preferred in IVOPTs, are you
> suggesting increasing the cost for non reg_offset pairs?  If so and per
> pair, the extra cost looks possible to be computed several times
> unexpectedly.
>
> >>
> >>>> +
> >>>>   /* Prefer the original ivs unless we may gain something by replacing 
> >>>> it.
> >>>>  The reason is to make debugging simpler; so this is not relevant for
> >>>>  artificial ivs created by other optimization passes.  */
> >>>>
> >>>
> >>>> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
> >>>>   return;
> >>>> }
> >>>>
> >>>> +  /* Since we priced more on non reg_offset IV cand step cost, we 
> >>>> should scale
> >>>> + up the appropriate IV group costs.  Simply consider USE_COMPARE at 
> >>>> the
> >>>> + loop exit, FIXME if multiple exits supported or no loop exit 
&

Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling

2020-08-10 Thread Bin.Cheng via Gcc-patches
On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin  wrote:
>
> Hi Bin,
>
> Thanks for the review!!
>
> on 2020/8/8 下午4:01, Bin.Cheng wrote:
> > Hi Kewen,
> > Sorry for the late reply.
> > The patch's most important change is below cost computation:
> >
> >> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct 
> >> iv_cand *cand)
> >> cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
> >>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
> >>
> >> +  /* Consider additional step updates during unrolling.  */
> >> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
> >> +cost += (data->current_loop->estimated_unroll - 1) * cost_step;
> > This is a bit strange, to me the add instructions are additional
> > computation caused by unrolling+addressing_mode, rather than a native
> > part in candidate itself.  Specifically, an additional cost is needed
> > if candidates (without reg_offset_p) are chosen for the address type
> > group/uses.
>
> Good point, ideally it should be one additional cost for each cand set,
> when we select one cand for one group, we need to check this pair need
> more (estimated_unroll - 1) step costs, we probably need to care about
> this during remove/replace etc.  IIUC the current IVOPTs cost framework
> doesn't support this and it could increase the selection complexity and
> time.  I hesitated to do it and put it to cand step cost initially instead.
>
> I was thinking those candidates with reg_offset_p should be only used for
> those reg_offset_p groups in most cases (very limited) meanwhile the others
> are simply scaled up like before.  But indeed this can cover some similar
> cases like one cand is only used for the compare type group which is for
> loop closing, then it doesn't need more step costs for unrolling.
>
> Do you prefer me to improve the current cost framework?
No, I don't think it's relevant to the candidate selecting algorithm.
I was thinking about adjusting cost somehow in
determine_group_iv_cost_address. Given we don't expose selected
addressing mode in this function, you may need to do it in
get_address_cost, either way.

>
> >> +
> >>   /* Prefer the original ivs unless we may gain something by replacing it.
> >>  The reason is to make debugging simpler; so this is not relevant for
> >>  artificial ivs created by other optimization passes.  */
> >>
> >
> >> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
> >>   return;
> >> }
> >>
> >> +  /* Since we priced more on non reg_offset IV cand step cost, we should 
> >> scale
> >> + up the appropriate IV group costs.  Simply consider USE_COMPARE at 
> >> the
> >> + loop exit, FIXME if multiple exits supported or no loop exit 
> >> comparisons
> >> + matter.  */
> >> +  if (data->consider_reg_offset_for_unroll_p
> >> +  && group->vuses[0]->type != USE_COMPARE)
> >> +cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
> > Not quite follow here, giving "pricing more on on-reg_offset IV cand"
> > doesn't make much sense to me.  Also why generic type uses are not
> > skipped?  We want to model the cost required for address computation,
> > however, for generic type uses there is no way to save the computation
> > in "address expression".  Once unrolled, the computation is always
> > there?
> >
>
> The main intention is to scale up the group/cand cost for unrolling since
> we have scaled up the step costs.  The assumption is that the original
If we adjust cost appropriately in function *group_iv_cost_address,
this would become unnecessary, right?  And naturally.
> costing (without this patch) can be viewed as either for all unrolled
> iterations or just one single iteration.  Since IVOPTs doesn't support
> fractional costing, I interpreted it as single iterations, to emulate
> unrolling scenario based on the previous step cost scaling, we need to
> scale up the cost for all computation.
>
> In most cases, the compare type use is for loop closing, there is still
> only one computation even unrolling happens, so I excluded it here.
> As "FIXME", if we find some cases are off, we can further restrict it to
> those USE_COMPARE uses which is exactly for loop closing.
>
> > And what's the impact on targets supporting [base + index + offset]
> > addressing mode?
>
> Good question, I didn't notice it since power doesn't support it.
> I noticed the comment

Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling

2020-08-08 Thread Bin.Cheng via Gcc-patches
Hi Kewen,
Sorry for the late reply.
The patch's most important change is below cost computation:

> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct 
> iv_cand *cand)
> cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
>
> +  /* Consider additional step updates during unrolling.  */
> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
> +cost += (data->current_loop->estimated_unroll - 1) * cost_step;
This is a bit strange, to me the add instructions are additional
computation caused by unrolling+addressing_mode, rather than a native
part in candidate itself.  Specifically, an additional cost is needed
if candidates (without reg_offset_p) are chosen for the address type
group/uses.
> +
>   /* Prefer the original ivs unless we may gain something by replacing it.
>  The reason is to make debugging simpler; so this is not relevant for
>  artificial ivs created by other optimization passes.  */
>

> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
>   return;
> }
>
> +  /* Since we priced more on non reg_offset IV cand step cost, we should 
> scale
> + up the appropriate IV group costs.  Simply consider USE_COMPARE at the
> + loop exit, FIXME if multiple exits supported or no loop exit comparisons
> + matter.  */
> +  if (data->consider_reg_offset_for_unroll_p
> +  && group->vuses[0]->type != USE_COMPARE)
> +cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
Not quite follow here, giving "pricing more on on-reg_offset IV cand"
doesn't make much sense to me.  Also why generic type uses are not
skipped?  We want to model the cost required for address computation,
however, for generic type uses there is no way to save the computation
in "address expression".  Once unrolled, the computation is always
there?

And what's the impact on targets supporting [base + index + offset]
addressing mode?

Given the patch is not likely to harm because rtl loop unrolling is
(or was?) by default disabled, so I am OK once the above comments are
addressed.

I wonder if it's possible to get 10% of (all which should be unrolled)
loops unrolled (conservatively) on gimple and enable it by default at
O3, rather than teaching ivopts to model a future pass which not
likely be used outside of benchmarks?

Thanks,
bin

> +
>   if (data->consider_all_candidates)
> {
>   group->cost_map[cand->id].cand = cand;

On Thu, May 28, 2020 at 8:24 PM Kewen.Lin  wrote:
>
>
> gcc/ChangeLog
>
> 2020-MM-DD  Kewen Lin  
>
> * tree-ssa-loop-ivopts.c (struct iv_group): New field reg_offset_p.
> (struct iv_cand): New field reg_offset_p.
> (struct ivopts_data): New field consider_reg_offset_for_unroll_p.
> (dump_groups): Dump group with reg_offset_p.
> (record_group): Initialize reg_offset_p.
> (mark_reg_offset_groups): New function.
> (find_interesting_uses): Call mark_reg_offset_groups.
> (add_candidate_1): Update reg_offset_p if derived from reg_offset_p 
> group.
> (set_group_iv_cost): Scale up group cost with estimate_unroll_factor 
> if
> consider_reg_offset_for_unroll_p.
> (determine_iv_cost): Increase step cost with estimate_unroll_factor if
> consider_reg_offset_for_unroll_p.
> (tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
> consider_reg_offset_for_unroll_p.
>
> 


Re: [PATCH v2] genemit.c (main): split insn-emit.c for compiling parallelly

2020-07-21 Thread Bin.Cheng via Gcc-patches
On Tue, Jul 21, 2020 at 11:14 AM Jojo  wrote:
>
> gcc/ChangeLog:
>
> * genemit.c (main): Print 'split line'.
> * Makefile.in (insn-emit.c): Define split count and file
>

Thanks for working one this, following comments are based on the
assumption that the approach is feasible after your investigation.

It's great to accelerate compilation time, do you have any number
showing how much this can achieve, on a typical machine with
reasonable parallelism?

> ---
>  gcc/Makefile.in | 10 ++
>  gcc/genemit.c   | 86 +++--
>  2 files changed, 58 insertions(+), 38 deletions(-)
>
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 2ba76656dbf..f805050a119 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1253,6 +1253,13 @@ ANALYZER_OBJS = \
>  # We put the *-match.o and insn-*.o files first so that a parallel make
>  # will build them sooner, because they are large and otherwise tend to be
>  # the last objects to finish building.
> +
> +insn-generated-split-num = 15
Hardcode number 15 looks strange here, how shall we know it's enough?
Or one step further, can we use some kind of general match "*" for
writing make rules here?
> +
> +insn-emit-split-c = $(foreach o, $(shell for i in 
> {1..$(insn-generated-split-num)}; do echo $$i; done), insn-emit$(o).c)
> +insn-emit-split-obj = $(patsubst %.c,%.o, $(insn-emit-split-c))
> +$(insn-emit-split-c): insn-emit.c
> +
>  OBJS = \
> gimple-match.o \
> generic-match.o \
> @@ -1260,6 +1267,7 @@ OBJS = \
> insn-automata.o \
> insn-dfatab.o \
> insn-emit.o \
> +   $(insn-emit-split-obj) \
> insn-extract.o \
> insn-latencytab.o \
> insn-modes.o \
> @@ -2367,6 +2375,8 @@ $(simple_generated_c:insn-%.c=s-%): s-%: 
> build/gen%$(build_exeext)
> $(RUN_GEN) build/gen$*$(build_exeext) $(md_file) \
>   $(filter insn-conditions.md,$^) > tmp-$*.c
> $(SHELL) $(srcdir)/../move-if-change tmp-$*.c insn-$*.c
> +   -csplit insn-$*.c /i\ am\ split\ line/ -k -s 
> {$(insn-generated-split-num)} -f insn-$* -b "%d.c"
> +   -( [ ! -s insn-$*0.c ] && for i in {1..$(insn-generated-split-num)}; 
> do touch insn-$*$$i.c; done && echo "" > insn-$*.c)
Not sure if this is the first time that csplit/coreutils is used,
shall we mention it here
https://gcc.gnu.org/install/prerequisites.html, even check it in
configure?
> $(STAMP) s-$*
>
>  # gencheck doesn't read the machine description, and the file produced
> diff --git a/gcc/genemit.c b/gcc/genemit.c
> index 84d07d388ee..fd60cdeeb96 100644
> --- a/gcc/genemit.c
> +++ b/gcc/genemit.c
> @@ -847,6 +847,45 @@ handle_overloaded_gen (overloaded_name *oname)
>  }
>  }
>
> +#define printf_include() \
> +  printf ("/* Generated automatically by the program `genemit'\n\
> +from the machine description file `md'.  */\n\n"); \
> +  printf ("#define IN_TARGET_CODE 1\n"); \
> +  printf ("#include \"config.h\"\n"); \
> +  printf ("#include \"system.h\"\n"); \
> +  printf ("#include \"coretypes.h\"\n"); \
> +  printf ("#include \"backend.h\"\n"); \
> +  printf ("#include \"predict.h\"\n"); \
> +  printf ("#include \"tree.h\"\n"); \
> +  printf ("#include \"rtl.h\"\n"); \
> +  printf ("#include \"alias.h\"\n"); \
> +  printf ("#include \"varasm.h\"\n"); \
> +  printf ("#include \"stor-layout.h\"\n"); \
> +  printf ("#include \"calls.h\"\n"); \
> +  printf ("#include \"memmodel.h\"\n"); \
> +  printf ("#include \"tm_p.h\"\n"); \
> +  printf ("#include \"flags.h\"\n"); \
> +  printf ("#include \"insn-config.h\"\n"); \
> +  printf ("#include \"expmed.h\"\n"); \
> +  printf ("#include \"dojump.h\"\n"); \
> +  printf ("#include \"explow.h\"\n"); \
> +  printf ("#include \"emit-rtl.h\"\n"); \
> +  printf ("#include \"stmt.h\"\n"); \
> +  printf ("#include \"expr.h\"\n"); \
> +  printf ("#include \"insn-codes.h\"\n"); \
> +  printf ("#include \"optabs.h\"\n"); \
> +  printf ("#include \"dfp.h\"\n"); \
> +  printf ("#include \"output.h\"\n"); \
> +  printf ("#include \"recog.h\"\n"); \
> +  printf ("#include \"df.h\"\n"); \
> +  printf ("#include \"resource.h\"\n"); \
> +  printf ("#include \"reload.h\"\n"); \
> +  printf ("#include \"diagnostic-core.h\"\n"); \
> +  printf ("#include \"regs.h\"\n"); \
> +  printf ("#include \"tm-constrs.h\"\n"); \
> +  printf ("#include \"ggc.h\"\n"); \
> +  printf ("#include \"target.h\"\n\n"); \

Can you use do {} while(0) style for defining this code block macro?
The trailing '\' is also strange here.
> +
>  int
>  main (int argc, const char **argv)
>  {
> @@ -862,49 +901,19 @@ main (int argc, const char **argv)
>/* Assign sequential codes to all entries in the machine description
>   in parallel with the tables in insn-output.c.  */
>
> -  printf ("/* Generated automatically by the program `genemit'\n\
> -from the machine description file `md'.  */\n\n");
> -
> -  printf ("#define IN_TARGET_CODE 1\n");
> -  printf ("#include \"config.h\"\n");
> -  

[PATCH PR95804]Force reduction partition to be scheduled in the last

2020-07-07 Thread bin.cheng via Gcc-patches
Hi,
This is a followup fix for PR95638 which changed the way post order numbers are 
maintained for
partition graph.  It missed one case that when SCC of reduction partition is 
broken by runtime
alias checks, we do need to make sure the reduction partition be scheduled in 
the last.  This patch
does this by forcing a negative post order to it.

Bootstrap and test on x86_64, is it OK?

Thanks,
bin

pr95804.patch
Description: Binary data


[PATCH PR95638]Record/restore postorder, rather than update it

2020-06-15 Thread bin.cheng via Gcc-patches
Hi,
This simple patch fixes wrong code issue as reported.  I tried to update 
postorder information after
the second call to graphds_scc with alias dependence edges skipped.  This 
wasn't working, and I
realize it's hard to do.  This patch simply records postorder information 
before the call and restores
after.  It also fixes memory leak.

Bootstrap and test on x86_64. Comments?

Thanks,
bin

2020-06-15  Bin Cheng  

gcc/
PR tree-optimization/95638
* tree-loop-distribution.c (pg_edge_callback_data): New field.
(loop_distribution::break_alias_scc_partitions): Record and restore
postorder information.  Fix memory leak.

gcc/testsuite/
PR tree-optimization/95638
* g++.dg/tree-ssa/pr95638.C: New test.

pr95638.txt
Description: Binary data


Re: [PATCH PR94969]Add unit distant vector to DDR in case of invariant access functions

2020-05-13 Thread Bin.Cheng via Gcc-patches
On Thu, May 14, 2020 at 1:46 AM Jakub Jelinek via Gcc-patches
 wrote:
>
> On Wed, May 13, 2020 at 02:00:11PM +0200, Christophe Lyon via Gcc-patches 
> wrote:
> > > > 2020-05-11  Bin Cheng  
> > > >
> > > > PR tree-optimization/94969
> > > > * gcc.dg/tree-ssa/pr94969.c: New test.
> >
> > The new test fails on arm and aarch64 and probably everywhere:
> > gcc.dg/tree-ssa/pr94969.c: dump file does not exist
> > UNRESOLVED: gcc.dg/tree-ssa/pr94969.c scan-tree-dump-not Loop 1
> > distributed: split to 3 loops "ldist"
> >
> > Can you fix this?
>
> Seems a mere swapping of the scan-tree-dump-not args, I've verified the
> test passes after this change and fails with older trunk, committed to trunk
> as obvious.
Oh, sorry for the breakage, and thanks for fixing this.

Thanks,
bin
>
> 2020-05-13  Jakub Jelinek  
>
> PR testsuite/95110
> * gcc.dg/tree-ssa/pr94969.c: Swap scan-tree-dump-not arguments.
>
> --- gcc/testsuite/gcc.dg/tree-ssa/pr94969.c.jj  2020-05-13 09:24:36.959012780 
> +0200
> +++ gcc/testsuite/gcc.dg/tree-ssa/pr94969.c 2020-05-13 19:13:53.664499322 
> +0200
> @@ -25,4 +25,4 @@ int main()
>  __builtin_abort ();
>  }
>
> -/* { dg-final { scan-tree-dump-not "ldist" "Loop 1 distributed: split to 3 
> loops"} } */
> +/* { dg-final { scan-tree-dump-not "Loop 1 distributed: split to 3 loops" 
> "ldist" } } */
>
> Jakub
>


[PATCH PR94969]Add unit distant vector to DDR in case of invariant access functions

2020-05-10 Thread bin.cheng via Gcc-patches
Hi,
As analyzed in PR94969, data dependence analysis now misses dependence vector 
for specific case in which DRs in DDR have the same invariant access functions. 
 This simple patch fixes the issue by also covering invariant cases.  Bootstrap 
and test on x86_64, is it OK?

Thanks,
bin

2020-05-11  Bin Cheng  

PR tree-optimization/94969
* tree-data-dependence.c (constant_access_functions): Rename to...
(invariant_access_functions): ...this.  Add parameter.  Check for
invariant access function, rather than constant.
(build_classic_dist_vector): Call above function.
* tree-loop-distribution.c (pg_add_dependence_edges): Add comment.

gcc/testsuite
2020-05-11  Bin Cheng  

PR tree-optimization/94969
* gcc.dg/tree-ssa/pr94969.c: New test.

pr94969-20200511.txt
Description: Binary data


Re: [PATCH PR93674]Avoid introducing IV of enumeral type in case of -fstrict-enums

2020-04-08 Thread bin.cheng via Gcc-patches
--
Sender:Richard Biener 
Sent At:2020 Mar. 20 (Fri.) 18:12
Recipient:bin.cheng 
Cc:Andrew Pinski ; GCC Patches 
Subject:Re: [PATCH PR93674]Avoid introducing IV of enumeral type in case of 
-fstrict-enums


On Fri, Mar 20, 2020 at 10:27 AM bin.cheng  wrote:
>
> --
> Sender:Richard Biener 
> Sent At:2020 Mar. 3 (Tue.) 17:36
> Recipient:Andrew Pinski 
> Cc:bin.cheng ; GCC Patches 
> 
> Subject:Re: [PATCH PR93674]Avoid introducing IV of enumeral type in case of 
> -fstrict-enums
>
>
> On Mon, Mar 2, 2020 at 6:14 PM Andrew Pinski  wrote:
> >
> > On Mon, Mar 2, 2020 at 1:40 AM Richard Biener
> >  wrote:
> > >
> > > On Mon, Mar 2, 2020 at 9:07 AM bin.cheng  
> > > wrote:
> > > >
> > > > Hi,
> > > > This is a simple fix for PR93674.  It adds cand carefully for enumeral 
> > > > type iv_use in
> > > > case of -fstrict-enums, it also avoids computing, replacing iv_use with 
> > > > the candidate
> > > > so that no IV of enumeral type is introduced with -fstrict-enums option.
> > > >
> > > > Testcase is also added.  Bootstrap and test on x86_64.  Any comment?
> > >
> > > I think we should avoid enum-typed (or bool-typed) IVs in general, not 
> > > just
> > > with -fstrict-enums.  That said, the ENUMERAL_TYPE checks should be
> > > !(INTEGER_TYPE || POINTER_TYPE_P) checks.
> >
> > Maybe even check type_has_mode_precision_p or
> > TYPE_MIN_VALUE/TYPE_MAX_VALUE have the same as the min/max for that
> > precision/signedness.
>
> Indeed we don't want non-mode precision INTEGER_TYPE IVs either.  I wouldn't
> check TYPE_MIN/MAX_VALUE here though.
>
> Here is the updated patch with more checks.  I also removed the 
> computation/elimination part for now.

+  if ((TREE_CODE (basetype) != INTEGER_TYPE && !POINTER_TYPE_P (basetype))
+  || !type_has_mode_precision_p (basetype))
+{
+  if (!CONVERT_EXPR_P (iv->base))
+   return;
+
+  /* Specially handle such iv_use if it's converted from other ones by
+adding candidate for the source iv.  */
+  base = TREE_OPERAND (iv->base, 0);
+  basetype = TREE_TYPE (base);
+  if (!INTEGRAL_TYPE_P (basetype))
+   return;

I think this is too lax - you'll happily add another non-INTEGER_TYPE
or non-mode-precision IV through this handling.  If the conversion
is not value-preserving the IV may also be useless (also the step
might see truncation in its type adjustment).  With a widening
conversion there's the issue with different wrap-around behavior.
I guess in most cases we'll not be able to use the IV if it is "bogus"
so all this might be harmless.

The original idea is to depend on the condition that the result won't be a 
legal IV in
case of widening.  For narrowing, it won't be a problem except what you 
mentioned
about non-INTEGER_TYPE or non-mode-precision issue.

Was there any reason to handle this special case?  Why not simply
do sth like
The reason is to strength reduce (conversion) operation in RHS by introducing 
appropriate
candidate.
And right, your proposed code looks better and more general.

+  if ((TREE_CODE (basetype) != INTEGER_TYPE && !POINTER_TYPE_P (basetype)))
+  || !type_has_mode_precision_p (basetype))
+{
 basetype = lang_hooks.type_for_mode (TYPE_MODE (basetype),
TYPE_UNSIGNED (basetype));
 tree step = fold_convert (basetype, iv->step);
 add_candidate (...);

?

The following also looks bogus:

+  if (basetype == generic_type_for (basetype))
+   step = fold_convert (basetype, step);

don't we add IVs in generic_type_for () and thus the conversion is
always necessary
anyways?  (it has the wrong type from the conversion we just stripped)

This is now discarded.
Patch updated and tested on x86_64.  Is it OK?

Thanks,
bin

2020-04-09  Bin Cheng  
Richard Biener  

PR tree-optimization/93674
* tree-ssa-loop-ivopts.c (langhooks.h): New include.
(add_iv_candidate_for_use): For iv_use of non integer or pointer type,
or non-mode precision type, add candidate in unsigned type with the
same precision.

gcc/testsuite
2020-04-09  Bin Cheng  

PR tree-optimization/93674
* g++.dg/pr93674.C: New test.

0001-Add-unsigned-type-iv_cand-for-iv_use-with-non-mode-p.patch
Description: Binary data


Re: [PING PATCH coroutines] Do not strip cleanup_point when promote temporaries out of current stmt

2020-04-08 Thread Bin.Cheng via Gcc-patches
On Tue, Apr 7, 2020 at 11:45 PM Iain Sandoe  wrote:
>
> Bin.Cheng  wrote:
>
> > On Mon, Mar 16, 2020 at 5:34 PM Bin.Cheng  wrote:
> >> On Wed, Mar 11, 2020 at 5:07 PM Iain Sandoe  wrote:
> >>> Bin.Cheng  wrote:
> >>>
> >>>> On Thu, Mar 5, 2020 at 10:18 PM Iain Sandoe  wrote:
>
> >>> If it helps, I could push a branch to users/iains/ on the FSF git server 
> >>> with this sequence.
> >> Sorry for being slow replying.  This is weird, were you testing against 
> >> trunk?
>
> 1/ @Jun Ma
>
>   - I think your observation is correct, we should enusre that cleanups are 
> applied where needed
>to any temps that remain after we’ve those captured by reference.
>
>   However, I would prefer that we do this as required, rather than assuming 
> it will be needed so
>   I have an updated patch below.
Sorry it's been quite long, I don't remember the detail of this
change.  Since this new version fixes the ICE too, please go ahead
commit it.
>
> 2/ @ Bin / Jun Ma
>
>   - The problem is that the testcase seems to be fragile, perhaps it was a 
> c-reduced one?
It's strange, however, I don't see being c-reduced has impact here in
this scenario.
>
> So… I do not propose to add this test-case at this point (perhaps we could 
> construct one that actually
> tests the required behaviour - that cleanups are still  run correctly for 
> temps that are not promoted by
> capture)?
>
> Anyway to avoid further delay, I think we should apply the patch below (I 
> have other fixes on top of this
> for open PRs)
>
> OK for master?
Yes, please.  Also one of approved patch is waiting for this one.

Thanks,
bin
> thanks
> Iain
>
> coroutines: Add cleanups, where required, to statements with captured 
> references.
>
> When we promote captured temporaries to local variables, we also
> remove their initializers from the relevant call expression.  This
> means that we should recompute the need for a cleanup expression
> once the set of temporaries that remains becomes known.
>
> gcc/cp/ChangeLog:
>
> 2020-04-07  Iain Sandoe  
> Jun Ma  
>
> * coroutines.cc (maybe_promote_captured_temps): Add a
> cleanup expression, if needed, to any call from which
> we promoted temporaries captured by reference.
>
> diff --git a/gcc/cp/coroutines.cc b/gcc/cp/coroutines.cc
> index 983fa650b55..936be06c336 100644
> --- a/gcc/cp/coroutines.cc
> +++ b/gcc/cp/coroutines.cc
> @@ -2798,11 +2798,13 @@ maybe_promote_captured_temps (tree *stmt, void *d)
>location_t sloc = EXPR_LOCATION (*stmt);
>tree aw_bind
> = build3_loc (sloc, BIND_EXPR, void_type_node, NULL, NULL, NULL);
> -  tree aw_statement_current;
> -  if (TREE_CODE (*stmt) == CLEANUP_POINT_EXPR)
> -   aw_statement_current = TREE_OPERAND (*stmt, 0);
> -  else
> -   aw_statement_current = *stmt;
> +
> +  /* Any cleanup point expression might no longer be necessary, since we
> +are removing one or more temporaries.  */
> +  tree aw_statement_current = *stmt;
> +  if (TREE_CODE (aw_statement_current) == CLEANUP_POINT_EXPR)
> +   aw_statement_current = TREE_OPERAND (aw_statement_current, 0);
> +
>/* Collected the scope vars we need move the temps to regular. */
>tree aw_bind_body = push_stmt_list ();
>tree varlist = NULL_TREE;
> @@ -2843,8 +2845,12 @@ maybe_promote_captured_temps (tree *stmt, void *d)
>   /* Replace all instances of that temp in the original expr.  */
>   cp_walk_tree (_statement_current, replace_proxy, , NULL);
> }
> -  /* What's left should be the original statement with any temporaries
> -broken out.  */
> +
> +  /* What's left should be the original statement with any co_await
> +captured temporaries broken out.  Other temporaries might remain
> +so see if we need to wrap the revised statement in a cleanup.  */
> +  aw_statement_current =
> +   maybe_cleanup_point_expr_void (aw_statement_current);
>add_stmt (aw_statement_current);
>BIND_EXPR_BODY (aw_bind) = pop_stmt_list (aw_bind_body);
>awpts->captured_temps.empty ();
>


Re: [PING PATCH coroutines] Do not strip cleanup_point when promote temporaries out of current stmt

2020-03-26 Thread Bin.Cheng via Gcc-patches
On Mon, Mar 16, 2020 at 5:34 PM Bin.Cheng  wrote:
>
> On Wed, Mar 11, 2020 at 5:07 PM Iain Sandoe  wrote:
> >
> > Bin.Cheng  wrote:
> >
> > > On Thu, Mar 5, 2020 at 10:18 PM Iain Sandoe  wrote:
> >
> > >> With current trunk + Bin’s two approved patches.
> > >>
> > >> I see no change in the testcase (lambda-09-capture-object.C) before / 
> > >> after
> > >> the patch
> > >>  (it fails for me at -O0 only - in both cases).
> > >
> >
> > > I tried exactly what you did, however, the result is different.
> >
> > I am a bit concerned that we get different results - yesterday I retested 
> > this on:
> >   Linux : x86_64-linux (cfarm gcc122)
> >   Darwin (x86_64-darwin16),
> >  with the same results on both platforms.
> >
> > 1) I applied the two testcases (note I have renamed 
> > lambda-09-capture-object.C => lambda-11-capture-object.C so that the test 
> > numbers are sequential).  However, I have not changed the test in any other 
> > way.
> >
> > Native configuration is x86_64-pc-linux-gnu
> >
> > === g++ tests ===
> >
> > Schedule of variations:
> > unix/-m32
> > unix/-m64
> >
> > Running target unix/-m32
> > Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> > target.
> > Using /usr/share/dejagnu/config/unix.exp as generic interface file for 
> > target.
> > Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp 
> > as tool-and-target-specific interface file.
> > Running 
> > /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/coroutines.exp
> >  ...
> > FAIL: g++.dg/coroutines/co-await-syntax-10.C (internal compiler error)
> > FAIL: g++.dg/coroutines/co-await-syntax-10.C (test for excess errors)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (internal 
> > compiler error)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (test for excess 
> > errors)
> > Running 
> > /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/torture/coro-torture.exp
> >  ...
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (internal 
> > compiler error)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (test for 
> > excess errors)
> >
> > Running target unix/-m64
> > Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> > target.
> > Using /usr/share/dejagnu/config/unix.exp as generic interface file for 
> > target.
> > Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp 
> > as tool-and-target-specific interface file.
> > Running 
> > /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/coroutines.exp
> >  ...
> > FAIL: g++.dg/coroutines/co-await-syntax-10.C (internal compiler error)
> > FAIL: g++.dg/coroutines/co-await-syntax-10.C (test for excess errors)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (internal 
> > compiler error)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (test for excess 
> > errors)
> > Running 
> > /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/torture/coro-torture.exp
> >  ...
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (internal 
> > compiler error)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (test for 
> > excess errors)
> >
> > ^^ so, this shows that both tests fail (co-await-syntax-10.C is expected)
> >
> > 2) I apply Bin’s patch (Pickup more CO_AWAIT_EXPR expanding cases) (which 
> > is approved)
> >
> > Native configuration is x86_64-pc-linux-gnu
> >
> > === g++ tests ===
> >
> > Schedule of variations:
> > unix/-m32
> > unix/-m64
> >
> > Running target unix/-m32
> > Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> > target.
> > Using /usr/share/dejagnu/config/unix.exp as generic interface file for 
> > target.
> > Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp 
> > as tool-and-target-specific interface file.
> > Running 
> > /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/coroutines.exp
> >  ...
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (internal 
> > compiler error)
> > FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (test for excess 
> >

Re: [PATCH PR93674]Avoid introducing IV of enumeral type in case of -fstrict-enums

2020-03-20 Thread bin.cheng via Gcc-patches
--
Sender:Richard Biener 
Sent At:2020 Mar. 3 (Tue.) 17:36
Recipient:Andrew Pinski 
Cc:bin.cheng ; GCC Patches 

Subject:Re: [PATCH PR93674]Avoid introducing IV of enumeral type in case of 
-fstrict-enums


On Mon, Mar 2, 2020 at 6:14 PM Andrew Pinski  wrote:
>
> On Mon, Mar 2, 2020 at 1:40 AM Richard Biener
>  wrote:
> >
> > On Mon, Mar 2, 2020 at 9:07 AM bin.cheng  
> > wrote:
> > >
> > > Hi,
> > > This is a simple fix for PR93674.  It adds cand carefully for enumeral 
> > > type iv_use in
> > > case of -fstrict-enums, it also avoids computing, replacing iv_use with 
> > > the candidate
> > > so that no IV of enumeral type is introduced with -fstrict-enums option.
> > >
> > > Testcase is also added.  Bootstrap and test on x86_64.  Any comment?
> >
> > I think we should avoid enum-typed (or bool-typed) IVs in general, not just
> > with -fstrict-enums.  That said, the ENUMERAL_TYPE checks should be
> > !(INTEGER_TYPE || POINTER_TYPE_P) checks.
>
> Maybe even check type_has_mode_precision_p or
> TYPE_MIN_VALUE/TYPE_MAX_VALUE have the same as the min/max for that
> precision/signedness.

Indeed we don't want non-mode precision INTEGER_TYPE IVs either.  I wouldn't
check TYPE_MIN/MAX_VALUE here though.

Here is the updated patch with more checks.  I also removed the 
computation/elimination part for now.

Thanks,
bin
2020-03-19  Bin Cheng  

PR tree-optimization/93674
* tree-ssa-loop-ivopts.c (add_iv_candidate_for_use): Don't add iv_cand
for iv_use which is not integer or pointer type, or non-mode precision
type.  For such cases, add iv_cand for source operand if it's
converted from other one.
(get_computation_cost, may_eliminate_iv): Avoid compute, eliminate
iv_use with enumeral type iv_cand in case of -fstrict-enums.

gcc/testsuite
2020-03-19  Bin Cheng  

PR tree-optimization/93674
* g++.dg/pr93674.C: New test.

0001-Don-t-add-iv_cand-for-iv_use-with-non-mode-precision.patch
Description: Binary data


Re: [PING PATCH coroutines] Do not strip cleanup_point when promote temporaries out of current stmt

2020-03-16 Thread Bin.Cheng via Gcc-patches
On Wed, Mar 11, 2020 at 5:07 PM Iain Sandoe  wrote:
>
> Bin.Cheng  wrote:
>
> > On Thu, Mar 5, 2020 at 10:18 PM Iain Sandoe  wrote:
>
> >> With current trunk + Bin’s two approved patches.
> >>
> >> I see no change in the testcase (lambda-09-capture-object.C) before / after
> >> the patch
> >>  (it fails for me at -O0 only - in both cases).
> >
>
> > I tried exactly what you did, however, the result is different.
>
> I am a bit concerned that we get different results - yesterday I retested 
> this on:
>   Linux : x86_64-linux (cfarm gcc122)
>   Darwin (x86_64-darwin16),
>  with the same results on both platforms.
>
> 1) I applied the two testcases (note I have renamed 
> lambda-09-capture-object.C => lambda-11-capture-object.C so that the test 
> numbers are sequential).  However, I have not changed the test in any other 
> way.
>
> Native configuration is x86_64-pc-linux-gnu
>
> === g++ tests ===
>
> Schedule of variations:
> unix/-m32
> unix/-m64
>
> Running target unix/-m32
> Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> target.
> Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
> Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp as 
> tool-and-target-specific interface file.
> Running 
> /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/coroutines.exp
>  ...
> FAIL: g++.dg/coroutines/co-await-syntax-10.C (internal compiler error)
> FAIL: g++.dg/coroutines/co-await-syntax-10.C (test for excess errors)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (internal compiler 
> error)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (test for excess 
> errors)
> Running 
> /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/torture/coro-torture.exp
>  ...
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (internal 
> compiler error)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (test for 
> excess errors)
>
> Running target unix/-m64
> Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> target.
> Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
> Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp as 
> tool-and-target-specific interface file.
> Running 
> /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/coroutines.exp
>  ...
> FAIL: g++.dg/coroutines/co-await-syntax-10.C (internal compiler error)
> FAIL: g++.dg/coroutines/co-await-syntax-10.C (test for excess errors)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (internal compiler 
> error)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (test for excess 
> errors)
> Running 
> /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/torture/coro-torture.exp
>  ...
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (internal 
> compiler error)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (test for 
> excess errors)
>
> ^^ so, this shows that both tests fail (co-await-syntax-10.C is expected)
>
> 2) I apply Bin’s patch (Pickup more CO_AWAIT_EXPR expanding cases) (which is 
> approved)
>
> Native configuration is x86_64-pc-linux-gnu
>
> === g++ tests ===
>
> Schedule of variations:
> unix/-m32
> unix/-m64
>
> Running target unix/-m32
> Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> target.
> Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
> Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp as 
> tool-and-target-specific interface file.
> Running 
> /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/coroutines.exp
>  ...
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (internal compiler 
> error)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C (test for excess 
> errors)
> Running 
> /home/iains/gcc-master/src-patched/gcc/testsuite/g++.dg/coroutines/torture/coro-torture.exp
>  ...
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (internal 
> compiler error)
> FAIL: g++.dg/coroutines/torture/lambda-11-capture-object.C   -O0  (test for 
> excess errors)
>
> Running target unix/-m64
> Using /usr/share/dejagnu/baseboards/unix.exp as board description file for 
> target.
> Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
> Using /home/iains/gcc-master/src-patched/gcc/testsuite/config/default.exp as 
> to

[PATCH PR94125]Update post order number for merged SCC

2020-03-15 Thread bin.cheng via Gcc-patches
Hi,
This simple patch fixes PR94125 by updating post order number for merged SCC.
The root cause is after computing SCC with runtime alias edges skipped, the post
order info is changed and it's possible a partition is scheduled after another 
where
should be scheduled before.  Note that updating to the minimal post order number
is safe, only partitions separated from SCC because of skipping runtime alias 
edges
can be assigned larger post oder than the original precedent one.

Bootstrap and test on x86_64, test case also added.

Thanks,
bin

2020-03-15  Bin Cheng  

PR tree-optimization/94125
* tree-loop-distribution.c
(loop_distribution::break_alias_scc_partitions): Update post ordeer
number for merged scc.

gcc/testsuite
2020-03-15  Bin Cheng  

PR tree-optimization/94125
* gcc.dg/tree-ssa/pr94125.c: New test.

0001-pr94125.patch
Description: Binary data


Re: [PING PATCH coroutines] Do not strip cleanup_point when promote temporaries out of current stmt

2020-03-10 Thread Bin.Cheng
On Thu, Mar 5, 2020 at 10:18 PM Iain Sandoe  wrote:
>
> Hello JunMa,
>
> JunMa  wrote:
>
> > Ping
>
> Once again, sorry for taking time to review this.
>
> > 在 2020/2/27 上午10:18, JunMa 写道:
> >> 在 2020/2/11 上午10:14, JunMa 写道:
> >> Kindly ping
> >>
> >> Regards
> >> JunMa
> >>> Hi
> >>> In maybe_promote_captured_temps, the cleanup_point_stmt has been
> >>> stripped when handle temporaries captured by reference. However, maybe
> >>> there are non-reference temporaries in current stmt which cause ice in
> >>> gimpilify pass.
> >>>
> >>> This patch fix this. The testcase comes from cppcoro and is reduced by
> >>> creduce.
>
> With current trunk + Bin’s two approved patches.
>
> I see no change in the testcase (lambda-09-capture-object.C) before / after
> the patch
>   (it fails for me at -O0 only - in both cases).

Hi Iain,
I tried exactly what you did, however, the result is different.
With current trunk(cb2c60206f4f2218f84ccde21663b00de068d8c7) with my
approved patch, the case(lambda-09-capture-object.C) still causes ICE.
Actually, the same ICE happens in testcase(co-await-syntax-11.C) added
by my patch.
So this one is a prerequisite for my approved patch.

Thanks,
bin
>
> please could you check?
> thanks
> Iain
>


Fwd: [PATCH PR93674]Avoid introducing IV of enumeral type in case of -fstrict-enums

2020-03-09 Thread Bin.Cheng
Forwarding to public list.

-- Forwarded message -
From: Bin.Cheng 
Date: Mon, Mar 9, 2020 at 5:07 PM
Subject: Re: [PATCH PR93674]Avoid introducing IV of enumeral type in
case of -fstrict-enums
To: Richard Biener 


On Tue, Mar 3, 2020 at 5:36 PM Richard Biener
 wrote:
>
> On Mon, Mar 2, 2020 at 6:14 PM Andrew Pinski  wrote:
> >
> > On Mon, Mar 2, 2020 at 1:40 AM Richard Biener
> >  wrote:
> > >
> > > On Mon, Mar 2, 2020 at 9:07 AM bin.cheng  
> > > wrote:
> > > >
> > > > Hi,
> > > > This is a simple fix for PR93674.  It adds cand carefully for enumeral 
> > > > type iv_use in
> > > > case of -fstrict-enums, it also avoids computing, replacing iv_use with 
> > > > the candidate
> > > > so that no IV of enumeral type is introduced with -fstrict-enums option.
> > > >
> > > > Testcase is also added.  Bootstrap and test on x86_64.  Any comment?
> > >
> > > I think we should avoid enum-typed (or bool-typed) IVs in general, not 
> > > just
> > > with -fstrict-enums.  That said, the ENUMERAL_TYPE checks should be
> > > !(INTEGER_TYPE || POINTER_TYPE_P) checks.
> >
> > Maybe even check type_has_mode_precision_p or
> > TYPE_MIN_VALUE/TYPE_MAX_VALUE have the same as the min/max for that
> > precision/signedness.
>
> Indeed we don't want non-mode precision INTEGER_TYPE IVs either.  I wouldn't
> check TYPE_MIN/MAX_VALUE here though.

Thank all of you for the suggestions, I will update the change.
>
> Richard.
>
> > Thanks,
> > Andrew
> >
> > >
> > > +  /* Check if cand can represent values of use for strict enums.  */
> > > +  else if (TREE_CODE (ctype) == ENUMERAL_TYPE && flag_strict_enums)
> > > +{
> > >
> > > if we don't have enum-typed IV candidates then the computation should
> > > be carried out in INTEGER_TYPE and then be converted to enum type.
> > > So why's this and the may_eliminate_iv hunks necessary?
I was wondering if enum variable is used directly as loop variable?
Note above check
only happens for adding candidate for derived iv use.

Thanks,
bin
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > bin
> > > > 2020-03-02  Bin Cheng  
> > > >
> > > > PR tree-optimization/93674
> > > > * tree-ssa-loop-ivopts.c (add_iv_candidate_for_use): Add 
> > > > candidate
> > > > for enumeral type iv_use converted from other iv.
> > > > (get_computation_cost, may_eliminate_iv): Avoid compute, 
> > > > eliminate
> > > > iv_use with enumeral type iv_cand in case of -fstrict-enums.
> > > >
> > > > gcc/testsuite
> > > > 2020-03-02  Bin Cheng  
> > > >
> > > > PR tree-optimization/93674
> > > > * g++.dg/pr93674.C: New test.


[PATCH PR93674]Avoid introducing IV of enumeral type in case of -fstrict-enums

2020-03-02 Thread bin.cheng
Hi,
This is a simple fix for PR93674.  It adds cand carefully for enumeral type 
iv_use in
case of -fstrict-enums, it also avoids computing, replacing iv_use with the 
candidate
so that no IV of enumeral type is introduced with -fstrict-enums option.

Testcase is also added.  Bootstrap and test on x86_64.  Any comment?

Thanks,
bin
2020-03-02  Bin Cheng  

PR tree-optimization/93674
* tree-ssa-loop-ivopts.c (add_iv_candidate_for_use): Add candidate
for enumeral type iv_use converted from other iv.
(get_computation_cost, may_eliminate_iv): Avoid compute, eliminate
iv_use with enumeral type iv_cand in case of -fstrict-enums.

gcc/testsuite
2020-03-02  Bin Cheng  

PR tree-optimization/93674
* g++.dg/pr93674.C: New test.

pr93674-20200302.txt
Description: Binary data


Re: [PATCH Coroutines]Insert the default return_void call at correct position

2020-02-10 Thread Bin.Cheng
Ping.

Thanks,
bin

On Mon, Feb 3, 2020 at 1:55 PM bin.cheng  wrote:
>
> Hi,
>
> Exception in coroutine is not correctly handled because the default
> return_void call is now inserted before the finish suspend point,
> rather than at the end of the original coroutine body.  This patch
> fixes the issue by generating following code:
>   co_await promise.initial_suspend();
>   try {
> // The original coroutine body
>
> promise.return_void(); // The default return_void call.
>   } catch (...) {
> promise.unhandled_exception();
>   }
>   final_suspend:
>   // ...
>
> Bootstrap and test on x86_64.  Is it OK?
>
> Thanks,
> bin
>
> gcc/cp
> 2020-02-03  Bin Cheng  
>
> * coroutines.cc (build_actor_fn): Factor out code inserting the
> default return_void call to...
> (morph_fn_to_coro): ...here, also hoist local var declarations.
>
> gcc/testsuite
> 2020-02-03  Bin Cheng  
>
> * g++.dg/coroutines/torture/co-ret-15-default-return_void.C: New.


Re: [PATCH Coroutines]Pickup more CO_AWAIT_EXPR expanding cases

2020-02-10 Thread bin.cheng
Hi,

We found more ICEs because of unexpanded CO_AWAIT_EXPR, it turned out we
can fix these issues with more simplification in function co_await_expander.  
Here
is the patch with a new test.

Bootstrap and test on x86_64.  Is it OK?

Thanks,
bin

gcc/cp
2020-02-10  Bin Cheng  

* coroutines.cc (co_await_expander): Simplify.

gcc/testsuite
2020-02-10  Bin Cheng  

* g++.dg/coroutines/co-await-syntax-10.C: New test.
* g++.dg/coroutines/co-await-syntax-11.C: New test.


--
Sender:bin.cheng 
Sent At:2020 Feb. 5 (Wed.) 09:34
Recipient:GCC Patches 
Cc:Iain Sandoe ; Nathan Sidwell 
Subject:[PATCH Coroutines]Pickup more CO_AWAIT_EXPR expanding cases


Hi,
This simple patch actually is a missing part of previous one at
https://gcc.gnu.org/ml/gcc-patches/2020-01/msg01451.html
It picks up more CO_AWAIT_EXPR expanding cases.  Test is also added.

Bootstrap and test on x86_64. Is it OK?

Thanks,
bin

0002-Simplify-co_await_expander.patch
Description: Binary data


Re: [PATCH coroutines] Change lowering behavior of alias variable from copy to substitute

2020-02-06 Thread Bin.Cheng
On Thu, Feb 6, 2020 at 5:12 PM Iain Sandoe  wrote:
>
> Hi JunMa,
>
> JunMa  wrote:
>
> > 在 2020/2/4 下午8:17, JunMa 写道:
> >> Hi
> >> When testing coroutines with lambda function, I find we copy each captured
> >> variable to frame. However, according to gimplify pass, for each
> >> declaration
> >> that is an alias for another expression(DECL_VALUE_EXPR), we can
> >> substitute them directly.
> >>
> >> Since lambda captured variables is one of this kind. It is better to
> >> replace them
> >> rather than copy them, This can reduce frame size (all of the captured
> >> variables
> >> are field of closure class) and avoid extra copy behavior as well.
> >>
> >> This patch remove all of the code related to copy captured variable.
> >> Instead, we first rewrite DECL_VALUE_EXPR with frame field, then we check
> >> every variable whether it has DECL_VALUE_EXPR, and then substitute it,
> >> this
> >> patch does not create frame field for such variables.
> >>
> >> Bootstrap and test on X86_64, is it OK?
>
> > minor update: only handle var_decl when iterate BIND_EXPR_VARS
> > in register_local_var_uses.
>
> Do you have any other local patches applied along with this?
>
> Testing locally (on Darwin), I see regressions with optimisation  O2/O3/Os
> e.g:
>
> class-05-lambda-capture-copy-local.C   -O2  (internal compiler error)
> class-06-lambda-capture-ref.C   -O2  (internal compiler error)
> lambda-05-capture-copy-local.C   -O2  (internal compiler error)
> lambda-06-multi-capture.C   -O2  (internal compiler error)
> lambda-07-multi-capture.C   -O2  (internal compiler error)
> lambda-08-co-ret-parm-ref.C   -O3 -g  (internal compiler error)
>
> I have applied this to master, and on top of the patches posted by you and
> Bin, but the results are the same.
Hi Iains,

Thanks for helping.
Yes, there will be another patch fixing the O2/O3 issues.  Will send
it out for review soon.

Thanks,
bin
>
> thanks
> Iain
>
> >> gcc/cp
> >> 2020-02-04  Jun Ma 
> >>
> >> * coroutines.cc (morph_fn_to_coro): Remove code related to
> >> copy captured variable.
> >> (register_local_var_uses):  Ditto.
> >> (register_param_uses):  Collect use of parameters inside
> >> DECL_VALUE_EXPR.
> >> (transform_local_var_uses): Substitute the alias variable
> >> with DECL_VALUE_EXPR if it has one.
> >>
> >>
> >> gcc/testsuite
> >> 2020-02-04  Jun Ma 
> >>
> >> * g++.dg/coroutines/lambda-07-multi-capture.C: New test.
> >
> >
> > <0001-fix-alias-variable.patch>
>
>


[PATCH Coroutines][OBVIOUS]Fix typo of temporary variable name

2020-02-05 Thread bin.cheng
Hi,
This is an obvious patch fixing typo in maybe_promote_captured_temps,
previously, the vnum (== 0) is used for all captured temps...

Will commit later.

Thanks,
bin

gcc/cp
2020-02-05  Bin Cheng  

* coroutines.cc (maybe_promote_captured_temps): Increase the index
number for temporary variables' name.

obvious-vnum.patch
Description: Binary data


[PATCH Coroutines]Pickup more CO_AWAIT_EXPR expanding cases

2020-02-04 Thread bin.cheng
Hi,
This simple patch actually is a missing part of previous one at
https://gcc.gnu.org/ml/gcc-patches/2020-01/msg01451.html
It picks up more CO_AWAIT_EXPR expanding cases.  Test is also added.

Bootstrap and test on x86_64. Is it OK?

Thanks,
bin

gcc/cp
2020-02-05  Bin Cheng  

* coroutines.cc (co_await_expander): Handle more CO_AWAIT_EXPR
expanding cases.

gcc/testsuite
2020-02-05  Bin Cheng  

* g++.dg/coroutines/co-await-syntax-10.C: New test.

0001-Pickup-more-CO_AWAIT_EXPR-expanding-cases.patch
Description: Binary data


[PATCH Coroutines]Insert the default return_void call at correct position

2020-02-02 Thread bin.cheng
Hi,

Exception in coroutine is not correctly handled because the default
return_void call is now inserted before the finish suspend point,
rather than at the end of the original coroutine body.  This patch
fixes the issue by generating following code:
  co_await promise.initial_suspend();
  try {
// The original coroutine body

promise.return_void(); // The default return_void call.
  } catch (...) {
promise.unhandled_exception();
  }
  final_suspend:
  // ...

Bootstrap and test on x86_64.  Is it OK?

Thanks,
bin

gcc/cp
2020-02-03  Bin Cheng  

* coroutines.cc (build_actor_fn): Factor out code inserting the
default return_void call to...
(morph_fn_to_coro): ...here, also hoist local var declarations.

gcc/testsuite
2020-02-03  Bin Cheng  

* g++.dg/coroutines/torture/co-ret-15-default-return_void.C: New.

0001-Insert-default-return_void-at-the-end-of-coroutine-b.patch
Description: Binary data


Re: [PATCH, ivopts] Fix fast-math-pr55281.c ICE

2020-01-30 Thread Bin.Cheng
On Thu, Jan 30, 2020 at 8:53 PM Andrew Stubbs  wrote:
>
> On 29/01/2020 08:24, Richard Biener wrote:
> > On Tue, Jan 28, 2020 at 5:53 PM Andrew Stubbs  wrote:
> >>
> >> This patch fixes an ICE compiling fast-math-pr55281.c for amdgcn.
> >>
> >> The problem is that an "iv" is created in which both base and step are
> >> pointer types,
> >
> > How did you get a POINTER_TYPE step?  That's where the issue lies
> > I think.
>
> It can come from "find_inv_vars_cb":
>
>set_iv (idata, op, op, build_int_cst (TREE_TYPE (op), 0), true);

This is recording invariant with zero step.  It seems we are using
wrong type building the zero-step.  How about detecting that op has
pointer type and using integer type here?

Thanks,
bin
>
> whenever "op" has a pointer type.
>
> Similarly for "get_iv":
>
>set_iv (data, var, var, build_int_cst (type, 0), true);
>
> whenever "var" has a pointer type.
>
> In this particular case, I traced the origin back to the second one, I
> think (but it's somewhat hard to unpick).
>
> I could change one or both of those, but I don't understand enough about
> the consequences of that to be sure it's the right thing to do. I can
> confirm that converting the step at this point does appear to have the
> desired effect in this instance.
>
> At least at the point of writing it to gimple I can determine what is
> definitely malformed.
>
> Andrew


[PATCH PR93334][GCC11]Refine data dependence of two refs storing the same constant with the same bytes

2020-01-29 Thread bin.cheng
Hi,

As discussed in the PR, this simple patch refines data dependence of two write
references storing the same constant with the same bytes.  It simply detects
the case with some restrictions and treats it as no dependence.  For now the
added interface in tree-data-ref.c is only used in loop distribution, which 
might
be generalized in the future.

Bootstrap and test on x86_64.  Any comments?

Thanks,
bin

2020-01-29  Bin Cheng  

* tree-data-ref.c (refine_affine_dependence): New function.
* tree-data-ref.h (refine_affine_dependence): New declaration.
* tree.h (const_with_all_bytes_same): External declaration.
* tree.c (const_with_all_bytes_same): Moved from...
* tree-loop-distribution.c (const_with_all_bytes_same): ...here.  Call
refine_affine_dependence

gcc/testsuite
2020-01-29  Bin Cheng  

* gcc/testsuite/gcc.dg/tree-ssa/pr93334.c: New test.

0001-pr93334.patch
Description: Binary data


[PATCH Coroutines]Fix an ICE case in expanding co_await expression

2020-01-22 Thread bin.cheng
Hi,

Though function co_await_expander may need to be further revised, this simple 
patch fixes an ICE case in co_await_expander, 

Handle CO_AWAIT_EXPR in conversion in co_await_expander.

Function co_await_expander expands CO_AWAIT_EXPR and inserts expanded
code before result of co_await is used, however, it doesn't cover the
type conversion case and leads to gimplify ICE.  This patch fixes it.

Bootstrap and test on x86_64.  Is it OK?

Thanks,
bin

gcc/cp
2020-01-22  Bin Cheng  

* coroutines.cc (co_await_expander): Handle type conversion case.

gcc/testsuite
2020-01-22  Bin Cheng  

* g++.dg/coroutines/co-await-syntax-09-convert.C: New test.

0001-Handle-CO_AWAIT_EXPR-in-conversion-in-co_await_expan.patch
Description: Binary data


Re: [PATCH Coroutines]Access promise via actor function's frame pointer argument

2020-01-20 Thread bin.cheng
On Mon, Jan 20, 2020 at 10:59 PM Iain Sandoe  wrote:
>
> Hi Bin,
>
> bin.cheng  wrote:
>
> > By standard, coroutine body should be encapsulated in try-catch block
> > as following:
> >  try {
> >    // coroutine body
> >  } catch(...) {
> >    promise.unhandled_exception();
> >  }
> > Given above try-catch block is implemented in the coroutine actor
> > function called by coroutine ramp function, so the promise should
> > be accessed via actor function's coroutine frame pointer argument,
> > rather than the ramp function's coroutine frame variable.
>
> thanks, good catch!
> it has not triggered for me (including on some more complex test-cases that 
> are not useable
> in the testsuite).
>
> > This patch also refactored code to make the fix easy, unfortunately,
>
> see below,
>
> > I failed to reduce a test case from cpproro.
>
> it would be good if we could try to find a reproducer.  I’m happy to try and 
> throw
> creduce at a preprocessed file, if you have one.
>
> > gcc/cp
> > 2020-01-20  Bin Cheng  
> >
> >        * coroutines.cc (act_des_fn, wrap_coroutine_body): New.
> >        (morph_fn_to_coro): Call act_des_fn to build actor/destroy decls, as
> >        well as access promise via actor function's frame pointer argument.
> >        Refactor code into above functions.
> >        (build_actor_fn, build_destroy_fn): Use frame pointer argument
> +      /* We still try to look for the promise method and warn if it's not
> +        present.  */
> +      tree ueh_meth
> +       = lookup_promise_method (orig, coro_unhandled_exception_identifier,
> +                                loc, /*musthave=*/ false);
> +      if (!ueh_meth || ueh_meth == error_mark_node)
> +       warning_at (loc, 0, "no member named %qE in %qT",
> +                   coro_unhandled_exception_identifier,
> +                   get_coroutine_promise_type (orig));
> +    }
> +  /* Else we don't check and don't care if the method is missing.  */
> +
> +  return fnbody;
> +}
>
> IMO this ^^^ obfuscates the fix, and I don’t think it should be done at the 
> same time.
Sure, given we are in this stage, I split the patch and leave refactoring to 
future.  IMHO, the function is too long so self-contained operations worth 
outlined function even it's called once.

Patch updated as attached.

Thanks,
bin

gcc/cp
2020-01-20  Bin Cheng  
        * coroutines.cc (act_des_fn): New.
        (morph_fn_to_coro): Call act_des_fn to build actor/destroy decls.
        Access promise via actor function's frame pointer argument.
        (build_actor_fn, build_destroy_fn): Use frame pointer argument.

0001-Use-promise-in-coroutine-frame-in-actor-function.patch
Description: Binary data


[PATCH Coroutines]Access promise via actor function's frame pointer argument

2020-01-20 Thread bin.cheng
Hi,

By standard, coroutine body should be encapsulated in try-catch block
as following:
  try {
// coroutine body
  } catch(...) {
promise.unhandled_exception();
  }
Given above try-catch block is implemented in the coroutine actor
function called by coroutine ramp function, so the promise should
be accessed via actor function's coroutine frame pointer argument,
rather than the ramp function's coroutine frame variable.

This patch also refactored code to make the fix easy, unfortunately,
I failed to reduce a test case from cpproro.  Bootstrap and test ongoing.  Is 
it OK?

Thanks,
bin

gcc/cp
2020-01-20  Bin Cheng  

* coroutines.cc (act_des_fn, wrap_coroutine_body): New.
(morph_fn_to_coro): Call act_des_fn to build actor/destroy decls, as
well as access promise via actor function's frame pointer argument.
Refactor code into above functions.
(build_actor_fn, build_destroy_fn): Use frame pointer argument.

0001-Use-promise-in-coroutine-frame-in-actor-function.patch
Description: Binary data


Re: [PATCH Coroutines]Fix ICE when co_awaiting on void type

2020-01-20 Thread Bin.Cheng
On Mon, Jan 20, 2020 at 6:31 PM Iain Sandoe  wrote:
>
> Hi Bin,
>
> Bin.Cheng  wrote:
>
> > On Mon, Jan 20, 2020 at 5:28 PM Jakub Jelinek  wrote:
> >> On Mon, Jan 20, 2020 at 08:59:20AM +, Iain Sandoe wrote:
> >>> Hi Bin,
> >>>
> >>> bin.cheng  wrote:
> >>>
> >>>> gcc/cp
> >>>> 2020-01-20  Bin Cheng  
> >>>>
> >>>>   * coroutines.cc (build_co_await): Skip getting complete type for 
> >>>> void.
> >>>>
> >>>> gcc/testsuite
> >>>> 2020-01-20  Bin Cheng  
> >>>>
> >>>>   * g++.dg/coroutines/co-await-void_type.C: New 
> >>>> test.<0001-Fix-ICE-when-co_awaiting-on-void-type.patch>
> >>>
> >>> This LGTM, (borderline obvious, in fact) but you will need to wait for
> >>> a C++
> >>> maintainer to approve,
> >>
> >> The patch actually looks wrong to me, there is nothing magic about void
> >> type here, it will ICE on any other incomplete type, because
> >> complete_type_or_else returns NULL if the type is incomplete.
> >>
> >> So, I think you want instead:
> >>   tree o_type = complete_type_or_else (TREE_TYPE (o), o);
> >> +  if (o_type == NULL_TREE)
> >> +return error_mark_node;
> >>   if (TREE_CODE (o_type) != RECORD_TYPE)
> >>
> >> or similar (the diagnostics should be emitted already, so I don't see
> >> further point to emit it once again).
> >
> > Thanks very much for pointing out.  I was trying to bypass generic
> > void error message to make it more related to coroutine, like:
> > e1.cc: In function ‘resumable foo(int)’:
> > e1.cc:45:3: error: awaitable type ‘void’ is not a structure
> >   45 |   co_yield n + x + y;
> >  |   ^~~~
> >
> > vs.
> >
> > e1.cc: In function ‘resumable foo(int)’:
> > e1.cc:45:20: error: invalid use of ‘void’
> >   45 |   co_yield n + x + y;
> >  |^
> >
> > Anyway I will update the patch.
>
> ...I had already made a start...
Thanks very much!

>
> The patch below gives the improved diagnostic while checking for NULL
> returns fom complete_type_or_else.
>
> OK?
> Iain
>
>
> gcc/cp
>
> 2020-01-20  Bin Cheng  
You can remove me from the ChangeLog or keep it as "bin.cheng"  :-)
>   Iain Sandoe  
>
> * coroutines.cc (coro_promise_type_found_p): Check for NULL return 
> from
> complete_type_or_else.
> (register_param_uses): Likewise.
> (build_co_await): Do not try to use complete_type_or_else for void 
> types,
> otherwise for incomplete types, check for NULL return from 
> complete_type_or_else.
>
> gcc/testsuite
> 2020-01-20  Bin Cheng  
>
>* g++.dg/coroutines/co-await-void_type.C: New 
> test.<0001-Fix-ICE-when-co_awaiting-on-void-type.patch>
>
>
> diff --git a/gcc/cp/coroutines.cc b/gcc/cp/coroutines.cc
> index f0febfe0c8a..4e1ba81fb46 100644
> --- a/gcc/cp/coroutines.cc
> +++ b/gcc/cp/coroutines.cc
> @@ -428,8 +428,9 @@ coro_promise_type_found_p (tree fndecl, location_t loc)
>
> /* Complete this, we're going to use it.  */
> coro_info->handle_type = complete_type_or_else (handle_type, fndecl);
> +
> /* Diagnostic would be emitted by complete_type_or_else.  */
> -  if (coro_info->handle_type == error_mark_node)
> +  if (!coro_info->handle_type)
> return false;
>
> /* Build a proxy for a handle to "self" as the param to
> @@ -651,8 +652,11 @@ build_co_await (location_t loc, tree a,
> suspend_point_kind suspend_kind)
>   o = a; /* This is most likely about to fail anyway.  */
>
> tree o_type = TREE_TYPE (o);
> -  if (!VOID_TYPE_P (o_type))
> -o_type = complete_type_or_else (TREE_TYPE (o), o);
> +  if (o_type && !VOID_TYPE_P (o_type) && !COMPLETE_TYPE_P (o_type))
IIUC, Jakub doesn't like checking void type specially here?

> +o_type = complete_type_or_else (o_type, o);
> +
> +  if (!o_type)
> +return error_mark_node;
>
> if (TREE_CODE (o_type) != RECORD_TYPE)
>   {
> @@ -2746,6 +2750,10 @@ register_param_uses (tree *stmt, int *do_subtree
> ATTRIBUTE_UNUSED, void *d)
> if (!COMPLETE_TYPE_P (actual_type))
> actual_type = complete_type_or_else (actual_type, *stmt);
>
> +  if (actual_type == NULL_TREE)
> +   /* Diagnostic emitted by complete_type_or_else.  */
> +   actual_type = error_mark_node;
> +
> if (TREE_CODE (actual_type) == REFERENCE_TYPE)
> actual_type = build_pointer_type (TREE_TYPE (actual_type));
>
>

Thanks,
bin


Re: [PATCH Coroutines]Fix ICE when co_awaiting on void type

2020-01-20 Thread Bin.Cheng
On Mon, Jan 20, 2020 at 5:28 PM Jakub Jelinek  wrote:
>
> On Mon, Jan 20, 2020 at 08:59:20AM +, Iain Sandoe wrote:
> > Hi Bin,
> >
> > bin.cheng  wrote:
> >
> > > gcc/cp
> > > 2020-01-20  Bin Cheng  
> > >
> > >* coroutines.cc (build_co_await): Skip getting complete type for 
> > > void.
> > >
> > > gcc/testsuite
> > > 2020-01-20  Bin Cheng  
> > >
> > >* g++.dg/coroutines/co-await-void_type.C: New 
> > > test.<0001-Fix-ICE-when-co_awaiting-on-void-type.patch>
> >
> > This LGTM, (borderline obvious, in fact) but you will need to wait for a C++
> > maintainer to approve,
>
> The patch actually looks wrong to me, there is nothing magic about void
> type here, it will ICE on any other incomplete type, because
> complete_type_or_else returns NULL if the type is incomplete.
>
> So, I think you want instead:
>tree o_type = complete_type_or_else (TREE_TYPE (o), o);
> +  if (o_type == NULL_TREE)
> +return error_mark_node;
>if (TREE_CODE (o_type) != RECORD_TYPE)
>
> or similar (the diagnostics should be emitted already, so I don't see
> further point to emit it once again).

Thanks very much for pointing out.  I was trying to bypass generic
void error message to make it more related to coroutine, like:
e1.cc: In function ‘resumable foo(int)’:
e1.cc:45:3: error: awaitable type ‘void’ is not a structure
   45 |   co_yield n + x + y;
  |   ^~~~

vs.

e1.cc: In function ‘resumable foo(int)’:
e1.cc:45:20: error: invalid use of ‘void’
   45 |   co_yield n + x + y;
  |^

Anyway I will update the patch.

Thanks,
bin
>
> But it also means that other uses of complete_type_or_else look broken:
>
>   /* Complete this, we're going to use it.  */
>   coro_info->handle_type = complete_type_or_else (handle_type, fndecl);
>   /* Diagnostic would be emitted by complete_type_or_else.  */
>   if (coro_info->handle_type == error_mark_node)
> return false;
>
> or
>
>   if (!COMPLETE_TYPE_P (actual_type))
> actual_type = complete_type_or_else (actual_type, *stmt);
>
>   if (TREE_CODE (actual_type) == REFERENCE_TYPE)
> actual_type = build_pointer_type (TREE_TYPE (actual_type));
>
> where I think the first one should check for handle_type being NULL_TREE
> rather than error_mark_node, and the latter should handle the case of
> it returning NULL_TREE.
>
> Right below the last hunk, I've noticed:
>
>   size_t namsize = sizeof ("__parm.") + IDENTIFIER_LENGTH (pname) + 1;
>   char *buf = (char *) alloca (namsize);
>   snprintf (buf, namsize, "__parm.%s", IDENTIFIER_POINTER (pname));
>
> Do you really need one byte extra?  I mean, sizeof ("__parm.") already
> includes +1 for the terminating '\0', so unless you append something to
> the buffer later, I don't see the point for the + 1.
> And the middle line could be written as
>   char *buf = XALLOCAVEC (char, namsize);
>
> Jakub
>


[PATCH Coroutines]Fix ICE when co_awaiting on void type

2020-01-19 Thread bin.cheng
Hi,

This simple patch skips calling complete_type_or_else for void type, which 
fixes the
corresponding ICE.

Thanks,
bin

gcc/cp
2020-01-20  Bin Cheng  

* coroutines.cc (build_co_await): Skip getting complete type for void.

gcc/testsuite
2020-01-20  Bin Cheng  

* g++.dg/coroutines/co-await-void_type.C: New test.

0001-Fix-ICE-when-co_awaiting-on-void-type.patch
Description: Binary data


[PATCH Coroutines]Fix false warning message about missing return

2020-01-19 Thread bin.cheng
Hi,
The patch sets current_function_returns_value flag in templates for all 
co_return/co_yield/co_await cases, as well as for ramp function.  This
fixes false warning message for case like the added, or various cases in
cppcoro.

Bootstrap and test on X86_64, is it OK?

Thanks,
bin

gcc/cp
2020-01-20  Bin Cheng  

* coroutines.cc (finish_co_await_expr): Set return value flag.
(finish_co_yield_expr, morph_fn_to_coro): Ditto.

gcc/testsuite
2020-01-20  Bin Cheng  

* g++.dg/coroutines/co-return-warning-1.C: New test.

0001-Fix-false-warning-messages-about-missing-return-in-c.patch
Description: Binary data


Re: [C++ coroutines 4/6] Middle end expanders and transforms.

2020-01-15 Thread Bin.Cheng
> > +   gassign *get_res = gimple_build_assign (lhs, done);
> > +   gsi_replace (gsi, get_res, true);
> > +   *handled_ops_p = true;
> > + }
> > + break;
> > +   }
> > +}
> > +  return NULL_TREE;
> > +}
> > +
> > +/* Main entry point for lowering coroutine FE builtins.  */
> > +
> > +static unsigned int
> > +execute_lower_coro_builtins (void)
> > +{
> > +  struct walk_stmt_info wi;
> > +  gimple_seq body;
> > +
> > +  body = gimple_body (current_function_decl);
> > +  memset (, 0, sizeof (wi));
> > +  walk_gimple_seq_mod (, lower_coro_builtin, NULL, );
> > +  gimple_set_body (current_function_decl, body);
>
> it would be nice to elide the function walk for functions not
> containing any CO* stuff (you noted that below for other parts).
> We can spend a bit in struct function noting functions with
> coroutine code inside and set the bit from frontends or from
> the gimplifier for example.  Since it's behind the flag_coroutines
> paywall this can be addressed as followup.

Yes, this bit flag is necessary for following optimization passes, I
wonder which bit you would suggest?

Thanks,
bin


Re: [PATCH PR92926]Fix wrong code caused by ctor node translation unit wide sharing

2020-01-14 Thread Bin.Cheng
On Wed, Jan 15, 2020 at 5:04 AM Jeff Law  wrote:
>
> On Thu, 2020-01-09 at 14:20 +0800, Bin.Cheng wrote:
> > On Fri, Dec 20, 2019 at 3:10 PM Richard Biener
> >  wrote:
> > > On December 20, 2019 2:13:47 AM GMT+01:00, "Bin.Cheng" 
> > >  wrote:
> > > > On Fri, Dec 13, 2019 at 11:26 AM bin.cheng
> > > >  wrote:
> > > > > Hi,
> > > > >
> > > > > As reported in PR92926, constant ctor is shared translation unit wide
> > > > because of constexpr_call_table,
> > > > > however, during gimplify, the shared ctor could be modified.  This
> > > > patch fixes the issue by unsharing
> > > > > it before modification in gimplify.  A test is reduced from cppcoro
> > > > library and added.
> > > > > Bootstrap and test ongoing.  Not sure if this is the correct fix
> > > > though, any comments?
> > > > Ping.  Any comment?
> > >
> > > Looks reasonable to me.
> > Given PR92926 is marked as duplicate of PR93143, I updated test case
> > of the patch.
> >
> > Thanks,
> > bin
> >
> > 2019-12-13  Bin Cheng  
> >
> > PR tree-optimization/93143
> > * gimplify.c (gimplify_init_constructor): Unshare ctor node before
> > clearing.
> >
> > gcc/testsuite
> > 2019-12-13  Bin Cheng  
> >
> > PR tree-optimization/93143
> > * g++.dg/pr93143.C: New test.
> Is this still needed?  I think Jason fixed the outstanding sharing
> problems a couple days ago.
Yes, the issue is fixed and this can be discarded.

Thanks,
bin
>
> jeff
>


Re: [PATCH GCC11]Improve uninitialized warning with value range info

2020-01-12 Thread bin.cheng
--
Sender:Richard Biener 
Sent At:2020 Jan. 9 (Thu.) 20:01
Recipient:Bin.Cheng 
Cc:bin.cheng ; GCC Patches 

Subject:Re: [PATCH GCC11]Improve uninitialized warning with value range info


On Thu, Jan 9, 2020 at 11:17 AM Bin.Cheng  wrote:
> > I am not quite follow here. Do you mean we collect three cases "i <
> > j", "i < min(j)", "max(i) < j" then
> > call prune_uninit_phi_opnds for all three conditions?
> 
> No, I've meant to somehow arrange that the 'preds' passed to
> use_pred_not_overlap_with_undef_path_pred contain all three predicates
> rather than just i < j, thus "expand" fully symbolic predicates.

Seems this would require non-trivial refactoring of the original code.

> > This is another question? because now we simply break out of for loop
> > for finding such condition:
> > 
> > -  if ((gimple_code (flag_def) == GIMPLE_PHI)
> > - && (gimple_bb (flag_def) == gimple_bb (phi))
> > - && find_matching_predicate_in_rest_chains (the_pred, preds,
> > -num_preds))
> > -   break;
> > 
> > It's always possible that this flag_def can't prune use predicates
> > against undefined path predicates, while a later one can prune but is
> > skipped?
> 
> I don't follow but I also don't want to understand the code too much ;)
> 
> I'm fine with the idea and if the patch cannot introudce extra bogus warnings
> let's go with it.  Can you amed the comment before the two find_var_cmp_const
> calls?  I wonder whether eliding the second sweep when the first didn't find
> any predicate it skipped is worth the trouble.

Thanks for the comments, I updated the patch as attached.

Thanks,
bin

2020-01-08  Bin Cheng  

* tree-ssa-uninit.c (find_var_cmp_const): New function.
(use_pred_not_overlap_with_undef_path_pred): Call above.
(find_matching_predicate_in_rest_chains): Remove param.

0001-Fix-false-uninitialized-warning-message.patch
Description: Binary data


Re: [PATCH GCC11]Improve uninitialized warning with value range info

2020-01-09 Thread Bin.Cheng
On Wed, Jan 8, 2020 at 7:50 PM Richard Biener
 wrote:
>
> On Wed, Jan 8, 2020 at 12:30 PM Bin.Cheng  wrote:
> >
> > On Wed, Jan 8, 2020 at 6:31 PM Richard Biener
> >  wrote:
> > >
> > > On Wed, Jan 8, 2020 at 6:01 AM bin.cheng  
> > > wrote:
> > > >
> > > > Sorry, here is the patch.
> > > > --
> > > > Sender:bin.cheng 
> > > > Sent At:2020 Jan. 8 (Wed.) 12:58
> > > > Recipient:GCC Patches 
> > > > Subject:[PATCH GCC11]Improve uninitialized warning with value range info
> > > >
> > > >
> > > > Hi,
> > > >
> > > > Function use_pred_not_overlap_with_undef_path_pred of 
> > > > pass_late_warn_uninitialized
> > > > checks if predicate of variable use overlaps with predicate of 
> > > > undefined control flow path.
> > > > For now, it only checks ssa_var comparing against constant, this can be 
> > > > improved where
> > > > ssa_var compares against another ssa_var with value range info, as 
> > > > described in comment:
> > > >
> > > > + /* Check value range info of rhs, do following transforms:
> > > > +  flag_var < [min, max]  ->  flag_var < max
> > > > +  flag_var > [min, max]  ->  flag_var > min
> > > > +
> > > > +We can also transform LE_EXPR/GE_EXPR to LT_EXPR/GT_EXPR:
> > > > +  flag_var <= [min, max] ->  flag_var < [min, max+1]
> > > > +  flag_var >= [min, max] ->  flag_var > [min-1, max]
> > > > +if no overflow/wrap.  */
> > > >
> > > > This change can avoid some false warning.  Bootstrap and test on 
> > > > x86_64, any comment?
> > >
> > > Definitely a good idea - the refactoring makes the patch hard to
> > > follow though.  The
> > > original code seems to pick any (the "first") compare against a constant. 
> > >  You
> > > return the "first" but maybe from range info that might also be
> > > [-INF,+INF].  It seems
> > > that we'd want to pick the best so eventually sort the predicate chain
> > > so that compares
> > > against constants come first at least?  Not sure if it really makes a
> > > difference but
> > I don't know either, but I simply tried to not break existing code int
> > the patch.
> > Function prune_uninit_phi_opnds is called for the first compares against
> > constant, actually it should be called for each comparison, but I guess it's
> > just avoiding O(M*N) complexity here.
>
> Yeah.  I'm just worried finding a "bad" value-range predicate cuts the search
> in a way that causes extra bogus warnings?
hmm, the code is now as:
+  cmp_code = find_var_cmp_const (preds, phi, _def, _cst, false);
+  if (cmp_code == ERROR_MARK)
+cmp_code = find_var_cmp_const (preds, phi, _def, _cst, true);
+  if (cmp_code == ERROR_MARK)
 return false;

First call to find_var_cmp_const preserves the original behavior,
while the second call
only finds value range case if there is no comparison cont is found by
the first call.
So there is no extra bogus warnings?

>
> >
> > > even currently we could have i < 5, i < 1 so the "better" one later?
> > > It might also make
> > > sense to simply push three predicates for i < j, namely i < j (if ever
> > > useful), i < min(j)
> > > and max(i) < j to avoid repeatedly doing the range computations.
> > IIUC, with current implementation, it's not useful to check value rang
> > info for both sides of comparison because prune_uninit_phi_opnds
> > requires the flag_var be defined by PHI node in the same basic block
> > as PHI parameter.
>
> Yes, but without remembering the code very well my suggestion allows
> "new" predicates to be gathered during collecting phase while your patch
> adjusts the query phase?
I am not quite follow here. Do you mean we collect three cases "i <
j", "i < min(j)", "max(i) < j" then
call prune_uninit_phi_opnds for all three conditions?
This is another question? because now we simply break out of for loop
for finding such condition:

-  if ((gimple_code (flag_def) == GIMPLE_PHI)
- && (gimple_bb (flag_def) == gimple_bb (phi))
- && find_matching_predicate_in_rest_chains (the_pred, preds,
-num_preds))
-   break;

It's always possible that this flag_def can't prune use predicates
against undefined path predicates, while a later one can prune but is
skipped?

Thanks,
bin
>
> Richard.
>
> > Thanks,
> > bin
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > Thanks,
> > > > bin
> > > >
> > > > 2020-01-08  Bin Cheng  
> > > >
> > > >  * tree-ssa-uninit.c (find_var_cmp_const): New function.
> > > >  (use_pred_not_overlap_with_undef_path_pred): Call above.


Re: [PATCH PR92926]Fix wrong code caused by ctor node translation unit wide sharing

2020-01-08 Thread Bin.Cheng
On Fri, Dec 20, 2019 at 3:10 PM Richard Biener
 wrote:
>
> On December 20, 2019 2:13:47 AM GMT+01:00, "Bin.Cheng" 
>  wrote:
> >On Fri, Dec 13, 2019 at 11:26 AM bin.cheng
> > wrote:
> >>
> >> Hi,
> >>
> >> As reported in PR92926, constant ctor is shared translation unit wide
> >because of constexpr_call_table,
> >> however, during gimplify, the shared ctor could be modified.  This
> >patch fixes the issue by unsharing
> >> it before modification in gimplify.  A test is reduced from cppcoro
> >library and added.
> >>
> >> Bootstrap and test ongoing.  Not sure if this is the correct fix
> >though, any comments?
> >Ping.  Any comment?
>
> Looks reasonable to me.
Given PR92926 is marked as duplicate of PR93143, I updated test case
of the patch.

Thanks,
bin

2019-12-13  Bin Cheng  

PR tree-optimization/93143
* gimplify.c (gimplify_init_constructor): Unshare ctor node before
clearing.

gcc/testsuite
2019-12-13  Bin Cheng  

PR tree-optimization/93143
* g++.dg/pr93143.C: New test.
From 77252c3bb41887af1daa9e83615a8aa32dc330f9 Mon Sep 17 00:00:00 2001
From: "bin.cheng" 
Date: Thu, 9 Jan 2020 14:13:08 +0800
Subject: [PATCH] Fix pr93143.

---
 gcc/gimplify.c |  2 ++
 gcc/testsuite/g++.dg/pr93143.C | 73 ++
 2 files changed, 75 insertions(+)
 create mode 100644 gcc/testsuite/g++.dg/pr93143.C

diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index 73fb2e7..55d7a93 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -5083,6 +5083,8 @@ gimplify_init_constructor (tree *expr_p, gimple_seq *pre_p, gimple_seq *post_p,
 	/* Zap the CONSTRUCTOR element list, which simplifies this case.
 	   Note that we still have to gimplify, in order to handle the
 	   case of variable sized types.  Avoid shared tree structures.  */
+	ctor = unshare_expr (ctor);
+	TREE_OPERAND (*expr_p, 1) = ctor;
 	CONSTRUCTOR_ELTS (ctor) = NULL;
 	TREE_SIDE_EFFECTS (ctor) = 0;
 	object = unshare_expr (object);
diff --git a/gcc/testsuite/g++.dg/pr93143.C b/gcc/testsuite/g++.dg/pr93143.C
new file mode 100644
index 000..40710cf
--- /dev/null
+++ b/gcc/testsuite/g++.dg/pr93143.C
@@ -0,0 +1,73 @@
+// { dg-do run }
+// { dg-options "-O3 -std=c++14" }
+
+struct array
+{
+constexpr unsigned char operator[](int i) const noexcept
+{
+return arr[i];
+}
+
+unsigned char arr[16];
+};
+
+
+class v6 {
+public:
+using bytes_type = array;
+constexpr v6(bytes_type const & bytes);
+constexpr bool is_loopback() const noexcept;
+static constexpr v6 loopback() noexcept
+{
+return v6(v6::bytes_type{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1});
+}
+private:
+bytes_type bytes_;
+};
+
+
+
+constexpr v6::v6(bytes_type const & bytes)
+: bytes_(bytes)
+{}
+
+constexpr
+bool v6::is_loopback() const noexcept
+{
+return bytes_[0] == 0 &&
+bytes_[1] == 0 &&
+bytes_[2] == 0 &&
+bytes_[3] == 0 &&
+bytes_[4] == 0 &&
+bytes_[5] == 0 &&
+bytes_[6] == 0 &&
+bytes_[7] == 0 &&
+bytes_[8] == 0 &&
+bytes_[9] == 0 &&
+bytes_[10] == 0 &&
+bytes_[11] == 0 &&
+bytes_[12] == 0 &&
+bytes_[13] == 0 &&
+bytes_[14] == 0 &&
+bytes_[15] == 1;
+}
+
+void v6_type()
+{
+[[maybe_unused]] constexpr auto loopback = v6::loopback();
+}
+
+int main()
+{
+v6_type();
+
+constexpr auto a = v6::loopback();
+if (!a.is_loopback())
+__builtin_abort();
+
+auto b = v6::loopback();
+if (!b.is_loopback())
+__builtin_abort();
+
+return 0;
+}
-- 
1.8.3.1



Re: [PATCH GCC11]Improve uninitialized warning with value range info

2020-01-08 Thread Bin.Cheng
On Wed, Jan 8, 2020 at 6:31 PM Richard Biener
 wrote:
>
> On Wed, Jan 8, 2020 at 6:01 AM bin.cheng  wrote:
> >
> > Sorry, here is the patch.
> > --
> > Sender:bin.cheng 
> > Sent At:2020 Jan. 8 (Wed.) 12:58
> > Recipient:GCC Patches 
> > Subject:[PATCH GCC11]Improve uninitialized warning with value range info
> >
> >
> > Hi,
> >
> > Function use_pred_not_overlap_with_undef_path_pred of 
> > pass_late_warn_uninitialized
> > checks if predicate of variable use overlaps with predicate of undefined 
> > control flow path.
> > For now, it only checks ssa_var comparing against constant, this can be 
> > improved where
> > ssa_var compares against another ssa_var with value range info, as 
> > described in comment:
> >
> > + /* Check value range info of rhs, do following transforms:
> > +  flag_var < [min, max]  ->  flag_var < max
> > +  flag_var > [min, max]  ->  flag_var > min
> > +
> > +We can also transform LE_EXPR/GE_EXPR to LT_EXPR/GT_EXPR:
> > +  flag_var <= [min, max] ->  flag_var < [min, max+1]
> > +  flag_var >= [min, max] ->  flag_var > [min-1, max]
> > +if no overflow/wrap.  */
> >
> > This change can avoid some false warning.  Bootstrap and test on x86_64, 
> > any comment?
>
> Definitely a good idea - the refactoring makes the patch hard to
> follow though.  The
> original code seems to pick any (the "first") compare against a constant.  You
> return the "first" but maybe from range info that might also be
> [-INF,+INF].  It seems
> that we'd want to pick the best so eventually sort the predicate chain
> so that compares
> against constants come first at least?  Not sure if it really makes a
> difference but
I don't know either, but I simply tried to not break existing code int
the patch.
Function prune_uninit_phi_opnds is called for the first compares against
constant, actually it should be called for each comparison, but I guess it's
just avoiding O(M*N) complexity here.


> even currently we could have i < 5, i < 1 so the "better" one later?
> It might also make
> sense to simply push three predicates for i < j, namely i < j (if ever
> useful), i < min(j)
> and max(i) < j to avoid repeatedly doing the range computations.
IIUC, with current implementation, it's not useful to check value rang
info for both sides of comparison because prune_uninit_phi_opnds
requires the flag_var be defined by PHI node in the same basic block
as PHI parameter.

Thanks,
bin
>
> Thanks,
> Richard.
>
> > Thanks,
> > bin
> >
> > 2020-01-08  Bin Cheng  
> >
> >  * tree-ssa-uninit.c (find_var_cmp_const): New function.
> >  (use_pred_not_overlap_with_undef_path_pred): Call above.


Re: [RFC] IVOPTs select cand with preferred D-form access

2020-01-08 Thread Bin.Cheng
On Tue, Jan 7, 2020 at 6:48 PM Kewen.Lin  wrote:
>
> on 2020/1/7 下午5:14, Richard Biener wrote:
> > On Mon, 6 Jan 2020, Kewen.Lin wrote:
> >
> >> We are thinking whether it can be handled in IVOPTs instead of one RTL 
> >> pass.
> >>
> >> During IVOPTs selecting IV cands, it doesn't know the loop will be 
> >> unrolled so
> >> it doesn't count the possible step cost in with X-form.  If we can teach 
> >> it to
> >> consider the case, the IV cands which plays with D-form can be preferred.
> >> Currently unrolling (incomplete) happens in RTL, it looks we have to 
> >> predict
> >> the loop whether unroll in IVOPTs.  Since there is some parameter checks 
> >> on RTL
> >> insn counts and target hooks, it seems not easy to get that.  Besides, we 
> >> need
> >> to check the step is valid to put into D-form field (eg: DQ-form requires 
> >> divide
> >> 16 exactly), to ensure no extra ADDIs needed.
> >>
> >> I'm not sure whether it's a good idea to implement in IVOPTs, but I did 
> >> some
> >> changes in IVOPTs to prove it's doable to get expected codes, the patch is 
> >> attached.
> >>
> >> Any comments/suggestions are highly appreiciated!
> >
> > Is the unrolled code better than the not unrolled code (assuming
> > optimal IV choice)?  Then IMHO IVOPTs should drive the unrolling,
> > either by actually doing it or by forcing it via the loop->unroll
> > setting.  I don't think second-guessing the RTL unroller at this
> > point is going to work.  Alternatively turn X-form into D-form during
> > RTL unrolling?
> >
>
> Hi Richard,
>
> Thanks for the comments!
>
> Yes, unrolled version is better on Power9 for both forms, but D-form unrolled 
> is better
> than X-form unrolled.  If we drive unrolling in IVOPTs, not sure it will be a 
> concern
> that IVOPTs becomes too heavy? or too rude with forced UF if imprecise? Do we 
> still
> have the plan to introduce one middle-end unroll pass, does it help if yes?
I am a bit worried that would make IVOPTs heavy too, it might be
possible to compute heuristics whether loop should be unrolled as a
post-IVOPTs transformation.  Of course the transformation needs to do
more work than simply unrolling in order to take advantage of
aforementioned addressing mode.
BTW, unrolled loop won't perform as good as ppc if the target doesn't
support [base + register + offset] addressing mode?

Another point, in case of multiple passes doing unrolling, the
"already unrolled" information may need to be recorded as a flag of
loop properties.

Thanks,
bin
> The quoted RTL patch is to propose one RTL pass after RTL loop passes, it 
> also sounds
> good to check whether RTL unrolling is a good place!
>
>
> BR,
> Kewen
>


Re: [PATCH GCC11]Improve uninitialized warning with value range info

2020-01-07 Thread bin.cheng
Sorry, here is the patch.
--
Sender:bin.cheng 
Sent At:2020 Jan. 8 (Wed.) 12:58
Recipient:GCC Patches 
Subject:[PATCH GCC11]Improve uninitialized warning with value range info


Hi,

Function use_pred_not_overlap_with_undef_path_pred of 
pass_late_warn_uninitialized
checks if predicate of variable use overlaps with predicate of undefined 
control flow path.
For now, it only checks ssa_var comparing against constant, this can be 
improved where
ssa_var compares against another ssa_var with value range info, as described in 
comment:

+ /* Check value range info of rhs, do following transforms:
+  flag_var < [min, max]  ->  flag_var < max
+  flag_var > [min, max]  ->  flag_var > min
+
+We can also transform LE_EXPR/GE_EXPR to LT_EXPR/GT_EXPR:
+  flag_var <= [min, max] ->  flag_var < [min, max+1]
+  flag_var >= [min, max] ->  flag_var > [min-1, max]
+if no overflow/wrap.  */

This change can avoid some false warning.  Bootstrap and test on x86_64, any 
comment?

Thanks,
bin

2020-01-08  Bin Cheng  

 * tree-ssa-uninit.c (find_var_cmp_const): New function.
 (use_pred_not_overlap_with_undef_path_pred): Call above.

check-uninit_warning-with-vrinfo-20200108.txt
Description: Binary data


[PATCH GCC11]Improve uninitialized warning with value range info

2020-01-07 Thread bin.cheng
Hi,

Function use_pred_not_overlap_with_undef_path_pred of 
pass_late_warn_uninitialized
checks if predicate of variable use overlaps with predicate of undefined 
control flow path.
For now, it only checks ssa_var comparing against constant, this can be 
improved where
ssa_var compares against another ssa_var with value range info, as described in 
comment:

+ /* Check value range info of rhs, do following transforms:
+  flag_var < [min, max]  ->  flag_var < max
+  flag_var > [min, max]  ->  flag_var > min
+
+We can also transform LE_EXPR/GE_EXPR to LT_EXPR/GT_EXPR:
+  flag_var <= [min, max] ->  flag_var < [min, max+1]
+  flag_var >= [min, max] ->  flag_var > [min-1, max]
+if no overflow/wrap.  */

This change can avoid some false warning.  Bootstrap and test on x86_64, any 
comment?

Thanks,
bin

2020-01-08  Bin Cheng  

* tree-ssa-uninit.c (find_var_cmp_const): New function.
(use_pred_not_overlap_with_undef_path_pred): Call above.

Re: [PATCH PR92926]Fix wrong code caused by ctor node translation unit wide sharing

2019-12-19 Thread Bin.Cheng
On Fri, Dec 13, 2019 at 11:26 AM bin.cheng  wrote:
>
> Hi,
>
> As reported in PR92926, constant ctor is shared translation unit wide because 
> of constexpr_call_table,
> however, during gimplify, the shared ctor could be modified.  This patch 
> fixes the issue by unsharing
> it before modification in gimplify.  A test is reduced from cppcoro library 
> and added.
>
> Bootstrap and test ongoing.  Not sure if this is the correct fix though, any 
> comments?
Ping.  Any comment?

Thanks,
bin
>
> Thanks,
> bin
>
> 2019-12-13  Bin Cheng  
>
> PR tree-optimization/92926
> * gimplify.c (gimplify_init_constructor): Unshare ctor node before
> clearing.
>
> gcc/testsuite
> 2019-12-13  Bin Cheng  
>
> PR tree-optimization/92926
> * g++.dg/pr92926.C: New test.


[PATCH PR92926]Fix wrong code caused by ctor node translation unit wide sharing

2019-12-12 Thread bin.cheng
Hi,

As reported in PR92926, constant ctor is shared translation unit wide because 
of constexpr_call_table,
however, during gimplify, the shared ctor could be modified.  This patch fixes 
the issue by unsharing
it before modification in gimplify.  A test is reduced from cppcoro library and 
added.

Bootstrap and test ongoing.  Not sure if this is the correct fix though, any 
comments?

Thanks,
bin

2019-12-13  Bin Cheng  

PR tree-optimization/92926
* gimplify.c (gimplify_init_constructor): Unshare ctor node before
clearing.

gcc/testsuite
2019-12-13  Bin Cheng  

PR tree-optimization/92926
* g++.dg/pr92926.C: New test.

0001-pr92926.patch
Description: Binary data


Re: [PATCH] Improve debug info in ivopts optimized loops (PR debug/90231)

2019-10-23 Thread Bin.Cheng
On Tue, Oct 22, 2019 at 3:32 PM Jakub Jelinek  wrote:
>
> On Mon, Oct 21, 2019 at 01:24:30PM +0200, Jakub Jelinek wrote:
> > So I wonder if for correctness I don't need to add:
> >
> >   if (!use->iv->no_overflow
> >   && !cand->iv->no_overflow
> >   && !integer_pow2p (cstep))
> > return NULL_TREE;
> >
> > with some of the above as comment explaining why.
> >
> > On the other side, if cand->iv->no_overflow, couldn't we bypass the extra
> > precision test?
>
> Here are these two in patch form.
>
> 2019-10-22  Jakub Jelinek  
>
> PR debug/90231
> * tree-ssa-loop-ivopts.c (get_debug_computation_at): New function.
> (remove_unused_ivs): Use it instead of get_computation_at.  When
> choosing best candidate, only consider candidates where
> get_debug_computation_at actually returns non-NULL.
>
> --- gcc/tree-ssa-loop-ivopts.c.jj   2019-10-21 14:17:57.598198162 +0200
> +++ gcc/tree-ssa-loop-ivopts.c  2019-10-22 09:30:09.782238157 +0200
> @@ -4089,6 +4089,94 @@ get_computation_at (class loop *loop, gi
>return fold_convert (type, aff_combination_to_tree ());
>  }
>
> +/* Like get_computation_at, but try harder, even if the computation
> +   is more expensive.  Intended for debug stmts.  */
> +
> +static tree
> +get_debug_computation_at (class loop *loop, gimple *at,
> + struct iv_use *use, struct iv_cand *cand)
> +{
> +  if (tree ret = get_computation_at (loop, at, use, cand))
> +return ret;
> +
> +  tree ubase = use->iv->base, ustep = use->iv->step;
> +  tree cbase = cand->iv->base, cstep = cand->iv->step;
> +  tree var;
> +  tree utype = TREE_TYPE (ubase), ctype = TREE_TYPE (cbase);
> +  widest_int rat;
> +
> +  /* We must have a precision to express the values of use.  */
> +  if (TYPE_PRECISION (utype) >= TYPE_PRECISION (ctype))
> +return NULL_TREE;
> +
> +  /* Try to handle the case that get_computation_at doesn't,
> + try to express
> + use = ubase + (var - cbase) / ratio.  */
> +  if (!constant_multiple_of (cstep, fold_convert (TREE_TYPE (cstep), ustep),
> +))
> +return NULL_TREE;
> +
> +  bool neg_p = false;
> +  if (wi::neg_p (rat))
> +{
> +  if (TYPE_UNSIGNED (ctype))
> +   return NULL_TREE;
> +  neg_p = true;
> +  rat = wi::neg (rat);
> +}
> +
> +  /* If both IVs can wrap around and CAND doesn't have a power of two step,
> + it is unsafe.  Consider uint16_t CAND with step 9, when wrapping around,
> + the values will be ... 0xfff0, 0xfff9, 2, 11 ... and when use is say
> + uint8_t with step 3, those values divided by 3 cast to uint8_t will be
> + ... 0x50, 0x53, 0, 3 ... rather than expected 0x50, 0x53, 0x56, 0x59.  
> */
Interesting, so we can still get correct debug info for iter in such
special cases.

> +  if (!use->iv->no_overflow
> +  && !cand->iv->no_overflow
> +  && !integer_pow2p (cstep))
> +return NULL_TREE;
> +
> +  int bits = wi::exact_log2 (rat);
> +  if (bits == -1)
> +bits = wi::floor_log2 (rat) + 1;
> +  if (!cand->iv->no_overflow
> +  && TYPE_PRECISION (utype) + bits > TYPE_PRECISION (ctype))
> +return NULL_TREE;
The patch is fine for me.

Just for the record, guess we may try to find (by recording
information in early phase) the correct/bijection candidate in
computing the iv in debuginfo in the future, then those checks would
be unnecessary.

Thanks,
bin
> +
> +  var = var_at_stmt (loop, cand, at);
> +
> +  if (POINTER_TYPE_P (ctype))
> +{
> +  ctype = unsigned_type_for (ctype);
> +  cbase = fold_convert (ctype, cbase);
> +  cstep = fold_convert (ctype, cstep);
> +  var = fold_convert (ctype, var);
> +}
> +
> +  ubase = unshare_expr (ubase);
> +  cbase = unshare_expr (cbase);
> +  if (stmt_after_increment (loop, cand, at))
> +var = fold_build2 (MINUS_EXPR, TREE_TYPE (var), var,
> +  unshare_expr (cstep));
> +
> +  var = fold_build2 (MINUS_EXPR, TREE_TYPE (var), var, cbase);
> +  var = fold_build2 (EXACT_DIV_EXPR, TREE_TYPE (var), var,
> +wide_int_to_tree (TREE_TYPE (var), rat));
> +  if (POINTER_TYPE_P (utype))
> +{
> +  var = fold_convert (sizetype, var);
> +  if (neg_p)
> +   var = fold_build1 (NEGATE_EXPR, sizetype, var);
> +  var = fold_build2 (POINTER_PLUS_EXPR, utype, ubase, var);
> +}
> +  else
> +{
> +  var = fold_convert (utype, var);
> +  var = fold_build2 (neg_p ? MINUS_EXPR : PLUS_EXPR, utype,
> +ubase, var);
> +}
> +  return var;
> +}
> +
>  /* Adjust the cost COST for being in loop setup rather than loop body.
> If we're optimizing for space, the loop setup overhead is constant;
> if we're optimizing for speed, amortize it over the per-iteration cost.
> @@ -7523,6 +7611,7 @@ remove_unused_ivs (struct ivopts_data *d
>   struct iv_use dummy_use;
>   struct iv_cand *best_cand = NULL, *cand;
>   unsigned i, 

Re: [PATCH] Improve debug info in ivopts optimized loops (PR debug/90231)

2019-10-21 Thread Bin.Cheng
On Sat, Oct 19, 2019 at 2:27 PM Jakub Jelinek  wrote:
>
> Hi!
>
> As mentioned in the PR, the following patch attempts to address two issues.
> In remove_unused_ivs we first find the best iv_cand (we prefer primarily the
> same step, next same mode and lastly constant base) and only then call
> get_computation_at to determine the replacement expression.  Unfortunately,
> in various cases get_computation_at can return NULL_TREE and in that case we
> don't try any other candidate and just leave the vars for debug as is, which
> results in => NULL and the IVs .
>
> The following patch will only consider candidates for which
> get_computation_at succeeds, while it can be slower, it can handle more
> cases.  Perhaps alternative would be to have two passes, pick up the best
> candidate without calling get_computation_at, calling it on the best
> candidate and if that fails, retry the best candidate search with calling
> get_computation_at.
>
> Another thing is that get_computation_at can only handle cases where
> use can be expressed as ubase + (var - cbase) * ratio for integral ratio.
> In the PR testcase we don't have any, one (the one we pick as best)
> has a ratio 1/15 and the other 1/4.  Using a division in ivopts at runtime
> is hardly ever desirable, but for debug info we don't mind that, all we need
> to ensure is that we don't result in wrong-debug.
> The patch implements expressing use as ubase + (var - cbase) / ratio for
> these debug info only uses, but so far requires that if the use IV couldn't
> wrap around, that the candidate (var - cbase) can't wrap around either,
> which should be true whenever the candidate type is IMHO at least
> ceil_log2 (ratio) bits larger than the use type.  Do I need to punt if
> !use->iv->no_overflow, or is
> ubase + (utype) ((var - cbase) / ratio) computation safe even if it wraps?
> And as questioned in the PR, are there other cases where we can safely
> assume no wrap (e.g. use it if use->iv->no_overflow && cand->iv->no_overflow
> without those extra precision checks)?
>
> Anyway, bootstrapped/regtested successfully on x86_64-linux and i686-linux.
Hi Jakub, thanks very much for working on this.

As for choosing the best candidate, I was thinking to reuse existing
get_computation_cost interface, but looks like it requires non-trivial
modification.  Your patch is simpler.

The overflow check is difficult, IIUC, checks on bit precision in the
patch is not enough?  Considering an unsigned 64-bit candidate with
unknown compilation start value.  My original idea was using
no_overflow flag, apparently this information is not well computed and
inaccurate in most cases?  At last, for now there is no link between
candidate and its original iv.  IIRC, when a candidate is derived from
a no_overflow pointer candidate, it's not safe to mark the candidate
as no_overflow.  I am not sure if I remember wrong, unfortunately, I
don't have examples either.

Thanks,
bin
>
> Comparing bootstrapped cc1plus without and with this patch (the former with
> the patch applied after bootstrap and stage3 rebuilt, so that .text etc. is
> identical) shows:
> -  [33] .debug_info   PROGBITS 22a4788 749275e 00 
>  0   0  1
> -  [34] .debug_abbrev PROGBITS 9736ee6 204aad 00  
> 0   0  1
> -  [35] .debug_line   PROGBITS 993b993 1688464 00 
>  0   0  1
> -  [36] .debug_strPROGBITS afc3df7 6f65aa 01  
> MS  0   0  1
> -  [37] .debug_locPROGBITS b6ba3a1 71a2dde 00 
>  0   0  1
> -  [38] .debug_ranges PROGBITS 1285d17f 16414d0 
> 00  0   0  1
> -  [39] .symtab   SYMTAB   13e9e650 166ff8 18 
> 40 38193  8
> -  [40] .strtab   STRTAB   14005648 2ad809 00 
>  0   0  1
> -  [41] .shstrtab STRTAB   142b2e51 0001a0 00 
>  0   0  1
> +  [33] .debug_info   PROGBITS 22a4788 749365e 00 
>  0   0  1
> +  [34] .debug_abbrev PROGBITS 9737de6 204a9f 00  
> 0   0  1
> +  [35] .debug_line   PROGBITS 993c885 1688f0c 00 
>  0   0  1
> +  [36] .debug_strPROGBITS afc5791 6f65aa 01  
> MS  0   0  1
> +  [37] .debug_locPROGBITS b6bbd3b 71cd404 00 
>  0   0  1
> +  [38] .debug_ranges PROGBITS 1288913f 16414b0 
> 00  0   0  1
> +  [39] .symtab   SYMTAB   13eca5f0 166ff8 18 
> 40 38193  8
> +  [40] .strtab   STRTAB   140315e8 2ad809 00 
>  0   0  1
> +  [41] .shstrtab STRTAB   142dedf1 0001a0 00 
>  0   0  1
> so .debug_info is 3840 bytes larger and .debug_loc is 173606 bytes larger
> (0.15%), so there are some changes with this, but not 

Re: [PATCH v6 3/3] PR80791 Consider doloop cmp use in ivopts

2019-08-23 Thread Bin.Cheng
On Fri, Aug 23, 2019 at 4:27 PM Kewen.Lin  wrote:
>
> Hi Bin
>
> on 2019/8/23 上午10:19, Bin.Cheng wrote:
> > On Thu, Aug 22, 2019 at 3:09 PM Kewen.Lin  wrote:
> >>
> >> Hi Bin,
> >>
> >> on 2019/8/22 下午1:46, Bin.Cheng wrote:
> >>> On Thu, Aug 22, 2019 at 11:18 AM Kewen.Lin  wrote:
> >>>>
> >>>> Hi Bin,
> >>>>
> >>>> Thanks for your time!
> >>>>
> >>>> on 2019/8/21 下午8:32, Bin.Cheng wrote:
> >>>>> On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin  wrote:
> >>>>>>
> >>>>>> Hi!
> >>>>>>
> >>>>>> Comparing to the previous versions of implementation mainly based on 
> >>>>>> the
> >>>>>> existing IV cands but zeroing the related group/use cost, this new one 
> >>>>>> is based
> >>>>>> on Richard and Segher's suggestion introducing one doloop dedicated IV 
> >>>>>> cand.
> >>>>>>
> >>>>>> Some key points are listed below:
> >>>>>>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated 
> >>>>>> IV cand.
> >>>>>>   2) Special name "doloop" assigned.
> >>>>>>   3) Doloop IV cand with form (niter+1, +, -1)
> >>>>>>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost 
> >>>>>> for step.
> >>>>>>   5) Support may_be_zero (regressed PR is in this case), the base of 
> >>>>>> doloop IV
> >>>>>>  can be COND_EXPR, add handlings in cand_value_at and 
> >>>>>> may_eliminate_iv.
> >>>>>>   6) Add more expr support in force_expr_to_var_cost for reasonable 
> >>>>>> cost
> >>>>>>  calculation on the IV base with may_be_zero (like COND_EXPR).
> >>>>>>   7) Set zero cost when using doloop IV cand for doloop use.
> >>>>>>   8) Add three hooks (should we merge _generic and _address?).
> >>>>>> *) have_count_reg_decr_p, is to indicate the target has special 
> >>>>>> hardware
> >>>>>>count register, we shouldn't consider the impact of doloop IV 
> >>>>>> when
> >>>>>>calculating register pressures.
> >>>>>> *) doloop_cost_for_generic, is the extra cost when using doloop IV 
> >>>>>> cand for
> >>>>>>generic type IV use.
> >>>>>> *) doloop_cost_for_address, is the extra cost when using doloop IV 
> >>>>>> cand for
> >>>>>>address type IV use.
>
> >> The new patch addressing the comments is attached.
> >> Could you please have a look again?  Thanks in advance!
> > Thanks for working on this.  A bit more nit-pickings.
> >
> > -add_candidate_1 (data, base, step, important,
> > -IP_NORMAL, use, NULL, orig_iv);
> > -  if (ip_end_pos (data->current_loop)
> > +add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, 
> > doloop,
> > +orig_iv);
> > +  if (!doloop && ip_end_pos (data->current_loop)
> > Could you add some comments elaborating why ip_end_pos candidate
> > shouldn't be added for doloop case?  Because the increment position is
> > wrong.
> >
> > Also if you make doloop the last default parameter of add_candidate_1,
> > you can save more unnecessary changes to calls to add_candidate?
> >
> > -cost = get_computation_cost (data, use, cand, false,
> > -_vars, NULL, _expr);
> > +{
> > +  cost = get_computation_cost (data, use, cand, false, _vars, NULL,
> > +  _expr);
> > +  if (cand->doloop_p)
> > +   cost += targetm.doloop_cost_for_generic;
> > +}
> > This adjustment
> >
> >cost = get_computation_cost (data, use, cand, true,
> >_vars, _autoinc, _expr);
> >
> > +  if (cand->doloop_p)
> > +cost += targetm.doloop_cost_for_address;
> > +
> > and this adjustment can be moved into get_computation_cost where all
> > cost adjustments are done.
> >
> > +  /* For doloop candidate/use pair, adjust to zero cost.  */
> > +  if (group->doloop_p 

Re: [PATCH v6 3/3] PR80791 Consider doloop cmp use in ivopts

2019-08-22 Thread Bin.Cheng
On Thu, Aug 22, 2019 at 3:09 PM Kewen.Lin  wrote:
>
> Hi Bin,
>
> on 2019/8/22 下午1:46, Bin.Cheng wrote:
> > On Thu, Aug 22, 2019 at 11:18 AM Kewen.Lin  wrote:
> >>
> >> Hi Bin,
> >>
> >> Thanks for your time!
> >>
> >> on 2019/8/21 下午8:32, Bin.Cheng wrote:
> >>> On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin  wrote:
> >>>>
> >>>> Hi!
> >>>>
> >>>> Comparing to the previous versions of implementation mainly based on the
> >>>> existing IV cands but zeroing the related group/use cost, this new one 
> >>>> is based
> >>>> on Richard and Segher's suggestion introducing one doloop dedicated IV 
> >>>> cand.
> >>>>
> >>>> Some key points are listed below:
> >>>>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated 
> >>>> IV cand.
> >>>>   2) Special name "doloop" assigned.
> >>>>   3) Doloop IV cand with form (niter+1, +, -1)
> >>>>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost 
> >>>> for step.
> >>>>   5) Support may_be_zero (regressed PR is in this case), the base of 
> >>>> doloop IV
> >>>>  can be COND_EXPR, add handlings in cand_value_at and 
> >>>> may_eliminate_iv.
> >>>>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
> >>>>  calculation on the IV base with may_be_zero (like COND_EXPR).
> >>>>   7) Set zero cost when using doloop IV cand for doloop use.
> >>>>   8) Add three hooks (should we merge _generic and _address?).
> >>>> *) have_count_reg_decr_p, is to indicate the target has special 
> >>>> hardware
> >>>>count register, we shouldn't consider the impact of doloop IV when
> >>>>calculating register pressures.
> >>>> *) doloop_cost_for_generic, is the extra cost when using doloop IV 
> >>>> cand for
> >>>>generic type IV use.
> >>>> *) doloop_cost_for_address, is the extra cost when using doloop IV 
> >>>> cand for
> >>>>address type IV use.
> >>> What will happen if doloop IV cand be used for generic/address type iv
> >>> use?  Can RTL doloop can still perform doloop optimization in this
> >>> case?
> >>>
> >>
> >> On Power, we put the iteration count into hardware count register, it 
> >> takes very
> >> high cost to move the count to GPR, so the cost is set as INF to make it 
> >> impossible
> >> to use it for generic/address type iv use.  But as some discussion before, 
> >> on some
> >> targets using GPR instead of hardware count register, they probably want 
> >> to use this
> >> doloop iv used for other uses if profitable.  These two hooks offer the 
> >> possibility.
> >> In that case, I think RTL doloop can still perform since it can still get 
> >> the
> >> pattern and transform.  The generic/address uses can still use it.
> >>>>
> >>>> Bootstrapped on powerpc64le-linux-gnu and regression testing passed 
> >>>> excepting
> >>>> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is 
> >>>> tracked
> >>>> by PR89983.
> >>>>
> >>>> Any comments and suggestions are highly appreciated.  Thanks!
> >>> Not sure if I understand the patch correctly, some comments embedded.
> >>>
> >>> +  /* The number of doloop candidate in the set.  */
> >>> +  unsigned n_doloop_cands;
> >>> +
> >>> This is unnecessary.  See below comments.
> >>>
> >>> -add_candidate_1 (data, base, step, important,
> >>> -IP_NORMAL, use, NULL, orig_iv);
> >>> +add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, 
> >>> doloop,
> >>> +orig_iv);
> >>>if (ip_end_pos (data->current_loop)
> >>>&& allow_ip_end_pos_p (data->current_loop))
> >>> -add_candidate_1 (data, base, step, important, IP_END, use, NULL, 
> >>> orig_iv);
> >>> +add_candidate_1 (data, base, step, important, IP_END, use, NULL, 
> >>> doloop,
> >>> +orig_iv);
> >>> Do we need to skip ip_end_pos

Re: [PATCH v6 3/3] PR80791 Consider doloop cmp use in ivopts

2019-08-21 Thread Bin.Cheng
On Thu, Aug 22, 2019 at 11:18 AM Kewen.Lin  wrote:
>
> Hi Bin,
>
> Thanks for your time!
>
> on 2019/8/21 下午8:32, Bin.Cheng wrote:
> > On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin  wrote:
> >>
> >> Hi!
> >>
> >> Comparing to the previous versions of implementation mainly based on the
> >> existing IV cands but zeroing the related group/use cost, this new one is 
> >> based
> >> on Richard and Segher's suggestion introducing one doloop dedicated IV 
> >> cand.
> >>
> >> Some key points are listed below:
> >>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV 
> >> cand.
> >>   2) Special name "doloop" assigned.
> >>   3) Doloop IV cand with form (niter+1, +, -1)
> >>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost for 
> >> step.
> >>   5) Support may_be_zero (regressed PR is in this case), the base of 
> >> doloop IV
> >>  can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
> >>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
> >>  calculation on the IV base with may_be_zero (like COND_EXPR).
> >>   7) Set zero cost when using doloop IV cand for doloop use.
> >>   8) Add three hooks (should we merge _generic and _address?).
> >> *) have_count_reg_decr_p, is to indicate the target has special 
> >> hardware
> >>count register, we shouldn't consider the impact of doloop IV when
> >>calculating register pressures.
> >> *) doloop_cost_for_generic, is the extra cost when using doloop IV 
> >> cand for
> >>generic type IV use.
> >> *) doloop_cost_for_address, is the extra cost when using doloop IV 
> >> cand for
> >>address type IV use.
> > What will happen if doloop IV cand be used for generic/address type iv
> > use?  Can RTL doloop can still perform doloop optimization in this
> > case?
> >
>
> On Power, we put the iteration count into hardware count register, it takes 
> very
> high cost to move the count to GPR, so the cost is set as INF to make it 
> impossible
> to use it for generic/address type iv use.  But as some discussion before, on 
> some
> targets using GPR instead of hardware count register, they probably want to 
> use this
> doloop iv used for other uses if profitable.  These two hooks offer the 
> possibility.
> In that case, I think RTL doloop can still perform since it can still get the
> pattern and transform.  The generic/address uses can still use it.
> >>
> >> Bootstrapped on powerpc64le-linux-gnu and regression testing passed 
> >> excepting
> >> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is 
> >> tracked
> >> by PR89983.
> >>
> >> Any comments and suggestions are highly appreciated.  Thanks!
> > Not sure if I understand the patch correctly, some comments embedded.
> >
> > +  /* The number of doloop candidate in the set.  */
> > +  unsigned n_doloop_cands;
> > +
> > This is unnecessary.  See below comments.
> >
> > -add_candidate_1 (data, base, step, important,
> > -IP_NORMAL, use, NULL, orig_iv);
> > +add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, 
> > doloop,
> > +orig_iv);
> >if (ip_end_pos (data->current_loop)
> >&& allow_ip_end_pos_p (data->current_loop))
> > -add_candidate_1 (data, base, step, important, IP_END, use, NULL, 
> > orig_iv);
> > +add_candidate_1 (data, base, step, important, IP_END, use, NULL, 
> > doloop,
> > +orig_iv);
> > Do we need to skip ip_end_pos case for doloop candidate?  Because the
> > candidate increment will be inserted in latch, i.e, increment position
> > is after exit condition.
> >
>
> Yes, we should skip it.  Currently function find_doloop_use has the check on 
> an
> empty latch and gimple_cond to latch, partially excluding it.  But it's still 
> good
> to guard it directly here.
>
> > -  tree_to_aff_combination (iv->base, type, val);
> > +  tree base = iv->base;
> > +  /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to 
> > extract
> > + the value under !may_be_zero to get the compact bound which also well 
> > fits
> > + for may_be_zero since we ensure the value for it is const one.  */
> > +  if (cand->doloop_p && desc->may_be_zero && !i

Re: [PATCH v6 3/3] PR80791 Consider doloop cmp use in ivopts

2019-08-21 Thread Bin.Cheng
On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin  wrote:
>
> Hi!
>
> Comparing to the previous versions of implementation mainly based on the
> existing IV cands but zeroing the related group/use cost, this new one is 
> based
> on Richard and Segher's suggestion introducing one doloop dedicated IV cand.
>
> Some key points are listed below:
>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV 
> cand.
>   2) Special name "doloop" assigned.
>   3) Doloop IV cand with form (niter+1, +, -1)
>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost for 
> step.
>   5) Support may_be_zero (regressed PR is in this case), the base of doloop IV
>  can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
>  calculation on the IV base with may_be_zero (like COND_EXPR).
>   7) Set zero cost when using doloop IV cand for doloop use.
>   8) Add three hooks (should we merge _generic and _address?).
> *) have_count_reg_decr_p, is to indicate the target has special hardware
>count register, we shouldn't consider the impact of doloop IV when
>calculating register pressures.
> *) doloop_cost_for_generic, is the extra cost when using doloop IV cand 
> for
>generic type IV use.
> *) doloop_cost_for_address, is the extra cost when using doloop IV cand 
> for
>address type IV use.
What will happen if doloop IV cand be used for generic/address type iv
use?  Can RTL doloop can still perform doloop optimization in this
case?

>
> Bootstrapped on powerpc64le-linux-gnu and regression testing passed excepting
> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is 
> tracked
> by PR89983.
>
> Any comments and suggestions are highly appreciated.  Thanks!
Not sure if I understand the patch correctly, some comments embedded.

+  /* The number of doloop candidate in the set.  */
+  unsigned n_doloop_cands;
+
This is unnecessary.  See below comments.

-add_candidate_1 (data, base, step, important,
-IP_NORMAL, use, NULL, orig_iv);
+add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
+orig_iv);
   if (ip_end_pos (data->current_loop)
   && allow_ip_end_pos_p (data->current_loop))
-add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
+add_candidate_1 (data, base, step, important, IP_END, use, NULL, doloop,
+orig_iv);
Do we need to skip ip_end_pos case for doloop candidate?  Because the
candidate increment will be inserted in latch, i.e, increment position
is after exit condition.

-  tree_to_aff_combination (iv->base, type, val);
+  tree base = iv->base;
+  /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to extract
+ the value under !may_be_zero to get the compact bound which also well fits
+ for may_be_zero since we ensure the value for it is const one.  */
+  if (cand->doloop_p && desc->may_be_zero && !integer_zerop
(desc->may_be_zero))
+base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter),
+   unshare_expr (rewrite_to_non_trapping_overflow (niter)),
+   build_int_cst (TREE_TYPE (niter), 1));
+  tree_to_aff_combination (base, type, val);
I don't quite follow here.  The iv->base is computed from niter, I
suppose compact bound is for cheaper candidate initialization?  Why
it's possible to extract !may_be_zero niter for may_be_zero here?  The
niter under !may_be_zero has no indication about the real niter under
may_be_zero.

-  cand_value_at (loop, cand, use->stmt, desc->niter, );
+  cand_value_at (loop, cand, use->stmt, desc, );
If I understand correctly, doloop use/cand will only be
identified/added for single exit loop, and there will be only one
cond(doloop) iv_use and only one doloop cand for doloop loop.  So the
cand_value at niter at use position would be 0.  If that's the case,
we can skip calling cand_value_at here for doloop cand.  The change to
cand_value_at would be unnecessary neither.

-  expensive.  */
-  if (!integer_zerop (desc->may_be_zero))
+  expensive.
+
+ For doloop candidate, we have considered MAY_BE_ZERO for IV base, need to
+ support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check.  */
+  if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p)
 return iv_elimination_compare_lt (data, cand, comp, desc);
And we can early return before this?

+  if (may_be_zero)
+{
+  if (COMPARISON_CLASS_P (may_be_zero))
+   {
+ niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
+  build_int_cst (ntype, 0),
+  rewrite_to_non_trapping_overflow (niter));
+   }
+  /* Don't try to obtain the iteration count expression when may_be_zero is
+integer_nonzerop (actually iteration count is one) or else.  */
+  else
+   return;
+}

Re: [PATCH v3 3/3] PR80791 Consider doloop cmp use in ivopts

2019-07-20 Thread Bin.Cheng
On Wed, Jun 19, 2019 at 7:47 PM Kewen.Lin  wrote:
>
> Hi all,
>
> This is the following patch after 
> https://gcc.gnu.org/ml/gcc-patches/2019-06/msg00910.html
>
> Main steps:
>   1) Identify the doloop cmp type iv use and record its bind_cand (explain it 
> later).
>   2) Set zero cost for pairs between this use and any iv cand.
>   3) IV cand set selecting algorithm runs as usual.
>   4) Fix up the selected iv cand for doloop use if need.
>
> It only focuses on the targets like Power which has specific count register.
> target hook have_count_reg_decr_p is proposed for it.
>
> Some notes:
>
> *) Why we need zero cost?  How about just decrease the cost for the pair
>between doloop use and its original iv cand?  How about just decrease
>the cost for the pair between doloop use and one selected iv cand?
>
>Since some target supports hardware count register for decrement and
>branch, it doesn't need the general instruction sequence for decr, cmp and
>branch in general registers.  The cost of moving count register to GPR
>is generally high, so it's standalone and can't be shared with other iv
>uses.  It means IVOPTs can take doloop use as invisible (zero cost).
>
>Let's take a look at PR80791 for example.
>
> original biv (cand 4)  use derived iv (cand 6)
>  generic use:   4  0
>  comp use (doloop use): 0 infinite
>
> For iv cost, original biv has cost 4 while use derived iv has cost 5.
> When IVOPTs considers doloop use, the optimal cost is 8 (original biv
> iv cost 4 + use cost 4).  Unfortunately it's not actually optimal, since
> later doloop transformation updates loop closing by count register,
> original biv (and its update) won't be needed in loop closing any more.
> The generic use become the only use for original biv.  That means, if we
> know the doloop will perform later, we shouldn't consider the doloop use
> when determining IV set.  If we don't consider it, the algorithm will
> choose iv cand 6 with total cost 5 (iv cost 5 + use cost 0).
>
> From the above, we can see that to decrease the cost for the pair between
> doloop use and original biv doesn't work.  Meanwhile it's hard to predict
> one good iv cand in final optimal set here and pre-update the cost
> between it and doloop use.  The analysis would be heavy and imperfect.
>
> *) Why we need bind_cand?
>
> As above, we assign zero cost for pairs between doloop use and each iv
> cand.  It's possible that doloop use gets assigned one iv cand which is
> invalid to be used during later rewrite.  Then we have to fix it up with 
> iv
> cand originally used for it.  It's fine whatever this iv cand exists in
> final iv cand set or not, even if it's not in the set, it will be
> eliminated in doloop transformation.
>
> By the way, I was thinking whether we can replace the hook 
> have_count_reg_decr_p
> with flag_branch_on_count_reg.  As the description of the "no-" option, 
> "Disable
> the optimization pass that scans for opportunities to use 'decrement and 
> branch'
> instructions on a count register instead of instruction sequences that 
> decrement
> a register, compare it against zero, and then branch based upon the result.", 
> it
> implicitly says it has count register support.  But I noticed that the gate of
> doloop_optimize checks this flag, as what I got from the previous 
> discussions, some
> targets which can perform doloop_optimize don't have specific count register, 
> so it
> sounds we can't make use of the flag, is it correct?
>
> Bootstrapped on powerpcle, also ran regression testing on powerpcle, got one 
> failure
> which is exposed by this patch and the root cause is duplicate of PR62147.
> case is gcc.target/powerpc/20050830-1.c
>
> Is it OK for trunk?
Sorry for the delaying.

I am not in favor of the approach very much.  When rewriting the pass
last time, we tried to reuse as much code as possible between cost
computation and iv_use rewriting.  we also followed guideline when
finite cost computed for cand/use pair, the use should be rewritten
using the cand successfully.  However, the patch adjust infinite cost
to zero cost causing cand can't be used to rewrite iv_use selected,
this is a backward step IMHO.

I am not sure if this is only useful for doloop cases, or for general cases?

Comment mentioned the point is to give more chances to consider other
IV cands instead of BIV.  If current algorithm relies on zeroing cost
of impossible cand/use pair to select optimal result, I suspect it's a
bug which should be fixed in candidate selection algorithm.  Do you
have a test case showing the issue? We should fix it as a standalone
problem, while the approach is covering the problem and not that
sound.

However, I think the patch can be changed that only finite cost should
be adjusted to zero.  Thus guarantee any cand selected is 

[PATCH PR91137]Find base object for ivopts via walk_tree

2019-07-17 Thread bin.cheng
Hi,
This patch fixes PR91137 by finding base objects with walk_tree utility.  Note 
we specially return
integer_zero_node when a tree expression contains multiple base objects.  This 
works since the
special node is compared unequal to any real base object, thus skipped in 
candidate selection.
This is intended to avoid propagating multiple base objects (maybe introduced 
by programmer).

Bootstrap and test on x86_64.  Is it OK?

Thanks,
bin
2019-07-15  Bin Cheng  

PR tree-optimization/91137
* tree-ssa-loop-ivopts.c (struct ivopts_data): New field.
(tree_ssa_iv_optimize_init, alloc_iv, tree_ssa_iv_optimize_finalize):
Init, use and fini the above new field.
(determine_base_object_1): New function.
(determine_base_object): Reimplement using walk_tree.

gcc/testsuite
2019-07-15  Bin Cheng  

PR tree-optimization/91137
* gcc.c-torture/execute/pr91137.c: New test.

0001-pr91137-20190715.patch
Description: Binary data


Re: [PING^1][PATCH v4 3/3] PR80791 Consider doloop cmp use in ivopts

2019-07-15 Thread Bin.Cheng
On Fri, Jul 12, 2019 at 8:11 PM Richard Biener  wrote:
>
> On Wed, 10 Jul 2019, Kewen.Lin wrote:
>
> > Hi all,
> >
> > I'd like to gentle ping the below patch:
> > https://gcc.gnu.org/ml/gcc-patches/2019-06/msg01225.html
> >
> > The previous version for more context/background:
> > https://gcc.gnu.org/ml/gcc-patches/2019-06/msg01126.html
> >
> > Thanks a lot in advance!
>
> Again I would have hoped Bin to chime in here.
Sorry for missing this one, will get to the patch this week.  Sorry
again for the inconvenience.

Thanks,
bin
>
> Am I correct that doloop HW implementations are constrainted
> by a decrement of one?  I see no code in the patch to constrain
> things this way.  I'm not familiar with the group code at all
> but I would have expected the patch to only affect costing
> of IVs of the appropriate form (decrement one and possibly
> no uses besides the one in the compare/decrement).  Since
> ivcanon already adds a canonical counter IV it's not
> necessary to generate an artificial candidate IV of the
> wanted style (that's something I might have expected as well).
>
> The rest should be just magic from the IVOPTs side?
>
> There might be the need to only consider at most one counter IV
> in the costing code.
>
> Richard.
>
> >
> > on 2019/6/20 下午8:16, Kewen.Lin wrote:
> > > Hi,
> > >
> > > Sorry, the previous patch is incomplete.
> > > New one attached.  Sorry for inconvenience.
> > >
> > > on 2019/6/20 下午8:08, Kewen.Lin wrote:
> > >> Hi Segher,
> > >>
> > >>> On Wed, Jun 19, 2019 at 07:47:34PM +0800, Kewen.Lin wrote:
> >  +/* Return true if count register for branch is supported.  */
> >  +
> >  +static bool
> >  +rs6000_have_count_reg_decr_p ()
> >  +{
> >  +  return flag_branch_on_count_reg;
> >  +}
> > >>>
> > >>> rs6000 unconditionally supports these instructions, not just when that
> > >>> flag is set.  If you need to look at the flag, the *caller* of this new
> > >>> hook should, not every implementation of the hook.  So just "return 
> > >>> true"
> > >>> here?
> > >>
> > >> Good point!  Updated it as hookpod.
> > >>
> >  +/* For doloop use, if the algothrim selects some candidate which 
> >  invalid for
> > >>>
> > >>> "algorithm", "which is invalid".
> > >>
> >  +   some cost like zero rather than original inifite cost.  The point 
> >  is to
> > >>>
> > >>> "infinite"
> > >>>
> > >>
> > >> Thanks for catching!  I should run spelling check next time.  :)
> > >>
> > >> New version attached with comments addressed.
> > >>
> > >>
> > >> Thanks,
> > >> Kewen
> > >>
> >
> >
>
> --
> Richard Biener 
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)


Re: [PATCH v3 2/3] Add predict_doloop_p target hook

2019-05-21 Thread Bin.Cheng
On Tue, May 21, 2019 at 1:50 PM Kewen.Lin  wrote:
>
> on 2019/5/20 下午10:43, Jeff Law wrote:
> > On 5/20/19 4:24 AM, Segher Boessenkool wrote:
> >> Let me try to answer a bit here...
> >>
> >> On Mon, May 20, 2019 at 11:28:26AM +0200, Richard Biener wrote:
> >>> On Thu, 16 May 2019, li...@linux.ibm.com wrote:
> >>
> >>> So the better way would be to expose that via a target hook somehow.
> >>> Or simply restrict IVOPTs processing to innermost loops for now.
> >>
> >> I think we should have two hooks: one is called with the struct loop as
> >> parameter; and the other is called for every statement in the loop, if
> >> the hook isn't null anyway.  Or perhaps we do not need that second one.
> > I'd wait to see a compelling example from real world code where we need
> > to scan the statements.  Otherwise we're just dragging in more target
> > specific decisions which in fact we want to minimize target stuff.
>
> The scan is trying to do similar thing like default_invalid_within_doloop.
> It scans for hardware counter register clobbering.  I think it's important
> and valuable to scan especially for call since it's common.
>
>   if (CALL_P (insn))
> return "Function call in loop.";
>
>   if (tablejump_p (insn, NULL, NULL) || computed_jump_p (insn))
> return "Computed branch in the loop.";
>
> But it's a question whether to make it as part of generic.  I double checked
> that most of the doloop targets use this default behavior, only 5 targets are
> using their own TARGET_INVALID_WITHIN_DOLOOP, so it might be a good thing to
> make it common to share.
I was not following previous rounds of discussion.  But again, is it
possible to make all code(including scanning every stmt in every bb in
loop, as well as niter checks) generic, export a target hook checking
on gimple_stmt.  In this case, backends could still make their own
decision?

Thanks,
bin
>
>
> Thanks,
> Kewen
>


Re: [PATCH PR57534]Support strength reduction for MEM_REF in slur

2019-05-16 Thread Bin.Cheng
On Thu, May 16, 2019 at 11:50 PM Bill Schmidt  wrote:
>
> Thanks, Bin and Richard -- I am out of the office until Tuesday, so will 
> review
> when I get back.  Bin, please CC me on SLSR patches as otherwise I might miss
> them.  Thanks!
Thanks for helping.  Will do it next time.

Thanks,
bin
>
> Bill
>
>
> On 5/16/19 6:37 AM, Richard Biener wrote:
> > On Wed, May 15, 2019 at 6:30 AM bin.cheng  
> > wrote:
> >> Hi,
> >> As noted in PR57534 comment #33, SLSR currently doesn't strength reduce 
> >> memory
> >> references in reported cases, which conflicts with its comment at the 
> >> beginning of file.
> >> The main reason is in functions slsr_process_ref and restructure_reference 
> >> which
> >> rejects MEM_REF by handled_compoenent_p in the first place.  This patch 
> >> identifies
> >> and creates CAND_REF for MEM_REF by restructuring base/offset.
> >>
> >> Note the patch only affects cases with multiple reducible MEM_REF.
> >>
> >> Also note, with this patch, [base + cst_offset] addressing mode would be 
> >> generally
> >> preferred.  I need to adjust three existing tests:
> >> * gcc.dg/tree-ssa/isolate-5.c
> >> * gcc.dg/tree-ssa/ssa-hoist-4.c
> >> Though address computation is reduced out of memory reference, the 
> >> generated
> >> assembly is not worsened.
> >>
> >> * gcc.dg/tree-ssa/slsr-3.c
> >> The generated assembly has two more instructions:
> >> <   movslq  %edx, %rcx
> >> <   movl(%rsi,%rcx,4), %r9d
> >> <   leaq0(,%rcx,4), %rax
> >> <   leal2(%r9), %r8d
> >> <   movl%r8d, (%rdi,%rcx,4)
> >> <   movl4(%rsi,%rax), %ecx
> >> <   addl$2, %ecx
> >> <   movl%ecx, 4(%rdi,%rax)
> >> <   movl8(%rsi,%rax), %ecx
> >> <   addl$2, %ecx
> >> <   movl%ecx, 8(%rdi,%rax)
> >> <   movl12(%rsi,%rax), %ecx
> >> <   addl$2, %ecx
> >> <   movl%ecx, 12(%rdi,%rax)
> >> ---
> >>>   movslq  %edx, %rax
> >>>   salq$2, %rax
> >>>   addq%rax, %rsi
> >>>   addq%rax, %rdi
> >>>   movl(%rsi), %eax
> >>>   addl$2, %eax
> >>>   movl%eax, (%rdi)
> >>>   movl4(%rsi), %eax
> >>>   addl$2, %eax
> >>>   movl%eax, 4(%rdi)
> >>>   movl8(%rsi), %eax
> >>>   addl$2, %eax
> >>>   movl%eax, 8(%rdi)
> >>>   movl12(%rsi), %eax
> >>>   addl$2, %eax
> >>>   movl%eax, 12(%rdi)
> >> Seems to me this is not deteriorating and "salq" can be saved by two 
> >> forward propagation.
> >>
> >> Bootstrap and test on x86_64, any comments?
> > The idea is good I think and the result above as well.  Leaving for Bill
> > to have a look as well, otherwise OK.
> >
> > Thanks,
> > Richard.
> >
> >> Thanks,
> >> bin
> >>
> >> 2019-05-15  Bin Cheng  
> >>
> >> PR tree-optimization/57534
> >> * gimple-ssa-strength-reduction.c (restructure_base_offset): New.
> >> (restructure_reference): Call restructure_base_offset when offset 
> >> is
> >> NULL.
> >> (slsr_process_ref): Handle MEM_REF.
> >>
> >> 2018-05-15  Bin Cheng  
> >>
> >> PR tree-optimization/57534
> >> * gcc.dg/tree-ssa/pr57534.c: New test.
> >> * gcc.dg/tree-ssa/isolate-5.c: Adjust checking strings.
> >> * gcc.dg/tree-ssa/slsr-3.c: Ditto.
> >> * gcc.dg/tree-ssa/ssa-hoist-4.c: Ditto.
>


Re: [PATCH v2 1/3] Move prepare_decl_rtl to expr.[ch] as extern

2019-05-15 Thread Bin.Cheng
On Wed, May 15, 2019 at 5:51 PM Kewen.Lin  wrote:
>
> on 2019/5/15 下午4:50, Bin.Cheng wrote:
> > On Wed, May 15, 2019 at 4:38 PM Kewen.Lin  wrote:
> >>
> >> on 2019/5/15 上午10:11, Kewen.Lin wrote:
> >>>
> >>> on 2019/5/14 下午3:18, Richard Biener wrote:
> >>>> Hum.  The function is somewhat of a hack, trying to produce
> >>>> "reasonable" DECL_RTL, exposing it in expr.[ch] with this
> >>>> name is eventually misleading.  Also you fail to "outline"
> >>>> the most important part:
> >>>>
> >>>>   FOR_EACH_VEC_ELT (decl_rtl_to_reset, i, obj)
> >>>> SET_DECL_RTL (obj, NULL_RTX);
> >>>>
> >>>> which IMHO would warrant making this machinery a class
> >>>> with the above done in its destructor?
> >>>>
> >>>
> >>> Good suggestion!  In the IVOPTS implementation, it has one
> >>> interface free_loop_data to clean up some data structures
> >>> including this decl_rtl_to_reset.  While for the use in
> >>> "PATCH v2 2/3", we have to clean it artificially once
> >>> expanding finishes.
> >>>
> >>> It's better to make it as a class and ensure the clean of
> >>> the vector in its destructor.
> >>>
> >>
> >> Hi Bin,
> >>
> >> Could you kindly comment this?  For the use in "PATCH v2 2/3",
> >> it's still the same with destructor change.  But for the use
> >> in IVOPTs, I don't know the historical reason why to clean
> >> up in free_loop_data instead of in place?  If it's possible to
> > This is quite old code in ivopts, I suppose it's for caching results
> > as you guessed.  I didn't measure if how much the caching affects
> > performance.  It can be simply removed if not as useful as expected.
> > Of course we need experiment data here.
>
> Got it, thanks!
>
> >
> > Please correct me if I am wrong.  This is factored out of ivopts in
> > order to reuse it in the new target hook?  As discussed in another
> > thread, most target hook code can be moved to ivopts, so this
> > refactoring and in-place freeing would be unnecessary, and the new
> > code can also take advantage of caching?
> >
>
> Yes, you are absolutely right.  If we move the generic part (
> especially costly_iter_for_doloop_p) to IVOPTs, the factoring isn't
> necessary any more.
>
> I thought it may be still nice to refactor it even if we will go
> with that moving of generic part.  But the code exists for a long
> time but there is no requests to factor it out all the time, it
> looks useless to expose it speculatively?
Just do it if you think necessary.  The criteria is simple: whether it
results in better code.  You don't need any approval to improve GCC.
:-)

Thanks,
bin
>
> OK, I'll cancel factoring it then.
>
>
> Thanks,
> Kewen
>
> > Thanks,
> > bin
>
>
> >> reuse, for example some object was prepared and we don't need
> >> to prepare for it again during the same loop handling, then
> >> the destructor proposal will clean it up too early and
> >> inefficient.  If so, one flag in constructor to control this
> >> might be required.
> >>
> >> Thanks a lot in advance!
> >>
> >> Kewen
> >>
> >>
>


Re: [PATCH v2 1/3] Move prepare_decl_rtl to expr.[ch] as extern

2019-05-15 Thread Bin.Cheng
On Wed, May 15, 2019 at 4:38 PM Kewen.Lin  wrote:
>
> on 2019/5/15 上午10:11, Kewen.Lin wrote:
> >
> > on 2019/5/14 下午3:18, Richard Biener wrote:
> >> Hum.  The function is somewhat of a hack, trying to produce
> >> "reasonable" DECL_RTL, exposing it in expr.[ch] with this
> >> name is eventually misleading.  Also you fail to "outline"
> >> the most important part:
> >>
> >>   FOR_EACH_VEC_ELT (decl_rtl_to_reset, i, obj)
> >> SET_DECL_RTL (obj, NULL_RTX);
> >>
> >> which IMHO would warrant making this machinery a class
> >> with the above done in its destructor?
> >>
> >
> > Good suggestion!  In the IVOPTS implementation, it has one
> > interface free_loop_data to clean up some data structures
> > including this decl_rtl_to_reset.  While for the use in
> > "PATCH v2 2/3", we have to clean it artificially once
> > expanding finishes.
> >
> > It's better to make it as a class and ensure the clean of
> > the vector in its destructor.
> >
>
> Hi Bin,
>
> Could you kindly comment this?  For the use in "PATCH v2 2/3",
> it's still the same with destructor change.  But for the use
> in IVOPTs, I don't know the historical reason why to clean
> up in free_loop_data instead of in place?  If it's possible to
This is quite old code in ivopts, I suppose it's for caching results
as you guessed.  I didn't measure if how much the caching affects
performance.  It can be simply removed if not as useful as expected.
Of course we need experiment data here.

Please correct me if I am wrong.  This is factored out of ivopts in
order to reuse it in the new target hook?  As discussed in another
thread, most target hook code can be moved to ivopts, so this
refactoring and in-place freeing would be unnecessary, and the new
code can also take advantage of caching?

Thanks,
bin
> reuse, for example some object was prepared and we don't need
> to prepare for it again during the same loop handling, then
> the destructor proposal will clean it up too early and
> inefficient.  If so, one flag in constructor to control this
> might be required.
>
> Thanks a lot in advance!
>
> Kewen
>
>
> >> Maybe name the functions prepare_guessed_decl_rtl ()
> >> and the new class guessed_decl_rtl?
> >>
> > OK.  Or "tmp" instead of "guessed"?
> >
> >
> > Thanks,
> > Kewen
> >
> >> Now looking how you'll end up using this...
> >>
> >> Richard.
> >>
> >
>


Re: [PATCH v2 2/3] Add predict_doloop_p target hook

2019-05-15 Thread Bin.Cheng
On Wed, May 15, 2019 at 1:20 PM Kewen.Lin  wrote:
>
> on 2019/5/15 上午11:34, Kewen.Lin wrote:
> >
> > on 2019/5/15 上午10:40, Bin.Cheng wrote:
> >> I wonder if you can factor out generic part into GIMPLE and leave
> >> target hook as specific as possible?
> >>
> >
> > Good suggestion! Yes, most of the checks are common as you
> > pointed out.  I hope the other targets won't have more
> > customization needs excepting for that GIMPLE stmt hook
> > check.
> >
> > I am thinking IVOPTs is a best place to factor to? Or
> > somewhere to put common GIMPLE query interface?
> >
>
> Or move it into targhooks.cpp as a possible base process
> of [target]_predict_doloop_p?   The target owner can
> decide whether to use generic_predict_doloop_p in its
> own target hook.
> It seems more flexible and allow them to have a new
> implementation for their own targets.  Like:
>
> rs6000_predict_doloop_p:
> 
> if (generic_predict_doloop_p(loop))
> ...
>
> special_target_predict_doloop_p:
Hmm, I thought the target hook would not need to take loop parameter
after moving generic part into ivopts?  Also the moved part would be a
local function (calling the new hook on stmt) in ivopts?

Thanks,
bin
> 
> 
>
>
> Thanks,
> Kewen
>
> >>> +
> >>> +  /* Similar to doloop_optimize, check whether iteration count too small
> >>> + and not profitable.  */
> >>> +  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
> >>> +  if (est_niter == -1)
> >>> +est_niter = get_likely_max_loop_iterations_int (loop);
> >>> +  if (est_niter >= 0 && est_niter < 3)
> >> The only probably target dependent is the magic number 3 here.  After
> >> moving all generic part to ivopts, we can use a macro for the moment,
> >> and improve it as a parameter if there are more doloop targets.
> >
> > The magic number 3 is from function doloop_optimize, not a
> > target dependent value.  But thanks for your tips with
> > macro mechanism!
> >
> >
> > Thanks,
> > Kewen
> >
> >> Overall most of above checks can be moved out of backend, leaving only
> >> more specific target hook checking on gimple_stmt?  And even that can
> >> be made generic to some extend.
> >>
> >> Thanks,
> >> bin
> >
>


[PATCH GCC]Correct cand_chain and stmt_cand_map for copy/cast

2019-05-14 Thread bin.cheng
Hi,
I noticed that cand_chain (first_interp/next_interp) is not maintained correctly
in slsr_process_copy/slsr_process_cast (now slsr_process_copycast).  This one
fixes the issue, as well as records the "first" cand in stmt_cand_map.

Hi Bill, is this correct or I misunderstood the code? Bootstrap and test on 
x86_64.

Thanks,
bin

2019-05-15  Bin Cheng  

* gimple-ssa-strength-reduction.c (slsr_process_copycast): Record
information about next_interp and the first cand.

0003-Correct-cand_chain-and-cand_stmt_map-for-copy-cast.patch
Description: Binary data


[PATCH GCC]refactor process of copy and cast in slsr

2019-05-14 Thread bin.cheng
Hi,
I noticed that slsr_process_copy and slsr_process_cast are almost identical
and called only once.  This patch refactor the two functions into one.
For copies, it also passes lhs' type to creation of new cand assuming that
lhs, rhs and base_cand have the same/compatible type.  I will keep an eye
on this in case of any fallouts.

Bootstrap and test on x86_64.  Is it OK?

Thanks,
bin

2019-05-15  Bin Cheng  

* gimple-ssa-strength-reduction.c (slsr_process_cast): Refactor into
(slsr_process_copycast): ...this.
(slsr_process_copy): Delete.
(find_candidates_dom_walker::before_dom_children): Call the refactor
function slsr_process_copycast.

0002-Factor-slsr-process-of-copy-and-cast.patch
Description: Binary data


[PATCH PR57534]Support strength reduction for MEM_REF in slur

2019-05-14 Thread bin.cheng
Hi,
As noted in PR57534 comment #33, SLSR currently doesn't strength reduce memory
references in reported cases, which conflicts with its comment at the beginning 
of file.
The main reason is in functions slsr_process_ref and restructure_reference which
rejects MEM_REF by handled_compoenent_p in the first place.  This patch 
identifies
and creates CAND_REF for MEM_REF by restructuring base/offset.

Note the patch only affects cases with multiple reducible MEM_REF.

Also note, with this patch, [base + cst_offset] addressing mode would be 
generally
preferred.  I need to adjust three existing tests:
* gcc.dg/tree-ssa/isolate-5.c
* gcc.dg/tree-ssa/ssa-hoist-4.c
Though address computation is reduced out of memory reference, the generated 
assembly is not worsened.

* gcc.dg/tree-ssa/slsr-3.c
The generated assembly has two more instructions:
<   movslq  %edx, %rcx
<   movl(%rsi,%rcx,4), %r9d
<   leaq0(,%rcx,4), %rax
<   leal2(%r9), %r8d
<   movl%r8d, (%rdi,%rcx,4)
<   movl4(%rsi,%rax), %ecx
<   addl$2, %ecx
<   movl%ecx, 4(%rdi,%rax)
<   movl8(%rsi,%rax), %ecx
<   addl$2, %ecx
<   movl%ecx, 8(%rdi,%rax)
<   movl12(%rsi,%rax), %ecx
<   addl$2, %ecx
<   movl%ecx, 12(%rdi,%rax)
---
>   movslq  %edx, %rax
>   salq$2, %rax
>   addq%rax, %rsi
>   addq%rax, %rdi
>   movl(%rsi), %eax
>   addl$2, %eax
>   movl%eax, (%rdi)
>   movl4(%rsi), %eax
>   addl$2, %eax
>   movl%eax, 4(%rdi)
>   movl8(%rsi), %eax
>   addl$2, %eax
>   movl%eax, 8(%rdi)
>   movl12(%rsi), %eax
>   addl$2, %eax
>   movl%eax, 12(%rdi)

Seems to me this is not deteriorating and "salq" can be saved by two forward 
propagation.

Bootstrap and test on x86_64, any comments?

Thanks,
bin

2019-05-15  Bin Cheng  

PR tree-optimization/57534
* gimple-ssa-strength-reduction.c (restructure_base_offset): New.
(restructure_reference): Call restructure_base_offset when offset is
NULL.
(slsr_process_ref): Handle MEM_REF.

2018-05-15  Bin Cheng  

PR tree-optimization/57534
* gcc.dg/tree-ssa/pr57534.c: New test.
* gcc.dg/tree-ssa/isolate-5.c: Adjust checking strings.
* gcc.dg/tree-ssa/slsr-3.c: Ditto.
* gcc.dg/tree-ssa/ssa-hoist-4.c: Ditto.

0001-pr57534.patch
Description: Binary data


Re: [PATCH v2 2/3] Add predict_doloop_p target hook

2019-05-14 Thread Bin.Cheng
I wonder if you can factor out generic part into GIMPLE and leave
target hook as specific as possible?

On Tue, May 14, 2019 at 11:10 AM  wrote:
>
> From: Kewen Lin 
>
> Previous version link for background:
> https://gcc.gnu.org/ml/gcc-patches/2019-04/msg00912.html
>
> This hook is to predict whether one loop in gimple will
> be transformed to low-overhead loop later in RTL, and
> designed to be called in ivopts.
>
> "Since the low-overhead loop optimize transformation is
> based on RTL, some of those checks are hard to be imitated
> on gimple, so it's not possible to predict the current
> loop will be transformed exactly in middle-end.  But if we
> can have most loop predicted precisely, it would be helpful.
> It highly depends on target hook fine tuning. It's
> acceptable to have some loops which can be transformed to
> low-overhead loop but we don't catch.  But we should try
> our best to avoid to predict some loop as low-overhead loop
> but which isn't."
>
> Bootstrapped and regression testing passed on powerpc64le.
>
> Is it ok for trunk?
>
> gcc/ChangeLog
>
> 2019-05-13  Kewen Lin  
>
> PR middle-end/80791
> * target.def (predict_doloop_p): New hook.
> * targhooks.h (default_predict_doloop_p): New declaration.
> * targhooks.c (default_predict_doloop_p): New function.
> * doc/tm.texi.in (TARGET_PREDICT_DOLOOP_P): New hook.
> * doc/tm.texi: Regenerate.
> * config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New function.
> (costly_iter_for_doloop_p): Likewise.
> (rs6000_predict_doloop_p): Likewise.
> (TARGET_PREDICT_DOLOOP_P): New macro.
>
> ---
>  gcc/config/rs6000/rs6000.c | 174 
> -
>  gcc/doc/tm.texi|   8 +++
>  gcc/doc/tm.texi.in |   2 +
>  gcc/target.def |   9 +++
>  gcc/targhooks.c|  13 
>  gcc/targhooks.h|   1 +
>  6 files changed, 206 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index a21f4f7..1c1c8eb 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -83,6 +83,9 @@
>  #include "tree-ssa-propagate.h"
>  #include "tree-vrp.h"
>  #include "tree-ssanames.h"
> +#include "tree-ssa-loop-niter.h"
> +#include "tree-cfg.h"
> +#include "tree-scalar-evolution.h"
>
>  /* This file should be included last.  */
>  #include "target-def.h"
> @@ -1914,6 +1917,9 @@ static const struct attribute_spec 
> rs6000_attribute_table[] =
>  #undef TARGET_CAN_USE_DOLOOP_P
>  #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
>
> +#undef TARGET_PREDICT_DOLOOP_P
> +#define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
> +
>  #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
>  #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
>
> @@ -39436,7 +39442,173 @@ rs6000_mangle_decl_assembler_name (tree decl, tree 
> id)
>return id;
>  }
>
> -
> +/* Check whether there are some instructions preventing doloop transformation
> +   inside loop body, mainly for instructions which are possible to kill CTR.
> +
> +   Return true if some invalid insn exits, otherwise return false.  */
> +
> +static bool
> +invalid_insn_for_doloop_p (struct loop *loop)
> +{
> +  basic_block *body = get_loop_body (loop);
> +  gimple_stmt_iterator gsi;
> +
> +  for (unsigned i = 0; i < loop->num_nodes; i++)
> +for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next ())
The loops on bbs/stmts seem general and can be put in GIMPLE.  So a
target hook taking gimple_stmt parameter and returning if the stmt
blocks doloop is enough?

> +  {
> +   gimple *stmt = gsi_stmt (gsi);
> +   if (is_gimple_call (stmt) && !gimple_call_internal_p (stmt)
> +   && !is_inexpensive_builtin (gimple_call_fndecl (stmt)))
> + {
> +   if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file,
> +  "predict doloop failure due to finding call.\n");
> +   return true;
> + }
> +   if (computed_goto_p (stmt))
> + {
> +   if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, "predict doloop failure due to"
> + "finding computed jump.\n");
> +   return true;
> + }
> +   /* TODO: Now this hook is expected to be called in ivopts, which is
> +  before switchlower1/switchlower2.  Checking for SWITCH at this 
> point
> +   will eliminate some good candidates.  But since there are only a few
> +   cases having _a_ switch statement in the innermost loop, it can be a 
> low
> +  priority enhancement.  */
> +
> +   if (gimple_code (stmt) == GIMPLE_SWITCH)
> + {
> +   if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file,
> +  "predict doloop failure due to finding switch.\n");
> +   return true;
> + 

[PATCH GCC]Simplify if-then-else in slsr cand_vect

2019-05-13 Thread bin.cheng
Hi,
While working on other PR, I noticed that we can save lots of if-then-else in 
accessing
cand_vec by placing an additional NULL element at front of it.

Bootstrap and test on x86_64.  Is it OK?

Thanks,
bin

2019-05-13  Bin Cheng  

* gimple-ssa-strength-reduction.c (lookup_cand): Adjust index by 1.
(alloc_cand_and_find_basis): Ditto.
(backtrace_base_for_ref, create_mul_ssa_cand): Remove if-then-else.
(create_mul_imm_cand, create_add_ssa_cand): Ditto.
(create_add_imm_cand, slsr_process_cast): Ditto.
(slsr_process_copy, replace_mult_candidate): Ditto.
(replace_rhs_if_not_dup, replace_one_candidate): Ditto.
(dump_cand_vec, analyze_candidates_and_replace): Skip NULL element.
(pass_strength_reduction::execute): Init the first NULL element.

0001-Simplify-cand_vec-access-in-slsr.patch
Description: Binary data


Re: [PATCH][PR90106] Builtin call transformation changes in cdce pass

2019-05-08 Thread Bin.Cheng
On Thu, May 9, 2019 at 5:31 AM Jeff Law  wrote:
>
> On 5/8/19 6:28 AM, Richard Biener wrote:
> > On Wed, May 8, 2019 at 12:09 PM JunMa  wrote:
> >>
> >> Hi
> >>
> >> As PR90106 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106),
> >> when gcc meets builtin function call like:
> >>
> >>y = sqrt (x);
> >>
> >> The cdce pass tries to transform the call into an internal function
> >> call and conditionally executes call with a simple range check on the
> >> arguments which can detect most cases and the errno does not need
> >> to be set. It looks like:
> >>
> >>y = IFN_SQRT (x);
> >>if (__builtin_isless (x, 0))
> >>  sqrt (x);
> >>
> >> However, If the call is in tail position, for example:
> >>
> >>y =  sqrt (x);
> >>return y;
> >>
> >> will become:
> >>
> >>y = IFN_SQRT (x);
> >>if (__builtin_isless (x, 0))
> >>  sqrt (x);
> >>return y;
> >>
> >> This transformation breaks tailcall pattern, and prevents
> >> later tailcall optimizations.
> >>
> >> So This patch transform builtin call with return value into
> >> if-then-else part, which looks like:
> >>
> >>y =  sqrt (x);
> >> ==>
> >>if (__builtin_isless (x, 0))
> >>  y = sqrt (x);
> >>else
> >>  y = IFN_SQRT (x);
> >>
> >> BTW, y = sqrt (x) can also transform like:
> >>
> >>y = IFN_SQRT (x);
> >>if (__builtin_isless (x, 0))
> >>  y = sqrt (x);
> >>
> >> We don‘t choose this pattern because it emits worse assemble
> >> code(more move instruction and use more registers) in x86_64.
> >>
> >> Bootstrapped/regtested on x86_64-linux, ok for trunk?
> >
> > OK.
> JunMa -- do you have a copyright assignment on file and write access to
> the repository?  If not we should take care of that before proceeding
> further.
Hi Jeff,
Thanks very much for helping.
Yes, he is under Alibaba's copyright assignment.  What else should we
do other than noticing in this mailing list message?
BTW, I think JunMa needs to apply write-access though.

Thanks,
bin
>
> Jeff


Re: [PATCH PR90078]Capping comp_cost computation in ivopts

2019-05-06 Thread Bin.Cheng
On Mon, May 6, 2019 at 6:11 PM Richard Biener
 wrote:
>
> On Sun, May 5, 2019 at 8:03 AM bin.cheng  wrote:
> >
> > > --
> > > Sender:Jakub Jelinek 
> > > Sent At:2019 Apr. 17 (Wed.) 19:27
> > > Recipient:Bin.Cheng 
> > > Cc:bin.cheng ; GCC Patches 
> > > 
> > > Subject:Re: [PATCH PR90078]Capping comp_cost computation in ivopts
> > >
> > >
> > > On Wed, Apr 17, 2019 at 07:14:05PM +0800, Bin.Cheng wrote:
> > > > > As
> > > > > #define INFTY 1000
> > > > > what is the reason to keep the previous condition as well?
> > > > > I mean, if cost1.cost == INFTY or cost2.cost == INFTY,
> > > > > cost1.cost + cost2.cost >= INFTY too.
> > > > > Unless costs can go negative.
> > > > It's a bit complicated, but in general, costs can go negative.
> > >
> > > Ok, no objections from me then (but as I don't know anything about it,
> > > not an ack either; you are ivopts maintainer, so you don't need one).
> >
> > Hi,
> > The previous patch was reverted on GCC-9 because of PR90240.  PR90240 is now
> > fixed by another patch.  This is the updated patch for PR90078.  It 
> > promotes type
> > of ivopts cost from int to int64_t, as well as change behavior of 
> > infinite_cost overflow
> > from saturation to assert.
> > Please note, implicit conversions are kept in cost computation as before 
> > without
> > introducing any narrowing.
> >
> > Bootstrap/test on x86_64 along with PR90240 patch.  Is it OK?
>
> Do not include system headers in .c files, instead those need to be
> (and are already)
> included via system.h.
>
>  /* The infinite cost.  */
> -#define INFTY 1000
> +#define INFTY 10L
>
> do we actually need this?  What happens on a ilp32 host?  That is, I believe
> you can drop the 'L' (it fits into an int anyways)
Yeah, now I think if int64_t is necessary or not.  With the scaling
bound and assertions.
>
> @@ -256,6 +259,7 @@ operator- (comp_cost cost1, comp_cost cost2)
>  return infinite_cost;
>
>gcc_assert (!cost2.infinite_cost_p ());
> +  gcc_assert (cost1.cost - cost2.cost < infinite_cost.cost);
>
>cost1.cost -= cost2.cost;
>cost1.complexity -= cost2.complexity;
>
> probably a pre-existing issue, but we do not seem to handle underflow
> here in general, nor check that underflow doesn't get us below -INFTY.
>
> I guess we really don't want negative costs?  That doesn't seem to be
> documented and I was also wondering why the cost isn't unsigned...
>
> @@ -638,7 +646,7 @@ struct iv_ca
>comp_cost cand_use_cost;
>
>/* Total cost of candidates.  */
> -  unsigned cand_cost;
> +  int64_t cand_cost;
>
>/* Number of times each invariant variable is used.  */
>unsigned *n_inv_var_uses;
>
> shows this "issue".  Can't we use uint64_t throughout the patch?
Oh, it's actually explained in previous message,
https://gcc.gnu.org/ml/gcc-patches/2019-04/msg00697.html
In short, yes the cost can be negative.  The negative cost should be
small in absolute value, and we don't check -INFTY.  With the scaling
bound change, I don't think -INFTY is possible IIUC.

Thanks,
bin
>
> Otherwise this looks OK.
>
> Thanks,
> Richard.
>
> > Thanks,
> > bin
> > 2019-05-05  Bin Cheng  
> >
> > PR tree-optimization/90078
> > * tree-ssa-loop-ivopts.c (inttypes.h): Include new header file.
> > (INFTY): Increase the value for infinite cost.
> > (struct comp_cost): Promote type of members to int64_t.
> > (infinite_cost): Don't set complexity in initialization.
> > (comp_cost::operator +,-,+=,-+,/=,*=): Assert when cost computation
> > overflows to infinite_cost.
> > (adjust_setup_cost): Promote type of parameter and cost computation
> > to int64_t.
> > (struct ainc_cost_data, struct iv_ca): Promote type of member to
> > int64_t.
> > (get_scaled_computation_cost_at, determine_iv_cost): Promote type of
> > cost computation to int64_t.
> > (determine_group_iv_costs, iv_ca_dump, find_optimal_iv_set): Use
> > int64_t's format specifier in dump.
> >
> > 2018-05-05  Bin Cheng  
> >
> > PR tree-optimization/90078
> > * g++.dg/tree-ssa/pr90078.C: New test.


Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts

2019-05-05 Thread Bin.Cheng
On Sun, May 5, 2019 at 2:02 PM Kewen.Lin  wrote:
>
> on 2019/5/5 下午12:04, Bin.Cheng wrote:
> > On Sun, May 5, 2019 at 11:23 AM Kewen.Lin  wrote:
> >>>> +  /* Some compare iv_use is probably useless once the doloop 
> >>>> optimization
> >>>> + performs.  */
> >>>> +  if (tailor_cmp_p)
> >>>> +tailor_cmp_uses (data);
> >>> Function tailor_cmp_uses sets iv_use->zero_cost_p under some
> >>> conditions.  Once the flag is set, though the iv_use participates cost
> >>> computation in determine_group_iv_cost_cond, the result cost is
> >>> hard-wired to ZERO (which means cost computation for such iv_use is
> >>> waste of time).
> >>
> >> Yes, it can be improved by some early check and return.
> >> But it's still helpful to make it call with may_eliminate_iv.
> >> gcc.dg/no-strict-overflow-6.c is one example, with may_eliminate_iv
> >> consideration it exposes more opportunities for downstream optimization.
> > Hmm, I wasn't suggesting early check and return, on the contrary, we
> > can better integrate doloop/cost stuff in the overall model.  See more
> > in following comments.
>
> Sorry, I didn't claim it clearly, the previous comment is to claim the
> call with may_eliminate_iv is not completely "waste of time", and early
> return can make it save time.  :)
>
> And yes, it's not an issue any more with your proposed idea.
>
> >>
> >>> Also iv_use rewriting process is skipped for related
> >>> ivs preserved explicitly by preserve_ivs_for_use.
> >>> Note IVOPTs already adds candidate for original ivs.  So why not
> >>> detecting doloop iv_use, adjust its cost with the corresponding
> >>> original iv candidate, then let the general cost model make the
> >>> decision?  I believe this better fits existing infrastructure and
> >>> requires less change, for example, preserve_ivs_for_use won't be
> >>> necessary.
> >> I agree adjusting the cost of original iv candidate of the iv_use
> >> requires less change, but on one hand, I thought to remove interest
> >> cmp iv use or make it zero cost is close to the fact.  Since it's
> >> eliminated eventually in doloop optimization, it should not
> >> considered in cost modeling.  This way looks more exact.
> > Whether doloop transformation should be considered or be bypassed in
> > cost model isn't a problem.  Actually we can bind doloop iv_cand to
> > cmp iv_use in order to force the transformation. My concern is the
> > patch specially handles doloop by setting the special flag, then
> > checking it.  We generally avoid such special-case handling since it
> > hurts long time maintenance.  The pass was very unstable in the pass
> > because of such issues.
> >
>
> OK, I understand your concerns now. Thanks for explanation!
>
> >> One the other hand, I assumed your suggestion is to increase the
> >> cost for the pair (original iv cand, cmp iv use), the increase cost
> >> seems to be a heuristic value?  It seems it's possible to sacrifice
> > Decrease the cost so that the iv_cand is preferred?  The comment
> > wasn't made on top of implementing doloop in ivopts.  Anyway, you can
> > still bypass cost model by binding the "correct" iv_cand to cmp
> > iv_use.
> >
>
> To decrease the cost isn't workable for this case, it make the original
> iv cand is preferred more and over the other iv cand for memory iv use,
> then the desirable memory based iv cand won't be chosen.
> If increase the cost, one basic iv cand is chosen for cmp use, memory
> based iv cand is chosen for memory use, instead of original iv for both.
Sorry for the mistake, I meant to decrease use cost of whatever "correct"
iv_cand for cmp iv_use that could enable doloop optimization, it doesn't
has to the original iv_cand.

>
> Could you help to explain the "binding" more?  Does it mean cost modeling
> decision making can simply bypass the cmp iv_use (we have artificially
> assigned one "correct" cand iv to it) and also doesn't count the "correct"
> iv cand cost into total iv cost? Is my understanding correct?
For example, if the heuristic finds out the "correct" doloop iv_cand, we can
force choosing that iv_cand for cmp iv_use and bypass the candidate choosing
algorithm:
struct iv_group {
  //...
  struct iv_cand *bind_cand;
};
then set this bind_cand directly in struct iv_ca::cand_for_group.  As a result,
iv_use rewriting code takes care of everything, no special handling (such as
preserve_ivs_for_use) is needed.


Re: [PATCH PR90240][RFC]Avoid scaling cost overflow by introducing scaling bound

2019-05-05 Thread bin.cheng
Hmm, mis-attached the old version patch.  Here is the updated one.

Thanks,
bin

--
Sender:bin.cheng 
Sent At:2019 May 5 (Sun.) 13:54
Recipient:Richard Biener 
Cc:GCC Patches 
Subject:Re: [PATCH PR90240][RFC]Avoid scaling cost overflow by introducing 
scaling bound


> --
> Sender:Richard Biener 
> Sent At:2019 Apr. 29 (Mon.) 20:01
> Recipient:bin.cheng 
> Cc:GCC Patches ; mliska 
> Subject:Re: [PATCH PR90240][RFC]Avoid scaling cost overflow by introducing 
> scaling bound
> 
> 
> On Sat, Apr 27, 2019 at 6:13 AM bin.cheng  wrote:
> >
> > > Hi,
> >
> > This is the draft patch avoiding scaling cost overflow by introducing a 
> > scaling bound
> > in IVOPTs.  For now the bound is 20, and scaling factor will be further 
> > scaled wrto
> > this bound.  For example, scaling factor like 1, 1000, 2000(max) would be 
> > scaled to
> > 1, 10, 20 correspondingly.
> >
> > HI Martin, I remember you introduced comp_cost/cost_scaling to improve one 
> > case
> > in spec2017.  Unfortunately I don't have access to the benchmark now, could 
> > you help
> > verify that if this patch has performance issue on it please?  Thanks
> >
> > Bootstrap and test on x86_64, and this is for further comment/discussion.
> 
> + float factor = 1.0f * bfreq / lfreq;
> + if (adjust_factor_p)
> +   {
> + factor *= COST_SCALING_FACTOR_BOUND;
> + factor = factor * lfreq / max_freq;
> +   }
> + body[i]->aux = (void *)(intptr_t)(int) factor;
> 
> float computations on the host shouldn't be done.
Fixed.
> 
> I wonder if changing comp_cost::cost to int64_t would help instead?  See also 
> my
> suggestion to use gmp.
Next patch for PR90078 changes the cost to int64_t.
> 
> Otherwise the approach looks sane to me - can we then even assert that
> no overflows
> happen and thus INFTY only appears if we manually set such cost?
Next patch for PR90078 changes to assert on cost overflow.

Attachment is the updated patch.  Bootstrap/test on x86_64 along with PR90078 
patch.
Is it OK?

Thanks,
bin
2019-05-05  Bin Cheng  

PR tree-optimization/90240
* tree-ssa-loop-ivopts.c (get_scaled_computation_cost_at): Scale cost
with respect to scaling factor pre-computed for each basic block.
(try_improve_iv_set): Return bool if best_cost equals to iv_ca cost.
(find_optimal_iv_set_1): Free iv_ca set if it has infinite_cost.
(COST_SCALING_FACTOR_BOUND, determine_scaling_factor): New.
(tree_ssa_iv_optimize_loop): Call determine_scaling_factor.  Extend
live range for array of loop's basic blocks.  Cleanup aux field of
loop's basic blocks.

2018-05-05  Bin Cheng  

PR tree-optimization/90240
* gfortran.dg/graphite/pr90240.f: New test.

0001-pr90240.patch
Description: Binary data


Re: [PATCH PR90078]Capping comp_cost computation in ivopts

2019-05-05 Thread bin.cheng
> --
> Sender:Jakub Jelinek 
> Sent At:2019 Apr. 17 (Wed.) 19:27
> Recipient:Bin.Cheng 
> Cc:bin.cheng ; GCC Patches 
> 
> Subject:Re: [PATCH PR90078]Capping comp_cost computation in ivopts
> 
> 
> On Wed, Apr 17, 2019 at 07:14:05PM +0800, Bin.Cheng wrote:
> > > As
> > > #define INFTY 1000
> > > what is the reason to keep the previous condition as well?
> > > I mean, if cost1.cost == INFTY or cost2.cost == INFTY,
> > > cost1.cost + cost2.cost >= INFTY too.
> > > Unless costs can go negative.
> > It's a bit complicated, but in general, costs can go negative.
> 
> Ok, no objections from me then (but as I don't know anything about it,
> not an ack either; you are ivopts maintainer, so you don't need one).

Hi,
The previous patch was reverted on GCC-9 because of PR90240.  PR90240 is now
fixed by another patch.  This is the updated patch for PR90078.  It promotes 
type
of ivopts cost from int to int64_t, as well as change behavior of infinite_cost 
overflow
from saturation to assert.
Please note, implicit conversions are kept in cost computation as before without
introducing any narrowing.

Bootstrap/test on x86_64 along with PR90240 patch.  Is it OK?

Thanks,
bin
2019-05-05  Bin Cheng  

PR tree-optimization/90078
* tree-ssa-loop-ivopts.c (inttypes.h): Include new header file.
(INFTY): Increase the value for infinite cost.
(struct comp_cost): Promote type of members to int64_t.
(infinite_cost): Don't set complexity in initialization.
(comp_cost::operator +,-,+=,-+,/=,*=): Assert when cost computation
overflows to infinite_cost.
(adjust_setup_cost): Promote type of parameter and cost computation
to int64_t.
(struct ainc_cost_data, struct iv_ca): Promote type of member to
int64_t.
(get_scaled_computation_cost_at, determine_iv_cost): Promote type of
cost computation to int64_t.
(determine_group_iv_costs, iv_ca_dump, find_optimal_iv_set): Use
int64_t's format specifier in dump.

2018-05-05  Bin Cheng  

PR tree-optimization/90078
* g++.dg/tree-ssa/pr90078.C: New test.

0002-Fix-pr90078-by-promoting-type-of-IVOPTs-cost-to-int6.patch
Description: Binary data


  1   2   3   4   5   6   7   8   9   10   >