[Bug rtl-optimization/66552] Missed optimization when shift amount is result of signed modulus

2020-02-13 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66552

--- Comment #13 from Li Jia He  ---
In this optimization we assume n is either positive or divisible by the nth
power of 2.
So the result of the % is non-negative.  However, it is not reasonable for
translating (a % 32)) to (a & 31).  If a is signed int and value is -1, (a %
32) will get the follow result, (a %  32) = (-1 % 32) = -1. However, (a & 31)
will get the follow result, (a & 31) = -1 & 31 = 31.  This conversion is not
reasonable at this time.

[Bug rtl-optimization/66552] Missed optimization when shift amount is result of signed modulus

2020-02-10 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66552

--- Comment #11 from Li Jia He  ---
The reason is that it is the remainder of the nth power of 2.  In x >> (n% 32),
32 is the fifth power of 2.  The hexadecimal representation of 32 is 0x100. 
Taking the remainder of 0x100, the data range is 0 ~ 0xff.  And the result is
the same as x >> (n & 0xff).

[Bug rtl-optimization/66552] Missed optimization when shift amount is result of signed modulus

2020-02-10 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66552

--- Comment #9 from Li Jia He  ---
(In reply to Andrew Pinski from comment #8)
> (In reply to Andrew Pinski from comment #7)
> > (In reply to Andrew Pinski from comment #6)
> > > (In reply to Li Jia He from comment #5)
> > > > Could we consider doing this optimization on gimple? I use the following
> > > > code on gimple to produce optimized results on powerpc64.
> > > 
> > > It might make sense.  But fold-const.c might not be the correct location;
> > > match.pd might be a better place for it.
> > 
> > Something like:
> > (simplify
> >  (rshift @0 (mod @1 integer_pow2p@2))
> >  (rshift @0 (bit_and @1 (minus @1 { build_int_cst (TREE_TYPE (@1), 1); }
> 
> Some typos:
> (simplify
>  (rshift @0 (mod @1 integer_pow2p@2))
>  (rshift @0 (bit_and @1 (minus @2 { build_int_cst (TREE_TYPE (@2), 1); }
> 
> This would be under the for:
> (for mod (ceil_mod floor_mod round_mod trunc_mod)

Thank you for your suggestions. Let's try it on gcc11 stage1 ^ _ ^.

[Bug rtl-optimization/66552] Missed optimization when shift amount is result of signed modulus

2020-02-10 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66552

Li Jia He  changed:

   What|Removed |Added

 CC||helijia at gcc dot gnu.org

--- Comment #5 from Li Jia He  ---
Could we consider doing this optimization on gimple? I use the following code
on gimple to produce optimized results on powerpc64.

diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index aefa91666e2..a40681b271f 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -11131,7 +11131,6 @@ fold_binary_loc (location_t loc, enum tree_code code,
tree type,
   WARN_STRICT_OVERFLOW_MISC);
  return fold_convert_loc (loc, type, tem);
}
-
   return NULL_TREE;

 case CEIL_MOD_EXPR:
@@ -11191,6 +11190,22 @@ fold_binary_loc (location_t loc, enum tree_code code,
tree type,
 prec) == 0)
return fold_convert_loc (loc, type, TREE_OPERAND (arg0, 0));

+  if (code == RSHIFT_EXPR
+  && (TREE_CODE (arg1) == CEIL_MOD_EXPR
+  || TREE_CODE (arg1) == FLOOR_MOD_EXPR
+  || TREE_CODE (arg1) == ROUND_MOD_EXPR
+  || TREE_CODE (arg1) == TRUNC_MOD_EXPR)
+  && TREE_CODE (TREE_OPERAND (arg1, 1)) == INTEGER_CST
+  && integer_pow2p (TREE_OPERAND (arg1, 1)))
+{
+  tree arg10 = TREE_OPERAND (arg1, 0);
+  tree arg11 = TREE_OPERAND (arg1, 1);
+  return fold_build2_loc (loc, code, type, arg0,
+  fold_build2_loc (loc, BIT_AND_EXPR, TREE_TYPE(arg10), arg10,
+fold_build2_loc (loc, MINUS_EXPR, TREE_TYPE(arg11), arg11,
+ build_one_cst(TREE_TYPE(arg11);
+}
+
   return NULL_TREE;

 case MIN_EXPR:

[Bug testsuite/92398] [10 regression] error in update of gcc.target/powerpc/pr72804.c in r277872

2019-12-09 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92398

Li Jia He  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||helijia at gcc dot gnu.org
 Resolution|--- |FIXED

--- Comment #12 from Li Jia He  ---
fixed on trunk together with r278918. On behave of Xiong Hu to close the issue
since his account couldn't.

[Bug target/92098] [9 Regression] After r262333, the following code cannot be vectorized on powerpc64le.

2019-12-01 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92098

--- Comment #3 from Li Jia He  ---
Author: helijia
Date: Mon Dec  2 06:23:56 2019
New Revision: 278892

URL: https://gcc.gnu.org/viewcvs?rev=278892=gcc=rev
Log:
[rs6000]Fix PR92098 by backporting vec_cmp and vcond_mask supports to
gcc-9-branch

As PR92132 added vec_cmp_* and vcond_mask_* supports on trunk.  This is a
partial backport of vec_{cmp,cmpu} interface and related expand
to gcc-9-branch to fix PR92098.

gcc/ChangeLog

2019-12-02  Li Jia He  

Partial backport from mainline
PR target/92098
2019-11-08  Kewen Lin  

PR target/92132
* config/rs6000/predicates.md
(signed_or_equality_comparison_operator): New predicate.
(unsigned_or_equality_comparison_operator): Likewise.
* config/rs6000/rs6000.md (one_cmpl2): Remove expand.
(one_cmpl3_internal): Rename to one_cmpl2.
* config/rs6000/vector.md
(vcond_mask_ for VEC_I and VEC_I): New expand.
(vec_cmp for VEC_I and VEC_I): Likewise.
(vec_cmpu for VEC_I and VEC_I): Likewise.

gcc/testsuite/ChangeLog

2019-12-02  Li Jia He  

Partial backport from trunk
PR target/92098
2019-11-08  Kewen Lin  

PR target/92132
* gcc.target/powerpc/pr92132-fp-1.c: New test.
* gcc.target/powerpc/pr92132-fp-2.c: New test.


Added:
branches/gcc-9-branch/gcc/testsuite/gcc.target/powerpc/pr92098-int-1.c
branches/gcc-9-branch/gcc/testsuite/gcc.target/powerpc/pr92098-int-2.c
Modified:
branches/gcc-9-branch/gcc/ChangeLog
branches/gcc-9-branch/gcc/config/rs6000/predicates.md
branches/gcc-9-branch/gcc/config/rs6000/rs6000.md
branches/gcc-9-branch/gcc/config/rs6000/vector.md
branches/gcc-9-branch/gcc/testsuite/ChangeLog

[Bug target/92132] new test case gcc.dg/vect/vect-cond-reduc-4.c fails with its introduction in r277067

2019-12-01 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92132

--- Comment #7 from Li Jia He  ---
Author: helijia
Date: Mon Dec  2 06:23:56 2019
New Revision: 278892

URL: https://gcc.gnu.org/viewcvs?rev=278892=gcc=rev
Log:
[rs6000]Fix PR92098 by backporting vec_cmp and vcond_mask supports to
gcc-9-branch

As PR92132 added vec_cmp_* and vcond_mask_* supports on trunk.  This is a
partial backport of vec_{cmp,cmpu} interface and related expand
to gcc-9-branch to fix PR92098.

gcc/ChangeLog

2019-12-02  Li Jia He  

Partial backport from mainline
PR target/92098
2019-11-08  Kewen Lin  

PR target/92132
* config/rs6000/predicates.md
(signed_or_equality_comparison_operator): New predicate.
(unsigned_or_equality_comparison_operator): Likewise.
* config/rs6000/rs6000.md (one_cmpl2): Remove expand.
(one_cmpl3_internal): Rename to one_cmpl2.
* config/rs6000/vector.md
(vcond_mask_ for VEC_I and VEC_I): New expand.
(vec_cmp for VEC_I and VEC_I): Likewise.
(vec_cmpu for VEC_I and VEC_I): Likewise.

gcc/testsuite/ChangeLog

2019-12-02  Li Jia He  

Partial backport from trunk
PR target/92098
2019-11-08  Kewen Lin  

PR target/92132
* gcc.target/powerpc/pr92132-fp-1.c: New test.
* gcc.target/powerpc/pr92132-fp-2.c: New test.


Added:
branches/gcc-9-branch/gcc/testsuite/gcc.target/powerpc/pr92098-int-1.c
branches/gcc-9-branch/gcc/testsuite/gcc.target/powerpc/pr92098-int-2.c
Modified:
branches/gcc-9-branch/gcc/ChangeLog
branches/gcc-9-branch/gcc/config/rs6000/predicates.md
branches/gcc-9-branch/gcc/config/rs6000/rs6000.md
branches/gcc-9-branch/gcc/config/rs6000/vector.md
branches/gcc-9-branch/gcc/testsuite/ChangeLog

[Bug target/92132] new test case gcc.dg/vect/vect-cond-reduc-4.c fails with its introduction in r277067

2019-11-14 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92132

Li Jia He  changed:

   What|Removed |Added

 CC||helijia at gcc dot gnu.org

--- Comment #6 from Li Jia He  ---
*** Bug 92098 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/92098] [10 Regression] After r262333, the following code cannot be vectorized on powerpc64le.

2019-11-14 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92098

Li Jia He  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #1 from Li Jia He  ---
We can solve this issue by support full condition reduction vectorization. And
PowerPC full condition reduction vectorization supported by PR92132.

*** This bug has been marked as a duplicate of bug 92132 ***

[Bug tree-optimization/92098] New: After r262333, the following code cannot be vectorized on powerpc64le.

2019-10-14 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92098

Bug ID: 92098
   Summary: After r262333, the following code cannot be vectorized
on powerpc64le.
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: helijia at gcc dot gnu.org
  Target Milestone: ---

Created attachment 47035
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47035=edit
dump file(Includes dump files that can be vectorized and not vectorized)

For the following code
---
#define NIL 0

typedef struct {
  unsigned int hash_size;
  unsigned short * head, * prev;
  unsigned int w_size;
} deflate_state;

void slide_hash(deflate_state *s)
{
unsigned n, m;
unsigned short *p;
unsigned int wsize = s->w_size;

n = s->hash_size;
p = >head[n];
do {
m = *--p;
*p = (unsigned short)(m >= wsize ? m - wsize : NIL);
} while (--n);
}
---

The compile command I used is 
cc1 a.c -Ofast  -fdump-tree-vect-details-all -fdump-tree-slp-details-all

we found r262333 will cause it can not be vectorized.  Because

a.c:20:5: note:   vect_is_simple_use: vectype vector(4) unsigned intD.4
a.c:20:5: note:   not vectorized: relevant stmt not supported: patt_37 =
wsize_12 <= m_16;
a.c:20:5: note:  bad operation or unsupported loop bound.

But before the commit this code can be vectorized.

Attachment is the file I dumped

[Bug target/80834] PowerPC gcc -mcpu=power9 seems to turn off vectorization that -mcpu=power8 enables

2019-08-18 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80834

Li Jia He  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Li Jia He  ---
This issue has been resolved on the trunk.

[Bug target/80834] PowerPC gcc -mcpu=power9 seems to turn off vectorization that -mcpu=power8 enables

2019-08-18 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80834

Li Jia He  changed:

   What|Removed |Added

 CC||helijia at gcc dot gnu.org

--- Comment #4 from Li Jia He  ---
This question may not be as complicated as described.  May only have a more
important relationship with the setting of the vect-cost-model value
(rs6000_builtin_vectorization_cost).  And it has been vectorized on the current
trunk(subversion id 274560).

If we use the code that mike said(subversion id 248266), and compile option is 
```
-mcpu=power9 -O3 -ffast-math -fdump-tree-vect-details-all
-fdump-tree-slp-details-all
```
We can see the following analysis of vect-cost-model
```
m_amatvec.c:114:5: note: density 96%, cost 87 exceeds threshold, penalizing
loop body cost by 10%m_amatvec.c:114:5: note: Cost model analysis:
  Vector inside of loop cost: 92
  Vector prologue cost: 5
  Vector epilogue cost: 36
  Scalar iteration cost: 36
  Scalar outside cost: 1
  Vector outside cost: 41
  prologue iterations: 0
  epilogue iterations: 1
m_amatvec.c:114:5: note: cost model: the vector iteration cost = 92 divided by
the scalar iteration cost = 36 is greater or equal to the vectorization factor
= 2.
m_amatvec.c:114:5: note: not vectorized: vectorization not profitable.
m_amatvec.c:114:5: note: not vectorized: vector version will never be
profitable.
```
We can see that the value of ‘Vector inside of loop cost’ is 92, however (92 /
36 = 2) >= 2, which causes vect-cost-model to think that vector version will
never be profitable.

If we use the current trunk code(subversion id 274560), and compile option is 
```
-mcpu=power9 -O3 -ffast-math -fdump-tree-vect-details-all
-fdump-tree-slp-details-all
```
We can see the following analysis of vect-cost-model
```
m_amatvec.c:114:5: note:  Cost model analysis:
  Vector inside of loop cost: 60
  Vector prologue cost: 5
  Vector epilogue cost: 36
  Scalar iteration cost: 36
  Scalar outside cost: 1
  Vector outside cost: 41
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 2
m_amatvec.c:114:5: note:Runtime profitability threshold = 2
m_amatvec.c:114:5: note:Static estimate profitability threshold = 2
```
At this point, we can see that the value of 'Vector inside of loop cost' is 60.
At this time (60 / 36 = 1) < 2, we think that vectorization can be profitable
at this time.

‘Vector inside of loop cost’ value change consists of 2 parts
  (1) The value of unaligned_store is reduced by ((3-1)*12)=24.
  (2) rs6000_density_test value is reduced by 8.

The change in the unaligned_store partial value fixed by the following patch.
```
commit 01cabe21e4ecae1e9c53fe12d7c0aa654143a3d2
Author: pthaugen 
Date:   Fri Oct 13 16:05:53 2017 +

* config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost):
Remove
TARGET_P9_VECTOR code for unaligned_load case.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@253731
138bc75d-0d04-0410-961f-82ee72b054a4

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index fefac6e0c95..00be94fe349 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2017-10-13  Pat Haugen  
+
+   * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Remove
+   TARGET_P9_VECTOR code for unaligned_load case.
+
 2017-10-13  Jan Hubicka  

* cfghooks.c (verify_flow_info): Check that edge probabilities are
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index e6e254ac041..b08cd316e68 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5419,9 +5419,6 @@ rs6000_builtin_vectorization_cost (enum
vect_cost_for_stmt type_of_cost,
 return 3;

   case unaligned_load:
-   if (TARGET_P9_VECTOR)
- return 3;
-
if (TARGET_EFFICIENT_UNALIGNED_VSX)
  return 1;

```
The analysis of the changes in the rs6000_density_test part of the data is as
follows:
As the code below, the density penalty fixup **depends on** the vec_cost.
```
  if (density_pct > DENSITY_PCT_THRESHOLD
  && vec_cost + not_vec_cost > DENSITY_SIZE_THRESHOLD)
{
  data->cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100;
  if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
 "density %d%%, cost %d exceeds threshold, penalizing "
 "loop body cost by %d%%", density_pct,
 vec_cost + not_vec_cost, DENSITY_PENALTY);
}
```
With the commit 253731, the vec_cost is reduced by 24 as you mentioned, the
`vec_cost + not_vec_cost` is less than DENSITY_SIZE_THRESHOLD, so it's fine.
(btw, not_vec_cost can be calculated as 3 from the previous dump.)

By the way, if we use this option -fvect-cost-model=unlimited, with the
‘unlimited’ model the vectorized code-path is assumed to b

[Bug middle-end/88784] Middle end is missing some optimizations about unsigned

2019-06-18 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88784

--- Comment #27 from Li Jia He  ---
Created attachment 46495
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46495=edit
[v2] try to fix this issue in ifcombine(and_comparisons_1 and or_comparisons_1)

This patch is similar to the previous patch, try to fix this issue in
ifcombine(and_comparisons_1 and or_comparisons_1). Would you like to help me
see what kind of question this patch might have again ?

[Bug middle-end/88784] Middle end is missing some optimizations about unsigned

2019-06-11 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88784

--- Comment #25 from Li Jia He  ---
Indeed, this patch cannot catch all variants that appear.

I found that the optimize_vec_cond_expr function in the tree-ssa-reassoc.c file
will
call maybe_fold_and_comparisons and maybe_fold_or_comparisons, so just this
patch
can also handle the non-branchy cases without adding those pattern to match.pd.

Indeed if we add the corresponding pattern to match.pd file and it would be
better to let ifcombine identify these patterns.  I will try to re-writing
ifcombine to identify these patterns.

[Bug middle-end/88784] Middle end is missing some optimizations about unsigned

2019-06-11 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88784

--- Comment #23 from Li Jia He  ---
Created attachment 46477
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46477=edit
try to fix this issue in ifcombine(and_comparisons_1 and or_comparisons_1)

I am trying to solve this issue directly in ifcombine.  Would you like to help
me see what kind of question this patch might have ?

[Bug other/90381] New test case gcc.dg/tree-ssa/pr88676-2.c fails with its introduction in r270934

2019-05-08 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90381

--- Comment #3 from Li Jia He  ---
Author: helijia
Date: Wed May  8 07:52:26 2019
New Revision: 271002

URL: https://gcc.gnu.org/viewcvs?rev=271002=gcc=rev
Log:
PR other/90381
* gcc.dg/tree-ssa/pr88676-2.c: Add 'target le' option to limit the
test case to run on the little endian machine.

Modified:
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/tree-ssa/pr88676-2.c

[Bug other/90381] New test case gcc.dg/tree-ssa/pr88676-2.c fails with its introduction in r270934

2019-05-08 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90381

--- Comment #1 from Li Jia He  ---
Thanks for pointing this out.
I used the following code:

struct foo1 {
  int i:1;
};

int test1 (struct foo1 *x)
{
  if (x->i == 0)
return 1;
  else if (x->i == 1)
return 1;
  return 0;
}

to dumped the pass output in front of phiopt1
on be machine:

test1 (struct foo1 * x)
{
  unsigned char _1;
  int _3;
  signed char _6;

   :
  _1 = BIT_FIELD_REF <*x_5(D), 8, 0>;
  _6 = (signed char) _1;
  if (_6 >= 0)
goto ; [INV]
  else
goto ; [INV]

   :
  // predicted unlikely by early return (on trees) predictor.

   :
  # _3 = PHI <1(3), 0(2)>
  return _3;

}

but, on le machine:

test1 (struct foo1 * x)
{
  unsigned char _1;
  unsigned char _2;
  int _3;

   :
  _1 = BIT_FIELD_REF <*x_5(D), 8, 0>;
  _2 = _1 & 1;
  if (_2 == 0)
goto ; [INV]
  else
goto ; [INV]

   :
  // predicted unlikely by early return (on trees) predictor.

   :
  # _3 = PHI <1(3), 0(2)>
  return _3;

}
‘’’
The difference is the comparison code in the if statement, however
two_value_replacement will only optimize for EQ_EXPR or NE_EXPR.
Can we limit this test case to the le machine ? Thanks.

[Bug target/88100] no warning reported when value for vec_splat_{su}{8,16} would overflow

2019-02-28 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88100

Li Jia He  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||helijia at gcc dot gnu.org
 Resolution|--- |FIXED

--- Comment #6 from Li Jia He  ---
It has been patched on the trunk and gcc 8, so modify this issue to a fixed
state.

[Bug target/88100] no warning reported when value for vec_splat_{su}{8,16} would overflow

2019-02-27 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88100

--- Comment #5 from Li Jia He  ---
Author: helijia
Date: Thu Feb 28 06:24:57 2019
New Revision: 269272

URL: https://gcc.gnu.org/viewcvs?rev=269272=gcc=rev
Log:
Backport from trunk
2019-02-20  Li Jia He  

PR target/88100
* gcc/config/rs6000/rs6000.c (rs6000_gimple_fold_builtin)
: Don't convert the operand before
range checking it.

* gcc/testsuite/gcc.target/powerpc/pr88100.c: New testcase.

Added:
branches/gcc-8-branch/gcc/testsuite/gcc.target/powerpc/pr88100.c
Modified:
branches/gcc-8-branch/gcc/ChangeLog
branches/gcc-8-branch/gcc/config/rs6000/rs6000.c
branches/gcc-8-branch/gcc/testsuite/ChangeLog

[Bug target/88100] no warning reported when value for vec_splat_{su}{8,16} would overflow

2019-02-19 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88100

--- Comment #4 from Li Jia He  ---
Author: helijia
Date: Wed Feb 20 02:35:39 2019
New Revision: 269033

URL: https://gcc.gnu.org/viewcvs?rev=269033=gcc=rev
Log:
[rs6000] fix PR 88100, range check for vec_splat_{su}{8,16,32}

GCC revision 259524 implemented range check for the vec_splat_{su}{8,16,32}
builtins.  However, as a consequence of the implementation, the range check
is not done correctly for the expected vspltis[bhw] instructions.  The result
is that we may not get a valid error message if the valid range of the data
is exceeded.

Although the input of the function prototype of vec_splat_{su}{8,16,32} is
const int, the actual data usage range is limited to the data range of 5 bits
signed.  We should limit the int_cst.val[0] data to the 5 bit signed data range
without any modification in the input arg0 parameter.  However, the sext_hwi
function intercepts the data of TREE_INT_CST_LOW (arg0) as size bits in the
sext_hwi (TREE_INT_CST_LOW (arg0), size) statement.  This will cause some of
the excess data to fall within the range of 5 bits signed, so that the correct
diagnostic information cannot be generated, we need to remove the sext_hwi to
ensure that the input data has not been modified.

This patch fix range check for the vec_splat_s[8,16,32] builtins.  The argument
must be a 5-bit const int as specified for the vspltis[bhw] instructions.

for gcc/ChangeLog

PR target/88100
* gcc/config/rs6000/rs6000.c (rs6000_gimple_fold_builtin)
: Don't convert the operand before
range checking it.

for gcc/testsuite/ChangeLog

PR target/88100
* gcc/testsuite/gcc.target/powerpc/pr88100.c: New testcase.

Added:
trunk/gcc/testsuite/gcc.target/powerpc/pr88100.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/rs6000/rs6000.c
trunk/gcc/testsuite/ChangeLog

[Bug middle-end/88784] New: Middle end is missing some optimizations about unsigned

2019-01-09 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88784

Bug ID: 88784
   Summary: Middle end is missing some optimizations about
unsigned
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: helijia at gcc dot gnu.org
  Target Milestone: ---

For both operands are unsigned, the following optimizations are valid, and
missing:
1. X > Y && X != 0 --> X > Y
2. X > Y || X != 0 --> X != 0
3. X <= Y || X != 0 --> true
4. X <= Y || X == 0 --> X <= Y
5. X > Y && X == 0 --> false

unsigned foo(unsigned x, unsigned y) { return x > y && x != 0; }
should fold to x > y, but I found we haven't done it right now.
I compile the code with the following command.
g++ unsigned.cpp -Ofast -c -S -o unsigned.s -fdump-tree-all

[Bug tree-optimization/88767] 'unroll and jam' not optimizing some loops

2019-01-09 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88767

Li Jia He  changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org

--- Comment #9 from Li Jia He  ---
(In reply to Richard Biener from comment #1)
> What's the room for improvement?  Why's unrolling the innermost loop not
> profitable?

Hi Richard, I want to achieve the effect of the following code:
__attribute__((noinline)) void calculate(const double* __restrict__ A, const
double* __restrict__ B, double* __restrict__ C) {
  unsigned int l_m = 0;
  unsigned int l_n = 0;
  unsigned int l_k = 0;

  A = (const double*)__builtin_assume_aligned(A,16);
  B = (const double*)__builtin_assume_aligned(B,16);
  C = (double*)__builtin_assume_aligned(C,16);

  for ( l_n = 0; l_n < 9; l_n += 3 ) { // loop 1
   for ( l_m = 0; l_m < 10; l_m++ ) { // loop 2
 C[(l_n*10)+l_m] = 0.0;
 C[(l_n*10)+l_m+10] = 0.0;
 C[(l_n*10)+l_m+20] = 0.0;
   }

   for ( l_k = 0; l_k < 17; l_k++ ) { // loop 3
 for ( l_m = 0; l_m < 10; l_m++ ) { // loop 4
   C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k];
   C[(l_n*10)+l_m+10] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k+20];
   C[(l_n*10)+l_m+20] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k+40];
  }
}
  }
}

#define SIZE 36
double A[SIZE][SIZE] __attribute__((aligned(16)));
double B[SIZE][SIZE] __attribute__((aligned(16)));
double C[SIZE][SIZE] __attribute__((aligned(16)));

int main()
{
  long r, i, j;

  for (i=0; i < SIZE; i++) {
for (j=0; j < SIZE; j++) {
  A[i][j] = 1.0;
  B[i][j] = 2.0;
  C[i][j] = 3.0;
}
  }

  for (r=0; r < 100; r++) {
calculate([0][0],[0][0], [0][0]);
  }

  return 0;
}
In the original code, cunrolli pass will completely expand loop2 and loop4, 
causing unroll-and-jam to have no chance to do it. From my test, the
performance 
of these codes is expectation code > enable cunrolli > disable cunrolli.
Sorry for not responding in time.

[Bug tree-optimization/88767] New: 'unroll and jam' not optimizing some loops

2019-01-09 Thread helijia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88767

Bug ID: 88767
   Summary: 'unroll and jam' not optimizing some loops
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: helijia at gcc dot gnu.org
  Target Milestone: ---

The test source is as follows:
__attribute__((noinline)) void calculate(const double* __restrict__ A, const
double* __restrict__ B, double* __restrict__ C) {
  unsigned int l_m = 0;
  unsigned int l_n = 0;
  unsigned int l_k = 0;

  A = (const double*)__builtin_assume_aligned(A,16);
  B = (const double*)__builtin_assume_aligned(B,16);
  C = (double*)__builtin_assume_aligned(C,16);

  for ( l_n = 0; l_n < 9; l_n++ ) { // loop 1 
   for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; } // loop 2 

for ( l_k = 0; l_k < 17; l_k++ ) { // loop 3 
  for ( l_m = 0; l_m < 10; l_m++ ) { // loop 4
C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k];
  }
}
  }
}

#define SIZE 36
double A[SIZE][SIZE] __attribute__((aligned(16)));
double B[SIZE][SIZE] __attribute__((aligned(16)));
double C[SIZE][SIZE] __attribute__((aligned(16)));

int main()
{
  long r, i, j;

  for (i=0; i < SIZE; i++) {
for (j=0; j < SIZE; j++) {
  A[i][j] = 1.0;
  B[i][j] = 2.0;
  C[i][j] = 3.0;
}
  }

  for (r=0; r < 100; r++) {
calculate([0][0],[0][0], [0][0]);
  }

  return 0;
}

First, I compile the test case with the following command. g++
unroll_jam_bug.cpp -O3  -funroll-loops -floop-unroll-and-jam -o unroll_jam_bug
-fdump-tree-unrolljam-details. In the generated file of
unroll_jam_bug.cpp.143t.unrolljam, I found that there is no unroll and jam
optimization for the loop in the calculate function.

Second, I added the -fdump-tree-all parameter to the command line. I found that
the innermost loop(loop 3 and 4) is completely unrolled because
pass_data_complete_unrolli pass thinks innermost loop is small. As the inner
loop is fully expanded, the original loop becomes large. When the loop is
expanded in the pass_loop_jam pass, the number of unroll_factor * loop
instruction > 200 will be judged. If the result is true, the optimization will
be abandoned. Otherwise, the optimization will proceed. 

By the second analysis, I tried to ban the unrolli optimization.So I use the
following command line. g++ unroll_jam_bug.cpp -O3 -mcpu=power8
-fdisable-tree-cunrolli -floop-unroll-and-jam -o unroll_jam_bug
-fdump-tree-unrolljam-details
Using this command, loop unroll and jam
optimization will be executed, but there seems to be room for optimization.

Original code:
for ( l_n = 0; l_n < 9; l_n++ ) {
for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; }

for ( l_k = 0; l_k < 17; l_k++ ) {
   for ( l_m = 0; l_m < 10; l_m++ ) {
C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k];
  }
}
  }
After unroll and jam pass:
for ( l_n = 0; l_n < 9; l_n++ ) {
for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; }

for ( l_k = 0; l_k < 17; l_k += 2 ) {
  for ( l_m = 0; l_m < 10; l_m++ ) {
C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k];
C[(l_n*10)+l_m] += A[(l_k*20 + 20)+l_m] * B[(l_n*20)+l_k + 1];
  }
}
  }