[Bug rtl-optimization/66552] Missed optimization when shift amount is result of signed modulus
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66552 --- Comment #13 from Li Jia He --- In this optimization we assume n is either positive or divisible by the nth power of 2. So the result of the % is non-negative. However, it is not reasonable for translating (a % 32)) to (a & 31). If a is signed int and value is -1, (a % 32) will get the follow result, (a % 32) = (-1 % 32) = -1. However, (a & 31) will get the follow result, (a & 31) = -1 & 31 = 31. This conversion is not reasonable at this time.
[Bug rtl-optimization/66552] Missed optimization when shift amount is result of signed modulus
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66552 --- Comment #11 from Li Jia He --- The reason is that it is the remainder of the nth power of 2. In x >> (n% 32), 32 is the fifth power of 2. The hexadecimal representation of 32 is 0x100. Taking the remainder of 0x100, the data range is 0 ~ 0xff. And the result is the same as x >> (n & 0xff).
[Bug rtl-optimization/66552] Missed optimization when shift amount is result of signed modulus
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66552 --- Comment #9 from Li Jia He --- (In reply to Andrew Pinski from comment #8) > (In reply to Andrew Pinski from comment #7) > > (In reply to Andrew Pinski from comment #6) > > > (In reply to Li Jia He from comment #5) > > > > Could we consider doing this optimization on gimple? I use the following > > > > code on gimple to produce optimized results on powerpc64. > > > > > > It might make sense. But fold-const.c might not be the correct location; > > > match.pd might be a better place for it. > > > > Something like: > > (simplify > > (rshift @0 (mod @1 integer_pow2p@2)) > > (rshift @0 (bit_and @1 (minus @1 { build_int_cst (TREE_TYPE (@1), 1); } > > Some typos: > (simplify > (rshift @0 (mod @1 integer_pow2p@2)) > (rshift @0 (bit_and @1 (minus @2 { build_int_cst (TREE_TYPE (@2), 1); } > > This would be under the for: > (for mod (ceil_mod floor_mod round_mod trunc_mod) Thank you for your suggestions. Let's try it on gcc11 stage1 ^ _ ^.
[Bug rtl-optimization/66552] Missed optimization when shift amount is result of signed modulus
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66552 Li Jia He changed: What|Removed |Added CC||helijia at gcc dot gnu.org --- Comment #5 from Li Jia He --- Could we consider doing this optimization on gimple? I use the following code on gimple to produce optimized results on powerpc64. diff --git a/gcc/fold-const.c b/gcc/fold-const.c index aefa91666e2..a40681b271f 100644 --- a/gcc/fold-const.c +++ b/gcc/fold-const.c @@ -11131,7 +11131,6 @@ fold_binary_loc (location_t loc, enum tree_code code, tree type, WARN_STRICT_OVERFLOW_MISC); return fold_convert_loc (loc, type, tem); } - return NULL_TREE; case CEIL_MOD_EXPR: @@ -11191,6 +11190,22 @@ fold_binary_loc (location_t loc, enum tree_code code, tree type, prec) == 0) return fold_convert_loc (loc, type, TREE_OPERAND (arg0, 0)); + if (code == RSHIFT_EXPR + && (TREE_CODE (arg1) == CEIL_MOD_EXPR + || TREE_CODE (arg1) == FLOOR_MOD_EXPR + || TREE_CODE (arg1) == ROUND_MOD_EXPR + || TREE_CODE (arg1) == TRUNC_MOD_EXPR) + && TREE_CODE (TREE_OPERAND (arg1, 1)) == INTEGER_CST + && integer_pow2p (TREE_OPERAND (arg1, 1))) +{ + tree arg10 = TREE_OPERAND (arg1, 0); + tree arg11 = TREE_OPERAND (arg1, 1); + return fold_build2_loc (loc, code, type, arg0, + fold_build2_loc (loc, BIT_AND_EXPR, TREE_TYPE(arg10), arg10, +fold_build2_loc (loc, MINUS_EXPR, TREE_TYPE(arg11), arg11, + build_one_cst(TREE_TYPE(arg11); +} + return NULL_TREE; case MIN_EXPR:
[Bug testsuite/92398] [10 regression] error in update of gcc.target/powerpc/pr72804.c in r277872
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92398 Li Jia He changed: What|Removed |Added Status|NEW |RESOLVED CC||helijia at gcc dot gnu.org Resolution|--- |FIXED --- Comment #12 from Li Jia He --- fixed on trunk together with r278918. On behave of Xiong Hu to close the issue since his account couldn't.
[Bug target/92098] [9 Regression] After r262333, the following code cannot be vectorized on powerpc64le.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92098 --- Comment #3 from Li Jia He --- Author: helijia Date: Mon Dec 2 06:23:56 2019 New Revision: 278892 URL: https://gcc.gnu.org/viewcvs?rev=278892=gcc=rev Log: [rs6000]Fix PR92098 by backporting vec_cmp and vcond_mask supports to gcc-9-branch As PR92132 added vec_cmp_* and vcond_mask_* supports on trunk. This is a partial backport of vec_{cmp,cmpu} interface and related expand to gcc-9-branch to fix PR92098. gcc/ChangeLog 2019-12-02 Li Jia He Partial backport from mainline PR target/92098 2019-11-08 Kewen Lin PR target/92132 * config/rs6000/predicates.md (signed_or_equality_comparison_operator): New predicate. (unsigned_or_equality_comparison_operator): Likewise. * config/rs6000/rs6000.md (one_cmpl2): Remove expand. (one_cmpl3_internal): Rename to one_cmpl2. * config/rs6000/vector.md (vcond_mask_ for VEC_I and VEC_I): New expand. (vec_cmp for VEC_I and VEC_I): Likewise. (vec_cmpu for VEC_I and VEC_I): Likewise. gcc/testsuite/ChangeLog 2019-12-02 Li Jia He Partial backport from trunk PR target/92098 2019-11-08 Kewen Lin PR target/92132 * gcc.target/powerpc/pr92132-fp-1.c: New test. * gcc.target/powerpc/pr92132-fp-2.c: New test. Added: branches/gcc-9-branch/gcc/testsuite/gcc.target/powerpc/pr92098-int-1.c branches/gcc-9-branch/gcc/testsuite/gcc.target/powerpc/pr92098-int-2.c Modified: branches/gcc-9-branch/gcc/ChangeLog branches/gcc-9-branch/gcc/config/rs6000/predicates.md branches/gcc-9-branch/gcc/config/rs6000/rs6000.md branches/gcc-9-branch/gcc/config/rs6000/vector.md branches/gcc-9-branch/gcc/testsuite/ChangeLog
[Bug target/92132] new test case gcc.dg/vect/vect-cond-reduc-4.c fails with its introduction in r277067
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92132 --- Comment #7 from Li Jia He --- Author: helijia Date: Mon Dec 2 06:23:56 2019 New Revision: 278892 URL: https://gcc.gnu.org/viewcvs?rev=278892=gcc=rev Log: [rs6000]Fix PR92098 by backporting vec_cmp and vcond_mask supports to gcc-9-branch As PR92132 added vec_cmp_* and vcond_mask_* supports on trunk. This is a partial backport of vec_{cmp,cmpu} interface and related expand to gcc-9-branch to fix PR92098. gcc/ChangeLog 2019-12-02 Li Jia He Partial backport from mainline PR target/92098 2019-11-08 Kewen Lin PR target/92132 * config/rs6000/predicates.md (signed_or_equality_comparison_operator): New predicate. (unsigned_or_equality_comparison_operator): Likewise. * config/rs6000/rs6000.md (one_cmpl2): Remove expand. (one_cmpl3_internal): Rename to one_cmpl2. * config/rs6000/vector.md (vcond_mask_ for VEC_I and VEC_I): New expand. (vec_cmp for VEC_I and VEC_I): Likewise. (vec_cmpu for VEC_I and VEC_I): Likewise. gcc/testsuite/ChangeLog 2019-12-02 Li Jia He Partial backport from trunk PR target/92098 2019-11-08 Kewen Lin PR target/92132 * gcc.target/powerpc/pr92132-fp-1.c: New test. * gcc.target/powerpc/pr92132-fp-2.c: New test. Added: branches/gcc-9-branch/gcc/testsuite/gcc.target/powerpc/pr92098-int-1.c branches/gcc-9-branch/gcc/testsuite/gcc.target/powerpc/pr92098-int-2.c Modified: branches/gcc-9-branch/gcc/ChangeLog branches/gcc-9-branch/gcc/config/rs6000/predicates.md branches/gcc-9-branch/gcc/config/rs6000/rs6000.md branches/gcc-9-branch/gcc/config/rs6000/vector.md branches/gcc-9-branch/gcc/testsuite/ChangeLog
[Bug target/92132] new test case gcc.dg/vect/vect-cond-reduc-4.c fails with its introduction in r277067
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92132 Li Jia He changed: What|Removed |Added CC||helijia at gcc dot gnu.org --- Comment #6 from Li Jia He --- *** Bug 92098 has been marked as a duplicate of this bug. ***
[Bug tree-optimization/92098] [10 Regression] After r262333, the following code cannot be vectorized on powerpc64le.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92098 Li Jia He changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #1 from Li Jia He --- We can solve this issue by support full condition reduction vectorization. And PowerPC full condition reduction vectorization supported by PR92132. *** This bug has been marked as a duplicate of bug 92132 ***
[Bug tree-optimization/92098] New: After r262333, the following code cannot be vectorized on powerpc64le.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92098 Bug ID: 92098 Summary: After r262333, the following code cannot be vectorized on powerpc64le. Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: helijia at gcc dot gnu.org Target Milestone: --- Created attachment 47035 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47035=edit dump file(Includes dump files that can be vectorized and not vectorized) For the following code --- #define NIL 0 typedef struct { unsigned int hash_size; unsigned short * head, * prev; unsigned int w_size; } deflate_state; void slide_hash(deflate_state *s) { unsigned n, m; unsigned short *p; unsigned int wsize = s->w_size; n = s->hash_size; p = >head[n]; do { m = *--p; *p = (unsigned short)(m >= wsize ? m - wsize : NIL); } while (--n); } --- The compile command I used is cc1 a.c -Ofast -fdump-tree-vect-details-all -fdump-tree-slp-details-all we found r262333 will cause it can not be vectorized. Because a.c:20:5: note: vect_is_simple_use: vectype vector(4) unsigned intD.4 a.c:20:5: note: not vectorized: relevant stmt not supported: patt_37 = wsize_12 <= m_16; a.c:20:5: note: bad operation or unsupported loop bound. But before the commit this code can be vectorized. Attachment is the file I dumped
[Bug target/80834] PowerPC gcc -mcpu=power9 seems to turn off vectorization that -mcpu=power8 enables
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80834 Li Jia He changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #5 from Li Jia He --- This issue has been resolved on the trunk.
[Bug target/80834] PowerPC gcc -mcpu=power9 seems to turn off vectorization that -mcpu=power8 enables
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80834 Li Jia He changed: What|Removed |Added CC||helijia at gcc dot gnu.org --- Comment #4 from Li Jia He --- This question may not be as complicated as described. May only have a more important relationship with the setting of the vect-cost-model value (rs6000_builtin_vectorization_cost). And it has been vectorized on the current trunk(subversion id 274560). If we use the code that mike said(subversion id 248266), and compile option is ``` -mcpu=power9 -O3 -ffast-math -fdump-tree-vect-details-all -fdump-tree-slp-details-all ``` We can see the following analysis of vect-cost-model ``` m_amatvec.c:114:5: note: density 96%, cost 87 exceeds threshold, penalizing loop body cost by 10%m_amatvec.c:114:5: note: Cost model analysis: Vector inside of loop cost: 92 Vector prologue cost: 5 Vector epilogue cost: 36 Scalar iteration cost: 36 Scalar outside cost: 1 Vector outside cost: 41 prologue iterations: 0 epilogue iterations: 1 m_amatvec.c:114:5: note: cost model: the vector iteration cost = 92 divided by the scalar iteration cost = 36 is greater or equal to the vectorization factor = 2. m_amatvec.c:114:5: note: not vectorized: vectorization not profitable. m_amatvec.c:114:5: note: not vectorized: vector version will never be profitable. ``` We can see that the value of ‘Vector inside of loop cost’ is 92, however (92 / 36 = 2) >= 2, which causes vect-cost-model to think that vector version will never be profitable. If we use the current trunk code(subversion id 274560), and compile option is ``` -mcpu=power9 -O3 -ffast-math -fdump-tree-vect-details-all -fdump-tree-slp-details-all ``` We can see the following analysis of vect-cost-model ``` m_amatvec.c:114:5: note: Cost model analysis: Vector inside of loop cost: 60 Vector prologue cost: 5 Vector epilogue cost: 36 Scalar iteration cost: 36 Scalar outside cost: 1 Vector outside cost: 41 prologue iterations: 0 epilogue iterations: 1 Calculated minimum iters for profitability: 2 m_amatvec.c:114:5: note:Runtime profitability threshold = 2 m_amatvec.c:114:5: note:Static estimate profitability threshold = 2 ``` At this point, we can see that the value of 'Vector inside of loop cost' is 60. At this time (60 / 36 = 1) < 2, we think that vectorization can be profitable at this time. ‘Vector inside of loop cost’ value change consists of 2 parts (1) The value of unaligned_store is reduced by ((3-1)*12)=24. (2) rs6000_density_test value is reduced by 8. The change in the unaligned_store partial value fixed by the following patch. ``` commit 01cabe21e4ecae1e9c53fe12d7c0aa654143a3d2 Author: pthaugen Date: Fri Oct 13 16:05:53 2017 + * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Remove TARGET_P9_VECTOR code for unaligned_load case. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@253731 138bc75d-0d04-0410-961f-82ee72b054a4 diff --git a/gcc/ChangeLog b/gcc/ChangeLog index fefac6e0c95..00be94fe349 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2017-10-13 Pat Haugen + + * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Remove + TARGET_P9_VECTOR code for unaligned_load case. + 2017-10-13 Jan Hubicka * cfghooks.c (verify_flow_info): Check that edge probabilities are diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index e6e254ac041..b08cd316e68 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -5419,9 +5419,6 @@ rs6000_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost, return 3; case unaligned_load: - if (TARGET_P9_VECTOR) - return 3; - if (TARGET_EFFICIENT_UNALIGNED_VSX) return 1; ``` The analysis of the changes in the rs6000_density_test part of the data is as follows: As the code below, the density penalty fixup **depends on** the vec_cost. ``` if (density_pct > DENSITY_PCT_THRESHOLD && vec_cost + not_vec_cost > DENSITY_SIZE_THRESHOLD) { data->cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100; if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "density %d%%, cost %d exceeds threshold, penalizing " "loop body cost by %d%%", density_pct, vec_cost + not_vec_cost, DENSITY_PENALTY); } ``` With the commit 253731, the vec_cost is reduced by 24 as you mentioned, the `vec_cost + not_vec_cost` is less than DENSITY_SIZE_THRESHOLD, so it's fine. (btw, not_vec_cost can be calculated as 3 from the previous dump.) By the way, if we use this option -fvect-cost-model=unlimited, with the ‘unlimited’ model the vectorized code-path is assumed to b
[Bug middle-end/88784] Middle end is missing some optimizations about unsigned
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88784 --- Comment #27 from Li Jia He --- Created attachment 46495 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46495=edit [v2] try to fix this issue in ifcombine(and_comparisons_1 and or_comparisons_1) This patch is similar to the previous patch, try to fix this issue in ifcombine(and_comparisons_1 and or_comparisons_1). Would you like to help me see what kind of question this patch might have again ?
[Bug middle-end/88784] Middle end is missing some optimizations about unsigned
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88784 --- Comment #25 from Li Jia He --- Indeed, this patch cannot catch all variants that appear. I found that the optimize_vec_cond_expr function in the tree-ssa-reassoc.c file will call maybe_fold_and_comparisons and maybe_fold_or_comparisons, so just this patch can also handle the non-branchy cases without adding those pattern to match.pd. Indeed if we add the corresponding pattern to match.pd file and it would be better to let ifcombine identify these patterns. I will try to re-writing ifcombine to identify these patterns.
[Bug middle-end/88784] Middle end is missing some optimizations about unsigned
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88784 --- Comment #23 from Li Jia He --- Created attachment 46477 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46477=edit try to fix this issue in ifcombine(and_comparisons_1 and or_comparisons_1) I am trying to solve this issue directly in ifcombine. Would you like to help me see what kind of question this patch might have ?
[Bug other/90381] New test case gcc.dg/tree-ssa/pr88676-2.c fails with its introduction in r270934
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90381 --- Comment #3 from Li Jia He --- Author: helijia Date: Wed May 8 07:52:26 2019 New Revision: 271002 URL: https://gcc.gnu.org/viewcvs?rev=271002=gcc=rev Log: PR other/90381 * gcc.dg/tree-ssa/pr88676-2.c: Add 'target le' option to limit the test case to run on the little endian machine. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/pr88676-2.c
[Bug other/90381] New test case gcc.dg/tree-ssa/pr88676-2.c fails with its introduction in r270934
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90381 --- Comment #1 from Li Jia He --- Thanks for pointing this out. I used the following code: struct foo1 { int i:1; }; int test1 (struct foo1 *x) { if (x->i == 0) return 1; else if (x->i == 1) return 1; return 0; } to dumped the pass output in front of phiopt1 on be machine: test1 (struct foo1 * x) { unsigned char _1; int _3; signed char _6; : _1 = BIT_FIELD_REF <*x_5(D), 8, 0>; _6 = (signed char) _1; if (_6 >= 0) goto ; [INV] else goto ; [INV] : // predicted unlikely by early return (on trees) predictor. : # _3 = PHI <1(3), 0(2)> return _3; } but, on le machine: test1 (struct foo1 * x) { unsigned char _1; unsigned char _2; int _3; : _1 = BIT_FIELD_REF <*x_5(D), 8, 0>; _2 = _1 & 1; if (_2 == 0) goto ; [INV] else goto ; [INV] : // predicted unlikely by early return (on trees) predictor. : # _3 = PHI <1(3), 0(2)> return _3; } ‘’’ The difference is the comparison code in the if statement, however two_value_replacement will only optimize for EQ_EXPR or NE_EXPR. Can we limit this test case to the le machine ? Thanks.
[Bug target/88100] no warning reported when value for vec_splat_{su}{8,16} would overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88100 Li Jia He changed: What|Removed |Added Status|NEW |RESOLVED CC||helijia at gcc dot gnu.org Resolution|--- |FIXED --- Comment #6 from Li Jia He --- It has been patched on the trunk and gcc 8, so modify this issue to a fixed state.
[Bug target/88100] no warning reported when value for vec_splat_{su}{8,16} would overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88100 --- Comment #5 from Li Jia He --- Author: helijia Date: Thu Feb 28 06:24:57 2019 New Revision: 269272 URL: https://gcc.gnu.org/viewcvs?rev=269272=gcc=rev Log: Backport from trunk 2019-02-20 Li Jia He PR target/88100 * gcc/config/rs6000/rs6000.c (rs6000_gimple_fold_builtin) : Don't convert the operand before range checking it. * gcc/testsuite/gcc.target/powerpc/pr88100.c: New testcase. Added: branches/gcc-8-branch/gcc/testsuite/gcc.target/powerpc/pr88100.c Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/config/rs6000/rs6000.c branches/gcc-8-branch/gcc/testsuite/ChangeLog
[Bug target/88100] no warning reported when value for vec_splat_{su}{8,16} would overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88100 --- Comment #4 from Li Jia He --- Author: helijia Date: Wed Feb 20 02:35:39 2019 New Revision: 269033 URL: https://gcc.gnu.org/viewcvs?rev=269033=gcc=rev Log: [rs6000] fix PR 88100, range check for vec_splat_{su}{8,16,32} GCC revision 259524 implemented range check for the vec_splat_{su}{8,16,32} builtins. However, as a consequence of the implementation, the range check is not done correctly for the expected vspltis[bhw] instructions. The result is that we may not get a valid error message if the valid range of the data is exceeded. Although the input of the function prototype of vec_splat_{su}{8,16,32} is const int, the actual data usage range is limited to the data range of 5 bits signed. We should limit the int_cst.val[0] data to the 5 bit signed data range without any modification in the input arg0 parameter. However, the sext_hwi function intercepts the data of TREE_INT_CST_LOW (arg0) as size bits in the sext_hwi (TREE_INT_CST_LOW (arg0), size) statement. This will cause some of the excess data to fall within the range of 5 bits signed, so that the correct diagnostic information cannot be generated, we need to remove the sext_hwi to ensure that the input data has not been modified. This patch fix range check for the vec_splat_s[8,16,32] builtins. The argument must be a 5-bit const int as specified for the vspltis[bhw] instructions. for gcc/ChangeLog PR target/88100 * gcc/config/rs6000/rs6000.c (rs6000_gimple_fold_builtin) : Don't convert the operand before range checking it. for gcc/testsuite/ChangeLog PR target/88100 * gcc/testsuite/gcc.target/powerpc/pr88100.c: New testcase. Added: trunk/gcc/testsuite/gcc.target/powerpc/pr88100.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/rs6000/rs6000.c trunk/gcc/testsuite/ChangeLog
[Bug middle-end/88784] New: Middle end is missing some optimizations about unsigned
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88784 Bug ID: 88784 Summary: Middle end is missing some optimizations about unsigned Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: helijia at gcc dot gnu.org Target Milestone: --- For both operands are unsigned, the following optimizations are valid, and missing: 1. X > Y && X != 0 --> X > Y 2. X > Y || X != 0 --> X != 0 3. X <= Y || X != 0 --> true 4. X <= Y || X == 0 --> X <= Y 5. X > Y && X == 0 --> false unsigned foo(unsigned x, unsigned y) { return x > y && x != 0; } should fold to x > y, but I found we haven't done it right now. I compile the code with the following command. g++ unsigned.cpp -Ofast -c -S -o unsigned.s -fdump-tree-all
[Bug tree-optimization/88767] 'unroll and jam' not optimizing some loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88767 Li Jia He changed: What|Removed |Added CC||amker at gcc dot gnu.org --- Comment #9 from Li Jia He --- (In reply to Richard Biener from comment #1) > What's the room for improvement? Why's unrolling the innermost loop not > profitable? Hi Richard, I want to achieve the effect of the following code: __attribute__((noinline)) void calculate(const double* __restrict__ A, const double* __restrict__ B, double* __restrict__ C) { unsigned int l_m = 0; unsigned int l_n = 0; unsigned int l_k = 0; A = (const double*)__builtin_assume_aligned(A,16); B = (const double*)__builtin_assume_aligned(B,16); C = (double*)__builtin_assume_aligned(C,16); for ( l_n = 0; l_n < 9; l_n += 3 ) { // loop 1 for ( l_m = 0; l_m < 10; l_m++ ) { // loop 2 C[(l_n*10)+l_m] = 0.0; C[(l_n*10)+l_m+10] = 0.0; C[(l_n*10)+l_m+20] = 0.0; } for ( l_k = 0; l_k < 17; l_k++ ) { // loop 3 for ( l_m = 0; l_m < 10; l_m++ ) { // loop 4 C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k]; C[(l_n*10)+l_m+10] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k+20]; C[(l_n*10)+l_m+20] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k+40]; } } } } #define SIZE 36 double A[SIZE][SIZE] __attribute__((aligned(16))); double B[SIZE][SIZE] __attribute__((aligned(16))); double C[SIZE][SIZE] __attribute__((aligned(16))); int main() { long r, i, j; for (i=0; i < SIZE; i++) { for (j=0; j < SIZE; j++) { A[i][j] = 1.0; B[i][j] = 2.0; C[i][j] = 3.0; } } for (r=0; r < 100; r++) { calculate([0][0],[0][0], [0][0]); } return 0; } In the original code, cunrolli pass will completely expand loop2 and loop4, causing unroll-and-jam to have no chance to do it. From my test, the performance of these codes is expectation code > enable cunrolli > disable cunrolli. Sorry for not responding in time.
[Bug tree-optimization/88767] New: 'unroll and jam' not optimizing some loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88767 Bug ID: 88767 Summary: 'unroll and jam' not optimizing some loops Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: helijia at gcc dot gnu.org Target Milestone: --- The test source is as follows: __attribute__((noinline)) void calculate(const double* __restrict__ A, const double* __restrict__ B, double* __restrict__ C) { unsigned int l_m = 0; unsigned int l_n = 0; unsigned int l_k = 0; A = (const double*)__builtin_assume_aligned(A,16); B = (const double*)__builtin_assume_aligned(B,16); C = (double*)__builtin_assume_aligned(C,16); for ( l_n = 0; l_n < 9; l_n++ ) { // loop 1 for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; } // loop 2 for ( l_k = 0; l_k < 17; l_k++ ) { // loop 3 for ( l_m = 0; l_m < 10; l_m++ ) { // loop 4 C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k]; } } } } #define SIZE 36 double A[SIZE][SIZE] __attribute__((aligned(16))); double B[SIZE][SIZE] __attribute__((aligned(16))); double C[SIZE][SIZE] __attribute__((aligned(16))); int main() { long r, i, j; for (i=0; i < SIZE; i++) { for (j=0; j < SIZE; j++) { A[i][j] = 1.0; B[i][j] = 2.0; C[i][j] = 3.0; } } for (r=0; r < 100; r++) { calculate([0][0],[0][0], [0][0]); } return 0; } First, I compile the test case with the following command. g++ unroll_jam_bug.cpp -O3 -funroll-loops -floop-unroll-and-jam -o unroll_jam_bug -fdump-tree-unrolljam-details. In the generated file of unroll_jam_bug.cpp.143t.unrolljam, I found that there is no unroll and jam optimization for the loop in the calculate function. Second, I added the -fdump-tree-all parameter to the command line. I found that the innermost loop(loop 3 and 4) is completely unrolled because pass_data_complete_unrolli pass thinks innermost loop is small. As the inner loop is fully expanded, the original loop becomes large. When the loop is expanded in the pass_loop_jam pass, the number of unroll_factor * loop instruction > 200 will be judged. If the result is true, the optimization will be abandoned. Otherwise, the optimization will proceed. By the second analysis, I tried to ban the unrolli optimization.So I use the following command line. g++ unroll_jam_bug.cpp -O3 -mcpu=power8 -fdisable-tree-cunrolli -floop-unroll-and-jam -o unroll_jam_bug -fdump-tree-unrolljam-details Using this command, loop unroll and jam optimization will be executed, but there seems to be room for optimization. Original code: for ( l_n = 0; l_n < 9; l_n++ ) { for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; } for ( l_k = 0; l_k < 17; l_k++ ) { for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k]; } } } After unroll and jam pass: for ( l_n = 0; l_n < 9; l_n++ ) { for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; } for ( l_k = 0; l_k < 17; l_k += 2 ) { for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k]; C[(l_n*10)+l_m] += A[(l_k*20 + 20)+l_m] * B[(l_n*20)+l_k + 1]; } } }