[Bug tree-optimization/25623] jump threading/cfg cleanup messes up "incoming counts" for some BBs

2023-07-06 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25623

--- Comment #12 from CVS Commits  ---
The master branch has been updated by Jan Hubicka :

https://gcc.gnu.org/g:3a61ca1b9256535e1bfb19b2d46cde21f3908a5d

commit r14-2369-g3a61ca1b9256535e1bfb19b2d46cde21f3908a5d
Author: Jan Hubicka 
Date:   Thu Jul 6 18:56:22 2023 +0200

Improve profile updates after loop-ch and cunroll

Extend loop-ch and loop unrolling to fix profile in case the loop is
known to not iterate at all (or iterate few times) while profile claims it
iterates more.  While this is kind of symptomatic fix, it is best we can do
incase profile was originally esitmated incorrectly.

In the testcase the problematic loop is produced by vectorizer and I think
vectorizer should know and account into its costs that vectorizer loop
and/or
epilogue is not going to loop after the transformation.  So it would be
nice
to fix it on that side, too.

The patch avoids about half of profile mismatches caused by cunroll.

Pass dump id and name|static mismatcdynamic mismatch
 |in count |in count
107t cunrolli|  3+3|17251   +17251
115t threadfull  |  3  |14376-2875
116t vrp |  5+2|30908   +16532
117t dse |  5  |30908
118t dce |  3-2|17251   -13657
127t ch  | 13   +10|17251
131t dom | 39   +26|17251
133t isolate-paths   | 47+8|17251
134t reassoc | 49+2|17251
136t forwprop| 53+4|   202501  +185250
159t cddce   | 61+8|   216211   +13710
161t ldist   | 62+1|   216211
172t ifcvt   | 66+4|   373711  +157500
173t vect|143   +77|  9802097 +9428386
176t cunroll |221   +78| 15639591 +5837494
183t loopdone|218-3| 15577640   -61951
195t fre |214-4| 15577640
197t dom |213-1| 16671606 +1093966
199t threadfull  |215+2| 16879581  +207975
200t vrp |217+2| 17077750  +198169
204t dce |215-2| 17004486   -73264
206t sink|213-2| 17004486
211t cddce   |219+6| 17005926+1440
255t optimized   |217-2| 17005926
256r expand  |210-7| 19571573 +2565647
258r into_cfglayout  |208-2| 19571573
275r loop2_unroll|212+4| 22992432 +3420859
291r ce2 |210-2| 23011838
312r pro_and_epilogue|230   +20| 23073776   +61938
315r jump2   |236+6| 27110534 +4036758
323r bbro|229-7| 21826835 -5283699

W/o the patch cunroll does:

176t cunroll |294  +151|126548439   +116746342

and we end up with 291 mismatches at bbro.

Bootstrapped/regtested x86_64-linux. Plan to commit it after the
scale_loop_frequency patch.

gcc/ChangeLog:

PR middle-end/25623
* tree-ssa-loop-ch.cc (ch_base::copy_headers): Scale loop frequency
to maximal number
of iterations determined.
* tree-ssa-loop-ivcanon.cc (try_unroll_loop_completely): Likewise.

gcc/testsuite/ChangeLog:

PR middle-end/25623
* gfortran.dg/pr25623-2.f90: New test.

[Bug tree-optimization/25623] jump threading/cfg cleanup messes up "incoming counts" for some BBs

2023-07-01 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25623

--- Comment #11 from CVS Commits  ---
The master branch has been updated by Jan Hubicka :

https://gcc.gnu.org/g:ee4d85b3a8b76328df6bccc1026d62dff5f827ce

commit r14-2231-gee4d85b3a8b76328df6bccc1026d62dff5f827ce
Author: Jan Hubicka 
Date:   Sat Jul 1 13:44:46 2023 +0200

Add testcase from PR25623

gcc/testsuite/ChangeLog:

PR tree-optimization/25623
* gfortran.dg/pr25623.f90: New test.

[Bug tree-optimization/25623] jump threading/cfg cleanup messes up "incoming counts" for some BBs

2023-07-01 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25623

--- Comment #10 from Jan Hubicka  ---
testcase from Comment #1 is wontfix (there is really not much to do at the
threading time since profile was not estimated realistically).
Original fortran testcase now works 
(after fix g:7e904d6c7f252ee947c237ed32dd43b2c248384d).
We do one threading in thread2 pass:

 Registering killing_def (path_oracle) _1
 Registering killing_def (path_oracle) ubound.4_14
Checking profitability of path (backwards):  
  [1] Registering jump thread: (2, 3) incoming edge;  (3, 6) nocopy;
path: 2->3->6 SUCCESS
Checking profitability of path (backwards):  bb:4 (6 insns) bb:10 (latch)
  Control statement insns: 2
  Overall: 4 insns


and give up on two because they crosses loop boundary.

Checking profitability of path (backwards):  bb:3 (2 insns) bb:4 (latch)
  Control statement insns: 2
  Overall: 0 insns

 Registering killing_def (path_oracle) S.6_56
path: 4->3->xx REJECTED
Checking profitability of path (backwards):  bb:6 (2 insns) bb:7 (latch)
  Control statement insns: 2
  Overall: 0 insns

 Registering killing_def (path_oracle) i_68
path: 7->6->xx REJECTED
 headers pass.

One path is the usual entry condition of loop known to be true (which I think
early opts should handle) and is eventually dealt with copy header pass.
Other path gets eventually a reason for the failure dumped:

Checking profitability of path (backwards):  bb:4 (16 insns) bb:6 (latch)
  Control statement insns: 2
  Overall: 14 insns
  FAIL: Did not thread around loop and would copy too many statements.
__attribute__((fn spec (". w w w w ")))

This is fact that loop is known to iterate at least once (there is explicit
+1). It may be interesting to peel for this.

With -O3 we vectorize the loop and while unroll the epilogue. However we get:

;;   basic block 14, loop depth 1, count 668941153 (estimated locally), maybe
hot
;;prev block 16, next block 15, flags: (NEW, REACHABLE, VISITED)
;;pred:   15 [always]  count:595357627 (estimated locally)
(FALLTHRU,DFS_BACK,EXECUTABLE)
;;16 [always]  count:73583526 (estimated locally) (FALLTHRU)
  # i_34 = PHI 
  _2 = i_34 + -1;
  _17 = (integer(kind=8)) _2;
  _18 = (*a_19(D))[_17];
  tmp_45 = __builtin_pow (_18,
3.33314829616256247390992939472198486328125e-1);
  tmp2_44 = tmp_45 * tmp_45;
  tmp4_43 = tmp2_44 * tmp2_44;
  _42 = (*b_24(D))[_17];
  _41 = _42 + tmp4_43;
  (*b_24(D))[_17] = _41;
  _39 = (*c_16(D))[_17];
  _38 = _39 + tmp2_44;
  (*c_16(D))[_17] = _38;
  i_31 = i_34 + 1;
  if (_1 < i_31)
goto ; [11.00%]
  else
goto ; [89.00%]

Cunrolli unloops it without fixing the profile resulting in inconsistent
profile:

;;   basic block 16, loop depth 0, count 668941153 (estimated locally), maybe
hot
;;   Invalid sum of incoming counts 73583527 (estimated locally), should be
668941153 (estimated locally)
;;prev block 13, next block 17, flags: (NEW, REACHABLE, VISITED)
;;pred:   13 [66.7% (guessed)]  count:63071594 (estimated locally)
(FALSE_VALUE)
;;7 [10.0% (guessed)]  count:10511933 (estimated locally)
(TRUE_VALUE)
  # i_29 = PHI 
  _2 = i_29 + -1;
  _17 = (integer(kind=8)) _2;
  _18 = (*a_19(D))[_17];
  tmp_45 = __builtin_pow (_18,
3.33314829616256247390992939472198486328125e-1);
  tmp2_44 = tmp_45 * tmp_45;
  tmp4_43 = tmp2_44 * tmp2_44;
  _42 = (*b_24(D))[_17];
  _41 = _42 + tmp4_43;
  (*b_24(D))[_17] = _41;
  _39 = (*c_16(D))[_17];
  _38 = _39 + tmp2_44;
  (*c_16(D))[_17] = _38;
  i_31 = i_29 + 1;
;;succ:   17 [always (guessed)]  count:668941153 (estimated locally)
(FALLTHRU)

;;   basic block 17, loop depth 0, count 105119324 (estimated locally), maybe
hot
;;   Invalid sum of incoming counts 700476950 (estimated locally), should be
105119324 (estimated locally)
;;prev block 16, next block 5, flags: (NEW, VISITED)
;;pred:   16 [always (guessed)]  count:668941153 (estimated locally)
(FALLTHRU)
;;13 [33.3% (guessed)]  count:31535797 (estimated locally)
(TRUE_VALUE)
;;succ:   5 [always]  count:105119324 (estimated locally)
(FALLTHRU,EXECUTABLE)

So I guess unlooping should fix the profile after itself, but does vect really
need to produce loops iterating precisely once?

[Bug tree-optimization/25623] jump threading/cfg cleanup messes up "incoming counts" for some BBs

2023-06-28 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25623

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #9 from Jan Hubicka  ---
In this testcase:

void rs6000_emit_move (int mode, int t, int tt)
{
  if (t == 1)
if (mode != 2)
  t = ();
  if (t == 1)
if (mode != 2)
__builtin_abort ();
}

The profile update can not go right.  

At the branch prediction time, the first two conditionals are predicted with
some probability close to 50% since the branch predictor can not derive much
about them.

However the last conditional is predicted with very small probability since it
calls 0.  Then the call to () gets higher frequency than call to
__builtin_abort.

Later we discover by jump threading that builtin_abort happens iff the call of
(), so the profile was inconsistent from start, just not in obvious way.
update_bb_profile_for_threading should print out reason into dump file.

[Bug tree-optimization/25623] jump threading/cfg cleanup messes up "incoming counts" for some BBs

2021-08-23 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25623

--- Comment #8 from Andrew Pinski  ---
*** Bug 22401 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/25623] jump threading/cfg cleanup messes up "incoming counts" for some BBs

2021-08-08 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25623
Bug 25623 depends on bug 26602, which changed state.

Bug 26602 Summary: cfg cleanup can mess up incoming frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26602

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

[Bug tree-optimization/25623] jump threading/cfg cleanup messes up "incoming counts" for some BBs

2021-08-08 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25623

--- Comment #7 from Andrew Pinski  ---
*** Bug 26602 has been marked as a duplicate of this bug. ***