[Bug rtl-optimization/90001] Compile-time hog in swing modulo scheduler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90001 Roman Zhuykov changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED Assignee|unassigned at gcc dot gnu.org |zhroma at gcc dot gnu.org --- Comment #8 from Roman Zhuykov --- Fixed. Yes, I'm closing the PR, I suppose this should not be backported. It's a pity we don't have an ability to create regression test for such situation.
[Bug rtl-optimization/90001] Compile-time hog in swing modulo scheduler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90001 --- Comment #7 from Arseny Solokha --- There are no backporting pending, I believe? If so, should this PR be closed now?
[Bug rtl-optimization/90001] Compile-time hog in swing modulo scheduler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90001 --- Comment #6 from Roman Zhuykov --- Author: zhroma Date: Fri Dec 13 17:02:53 2019 New Revision: 279375 URL: https://gcc.gnu.org/viewcvs?rev=279375=gcc=rev Log: modulo-sched: speed up DDG analysis (PR90001) PR rtl-optimization/90001 * ddg.c (create_ddg): Init max_dist array for each node. (free_ddg): Free max_dist array. (create_ddg_edge): Use bool field instead of aux union. (set_recurrence_length): Use prepared max_dist information instead of calling longest_simple_path. (create_scc): Remove graph argument, fill node's aux.count with SCC id, and move set_recurrence_length call to... (create_ddg_all_sccs): ...here, after filling all max_dist arrays using Floyd–Warshall-like algorithm. (update_dist_to_successors): Remove the whole function. (longest_simple_path): Likewise. * ddg.h (struct ddg_node): Add max_dist pointer. (struct ddg_edge): Use bool field instead of unused aux union. Modified: trunk/gcc/ChangeLog trunk/gcc/ddg.c trunk/gcc/ddg.h
[Bug rtl-optimization/90001] Compile-time hog in swing modulo scheduler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90001 --- Comment #5 from Roman Zhuykov --- Retested patch separately, everything works. Have found 2 more slow Fortran examples on (obsolete) spu platform and with additional options like -O1/O2 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions gfortran.dg/sms-2.f90 gfortran.dg/forall_10.f90 Compilation| time, sec |unpatched vs patched| sms options forall_10 | 35.02 0.66 | -fmodulo-sched forall_10 | 87.44 0.52 | -fmodulo-sched -fmodulo-sched-allow-regmoves sms-2 | 34.58 0.44 | -fmodulo-sched sms-2 | 86.77 0.27 | -fmodulo-sched -fmodulo-sched-allow-regmoves
[Bug rtl-optimization/90001] Compile-time hog in swing modulo scheduler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90001 --- Comment #4 from Roman Zhuykov --- Thanks for testcase. 2-3 weeks ago I already caught and fixed this on my local branch, see some info in the bottom. Current algorithm which finds recurrence_length for all DDG strongly connected components works in like O(N^6) time, where N in the number of nodes in DDG. The time is so bad mostly for graphs with lots of edges, like almost N^2 edges. Richard's suggestion is right - it will be still something like O(N^5) in worst case even without bitmap overhead for such graphs. My proposed algorithm works in O(N^3). Algorithm of finding SCCs itself is also not optimal (maybe up to O(N^4)), but here it left untouched. For some situations, when amount of edges is smaller (like equal to N), new algorithm can be unfortunately slower than old one. But I think it's better here to add some bail-out when we got more than 1000 nodes for example. Before creating this patch, I tested special version of it, where both approaches were in action and asserts were inserted to check that algorithms results (longest_simple_path values) are absolutely the same. I can publish this special version if needed. I wonder how regression test can be created for such a situation? [Testing] Proposed patch with a bunch of ~25 other patches was tested a lot: *(1) Bootstrapped and regtested on x86_64 *(2) Cross-compiler regtest on aarch64, arm, powerpc, powerpc64, ia64 and s390 *(3) Also done (1) and (2) with -fmodulo-sched enabled by default *(4) Also done (1) and (2) with -fmodulo-sched and -fmodulo-sched-allow-regmoves enabled by default *(5) Moreover, all (1-4) was also done with 4.9, 5, 6, 7, and 8 branches, on active branches an trunk date was 20190327. More than 250 compiler instances built and tested in total (counting both "unpatched" vs "patched"). No new failures which can relate to this algorithm were found. "Special version" was also tested in practically same scenarios (and one more week before, like 20190320), but not exactly all of them. But still have to retest it separately without all my stuff :) [PS] Last month I spent a lot of time updating my patches described here https://gcc.gnu.org/ml/gcc-patches/2017-02/msg01647.html and have locally added several other patches, including this fix. My updated branches are not published yet, because there are still some unsolved issues, I can't fix some bugzilla PRs also. I'll try to add comments in another modulo-sched-related PRs soon.
[Bug rtl-optimization/90001] Compile-time hog in swing modulo scheduler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90001 Roman Zhuykov changed: What|Removed |Added CC||zhroma at ispras dot ru --- Comment #3 from Roman Zhuykov --- Created attachment 46099 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46099=edit Proposed patch Untested on trunk yet
[Bug rtl-optimization/90001] Compile-time hog in swing modulo scheduler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90001 Richard Biener changed: What|Removed |Added Target|powerpc-*-linux-gnu |powerpc-*-linux-gnu ||powerpc64le-*-* Status|UNCONFIRMED |NEW Last reconfirmed||2019-04-08 Component|target |rtl-optimization Ever confirmed|0 |1 --- Comment #2 from Richard Biener --- Confirmed also on ppc64le (thus likely on any doloop capable target which SMS needs). Note that SMS is unmaintained. The issue is that set_recurrence_length is called with a graph with 18662 back-arcs and the longest_simple_path does work on the order of the graph size because it uses sbitmaps. Using bitmaps doesn't help here though. I guess the underlying algorithm for set_recurrence_length simply isn't scalable and needs to be replaced.