[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264 --- Comment #8 from Freddie Witherden --- (In reply to rguent...@suse.de from comment #7) > > Instead of [[gnu::flatten]] you could use the > __attribute__((always_inline)) attribute on the foo function definition > if you didn't simplify the outline above too much to make that > infeasible. IIRC we do not have sth like > > [[gnu::inline]] foo(i, ...); > > to force inlining of a specific call, nor [[gnu::noinline]] foo(i, ...); > both which seem useful. Not sure if the C++ syntax would support > such placement of an attribute of course. So this is exactly what we had in the pre-flatten version of the code: https://github.com/PyFR/Polyquad/commit/f24366c059d2d693222985cdd9333238bd909ad3 The issue was while GCC would inline the annotated functions it would go no further. As such, if I recall correctly, all of the constructor calls to the relatively simple Eigen vector types were no longer inlined. Thus a line of code which should translate into a few register-to-memory mov instructions results in a a constructor call, an assignment call, and some cleanup. Since I could not add the force inline attribute to the library types I went in search of an alternative. For the T = bfloat eval_orthob instance is the "if (std::is_fundamental::value)" considered before the body is inlined?
[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264 --- Comment #7 from rguenther at suse dot de --- On Fri, 22 May 2020, freddie at witherden dot org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264 > > --- Comment #6 from Freddie Witherden --- > (In reply to Richard Biener from comment #3) > > So with the [[gnu::flatten]] attributes removed -O1 needs 80 seconds to > > compile and about 3GB of memory, -O2 needs around 2 minutes (same memory), > > -O3 > > is the same as -O2. > > > > Maybe instead of [[gnu::flatten]] you want to bump --param > > inline-unit-growth > > or --param large-function-growth more moderately in case you can measure an > > effect on runtime. > > > > Note multiple [[gnu::flatten]] can really exponentially grow program size > > since it is not appearant which functions might be used from another > > translation unit until you can use -fwhole-program (single CU program) > > or -flto (but there [[gnu::flatten]] is applied to early to avoid such > > growth - sth we might want to fix). Placing things not used from outside > > in anonymous namespaces might help. > > The [[gnu::flatten]] was added to get GCC's performance in the case of T = > double on a par with Clang's. (We don't care about performance with T = > bfloat > as it is just used as a final polishing pass.) I can understand why GCC does > not want to inline it in the case of T = bfloat which is a complex type, but > for T = double the function is basically just a sequence of mov's to populate > an array. > > As the function is of the form > > for (int i = 0; i < N; i++) // N = template arg > for (int j = 0; j < p[N]; j++) // runtime trip count > foo(i, ...); // static polymorphism > > with foo being a large switch-case on its first argument the expectation was > for the compiler to inline foo, unroll the outer loop, and then prune the dead > cases such that we have something similar to > > for (int j = 0; j < p[0]; j++) > foo(0, ...); // inline i = 0 case > for (int j = 0; j < p[1]; j++) > foo(1, ...); // inline i = 1 case > // ... Ah, interesting. This kind of static polymorphism should be handled by IPA-CP already but it's of course possible we're confused about a detail in this very testcase. Honza? Instead of [[gnu::flatten]] you could use the __attribute__((always_inline)) attribute on the foo function definition if you didn't simplify the outline above too much to make that infeasible. IIRC we do not have sth like [[gnu::inline]] foo(i, ...); to force inlining of a specific call, nor [[gnu::noinline]] foo(i, ...); both which seem useful. Not sure if the C++ syntax would support such placement of an attribute of course.
[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264 --- Comment #6 from Freddie Witherden --- (In reply to Richard Biener from comment #3) > So with the [[gnu::flatten]] attributes removed -O1 needs 80 seconds to > compile and about 3GB of memory, -O2 needs around 2 minutes (same memory), > -O3 > is the same as -O2. > > Maybe instead of [[gnu::flatten]] you want to bump --param inline-unit-growth > or --param large-function-growth more moderately in case you can measure an > effect on runtime. > > Note multiple [[gnu::flatten]] can really exponentially grow program size > since it is not appearant which functions might be used from another > translation unit until you can use -fwhole-program (single CU program) > or -flto (but there [[gnu::flatten]] is applied to early to avoid such > growth - sth we might want to fix). Placing things not used from outside > in anonymous namespaces might help. The [[gnu::flatten]] was added to get GCC's performance in the case of T = double on a par with Clang's. (We don't care about performance with T = bfloat as it is just used as a final polishing pass.) I can understand why GCC does not want to inline it in the case of T = bfloat which is a complex type, but for T = double the function is basically just a sequence of mov's to populate an array. As the function is of the form for (int i = 0; i < N; i++) // N = template arg for (int j = 0; j < p[N]; j++) // runtime trip count foo(i, ...); // static polymorphism with foo being a large switch-case on its first argument the expectation was for the compiler to inline foo, unroll the outer loop, and then prune the dead cases such that we have something similar to for (int j = 0; j < p[0]; j++) foo(0, ...); // inline i = 0 case for (int j = 0; j < p[1]; j++) foo(1, ...); // inline i = 1 case // ...
[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264 --- Comment #5 from Richard Biener --- So confirmed we eventually blow up at -O1: ++: fatal error: Killed signal terminated program cc1plus compilation terminated. Command exited with non-zero status 1 3015.48user 45.01system 1:08:57elapsed 73%CPU (0avgtext+0avgdata 30682104maxresident)k 1549456inputs+47040outputs (2343major+9807077minor)pagefaults 0swaps didn't manage to catch where in the process of compilation that was though, during PTA it hovered at ~12GB.
[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264 Richard Biener changed: What|Removed |Added Keywords||missed-optimization --- Comment #4 from Richard Biener --- clang documentation mentions they support [[gnu::flatten]], whether implementations match here is of course another question. I guess for a convoluted cgraph our flatten implementation leaves sth to be desired - if there's two calls to the same function we inline it fully twice and have to reap benefits of inlining all calls (recursively) in them twice rather than producing an optimized body for the flatten inlining first. One could envision some early cloning for the purpose of flattening, pushing down the flattening attribute to the clones that end up being inlined multiple times. Not sure how easy that would be - Honza?
[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |WAITING Last reconfirmed||2020-05-22 Ever confirmed|0 |1 --- Comment #3 from Richard Biener --- So with the [[gnu::flatten]] attributes removed -O1 needs 80 seconds to compile and about 3GB of memory, -O2 needs around 2 minutes (same memory), -O3 is the same as -O2. Maybe instead of [[gnu::flatten]] you want to bump --param inline-unit-growth or --param large-function-growth more moderately in case you can measure an effect on runtime. Note multiple [[gnu::flatten]] can really exponentially grow program size since it is not appearant which functions might be used from another translation unit until you can use -fwhole-program (single CU program) or -flto (but there [[gnu::flatten]] is applied to early to avoid such growth - sth we might want to fix). Placing things not used from outside in anonymous namespaces might help.
[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264 --- Comment #2 from Richard Biener --- We're then inlining some more costing another ~5GB ontop of the early optimization memory use of ~5GB (might be other IPA transforms than inlining as well). The big function is meanwhile 2 million basic blocks... update-SSA and friends are no fun here (the function with 2 million BBs is eval_orthob). Ah, you use [[gnu::flatten]] on that - so isn't it just what you asked for? I wonder if Clang implements that at all. Note the issue with -fvar-tracking* and -g and large functions is known...
[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264 --- Comment #1 from Richard Biener --- Confirmed. We do have (a) huuuge function here, containing 539237 basic blocks after early inlining which is void polyquad::BaseDomain::expand(const VectorXT&, polyquad::BaseDomain::MatrixPtsT&) const [with Derived = polyquad::TetDomain > >; T = boost::multiprecision::number >; int Ndim = 3; int Norbits = 5] obviously every IL walk will be bad here. Didn't yet find the actual wall it runs into, still runs...