[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above

2020-05-22 Thread freddie at witherden dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

--- Comment #8 from Freddie Witherden  ---
(In reply to rguent...@suse.de from comment #7)
> 
> Instead of [[gnu::flatten]] you could use the 
> __attribute__((always_inline)) attribute on the foo function definition
> if you didn't simplify the outline above too much to make that
> infeasible.  IIRC we do not have sth like
> 
>   [[gnu::inline]] foo(i, ...);
> 
> to force inlining of a specific call, nor [[gnu::noinline]] foo(i, ...);
> both which seem useful.  Not sure if the C++ syntax would support
> such placement of an attribute of course.

So this is exactly what we had in the pre-flatten version of the code:

https://github.com/PyFR/Polyquad/commit/f24366c059d2d693222985cdd9333238bd909ad3

The issue was while GCC would inline the annotated functions it would go no
further.  As such, if I recall correctly, all of the constructor calls to the
relatively simple Eigen vector types were no longer inlined.  Thus a line of
code which should translate into a few register-to-memory mov instructions
results in a  a constructor call, an assignment call, and some cleanup.  Since
I could not add the force inline attribute to the library types I went in
search of an alternative.

For the T = bfloat eval_orthob instance is the "if
(std::is_fundamental::value)" considered before the body is inlined?

[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above

2020-05-22 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

--- Comment #7 from rguenther at suse dot de  ---
On Fri, 22 May 2020, freddie at witherden dot org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264
> 
> --- Comment #6 from Freddie Witherden  ---
> (In reply to Richard Biener from comment #3)
> > So with the [[gnu::flatten]] attributes removed -O1 needs 80 seconds to
> > compile and about 3GB of memory, -O2 needs around 2 minutes (same memory),
> > -O3
> > is the same as -O2.
> > 
> > Maybe instead of [[gnu::flatten]] you want to bump --param 
> > inline-unit-growth
> > or --param large-function-growth more moderately in case you can measure an
> > effect on runtime.
> > 
> > Note multiple [[gnu::flatten]] can really exponentially grow program size
> > since it is not appearant which functions might be used from another
> > translation unit until you can use -fwhole-program (single CU program)
> > or -flto (but there [[gnu::flatten]] is applied to early to avoid such
> > growth - sth we might want to fix).  Placing things not used from outside
> > in anonymous namespaces might help.
> 
> The [[gnu::flatten]] was added to get GCC's performance in the case of T =
> double on a par with Clang's.  (We don't care about performance with T = 
> bfloat
> as it is just used as a final polishing pass.)  I can understand why GCC does
> not want to inline it in the case of T = bfloat which is a complex type, but
> for T = double the function is basically just a sequence of mov's to populate
> an array.
> 
> As the function is of the form
> 
> for (int i = 0; i < N; i++) // N = template arg
>   for (int j = 0; j < p[N]; j++) // runtime trip count
>   foo(i, ...); // static polymorphism
> 
> with foo being a large switch-case on its first argument the expectation was
> for the compiler to inline foo, unroll the outer loop, and then prune the dead
> cases such that we have something similar to
> 
> for (int j = 0; j < p[0]; j++)
> foo(0, ...); // inline i = 0 case
> for (int j = 0; j < p[1]; j++)
> foo(1, ...); // inline i = 1 case
> // ...

Ah, interesting.  This kind of static polymorphism should be handled
by IPA-CP already but it's of course possible we're confused about
a detail in this very testcase.  Honza?

Instead of [[gnu::flatten]] you could use the 
__attribute__((always_inline)) attribute on the foo function definition
if you didn't simplify the outline above too much to make that
infeasible.  IIRC we do not have sth like

  [[gnu::inline]] foo(i, ...);

to force inlining of a specific call, nor [[gnu::noinline]] foo(i, ...);
both which seem useful.  Not sure if the C++ syntax would support
such placement of an attribute of course.

[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above

2020-05-22 Thread freddie at witherden dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

--- Comment #6 from Freddie Witherden  ---
(In reply to Richard Biener from comment #3)
> So with the [[gnu::flatten]] attributes removed -O1 needs 80 seconds to
> compile and about 3GB of memory, -O2 needs around 2 minutes (same memory),
> -O3
> is the same as -O2.
> 
> Maybe instead of [[gnu::flatten]] you want to bump --param inline-unit-growth
> or --param large-function-growth more moderately in case you can measure an
> effect on runtime.
> 
> Note multiple [[gnu::flatten]] can really exponentially grow program size
> since it is not appearant which functions might be used from another
> translation unit until you can use -fwhole-program (single CU program)
> or -flto (but there [[gnu::flatten]] is applied to early to avoid such
> growth - sth we might want to fix).  Placing things not used from outside
> in anonymous namespaces might help.

The [[gnu::flatten]] was added to get GCC's performance in the case of T =
double on a par with Clang's.  (We don't care about performance with T = bfloat
as it is just used as a final polishing pass.)  I can understand why GCC does
not want to inline it in the case of T = bfloat which is a complex type, but
for T = double the function is basically just a sequence of mov's to populate
an array.

As the function is of the form

for (int i = 0; i < N; i++) // N = template arg
  for (int j = 0; j < p[N]; j++) // runtime trip count
  foo(i, ...); // static polymorphism

with foo being a large switch-case on its first argument the expectation was
for the compiler to inline foo, unroll the outer loop, and then prune the dead
cases such that we have something similar to

for (int j = 0; j < p[0]; j++)
foo(0, ...); // inline i = 0 case
for (int j = 0; j < p[1]; j++)
foo(1, ...); // inline i = 1 case
// ...

[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above

2020-05-22 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

--- Comment #5 from Richard Biener  ---
So confirmed we eventually blow up at -O1:

++: fatal error: Killed signal terminated program cc1plus   
compilation terminated.
Command exited with non-zero status 1   
3015.48user 45.01system 1:08:57elapsed 73%CPU (0avgtext+0avgdata
30682104maxresident)k   
1549456inputs+47040outputs (2343major+9807077minor)pagefaults 0swaps

didn't manage to catch where in the process of compilation that was though,
during PTA it hovered at ~12GB.

[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above

2020-05-22 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization

--- Comment #4 from Richard Biener  ---
clang documentation mentions they support [[gnu::flatten]], whether
implementations match here is of course another question.

I guess for a convoluted cgraph our flatten implementation leaves sth to be
desired - if there's two calls to the same function we inline it fully
twice and have to reap benefits of inlining all calls (recursively) in them
twice rather than producing an optimized body for the flatten inlining
first.  One could envision some early cloning for the purpose of flattening,
pushing down the flattening attribute to the clones that end up being
inlined multiple times.  Not sure how easy that would be - Honza?

[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above

2020-05-22 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2020-05-22
 Ever confirmed|0   |1

--- Comment #3 from Richard Biener  ---
So with the [[gnu::flatten]] attributes removed -O1 needs 80 seconds to compile
and about 3GB of memory, -O2 needs around 2 minutes (same memory), -O3
is the same as -O2.

Maybe instead of [[gnu::flatten]] you want to bump --param inline-unit-growth
or --param large-function-growth more moderately in case you can measure an
effect on runtime.

Note multiple [[gnu::flatten]] can really exponentially grow program size
since it is not appearant which functions might be used from another
translation unit until you can use -fwhole-program (single CU program)
or -flto (but there [[gnu::flatten]] is applied to early to avoid such
growth - sth we might want to fix).  Placing things not used from outside
in anonymous namespaces might help.

[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above

2020-05-22 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

--- Comment #2 from Richard Biener  ---
We're then inlining some more costing another ~5GB ontop of the early
optimization memory use of ~5GB (might be other IPA transforms than inlining
as well).  The big function is meanwhile 2 million basic blocks...
update-SSA and friends are no fun here (the function with 2 million BBs is
eval_orthob).

Ah, you use [[gnu::flatten]] on that - so isn't it just what you asked for?

I wonder if Clang implements that at all.

Note the issue with -fvar-tracking* and -g and large functions is known...

[Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above

2020-05-22 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

--- Comment #1 from Richard Biener  ---
Confirmed.  We do have (a) huuuge function here, containing 539237 basic blocks
after early inlining which is

void polyquad::BaseDomain::expand(const VectorXT&,
polyquad::BaseDomain::MatrixPtsT&) const [with
Derived =
polyquad::TetDomain
> >; T =
boost::multiprecision::number
>; int Ndim = 3; int Norbits = 5]

obviously every IL walk will be bad here.  Didn't yet find the actual wall it
runs into, still runs...