Re: LTO: link-time optimization effect on proton performance

2014-10-02 Thread Andrew Stitcher
On Wed, 2014-10-01 at 21:54 -0400, Michael Goulish wrote:
 Link-time optimization can be turned on by adding the
  -flto flag to the proton library build, in both compilation
 and linking steps.  It offers the possibility of optimizations
 using deeper knowledge of the whole program than is available 
 before linking.

Now that you've got lot going you could try pgo (profile guided
optimisations) they're supposed to be good!

They do take a significant amount of work to set up though, because you
need to profile a representative workload, then optimise the code again
based on the results from that.

So the flow is more like:

1. Build with appropriate options (probably still lto)
2. Run profiling run
3. Build again with optimising input from the profile run.

 ...So!  The LTO technology really works, but it's not as
 good as manual inlining based on profiling.  In fact
 it slows that down a little, probably because it is choosing 
 some inlining candidates that don't help enough to offset
 cache thrash due to code size increase.

You could perhaps test this by using -Os -flto rather than -O2 -flto
that should attempt to keep the size of code down.

Also the gcc 4.8 versions have better heuristics for inlining than the
4.7 versions (apparently and the 4.9 versions will be better still).

Andrew




LTO: link-time optimization effect on proton performance

2014-10-01 Thread Michael Goulish

Link-time optimization can be turned on by adding the
 -flto flag to the proton library build, in both compilation
and linking steps.  It offers the possibility of optimizations
using deeper knowledge of the whole program than is available 
before linking.

I have also been trying to get some extra performance by hand-
inlining functions that I select based on valgrind/callgrind
profiling data.

My test procedure has been to run 50 trials, where each trial 
is a run of two programs: my psend and precv proton-C clients
written at the Engine level.  Each trial involves sending and 
receiving 5 million small messages.  

The result from each trial is a single high-resolution timing 
number.  (From just before the sender sends the first message, 
to just after the receiver receives the last message.)
The result of each test is a list of 50 of those numbers.

I compare tests using an online Student's T-test calculator.
(Student was the pen-name of the guy who invented it.
His real name was Gosset, and he was working at the Guiness 
Brewery in Dublin when he invented it. I am not making this up.)  
The t-test gives a number that indicates the likihood 
that the difference between two tests could have happened randomly.

A small t-test result indicates that the difference between 
two test is unlikely to have happened randomly.  For example
a t-test result of 0.01 means that the difference between your 
two tests should only happen 1 time out of 100 times due to 
random chance.  Smaller results are better.

With 50 sample-points in each test, you can get nice high 
certainty as to whether you are seeing real or random results.
All of the results below are hyper-significant.  The *worst*
t-test result was 2.9e-8, i.e. 3 chances out of 100 million 
that the difference between the two tests could happen randomly.



So .. here are the results.   (in seconds)


( builds used throughout are normal release-with-debug-info,
with -O2 optimization. )


1. Proton code as of 0800 EDT yesterday, with no changes. 
   mean 41.267825   sigma 0.834826

2. LTO build
   mean 40.073661   sigma 1.108513improvement: 2.9%

3. manual inlining changes
   mean 39.011794   sigma 1.056831improvement: 5.5%

4. LTO build plus my changes
   mean 39.211283   sigma 1.041303improvement: 5.0%




So!  The LTO technology really works, but it's not as
good as manual inlining based on profiling.  In fact
it slows that down a little, probably because it is choosing 
some inlining candidates that don't help enough to offset
cache thrash due to code size increase.


so there you go.