On Fri, Jan 13, 2017 at 8:46 AM, Roland Haas <rh...@illinois.edu> wrote:
> Hello Ian,
>> I have never been able to get anything like realistic FLOPS numbers
>> from PAPI. I have not tried recently.  I think I heard that the
>> hardware counter interfaces were only ever originally intended as
>> debugging tools to be used by the processor manufacturers themselves,
>> and were quite unreliable.  This might have changed in recent CPUs.
>> Do you get numbers consistent with what you expect?  BlueWaters
>> doesn't use very recent CPUs.  I didn't know that the PAPI thorn
>> tests this; that is nice!
> BW is likely one of the better candidates for good numbers since PAPI
> is supported by Cray (https://bluewaters.ncsa.illinois.edu/papi). There
> are even (non user accessible) counters that always count how many
> flops are used in a job and that one can query (unless the user code
> uses PAPI in which case the counters are not usable by the system).
> PAPI (or at least the counters) is pretty much my only hope as counting
> the number of flops in a GR+Hydro+AMR+Neutrino simulation is quite
> hopeless.

You mean "counting the number of flops in a Hydro+AMR+Neutrino
simulation is quite hopeless". Counting in GR is not hopeless since
Kranc can do that for us. (Okay, it doesn't count the stencil
operations yet.)

I just thought I'd wedge in another advertisement for using code generators.

Modern Intel CPUs don't have hardware counters for Flops any more, as
the measure that is of interest to users ("how many operations did my
Fortran code contain?") is irrelevant for the CPU. Since the
floating-point unit is idle almost all the time (often ~90% of the
time), it aggressively uses speculative execution for floating-point
operations. The number of speculatively executed and then discarded
(!) operations can be several times higher than the number of "useful"
operations. It's good for overall performance, but the numbers are
basically impossible to interpret.

In addition, the larger vector sizes (e.g. 4 for AVX, now 8 for
AVX512) mean that there are often unused vector lanes. If you count
hardware instructions, then these are still included.

Finally, compilers can transform code in ways that increases the
number of operations. This is called "rematerialization". If there is
an intermediate result that is used multiple times, then the compiler
needs to choose between (a) storing it and (b) re-calculating it. If
there are no free registers available (e.g. because there are already
too many local variables), then re-calculating (1 cycle) is cheaper
than loading/storing (several cycles each time). So e.g. the code

tmp = A + B;
x += tmp;
y += tmp;

can be transformed to

x += A + B;
y += A + B;

which has one more operation, but one fewer variable.

In other words, using a hardware performance counter to count
operations is about as accurate as counting steps to measure distance.
There's a correlation, but it's difficult to quantify the error.


Erik Schnetter <schnet...@cct.lsu.edu>
Users mailing list

Reply via email to