Re: [Users] PAPI thorn and threads

2017-01-13 Thread Roland Haas
Hello Erik, all,

> You mean "counting the number of flops in a Hydro+AMR+Neutrino
> simulation is quite hopeless". Counting in GR is not hopeless since
> Kranc can do that for us. (Okay, it doesn't count the stencil
> operations yet.)
Well its worse than that really. Even *if* I had a small enough code
(or code generator) so that I would try and consider the number of
FLOPS for a point that is in the interior of the grid on a single
level, ignoring buffer zones, ghost zones and the issue that some
values are recomputed in buffer zones while others are interpolated,
the hydro codes almost always have different code paths depending on
the values on the grid (eg reconstruction or the Riemann solvers or the
non-linear root finding in con2prim which is ~1/3 of the computational
costs usually) so that number of FLOPs used for a grid point is not
predictable unless one knows exactly what data will be present at the
grid point.

So that best I can hope for is some rough estimate, which the counters
on BW should give me.

> Modern Intel CPUs don't have hardware counters for Flops any more, as
This being BW, it's neither Intel nor strictly modern CPUs :-). But
instead AMD Interlagos CPUs. My laptop on the other hand with a
modern Intel CPU simply does not provide any PAPI numbers at all.

> In other words, using a hardware performance counter to count
> operations is about as accurate as counting steps to measure distance.
> There's a correlation, but it's difficult to quantify the error.
True.

Yours,
Roland

-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://keys.gnupg.net.


pgpKEQE001dsC.pgp
Description: OpenPGP digital signature
___
Users mailing list
Users@cactuscode.org
http://cactuscode.org/mailman/listinfo/users


Re: [Users] PAPI thorn and threads

2017-01-13 Thread Erik Schnetter
On Fri, Jan 13, 2017 at 8:46 AM, Roland Haas  wrote:
> Hello Ian,
>
>> I have never been able to get anything like realistic FLOPS numbers
>> from PAPI. I have not tried recently.  I think I heard that the
>> hardware counter interfaces were only ever originally intended as
>> debugging tools to be used by the processor manufacturers themselves,
>> and were quite unreliable.  This might have changed in recent CPUs.
>> Do you get numbers consistent with what you expect?  BlueWaters
>> doesn't use very recent CPUs.  I didn't know that the PAPI thorn
>> tests this; that is nice!
> BW is likely one of the better candidates for good numbers since PAPI
> is supported by Cray (https://bluewaters.ncsa.illinois.edu/papi). There
> are even (non user accessible) counters that always count how many
> flops are used in a job and that one can query (unless the user code
> uses PAPI in which case the counters are not usable by the system).
>
> PAPI (or at least the counters) is pretty much my only hope as counting
> the number of flops in a GR+Hydro+AMR+Neutrino simulation is quite
> hopeless.

You mean "counting the number of flops in a Hydro+AMR+Neutrino
simulation is quite hopeless". Counting in GR is not hopeless since
Kranc can do that for us. (Okay, it doesn't count the stencil
operations yet.)

I just thought I'd wedge in another advertisement for using code generators.

Modern Intel CPUs don't have hardware counters for Flops any more, as
the measure that is of interest to users ("how many operations did my
Fortran code contain?") is irrelevant for the CPU. Since the
floating-point unit is idle almost all the time (often ~90% of the
time), it aggressively uses speculative execution for floating-point
operations. The number of speculatively executed and then discarded
(!) operations can be several times higher than the number of "useful"
operations. It's good for overall performance, but the numbers are
basically impossible to interpret.

In addition, the larger vector sizes (e.g. 4 for AVX, now 8 for
AVX512) mean that there are often unused vector lanes. If you count
hardware instructions, then these are still included.

Finally, compilers can transform code in ways that increases the
number of operations. This is called "rematerialization". If there is
an intermediate result that is used multiple times, then the compiler
needs to choose between (a) storing it and (b) re-calculating it. If
there are no free registers available (e.g. because there are already
too many local variables), then re-calculating (1 cycle) is cheaper
than loading/storing (several cycles each time). So e.g. the code

tmp = A + B;
x += tmp;
...
y += tmp;

can be transformed to

x += A + B;
...
y += A + B;

which has one more operation, but one fewer variable.

In other words, using a hardware performance counter to count
operations is about as accurate as counting steps to measure distance.
There's a correlation, but it's difficult to quantify the error.

-erik

-- 
Erik Schnetter 
http://www.perimeterinstitute.ca/personal/eschnetter/
___
Users mailing list
Users@cactuscode.org
http://cactuscode.org/mailman/listinfo/users


Re: [Users] PAPI thorn and threads

2017-01-13 Thread Roland Haas
Hello Ian,

> I have never been able to get anything like realistic FLOPS numbers
> from PAPI. I have not tried recently.  I think I heard that the
> hardware counter interfaces were only ever originally intended as
> debugging tools to be used by the processor manufacturers themselves,
> and were quite unreliable.  This might have changed in recent CPUs.
> Do you get numbers consistent with what you expect?  BlueWaters
> doesn't use very recent CPUs.  I didn't know that the PAPI thorn
> tests this; that is nice!
BW is likely one of the better candidates for good numbers since PAPI
is supported by Cray (https://bluewaters.ncsa.illinois.edu/papi). There
are even (non user accessible) counters that always count how many
flops are used in a job and that one can query (unless the user code
uses PAPI in which case the counters are not usable by the system). 

PAPI (or at least the counters) is pretty much my only hope as counting
the number of flops in a GR+Hydro+AMR+Neutrino simulation is quite
hopeless. 

Yours,
Roland

-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://keys.gnupg.net.


pgpuT4bflmAQ5.pgp
Description: OpenPGP digital signature
___
Users mailing list
Users@cactuscode.org
http://cactuscode.org/mailman/listinfo/users


Re: [Users] PAPI thorn and threads

2017-01-12 Thread Ian Hinder

On 12 Jan 2017, at 21:22, Roland Haas  wrote:

> Hello Frank,
> 
> thanks. That is somewhat reassuring. I also did some experiments and
> consulted the PAPI thorns source code (its clocks file) and it seems as
> if it always accumulates counter values over all threads when reading
> out PAPI counters so things do in fact work as hoped for (namely the
> flop counter counts all flops in a MPI rank and not just on thread
> zero). My threads were bound to kernel level threads (and cores for
> that matter) since I ran my tests on Blue Waters.

Hi Roland,

I have never been able to get anything like realistic FLOPS numbers from PAPI. 
I have not tried recently.  I think I heard that the hardware counter 
interfaces were only ever originally intended as debugging tools to be used by 
the processor manufacturers themselves, and were quite unreliable.  This might 
have changed in recent CPUs.  Do you get numbers consistent with what you 
expect?  BlueWaters doesn't use very recent CPUs.  I didn't know that the PAPI 
thorn tests this; that is nice!

--
Ian Hinder
http://members.aei.mpg.de/ianhin



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Users mailing list
Users@cactuscode.org
http://cactuscode.org/mailman/listinfo/users


Re: [Users] PAPI thorn and threads

2017-01-12 Thread Roland Haas
Hello Frank,

thanks. That is somewhat reassuring. I also did some experiments and
consulted the PAPI thorns source code (its clocks file) and it seems as
if it always accumulates counter values over all threads when reading
out PAPI counters so things do in fact work as hoped for (namely the
flop counter counts all flops in a MPI rank and not just on thread
zero). My threads were bound to kernel level threads (and cores for
that matter) since I ran my tests on Blue Waters.

Yours,
Roland

> On Thu, Jan 12, 2017 at 11:43:43AM -0600, Roland Haas wrote:
> >does anyone know if the floating point event counts reported by PAPI
> >are summed over all threads inside of a MPI rank? Or is it only the
> >count on thread 0?  
> 
> From the documentation the answer seems to be "it depends":
> 
> In order to support threaded operation, the operating system must save and 
> restore the counter hardware upon context switches among different threads or 
> processes. However, OpenMP hides the concept of user and kernel level threads 
> from the user. As a result, unless the user explicitly takes action to bind 
> their thread to a kernel thread (sometimes called a Light Weight Process or 
> LWP), the counts returned by PAPI will not necessarily be accurate.
> 
> To address this situation, PAPI treats every platform as if it is running on 
> top of kernel threads.
> 
> Unbound, user level threads that call PAPI will function properly, but will 
> most likely return unreliable or inaccurate event counts.
> 
> Fortunately, in the batch environments of the HPC community, there is no 
> significant advantage to user level threads and thus kernel level threads are 
> the default. Frank
> 



-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://keys.gnupg.net.


pgpEp8pMsXgCF.pgp
Description: OpenPGP digital signature
___
Users mailing list
Users@cactuscode.org
http://cactuscode.org/mailman/listinfo/users


Re: [Users] PAPI thorn and threads

2017-01-12 Thread Erik Schnetter
Roland

PAPI has a mechanism to work either on a process or on a single
thread. I believe Cactus switches PAPI to threaded mode by default.
This works only for operating system threads (OpenMP), not for
user-level threads (FunHPC). I don't recall the details, but the PAPI
documentation should describe this in its API documentation.

In Cactus, we initially run some tests (probably a DGEMM) to check
whether PAPI's numbers are consistent with what we expect. This might
help answer this question.

I think that handling multi-threading correctly requires the operating
system to cooperate. On an HPC system, the kernel might have been
modified and cause problems. This is just a wild guess, though.

-erik


On Thu, Jan 12, 2017 at 12:43 PM, Roland Haas  wrote:
> Hello all,
>
> does anyone know if the floating point event counts reported by PAPI
> are summed over all threads inside of a MPI rank? Or is it only the
> count on thread 0?
>
> I would hope for the former but suspect the latter.
>
> That is, if I was to run the same job with using ncores
> cores and would run once with nranks MPI ranks and nthreads threads
> per rank and onece with ncores MPI ranks and 1 thread per rank, would
> the sum over all *reported* event counts of all ranks (roughly,
> neglecting ghost zones etc) agree?
>
> Yours,
> Roland
>
> --
> My email is as private as my paper mail. I therefore support encrypting
> and signing email messages. Get my PGP key from http://keys.gnupg.net.
>
> ___
> Users mailing list
> Users@cactuscode.org
> http://cactuscode.org/mailman/listinfo/users
>



-- 
Erik Schnetter 
http://www.perimeterinstitute.ca/personal/eschnetter/
___
Users mailing list
Users@cactuscode.org
http://cactuscode.org/mailman/listinfo/users


Re: [Users] PAPI thorn and threads

2017-01-12 Thread Frank Loeffler

On Thu, Jan 12, 2017 at 11:43:43AM -0600, Roland Haas wrote:

does anyone know if the floating point event counts reported by PAPI
are summed over all threads inside of a MPI rank? Or is it only the
count on thread 0?


From the documentation the answer seems to be "it depends":

In order to support threaded operation, the operating system must save 
and restore the counter hardware upon context switches among different 
threads or processes. However, OpenMP hides the concept of user and 
kernel level threads from the user. As a result, unless the user 
explicitly takes action to bind their thread to a kernel thread 
(sometimes called a Light Weight Process or LWP), the counts returned by 
PAPI will not necessarily be accurate.


To address this situation, PAPI treats every platform as if it is 
running on top of kernel threads.


Unbound, user level threads that call PAPI will function properly, but 
will most likely return unreliable or inaccurate event counts.


Fortunately, in the batch environments of the HPC community, there is no 
significant advantage to user level threads and thus kernel level 
threads are the default. 


Frank



signature.asc
Description: Digital signature
___
Users mailing list
Users@cactuscode.org
http://cactuscode.org/mailman/listinfo/users