Re: [Users] PAPI thorn and threads
Hello Erik, all, > You mean "counting the number of flops in a Hydro+AMR+Neutrino > simulation is quite hopeless". Counting in GR is not hopeless since > Kranc can do that for us. (Okay, it doesn't count the stencil > operations yet.) Well its worse than that really. Even *if* I had a small enough code (or code generator) so that I would try and consider the number of FLOPS for a point that is in the interior of the grid on a single level, ignoring buffer zones, ghost zones and the issue that some values are recomputed in buffer zones while others are interpolated, the hydro codes almost always have different code paths depending on the values on the grid (eg reconstruction or the Riemann solvers or the non-linear root finding in con2prim which is ~1/3 of the computational costs usually) so that number of FLOPs used for a grid point is not predictable unless one knows exactly what data will be present at the grid point. So that best I can hope for is some rough estimate, which the counters on BW should give me. > Modern Intel CPUs don't have hardware counters for Flops any more, as This being BW, it's neither Intel nor strictly modern CPUs :-). But instead AMD Interlagos CPUs. My laptop on the other hand with a modern Intel CPU simply does not provide any PAPI numbers at all. > In other words, using a hardware performance counter to count > operations is about as accurate as counting steps to measure distance. > There's a correlation, but it's difficult to quantify the error. True. Yours, Roland -- My email is as private as my paper mail. I therefore support encrypting and signing email messages. Get my PGP key from http://keys.gnupg.net. pgpKEQE001dsC.pgp Description: OpenPGP digital signature ___ Users mailing list Users@cactuscode.org http://cactuscode.org/mailman/listinfo/users
Re: [Users] PAPI thorn and threads
On Fri, Jan 13, 2017 at 8:46 AM, Roland Haas wrote: > Hello Ian, > >> I have never been able to get anything like realistic FLOPS numbers >> from PAPI. I have not tried recently. I think I heard that the >> hardware counter interfaces were only ever originally intended as >> debugging tools to be used by the processor manufacturers themselves, >> and were quite unreliable. This might have changed in recent CPUs. >> Do you get numbers consistent with what you expect? BlueWaters >> doesn't use very recent CPUs. I didn't know that the PAPI thorn >> tests this; that is nice! > BW is likely one of the better candidates for good numbers since PAPI > is supported by Cray (https://bluewaters.ncsa.illinois.edu/papi). There > are even (non user accessible) counters that always count how many > flops are used in a job and that one can query (unless the user code > uses PAPI in which case the counters are not usable by the system). > > PAPI (or at least the counters) is pretty much my only hope as counting > the number of flops in a GR+Hydro+AMR+Neutrino simulation is quite > hopeless. You mean "counting the number of flops in a Hydro+AMR+Neutrino simulation is quite hopeless". Counting in GR is not hopeless since Kranc can do that for us. (Okay, it doesn't count the stencil operations yet.) I just thought I'd wedge in another advertisement for using code generators. Modern Intel CPUs don't have hardware counters for Flops any more, as the measure that is of interest to users ("how many operations did my Fortran code contain?") is irrelevant for the CPU. Since the floating-point unit is idle almost all the time (often ~90% of the time), it aggressively uses speculative execution for floating-point operations. The number of speculatively executed and then discarded (!) operations can be several times higher than the number of "useful" operations. It's good for overall performance, but the numbers are basically impossible to interpret. In addition, the larger vector sizes (e.g. 4 for AVX, now 8 for AVX512) mean that there are often unused vector lanes. If you count hardware instructions, then these are still included. Finally, compilers can transform code in ways that increases the number of operations. This is called "rematerialization". If there is an intermediate result that is used multiple times, then the compiler needs to choose between (a) storing it and (b) re-calculating it. If there are no free registers available (e.g. because there are already too many local variables), then re-calculating (1 cycle) is cheaper than loading/storing (several cycles each time). So e.g. the code tmp = A + B; x += tmp; ... y += tmp; can be transformed to x += A + B; ... y += A + B; which has one more operation, but one fewer variable. In other words, using a hardware performance counter to count operations is about as accurate as counting steps to measure distance. There's a correlation, but it's difficult to quantify the error. -erik -- Erik Schnetter http://www.perimeterinstitute.ca/personal/eschnetter/ ___ Users mailing list Users@cactuscode.org http://cactuscode.org/mailman/listinfo/users
Re: [Users] PAPI thorn and threads
Hello Ian, > I have never been able to get anything like realistic FLOPS numbers > from PAPI. I have not tried recently. I think I heard that the > hardware counter interfaces were only ever originally intended as > debugging tools to be used by the processor manufacturers themselves, > and were quite unreliable. This might have changed in recent CPUs. > Do you get numbers consistent with what you expect? BlueWaters > doesn't use very recent CPUs. I didn't know that the PAPI thorn > tests this; that is nice! BW is likely one of the better candidates for good numbers since PAPI is supported by Cray (https://bluewaters.ncsa.illinois.edu/papi). There are even (non user accessible) counters that always count how many flops are used in a job and that one can query (unless the user code uses PAPI in which case the counters are not usable by the system). PAPI (or at least the counters) is pretty much my only hope as counting the number of flops in a GR+Hydro+AMR+Neutrino simulation is quite hopeless. Yours, Roland -- My email is as private as my paper mail. I therefore support encrypting and signing email messages. Get my PGP key from http://keys.gnupg.net. pgpuT4bflmAQ5.pgp Description: OpenPGP digital signature ___ Users mailing list Users@cactuscode.org http://cactuscode.org/mailman/listinfo/users
Re: [Users] PAPI thorn and threads
On 12 Jan 2017, at 21:22, Roland Haas wrote: > Hello Frank, > > thanks. That is somewhat reassuring. I also did some experiments and > consulted the PAPI thorns source code (its clocks file) and it seems as > if it always accumulates counter values over all threads when reading > out PAPI counters so things do in fact work as hoped for (namely the > flop counter counts all flops in a MPI rank and not just on thread > zero). My threads were bound to kernel level threads (and cores for > that matter) since I ran my tests on Blue Waters. Hi Roland, I have never been able to get anything like realistic FLOPS numbers from PAPI. I have not tried recently. I think I heard that the hardware counter interfaces were only ever originally intended as debugging tools to be used by the processor manufacturers themselves, and were quite unreliable. This might have changed in recent CPUs. Do you get numbers consistent with what you expect? BlueWaters doesn't use very recent CPUs. I didn't know that the PAPI thorn tests this; that is nice! -- Ian Hinder http://members.aei.mpg.de/ianhin signature.asc Description: Message signed with OpenPGP using GPGMail ___ Users mailing list Users@cactuscode.org http://cactuscode.org/mailman/listinfo/users
Re: [Users] PAPI thorn and threads
Hello Frank, thanks. That is somewhat reassuring. I also did some experiments and consulted the PAPI thorns source code (its clocks file) and it seems as if it always accumulates counter values over all threads when reading out PAPI counters so things do in fact work as hoped for (namely the flop counter counts all flops in a MPI rank and not just on thread zero). My threads were bound to kernel level threads (and cores for that matter) since I ran my tests on Blue Waters. Yours, Roland > On Thu, Jan 12, 2017 at 11:43:43AM -0600, Roland Haas wrote: > >does anyone know if the floating point event counts reported by PAPI > >are summed over all threads inside of a MPI rank? Or is it only the > >count on thread 0? > > From the documentation the answer seems to be "it depends": > > In order to support threaded operation, the operating system must save and > restore the counter hardware upon context switches among different threads or > processes. However, OpenMP hides the concept of user and kernel level threads > from the user. As a result, unless the user explicitly takes action to bind > their thread to a kernel thread (sometimes called a Light Weight Process or > LWP), the counts returned by PAPI will not necessarily be accurate. > > To address this situation, PAPI treats every platform as if it is running on > top of kernel threads. > > Unbound, user level threads that call PAPI will function properly, but will > most likely return unreliable or inaccurate event counts. > > Fortunately, in the batch environments of the HPC community, there is no > significant advantage to user level threads and thus kernel level threads are > the default. Frank > -- My email is as private as my paper mail. I therefore support encrypting and signing email messages. Get my PGP key from http://keys.gnupg.net. pgpEp8pMsXgCF.pgp Description: OpenPGP digital signature ___ Users mailing list Users@cactuscode.org http://cactuscode.org/mailman/listinfo/users
Re: [Users] PAPI thorn and threads
Roland PAPI has a mechanism to work either on a process or on a single thread. I believe Cactus switches PAPI to threaded mode by default. This works only for operating system threads (OpenMP), not for user-level threads (FunHPC). I don't recall the details, but the PAPI documentation should describe this in its API documentation. In Cactus, we initially run some tests (probably a DGEMM) to check whether PAPI's numbers are consistent with what we expect. This might help answer this question. I think that handling multi-threading correctly requires the operating system to cooperate. On an HPC system, the kernel might have been modified and cause problems. This is just a wild guess, though. -erik On Thu, Jan 12, 2017 at 12:43 PM, Roland Haas wrote: > Hello all, > > does anyone know if the floating point event counts reported by PAPI > are summed over all threads inside of a MPI rank? Or is it only the > count on thread 0? > > I would hope for the former but suspect the latter. > > That is, if I was to run the same job with using ncores > cores and would run once with nranks MPI ranks and nthreads threads > per rank and onece with ncores MPI ranks and 1 thread per rank, would > the sum over all *reported* event counts of all ranks (roughly, > neglecting ghost zones etc) agree? > > Yours, > Roland > > -- > My email is as private as my paper mail. I therefore support encrypting > and signing email messages. Get my PGP key from http://keys.gnupg.net. > > ___ > Users mailing list > Users@cactuscode.org > http://cactuscode.org/mailman/listinfo/users > -- Erik Schnetter http://www.perimeterinstitute.ca/personal/eschnetter/ ___ Users mailing list Users@cactuscode.org http://cactuscode.org/mailman/listinfo/users
Re: [Users] PAPI thorn and threads
On Thu, Jan 12, 2017 at 11:43:43AM -0600, Roland Haas wrote: does anyone know if the floating point event counts reported by PAPI are summed over all threads inside of a MPI rank? Or is it only the count on thread 0? From the documentation the answer seems to be "it depends": In order to support threaded operation, the operating system must save and restore the counter hardware upon context switches among different threads or processes. However, OpenMP hides the concept of user and kernel level threads from the user. As a result, unless the user explicitly takes action to bind their thread to a kernel thread (sometimes called a Light Weight Process or LWP), the counts returned by PAPI will not necessarily be accurate. To address this situation, PAPI treats every platform as if it is running on top of kernel threads. Unbound, user level threads that call PAPI will function properly, but will most likely return unreliable or inaccurate event counts. Fortunately, in the batch environments of the HPC community, there is no significant advantage to user level threads and thus kernel level threads are the default. Frank signature.asc Description: Digital signature ___ Users mailing list Users@cactuscode.org http://cactuscode.org/mailman/listinfo/users