Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-27 Thread Jonathan D. Halverson
Hi Szilárd,

Our OS is RHEL 7.6.

Thank you for your test results. It's nice to see consistent results on a 
POWER9 system.

Your suggestion of allocating the whole node was a good one. I did this in two 
ways. The first was to bypass the Slurm scheduler by ssh-ing to an empty node 
and running the benchmark. The second way was through Slurm using the 
--exclusive directive (which allocates the entire node indepedent of job size). 
In both cases, which used 32 hardware threads and one V100 GPU for ADH (PME, 
cubic, 40k steps), the performance was about 132 ns/day which is significantly 
better than the 90 ns/day from before (without --exclusive). Links to the 
md.log files are below. Here is the Slurm script with --exclusive:

--
#!/bin/bash
#SBATCH --job-name=gmx   # create a short name for your job
#SBATCH --nodes=1# node count
#SBATCH --ntasks=1   # total number of tasks across all nodes
#SBATCH --cpus-per-task=32   # cpu-cores per task (>1 if multi-threaded 
tasks)
#SBATCH --mem=8G # memory per node (4G per cpu-core is default)
#SBATCH --time=00:10:00  # total run time limit (HH:MM:SS)
#SBATCH --gres=gpu:1 # number of gpus per node
#SBATCH --exclusive  # TASK AFFINITIES SET CORRECTLY BUT ENTIRE NODE ALLOCATED 
TO JOB

module purge
module load cudatoolkit/10.2

BCH=../adh_cubic
gmx grompp -f $BCH/pme_verlet.mdp -c $BCH/conf.gro -p $BCH/topol.top -o 
bench.tpr
srun gmx mdrun -nsteps 4 -pin on -ntmpi $SLURM_NTASKS -ntomp 
$SLURM_CPUS_PER_TASK -s bench.tpr
--

Here are the log files:

md.log with --exclusive:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.with-exclusive

md.log without --exclusive:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.without-exclusive

Szilárd, what is your reading of these two files?

This is a shared cluster so I can't use --exclusive for all jobs. Our nodes 
have four GPUs and 128 hardware threads (SMT4 so 32 cores over 2 sockets). Any 
thoughts on how to make a job behave like it is being run with --exclusive? The 
task affinities are apparently not being set properly in that case.

To solve this I tried experimenting with the --cpu-bind settings. When 
--exclusive is not used, I find a slight performance gain by using 
--cpu-bind=cores:
srun --cpu-bind=cores gmx mdrun -nsteps 4 -pin on -ntmpi $SLURM_NTASKS 
-ntomp $SLURM_CPUS_PER_TASK -s bench.tpr

In this case it still gives "NOTE: Thread affinity was not set" and performance 
is still poor.

The --exclusive result suggests that the failed hardware unit test can be 
ignored, I believe.

Here's a bit about our Slurm configuration:
$ grep -i affinity /etc/slurm/slurm.conf
TaskPlugin=affinity,cgroup

ldd shows that gmx is linked against libhwloc.so.5.

I have not heard from my contact at ORNL. All I can find online is that they 
offer GROMACS 5.1 (https://www.olcf.ornl.gov/software_package/gromacs/) and 
apparently nothing special is done about thread affinities.

Jon



From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se 
 on behalf of Szilárd Páll 

Sent: Friday, April 24, 2020 6:06 PM
To: Discussion list for GROMACS users 
Cc: gromacs.org_gmx-users@maillist.sys.kth.se 

Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

Hi,

Affinity settings on the Talos II with Ubuntu 18.04 kernel 5.0 works fine.
I get threads pinned where they should be (hwloc confirmed) and consistent
results. I also get reasonable thread placement even without pinning (i.e.
the kernel scatters first until #threads <= #hwthreads). I see only a minor
penalty to not pinning -- not too surprising given that I have a single
NUMA node and the kernel is doing its job.

Here are my quick the test results run on an 8-core Talos II Power9 + a
GPU, using the adh_cubic input:

$ grep Perf *.log
test_1x1_rep1.log:Performance:   16.617
test_1x1_rep2.log:Performance:   16.479
test_1x1_rep3.log:Performance:   16.520
test_1x2_rep1.log:Performance:   32.034
test_1x2_rep2.log:Performance:   32.389
test_1x2_rep3.log:Performance:   32.340
test_1x4_rep1.log:Performance:   62.341
test_1x4_rep2.log:Performance:   62.569
test_1x4_rep3.log:Performance:   62.476
test_1x8_rep1.log:Performance:   97.049
test_1x8_rep2.log:Performance:   96.653
test_1x8_rep3.log:Performance:   96.889


This seems to point towards some issue with the OS or setup on the IBM
machines you have -- and the unit test error may be one of the symptoms of
it (as it suggests something is off with the hardware topology and a NUMA
node is missing from it). I'd still suggest checking if a full not
allocation with all threads, memory, etc passed to the job results in
suc

Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-24 Thread Alex
lt;
gromacs.org_gmx-users-boun...@maillist.sys.kth.se> on behalf of Kevin
Boyd 
Sent: Thursday, April 23, 2020 9:08 PM
To: gmx-us...@gromacs.org 
Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

Hi,

Can you post the full log for the Intel system? I typically find the

real

cycle and time accounting section a better place to start debugging
performance issues.

A couple quick notes, but need a side-by-side comparison for more useful
analysis, and these points may apply to both systems so may not be your
root cause:
* At first glance, your Power system spends 1/3 of its time in

constraint

calculation, which is unusual. This can be reduced 2 ways - first, by
adding more CPU cores. It doesn't make a ton of sense to benchmark on

one

core if your applications will use more. Second, if you upgrade to

Gromacs

2020 you can probably put the constraint calculation on the GPU with
-update GPU.
* The Power system log has this line:



https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304

indicating
that threads perhaps were not actually pinned. Try adding -pinoffset 0

(or

some other core) to specify where you want the process pinned.

Kevin

On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson <
halver...@princeton.edu> wrote:


*Message sent from a system outside of UConn.*


We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on

an

IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running

RHEL

7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel
nodes. Everything below is about of the POWER9/V100 node.

We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1
CPU-core and 1 GPU (
ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and
found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives

102

ns/day. The difference in performance is roughly the same for the

larger

ADH benchmark and when different numbers of CPU-cores are used. GROMACS

is

always underperforming on our POWER9/V100 nodes. We have pinning turned

on

(see Slurm script at bottom).

Below is our build procedure on the POWER9/V100 node:

version_gmx=2019.6
wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
tar zxvf gromacs-${version_gmx}.tar.gz
cd gromacs-${version_gmx}
mkdir build && cd build

module purge
module load rh/devtoolset/7
module load cudatoolkit/10.2

OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"

cmake3 .. -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
-DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
-DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
-DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
-DGMX_BUILD_OWN_FFTW=ON \
-DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \
-DGMX_OPENMP_MAX_THREADS=128 \
-DCMAKE_INSTALL_PREFIX=$HOME/.local \
-DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON

make -j 10
make check
make install

45 of the 46 tests pass with the exception being HardwareUnitTests.

There

are several posts about this and apparently it is not a concern. The

full

build log is here:


https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log


Here is more info about our POWER9/V100 node:

$ lscpu
Architecture:  ppc64le
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:4
Core(s) per socket:16
Socket(s): 2
NUMA node(s):  6
Model: 2.3 (pvr 004e 1203)
Model name:POWER9, altivec supported
CPU max MHz:   3800.
CPU min MHz:   2300.

You see that we have 4 hardware threads per physical core. If we use 4
hardware threads on the RNASE benchmark instead of 1 the performance

goes

to 119 ns/day which is still about 20% less than the Broadwell/P100

value.

When using multiple CPU-cores on the POWER9/V100 there is significant
variation in the execution time of the code.

There are four GPUs per POWER9/V100 node:

$ nvidia-smi -q
Driver Version  : 440.33.01
CUDA Version: 10.2
GPU 0004:04:00.0
  Product Name: Tesla V100-SXM2-32GB

The GPUs have been shown to perform as expected on other applications.




The following lines are found in md.log for the POWER9/V100 run:

Overriding thread affinity set outside gmx mdrun
Pinning threads with an auto-selected logical core stride of 128
NOTE: Thread affinity was not set.

The full md.log is available here:


https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log




Below are the MegaFlops Accounting for the POWER9/V100 versus
Broadwell/P100:

 IBM POWER9 WITH NVIDIA V100 
Computing:   M-Number M-Flops  %

Flops

-

   Pair Se

Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-24 Thread Szilárd Páll
Hi,

Affinity settings on the Talos II with Ubuntu 18.04 kernel 5.0 works fine.
I get threads pinned where they should be (hwloc confirmed) and consistent
results. I also get reasonable thread placement even without pinning (i.e.
the kernel scatters first until #threads <= #hwthreads). I see only a minor
penalty to not pinning -- not too surprising given that I have a single
NUMA node and the kernel is doing its job.

Here are my quick the test results run on an 8-core Talos II Power9 + a
GPU, using the adh_cubic input:

$ grep Perf *.log
test_1x1_rep1.log:Performance:   16.617
test_1x1_rep2.log:Performance:   16.479
test_1x1_rep3.log:Performance:   16.520
test_1x2_rep1.log:Performance:   32.034
test_1x2_rep2.log:Performance:   32.389
test_1x2_rep3.log:Performance:   32.340
test_1x4_rep1.log:Performance:   62.341
test_1x4_rep2.log:Performance:   62.569
test_1x4_rep3.log:Performance:   62.476
test_1x8_rep1.log:Performance:   97.049
test_1x8_rep2.log:Performance:   96.653
test_1x8_rep3.log:Performance:   96.889


This seems to point towards some issue with the OS or setup on the IBM
machines you have -- and the unit test error may be one of the symptoms of
it (as it suggests something is off with the hardware topology and a NUMA
node is missing from it). I'd still suggest checking if a full not
allocation with all threads, memory, etc passed to the job results in
successful affinity settings i) in mdrun ii) in some other tool.

Please update this thread if you have further findings.

Cheers,
--
Szilárd


On Fri, Apr 24, 2020 at 10:52 PM Szilárd Páll 
wrote:

>
> The following lines are found in md.log for the POWER9/V100 run:
>>
>> Overriding thread affinity set outside gmx mdrun
>> Pinning threads with an auto-selected logical core stride of 128
>> NOTE: Thread affinity was not set.
>>
>> The full md.log is available here:
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log
>
>
> I glanced over that at first, will see if I can reproduce it, though I
> only have access to a Raptor Talos, not an IBM machine with Ubuntu.
>
> What OS are you using?
>
>
> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-requ...@gromacs.org.
>>
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-24 Thread Szilárd Páll
> The following lines are found in md.log for the POWER9/V100 run:
>
> Overriding thread affinity set outside gmx mdrun
> Pinning threads with an auto-selected logical core stride of 128
> NOTE: Thread affinity was not set.
>
> The full md.log is available here:
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log


I glanced over that at first, will see if I can reproduce it, though I only
have access to a Raptor Talos, not an IBM machine with Ubuntu.

What OS are you using?


-- 
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-24 Thread Szilárd Páll
On Fri, Apr 24, 2020 at 5:55 AM Alex  wrote:

> Hi Kevin,
>
> We've been having issues with Power9/V100 very similar to what Jon
> described and basically settled on what I believe is sub-par
> performance. We tested it on systems with ~30-50K particles and threads
> simply cannot be pinned.


What does that mean, how did you verify that?
The Linux kernel can in general set affinities on ppc64el, whether that's
requested by mdrun or some other tool, so if you have observed that the
affinity mask is not respected (or it does not change), that more likely OS
/ setup issue, I'd think.

What is different compared to x86 is that the hardware thread layout is
different on Power9 (with default Linux kernel configs) and hardware
threads are exposed as consecutive "CPUs" by the OS rather than strided by
#cores.

I could try to sum up some details on how to sett affinities (with mdrun or
external tools), if that is of interest. However, it really should be
something that's possible to do even using the job scheduler (+ along
reasonable system configuration).


> As far as Gromacs is concerned, our brand-new
> Power9 nodes operate as if they were based on Intel CPUs (two threads
> per core)


Unless the hardware thread layout has been changed, that's perhaps not the
case, see above.


> and zero advantage of IBM parallelization is being taken.
>

You mean the SMT4?


> Other users of the same nodes reported similar issues with other
> software, which to me suggests that our sysadmins don't really know how
> to set these nodes up.
>
> At this point, if someone could figure out a clear set of build
> instructions in combination with slurm/mdrun inputs, it would be very
> much appreciated.
>

Have you checked  public documentation on ORNL's sites? GROMACS has been
used successfully on Summit. What about IBM support?

--
Szilárd


>
> Alex
>
> On 4/23/2020 9:37 PM, Kevin Boyd wrote:
> > I'm not entirely sure how thread-pinning plays with slurm allocations on
> > partial nodes. I always reserve the entire node when I use thread
> pinning,
> > and run a bunch of simulations by pinning to different cores manually,
> > rather than relying on slurm to divvy up resources for multiple jobs.
> >
> > Looking at both logs now, a few more points
> >
> > * Your benchmarks are short enough that little things like cores spinning
> > up frequencies can matter. I suggest running longer (increase nsteps in
> the
> > mdp or at the command line), and throwing away your initial benchmark
> data
> > (see -resetstep and -resethway) to avoid artifacts
> > * Your benchmark system is quite small for such a powerful GPU. I might
> > expect better performance running multiple simulations per-GPU if the
> > workflows being run can rely on replicates, and a larger system would
> > probably scale better to the V100.
> > * The P100/intel system appears to have pinned cores properly, it's
> > unclear whether it had a real impact on these benchmarks
> > * It looks like the CPU-based computations were the primary contributors
> to
> > the observed difference in performance. That should decrease or go away
> > with increased core counts and shifting the update phase to the GPU. It
> may
> > be (I have no prior experience to indicate either way) that the intel
> cores
> > are simply better on a 1-1 basis than the Power cores. If you have 4-8
> > cores per simulation (try -ntomp 4 and increasing the allocation of your
> > slurm job), the individual core performance shouldn't matter too
> > much, you're just certainly bottlenecked on one CPU core per GPU, which
> can
> > emphasize performance differences..
> >
> > Kevin
> >
> > On Thu, Apr 23, 2020 at 6:43 PM Jonathan D. Halverson <
> > halver...@princeton.edu> wrote:
> >
> >> *Message sent from a system outside of UConn.*
> >>
> >>
> >> Hi Kevin,
> >>
> >> md.log for the Intel run is here:
> >>
> >>
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100
> >>
> >> Thanks for the info on constraints with 2020. I'll try some runs with
> >> different values of -pinoffset for 2019.6.
> >>
> >> I know a group at NIST is having the same or similar problems with
> >> POWER9/V100.
> >>
> >> Jon
> >> 
> >> From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se <
> >> gromacs.org_gmx-users-boun...@maillist.sys.kth.se> on behalf of Kevin
> >> Boyd 
> >> Sent: Thursday, April 23, 2020 9:08 PM
> >> To: gmx-us...@gromacs.org 
> >> Subject: Re: 

Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-24 Thread Jonathan D. Halverson
I cannot force the pinning via GROMACS so I will look at what can be done with 
hwloc.

On the POWER9 the hardware appears to be detected correctly (only Intel gives 
note):
Running on 1 node with total 128 cores, 128 logical cores, 1 compatible GPU

But during the build it fails the HarwareUnitTests:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log#L3338


Here are more benchmarks based on Kevin and Szilárd's suggestions:

ADH (134177 atoms, 
ftp://ftp.gromacs.org/pub/benchmarks/ADH_bench_systems.tar.gz)
2019.6, PME and cubic box
nsteps = 4

Intel Broadwell-NVIDIA P100
ntomp (rate, wall time)
1 (21 ns/day, 323 s)
4 (56 ns/day, 123 s)
8 (69 ns/day, 100 s)

IBM POWER9-NVIDIA V100
ntomp (rate, wall time)
 1 (14 ns/day, 500 s)
 1 (14 ns/day, 502 s)
 1 (14 ns/day, 510 s)
 4 (19 ns/day, 357 s)
 4 (17 ns/day, 397 s)
 4 (20 ns/day, 346 s)
 8 (30 ns/day, 232 s)
 8 (24 ns/day, 288 s)
 8 (31 ns/day, 222 s)
16 (59 ns/day, 117 s)
16 (65 ns/day, 107 s)
16 (63 ns/day, 110 s) [md.log on GitHub is https://bit.ly/3aCm1gw]
32 (89 ns/day,  76 s)
32 (93 ns/day,  75 s)
32 (89 ns/day,  78 s)
64 (57 ns/day, 122 s)
64 (43 ns/day, 159 s)
64 (46 ns/day, 152 s)

Yes, there is variability between identical runs for POWER9/V100.

For the Intel case, ntomp equals the number of physical cores. For the IBM 
case, ntomp is equal to the number of hardware threads (4 hardware threads per 
physical core). On a physical core basis, these number are looking better but 
clearly there are still problems.

I tried different values for -pinoffset but didn't see performance gains that 
could't be explained by the variation from run to run.

I've written to contacts at ORNL and IBM.

Jon


From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se 
 on behalf of Szilárd Páll 

Sent: Friday, April 24, 2020 10:23 AM
To: Discussion list for GROMACS users 
Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

Using a single thread per GPU as the linked log files show is not
sufficient for GROMACS (and any modern machine should have more than that
anyway), but I imply from your mail that this only meant to debug
performance instability?

Your performance variations with Power9 may be related that you are either
not setting affinities or the affinity settings is not correct. However,
you also have some job scheduler in the way (that I suspect is either not
configured well or is not passed the required options to correctly assign
resources to jobs) and obfuscates machine layout and makes things look
weird to mdrun [1].

I suggest to simplify the problem and try to debug it step-by-step. Start
with allocating full nodes and test that you can pin (either with mdurun
-pin on or hwloc) and avoid [1], get an understanding of what should you
expect from the node sharing that seem to not work correctly. Building
GROMACS with hwloc may help as you get better reporting in the log.

[1]
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100#L58

--
Szilárd


On Fri, Apr 24, 2020 at 3:43 AM Jonathan D. Halverson <
halver...@princeton.edu> wrote:

> Hi Kevin,
>
> md.log for the Intel run is here:
>
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100
>
> Thanks for the info on constraints with 2020. I'll try some runs with
> different values of -pinoffset for 2019.6.
>
> I know a group at NIST is having the same or similar problems with
> POWER9/V100.
>
> Jon
> 
> From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se <
> gromacs.org_gmx-users-boun...@maillist.sys.kth.se> on behalf of Kevin
> Boyd 
> Sent: Thursday, April 23, 2020 9:08 PM
> To: gmx-us...@gromacs.org 
> Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node
>
> Hi,
>
> Can you post the full log for the Intel system? I typically find the real
> cycle and time accounting section a better place to start debugging
> performance issues.
>
> A couple quick notes, but need a side-by-side comparison for more useful
> analysis, and these points may apply to both systems so may not be your
> root cause:
> * At first glance, your Power system spends 1/3 of its time in constraint
> calculation, which is unusual. This can be reduced 2 ways - first, by
> adding more CPU cores. It doesn't make a ton of sense to benchmark on one
> core if your applications will use more. Second, if you upgrade to Gromacs
> 2020 you can probably put the constraint calculation on the GPU with
> -update GPU.
> * The Power system log has this line:
>
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304
> indicating
> that threads perhaps were not actually pinned. Try adding -pinoffset 0 (or
> some other core) to specify where you want the process pinned.
>
&g

Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-24 Thread Szilárd Páll
Using a single thread per GPU as the linked log files show is not
sufficient for GROMACS (and any modern machine should have more than that
anyway), but I imply from your mail that this only meant to debug
performance instability?

Your performance variations with Power9 may be related that you are either
not setting affinities or the affinity settings is not correct. However,
you also have some job scheduler in the way (that I suspect is either not
configured well or is not passed the required options to correctly assign
resources to jobs) and obfuscates machine layout and makes things look
weird to mdrun [1].

I suggest to simplify the problem and try to debug it step-by-step. Start
with allocating full nodes and test that you can pin (either with mdurun
-pin on or hwloc) and avoid [1], get an understanding of what should you
expect from the node sharing that seem to not work correctly. Building
GROMACS with hwloc may help as you get better reporting in the log.

[1]
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100#L58

--
Szilárd


On Fri, Apr 24, 2020 at 3:43 AM Jonathan D. Halverson <
halver...@princeton.edu> wrote:

> Hi Kevin,
>
> md.log for the Intel run is here:
>
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100
>
> Thanks for the info on constraints with 2020. I'll try some runs with
> different values of -pinoffset for 2019.6.
>
> I know a group at NIST is having the same or similar problems with
> POWER9/V100.
>
> Jon
> 
> From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se <
> gromacs.org_gmx-users-boun...@maillist.sys.kth.se> on behalf of Kevin
> Boyd 
> Sent: Thursday, April 23, 2020 9:08 PM
> To: gmx-us...@gromacs.org 
> Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node
>
> Hi,
>
> Can you post the full log for the Intel system? I typically find the real
> cycle and time accounting section a better place to start debugging
> performance issues.
>
> A couple quick notes, but need a side-by-side comparison for more useful
> analysis, and these points may apply to both systems so may not be your
> root cause:
> * At first glance, your Power system spends 1/3 of its time in constraint
> calculation, which is unusual. This can be reduced 2 ways - first, by
> adding more CPU cores. It doesn't make a ton of sense to benchmark on one
> core if your applications will use more. Second, if you upgrade to Gromacs
> 2020 you can probably put the constraint calculation on the GPU with
> -update GPU.
> * The Power system log has this line:
>
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304
> indicating
> that threads perhaps were not actually pinned. Try adding -pinoffset 0 (or
> some other core) to specify where you want the process pinned.
>
> Kevin
>
> On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson <
> halver...@princeton.edu> wrote:
>
> > *Message sent from a system outside of UConn.*
> >
> >
> > We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on an
> > IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running
> RHEL
> > 7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel
> > nodes. Everything below is about of the POWER9/V100 node.
> >
> > We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1
> > CPU-core and 1 GPU (
> > ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and
> > found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives
> 102
> > ns/day. The difference in performance is roughly the same for the larger
> > ADH benchmark and when different numbers of CPU-cores are used. GROMACS
> is
> > always underperforming on our POWER9/V100 nodes. We have pinning turned
> on
> > (see Slurm script at bottom).
> >
> > Below is our build procedure on the POWER9/V100 node:
> >
> > version_gmx=2019.6
> > wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
> > tar zxvf gromacs-${version_gmx}.tar.gz
> > cd gromacs-${version_gmx}
> > mkdir build && cd build
> >
> > module purge
> > module load rh/devtoolset/7
> > module load cudatoolkit/10.2
> >
> > OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"
> >
> > cmake3 .. -DCMAKE_BUILD_TYPE=Release \
> > -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
> > -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
> > -DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
> > -DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
&

Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-23 Thread Alex

Hi Kevin,

We've been having issues with Power9/V100 very similar to what Jon 
described and basically settled on what I believe is sub-par 
performance. We tested it on systems with ~30-50K particles and threads 
simply cannot be pinned. As far as Gromacs is concerned, our brand-new 
Power9 nodes operate as if they were based on Intel CPUs (two threads 
per core) and zero advantage of IBM parallelization is being taken. 
Other users of the same nodes reported similar issues with other 
software, which to me suggests that our sysadmins don't really know how 
to set these nodes up.


At this point, if someone could figure out a clear set of build 
instructions in combination with slurm/mdrun inputs, it would be very 
much appreciated.


Alex

On 4/23/2020 9:37 PM, Kevin Boyd wrote:

I'm not entirely sure how thread-pinning plays with slurm allocations on
partial nodes. I always reserve the entire node when I use thread pinning,
and run a bunch of simulations by pinning to different cores manually,
rather than relying on slurm to divvy up resources for multiple jobs.

Looking at both logs now, a few more points

* Your benchmarks are short enough that little things like cores spinning
up frequencies can matter. I suggest running longer (increase nsteps in the
mdp or at the command line), and throwing away your initial benchmark data
(see -resetstep and -resethway) to avoid artifacts
* Your benchmark system is quite small for such a powerful GPU. I might
expect better performance running multiple simulations per-GPU if the
workflows being run can rely on replicates, and a larger system would
probably scale better to the V100.
* The P100/intel system appears to have pinned cores properly, it's
unclear whether it had a real impact on these benchmarks
* It looks like the CPU-based computations were the primary contributors to
the observed difference in performance. That should decrease or go away
with increased core counts and shifting the update phase to the GPU. It may
be (I have no prior experience to indicate either way) that the intel cores
are simply better on a 1-1 basis than the Power cores. If you have 4-8
cores per simulation (try -ntomp 4 and increasing the allocation of your
slurm job), the individual core performance shouldn't matter too
much, you're just certainly bottlenecked on one CPU core per GPU, which can
emphasize performance differences..

Kevin

On Thu, Apr 23, 2020 at 6:43 PM Jonathan D. Halverson <
halver...@princeton.edu> wrote:


*Message sent from a system outside of UConn.*


Hi Kevin,

md.log for the Intel run is here:

https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100

Thanks for the info on constraints with 2020. I'll try some runs with
different values of -pinoffset for 2019.6.

I know a group at NIST is having the same or similar problems with
POWER9/V100.

Jon

From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se <
gromacs.org_gmx-users-boun...@maillist.sys.kth.se> on behalf of Kevin
Boyd 
Sent: Thursday, April 23, 2020 9:08 PM
To: gmx-us...@gromacs.org 
Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

Hi,

Can you post the full log for the Intel system? I typically find the real
cycle and time accounting section a better place to start debugging
performance issues.

A couple quick notes, but need a side-by-side comparison for more useful
analysis, and these points may apply to both systems so may not be your
root cause:
* At first glance, your Power system spends 1/3 of its time in constraint
calculation, which is unusual. This can be reduced 2 ways - first, by
adding more CPU cores. It doesn't make a ton of sense to benchmark on one
core if your applications will use more. Second, if you upgrade to Gromacs
2020 you can probably put the constraint calculation on the GPU with
-update GPU.
* The Power system log has this line:

https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304
indicating
that threads perhaps were not actually pinned. Try adding -pinoffset 0 (or
some other core) to specify where you want the process pinned.

Kevin

On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson <
halver...@princeton.edu> wrote:


*Message sent from a system outside of UConn.*


We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on an
IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running

RHEL

7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel
nodes. Everything below is about of the POWER9/V100 node.

We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1
CPU-core and 1 GPU (
ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and
found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives

102

ns/day. The difference in performance is roughly the same for the larger
ADH benchmark and when different numbers of CPU-cores are used. GROMACS

is

alwa

Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-23 Thread Kevin Boyd
I'm not entirely sure how thread-pinning plays with slurm allocations on
partial nodes. I always reserve the entire node when I use thread pinning,
and run a bunch of simulations by pinning to different cores manually,
rather than relying on slurm to divvy up resources for multiple jobs.

Looking at both logs now, a few more points

* Your benchmarks are short enough that little things like cores spinning
up frequencies can matter. I suggest running longer (increase nsteps in the
mdp or at the command line), and throwing away your initial benchmark data
(see -resetstep and -resethway) to avoid artifacts
* Your benchmark system is quite small for such a powerful GPU. I might
expect better performance running multiple simulations per-GPU if the
workflows being run can rely on replicates, and a larger system would
probably scale better to the V100.
* The P100/intel system appears to have pinned cores properly, it's
unclear whether it had a real impact on these benchmarks
* It looks like the CPU-based computations were the primary contributors to
the observed difference in performance. That should decrease or go away
with increased core counts and shifting the update phase to the GPU. It may
be (I have no prior experience to indicate either way) that the intel cores
are simply better on a 1-1 basis than the Power cores. If you have 4-8
cores per simulation (try -ntomp 4 and increasing the allocation of your
slurm job), the individual core performance shouldn't matter too
much, you're just certainly bottlenecked on one CPU core per GPU, which can
emphasize performance differences..

Kevin

On Thu, Apr 23, 2020 at 6:43 PM Jonathan D. Halverson <
halver...@princeton.edu> wrote:

> *Message sent from a system outside of UConn.*
>
>
> Hi Kevin,
>
> md.log for the Intel run is here:
>
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100
>
> Thanks for the info on constraints with 2020. I'll try some runs with
> different values of -pinoffset for 2019.6.
>
> I know a group at NIST is having the same or similar problems with
> POWER9/V100.
>
> Jon
> 
> From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se <
> gromacs.org_gmx-users-boun...@maillist.sys.kth.se> on behalf of Kevin
> Boyd 
> Sent: Thursday, April 23, 2020 9:08 PM
> To: gmx-us...@gromacs.org 
> Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node
>
> Hi,
>
> Can you post the full log for the Intel system? I typically find the real
> cycle and time accounting section a better place to start debugging
> performance issues.
>
> A couple quick notes, but need a side-by-side comparison for more useful
> analysis, and these points may apply to both systems so may not be your
> root cause:
> * At first glance, your Power system spends 1/3 of its time in constraint
> calculation, which is unusual. This can be reduced 2 ways - first, by
> adding more CPU cores. It doesn't make a ton of sense to benchmark on one
> core if your applications will use more. Second, if you upgrade to Gromacs
> 2020 you can probably put the constraint calculation on the GPU with
> -update GPU.
> * The Power system log has this line:
>
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304
> indicating
> that threads perhaps were not actually pinned. Try adding -pinoffset 0 (or
> some other core) to specify where you want the process pinned.
>
> Kevin
>
> On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson <
> halver...@princeton.edu> wrote:
>
> > *Message sent from a system outside of UConn.*
> >
> >
> > We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on an
> > IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running
> RHEL
> > 7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel
> > nodes. Everything below is about of the POWER9/V100 node.
> >
> > We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1
> > CPU-core and 1 GPU (
> > ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and
> > found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives
> 102
> > ns/day. The difference in performance is roughly the same for the larger
> > ADH benchmark and when different numbers of CPU-cores are used. GROMACS
> is
> > always underperforming on our POWER9/V100 nodes. We have pinning turned
> on
> > (see Slurm script at bottom).
> >
> > Below is our build procedure on the POWER9/V100 node:
> >
> > version_gmx=2019.6
> > wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
> > tar zxvf gromacs-${version_gmx}.tar.gz
> > cd gromacs-${version_gmx}

Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-23 Thread Jonathan D. Halverson
Hi Kevin,

md.log for the Intel run is here:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100

Thanks for the info on constraints with 2020. I'll try some runs with different 
values of -pinoffset for 2019.6.

I know a group at NIST is having the same or similar problems with POWER9/V100.

Jon

From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se 
 on behalf of Kevin Boyd 

Sent: Thursday, April 23, 2020 9:08 PM
To: gmx-us...@gromacs.org 
Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

Hi,

Can you post the full log for the Intel system? I typically find the real
cycle and time accounting section a better place to start debugging
performance issues.

A couple quick notes, but need a side-by-side comparison for more useful
analysis, and these points may apply to both systems so may not be your
root cause:
* At first glance, your Power system spends 1/3 of its time in constraint
calculation, which is unusual. This can be reduced 2 ways - first, by
adding more CPU cores. It doesn't make a ton of sense to benchmark on one
core if your applications will use more. Second, if you upgrade to Gromacs
2020 you can probably put the constraint calculation on the GPU with
-update GPU.
* The Power system log has this line:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304
indicating
that threads perhaps were not actually pinned. Try adding -pinoffset 0 (or
some other core) to specify where you want the process pinned.

Kevin

On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson <
halver...@princeton.edu> wrote:

> *Message sent from a system outside of UConn.*
>
>
> We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on an
> IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running RHEL
> 7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel
> nodes. Everything below is about of the POWER9/V100 node.
>
> We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1
> CPU-core and 1 GPU (
> ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and
> found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives 102
> ns/day. The difference in performance is roughly the same for the larger
> ADH benchmark and when different numbers of CPU-cores are used. GROMACS is
> always underperforming on our POWER9/V100 nodes. We have pinning turned on
> (see Slurm script at bottom).
>
> Below is our build procedure on the POWER9/V100 node:
>
> version_gmx=2019.6
> wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
> tar zxvf gromacs-${version_gmx}.tar.gz
> cd gromacs-${version_gmx}
> mkdir build && cd build
>
> module purge
> module load rh/devtoolset/7
> module load cudatoolkit/10.2
>
> OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"
>
> cmake3 .. -DCMAKE_BUILD_TYPE=Release \
> -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
> -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
> -DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
> -DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
> -DGMX_BUILD_OWN_FFTW=ON \
> -DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \
> -DGMX_OPENMP_MAX_THREADS=128 \
> -DCMAKE_INSTALL_PREFIX=$HOME/.local \
> -DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON
>
> make -j 10
> make check
> make install
>
> 45 of the 46 tests pass with the exception being HardwareUnitTests. There
> are several posts about this and apparently it is not a concern. The full
> build log is here:
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log
>
>
>
> Here is more info about our POWER9/V100 node:
>
> $ lscpu
> Architecture:  ppc64le
> Byte Order:Little Endian
> CPU(s):128
> On-line CPU(s) list:   0-127
> Thread(s) per core:4
> Core(s) per socket:16
> Socket(s): 2
> NUMA node(s):  6
> Model: 2.3 (pvr 004e 1203)
> Model name:POWER9, altivec supported
> CPU max MHz:   3800.
> CPU min MHz:   2300.
>
> You see that we have 4 hardware threads per physical core. If we use 4
> hardware threads on the RNASE benchmark instead of 1 the performance goes
> to 119 ns/day which is still about 20% less than the Broadwell/P100 value.
> When using multiple CPU-cores on the POWER9/V100 there is significant
> variation in the execution time of the code.
>
> There are four GPUs per POWER9/V100 node:
>
> $ nvidia-smi -q
> Driver Version  : 440.33.01
> CUDA Version: 10.2
> GPU 0004:04:00.0
> Product Name: Tesla V

Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-23 Thread Kevin Boyd
Hi,

Can you post the full log for the Intel system? I typically find the real
cycle and time accounting section a better place to start debugging
performance issues.

A couple quick notes, but need a side-by-side comparison for more useful
analysis, and these points may apply to both systems so may not be your
root cause:
* At first glance, your Power system spends 1/3 of its time in constraint
calculation, which is unusual. This can be reduced 2 ways - first, by
adding more CPU cores. It doesn't make a ton of sense to benchmark on one
core if your applications will use more. Second, if you upgrade to Gromacs
2020 you can probably put the constraint calculation on the GPU with
-update GPU.
* The Power system log has this line:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304
indicating
that threads perhaps were not actually pinned. Try adding -pinoffset 0 (or
some other core) to specify where you want the process pinned.

Kevin

On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson <
halver...@princeton.edu> wrote:

> *Message sent from a system outside of UConn.*
>
>
> We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on an
> IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running RHEL
> 7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel
> nodes. Everything below is about of the POWER9/V100 node.
>
> We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1
> CPU-core and 1 GPU (
> ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and
> found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives 102
> ns/day. The difference in performance is roughly the same for the larger
> ADH benchmark and when different numbers of CPU-cores are used. GROMACS is
> always underperforming on our POWER9/V100 nodes. We have pinning turned on
> (see Slurm script at bottom).
>
> Below is our build procedure on the POWER9/V100 node:
>
> version_gmx=2019.6
> wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
> tar zxvf gromacs-${version_gmx}.tar.gz
> cd gromacs-${version_gmx}
> mkdir build && cd build
>
> module purge
> module load rh/devtoolset/7
> module load cudatoolkit/10.2
>
> OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"
>
> cmake3 .. -DCMAKE_BUILD_TYPE=Release \
> -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
> -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
> -DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
> -DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
> -DGMX_BUILD_OWN_FFTW=ON \
> -DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \
> -DGMX_OPENMP_MAX_THREADS=128 \
> -DCMAKE_INSTALL_PREFIX=$HOME/.local \
> -DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON
>
> make -j 10
> make check
> make install
>
> 45 of the 46 tests pass with the exception being HardwareUnitTests. There
> are several posts about this and apparently it is not a concern. The full
> build log is here:
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log
>
>
>
> Here is more info about our POWER9/V100 node:
>
> $ lscpu
> Architecture:  ppc64le
> Byte Order:Little Endian
> CPU(s):128
> On-line CPU(s) list:   0-127
> Thread(s) per core:4
> Core(s) per socket:16
> Socket(s): 2
> NUMA node(s):  6
> Model: 2.3 (pvr 004e 1203)
> Model name:POWER9, altivec supported
> CPU max MHz:   3800.
> CPU min MHz:   2300.
>
> You see that we have 4 hardware threads per physical core. If we use 4
> hardware threads on the RNASE benchmark instead of 1 the performance goes
> to 119 ns/day which is still about 20% less than the Broadwell/P100 value.
> When using multiple CPU-cores on the POWER9/V100 there is significant
> variation in the execution time of the code.
>
> There are four GPUs per POWER9/V100 node:
>
> $ nvidia-smi -q
> Driver Version  : 440.33.01
> CUDA Version: 10.2
> GPU 0004:04:00.0
> Product Name: Tesla V100-SXM2-32GB
>
> The GPUs have been shown to perform as expected on other applications.
>
>
>
>
> The following lines are found in md.log for the POWER9/V100 run:
>
> Overriding thread affinity set outside gmx mdrun
> Pinning threads with an auto-selected logical core stride of 128
> NOTE: Thread affinity was not set.
>
> The full md.log is available here:
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log
>
>
>
>
> Below are the MegaFlops Accounting for the POWER9/V100 versus
> Broadwell/P100:
>
>  IBM POWER9 WITH NVIDIA V100 
> Computing:   M-Number M-Flops  % Flops
>
> -
>  Pair Search distance check 297.7638722679.875 0.0
>  NxN Ewald Elec. + LJ [F]244214.215808