Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-27 Thread Jonathan D. Halverson
Hi Szilárd,

Our OS is RHEL 7.6.

Thank you for your test results. It's nice to see consistent results on a 
POWER9 system.

Your suggestion of allocating the whole node was a good one. I did this in two 
ways. The first was to bypass the Slurm scheduler by ssh-ing to an empty node 
and running the benchmark. The second way was through Slurm using the 
--exclusive directive (which allocates the entire node indepedent of job size). 
In both cases, which used 32 hardware threads and one V100 GPU for ADH (PME, 
cubic, 40k steps), the performance was about 132 ns/day which is significantly 
better than the 90 ns/day from before (without --exclusive). Links to the 
md.log files are below. Here is the Slurm script with --exclusive:

--
#!/bin/bash
#SBATCH --job-name=gmx   # create a short name for your job
#SBATCH --nodes=1# node count
#SBATCH --ntasks=1   # total number of tasks across all nodes
#SBATCH --cpus-per-task=32   # cpu-cores per task (>1 if multi-threaded 
tasks)
#SBATCH --mem=8G # memory per node (4G per cpu-core is default)
#SBATCH --time=00:10:00  # total run time limit (HH:MM:SS)
#SBATCH --gres=gpu:1 # number of gpus per node
#SBATCH --exclusive  # TASK AFFINITIES SET CORRECTLY BUT ENTIRE NODE ALLOCATED 
TO JOB

module purge
module load cudatoolkit/10.2

BCH=../adh_cubic
gmx grompp -f $BCH/pme_verlet.mdp -c $BCH/conf.gro -p $BCH/topol.top -o 
bench.tpr
srun gmx mdrun -nsteps 4 -pin on -ntmpi $SLURM_NTASKS -ntomp 
$SLURM_CPUS_PER_TASK -s bench.tpr
--

Here are the log files:

md.log with --exclusive:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.with-exclusive

md.log without --exclusive:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.without-exclusive

Szilárd, what is your reading of these two files?

This is a shared cluster so I can't use --exclusive for all jobs. Our nodes 
have four GPUs and 128 hardware threads (SMT4 so 32 cores over 2 sockets). Any 
thoughts on how to make a job behave like it is being run with --exclusive? The 
task affinities are apparently not being set properly in that case.

To solve this I tried experimenting with the --cpu-bind settings. When 
--exclusive is not used, I find a slight performance gain by using 
--cpu-bind=cores:
srun --cpu-bind=cores gmx mdrun -nsteps 4 -pin on -ntmpi $SLURM_NTASKS 
-ntomp $SLURM_CPUS_PER_TASK -s bench.tpr

In this case it still gives "NOTE: Thread affinity was not set" and performance 
is still poor.

The --exclusive result suggests that the failed hardware unit test can be 
ignored, I believe.

Here's a bit about our Slurm configuration:
$ grep -i affinity /etc/slurm/slurm.conf
TaskPlugin=affinity,cgroup

ldd shows that gmx is linked against libhwloc.so.5.

I have not heard from my contact at ORNL. All I can find online is that they 
offer GROMACS 5.1 (https://www.olcf.ornl.gov/software_package/gromacs/) and 
apparently nothing special is done about thread affinities.

Jon



From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se 
 on behalf of Szilárd Páll 

Sent: Friday, April 24, 2020 6:06 PM
To: Discussion list for GROMACS users 
Cc: gromacs.org_gmx-users@maillist.sys.kth.se 

Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

Hi,

Affinity settings on the Talos II with Ubuntu 18.04 kernel 5.0 works fine.
I get threads pinned where they should be (hwloc confirmed) and consistent
results. I also get reasonable thread placement even without pinning (i.e.
the kernel scatters first until #threads <= #hwthreads). I see only a minor
penalty to not pinning -- not too surprising given that I have a single
NUMA node and the kernel is doing its job.

Here are my quick the test results run on an 8-core Talos II Power9 + a
GPU, using the adh_cubic input:

$ grep Perf *.log
test_1x1_rep1.log:Performance:   16.617
test_1x1_rep2.log:Performance:   16.479
test_1x1_rep3.log:Performance:   16.520
test_1x2_rep1.log:Performance:   32.034
test_1x2_rep2.log:Performance:   32.389
test_1x2_rep3.log:Performance:   32.340
test_1x4_rep1.log:Performance:   62.341
test_1x4_rep2.log:Performance:   62.569
test_1x4_rep3.log:Performance:   62.476
test_1x8_rep1.log:Performance:   97.049
test_1x8_rep2.log:Performance:   96.653
test_1x8_rep3.log:Performance:   96.889


This seems to point towards some issue with the OS or setup on the IBM
machines you have -- and the unit test error may be one of the symptoms of
it (as it suggests something is off with the hardware topology and a NUMA
node is missing from it). I'd still suggest checking if a full not
allocation with all threads, memory, etc passed to the job results in
successful affinity 

Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-24 Thread Jonathan D. Halverson
I cannot force the pinning via GROMACS so I will look at what can be done with 
hwloc.

On the POWER9 the hardware appears to be detected correctly (only Intel gives 
note):
Running on 1 node with total 128 cores, 128 logical cores, 1 compatible GPU

But during the build it fails the HarwareUnitTests:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log#L3338


Here are more benchmarks based on Kevin and Szilárd's suggestions:

ADH (134177 atoms, 
ftp://ftp.gromacs.org/pub/benchmarks/ADH_bench_systems.tar.gz)
2019.6, PME and cubic box
nsteps = 4

Intel Broadwell-NVIDIA P100
ntomp (rate, wall time)
1 (21 ns/day, 323 s)
4 (56 ns/day, 123 s)
8 (69 ns/day, 100 s)

IBM POWER9-NVIDIA V100
ntomp (rate, wall time)
 1 (14 ns/day, 500 s)
 1 (14 ns/day, 502 s)
 1 (14 ns/day, 510 s)
 4 (19 ns/day, 357 s)
 4 (17 ns/day, 397 s)
 4 (20 ns/day, 346 s)
 8 (30 ns/day, 232 s)
 8 (24 ns/day, 288 s)
 8 (31 ns/day, 222 s)
16 (59 ns/day, 117 s)
16 (65 ns/day, 107 s)
16 (63 ns/day, 110 s) [md.log on GitHub is https://bit.ly/3aCm1gw]
32 (89 ns/day,  76 s)
32 (93 ns/day,  75 s)
32 (89 ns/day,  78 s)
64 (57 ns/day, 122 s)
64 (43 ns/day, 159 s)
64 (46 ns/day, 152 s)

Yes, there is variability between identical runs for POWER9/V100.

For the Intel case, ntomp equals the number of physical cores. For the IBM 
case, ntomp is equal to the number of hardware threads (4 hardware threads per 
physical core). On a physical core basis, these number are looking better but 
clearly there are still problems.

I tried different values for -pinoffset but didn't see performance gains that 
could't be explained by the variation from run to run.

I've written to contacts at ORNL and IBM.

Jon


From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se 
 on behalf of Szilárd Páll 

Sent: Friday, April 24, 2020 10:23 AM
To: Discussion list for GROMACS users 
Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

Using a single thread per GPU as the linked log files show is not
sufficient for GROMACS (and any modern machine should have more than that
anyway), but I imply from your mail that this only meant to debug
performance instability?

Your performance variations with Power9 may be related that you are either
not setting affinities or the affinity settings is not correct. However,
you also have some job scheduler in the way (that I suspect is either not
configured well or is not passed the required options to correctly assign
resources to jobs) and obfuscates machine layout and makes things look
weird to mdrun [1].

I suggest to simplify the problem and try to debug it step-by-step. Start
with allocating full nodes and test that you can pin (either with mdurun
-pin on or hwloc) and avoid [1], get an understanding of what should you
expect from the node sharing that seem to not work correctly. Building
GROMACS with hwloc may help as you get better reporting in the log.

[1]
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100#L58

--
Szilárd


On Fri, Apr 24, 2020 at 3:43 AM Jonathan D. Halverson <
halver...@princeton.edu> wrote:

> Hi Kevin,
>
> md.log for the Intel run is here:
>
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100
>
> Thanks for the info on constraints with 2020. I'll try some runs with
> different values of -pinoffset for 2019.6.
>
> I know a group at NIST is having the same or similar problems with
> POWER9/V100.
>
> Jon
> 
> From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se <
> gromacs.org_gmx-users-boun...@maillist.sys.kth.se> on behalf of Kevin
> Boyd 
> Sent: Thursday, April 23, 2020 9:08 PM
> To: gmx-us...@gromacs.org 
> Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node
>
> Hi,
>
> Can you post the full log for the Intel system? I typically find the real
> cycle and time accounting section a better place to start debugging
> performance issues.
>
> A couple quick notes, but need a side-by-side comparison for more useful
> analysis, and these points may apply to both systems so may not be your
> root cause:
> * At first glance, your Power system spends 1/3 of its time in constraint
> calculation, which is unusual. This can be reduced 2 ways - first, by
> adding more CPU cores. It doesn't make a ton of sense to benchmark on one
> core if your applications will use more. Second, if you upgrade to Gromacs
> 2020 you can probably put the constraint calculation on the GPU with
> -update GPU.
> * The Power system log has this line:
>
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304
> indicating
> that threads perhaps were not actually pinned. Try adding -pinoffset 0 (or
> some other core) to specify where you want the process pinned.
>
&g

Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-23 Thread Jonathan D. Halverson
Hi Kevin,

md.log for the Intel run is here:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100

Thanks for the info on constraints with 2020. I'll try some runs with different 
values of -pinoffset for 2019.6.

I know a group at NIST is having the same or similar problems with POWER9/V100.

Jon

From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se 
 on behalf of Kevin Boyd 

Sent: Thursday, April 23, 2020 9:08 PM
To: gmx-us...@gromacs.org 
Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

Hi,

Can you post the full log for the Intel system? I typically find the real
cycle and time accounting section a better place to start debugging
performance issues.

A couple quick notes, but need a side-by-side comparison for more useful
analysis, and these points may apply to both systems so may not be your
root cause:
* At first glance, your Power system spends 1/3 of its time in constraint
calculation, which is unusual. This can be reduced 2 ways - first, by
adding more CPU cores. It doesn't make a ton of sense to benchmark on one
core if your applications will use more. Second, if you upgrade to Gromacs
2020 you can probably put the constraint calculation on the GPU with
-update GPU.
* The Power system log has this line:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304
indicating
that threads perhaps were not actually pinned. Try adding -pinoffset 0 (or
some other core) to specify where you want the process pinned.

Kevin

On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson <
halver...@princeton.edu> wrote:

> *Message sent from a system outside of UConn.*
>
>
> We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on an
> IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running RHEL
> 7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel
> nodes. Everything below is about of the POWER9/V100 node.
>
> We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1
> CPU-core and 1 GPU (
> ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and
> found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives 102
> ns/day. The difference in performance is roughly the same for the larger
> ADH benchmark and when different numbers of CPU-cores are used. GROMACS is
> always underperforming on our POWER9/V100 nodes. We have pinning turned on
> (see Slurm script at bottom).
>
> Below is our build procedure on the POWER9/V100 node:
>
> version_gmx=2019.6
> wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
> tar zxvf gromacs-${version_gmx}.tar.gz
> cd gromacs-${version_gmx}
> mkdir build && cd build
>
> module purge
> module load rh/devtoolset/7
> module load cudatoolkit/10.2
>
> OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"
>
> cmake3 .. -DCMAKE_BUILD_TYPE=Release \
> -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
> -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
> -DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
> -DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
> -DGMX_BUILD_OWN_FFTW=ON \
> -DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \
> -DGMX_OPENMP_MAX_THREADS=128 \
> -DCMAKE_INSTALL_PREFIX=$HOME/.local \
> -DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON
>
> make -j 10
> make check
> make install
>
> 45 of the 46 tests pass with the exception being HardwareUnitTests. There
> are several posts about this and apparently it is not a concern. The full
> build log is here:
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log
>
>
>
> Here is more info about our POWER9/V100 node:
>
> $ lscpu
> Architecture:  ppc64le
> Byte Order:Little Endian
> CPU(s):128
> On-line CPU(s) list:   0-127
> Thread(s) per core:4
> Core(s) per socket:16
> Socket(s): 2
> NUMA node(s):  6
> Model: 2.3 (pvr 004e 1203)
> Model name:POWER9, altivec supported
> CPU max MHz:   3800.
> CPU min MHz:   2300.
>
> You see that we have 4 hardware threads per physical core. If we use 4
> hardware threads on the RNASE benchmark instead of 1 the performance goes
> to 119 ns/day which is still about 20% less than the Broadwell/P100 value.
> When using multiple CPU-cores on the POWER9/V100 there is significant
> variation in the execution time of the code.
>
> There are four GPUs per POWER9/V100 node:
>
> $ nvidia-smi -q
> Driver Version  : 440.33.01
> CUDA Version: 10.2
> GPU 0004:04:00.0
> Product Name: Tesla V

[gmx-users] GROMACS performance issues on POWER9/V100 node

2020-04-23 Thread Jonathan D. Halverson
We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on an IBM 
POWER9/V100 node versus an Intel Broadwell/P100. Both are running RHEL 7.7 and 
Slurm 19.05.5. We have no concerns about GROMACS on our Intel nodes. Everything 
below is about of the POWER9/V100 node.

We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1 CPU-core 
and 1 GPU (ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and 
found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives 102 
ns/day. The difference in performance is roughly the same for the larger ADH 
benchmark and when different numbers of CPU-cores are used. GROMACS is always 
underperforming on our POWER9/V100 nodes. We have pinning turned on (see Slurm 
script at bottom).

Below is our build procedure on the POWER9/V100 node:

version_gmx=2019.6
wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
tar zxvf gromacs-${version_gmx}.tar.gz
cd gromacs-${version_gmx}
mkdir build && cd build

module purge
module load rh/devtoolset/7
module load cudatoolkit/10.2

OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"

cmake3 .. -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
-DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
-DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
-DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
-DGMX_BUILD_OWN_FFTW=ON \
-DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \
-DGMX_OPENMP_MAX_THREADS=128 \
-DCMAKE_INSTALL_PREFIX=$HOME/.local \
-DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON

make -j 10
make check
make install

45 of the 46 tests pass with the exception being HardwareUnitTests. There are 
several posts about this and apparently it is not a concern. The full build log 
is here:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log



Here is more info about our POWER9/V100 node:

$ lscpu
Architecture:  ppc64le
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:4
Core(s) per socket:16
Socket(s): 2
NUMA node(s):  6
Model: 2.3 (pvr 004e 1203)
Model name:POWER9, altivec supported
CPU max MHz:   3800.
CPU min MHz:   2300.

You see that we have 4 hardware threads per physical core. If we use 4 hardware 
threads on the RNASE benchmark instead of 1 the performance goes to 119 ns/day 
which is still about 20% less than the Broadwell/P100 value. When using 
multiple CPU-cores on the POWER9/V100 there is significant variation in the 
execution time of the code.

There are four GPUs per POWER9/V100 node:

$ nvidia-smi -q
Driver Version  : 440.33.01
CUDA Version: 10.2
GPU 0004:04:00.0
Product Name: Tesla V100-SXM2-32GB

The GPUs have been shown to perform as expected on other applications.




The following lines are found in md.log for the POWER9/V100 run:

Overriding thread affinity set outside gmx mdrun
Pinning threads with an auto-selected logical core stride of 128
NOTE: Thread affinity was not set.

The full md.log is available here:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log




Below are the MegaFlops Accounting for the POWER9/V100 versus Broadwell/P100:

 IBM POWER9 WITH NVIDIA V100 
Computing:   M-Number M-Flops  % Flops
-
 Pair Search distance check 297.7638722679.875 0.0
 NxN Ewald Elec. + LJ [F]244214.21580816118138.24398.0
 NxN Ewald Elec. + LJ [V]2483.565760  265741.536 1.6
 1,4 nonbonded interactions  53.4153414807.381 0.0
 Shift-X  3.029040  18.174 0.0
 Angles  37.0437046223.342 0.0
 Propers 55.825582   12784.058 0.1
 Impropers4.220422 877.848 0.0
 Virial   2.432585  43.787 0.0
 Stop-CM  2.452080  24.521 0.0
 Calc-Ekin   48.1280801299.458 0.0
 Lincs   20.5361591232.170 0.0
 Lincs-Mat  444.6133441778.453 0.0
 Constraint-V   261.1922282089.538 0.0
 Constraint-Vir   2.430161  58.324 0.0
 Settle  73.382008   23702.389 0.1
-
 Total16441499.096   100.0