Re: [gmx-users] simulation on 2 gpus

2019-09-06 Thread Szilárd Páll
On Fri, Sep 6, 2019 at 3:47 PM Stefano Guglielmo
 wrote:
>
> Hi Szilard,
>
> thanks for suggestions.
>
>
> As for the strange crash, the workstation works fine using only cpu; the
> problem seems to be related to gpu usage, when both cards are used for 200
> W over 250 (more or less) the workstation turns off. It is not about PSU
> (even in the "offending" case we are quite below the maximum power),

How far below? Note that PSU efficiency and quality does also affect
stability at high load.

> and it
> is neither related to temperature (it happens even if gpu temp is around
> 55-60 °C). The vendor did some tests and accordingly the hardware seems to
> be ok. Do you (or anyone else in the list) have any particular test to
> suggest that can more specifically help to diagnose the problem?

I suggest the following for load testing:
https://github.com/ComputationalRadiationPhysics/cuda_memtest
and for memory stress testing:
https://github.com/ComputationalRadiationPhysics/cuda_memtest

Cheers,
--

Szilárd

>
> Any opinion is appreciated,
>
> thanks
>
> Il giorno mercoledì 21 agosto 2019, Szilárd Páll 
> ha scritto:
>
> > Hi Stefano,
> >
> >
> > On Tue, Aug 20, 2019 at 3:29 PM Stefano Guglielmo
> >  wrote:
> > >
> > > Dear Szilard,
> > >
> > > thanks for the very clear answer.
> > > Following your suggestion I tried to run without DD; for the same system
> > I
> > > run two simulations on two gpus:
> > >
> > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> > > -gputasks 00 -pin on -pinoffset 0 -pinstride 1
> > >
> > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> > > -gputasks 11 -pin on -pinoffset 28 -pinstride 1
> > >
> > > but again the system crashed; with this I mean that after few minutes the
> > > machine goes off (power off) without any error message, even without
> > using
> > > all the threads.
> >
> > That is not normal and I strongly recommend investigating it as it
> > could be a sign of an underlying system/hardware instability or fault
> > which could ultimately lead to incorrect simulation results.
> >
> > Are you sure that:
> > - your machine is stable and reliable at high loads; is the PSU sufficient?
> > - your hardware has been thoroughly stress-tested and it does not show
> > instabilities?
> >
> > Does the crash also happen with GROMACS running on the CPU only (using
> > all cores)?
> > I'd recommend running some stress-tests that fully load the machine
> > for a few hours to see if the error persists.
> >
> > > I then tried running the two simulations on the same gpu without DD:
> > >
> > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> > > -gputasks 00 -pin on -pinoffset 0 -pinstride 1
> > >
> > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> > > -gputasks 00 -pin on -pinoffset 28 -pinstride 1
> > >
> > > and I obtained better performance (about 70 ns/day) with a massive use of
> > > the gpu (around 90%), comparing to the two runs on two gpus I reported in
> > > the previous post
> > > (gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
> > -gputasks
> > > 000 -pin on -pinoffset 0 -pinstride 1
> > >  gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
> > > -gputasks 111 -pin on -pinoffset 28 -pinstride 1).
> >
> > That is expected; domain-decomposition on a single GPU is unnecessary
> > and introduces overheads that limit performance.
> >
> > > As for pinning, cpu topology according to log file is:
> > > hardware topology: Basic
> > > Sockets, cores, and logical processors:
> > >   Socket  0: [   0  32] [   1  33] [   2  34] [   3  35] [   4  36] [
> > > 5  37] [   6  38] [   7  39] [  16  48] [  17  49] [  18  50] [  19  51]
> > [
> > >  20  52] [  21  53] [  22  54] [  23  55] [   8  40] [   9  41] [  10
> > 42]
> > > [  11  43] [  12  44] [  13  45] [  14  46] [  15  47] [  24  56] [  25
> > >  57] [  26  58] [  27  59] [  28  60] [  29  61] [  30  62] [  31  63]
> > > If I understand well (absolutely not sure) it should not be that
> > convenient
> > > to pin to consecutive threads,
> >
> > On the contrary, pinning to consecutive threads is the recommended
> > behavior. More generally, application threads are expected to be
> > pinned to consecutive cores (as threading parallelization will benefit
> > from the resulting cache access patterns); now, CPU cores can have
> > multiple hardware threads and depending on whether using one or
> > mulitpole makes sense (performance-wise), will determine whether a
> > stride of 1 or 2 is best. Typically, when most work is offloaded to a
> > GPU and many CPU cores are available 1 thread/core is best.
> >
> > Note that the above topology mapping simply means that the indexed
> > entities that the operating system calls "CPU" grouped in "[]"
> > correspond to hardware threads of the same core, i.e. core 0 is [0
> > 32], core 1 [1 33], etc. Pinning with a stride happens into this map:
> > - with a -pinstride 1 

Re: [gmx-users] simulation on 2 gpus

2019-09-06 Thread Stefano Guglielmo
Hi Szilard,

thanks for suggestions.


As for the strange crash, the workstation works fine using only cpu; the
problem seems to be related to gpu usage, when both cards are used for 200
W over 250 (more or less) the workstation turns off. It is not about PSU
(even in the "offending" case we are quite below the maximum power), and it
is neither related to temperature (it happens even if gpu temp is around
55-60 °C). The vendor did some tests and accordingly the hardware seems to
be ok. Do you (or anyone else in the list) have any particular test to
suggest that can more specifically help to diagnose the problem?


Any opinion is appreciated,

thanks

Il giorno mercoledì 21 agosto 2019, Szilárd Páll 
ha scritto:

> Hi Stefano,
>
>
> On Tue, Aug 20, 2019 at 3:29 PM Stefano Guglielmo
>  wrote:
> >
> > Dear Szilard,
> >
> > thanks for the very clear answer.
> > Following your suggestion I tried to run without DD; for the same system
> I
> > run two simulations on two gpus:
> >
> > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> > -gputasks 00 -pin on -pinoffset 0 -pinstride 1
> >
> > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> > -gputasks 11 -pin on -pinoffset 28 -pinstride 1
> >
> > but again the system crashed; with this I mean that after few minutes the
> > machine goes off (power off) without any error message, even without
> using
> > all the threads.
>
> That is not normal and I strongly recommend investigating it as it
> could be a sign of an underlying system/hardware instability or fault
> which could ultimately lead to incorrect simulation results.
>
> Are you sure that:
> - your machine is stable and reliable at high loads; is the PSU sufficient?
> - your hardware has been thoroughly stress-tested and it does not show
> instabilities?
>
> Does the crash also happen with GROMACS running on the CPU only (using
> all cores)?
> I'd recommend running some stress-tests that fully load the machine
> for a few hours to see if the error persists.
>
> > I then tried running the two simulations on the same gpu without DD:
> >
> > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> > -gputasks 00 -pin on -pinoffset 0 -pinstride 1
> >
> > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> > -gputasks 00 -pin on -pinoffset 28 -pinstride 1
> >
> > and I obtained better performance (about 70 ns/day) with a massive use of
> > the gpu (around 90%), comparing to the two runs on two gpus I reported in
> > the previous post
> > (gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
> -gputasks
> > 000 -pin on -pinoffset 0 -pinstride 1
> >  gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
> > -gputasks 111 -pin on -pinoffset 28 -pinstride 1).
>
> That is expected; domain-decomposition on a single GPU is unnecessary
> and introduces overheads that limit performance.
>
> > As for pinning, cpu topology according to log file is:
> > hardware topology: Basic
> > Sockets, cores, and logical processors:
> >   Socket  0: [   0  32] [   1  33] [   2  34] [   3  35] [   4  36] [
> > 5  37] [   6  38] [   7  39] [  16  48] [  17  49] [  18  50] [  19  51]
> [
> >  20  52] [  21  53] [  22  54] [  23  55] [   8  40] [   9  41] [  10
> 42]
> > [  11  43] [  12  44] [  13  45] [  14  46] [  15  47] [  24  56] [  25
> >  57] [  26  58] [  27  59] [  28  60] [  29  61] [  30  62] [  31  63]
> > If I understand well (absolutely not sure) it should not be that
> convenient
> > to pin to consecutive threads,
>
> On the contrary, pinning to consecutive threads is the recommended
> behavior. More generally, application threads are expected to be
> pinned to consecutive cores (as threading parallelization will benefit
> from the resulting cache access patterns); now, CPU cores can have
> multiple hardware threads and depending on whether using one or
> mulitpole makes sense (performance-wise), will determine whether a
> stride of 1 or 2 is best. Typically, when most work is offloaded to a
> GPU and many CPU cores are available 1 thread/core is best.
>
> Note that the above topology mapping simply means that the indexed
> entities that the operating system calls "CPU" grouped in "[]"
> correspond to hardware threads of the same core, i.e. core 0 is [0
> 32], core 1 [1 33], etc. Pinning with a stride happens into this map:
> - with a -pinstride 1 thread mapping will be (app thread->hardware
> thread): 0->0, 1->32, 2->1, 3->33,...
> - with a -pinstride 2 thread mapping will be (-||-): 0->0, 1->1, 2->2,
> 3->3, ...
>
> > and indeed I found a subtle degradation of
> > performance for a single simulation, switching from:
> > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> -gputasks
> > 00 -pin on
> > to
> > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> -gputasks
> > 00 -pin on -pinoffset 0 -pinstride 1.
>
> If you compare the log files of the two, you should notice that the
> former 

Re: [gmx-users] simulation on 2 gpus

2019-08-21 Thread Szilárd Páll
Hi Stefano,


On Tue, Aug 20, 2019 at 3:29 PM Stefano Guglielmo
 wrote:
>
> Dear Szilard,
>
> thanks for the very clear answer.
> Following your suggestion I tried to run without DD; for the same system I
> run two simulations on two gpus:
>
> gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> -gputasks 00 -pin on -pinoffset 0 -pinstride 1
>
> gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> -gputasks 11 -pin on -pinoffset 28 -pinstride 1
>
> but again the system crashed; with this I mean that after few minutes the
> machine goes off (power off) without any error message, even without using
> all the threads.

That is not normal and I strongly recommend investigating it as it
could be a sign of an underlying system/hardware instability or fault
which could ultimately lead to incorrect simulation results.

Are you sure that:
- your machine is stable and reliable at high loads; is the PSU sufficient?
- your hardware has been thoroughly stress-tested and it does not show
instabilities?

Does the crash also happen with GROMACS running on the CPU only (using
all cores)?
I'd recommend running some stress-tests that fully load the machine
for a few hours to see if the error persists.

> I then tried running the two simulations on the same gpu without DD:
>
> gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> -gputasks 00 -pin on -pinoffset 0 -pinstride 1
>
> gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> -gputasks 00 -pin on -pinoffset 28 -pinstride 1
>
> and I obtained better performance (about 70 ns/day) with a massive use of
> the gpu (around 90%), comparing to the two runs on two gpus I reported in
> the previous post
> (gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks
> 000 -pin on -pinoffset 0 -pinstride 1
>  gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
> -gputasks 111 -pin on -pinoffset 28 -pinstride 1).

That is expected; domain-decomposition on a single GPU is unnecessary
and introduces overheads that limit performance.

> As for pinning, cpu topology according to log file is:
> hardware topology: Basic
> Sockets, cores, and logical processors:
>   Socket  0: [   0  32] [   1  33] [   2  34] [   3  35] [   4  36] [
> 5  37] [   6  38] [   7  39] [  16  48] [  17  49] [  18  50] [  19  51] [
>  20  52] [  21  53] [  22  54] [  23  55] [   8  40] [   9  41] [  10  42]
> [  11  43] [  12  44] [  13  45] [  14  46] [  15  47] [  24  56] [  25
>  57] [  26  58] [  27  59] [  28  60] [  29  61] [  30  62] [  31  63]
> If I understand well (absolutely not sure) it should not be that convenient
> to pin to consecutive threads,

On the contrary, pinning to consecutive threads is the recommended
behavior. More generally, application threads are expected to be
pinned to consecutive cores (as threading parallelization will benefit
from the resulting cache access patterns); now, CPU cores can have
multiple hardware threads and depending on whether using one or
mulitpole makes sense (performance-wise), will determine whether a
stride of 1 or 2 is best. Typically, when most work is offloaded to a
GPU and many CPU cores are available 1 thread/core is best.

Note that the above topology mapping simply means that the indexed
entities that the operating system calls "CPU" grouped in "[]"
correspond to hardware threads of the same core, i.e. core 0 is [0
32], core 1 [1 33], etc. Pinning with a stride happens into this map:
- with a -pinstride 1 thread mapping will be (app thread->hardware
thread): 0->0, 1->32, 2->1, 3->33,...
- with a -pinstride 2 thread mapping will be (-||-): 0->0, 1->1, 2->2, 3->3, ...

> and indeed I found a subtle degradation of
> performance for a single simulation, switching from:
> gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks
> 00 -pin on
> to
> gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks
> 00 -pin on -pinoffset 0 -pinstride 1.

If you compare the log files of the two, you should notice that the
former used a pinstride 2 resulting in the use 28 cores while the
latter using only 14 cores; the likely reason for only a small
difference is that there is not enough CPU work to scale to 28 cores
and additionally, these specific TR CPUs are tricky to scale across
using wide multi-threaded parallelization.

Cheers,
--
Szilárd


>
> Thanks again
> Stefano
>
>
>
>
> Il giorno ven 16 ago 2019 alle ore 17:48 Szilárd Páll <
> pall.szil...@gmail.com> ha scritto:
>
> > On Mon, Aug 5, 2019 at 5:00 PM Stefano Guglielmo
> >  wrote:
> > >
> > > Dear Paul,
> > > thanks for suggestions. Following them I managed to run 91 ns/day for the
> > > system I referred to in my previous post with the configuration:
> > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
> > -gputasks
> > > 111 -pin on (still 28 threads seems to be the best choice)
> > >
> > > and 56 ns/day for two independent runs:
> > > gmx 

Re: [gmx-users] simulation on 2 gpus

2019-08-20 Thread Stefano Guglielmo
Dear Szilard,

thanks for the very clear answer.
Following your suggestion I tried to run without DD; for the same system I
run two simulations on two gpus:

gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
-gputasks 00 -pin on -pinoffset 0 -pinstride 1

gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
-gputasks 11 -pin on -pinoffset 28 -pinstride 1

but again the system crashed; with this I mean that after few minutes the
machine goes off (power off) without any error message, even without using
all the threads.

I then tried running the two simulations on the same gpu without DD:

gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
-gputasks 00 -pin on -pinoffset 0 -pinstride 1

gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
-gputasks 00 -pin on -pinoffset 28 -pinstride 1

and I obtained better performance (about 70 ns/day) with a massive use of
the gpu (around 90%), comparing to the two runs on two gpus I reported in
the previous post
(gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks
000 -pin on -pinoffset 0 -pinstride 1
 gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
-gputasks 111 -pin on -pinoffset 28 -pinstride 1).

As for pinning, cpu topology according to log file is:
hardware topology: Basic
Sockets, cores, and logical processors:
  Socket  0: [   0  32] [   1  33] [   2  34] [   3  35] [   4  36] [
5  37] [   6  38] [   7  39] [  16  48] [  17  49] [  18  50] [  19  51] [
 20  52] [  21  53] [  22  54] [  23  55] [   8  40] [   9  41] [  10  42]
[  11  43] [  12  44] [  13  45] [  14  46] [  15  47] [  24  56] [  25
 57] [  26  58] [  27  59] [  28  60] [  29  61] [  30  62] [  31  63]
If I understand well (absolutely not sure) it should not be that convenient
to pin to consecutive threads, and indeed I found a subtle degradation of
performance for a single simulation, switching from:
gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks
00 -pin on
to
gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks
00 -pin on -pinoffset 0 -pinstride 1.

Thanks again
Stefano




Il giorno ven 16 ago 2019 alle ore 17:48 Szilárd Páll <
pall.szil...@gmail.com> ha scritto:

> On Mon, Aug 5, 2019 at 5:00 PM Stefano Guglielmo
>  wrote:
> >
> > Dear Paul,
> > thanks for suggestions. Following them I managed to run 91 ns/day for the
> > system I referred to in my previous post with the configuration:
> > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
> -gputasks
> > 111 -pin on (still 28 threads seems to be the best choice)
> >
> > and 56 ns/day for two independent runs:
> > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
> -gputasks
> > 000 -pin on -pinoffset 0 -pinstride 1
> > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
> -gputasks
> > 111 -pin on -pinoffset 28 -pinstride 1
> > which is a fairly good result.
>
> Use no DD in single-GPU runs, i.e. for the latter, just simply
> gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> -gputasks 00 -pin on -pinoffset 0 -pinstride 1
>
> You can also have mdrun's multidir functionality manage an ensemble of
> jobs (related or not) so you don't have to manually start, calculate
> pinning, etc.
>
>
> > I am still wondering if somehow I should pin the threads in some
> different
> > way in order to reflect the cpu topology and if this can influence
> > performance (if I remember well NAMD allows the user to indicate
> explicitly
> > the cpu core/threads to use in a computation).
>
> Your pinning does reflect the CPU topology -- the 4x7=28 threads are
> pinned to consecutive hardware threads (because of -pinstride 1, i.e.
> don't skip the second hardware thread of the core). The mapping of
> software to hardware threads happens based on a the topology-based
> hardware thread indexing, see the hardware detection report in the log
> file.
>
> > When I tried to run two simulations with the following configuration:
> > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1
> -gputasks
> >  -pin on -pinoffset 0 -pinstride 1
> > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1
> -gputasks
> >  -pin on -pinoffset 0 -pinstride 32
> > the system crashed down. Probably this is normal and I am missing
> something
> > quite obvious.
>
> Not really. What do you mean by "crashed down", the machine should not
> crash, nor should the simulation. Even though your machine has 32
> cores / 64 threads, using all of these may not always be beneficial as
> using more threads where there is too little work to scale will have
> an overhead. Have you tried using all cores but only 1 thread / core
> (i.e. 32 threads in total with pinstride 2)?
>
> Cheers,
> --
> Szilárd
>
> >
> > Thanks again for the valuable advices
> > Stefano
> >
> >
> >
> > Il giorno dom 4 ago 2019 alle ore 01:40 paul buscemi  ha
> > 

Re: [gmx-users] simulation on 2 gpus

2019-08-16 Thread Szilárd Páll
On Mon, Aug 5, 2019 at 5:00 PM Stefano Guglielmo
 wrote:
>
> Dear Paul,
> thanks for suggestions. Following them I managed to run 91 ns/day for the
> system I referred to in my previous post with the configuration:
> gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks
> 111 -pin on (still 28 threads seems to be the best choice)
>
> and 56 ns/day for two independent runs:
> gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks
> 000 -pin on -pinoffset 0 -pinstride 1
> gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks
> 111 -pin on -pinoffset 28 -pinstride 1
> which is a fairly good result.

Use no DD in single-GPU runs, i.e. for the latter, just simply
gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
-gputasks 00 -pin on -pinoffset 0 -pinstride 1

You can also have mdrun's multidir functionality manage an ensemble of
jobs (related or not) so you don't have to manually start, calculate
pinning, etc.


> I am still wondering if somehow I should pin the threads in some different
> way in order to reflect the cpu topology and if this can influence
> performance (if I remember well NAMD allows the user to indicate explicitly
> the cpu core/threads to use in a computation).

Your pinning does reflect the CPU topology -- the 4x7=28 threads are
pinned to consecutive hardware threads (because of -pinstride 1, i.e.
don't skip the second hardware thread of the core). The mapping of
software to hardware threads happens based on a the topology-based
hardware thread indexing, see the hardware detection report in the log
file.

> When I tried to run two simulations with the following configuration:
> gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1 -gputasks
>  -pin on -pinoffset 0 -pinstride 1
> gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1 -gputasks
>  -pin on -pinoffset 0 -pinstride 32
> the system crashed down. Probably this is normal and I am missing something
> quite obvious.

Not really. What do you mean by "crashed down", the machine should not
crash, nor should the simulation. Even though your machine has 32
cores / 64 threads, using all of these may not always be beneficial as
using more threads where there is too little work to scale will have
an overhead. Have you tried using all cores but only 1 thread / core
(i.e. 32 threads in total with pinstride 2)?

Cheers,
--
Szilárd

>
> Thanks again for the valuable advices
> Stefano
>
>
>
> Il giorno dom 4 ago 2019 alle ore 01:40 paul buscemi  ha
> scritto:
>
> > Stefano,
> >
> > A recent run with 14 atoms, including 1 isopropanol  molecules on
> > top of  an end restrained PDMS surface of  74000 atoms  in a 20 20 30 nm
> > box ran at 67 ns/d nvt with the mdrun conditions I posted. It took 120 ns
> > for 100 molecules of an adsorbate  to go from solution to the surface.   I
> > don't think this will set the world ablaze with any benchmarks but it is
> > acceptable to get some work done.
> >
> > Linux Mint Mate 18, AMD Threadripper 32 core 2990wx 4.2Ghz, 32GB DDR4, 2x
> > RTX 2080TI gmx2019 in the simplest gmx configuration for gpus,  CUDA
> > version 10, Nvidia 410.7p loaded  from the repository
> >
> > Paul
> >
> > > On Aug 3, 2019, at 12:58 PM, paul buscemi  wrote:
> > >
> > > Stefano,
> > >
> > > Here is a typical run
> > >
> > > fpr minimization mdrun -deffnm   grofile. -nn gpu
> > >
> > > and for other runs for a 32 core
> > >
> > > gmx -deffnm grofile.nvt  -nb gpu -pme gpu -ntomp  8  -ntmpi 8  -npme 1
> > -gputasks   -pin on
> > >
> > > Depending on the molecular system/model   -ntomp -4 -ntmpi 16  may be
> > faster   - of course adjusting -gputasks
> > >
> > > Rarely do I find that not using ntomp and ntpmi is faster, but it is
> > never bad
> > >
> > > Let me know how it goes.
> > >
> > > Paul
> > >
> > >> On Aug 3, 2019, at 4:41 AM, Stefano Guglielmo <
> > stefano.guglie...@unito.it> wrote:
> > >>
> > >> Hi Paul,
> > >> thanks for the reply. Would you mind posting the command you used or
> > >> telling how did you balance the work between cpu and gpu?
> > >>
> > >> What about pinning? Does anyone know how to deal with a cpu topology
> > like
> > >> the one reported in my previous post and if it is relevant for
> > performance?
> > >> Thanks
> > >> Stefano
> > >>
> > >> Il giorno sabato 3 agosto 2019, Paul Buscemi  ha
> > scritto:
> > >>
> > >>> I run the same system and setup but no nvlink. Maestro runs both gpus
> > at
> > >>> 100 percent. Gromacs typically 50 --60 percent can do 600ns/d on 2
> > >>> atoms
> > >>>
> > >>> PB
> > >>>
> >  On Jul 25, 2019, at 9:30 PM, Kevin Boyd  wrote:
> > 
> >  Hi,
> > 
> >  I've done a lot of research/experimentation on this, so I can maybe
> > get
> > >>> you
> >  started - if anyone has any questions about the essay to follow, feel
> > >>> free
> >  to email me personally, and I'll link it to the email thread if it
> 

Re: [gmx-users] simulation on 2 gpus

2019-08-05 Thread Stefano Guglielmo
Dear Paul,
thanks for suggestions. Following them I managed to run 91 ns/day for the
system I referred to in my previous post with the configuration:
gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks
111 -pin on (still 28 threads seems to be the best choice)

and 56 ns/day for two independent runs:
gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks
000 -pin on -pinoffset 0 -pinstride 1
gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks
111 -pin on -pinoffset 28 -pinstride 1
which is a fairly good result.
I am still wondering if somehow I should pin the threads in some different
way in order to reflect the cpu topology and if this can influence
performance (if I remember well NAMD allows the user to indicate explicitly
the cpu core/threads to use in a computation).

When I tried to run two simulations with the following configuration:
gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1 -gputasks
 -pin on -pinoffset 0 -pinstride 1
gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1 -gputasks
 -pin on -pinoffset 0 -pinstride 32
the system crashed down. Probably this is normal and I am missing something
quite obvious.

Thanks again for the valuable advices
Stefano



Il giorno dom 4 ago 2019 alle ore 01:40 paul buscemi  ha
scritto:

> Stefano,
>
> A recent run with 14 atoms, including 1 isopropanol  molecules on
> top of  an end restrained PDMS surface of  74000 atoms  in a 20 20 30 nm
> box ran at 67 ns/d nvt with the mdrun conditions I posted. It took 120 ns
> for 100 molecules of an adsorbate  to go from solution to the surface.   I
> don't think this will set the world ablaze with any benchmarks but it is
> acceptable to get some work done.
>
> Linux Mint Mate 18, AMD Threadripper 32 core 2990wx 4.2Ghz, 32GB DDR4, 2x
> RTX 2080TI gmx2019 in the simplest gmx configuration for gpus,  CUDA
> version 10, Nvidia 410.7p loaded  from the repository
>
> Paul
>
> > On Aug 3, 2019, at 12:58 PM, paul buscemi  wrote:
> >
> > Stefano,
> >
> > Here is a typical run
> >
> > fpr minimization mdrun -deffnm   grofile. -nn gpu
> >
> > and for other runs for a 32 core
> >
> > gmx -deffnm grofile.nvt  -nb gpu -pme gpu -ntomp  8  -ntmpi 8  -npme 1
> -gputasks   -pin on
> >
> > Depending on the molecular system/model   -ntomp -4 -ntmpi 16  may be
> faster   - of course adjusting -gputasks
> >
> > Rarely do I find that not using ntomp and ntpmi is faster, but it is
> never bad
> >
> > Let me know how it goes.
> >
> > Paul
> >
> >> On Aug 3, 2019, at 4:41 AM, Stefano Guglielmo <
> stefano.guglie...@unito.it> wrote:
> >>
> >> Hi Paul,
> >> thanks for the reply. Would you mind posting the command you used or
> >> telling how did you balance the work between cpu and gpu?
> >>
> >> What about pinning? Does anyone know how to deal with a cpu topology
> like
> >> the one reported in my previous post and if it is relevant for
> performance?
> >> Thanks
> >> Stefano
> >>
> >> Il giorno sabato 3 agosto 2019, Paul Buscemi  ha
> scritto:
> >>
> >>> I run the same system and setup but no nvlink. Maestro runs both gpus
> at
> >>> 100 percent. Gromacs typically 50 --60 percent can do 600ns/d on 2
> >>> atoms
> >>>
> >>> PB
> >>>
>  On Jul 25, 2019, at 9:30 PM, Kevin Boyd  wrote:
> 
>  Hi,
> 
>  I've done a lot of research/experimentation on this, so I can maybe
> get
> >>> you
>  started - if anyone has any questions about the essay to follow, feel
> >>> free
>  to email me personally, and I'll link it to the email thread if it
> ends
> >>> up
>  being pertinent.
> 
>  First, there's some more internet resources to checkout. See Mark's
> talk
> >>> at
>  -
>  https://bioexcel.eu/webinar-performance-tuning-and-
> >>> optimization-of-gromacs/
>  Gromacs development moves fast, but a lot of it is still relevant.
> 
>  I'll expand a bit here, with the caveat that Gromacs GPU development
> is
>  moving very fast and so the correct commands for optimal performance
> are
>  both system-dependent and a moving target between versions. This is a
> >>> good
>  thing - GPUs have revolutionized the field, and with each iteration we
> >>> make
>  better use of them. The downside is that it's unclear exactly what
> sort
> >>> of
>  CPU-GPU balance you should look to purchase to take advantage of
> future
>  developments, though the trend is certainly that more and more
> >>> computation
>  is being offloaded to the GPUs.
> 
>  The most important consideration is that to get maximum total
> throughput
>  performance, you should be running not one but multiple simulations
>  simultaneously. You can do this through the -multidir option, but I
> don't
>  recommend that in this case, as it requires compiling with MPI and
> limits
>  some of your options. My run scripts usually use "gmx mdrun ... &" 

Re: [gmx-users] simulation on 2 gpus

2019-08-03 Thread paul buscemi
Stefano,

Here is a typical run

fpr minimization mdrun -deffnm   grofile. -nn gpu 

and for other runs for a 32 core

gmx -deffnm grofile.nvt  -nb gpu -pme gpu -ntomp  8  -ntmpi 8  -npme 1 
-gputasks   -pin on   

Depending on the molecular system/model   -ntomp -4 -ntmpi 16  may be faster   
- of course adjusting -gputasks

Rarely do I fine that not using ntomp and ntpmi is faster, but it is never bad

Let me know how it goes.

Paul

> On Aug 3, 2019, at 4:41 AM, Stefano Guglielmo  
> wrote:
> 
> Hi Paul,
> thanks for the reply. Would you mind posting the command you used or
> telling how did you balance the work between cpu and gpu?
> 
> What about pinning? Does anyone know how to deal with a cpu topology like
> the one reported in my previous post and if it is relevant for performance?
> Thanks
> Stefano
> 
> Il giorno sabato 3 agosto 2019, Paul Buscemi  ha scritto:
> 
>> I run the same system and setup but no nvlink. Maestro runs both gpus at
>> 100 percent. Gromacs typically 50 --60 percent can do 600ns/d on 2
>> atoms
>> 
>> PB
>> 
>>> On Jul 25, 2019, at 9:30 PM, Kevin Boyd  wrote:
>>> 
>>> Hi,
>>> 
>>> I've done a lot of research/experimentation on this, so I can maybe get
>> you
>>> started - if anyone has any questions about the essay to follow, feel
>> free
>>> to email me personally, and I'll link it to the email thread if it ends
>> up
>>> being pertinent.
>>> 
>>> First, there's some more internet resources to checkout. See Mark's talk
>> at
>>> -
>>> https://bioexcel.eu/webinar-performance-tuning-and-
>> optimization-of-gromacs/
>>> Gromacs development moves fast, but a lot of it is still relevant.
>>> 
>>> I'll expand a bit here, with the caveat that Gromacs GPU development is
>>> moving very fast and so the correct commands for optimal performance are
>>> both system-dependent and a moving target between versions. This is a
>> good
>>> thing - GPUs have revolutionized the field, and with each iteration we
>> make
>>> better use of them. The downside is that it's unclear exactly what sort
>> of
>>> CPU-GPU balance you should look to purchase to take advantage of future
>>> developments, though the trend is certainly that more and more
>> computation
>>> is being offloaded to the GPUs.
>>> 
>>> The most important consideration is that to get maximum total throughput
>>> performance, you should be running not one but multiple simulations
>>> simultaneously. You can do this through the -multidir option, but I don't
>>> recommend that in this case, as it requires compiling with MPI and limits
>>> some of your options. My run scripts usually use "gmx mdrun ... &" to
>>> initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin
>>> -pinoffset, and -gputasks. I can give specific examples if you're
>>> interested.
>>> 
>>> Another important point is that you can run more simulations than the
>>> number of GPUs you have. Depending on CPU-GPU balance and quality, you
>>> won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but
>>> you might increase it up to 1.5x. This would involve targeting the same
>> GPU
>>> with -gputasks.
>>> 
>>> Within a simulation, you should set up a benchmarking script to figure
>> out
>>> the best combination of thread-mpi ranks and open-mp threads - this can
>>> have pretty drastic effects on performance. For example, if you want to
>> use
>>> your entire machine for one simulation (not recommended for maximal
>> 
>> --
>> Gromacs Users mailing list
>> 
>> * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>> 
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> 
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-requ...@gromacs.org.
>> 
> 
> 
> -- 
> Stefano GUGLIELMO PhD
> Assistant Professor of Medicinal Chemistry
> Department of Drug Science and Technology
> Via P. Giuria 9
> 10125 Turin, ITALY
> ph. +39 (0)11 6707178
> -- 
> Gromacs Users mailing list
> 
> * Please search the archive at 
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
> 
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> 
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
> mail to gmx-users-requ...@gromacs.org.

-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] simulation on 2 gpus

2019-08-02 Thread Paul Buscemi
I run the same system and setup but no nvlink. Maestro runs both gpus at 100 
percent. Gromacs typically 50 --60 percent can do 600ns/d on 2 atoms 

PB

> On Jul 25, 2019, at 9:30 PM, Kevin Boyd  wrote:
> 
> Hi,
> 
> I've done a lot of research/experimentation on this, so I can maybe get you
> started - if anyone has any questions about the essay to follow, feel free
> to email me personally, and I'll link it to the email thread if it ends up
> being pertinent.
> 
> First, there's some more internet resources to checkout. See Mark's talk at
> -
> https://bioexcel.eu/webinar-performance-tuning-and-optimization-of-gromacs/
> Gromacs development moves fast, but a lot of it is still relevant.
> 
> I'll expand a bit here, with the caveat that Gromacs GPU development is
> moving very fast and so the correct commands for optimal performance are
> both system-dependent and a moving target between versions. This is a good
> thing - GPUs have revolutionized the field, and with each iteration we make
> better use of them. The downside is that it's unclear exactly what sort of
> CPU-GPU balance you should look to purchase to take advantage of future
> developments, though the trend is certainly that more and more computation
> is being offloaded to the GPUs.
> 
> The most important consideration is that to get maximum total throughput
> performance, you should be running not one but multiple simulations
> simultaneously. You can do this through the -multidir option, but I don't
> recommend that in this case, as it requires compiling with MPI and limits
> some of your options. My run scripts usually use "gmx mdrun ... &" to
> initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin
> -pinoffset, and -gputasks. I can give specific examples if you're
> interested.
> 
> Another important point is that you can run more simulations than the
> number of GPUs you have. Depending on CPU-GPU balance and quality, you
> won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but
> you might increase it up to 1.5x. This would involve targeting the same GPU
> with -gputasks.
> 
> Within a simulation, you should set up a benchmarking script to figure out
> the best combination of thread-mpi ranks and open-mp threads - this can
> have pretty drastic effects on performance. For example, if you want to use
> your entire machine for one simulation (not recommended for maximal

-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] simulation on 2 gpus

2019-08-02 Thread Stefano Guglielmo
ce) performance
> benefit over MPI.
>
> Kevin
>
> On Fri, Jul 26, 2019 at 8:21 AM Gregory Man Kai Poon 
> wrote:
>
> > Hi Kevin,
> > Thanks for your very useful post.  Could you give a few command line
> > examples on how to start multiple runs at different times (e.g.,
> allocate a
> > subset of CPU/GPU to one run, and start another run later using another
> > unsubset of yet-unallocated CPU/GPU).  Also, could you elaborate on the
> > drawbacks of the MPI compilation that you hinted at?
> > Gregory
> >
> > From: Kevin Boyd<mailto:kevin.b...@uconn.edu>
> > Sent: Thursday, July 25, 2019 10:31 PM
> > To: gmx-us...@gromacs.org<mailto:gmx-us...@gromacs.org>
> > Subject: Re: [gmx-users] simulation on 2 gpus
> >
> > Hi,
> >
> > I've done a lot of research/experimentation on this, so I can maybe get
> you
> > started - if anyone has any questions about the essay to follow, feel
> free
> > to email me personally, and I'll link it to the email thread if it ends
> up
> > being pertinent.
> >
> > First, there's some more internet resources to checkout. See Mark's talk
> at
> > -
> >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioexcel.eu%2Fwebinar-performance-tuning-and-optimization-of-gromacs%2Fdata=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053958340sdata=6KGjrrBb8w%2FSqMvTtdzoiufYtOpmOKX5QyYNFAivTMo%3Dreserved=0
> > Gromacs development moves fast, but a lot of it is still relevant.
> >
> > I'll expand a bit here, with the caveat that Gromacs GPU development is
> > moving very fast and so the correct commands for optimal performance are
> > both system-dependent and a moving target between versions. This is a
> good
> > thing - GPUs have revolutionized the field, and with each iteration we
> make
> > better use of them. The downside is that it's unclear exactly what sort
> of
> > CPU-GPU balance you should look to purchase to take advantage of future
> > developments, though the trend is certainly that more and more
> computation
> > is being offloaded to the GPUs.
> >
> > The most important consideration is that to get maximum total throughput
> > performance, you should be running not one but multiple simulations
> > simultaneously. You can do this through the -multidir option, but I don't
> > recommend that in this case, as it requires compiling with MPI and limits
> > some of your options. My run scripts usually use "gmx mdrun ... &" to
> > initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin
> > -pinoffset, and -gputasks. I can give specific examples if you're
> > interested.
> >
> > Another important point is that you can run more simulations than the
> > number of GPUs you have. Depending on CPU-GPU balance and quality, you
> > won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but
> > you might increase it up to 1.5x. This would involve targeting the same
> GPU
> > with -gputasks.
> >
> > Within a simulation, you should set up a benchmarking script to figure
> out
> > the best combination of thread-mpi ranks and open-mp threads - this can
> > have pretty drastic effects on performance. For example, if you want to
> use
> > your entire machine for one simulation (not recommended for maximal
> > efficiency), you have a lot of decomposition options (ignoring PME -
> which
> > is important, see below):
> >
> > -ntmpi 2 -ntomp 32 -gputasks 01
> > -ntmpi 4 -ntomp 16 -gputasks 0011
> > -ntmpi 8 -ntomp 8  -gputasks 
> > -ntmpi 16 -ntomp 4 -gputasks 111
> > (and a few others - note that ntmpi * ntomp = total threads available)
> >
> > In my experience, you need to scan the options in a benchmarking script
> for
> > each simulation size/content you want to simulate, and the difference
> > between the best and the worst can be up to a factor of 2-4 in terms of
> > performance. If you're splitting your machine among multiple
> simulations, I
> > suggest running 1 mpi thread (-ntmpi 1) per simulation, unless your
> > benchmarking suggests that the optimal performance lies elsewhere.
> >
> > Things get more complicated when you start putting PME on the GPUs. For
> the
> > machines I work on, putting PME on GPUs absolutely improves performance,
> > but I'm not fully confident in that assessment without testing your
> > specific machine - you have a lot of cores with that threadripper, and
> this
> > is another area

Re: [gmx-users] simulation on 2 gpus

2019-07-30 Thread Stefano Guglielmo
le
...
Using 1 MPI thread
Using 32 OpenMP threads

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PME tasks will do all aspects on the GPU
Applying core pinning offset 32."

Two runs can be carried out with the command:
gmx mdrun -deffnm run1 -gpu_id 1 -pin on -pinstride 1 -pinoffset 14 -ntmpi
1 -ntomp 28
gmx mdrun -deffnm run0 -gpu_id 0 -pin on -pinstride 1 -pinoffset 0 -ntmpi 1
-ntomp 28

"Using 1 MPI thread
Using 28 OpenMP threads

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PME tasks will do all aspects on the GPU
Applying core pinning offset 14
Pinning threads with a user-specified logical core stride of 1"

or
gmx mdrun -deffnm run1 -gpu_id 1 -pin on -ntmpi 1 -ntomp 28
gmx mdrun -deffnm run0 -gpu_id 0 -pin on -ntmpi 1 -ntomp 28

"Using 1 MPI thread
Using 28 OpenMP threads

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PME tasks will do all aspects on the GPU
Pinning threads with an auto-selected logical core stride of 2"

With some disappointment in both situations there was a substantial
degrading of performance, about 35-40 ns/day for the same system, with a
gpu usage of 25-30%, compared to 50-55% for the single run on a single gpu,
and much below the power cap.

I hope not to have been confusing and will be grateful for any suggestions.

Thanks
Stefano


Il giorno ven 26 lug 2019 alle ore 15:00 Kevin Boyd 
ha scritto:

> Sure - you can do it 2 ways with normal Gromacs. Either run the simulations
> in separate terminals, or use ampersands to run them in the background of 1
> terminal.
>
> I'll give a concrete example for your threadripper, using 32 of your cores,
> so that you could run some other computation on the other 32. I typically
> make a bash variable with all the common arguments.
>
> Given tprs run1.tpr ...run4.tpr
>
> gmx_common="gmx mdrun -ntomp 8 -ntmpi 1 -pme gpu -nb gpu -pin on -pinstride
> 1"
> $gmx_common -deffnm run1 -pinoffset 32 -gputasks 00 &
> $gmx_common -deffnm run2 -pinoffest 40 -gputasks 00 &
> $gmx_common -deffnm run3 -pinoffset 48 -gputasks 11 &
> $gmx_common -deffnm run3 -pinoffset 56 -gputasks 11
>
> So run1 will run on cores 32-39, on GPU 0, run2 on cores 40-47 on the same
> GPU, and the other 2 runs will use GPU 1. Note the ampersands on the first
> 3 runs, so they'll go off in the background
>
> I should also have mentioned one peculiarity with running with -ntmpi 1 and
> -pme gpu, in that even though there's now only one rank (with nonbonded and
> PME both running on it), you still need 2 gpu tasks for that one rank, one
> for each type of interaction.
>
> As for multidir, I forget what troubles I ran into exactly, but I was
> unable to run some subset of simulations. Anyhow if you aren't running on a
> cluster, I see no reason to compile with MPI and have to use srun or slurm,
> and need to use gmx_mpi rather than gmx. The built-in thread-mpi gives you
> up to 64 threads, and can have a minor (<5% in my experience) performance
> benefit over MPI.
>
> Kevin
>
> On Fri, Jul 26, 2019 at 8:21 AM Gregory Man Kai Poon 
> wrote:
>
> > Hi Kevin,
> > Thanks for your very useful post.  Could you give a few command line
> > examples on how to start multiple runs at different times (e.g.,
> allocate a
> > subset of CPU/GPU to one run, and start another run later using another
> > unsubset of yet-unallocated CPU/GPU).  Also, could you elaborate on the
> > drawbacks of the MPI compilation that you hinted at?
> > Gregory
> >
> > From: Kevin Boyd<mailto:kevin.b...@uconn.edu>
> > Sent: Thursday, July 25, 2019 10:31 PM
> > To: gmx-us...@gromacs.org<mailto:gmx-us...@gromacs.org>
> > Subject: Re: [gmx-users] simulation on 2 gpus
> >
> > Hi,
> >
> > I've done a lot of research/experimentation on this, so I can maybe get
> you
> > started - if anyone has any questions about the essay to follow, feel
> free
> > to email me personally, and I'll link it to the email thread if it ends
> up
> > being pertinent.
> >
> > First, there's some more internet resources to checkout. See Mark's talk
> at
> > -
> >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioexcel.eu%2Fwebinar-performance-tuning-and-optimization-of-gromacs%2Fdata=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C6

Re: [gmx-users] simulation on 2 gpus

2019-07-26 Thread Kevin Boyd
Sure - you can do it 2 ways with normal Gromacs. Either run the simulations
in separate terminals, or use ampersands to run them in the background of 1
terminal.

I'll give a concrete example for your threadripper, using 32 of your cores,
so that you could run some other computation on the other 32. I typically
make a bash variable with all the common arguments.

Given tprs run1.tpr ...run4.tpr

gmx_common="gmx mdrun -ntomp 8 -ntmpi 1 -pme gpu -nb gpu -pin on -pinstride
1"
$gmx_common -deffnm run1 -pinoffset 32 -gputasks 00 &
$gmx_common -deffnm run2 -pinoffest 40 -gputasks 00 &
$gmx_common -deffnm run3 -pinoffset 48 -gputasks 11 &
$gmx_common -deffnm run3 -pinoffset 56 -gputasks 11

So run1 will run on cores 32-39, on GPU 0, run2 on cores 40-47 on the same
GPU, and the other 2 runs will use GPU 1. Note the ampersands on the first
3 runs, so they'll go off in the background

I should also have mentioned one peculiarity with running with -ntmpi 1 and
-pme gpu, in that even though there's now only one rank (with nonbonded and
PME both running on it), you still need 2 gpu tasks for that one rank, one
for each type of interaction.

As for multidir, I forget what troubles I ran into exactly, but I was
unable to run some subset of simulations. Anyhow if you aren't running on a
cluster, I see no reason to compile with MPI and have to use srun or slurm,
and need to use gmx_mpi rather than gmx. The built-in thread-mpi gives you
up to 64 threads, and can have a minor (<5% in my experience) performance
benefit over MPI.

Kevin

On Fri, Jul 26, 2019 at 8:21 AM Gregory Man Kai Poon  wrote:

> Hi Kevin,
> Thanks for your very useful post.  Could you give a few command line
> examples on how to start multiple runs at different times (e.g., allocate a
> subset of CPU/GPU to one run, and start another run later using another
> unsubset of yet-unallocated CPU/GPU).  Also, could you elaborate on the
> drawbacks of the MPI compilation that you hinted at?
> Gregory
>
> From: Kevin Boyd<mailto:kevin.b...@uconn.edu>
> Sent: Thursday, July 25, 2019 10:31 PM
> To: gmx-us...@gromacs.org<mailto:gmx-us...@gromacs.org>
> Subject: Re: [gmx-users] simulation on 2 gpus
>
> Hi,
>
> I've done a lot of research/experimentation on this, so I can maybe get you
> started - if anyone has any questions about the essay to follow, feel free
> to email me personally, and I'll link it to the email thread if it ends up
> being pertinent.
>
> First, there's some more internet resources to checkout. See Mark's talk at
> -
>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioexcel.eu%2Fwebinar-performance-tuning-and-optimization-of-gromacs%2Fdata=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053958340sdata=6KGjrrBb8w%2FSqMvTtdzoiufYtOpmOKX5QyYNFAivTMo%3Dreserved=0
> Gromacs development moves fast, but a lot of it is still relevant.
>
> I'll expand a bit here, with the caveat that Gromacs GPU development is
> moving very fast and so the correct commands for optimal performance are
> both system-dependent and a moving target between versions. This is a good
> thing - GPUs have revolutionized the field, and with each iteration we make
> better use of them. The downside is that it's unclear exactly what sort of
> CPU-GPU balance you should look to purchase to take advantage of future
> developments, though the trend is certainly that more and more computation
> is being offloaded to the GPUs.
>
> The most important consideration is that to get maximum total throughput
> performance, you should be running not one but multiple simulations
> simultaneously. You can do this through the -multidir option, but I don't
> recommend that in this case, as it requires compiling with MPI and limits
> some of your options. My run scripts usually use "gmx mdrun ... &" to
> initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin
> -pinoffset, and -gputasks. I can give specific examples if you're
> interested.
>
> Another important point is that you can run more simulations than the
> number of GPUs you have. Depending on CPU-GPU balance and quality, you
> won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but
> you might increase it up to 1.5x. This would involve targeting the same GPU
> with -gputasks.
>
> Within a simulation, you should set up a benchmarking script to figure out
> the best combination of thread-mpi ranks and open-mp threads - this can
> have pretty drastic effects on performance. For example, if you want to use
> your entire machine for one simulation (not recommended for maximal
> efficiency), you have a lot of decomposition options (ignoring PME - which
> is important, see below):
>
> -ntmpi 2 -ntomp 32 -gpu

Re: [gmx-users] simulation on 2 gpus

2019-07-26 Thread Mark Abraham
Hi,

It's rather like the example at
http://manual.gromacs.org/current/user-guide/mdrun-performance.html#examples-for-mdrun-on-one-node
where
instead of

gmx mdrun -nt 6 -pin on -pinoffset 0 -pinstride 1
gmx mdrun -nt 6 -pin on -pinoffset 6 -pinstride 1

to run on a machine with 12 hardware threads, you want to adapt the number
of threads and also specify disjoint GPU sets, e.g.

gmx mdrun -nt 32 -pin on -pinoffset 0 -pinstride 1 -gpu_id 0
gmx mdrun -nt 32 -pin on -pinoffset 32 -pinstride 1 -gpu_id 1

That lets mdrun choose the mix of thread-MPI ranks vs OpenMP threads on
those ranks, but you could replace -nt 32 with -ntmpi N -ntomp M so long as
the product of N and M are 32.

Mark

On Fri, 26 Jul 2019 at 14:22, Gregory Man Kai Poon  wrote:

> Hi Kevin,
> Thanks for your very useful post.  Could you give a few command line
> examples on how to start multiple runs at different times (e.g., allocate a
> subset of CPU/GPU to one run, and start another run later using another
> unsubset of yet-unallocated CPU/GPU).  Also, could you elaborate on the
> drawbacks of the MPI compilation that you hinted at?
> Gregory
>
> From: Kevin Boyd<mailto:kevin.b...@uconn.edu>
> Sent: Thursday, July 25, 2019 10:31 PM
> To: gmx-us...@gromacs.org<mailto:gmx-us...@gromacs.org>
> Subject: Re: [gmx-users] simulation on 2 gpus
>
> Hi,
>
> I've done a lot of research/experimentation on this, so I can maybe get you
> started - if anyone has any questions about the essay to follow, feel free
> to email me personally, and I'll link it to the email thread if it ends up
> being pertinent.
>
> First, there's some more internet resources to checkout. See Mark's talk at
> -
>
> https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioexcel.eu%2Fwebinar-performance-tuning-and-optimization-of-gromacs%2Fdata=02%7C01%7Cgpoon%40gsu.edu%7Cfd42b6ec3efa41d855b608d711714bdd%7C515ad73d8d5e4169895c9789dc742a70%7C0%7C0%7C636997050628368338sdata=%2BaUIuI63M7HRo%2B2VSUs0WIr0nYB10jE7lxnHW6gM8Os%3Dreserved=0
> Gromacs development moves fast, but a lot of it is still relevant.
>
> I'll expand a bit here, with the caveat that Gromacs GPU development is
> moving very fast and so the correct commands for optimal performance are
> both system-dependent and a moving target between versions. This is a good
> thing - GPUs have revolutionized the field, and with each iteration we make
> better use of them. The downside is that it's unclear exactly what sort of
> CPU-GPU balance you should look to purchase to take advantage of future
> developments, though the trend is certainly that more and more computation
> is being offloaded to the GPUs.
>
> The most important consideration is that to get maximum total throughput
> performance, you should be running not one but multiple simulations
> simultaneously. You can do this through the -multidir option, but I don't
> recommend that in this case, as it requires compiling with MPI and limits
> some of your options. My run scripts usually use "gmx mdrun ... &" to
> initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin
> -pinoffset, and -gputasks. I can give specific examples if you're
> interested.
>
> Another important point is that you can run more simulations than the
> number of GPUs you have. Depending on CPU-GPU balance and quality, you
> won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but
> you might increase it up to 1.5x. This would involve targeting the same GPU
> with -gputasks.
>
> Within a simulation, you should set up a benchmarking script to figure out
> the best combination of thread-mpi ranks and open-mp threads - this can
> have pretty drastic effects on performance. For example, if you want to use
> your entire machine for one simulation (not recommended for maximal
> efficiency), you have a lot of decomposition options (ignoring PME - which
> is important, see below):
>
> -ntmpi 2 -ntomp 32 -gputasks 01
> -ntmpi 4 -ntomp 16 -gputasks 0011
> -ntmpi 8 -ntomp 8  -gputasks 
> -ntmpi 16 -ntomp 4 -gputasks 111
> (and a few others - note that ntmpi * ntomp = total threads available)
>
> In my experience, you need to scan the options in a benchmarking script for
> each simulation size/content you want to simulate, and the difference
> between the best and the worst can be up to a factor of 2-4 in terms of
> performance. If you're splitting your machine among multiple simulations, I
> suggest running 1 mpi thread (-ntmpi 1) per simulation, unless your
> benchmarking suggests that the optimal performance lies elsewhere.
>
> Things get more complicated when you start putting PME on the GPUs. For the
> machines I work on, putting PME on GPUs absolutely improves performance,
> but I'm not fully confident in 

Re: [gmx-users] simulation on 2 gpus

2019-07-26 Thread Gregory Man Kai Poon
Hi Kevin,
Thanks for your very useful post.  Could you give a few command line examples 
on how to start multiple runs at different times (e.g., allocate a subset of 
CPU/GPU to one run, and start another run later using another unsubset of 
yet-unallocated CPU/GPU).  Also, could you elaborate on the drawbacks of the 
MPI compilation that you hinted at?
Gregory

From: Kevin Boyd<mailto:kevin.b...@uconn.edu>
Sent: Thursday, July 25, 2019 10:31 PM
To: gmx-us...@gromacs.org<mailto:gmx-us...@gromacs.org>
Subject: Re: [gmx-users] simulation on 2 gpus

Hi,

I've done a lot of research/experimentation on this, so I can maybe get you
started - if anyone has any questions about the essay to follow, feel free
to email me personally, and I'll link it to the email thread if it ends up
being pertinent.

First, there's some more internet resources to checkout. See Mark's talk at
-
https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioexcel.eu%2Fwebinar-performance-tuning-and-optimization-of-gromacs%2Fdata=02%7C01%7Cgpoon%40gsu.edu%7Cfd42b6ec3efa41d855b608d711714bdd%7C515ad73d8d5e4169895c9789dc742a70%7C0%7C0%7C636997050628368338sdata=%2BaUIuI63M7HRo%2B2VSUs0WIr0nYB10jE7lxnHW6gM8Os%3Dreserved=0
Gromacs development moves fast, but a lot of it is still relevant.

I'll expand a bit here, with the caveat that Gromacs GPU development is
moving very fast and so the correct commands for optimal performance are
both system-dependent and a moving target between versions. This is a good
thing - GPUs have revolutionized the field, and with each iteration we make
better use of them. The downside is that it's unclear exactly what sort of
CPU-GPU balance you should look to purchase to take advantage of future
developments, though the trend is certainly that more and more computation
is being offloaded to the GPUs.

The most important consideration is that to get maximum total throughput
performance, you should be running not one but multiple simulations
simultaneously. You can do this through the -multidir option, but I don't
recommend that in this case, as it requires compiling with MPI and limits
some of your options. My run scripts usually use "gmx mdrun ... &" to
initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin
-pinoffset, and -gputasks. I can give specific examples if you're
interested.

Another important point is that you can run more simulations than the
number of GPUs you have. Depending on CPU-GPU balance and quality, you
won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but
you might increase it up to 1.5x. This would involve targeting the same GPU
with -gputasks.

Within a simulation, you should set up a benchmarking script to figure out
the best combination of thread-mpi ranks and open-mp threads - this can
have pretty drastic effects on performance. For example, if you want to use
your entire machine for one simulation (not recommended for maximal
efficiency), you have a lot of decomposition options (ignoring PME - which
is important, see below):

-ntmpi 2 -ntomp 32 -gputasks 01
-ntmpi 4 -ntomp 16 -gputasks 0011
-ntmpi 8 -ntomp 8  -gputasks 
-ntmpi 16 -ntomp 4 -gputasks 111
(and a few others - note that ntmpi * ntomp = total threads available)

In my experience, you need to scan the options in a benchmarking script for
each simulation size/content you want to simulate, and the difference
between the best and the worst can be up to a factor of 2-4 in terms of
performance. If you're splitting your machine among multiple simulations, I
suggest running 1 mpi thread (-ntmpi 1) per simulation, unless your
benchmarking suggests that the optimal performance lies elsewhere.

Things get more complicated when you start putting PME on the GPUs. For the
machines I work on, putting PME on GPUs absolutely improves performance,
but I'm not fully confident in that assessment without testing your
specific machine - you have a lot of cores with that threadripper, and this
is another area where I expect Gromacs 2020 might shift the GPU-CPU optimal
balance.

The issue with PME on GPUs is that we can (currently) only have one rank
doing GPU PME work. So, if we have a machine with say 20 cores and 2 gpus,
if I run the following

gmx mdrun  -ntomp 10 -ntmpi 2 -pme gpu -npme 1 -gputasks 01

, two ranks will be started - one with cores 0-9, will work on the
short-range interactions, offloading where it can to GPU 0, and the PME
rank (cores 10-19)  will offload to GPU 1. There is one significant problem
(and one minor problem) with this setup. First, it is massively inefficient
in terms of load balance. In a typical system (there are exceptions), PME
takes up ~1/3 of the computation that short-range interactions take. So, we
are offloading 1/4 of our interactions to one GPU and 3/4 to the other,
which leads to imbalance. In this specific case (2 GPUs and sufficient
cores), the most optimal solution is often (but not always) to run with
-ntmpi 4 (in

Re: [gmx-users] simulation on 2 gpus

2019-07-25 Thread Kevin Boyd
Hi,

I've done a lot of research/experimentation on this, so I can maybe get you
started - if anyone has any questions about the essay to follow, feel free
to email me personally, and I'll link it to the email thread if it ends up
being pertinent.

First, there's some more internet resources to checkout. See Mark's talk at
-
https://bioexcel.eu/webinar-performance-tuning-and-optimization-of-gromacs/
Gromacs development moves fast, but a lot of it is still relevant.

I'll expand a bit here, with the caveat that Gromacs GPU development is
moving very fast and so the correct commands for optimal performance are
both system-dependent and a moving target between versions. This is a good
thing - GPUs have revolutionized the field, and with each iteration we make
better use of them. The downside is that it's unclear exactly what sort of
CPU-GPU balance you should look to purchase to take advantage of future
developments, though the trend is certainly that more and more computation
is being offloaded to the GPUs.

The most important consideration is that to get maximum total throughput
performance, you should be running not one but multiple simulations
simultaneously. You can do this through the -multidir option, but I don't
recommend that in this case, as it requires compiling with MPI and limits
some of your options. My run scripts usually use "gmx mdrun ... &" to
initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin
-pinoffset, and -gputasks. I can give specific examples if you're
interested.

Another important point is that you can run more simulations than the
number of GPUs you have. Depending on CPU-GPU balance and quality, you
won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but
you might increase it up to 1.5x. This would involve targeting the same GPU
with -gputasks.

Within a simulation, you should set up a benchmarking script to figure out
the best combination of thread-mpi ranks and open-mp threads - this can
have pretty drastic effects on performance. For example, if you want to use
your entire machine for one simulation (not recommended for maximal
efficiency), you have a lot of decomposition options (ignoring PME - which
is important, see below):

-ntmpi 2 -ntomp 32 -gputasks 01
-ntmpi 4 -ntomp 16 -gputasks 0011
-ntmpi 8 -ntomp 8  -gputasks 
-ntmpi 16 -ntomp 4 -gputasks 111
(and a few others - note that ntmpi * ntomp = total threads available)

In my experience, you need to scan the options in a benchmarking script for
each simulation size/content you want to simulate, and the difference
between the best and the worst can be up to a factor of 2-4 in terms of
performance. If you're splitting your machine among multiple simulations, I
suggest running 1 mpi thread (-ntmpi 1) per simulation, unless your
benchmarking suggests that the optimal performance lies elsewhere.

Things get more complicated when you start putting PME on the GPUs. For the
machines I work on, putting PME on GPUs absolutely improves performance,
but I'm not fully confident in that assessment without testing your
specific machine - you have a lot of cores with that threadripper, and this
is another area where I expect Gromacs 2020 might shift the GPU-CPU optimal
balance.

The issue with PME on GPUs is that we can (currently) only have one rank
doing GPU PME work. So, if we have a machine with say 20 cores and 2 gpus,
if I run the following

gmx mdrun  -ntomp 10 -ntmpi 2 -pme gpu -npme 1 -gputasks 01

, two ranks will be started - one with cores 0-9, will work on the
short-range interactions, offloading where it can to GPU 0, and the PME
rank (cores 10-19)  will offload to GPU 1. There is one significant problem
(and one minor problem) with this setup. First, it is massively inefficient
in terms of load balance. In a typical system (there are exceptions), PME
takes up ~1/3 of the computation that short-range interactions take. So, we
are offloading 1/4 of our interactions to one GPU and 3/4 to the other,
which leads to imbalance. In this specific case (2 GPUs and sufficient
cores), the most optimal solution is often (but not always) to run with
-ntmpi 4 (in this example, then -ntomp 5), as the PME rank then gets 1/4 of
the GPU instructions, proportional to the computation needed.

The second(less critical - don't worry about this unless you're
CPU-limited) problem is that PME-GPU mpi ranks only use 1 CPU core in their
calculations. So, with a node of 20 cores and 2 GPUs, if I run a simulation
with -ntmpi 4 -ntmpi 5 -pme gpu -npme 1 -pme gpu, each one of those ranks
will have 5 CPUs, but the PME rank will only use one of them. You can
specify the number of PME cores per rank with -ntomp_pme. This is useful in
restricted cases. For example, given the above architecture setup (20
cores, 2 GPUs), I could maximally exploit my CPUs with the following
commands:

gmx mdrun  -ntmpi 4 -ntomp 3 -ntomp_pme 1 -pme gpu -npme 1 -gputasks
 -pin on -pinoffset 0 &
gmx mdrun  -ntmpi 4 

[gmx-users] simulation on 2 gpus

2019-07-25 Thread Stefano Guglielmo
Dear all,
I am trying to run simulation with Gromacs 2019.2 on a workstation with an
amd Threadripper cpu (32 core, 64 threads, 128 GB RAM and with two rtx 2080
ti with nvlink bridge. I read user's guide section regarding performance
and I am exploring some possibile combinations of cpu/gpu work to run as
fast as possible. I was wondering if some of you has experience of running
on more than one gpu with several cores and can give some hints as starting
point.
Thanks
Stefano


-- 
Stefano GUGLIELMO PhD
Assistant Professor of Medicinal Chemistry
Department of Drug Science and Technology
Via P. Giuria 9
10125 Turin, ITALY
ph. +39 (0)11 6707178
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.