Re: [gmx-users] simulation on 2 gpus
On Fri, Sep 6, 2019 at 3:47 PM Stefano Guglielmo wrote: > > Hi Szilard, > > thanks for suggestions. > > > As for the strange crash, the workstation works fine using only cpu; the > problem seems to be related to gpu usage, when both cards are used for 200 > W over 250 (more or less) the workstation turns off. It is not about PSU > (even in the "offending" case we are quite below the maximum power), How far below? Note that PSU efficiency and quality does also affect stability at high load. > and it > is neither related to temperature (it happens even if gpu temp is around > 55-60 °C). The vendor did some tests and accordingly the hardware seems to > be ok. Do you (or anyone else in the list) have any particular test to > suggest that can more specifically help to diagnose the problem? I suggest the following for load testing: https://github.com/ComputationalRadiationPhysics/cuda_memtest and for memory stress testing: https://github.com/ComputationalRadiationPhysics/cuda_memtest Cheers, -- Szilárd > > Any opinion is appreciated, > > thanks > > Il giorno mercoledì 21 agosto 2019, Szilárd Páll > ha scritto: > > > Hi Stefano, > > > > > > On Tue, Aug 20, 2019 at 3:29 PM Stefano Guglielmo > > wrote: > > > > > > Dear Szilard, > > > > > > thanks for the very clear answer. > > > Following your suggestion I tried to run without DD; for the same system > > I > > > run two simulations on two gpus: > > > > > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > > > -gputasks 00 -pin on -pinoffset 0 -pinstride 1 > > > > > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > > > -gputasks 11 -pin on -pinoffset 28 -pinstride 1 > > > > > > but again the system crashed; with this I mean that after few minutes the > > > machine goes off (power off) without any error message, even without > > using > > > all the threads. > > > > That is not normal and I strongly recommend investigating it as it > > could be a sign of an underlying system/hardware instability or fault > > which could ultimately lead to incorrect simulation results. > > > > Are you sure that: > > - your machine is stable and reliable at high loads; is the PSU sufficient? > > - your hardware has been thoroughly stress-tested and it does not show > > instabilities? > > > > Does the crash also happen with GROMACS running on the CPU only (using > > all cores)? > > I'd recommend running some stress-tests that fully load the machine > > for a few hours to see if the error persists. > > > > > I then tried running the two simulations on the same gpu without DD: > > > > > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > > > -gputasks 00 -pin on -pinoffset 0 -pinstride 1 > > > > > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > > > -gputasks 00 -pin on -pinoffset 28 -pinstride 1 > > > > > > and I obtained better performance (about 70 ns/day) with a massive use of > > > the gpu (around 90%), comparing to the two runs on two gpus I reported in > > > the previous post > > > (gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 > > -gputasks > > > 000 -pin on -pinoffset 0 -pinstride 1 > > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 > > > -gputasks 111 -pin on -pinoffset 28 -pinstride 1). > > > > That is expected; domain-decomposition on a single GPU is unnecessary > > and introduces overheads that limit performance. > > > > > As for pinning, cpu topology according to log file is: > > > hardware topology: Basic > > > Sockets, cores, and logical processors: > > > Socket 0: [ 0 32] [ 1 33] [ 2 34] [ 3 35] [ 4 36] [ > > > 5 37] [ 6 38] [ 7 39] [ 16 48] [ 17 49] [ 18 50] [ 19 51] > > [ > > > 20 52] [ 21 53] [ 22 54] [ 23 55] [ 8 40] [ 9 41] [ 10 > > 42] > > > [ 11 43] [ 12 44] [ 13 45] [ 14 46] [ 15 47] [ 24 56] [ 25 > > > 57] [ 26 58] [ 27 59] [ 28 60] [ 29 61] [ 30 62] [ 31 63] > > > If I understand well (absolutely not sure) it should not be that > > convenient > > > to pin to consecutive threads, > > > > On the contrary, pinning to consecutive threads is the recommended > > behavior. More generally, application threads are expected to be > > pinned to consecutive cores (as threading parallelization will benefit > > from the resulting cache access patterns); now, CPU cores can have > > multiple hardware threads and depending on whether using one or > > mulitpole makes sense (performance-wise), will determine whether a > > stride of 1 or 2 is best. Typically, when most work is offloaded to a > > GPU and many CPU cores are available 1 thread/core is best. > > > > Note that the above topology mapping simply means that the indexed > > entities that the operating system calls "CPU" grouped in "[]" > > correspond to hardware threads of the same core, i.e. core 0 is [0 > > 32], core 1 [1 33], etc. Pinning with a stride happens into this map: > > - with a -pinstride 1
Re: [gmx-users] simulation on 2 gpus
Hi Szilard, thanks for suggestions. As for the strange crash, the workstation works fine using only cpu; the problem seems to be related to gpu usage, when both cards are used for 200 W over 250 (more or less) the workstation turns off. It is not about PSU (even in the "offending" case we are quite below the maximum power), and it is neither related to temperature (it happens even if gpu temp is around 55-60 °C). The vendor did some tests and accordingly the hardware seems to be ok. Do you (or anyone else in the list) have any particular test to suggest that can more specifically help to diagnose the problem? Any opinion is appreciated, thanks Il giorno mercoledì 21 agosto 2019, Szilárd Páll ha scritto: > Hi Stefano, > > > On Tue, Aug 20, 2019 at 3:29 PM Stefano Guglielmo > wrote: > > > > Dear Szilard, > > > > thanks for the very clear answer. > > Following your suggestion I tried to run without DD; for the same system > I > > run two simulations on two gpus: > > > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > > -gputasks 00 -pin on -pinoffset 0 -pinstride 1 > > > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > > -gputasks 11 -pin on -pinoffset 28 -pinstride 1 > > > > but again the system crashed; with this I mean that after few minutes the > > machine goes off (power off) without any error message, even without > using > > all the threads. > > That is not normal and I strongly recommend investigating it as it > could be a sign of an underlying system/hardware instability or fault > which could ultimately lead to incorrect simulation results. > > Are you sure that: > - your machine is stable and reliable at high loads; is the PSU sufficient? > - your hardware has been thoroughly stress-tested and it does not show > instabilities? > > Does the crash also happen with GROMACS running on the CPU only (using > all cores)? > I'd recommend running some stress-tests that fully load the machine > for a few hours to see if the error persists. > > > I then tried running the two simulations on the same gpu without DD: > > > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > > -gputasks 00 -pin on -pinoffset 0 -pinstride 1 > > > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > > -gputasks 00 -pin on -pinoffset 28 -pinstride 1 > > > > and I obtained better performance (about 70 ns/day) with a massive use of > > the gpu (around 90%), comparing to the two runs on two gpus I reported in > > the previous post > > (gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 > -gputasks > > 000 -pin on -pinoffset 0 -pinstride 1 > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 > > -gputasks 111 -pin on -pinoffset 28 -pinstride 1). > > That is expected; domain-decomposition on a single GPU is unnecessary > and introduces overheads that limit performance. > > > As for pinning, cpu topology according to log file is: > > hardware topology: Basic > > Sockets, cores, and logical processors: > > Socket 0: [ 0 32] [ 1 33] [ 2 34] [ 3 35] [ 4 36] [ > > 5 37] [ 6 38] [ 7 39] [ 16 48] [ 17 49] [ 18 50] [ 19 51] > [ > > 20 52] [ 21 53] [ 22 54] [ 23 55] [ 8 40] [ 9 41] [ 10 > 42] > > [ 11 43] [ 12 44] [ 13 45] [ 14 46] [ 15 47] [ 24 56] [ 25 > > 57] [ 26 58] [ 27 59] [ 28 60] [ 29 61] [ 30 62] [ 31 63] > > If I understand well (absolutely not sure) it should not be that > convenient > > to pin to consecutive threads, > > On the contrary, pinning to consecutive threads is the recommended > behavior. More generally, application threads are expected to be > pinned to consecutive cores (as threading parallelization will benefit > from the resulting cache access patterns); now, CPU cores can have > multiple hardware threads and depending on whether using one or > mulitpole makes sense (performance-wise), will determine whether a > stride of 1 or 2 is best. Typically, when most work is offloaded to a > GPU and many CPU cores are available 1 thread/core is best. > > Note that the above topology mapping simply means that the indexed > entities that the operating system calls "CPU" grouped in "[]" > correspond to hardware threads of the same core, i.e. core 0 is [0 > 32], core 1 [1 33], etc. Pinning with a stride happens into this map: > - with a -pinstride 1 thread mapping will be (app thread->hardware > thread): 0->0, 1->32, 2->1, 3->33,... > - with a -pinstride 2 thread mapping will be (-||-): 0->0, 1->1, 2->2, > 3->3, ... > > > and indeed I found a subtle degradation of > > performance for a single simulation, switching from: > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > -gputasks > > 00 -pin on > > to > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > -gputasks > > 00 -pin on -pinoffset 0 -pinstride 1. > > If you compare the log files of the two, you should notice that the > former
Re: [gmx-users] simulation on 2 gpus
Hi Stefano, On Tue, Aug 20, 2019 at 3:29 PM Stefano Guglielmo wrote: > > Dear Szilard, > > thanks for the very clear answer. > Following your suggestion I tried to run without DD; for the same system I > run two simulations on two gpus: > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > -gputasks 00 -pin on -pinoffset 0 -pinstride 1 > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > -gputasks 11 -pin on -pinoffset 28 -pinstride 1 > > but again the system crashed; with this I mean that after few minutes the > machine goes off (power off) without any error message, even without using > all the threads. That is not normal and I strongly recommend investigating it as it could be a sign of an underlying system/hardware instability or fault which could ultimately lead to incorrect simulation results. Are you sure that: - your machine is stable and reliable at high loads; is the PSU sufficient? - your hardware has been thoroughly stress-tested and it does not show instabilities? Does the crash also happen with GROMACS running on the CPU only (using all cores)? I'd recommend running some stress-tests that fully load the machine for a few hours to see if the error persists. > I then tried running the two simulations on the same gpu without DD: > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > -gputasks 00 -pin on -pinoffset 0 -pinstride 1 > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > -gputasks 00 -pin on -pinoffset 28 -pinstride 1 > > and I obtained better performance (about 70 ns/day) with a massive use of > the gpu (around 90%), comparing to the two runs on two gpus I reported in > the previous post > (gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks > 000 -pin on -pinoffset 0 -pinstride 1 > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 > -gputasks 111 -pin on -pinoffset 28 -pinstride 1). That is expected; domain-decomposition on a single GPU is unnecessary and introduces overheads that limit performance. > As for pinning, cpu topology according to log file is: > hardware topology: Basic > Sockets, cores, and logical processors: > Socket 0: [ 0 32] [ 1 33] [ 2 34] [ 3 35] [ 4 36] [ > 5 37] [ 6 38] [ 7 39] [ 16 48] [ 17 49] [ 18 50] [ 19 51] [ > 20 52] [ 21 53] [ 22 54] [ 23 55] [ 8 40] [ 9 41] [ 10 42] > [ 11 43] [ 12 44] [ 13 45] [ 14 46] [ 15 47] [ 24 56] [ 25 > 57] [ 26 58] [ 27 59] [ 28 60] [ 29 61] [ 30 62] [ 31 63] > If I understand well (absolutely not sure) it should not be that convenient > to pin to consecutive threads, On the contrary, pinning to consecutive threads is the recommended behavior. More generally, application threads are expected to be pinned to consecutive cores (as threading parallelization will benefit from the resulting cache access patterns); now, CPU cores can have multiple hardware threads and depending on whether using one or mulitpole makes sense (performance-wise), will determine whether a stride of 1 or 2 is best. Typically, when most work is offloaded to a GPU and many CPU cores are available 1 thread/core is best. Note that the above topology mapping simply means that the indexed entities that the operating system calls "CPU" grouped in "[]" correspond to hardware threads of the same core, i.e. core 0 is [0 32], core 1 [1 33], etc. Pinning with a stride happens into this map: - with a -pinstride 1 thread mapping will be (app thread->hardware thread): 0->0, 1->32, 2->1, 3->33,... - with a -pinstride 2 thread mapping will be (-||-): 0->0, 1->1, 2->2, 3->3, ... > and indeed I found a subtle degradation of > performance for a single simulation, switching from: > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks > 00 -pin on > to > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks > 00 -pin on -pinoffset 0 -pinstride 1. If you compare the log files of the two, you should notice that the former used a pinstride 2 resulting in the use 28 cores while the latter using only 14 cores; the likely reason for only a small difference is that there is not enough CPU work to scale to 28 cores and additionally, these specific TR CPUs are tricky to scale across using wide multi-threaded parallelization. Cheers, -- Szilárd > > Thanks again > Stefano > > > > > Il giorno ven 16 ago 2019 alle ore 17:48 Szilárd Páll < > pall.szil...@gmail.com> ha scritto: > > > On Mon, Aug 5, 2019 at 5:00 PM Stefano Guglielmo > > wrote: > > > > > > Dear Paul, > > > thanks for suggestions. Following them I managed to run 91 ns/day for the > > > system I referred to in my previous post with the configuration: > > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 > > -gputasks > > > 111 -pin on (still 28 threads seems to be the best choice) > > > > > > and 56 ns/day for two independent runs: > > > gmx
Re: [gmx-users] simulation on 2 gpus
Dear Szilard, thanks for the very clear answer. Following your suggestion I tried to run without DD; for the same system I run two simulations on two gpus: gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks 00 -pin on -pinoffset 0 -pinstride 1 gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks 11 -pin on -pinoffset 28 -pinstride 1 but again the system crashed; with this I mean that after few minutes the machine goes off (power off) without any error message, even without using all the threads. I then tried running the two simulations on the same gpu without DD: gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks 00 -pin on -pinoffset 0 -pinstride 1 gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks 00 -pin on -pinoffset 28 -pinstride 1 and I obtained better performance (about 70 ns/day) with a massive use of the gpu (around 90%), comparing to the two runs on two gpus I reported in the previous post (gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks 000 -pin on -pinoffset 0 -pinstride 1 gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks 111 -pin on -pinoffset 28 -pinstride 1). As for pinning, cpu topology according to log file is: hardware topology: Basic Sockets, cores, and logical processors: Socket 0: [ 0 32] [ 1 33] [ 2 34] [ 3 35] [ 4 36] [ 5 37] [ 6 38] [ 7 39] [ 16 48] [ 17 49] [ 18 50] [ 19 51] [ 20 52] [ 21 53] [ 22 54] [ 23 55] [ 8 40] [ 9 41] [ 10 42] [ 11 43] [ 12 44] [ 13 45] [ 14 46] [ 15 47] [ 24 56] [ 25 57] [ 26 58] [ 27 59] [ 28 60] [ 29 61] [ 30 62] [ 31 63] If I understand well (absolutely not sure) it should not be that convenient to pin to consecutive threads, and indeed I found a subtle degradation of performance for a single simulation, switching from: gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks 00 -pin on to gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks 00 -pin on -pinoffset 0 -pinstride 1. Thanks again Stefano Il giorno ven 16 ago 2019 alle ore 17:48 Szilárd Páll < pall.szil...@gmail.com> ha scritto: > On Mon, Aug 5, 2019 at 5:00 PM Stefano Guglielmo > wrote: > > > > Dear Paul, > > thanks for suggestions. Following them I managed to run 91 ns/day for the > > system I referred to in my previous post with the configuration: > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 > -gputasks > > 111 -pin on (still 28 threads seems to be the best choice) > > > > and 56 ns/day for two independent runs: > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 > -gputasks > > 000 -pin on -pinoffset 0 -pinstride 1 > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 > -gputasks > > 111 -pin on -pinoffset 28 -pinstride 1 > > which is a fairly good result. > > Use no DD in single-GPU runs, i.e. for the latter, just simply > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 > -gputasks 00 -pin on -pinoffset 0 -pinstride 1 > > You can also have mdrun's multidir functionality manage an ensemble of > jobs (related or not) so you don't have to manually start, calculate > pinning, etc. > > > > I am still wondering if somehow I should pin the threads in some > different > > way in order to reflect the cpu topology and if this can influence > > performance (if I remember well NAMD allows the user to indicate > explicitly > > the cpu core/threads to use in a computation). > > Your pinning does reflect the CPU topology -- the 4x7=28 threads are > pinned to consecutive hardware threads (because of -pinstride 1, i.e. > don't skip the second hardware thread of the core). The mapping of > software to hardware threads happens based on a the topology-based > hardware thread indexing, see the hardware detection report in the log > file. > > > When I tried to run two simulations with the following configuration: > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1 > -gputasks > > -pin on -pinoffset 0 -pinstride 1 > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1 > -gputasks > > -pin on -pinoffset 0 -pinstride 32 > > the system crashed down. Probably this is normal and I am missing > something > > quite obvious. > > Not really. What do you mean by "crashed down", the machine should not > crash, nor should the simulation. Even though your machine has 32 > cores / 64 threads, using all of these may not always be beneficial as > using more threads where there is too little work to scale will have > an overhead. Have you tried using all cores but only 1 thread / core > (i.e. 32 threads in total with pinstride 2)? > > Cheers, > -- > Szilárd > > > > > Thanks again for the valuable advices > > Stefano > > > > > > > > Il giorno dom 4 ago 2019 alle ore 01:40 paul buscemi ha > >
Re: [gmx-users] simulation on 2 gpus
On Mon, Aug 5, 2019 at 5:00 PM Stefano Guglielmo wrote: > > Dear Paul, > thanks for suggestions. Following them I managed to run 91 ns/day for the > system I referred to in my previous post with the configuration: > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks > 111 -pin on (still 28 threads seems to be the best choice) > > and 56 ns/day for two independent runs: > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks > 000 -pin on -pinoffset 0 -pinstride 1 > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks > 111 -pin on -pinoffset 28 -pinstride 1 > which is a fairly good result. Use no DD in single-GPU runs, i.e. for the latter, just simply gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks 00 -pin on -pinoffset 0 -pinstride 1 You can also have mdrun's multidir functionality manage an ensemble of jobs (related or not) so you don't have to manually start, calculate pinning, etc. > I am still wondering if somehow I should pin the threads in some different > way in order to reflect the cpu topology and if this can influence > performance (if I remember well NAMD allows the user to indicate explicitly > the cpu core/threads to use in a computation). Your pinning does reflect the CPU topology -- the 4x7=28 threads are pinned to consecutive hardware threads (because of -pinstride 1, i.e. don't skip the second hardware thread of the core). The mapping of software to hardware threads happens based on a the topology-based hardware thread indexing, see the hardware detection report in the log file. > When I tried to run two simulations with the following configuration: > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1 -gputasks > -pin on -pinoffset 0 -pinstride 1 > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1 -gputasks > -pin on -pinoffset 0 -pinstride 32 > the system crashed down. Probably this is normal and I am missing something > quite obvious. Not really. What do you mean by "crashed down", the machine should not crash, nor should the simulation. Even though your machine has 32 cores / 64 threads, using all of these may not always be beneficial as using more threads where there is too little work to scale will have an overhead. Have you tried using all cores but only 1 thread / core (i.e. 32 threads in total with pinstride 2)? Cheers, -- Szilárd > > Thanks again for the valuable advices > Stefano > > > > Il giorno dom 4 ago 2019 alle ore 01:40 paul buscemi ha > scritto: > > > Stefano, > > > > A recent run with 14 atoms, including 1 isopropanol molecules on > > top of an end restrained PDMS surface of 74000 atoms in a 20 20 30 nm > > box ran at 67 ns/d nvt with the mdrun conditions I posted. It took 120 ns > > for 100 molecules of an adsorbate to go from solution to the surface. I > > don't think this will set the world ablaze with any benchmarks but it is > > acceptable to get some work done. > > > > Linux Mint Mate 18, AMD Threadripper 32 core 2990wx 4.2Ghz, 32GB DDR4, 2x > > RTX 2080TI gmx2019 in the simplest gmx configuration for gpus, CUDA > > version 10, Nvidia 410.7p loaded from the repository > > > > Paul > > > > > On Aug 3, 2019, at 12:58 PM, paul buscemi wrote: > > > > > > Stefano, > > > > > > Here is a typical run > > > > > > fpr minimization mdrun -deffnm grofile. -nn gpu > > > > > > and for other runs for a 32 core > > > > > > gmx -deffnm grofile.nvt -nb gpu -pme gpu -ntomp 8 -ntmpi 8 -npme 1 > > -gputasks -pin on > > > > > > Depending on the molecular system/model -ntomp -4 -ntmpi 16 may be > > faster - of course adjusting -gputasks > > > > > > Rarely do I find that not using ntomp and ntpmi is faster, but it is > > never bad > > > > > > Let me know how it goes. > > > > > > Paul > > > > > >> On Aug 3, 2019, at 4:41 AM, Stefano Guglielmo < > > stefano.guglie...@unito.it> wrote: > > >> > > >> Hi Paul, > > >> thanks for the reply. Would you mind posting the command you used or > > >> telling how did you balance the work between cpu and gpu? > > >> > > >> What about pinning? Does anyone know how to deal with a cpu topology > > like > > >> the one reported in my previous post and if it is relevant for > > performance? > > >> Thanks > > >> Stefano > > >> > > >> Il giorno sabato 3 agosto 2019, Paul Buscemi ha > > scritto: > > >> > > >>> I run the same system and setup but no nvlink. Maestro runs both gpus > > at > > >>> 100 percent. Gromacs typically 50 --60 percent can do 600ns/d on 2 > > >>> atoms > > >>> > > >>> PB > > >>> > > On Jul 25, 2019, at 9:30 PM, Kevin Boyd wrote: > > > > Hi, > > > > I've done a lot of research/experimentation on this, so I can maybe > > get > > >>> you > > started - if anyone has any questions about the essay to follow, feel > > >>> free > > to email me personally, and I'll link it to the email thread if it >
Re: [gmx-users] simulation on 2 gpus
Dear Paul, thanks for suggestions. Following them I managed to run 91 ns/day for the system I referred to in my previous post with the configuration: gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks 111 -pin on (still 28 threads seems to be the best choice) and 56 ns/day for two independent runs: gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks 000 -pin on -pinoffset 0 -pinstride 1 gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks 111 -pin on -pinoffset 28 -pinstride 1 which is a fairly good result. I am still wondering if somehow I should pin the threads in some different way in order to reflect the cpu topology and if this can influence performance (if I remember well NAMD allows the user to indicate explicitly the cpu core/threads to use in a computation). When I tried to run two simulations with the following configuration: gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1 -gputasks -pin on -pinoffset 0 -pinstride 1 gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1 -gputasks -pin on -pinoffset 0 -pinstride 32 the system crashed down. Probably this is normal and I am missing something quite obvious. Thanks again for the valuable advices Stefano Il giorno dom 4 ago 2019 alle ore 01:40 paul buscemi ha scritto: > Stefano, > > A recent run with 14 atoms, including 1 isopropanol molecules on > top of an end restrained PDMS surface of 74000 atoms in a 20 20 30 nm > box ran at 67 ns/d nvt with the mdrun conditions I posted. It took 120 ns > for 100 molecules of an adsorbate to go from solution to the surface. I > don't think this will set the world ablaze with any benchmarks but it is > acceptable to get some work done. > > Linux Mint Mate 18, AMD Threadripper 32 core 2990wx 4.2Ghz, 32GB DDR4, 2x > RTX 2080TI gmx2019 in the simplest gmx configuration for gpus, CUDA > version 10, Nvidia 410.7p loaded from the repository > > Paul > > > On Aug 3, 2019, at 12:58 PM, paul buscemi wrote: > > > > Stefano, > > > > Here is a typical run > > > > fpr minimization mdrun -deffnm grofile. -nn gpu > > > > and for other runs for a 32 core > > > > gmx -deffnm grofile.nvt -nb gpu -pme gpu -ntomp 8 -ntmpi 8 -npme 1 > -gputasks -pin on > > > > Depending on the molecular system/model -ntomp -4 -ntmpi 16 may be > faster - of course adjusting -gputasks > > > > Rarely do I find that not using ntomp and ntpmi is faster, but it is > never bad > > > > Let me know how it goes. > > > > Paul > > > >> On Aug 3, 2019, at 4:41 AM, Stefano Guglielmo < > stefano.guglie...@unito.it> wrote: > >> > >> Hi Paul, > >> thanks for the reply. Would you mind posting the command you used or > >> telling how did you balance the work between cpu and gpu? > >> > >> What about pinning? Does anyone know how to deal with a cpu topology > like > >> the one reported in my previous post and if it is relevant for > performance? > >> Thanks > >> Stefano > >> > >> Il giorno sabato 3 agosto 2019, Paul Buscemi ha > scritto: > >> > >>> I run the same system and setup but no nvlink. Maestro runs both gpus > at > >>> 100 percent. Gromacs typically 50 --60 percent can do 600ns/d on 2 > >>> atoms > >>> > >>> PB > >>> > On Jul 25, 2019, at 9:30 PM, Kevin Boyd wrote: > > Hi, > > I've done a lot of research/experimentation on this, so I can maybe > get > >>> you > started - if anyone has any questions about the essay to follow, feel > >>> free > to email me personally, and I'll link it to the email thread if it > ends > >>> up > being pertinent. > > First, there's some more internet resources to checkout. See Mark's > talk > >>> at > - > https://bioexcel.eu/webinar-performance-tuning-and- > >>> optimization-of-gromacs/ > Gromacs development moves fast, but a lot of it is still relevant. > > I'll expand a bit here, with the caveat that Gromacs GPU development > is > moving very fast and so the correct commands for optimal performance > are > both system-dependent and a moving target between versions. This is a > >>> good > thing - GPUs have revolutionized the field, and with each iteration we > >>> make > better use of them. The downside is that it's unclear exactly what > sort > >>> of > CPU-GPU balance you should look to purchase to take advantage of > future > developments, though the trend is certainly that more and more > >>> computation > is being offloaded to the GPUs. > > The most important consideration is that to get maximum total > throughput > performance, you should be running not one but multiple simulations > simultaneously. You can do this through the -multidir option, but I > don't > recommend that in this case, as it requires compiling with MPI and > limits > some of your options. My run scripts usually use "gmx mdrun ... &"
Re: [gmx-users] simulation on 2 gpus
Stefano, Here is a typical run fpr minimization mdrun -deffnm grofile. -nn gpu and for other runs for a 32 core gmx -deffnm grofile.nvt -nb gpu -pme gpu -ntomp 8 -ntmpi 8 -npme 1 -gputasks -pin on Depending on the molecular system/model -ntomp -4 -ntmpi 16 may be faster - of course adjusting -gputasks Rarely do I fine that not using ntomp and ntpmi is faster, but it is never bad Let me know how it goes. Paul > On Aug 3, 2019, at 4:41 AM, Stefano Guglielmo > wrote: > > Hi Paul, > thanks for the reply. Would you mind posting the command you used or > telling how did you balance the work between cpu and gpu? > > What about pinning? Does anyone know how to deal with a cpu topology like > the one reported in my previous post and if it is relevant for performance? > Thanks > Stefano > > Il giorno sabato 3 agosto 2019, Paul Buscemi ha scritto: > >> I run the same system and setup but no nvlink. Maestro runs both gpus at >> 100 percent. Gromacs typically 50 --60 percent can do 600ns/d on 2 >> atoms >> >> PB >> >>> On Jul 25, 2019, at 9:30 PM, Kevin Boyd wrote: >>> >>> Hi, >>> >>> I've done a lot of research/experimentation on this, so I can maybe get >> you >>> started - if anyone has any questions about the essay to follow, feel >> free >>> to email me personally, and I'll link it to the email thread if it ends >> up >>> being pertinent. >>> >>> First, there's some more internet resources to checkout. See Mark's talk >> at >>> - >>> https://bioexcel.eu/webinar-performance-tuning-and- >> optimization-of-gromacs/ >>> Gromacs development moves fast, but a lot of it is still relevant. >>> >>> I'll expand a bit here, with the caveat that Gromacs GPU development is >>> moving very fast and so the correct commands for optimal performance are >>> both system-dependent and a moving target between versions. This is a >> good >>> thing - GPUs have revolutionized the field, and with each iteration we >> make >>> better use of them. The downside is that it's unclear exactly what sort >> of >>> CPU-GPU balance you should look to purchase to take advantage of future >>> developments, though the trend is certainly that more and more >> computation >>> is being offloaded to the GPUs. >>> >>> The most important consideration is that to get maximum total throughput >>> performance, you should be running not one but multiple simulations >>> simultaneously. You can do this through the -multidir option, but I don't >>> recommend that in this case, as it requires compiling with MPI and limits >>> some of your options. My run scripts usually use "gmx mdrun ... &" to >>> initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin >>> -pinoffset, and -gputasks. I can give specific examples if you're >>> interested. >>> >>> Another important point is that you can run more simulations than the >>> number of GPUs you have. Depending on CPU-GPU balance and quality, you >>> won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but >>> you might increase it up to 1.5x. This would involve targeting the same >> GPU >>> with -gputasks. >>> >>> Within a simulation, you should set up a benchmarking script to figure >> out >>> the best combination of thread-mpi ranks and open-mp threads - this can >>> have pretty drastic effects on performance. For example, if you want to >> use >>> your entire machine for one simulation (not recommended for maximal >> >> -- >> Gromacs Users mailing list >> >> * Please search the archive at http://www.gromacs.org/ >> Support/Mailing_Lists/GMX-Users_List before posting! >> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> >> * For (un)subscribe requests visit >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >> send a mail to gmx-users-requ...@gromacs.org. >> > > > -- > Stefano GUGLIELMO PhD > Assistant Professor of Medicinal Chemistry > Department of Drug Science and Technology > Via P. Giuria 9 > 10125 Turin, ITALY > ph. +39 (0)11 6707178 > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a > mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] simulation on 2 gpus
I run the same system and setup but no nvlink. Maestro runs both gpus at 100 percent. Gromacs typically 50 --60 percent can do 600ns/d on 2 atoms PB > On Jul 25, 2019, at 9:30 PM, Kevin Boyd wrote: > > Hi, > > I've done a lot of research/experimentation on this, so I can maybe get you > started - if anyone has any questions about the essay to follow, feel free > to email me personally, and I'll link it to the email thread if it ends up > being pertinent. > > First, there's some more internet resources to checkout. See Mark's talk at > - > https://bioexcel.eu/webinar-performance-tuning-and-optimization-of-gromacs/ > Gromacs development moves fast, but a lot of it is still relevant. > > I'll expand a bit here, with the caveat that Gromacs GPU development is > moving very fast and so the correct commands for optimal performance are > both system-dependent and a moving target between versions. This is a good > thing - GPUs have revolutionized the field, and with each iteration we make > better use of them. The downside is that it's unclear exactly what sort of > CPU-GPU balance you should look to purchase to take advantage of future > developments, though the trend is certainly that more and more computation > is being offloaded to the GPUs. > > The most important consideration is that to get maximum total throughput > performance, you should be running not one but multiple simulations > simultaneously. You can do this through the -multidir option, but I don't > recommend that in this case, as it requires compiling with MPI and limits > some of your options. My run scripts usually use "gmx mdrun ... &" to > initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin > -pinoffset, and -gputasks. I can give specific examples if you're > interested. > > Another important point is that you can run more simulations than the > number of GPUs you have. Depending on CPU-GPU balance and quality, you > won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but > you might increase it up to 1.5x. This would involve targeting the same GPU > with -gputasks. > > Within a simulation, you should set up a benchmarking script to figure out > the best combination of thread-mpi ranks and open-mp threads - this can > have pretty drastic effects on performance. For example, if you want to use > your entire machine for one simulation (not recommended for maximal -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] simulation on 2 gpus
ce) performance > benefit over MPI. > > Kevin > > On Fri, Jul 26, 2019 at 8:21 AM Gregory Man Kai Poon > wrote: > > > Hi Kevin, > > Thanks for your very useful post. Could you give a few command line > > examples on how to start multiple runs at different times (e.g., > allocate a > > subset of CPU/GPU to one run, and start another run later using another > > unsubset of yet-unallocated CPU/GPU). Also, could you elaborate on the > > drawbacks of the MPI compilation that you hinted at? > > Gregory > > > > From: Kevin Boyd<mailto:kevin.b...@uconn.edu> > > Sent: Thursday, July 25, 2019 10:31 PM > > To: gmx-us...@gromacs.org<mailto:gmx-us...@gromacs.org> > > Subject: Re: [gmx-users] simulation on 2 gpus > > > > Hi, > > > > I've done a lot of research/experimentation on this, so I can maybe get > you > > started - if anyone has any questions about the essay to follow, feel > free > > to email me personally, and I'll link it to the email thread if it ends > up > > being pertinent. > > > > First, there's some more internet resources to checkout. See Mark's talk > at > > - > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioexcel.eu%2Fwebinar-performance-tuning-and-optimization-of-gromacs%2Fdata=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053958340sdata=6KGjrrBb8w%2FSqMvTtdzoiufYtOpmOKX5QyYNFAivTMo%3Dreserved=0 > > Gromacs development moves fast, but a lot of it is still relevant. > > > > I'll expand a bit here, with the caveat that Gromacs GPU development is > > moving very fast and so the correct commands for optimal performance are > > both system-dependent and a moving target between versions. This is a > good > > thing - GPUs have revolutionized the field, and with each iteration we > make > > better use of them. The downside is that it's unclear exactly what sort > of > > CPU-GPU balance you should look to purchase to take advantage of future > > developments, though the trend is certainly that more and more > computation > > is being offloaded to the GPUs. > > > > The most important consideration is that to get maximum total throughput > > performance, you should be running not one but multiple simulations > > simultaneously. You can do this through the -multidir option, but I don't > > recommend that in this case, as it requires compiling with MPI and limits > > some of your options. My run scripts usually use "gmx mdrun ... &" to > > initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin > > -pinoffset, and -gputasks. I can give specific examples if you're > > interested. > > > > Another important point is that you can run more simulations than the > > number of GPUs you have. Depending on CPU-GPU balance and quality, you > > won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but > > you might increase it up to 1.5x. This would involve targeting the same > GPU > > with -gputasks. > > > > Within a simulation, you should set up a benchmarking script to figure > out > > the best combination of thread-mpi ranks and open-mp threads - this can > > have pretty drastic effects on performance. For example, if you want to > use > > your entire machine for one simulation (not recommended for maximal > > efficiency), you have a lot of decomposition options (ignoring PME - > which > > is important, see below): > > > > -ntmpi 2 -ntomp 32 -gputasks 01 > > -ntmpi 4 -ntomp 16 -gputasks 0011 > > -ntmpi 8 -ntomp 8 -gputasks > > -ntmpi 16 -ntomp 4 -gputasks 111 > > (and a few others - note that ntmpi * ntomp = total threads available) > > > > In my experience, you need to scan the options in a benchmarking script > for > > each simulation size/content you want to simulate, and the difference > > between the best and the worst can be up to a factor of 2-4 in terms of > > performance. If you're splitting your machine among multiple > simulations, I > > suggest running 1 mpi thread (-ntmpi 1) per simulation, unless your > > benchmarking suggests that the optimal performance lies elsewhere. > > > > Things get more complicated when you start putting PME on the GPUs. For > the > > machines I work on, putting PME on GPUs absolutely improves performance, > > but I'm not fully confident in that assessment without testing your > > specific machine - you have a lot of cores with that threadripper, and > this > > is another area
Re: [gmx-users] simulation on 2 gpus
le ... Using 1 MPI thread Using 32 OpenMP threads 1 GPU selected for this run. Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node: PP:1,PME:1 PP tasks will do (non-perturbed) short-ranged interactions on the GPU PME tasks will do all aspects on the GPU Applying core pinning offset 32." Two runs can be carried out with the command: gmx mdrun -deffnm run1 -gpu_id 1 -pin on -pinstride 1 -pinoffset 14 -ntmpi 1 -ntomp 28 gmx mdrun -deffnm run0 -gpu_id 0 -pin on -pinstride 1 -pinoffset 0 -ntmpi 1 -ntomp 28 "Using 1 MPI thread Using 28 OpenMP threads 1 GPU selected for this run. Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node: PP:1,PME:1 PP tasks will do (non-perturbed) short-ranged interactions on the GPU PME tasks will do all aspects on the GPU Applying core pinning offset 14 Pinning threads with a user-specified logical core stride of 1" or gmx mdrun -deffnm run1 -gpu_id 1 -pin on -ntmpi 1 -ntomp 28 gmx mdrun -deffnm run0 -gpu_id 0 -pin on -ntmpi 1 -ntomp 28 "Using 1 MPI thread Using 28 OpenMP threads 1 GPU selected for this run. Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node: PP:1,PME:1 PP tasks will do (non-perturbed) short-ranged interactions on the GPU PME tasks will do all aspects on the GPU Pinning threads with an auto-selected logical core stride of 2" With some disappointment in both situations there was a substantial degrading of performance, about 35-40 ns/day for the same system, with a gpu usage of 25-30%, compared to 50-55% for the single run on a single gpu, and much below the power cap. I hope not to have been confusing and will be grateful for any suggestions. Thanks Stefano Il giorno ven 26 lug 2019 alle ore 15:00 Kevin Boyd ha scritto: > Sure - you can do it 2 ways with normal Gromacs. Either run the simulations > in separate terminals, or use ampersands to run them in the background of 1 > terminal. > > I'll give a concrete example for your threadripper, using 32 of your cores, > so that you could run some other computation on the other 32. I typically > make a bash variable with all the common arguments. > > Given tprs run1.tpr ...run4.tpr > > gmx_common="gmx mdrun -ntomp 8 -ntmpi 1 -pme gpu -nb gpu -pin on -pinstride > 1" > $gmx_common -deffnm run1 -pinoffset 32 -gputasks 00 & > $gmx_common -deffnm run2 -pinoffest 40 -gputasks 00 & > $gmx_common -deffnm run3 -pinoffset 48 -gputasks 11 & > $gmx_common -deffnm run3 -pinoffset 56 -gputasks 11 > > So run1 will run on cores 32-39, on GPU 0, run2 on cores 40-47 on the same > GPU, and the other 2 runs will use GPU 1. Note the ampersands on the first > 3 runs, so they'll go off in the background > > I should also have mentioned one peculiarity with running with -ntmpi 1 and > -pme gpu, in that even though there's now only one rank (with nonbonded and > PME both running on it), you still need 2 gpu tasks for that one rank, one > for each type of interaction. > > As for multidir, I forget what troubles I ran into exactly, but I was > unable to run some subset of simulations. Anyhow if you aren't running on a > cluster, I see no reason to compile with MPI and have to use srun or slurm, > and need to use gmx_mpi rather than gmx. The built-in thread-mpi gives you > up to 64 threads, and can have a minor (<5% in my experience) performance > benefit over MPI. > > Kevin > > On Fri, Jul 26, 2019 at 8:21 AM Gregory Man Kai Poon > wrote: > > > Hi Kevin, > > Thanks for your very useful post. Could you give a few command line > > examples on how to start multiple runs at different times (e.g., > allocate a > > subset of CPU/GPU to one run, and start another run later using another > > unsubset of yet-unallocated CPU/GPU). Also, could you elaborate on the > > drawbacks of the MPI compilation that you hinted at? > > Gregory > > > > From: Kevin Boyd<mailto:kevin.b...@uconn.edu> > > Sent: Thursday, July 25, 2019 10:31 PM > > To: gmx-us...@gromacs.org<mailto:gmx-us...@gromacs.org> > > Subject: Re: [gmx-users] simulation on 2 gpus > > > > Hi, > > > > I've done a lot of research/experimentation on this, so I can maybe get > you > > started - if anyone has any questions about the essay to follow, feel > free > > to email me personally, and I'll link it to the email thread if it ends > up > > being pertinent. > > > > First, there's some more internet resources to checkout. See Mark's talk > at > > - > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioexcel.eu%2Fwebinar-performance-tuning-and-optimization-of-gromacs%2Fdata=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C6
Re: [gmx-users] simulation on 2 gpus
Sure - you can do it 2 ways with normal Gromacs. Either run the simulations in separate terminals, or use ampersands to run them in the background of 1 terminal. I'll give a concrete example for your threadripper, using 32 of your cores, so that you could run some other computation on the other 32. I typically make a bash variable with all the common arguments. Given tprs run1.tpr ...run4.tpr gmx_common="gmx mdrun -ntomp 8 -ntmpi 1 -pme gpu -nb gpu -pin on -pinstride 1" $gmx_common -deffnm run1 -pinoffset 32 -gputasks 00 & $gmx_common -deffnm run2 -pinoffest 40 -gputasks 00 & $gmx_common -deffnm run3 -pinoffset 48 -gputasks 11 & $gmx_common -deffnm run3 -pinoffset 56 -gputasks 11 So run1 will run on cores 32-39, on GPU 0, run2 on cores 40-47 on the same GPU, and the other 2 runs will use GPU 1. Note the ampersands on the first 3 runs, so they'll go off in the background I should also have mentioned one peculiarity with running with -ntmpi 1 and -pme gpu, in that even though there's now only one rank (with nonbonded and PME both running on it), you still need 2 gpu tasks for that one rank, one for each type of interaction. As for multidir, I forget what troubles I ran into exactly, but I was unable to run some subset of simulations. Anyhow if you aren't running on a cluster, I see no reason to compile with MPI and have to use srun or slurm, and need to use gmx_mpi rather than gmx. The built-in thread-mpi gives you up to 64 threads, and can have a minor (<5% in my experience) performance benefit over MPI. Kevin On Fri, Jul 26, 2019 at 8:21 AM Gregory Man Kai Poon wrote: > Hi Kevin, > Thanks for your very useful post. Could you give a few command line > examples on how to start multiple runs at different times (e.g., allocate a > subset of CPU/GPU to one run, and start another run later using another > unsubset of yet-unallocated CPU/GPU). Also, could you elaborate on the > drawbacks of the MPI compilation that you hinted at? > Gregory > > From: Kevin Boyd<mailto:kevin.b...@uconn.edu> > Sent: Thursday, July 25, 2019 10:31 PM > To: gmx-us...@gromacs.org<mailto:gmx-us...@gromacs.org> > Subject: Re: [gmx-users] simulation on 2 gpus > > Hi, > > I've done a lot of research/experimentation on this, so I can maybe get you > started - if anyone has any questions about the essay to follow, feel free > to email me personally, and I'll link it to the email thread if it ends up > being pertinent. > > First, there's some more internet resources to checkout. See Mark's talk at > - > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioexcel.eu%2Fwebinar-performance-tuning-and-optimization-of-gromacs%2Fdata=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053958340sdata=6KGjrrBb8w%2FSqMvTtdzoiufYtOpmOKX5QyYNFAivTMo%3Dreserved=0 > Gromacs development moves fast, but a lot of it is still relevant. > > I'll expand a bit here, with the caveat that Gromacs GPU development is > moving very fast and so the correct commands for optimal performance are > both system-dependent and a moving target between versions. This is a good > thing - GPUs have revolutionized the field, and with each iteration we make > better use of them. The downside is that it's unclear exactly what sort of > CPU-GPU balance you should look to purchase to take advantage of future > developments, though the trend is certainly that more and more computation > is being offloaded to the GPUs. > > The most important consideration is that to get maximum total throughput > performance, you should be running not one but multiple simulations > simultaneously. You can do this through the -multidir option, but I don't > recommend that in this case, as it requires compiling with MPI and limits > some of your options. My run scripts usually use "gmx mdrun ... &" to > initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin > -pinoffset, and -gputasks. I can give specific examples if you're > interested. > > Another important point is that you can run more simulations than the > number of GPUs you have. Depending on CPU-GPU balance and quality, you > won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but > you might increase it up to 1.5x. This would involve targeting the same GPU > with -gputasks. > > Within a simulation, you should set up a benchmarking script to figure out > the best combination of thread-mpi ranks and open-mp threads - this can > have pretty drastic effects on performance. For example, if you want to use > your entire machine for one simulation (not recommended for maximal > efficiency), you have a lot of decomposition options (ignoring PME - which > is important, see below): > > -ntmpi 2 -ntomp 32 -gpu
Re: [gmx-users] simulation on 2 gpus
Hi, It's rather like the example at http://manual.gromacs.org/current/user-guide/mdrun-performance.html#examples-for-mdrun-on-one-node where instead of gmx mdrun -nt 6 -pin on -pinoffset 0 -pinstride 1 gmx mdrun -nt 6 -pin on -pinoffset 6 -pinstride 1 to run on a machine with 12 hardware threads, you want to adapt the number of threads and also specify disjoint GPU sets, e.g. gmx mdrun -nt 32 -pin on -pinoffset 0 -pinstride 1 -gpu_id 0 gmx mdrun -nt 32 -pin on -pinoffset 32 -pinstride 1 -gpu_id 1 That lets mdrun choose the mix of thread-MPI ranks vs OpenMP threads on those ranks, but you could replace -nt 32 with -ntmpi N -ntomp M so long as the product of N and M are 32. Mark On Fri, 26 Jul 2019 at 14:22, Gregory Man Kai Poon wrote: > Hi Kevin, > Thanks for your very useful post. Could you give a few command line > examples on how to start multiple runs at different times (e.g., allocate a > subset of CPU/GPU to one run, and start another run later using another > unsubset of yet-unallocated CPU/GPU). Also, could you elaborate on the > drawbacks of the MPI compilation that you hinted at? > Gregory > > From: Kevin Boyd<mailto:kevin.b...@uconn.edu> > Sent: Thursday, July 25, 2019 10:31 PM > To: gmx-us...@gromacs.org<mailto:gmx-us...@gromacs.org> > Subject: Re: [gmx-users] simulation on 2 gpus > > Hi, > > I've done a lot of research/experimentation on this, so I can maybe get you > started - if anyone has any questions about the essay to follow, feel free > to email me personally, and I'll link it to the email thread if it ends up > being pertinent. > > First, there's some more internet resources to checkout. See Mark's talk at > - > > https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioexcel.eu%2Fwebinar-performance-tuning-and-optimization-of-gromacs%2Fdata=02%7C01%7Cgpoon%40gsu.edu%7Cfd42b6ec3efa41d855b608d711714bdd%7C515ad73d8d5e4169895c9789dc742a70%7C0%7C0%7C636997050628368338sdata=%2BaUIuI63M7HRo%2B2VSUs0WIr0nYB10jE7lxnHW6gM8Os%3Dreserved=0 > Gromacs development moves fast, but a lot of it is still relevant. > > I'll expand a bit here, with the caveat that Gromacs GPU development is > moving very fast and so the correct commands for optimal performance are > both system-dependent and a moving target between versions. This is a good > thing - GPUs have revolutionized the field, and with each iteration we make > better use of them. The downside is that it's unclear exactly what sort of > CPU-GPU balance you should look to purchase to take advantage of future > developments, though the trend is certainly that more and more computation > is being offloaded to the GPUs. > > The most important consideration is that to get maximum total throughput > performance, you should be running not one but multiple simulations > simultaneously. You can do this through the -multidir option, but I don't > recommend that in this case, as it requires compiling with MPI and limits > some of your options. My run scripts usually use "gmx mdrun ... &" to > initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin > -pinoffset, and -gputasks. I can give specific examples if you're > interested. > > Another important point is that you can run more simulations than the > number of GPUs you have. Depending on CPU-GPU balance and quality, you > won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but > you might increase it up to 1.5x. This would involve targeting the same GPU > with -gputasks. > > Within a simulation, you should set up a benchmarking script to figure out > the best combination of thread-mpi ranks and open-mp threads - this can > have pretty drastic effects on performance. For example, if you want to use > your entire machine for one simulation (not recommended for maximal > efficiency), you have a lot of decomposition options (ignoring PME - which > is important, see below): > > -ntmpi 2 -ntomp 32 -gputasks 01 > -ntmpi 4 -ntomp 16 -gputasks 0011 > -ntmpi 8 -ntomp 8 -gputasks > -ntmpi 16 -ntomp 4 -gputasks 111 > (and a few others - note that ntmpi * ntomp = total threads available) > > In my experience, you need to scan the options in a benchmarking script for > each simulation size/content you want to simulate, and the difference > between the best and the worst can be up to a factor of 2-4 in terms of > performance. If you're splitting your machine among multiple simulations, I > suggest running 1 mpi thread (-ntmpi 1) per simulation, unless your > benchmarking suggests that the optimal performance lies elsewhere. > > Things get more complicated when you start putting PME on the GPUs. For the > machines I work on, putting PME on GPUs absolutely improves performance, > but I'm not fully confident in
Re: [gmx-users] simulation on 2 gpus
Hi Kevin, Thanks for your very useful post. Could you give a few command line examples on how to start multiple runs at different times (e.g., allocate a subset of CPU/GPU to one run, and start another run later using another unsubset of yet-unallocated CPU/GPU). Also, could you elaborate on the drawbacks of the MPI compilation that you hinted at? Gregory From: Kevin Boyd<mailto:kevin.b...@uconn.edu> Sent: Thursday, July 25, 2019 10:31 PM To: gmx-us...@gromacs.org<mailto:gmx-us...@gromacs.org> Subject: Re: [gmx-users] simulation on 2 gpus Hi, I've done a lot of research/experimentation on this, so I can maybe get you started - if anyone has any questions about the essay to follow, feel free to email me personally, and I'll link it to the email thread if it ends up being pertinent. First, there's some more internet resources to checkout. See Mark's talk at - https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioexcel.eu%2Fwebinar-performance-tuning-and-optimization-of-gromacs%2Fdata=02%7C01%7Cgpoon%40gsu.edu%7Cfd42b6ec3efa41d855b608d711714bdd%7C515ad73d8d5e4169895c9789dc742a70%7C0%7C0%7C636997050628368338sdata=%2BaUIuI63M7HRo%2B2VSUs0WIr0nYB10jE7lxnHW6gM8Os%3Dreserved=0 Gromacs development moves fast, but a lot of it is still relevant. I'll expand a bit here, with the caveat that Gromacs GPU development is moving very fast and so the correct commands for optimal performance are both system-dependent and a moving target between versions. This is a good thing - GPUs have revolutionized the field, and with each iteration we make better use of them. The downside is that it's unclear exactly what sort of CPU-GPU balance you should look to purchase to take advantage of future developments, though the trend is certainly that more and more computation is being offloaded to the GPUs. The most important consideration is that to get maximum total throughput performance, you should be running not one but multiple simulations simultaneously. You can do this through the -multidir option, but I don't recommend that in this case, as it requires compiling with MPI and limits some of your options. My run scripts usually use "gmx mdrun ... &" to initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin -pinoffset, and -gputasks. I can give specific examples if you're interested. Another important point is that you can run more simulations than the number of GPUs you have. Depending on CPU-GPU balance and quality, you won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but you might increase it up to 1.5x. This would involve targeting the same GPU with -gputasks. Within a simulation, you should set up a benchmarking script to figure out the best combination of thread-mpi ranks and open-mp threads - this can have pretty drastic effects on performance. For example, if you want to use your entire machine for one simulation (not recommended for maximal efficiency), you have a lot of decomposition options (ignoring PME - which is important, see below): -ntmpi 2 -ntomp 32 -gputasks 01 -ntmpi 4 -ntomp 16 -gputasks 0011 -ntmpi 8 -ntomp 8 -gputasks -ntmpi 16 -ntomp 4 -gputasks 111 (and a few others - note that ntmpi * ntomp = total threads available) In my experience, you need to scan the options in a benchmarking script for each simulation size/content you want to simulate, and the difference between the best and the worst can be up to a factor of 2-4 in terms of performance. If you're splitting your machine among multiple simulations, I suggest running 1 mpi thread (-ntmpi 1) per simulation, unless your benchmarking suggests that the optimal performance lies elsewhere. Things get more complicated when you start putting PME on the GPUs. For the machines I work on, putting PME on GPUs absolutely improves performance, but I'm not fully confident in that assessment without testing your specific machine - you have a lot of cores with that threadripper, and this is another area where I expect Gromacs 2020 might shift the GPU-CPU optimal balance. The issue with PME on GPUs is that we can (currently) only have one rank doing GPU PME work. So, if we have a machine with say 20 cores and 2 gpus, if I run the following gmx mdrun -ntomp 10 -ntmpi 2 -pme gpu -npme 1 -gputasks 01 , two ranks will be started - one with cores 0-9, will work on the short-range interactions, offloading where it can to GPU 0, and the PME rank (cores 10-19) will offload to GPU 1. There is one significant problem (and one minor problem) with this setup. First, it is massively inefficient in terms of load balance. In a typical system (there are exceptions), PME takes up ~1/3 of the computation that short-range interactions take. So, we are offloading 1/4 of our interactions to one GPU and 3/4 to the other, which leads to imbalance. In this specific case (2 GPUs and sufficient cores), the most optimal solution is often (but not always) to run with -ntmpi 4 (in
Re: [gmx-users] simulation on 2 gpus
Hi, I've done a lot of research/experimentation on this, so I can maybe get you started - if anyone has any questions about the essay to follow, feel free to email me personally, and I'll link it to the email thread if it ends up being pertinent. First, there's some more internet resources to checkout. See Mark's talk at - https://bioexcel.eu/webinar-performance-tuning-and-optimization-of-gromacs/ Gromacs development moves fast, but a lot of it is still relevant. I'll expand a bit here, with the caveat that Gromacs GPU development is moving very fast and so the correct commands for optimal performance are both system-dependent and a moving target between versions. This is a good thing - GPUs have revolutionized the field, and with each iteration we make better use of them. The downside is that it's unclear exactly what sort of CPU-GPU balance you should look to purchase to take advantage of future developments, though the trend is certainly that more and more computation is being offloaded to the GPUs. The most important consideration is that to get maximum total throughput performance, you should be running not one but multiple simulations simultaneously. You can do this through the -multidir option, but I don't recommend that in this case, as it requires compiling with MPI and limits some of your options. My run scripts usually use "gmx mdrun ... &" to initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin -pinoffset, and -gputasks. I can give specific examples if you're interested. Another important point is that you can run more simulations than the number of GPUs you have. Depending on CPU-GPU balance and quality, you won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but you might increase it up to 1.5x. This would involve targeting the same GPU with -gputasks. Within a simulation, you should set up a benchmarking script to figure out the best combination of thread-mpi ranks and open-mp threads - this can have pretty drastic effects on performance. For example, if you want to use your entire machine for one simulation (not recommended for maximal efficiency), you have a lot of decomposition options (ignoring PME - which is important, see below): -ntmpi 2 -ntomp 32 -gputasks 01 -ntmpi 4 -ntomp 16 -gputasks 0011 -ntmpi 8 -ntomp 8 -gputasks -ntmpi 16 -ntomp 4 -gputasks 111 (and a few others - note that ntmpi * ntomp = total threads available) In my experience, you need to scan the options in a benchmarking script for each simulation size/content you want to simulate, and the difference between the best and the worst can be up to a factor of 2-4 in terms of performance. If you're splitting your machine among multiple simulations, I suggest running 1 mpi thread (-ntmpi 1) per simulation, unless your benchmarking suggests that the optimal performance lies elsewhere. Things get more complicated when you start putting PME on the GPUs. For the machines I work on, putting PME on GPUs absolutely improves performance, but I'm not fully confident in that assessment without testing your specific machine - you have a lot of cores with that threadripper, and this is another area where I expect Gromacs 2020 might shift the GPU-CPU optimal balance. The issue with PME on GPUs is that we can (currently) only have one rank doing GPU PME work. So, if we have a machine with say 20 cores and 2 gpus, if I run the following gmx mdrun -ntomp 10 -ntmpi 2 -pme gpu -npme 1 -gputasks 01 , two ranks will be started - one with cores 0-9, will work on the short-range interactions, offloading where it can to GPU 0, and the PME rank (cores 10-19) will offload to GPU 1. There is one significant problem (and one minor problem) with this setup. First, it is massively inefficient in terms of load balance. In a typical system (there are exceptions), PME takes up ~1/3 of the computation that short-range interactions take. So, we are offloading 1/4 of our interactions to one GPU and 3/4 to the other, which leads to imbalance. In this specific case (2 GPUs and sufficient cores), the most optimal solution is often (but not always) to run with -ntmpi 4 (in this example, then -ntomp 5), as the PME rank then gets 1/4 of the GPU instructions, proportional to the computation needed. The second(less critical - don't worry about this unless you're CPU-limited) problem is that PME-GPU mpi ranks only use 1 CPU core in their calculations. So, with a node of 20 cores and 2 GPUs, if I run a simulation with -ntmpi 4 -ntmpi 5 -pme gpu -npme 1 -pme gpu, each one of those ranks will have 5 CPUs, but the PME rank will only use one of them. You can specify the number of PME cores per rank with -ntomp_pme. This is useful in restricted cases. For example, given the above architecture setup (20 cores, 2 GPUs), I could maximally exploit my CPUs with the following commands: gmx mdrun -ntmpi 4 -ntomp 3 -ntomp_pme 1 -pme gpu -npme 1 -gputasks -pin on -pinoffset 0 & gmx mdrun -ntmpi 4
[gmx-users] simulation on 2 gpus
Dear all, I am trying to run simulation with Gromacs 2019.2 on a workstation with an amd Threadripper cpu (32 core, 64 threads, 128 GB RAM and with two rtx 2080 ti with nvlink bridge. I read user's guide section regarding performance and I am exploring some possibile combinations of cpu/gpu work to run as fast as possible. I was wondering if some of you has experience of running on more than one gpu with several cores and can give some hints as starting point. Thanks Stefano -- Stefano GUGLIELMO PhD Assistant Professor of Medicinal Chemistry Department of Drug Science and Technology Via P. Giuria 9 10125 Turin, ITALY ph. +39 (0)11 6707178 -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.