Re: [gmx-users] Help on MD performance, GPU has less load than CPU.
So the state of the art in the whole field is to just blindly copy .mdp settings from webpages rather than review the literature of related work? Nice. :-D Mark On Thu, Jul 13, 2017 at 7:18 PM Téletchéa Stéphane < stephane.teletc...@univ-nantes.fr> wrote: > Le 12/07/2017 à 18:15, Mark Abraham a écrit : > > Hi, > > > > Sure. But who has data that shows that e.g. a free-energy calculation > with > > the defaults produces lower quality observables than you get with the > > defaults? > > > > Mark > > Hi, > > As defaults are defaults ... who knows :-) To get number in front of > these assumptions is hard, and probably nobody wants to do this on a > large scale ... But I'm too close to holidays to argue on this point by > now! > > Stéphane > > -- > Assistant Professor in BioInformatics, UFIP, UMR 6286 CNRS, Team Protein > Design In Silico > UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322 > Nantes cedex 03, France > Tél : +33 251 125 636 / Fax : +33 251 125 632 > http://www.ufip.univ-nantes.fr/ - http://www.steletch.org > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] Help on MD performance, GPU has less load than CPU.
Le 12/07/2017 à 18:15, Mark Abraham a écrit : Hi, Sure. But who has data that shows that e.g. a free-energy calculation with the defaults produces lower quality observables than you get with the defaults? Mark Hi, As defaults are defaults ... who knows :-) To get number in front of these assumptions is hard, and probably nobody wants to do this on a large scale ... But I'm too close to holidays to argue on this point by now! Stéphane -- Assistant Professor in BioInformatics, UFIP, UMR 6286 CNRS, Team Protein Design In Silico UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322 Nantes cedex 03, France Tél : +33 251 125 636 / Fax : +33 251 125 632 http://www.ufip.univ-nantes.fr/ - http://www.steletch.org -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] Help on MD performance, GPU has less load than CPU.
Hi, Sure. But who has data that shows that e.g. a free-energy calculation with the defaults produces lower quality observables than you get with the defaults? Mark On Wed, Jul 12, 2017 at 5:59 PM Téletchéa Stéphane < stephane.teletc...@univ-nantes.fr> wrote: > Le 11/07/2017 à 15:24, Mark Abraham a écrit : > > Guessing wildly, the cost of your simulation is probably at least double > > what the defaults would give, and for that cost, I'd want to know why. > > Estimated colleague, > > Since this is a wild guess, I'd think to add some guesses myself. I > remember "some time" back having used a lower tolerance on Ewald for > amber simulations (around amber 4/5/6 ...) and it was more common at > this time I presume. This may also be linked to the fact that amber has > a short cut-off at 8 angstrom for electrostatics ... > Someone apparently "ill" at the time already found this stane in 2009: > > > http://gromacs.org_gmx-users.maillist.sys.kth.narkive.com/vTjpMdwU/gromacs-preformance-versus-amber > > Out of my memroy, I remembered using 10-6 for Ewald tolerance in AMBER, > and this is mentioned here: > > http://ambermd.org/Questions/ewald.html > > ... apparently linked to DNA simulation as found in JACS 117,4193 (1995) > > In short, this value may come in back and forth for "historical" reasons > (and misuse, of course). > > Others may have additional comments :-) > > Best, > > Stéphane > > > -- > Assistant Professor in BioInformatics, UFIP, UMR 6286 CNRS, Team Protein > Design In Silico > UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322 > Nantes cedex 03, France > Tél : +33 251 125 636 / Fax : +33 251 125 632 > http://www.ufip.univ-nantes.fr/ - http://www.steletch.org > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] Help on MD performance, GPU has less load than CPU.
Le 11/07/2017 à 15:24, Mark Abraham a écrit : Guessing wildly, the cost of your simulation is probably at least double what the defaults would give, and for that cost, I'd want to know why. Estimated colleague, Since this is a wild guess, I'd think to add some guesses myself. I remember "some time" back having used a lower tolerance on Ewald for amber simulations (around amber 4/5/6 ...) and it was more common at this time I presume. This may also be linked to the fact that amber has a short cut-off at 8 angstrom for electrostatics ... Someone apparently "ill" at the time already found this stane in 2009: http://gromacs.org_gmx-users.maillist.sys.kth.narkive.com/vTjpMdwU/gromacs-preformance-versus-amber Out of my memroy, I remembered using 10-6 for Ewald tolerance in AMBER, and this is mentioned here: http://ambermd.org/Questions/ewald.html ... apparently linked to DNA simulation as found in JACS 117,4193 (1995) In short, this value may come in back and forth for "historical" reasons (and misuse, of course). Others may have additional comments :-) Best, Stéphane -- Assistant Professor in BioInformatics, UFIP, UMR 6286 CNRS, Team Protein Design In Silico UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322 Nantes cedex 03, France Tél : +33 251 125 636 / Fax : +33 251 125 632 http://www.ufip.univ-nantes.fr/ - http://www.steletch.org -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] Help on MD performance, GPU has less load than CPU.
Hi, I'm genuinely curious about why people set ewald_rtol smaller (which is unlikely to be useful, because the accumulation of forces in single precision will have round-off error that means the approximation to the correct sum is not reliably accurate to more than about 1 in 1e-5), and thus pme_order to large values - second time I've seen this in 24 hours. Is there data somewhere that shows this is useful? In any case, it a) causes a lot more work on the CPU, and b) only 4 (and to a lesser extent, 5) is optimized for performance (because there's no data that shows higher order is useful). And for free-energy calculation, that extra expense accrues for each lambda state. See the "PME mesh" parts of the performance report. Guessing wildly, the cost of your simulation is probably at least double what the defaults would give, and for that cost, I'd want to know why. Mark On Mon, Jul 10, 2017 at 5:02 PM Davide Bonanniwrote: > Hi, > > I am working on a node with Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16 > physical core, 32 logical core and 1 GPU NVIDIA GeForce GTX 980 Ti. > I am launching a series of 2 ns molecolar dynamics simulations of a system > of 6 atoms. > I tried diverse setting combination, but however i obtained the best > performance with the command: > > "gmx mdrun -deffnm md_LIG -cpt 1 -cpo restart1.cpt -pin on" > > which use 32 OpenMP threads, 1 MPI thread, and the GPU. > At the end of the file.log of molecular dynamic production I obtain this > message: > > "NOTE: The GPU has >25% less load than the CPU. This imbalance causes > performance loss." > > I don't know how can improve the load on CPU more than this, or how I can > decrease the load on GPU. Do you have any suggestions? > > Thank you in advance. > > Cheers, > > Davide Bonanni > > > Initial and final part of LOG file here: > > Log file opened on Sun Jul 9 04:02:44 2017 > Host: bigblue pid: 16777 rank ID: 0 number of ranks: 1 >:-) GROMACS - gmx mdrun, VERSION 5.1.4 (-: > > > > GROMACS: gmx mdrun, VERSION 5.1.4 > Executable: /usr/bin/gmx > Data prefix: /usr/local/gromacs > Command line: > gmx mdrun -deffnm md_fluo_7 -cpt 1 -cpo restart1.cpt -pin on > > GROMACS version:VERSION 5.1.4 > Precision: single > Memory model: 64 bit > MPI library:thread_mpi > OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32) > GPU support:enabled > OpenCL support: disabled > invsqrt routine:gmx_software_invsqrt(x) > SIMD instructions: AVX2_256 > FFT library:fftw-3.3.4-sse2-avx > RDTSCP usage: enabled > C++11 compilation: disabled > TNG support:enabled > Tracing support:disabled > Built on: Tue 8 Nov 12:26:14 CET 2016 > Built by: root@bigblue [CMAKE] > Build OS/arch: Linux 3.10.0-327.el7.x86_64 x86_64 > Build CPU vendor: GenuineIntel > Build CPU brand:Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz > Build CPU family: 6 Model: 63 Stepping: 2 > Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt > lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd > rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic > C compiler: /bin/cc GNU 4.8.5 > C compiler flags:-march=core-avx2-Wextra > -Wno-missing-field-initializers > -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value > -Wunused-parameter -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast > -Wno-array-bounds > C++ compiler: /bin/c++ GNU 4.8.5 > C++ compiler flags: -march=core-avx2-Wextra > -Wno-missing-field-initializers > -Wpointer-arith -Wall -Wno-unused-function -O3 -DNDEBUG -funroll-all-loops > -fexcess-precision=fast -Wno-array-bounds > Boost version: 1.55.0 (internal) > CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler > driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on > Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44 > CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch= > compute_30,code=sm_30;-gencode;arch=compute_35,code= > sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch= > compute_50,code=sm_50;-gencode;arch=compute_52,code= > sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch= > compute_61,code=sm_61;-gencode;arch=compute_60,code= > compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;; > ;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;- > Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;- > fexcess-precision=fast;-Wno-array-bounds; > CUDA driver:8.0 > CUDA runtime: 8.0 > > > Running on 1 node with total 16 cores, 32 logical cores, 1 compatible GPU > Hardware detected: > CPU info: > Vendor: GenuineIntel > Brand: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz > Family: 6 model: 63 stepping: 2 > CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt >
[gmx-users] Help on MD performance, GPU has less load than CPU.
Hi, I am working on a node with Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16 physical core, 32 logical core and 1 GPU NVIDIA GeForce GTX 980 Ti. I am launching a series of 2 ns molecolar dynamics simulations of a system of 6 atoms. I tried diverse setting combination, but however i obtained the best performance with the command: "gmx mdrun -deffnm md_LIG -cpt 1 -cpo restart1.cpt -pin on" which use 32 OpenMP threads, 1 MPI thread, and the GPU. At the end of the file.log of molecular dynamic production I obtain this message: "NOTE: The GPU has >25% less load than the CPU. This imbalance causes performance loss." I don't know how can improve the load on CPU more than this, or how I can decrease the load on GPU. Do you have any suggestions? Thank you in advance. Cheers, Davide Bonanni Initial and final part of LOG file here: Log file opened on Sun Jul 9 04:02:44 2017 Host: bigblue pid: 16777 rank ID: 0 number of ranks: 1 :-) GROMACS - gmx mdrun, VERSION 5.1.4 (-: GROMACS: gmx mdrun, VERSION 5.1.4 Executable: /usr/bin/gmx Data prefix: /usr/local/gromacs Command line: gmx mdrun -deffnm md_fluo_7 -cpt 1 -cpo restart1.cpt -pin on GROMACS version:VERSION 5.1.4 Precision: single Memory model: 64 bit MPI library:thread_mpi OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32) GPU support:enabled OpenCL support: disabled invsqrt routine:gmx_software_invsqrt(x) SIMD instructions: AVX2_256 FFT library:fftw-3.3.4-sse2-avx RDTSCP usage: enabled C++11 compilation: disabled TNG support:enabled Tracing support:disabled Built on: Tue 8 Nov 12:26:14 CET 2016 Built by: root@bigblue [CMAKE] Build OS/arch: Linux 3.10.0-327.el7.x86_64 x86_64 Build CPU vendor: GenuineIntel Build CPU brand:Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Build CPU family: 6 Model: 63 Stepping: 2 Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic C compiler: /bin/cc GNU 4.8.5 C compiler flags:-march=core-avx2-Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds C++ compiler: /bin/c++ GNU 4.8.5 C++ compiler flags: -march=core-avx2-Wextra -Wno-missing-field-initializers -Wpointer-arith -Wall -Wno-unused-function -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds Boost version: 1.55.0 (internal) CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44 CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch= compute_30,code=sm_30;-gencode;arch=compute_35,code= sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch= compute_50,code=sm_50;-gencode;arch=compute_52,code= sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch= compute_61,code=sm_61;-gencode;arch=compute_60,code= compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;; ;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;- Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;- fexcess-precision=fast;-Wno-array-bounds; CUDA driver:8.0 CUDA runtime: 8.0 Running on 1 node with total 16 cores, 32 logical cores, 1 compatible GPU Hardware detected: CPU info: Vendor: GenuineIntel Brand: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Family: 6 model: 63 stepping: 2 CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic SIMD instructions most likely to fit this hardware: AVX2_256 SIMD instructions selected at GROMACS compile time: AVX2_256 GPU info: Number of GPUs detected: 1 #0: NVIDIA GeForce GTX 980 Ti, compute cap.: 5.2, ECC: no, stat: compatible Changing nstlist from 20 to 40, rlist from 1.2 to 1.2 Input Parameters: integrator = sd tinit = 0 dt = 0.002 nsteps = 100 init-step = 0 simulation-part= 1 comm-mode = Linear nstcomm= 100 bd-fric= 0 ld-seed= 57540858 emtol = 10 emstep = 0.01 niter = 20 fcstep = 0 nstcgsteep = 1000 nbfgscorr = 10 rtpi = 0.05 nstxout