Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
We did test that. Our cluster has total 11 GPU nodes and I ran 20 tests over all of them. 7 out of the 20 tests did have the potential energy jump issue and they were running on 5 different nodes. So I tend to believe this issue happens on any of those nodes. On Wed, Dec 4, 2019 at 1:14 PM Szilárd Páll wrote: > The fact that you are observing errors alo the energies to be off by so > much and that it reproduces with multiple inputs suggest that this may not > a code issue. Did you do all runs that failed on the same hardware? Have > you excluded the option that one of those GeForce cards may be flaky? > > -- > Szilárd > > > On Wed, Dec 4, 2019 at 7:47 PM Chenou Zhang wrote: > > > We tried the same gmx settings in 2019.4 with different protein systems. > > And we got the same weird potential energy jump within 1000 steps. > > > > ``` > > > > Step Time > > 00.0 > > Energies (kJ/mol) > >BondU-BProper Dih. Improper Dih. CMAP > Dih. > > 2.08204e+049.92358e+046.53063e+041.06706e+03 > -2.75672e+02 > > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. > recip. > > 1.50031e+04 -4.86857e+043.10386e+04 -1.09745e+06 > 4.81832e+03 > > PotentialKinetic En. Total Energy Conserved En. > Temperature > >-9.09123e+052.80635e+05 -6.28487e+05 -6.28428e+05 > 3.04667e+02 > > Pressure (bar) Constr. rmsd > >-1.56013e+003.60634e-06 > > > > DD step 999 load imb.: force 14.6% pme mesh/force 0.581 > >Step Time > >10002.0 > > > > Energies (kJ/mol) > >BondU-BProper Dih. Improper Dih. CMAP > Dih. > > 2.04425e+049.92768e+046.52873e+041.02016e+03 > -2.45851e+02 > > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. > recip. > > 1.49863e+04 -4.91092e+043.10572e+04 -1.09508e+06 > 4.97942e+03 > > PotentialKinetic En. Total Energy Conserved En. > Temperature > > 1.35726e+352.77598e+051.35726e+351.35726e+35 > 3.01370e+02 > > Pressure (bar) Constr. rmsd > >-7.55250e+013.63239e-06 > > > > DD step 1999 load imb.: force 16.1% pme mesh/force 0.598 > >Step Time > >20004.0 > > > > Energies (kJ/mol) > >BondU-BProper Dih. Improper Dih. CMAP > Dih. > > 1.99521e+049.97482e+046.49595e+041.00798e+03 > -2.42567e+02 > > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. > recip. > > 1.50156e+04 -4.85324e+043.01944e+04 -1.09620e+06 > 4.82958e+03 > > PotentialKinetic En. Total Energy Conserved En. > Temperature > > 1.35726e+352.79206e+051.35726e+351.35726e+35 > 3.03115e+02 > > Pressure (bar) Constr. rmsd > >-5.50508e+013.64353e-06 > > > > DD step 2999 load imb.: force 16.6% pme mesh/force 0.602 > >Step Time > >30006.0 > > > > > > Energies (kJ/mol) > >BondU-BProper Dih. Improper Dih. CMAP > Dih. > > 1.98590e+049.88100e+046.50934e+041.07048e+03 > -2.38831e+02 > > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. > recip. > > 1.49609e+04 -4.93079e+043.12273e+04 -1.09582e+06 > 4.83209e+03 > > PotentialKinetic En. Total Energy Conserved En. > Temperature > > 1.35726e+352.79438e+051.35726e+351.35726e+35 > 3.03367e+02 > > Pressure (bar) Constr. rmsd > > 7.62438e+013.61574e-06 > > > > ``` > > > > On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham > > wrote: > > > > > Hi, > > > > > > What driver version is reported in the respective log files? Does the > > error > > > persist if mdrun -notunepme is used? > > > > > > Mark > > > > > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, wrote: > > > > > > > Hi Gromacs developers, > > > > > > > > I'm currently running gromacs 2019.4 on our university's HPC cluster. > > To > > > > fully utilize the GPU nodes, I followed notes on > > > > > > > > > > > > > > http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html > > > > . > > > > > > > > > > &
Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
We tried the same gmx settings in 2019.4 with different protein systems. And we got the same weird potential energy jump within 1000 steps. ``` Step Time 00.0 Energies (kJ/mol) BondU-BProper Dih. Improper Dih. CMAP Dih. 2.08204e+049.92358e+046.53063e+041.06706e+03 -2.75672e+02 LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. recip. 1.50031e+04 -4.86857e+043.10386e+04 -1.09745e+064.81832e+03 PotentialKinetic En. Total Energy Conserved En.Temperature -9.09123e+052.80635e+05 -6.28487e+05 -6.28428e+053.04667e+02 Pressure (bar) Constr. rmsd -1.56013e+003.60634e-06 DD step 999 load imb.: force 14.6% pme mesh/force 0.581 Step Time 10002.0 Energies (kJ/mol) BondU-BProper Dih. Improper Dih. CMAP Dih. 2.04425e+049.92768e+046.52873e+041.02016e+03 -2.45851e+02 LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. recip. 1.49863e+04 -4.91092e+043.10572e+04 -1.09508e+064.97942e+03 PotentialKinetic En. Total Energy Conserved En.Temperature 1.35726e+352.77598e+051.35726e+351.35726e+353.01370e+02 Pressure (bar) Constr. rmsd -7.55250e+013.63239e-06 DD step 1999 load imb.: force 16.1% pme mesh/force 0.598 Step Time 20004.0 Energies (kJ/mol) BondU-BProper Dih. Improper Dih. CMAP Dih. 1.99521e+049.97482e+046.49595e+041.00798e+03 -2.42567e+02 LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. recip. 1.50156e+04 -4.85324e+043.01944e+04 -1.09620e+064.82958e+03 PotentialKinetic En. Total Energy Conserved En.Temperature 1.35726e+352.79206e+051.35726e+351.35726e+353.03115e+02 Pressure (bar) Constr. rmsd -5.50508e+013.64353e-06 DD step 2999 load imb.: force 16.6% pme mesh/force 0.602 Step Time 30006.0 Energies (kJ/mol) BondU-BProper Dih. Improper Dih. CMAP Dih. 1.98590e+049.88100e+046.50934e+041.07048e+03 -2.38831e+02 LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. recip. 1.49609e+04 -4.93079e+043.12273e+04 -1.09582e+064.83209e+03 PotentialKinetic En. Total Energy Conserved En.Temperature 1.35726e+352.79438e+051.35726e+351.35726e+353.03367e+02 Pressure (bar) Constr. rmsd 7.62438e+013.61574e-06 ``` On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham wrote: > Hi, > > What driver version is reported in the respective log files? Does the error > persist if mdrun -notunepme is used? > > Mark > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, wrote: > > > Hi Gromacs developers, > > > > I'm currently running gromacs 2019.4 on our university's HPC cluster. To > > fully utilize the GPU nodes, I followed notes on > > > > > http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html > > . > > > > > > And here is the command I used for my runs. > > ``` > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu > -ntomp > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 > -cpi > > -noappend > > ``` > > > > And for some of those runs, they might fail with the following error: > > ``` > > --- > > > > Program: gmx mdrun, version 2019.4 > > > > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) > > > > MPI rank:3 (out of 8) > > > > > > > > Fatal error: > > > > cudaStreamSynchronize failed: an illegal memory access was encountered > > > > > > > > For more information and tips for troubleshooting, please check the > GROMACS > > > > website at http://www.gromacs.org/Documentation/Errors > > ``` > > > > we also had a different error from slurm system: > > ``` > > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9 > > M-cycles > > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4 > > M-cycles > > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation > > fault gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on > -nb > > gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS > > -cpt 60 -cpi -noappend > > ``` > > > > We first thought this could due to compiler issue and tried different > > sett
Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible #1: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible #2: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible #3: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible ``` Note that the simulation ran for about 2.8ns and we got a weird high potential energy at the end of it. On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham wrote: > Hi, > > What driver version is reported in the respective log files? Does the error > persist if mdrun -notunepme is used? > > Mark > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, wrote: > > > Hi Gromacs developers, > > > > I'm currently running gromacs 2019.4 on our university's HPC cluster. To > > fully utilize the GPU nodes, I followed notes on > > > > > http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html > > . > > > > > > And here is the command I used for my runs. > > ``` > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu > -ntomp > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 > -cpi > > -noappend > > ``` > > > > And for some of those runs, they might fail with the following error: > > ``` > > --- > > > > Program: gmx mdrun, version 2019.4 > > > > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) > > > > MPI rank:3 (out of 8) > > > > > > > > Fatal error: > > > > cudaStreamSynchronize failed: an illegal memory access was encountered > > > > > > > > For more information and tips for troubleshooting, please check the > GROMACS > > > > website at http://www.gromacs.org/Documentation/Errors > > ``` > > > > we also had a different error from slurm system: > > ``` > > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9 > > M-cycles > > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4 > > M-cycles > > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation > > fault gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on > -nb > > gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS > > -cpt 60 -cpi -noappend > > ``` > > > > We first thought this could due to compiler issue and tried different > > settings as following: > > ===test1=== > > > > module load cuda/9.2.88.1 > > module load gcc/7.3.0 > > . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC > > > > ===test2=== > > > > module load cuda/9.2.88.1 > > module load gcc/6x > > . /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC > > > > ===test3=== > > > > module load cuda/9.2.148 > > module load gcc/7.3.0 > > . /home/rsexton2/Library/gromacs/2019.4/test3/bin/GMXRC > > > > ===test4=== > > > > module load cuda/9.2.148 > > module load gcc/6x > > . /home/rsexton2/Library/gromacs/2019.4/test4/bin/GMXRC > > > > ===test5=== > > > > module load cuda/9.1.85 > > module load gcc/6x > > . /home/rsexton2/Library/gromacs/2019.4/test5/bin/GMXRC > > > > ===test6=== > > > > module load cuda/9.0.176 > > module load gcc/6x > > . /home/rsexton2/Library/gromacs/2019.4/test6/bin/GMXRC > > > > ===test7=== > > > > module load cuda/9.2.88.1 > > module load gccgpu/7.4.0 > > . /home/rsexton2/Library/gromacs/2019.4/test7/bin/GMXRC > > > > > > However we still ended up with the same errors showed above. Does anyone > > know where does the cudaStreamSynchronize come in? Or am I wrongly using > > those gmx gpu commands? > > > > Any input will be appreciated! > > > > Thanks! > > Chenou > > -- > > Gromacs Users mailing list > > > > * Please search the archive at > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > posting! > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > * For (un)subscribe requests visit > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > send a mail to gmx-users-requ...@gromacs.org. > > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
For the error: ``` ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9 M-cycles ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4 M-cycles /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation fault gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 -cpi -noappend ``` I got these driver info: ``` GROMACS: gmx mdrun, version 2019.4 Executable: /home/rsexton2/Library/gromacs/2019.4/test1/bin/gmx Data prefix: /home/rsexton2/Library/gromacs/2019.4/test1 Working dir: /scratch/czhan178/project/NapA-2019.4/gromacs_test_1/test_9 Process ID: 29866 Command line: gmx mdrun -v -s md_seed_fixed.tpr -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh 2 -cpt 60 -cpi -noappend GROMACS version:2019.4 Precision: single Memory model: 64 bit MPI library:thread_mpi OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64) GPU support:CUDA SIMD instructions: AVX2_256 FFT library:fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512 RDTSCP usage: enabled TNG support:enabled Hwloc support: hwloc-1.11.2 Tracing support:disabled C compiler: /packages/7x/gcc/gcc-7.3.0/bin/gcc GNU 7.3.0 C compiler flags:-mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast C++ compiler: /packages/7x/gcc/gcc-7.3.0/bin/g++ GNU 7.3.0 C++ compiler flags: -mavx2 -mfma-std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast CUDA compiler: /packages/7x/cuda/9.2.88.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on Wed_Apr_11_23:16:29_CDT_2018;Cuda compilation tools, release 9.2, V9.2.88 CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast; CUDA driver:9.20 CUDA runtime: 9.20 ``` I'll run -notunepme option and get you updated. Chenou On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham wrote: > Hi, > > What driver version is reported in the respective log files? Does the error > persist if mdrun -notunepme is used? > > Mark > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, wrote: > > > Hi Gromacs developers, > > > > I'm currently running gromacs 2019.4 on our university's HPC cluster. To > > fully utilize the GPU nodes, I followed notes on > > > > > http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html > > . > > > > > > And here is the command I used for my runs. > > ``` > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu > -ntomp > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 > -cpi > > -noappend > > ``` > > > > And for some of those runs, they might fail with the following error: > > ``` > > --- > > > > Program: gmx mdrun, version 2019.4 > > > > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) > > > > MPI rank:3 (out of 8) > > > > > > > > Fatal error: > > > > cudaStreamSynchronize failed: an illegal memory access was encountered > > > > > > > > For more information and tips for troubleshooting, please check the > GROMACS > > > > website at http://www.gromacs.org/Documentation/Errors > > ``` > > > > we also had a different error from slurm system: > > ``` > > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9 > > M-cycles > > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4 > > M-cycles > > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation > > fault gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on > -nb > > gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS > > -cpt 60 -cpi -noappend > > ``` > > > > We first thought this could due to compiler issue and tried different > > settings as following: > > ===test1=== > > > > module load cuda/9.2.88.1 > > module load gcc/7.3.0 > > . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC > > > > ===test2=== > > > > module load cuda/9.2.88.1 > > module load
[gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
Hi Gromacs developers, I'm currently running gromacs 2019.4 on our university's HPC cluster. To fully utilize the GPU nodes, I followed notes on http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html. And here is the command I used for my runs. ``` gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 -cpi -noappend ``` And for some of those runs, they might fail with the following error: ``` --- Program: gmx mdrun, version 2019.4 Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) MPI rank:3 (out of 8) Fatal error: cudaStreamSynchronize failed: an illegal memory access was encountered For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors ``` we also had a different error from slurm system: ``` ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9 M-cycles ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4 M-cycles /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation fault gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 -cpi -noappend ``` We first thought this could due to compiler issue and tried different settings as following: ===test1=== module load cuda/9.2.88.1 module load gcc/7.3.0 . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC ===test2=== module load cuda/9.2.88.1 module load gcc/6x . /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC ===test3=== module load cuda/9.2.148 module load gcc/7.3.0 . /home/rsexton2/Library/gromacs/2019.4/test3/bin/GMXRC ===test4=== module load cuda/9.2.148 module load gcc/6x . /home/rsexton2/Library/gromacs/2019.4/test4/bin/GMXRC ===test5=== module load cuda/9.1.85 module load gcc/6x . /home/rsexton2/Library/gromacs/2019.4/test5/bin/GMXRC ===test6=== module load cuda/9.0.176 module load gcc/6x . /home/rsexton2/Library/gromacs/2019.4/test6/bin/GMXRC ===test7=== module load cuda/9.2.88.1 module load gccgpu/7.4.0 . /home/rsexton2/Library/gromacs/2019.4/test7/bin/GMXRC However we still ended up with the same errors showed above. Does anyone know where does the cudaStreamSynchronize come in? Or am I wrongly using those gmx gpu commands? Any input will be appreciated! Thanks! Chenou -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.