Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
Can you please file an issue on redmine.gromacs.org and attach the inputs that reproduce the behavior described? -- Szilárd On Wed, Dec 4, 2019, 21:35 Chenou Zhang wrote: > We did test that. > Our cluster has total 11 GPU nodes and I ran 20 tests over all of them. 7 > out of the 20 tests did have the potential energy jump issue and they were > running on 5 different nodes. > So I tend to believe this issue happens on any of those nodes. > > On Wed, Dec 4, 2019 at 1:14 PM Szilárd Páll > wrote: > > > The fact that you are observing errors alo the energies to be off by so > > much and that it reproduces with multiple inputs suggest that this may > not > > a code issue. Did you do all runs that failed on the same hardware? Have > > you excluded the option that one of those GeForce cards may be flaky? > > > > -- > > Szilárd > > > > > > On Wed, Dec 4, 2019 at 7:47 PM Chenou Zhang wrote: > > > > > We tried the same gmx settings in 2019.4 with different protein > systems. > > > And we got the same weird potential energy jump within 1000 steps. > > > > > > ``` > > > > > > Step Time > > > 00.0 > > > Energies (kJ/mol) > > >BondU-BProper Dih. Improper Dih. CMAP > > Dih. > > > 2.08204e+049.92358e+046.53063e+041.06706e+03 > > -2.75672e+02 > > > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. > > recip. > > > 1.50031e+04 -4.86857e+043.10386e+04 -1.09745e+06 > > 4.81832e+03 > > > PotentialKinetic En. Total Energy Conserved En. > > Temperature > > >-9.09123e+052.80635e+05 -6.28487e+05 -6.28428e+05 > > 3.04667e+02 > > > Pressure (bar) Constr. rmsd > > >-1.56013e+003.60634e-06 > > > > > > DD step 999 load imb.: force 14.6% pme mesh/force 0.581 > > >Step Time > > >10002.0 > > > > > > Energies (kJ/mol) > > >BondU-BProper Dih. Improper Dih. CMAP > > Dih. > > > 2.04425e+049.92768e+046.52873e+041.02016e+03 > > -2.45851e+02 > > > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. > > recip. > > > 1.49863e+04 -4.91092e+043.10572e+04 -1.09508e+06 > > 4.97942e+03 > > > PotentialKinetic En. Total Energy Conserved En. > > Temperature > > > 1.35726e+352.77598e+051.35726e+351.35726e+35 > > 3.01370e+02 > > > Pressure (bar) Constr. rmsd > > >-7.55250e+013.63239e-06 > > > > > > DD step 1999 load imb.: force 16.1% pme mesh/force 0.598 > > >Step Time > > >20004.0 > > > > > > Energies (kJ/mol) > > >BondU-BProper Dih. Improper Dih. CMAP > > Dih. > > > 1.99521e+049.97482e+046.49595e+041.00798e+03 > > -2.42567e+02 > > > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. > > recip. > > > 1.50156e+04 -4.85324e+043.01944e+04 -1.09620e+06 > > 4.82958e+03 > > > PotentialKinetic En. Total Energy Conserved En. > > Temperature > > > 1.35726e+352.79206e+051.35726e+351.35726e+35 > > 3.03115e+02 > > > Pressure (bar) Constr. rmsd > > >-5.50508e+013.64353e-06 > > > > > > DD step 2999 load imb.: force 16.6% pme mesh/force 0.602 > > >Step Time > > >30006.0 > > > > > > > > > Energies (kJ/mol) > > >BondU-BProper Dih. Improper Dih. CMAP > > Dih. > > > 1.98590e+049.88100e+046.50934e+041.07048e+03 > > -2.38831e+02 > > > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. > > recip. > > > 1.49609e+04 -4.93079e+043.12273e+04 -1.09582e+06 > > 4.83209e+03 > > > PotentialKinetic En. Total Energy Conserved En. > > Temperature > > > 1.35726e+352.79438e+051.35726e+351.35726e+35 > > 3.03367e+02 > > > Pressure (bar) Constr. rmsd > > > 7.62438e+013.61574e-06 > > > > > > ``` > > > > > > On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham > > > wrote: > > > > > > > Hi, > > > > > > > > What driver version is reported in the respective log files? Does the > > > error > > > > persist if mdrun -notunepme is used? > > > > > > > > Mark > > > > > > > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, wrote: > > > > > > > > > Hi Gromacs developers, > > > > > > > > > > I'm currently running gromacs 2019.4 on our university's HPC > cluster. > > > To > > > > > fully utilize the GPU nodes, I followed notes on > > > > > > > > > > > > > > > > > > > > http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html > > > > > . > > > > > > > > > > > > > > > And here is the command I used for my runs. > > > > > ``` > > > > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu > > > > -ntomp > > > > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt > > 60 > > > > -cpi > > > > > -noappend > > > > >
Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
We did test that. Our cluster has total 11 GPU nodes and I ran 20 tests over all of them. 7 out of the 20 tests did have the potential energy jump issue and they were running on 5 different nodes. So I tend to believe this issue happens on any of those nodes. On Wed, Dec 4, 2019 at 1:14 PM Szilárd Páll wrote: > The fact that you are observing errors alo the energies to be off by so > much and that it reproduces with multiple inputs suggest that this may not > a code issue. Did you do all runs that failed on the same hardware? Have > you excluded the option that one of those GeForce cards may be flaky? > > -- > Szilárd > > > On Wed, Dec 4, 2019 at 7:47 PM Chenou Zhang wrote: > > > We tried the same gmx settings in 2019.4 with different protein systems. > > And we got the same weird potential energy jump within 1000 steps. > > > > ``` > > > > Step Time > > 00.0 > > Energies (kJ/mol) > >BondU-BProper Dih. Improper Dih. CMAP > Dih. > > 2.08204e+049.92358e+046.53063e+041.06706e+03 > -2.75672e+02 > > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. > recip. > > 1.50031e+04 -4.86857e+043.10386e+04 -1.09745e+06 > 4.81832e+03 > > PotentialKinetic En. Total Energy Conserved En. > Temperature > >-9.09123e+052.80635e+05 -6.28487e+05 -6.28428e+05 > 3.04667e+02 > > Pressure (bar) Constr. rmsd > >-1.56013e+003.60634e-06 > > > > DD step 999 load imb.: force 14.6% pme mesh/force 0.581 > >Step Time > >10002.0 > > > > Energies (kJ/mol) > >BondU-BProper Dih. Improper Dih. CMAP > Dih. > > 2.04425e+049.92768e+046.52873e+041.02016e+03 > -2.45851e+02 > > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. > recip. > > 1.49863e+04 -4.91092e+043.10572e+04 -1.09508e+06 > 4.97942e+03 > > PotentialKinetic En. Total Energy Conserved En. > Temperature > > 1.35726e+352.77598e+051.35726e+351.35726e+35 > 3.01370e+02 > > Pressure (bar) Constr. rmsd > >-7.55250e+013.63239e-06 > > > > DD step 1999 load imb.: force 16.1% pme mesh/force 0.598 > >Step Time > >20004.0 > > > > Energies (kJ/mol) > >BondU-BProper Dih. Improper Dih. CMAP > Dih. > > 1.99521e+049.97482e+046.49595e+041.00798e+03 > -2.42567e+02 > > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. > recip. > > 1.50156e+04 -4.85324e+043.01944e+04 -1.09620e+06 > 4.82958e+03 > > PotentialKinetic En. Total Energy Conserved En. > Temperature > > 1.35726e+352.79206e+051.35726e+351.35726e+35 > 3.03115e+02 > > Pressure (bar) Constr. rmsd > >-5.50508e+013.64353e-06 > > > > DD step 2999 load imb.: force 16.6% pme mesh/force 0.602 > >Step Time > >30006.0 > > > > > > Energies (kJ/mol) > >BondU-BProper Dih. Improper Dih. CMAP > Dih. > > 1.98590e+049.88100e+046.50934e+041.07048e+03 > -2.38831e+02 > > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. > recip. > > 1.49609e+04 -4.93079e+043.12273e+04 -1.09582e+06 > 4.83209e+03 > > PotentialKinetic En. Total Energy Conserved En. > Temperature > > 1.35726e+352.79438e+051.35726e+351.35726e+35 > 3.03367e+02 > > Pressure (bar) Constr. rmsd > > 7.62438e+013.61574e-06 > > > > ``` > > > > On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham > > wrote: > > > > > Hi, > > > > > > What driver version is reported in the respective log files? Does the > > error > > > persist if mdrun -notunepme is used? > > > > > > Mark > > > > > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, wrote: > > > > > > > Hi Gromacs developers, > > > > > > > > I'm currently running gromacs 2019.4 on our university's HPC cluster. > > To > > > > fully utilize the GPU nodes, I followed notes on > > > > > > > > > > > > > > http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html > > > > . > > > > > > > > > > > > And here is the command I used for my runs. > > > > ``` > > > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu > > > -ntomp > > > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt > 60 > > > -cpi > > > > -noappend > > > > ``` > > > > > > > > And for some of those runs, they might fail with the following error: > > > > ``` > > > > --- > > > > > > > > Program: gmx mdrun, version 2019.4 > > > > > > > > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) > > > > > > > > MPI rank:3 (out of 8) > > > > > > > > > > > > > > > > Fatal error: > > > > > > > > cudaStreamSynchronize failed: an illegal memory access was >
Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
The fact that you are observing errors alo the energies to be off by so much and that it reproduces with multiple inputs suggest that this may not a code issue. Did you do all runs that failed on the same hardware? Have you excluded the option that one of those GeForce cards may be flaky? -- Szilárd On Wed, Dec 4, 2019 at 7:47 PM Chenou Zhang wrote: > We tried the same gmx settings in 2019.4 with different protein systems. > And we got the same weird potential energy jump within 1000 steps. > > ``` > > Step Time > 00.0 > Energies (kJ/mol) >BondU-BProper Dih. Improper Dih. CMAP Dih. > 2.08204e+049.92358e+046.53063e+041.06706e+03 -2.75672e+02 > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. recip. > 1.50031e+04 -4.86857e+043.10386e+04 -1.09745e+064.81832e+03 > PotentialKinetic En. Total Energy Conserved En.Temperature >-9.09123e+052.80635e+05 -6.28487e+05 -6.28428e+053.04667e+02 > Pressure (bar) Constr. rmsd >-1.56013e+003.60634e-06 > > DD step 999 load imb.: force 14.6% pme mesh/force 0.581 >Step Time >10002.0 > > Energies (kJ/mol) >BondU-BProper Dih. Improper Dih. CMAP Dih. > 2.04425e+049.92768e+046.52873e+041.02016e+03 -2.45851e+02 > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. recip. > 1.49863e+04 -4.91092e+043.10572e+04 -1.09508e+064.97942e+03 > PotentialKinetic En. Total Energy Conserved En.Temperature > 1.35726e+352.77598e+051.35726e+351.35726e+353.01370e+02 > Pressure (bar) Constr. rmsd >-7.55250e+013.63239e-06 > > DD step 1999 load imb.: force 16.1% pme mesh/force 0.598 >Step Time >20004.0 > > Energies (kJ/mol) >BondU-BProper Dih. Improper Dih. CMAP Dih. > 1.99521e+049.97482e+046.49595e+041.00798e+03 -2.42567e+02 > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. recip. > 1.50156e+04 -4.85324e+043.01944e+04 -1.09620e+064.82958e+03 > PotentialKinetic En. Total Energy Conserved En.Temperature > 1.35726e+352.79206e+051.35726e+351.35726e+353.03115e+02 > Pressure (bar) Constr. rmsd >-5.50508e+013.64353e-06 > > DD step 2999 load imb.: force 16.6% pme mesh/force 0.602 >Step Time >30006.0 > > > Energies (kJ/mol) >BondU-BProper Dih. Improper Dih. CMAP Dih. > 1.98590e+049.88100e+046.50934e+041.07048e+03 -2.38831e+02 > LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. recip. > 1.49609e+04 -4.93079e+043.12273e+04 -1.09582e+064.83209e+03 > PotentialKinetic En. Total Energy Conserved En.Temperature > 1.35726e+352.79438e+051.35726e+351.35726e+353.03367e+02 > Pressure (bar) Constr. rmsd > 7.62438e+013.61574e-06 > > ``` > > On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham > wrote: > > > Hi, > > > > What driver version is reported in the respective log files? Does the > error > > persist if mdrun -notunepme is used? > > > > Mark > > > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, wrote: > > > > > Hi Gromacs developers, > > > > > > I'm currently running gromacs 2019.4 on our university's HPC cluster. > To > > > fully utilize the GPU nodes, I followed notes on > > > > > > > > > http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html > > > . > > > > > > > > > And here is the command I used for my runs. > > > ``` > > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu > > -ntomp > > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 > > -cpi > > > -noappend > > > ``` > > > > > > And for some of those runs, they might fail with the following error: > > > ``` > > > --- > > > > > > Program: gmx mdrun, version 2019.4 > > > > > > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) > > > > > > MPI rank:3 (out of 8) > > > > > > > > > > > > Fatal error: > > > > > > cudaStreamSynchronize failed: an illegal memory access was encountered > > > > > > > > > > > > For more information and tips for troubleshooting, please check the > > GROMACS > > > > > > website at http://www.gromacs.org/Documentation/Errors > > > ``` > > > > > > we also had a different error from slurm system: > > > ``` > > > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9 > > > M-cycles > > > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4 > > > M-cycles > > > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation > > > fault gmx mdrun -v -s $TPR -deffnm
Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
We tried the same gmx settings in 2019.4 with different protein systems. And we got the same weird potential energy jump within 1000 steps. ``` Step Time 00.0 Energies (kJ/mol) BondU-BProper Dih. Improper Dih. CMAP Dih. 2.08204e+049.92358e+046.53063e+041.06706e+03 -2.75672e+02 LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. recip. 1.50031e+04 -4.86857e+043.10386e+04 -1.09745e+064.81832e+03 PotentialKinetic En. Total Energy Conserved En.Temperature -9.09123e+052.80635e+05 -6.28487e+05 -6.28428e+053.04667e+02 Pressure (bar) Constr. rmsd -1.56013e+003.60634e-06 DD step 999 load imb.: force 14.6% pme mesh/force 0.581 Step Time 10002.0 Energies (kJ/mol) BondU-BProper Dih. Improper Dih. CMAP Dih. 2.04425e+049.92768e+046.52873e+041.02016e+03 -2.45851e+02 LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. recip. 1.49863e+04 -4.91092e+043.10572e+04 -1.09508e+064.97942e+03 PotentialKinetic En. Total Energy Conserved En.Temperature 1.35726e+352.77598e+051.35726e+351.35726e+353.01370e+02 Pressure (bar) Constr. rmsd -7.55250e+013.63239e-06 DD step 1999 load imb.: force 16.1% pme mesh/force 0.598 Step Time 20004.0 Energies (kJ/mol) BondU-BProper Dih. Improper Dih. CMAP Dih. 1.99521e+049.97482e+046.49595e+041.00798e+03 -2.42567e+02 LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. recip. 1.50156e+04 -4.85324e+043.01944e+04 -1.09620e+064.82958e+03 PotentialKinetic En. Total Energy Conserved En.Temperature 1.35726e+352.79206e+051.35726e+351.35726e+353.03115e+02 Pressure (bar) Constr. rmsd -5.50508e+013.64353e-06 DD step 2999 load imb.: force 16.6% pme mesh/force 0.602 Step Time 30006.0 Energies (kJ/mol) BondU-BProper Dih. Improper Dih. CMAP Dih. 1.98590e+049.88100e+046.50934e+041.07048e+03 -2.38831e+02 LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. recip. 1.49609e+04 -4.93079e+043.12273e+04 -1.09582e+064.83209e+03 PotentialKinetic En. Total Energy Conserved En.Temperature 1.35726e+352.79438e+051.35726e+351.35726e+353.03367e+02 Pressure (bar) Constr. rmsd 7.62438e+013.61574e-06 ``` On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham wrote: > Hi, > > What driver version is reported in the respective log files? Does the error > persist if mdrun -notunepme is used? > > Mark > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, wrote: > > > Hi Gromacs developers, > > > > I'm currently running gromacs 2019.4 on our university's HPC cluster. To > > fully utilize the GPU nodes, I followed notes on > > > > > http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html > > . > > > > > > And here is the command I used for my runs. > > ``` > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu > -ntomp > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 > -cpi > > -noappend > > ``` > > > > And for some of those runs, they might fail with the following error: > > ``` > > --- > > > > Program: gmx mdrun, version 2019.4 > > > > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) > > > > MPI rank:3 (out of 8) > > > > > > > > Fatal error: > > > > cudaStreamSynchronize failed: an illegal memory access was encountered > > > > > > > > For more information and tips for troubleshooting, please check the > GROMACS > > > > website at http://www.gromacs.org/Documentation/Errors > > ``` > > > > we also had a different error from slurm system: > > ``` > > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9 > > M-cycles > > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4 > > M-cycles > > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation > > fault gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on > -nb > > gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS > > -cpt 60 -cpi -noappend > > ``` > > > > We first thought this could due to compiler issue and tried different > > settings as following: > > ===test1=== > > > > module load cuda/9.2.88.1 > > module load gcc/7.3.0 > > . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC > > > > ===test2=== > > > > module load cuda/9.2.88.1 > > module load gcc/6x > > . /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC > > > > ===test3=== > > > > module load cuda/9.2.148 > > module
Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
Hi, I've run 30 tests with the -notunepme option. I got the following error from one of them(which is still the same *cudaStreamSynchronize failed* error): ``` DD step 1422999 vol min/aver 0.639 load imb.: force 1.1% pme mesh/force 1.079 Step Time 1423000 2846.0 Energies (kJ/mol) BondU-BProper Dih. Improper Dih. CMAP Dih. 3.79755e+041.78943e+051.22798e+052.83835e+03 -9.19303e+02 LJ-14 Coulomb-14LJ (SR) Coulomb (SR) Coul. recip. 2.56547e+045.11714e+059.77218e+03 -2.07148e+068.64504e+03 PotentialKinetic En. Total Energy Conserved En.Temperature 7.64126e+134.79398e+057.64126e+137.64126e+133.58009e+02 Pressure (bar) Constr. rmsd -6.03201e+014.56399e-06 --- Program: gmx mdrun, version 2019.4 Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) MPI rank:2 (out of 8) Fatal error: cudaStreamSynchronize failed: an illegal memory access was encountered For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors --- ``` Here is the command and the driver info: ``` Command line: gmx mdrun -v -s md_seed_fixed.tpr -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu -ntomp 3 -pme gpu -pmefft gpu -notunepme -npme 1 -gputasks 00112233 -maxh 2 -cpt 60 -cpi -noappend GROMACS version:2019.4 Precision: single Memory model: 64 bit MPI library:thread_mpi OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64) GPU support:CUDA SIMD instructions: AVX2_256 FFT library:fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512 RDTSCP usage: enabled TNG support:enabled Hwloc support: hwloc-1.11.2 Tracing support:disabled C compiler: /packages/7x/gcc/gcc-7.3.0/bin/gcc GNU 7.3.0 C compiler flags:-mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast C++ compiler: /packages/7x/gcc/gcc-7.3.0/bin/g++ GNU 7.3.0 C++ compiler flags: -mavx2 -mfma-std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast CUDA compiler: /packages/7x/cuda/9.2.88.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on Wed_Apr_11_23:16:29_CDT_2018;Cuda compilation tools, release 9.2, V9.2.88 CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast; CUDA driver:9.20 CUDA runtime: 9.20 Running on 1 node with total 24 cores, 24 logical cores, 4 compatible GPUs Hardware detected: CPU info: Vendor: Intel Brand: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz Family: 6 Model: 79 Stepping: 1 Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic Hardware topology: Full, with devices Sockets, cores, and logical processors: Socket 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] Socket 1: [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19] [ 20] [ 21] [ 22] [ 23] Numa nodes: Node 0 (34229563392 bytes mem): 0 1 2 3 4 5 6 7 8 9 10 11 Node 1 (34359738368 bytes mem): 12 13 14 15 16 17 18 19 20 21 22 23 Latency: 0 1 0 1.00 2.10 1 2.10 1.00 Caches: L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways L3: 31457280 bytes, linesize 64 bytes, assoc. 20, shared 12 ways PCI devices: :01:00.0 Id: 15b3:1007 Class: 0x0200 Numa: 0 :02:00.0 Id: 10de:1b06 Class: 0x0300 Numa: 0 :03:00.0 Id: 10de:1b06 Class: 0x0300 Numa: 0 :00:11.4 Id: 8086:8d62 Class: 0x0106 Numa: 0 :06:00.0 Id: 1a03:2000 Class: 0x0300 Numa: 0 :00:1f.2 Id: 8086:8d02 Class: 0x0106 Numa: 0 :81:00.0 Id: 8086:1521 Class: 0x0200 Numa: 1 :81:00.1 Id: 8086:1521 Class: 0x0200 Numa: 1 :82:00.0 Id: 15b3:1007 Class: 0x0280 Numa: 1 :83:00.0 Id: 10de:1b06 Class: 0x0300 Numa: 1 :84:00.0 Id: 10de:1b06 Class: 0x0300 Numa: 1 GPU info: Number of GPUs detected: 4 #0: NVIDIA GeForce
Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
For the error: ``` ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9 M-cycles ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4 M-cycles /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation fault gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 -cpi -noappend ``` I got these driver info: ``` GROMACS: gmx mdrun, version 2019.4 Executable: /home/rsexton2/Library/gromacs/2019.4/test1/bin/gmx Data prefix: /home/rsexton2/Library/gromacs/2019.4/test1 Working dir: /scratch/czhan178/project/NapA-2019.4/gromacs_test_1/test_9 Process ID: 29866 Command line: gmx mdrun -v -s md_seed_fixed.tpr -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh 2 -cpt 60 -cpi -noappend GROMACS version:2019.4 Precision: single Memory model: 64 bit MPI library:thread_mpi OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64) GPU support:CUDA SIMD instructions: AVX2_256 FFT library:fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512 RDTSCP usage: enabled TNG support:enabled Hwloc support: hwloc-1.11.2 Tracing support:disabled C compiler: /packages/7x/gcc/gcc-7.3.0/bin/gcc GNU 7.3.0 C compiler flags:-mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast C++ compiler: /packages/7x/gcc/gcc-7.3.0/bin/g++ GNU 7.3.0 C++ compiler flags: -mavx2 -mfma-std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast CUDA compiler: /packages/7x/cuda/9.2.88.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on Wed_Apr_11_23:16:29_CDT_2018;Cuda compilation tools, release 9.2, V9.2.88 CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast; CUDA driver:9.20 CUDA runtime: 9.20 ``` I'll run -notunepme option and get you updated. Chenou On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham wrote: > Hi, > > What driver version is reported in the respective log files? Does the error > persist if mdrun -notunepme is used? > > Mark > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, wrote: > > > Hi Gromacs developers, > > > > I'm currently running gromacs 2019.4 on our university's HPC cluster. To > > fully utilize the GPU nodes, I followed notes on > > > > > http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html > > . > > > > > > And here is the command I used for my runs. > > ``` > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu > -ntomp > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 > -cpi > > -noappend > > ``` > > > > And for some of those runs, they might fail with the following error: > > ``` > > --- > > > > Program: gmx mdrun, version 2019.4 > > > > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) > > > > MPI rank:3 (out of 8) > > > > > > > > Fatal error: > > > > cudaStreamSynchronize failed: an illegal memory access was encountered > > > > > > > > For more information and tips for troubleshooting, please check the > GROMACS > > > > website at http://www.gromacs.org/Documentation/Errors > > ``` > > > > we also had a different error from slurm system: > > ``` > > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9 > > M-cycles > > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4 > > M-cycles > > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation > > fault gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on > -nb > > gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS > > -cpt 60 -cpi -noappend > > ``` > > > > We first thought this could due to compiler issue and tried different > > settings as following: > > ===test1=== > > > > module load cuda/9.2.88.1 > > module load gcc/7.3.0 > > . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC > > > > ===test2=== > > > > module load cuda/9.2.88.1 > > module load gcc/6x > > . /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC > > > > ===test3=== > > > > module load cuda/9.2.148 > > module load gcc/7.3.0 > > . /home/rsexton2/Library/gromacs/2019.4/test3/bin/GMXRC > > > > ===test4=== > > > > module load cuda/9.2.148 > > module load gcc/6x > > . /home/rsexton2/Library/gromacs/2019.4/test4/bin/GMXRC > > > > ===test5=== > > > > module load cuda/9.1.85 > > module load
Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
Hi, What driver version is reported in the respective log files? Does the error persist if mdrun -notunepme is used? Mark On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, wrote: > Hi Gromacs developers, > > I'm currently running gromacs 2019.4 on our university's HPC cluster. To > fully utilize the GPU nodes, I followed notes on > > http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html > . > > > And here is the command I used for my runs. > ``` > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu -ntomp > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 -cpi > -noappend > ``` > > And for some of those runs, they might fail with the following error: > ``` > --- > > Program: gmx mdrun, version 2019.4 > > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) > > MPI rank:3 (out of 8) > > > > Fatal error: > > cudaStreamSynchronize failed: an illegal memory access was encountered > > > > For more information and tips for troubleshooting, please check the GROMACS > > website at http://www.gromacs.org/Documentation/Errors > ``` > > we also had a different error from slurm system: > ``` > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9 > M-cycles > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4 > M-cycles > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation > fault gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb > gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS > -cpt 60 -cpi -noappend > ``` > > We first thought this could due to compiler issue and tried different > settings as following: > ===test1=== > > module load cuda/9.2.88.1 > module load gcc/7.3.0 > . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC > > ===test2=== > > module load cuda/9.2.88.1 > module load gcc/6x > . /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC > > ===test3=== > > module load cuda/9.2.148 > module load gcc/7.3.0 > . /home/rsexton2/Library/gromacs/2019.4/test3/bin/GMXRC > > ===test4=== > > module load cuda/9.2.148 > module load gcc/6x > . /home/rsexton2/Library/gromacs/2019.4/test4/bin/GMXRC > > ===test5=== > > module load cuda/9.1.85 > module load gcc/6x > . /home/rsexton2/Library/gromacs/2019.4/test5/bin/GMXRC > > ===test6=== > > module load cuda/9.0.176 > module load gcc/6x > . /home/rsexton2/Library/gromacs/2019.4/test6/bin/GMXRC > > ===test7=== > > module load cuda/9.2.88.1 > module load gccgpu/7.4.0 > . /home/rsexton2/Library/gromacs/2019.4/test7/bin/GMXRC > > > However we still ended up with the same errors showed above. Does anyone > know where does the cudaStreamSynchronize come in? Or am I wrongly using > those gmx gpu commands? > > Any input will be appreciated! > > Thanks! > Chenou > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
[gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
Hi Gromacs developers, I'm currently running gromacs 2019.4 on our university's HPC cluster. To fully utilize the GPU nodes, I followed notes on http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html. And here is the command I used for my runs. ``` gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 -cpi -noappend ``` And for some of those runs, they might fail with the following error: ``` --- Program: gmx mdrun, version 2019.4 Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) MPI rank:3 (out of 8) Fatal error: cudaStreamSynchronize failed: an illegal memory access was encountered For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors ``` we also had a different error from slurm system: ``` ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9 M-cycles ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4 M-cycles /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation fault gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 -cpi -noappend ``` We first thought this could due to compiler issue and tried different settings as following: ===test1=== module load cuda/9.2.88.1 module load gcc/7.3.0 . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC ===test2=== module load cuda/9.2.88.1 module load gcc/6x . /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC ===test3=== module load cuda/9.2.148 module load gcc/7.3.0 . /home/rsexton2/Library/gromacs/2019.4/test3/bin/GMXRC ===test4=== module load cuda/9.2.148 module load gcc/6x . /home/rsexton2/Library/gromacs/2019.4/test4/bin/GMXRC ===test5=== module load cuda/9.1.85 module load gcc/6x . /home/rsexton2/Library/gromacs/2019.4/test5/bin/GMXRC ===test6=== module load cuda/9.0.176 module load gcc/6x . /home/rsexton2/Library/gromacs/2019.4/test6/bin/GMXRC ===test7=== module load cuda/9.2.88.1 module load gccgpu/7.4.0 . /home/rsexton2/Library/gromacs/2019.4/test7/bin/GMXRC However we still ended up with the same errors showed above. Does anyone know where does the cudaStreamSynchronize come in? Or am I wrongly using those gmx gpu commands? Any input will be appreciated! Thanks! Chenou -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.