Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue

2019-12-05 Thread Szilárd Páll
Can you please file an issue on redmine.gromacs.org and attach the inputs
that reproduce the behavior described?

--
Szilárd

On Wed, Dec 4, 2019, 21:35 Chenou Zhang  wrote:

> We did test that.
> Our cluster has total 11 GPU nodes and I ran 20 tests over all of them. 7
> out of the 20 tests did have the potential energy jump issue and they were
> running on 5 different nodes.
> So I tend to believe this issue happens on any of those nodes.
>
> On Wed, Dec 4, 2019 at 1:14 PM Szilárd Páll 
> wrote:
>
> > The fact that you are observing errors alo the energies to be off by so
> > much and that it reproduces with multiple inputs suggest that this may
> not
> > a code issue. Did you do all runs that failed on the same hardware? Have
> > you excluded the option that one of those GeForce cards may be flaky?
> >
> > --
> > Szilárd
> >
> >
> > On Wed, Dec 4, 2019 at 7:47 PM Chenou Zhang  wrote:
> >
> > > We tried the same gmx settings in 2019.4 with different protein
> systems.
> > > And we got the same weird potential energy jump  within 1000 steps.
> > >
> > > ```
> > >
> > > Step   Time
> > >   00.0
> > >  Energies (kJ/mol)
> > >BondU-BProper Dih.  Improper Dih.  CMAP
> > Dih.
> > > 2.08204e+049.92358e+046.53063e+041.06706e+03
> >  -2.75672e+02
> > >   LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul.
> > recip.
> > > 1.50031e+04   -4.86857e+043.10386e+04   -1.09745e+06
> > 4.81832e+03
> > >   PotentialKinetic En.   Total Energy  Conserved En.
> > Temperature
> > >-9.09123e+052.80635e+05   -6.28487e+05   -6.28428e+05
> > 3.04667e+02
> > >  Pressure (bar)   Constr. rmsd
> > >-1.56013e+003.60634e-06
> > >
> > > DD  step 999 load imb.: force 14.6%  pme mesh/force 0.581
> > >Step   Time
> > >10002.0
> > >
> > > Energies (kJ/mol)
> > >BondU-BProper Dih.  Improper Dih.  CMAP
> > Dih.
> > > 2.04425e+049.92768e+046.52873e+041.02016e+03
> >  -2.45851e+02
> > >   LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul.
> > recip.
> > > 1.49863e+04   -4.91092e+043.10572e+04   -1.09508e+06
> > 4.97942e+03
> > >   PotentialKinetic En.   Total Energy  Conserved En.
> > Temperature
> > > 1.35726e+352.77598e+051.35726e+351.35726e+35
> > 3.01370e+02
> > >  Pressure (bar)   Constr. rmsd
> > >-7.55250e+013.63239e-06
> > >
> > >  DD  step 1999 load imb.: force 16.1%  pme mesh/force 0.598
> > >Step   Time
> > >20004.0
> > >
> > > Energies (kJ/mol)
> > >BondU-BProper Dih.  Improper Dih.  CMAP
> > Dih.
> > > 1.99521e+049.97482e+046.49595e+041.00798e+03
> >  -2.42567e+02
> > >   LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul.
> > recip.
> > > 1.50156e+04   -4.85324e+043.01944e+04   -1.09620e+06
> > 4.82958e+03
> > >   PotentialKinetic En.   Total Energy  Conserved En.
> > Temperature
> > > 1.35726e+352.79206e+051.35726e+351.35726e+35
> > 3.03115e+02
> > >  Pressure (bar)   Constr. rmsd
> > >-5.50508e+013.64353e-06
> > >
> > > DD  step 2999 load imb.: force 16.6%  pme mesh/force 0.602
> > >Step   Time
> > >30006.0
> > >
> > >
> > > Energies (kJ/mol)
> > >BondU-BProper Dih.  Improper Dih.  CMAP
> > Dih.
> > > 1.98590e+049.88100e+046.50934e+041.07048e+03
> >  -2.38831e+02
> > >   LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul.
> > recip.
> > > 1.49609e+04   -4.93079e+043.12273e+04   -1.09582e+06
> > 4.83209e+03
> > >   PotentialKinetic En.   Total Energy  Conserved En.
> > Temperature
> > > 1.35726e+352.79438e+051.35726e+351.35726e+35
> > 3.03367e+02
> > >  Pressure (bar)   Constr. rmsd
> > > 7.62438e+013.61574e-06
> > >
> > > ```
> > >
> > > On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > What driver version is reported in the respective log files? Does the
> > > error
> > > > persist if mdrun -notunepme is used?
> > > >
> > > > Mark
> > > >
> > > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang,  wrote:
> > > >
> > > > > Hi Gromacs developers,
> > > > >
> > > > > I'm currently running gromacs 2019.4 on our university's HPC
> cluster.
> > > To
> > > > > fully utilize the GPU nodes, I followed notes on
> > > > >
> > > > >
> > > >
> > >
> >
> http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html
> > > > > .
> > > > >
> > > > >
> > > > > And here is the command I used for my runs.
> > > > > ```
> > > > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu
> > > > -ntomp
> > > > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt
> > 60
> > > > -cpi
> > > > > -noappend
> > > > > 

Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue

2019-12-04 Thread Chenou Zhang
We did test that.
Our cluster has total 11 GPU nodes and I ran 20 tests over all of them. 7
out of the 20 tests did have the potential energy jump issue and they were
running on 5 different nodes.
So I tend to believe this issue happens on any of those nodes.

On Wed, Dec 4, 2019 at 1:14 PM Szilárd Páll  wrote:

> The fact that you are observing errors alo the energies to be off by so
> much and that it reproduces with multiple inputs suggest that this may not
> a code issue. Did you do all runs that failed on the same hardware? Have
> you excluded the option that one of those GeForce cards may be flaky?
>
> --
> Szilárd
>
>
> On Wed, Dec 4, 2019 at 7:47 PM Chenou Zhang  wrote:
>
> > We tried the same gmx settings in 2019.4 with different protein systems.
> > And we got the same weird potential energy jump  within 1000 steps.
> >
> > ```
> >
> > Step   Time
> >   00.0
> >  Energies (kJ/mol)
> >BondU-BProper Dih.  Improper Dih.  CMAP
> Dih.
> > 2.08204e+049.92358e+046.53063e+041.06706e+03
>  -2.75672e+02
> >   LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul.
> recip.
> > 1.50031e+04   -4.86857e+043.10386e+04   -1.09745e+06
> 4.81832e+03
> >   PotentialKinetic En.   Total Energy  Conserved En.
> Temperature
> >-9.09123e+052.80635e+05   -6.28487e+05   -6.28428e+05
> 3.04667e+02
> >  Pressure (bar)   Constr. rmsd
> >-1.56013e+003.60634e-06
> >
> > DD  step 999 load imb.: force 14.6%  pme mesh/force 0.581
> >Step   Time
> >10002.0
> >
> > Energies (kJ/mol)
> >BondU-BProper Dih.  Improper Dih.  CMAP
> Dih.
> > 2.04425e+049.92768e+046.52873e+041.02016e+03
>  -2.45851e+02
> >   LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul.
> recip.
> > 1.49863e+04   -4.91092e+043.10572e+04   -1.09508e+06
> 4.97942e+03
> >   PotentialKinetic En.   Total Energy  Conserved En.
> Temperature
> > 1.35726e+352.77598e+051.35726e+351.35726e+35
> 3.01370e+02
> >  Pressure (bar)   Constr. rmsd
> >-7.55250e+013.63239e-06
> >
> >  DD  step 1999 load imb.: force 16.1%  pme mesh/force 0.598
> >Step   Time
> >20004.0
> >
> > Energies (kJ/mol)
> >BondU-BProper Dih.  Improper Dih.  CMAP
> Dih.
> > 1.99521e+049.97482e+046.49595e+041.00798e+03
>  -2.42567e+02
> >   LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul.
> recip.
> > 1.50156e+04   -4.85324e+043.01944e+04   -1.09620e+06
> 4.82958e+03
> >   PotentialKinetic En.   Total Energy  Conserved En.
> Temperature
> > 1.35726e+352.79206e+051.35726e+351.35726e+35
> 3.03115e+02
> >  Pressure (bar)   Constr. rmsd
> >-5.50508e+013.64353e-06
> >
> > DD  step 2999 load imb.: force 16.6%  pme mesh/force 0.602
> >Step   Time
> >30006.0
> >
> >
> > Energies (kJ/mol)
> >BondU-BProper Dih.  Improper Dih.  CMAP
> Dih.
> > 1.98590e+049.88100e+046.50934e+041.07048e+03
>  -2.38831e+02
> >   LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul.
> recip.
> > 1.49609e+04   -4.93079e+043.12273e+04   -1.09582e+06
> 4.83209e+03
> >   PotentialKinetic En.   Total Energy  Conserved En.
> Temperature
> > 1.35726e+352.79438e+051.35726e+351.35726e+35
> 3.03367e+02
> >  Pressure (bar)   Constr. rmsd
> > 7.62438e+013.61574e-06
> >
> > ```
> >
> > On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham 
> > wrote:
> >
> > > Hi,
> > >
> > > What driver version is reported in the respective log files? Does the
> > error
> > > persist if mdrun -notunepme is used?
> > >
> > > Mark
> > >
> > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang,  wrote:
> > >
> > > > Hi Gromacs developers,
> > > >
> > > > I'm currently running gromacs 2019.4 on our university's HPC cluster.
> > To
> > > > fully utilize the GPU nodes, I followed notes on
> > > >
> > > >
> > >
> >
> http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html
> > > > .
> > > >
> > > >
> > > > And here is the command I used for my runs.
> > > > ```
> > > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu
> > > -ntomp
> > > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt
> 60
> > > -cpi
> > > > -noappend
> > > > ```
> > > >
> > > > And for some of those runs, they might fail with the following error:
> > > > ```
> > > > ---
> > > >
> > > > Program: gmx mdrun, version 2019.4
> > > >
> > > > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229)
> > > >
> > > > MPI rank:3 (out of 8)
> > > >
> > > >
> > > >
> > > > Fatal error:
> > > >
> > > > cudaStreamSynchronize failed: an illegal memory access was
> 

Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue

2019-12-04 Thread Szilárd Páll
The fact that you are observing errors alo the energies to be off by so
much and that it reproduces with multiple inputs suggest that this may not
a code issue. Did you do all runs that failed on the same hardware? Have
you excluded the option that one of those GeForce cards may be flaky?

--
Szilárd


On Wed, Dec 4, 2019 at 7:47 PM Chenou Zhang  wrote:

> We tried the same gmx settings in 2019.4 with different protein systems.
> And we got the same weird potential energy jump  within 1000 steps.
>
> ```
>
> Step   Time
>   00.0
>  Energies (kJ/mol)
>BondU-BProper Dih.  Improper Dih.  CMAP Dih.
> 2.08204e+049.92358e+046.53063e+041.06706e+03   -2.75672e+02
>   LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul. recip.
> 1.50031e+04   -4.86857e+043.10386e+04   -1.09745e+064.81832e+03
>   PotentialKinetic En.   Total Energy  Conserved En.Temperature
>-9.09123e+052.80635e+05   -6.28487e+05   -6.28428e+053.04667e+02
>  Pressure (bar)   Constr. rmsd
>-1.56013e+003.60634e-06
>
> DD  step 999 load imb.: force 14.6%  pme mesh/force 0.581
>Step   Time
>10002.0
>
> Energies (kJ/mol)
>BondU-BProper Dih.  Improper Dih.  CMAP Dih.
> 2.04425e+049.92768e+046.52873e+041.02016e+03   -2.45851e+02
>   LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul. recip.
> 1.49863e+04   -4.91092e+043.10572e+04   -1.09508e+064.97942e+03
>   PotentialKinetic En.   Total Energy  Conserved En.Temperature
> 1.35726e+352.77598e+051.35726e+351.35726e+353.01370e+02
>  Pressure (bar)   Constr. rmsd
>-7.55250e+013.63239e-06
>
>  DD  step 1999 load imb.: force 16.1%  pme mesh/force 0.598
>Step   Time
>20004.0
>
> Energies (kJ/mol)
>BondU-BProper Dih.  Improper Dih.  CMAP Dih.
> 1.99521e+049.97482e+046.49595e+041.00798e+03   -2.42567e+02
>   LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul. recip.
> 1.50156e+04   -4.85324e+043.01944e+04   -1.09620e+064.82958e+03
>   PotentialKinetic En.   Total Energy  Conserved En.Temperature
> 1.35726e+352.79206e+051.35726e+351.35726e+353.03115e+02
>  Pressure (bar)   Constr. rmsd
>-5.50508e+013.64353e-06
>
> DD  step 2999 load imb.: force 16.6%  pme mesh/force 0.602
>Step   Time
>30006.0
>
>
> Energies (kJ/mol)
>BondU-BProper Dih.  Improper Dih.  CMAP Dih.
> 1.98590e+049.88100e+046.50934e+041.07048e+03   -2.38831e+02
>   LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul. recip.
> 1.49609e+04   -4.93079e+043.12273e+04   -1.09582e+064.83209e+03
>   PotentialKinetic En.   Total Energy  Conserved En.Temperature
> 1.35726e+352.79438e+051.35726e+351.35726e+353.03367e+02
>  Pressure (bar)   Constr. rmsd
> 7.62438e+013.61574e-06
>
> ```
>
> On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham 
> wrote:
>
> > Hi,
> >
> > What driver version is reported in the respective log files? Does the
> error
> > persist if mdrun -notunepme is used?
> >
> > Mark
> >
> > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang,  wrote:
> >
> > > Hi Gromacs developers,
> > >
> > > I'm currently running gromacs 2019.4 on our university's HPC cluster.
> To
> > > fully utilize the GPU nodes, I followed notes on
> > >
> > >
> >
> http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html
> > > .
> > >
> > >
> > > And here is the command I used for my runs.
> > > ```
> > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu
> > -ntomp
> > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60
> > -cpi
> > > -noappend
> > > ```
> > >
> > > And for some of those runs, they might fail with the following error:
> > > ```
> > > ---
> > >
> > > Program: gmx mdrun, version 2019.4
> > >
> > > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229)
> > >
> > > MPI rank:3 (out of 8)
> > >
> > >
> > >
> > > Fatal error:
> > >
> > > cudaStreamSynchronize failed: an illegal memory access was encountered
> > >
> > >
> > >
> > > For more information and tips for troubleshooting, please check the
> > GROMACS
> > >
> > > website at http://www.gromacs.org/Documentation/Errors
> > > ```
> > >
> > > we also had a different error from slurm system:
> > > ```
> > > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9
> > > M-cycles
> > > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4
> > > M-cycles
> > > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation
> > > fault  gmx mdrun -v -s $TPR -deffnm 

Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue

2019-12-04 Thread Chenou Zhang
We tried the same gmx settings in 2019.4 with different protein systems.
And we got the same weird potential energy jump  within 1000 steps.

```

Step   Time
  00.0
 Energies (kJ/mol)
   BondU-BProper Dih.  Improper Dih.  CMAP Dih.
2.08204e+049.92358e+046.53063e+041.06706e+03   -2.75672e+02
  LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul. recip.
1.50031e+04   -4.86857e+043.10386e+04   -1.09745e+064.81832e+03
  PotentialKinetic En.   Total Energy  Conserved En.Temperature
   -9.09123e+052.80635e+05   -6.28487e+05   -6.28428e+053.04667e+02
 Pressure (bar)   Constr. rmsd
   -1.56013e+003.60634e-06

DD  step 999 load imb.: force 14.6%  pme mesh/force 0.581
   Step   Time
   10002.0

Energies (kJ/mol)
   BondU-BProper Dih.  Improper Dih.  CMAP Dih.
2.04425e+049.92768e+046.52873e+041.02016e+03   -2.45851e+02
  LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul. recip.
1.49863e+04   -4.91092e+043.10572e+04   -1.09508e+064.97942e+03
  PotentialKinetic En.   Total Energy  Conserved En.Temperature
1.35726e+352.77598e+051.35726e+351.35726e+353.01370e+02
 Pressure (bar)   Constr. rmsd
   -7.55250e+013.63239e-06

 DD  step 1999 load imb.: force 16.1%  pme mesh/force 0.598
   Step   Time
   20004.0

Energies (kJ/mol)
   BondU-BProper Dih.  Improper Dih.  CMAP Dih.
1.99521e+049.97482e+046.49595e+041.00798e+03   -2.42567e+02
  LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul. recip.
1.50156e+04   -4.85324e+043.01944e+04   -1.09620e+064.82958e+03
  PotentialKinetic En.   Total Energy  Conserved En.Temperature
1.35726e+352.79206e+051.35726e+351.35726e+353.03115e+02
 Pressure (bar)   Constr. rmsd
   -5.50508e+013.64353e-06

DD  step 2999 load imb.: force 16.6%  pme mesh/force 0.602
   Step   Time
   30006.0


Energies (kJ/mol)
   BondU-BProper Dih.  Improper Dih.  CMAP Dih.
1.98590e+049.88100e+046.50934e+041.07048e+03   -2.38831e+02
  LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul. recip.
1.49609e+04   -4.93079e+043.12273e+04   -1.09582e+064.83209e+03
  PotentialKinetic En.   Total Energy  Conserved En.Temperature
1.35726e+352.79438e+051.35726e+351.35726e+353.03367e+02
 Pressure (bar)   Constr. rmsd
7.62438e+013.61574e-06

```

On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham 
wrote:

> Hi,
>
> What driver version is reported in the respective log files? Does the error
> persist if mdrun -notunepme is used?
>
> Mark
>
> On Mon., 2 Dec. 2019, 21:18 Chenou Zhang,  wrote:
>
> > Hi Gromacs developers,
> >
> > I'm currently running gromacs 2019.4 on our university's HPC cluster. To
> > fully utilize the GPU nodes, I followed notes on
> >
> >
> http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html
> > .
> >
> >
> > And here is the command I used for my runs.
> > ```
> > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu
> -ntomp
> > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60
> -cpi
> > -noappend
> > ```
> >
> > And for some of those runs, they might fail with the following error:
> > ```
> > ---
> >
> > Program: gmx mdrun, version 2019.4
> >
> > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229)
> >
> > MPI rank:3 (out of 8)
> >
> >
> >
> > Fatal error:
> >
> > cudaStreamSynchronize failed: an illegal memory access was encountered
> >
> >
> >
> > For more information and tips for troubleshooting, please check the
> GROMACS
> >
> > website at http://www.gromacs.org/Documentation/Errors
> > ```
> >
> > we also had a different error from slurm system:
> > ```
> > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9
> > M-cycles
> > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4
> > M-cycles
> > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation
> > fault  gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on
> -nb
> > gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS
> > -cpt 60 -cpi -noappend
> > ```
> >
> > We first thought this could due to compiler issue and tried different
> > settings as following:
> > ===test1===
> > 
> > module load cuda/9.2.88.1
> > module load gcc/7.3.0
> > . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC
> > 
> > ===test2===
> > 
> > module load cuda/9.2.88.1
> > module load gcc/6x
> > . /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC
> > 
> > ===test3===
> > 
> > module load cuda/9.2.148
> > module 

Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue

2019-12-03 Thread Chenou Zhang
Hi,

I've run 30 tests with the -notunepme option. I got the following error
from one of them(which is still the same *cudaStreamSynchronize failed*
error):


```
DD  step 1422999  vol min/aver 0.639  load imb.: force  1.1%  pme
mesh/force 1.079
   Step   Time

1423000 2846.0



   Energies (kJ/mol)

   BondU-BProper Dih.  Improper Dih.  CMAP Dih.

3.79755e+041.78943e+051.22798e+052.83835e+03   -9.19303e+02

  LJ-14 Coulomb-14LJ (SR)   Coulomb (SR)   Coul. recip.

2.56547e+045.11714e+059.77218e+03   -2.07148e+068.64504e+03

  PotentialKinetic En.   Total Energy  Conserved En.Temperature

7.64126e+134.79398e+057.64126e+137.64126e+133.58009e+02

 Pressure (bar)   Constr. rmsd

   -6.03201e+014.56399e-06





---

Program: gmx mdrun, version 2019.4

Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229)

MPI rank:2 (out of 8)



Fatal error:

cudaStreamSynchronize failed: an illegal memory access was encountered



For more information and tips for troubleshooting, please check the GROMACS

website at http://www.gromacs.org/Documentation/Errors

---
```

Here is the command and the driver info:

```
Command line:

  gmx mdrun -v -s md_seed_fixed.tpr -deffnm md_seed_fixed -ntmpi 8 -pin on
-nb gpu -ntomp 3 -pme gpu -pmefft gpu -notunepme -npme 1 -gputasks 00112233
-maxh 2 -cpt 60 -cpi -noappend


GROMACS version:2019.4

Precision:  single

Memory model:   64 bit

MPI library:thread_mpi

OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)

GPU support:CUDA

SIMD instructions:  AVX2_256

FFT library:fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512

RDTSCP usage:   enabled

TNG support:enabled

Hwloc support:  hwloc-1.11.2

Tracing support:disabled

C compiler: /packages/7x/gcc/gcc-7.3.0/bin/gcc GNU 7.3.0

C compiler flags:-mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler:   /packages/7x/gcc/gcc-7.3.0/bin/g++ GNU 7.3.0

C++ compiler flags:  -mavx2 -mfma-std=c++11   -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
CUDA compiler:  /packages/7x/cuda/9.2.88.1/bin/nvcc nvcc: NVIDIA (R)
Cuda compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on
Wed_Apr_11_23:16:29_CDT_2018;Cuda compilation tools, release 9.2, V9.2.88
CUDA compiler
flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;;
;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver:9.20

CUDA runtime:   9.20





Running on 1 node with total 24 cores, 24 logical cores, 4 compatible GPUs

Hardware detected:

  CPU info:

Vendor: Intel

Brand:  Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz

Family: 6   Model: 79   Stepping: 1

Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel
lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp
rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
  Hardware topology: Full, with devices

Sockets, cores, and logical processors:

  Socket  0: [   0] [   1] [   2] [   3] [   4] [   5] [   6] [   7] [
  8] [   9] [  10] [  11]
  Socket  1: [  12] [  13] [  14] [  15] [  16] [  17] [  18] [  19] [
 20] [  21] [  22] [  23]
Numa nodes:

  Node  0 (34229563392 bytes mem):   0   1   2   3   4   5   6   7   8
  9  10  11
  Node  1 (34359738368 bytes mem):  12  13  14  15  16  17  18  19  20
 21  22  23
  Latency:

   0 1

 0  1.00  2.10

 1  2.10  1.00

Caches:

  L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways

  L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways

  L3: 31457280 bytes, linesize 64 bytes, assoc. 20, shared 12 ways
 PCI devices:

  :01:00.0  Id: 15b3:1007  Class: 0x0200  Numa: 0

  :02:00.0  Id: 10de:1b06  Class: 0x0300  Numa: 0

  :03:00.0  Id: 10de:1b06  Class: 0x0300  Numa: 0

  :00:11.4  Id: 8086:8d62  Class: 0x0106  Numa: 0

  :06:00.0  Id: 1a03:2000  Class: 0x0300  Numa: 0

  :00:1f.2  Id: 8086:8d02  Class: 0x0106  Numa: 0

  :81:00.0  Id: 8086:1521  Class: 0x0200  Numa: 1

  :81:00.1  Id: 8086:1521  Class: 0x0200  Numa: 1

  :82:00.0  Id: 15b3:1007  Class: 0x0280  Numa: 1

  :83:00.0  Id: 10de:1b06  Class: 0x0300  Numa: 1

  :84:00.0  Id: 10de:1b06  Class: 0x0300  Numa: 1

  GPU info:

Number of GPUs detected: 4

#0: NVIDIA GeForce 

Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue

2019-12-02 Thread Chenou Zhang
For the error:
```
^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9
M-cycles
^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4
M-cycles
/var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation
fault  gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb
gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS
-cpt 60 -cpi -noappend
```
I got these driver info:
```
GROMACS:  gmx mdrun, version 2019.4

Executable:   /home/rsexton2/Library/gromacs/2019.4/test1/bin/gmx

Data prefix:  /home/rsexton2/Library/gromacs/2019.4/test1

Working dir:  /scratch/czhan178/project/NapA-2019.4/gromacs_test_1/test_9

Process ID:   29866

Command line:

  gmx mdrun -v -s md_seed_fixed.tpr -deffnm md_seed_fixed -ntmpi 8 -pin on
-nb gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh 2
-cpt 60 -cpi -noappend


GROMACS version:2019.4

Precision:  single

Memory model:   64 bit

MPI library:thread_mpi

OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)

GPU support:CUDA

SIMD instructions:  AVX2_256

FFT library:fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512

RDTSCP usage:   enabled

TNG support:enabled

Hwloc support:  hwloc-1.11.2

Tracing support:disabled

C compiler: /packages/7x/gcc/gcc-7.3.0/bin/gcc GNU 7.3.0

C compiler flags:-mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler:   /packages/7x/gcc/gcc-7.3.0/bin/g++ GNU 7.3.0

C++ compiler flags:  -mavx2 -mfma-std=c++11   -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
CUDA compiler:  /packages/7x/cuda/9.2.88.1/bin/nvcc nvcc: NVIDIA (R)
Cuda compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on
Wed_Apr_11_23:16:29_CDT_2018;Cuda compilation tools, release 9.2, V9.2.88
CUDA compiler
flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;;
;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver:9.20

CUDA runtime:   9.20
```

I'll run -notunepme option and get you updated.

Chenou

On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham 
wrote:

> Hi,
>
> What driver version is reported in the respective log files? Does the error
> persist if mdrun -notunepme is used?
>
> Mark
>
> On Mon., 2 Dec. 2019, 21:18 Chenou Zhang,  wrote:
>
> > Hi Gromacs developers,
> >
> > I'm currently running gromacs 2019.4 on our university's HPC cluster. To
> > fully utilize the GPU nodes, I followed notes on
> >
> >
> http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html
> > .
> >
> >
> > And here is the command I used for my runs.
> > ```
> > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu
> -ntomp
> > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60
> -cpi
> > -noappend
> > ```
> >
> > And for some of those runs, they might fail with the following error:
> > ```
> > ---
> >
> > Program: gmx mdrun, version 2019.4
> >
> > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229)
> >
> > MPI rank:3 (out of 8)
> >
> >
> >
> > Fatal error:
> >
> > cudaStreamSynchronize failed: an illegal memory access was encountered
> >
> >
> >
> > For more information and tips for troubleshooting, please check the
> GROMACS
> >
> > website at http://www.gromacs.org/Documentation/Errors
> > ```
> >
> > we also had a different error from slurm system:
> > ```
> > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9
> > M-cycles
> > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4
> > M-cycles
> > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation
> > fault  gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on
> -nb
> > gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS
> > -cpt 60 -cpi -noappend
> > ```
> >
> > We first thought this could due to compiler issue and tried different
> > settings as following:
> > ===test1===
> > 
> > module load cuda/9.2.88.1
> > module load gcc/7.3.0
> > . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC
> > 
> > ===test2===
> > 
> > module load cuda/9.2.88.1
> > module load gcc/6x
> > . /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC
> > 
> > ===test3===
> > 
> > module load cuda/9.2.148
> > module load gcc/7.3.0
> > . /home/rsexton2/Library/gromacs/2019.4/test3/bin/GMXRC
> > 
> > ===test4===
> > 
> > module load cuda/9.2.148
> > module load gcc/6x
> > . /home/rsexton2/Library/gromacs/2019.4/test4/bin/GMXRC
> > 
> > ===test5===
> > 
> > module load cuda/9.1.85
> > module load 

Re: [gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue

2019-12-02 Thread Mark Abraham
Hi,

What driver version is reported in the respective log files? Does the error
persist if mdrun -notunepme is used?

Mark

On Mon., 2 Dec. 2019, 21:18 Chenou Zhang,  wrote:

> Hi Gromacs developers,
>
> I'm currently running gromacs 2019.4 on our university's HPC cluster. To
> fully utilize the GPU nodes, I followed notes on
>
> http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html
> .
>
>
> And here is the command I used for my runs.
> ```
> gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu -ntomp
> 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 -cpi
> -noappend
> ```
>
> And for some of those runs, they might fail with the following error:
> ```
> ---
>
> Program: gmx mdrun, version 2019.4
>
> Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229)
>
> MPI rank:3 (out of 8)
>
>
>
> Fatal error:
>
> cudaStreamSynchronize failed: an illegal memory access was encountered
>
>
>
> For more information and tips for troubleshooting, please check the GROMACS
>
> website at http://www.gromacs.org/Documentation/Errors
> ```
>
> we also had a different error from slurm system:
> ```
> ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9
> M-cycles
> ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4
> M-cycles
> /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation
> fault  gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb
> gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS
> -cpt 60 -cpi -noappend
> ```
>
> We first thought this could due to compiler issue and tried different
> settings as following:
> ===test1===
> 
> module load cuda/9.2.88.1
> module load gcc/7.3.0
> . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC
> 
> ===test2===
> 
> module load cuda/9.2.88.1
> module load gcc/6x
> . /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC
> 
> ===test3===
> 
> module load cuda/9.2.148
> module load gcc/7.3.0
> . /home/rsexton2/Library/gromacs/2019.4/test3/bin/GMXRC
> 
> ===test4===
> 
> module load cuda/9.2.148
> module load gcc/6x
> . /home/rsexton2/Library/gromacs/2019.4/test4/bin/GMXRC
> 
> ===test5===
> 
> module load cuda/9.1.85
> module load gcc/6x
> . /home/rsexton2/Library/gromacs/2019.4/test5/bin/GMXRC
> 
> ===test6===
> 
> module load cuda/9.0.176
> module load gcc/6x
> . /home/rsexton2/Library/gromacs/2019.4/test6/bin/GMXRC
> 
> ===test7===
> 
> module load cuda/9.2.88.1
> module load gccgpu/7.4.0
> . /home/rsexton2/Library/gromacs/2019.4/test7/bin/GMXRC
> 
>
> However we still ended up with the same errors showed above. Does anyone
> know where does the cudaStreamSynchronize come in? Or am I wrongly using
> those gmx gpu commands?
>
> Any input will be appreciated!
>
> Thanks!
> Chenou
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


[gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue

2019-12-02 Thread Chenou Zhang
Hi Gromacs developers,

I'm currently running gromacs 2019.4 on our university's HPC cluster. To
fully utilize the GPU nodes, I followed notes on
http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html.


And here is the command I used for my runs.
```
gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu -ntomp
3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 -cpi
-noappend
```

And for some of those runs, they might fail with the following error:
```
---

Program: gmx mdrun, version 2019.4

Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229)

MPI rank:3 (out of 8)



Fatal error:

cudaStreamSynchronize failed: an illegal memory access was encountered



For more information and tips for troubleshooting, please check the GROMACS

website at http://www.gromacs.org/Documentation/Errors
```

we also had a different error from slurm system:
```
^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9
M-cycles
^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4
M-cycles
/var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation
fault  gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb
gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS
-cpt 60 -cpi -noappend
```

We first thought this could due to compiler issue and tried different
settings as following:
===test1===

module load cuda/9.2.88.1
module load gcc/7.3.0
. /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC

===test2===

module load cuda/9.2.88.1
module load gcc/6x
. /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC

===test3===

module load cuda/9.2.148
module load gcc/7.3.0
. /home/rsexton2/Library/gromacs/2019.4/test3/bin/GMXRC

===test4===

module load cuda/9.2.148
module load gcc/6x
. /home/rsexton2/Library/gromacs/2019.4/test4/bin/GMXRC

===test5===

module load cuda/9.1.85
module load gcc/6x
. /home/rsexton2/Library/gromacs/2019.4/test5/bin/GMXRC

===test6===

module load cuda/9.0.176
module load gcc/6x
. /home/rsexton2/Library/gromacs/2019.4/test6/bin/GMXRC

===test7===

module load cuda/9.2.88.1
module load gccgpu/7.4.0
. /home/rsexton2/Library/gromacs/2019.4/test7/bin/GMXRC


However we still ended up with the same errors showed above. Does anyone
know where does the cudaStreamSynchronize come in? Or am I wrongly using
those gmx gpu commands?

Any input will be appreciated!

Thanks!
Chenou
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.