Re: [gmx-users] jwe1050i + jwe0019i errors = SIGSEGV (Fujitsu)

2013-10-15 Thread Mark Abraham
On Thu, Oct 10, 2013 at 2:34 PM, James jamesresearch...@gmail.com wrote:

 Dear Mark,

 Thanks again for your response.

 Many of the regression tests seem to have passed:

 All 16 simple tests PASSED
 All 19 complex tests PASSED
 All 142 kernel tests PASSED
 All 9 freeenergy tests PASSED
 All 0 extra tests PASSED
 Error not all 42 pdb2gmx tests have been done successfully
 Only 0 energies in the log file
 pdb2gmx tests FAILED

 I'm not sure why pdb2gmx failed but I suppose it will not impact the
 crashing I'm experiencing.


No, that's fine. Probably they don't have sufficiently explicit guards to
stop people running the energy minimization with a more-than-useful number
of OpenMP threads.


 Regarding the stack trace showing line numbers, what is the best way to go
 about this, in this context? I'm not really experienced in that aspect.


That's a matter of compiling in debug mode (use cmake ..
-DCMAKE_BUILD_TYPE=Debug), and hopefully observing the same crash with an
error message that has more useful information. The debug mode annotates
the executable so that a finger can be pointed at the code line that caused
the segfault. Hopefully the compiler does this properly, but support for
this in OpenMP is a corner compiler writers might cut ;-) Depending on the
details, loading a core dump in a debugger can also be necessary, but your
local sysadmins are the people to talk to there.

Mark

Thanks again for your help!

 Best regards,

 James


 On 21 September 2013 23:12, Mark Abraham mark.j.abra...@gmail.com wrote:

  On Sat, Sep 21, 2013 at 2:45 PM, James jamesresearch...@gmail.com
 wrote:
   Dear Mark and the rest of the Gromacs team,
  
   Thanks a lot for your response. I have been trying to isolate the
 problem
   and have also been in discussion with the support staff. They suggested
  it
   may be a bug in the gromacs code, and I have tried to isolate the
 problem
   more precisely.
 
  First, do the GROMACS regression tests for Verlet kernels pass? (Run
  them all, but those with nbnxn prefix are of interest here.) They
  likely won't scale to 16 OMP threads, but you can vary OMP_NUM_THREADS
  environment variable to see what you can see.
 
   Considering that the calculation is run under MPI with 16 OpenMP cores
  per
   MPI node, the error seems to occur under the following conditions:
  
   A few thousand atoms: 1 or 2 MPI nodes: OK
   Double the number of atoms (~15,000): 1 MPI node: OK, 2 MPI nodes:
  SIGSEGV
   error described below.
  
   So it seems that the error occurs for relatively large systems which
 use
   MPI.
 
  ~500 atoms per core (thread) is a system in the normal GROMACS scaling
  regime. 16 OMP threads is more than is useful on other HPC systems,
  but since we don't know what your hardware is, whether you are
  investigating something useful is your decision.
 
   The crash mentions the calc_cell_indices function (see below). Is
 this
   somehow a problem with memory not being sufficient at the MPI interface
  at
   this function? I'm not sure how to proceed further. Any help would be
   greatly appreciated.
 
  If there is a problem with GROMACS (which so far I doubt), we'd need a
  stack trace that shows a line number (rather than addresses) in order
  to start to locate it.
 
  Mark
 
   Gromacs version is 4.6.3.
  
   Thank you very much for your time.
  
   James
  
  
   On 4 September 2013 16:05, Mark Abraham mark.j.abra...@gmail.com
  wrote:
  
   On Sep 4, 2013 7:59 AM, James jamesresearch...@gmail.com wrote:
   
Dear all,
   
I'm trying to run Gromacs on a Fujitsu supercomputer but the
 software
  is
crashing.
   
I run grompp:
   
grompp_mpi_d -f parameters.mdp -c system.pdb -p overthe.top
   
and it produces the error:
   
jwe1050i-w The hardware barrier couldn't be used and continues
  processing
using the software barrier.
taken to (standard) corrective action, execution continuing.
error summary (Fortran)
error number error level error count
jwe1050i w 1
total error count = 1
   
but still outputs topol.tpr so I can continue.
  
   There's no value in compiling grompp with MPI or in double precision.
  
I then run with
   
export FLIB_FASTOMP=FALSE
source /home/username/Gromacs463/bin/GMXRC.bash
mpiexec mdrun_mpi_d -ntomp 16 -v
   
but it crashes:
   
starting mdrun 'testrun'
5 steps, 100.0 ps.
jwe0019i-u The program was terminated abnormally with signal number
   SIGSEGV.
signal identifier = SEGV_MAPERR, address not mapped to object
error occurs at calc_cell_indices._OMP_1 loc 00233474 offset
03b4
calc_cell_indices._OMP_1 at loc 002330c0 called from loc
02088fa0 in start_thread
start_thread at loc 02088e4c called from loc
 029d19b4
  in
__thread_start
__thread_start at loc 029d1988 called from o.s.
error summary (Fortran)
error number error level error count
jwe0019i 

Re: [gmx-users] jwe1050i + jwe0019i errors = SIGSEGV (Fujitsu)

2013-10-10 Thread James
Dear Mark,

Thanks again for your response.

Many of the regression tests seem to have passed:

All 16 simple tests PASSED
All 19 complex tests PASSED
All 142 kernel tests PASSED
All 9 freeenergy tests PASSED
All 0 extra tests PASSED
Error not all 42 pdb2gmx tests have been done successfully
Only 0 energies in the log file
pdb2gmx tests FAILED

I'm not sure why pdb2gmx failed but I suppose it will not impact the
crashing I'm experiencing.

Regarding the stack trace showing line numbers, what is the best way to go
about this, in this context? I'm not really experienced in that aspect.

Thanks again for your help!

Best regards,

James


On 21 September 2013 23:12, Mark Abraham mark.j.abra...@gmail.com wrote:

 On Sat, Sep 21, 2013 at 2:45 PM, James jamesresearch...@gmail.com wrote:
  Dear Mark and the rest of the Gromacs team,
 
  Thanks a lot for your response. I have been trying to isolate the problem
  and have also been in discussion with the support staff. They suggested
 it
  may be a bug in the gromacs code, and I have tried to isolate the problem
  more precisely.

 First, do the GROMACS regression tests for Verlet kernels pass? (Run
 them all, but those with nbnxn prefix are of interest here.) They
 likely won't scale to 16 OMP threads, but you can vary OMP_NUM_THREADS
 environment variable to see what you can see.

  Considering that the calculation is run under MPI with 16 OpenMP cores
 per
  MPI node, the error seems to occur under the following conditions:
 
  A few thousand atoms: 1 or 2 MPI nodes: OK
  Double the number of atoms (~15,000): 1 MPI node: OK, 2 MPI nodes:
 SIGSEGV
  error described below.
 
  So it seems that the error occurs for relatively large systems which use
  MPI.

 ~500 atoms per core (thread) is a system in the normal GROMACS scaling
 regime. 16 OMP threads is more than is useful on other HPC systems,
 but since we don't know what your hardware is, whether you are
 investigating something useful is your decision.

  The crash mentions the calc_cell_indices function (see below). Is this
  somehow a problem with memory not being sufficient at the MPI interface
 at
  this function? I'm not sure how to proceed further. Any help would be
  greatly appreciated.

 If there is a problem with GROMACS (which so far I doubt), we'd need a
 stack trace that shows a line number (rather than addresses) in order
 to start to locate it.

 Mark

  Gromacs version is 4.6.3.
 
  Thank you very much for your time.
 
  James
 
 
  On 4 September 2013 16:05, Mark Abraham mark.j.abra...@gmail.com
 wrote:
 
  On Sep 4, 2013 7:59 AM, James jamesresearch...@gmail.com wrote:
  
   Dear all,
  
   I'm trying to run Gromacs on a Fujitsu supercomputer but the software
 is
   crashing.
  
   I run grompp:
  
   grompp_mpi_d -f parameters.mdp -c system.pdb -p overthe.top
  
   and it produces the error:
  
   jwe1050i-w The hardware barrier couldn't be used and continues
 processing
   using the software barrier.
   taken to (standard) corrective action, execution continuing.
   error summary (Fortran)
   error number error level error count
   jwe1050i w 1
   total error count = 1
  
   but still outputs topol.tpr so I can continue.
 
  There's no value in compiling grompp with MPI or in double precision.
 
   I then run with
  
   export FLIB_FASTOMP=FALSE
   source /home/username/Gromacs463/bin/GMXRC.bash
   mpiexec mdrun_mpi_d -ntomp 16 -v
  
   but it crashes:
  
   starting mdrun 'testrun'
   5 steps, 100.0 ps.
   jwe0019i-u The program was terminated abnormally with signal number
  SIGSEGV.
   signal identifier = SEGV_MAPERR, address not mapped to object
   error occurs at calc_cell_indices._OMP_1 loc 00233474 offset
   03b4
   calc_cell_indices._OMP_1 at loc 002330c0 called from loc
   02088fa0 in start_thread
   start_thread at loc 02088e4c called from loc 029d19b4
 in
   __thread_start
   __thread_start at loc 029d1988 called from o.s.
   error summary (Fortran)
   error number error level error count
   jwe0019i u 1
   jwe1050i w 1
   total error count = 2
   [ERR.] PLE 0014 plexec The process terminated
  
 
 
 abnormally.(rank=1)(nid=0x03060006)(exitstatus=240)(CODE=2002,1966080,61440)
   [ERR.] PLE The program that the user specified may be illegal or
   inaccessible on the node.(nid=0x03060006)
  
   Any ideas what could be wrong? It works on my local intel machine.
 
  Looks like it wasn't compiled correctly for the target machine. What was
  the cmake command, what does mdrun -version output? Also, if this is
 the K
  computer, probably we can't help, because the compiler docs are
 officially
  unavailable to us. National secret, and all ;-)
 
  Mark
 
  
   Thanks in advance,
  
   James
   --
   gmx-users mailing listgmx-users@gromacs.org
   http://lists.gromacs.org/mailman/listinfo/gmx-users
   * Please search the archive at
  http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
   * Please don't post 

Re: [gmx-users] jwe1050i + jwe0019i errors = SIGSEGV (Fujitsu)

2013-09-21 Thread James
Dear Mark and the rest of the Gromacs team,

Thanks a lot for your response. I have been trying to isolate the problem
and have also been in discussion with the support staff. They suggested it
may be a bug in the gromacs code, and I have tried to isolate the problem
more precisely.

Considering that the calculation is run under MPI with 16 OpenMP cores per
MPI node, the error seems to occur under the following conditions:

A few thousand atoms: 1 or 2 MPI nodes: OK
Double the number of atoms (~15,000): 1 MPI node: OK, 2 MPI nodes: SIGSEGV
error described below.

So it seems that the error occurs for relatively large systems which use
MPI.

The crash mentions the calc_cell_indices function (see below). Is this
somehow a problem with memory not being sufficient at the MPI interface at
this function? I'm not sure how to proceed further. Any help would be
greatly appreciated.

Gromacs version is 4.6.3.

Thank you very much for your time.

James


On 4 September 2013 16:05, Mark Abraham mark.j.abra...@gmail.com wrote:

 On Sep 4, 2013 7:59 AM, James jamesresearch...@gmail.com wrote:
 
  Dear all,
 
  I'm trying to run Gromacs on a Fujitsu supercomputer but the software is
  crashing.
 
  I run grompp:
 
  grompp_mpi_d -f parameters.mdp -c system.pdb -p overthe.top
 
  and it produces the error:
 
  jwe1050i-w The hardware barrier couldn't be used and continues processing
  using the software barrier.
  taken to (standard) corrective action, execution continuing.
  error summary (Fortran)
  error number error level error count
  jwe1050i w 1
  total error count = 1
 
  but still outputs topol.tpr so I can continue.

 There's no value in compiling grompp with MPI or in double precision.

  I then run with
 
  export FLIB_FASTOMP=FALSE
  source /home/username/Gromacs463/bin/GMXRC.bash
  mpiexec mdrun_mpi_d -ntomp 16 -v
 
  but it crashes:
 
  starting mdrun 'testrun'
  5 steps, 100.0 ps.
  jwe0019i-u The program was terminated abnormally with signal number
 SIGSEGV.
  signal identifier = SEGV_MAPERR, address not mapped to object
  error occurs at calc_cell_indices._OMP_1 loc 00233474 offset
  03b4
  calc_cell_indices._OMP_1 at loc 002330c0 called from loc
  02088fa0 in start_thread
  start_thread at loc 02088e4c called from loc 029d19b4 in
  __thread_start
  __thread_start at loc 029d1988 called from o.s.
  error summary (Fortran)
  error number error level error count
  jwe0019i u 1
  jwe1050i w 1
  total error count = 2
  [ERR.] PLE 0014 plexec The process terminated
 

 abnormally.(rank=1)(nid=0x03060006)(exitstatus=240)(CODE=2002,1966080,61440)
  [ERR.] PLE The program that the user specified may be illegal or
  inaccessible on the node.(nid=0x03060006)
 
  Any ideas what could be wrong? It works on my local intel machine.

 Looks like it wasn't compiled correctly for the target machine. What was
 the cmake command, what does mdrun -version output? Also, if this is the K
 computer, probably we can't help, because the compiler docs are officially
 unavailable to us. National secret, and all ;-)

 Mark

 
  Thanks in advance,
 
  James
  --
  gmx-users mailing listgmx-users@gromacs.org
  http://lists.gromacs.org/mailman/listinfo/gmx-users
  * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
  * Please don't post (un)subscribe requests to the list. Use the
  www interface or send it to gmx-users-requ...@gromacs.org.
  * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 * Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

-- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] jwe1050i + jwe0019i errors = SIGSEGV (Fujitsu)

2013-09-21 Thread Mark Abraham
On Sat, Sep 21, 2013 at 2:45 PM, James jamesresearch...@gmail.com wrote:
 Dear Mark and the rest of the Gromacs team,

 Thanks a lot for your response. I have been trying to isolate the problem
 and have also been in discussion with the support staff. They suggested it
 may be a bug in the gromacs code, and I have tried to isolate the problem
 more precisely.

First, do the GROMACS regression tests for Verlet kernels pass? (Run
them all, but those with nbnxn prefix are of interest here.) They
likely won't scale to 16 OMP threads, but you can vary OMP_NUM_THREADS
environment variable to see what you can see.

 Considering that the calculation is run under MPI with 16 OpenMP cores per
 MPI node, the error seems to occur under the following conditions:

 A few thousand atoms: 1 or 2 MPI nodes: OK
 Double the number of atoms (~15,000): 1 MPI node: OK, 2 MPI nodes: SIGSEGV
 error described below.

 So it seems that the error occurs for relatively large systems which use
 MPI.

~500 atoms per core (thread) is a system in the normal GROMACS scaling
regime. 16 OMP threads is more than is useful on other HPC systems,
but since we don't know what your hardware is, whether you are
investigating something useful is your decision.

 The crash mentions the calc_cell_indices function (see below). Is this
 somehow a problem with memory not being sufficient at the MPI interface at
 this function? I'm not sure how to proceed further. Any help would be
 greatly appreciated.

If there is a problem with GROMACS (which so far I doubt), we'd need a
stack trace that shows a line number (rather than addresses) in order
to start to locate it.

Mark

 Gromacs version is 4.6.3.

 Thank you very much for your time.

 James


 On 4 September 2013 16:05, Mark Abraham mark.j.abra...@gmail.com wrote:

 On Sep 4, 2013 7:59 AM, James jamesresearch...@gmail.com wrote:
 
  Dear all,
 
  I'm trying to run Gromacs on a Fujitsu supercomputer but the software is
  crashing.
 
  I run grompp:
 
  grompp_mpi_d -f parameters.mdp -c system.pdb -p overthe.top
 
  and it produces the error:
 
  jwe1050i-w The hardware barrier couldn't be used and continues processing
  using the software barrier.
  taken to (standard) corrective action, execution continuing.
  error summary (Fortran)
  error number error level error count
  jwe1050i w 1
  total error count = 1
 
  but still outputs topol.tpr so I can continue.

 There's no value in compiling grompp with MPI or in double precision.

  I then run with
 
  export FLIB_FASTOMP=FALSE
  source /home/username/Gromacs463/bin/GMXRC.bash
  mpiexec mdrun_mpi_d -ntomp 16 -v
 
  but it crashes:
 
  starting mdrun 'testrun'
  5 steps, 100.0 ps.
  jwe0019i-u The program was terminated abnormally with signal number
 SIGSEGV.
  signal identifier = SEGV_MAPERR, address not mapped to object
  error occurs at calc_cell_indices._OMP_1 loc 00233474 offset
  03b4
  calc_cell_indices._OMP_1 at loc 002330c0 called from loc
  02088fa0 in start_thread
  start_thread at loc 02088e4c called from loc 029d19b4 in
  __thread_start
  __thread_start at loc 029d1988 called from o.s.
  error summary (Fortran)
  error number error level error count
  jwe0019i u 1
  jwe1050i w 1
  total error count = 2
  [ERR.] PLE 0014 plexec The process terminated
 

 abnormally.(rank=1)(nid=0x03060006)(exitstatus=240)(CODE=2002,1966080,61440)
  [ERR.] PLE The program that the user specified may be illegal or
  inaccessible on the node.(nid=0x03060006)
 
  Any ideas what could be wrong? It works on my local intel machine.

 Looks like it wasn't compiled correctly for the target machine. What was
 the cmake command, what does mdrun -version output? Also, if this is the K
 computer, probably we can't help, because the compiler docs are officially
 unavailable to us. National secret, and all ;-)

 Mark

 
  Thanks in advance,
 
  James
  --
  gmx-users mailing listgmx-users@gromacs.org
  http://lists.gromacs.org/mailman/listinfo/gmx-users
  * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
  * Please don't post (un)subscribe requests to the list. Use the
  www interface or send it to gmx-users-requ...@gromacs.org.
  * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 * Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 * Please search the archive at 
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 * Please don't post (un)subscribe requests to the list. Use the
 

Re: [gmx-users] jwe1050i + jwe0019i errors = SIGSEGV (Fujitsu)

2013-09-04 Thread Mark Abraham
On Sep 4, 2013 7:59 AM, James jamesresearch...@gmail.com wrote:

 Dear all,

 I'm trying to run Gromacs on a Fujitsu supercomputer but the software is
 crashing.

 I run grompp:

 grompp_mpi_d -f parameters.mdp -c system.pdb -p overthe.top

 and it produces the error:

 jwe1050i-w The hardware barrier couldn't be used and continues processing
 using the software barrier.
 taken to (standard) corrective action, execution continuing.
 error summary (Fortran)
 error number error level error count
 jwe1050i w 1
 total error count = 1

 but still outputs topol.tpr so I can continue.

There's no value in compiling grompp with MPI or in double precision.

 I then run with

 export FLIB_FASTOMP=FALSE
 source /home/username/Gromacs463/bin/GMXRC.bash
 mpiexec mdrun_mpi_d -ntomp 16 -v

 but it crashes:

 starting mdrun 'testrun'
 5 steps, 100.0 ps.
 jwe0019i-u The program was terminated abnormally with signal number
SIGSEGV.
 signal identifier = SEGV_MAPERR, address not mapped to object
 error occurs at calc_cell_indices._OMP_1 loc 00233474 offset
 03b4
 calc_cell_indices._OMP_1 at loc 002330c0 called from loc
 02088fa0 in start_thread
 start_thread at loc 02088e4c called from loc 029d19b4 in
 __thread_start
 __thread_start at loc 029d1988 called from o.s.
 error summary (Fortran)
 error number error level error count
 jwe0019i u 1
 jwe1050i w 1
 total error count = 2
 [ERR.] PLE 0014 plexec The process terminated

abnormally.(rank=1)(nid=0x03060006)(exitstatus=240)(CODE=2002,1966080,61440)
 [ERR.] PLE The program that the user specified may be illegal or
 inaccessible on the node.(nid=0x03060006)

 Any ideas what could be wrong? It works on my local intel machine.

Looks like it wasn't compiled correctly for the target machine. What was
the cmake command, what does mdrun -version output? Also, if this is the K
computer, probably we can't help, because the compiler docs are officially
unavailable to us. National secret, and all ;-)

Mark


 Thanks in advance,

 James
 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 * Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 * Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
-- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists