Re: [gmx-users] jwe1050i + jwe0019i errors = SIGSEGV (Fujitsu)
On Thu, Oct 10, 2013 at 2:34 PM, James jamesresearch...@gmail.com wrote: Dear Mark, Thanks again for your response. Many of the regression tests seem to have passed: All 16 simple tests PASSED All 19 complex tests PASSED All 142 kernel tests PASSED All 9 freeenergy tests PASSED All 0 extra tests PASSED Error not all 42 pdb2gmx tests have been done successfully Only 0 energies in the log file pdb2gmx tests FAILED I'm not sure why pdb2gmx failed but I suppose it will not impact the crashing I'm experiencing. No, that's fine. Probably they don't have sufficiently explicit guards to stop people running the energy minimization with a more-than-useful number of OpenMP threads. Regarding the stack trace showing line numbers, what is the best way to go about this, in this context? I'm not really experienced in that aspect. That's a matter of compiling in debug mode (use cmake .. -DCMAKE_BUILD_TYPE=Debug), and hopefully observing the same crash with an error message that has more useful information. The debug mode annotates the executable so that a finger can be pointed at the code line that caused the segfault. Hopefully the compiler does this properly, but support for this in OpenMP is a corner compiler writers might cut ;-) Depending on the details, loading a core dump in a debugger can also be necessary, but your local sysadmins are the people to talk to there. Mark Thanks again for your help! Best regards, James On 21 September 2013 23:12, Mark Abraham mark.j.abra...@gmail.com wrote: On Sat, Sep 21, 2013 at 2:45 PM, James jamesresearch...@gmail.com wrote: Dear Mark and the rest of the Gromacs team, Thanks a lot for your response. I have been trying to isolate the problem and have also been in discussion with the support staff. They suggested it may be a bug in the gromacs code, and I have tried to isolate the problem more precisely. First, do the GROMACS regression tests for Verlet kernels pass? (Run them all, but those with nbnxn prefix are of interest here.) They likely won't scale to 16 OMP threads, but you can vary OMP_NUM_THREADS environment variable to see what you can see. Considering that the calculation is run under MPI with 16 OpenMP cores per MPI node, the error seems to occur under the following conditions: A few thousand atoms: 1 or 2 MPI nodes: OK Double the number of atoms (~15,000): 1 MPI node: OK, 2 MPI nodes: SIGSEGV error described below. So it seems that the error occurs for relatively large systems which use MPI. ~500 atoms per core (thread) is a system in the normal GROMACS scaling regime. 16 OMP threads is more than is useful on other HPC systems, but since we don't know what your hardware is, whether you are investigating something useful is your decision. The crash mentions the calc_cell_indices function (see below). Is this somehow a problem with memory not being sufficient at the MPI interface at this function? I'm not sure how to proceed further. Any help would be greatly appreciated. If there is a problem with GROMACS (which so far I doubt), we'd need a stack trace that shows a line number (rather than addresses) in order to start to locate it. Mark Gromacs version is 4.6.3. Thank you very much for your time. James On 4 September 2013 16:05, Mark Abraham mark.j.abra...@gmail.com wrote: On Sep 4, 2013 7:59 AM, James jamesresearch...@gmail.com wrote: Dear all, I'm trying to run Gromacs on a Fujitsu supercomputer but the software is crashing. I run grompp: grompp_mpi_d -f parameters.mdp -c system.pdb -p overthe.top and it produces the error: jwe1050i-w The hardware barrier couldn't be used and continues processing using the software barrier. taken to (standard) corrective action, execution continuing. error summary (Fortran) error number error level error count jwe1050i w 1 total error count = 1 but still outputs topol.tpr so I can continue. There's no value in compiling grompp with MPI or in double precision. I then run with export FLIB_FASTOMP=FALSE source /home/username/Gromacs463/bin/GMXRC.bash mpiexec mdrun_mpi_d -ntomp 16 -v but it crashes: starting mdrun 'testrun' 5 steps, 100.0 ps. jwe0019i-u The program was terminated abnormally with signal number SIGSEGV. signal identifier = SEGV_MAPERR, address not mapped to object error occurs at calc_cell_indices._OMP_1 loc 00233474 offset 03b4 calc_cell_indices._OMP_1 at loc 002330c0 called from loc 02088fa0 in start_thread start_thread at loc 02088e4c called from loc 029d19b4 in __thread_start __thread_start at loc 029d1988 called from o.s. error summary (Fortran) error number error level error count jwe0019i
Re: [gmx-users] jwe1050i + jwe0019i errors = SIGSEGV (Fujitsu)
Dear Mark, Thanks again for your response. Many of the regression tests seem to have passed: All 16 simple tests PASSED All 19 complex tests PASSED All 142 kernel tests PASSED All 9 freeenergy tests PASSED All 0 extra tests PASSED Error not all 42 pdb2gmx tests have been done successfully Only 0 energies in the log file pdb2gmx tests FAILED I'm not sure why pdb2gmx failed but I suppose it will not impact the crashing I'm experiencing. Regarding the stack trace showing line numbers, what is the best way to go about this, in this context? I'm not really experienced in that aspect. Thanks again for your help! Best regards, James On 21 September 2013 23:12, Mark Abraham mark.j.abra...@gmail.com wrote: On Sat, Sep 21, 2013 at 2:45 PM, James jamesresearch...@gmail.com wrote: Dear Mark and the rest of the Gromacs team, Thanks a lot for your response. I have been trying to isolate the problem and have also been in discussion with the support staff. They suggested it may be a bug in the gromacs code, and I have tried to isolate the problem more precisely. First, do the GROMACS regression tests for Verlet kernels pass? (Run them all, but those with nbnxn prefix are of interest here.) They likely won't scale to 16 OMP threads, but you can vary OMP_NUM_THREADS environment variable to see what you can see. Considering that the calculation is run under MPI with 16 OpenMP cores per MPI node, the error seems to occur under the following conditions: A few thousand atoms: 1 or 2 MPI nodes: OK Double the number of atoms (~15,000): 1 MPI node: OK, 2 MPI nodes: SIGSEGV error described below. So it seems that the error occurs for relatively large systems which use MPI. ~500 atoms per core (thread) is a system in the normal GROMACS scaling regime. 16 OMP threads is more than is useful on other HPC systems, but since we don't know what your hardware is, whether you are investigating something useful is your decision. The crash mentions the calc_cell_indices function (see below). Is this somehow a problem with memory not being sufficient at the MPI interface at this function? I'm not sure how to proceed further. Any help would be greatly appreciated. If there is a problem with GROMACS (which so far I doubt), we'd need a stack trace that shows a line number (rather than addresses) in order to start to locate it. Mark Gromacs version is 4.6.3. Thank you very much for your time. James On 4 September 2013 16:05, Mark Abraham mark.j.abra...@gmail.com wrote: On Sep 4, 2013 7:59 AM, James jamesresearch...@gmail.com wrote: Dear all, I'm trying to run Gromacs on a Fujitsu supercomputer but the software is crashing. I run grompp: grompp_mpi_d -f parameters.mdp -c system.pdb -p overthe.top and it produces the error: jwe1050i-w The hardware barrier couldn't be used and continues processing using the software barrier. taken to (standard) corrective action, execution continuing. error summary (Fortran) error number error level error count jwe1050i w 1 total error count = 1 but still outputs topol.tpr so I can continue. There's no value in compiling grompp with MPI or in double precision. I then run with export FLIB_FASTOMP=FALSE source /home/username/Gromacs463/bin/GMXRC.bash mpiexec mdrun_mpi_d -ntomp 16 -v but it crashes: starting mdrun 'testrun' 5 steps, 100.0 ps. jwe0019i-u The program was terminated abnormally with signal number SIGSEGV. signal identifier = SEGV_MAPERR, address not mapped to object error occurs at calc_cell_indices._OMP_1 loc 00233474 offset 03b4 calc_cell_indices._OMP_1 at loc 002330c0 called from loc 02088fa0 in start_thread start_thread at loc 02088e4c called from loc 029d19b4 in __thread_start __thread_start at loc 029d1988 called from o.s. error summary (Fortran) error number error level error count jwe0019i u 1 jwe1050i w 1 total error count = 2 [ERR.] PLE 0014 plexec The process terminated abnormally.(rank=1)(nid=0x03060006)(exitstatus=240)(CODE=2002,1966080,61440) [ERR.] PLE The program that the user specified may be illegal or inaccessible on the node.(nid=0x03060006) Any ideas what could be wrong? It works on my local intel machine. Looks like it wasn't compiled correctly for the target machine. What was the cmake command, what does mdrun -version output? Also, if this is the K computer, probably we can't help, because the compiler docs are officially unavailable to us. National secret, and all ;-) Mark Thanks in advance, James -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post
Re: [gmx-users] jwe1050i + jwe0019i errors = SIGSEGV (Fujitsu)
Dear Mark and the rest of the Gromacs team, Thanks a lot for your response. I have been trying to isolate the problem and have also been in discussion with the support staff. They suggested it may be a bug in the gromacs code, and I have tried to isolate the problem more precisely. Considering that the calculation is run under MPI with 16 OpenMP cores per MPI node, the error seems to occur under the following conditions: A few thousand atoms: 1 or 2 MPI nodes: OK Double the number of atoms (~15,000): 1 MPI node: OK, 2 MPI nodes: SIGSEGV error described below. So it seems that the error occurs for relatively large systems which use MPI. The crash mentions the calc_cell_indices function (see below). Is this somehow a problem with memory not being sufficient at the MPI interface at this function? I'm not sure how to proceed further. Any help would be greatly appreciated. Gromacs version is 4.6.3. Thank you very much for your time. James On 4 September 2013 16:05, Mark Abraham mark.j.abra...@gmail.com wrote: On Sep 4, 2013 7:59 AM, James jamesresearch...@gmail.com wrote: Dear all, I'm trying to run Gromacs on a Fujitsu supercomputer but the software is crashing. I run grompp: grompp_mpi_d -f parameters.mdp -c system.pdb -p overthe.top and it produces the error: jwe1050i-w The hardware barrier couldn't be used and continues processing using the software barrier. taken to (standard) corrective action, execution continuing. error summary (Fortran) error number error level error count jwe1050i w 1 total error count = 1 but still outputs topol.tpr so I can continue. There's no value in compiling grompp with MPI or in double precision. I then run with export FLIB_FASTOMP=FALSE source /home/username/Gromacs463/bin/GMXRC.bash mpiexec mdrun_mpi_d -ntomp 16 -v but it crashes: starting mdrun 'testrun' 5 steps, 100.0 ps. jwe0019i-u The program was terminated abnormally with signal number SIGSEGV. signal identifier = SEGV_MAPERR, address not mapped to object error occurs at calc_cell_indices._OMP_1 loc 00233474 offset 03b4 calc_cell_indices._OMP_1 at loc 002330c0 called from loc 02088fa0 in start_thread start_thread at loc 02088e4c called from loc 029d19b4 in __thread_start __thread_start at loc 029d1988 called from o.s. error summary (Fortran) error number error level error count jwe0019i u 1 jwe1050i w 1 total error count = 2 [ERR.] PLE 0014 plexec The process terminated abnormally.(rank=1)(nid=0x03060006)(exitstatus=240)(CODE=2002,1966080,61440) [ERR.] PLE The program that the user specified may be illegal or inaccessible on the node.(nid=0x03060006) Any ideas what could be wrong? It works on my local intel machine. Looks like it wasn't compiled correctly for the target machine. What was the cmake command, what does mdrun -version output? Also, if this is the K computer, probably we can't help, because the compiler docs are officially unavailable to us. National secret, and all ;-) Mark Thanks in advance, James -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] jwe1050i + jwe0019i errors = SIGSEGV (Fujitsu)
On Sat, Sep 21, 2013 at 2:45 PM, James jamesresearch...@gmail.com wrote: Dear Mark and the rest of the Gromacs team, Thanks a lot for your response. I have been trying to isolate the problem and have also been in discussion with the support staff. They suggested it may be a bug in the gromacs code, and I have tried to isolate the problem more precisely. First, do the GROMACS regression tests for Verlet kernels pass? (Run them all, but those with nbnxn prefix are of interest here.) They likely won't scale to 16 OMP threads, but you can vary OMP_NUM_THREADS environment variable to see what you can see. Considering that the calculation is run under MPI with 16 OpenMP cores per MPI node, the error seems to occur under the following conditions: A few thousand atoms: 1 or 2 MPI nodes: OK Double the number of atoms (~15,000): 1 MPI node: OK, 2 MPI nodes: SIGSEGV error described below. So it seems that the error occurs for relatively large systems which use MPI. ~500 atoms per core (thread) is a system in the normal GROMACS scaling regime. 16 OMP threads is more than is useful on other HPC systems, but since we don't know what your hardware is, whether you are investigating something useful is your decision. The crash mentions the calc_cell_indices function (see below). Is this somehow a problem with memory not being sufficient at the MPI interface at this function? I'm not sure how to proceed further. Any help would be greatly appreciated. If there is a problem with GROMACS (which so far I doubt), we'd need a stack trace that shows a line number (rather than addresses) in order to start to locate it. Mark Gromacs version is 4.6.3. Thank you very much for your time. James On 4 September 2013 16:05, Mark Abraham mark.j.abra...@gmail.com wrote: On Sep 4, 2013 7:59 AM, James jamesresearch...@gmail.com wrote: Dear all, I'm trying to run Gromacs on a Fujitsu supercomputer but the software is crashing. I run grompp: grompp_mpi_d -f parameters.mdp -c system.pdb -p overthe.top and it produces the error: jwe1050i-w The hardware barrier couldn't be used and continues processing using the software barrier. taken to (standard) corrective action, execution continuing. error summary (Fortran) error number error level error count jwe1050i w 1 total error count = 1 but still outputs topol.tpr so I can continue. There's no value in compiling grompp with MPI or in double precision. I then run with export FLIB_FASTOMP=FALSE source /home/username/Gromacs463/bin/GMXRC.bash mpiexec mdrun_mpi_d -ntomp 16 -v but it crashes: starting mdrun 'testrun' 5 steps, 100.0 ps. jwe0019i-u The program was terminated abnormally with signal number SIGSEGV. signal identifier = SEGV_MAPERR, address not mapped to object error occurs at calc_cell_indices._OMP_1 loc 00233474 offset 03b4 calc_cell_indices._OMP_1 at loc 002330c0 called from loc 02088fa0 in start_thread start_thread at loc 02088e4c called from loc 029d19b4 in __thread_start __thread_start at loc 029d1988 called from o.s. error summary (Fortran) error number error level error count jwe0019i u 1 jwe1050i w 1 total error count = 2 [ERR.] PLE 0014 plexec The process terminated abnormally.(rank=1)(nid=0x03060006)(exitstatus=240)(CODE=2002,1966080,61440) [ERR.] PLE The program that the user specified may be illegal or inaccessible on the node.(nid=0x03060006) Any ideas what could be wrong? It works on my local intel machine. Looks like it wasn't compiled correctly for the target machine. What was the cmake command, what does mdrun -version output? Also, if this is the K computer, probably we can't help, because the compiler docs are officially unavailable to us. National secret, and all ;-) Mark Thanks in advance, James -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the
Re: [gmx-users] jwe1050i + jwe0019i errors = SIGSEGV (Fujitsu)
On Sep 4, 2013 7:59 AM, James jamesresearch...@gmail.com wrote: Dear all, I'm trying to run Gromacs on a Fujitsu supercomputer but the software is crashing. I run grompp: grompp_mpi_d -f parameters.mdp -c system.pdb -p overthe.top and it produces the error: jwe1050i-w The hardware barrier couldn't be used and continues processing using the software barrier. taken to (standard) corrective action, execution continuing. error summary (Fortran) error number error level error count jwe1050i w 1 total error count = 1 but still outputs topol.tpr so I can continue. There's no value in compiling grompp with MPI or in double precision. I then run with export FLIB_FASTOMP=FALSE source /home/username/Gromacs463/bin/GMXRC.bash mpiexec mdrun_mpi_d -ntomp 16 -v but it crashes: starting mdrun 'testrun' 5 steps, 100.0 ps. jwe0019i-u The program was terminated abnormally with signal number SIGSEGV. signal identifier = SEGV_MAPERR, address not mapped to object error occurs at calc_cell_indices._OMP_1 loc 00233474 offset 03b4 calc_cell_indices._OMP_1 at loc 002330c0 called from loc 02088fa0 in start_thread start_thread at loc 02088e4c called from loc 029d19b4 in __thread_start __thread_start at loc 029d1988 called from o.s. error summary (Fortran) error number error level error count jwe0019i u 1 jwe1050i w 1 total error count = 2 [ERR.] PLE 0014 plexec The process terminated abnormally.(rank=1)(nid=0x03060006)(exitstatus=240)(CODE=2002,1966080,61440) [ERR.] PLE The program that the user specified may be illegal or inaccessible on the node.(nid=0x03060006) Any ideas what could be wrong? It works on my local intel machine. Looks like it wasn't compiled correctly for the target machine. What was the cmake command, what does mdrun -version output? Also, if this is the K computer, probably we can't help, because the compiler docs are officially unavailable to us. National secret, and all ;-) Mark Thanks in advance, James -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists