Hi, With ddt, you can do offline debugging just to get where the program crashes ddt -n 8 --offline a.out ... You might also wanna try the reverse connect feature
Cheers, Gilles "Beheshti, Mohammadali" <mohammadali.behes...@uhn.ca> wrote: >Hi Gus, > >Thank you very much for your prompt response. The myjob.sh script is as >follows: > >#!/bin/bash >#PBS -N myjob >#PBS -l nodes=1:ppn=8 >#PBS -l walltime=120:00:00 >#PBS -l pvmem=2000MB >module load openmpi/2.0.0 >cd /cluster/home/t48263uhn/Carp/PlosOneData/ >mpirun -np 8 carp.debug.petsc.pt +F >/cluster/home/t48263uhn/Carp/PlosOneData/parameters_ECG_adjust.par > >I am user of a cardiac modeling software named "CARP". I tried to attach a >parallel debugger to my job as you suggested. First I tried TotalView by >adding -tv option to mpirun command: > >mpirun -tv -np 8 carp.debug.petsc.pt +F >/cluster/home/t48263uhn/Carp/PlosOneData/parameters_ECG_adjust.par > >but in output file I get following error: > >"This version of Open MPI is known to have a problem using the "--debug" >option to mpirun, and has therefore disabled it. This functionality will >be restored in a future version of Open MPI. > >Please see https://github.com/open-mpi/ompi/issues/1225 for details." > >Then I tried DDT by using --debug option after mpirun which gives me a similar >error: >"This version of Open MPI is known to have a problem using the "--debug" >option to mpirun, and has therefore disabled it. This functionality will >be restored in a future version of Open MPI. > >Please see https://github.com/open-mpi/ompi/issues/1225 for details." > >I believe there is an older version Open MPI on the system, but the system >admin asked me not to use it. > >I may try that and report the results. I have also attached the missing files >in gzip format. > >Thanks, > > >Ali >________________________________________ >From: users [users-boun...@lists.open-mpi.org] on behalf of Gus Correa >[g...@ldeo.columbia.edu] >Sent: Tuesday, November 15, 2016 5:42 PM >To: Open MPI Users >Subject: Re: [OMPI users] MPI_ABORT was invoked on rank 0 in communicator >compute with errorcode 59 > >Hi Mohammadali > >"Signal number 11 SEGV", is the Unix/Linux signal for a memory >violation (a.k.a. segmentation violation or segmentation fault). >This normally happens when the program tries to read >or write in a memory area that it did not allocate, already >freed, or belongs to another process. >That is most likely a programming error on the FEM code, >probably not an MPI error, probably not a PETSC error either. > >The "errorcode 59" seems to be the PETSC error message >issued when it receives a signal (in this case a >segmentation fault signal, I guess) from the operational >system (Linux, probably). >Apparently it simply throws the error message and >calls MPI_Abort, and the program stops. >This is what petscerror.h include file has about error code 59: > >#define PETSC_ERR_SIG 59 /* signal received */ > >** > >One suggestion is to compile the code with debugging flags (-g), >and attach a debugger to it. Not an easy task if you have many >processes/ranks in your program, if your debugger is the default >Linux gdb, but it is not impossible to do either. >Depending on the computer you have, you may have a parallel debugger, >such as TotalView or DDT, which are more user friendly. > >You could also compile it with the flag -traceback >(or -fbacktrace, the syntax depends on the compiler, check the compiler >man page). >This at least will tell you the location in the program where the >segmentation fault happened (in the STDERR file of your job). > >I hope this helps. >Gus Correa > >PS - The zip attachment with your "myjob.sh" script >was removed from the email. >Many email server programs remove zip for safety. >Files with ".sh" suffix are also removed in general. >You could compress it with gzip or bzip2 instead. > >On 11/15/2016 02:40 PM, Beheshti, Mohammadali wrote: >> Hi, >> >> >> >> I am running simulations in a software which uses ompi to solve an FEM >> problem. From time to time I receive the error “ >> >> MPI_ABORT was invoked on rank 0 in communicator compute with errorcode >> 59” in the output file while for the larger simulations (with larger FEM >> mesh) I almost always get this error. I don’t have any idea what is the >> cause of this error. The error file contains a PETSC error: ”caught >> signal number 11 SEGV”. I am running my jobs on a HPC system which has >> Open MPI version 2.0.0. I am also using a bash file (myjob.sh) which is >> attached. The ompi_info - - all command and ifconfig command outputs >> are also attached. I appreciate any help in this regard. >> >> >> >> Thanks >> >> >> >> Ali >> >> >> >> >> >> ************************** >> >> Mohammadali Beheshti >> >> Post-Doctoral Fellow >> >> Department of Medicine (Cardiology) >> >> Toronto General Research Institute >> >> University Health Network >> >> Tel: 416-340-4800 <tel:416-340-4800> ext. 6837 >> >> >> >> ************************** >> >> >> >> >> This e-mail may contain confidential and/or privileged information for >> the sole use of the intended recipient. >> Any review or distribution by anyone other than the person for whom it >> was originally intended is strictly prohibited. >> If you have received this e-mail in error, please contact the sender and >> delete all copies. >> Opinions, conclusions or other information contained in this e-mail may >> not be that of the organization. >> >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> > >_______________________________________________ >users mailing list >users@lists.open-mpi.org >https://rfd.newmexicoconsortium.org/mailman/listinfo/users > >This e-mail may contain confidential and/or privileged information for the >sole use of the intended recipient. >Any review or distribution by anyone other than the person for whom it was >originally intended is strictly prohibited. >If you have received this e-mail in error, please contact the sender and >delete all copies. >Opinions, conclusions or other information contained in this e-mail may not be >that of the organization. _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users