Hi,

With ddt, you can do offline debugging just to get where the program crashes
ddt -n 8 --offline a.out ...
You might also wanna try the reverse connect feature

Cheers,

Gilles

"Beheshti, Mohammadali" <mohammadali.behes...@uhn.ca> wrote:
>Hi Gus,
>
>Thank you very much for your prompt response. The myjob.sh script is as 
>follows:
>
>#!/bin/bash
>#PBS -N myjob
>#PBS -l nodes=1:ppn=8
>#PBS -l walltime=120:00:00
>#PBS -l pvmem=2000MB
>module load openmpi/2.0.0
>cd /cluster/home/t48263uhn/Carp/PlosOneData/
>mpirun -np 8 carp.debug.petsc.pt +F 
>/cluster/home/t48263uhn/Carp/PlosOneData/parameters_ECG_adjust.par 
>
>I am user of a cardiac modeling software named "CARP".  I tried to attach a 
>parallel debugger to my job as you suggested. First I tried TotalView by 
>adding -tv option to mpirun command:
>
>mpirun -tv -np 8 carp.debug.petsc.pt +F 
>/cluster/home/t48263uhn/Carp/PlosOneData/parameters_ECG_adjust.par 
>
>but in output file I get following error:
>
>"This version of Open MPI is known to have a problem using the "--debug"
>option to mpirun, and has therefore disabled it. This functionality will
>be restored in a future version of Open MPI.
>
>Please see https://github.com/open-mpi/ompi/issues/1225 for details."
>
>Then I tried DDT by using --debug option after mpirun which gives me a similar 
>error:
>"This version of Open MPI is known to have a problem using the "--debug"
>option to mpirun, and has therefore disabled it. This functionality will
>be restored in a future version of Open MPI.
>
>Please see https://github.com/open-mpi/ompi/issues/1225 for details."
>
>I believe there is an older version Open MPI on the system, but the system 
>admin asked me not to use it.
>
>I may try that and report the results. I have also attached the missing files 
>in gzip format.
>
>Thanks,
>
>
>Ali
>________________________________________
>From: users [users-boun...@lists.open-mpi.org] on behalf of Gus Correa 
>[g...@ldeo.columbia.edu]
>Sent: Tuesday, November 15, 2016 5:42 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] MPI_ABORT was invoked on rank 0 in communicator 
>compute with errorcode 59
>
>Hi Mohammadali
>
>"Signal number 11 SEGV", is the Unix/Linux signal for a memory
>violation (a.k.a. segmentation violation or segmentation fault).
>This normally happens when the program tries to read
>or write in a memory area that it did not allocate, already
>freed, or belongs to another process.
>That is most likely a programming error on the FEM code,
>probably not an MPI error, probably not a PETSC error either.
>
>The "errorcode 59" seems to be the PETSC error message
>issued when it receives a signal (in this case a
>segmentation fault signal, I guess) from the operational
>system (Linux, probably).
>Apparently it simply throws the error message and
>calls MPI_Abort, and the program stops.
>This is what petscerror.h include file has about error code 59:
>
>#define PETSC_ERR_SIG              59   /* signal received */
>
>**
>
>One suggestion is to compile the code with debugging flags (-g),
>and attach a debugger to it. Not an easy task if you have many
>processes/ranks in your program, if your debugger is the default
>Linux gdb, but it is not impossible to do either.
>Depending on the computer you have, you may have a parallel debugger,
>such as TotalView or DDT, which are more user friendly.
>
>You could also compile it with the flag -traceback
>(or -fbacktrace, the syntax depends on the compiler, check the compiler
>man page).
>This at least will tell you the location in the program where the
>segmentation fault happened (in the STDERR file of your job).
>
>I hope this helps.
>Gus Correa
>
>PS - The zip attachment with your "myjob.sh" script
>was removed from the email.
>Many email server programs remove zip for safety.
>Files with ".sh" suffix are also removed in general.
>You could compress it with gzip or bzip2 instead.
>
>On 11/15/2016 02:40 PM, Beheshti, Mohammadali wrote:
>> Hi,
>>
>>
>>
>> I am running simulations in a software which uses ompi to solve an FEM
>> problem.  From time to time I receive the error “
>>
>> MPI_ABORT was invoked on rank 0 in communicator compute with errorcode
>> 59” in the output file while for the larger simulations (with larger FEM
>> mesh) I almost always get this error. I don’t have any idea what is the
>> cause of this error. The error file contains a PETSC error: ”caught
>> signal number 11 SEGV”. I am running my jobs on a HPC system which has
>> Open MPI version 2.0.0.  I am also using a bash file (myjob.sh) which is
>> attached. The ompi_info - - all  command and ifconfig command outputs
>> are also attached. I appreciate any help in this regard.
>>
>>
>>
>> Thanks
>>
>>
>>
>> Ali
>>
>>
>>
>>
>>
>> **************************
>>
>> Mohammadali Beheshti
>>
>> Post-Doctoral Fellow
>>
>> Department of Medicine (Cardiology)
>>
>> Toronto General Research Institute
>>
>> University Health Network
>>
>> Tel: 416-340-4800 <tel:416-340-4800> ext. 6837
>>
>>
>>
>> **************************
>>
>>
>>
>>
>> This e-mail may contain confidential and/or privileged information for
>> the sole use of the intended recipient.
>> Any review or distribution by anyone other than the person for whom it
>> was originally intended is strictly prohibited.
>> If you have received this e-mail in error, please contact the sender and
>> delete all copies.
>> Opinions, conclusions or other information contained in this e-mail may
>> not be that of the organization.
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
>_______________________________________________
>users mailing list
>users@lists.open-mpi.org
>https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>This e-mail may contain confidential and/or privileged information for the 
>sole use of the intended recipient. 
>Any review or distribution by anyone other than the person for whom it was 
>originally intended is strictly prohibited. 
>If you have received this e-mail in error, please contact the sender and 
>delete all copies. 
>Opinions, conclusions or other information contained in this e-mail may not be 
>that of the organization.
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to