Hi Gus,

Thank you very much for your prompt response. The myjob.sh script is as follows:

#!/bin/bash
#PBS -N myjob
#PBS -l nodes=1:ppn=8
#PBS -l walltime=120:00:00
#PBS -l pvmem=2000MB
module load openmpi/2.0.0
cd /cluster/home/t48263uhn/Carp/PlosOneData/
mpirun -np 8 carp.debug.petsc.pt +F 
/cluster/home/t48263uhn/Carp/PlosOneData/parameters_ECG_adjust.par 

I am user of a cardiac modeling software named "CARP".  I tried to attach a 
parallel debugger to my job as you suggested. First I tried TotalView by adding 
-tv option to mpirun command:

mpirun -tv -np 8 carp.debug.petsc.pt +F 
/cluster/home/t48263uhn/Carp/PlosOneData/parameters_ECG_adjust.par 

but in output file I get following error:

"This version of Open MPI is known to have a problem using the "--debug"
option to mpirun, and has therefore disabled it. This functionality will
be restored in a future version of Open MPI.

Please see https://github.com/open-mpi/ompi/issues/1225 for details."

Then I tried DDT by using --debug option after mpirun which gives me a similar 
error:
"This version of Open MPI is known to have a problem using the "--debug"
option to mpirun, and has therefore disabled it. This functionality will
be restored in a future version of Open MPI.

Please see https://github.com/open-mpi/ompi/issues/1225 for details."

I believe there is an older version Open MPI on the system, but the system 
admin asked me not to use it.

I may try that and report the results. I have also attached the missing files 
in gzip format.

Thanks,


Ali
________________________________________
From: users [users-boun...@lists.open-mpi.org] on behalf of Gus Correa 
[g...@ldeo.columbia.edu]
Sent: Tuesday, November 15, 2016 5:42 PM
To: Open MPI Users
Subject: Re: [OMPI users] MPI_ABORT was invoked on rank 0 in communicator 
compute with errorcode 59

Hi Mohammadali

"Signal number 11 SEGV", is the Unix/Linux signal for a memory
violation (a.k.a. segmentation violation or segmentation fault).
This normally happens when the program tries to read
or write in a memory area that it did not allocate, already
freed, or belongs to another process.
That is most likely a programming error on the FEM code,
probably not an MPI error, probably not a PETSC error either.

The "errorcode 59" seems to be the PETSC error message
issued when it receives a signal (in this case a
segmentation fault signal, I guess) from the operational
system (Linux, probably).
Apparently it simply throws the error message and
calls MPI_Abort, and the program stops.
This is what petscerror.h include file has about error code 59:

#define PETSC_ERR_SIG              59   /* signal received */

**

One suggestion is to compile the code with debugging flags (-g),
and attach a debugger to it. Not an easy task if you have many
processes/ranks in your program, if your debugger is the default
Linux gdb, but it is not impossible to do either.
Depending on the computer you have, you may have a parallel debugger,
such as TotalView or DDT, which are more user friendly.

You could also compile it with the flag -traceback
(or -fbacktrace, the syntax depends on the compiler, check the compiler
man page).
This at least will tell you the location in the program where the
segmentation fault happened (in the STDERR file of your job).

I hope this helps.
Gus Correa

PS - The zip attachment with your "myjob.sh" script
was removed from the email.
Many email server programs remove zip for safety.
Files with ".sh" suffix are also removed in general.
You could compress it with gzip or bzip2 instead.

On 11/15/2016 02:40 PM, Beheshti, Mohammadali wrote:
> Hi,
>
>
>
> I am running simulations in a software which uses ompi to solve an FEM
> problem.  From time to time I receive the error “
>
> MPI_ABORT was invoked on rank 0 in communicator compute with errorcode
> 59” in the output file while for the larger simulations (with larger FEM
> mesh) I almost always get this error. I don’t have any idea what is the
> cause of this error. The error file contains a PETSC error: ”caught
> signal number 11 SEGV”. I am running my jobs on a HPC system which has
> Open MPI version 2.0.0.  I am also using a bash file (myjob.sh) which is
> attached. The ompi_info - - all  command and ifconfig command outputs
> are also attached. I appreciate any help in this regard.
>
>
>
> Thanks
>
>
>
> Ali
>
>
>
>
>
> **************************
>
> Mohammadali Beheshti
>
> Post-Doctoral Fellow
>
> Department of Medicine (Cardiology)
>
> Toronto General Research Institute
>
> University Health Network
>
> Tel: 416-340-4800 <tel:416-340-4800> ext. 6837
>
>
>
> **************************
>
>
>
>
> This e-mail may contain confidential and/or privileged information for
> the sole use of the intended recipient.
> Any review or distribution by anyone other than the person for whom it
> was originally intended is strictly prohibited.
> If you have received this e-mail in error, please contact the sender and
> delete all copies.
> Opinions, conclusions or other information contained in this e-mail may
> not be that of the organization.
>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

This e-mail may contain confidential and/or privileged information for the sole 
use of the intended recipient. 
Any review or distribution by anyone other than the person for whom it was 
originally intended is strictly prohibited. 
If you have received this e-mail in error, please contact the sender and delete 
all copies. 
Opinions, conclusions or other information contained in this e-mail may not be 
that of the organization.

Attachment: openmpi.tar.gz
Description: openmpi.tar.gz

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to