Re: [OMPI users] OMPI users] MPI_ABORT was invoked on rank 0 in communicator compute with errorcode 59

2016-11-16 Thread Gilles Gouaillardet
Hi,

With ddt, you can do offline debugging just to get where the program crashes
ddt -n 8 --offline a.out ...
You might also wanna try the reverse connect feature

Cheers,

Gilles

"Beheshti, Mohammadali"  wrote:
>Hi Gus,
>
>Thank you very much for your prompt response. The myjob.sh script is as 
>follows:
>
>#!/bin/bash
>#PBS -N myjob
>#PBS -l nodes=1:ppn=8
>#PBS -l walltime=120:00:00
>#PBS -l pvmem=2000MB
>module load openmpi/2.0.0
>cd /cluster/home/t48263uhn/Carp/PlosOneData/
>mpirun -np 8 carp.debug.petsc.pt +F 
>/cluster/home/t48263uhn/Carp/PlosOneData/parameters_ECG_adjust.par 
>
>I am user of a cardiac modeling software named "CARP".  I tried to attach a 
>parallel debugger to my job as you suggested. First I tried TotalView by 
>adding -tv option to mpirun command:
>
>mpirun -tv -np 8 carp.debug.petsc.pt +F 
>/cluster/home/t48263uhn/Carp/PlosOneData/parameters_ECG_adjust.par 
>
>but in output file I get following error:
>
>"This version of Open MPI is known to have a problem using the "--debug"
>option to mpirun, and has therefore disabled it. This functionality will
>be restored in a future version of Open MPI.
>
>Please see https://github.com/open-mpi/ompi/issues/1225 for details."
>
>Then I tried DDT by using --debug option after mpirun which gives me a similar 
>error:
>"This version of Open MPI is known to have a problem using the "--debug"
>option to mpirun, and has therefore disabled it. This functionality will
>be restored in a future version of Open MPI.
>
>Please see https://github.com/open-mpi/ompi/issues/1225 for details."
>
>I believe there is an older version Open MPI on the system, but the system 
>admin asked me not to use it.
>
>I may try that and report the results. I have also attached the missing files 
>in gzip format.
>
>Thanks,
>
>
>Ali
>
>From: users [users-boun...@lists.open-mpi.org] on behalf of Gus Correa 
>[g...@ldeo.columbia.edu]
>Sent: Tuesday, November 15, 2016 5:42 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] MPI_ABORT was invoked on rank 0 in communicator 
>compute with errorcode 59
>
>Hi Mohammadali
>
>"Signal number 11 SEGV", is the Unix/Linux signal for a memory
>violation (a.k.a. segmentation violation or segmentation fault).
>This normally happens when the program tries to read
>or write in a memory area that it did not allocate, already
>freed, or belongs to another process.
>That is most likely a programming error on the FEM code,
>probably not an MPI error, probably not a PETSC error either.
>
>The "errorcode 59" seems to be the PETSC error message
>issued when it receives a signal (in this case a
>segmentation fault signal, I guess) from the operational
>system (Linux, probably).
>Apparently it simply throws the error message and
>calls MPI_Abort, and the program stops.
>This is what petscerror.h include file has about error code 59:
>
>#define PETSC_ERR_SIG  59   /* signal received */
>
>**
>
>One suggestion is to compile the code with debugging flags (-g),
>and attach a debugger to it. Not an easy task if you have many
>processes/ranks in your program, if your debugger is the default
>Linux gdb, but it is not impossible to do either.
>Depending on the computer you have, you may have a parallel debugger,
>such as TotalView or DDT, which are more user friendly.
>
>You could also compile it with the flag -traceback
>(or -fbacktrace, the syntax depends on the compiler, check the compiler
>man page).
>This at least will tell you the location in the program where the
>segmentation fault happened (in the STDERR file of your job).
>
>I hope this helps.
>Gus Correa
>
>PS - The zip attachment with your "myjob.sh" script
>was removed from the email.
>Many email server programs remove zip for safety.
>Files with ".sh" suffix are also removed in general.
>You could compress it with gzip or bzip2 instead.
>
>On 11/15/2016 02:40 PM, Beheshti, Mohammadali wrote:
>> Hi,
>>
>>
>>
>> I am running simulations in a software which uses ompi to solve an FEM
>> problem.  From time to time I receive the error “
>>
>> MPI_ABORT was invoked on rank 0 in communicator compute with errorcode
>> 59” in the output file while for the larger simulations (with larger FEM
>> mesh) I almost always get this error. I don’t have any idea what is the
>> cause of this error. The error file contains a PETSC error: ”caught
>> signal number 11 SEGV”. I am running my jobs on a HPC system which has
>> Open MPI version 2.0.0.  I am also using a bash file (myjob.sh) which is
>> attached. The ompi_info - - all  command and ifconfig command outputs
>> are also attached. I appreciate any help in this regard.
>>
>>
>>
>> Thanks
>>
>>
>>
>> Ali
>>
>>
>>
>>
>>
>> **
>>
>> Mohammadali Beheshti
>>
>> Post-Doctoral Fellow
>>
>> Department of Medicine (Cardiology)
>>
>> Toronto General Research Institute
>>
>> University Health Network
>>
>> Tel: 416-340-4800  ext. 6837
>>
>>
>>
>> 

Re: [OMPI users] MPI_ABORT was invoked on rank 0 in communicator compute with errorcode 59

2016-11-16 Thread Beheshti, Mohammadali
Hi Gus,

Thank you very much for your prompt response. The myjob.sh script is as follows:

#!/bin/bash
#PBS -N myjob
#PBS -l nodes=1:ppn=8
#PBS -l walltime=120:00:00
#PBS -l pvmem=2000MB
module load openmpi/2.0.0
cd /cluster/home/t48263uhn/Carp/PlosOneData/
mpirun -np 8 carp.debug.petsc.pt +F 
/cluster/home/t48263uhn/Carp/PlosOneData/parameters_ECG_adjust.par 

I am user of a cardiac modeling software named "CARP".  I tried to attach a 
parallel debugger to my job as you suggested. First I tried TotalView by adding 
-tv option to mpirun command:

mpirun -tv -np 8 carp.debug.petsc.pt +F 
/cluster/home/t48263uhn/Carp/PlosOneData/parameters_ECG_adjust.par 

but in output file I get following error:

"This version of Open MPI is known to have a problem using the "--debug"
option to mpirun, and has therefore disabled it. This functionality will
be restored in a future version of Open MPI.

Please see https://github.com/open-mpi/ompi/issues/1225 for details."

Then I tried DDT by using --debug option after mpirun which gives me a similar 
error:
"This version of Open MPI is known to have a problem using the "--debug"
option to mpirun, and has therefore disabled it. This functionality will
be restored in a future version of Open MPI.

Please see https://github.com/open-mpi/ompi/issues/1225 for details."

I believe there is an older version Open MPI on the system, but the system 
admin asked me not to use it.

I may try that and report the results. I have also attached the missing files 
in gzip format.

Thanks,


Ali

From: users [users-boun...@lists.open-mpi.org] on behalf of Gus Correa 
[g...@ldeo.columbia.edu]
Sent: Tuesday, November 15, 2016 5:42 PM
To: Open MPI Users
Subject: Re: [OMPI users] MPI_ABORT was invoked on rank 0 in communicator 
compute with errorcode 59

Hi Mohammadali

"Signal number 11 SEGV", is the Unix/Linux signal for a memory
violation (a.k.a. segmentation violation or segmentation fault).
This normally happens when the program tries to read
or write in a memory area that it did not allocate, already
freed, or belongs to another process.
That is most likely a programming error on the FEM code,
probably not an MPI error, probably not a PETSC error either.

The "errorcode 59" seems to be the PETSC error message
issued when it receives a signal (in this case a
segmentation fault signal, I guess) from the operational
system (Linux, probably).
Apparently it simply throws the error message and
calls MPI_Abort, and the program stops.
This is what petscerror.h include file has about error code 59:

#define PETSC_ERR_SIG  59   /* signal received */

**

One suggestion is to compile the code with debugging flags (-g),
and attach a debugger to it. Not an easy task if you have many
processes/ranks in your program, if your debugger is the default
Linux gdb, but it is not impossible to do either.
Depending on the computer you have, you may have a parallel debugger,
such as TotalView or DDT, which are more user friendly.

You could also compile it with the flag -traceback
(or -fbacktrace, the syntax depends on the compiler, check the compiler
man page).
This at least will tell you the location in the program where the
segmentation fault happened (in the STDERR file of your job).

I hope this helps.
Gus Correa

PS - The zip attachment with your "myjob.sh" script
was removed from the email.
Many email server programs remove zip for safety.
Files with ".sh" suffix are also removed in general.
You could compress it with gzip or bzip2 instead.

On 11/15/2016 02:40 PM, Beheshti, Mohammadali wrote:
> Hi,
>
>
>
> I am running simulations in a software which uses ompi to solve an FEM
> problem.  From time to time I receive the error “
>
> MPI_ABORT was invoked on rank 0 in communicator compute with errorcode
> 59” in the output file while for the larger simulations (with larger FEM
> mesh) I almost always get this error. I don’t have any idea what is the
> cause of this error. The error file contains a PETSC error: ”caught
> signal number 11 SEGV”. I am running my jobs on a HPC system which has
> Open MPI version 2.0.0.  I am also using a bash file (myjob.sh) which is
> attached. The ompi_info - - all  command and ifconfig command outputs
> are also attached. I appreciate any help in this regard.
>
>
>
> Thanks
>
>
>
> Ali
>
>
>
>
>
> **
>
> Mohammadali Beheshti
>
> Post-Doctoral Fellow
>
> Department of Medicine (Cardiology)
>
> Toronto General Research Institute
>
> University Health Network
>
> Tel: 416-340-4800  ext. 6837
>
>
>
> **
>
>
>
>
> This e-mail may contain confidential and/or privileged information for
> the sole use of the intended recipient.
> Any review or distribution by anyone other than the person for whom it
> was originally intended is strictly prohibited.
> If you have received this e-mail in error, please contact the sender and
> delete all copies.
> Opinions, conclusions or other 

Re: [OMPI users] symbol lookup error for a "hello world" fortran script

2016-11-16 Thread Julien de Troullioud de Lanversin
Gilles,


I went for the radical solution and completely removed intel prallel studio
from my system (I used it for ifort but I prefer to use gfortran now).

The script works now.

Thanks a lot for your help Gilles.


Julien

2016-11-16 1:24 GMT-05:00 Gilles Gouaillardet :

> Julien,
>
> the fortran lib is in /usr/lib/libmpi_mpifh.so.12
> the C lib is the one from Intel MPI
>
> i guess the C lib is in /usr/lib, and not /usr/lib/openmpi/lib
> prepending /usr/lib is never recommended, so i suggest you simply remove
> /opt/intel/compilers_and_libraries_2016.1.150/linux/mpi/intel64/lib
> from your LD_LIBRARY_PATH
>
> Cheers,
>
> Gilles
>
> On Tue, Nov 15, 2016 at 10:33 PM, Julien de Troullioud de Lanversin
>  wrote:
> > Gilles,
> >
> >
> > Thank you for your fast reply.
> >
> > When I type whereis mpifort I have the following: mpifort:
> > /usr/bin/mpifort.openmpi /usr/bin/mpifort /usr/share/man/man1/mpifort.1.
> gz
> >
> > I made sure I exported the LD_LIBRARY_PATH after I prepended
> > /usr/lib/openmpi/lib. The same error is produced.
> >
> > If I type ldd ./test I get the following:
> >
> > linux-vdso.so.1 =>  (0x7ffde08ef000)
> > libmpi_mpifh.so.12 => /usr/lib/libmpi_mpifh.so.12
> (0x2b5a2132c000)
> > libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3
> > (0x2b5a21585000)
> > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x2b5a218b)
> > libmpi.so.12 =>
> > /opt/intel/compilers_and_libraries_2016.1.150/linux/
> mpi/intel64/lib/libmpi.so.12
> > (0x2b5a21c7a000)
> > libopen-pal.so.13 => /usr/lib/libopen-pal.so.13 (0x2b5a2243c000)
> > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
> > (0x2b5a226d9000)
> > libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0
> > (0x2b5a228f7000)
> > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x2b5a22b36000)
> > libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1
> > (0x2b5a22e3f000)
> > /lib64/ld-linux-x86-64.so.2 (0x55689175a000)
> > librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x2b5a23056000)
> > libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x2b5a2325e000)
> > libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1
> (0x2b5a23462000)
> > libhwloc.so.5 => /usr/lib/x86_64-linux-gnu/libhwloc.so.5
> > (0x2b5a23666000)
> > libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1
> > (0x2b5a238a)
> > libltdl.so.7 => /usr/lib/x86_64-linux-gnu/libltdl.so.7
> > (0x2b5a23aac000)
> >
> >
> > Thanks.
> >
> >
> >
> > Julien
> >
> >
> >
> > 2016-11-15 23:52 GMT-05:00 Gilles Gouaillardet
> > :
> >>
> >> Julien,
> >>
> >> first, make sure you are using the Open MPI wrapper
> >> which mpifort
> >> should be /usr/lib/openmpi/bin if i understand correctly
> >> then make sure you exported your LD_LIBRARY_PATH *after* you prepended
> >> the path to Open MPI lib
> >> in your .bashrc you can either
> >> LD_LIBRARY_PATH=/usr/lib/openmpi/lib:$LD_LIBRARY_PATH
> >> export LD_LIBRARY_PATH
> >> or directly
> >> export LD_LIBRARY_PATH=/usr/lib/openmpi/lib:$LD_LIBRARY_PATH
> >>
> >> then you can
> >> ldd ./test
> >> and comfirm all MPI libs (both C and Fortran) are pointing to Open MPI
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> On Tue, Nov 15, 2016 at 9:41 PM, Julien de Troullioud de Lanversin
> >>  wrote:
> >> > Hi all,
> >> >
> >> >
> >> > I am completely new to MPI (and relatively new to linux). I am sorry
> if
> >> > the
> >> > problem I encountered is obvious to solve.
> >> >
> >> > When I run the following simple test with mpirun:
> >> >
> >> > program hello_world
> >> >
> >> >   use mpi
> >> >   integer ierr
> >> >
> >> >   call MPI_INIT ( ierr )
> >> >   print *, "Hello world"
> >> >   call MPI_FINALIZE ( ierr )
> >> >
> >> > end program hello_world
> >> >
> >> >
> >> > I get the following error :
> >> > ./test: symbol lookup error: /usr/lib/libmpi_mpifh.so.12: undefined
> >> > symbol:
> >> > mpi_fortran_weights_empty
> >> >
> >> > I compiled the source code like this: mpifort -o test test.f90
> >> >
> >> > I look up on the internet and I understand that it is a problem with
> the
> >> > shared library of open mpi. But I think I correctly added the open mpi
> >> > lib
> >> > to ld_library_path (I added the first directory --
> /usr/lib/openmpi/lib
> >> > --
> >> > via .bashrc). Here is an echo $LD_LIBRARY_PATH:
> >> >
> >> >
> >> > $/usr/lib/openmpi/lib:/opt/intel/compilers_and_libraries_
> 2016.1.150/linux/compiler/lib/intel64:/opt/intel/compilers_
> and_libraries_2016.1.150/linux/mpi/intel64/lib:/opt/
> intel/compilers_and_libraries_2016.1.150/linux/mpi/mic/lib:/
> opt/intel/compilers_and_libraries_2016.1.150/linux/
> ipp/lib/intel64:/opt/intel/compilers_and_libraries_2016.
> 1.150/linux/compiler/lib/intel64:/opt/intel/compilers_
> and_libraries_2016.1.150/linux/mkl/lib/intel64:/opt/
> 

Re: [OMPI users] Open MPI State of the Union BOF at SC'16 next week

2016-11-16 Thread Jeff Squyres (jsquyres)
Will do!  It will be tonight at the earliest (the BOF is tonight at 5:15pm US 
Mountain time), but tomorrow morning is more likely.



> On Nov 15, 2016, at 12:48 PM, Sean Ahern  wrote:
> 
> Make sense. Thanks for making the slides available. Would you mind posting to 
> the list when the rest of us can get them?
> 
> -Sean
> 
> --
> Sean Ahern
> Computational Engineering International
> 919-363-0883
> 
> On Tue, Nov 15, 2016 at 10:53 AM, Jeff Squyres (jsquyres) 
>  wrote:
> On Nov 10, 2016, at 9:31 AM, Jeff Squyres (jsquyres)  
> wrote:
> >
> > The slides will definitely be available afterwards.  We'll see if we can 
> > make some flavor of recording available as well.
> 
> After poking around a bit, it looks like the SC rules prohibit us from 
> recording the BOF (which is not unreasonable, actually).
> 
> The slides will definitely be available on the Open MPI web site.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users