Re: [OMPI users] rdmacm and udcm failure in 2.0.1 on RoCE

2016-12-14 Thread Nathan Hjelm
Can you configure with —enable-debug and run with —mca btl_base_verbose 100 and 
provide the output? It may indicate why neither udcm nor rdmacm are available.

-Nathan


> On Dec 14, 2016, at 2:47 PM, Dave Turner  wrote:
> 
> --
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>   Local host:   elf22
>   Local device: mlx4_2
>   Local port:   1
>   CPCs attempted:   rdmacm, udcm
> --
> 
> We have had no problems using 1.10.4 on RoCE but 2.0.1 fails to
> find either connection manager.  I've read that rdmacm may have
> issues under 2.0.1 so udcm may be the only one working.  Are there
> any known issues with that on RoCE?  Or does this just mean we
> don't have RoCE configured correctly?
> 
>   Dave Turner
> 
> -- 
> Work: davetur...@ksu.edu (785) 532-7791
>  2219 Engineering Hall, Manhattan KS  66506
> Home:drdavetur...@gmail.com
>   cell: (785) 770-5929
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Segmentation Fault (Core Dumped) on mpif90 -v

2016-12-14 Thread Paul Kapinos

Hello all,
we seem to run into the same issue: 'mpif90' sigsegvs immediately for Open MPI 
1.10.4 compiled using Intel compilers 16.0.4.258 and 16.0.3.210, while it works 
fine when compiled with 16.0.2.181.


It seems to be a compiler issue (more exactly: library issue on libs delivered 
with 16.0.4.258 and 16.0.3.210 versions). Changing the version of compiler 
loaded back to 16.0.2.181 (=> change of dynamically loaded libs) let the 
prevously-failing binary (compiled with newer compilers) to work propperly.


Compiling with -O0 does not help. As the issue is likely in the Intel libs (as 
said changing out these solves/raises the issue) we will do a failback to 
16.0.2.181 compiler version. We will try to open a case by Intel - let's see...


Have a nice day,

Paul Kapinos



On 05/06/16 14:10, Jeff Squyres (jsquyres) wrote:

Ok, good.

I asked that question because typically when we see errors like this, it is 
usually either a busted compiler installation or inadvertently mixing the 
run-times of multiple different compilers in some kind of incompatible way.  
Specifically, the mpifort (aka mpif90) application is a fairly simple program 
-- there's no reason it should segv, especially with a stack trace that you 
sent that implies that it's dying early in startup, potentially even before it 
has hit any Open MPI code (i.e., it could even be pre-main).

BTW, you might be able to get a more complete stack trace from the debugger 
that comes with the Intel compiler (idb?  I don't remember offhand).

Since you are able to run simple programs compiled by this compiler, it sounds 
like the compiler is working fine.  Good!

The next thing to check is to see if somehow the compiler and/or run-time 
environments are getting mixed up.  E.g., the apps were compiled for one 
compiler/run-time but are being used with another.  Also ensure that any 
compiler/linker flags that you are passing to Open MPI's configure script are 
native and correct for the platform for which you're compiling (e.g., don't 
pass in flags that optimize for a different platform; that may result in 
generating machine code instructions that are invalid for your platform).

Try recompiling/re-installing Open MPI from scratch, and if it still doesn't 
work, then send all the information listed here:

https://www.open-mpi.org/community/help/



On May 6, 2016, at 3:45 AM, Giacomo Rossi  wrote:

Yes, I've tried three simple "Hello world" programs in fortan, C and C++ and 
the compile and run with intel 16.0.3. The problem is with the openmpi compiled from 
source.

Giacomo Rossi Ph.D., Space Engineer

Research Fellow at Dept. of Mechanical and Aerospace Engineering, "Sapienza" 
University of Rome
p: (+39) 0692927207 | m: (+39) 3408816643 | e: giacom...@gmail.com

Member of Fortran-FOSS-programmers


2016-05-05 11:15 GMT+02:00 Giacomo Rossi :
 gdb /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90
GNU gdb (GDB) 7.11
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90...(no 
debugging symbols found)...done.
(gdb) r -v
Starting program: /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90 -v

Program received signal SIGSEGV, Segmentation fault.
0x76858f38 in ?? ()
(gdb) bt
#0  0x76858f38 in ?? ()
#1  0x77de5828 in _dl_relocate_object () from 
/lib64/ld-linux-x86-64.so.2
#2  0x77ddcfa3 in dl_main () from /lib64/ld-linux-x86-64.so.2
#3  0x77df029c in _dl_sysdep_start () from /lib64/ld-linux-x86-64.so.2
#4  0x774a in _dl_start () from /lib64/ld-linux-x86-64.so.2
#5  0x77dd9d98 in _start () from /lib64/ld-linux-x86-64.so.2
#6  0x0002 in ?? ()
#7  0x7fffaa8a in ?? ()
#8  0x7fffaab6 in ?? ()
#9  0x in ?? ()

Giacomo Rossi Ph.D., Space Engineer

Research Fellow at Dept. of Mechanical and Aerospace Engineering, "Sapienza" 
University of Rome
p: (+39) 0692927207 | m: (+39) 3408816643 | e: giacom...@gmail.com

Member of Fortran-FOSS-programmers


2016-05-05 10:44 GMT+02:00 Giacomo Rossi :
Here the result of ldd command:
'ldd /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90
linux-vdso.so.1 (0x7ffcacbbe000)
libopen-pal.so.13 => 
/opt/openmpi/1.10.2/intel/16.0.3/lib/libopen-pal.so.13