Re: [OMPI users] Problem running with UCX/oshmem on single node?

2018-05-14 Thread Michael Di Domenico
On Wed, May 9, 2018 at 9:45 PM, Howard Pritchard  wrote:
>
> You either need to go and buy a connectx4/5 HCA from mellanox (and maybe a
> switch), and install that
> on your system, or else install xpmem (https://github.com/hjelmn/xpmem).
> Note there is a bug right now
> in UCX that you may hit if you try to go thee xpmem only  route:

How stringent is the Connect-X 4/5 requirement?  i have Connect-X 3
cards will they work?  during the configure step is seems to yell at
me that mlx5 wont compile because i don't have Mellanox OFED v3.1
installed, is that also a requirement (i'm using the RHEl7.4 bundled
version of ofed, not then vendor versions)
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Problem running with UCX/oshmem on single node?

2018-05-09 Thread Howard Pritchard
Hi Craig,

You are experiencing problems because you don't have a transport installed
that UCX can use for oshmem.

You either need to go and buy a connectx4/5 HCA from mellanox (and maybe a
switch), and install that
on your system, or else install xpmem (https://github.com/hjelmn/xpmem).
Note there is a bug right now
in UCX that you may hit if you try to go thee xpmem only  route:

https://github.com/open-mpi/ompi/issues/5083
and
https://github.com/openucx/ucx/issues/2588

If you are just running on a single node and want to experiment with the
OpenSHMEM program model,
and do not have mellanox mlx5 equipment installed on the node, you are much
better off trying to use SOS
over OFI libfabric:

https://github.com/Sandia-OpenSHMEM/SOS
https://github.com/ofiwg/libfabric/releases

For SOS you will need to install the hydra launcher as well:

http://www.mpich.org/downloads/

I really wish google would do a better job at hitting my responses about
this type of problem.  I seem to
respond every couple of months to this exact problem on this mail list.


Howard


2018-05-09 13:11 GMT-06:00 Craig Reese :

>
> I'm trying to play with oshmem on a single node (just to have a way to do
> some simple
> experimentation and playing around) and having spectacular problems:
>
> CentOS 6.9 (gcc 4.4.7)
> built and installed ucx 1.3.0
> built and installed openmpi-3.1.0
>
> [cfreese]$ cat oshmem.c
>
> #include 
> int
> main() {
> shmem_init();
> }
>
> [cfreese]$ mpicc oshmem.c -loshmem
>
> [cfreese]$ shmemrun -np 2 ./a.out
>
> [ucs1l:30118] mca: base: components_register: registering framework spml
> components
> [ucs1l:30118] mca: base: components_register: found loaded component ucx
> [ucs1l:30119] mca: base: components_register: registering framework spml
> components
> [ucs1l:30119] mca: base: components_register: found loaded component ucx
> [ucs1l:30119] mca: base: components_register: component ucx register
> function successful
> [ucs1l:30118] mca: base: components_register: component ucx register
> function successful
> [ucs1l:30119] mca: base: components_open: opening spml components
> [ucs1l:30119] mca: base: components_open: found loaded component ucx
> [ucs1l:30118] mca: base: components_open: opening spml components
> [ucs1l:30118] mca: base: components_open: found loaded component ucx
> [ucs1l:30119] mca: base: components_open: component ucx open function
> successful
> [ucs1l:30118] mca: base: components_open: component ucx open function
> successful
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
> mca_spml_base_select() select: initializing spml component ucx
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173
> - mca_spml_ucx_component_init() in ucx, my priority is 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
> mca_spml_base_select() select: initializing spml component ucx
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173
> - mca_spml_ucx_component_init() in ucx, my priority is 21
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184
> - mca_spml_ucx_component_init() *** ucx initialized 
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
> mca_spml_base_select() select: init returned priority 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
> mca_spml_base_select() selected ucx best priority 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
> mca_spml_base_select() select: component ucx selected
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
> mca_spml_ucx_enable() *** ucx ENABLED 
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184
> - mca_spml_ucx_component_init() *** ucx initialized 
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
> mca_spml_base_select() select: init returned priority 21
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
> mca_spml_base_select() selected ucx best priority 21
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
> mca_spml_base_select() select: component ucx selected
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
> mca_spml_ucx_enable() *** ucx ENABLED 
>
> here's where I think the real issue is
>
> [1525891910.424102] [ucs1l:30119:0] select.c:316  UCX  ERROR no
> remote registered memory access transport to : mm/posix -
> Destination is unreachable, mm/sysv - Destination is unreachable, tcp/eth0
> - no put short, self/self - Destination is unreachable
> [1525891910.424104] [ucs1l:30118:0] select.c:316  UCX  ERROR no
> remote registered memory access transport to : mm/posix -
> Destination is unreachable, mm/sysv - Destination is unreachable, tcp/eth0
> - no put short, self/self - Destination is unreachable
>
> [ucs1l:30119] Error 

[OMPI users] Problem running with UCX/oshmem on single node?

2018-05-09 Thread Craig Reese


I'm trying to play with oshmem on a single node (just to have a way to 
do some simple

experimentation and playing around) and having spectacular problems:

CentOS 6.9 (gcc 4.4.7)
built and installed ucx 1.3.0
built and installed openmpi-3.1.0

   [cfreese]$ cat oshmem.c

   #include 
   int
   main() {
    shmem_init();
   }

   [cfreese]$ mpicc oshmem.c -loshmem

   [cfreese]$ shmemrun -np 2 ./a.out

   [ucs1l:30118] mca: base: components_register: registering framework
   spml components
   [ucs1l:30118] mca: base: components_register: found loaded component ucx
   [ucs1l:30119] mca: base: components_register: registering framework
   spml components
   [ucs1l:30119] mca: base: components_register: found loaded component ucx
   [ucs1l:30119] mca: base: components_register: component ucx register
   function successful
   [ucs1l:30118] mca: base: components_register: component ucx register
   function successful
   [ucs1l:30119] mca: base: components_open: opening spml components
   [ucs1l:30119] mca: base: components_open: found loaded component ucx
   [ucs1l:30118] mca: base: components_open: opening spml components
   [ucs1l:30118] mca: base: components_open: found loaded component ucx
   [ucs1l:30119] mca: base: components_open: component ucx open
   function successful
   [ucs1l:30118] mca: base: components_open: component ucx open
   function successful
   [ucs1l:30119]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
   mca_spml_base_select() select: initializing spml component ucx
   [ucs1l:30119]
   ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173 -
   mca_spml_ucx_component_init() in ucx, my priority is 21
   [ucs1l:30118]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
   mca_spml_base_select() select: initializing spml component ucx
   [ucs1l:30118]
   ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173 -
   mca_spml_ucx_component_init() in ucx, my priority is 21
   [ucs1l:30118]
   ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184 -
   mca_spml_ucx_component_init() *** ucx initialized 
   [ucs1l:30118]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
   mca_spml_base_select() select: init returned priority 21
   [ucs1l:30118]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
   mca_spml_base_select() selected ucx best priority 21
   [ucs1l:30118]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
   mca_spml_base_select() select: component ucx selected
   [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
   mca_spml_ucx_enable() *** ucx ENABLED 
   [ucs1l:30119]
   ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184 -
   mca_spml_ucx_component_init() *** ucx initialized 
   [ucs1l:30119]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
   mca_spml_base_select() select: init returned priority 21
   [ucs1l:30119]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
   mca_spml_base_select() selected ucx best priority 21
   [ucs1l:30119]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
   mca_spml_base_select() select: component ucx selected
   [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
   mca_spml_ucx_enable() *** ucx ENABLED 

here's where I think the real issue is

   [1525891910.424102] [ucs1l:30119:0] select.c:316  UCX  ERROR no
   remote registered memory access transport to :
   mm/posix - Destination is unreachable, mm/sysv - Destination is
   unreachable, tcp/eth0 - no put short, self/self - Destination is
   unreachable
   [1525891910.424104] [ucs1l:30118:0] select.c:316  UCX ERROR
   no remote registered memory access transport to :
   mm/posix - Destination is unreachable, mm/sysv - Destination is
   unreachable, tcp/eth0 - no put short, self/self - Destination is
   unreachable

   [ucs1l:30119] Error
   ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:293 -
   mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is
   unreachable
   [ucs1l:30118] Error
   ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:293 -
   mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is
   unreachable
   *** glibc detected *** ./a.out: double free or corruption (!prev):
   0x00bb0f10 ***
   *** glibc detected *** ./a.out: double free or corruption (!prev):
   0x00f98ef0 ***
   === Backtrace: =
   === Backtrace: =
   /lib64/libc.so.6[0x338d875dee]
   /lib64/libc.so.6[0x338d875dee]
   /lib64/libc.so.6[0x338d878c80]
   /lib64/libc.so.6[0x338d878c80]
   
/opt/openmpi-3.1.0/lib/liboshmem.so.40(mca_spml_ucx_add_procs+0x2dc)[0x7fea58e4637c]
   
/opt/openmpi-3.1.0/lib/liboshmem.so.40(mca_spml_ucx_add_procs+0x2dc)[0x7f1dc261437c]
   
/opt/openmpi-3.1.0/lib/liboshmem.so.40(oshmem_shmem_init+0x273)[0x7fea58e07833]
   
/opt/openmpi-3.1.0/lib/liboshmem.so.40(oshmem_shmem_init+0x273)[0x7f1dc25d5833]