Re: [OMPI users] Problem running with UCX/oshmem on single node?

2018-05-09 Thread Howard Pritchard
Hi Craig,

You are experiencing problems because you don't have a transport installed
that UCX can use for oshmem.

You either need to go and buy a connectx4/5 HCA from mellanox (and maybe a
switch), and install that
on your system, or else install xpmem (https://github.com/hjelmn/xpmem).
Note there is a bug right now
in UCX that you may hit if you try to go thee xpmem only  route:

https://github.com/open-mpi/ompi/issues/5083
and
https://github.com/openucx/ucx/issues/2588

If you are just running on a single node and want to experiment with the
OpenSHMEM program model,
and do not have mellanox mlx5 equipment installed on the node, you are much
better off trying to use SOS
over OFI libfabric:

https://github.com/Sandia-OpenSHMEM/SOS
https://github.com/ofiwg/libfabric/releases

For SOS you will need to install the hydra launcher as well:

http://www.mpich.org/downloads/

I really wish google would do a better job at hitting my responses about
this type of problem.  I seem to
respond every couple of months to this exact problem on this mail list.


Howard


2018-05-09 13:11 GMT-06:00 Craig Reese :

>
> I'm trying to play with oshmem on a single node (just to have a way to do
> some simple
> experimentation and playing around) and having spectacular problems:
>
> CentOS 6.9 (gcc 4.4.7)
> built and installed ucx 1.3.0
> built and installed openmpi-3.1.0
>
> [cfreese]$ cat oshmem.c
>
> #include 
> int
> main() {
> shmem_init();
> }
>
> [cfreese]$ mpicc oshmem.c -loshmem
>
> [cfreese]$ shmemrun -np 2 ./a.out
>
> [ucs1l:30118] mca: base: components_register: registering framework spml
> components
> [ucs1l:30118] mca: base: components_register: found loaded component ucx
> [ucs1l:30119] mca: base: components_register: registering framework spml
> components
> [ucs1l:30119] mca: base: components_register: found loaded component ucx
> [ucs1l:30119] mca: base: components_register: component ucx register
> function successful
> [ucs1l:30118] mca: base: components_register: component ucx register
> function successful
> [ucs1l:30119] mca: base: components_open: opening spml components
> [ucs1l:30119] mca: base: components_open: found loaded component ucx
> [ucs1l:30118] mca: base: components_open: opening spml components
> [ucs1l:30118] mca: base: components_open: found loaded component ucx
> [ucs1l:30119] mca: base: components_open: component ucx open function
> successful
> [ucs1l:30118] mca: base: components_open: component ucx open function
> successful
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
> mca_spml_base_select() select: initializing spml component ucx
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173
> - mca_spml_ucx_component_init() in ucx, my priority is 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
> mca_spml_base_select() select: initializing spml component ucx
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173
> - mca_spml_ucx_component_init() in ucx, my priority is 21
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184
> - mca_spml_ucx_component_init() *** ucx initialized 
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
> mca_spml_base_select() select: init returned priority 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
> mca_spml_base_select() selected ucx best priority 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
> mca_spml_base_select() select: component ucx selected
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
> mca_spml_ucx_enable() *** ucx ENABLED 
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184
> - mca_spml_ucx_component_init() *** ucx initialized 
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
> mca_spml_base_select() select: init returned priority 21
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
> mca_spml_base_select() selected ucx best priority 21
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
> mca_spml_base_select() select: component ucx selected
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
> mca_spml_ucx_enable() *** ucx ENABLED 
>
> here's where I think the real issue is
>
> [1525891910.424102] [ucs1l:30119:0] select.c:316  UCX  ERROR no
> remote registered memory access transport to : mm/posix -
> Destination is unreachable, mm/sysv - Destination is unreachable, tcp/eth0
> - no put short, self/self - Destination is unreachable
> [1525891910.424104] [ucs1l:30118:0] select.c:316  UCX  ERROR no
> remote registered memory access transport to : mm/posix -
> Destination is unreachable, mm/sysv - Destination is unreachable, tcp/eth0
> - no put short, self/self - Destination is unreachable
>
> [ucs1l:30119] Error 

Re: [OMPI users] MPI-3 RMA on Cray XC40

2018-05-09 Thread Nathan Hjelm
Thanks for confirming that it works for you as well. I have a PR open on v3.1.x 
that brings osc/rdma up to date with master. I will also be bringing some code 
that greatly improves the multi-threaded RMA performance on Aries systems (at 
least with benchmarks— github.com/hpc/rma-mt). That will not make it into 
v3.1.x but will be in v4.0.0.

-Nathan

> On May 9, 2018, at 1:26 AM, Joseph Schuchart  wrote:
> 
> Nathan,
> 
> Thank you, I can confirm that it works as expected with master on our system. 
> I will stick to this version then until 3.1.1 is out.
> 
> Joseph
> 
> On 05/08/2018 05:34 PM, Nathan Hjelm wrote:
>> Looks like it doesn't fail with master so at some point I fixed this bug. 
>> The current plan is to bring all the master changes into v3.1.1. This 
>> includes a number of bug fixes.
>> -Nathan
>> On May 08, 2018, at 08:25 AM, Joseph Schuchart  wrote:
>>> Nathan,
>>> 
>>> Thanks for looking into that. My test program is attached.
>>> 
>>> Best
>>> Joseph
>>> 
>>> On 05/08/2018 02:56 PM, Nathan Hjelm wrote:
 I will take a look today. Can you send me your test program?
 
 -Nathan
 
> On May 8, 2018, at 2:49 AM, Joseph Schuchart  wrote:
> 
> All,
> 
> I have been experimenting with using Open MPI 3.1.0 on our Cray XC40 
> (Haswell-based nodes, Aries interconnect) for multi-threaded MPI RMA. 
> Unfortunately, a simple (single-threaded) test case consisting of two 
> processes performing an MPI_Rget+MPI_Wait hangs when running on two 
> nodes. It succeeds if both processes run on a single node.
> 
> For completeness, I am attaching the config.log. The build environment 
> was set up to build Open MPI for the login nodes (I wasn't sure how to 
> properly cross-compile the libraries):
> 
> ```
> # this seems necessary to avoid a linker error during build
> export CRAYPE_LINK_TYPE=dynamic
> module swap PrgEnv-cray PrgEnv-intel
> module sw craype-haswell craype-sandybridge
> module unload craype-hugepages16M
> module unload cray-mpich
> ```
> 
> I am using mpirun to launch the test code. Below is the BTL debug log 
> (with tcp disabled for clarity, turning it on makes no difference):
> 
> ```
> mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 2 -N 1 ./mpi_test_loop
> [nid03060:36184] mca: base: components_register: registering framework 
> btl components
> [nid03060:36184] mca: base: components_register: found loaded component 
> self
> [nid03060:36184] mca: base: components_register: component self register 
> function successful
> [nid03060:36184] mca: base: components_register: found loaded component sm
> [nid03061:36208] mca: base: components_register: registering framework 
> btl components
> [nid03061:36208] mca: base: components_register: found loaded component 
> self
> [nid03060:36184] mca: base: components_register: found loaded component 
> ugni
> [nid03061:36208] mca: base: components_register: component self register 
> function successful
> [nid03061:36208] mca: base: components_register: found loaded component sm
> [nid03061:36208] mca: base: components_register: found loaded component 
> ugni
> [nid03060:36184] mca: base: components_register: component ugni register 
> function successful
> [nid03060:36184] mca: base: components_register: found loaded component 
> vader
> [nid03061:36208] mca: base: components_register: component ugni register 
> function successful
> [nid03061:36208] mca: base: components_register: found loaded component 
> vader
> [nid03060:36184] mca: base: components_register: component vader register 
> function successful
> [nid03060:36184] mca: base: components_open: opening btl components
> [nid03060:36184] mca: base: components_open: found loaded component self
> [nid03060:36184] mca: base: components_open: component self open function 
> successful
> [nid03060:36184] mca: base: components_open: found loaded component ugni
> [nid03060:36184] mca: base: components_open: component ugni open function 
> successful
> [nid03060:36184] mca: base: components_open: found loaded component vader
> [nid03060:36184] mca: base: components_open: component vader open 
> function successful
> [nid03060:36184] select: initializing btl component self
> [nid03060:36184] select: init of component self returned success
> [nid03060:36184] select: initializing btl component ugni
> [nid03061:36208] mca: base: components_register: component vader register 
> function successful
> [nid03061:36208] mca: base: components_open: opening btl components
> [nid03061:36208] mca: base: components_open: found loaded component self
> [nid03061:36208] mca: base: components_open: component self open function 
> successful

[OMPI users] Problem running with UCX/oshmem on single node?

2018-05-09 Thread Craig Reese


I'm trying to play with oshmem on a single node (just to have a way to 
do some simple

experimentation and playing around) and having spectacular problems:

CentOS 6.9 (gcc 4.4.7)
built and installed ucx 1.3.0
built and installed openmpi-3.1.0

   [cfreese]$ cat oshmem.c

   #include 
   int
   main() {
    shmem_init();
   }

   [cfreese]$ mpicc oshmem.c -loshmem

   [cfreese]$ shmemrun -np 2 ./a.out

   [ucs1l:30118] mca: base: components_register: registering framework
   spml components
   [ucs1l:30118] mca: base: components_register: found loaded component ucx
   [ucs1l:30119] mca: base: components_register: registering framework
   spml components
   [ucs1l:30119] mca: base: components_register: found loaded component ucx
   [ucs1l:30119] mca: base: components_register: component ucx register
   function successful
   [ucs1l:30118] mca: base: components_register: component ucx register
   function successful
   [ucs1l:30119] mca: base: components_open: opening spml components
   [ucs1l:30119] mca: base: components_open: found loaded component ucx
   [ucs1l:30118] mca: base: components_open: opening spml components
   [ucs1l:30118] mca: base: components_open: found loaded component ucx
   [ucs1l:30119] mca: base: components_open: component ucx open
   function successful
   [ucs1l:30118] mca: base: components_open: component ucx open
   function successful
   [ucs1l:30119]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
   mca_spml_base_select() select: initializing spml component ucx
   [ucs1l:30119]
   ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173 -
   mca_spml_ucx_component_init() in ucx, my priority is 21
   [ucs1l:30118]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
   mca_spml_base_select() select: initializing spml component ucx
   [ucs1l:30118]
   ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173 -
   mca_spml_ucx_component_init() in ucx, my priority is 21
   [ucs1l:30118]
   ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184 -
   mca_spml_ucx_component_init() *** ucx initialized 
   [ucs1l:30118]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
   mca_spml_base_select() select: init returned priority 21
   [ucs1l:30118]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
   mca_spml_base_select() selected ucx best priority 21
   [ucs1l:30118]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
   mca_spml_base_select() select: component ucx selected
   [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
   mca_spml_ucx_enable() *** ucx ENABLED 
   [ucs1l:30119]
   ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184 -
   mca_spml_ucx_component_init() *** ucx initialized 
   [ucs1l:30119]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
   mca_spml_base_select() select: init returned priority 21
   [ucs1l:30119]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
   mca_spml_base_select() selected ucx best priority 21
   [ucs1l:30119]
   ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
   mca_spml_base_select() select: component ucx selected
   [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
   mca_spml_ucx_enable() *** ucx ENABLED 

here's where I think the real issue is

   [1525891910.424102] [ucs1l:30119:0] select.c:316  UCX  ERROR no
   remote registered memory access transport to :
   mm/posix - Destination is unreachable, mm/sysv - Destination is
   unreachable, tcp/eth0 - no put short, self/self - Destination is
   unreachable
   [1525891910.424104] [ucs1l:30118:0] select.c:316  UCX ERROR
   no remote registered memory access transport to :
   mm/posix - Destination is unreachable, mm/sysv - Destination is
   unreachable, tcp/eth0 - no put short, self/self - Destination is
   unreachable

   [ucs1l:30119] Error
   ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:293 -
   mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is
   unreachable
   [ucs1l:30118] Error
   ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:293 -
   mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is
   unreachable
   *** glibc detected *** ./a.out: double free or corruption (!prev):
   0x00bb0f10 ***
   *** glibc detected *** ./a.out: double free or corruption (!prev):
   0x00f98ef0 ***
   === Backtrace: =
   === Backtrace: =
   /lib64/libc.so.6[0x338d875dee]
   /lib64/libc.so.6[0x338d875dee]
   /lib64/libc.so.6[0x338d878c80]
   /lib64/libc.so.6[0x338d878c80]
   
/opt/openmpi-3.1.0/lib/liboshmem.so.40(mca_spml_ucx_add_procs+0x2dc)[0x7fea58e4637c]
   
/opt/openmpi-3.1.0/lib/liboshmem.so.40(mca_spml_ucx_add_procs+0x2dc)[0x7f1dc261437c]
   
/opt/openmpi-3.1.0/lib/liboshmem.so.40(oshmem_shmem_init+0x273)[0x7fea58e07833]
   
/opt/openmpi-3.1.0/lib/liboshmem.so.40(oshmem_shmem_init+0x273)[0x7f1dc25d5833]
   

[OMPI users] shmem

2018-05-09 Thread Michael Di Domenico
before i debug ucx further (cause it's totally not working for me), i
figured i'd check to see if it's *really* required to use shmem inside
of openmpi.  i'm pretty sure the answer is yes, but i wanted to double
check.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] problem

2018-05-09 Thread Jeff Squyres (jsquyres)
It looks like you're getting a segv when calling MPI_Comm_rank().

This is quite unusual -- MPI_Comm_rank() is just a local lookup / return of an 
integer.  If MPI_Comm_rank() is seg faulting, it usually indicates that there's 
some other kind of memory error in the application, and this seg fault you're 
seeing is just a symptom -- it's not the real problem.  It may have worked with 
Intel MPI by chance, or for some reason, Intel MPI has a different memory 
pattern than Open MPI and it didn't happen to trigger this exact problem.

You might want to run your application through a memory-checking debugger.



> On May 9, 2018, at 11:39 AM, Ankita m  wrote:
> 
> yes. Because previously i was using intel-mpi. That time the program was 
> running perfectly. Now when i use openmpi this shows this error 
> files...Though i am not quite sure. I just thought if the issue will be for 
> Openmpi then i could get some help here.
> 
> On Wed, May 9, 2018 at 6:47 PM, Gilles Gouaillardet 
>  wrote:
> Ankita,
> 
> Do you have any reason to suspect the root cause of the crash is Open MPI ?
> 
> Cheers,
> 
> Gilles
> 
> 
> On Wednesday, May 9, 2018, Ankita m  wrote:
> MPI "Hello World" program is also working 
> 
> please see this error file attached below. its of a different program
> 
> On Wed, May 9, 2018 at 4:10 PM, John Hearns via users 
>  wrote:
> Ankita, looks like your program is not launching correctly.
> I would try the following:   
> define two hosts in a machinefile.  Use mpirun -np 2  machinefile  date
> Ie can you use mpirun just to run the command 'date'
> 
> Secondly compile up and try to run an MPI 'Hello World' program
> 
> 
> On 9 May 2018 at 12:28, Ankita m  wrote:
> I am using ompi -3.1.0 version in my program and compiler is mpicc
> 
> its a parallel program which uses multiple nodes with 16 cores in each node. 
> 
> but its not working and generates a error file . i Have attached the error 
> file below.
> 
> can anyone please tell what is the issue actually
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] problem

2018-05-09 Thread Ankita m
yes. Because previously i was using intel-mpi. That time the program was
running perfectly. Now when i use openmpi this shows this error
files...Though i am not quite sure. I just thought if the issue will be for
Openmpi then i could get some help here.

On Wed, May 9, 2018 at 6:47 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Ankita,
>
> Do you have any reason to suspect the root cause of the crash is Open MPI ?
>
> Cheers,
>
> Gilles
>
>
> On Wednesday, May 9, 2018, Ankita m  wrote:
>
>> MPI "Hello World" program is also working
>>
>> please see this error file attached below. its of a different program
>>
>> On Wed, May 9, 2018 at 4:10 PM, John Hearns via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> Ankita, looks like your program is not launching correctly.
>>> I would try the following:
>>> define two hosts in a machinefile.  Use mpirun -np 2  machinefile  date
>>> Ie can you use mpirun just to run the command 'date'
>>>
>>> Secondly compile up and try to run an MPI 'Hello World' program
>>>
>>>
>>> On 9 May 2018 at 12:28, Ankita m  wrote:
>>>
 I am using ompi -3.1.0 version in my program and compiler is mpicc

 its a parallel program which uses multiple nodes with 16 cores in each
 node.

 but its not working and generates a error file . i Have attached the
 error file below.

 can anyone please tell what is the issue actually

 ___
 users mailing list
 users@lists.open-mpi.org
 https://lists.open-mpi.org/mailman/listinfo/users

>>>
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] problem

2018-05-09 Thread Gilles Gouaillardet
Ankita,

Do you have any reason to suspect the root cause of the crash is Open MPI ?

Cheers,

Gilles

On Wednesday, May 9, 2018, Ankita m  wrote:

> MPI "Hello World" program is also working
>
> please see this error file attached below. its of a different program
>
> On Wed, May 9, 2018 at 4:10 PM, John Hearns via users <
> users@lists.open-mpi.org> wrote:
>
>> Ankita, looks like your program is not launching correctly.
>> I would try the following:
>> define two hosts in a machinefile.  Use mpirun -np 2  machinefile  date
>> Ie can you use mpirun just to run the command 'date'
>>
>> Secondly compile up and try to run an MPI 'Hello World' program
>>
>>
>> On 9 May 2018 at 12:28, Ankita m  wrote:
>>
>>> I am using ompi -3.1.0 version in my program and compiler is mpicc
>>>
>>> its a parallel program which uses multiple nodes with 16 cores in each
>>> node.
>>>
>>> but its not working and generates a error file . i Have attached the
>>> error file below.
>>>
>>> can anyone please tell what is the issue actually
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] problem

2018-05-09 Thread Ankita m
MPI "Hello World" program is also working

please see this error file attached below. its of a different program

On Wed, May 9, 2018 at 4:10 PM, John Hearns via users <
users@lists.open-mpi.org> wrote:

> Ankita, looks like your program is not launching correctly.
> I would try the following:
> define two hosts in a machinefile.  Use mpirun -np 2  machinefile  date
> Ie can you use mpirun just to run the command 'date'
>
> Secondly compile up and try to run an MPI 'Hello World' program
>
>
> On 9 May 2018 at 12:28, Ankita m  wrote:
>
>> I am using ompi -3.1.0 version in my program and compiler is mpicc
>>
>> its a parallel program which uses multiple nodes with 16 cores in each
>> node.
>>
>> but its not working and generates a error file . i Have attached the
>> error file below.
>>
>> can anyone please tell what is the issue actually
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>


bicgstab_Test.e88
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] problem

2018-05-09 Thread John Hearns via users
Ankita, looks like your program is not launching correctly.
I would try the following:
define two hosts in a machinefile.  Use mpirun -np 2  machinefile  date
Ie can you use mpirun just to run the command 'date'

Secondly compile up and try to run an MPI 'Hello World' program


On 9 May 2018 at 12:28, Ankita m  wrote:

> I am using ompi -3.1.0 version in my program and compiler is mpicc
>
> its a parallel program which uses multiple nodes with 16 cores in each
> node.
>
> but its not working and generates a error file . i Have attached the error
> file below.
>
> can anyone please tell what is the issue actually
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] problem

2018-05-09 Thread Ankita m
I am using ompi -3.1.0 version in my program and compiler is mpicc

its a parallel program which uses multiple nodes with 16 cores in each
node.

but its not working and generates a error file . i Have attached the error
file below.

can anyone please tell what is the issue actually


bicgstab_Test.e61
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI-3 RMA on Cray XC40

2018-05-09 Thread Joseph Schuchart

Nathan,

Thank you, I can confirm that it works as expected with master on our 
system. I will stick to this version then until 3.1.1 is out.


Joseph

On 05/08/2018 05:34 PM, Nathan Hjelm wrote:


Looks like it doesn't fail with master so at some point I fixed this 
bug. The current plan is to bring all the master changes into v3.1.1. 
This includes a number of bug fixes.


-Nathan

On May 08, 2018, at 08:25 AM, Joseph Schuchart  wrote:


Nathan,

Thanks for looking into that. My test program is attached.

Best
Joseph

On 05/08/2018 02:56 PM, Nathan Hjelm wrote:

I will take a look today. Can you send me your test program?

-Nathan


On May 8, 2018, at 2:49 AM, Joseph Schuchart  wrote:

All,

I have been experimenting with using Open MPI 3.1.0 on our Cray XC40 
(Haswell-based nodes, Aries interconnect) for multi-threaded MPI 
RMA. Unfortunately, a simple (single-threaded) test case consisting 
of two processes performing an MPI_Rget+MPI_Wait hangs when running 
on two nodes. It succeeds if both processes run on a single node.


For completeness, I am attaching the config.log. The build 
environment was set up to build Open MPI for the login nodes (I 
wasn't sure how to properly cross-compile the libraries):


```
# this seems necessary to avoid a linker error during build
export CRAYPE_LINK_TYPE=dynamic
module swap PrgEnv-cray PrgEnv-intel
module sw craype-haswell craype-sandybridge
module unload craype-hugepages16M
module unload cray-mpich
```

I am using mpirun to launch the test code. Below is the BTL debug 
log (with tcp disabled for clarity, turning it on makes no difference):


```
mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 2 -N 1 
./mpi_test_loop
[nid03060:36184] mca: base: components_register: registering 
framework btl components
[nid03060:36184] mca: base: components_register: found loaded 
component self
[nid03060:36184] mca: base: components_register: component self 
register function successful
[nid03060:36184] mca: base: components_register: found loaded 
component sm
[nid03061:36208] mca: base: components_register: registering 
framework btl components
[nid03061:36208] mca: base: components_register: found loaded 
component self
[nid03060:36184] mca: base: components_register: found loaded 
component ugni
[nid03061:36208] mca: base: components_register: component self 
register function successful
[nid03061:36208] mca: base: components_register: found loaded 
component sm
[nid03061:36208] mca: base: components_register: found loaded 
component ugni
[nid03060:36184] mca: base: components_register: component ugni 
register function successful
[nid03060:36184] mca: base: components_register: found loaded 
component vader
[nid03061:36208] mca: base: components_register: component ugni 
register function successful
[nid03061:36208] mca: base: components_register: found loaded 
component vader
[nid03060:36184] mca: base: components_register: component vader 
register function successful

[nid03060:36184] mca: base: components_open: opening btl components
[nid03060:36184] mca: base: components_open: found loaded component self
[nid03060:36184] mca: base: components_open: component self open 
function successful

[nid03060:36184] mca: base: components_open: found loaded component ugni
[nid03060:36184] mca: base: components_open: component ugni open 
function successful
[nid03060:36184] mca: base: components_open: found loaded component 
vader
[nid03060:36184] mca: base: components_open: component vader open 
function successful

[nid03060:36184] select: initializing btl component self
[nid03060:36184] select: init of component self returned success
[nid03060:36184] select: initializing btl component ugni
[nid03061:36208] mca: base: components_register: component vader 
register function successful

[nid03061:36208] mca: base: components_open: opening btl components
[nid03061:36208] mca: base: components_open: found loaded component self
[nid03061:36208] mca: base: components_open: component self open 
function successful

[nid03061:36208] mca: base: components_open: found loaded component ugni
[nid03061:36208] mca: base: components_open: component ugni open 
function successful
[nid03061:36208] mca: base: components_open: found loaded component 
vader
[nid03061:36208] mca: base: components_open: component vader open 
function successful

[nid03061:36208] select: initializing btl component self
[nid03061:36208] select: init of component self returned success
[nid03061:36208] select: initializing btl component ugni
[nid03061:36208] select: init of component ugni returned success
[nid03061:36208] select: initializing btl component vader
[nid03061:36208] select: init of component vader returned failure
[nid03061:36208] mca: base: close: component vader closed
[nid03061:36208] mca: base: close: unloading component vader
[nid03060:36184] select: init of component ugni returned success
[nid03060:36184] select: initializing btl component vader

[OMPI users] process binding error for openmpi-master-201805080348-b39bbfb

2018-05-09 Thread Siegmar Gross

Hi,

I've installed openmpi-master-201805080348-b39bbfb on my "SUSE Linux
Enterprise Server 12.3 (x86_64)" with gcc-6.4.0. Unfortunately I get
an error if I use process binding.


loki config_files 137 mpiexec -report-bindings -np 4 -rf rf_loki_nfs1 hostname
[loki:17301] OPAL dss:unpack: got type 27 when expecting type 3
[loki:17301] [[56361,0],0] ORTE_ERROR_LOG: Pack data mismatch in file 
../../../../openmpi-master-201805080348-b39bbfb/orte/mca/odls/base/odls_base_default_fns.c 
at line 612

[nfs1:05786] OPAL dss:unpack: got type 27 when expecting type 3
[nfs1:05786] [[56361,0],1] ORTE_ERROR_LOG: Pack data mismatch in file 
../../../../openmpi-master-201805080348-b39bbfb/orte/mca/odls/base/odls_base_default_fns.c 
at line 612

loki config_files 138 which mpiexec
/usr/local/openmpi-master_64_gcc/bin/mpiexec
loki config_files 139


loki config_files 139 more rf_loki_nfs1
rank 0=loki slot=0:0-3;1:0-1
rank 1=loki slot=1:2-5
rank 2=nfs1 slot=0:4
rank 3=nfs1 slot=1:5
loki config_files 140


Everything works as expected with openmpi-3.1.0.

loki config_files 110 mpiexec -report-bindings -np 4 -rf rf_loki_nfs1 hostname
loki
[loki:18003] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 
0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 1[core 6[hwt 
0-1]], socket 1[core 7[hwt 0-1]]: [BB/BB/BB/BB/../..][BB/BB/../../../..]
[loki:18003] MCW rank 1 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 
0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
[../../../../../..][../../BB/BB/BB/BB]

loki
[nfs1:07399] MCW rank 2 bound to socket 0[core 4[hwt 0-1]]: 
[../../../../BB/..][../../../../../..]
[nfs1:07399] MCW rank 3 bound to socket 1[core 11[hwt 0-1]]: 
[../../../../../..][../../../../../BB]

nfs1
nfs1
loki config_files 111 which mpiexec
/usr/local/openmpi-3.1.0_64_gcc/bin/mpiexec
loki config_files 112


I would be grateful, if somebody can fix the problem. Do you need anything
else? Thank you very much for any help in advance.


Kind regards

Siegmar
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users