I will take a look today. Can you send me your test program?

-Nathan

> On May 8, 2018, at 2:49 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
> 
> All,
> 
> I have been experimenting with using Open MPI 3.1.0 on our Cray XC40 
> (Haswell-based nodes, Aries interconnect) for multi-threaded MPI RMA. 
> Unfortunately, a simple (single-threaded) test case consisting of two 
> processes performing an MPI_Rget+MPI_Wait hangs when running on two nodes. It 
> succeeds if both processes run on a single node.
> 
> For completeness, I am attaching the config.log. The build environment was 
> set up to build Open MPI for the login nodes (I wasn't sure how to properly 
> cross-compile the libraries):
> 
> ```
> # this seems necessary to avoid a linker error during build
> export CRAYPE_LINK_TYPE=dynamic
> module swap PrgEnv-cray PrgEnv-intel
> module sw craype-haswell craype-sandybridge
> module unload craype-hugepages16M
> module unload cray-mpich
> ```
> 
> I am using mpirun to launch the test code. Below is the BTL debug log (with 
> tcp disabled for clarity, turning it on makes no difference):
> 
> ```
> mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 2 -N 1 ./mpi_test_loop
> [nid03060:36184] mca: base: components_register: registering framework btl 
> components
> [nid03060:36184] mca: base: components_register: found loaded component self
> [nid03060:36184] mca: base: components_register: component self register 
> function successful
> [nid03060:36184] mca: base: components_register: found loaded component sm
> [nid03061:36208] mca: base: components_register: registering framework btl 
> components
> [nid03061:36208] mca: base: components_register: found loaded component self
> [nid03060:36184] mca: base: components_register: found loaded component ugni
> [nid03061:36208] mca: base: components_register: component self register 
> function successful
> [nid03061:36208] mca: base: components_register: found loaded component sm
> [nid03061:36208] mca: base: components_register: found loaded component ugni
> [nid03060:36184] mca: base: components_register: component ugni register 
> function successful
> [nid03060:36184] mca: base: components_register: found loaded component vader
> [nid03061:36208] mca: base: components_register: component ugni register 
> function successful
> [nid03061:36208] mca: base: components_register: found loaded component vader
> [nid03060:36184] mca: base: components_register: component vader register 
> function successful
> [nid03060:36184] mca: base: components_open: opening btl components
> [nid03060:36184] mca: base: components_open: found loaded component self
> [nid03060:36184] mca: base: components_open: component self open function 
> successful
> [nid03060:36184] mca: base: components_open: found loaded component ugni
> [nid03060:36184] mca: base: components_open: component ugni open function 
> successful
> [nid03060:36184] mca: base: components_open: found loaded component vader
> [nid03060:36184] mca: base: components_open: component vader open function 
> successful
> [nid03060:36184] select: initializing btl component self
> [nid03060:36184] select: init of component self returned success
> [nid03060:36184] select: initializing btl component ugni
> [nid03061:36208] mca: base: components_register: component vader register 
> function successful
> [nid03061:36208] mca: base: components_open: opening btl components
> [nid03061:36208] mca: base: components_open: found loaded component self
> [nid03061:36208] mca: base: components_open: component self open function 
> successful
> [nid03061:36208] mca: base: components_open: found loaded component ugni
> [nid03061:36208] mca: base: components_open: component ugni open function 
> successful
> [nid03061:36208] mca: base: components_open: found loaded component vader
> [nid03061:36208] mca: base: components_open: component vader open function 
> successful
> [nid03061:36208] select: initializing btl component self
> [nid03061:36208] select: init of component self returned success
> [nid03061:36208] select: initializing btl component ugni
> [nid03061:36208] select: init of component ugni returned success
> [nid03061:36208] select: initializing btl component vader
> [nid03061:36208] select: init of component vader returned failure
> [nid03061:36208] mca: base: close: component vader closed
> [nid03061:36208] mca: base: close: unloading component vader
> [nid03060:36184] select: init of component ugni returned success
> [nid03060:36184] select: initializing btl component vader
> [nid03060:36184] select: init of component vader returned failure
> [nid03060:36184] mca: base: close: component vader closed
> [nid03060:36184] mca: base: close: unloading component vader
> [nid03061:36208] mca: bml: Using self btl for send to [[54630,1],1] on node 
> nid03061
> [nid03060:36184] mca: bml: Using self btl for send to [[54630,1],0] on node 
> nid03060
> [nid03061:36208] mca: bml: Using ugni btl for send to [[54630,1],0] on node 
> (null)
> [nid03060:36184] mca: bml: Using ugni btl for send to [[54630,1],1] on node 
> (null)
> ```
> 
> It looks like the UGNI btl is being initialized correctly but then fails to 
> find the node to communicate with? Is there a way to get more information? 
> There doesn't seem to be an MCA parameter to increase verbosity specifically 
> of the UGNI btl.
> 
> Any help would be appreciated!
> 
> Cheers
> Joseph
> <config.log.tgz>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to