I will take a look today. Can you send me your test program? -Nathan
> On May 8, 2018, at 2:49 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: > > All, > > I have been experimenting with using Open MPI 3.1.0 on our Cray XC40 > (Haswell-based nodes, Aries interconnect) for multi-threaded MPI RMA. > Unfortunately, a simple (single-threaded) test case consisting of two > processes performing an MPI_Rget+MPI_Wait hangs when running on two nodes. It > succeeds if both processes run on a single node. > > For completeness, I am attaching the config.log. The build environment was > set up to build Open MPI for the login nodes (I wasn't sure how to properly > cross-compile the libraries): > > ``` > # this seems necessary to avoid a linker error during build > export CRAYPE_LINK_TYPE=dynamic > module swap PrgEnv-cray PrgEnv-intel > module sw craype-haswell craype-sandybridge > module unload craype-hugepages16M > module unload cray-mpich > ``` > > I am using mpirun to launch the test code. Below is the BTL debug log (with > tcp disabled for clarity, turning it on makes no difference): > > ``` > mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 2 -N 1 ./mpi_test_loop > [nid03060:36184] mca: base: components_register: registering framework btl > components > [nid03060:36184] mca: base: components_register: found loaded component self > [nid03060:36184] mca: base: components_register: component self register > function successful > [nid03060:36184] mca: base: components_register: found loaded component sm > [nid03061:36208] mca: base: components_register: registering framework btl > components > [nid03061:36208] mca: base: components_register: found loaded component self > [nid03060:36184] mca: base: components_register: found loaded component ugni > [nid03061:36208] mca: base: components_register: component self register > function successful > [nid03061:36208] mca: base: components_register: found loaded component sm > [nid03061:36208] mca: base: components_register: found loaded component ugni > [nid03060:36184] mca: base: components_register: component ugni register > function successful > [nid03060:36184] mca: base: components_register: found loaded component vader > [nid03061:36208] mca: base: components_register: component ugni register > function successful > [nid03061:36208] mca: base: components_register: found loaded component vader > [nid03060:36184] mca: base: components_register: component vader register > function successful > [nid03060:36184] mca: base: components_open: opening btl components > [nid03060:36184] mca: base: components_open: found loaded component self > [nid03060:36184] mca: base: components_open: component self open function > successful > [nid03060:36184] mca: base: components_open: found loaded component ugni > [nid03060:36184] mca: base: components_open: component ugni open function > successful > [nid03060:36184] mca: base: components_open: found loaded component vader > [nid03060:36184] mca: base: components_open: component vader open function > successful > [nid03060:36184] select: initializing btl component self > [nid03060:36184] select: init of component self returned success > [nid03060:36184] select: initializing btl component ugni > [nid03061:36208] mca: base: components_register: component vader register > function successful > [nid03061:36208] mca: base: components_open: opening btl components > [nid03061:36208] mca: base: components_open: found loaded component self > [nid03061:36208] mca: base: components_open: component self open function > successful > [nid03061:36208] mca: base: components_open: found loaded component ugni > [nid03061:36208] mca: base: components_open: component ugni open function > successful > [nid03061:36208] mca: base: components_open: found loaded component vader > [nid03061:36208] mca: base: components_open: component vader open function > successful > [nid03061:36208] select: initializing btl component self > [nid03061:36208] select: init of component self returned success > [nid03061:36208] select: initializing btl component ugni > [nid03061:36208] select: init of component ugni returned success > [nid03061:36208] select: initializing btl component vader > [nid03061:36208] select: init of component vader returned failure > [nid03061:36208] mca: base: close: component vader closed > [nid03061:36208] mca: base: close: unloading component vader > [nid03060:36184] select: init of component ugni returned success > [nid03060:36184] select: initializing btl component vader > [nid03060:36184] select: init of component vader returned failure > [nid03060:36184] mca: base: close: component vader closed > [nid03060:36184] mca: base: close: unloading component vader > [nid03061:36208] mca: bml: Using self btl for send to [[54630,1],1] on node > nid03061 > [nid03060:36184] mca: bml: Using self btl for send to [[54630,1],0] on node > nid03060 > [nid03061:36208] mca: bml: Using ugni btl for send to [[54630,1],0] on node > (null) > [nid03060:36184] mca: bml: Using ugni btl for send to [[54630,1],1] on node > (null) > ``` > > It looks like the UGNI btl is being initialized correctly but then fails to > find the node to communicate with? Is there a way to get more information? > There doesn't seem to be an MCA parameter to increase verbosity specifically > of the UGNI btl. > > Any help would be appreciated! > > Cheers > Joseph > <config.log.tgz> > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users