Thanks. The verbose output is:

[kahan01.upvnet.upv.es:29732] mca: base: components_register: registering 
framework btl components
[kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded 
component self
[kahan01.upvnet.upv.es:29732] mca: base: components_register: component self 
register function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded 
component sm
[kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded 
component openib
[kahan01.upvnet.upv.es:29732] mca: base: components_register: component openib 
register function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded 
component vader
[kahan01.upvnet.upv.es:29732] mca: base: components_register: component vader 
register function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded 
component tcp
[kahan01.upvnet.upv.es:29732] mca: base: components_register: component tcp 
register function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_open: opening btl components
[kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded 
component self
[kahan01.upvnet.upv.es:29732] mca: base: components_open: component self open 
function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded 
component openib
[kahan01.upvnet.upv.es:29732] mca: base: components_open: component openib open 
function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded 
component vader
[kahan01.upvnet.upv.es:29732] mca: base: components_open: component vader open 
function successful
[kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded 
component tcp
[kahan01.upvnet.upv.es:29732] mca: base: components_open: component tcp open 
function successful
[kahan01.upvnet.upv.es:29732] select: initializing btl component self
[kahan01.upvnet.upv.es:29732] select: init of component self returned success
[kahan01.upvnet.upv.es:29732] select: initializing btl component openib
[kahan01.upvnet.upv.es:29732] Checking distance from this process to 
device=qedr0
[kahan01.upvnet.upv.es:29732] hwloc_distances->nbobjs=4
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[0]=10
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[1]=16
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[2]=16
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[3]=16
[kahan01.upvnet.upv.es:29732] ibv_obj->type set to NULL
[kahan01.upvnet.upv.es:29732] Process is bound: distance to device is 0.000000
[kahan01.upvnet.upv.es:29732] Checking distance from this process to 
device=qedr1
[kahan01.upvnet.upv.es:29732] hwloc_distances->nbobjs=4
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[0]=10
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[1]=16
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[2]=16
[kahan01.upvnet.upv.es:29732] hwloc_distances->values[3]=16
[kahan01.upvnet.upv.es:29732] ibv_obj->type set to NULL
[kahan01.upvnet.upv.es:29732] Process is bound: distance to device is 0.000000
[kahan01.upvnet.upv.es:29732] openib BTL: rdmacm CPC unavailable for use on 
qedr0:1; skipped
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           kahan01
  Local device:         qedr0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
[kahan01.upvnet.upv.es:29732] select: init of component openib returned failure
[kahan01.upvnet.upv.es:29732] mca: base: close: component openib closed
[kahan01.upvnet.upv.es:29732] mca: base: close: unloading component openib
[kahan01.upvnet.upv.es:29732] select: initializing btl component vader
[kahan01.upvnet.upv.es:29732] select: init of component vader returned failure
[kahan01.upvnet.upv.es:29732] mca: base: close: component vader closed
[kahan01.upvnet.upv.es:29732] mca: base: close: unloading component vader
[kahan01.upvnet.upv.es:29732] select: initializing btl component tcp
[kahan01.upvnet.upv.es:29732] btl: tcp: Searching for exclude address+prefix: 
127.0.0.1 / 8
[kahan01.upvnet.upv.es:29732] btl: tcp: Found match: 127.0.0.1 (lo)
[kahan01.upvnet.upv.es:29732] btl:tcp: Attempting to bind to AF_INET port 1024
[kahan01.upvnet.upv.es:29732] btl:tcp: Successfully bound to AF_INET port 1024
[kahan01.upvnet.upv.es:29732] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[kahan01.upvnet.upv.es:29732] btl:tcp: examining interface eno1
[kahan01.upvnet.upv.es:29732] btl:tcp: using ipv6 interface eno1
[kahan01.upvnet.upv.es:29732] btl:tcp: examining interface eno5
[kahan01.upvnet.upv.es:29732] btl:tcp: using ipv6 interface eno5
[kahan01.upvnet.upv.es:29732] select: init of component tcp returned success
[kahan01.upvnet.upv.es:29732] mca: bml: Using self btl for send to 
[[45435,1],0] on node kahan01
Hello world from process 0 of 1, provided=1
[kahan01.upvnet.upv.es:29732] mca: base: close: component self closed
[kahan01.upvnet.upv.es:29732] mca: base: close: unloading component self
[kahan01.upvnet.upv.es:29732] mca: base: close: component tcp closed
[kahan01.upvnet.upv.es:29732] mca: base: close: unloading component tcp


Regarding UCX, at some point I tried but IIRC the installation of UCX in this 
machine does not work for some reason. Is there an easy way to check if UCX 
works well before installing Open MPI?

Jose



> El 3 feb 2022, a las 16:38, Pritchard Jr., Howard <howa...@lanl.gov> escribió:
> 
> Hello Jose,
> 
> I suspect the issue here is that the OpenIB BTl isn't finding a connection 
> module when you are requesting MPI_THREAD_MULTIPLE.
> The rdmacm connection is deselected if MPI_THREAD_MULTIPLE thread support 
> level is being requested.
> 
> If you run the test in a shell with
> 
> export OMPI_MCA_btl_base_verbose=100
> 
> there may be some more info to help diagnose what's going on.
> 
> Another option would be to build Open MPI with UCX support.  That's the 
> better way to use Open MPI over IB/RoCE.
> 
> Howard
> 
> On 2/2/22, 10:52 AM, "users on behalf of Jose E. Roman via users" 
> <users-boun...@lists.open-mpi.org on behalf of users@lists.open-mpi.org> 
> wrote:
> 
>    Hi.
> 
>    I am using Open MPI 4.1.1 with the openib BTL on a 4-node cluster with 
> Ethernet 10/25Gb (RoCE). It is using libibverbs from Ubuntu 18.04 (kernel 
> 4.15.0-166-generic).
> 
>    With this hello world example:
> 
>    #include <stdio.h>
>    #include <mpi.h>
>    int main (int argc, char *argv[])
>    {
>     int rank, size, provided;
>     MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
>     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>     MPI_Comm_size(MPI_COMM_WORLD, &size);
>     printf("Hello world from process %d of %d, provided=%d\n", rank, size, 
> provided);
>     MPI_Finalize();
>     return 0;
>    }
> 
>    I get the following output when run on one node:
> 
>    $ ./hellow
>    --------------------------------------------------------------------------
>    No OpenFabrics connection schemes reported that they were able to be
>    used on a specific port.  As such, the openib BTL (OpenFabrics
>    support) will be disabled for this port.
> 
>     Local host:           kahan01
>     Local device:         qedr0
>     Local port:           1
>     CPCs attempted:       rdmacm, udcm
>    --------------------------------------------------------------------------
>    Hello world from process 0 of 1, provided=1
> 
> 
>    The message does not appear if I run on the front-end (does not have RoCE 
> network) or if I run it on the node either using MPI_Init() instead of 
> MPI_Init_thread() or using MPI_THREAD_SINGLE instead of MPI_THREAD_FUNNELED.
> 
>    Is there any reason why MPI_Init_thread() is behaving differently to 
> MPI_Init()? Note that I am not using threads, and just one MPI process.
> 
> 
>    The question has a second part: is there a way to determine (without 
> running an MPI program) that MPI_Init_thread() won't work but MPI_Init() will 
> work? I am asking this because PETSc programs default to use 
> MPI_Init_thread() when PETSc's configure script finds the MPI_Init_thread() 
> symbol in the MPI library. But in situations like the one reported here, it 
> would be better to revert to MPI_Init() since MPI_Init_thread() will not work 
> as expected. [The configure script cannot run an MPI program due to batch 
> systems.]
> 
>    Thanks for your help.
>    Jose
> 

Reply via email to