Thanks. The verbose output is: [kahan01.upvnet.upv.es:29732] mca: base: components_register: registering framework btl components [kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded component self [kahan01.upvnet.upv.es:29732] mca: base: components_register: component self register function successful [kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded component sm [kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded component openib [kahan01.upvnet.upv.es:29732] mca: base: components_register: component openib register function successful [kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded component vader [kahan01.upvnet.upv.es:29732] mca: base: components_register: component vader register function successful [kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded component tcp [kahan01.upvnet.upv.es:29732] mca: base: components_register: component tcp register function successful [kahan01.upvnet.upv.es:29732] mca: base: components_open: opening btl components [kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded component self [kahan01.upvnet.upv.es:29732] mca: base: components_open: component self open function successful [kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded component openib [kahan01.upvnet.upv.es:29732] mca: base: components_open: component openib open function successful [kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded component vader [kahan01.upvnet.upv.es:29732] mca: base: components_open: component vader open function successful [kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded component tcp [kahan01.upvnet.upv.es:29732] mca: base: components_open: component tcp open function successful [kahan01.upvnet.upv.es:29732] select: initializing btl component self [kahan01.upvnet.upv.es:29732] select: init of component self returned success [kahan01.upvnet.upv.es:29732] select: initializing btl component openib [kahan01.upvnet.upv.es:29732] Checking distance from this process to device=qedr0 [kahan01.upvnet.upv.es:29732] hwloc_distances->nbobjs=4 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[0]=10 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[1]=16 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[2]=16 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[3]=16 [kahan01.upvnet.upv.es:29732] ibv_obj->type set to NULL [kahan01.upvnet.upv.es:29732] Process is bound: distance to device is 0.000000 [kahan01.upvnet.upv.es:29732] Checking distance from this process to device=qedr1 [kahan01.upvnet.upv.es:29732] hwloc_distances->nbobjs=4 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[0]=10 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[1]=16 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[2]=16 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[3]=16 [kahan01.upvnet.upv.es:29732] ibv_obj->type set to NULL [kahan01.upvnet.upv.es:29732] Process is bound: distance to device is 0.000000 [kahan01.upvnet.upv.es:29732] openib BTL: rdmacm CPC unavailable for use on qedr0:1; skipped -------------------------------------------------------------------------- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.
Local host: kahan01 Local device: qedr0 Local port: 1 CPCs attempted: rdmacm, udcm -------------------------------------------------------------------------- [kahan01.upvnet.upv.es:29732] select: init of component openib returned failure [kahan01.upvnet.upv.es:29732] mca: base: close: component openib closed [kahan01.upvnet.upv.es:29732] mca: base: close: unloading component openib [kahan01.upvnet.upv.es:29732] select: initializing btl component vader [kahan01.upvnet.upv.es:29732] select: init of component vader returned failure [kahan01.upvnet.upv.es:29732] mca: base: close: component vader closed [kahan01.upvnet.upv.es:29732] mca: base: close: unloading component vader [kahan01.upvnet.upv.es:29732] select: initializing btl component tcp [kahan01.upvnet.upv.es:29732] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8 [kahan01.upvnet.upv.es:29732] btl: tcp: Found match: 127.0.0.1 (lo) [kahan01.upvnet.upv.es:29732] btl:tcp: Attempting to bind to AF_INET port 1024 [kahan01.upvnet.upv.es:29732] btl:tcp: Successfully bound to AF_INET port 1024 [kahan01.upvnet.upv.es:29732] btl:tcp: my listening v4 socket is 0.0.0.0:1024 [kahan01.upvnet.upv.es:29732] btl:tcp: examining interface eno1 [kahan01.upvnet.upv.es:29732] btl:tcp: using ipv6 interface eno1 [kahan01.upvnet.upv.es:29732] btl:tcp: examining interface eno5 [kahan01.upvnet.upv.es:29732] btl:tcp: using ipv6 interface eno5 [kahan01.upvnet.upv.es:29732] select: init of component tcp returned success [kahan01.upvnet.upv.es:29732] mca: bml: Using self btl for send to [[45435,1],0] on node kahan01 Hello world from process 0 of 1, provided=1 [kahan01.upvnet.upv.es:29732] mca: base: close: component self closed [kahan01.upvnet.upv.es:29732] mca: base: close: unloading component self [kahan01.upvnet.upv.es:29732] mca: base: close: component tcp closed [kahan01.upvnet.upv.es:29732] mca: base: close: unloading component tcp Regarding UCX, at some point I tried but IIRC the installation of UCX in this machine does not work for some reason. Is there an easy way to check if UCX works well before installing Open MPI? Jose > El 3 feb 2022, a las 16:38, Pritchard Jr., Howard <howa...@lanl.gov> escribió: > > Hello Jose, > > I suspect the issue here is that the OpenIB BTl isn't finding a connection > module when you are requesting MPI_THREAD_MULTIPLE. > The rdmacm connection is deselected if MPI_THREAD_MULTIPLE thread support > level is being requested. > > If you run the test in a shell with > > export OMPI_MCA_btl_base_verbose=100 > > there may be some more info to help diagnose what's going on. > > Another option would be to build Open MPI with UCX support. That's the > better way to use Open MPI over IB/RoCE. > > Howard > > On 2/2/22, 10:52 AM, "users on behalf of Jose E. Roman via users" > <users-boun...@lists.open-mpi.org on behalf of users@lists.open-mpi.org> > wrote: > > Hi. > > I am using Open MPI 4.1.1 with the openib BTL on a 4-node cluster with > Ethernet 10/25Gb (RoCE). It is using libibverbs from Ubuntu 18.04 (kernel > 4.15.0-166-generic). > > With this hello world example: > > #include <stdio.h> > #include <mpi.h> > int main (int argc, char *argv[]) > { > int rank, size, provided; > MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > MPI_Comm_size(MPI_COMM_WORLD, &size); > printf("Hello world from process %d of %d, provided=%d\n", rank, size, > provided); > MPI_Finalize(); > return 0; > } > > I get the following output when run on one node: > > $ ./hellow > -------------------------------------------------------------------------- > No OpenFabrics connection schemes reported that they were able to be > used on a specific port. As such, the openib BTL (OpenFabrics > support) will be disabled for this port. > > Local host: kahan01 > Local device: qedr0 > Local port: 1 > CPCs attempted: rdmacm, udcm > -------------------------------------------------------------------------- > Hello world from process 0 of 1, provided=1 > > > The message does not appear if I run on the front-end (does not have RoCE > network) or if I run it on the node either using MPI_Init() instead of > MPI_Init_thread() or using MPI_THREAD_SINGLE instead of MPI_THREAD_FUNNELED. > > Is there any reason why MPI_Init_thread() is behaving differently to > MPI_Init()? Note that I am not using threads, and just one MPI process. > > > The question has a second part: is there a way to determine (without > running an MPI program) that MPI_Init_thread() won't work but MPI_Init() will > work? I am asking this because PETSc programs default to use > MPI_Init_thread() when PETSc's configure script finds the MPI_Init_thread() > symbol in the MPI library. But in situations like the one reported here, it > would be better to revert to MPI_Init() since MPI_Init_thread() will not work > as expected. [The configure script cannot run an MPI program due to batch > systems.] > > Thanks for your help. > Jose >