Hi,

When we ran openmpi  v4.0.0 on a cluster with infiniband, we got the following 
warning and error messages. The older versions < 3.x work fine on the cluster.


####################################

$ mpirun -n 4 ./a.out

--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              t02n34
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   t02n34
  Local device: mlx5_0
--------------------------------------------------------------------------
libibcm: couldn't read ABI version
[1546869563.579350] [t02n34:28160:0]       cm_iface.c:309  UCX  ERROR 
ib_cm_open_device() failed: No such file or directory. Check if ib_ucm.ko 
module is loaded.
libibcm: couldn't read ABI version
[1546869563.580315] [t02n34:28159:0]       cm_iface.c:309  UCX  ERROR 
ib_cm_open_device() failed: No such file or directory. Check if ib_ucm.ko 
module is loaded.
libibcm: couldn't read ABI version
[1546869563.580620] [t02n34:28161:0]       cm_iface.c:309  UCX  ERROR 
ib_cm_open_device() failed: No such file or directory. Check if ib_ucm.ko 
module is loaded.
libibcm: couldn't read ABI version
[1546869563.581113] [t02n34:28158:0]       cm_iface.c:309  UCX  ERROR 
ib_cm_open_device() failed: No such file or directory. Check if ib_ucm.ko 
module is loaded.
[t02n34:28159] ../../../../../openmpi-4.0.0/ompi/mca/pml/ucx/pml_ucx.c:212 
Error: Failed to create UCP worker
[t02n34:28160] ../../../../../openmpi-4.0.0/ompi/mca/pml/ucx/pml_ucx.c:212 
Error: Failed to create UCP worker
[t02n34:28158] ../../../../../openmpi-4.0.0/ompi/mca/pml/ucx/pml_ucx.c:212 
Error: Failed to create UCP worker
[t02n34:28161] ../../../../../openmpi-4.0.0/ompi/mca/pml/ucx/pml_ucx.c:212 
Error: Failed to create UCP worker
Hello world from processor t02n34, rank 3 out of 4 processors
Hello world from processor t02n34, rank 0 out of 4 processors
Hello world from processor t02n34, rank 2 out of 4 processors
Hello world from processor t02n34, rank 1 out of 4 processors
[t02n34:28151] 3 more processes have sent help message help-mpi-btl-openib.txt 
/ ib port not selected
[t02n34:28151] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages
[t02n34:28151] 3 more processes have sent help message help-mpi-btl-openib.txt 
/ error in device init



If set the variable "btl_openib_allow_ib=1", there are other errors.


t02n34$ mpirun -n 4 --mca btl_openib_allow_ib 1 ./a.out
[t02n34:28232:0:28232] Caught signal 11 (Segmentation fault: invalid 
permissions for mapped object at address 0x7fef6749e7e0)
[t02n34:28234:0:28234] Caught signal 11 (Segmentation fault: invalid 
permissions for mapped object at address 0x7fc2e8f4d7e0)
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[t02n34:28233:0:28233] Caught signal 11 (Segmentation fault: invalid 
permissions for mapped object at address 0x7f981ee0e7e0)
[t02n34:28235:0:28235] Caught signal 11 (Segmentation fault: invalid 
permissions for mapped object at address 0x7fdc778c07e0)
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node t02n34 exited on signal 
11 (Segmentation fault).
--------------------------------------------------------------------------


############################


The configuration flags to build this version are:


$ ../openmpi-4.0.0/configure --prefix=/vol/openmpi/4.0.0/ 
--with-ucx=/vol/openmpi/4.0.0/ucx/1.4.0

(even tried with --without-verbs but got same errors)



Thanks a lot.


Regards, Jing

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to