Hi I tried to install OpenMPI v 4.0.1 on our Debian cluster using Infiniband network with following dev_info: hca_id: mlx4_0
1) I first tried to install openMPI without UCX framework and i runs prefectly as before just need to add >export OMPI_MCA_btl_openib_allow_ib=1 to remove the warning linked to the usage of deprecated openib BTL 2) I installed openMPI using UCX v 1.5.1 ( latest release ) - and it crashes with the following errors >>>uct_iface.c:57 UCX WARN got active message id 5, but no handler installed and on level of UCS libraries: -------------------------------------------------------------------------- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: lxbk0196 Local adapter: mlx4_0 Local port: 1 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: lxbk0196 Local device: mlx4_0 -------------------------------------------------------------------------- [lxbk0195:17255] 39 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected [lxbk0195:17255] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [lxbk0195:17255] 39 more processes have sent help message help-mpi-btl-openib.txt / error in device init [lxbk0195:17271:0:17271] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil)) ==== backtrace ==== 0 /lustre/hebe/rz/denis/mpi/ucx/lib/libucs.so.0(+0x1d540) [0x7f4838796540] 1 /lustre/hebe/rz/denis/mpi/ucx/lib/libucs.so.0(+0x1d79b) [0x7f483879679b] 2 /lustre/hebe/rz/denis/mpi/ucx/lib/libucp.so.0(ucp_rndv_ats_handler+0x4) [0x7f4838e30c34] 3 /lustre/hebe/rz/denis/mpi/ucx/lib/libuct.so.0(+0x24e41) [0x7f4838be7e41] 4 /lustre/hebe/rz/denis/mpi/ucx/lib/libucp.so.0(ucp_worker_progress+0x22) [0x7f4838e23a02] 5 /lustre/hebe/rz/denis/mpi/openmpi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17) [0x7f4839ca0207] 6 /lustre/hebe/rz/denis/mpi/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c) [0x7f484a88794c] 7 /lustre/hebe/rz/denis/mpi/openmpi/lib/libmpi.so.40(ompi_request_default_wait_all+0x289) [0x7f484bea0709] 8 /lustre/hebe/rz/denis/mpi/openmpi/lib/libmpi.so.40(PMPI_Waitall+0xc7) [0x7f484bee4837] 9 /lustre/hebe/rz/denis/mpi/openmpi/lib/openmpi/mca_io_romio321.so(ADIOI_Calc_others_req+0x31c) [0x7f4828b0bc8c] 10 /lustre/hebe/rz/denis/mpi/openmpi/lib/openmpi/mca_io_romio321.so(ADIOI_GEN_WriteStridedColl+0x397) [0x7f4828b1f9a7] 11 /lustre/hebe/rz/denis/mpi/openmpi/lib/openmpi/mca_io_romio321.so(MPIOI_File_write_all+0x1b2) [0x7f4828b04b92] 12 /lustre/hebe/rz/denis/mpi/openmpi/lib/openmpi/mca_io_romio321.so(mca_io_romio_dist_MPI_File_write_all+0x23) [0x7f4828b04ca3] 13 /lustre/hebe/rz/denis/mpi/openmpi/lib/openmpi/mca_io_romio321.so(mca_io_romio321_file_write_all+0x22) [0x7f4828afe902] 14 /lustre/hebe/rz/denis/mpi/openmpi/lib/libmpi.so.40(PMPI_File_write_all+0xde) [0x7f484bec7ece] 15 /lustre/hebe/rz/denis/mpi/openmpi/lib/libmpi_mpifh.so.40(mpi_file_write_all+0x5b) [0x7f484c1b5fcb] 16 epoch3d() [0x575da5] 17 epoch3d() [0x576660] 18 epoch3d() [0x497ea6] 19 epoch3d() [0x4b10ee] 20 epoch3d() [0x4be358] 21 epoch3d() [0x40395d] 22 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f484ae3cb45] 23 epoch3d() [0x40398d] Any idea what went wrong ? Thanks in advance Denis Bertini
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users