Hi
I tried to install OpenMPI v 4.0.1 on our Debian cluster using Infiniband 
network with following dev_info:
hca_id: mlx4_0

1) I first tried to install openMPI without UCX framework and i runs prefectly 
as before
just need to add

>export  OMPI_MCA_btl_openib_allow_ib=1

to remove the warning linked to the usage of deprecated openib BTL

2) I installed openMPI using UCX v 1.5.1 ( latest release ) - and it crashes 
with the following errors

>>>uct_iface.c:57   UCX  WARN  got active message id 5, but no handler installed

and on level of UCS libraries:

--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              lxbk0196
  Local adapter:           mlx4_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   lxbk0196
  Local device: mlx4_0
--------------------------------------------------------------------------
[lxbk0195:17255] 39 more processes have sent help message 
help-mpi-btl-openib.txt / ib port not selected
[lxbk0195:17255] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages
[lxbk0195:17255] 39 more processes have sent help message 
help-mpi-btl-openib.txt / error in device init
[lxbk0195:17271:0:17271] Caught signal 11 (Segmentation fault: Sent by the 
kernel at address (nil))
==== backtrace ====
    0  /lustre/hebe/rz/denis/mpi/ucx/lib/libucs.so.0(+0x1d540) [0x7f4838796540]
    1  /lustre/hebe/rz/denis/mpi/ucx/lib/libucs.so.0(+0x1d79b) [0x7f483879679b]
    2  /lustre/hebe/rz/denis/mpi/ucx/lib/libucp.so.0(ucp_rndv_ats_handler+0x4) 
[0x7f4838e30c34]
    3  /lustre/hebe/rz/denis/mpi/ucx/lib/libuct.so.0(+0x24e41) [0x7f4838be7e41]
    4  /lustre/hebe/rz/denis/mpi/ucx/lib/libucp.so.0(ucp_worker_progress+0x22) 
[0x7f4838e23a02]
    5  
/lustre/hebe/rz/denis/mpi/openmpi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)
 [0x7f4839ca0207]
    6  
/lustre/hebe/rz/denis/mpi/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c) 
[0x7f484a88794c]
    7  
/lustre/hebe/rz/denis/mpi/openmpi/lib/libmpi.so.40(ompi_request_default_wait_all+0x289)
 [0x7f484bea0709]
    8  /lustre/hebe/rz/denis/mpi/openmpi/lib/libmpi.so.40(PMPI_Waitall+0xc7) 
[0x7f484bee4837]
    9  
/lustre/hebe/rz/denis/mpi/openmpi/lib/openmpi/mca_io_romio321.so(ADIOI_Calc_others_req+0x31c)
 [0x7f4828b0bc8c]
   10  
/lustre/hebe/rz/denis/mpi/openmpi/lib/openmpi/mca_io_romio321.so(ADIOI_GEN_WriteStridedColl+0x397)
 [0x7f4828b1f9a7]
   11  
/lustre/hebe/rz/denis/mpi/openmpi/lib/openmpi/mca_io_romio321.so(MPIOI_File_write_all+0x1b2)
 [0x7f4828b04b92]
   12  
/lustre/hebe/rz/denis/mpi/openmpi/lib/openmpi/mca_io_romio321.so(mca_io_romio_dist_MPI_File_write_all+0x23)
 [0x7f4828b04ca3]
   13  
/lustre/hebe/rz/denis/mpi/openmpi/lib/openmpi/mca_io_romio321.so(mca_io_romio321_file_write_all+0x22)
 [0x7f4828afe902]
   14  
/lustre/hebe/rz/denis/mpi/openmpi/lib/libmpi.so.40(PMPI_File_write_all+0xde) 
[0x7f484bec7ece]
   15  
/lustre/hebe/rz/denis/mpi/openmpi/lib/libmpi_mpifh.so.40(mpi_file_write_all+0x5b)
 [0x7f484c1b5fcb]
   16  epoch3d() [0x575da5]
   17  epoch3d() [0x576660]
   18  epoch3d() [0x497ea6]
   19  epoch3d() [0x4b10ee]
   20  epoch3d() [0x4be358]
   21  epoch3d() [0x40395d]
   22  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f484ae3cb45]
   23  epoch3d() [0x40398d]


Any idea what went wrong ?

Thanks in advance
Denis Bertini








_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to