Hi all, I am trying to run MPI on a distributed mode. The cluster setup is an 8-machine cluster with Debian 8 (Jessie), Intel Xeon E5-2609 2.40 GHz and Mellanox-QDR HCA Infiniband. My MPI version is 3.0.4. I can successfully run a simple command on all nodes that doesn’t use the infiniband but when I am running my experiments I am receiving the following error from one of the nodes: ------------------------------------------------------------------------- Failed to modify the attributes of a queue pair (QP):
Hostname: euler04 Mask for QP attributes to be modified: 65537 Error: Invalid argument -------------------------------------------------------------------------- -------------------------------------------------------------------------- Open MPI has detected that there are UD-capable Verbs devices on your system, but none of them were able to be setup properly. This may indicate a problem on this system. You job will continue, but Open MPI will ignore the "ud" oob component in this run. Hostname: euler04 -------------------------------------------------------------------------- -------------------------------------------------------------------------- Failed to modify the attributes of a queue pair (QP): Hostname: euler04 Mask for QP attributes to be modified: 65537 Error: Invalid argument -------------------------------------------------------------------------- -------------------------------------------------------------------------- Open MPI has detected that there are UD-capable Verbs devices on your system, but none of them were able to be setup properly. This may indicate a problem on this system. You job will continue, but Open MPI will ignore the "ud" oob component in this run. Hostname: euler04 -------------------------------------------------------------------------- [euler04][[29717,1],29][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],25][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],24][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],31][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],30][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],27][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],26][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 [euler04][[29717,1],28][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] error modifing QP to RTS errno says Invalid argument; errno=22 Note that I am compiling MPI from source on a shared NFS using the commands: ./configure prefix=/path/to/NFS/ make make install And also that my cluster configuration in all of the nodes is the same. I am running my job using /path/to/NFS/mpirun —hostfile hostfile ./executable_name. I am not receiving any error when I am excluding this host. Is this a hardware error? Should I try a different MPI version? Any help would be appreciated. Thanks very much in advance for your help, Dimitris
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users