Hi all,

I am trying to run MPI on a distributed mode. The cluster setup is an 8-machine 
cluster with Debian 8 (Jessie), Intel Xeon E5-2609 2.40 GHz and Mellanox-QDR 
HCA Infiniband. My MPI version is 3.0.4. I can successfully run a simple 
command on all nodes that doesn’t use the infiniband but when I am running my 
experiments I am receiving the following error from one of the nodes:
-------------------------------------------------------------------------
Failed to modify the attributes of a queue pair (QP):

Hostname: euler04
Mask for QP attributes to be modified: 65537
Error:    Invalid argument
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: euler04
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Failed to modify the attributes of a queue pair (QP):

Hostname: euler04
Mask for QP attributes to be modified: 65537
Error:    Invalid argument
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: euler04
--------------------------------------------------------------------------
[euler04][[29717,1],29][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],25][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],24][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],31][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],30][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],27][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],26][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],28][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22

Note that I am compiling MPI from source on a shared NFS using the commands:
./configure prefix=/path/to/NFS/
make
make install
And also that my cluster configuration in all of the nodes is the same. I am 
running my job using /path/to/NFS/mpirun —hostfile hostfile ./executable_name. 
I am not receiving any error when I am excluding this host. Is this a hardware 
error? Should I try a different MPI version? Any help would be appreciated.

Thanks very much in advance for your help,
Dimitris
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to