Re: [OMPI users] MPI failing on Infiniband (queue pair error)

2019-05-09 Thread Jeff Squyres (jsquyres) via users
You might want to try two things:

1. Upgrade to Open MPI v4.0.1.
2. Use the UCX PML instead of the openib BTL.

You may need to download/install UCX first.

Then configure Open MPI:

./configure --with-ucx --without-verbs --enable-mca-no-build=btl-uct ...

This will build the UCX PML, and that should get used by default when you 
mpirun.

Note that the "--enable-mca-no-build..." option is because it looks like we 
have a plugin (the BTL UCT plugin, to be specific) in the v4.0.1 release that 
does not compile successfully with the latest version of UCX.  This will be 
fixed in a subsequent Open MPI v4.0.x release.


> On May 9, 2019, at 10:17 AM, Koutsoukos Dimitrios via users 
>  wrote:
> 
> Hi all,
> 
> I am trying to run MPI on a distributed mode. The cluster setup is an 
> 8-machine cluster with Debian 8 (Jessie), Intel Xeon E5-2609 2.40 GHz and 
> Mellanox-QDR HCA Infiniband. My MPI version is 3.0.4. I can successfully run 
> a simple command on all nodes that doesn’t use the infiniband but when I am 
> running my experiments I am receiving the following error from one of the 
> nodes:
> -
> Failed to modify the attributes of a queue pair (QP):
> 
> Hostname: euler04
> Mask for QP attributes to be modified: 65537
> Error:Invalid argument
> --
> --
> Open MPI has detected that there are UD-capable Verbs devices on your
> system, but none of them were able to be setup properly.  This may
> indicate a problem on this system.
> 
> You job will continue, but Open MPI will ignore the "ud" oob component
> in this run.
> 
> Hostname: euler04
> --
> --
> Failed to modify the attributes of a queue pair (QP):
> 
> Hostname: euler04
> Mask for QP attributes to be modified: 65537
> Error:Invalid argument
> --
> --
> Open MPI has detected that there are UD-capable Verbs devices on your
> system, but none of them were able to be setup properly.  This may
> indicate a problem on this system.
> 
> You job will continue, but Open MPI will ignore the "ud" oob component
> in this run.
> 
> Hostname: euler04
> --
> [euler04][[29717,1],29][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],25][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],24][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],31][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],30][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],27][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],26][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],28][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> 
> Note that I am compiling MPI from source on a shared NFS using the commands:
> ./configure prefix=/path/to/NFS/
> make
> make install 
> And also that my cluster configuration in all of the nodes is the same. I am 
> running my job using /path/to/NFS/mpirun —hostfile hostfile 
> ./executable_name. I am not receiving any error when I am excluding this 
> host. Is this a hardware error? Should I try a different MPI version? Any 
> help would be appreciated.
> 
> Thanks very much in advance for your help,
> Dimitris
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] MPI failing on Infiniband (queue pair error)

2019-05-09 Thread Koutsoukos Dimitrios via users
Hi all,

I am trying to run MPI on a distributed mode. The cluster setup is an 8-machine 
cluster with Debian 8 (Jessie), Intel Xeon E5-2609 2.40 GHz and Mellanox-QDR 
HCA Infiniband. My MPI version is 3.0.4. I can successfully run a simple 
command on all nodes that doesn’t use the infiniband but when I am running my 
experiments I am receiving the following error from one of the nodes:
-
Failed to modify the attributes of a queue pair (QP):

Hostname: euler04
Mask for QP attributes to be modified: 65537
Error:Invalid argument
--
--
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: euler04
--
--
Failed to modify the attributes of a queue pair (QP):

Hostname: euler04
Mask for QP attributes to be modified: 65537
Error:Invalid argument
--
--
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: euler04
--
[euler04][[29717,1],29][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],25][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],24][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],31][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],30][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],27][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],26][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22
[euler04][[29717,1],28][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
 error modifing QP to RTS errno says Invalid argument; errno=22

Note that I am compiling MPI from source on a shared NFS using the commands:
./configure prefix=/path/to/NFS/
make
make install
And also that my cluster configuration in all of the nodes is the same. I am 
running my job using /path/to/NFS/mpirun —hostfile hostfile ./executable_name. 
I am not receiving any error when I am excluding this host. Is this a hardware 
error? Should I try a different MPI version? Any help would be appreciated.

Thanks very much in advance for your help,
Dimitris
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users