You might want to try two things:
1. Upgrade to Open MPI v4.0.1.
2. Use the UCX PML instead of the openib BTL.
You may need to download/install UCX first.
Then configure Open MPI:
./configure --with-ucx --without-verbs --enable-mca-no-build=btl-uct ...
This will build the UCX PML, and that should get used by default when you
mpirun.
Note that the "--enable-mca-no-build..." option is because it looks like we
have a plugin (the BTL UCT plugin, to be specific) in the v4.0.1 release that
does not compile successfully with the latest version of UCX. This will be
fixed in a subsequent Open MPI v4.0.x release.
> On May 9, 2019, at 10:17 AM, Koutsoukos Dimitrios via users
> wrote:
>
> Hi all,
>
> I am trying to run MPI on a distributed mode. The cluster setup is an
> 8-machine cluster with Debian 8 (Jessie), Intel Xeon E5-2609 2.40 GHz and
> Mellanox-QDR HCA Infiniband. My MPI version is 3.0.4. I can successfully run
> a simple command on all nodes that doesn’t use the infiniband but when I am
> running my experiments I am receiving the following error from one of the
> nodes:
> -
> Failed to modify the attributes of a queue pair (QP):
>
> Hostname: euler04
> Mask for QP attributes to be modified: 65537
> Error:Invalid argument
> --
> --
> Open MPI has detected that there are UD-capable Verbs devices on your
> system, but none of them were able to be setup properly. This may
> indicate a problem on this system.
>
> You job will continue, but Open MPI will ignore the "ud" oob component
> in this run.
>
> Hostname: euler04
> --
> --
> Failed to modify the attributes of a queue pair (QP):
>
> Hostname: euler04
> Mask for QP attributes to be modified: 65537
> Error:Invalid argument
> --
> --
> Open MPI has detected that there are UD-capable Verbs devices on your
> system, but none of them were able to be setup properly. This may
> indicate a problem on this system.
>
> You job will continue, but Open MPI will ignore the "ud" oob component
> in this run.
>
> Hostname: euler04
> --
> [euler04][[29717,1],29][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
> error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],25][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
> error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],24][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
> error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],31][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
> error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],30][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
> error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],27][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
> error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],26][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
> error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],28][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
> error modifing QP to RTS errno says Invalid argument; errno=22
>
> Note that I am compiling MPI from source on a shared NFS using the commands:
> ./configure prefix=/path/to/NFS/
> make
> make install
> And also that my cluster configuration in all of the nodes is the same. I am
> running my job using /path/to/NFS/mpirun —hostfile hostfile
> ./executable_name. I am not receiving any error when I am excluding this
> host. Is this a hardware error? Should I try a different MPI version? Any
> help would be appreciated.
>
> Thanks very much in advance for your help,
> Dimitris
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
jsquy...@cisco.com
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users