Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-29 Thread Ralph Castain via users
So it _is_ UCX that is the problem! Try using OMPI_MCA_pml=ob1 instead



> On Jul 29, 2021, at 8:33 AM, Ryan Novosielski  wrote:
> 
> Thanks, Ralph. This /does/ change things, but not very much. I was not under 
> the impression that I needed to do that, since when I ran without having 
> built against UCX, it warned me about the openib method being deprecated. By 
> default, does OpenMPI not use either anymore, and I need to specifically call 
> for UCX? Seems strange.
> 
> Anyhow, I’ve got some variables defined still, in addition to your 
> suggestion, for verbosity:
> 
> [novosirj@amarel-test2 ~]$ env | grep ^OMPI
> OMPI_MCA_pml=ucx
> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
> OMPI_MCA_pml_ucx_verbose=100
> 
> Here goes:
> 
> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
> ./mpihello-gcc-8-openmpi-4.0.6
> srun: job 13995650 queued and waiting for resources
> srun: job 13995650 has been allocated resources
> --
> WARNING: There was an error initializing an OpenFabrics device.
> 
>  Local host:   gpu004
>  Local device: mlx4_0
> --
> --
> WARNING: There was an error initializing an OpenFabrics device.
> 
>  Local host:   gpu004
>  Local device: mlx4_0
> --
> [gpu004.amarel.rutgers.edu:29823] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
> memory hooks as external events
> [gpu004.amarel.rutgers.edu:29824] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
> memory hooks as external events
> [gpu004.amarel.rutgers.edu:29823] 
> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
> UCX version 1.5.2
> [gpu004.amarel.rutgers.edu:29824] 
> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
> UCX version 1.5.2
> [gpu004.amarel.rutgers.edu:29823] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
> did not match transport list
> [gpu004.amarel.rutgers.edu:29823] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: 
> did not match transport list
> [gpu004.amarel.rutgers.edu:29823] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: 
> did not match transport list
> [gpu004.amarel.rutgers.edu:29824] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
> did not match transport list
> [gpu004.amarel.rutgers.edu:29823] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
> rc/mlx4_0:1: did not match transport list
> [gpu004.amarel.rutgers.edu:29823] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
> ud/mlx4_0:1: did not match transport list
> [gpu004.amarel.rutgers.edu:29823] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: 
> did not match transport list
> [gpu004.amarel.rutgers.edu:29823] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: 
> did not match transport list
> [gpu004.amarel.rutgers.edu:29823] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: 
> did not match transport list
> [gpu004.amarel.rutgers.edu:29823] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support 
> level is none
> [gpu004.amarel.rutgers.edu:29824] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: 
> did not match transport list
> [gpu004.amarel.rutgers.edu:29824] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: 
> did not match transport list
> [gpu004.amarel.rutgers.edu:29824] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
> rc/mlx4_0:1: did not match transport list
> --
> No components were able to be opened in the pml framework.
> 
> This typically means that either no components of this type were
> installed, or none of the installed components can be loaded.
> Sometimes this means that shared libraries required by these
> components are unable to be found/loaded.
> 
>  Host:  gpu004
>  Framework: pml
> --
> [gpu004.amarel.rutgers.edu:29823] PML ucx cannot be selected
> [gpu004.amarel.rutgers.edu:29824] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
> ud/mlx4_0:1: did not match transport list
> [gpu004.amarel.rutgers.edu:29824] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: 
> did not match transport list
> [gpu004.amarel.rutgers.edu:29824] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: 
> did not match transport list
> 

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-29 Thread Ryan Novosielski via users
Thanks, Ralph. This /does/ change things, but not very much. I was not under 
the impression that I needed to do that, since when I ran without having built 
against UCX, it warned me about the openib method being deprecated. By default, 
does OpenMPI not use either anymore, and I need to specifically call for UCX? 
Seems strange.

Anyhow, I’ve got some variables defined still, in addition to your suggestion, 
for verbosity:

[novosirj@amarel-test2 ~]$ env | grep ^OMPI
OMPI_MCA_pml=ucx
OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
OMPI_MCA_pml_ucx_verbose=100

Here goes:

[novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
./mpihello-gcc-8-openmpi-4.0.6
srun: job 13995650 queued and waiting for resources
srun: job 13995650 has been allocated resources
--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   gpu004
  Local device: mlx4_0
--
--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   gpu004
  Local device: mlx4_0
--
[gpu004.amarel.rutgers.edu:29823] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:29824] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:29823] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:29824] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:29823] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:29823] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list
[gpu004.amarel.rutgers.edu:29823] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did 
not match transport list
[gpu004.amarel.rutgers.edu:29824] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:29823] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:29823] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:29823] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did 
not match transport list
[gpu004.amarel.rutgers.edu:29823] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did 
not match transport list
[gpu004.amarel.rutgers.edu:29823] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did 
not match transport list
[gpu004.amarel.rutgers.edu:29823] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level 
is none
[gpu004.amarel.rutgers.edu:29824] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list
[gpu004.amarel.rutgers.edu:29824] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did 
not match transport list
[gpu004.amarel.rutgers.edu:29824] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: 
did not match transport list
--
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:  gpu004
  Framework: pml
--
[gpu004.amarel.rutgers.edu:29823] PML ucx cannot be selected
[gpu004.amarel.rutgers.edu:29824] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:29824] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did 
not match transport list
[gpu004.amarel.rutgers.edu:29824] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did 
not match transport list
[gpu004.amarel.rutgers.edu:29824] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did 
not match transport list
[gpu004.amarel.rutgers.edu:29824] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level 
is none
--
No components 

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-29 Thread Ralph Castain via users
Ryan - I suspect what Sergey was trying to say was that you need to ensure OMPI 
doesn't try to use the OpenIB driver, or at least that it doesn't attempt to 
initialize it. Try adding

OMPI_MCA_pml=ucx

to your environment.


On Jul 29, 2021, at 1:56 AM, Sergey Oblomov via users mailto:users@lists.open-mpi.org> > wrote:

Hi
 This issue arrives from BTL OpenIB, not related to UCX
 From: users mailto:users-boun...@lists.open-mpi.org> > on behalf of Ryan Novosielski via 
users mailto:users@lists.open-mpi.org> >
Date: Thursday, 29 July 2021, 08:25
To: users@lists.open-mpi.org  
mailto:users@lists.open-mpi.org> >
Cc: Ryan Novosielski mailto:novos...@rutgers.edu> >
Subject: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There 
was an error initializing an OpenFabrics device."

Hi there,

New to using UCX, as a result of having built OpenMPI without it and running 
tests and getting warned. Installed UCX from the distribution:

[novosirj@amarel-test2 ~]$ rpm -qa ucx
ucx-1.5.2-1.el7.x86_64

…and rebuilt OpenMPI. Built fine. However, I’m getting some pretty unhelpful 
messages about not using the IB card. I looked around the internet some and set 
a couple of environment variables to get a little more information:

OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
export OMPI_MCA_pml_ucx_verbose=100

Here’s what happens:

[novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
./mpihello-gcc-8-openmpi-4.0.6 
srun: job 13993927 queued and waiting for resources
srun: job 13993927 has been allocated resources
--
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   gpu004
 Local device: mlx4_0
--
--
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   gpu004
 Local device: mlx4_0
--
[gpu004.amarel.rutgers.edu:02327  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02327  ] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level 
is none
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
[gpu004.amarel.rutgers.edu:02327  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-29 Thread Sergey Oblomov via users
Hi

This issue arrives from BTL OpenIB, not related to UCX

From: users  on behalf of Ryan Novosielski 
via users 
Date: Thursday, 29 July 2021, 08:25
To: users@lists.open-mpi.org 
Cc: Ryan Novosielski 
Subject: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There 
was an error initializing an OpenFabrics device."
Hi there,

New to using UCX, as a result of having built OpenMPI without it and running 
tests and getting warned. Installed UCX from the distribution:

[novosirj@amarel-test2 ~]$ rpm -qa ucx
ucx-1.5.2-1.el7.x86_64

…and rebuilt OpenMPI. Built fine. However, I’m getting some pretty unhelpful 
messages about not using the IB card. I looked around the internet some and set 
a couple of environment variables to get a little more information:

OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
export OMPI_MCA_pml_ucx_verbose=100

Here’s what happens:

[novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
./mpihello-gcc-8-openmpi-4.0.6
srun: job 13993927 queued and waiting for resources
srun: job 13993927 has been allocated resources
--
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   gpu004
 Local device: mlx4_0
--
--
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   gpu004
 Local device: mlx4_0
--
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level 
is none
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level 
is none
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147