Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-08-11 Thread Ralph Castain via users
I'd suggest opening a ticket on the UCX repo itself. This looks to me like UCX 
not recognizing a Mellanox device, or at least not initializing it correctly.


> On Aug 11, 2021, at 8:21 AM, Ryan Novosielski  wrote:
> 
> Thanks. That /is/ one solution, and what I’ll do in the interim since this 
> has to work in at least some fashion, but I would actually like to use UCX if 
> OpenIB is going to be deprecated. How do I find out what’s actually wrong?
> 
> --
> #BlackLivesMatter
> 
> || \\UTGERS,   
> |---*O*---
> ||_// the State| Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ| Office of Advanced Research Computing - MSB C630, 
> Newark
> `'
> 
>> On Jul 29, 2021, at 11:35 AM, Ralph Castain via users 
>>  wrote:
>> 
>> So it _is_ UCX that is the problem! Try using OMPI_MCA_pml=ob1 instead
>> 
>>> On Jul 29, 2021, at 8:33 AM, Ryan Novosielski  wrote:
>>> 
>>> Thanks, Ralph. This /does/ change things, but not very much. I was not 
>>> under the impression that I needed to do that, since when I ran without 
>>> having built against UCX, it warned me about the openib method being 
>>> deprecated. By default, does OpenMPI not use either anymore, and I need to 
>>> specifically call for UCX? Seems strange.
>>> 
>>> Anyhow, I’ve got some variables defined still, in addition to your 
>>> suggestion, for verbosity:
>>> 
>>> [novosirj@amarel-test2 ~]$ env | grep ^OMPI
>>> OMPI_MCA_pml=ucx
>>> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
>>> OMPI_MCA_pml_ucx_verbose=100
>>> 
>>> Here goes:
>>> 
>>> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
>>> ./mpihello-gcc-8-openmpi-4.0.6
>>> srun: job 13995650 queued and waiting for resources
>>> srun: job 13995650 has been allocated resources
>>> --
>>> WARNING: There was an error initializing an OpenFabrics device.
>>> 
>>> Local host:   gpu004
>>> Local device: mlx4_0
>>> --
>>> --
>>> WARNING: There was an error initializing an OpenFabrics device.
>>> 
>>> Local host:   gpu004
>>> Local device: mlx4_0
>>> --
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using 
>>> OPAL memory hooks as external events
>>> [gpu004.amarel.rutgers.edu:29824] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using 
>>> OPAL memory hooks as external events
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 
>>> mca_pml_ucx_open: UCX version 1.5.2
>>> [gpu004.amarel.rutgers.edu:29824] 
>>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 
>>> mca_pml_ucx_open: UCX version 1.5.2
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>>> self/self: did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: 
>>> did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: 
>>> did not match transport list
>>> [gpu004.amarel.rutgers.edu:29824] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>>> self/self: did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>>> rc/mlx4_0:1: did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>>> ud/mlx4_0:1: did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: 
>>> did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: 
>>> did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: 
>>> did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support 
>>> level is none
>>> [gpu004.amarel.rutgers.edu:29824] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: 
>>> did not match transport list
>>> [gpu004.amarel.rutgers.edu:29824] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: 
>>> did not match transport list
>>> [gpu004.amarel.rutgers.edu:29824] 
>>> 

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-08-11 Thread Ryan Novosielski via users
Thanks. That /is/ one solution, and what I’ll do in the interim since this has 
to work in at least some fashion, but I would actually like to use UCX if 
OpenIB is going to be deprecated. How do I find out what’s actually wrong?

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Jul 29, 2021, at 11:35 AM, Ralph Castain via users 
>  wrote:
> 
> So it _is_ UCX that is the problem! Try using OMPI_MCA_pml=ob1 instead
> 
>> On Jul 29, 2021, at 8:33 AM, Ryan Novosielski  wrote:
>> 
>> Thanks, Ralph. This /does/ change things, but not very much. I was not under 
>> the impression that I needed to do that, since when I ran without having 
>> built against UCX, it warned me about the openib method being deprecated. By 
>> default, does OpenMPI not use either anymore, and I need to specifically 
>> call for UCX? Seems strange.
>> 
>> Anyhow, I’ve got some variables defined still, in addition to your 
>> suggestion, for verbosity:
>> 
>> [novosirj@amarel-test2 ~]$ env | grep ^OMPI
>> OMPI_MCA_pml=ucx
>> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
>> OMPI_MCA_pml_ucx_verbose=100
>> 
>> Here goes:
>> 
>> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
>> ./mpihello-gcc-8-openmpi-4.0.6
>> srun: job 13995650 queued and waiting for resources
>> srun: job 13995650 has been allocated resources
>> --
>> WARNING: There was an error initializing an OpenFabrics device.
>> 
>> Local host:   gpu004
>> Local device: mlx4_0
>> --
>> --
>> WARNING: There was an error initializing an OpenFabrics device.
>> 
>> Local host:   gpu004
>> Local device: mlx4_0
>> --
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
>> memory hooks as external events
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
>> memory hooks as external events
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 
>> mca_pml_ucx_open: UCX version 1.5.2
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 
>> mca_pml_ucx_open: UCX version 1.5.2
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>> rc/mlx4_0:1: did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>> ud/mlx4_0:1: did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support 
>> level is none
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>> rc/mlx4_0:1: did not match transport list
>> --
>> No components were able to be opened in the pml framework.
>> 
>> This typically means that either no components of this type were
>> installed, or none of the installed components can be loaded.
>> Sometimes this means that