Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-08-11 Thread Ryan Novosielski via users
c:304 tcp/ib0: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>> rc/mlx4_0:1: did not match transport list
>> --
>> No components were able to be opened in the pml framework.
>> 
>> This typically means that either no components of this type were
>> installed, or none of the installed components can be loaded.
>> Sometimes this means that shared libraries required by these
>> components are unable to be found/loaded.
>> 
>> Host:  gpu004
>> Framework: pml
>> --
>> [gpu004.amarel.rutgers.edu:29823] PML ucx cannot be selected
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>> ud/mlx4_0:1: did not match transport list
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support 
>> level is none
>> --
>> No components were able to be opened in the pml framework.
>> 
>> This typically means that either no components of this type were
>> installed, or none of the installed components can be loaded.
>> Sometimes this means that shared libraries required by these
>> components are unable to be found/loaded.
>> 
>> Host:  gpu004
>> Framework: pml
>> --
>> [gpu004.amarel.rutgers.edu:29824] PML ucx cannot be selected
>> slurmstepd: error: *** STEP 13995650.0 ON gpu004 CANCELLED AT 
>> 2021-07-29T11:31:19 ***
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> srun: error: gpu004: tasks 0-1: Exited with exit code 1
>> 
>> --
>> #BlackLivesMatter
>> 
>> || \\UTGERS,  
>> |---*O*---------------
>> ||_// the State   | Ryan Novosielski - novos...@rutgers.edu
>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
>> ||  \\of NJ   | Office of Advanced Research Computing - MSB C630, 
>> Newark
>>`'
>> 
>>> On Jul 29, 2021, at 8:34 AM, Ralph Castain via users 
>>>  wrote:
>>> 
>>> Ryan - I suspect what Sergey was trying to say was that you need to ensure 
>>> OMPI doesn't try to use the OpenIB driver, or at least that it doesn't 
>>> attempt to initialize it. Try adding
>>> 
>>> OMPI_MCA_pml=ucx
>>> 
>>> to your environment.
>>> 
>>> 
>>>> On Jul 29, 2021, at 1:56 AM, Sergey Oblomov via users 
>>>>  wrote:
>>>> 
>>>> Hi
>>>> 
>>>> This issue arrives from BTL OpenIB, not related to UCX
>>>> 
>>>> From: users  on behalf of Ryan 
>>>> Novosielski via users 
>>>> Date: Thursday, 29 July 2021, 08:25
>>>> To: users@lists.open-mpi.org 
>>>> Cc: Ryan Novosielski 
>>>> Subject: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: 
>>>> There was an error initializing an OpenFabrics device."
>>>> 
>>>> Hi there,
>>>> 
>>>> New to using UCX, as a result of having built OpenMPI without it and 
>>>> running tests and getting warned. Installed UCX from the distribution:
>>>> 
>>>> [novosirj@amarel-test2 ~]$ rpm -qa ucx
>>>> ucx-1.5.2-1.el7.x86_64
>>>> 
>>>> …and rebuilt OpenMPI. Built fine. However, I’m getting some pretty 
>>>> unhelpful messages about not using the IB card. I looked around the 
>>>> internet some and set a couple of environment variables to get a little 
>>>> more information:
>>>> 
>>>> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
>>>> export OMPI_MCA_pml_ucx_verbose=100
>>>> 
>>>> Here’s what happens:
>>>> 
>

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-29 Thread Ryan Novosielski via users
 were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:  gpu004
  Framework: pml
--
[gpu004.amarel.rutgers.edu:29824] PML ucx cannot be selected
slurmstepd: error: *** STEP 13995650.0 ON gpu004 CANCELLED AT 
2021-07-29T11:31:19 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: gpu004: tasks 0-1: Exited with exit code 1

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Jul 29, 2021, at 8:34 AM, Ralph Castain via users 
>  wrote:
> 
> Ryan - I suspect what Sergey was trying to say was that you need to ensure 
> OMPI doesn't try to use the OpenIB driver, or at least that it doesn't 
> attempt to initialize it. Try adding
> 
> OMPI_MCA_pml=ucx
> 
> to your environment.
> 
> 
>> On Jul 29, 2021, at 1:56 AM, Sergey Oblomov via users 
>>  wrote:
>> 
>> Hi
>>  
>> This issue arrives from BTL OpenIB, not related to UCX
>>  
>> From: users  on behalf of Ryan Novosielski 
>> via users 
>> Date: Thursday, 29 July 2021, 08:25
>> To: users@lists.open-mpi.org 
>> Cc: Ryan Novosielski 
>> Subject: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There 
>> was an error initializing an OpenFabrics device."
>> 
>> Hi there,
>> 
>> New to using UCX, as a result of having built OpenMPI without it and running 
>> tests and getting warned. Installed UCX from the distribution:
>> 
>> [novosirj@amarel-test2 ~]$ rpm -qa ucx
>> ucx-1.5.2-1.el7.x86_64
>> 
>> …and rebuilt OpenMPI. Built fine. However, I’m getting some pretty unhelpful 
>> messages about not using the IB card. I looked around the internet some and 
>> set a couple of environment variables to get a little more information:
>> 
>> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
>> export OMPI_MCA_pml_ucx_verbose=100
>> 
>> Here’s what happens:
>> 
>> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
>> ./mpihello-gcc-8-openmpi-4.0.6 
>> srun: job 13993927 queued and waiting for resources
>> srun: job 13993927 has been allocated resources
>> --
>> WARNING: There was an error initializing an OpenFabrics device.
>> 
>>  Local host:   gpu004
>>  Local device: mlx4_0
>> --
>> --
>> WARNING: There was an error initializing an OpenFabrics device.
>> 
>>  Local host:   gpu004
>>  Local device: mlx4_0
>> --
>> [gpu004.amarel.rutgers.edu:02327] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
>> memory hooks as external events
>> [gpu004.amarel.rutgers.edu:02327] 
>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 
>> mca_pml_ucx_open: UCX version 1.5.2
>> [gpu004.amarel.rutgers.edu:02326] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
>> memory hooks as external events
>> [gpu004.amarel.rutgers.edu:02326] 
>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 
>> mca_pml_ucx_open: UCX version 1.5.2
>> [gpu004.amarel.rutgers.edu:02326] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:02326] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:02327] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:02326] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:02326] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>> rc/mlx4_0:1: did not match transport list
>> [gpu004.amarel.rutgers.ed

[OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-28 Thread Ryan Novosielski via users
Hi there,

New to using UCX, as a result of having built OpenMPI without it and running 
tests and getting warned. Installed UCX from the distribution:

[novosirj@amarel-test2 ~]$ rpm -qa ucx
ucx-1.5.2-1.el7.x86_64

…and rebuilt OpenMPI. Built fine. However, I’m getting some pretty unhelpful 
messages about not using the IB card. I looked around the internet some and set 
a couple of environment variables to get a little more information:

OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
export OMPI_MCA_pml_ucx_verbose=100

Here’s what happens:

[novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
./mpihello-gcc-8-openmpi-4.0.6 
srun: job 13993927 queued and waiting for resources
srun: job 13993927 has been allocated resources
--
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   gpu004
 Local device: mlx4_0
--
--
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   gpu004
 Local device: mlx4_0
--
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level 
is none
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level 
is none
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
Hello world from processor gpu004.amarel.rutgers.edu, rank 0 out of 2 processors
Hello world from processor gpu004.amarel.rutgers.edu, rank 1 out