Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."
I'd suggest opening a ticket on the UCX repo itself. This looks to me like UCX not recognizing a Mellanox device, or at least not initializing it correctly. > On Aug 11, 2021, at 8:21 AM, Ryan Novosielski wrote: > > Thanks. That /is/ one solution, and what I’ll do in the interim since this > has to work in at least some fashion, but I would actually like to use UCX if > OpenIB is going to be deprecated. How do I find out what’s actually wrong? > > -- > #BlackLivesMatter > > || \\UTGERS, > |---*O*--- > ||_// the State| Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\of NJ| Office of Advanced Research Computing - MSB C630, > Newark > `' > >> On Jul 29, 2021, at 11:35 AM, Ralph Castain via users >> wrote: >> >> So it _is_ UCX that is the problem! Try using OMPI_MCA_pml=ob1 instead >> >>> On Jul 29, 2021, at 8:33 AM, Ryan Novosielski wrote: >>> >>> Thanks, Ralph. This /does/ change things, but not very much. I was not >>> under the impression that I needed to do that, since when I ran without >>> having built against UCX, it warned me about the openib method being >>> deprecated. By default, does OpenMPI not use either anymore, and I need to >>> specifically call for UCX? Seems strange. >>> >>> Anyhow, I’ve got some variables defined still, in addition to your >>> suggestion, for verbosity: >>> >>> [novosirj@amarel-test2 ~]$ env | grep ^OMPI >>> OMPI_MCA_pml=ucx >>> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1 >>> OMPI_MCA_pml_ucx_verbose=100 >>> >>> Here goes: >>> >>> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc --reservation=UCX >>> ./mpihello-gcc-8-openmpi-4.0.6 >>> srun: job 13995650 queued and waiting for resources >>> srun: job 13995650 has been allocated resources >>> -- >>> WARNING: There was an error initializing an OpenFabrics device. >>> >>> Local host: gpu004 >>> Local device: mlx4_0 >>> -- >>> -- >>> WARNING: There was an error initializing an OpenFabrics device. >>> >>> Local host: gpu004 >>> Local device: mlx4_0 >>> -- >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using >>> OPAL memory hooks as external events >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using >>> OPAL memory hooks as external events >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 >>> mca_pml_ucx_open: UCX version 1.5.2 >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 >>> mca_pml_ucx_open: UCX version 1.5.2 >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> self/self: did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> self/self: did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> rc/mlx4_0:1: did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> ud/mlx4_0:1: did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support >>> level is none >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29824] >>>
Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."
Thanks. That /is/ one solution, and what I’ll do in the interim since this has to work in at least some fashion, but I would actually like to use UCX if OpenIB is going to be deprecated. How do I find out what’s actually wrong? -- #BlackLivesMatter || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Jul 29, 2021, at 11:35 AM, Ralph Castain via users > wrote: > > So it _is_ UCX that is the problem! Try using OMPI_MCA_pml=ob1 instead > >> On Jul 29, 2021, at 8:33 AM, Ryan Novosielski wrote: >> >> Thanks, Ralph. This /does/ change things, but not very much. I was not under >> the impression that I needed to do that, since when I ran without having >> built against UCX, it warned me about the openib method being deprecated. By >> default, does OpenMPI not use either anymore, and I need to specifically >> call for UCX? Seems strange. >> >> Anyhow, I’ve got some variables defined still, in addition to your >> suggestion, for verbosity: >> >> [novosirj@amarel-test2 ~]$ env | grep ^OMPI >> OMPI_MCA_pml=ucx >> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1 >> OMPI_MCA_pml_ucx_verbose=100 >> >> Here goes: >> >> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc --reservation=UCX >> ./mpihello-gcc-8-openmpi-4.0.6 >> srun: job 13995650 queued and waiting for resources >> srun: job 13995650 has been allocated resources >> -- >> WARNING: There was an error initializing an OpenFabrics device. >> >> Local host: gpu004 >> Local device: mlx4_0 >> -- >> -- >> WARNING: There was an error initializing an OpenFabrics device. >> >> Local host: gpu004 >> Local device: mlx4_0 >> -- >> [gpu004.amarel.rutgers.edu:29823] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL >> memory hooks as external events >> [gpu004.amarel.rutgers.edu:29824] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL >> memory hooks as external events >> [gpu004.amarel.rutgers.edu:29823] >> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 >> mca_pml_ucx_open: UCX version 1.5.2 >> [gpu004.amarel.rutgers.edu:29824] >> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 >> mca_pml_ucx_open: UCX version 1.5.2 >> [gpu004.amarel.rutgers.edu:29823] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: >> did not match transport list >> [gpu004.amarel.rutgers.edu:29823] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: >> did not match transport list >> [gpu004.amarel.rutgers.edu:29823] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: >> did not match transport list >> [gpu004.amarel.rutgers.edu:29824] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: >> did not match transport list >> [gpu004.amarel.rutgers.edu:29823] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >> rc/mlx4_0:1: did not match transport list >> [gpu004.amarel.rutgers.edu:29823] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >> ud/mlx4_0:1: did not match transport list >> [gpu004.amarel.rutgers.edu:29823] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: >> did not match transport list >> [gpu004.amarel.rutgers.edu:29823] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: >> did not match transport list >> [gpu004.amarel.rutgers.edu:29823] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: >> did not match transport list >> [gpu004.amarel.rutgers.edu:29823] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support >> level is none >> [gpu004.amarel.rutgers.edu:29824] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: >> did not match transport list >> [gpu004.amarel.rutgers.edu:29824] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: >> did not match transport list >> [gpu004.amarel.rutgers.edu:29824] >> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >> rc/mlx4_0:1: did not match transport list >> -- >> No components were able to be opened in the pml framework. >> >> This typically means that either no components of this type were >> installed, or none of the installed components can be loaded. >> Sometimes this means that