Re: [OMPI users] (Fedora 34, x86_64-pc-linux-gnu, openmpi-4.1.1.tar.gz): PML ucx cannot be selected

2021-05-28 Thread Gilles Gouaillardet via users
Jorge,

pml/ucx used to be selected when no fast interconnect were detected
(since ucx provides driver for both TCP and shared memory).
These providers are now disabled by default, so unless your machine
has a supported fast interconnect (such as Infiniband),
pml/ucx cannot be used out of the box anymore.

if you really want to use pml/ucx on your notebook, you need to
manually re-enable these providers.

That being said, your best choice here is really not to force any pml,
and let Open MPI use pml/ob1
(that has support for both TCP and shared memory)

Cheers,

Gilles

On Sat, May 29, 2021 at 11:19 AM Jorge D'Elia via users
 wrote:
>
> Hi,
>
> We routinely build OpenMPI on x86_64-pc-linux-gnu machines from
> the sources using gcc and usually everything works fine.
>
> In one case we recently installed Fedora 34 from scratch on an
> ASUS G53SX notebook (Intel Core i7-2630QM CPU 2.00GHz ×4 cores,
> without any IB device). Next we build OpenMPI using the file
> openmpi-4.1.1.tar.gz and the GCC 12.0.0 20210524 (experimental)
> compiler.
>
> However, when trying to experiment OpenMPI using UCX
> with a simple test, we get the runtime errors:
>
>   No components were able to be opened in the btl framework.
>   PML ucx cannot be selected
>
> while the test worked fine until Fedora 33 on the same
> machine using the same OpenMPI configuration.
>
> We attach below some info about a simple test run.
>
> Please, any clues where to check or maybe something is missing?
> Thanks in advance.
>
> Regards
> Jorge.
>
> --
> $ cat /proc/version
> Linux version 5.12.7-300.fc34.x86_64 
> (mockbu...@bkernel01.iad2.fedoraproject.org) (gcc (GCC) 11.1.1 20210428 (Red 
> Hat 11.1.1-1), GNU ld version 2.35.1-41.fc34) #1 SMP Wed May 26 12:58:58 UTC 
> 2021
>
> $ mpifort --version
> GNU Fortran (GCC) 12.0.0 20210524 (experimental)
> Copyright (C) 2021 Free Software Foundation, Inc.
>
> $ which mpifort
> /usr/beta/openmpi/bin/mpifort
>
> $ mpifort -o hello_usempi_f08.exe hello_usempi_f08.f90
>
> $ mpirun --mca orte_base_help_aggregate 0 --mca btl self,vader,tcp --map-by 
> node --report-bindings --machinefile ~/machi-openmpi.dat --np 2  
> hello_usempi_f08.exe
> [verne:200650] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././.]
> [verne:200650] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./.]
> Hello, world, I am  0 of  2: Open MPI v4.1.1, package: Open MPI bigpack@verne 
> Distribution, ident: 4.1.1, repo rev: v4.1.1, Apr 24, 2021
> Hello, world, I am  1 of  2: Open MPI v4.1.1, package: Open MPI bigpack@verne 
> Distribution, ident: 4.1.1, repo rev: v4.1.1, Apr 24, 2021
>
> $ mpirun --mca orte_base_help_aggregate 0 --mca pml ucx --mca btl 
> ^self,vader,tcp --map-by node --report-bindings --machinefile 
> ~/machi-openmpi.dat --np 2  hello_usempi_f08.exe
> [verne:200772] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././.]
> [verne:200772] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./.]
> --
> No components were able to be opened in the btl framework.
>
> This typically means that either no components of this type were
> installed, or none of the installed components can be loaded.
> Sometimes this means that shared libraries required by these
> components are unable to be found/loaded.
>
>   Host:  verne
>   Framework: btl
> --
> --
> No components were able to be opened in the btl framework.
>
> This typically means that either no components of this type were
> installed, or none of the installed components can be loaded.
> Sometimes this means that shared libraries required by these
> components are unable to be found/loaded.
>
>   Host:  verne
>   Framework: btl
> --
> --
> No components were able to be opened in the pml framework.
>
> This typically means that either no components of this type were
> installed, or none of the installed components can be loaded.
> Sometimes this means that shared libraries required by these
> components are unable to be found/loaded.
>
>   Host:  verne
>   Framework: pml
> --
> [verne:200777] PML ucx cannot be selected
> --
> No components were able to be opened in the pml framework.
>
> This typically means that either no components of this type were
> installed, or none of the installed components can be loaded.
> Sometimes this means that shared libraries required by these
> components are unable to be found/loaded.
>
>   Host:  verne
>   Framework: pml
> --
> [verne:200772] PMIX ERROR: 

[OMPI users] (Fedora 34, x86_64-pc-linux-gnu, openmpi-4.1.1.tar.gz): PML ucx cannot be selected

2021-05-28 Thread Jorge D'Elia via users
Hi,

We routinely build OpenMPI on x86_64-pc-linux-gnu machines from 
the sources using gcc and usually everything works fine.

In one case we recently installed Fedora 34 from scratch on an
ASUS G53SX notebook (Intel Core i7-2630QM CPU 2.00GHz ×4 cores, 
without any IB device). Next we build OpenMPI using the file 
openmpi-4.1.1.tar.gz and the GCC 12.0.0 20210524 (experimental) 
compiler.

However, when trying to experiment OpenMPI using UCX 
with a simple test, we get the runtime errors:

  No components were able to be opened in the btl framework.
  PML ucx cannot be selected

while the test worked fine until Fedora 33 on the same 
machine using the same OpenMPI configuration.

We attach below some info about a simple test run.

Please, any clues where to check or maybe something is missing?
Thanks in advance.

Regards
Jorge.

--
$ cat /proc/version
Linux version 5.12.7-300.fc34.x86_64 
(mockbu...@bkernel01.iad2.fedoraproject.org) (gcc (GCC) 11.1.1 20210428 (Red 
Hat 11.1.1-1), GNU ld version 2.35.1-41.fc34) #1 SMP Wed May 26 12:58:58 UTC 
2021

$ mpifort --version
GNU Fortran (GCC) 12.0.0 20210524 (experimental)
Copyright (C) 2021 Free Software Foundation, Inc.

$ which mpifort
/usr/beta/openmpi/bin/mpifort

$ mpifort -o hello_usempi_f08.exe hello_usempi_f08.f90

$ mpirun --mca orte_base_help_aggregate 0 --mca btl self,vader,tcp --map-by 
node --report-bindings --machinefile ~/machi-openmpi.dat --np 2  
hello_usempi_f08.exe
[verne:200650] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././.]
[verne:200650] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./.]
Hello, world, I am  0 of  2: Open MPI v4.1.1, package: Open MPI bigpack@verne 
Distribution, ident: 4.1.1, repo rev: v4.1.1, Apr 24, 2021                      
                                             
Hello, world, I am  1 of  2: Open MPI v4.1.1, package: Open MPI bigpack@verne 
Distribution, ident: 4.1.1, repo rev: v4.1.1, Apr 24, 2021
                                                                                
                       
$ mpirun --mca orte_base_help_aggregate 0 --mca pml ucx --mca btl 
^self,vader,tcp --map-by node --report-bindings --machinefile 
~/machi-openmpi.dat --np 2  hello_usempi_f08.exe
[verne:200772] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././.]
[verne:200772] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./.]
--
No components were able to be opened in the btl framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      verne
  Framework: btl
--
--
No components were able to be opened in the btl framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      verne
  Framework: btl
--
--
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      verne
  Framework: pml
--
[verne:200777] PML ucx cannot be selected
--
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      verne
  Framework: pml
--
[verne:200772] PMIX ERROR: UNREACHABLE in file 
../../../../../../../opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c at line 
2198


$ ompi_info | grep ucx
  Configure command line: '--enable-ipv6' '--enable-sparse-groups' 
'--enable-mpi-ext' '--enable-mpi-cxx' '--enable-oshmem' 
'--with-libevent=internal' '--with-ucx' '--with-pmix=internal' 
'--without-libfabric' '--prefix=/usr/beta/openmpi'
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.1)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component
                 v4.1.1)

$ ompi_info --param all all --level 9 | grep ucx
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.1)