Hi Matt,

Few comments/questions:

-          If your cluster has Omni-Path, you won’t need UCX. Instead you can 
run using PSM2, or alternatively OFI (a.k.a. Libfabric)

-          With the command you shared below (4 ranks on the local node) (I 
think) a shared mem transport is being selected (vader?). So, if the job is not 
starting this seems to be a runtime issue rather than transport…. Pmix? slurm?
Thanks
_MAC


From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Matt Thompson
Sent: Friday, January 18, 2019 10:27 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
On Jan 18, 2019, at 12:43 PM, Matt Thompson 
<fort...@gmail.com<mailto:fort...@gmail.com>> wrote:
>
> With some help, I managed to build an Open MPI 4.0.0 with:

We can discuss each of these params to let you know what they are.

> ./configure --disable-wrapper-rpath --disable-wrapper-runpath

Did you have a reason for disabling these?  They're generally good things.  
What they do is add linker flags to the wrapper compilers (i.e., mpicc and 
friends) that basically put a default path to find libraries at run time (that 
can/will in most cases override LD_LIBRARY_PATH -- but you can override these 
linked-in-default-paths if you want/need to).

I've had these in my Open MPI builds for a while now. The reason was one of the 
libraries I need for the climate model I work on went nuts if both of them 
weren't there. It was originally the rpath one but then eventually (Open MPI 
3?) I had to add the runpath one. But I have been updating the libraries more 
aggressively recently (due to OS upgrades) so it's possible this is no longer 
needed.


> --with-psm2

Ensure that Open MPI can include support for the PSM2 library, and abort 
configure if it cannot.

> --with-slurm

Ensure that Open MPI can include support for SLURM, and abort configure if it 
cannot.

> --enable-mpi1-compatibility

Add support for MPI_Address and other MPI-1 functions that have since been 
deleted from the MPI 3.x specification.

> --with-ucx

Ensure that Open MPI can include support for UCX, and abort configure if it 
cannot.

> --with-pmix=/usr/nlocal/pmix/2.1

Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 
(instead of using the PMIx that is bundled internally to Open MPI's source code 
tree/expanded tarball).

Unless you have a reason to use the external PMIx, the internal/bundled PMIx is 
usually sufficient.

Ah. I did not know that. I figured if our SLURM was built linked to a specific 
PMIx v2 that I should build Open MPI with the same PMIx. I'll build an Open MPI 
4 without specifying this.


> --with-libevent=/usr

Same as previous; change "pmix" to "libevent" (i.e., use the external libevent 
instead of the bundled libevent).

> CC=icc CXX=icpc FC=ifort

Specify the exact compilers to use.

> The MPI 1 is because I need to build HDF5 eventually and I added psm2 because 
> it's an Omnipath cluster. The libevent was probably a red herring as 
> libevent-devel wasn't installed on the system. It was eventually, and I just 
> didn't remove the flag. And I saw no errors in the build!

Might as well remove the --with-libevent if you don't need it.

> However, I seem to have built an Open MPI that doesn't work:
>
> (1099)(master) $ mpirun --version
> mpirun (Open MPI) 4.0.0
>
> Report bugs to http://www.open-mpi.org/community/help/
> (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
>
> It just sits there...forever. Can the gurus here help me figure out what I 
> managed to break? Perhaps I added too much to my configure line? Not enough?

There could be a few things going on here.

Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an "sbatch" 
script?

I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just fine 
(as you'd hope on an Omnipath cluster), but for some reason Open MPI is twitchy 
on this cluster. I once managed to get Open MPI 3.0.1 working (a few months 
ago), and it had some interesting startup scaling I liked (slow at low core 
count, but getting close to Intel MPI at high core count), though it seemed to 
not work after about 100 nodes (4000 processes) or so.

--
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna 
Rampton
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to