Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-23 Thread Matt Thompson
_MAC, et al,

Things are looking up. By specifying, --with-verbs=no, things are looking
up. I can run helloworld. But in a new-for-me wrinkle, I can only run on
*more* than one node. Not sure I've ever seen that. Using 40 core nodes,
this:

mpirun -np 41 ./helloWorld.mpi3.SLES12.OMPI400.exe

works, and -np 40 fails:

(1027)(master) $ mpirun -np 40 ./helloWorld.mpi3.SLES12.OMPI400.exe
[borga033:05598] *** An error occurred in MPI_Barrier
[borga033:05598] *** reported by process [140735567101953,140733193388034]
[borga033:05598] *** on communicator MPI_COMM_WORLD
[borga033:05598] *** MPI_ERR_OTHER: known error not in list
[borga033:05598] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[borga033:05598] ***and potentially your MPI job)
Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications
running on Intel(R) 64, Version 18.0.5.274 Build 20180823
MPI Version: 3.1
MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover21
Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018
forrtl: error (78): process killed (SIGTERM)
Image  PCRoutineLineSource
helloWorld.mpi3.S  0040A38E  for__signal_handl Unknown  Unknown
libpthread-2.22.s  2B9CCB20  Unknown   Unknown  Unknown
libpthread-2.22.s  2B9CC3ED  __nanosleep   Unknown  Unknown
libopen-rte.so.40  2C3C5854  orte_show_help_no Unknown  Unknown
libopen-rte.so.40  2C3C5595  orte_show_helpUnknown  Unknown
libmpi.so.40.20.0  2B3BADC5  ompi_mpi_errors_a Unknown  Unknown
libmpi.so.40.20.0  2B3B99D9  ompi_errhandler_i Unknown  Unknown
libmpi.so.40.20.0  2B3E4586  MPI_Barrier   Unknown  Unknown
libmpi_mpifh.so.4  2B15EE53  MPI_Barrier_f08   Unknown  Unknown
libmpi_usempif08.  2ACE7742  mpi_barrier_f08_  Unknown  Unknown
helloWorld.mpi3.S  0040939F  Unknown   Unknown  Unknown
helloWorld.mpi3.S  0040915E  Unknown   Unknown  Unknown
libc-2.22.so   2BBF96D5  __libc_start_main Unknown  Unknown
helloWorld.mpi3.S  00409069  Unknown   Unknown  Unknown

So, I'm getting closer but I have to admit I've never built an MPI stack
before where running on a single node was the broken bit!

On Tue, Jan 22, 2019 at 1:31 PM Cabral, Matias A 
wrote:

> Hi Matt,
>
>
>
> There seem to be two different issues here:
>
> a)  The warning message comes from the openib btl. Given that
> Omnipath has verbs API and you have the necessary libraries in your system,
> openib btl finds itself as a potential transport and prints the warning
> during its init (openib btl is its way to deprecation). You may try to
> explicitly ask for vader btl given you are running on shared mem: -mca btl
> self,vader -mca pml ob1. Or better, explicitly build without openib:
> ./configure --with-verbs=no …
>
> b)  Not my field of expertise, but you may be having some conflict
> with the external components you are using:
> --with-pmix=/usr/nlocal/pmix/2.1 --with-libevent=/usr . You may try not
> specifying these and using the ones provided by OMPI.
>
>
>
> _MAC
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Matt
> Thompson
> *Sent:* Tuesday, January 22, 2019 6:04 AM
> *To:* Open MPI Users 
> *Subject:* Re: [OMPI users] Help Getting Started with Open MPI and PMIx
> and UCX
>
>
>
> Well,
>
>
>
> By turning off UCX compilation per Howard, things get a bit better in that
> something happens! It's not a good something, as it seems to die with an
> infiniband error. As this is an Omnipath system, is OpenMPI perhaps seeing
> libverbs somewhere and compiling it in? To wit:
>
>
>
> (1006)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
>
> --
>
> By default, for Open MPI 4.0 and later, infiniband ports on a device
>
> are not used by default.  The intent is to use UCX for these devices.
>
> You can override this policy by setting the btl_openib_allow_ib MCA
> parameter
>
> to true.
>
>
>
>   Local host:  borgc129
>
>   Local adapter:   hfi1_0
>
>   Local port:  1
>
>
>
> --
>
> --
>
> WARNING: There was an error initializing an OpenFabrics device.
>
>
>
>   Local host:   borgc129
>
>   Local device: hfi1_0
>
> --
>
> Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications
> running on Intel(R) 64, Version 18.0.5.274 Build 20180823
>
> MPI Version: 3.1
>
> MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover23
> Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018
>
> [borgc129:260830] *** An error 

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-22 Thread Cabral, Matias A
Hi Matt,

There seem to be two different issues here:

a)  The warning message comes from the openib btl. Given that Omnipath has 
verbs API and you have the necessary libraries in your system, openib btl finds 
itself as a potential transport and prints the warning during its init (openib 
btl is its way to deprecation). You may try to explicitly ask for vader btl 
given you are running on shared mem: -mca btl self,vader -mca pml ob1. Or 
better, explicitly build without openib: ./configure --with-verbs=no …

b)  Not my field of expertise, but you may be having some conflict with the 
external components you are using: --with-pmix=/usr/nlocal/pmix/2.1 
--with-libevent=/usr . You may try not specifying these and using the ones 
provided by OMPI.

_MAC

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Matt Thompson
Sent: Tuesday, January 22, 2019 6:04 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

Well,

By turning off UCX compilation per Howard, things get a bit better in that 
something happens! It's not a good something, as it seems to die with an 
infiniband error. As this is an Omnipath system, is OpenMPI perhaps seeing 
libverbs somewhere and compiling it in? To wit:

(1006)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
--
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:  borgc129
  Local adapter:   hfi1_0
  Local port:  1

--
--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   borgc129
  Local device: hfi1_0
--
Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications 
running on Intel(R) 64, Version 18.0.5.274 Build 20180823
MPI Version: 3.1
MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover23 
Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018
[borgc129:260830] *** An error occurred in MPI_Barrier
[borgc129:260830] *** reported by process [140736833716225,46909632806913]
[borgc129:260830] *** on communicator MPI_COMM_WORLD
[borgc129:260830] *** MPI_ERR_OTHER: known error not in list
[borgc129:260830] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[borgc129:260830] ***and potentially your MPI job)
forrtl: error (78): process killed (SIGTERM)
Image  PCRoutineLineSource
helloWorld.mpi3.S  0040A38E  for__signal_handl Unknown  Unknown
libpthread-2.22.s  2B9CCB20  Unknown   Unknown  Unknown
libpthread-2.22.s  2B9C90CD  pthread_cond_wait Unknown  Unknown
libpmix.so.2.1.11  2AAAB1D780A1  PMIx_AbortUnknown  Unknown
mca_pmix_ext2x.so  2AAAB1B3AA75  ext2x_abort   Unknown  Unknown
mca_ess_pmi.so 2AAAB1724BC0  Unknown   Unknown  Unknown
libopen-rte.so.40  2C3E941C  orte_errmgr_base_ Unknown  Unknown
mca_errmgr_defaul  2AAABC401668  Unknown   Unknown  Unknown
libmpi.so.40.20.0  2B3CDBC4  ompi_mpi_abortUnknown  Unknown
libmpi.so.40.20.0  2B3BB1EF  ompi_mpi_errors_a Unknown  Unknown
libmpi.so.40.20.0  2B3B99C9  ompi_errhandler_i Unknown  Unknown
libmpi.so.40.20.0  2B3E4576  MPI_Barrier   Unknown  Unknown
libmpi_mpifh.so.4  2B15EE53  MPI_Barrier_f08   Unknown  Unknown
libmpi_usempif08.  2ACE7732  mpi_barrier_f08_  Unknown  Unknown
helloWorld.mpi3.S  0040939F  Unknown   Unknown  Unknown
helloWorld.mpi3.S  0040915E  Unknown   Unknown  Unknown
libc-2.22.so   2BBF96D5  __libc_start_main 
Unknown  Unknown
helloWorld.mpi3.S  00409069  Unknown   Unknown  Unknown

On Sun, Jan 20, 2019 at 4:19 PM Howard Pritchard 
mailto:hpprit...@gmail.com>> wrote:
Hi Matt

Definitely do not include the ucx option for an omnipath cluster.  Actually if 
you accidentally installed ucx in it’s default location use on the system 
Switch to this config option

—with-ucx=no

Otherwise you will hit

https://github.com/openucx/ucx/issues/750

Howard


Gilles Gouaillardet 
mailto:gilles.gouaillar...@gmail.com>> schrieb 
am Sa. 19. Jan. 2019 um 18:41:
Matt,

There are two ways of using PMIx

- if you use mpirun, then the MPI app (e.g. the PMIx client) will talk
to mpirun and orted daemons (e.g. the PMIx server)
- if you use SLURM srun, then the MPI app will directly talk to the
PMIx server provided by SLURM. 

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-22 Thread Matt Thompson
Well,

By turning off UCX compilation per Howard, things get a bit better in that
something happens! It's not a good something, as it seems to die with an
infiniband error. As this is an Omnipath system, is OpenMPI perhaps seeing
libverbs somewhere and compiling it in? To wit:

(1006)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
--
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA
parameter
to true.

  Local host:  borgc129
  Local adapter:   hfi1_0
  Local port:  1

--
--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   borgc129
  Local device: hfi1_0
--
Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications
running on Intel(R) 64, Version 18.0.5.274 Build 20180823
MPI Version: 3.1
MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover23
Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018
[borgc129:260830] *** An error occurred in MPI_Barrier
[borgc129:260830] *** reported by process [140736833716225,46909632806913]
[borgc129:260830] *** on communicator MPI_COMM_WORLD
[borgc129:260830] *** MPI_ERR_OTHER: known error not in list
[borgc129:260830] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[borgc129:260830] ***and potentially your MPI job)
forrtl: error (78): process killed (SIGTERM)
Image  PCRoutineLineSource
helloWorld.mpi3.S  0040A38E  for__signal_handl Unknown  Unknown
libpthread-2.22.s  2B9CCB20  Unknown   Unknown  Unknown
libpthread-2.22.s  2B9C90CD  pthread_cond_wait Unknown  Unknown
libpmix.so.2.1.11  2AAAB1D780A1  PMIx_AbortUnknown  Unknown
mca_pmix_ext2x.so  2AAAB1B3AA75  ext2x_abort   Unknown  Unknown
mca_ess_pmi.so 2AAAB1724BC0  Unknown   Unknown  Unknown
libopen-rte.so.40  2C3E941C  orte_errmgr_base_ Unknown  Unknown
mca_errmgr_defaul  2AAABC401668  Unknown   Unknown  Unknown
libmpi.so.40.20.0  2B3CDBC4  ompi_mpi_abortUnknown  Unknown
libmpi.so.40.20.0  2B3BB1EF  ompi_mpi_errors_a Unknown  Unknown
libmpi.so.40.20.0  2B3B99C9  ompi_errhandler_i Unknown  Unknown
libmpi.so.40.20.0  2B3E4576  MPI_Barrier   Unknown  Unknown
libmpi_mpifh.so.4  2B15EE53  MPI_Barrier_f08   Unknown  Unknown
libmpi_usempif08.  2ACE7732  mpi_barrier_f08_  Unknown  Unknown
helloWorld.mpi3.S  0040939F  Unknown   Unknown  Unknown
helloWorld.mpi3.S  0040915E  Unknown   Unknown  Unknown
libc-2.22.so   2BBF96D5  __libc_start_main Unknown  Unknown
helloWorld.mpi3.S  00409069  Unknown   Unknown  Unknown

On Sun, Jan 20, 2019 at 4:19 PM Howard Pritchard 
wrote:

> Hi Matt
>
> Definitely do not include the ucx option for an omnipath cluster.
> Actually if you accidentally installed ucx in it’s default location use on
> the system Switch to this config option
>
> —with-ucx=no
>
> Otherwise you will hit
>
> https://github.com/openucx/ucx/issues/750
>
> Howard
>
>
> Gilles Gouaillardet  schrieb am Sa. 19.
> Jan. 2019 um 18:41:
>
>> Matt,
>>
>> There are two ways of using PMIx
>>
>> - if you use mpirun, then the MPI app (e.g. the PMIx client) will talk
>> to mpirun and orted daemons (e.g. the PMIx server)
>> - if you use SLURM srun, then the MPI app will directly talk to the
>> PMIx server provided by SLURM. (note you might have to srun
>> --mpi=pmix_v2 or something)
>>
>> In the former case, it does not matter whether you use the embedded or
>> external PMIx.
>> In the latter case, Open MPI and SLURM have to use compatible PMIx
>> libraries, and you can either check the cross-version compatibility
>> matrix,
>> or build Open MPI with the same PMIx used by SLURM to be on the safe
>> side (not a bad idea IMHO).
>>
>>
>> Regarding the hang, I suggest you try different things
>> - use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun
>> runs on a compute node rather than on a frontend node)
>> - try something even simpler such as mpirun hostname (both with sbatch
>> and salloc)
>> - explicitly specify the network to be used for the wire-up. you can
>> for example mpirun --mca oob_tcp_if_include 192.168.0.0/24 if this is
>> the network subnet by which all the nodes (e.g. compute nodes and
>> frontend node if you use salloc) communicate.
>>
>>
>> Cheers,
>>
>> Gilles
>>
>> On Sat, Jan 19, 2019 at 3:31 AM Matt Thompson  

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-20 Thread Howard Pritchard
Hi Matt

Definitely do not include the ucx option for an omnipath cluster.  Actually
if you accidentally installed ucx in it’s default location use on the
system Switch to this config option

—with-ucx=no

Otherwise you will hit

https://github.com/openucx/ucx/issues/750

Howard


Gilles Gouaillardet  schrieb am Sa. 19. Jan.
2019 um 18:41:

> Matt,
>
> There are two ways of using PMIx
>
> - if you use mpirun, then the MPI app (e.g. the PMIx client) will talk
> to mpirun and orted daemons (e.g. the PMIx server)
> - if you use SLURM srun, then the MPI app will directly talk to the
> PMIx server provided by SLURM. (note you might have to srun
> --mpi=pmix_v2 or something)
>
> In the former case, it does not matter whether you use the embedded or
> external PMIx.
> In the latter case, Open MPI and SLURM have to use compatible PMIx
> libraries, and you can either check the cross-version compatibility
> matrix,
> or build Open MPI with the same PMIx used by SLURM to be on the safe
> side (not a bad idea IMHO).
>
>
> Regarding the hang, I suggest you try different things
> - use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun
> runs on a compute node rather than on a frontend node)
> - try something even simpler such as mpirun hostname (both with sbatch
> and salloc)
> - explicitly specify the network to be used for the wire-up. you can
> for example mpirun --mca oob_tcp_if_include 192.168.0.0/24 if this is
> the network subnet by which all the nodes (e.g. compute nodes and
> frontend node if you use salloc) communicate.
>
>
> Cheers,
>
> Gilles
>
> On Sat, Jan 19, 2019 at 3:31 AM Matt Thompson  wrote:
> >
> > On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> On Jan 18, 2019, at 12:43 PM, Matt Thompson  wrote:
> >> >
> >> > With some help, I managed to build an Open MPI 4.0.0 with:
> >>
> >> We can discuss each of these params to let you know what they are.
> >>
> >> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath
> >>
> >> Did you have a reason for disabling these?  They're generally good
> things.  What they do is add linker flags to the wrapper compilers (i.e.,
> mpicc and friends) that basically put a default path to find libraries at
> run time (that can/will in most cases override LD_LIBRARY_PATH -- but you
> can override these linked-in-default-paths if you want/need to).
> >
> >
> > I've had these in my Open MPI builds for a while now. The reason was one
> of the libraries I need for the climate model I work on went nuts if both
> of them weren't there. It was originally the rpath one but then eventually
> (Open MPI 3?) I had to add the runpath one. But I have been updating the
> libraries more aggressively recently (due to OS upgrades) so it's possible
> this is no longer needed.
> >
> >>
> >>
> >> > --with-psm2
> >>
> >> Ensure that Open MPI can include support for the PSM2 library, and
> abort configure if it cannot.
> >>
> >> > --with-slurm
> >>
> >> Ensure that Open MPI can include support for SLURM, and abort configure
> if it cannot.
> >>
> >> > --enable-mpi1-compatibility
> >>
> >> Add support for MPI_Address and other MPI-1 functions that have since
> been deleted from the MPI 3.x specification.
> >>
> >> > --with-ucx
> >>
> >> Ensure that Open MPI can include support for UCX, and abort configure
> if it cannot.
> >>
> >> > --with-pmix=/usr/nlocal/pmix/2.1
> >>
> >> Tells Open MPI to use the PMIx that is installed at
> /usr/nlocal/pmix/2.1 (instead of using the PMIx that is bundled internally
> to Open MPI's source code tree/expanded tarball).
> >>
> >> Unless you have a reason to use the external PMIx, the internal/bundled
> PMIx is usually sufficient.
> >
> >
> > Ah. I did not know that. I figured if our SLURM was built linked to a
> specific PMIx v2 that I should build Open MPI with the same PMIx. I'll
> build an Open MPI 4 without specifying this.
> >
> >>
> >>
> >> > --with-libevent=/usr
> >>
> >> Same as previous; change "pmix" to "libevent" (i.e., use the external
> libevent instead of the bundled libevent).
> >>
> >> > CC=icc CXX=icpc FC=ifort
> >>
> >> Specify the exact compilers to use.
> >>
> >> > The MPI 1 is because I need to build HDF5 eventually and I added psm2
> because it's an Omnipath cluster. The libevent was probably a red herring
> as libevent-devel wasn't installed on the system. It was eventually, and I
> just didn't remove the flag. And I saw no errors in the build!
> >>
> >> Might as well remove the --with-libevent if you don't need it.
> >>
> >> > However, I seem to have built an Open MPI that doesn't work:
> >> >
> >> > (1099)(master) $ mpirun --version
> >> > mpirun (Open MPI) 4.0.0
> >> >
> >> > Report bugs to http://www.open-mpi.org/community/help/
> >> > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
> >> >
> >> > It just sits there...forever. Can the gurus here help me figure out
> what I managed to break? Perhaps I added too much to my configure 

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-19 Thread Gilles Gouaillardet
Matt,

There are two ways of using PMIx

- if you use mpirun, then the MPI app (e.g. the PMIx client) will talk
to mpirun and orted daemons (e.g. the PMIx server)
- if you use SLURM srun, then the MPI app will directly talk to the
PMIx server provided by SLURM. (note you might have to srun
--mpi=pmix_v2 or something)

In the former case, it does not matter whether you use the embedded or
external PMIx.
In the latter case, Open MPI and SLURM have to use compatible PMIx
libraries, and you can either check the cross-version compatibility
matrix,
or build Open MPI with the same PMIx used by SLURM to be on the safe
side (not a bad idea IMHO).


Regarding the hang, I suggest you try different things
- use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun
runs on a compute node rather than on a frontend node)
- try something even simpler such as mpirun hostname (both with sbatch
and salloc)
- explicitly specify the network to be used for the wire-up. you can
for example mpirun --mca oob_tcp_if_include 192.168.0.0/24 if this is
the network subnet by which all the nodes (e.g. compute nodes and
frontend node if you use salloc) communicate.


Cheers,

Gilles

On Sat, Jan 19, 2019 at 3:31 AM Matt Thompson  wrote:
>
> On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users 
>  wrote:
>>
>> On Jan 18, 2019, at 12:43 PM, Matt Thompson  wrote:
>> >
>> > With some help, I managed to build an Open MPI 4.0.0 with:
>>
>> We can discuss each of these params to let you know what they are.
>>
>> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath
>>
>> Did you have a reason for disabling these?  They're generally good things.  
>> What they do is add linker flags to the wrapper compilers (i.e., mpicc and 
>> friends) that basically put a default path to find libraries at run time 
>> (that can/will in most cases override LD_LIBRARY_PATH -- but you can 
>> override these linked-in-default-paths if you want/need to).
>
>
> I've had these in my Open MPI builds for a while now. The reason was one of 
> the libraries I need for the climate model I work on went nuts if both of 
> them weren't there. It was originally the rpath one but then eventually (Open 
> MPI 3?) I had to add the runpath one. But I have been updating the libraries 
> more aggressively recently (due to OS upgrades) so it's possible this is no 
> longer needed.
>
>>
>>
>> > --with-psm2
>>
>> Ensure that Open MPI can include support for the PSM2 library, and abort 
>> configure if it cannot.
>>
>> > --with-slurm
>>
>> Ensure that Open MPI can include support for SLURM, and abort configure if 
>> it cannot.
>>
>> > --enable-mpi1-compatibility
>>
>> Add support for MPI_Address and other MPI-1 functions that have since been 
>> deleted from the MPI 3.x specification.
>>
>> > --with-ucx
>>
>> Ensure that Open MPI can include support for UCX, and abort configure if it 
>> cannot.
>>
>> > --with-pmix=/usr/nlocal/pmix/2.1
>>
>> Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 
>> (instead of using the PMIx that is bundled internally to Open MPI's source 
>> code tree/expanded tarball).
>>
>> Unless you have a reason to use the external PMIx, the internal/bundled PMIx 
>> is usually sufficient.
>
>
> Ah. I did not know that. I figured if our SLURM was built linked to a 
> specific PMIx v2 that I should build Open MPI with the same PMIx. I'll build 
> an Open MPI 4 without specifying this.
>
>>
>>
>> > --with-libevent=/usr
>>
>> Same as previous; change "pmix" to "libevent" (i.e., use the external 
>> libevent instead of the bundled libevent).
>>
>> > CC=icc CXX=icpc FC=ifort
>>
>> Specify the exact compilers to use.
>>
>> > The MPI 1 is because I need to build HDF5 eventually and I added psm2 
>> > because it's an Omnipath cluster. The libevent was probably a red herring 
>> > as libevent-devel wasn't installed on the system. It was eventually, and I 
>> > just didn't remove the flag. And I saw no errors in the build!
>>
>> Might as well remove the --with-libevent if you don't need it.
>>
>> > However, I seem to have built an Open MPI that doesn't work:
>> >
>> > (1099)(master) $ mpirun --version
>> > mpirun (Open MPI) 4.0.0
>> >
>> > Report bugs to http://www.open-mpi.org/community/help/
>> > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
>> >
>> > It just sits there...forever. Can the gurus here help me figure out what I 
>> > managed to break? Perhaps I added too much to my configure line? Not 
>> > enough?
>>
>> There could be a few things going on here.
>>
>> Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an 
>> "sbatch" script?
>
>
> I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just fine 
> (as you'd hope on an Omnipath cluster), but for some reason Open MPI is 
> twitchy on this cluster. I once managed to get Open MPI 3.0.1 working (a few 
> months ago), and it had some interesting startup scaling I liked (slow at low 
> core count, but getting 

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Cabral, Matias A
Hi Matt,

Few comments/questions:

-  If your cluster has Omni-Path, you won’t need UCX. Instead you can 
run using PSM2, or alternatively OFI (a.k.a. Libfabric)

-  With the command you shared below (4 ranks on the local node) (I 
think) a shared mem transport is being selected (vader?). So, if the job is not 
starting this seems to be a runtime issue rather than transport…. Pmix? slurm?
Thanks
_MAC


From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Matt Thompson
Sent: Friday, January 18, 2019 10:27 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users 
mailto:users@lists.open-mpi.org>> wrote:
On Jan 18, 2019, at 12:43 PM, Matt Thompson 
mailto:fort...@gmail.com>> wrote:
>
> With some help, I managed to build an Open MPI 4.0.0 with:

We can discuss each of these params to let you know what they are.

> ./configure --disable-wrapper-rpath --disable-wrapper-runpath

Did you have a reason for disabling these?  They're generally good things.  
What they do is add linker flags to the wrapper compilers (i.e., mpicc and 
friends) that basically put a default path to find libraries at run time (that 
can/will in most cases override LD_LIBRARY_PATH -- but you can override these 
linked-in-default-paths if you want/need to).

I've had these in my Open MPI builds for a while now. The reason was one of the 
libraries I need for the climate model I work on went nuts if both of them 
weren't there. It was originally the rpath one but then eventually (Open MPI 
3?) I had to add the runpath one. But I have been updating the libraries more 
aggressively recently (due to OS upgrades) so it's possible this is no longer 
needed.


> --with-psm2

Ensure that Open MPI can include support for the PSM2 library, and abort 
configure if it cannot.

> --with-slurm

Ensure that Open MPI can include support for SLURM, and abort configure if it 
cannot.

> --enable-mpi1-compatibility

Add support for MPI_Address and other MPI-1 functions that have since been 
deleted from the MPI 3.x specification.

> --with-ucx

Ensure that Open MPI can include support for UCX, and abort configure if it 
cannot.

> --with-pmix=/usr/nlocal/pmix/2.1

Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 
(instead of using the PMIx that is bundled internally to Open MPI's source code 
tree/expanded tarball).

Unless you have a reason to use the external PMIx, the internal/bundled PMIx is 
usually sufficient.

Ah. I did not know that. I figured if our SLURM was built linked to a specific 
PMIx v2 that I should build Open MPI with the same PMIx. I'll build an Open MPI 
4 without specifying this.


> --with-libevent=/usr

Same as previous; change "pmix" to "libevent" (i.e., use the external libevent 
instead of the bundled libevent).

> CC=icc CXX=icpc FC=ifort

Specify the exact compilers to use.

> The MPI 1 is because I need to build HDF5 eventually and I added psm2 because 
> it's an Omnipath cluster. The libevent was probably a red herring as 
> libevent-devel wasn't installed on the system. It was eventually, and I just 
> didn't remove the flag. And I saw no errors in the build!

Might as well remove the --with-libevent if you don't need it.

> However, I seem to have built an Open MPI that doesn't work:
>
> (1099)(master) $ mpirun --version
> mpirun (Open MPI) 4.0.0
>
> Report bugs to http://www.open-mpi.org/community/help/
> (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
>
> It just sits there...forever. Can the gurus here help me figure out what I 
> managed to break? Perhaps I added too much to my configure line? Not enough?

There could be a few things going on here.

Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an "sbatch" 
script?

I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just fine 
(as you'd hope on an Omnipath cluster), but for some reason Open MPI is twitchy 
on this cluster. I once managed to get Open MPI 3.0.1 working (a few months 
ago), and it had some interesting startup scaling I liked (slow at low core 
count, but getting close to Intel MPI at high core count), though it seemed to 
not work after about 100 nodes (4000 processes) or so.

--
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna 
Rampton
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Matt Thompson
On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> On Jan 18, 2019, at 12:43 PM, Matt Thompson  wrote:
> >
> > With some help, I managed to build an Open MPI 4.0.0 with:
>
> We can discuss each of these params to let you know what they are.
>
> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath
>
> Did you have a reason for disabling these?  They're generally good
> things.  What they do is add linker flags to the wrapper compilers (i.e.,
> mpicc and friends) that basically put a default path to find libraries at
> run time (that can/will in most cases override LD_LIBRARY_PATH -- but you
> can override these linked-in-default-paths if you want/need to).
>

I've had these in my Open MPI builds for a while now. The reason was one of
the libraries I need for the climate model I work on went nuts if both of
them weren't there. It was originally the rpath one but then eventually
(Open MPI 3?) I had to add the runpath one. But I have been updating the
libraries more aggressively recently (due to OS upgrades) so it's possible
this is no longer needed.


>
> > --with-psm2
>
> Ensure that Open MPI can include support for the PSM2 library, and abort
> configure if it cannot.
>
> > --with-slurm
>
> Ensure that Open MPI can include support for SLURM, and abort configure if
> it cannot.
>
> > --enable-mpi1-compatibility
>
> Add support for MPI_Address and other MPI-1 functions that have since been
> deleted from the MPI 3.x specification.
>
> > --with-ucx
>
> Ensure that Open MPI can include support for UCX, and abort configure if
> it cannot.
>
> > --with-pmix=/usr/nlocal/pmix/2.1
>
> Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1
> (instead of using the PMIx that is bundled internally to Open MPI's source
> code tree/expanded tarball).
>
> Unless you have a reason to use the external PMIx, the internal/bundled
> PMIx is usually sufficient.
>

Ah. I did not know that. I figured if our SLURM was built linked to a
specific PMIx v2 that I should build Open MPI with the same PMIx. I'll
build an Open MPI 4 without specifying this.


>
> > --with-libevent=/usr
>
> Same as previous; change "pmix" to "libevent" (i.e., use the external
> libevent instead of the bundled libevent).
>
> > CC=icc CXX=icpc FC=ifort
>
> Specify the exact compilers to use.
>
> > The MPI 1 is because I need to build HDF5 eventually and I added psm2
> because it's an Omnipath cluster. The libevent was probably a red herring
> as libevent-devel wasn't installed on the system. It was eventually, and I
> just didn't remove the flag. And I saw no errors in the build!
>
> Might as well remove the --with-libevent if you don't need it.
>
> > However, I seem to have built an Open MPI that doesn't work:
> >
> > (1099)(master) $ mpirun --version
> > mpirun (Open MPI) 4.0.0
> >
> > Report bugs to http://www.open-mpi.org/community/help/
> > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
> >
> > It just sits there...forever. Can the gurus here help me figure out what
> I managed to break? Perhaps I added too much to my configure line? Not
> enough?
>
> There could be a few things going on here.
>
> Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an
> "sbatch" script?
>

I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just
fine (as you'd hope on an Omnipath cluster), but for some reason Open MPI
is twitchy on this cluster. I once managed to get Open MPI 3.0.1 working (a
few months ago), and it had some interesting startup scaling I liked (slow
at low core count, but getting close to Intel MPI at high core count),
though it seemed to not work after about 100 nodes (4000 processes) or so.

-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Jeff Squyres (jsquyres) via users
On Jan 18, 2019, at 12:43 PM, Matt Thompson  wrote:
> 
> With some help, I managed to build an Open MPI 4.0.0 with:

We can discuss each of these params to let you know what they are.

> ./configure --disable-wrapper-rpath --disable-wrapper-runpath

Did you have a reason for disabling these?  They're generally good things.  
What they do is add linker flags to the wrapper compilers (i.e., mpicc and 
friends) that basically put a default path to find libraries at run time (that 
can/will in most cases override LD_LIBRARY_PATH -- but you can override these 
linked-in-default-paths if you want/need to).

> --with-psm2

Ensure that Open MPI can include support for the PSM2 library, and abort 
configure if it cannot.

> --with-slurm 

Ensure that Open MPI can include support for SLURM, and abort configure if it 
cannot.

> --enable-mpi1-compatibility

Add support for MPI_Address and other MPI-1 functions that have since been 
deleted from the MPI 3.x specification.

> --with-ucx

Ensure that Open MPI can include support for UCX, and abort configure if it 
cannot.

> --with-pmix=/usr/nlocal/pmix/2.1

Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 
(instead of using the PMIx that is bundled internally to Open MPI's source code 
tree/expanded tarball).

Unless you have a reason to use the external PMIx, the internal/bundled PMIx is 
usually sufficient.

> --with-libevent=/usr

Same as previous; change "pmix" to "libevent" (i.e., use the external libevent 
instead of the bundled libevent).

> CC=icc CXX=icpc FC=ifort

Specify the exact compilers to use.

> The MPI 1 is because I need to build HDF5 eventually and I added psm2 because 
> it's an Omnipath cluster. The libevent was probably a red herring as 
> libevent-devel wasn't installed on the system. It was eventually, and I just 
> didn't remove the flag. And I saw no errors in the build!

Might as well remove the --with-libevent if you don't need it.

> However, I seem to have built an Open MPI that doesn't work:
> 
> (1099)(master) $ mpirun --version
> mpirun (Open MPI) 4.0.0
> 
> Report bugs to http://www.open-mpi.org/community/help/
> (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
> 
> It just sits there...forever. Can the gurus here help me figure out what I 
> managed to break? Perhaps I added too much to my configure line? Not enough?

There could be a few things going on here.

Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an "sbatch" 
script?

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Matt Thompson
All,

With some help, I managed to build an Open MPI 4.0.0 with:

./configure --disable-wrapper-rpath --disable-wrapper-runpath --with-psm2
--with-slurm --enable-mpi1-compatibility --with-ucx
--with-pmix=/usr/nlocal/pmix/2.1 --with-libevent=/usr CC=icc CXX=icpc
FC=ifort

The MPI 1 is because I need to build HDF5 eventually and I added psm2
because it's an Omnipath cluster. The libevent was probably a red herring
as libevent-devel wasn't installed on the system. It was eventually, and I
just didn't remove the flag. And I saw no errors in the build!

However, I seem to have built an Open MPI that doesn't work:

(1099)(master) $ mpirun --version
mpirun (Open MPI) 4.0.0

Report bugs to http://www.open-mpi.org/community/help/
(1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe

It just sits there...forever. Can the gurus here help me figure out what I
managed to break? Perhaps I added too much to my configure line? Not enough?

Thanks,
Matt

On Thu, Jan 17, 2019 at 11:10 AM Matt Thompson  wrote:

> Dear Open MPI Gurus,
>
> A cluster I use recently updated their SLURM to have support for UCX and
> PMIx. These are names I've seen and heard often at SC BoFs and posters, but
> now is my first time to play with them.
>
> So, my first question is how exactly should I build Open MPI to try these
> features out. I'm guessing I'll need things like "--with-ucx" to test UCX,
> but is anything needed for PMIx?
>
> Second, when it comes to running Open MPI, are there new MCA parameters I
> need to look out for when testing?
>
> Sorry for the generic questions, but I'm more on the user end of the
> cluster than the administrator end, so I tend to get lost in the detailed
> presentations, etc. I see online.
>
> Thanks,
> Matt
> --
> Matt Thompson
>“The fact is, this is about us identifying what we do best and
>finding more ways of doing less of it better” -- Director of Better
> Anna Rampton
>


-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users