Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Charles A Taylor via users
Sure…

+ ./configure 
  --build=x86_64-redhat-linux-gnu \
  --host=x86_64-redhat-linux-gnu \
  --program-prefix= \
  --disable-dependency-tracking \
  --prefix=/apps/mpi/intel/2019.1.144/openmpi/4.0.1 \
  --exec-prefix=/apps/mpi/intel/2019.1.144/openmpi/4.0.1 \
  --bindir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/bin \
  --sbindir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/sbin \
  --sysconfdir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/etc \
  --datadir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/share \
  --includedir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/include \
  --libdir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/lib64 \
  --libexecdir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/libexec \
  --localstatedir=/var \
  --sharedstatedir=/var/lib \
  --mandir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/share/man \
  --infodir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/share/info \
  C=icc CXX=icpc FC=ifort 'FFLAGS=-O2 -g -warn -m64' LDFLAGS= \
  --enable-static \
  --enable-orterun-prefix-by-default \
  --with-slurm=/opt/slurm \
  --with-pmix=/opt/pmix/3.1.2 \
  --with-pmi=/opt/slurm \
  --with-libevent=external \
  --with-hwloc=external \
  --without-verbs \
  --with-libfabric \
  --with-ucx \
  --with-mxm=no \
  --with-cuda=no \
  --enable-openib-udcm \
  --enable-openib-rdmacm


> On Jun 20, 2019, at 12:49 PM, Jeff Squyres (jsquyres) via users 
>  wrote:
> 
> Ok.
> 
> Perhaps we still missed something in the configury.
> 
> Worst case, you can:
> 
> $ ./configure CPPFLAGS=-I/usr/include/slurm ...rest of your configure 
> params...
> 
> That will add the -I to CPPFLAGS, and it will preserve that you set that 
> value in the top few lines of config.log.
> 
> 
> 
> On Jun 20, 2019, at 12:25 PM, Carlson, Timothy S  
> wrote:
>> 
>> As of recent you needed to use --with-slurm and --with-pmi2
>> 
>> While the configure line indicates it picks up pmi2 as part of slurm that is 
>> not in fact true and you need to specifically tell it about pmi2
>> 
>> From: users  On Behalf Of Noam Bernstein 
>> via users
>> Sent: Thursday, June 20, 2019 9:16 AM
>> To: Jeff Squyres (jsquyres) 
>> Cc: Noam Bernstein ; Open MPI User's List 
>> 
>> Subject: Re: [OMPI users] OpenMPI 4 and pmi2 support
>> 
>> 
>> 
>> 
>> On Jun 20, 2019, at 11:54 AM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>> On Jun 14, 2019, at 2:02 PM, Noam Bernstein via users 
>>  wrote:
>> 
>> 
>> Hi Jeff - do you remember this issue from a couple of months ago?  
>> 
>> Noam: I'm sorry, I totally missed this email.  My INBOX is a continual 
>> disaster.  :-(
>> 
>> No problem.  We’re running with mpirun for now.
>> 
>> 
>> 
>> Unfortunately, the failure to find pmi.h is still happening.  I just tried 
>> with 4.0.1 (not rc), and I still run into the same error (failing to find 
>> #include  when compiling opal/mca/pmix/s1/mca_pmix_s1_la-pmix_s1.lo):
>> make[2]: Entering directory 
>> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal/mca/pmix/s1'
>> CC   mca_pmix_s1_la-pmix_s1.lo
>> pmix_s1.c:29:17: fatal error: pmi.h: No such file or directory
>> #include 
>>^
>> compilation terminated.
>> make[2]: *** [mca_pmix_s1_la-pmix_s1.lo] Error 1
>> make[2]: Leaving directory 
>> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal/mca/pmix/s1'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory 
>> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal'
>> make: *** [all-recursive] Error 1
>> 
>> I looked back earlier in this thread, and I don't see the version of SLRUM 
>> that you're using.  What version is it?
>> 
>> 18.08, provided for our CentOS 7.6-based Rocks through the slurm roll, so 
>> not compiled by me.
>> 
>> 
>> 
>> Is there a pmi2.h in the SLURM installation (i.e., not pmi.h)?
>> 
>> Or is the problem that -I/usr/include/slurm is not passed to the compile 
>> line (per your output, below)?
>> 
>> /usr/include/slurm has both pmi.h and pmi2.h, but (from what I could tell 
>> when trying to manually reproduce what make is doing)
>> -I/usr/include/slurm 
>> is not being passed when compiling those files.
>> 
>> 
>> 
>> When I dig into what libtool is trying to do, I get (once I remove the 
>> —silent flag):
>> 
>> (FWIW, you can also "make V=1" to have it show you all this detail)
>> 
>> I’ll check that, to confirm that I’m correct about it not being passed.
>> 
>>  
>>  Noam
>> 
>> 
>> |
>> |
>> |
>> U.S. NAVAL
>> |
>> |
>> _RESEARCH_
>> |
>> LABORATORY
>> Noam Bernstein, Ph.D.
>> Center for Materials Physics and Technology
>> U.S. Naval Research Laboratory
>> T +1 202 404 8628  F +1 202 404 7546
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.nrl.navy.mil=DwIGaQ=sJ6xIWYx-zLMB3EPkvcnVg=NpYP1iUbEbTx87BW8Gx5ow=u1fQ9HzG1l1CRApve71dA4BBKPDM3lRS__c1Ev4h4bM=UevOrdYXRuu7JeDg4GBR5Y6tF0ZlSLkb-updK57HYTU=
>>  
> 
> 
> -- 
> Jeff Squyres
> 

Re: [OMPI users] Intel Compilers

2019-06-20 Thread Charles A Taylor via users


> On Jun 20, 2019, at 12:10 PM, Carlson, Timothy S  
> wrote:
> 
> I’ve never seen that error and have built some flavor of this combination 
> dozens of times.  What version of Intel Compiler and what version of OpenMPI 
> are you trying to build?

[chasman@login4 gizmo-mufasa]$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, 
Version 19.0.1.144 Build 20181018
Copyright (C) 1985-2018 Intel Corporation.  All rights reserved.

OpenMPI 4.0.1 

It is probably something I/we are doing that is throwing the configure script 
and macros off.  We include some version (7.3.0 in this case) of gcc in our 
command and library paths because icpc needs the gnu headers for certain 
things.  Perhaps the configure script is picking that up and thinks we are 
using gnu.   

I’ll have to look more closely now that I know I’m the only one seeing it.  :(

Charlie Taylor
UF Research Computing


>  
> Tim
>  
> From: users  <mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Charles A Taylor via 
> users
> Sent: Thursday, June 20, 2019 8:55 AM
> To: Open MPI Users  <mailto:users@lists.open-mpi.org>>
> Cc: Charles A Taylor mailto:chas...@ufl.edu>>
> Subject: [OMPI users] Intel Compilers
>  
> OpenMPI probably has one of the largest and most complete configure+build 
> systems I’ve ever seen.  
>  
> I’m surprised however that it doesn’t pick up the use of the intel compilers 
> and modify the command line
> parameters as needed.
>  
> ifort: command line warning #10006: ignoring unknown option '-pipe'
> ifort: command line warning #10157: ignoring option '-W'; argument is of 
> wrong type
> ifort: command line warning #10006: ignoring unknown option 
> '-fparam=ssp-buffer-size=4'
> ifort: command line warning #10006: ignoring unknown option '-pipe'
> ifort: command line warning #10157: ignoring option '-W'; argument is of 
> wrong type
> ifort: command line warning #10006: ignoring unknown option 
> '-fparam=ssp-buffer-size=4'
> ifort: command line warning #10006: ignoring unknown option '-pipe'
> ifort: command line warning #10157: ignoring option '-W'; argument is of 
> wrong type
> ifort: command line warning #10006: ignoring unknown option 
> '-fparam=ssp-buffer-size=4’
>  
> Maybe I’m missing something.
>  
> Regards,
>  
> Charlie Taylor
> UF Research Computing

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Intel Compilers

2019-06-20 Thread Charles A Taylor via users
OpenMPI probably has one of the largest and most complete configure+build 
systems I’ve ever seen.  

I’m surprised however that it doesn’t pick up the use of the intel compilers 
and modify the command line
parameters as needed.

ifort: command line warning #10006: ignoring unknown option '-pipe'
ifort: command line warning #10157: ignoring option '-W'; argument is of wrong 
type
ifort: command line warning #10006: ignoring unknown option 
'-fparam=ssp-buffer-size=4'
ifort: command line warning #10006: ignoring unknown option '-pipe'
ifort: command line warning #10157: ignoring option '-W'; argument is of wrong 
type
ifort: command line warning #10006: ignoring unknown option 
'-fparam=ssp-buffer-size=4'
ifort: command line warning #10006: ignoring unknown option '-pipe'
ifort: command line warning #10157: ignoring option '-W'; argument is of wrong 
type
ifort: command line warning #10006: ignoring unknown option 
'-fparam=ssp-buffer-size=4’

Maybe I’m missing something.

Regards,

Charlie Taylor
UF Research Computing___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Charles A Taylor via users
This looks a lot like a problem I had with OpenMPI 3.1.2.  I thought the fix 
was landed in 4.0.0 but you might
want to check the code to be sure there wasn’t a regression in 4.1.x.  Most of 
our codes are still running
3.1.2 so I haven’t built anything beyond 4.0.0 which definitely included the 
fix.

See…

- Apply patch for memory leak associated with UCX PML.
-https://github.com/openucx/ucx/issues/2921
-https://github.com/open-mpi/ompi/pull/5878

Charles Taylor
UF Research Computing


> On Jun 19, 2019, at 2:26 PM, Noam Bernstein via users 
>  wrote:
> 
>> On Jun 19, 2019, at 2:00 PM, John Hearns via users > <mailto:users@lists.open-mpi.org>> wrote:
>> 
>> Noam, it may be a stupid question. Could you try runningslabtop   ss the 
>> program executes
> 
> The top SIZE usage is this line
>OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME  
>  
> 5937540 5937540 100%0.09K 141370 42565480K kmalloc-96
> which seems to be growing continuously. However, it’s much smaller than the 
> drop in free memory.  It gets to around 1 GB after tens of seconds (500 MB 
> here), but the overall free memory is dropping by about 1 GB / second, so 
> tens of GB over the same time.
> 
>> 
>> Also  'watch  cat /proc/meminfo'is also a good diagnostic
> 
> Other than MemFree dropping, I don’t see much. Here’s a diff, 10 seconds 
> apart:
> 2,3c2,3
> < MemFree:54229400 kB
> < MemAvailable:   54271804 kB
> ---
> > MemFree:45010772 kB
> > MemAvailable:   45054200 kB
> 19c19
> < AnonPages:  22063260 kB
> ---
> > AnonPages:  22526300 kB
> 22,24c22,24
> < Slab: 851380 kB
> < SReclaimable:  87100 kB
> < SUnreclaim:   764280 kB
> ---
> > Slab:1068208 kB
> > SReclaimable:  89148 kB
> > SUnreclaim:   979060 kB
> 31c31
> < Committed_AS:   34976896 kB
> ---
> > Committed_AS:   34977680 kB
> 
> MemFree has dropped by 9 GB, but as far as I can tell nothing else has 
> increased by anything near as much, so I don’t know where the memory is going.
> 
>   Noam
> 
> 
> 
> ||
> |U.S. NAVAL|
> |_RESEARCH_|
> LABORATORY
> 
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil 
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.nrl.navy.mil=DwMFaQ=sJ6xIWYx-zLMB3EPkvcnVg=NpYP1iUbEbTx87BW8Gx5ow=uR1yQLj0g46Qb_ELHglK3ck3gNxjVqxMHyRu2bcfRQo=0UyoZWeZV8v9A3u8grfAMtjdaqPRb8FsOMORqr9NOew=>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=NpYP1iUbEbTx87BW8Gx5ow=uR1yQLj0g46Qb_ELHglK3ck3gNxjVqxMHyRu2bcfRQo=oTZPqoXvy0rvbh3Ni6Mquuzel8PXWG1ub4-c6xleDnQ=

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Error initializing an UCX / OpenFabrics device. #6300

2019-03-22 Thread Charles A Taylor
Anyone else running into the issue below with OpenMPI 4.0.0?

   https://github.com/open-mpi/ompi/issues/6300 
  (Error initializing an UCX / 
OpenFabrics device)

I’m hitting it and don’t really see why.  I posted to the bug but maybe I need 
to just open a new issue.

Charlie Taylor
UF Research Computing___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Memory Leak in 3.1.2 + UCX

2018-10-17 Thread Charles A Taylor
Just to follow up…

This turned out to be a bug in OpenMPI+UCX.

   https://github.com/openucx/ucx/issues/2921 
<https://github.com/openucx/ucx/issues/2921>
   https://github.com/open-mpi/ompi/pull/5878 
<https://github.com/open-mpi/ompi/pull/5878>

I cherry-picked the patch from the github master and applied it to 3.1.2.  The 
gadget/gizmo
test case has been running since yesterday without the previously observed 
growth in RSS.

Thanks to Yossi Itigin (yos...@mellanox.com <mailto:yos...@mellanox.com>) for 
the fix.

Charlie Taylor
UF Research Computing

> On Oct 4, 2018, at 5:39 PM, Charles A Taylor  wrote:
> 
> 
> We are seeing a gaping memory leak when running OpenMPI 3.1.x (or 2.1.2, for 
> that matter) built with UCX support.   The leak shows up
> whether the “ucx” PML is specified for the run or not.  The applications in 
> question are arepo and gizmo but it I have no reason to believe
> that others are not affected as well.
> 
> Basically the MPI processes grow without bound until SLURM kills the job or 
> the host memory is exhausted.  
> If I configure and build with “--without-ucx” the problem goes away.
> 
> I didn’t see anything about this on the UCX github site so I thought I’d ask 
> here.  Anyone else seeing the same or similar?
> 
> What version of UCX is OpenMPI 3.1.x tested against?
> 
> Regards,
> 
> Charlie Taylor
> UF Research Computing
> 
> Details:
> —
> RHEL7.5
> OpenMPI 3.1.2 (and any other version I’ve tried).
> ucx 1.2.2-1.el7 (RH native)
> RH native IB stack
> Mellanox FDR/EDR IB fabric
> Intel Parallel Studio 2018.1.163
> 
> Configuration Options:
> —
> CFG_OPTS=""
> CFG_OPTS="$CFG_OPTS C=icc CXX=icpc FC=ifort FFLAGS=\"-O2 -g -warn -m64\" 
> LDFLAGS=\"\" "
> CFG_OPTS="$CFG_OPTS --enable-static"
> CFG_OPTS="$CFG_OPTS --enable-orterun-prefix-by-default"
> CFG_OPTS="$CFG_OPTS --with-slurm=/opt/slurm"
> CFG_OPTS="$CFG_OPTS --with-pmix=/opt/pmix/2.1.1"
> CFG_OPTS="$CFG_OPTS --with-pmi=/opt/slurm"
> CFG_OPTS="$CFG_OPTS --with-libevent=external"
> CFG_OPTS="$CFG_OPTS --with-hwloc=external"
> CFG_OPTS="$CFG_OPTS --with-verbs=/usr"
> CFG_OPTS="$CFG_OPTS --with-libfabric=/usr"
> CFG_OPTS="$CFG_OPTS --with-ucx=/usr"
> CFG_OPTS="$CFG_OPTS --with-verbs-libdir=/usr/lib64"
> CFG_OPTS="$CFG_OPTS --with-mxm=no"
> CFG_OPTS="$CFG_OPTS --with-cuda=${HPC_CUDA_DIR}"
> CFG_OPTS="$CFG_OPTS --enable-openib-udcm"
> CFG_OPTS="$CFG_OPTS --enable-openib-rdmacm"
> CFG_OPTS="$CFG_OPTS --disable-pmix-dstore"
> 
> rpmbuild --ba \
> --define '_name openmpi' \
> --define "_version $OMPI_VER" \
> --define "_release ${RELEASE}" \
> --define "_prefix $PREFIX" \
> --define '_mandir %{_prefix}/share/man' \
> --define '_defaultdocdir %{_prefix}' \
> --define 'mflags -j 8' \
> --define 'use_default_rpm_opt_flags 1' \
> --define 'use_check_files 0' \
> --define 'install_shell_scripts 1' \
> --define 'shell_scripts_basename mpivars' \
> --define "configure_options $CFG_OPTS " \
> openmpi-${OMPI_VER}.spec 2>&1 | tee rpmbuild.log
> 
> 
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=_TUHqBC2-jZfYbwP18yLYDuU3Rq68N8-nk-rnxsiDGo=QjG-szMi1wbDc0DX3andcwIsZNDMsVZErnCirrAYnlE=

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] issue compiling openmpi 3.2.1 with pmi and slurm

2018-10-10 Thread Charles A Taylor
In our config the "--with-pmi" points to the slurm “prefix” dir not the slurm 
libdir.  The options below work for us with SLURM installed in “/opt/slurm”.

I’ll note that after sharing this config with regard to another issue, it was 
recommended to drop the “/usr” in the “—with-foo=/usr” options and simply
use “—with-foo”.  That said, the configuration builds, runs, and works with 
both pmi2 and pmix_v2 under SLURM (17.11.5).

Hope it helps,

Charlie Taylor
UF Research Computing

CFG_OPTS=""
CFG_OPTS="$CFG_OPTS C=icc CXX=icpc FC=ifort FFLAGS=\"-O2 -g -warn -m64\" 
LDFLAGS=\"\" "
CFG_OPTS="$CFG_OPTS --enable-static"
CFG_OPTS="$CFG_OPTS --enable-orterun-prefix-by-default"
CFG_OPTS="$CFG_OPTS --with-slurm=/opt/slurm"
CFG_OPTS="$CFG_OPTS --with-pmix=/opt/pmix/2.1.1"
CFG_OPTS="$CFG_OPTS --with-pmi=/opt/slurm"
CFG_OPTS="$CFG_OPTS --with-libevent=external"
CFG_OPTS="$CFG_OPTS --with-hwloc=external"
CFG_OPTS="$CFG_OPTS --with-verbs=/usr"
CFG_OPTS="$CFG_OPTS --with-libfabric=/usr"
CFG_OPTS="$CFG_OPTS --with-ucx=/usr"
CFG_OPTS="$CFG_OPTS --with-verbs-libdir=/usr/lib64"
CFG_OPTS="$CFG_OPTS --with-mxm=no"
CFG_OPTS="$CFG_OPTS --with-cuda=${HPC_CUDA_DIR}"
CFG_OPTS="$CFG_OPTS --enable-openib-udcm"
CFG_OPTS="$CFG_OPTS --enable-openib-rdmacm”




> On Oct 10, 2018, at 9:44 AM, Ross, Daniel B. via users 
>  wrote:
> 
> I have been able to configure without issue using the following options:
> ./configure --prefix=/usr/local/ --with-cuda --with-slurm 
> --with-pmi=/usr/local/slurm/include/slurm 
> --with-pmi-libdir=/usr/local/slurm/lib64
>  
> Everything compiles just fine until I get this error:
>  
> make[3]: Leaving directory 
> `/usr/local/src/openmpi/openmpi-3.1.2/opal/mca/pmix/pmix2x'
> make[2]: Leaving directory 
> `/usr/local/src/openmpi/openmpi-3.1.2/opal/mca/pmix/pmix2x'
> Making all in mca/pmix/s1
> make[2]: Entering directory 
> `/usr/local/src/openmpi/openmpi-3.1.2/opal/mca/pmix/s1'
>   CC   mca_pmix_s1_la-pmix_s1.lo
> pmix_s1.c:29:17: fatal error: pmi.h: No such file or directory
> #include 
>  ^
> compilation terminated.
> make[2]: *** [mca_pmix_s1_la-pmix_s1.lo] Error 1
> make[2]: Leaving directory 
> `/usr/local/src/openmpi/openmpi-3.1.2/opal/mca/pmix/s1'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/usr/local/src/openmpi/openmpi-3.1.2/opal'
> make: *** [all-recursive] Error 1
>  
>  
> any ideas why I am getting this error?
> Thanks
>  
> ___
> users mailing list
> users@lists.open-mpi.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwICAg=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=kx16f4bm7qZNj7skG8__z3EMDluM8L_LHYZHmA1ZhFI=YSuW5T_SPA66GoYyG95eakEqTMAdWuVKsuzdYU4N56A=

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Memory Leak in 3.1.2 + UCX

2018-10-06 Thread Charles A Taylor


> On Oct 6, 2018, at 6:06 AM,   wrote:
> 
> Charles,
> 
> ucx has a higher priority than ob1, that is why it is used by default 
> when available.

Good to know.  Thanks.

> 
> If you can provide simple instructions on how to build and test one of 
> the apps that experiment
> a memory leak, that would greatly help us and the UCX folks reproduce, 
> troubleshoot and diagnose this issue.

I’ll be happy to do that.  Is it better to post it here or on the OpenMPI 
github site?

Regards,

Charlie

> 
> Cheers,
> 
> Gilles
> 
> - Original Message -
>> 
>>> On Oct 5, 2018, at 11:31 AM, Gilles Gouaillardet  gouaillar...@gmail.com> wrote:
>>> 
>>> are you saying that even if you
>>> 
>>>mpirun --mca pml ob1 ...
>>> 
>>> (e.g. force the ob1 component of the pml framework) the memory leak 
> is
>>> still present ?
>> 
>> No, I do not mean to say that - at least not in the current 
> incarnation.  Running with the following parameters avoids the leak…
>> 
>>export OMPI_MCA_pml="ob1"
>>export OMPI_MCA_btl_openib_eager_limit=1048576
>>export OMPI_MCA_btl_openib_max_send_size=1048576
>> 
>> as does building OpenMPI without UCX support (i.e. —without-ucx).   
>> 
>> However, building _with_ UCX support (including the current github 
> source) and running with the following parameters produces
>> the leak (note that no PML was explicitly requested).  
>> 
>>   export OMPI_MCA_oob_tcp_listen_mode="listen_thread"
>>   export OMPI_MCA_btl_openib_eager_limit=1048576
>>   export OMPI_MCA_btl_openib_max_send_size=1048576
>>   export OMPI_MCA_btl="self,vader,openib”
>> 
>> The eager_limit and send_size limits are needed with this app to 
> prevent a deadlock that I’ve posted about previously. 
>> 
>> Also, explicitly requesting the UCX PML with,
>> 
>> export OMPI_MCA_pml=“ucx"
>> 
>> produces the leak.
>> 
>> I’m continuing to try to find exactly what I’m doing wrong to produce 
> this behavior but have been unable to arrive at 
>> a solution other than excluding UCX which seems like a bad idea since 
> Jeff (Squyres) pointed out that it is the
>> Mellanox-recommended way to run on Mellanox hardware.  Interestingly, 
> using the UCX PML framework avoids
>> the deadlock that results when running with the default parameters and 
> not limiting the message sizes - another
>> reason we’d like to be able to use it.
>> 
>> I can read your mind at this point - “Wow, these guys have really 
> horked their cluster”.  Could be.   But we run
>> thousands of jobs every day including many other OpenMPI jobs (vasp, 
> gromacs, raxml, lammps, namd, etc).
>> Also the users of the Arepo and Gadget code are currently running with 
> MVAPICH2 without issue.  I installed
>> it specifically to get them past these OpenMPI problems.  We don’t 
> normally build anything with MPICH/MVAPICH/IMPI
>> since we have never had any real reason to - until now.
>> 
>> That may have to be the solution but the memory leak is so readily 
> reproducible that I thought I’d ask about it.
>> Since it appears that others are not seeing this issue, I’ll continue 
> to try to figure it out and if I do, I’ll be sure to post back.
>> 
>>> As a side note, we strongly recommend to avoid
>>> configure --with-FOO=/usr
>>> instead
>>> configure --with-FOO
>>> should be used (otherwise you will end up with -I/usr/include
>>> -L/usr/lib64 and that could silently hide third party libraries
>>> installed in a non standard directory). If --with-FOO fails for you,
>>> then this is a bug we will appreciate you report.
>> 
>> Noted and logged.  We’ve been using the —with-FOO=/usr for a long time 
> (since 1.x days).  There was a reason we started doing
>> it but I’ve long since forgotten what it was but I think it was to _
> avoid_ what you describe - not cause it.  Regardless,
>> I’ll heed your warning and remove it from future builds and file a bug 
> if there are any problems.
>> 
>> However, I did post of a similar problem previously in when 
> configuring against an external PMIx library.  The configure
>> script produces (or did) a "-L/usr/lib” instead of a "-L/usr/lib64” 
> resulting in unresolved PMIx routines when linking.
>> That was with OpenMPI 2.1.2.  We now include a lib -> lib64 symlink in 
> our /opt/pmix/x.y.z directories so I haven’t looked to 
>> see if that was fixed for 3.x or not.
>> 
>> I should have also mentioned in my previou

Re: [OMPI users] Memory Leak in 3.1.2 + UCX

2018-10-06 Thread Charles A Taylor

> On Oct 5, 2018, at 11:31 AM, Gilles Gouaillardet 
>  wrote:
> 
> are you saying that even if you
> 
> mpirun --mca pml ob1 ...
> 
> (e.g. force the ob1 component of the pml framework) the memory leak is
> still present ?

No, I do not mean to say that - at least not in the current incarnation.  
Running with the following parameters avoids the leak…

export OMPI_MCA_pml="ob1"
export OMPI_MCA_btl_openib_eager_limit=1048576
export OMPI_MCA_btl_openib_max_send_size=1048576

as does building OpenMPI without UCX support (i.e. —without-ucx).   

However, building _with_ UCX support (including the current github source) and 
running with the following parameters produces
the leak (note that no PML was explicitly requested).  

   export OMPI_MCA_oob_tcp_listen_mode="listen_thread"
   export OMPI_MCA_btl_openib_eager_limit=1048576
   export OMPI_MCA_btl_openib_max_send_size=1048576
   export OMPI_MCA_btl="self,vader,openib”

The eager_limit and send_size limits are needed with this app to prevent a 
deadlock that I’ve posted about previously. 

Also, explicitly requesting the UCX PML with,

 export OMPI_MCA_pml=“ucx"

produces the leak.

I’m continuing to try to find exactly what I’m doing wrong to produce this 
behavior but have been unable to arrive at 
a solution other than excluding UCX which seems like a bad idea since Jeff 
(Squyres) pointed out that it is the
Mellanox-recommended way to run on Mellanox hardware.  Interestingly, using the 
UCX PML framework avoids
the deadlock that results when running with the default parameters and not 
limiting the message sizes - another
reason we’d like to be able to use it.

I can read your mind at this point - “Wow, these guys have really horked their 
cluster”.  Could be.   But we run
thousands of jobs every day including many other OpenMPI jobs (vasp, gromacs, 
raxml, lammps, namd, etc).
Also the users of the Arepo and Gadget code are currently running with MVAPICH2 
without issue.  I installed
it specifically to get them past these OpenMPI problems.  We don’t normally 
build anything with MPICH/MVAPICH/IMPI
since we have never had any real reason to - until now.

That may have to be the solution but the memory leak is so readily reproducible 
that I thought I’d ask about it.
Since it appears that others are not seeing this issue, I’ll continue to try to 
figure it out and if I do, I’ll be sure to post back.

> As a side note, we strongly recommend to avoid
> configure --with-FOO=/usr
> instead
> configure --with-FOO
> should be used (otherwise you will end up with -I/usr/include
> -L/usr/lib64 and that could silently hide third party libraries
> installed in a non standard directory). If --with-FOO fails for you,
> then this is a bug we will appreciate you report.

Noted and logged.  We’ve been using the —with-FOO=/usr for a long time (since 
1.x days).  There was a reason we started doing
it but I’ve long since forgotten what it was but I think it was to _avoid_ what 
you describe - not cause it.  Regardless,
I’ll heed your warning and remove it from future builds and file a bug if there 
are any problems.

However, I did post of a similar problem previously in when configuring against 
an external PMIx library.  The configure
script produces (or did) a "-L/usr/lib” instead of a "-L/usr/lib64” resulting 
in unresolved PMIx routines when linking.
That was with OpenMPI 2.1.2.  We now include a lib -> lib64 symlink in our 
/opt/pmix/x.y.z directories so I haven’t looked to 
see if that was fixed for 3.x or not.

I should have also mentioned in my previous post that HPC_CUDA_DIR=NO meaning 
that CUDA support has
been excluded from these builds (in case anyone was wondering).

Thanks for the feedback,

Charlie

> 
> Cheers,
> 
> Gilles
> On Fri, Oct 5, 2018 at 6:42 AM Charles A Taylor  wrote:
>> 
>> 
>> We are seeing a gaping memory leak when running OpenMPI 3.1.x (or 2.1.2, for 
>> that matter) built with UCX support.   The leak shows up
>> whether the “ucx” PML is specified for the run or not.  The applications in 
>> question are arepo and gizmo but it I have no reason to believe
>> that others are not affected as well.
>> 
>> Basically the MPI processes grow without bound until SLURM kills the job or 
>> the host memory is exhausted.
>> If I configure and build with “--without-ucx” the problem goes away.
>> 
>> I didn’t see anything about this on the UCX github site so I thought I’d ask 
>> here.  Anyone else seeing the same or similar?
>> 
>> What version of UCX is OpenMPI 3.1.x tested against?
>> 
>> Regards,
>> 
>> Charlie Taylor
>> UF Research Computing
>> 
>> Details:
>> —
>> RHEL7.5
>> OpenMPI 3.1.2 (and any other version I’ve tried).
>> ucx 1.2.2-1.el7 (

[OMPI users] Memory Leak in 3.1.2 + UCX

2018-10-04 Thread Charles A Taylor

We are seeing a gaping memory leak when running OpenMPI 3.1.x (or 2.1.2, for 
that matter) built with UCX support.   The leak shows up
whether the “ucx” PML is specified for the run or not.  The applications in 
question are arepo and gizmo but it I have no reason to believe
that others are not affected as well.

Basically the MPI processes grow without bound until SLURM kills the job or the 
host memory is exhausted.  
If I configure and build with “--without-ucx” the problem goes away.

I didn’t see anything about this on the UCX github site so I thought I’d ask 
here.  Anyone else seeing the same or similar?

What version of UCX is OpenMPI 3.1.x tested against?

Regards,

Charlie Taylor
UF Research Computing

Details:
—
RHEL7.5
OpenMPI 3.1.2 (and any other version I’ve tried).
ucx 1.2.2-1.el7 (RH native)
RH native IB stack
Mellanox FDR/EDR IB fabric
Intel Parallel Studio 2018.1.163

Configuration Options:
—
CFG_OPTS=""
CFG_OPTS="$CFG_OPTS C=icc CXX=icpc FC=ifort FFLAGS=\"-O2 -g -warn -m64\" 
LDFLAGS=\"\" "
CFG_OPTS="$CFG_OPTS --enable-static"
CFG_OPTS="$CFG_OPTS --enable-orterun-prefix-by-default"
CFG_OPTS="$CFG_OPTS --with-slurm=/opt/slurm"
CFG_OPTS="$CFG_OPTS --with-pmix=/opt/pmix/2.1.1"
CFG_OPTS="$CFG_OPTS --with-pmi=/opt/slurm"
CFG_OPTS="$CFG_OPTS --with-libevent=external"
CFG_OPTS="$CFG_OPTS --with-hwloc=external"
CFG_OPTS="$CFG_OPTS --with-verbs=/usr"
CFG_OPTS="$CFG_OPTS --with-libfabric=/usr"
CFG_OPTS="$CFG_OPTS --with-ucx=/usr"
CFG_OPTS="$CFG_OPTS --with-verbs-libdir=/usr/lib64"
CFG_OPTS="$CFG_OPTS --with-mxm=no"
CFG_OPTS="$CFG_OPTS --with-cuda=${HPC_CUDA_DIR}"
CFG_OPTS="$CFG_OPTS --enable-openib-udcm"
CFG_OPTS="$CFG_OPTS --enable-openib-rdmacm"
CFG_OPTS="$CFG_OPTS --disable-pmix-dstore"

rpmbuild --ba \
 --define '_name openmpi' \
 --define "_version $OMPI_VER" \
 --define "_release ${RELEASE}" \
 --define "_prefix $PREFIX" \
 --define '_mandir %{_prefix}/share/man' \
 --define '_defaultdocdir %{_prefix}' \
 --define 'mflags -j 8' \
 --define 'use_default_rpm_opt_flags 1' \
 --define 'use_check_files 0' \
 --define 'install_shell_scripts 1' \
 --define 'shell_scripts_basename mpivars' \
 --define "configure_options $CFG_OPTS " \
 openmpi-${OMPI_VER}.spec 2>&1 | tee rpmbuild.log




___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI + PMIx + SLURM

2018-07-01 Thread Charles A Taylor
Just wanted to follow up on my own post.

Turns out there was a missing symlink (much embarrassment) on by build host.   
That’s why you don’t see “pmix_v1” in the “srun —mpi=list” output (previous 
post).
Once I fixed that and rebuilt SLURM, I was able to launch existing OpenMPI 3.x 
apps with,

  srun —mpi=pmix_v1

Apologies for the wasted bandwidth.

Regards,

Charlie

> On Jun 28, 2018, at 8:14 AM, Charles A Taylor  wrote:
> 
> There is a name for my pain and it is “OpenMPI + PMIx”.  :)
> 
> I’m looking at upgrading SLURM from 16.05.11 to 17.11.05 (bear with me, this 
> is not a SLURM question).
> 
> After building SLURM 17.11.05 with 
> ‘--with-pmix=/opt/pmix/1.1.5:/opt/pmix/2.1/1’ and installing a test instance, 
> I see
> 
> $ srun --mpi=list
> srun: MPI types are...
> srun: pmix
> srun: pmi2
> srun: pmix_v2
> srun: none
> srun: openmpi
> 
> Seems reasonable.
> 
> Now, we have applications built with OpenMPI 3.0.0 and 3.1.0 linked against 
> /opt/pmix/1.1.5 (--with-pmix=/opt/pmix/1.1.5).  When I attempt to launch 
> these applications using,
> 
>  srun —mpi=pmix 
> 
> I get the following ...
> 
> [c1a-s18.ufhpc:17995] Security mode none is not available
> [c1a-s18.ufhpc:17995] PMIX ERROR: UNREACHABLE in file 
> src/client/pmix_client.c at line 199
> --
> The application appears to have been direct launched using "srun",
> but OMPI was not built with SLURM's PMI support and therefore cannot
> execute. There are several options for building PMI support under
> SLURM, depending upon the SLURM version you are using:
> 
>  version 16.05 or later: you can use SLURM's PMIx support. This
>  requires that you configure and build SLURM --with-pmix.
> 
>  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>  install PMI-2. You must then build Open MPI using --with-pmi pointing
>  to the SLURM PMI library location.
> 
> Please configure as appropriate and try again.
> --
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> ———
> 
> So slurm/srun appear to have library support for both pmix and pmix_v2 and 
> OpenMPI 3.0.0 and OpenMPI 3.1.0 both have pmix support (1.1.5) since we 
> launch them every day with “srun —mpi=pmix” under slurm 16.05.11.
> 
> Is this a bug?   Am I overlooking something?  Is it possible to transition to 
> OpenMPI 3.x + PMIx 2.x + SLURM 17.x without rebuilding (essentially) 
> everything (including all applications)?
> 
> Charlie Taylor
> UF Research Computing
> 

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] OpenMPI + PMIx + SLURM

2018-07-01 Thread Charles A Taylor
There is a name for my pain and it is “OpenMPI + PMIx”.  :)

I’m looking at upgrading SLURM from 16.05.11 to 17.11.05 (bear with me, this is 
not a SLURM question).

After building SLURM 17.11.05 with 
‘--with-pmix=/opt/pmix/1.1.5:/opt/pmix/2.1/1’ and installing a test instance, I 
see

$ srun --mpi=list
srun: MPI types are...
srun: pmix
srun: pmi2
srun: pmix_v2
srun: none
srun: openmpi

Seems reasonable.

Now, we have applications built with OpenMPI 3.0.0 and 3.1.0 linked against 
/opt/pmix/1.1.5 (--with-pmix=/opt/pmix/1.1.5).  When I attempt to launch these 
applications using,

  srun —mpi=pmix 

I get the following ...

[c1a-s18.ufhpc:17995] Security mode none is not available
[c1a-s18.ufhpc:17995] PMIX ERROR: UNREACHABLE in file src/client/pmix_client.c 
at line 199
--
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
———

So slurm/srun appear to have library support for both pmix and pmix_v2 and 
OpenMPI 3.0.0 and OpenMPI 3.1.0 both have pmix support (1.1.5) since we launch 
them every day with “srun —mpi=pmix” under slurm 16.05.11.

Is this a bug?   Am I overlooking something?  Is it possible to transition to 
OpenMPI 3.x + PMIx 2.x + SLURM 17.x without rebuilding (essentially) everything 
(including all applications)?

Charlie Taylor
UF Research Computing

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] A couple of general questions

2018-06-14 Thread Charles A Taylor
Aw, sheesh.  Thanks.  Somehow I missed that despite being on the page - lack of 
focus,  I guess.

Best,

Charlie

> On Jun 14, 2018, at 4:38 PM, Pavel Shamis  wrote:
> 
> You just have to switch PML to UCX.
> You have some example of the command line here: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openucx_ucx_wiki_OpenMPI-2Dand-2DOpenSHMEM-2Dinstallation-2Dwith-2DUCX=DwIFaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=8sBODgXZKw_dNqkFqkTqbGD3_7nNlm_pat-D6AqiaC8=AtjVGlnk5Sxl6o7bcEa0LnFxgfmLD0qjnMKDqvn085s=2BFC-oRO8l3PwqI2eZFfGzFCa4eVxg8xmlx3adKzjug=
>  
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openucx_ucx_wiki_OpenMPI-2Dand-2DOpenSHMEM-2Dinstallation-2Dwith-2DUCX=DwMFaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=69tLaZgRut2phhUywsZ7EtczOR47D8Jb5O22ESQO_TI=cop4oKioc-d7X7CFVHdWTiX4p6tsnD7V-uT7JdSnIdw=>
> Best,
> P.
> 
> 
> On Thu, Jun 14, 2018 at 3:25 PM Charles A Taylor  <mailto:chas...@ufl.edu>> wrote:
> Hmmm.  ompi_info only shows the ucx pml.  I don’t see any “transports”.   
> Will they show up somewhere or are they documented.   Right now it looks like 
> the only UCX related thing I can do with openmpi 3.1.0 is
> 
> export OMPI_MCA_pml=ucx
> mpiexec ….
> 
> From ompi_info…
> 
> $ ompi_info --param all all  | more | grep ucx
>  MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v3.1.0)
>  MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v3.1.0)
> 
> I’m assuming there is more to it than that.
> 
> Regards,
> 
> Charlie
> 
> 
> > On Jun 14, 2018, at 1:18 PM, Jeff Squyres (jsquyres) via users 
> > mailto:users@lists.open-mpi.org>> wrote:
> > 
> > Charles --
> > 
> > It may have gotten lost in the middle of this thread, but the 
> > vendor-recommended way of running on InfiniBand these days is with UCX.  
> > I.e., install OpenUCX and use one of the UCX transports in Open MPI.  
> > Unless you have special requirements, you should likely give this a try and 
> > see if it works for you.
> > 
> > The libfabric / verbs combo *may* work, but I don't know how robust the 
> > verbs libfabric support was in the v1.5 release series.
> > 
> > 
> >> On Jun 14, 2018, at 10:01 AM, Charles A Taylor  >> <mailto:chas...@ufl.edu>> wrote:
> >> 
> >> Hi Matias,
> >> 
> >> Thanks for the response.  
> >> 
> >> As of a couple of hours ago we are running: 
> >> 
> >>   libfabric-devel-1.5.3-1.el7.x86_64
> >>   libfabric-1.5.3-1.el7.x86_64
> >> 
> >> As for the provider, I saw that one but just listed “verbs”.  I’ll go with 
> >> the “verbs;ofi_rxm” going forward.
> >> 
> >> Regards,
> >> 
> >> Charlie
> >> 
> >> 
> >>> On Jun 14, 2018, at 12:49 PM, Cabral, Matias A  >>> <mailto:matias.a.cab...@intel.com>> wrote:
> >>> 
> >>> Hi Charles,
> >>> 
> >>> What version of libfabric do you have installed? To run OMPI using the 
> >>> verbs provider you need to pair it with the ofi_rxm provider. fi_info 
> >>> should list it like:
> >>> …
> >>> provider: verbs;ofi_rxm
> >>> …
> >>> 
> >>> So in your command line you have to specify:
> >>> mpirun -mca pml cm -mca mtl ofi -mca mtl_ofi_provider_include 
> >>> “verbs;ofi_rxm”  ….
> >>> 
> >>> (don’t skip the quotes)
> >>> 
> >> 
> >> ___
> >> users mailing list
> >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=6DdoqVoTIfPtbcYwMs5Kf4wAb1E-3ip44LC0DodP-qM=Tj45vOxdXErSAFSkD9LEyWCCMfBkS345sgPIqmLRy5c=
> >>  
> >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=6DdoqVoTIfPtbcYwMs5Kf4wAb1E-3ip44LC0DodP-qM=Tj45vOxdXErSAFSkD9LEyWCCMfBkS345sgPIqmLRy5c=>
> > 
> > 
> > -- 
> > Jeff Squyres
> > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
> > 
> > ___
> > users mailing list
> > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=D

Re: [OMPI users] A couple of general questions

2018-06-14 Thread Charles A Taylor
Hmmm.  ompi_info only shows the ucx pml.  I don’t see any “transports”.   Will 
they show up somewhere or are they documented.   Right now it looks like the 
only UCX related thing I can do with openmpi 3.1.0 is

export OMPI_MCA_pml=ucx
mpiexec ….

From ompi_info…

$ ompi_info --param all all  | more | grep ucx
 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v3.1.0)
 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v3.1.0)

I’m assuming there is more to it than that.

Regards,

Charlie


> On Jun 14, 2018, at 1:18 PM, Jeff Squyres (jsquyres) via users 
>  wrote:
> 
> Charles --
> 
> It may have gotten lost in the middle of this thread, but the 
> vendor-recommended way of running on InfiniBand these days is with UCX.  
> I.e., install OpenUCX and use one of the UCX transports in Open MPI.  Unless 
> you have special requirements, you should likely give this a try and see if 
> it works for you.
> 
> The libfabric / verbs combo *may* work, but I don't know how robust the verbs 
> libfabric support was in the v1.5 release series.
> 
> 
>> On Jun 14, 2018, at 10:01 AM, Charles A Taylor  wrote:
>> 
>> Hi Matias,
>> 
>> Thanks for the response.  
>> 
>> As of a couple of hours ago we are running: 
>> 
>>   libfabric-devel-1.5.3-1.el7.x86_64
>>   libfabric-1.5.3-1.el7.x86_64
>> 
>> As for the provider, I saw that one but just listed “verbs”.  I’ll go with 
>> the “verbs;ofi_rxm” going forward.
>> 
>> Regards,
>> 
>> Charlie
>> 
>> 
>>> On Jun 14, 2018, at 12:49 PM, Cabral, Matias A  
>>> wrote:
>>> 
>>> Hi Charles,
>>> 
>>> What version of libfabric do you have installed? To run OMPI using the 
>>> verbs provider you need to pair it with the ofi_rxm provider. fi_info 
>>> should list it like:
>>> …
>>> provider: verbs;ofi_rxm
>>> …
>>> 
>>> So in your command line you have to specify:
>>> mpirun -mca pml cm -mca mtl ofi -mca mtl_ofi_provider_include 
>>> “verbs;ofi_rxm”  ….
>>> 
>>> (don’t skip the quotes)
>>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=6DdoqVoTIfPtbcYwMs5Kf4wAb1E-3ip44LC0DodP-qM=Tj45vOxdXErSAFSkD9LEyWCCMfBkS345sgPIqmLRy5c=
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=6DdoqVoTIfPtbcYwMs5Kf4wAb1E-3ip44LC0DodP-qM=Tj45vOxdXErSAFSkD9LEyWCCMfBkS345sgPIqmLRy5c=

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] A couple of general questions

2018-06-14 Thread Charles A Taylor
Thank you, Jeff.

The ofi MTL with the verbs provider seems to be working well at the moment.  
I’ll need to let it run a day or so before I know whether we can avoid the 
deadlocks experienced with the straight openib BTL.

I’ve also built-in UCX support so I’ll be trying that next.  

Again, thanks for the response.

Oh, before I forget and I hope this doesn’t sound snarky, but how does the 
community find out that things like UCX and libfabric exist as well as how to 
use them when the FAQs on open-mpi.org 

 don’t have much information beyond the now ancient 1.8 series?   Afterall, 
this is hardly your typical “mpiexec” command line…

 mpirun -mca pml cm -mca mtl ofi -mca mtl_ofi_provider_include 
“verbs;ofi_rxm ...”  ,

if you get my drift.  Even google doesn’t seem to know all that much about 
these things.  I’m feeling more than a little ignorant these days.  :)

Thanks to all for the responses.  It has been a huge help.

Charlie

> On Jun 14, 2018, at 1:18 PM, Jeff Squyres (jsquyres) via users 
>  wrote:
> 
> Charles --
> 
> It may have gotten lost in the middle of this thread, but the 
> vendor-recommended way of running on InfiniBand these days is with UCX.  
> I.e., install OpenUCX and use one of the UCX transports in Open MPI.  Unless 
> you have special requirements, you should likely give this a try and see if 
> it works for you.
> 
> The libfabric / verbs combo *may* work, but I don't know how robust the verbs 
> libfabric support was in the v1.5 release series.

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] A couple of general questions

2018-06-14 Thread Charles A Taylor
Hi Matias,

Thanks for the response.  

As of a couple of hours ago we are running: 

   libfabric-devel-1.5.3-1.el7.x86_64
   libfabric-1.5.3-1.el7.x86_64

As for the provider, I saw that one but just listed “verbs”.  I’ll go with the 
“verbs;ofi_rxm” going forward.

Regards,

Charlie


> On Jun 14, 2018, at 12:49 PM, Cabral, Matias A  
> wrote:
> 
> Hi Charles, <>
>  
> What version of libfabric do you have installed? To run OMPI using the verbs 
> provider you need to pair it with the ofi_rxm provider. fi_info should list 
> it like:
> …
> provider: verbs;ofi_rxm
> …
>  
> So in your command line you have to specify:
> mpirun -mca pml cm -mca mtl ofi -mca mtl_ofi_provider_include “verbs;ofi_rxm” 
>  ….
>  
> (don’t skip the quotes)
>  

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] A couple of general questions

2018-06-14 Thread Charles A Taylor
FYI…

GIZMO: prov/verbs/src/ep_rdm/verbs_tagged_ep_rdm.c:443: 
fi_ibv_rdm_tagged_release_remote_sbuff: Assertion `0' failed.

GIZMO:10405 terminated with signal 6 at PC=2add5835c1f7 SP=7fff8071b008.  
Backtrace:
/usr/lib64/libc.so.6(gsignal+0x37)[0x2add5835c1f7]
/usr/lib64/libc.so.6(abort+0x148)[0x2add5835d8e8]
/usr/lib64/libc.so.6(+0x2e266)[0x2add58355266]
/usr/lib64/libc.so.6(+0x2e312)[0x2add58355312]
/lib64/libfabric.so.1(+0x4df43)[0x2add5b87df43]
/lib64/libfabric.so.1(+0x43af2)[0x2add5b873af2]
/lib64/libfabric.so.1(+0x43ea9)[0x2add5b873ea9]


> On Jun 14, 2018, at 7:48 AM, Howard Pritchard  wrote:
> 
> Hello Charles
> 
> You are heading in the right direction.
> 
> First you might want to run the libfabric fi_info command to see what 
> capabilities you picked up from the libfabric RPMs.
> 
> Next you may well not actually be using the OFI  mtl.
> 
> Could you run your app with
> 
> export OMPI_MCA_mtl_base_verbose=100
> 
> and post the output?
> 
> It would also help if you described the system you are using :  OS 
> interconnect cpu type etc. 
> 
> Howard
> 
> Charles A Taylor mailto:chas...@ufl.edu>> schrieb am Do. 
> 14. Juni 2018 um 06:36:
> Because of the issues we are having with OpenMPI and the openib BTL 
> (questions previously asked), I’ve been looking into what other transports 
> are available.  I was particularly interested in OFI/libfabric support but 
> cannot find any information on it more recent than a reference to the usNIC 
> BTL from 2015 (Jeff Squyres, Cisco).  Unfortunately, the openmpi-org website 
> FAQ’s covering OpenFabrics support don’t mention anything beyond OpenMPI 1.8. 
>  Given that 3.1 is the current stable version, that seems odd.
> 
> That being the case, I thought I’d ask here. After laying down the 
> libfabric-devel RPM and building (3.1.0) with —with-libfabric=/usr, I end up 
> with an “ofi” MTL but nothing else.   I can run with OMPI_MCA_mtl=ofi and 
> OMPI_MCA_btl=“self,vader,openib” but it eventually crashes in libopen-pal.so. 
>   (mpi_waitall() higher up the stack).
> 
> GIZMO:9185 terminated with signal 11 at PC=2b4d4b68a91d SP=7ffcfbde9ff0.  
> Backtrace:
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(+0x9391d)[0x2b4d4b68a91d]
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(opal_progress+0x24)[0x2b4d4b632754]
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(ompi_request_default_wait_all+0x11f)[0x2b4d47be2a6f]
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(PMPI_Waitall+0xbd)[0x2b4d47c2ce4d]
> 
> Questions: Am I using the OFI MTL as intended?   Should there be an “ofi” 
> BTL?   Does anyone use this?
> 
> Thanks,
> 
> Charlie Taylor
> UF Research Computing
> 
> PS - If you could use some help updating the FAQs, I’d be willing to put in 
> some time.  I’d probably learn a lot.
> ___
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIFaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=8sBODgXZKw_dNqkFqkTqbGD3_7nNlm_pat-D6AqiaC8=pDOR2yTEZWtS3wHCqrASHkfd22e7kPU3D1XnttWrL7Y=UYlpo1EvM2cQqSZ5N-DoOLoE-G9_kWlffvJ2WfuESP4=
>  
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwMFaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=nOFQDWuhmU9qhe6be-0JeNMGn1q64kJj0nWQV-vZg7k=PoOVfxkE7rR9spMSFabAs8TokTpgbCIyJRGuWTf5jIk=>___
> users mailing list
> users@lists.open-mpi.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwICAg=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=nOFQDWuhmU9qhe6be-0JeNMGn1q64kJj0nWQV-vZg7k=PoOVfxkE7rR9spMSFabAs8TokTpgbCIyJRGuWTf5jIk=

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] A couple of general questions

2018-06-14 Thread Charles A Taylor
ude list
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:301: mtl:ofi:prov: none
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:410: select_ofi_provider: no provider 
found
[c29a-s2.ufhpc:01464] select: init returned failure for component ofi
[c29a-s2.ufhpc:01464] select: no component selected
[c29a-s2.ufhpc:01464] mca: base: close: component ofi closed
[c29a-s2.ufhpc:01464] mca: base: close: unloading component ofi
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:269: mtl:ofi:provider_include = 
"psm,psm2,gni"
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = 
"(null)"
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include 
list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:301: mtl:ofi:prov: none
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:410: select_ofi_provider: no provider 
found
[c29a-s2.ufhpc:01465] select: init returned failure for component ofi
[c29a-s2.ufhpc:01465] select: no component selected
[c29a-s2.ufhpc:01465] mca: base: close: component ofi closed
[c29a-s2.ufhpc:01465] mca: base: close: unloading component ofi
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:269: mtl:ofi:provider_include = 
"psm,psm2,gni"
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = 
"(null)"
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:269: mtl:ofi:provider_include = 
"psm,psm2,gni"
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = 
"(null)"
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include 
list
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include 
list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in 
include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:301: mtl:ofi:prov: none
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:410: select_ofi_provider: no provider 
found
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:301: mtl:ofi:prov: none
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:410: select_ofi_provider: no provider 
found
[c29a-s2.ufhpc:01463] select: init returned failure for component ofi
[c29a-s2.ufhpc:01463] select: no component selected
[c29a-s2.ufhpc:01466] select: init returned failure for component ofi
[c29a-s2.ufhpc:01466] select: no component selected
[c29a-s2.ufhpc:01466] mca: base: close: component ofi closed
[c29a-s2.ufhpc:01466] mca: base: close: unloading component ofi
[c29a-s2.ufhpc:01463] mca: base: close: component ofi closed
[c29a-s2.ufhpc:01463] mca: base: close: unloading component ofi


> On Jun 14, 2018, at 7:48 AM, Howard Pritchard  wrote:
> 
> Hello Charles
> 
> You are heading in the right direction.
> 
> First you might want to run the libfabric fi_info command to see what 
> capabilities you picked up from the libfabric RPMs.
> 
> Next you may well not actually be using the OFI  mtl.
> 
> Could you run your app with
> 
> export OMPI_MCA_mtl_base_verbose=100
> 
> and post the output?
> 
> It would also help if you described the system you are using :  OS 
> interconnect cpu type etc. 
> 
> Howard
> 
> Charles A Taylor mailto:chas...@ufl.edu>> schrieb am Do. 
> 14. Juni 2018 um 06:36:
> Because of the issues we are having with OpenMPI and the openib BTL 
> (questions previously asked), I’ve been looking into what other transports 
> are available.  I was particularly interested in OFI/libfabric support but 
> cannot find any information on it more recent than a reference to the usNIC 
> BTL from 2015 (Jeff Squyres, Cisco).  Unfortunately, the openmpi-org website 
> FAQ’s covering OpenFabrics support don’t mention anything 

[OMPI users] A couple of general questions

2018-06-14 Thread Charles A Taylor
Because of the issues we are having with OpenMPI and the openib BTL (questions 
previously asked), I’ve been looking into what other transports are available.  
I was particularly interested in OFI/libfabric support but cannot find any 
information on it more recent than a reference to the usNIC BTL from 2015 (Jeff 
Squyres, Cisco).  Unfortunately, the openmpi-org website FAQ’s covering 
OpenFabrics support don’t mention anything beyond OpenMPI 1.8.  Given that 3.1 
is the current stable version, that seems odd.

That being the case, I thought I’d ask here. After laying down the 
libfabric-devel RPM and building (3.1.0) with —with-libfabric=/usr, I end up 
with an “ofi” MTL but nothing else.   I can run with OMPI_MCA_mtl=ofi and 
OMPI_MCA_btl=“self,vader,openib” but it eventually crashes in libopen-pal.so.   
(mpi_waitall() higher up the stack).

GIZMO:9185 terminated with signal 11 at PC=2b4d4b68a91d SP=7ffcfbde9ff0.  
Backtrace:
/apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(+0x9391d)[0x2b4d4b68a91d]
/apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(opal_progress+0x24)[0x2b4d4b632754]
/apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(ompi_request_default_wait_all+0x11f)[0x2b4d47be2a6f]
/apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(PMPI_Waitall+0xbd)[0x2b4d47c2ce4d]

Questions: Am I using the OFI MTL as intended?   Should there be an “ofi” BTL?  
 Does anyone use this?

Thanks,

Charlie Taylor
UF Research Computing

PS - If you could use some help updating the FAQs, I’d be willing to put in 
some time.  I’d probably learn a lot.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] OpenMPI + gadget/gizmo/arepo

2018-05-23 Thread Charles A Taylor
I feel a little funny posting this but I have observed this problem now over 
three different versions of OpenMPI (1.10.2, 2.0.3, 3.0.0) and have refrained 
from asking about it before now because we always had a work-around.  That may 
not be the case now and  feel like I’m missing something obvious.

I’ve tried to summarize our system configuration as succinctly as possible 
below but it is a pretty standard Linux cluster with an IB interconnect 
(mellanox).

In short, we run many MPI applications (LAMMPS, VASP, NAMD, AMBER, ENZO, etc) 
successfully.  However, the astrophysical galaxy modeling codes Arepo and Gizmo 
(both Gadget derivatives) seem to give us fits - deadlocking randomly after 
running for hours or days.  I’ve tracked this down to a deadlock with some 
processes in MPI_Waitall() and others in MPI_Sendrecv().  I’ve looked at the 
code where the processes deadlock and can’t see any obvious issue.  I also know 
that the same versions of the same codes are run on other, similar platforms at 
other sites (TACC, NASA, for example).  

While trying various things over the last few days I have learned that setting

 export 
OMPI_MCA_btl_openib_flags="send,fetching-atomics,need-ack,need-csum,hetero-rdma”

seems to avoid the deadlocks.  In other words, disabling RDMA read/write seems 
to avoid the deadlocks.  Perhaps some RDMA read/write tuning is in order but 
I’ve had no success with that so far.

There are a couple of MPI related ifdefs in the code with regard to 
MPI_IN_PLACE and async sendrecv().  I’ve experimented with both.  Prior to 
OpenMPI 3.0.0 the gizmo code would run without deadlocking if 
-DNO_ISEND_IRECV_IN_DOMAIN was used at build time.  Under OpenMPI 3.0.0 that is 
no longer the case.  

FWIW, I also know that gizmo runs (on our system) using intel mpi (5.1.1) but 
I’m trying to avoid making that generally available since every other app we 
have works just fine with OpenMPI.

Anyone else have experience with these codes using OpenMPI (or otherwise)?  Any 
comments or suggestions would be appreciated. 

Regards,

Charlie Taylor
UF Research Computing



Applications: Gadget derivatives gizmo and arepo
Problem:   Random Deadlocks in MPI_waitall, MPI_sendrecv
Platform:   RedHat EL7 (and RedHat EL6 previously)
Systems:  Dell SOS6320, Haswell (2 x CPU E5-2698 v3 @ 2.30GHz)
Interconnect: Mellanox ConnectX-3 FDR (OpenSM fabric manager)
IB Stack:   RedHat EL7.4 native
OpenMPI: 3.0.0 (currently - see configure options below, but the problem 
has been persistent across versions)
Compilers:   Intel Suite (various versions - 2016, 2017, 2018)

Build time configure options.
--
CFG_OPTS="$CFG_OPTS C=icc CXX=icpc FC=ifort FFLAGS=\"-O2 -g -warn -m64\" 
LDFLAGS=\"\" "
CFG_OPTS="$CFG_OPTS --enable-static"
CFG_OPTS="$CFG_OPTS --enable-orterun-prefix-by-default"
CFG_OPTS="$CFG_OPTS --with-slurm=/opt/slurm"
CFG_OPTS="$CFG_OPTS --with-pmix=/opt/pmix"
CFG_OPTS="$CFG_OPTS --with-libevent=external"
CFG_OPTS="$CFG_OPTS --with-hwloc=external"
CFG_OPTS="$CFG_OPTS --with-verbs=/usr"
CFG_OPTS="$CFG_OPTS --with-verbs-libdir=/usr/lib64"
CFG_OPTS="$CFG_OPTS --with-mxm=no"
CFG_OPTS="$CFG_OPTS --with-cuda=${HPC_CUDA_DIR}"
CFG_OPTS="$CFG_OPTS --enable-openib-udcm"
CFG_OPTS="$CFG_OPTS --enable-openib-rdmacm"

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] openmpi/slurm/pmix

2018-04-24 Thread Charles A Taylor
Hi Gilles,

Yes, I did.  It was ignored AFAICT.I did not look for the reason - only so 
many hours in the day.  

Regards,

Charlie


> On Apr 24, 2018, at 8:07 AM,   wrote:
> 
> Charles,
> 
> have you tried to configure --with-pmix-libdir=/.../lib64 ?
> 
> Cheers,
> 
> Gilles
> 
> - Original Message -
>> I´ll add that when building OpenMPI 3.0.0 with an external PMIx, I 
> found that the OpenMPI configure script only looks in “lib” for the the 
> pmix library but the pmix configure/build uses “lib64” (as it should on 
> a 64-bit system) so the configure script falls back to the internal PMIx.
>  As Robert suggested, check your config.log for “not found” messages.  
>> 
>> In my case, I simply added a “lib -> lib64” symlink in the PMIx 
> installation directory rather than alter the configure script and that 
> did the trick.
>> 
>> Good luck,
>> 
>> Charlie
>> 
>>> On Apr 23, 2018, at 6:07 PM, r...@open-mpi.org wrote:
>>> 
>>> Hi Michael
>>> 
>>> Looks like the problem is that you didn´t wind up with the external 
> PMIx. The component listed in your error is the internal PMIx one which 
> shouldn´t have built given that configure line.
>>> 
>>> Check your config.out and see what happened. Also, ensure that your 
> LD_LIBRARY_PATH is properly pointing to the installation, and that you 
> built into a “clean” prefix.
>>> 
>>> 
 On Apr 23, 2018, at 12:01 PM, Michael Di Domenico  gmail.com> wrote:
 
 i'm trying to get slurm 17.11.5 and openmpi 3.0.1 working with pmix.
 
 everything compiled, but when i run something it get
 
 : symbol lookup error: /openmpi/mca_pmix_pmix2x.so: undefined 
> symbol:
 opal_libevent2022_evthread_use_pthreads
 
 i more then sure i did something wrong, but i'm not sure what, here
> 's what i did
 
 compile libevent 2.1.8
 
 ./configure --prefix=/libevent-2.1.8
 
 compile pmix 2.1.0
 
 ./configure --prefix=/pmix-2.1.0 --with-psm2
 --with-munge=/munge-0.5.13 --with-libevent=/libevent-2.1.8
 
 compile openmpi
 
 ./configure --prefix=/openmpi-3.0.1 --with-slurm=/slurm-17.11.5
 --with-hwloc=external --with-mxm=/opt/mellanox/mxm
 --with-cuda=/usr/local/cuda --with-pmix=/pmix-2.1.0
 --with-libevent=/libevent-2.1.8
 
 when i look at the symbols in the mca_pmix_pmix2x.so library the
 function is indeed undefined (U) in the output, but checking ldd
 against the library doesn't show any missing
 
 any thoughts?
 ___
 users mailing list
 users@lists.open-mpi.org
 https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=XE6hInyZVJ5VMrO5vdTEKEw3pZBBVnLE7U8Nm67zj2M=_sgJVrkRzlv7dIYMvtMfj26AJdbH-fcOOarmN7PyJCI=
> 
>>> 
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=XE6hInyZVJ5VMrO5vdTEKEw3pZBBVnLE7U8Nm67zj2M=_sgJVrkRzlv7dIYMvtMfj26AJdbH-fcOOarmN7PyJCI=
> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIFag=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=0XUVnlQfzGhlRDSBAm8nGvZt27jITo3r1oX9_vg639w=ErD6RckR-Uvdpj4CTtNvT9iZck285Vdf6sgYskQ_Z-k=
> ___
> users mailing list
> users@lists.open-mpi.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIFag=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=0XUVnlQfzGhlRDSBAm8nGvZt27jITo3r1oX9_vg639w=ErD6RckR-Uvdpj4CTtNvT9iZck285Vdf6sgYskQ_Z-k=

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] openmpi/slurm/pmix

2018-04-24 Thread Charles A Taylor
I’ll add that when building OpenMPI 3.0.0 with an external PMIx, I found that 
the OpenMPI configure script only looks in “lib” for the the pmix library but 
the pmix configure/build uses “lib64” (as it should on a 64-bit system) so the 
configure script falls back to the internal PMIx.  As Robert suggested, check 
your config.log for “not found” messages.  

In my case, I simply added a “lib -> lib64” symlink in the PMIx installation 
directory rather than alter the configure script and that did the trick.

Good luck,

Charlie

> On Apr 23, 2018, at 6:07 PM, r...@open-mpi.org wrote:
> 
> Hi Michael
> 
> Looks like the problem is that you didn’t wind up with the external PMIx. The 
> component listed in your error is the internal PMIx one which shouldn’t have 
> built given that configure line.
> 
> Check your config.out and see what happened. Also, ensure that your 
> LD_LIBRARY_PATH is properly pointing to the installation, and that you built 
> into a “clean” prefix.
> 
> 
>> On Apr 23, 2018, at 12:01 PM, Michael Di Domenico  
>> wrote:
>> 
>> i'm trying to get slurm 17.11.5 and openmpi 3.0.1 working with pmix.
>> 
>> everything compiled, but when i run something it get
>> 
>> : symbol lookup error: /openmpi/mca_pmix_pmix2x.so: undefined symbol:
>> opal_libevent2022_evthread_use_pthreads
>> 
>> i more then sure i did something wrong, but i'm not sure what, here's what i 
>> did
>> 
>> compile libevent 2.1.8
>> 
>> ./configure --prefix=/libevent-2.1.8
>> 
>> compile pmix 2.1.0
>> 
>> ./configure --prefix=/pmix-2.1.0 --with-psm2
>> --with-munge=/munge-0.5.13 --with-libevent=/libevent-2.1.8
>> 
>> compile openmpi
>> 
>> ./configure --prefix=/openmpi-3.0.1 --with-slurm=/slurm-17.11.5
>> --with-hwloc=external --with-mxm=/opt/mellanox/mxm
>> --with-cuda=/usr/local/cuda --with-pmix=/pmix-2.1.0
>> --with-libevent=/libevent-2.1.8
>> 
>> when i look at the symbols in the mca_pmix_pmix2x.so library the
>> function is indeed undefined (U) in the output, but checking ldd
>> against the library doesn't show any missing
>> 
>> any thoughts?
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=XE6hInyZVJ5VMrO5vdTEKEw3pZBBVnLE7U8Nm67zj2M=_sgJVrkRzlv7dIYMvtMfj26AJdbH-fcOOarmN7PyJCI=
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=XE6hInyZVJ5VMrO5vdTEKEw3pZBBVnLE7U8Nm67zj2M=_sgJVrkRzlv7dIYMvtMfj26AJdbH-fcOOarmN7PyJCI=

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] ARM/Allinea DDT

2018-04-12 Thread Charles A Taylor
Understood.  But since the OpenMPI versions in question are not listed as 
supported on the ARM/Allinea product page, I thought I’d ask if it is 
_supposed_ to work before bothering folks with details.

In the meantime, ARM/Allinea has responded so I’ll provide them with the 
details.  But, in short, DDT can’t seem to attach to the 3.0 processes/ranks 
and just hangs trying.  Using the same code, that doesn’t happen with OpenMPI 
1.10.2 nor IntelMPI 5.1.1.  This is under RHEL 7.4 and launching with srun 
under SLURM 16.05.11 (for those who want to know).

Thanks for the replies,

Charlie


> On Apr 11, 2018, at 3:15 PM, r...@open-mpi.org wrote:
> 
> You probably should provide a little more info here. I know the MPIR attach 
> was broken in the v2.x series, but we fixed that - could be something remains 
> broken in OMPI 3.x.
> 
> FWIW: I doubt it's an Allinea problem.
> 
>> On Apr 11, 2018, at 11:54 AM, Charles A Taylor <chas...@ufl.edu> wrote:
>> 
>> 
>> Contacting ARM seems a bit difficult so I thought I would ask here.  We rely 
>> on DDT for debugging but it doesn’t work with OpenMPI 3.x and I can’t find 
>> anything about them having plans to support it.
>> 
>> Anyone know if ARM DDT has plans to support newer versions of OpenMPI?
>> 
>> Charlie Taylor
>> UF Research Computing
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=ZXjA-iEShW778rarEcH8zfwC7ZZIe3E_nnmA3efkQ2U=YOADJaH3OfJGTO5VTSs9F9MoQv4cBZfJJXXW3RTN5Yg=
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=ZXjA-iEShW778rarEcH8zfwC7ZZIe3E_nnmA3efkQ2U=YOADJaH3OfJGTO5VTSs9F9MoQv4cBZfJJXXW3RTN5Yg=

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] ARM/Allinea DDT

2018-04-11 Thread Charles A Taylor

Contacting ARM seems a bit difficult so I thought I would ask here.  We rely on 
DDT for debugging but it doesn’t work with OpenMPI 3.x and I can’t find 
anything about them having plans to support it.

Anyone know if ARM DDT has plans to support newer versions of OpenMPI?

Charlie Taylor
UF Research Computing
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 3.0.0 on RHEL-7

2018-03-08 Thread Charles A Taylor
Thanks for the replies.  Very helpful.  I thought that —with-libevent=/usr was 
equivalent but better than —with-libevent=external but I was really just 
shooting myself in the foot.

Changing the pmix, libevent, and hwloc config options to,

--with-pmix=/opt/pmix --with-libevent=external --with-hwloc=external

resulted in a working config and working libraries.  One note - I _did_ have to 
create a lib -> lib64 symlink in /opt/pmix so that is something the configure 
script maintainers may want to take a look at.  Doesn’t seem like that should 
be necessary on a x86_64 platform.

Thanks again for setting me straight.

Charlie

> On Mar 7, 2018, at 7:08 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
> Charles,
> 
> First, you should really
> 
> configure --with-libevent=external
> 
> Then could you please describe the issue ?
> configure ? make ?
> 
> if configure fails, please compress and attach your config.log so we
> can have a look
> 
> 
> Cheers,
> 
> Gilles
> 
> On Thu, Mar 8, 2018 at 9:02 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:
>> So long as it is libevent-devel-2.0.22, you should be okay. You might want 
>> to up PMIx to v1.2.5 as Slurm 16.05 should handle that okay. OMPI v3.0.0 has 
>> PMIx 2.0 in it, but should be okay with 1.2.5 last I checked (but it has 
>> been awhile and I can’t swear to it).
>> 
>> 
>>> On Mar 7, 2018, at 2:03 PM, Charles A Taylor <chas...@ufl.edu> wrote:
>>> 
>>> Hi
>>> 
>>> Distro: RHEL-7 (7.4)
>>> SLURM: 16.05.11
>>> PMIx: 1.1.5
>>> 
>>> Trying to build OpenMPI 3.0.0 for our RHEL7 systems but running into what 
>>> might be a configure script issue more than a real incompatibility problem. 
>>>  Configuring with the following,
>>> 
>>>  --with-slurm=/opt/slurm --with-pmix=/opt/pmix 
>>> --with-external-libpmix=/opt/pmix/lib64 --with-libevent=/usr
>>> 
>>> It seems happy enough with PMIx and there appears to be PMIx 1.x support.  
>>> However, libevent is another issue and my take so far is that the EL7 
>>> native libevent-devel-2.0.x is the crux of the problem.  It doesn’t seem 
>>> like the configure script expects that and I’m a little reluctant to start 
>>> patching things together with symlinks or configure script changes.
>>> 
>>> My real question is what is intended?   Should this be working or was 3.0.0 
>>> really intended for PMIx 2.x?  I imagine we will go to PMIx 2.x with SLURM 
>>> 17 but aren’t quite ready for that yet.
>>> 
>>> Regards,
>>> 
>>> Charlie Taylor
>>> UF Research Computing
>>> 
>>> 
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=2uvuXdSIFiI-RGnA80gNIcK6B7zzhsvhKg9XAG2KplQ=6Ai2bxZZkFjhLn00ADhGjVs9z-1J93avnmK1_OtFqGw=
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=2uvuXdSIFiI-RGnA80gNIcK6B7zzhsvhKg9XAG2KplQ=6Ai2bxZZkFjhLn00ADhGjVs9z-1J93avnmK1_OtFqGw=
> ___
> users mailing list
> users@lists.open-mpi.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=2uvuXdSIFiI-RGnA80gNIcK6B7zzhsvhKg9XAG2KplQ=6Ai2bxZZkFjhLn00ADhGjVs9z-1J93avnmK1_OtFqGw=

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] OpenMPI 3.0.0 on RHEL-7

2018-03-07 Thread Charles A Taylor
Hi 

Distro: RHEL-7 (7.4)
SLURM: 16.05.11
PMIx: 1.1.5

Trying to build OpenMPI 3.0.0 for our RHEL7 systems but running into what might 
be a configure script issue more than a real incompatibility problem.  
Configuring with the following,

   --with-slurm=/opt/slurm --with-pmix=/opt/pmix 
--with-external-libpmix=/opt/pmix/lib64 --with-libevent=/usr 

It seems happy enough with PMIx and there appears to be PMIx 1.x support.  
However, libevent is another issue and my take so far is that the EL7 native 
libevent-devel-2.0.x is the crux of the problem.  It doesn’t seem like the 
configure script expects that and I’m a little reluctant to start patching 
things together with symlinks or configure script changes.

My real question is what is intended?   Should this be working or was 3.0.0 
really intended for PMIx 2.x?  I imagine we will go to PMIx 2.x with SLURM 17 
but aren’t quite ready for that yet.

Regards,

Charlie Taylor
UF Research Computing


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI & Slurm: mpiexec/mpirun vs. srun

2017-12-19 Thread Charles A Taylor

> Or one could tell OMPI to do what you really want it to do using map-by and 
> bind-to options, perhaps putting them in the default MCA param file.

Nod.  Agreed, but far too complicated for 98% of our users.
> 
> Or you could enable cgroups in slurm so that OMPI sees the binding envelope - 
> it will respect it.

We’ve configured cgroups from the beginning.

> The problem is that OMPI isn’t seeing the requested binding envelope and 
> thinks resources are available that really aren’t, and so it gets confused 
> about how to map things. Slurm expresses that envelope in an envar, but the 
> name and syntax keep changing over the releases, and we just can’t track it 
> all the time.

Understood.

> I’m not sure what “slurm_nodeid” is - where does this come from?

Sorry, it was S_JOB_NODEID from spank.h.  I ended up changing my approach to 
the tmpdir creation because of this and the fact the the job’s UID/GID were not 
available in the SPANK routine where I needed them.  I would hope that this 
maps to the exported env variable SLURM_NODEID but I don’t know that for sure.

Thanks for the feedback,

Charlie

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI & Slurm: mpiexec/mpirun vs. srun

2017-12-19 Thread Charles A Taylor
Hi All,

I’m glad to see this come up.  We’ve used OpenMPI for a long time and switched 
to SLURM (from torque+moab) about 2.5 years ago.  At the time, I had a lot of 
questions about running MPI jobs under SLURM and good information seemed to be 
scarce - especially regarding “srun”.   I’ll just briefly share my/our 
observations.  For those who are interested, there are examples of our 
suggested submission scripts at 
https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#MPI_job 
 (as I type this I’m 
hoping that page is up-to-date).  Feel free to comment or make suggestions if 
you have had different experiences or know better (very possible).

1. We initially ignored srun since mpiexec _seemed_ to work fine (more below).

2. We soon started to get user complaints of MPI apps running at 1/2 to 1/3 of 
their expected or previously observed speeds - but only sporadically - meaning 
that sometimes the same job, submitted the same way would run at full speed and 
sometimes at 1/2 or 1/3 (almost exactly) speed.

Investigation showed that some MPI ranks in the job were time-slicing across 
one or more of the cores allocated by SLURM.  It turns out that if the slurm 
allocation is not consistent with the default OMPI core/socket mapping, this 
can easily happen.  It can be avoided by a) using “srun —mpi=pmi2” or as of 
2.x, “srun —mpi=pmix” or b) more carefully crafting your slurm resource request 
to be consistent with the OMPI default core/socket mapping.

So beware of resource requests that specify only the number of tasks 
(—ntasks=64) and then launch with “mpiexec”.  Slurm will happily allocate those 
tasks anywhere it can (on a busy cluster) and you will get some very 
non-optimal core mappings/bindings and, possibly, core sharing.

3. While doing some spank development for a local, per-job (not per step) 
temporary directory, I noticed that when launching multi-host MPI jobs with 
mpiexec vs srun, you end up with more than one host with “slurm_nodeid=1”.  I’m 
not sure if this is a bug (it was 15.08.x) or not and it didn’t seem to cause 
issues but I also don’t think that it is ideal for two nodes in the same job to 
have the some numeric nodeid.   When launching with “srun”, that didn’t happen.

Anyway, that is what we have observed.  Generally speaking, I try to get users 
to use “srun” but many of them still use “mpiexec” out of habit.  You know what 
they say about old habits.  

Comments, suggestions, or just other experiences are welcome.  Also, if anyone 
is interested in the tmpdir spank plugin, you can contact me.  We are happy to 
share.

Best and Merry Christmas to all,

Charlie Taylor
UF Research Computing



> On Dec 18, 2017, at 8:12 PM, r...@open-mpi.org wrote:
> 
> We have had reports of applications running faster when executing under 
> OMPI’s mpiexec versus when started by srun. Reasons aren’t entirely clear, 
> but are likely related to differences in mapping/binding options (OMPI 
> provides a very large range compared to srun) and optimization flags provided 
> by mpiexec that are specific to OMPI.
> 
> OMPI uses PMIx for wireup support (starting with the v2.x series), which 
> provides a faster startup than other PMI implementations. However, that is 
> also available with Slurm starting with the 16.05 release, and some further 
> PMIx-based launch optimizations were recently added to the Slurm 17.11 
> release. So I would expect that launch via srun with the latest Slurm release 
> and PMIx would be faster than mpiexec - though that still leaves the faster 
> execution reports to consider.
> 
> HTH
> Ralph
> 
> 
>> On Dec 18, 2017, at 2:18 PM, Prentice Bisbal  wrote:
>> 
>> Greeting OpenMPI users and devs!
>> 
>> We use OpenMPI with Slurm as our scheduler, and a user has asked me this: 
>> should they use mpiexec/mpirun or srun to start their MPI jobs through Slurm?
>> 
>> My inclination is to use mpiexec, since that is the only method that's 
>> (somewhat) defined in the MPI standard and therefore the most portable, and 
>> the examples in the OpenMPI FAQ use mpirun. However, the Slurm documentation 
>> on the schedmd website say to use srun with the --mpi=pmi option. (See links 
>> below)
>> 
>> What are the pros/cons of using these two methods, other than the 
>> portability issue I already mentioned? Does srun+pmi use a different method 
>> to wire up the connections? Some things I read online seem to indicate that. 
>> If slurm was built with PMI support, and OpenMPI was built with Slurm 
>> support, does it really make any difference?
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Dslurm=DwIGaQ=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM=HOtXciFqK5GlgIgLAxthUQ=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI=S8O6oozkRUTwijpQQDmGrJZb8Bmnsro9a88Z8CMu6jY=
>>  
>> 

Re: [OMPI users] PMIx + OpenMPI

2017-08-07 Thread Charles A Taylor
Many thanks to all who replied and especially to Artem Polyakov of Mellanox who 
provided a slurm-15.08.13 specific pmix patch.  That patch applied and built 
cleanly against the 15.08.13 tarball and better yet, it works.  

Regards,

Charles A. Taylor
UF Research Computing

> On Aug 6, 2017, at 9:14 AM, Charles A Taylor <chas...@ufl.edu> wrote:
> 
> HI Gilles,
> 
> I tried both “—with-pmix=/opt/pmix” and “—with-pmix=internal” and got the 
> same “UNREACHABLE” error both ways.  I tried the “external” first since that 
> is what SLURM was built against.
> 
> I’m missing something simple/basic - just not sure what it is.  
> 
> Thanks,
> 
> Charlie
> 
>> On Aug 6, 2017, at 7:43 AM, Gilles Gouaillardet 
>> <gilles.gouaillar...@gmail.com> wrote:
>> 
>> Charles,
>> 
>> did you build Open MPI with the external PMIx ?
>> iirc, Open MPI 2.0.x does not support cross version PMIx
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Sun, Aug 6, 2017 at 7:59 PM, Charles A Taylor <chas...@ufl.edu> wrote:
>>> 
>>>> On Aug 6, 2017, at 6:53 AM, Charles A Taylor <chas...@ufl.edu> wrote:
>>>> 
>>>> 
>>>> Anyone successfully using PMIx with OpenMPI and SLURM?  I have,
>>>> 
>>>> 1. Installed an “external” version (1.1.5) of PMIx.
>>>> 2. Patched SLURM 15.08.13 with the SchedMD-provided PMIx patch (results in 
>>>> an mpi_pmix plugin along the lines of mpi_pmi2).
>>>> 3. Built OpenMPI 2.0.1 (tried 2.0.3 as well).
>>>> 
>>>> However, when attempting to launch MPI apps (LAMMPS in this case), I get
>>>> 
>>>>  [c9a-s2.ufhpc:08914] PMIX ERROR: UNREACHABLE in file 
>>>> src/client/pmix_client.c at line 199
>>>> 
>>> I should have mentioned that I’m launching with
>>> 
>>>  srun —mpi=pmix …
>>> 
>>> If I launch with
>>> 
>>> srun —mpi=pmi2 ...
>>> 
>>> the app starts and runs without issue.
>>> 
>>> 
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] PMIx + OpenMPI

2017-08-06 Thread Charles A Taylor
HI Gilles,

I tried both “—with-pmix=/opt/pmix” and “—with-pmix=internal” and got the same 
“UNREACHABLE” error both ways.  I tried the “external” first since that is what 
SLURM was built against.

I’m missing something simple/basic - just not sure what it is.  

Thanks,

Charlie

> On Aug 6, 2017, at 7:43 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
> Charles,
> 
> did you build Open MPI with the external PMIx ?
> iirc, Open MPI 2.0.x does not support cross version PMIx
> 
> Cheers,
> 
> Gilles
> 
> On Sun, Aug 6, 2017 at 7:59 PM, Charles A Taylor <chas...@ufl.edu> wrote:
>> 
>>> On Aug 6, 2017, at 6:53 AM, Charles A Taylor <chas...@ufl.edu> wrote:
>>> 
>>> 
>>> Anyone successfully using PMIx with OpenMPI and SLURM?  I have,
>>> 
>>> 1. Installed an “external” version (1.1.5) of PMIx.
>>> 2. Patched SLURM 15.08.13 with the SchedMD-provided PMIx patch (results in 
>>> an mpi_pmix plugin along the lines of mpi_pmi2).
>>> 3. Built OpenMPI 2.0.1 (tried 2.0.3 as well).
>>> 
>>> However, when attempting to launch MPI apps (LAMMPS in this case), I get
>>> 
>>>   [c9a-s2.ufhpc:08914] PMIX ERROR: UNREACHABLE in file 
>>> src/client/pmix_client.c at line 199
>>> 
>> I should have mentioned that I’m launching with
>> 
>>   srun —mpi=pmix …
>> 
>> If I launch with
>> 
>>  srun —mpi=pmi2 ...
>> 
>> the app starts and runs without issue.
>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] PMIx + OpenMPI

2017-08-06 Thread Charles A Taylor

> On Aug 6, 2017, at 6:53 AM, Charles A Taylor <chas...@ufl.edu> wrote:
> 
> 
> Anyone successfully using PMIx with OpenMPI and SLURM?  I have,
> 
> 1. Installed an “external” version (1.1.5) of PMIx.
> 2. Patched SLURM 15.08.13 with the SchedMD-provided PMIx patch (results in an 
> mpi_pmix plugin along the lines of mpi_pmi2).
> 3. Built OpenMPI 2.0.1 (tried 2.0.3 as well).
> 
> However, when attempting to launch MPI apps (LAMMPS in this case), I get
> 
>[c9a-s2.ufhpc:08914] PMIX ERROR: UNREACHABLE in file 
> src/client/pmix_client.c at line 199
> 
I should have mentioned that I’m launching with

   srun —mpi=pmix …

If I launch with 

  srun —mpi=pmi2 ...

the app starts and runs without issue.


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] PMIx + OpenMPI

2017-08-06 Thread Charles A Taylor

Anyone successfully using PMIx with OpenMPI and SLURM?  I have,

1. Installed an “external” version (1.1.5) of PMIx.
2. Patched SLURM 15.08.13 with the SchedMD-provided PMIx patch (results in an 
mpi_pmix plugin along the lines of mpi_pmi2).
3. Built OpenMPI 2.0.1 (tried 2.0.3 as well).

However, when attempting to launch MPI apps (LAMMPS in this case), I get

[c9a-s2.ufhpc:08914] PMIX ERROR: UNREACHABLE in file 
src/client/pmix_client.c at line 199

This comes from,

if (PMIX_SUCCESS != (ret=usock_connect((struct sockaddr *)address, 
))) {
PMIX_ERROR_LOG(ret);
return ret;
  }
I’ve googled and looked at the archives and don’t see any other references to 
this error.  Don’t really see much about using OpenMPI with pmix at all.  I 
assumed the “server” side was embedded in orted or some such but maybe not.

What am I missing?  Is there some server that needs to be started separately as 
with mpd?

Thanks,

Charlie Taylor
Research Computing
University of Florida
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users