from:"Matt Thompson"

Re: [OMPI users] OpenMPI 5.0.0 & Intel OneAPI 2023.2.0 on MacOS 14.0:

2023-11-06 Thread Matt Thompson via users

I have built Open MPI 5 (well, 5.0.0rc12) with Intel oneAPI under Rosetta2
with:

 $ lt_cv_ld_force_load=no ../configure --disable-wrapper-rpath
--disable-wrapper-runpath \
CC=clang CXX=clang++ FC=ifort \
--with-hwloc=internal --with-libevent=internal --with-pmix=internal

I'm fairly sure the two wrapper flags are not needed, I just have them for
historical reasons (long ago I needed them and until they cause an issue, I
just keep all my flags around).

Maybe it works for me because I'm using clang instead of icc? I can "get
away" with that because the code I work on is nearly all Fortran so the C
compiler is not as important to us. And all the libraries we care about
seem happy with mixed ifort-clang as well.

If you don't have a driving need for icc, maybe this will let things work?

On Mon, Nov 6, 2023 at 8:55 AM Volker Blum via users <
users@lists.open-mpi.org> wrote:

> I don’t have a solution to this but am interested in finding one.
>
> There is an issue with some include statements between OneAPI and XCode on
> MacOS 14.x , at least for C++ (the example below seems to be C?). It
> appears that many standard headers are not being found.
>
> I did not encounter this problem with OpenMPI, though, since I got stuck
> at an earlier point. My workaround, OpenMPI 4.1.6, compiled fine.
>
> While compiling a different C++ code, these missing headers struck me, too.
>
> Many of the include related error messages went away after installing
> XCode 15.1 beta 2 - however, not all of them. That’s as far as I got …
> sorry about the experience.
>
> Best wishes
> Volker
>
>
> Volker Blum
> Vinik Associate Professor, Duke MEMS & Chemistry
> https://aims.pratt.duke.edu
> https://bsky.app/profile/aimsduke.bsky.social
>
> > On Nov 6, 2023, at 4:25 AM, Christophe Peyret via users <
> users@lists.open-mpi.org> wrote:
> >
> > Hello,
> >
> > I am tring to compile openmpi 5.0.0 on MacOS 14.1 with Intel oneapi
> Version 2021.9.0 Build 20230302_00.
> >
> > I enter commande :
> >
> > lt_cv_ld_force_load=no  ../openmpi-5.0.0/configure
> --prefix=$APP_DIR/openmpi-5.0.0 F77=ifort FC=ifort CC=icc CXX=icpc
> --with-pmix=internal  --with-libevent=internal --with-hwloc=internal
> >
> > Then
> >
> > make
> >
> > And compilation stops with error message :
> >
> >
> /Users/christophe/Developer/openmpi-5.0.0/3rd-party/openpmix/src/util/pmix_path.c(55):
> catastrophic error: cannot open source file
> "/Users/christophe/Developer/openmpi-5.0.0/3rd-party/openpmix/src/util/pmix_path.c"
> >  #include 
> > ^
> >
> > compilation aborted for
> /Users/christophe/Developer/openmpi-5.0.0/3rd-party/openpmix/src/util/pmix_path.c
> (code 4)
> > make[4]: *** [pmix_path.lo] Error 1
> > make[3]: *** [all-recursive] Error 1
> > make[2]: *** [all-recursive] Error 1
> > make[1]: *** [all-recursive] Error 1
> > make: *** [all-recursive] Error 1
> >
>
>

-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton

Re: [OMPI users] OpenMPI 5.0.0 & Intel OneAPI 2023.2.0 on MacOS 14.0:

2023-10-28 Thread Matt Thompson via users

On my Mac I build Open MPI 5 with (among other flags):

--with-hwloc=internal --with-libevent=internal --with-pmix=internal

In my case, I should have had libevent through brew, but it didn't seem to
see it. But then I figured I might as well let Open MPI build its own for
convenience.

Matt

On Fri, Oct 27, 2023 at 7:51 PM Volker Blum via users <
users@lists.open-mpi.org> wrote:

> OpenMPI 5.0.0 & Intel OneAPI 2023.2.0 on MacOS 14.0:
>
> In an ostensibly clean system, the following configure on MacOS ends
> without a viable pmix build:
>
> configure: WARNING: Either libevent or libev support is required, but
> neither
> configure: WARNING: was found. Please use the configure options to point us
> configure: WARNING: to where we can find one or the other library
> configure: error: Cannot continue
> configure: = done with 3rd-party/openpmix configure =
> checking for pmix pkg-config name... pmix
> checking if pmix pkg-config module exists... yes
> checking for pmix pkg-config cflags...
> -I/usr/local/Cellar/open-mpi/4.1.5/include
> checking for pmix pkg-config ldflags...
> -L/usr/local/Cellar/open-mpi/4.1.5/lib
> checking for pmix pkg-config static ldflags...
> -L/usr/local/Cellar/open-mpi/4.1.5/lib
> checking for pmix pkg-config libs... -lpmix -lz
> checking for pmix pkg-config static libs... -lpmix -lz
> checking for pmix.h... no
> configure: error: Could not find viable pmix build.
>
> configure command used was:
>
> lt_cv_ld_force_load=no ./configure --prefix=/usr/local/openmpi/5.0.0
> FC=ifort F77=ifort CC=icc CXX=icpc
>
> ***
>
> The same command works (up to the end of the configure stage) with OpenMPI
> 4.1.6.
>
> My guess is that this is related to some earlier pmix related issues that
> can be found by google but wanted to report.
>
> Thank you!
> Best wishes
> Volker
>
>
> Volker Blum
> Associate Professor, Duke MEMS & Chemistry
> https://aims.pratt.duke.edu
> https://bsky.app/profile/aimsduke.bsky.social
>
>
>
>

-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton

[OMPI users] Building Open MPI without zlib: what might go wrong/different?

2022-01-31 Thread Matt Thompson via users

Open MPI List,

Recently in trying to build some libraries with NVHPC + Open MPI, I hit an
error building HDF5 where it died at configure time saying that the zlib
that Open MPI wanted to link to (my system one) was incompatible with the
zlib I built in my libraries leading up to HDF5. So, in the end I "fixed"
my issue by adding:

--without-zlib

to my configure line for Open MPI and rebuilt. And hey, it worked. HDF5
built. And Hello world still works as well.

But I'm now wondering: what might I be missing now? Zlib isn't required by
the MPI Standard (as far as I can tell), so I'm guessing it's not
functionality but rather performance?

Just curious,
Matt

-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton

Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

2021-12-30 Thread Matt Thompson via users

Jeff,

I'll take a look when I'm back at work next week. I work with someone on
the Fortran Standards Committee, so if I can find the code, we can probably
figure out how to fix it.

That said, I know just enough Autotools to cause massive damage and fix
a minor bugs. Can you give me a pointer as to where to look for the Fortran
tests the configure scripts runs? conftest.f90 is the "generic" name I
assume Autotools uses for tests, so I'm guessing there is an... m4 script
somewhere generating it? In config/ maybe?

Matt

On Thu, Dec 30, 2021 at 10:27 AM Jeff Squyres (jsquyres) 
wrote:

> Snarky comments from the NAG tech support people aside, if they could be a
> little more specific about what non-conformant Fortran code they're
> referring to, we'd be happy to work with them to get it fixed.
>
> I'm one of the few people in the Open MPI dev community who has a clue
> about Fortran, and I'm *very far* from being a Fortran expert.  Modern
> Fortran is a legitimately complicated language.  So it doesn't surprise me
> that we might have some code in our configure tests that isn't quite right.
>
> Let's also keep in mind that the state of F2008 support varies widely
> across compilers and versions.  The current Open MPI configure tests
> straddle the line of trying to find *enough* F2008 support in a given
> compiler to be sufficient for the mpi_f08 module without being so overly
> proscriptive as to disqualify compilers that aren't fully F2008-compliant.
> Frankly, the state of F2008 support across the various Fortran compilers
> was a mess when we wrote those configure tests; we had to cobble together a
> variety of complicated tests to figure out if any given compiler supported
> enough F2008 support for some / all of the mpi_f08 module.  That's why the
> configure tests are... complicated.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> 
> From: users  on behalf of Matt Thompson
> via users 
> Sent: Thursday, December 23, 2021 11:41 AM
> To: Wadud Miah
> Cc: Matt Thompson; Open MPI Users
> Subject: Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2
>
> I heard back from NAG:
>
> Regarding OpenMPI, we have attempted the build ourselves but cannot make
> sense of the configure script. Only the OpenMPI maintainers can do
> something about that and it looks like they assume that all compilers will
> just swallow non-conforming Fortran code. The error downgrading options for
> NAG compiler remain "-dusty", "-mismatch" and "-mismatch_all" and none of
> them seem to help with the mpi_f08 module of OpenMPI. If there is a bug in
> the NAG Fortran Compiler that is responsible for this, we would love to
> hear about it, but at the moment we are not aware of such.
>
> So it might mean the configure script itself might need to be altered to
> use F2008 conforming code?
>
> On Thu, Dec 23, 2021 at 8:31 AM Wadud Miah  wmiah...@gmail.com>> wrote:
> You can contact NAG support at supp...@nag.co.uk<mailto:supp...@nag.co.uk>
> but they will look into this in the new year.
>
> Regards,
>
> On Thu, 23 Dec 2021, 13:18 Matt Thompson via users, <
> users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
> Oh. Yes, I am on macOS. The Linux cluster I work on doesn't have NAG 7.1
> on it...mainly because I haven't asked for it. Until NAG fix the bug we are
> seeing, I figured why bother the admins.
>
> Still, it does *seem* like it should work. I might ask NAG support about
> it.
>
> On Wed, Dec 22, 2021 at 6:28 PM Tom Kacvinsky  tkacv...@gmail.com>> wrote:
> On Wed, Dec 22, 2021 at 5:45 PM Tom Kacvinsky  tkacv...@gmail.com>> wrote:
> >
> > On Wed, Dec 22, 2021 at 4:11 PM Matt Thompson  fort...@gmail.com>> wrote:
> > >
> > > All,
> > >
> > > When I build Open MPI with NAG, I have to pass in:
> > >
> > >   FCFLAGS"=-mismatch_all -fpp"
> > >
> > > this flag tells nagfor to downgrade some errors with interfaces to
> warnings:
> > >
> > >-mismatch_all
> > >  Further downgrade consistency checking of procedure
> argument lists so that calls to routines in the same file which are
> > >  incorrect will produce warnings instead of error
> messages.  This option disables -C=calls.
> > >
> > > The fpp flag is how you tell NAG to do preprocessing (it doesn't
> automatically do it with .F90 files).
> > >
> > > I also have to pass in a lot of other flags as seen here:
> > >
> > >
> https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/nag-7.1_7101/openmpi/4.1.2.lua
>

Re: [OMPI users] Mac OS + openmpi-4.1.2 + intel oneapi

2021-12-30 Thread Matt Thompson via users

Jeff,

I'm not sure it'll happen. For understandable reasons (for Intel), I think
Intel is not putting too much emphasis on supporting macOS. I guess since I
had a workaround I didn't press them. (Maybe the workaround has performance
issues? I don't know, but I only ever run with macOS on laptops, so
performance isn't primary for me yet.)

On Thu, Dec 30, 2021 at 10:15 AM Jeff Squyres (jsquyres) 
wrote:

> The conclusion we came to on that issue was that this was an issue with
> Intel ifort.  Was anyone able to raise this with Intel ifort tech support?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> 
> From: users  on behalf of Matt Thompson
> via users 
> Sent: Thursday, December 30, 2021 9:56 AM
> To: Open MPI Users
> Cc: Matt Thompson; Christophe Peyret
> Subject: Re: [OMPI users] Mac OS + openmpi-4.1.2 + intel oneapi
>
> Oh yeah. I know that error. This is due to a long standing issue with
> Intel on macOS and Open MPI:
>
> https://github.com/open-mpi/ompi/issues/7615
>
> You need to configure Open MPI with "lt_cv_ld_force_load=no" at the
> beginning. (You can see an example at the top of my modulefile here:
> https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/intel-clang-2022.0.0/openmpi/4.1.2.lua
> )
>
> Matt
>
> On Thu, Dec 30, 2021 at 5:47 AM Christophe Peyret via users <
> users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
>
> Hello,
>
> I have built openmpi-4.1.2 with latest intel oneapi compilers, including
> fortran
>
> but I am facing problems at compilation:
>
>
> mpif90 toto.f90
>
> Undefined symbols for architecture x86_64:
>
>   "_ompi_buffer_detach_f08", referenced from:
>
>   import-atom in libmpi_usempif08.dylib
>
> ld: symbol(s) not found for architecture x86_64
>
> library libmpi_usempif08.dylib is present in $MPI_DIR/lib
>
>
> mpif90 -showme
>
> ifort -I/Users/chris/Applications/Intel/openmpi-4.1.2/include
> -Wl,-flat_namespace -Wl,-commons,use_dylibs
> -I/Users/chris/Applications/Intel/openmpi-4.1.2/lib
> -L/Users/chris/Applications/Intel/openmpi-4.1.2/lib -lmpi_usempif08
> -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi
>
>
> if I remove -lmpi_usempif08 from that command line it works !
>
> ifort -I/Users/chris/Applications/Intel/openmpi-4.1.2/include
> -Wl,-flat_namespace -Wl,-commons,use_dylibs
> -I/Users/chris/Applications/Intel/openmpi-4.1.2/lib
> -L/Users/chris/Applications/Intel/openmpi-4.1.2/lib
> -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi toto.f90
>
>
> And program runs:
>
> mpirun -n 4 a.out
>
> rank=2/4
>
> rank=3/4
>
> rank=0/4
>
> rank=1/4
>
>
> Annexe the Program
>
> program toto
>
>   use mpi
>
>   implicit none
>
>   integer :: i
>
>   integer :: comm,rank,size,ierror
>
>   call mpi_init(ierror)
>
>   comm=MPI_COMM_WORLD
>
>   call mpi_comm_rank(comm, rank, ierror)
>
>   call mpi_comm_size(comm, size, ierror)
>
>   print '("rank=",i0,"/",i0)',rank,size
>
>   call mpi_finalize(ierror)
>
> end program toto
>
>
> --
>
> Christophe Peyret
>
> ONERA/DAAA/NFLU
>
> 29 ave de la Division Leclerc
> F92322 Châtillon Cedex
>
>
> --
> Matt Thompson
>“The fact is, this is about us identifying what we do best and
>finding more ways of doing less of it better” -- Director of Better
> Anna Rampton
>


-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton

Re: [OMPI users] Mac OS + openmpi-4.1.2 + intel oneapi

2021-12-30 Thread Matt Thompson via users

Oh yeah. I know that error. This is due to a long standing issue with Intel
on macOS and Open MPI:

https://github.com/open-mpi/ompi/issues/7615

You need to configure Open MPI with "lt_cv_ld_force_load=no" at the
beginning. (You can see an example at the top of my modulefile here:
https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/intel-clang-2022.0.0/openmpi/4.1.2.lua
)

Matt

On Thu, Dec 30, 2021 at 5:47 AM Christophe Peyret via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
> I have built openmpi-4.1.2 with latest intel oneapi compilers, including
> fortran
>
> but I am facing problems at compilation:
>
>
> *mpif90 toto.f90*
>
> Undefined symbols for architecture x86_64:
>
>   "_ompi_buffer_detach_f08", referenced from:
>
>   import-atom in libmpi_usempif08.dylib
>
> ld: symbol(s) not found for architecture x86_64
>
> library libmpi_usempif08.dylib is present in $MPI_DIR/lib
>
>
> *mpif90 -showme
>
>
>*
>
> ifort -I/Users/chris/Applications/Intel/openmpi-4.1.2/include
> -Wl,-flat_namespace -Wl,-commons,use_dylibs
> -I/Users/chris/Applications/Intel/openmpi-4.1.2/lib
> -L/Users/chris/Applications/Intel/openmpi-4.1.2/lib -lmpi_usempif08
> -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi
>
>
> if I remove -lmpi_usempif08 from that command line it works !
>
> *ifort -I/Users/chris/Applications/Intel/openmpi-4.1.2/include
> -Wl,-flat_namespace -Wl,-commons,use_dylibs
> -I/Users/chris/Applications/Intel/openmpi-4.1.2/lib
> -L/Users/chris/Applications/Intel/openmpi-4.1.2/lib
> -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi toto.f90*
>
>
> And program runs:
>
> *mpirun -n 4 a.out*
>
> rank=2/4
>
> rank=3/4
>
> rank=0/4
>
> rank=1/4
>
>
> Annexe the Program
>
> program toto
>
>   use mpi
>
>   implicit none
>
>   integer :: i
>
>   integer :: comm,rank,size,ierror
>
>   call mpi_init(ierror)
>
>   comm=MPI_COMM_WORLD
>
>   call mpi_comm_rank(comm, rank, ierror)
>
>   call mpi_comm_size(comm, size, ierror)
>
>   print '("rank=",i0,"/",i0)',rank,size
>
>   call mpi_finalize(ierror)
>
> end program toto
>
>
> --
>
> *Christophe Peyret*
>
> *ONERA/DAAA/NFLU*
>
> 29 ave de la Division Leclerc
> F92322 Châtillon Cedex
>
>

-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton

Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

2021-12-23 Thread Matt Thompson via users

I heard back from NAG:

Regarding OpenMPI, we have attempted the build ourselves but cannot make
sense of the configure script. Only the OpenMPI maintainers can do
something about that and it looks like they assume that all compilers will
just swallow non-conforming Fortran code. The error downgrading options for
NAG compiler remain "-dusty", "-mismatch" and "-mismatch_all" and none of
them seem to help with the mpi_f08 module of OpenMPI. If there is a bug in
the NAG Fortran Compiler that is responsible for this, we would love to
hear about it, but at the moment we are not aware of such.

So it might mean the configure script itself might need to be altered to
use F2008 conforming code?

On Thu, Dec 23, 2021 at 8:31 AM Wadud Miah  wrote:

> You can contact NAG support at supp...@nag.co.uk but they will look into
> this in the new year.
>
> Regards,
>
> On Thu, 23 Dec 2021, 13:18 Matt Thompson via users, <
> users@lists.open-mpi.org> wrote:
>
>> Oh. Yes, I am on macOS. The Linux cluster I work on doesn't have NAG 7.1
>> on it...mainly because I haven't asked for it. Until NAG fix the bug we are
>> seeing, I figured why bother the admins.
>>
>> Still, it does *seem* like it should work. I might ask NAG support about
>> it.
>>
>> On Wed, Dec 22, 2021 at 6:28 PM Tom Kacvinsky  wrote:
>>
>>> On Wed, Dec 22, 2021 at 5:45 PM Tom Kacvinsky 
>>> wrote:
>>> >
>>> > On Wed, Dec 22, 2021 at 4:11 PM Matt Thompson 
>>> wrote:
>>> > >
>>> > > All,
>>> > >
>>> > > When I build Open MPI with NAG, I have to pass in:
>>> > >
>>> > >   FCFLAGS"=-mismatch_all -fpp"
>>> > >
>>> > > this flag tells nagfor to downgrade some errors with interfaces to
>>> warnings:
>>> > >
>>> > >-mismatch_all
>>> > >  Further downgrade consistency checking of procedure
>>> argument lists so that calls to routines in the same file which are
>>> > >  incorrect will produce warnings instead of error
>>> messages.  This option disables -C=calls.
>>> > >
>>> > > The fpp flag is how you tell NAG to do preprocessing (it doesn't
>>> automatically do it with .F90 files).
>>> > >
>>> > > I also have to pass in a lot of other flags as seen here:
>>> > >
>>> > >
>>> https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/nag-7.1_7101/openmpi/4.1.2.lua
>>> > >
>>> > > Now I hadn't yet tried NAG 7.1 with Open MPI because NAG 7.1 has a
>>> bug with a library I depend on, but it does promise better F2008 support.
>>> To see what happens, I tried myself and added --enable-mpi-fortran=all, but:
>>> > >
>>> > > checking if building Fortran 'use mpi_f08' bindings... no
>>> > > configure: error: Cannot build requested Fortran bindings, aborting
>>> > >
>>> > > Unfortunately, the NAG Fortran guru I work with is off until the new
>>> year. When he comes back, I might ask him about this. He might know
>>> something we can do to make NAG happy with mpif08.
>>> > >
>>> >
>>> > The very curious thing about this is that NAG 7.1 is that mpif08
>>> > configured properly with the macOS (Intel architecture) flavor of
>>> > it.  But as this thread seems to indicate, it barfs on Linux.  Just
>>> > an extra data point.
>>> >
>>>
>>> I'd like to recall that statement, I was not looking at the config.log
>>> carefully enough.  I see this still, even on macOS
>>>
>>> checking if building Fortran 'use mpi_f08' bindings... no
>>>
>>
>>
>> --
>> Matt Thompson
>>“The fact is, this is about us identifying what we do best and
>>finding more ways of doing less of it better” -- Director of Better
>> Anna Rampton
>>
>

-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton

Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

2021-12-23 Thread Matt Thompson via users

Oh. Yes, I am on macOS. The Linux cluster I work on doesn't have NAG 7.1 on
it...mainly because I haven't asked for it. Until NAG fix the bug we are
seeing, I figured why bother the admins.

Still, it does *seem* like it should work. I might ask NAG support about it.

On Wed, Dec 22, 2021 at 6:28 PM Tom Kacvinsky  wrote:

> On Wed, Dec 22, 2021 at 5:45 PM Tom Kacvinsky  wrote:
> >
> > On Wed, Dec 22, 2021 at 4:11 PM Matt Thompson  wrote:
> > >
> > > All,
> > >
> > > When I build Open MPI with NAG, I have to pass in:
> > >
> > >   FCFLAGS"=-mismatch_all -fpp"
> > >
> > > this flag tells nagfor to downgrade some errors with interfaces to
> warnings:
> > >
> > >-mismatch_all
> > >  Further downgrade consistency checking of procedure
> argument lists so that calls to routines in the same file which are
> > >  incorrect will produce warnings instead of error
> messages.  This option disables -C=calls.
> > >
> > > The fpp flag is how you tell NAG to do preprocessing (it doesn't
> automatically do it with .F90 files).
> > >
> > > I also have to pass in a lot of other flags as seen here:
> > >
> > >
> https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/nag-7.1_7101/openmpi/4.1.2.lua
> > >
> > > Now I hadn't yet tried NAG 7.1 with Open MPI because NAG 7.1 has a bug
> with a library I depend on, but it does promise better F2008 support. To
> see what happens, I tried myself and added --enable-mpi-fortran=all, but:
> > >
> > > checking if building Fortran 'use mpi_f08' bindings... no
> > > configure: error: Cannot build requested Fortran bindings, aborting
> > >
> > > Unfortunately, the NAG Fortran guru I work with is off until the new
> year. When he comes back, I might ask him about this. He might know
> something we can do to make NAG happy with mpif08.
> > >
> >
> > The very curious thing about this is that NAG 7.1 is that mpif08
> > configured properly with the macOS (Intel architecture) flavor of
> > it.  But as this thread seems to indicate, it barfs on Linux.  Just
> > an extra data point.
> >
>
> I'd like to recall that statement, I was not looking at the config.log
> carefully enough.  I see this still, even on macOS
>
> checking if building Fortran 'use mpi_f08' bindings... no
>


-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton

Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2

2021-12-22 Thread Matt Thompson via users

All,

When I build Open MPI with NAG, I have to pass in:

  FCFLAGS"=-mismatch_all -fpp"

this flag tells nagfor to downgrade some errors with interfaces to warnings:

   -mismatch_all
 Further downgrade consistency checking of procedure
argument lists so that calls to routines in the same file which are
 incorrect will produce warnings instead of error
messages.  This option disables -C=calls.

The fpp flag is how you tell NAG to do preprocessing (it doesn't
automatically do it with .F90 files).

I also have to pass in a lot of other flags as seen here:

https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/nag-7.1_7101/openmpi/4.1.2.lua

Now I hadn't yet tried NAG 7.1 with Open MPI because NAG 7.1 has a bug with
a library I depend on, but it does promise better F2008 support. To see
what happens, I tried myself and added --enable-mpi-fortran=all, but:

checking if building Fortran 'use mpi_f08' bindings... no
configure: error: Cannot build requested Fortran bindings, aborting

Unfortunately, the NAG Fortran guru I work with is off until the new year.
When he comes back, I might ask him about this. He might know something we
can do to make NAG happy with mpif08.

Matt

On Wed, Dec 22, 2021 at 3:44 PM Tom Kacvinsky via users <
users@lists.open-mpi.org> wrote:

> On Wed, Dec 22, 2021 at 8:54 AM Tom Kacvinsky  wrote:
> >
> > On Wed, Dec 22, 2021 at 8:48 AM Wadud Miah via users
> >  wrote:
> > >
> > > Hi,
> > >
> > > I tried using the NAG compiler 7.2 which is fully Fortran 2008
> compliant, but the Open MPI configure script shows that it will not build
> the Fortran 2008 MPI bindings:
> > >
> > > $ FC=nagfor ./configure --prefix=/usr/local/openmpi-4.1.2
> > > [ ... ]
> > > checking if building Fortran 'use mpi_f08' bindings... no
> > > Build MPI Fortran bindings: mpif.h, use mpi
> > >
> > > Could someone please look into this?
> >
> > Would you provide the config.log from running configure?  That would
> > help diagnose the problem.  Oftentimes, you will see what the error is
> > when checking for certain features.
> >
>
> I was sent the config.log off kist, and I spotted this:
>
> configure:69595: result: no
> configure:69673: checking for Fortran compiler support of !$PRAGMA
> IGNORE_TKR
> configure:69740: nagfor -c -f2008 -dusty -mismatch  conftest.f90 >&5
> NAG Fortran Compiler Release 7.1(Hanzomon) Build 7101
> Evaluation trial version of NAG Fortran Compiler Release 7.1(Hanzomon)
> Build 7101
> Questionable: conftest.f90, line 52: Variable A set but never referenced
> Warning: conftest.f90, line 52: Pointer PTR never dereferenced
> Error: conftest.f90, line 39: Incorrect data type REAL (expected
> CHARACTER) for argument BUFFER (no. 1) of FOO
> Error: conftest.f90, line 50: Incorrect data type INTEGER (expected
> CHARACTER) for argument BUFFER (no. 1) of FOO
> [NAG Fortran Compiler error termination, 2 errors, 2 warnings]
> configure:69740: $? = 2
>
> So I suspect this makes the Fortran checks unhappy so that the
> configure logic (as for as I could see) wouldn't check for 2018
> binding support.
>
> So apparently, the new NAG Fortran compiler is really fussy.
>
> There is not much more I can do with this as I am nowhere near
> competent in Fortran coding.
>


-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton

Re: [OMPI users] Cannot build working Open MPI 4.1.1 with NAG Fortran/clang on macOS (but I could before!)

2021-10-29 Thread Matt Thompson via users

As for the static build, this:

/Users/mathomp4/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-static/bin/mpicc
~/MPITests/helloWorld.c  -lopen-orted-mpir

does work, but there is no way I could change all our model scripting to
add that flag all over the place. Is there a way to "stuff"
-lopen-orted-mpir into the main wrappers:

❯
/Users/mathomp4/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-static/bin/mpicc
-show
gcc
-I/Users/mathomp4/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-static/include
-L/Users/mathomp4/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-static/lib
-lmpi -lopen-rte -lopen-pal -lm -lz

so that I don't need to remember it every time?

On Fri, Oct 29, 2021 at 10:32 AM Matt Thompson  wrote:

> So, an update. Nothing I seem to do with libtool seems to help, but I'm
> trying various things. For example, I tried editing libtool to use:
>
> -Wl,-Wl,,-dynamiclib
>
> and also:
>
> -Wl,-Wl,,-install_name -Wl,-Wl,,\$rpath/\$soname
>
> as that also threw an error (not a NAG flag). And when I did that, I get
> this:
>
> clang: error: no such file or directory:
> '/Users/mathomp4/installed/Compiler/nag-7.0_7048/openmpi/4.1.1-basic-gilles/lib/libmpi_usempi.40.dylib'
>
> Somehow it's looking for a file in the install path and I think it's
> because of an extra space. When I run with make V=1:
>
> ,-Wl,,-install_name -Wl,-Wl,,
> /Users/mathomp4/installed/Compiler/nag-7.0_7048/openmpi/4.1.1-basic-gilles/lib/libmpi_usempi.40.dylib
>
> That space between the -Wl,, and the /Users... install path is causing
> issues.
>
> Sigh. I guess it's time to try and figure out how to get the static build
> to work.
>
> On Fri, Oct 29, 2021 at 8:24 AM Matt Thompson  wrote:
>
>> Gilles,
>>
>> I tried both NAG 7.0.7062 and 7.0.7048. Both fail in the same way. And I
>> was using the official tarball from the Open MPI website. I downloaded it
>> long ago and then kept it around.
>>
>> And I didn't run autogen.pl, no, but I could try that.
>>
>> And...I do see CC=nagfor in the libtool file. But, why would libtool
>> *ever* use CC="nagfor"? I mean, I see it in the file, but I specifically
>> told configure that CC=gcc (or CC=clang). Only FC should be nagfor.
>>
>> On Thu, Oct 28, 2021 at 8:51 PM Gilles Gouaillardet via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> Matt,
>>>
>>> did you build the same Open MPI 4.1.1 from an official tarball with the
>>> previous NAG Fortran?
>>> did you run autogen.pl (--force) ?
>>>
>>> Just to be sure, can you rerun the same test with the previous NAG
>>> version?
>>>
>>>
>>> When using static libraries, you can try manually linking with
>>> -lopen-orted-mpir and see if it helps.
>>> If you want to use shared libraries, I would try to run configure,
>>> and then edit the generated libtool file:
>>> look a line like
>>>
>>> CC="nagfor"
>>>
>>> and then edit the next line
>>>
>>>
>>> # Commands used to build a shared archive.
>>>
>>> archive_cmds="\$CC -dynamiclib \$allow_undef ..."
>>>
>>> simply manually remove "-dynamiclib" here and see if it helps
>>>
>>>
>>> Cheers,
>>>
>>> Gilles
>>> On Fri, Oct 29, 2021 at 12:30 AM Matt Thompson via users <
>>> users@lists.open-mpi.org> wrote:
>>>
>>>> Dear Open MPI Gurus,
>>>>
>>>> This is a...confusing one. For some reason, I cannot build a working
>>>> Open MPI with NAG 7.0.7062 and clang on my MacBook running macOS 11.6.1.
>>>> The thing is, I could do this back in July with NAG 7.0.7048. So my fear is
>>>> that something changed with macOS, or clang/xcode, or something in between.
>>>>
>>>> So here are the symptoms, I usually build with a few extra flags that
>>>> I've always carried around but for now I'm going to go basic. First, I try
>>>> to build Open MPI in a basic way:
>>>>
>>>> ../configure FCFLAGS"=-mismatch_all -fpp" CC=clang CXX=clang++
>>>> FC=nagfor
>>>> --prefix=$HOME/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-basic |& tee
>>>> configure.log
>>>>
>>>> Note that the FCFLAGS are needed for NAG since it doesn't preprocess
>>>> .F90 files by default (so -fpp) and it can be *very* strict with interfaces
>>>> and any slight interface difference is an error so we use -mismatch_all.
>>>>
>>>> Now with this configure line, I the

Re: [OMPI users] Cannot build working Open MPI 4.1.1 with NAG Fortran/clang on macOS (but I could before!)

2021-10-29 Thread Matt Thompson via users

So, an update. Nothing I seem to do with libtool seems to help, but I'm
trying various things. For example, I tried editing libtool to use:

-Wl,-Wl,,-dynamiclib

and also:

-Wl,-Wl,,-install_name -Wl,-Wl,,\$rpath/\$soname

as that also threw an error (not a NAG flag). And when I did that, I get
this:

clang: error: no such file or directory:
'/Users/mathomp4/installed/Compiler/nag-7.0_7048/openmpi/4.1.1-basic-gilles/lib/libmpi_usempi.40.dylib'

Somehow it's looking for a file in the install path and I think it's
because of an extra space. When I run with make V=1:

,-Wl,,-install_name -Wl,-Wl,,
/Users/mathomp4/installed/Compiler/nag-7.0_7048/openmpi/4.1.1-basic-gilles/lib/libmpi_usempi.40.dylib

That space between the -Wl,, and the /Users... install path is causing
issues.

Sigh. I guess it's time to try and figure out how to get the static build
to work.

On Fri, Oct 29, 2021 at 8:24 AM Matt Thompson  wrote:

> Gilles,
>
> I tried both NAG 7.0.7062 and 7.0.7048. Both fail in the same way. And I
> was using the official tarball from the Open MPI website. I downloaded it
> long ago and then kept it around.
>
> And I didn't run autogen.pl, no, but I could try that.
>
> And...I do see CC=nagfor in the libtool file. But, why would libtool
> *ever* use CC="nagfor"? I mean, I see it in the file, but I specifically
> told configure that CC=gcc (or CC=clang). Only FC should be nagfor.
>
> On Thu, Oct 28, 2021 at 8:51 PM Gilles Gouaillardet via users <
> users@lists.open-mpi.org> wrote:
>
>> Matt,
>>
>> did you build the same Open MPI 4.1.1 from an official tarball with the
>> previous NAG Fortran?
>> did you run autogen.pl (--force) ?
>>
>> Just to be sure, can you rerun the same test with the previous NAG
>> version?
>>
>>
>> When using static libraries, you can try manually linking with
>> -lopen-orted-mpir and see if it helps.
>> If you want to use shared libraries, I would try to run configure,
>> and then edit the generated libtool file:
>> look a line like
>>
>> CC="nagfor"
>>
>> and then edit the next line
>>
>>
>> # Commands used to build a shared archive.
>>
>> archive_cmds="\$CC -dynamiclib \$allow_undef ..."
>>
>> simply manually remove "-dynamiclib" here and see if it helps
>>
>>
>> Cheers,
>>
>> Gilles
>> On Fri, Oct 29, 2021 at 12:30 AM Matt Thompson via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> Dear Open MPI Gurus,
>>>
>>> This is a...confusing one. For some reason, I cannot build a working
>>> Open MPI with NAG 7.0.7062 and clang on my MacBook running macOS 11.6.1.
>>> The thing is, I could do this back in July with NAG 7.0.7048. So my fear is
>>> that something changed with macOS, or clang/xcode, or something in between.
>>>
>>> So here are the symptoms, I usually build with a few extra flags that
>>> I've always carried around but for now I'm going to go basic. First, I try
>>> to build Open MPI in a basic way:
>>>
>>> ../configure FCFLAGS"=-mismatch_all -fpp" CC=clang CXX=clang++ FC=nagfor
>>> --prefix=$HOME/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-basic |& tee
>>> configure.log
>>>
>>> Note that the FCFLAGS are needed for NAG since it doesn't preprocess
>>> .F90 files by default (so -fpp) and it can be *very* strict with interfaces
>>> and any slight interface difference is an error so we use -mismatch_all.
>>>
>>> Now with this configure line, I then build and:
>>>
>>> Making all in mpi/fortran/use-mpi-tkr
>>> make[2]: Entering directory
>>> '/Users/mathomp4/src/MPI/openmpi-4.1.1/build-basic/ompi/mpi/fortran/use-mpi-tkr'
>>>   FCLD libmpi_usempi.la
>>> NAG Fortran Compiler Release 7.0(Yurakucho) Build 7062
>>> Option error: Unrecognised option -dynamiclib
>>> make[2]: *** [Makefile:1966: libmpi_usempi.la] Error 2
>>> make[2]: Leaving directory
>>> '/Users/mathomp4/src/MPI/openmpi-4.1.1/build-basic/ompi/mpi/fortran/use-mpi-tkr'
>>> make[1]: *** [Makefile:3555: all-recursive] Error 1
>>> make[1]: Leaving directory
>>> '/Users/mathomp4/src/MPI/openmpi-4.1.1/build-basic/ompi'
>>> make: *** [Makefile:1901: all-recursive] Error 1
>>>
>>> For some reason, the make system is trying to pass a clang option,
>>> -dynamiclib, to nagfor and it fails. With verbose on:
>>>
>>> libtool: link: nagfor -dynamiclib -Wl,-Wl,,-undefined
>>> -Wl,-Wl,,dynamic_lookup -o .libs/libmpi_usempi.40.dylib  .libs/mpi.o
>>> .libs/mp

Re: [OMPI users] Cannot build working Open MPI 4.1.1 with NAG Fortran/clang on macOS (but I could before!)

2021-10-29 Thread Matt Thompson via users

Gilles,

I tried both NAG 7.0.7062 and 7.0.7048. Both fail in the same way. And I
was using the official tarball from the Open MPI website. I downloaded it
long ago and then kept it around.

And I didn't run autogen.pl, no, but I could try that.

And...I do see CC=nagfor in the libtool file. But, why would libtool *ever*
use CC="nagfor"? I mean, I see it in the file, but I specifically told
configure that CC=gcc (or CC=clang). Only FC should be nagfor.

On Thu, Oct 28, 2021 at 8:51 PM Gilles Gouaillardet via users <
users@lists.open-mpi.org> wrote:

> Matt,
>
> did you build the same Open MPI 4.1.1 from an official tarball with the
> previous NAG Fortran?
> did you run autogen.pl (--force) ?
>
> Just to be sure, can you rerun the same test with the previous NAG version?
>
>
> When using static libraries, you can try manually linking with
> -lopen-orted-mpir and see if it helps.
> If you want to use shared libraries, I would try to run configure,
> and then edit the generated libtool file:
> look a line like
>
> CC="nagfor"
>
> and then edit the next line
>
>
> # Commands used to build a shared archive.
>
> archive_cmds="\$CC -dynamiclib \$allow_undef ..."
>
> simply manually remove "-dynamiclib" here and see if it helps
>
>
> Cheers,
>
> Gilles
> On Fri, Oct 29, 2021 at 12:30 AM Matt Thompson via users <
> users@lists.open-mpi.org> wrote:
>
>> Dear Open MPI Gurus,
>>
>> This is a...confusing one. For some reason, I cannot build a working Open
>> MPI with NAG 7.0.7062 and clang on my MacBook running macOS 11.6.1. The
>> thing is, I could do this back in July with NAG 7.0.7048. So my fear is
>> that something changed with macOS, or clang/xcode, or something in between.
>>
>> So here are the symptoms, I usually build with a few extra flags that
>> I've always carried around but for now I'm going to go basic. First, I try
>> to build Open MPI in a basic way:
>>
>> ../configure FCFLAGS"=-mismatch_all -fpp" CC=clang CXX=clang++ FC=nagfor
>> --prefix=$HOME/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-basic |& tee
>> configure.log
>>
>> Note that the FCFLAGS are needed for NAG since it doesn't preprocess .F90
>> files by default (so -fpp) and it can be *very* strict with interfaces and
>> any slight interface difference is an error so we use -mismatch_all.
>>
>> Now with this configure line, I then build and:
>>
>> Making all in mpi/fortran/use-mpi-tkr
>> make[2]: Entering directory
>> '/Users/mathomp4/src/MPI/openmpi-4.1.1/build-basic/ompi/mpi/fortran/use-mpi-tkr'
>>   FCLD libmpi_usempi.la
>> NAG Fortran Compiler Release 7.0(Yurakucho) Build 7062
>> Option error: Unrecognised option -dynamiclib
>> make[2]: *** [Makefile:1966: libmpi_usempi.la] Error 2
>> make[2]: Leaving directory
>> '/Users/mathomp4/src/MPI/openmpi-4.1.1/build-basic/ompi/mpi/fortran/use-mpi-tkr'
>> make[1]: *** [Makefile:3555: all-recursive] Error 1
>> make[1]: Leaving directory
>> '/Users/mathomp4/src/MPI/openmpi-4.1.1/build-basic/ompi'
>> make: *** [Makefile:1901: all-recursive] Error 1
>>
>> For some reason, the make system is trying to pass a clang option,
>> -dynamiclib, to nagfor and it fails. With verbose on:
>>
>> libtool: link: nagfor -dynamiclib -Wl,-Wl,,-undefined
>> -Wl,-Wl,,dynamic_lookup -o .libs/libmpi_usempi.40.dylib  .libs/mpi.o
>> .libs/mpi_aint_add_f90.o .libs/mpi_aint_diff_f90.o
>> .libs/mpi_comm_spawn_multiple_f90.o .libs/mpi_testall_f90.o
>> .libs/mpi_testsome_f90.o .libs/mpi_waitall_f90.o .libs/mpi_waitsome_f90.o
>> .libs/mpi_wtick_f90.o .libs/mpi_wtime_f90.o .libs/mpi-tkr-sizeof.o...
>>
>> As a test, I tried the same thing with NAG 7.0.7048 (which worked in
>> July) and I get the same issue:
>>
>> Option error: Unrecognised option -dynamiclib
>>
>> Note, that Intel Fortran and Gfortran *do* support this flag, but NAG has
>> something like:
>>
>>-Bbinding Specify  static  or  dynamic binding.  This only has
>> effect if specified during the link phase.  The default is dynamic binding.
>>
>> but maybe the Open MPI system doesn't know NAG?
>>
>> So I say to myself, okay, dynamiclib is a shared library sounding thing,
>> so let's try static library build! So, following the documentation I try:
>>
>> ../configure --enable-static -disable-shared FCFLAGS"=-mismatch_all -fpp"
>> CC=gcc CXX=g++ FC=nagfor
>> --prefix=$HOME/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-static |& tee
>> configure.log
>>
>> and it builds! Yay! And then I try to b

[OMPI users] Cannot build working Open MPI 4.1.1 with NAG Fortran/clang on macOS (but I could before!)

2021-10-28 Thread Matt Thompson via users

ugger_init_after_spawn in libopen-rte.a(orted_submit.o)
  "_MPIR_server_arguments", referenced from:
  _orte_debugger_init_after_spawn in libopen-rte.a(orted_submit.o)
  _setup_debugger_job in libopen-rte.a(orted_submit.o)
  _run_debugger in libopen-rte.a(orted_submit.o)
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see
invocation)

So...yeah. ¯\_(ツ)_/¯ Maybe this needs -Bstatic??

But again, all this worked with shared a few months ago (I've never tried
static until now) and NAG has *never* supported -dynamiclib as far as I
know.

I do see references to -Bstatic and -Bdynamic in the source code, but
apparently I'm not triggering the configure step to use them?

Anyone else out there encounter this?

NOTE: I did try doing an Intel Fortran + Clang shared build today and that
seemed to work. I think that's because Intel Fortran recognizes -dynamiclib
so it can get past that FCLD step.
-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton

Re: [OMPI users] [External] Help with MPI and macOS Firewall

2021-03-19 Thread Matt Thompson via users

Gilles,

For some odd reason, 'self, vader' didn't seem as effective as "^tcp". Not
sure why, but at least I have something that seems to work.

I suppose I don't really need tcp sockets on a single laptop :D

Matt

On Thu, Mar 18, 2021 at 8:46 PM Gilles Gouaillardet via users <
users@lists.open-mpi.org> wrote:

> Matt,
>
> you can either
>
> mpirun --mca btl self,vader ...
>
> or
>
> export OMPI_MCA_btl=self,vader
> mpirun ...
>
> you may also add
> btl = self,vader
> in your /etc/openmpi-mca-params.conf
> and then simply
>
> mpirun ...
>
> Cheers,
>
> Gilles
>
> On Fri, Mar 19, 2021 at 5:44 AM Matt Thompson via users
>  wrote:
> >
> > Prentice,
> >
> > Ooh. The first one seems to work. The second one apparently is not liked
> by zsh and I had to do:
> > ❯ mpirun -mca btl '^tcp' -np 6 ./helloWorld.mpi3.exe
> > Compiler Version: GCC version 10.2.0
> > MPI Version: 3.1
> > MPI Library Version: Open MPI v4.1.0, package: Open MPI
> mathomp4@gs6101-parcel.local Distribution, ident: 4.1.0, repo rev:
> v4.1.0, Dec 18, 2020
> >
> > Next question: is this:
> >
> > OMPI_MCA_btl='self,vader'
> >
> > the right environment variable translation of that command-line option?
> >
> > On Thu, Mar 18, 2021 at 3:40 PM Prentice Bisbal via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> OpenMPI should only be using shared memory on the local host
> automatically, but maybe you need to force it.
> >>
> >> I think
> >>
> >> mpirun -mca btl self,vader ...
> >>
> >> should do that.
> >>
> >> or you can exclude tcp instead
> >>
> >> mpirun -mca btl ^tcp
> >>
> >> See
> >>
> >> https://www.open-mpi.org/faq/?category=sm
> >>
> >> for more info.
> >>
> >> Prentice
> >>
> >> On 3/18/21 12:28 PM, Matt Thompson via users wrote:
> >>
> >> All,
> >>
> >> This isn't specifically an Open MPI issue, but as that is the MPI stack
> I use on my laptop, I'm hoping someone here might have a possible solution.
> (I am pretty sure something like MPICH would trigger this as well.)
> >>
> >> Namely, my employer recently did something somewhere so that now *any*
> MPI application I run will throw popups like this one:
> >>
> >>
> https://user-images.githubusercontent.com/4114656/30962814-866f3010-a44b-11e7-9de3-9f2a3b0229c0.png
> >>
> >> though for me it's asking about "orterun" and "helloworld.mpi3.exe",
> etc. I essentially get one-per-process.
> >>
> >> If I had sudo access, I suppose I could just keep clicking "Allow" for
> every program, but I don't and I compile lots of programs with different
> names.
> >>
> >> So, I was hoping maybe an Open MPI guru out there knew of an MCA thing
> I could use to avoid them? This is all isolated on-my-laptop MPI I'm doing,
> so at most an "mpirun --oversubscribe -np 12" or something. It'll never go
> over my network to anything, etc.
> >>
> >> --
> >> Matt Thompson
> >>“The fact is, this is about us identifying what we do best and
> >>finding more ways of doing less of it better” -- Director of Better
> Anna Rampton
> >
> >
> >
> > --
> > Matt Thompson
> >“The fact is, this is about us identifying what we do best and
> >finding more ways of doing less of it better” -- Director of Better
> Anna Rampton
>


-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton

Re: [OMPI users] [External] Help with MPI and macOS Firewall

2021-03-18 Thread Matt Thompson via users

Prentice,

Ooh. The first one seems to work. The second one apparently is not liked by
zsh and I had to do:
❯ mpirun -mca btl '^tcp' -np 6 ./helloWorld.mpi3.exe
Compiler Version: GCC version 10.2.0
MPI Version: 3.1
MPI Library Version: Open MPI v4.1.0, package: Open MPI
mathomp4@gs6101-parcel.local Distribution, ident: 4.1.0, repo rev: v4.1.0,
Dec 18, 2020

Next question: is this:

OMPI_MCA_btl='self,vader'

the right environment variable translation of that command-line option?

On Thu, Mar 18, 2021 at 3:40 PM Prentice Bisbal via users <
users@lists.open-mpi.org> wrote:

> OpenMPI should only be using shared memory on the local host
> automatically, but maybe you need to force it.
>
> I think
>
> mpirun -mca btl self,vader ...
>
> should do that.
>
> or you can exclude tcp instead
>
> mpirun -mca btl ^tcp
>
> See
>
> https://www.open-mpi.org/faq/?category=sm
>
> for more info.
>
> Prentice
>
> On 3/18/21 12:28 PM, Matt Thompson via users wrote:
>
> All,
>
> This isn't specifically an Open MPI issue, but as that is the MPI stack I
> use on my laptop, I'm hoping someone here might have a possible solution.
> (I am pretty sure something like MPICH would trigger this as well.)
>
> Namely, my employer recently did something somewhere so that now *any* MPI
> application I run will throw popups like this one:
>
>
> https://user-images.githubusercontent.com/4114656/30962814-866f3010-a44b-11e7-9de3-9f2a3b0229c0.png
>
> though for me it's asking about "orterun" and "helloworld.mpi3.exe", etc.
> I essentially get one-per-process.
>
> If I had sudo access, I suppose I could just keep clicking "Allow" for
> every program, but I don't and I compile lots of programs with different
> names.
>
> So, I was hoping maybe an Open MPI guru out there knew of an MCA thing I
> could use to avoid them? This is all isolated on-my-laptop MPI I'm doing,
> so at most an "mpirun --oversubscribe -np 12" or something. It'll never go
> over my network to anything, etc.
>
> --
> Matt Thompson
>“The fact is, this is about us identifying what we do best and
>finding more ways of doing less of it better” -- Director of Better
> Anna Rampton
>
>

-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton

[OMPI users] Help with MPI and macOS Firewall

2021-03-18 Thread Matt Thompson via users

All,

This isn't specifically an Open MPI issue, but as that is the MPI stack I
use on my laptop, I'm hoping someone here might have a possible solution.
(I am pretty sure something like MPICH would trigger this as well.)

Namely, my employer recently did something somewhere so that now *any* MPI
application I run will throw popups like this one:

https://user-images.githubusercontent.com/4114656/30962814-866f3010-a44b-11e7-9de3-9f2a3b0229c0.png

though for me it's asking about "orterun" and "helloworld.mpi3.exe", etc. I
essentially get one-per-process.

If I had sudo access, I suppose I could just keep clicking "Allow" for
every program, but I don't and I compile lots of programs with different
names.

So, I was hoping maybe an Open MPI guru out there knew of an MCA thing I
could use to avoid them? This is all isolated on-my-laptop MPI I'm doing,
so at most an "mpirun --oversubscribe -np 12" or something. It'll never go
over my network to anything, etc.

-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton

Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI

2020-02-25 Thread Matt Thompson via users

Adam,

A couple questions. First, is seccomp the reason you think I have the
MPI_THREAD_MULTIPLE error? Or is it more for the vader error? If so, the
environment variable Nathan provided is probably enough. These are unit
tests and should execute in seconds at most (building them takes 10x-100x
more time).

But if it can help with the MPI_THREAD_MULTIPLE error, can you help
translate that to "Fortran programmer who really can only do docker
build/run/push/cp" for me? I found this page:
https://docs.docker.com/engine/security/seccomp/ that I'm trying to read
through and understand, but I'm mainly learning I should be looking at
taking some Docker training soon!

On Mon, Feb 24, 2020 at 8:24 PM Adam Simpson  wrote:

> Calls to process_vm_readv() and process_vm_writev() are disabled in the
> default Docker seccomp profile
> <https://github.com/moby/moby/blob/master/profiles/seccomp/default.json>.
> You can add the docker flag --cap-add=SYS_PTRACE or better yet modify the
> seccomp profile so that process_vm_readv and process_vm_writev are
> whitelisted, by adding them to the syscalls.names list.
>
> You can also disable seccomp, and several other confinement and security
> features, if you prefer a heavy handed approach:
>
> $ docker run --privileged --security-opt label=disable --security-opt
> seccomp=unconfined --security-opt apparmor=unconfined --ipc=host
> --network=host ...
>
> If you're still having trouble after fixing the above you may need to
> check yama on the host. You can check with "sysctl -w
> kernel.yama.ptrace_scope", if it returns a value other than 0 you may
> need to disable it with "sysctl -w kernel.yama.ptrace_scope=0".
>
> Adam
>
> --
> *From:* users  on behalf of Matt
> Thompson via users 
> *Sent:* Monday, February 24, 2020 5:15 PM
> *To:* Open MPI Users 
> *Cc:* Matt Thompson 
> *Subject:* Re: [OMPI users] Help with One-Sided Communication: Works in
> Intel MPI, Fails in Open MPI
>
> *External email: Use caution opening links or attachments*
> Nathan,
>
> The reproducer would be that code that's on the Intel website. That is
> what I was running. You could pull my image if you like but...since you are
> the genius:
>
> [root@adac3ce0cf32 ~]# mpirun --mca btl_vader_single_copy_mechanism none
> -np 2 ./a.out
>
> Rank 0 running on adac3ce0cf32
> Rank 1 running on adac3ce0cf32
> Rank 0 sets data in the shared memory: 00 01 02 03
> Rank 1 sets data in the shared memory: 10 11 12 13
> Rank 0 gets data from the shared memory: 10 11 12 13
> Rank 0 has new data in the shared memory: 00 01 02 03
> Rank 1 gets data from the shared memory: 00 01 02 03
> Rank 1 has new data in the shared memory: 10 11 12 13
>
> And knowing this led to: https://github.com/open-mpi/ompi/issues/4948
>
> So, good news is that setting export
> OMPI_MCA_btl_vader_single_copy_mechanism=none let's a lot of stuff work.
> The bad news is we seem to be using MPI_THREAD_MULTIPLE and it does not
> like it:
>
> Start 2: pFIO_tests_mpi
>
> 2: Test command: /opt/openmpi-4.0.2/bin/mpiexec "-n" "18" "-oversubscribe"
> "/root/project/MAPL/build/bin/pfio_ctest_io.x" "-nc" "6" "-nsi" "6" "-nso"
> "6" "-ngo" "1" "-ngi" "1" "-v" "T,U" "-s" "mpi"
> 2: Test timeout computed to be: 1500
> 2:
> --
> 2: The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this
> release.
> 2: Workarounds are to run on a single node, or to use a system with an RDMA
> 2: capable network such as Infiniband.
> 2:
> --
> 2: [adac3ce0cf32:03619] *** An error occurred in MPI_Win_create
> 2: [adac3ce0cf32:03619] *** reported by process [270073857,16]
> 2: [adac3ce0cf32:03619] *** on communicator MPI COMMUNICATOR 4 DUP FROM 3
> 2: [adac3ce0cf32:03619] *** MPI_ERR_WIN: invalid window
> 2: [adac3ce0cf32:03619] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> 2: [adac3ce0cf32:03619] ***and potentially your MPI job)
> 2: [adac3ce0cf32:03587] 17 more processes have sent help message
> help-osc-pt2pt.txt / mpi-thread-multiple-not-supported
> 2: [adac3ce0cf32:03587] Set MCA parameter "orte_base_help_aggregate" to 0
> to see all help / error messages
> 2: [adac3ce0cf32:03587] 17 more processes have sent help message
> help-mpi-errors.txt / mpi_errors_are_fatal
> 2/5 Test #2: pFIO_tests_mpi ...***Failed0.18 sec
>
> 40% tests passed, 3 tests failed out of

Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI

2020-02-24 Thread Matt Thompson via users

Nathan,

The reproducer would be that code that's on the Intel website. That is what
I was running. You could pull my image if you like but...since you are the
genius:

[root@adac3ce0cf32 ~]# mpirun --mca btl_vader_single_copy_mechanism none
-np 2 ./a.out

Rank 0 running on adac3ce0cf32
Rank 1 running on adac3ce0cf32
Rank 0 sets data in the shared memory: 00 01 02 03
Rank 1 sets data in the shared memory: 10 11 12 13
Rank 0 gets data from the shared memory: 10 11 12 13
Rank 0 has new data in the shared memory: 00 01 02 03
Rank 1 gets data from the shared memory: 00 01 02 03
Rank 1 has new data in the shared memory: 10 11 12 13

And knowing this led to: https://github.com/open-mpi/ompi/issues/4948

So, good news is that setting export
OMPI_MCA_btl_vader_single_copy_mechanism=none let's a lot of stuff work.
The bad news is we seem to be using MPI_THREAD_MULTIPLE and it does not
like it:

Start 2: pFIO_tests_mpi

2: Test command: /opt/openmpi-4.0.2/bin/mpiexec "-n" "18" "-oversubscribe"
"/root/project/MAPL/build/bin/pfio_ctest_io.x" "-nc" "6" "-nsi" "6" "-nso"
"6" "-ngo" "1" "-ngi" "1" "-v" "T,U" "-s" "mpi"
2: Test timeout computed to be: 1500
2:
--
2: The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this
release.
2: Workarounds are to run on a single node, or to use a system with an RDMA
2: capable network such as Infiniband.
2:
--
2: [adac3ce0cf32:03619] *** An error occurred in MPI_Win_create
2: [adac3ce0cf32:03619] *** reported by process [270073857,16]
2: [adac3ce0cf32:03619] *** on communicator MPI COMMUNICATOR 4 DUP FROM 3
2: [adac3ce0cf32:03619] *** MPI_ERR_WIN: invalid window
2: [adac3ce0cf32:03619] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
2: [adac3ce0cf32:03619] ***and potentially your MPI job)
2: [adac3ce0cf32:03587] 17 more processes have sent help message
help-osc-pt2pt.txt / mpi-thread-multiple-not-supported
2: [adac3ce0cf32:03587] Set MCA parameter "orte_base_help_aggregate" to 0
to see all help / error messages
2: [adac3ce0cf32:03587] 17 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
2/5 Test #2: pFIO_tests_mpi ...***Failed0.18 sec

40% tests passed, 3 tests failed out of 5

Total Test time (real) =   1.08 sec

The following tests FAILED:
  2 - pFIO_tests_mpi (Failed)
  3 - pFIO_tests_simple (Failed)
  4 - pFIO_tests_hybrid (Failed)
Errors while running CTest

The weird thing is, I *am* running on one node (it's all I have, I'm not
fancy enough at AWS to try more yet) and ompi_info does mention
MPI_THREAD_MULTIPLE:

[root@adac3ce0cf32 build]# ompi_info | grep -i mult
  Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support:
yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)

Any ideas on this one?

On Mon, Feb 24, 2020 at 7:24 PM Nathan Hjelm via users <
users@lists.open-mpi.org> wrote:

> The error is from btl/vader. CMA is not functioning as expected. It might
> work if you set btl_vader_single_copy_mechanism=none
>
> Performance will suffer though. It would be worth understanding with
> process_readv is failing.
>
> Can you send a simple reproducer?
>
> -Nathan
>
> On Feb 24, 2020, at 2:59 PM, Gabriel, Edgar via users <
> users@lists.open-mpi.org> wrote:
>
> 
>
> I am not an expert for the one-sided code in Open MPI, I wanted to comment
> briefly on the potential MPI -IO related item. As far as I can see, the
> error message
>
>
>
> “Read -1, expected 48, errno = 1”
>
> does not stem from MPI I/O, at least not from the ompio library. What file
> system did you use for these tests?
>
>
>
> Thanks
>
> Edgar
>
>
>
> *From:* users  *On Behalf Of *Matt
> Thompson via users
> *Sent:* Monday, February 24, 2020 1:20 PM
> *To:* users@lists.open-mpi.org
> *Cc:* Matt Thompson 
> *Subject:* [OMPI users] Help with One-Sided Communication: Works in Intel
> MPI, Fails in Open MPI
>
>
>
> All,
>
>
>
> My guess is this is a "I built Open MPI incorrectly" sort of issue, but
> I'm not sure how to fix it. Namely, I'm currently trying to get an MPI
> project's CI working on CircleCI using Open MPI to run some unit tests (on
> a single node, so need some oversubscribe). I can build everything just
> fine, but when I try to run, things just...blow up:
>
>
>
> [root@3796b115c961 build]# /opt/openmpi-4.0.2/bin/mpirun -np 18
> -oversubscribe /root/project/MAPL/build/bin/pfio_ctest_io.x -nc 6 -nsi 6
> -nso 6 -ngo 1 -ngi 1 -v T,U

Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI

2020-02-24 Thread Matt Thompson via users

On Mon, Feb 24, 2020 at 4:57 PM Gabriel, Edgar 
wrote:

> I am not an expert for the one-sided code in Open MPI, I wanted to comment
> briefly on the potential MPI -IO related item. As far as I can see, the
> error message
>
>
>
> “Read -1, expected 48, errno = 1”
>
> does not stem from MPI I/O, at least not from the ompio library. What file
> system did you use for these tests?
>

I am not sure. It was happening in a Docker image running on an AWS EC2
instance, so I guess whatever ebs is? I'm sort of a neophyte at both AWS
and Docker, so combine the two and...

Matt

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-23 Thread Matt Thompson

_MAC, et al,

Things are looking up. By specifying, --with-verbs=no, things are looking
up. I can run helloworld. But in a new-for-me wrinkle, I can only run on
*more* than one node. Not sure I've ever seen that. Using 40 core nodes,
this:

mpirun -np 41 ./helloWorld.mpi3.SLES12.OMPI400.exe

works, and -np 40 fails:

(1027)(master) $ mpirun -np 40 ./helloWorld.mpi3.SLES12.OMPI400.exe
[borga033:05598] *** An error occurred in MPI_Barrier
[borga033:05598] *** reported by process [140735567101953,140733193388034]
[borga033:05598] *** on communicator MPI_COMM_WORLD
[borga033:05598] *** MPI_ERR_OTHER: known error not in list
[borga033:05598] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[borga033:05598] ***and potentially your MPI job)
Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications
running on Intel(R) 64, Version 18.0.5.274 Build 20180823
MPI Version: 3.1
MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover21
Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018
forrtl: error (78): process killed (SIGTERM)
Image  PCRoutineLineSource
helloWorld.mpi3.S  0040A38E  for__signal_handl Unknown  Unknown
libpthread-2.22.s  2B9CCB20  Unknown   Unknown  Unknown
libpthread-2.22.s  2B9CC3ED  __nanosleep   Unknown  Unknown
libopen-rte.so.40  2C3C5854  orte_show_help_no Unknown  Unknown
libopen-rte.so.40  2C3C5595  orte_show_helpUnknown  Unknown
libmpi.so.40.20.0  2B3BADC5  ompi_mpi_errors_a Unknown  Unknown
libmpi.so.40.20.0  2B3B99D9  ompi_errhandler_i Unknown  Unknown
libmpi.so.40.20.0  2B3E4586  MPI_Barrier   Unknown  Unknown
libmpi_mpifh.so.4  2B15EE53  MPI_Barrier_f08   Unknown  Unknown
libmpi_usempif08.  2ACE7742  mpi_barrier_f08_  Unknown  Unknown
helloWorld.mpi3.S  0040939F  Unknown   Unknown  Unknown
helloWorld.mpi3.S  0040915E  Unknown   Unknown  Unknown
libc-2.22.so   2BBF96D5  __libc_start_main Unknown  Unknown
helloWorld.mpi3.S  00409069  Unknown   Unknown  Unknown

So, I'm getting closer but I have to admit I've never built an MPI stack
before where running on a single node was the broken bit!

On Tue, Jan 22, 2019 at 1:31 PM Cabral, Matias A 
wrote:

> Hi Matt,
>
>
>
> There seem to be two different issues here:
>
> a)  The warning message comes from the openib btl. Given that
> Omnipath has verbs API and you have the necessary libraries in your system,
> openib btl finds itself as a potential transport and prints the warning
> during its init (openib btl is its way to deprecation). You may try to
> explicitly ask for vader btl given you are running on shared mem: -mca btl
> self,vader -mca pml ob1. Or better, explicitly build without openib:
> ./configure --with-verbs=no …
>
> b)  Not my field of expertise, but you may be having some conflict
> with the external components you are using:
> --with-pmix=/usr/nlocal/pmix/2.1 --with-libevent=/usr . You may try not
> specifying these and using the ones provided by OMPI.
>
>
>
> _MAC
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Matt
> Thompson
> *Sent:* Tuesday, January 22, 2019 6:04 AM
> *To:* Open MPI Users 
> *Subject:* Re: [OMPI users] Help Getting Started with Open MPI and PMIx
> and UCX
>
>
>
> Well,
>
>
>
> By turning off UCX compilation per Howard, things get a bit better in that
> something happens! It's not a good something, as it seems to die with an
> infiniband error. As this is an Omnipath system, is OpenMPI perhaps seeing
> libverbs somewhere and compiling it in? To wit:
>
>
>
> (1006)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
>
> --
>
> By default, for Open MPI 4.0 and later, infiniband ports on a device
>
> are not used by default.  The intent is to use UCX for these devices.
>
> You can override this policy by setting the btl_openib_allow_ib MCA
> parameter
>
> to true.
>
>
>
>   Local host:  borgc129
>
>   Local adapter:   hfi1_0
>
>   Local port:  1
>
>
>
> --
>
> --
>
> WARNING: There was an error initializing an OpenFabrics device.
>
>
>
>   Local host:   borgc129
>
>   Local device: hfi1_0
>
> --
>
> Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications
> ru

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-22 Thread Matt Thompson

Well,

By turning off UCX compilation per Howard, things get a bit better in that
something happens! It's not a good something, as it seems to die with an
infiniband error. As this is an Omnipath system, is OpenMPI perhaps seeing
libverbs somewhere and compiling it in? To wit:

(1006)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
--
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA
parameter
to true.

  Local host:  borgc129
  Local adapter:   hfi1_0
  Local port:  1

--
--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   borgc129
  Local device: hfi1_0
--
Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications
running on Intel(R) 64, Version 18.0.5.274 Build 20180823
MPI Version: 3.1
MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover23
Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018
[borgc129:260830] *** An error occurred in MPI_Barrier
[borgc129:260830] *** reported by process [140736833716225,46909632806913]
[borgc129:260830] *** on communicator MPI_COMM_WORLD
[borgc129:260830] *** MPI_ERR_OTHER: known error not in list
[borgc129:260830] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[borgc129:260830] ***and potentially your MPI job)
forrtl: error (78): process killed (SIGTERM)
Image  PCRoutineLineSource
helloWorld.mpi3.S  0040A38E  for__signal_handl Unknown  Unknown
libpthread-2.22.s  2B9CCB20  Unknown   Unknown  Unknown
libpthread-2.22.s  2B9C90CD  pthread_cond_wait Unknown  Unknown
libpmix.so.2.1.11  2AAAB1D780A1  PMIx_AbortUnknown  Unknown
mca_pmix_ext2x.so  2AAAB1B3AA75  ext2x_abort   Unknown  Unknown
mca_ess_pmi.so 2AAAB1724BC0  Unknown   Unknown  Unknown
libopen-rte.so.40  2C3E941C  orte_errmgr_base_ Unknown  Unknown
mca_errmgr_defaul  2AAABC401668  Unknown   Unknown  Unknown
libmpi.so.40.20.0  2B3CDBC4  ompi_mpi_abortUnknown  Unknown
libmpi.so.40.20.0  2B3BB1EF  ompi_mpi_errors_a Unknown  Unknown
libmpi.so.40.20.0  2B3B99C9  ompi_errhandler_i Unknown  Unknown
libmpi.so.40.20.0  2B3E4576  MPI_Barrier   Unknown  Unknown
libmpi_mpifh.so.4  2B15EE53  MPI_Barrier_f08   Unknown  Unknown
libmpi_usempif08.  2ACE7732  mpi_barrier_f08_  Unknown  Unknown
helloWorld.mpi3.S  0040939F  Unknown   Unknown  Unknown
helloWorld.mpi3.S  0040915E  Unknown   Unknown  Unknown
libc-2.22.so   2BBF96D5  __libc_start_main Unknown  Unknown
helloWorld.mpi3.S  00409069  Unknown   Unknown  Unknown

On Sun, Jan 20, 2019 at 4:19 PM Howard Pritchard 
wrote:

> Hi Matt
>
> Definitely do not include the ucx option for an omnipath cluster.
> Actually if you accidentally installed ucx in it’s default location use on
> the system Switch to this config option
>
> —with-ucx=no
>
> Otherwise you will hit
>
> https://github.com/openucx/ucx/issues/750
>
> Howard
>
>
> Gilles Gouaillardet  schrieb am Sa. 19.
> Jan. 2019 um 18:41:
>
>> Matt,
>>
>> There are two ways of using PMIx
>>
>> - if you use mpirun, then the MPI app (e.g. the PMIx client) will talk
>> to mpirun and orted daemons (e.g. the PMIx server)
>> - if you use SLURM srun, then the MPI app will directly talk to the
>> PMIx server provided by SLURM. (note you might have to srun
>> --mpi=pmix_v2 or something)
>>
>> In the former case, it does not matter whether you use the embedded or
>> external PMIx.
>> In the latter case, Open MPI and SLURM have to use compatible PMIx
>> libraries, and you can either check the cross-version compatibility
>> matrix,
>> or build Open MPI with the same PMIx used by SLURM to be on the safe
>> side (not a bad idea IMHO).
>>
>>
>> Regarding the hang, I suggest you try different things
>> - use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun
>> runs on a compute node rather than on a frontend node)
>> - try something even simpler such as mpirun hostname (both with sbatch
>> and salloc)
>> - explicitly specify the network to be used for the wire-up. you can
>> for example mpirun --mca oob_tcp_if_include 192.168.0.0/24

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Matt Thompson

On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> On Jan 18, 2019, at 12:43 PM, Matt Thompson  wrote:
> >
> > With some help, I managed to build an Open MPI 4.0.0 with:
>
> We can discuss each of these params to let you know what they are.
>
> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath
>
> Did you have a reason for disabling these?  They're generally good
> things.  What they do is add linker flags to the wrapper compilers (i.e.,
> mpicc and friends) that basically put a default path to find libraries at
> run time (that can/will in most cases override LD_LIBRARY_PATH -- but you
> can override these linked-in-default-paths if you want/need to).
>

I've had these in my Open MPI builds for a while now. The reason was one of
the libraries I need for the climate model I work on went nuts if both of
them weren't there. It was originally the rpath one but then eventually
(Open MPI 3?) I had to add the runpath one. But I have been updating the
libraries more aggressively recently (due to OS upgrades) so it's possible
this is no longer needed.


>
> > --with-psm2
>
> Ensure that Open MPI can include support for the PSM2 library, and abort
> configure if it cannot.
>
> > --with-slurm
>
> Ensure that Open MPI can include support for SLURM, and abort configure if
> it cannot.
>
> > --enable-mpi1-compatibility
>
> Add support for MPI_Address and other MPI-1 functions that have since been
> deleted from the MPI 3.x specification.
>
> > --with-ucx
>
> Ensure that Open MPI can include support for UCX, and abort configure if
> it cannot.
>
> > --with-pmix=/usr/nlocal/pmix/2.1
>
> Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1
> (instead of using the PMIx that is bundled internally to Open MPI's source
> code tree/expanded tarball).
>
> Unless you have a reason to use the external PMIx, the internal/bundled
> PMIx is usually sufficient.
>

Ah. I did not know that. I figured if our SLURM was built linked to a
specific PMIx v2 that I should build Open MPI with the same PMIx. I'll
build an Open MPI 4 without specifying this.


>
> > --with-libevent=/usr
>
> Same as previous; change "pmix" to "libevent" (i.e., use the external
> libevent instead of the bundled libevent).
>
> > CC=icc CXX=icpc FC=ifort
>
> Specify the exact compilers to use.
>
> > The MPI 1 is because I need to build HDF5 eventually and I added psm2
> because it's an Omnipath cluster. The libevent was probably a red herring
> as libevent-devel wasn't installed on the system. It was eventually, and I
> just didn't remove the flag. And I saw no errors in the build!
>
> Might as well remove the --with-libevent if you don't need it.
>
> > However, I seem to have built an Open MPI that doesn't work:
> >
> > (1099)(master) $ mpirun --version
> > mpirun (Open MPI) 4.0.0
> >
> > Report bugs to http://www.open-mpi.org/community/help/
> > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
> >
> > It just sits there...forever. Can the gurus here help me figure out what
> I managed to break? Perhaps I added too much to my configure line? Not
> enough?
>
> There could be a few things going on here.
>
> Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an
> "sbatch" script?
>

I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just
fine (as you'd hope on an Omnipath cluster), but for some reason Open MPI
is twitchy on this cluster. I once managed to get Open MPI 3.0.1 working (a
few months ago), and it had some interesting startup scaling I liked (slow
at low core count, but getting close to Intel MPI at high core count),
though it seemed to not work after about 100 nodes (4000 processes) or so.

-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Matt Thompson

All,

With some help, I managed to build an Open MPI 4.0.0 with:

./configure --disable-wrapper-rpath --disable-wrapper-runpath --with-psm2
--with-slurm --enable-mpi1-compatibility --with-ucx
--with-pmix=/usr/nlocal/pmix/2.1 --with-libevent=/usr CC=icc CXX=icpc
FC=ifort

The MPI 1 is because I need to build HDF5 eventually and I added psm2
because it's an Omnipath cluster. The libevent was probably a red herring
as libevent-devel wasn't installed on the system. It was eventually, and I
just didn't remove the flag. And I saw no errors in the build!

However, I seem to have built an Open MPI that doesn't work:

(1099)(master) $ mpirun --version
mpirun (Open MPI) 4.0.0

Report bugs to http://www.open-mpi.org/community/help/
(1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe

It just sits there...forever. Can the gurus here help me figure out what I
managed to break? Perhaps I added too much to my configure line? Not enough?

Thanks,
Matt

On Thu, Jan 17, 2019 at 11:10 AM Matt Thompson  wrote:

> Dear Open MPI Gurus,
>
> A cluster I use recently updated their SLURM to have support for UCX and
> PMIx. These are names I've seen and heard often at SC BoFs and posters, but
> now is my first time to play with them.
>
> So, my first question is how exactly should I build Open MPI to try these
> features out. I'm guessing I'll need things like "--with-ucx" to test UCX,
> but is anything needed for PMIx?
>
> Second, when it comes to running Open MPI, are there new MCA parameters I
> need to look out for when testing?
>
> Sorry for the generic questions, but I'm more on the user end of the
> cluster than the administrator end, so I tend to get lost in the detailed
> presentations, etc. I see online.
>
> Thanks,
> Matt
> --
> Matt Thompson
>“The fact is, this is about us identifying what we do best and
>finding more ways of doing less of it better” -- Director of Better
> Anna Rampton
>


-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-17 Thread Matt Thompson

Dear Open MPI Gurus,

A cluster I use recently updated their SLURM to have support for UCX and
PMIx. These are names I've seen and heard often at SC BoFs and posters, but
now is my first time to play with them.

So, my first question is how exactly should I build Open MPI to try these
features out. I'm guessing I'll need things like "--with-ucx" to test UCX,
but is anything needed for PMIx?

Second, when it comes to running Open MPI, are there new MCA parameters I
need to look out for when testing?

Sorry for the generic questions, but I'm more on the user end of the
cluster than the administrator end, so I tend to get lost in the detailed
presentations, etc. I see online.

Thanks,
Matt
-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] 3.1.1 Bindings Change

2018-07-03 Thread Matt Thompson

Dear Open MPI Gurus,

In the latest 3.1.1 announcement, I saw:

- Fix dummy variable names for the mpi and mpi_f08 Fortran bindings to
  match the MPI standard.  This may break applications which use
  name-based parameters in Fortran which used our internal names
  rather than those documented in the MPI standard.

Is there an example of this change somewhere (in the Git issues or another
place)? I don't think we have anything in our software that would be hit by
this (since we test/run our code with Intel MPI, MPT as well as Open MPI),
but I want to be sure we don't have some hidden #ifdef OPENMPI somewhere.

Matt

-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] NAS benchmark

2018-02-03 Thread Matt Thompson

Well, whenever I see a "relocation truncated to fit" error, my first
thought is to add "-mcmodel=medium" to the compile flags. I'm surprised NAS
Benchmarks need it, though.

On Sat, Feb 3, 2018 at 3:48 AM, Mahmood Naderan <mahmood...@gmail.com>
wrote:

> Hi,
> Any body has tried NAS benchmark with ompi? I get the following linker
> error while building one of the benchmarks.
>
> [mahmood@rocks7 NPB3.3-MPI]$ make BT NPROCS=4 CLASS=D
>=
>=  NAS Parallel Benchmarks 3.3  =
>=  MPI/F77/C=
>=
>
> cd BT; make NPROCS=4 CLASS=D SUBTYPE= VERSION=
> make[1]: Entering directory `/home/mahmood/Downloads/NPB3.
> 3.1/NPB3.3-MPI/BT'
> make[2]: Entering directory `/home/mahmood/Downloads/NPB3.
> 3.1/NPB3.3-MPI/sys'
> cc -g  -o setparams setparams.c
> make[2]: Leaving directory `/home/mahmood/Downloads/NPB3.
> 3.1/NPB3.3-MPI/sys'
> ../sys/setparams bt 4 D
> make[2]: Entering directory `/home/mahmood/Downloads/NPB3.
> 3.1/NPB3.3-MPI/BT'
> make.def modified. Rebuilding npbparams.h just in case
> rm -f npbparams.h
> ../sys/setparams bt 4 D
> mpif90 -c -I/usr/local/include -O bt.f
> mpif90 -c -I/usr/local/include -O make_set.f
> mpif90 -c -I/usr/local/include -O initialize.f
> mpif90 -c -I/usr/local/include -O exact_solution.f
> mpif90 -c -I/usr/local/include -O exact_rhs.f
> mpif90 -c -I/usr/local/include -O set_constants.f
> mpif90 -c -I/usr/local/include -O adi.f
> mpif90 -c -I/usr/local/include -O define.f
> mpif90 -c -I/usr/local/include -O copy_faces.f
> mpif90 -c -I/usr/local/include -O rhs.f
> mpif90 -c -I/usr/local/include -O solve_subs.f
> mpif90 -c -I/usr/local/include -O x_solve.f
> mpif90 -c -I/usr/local/include -O y_solve.f
> mpif90 -c -I/usr/local/include -O z_solve.f
> mpif90 -c -I/usr/local/include -O add.f
> mpif90 -c -I/usr/local/include -O error.f
> mpif90 -c -I/usr/local/include -O verify.f
> mpif90 -c -I/usr/local/include -O setup_mpi.f
> make[3]: Entering directory `/home/mahmood/Downloads/NPB3.
> 3.1/NPB3.3-MPI/BT'
> mpif90 -c -I/usr/local/include -O btio.f
> mpif90 -O -o ../bin/bt.D.4 bt.o make_set.o initialize.o exact_solution.o
> exact_rhs.o set_constants.o adi.o define.o copy_faces.o rhs.o solve_subs.o
> x_solve.o y_solve.o z_solve.o add.o error.o verify.o setup_mpi.o
> ../common/print_results.o ../common/timers.o btio.o -L/usr/local/lib -lmpi
> x_solve.o: In function `x_solve_cell_':
> x_solve.f:(.text+0x77a): relocation truncated to fit: R_X86_64_32 against
> symbol `work_lhs_' defined in COMMON section in x_solve.o
> x_solve.f:(.text+0x77f): relocation truncated to fit: R_X86_64_32 against
> symbol `work_lhs_' defined in COMMON section in x_solve.o
> x_solve.f:(.text+0x946): relocation truncated to fit: R_X86_64_32S against
> symbol `work_lhs_' defined in COMMON section in x_solve.o
> x_solve.f:(.text+0x94e): relocation truncated to fit: R_X86_64_32S against
> symbol `work_lhs_' defined in COMMON section in x_solve.o
> x_solve.f:(.text+0x958): relocation truncated to fit: R_X86_64_32S against
> symbol `work_lhs_' defined in COMMON section in x_solve.o
> x_solve.f:(.text+0x962): relocation truncated to fit: R_X86_64_32S against
> symbol `work_lhs_' defined in COMMON section in x_solve.o
> x_solve.f:(.text+0x96c): relocation truncated to fit: R_X86_64_32S against
> symbol `work_lhs_' defined in COMMON section in x_solve.o
> x_solve.f:(.text+0x9ab): relocation truncated to fit: R_X86_64_32S against
> symbol `work_lhs_' defined in COMMON section in x_solve.o
> x_solve.f:(.text+0x9c6): relocation truncated to fit: R_X86_64_32S against
> symbol `work_lhs_' defined in COMMON section in x_solve.o
> x_solve.f:(.text+0x9f3): relocation truncated to fit: R_X86_64_32S against
> symbol `work_lhs_' defined in COMMON section in x_solve.o
> x_solve.f:(.text+0xa21): additional relocation overflows omitted from the
> output
> collect2: error: ld returned 1 exit status
> make[3]: *** [bt-bt] Error 1
> make[3]: Leaving directory `/home/mahmood/Downloads/NPB3.
> 3.1/NPB3.3-MPI/BT'
> make[2]: *** [exec] Error 2
> make[2]: Leaving directory `/home/mahmood/Downloads/NPB3.
> 3.1/NPB3.3-MPI/BT'
> make[1]: *** [../bin/bt.D.4] Error 2
> make[1]: Leaving directory `/home/mahmood/Downloads/NPB3.
> 3.1/NPB3.3-MPI/BT'
> make: *** [bt] Error 2
>
>
> There is a good guide about that (https://www.technovelty.org/
> c/relocation-truncated-to-fit-wtf.html) but I don't know which compiler
> flag should I fix to fix that.
>
> Any idea?
>
> Regards,
> Mahmood
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.

[OMPI users] mpi_f08 interfaces in man3 pages?

2017-08-10 Thread Matt Thompson

OMPI Users,

I know from a while back when I was scanning git to find some other thing,
I saw a kind user (Gilles Gouaillardet?) added the F08 interfaces into the
man pages. As I am lazy, 'man mpi_send' would be nicer than me pulling out
my Big Blue Book to look it up. (Second in laziness is using Google, so I'd
love to see the F08 interfaces in the man 3 web page docs as well.)

Is there a timeline as to when these might get into the Open MPI releases?
I'm not the best at nroff, but I'll gladly help out in any way I can.

Matt

PS: An effort is happening where I work to incorporate one-sided MPI
combined with OpenMP/threads into our code. So far, Open MPI is the only
stack we've tried where the code doesn't weirdly die in odd places, so I
might be coming back here with more questions when we try to improve the
performance/encounter problems.

-- 
Matt Thompson

Man Among Men
Fulcrum of History
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Tuning vader for MPI_Wait Halt?

2017-06-07 Thread Matt Thompson

Nathan,

Sadly, I'm not sure I can provide a reproducer, as it's currently our full
earth system model and is accessing terabytes of background files, etc.
That said, I'll work on it. I have a tiny version of the model, but that
usually always works everywhere (and I can only reproduce the issue at a
rather high resolution).

We do have a couple of code testers that duplicate functionality around
that MPI_Wait call, but, and this is the fun part, it seems to be a very
specific type of that call (only if you are doing a daily time-averaged
collection!). Still, I'll try and test that tester with Open MPI 2.1.0.
Maybe it'll hang!

As for kernel, my desktop is 3.10.0-514.16.1.el7.x86_64 (RHEL 7)  and the
cluster compute node is on 3.0.101-0.47.90-default (SLES11 SP3). If I run
'lsmod' I see xpmem on the cluster, but my desktop does not have it. So,
perhaps not XPMEM related?

Matt

On Mon, Jun 5, 2017 at 1:00 PM, Nathan Hjelm <hje...@me.com> wrote:

> Can you provide a reproducer for the hang? What kernel version are you
> using? Is xpmem installed?
>
> -Nathan
>
> On Jun 05, 2017, at 10:53 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> OMPI Users,
>
> I was wondering if there is a best way to "tune" vader to get around an
> intermittent MPI_Wait halt?
>
> I ask because I recently found that if I use Open MPI 2.1.x on either my
> desktop or on the supercomputer I have access to, if vader is enabled, the
> model seems to "deadlock" at an MPI_Wait call. If I run as:
>
>   mpirun --mca btl self,sm,tcp
>
> on my desktop it works. When I moved to my cluster, I tried the more
> generic:
>
>   mpirun --mca btl ^vader
>
> since it uses openib, and with it things work. Well, I hope that's how one
> would turn off vader in MCA speak. (Note: this deadlock seems a bit
> sporadic, but I do now have a case which seems to cause it reproducibly).
>
> Now, I know vader is supposed to be the "better" sm communication tech, so
> I'd rather use it and thought maybe I could twiddle some tuning knobs. So I
> looked at:
>
>   https://www.open-mpi.org/faq/?category=sm
>
> and there I saw question 6 "How do I know what MCA parameters are
> available for tuning MPI performance?". But when I try the commands listed
> (minus the HTML/CSS tags):
>
> (1081) $ ompi_info --param btl sm
>  MCA btl: sm (MCA v2.1.0, API v3.0.0, Component v2.1.0)
> (1082) $ ompi_info --param mpool sm
> (1083) $
>
> Huh. I expected more, but searching around the Open MPI FAQs made me think
> I should use:
>
>   ompi_info --param btl sm --level 9
>
> which does spit out a lot, though the equivalent for mpool sm does not.
>
> Any ideas on which of the many knobs is best to try and turn? Something
> that, by default, perhaps is one thing for sm but different for vader? I
> tried to see if "ompi_info --param btl vader --level 9" did something, but
> it doesn't put anything out.
>
> I will note that this code runs just fine with Open MPI 2.0.2 as well as
> with Intel MPI and SGI MPT, so I'm thinking the code itself is okay, but
> something from Open MPI 2.0.x to Open MPI 2.1.x changed. I see two entries
> in the Open MPI 2.1.0 announcement about vader, but nothing specific about
> how to "revert" if they are even causing the problem:
>
> - Fix regression that lowered the memory maximum message bandwidth for
>   large messages on some BTL network transports, such as openib, sm,
>   and vader.
>
>
> - The vader BTL is now more efficient in terms of memory usage when
>   using XPMEM.
>
>
> Thanks for any help,
> Matt
>
>
> --
> Matt Thompson
>
> Man Among Men
> Fulcrum of History
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 
Matt Thompson

Man Among Men
Fulcrum of History
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Tuning vader for MPI_Wait Halt?

2017-06-05 Thread Matt Thompson

OMPI Users,

I was wondering if there is a best way to "tune" vader to get around an
intermittent MPI_Wait halt?

I ask because I recently found that if I use Open MPI 2.1.x on either my
desktop or on the supercomputer I have access to, if vader is enabled, the
model seems to "deadlock" at an MPI_Wait call. If I run as:

  mpirun --mca btl self,sm,tcp

on my desktop it works. When I moved to my cluster, I tried the more
generic:

  mpirun --mca btl ^vader

since it uses openib, and with it things work. Well, I hope that's how one
would turn off vader in MCA speak. (Note: this deadlock seems a bit
sporadic, but I do now have a case which seems to cause it reproducibly).

Now, I know vader is supposed to be the "better" sm communication tech, so
I'd rather use it and thought maybe I could twiddle some tuning knobs. So I
looked at:

  https://www.open-mpi.org/faq/?category=sm

and there I saw question 6 "How do I know what MCA parameters are available
for tuning MPI performance?". But when I try the commands listed (minus the
HTML/CSS tags):

(1081) $ ompi_info --param btl sm
 MCA btl: sm (MCA v2.1.0, API v3.0.0, Component v2.1.0)
(1082) $ ompi_info --param mpool sm
(1083) $

Huh. I expected more, but searching around the Open MPI FAQs made me think
I should use:

  ompi_info --param btl sm --level 9

which does spit out a lot, though the equivalent for mpool sm does not.

Any ideas on which of the many knobs is best to try and turn? Something
that, by default, perhaps is one thing for sm but different for vader? I
tried to see if "ompi_info --param btl vader --level 9" did something, but
it doesn't put anything out.

I will note that this code runs just fine with Open MPI 2.0.2 as well as
with Intel MPI and SGI MPT, so I'm thinking the code itself is okay, but
something from Open MPI 2.0.x to Open MPI 2.1.x changed. I see two entries
in the Open MPI 2.1.0 announcement about vader, but nothing specific about
how to "revert" if they are even causing the problem:

- Fix regression that lowered the memory maximum message bandwidth for
  large messages on some BTL network transports, such as openib, sm,
  and vader.


- The vader BTL is now more efficient in terms of memory usage when
  using XPMEM.


Thanks for any help,
Matt


-- 
Matt Thompson

Man Among Men
Fulcrum of History
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-03 Thread Matt Thompson

t;>>>>>>
> >>>>>>>>>> On Monday, April 3, 2017, Prentice Bisbal <pbis...@pppl.gov
> >>>>>>>>>> <mailto:pbis...@pppl.gov>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>Greeting Open MPI users! After being off this list for
> several
> >>>>>>>>>>years, I'm back! And I need help:
> >>>>>>>>>>
> >>>>>>>>>>I'm trying to compile OpenMPI 1.10.3 with the PGI compilers,
> >>>>>>>>>>version 17.3. I'm using the following configure options:
> >>>>>>>>>>
> >>>>>>>>>>./configure \
> >>>>>>>>>> --prefix=/usr/pppl/pgi/17.3-pkgs/openmpi-1.10.3 \
> >>>>>>>>>>  --disable-silent-rules \
> >>>>>>>>>>  --enable-shared \
> >>>>>>>>>>  --enable-static \
> >>>>>>>>>>  --enable-mpi-thread-multiple \
> >>>>>>>>>>  --with-pmi=/usr/pppl/slurm/15.08.8 \
> >>>>>>>>>>  --with-hwloc \
> >>>>>>>>>>  --with-verbs \
> >>>>>>>>>>  --with-slurm \
> >>>>>>>>>>  --with-psm \
> >>>>>>>>>>  CC=pgcc \
> >>>>>>>>>>  CFLAGS="-tp x64 -fast" \
> >>>>>>>>>>  CXX=pgc++ \
> >>>>>>>>>>  CXXFLAGS="-tp x64 -fast" \
> >>>>>>>>>>  FC=pgfortran \
> >>>>>>>>>>  FCFLAGS="-tp x64 -fast" \
> >>>>>>>>>>  2>&1 | tee configure.log
> >>>>>>>>>>
> >>>>>>>>>>Which leads to this error  from libtool during make:
> >>>>>>>>>>
> >>>>>>>>>>pgcc-Error-Unknown switch: -pthread
> >>>>>>>>>>
> >>>>>>>>>>I've searched the archives, which ultimately lead to this
> work
> >>>>>>>>>>around from 2009:
> >>>>>>>>>>
> >>>>>>>>>> https://www.open-mpi.org/community/lists/users/2009/04/8724.php
> >>>>>>>>>> <https://www.open-mpi.org/community/lists/users/2009/04/
> 8724.php>
> >>>>>>>>>>
> >>>>>>>>>>Interestingly, I participated in the discussion that lead to
> that
> >>>>>>>>>>workaround, stating that I had no problem compiling Open MPI
> with
> >>>>>>>>>>PGI v9. I'm assuming the problem now is that I'm specifying
> >>>>>>>>>>--enable-mpi-thread-multiple, which I'm doing because a user
> >>>>>>>>>>requested that feature.
> >>>>>>>>>>
> >>>>>>>>>>It's been exactly 8 years and 2 days since that workaround
> was
> >>>>>>>>>>posted to the list. Please tell me a better way of dealing
> with
> >>>>>>>>>>this issue than writing a 'fakepgf90' script. Any
> suggestions?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>--
> >>>>>>>>>>Prentice
> >>>>>>>>>>
> >>>>>>>>>> ___
> >>>>>>>>>>users mailing list
> >>>>>>>>>>users@lists.open-mpi.org
> >>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> ___
> >>>>>>>>>> users mailing list
> >>>>>>>>>> users@lists.open-mpi.org
> >>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ___
> >>>>>>>>> users mailing list
> >>>>>>>>> users@lists.open-mpi.org
> >>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>> ___
> >>>> users mailing list
> >>>> users@lists.open-mpi.org
> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >>>
> >>
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
>
> -BEGIN PGP SIGNATURE-
> Comment: GPGTools - https://gpgtools.org
>
> iEYEARECAAYFAljiu0YACgkQo/GbGkBRnRpGowCgha3O1wvYyQQOrsYuUqSGJq2B
> qHEAnRyT0PHY75NmmI9Efv4CkM7aJjVp
> =f5Xk
> -END PGP SIGNATURE-
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 
Matt Thompson

Man Among Men
Fulcrum of History
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Issues with PGI 16.10, OpenMPI 2.1.0 on macOS: Fortran issues with hello world (running and dylib)

2017-03-24 Thread Matt Thompson

v2.1.0, package: Open MPI
> user@computer.concealed Distribution, ident: 2.1.0, repo rev:
> v2.0.1-696-g1cd1edf, Mar 20, 2017
>
> Hello, world, I am  2 of  4: Open MPI v2.1.0, package: Open MPI
> user@computer.concealed Distribution, ident: 2.1.0, repo rev:
> v2.0.1-696-g1cd1edf, Mar 20, 2017
>


PGI 16.10:

(831) $ mpifort -o hello_usempif08.exe hello_usempif08.f90
> (832) $ mpirun -np 4 ./hello_usempif08.exe
> [computer.concealed:87920] mca_base_component_repository_open: unable to
> open mca_patcher_overwrite: File not found (ignored)
> [computer.concealed:87920] mca_base_component_repository_open: unable to
> open mca_shmem_mmap: File not found (ignored)
> [computer.concealed:87920] mca_base_component_repository_open: unable to
> open mca_shmem_posix: File not found (ignored)
> [computer.concealed:87920] mca_base_component_repository_open: unable to
> open mca_shmem_sysv: File not found (ignored)
> --
> It looks like opal_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during opal_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>   opal_shmem_base_select failed
>   --> Returned value -1 instead of OPAL_SUCCESS
> ------


Well okay then. I didn't see any errors during the build for PGI:

(835) $ grep Error make.pgi-16.10.log
>   SED  mpi/man/man3/MPI_Error_class.3
>   SED  mpi/man/man3/MPI_Error_string.3


Any ideas/help?

-- 
Matt Thompson

Man Among Men
Fulcrum of History
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help with Open MPI 2.1.0 and PGI 16.10: Configure and C++

2017-03-24 Thread Matt Thompson

Gilles,

The library I have having issues linking is ESMF and it is a C++/Fortran
application. From
http://www.earthsystemmodeling.org/esmf_releases/non_public/ESMF_7_0_0/ESMF_usrdoc/node9.html#SECTION00092000
:

The following compilers and utilities *are required* for compiling, linking
> and testing the ESMF software:
> Fortran90 (or later) compiler;
> C++ compiler;
> MPI implementation compatible with the above compilers (but see below);
> GNU's gcc compiler - for a standard cpp preprocessor implementation;
> GNU make;
> Perl - for running test scripts.


(Emphasis mine)

This is why I am concerned. For now, I'll build Open MPI with the (possibly
useless) C++ support for PGI and move on to the Fortran issue (which I'll
detail in another email).

But, as I *need* ESMF for my application, it would be good to get an mpicxx
that I can have confidence in with PGI.

Matt


On Thu, Mar 23, 2017 at 9:05 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Matt,
>
> a C++ compiler is required to configure Open MPI.
> That being said, C++ compiler is only used if you build the C++ bindings
> (That were removed from MPI-3)
> And unless you plan to use the mpic++ wrapper (with or without the C++
> bindings),
> a valid C++ compiler is not required at all.
> /* configure still requires one, and that could be improved */
>
> My point is you should not worry too much about configure messages related
> to C++,
> and you should instead focus on the Fortran issue.
>
> Cheers,
>
> Gilles
>
> On Thursday, March 23, 2017, Matt Thompson <fort...@gmail.com> wrote:
>
>> All, I'm hoping one of you knows what I might be doing wrong here.  I'm
>> trying to use Open MPI 2.1.0 for PGI 16.10 (Community Edition) on macOS.
>> Now, I built it a la:
>>
>> http://www.pgroup.com/userforum/viewtopic.php?p=21105#21105
>>
>> and found that it built, but the resulting mpifort, etc were just not
>> good. Couldn't even do Hello World.
>>
>> So, I thought I'd start from the beginning. I tried running:
>>
>> configure --disable-wrapper-rpath CC=pgcc CXX=pgc++ FC=pgfortran
>> --prefix=/Users/mathomp4/installed/Compiler/pgi-16.10/openmpi/2.1.0
>> but when I did I saw this:
>>
>> *** C++ compiler and preprocessor
>> checking whether we are using the GNU C++ compiler... yes
>> checking whether pgc++ accepts -g... yes
>> checking dependency style of pgc++... none
>> checking how to run the C++ preprocessor... pgc++ -E
>> checking for the C++ compiler vendor... gnu
>>
>> Well, that's not the right vendor. So, I took a look at configure and I
>> saw that at least some detection for PGI was a la:
>>
>>   pgCC* | pgcpp*)
>> # Portland Group C++ compiler
>> case `$CC -V` in
>> *pgCC\ [1-5].* | *pgcpp\ [1-5].*)
>>
>>   pgCC* | pgcpp*)
>> # Portland Group C++ compiler
>> lt_prog_compiler_wl_CXX='-Wl,'
>> lt_prog_compiler_pic_CXX='-fpic'
>> lt_prog_compiler_static_CXX='-Bstatic'
>> ;;
>>
>> Ah. PGI 16.9+ now use pgc++ to do C++ compiling, not pgcpp. So, I hacked
>> configure so that references to pgCC (nonexistent on macOS) are gone and
>> all pgcpp became pgc++, but:
>>
>> *** C++ compiler and preprocessor
>> checking whether we are using the GNU C++ compiler... yes
>> checking whether pgc++ accepts -g... yes
>> checking dependency style of pgc++... none
>> checking how to run the C++ preprocessor... pgc++ -E
>> checking for the C++ compiler vendor... gnu
>>
>> Well, at this point, I think I'm stopping until I get help. Will this
>> chunk of configure always return gnu for PGI? I know the C part returns
>> 'portland group':
>>
>> *** C compiler and preprocessor
>> checking for gcc... (cached) pgcc
>> checking whether we are using the GNU C compiler... (cached) no
>> checking whether pgcc accepts -g... (cached) yes
>> checking for pgcc option to accept ISO C89... (cached) none needed
>> checking whether pgcc understands -c and -o together... (cached) yes
>> checking for pgcc option to accept ISO C99... none needed
>> checking for the C compiler vendor... portland group
>>
>> so I thought the C++ section would as well. I also tried passing in
>> --enable-mpi-cxx, but that did nothing.
>>
>> Is this just a red herring? My real concern is with pgfortran/mpifort,
>> but I thought I'd start with this. If this is okay, I'll move on and detail
>> the fortran issues I'm having.
>>
>> Matt
>> --
>> Matt Thompson
>>
>> Man Among Men
>> Fulcrum of History
>>
>>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 
Matt Thompson

Man Among Men
Fulcrum of History
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Help with Open MPI 2.1.0 and PGI 16.10: Configure and C++

2017-03-22 Thread Matt Thompson

All, I'm hoping one of you knows what I might be doing wrong here.  I'm
trying to use Open MPI 2.1.0 for PGI 16.10 (Community Edition) on macOS.
Now, I built it a la:

http://www.pgroup.com/userforum/viewtopic.php?p=21105#21105

and found that it built, but the resulting mpifort, etc were just not good.
Couldn't even do Hello World.

So, I thought I'd start from the beginning. I tried running:

configure --disable-wrapper-rpath CC=pgcc CXX=pgc++ FC=pgfortran
--prefix=/Users/mathomp4/installed/Compiler/pgi-16.10/openmpi/2.1.0
but when I did I saw this:

*** C++ compiler and preprocessor
checking whether we are using the GNU C++ compiler... yes
checking whether pgc++ accepts -g... yes
checking dependency style of pgc++... none
checking how to run the C++ preprocessor... pgc++ -E
checking for the C++ compiler vendor... gnu

Well, that's not the right vendor. So, I took a look at configure and I saw
that at least some detection for PGI was a la:

  pgCC* | pgcpp*)
# Portland Group C++ compiler
case `$CC -V` in
*pgCC\ [1-5].* | *pgcpp\ [1-5].*)

  pgCC* | pgcpp*)
# Portland Group C++ compiler
lt_prog_compiler_wl_CXX='-Wl,'
lt_prog_compiler_pic_CXX='-fpic'
lt_prog_compiler_static_CXX='-Bstatic'
;;

Ah. PGI 16.9+ now use pgc++ to do C++ compiling, not pgcpp. So, I hacked
configure so that references to pgCC (nonexistent on macOS) are gone and
all pgcpp became pgc++, but:

*** C++ compiler and preprocessor
checking whether we are using the GNU C++ compiler... yes
checking whether pgc++ accepts -g... yes
checking dependency style of pgc++... none
checking how to run the C++ preprocessor... pgc++ -E
checking for the C++ compiler vendor... gnu

Well, at this point, I think I'm stopping until I get help. Will this chunk
of configure always return gnu for PGI? I know the C part returns 'portland
group':

*** C compiler and preprocessor
checking for gcc... (cached) pgcc
checking whether we are using the GNU C compiler... (cached) no
checking whether pgcc accepts -g... (cached) yes
checking for pgcc option to accept ISO C89... (cached) none needed
checking whether pgcc understands -c and -o together... (cached) yes
checking for pgcc option to accept ISO C99... none needed
checking for the C compiler vendor... portland group

so I thought the C++ section would as well. I also tried passing in
--enable-mpi-cxx, but that did nothing.

Is this just a red herring? My real concern is with pgfortran/mpifort, but
I thought I'd start with this. If this is okay, I'll move on and detail the
fortran issues I'm having.

Matt
-- 
Matt Thompson

Man Among Men
Fulcrum of History
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Issues building Open MPI 2.0.1 with PGI 16.10 on macOS

2016-11-30 Thread Matt Thompson

Well, jtull over at PGI seemed to have the "magic sauce":

http://www.pgroup.com/userforum/viewtopic.php?p=21105#21105

Namely, I think it's the siterc file. I'm not sure which of the adaptations
fixes the issue yet, though.

On Mon, Nov 28, 2016 at 3:11 PM, Jeff Hammond <jeff.scie...@gmail.com>
wrote:

> attached config.log that contains the details of the following failures is
> the best way to make forward-progress here.  that none of the system
> headers are detected suggests a rather serious compiler problem that may
> not have anything to do with headers.
>
> checking for sys/types.h... no
> checking for sys/stat.h... no
> checking for stdlib.h... no
> checking for string.h... no
> checking for memory.h... no
> checking for strings.h... no
> checking for inttypes.h... no
> checking for stdint.h... no
> checking for unistd.h... no
>
>
> On Mon, Nov 28, 2016 at 9:49 AM, Matt Thompson <fort...@gmail.com> wrote:
>
>> Hmm. Well, I definitely have /usr/include/stdint.h as I previously was
>> trying work with clang as compiler stack. And as near as I can tell, Open
>> MPI's configure is seeing /usr/include as oldincludedir, but maybe that's
>> not how it finds it?
>>
>> If I check my configure output:
>>
>> 
>> 
>> == Configuring Open MPI
>> 
>> 
>>
>> *** Startup tests
>> checking build system type... x86_64-apple-darwin15.6.0
>> 
>> checking for sys/types.h... yes
>> checking for sys/stat.h... yes
>> checking for stdlib.h... yes
>> checking for string.h... yes
>> checking for memory.h... yes
>> checking for strings.h... yes
>> checking for inttypes.h... yes
>> checking for stdint.h... yes
>> checking for unistd.h... yes
>>
>> So, the startup saw it. But:
>>
>> --- MCA component event:libevent2022 (m4 configuration macro, priority 80)
>> checking for MCA component event:libevent2022 compile mode... static
>> checking libevent configuration args... --disable-dns --disable-http
>> --disable-rpc --disable-openssl --enable-thread-support --d
>> isable-evport
>> configure: OPAL configuring in opal/mca/event/libevent2022/libevent
>> configure: running /bin/sh './configure' --disable-dns --disable-http
>> --disable-rpc --disable-openssl --enable-thread-support --
>> disable-evport  '--disable-wrapper-rpath' 'CC=pgcc' 'CXX=pgc++'
>> 'FC=pgfortran' 'CFLAGS=-m64' 'CXXFLAGS=-m64' 'FCFLAGS=-m64' '--w
>> ithout-verbs' '--prefix=/Users/mathomp4/inst
>> alled/Compiler/pgi-16.10/openmpi/2.0.1' 'CPPFLAGS=-I/Users/mathomp4/sr
>> c/MPI/openmpi-
>> 2.0.1 -I/Users/mathomp4/src/MPI/openmpi-2.0.1
>> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/include
>> -I/Users/mathomp4/src/MPI/o
>> penmpi-2.0.1/opal/mca/hwloc/hwloc1112/hwloc/include
>> -Drandom=opal_random' --cache-file=/dev/null --srcdir=. --disable-option-che
>> cking
>> checking for a BSD-compatible install... /usr/bin/install -c
>> 
>> checking for sys/types.h... no
>> checking for sys/stat.h... no
>> checking for stdlib.h... no
>> checking for string.h... no
>> checking for memory.h... no
>> checking for strings.h... no
>> checking for inttypes.h... no
>> checking for stdint.h... no
>> checking for unistd.h... no
>>
>> So, it's like whatever magic found stdint.h for the startup isn't passed
>> down to libevent when it builds? As I scan the configure output, PMIx sees
>> stdint.h in its section and ROMIO sees it as well, but not libevent2022.
>> The Makefiles inside of libevent2022 do have 'oldincludedir =
>> /usr/include'. Hmm.
>>
>>
>>
>> On Mon, Nov 28, 2016 at 11:39 AM, Bennet Fauber <ben...@umich.edu> wrote:
>>
>>> I think PGI uses installed GCC components for some parts of standard C
>>> (at least for some things on Linux, it does; and I imagine it is
>>> similar for Mac).  If you look at the post at
>>>
>>> http://www.pgroup.com/userforum/viewtopic.php?t=5147=17f
>>> 3afa2cd0eec05b0f4e54a60f50479
>>>
>>> The problem seems to have been one with the Xcode configuration:
>>>
>>> "It turns out my Xcode was messed up as I was missing /usr/include/.
>>>  After rerunning xcode-select --install it works now."
>>>
>>> On my OS X 10.11.6, I have /usr/include/stdint.h without having the
>>> PGI compilers.  This may be related to the GNU command line tools
>>> installation...?  I think t

Re: [OMPI users] Issues building Open MPI 2.0.1 with PGI 16.10 on macOS

2016-11-28 Thread Matt Thompson

types of exactly 64, 32, 16, and 8 bits
> >
> >  *  respectively.
> >
> >  *ev_uintptr_t, ev_intptr_t
> >
> >  *  unsigned/signed integers large enough
> >
> >  *  to hold a pointer without loss of bits.
> >
> >  *ev_ssize_t
> >
> >  *  A signed type of the same size as size_t
> >
> >  *ev_off_t
> >
> >  *  A signed type typically used to represent offsets within a
> >
> >  *  (potentially large) file
> >
> >  *
> >
> >  * @{
> >
> >  */
> >
> > #ifdef _EVENT_HAVE_UINT64_T
> >
> > #define ev_uint64_t uint64_t
> >
> > #define ev_int64_t int64_t
> >
> > #elif defined(WIN32)
> >
> > #define ev_uint64_t unsigned __int64
> >
> > #define ev_int64_t signed __int64
> >
> > #elif _EVENT_SIZEOF_LONG_LONG == 8
> >
> > #define ev_uint64_t unsigned long long
> >
> > #define ev_int64_t long long
> >
> > #elif _EVENT_SIZEOF_LONG == 8
> >
> > #define ev_uint64_t unsigned long
> >
> > #define ev_int64_t long
> >
> > #elif defined(_EVENT_IN_DOXYGEN)
> >
> > #define ev_uint64_t ...
> >
> > #define ev_int64_t ...
> >
> > #else
> >
> > #error "No way to define ev_uint64_t"
> >
> > #endif
> >
> >
> > On Mon, Nov 28, 2016 at 5:04 AM, Matt Thompson <fort...@gmail.com>
> wrote:
> >>
> >> All,
> >>
> >> I recently tried building Open MPI 2.0.1 with the new Community Edition
> of
> >> PGI on macOS. My first mistake was I was configuring with a configure
> line
> >> I'd cribbed from Linux that had -fPIC. Apparently -fPIC was removed
> from the
> >> macOS build. Okay, I can remove that and I configured with:
> >>
> >> ./configure --disable-wrapper-rpath CC=pgcc CXX=pgc++ FC=pgfortran
> >> CFLAGS='-m64' CXXFLAGS='-m64' FCFLAGS='-m64' --without-verbs
> >> --prefix=/Users/mathomp4/installed/Compiler/pgi-16.10/openmpi/2.0.1 |
> & tee
> >> configure.pgi-16.10.log
> >>
> >> But, now, when I try to actually build, I get an error pretty quick
> inside
> >> the make:
> >>
> >>   CC   printf.lo
> >>   CC   proc.lo
> >>   CC   qsort.lo
> >>
> >> PGC-F-0249-#error --  "No way to define ev_uint64_t"
> >> (/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/
> libevent2022/libevent/include/event2/util.h:
> >> 126)
> >> PGC/x86-64 OSX 16.10-0: compilation aborted
> >>   CC   show_help.lo
> >> make[3]: *** [proc.lo] Error 1
> >> make[3]: *** Waiting for unfinished jobs
> >> make[2]: *** [all-recursive] Error 1
> >> make[1]: *** [all-recursive] Error 1
> >> make: *** [all-recursive] Error 1
> >>
> >> This was done with -j2, so if I remake with 'make V=1' I see:
> >>
> >> source='proc.c' object='proc.lo' libtool=yes \
> >> DEPDIR=.deps depmode=pgcc /bin/sh ../../config/depcomp \
> >> /bin/sh ../../libtool  --tag=CC   --mode=compile pgcc -DHAVE_CONFIG_H
> -I.
> >> -I../../opal/include -I../../ompi/include -I../../oshmem/include
> >> -I../../opal/mca/hwloc/hwloc1112/hwloc/include/private/autogen
> >> -I../../opal/mca/hwloc/hwloc1112/hwloc/include/hwloc/autogen
> >> -I../../ompi/mpiext/cuda/c   -I../.. -I../../orte/include
> >> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/hwloc/
> hwloc1112/hwloc/include
> >> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/
> libevent2022/libevent
> >> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/
> libevent2022/libevent/include
> >> -O -DNDEBUG -m64  -c -o proc.lo proc.c
> >> libtool: compile:  pgcc -DHAVE_CONFIG_H -I. -I../../opal/include
> >> -I../../ompi/include -I../../oshmem/include
> >> -I../../opal/mca/hwloc/hwloc1112/hwloc/include/private/autogen
> >> -I../../opal/mca/hwloc/hwloc1112/hwloc/include/hwloc/autogen
> >> -I../../ompi/mpiext/cuda/c -I../.. -I../../orte/include
> >> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/hwloc/
> hwloc1112/hwloc/include
> >> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/
> libevent2022/libevent
> >> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/
> libevent2022/libevent/include
> >> -O -DNDEBUG -m64 -c proc.c -MD -o proc.o
> >> PGC-F-0249-#error --  "No way to define ev_uint64_t"
> >> (/Users/mathomp4/src

[OMPI users] Issues building Open MPI 2.0.1 with PGI 16.10 on macOS

2016-11-28 Thread Matt Thompson

All,

I recently tried building Open MPI 2.0.1 with the new Community Edition of
PGI on macOS. My first mistake was I was configuring with a configure line
I'd cribbed from Linux that had -fPIC. Apparently -fPIC was removed from
the macOS build. Okay, I can remove that and I configured with:

./configure --disable-wrapper-rpath CC=pgcc CXX=pgc++ FC=pgfortran
CFLAGS='-m64' CXXFLAGS='-m64' FCFLAGS='-m64' --without-verbs
--prefix=/Users/mathomp4/installed/Compiler/pgi-16.10/openmpi/2.0.1 | & tee
configure.pgi-16.10.log

But, now, when I try to actually build, I get an error pretty quick inside
the make:

  CC   printf.lo
  CC   proc.lo
  CC   qsort.lo

PGC-F-0249-#error --  "No way to define ev_uint64_t"
(/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/libevent2022/libevent/include/event2/util.h:
126)
PGC/x86-64 OSX 16.10-0: compilation aborted
  CC   show_help.lo
make[3]: *** [proc.lo] Error 1
make[3]: *** Waiting for unfinished jobs
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

This was done with -j2, so if I remake with 'make V=1' I see:

source='proc.c' object='proc.lo' libtool=yes \
DEPDIR=.deps depmode=pgcc /bin/sh ../../config/depcomp \
/bin/sh ../../libtool  --tag=CC   --mode=compile pgcc -DHAVE_CONFIG_H -I.
-I../../opal/include -I../../ompi/include -I../../oshmem/include
-I../../opal/mca/hwloc/hwloc1112/hwloc/include/private/autogen
-I../../opal/mca/hwloc/hwloc1112/hwloc/include/hwloc/autogen
-I../../ompi/mpiext/cuda/c   -I../.. -I../../orte/include
-I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/hwloc/hwloc1112/hwloc/include
-I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/libevent2022/libevent
-I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/libevent2022/libevent/include
 -O -DNDEBUG -m64  -c -o proc.lo proc.c
libtool: compile:  pgcc -DHAVE_CONFIG_H -I. -I../../opal/include
-I../../ompi/include -I../../oshmem/include
-I../../opal/mca/hwloc/hwloc1112/hwloc/include/private/autogen
-I../../opal/mca/hwloc/hwloc1112/hwloc/include/hwloc/autogen
-I../../ompi/mpiext/cuda/c -I../.. -I../../orte/include
-I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/hwloc/hwloc1112/hwloc/include
-I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/libevent2022/libevent
-I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/libevent2022/libevent/include
-O -DNDEBUG -m64 -c proc.c -MD -o proc.o
PGC-F-0249-#error --  "No way to define ev_uint64_t"
(/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/libevent2022/libevent/include/event2/util.h:
126)
PGC/x86-64 OSX 16.10-0: compilation aborted
make[3]: *** [proc.lo] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

I guess my question is whether this is an issue with PGI or Open MPI or
both? I'm not too sure. I've also asked about this on the PGI forums as
well (http://www.pgroup.com/userforum/viewtopic.php?t=5413=0) since
I'm not sure. But, no matter what, does anyone have thoughts on how to
solve this?

Thanks,
Matt

-- 
Matt Thompson
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] mpi_f08 Question: set comm on declaration error, and other questions

2016-08-20 Thread Matt Thompson

On Fri, Aug 19, 2016 at 8:54 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> On Aug 19, 2016, at 6:32 PM, Matt Thompson <fort...@gmail.com> wrote:
>
> > > that the comm == MPI_COMM_WORLD evaluates to .TRUE.? I discovered that
> once when I was printing some stuff.
> >
> > That might well be a coincidence.  type(MPI_Comm) is not a boolean type,
> so I'm not sure how you compared it to .true.
> >
> > Well, I made a program like:
> >
> > (208) $ cat test2.F90
> > program whoami
> >use mpi_f08
> >implicit none
> >type(MPI_Comm) :: comm
> >if (comm == MPI_COMM_WORLD) write (*,*) "I am MPI_COMM_WORLD"
> >if (comm == MPI_COMM_NULL) write (*,*) "I am MPI_COMM_NULL"
> > end program whoami
> > (209) $ mpifort test2.F90
> > (210) $ mpirun -np 4 ./a.out
> >  I am MPI_COMM_WORLD
> >  I am MPI_COMM_WORLD
> >  I am MPI_COMM_WORLD
> >  I am MPI_COMM_WORLD
> >
> > I think if you print comm, you get 0 and MPI_COMM_WORLD=0 and
> MPI_COMM_NULL=2 so...I guess I'm surprised. I'd have thought MPI_Comm would
> have been undefined until defined.
>
> I don't know the rules here for what happens in Fortran when comparing an
> uninitialized derived type.  The results could be undefined...?
>
> > Instead you can write a program like this:
> >
> > (226) $ cat helloWorld.mpi3.F90
> > program hello_world
> >
> >use mpi_f08
> >
> >implicit none
> >
> >type(MPI_Comm) :: comm
> >integer :: myid, npes, ierror
> >integer :: name_length
> >
> >character(len=MPI_MAX_PROCESSOR_NAME) :: processor_name
> >
> >call mpi_init(ierror)
> >
> >call MPI_Comm_Rank(comm,myid,ierror)
> >write (*,*) 'ierror: ', ierror
> >call MPI_Comm_Size(comm,npes,ierror)
> >call MPI_Get_Processor_Name(processor_name,name_length,ierror)
> >
> >write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes,
> "is on", trim(processor_name)
> >
> >call MPI_Finalize(ierror)
> >
> > end program hello_world
> > (227) $ mpifort helloWorld.mpi3.F90
> > (228) $ mpirun -np 4 ./a.out
> >  ierror:0
> >  ierror:0
> >  ierror:0
> >  ierror:0
> > Process2 of4 is on compy
> > Process1 of4 is on compy
> > Process3 of4 is on compy
> > Process0 of4 is on copy
>
> That does seem to be odd output.  What is the hostname on your machine?


Oh well, I (badly) munged the hostname on the computer I ran on because it
had the IP address within. I figured better safe than sorry and not
broadcast that out there. :)


> FWIW, I changed your write statement to:
>
> print *, "Process", myid, "of", npes, "is on", trim(processor_name)
>
> and after I added a "comm = MPI_COMM_WORLD" before the call to
> MPI_COMM_RANK, the output prints properly for me (i.e., I see my hostname).
>
> > This seems odd to me. I haven't passed in MPI_COMM_WORLD as the
> communicator to MPI_Comm_Rank, and yet, it worked and the error code was 0
> (which I'd take as success). Even if you couldn't detect this at compile
> time, I'm surprised it doesn't trigger a run-time error.  Is this the
> correct behavior according to the Standard?
>
> I think you're passing an undefined value, so the results will be
> undefined.
>
> It's quite possible that the comm%mpi_val inside the comm is (randomly?)
> assigned to 0, which is the same value as mpif.f's MPI_COMM_WORLD, and
> therefore your comm is effectively the same as mpi_f08's MPI_COMM_WORLD --
> which is why MPI_COMM_RANK and MPI_COMM_SIZE worked for you.
>
> Indeed, when I run your program, I get:
>
> -
> $ ./foo
> [savbu-usnic-a:31774] *** An error occurred in MPI_Comm_rank
> [savbu-usnic-a:31774] *** reported by process [756088833,0]
> [savbu-usnic-a:31774] *** on communicator MPI_COMM_WORLD
> [savbu-usnic-a:31774] *** MPI_ERR_COMM: invalid communicator
> [savbu-usnic-a:31774] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [savbu-usnic-a:31774] ***and potentially your MPI job)
> -
>
> I.e., MPI_COMM_RANK is aborting because the communicator being passed in
> is invalid.
>
>
Huh. I guess I'd assumed that the MPI Standard would have made sure a
declared communicator that hasn't been filled would have been an error to
use.

When I get back on Monday, I'll try out some other compilers as well as try
different compiler options (e.g., -g -O0, say).

Re: [OMPI users] mpi_f08 Question: set comm on declaration error, and other questions

2016-08-19 Thread Matt Thompson

On Fri, Aug 19, 2016 at 2:55 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> On Aug 19, 2016, at 2:30 PM, Matt Thompson <fort...@gmail.com> wrote:
> >
> > I'm slowly trying to learn and transition to 'use mpi_f08'. So, I'm
> writing various things and I noticed that this triggers an error:
> >
> > program hello_world
> >use mpi_f08
> >implicit none
> >type(MPI_Comm) :: comm = MPI_COMM_NULL
> > end program hello_world
> >
> > when compiled (Open MPI 2.0.0 with GCC 6.1):
> >
> > (380) $ mpifort test1.F90
> > test1.F90:7:27:
> >
> > type(MPI_Comm) :: comm = MPI_COMM_NULL
> >1
> > Error: Parameter ‘mpi_comm_null’ at (1) has not been declared or is a
> variable, which does not reduce to a constant expression
> >
> > Why is that? Obviously, I can just do:
> >
> >type(MPI_Comm) :: comm
> >comm = MPI_COMM_NULL
> >
> > and that works just fine (note MPI_COMM_NULL doesn't seem to be special
> as MPI_COMM_WORLD triggers the same error).
>
> I am *not* a Fortran expert, but I believe the difference between the two
> is:
>
> 1. The first one is a compile-time assignment.  And you can only do those
> with constants.  MPI_COMM_NULL is not a compile-time constant, hence, you
> get an error.
>
> 2. The second one is a run-time assignment.  You can do that between any
> compatible entities, and so that works.
>

Okay. This makes sense. I guess I was surprised that MPI_COMM_NULL wasn't a
constant (or parameter, I guess). But maybe a type() cannot be constant...


>
> > I'm just wondering why the first doesn't work, for my own edification. I
> tried reading through the Standard, but my eyes started watering after a
> bit (though that might have been the neon green cover). Is it related to
> the fact that when one declares:
> >
> >type(MPI_Comm) :: comm
> >
> > that the comm == MPI_COMM_WORLD evaluates to .TRUE.? I discovered that
> once when I was printing some stuff.
>
> That might well be a coincidence.  type(MPI_Comm) is not a boolean type,
> so I'm not sure how you compared it to .true.


Well, I made a program like:

(208) $ cat test2.F90
program whoami
   use mpi_f08
   implicit none
   type(MPI_Comm) :: comm
   if (comm == MPI_COMM_WORLD) write (*,*) "I am MPI_COMM_WORLD"
   if (comm == MPI_COMM_NULL) write (*,*) "I am MPI_COMM_NULL"
end program whoami
(209) $ mpifort test2.F90
(210) $ mpirun -np 4 ./a.out
 I am MPI_COMM_WORLD
 I am MPI_COMM_WORLD
 I am MPI_COMM_WORLD
 I am MPI_COMM_WORLD

I think if you print comm, you get 0 and MPI_COMM_WORLD=0 and
MPI_COMM_NULL=2 so...I guess I'm surprised. I'd have thought MPI_Comm would
have been undefined until defined. Instead you can write a program like
this:

(226) $ cat helloWorld.mpi3.F90
program hello_world

   use mpi_f08

   implicit none

   type(MPI_Comm) :: comm
   integer :: myid, npes, ierror
   integer :: name_length

   character(len=MPI_MAX_PROCESSOR_NAME) :: processor_name

   call mpi_init(ierror)

   call MPI_Comm_Rank(comm,myid,ierror)
   write (*,*) 'ierror: ', ierror
   call MPI_Comm_Size(comm,npes,ierror)
   call MPI_Get_Processor_Name(processor_name,name_length,ierror)

   write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes, "is
on", trim(processor_name)

   call MPI_Finalize(ierror)

end program hello_world
(227) $ mpifort helloWorld.mpi3.F90
(228) $ mpirun -np 4 ./a.out
 ierror:0
 ierror:0
 ierror:0
 ierror:0
Process2 of4 is on compy
Process1 of4 is on compy
Process3 of4 is on compy
Process0 of4 is on compy

This seems odd to me. I haven't passed in MPI_COMM_WORLD as the
communicator to MPI_Comm_Rank, and yet, it worked and the error code was 0
(which I'd take as success). Even if you couldn't detect this at compile
time, I'm surprised it doesn't trigger a run-time error.  Is this the
correct behavior according to the Standard?

Matt

-- 
Matt Thompson
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] mpi_f08 Question: set comm on declaration error, and other questions

2016-08-19 Thread Matt Thompson

Oh great Open MPI Gurus,

I'm slowly trying to learn and transition to 'use mpi_f08'. So, I'm writing
various things and I noticed that this triggers an error:

program hello_world
   use mpi_f08
   implicit none
   type(MPI_Comm) :: comm = MPI_COMM_NULL
end program hello_world

when compiled (Open MPI 2.0.0 with GCC 6.1):

(380) $ mpifort test1.F90
test1.F90:7:27:

type(MPI_Comm) :: comm = MPI_COMM_NULL
   1
Error: Parameter ‘mpi_comm_null’ at (1) has not been declared or is a
variable, which does not reduce to a constant expression

Why is that? Obviously, I can just do:

   type(MPI_Comm) :: comm
   comm = MPI_COMM_NULL

and that works just fine (note MPI_COMM_NULL doesn't seem to be special as
MPI_COMM_WORLD triggers the same error).

I'm just wondering why the first doesn't work, for my own edification. I
tried reading through the Standard, but my eyes started watering after a
bit (though that might have been the neon green cover). Is it related to
the fact that when one declares:

   type(MPI_Comm) :: comm

that the comm == MPI_COMM_WORLD evaluates to .TRUE.? I discovered that once
when I was printing some stuff.

Thanks for helping me learn,
Matt

-- 
Matt Thompson
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Error with Open MPI 2.0.0: error obtaining device attributes for mlx5_0 errno says Cannot allocate memory

2016-07-13 Thread Matt Thompson

On Wed, Jul 13, 2016 at 9:50 AM, Nathan Hjelm <hje...@me.com> wrote:

> As of 2.0.0 we now support experimental verbs. It looks like one of the
> calls is failing:
>
> #if HAVE_DECL_IBV_EXP_QUERY_DEVICE
> device->ib_exp_dev_attr.comp_mask = IBV_EXP_DEVICE_ATTR_RESERVED - 1;
> if(ibv_exp_query_device(device->ib_dev_context,
> >ib_exp_dev_attr)){
> BTL_ERROR(("error obtaining device attributes for %s errno says
> %s",
> ibv_get_device_name(device->ib_dev), strerror(errno)));
> goto error;
> }
> #endif
>
> Do you know what OFED or MOFED version you are running?
>

Per one of our gurus, answers from your IB page:

1. Which OpenFabrics version are you running? Please specify where you got
the software from (e.g., from the OpenFabrics community web site, from a
vendor, or it was already included in your Linux distribution).
   Mellanox OFED 3.1-1.0.3 (soon to be 3.3-1.0.0)

2. What distro and version of Linux are you running? What is your kernel
version?
   SLES11 SP3 (LTSS); 3.0.101-0.47.71-default (soon to be
3.0.101-0.47.79-default)

3. Which subnet manager are you running? (e.g., OpenSM, a vendor-specific
subnet manager, etc.)
   Mellanox UFM (OpenSM under the covers)

-- 
Matt Thompson

Man Among Men
Fulcrum of History

[OMPI users] Error with Open MPI 2.0.0: error obtaining device attributes for mlx5_0 errno says Cannot allocate memory

2016-07-13 Thread Matt Thompson

All,

I've been struggling here at NASA Goddard trying to get PGI 16.5 + Open MPI
1.10.3 working on the Discover cluster. What was happening was I'd run our
climate model at, say, 4x24 and it would work sometimes. Most of the time.
Every once in a while, it'd throw a segfault. If we changed the layout or
number of processors, more (and sometimes different) segfaults are trigger.

As we could build with PGI 15.7 + Open MPI 1.10.3 (where Open MPI is built
exactly the same) and run perfectly, I was focusing on the Open MPI build.
I tried compiling it at -O3, -O, -O0, all sorts of things and was about to
throw in the towel as all failed.

But, I saw Open MPI 2.0.0 was out and figured, may as well try the latest
before reporting to the mailing list. I built it and, huzzah!, it works!
I'm happy! Except that every time I execute 'mpirun' I get odd errors:

(1034) $ mpirun -np 4 ./helloWorld.mpi2.exe
--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   borgr074
  Local device: mlx5_0
--
[borgr074][[35244,1],1][btl_openib_component.c:1618:init_one_device] error
obtaining device attributes for mlx5_0 errno says Cannot allocate memory
[borgr074][[35244,1],3][btl_openib_component.c:1618:init_one_device] error
obtaining device attributes for mlx5_0 errno says Cannot allocate memory
[borgr074][[35244,1],0][btl_openib_component.c:1618:init_one_device] error
obtaining device attributes for mlx5_0 errno says Cannot allocate memory
[borgr074][[35244,1],2][btl_openib_component.c:1618:init_one_device] error
obtaining device attributes for mlx5_0 errno says Cannot allocate memory
MPI Version: 3.1
MPI Library Version: Open MPI v2.0.0, package: Open MPI mathomp4@borg01z239
Distribution, ident: 2.0.0, repo rev: v2.x-dev-1570-g0a4a5d7, Jul 12, 2016
Process0 of4 is on borgr074
Process3 of4 is on borgr074
Process1 of4 is on borgr074
Process2 of4 is on borgr074
[borgr074:29032] 3 more processes have sent help message
help-mpi-btl-openib.txt / error in device init
[borgr074:29032] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages

If I run with --mca btl_base_verbose 1 and use more than one node, I see
that the openib/verbs (still not sure what to call this) btl isn't being
used, but rather tcp:

[borgr075:14374] mca: bml: Using tcp btl for send to [[35628,1],15] on node
borgr074
[borgr075:14374] mca: bml: Using tcp btl for send to [[35628,1],15] on node
borgr074

which makes sense since it can't find an Infiniband device.

My first thought is that the build/configure procedure of the past doesn't
quite jibe with what Open MPI 2.0.0 is expecting? I build Open MPI as:

export CC=pgcc
export CXX=pgc++
export FC=pgfortran

export CFLAGS="-fpic -m64"
export CXXFLAGS="-fpic -m64"
export FCFLAGS="-m64 -fpic"
export PREFIX=/discover/swdev/mathomp4/MPI/openmpi/2.0.0/pgi-16.5-k40

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/slurm/lib64
export LDFLAGS="-L/usr/slurm/lib64"
export CPPFLAGS="-I/usr/slurm/include"

export LIBS="-lpciaccess"

build() {
  echo `pwd`
  ./configure --with-slurm --disable-wrapper-rpath --enable-shared
--prefix=${PREFIX}
  make -j8
  make install
}

echo "calling build"
build
echo "exiting"

This is a build script built over time; it might have things unnecessary
for an Open MPI 2.0 build, but perhaps now it needs more info? I can say
that in the past (say with 1.10.3) it definitely found the openib/verbs btl
and used it!

Per the website, I'm attaching links to my config.log and "ompi_info --all"
information:

https://dl.dropboxusercontent.com/u/61696/Open%20MPI/config.log.gz
https://dl.dropboxusercontent.com/u/61696/Open%20MPI/build.pgi16.5.log.gz
https://dl.dropboxusercontent.com/u/61696/Open%20MPI/ompi_info.txt.gz

I tried to run "ompi_info -v ompi full --parsable" as asked but that
doesn't seem possible anymore:

(1053) $ ompi_info -v ompi full --parsable
ompi_info: Error: unknown option "-v"
Type 'ompi_info --help' for usage.

I am asking our machine gurus about the Infiniband network per:
https://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot
-- 
Matt Thompson

Man Among Men
Fulcrum of History

Re: [OMPI users] Issues Building Open MPI static with Intel Fortran 16

2016-01-22 Thread Matt Thompson

Howard,

Welp. That worked! I'm assuming oshmem = OpenSHMEM, right? If so, yeah, for
now, not important on my wee workstation. (If it isn't, is it something I
should work on getting to work?)

Matt

On Fri, Jan 22, 2016 at 2:47 PM, Howard Pritchard <hpprit...@gmail.com>
wrote:

> HI Matt,
>
> If you don't need oshmem, you could try again with --disable-oshmem added
> to the config line
>
> Howard
>
>
> 2016-01-22 12:15 GMT-07:00 Matt Thompson <fort...@gmail.com>:
>
>> All,
>>
>> I'm trying to duplicate an issue I had with ESMF long ago (not sure if I
>> reported it here or at ESMF, but...). It had been a while, so I started
>> from scratch. I first built Open MPI 1.10.2 with Intel Fortran 16.0.0.109
>> and my system GCC (4.8.5 from RHEL7) with mostly defaults:
>>
>> # ./configure --disable-wrapper-rpath CC=gcc CXX=g++ FC=ifort \
>> #CFLAGS='-fPIC -m64' CXXFLAGS='-fPIC -m64' FCFLAGS='-fPIC -m64' \
>> #
>> --prefix=/ford1/share/gmao_SIteam/MPI/openmpi-1.10.2-ifort-16.0.0.109-shared
>> | & tee configure.intel16.0.0.109-shared.log
>>
>> This built and checked just fine. Huzzah! And, indeed, it died in ESMF
>> during a link in an odd way (ESMF is looking at it).
>>
>> As a thought, I decided to see if building Open MPI statically might help
>> or not. So, I tried to build Open MPI with:
>>
>> # ./configure --disable-shared --enable-static --disable-wrapper-rpath
>> CC=gcc CXX=g++ FC=ifort \
>> #CFLAGS='-fPIC -m64' CXXFLAGS='-fPIC -m64' FCFLAGS='-fPIC -m64' \
>> #
>> --prefix=/ford1/share/gmao_SIteam/MPI/openmpi-1.10.2-ifort-16.0.0.109-static
>> | & tee configure.intel16.0.0.109-static.log
>>
>> I just added --disable-shared --enable-static being lazy. But, when I do
>> this, I get this (when built with make V=1):
>>
>> Making all in tools/oshmem_info
>> make[2]: Entering directory
>> `/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem/tools/oshmem_info'
>> /bin/sh ../../../libtool  --tag=CC   --mode=link gcc -std=gnu99  -O3
>> -DNDEBUG -fPIC -m64 -finline-functions -fno-strict-aliasing -pthread   -o
>> oshmem_info oshmem_info.o param.o ../../../ompi/libmpi.la
>> ../../../oshmem/liboshmem.la ../../../orte/libopen-rte.la ../../../opal/
>> libopen-pal.la -lrt -lm -lutil
>> libtool: link: gcc -std=gnu99 -O3 -DNDEBUG -fPIC -m64 -finline-functions
>> -fno-strict-aliasing -pthread -o oshmem_info oshmem_info.o param.o
>>  ../../../ompi/.libs/libmpi.a ../../../oshmem/.libs/liboshmem.a
>> /ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/ompi/.libs/libmpi.a
>> -libverbs
>> /ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/orte/.libs/libopen-rte.a
>> ../../../orte/.libs/libopen-rte.a
>> /ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/opal/.libs/libopen-pal.a
>> ../../../opal/.libs/libopen-pal.a -lnuma -ldl -lrt -lm -lutil -pthread
>> /usr/bin/ld: ../../../oshmem/.libs/liboshmem.a(memheap_base_static.o):
>> undefined reference to symbol '_end'
>> /usr/bin/ld: note: '_end' is defined in DSO /lib64/libnl-route-3.so.200
>> so try adding it to the linker command line
>> /lib64/libnl-route-3.so.200: could not read symbols: Invalid operation
>> collect2: error: ld returned 1 exit status
>> make[2]: *** [oshmem_info] Error 1
>> make[2]: Leaving directory
>> `/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem/tools/oshmem_info'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory
>> `/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem'
>> make: *** [all-recursive] Error 1
>>
>> So, what did I do wrong? Or is there something I need to add to the
>> configure line? I have built static versions of Open MPI in the past (say
>> 1.8.7 era with Intel Fortran 15), but this is a new OS (RHEL 7 instead of
>> 6) so I can see issues possible.
>>
>> Anyone seen this before? As I said, the "usual" build way is just fine.
>> Perhaps I need an extra RPM that isn't installed? I do have libnl-devel
>> installed.
>>
>> --
>> Matt Thompson
>>
>> Man Among Men
>> Fulcrum of History
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/01/28344.php
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/01/28345.php
>



-- 
Matt Thompson

Man Among Men
Fulcrum of History

[OMPI users] Issues Building Open MPI static with Intel Fortran 16

2016-01-22 Thread Matt Thompson

All,

I'm trying to duplicate an issue I had with ESMF long ago (not sure if I
reported it here or at ESMF, but...). It had been a while, so I started
from scratch. I first built Open MPI 1.10.2 with Intel Fortran 16.0.0.109
and my system GCC (4.8.5 from RHEL7) with mostly defaults:

# ./configure --disable-wrapper-rpath CC=gcc CXX=g++ FC=ifort \
#CFLAGS='-fPIC -m64' CXXFLAGS='-fPIC -m64' FCFLAGS='-fPIC -m64' \
#
--prefix=/ford1/share/gmao_SIteam/MPI/openmpi-1.10.2-ifort-16.0.0.109-shared
| & tee configure.intel16.0.0.109-shared.log

This built and checked just fine. Huzzah! And, indeed, it died in ESMF
during a link in an odd way (ESMF is looking at it).

As a thought, I decided to see if building Open MPI statically might help
or not. So, I tried to build Open MPI with:

# ./configure --disable-shared --enable-static --disable-wrapper-rpath
CC=gcc CXX=g++ FC=ifort \
#CFLAGS='-fPIC -m64' CXXFLAGS='-fPIC -m64' FCFLAGS='-fPIC -m64' \
#
--prefix=/ford1/share/gmao_SIteam/MPI/openmpi-1.10.2-ifort-16.0.0.109-static
| & tee configure.intel16.0.0.109-static.log

I just added --disable-shared --enable-static being lazy. But, when I do
this, I get this (when built with make V=1):

Making all in tools/oshmem_info
make[2]: Entering directory
`/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem/tools/oshmem_info'
/bin/sh ../../../libtool  --tag=CC   --mode=link gcc -std=gnu99  -O3
-DNDEBUG -fPIC -m64 -finline-functions -fno-strict-aliasing -pthread   -o
oshmem_info oshmem_info.o param.o ../../../ompi/libmpi.la ../../../oshmem/
liboshmem.la ../../../orte/libopen-rte.la ../../../opal/libopen-pal.la -lrt
-lm -lutil
libtool: link: gcc -std=gnu99 -O3 -DNDEBUG -fPIC -m64 -finline-functions
-fno-strict-aliasing -pthread -o oshmem_info oshmem_info.o param.o
 ../../../ompi/.libs/libmpi.a ../../../oshmem/.libs/liboshmem.a
/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/ompi/.libs/libmpi.a
-libverbs
/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/orte/.libs/libopen-rte.a
../../../orte/.libs/libopen-rte.a
/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/opal/.libs/libopen-pal.a
../../../opal/.libs/libopen-pal.a -lnuma -ldl -lrt -lm -lutil -pthread
/usr/bin/ld: ../../../oshmem/.libs/liboshmem.a(memheap_base_static.o):
undefined reference to symbol '_end'
/usr/bin/ld: note: '_end' is defined in DSO /lib64/libnl-route-3.so.200 so
try adding it to the linker command line
/lib64/libnl-route-3.so.200: could not read symbols: Invalid operation
collect2: error: ld returned 1 exit status
make[2]: *** [oshmem_info] Error 1
make[2]: Leaving directory
`/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem/tools/oshmem_info'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory
`/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem'
make: *** [all-recursive] Error 1

So, what did I do wrong? Or is there something I need to add to the
configure line? I have built static versions of Open MPI in the past (say
1.8.7 era with Intel Fortran 15), but this is a new OS (RHEL 7 instead of
6) so I can see issues possible.

Anyone seen this before? As I said, the "usual" build way is just fine.
Perhaps I need an extra RPM that isn't installed? I do have libnl-devel
installed.

-- 
Matt Thompson

Man Among Men
Fulcrum of History

Re: [OMPI users] MPI, Fortran, and GET_ENVIRONMENT_VARIABLE

2016-01-15 Thread Matt Thompson

Ralph,

Sounds good. I'll keep my eyes out. I figured it probably wasn't possible.
Of course, it's simple enough to run a script ahead of time that can build
a table that could be read in-program. I was just hoping perhaps I could do
it in one-step instead of two!

And, well, I'm slowly learning that whatever I knew about switches in an
Ethernet way means nothing in an Infiniband situation!

On Fri, Jan 15, 2016 at 11:27 AM, Ralph Castain <r...@open-mpi.org> wrote:

> Yes, we don’t propagate envars ourselves other than MCA params. You can
> ask mpirun to forward specific envars to every proc, but that would only
> push the same value to everyone, and that doesn’t sound like what you are
> looking for.
>
> FWIW: we are working on adding the ability to directly query the info you
> are seeking - i.e., to ask for things like “which procs are on the same
> switch as me?”. Hoping to have it later this year, perhaps in the summer.
>
>
> On Jan 15, 2016, at 7:56 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> Ralph,
>
> That doesn't help:
>
> (1004) $ mpirun -map-by node -np 8 ./hostenv.x | sort -g -k2
> Process0 of8 is on host borgo086
> Process0 of8 is on processor borgo086
> Process1 of8 is on host borgo086
> Process1 of8 is on processor borgo140
> Process2 of8 is on host borgo086
> Process2 of8 is on processor borgo086
> Process3 of8 is on host borgo086
> Process3 of8 is on processor borgo140
> Process4 of8 is on host borgo086
> Process4 of8 is on processor borgo086
> Process5 of8 is on host borgo086
> Process5 of8 is on processor borgo140
> Process6 of8 is on host borgo086
> Process6 of8 is on processor borgo086
> Process7 of8 is on host borgo086
> Process7 of8 is on processor borgo140
>
> But it was doing the right thing before. It saw my SLURM_* bits and
> correctly put 4 processes on the first node and 4 on the second (see the
> processor line which is from MPI, not the environment), and I only asked
> for 4 tasks per node:
>
> SLURM_NODELIST=borgo[086,140]
> SLURM_NTASKS_PER_NODE=4
> SLURM_NNODES=2
> SLURM_NTASKS=8
> SLURM_TASKS_PER_NODE=4(x2)
>
> My guess is no MPI stack wants to propagate an environment variable to
> every process. I'm picturing an 1000 node/28000 core job...and poor Open
> MPI (or MPT or Intel MPI) would have to marshall 28000xN environment
> variables around and keep track of who gets what...
>
> Matt
>
>
> On Fri, Jan 15, 2016 at 10:48 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Actually, the explanation is much simpler. You probably have more than 8
>> slots on borgj020, and so your job is simply small enough that we put it
>> all on one host. If you want to force the job to use both hosts, add
>> “-map-by node” to your cmd line
>>
>>
>> On Jan 15, 2016, at 7:02 AM, Jim Edwards <jedwa...@ucar.edu> wrote:
>>
>>
>>
>> On Fri, Jan 15, 2016 at 7:53 AM, Matt Thompson <fort...@gmail.com> wrote:
>>
>>> All,
>>>
>>> I'm not too sure if this is an MPI issue, a Fortran issue, or something
>>> else but I thought I'd ask the MPI gurus here first since my web search
>>> failed me.
>>>
>>> There is a chance in the future I might want/need to query an
>>> environment variable in a Fortran program, namely to figure out what switch
>>> a currently running process is on (via SLURM_TOPOLOGY_ADDR in my case) and
>>> perhaps make a "per-switch" communicator.[1]
>>>
>>> So, I coded up a boring Fortran program whose only exciting lines are:
>>>
>>>call MPI_Get_Processor_Name(processor_name,name_length,ierror)
>>>call get_environment_variable("HOST",host_name)
>>>
>>>write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes,
>>> "is on processor", trim(processor_name)
>>>write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes,
>>> "is on host", trim(host_name)
>>>
>>> I decided to try out with the HOST environment variable first because
>>> it is simple and different per node (I didn't want to take many, many nodes
>>> to find the point when a switch is traversed). I then grabbed two nodes
>>> with 4 processes per node and...:
>>>
>>> (1046) $ echo "$SLURM_NODELIST"
>>> borgj[020,036]
>>> (1047) $ pdsh -w "$SLURM_NODELIST" echo '$HOST'
>>> borgj036: borgj036
>>> borgj020: borgj020
>>> (1048) $

Re: [OMPI users] MPI, Fortran, and GET_ENVIRONMENT_VARIABLE

2016-01-15 Thread Matt Thompson

Ralph,

That doesn't help:

(1004) $ mpirun -map-by node -np 8 ./hostenv.x | sort -g -k2
Process0 of8 is on host borgo086
Process0 of8 is on processor borgo086
Process1 of8 is on host borgo086
Process1 of8 is on processor borgo140
Process2 of8 is on host borgo086
Process2 of8 is on processor borgo086
Process3 of8 is on host borgo086
Process3 of8 is on processor borgo140
Process4 of8 is on host borgo086
Process4 of8 is on processor borgo086
Process5 of8 is on host borgo086
Process5 of8 is on processor borgo140
Process6 of8 is on host borgo086
Process6 of8 is on processor borgo086
Process7 of8 is on host borgo086
Process7 of8 is on processor borgo140

But it was doing the right thing before. It saw my SLURM_* bits and
correctly put 4 processes on the first node and 4 on the second (see the
processor line which is from MPI, not the environment), and I only asked
for 4 tasks per node:

SLURM_NODELIST=borgo[086,140]
SLURM_NTASKS_PER_NODE=4
SLURM_NNODES=2
SLURM_NTASKS=8
SLURM_TASKS_PER_NODE=4(x2)

My guess is no MPI stack wants to propagate an environment variable to
every process. I'm picturing an 1000 node/28000 core job...and poor Open
MPI (or MPT or Intel MPI) would have to marshall 28000xN environment
variables around and keep track of who gets what...

Matt


On Fri, Jan 15, 2016 at 10:48 AM, Ralph Castain <r...@open-mpi.org> wrote:

> Actually, the explanation is much simpler. You probably have more than 8
> slots on borgj020, and so your job is simply small enough that we put it
> all on one host. If you want to force the job to use both hosts, add
> “-map-by node” to your cmd line
>
>
> On Jan 15, 2016, at 7:02 AM, Jim Edwards <jedwa...@ucar.edu> wrote:
>
>
>
> On Fri, Jan 15, 2016 at 7:53 AM, Matt Thompson <fort...@gmail.com> wrote:
>
>> All,
>>
>> I'm not too sure if this is an MPI issue, a Fortran issue, or something
>> else but I thought I'd ask the MPI gurus here first since my web search
>> failed me.
>>
>> There is a chance in the future I might want/need to query an environment
>> variable in a Fortran program, namely to figure out what switch a currently
>> running process is on (via SLURM_TOPOLOGY_ADDR in my case) and perhaps make
>> a "per-switch" communicator.[1]
>>
>> So, I coded up a boring Fortran program whose only exciting lines are:
>>
>>call MPI_Get_Processor_Name(processor_name,name_length,ierror)
>>call get_environment_variable("HOST",host_name)
>>
>>write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes, "is
>> on processor", trim(processor_name)
>>write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes, "is
>> on host", trim(host_name)
>>
>> I decided to try out with the HOST environment variable first because it
>> is simple and different per node (I didn't want to take many, many nodes to
>> find the point when a switch is traversed). I then grabbed two nodes with 4
>> processes per node and...:
>>
>> (1046) $ echo "$SLURM_NODELIST"
>> borgj[020,036]
>> (1047) $ pdsh -w "$SLURM_NODELIST" echo '$HOST'
>> borgj036: borgj036
>> borgj020: borgj020
>> (1048) $ mpifort -o hostenv.x hostenv.F90
>> (1049) $ mpirun -np 8 ./hostenv.x | sort -g -k2
>> Process0 of8 is on host borgj020
>> Process0 of8 is on processor borgj020
>> Process1 of8 is on host borgj020
>> Process1 of8 is on processor borgj020
>> Process2 of8 is on host borgj020
>> Process2 of8 is on processor borgj020
>> Process3 of8 is on host borgj020
>> Process3 of8 is on processor borgj020
>> Process4 of8 is on host borgj020
>> Process4 of8 is on processor borgj036
>> Process5 of8 is on host borgj020
>> Process5 of8 is on processor borgj036
>> Process6 of8 is on host borgj020
>> Process6 of8 is on processor borgj036
>> Process7 of8 is on host borgj020
>> Process7 of8 is on processor borgj036
>>
>> It looks like MPI_Get_Processor_Name is doing its thing, but the HOST one
>> seems to only be reflecting the first host. My guess is that OpenMPI
>> doesn't export every processes' environment separately to every process so
>> it is reflecting HOST from process 0.
>> 
>>
>
> I would guess that what is actually happening is that slurm is exporting
> all of the variables from the host node including the $HOST variable and
> overwriting the 
> def

[OMPI users] MPI, Fortran, and GET_ENVIRONMENT_VARIABLE

2016-01-15 Thread Matt Thompson

All,

I'm not too sure if this is an MPI issue, a Fortran issue, or something
else but I thought I'd ask the MPI gurus here first since my web search
failed me.

There is a chance in the future I might want/need to query an environment
variable in a Fortran program, namely to figure out what switch a currently
running process is on (via SLURM_TOPOLOGY_ADDR in my case) and perhaps make
a "per-switch" communicator.[1]

So, I coded up a boring Fortran program whose only exciting lines are:

   call MPI_Get_Processor_Name(processor_name,name_length,ierror)
   call get_environment_variable("HOST",host_name)

   write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes, "is
on processor", trim(processor_name)
   write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes, "is
on host", trim(host_name)

I decided to try out with the HOST environment variable first because it is
simple and different per node (I didn't want to take many, many nodes to
find the point when a switch is traversed). I then grabbed two nodes with 4
processes per node and...:

(1046) $ echo "$SLURM_NODELIST"
borgj[020,036]
(1047) $ pdsh -w "$SLURM_NODELIST" echo '$HOST'
borgj036: borgj036
borgj020: borgj020
(1048) $ mpifort -o hostenv.x hostenv.F90
(1049) $ mpirun -np 8 ./hostenv.x | sort -g -k2
Process0 of8 is on host borgj020
Process0 of8 is on processor borgj020
Process1 of8 is on host borgj020
Process1 of8 is on processor borgj020
Process2 of8 is on host borgj020
Process2 of8 is on processor borgj020
Process3 of8 is on host borgj020
Process3 of8 is on processor borgj020
Process4 of8 is on host borgj020
Process4 of8 is on processor borgj036
Process5 of8 is on host borgj020
Process5 of8 is on processor borgj036
Process6 of8 is on host borgj020
Process6 of8 is on processor borgj036
Process7 of8 is on host borgj020
Process7 of8 is on processor borgj036

It looks like MPI_Get_Processor_Name is doing its thing, but the HOST one
seems to only be reflecting the first host. My guess is that OpenMPI
doesn't export every processes' environment separately to every process so
it is reflecting HOST from process 0.

So, I guess my question is: can this be done? Is there an option to Open
MPI that might do it? Or is this just something MPI doesn't do? Or is my
Google-fu just too weak to figure out the right search-phrase to find the
answer to this probable FAQ?

Matt

[1] Note, this might be unnecessary, but I got to the point where I wanted
to see if I *could* do it, rather than *should*.

-- 
Matt Thompson

Man Among Men
Fulcrum of History

Re: [OMPI users] Open MPI MPI-OpenMP Hybrid Binding Question

2016-01-06 Thread Matt Thompson

On Wed, Jan 6, 2016 at 7:20 PM, Gilles Gouaillardet <gil...@rist.or.jp>
wrote:

> FWIW,
>
> there has been one attempt to set the OMP_* environment variables within
> OpenMPI, and that was aborted
> because that caused crashes with a prominent commercial compiler.
>
> also, i'd like to clarify that OpenMPI does bind MPI tasks (e.g.
> processes), and it is up to the OpenMP runtime to bind the OpenMP threads
> to the resources made available by OpenMPI to the MPI task.
>
> in this case, that means OpenMPI will bind a MPI tasks to 7 cores (for
> example cores 7 to 13), and it is up to the OpenMP runtime to bind each 7
> OpenMP threads to one core previously allocated by OpenMPI
> (for example, OMP thread 0 to core 7, OMP thread 1 to core 8, ...)
>

Indeed. Hybrid programming is a two-step tango. The harder task (in some
ways) is the placing MPI processes where I want. With omplace I could just
force things (though probably not with Open MPI...haven't tried it yet),
but I'd rather have a more "formulaic" way to place processes since then
you can script it. Now that I know about the ppr: syntax, I can see it'll
be quite useful!

The other task is to get the OpenMP threads in the "right way". I was
pretty sure KMP_AFFINITY=compact was correct (worked once...and, yeah,
using Intel at present. Figured start there, then expand to figure out GCC
and PGI). I'll do some experimenting with the OMP_* versions as a
more-respected standard is always a good thing.

For others with inquiries into this, I highly recommend this page I found
after my query was answered here:

https://www.olcf.ornl.gov/kb_articles/parallel-job-execution-on-commodity-clusters/

At this point, I'm thinking I should start up an MPI+OpenMP wiki to map all
the combinations of compiler+mpistack.

Or pray the MPI Forum and OpenMP combine and I can just look in a Standard.
:D

Thanks,
Matt
-- 
Matt Thompson

Man Among Men
Fulcrum of History

Re: [OMPI users] Open MPI MPI-OpenMP Hybrid Binding Question

2016-01-06 Thread Matt Thompson

, 2016 at 2:48 PM, Erik Schnetter <schnet...@gmail.com> wrote:

> Setting KMP_AFFINITY will probably override anything that OpenMPI
> sets. Can you try without?
>
> -erik
>
> On Wed, Jan 6, 2016 at 2:46 PM, Matt Thompson <fort...@gmail.com> wrote:
> > Hello Open MPI Gurus,
> >
> > As I explore MPI-OpenMP hybrid codes, I'm trying to figure out how to do
> > things to get the same behavior in various stacks. For example, I have a
> > 28-core node (2 14-core Haswells), and I'd like to run 4 MPI processes
> and 7
> > OpenMP threads. Thus, I'd like the processes to be 2 processes per socket
> > with the OpenMP threads laid out on them. Using a "hybrid Hello World"
> > program, I can achieve this with Intel MPI (after a lot of testing):
> >
> > (1097) $ env OMP_NUM_THREADS=7 KMP_AFFINITY=compact mpirun -np 4
> > ./hello-hybrid.x | sort -g -k 18
> > srun.slurm: cluster configuration lacks support for cpu binding
> > Hello from thread 0 out of 7 from process 2 out of 4 on borgo035 on CPU 0
> > Hello from thread 1 out of 7 from process 2 out of 4 on borgo035 on CPU 1
> > Hello from thread 2 out of 7 from process 2 out of 4 on borgo035 on CPU 2
> > Hello from thread 3 out of 7 from process 2 out of 4 on borgo035 on CPU 3
> > Hello from thread 4 out of 7 from process 2 out of 4 on borgo035 on CPU 4
> > Hello from thread 5 out of 7 from process 2 out of 4 on borgo035 on CPU 5
> > Hello from thread 6 out of 7 from process 2 out of 4 on borgo035 on CPU 6
> > Hello from thread 0 out of 7 from process 3 out of 4 on borgo035 on CPU 7
> > Hello from thread 1 out of 7 from process 3 out of 4 on borgo035 on CPU 8
> > Hello from thread 2 out of 7 from process 3 out of 4 on borgo035 on CPU 9
> > Hello from thread 3 out of 7 from process 3 out of 4 on borgo035 on CPU
> 10
> > Hello from thread 4 out of 7 from process 3 out of 4 on borgo035 on CPU
> 11
> > Hello from thread 5 out of 7 from process 3 out of 4 on borgo035 on CPU
> 12
> > Hello from thread 6 out of 7 from process 3 out of 4 on borgo035 on CPU
> 13
> > Hello from thread 0 out of 7 from process 0 out of 4 on borgo035 on CPU
> 14
> > Hello from thread 1 out of 7 from process 0 out of 4 on borgo035 on CPU
> 15
> > Hello from thread 2 out of 7 from process 0 out of 4 on borgo035 on CPU
> 16
> > Hello from thread 3 out of 7 from process 0 out of 4 on borgo035 on CPU
> 17
> > Hello from thread 4 out of 7 from process 0 out of 4 on borgo035 on CPU
> 18
> > Hello from thread 5 out of 7 from process 0 out of 4 on borgo035 on CPU
> 19
> > Hello from thread 6 out of 7 from process 0 out of 4 on borgo035 on CPU
> 20
> > Hello from thread 0 out of 7 from process 1 out of 4 on borgo035 on CPU
> 21
> > Hello from thread 1 out of 7 from process 1 out of 4 on borgo035 on CPU
> 22
> > Hello from thread 2 out of 7 from process 1 out of 4 on borgo035 on CPU
> 23
> > Hello from thread 3 out of 7 from process 1 out of 4 on borgo035 on CPU
> 24
> > Hello from thread 4 out of 7 from process 1 out of 4 on borgo035 on CPU
> 25
> > Hello from thread 5 out of 7 from process 1 out of 4 on borgo035 on CPU
> 26
> > Hello from thread 6 out of 7 from process 1 out of 4 on borgo035 on CPU
> 27
> >
> > Other than the odd fact that Process #0 seemed to start on Socket #1
> (this
> > might be an artifact of how I'm trying to detect the CPU I'm on), this
> looks
> > reasonable. 14 threads on each socket and each process is laying out its
> > threads in a nice orderly fashion.
> >
> > I'm trying to figure out how to do this with Open MPI (version 1.10.0)
> and
> > apparently I am just not quite good enough to figure it out. The closest
> > I've gotten is:
> >
> > (1155) $ env OMP_NUM_THREADS=7 KMP_AFFINITY=compact mpirun -np 4 -map-by
> > ppr:2:socket ./hello-hybrid.x | sort -g -k 18
> > Hello from thread 0 out of 7 from process 0 out of 4 on borgo035 on CPU 0
> > Hello from thread 0 out of 7 from process 1 out of 4 on borgo035 on CPU 0
> > Hello from thread 1 out of 7 from process 0 out of 4 on borgo035 on CPU 1
> > Hello from thread 1 out of 7 from process 1 out of 4 on borgo035 on CPU 1
> > Hello from thread 2 out of 7 from process 0 out of 4 on borgo035 on CPU 2
> > Hello from thread 2 out of 7 from process 1 out of 4 on borgo035 on CPU 2
> > Hello from thread 3 out of 7 from process 0 out of 4 on borgo035 on CPU 3
> > Hello from thread 3 out of 7 from process 1 out of 4 on borgo035 on CPU 3
> > Hello from thread 4 out of 7 from process 0 out of 4 on borgo035 on CPU 4
> > Hello from thread 4 out of 7 from process 1 out of 4 on borgo035

[OMPI users] Open MPI MPI-OpenMP Hybrid Binding Question

2016-01-06 Thread Matt Thompson

4 out of 7 from process 3 out of 4 on borgo035 on CPU 18
Hello from thread 5 out of 7 from process 2 out of 4 on borgo035 on CPU 19
Hello from thread 5 out of 7 from process 3 out of 4 on borgo035 on CPU 19
Hello from thread 6 out of 7 from process 2 out of 4 on borgo035 on CPU 20
Hello from thread 6 out of 7 from process 3 out of 4 on borgo035 on CPU 20

Obviously not right. Any ideas on how to help me learn? The man mpirun page
is a bit formidable in the pinning part, so maybe I've missed an obvious
answer.

Matt
-- 
Matt Thompson

Man Among Men
Fulcrum of History

Re: [OMPI users] Help with Binding in 1.8.8: Use only second socket

2015-12-21 Thread Matt Thompson

Ralph,

Huh. That isn't in the Open MPI 1.8.8 mpirun man page. It is in Open MPI
1.10, so I'm guessing someone noticed it wasn't there. Explains why I
didn't try it out. I'm assuming this option is respected on all nodes?

Note: a SmarterManThanI™ here at Goddard thought up this:

#!/bin/bash
rank=0
for node in $(srun uname -n | sort); do
echo "rank $rank=$node slots=1:*"
let rank+=1
done

It does seem to work in synthetic tests so I'm trying it now in my real
job. I had to hack a few run scripts so I'll probably spend the next hour
debugging something dumb I did.

What I'm wondering about all this is: can this be done with --slot-list?
Or, perhaps, does --slot-list even work?

I have tried about 20 different variations of it, e.g., --slot-list 1:*,
--slot-list '1:*', --slot-list 1:0,1,2,3,4,5,6,7, --slot-list
1:8,9,10,11,12,13,14,15, --slot-list 8-15, , and every time I seem to
trigger an error via help-rmaps_rank_file.txt. I tried to read
through opal_hwloc_base_slot_list_parse in the source, but my C isn't great
(see my gmail address name) so that didn't help. Might not even be the
right function, but I was just acking the code.

Thanks,
Matt


On Mon, Dec 21, 2015 at 10:51 AM, Ralph Castain <r...@open-mpi.org> wrote:

> Try adding —cpu-set a,b,c,…  where the a,b,c… are the core id’s of your
> second socket. I’m working on a cleaner option as this has come up before.
>
>
> On Dec 21, 2015, at 5:29 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> Dear Open MPI Gurus,
>
> I'm currently trying to do something with Open MPI 1.8.8 that I'm pretty
> sure is possible, but I'm just not smart enough to figure out. Namely, I'm
> seeing some odd GPU timings and I think it's because I was dumb and assumed
> the GPU was on the PCI bus next to Socket #0 as some older GPU nodes I ran
> on were like that.
>
> But, a trip through lspci and lstopo has shown me that the GPU is actually
> on Socket #1. These are dual socket Sandy Bridge nodes and I'd like to do
> some tests where I run a 8 processes per node and those processes all land
> on Socket #1.
>
> So, what I'm trying to figure out is how to have Open MPI bind processes
> like that. My first thought as always is to run a helloworld job with
> -report-bindings on. I can manage to do this:
>
> (1061) $ mpirun -np 8 -report-bindings -map-by core ./helloWorld.exe
> [borg01z205:16306] MCW rank 4 bound to socket 0[core 4[hwt 0]]:
> [././././B/././.][./././././././.]
> [borg01z205:16306] MCW rank 5 bound to socket 0[core 5[hwt 0]]:
> [./././././B/./.][./././././././.]
> [borg01z205:16306] MCW rank 6 bound to socket 0[core 6[hwt 0]]:
> [././././././B/.][./././././././.]
> [borg01z205:16306] MCW rank 7 bound to socket 0[core 7[hwt 0]]:
> [./././././././B][./././././././.]
> [borg01z205:16306] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
> [B/././././././.][./././././././.]
> [borg01z205:16306] MCW rank 1 bound to socket 0[core 1[hwt 0]]:
> [./B/./././././.][./././././././.]
> [borg01z205:16306] MCW rank 2 bound to socket 0[core 2[hwt 0]]:
> [././B/././././.][./././././././.]
> [borg01z205:16306] MCW rank 3 bound to socket 0[core 3[hwt 0]]:
> [./././B/./././.][./././././././.]
> Process7 of8 is on borg01z205
> Process5 of8 is on borg01z205
> Process2 of8 is on borg01z205
> Process3 of8 is on borg01z205
> Process4 of8 is on borg01z205
> Process6 of8 is on borg01z205
> Process0 of8 is on borg01z205
> Process1 of8 is on borg01z205
>
> Great...but wrong socket! Is there a way to tell it to use Socket 1
> instead?
>
> Note I'll be running under SLURM, so I will only have 8 processes per
> node, so it shouldn't need to use Socket 0.
> --
> Matt Thompson
>
> Man Among Men
> Fulcrum of History
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/12/28190.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/12/28195.php
>



-- 
Matt Thompson

Man Among Men
Fulcrum of History

[OMPI users] Help with Binding in 1.8.8: Use only second socket

2015-12-21 Thread Matt Thompson

Dear Open MPI Gurus,

I'm currently trying to do something with Open MPI 1.8.8 that I'm pretty
sure is possible, but I'm just not smart enough to figure out. Namely, I'm
seeing some odd GPU timings and I think it's because I was dumb and assumed
the GPU was on the PCI bus next to Socket #0 as some older GPU nodes I ran
on were like that.

But, a trip through lspci and lstopo has shown me that the GPU is actually
on Socket #1. These are dual socket Sandy Bridge nodes and I'd like to do
some tests where I run a 8 processes per node and those processes all land
on Socket #1.

So, what I'm trying to figure out is how to have Open MPI bind processes
like that. My first thought as always is to run a helloworld job with
-report-bindings on. I can manage to do this:

(1061) $ mpirun -np 8 -report-bindings -map-by core ./helloWorld.exe
[borg01z205:16306] MCW rank 4 bound to socket 0[core 4[hwt 0]]:
[././././B/././.][./././././././.]
[borg01z205:16306] MCW rank 5 bound to socket 0[core 5[hwt 0]]:
[./././././B/./.][./././././././.]
[borg01z205:16306] MCW rank 6 bound to socket 0[core 6[hwt 0]]:
[././././././B/.][./././././././.]
[borg01z205:16306] MCW rank 7 bound to socket 0[core 7[hwt 0]]:
[./././././././B][./././././././.]
[borg01z205:16306] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.]
[borg01z205:16306] MCW rank 1 bound to socket 0[core 1[hwt 0]]:
[./B/./././././.][./././././././.]
[borg01z205:16306] MCW rank 2 bound to socket 0[core 2[hwt 0]]:
[././B/././././.][./././././././.]
[borg01z205:16306] MCW rank 3 bound to socket 0[core 3[hwt 0]]:
[./././B/./././.][./././././././.]
Process7 of8 is on borg01z205
Process5 of8 is on borg01z205
Process2 of8 is on borg01z205
Process3 of8 is on borg01z205
Process4 of8 is on borg01z205
Process6 of8 is on borg01z205
Process0 of8 is on borg01z205
Process1 of8 is on borg01z205

Great...but wrong socket! Is there a way to tell it to use Socket 1
instead?

Note I'll be running under SLURM, so I will only have 8 processes per node,
so it shouldn't need to use Socket 0.
-- 
Matt Thompson

Man Among Men
Fulcrum of History

Re: [OMPI users] Open MPI 1.10.0: Works on one Sandybridge Node, not on another: tcp_peer_send_blocking

2015-09-24 Thread Matt Thompson

On Thu, Sep 24, 2015 at 12:10 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Ah, sorry - wrong param. It’s the out-of-band that is having the problem.
> Try adding —mca oob_tcp_if_include 
>

Ooh. Okay. Look at this:

(13) $ mpirun --mca oob_tcp_if_include ib0 -np 2 ./helloWorld.x
Process 1 of 2 is on r509i2n17
Process 0 of 2 is on r509i2n17

So that is nice. Now the spin up if I have 8 or so nodes is rather...slow.
But at this point I'll take working over efficient. Quick startup can come
later.

Matt



>
>
> On Sep 24, 2015, at 8:56 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> Ralph,
>
> I believe these nodes might have both an Ethernet and Infiniband port
> where the Ethernet port is not the one to use. Is there a way to tell Open
> MPI to ignore any ethernet devices it sees? I've tried:
>
> --mca btl sm,openib,self
>
> and (based on the advice of the much more intelligent support at NAS):
>
> --mca btl openib,self --mca btl_openib_if_include mlx4_0,mlx4_1
>
> But neither worked.
>
> Matt
>
>
> On Thu, Sep 24, 2015 at 11:41 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Starting in the 1.7 series, OMPI by default launches daemons on all nodes
>> in the allocation during startup. This is done so we can “probe” the
>> topology of the nodes and use that info during the process mapping
>> procedure - e.g., if you want to map-by NUMA regions.
>>
>> What is happening here is that some of the nodes in your allocation
>> aren’t allowing those daemons to callback to mpirun. Either a firewall is
>> in the way, or something is preventing it.
>>
>> If you don’t want to launch on those other nodes, you could just add
>> —novm to your cmd line, or use the —host option to restrict us to your
>> local node. However, I imagine you got the bigger allocation so you could
>> use it :-)
>>
>> In which case, you need to remove the obstacle. You might check for
>> firewall, or check to see if multiple NICs are on the non-maia nodes (this
>> can sometimes confuse things, especially if someone put the NICs on the
>> same IP subnet)
>>
>> HTH
>> Ralph
>>
>>
>>
>> On Sep 24, 2015, at 8:18 AM, Matt Thompson <fort...@gmail.com> wrote:
>>
>> Open MPI Users,
>>
>> I'm hoping someone here can help. I built Open MPI 1.10.0 with PGI 15.7
>> using this configure string:
>>
>>  ./configure --disable-vt --with-tm=/PBS --with-verbs
>> --disable-wrapper-rpath \
>> CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 CFLAGS='-fpic -m64' \
>> CXXFLAGS='-fpic -m64' FCFLAGS='-fpic -m64' FFLAGS='-fpic -m64' \
>> --prefix=/nobackup/gmao_SIteam/MPI/pgi_15.7-openmpi_1.10.0 |& tee
>> configure.pgi15.7.log
>>
>> It seemed to pass 'make check'.
>>
>> I'm working at pleiades at NAS, and there they have both Sandy Bridge
>> nodes with GPUs (maia) and regular Sandy Bridge compute nodes (here after
>> called Sandy) without. To be extra careful (since PGI compiles to the
>> architecture you build on) I took a Westmere node and built Open MPI there
>> just in case.
>>
>> So, as I said, all seems to work with a test. I now grab a maia node,
>> maia1, of an allocation of 4 I had:
>>
>> (102) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
>> (103) $ mpirun -np 2 ./helloWorld.x
>> Process 0 of 2 is on maia1
>> Process 1 of 2 is on maia1
>>
>> Good. Now, let's go to a Sandy Bridge (non-GPU) node, r321i7n16, of an
>> allocation of 8 I had:
>>
>> (49) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
>> (50) $ mpirun -np 2 ./helloWorld.x
>> [r323i5n11:13063] [[62995,0],7] tcp_peer_send_blocking: send() to socket
>> 9 failed: Broken pipe (32)
>> [r323i5n6:57417] [[62995,0],2] tcp_peer_send_blocking: send() to socket 9
>> failed: Broken pipe (32)
>> [r323i5n7:67287] [[62995,0],3] tcp_peer_send_blocking: send() to socket 9
>> failed: Broken pipe (32)
>> [r323i5n8:57429] [[62995,0],4] tcp_peer_send_blocking: send() to socket 9
>> failed: Broken pipe (32)
>> [r323i5n10:35329] [[62995,0],6] tcp_peer_send_blocking: send() to socket
>> 9 failed: Broken pipe (32)
>> [r323i5n9:13456] [[62995,0],5] tcp_peer_send_blocking: send() to socket 9
>> failed: Broken pipe (32)
>>
>> Hmm. Let's try turning off tcp (often my first thought when on an
>> Infiniband system):
>>
>> (51) $ mpirun --mca btl sm,openib,self -np 2 ./helloWorld.x
>> [r323i5n6:57420] [[62996,0],2] tcp_peer_send_blocking: send() to socket 9
>> failed: Broken pipe (32)
>> [r323i5n9:13459] [[62996,0],5] tcp_peer_send_blocking: send() to socket 9

Re: [OMPI users] Open MPI 1.10.0: Works on one Sandybridge Node, not on another: tcp_peer_send_blocking

2015-09-24 Thread Matt Thompson

Ralph,

I believe these nodes might have both an Ethernet and Infiniband port where
the Ethernet port is not the one to use. Is there a way to tell Open MPI to
ignore any ethernet devices it sees? I've tried:

--mca btl sm,openib,self

and (based on the advice of the much more intelligent support at NAS):

--mca btl openib,self --mca btl_openib_if_include mlx4_0,mlx4_1

But neither worked.

Matt


On Thu, Sep 24, 2015 at 11:41 AM, Ralph Castain <r...@open-mpi.org> wrote:

> Starting in the 1.7 series, OMPI by default launches daemons on all nodes
> in the allocation during startup. This is done so we can “probe” the
> topology of the nodes and use that info during the process mapping
> procedure - e.g., if you want to map-by NUMA regions.
>
> What is happening here is that some of the nodes in your allocation aren’t
> allowing those daemons to callback to mpirun. Either a firewall is in the
> way, or something is preventing it.
>
> If you don’t want to launch on those other nodes, you could just add —novm
> to your cmd line, or use the —host option to restrict us to your local
> node. However, I imagine you got the bigger allocation so you could use it
> :-)
>
> In which case, you need to remove the obstacle. You might check for
> firewall, or check to see if multiple NICs are on the non-maia nodes (this
> can sometimes confuse things, especially if someone put the NICs on the
> same IP subnet)
>
> HTH
> Ralph
>
>
>
> On Sep 24, 2015, at 8:18 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> Open MPI Users,
>
> I'm hoping someone here can help. I built Open MPI 1.10.0 with PGI 15.7
> using this configure string:
>
>  ./configure --disable-vt --with-tm=/PBS --with-verbs
> --disable-wrapper-rpath \
> CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 CFLAGS='-fpic -m64' \
> CXXFLAGS='-fpic -m64' FCFLAGS='-fpic -m64' FFLAGS='-fpic -m64' \
> --prefix=/nobackup/gmao_SIteam/MPI/pgi_15.7-openmpi_1.10.0 |& tee
> configure.pgi15.7.log
>
> It seemed to pass 'make check'.
>
> I'm working at pleiades at NAS, and there they have both Sandy Bridge
> nodes with GPUs (maia) and regular Sandy Bridge compute nodes (here after
> called Sandy) without. To be extra careful (since PGI compiles to the
> architecture you build on) I took a Westmere node and built Open MPI there
> just in case.
>
> So, as I said, all seems to work with a test. I now grab a maia node,
> maia1, of an allocation of 4 I had:
>
> (102) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
> (103) $ mpirun -np 2 ./helloWorld.x
> Process 0 of 2 is on maia1
> Process 1 of 2 is on maia1
>
> Good. Now, let's go to a Sandy Bridge (non-GPU) node, r321i7n16, of an
> allocation of 8 I had:
>
> (49) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
> (50) $ mpirun -np 2 ./helloWorld.x
> [r323i5n11:13063] [[62995,0],7] tcp_peer_send_blocking: send() to socket 9
> failed: Broken pipe (32)
> [r323i5n6:57417] [[62995,0],2] tcp_peer_send_blocking: send() to socket 9
> failed: Broken pipe (32)
> [r323i5n7:67287] [[62995,0],3] tcp_peer_send_blocking: send() to socket 9
> failed: Broken pipe (32)
> [r323i5n8:57429] [[62995,0],4] tcp_peer_send_blocking: send() to socket 9
> failed: Broken pipe (32)
> [r323i5n10:35329] [[62995,0],6] tcp_peer_send_blocking: send() to socket 9
> failed: Broken pipe (32)
> [r323i5n9:13456] [[62995,0],5] tcp_peer_send_blocking: send() to socket 9
> failed: Broken pipe (32)
>
> Hmm. Let's try turning off tcp (often my first thought when on an
> Infiniband system):
>
> (51) $ mpirun --mca btl sm,openib,self -np 2 ./helloWorld.x
> [r323i5n6:57420] [[62996,0],2] tcp_peer_send_blocking: send() to socket 9
> failed: Broken pipe (32)
> [r323i5n9:13459] [[62996,0],5] tcp_peer_send_blocking: send() to socket 9
> failed: Broken pipe (32)
> [r323i5n8:57432] [[62996,0],4] tcp_peer_send_blocking: send() to socket 9
> failed: Broken pipe (32)
> [r323i5n7:67290] [[62996,0],3] tcp_peer_send_blocking: send() to socket 9
> failed: Broken pipe (32)
> [r323i5n11:13066] [[62996,0],7] tcp_peer_send_blocking: send() to socket 9
> failed: Broken pipe (32)
> [r323i5n10:35332] [[62996,0],6] tcp_peer_send_blocking: send() to socket 9
> failed: Broken pipe (32)
>
> Now, the nodes reporting the issue seem to be the "other" nodes on the
> allocation that are in a different rack:
>
> (52) $ cat $PBS_NODEFILE | uniq
> r321i7n16
> r321i7n17
> r323i5n6
> r323i5n7
> r323i5n8
> r323i5n9
> r323i5n10
> r323i5n11
>
> Maybe that's a clue? I didn't think this would matter if I only ran two
> processes...and it works on the multi-node maia allocation.
>
> I've tried searching the web, but the only place I've seen
> tcp_peer_send_blocking is in a PDF whe

[OMPI users] Open MPI 1.10.0: Works on one Sandybridge Node, not on another: tcp_peer_send_blocking

2015-09-24 Thread Matt Thompson

Open MPI Users,

I'm hoping someone here can help. I built Open MPI 1.10.0 with PGI 15.7
using this configure string:

 ./configure --disable-vt --with-tm=/PBS --with-verbs
--disable-wrapper-rpath \
CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 CFLAGS='-fpic -m64' \
CXXFLAGS='-fpic -m64' FCFLAGS='-fpic -m64' FFLAGS='-fpic -m64' \
--prefix=/nobackup/gmao_SIteam/MPI/pgi_15.7-openmpi_1.10.0 |& tee
configure.pgi15.7.log

It seemed to pass 'make check'.

I'm working at pleiades at NAS, and there they have both Sandy Bridge nodes
with GPUs (maia) and regular Sandy Bridge compute nodes (here after called
Sandy) without. To be extra careful (since PGI compiles to the architecture
you build on) I took a Westmere node and built Open MPI there just in case.

So, as I said, all seems to work with a test. I now grab a maia node,
maia1, of an allocation of 4 I had:

(102) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
(103) $ mpirun -np 2 ./helloWorld.x
Process 0 of 2 is on maia1
Process 1 of 2 is on maia1

Good. Now, let's go to a Sandy Bridge (non-GPU) node, r321i7n16, of an
allocation of 8 I had:

(49) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
(50) $ mpirun -np 2 ./helloWorld.x
[r323i5n11:13063] [[62995,0],7] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n6:57417] [[62995,0],2] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n7:67287] [[62995,0],3] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n8:57429] [[62995,0],4] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n10:35329] [[62995,0],6] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n9:13456] [[62995,0],5] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)

Hmm. Let's try turning off tcp (often my first thought when on an
Infiniband system):

(51) $ mpirun --mca btl sm,openib,self -np 2 ./helloWorld.x
[r323i5n6:57420] [[62996,0],2] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n9:13459] [[62996,0],5] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n8:57432] [[62996,0],4] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n7:67290] [[62996,0],3] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n11:13066] [[62996,0],7] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n10:35332] [[62996,0],6] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)

Now, the nodes reporting the issue seem to be the "other" nodes on the
allocation that are in a different rack:

(52) $ cat $PBS_NODEFILE | uniq
r321i7n16
r321i7n17
r323i5n6
r323i5n7
r323i5n8
r323i5n9
r323i5n10
r323i5n11

Maybe that's a clue? I didn't think this would matter if I only ran two
processes...and it works on the multi-node maia allocation.

I've tried searching the web, but the only place I've seen
tcp_peer_send_blocking is in a PDF where they say it's an error that can be
seen:

http://www.hpc.mcgill.ca/downloads/checkpointing_workshop/20150326%20-%20McGill%20-%20Checkpointing%20Techniques.pdf

Any ideas for what this error can mean?

-- 
Matt Thompson

Man Among Men
Fulcrum of History

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-15 Thread Matt Thompson

Looking at the Open MPI 1.10.0 man page:

  https://www.open-mpi.org/doc/v1.10/man1/mpirun.1.php

it looks like perhaps -oversubscribe (which was an option) is now the
default behavior. Instead we have:

*-nooversubscribe, --nooversubscribe*Do not oversubscribe any nodes; error
(without starting any processes) if the requested number of processes would
cause oversubscription. This option implicitly sets "max_slots" equal to
the "slots" value for each node.

It also looks like -map-by has a way to implement it as well (see man page).

Thanks for letting me/us know about this. On a system of mine I sort of
depend on the -nooversubscribe behavior!

Matt



On Tue, Sep 15, 2015 at 11:17 AM, Patrick Begou <
patrick.be...@legi.grenoble-inp.fr> wrote:

> Hi,
>
> I'm runing OpenMPI 1.10.0 built with Intel 2015 compilers on a Bullx
> System.
> I've some troubles with the bind-to core option when using cpuset.
> If the cpuset is less than all the cores of a cpu (ex: 4 cores allowed on
> a 8 cores cpus) OpenMPI 1.10.0 allows to overload these cores  until the
> maximum number of cores of the cpu.
> With this config and because the cpuset only allows 4 cores, I can reach 2
> processes/core if I use:
>
> mpirun -np 8 --bind-to core my_application
>
> OpenMPI 1.7.3 doesn't show the problem with the same situation:
> mpirun -np 8 --bind-to-core my_application
> returns:
> *A request was made to bind to that would result in binding more*
> *processes than cpus on a resource*
> and that's okay of course.
>
>
> Is there a way to avoid this oveloading with OpenMPI 1.10.0 ?
>
> Thanks
>
> Patrick
>
> --
> ===
> |  Equipe M.O.S.T. |  |
> |  Patrick BEGOU   | mailto:patrick.be...@grenoble-inp.fr 
> <patrick.be...@grenoble-inp.fr> |
> |  LEGI|  |
> |  BP 53 X | Tel 04 76 82 51 35   |
> |  38041 GRENOBLE CEDEX| Fax 04 76 82 52 71   |
> ===
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27575.php
>



-- 
Matt Thompson

Man Among Men
Fulcrum of History

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Matt Thompson

Jeff,

Some limited testing shows that that srun does seem to work where the
quote-y one did not. I'm working with our admins now to make sure it let's
the prolog work as expected as well.

I'll keep you informed,
Matt


On Thu, Sep 4, 2014 at 1:26 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> Try this (typed in editor, not tested!):
>
> #! /usr/bin/perl -w
>
> use strict;
> use warnings;
>
> use FindBin;
>
> # Specify the path to the prolog.
> my $prolog = '--task-prolog=/gpfsm//.task.prolog';
>
> # Build the path to the SLURM srun command.
> my $srun_slurm = "${FindBin::Bin}/srun.slurm";
>
> # Add the prolog option, but abort if the user specifies a prolog option.
> my @command = split(/ /, "$srun_slurm $prolog");
> foreach (@ARGV) {
> if (/^--task-prolog=/) {
> print("The --task-prolog option is unsupported at . Please " .
>   "contact the  for assistance.\n");
> exit(1);
>     } else {
> push(@command, $_);
> }
> }
> system(@command);
>
>
>
> On Sep 4, 2014, at 1:21 PM, Matt Thompson <fort...@gmail.com> wrote:
>
> > Jeff,
> >
> > Here is the script (with a bit of munging for safety's sake):
> >
> > #! /usr/bin/perl -w
> >
> > use strict;
> > use warnings;
> >
> > use FindBin;
> >
> > # Specify the path to the prolog.
> > my $prolog = '--task-prolog=/gpfsm//.task.prolog';
> >
> > # Build the path to the SLURM srun command.
> > my $srun_slurm = "${FindBin::Bin}/srun.slurm";
> >
> > # Add the prolog option, but abort if the user specifies a prolog option.
> > my $command = "$srun_slurm $prolog";
> > foreach (@ARGV) {
> > if (/^--task-prolog=/) {
> > print("The --task-prolog option is unsupported at . Please "
> .
> >   "contact the  for assistance.\n");
> > exit(1);
> > } else {
> > $command .= " $_";
> > }
> > }
> > system($command);
> >
> > Ideas?
> >
> >
> >
> > On Thu, Sep 4, 2014 at 10:51 AM, Ralph Castain <r...@open-mpi.org> wrote:
> > Still begs the bigger question, though, as others have used script
> wrappers before - and I'm not sure we (OMPI) want to be in the business of
> dictating the scripting language they can use. :-)
> >
> > Jeff and I will argue that one out
> >
> >
> > On Sep 4, 2014, at 7:38 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
> wrote:
> >
> >> Ah, if it's perl, it might be easy. It might just be the difference
> between system("...string...") and system(@argv).
> >>
> >> Sent from my phone. No type good.
> >>
> >> On Sep 4, 2014, at 8:35 AM, "Matt Thompson" <fort...@gmail.com> wrote:
> >>
> >>> Jeff,
> >>>
> >>> I actually misspoke earlier. It turns out our srun is a *Perl* script
> around the SLURM srun. I'll speak with our admins to see if they can
> massage the script to not interpret the arguments. If possible, I'll ask
> them if I can share the script with you (privately or on the list) and
> maybe you can see how it is affecting Open MPI's argument passage.
> >>>
> >>> Matt
> >>>
> >>>
> >>> On Thu, Sep 4, 2014 at 8:04 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> >>> On Sep 3, 2014, at 9:27 AM, Matt Thompson <fort...@gmail.com> wrote:
> >>>
> >>> > Just saw this, sorry. Our srun is indeed a shell script. It seems to
> be a wrapper around the regular srun that runs a --task-prolog. What it
> does...that's beyond my ken, but I could ask. My guess is that it probably
> does something that helps keep our old PBS scripts running (sets
> $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The
> admins would, of course, prefer all future scripts be SLURM-native scripts,
> but there are a lot of production runs that uses many, many PBS scripts.
> Converting that would need slow, careful QC to make sure any "pure SLURM"
> versions act as expected.
> >>>
> >>> Ralph and I haven't had a chance to discuss this in detail yet, but I
> have thought about this quite a bit.
> >>>
> >>> What is happening is that one of the $argv OMPI passes is of the form
> "foo;bar".  Your srun script is interpreting the ";" as the end of the
> command the the "bar" as the beginning of a new command,

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Matt Thompson

Jeff,

Here is the script (with a bit of munging for safety's sake):

#! /usr/bin/perl -w

use strict;
use warnings;

use FindBin;

# Specify the path to the prolog.
my $prolog = '--task-prolog=/gpfsm//.task.prolog';

# Build the path to the SLURM srun command.
my $srun_slurm = "${FindBin::Bin}/srun.slurm";

# Add the prolog option, but abort if the user specifies a prolog option.
my $command = "$srun_slurm $prolog";
foreach (@ARGV) {
if (/^--task-prolog=/) {
print("The --task-prolog option is unsupported at . Please " .
  "contact the  for assistance.\n");
exit(1);
} else {
$command .= " $_";
}
}
system($command);

Ideas?



On Thu, Sep 4, 2014 at 10:51 AM, Ralph Castain <r...@open-mpi.org> wrote:

> Still begs the bigger question, though, as others have used script
> wrappers before - and I'm not sure we (OMPI) want to be in the business of
> dictating the scripting language they can use. :-)
>
> Jeff and I will argue that one out
>
>
> On Sep 4, 2014, at 7:38 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
> wrote:
>
>  Ah, if it's perl, it might be easy. It might just be the difference
> between system("...string...") and system(@argv).
>
> Sent from my phone. No type good.
>
> On Sep 4, 2014, at 8:35 AM, "Matt Thompson" <fort...@gmail.com> wrote:
>
>   Jeff,
>
>  I actually misspoke earlier. It turns out our srun is a *Perl* script
> around the SLURM srun. I'll speak with our admins to see if they can
> massage the script to not interpret the arguments. If possible, I'll ask
> them if I can share the script with you (privately or on the list) and
> maybe you can see how it is affecting Open MPI's argument passage.
>
>  Matt
>
>
> On Thu, Sep 4, 2014 at 8:04 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> On Sep 3, 2014, at 9:27 AM, Matt Thompson <fort...@gmail.com> wrote:
>>
>> > Just saw this, sorry. Our srun is indeed a shell script. It seems to be
>> a wrapper around the regular srun that runs a --task-prolog. What it
>> does...that's beyond my ken, but I could ask. My guess is that it probably
>> does something that helps keep our old PBS scripts running (sets
>> $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The
>> admins would, of course, prefer all future scripts be SLURM-native scripts,
>> but there are a lot of production runs that uses many, many PBS scripts.
>> Converting that would need slow, careful QC to make sure any "pure SLURM"
>> versions act as expected.
>>
>>  Ralph and I haven't had a chance to discuss this in detail yet, but I
>> have thought about this quite a bit.
>>
>> What is happening is that one of the $argv OMPI passes is of the form
>> "foo;bar".  Your srun script is interpreting the ";" as the end of the
>> command the the "bar" as the beginning of a new command, and mayhem ensues.
>>
>> Basically, your srun script is violating what should be a very safe
>> assumption: that the $argv we pass to it will not be interpreted by a
>> shell.  Put differently: your "srun" script behaves differently than
>> SLURM's "srun" executable.  This violates OMPI's expectations of how srun
>> should behave.
>>
>> My $0.02 is that if we "fix" this in OMPI, we're effectively penalizing
>> all other SLURM installations out there that *don't* violate this
>> assumption (i.e., all of them).  Ralph may disagree with me on this point,
>> BTW -- like I said, we haven't talked about this in detail since Tuesday.
>> :-)
>>
>> So here's my question: is there any chance you can change your "srun"
>> script to a script language that doesn't recombine $argv?  This is a common
>> problem, actually -- sh/csh/etc. script languages tend to recombine $argv,
>> but other languages such as perl and python do not (e.g.,
>> http://stackoverflow.com/questions/6981533/how-to-preserve-single-and-double-quotes-in-shell-script-arguments-without-the-a
>> ).
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>  Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/09/25263.php
>>
>
>
>
>  --
>  "And, isn't sanity really just a one-trick pony

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Matt Thompson

Jeff,

I actually misspoke earlier. It turns out our srun is a *Perl* script
around the SLURM srun. I'll speak with our admins to see if they can
massage the script to not interpret the arguments. If possible, I'll ask
them if I can share the script with you (privately or on the list) and
maybe you can see how it is affecting Open MPI's argument passage.

Matt


On Thu, Sep 4, 2014 at 8:04 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> On Sep 3, 2014, at 9:27 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> > Just saw this, sorry. Our srun is indeed a shell script. It seems to be
> a wrapper around the regular srun that runs a --task-prolog. What it
> does...that's beyond my ken, but I could ask. My guess is that it probably
> does something that helps keep our old PBS scripts running (sets
> $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The
> admins would, of course, prefer all future scripts be SLURM-native scripts,
> but there are a lot of production runs that uses many, many PBS scripts.
> Converting that would need slow, careful QC to make sure any "pure SLURM"
> versions act as expected.
>
> Ralph and I haven't had a chance to discuss this in detail yet, but I have
> thought about this quite a bit.
>
> What is happening is that one of the $argv OMPI passes is of the form
> "foo;bar".  Your srun script is interpreting the ";" as the end of the
> command the the "bar" as the beginning of a new command, and mayhem ensues.
>
> Basically, your srun script is violating what should be a very safe
> assumption: that the $argv we pass to it will not be interpreted by a
> shell.  Put differently: your "srun" script behaves differently than
> SLURM's "srun" executable.  This violates OMPI's expectations of how srun
> should behave.
>
> My $0.02 is that if we "fix" this in OMPI, we're effectively penalizing
> all other SLURM installations out there that *don't* violate this
> assumption (i.e., all of them).  Ralph may disagree with me on this point,
> BTW -- like I said, we haven't talked about this in detail since Tuesday.
> :-)
>
> So here's my question: is there any chance you can change your "srun"
> script to a script language that doesn't recombine $argv?  This is a common
> problem, actually -- sh/csh/etc. script languages tend to recombine $argv,
> but other languages such as perl and python do not (e.g.,
> http://stackoverflow.com/questions/6981533/how-to-preserve-single-and-double-quotes-in-shell-script-arguments-without-the-a
> ).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/09/25263.php
>



-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-03 Thread Matt Thompson

On Tue, Sep 2, 2014 at 8:38 PM, Jeff Squyres (jsquyres) 
wrote:

> Matt: Random thought -- is your "srun" a shell script, perchance?  (it
> shouldn't be, but perhaps there's some kind of local override...?)
>
> Ralph's point on the call today is that it doesn't matter *how* this
> problem is happening.  It *is* happening to real users, and so we need to
> account for it.
>
> But it really bothers me that we don't understand *how/why* this is
> happening (e.g., is this OMPI's fault somehow?  I don't think so, but then
> again, we don't understand how it's happening).  *Somewhere* in there, a
> shell is getting invoked.  But "srun" shouldn't be invoking a shell on the
> remote side -- it should be directly fork/exec'ing the tokens with no shell
> interpretation at all.
>

Jeff,

Just saw this, sorry. Our srun is indeed a shell script. It seems to be a
wrapper around the regular srun that runs a --task-prolog. What it
does...that's beyond my ken, but I could ask. My guess is that it probably
does something that helps keep our old PBS scripts running (sets
$PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The
admins would, of course, prefer all future scripts be SLURM-native scripts,
but there are a lot of production runs that uses many, many PBS scripts.
Converting that would need slow, careful QC to make sure any "pure SLURM"
versions act as expected.

Matt

-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-03 Thread Matt Thompson

Jeff,

I tried your script and I saw:

(1027) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun
-np 8 ./script.sh
(1028) $

Now, the very first time I ran it, I think I might have noticed a blip of
orted on the nodes, but it disappeared fast. When I re-run the same
command, it just seems to exit immediately with nothing showing up.

If I use my "debug-patch" version, I see:

(1028) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch//bin/mpirun
-np 8 ./script.sh
hello world
hello world
hello world
hello world
hello world
hello world
hello world
hello world

And, well, it's there for 10 minutes, I'm guessing. If I ssh to another of
the nodes in my allocation:

(1005) $ ps aux | grep openmpi
mathomp4 20317  0.0  0.0  59952  4256 ?S09:17   0:00
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/orted
-mca orte_ess_jobid 1842544640 -mca orte_ess_vpid 1 -mca orte_ess_num_procs
6 -mca orte_hnp_uri 1842544640.0;tcp://10.1.24.169,172.31.1.254,
10.12.24.169:41684
mathomp4 20389  0.0  0.0   5524   844 pts/0S+   09:19   0:00 grep
--color=auto openmpi


Matt


On Tue, Sep 2, 2014 at 5:35 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> Matt --
>
> We were discussing this issue on our weekly OMPI engineering call today.
>
> Can you check one thing for me?  With the un-edited 1.8.2 tarball
> installation, I see that you're getting no output for commands that you run
> -- but also no errors.
>
> Can you verify and see if your commands are actually *running*?  E.g, try:
>
> $ cat > script.sh < #!/bin/sh
> echo hello world
> sleep 600
> echo goodbye world
> EOF
> $ chmod +x script.sh
> $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
> $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-clean/bin/mpirun
> -np 8 script.sh
>
> and then go "ps" on the back-end nodes and see if there is an "orted"
> process and N "sleep 600" processes running on them.
>
> I'm *assuming* you won't see the "hello world" output.
>
> The purpose of this test is that I want to see if OMPI is just totally
> erring out and not even running your job (which is quite unlikely; OMPI
> should be much more noisy when this happens), or whether we're simply not
> seeing the stdout from the job.
>
> Thanks.
>
>
>
> On Sep 2, 2014, at 9:36 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> > On that machine, it would be SLES 11 SP1. I think it's soon
> transitioning to SLES 11 SP3.
> >
> > I also use Open MPI on an RHEL 6.5 box (possibly soon to be RHEL 7).
> >
> >
> > On Mon, Sep 1, 2014 at 8:41 PM, Ralph Castain <r...@open-mpi.org> wrote:
> > Thanks - I expect we'll have to release 1.8.3 soon to fix this in case
> others have similar issues. Out of curiosity, what OS are you using?
> >
> >
> > On Sep 1, 2014, at 9:00 AM, Matt Thompson <fort...@gmail.com> wrote:
> >
> >> Ralph,
> >>
> >> Okay that seems to have done it here (well, minus the usual
> shmem_mmap_enable_nfs_warning that our system always generates):
> >>
> >> (1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
> >> (1034) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun
> -np 8 ./helloWorld.182-debug-patch.x
> >> Process7 of8 is on borg01w218
> >> Process5 of8 is on borg01w218
> >> Process1 of8 is on borg01w218
> >> Process3 of8 is on borg01w218
> >> Process0 of8 is on borg01w218
> >> Process2 of8 is on borg01w218
> >> Process4 of8 is on borg01w218
> >> Process6 of8 is on borg01w218
> >>
> >> I'll ask the admin to apply the patch locally...and wait for 1.8.3, I
> suppose.
> >>
> >> Thanks,
> >> Matt
> >>
> >> On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain <r...@open-mpi.org>
> wrote:
> >> HmmmI may see the problem. Would you be so kind as to apply the
> attached patch to your 1.8.2 code, rebuild, and try again?
> >>
> >> Much appreciate the help. Everyone's system is slightly different, and
> I think you've uncovered one of those differences.
> >> Ralph
> >>
> >>
> >>
> >> On Aug 31, 2014, at 6:25 AM, Matt Thompson <fort...@gmail.com> wrote:
> >>
> >>> Ralph,
> >>>
> >>> Sorry it took me a bit of time. Here you go:
> >>>
> >>> (1002) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
> --leave-session-attached --debug-daemons --mca oob_base_verbose 1

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-01 Thread Matt Thompson

Ralph,

Okay that seems to have done it here (well, minus the
usual shmem_mmap_enable_nfs_warning that our system always generates):

(1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
(1034) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun
-np 8 ./helloWorld.182-debug-patch.x
Process7 of8 is on borg01w218
Process5 of8 is on borg01w218
Process1 of8 is on borg01w218
Process3 of8 is on borg01w218
Process0 of8 is on borg01w218
Process2 of8 is on borg01w218
Process4 of8 is on borg01w218
Process6 of8 is on borg01w218

I'll ask the admin to apply the patch locally...and wait for 1.8.3, I
suppose.

Thanks,
Matt

On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain <r...@open-mpi.org> wrote:

> HmmmI may see the problem. Would you be so kind as to apply the
> attached patch to your 1.8.2 code, rebuild, and try again?
>
> Much appreciate the help. Everyone's system is slightly different, and I
> think you've uncovered one of those differences.
> Ralph
>
>
>
> On Aug 31, 2014, at 6:25 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> Ralph,
>
> Sorry it took me a bit of time. Here you go:
>
> (1002) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
> --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca
> plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x
> [borg01w063:03815] mca:base:select:(  plm) Querying component [isolated]
> [borg01w063:03815] mca:base:select:(  plm) Query of component [isolated]
> set priority to 0
> [borg01w063:03815] mca:base:select:(  plm) Querying component [rsh]
> [borg01w063:03815] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
> path NULL
> [borg01w063:03815] mca:base:select:(  plm) Query of component [rsh] set
> priority to 10
> [borg01w063:03815] mca:base:select:(  plm) Querying component [slurm]
> [borg01w063:03815] [[INVALID],INVALID] plm:slurm: available for selection
> [borg01w063:03815] mca:base:select:(  plm) Query of component [slurm] set
> priority to 75
> [borg01w063:03815] mca:base:select:(  plm) Selected component [slurm]
> [borg01w063:03815] plm:base:set_hnp_name: initial bias 3815 nodename hash
> 1757783593
> [borg01w063:03815] plm:base:set_hnp_name: final jobfam 49163
> [borg01w063:03815] mca: base: components_register: registering oob
> components
> [borg01w063:03815] mca: base: components_register: found loaded component
> tcp
> [borg01w063:03815] mca: base: components_register: component tcp register
> function successful
> [borg01w063:03815] mca: base: components_open: opening oob components
> [borg01w063:03815] mca: base: components_open: found loaded component tcp
> [borg01w063:03815] mca: base: components_open: component tcp open function
> successful
> [borg01w063:03815] mca:oob:select: checking available component tcp
> [borg01w063:03815] mca:oob:select: Querying component [tcp]
> [borg01w063:03815] oob:tcp: component_available called
> [borg01w063:03815] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> [borg01w063:03815] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> [borg01w063:03815] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.1.24.63 to our
> list of V4 connections
> [borg01w063:03815] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 172.31.1.254 to our
> list of V4 connections
> [borg01w063:03815] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.12.24.63 to our
> list of V4 connections
> [borg01w063:03815] [[49163,0],0] TCP STARTUP
> [borg01w063:03815] [[49163,0],0] attempting to bind to IPv4 port 0
> [borg01w063:03815] [[49163,0],0] assigned IPv4 port 41373
> [borg01w063:03815] mca:oob:select: Adding component to end
> [borg01w063:03815] mca:oob:select: Found 1 active transports
> [borg01w063:03815] [[49163,0],0] plm:base:receive start comm
> [borg01w063:03815] [[49163,0],0] plm:base:setup_job
> [borg01w063:03815] [[49163,0],0] plm:slurm: LAUNCH DAEMONS CALLED
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm creating map
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
> [[49163,0],1]
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon
> [[49163,0],1] to node borg01w064
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
> [[49163,0],2]
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon
> [[49163,0],2] to node borg01w065
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
> [[49163,0],3]
> [borg01w063:03815] [[49163,0],0] plm:base:setup

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-31 Thread Matt Thompson

routed_binomial.c at line 498
[borg01w065:15893] [[49163,0],2] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 539
slurmd[borg01w065]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH
SIGNAL 9 ***
slurmd[borg01w070]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH
SIGNAL 9 ***
[borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file
base/rml_base_contact.c at line 161
[borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file
routed_binomial.c at line 498
[borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 539
[borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file
base/rml_base_contact.c at line 161
[borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file
routed_binomial.c at line 498
[borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 539
slurmd[borg01w069]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH
SIGNAL 9 ***
[borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file
base/rml_base_contact.c at line 161
[borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file
routed_binomial.c at line 498
[borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 539
slurmd[borg01w071]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH
SIGNAL 9 ***
slurmd[borg01w065]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH
SIGNAL 9 ***
slurmd[borg01w069]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH
SIGNAL 9 ***
slurmd[borg01w070]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH
SIGNAL 9 ***
slurmd[borg01w071]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH
SIGNAL 9 ***
srun.slurm: error: borg01w069: task 2: Exited with exit code 213
srun.slurm: error: borg01w065: task 1: Exited with exit code 213
srun.slurm: error: borg01w071: task 4: Exited with exit code 213
srun.slurm: error: borg01w070: task 3: Exited with exit code 213
sh: tcp://10.1.24.63,172.31.1.254,10.12.24.63:41373: No such file or
directory
[borg01w063:03815] [[49163,0],0] plm:slurm: primary daemons complete!
[borg01w063:03815] [[49163,0],0] plm:base:receive stop comm
[borg01w063:03815] [[49163,0],0] TCP SHUTDOWN
[borg01w063:03815] mca: base: close: component tcp closed
[borg01w063:03815] mca: base: close: unloading component tcp



On Fri, Aug 29, 2014 at 3:18 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Rats - I also need "-mca plm_base_verbose 5" on there so I can see the cmd
> line being executed. Can you add it?
>
>
> On Aug 29, 2014, at 11:16 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> Ralph,
>
> Here you go:
>
> (1080) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
> --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8
> ./helloWorld.182-debug.x
> [borg01x142:29232] mca: base: components_register: registering oob
> components
> [borg01x142:29232] mca: base: components_register: found loaded component
> tcp
> [borg01x142:29232] mca: base: components_register: component tcp register
> function successful
> [borg01x142:29232] mca: base: components_open: opening oob components
> [borg01x142:29232] mca: base: components_open: found loaded component tcp
> [borg01x142:29232] mca: base: components_open: component tcp open function
> successful
> [borg01x142:29232] mca:oob:select: checking available component tcp
> [borg01x142:29232] mca:oob:select: Querying component [tcp]
> [borg01x142:29232] oob:tcp: component_available called
> [borg01x142:29232] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> [borg01x142:29232] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> [borg01x142:29232] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.1.25.142 to our
> list of V4 connections
> [borg01x142:29232] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 172.31.1.254 to our
> list of V4 connections
> [borg01x142:29232] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.12.25.142 to our
> list of V4 connections
> [borg01x142:29232] [[52298,0],0] TCP STARTUP
> [borg01x142:29232] [[52298,0],0] attempting to bind to IPv4 port 0
> [borg01x142:29232] [[52298,0],0] assigned IPv4 port 41686
> [borg01x142:29232] mca:oob:select: Adding component to end
> [borg01x142:29232] mca:oob:select: Found 1 active transports
> srun.slurm: cluster configuration lacks support for cpu binding
> srun.slurm: cluster configuration lacks support for cpu binding
> [borg01x153:01290] mca: base: components_register: registering oob
> components
> [borg01x153:01290] mca: base: components_register: found loaded component
> tcp
> [borg01x1

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Matt Thompson

base/ess_base_std_orted.c at line 539
srun.slurm: error: borg01x143: task 0: Exited with exit code 213
srun.slurm: Terminating job step 2332583.24
slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH
SIGNAL 9 ***
srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to
finish.
srun.slurm: error: borg01x153: task 3: Exited with exit code 213
[borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file
base/rml_base_contact.c at line 161
[borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file
routed_binomial.c at line 498
[borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 539
[borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file
base/rml_base_contact.c at line 161
[borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file
routed_binomial.c at line 498
[borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 539
slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH
SIGNAL 9 ***
srun.slurm: error: borg01x144: task 1: Exited with exit code 213
[borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file
base/rml_base_contact.c at line 161
[borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file
routed_binomial.c at line 498
[borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 539
slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH
SIGNAL 9 ***
slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH
SIGNAL 9 ***
srun.slurm: error: borg01x154: task 4: Exited with exit code 213
srun.slurm: error: borg01x145: task 2: Exited with exit code 213
[borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file
base/rml_base_contact.c at line 161
[borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file
routed_binomial.c at line 498
[borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 539
slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH
SIGNAL 9 ***
slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH
SIGNAL 9 ***
sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:41686: No such file or
directory
[borg01x142:29232] [[52298,0],0] TCP SHUTDOWN
[borg01x142:29232] mca: base: close: component tcp closed
[borg01x142:29232] mca: base: close: unloading component tcp

Note, if I can get the allocation today, I want to try doing all this on a
single SandyBridge node, rather than on 6. It might make comparing various
runs a bit easier!

Matt



On Fri, Aug 29, 2014 at 12:42 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Okay, something quite weird is happening here. I can't replicate using the
> 1.8.2 release tarball on a slurm machine, so my guess is that something
> else is going on here.
>
> Could you please rebuild the 1.8.2 code with --enable-debug on the
> configure line (assuming you haven't already done so), and then rerun that
> version as before but adding "--mca oob_base_verbose 10" to the cmd line?
>
>
> On Aug 29, 2014, at 4:22 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> Ralph,
>
> For 1.8.2rc4 I get:
>
> (1003) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun
> --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x
> srun.slurm: cluster configuration lacks support for cpu binding
> srun.slurm: cluster configuration lacks support for cpu binding
> Daemon [[47143,0],5] checking in as pid 10990 on host borg01x154
> [borg01x154:10990] [[47143,0],5] orted: up and running - waiting for
> commands!
> Daemon [[47143,0],1] checking in as pid 23473 on host borg01x143
> Daemon [[47143,0],2] checking in as pid 8250 on host borg01x144
> [borg01x144:08250] [[47143,0],2] orted: up and running - waiting for
> commands!
> [borg01x143:23473] [[47143,0],1] orted: up and running - waiting for
> commands!
> Daemon [[47143,0],3] checking in as pid 12320 on host borg01x145
> Daemon [[47143,0],4] checking in as pid 10902 on host borg01x153
> [borg01x153:10902] [[47143,0],4] orted: up and running - waiting for
> commands!
> [borg01x145:12320] [[47143,0],3] orted: up and running - waiting for
> commands!
> [borg01x142:01629] [[47143,0],0] orted_cmd: received add_local_procs
> [borg01x144:08250] [[47143,0],2] orted_cmd: received add_local_procs
> [borg01x153:10902] [[47143,0],4] orted_cmd: received add_local_procs
> [borg01x143:23473] [[47143,0],1] orted_cmd: received add_local_procs
> [borg01x145:12320] [[47143,0],3] orted_cmd: received add_local_procs
> [borg01x154:10990] [[47143,0],5] orted_cmd: received add_local_procs
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
> local proc [[47143,

[OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-28 Thread Matt Thompson

Open MPI List,

I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 on our
cluster (reported on this list), and decided to try it with 1.8.2. However,
we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even weirder,
Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: I get no stdout
with Open MPI 1.8.2. That is, HelloWorld doesn't work.

To wit, our sysadmin has two tarballs:

(1441) $ sha1sum openmpi-1.8.2rc4.tar.bz2
7e7496913c949451f546f22a1a159df25f8bb683  openmpi-1.8.2rc4.tar.bz2
(1442) $ sha1sum openmpi-1.8.2.tar.gz
cf2b1e45575896f63367406c6c50574699d8b2e1  openmpi-1.8.2.tar.gz

I then build each with a script in the method our sysadmin usually does:

#!/bin/sh
> set -x
> export PREFIX=/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2
> export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/nlocal/slurm/2.6.3/lib64
> build() {
>   echo `pwd`
>   ./configure --with-slurm --disable-wrapper-rpath --enable-shared
> --enable-mca-no-build=btl-usnic \
>   CC=gcc CXX=g++ F77=gfortran FC=gfortran \
>   CFLAGS="-mtune=generic -fPIC -m64" CXXFLAGS="-mtune=generic -fPIC
> -m64" FFLAGS="-mtune=generic -fPIC -m64" \
>   F77FLAGS="-mtune=generic -fPIC -m64" FCFLAGS="-mtune=generic -fPIC
> -m64" F90FLAGS="-mtune=generic -fPIC -m64" \
>   LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64"
> CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" LIBS="-lpciaccess" \
>  --prefix=${PREFIX} 2>&1 | tee configure.1.8.2.log
>   make 2>&1 | tee make.1.8.2.log
>   make check 2>&1 | tee makecheck.1.8.2.log
>   make install 2>&1 | tee makeinstall.1.8.2.log
> }
> echo "calling build"
> build
> echo "exiting"


The only difference between the two is '1.8.2' or '1.8.2rc4' in the PREFIX
and log file tees.  Now, let us test. First, I grab some nodes with slurm:

$ salloc --nodes=6 --ntasks-per-node=16 --constraint=sand --time=09:00:00
> --account=g0620 --mail-type=BEGIN


Once I get my nodes, I run with 1.8.2rc4:

(1142) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpifort -o
> helloWorld.182rc4.x helloWorld.F90
> (1143) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8
> ./helloWorld.182rc4.x
> Process0 of8 is on borg01w044
> Process5 of8 is on borg01w044
> Process3 of8 is on borg01w044
> Process7 of8 is on borg01w044
> Process1 of8 is on borg01w044
> Process2 of8 is on borg01w044
> Process4 of8 is on borg01w044
> Process6 of8 is on borg01w044


Now 1.8.2:

(1144) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpifort -o
> helloWorld.182.x helloWorld.F90
> (1145) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun -np 8
> ./helloWorld.182.x
> (1146) $


No output at all. But, if I take the helloWorld.x from 1.8.2 and run it
with 1.8.2rc4's mpirun:

(1146) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8
> ./helloWorld.182.x
> Process5 of8 is on borg01w044
> Process7 of8 is on borg01w044
> Process2 of8 is on borg01w044
> Process4 of8 is on borg01w044
> Process1 of8 is on borg01w044
> Process3 of8 is on borg01w044
> Process6 of8 is on borg01w044
> Process0 of8 is on borg01w044


So...any idea what is happening here? There did seem to be a few SLURM
related changes between the two tarballs involving /dev/null but it's a bit
above me to decipher.

You can find the ompi_info, build, make, config, etc logs at these links
(they are ~300kB which is over the mailing list limit according to the Open
MPI web page):

https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2rc4-Output.tar.bz2
https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2-Output.tar.bz2

Thank you for any help and please let me know if you need more information,
Matt

-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

Re: [OMPI users] Intermittent, somewhat architecture-dependent hang with Open MPI 1.8.1

2014-08-16 Thread Matt Thompson

Jeff,

I've tried moving the backing file and it doesn't matter. I can say that
PGI 14.7 + Open MPI 1.8.1 does not show this issue. I can run that on 96
cores just fine. Heck, I've run it on a few hundred.

As for the 96, they are either on 8 Westmere nodes (8 nodes with 2 6-core
sockets) or 6 Sandy Bridge nodes (6 nodes with 2 8-core sockets). I think
each set is on a different Infiniband fabric, but I'm not sure of that.
However, since the PGI 14.7/Open MPI 1.8.1 works just fine on the exact
same sets of nodes (grabbed via an interactive SLURM job), I can't see how
the Infiniband fabric would matter.

I also tried various combinations of:

  mpirun --np
  mpirun --map-by core -np
  mpirun --map-by socket -np

and maybe a few -bind-to as well all with --report-bindings on to make sure
it was doing what I expect, and it was. It wasn't putting 96 processes on a
single node, for example, or all on the same socket or core by some freak
accident.

The only difference between the Open MPI installs are the compilers they
were built with (I'm pretty sure the admins just downloaded the source
once). Looking at "mpif90 -showme" I can see that the PGI 14.7 compile
built the mpi_f90 and mpi modules while it looks like the GCC 4.9.1 did
not, but our main code and this reproducer only use mpif.h, so that
shouldn't matter.

Matt


On Sat, Aug 16, 2014 at 7:33 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> Have you tried moving your shared memory backing file directory, like the
> warning message suggests?
>
> I haven't seen a shared memory file on a network share cause correctness
> issues before (just performance issues), but I could see how that could be
> in the realm of possibility...
>
> Also, are you running 96 processes on a single machine, or spread across
> multiple machines?
>
> Note that Open MPI 1.8.x binds each MPI process to a core by default, so
> if you're oversubscribing the machine, it could be fairly disastrous...?
>
>
> On Aug 14, 2014, at 1:29 PM, Matt Thompson <fort...@gmail.com> wrote:
>
> > Open MPI Users,
> >
> > I work on a large climate model called GEOS-5 and we've recently managed
> to get it to compile with gfortran 4.9.1 (our usual compilers are Intel and
> PGI for performance). In doing so, we asked our admins to install Open MPI
> 1.8.1 as the MPI stack instead of MVAPICH2 2.0 mainly because we figure the
> gfortran port is more geared to a desktop.
> >
> > So, the model builds just fine but when we run it, it stalls in our
> "History" component whose job is to write out netCDF files of output. The
> odd thing is, though, this stall seems to happen more on our Sandy Bridge
> nodes than on our Westmere nodes, but both hang.
> >
> > A colleague has made a single-file code that emulates our History
> component (the MPI traffic part) that we've used to report bugs to MVAPICH
> and I asked him to try it with this issue and it seems to duplicate it.
> >
> > To wit, a "successful" run of the code is:
> >
> > (1003) $ mpirun -np 96 ./mpi_reproducer.x 4 24
> > srun.slurm: cluster configuration lacks support for cpu binding
> > srun.slurm: cluster configuration lacks support for cpu binding
> >
> --
> > WARNING: Open MPI will create a shared memory backing file in a
> > directory that appears to be mounted on a network filesystem.
> > Creating the shared memory backup file on a network file system, such
> > as NFS or Lustre is not recommended -- it may cause excessive network
> > traffic to your file servers and/or cause shared memory traffic in
> > Open MPI to be much slower than expected.
> >
> > You may want to check what the typical temporary directory is on your
> > node.  Possible sources of the location of this temporary directory
> > include the $TEMPDIR, $TEMP, and $TMP environment variables.
> >
> > Note, too, that system administrators can set a list of filesystems
> > where Open MPI is disallowed from creating temporary files by setting
> > the MCA parameter "orte_no_session_dir".
> >
> >   Local host: borg01s026
> >   Fileame:
> /gpfsm/dnb31/tdirs/pbs/slurm.2202701.mathomp4/openmpi-sessions-mathomp4@borg01s026_0
> /60464/1/shared_mem_pool.borg01s026
> >
> > You can set the MCA paramter shmem_mmap_enable_nfs_warning to 0 to
> > disable this message.
> >
> --
> >  nx:4
> >  ny:   24
> >  comm size is   96
> >  local array sizes are  12  12
> >  filling local arrays
> >  creating requests
> >  ig

[OMPI users] Intermittent, somewhat architecture-dependent hang with Open MPI 1.8.1

2014-08-14 Thread Matt Thompson

;place" around a collective wait.

Finally, if I setenv OMPI_MCA_orte_base_help_aggregate 0 (to see all
help/error messages) I usually just "hang" with no error message at all
(additionally turning off the warning):

(1203) $ setenv OMPI_MCA_orte_base_help_aggregate 0
(1203) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
(1204) $ mpirun -np 96 ./mpi_reproducer.x 4 24
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
 nx:4
 ny:   24
 comm size is   96
 local array sizes are  12  12
 filling local arrays
 creating requests
 igather
 before collective wait

Note, this problem doesn't seem to appear at lower number of processes (16,
24, 32) but does seem pretty consistent at 96, especially on Sandy Bridges.

Also, yes, we get that weird srun.slurm warning but we always seem to get
that (Open MPI, MVAPICH) so while our admins are trying to correct that, at
present it is not our worry.

The MPI stack was compiled with (per our admins):

export CFLAGS="-fPIC -m64"
export CXXFLAGS="-fPIC -m64"
export FFLAGS="-fPIC"
export FCFLAGS="-fPIC"
export F90FLAGS="-fPIC"

export LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64"
export CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include"

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/nlocal/slurm/2.6.3/lib64

../configure --with-slurm --disable-wrapper-rpath --enable-shared
--enable-mca-no-build=btl-usnic --prefix=${PREFIX}

The output of "ompi_info --all" is found:

  https://gist.github.com/mathomp4/301723165efbbb616184#file-ompi_info-out

The reproducer code can be found here:


https://gist.github.com/mathomp4/301723165efbbb616184#file-mpi_reproducer-f90

The reproducer is easily built with just 'mpif90' and to run it:

  mpirun -np NPROCS ./mpi_reproducer.x NX NY

where NX*NY has to equal NPROCS and it's best to keep them even numbers.
(There might be a few more restrictions and the code will die if you
violate them.)

Thanks,
Matt Thompson

-- 
Matt Thompson  SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712  Fax: 301-614-6246

Re: [OMPI users] Help building/installing a working Open MPI 1.7.4 on OS X 10.9.2 with Free PGI Fortran

2014-03-24 Thread Matt Thompson

Jeff,

I ran these commands:

$ make clean
$ make distclean

(wanted to be extra sure!)

$ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64'
CXXFLAGS='-m64' LDFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64'
--prefix=/Users/fortran/AutomakeBug/autobug14 | & tee configure.log
$ make V=1 install |& tee makeV1install.log

So find attached the config.log, configure.log, and makeV1install.log which
should have all the info you asked about.

Matt

PS: I just tried configure/make/make install with Open MPI 1.7.5, but the
same error occurs as expected. Hope springs eternal, you know?


On Mon, Mar 24, 2014 at 6:48 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> On Mar 24, 2014, at 6:34 PM, Matt Thompson <fort...@gmail.com> wrote:
>
> > Sorry for the late reply. The answer is: No, 1.14.1 has not fixed the
> problem (and indeed, that's what my Mac is running):
> >
> > (28) $ make install | & tee makeinstall.log
> > Making install in src
> >  ../config/install-sh -c -d '/Users/fortran/AutomakeBug/autobug14/lib'
> >  /bin/sh ../libtool   --mode=install /usr/bin/install -c
> libfortran_stuff.la '/Users/fortran/AutomakeBug/autobug14/lib'
> > libtool: install: /usr/bin/install -c .libs/libfortran_stuff.0.dylib
> /Users/fortran/AutomakeBug/autobug14/lib/libfortran_stuff.0.dylib
> > install: .libs/libfortran_stuff.0.dylib: No such file or directory
> > make[2]: *** [install-libLTLIBRARIES] Error 71
> > make[1]: *** [install-am] Error 2
> > make: *** [install-recursive] Error 1
> >
> > This is the output from either the am12 or am14 test. If you have any
> options you'd like me to try with this, let me know. (For example, is there
> a way to make autotools *more* verbose? I've always tried to make it less
> so!)
>
> Ok.  With the am14 tarball, please run:
>
> make clean
>
> And then run this:
>
> make V=1 install
>
> And then send the following:
>
> - configure stdout
> - config.log file
> - stdout/stderr from "make V=1 install"
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick


config.log
Description: Binary data


configure.log
Description: Binary data


makeV1install.log
Description: Binary data

Re: [OMPI users] Help building/installing a working Open MPI 1.7.4 on OS X 10.9.2 with Free PGI Fortran

2014-03-24 Thread Matt Thompson

Jeff,

Sorry for the late reply. The answer is: No, 1.14.1 has not fixed the
problem (and indeed, that's what my Mac is running):

(28) $ make install | & tee makeinstall.log
Making install in src
 ../config/install-sh -c -d '/Users/fortran/AutomakeBug/autobug14/lib'
 /bin/sh ../libtool   --mode=install /usr/bin/install -c
libfortran_stuff.la '/Users/fortran/AutomakeBug/autobug14/lib'
libtool: install: /usr/bin/install -c .libs/libfortran_stuff.0.dylib
/Users/fortran/AutomakeBug/autobug14/lib/libfortran_stuff.0.dylib
install: .libs/libfortran_stuff.0.dylib: No such file or directory
make[2]: *** [install-libLTLIBRARIES] Error 71
make[1]: *** [install-am] Error 2
make: *** [install-recursive] Error 1

This is the output from either the am12 or am14 test. If you have any
options you'd like me to try with this, let me know. (For example, is there
a way to make autotools *more* verbose? I've always tried to make it less
so!)

Matt


On Fri, Mar 21, 2014 at 11:02 AM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:

>  This is starting to smell like a Libtool and/or Automake bug -- it
> created libmpi_usempi_ignore_tkr.dylib, but it tried to install
> libmpi_usempi_ignore_tkr.0.dylib (notice the extra ".0").  :-\
>
> This is both good and bad.
>
> Good: I can think of 2 ways to work around this issue off the top of my
> head:
>
> 1. "make -k install" and ignore the error as it flashes by.  The rest of
> OMPI will install properly.  Then cd into
> build_dir/ompi/mpi/fortran/use-mpi-ignore-tkr/.libs. Copy
> libmpi_usempi_ignore_tkr.* to $libdir (i.e.,
> /Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc/lib, in your example below).
> And you should be good to go.
>
> ...although you may need to do a similar thing in the
> ompi/mpi/fortran/use-mpi-f08/.libs directory.
>
> 2. Somewhere in ompi/mpi/fortran/use-mpi-ignore-tkr/Makefile will be the
> filename "libmpi_usempi_ignore_tkr.0.dylib".  Edit it to remove the ".0".
> Then "make install" should work fine.  (you might need to do the same in
> use-mpi-f08/Makefile)
>
> Bad: we can't really fix this error if it really is a bug in Automake
> and/or Libtool, but we can at least report it upstream.
>
> I've made a trivial Autotools test project (
> https://github.com/jsquyres/pgi-autotool-bug) to see if we can nail this
> down a little more, and possibly use the results to report upstream.
>
> Here's the versions of Autotools that we use to make the OMPI 1.7.x series:
>
> Autoconf 2.69
> Automake 1.12.2
> Libtool 2.4.2
> m4 1.4.16
>
> Attached is a tarball I made of the sample project using those versions.
> Can you try building and installing this tarball on your system with the
> same kinds of options you used with OMPI?  Hopefully, you should see the
> same error.  If not, I need to tweak this project a bit more to make it
> more like OMPI's build system behavior.
>
> If you can replicate the error, then also try the second attached tarball:
> it's the same project, but bootstrapped with the latest versions of GNU
> Automake (the others are already the most recent):
>
> Automake 1.14.1
>
> This will let us see if automake 1.14.1 has fixed the issue.
>
>
>
>
> On Mar 20, 2014, at 1:16 PM, Matt Thompson <fort...@gmail.com> wrote:
>
> > Jeff, here you go:
> >
> > (3) $ cd ompi/mpi/fortran/use-mpi-ignore-tkr
> > total 2888
> > -rw-r--r--  1 fortran  staff   1.7K Apr 13  2013 Makefile.am
> > -rw-r--r--  1 fortran  staff   215K Dec 17 21:09
> mpi-ignore-tkr-interfaces.h.in
> > -rw-r--r--  1 fortran  staff39K Dec 17 21:09
> mpi-ignore-tkr-file-interfaces.h.in
> > -rw-r--r--  1 fortran  staff   1.5K Jan 27 19:04 mpi-ignore-tkr.F90
> > -rw-r--r--  1 fortran  staff80K Feb  4 17:53 Makefile.in
> > -rw-r--r--  1 fortran  staff   208K Mar 18 20:37
> mpi-ignore-tkr-interfaces.h
> > -rw-r--r--  1 fortran  staff38K Mar 18 20:37
> mpi-ignore-tkr-file-interfaces.h
> > -rw-r--r--  1 fortran  staff75K Mar 18 20:37 Makefile
> > -rw-r--r--  1 fortran  staff   765K Mar 18 20:47 mpi.mod
> > -rw-r--r--  1 fortran  staff   280B Mar 18 20:47 mpi-ignore-tkr.lo
> > -rw-r--r--  1 fortran  staff   1.0K Mar 18 20:47
> libmpi_usempi_ignore_tkr.la
> > Directory:
> /Users/fortran/MPI/src/openmpi-1.7.4/ompi/mpi/fortran/use-mpi-ignore-tkr
> > (4) $ make clean
> > test -z "*~ .#*" || rm -f *~ .#*
> > test -z "libmpi_usempi_ignore_tkr.la" || rm -f
> libmpi_usempi_ignore_tkr.la
> > rm -f ./so_locations
> > rm -rf .libs _libs
> > rm -f *.o
> > test -z "*.mod" || rm -f *.mod
> > rm -f *.lo
> > (5) $ make V=1
> > /bin/sh ../../../../libtool  --tag=FC   --m

Re: [OMPI users] Help building/installing a working Open MPI 1.7.4 on OS X 10.9.2 with Free PGI Fortran

2014-03-20 Thread Matt Thompson

cking Fortran compiler ignore TKR syntax... 1:real, dimension(*):!DIR$
> IGNORE_TKR
> checking if building Fortran 'use mpi' bindings... yes
> -
>
> And then the make logs indicate that it did, indeed, build the ignore TKR
> mpi module.
>
> -
> Making all in mpi/fortran/use-mpi-ignore-tkr
>   PPFC mpi-ignore-tkr.lo
>   FCLD libmpi_usempi_ignore_tkr.la
> -
>
> And then make install fails:
>
> -
> Making install in mpi/fortran/use-mpi-ignore-tkr
>  ../../../../config/install-sh -c -d
> '/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc/lib'
>  /bin/sh ../../../../libtool   --mode=install /usr/bin/install -c
> libmpi_usempi_ignore_tkr.la'/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc/lib'
> libtool: install: /usr/bin/install -c
> .libs/libmpi_usempi_ignore_tkr.0.dylib
> /Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc/lib/libmpi_usempi_ignore_tkr.0.dylib
> install: .libs/libmpi_usempi_ignore_tkr.0.dylib: No such file or directory
> -
>
> Can you do the following:
>
> -
> cd ompi_build_dir/ompi/mpi/fortran/use-mpi-ignore-tkr
> make clean
> make V=1
> find .
> make install
> -
>
>
> On Mar 20, 2014, at 7:44 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> > Jeff,
> >
> > It does not:
> >
> > Directory:
> /Users/fortran/MPI/src/openmpi-1.7.4/ompi/mpi/fortran/use-mpi-ignore-tkr/.libs
> > (106) $ ls -ltr
> > total 1560
> > -rw-r--r--  1 fortran  staff  784824 Mar 18 20:47 mpi-ignore-tkr.o
> > -rw-r--r--  1 fortran  staff1021 Mar 18 20:47
> libmpi_usempi_ignore_tkr.lai
> > lrwxr-xr-x  1 fortran  staff  30 Mar 18 20:47
> libmpi_usempi_ignore_tkr.la@ -> ../libmpi_usempi_ignore_tkr.la
> > lrwxr-xr-x  1 fortran  staff  32 Mar 18 20:47
> libmpi_usempi_ignore_tkr.dylib@ -> libmpi_usempi_ignore_tkr.0.dylib
> >
> > which I guess makes sense.
> >
> > I'm attaching the logfiles from my compile attempt. This is the "basic"
> attempt as can be seen from the config.log file.
> >
> > Thanks,
> > Matt
> >
> >
> >
> > On Thu, Mar 20, 2014 at 6:45 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > Sorry for the delay; we're working on releasing 1.7.5 and that's
> consuming all my time...
> >
> > That's a strange error.  Can you confirm whether
> ompi_buil_dir/ompi/mpi/fortran/use-mpi-ignore-tkr/.libs/libmpi_usempi_ignore_tkr.0.dylib
> exists or not?
> >
> > Can you send all the info listed here:
> >
> > http://www.open-mpi.org/community/help/
> >
> >
> > On Mar 18, 2014, at 8:59 PM, Matt Thompson <fort...@gmail.com> wrote:
> >
> > > All,
> > >
> > > I recently downloaded PGI's Free OS X Fortran compiler:
> > >
> > > http://www.pgroup.com/products/freepgi/
> > >
> > > in the hope of potentially using it to compile a weather model I work
> with GEOS-5. That model requires an MPI stack and I usually start (and end)
> with Open MPI on a desktop.
> > >
> > > So, I grabbed Open MPI 1.7.4 and tried compiling it in a few ways. In
> each case, my C and C++ compilers were the built-in clang-y gcc and g++
> from Xcode, while pgfortran was the Fortran compiler. I tried a few
> different configures from the basic:
> > >
> > > $ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64'
> CXXFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64'
> --prefix=/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3
> > >
> > > all the way to the "let's try every flag Google says I might use"
> version of:
> > >
> > > $ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64
> -Xclang -target-feature -Xclang -aes -mmacosx-version-min=10.8'
> CXXFLAGS='-m64 -Xclang -target-feature -Xclang -aes
> -mmacosx-version-min=10.8' LDFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64'
> --prefix=/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx
> > >
> > > In every case, the configure, make, and make check worked well without
> error, but running a 'make install' led to:
> > >
> > > Making install in mpi/fortran/use-mpi-ignore-tkr
> > >  ../../../../config/install-sh -c -d
> '/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib'
> > >  /bin/sh ../../../../libtool   --mode=install /usr/bin/install -c
> libmpi_usempi_ignore_tkr.la'/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib'
> > > libtool: install: /usr/bin/install -c
> .libs/libmpi_usempi_ignore_tkr.0.dylib
> /Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib/libmpi_usempi_ignore_tkr.0.dylib
> > > install: .

Re: [OMPI users] Help building/installing a working Open MPI 1.7.4 on OS X 10.9.2 with Free PGI Fortran

2014-03-20 Thread Matt Thompson

Jeff,

It does not:

Directory:
/Users/fortran/MPI/src/openmpi-1.7.4/ompi/mpi/fortran/use-mpi-ignore-tkr/.libs
(106) $ ls -ltr
total 1560
-rw-r--r--  1 fortran  staff  784824 Mar 18 20:47 mpi-ignore-tkr.o
-rw-r--r--  1 fortran  staff1021 Mar 18 20:47
libmpi_usempi_ignore_tkr.lai
lrwxr-xr-x  1 fortran  staff  30 Mar 18 20:47
libmpi_usempi_ignore_tkr.la@ -> ../libmpi_usempi_ignore_tkr.la
lrwxr-xr-x  1 fortran  staff  32 Mar 18 20:47
libmpi_usempi_ignore_tkr.dylib@ -> libmpi_usempi_ignore_tkr.0.dylib

which I guess makes sense.

I'm attaching the logfiles from my compile attempt. This is the "basic"
attempt as can be seen from the config.log file.

Thanks,
Matt



On Thu, Mar 20, 2014 at 6:45 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> Sorry for the delay; we're working on releasing 1.7.5 and that's consuming
> all my time...
>
> That's a strange error.  Can you confirm whether
> ompi_buil_dir/ompi/mpi/fortran/use-mpi-ignore-tkr/.libs/libmpi_usempi_ignore_tkr.0.dylib
> exists or not?
>
> Can you send all the info listed here:
>
> http://www.open-mpi.org/community/help/
>
>
> On Mar 18, 2014, at 8:59 PM, Matt Thompson <fort...@gmail.com> wrote:
>
> > All,
> >
> > I recently downloaded PGI's Free OS X Fortran compiler:
> >
> > http://www.pgroup.com/products/freepgi/
> >
> > in the hope of potentially using it to compile a weather model I work
> with GEOS-5. That model requires an MPI stack and I usually start (and end)
> with Open MPI on a desktop.
> >
> > So, I grabbed Open MPI 1.7.4 and tried compiling it in a few ways. In
> each case, my C and C++ compilers were the built-in clang-y gcc and g++
> from Xcode, while pgfortran was the Fortran compiler. I tried a few
> different configures from the basic:
> >
> > $ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64'
> CXXFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64'
> --prefix=/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3
> >
> > all the way to the "let's try every flag Google says I might use"
> version of:
> >
> > $ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64
> -Xclang -target-feature -Xclang -aes -mmacosx-version-min=10.8'
> CXXFLAGS='-m64 -Xclang -target-feature -Xclang -aes
> -mmacosx-version-min=10.8' LDFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64'
> --prefix=/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx
> >
> > In every case, the configure, make, and make check worked well without
> error, but running a 'make install' led to:
> >
> > Making install in mpi/fortran/use-mpi-ignore-tkr
> >  ../../../../config/install-sh -c -d
> '/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib'
> >  /bin/sh ../../../../libtool   --mode=install /usr/bin/install -c
> libmpi_usempi_ignore_tkr.la'/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib'
> > libtool: install: /usr/bin/install -c
> .libs/libmpi_usempi_ignore_tkr.0.dylib
> /Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib/libmpi_usempi_ignore_tkr.0.dylib
> > install: .libs/libmpi_usempi_ignore_tkr.0.dylib: No such file or
> directory
> > make[3]: *** [install-libLTLIBRARIES] Error 71
> > make[2]: *** [install-am] Error 2
> > make[1]: *** [install-recursive] Error 1
> > make: *** [install-recursive] Error 1
> >
> > Any ideas on how to overcome this?
> >
> > Thanks,
> > Matt Thompson
> > --
> > "And, isn't sanity really just a one-trick pony anyway? I mean all you
> >  get is one trick: rational thinking. But when you're good and crazy,
> >  oooh, oooh, oooh, the sky is the limit!" -- The Tick
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick


OMPI-1.7.4-Logfiles.tar.bz2
Description: BZip2 compressed data

[OMPI users] Help building/installing a working Open MPI 1.7.4 on OS X 10.9.2 with Free PGI Fortran

2014-03-18 Thread Matt Thompson

All,

I recently downloaded PGI's Free OS X Fortran compiler:

http://www.pgroup.com/products/freepgi/

in the hope of potentially using it to compile a weather model I work with
GEOS-5. That model requires an MPI stack and I usually start (and end) with
Open MPI on a desktop.

So, I grabbed Open MPI 1.7.4 and tried compiling it in a few ways. In each
case, my C and C++ compilers were the built-in clang-y gcc and g++ from
Xcode, while pgfortran was the Fortran compiler. I tried a few different
configures from the basic:

$ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64'
> CXXFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64'
> --prefix=/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3


all the way to the "let's try every flag Google says I might use" version
of:

$ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64
> -Xclang -target-feature -Xclang -aes -mmacosx-version-min=10.8'
> CXXFLAGS='-m64 -Xclang -target-feature -Xclang -aes
> -mmacosx-version-min=10.8' LDFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64'
> --prefix=/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx


In every case, the configure, make, and make check worked well without
error, but running a 'make install' led to:

Making install in mpi/fortran/use-mpi-ignore-tkr
>  ../../../../config/install-sh -c -d
> '/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib'
>  /bin/sh ../../../../libtool   --mode=install /usr/bin/install -c
> libmpi_usempi_ignore_tkr.la'/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib'
> libtool: install: /usr/bin/install -c
> .libs/libmpi_usempi_ignore_tkr.0.dylib
> /Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib/libmpi_usempi_ignore_tkr.0.dylib
> install: .libs/libmpi_usempi_ignore_tkr.0.dylib: No such file or directory
> make[3]: *** [install-libLTLIBRARIES] Error 71
> make[2]: *** [install-am] Error 2
> make[1]: *** [install-recursive] Error 1
> make: *** [install-recursive] Error 1


Any ideas on how to overcome this?

Thanks,
Matt Thompson
-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

71 matches

Mail list logo