Re: [OMPI users] OpenMPI 5.0.0 & Intel OneAPI 2023.2.0 on MacOS 14.0:
I have built Open MPI 5 (well, 5.0.0rc12) with Intel oneAPI under Rosetta2 with: $ lt_cv_ld_force_load=no ../configure --disable-wrapper-rpath --disable-wrapper-runpath \ CC=clang CXX=clang++ FC=ifort \ --with-hwloc=internal --with-libevent=internal --with-pmix=internal I'm fairly sure the two wrapper flags are not needed, I just have them for historical reasons (long ago I needed them and until they cause an issue, I just keep all my flags around). Maybe it works for me because I'm using clang instead of icc? I can "get away" with that because the code I work on is nearly all Fortran so the C compiler is not as important to us. And all the libraries we care about seem happy with mixed ifort-clang as well. If you don't have a driving need for icc, maybe this will let things work? On Mon, Nov 6, 2023 at 8:55 AM Volker Blum via users < users@lists.open-mpi.org> wrote: > I don’t have a solution to this but am interested in finding one. > > There is an issue with some include statements between OneAPI and XCode on > MacOS 14.x , at least for C++ (the example below seems to be C?). It > appears that many standard headers are not being found. > > I did not encounter this problem with OpenMPI, though, since I got stuck > at an earlier point. My workaround, OpenMPI 4.1.6, compiled fine. > > While compiling a different C++ code, these missing headers struck me, too. > > Many of the include related error messages went away after installing > XCode 15.1 beta 2 - however, not all of them. That’s as far as I got … > sorry about the experience. > > Best wishes > Volker > > > Volker Blum > Vinik Associate Professor, Duke MEMS & Chemistry > https://aims.pratt.duke.edu > https://bsky.app/profile/aimsduke.bsky.social > > > On Nov 6, 2023, at 4:25 AM, Christophe Peyret via users < > users@lists.open-mpi.org> wrote: > > > > Hello, > > > > I am tring to compile openmpi 5.0.0 on MacOS 14.1 with Intel oneapi > Version 2021.9.0 Build 20230302_00. > > > > I enter commande : > > > > lt_cv_ld_force_load=no ../openmpi-5.0.0/configure > --prefix=$APP_DIR/openmpi-5.0.0 F77=ifort FC=ifort CC=icc CXX=icpc > --with-pmix=internal --with-libevent=internal --with-hwloc=internal > > > > Then > > > > make > > > > And compilation stops with error message : > > > > > /Users/christophe/Developer/openmpi-5.0.0/3rd-party/openpmix/src/util/pmix_path.c(55): > catastrophic error: cannot open source file > "/Users/christophe/Developer/openmpi-5.0.0/3rd-party/openpmix/src/util/pmix_path.c" > > #include > > ^ > > > > compilation aborted for > /Users/christophe/Developer/openmpi-5.0.0/3rd-party/openpmix/src/util/pmix_path.c > (code 4) > > make[4]: *** [pmix_path.lo] Error 1 > > make[3]: *** [all-recursive] Error 1 > > make[2]: *** [all-recursive] Error 1 > > make[1]: *** [all-recursive] Error 1 > > make: *** [all-recursive] Error 1 > > > > -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
Re: [OMPI users] OpenMPI 5.0.0 & Intel OneAPI 2023.2.0 on MacOS 14.0:
On my Mac I build Open MPI 5 with (among other flags): --with-hwloc=internal --with-libevent=internal --with-pmix=internal In my case, I should have had libevent through brew, but it didn't seem to see it. But then I figured I might as well let Open MPI build its own for convenience. Matt On Fri, Oct 27, 2023 at 7:51 PM Volker Blum via users < users@lists.open-mpi.org> wrote: > OpenMPI 5.0.0 & Intel OneAPI 2023.2.0 on MacOS 14.0: > > In an ostensibly clean system, the following configure on MacOS ends > without a viable pmix build: > > configure: WARNING: Either libevent or libev support is required, but > neither > configure: WARNING: was found. Please use the configure options to point us > configure: WARNING: to where we can find one or the other library > configure: error: Cannot continue > configure: = done with 3rd-party/openpmix configure = > checking for pmix pkg-config name... pmix > checking if pmix pkg-config module exists... yes > checking for pmix pkg-config cflags... > -I/usr/local/Cellar/open-mpi/4.1.5/include > checking for pmix pkg-config ldflags... > -L/usr/local/Cellar/open-mpi/4.1.5/lib > checking for pmix pkg-config static ldflags... > -L/usr/local/Cellar/open-mpi/4.1.5/lib > checking for pmix pkg-config libs... -lpmix -lz > checking for pmix pkg-config static libs... -lpmix -lz > checking for pmix.h... no > configure: error: Could not find viable pmix build. > > configure command used was: > > lt_cv_ld_force_load=no ./configure --prefix=/usr/local/openmpi/5.0.0 > FC=ifort F77=ifort CC=icc CXX=icpc > > *** > > The same command works (up to the end of the configure stage) with OpenMPI > 4.1.6. > > My guess is that this is related to some earlier pmix related issues that > can be found by google but wanted to report. > > Thank you! > Best wishes > Volker > > > Volker Blum > Associate Professor, Duke MEMS & Chemistry > https://aims.pratt.duke.edu > https://bsky.app/profile/aimsduke.bsky.social > > > > -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
[OMPI users] Building Open MPI without zlib: what might go wrong/different?
Open MPI List, Recently in trying to build some libraries with NVHPC + Open MPI, I hit an error building HDF5 where it died at configure time saying that the zlib that Open MPI wanted to link to (my system one) was incompatible with the zlib I built in my libraries leading up to HDF5. So, in the end I "fixed" my issue by adding: --without-zlib to my configure line for Open MPI and rebuilt. And hey, it worked. HDF5 built. And Hello world still works as well. But I'm now wondering: what might I be missing now? Zlib isn't required by the MPI Standard (as far as I can tell), so I'm guessing it's not functionality but rather performance? Just curious, Matt -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2
Jeff, I'll take a look when I'm back at work next week. I work with someone on the Fortran Standards Committee, so if I can find the code, we can probably figure out how to fix it. That said, I know just enough Autotools to cause massive damage and fix a minor bugs. Can you give me a pointer as to where to look for the Fortran tests the configure scripts runs? conftest.f90 is the "generic" name I assume Autotools uses for tests, so I'm guessing there is an... m4 script somewhere generating it? In config/ maybe? Matt On Thu, Dec 30, 2021 at 10:27 AM Jeff Squyres (jsquyres) wrote: > Snarky comments from the NAG tech support people aside, if they could be a > little more specific about what non-conformant Fortran code they're > referring to, we'd be happy to work with them to get it fixed. > > I'm one of the few people in the Open MPI dev community who has a clue > about Fortran, and I'm *very far* from being a Fortran expert. Modern > Fortran is a legitimately complicated language. So it doesn't surprise me > that we might have some code in our configure tests that isn't quite right. > > Let's also keep in mind that the state of F2008 support varies widely > across compilers and versions. The current Open MPI configure tests > straddle the line of trying to find *enough* F2008 support in a given > compiler to be sufficient for the mpi_f08 module without being so overly > proscriptive as to disqualify compilers that aren't fully F2008-compliant. > Frankly, the state of F2008 support across the various Fortran compilers > was a mess when we wrote those configure tests; we had to cobble together a > variety of complicated tests to figure out if any given compiler supported > enough F2008 support for some / all of the mpi_f08 module. That's why the > configure tests are... complicated. > > -- > Jeff Squyres > jsquy...@cisco.com > > > From: users on behalf of Matt Thompson > via users > Sent: Thursday, December 23, 2021 11:41 AM > To: Wadud Miah > Cc: Matt Thompson; Open MPI Users > Subject: Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2 > > I heard back from NAG: > > Regarding OpenMPI, we have attempted the build ourselves but cannot make > sense of the configure script. Only the OpenMPI maintainers can do > something about that and it looks like they assume that all compilers will > just swallow non-conforming Fortran code. The error downgrading options for > NAG compiler remain "-dusty", "-mismatch" and "-mismatch_all" and none of > them seem to help with the mpi_f08 module of OpenMPI. If there is a bug in > the NAG Fortran Compiler that is responsible for this, we would love to > hear about it, but at the moment we are not aware of such. > > So it might mean the configure script itself might need to be altered to > use F2008 conforming code? > > On Thu, Dec 23, 2021 at 8:31 AM Wadud Miah wmiah...@gmail.com>> wrote: > You can contact NAG support at supp...@nag.co.uk<mailto:supp...@nag.co.uk> > but they will look into this in the new year. > > Regards, > > On Thu, 23 Dec 2021, 13:18 Matt Thompson via users, < > users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: > Oh. Yes, I am on macOS. The Linux cluster I work on doesn't have NAG 7.1 > on it...mainly because I haven't asked for it. Until NAG fix the bug we are > seeing, I figured why bother the admins. > > Still, it does *seem* like it should work. I might ask NAG support about > it. > > On Wed, Dec 22, 2021 at 6:28 PM Tom Kacvinsky tkacv...@gmail.com>> wrote: > On Wed, Dec 22, 2021 at 5:45 PM Tom Kacvinsky tkacv...@gmail.com>> wrote: > > > > On Wed, Dec 22, 2021 at 4:11 PM Matt Thompson fort...@gmail.com>> wrote: > > > > > > All, > > > > > > When I build Open MPI with NAG, I have to pass in: > > > > > > FCFLAGS"=-mismatch_all -fpp" > > > > > > this flag tells nagfor to downgrade some errors with interfaces to > warnings: > > > > > >-mismatch_all > > > Further downgrade consistency checking of procedure > argument lists so that calls to routines in the same file which are > > > incorrect will produce warnings instead of error > messages. This option disables -C=calls. > > > > > > The fpp flag is how you tell NAG to do preprocessing (it doesn't > automatically do it with .F90 files). > > > > > > I also have to pass in a lot of other flags as seen here: > > > > > > > https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/nag-7.1_7101/openmpi/4.1.2.lua >
Re: [OMPI users] Mac OS + openmpi-4.1.2 + intel oneapi
Jeff, I'm not sure it'll happen. For understandable reasons (for Intel), I think Intel is not putting too much emphasis on supporting macOS. I guess since I had a workaround I didn't press them. (Maybe the workaround has performance issues? I don't know, but I only ever run with macOS on laptops, so performance isn't primary for me yet.) On Thu, Dec 30, 2021 at 10:15 AM Jeff Squyres (jsquyres) wrote: > The conclusion we came to on that issue was that this was an issue with > Intel ifort. Was anyone able to raise this with Intel ifort tech support? > > -- > Jeff Squyres > jsquy...@cisco.com > > > From: users on behalf of Matt Thompson > via users > Sent: Thursday, December 30, 2021 9:56 AM > To: Open MPI Users > Cc: Matt Thompson; Christophe Peyret > Subject: Re: [OMPI users] Mac OS + openmpi-4.1.2 + intel oneapi > > Oh yeah. I know that error. This is due to a long standing issue with > Intel on macOS and Open MPI: > > https://github.com/open-mpi/ompi/issues/7615 > > You need to configure Open MPI with "lt_cv_ld_force_load=no" at the > beginning. (You can see an example at the top of my modulefile here: > https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/intel-clang-2022.0.0/openmpi/4.1.2.lua > ) > > Matt > > On Thu, Dec 30, 2021 at 5:47 AM Christophe Peyret via users < > users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: > > Hello, > > I have built openmpi-4.1.2 with latest intel oneapi compilers, including > fortran > > but I am facing problems at compilation: > > > mpif90 toto.f90 > > Undefined symbols for architecture x86_64: > > "_ompi_buffer_detach_f08", referenced from: > > import-atom in libmpi_usempif08.dylib > > ld: symbol(s) not found for architecture x86_64 > > library libmpi_usempif08.dylib is present in $MPI_DIR/lib > > > mpif90 -showme > > ifort -I/Users/chris/Applications/Intel/openmpi-4.1.2/include > -Wl,-flat_namespace -Wl,-commons,use_dylibs > -I/Users/chris/Applications/Intel/openmpi-4.1.2/lib > -L/Users/chris/Applications/Intel/openmpi-4.1.2/lib -lmpi_usempif08 > -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi > > > if I remove -lmpi_usempif08 from that command line it works ! > > ifort -I/Users/chris/Applications/Intel/openmpi-4.1.2/include > -Wl,-flat_namespace -Wl,-commons,use_dylibs > -I/Users/chris/Applications/Intel/openmpi-4.1.2/lib > -L/Users/chris/Applications/Intel/openmpi-4.1.2/lib > -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi toto.f90 > > > And program runs: > > mpirun -n 4 a.out > > rank=2/4 > > rank=3/4 > > rank=0/4 > > rank=1/4 > > > Annexe the Program > > program toto > > use mpi > > implicit none > > integer :: i > > integer :: comm,rank,size,ierror > > call mpi_init(ierror) > > comm=MPI_COMM_WORLD > > call mpi_comm_rank(comm, rank, ierror) > > call mpi_comm_size(comm, size, ierror) > > print '("rank=",i0,"/",i0)',rank,size > > call mpi_finalize(ierror) > > end program toto > > > -- > > Christophe Peyret > > ONERA/DAAA/NFLU > > 29 ave de la Division Leclerc > F92322 Châtillon Cedex > > > -- > Matt Thompson >“The fact is, this is about us identifying what we do best and >finding more ways of doing less of it better” -- Director of Better > Anna Rampton > -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
Re: [OMPI users] Mac OS + openmpi-4.1.2 + intel oneapi
Oh yeah. I know that error. This is due to a long standing issue with Intel on macOS and Open MPI: https://github.com/open-mpi/ompi/issues/7615 You need to configure Open MPI with "lt_cv_ld_force_load=no" at the beginning. (You can see an example at the top of my modulefile here: https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/intel-clang-2022.0.0/openmpi/4.1.2.lua ) Matt On Thu, Dec 30, 2021 at 5:47 AM Christophe Peyret via users < users@lists.open-mpi.org> wrote: > Hello, > > I have built openmpi-4.1.2 with latest intel oneapi compilers, including > fortran > > but I am facing problems at compilation: > > > *mpif90 toto.f90* > > Undefined symbols for architecture x86_64: > > "_ompi_buffer_detach_f08", referenced from: > > import-atom in libmpi_usempif08.dylib > > ld: symbol(s) not found for architecture x86_64 > > library libmpi_usempif08.dylib is present in $MPI_DIR/lib > > > *mpif90 -showme > > >* > > ifort -I/Users/chris/Applications/Intel/openmpi-4.1.2/include > -Wl,-flat_namespace -Wl,-commons,use_dylibs > -I/Users/chris/Applications/Intel/openmpi-4.1.2/lib > -L/Users/chris/Applications/Intel/openmpi-4.1.2/lib -lmpi_usempif08 > -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi > > > if I remove -lmpi_usempif08 from that command line it works ! > > *ifort -I/Users/chris/Applications/Intel/openmpi-4.1.2/include > -Wl,-flat_namespace -Wl,-commons,use_dylibs > -I/Users/chris/Applications/Intel/openmpi-4.1.2/lib > -L/Users/chris/Applications/Intel/openmpi-4.1.2/lib > -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi toto.f90* > > > And program runs: > > *mpirun -n 4 a.out* > > rank=2/4 > > rank=3/4 > > rank=0/4 > > rank=1/4 > > > Annexe the Program > > program toto > > use mpi > > implicit none > > integer :: i > > integer :: comm,rank,size,ierror > > call mpi_init(ierror) > > comm=MPI_COMM_WORLD > > call mpi_comm_rank(comm, rank, ierror) > > call mpi_comm_size(comm, size, ierror) > > print '("rank=",i0,"/",i0)',rank,size > > call mpi_finalize(ierror) > > end program toto > > > -- > > *Christophe Peyret* > > *ONERA/DAAA/NFLU* > > 29 ave de la Division Leclerc > F92322 Châtillon Cedex > > -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2
I heard back from NAG: Regarding OpenMPI, we have attempted the build ourselves but cannot make sense of the configure script. Only the OpenMPI maintainers can do something about that and it looks like they assume that all compilers will just swallow non-conforming Fortran code. The error downgrading options for NAG compiler remain "-dusty", "-mismatch" and "-mismatch_all" and none of them seem to help with the mpi_f08 module of OpenMPI. If there is a bug in the NAG Fortran Compiler that is responsible for this, we would love to hear about it, but at the moment we are not aware of such. So it might mean the configure script itself might need to be altered to use F2008 conforming code? On Thu, Dec 23, 2021 at 8:31 AM Wadud Miah wrote: > You can contact NAG support at supp...@nag.co.uk but they will look into > this in the new year. > > Regards, > > On Thu, 23 Dec 2021, 13:18 Matt Thompson via users, < > users@lists.open-mpi.org> wrote: > >> Oh. Yes, I am on macOS. The Linux cluster I work on doesn't have NAG 7.1 >> on it...mainly because I haven't asked for it. Until NAG fix the bug we are >> seeing, I figured why bother the admins. >> >> Still, it does *seem* like it should work. I might ask NAG support about >> it. >> >> On Wed, Dec 22, 2021 at 6:28 PM Tom Kacvinsky wrote: >> >>> On Wed, Dec 22, 2021 at 5:45 PM Tom Kacvinsky >>> wrote: >>> > >>> > On Wed, Dec 22, 2021 at 4:11 PM Matt Thompson >>> wrote: >>> > > >>> > > All, >>> > > >>> > > When I build Open MPI with NAG, I have to pass in: >>> > > >>> > > FCFLAGS"=-mismatch_all -fpp" >>> > > >>> > > this flag tells nagfor to downgrade some errors with interfaces to >>> warnings: >>> > > >>> > >-mismatch_all >>> > > Further downgrade consistency checking of procedure >>> argument lists so that calls to routines in the same file which are >>> > > incorrect will produce warnings instead of error >>> messages. This option disables -C=calls. >>> > > >>> > > The fpp flag is how you tell NAG to do preprocessing (it doesn't >>> automatically do it with .F90 files). >>> > > >>> > > I also have to pass in a lot of other flags as seen here: >>> > > >>> > > >>> https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/nag-7.1_7101/openmpi/4.1.2.lua >>> > > >>> > > Now I hadn't yet tried NAG 7.1 with Open MPI because NAG 7.1 has a >>> bug with a library I depend on, but it does promise better F2008 support. >>> To see what happens, I tried myself and added --enable-mpi-fortran=all, but: >>> > > >>> > > checking if building Fortran 'use mpi_f08' bindings... no >>> > > configure: error: Cannot build requested Fortran bindings, aborting >>> > > >>> > > Unfortunately, the NAG Fortran guru I work with is off until the new >>> year. When he comes back, I might ask him about this. He might know >>> something we can do to make NAG happy with mpif08. >>> > > >>> > >>> > The very curious thing about this is that NAG 7.1 is that mpif08 >>> > configured properly with the macOS (Intel architecture) flavor of >>> > it. But as this thread seems to indicate, it barfs on Linux. Just >>> > an extra data point. >>> > >>> >>> I'd like to recall that statement, I was not looking at the config.log >>> carefully enough. I see this still, even on macOS >>> >>> checking if building Fortran 'use mpi_f08' bindings... no >>> >> >> >> -- >> Matt Thompson >>“The fact is, this is about us identifying what we do best and >>finding more ways of doing less of it better” -- Director of Better >> Anna Rampton >> > -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2
Oh. Yes, I am on macOS. The Linux cluster I work on doesn't have NAG 7.1 on it...mainly because I haven't asked for it. Until NAG fix the bug we are seeing, I figured why bother the admins. Still, it does *seem* like it should work. I might ask NAG support about it. On Wed, Dec 22, 2021 at 6:28 PM Tom Kacvinsky wrote: > On Wed, Dec 22, 2021 at 5:45 PM Tom Kacvinsky wrote: > > > > On Wed, Dec 22, 2021 at 4:11 PM Matt Thompson wrote: > > > > > > All, > > > > > > When I build Open MPI with NAG, I have to pass in: > > > > > > FCFLAGS"=-mismatch_all -fpp" > > > > > > this flag tells nagfor to downgrade some errors with interfaces to > warnings: > > > > > >-mismatch_all > > > Further downgrade consistency checking of procedure > argument lists so that calls to routines in the same file which are > > > incorrect will produce warnings instead of error > messages. This option disables -C=calls. > > > > > > The fpp flag is how you tell NAG to do preprocessing (it doesn't > automatically do it with .F90 files). > > > > > > I also have to pass in a lot of other flags as seen here: > > > > > > > https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/nag-7.1_7101/openmpi/4.1.2.lua > > > > > > Now I hadn't yet tried NAG 7.1 with Open MPI because NAG 7.1 has a bug > with a library I depend on, but it does promise better F2008 support. To > see what happens, I tried myself and added --enable-mpi-fortran=all, but: > > > > > > checking if building Fortran 'use mpi_f08' bindings... no > > > configure: error: Cannot build requested Fortran bindings, aborting > > > > > > Unfortunately, the NAG Fortran guru I work with is off until the new > year. When he comes back, I might ask him about this. He might know > something we can do to make NAG happy with mpif08. > > > > > > > The very curious thing about this is that NAG 7.1 is that mpif08 > > configured properly with the macOS (Intel architecture) flavor of > > it. But as this thread seems to indicate, it barfs on Linux. Just > > an extra data point. > > > > I'd like to recall that statement, I was not looking at the config.log > carefully enough. I see this still, even on macOS > > checking if building Fortran 'use mpi_f08' bindings... no > -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
Re: [OMPI users] NAG Fortran 2018 bindings with Open MPI 4.1.2
All, When I build Open MPI with NAG, I have to pass in: FCFLAGS"=-mismatch_all -fpp" this flag tells nagfor to downgrade some errors with interfaces to warnings: -mismatch_all Further downgrade consistency checking of procedure argument lists so that calls to routines in the same file which are incorrect will produce warnings instead of error messages. This option disables -C=calls. The fpp flag is how you tell NAG to do preprocessing (it doesn't automatically do it with .F90 files). I also have to pass in a lot of other flags as seen here: https://github.com/mathomp4/parcelmodulefiles/blob/main/Compiler/nag-7.1_7101/openmpi/4.1.2.lua Now I hadn't yet tried NAG 7.1 with Open MPI because NAG 7.1 has a bug with a library I depend on, but it does promise better F2008 support. To see what happens, I tried myself and added --enable-mpi-fortran=all, but: checking if building Fortran 'use mpi_f08' bindings... no configure: error: Cannot build requested Fortran bindings, aborting Unfortunately, the NAG Fortran guru I work with is off until the new year. When he comes back, I might ask him about this. He might know something we can do to make NAG happy with mpif08. Matt On Wed, Dec 22, 2021 at 3:44 PM Tom Kacvinsky via users < users@lists.open-mpi.org> wrote: > On Wed, Dec 22, 2021 at 8:54 AM Tom Kacvinsky wrote: > > > > On Wed, Dec 22, 2021 at 8:48 AM Wadud Miah via users > > wrote: > > > > > > Hi, > > > > > > I tried using the NAG compiler 7.2 which is fully Fortran 2008 > compliant, but the Open MPI configure script shows that it will not build > the Fortran 2008 MPI bindings: > > > > > > $ FC=nagfor ./configure --prefix=/usr/local/openmpi-4.1.2 > > > [ ... ] > > > checking if building Fortran 'use mpi_f08' bindings... no > > > Build MPI Fortran bindings: mpif.h, use mpi > > > > > > Could someone please look into this? > > > > Would you provide the config.log from running configure? That would > > help diagnose the problem. Oftentimes, you will see what the error is > > when checking for certain features. > > > > I was sent the config.log off kist, and I spotted this: > > configure:69595: result: no > configure:69673: checking for Fortran compiler support of !$PRAGMA > IGNORE_TKR > configure:69740: nagfor -c -f2008 -dusty -mismatch conftest.f90 >&5 > NAG Fortran Compiler Release 7.1(Hanzomon) Build 7101 > Evaluation trial version of NAG Fortran Compiler Release 7.1(Hanzomon) > Build 7101 > Questionable: conftest.f90, line 52: Variable A set but never referenced > Warning: conftest.f90, line 52: Pointer PTR never dereferenced > Error: conftest.f90, line 39: Incorrect data type REAL (expected > CHARACTER) for argument BUFFER (no. 1) of FOO > Error: conftest.f90, line 50: Incorrect data type INTEGER (expected > CHARACTER) for argument BUFFER (no. 1) of FOO > [NAG Fortran Compiler error termination, 2 errors, 2 warnings] > configure:69740: $? = 2 > > So I suspect this makes the Fortran checks unhappy so that the > configure logic (as for as I could see) wouldn't check for 2018 > binding support. > > So apparently, the new NAG Fortran compiler is really fussy. > > There is not much more I can do with this as I am nowhere near > competent in Fortran coding. > -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
Re: [OMPI users] Cannot build working Open MPI 4.1.1 with NAG Fortran/clang on macOS (but I could before!)
As for the static build, this: /Users/mathomp4/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-static/bin/mpicc ~/MPITests/helloWorld.c -lopen-orted-mpir does work, but there is no way I could change all our model scripting to add that flag all over the place. Is there a way to "stuff" -lopen-orted-mpir into the main wrappers: ❯ /Users/mathomp4/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-static/bin/mpicc -show gcc -I/Users/mathomp4/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-static/include -L/Users/mathomp4/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-static/lib -lmpi -lopen-rte -lopen-pal -lm -lz so that I don't need to remember it every time? On Fri, Oct 29, 2021 at 10:32 AM Matt Thompson wrote: > So, an update. Nothing I seem to do with libtool seems to help, but I'm > trying various things. For example, I tried editing libtool to use: > > -Wl,-Wl,,-dynamiclib > > and also: > > -Wl,-Wl,,-install_name -Wl,-Wl,,\$rpath/\$soname > > as that also threw an error (not a NAG flag). And when I did that, I get > this: > > clang: error: no such file or directory: > '/Users/mathomp4/installed/Compiler/nag-7.0_7048/openmpi/4.1.1-basic-gilles/lib/libmpi_usempi.40.dylib' > > Somehow it's looking for a file in the install path and I think it's > because of an extra space. When I run with make V=1: > > ,-Wl,,-install_name -Wl,-Wl,, > /Users/mathomp4/installed/Compiler/nag-7.0_7048/openmpi/4.1.1-basic-gilles/lib/libmpi_usempi.40.dylib > > That space between the -Wl,, and the /Users... install path is causing > issues. > > Sigh. I guess it's time to try and figure out how to get the static build > to work. > > On Fri, Oct 29, 2021 at 8:24 AM Matt Thompson wrote: > >> Gilles, >> >> I tried both NAG 7.0.7062 and 7.0.7048. Both fail in the same way. And I >> was using the official tarball from the Open MPI website. I downloaded it >> long ago and then kept it around. >> >> And I didn't run autogen.pl, no, but I could try that. >> >> And...I do see CC=nagfor in the libtool file. But, why would libtool >> *ever* use CC="nagfor"? I mean, I see it in the file, but I specifically >> told configure that CC=gcc (or CC=clang). Only FC should be nagfor. >> >> On Thu, Oct 28, 2021 at 8:51 PM Gilles Gouaillardet via users < >> users@lists.open-mpi.org> wrote: >> >>> Matt, >>> >>> did you build the same Open MPI 4.1.1 from an official tarball with the >>> previous NAG Fortran? >>> did you run autogen.pl (--force) ? >>> >>> Just to be sure, can you rerun the same test with the previous NAG >>> version? >>> >>> >>> When using static libraries, you can try manually linking with >>> -lopen-orted-mpir and see if it helps. >>> If you want to use shared libraries, I would try to run configure, >>> and then edit the generated libtool file: >>> look a line like >>> >>> CC="nagfor" >>> >>> and then edit the next line >>> >>> >>> # Commands used to build a shared archive. >>> >>> archive_cmds="\$CC -dynamiclib \$allow_undef ..." >>> >>> simply manually remove "-dynamiclib" here and see if it helps >>> >>> >>> Cheers, >>> >>> Gilles >>> On Fri, Oct 29, 2021 at 12:30 AM Matt Thompson via users < >>> users@lists.open-mpi.org> wrote: >>> >>>> Dear Open MPI Gurus, >>>> >>>> This is a...confusing one. For some reason, I cannot build a working >>>> Open MPI with NAG 7.0.7062 and clang on my MacBook running macOS 11.6.1. >>>> The thing is, I could do this back in July with NAG 7.0.7048. So my fear is >>>> that something changed with macOS, or clang/xcode, or something in between. >>>> >>>> So here are the symptoms, I usually build with a few extra flags that >>>> I've always carried around but for now I'm going to go basic. First, I try >>>> to build Open MPI in a basic way: >>>> >>>> ../configure FCFLAGS"=-mismatch_all -fpp" CC=clang CXX=clang++ >>>> FC=nagfor >>>> --prefix=$HOME/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-basic |& tee >>>> configure.log >>>> >>>> Note that the FCFLAGS are needed for NAG since it doesn't preprocess >>>> .F90 files by default (so -fpp) and it can be *very* strict with interfaces >>>> and any slight interface difference is an error so we use -mismatch_all. >>>> >>>> Now with this configure line, I the
Re: [OMPI users] Cannot build working Open MPI 4.1.1 with NAG Fortran/clang on macOS (but I could before!)
So, an update. Nothing I seem to do with libtool seems to help, but I'm trying various things. For example, I tried editing libtool to use: -Wl,-Wl,,-dynamiclib and also: -Wl,-Wl,,-install_name -Wl,-Wl,,\$rpath/\$soname as that also threw an error (not a NAG flag). And when I did that, I get this: clang: error: no such file or directory: '/Users/mathomp4/installed/Compiler/nag-7.0_7048/openmpi/4.1.1-basic-gilles/lib/libmpi_usempi.40.dylib' Somehow it's looking for a file in the install path and I think it's because of an extra space. When I run with make V=1: ,-Wl,,-install_name -Wl,-Wl,, /Users/mathomp4/installed/Compiler/nag-7.0_7048/openmpi/4.1.1-basic-gilles/lib/libmpi_usempi.40.dylib That space between the -Wl,, and the /Users... install path is causing issues. Sigh. I guess it's time to try and figure out how to get the static build to work. On Fri, Oct 29, 2021 at 8:24 AM Matt Thompson wrote: > Gilles, > > I tried both NAG 7.0.7062 and 7.0.7048. Both fail in the same way. And I > was using the official tarball from the Open MPI website. I downloaded it > long ago and then kept it around. > > And I didn't run autogen.pl, no, but I could try that. > > And...I do see CC=nagfor in the libtool file. But, why would libtool > *ever* use CC="nagfor"? I mean, I see it in the file, but I specifically > told configure that CC=gcc (or CC=clang). Only FC should be nagfor. > > On Thu, Oct 28, 2021 at 8:51 PM Gilles Gouaillardet via users < > users@lists.open-mpi.org> wrote: > >> Matt, >> >> did you build the same Open MPI 4.1.1 from an official tarball with the >> previous NAG Fortran? >> did you run autogen.pl (--force) ? >> >> Just to be sure, can you rerun the same test with the previous NAG >> version? >> >> >> When using static libraries, you can try manually linking with >> -lopen-orted-mpir and see if it helps. >> If you want to use shared libraries, I would try to run configure, >> and then edit the generated libtool file: >> look a line like >> >> CC="nagfor" >> >> and then edit the next line >> >> >> # Commands used to build a shared archive. >> >> archive_cmds="\$CC -dynamiclib \$allow_undef ..." >> >> simply manually remove "-dynamiclib" here and see if it helps >> >> >> Cheers, >> >> Gilles >> On Fri, Oct 29, 2021 at 12:30 AM Matt Thompson via users < >> users@lists.open-mpi.org> wrote: >> >>> Dear Open MPI Gurus, >>> >>> This is a...confusing one. For some reason, I cannot build a working >>> Open MPI with NAG 7.0.7062 and clang on my MacBook running macOS 11.6.1. >>> The thing is, I could do this back in July with NAG 7.0.7048. So my fear is >>> that something changed with macOS, or clang/xcode, or something in between. >>> >>> So here are the symptoms, I usually build with a few extra flags that >>> I've always carried around but for now I'm going to go basic. First, I try >>> to build Open MPI in a basic way: >>> >>> ../configure FCFLAGS"=-mismatch_all -fpp" CC=clang CXX=clang++ FC=nagfor >>> --prefix=$HOME/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-basic |& tee >>> configure.log >>> >>> Note that the FCFLAGS are needed for NAG since it doesn't preprocess >>> .F90 files by default (so -fpp) and it can be *very* strict with interfaces >>> and any slight interface difference is an error so we use -mismatch_all. >>> >>> Now with this configure line, I then build and: >>> >>> Making all in mpi/fortran/use-mpi-tkr >>> make[2]: Entering directory >>> '/Users/mathomp4/src/MPI/openmpi-4.1.1/build-basic/ompi/mpi/fortran/use-mpi-tkr' >>> FCLD libmpi_usempi.la >>> NAG Fortran Compiler Release 7.0(Yurakucho) Build 7062 >>> Option error: Unrecognised option -dynamiclib >>> make[2]: *** [Makefile:1966: libmpi_usempi.la] Error 2 >>> make[2]: Leaving directory >>> '/Users/mathomp4/src/MPI/openmpi-4.1.1/build-basic/ompi/mpi/fortran/use-mpi-tkr' >>> make[1]: *** [Makefile:3555: all-recursive] Error 1 >>> make[1]: Leaving directory >>> '/Users/mathomp4/src/MPI/openmpi-4.1.1/build-basic/ompi' >>> make: *** [Makefile:1901: all-recursive] Error 1 >>> >>> For some reason, the make system is trying to pass a clang option, >>> -dynamiclib, to nagfor and it fails. With verbose on: >>> >>> libtool: link: nagfor -dynamiclib -Wl,-Wl,,-undefined >>> -Wl,-Wl,,dynamic_lookup -o .libs/libmpi_usempi.40.dylib .libs/mpi.o >>> .libs/mp
Re: [OMPI users] Cannot build working Open MPI 4.1.1 with NAG Fortran/clang on macOS (but I could before!)
Gilles, I tried both NAG 7.0.7062 and 7.0.7048. Both fail in the same way. And I was using the official tarball from the Open MPI website. I downloaded it long ago and then kept it around. And I didn't run autogen.pl, no, but I could try that. And...I do see CC=nagfor in the libtool file. But, why would libtool *ever* use CC="nagfor"? I mean, I see it in the file, but I specifically told configure that CC=gcc (or CC=clang). Only FC should be nagfor. On Thu, Oct 28, 2021 at 8:51 PM Gilles Gouaillardet via users < users@lists.open-mpi.org> wrote: > Matt, > > did you build the same Open MPI 4.1.1 from an official tarball with the > previous NAG Fortran? > did you run autogen.pl (--force) ? > > Just to be sure, can you rerun the same test with the previous NAG version? > > > When using static libraries, you can try manually linking with > -lopen-orted-mpir and see if it helps. > If you want to use shared libraries, I would try to run configure, > and then edit the generated libtool file: > look a line like > > CC="nagfor" > > and then edit the next line > > > # Commands used to build a shared archive. > > archive_cmds="\$CC -dynamiclib \$allow_undef ..." > > simply manually remove "-dynamiclib" here and see if it helps > > > Cheers, > > Gilles > On Fri, Oct 29, 2021 at 12:30 AM Matt Thompson via users < > users@lists.open-mpi.org> wrote: > >> Dear Open MPI Gurus, >> >> This is a...confusing one. For some reason, I cannot build a working Open >> MPI with NAG 7.0.7062 and clang on my MacBook running macOS 11.6.1. The >> thing is, I could do this back in July with NAG 7.0.7048. So my fear is >> that something changed with macOS, or clang/xcode, or something in between. >> >> So here are the symptoms, I usually build with a few extra flags that >> I've always carried around but for now I'm going to go basic. First, I try >> to build Open MPI in a basic way: >> >> ../configure FCFLAGS"=-mismatch_all -fpp" CC=clang CXX=clang++ FC=nagfor >> --prefix=$HOME/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-basic |& tee >> configure.log >> >> Note that the FCFLAGS are needed for NAG since it doesn't preprocess .F90 >> files by default (so -fpp) and it can be *very* strict with interfaces and >> any slight interface difference is an error so we use -mismatch_all. >> >> Now with this configure line, I then build and: >> >> Making all in mpi/fortran/use-mpi-tkr >> make[2]: Entering directory >> '/Users/mathomp4/src/MPI/openmpi-4.1.1/build-basic/ompi/mpi/fortran/use-mpi-tkr' >> FCLD libmpi_usempi.la >> NAG Fortran Compiler Release 7.0(Yurakucho) Build 7062 >> Option error: Unrecognised option -dynamiclib >> make[2]: *** [Makefile:1966: libmpi_usempi.la] Error 2 >> make[2]: Leaving directory >> '/Users/mathomp4/src/MPI/openmpi-4.1.1/build-basic/ompi/mpi/fortran/use-mpi-tkr' >> make[1]: *** [Makefile:3555: all-recursive] Error 1 >> make[1]: Leaving directory >> '/Users/mathomp4/src/MPI/openmpi-4.1.1/build-basic/ompi' >> make: *** [Makefile:1901: all-recursive] Error 1 >> >> For some reason, the make system is trying to pass a clang option, >> -dynamiclib, to nagfor and it fails. With verbose on: >> >> libtool: link: nagfor -dynamiclib -Wl,-Wl,,-undefined >> -Wl,-Wl,,dynamic_lookup -o .libs/libmpi_usempi.40.dylib .libs/mpi.o >> .libs/mpi_aint_add_f90.o .libs/mpi_aint_diff_f90.o >> .libs/mpi_comm_spawn_multiple_f90.o .libs/mpi_testall_f90.o >> .libs/mpi_testsome_f90.o .libs/mpi_waitall_f90.o .libs/mpi_waitsome_f90.o >> .libs/mpi_wtick_f90.o .libs/mpi_wtime_f90.o .libs/mpi-tkr-sizeof.o... >> >> As a test, I tried the same thing with NAG 7.0.7048 (which worked in >> July) and I get the same issue: >> >> Option error: Unrecognised option -dynamiclib >> >> Note, that Intel Fortran and Gfortran *do* support this flag, but NAG has >> something like: >> >>-Bbinding Specify static or dynamic binding. This only has >> effect if specified during the link phase. The default is dynamic binding. >> >> but maybe the Open MPI system doesn't know NAG? >> >> So I say to myself, okay, dynamiclib is a shared library sounding thing, >> so let's try static library build! So, following the documentation I try: >> >> ../configure --enable-static -disable-shared FCFLAGS"=-mismatch_all -fpp" >> CC=gcc CXX=g++ FC=nagfor >> --prefix=$HOME/installed/Compiler/nag-7.0_7062/openmpi/4.1.1-static |& tee >> configure.log >> >> and it builds! Yay! And then I try to b
[OMPI users] Cannot build working Open MPI 4.1.1 with NAG Fortran/clang on macOS (but I could before!)
ugger_init_after_spawn in libopen-rte.a(orted_submit.o) "_MPIR_server_arguments", referenced from: _orte_debugger_init_after_spawn in libopen-rte.a(orted_submit.o) _setup_debugger_job in libopen-rte.a(orted_submit.o) _run_debugger in libopen-rte.a(orted_submit.o) ld: symbol(s) not found for architecture x86_64 clang: error: linker command failed with exit code 1 (use -v to see invocation) So...yeah. ¯\_(ツ)_/¯ Maybe this needs -Bstatic?? But again, all this worked with shared a few months ago (I've never tried static until now) and NAG has *never* supported -dynamiclib as far as I know. I do see references to -Bstatic and -Bdynamic in the source code, but apparently I'm not triggering the configure step to use them? Anyone else out there encounter this? NOTE: I did try doing an Intel Fortran + Clang shared build today and that seemed to work. I think that's because Intel Fortran recognizes -dynamiclib so it can get past that FCLD step. -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
Re: [OMPI users] [External] Help with MPI and macOS Firewall
Gilles, For some odd reason, 'self, vader' didn't seem as effective as "^tcp". Not sure why, but at least I have something that seems to work. I suppose I don't really need tcp sockets on a single laptop :D Matt On Thu, Mar 18, 2021 at 8:46 PM Gilles Gouaillardet via users < users@lists.open-mpi.org> wrote: > Matt, > > you can either > > mpirun --mca btl self,vader ... > > or > > export OMPI_MCA_btl=self,vader > mpirun ... > > you may also add > btl = self,vader > in your /etc/openmpi-mca-params.conf > and then simply > > mpirun ... > > Cheers, > > Gilles > > On Fri, Mar 19, 2021 at 5:44 AM Matt Thompson via users > wrote: > > > > Prentice, > > > > Ooh. The first one seems to work. The second one apparently is not liked > by zsh and I had to do: > > ❯ mpirun -mca btl '^tcp' -np 6 ./helloWorld.mpi3.exe > > Compiler Version: GCC version 10.2.0 > > MPI Version: 3.1 > > MPI Library Version: Open MPI v4.1.0, package: Open MPI > mathomp4@gs6101-parcel.local Distribution, ident: 4.1.0, repo rev: > v4.1.0, Dec 18, 2020 > > > > Next question: is this: > > > > OMPI_MCA_btl='self,vader' > > > > the right environment variable translation of that command-line option? > > > > On Thu, Mar 18, 2021 at 3:40 PM Prentice Bisbal via users < > users@lists.open-mpi.org> wrote: > >> > >> OpenMPI should only be using shared memory on the local host > automatically, but maybe you need to force it. > >> > >> I think > >> > >> mpirun -mca btl self,vader ... > >> > >> should do that. > >> > >> or you can exclude tcp instead > >> > >> mpirun -mca btl ^tcp > >> > >> See > >> > >> https://www.open-mpi.org/faq/?category=sm > >> > >> for more info. > >> > >> Prentice > >> > >> On 3/18/21 12:28 PM, Matt Thompson via users wrote: > >> > >> All, > >> > >> This isn't specifically an Open MPI issue, but as that is the MPI stack > I use on my laptop, I'm hoping someone here might have a possible solution. > (I am pretty sure something like MPICH would trigger this as well.) > >> > >> Namely, my employer recently did something somewhere so that now *any* > MPI application I run will throw popups like this one: > >> > >> > https://user-images.githubusercontent.com/4114656/30962814-866f3010-a44b-11e7-9de3-9f2a3b0229c0.png > >> > >> though for me it's asking about "orterun" and "helloworld.mpi3.exe", > etc. I essentially get one-per-process. > >> > >> If I had sudo access, I suppose I could just keep clicking "Allow" for > every program, but I don't and I compile lots of programs with different > names. > >> > >> So, I was hoping maybe an Open MPI guru out there knew of an MCA thing > I could use to avoid them? This is all isolated on-my-laptop MPI I'm doing, > so at most an "mpirun --oversubscribe -np 12" or something. It'll never go > over my network to anything, etc. > >> > >> -- > >> Matt Thompson > >>“The fact is, this is about us identifying what we do best and > >>finding more ways of doing less of it better” -- Director of Better > Anna Rampton > > > > > > > > -- > > Matt Thompson > >“The fact is, this is about us identifying what we do best and > >finding more ways of doing less of it better” -- Director of Better > Anna Rampton > -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
Re: [OMPI users] [External] Help with MPI and macOS Firewall
Prentice, Ooh. The first one seems to work. The second one apparently is not liked by zsh and I had to do: ❯ mpirun -mca btl '^tcp' -np 6 ./helloWorld.mpi3.exe Compiler Version: GCC version 10.2.0 MPI Version: 3.1 MPI Library Version: Open MPI v4.1.0, package: Open MPI mathomp4@gs6101-parcel.local Distribution, ident: 4.1.0, repo rev: v4.1.0, Dec 18, 2020 Next question: is this: OMPI_MCA_btl='self,vader' the right environment variable translation of that command-line option? On Thu, Mar 18, 2021 at 3:40 PM Prentice Bisbal via users < users@lists.open-mpi.org> wrote: > OpenMPI should only be using shared memory on the local host > automatically, but maybe you need to force it. > > I think > > mpirun -mca btl self,vader ... > > should do that. > > or you can exclude tcp instead > > mpirun -mca btl ^tcp > > See > > https://www.open-mpi.org/faq/?category=sm > > for more info. > > Prentice > > On 3/18/21 12:28 PM, Matt Thompson via users wrote: > > All, > > This isn't specifically an Open MPI issue, but as that is the MPI stack I > use on my laptop, I'm hoping someone here might have a possible solution. > (I am pretty sure something like MPICH would trigger this as well.) > > Namely, my employer recently did something somewhere so that now *any* MPI > application I run will throw popups like this one: > > > https://user-images.githubusercontent.com/4114656/30962814-866f3010-a44b-11e7-9de3-9f2a3b0229c0.png > > though for me it's asking about "orterun" and "helloworld.mpi3.exe", etc. > I essentially get one-per-process. > > If I had sudo access, I suppose I could just keep clicking "Allow" for > every program, but I don't and I compile lots of programs with different > names. > > So, I was hoping maybe an Open MPI guru out there knew of an MCA thing I > could use to avoid them? This is all isolated on-my-laptop MPI I'm doing, > so at most an "mpirun --oversubscribe -np 12" or something. It'll never go > over my network to anything, etc. > > -- > Matt Thompson >“The fact is, this is about us identifying what we do best and >finding more ways of doing less of it better” -- Director of Better > Anna Rampton > > -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
[OMPI users] Help with MPI and macOS Firewall
All, This isn't specifically an Open MPI issue, but as that is the MPI stack I use on my laptop, I'm hoping someone here might have a possible solution. (I am pretty sure something like MPICH would trigger this as well.) Namely, my employer recently did something somewhere so that now *any* MPI application I run will throw popups like this one: https://user-images.githubusercontent.com/4114656/30962814-866f3010-a44b-11e7-9de3-9f2a3b0229c0.png though for me it's asking about "orterun" and "helloworld.mpi3.exe", etc. I essentially get one-per-process. If I had sudo access, I suppose I could just keep clicking "Allow" for every program, but I don't and I compile lots of programs with different names. So, I was hoping maybe an Open MPI guru out there knew of an MCA thing I could use to avoid them? This is all isolated on-my-laptop MPI I'm doing, so at most an "mpirun --oversubscribe -np 12" or something. It'll never go over my network to anything, etc. -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI
Adam, A couple questions. First, is seccomp the reason you think I have the MPI_THREAD_MULTIPLE error? Or is it more for the vader error? If so, the environment variable Nathan provided is probably enough. These are unit tests and should execute in seconds at most (building them takes 10x-100x more time). But if it can help with the MPI_THREAD_MULTIPLE error, can you help translate that to "Fortran programmer who really can only do docker build/run/push/cp" for me? I found this page: https://docs.docker.com/engine/security/seccomp/ that I'm trying to read through and understand, but I'm mainly learning I should be looking at taking some Docker training soon! On Mon, Feb 24, 2020 at 8:24 PM Adam Simpson wrote: > Calls to process_vm_readv() and process_vm_writev() are disabled in the > default Docker seccomp profile > <https://github.com/moby/moby/blob/master/profiles/seccomp/default.json>. > You can add the docker flag --cap-add=SYS_PTRACE or better yet modify the > seccomp profile so that process_vm_readv and process_vm_writev are > whitelisted, by adding them to the syscalls.names list. > > You can also disable seccomp, and several other confinement and security > features, if you prefer a heavy handed approach: > > $ docker run --privileged --security-opt label=disable --security-opt > seccomp=unconfined --security-opt apparmor=unconfined --ipc=host > --network=host ... > > If you're still having trouble after fixing the above you may need to > check yama on the host. You can check with "sysctl -w > kernel.yama.ptrace_scope", if it returns a value other than 0 you may > need to disable it with "sysctl -w kernel.yama.ptrace_scope=0". > > Adam > > -- > *From:* users on behalf of Matt > Thompson via users > *Sent:* Monday, February 24, 2020 5:15 PM > *To:* Open MPI Users > *Cc:* Matt Thompson > *Subject:* Re: [OMPI users] Help with One-Sided Communication: Works in > Intel MPI, Fails in Open MPI > > *External email: Use caution opening links or attachments* > Nathan, > > The reproducer would be that code that's on the Intel website. That is > what I was running. You could pull my image if you like but...since you are > the genius: > > [root@adac3ce0cf32 ~]# mpirun --mca btl_vader_single_copy_mechanism none > -np 2 ./a.out > > Rank 0 running on adac3ce0cf32 > Rank 1 running on adac3ce0cf32 > Rank 0 sets data in the shared memory: 00 01 02 03 > Rank 1 sets data in the shared memory: 10 11 12 13 > Rank 0 gets data from the shared memory: 10 11 12 13 > Rank 0 has new data in the shared memory: 00 01 02 03 > Rank 1 gets data from the shared memory: 00 01 02 03 > Rank 1 has new data in the shared memory: 10 11 12 13 > > And knowing this led to: https://github.com/open-mpi/ompi/issues/4948 > > So, good news is that setting export > OMPI_MCA_btl_vader_single_copy_mechanism=none let's a lot of stuff work. > The bad news is we seem to be using MPI_THREAD_MULTIPLE and it does not > like it: > > Start 2: pFIO_tests_mpi > > 2: Test command: /opt/openmpi-4.0.2/bin/mpiexec "-n" "18" "-oversubscribe" > "/root/project/MAPL/build/bin/pfio_ctest_io.x" "-nc" "6" "-nsi" "6" "-nso" > "6" "-ngo" "1" "-ngi" "1" "-v" "T,U" "-s" "mpi" > 2: Test timeout computed to be: 1500 > 2: > -- > 2: The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this > release. > 2: Workarounds are to run on a single node, or to use a system with an RDMA > 2: capable network such as Infiniband. > 2: > -- > 2: [adac3ce0cf32:03619] *** An error occurred in MPI_Win_create > 2: [adac3ce0cf32:03619] *** reported by process [270073857,16] > 2: [adac3ce0cf32:03619] *** on communicator MPI COMMUNICATOR 4 DUP FROM 3 > 2: [adac3ce0cf32:03619] *** MPI_ERR_WIN: invalid window > 2: [adac3ce0cf32:03619] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > 2: [adac3ce0cf32:03619] ***and potentially your MPI job) > 2: [adac3ce0cf32:03587] 17 more processes have sent help message > help-osc-pt2pt.txt / mpi-thread-multiple-not-supported > 2: [adac3ce0cf32:03587] Set MCA parameter "orte_base_help_aggregate" to 0 > to see all help / error messages > 2: [adac3ce0cf32:03587] 17 more processes have sent help message > help-mpi-errors.txt / mpi_errors_are_fatal > 2/5 Test #2: pFIO_tests_mpi ...***Failed0.18 sec > > 40% tests passed, 3 tests failed out of
Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI
Nathan, The reproducer would be that code that's on the Intel website. That is what I was running. You could pull my image if you like but...since you are the genius: [root@adac3ce0cf32 ~]# mpirun --mca btl_vader_single_copy_mechanism none -np 2 ./a.out Rank 0 running on adac3ce0cf32 Rank 1 running on adac3ce0cf32 Rank 0 sets data in the shared memory: 00 01 02 03 Rank 1 sets data in the shared memory: 10 11 12 13 Rank 0 gets data from the shared memory: 10 11 12 13 Rank 0 has new data in the shared memory: 00 01 02 03 Rank 1 gets data from the shared memory: 00 01 02 03 Rank 1 has new data in the shared memory: 10 11 12 13 And knowing this led to: https://github.com/open-mpi/ompi/issues/4948 So, good news is that setting export OMPI_MCA_btl_vader_single_copy_mechanism=none let's a lot of stuff work. The bad news is we seem to be using MPI_THREAD_MULTIPLE and it does not like it: Start 2: pFIO_tests_mpi 2: Test command: /opt/openmpi-4.0.2/bin/mpiexec "-n" "18" "-oversubscribe" "/root/project/MAPL/build/bin/pfio_ctest_io.x" "-nc" "6" "-nsi" "6" "-nso" "6" "-ngo" "1" "-ngi" "1" "-v" "T,U" "-s" "mpi" 2: Test timeout computed to be: 1500 2: -- 2: The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release. 2: Workarounds are to run on a single node, or to use a system with an RDMA 2: capable network such as Infiniband. 2: -- 2: [adac3ce0cf32:03619] *** An error occurred in MPI_Win_create 2: [adac3ce0cf32:03619] *** reported by process [270073857,16] 2: [adac3ce0cf32:03619] *** on communicator MPI COMMUNICATOR 4 DUP FROM 3 2: [adac3ce0cf32:03619] *** MPI_ERR_WIN: invalid window 2: [adac3ce0cf32:03619] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, 2: [adac3ce0cf32:03619] ***and potentially your MPI job) 2: [adac3ce0cf32:03587] 17 more processes have sent help message help-osc-pt2pt.txt / mpi-thread-multiple-not-supported 2: [adac3ce0cf32:03587] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 2: [adac3ce0cf32:03587] 17 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal 2/5 Test #2: pFIO_tests_mpi ...***Failed0.18 sec 40% tests passed, 3 tests failed out of 5 Total Test time (real) = 1.08 sec The following tests FAILED: 2 - pFIO_tests_mpi (Failed) 3 - pFIO_tests_simple (Failed) 4 - pFIO_tests_hybrid (Failed) Errors while running CTest The weird thing is, I *am* running on one node (it's all I have, I'm not fancy enough at AWS to try more yet) and ompi_info does mention MPI_THREAD_MULTIPLE: [root@adac3ce0cf32 build]# ompi_info | grep -i mult Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) Any ideas on this one? On Mon, Feb 24, 2020 at 7:24 PM Nathan Hjelm via users < users@lists.open-mpi.org> wrote: > The error is from btl/vader. CMA is not functioning as expected. It might > work if you set btl_vader_single_copy_mechanism=none > > Performance will suffer though. It would be worth understanding with > process_readv is failing. > > Can you send a simple reproducer? > > -Nathan > > On Feb 24, 2020, at 2:59 PM, Gabriel, Edgar via users < > users@lists.open-mpi.org> wrote: > > > > I am not an expert for the one-sided code in Open MPI, I wanted to comment > briefly on the potential MPI -IO related item. As far as I can see, the > error message > > > > “Read -1, expected 48, errno = 1” > > does not stem from MPI I/O, at least not from the ompio library. What file > system did you use for these tests? > > > > Thanks > > Edgar > > > > *From:* users *On Behalf Of *Matt > Thompson via users > *Sent:* Monday, February 24, 2020 1:20 PM > *To:* users@lists.open-mpi.org > *Cc:* Matt Thompson > *Subject:* [OMPI users] Help with One-Sided Communication: Works in Intel > MPI, Fails in Open MPI > > > > All, > > > > My guess is this is a "I built Open MPI incorrectly" sort of issue, but > I'm not sure how to fix it. Namely, I'm currently trying to get an MPI > project's CI working on CircleCI using Open MPI to run some unit tests (on > a single node, so need some oversubscribe). I can build everything just > fine, but when I try to run, things just...blow up: > > > > [root@3796b115c961 build]# /opt/openmpi-4.0.2/bin/mpirun -np 18 > -oversubscribe /root/project/MAPL/build/bin/pfio_ctest_io.x -nc 6 -nsi 6 > -nso 6 -ngo 1 -ngi 1 -v T,U
Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI
On Mon, Feb 24, 2020 at 4:57 PM Gabriel, Edgar wrote: > I am not an expert for the one-sided code in Open MPI, I wanted to comment > briefly on the potential MPI -IO related item. As far as I can see, the > error message > > > > “Read -1, expected 48, errno = 1” > > does not stem from MPI I/O, at least not from the ompio library. What file > system did you use for these tests? > I am not sure. It was happening in a Docker image running on an AWS EC2 instance, so I guess whatever ebs is? I'm sort of a neophyte at both AWS and Docker, so combine the two and... Matt
Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX
_MAC, et al, Things are looking up. By specifying, --with-verbs=no, things are looking up. I can run helloworld. But in a new-for-me wrinkle, I can only run on *more* than one node. Not sure I've ever seen that. Using 40 core nodes, this: mpirun -np 41 ./helloWorld.mpi3.SLES12.OMPI400.exe works, and -np 40 fails: (1027)(master) $ mpirun -np 40 ./helloWorld.mpi3.SLES12.OMPI400.exe [borga033:05598] *** An error occurred in MPI_Barrier [borga033:05598] *** reported by process [140735567101953,140733193388034] [borga033:05598] *** on communicator MPI_COMM_WORLD [borga033:05598] *** MPI_ERR_OTHER: known error not in list [borga033:05598] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [borga033:05598] ***and potentially your MPI job) Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.5.274 Build 20180823 MPI Version: 3.1 MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover21 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018 forrtl: error (78): process killed (SIGTERM) Image PCRoutineLineSource helloWorld.mpi3.S 0040A38E for__signal_handl Unknown Unknown libpthread-2.22.s 2B9CCB20 Unknown Unknown Unknown libpthread-2.22.s 2B9CC3ED __nanosleep Unknown Unknown libopen-rte.so.40 2C3C5854 orte_show_help_no Unknown Unknown libopen-rte.so.40 2C3C5595 orte_show_helpUnknown Unknown libmpi.so.40.20.0 2B3BADC5 ompi_mpi_errors_a Unknown Unknown libmpi.so.40.20.0 2B3B99D9 ompi_errhandler_i Unknown Unknown libmpi.so.40.20.0 2B3E4586 MPI_Barrier Unknown Unknown libmpi_mpifh.so.4 2B15EE53 MPI_Barrier_f08 Unknown Unknown libmpi_usempif08. 2ACE7742 mpi_barrier_f08_ Unknown Unknown helloWorld.mpi3.S 0040939F Unknown Unknown Unknown helloWorld.mpi3.S 0040915E Unknown Unknown Unknown libc-2.22.so 2BBF96D5 __libc_start_main Unknown Unknown helloWorld.mpi3.S 00409069 Unknown Unknown Unknown So, I'm getting closer but I have to admit I've never built an MPI stack before where running on a single node was the broken bit! On Tue, Jan 22, 2019 at 1:31 PM Cabral, Matias A wrote: > Hi Matt, > > > > There seem to be two different issues here: > > a) The warning message comes from the openib btl. Given that > Omnipath has verbs API and you have the necessary libraries in your system, > openib btl finds itself as a potential transport and prints the warning > during its init (openib btl is its way to deprecation). You may try to > explicitly ask for vader btl given you are running on shared mem: -mca btl > self,vader -mca pml ob1. Or better, explicitly build without openib: > ./configure --with-verbs=no … > > b) Not my field of expertise, but you may be having some conflict > with the external components you are using: > --with-pmix=/usr/nlocal/pmix/2.1 --with-libevent=/usr . You may try not > specifying these and using the ones provided by OMPI. > > > > _MAC > > > > *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Matt > Thompson > *Sent:* Tuesday, January 22, 2019 6:04 AM > *To:* Open MPI Users > *Subject:* Re: [OMPI users] Help Getting Started with Open MPI and PMIx > and UCX > > > > Well, > > > > By turning off UCX compilation per Howard, things get a bit better in that > something happens! It's not a good something, as it seems to die with an > infiniband error. As this is an Omnipath system, is OpenMPI perhaps seeing > libverbs somewhere and compiling it in? To wit: > > > > (1006)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe > > -- > > By default, for Open MPI 4.0 and later, infiniband ports on a device > > are not used by default. The intent is to use UCX for these devices. > > You can override this policy by setting the btl_openib_allow_ib MCA > parameter > > to true. > > > > Local host: borgc129 > > Local adapter: hfi1_0 > > Local port: 1 > > > > -- > > -- > > WARNING: There was an error initializing an OpenFabrics device. > > > > Local host: borgc129 > > Local device: hfi1_0 > > -- > > Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications > ru
Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX
Well, By turning off UCX compilation per Howard, things get a bit better in that something happens! It's not a good something, as it seems to die with an infiniband error. As this is an Omnipath system, is OpenMPI perhaps seeing libverbs somewhere and compiling it in? To wit: (1006)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe -- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: borgc129 Local adapter: hfi1_0 Local port: 1 -- -- WARNING: There was an error initializing an OpenFabrics device. Local host: borgc129 Local device: hfi1_0 -- Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.5.274 Build 20180823 MPI Version: 3.1 MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover23 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018 [borgc129:260830] *** An error occurred in MPI_Barrier [borgc129:260830] *** reported by process [140736833716225,46909632806913] [borgc129:260830] *** on communicator MPI_COMM_WORLD [borgc129:260830] *** MPI_ERR_OTHER: known error not in list [borgc129:260830] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [borgc129:260830] ***and potentially your MPI job) forrtl: error (78): process killed (SIGTERM) Image PCRoutineLineSource helloWorld.mpi3.S 0040A38E for__signal_handl Unknown Unknown libpthread-2.22.s 2B9CCB20 Unknown Unknown Unknown libpthread-2.22.s 2B9C90CD pthread_cond_wait Unknown Unknown libpmix.so.2.1.11 2AAAB1D780A1 PMIx_AbortUnknown Unknown mca_pmix_ext2x.so 2AAAB1B3AA75 ext2x_abort Unknown Unknown mca_ess_pmi.so 2AAAB1724BC0 Unknown Unknown Unknown libopen-rte.so.40 2C3E941C orte_errmgr_base_ Unknown Unknown mca_errmgr_defaul 2AAABC401668 Unknown Unknown Unknown libmpi.so.40.20.0 2B3CDBC4 ompi_mpi_abortUnknown Unknown libmpi.so.40.20.0 2B3BB1EF ompi_mpi_errors_a Unknown Unknown libmpi.so.40.20.0 2B3B99C9 ompi_errhandler_i Unknown Unknown libmpi.so.40.20.0 2B3E4576 MPI_Barrier Unknown Unknown libmpi_mpifh.so.4 2B15EE53 MPI_Barrier_f08 Unknown Unknown libmpi_usempif08. 2ACE7732 mpi_barrier_f08_ Unknown Unknown helloWorld.mpi3.S 0040939F Unknown Unknown Unknown helloWorld.mpi3.S 0040915E Unknown Unknown Unknown libc-2.22.so 2BBF96D5 __libc_start_main Unknown Unknown helloWorld.mpi3.S 00409069 Unknown Unknown Unknown On Sun, Jan 20, 2019 at 4:19 PM Howard Pritchard wrote: > Hi Matt > > Definitely do not include the ucx option for an omnipath cluster. > Actually if you accidentally installed ucx in it’s default location use on > the system Switch to this config option > > —with-ucx=no > > Otherwise you will hit > > https://github.com/openucx/ucx/issues/750 > > Howard > > > Gilles Gouaillardet schrieb am Sa. 19. > Jan. 2019 um 18:41: > >> Matt, >> >> There are two ways of using PMIx >> >> - if you use mpirun, then the MPI app (e.g. the PMIx client) will talk >> to mpirun and orted daemons (e.g. the PMIx server) >> - if you use SLURM srun, then the MPI app will directly talk to the >> PMIx server provided by SLURM. (note you might have to srun >> --mpi=pmix_v2 or something) >> >> In the former case, it does not matter whether you use the embedded or >> external PMIx. >> In the latter case, Open MPI and SLURM have to use compatible PMIx >> libraries, and you can either check the cross-version compatibility >> matrix, >> or build Open MPI with the same PMIx used by SLURM to be on the safe >> side (not a bad idea IMHO). >> >> >> Regarding the hang, I suggest you try different things >> - use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun >> runs on a compute node rather than on a frontend node) >> - try something even simpler such as mpirun hostname (both with sbatch >> and salloc) >> - explicitly specify the network to be used for the wire-up. you can >> for example mpirun --mca oob_tcp_if_include 192.168.0.0/24
Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX
On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users < users@lists.open-mpi.org> wrote: > On Jan 18, 2019, at 12:43 PM, Matt Thompson wrote: > > > > With some help, I managed to build an Open MPI 4.0.0 with: > > We can discuss each of these params to let you know what they are. > > > ./configure --disable-wrapper-rpath --disable-wrapper-runpath > > Did you have a reason for disabling these? They're generally good > things. What they do is add linker flags to the wrapper compilers (i.e., > mpicc and friends) that basically put a default path to find libraries at > run time (that can/will in most cases override LD_LIBRARY_PATH -- but you > can override these linked-in-default-paths if you want/need to). > I've had these in my Open MPI builds for a while now. The reason was one of the libraries I need for the climate model I work on went nuts if both of them weren't there. It was originally the rpath one but then eventually (Open MPI 3?) I had to add the runpath one. But I have been updating the libraries more aggressively recently (due to OS upgrades) so it's possible this is no longer needed. > > > --with-psm2 > > Ensure that Open MPI can include support for the PSM2 library, and abort > configure if it cannot. > > > --with-slurm > > Ensure that Open MPI can include support for SLURM, and abort configure if > it cannot. > > > --enable-mpi1-compatibility > > Add support for MPI_Address and other MPI-1 functions that have since been > deleted from the MPI 3.x specification. > > > --with-ucx > > Ensure that Open MPI can include support for UCX, and abort configure if > it cannot. > > > --with-pmix=/usr/nlocal/pmix/2.1 > > Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 > (instead of using the PMIx that is bundled internally to Open MPI's source > code tree/expanded tarball). > > Unless you have a reason to use the external PMIx, the internal/bundled > PMIx is usually sufficient. > Ah. I did not know that. I figured if our SLURM was built linked to a specific PMIx v2 that I should build Open MPI with the same PMIx. I'll build an Open MPI 4 without specifying this. > > > --with-libevent=/usr > > Same as previous; change "pmix" to "libevent" (i.e., use the external > libevent instead of the bundled libevent). > > > CC=icc CXX=icpc FC=ifort > > Specify the exact compilers to use. > > > The MPI 1 is because I need to build HDF5 eventually and I added psm2 > because it's an Omnipath cluster. The libevent was probably a red herring > as libevent-devel wasn't installed on the system. It was eventually, and I > just didn't remove the flag. And I saw no errors in the build! > > Might as well remove the --with-libevent if you don't need it. > > > However, I seem to have built an Open MPI that doesn't work: > > > > (1099)(master) $ mpirun --version > > mpirun (Open MPI) 4.0.0 > > > > Report bugs to http://www.open-mpi.org/community/help/ > > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe > > > > It just sits there...forever. Can the gurus here help me figure out what > I managed to break? Perhaps I added too much to my configure line? Not > enough? > > There could be a few things going on here. > > Are you running inside a SLURM job? E.g., in a "salloc" job, or in an > "sbatch" script? > I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just fine (as you'd hope on an Omnipath cluster), but for some reason Open MPI is twitchy on this cluster. I once managed to get Open MPI 3.0.1 working (a few months ago), and it had some interesting startup scaling I liked (slow at low core count, but getting close to Intel MPI at high core count), though it seemed to not work after about 100 nodes (4000 processes) or so. -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX
All, With some help, I managed to build an Open MPI 4.0.0 with: ./configure --disable-wrapper-rpath --disable-wrapper-runpath --with-psm2 --with-slurm --enable-mpi1-compatibility --with-ucx --with-pmix=/usr/nlocal/pmix/2.1 --with-libevent=/usr CC=icc CXX=icpc FC=ifort The MPI 1 is because I need to build HDF5 eventually and I added psm2 because it's an Omnipath cluster. The libevent was probably a red herring as libevent-devel wasn't installed on the system. It was eventually, and I just didn't remove the flag. And I saw no errors in the build! However, I seem to have built an Open MPI that doesn't work: (1099)(master) $ mpirun --version mpirun (Open MPI) 4.0.0 Report bugs to http://www.open-mpi.org/community/help/ (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe It just sits there...forever. Can the gurus here help me figure out what I managed to break? Perhaps I added too much to my configure line? Not enough? Thanks, Matt On Thu, Jan 17, 2019 at 11:10 AM Matt Thompson wrote: > Dear Open MPI Gurus, > > A cluster I use recently updated their SLURM to have support for UCX and > PMIx. These are names I've seen and heard often at SC BoFs and posters, but > now is my first time to play with them. > > So, my first question is how exactly should I build Open MPI to try these > features out. I'm guessing I'll need things like "--with-ucx" to test UCX, > but is anything needed for PMIx? > > Second, when it comes to running Open MPI, are there new MCA parameters I > need to look out for when testing? > > Sorry for the generic questions, but I'm more on the user end of the > cluster than the administrator end, so I tend to get lost in the detailed > presentations, etc. I see online. > > Thanks, > Matt > -- > Matt Thompson >“The fact is, this is about us identifying what we do best and >finding more ways of doing less of it better” -- Director of Better > Anna Rampton > -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Help Getting Started with Open MPI and PMIx and UCX
Dear Open MPI Gurus, A cluster I use recently updated their SLURM to have support for UCX and PMIx. These are names I've seen and heard often at SC BoFs and posters, but now is my first time to play with them. So, my first question is how exactly should I build Open MPI to try these features out. I'm guessing I'll need things like "--with-ucx" to test UCX, but is anything needed for PMIx? Second, when it comes to running Open MPI, are there new MCA parameters I need to look out for when testing? Sorry for the generic questions, but I'm more on the user end of the cluster than the administrator end, so I tend to get lost in the detailed presentations, etc. I see online. Thanks, Matt -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] 3.1.1 Bindings Change
Dear Open MPI Gurus, In the latest 3.1.1 announcement, I saw: - Fix dummy variable names for the mpi and mpi_f08 Fortran bindings to match the MPI standard. This may break applications which use name-based parameters in Fortran which used our internal names rather than those documented in the MPI standard. Is there an example of this change somewhere (in the Git issues or another place)? I don't think we have anything in our software that would be hit by this (since we test/run our code with Intel MPI, MPT as well as Open MPI), but I want to be sure we don't have some hidden #ifdef OPENMPI somewhere. Matt -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] NAS benchmark
Well, whenever I see a "relocation truncated to fit" error, my first thought is to add "-mcmodel=medium" to the compile flags. I'm surprised NAS Benchmarks need it, though. On Sat, Feb 3, 2018 at 3:48 AM, Mahmood Naderan <mahmood...@gmail.com> wrote: > Hi, > Any body has tried NAS benchmark with ompi? I get the following linker > error while building one of the benchmarks. > > [mahmood@rocks7 NPB3.3-MPI]$ make BT NPROCS=4 CLASS=D >= >= NAS Parallel Benchmarks 3.3 = >= MPI/F77/C= >= > > cd BT; make NPROCS=4 CLASS=D SUBTYPE= VERSION= > make[1]: Entering directory `/home/mahmood/Downloads/NPB3. > 3.1/NPB3.3-MPI/BT' > make[2]: Entering directory `/home/mahmood/Downloads/NPB3. > 3.1/NPB3.3-MPI/sys' > cc -g -o setparams setparams.c > make[2]: Leaving directory `/home/mahmood/Downloads/NPB3. > 3.1/NPB3.3-MPI/sys' > ../sys/setparams bt 4 D > make[2]: Entering directory `/home/mahmood/Downloads/NPB3. > 3.1/NPB3.3-MPI/BT' > make.def modified. Rebuilding npbparams.h just in case > rm -f npbparams.h > ../sys/setparams bt 4 D > mpif90 -c -I/usr/local/include -O bt.f > mpif90 -c -I/usr/local/include -O make_set.f > mpif90 -c -I/usr/local/include -O initialize.f > mpif90 -c -I/usr/local/include -O exact_solution.f > mpif90 -c -I/usr/local/include -O exact_rhs.f > mpif90 -c -I/usr/local/include -O set_constants.f > mpif90 -c -I/usr/local/include -O adi.f > mpif90 -c -I/usr/local/include -O define.f > mpif90 -c -I/usr/local/include -O copy_faces.f > mpif90 -c -I/usr/local/include -O rhs.f > mpif90 -c -I/usr/local/include -O solve_subs.f > mpif90 -c -I/usr/local/include -O x_solve.f > mpif90 -c -I/usr/local/include -O y_solve.f > mpif90 -c -I/usr/local/include -O z_solve.f > mpif90 -c -I/usr/local/include -O add.f > mpif90 -c -I/usr/local/include -O error.f > mpif90 -c -I/usr/local/include -O verify.f > mpif90 -c -I/usr/local/include -O setup_mpi.f > make[3]: Entering directory `/home/mahmood/Downloads/NPB3. > 3.1/NPB3.3-MPI/BT' > mpif90 -c -I/usr/local/include -O btio.f > mpif90 -O -o ../bin/bt.D.4 bt.o make_set.o initialize.o exact_solution.o > exact_rhs.o set_constants.o adi.o define.o copy_faces.o rhs.o solve_subs.o > x_solve.o y_solve.o z_solve.o add.o error.o verify.o setup_mpi.o > ../common/print_results.o ../common/timers.o btio.o -L/usr/local/lib -lmpi > x_solve.o: In function `x_solve_cell_': > x_solve.f:(.text+0x77a): relocation truncated to fit: R_X86_64_32 against > symbol `work_lhs_' defined in COMMON section in x_solve.o > x_solve.f:(.text+0x77f): relocation truncated to fit: R_X86_64_32 against > symbol `work_lhs_' defined in COMMON section in x_solve.o > x_solve.f:(.text+0x946): relocation truncated to fit: R_X86_64_32S against > symbol `work_lhs_' defined in COMMON section in x_solve.o > x_solve.f:(.text+0x94e): relocation truncated to fit: R_X86_64_32S against > symbol `work_lhs_' defined in COMMON section in x_solve.o > x_solve.f:(.text+0x958): relocation truncated to fit: R_X86_64_32S against > symbol `work_lhs_' defined in COMMON section in x_solve.o > x_solve.f:(.text+0x962): relocation truncated to fit: R_X86_64_32S against > symbol `work_lhs_' defined in COMMON section in x_solve.o > x_solve.f:(.text+0x96c): relocation truncated to fit: R_X86_64_32S against > symbol `work_lhs_' defined in COMMON section in x_solve.o > x_solve.f:(.text+0x9ab): relocation truncated to fit: R_X86_64_32S against > symbol `work_lhs_' defined in COMMON section in x_solve.o > x_solve.f:(.text+0x9c6): relocation truncated to fit: R_X86_64_32S against > symbol `work_lhs_' defined in COMMON section in x_solve.o > x_solve.f:(.text+0x9f3): relocation truncated to fit: R_X86_64_32S against > symbol `work_lhs_' defined in COMMON section in x_solve.o > x_solve.f:(.text+0xa21): additional relocation overflows omitted from the > output > collect2: error: ld returned 1 exit status > make[3]: *** [bt-bt] Error 1 > make[3]: Leaving directory `/home/mahmood/Downloads/NPB3. > 3.1/NPB3.3-MPI/BT' > make[2]: *** [exec] Error 2 > make[2]: Leaving directory `/home/mahmood/Downloads/NPB3. > 3.1/NPB3.3-MPI/BT' > make[1]: *** [../bin/bt.D.4] Error 2 > make[1]: Leaving directory `/home/mahmood/Downloads/NPB3. > 3.1/NPB3.3-MPI/BT' > make: *** [bt] Error 2 > > > There is a good guide about that (https://www.technovelty.org/ > c/relocation-truncated-to-fit-wtf.html) but I don't know which compiler > flag should I fix to fix that. > > Any idea? > > Regards, > Mahmood > > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.
[OMPI users] mpi_f08 interfaces in man3 pages?
OMPI Users, I know from a while back when I was scanning git to find some other thing, I saw a kind user (Gilles Gouaillardet?) added the F08 interfaces into the man pages. As I am lazy, 'man mpi_send' would be nicer than me pulling out my Big Blue Book to look it up. (Second in laziness is using Google, so I'd love to see the F08 interfaces in the man 3 web page docs as well.) Is there a timeline as to when these might get into the Open MPI releases? I'm not the best at nroff, but I'll gladly help out in any way I can. Matt PS: An effort is happening where I work to incorporate one-sided MPI combined with OpenMP/threads into our code. So far, Open MPI is the only stack we've tried where the code doesn't weirdly die in odd places, so I might be coming back here with more questions when we try to improve the performance/encounter problems. -- Matt Thompson Man Among Men Fulcrum of History ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Tuning vader for MPI_Wait Halt?
Nathan, Sadly, I'm not sure I can provide a reproducer, as it's currently our full earth system model and is accessing terabytes of background files, etc. That said, I'll work on it. I have a tiny version of the model, but that usually always works everywhere (and I can only reproduce the issue at a rather high resolution). We do have a couple of code testers that duplicate functionality around that MPI_Wait call, but, and this is the fun part, it seems to be a very specific type of that call (only if you are doing a daily time-averaged collection!). Still, I'll try and test that tester with Open MPI 2.1.0. Maybe it'll hang! As for kernel, my desktop is 3.10.0-514.16.1.el7.x86_64 (RHEL 7) and the cluster compute node is on 3.0.101-0.47.90-default (SLES11 SP3). If I run 'lsmod' I see xpmem on the cluster, but my desktop does not have it. So, perhaps not XPMEM related? Matt On Mon, Jun 5, 2017 at 1:00 PM, Nathan Hjelm <hje...@me.com> wrote: > Can you provide a reproducer for the hang? What kernel version are you > using? Is xpmem installed? > > -Nathan > > On Jun 05, 2017, at 10:53 AM, Matt Thompson <fort...@gmail.com> wrote: > > OMPI Users, > > I was wondering if there is a best way to "tune" vader to get around an > intermittent MPI_Wait halt? > > I ask because I recently found that if I use Open MPI 2.1.x on either my > desktop or on the supercomputer I have access to, if vader is enabled, the > model seems to "deadlock" at an MPI_Wait call. If I run as: > > mpirun --mca btl self,sm,tcp > > on my desktop it works. When I moved to my cluster, I tried the more > generic: > > mpirun --mca btl ^vader > > since it uses openib, and with it things work. Well, I hope that's how one > would turn off vader in MCA speak. (Note: this deadlock seems a bit > sporadic, but I do now have a case which seems to cause it reproducibly). > > Now, I know vader is supposed to be the "better" sm communication tech, so > I'd rather use it and thought maybe I could twiddle some tuning knobs. So I > looked at: > > https://www.open-mpi.org/faq/?category=sm > > and there I saw question 6 "How do I know what MCA parameters are > available for tuning MPI performance?". But when I try the commands listed > (minus the HTML/CSS tags): > > (1081) $ ompi_info --param btl sm > MCA btl: sm (MCA v2.1.0, API v3.0.0, Component v2.1.0) > (1082) $ ompi_info --param mpool sm > (1083) $ > > Huh. I expected more, but searching around the Open MPI FAQs made me think > I should use: > > ompi_info --param btl sm --level 9 > > which does spit out a lot, though the equivalent for mpool sm does not. > > Any ideas on which of the many knobs is best to try and turn? Something > that, by default, perhaps is one thing for sm but different for vader? I > tried to see if "ompi_info --param btl vader --level 9" did something, but > it doesn't put anything out. > > I will note that this code runs just fine with Open MPI 2.0.2 as well as > with Intel MPI and SGI MPT, so I'm thinking the code itself is okay, but > something from Open MPI 2.0.x to Open MPI 2.1.x changed. I see two entries > in the Open MPI 2.1.0 announcement about vader, but nothing specific about > how to "revert" if they are even causing the problem: > > - Fix regression that lowered the memory maximum message bandwidth for > large messages on some BTL network transports, such as openib, sm, > and vader. > > > - The vader BTL is now more efficient in terms of memory usage when > using XPMEM. > > > Thanks for any help, > Matt > > > -- > Matt Thompson > > Man Among Men > Fulcrum of History > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > -- Matt Thompson Man Among Men Fulcrum of History ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Tuning vader for MPI_Wait Halt?
OMPI Users, I was wondering if there is a best way to "tune" vader to get around an intermittent MPI_Wait halt? I ask because I recently found that if I use Open MPI 2.1.x on either my desktop or on the supercomputer I have access to, if vader is enabled, the model seems to "deadlock" at an MPI_Wait call. If I run as: mpirun --mca btl self,sm,tcp on my desktop it works. When I moved to my cluster, I tried the more generic: mpirun --mca btl ^vader since it uses openib, and with it things work. Well, I hope that's how one would turn off vader in MCA speak. (Note: this deadlock seems a bit sporadic, but I do now have a case which seems to cause it reproducibly). Now, I know vader is supposed to be the "better" sm communication tech, so I'd rather use it and thought maybe I could twiddle some tuning knobs. So I looked at: https://www.open-mpi.org/faq/?category=sm and there I saw question 6 "How do I know what MCA parameters are available for tuning MPI performance?". But when I try the commands listed (minus the HTML/CSS tags): (1081) $ ompi_info --param btl sm MCA btl: sm (MCA v2.1.0, API v3.0.0, Component v2.1.0) (1082) $ ompi_info --param mpool sm (1083) $ Huh. I expected more, but searching around the Open MPI FAQs made me think I should use: ompi_info --param btl sm --level 9 which does spit out a lot, though the equivalent for mpool sm does not. Any ideas on which of the many knobs is best to try and turn? Something that, by default, perhaps is one thing for sm but different for vader? I tried to see if "ompi_info --param btl vader --level 9" did something, but it doesn't put anything out. I will note that this code runs just fine with Open MPI 2.0.2 as well as with Intel MPI and SGI MPT, so I'm thinking the code itself is okay, but something from Open MPI 2.0.x to Open MPI 2.1.x changed. I see two entries in the Open MPI 2.1.0 announcement about vader, but nothing specific about how to "revert" if they are even causing the problem: - Fix regression that lowered the memory maximum message bandwidth for large messages on some BTL network transports, such as openib, sm, and vader. - The vader BTL is now more efficient in terms of memory usage when using XPMEM. Thanks for any help, Matt -- Matt Thompson Man Among Men Fulcrum of History ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread
t;>>>>>> > >>>>>>>>>> On Monday, April 3, 2017, Prentice Bisbal <pbis...@pppl.gov > >>>>>>>>>> <mailto:pbis...@pppl.gov>> wrote: > >>>>>>>>>> > >>>>>>>>>>Greeting Open MPI users! After being off this list for > several > >>>>>>>>>>years, I'm back! And I need help: > >>>>>>>>>> > >>>>>>>>>>I'm trying to compile OpenMPI 1.10.3 with the PGI compilers, > >>>>>>>>>>version 17.3. I'm using the following configure options: > >>>>>>>>>> > >>>>>>>>>>./configure \ > >>>>>>>>>> --prefix=/usr/pppl/pgi/17.3-pkgs/openmpi-1.10.3 \ > >>>>>>>>>> --disable-silent-rules \ > >>>>>>>>>> --enable-shared \ > >>>>>>>>>> --enable-static \ > >>>>>>>>>> --enable-mpi-thread-multiple \ > >>>>>>>>>> --with-pmi=/usr/pppl/slurm/15.08.8 \ > >>>>>>>>>> --with-hwloc \ > >>>>>>>>>> --with-verbs \ > >>>>>>>>>> --with-slurm \ > >>>>>>>>>> --with-psm \ > >>>>>>>>>> CC=pgcc \ > >>>>>>>>>> CFLAGS="-tp x64 -fast" \ > >>>>>>>>>> CXX=pgc++ \ > >>>>>>>>>> CXXFLAGS="-tp x64 -fast" \ > >>>>>>>>>> FC=pgfortran \ > >>>>>>>>>> FCFLAGS="-tp x64 -fast" \ > >>>>>>>>>> 2>&1 | tee configure.log > >>>>>>>>>> > >>>>>>>>>>Which leads to this error from libtool during make: > >>>>>>>>>> > >>>>>>>>>>pgcc-Error-Unknown switch: -pthread > >>>>>>>>>> > >>>>>>>>>>I've searched the archives, which ultimately lead to this > work > >>>>>>>>>>around from 2009: > >>>>>>>>>> > >>>>>>>>>> https://www.open-mpi.org/community/lists/users/2009/04/8724.php > >>>>>>>>>> <https://www.open-mpi.org/community/lists/users/2009/04/ > 8724.php> > >>>>>>>>>> > >>>>>>>>>>Interestingly, I participated in the discussion that lead to > that > >>>>>>>>>>workaround, stating that I had no problem compiling Open MPI > with > >>>>>>>>>>PGI v9. I'm assuming the problem now is that I'm specifying > >>>>>>>>>>--enable-mpi-thread-multiple, which I'm doing because a user > >>>>>>>>>>requested that feature. > >>>>>>>>>> > >>>>>>>>>>It's been exactly 8 years and 2 days since that workaround > was > >>>>>>>>>>posted to the list. Please tell me a better way of dealing > with > >>>>>>>>>>this issue than writing a 'fakepgf90' script. Any > suggestions? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>-- > >>>>>>>>>>Prentice > >>>>>>>>>> > >>>>>>>>>> ___ > >>>>>>>>>>users mailing list > >>>>>>>>>>users@lists.open-mpi.org > >>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > >>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> ___ > >>>>>>>>>> users mailing list > >>>>>>>>>> users@lists.open-mpi.org > >>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> ___ > >>>>>>>>> users mailing list > >>>>>>>>> users@lists.open-mpi.org > >>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> ___ > >>>> users mailing list > >>>> users@lists.open-mpi.org > >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > >>> > >> > > > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > -BEGIN PGP SIGNATURE- > Comment: GPGTools - https://gpgtools.org > > iEYEARECAAYFAljiu0YACgkQo/GbGkBRnRpGowCgha3O1wvYyQQOrsYuUqSGJq2B > qHEAnRyT0PHY75NmmI9Efv4CkM7aJjVp > =f5Xk > -END PGP SIGNATURE- > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > -- Matt Thompson Man Among Men Fulcrum of History ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Issues with PGI 16.10, OpenMPI 2.1.0 on macOS: Fortran issues with hello world (running and dylib)
v2.1.0, package: Open MPI > user@computer.concealed Distribution, ident: 2.1.0, repo rev: > v2.0.1-696-g1cd1edf, Mar 20, 2017 > > Hello, world, I am 2 of 4: Open MPI v2.1.0, package: Open MPI > user@computer.concealed Distribution, ident: 2.1.0, repo rev: > v2.0.1-696-g1cd1edf, Mar 20, 2017 > PGI 16.10: (831) $ mpifort -o hello_usempif08.exe hello_usempif08.f90 > (832) $ mpirun -np 4 ./hello_usempif08.exe > [computer.concealed:87920] mca_base_component_repository_open: unable to > open mca_patcher_overwrite: File not found (ignored) > [computer.concealed:87920] mca_base_component_repository_open: unable to > open mca_shmem_mmap: File not found (ignored) > [computer.concealed:87920] mca_base_component_repository_open: unable to > open mca_shmem_posix: File not found (ignored) > [computer.concealed:87920] mca_base_component_repository_open: unable to > open mca_shmem_sysv: File not found (ignored) > -- > It looks like opal_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during opal_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > opal_shmem_base_select failed > --> Returned value -1 instead of OPAL_SUCCESS > ------ Well okay then. I didn't see any errors during the build for PGI: (835) $ grep Error make.pgi-16.10.log > SED mpi/man/man3/MPI_Error_class.3 > SED mpi/man/man3/MPI_Error_string.3 Any ideas/help? -- Matt Thompson Man Among Men Fulcrum of History ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Help with Open MPI 2.1.0 and PGI 16.10: Configure and C++
Gilles, The library I have having issues linking is ESMF and it is a C++/Fortran application. From http://www.earthsystemmodeling.org/esmf_releases/non_public/ESMF_7_0_0/ESMF_usrdoc/node9.html#SECTION00092000 : The following compilers and utilities *are required* for compiling, linking > and testing the ESMF software: > Fortran90 (or later) compiler; > C++ compiler; > MPI implementation compatible with the above compilers (but see below); > GNU's gcc compiler - for a standard cpp preprocessor implementation; > GNU make; > Perl - for running test scripts. (Emphasis mine) This is why I am concerned. For now, I'll build Open MPI with the (possibly useless) C++ support for PGI and move on to the Fortran issue (which I'll detail in another email). But, as I *need* ESMF for my application, it would be good to get an mpicxx that I can have confidence in with PGI. Matt On Thu, Mar 23, 2017 at 9:05 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Matt, > > a C++ compiler is required to configure Open MPI. > That being said, C++ compiler is only used if you build the C++ bindings > (That were removed from MPI-3) > And unless you plan to use the mpic++ wrapper (with or without the C++ > bindings), > a valid C++ compiler is not required at all. > /* configure still requires one, and that could be improved */ > > My point is you should not worry too much about configure messages related > to C++, > and you should instead focus on the Fortran issue. > > Cheers, > > Gilles > > On Thursday, March 23, 2017, Matt Thompson <fort...@gmail.com> wrote: > >> All, I'm hoping one of you knows what I might be doing wrong here. I'm >> trying to use Open MPI 2.1.0 for PGI 16.10 (Community Edition) on macOS. >> Now, I built it a la: >> >> http://www.pgroup.com/userforum/viewtopic.php?p=21105#21105 >> >> and found that it built, but the resulting mpifort, etc were just not >> good. Couldn't even do Hello World. >> >> So, I thought I'd start from the beginning. I tried running: >> >> configure --disable-wrapper-rpath CC=pgcc CXX=pgc++ FC=pgfortran >> --prefix=/Users/mathomp4/installed/Compiler/pgi-16.10/openmpi/2.1.0 >> but when I did I saw this: >> >> *** C++ compiler and preprocessor >> checking whether we are using the GNU C++ compiler... yes >> checking whether pgc++ accepts -g... yes >> checking dependency style of pgc++... none >> checking how to run the C++ preprocessor... pgc++ -E >> checking for the C++ compiler vendor... gnu >> >> Well, that's not the right vendor. So, I took a look at configure and I >> saw that at least some detection for PGI was a la: >> >> pgCC* | pgcpp*) >> # Portland Group C++ compiler >> case `$CC -V` in >> *pgCC\ [1-5].* | *pgcpp\ [1-5].*) >> >> pgCC* | pgcpp*) >> # Portland Group C++ compiler >> lt_prog_compiler_wl_CXX='-Wl,' >> lt_prog_compiler_pic_CXX='-fpic' >> lt_prog_compiler_static_CXX='-Bstatic' >> ;; >> >> Ah. PGI 16.9+ now use pgc++ to do C++ compiling, not pgcpp. So, I hacked >> configure so that references to pgCC (nonexistent on macOS) are gone and >> all pgcpp became pgc++, but: >> >> *** C++ compiler and preprocessor >> checking whether we are using the GNU C++ compiler... yes >> checking whether pgc++ accepts -g... yes >> checking dependency style of pgc++... none >> checking how to run the C++ preprocessor... pgc++ -E >> checking for the C++ compiler vendor... gnu >> >> Well, at this point, I think I'm stopping until I get help. Will this >> chunk of configure always return gnu for PGI? I know the C part returns >> 'portland group': >> >> *** C compiler and preprocessor >> checking for gcc... (cached) pgcc >> checking whether we are using the GNU C compiler... (cached) no >> checking whether pgcc accepts -g... (cached) yes >> checking for pgcc option to accept ISO C89... (cached) none needed >> checking whether pgcc understands -c and -o together... (cached) yes >> checking for pgcc option to accept ISO C99... none needed >> checking for the C compiler vendor... portland group >> >> so I thought the C++ section would as well. I also tried passing in >> --enable-mpi-cxx, but that did nothing. >> >> Is this just a red herring? My real concern is with pgfortran/mpifort, >> but I thought I'd start with this. If this is okay, I'll move on and detail >> the fortran issues I'm having. >> >> Matt >> -- >> Matt Thompson >> >> Man Among Men >> Fulcrum of History >> >> > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > -- Matt Thompson Man Among Men Fulcrum of History ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Help with Open MPI 2.1.0 and PGI 16.10: Configure and C++
All, I'm hoping one of you knows what I might be doing wrong here. I'm trying to use Open MPI 2.1.0 for PGI 16.10 (Community Edition) on macOS. Now, I built it a la: http://www.pgroup.com/userforum/viewtopic.php?p=21105#21105 and found that it built, but the resulting mpifort, etc were just not good. Couldn't even do Hello World. So, I thought I'd start from the beginning. I tried running: configure --disable-wrapper-rpath CC=pgcc CXX=pgc++ FC=pgfortran --prefix=/Users/mathomp4/installed/Compiler/pgi-16.10/openmpi/2.1.0 but when I did I saw this: *** C++ compiler and preprocessor checking whether we are using the GNU C++ compiler... yes checking whether pgc++ accepts -g... yes checking dependency style of pgc++... none checking how to run the C++ preprocessor... pgc++ -E checking for the C++ compiler vendor... gnu Well, that's not the right vendor. So, I took a look at configure and I saw that at least some detection for PGI was a la: pgCC* | pgcpp*) # Portland Group C++ compiler case `$CC -V` in *pgCC\ [1-5].* | *pgcpp\ [1-5].*) pgCC* | pgcpp*) # Portland Group C++ compiler lt_prog_compiler_wl_CXX='-Wl,' lt_prog_compiler_pic_CXX='-fpic' lt_prog_compiler_static_CXX='-Bstatic' ;; Ah. PGI 16.9+ now use pgc++ to do C++ compiling, not pgcpp. So, I hacked configure so that references to pgCC (nonexistent on macOS) are gone and all pgcpp became pgc++, but: *** C++ compiler and preprocessor checking whether we are using the GNU C++ compiler... yes checking whether pgc++ accepts -g... yes checking dependency style of pgc++... none checking how to run the C++ preprocessor... pgc++ -E checking for the C++ compiler vendor... gnu Well, at this point, I think I'm stopping until I get help. Will this chunk of configure always return gnu for PGI? I know the C part returns 'portland group': *** C compiler and preprocessor checking for gcc... (cached) pgcc checking whether we are using the GNU C compiler... (cached) no checking whether pgcc accepts -g... (cached) yes checking for pgcc option to accept ISO C89... (cached) none needed checking whether pgcc understands -c and -o together... (cached) yes checking for pgcc option to accept ISO C99... none needed checking for the C compiler vendor... portland group so I thought the C++ section would as well. I also tried passing in --enable-mpi-cxx, but that did nothing. Is this just a red herring? My real concern is with pgfortran/mpifort, but I thought I'd start with this. If this is okay, I'll move on and detail the fortran issues I'm having. Matt -- Matt Thompson Man Among Men Fulcrum of History ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Issues building Open MPI 2.0.1 with PGI 16.10 on macOS
Well, jtull over at PGI seemed to have the "magic sauce": http://www.pgroup.com/userforum/viewtopic.php?p=21105#21105 Namely, I think it's the siterc file. I'm not sure which of the adaptations fixes the issue yet, though. On Mon, Nov 28, 2016 at 3:11 PM, Jeff Hammond <jeff.scie...@gmail.com> wrote: > attached config.log that contains the details of the following failures is > the best way to make forward-progress here. that none of the system > headers are detected suggests a rather serious compiler problem that may > not have anything to do with headers. > > checking for sys/types.h... no > checking for sys/stat.h... no > checking for stdlib.h... no > checking for string.h... no > checking for memory.h... no > checking for strings.h... no > checking for inttypes.h... no > checking for stdint.h... no > checking for unistd.h... no > > > On Mon, Nov 28, 2016 at 9:49 AM, Matt Thompson <fort...@gmail.com> wrote: > >> Hmm. Well, I definitely have /usr/include/stdint.h as I previously was >> trying work with clang as compiler stack. And as near as I can tell, Open >> MPI's configure is seeing /usr/include as oldincludedir, but maybe that's >> not how it finds it? >> >> If I check my configure output: >> >> >> >> == Configuring Open MPI >> >> >> >> *** Startup tests >> checking build system type... x86_64-apple-darwin15.6.0 >> >> checking for sys/types.h... yes >> checking for sys/stat.h... yes >> checking for stdlib.h... yes >> checking for string.h... yes >> checking for memory.h... yes >> checking for strings.h... yes >> checking for inttypes.h... yes >> checking for stdint.h... yes >> checking for unistd.h... yes >> >> So, the startup saw it. But: >> >> --- MCA component event:libevent2022 (m4 configuration macro, priority 80) >> checking for MCA component event:libevent2022 compile mode... static >> checking libevent configuration args... --disable-dns --disable-http >> --disable-rpc --disable-openssl --enable-thread-support --d >> isable-evport >> configure: OPAL configuring in opal/mca/event/libevent2022/libevent >> configure: running /bin/sh './configure' --disable-dns --disable-http >> --disable-rpc --disable-openssl --enable-thread-support -- >> disable-evport '--disable-wrapper-rpath' 'CC=pgcc' 'CXX=pgc++' >> 'FC=pgfortran' 'CFLAGS=-m64' 'CXXFLAGS=-m64' 'FCFLAGS=-m64' '--w >> ithout-verbs' '--prefix=/Users/mathomp4/inst >> alled/Compiler/pgi-16.10/openmpi/2.0.1' 'CPPFLAGS=-I/Users/mathomp4/sr >> c/MPI/openmpi- >> 2.0.1 -I/Users/mathomp4/src/MPI/openmpi-2.0.1 >> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/include >> -I/Users/mathomp4/src/MPI/o >> penmpi-2.0.1/opal/mca/hwloc/hwloc1112/hwloc/include >> -Drandom=opal_random' --cache-file=/dev/null --srcdir=. --disable-option-che >> cking >> checking for a BSD-compatible install... /usr/bin/install -c >> >> checking for sys/types.h... no >> checking for sys/stat.h... no >> checking for stdlib.h... no >> checking for string.h... no >> checking for memory.h... no >> checking for strings.h... no >> checking for inttypes.h... no >> checking for stdint.h... no >> checking for unistd.h... no >> >> So, it's like whatever magic found stdint.h for the startup isn't passed >> down to libevent when it builds? As I scan the configure output, PMIx sees >> stdint.h in its section and ROMIO sees it as well, but not libevent2022. >> The Makefiles inside of libevent2022 do have 'oldincludedir = >> /usr/include'. Hmm. >> >> >> >> On Mon, Nov 28, 2016 at 11:39 AM, Bennet Fauber <ben...@umich.edu> wrote: >> >>> I think PGI uses installed GCC components for some parts of standard C >>> (at least for some things on Linux, it does; and I imagine it is >>> similar for Mac). If you look at the post at >>> >>> http://www.pgroup.com/userforum/viewtopic.php?t=5147=17f >>> 3afa2cd0eec05b0f4e54a60f50479 >>> >>> The problem seems to have been one with the Xcode configuration: >>> >>> "It turns out my Xcode was messed up as I was missing /usr/include/. >>> After rerunning xcode-select --install it works now." >>> >>> On my OS X 10.11.6, I have /usr/include/stdint.h without having the >>> PGI compilers. This may be related to the GNU command line tools >>> installation...? I think t
Re: [OMPI users] Issues building Open MPI 2.0.1 with PGI 16.10 on macOS
types of exactly 64, 32, 16, and 8 bits > > > > * respectively. > > > > *ev_uintptr_t, ev_intptr_t > > > > * unsigned/signed integers large enough > > > > * to hold a pointer without loss of bits. > > > > *ev_ssize_t > > > > * A signed type of the same size as size_t > > > > *ev_off_t > > > > * A signed type typically used to represent offsets within a > > > > * (potentially large) file > > > > * > > > > * @{ > > > > */ > > > > #ifdef _EVENT_HAVE_UINT64_T > > > > #define ev_uint64_t uint64_t > > > > #define ev_int64_t int64_t > > > > #elif defined(WIN32) > > > > #define ev_uint64_t unsigned __int64 > > > > #define ev_int64_t signed __int64 > > > > #elif _EVENT_SIZEOF_LONG_LONG == 8 > > > > #define ev_uint64_t unsigned long long > > > > #define ev_int64_t long long > > > > #elif _EVENT_SIZEOF_LONG == 8 > > > > #define ev_uint64_t unsigned long > > > > #define ev_int64_t long > > > > #elif defined(_EVENT_IN_DOXYGEN) > > > > #define ev_uint64_t ... > > > > #define ev_int64_t ... > > > > #else > > > > #error "No way to define ev_uint64_t" > > > > #endif > > > > > > On Mon, Nov 28, 2016 at 5:04 AM, Matt Thompson <fort...@gmail.com> > wrote: > >> > >> All, > >> > >> I recently tried building Open MPI 2.0.1 with the new Community Edition > of > >> PGI on macOS. My first mistake was I was configuring with a configure > line > >> I'd cribbed from Linux that had -fPIC. Apparently -fPIC was removed > from the > >> macOS build. Okay, I can remove that and I configured with: > >> > >> ./configure --disable-wrapper-rpath CC=pgcc CXX=pgc++ FC=pgfortran > >> CFLAGS='-m64' CXXFLAGS='-m64' FCFLAGS='-m64' --without-verbs > >> --prefix=/Users/mathomp4/installed/Compiler/pgi-16.10/openmpi/2.0.1 | > & tee > >> configure.pgi-16.10.log > >> > >> But, now, when I try to actually build, I get an error pretty quick > inside > >> the make: > >> > >> CC printf.lo > >> CC proc.lo > >> CC qsort.lo > >> > >> PGC-F-0249-#error -- "No way to define ev_uint64_t" > >> (/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/ > libevent2022/libevent/include/event2/util.h: > >> 126) > >> PGC/x86-64 OSX 16.10-0: compilation aborted > >> CC show_help.lo > >> make[3]: *** [proc.lo] Error 1 > >> make[3]: *** Waiting for unfinished jobs > >> make[2]: *** [all-recursive] Error 1 > >> make[1]: *** [all-recursive] Error 1 > >> make: *** [all-recursive] Error 1 > >> > >> This was done with -j2, so if I remake with 'make V=1' I see: > >> > >> source='proc.c' object='proc.lo' libtool=yes \ > >> DEPDIR=.deps depmode=pgcc /bin/sh ../../config/depcomp \ > >> /bin/sh ../../libtool --tag=CC --mode=compile pgcc -DHAVE_CONFIG_H > -I. > >> -I../../opal/include -I../../ompi/include -I../../oshmem/include > >> -I../../opal/mca/hwloc/hwloc1112/hwloc/include/private/autogen > >> -I../../opal/mca/hwloc/hwloc1112/hwloc/include/hwloc/autogen > >> -I../../ompi/mpiext/cuda/c -I../.. -I../../orte/include > >> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/hwloc/ > hwloc1112/hwloc/include > >> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/ > libevent2022/libevent > >> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/ > libevent2022/libevent/include > >> -O -DNDEBUG -m64 -c -o proc.lo proc.c > >> libtool: compile: pgcc -DHAVE_CONFIG_H -I. -I../../opal/include > >> -I../../ompi/include -I../../oshmem/include > >> -I../../opal/mca/hwloc/hwloc1112/hwloc/include/private/autogen > >> -I../../opal/mca/hwloc/hwloc1112/hwloc/include/hwloc/autogen > >> -I../../ompi/mpiext/cuda/c -I../.. -I../../orte/include > >> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/hwloc/ > hwloc1112/hwloc/include > >> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/ > libevent2022/libevent > >> -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/ > libevent2022/libevent/include > >> -O -DNDEBUG -m64 -c proc.c -MD -o proc.o > >> PGC-F-0249-#error -- "No way to define ev_uint64_t" > >> (/Users/mathomp4/src
[OMPI users] Issues building Open MPI 2.0.1 with PGI 16.10 on macOS
All, I recently tried building Open MPI 2.0.1 with the new Community Edition of PGI on macOS. My first mistake was I was configuring with a configure line I'd cribbed from Linux that had -fPIC. Apparently -fPIC was removed from the macOS build. Okay, I can remove that and I configured with: ./configure --disable-wrapper-rpath CC=pgcc CXX=pgc++ FC=pgfortran CFLAGS='-m64' CXXFLAGS='-m64' FCFLAGS='-m64' --without-verbs --prefix=/Users/mathomp4/installed/Compiler/pgi-16.10/openmpi/2.0.1 | & tee configure.pgi-16.10.log But, now, when I try to actually build, I get an error pretty quick inside the make: CC printf.lo CC proc.lo CC qsort.lo PGC-F-0249-#error -- "No way to define ev_uint64_t" (/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/libevent2022/libevent/include/event2/util.h: 126) PGC/x86-64 OSX 16.10-0: compilation aborted CC show_help.lo make[3]: *** [proc.lo] Error 1 make[3]: *** Waiting for unfinished jobs make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 This was done with -j2, so if I remake with 'make V=1' I see: source='proc.c' object='proc.lo' libtool=yes \ DEPDIR=.deps depmode=pgcc /bin/sh ../../config/depcomp \ /bin/sh ../../libtool --tag=CC --mode=compile pgcc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../ompi/include -I../../oshmem/include -I../../opal/mca/hwloc/hwloc1112/hwloc/include/private/autogen -I../../opal/mca/hwloc/hwloc1112/hwloc/include/hwloc/autogen -I../../ompi/mpiext/cuda/c -I../.. -I../../orte/include -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/hwloc/hwloc1112/hwloc/include -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/libevent2022/libevent -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/libevent2022/libevent/include -O -DNDEBUG -m64 -c -o proc.lo proc.c libtool: compile: pgcc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../ompi/include -I../../oshmem/include -I../../opal/mca/hwloc/hwloc1112/hwloc/include/private/autogen -I../../opal/mca/hwloc/hwloc1112/hwloc/include/hwloc/autogen -I../../ompi/mpiext/cuda/c -I../.. -I../../orte/include -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/hwloc/hwloc1112/hwloc/include -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/libevent2022/libevent -I/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/libevent2022/libevent/include -O -DNDEBUG -m64 -c proc.c -MD -o proc.o PGC-F-0249-#error -- "No way to define ev_uint64_t" (/Users/mathomp4/src/MPI/openmpi-2.0.1/opal/mca/event/libevent2022/libevent/include/event2/util.h: 126) PGC/x86-64 OSX 16.10-0: compilation aborted make[3]: *** [proc.lo] Error 1 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 I guess my question is whether this is an issue with PGI or Open MPI or both? I'm not too sure. I've also asked about this on the PGI forums as well (http://www.pgroup.com/userforum/viewtopic.php?t=5413=0) since I'm not sure. But, no matter what, does anyone have thoughts on how to solve this? Thanks, Matt -- Matt Thompson ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] mpi_f08 Question: set comm on declaration error, and other questions
On Fri, Aug 19, 2016 at 8:54 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > On Aug 19, 2016, at 6:32 PM, Matt Thompson <fort...@gmail.com> wrote: > > > > that the comm == MPI_COMM_WORLD evaluates to .TRUE.? I discovered that > once when I was printing some stuff. > > > > That might well be a coincidence. type(MPI_Comm) is not a boolean type, > so I'm not sure how you compared it to .true. > > > > Well, I made a program like: > > > > (208) $ cat test2.F90 > > program whoami > >use mpi_f08 > >implicit none > >type(MPI_Comm) :: comm > >if (comm == MPI_COMM_WORLD) write (*,*) "I am MPI_COMM_WORLD" > >if (comm == MPI_COMM_NULL) write (*,*) "I am MPI_COMM_NULL" > > end program whoami > > (209) $ mpifort test2.F90 > > (210) $ mpirun -np 4 ./a.out > > I am MPI_COMM_WORLD > > I am MPI_COMM_WORLD > > I am MPI_COMM_WORLD > > I am MPI_COMM_WORLD > > > > I think if you print comm, you get 0 and MPI_COMM_WORLD=0 and > MPI_COMM_NULL=2 so...I guess I'm surprised. I'd have thought MPI_Comm would > have been undefined until defined. > > I don't know the rules here for what happens in Fortran when comparing an > uninitialized derived type. The results could be undefined...? > > > Instead you can write a program like this: > > > > (226) $ cat helloWorld.mpi3.F90 > > program hello_world > > > >use mpi_f08 > > > >implicit none > > > >type(MPI_Comm) :: comm > >integer :: myid, npes, ierror > >integer :: name_length > > > >character(len=MPI_MAX_PROCESSOR_NAME) :: processor_name > > > >call mpi_init(ierror) > > > >call MPI_Comm_Rank(comm,myid,ierror) > >write (*,*) 'ierror: ', ierror > >call MPI_Comm_Size(comm,npes,ierror) > >call MPI_Get_Processor_Name(processor_name,name_length,ierror) > > > >write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes, > "is on", trim(processor_name) > > > >call MPI_Finalize(ierror) > > > > end program hello_world > > (227) $ mpifort helloWorld.mpi3.F90 > > (228) $ mpirun -np 4 ./a.out > > ierror:0 > > ierror:0 > > ierror:0 > > ierror:0 > > Process2 of4 is on compy > > Process1 of4 is on compy > > Process3 of4 is on compy > > Process0 of4 is on copy > > That does seem to be odd output. What is the hostname on your machine? Oh well, I (badly) munged the hostname on the computer I ran on because it had the IP address within. I figured better safe than sorry and not broadcast that out there. :) > FWIW, I changed your write statement to: > > print *, "Process", myid, "of", npes, "is on", trim(processor_name) > > and after I added a "comm = MPI_COMM_WORLD" before the call to > MPI_COMM_RANK, the output prints properly for me (i.e., I see my hostname). > > > This seems odd to me. I haven't passed in MPI_COMM_WORLD as the > communicator to MPI_Comm_Rank, and yet, it worked and the error code was 0 > (which I'd take as success). Even if you couldn't detect this at compile > time, I'm surprised it doesn't trigger a run-time error. Is this the > correct behavior according to the Standard? > > I think you're passing an undefined value, so the results will be > undefined. > > It's quite possible that the comm%mpi_val inside the comm is (randomly?) > assigned to 0, which is the same value as mpif.f's MPI_COMM_WORLD, and > therefore your comm is effectively the same as mpi_f08's MPI_COMM_WORLD -- > which is why MPI_COMM_RANK and MPI_COMM_SIZE worked for you. > > Indeed, when I run your program, I get: > > - > $ ./foo > [savbu-usnic-a:31774] *** An error occurred in MPI_Comm_rank > [savbu-usnic-a:31774] *** reported by process [756088833,0] > [savbu-usnic-a:31774] *** on communicator MPI_COMM_WORLD > [savbu-usnic-a:31774] *** MPI_ERR_COMM: invalid communicator > [savbu-usnic-a:31774] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [savbu-usnic-a:31774] ***and potentially your MPI job) > - > > I.e., MPI_COMM_RANK is aborting because the communicator being passed in > is invalid. > > Huh. I guess I'd assumed that the MPI Standard would have made sure a declared communicator that hasn't been filled would have been an error to use. When I get back on Monday, I'll try out some other compilers as well as try different compiler options (e.g., -g -O0, say).
Re: [OMPI users] mpi_f08 Question: set comm on declaration error, and other questions
On Fri, Aug 19, 2016 at 2:55 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > On Aug 19, 2016, at 2:30 PM, Matt Thompson <fort...@gmail.com> wrote: > > > > I'm slowly trying to learn and transition to 'use mpi_f08'. So, I'm > writing various things and I noticed that this triggers an error: > > > > program hello_world > >use mpi_f08 > >implicit none > >type(MPI_Comm) :: comm = MPI_COMM_NULL > > end program hello_world > > > > when compiled (Open MPI 2.0.0 with GCC 6.1): > > > > (380) $ mpifort test1.F90 > > test1.F90:7:27: > > > > type(MPI_Comm) :: comm = MPI_COMM_NULL > >1 > > Error: Parameter ‘mpi_comm_null’ at (1) has not been declared or is a > variable, which does not reduce to a constant expression > > > > Why is that? Obviously, I can just do: > > > >type(MPI_Comm) :: comm > >comm = MPI_COMM_NULL > > > > and that works just fine (note MPI_COMM_NULL doesn't seem to be special > as MPI_COMM_WORLD triggers the same error). > > I am *not* a Fortran expert, but I believe the difference between the two > is: > > 1. The first one is a compile-time assignment. And you can only do those > with constants. MPI_COMM_NULL is not a compile-time constant, hence, you > get an error. > > 2. The second one is a run-time assignment. You can do that between any > compatible entities, and so that works. > Okay. This makes sense. I guess I was surprised that MPI_COMM_NULL wasn't a constant (or parameter, I guess). But maybe a type() cannot be constant... > > > I'm just wondering why the first doesn't work, for my own edification. I > tried reading through the Standard, but my eyes started watering after a > bit (though that might have been the neon green cover). Is it related to > the fact that when one declares: > > > >type(MPI_Comm) :: comm > > > > that the comm == MPI_COMM_WORLD evaluates to .TRUE.? I discovered that > once when I was printing some stuff. > > That might well be a coincidence. type(MPI_Comm) is not a boolean type, > so I'm not sure how you compared it to .true. Well, I made a program like: (208) $ cat test2.F90 program whoami use mpi_f08 implicit none type(MPI_Comm) :: comm if (comm == MPI_COMM_WORLD) write (*,*) "I am MPI_COMM_WORLD" if (comm == MPI_COMM_NULL) write (*,*) "I am MPI_COMM_NULL" end program whoami (209) $ mpifort test2.F90 (210) $ mpirun -np 4 ./a.out I am MPI_COMM_WORLD I am MPI_COMM_WORLD I am MPI_COMM_WORLD I am MPI_COMM_WORLD I think if you print comm, you get 0 and MPI_COMM_WORLD=0 and MPI_COMM_NULL=2 so...I guess I'm surprised. I'd have thought MPI_Comm would have been undefined until defined. Instead you can write a program like this: (226) $ cat helloWorld.mpi3.F90 program hello_world use mpi_f08 implicit none type(MPI_Comm) :: comm integer :: myid, npes, ierror integer :: name_length character(len=MPI_MAX_PROCESSOR_NAME) :: processor_name call mpi_init(ierror) call MPI_Comm_Rank(comm,myid,ierror) write (*,*) 'ierror: ', ierror call MPI_Comm_Size(comm,npes,ierror) call MPI_Get_Processor_Name(processor_name,name_length,ierror) write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes, "is on", trim(processor_name) call MPI_Finalize(ierror) end program hello_world (227) $ mpifort helloWorld.mpi3.F90 (228) $ mpirun -np 4 ./a.out ierror:0 ierror:0 ierror:0 ierror:0 Process2 of4 is on compy Process1 of4 is on compy Process3 of4 is on compy Process0 of4 is on compy This seems odd to me. I haven't passed in MPI_COMM_WORLD as the communicator to MPI_Comm_Rank, and yet, it worked and the error code was 0 (which I'd take as success). Even if you couldn't detect this at compile time, I'm surprised it doesn't trigger a run-time error. Is this the correct behavior according to the Standard? Matt -- Matt Thompson ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] mpi_f08 Question: set comm on declaration error, and other questions
Oh great Open MPI Gurus, I'm slowly trying to learn and transition to 'use mpi_f08'. So, I'm writing various things and I noticed that this triggers an error: program hello_world use mpi_f08 implicit none type(MPI_Comm) :: comm = MPI_COMM_NULL end program hello_world when compiled (Open MPI 2.0.0 with GCC 6.1): (380) $ mpifort test1.F90 test1.F90:7:27: type(MPI_Comm) :: comm = MPI_COMM_NULL 1 Error: Parameter ‘mpi_comm_null’ at (1) has not been declared or is a variable, which does not reduce to a constant expression Why is that? Obviously, I can just do: type(MPI_Comm) :: comm comm = MPI_COMM_NULL and that works just fine (note MPI_COMM_NULL doesn't seem to be special as MPI_COMM_WORLD triggers the same error). I'm just wondering why the first doesn't work, for my own edification. I tried reading through the Standard, but my eyes started watering after a bit (though that might have been the neon green cover). Is it related to the fact that when one declares: type(MPI_Comm) :: comm that the comm == MPI_COMM_WORLD evaluates to .TRUE.? I discovered that once when I was printing some stuff. Thanks for helping me learn, Matt -- Matt Thompson ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Error with Open MPI 2.0.0: error obtaining device attributes for mlx5_0 errno says Cannot allocate memory
On Wed, Jul 13, 2016 at 9:50 AM, Nathan Hjelm <hje...@me.com> wrote: > As of 2.0.0 we now support experimental verbs. It looks like one of the > calls is failing: > > #if HAVE_DECL_IBV_EXP_QUERY_DEVICE > device->ib_exp_dev_attr.comp_mask = IBV_EXP_DEVICE_ATTR_RESERVED - 1; > if(ibv_exp_query_device(device->ib_dev_context, > >ib_exp_dev_attr)){ > BTL_ERROR(("error obtaining device attributes for %s errno says > %s", > ibv_get_device_name(device->ib_dev), strerror(errno))); > goto error; > } > #endif > > Do you know what OFED or MOFED version you are running? > Per one of our gurus, answers from your IB page: 1. Which OpenFabrics version are you running? Please specify where you got the software from (e.g., from the OpenFabrics community web site, from a vendor, or it was already included in your Linux distribution). Mellanox OFED 3.1-1.0.3 (soon to be 3.3-1.0.0) 2. What distro and version of Linux are you running? What is your kernel version? SLES11 SP3 (LTSS); 3.0.101-0.47.71-default (soon to be 3.0.101-0.47.79-default) 3. Which subnet manager are you running? (e.g., OpenSM, a vendor-specific subnet manager, etc.) Mellanox UFM (OpenSM under the covers) -- Matt Thompson Man Among Men Fulcrum of History
[OMPI users] Error with Open MPI 2.0.0: error obtaining device attributes for mlx5_0 errno says Cannot allocate memory
All, I've been struggling here at NASA Goddard trying to get PGI 16.5 + Open MPI 1.10.3 working on the Discover cluster. What was happening was I'd run our climate model at, say, 4x24 and it would work sometimes. Most of the time. Every once in a while, it'd throw a segfault. If we changed the layout or number of processors, more (and sometimes different) segfaults are trigger. As we could build with PGI 15.7 + Open MPI 1.10.3 (where Open MPI is built exactly the same) and run perfectly, I was focusing on the Open MPI build. I tried compiling it at -O3, -O, -O0, all sorts of things and was about to throw in the towel as all failed. But, I saw Open MPI 2.0.0 was out and figured, may as well try the latest before reporting to the mailing list. I built it and, huzzah!, it works! I'm happy! Except that every time I execute 'mpirun' I get odd errors: (1034) $ mpirun -np 4 ./helloWorld.mpi2.exe -- WARNING: There was an error initializing an OpenFabrics device. Local host: borgr074 Local device: mlx5_0 -- [borgr074][[35244,1],1][btl_openib_component.c:1618:init_one_device] error obtaining device attributes for mlx5_0 errno says Cannot allocate memory [borgr074][[35244,1],3][btl_openib_component.c:1618:init_one_device] error obtaining device attributes for mlx5_0 errno says Cannot allocate memory [borgr074][[35244,1],0][btl_openib_component.c:1618:init_one_device] error obtaining device attributes for mlx5_0 errno says Cannot allocate memory [borgr074][[35244,1],2][btl_openib_component.c:1618:init_one_device] error obtaining device attributes for mlx5_0 errno says Cannot allocate memory MPI Version: 3.1 MPI Library Version: Open MPI v2.0.0, package: Open MPI mathomp4@borg01z239 Distribution, ident: 2.0.0, repo rev: v2.x-dev-1570-g0a4a5d7, Jul 12, 2016 Process0 of4 is on borgr074 Process3 of4 is on borgr074 Process1 of4 is on borgr074 Process2 of4 is on borgr074 [borgr074:29032] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init [borgr074:29032] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages If I run with --mca btl_base_verbose 1 and use more than one node, I see that the openib/verbs (still not sure what to call this) btl isn't being used, but rather tcp: [borgr075:14374] mca: bml: Using tcp btl for send to [[35628,1],15] on node borgr074 [borgr075:14374] mca: bml: Using tcp btl for send to [[35628,1],15] on node borgr074 which makes sense since it can't find an Infiniband device. My first thought is that the build/configure procedure of the past doesn't quite jibe with what Open MPI 2.0.0 is expecting? I build Open MPI as: export CC=pgcc export CXX=pgc++ export FC=pgfortran export CFLAGS="-fpic -m64" export CXXFLAGS="-fpic -m64" export FCFLAGS="-m64 -fpic" export PREFIX=/discover/swdev/mathomp4/MPI/openmpi/2.0.0/pgi-16.5-k40 export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/slurm/lib64 export LDFLAGS="-L/usr/slurm/lib64" export CPPFLAGS="-I/usr/slurm/include" export LIBS="-lpciaccess" build() { echo `pwd` ./configure --with-slurm --disable-wrapper-rpath --enable-shared --prefix=${PREFIX} make -j8 make install } echo "calling build" build echo "exiting" This is a build script built over time; it might have things unnecessary for an Open MPI 2.0 build, but perhaps now it needs more info? I can say that in the past (say with 1.10.3) it definitely found the openib/verbs btl and used it! Per the website, I'm attaching links to my config.log and "ompi_info --all" information: https://dl.dropboxusercontent.com/u/61696/Open%20MPI/config.log.gz https://dl.dropboxusercontent.com/u/61696/Open%20MPI/build.pgi16.5.log.gz https://dl.dropboxusercontent.com/u/61696/Open%20MPI/ompi_info.txt.gz I tried to run "ompi_info -v ompi full --parsable" as asked but that doesn't seem possible anymore: (1053) $ ompi_info -v ompi full --parsable ompi_info: Error: unknown option "-v" Type 'ompi_info --help' for usage. I am asking our machine gurus about the Infiniband network per: https://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot -- Matt Thompson Man Among Men Fulcrum of History
Re: [OMPI users] Issues Building Open MPI static with Intel Fortran 16
Howard, Welp. That worked! I'm assuming oshmem = OpenSHMEM, right? If so, yeah, for now, not important on my wee workstation. (If it isn't, is it something I should work on getting to work?) Matt On Fri, Jan 22, 2016 at 2:47 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > HI Matt, > > If you don't need oshmem, you could try again with --disable-oshmem added > to the config line > > Howard > > > 2016-01-22 12:15 GMT-07:00 Matt Thompson <fort...@gmail.com>: > >> All, >> >> I'm trying to duplicate an issue I had with ESMF long ago (not sure if I >> reported it here or at ESMF, but...). It had been a while, so I started >> from scratch. I first built Open MPI 1.10.2 with Intel Fortran 16.0.0.109 >> and my system GCC (4.8.5 from RHEL7) with mostly defaults: >> >> # ./configure --disable-wrapper-rpath CC=gcc CXX=g++ FC=ifort \ >> #CFLAGS='-fPIC -m64' CXXFLAGS='-fPIC -m64' FCFLAGS='-fPIC -m64' \ >> # >> --prefix=/ford1/share/gmao_SIteam/MPI/openmpi-1.10.2-ifort-16.0.0.109-shared >> | & tee configure.intel16.0.0.109-shared.log >> >> This built and checked just fine. Huzzah! And, indeed, it died in ESMF >> during a link in an odd way (ESMF is looking at it). >> >> As a thought, I decided to see if building Open MPI statically might help >> or not. So, I tried to build Open MPI with: >> >> # ./configure --disable-shared --enable-static --disable-wrapper-rpath >> CC=gcc CXX=g++ FC=ifort \ >> #CFLAGS='-fPIC -m64' CXXFLAGS='-fPIC -m64' FCFLAGS='-fPIC -m64' \ >> # >> --prefix=/ford1/share/gmao_SIteam/MPI/openmpi-1.10.2-ifort-16.0.0.109-static >> | & tee configure.intel16.0.0.109-static.log >> >> I just added --disable-shared --enable-static being lazy. But, when I do >> this, I get this (when built with make V=1): >> >> Making all in tools/oshmem_info >> make[2]: Entering directory >> `/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem/tools/oshmem_info' >> /bin/sh ../../../libtool --tag=CC --mode=link gcc -std=gnu99 -O3 >> -DNDEBUG -fPIC -m64 -finline-functions -fno-strict-aliasing -pthread -o >> oshmem_info oshmem_info.o param.o ../../../ompi/libmpi.la >> ../../../oshmem/liboshmem.la ../../../orte/libopen-rte.la ../../../opal/ >> libopen-pal.la -lrt -lm -lutil >> libtool: link: gcc -std=gnu99 -O3 -DNDEBUG -fPIC -m64 -finline-functions >> -fno-strict-aliasing -pthread -o oshmem_info oshmem_info.o param.o >> ../../../ompi/.libs/libmpi.a ../../../oshmem/.libs/liboshmem.a >> /ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/ompi/.libs/libmpi.a >> -libverbs >> /ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/orte/.libs/libopen-rte.a >> ../../../orte/.libs/libopen-rte.a >> /ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/opal/.libs/libopen-pal.a >> ../../../opal/.libs/libopen-pal.a -lnuma -ldl -lrt -lm -lutil -pthread >> /usr/bin/ld: ../../../oshmem/.libs/liboshmem.a(memheap_base_static.o): >> undefined reference to symbol '_end' >> /usr/bin/ld: note: '_end' is defined in DSO /lib64/libnl-route-3.so.200 >> so try adding it to the linker command line >> /lib64/libnl-route-3.so.200: could not read symbols: Invalid operation >> collect2: error: ld returned 1 exit status >> make[2]: *** [oshmem_info] Error 1 >> make[2]: Leaving directory >> `/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem/tools/oshmem_info' >> make[1]: *** [all-recursive] Error 1 >> make[1]: Leaving directory >> `/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem' >> make: *** [all-recursive] Error 1 >> >> So, what did I do wrong? Or is there something I need to add to the >> configure line? I have built static versions of Open MPI in the past (say >> 1.8.7 era with Intel Fortran 15), but this is a new OS (RHEL 7 instead of >> 6) so I can see issues possible. >> >> Anyone seen this before? As I said, the "usual" build way is just fine. >> Perhaps I need an extra RPM that isn't installed? I do have libnl-devel >> installed. >> >> -- >> Matt Thompson >> >> Man Among Men >> Fulcrum of History >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/01/28344.php >> > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/01/28345.php > -- Matt Thompson Man Among Men Fulcrum of History
[OMPI users] Issues Building Open MPI static with Intel Fortran 16
All, I'm trying to duplicate an issue I had with ESMF long ago (not sure if I reported it here or at ESMF, but...). It had been a while, so I started from scratch. I first built Open MPI 1.10.2 with Intel Fortran 16.0.0.109 and my system GCC (4.8.5 from RHEL7) with mostly defaults: # ./configure --disable-wrapper-rpath CC=gcc CXX=g++ FC=ifort \ #CFLAGS='-fPIC -m64' CXXFLAGS='-fPIC -m64' FCFLAGS='-fPIC -m64' \ # --prefix=/ford1/share/gmao_SIteam/MPI/openmpi-1.10.2-ifort-16.0.0.109-shared | & tee configure.intel16.0.0.109-shared.log This built and checked just fine. Huzzah! And, indeed, it died in ESMF during a link in an odd way (ESMF is looking at it). As a thought, I decided to see if building Open MPI statically might help or not. So, I tried to build Open MPI with: # ./configure --disable-shared --enable-static --disable-wrapper-rpath CC=gcc CXX=g++ FC=ifort \ #CFLAGS='-fPIC -m64' CXXFLAGS='-fPIC -m64' FCFLAGS='-fPIC -m64' \ # --prefix=/ford1/share/gmao_SIteam/MPI/openmpi-1.10.2-ifort-16.0.0.109-static | & tee configure.intel16.0.0.109-static.log I just added --disable-shared --enable-static being lazy. But, when I do this, I get this (when built with make V=1): Making all in tools/oshmem_info make[2]: Entering directory `/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem/tools/oshmem_info' /bin/sh ../../../libtool --tag=CC --mode=link gcc -std=gnu99 -O3 -DNDEBUG -fPIC -m64 -finline-functions -fno-strict-aliasing -pthread -o oshmem_info oshmem_info.o param.o ../../../ompi/libmpi.la ../../../oshmem/ liboshmem.la ../../../orte/libopen-rte.la ../../../opal/libopen-pal.la -lrt -lm -lutil libtool: link: gcc -std=gnu99 -O3 -DNDEBUG -fPIC -m64 -finline-functions -fno-strict-aliasing -pthread -o oshmem_info oshmem_info.o param.o ../../../ompi/.libs/libmpi.a ../../../oshmem/.libs/liboshmem.a /ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/ompi/.libs/libmpi.a -libverbs /ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/orte/.libs/libopen-rte.a ../../../orte/.libs/libopen-rte.a /ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/opal/.libs/libopen-pal.a ../../../opal/.libs/libopen-pal.a -lnuma -ldl -lrt -lm -lutil -pthread /usr/bin/ld: ../../../oshmem/.libs/liboshmem.a(memheap_base_static.o): undefined reference to symbol '_end' /usr/bin/ld: note: '_end' is defined in DSO /lib64/libnl-route-3.so.200 so try adding it to the linker command line /lib64/libnl-route-3.so.200: could not read symbols: Invalid operation collect2: error: ld returned 1 exit status make[2]: *** [oshmem_info] Error 1 make[2]: Leaving directory `/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem/tools/oshmem_info' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem' make: *** [all-recursive] Error 1 So, what did I do wrong? Or is there something I need to add to the configure line? I have built static versions of Open MPI in the past (say 1.8.7 era with Intel Fortran 15), but this is a new OS (RHEL 7 instead of 6) so I can see issues possible. Anyone seen this before? As I said, the "usual" build way is just fine. Perhaps I need an extra RPM that isn't installed? I do have libnl-devel installed. -- Matt Thompson Man Among Men Fulcrum of History
Re: [OMPI users] MPI, Fortran, and GET_ENVIRONMENT_VARIABLE
Ralph, Sounds good. I'll keep my eyes out. I figured it probably wasn't possible. Of course, it's simple enough to run a script ahead of time that can build a table that could be read in-program. I was just hoping perhaps I could do it in one-step instead of two! And, well, I'm slowly learning that whatever I knew about switches in an Ethernet way means nothing in an Infiniband situation! On Fri, Jan 15, 2016 at 11:27 AM, Ralph Castain <r...@open-mpi.org> wrote: > Yes, we don’t propagate envars ourselves other than MCA params. You can > ask mpirun to forward specific envars to every proc, but that would only > push the same value to everyone, and that doesn’t sound like what you are > looking for. > > FWIW: we are working on adding the ability to directly query the info you > are seeking - i.e., to ask for things like “which procs are on the same > switch as me?”. Hoping to have it later this year, perhaps in the summer. > > > On Jan 15, 2016, at 7:56 AM, Matt Thompson <fort...@gmail.com> wrote: > > Ralph, > > That doesn't help: > > (1004) $ mpirun -map-by node -np 8 ./hostenv.x | sort -g -k2 > Process0 of8 is on host borgo086 > Process0 of8 is on processor borgo086 > Process1 of8 is on host borgo086 > Process1 of8 is on processor borgo140 > Process2 of8 is on host borgo086 > Process2 of8 is on processor borgo086 > Process3 of8 is on host borgo086 > Process3 of8 is on processor borgo140 > Process4 of8 is on host borgo086 > Process4 of8 is on processor borgo086 > Process5 of8 is on host borgo086 > Process5 of8 is on processor borgo140 > Process6 of8 is on host borgo086 > Process6 of8 is on processor borgo086 > Process7 of8 is on host borgo086 > Process7 of8 is on processor borgo140 > > But it was doing the right thing before. It saw my SLURM_* bits and > correctly put 4 processes on the first node and 4 on the second (see the > processor line which is from MPI, not the environment), and I only asked > for 4 tasks per node: > > SLURM_NODELIST=borgo[086,140] > SLURM_NTASKS_PER_NODE=4 > SLURM_NNODES=2 > SLURM_NTASKS=8 > SLURM_TASKS_PER_NODE=4(x2) > > My guess is no MPI stack wants to propagate an environment variable to > every process. I'm picturing an 1000 node/28000 core job...and poor Open > MPI (or MPT or Intel MPI) would have to marshall 28000xN environment > variables around and keep track of who gets what... > > Matt > > > On Fri, Jan 15, 2016 at 10:48 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> Actually, the explanation is much simpler. You probably have more than 8 >> slots on borgj020, and so your job is simply small enough that we put it >> all on one host. If you want to force the job to use both hosts, add >> “-map-by node” to your cmd line >> >> >> On Jan 15, 2016, at 7:02 AM, Jim Edwards <jedwa...@ucar.edu> wrote: >> >> >> >> On Fri, Jan 15, 2016 at 7:53 AM, Matt Thompson <fort...@gmail.com> wrote: >> >>> All, >>> >>> I'm not too sure if this is an MPI issue, a Fortran issue, or something >>> else but I thought I'd ask the MPI gurus here first since my web search >>> failed me. >>> >>> There is a chance in the future I might want/need to query an >>> environment variable in a Fortran program, namely to figure out what switch >>> a currently running process is on (via SLURM_TOPOLOGY_ADDR in my case) and >>> perhaps make a "per-switch" communicator.[1] >>> >>> So, I coded up a boring Fortran program whose only exciting lines are: >>> >>>call MPI_Get_Processor_Name(processor_name,name_length,ierror) >>>call get_environment_variable("HOST",host_name) >>> >>>write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes, >>> "is on processor", trim(processor_name) >>>write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes, >>> "is on host", trim(host_name) >>> >>> I decided to try out with the HOST environment variable first because >>> it is simple and different per node (I didn't want to take many, many nodes >>> to find the point when a switch is traversed). I then grabbed two nodes >>> with 4 processes per node and...: >>> >>> (1046) $ echo "$SLURM_NODELIST" >>> borgj[020,036] >>> (1047) $ pdsh -w "$SLURM_NODELIST" echo '$HOST' >>> borgj036: borgj036 >>> borgj020: borgj020 >>> (1048) $
Re: [OMPI users] MPI, Fortran, and GET_ENVIRONMENT_VARIABLE
Ralph, That doesn't help: (1004) $ mpirun -map-by node -np 8 ./hostenv.x | sort -g -k2 Process0 of8 is on host borgo086 Process0 of8 is on processor borgo086 Process1 of8 is on host borgo086 Process1 of8 is on processor borgo140 Process2 of8 is on host borgo086 Process2 of8 is on processor borgo086 Process3 of8 is on host borgo086 Process3 of8 is on processor borgo140 Process4 of8 is on host borgo086 Process4 of8 is on processor borgo086 Process5 of8 is on host borgo086 Process5 of8 is on processor borgo140 Process6 of8 is on host borgo086 Process6 of8 is on processor borgo086 Process7 of8 is on host borgo086 Process7 of8 is on processor borgo140 But it was doing the right thing before. It saw my SLURM_* bits and correctly put 4 processes on the first node and 4 on the second (see the processor line which is from MPI, not the environment), and I only asked for 4 tasks per node: SLURM_NODELIST=borgo[086,140] SLURM_NTASKS_PER_NODE=4 SLURM_NNODES=2 SLURM_NTASKS=8 SLURM_TASKS_PER_NODE=4(x2) My guess is no MPI stack wants to propagate an environment variable to every process. I'm picturing an 1000 node/28000 core job...and poor Open MPI (or MPT or Intel MPI) would have to marshall 28000xN environment variables around and keep track of who gets what... Matt On Fri, Jan 15, 2016 at 10:48 AM, Ralph Castain <r...@open-mpi.org> wrote: > Actually, the explanation is much simpler. You probably have more than 8 > slots on borgj020, and so your job is simply small enough that we put it > all on one host. If you want to force the job to use both hosts, add > “-map-by node” to your cmd line > > > On Jan 15, 2016, at 7:02 AM, Jim Edwards <jedwa...@ucar.edu> wrote: > > > > On Fri, Jan 15, 2016 at 7:53 AM, Matt Thompson <fort...@gmail.com> wrote: > >> All, >> >> I'm not too sure if this is an MPI issue, a Fortran issue, or something >> else but I thought I'd ask the MPI gurus here first since my web search >> failed me. >> >> There is a chance in the future I might want/need to query an environment >> variable in a Fortran program, namely to figure out what switch a currently >> running process is on (via SLURM_TOPOLOGY_ADDR in my case) and perhaps make >> a "per-switch" communicator.[1] >> >> So, I coded up a boring Fortran program whose only exciting lines are: >> >>call MPI_Get_Processor_Name(processor_name,name_length,ierror) >>call get_environment_variable("HOST",host_name) >> >>write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes, "is >> on processor", trim(processor_name) >>write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes, "is >> on host", trim(host_name) >> >> I decided to try out with the HOST environment variable first because it >> is simple and different per node (I didn't want to take many, many nodes to >> find the point when a switch is traversed). I then grabbed two nodes with 4 >> processes per node and...: >> >> (1046) $ echo "$SLURM_NODELIST" >> borgj[020,036] >> (1047) $ pdsh -w "$SLURM_NODELIST" echo '$HOST' >> borgj036: borgj036 >> borgj020: borgj020 >> (1048) $ mpifort -o hostenv.x hostenv.F90 >> (1049) $ mpirun -np 8 ./hostenv.x | sort -g -k2 >> Process0 of8 is on host borgj020 >> Process0 of8 is on processor borgj020 >> Process1 of8 is on host borgj020 >> Process1 of8 is on processor borgj020 >> Process2 of8 is on host borgj020 >> Process2 of8 is on processor borgj020 >> Process3 of8 is on host borgj020 >> Process3 of8 is on processor borgj020 >> Process4 of8 is on host borgj020 >> Process4 of8 is on processor borgj036 >> Process5 of8 is on host borgj020 >> Process5 of8 is on processor borgj036 >> Process6 of8 is on host borgj020 >> Process6 of8 is on processor borgj036 >> Process7 of8 is on host borgj020 >> Process7 of8 is on processor borgj036 >> >> It looks like MPI_Get_Processor_Name is doing its thing, but the HOST one >> seems to only be reflecting the first host. My guess is that OpenMPI >> doesn't export every processes' environment separately to every process so >> it is reflecting HOST from process 0. >> >> > > I would guess that what is actually happening is that slurm is exporting > all of the variables from the host node including the $HOST variable and > overwriting the > def
[OMPI users] MPI, Fortran, and GET_ENVIRONMENT_VARIABLE
All, I'm not too sure if this is an MPI issue, a Fortran issue, or something else but I thought I'd ask the MPI gurus here first since my web search failed me. There is a chance in the future I might want/need to query an environment variable in a Fortran program, namely to figure out what switch a currently running process is on (via SLURM_TOPOLOGY_ADDR in my case) and perhaps make a "per-switch" communicator.[1] So, I coded up a boring Fortran program whose only exciting lines are: call MPI_Get_Processor_Name(processor_name,name_length,ierror) call get_environment_variable("HOST",host_name) write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes, "is on processor", trim(processor_name) write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') "Process", myid, "of", npes, "is on host", trim(host_name) I decided to try out with the HOST environment variable first because it is simple and different per node (I didn't want to take many, many nodes to find the point when a switch is traversed). I then grabbed two nodes with 4 processes per node and...: (1046) $ echo "$SLURM_NODELIST" borgj[020,036] (1047) $ pdsh -w "$SLURM_NODELIST" echo '$HOST' borgj036: borgj036 borgj020: borgj020 (1048) $ mpifort -o hostenv.x hostenv.F90 (1049) $ mpirun -np 8 ./hostenv.x | sort -g -k2 Process0 of8 is on host borgj020 Process0 of8 is on processor borgj020 Process1 of8 is on host borgj020 Process1 of8 is on processor borgj020 Process2 of8 is on host borgj020 Process2 of8 is on processor borgj020 Process3 of8 is on host borgj020 Process3 of8 is on processor borgj020 Process4 of8 is on host borgj020 Process4 of8 is on processor borgj036 Process5 of8 is on host borgj020 Process5 of8 is on processor borgj036 Process6 of8 is on host borgj020 Process6 of8 is on processor borgj036 Process7 of8 is on host borgj020 Process7 of8 is on processor borgj036 It looks like MPI_Get_Processor_Name is doing its thing, but the HOST one seems to only be reflecting the first host. My guess is that OpenMPI doesn't export every processes' environment separately to every process so it is reflecting HOST from process 0. So, I guess my question is: can this be done? Is there an option to Open MPI that might do it? Or is this just something MPI doesn't do? Or is my Google-fu just too weak to figure out the right search-phrase to find the answer to this probable FAQ? Matt [1] Note, this might be unnecessary, but I got to the point where I wanted to see if I *could* do it, rather than *should*. -- Matt Thompson Man Among Men Fulcrum of History
Re: [OMPI users] Open MPI MPI-OpenMP Hybrid Binding Question
On Wed, Jan 6, 2016 at 7:20 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > FWIW, > > there has been one attempt to set the OMP_* environment variables within > OpenMPI, and that was aborted > because that caused crashes with a prominent commercial compiler. > > also, i'd like to clarify that OpenMPI does bind MPI tasks (e.g. > processes), and it is up to the OpenMP runtime to bind the OpenMP threads > to the resources made available by OpenMPI to the MPI task. > > in this case, that means OpenMPI will bind a MPI tasks to 7 cores (for > example cores 7 to 13), and it is up to the OpenMP runtime to bind each 7 > OpenMP threads to one core previously allocated by OpenMPI > (for example, OMP thread 0 to core 7, OMP thread 1 to core 8, ...) > Indeed. Hybrid programming is a two-step tango. The harder task (in some ways) is the placing MPI processes where I want. With omplace I could just force things (though probably not with Open MPI...haven't tried it yet), but I'd rather have a more "formulaic" way to place processes since then you can script it. Now that I know about the ppr: syntax, I can see it'll be quite useful! The other task is to get the OpenMP threads in the "right way". I was pretty sure KMP_AFFINITY=compact was correct (worked once...and, yeah, using Intel at present. Figured start there, then expand to figure out GCC and PGI). I'll do some experimenting with the OMP_* versions as a more-respected standard is always a good thing. For others with inquiries into this, I highly recommend this page I found after my query was answered here: https://www.olcf.ornl.gov/kb_articles/parallel-job-execution-on-commodity-clusters/ At this point, I'm thinking I should start up an MPI+OpenMP wiki to map all the combinations of compiler+mpistack. Or pray the MPI Forum and OpenMP combine and I can just look in a Standard. :D Thanks, Matt -- Matt Thompson Man Among Men Fulcrum of History
Re: [OMPI users] Open MPI MPI-OpenMP Hybrid Binding Question
, 2016 at 2:48 PM, Erik Schnetter <schnet...@gmail.com> wrote: > Setting KMP_AFFINITY will probably override anything that OpenMPI > sets. Can you try without? > > -erik > > On Wed, Jan 6, 2016 at 2:46 PM, Matt Thompson <fort...@gmail.com> wrote: > > Hello Open MPI Gurus, > > > > As I explore MPI-OpenMP hybrid codes, I'm trying to figure out how to do > > things to get the same behavior in various stacks. For example, I have a > > 28-core node (2 14-core Haswells), and I'd like to run 4 MPI processes > and 7 > > OpenMP threads. Thus, I'd like the processes to be 2 processes per socket > > with the OpenMP threads laid out on them. Using a "hybrid Hello World" > > program, I can achieve this with Intel MPI (after a lot of testing): > > > > (1097) $ env OMP_NUM_THREADS=7 KMP_AFFINITY=compact mpirun -np 4 > > ./hello-hybrid.x | sort -g -k 18 > > srun.slurm: cluster configuration lacks support for cpu binding > > Hello from thread 0 out of 7 from process 2 out of 4 on borgo035 on CPU 0 > > Hello from thread 1 out of 7 from process 2 out of 4 on borgo035 on CPU 1 > > Hello from thread 2 out of 7 from process 2 out of 4 on borgo035 on CPU 2 > > Hello from thread 3 out of 7 from process 2 out of 4 on borgo035 on CPU 3 > > Hello from thread 4 out of 7 from process 2 out of 4 on borgo035 on CPU 4 > > Hello from thread 5 out of 7 from process 2 out of 4 on borgo035 on CPU 5 > > Hello from thread 6 out of 7 from process 2 out of 4 on borgo035 on CPU 6 > > Hello from thread 0 out of 7 from process 3 out of 4 on borgo035 on CPU 7 > > Hello from thread 1 out of 7 from process 3 out of 4 on borgo035 on CPU 8 > > Hello from thread 2 out of 7 from process 3 out of 4 on borgo035 on CPU 9 > > Hello from thread 3 out of 7 from process 3 out of 4 on borgo035 on CPU > 10 > > Hello from thread 4 out of 7 from process 3 out of 4 on borgo035 on CPU > 11 > > Hello from thread 5 out of 7 from process 3 out of 4 on borgo035 on CPU > 12 > > Hello from thread 6 out of 7 from process 3 out of 4 on borgo035 on CPU > 13 > > Hello from thread 0 out of 7 from process 0 out of 4 on borgo035 on CPU > 14 > > Hello from thread 1 out of 7 from process 0 out of 4 on borgo035 on CPU > 15 > > Hello from thread 2 out of 7 from process 0 out of 4 on borgo035 on CPU > 16 > > Hello from thread 3 out of 7 from process 0 out of 4 on borgo035 on CPU > 17 > > Hello from thread 4 out of 7 from process 0 out of 4 on borgo035 on CPU > 18 > > Hello from thread 5 out of 7 from process 0 out of 4 on borgo035 on CPU > 19 > > Hello from thread 6 out of 7 from process 0 out of 4 on borgo035 on CPU > 20 > > Hello from thread 0 out of 7 from process 1 out of 4 on borgo035 on CPU > 21 > > Hello from thread 1 out of 7 from process 1 out of 4 on borgo035 on CPU > 22 > > Hello from thread 2 out of 7 from process 1 out of 4 on borgo035 on CPU > 23 > > Hello from thread 3 out of 7 from process 1 out of 4 on borgo035 on CPU > 24 > > Hello from thread 4 out of 7 from process 1 out of 4 on borgo035 on CPU > 25 > > Hello from thread 5 out of 7 from process 1 out of 4 on borgo035 on CPU > 26 > > Hello from thread 6 out of 7 from process 1 out of 4 on borgo035 on CPU > 27 > > > > Other than the odd fact that Process #0 seemed to start on Socket #1 > (this > > might be an artifact of how I'm trying to detect the CPU I'm on), this > looks > > reasonable. 14 threads on each socket and each process is laying out its > > threads in a nice orderly fashion. > > > > I'm trying to figure out how to do this with Open MPI (version 1.10.0) > and > > apparently I am just not quite good enough to figure it out. The closest > > I've gotten is: > > > > (1155) $ env OMP_NUM_THREADS=7 KMP_AFFINITY=compact mpirun -np 4 -map-by > > ppr:2:socket ./hello-hybrid.x | sort -g -k 18 > > Hello from thread 0 out of 7 from process 0 out of 4 on borgo035 on CPU 0 > > Hello from thread 0 out of 7 from process 1 out of 4 on borgo035 on CPU 0 > > Hello from thread 1 out of 7 from process 0 out of 4 on borgo035 on CPU 1 > > Hello from thread 1 out of 7 from process 1 out of 4 on borgo035 on CPU 1 > > Hello from thread 2 out of 7 from process 0 out of 4 on borgo035 on CPU 2 > > Hello from thread 2 out of 7 from process 1 out of 4 on borgo035 on CPU 2 > > Hello from thread 3 out of 7 from process 0 out of 4 on borgo035 on CPU 3 > > Hello from thread 3 out of 7 from process 1 out of 4 on borgo035 on CPU 3 > > Hello from thread 4 out of 7 from process 0 out of 4 on borgo035 on CPU 4 > > Hello from thread 4 out of 7 from process 1 out of 4 on borgo035
[OMPI users] Open MPI MPI-OpenMP Hybrid Binding Question
4 out of 7 from process 3 out of 4 on borgo035 on CPU 18 Hello from thread 5 out of 7 from process 2 out of 4 on borgo035 on CPU 19 Hello from thread 5 out of 7 from process 3 out of 4 on borgo035 on CPU 19 Hello from thread 6 out of 7 from process 2 out of 4 on borgo035 on CPU 20 Hello from thread 6 out of 7 from process 3 out of 4 on borgo035 on CPU 20 Obviously not right. Any ideas on how to help me learn? The man mpirun page is a bit formidable in the pinning part, so maybe I've missed an obvious answer. Matt -- Matt Thompson Man Among Men Fulcrum of History
Re: [OMPI users] Help with Binding in 1.8.8: Use only second socket
Ralph, Huh. That isn't in the Open MPI 1.8.8 mpirun man page. It is in Open MPI 1.10, so I'm guessing someone noticed it wasn't there. Explains why I didn't try it out. I'm assuming this option is respected on all nodes? Note: a SmarterManThanI™ here at Goddard thought up this: #!/bin/bash rank=0 for node in $(srun uname -n | sort); do echo "rank $rank=$node slots=1:*" let rank+=1 done It does seem to work in synthetic tests so I'm trying it now in my real job. I had to hack a few run scripts so I'll probably spend the next hour debugging something dumb I did. What I'm wondering about all this is: can this be done with --slot-list? Or, perhaps, does --slot-list even work? I have tried about 20 different variations of it, e.g., --slot-list 1:*, --slot-list '1:*', --slot-list 1:0,1,2,3,4,5,6,7, --slot-list 1:8,9,10,11,12,13,14,15, --slot-list 8-15, , and every time I seem to trigger an error via help-rmaps_rank_file.txt. I tried to read through opal_hwloc_base_slot_list_parse in the source, but my C isn't great (see my gmail address name) so that didn't help. Might not even be the right function, but I was just acking the code. Thanks, Matt On Mon, Dec 21, 2015 at 10:51 AM, Ralph Castain <r...@open-mpi.org> wrote: > Try adding —cpu-set a,b,c,… where the a,b,c… are the core id’s of your > second socket. I’m working on a cleaner option as this has come up before. > > > On Dec 21, 2015, at 5:29 AM, Matt Thompson <fort...@gmail.com> wrote: > > Dear Open MPI Gurus, > > I'm currently trying to do something with Open MPI 1.8.8 that I'm pretty > sure is possible, but I'm just not smart enough to figure out. Namely, I'm > seeing some odd GPU timings and I think it's because I was dumb and assumed > the GPU was on the PCI bus next to Socket #0 as some older GPU nodes I ran > on were like that. > > But, a trip through lspci and lstopo has shown me that the GPU is actually > on Socket #1. These are dual socket Sandy Bridge nodes and I'd like to do > some tests where I run a 8 processes per node and those processes all land > on Socket #1. > > So, what I'm trying to figure out is how to have Open MPI bind processes > like that. My first thought as always is to run a helloworld job with > -report-bindings on. I can manage to do this: > > (1061) $ mpirun -np 8 -report-bindings -map-by core ./helloWorld.exe > [borg01z205:16306] MCW rank 4 bound to socket 0[core 4[hwt 0]]: > [././././B/././.][./././././././.] > [borg01z205:16306] MCW rank 5 bound to socket 0[core 5[hwt 0]]: > [./././././B/./.][./././././././.] > [borg01z205:16306] MCW rank 6 bound to socket 0[core 6[hwt 0]]: > [././././././B/.][./././././././.] > [borg01z205:16306] MCW rank 7 bound to socket 0[core 7[hwt 0]]: > [./././././././B][./././././././.] > [borg01z205:16306] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > [B/././././././.][./././././././.] > [borg01z205:16306] MCW rank 1 bound to socket 0[core 1[hwt 0]]: > [./B/./././././.][./././././././.] > [borg01z205:16306] MCW rank 2 bound to socket 0[core 2[hwt 0]]: > [././B/././././.][./././././././.] > [borg01z205:16306] MCW rank 3 bound to socket 0[core 3[hwt 0]]: > [./././B/./././.][./././././././.] > Process7 of8 is on borg01z205 > Process5 of8 is on borg01z205 > Process2 of8 is on borg01z205 > Process3 of8 is on borg01z205 > Process4 of8 is on borg01z205 > Process6 of8 is on borg01z205 > Process0 of8 is on borg01z205 > Process1 of8 is on borg01z205 > > Great...but wrong socket! Is there a way to tell it to use Socket 1 > instead? > > Note I'll be running under SLURM, so I will only have 8 processes per > node, so it shouldn't need to use Socket 0. > -- > Matt Thompson > > Man Among Men > Fulcrum of History > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/12/28190.php > > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/12/28195.php > -- Matt Thompson Man Among Men Fulcrum of History
[OMPI users] Help with Binding in 1.8.8: Use only second socket
Dear Open MPI Gurus, I'm currently trying to do something with Open MPI 1.8.8 that I'm pretty sure is possible, but I'm just not smart enough to figure out. Namely, I'm seeing some odd GPU timings and I think it's because I was dumb and assumed the GPU was on the PCI bus next to Socket #0 as some older GPU nodes I ran on were like that. But, a trip through lspci and lstopo has shown me that the GPU is actually on Socket #1. These are dual socket Sandy Bridge nodes and I'd like to do some tests where I run a 8 processes per node and those processes all land on Socket #1. So, what I'm trying to figure out is how to have Open MPI bind processes like that. My first thought as always is to run a helloworld job with -report-bindings on. I can manage to do this: (1061) $ mpirun -np 8 -report-bindings -map-by core ./helloWorld.exe [borg01z205:16306] MCW rank 4 bound to socket 0[core 4[hwt 0]]: [././././B/././.][./././././././.] [borg01z205:16306] MCW rank 5 bound to socket 0[core 5[hwt 0]]: [./././././B/./.][./././././././.] [borg01z205:16306] MCW rank 6 bound to socket 0[core 6[hwt 0]]: [././././././B/.][./././././././.] [borg01z205:16306] MCW rank 7 bound to socket 0[core 7[hwt 0]]: [./././././././B][./././././././.] [borg01z205:16306] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.] [borg01z205:16306] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././.][./././././././.] [borg01z205:16306] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././././.][./././././././.] [borg01z205:16306] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./././.][./././././././.] Process7 of8 is on borg01z205 Process5 of8 is on borg01z205 Process2 of8 is on borg01z205 Process3 of8 is on borg01z205 Process4 of8 is on borg01z205 Process6 of8 is on borg01z205 Process0 of8 is on borg01z205 Process1 of8 is on borg01z205 Great...but wrong socket! Is there a way to tell it to use Socket 1 instead? Note I'll be running under SLURM, so I will only have 8 processes per node, so it shouldn't need to use Socket 0. -- Matt Thompson Man Among Men Fulcrum of History
Re: [OMPI users] Open MPI 1.10.0: Works on one Sandybridge Node, not on another: tcp_peer_send_blocking
On Thu, Sep 24, 2015 at 12:10 PM, Ralph Castain <r...@open-mpi.org> wrote: > Ah, sorry - wrong param. It’s the out-of-band that is having the problem. > Try adding —mca oob_tcp_if_include > Ooh. Okay. Look at this: (13) $ mpirun --mca oob_tcp_if_include ib0 -np 2 ./helloWorld.x Process 1 of 2 is on r509i2n17 Process 0 of 2 is on r509i2n17 So that is nice. Now the spin up if I have 8 or so nodes is rather...slow. But at this point I'll take working over efficient. Quick startup can come later. Matt > > > On Sep 24, 2015, at 8:56 AM, Matt Thompson <fort...@gmail.com> wrote: > > Ralph, > > I believe these nodes might have both an Ethernet and Infiniband port > where the Ethernet port is not the one to use. Is there a way to tell Open > MPI to ignore any ethernet devices it sees? I've tried: > > --mca btl sm,openib,self > > and (based on the advice of the much more intelligent support at NAS): > > --mca btl openib,self --mca btl_openib_if_include mlx4_0,mlx4_1 > > But neither worked. > > Matt > > > On Thu, Sep 24, 2015 at 11:41 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> Starting in the 1.7 series, OMPI by default launches daemons on all nodes >> in the allocation during startup. This is done so we can “probe” the >> topology of the nodes and use that info during the process mapping >> procedure - e.g., if you want to map-by NUMA regions. >> >> What is happening here is that some of the nodes in your allocation >> aren’t allowing those daemons to callback to mpirun. Either a firewall is >> in the way, or something is preventing it. >> >> If you don’t want to launch on those other nodes, you could just add >> —novm to your cmd line, or use the —host option to restrict us to your >> local node. However, I imagine you got the bigger allocation so you could >> use it :-) >> >> In which case, you need to remove the obstacle. You might check for >> firewall, or check to see if multiple NICs are on the non-maia nodes (this >> can sometimes confuse things, especially if someone put the NICs on the >> same IP subnet) >> >> HTH >> Ralph >> >> >> >> On Sep 24, 2015, at 8:18 AM, Matt Thompson <fort...@gmail.com> wrote: >> >> Open MPI Users, >> >> I'm hoping someone here can help. I built Open MPI 1.10.0 with PGI 15.7 >> using this configure string: >> >> ./configure --disable-vt --with-tm=/PBS --with-verbs >> --disable-wrapper-rpath \ >> CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 CFLAGS='-fpic -m64' \ >> CXXFLAGS='-fpic -m64' FCFLAGS='-fpic -m64' FFLAGS='-fpic -m64' \ >> --prefix=/nobackup/gmao_SIteam/MPI/pgi_15.7-openmpi_1.10.0 |& tee >> configure.pgi15.7.log >> >> It seemed to pass 'make check'. >> >> I'm working at pleiades at NAS, and there they have both Sandy Bridge >> nodes with GPUs (maia) and regular Sandy Bridge compute nodes (here after >> called Sandy) without. To be extra careful (since PGI compiles to the >> architecture you build on) I took a Westmere node and built Open MPI there >> just in case. >> >> So, as I said, all seems to work with a test. I now grab a maia node, >> maia1, of an allocation of 4 I had: >> >> (102) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c >> (103) $ mpirun -np 2 ./helloWorld.x >> Process 0 of 2 is on maia1 >> Process 1 of 2 is on maia1 >> >> Good. Now, let's go to a Sandy Bridge (non-GPU) node, r321i7n16, of an >> allocation of 8 I had: >> >> (49) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c >> (50) $ mpirun -np 2 ./helloWorld.x >> [r323i5n11:13063] [[62995,0],7] tcp_peer_send_blocking: send() to socket >> 9 failed: Broken pipe (32) >> [r323i5n6:57417] [[62995,0],2] tcp_peer_send_blocking: send() to socket 9 >> failed: Broken pipe (32) >> [r323i5n7:67287] [[62995,0],3] tcp_peer_send_blocking: send() to socket 9 >> failed: Broken pipe (32) >> [r323i5n8:57429] [[62995,0],4] tcp_peer_send_blocking: send() to socket 9 >> failed: Broken pipe (32) >> [r323i5n10:35329] [[62995,0],6] tcp_peer_send_blocking: send() to socket >> 9 failed: Broken pipe (32) >> [r323i5n9:13456] [[62995,0],5] tcp_peer_send_blocking: send() to socket 9 >> failed: Broken pipe (32) >> >> Hmm. Let's try turning off tcp (often my first thought when on an >> Infiniband system): >> >> (51) $ mpirun --mca btl sm,openib,self -np 2 ./helloWorld.x >> [r323i5n6:57420] [[62996,0],2] tcp_peer_send_blocking: send() to socket 9 >> failed: Broken pipe (32) >> [r323i5n9:13459] [[62996,0],5] tcp_peer_send_blocking: send() to socket 9
Re: [OMPI users] Open MPI 1.10.0: Works on one Sandybridge Node, not on another: tcp_peer_send_blocking
Ralph, I believe these nodes might have both an Ethernet and Infiniband port where the Ethernet port is not the one to use. Is there a way to tell Open MPI to ignore any ethernet devices it sees? I've tried: --mca btl sm,openib,self and (based on the advice of the much more intelligent support at NAS): --mca btl openib,self --mca btl_openib_if_include mlx4_0,mlx4_1 But neither worked. Matt On Thu, Sep 24, 2015 at 11:41 AM, Ralph Castain <r...@open-mpi.org> wrote: > Starting in the 1.7 series, OMPI by default launches daemons on all nodes > in the allocation during startup. This is done so we can “probe” the > topology of the nodes and use that info during the process mapping > procedure - e.g., if you want to map-by NUMA regions. > > What is happening here is that some of the nodes in your allocation aren’t > allowing those daemons to callback to mpirun. Either a firewall is in the > way, or something is preventing it. > > If you don’t want to launch on those other nodes, you could just add —novm > to your cmd line, or use the —host option to restrict us to your local > node. However, I imagine you got the bigger allocation so you could use it > :-) > > In which case, you need to remove the obstacle. You might check for > firewall, or check to see if multiple NICs are on the non-maia nodes (this > can sometimes confuse things, especially if someone put the NICs on the > same IP subnet) > > HTH > Ralph > > > > On Sep 24, 2015, at 8:18 AM, Matt Thompson <fort...@gmail.com> wrote: > > Open MPI Users, > > I'm hoping someone here can help. I built Open MPI 1.10.0 with PGI 15.7 > using this configure string: > > ./configure --disable-vt --with-tm=/PBS --with-verbs > --disable-wrapper-rpath \ > CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 CFLAGS='-fpic -m64' \ > CXXFLAGS='-fpic -m64' FCFLAGS='-fpic -m64' FFLAGS='-fpic -m64' \ > --prefix=/nobackup/gmao_SIteam/MPI/pgi_15.7-openmpi_1.10.0 |& tee > configure.pgi15.7.log > > It seemed to pass 'make check'. > > I'm working at pleiades at NAS, and there they have both Sandy Bridge > nodes with GPUs (maia) and regular Sandy Bridge compute nodes (here after > called Sandy) without. To be extra careful (since PGI compiles to the > architecture you build on) I took a Westmere node and built Open MPI there > just in case. > > So, as I said, all seems to work with a test. I now grab a maia node, > maia1, of an allocation of 4 I had: > > (102) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c > (103) $ mpirun -np 2 ./helloWorld.x > Process 0 of 2 is on maia1 > Process 1 of 2 is on maia1 > > Good. Now, let's go to a Sandy Bridge (non-GPU) node, r321i7n16, of an > allocation of 8 I had: > > (49) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c > (50) $ mpirun -np 2 ./helloWorld.x > [r323i5n11:13063] [[62995,0],7] tcp_peer_send_blocking: send() to socket 9 > failed: Broken pipe (32) > [r323i5n6:57417] [[62995,0],2] tcp_peer_send_blocking: send() to socket 9 > failed: Broken pipe (32) > [r323i5n7:67287] [[62995,0],3] tcp_peer_send_blocking: send() to socket 9 > failed: Broken pipe (32) > [r323i5n8:57429] [[62995,0],4] tcp_peer_send_blocking: send() to socket 9 > failed: Broken pipe (32) > [r323i5n10:35329] [[62995,0],6] tcp_peer_send_blocking: send() to socket 9 > failed: Broken pipe (32) > [r323i5n9:13456] [[62995,0],5] tcp_peer_send_blocking: send() to socket 9 > failed: Broken pipe (32) > > Hmm. Let's try turning off tcp (often my first thought when on an > Infiniband system): > > (51) $ mpirun --mca btl sm,openib,self -np 2 ./helloWorld.x > [r323i5n6:57420] [[62996,0],2] tcp_peer_send_blocking: send() to socket 9 > failed: Broken pipe (32) > [r323i5n9:13459] [[62996,0],5] tcp_peer_send_blocking: send() to socket 9 > failed: Broken pipe (32) > [r323i5n8:57432] [[62996,0],4] tcp_peer_send_blocking: send() to socket 9 > failed: Broken pipe (32) > [r323i5n7:67290] [[62996,0],3] tcp_peer_send_blocking: send() to socket 9 > failed: Broken pipe (32) > [r323i5n11:13066] [[62996,0],7] tcp_peer_send_blocking: send() to socket 9 > failed: Broken pipe (32) > [r323i5n10:35332] [[62996,0],6] tcp_peer_send_blocking: send() to socket 9 > failed: Broken pipe (32) > > Now, the nodes reporting the issue seem to be the "other" nodes on the > allocation that are in a different rack: > > (52) $ cat $PBS_NODEFILE | uniq > r321i7n16 > r321i7n17 > r323i5n6 > r323i5n7 > r323i5n8 > r323i5n9 > r323i5n10 > r323i5n11 > > Maybe that's a clue? I didn't think this would matter if I only ran two > processes...and it works on the multi-node maia allocation. > > I've tried searching the web, but the only place I've seen > tcp_peer_send_blocking is in a PDF whe
[OMPI users] Open MPI 1.10.0: Works on one Sandybridge Node, not on another: tcp_peer_send_blocking
Open MPI Users, I'm hoping someone here can help. I built Open MPI 1.10.0 with PGI 15.7 using this configure string: ./configure --disable-vt --with-tm=/PBS --with-verbs --disable-wrapper-rpath \ CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 CFLAGS='-fpic -m64' \ CXXFLAGS='-fpic -m64' FCFLAGS='-fpic -m64' FFLAGS='-fpic -m64' \ --prefix=/nobackup/gmao_SIteam/MPI/pgi_15.7-openmpi_1.10.0 |& tee configure.pgi15.7.log It seemed to pass 'make check'. I'm working at pleiades at NAS, and there they have both Sandy Bridge nodes with GPUs (maia) and regular Sandy Bridge compute nodes (here after called Sandy) without. To be extra careful (since PGI compiles to the architecture you build on) I took a Westmere node and built Open MPI there just in case. So, as I said, all seems to work with a test. I now grab a maia node, maia1, of an allocation of 4 I had: (102) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c (103) $ mpirun -np 2 ./helloWorld.x Process 0 of 2 is on maia1 Process 1 of 2 is on maia1 Good. Now, let's go to a Sandy Bridge (non-GPU) node, r321i7n16, of an allocation of 8 I had: (49) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c (50) $ mpirun -np 2 ./helloWorld.x [r323i5n11:13063] [[62995,0],7] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n6:57417] [[62995,0],2] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n7:67287] [[62995,0],3] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n8:57429] [[62995,0],4] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n10:35329] [[62995,0],6] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n9:13456] [[62995,0],5] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) Hmm. Let's try turning off tcp (often my first thought when on an Infiniband system): (51) $ mpirun --mca btl sm,openib,self -np 2 ./helloWorld.x [r323i5n6:57420] [[62996,0],2] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n9:13459] [[62996,0],5] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n8:57432] [[62996,0],4] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n7:67290] [[62996,0],3] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n11:13066] [[62996,0],7] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n10:35332] [[62996,0],6] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) Now, the nodes reporting the issue seem to be the "other" nodes on the allocation that are in a different rack: (52) $ cat $PBS_NODEFILE | uniq r321i7n16 r321i7n17 r323i5n6 r323i5n7 r323i5n8 r323i5n9 r323i5n10 r323i5n11 Maybe that's a clue? I didn't think this would matter if I only ran two processes...and it works on the multi-node maia allocation. I've tried searching the web, but the only place I've seen tcp_peer_send_blocking is in a PDF where they say it's an error that can be seen: http://www.hpc.mcgill.ca/downloads/checkpointing_workshop/20150326%20-%20McGill%20-%20Checkpointing%20Techniques.pdf Any ideas for what this error can mean? -- Matt Thompson Man Among Men Fulcrum of History
Re: [OMPI users] OpenMPI-1.10.0 bind-to core error
Looking at the Open MPI 1.10.0 man page: https://www.open-mpi.org/doc/v1.10/man1/mpirun.1.php it looks like perhaps -oversubscribe (which was an option) is now the default behavior. Instead we have: *-nooversubscribe, --nooversubscribe*Do not oversubscribe any nodes; error (without starting any processes) if the requested number of processes would cause oversubscription. This option implicitly sets "max_slots" equal to the "slots" value for each node. It also looks like -map-by has a way to implement it as well (see man page). Thanks for letting me/us know about this. On a system of mine I sort of depend on the -nooversubscribe behavior! Matt On Tue, Sep 15, 2015 at 11:17 AM, Patrick Begou < patrick.be...@legi.grenoble-inp.fr> wrote: > Hi, > > I'm runing OpenMPI 1.10.0 built with Intel 2015 compilers on a Bullx > System. > I've some troubles with the bind-to core option when using cpuset. > If the cpuset is less than all the cores of a cpu (ex: 4 cores allowed on > a 8 cores cpus) OpenMPI 1.10.0 allows to overload these cores until the > maximum number of cores of the cpu. > With this config and because the cpuset only allows 4 cores, I can reach 2 > processes/core if I use: > > mpirun -np 8 --bind-to core my_application > > OpenMPI 1.7.3 doesn't show the problem with the same situation: > mpirun -np 8 --bind-to-core my_application > returns: > *A request was made to bind to that would result in binding more* > *processes than cpus on a resource* > and that's okay of course. > > > Is there a way to avoid this oveloading with OpenMPI 1.10.0 ? > > Thanks > > Patrick > > -- > === > | Equipe M.O.S.T. | | > | Patrick BEGOU | mailto:patrick.be...@grenoble-inp.fr > <patrick.be...@grenoble-inp.fr> | > | LEGI| | > | BP 53 X | Tel 04 76 82 51 35 | > | 38041 GRENOBLE CEDEX| Fax 04 76 82 52 71 | > === > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/09/27575.php > -- Matt Thompson Man Among Men Fulcrum of History
Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs
Jeff, Some limited testing shows that that srun does seem to work where the quote-y one did not. I'm working with our admins now to make sure it let's the prolog work as expected as well. I'll keep you informed, Matt On Thu, Sep 4, 2014 at 1:26 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Try this (typed in editor, not tested!): > > #! /usr/bin/perl -w > > use strict; > use warnings; > > use FindBin; > > # Specify the path to the prolog. > my $prolog = '--task-prolog=/gpfsm//.task.prolog'; > > # Build the path to the SLURM srun command. > my $srun_slurm = "${FindBin::Bin}/srun.slurm"; > > # Add the prolog option, but abort if the user specifies a prolog option. > my @command = split(/ /, "$srun_slurm $prolog"); > foreach (@ARGV) { > if (/^--task-prolog=/) { > print("The --task-prolog option is unsupported at . Please " . > "contact the for assistance.\n"); > exit(1); > } else { > push(@command, $_); > } > } > system(@command); > > > > On Sep 4, 2014, at 1:21 PM, Matt Thompson <fort...@gmail.com> wrote: > > > Jeff, > > > > Here is the script (with a bit of munging for safety's sake): > > > > #! /usr/bin/perl -w > > > > use strict; > > use warnings; > > > > use FindBin; > > > > # Specify the path to the prolog. > > my $prolog = '--task-prolog=/gpfsm//.task.prolog'; > > > > # Build the path to the SLURM srun command. > > my $srun_slurm = "${FindBin::Bin}/srun.slurm"; > > > > # Add the prolog option, but abort if the user specifies a prolog option. > > my $command = "$srun_slurm $prolog"; > > foreach (@ARGV) { > > if (/^--task-prolog=/) { > > print("The --task-prolog option is unsupported at . Please " > . > > "contact the for assistance.\n"); > > exit(1); > > } else { > > $command .= " $_"; > > } > > } > > system($command); > > > > Ideas? > > > > > > > > On Thu, Sep 4, 2014 at 10:51 AM, Ralph Castain <r...@open-mpi.org> wrote: > > Still begs the bigger question, though, as others have used script > wrappers before - and I'm not sure we (OMPI) want to be in the business of > dictating the scripting language they can use. :-) > > > > Jeff and I will argue that one out > > > > > > On Sep 4, 2014, at 7:38 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > > >> Ah, if it's perl, it might be easy. It might just be the difference > between system("...string...") and system(@argv). > >> > >> Sent from my phone. No type good. > >> > >> On Sep 4, 2014, at 8:35 AM, "Matt Thompson" <fort...@gmail.com> wrote: > >> > >>> Jeff, > >>> > >>> I actually misspoke earlier. It turns out our srun is a *Perl* script > around the SLURM srun. I'll speak with our admins to see if they can > massage the script to not interpret the arguments. If possible, I'll ask > them if I can share the script with you (privately or on the list) and > maybe you can see how it is affecting Open MPI's argument passage. > >>> > >>> Matt > >>> > >>> > >>> On Thu, Sep 4, 2014 at 8:04 AM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > >>> On Sep 3, 2014, at 9:27 AM, Matt Thompson <fort...@gmail.com> wrote: > >>> > >>> > Just saw this, sorry. Our srun is indeed a shell script. It seems to > be a wrapper around the regular srun that runs a --task-prolog. What it > does...that's beyond my ken, but I could ask. My guess is that it probably > does something that helps keep our old PBS scripts running (sets > $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The > admins would, of course, prefer all future scripts be SLURM-native scripts, > but there are a lot of production runs that uses many, many PBS scripts. > Converting that would need slow, careful QC to make sure any "pure SLURM" > versions act as expected. > >>> > >>> Ralph and I haven't had a chance to discuss this in detail yet, but I > have thought about this quite a bit. > >>> > >>> What is happening is that one of the $argv OMPI passes is of the form > "foo;bar". Your srun script is interpreting the ";" as the end of the > command the the "bar" as the beginning of a new command,
Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs
Jeff, Here is the script (with a bit of munging for safety's sake): #! /usr/bin/perl -w use strict; use warnings; use FindBin; # Specify the path to the prolog. my $prolog = '--task-prolog=/gpfsm//.task.prolog'; # Build the path to the SLURM srun command. my $srun_slurm = "${FindBin::Bin}/srun.slurm"; # Add the prolog option, but abort if the user specifies a prolog option. my $command = "$srun_slurm $prolog"; foreach (@ARGV) { if (/^--task-prolog=/) { print("The --task-prolog option is unsupported at . Please " . "contact the for assistance.\n"); exit(1); } else { $command .= " $_"; } } system($command); Ideas? On Thu, Sep 4, 2014 at 10:51 AM, Ralph Castain <r...@open-mpi.org> wrote: > Still begs the bigger question, though, as others have used script > wrappers before - and I'm not sure we (OMPI) want to be in the business of > dictating the scripting language they can use. :-) > > Jeff and I will argue that one out > > > On Sep 4, 2014, at 7:38 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > Ah, if it's perl, it might be easy. It might just be the difference > between system("...string...") and system(@argv). > > Sent from my phone. No type good. > > On Sep 4, 2014, at 8:35 AM, "Matt Thompson" <fort...@gmail.com> wrote: > > Jeff, > > I actually misspoke earlier. It turns out our srun is a *Perl* script > around the SLURM srun. I'll speak with our admins to see if they can > massage the script to not interpret the arguments. If possible, I'll ask > them if I can share the script with you (privately or on the list) and > maybe you can see how it is affecting Open MPI's argument passage. > > Matt > > > On Thu, Sep 4, 2014 at 8:04 AM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > >> On Sep 3, 2014, at 9:27 AM, Matt Thompson <fort...@gmail.com> wrote: >> >> > Just saw this, sorry. Our srun is indeed a shell script. It seems to be >> a wrapper around the regular srun that runs a --task-prolog. What it >> does...that's beyond my ken, but I could ask. My guess is that it probably >> does something that helps keep our old PBS scripts running (sets >> $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The >> admins would, of course, prefer all future scripts be SLURM-native scripts, >> but there are a lot of production runs that uses many, many PBS scripts. >> Converting that would need slow, careful QC to make sure any "pure SLURM" >> versions act as expected. >> >> Ralph and I haven't had a chance to discuss this in detail yet, but I >> have thought about this quite a bit. >> >> What is happening is that one of the $argv OMPI passes is of the form >> "foo;bar". Your srun script is interpreting the ";" as the end of the >> command the the "bar" as the beginning of a new command, and mayhem ensues. >> >> Basically, your srun script is violating what should be a very safe >> assumption: that the $argv we pass to it will not be interpreted by a >> shell. Put differently: your "srun" script behaves differently than >> SLURM's "srun" executable. This violates OMPI's expectations of how srun >> should behave. >> >> My $0.02 is that if we "fix" this in OMPI, we're effectively penalizing >> all other SLURM installations out there that *don't* violate this >> assumption (i.e., all of them). Ralph may disagree with me on this point, >> BTW -- like I said, we haven't talked about this in detail since Tuesday. >> :-) >> >> So here's my question: is there any chance you can change your "srun" >> script to a script language that doesn't recombine $argv? This is a common >> problem, actually -- sh/csh/etc. script languages tend to recombine $argv, >> but other languages such as perl and python do not (e.g., >> http://stackoverflow.com/questions/6981533/how-to-preserve-single-and-double-quotes-in-shell-script-arguments-without-the-a >> ). >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25263.php >> > > > > -- > "And, isn't sanity really just a one-trick pony
Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs
Jeff, I actually misspoke earlier. It turns out our srun is a *Perl* script around the SLURM srun. I'll speak with our admins to see if they can massage the script to not interpret the arguments. If possible, I'll ask them if I can share the script with you (privately or on the list) and maybe you can see how it is affecting Open MPI's argument passage. Matt On Thu, Sep 4, 2014 at 8:04 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > On Sep 3, 2014, at 9:27 AM, Matt Thompson <fort...@gmail.com> wrote: > > > Just saw this, sorry. Our srun is indeed a shell script. It seems to be > a wrapper around the regular srun that runs a --task-prolog. What it > does...that's beyond my ken, but I could ask. My guess is that it probably > does something that helps keep our old PBS scripts running (sets > $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The > admins would, of course, prefer all future scripts be SLURM-native scripts, > but there are a lot of production runs that uses many, many PBS scripts. > Converting that would need slow, careful QC to make sure any "pure SLURM" > versions act as expected. > > Ralph and I haven't had a chance to discuss this in detail yet, but I have > thought about this quite a bit. > > What is happening is that one of the $argv OMPI passes is of the form > "foo;bar". Your srun script is interpreting the ";" as the end of the > command the the "bar" as the beginning of a new command, and mayhem ensues. > > Basically, your srun script is violating what should be a very safe > assumption: that the $argv we pass to it will not be interpreted by a > shell. Put differently: your "srun" script behaves differently than > SLURM's "srun" executable. This violates OMPI's expectations of how srun > should behave. > > My $0.02 is that if we "fix" this in OMPI, we're effectively penalizing > all other SLURM installations out there that *don't* violate this > assumption (i.e., all of them). Ralph may disagree with me on this point, > BTW -- like I said, we haven't talked about this in detail since Tuesday. > :-) > > So here's my question: is there any chance you can change your "srun" > script to a script language that doesn't recombine $argv? This is a common > problem, actually -- sh/csh/etc. script languages tend to recombine $argv, > but other languages such as perl and python do not (e.g., > http://stackoverflow.com/questions/6981533/how-to-preserve-single-and-double-quotes-in-shell-script-arguments-without-the-a > ). > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25263.php > -- "And, isn't sanity really just a one-trick pony anyway? I mean all you get is one trick: rational thinking. But when you're good and crazy, oooh, oooh, oooh, the sky is the limit!" -- The Tick
Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs
On Tue, Sep 2, 2014 at 8:38 PM, Jeff Squyres (jsquyres)wrote: > Matt: Random thought -- is your "srun" a shell script, perchance? (it > shouldn't be, but perhaps there's some kind of local override...?) > > Ralph's point on the call today is that it doesn't matter *how* this > problem is happening. It *is* happening to real users, and so we need to > account for it. > > But it really bothers me that we don't understand *how/why* this is > happening (e.g., is this OMPI's fault somehow? I don't think so, but then > again, we don't understand how it's happening). *Somewhere* in there, a > shell is getting invoked. But "srun" shouldn't be invoking a shell on the > remote side -- it should be directly fork/exec'ing the tokens with no shell > interpretation at all. > Jeff, Just saw this, sorry. Our srun is indeed a shell script. It seems to be a wrapper around the regular srun that runs a --task-prolog. What it does...that's beyond my ken, but I could ask. My guess is that it probably does something that helps keep our old PBS scripts running (sets $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The admins would, of course, prefer all future scripts be SLURM-native scripts, but there are a lot of production runs that uses many, many PBS scripts. Converting that would need slow, careful QC to make sure any "pure SLURM" versions act as expected. Matt -- "And, isn't sanity really just a one-trick pony anyway? I mean all you get is one trick: rational thinking. But when you're good and crazy, oooh, oooh, oooh, the sky is the limit!" -- The Tick
Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs
Jeff, I tried your script and I saw: (1027) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun -np 8 ./script.sh (1028) $ Now, the very first time I ran it, I think I might have noticed a blip of orted on the nodes, but it disappeared fast. When I re-run the same command, it just seems to exit immediately with nothing showing up. If I use my "debug-patch" version, I see: (1028) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch//bin/mpirun -np 8 ./script.sh hello world hello world hello world hello world hello world hello world hello world hello world And, well, it's there for 10 minutes, I'm guessing. If I ssh to another of the nodes in my allocation: (1005) $ ps aux | grep openmpi mathomp4 20317 0.0 0.0 59952 4256 ?S09:17 0:00 /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/orted -mca orte_ess_jobid 1842544640 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 6 -mca orte_hnp_uri 1842544640.0;tcp://10.1.24.169,172.31.1.254, 10.12.24.169:41684 mathomp4 20389 0.0 0.0 5524 844 pts/0S+ 09:19 0:00 grep --color=auto openmpi Matt On Tue, Sep 2, 2014 at 5:35 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Matt -- > > We were discussing this issue on our weekly OMPI engineering call today. > > Can you check one thing for me? With the un-edited 1.8.2 tarball > installation, I see that you're getting no output for commands that you run > -- but also no errors. > > Can you verify and see if your commands are actually *running*? E.g, try: > > $ cat > script.sh < #!/bin/sh > echo hello world > sleep 600 > echo goodbye world > EOF > $ chmod +x script.sh > $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0 > $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-clean/bin/mpirun > -np 8 script.sh > > and then go "ps" on the back-end nodes and see if there is an "orted" > process and N "sleep 600" processes running on them. > > I'm *assuming* you won't see the "hello world" output. > > The purpose of this test is that I want to see if OMPI is just totally > erring out and not even running your job (which is quite unlikely; OMPI > should be much more noisy when this happens), or whether we're simply not > seeing the stdout from the job. > > Thanks. > > > > On Sep 2, 2014, at 9:36 AM, Matt Thompson <fort...@gmail.com> wrote: > > > On that machine, it would be SLES 11 SP1. I think it's soon > transitioning to SLES 11 SP3. > > > > I also use Open MPI on an RHEL 6.5 box (possibly soon to be RHEL 7). > > > > > > On Mon, Sep 1, 2014 at 8:41 PM, Ralph Castain <r...@open-mpi.org> wrote: > > Thanks - I expect we'll have to release 1.8.3 soon to fix this in case > others have similar issues. Out of curiosity, what OS are you using? > > > > > > On Sep 1, 2014, at 9:00 AM, Matt Thompson <fort...@gmail.com> wrote: > > > >> Ralph, > >> > >> Okay that seems to have done it here (well, minus the usual > shmem_mmap_enable_nfs_warning that our system always generates): > >> > >> (1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0 > >> (1034) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun > -np 8 ./helloWorld.182-debug-patch.x > >> Process7 of8 is on borg01w218 > >> Process5 of8 is on borg01w218 > >> Process1 of8 is on borg01w218 > >> Process3 of8 is on borg01w218 > >> Process0 of8 is on borg01w218 > >> Process2 of8 is on borg01w218 > >> Process4 of8 is on borg01w218 > >> Process6 of8 is on borg01w218 > >> > >> I'll ask the admin to apply the patch locally...and wait for 1.8.3, I > suppose. > >> > >> Thanks, > >> Matt > >> > >> On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain <r...@open-mpi.org> > wrote: > >> HmmmI may see the problem. Would you be so kind as to apply the > attached patch to your 1.8.2 code, rebuild, and try again? > >> > >> Much appreciate the help. Everyone's system is slightly different, and > I think you've uncovered one of those differences. > >> Ralph > >> > >> > >> > >> On Aug 31, 2014, at 6:25 AM, Matt Thompson <fort...@gmail.com> wrote: > >> > >>> Ralph, > >>> > >>> Sorry it took me a bit of time. Here you go: > >>> > >>> (1002) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun > --leave-session-attached --debug-daemons --mca oob_base_verbose 1
Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs
Ralph, Okay that seems to have done it here (well, minus the usual shmem_mmap_enable_nfs_warning that our system always generates): (1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0 (1034) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun -np 8 ./helloWorld.182-debug-patch.x Process7 of8 is on borg01w218 Process5 of8 is on borg01w218 Process1 of8 is on borg01w218 Process3 of8 is on borg01w218 Process0 of8 is on borg01w218 Process2 of8 is on borg01w218 Process4 of8 is on borg01w218 Process6 of8 is on borg01w218 I'll ask the admin to apply the patch locally...and wait for 1.8.3, I suppose. Thanks, Matt On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain <r...@open-mpi.org> wrote: > HmmmI may see the problem. Would you be so kind as to apply the > attached patch to your 1.8.2 code, rebuild, and try again? > > Much appreciate the help. Everyone's system is slightly different, and I > think you've uncovered one of those differences. > Ralph > > > > On Aug 31, 2014, at 6:25 AM, Matt Thompson <fort...@gmail.com> wrote: > > Ralph, > > Sorry it took me a bit of time. Here you go: > > (1002) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun > --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca > plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x > [borg01w063:03815] mca:base:select:( plm) Querying component [isolated] > [borg01w063:03815] mca:base:select:( plm) Query of component [isolated] > set priority to 0 > [borg01w063:03815] mca:base:select:( plm) Querying component [rsh] > [borg01w063:03815] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh > path NULL > [borg01w063:03815] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [borg01w063:03815] mca:base:select:( plm) Querying component [slurm] > [borg01w063:03815] [[INVALID],INVALID] plm:slurm: available for selection > [borg01w063:03815] mca:base:select:( plm) Query of component [slurm] set > priority to 75 > [borg01w063:03815] mca:base:select:( plm) Selected component [slurm] > [borg01w063:03815] plm:base:set_hnp_name: initial bias 3815 nodename hash > 1757783593 > [borg01w063:03815] plm:base:set_hnp_name: final jobfam 49163 > [borg01w063:03815] mca: base: components_register: registering oob > components > [borg01w063:03815] mca: base: components_register: found loaded component > tcp > [borg01w063:03815] mca: base: components_register: component tcp register > function successful > [borg01w063:03815] mca: base: components_open: opening oob components > [borg01w063:03815] mca: base: components_open: found loaded component tcp > [borg01w063:03815] mca: base: components_open: component tcp open function > successful > [borg01w063:03815] mca:oob:select: checking available component tcp > [borg01w063:03815] mca:oob:select: Querying component [tcp] > [borg01w063:03815] oob:tcp: component_available called > [borg01w063:03815] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [borg01w063:03815] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > [borg01w063:03815] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.1.24.63 to our > list of V4 connections > [borg01w063:03815] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 172.31.1.254 to our > list of V4 connections > [borg01w063:03815] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.12.24.63 to our > list of V4 connections > [borg01w063:03815] [[49163,0],0] TCP STARTUP > [borg01w063:03815] [[49163,0],0] attempting to bind to IPv4 port 0 > [borg01w063:03815] [[49163,0],0] assigned IPv4 port 41373 > [borg01w063:03815] mca:oob:select: Adding component to end > [borg01w063:03815] mca:oob:select: Found 1 active transports > [borg01w063:03815] [[49163,0],0] plm:base:receive start comm > [borg01w063:03815] [[49163,0],0] plm:base:setup_job > [borg01w063:03815] [[49163,0],0] plm:slurm: LAUNCH DAEMONS CALLED > [borg01w063:03815] [[49163,0],0] plm:base:setup_vm > [borg01w063:03815] [[49163,0],0] plm:base:setup_vm creating map > [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon > [[49163,0],1] > [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon > [[49163,0],1] to node borg01w064 > [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon > [[49163,0],2] > [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon > [[49163,0],2] to node borg01w065 > [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon > [[49163,0],3] > [borg01w063:03815] [[49163,0],0] plm:base:setup
Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs
routed_binomial.c at line 498 [borg01w065:15893] [[49163,0],2] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539 slurmd[borg01w065]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 *** slurmd[borg01w070]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 *** [borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161 [borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498 [borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539 [borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161 [borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498 [borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539 slurmd[borg01w069]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 *** [borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161 [borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498 [borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539 slurmd[borg01w071]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 *** slurmd[borg01w065]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 *** slurmd[borg01w069]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 *** slurmd[borg01w070]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 *** slurmd[borg01w071]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 *** srun.slurm: error: borg01w069: task 2: Exited with exit code 213 srun.slurm: error: borg01w065: task 1: Exited with exit code 213 srun.slurm: error: borg01w071: task 4: Exited with exit code 213 srun.slurm: error: borg01w070: task 3: Exited with exit code 213 sh: tcp://10.1.24.63,172.31.1.254,10.12.24.63:41373: No such file or directory [borg01w063:03815] [[49163,0],0] plm:slurm: primary daemons complete! [borg01w063:03815] [[49163,0],0] plm:base:receive stop comm [borg01w063:03815] [[49163,0],0] TCP SHUTDOWN [borg01w063:03815] mca: base: close: component tcp closed [borg01w063:03815] mca: base: close: unloading component tcp On Fri, Aug 29, 2014 at 3:18 PM, Ralph Castain <r...@open-mpi.org> wrote: > Rats - I also need "-mca plm_base_verbose 5" on there so I can see the cmd > line being executed. Can you add it? > > > On Aug 29, 2014, at 11:16 AM, Matt Thompson <fort...@gmail.com> wrote: > > Ralph, > > Here you go: > > (1080) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun > --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8 > ./helloWorld.182-debug.x > [borg01x142:29232] mca: base: components_register: registering oob > components > [borg01x142:29232] mca: base: components_register: found loaded component > tcp > [borg01x142:29232] mca: base: components_register: component tcp register > function successful > [borg01x142:29232] mca: base: components_open: opening oob components > [borg01x142:29232] mca: base: components_open: found loaded component tcp > [borg01x142:29232] mca: base: components_open: component tcp open function > successful > [borg01x142:29232] mca:oob:select: checking available component tcp > [borg01x142:29232] mca:oob:select: Querying component [tcp] > [borg01x142:29232] oob:tcp: component_available called > [borg01x142:29232] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [borg01x142:29232] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > [borg01x142:29232] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.1.25.142 to our > list of V4 connections > [borg01x142:29232] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 172.31.1.254 to our > list of V4 connections > [borg01x142:29232] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.12.25.142 to our > list of V4 connections > [borg01x142:29232] [[52298,0],0] TCP STARTUP > [borg01x142:29232] [[52298,0],0] attempting to bind to IPv4 port 0 > [borg01x142:29232] [[52298,0],0] assigned IPv4 port 41686 > [borg01x142:29232] mca:oob:select: Adding component to end > [borg01x142:29232] mca:oob:select: Found 1 active transports > srun.slurm: cluster configuration lacks support for cpu binding > srun.slurm: cluster configuration lacks support for cpu binding > [borg01x153:01290] mca: base: components_register: registering oob > components > [borg01x153:01290] mca: base: components_register: found loaded component > tcp > [borg01x1
Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs
base/ess_base_std_orted.c at line 539 srun.slurm: error: borg01x143: task 0: Exited with exit code 213 srun.slurm: Terminating job step 2332583.24 slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH SIGNAL 9 *** srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to finish. srun.slurm: error: borg01x153: task 3: Exited with exit code 213 [borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161 [borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498 [borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539 [borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161 [borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498 [borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539 slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH SIGNAL 9 *** srun.slurm: error: borg01x144: task 1: Exited with exit code 213 [borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161 [borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498 [borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539 slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH SIGNAL 9 *** slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH SIGNAL 9 *** srun.slurm: error: borg01x154: task 4: Exited with exit code 213 srun.slurm: error: borg01x145: task 2: Exited with exit code 213 [borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161 [borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498 [borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539 slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH SIGNAL 9 *** slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH SIGNAL 9 *** sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:41686: No such file or directory [borg01x142:29232] [[52298,0],0] TCP SHUTDOWN [borg01x142:29232] mca: base: close: component tcp closed [borg01x142:29232] mca: base: close: unloading component tcp Note, if I can get the allocation today, I want to try doing all this on a single SandyBridge node, rather than on 6. It might make comparing various runs a bit easier! Matt On Fri, Aug 29, 2014 at 12:42 PM, Ralph Castain <r...@open-mpi.org> wrote: > Okay, something quite weird is happening here. I can't replicate using the > 1.8.2 release tarball on a slurm machine, so my guess is that something > else is going on here. > > Could you please rebuild the 1.8.2 code with --enable-debug on the > configure line (assuming you haven't already done so), and then rerun that > version as before but adding "--mca oob_base_verbose 10" to the cmd line? > > > On Aug 29, 2014, at 4:22 AM, Matt Thompson <fort...@gmail.com> wrote: > > Ralph, > > For 1.8.2rc4 I get: > > (1003) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun > --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x > srun.slurm: cluster configuration lacks support for cpu binding > srun.slurm: cluster configuration lacks support for cpu binding > Daemon [[47143,0],5] checking in as pid 10990 on host borg01x154 > [borg01x154:10990] [[47143,0],5] orted: up and running - waiting for > commands! > Daemon [[47143,0],1] checking in as pid 23473 on host borg01x143 > Daemon [[47143,0],2] checking in as pid 8250 on host borg01x144 > [borg01x144:08250] [[47143,0],2] orted: up and running - waiting for > commands! > [borg01x143:23473] [[47143,0],1] orted: up and running - waiting for > commands! > Daemon [[47143,0],3] checking in as pid 12320 on host borg01x145 > Daemon [[47143,0],4] checking in as pid 10902 on host borg01x153 > [borg01x153:10902] [[47143,0],4] orted: up and running - waiting for > commands! > [borg01x145:12320] [[47143,0],3] orted: up and running - waiting for > commands! > [borg01x142:01629] [[47143,0],0] orted_cmd: received add_local_procs > [borg01x144:08250] [[47143,0],2] orted_cmd: received add_local_procs > [borg01x153:10902] [[47143,0],4] orted_cmd: received add_local_procs > [borg01x143:23473] [[47143,0],1] orted_cmd: received add_local_procs > [borg01x145:12320] [[47143,0],3] orted_cmd: received add_local_procs > [borg01x154:10990] [[47143,0],5] orted_cmd: received add_local_procs > [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from > local proc [[47143,
[OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs
Open MPI List, I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 on our cluster (reported on this list), and decided to try it with 1.8.2. However, we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even weirder, Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: I get no stdout with Open MPI 1.8.2. That is, HelloWorld doesn't work. To wit, our sysadmin has two tarballs: (1441) $ sha1sum openmpi-1.8.2rc4.tar.bz2 7e7496913c949451f546f22a1a159df25f8bb683 openmpi-1.8.2rc4.tar.bz2 (1442) $ sha1sum openmpi-1.8.2.tar.gz cf2b1e45575896f63367406c6c50574699d8b2e1 openmpi-1.8.2.tar.gz I then build each with a script in the method our sysadmin usually does: #!/bin/sh > set -x > export PREFIX=/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2 > export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/nlocal/slurm/2.6.3/lib64 > build() { > echo `pwd` > ./configure --with-slurm --disable-wrapper-rpath --enable-shared > --enable-mca-no-build=btl-usnic \ > CC=gcc CXX=g++ F77=gfortran FC=gfortran \ > CFLAGS="-mtune=generic -fPIC -m64" CXXFLAGS="-mtune=generic -fPIC > -m64" FFLAGS="-mtune=generic -fPIC -m64" \ > F77FLAGS="-mtune=generic -fPIC -m64" FCFLAGS="-mtune=generic -fPIC > -m64" F90FLAGS="-mtune=generic -fPIC -m64" \ > LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64" > CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" LIBS="-lpciaccess" \ > --prefix=${PREFIX} 2>&1 | tee configure.1.8.2.log > make 2>&1 | tee make.1.8.2.log > make check 2>&1 | tee makecheck.1.8.2.log > make install 2>&1 | tee makeinstall.1.8.2.log > } > echo "calling build" > build > echo "exiting" The only difference between the two is '1.8.2' or '1.8.2rc4' in the PREFIX and log file tees. Now, let us test. First, I grab some nodes with slurm: $ salloc --nodes=6 --ntasks-per-node=16 --constraint=sand --time=09:00:00 > --account=g0620 --mail-type=BEGIN Once I get my nodes, I run with 1.8.2rc4: (1142) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpifort -o > helloWorld.182rc4.x helloWorld.F90 > (1143) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8 > ./helloWorld.182rc4.x > Process0 of8 is on borg01w044 > Process5 of8 is on borg01w044 > Process3 of8 is on borg01w044 > Process7 of8 is on borg01w044 > Process1 of8 is on borg01w044 > Process2 of8 is on borg01w044 > Process4 of8 is on borg01w044 > Process6 of8 is on borg01w044 Now 1.8.2: (1144) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpifort -o > helloWorld.182.x helloWorld.F90 > (1145) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun -np 8 > ./helloWorld.182.x > (1146) $ No output at all. But, if I take the helloWorld.x from 1.8.2 and run it with 1.8.2rc4's mpirun: (1146) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8 > ./helloWorld.182.x > Process5 of8 is on borg01w044 > Process7 of8 is on borg01w044 > Process2 of8 is on borg01w044 > Process4 of8 is on borg01w044 > Process1 of8 is on borg01w044 > Process3 of8 is on borg01w044 > Process6 of8 is on borg01w044 > Process0 of8 is on borg01w044 So...any idea what is happening here? There did seem to be a few SLURM related changes between the two tarballs involving /dev/null but it's a bit above me to decipher. You can find the ompi_info, build, make, config, etc logs at these links (they are ~300kB which is over the mailing list limit according to the Open MPI web page): https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2rc4-Output.tar.bz2 https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2-Output.tar.bz2 Thank you for any help and please let me know if you need more information, Matt -- "And, isn't sanity really just a one-trick pony anyway? I mean all you get is one trick: rational thinking. But when you're good and crazy, oooh, oooh, oooh, the sky is the limit!" -- The Tick
Re: [OMPI users] Intermittent, somewhat architecture-dependent hang with Open MPI 1.8.1
Jeff, I've tried moving the backing file and it doesn't matter. I can say that PGI 14.7 + Open MPI 1.8.1 does not show this issue. I can run that on 96 cores just fine. Heck, I've run it on a few hundred. As for the 96, they are either on 8 Westmere nodes (8 nodes with 2 6-core sockets) or 6 Sandy Bridge nodes (6 nodes with 2 8-core sockets). I think each set is on a different Infiniband fabric, but I'm not sure of that. However, since the PGI 14.7/Open MPI 1.8.1 works just fine on the exact same sets of nodes (grabbed via an interactive SLURM job), I can't see how the Infiniband fabric would matter. I also tried various combinations of: mpirun --np mpirun --map-by core -np mpirun --map-by socket -np and maybe a few -bind-to as well all with --report-bindings on to make sure it was doing what I expect, and it was. It wasn't putting 96 processes on a single node, for example, or all on the same socket or core by some freak accident. The only difference between the Open MPI installs are the compilers they were built with (I'm pretty sure the admins just downloaded the source once). Looking at "mpif90 -showme" I can see that the PGI 14.7 compile built the mpi_f90 and mpi modules while it looks like the GCC 4.9.1 did not, but our main code and this reproducer only use mpif.h, so that shouldn't matter. Matt On Sat, Aug 16, 2014 at 7:33 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > Have you tried moving your shared memory backing file directory, like the > warning message suggests? > > I haven't seen a shared memory file on a network share cause correctness > issues before (just performance issues), but I could see how that could be > in the realm of possibility... > > Also, are you running 96 processes on a single machine, or spread across > multiple machines? > > Note that Open MPI 1.8.x binds each MPI process to a core by default, so > if you're oversubscribing the machine, it could be fairly disastrous...? > > > On Aug 14, 2014, at 1:29 PM, Matt Thompson <fort...@gmail.com> wrote: > > > Open MPI Users, > > > > I work on a large climate model called GEOS-5 and we've recently managed > to get it to compile with gfortran 4.9.1 (our usual compilers are Intel and > PGI for performance). In doing so, we asked our admins to install Open MPI > 1.8.1 as the MPI stack instead of MVAPICH2 2.0 mainly because we figure the > gfortran port is more geared to a desktop. > > > > So, the model builds just fine but when we run it, it stalls in our > "History" component whose job is to write out netCDF files of output. The > odd thing is, though, this stall seems to happen more on our Sandy Bridge > nodes than on our Westmere nodes, but both hang. > > > > A colleague has made a single-file code that emulates our History > component (the MPI traffic part) that we've used to report bugs to MVAPICH > and I asked him to try it with this issue and it seems to duplicate it. > > > > To wit, a "successful" run of the code is: > > > > (1003) $ mpirun -np 96 ./mpi_reproducer.x 4 24 > > srun.slurm: cluster configuration lacks support for cpu binding > > srun.slurm: cluster configuration lacks support for cpu binding > > > -- > > WARNING: Open MPI will create a shared memory backing file in a > > directory that appears to be mounted on a network filesystem. > > Creating the shared memory backup file on a network file system, such > > as NFS or Lustre is not recommended -- it may cause excessive network > > traffic to your file servers and/or cause shared memory traffic in > > Open MPI to be much slower than expected. > > > > You may want to check what the typical temporary directory is on your > > node. Possible sources of the location of this temporary directory > > include the $TEMPDIR, $TEMP, and $TMP environment variables. > > > > Note, too, that system administrators can set a list of filesystems > > where Open MPI is disallowed from creating temporary files by setting > > the MCA parameter "orte_no_session_dir". > > > > Local host: borg01s026 > > Fileame: > /gpfsm/dnb31/tdirs/pbs/slurm.2202701.mathomp4/openmpi-sessions-mathomp4@borg01s026_0 > /60464/1/shared_mem_pool.borg01s026 > > > > You can set the MCA paramter shmem_mmap_enable_nfs_warning to 0 to > > disable this message. > > > -- > > nx:4 > > ny: 24 > > comm size is 96 > > local array sizes are 12 12 > > filling local arrays > > creating requests > > ig
[OMPI users] Intermittent, somewhat architecture-dependent hang with Open MPI 1.8.1
;place" around a collective wait. Finally, if I setenv OMPI_MCA_orte_base_help_aggregate 0 (to see all help/error messages) I usually just "hang" with no error message at all (additionally turning off the warning): (1203) $ setenv OMPI_MCA_orte_base_help_aggregate 0 (1203) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0 (1204) $ mpirun -np 96 ./mpi_reproducer.x 4 24 srun.slurm: cluster configuration lacks support for cpu binding srun.slurm: cluster configuration lacks support for cpu binding nx:4 ny: 24 comm size is 96 local array sizes are 12 12 filling local arrays creating requests igather before collective wait Note, this problem doesn't seem to appear at lower number of processes (16, 24, 32) but does seem pretty consistent at 96, especially on Sandy Bridges. Also, yes, we get that weird srun.slurm warning but we always seem to get that (Open MPI, MVAPICH) so while our admins are trying to correct that, at present it is not our worry. The MPI stack was compiled with (per our admins): export CFLAGS="-fPIC -m64" export CXXFLAGS="-fPIC -m64" export FFLAGS="-fPIC" export FCFLAGS="-fPIC" export F90FLAGS="-fPIC" export LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64" export CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/nlocal/slurm/2.6.3/lib64 ../configure --with-slurm --disable-wrapper-rpath --enable-shared --enable-mca-no-build=btl-usnic --prefix=${PREFIX} The output of "ompi_info --all" is found: https://gist.github.com/mathomp4/301723165efbbb616184#file-ompi_info-out The reproducer code can be found here: https://gist.github.com/mathomp4/301723165efbbb616184#file-mpi_reproducer-f90 The reproducer is easily built with just 'mpif90' and to run it: mpirun -np NPROCS ./mpi_reproducer.x NX NY where NX*NY has to equal NPROCS and it's best to keep them even numbers. (There might be a few more restrictions and the code will die if you violate them.) Thanks, Matt Thompson -- Matt Thompson SSAI, Sr Software Test Engr NASA GSFC, Global Modeling and Assimilation Office Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771 Phone: 301-614-6712 Fax: 301-614-6246
Re: [OMPI users] Help building/installing a working Open MPI 1.7.4 on OS X 10.9.2 with Free PGI Fortran
Jeff, I ran these commands: $ make clean $ make distclean (wanted to be extra sure!) $ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64' CXXFLAGS='-m64' LDFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64' --prefix=/Users/fortran/AutomakeBug/autobug14 | & tee configure.log $ make V=1 install |& tee makeV1install.log So find attached the config.log, configure.log, and makeV1install.log which should have all the info you asked about. Matt PS: I just tried configure/make/make install with Open MPI 1.7.5, but the same error occurs as expected. Hope springs eternal, you know? On Mon, Mar 24, 2014 at 6:48 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > On Mar 24, 2014, at 6:34 PM, Matt Thompson <fort...@gmail.com> wrote: > > > Sorry for the late reply. The answer is: No, 1.14.1 has not fixed the > problem (and indeed, that's what my Mac is running): > > > > (28) $ make install | & tee makeinstall.log > > Making install in src > > ../config/install-sh -c -d '/Users/fortran/AutomakeBug/autobug14/lib' > > /bin/sh ../libtool --mode=install /usr/bin/install -c > libfortran_stuff.la '/Users/fortran/AutomakeBug/autobug14/lib' > > libtool: install: /usr/bin/install -c .libs/libfortran_stuff.0.dylib > /Users/fortran/AutomakeBug/autobug14/lib/libfortran_stuff.0.dylib > > install: .libs/libfortran_stuff.0.dylib: No such file or directory > > make[2]: *** [install-libLTLIBRARIES] Error 71 > > make[1]: *** [install-am] Error 2 > > make: *** [install-recursive] Error 1 > > > > This is the output from either the am12 or am14 test. If you have any > options you'd like me to try with this, let me know. (For example, is there > a way to make autotools *more* verbose? I've always tried to make it less > so!) > > Ok. With the am14 tarball, please run: > > make clean > > And then run this: > > make V=1 install > > And then send the following: > > - configure stdout > - config.log file > - stdout/stderr from "make V=1 install" > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- "And, isn't sanity really just a one-trick pony anyway? I mean all you get is one trick: rational thinking. But when you're good and crazy, oooh, oooh, oooh, the sky is the limit!" -- The Tick config.log Description: Binary data configure.log Description: Binary data makeV1install.log Description: Binary data
Re: [OMPI users] Help building/installing a working Open MPI 1.7.4 on OS X 10.9.2 with Free PGI Fortran
Jeff, Sorry for the late reply. The answer is: No, 1.14.1 has not fixed the problem (and indeed, that's what my Mac is running): (28) $ make install | & tee makeinstall.log Making install in src ../config/install-sh -c -d '/Users/fortran/AutomakeBug/autobug14/lib' /bin/sh ../libtool --mode=install /usr/bin/install -c libfortran_stuff.la '/Users/fortran/AutomakeBug/autobug14/lib' libtool: install: /usr/bin/install -c .libs/libfortran_stuff.0.dylib /Users/fortran/AutomakeBug/autobug14/lib/libfortran_stuff.0.dylib install: .libs/libfortran_stuff.0.dylib: No such file or directory make[2]: *** [install-libLTLIBRARIES] Error 71 make[1]: *** [install-am] Error 2 make: *** [install-recursive] Error 1 This is the output from either the am12 or am14 test. If you have any options you'd like me to try with this, let me know. (For example, is there a way to make autotools *more* verbose? I've always tried to make it less so!) Matt On Fri, Mar 21, 2014 at 11:02 AM, Jeff Squyres (jsquyres) < jsquy...@cisco.com> wrote: > This is starting to smell like a Libtool and/or Automake bug -- it > created libmpi_usempi_ignore_tkr.dylib, but it tried to install > libmpi_usempi_ignore_tkr.0.dylib (notice the extra ".0"). :-\ > > This is both good and bad. > > Good: I can think of 2 ways to work around this issue off the top of my > head: > > 1. "make -k install" and ignore the error as it flashes by. The rest of > OMPI will install properly. Then cd into > build_dir/ompi/mpi/fortran/use-mpi-ignore-tkr/.libs. Copy > libmpi_usempi_ignore_tkr.* to $libdir (i.e., > /Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc/lib, in your example below). > And you should be good to go. > > ...although you may need to do a similar thing in the > ompi/mpi/fortran/use-mpi-f08/.libs directory. > > 2. Somewhere in ompi/mpi/fortran/use-mpi-ignore-tkr/Makefile will be the > filename "libmpi_usempi_ignore_tkr.0.dylib". Edit it to remove the ".0". > Then "make install" should work fine. (you might need to do the same in > use-mpi-f08/Makefile) > > Bad: we can't really fix this error if it really is a bug in Automake > and/or Libtool, but we can at least report it upstream. > > I've made a trivial Autotools test project ( > https://github.com/jsquyres/pgi-autotool-bug) to see if we can nail this > down a little more, and possibly use the results to report upstream. > > Here's the versions of Autotools that we use to make the OMPI 1.7.x series: > > Autoconf 2.69 > Automake 1.12.2 > Libtool 2.4.2 > m4 1.4.16 > > Attached is a tarball I made of the sample project using those versions. > Can you try building and installing this tarball on your system with the > same kinds of options you used with OMPI? Hopefully, you should see the > same error. If not, I need to tweak this project a bit more to make it > more like OMPI's build system behavior. > > If you can replicate the error, then also try the second attached tarball: > it's the same project, but bootstrapped with the latest versions of GNU > Automake (the others are already the most recent): > > Automake 1.14.1 > > This will let us see if automake 1.14.1 has fixed the issue. > > > > > On Mar 20, 2014, at 1:16 PM, Matt Thompson <fort...@gmail.com> wrote: > > > Jeff, here you go: > > > > (3) $ cd ompi/mpi/fortran/use-mpi-ignore-tkr > > total 2888 > > -rw-r--r-- 1 fortran staff 1.7K Apr 13 2013 Makefile.am > > -rw-r--r-- 1 fortran staff 215K Dec 17 21:09 > mpi-ignore-tkr-interfaces.h.in > > -rw-r--r-- 1 fortran staff39K Dec 17 21:09 > mpi-ignore-tkr-file-interfaces.h.in > > -rw-r--r-- 1 fortran staff 1.5K Jan 27 19:04 mpi-ignore-tkr.F90 > > -rw-r--r-- 1 fortran staff80K Feb 4 17:53 Makefile.in > > -rw-r--r-- 1 fortran staff 208K Mar 18 20:37 > mpi-ignore-tkr-interfaces.h > > -rw-r--r-- 1 fortran staff38K Mar 18 20:37 > mpi-ignore-tkr-file-interfaces.h > > -rw-r--r-- 1 fortran staff75K Mar 18 20:37 Makefile > > -rw-r--r-- 1 fortran staff 765K Mar 18 20:47 mpi.mod > > -rw-r--r-- 1 fortran staff 280B Mar 18 20:47 mpi-ignore-tkr.lo > > -rw-r--r-- 1 fortran staff 1.0K Mar 18 20:47 > libmpi_usempi_ignore_tkr.la > > Directory: > /Users/fortran/MPI/src/openmpi-1.7.4/ompi/mpi/fortran/use-mpi-ignore-tkr > > (4) $ make clean > > test -z "*~ .#*" || rm -f *~ .#* > > test -z "libmpi_usempi_ignore_tkr.la" || rm -f > libmpi_usempi_ignore_tkr.la > > rm -f ./so_locations > > rm -rf .libs _libs > > rm -f *.o > > test -z "*.mod" || rm -f *.mod > > rm -f *.lo > > (5) $ make V=1 > > /bin/sh ../../../../libtool --tag=FC --m
Re: [OMPI users] Help building/installing a working Open MPI 1.7.4 on OS X 10.9.2 with Free PGI Fortran
cking Fortran compiler ignore TKR syntax... 1:real, dimension(*):!DIR$ > IGNORE_TKR > checking if building Fortran 'use mpi' bindings... yes > - > > And then the make logs indicate that it did, indeed, build the ignore TKR > mpi module. > > - > Making all in mpi/fortran/use-mpi-ignore-tkr > PPFC mpi-ignore-tkr.lo > FCLD libmpi_usempi_ignore_tkr.la > - > > And then make install fails: > > - > Making install in mpi/fortran/use-mpi-ignore-tkr > ../../../../config/install-sh -c -d > '/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc/lib' > /bin/sh ../../../../libtool --mode=install /usr/bin/install -c > libmpi_usempi_ignore_tkr.la'/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc/lib' > libtool: install: /usr/bin/install -c > .libs/libmpi_usempi_ignore_tkr.0.dylib > /Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc/lib/libmpi_usempi_ignore_tkr.0.dylib > install: .libs/libmpi_usempi_ignore_tkr.0.dylib: No such file or directory > - > > Can you do the following: > > - > cd ompi_build_dir/ompi/mpi/fortran/use-mpi-ignore-tkr > make clean > make V=1 > find . > make install > - > > > On Mar 20, 2014, at 7:44 AM, Matt Thompson <fort...@gmail.com> wrote: > > > Jeff, > > > > It does not: > > > > Directory: > /Users/fortran/MPI/src/openmpi-1.7.4/ompi/mpi/fortran/use-mpi-ignore-tkr/.libs > > (106) $ ls -ltr > > total 1560 > > -rw-r--r-- 1 fortran staff 784824 Mar 18 20:47 mpi-ignore-tkr.o > > -rw-r--r-- 1 fortran staff1021 Mar 18 20:47 > libmpi_usempi_ignore_tkr.lai > > lrwxr-xr-x 1 fortran staff 30 Mar 18 20:47 > libmpi_usempi_ignore_tkr.la@ -> ../libmpi_usempi_ignore_tkr.la > > lrwxr-xr-x 1 fortran staff 32 Mar 18 20:47 > libmpi_usempi_ignore_tkr.dylib@ -> libmpi_usempi_ignore_tkr.0.dylib > > > > which I guess makes sense. > > > > I'm attaching the logfiles from my compile attempt. This is the "basic" > attempt as can be seen from the config.log file. > > > > Thanks, > > Matt > > > > > > > > On Thu, Mar 20, 2014 at 6:45 AM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > Sorry for the delay; we're working on releasing 1.7.5 and that's > consuming all my time... > > > > That's a strange error. Can you confirm whether > ompi_buil_dir/ompi/mpi/fortran/use-mpi-ignore-tkr/.libs/libmpi_usempi_ignore_tkr.0.dylib > exists or not? > > > > Can you send all the info listed here: > > > > http://www.open-mpi.org/community/help/ > > > > > > On Mar 18, 2014, at 8:59 PM, Matt Thompson <fort...@gmail.com> wrote: > > > > > All, > > > > > > I recently downloaded PGI's Free OS X Fortran compiler: > > > > > > http://www.pgroup.com/products/freepgi/ > > > > > > in the hope of potentially using it to compile a weather model I work > with GEOS-5. That model requires an MPI stack and I usually start (and end) > with Open MPI on a desktop. > > > > > > So, I grabbed Open MPI 1.7.4 and tried compiling it in a few ways. In > each case, my C and C++ compilers were the built-in clang-y gcc and g++ > from Xcode, while pgfortran was the Fortran compiler. I tried a few > different configures from the basic: > > > > > > $ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64' > CXXFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64' > --prefix=/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3 > > > > > > all the way to the "let's try every flag Google says I might use" > version of: > > > > > > $ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64 > -Xclang -target-feature -Xclang -aes -mmacosx-version-min=10.8' > CXXFLAGS='-m64 -Xclang -target-feature -Xclang -aes > -mmacosx-version-min=10.8' LDFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64' > --prefix=/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx > > > > > > In every case, the configure, make, and make check worked well without > error, but running a 'make install' led to: > > > > > > Making install in mpi/fortran/use-mpi-ignore-tkr > > > ../../../../config/install-sh -c -d > '/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib' > > > /bin/sh ../../../../libtool --mode=install /usr/bin/install -c > libmpi_usempi_ignore_tkr.la'/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib' > > > libtool: install: /usr/bin/install -c > .libs/libmpi_usempi_ignore_tkr.0.dylib > /Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib/libmpi_usempi_ignore_tkr.0.dylib > > > install: .
Re: [OMPI users] Help building/installing a working Open MPI 1.7.4 on OS X 10.9.2 with Free PGI Fortran
Jeff, It does not: Directory: /Users/fortran/MPI/src/openmpi-1.7.4/ompi/mpi/fortran/use-mpi-ignore-tkr/.libs (106) $ ls -ltr total 1560 -rw-r--r-- 1 fortran staff 784824 Mar 18 20:47 mpi-ignore-tkr.o -rw-r--r-- 1 fortran staff1021 Mar 18 20:47 libmpi_usempi_ignore_tkr.lai lrwxr-xr-x 1 fortran staff 30 Mar 18 20:47 libmpi_usempi_ignore_tkr.la@ -> ../libmpi_usempi_ignore_tkr.la lrwxr-xr-x 1 fortran staff 32 Mar 18 20:47 libmpi_usempi_ignore_tkr.dylib@ -> libmpi_usempi_ignore_tkr.0.dylib which I guess makes sense. I'm attaching the logfiles from my compile attempt. This is the "basic" attempt as can be seen from the config.log file. Thanks, Matt On Thu, Mar 20, 2014 at 6:45 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > Sorry for the delay; we're working on releasing 1.7.5 and that's consuming > all my time... > > That's a strange error. Can you confirm whether > ompi_buil_dir/ompi/mpi/fortran/use-mpi-ignore-tkr/.libs/libmpi_usempi_ignore_tkr.0.dylib > exists or not? > > Can you send all the info listed here: > > http://www.open-mpi.org/community/help/ > > > On Mar 18, 2014, at 8:59 PM, Matt Thompson <fort...@gmail.com> wrote: > > > All, > > > > I recently downloaded PGI's Free OS X Fortran compiler: > > > > http://www.pgroup.com/products/freepgi/ > > > > in the hope of potentially using it to compile a weather model I work > with GEOS-5. That model requires an MPI stack and I usually start (and end) > with Open MPI on a desktop. > > > > So, I grabbed Open MPI 1.7.4 and tried compiling it in a few ways. In > each case, my C and C++ compilers were the built-in clang-y gcc and g++ > from Xcode, while pgfortran was the Fortran compiler. I tried a few > different configures from the basic: > > > > $ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64' > CXXFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64' > --prefix=/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3 > > > > all the way to the "let's try every flag Google says I might use" > version of: > > > > $ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64 > -Xclang -target-feature -Xclang -aes -mmacosx-version-min=10.8' > CXXFLAGS='-m64 -Xclang -target-feature -Xclang -aes > -mmacosx-version-min=10.8' LDFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64' > --prefix=/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx > > > > In every case, the configure, make, and make check worked well without > error, but running a 'make install' led to: > > > > Making install in mpi/fortran/use-mpi-ignore-tkr > > ../../../../config/install-sh -c -d > '/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib' > > /bin/sh ../../../../libtool --mode=install /usr/bin/install -c > libmpi_usempi_ignore_tkr.la'/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib' > > libtool: install: /usr/bin/install -c > .libs/libmpi_usempi_ignore_tkr.0.dylib > /Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib/libmpi_usempi_ignore_tkr.0.dylib > > install: .libs/libmpi_usempi_ignore_tkr.0.dylib: No such file or > directory > > make[3]: *** [install-libLTLIBRARIES] Error 71 > > make[2]: *** [install-am] Error 2 > > make[1]: *** [install-recursive] Error 1 > > make: *** [install-recursive] Error 1 > > > > Any ideas on how to overcome this? > > > > Thanks, > > Matt Thompson > > -- > > "And, isn't sanity really just a one-trick pony anyway? I mean all you > > get is one trick: rational thinking. But when you're good and crazy, > > oooh, oooh, oooh, the sky is the limit!" -- The Tick > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- "And, isn't sanity really just a one-trick pony anyway? I mean all you get is one trick: rational thinking. But when you're good and crazy, oooh, oooh, oooh, the sky is the limit!" -- The Tick OMPI-1.7.4-Logfiles.tar.bz2 Description: BZip2 compressed data
[OMPI users] Help building/installing a working Open MPI 1.7.4 on OS X 10.9.2 with Free PGI Fortran
All, I recently downloaded PGI's Free OS X Fortran compiler: http://www.pgroup.com/products/freepgi/ in the hope of potentially using it to compile a weather model I work with GEOS-5. That model requires an MPI stack and I usually start (and end) with Open MPI on a desktop. So, I grabbed Open MPI 1.7.4 and tried compiling it in a few ways. In each case, my C and C++ compilers were the built-in clang-y gcc and g++ from Xcode, while pgfortran was the Fortran compiler. I tried a few different configures from the basic: $ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64' > CXXFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64' > --prefix=/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3 all the way to the "let's try every flag Google says I might use" version of: $ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64 > -Xclang -target-feature -Xclang -aes -mmacosx-version-min=10.8' > CXXFLAGS='-m64 -Xclang -target-feature -Xclang -aes > -mmacosx-version-min=10.8' LDFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64' > --prefix=/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx In every case, the configure, make, and make check worked well without error, but running a 'make install' led to: Making install in mpi/fortran/use-mpi-ignore-tkr > ../../../../config/install-sh -c -d > '/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib' > /bin/sh ../../../../libtool --mode=install /usr/bin/install -c > libmpi_usempi_ignore_tkr.la'/Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib' > libtool: install: /usr/bin/install -c > .libs/libmpi_usempi_ignore_tkr.0.dylib > /Users/fortran/MPI/openmpi_1.7.4-pgi_14.3-gcc-mmacosx/lib/libmpi_usempi_ignore_tkr.0.dylib > install: .libs/libmpi_usempi_ignore_tkr.0.dylib: No such file or directory > make[3]: *** [install-libLTLIBRARIES] Error 71 > make[2]: *** [install-am] Error 2 > make[1]: *** [install-recursive] Error 1 > make: *** [install-recursive] Error 1 Any ideas on how to overcome this? Thanks, Matt Thompson -- "And, isn't sanity really just a one-trick pony anyway? I mean all you get is one trick: rational thinking. But when you're good and crazy, oooh, oooh, oooh, the sky is the limit!" -- The Tick