Re: [OMPI devel] Program which runs wih 1.8.3, fails with 2.0.2
On 20 April 2017 at 12:58, r...@open-mpi.org wrote: > Fully expected - if ORTE can’t start one or more daemons, then the MPI job > itself will never be executed. > > There was an SGE integration issue in the 2.0 series - I fixed it, but IIRC > it didn’t quite make the 2.0.2 release. In fact, I just checked and it did > indeed miss that release. > > You have three choices: > > 1. you could apply the patch to the 2.0.2 source code yourself - it is at > https://github.com/open-mpi/ompi/pull/3162 > > 2. download a copy of the latest nightly 2.0.3 tarball - hasn’t been > officially released yet, but includes the patch > > 3. upgrade to the nightly 2.1.1 tarball - expected to be officially released > soon and also includes the patch > > Hopefully, one of those options will fix the problem > Ralph OK, so I went with Choice 2 and downloaded the nightly 2.0.3 from the 19th April. openmpi-v2.0.x-201704190318-24b5b83.tar.bz2 Having altered the top-level directory inside it to be just openmpi-2.0.x I then used that tarbal as a source for the SPEC file from 2.0.2's SRPM and ... it's all built and my application is now running inside the SGE environment. Deep joy ! Thanks for the suggestions, I'm looking forwards to the official releases of 2.0.3 and/or 2.1.1. Kevin ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Program which runs wih 1.8.3, fails with 2.0.2
On 20 April 2017 at 12:58, r...@open-mpi.org wrote: > Fully expected - if ORTE can’t start one or more daemons, then the MPI job > itself will never be executed. > > There was an SGE integration issue in the 2.0 series - I fixed it, but IIRC > it didn’t quite make the 2.0.2 release. In fact, I just checked and it did > indeed miss that release. > > You have three choices: > > 1. you could apply the patch to the 2.0.2 source code yourself - it is at > https://github.com/open-mpi/ompi/pull/3162 > > 2. download a copy of the latest nightly 2.0.3 tarball - hasn’t been > officially released yet, but includes the patch > > 3. upgrade to the nightly 2.1.1 tarball - expected to be officially released > soon and also includes the patch > > Hopefully, one of those options will fix the problem > Ralph This observation may not belong here, but as there are some eyes on this issue, I might as well raise it here, as I came across it in the wake of going with Choice 2 If one wishes to take a nightly tarball and try and use the existing version's SRPM build infrastructure to create an RPM, then you will fall foul of the dashes in the nightly tarball names. That is, if you try putting in Version: v2.0.x-201704190318-24b5b83 instead of the original Version: 2.0.2 you'll be told, when coming to do an rpmbuild, (your line may differ) error: line 188: Illegal char '-' in: Version: v2.0.x-201704190318-24b5b83 You can get round this by 1) Un-tar-ing the nightly tarball 2) Renaming it so that the dashes become dots 3) Recreating the newly named nightly tarball if you then use Version: v2.0.x.201704190318.24b5b83 the rpmbuild runs (as in it's running as I write !). As to whether that observation might inform a different convention for naming the nightly tarballs, that is left up to the real developers. ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Program which runs wih 1.8.3, fails with 2.0.2
On 20 April 2017 at 12:58, r...@open-mpi.org wrote: > Fully expected - if ORTE can’t start one or more daemons, then the MPI job > itself will never be executed. > > There was an SGE integration issue in the 2.0 series - I fixed it, but IIRC > it didn’t quite make the 2.0.2 release. In fact, I just checked and it did > indeed miss that release. Just to say that the original "bodge fix" mentioned here https://github.com/open-mpi/ompi/issues/2947 that is, adding -mca plm_rsh_agent foo on the mpirun command line, does allow my 2.0.2 application to run. > You have three choices: > ... Now I just have to make one. Cheers, Kevin ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Program which runs wih 1.8.3, fails with 2.0.2
On 19 April 2017 at 18:35, Kevin Buckley wrote: > If I compile against 2.0.2 the same command works at the command line > but not in the "SGE" job submission, where I see a complaint about > > = > Host key verification failed. > -- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > blah, blah, blah ... > = Just to add that if I add in some basic debugging --mca btl_base_verbose 30 then when running at the command line, I get a swathe of info from the MCA, however within the SGE environment, I still only get the "ORTE was unable .." message ? ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
[OMPI devel] Program which runs wih 1.8.3, fails with 2.0.2
I have source code for MrBayes. If I compile against OpenMPI 1.8.3, then an mpirun -np4 mb < somefile.txt works at both the command line and in an "SGE" job submission where I'm tagetting 4 cores on the same node. If I compile against 2.0.2 the same command works at the command line but not in the "SGE" job submission, where I see a complaint about = Host key verification failed. -- ORTE was unable to reliably start one or more daemons. This usually is caused by: blah, blah, blah ... = Even if I try to force the SM transport via the MCA BTL flags, same thng across the four attempts. Both the 1.8.3 and 2.0.2 version were built from the SRPM and, I believe, were built in the same way. So what might have I missed ? SGE integration related ?? Kevin M. Buckley eScience Consultant School of Engineering and Computer Science Victoria University of Wellington New Zealand ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt
On 5 April 2017 at 13:01, Kevin Buckley wrote: > I also note that as things stand, the Relocation is used for all > files except the Environment Module file, resulting from the > rpmbuild beig done as follows > > --define 'install_shell_scripts 1' \ > --define 'install_modulefile 1' \ > --define 'use_mpi_selector 1' \ > --define 'scl openmpi-openmpi' \ > -bc openmpi-2.0.2.spec > > remains in the "system" tree, as > > /usr/share/Modules/modulefiles/openmpi-openmpi-openmpi/2.0.2 > > although, because of the ACL naming, it would no longer clash with > the path to a non-SCL built module file. Furthermore, that modulefile contains the right paths ---8<-8<-8<-8<-8<-8<-8<-- #%Module # NOTE: This is an automatically-generated file! (generated by the # Open MPI/SHMEM RPM). Any changes made here will be lost a) if the RPM is # uninstalled, or b) if the RPM is upgraded or uninstalled. proc ModulesHelp { } { puts stderr "This module adds Open MPI/SHMEM v2.0.2 to various paths" } module-whatis "Sets up Open MPI/SHMEM v2.0.2 in your enviornment" prepend-path PATH "/opt/OpenMPI/openmpi-openmpi/root/usr/bin/" prepend-path LD_LIBRARY_PATH /opt/OpenMPI/openmpi-openmpi/root/usr/lib64 prepend-path MANPATH /opt/OpenMPI/openmpi-openmpi/root/usr/share/man ---8<-8<-8<-8<-8<-8<-8<-- ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt
Just in case anyone is interested in following this, I'll try and document what I'm doing here I have a forked repo and added a branch here https://github.com/vuw-ecs-kevin/ompi/tree/make-specfile-scl-capable and have applied a series of small changes that allow for the building of an RPM that appears to be an SCL-enabled one. Couple of small things to iron out, mostly as regards naming, to whit: An SCL rpmbuild, using --define 'scl openmpi-openmpi' produces openmpi-openmpi-openmpi-2.0.2-1.el6.x86_64.rpm where the rpm -qip info shows == Name: openmpi-openmpi-openmpi Relocations: /opt/OpenMPI/openmpi-openmpi/root/usr Version : 2.0.2 Vendor: redhat Release : 1.el6 Build Date: Wed 05 Apr 2017 12:20:45 PM NZST Install Date: (not installed) Build Host: lochostname.vuw.ac.nz Group : Development/Libraries Source RPM: openmpi-openmpi-openmpi-2.0.2-1.el6.src.rpm Size: 50459297 License: BSD Signature : (none) Packager: redhat Note, with reference to: https://www.softwarecollections.org/en/docs/guide/ that this assumes that Section 2.3's The Software Collection Root Directory is /opt/OpenMPI/ and that Section 2.4's. The Software Collection Prefix is openmpi-openmpi because the "myorganisation" is openmpi as is the package name. I am less than convinced that those SCL guidelines add anything extra that might be lost if the Relocation simply occured at /opt/OpenMPI/openmpi/root/usr which I believe would result from building from the same SPEC-file with just -define 'scl openmpi' but I'm not clainging to be an SCL expert. I also note that as things stand, the Relocation is used for all files except the Environment Module file, resulting from the rpmbuild beig done as follows --define 'install_shell_scripts 1' \ --define 'install_modulefile 1' \ --define 'use_mpi_selector 1' \ --define 'scl openmpi-openmpi' \ -bc openmpi-2.0.2.spec remains in the "system" tree, as /usr/share/Modules/modulefiles/openmpi-openmpi-openmpi/2.0.2 although, because of the ACL naming, it would no longer clash with the path to a non-SCL built module file. Some thought needed on the module file location, perhaps not all that surprising, given that was where this thread stared. Anyroad, some initial progress, which might ellicit some comments ? ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt
On 31 March 2017 at 23:35, Jeff Squyres (jsquyres) wrote: and Gilles, who said, >> you should only use the tarballs from www.open-mpi.org > The GitHub tarballs are simple tars of the git repo at a given hash (e.g., > the v2.0.2 tag in git). ... Yep I'm aware of the way that GitHub tarbals can be created, even when a lot of folk don't bother creating them. I think it's good that you do. > We can't turn off these tarballs from being generated and from being > available on the GitHub site. > We actually don't do anything with these tarballs at all. OK, even simpler. > The short version is that the process looks like this (this skips a lot of > detail, but you get the point): > > - > git clone ...the ompi repo... > cd ompi > git checkout ...the desired hash... > ./autogen.pl > ./configure > make distcheck > - Given that any changes, as regards an SCL-capable RPM build, are only likely to be made to contrib/dist/linux/openmpi.spec which is a file that exists in both GitHub and the bootstrapped ones, it all seems OK. I think some of my confusion comes from seeing files that only appear to be in the GitHub tarball when compared against the bootstrapped tarball. after doing tar xf ompi-2.0.2.tar.gz tar xf openmpi-2.0.2.tar.bz2 diff -r ompi-2.0.2/ openmpi-2.0.2/ > OpenMPI-202-gh_website.diff (I attach the latter) which suggests, along with Only in openmpi-2.0.2/: configure Only in openmpi-2.0.2/contrib: Makefile.in which I appreciate are the results of the Autotools bootstrapping, stuff like Only in ompi-2.0.2/contrib/dist: find-copyrights.pl Only in ompi-2.0.2/contrib/dist/linux: README Only in ompi-2.0.2/contrib/dist/linux: README.ompi-spec-generator Only in ompi-2.0.2/contrib/dist/linux: buildrpm.sh Only in ompi-2.0.2/contrib/dist/linux: buildswitcherrpm.sh Only in ompi-2.0.2/contrib/dist/linux: ompi-spec-generator.py Only in ompi-2.0.2/contrib/dist/linux: openmpi-switcher-modulefile.spec Only in ompi-2.0.2/contrib/dist/linux: openmpi-switcher-modulefile.tcl which suggests some things have been removed? Apologies for dragging this out, Kevin Only in ompi-2.0.2/: .gitignore Only in ompi-2.0.2/: .mailmap Only in ompi-2.0.2/: .travis.yml Only in ompi-2.0.2/: HACKING Only in ompi-2.0.2/: ISSUE_TEMPLATE.md Only in openmpi-2.0.2/: Makefile.in diff -r ompi-2.0.2/VERSION openmpi-2.0.2/VERSION 27c27 < greek=rc5 --- > greek= 34c34 < repo_rev= --- > repo_rev=v2.0.1-348-ge291d0e 44c44 < tarball_version=gitclone --- > tarball_version= 48c48 < date="Unreleased developer copy" --- > date="Jan 31, 2017" Only in openmpi-2.0.2/: aclocal.m4 Only in openmpi-2.0.2/config: Makefile.in Only in ompi-2.0.2/config: Makefile.options Only in openmpi-2.0.2/config: autogen_found_items.m4 Only in openmpi-2.0.2/config: compile Only in openmpi-2.0.2/config: config.guess Only in openmpi-2.0.2/config: config.sub Only in openmpi-2.0.2/config: depcomp Only in openmpi-2.0.2/config: install-sh Only in openmpi-2.0.2/config: libtool.m4 Only in openmpi-2.0.2/config: ltmain.sh Only in openmpi-2.0.2/config: ltoptions.m4 Only in openmpi-2.0.2/config: ltsugar.m4 Only in openmpi-2.0.2/config: ltversion.m4 Only in openmpi-2.0.2/config: lt~obsolete.m4 Only in openmpi-2.0.2/config: missing Only in ompi-2.0.2/config: ompi_check_udapl.m4 Only in ompi-2.0.2/config: ompi_microsoft.m4 Only in ompi-2.0.2/config: opal_setup_component_package.m4 Only in ompi-2.0.2/config: orte_setup_java.m4 Only in openmpi-2.0.2/config: test-driver Only in openmpi-2.0.2/config: ylwrap Only in openmpi-2.0.2/: configure Only in openmpi-2.0.2/contrib: Makefile.in Only in ompi-2.0.2/contrib/amca-param-sets: ft-enable-cr Only in ompi-2.0.2/contrib/amca-param-sets: ft-enable-cr-recovery Only in ompi-2.0.2/contrib: annual-maintenance Only in ompi-2.0.2/contrib: authors-to-cvsimport.pl Only in ompi-2.0.2/contrib: build-mca-comps-outside-of-tree Only in ompi-2.0.2/contrib: build-server Only in ompi-2.0.2/contrib: check-btl-sm-diffs.pl Only in ompi-2.0.2/contrib: check-help-strings.pl Only in ompi-2.0.2/contrib: check-ob1-pml-diffs.pl Only in ompi-2.0.2/contrib: check-ob1-revision.pl Only in ompi-2.0.2/contrib: check-owner.pl Only in ompi-2.0.2/contrib: check_unnecessary_headers.sh Only in ompi-2.0.2/contrib: code_counter.pl Only in ompi-2.0.2/contrib: coverity Only in ompi-2.0.2/contrib/dist: find-copyrights.pl Only in ompi-2.0.2/contrib/dist/linux: README Only in ompi-2.0.2/contrib/dist/linux: README.ompi-spec-generator Only in ompi-2.0.2/contrib/dist/linux: buildrpm.sh Only in ompi-2.0.2/contrib/dist/linux: buildswitcherrpm.sh Only in ompi-2.0.2/contrib/dist/linux: ompi-spec-generator.py Only in ompi-2.0.2/contrib/dist/linux: openmpi-switcher-modulefile.spec Only in ompi-2.0.2/contrib/dist/linux: openmpi-switcher-modulefile.tcl Only in ompi-2.0.2/contrib/dist: make-authors.pl Only in ompi-2.0.2/contrib/dist: make-html-man-pages.pl Only in ompi-2.0.2/contrib/dist: make_tarball Only in openmpi-2.0.2/contrib/dist/mofed/debian: changelog Only in openmpi-2.0.2/con
Re: [OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt
On 29 March 2017 at 13:49, Jeff Squyres (jsquyres) wrote: > I have no objections to this. > > Unfortunately, I don't have the time to work on it, but we'd be glad to look > at pull requests to introduce this functionality. :-) Yes, yes, alright. I am though slightly confused, following the move to GitHub. If I go to the GiHub repo, the available release tarballs appear, for example, as: ompi-2.0.2.tar.gz and unpack as drwxrwxr-x root/root 0 2017-01-31 05:54 ompi-2.0.2/ -rw-rw-r-- root/root 15724 2017-01-31 05:54 ompi-2.0.2/.gitignore .-rw-rw-r-- root/root 4212 2017-01-31 05:54 ompi-2.0.2/.mailmap -rw-rw-r-- root/root 3415 2017-01-31 05:54 ompi-2.0.2/.travis.yml -rw-rw-r-- root/root 9867 2017-01-31 05:54 ompi-2.0.2/AUTHORS .. that is, using the GitHub subproject/repo name. However, the tarball refered to in the SRPM's SPEC-file has the "does what it says on the tin" name: Source: openmpi-%{version}.tar.bz2 which is also the version availble from the open-mpi.org website, and which unpacks as drwxrwxr-x mpiteam/mpiteam 0 2017-02-01 04:21 openmpi-2.0.2/ -rw-rw-r-- mpiteam/mpiteam 9867 2017-01-27 08:14 openmpi-2.0.2/AUTHORS ... Are the GitHub tarballs and the "availble from the open-mpi.org website" tarballs the same ? (except for the git stuff) Should they be the same, now that the development is managed at GitHub ? What is your process behind taking the GitHub release tarball and turning it into the open-mpi.org one ? (or vice-versa ?) ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt
Another than occured to me whilst looking around this was whether the OpenMPI SRPM might benefit from being given proper "Software Collections" package capability, as opposed to having the "install in opt" option. I don't claim to have enough insight to say either way here, however the Software Collections documentation https://www.softwarecollections.org/en/docs/guide/#sect-Converting_a_Conventional_Spec_File suggests that it's possible to use a traditional sepcfile for both system and SCL builds, and I could see that were "OpenMPI" to be used as the SCLo "provider" tag then things would end up living under /opt/OpenMPI/openmpi-202/ which isn't that far away from what you have, with the "install in opt" now. Very much just a Request for Comments than a suggestion though, Kevin ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt
On 23 March 2017 at 23:41, Jeff Squyres (jsquyres) wrote: > Yoinks. Looks like this was an oversight. :-( > > Yes, I agree that install_in_opt should put the modulefile in /opt as well. Actually, I have since read the SPEC file from top to bottom and seen a Changelog entry (from you Jeff. from you!) dating back to 2007 that says * Fri Feb 9 2007 Jeff Squyres ... - Make shell script and modulefile installation indepdendent of %{install_in_opt} (they're really separate issues) ... so are you sure you agree ? (*see below) > If you could offer a patch or pull request, that would be awesome. I should get around to cloning a GitHub branch I guess but for now, it appears to be a simple one-liner $ diff openmpi-2.0.2.spec openmpi-2.0.2_vuw_gcc.spec 165a166 > %define modulefile_path /opt/%{name}/%{version}/share/openmpi/modulefiles $ which needs adding at the bottom of the %if %{install_in_opt} stanza. A diff -u patch is attached. * I think the reasoning for me, is that if the SRPM is being used to create an OpenMPI that will supercede the RHEL/CentOS system's version then adding the module into a system area is the right thing, but if an admin goes to the trouble of "installing in opt" then clearly they want things "out of the way of the system", in general usage anyway and so having the modulefile in a place that requires them to take action for it to become available to the users is a "good thing". All the best, Kevin --- SPECS/openmpi-2.0.2.spec.orig 2017-02-01 13:05:57.0 +1300 +++ SPECS/openmpi-2.0.2.spec2017-03-23 17:11:50.033903233 +1300 @@ -163,6 +163,7 @@ # bets are off. So feel free to install it anywhere in your tree. He # suggests $prefix/doc. %define _defaultdocdir /opt/%{name}/%{version}/doc +%define modulefile_path /opt/%{name}/%{version}/share/openmpi/modulefiles %endif %if !%{build_debuginfo_rpm} ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
[OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt
Just came to rehash some old attempts to build previous OpenMPIs for an RPM-based system and noticed that, despite specifying --define 'install_in_opt 1' \ as part of this full "config" rpmbuild stage (Note: SPEC-file Release tag is atered so as not to have the RPM clash with any system MPI packages, hence the name change) rpmbuild \ --define '_topdir /path/to/my/rpmbuild' \ --define 'install_in_opt 1' \ --define 'install_shell_scripts 1' \ --define 'install_modulefile 1' \ --define 'use_mpi_selector 0' \ --define 'build_all_in_one_rpm 1' \ -bb openmpi-2.0.2_vuw_gcc.spec I stll get a non /opt hierachy file /usr/share/Modules/modulefiles/openmpi/2.0.2 which isn't what is needed. I'd have thought the module files for an "install_in_opt" configuration would have been installed below the /opt/openmpi/2.0.2/share tree, leaving the system admin to then place them wherever they needed to end up within the local "system dir" hierachies. (Note: I use /etc/modulefiles/ in addition to /usr/share/Modules/modulefiles/, which I consider to be the module files from the modules package itself) I note in the SPEC-file that there are possible overrides to any related defaults, vis: # type: string (root path to install modulefiles) %{!?modulefile_path: %define modulefile_path /usr/share/Modules/modulefiles} # type: string (subdir to install modulefile) %{!?modulefile_subdir: %define modulefile_subdir %{name}} # type: string (name of modulefile) %{!?modulefile_name: %define modulefile_name %{version}} but would suggest that the install_modulefile 1 when combined with install_in_opt 1 should alter the default, without any extra user configuration, just as if an AutoTools config had done a --prefix=/opt in which case everything would be expected to get the prefix . Just a suggestion, though I'll look to offer a patch, now that I see that my notes from 2012 about this (where I first found the issue!) still apply but, before I look to offer the patch ... whaddyathink of the suggestion itself? Kevin M. Buckley eScience Consultant School of Engineering and Computer Science Victoria University of Wellington New Zealand ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
[OMPI devel] OpenMPI 1.10.0: Arch Linux PkgSrc build suggests configure script patch
Watcha, we recently updated the OpenMPI installation on our School's ArchLinux machines, where OpenMPI is built as a PkgSrc package, to 1.10.0 In running through the build, we were told that PkgSrc wasn't too keen on the use of the == with a single "if test" construct and so I needed to apply the following patch --- configure.orig 2015-08-24 23:33:14.0 + +++ configure @@ -60570,8 +60570,8 @@ _ACEOF $as_echo "$MPI_OFFSET_DATATYPE" >&6; } -if test "$ompi_fortran_happy" == "1" && \ - test "$OMPI_WANT_FORTRAN_BINDINGS" == "1"; then +if test "$ompi_fortran_happy" = "1" && \ + test "$OMPI_WANT_FORTRAN_BINDINGS" = "1"; then # Get the kind value for Fortran MPI_INTEGER_KIND (corresponding # to whatever is the same size as a F77 INTEGER -- for the Seem to recall that this is "good practice" and indeed, can see that other "if test" stanzas in the configure script have been fixed to match, so perhaps this one has just slipped through the net and/or not been reported by anyone else as yet. -- Kevin M. Buckley eScience Consultant School of Engineering and Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] (no subject)
On 9 December 2014 at 03:29, Howard Pritchard wrote: > Hello Kevin, > > Could you try testing with Open MPI 1.8.3? There was a bug in 1.8.1 > that you are likely hitting in your testing. > > Thanks, > > Howard Bingo! Seems to have got rid of those messages. Thanks.
Re: [OMPI devel] (no subject)
Apologies for the lack of a subject line: cut and pasted the body before the subject ! Should have been Removing "registering part of your physical memory" warning message Dunno if anyone can fix that in the maling list?
[OMPI devel] (no subject)
Watcha, have recently come to install the PISM package on top of PETSc, which, in turn is built against OpenMPI 1.8.1 on our Science Faculty HPC Facility, which has SGI C2112 compute nodes with 64GB RAM running on top of CentOS 6. In testing the PETSc deployment out and when running PISM itself, I am seeing the " ...OpenFabrics subsystem is configured to only allow registering part of your physical memory ..." message telling me Registerable memory: 32768 MiB Total memory:524285 MiB Oh yeah, that's the 512GB big memory node, not a 64Gb compute node, which says Registerable memory: 32768 MiB Total memory:65534 MiB but still suggests a default for allowing the use of 32GB. So, having followed my nose to the OpenMPI FAQ sections, and the Mellanox community page, http://community.mellanox.com/docs/DOC-1120 which suggests the defaults for the two parameters in need of a tweak are log_num_mtt 20 log_mtts_per_seg 0 I came to try and tweak those Mellanox driver parameters. What I see on my compute nodes is # cat /sys/module/mlx4_core/parameters/log_num_mtt 0 # cat /sys/module/mlx4_core/parameters/log_mtts_per_seg 3 # so something that doesn't match the defaults the Mellanox page suggests I should be seeing. Furthermore, having "done the math" and realised that I probably want log_num_mtt 22 log_mtts_per_seg 3 to allow OpenMPI to use double the memory (128GB - because giving it 1 TB on the big memory node seems excessive!) when I come to alter those values, I can't seem to. Trying to add a module load option options mlx4_core log_num_mtt=22 via modifying the file /etc/modprobe.d/mlx4.conf never sees that value honoured after a full node reboot. It also appears that the /sys/module/mlx4_core/parameters/ are nearly all read-only, including the ones it's suggested that I tweak, vis: # echo 22 > /sys/module/mlx4_core/parameters/log_num_mtt -bash: /sys/module/mlx4_core/parameters/log_num_mtt: Permission denied # ls -l /sys/module/mlx4_core/parameters/log_num_mtt -r--r--r--. 1 root root 4096 Dec 5 13:08 /sys/module/mlx4_core/parameters/log_num_mtt so I'm getting the impression that the Mrellanox driver doesn't really want the defaults altered ? OK, so if i can't tell my nodes to allow OpenMPI to use any more than 32GB, how do I turn off the OpenMPI message that is telling me about it? Kevin M. Buckley eScience Consultant School of Engineering and Computer Science Victoria University of Wellington New Zealand
[OMPI devel] PkgSrc build of 1.8.1 gives a portability error
Hello again OpenMPI folk, been a while. Have just come to build OpenMPI 1.8.1 within a PkgSrc environment for our ArchLinux machines (yes, we used to be NetBSD, yes). Latest PkgSrc build was for 1.6.4. The 1.6.4 PkgSrc build required 4 patches, 3 of which were PkgSrc-specific and just defined a sysconfexampledir = $(pkgdatadir)/examples so that the PkgSrc build could "Install configuration files into example directory" Those patches affected orte/etc/Makefile.in opal/etc/Makefile.in ompi/etc/Makefile.in The 4th patch, affecting opal/tools/wrappers/opal_wrapper.c added some "Missing RPATH support" I fixed up those four patches so that they appled cleanly to 1.8.1 , however I have been informed, by the PkgSrc build process, of the following ---8<-8<-8<-8<-8<-8<-8<-8<-8<-- => Checking for portability problems in extracted files ERROR: [check-portability.awk] => Found test ... == ...: ERROR: [check-portability.awk] configure: if test "$enable_oshmem" == "yes" -a "$ompi_fortran_happy" == "1" -a \ Explanation: === The "test" command, as well as the "[" command, are not required to know the "==" operator. Only a few implementations like bash and some versions of ksh support it. When you run "test foo == foo" on a platform that does not support the "==" operator, the result will be "false" instead of "true". This can lead to unexpected behavior. There are two ways to fix this error message. If the file that contains the "test ==" is needed for building the package, you should create a patch for it, replacing the "==" operator with "=". If the file is not needed, add its name to the CHECK_PORTABILITY_SKIP variable in the package Makefile. === ---8<-8<-8<-8<-8<-8<-8<-8<-8<-- Obviously, the file that needs to be patched is really configure.ac and not configure but anyroad, the place at which the oshmen stanza has used the "non-portable" double-equals construct is shown in the following attempted patch ---8<-8<-8<-8<-8<-8<-8<-8<-8<-- --- configure.ac.orig 2014-04-22 14:51:44.0 + +++ configure.ac @@ -611,8 +611,8 @@ m4_ifdef([project_ompi], [OMPI_SETUP_MPI ]) AM_CONDITIONAL(OSHMEM_BUILD_FORTRAN_BINDINGS, -[test "$enable_oshmem" == "yes" -a "$ompi_fortran_happy" == "1" -a \ - "$OMPI_WANT_FORTRAN_BINDINGS" == "1" -a \ +[test "$enable_oshmem" = "yes" -a "$ompi_fortran_happy" = "1" -a \ + "$OMPI_WANT_FORTRAN_BINDINGS" = "1" -a \ "$enable_oshmem_fortran" != "no"]) # checkpoint results ---8<-8<-8<-8<-8<-8<-8<-8<-8<-- Someone may wish to give that the "once over" ahead of the 1.8.2 release, in light of what PkgSrc considers to be portable. All the best, Kevin M. Buckley eScience Consultant School of Engineering and Computer Science Victoria University of Wellington New Zealand
[OMPI devel] rpmbuild of 1.6.2 fails when build_all_in_one_rpm is 0, works when 1
I was also thinking of building a version against the PGI compilers, however, rest assured I saw the failiure to build sperated RPMs with a vanilla spec-file as well. Obviously no show stopper, as I can build the "all_in_one_rpm" but thought to feed the experience back. Kevin Buckley ECS, VUW, NZ
Re: [OMPI devel] autogen.sh improvements
> > 5. ompi_mca.m4 has been cleaned up a bit, allowing autogen.pl to be a > > little dumber than autogen.sh > > So you are dumbing down in search of improvements ? > Apologies. That was only meant to go to Jeff Squyres. Kevin
Re: [OMPI devel] autogen.sh improvements
> 5. ompi_mca.m4 has been cleaned up a bit, allowing autogen.pl to be a > little dumber than autogen.sh So you are dumbing down in search of improvements ?
Re: [OMPI devel] v1.5: thumbs up or down? - Thumbs Down
> > OK, I humbly withdraw (a) above but now, equally humbly, suggest > > that instead of using a list, those things be turned into standard, > > single-target, configure options, vis: > > > > --with-= > > > > --enable-= > > True, this would be better. I believe that Brian didn't initially do it > this way for some subtle reasons, but I confess that I don't remember > exactly why. > > Patches would be welcome here. :-) > > -- > Jeff Squyres Here you go, as per our offlist discussion, a patch that allows one to deselect addons individually, --enable-addon1=no --disable-addon3 which is required because you can't build up --enable-contrib-no-build=addon1 --enable-contrib-no-build=addon2 options as the build system only honours the last one. What this does is use the value of --enable-addon1 (yes) --enable-addon1=yes (yes) --enable-addon1=no(no) --disable-addon1 (no) or sets it to "yes" if you don't give anything, thereby maintaining the default of build all the contribs unless told otherwise. $ diff -u openmpi-1.5rc3{-vanilla,}/config/ompi_contrib.m4 --- openmpi-1.5rc3-vanilla/config/ompi_contrib.m4 2009-12-09 10:33:28.0 +1300 +++ openmpi-1.5rc3/config/ompi_contrib.m4 2010-07-11 15:43:56.0 +1200 @@ -99,6 +99,16 @@ ompi_show_subsubsubtitle "$1 (m4 configuration macro)" +AC_ARG_ENABLE([$1], +[AS_HELP_STRING([--disable-$1], + [disable support for contributed package $1])], +[], +[enable_$1=yes]) + +if test "x$enable_$1" != xyes ; then +DISABLE_contrib_$1=yes +fi + OMPI_CONTRIB_HAPPY=0 if test "$DISABLE_contrib_$1" = "" -a "$DISABLE_contrib_all" = ""; then -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] v1.5: thumbs up or down? - Thumbs Down
> That contribution needs to be > > a) brought under the control of --enable-contrib-no-build= > > b) possibly renamed (it would seem to be an MPI specific thing) > so maybe, libmpitrace ? I'd like to qualify that, in the light of some more digging, though (b) is still an issue. It seems that the "libtrace" conribution IS under the control of --enable-contrib-no-build= (not that you'ld know that from configure -- help) but the NetBSD handling of such options ended up running confgure with: --enable-contrib-no-build=libtrace --enable-contrib-no-build=vt and not: --enable-contrib-no-build=vt,libtrace so the configure was only honouring the last one listed. OK, I humbly withdraw (a) above but now, equally humbly, suggest that instead of using a list, those things be turned into standard, single-target, configure options, vis: --with-= --enable-= humbly yours, Kevin -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] v1.5: thumbs up or down? - Thumbs Down
Something I have just noticed on the NetBSD platform build that I think goes further than just that platform. There is a NetBSD packaging clash between the libtrace.la from ompi/contrib/libtrace/ and that from an already existing package libtrace-3.0.6 (Homepage:http://research.wand.net.nz/software/libtrace.php) As the OpenMPI is just a contribution, I've tried turning off the building the OpenMPI libtrace using the time-honoured --enable-contrib-no-build=libtrace but it still builds. Is this a new contribution (was not in 1.4x ?) that is not controlled by the --enable-contrib-no-build= mechanism. The "configure --help" output only mentions "libnbc" and "vt" though there doesn't seem to be a "libnbc" to control anymore. Indeed a top-level, find . -name \*nbc\* -print returns empty handed. That contribution needs to be a) brought under the control of --enable-contrib-no-build= b) possibly renamed (it would seem to be an MPI specific thing) so maybe, libmpitrace ? -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] v1.5: thumbs up or down?
> 4) The other thing that comes to mind are the mountain of WARNINGs > because of the "redefinition" of > > #define CACHE_LINE_SIZE 128 > > in > > opal/include/opal/sys/cache.h > > although it's a bit "chicken and egg" because NetBSD's definition, > in: > > /usr/include/sys/param.h > > obviously allows one to redefine it, vis: > > #ifndef CACHE_LINE_SIZE > #define CACHE_LINE_SIZE 64 > #endif > > so that's probably not an issue but at least you know about it. But here's a patch: --- opal/include/opal/sys/cache.h.orig 2010-07-06 14:29:44.0 +1200 +++ opal/include/opal/sys/cache.h 2010-07-06 14:32:34.0 +1200 @@ -30,7 +30,9 @@ * * For now hardwire this to a reasonable value, and automate later - RLG */ +#ifndef CACHE_LINE_SIZE #define CACHE_LINE_SIZE 128 +#endif #endif /* OPAL_SYS_CACHE_H */ -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] v1.5: thumbs up or down?
> Can we get a thumbs up / down from each organization about where you > think we are with v1.5? Cisco and HLRS obviously give a thumbs up. > I can't claim to speak for NetBSD but, for info, I have just managed to COMPILE 1.5rc3 on a NetBSD platform. Notes: == 1) The patch NetBSD was applying to opal/util/if.c Rewrite network interface configuration using getifaddrs(3) for BSD, is no longer needed, so that's obviously now in the OpenMPI source. 2) A couple of other NetBSD patches needed altering but mostly for line numbers with an extra couple of changes where OpenMPI has changed a define from #if OMPI_WANT_LIBLTDL to #if OPAL_WANT_LIBLTDL 3) I also had to disable the Vampire Trace toolkit build --enable-contrib-no-build=vt but then I had to do that for OpenMPI 1.4.2. 4) The other thing that comes to mind are the mountain of WARNINGs because of the "redefinition" of #define CACHE_LINE_SIZE 128 in opal/include/opal/sys/cache.h although it's a bit "chicken and egg" because NetBSD's definition, in: /usr/include/sys/param.h obviously allows one to redefine it, vis: #ifndef CACHE_LINE_SIZE #define CACHE_LINE_SIZE 64 #endif so that's probably not an issue but at least you know about it. That OpenMPI #define should probably be wrapped in an #ifndef to defeat the warnings but, as to which should be defined first, by virtue of the order the order of the include files ... ? I'll try actually installing and running the thing when I get a chance but thought you might appreciate some "progress feedback" from a NetBSD platform trail. -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
> I added several FAQ items -- how do they look? > > http://www.open-mpi.org/faq/?category=troubleshooting#erroneous-file-not-found-message > http://www.open-mpi.org/faq/?category=troubleshooting#missing-symbols > http://www.open-mpi.org/faq/?category=building#install-overwrite > "This is due to some deep run time linker voodoo" >From what I have come to understand about this: I think that pretty much covers it ! Serioulsy, this is good stuff to have "out there" though, because, as you point out, the info an installer/user gets back, and through which they might then first look to diagnose such issues, may not steer them in the direction it should. Kevin PS A style as opposed to substance thing: I did notice that the last one of the three seem to be using a fixed size width, whereas text in the the first and second flow into the browser window. -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
Cc'd Aleksej as I'm not sure he's on the "devel" list, and Mark Davies, as he is certainly not. I'll also post this back onto the R HPC SIG list which is where I came in. Jeff Squyres wrote: > Now, all this being said, IIRC (and I very well may not!), the real > underlying issue here is that R is dlopening libmpi.so, which, in turn, is > dlopening its own DSOs. Given the global linker scoping issues, OMPI's > DSOs are unable to find the symbols they need to resolve in the process > (because libmpi.so's was opened in a private scope). > > This probably is unfortunately larger than us (Open MPI) -- it's really a > POSIX issue. What would be ideal is if different linker namespaces could > be something more fine-grained than "global" or "private" within a > process. E.g., if the private namespace of libmpi.so in the process could > selectively make its symbol namespace available to the DSOs that it > dlopens. Right now, the only option libmpi.so has is to be opened > with a public scope, which somewhat defeats the point of private > scoping. > Tying in with the suggestions you make above, there would seem to be a work-around fix for this, in the case of the Rmpi package on NetBSD anyway. Furthermore, the fix does not require any alterations to OpenMPI. Apparently, there has been a similar issue, symbol visibility when chaining shared library loading, within PAM on NetBSD. Mark Davies has now determined a way to force the Rmpi package to load libmpi.so, ahead of loading the Rmpi shared library itself, so that what appear to be the missing symbols are then available, for any future loads of the OpenMPI component libraries. On the version of Rmpi that I have been using, 0.5-8, the "fix" can be effected by the following, one, line, patch --- Rmpi/R/zzz.R2009-02-04 05:27:08.0 +1300 +++ Rmpi.local/R/zzz.R 2010-05-17 14:25:27.0 +1200 @@ -7,6 +7,7 @@ #cat(vertxt) # Check if lam-mpi is running +dyn.load("/usr/pkg/lib/libmpi.so", local=FALSE) library.dynam("Rmpi", pkg, lib) if (!TRUE) stop("Fail to load Rmpi dynamic library.") Note that this currently hard codes the path to the libmpi.so, which for our system is in the standard NetBSD PkgSrc location, though there are probably "nicer" ways to achieve the same end, and greater flexibility, using R internals. Having said that, this "fix" does not seem to be needed on plaforms that have a global scope for shared library symbols, so maybe attempts to make it generic may be pointless. Thanks for everyone's time on this issue. I'll certainly be watching attempts to resolve the "larger than us (Open MPI)" issue, Kevin -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
Jeff, > So the error message is at least *somewhat* better than a totally > misleading "file not found" message -- but it still only speculates > on the real reason that libltdl failed to load the DSO. > > 2. https://svn.open-mpi.org/trac/ompi/changeset/22806 put in an > OMPI-specific change to libltdl that avoids the incorrect error message > altogether. So now OMPI should print out the *real* reason libltdl > failed to load the DSO. > > It does not look like this patch made it over into the v1.4 series; > it is awaiting review before it moves to the v1.5 branch > (https://svn.open-mpi.org/trac/ompi/ticket/2337). > > Hope that all made sense! Great insight. You'll appreciate I have some idea as to what's going on but not the completed jigsaw view as to how all the pieces I find fit into the whole, so thank you. Not sure it explains away the inabaility of my libtool test program to open the shared-library in question but it certainly moves things forwards. > Have you tried building Open MPI with the --disable-dlopen configure flag? > This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no > dlopening at run-time. Hence, your app (R) can dlopen libmpi.so, but then > libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are > physically located in libmpi.so. Given your reasoning, that's gotta be worth a shot: wilco. Thanks once again for your time on this, Kevin -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
> Which libltdl version is that NetBSD ltdl.h from? Which version is > in opal/libltdl? Have you tried not doing the above change? > > libltdl 2.2.x has incompatible changes over 1.5.x, both in the library > as well as in the header, as well as (I think) in preloaded modules. Hey Ralf, The libtool distinfo file implies NetBSD currently uses libtool-2.2.6b. An ldd of mpirun shows -lltdl.7 => /usr/pkg/lib/libltdl.so.7 I do need to attempt a build of 1.4.2 here in ECS, so I'll try building without the patches but I seem to recall that if those libtool-related patches opal/Makefile.in configure opal/mca/base/mca_base_component_find.c opal/mca/base/mca_base_component_repository.c test/support/components.h test/support/components.c were not applied, it did not even build. But we'll see. And if you are reading this, Alexsej, have you,as the real "OpenMPI on NetBSD" man, built a 1.4.2 as yet ? Kevin -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
[OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
Hi there, this is an issue that I started a while ago on the R HPC SIG mailing list and which then moved into an off-list conversation with Jeff Squyres but on which no progress has been made. I believe that the issue is less with Rmpi than with something that Rmpi is exposing in OpenMPI specifically on NetBSD, hence posting here. (FWIW, I have since had an Rmpi/R/SGE/OpenMPI stack running on RHEL/Vmware, once I realised that I had to exclude the virbr0 interfaces that OpenMPI seemed to take quite a liking to!) I appreciate that few on the list are running OpenMPI on NetBSD but, as detailed below, I found the OpenMPI thread "[OMPI devel] Missing Symbol" that seems to tie in with the problem I am seeing and. more importantly, originated away from an NetBSD implementation. I thus thought I'd stick the guts of the off-list conversation onto the OpenMPI list and see if anyone else who may have been involved with the "Missing Symbol" thread has any ideas. There would seem to have been four emails of relevance from that off-list conversation, so eyes down, looking for a full house: === Part 1 === Basically, when I come to load the Rmpi library > library(Rmpi, lib.loc="/local/scratch/kevin/Pkgs/R/") I get a swathe of OpenMPI errors (attached below) [europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open /usr/pkg/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open /usr/pkg/lib/openmpi/mca_carto_file: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) === Part 2 === > An off the wall question -- do you have multiple versions of Open MPI > installed on the system, perchance? I wonder if you compiled Rmpi.so with > one version of OMPI and it's picking up libmpi.so from the other version > (or something along those lines). Mismatches between the versions might > well cause issues like this...? That's not it. Everything is from 1.4.1. I have once again delved deeper into the innards of the OpenMPI source than I would have expected and seen that the error message is coming from just after File: opal/mca/base/mca_base_component_find.c Routine: static int open_component(component_file_item_t *target_file, opal_list_t *found_components) Code: #if OPAL_HAVE_LTDL_ADVISE component_handle = lt_dlopenadvise(target_file->filename, opal_mca_dladvise); #else component_handle = lt_dlopenext(target_file->filename); #endif where there's a bit of ferkling going on so as to check for a given file existing, hence the "slightly better error message". We have ./opal/include/opal_config.h:#define OPAL_HAVE_LTDL_ADVISE 0 so we are invoking the lt_dlopenext clause. That is a file that gets patched in the NetBSD build as follows $diff opal/mca/base/mca_base_component_find.c{.orig,} 44,46d43 < #ifndef __WINDOWS__ < #include "opal/libltdl/ltdl.h" < #else 48d44 < #endif ie we have taken out the inclusion of opal/libltdl/ltdl.h to force the use of the NetBSD "ltdl.h" one, which I guess might point to something underlying the issue but as to what ... OK, from what I can see, I have $ls -l /usr/pkg/lib/openmpi/mca_carto_auto_detect* -rw-r--r-- 1 root wheel 3892 Mar 22 16:21 /usr/pkg/lib/openmpi/mca_carto_auto_detect.a -rwxr-xr-x 1 root wheel 1105 Mar 22 16:21 /usr/pkg/lib/openmpi/mca_carto_auto_detect.la* -rwxr-xr-x 1 root wheel 7078 Mar 22 16:21 /usr/pkg/lib/openmpi/mca_carto_auto_detect.so* however there are no "versioned" links for the .so file (.so.0, .so.0.0.0 etc) but would that be an issue - probably not. Furthermore, the Autobook (yes, I read some of that too!) says: Function: lt_dlhandle lt_dlopenext (const char *filename) This function is used in precisely the same way as lt_dlopen. However, if the search for the named module by exact match against filename fails, it will try again with a `.la' extension, and then the native shared library extension (`.sl' on HP-UX, for example). so the file that will end up being referenced obviously exists, so why would lt_dlopenext not be able to open it the library there? It would seem (from the error message)that what's being passed to the routine as target_file->filename is /usr/pkg/lib/openmpi/mca_carto_file and so lt_dlopenext should at least find the .la and the .so rather than punt, no ? I am at a loss as to how to debug further this as my experience of adding flags to openmpi invocations is zero. In case you speak libtool (?) I enclose the .la file but nothing looks "wrong" to me. Kevin # mca_carto_auto_detect.la - a libtool library file # Generated by ltmain.sh (GNU libtool) 2.2.6b # # Please DO NOT delete this file! # It is necessary for linking the library. # The name that we can dlopen(3). dlname='mca_carto_auto_detect.so' # Names of this library. library_names='mca_carto_auto_detect.so mca_carto_auto_dete