Re: [OMPI devel] Program which runs wih 1.8.3, fails with 2.0.2

2017-04-19 Thread Kevin Buckley
On 20 April 2017 at 12:58, r...@open-mpi.org  wrote:
> Fully expected - if ORTE can’t start one or more daemons, then the MPI job
> itself will never be executed.
>
> There was an SGE integration issue in the 2.0 series - I fixed it, but IIRC
> it didn’t quite make the 2.0.2 release. In fact, I just checked and it did
> indeed miss that release.
>
> You have three choices:
>
> 1. you could apply the patch to the 2.0.2 source code yourself - it is at
> https://github.com/open-mpi/ompi/pull/3162
>
> 2. download a copy of the latest nightly 2.0.3 tarball - hasn’t been
> officially released yet, but includes the patch
>
> 3. upgrade to the nightly 2.1.1 tarball - expected to be officially released
> soon and also includes the patch
>
> Hopefully, one of those options will fix the problem
> Ralph

OK, so I went with Choice 2 and downloaded the nightly 2.0.3 from
the 19th April.

openmpi-v2.0.x-201704190318-24b5b83.tar.bz2

Having altered the top-level directory inside it to be just

openmpi-2.0.x

I then used that tarbal as a source for the SPEC file from 2.0.2's
SRPM and ...

   it's all built and my application is now running inside the SGE environment.

Deep joy !

Thanks for the suggestions, I'm looking forwards to the official releases of
2.0.3 and/or 2.1.1.

Kevin
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Program which runs wih 1.8.3, fails with 2.0.2

2017-04-19 Thread Kevin Buckley
On 20 April 2017 at 12:58, r...@open-mpi.org  wrote:
> Fully expected - if ORTE can’t start one or more daemons, then the MPI job
> itself will never be executed.
>
> There was an SGE integration issue in the 2.0 series - I fixed it, but IIRC
> it didn’t quite make the 2.0.2 release. In fact, I just checked and it did
> indeed miss that release.
>
> You have three choices:
>
> 1. you could apply the patch to the 2.0.2 source code yourself - it is at
> https://github.com/open-mpi/ompi/pull/3162
>
> 2. download a copy of the latest nightly 2.0.3 tarball - hasn’t been
> officially released yet, but includes the patch
>
> 3. upgrade to the nightly 2.1.1 tarball - expected to be officially released
> soon and also includes the patch
>
> Hopefully, one of those options will fix the problem
> Ralph

This observation may not belong here, but as there are some eyes on this
issue, I might as well raise it here, as I came across it in the wake of going
with Choice 2


If one wishes to take a nightly tarball and try and use the
existing version's SRPM build infrastructure to create an RPM,
then you will fall foul of the dashes in the nightly tarball
names.

That is, if you try putting in

 Version: v2.0.x-201704190318-24b5b83

instead of the original

 Version: 2.0.2

you'll be told, when coming to do an rpmbuild, (your line may differ)

 error: line 188: Illegal char '-' in: Version: v2.0.x-201704190318-24b5b83

You can get round this by

  1) Un-tar-ing the nightly tarball
  2) Renaming it so that the dashes become dots
  3) Recreating the newly named nightly tarball

if you then use

 Version: v2.0.x.201704190318.24b5b83

the rpmbuild runs (as in it's running as I write !).

As to whether that observation might inform a different convention
for naming the nightly tarballs, that is left up to the real developers.
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Program which runs wih 1.8.3, fails with 2.0.2

2017-04-19 Thread Kevin Buckley
On 20 April 2017 at 12:58, r...@open-mpi.org  wrote:
> Fully expected - if ORTE can’t start one or more daemons, then the MPI job
> itself will never be executed.
>
> There was an SGE integration issue in the 2.0 series - I fixed it, but IIRC
> it didn’t quite make the 2.0.2 release. In fact, I just checked and it did
> indeed miss that release.

Just to say that the original "bodge fix" mentioned here

https://github.com/open-mpi/ompi/issues/2947

that is, adding

-mca plm_rsh_agent foo

on the mpirun command line, does allow my 2.0.2 application to run.

> You have three choices:
> ...

Now I just have to make one.

Cheers,
Kevin
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Program which runs wih 1.8.3, fails with 2.0.2

2017-04-19 Thread Kevin Buckley
On 19 April 2017 at 18:35, Kevin Buckley
 wrote:

> If I compile against 2.0.2 the same command works at the command line
> but not in the "SGE" job submission, where I see a complaint about
>
> =
> Host key verification failed.
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
>  blah, blah, blah ...
> =

Just to add that if I add in some basic debugging

--mca btl_base_verbose 30

then when running at the command line, I get a swathe of info
from the MCA, however within the SGE environment, I still only
get the "ORTE was unable .." message ?
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


[OMPI devel] Program which runs wih 1.8.3, fails with 2.0.2

2017-04-18 Thread Kevin Buckley
I have source code for MrBayes.

If I compile against OpenMPI 1.8.3, then an

mpirun -np4 mb < somefile.txt

works at both the command line and in an "SGE" job submission where
I'm tagetting 4 cores on the same node.

If I compile against 2.0.2 the same command works at the command line
but not in the "SGE" job submission, where I see a complaint about

=
Host key verification failed.
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
 blah, blah, blah ...
=

Even if I try to force the SM transport via the MCA BTL flags,
same thng across the four attempts.

Both the 1.8.3 and 2.0.2 version were built from the SRPM and,
I believe, were built in the same way.

So what might have I missed ?

SGE integration related ??


Kevin M. Buckley

eScience Consultant
School of Engineering and Computer Science
Victoria University of Wellington
New Zealand
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt

2017-04-04 Thread Kevin Buckley
On 5 April 2017 at 13:01, Kevin Buckley
 wrote:

> I also note that as things stand, the Relocation is used for all
> files except the Environment Module file, resulting from the
> rpmbuild beig done as follows
>
> --define 'install_shell_scripts 1' \
> --define 'install_modulefile 1' \
> --define 'use_mpi_selector 1' \
> --define 'scl openmpi-openmpi' \
> -bc openmpi-2.0.2.spec
>
> remains in the "system" tree, as
>
> /usr/share/Modules/modulefiles/openmpi-openmpi-openmpi/2.0.2
>
> although, because of the ACL naming, it would no longer clash with
> the path to a non-SCL built module file.

Furthermore, that modulefile contains the right paths

---8<-8<-8<-8<-8<-8<-8<--
#%Module

# NOTE: This is an automatically-generated file!  (generated by the
# Open MPI/SHMEM RPM).  Any changes made here will be lost a) if the RPM is
# uninstalled, or b) if the RPM is upgraded or uninstalled.

proc ModulesHelp { } {
   puts stderr "This module adds Open MPI/SHMEM v2.0.2 to various paths"
}

module-whatis   "Sets up Open MPI/SHMEM v2.0.2 in your enviornment"

prepend-path PATH "/opt/OpenMPI/openmpi-openmpi/root/usr/bin/"
prepend-path LD_LIBRARY_PATH /opt/OpenMPI/openmpi-openmpi/root/usr/lib64
prepend-path MANPATH /opt/OpenMPI/openmpi-openmpi/root/usr/share/man
---8<-8<-8<-8<-8<-8<-8<--
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt

2017-04-04 Thread Kevin Buckley
Just in case anyone is interested in following this, I'll
try and document what I'm doing here

I have a forked repo and added a branch here

https://github.com/vuw-ecs-kevin/ompi/tree/make-specfile-scl-capable

and have applied a series of small changes that allow for the building
of an RPM that appears to be an SCL-enabled one.

Couple of small things to iron out, mostly as regards naming, to whit:

An SCL rpmbuild, using

--define 'scl openmpi-openmpi'

produces

openmpi-openmpi-openmpi-2.0.2-1.el6.x86_64.rpm

where the rpm -qip info shows

==
Name: openmpi-openmpi-openmpi  Relocations:
/opt/OpenMPI/openmpi-openmpi/root/usr
Version : 2.0.2 Vendor: redhat
Release : 1.el6 Build Date: Wed 05 Apr
2017 12:20:45 PM NZST
Install Date: (not installed)   Build Host: lochostname.vuw.ac.nz
Group   : Development/Libraries Source RPM:
openmpi-openmpi-openmpi-2.0.2-1.el6.src.rpm
Size: 50459297 License: BSD
Signature   : (none)
Packager: redhat


Note, with reference to:

https://www.softwarecollections.org/en/docs/guide/

that this assumes that

Section 2.3's The Software Collection Root Directory

is

/opt/OpenMPI/

and that

Section 2.4's. The Software Collection Prefix

is

openmpi-openmpi

because the "myorganisation" is openmpi

as is the package name.


I am less than convinced that those SCL guidelines add anything extra
that might be lost if the Relocation simply occured at

/opt/OpenMPI/openmpi/root/usr

which I believe would result from building from the same SPEC-file
with just

-define 'scl openmpi'

but I'm not clainging to be an SCL expert.


I also note that as things stand, the Relocation is used for all
files except the Environment Module file, resulting from the
rpmbuild beig done as follows

--define 'install_shell_scripts 1' \
--define 'install_modulefile 1' \
--define 'use_mpi_selector 1' \
--define 'scl openmpi-openmpi' \
-bc openmpi-2.0.2.spec

remains in the "system" tree, as

/usr/share/Modules/modulefiles/openmpi-openmpi-openmpi/2.0.2

although, because of the ACL naming, it would no longer clash with
the path to a non-SCL built module file.

Some thought needed on the module file location, perhaps not all
that surprising, given that was where this thread stared.


Anyroad, some initial progress, which might ellicit some comments ?
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt

2017-04-02 Thread Kevin Buckley
On 31 March 2017 at 23:35, Jeff Squyres (jsquyres)  wrote:

and Gilles, who said,

>> you should only use the tarballs from www.open-mpi.org

> The GitHub tarballs are simple tars of the git repo at a given hash (e.g., 
> the v2.0.2 tag in git).  ...

Yep I'm aware of the way that GitHub tarbals can be created, even when a lot
of folk don't bother creating them. I think it's good that you do.

> We can't turn off these tarballs from being generated and from being 
> available on the GitHub site.
> We actually don't do anything with these tarballs at all.

OK, even simpler.

>  The short version is that the process looks like this (this skips a lot of 
> detail, but you get the point):
>
> -
> git clone ...the ompi repo...
> cd ompi
> git checkout ...the desired hash...
> ./autogen.pl
> ./configure
> make distcheck
> -

Given that any changes, as regards an SCL-capable RPM build, are
only likely to be made to

contrib/dist/linux/openmpi.spec

which is a file that exists in both GitHub and the bootstrapped ones, it
all seems OK.

I think some of my confusion comes from seeing files that only appear to
be in the GitHub tarball when compared against the bootstrapped tarball.
after doing

tar xf ompi-2.0.2.tar.gz
tar xf openmpi-2.0.2.tar.bz2
diff -r ompi-2.0.2/  openmpi-2.0.2/ > OpenMPI-202-gh_website.diff

(I attach the latter)

which suggests, along with

Only in openmpi-2.0.2/: configure
Only in openmpi-2.0.2/contrib: Makefile.in

which I appreciate are the results of the Autotools bootstrapping, stuff like

Only in ompi-2.0.2/contrib/dist: find-copyrights.pl
Only in ompi-2.0.2/contrib/dist/linux: README
Only in ompi-2.0.2/contrib/dist/linux: README.ompi-spec-generator
Only in ompi-2.0.2/contrib/dist/linux: buildrpm.sh
Only in ompi-2.0.2/contrib/dist/linux: buildswitcherrpm.sh
Only in ompi-2.0.2/contrib/dist/linux: ompi-spec-generator.py
Only in ompi-2.0.2/contrib/dist/linux: openmpi-switcher-modulefile.spec
Only in ompi-2.0.2/contrib/dist/linux: openmpi-switcher-modulefile.tcl

which suggests some things have been removed?

Apologies for dragging this out,
Kevin
Only in ompi-2.0.2/: .gitignore
Only in ompi-2.0.2/: .mailmap
Only in ompi-2.0.2/: .travis.yml
Only in ompi-2.0.2/: HACKING
Only in ompi-2.0.2/: ISSUE_TEMPLATE.md
Only in openmpi-2.0.2/: Makefile.in
diff -r ompi-2.0.2/VERSION openmpi-2.0.2/VERSION
27c27
< greek=rc5
---
> greek=
34c34
< repo_rev=
---
> repo_rev=v2.0.1-348-ge291d0e
44c44
< tarball_version=gitclone
---
> tarball_version=
48c48
< date="Unreleased developer copy"
---
> date="Jan 31, 2017"
Only in openmpi-2.0.2/: aclocal.m4
Only in openmpi-2.0.2/config: Makefile.in
Only in ompi-2.0.2/config: Makefile.options
Only in openmpi-2.0.2/config: autogen_found_items.m4
Only in openmpi-2.0.2/config: compile
Only in openmpi-2.0.2/config: config.guess
Only in openmpi-2.0.2/config: config.sub
Only in openmpi-2.0.2/config: depcomp
Only in openmpi-2.0.2/config: install-sh
Only in openmpi-2.0.2/config: libtool.m4
Only in openmpi-2.0.2/config: ltmain.sh
Only in openmpi-2.0.2/config: ltoptions.m4
Only in openmpi-2.0.2/config: ltsugar.m4
Only in openmpi-2.0.2/config: ltversion.m4
Only in openmpi-2.0.2/config: lt~obsolete.m4
Only in openmpi-2.0.2/config: missing
Only in ompi-2.0.2/config: ompi_check_udapl.m4
Only in ompi-2.0.2/config: ompi_microsoft.m4
Only in ompi-2.0.2/config: opal_setup_component_package.m4
Only in ompi-2.0.2/config: orte_setup_java.m4
Only in openmpi-2.0.2/config: test-driver
Only in openmpi-2.0.2/config: ylwrap
Only in openmpi-2.0.2/: configure
Only in openmpi-2.0.2/contrib: Makefile.in
Only in ompi-2.0.2/contrib/amca-param-sets: ft-enable-cr
Only in ompi-2.0.2/contrib/amca-param-sets: ft-enable-cr-recovery
Only in ompi-2.0.2/contrib: annual-maintenance
Only in ompi-2.0.2/contrib: authors-to-cvsimport.pl
Only in ompi-2.0.2/contrib: build-mca-comps-outside-of-tree
Only in ompi-2.0.2/contrib: build-server
Only in ompi-2.0.2/contrib: check-btl-sm-diffs.pl
Only in ompi-2.0.2/contrib: check-help-strings.pl
Only in ompi-2.0.2/contrib: check-ob1-pml-diffs.pl
Only in ompi-2.0.2/contrib: check-ob1-revision.pl
Only in ompi-2.0.2/contrib: check-owner.pl
Only in ompi-2.0.2/contrib: check_unnecessary_headers.sh
Only in ompi-2.0.2/contrib: code_counter.pl
Only in ompi-2.0.2/contrib: coverity
Only in ompi-2.0.2/contrib/dist: find-copyrights.pl
Only in ompi-2.0.2/contrib/dist/linux: README
Only in ompi-2.0.2/contrib/dist/linux: README.ompi-spec-generator
Only in ompi-2.0.2/contrib/dist/linux: buildrpm.sh
Only in ompi-2.0.2/contrib/dist/linux: buildswitcherrpm.sh
Only in ompi-2.0.2/contrib/dist/linux: ompi-spec-generator.py
Only in ompi-2.0.2/contrib/dist/linux: openmpi-switcher-modulefile.spec
Only in ompi-2.0.2/contrib/dist/linux: openmpi-switcher-modulefile.tcl
Only in ompi-2.0.2/contrib/dist: make-authors.pl
Only in ompi-2.0.2/contrib/dist: make-html-man-pages.pl
Only in ompi-2.0.2/contrib/dist: make_tarball
Only in openmpi-2.0.2/contrib/dist/mofed/debian: changelog
Only in openmpi-2.0.2/con

Re: [OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt

2017-03-30 Thread Kevin Buckley
On 29 March 2017 at 13:49, Jeff Squyres (jsquyres)  wrote:
> I have no objections to this.
>
> Unfortunately, I don't have the time to work on it, but we'd be glad to look 
> at pull requests to introduce this functionality.  :-)

Yes, yes, alright.

I am though slightly confused, following the move to GitHub.

If I go to the GiHub repo, the available release tarballs appear, for
example, as: ompi-2.0.2.tar.gz
and unpack as

drwxrwxr-x root/root 0 2017-01-31 05:54 ompi-2.0.2/
-rw-rw-r-- root/root 15724 2017-01-31 05:54 ompi-2.0.2/.gitignore
.-rw-rw-r-- root/root  4212 2017-01-31 05:54 ompi-2.0.2/.mailmap
-rw-rw-r-- root/root  3415 2017-01-31 05:54 ompi-2.0.2/.travis.yml
-rw-rw-r-- root/root  9867 2017-01-31 05:54 ompi-2.0.2/AUTHORS
..

that is, using the GitHub subproject/repo name.


However, the tarball refered to in the SRPM's SPEC-file has the
"does what it says on the tin" name:

Source: openmpi-%{version}.tar.bz2

which is also the version availble from the open-mpi.org website, and which
unpacks as

drwxrwxr-x mpiteam/mpiteam   0 2017-02-01 04:21 openmpi-2.0.2/
-rw-rw-r-- mpiteam/mpiteam 9867 2017-01-27 08:14 openmpi-2.0.2/AUTHORS
...

Are the GitHub tarballs and the "availble from the open-mpi.org
website" tarballs
the same ? (except for the git stuff)

Should they be the same, now that the development is managed at GitHub ?

What is your process behind taking the GitHub release tarball and
turning it into
the open-mpi.org one ? (or vice-versa ?)
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt

2017-03-23 Thread Kevin Buckley
Another than occured to me whilst looking around this
was whether the OpenMPI SRPM might benefit from
being given proper  "Software Collections" package
capability, as opposed to having the "install in opt"
option.

I don't claim to have enough insight to say either way
here, however the Software Collections documentation

https://www.softwarecollections.org/en/docs/guide/#sect-Converting_a_Conventional_Spec_File

suggests that it's possible to use a traditional sepcfile
for both system and SCL builds, and I could see that
were "OpenMPI" to be used as the SCLo "provider"
tag then things would end up living under

/opt/OpenMPI/openmpi-202/

which isn't that far away from what you have, with the
"install in opt"  now.


Very much just a Request for Comments than a suggestion though,
Kevin
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt

2017-03-23 Thread Kevin Buckley
On 23 March 2017 at 23:41, Jeff Squyres (jsquyres)  wrote:
> Yoinks.  Looks like this was an oversight.  :-(
>
> Yes, I agree that install_in_opt should put the modulefile in /opt as well.

Actually, I have since read the SPEC file from top to bottom and seen a
Changelog entry (from you Jeff. from you!) dating back to 2007 that says

* Fri Feb  9 2007 Jeff Squyres 
...
- Make shell script and modulefile installation indepdendent of
  %{install_in_opt} (they're really separate issues)
...

so are you sure you agree ? (*see below)

> If you could offer a patch or pull request, that would be awesome.

I should get around to cloning a GitHub branch I guess but for now,
it appears to be a simple one-liner

$ diff openmpi-2.0.2.spec openmpi-2.0.2_vuw_gcc.spec
165a166
> %define modulefile_path /opt/%{name}/%{version}/share/openmpi/modulefiles
$

which needs adding at the  bottom of the

%if %{install_in_opt}

stanza. A diff -u patch is attached.


* I think the reasoning for me, is that if the SRPM is being used to create
an OpenMPI that will supercede the RHEL/CentOS system's version
then adding the module into a system area is the right thing, but if an
admin goes to the trouble of "installing in opt" then clearly they want
things "out of the way of the system", in general usage anyway and so
having the modulefile in a place that requires them to take action for
it to become available to the users is a "good thing".

All the best,
Kevin
--- SPECS/openmpi-2.0.2.spec.orig   2017-02-01 13:05:57.0 +1300
+++ SPECS/openmpi-2.0.2.spec2017-03-23 17:11:50.033903233 +1300
@@ -163,6 +163,7 @@
 # bets are off.  So feel free to install it anywhere in your tree.  He
 # suggests $prefix/doc.
 %define _defaultdocdir /opt/%{name}/%{version}/doc
+%define modulefile_path /opt/%{name}/%{version}/share/openmpi/modulefiles
 %endif
 
 %if !%{build_debuginfo_rpm}
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] 2.0.2 SRPM in "install_in_opt" configuration creates file outside /opt

2017-03-22 Thread Kevin Buckley
Just came to rehash some old attempts to build previous OpenMPIs
for an RPM-based system and noticed that, despite specifying

--define 'install_in_opt 1'  \

as part of this full "config" rpmbuild stage

(Note: SPEC-file Release tag is atered so as not to have the RPM clash with
 any system MPI packages, hence the name change)


rpmbuild  \
  --define '_topdir  /path/to/my/rpmbuild' \
--define 'install_in_opt 1'  \
--define 'install_shell_scripts 1' \
--define 'install_modulefile 1' \
--define 'use_mpi_selector 0' \
--define 'build_all_in_one_rpm 1' \
-bb openmpi-2.0.2_vuw_gcc.spec

I stll get a non /opt hierachy file

/usr/share/Modules/modulefiles/openmpi/2.0.2

which isn't what is needed.

I'd have thought the module files for an "install_in_opt" configuration
would have  been installed below the

/opt/openmpi/2.0.2/share

tree, leaving the system admin to then place them wherever they needed
to end up within the local "system dir" hierachies.

(Note: I use /etc/modulefiles/ in addition to /usr/share/Modules/modulefiles/,
 which I consider to be the module files from the modules package itself)



I note in the SPEC-file that there are possible overrides to any related
defaults, vis:

# type: string (root path to install modulefiles)
%{!?modulefile_path: %define modulefile_path /usr/share/Modules/modulefiles}
# type: string (subdir to install modulefile)
%{!?modulefile_subdir: %define modulefile_subdir %{name}}
# type: string (name of modulefile)
%{!?modulefile_name: %define modulefile_name %{version}}

but would suggest that the

install_modulefile 1

when combined with

install_in_opt 1

should alter the default, without any extra user configuration, just as if
an AutoTools config had done a

--prefix=/opt

in which case everything would be expected to get the prefix .


Just a suggestion, though I'll look to offer a patch, now that I see
that my notes from 2012 about this (where I first found the issue!)
still apply but, before I look to offer the patch ...

whaddyathink of the suggestion itself?


Kevin M. Buckley

eScience Consultant
School of Engineering and Computer Science
Victoria University of Wellington
New Zealand
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


[OMPI devel] OpenMPI 1.10.0: Arch Linux PkgSrc build suggests configure script patch

2015-09-21 Thread Kevin Buckley
Watcha,

we recently updated the OpenMPI installation on our School's ArchLinux
machines, where OpenMPI is built as a PkgSrc package, to 1.10.0

In running through the build, we were told that PkgSrc wasn't too keen on
the use of the == with a single "if test" construct and so I needed to apply
the following patch

--- configure.orig  2015-08-24 23:33:14.0 +
+++ configure
@@ -60570,8 +60570,8 @@ _ACEOF
 $as_echo "$MPI_OFFSET_DATATYPE" >&6; }


-if test "$ompi_fortran_happy" == "1" && \
-   test "$OMPI_WANT_FORTRAN_BINDINGS" == "1"; then
+if test "$ompi_fortran_happy" = "1" && \
+   test "$OMPI_WANT_FORTRAN_BINDINGS" = "1"; then

 # Get the kind value for Fortran MPI_INTEGER_KIND (corresponding
 # to whatever is the same size as a F77 INTEGER -- for the


Seem to recall that this is "good practice" and indeed, can see that
other "if test" stanzas in the configure script have been fixed to match,
so perhaps this one has just slipped through the net and/or not been
reported by anyone else as yet.

--
Kevin M. Buckley

eScience Consultant
School of Engineering and Computer Science
Victoria University of Wellington
New Zealand


Re: [OMPI devel] (no subject)

2014-12-09 Thread Kevin Buckley
On 9 December 2014 at 03:29, Howard Pritchard  wrote:
> Hello Kevin,
>
> Could you try testing with Open MPI 1.8.3?  There was a bug in 1.8.1
> that you are likely hitting in your testing.
>
> Thanks,
>
> Howard

Bingo!

Seems to have got rid of those messages.

Thanks.


Re: [OMPI devel] (no subject)

2014-12-07 Thread Kevin Buckley
Apologies for the lack of a subject line: cut and pasted the body
before the subject !

Should have been

Removing "registering part of your physical memory" warning message

Dunno if anyone can fix that in the maling list?


[OMPI devel] (no subject)

2014-12-07 Thread Kevin Buckley
Watcha,

have recently come to install the PISM package on top of PETSc, which,
in turn is
built against OpenMPI 1.8.1 on our Science Faculty HPC Facility, which has SGI
C2112 compute nodes with 64GB RAM running on top of CentOS 6.

In testing the PETSc deployment out and when running PISM itself, I am
seeing the

" ...OpenFabrics subsystem is configured to only allow registering
part of your physical
 memory ..."

message telling me

  Registerable memory: 32768 MiB
  Total memory:524285 MiB

Oh yeah, that's the 512GB big memory node, not a 64Gb compute node, which says

  Registerable memory: 32768 MiB
  Total memory:65534 MiB

but still suggests a default  for allowing the use of 32GB.

So, having followed my nose to the OpenMPI FAQ sections, and the Mellanox
community page,

http://community.mellanox.com/docs/DOC-1120

which suggests the defaults for the two parameters in need of a tweak are

log_num_mtt  20
log_mtts_per_seg 0

I came to try and tweak  those Mellanox driver parameters.

What I see on my compute nodes is

# cat /sys/module/mlx4_core/parameters/log_num_mtt
0
# cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
3
#

so something that doesn't match the defaults the Mellanox page suggests
I should be seeing.

Furthermore, having "done the math" and realised that I probably want

log_num_mtt  22
log_mtts_per_seg 3

to allow OpenMPI to use double the memory (128GB - because giving it 1 TB
on the big memory node seems excessive!) when I come to alter those values,
I can't seem to.

Trying to add a module load option

options mlx4_core log_num_mtt=22

via modifying the file

/etc/modprobe.d/mlx4.conf

never sees that value honoured after a full node reboot.

It also appears that the

/sys/module/mlx4_core/parameters/

are nearly all read-only, including the ones it's suggested that I tweak, vis:

# echo 22 > /sys/module/mlx4_core/parameters/log_num_mtt
-bash: /sys/module/mlx4_core/parameters/log_num_mtt: Permission denied
#  ls -l /sys/module/mlx4_core/parameters/log_num_mtt
-r--r--r--. 1 root root 4096 Dec  5 13:08
/sys/module/mlx4_core/parameters/log_num_mtt

so I'm getting the impression that the Mrellanox driver doesn't really want
the defaults altered ?

OK, so if i can't tell my nodes to allow OpenMPI to use any more than 32GB,
how do I turn off the OpenMPI message that is telling me about it?


Kevin M. Buckley

eScience Consultant
School of Engineering and Computer Science
Victoria University of Wellington
New Zealand


[OMPI devel] PkgSrc build of 1.8.1 gives a portability error

2014-07-17 Thread Kevin Buckley
Hello again OpenMPI folk,  been a while.

Have just come to build OpenMPI 1.8.1 within a PkgSrc environment for
our ArchLinux machines (yes, we used to be NetBSD, yes).

Latest PkgSrc build was for 1.6.4.

The 1.6.4 PkgSrc build required 4 patches, 3 of which were PkgSrc-specific
and just defined a

sysconfexampledir = $(pkgdatadir)/examples

so that the PkgSrc build could "Install configuration files into
example directory"

Those patches affected

orte/etc/Makefile.in
opal/etc/Makefile.in
ompi/etc/Makefile.in

The 4th patch, affecting

opal/tools/wrappers/opal_wrapper.c

added some "Missing RPATH support"

I fixed up those four patches so that they appled cleanly to 1.8.1 ,
however I have
been informed, by the PkgSrc build process, of the following

---8<-8<-8<-8<-8<-8<-8<-8<-8<--
=> Checking for portability problems in extracted files
ERROR: [check-portability.awk] => Found test ... == ...:
ERROR: [check-portability.awk] configure:  if test "$enable_oshmem" ==
"yes" -a "$ompi_fortran_happy" == "1" -a \

Explanation:
===
The "test" command, as well as the "[" command, are not required to know
the "==" operator. Only a few implementations like bash and some
versions of ksh support it.

When you run "test foo == foo" on a platform that does not support the
"==" operator, the result will be "false" instead of "true". This can
lead to unexpected behavior.

There are two ways to fix this error message. If the file that contains
the "test ==" is needed for building the package, you should create a
patch for it, replacing the "==" operator with "=". If the file is not
needed, add its name to the CHECK_PORTABILITY_SKIP variable in the
package Makefile.
===

---8<-8<-8<-8<-8<-8<-8<-8<-8<--

Obviously, the file that needs to be patched is really

configure.ac

and not

configure

but anyroad, the place at which the oshmen stanza has used the "non-portable"
double-equals construct is shown in the following attempted patch


---8<-8<-8<-8<-8<-8<-8<-8<-8<--
--- configure.ac.orig   2014-04-22 14:51:44.0 +
+++ configure.ac
@@ -611,8 +611,8 @@ m4_ifdef([project_ompi], [OMPI_SETUP_MPI
 ])

 AM_CONDITIONAL(OSHMEM_BUILD_FORTRAN_BINDINGS,
-[test "$enable_oshmem" == "yes" -a "$ompi_fortran_happy" == "1" -a \
-  "$OMPI_WANT_FORTRAN_BINDINGS" == "1" -a \
+[test "$enable_oshmem" = "yes" -a "$ompi_fortran_happy" = "1" -a \
+  "$OMPI_WANT_FORTRAN_BINDINGS" = "1" -a \
   "$enable_oshmem_fortran" != "no"])

 # checkpoint results
---8<-8<-8<-8<-8<-8<-8<-8<-8<--

Someone may wish to give that the "once over" ahead of the 1.8.2 release, in
light of what PkgSrc considers to be portable.

All the best,
Kevin M. Buckley

eScience Consultant
School of Engineering and Computer Science
Victoria University of Wellington
New Zealand


[OMPI devel] rpmbuild of 1.6.2 fails when build_all_in_one_rpm is 0, works when 1

2012-10-10 Thread Kevin Buckley
 I was also
thinking of building a version
against the PGI compilers, however, rest assured I saw the failiure to
build sperated RPMs
with a vanilla spec-file as well.

Obviously no show stopper, as I can build the "all_in_one_rpm" but
thought to feed
the experience back.

Kevin Buckley
ECS, VUW, NZ


Re: [OMPI devel] autogen.sh improvements

2010-08-16 Thread Kevin . Buckley

> > 5. ompi_mca.m4 has been cleaned up a bit, allowing autogen.pl to be a
> > little dumber than autogen.sh
>
> So you are dumbing down in search of improvements ?
>

Apologies. That was only meant to go to Jeff Squyres.

Kevin


Re: [OMPI devel] autogen.sh improvements

2010-08-16 Thread Kevin . Buckley
> 5. ompi_mca.m4 has been cleaned up a bit, allowing autogen.pl to be a
> little dumber than autogen.sh

So you are dumbing down in search of improvements ?




Re: [OMPI devel] v1.5: thumbs up or down? - Thumbs Down

2010-07-11 Thread Kevin . Buckley
> > OK, I humbly withdraw (a) above but now, equally humbly, suggest
> > that instead of using a list, those things be turned into standard,
> > single-target, configure options, vis:
> >
> > --with-=
> >
> > --enable-=
>
> True, this would be better.  I believe that Brian didn't initially do it
> this way for some subtle reasons, but I confess that I don't remember
> exactly why.
>
> Patches would be welcome here.  :-)
>
> --
> Jeff Squyres


Here you go, as per our offlist discussion, a patch that allows
one to deselect addons individually,

--enable-addon1=no --disable-addon3

which is required because you can't build up

--enable-contrib-no-build=addon1 --enable-contrib-no-build=addon2

options as the build system only honours the last one.

What this does is use the value of

--enable-addon1   (yes)
--enable-addon1=yes   (yes)
--enable-addon1=no(no)
--disable-addon1  (no)

or sets it to "yes" if you don't give anything, thereby maintaining
the default of build all the contribs unless told otherwise.


$ diff -u openmpi-1.5rc3{-vanilla,}/config/ompi_contrib.m4
--- openmpi-1.5rc3-vanilla/config/ompi_contrib.m4   2009-12-09
10:33:28.0 +1300
+++ openmpi-1.5rc3/config/ompi_contrib.m4   2010-07-11
15:43:56.0 +1200
@@ -99,6 +99,16 @@

 ompi_show_subsubsubtitle "$1 (m4 configuration macro)"

+AC_ARG_ENABLE([$1],
+[AS_HELP_STRING([--disable-$1],
+  [disable support for contributed package $1])],
+[],
+[enable_$1=yes])
+
+if test "x$enable_$1" != xyes ; then
+DISABLE_contrib_$1=yes
+fi
+
 OMPI_CONTRIB_HAPPY=0
 if test "$DISABLE_contrib_$1" = "" -a "$DISABLE_contrib_all" = ""; then



-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] v1.5: thumbs up or down? - Thumbs Down

2010-07-06 Thread Kevin . Buckley

> That contribution needs to be
>
> a) brought under the control of --enable-contrib-no-build=
>
> b) possibly renamed (it would seem to be an MPI specific thing)
> so maybe, libmpitrace ?

I'd like to qualify that, in the light of some more digging,
though (b) is still an issue.

It seems that the "libtrace" conribution IS under the control of

--enable-contrib-no-build=

(not that you'ld know that from configure -- help) but the NetBSD
handling of such options ended up running confgure with:

 --enable-contrib-no-build=libtrace --enable-contrib-no-build=vt

and not:

--enable-contrib-no-build=vt,libtrace

so the configure was only honouring the last one listed.


OK, I humbly withdraw (a) above but now, equally humbly, suggest
that instead of using a list, those things be turned into standard,
single-target, configure options, vis:

--with-=

--enable-=

humbly yours,
Kevin

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] v1.5: thumbs up or down? - Thumbs Down

2010-07-06 Thread Kevin . Buckley

Something I have just noticed on the NetBSD platform build
that I think goes further than just that platform.


There is a NetBSD packaging clash between the

libtrace.la

from

ompi/contrib/libtrace/

and that from an already existing package

libtrace-3.0.6

(Homepage:http://research.wand.net.nz/software/libtrace.php)


As the OpenMPI is just a contribution, I've tried turning off the
building the OpenMPI libtrace using the time-honoured

 --enable-contrib-no-build=libtrace

but it still builds.

Is this a new contribution (was not in 1.4x ?) that is not
controlled by the

 --enable-contrib-no-build=

mechanism.

The "configure --help" output only mentions "libnbc" and "vt"
though there doesn't seem to be a "libnbc" to control anymore.

Indeed a top-level,

 find . -name \*nbc\* -print

returns empty handed.

That contribution needs to be

a) brought under the control of --enable-contrib-no-build=

b) possibly renamed (it would seem to be an MPI specific thing)
so maybe, libmpitrace ?

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] v1.5: thumbs up or down?

2010-07-05 Thread Kevin . Buckley
> 4) The other thing that comes to mind are the mountain of WARNINGs
> because of the "redefinition" of
>
> #define CACHE_LINE_SIZE 128
>
> in
>
> opal/include/opal/sys/cache.h
>
> although it's a bit "chicken and egg" because NetBSD's definition,
> in:
>
> /usr/include/sys/param.h
>
> obviously allows one to redefine it, vis:
>
> #ifndef CACHE_LINE_SIZE
> #define CACHE_LINE_SIZE 64
> #endif
>
> so that's probably not an issue but at least you know about it.


But here's a patch:

--- opal/include/opal/sys/cache.h.orig  2010-07-06 14:29:44.0 +1200
+++ opal/include/opal/sys/cache.h   2010-07-06 14:32:34.0 +1200
@@ -30,7 +30,9 @@
  *
  * For now hardwire this to a reasonable value, and automate later - RLG
  */
+#ifndef CACHE_LINE_SIZE
 #define CACHE_LINE_SIZE 128
+#endif


 #endif /* OPAL_SYS_CACHE_H */


-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] v1.5: thumbs up or down?

2010-07-04 Thread Kevin . Buckley
> Can we get a thumbs up / down from each organization about where you
> think we are with v1.5?  Cisco and HLRS obviously give a thumbs up.
>

I can't claim to speak for NetBSD but, for info, I have just managed
to COMPILE 1.5rc3 on a NetBSD platform.

Notes:
==

1) The patch NetBSD was applying to opal/util/if.c

  Rewrite network interface configuration using getifaddrs(3) for BSD,

is no longer needed, so that's obviously now in the OpenMPI source.

2) A couple of other NetBSD patches needed altering but mostly for
   line numbers with an extra  couple of changes where OpenMPI has
   changed a define from

#if OMPI_WANT_LIBLTDL

to

#if OPAL_WANT_LIBLTDL

3) I also had to disable the Vampire Trace toolkit build

--enable-contrib-no-build=vt

but then I had to do that for OpenMPI 1.4.2.


4) The other thing that comes to mind are the mountain of WARNINGs
because of the "redefinition" of

#define CACHE_LINE_SIZE 128

in

opal/include/opal/sys/cache.h

although it's a bit "chicken and egg" because NetBSD's definition,
in:

/usr/include/sys/param.h

obviously allows one to redefine it, vis:

#ifndef CACHE_LINE_SIZE
#define CACHE_LINE_SIZE 64
#endif

so that's probably not an issue but at least you know about it.


That OpenMPI #define should probably be wrapped in an #ifndef
to defeat the warnings but, as to which should be defined first,
by virtue of the order the order of the include files ... ?


I'll try actually installing and running the thing when I get
a chance but thought you might appreciate some "progress feedback"
from a NetBSD platform trail.

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-18 Thread Kevin . Buckley
> I added several FAQ items -- how do they look?
>
> http://www.open-mpi.org/faq/?category=troubleshooting#erroneous-file-not-found-message
> http://www.open-mpi.org/faq/?category=troubleshooting#missing-symbols
> http://www.open-mpi.org/faq/?category=building#install-overwrite
>

  "This is due to some deep run time linker voodoo"

>From what I have come to understand about this: I think that pretty
much covers it !

Serioulsy, this is good stuff to have "out there" though, because,
as you point out, the info an installer/user gets back, and through
which they might then first look to diagnose such issues, may not
steer them in the direction it should.

Kevin

PS
A style as opposed to substance thing:

I did notice that the last one of the three seem to be using a
fixed size width, whereas text in the the first and second flow
into the browser window.

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-16 Thread Kevin . Buckley
Cc'd Aleksej as I'm not sure he's on the "devel" list, and Mark
Davies, as he is certainly not.

I'll also post this back onto the R HPC SIG list which is
where I came in.

Jeff Squyres wrote:

> Now, all this being said, IIRC (and I very well may not!), the real
> underlying issue here is that R is dlopening libmpi.so, which, in turn, is
> dlopening its own DSOs.  Given the global linker scoping issues, OMPI's
> DSOs are unable to find the symbols they need to resolve in the process
> (because libmpi.so's was opened in a private scope).
>
> This probably is unfortunately larger than us (Open MPI) -- it's really a
> POSIX issue.  What would be ideal is if different linker namespaces could
> be something more fine-grained than "global" or "private" within a
> process.  E.g., if the private namespace of libmpi.so in the process could
> selectively make its symbol namespace available to the DSOs that it
> dlopens.  Right now, the only option libmpi.so has is to be opened
> with a public scope, which somewhat defeats the point of private
> scoping.
>

Tying in with the suggestions you make above, there would seem to
be a work-around fix for this, in the case of the Rmpi package
on NetBSD anyway.

Furthermore, the fix does not require any alterations to OpenMPI.

Apparently, there has been a similar issue, symbol visibility
when chaining shared library loading, within PAM on NetBSD.

Mark Davies has now determined a way to force the Rmpi package
to load libmpi.so, ahead of loading the Rmpi shared library itself,
so that what appear to be the missing symbols are then available,
for any future loads of the OpenMPI component libraries.


On the version of Rmpi that I have been using, 0.5-8, the "fix"
can be effected by the following, one, line, patch

--- Rmpi/R/zzz.R2009-02-04 05:27:08.0 +1300
+++ Rmpi.local/R/zzz.R  2010-05-17 14:25:27.0 +1200
@@ -7,6 +7,7 @@
 #cat(vertxt)

 # Check if lam-mpi is running
+dyn.load("/usr/pkg/lib/libmpi.so", local=FALSE)
 library.dynam("Rmpi", pkg, lib)
 if (!TRUE)
stop("Fail to load Rmpi dynamic library.")


Note that this currently hard codes the path to the libmpi.so,
which for our system is in the standard NetBSD PkgSrc location,
though there are probably "nicer" ways to achieve the same end,
and greater flexibility, using R internals.

Having said that, this "fix" does not seem to be needed on
plaforms that have a global scope for shared library symbols,
so maybe attempts to make it generic may be pointless.

Thanks for everyone's time on this issue. I'll certainly be
watching attempts to resolve the "larger than us (Open MPI)"
issue,

Kevin

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-16 Thread Kevin . Buckley
Jeff,

> So the error message is at least *somewhat* better than a totally
> misleading "file not found" message -- but it still only speculates
> on the real reason that libltdl failed to load the DSO.
>
> 2. https://svn.open-mpi.org/trac/ompi/changeset/22806 put in an
> OMPI-specific change to libltdl that avoids the incorrect error message
> altogether.  So now OMPI should print out the *real* reason libltdl
> failed to load the DSO.
>
> It does not look like this patch made it over into the v1.4 series;
> it is awaiting review before it moves to the v1.5 branch
> (https://svn.open-mpi.org/trac/ompi/ticket/2337).
>
> Hope that all made sense!

Great insight. You'll appreciate I have some idea as to what's going on
but not the completed jigsaw view as to how all the pieces I find fit
into the whole, so thank you.

Not sure it explains away the inabaility of my libtool test program to
open the shared-library in question but it certainly moves things
forwards.

> Have you tried building Open MPI with the --disable-dlopen configure flag?
>  This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no
> dlopening at run-time.  Hence, your app (R) can dlopen libmpi.so, but then
> libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are
> physically located in libmpi.so.

Given your reasoning, that's gotta be worth a shot: wilco.

Thanks once again for your time on this,
Kevin

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-11 Thread Kevin . Buckley

> Which libltdl version is that NetBSD ltdl.h from?  Which version is
> in opal/libltdl?  Have you tried not doing the above change?
>
> libltdl 2.2.x has incompatible changes over 1.5.x, both in the library
> as well as in the header, as well as (I think) in preloaded modules.

Hey Ralf,

The libtool distinfo file implies NetBSD currently uses libtool-2.2.6b.

An ldd of mpirun shows  -lltdl.7 => /usr/pkg/lib/libltdl.so.7


I do need to attempt a build of 1.4.2 here in ECS, so I'll try
building without the patches but I seem to recall that if those
libtool-related patches

opal/Makefile.in
configure
opal/mca/base/mca_base_component_find.c
opal/mca/base/mca_base_component_repository.c
test/support/components.h
test/support/components.c

were not applied, it did not even build. But we'll see.


And if you are reading this, Alexsej, have you,as the real
"OpenMPI on NetBSD" man, built a 1.4.2 as yet ?

Kevin

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



[OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-11 Thread Kevin . Buckley
Hi there,

this is an issue that I started a while ago on the R HPC SIG mailing
list and which then moved into an off-list conversation with Jeff
Squyres but on which no progress has been made.

I believe that the issue is less with Rmpi than with something
that Rmpi is exposing in OpenMPI specifically on NetBSD, hence
posting here.

(FWIW, I have since had an Rmpi/R/SGE/OpenMPI stack running on
 RHEL/Vmware, once I realised that I had to exclude the virbr0
 interfaces that OpenMPI seemed to take quite a liking to!)

I appreciate that few on the list are running OpenMPI on NetBSD
but, as detailed below, I found the OpenMPI thread

"[OMPI devel] Missing Symbol"

that seems to tie in with the problem I am seeing and. more
importantly, originated away from an NetBSD implementation.

I thus thought I'd stick the guts of the off-list conversation
onto the OpenMPI list and see if anyone else who may have been
involved with the "Missing Symbol" thread has any ideas.

There would seem to have been four emails of relevance from that
off-list conversation, so eyes down, looking for a full house:



=== Part 1 ===

Basically, when I come to load the Rmpi library

> library(Rmpi, lib.loc="/local/scratch/kevin/Pkgs/R/")

I get a swathe of OpenMPI errors (attached below)


[europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open
/usr/pkg/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or
compiled for a different version of Open MPI? (ignored)
[europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open
/usr/pkg/lib/openmpi/mca_carto_file: perhaps a missing symbol, or compiled
for a different version of Open MPI? (ignored)


=== Part 2 ===

> An off the wall question -- do you have multiple versions of Open MPI
> installed on the system, perchance?  I wonder if you compiled Rmpi.so with
> one version of OMPI and it's picking up libmpi.so from the other version
> (or something along those lines).  Mismatches between the versions might
> well cause issues like this...?

That's not it. Everything is from 1.4.1.

I have once again delved deeper into the innards of the OpenMPI
source than I would have expected and seen that the error message
is coming from just after

File:
opal/mca/base/mca_base_component_find.c

Routine:
static int open_component(component_file_item_t *target_file,
   opal_list_t *found_components)

Code:
#if OPAL_HAVE_LTDL_ADVISE
  component_handle = lt_dlopenadvise(target_file->filename,
opal_mca_dladvise);
#else
  component_handle = lt_dlopenext(target_file->filename);
#endif


where there's a bit of ferkling going on so as to check for
a given file existing, hence the "slightly better error message".

We have

./opal/include/opal_config.h:#define OPAL_HAVE_LTDL_ADVISE 0

so we are invoking the lt_dlopenext clause.

That is a file that gets patched in the NetBSD build as follows

$diff opal/mca/base/mca_base_component_find.c{.orig,}
44,46d43
<   #ifndef __WINDOWS__
< #include "opal/libltdl/ltdl.h"
<   #else
48d44
<   #endif

ie we have taken out the inclusion of

opal/libltdl/ltdl.h

to force the use of the NetBSD "ltdl.h" one, which I guess might point
to something underlying the issue but as to what ...

OK, from what I can see, I have

$ls -l /usr/pkg/lib/openmpi/mca_carto_auto_detect*
-rw-r--r-- 1 root wheel 3892 Mar 22 16:21
/usr/pkg/lib/openmpi/mca_carto_auto_detect.a
-rwxr-xr-x 1 root wheel 1105 Mar 22 16:21
/usr/pkg/lib/openmpi/mca_carto_auto_detect.la*
-rwxr-xr-x 1 root wheel 7078 Mar 22 16:21
/usr/pkg/lib/openmpi/mca_carto_auto_detect.so*


however there are no "versioned" links for the .so file
(.so.0, .so.0.0.0 etc) but would that be an issue - probably not.


Furthermore, the Autobook (yes, I read some of that too!) says:

Function: lt_dlhandle lt_dlopenext (const char *filename)
This function is used in precisely the same way as lt_dlopen. However,
if the search for the named module by exact match against filename
fails, it will try again with a `.la' extension, and then the native
shared library extension (`.sl' on HP-UX, for example).

so the file that will end up being referenced obviously exists,
so why would

lt_dlopenext

not be able to open it the library there?

It would seem (from the error message)that what's being passed
to the routine as

target_file->filename

is

/usr/pkg/lib/openmpi/mca_carto_file

and so lt_dlopenext should at least find the .la and the .so
rather than punt, no ?

I am at a loss as to how to debug further this as my experience of
adding flags to openmpi invocations is zero.


In case you speak libtool (?) I enclose the .la file but nothing
looks "wrong" to me.

Kevin


# mca_carto_auto_detect.la - a libtool library file
# Generated by ltmain.sh (GNU libtool) 2.2.6b
#
# Please DO NOT delete this file!
# It is necessary for linking the library.

# The name that we can dlopen(3).
dlname='mca_carto_auto_detect.so'

# Names of this library.
library_names='mca_carto_auto_detect.so mca_carto_auto_dete