[OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-11 Thread Kevin . Buckley
Hi there,

this is an issue that I started a while ago on the R HPC SIG mailing
list and which then moved into an off-list conversation with Jeff
Squyres but on which no progress has been made.

I believe that the issue is less with Rmpi than with something
that Rmpi is exposing in OpenMPI specifically on NetBSD, hence
posting here.

(FWIW, I have since had an Rmpi/R/SGE/OpenMPI stack running on
 RHEL/Vmware, once I realised that I had to exclude the virbr0
 interfaces that OpenMPI seemed to take quite a liking to!)

I appreciate that few on the list are running OpenMPI on NetBSD
but, as detailed below, I found the OpenMPI thread

"[OMPI devel] Missing Symbol"

that seems to tie in with the problem I am seeing and. more
importantly, originated away from an NetBSD implementation.

I thus thought I'd stick the guts of the off-list conversation
onto the OpenMPI list and see if anyone else who may have been
involved with the "Missing Symbol" thread has any ideas.

There would seem to have been four emails of relevance from that
off-list conversation, so eyes down, looking for a full house:



=== Part 1 ===

Basically, when I come to load the Rmpi library

> library(Rmpi, lib.loc="/local/scratch/kevin/Pkgs/R/")

I get a swathe of OpenMPI errors (attached below)


[europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open
/usr/pkg/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or
compiled for a different version of Open MPI? (ignored)
[europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open
/usr/pkg/lib/openmpi/mca_carto_file: perhaps a missing symbol, or compiled
for a different version of Open MPI? (ignored)


=== Part 2 ===

> An off the wall question -- do you have multiple versions of Open MPI
> installed on the system, perchance?  I wonder if you compiled Rmpi.so with
> one version of OMPI and it's picking up libmpi.so from the other version
> (or something along those lines).  Mismatches between the versions might
> well cause issues like this...?

That's not it. Everything is from 1.4.1.

I have once again delved deeper into the innards of the OpenMPI
source than I would have expected and seen that the error message
is coming from just after

File:
opal/mca/base/mca_base_component_find.c

Routine:
static int open_component(component_file_item_t *target_file,
   opal_list_t *found_components)

Code:
#if OPAL_HAVE_LTDL_ADVISE
  component_handle = lt_dlopenadvise(target_file->filename,
opal_mca_dladvise);
#else
  component_handle = lt_dlopenext(target_file->filename);
#endif


where there's a bit of ferkling going on so as to check for
a given file existing, hence the "slightly better error message".

We have

./opal/include/opal_config.h:#define OPAL_HAVE_LTDL_ADVISE 0

so we are invoking the lt_dlopenext clause.

That is a file that gets patched in the NetBSD build as follows

$diff opal/mca/base/mca_base_component_find.c{.orig,}
44,46d43
<   #ifndef __WINDOWS__
< #include "opal/libltdl/ltdl.h"
<   #else
48d44
<   #endif

ie we have taken out the inclusion of

opal/libltdl/ltdl.h

to force the use of the NetBSD "ltdl.h" one, which I guess might point
to something underlying the issue but as to what ...

OK, from what I can see, I have

$ls -l /usr/pkg/lib/openmpi/mca_carto_auto_detect*
-rw-r--r-- 1 root wheel 3892 Mar 22 16:21
/usr/pkg/lib/openmpi/mca_carto_auto_detect.a
-rwxr-xr-x 1 root wheel 1105 Mar 22 16:21
/usr/pkg/lib/openmpi/mca_carto_auto_detect.la*
-rwxr-xr-x 1 root wheel 7078 Mar 22 16:21
/usr/pkg/lib/openmpi/mca_carto_auto_detect.so*


however there are no "versioned" links for the .so file
(.so.0, .so.0.0.0 etc) but would that be an issue - probably not.


Furthermore, the Autobook (yes, I read some of that too!) says:

Function: lt_dlhandle lt_dlopenext (const char *filename)
This function is used in precisely the same way as lt_dlopen. However,
if the search for the named module by exact match against filename
fails, it will try again with a `.la' extension, and then the native
shared library extension (`.sl' on HP-UX, for example).

so the file that will end up being referenced obviously exists,
so why would

lt_dlopenext

not be able to open it the library there?

It would seem (from the error message)that what's being passed
to the routine as

target_file->filename

is

/usr/pkg/lib/openmpi/mca_carto_file

and so lt_dlopenext should at least find the .la and the .so
rather than punt, no ?

I am at a loss as to how to debug further this as my experience of
adding flags to openmpi invocations is zero.


In case you speak libtool (?) I enclose the .la file but nothing
looks "wrong" to me.

Kevin


# mca_carto_auto_detect.la - a libtool library file
# Generated by ltmain.sh (GNU libtool) 2.2.6b
#
# Please DO NOT delete this file!
# It is necessary for linking the library.

# The name that we can dlopen(3).
dlname='mca_carto_auto_detect.so'

# Names of this library.
library_names='mca_carto_auto_detect.so mca_carto_auto_dete

Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-11 Thread Ralf Wildenhues
Hello Kevin,

* kevin.buck...@ecs.vuw.ac.nz wrote on Tue, May 11, 2010 at 06:42:01AM CEST:
> That is a file that gets patched in the NetBSD build as follows
> 
> $diff opal/mca/base/mca_base_component_find.c{.orig,}
> 44,46d43
> <   #ifndef __WINDOWS__
> < #include "opal/libltdl/ltdl.h"
> <   #else
> 48d44
> <   #endif
> 
> ie we have taken out the inclusion of
> 
> opal/libltdl/ltdl.h
> 
> to force the use of the NetBSD "ltdl.h" one, which I guess might point
> to something underlying the issue but as to what ...

Which libltdl version is that NetBSD ltdl.h from?  Which version is in
opal/libltdl?  Have you tried not doing the above change?

libltdl 2.2.x has incompatible changes over 1.5.x, both in the library
as well as in the header, as well as (I think) in preloaded modules.

Cheers,
Ralf


Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-11 Thread Kevin . Buckley

> Which libltdl version is that NetBSD ltdl.h from?  Which version is
> in opal/libltdl?  Have you tried not doing the above change?
>
> libltdl 2.2.x has incompatible changes over 1.5.x, both in the library
> as well as in the header, as well as (I think) in preloaded modules.

Hey Ralf,

The libtool distinfo file implies NetBSD currently uses libtool-2.2.6b.

An ldd of mpirun shows  -lltdl.7 => /usr/pkg/lib/libltdl.so.7


I do need to attempt a build of 1.4.2 here in ECS, so I'll try
building without the patches but I seem to recall that if those
libtool-related patches

opal/Makefile.in
configure
opal/mca/base/mca_base_component_find.c
opal/mca/base/mca_base_component_repository.c
test/support/components.h
test/support/components.c

were not applied, it did not even build. But we'll see.


And if you are reading this, Alexsej, have you,as the real
"OpenMPI on NetBSD" man, built a 1.4.2 as yet ?

Kevin

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-15 Thread Jeff Squyres
Sorry for the delay in replying.

I think that the issue here is the well-known libltdl "reporting the wrong 
error message" issue.  

Specifically, sometimes libltdl fails to load a DSO for a good reason, but then 
libltdl fails to report the right reason as to why it failed to load the DSO.  
Open MPI uses the function ld_dlerror() to get a printable string reason for 
why a DSO fails to load.  But sometimes that string reason is *wrong* (i.e., 
the DSO didn't load, but the reason OMPI printed out as to *why* it didn't load 
is incorrect).  And therefore what OMPI prints out is misleading, at best.

Over time, we have tried two things to make this error message better:

1. When we detect the "wrong" error message (i.e., if lt_dlerror() returns 
"file not found"), we actually use stat() to check for the presence of the file 
we were trying to open.  If we find the file, then we don't print the 
lt_dlerror(), but instead print the message you see:

[europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open
/usr/pkg/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or
compiled for a different version of Open MPI? (ignored)

So the error message is at least *somewhat* better than a totally misleading 
"file not found" message -- but it still only speculates on the real reason 
that libltdl failed to load the DSO.

2. https://svn.open-mpi.org/trac/ompi/changeset/22806 put in an OMPI-specific 
change to libltdl that avoids the incorrect error message altogether.  So now 
OMPI should print out the *real* reason libltdl failed to load the DSO.

It does not look like this patch made it over into the v1.4 series; it is 
awaiting review before it moves to the v1.5 branch 
(https://svn.open-mpi.org/trac/ompi/ticket/2337).  

Hope that all made sense!

-

Now, all this being said, IIRC (and I very well may not!), the real underlying 
issue here is that R is dlopening libmpi.so, which, in turn, is dlopening its 
own DSOs.  Given the global linker scoping issues, OMPI's DSOs are unable to 
find the symbols they need to resolve in the process (because libmpi.so's was 
opened in a private scope).

This probably is unfortunately larger than us (Open MPI) -- it's really a POSIX 
issue.  What would be ideal is if different linker namespaces could be 
something more fine-grained than "global" or "private" within a process.  E.g., 
if the private namespace of libmpi.so in the process could selectively make its 
symbol namespace available to the DSOs that it dlopens.  Right now, the only 
option libmpi.so has is to be opened with a public scope, which somewhat 
defeats the point of private scoping.

Have you tried building Open MPI with the --disable-dlopen configure flag?  
This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no dlopening 
at run-time.  Hence, your app (R) can dlopen libmpi.so, but then libmpi.so 
doesn't dlopen anything else -- all of OMPI's plugins are physically located in 
libmpi.so.




On May 11, 2010, at 8:33 PM,  
 wrote:

> 
> > Which libltdl version is that NetBSD ltdl.h from?  Which version is
> > in opal/libltdl?  Have you tried not doing the above change?
> >
> > libltdl 2.2.x has incompatible changes over 1.5.x, both in the library
> > as well as in the header, as well as (I think) in preloaded modules.
> 
> Hey Ralf,
> 
> The libtool distinfo file implies NetBSD currently uses libtool-2.2.6b.
> 
> An ldd of mpirun shows  -lltdl.7 => /usr/pkg/lib/libltdl.so.7
> 
> 
> I do need to attempt a build of 1.4.2 here in ECS, so I'll try
> building without the patches but I seem to recall that if those
> libtool-related patches
> 
> opal/Makefile.in
> configure
> opal/mca/base/mca_base_component_find.c
> opal/mca/base/mca_base_component_repository.c
> test/support/components.h
> test/support/components.c
> 
> were not applied, it did not even build. But we'll see.
> 
> 
> And if you are reading this, Alexsej, have you,as the real
> "OpenMPI on NetBSD" man, built a 1.4.2 as yet ?
> 
> Kevin
> 
> --
> Kevin M. Buckley  Room:  CO327
> School of Engineering and Phone: +64 4 463 5971
>  Computer Science
> Victoria University of Wellington
> New Zealand
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-16 Thread Kevin . Buckley
Jeff,

> So the error message is at least *somewhat* better than a totally
> misleading "file not found" message -- but it still only speculates
> on the real reason that libltdl failed to load the DSO.
>
> 2. https://svn.open-mpi.org/trac/ompi/changeset/22806 put in an
> OMPI-specific change to libltdl that avoids the incorrect error message
> altogether.  So now OMPI should print out the *real* reason libltdl
> failed to load the DSO.
>
> It does not look like this patch made it over into the v1.4 series;
> it is awaiting review before it moves to the v1.5 branch
> (https://svn.open-mpi.org/trac/ompi/ticket/2337).
>
> Hope that all made sense!

Great insight. You'll appreciate I have some idea as to what's going on
but not the completed jigsaw view as to how all the pieces I find fit
into the whole, so thank you.

Not sure it explains away the inabaility of my libtool test program to
open the shared-library in question but it certainly moves things
forwards.

> Have you tried building Open MPI with the --disable-dlopen configure flag?
>  This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no
> dlopening at run-time.  Hence, your app (R) can dlopen libmpi.so, but then
> libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are
> physically located in libmpi.so.

Given your reasoning, that's gotta be worth a shot: wilco.

Thanks once again for your time on this,
Kevin

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-16 Thread Kevin . Buckley
Cc'd Aleksej as I'm not sure he's on the "devel" list, and Mark
Davies, as he is certainly not.

I'll also post this back onto the R HPC SIG list which is
where I came in.

Jeff Squyres wrote:

> Now, all this being said, IIRC (and I very well may not!), the real
> underlying issue here is that R is dlopening libmpi.so, which, in turn, is
> dlopening its own DSOs.  Given the global linker scoping issues, OMPI's
> DSOs are unable to find the symbols they need to resolve in the process
> (because libmpi.so's was opened in a private scope).
>
> This probably is unfortunately larger than us (Open MPI) -- it's really a
> POSIX issue.  What would be ideal is if different linker namespaces could
> be something more fine-grained than "global" or "private" within a
> process.  E.g., if the private namespace of libmpi.so in the process could
> selectively make its symbol namespace available to the DSOs that it
> dlopens.  Right now, the only option libmpi.so has is to be opened
> with a public scope, which somewhat defeats the point of private
> scoping.
>

Tying in with the suggestions you make above, there would seem to
be a work-around fix for this, in the case of the Rmpi package
on NetBSD anyway.

Furthermore, the fix does not require any alterations to OpenMPI.

Apparently, there has been a similar issue, symbol visibility
when chaining shared library loading, within PAM on NetBSD.

Mark Davies has now determined a way to force the Rmpi package
to load libmpi.so, ahead of loading the Rmpi shared library itself,
so that what appear to be the missing symbols are then available,
for any future loads of the OpenMPI component libraries.


On the version of Rmpi that I have been using, 0.5-8, the "fix"
can be effected by the following, one, line, patch

--- Rmpi/R/zzz.R2009-02-04 05:27:08.0 +1300
+++ Rmpi.local/R/zzz.R  2010-05-17 14:25:27.0 +1200
@@ -7,6 +7,7 @@
 #cat(vertxt)

 # Check if lam-mpi is running
+dyn.load("/usr/pkg/lib/libmpi.so", local=FALSE)
 library.dynam("Rmpi", pkg, lib)
 if (!TRUE)
stop("Fail to load Rmpi dynamic library.")


Note that this currently hard codes the path to the libmpi.so,
which for our system is in the standard NetBSD PkgSrc location,
though there are probably "nicer" ways to achieve the same end,
and greater flexibility, using R internals.

Having said that, this "fix" does not seem to be needed on
plaforms that have a global scope for shared library symbols,
so maybe attempts to make it generic may be pointless.

Thanks for everyone's time on this issue. I'll certainly be
watching attempts to resolve the "larger than us (Open MPI)"
issue,

Kevin

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-17 Thread Jeff Squyres
On May 16, 2010, at 5:56 PM,  
 wrote:

> > Have you tried building Open MPI with the --disable-dlopen configure flag?
> >  This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no
> > dlopening at run-time.  Hence, your app (R) can dlopen libmpi.so, but then
> > libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are
> > physically located in libmpi.so.
> 
> Given your reasoning, that's gotta be worth a shot: wilco.

This issue has come up a few times on the list; I will add something to the FAQ 
about this.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-18 Thread Jeff Squyres
I added several FAQ items -- how do they look?

http://www.open-mpi.org/faq/?category=troubleshooting#erroneous-file-not-found-message
http://www.open-mpi.org/faq/?category=troubleshooting#missing-symbols
http://www.open-mpi.org/faq/?category=building#install-overwrite


On May 17, 2010, at 9:15 AM, Jeff Squyres (jsquyres) wrote:

> On May 16, 2010, at 5:56 PM,  
>  wrote:
> 
> > > Have you tried building Open MPI with the --disable-dlopen configure flag?
> > >  This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no
> > > dlopening at run-time.  Hence, your app (R) can dlopen libmpi.so, but then
> > > libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are
> > > physically located in libmpi.so.
> >
> > Given your reasoning, that's gotta be worth a shot: wilco.
> 
> This issue has come up a few times on the list; I will add something to the 
> FAQ about this.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-18 Thread Kevin . Buckley
> I added several FAQ items -- how do they look?
>
> http://www.open-mpi.org/faq/?category=troubleshooting#erroneous-file-not-found-message
> http://www.open-mpi.org/faq/?category=troubleshooting#missing-symbols
> http://www.open-mpi.org/faq/?category=building#install-overwrite
>

  "This is due to some deep run time linker voodoo"

>From what I have come to understand about this: I think that pretty
much covers it !

Serioulsy, this is good stuff to have "out there" though, because,
as you point out, the info an installer/user gets back, and through
which they might then first look to diagnose such issues, may not
steer them in the direction it should.

Kevin

PS
A style as opposed to substance thing:

I did notice that the last one of the three seem to be using a
fixed size width, whereas text in the the first and second flow
into the browser window.

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-19 Thread Jeff Squyres
On May 18, 2010, at 9:49 PM,  wrote:

> > http://www.open-mpi.org/faq/?category=building#install-overwrite
> I did notice that the last one of the three seem to be using a
> fixed size width, whereas text in the the first and second flow
> into the browser window.

I used some fixed-width font words in the entries, but your text makes it sound 
like a mistakenly-unterminated  or somesuch.

Can you send me a screenshot showing what you're seeing, or can you cite 
specifically where you see the HTML problem?  When I view the above FAQ entry, 
it looks fine -- it flows to the width of the browser window, etc.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/