Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-19 Thread Jeff Squyres
On May 18, 2010, at 9:49 PM,  wrote:

> > http://www.open-mpi.org/faq/?category=building#install-overwrite
> I did notice that the last one of the three seem to be using a
> fixed size width, whereas text in the the first and second flow
> into the browser window.

I used some fixed-width font words in the entries, but your text makes it sound 
like a mistakenly-unterminated  or somesuch.

Can you send me a screenshot showing what you're seeing, or can you cite 
specifically where you see the HTML problem?  When I view the above FAQ entry, 
it looks fine -- it flows to the width of the browser window, etc.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-18 Thread Kevin . Buckley
> I added several FAQ items -- how do they look?
>
> http://www.open-mpi.org/faq/?category=troubleshooting#erroneous-file-not-found-message
> http://www.open-mpi.org/faq/?category=troubleshooting#missing-symbols
> http://www.open-mpi.org/faq/?category=building#install-overwrite
>

  "This is due to some deep run time linker voodoo"

>From what I have come to understand about this: I think that pretty
much covers it !

Serioulsy, this is good stuff to have "out there" though, because,
as you point out, the info an installer/user gets back, and through
which they might then first look to diagnose such issues, may not
steer them in the direction it should.

Kevin

PS
A style as opposed to substance thing:

I did notice that the last one of the three seem to be using a
fixed size width, whereas text in the the first and second flow
into the browser window.

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-18 Thread Jeff Squyres
I added several FAQ items -- how do they look?

http://www.open-mpi.org/faq/?category=troubleshooting#erroneous-file-not-found-message
http://www.open-mpi.org/faq/?category=troubleshooting#missing-symbols
http://www.open-mpi.org/faq/?category=building#install-overwrite


On May 17, 2010, at 9:15 AM, Jeff Squyres (jsquyres) wrote:

> On May 16, 2010, at 5:56 PM,  
>  wrote:
> 
> > > Have you tried building Open MPI with the --disable-dlopen configure flag?
> > >  This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no
> > > dlopening at run-time.  Hence, your app (R) can dlopen libmpi.so, but then
> > > libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are
> > > physically located in libmpi.so.
> >
> > Given your reasoning, that's gotta be worth a shot: wilco.
> 
> This issue has come up a few times on the list; I will add something to the 
> FAQ about this.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-17 Thread Jeff Squyres
On May 16, 2010, at 5:56 PM,  
 wrote:

> > Have you tried building Open MPI with the --disable-dlopen configure flag?
> >  This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no
> > dlopening at run-time.  Hence, your app (R) can dlopen libmpi.so, but then
> > libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are
> > physically located in libmpi.so.
> 
> Given your reasoning, that's gotta be worth a shot: wilco.

This issue has come up a few times on the list; I will add something to the FAQ 
about this.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-16 Thread Kevin . Buckley
Cc'd Aleksej as I'm not sure he's on the "devel" list, and Mark
Davies, as he is certainly not.

I'll also post this back onto the R HPC SIG list which is
where I came in.

Jeff Squyres wrote:

> Now, all this being said, IIRC (and I very well may not!), the real
> underlying issue here is that R is dlopening libmpi.so, which, in turn, is
> dlopening its own DSOs.  Given the global linker scoping issues, OMPI's
> DSOs are unable to find the symbols they need to resolve in the process
> (because libmpi.so's was opened in a private scope).
>
> This probably is unfortunately larger than us (Open MPI) -- it's really a
> POSIX issue.  What would be ideal is if different linker namespaces could
> be something more fine-grained than "global" or "private" within a
> process.  E.g., if the private namespace of libmpi.so in the process could
> selectively make its symbol namespace available to the DSOs that it
> dlopens.  Right now, the only option libmpi.so has is to be opened
> with a public scope, which somewhat defeats the point of private
> scoping.
>

Tying in with the suggestions you make above, there would seem to
be a work-around fix for this, in the case of the Rmpi package
on NetBSD anyway.

Furthermore, the fix does not require any alterations to OpenMPI.

Apparently, there has been a similar issue, symbol visibility
when chaining shared library loading, within PAM on NetBSD.

Mark Davies has now determined a way to force the Rmpi package
to load libmpi.so, ahead of loading the Rmpi shared library itself,
so that what appear to be the missing symbols are then available,
for any future loads of the OpenMPI component libraries.


On the version of Rmpi that I have been using, 0.5-8, the "fix"
can be effected by the following, one, line, patch

--- Rmpi/R/zzz.R2009-02-04 05:27:08.0 +1300
+++ Rmpi.local/R/zzz.R  2010-05-17 14:25:27.0 +1200
@@ -7,6 +7,7 @@
 #cat(vertxt)

 # Check if lam-mpi is running
+dyn.load("/usr/pkg/lib/libmpi.so", local=FALSE)
 library.dynam("Rmpi", pkg, lib)
 if (!TRUE)
stop("Fail to load Rmpi dynamic library.")


Note that this currently hard codes the path to the libmpi.so,
which for our system is in the standard NetBSD PkgSrc location,
though there are probably "nicer" ways to achieve the same end,
and greater flexibility, using R internals.

Having said that, this "fix" does not seem to be needed on
plaforms that have a global scope for shared library symbols,
so maybe attempts to make it generic may be pointless.

Thanks for everyone's time on this issue. I'll certainly be
watching attempts to resolve the "larger than us (Open MPI)"
issue,

Kevin

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-16 Thread Kevin . Buckley
Jeff,

> So the error message is at least *somewhat* better than a totally
> misleading "file not found" message -- but it still only speculates
> on the real reason that libltdl failed to load the DSO.
>
> 2. https://svn.open-mpi.org/trac/ompi/changeset/22806 put in an
> OMPI-specific change to libltdl that avoids the incorrect error message
> altogether.  So now OMPI should print out the *real* reason libltdl
> failed to load the DSO.
>
> It does not look like this patch made it over into the v1.4 series;
> it is awaiting review before it moves to the v1.5 branch
> (https://svn.open-mpi.org/trac/ompi/ticket/2337).
>
> Hope that all made sense!

Great insight. You'll appreciate I have some idea as to what's going on
but not the completed jigsaw view as to how all the pieces I find fit
into the whole, so thank you.

Not sure it explains away the inabaility of my libtool test program to
open the shared-library in question but it certainly moves things
forwards.

> Have you tried building Open MPI with the --disable-dlopen configure flag?
>  This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no
> dlopening at run-time.  Hence, your app (R) can dlopen libmpi.so, but then
> libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are
> physically located in libmpi.so.

Given your reasoning, that's gotta be worth a shot: wilco.

Thanks once again for your time on this,
Kevin

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-15 Thread Jeff Squyres
Sorry for the delay in replying.

I think that the issue here is the well-known libltdl "reporting the wrong 
error message" issue.  

Specifically, sometimes libltdl fails to load a DSO for a good reason, but then 
libltdl fails to report the right reason as to why it failed to load the DSO.  
Open MPI uses the function ld_dlerror() to get a printable string reason for 
why a DSO fails to load.  But sometimes that string reason is *wrong* (i.e., 
the DSO didn't load, but the reason OMPI printed out as to *why* it didn't load 
is incorrect).  And therefore what OMPI prints out is misleading, at best.

Over time, we have tried two things to make this error message better:

1. When we detect the "wrong" error message (i.e., if lt_dlerror() returns 
"file not found"), we actually use stat() to check for the presence of the file 
we were trying to open.  If we find the file, then we don't print the 
lt_dlerror(), but instead print the message you see:

[europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open
/usr/pkg/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or
compiled for a different version of Open MPI? (ignored)

So the error message is at least *somewhat* better than a totally misleading 
"file not found" message -- but it still only speculates on the real reason 
that libltdl failed to load the DSO.

2. https://svn.open-mpi.org/trac/ompi/changeset/22806 put in an OMPI-specific 
change to libltdl that avoids the incorrect error message altogether.  So now 
OMPI should print out the *real* reason libltdl failed to load the DSO.

It does not look like this patch made it over into the v1.4 series; it is 
awaiting review before it moves to the v1.5 branch 
(https://svn.open-mpi.org/trac/ompi/ticket/2337).  

Hope that all made sense!

-

Now, all this being said, IIRC (and I very well may not!), the real underlying 
issue here is that R is dlopening libmpi.so, which, in turn, is dlopening its 
own DSOs.  Given the global linker scoping issues, OMPI's DSOs are unable to 
find the symbols they need to resolve in the process (because libmpi.so's was 
opened in a private scope).

This probably is unfortunately larger than us (Open MPI) -- it's really a POSIX 
issue.  What would be ideal is if different linker namespaces could be 
something more fine-grained than "global" or "private" within a process.  E.g., 
if the private namespace of libmpi.so in the process could selectively make its 
symbol namespace available to the DSOs that it dlopens.  Right now, the only 
option libmpi.so has is to be opened with a public scope, which somewhat 
defeats the point of private scoping.

Have you tried building Open MPI with the --disable-dlopen configure flag?  
This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no dlopening 
at run-time.  Hence, your app (R) can dlopen libmpi.so, but then libmpi.so 
doesn't dlopen anything else -- all of OMPI's plugins are physically located in 
libmpi.so.




On May 11, 2010, at 8:33 PM,  
 wrote:

> 
> > Which libltdl version is that NetBSD ltdl.h from?  Which version is
> > in opal/libltdl?  Have you tried not doing the above change?
> >
> > libltdl 2.2.x has incompatible changes over 1.5.x, both in the library
> > as well as in the header, as well as (I think) in preloaded modules.
> 
> Hey Ralf,
> 
> The libtool distinfo file implies NetBSD currently uses libtool-2.2.6b.
> 
> An ldd of mpirun shows  -lltdl.7 => /usr/pkg/lib/libltdl.so.7
> 
> 
> I do need to attempt a build of 1.4.2 here in ECS, so I'll try
> building without the patches but I seem to recall that if those
> libtool-related patches
> 
> opal/Makefile.in
> configure
> opal/mca/base/mca_base_component_find.c
> opal/mca/base/mca_base_component_repository.c
> test/support/components.h
> test/support/components.c
> 
> were not applied, it did not even build. But we'll see.
> 
> 
> And if you are reading this, Alexsej, have you,as the real
> "OpenMPI on NetBSD" man, built a 1.4.2 as yet ?
> 
> Kevin
> 
> --
> Kevin M. Buckley  Room:  CO327
> School of Engineering and Phone: +64 4 463 5971
>  Computer Science
> Victoria University of Wellington
> New Zealand
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-11 Thread Kevin . Buckley

> Which libltdl version is that NetBSD ltdl.h from?  Which version is
> in opal/libltdl?  Have you tried not doing the above change?
>
> libltdl 2.2.x has incompatible changes over 1.5.x, both in the library
> as well as in the header, as well as (I think) in preloaded modules.

Hey Ralf,

The libtool distinfo file implies NetBSD currently uses libtool-2.2.6b.

An ldd of mpirun shows  -lltdl.7 => /usr/pkg/lib/libltdl.so.7


I do need to attempt a build of 1.4.2 here in ECS, so I'll try
building without the patches but I seem to recall that if those
libtool-related patches

opal/Makefile.in
configure
opal/mca/base/mca_base_component_find.c
opal/mca/base/mca_base_component_repository.c
test/support/components.h
test/support/components.c

were not applied, it did not even build. But we'll see.


And if you are reading this, Alexsej, have you,as the real
"OpenMPI on NetBSD" man, built a 1.4.2 as yet ?

Kevin

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-11 Thread Ralf Wildenhues
Hello Kevin,

* kevin.buck...@ecs.vuw.ac.nz wrote on Tue, May 11, 2010 at 06:42:01AM CEST:
> That is a file that gets patched in the NetBSD build as follows
> 
> $diff opal/mca/base/mca_base_component_find.c{.orig,}
> 44,46d43
> <   #ifndef __WINDOWS__
> < #include "opal/libltdl/ltdl.h"
> <   #else
> 48d44
> <   #endif
> 
> ie we have taken out the inclusion of
> 
> opal/libltdl/ltdl.h
> 
> to force the use of the NetBSD "ltdl.h" one, which I guess might point
> to something underlying the issue but as to what ...

Which libltdl version is that NetBSD ltdl.h from?  Which version is in
opal/libltdl?  Have you tried not doing the above change?

libltdl 2.2.x has incompatible changes over 1.5.x, both in the library
as well as in the header, as well as (I think) in preloaded modules.

Cheers,
Ralf