Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
On May 18, 2010, at 9:49 PM, wrote: > > http://www.open-mpi.org/faq/?category=building#install-overwrite > I did notice that the last one of the three seem to be using a > fixed size width, whereas text in the the first and second flow > into the browser window. I used some fixed-width font words in the entries, but your text makes it sound like a mistakenly-unterminated or somesuch. Can you send me a screenshot showing what you're seeing, or can you cite specifically where you see the HTML problem? When I view the above FAQ entry, it looks fine -- it flows to the width of the browser window, etc. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
> I added several FAQ items -- how do they look? > > http://www.open-mpi.org/faq/?category=troubleshooting#erroneous-file-not-found-message > http://www.open-mpi.org/faq/?category=troubleshooting#missing-symbols > http://www.open-mpi.org/faq/?category=building#install-overwrite > "This is due to some deep run time linker voodoo" >From what I have come to understand about this: I think that pretty much covers it ! Serioulsy, this is good stuff to have "out there" though, because, as you point out, the info an installer/user gets back, and through which they might then first look to diagnose such issues, may not steer them in the direction it should. Kevin PS A style as opposed to substance thing: I did notice that the last one of the three seem to be using a fixed size width, whereas text in the the first and second flow into the browser window. -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
I added several FAQ items -- how do they look? http://www.open-mpi.org/faq/?category=troubleshooting#erroneous-file-not-found-message http://www.open-mpi.org/faq/?category=troubleshooting#missing-symbols http://www.open-mpi.org/faq/?category=building#install-overwrite On May 17, 2010, at 9:15 AM, Jeff Squyres (jsquyres) wrote: > On May 16, 2010, at 5:56 PM, > wrote: > > > > Have you tried building Open MPI with the --disable-dlopen configure flag? > > > This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no > > > dlopening at run-time. Hence, your app (R) can dlopen libmpi.so, but then > > > libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are > > > physically located in libmpi.so. > > > > Given your reasoning, that's gotta be worth a shot: wilco. > > This issue has come up a few times on the list; I will add something to the > FAQ about this. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
On May 16, 2010, at 5:56 PM, wrote: > > Have you tried building Open MPI with the --disable-dlopen configure flag? > > This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no > > dlopening at run-time. Hence, your app (R) can dlopen libmpi.so, but then > > libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are > > physically located in libmpi.so. > > Given your reasoning, that's gotta be worth a shot: wilco. This issue has come up a few times on the list; I will add something to the FAQ about this. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
Cc'd Aleksej as I'm not sure he's on the "devel" list, and Mark Davies, as he is certainly not. I'll also post this back onto the R HPC SIG list which is where I came in. Jeff Squyres wrote: > Now, all this being said, IIRC (and I very well may not!), the real > underlying issue here is that R is dlopening libmpi.so, which, in turn, is > dlopening its own DSOs. Given the global linker scoping issues, OMPI's > DSOs are unable to find the symbols they need to resolve in the process > (because libmpi.so's was opened in a private scope). > > This probably is unfortunately larger than us (Open MPI) -- it's really a > POSIX issue. What would be ideal is if different linker namespaces could > be something more fine-grained than "global" or "private" within a > process. E.g., if the private namespace of libmpi.so in the process could > selectively make its symbol namespace available to the DSOs that it > dlopens. Right now, the only option libmpi.so has is to be opened > with a public scope, which somewhat defeats the point of private > scoping. > Tying in with the suggestions you make above, there would seem to be a work-around fix for this, in the case of the Rmpi package on NetBSD anyway. Furthermore, the fix does not require any alterations to OpenMPI. Apparently, there has been a similar issue, symbol visibility when chaining shared library loading, within PAM on NetBSD. Mark Davies has now determined a way to force the Rmpi package to load libmpi.so, ahead of loading the Rmpi shared library itself, so that what appear to be the missing symbols are then available, for any future loads of the OpenMPI component libraries. On the version of Rmpi that I have been using, 0.5-8, the "fix" can be effected by the following, one, line, patch --- Rmpi/R/zzz.R2009-02-04 05:27:08.0 +1300 +++ Rmpi.local/R/zzz.R 2010-05-17 14:25:27.0 +1200 @@ -7,6 +7,7 @@ #cat(vertxt) # Check if lam-mpi is running +dyn.load("/usr/pkg/lib/libmpi.so", local=FALSE) library.dynam("Rmpi", pkg, lib) if (!TRUE) stop("Fail to load Rmpi dynamic library.") Note that this currently hard codes the path to the libmpi.so, which for our system is in the standard NetBSD PkgSrc location, though there are probably "nicer" ways to achieve the same end, and greater flexibility, using R internals. Having said that, this "fix" does not seem to be needed on plaforms that have a global scope for shared library symbols, so maybe attempts to make it generic may be pointless. Thanks for everyone's time on this issue. I'll certainly be watching attempts to resolve the "larger than us (Open MPI)" issue, Kevin -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
Jeff, > So the error message is at least *somewhat* better than a totally > misleading "file not found" message -- but it still only speculates > on the real reason that libltdl failed to load the DSO. > > 2. https://svn.open-mpi.org/trac/ompi/changeset/22806 put in an > OMPI-specific change to libltdl that avoids the incorrect error message > altogether. So now OMPI should print out the *real* reason libltdl > failed to load the DSO. > > It does not look like this patch made it over into the v1.4 series; > it is awaiting review before it moves to the v1.5 branch > (https://svn.open-mpi.org/trac/ompi/ticket/2337). > > Hope that all made sense! Great insight. You'll appreciate I have some idea as to what's going on but not the completed jigsaw view as to how all the pieces I find fit into the whole, so thank you. Not sure it explains away the inabaility of my libtool test program to open the shared-library in question but it certainly moves things forwards. > Have you tried building Open MPI with the --disable-dlopen configure flag? > This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no > dlopening at run-time. Hence, your app (R) can dlopen libmpi.so, but then > libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are > physically located in libmpi.so. Given your reasoning, that's gotta be worth a shot: wilco. Thanks once again for your time on this, Kevin -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
Sorry for the delay in replying. I think that the issue here is the well-known libltdl "reporting the wrong error message" issue. Specifically, sometimes libltdl fails to load a DSO for a good reason, but then libltdl fails to report the right reason as to why it failed to load the DSO. Open MPI uses the function ld_dlerror() to get a printable string reason for why a DSO fails to load. But sometimes that string reason is *wrong* (i.e., the DSO didn't load, but the reason OMPI printed out as to *why* it didn't load is incorrect). And therefore what OMPI prints out is misleading, at best. Over time, we have tried two things to make this error message better: 1. When we detect the "wrong" error message (i.e., if lt_dlerror() returns "file not found"), we actually use stat() to check for the presence of the file we were trying to open. If we find the file, then we don't print the lt_dlerror(), but instead print the message you see: [europa.ecs.vuw.ac.nz:09687] mca: base: component_find: unable to open /usr/pkg/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) So the error message is at least *somewhat* better than a totally misleading "file not found" message -- but it still only speculates on the real reason that libltdl failed to load the DSO. 2. https://svn.open-mpi.org/trac/ompi/changeset/22806 put in an OMPI-specific change to libltdl that avoids the incorrect error message altogether. So now OMPI should print out the *real* reason libltdl failed to load the DSO. It does not look like this patch made it over into the v1.4 series; it is awaiting review before it moves to the v1.5 branch (https://svn.open-mpi.org/trac/ompi/ticket/2337). Hope that all made sense! - Now, all this being said, IIRC (and I very well may not!), the real underlying issue here is that R is dlopening libmpi.so, which, in turn, is dlopening its own DSOs. Given the global linker scoping issues, OMPI's DSOs are unable to find the symbols they need to resolve in the process (because libmpi.so's was opened in a private scope). This probably is unfortunately larger than us (Open MPI) -- it's really a POSIX issue. What would be ideal is if different linker namespaces could be something more fine-grained than "global" or "private" within a process. E.g., if the private namespace of libmpi.so in the process could selectively make its symbol namespace available to the DSOs that it dlopens. Right now, the only option libmpi.so has is to be opened with a public scope, which somewhat defeats the point of private scoping. Have you tried building Open MPI with the --disable-dlopen configure flag? This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no dlopening at run-time. Hence, your app (R) can dlopen libmpi.so, but then libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are physically located in libmpi.so. On May 11, 2010, at 8:33 PM, wrote: > > > Which libltdl version is that NetBSD ltdl.h from? Which version is > > in opal/libltdl? Have you tried not doing the above change? > > > > libltdl 2.2.x has incompatible changes over 1.5.x, both in the library > > as well as in the header, as well as (I think) in preloaded modules. > > Hey Ralf, > > The libtool distinfo file implies NetBSD currently uses libtool-2.2.6b. > > An ldd of mpirun shows -lltdl.7 => /usr/pkg/lib/libltdl.so.7 > > > I do need to attempt a build of 1.4.2 here in ECS, so I'll try > building without the patches but I seem to recall that if those > libtool-related patches > > opal/Makefile.in > configure > opal/mca/base/mca_base_component_find.c > opal/mca/base/mca_base_component_repository.c > test/support/components.h > test/support/components.c > > were not applied, it did not even build. But we'll see. > > > And if you are reading this, Alexsej, have you,as the real > "OpenMPI on NetBSD" man, built a 1.4.2 as yet ? > > Kevin > > -- > Kevin M. Buckley Room: CO327 > School of Engineering and Phone: +64 4 463 5971 > Computer Science > Victoria University of Wellington > New Zealand > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
> Which libltdl version is that NetBSD ltdl.h from? Which version is > in opal/libltdl? Have you tried not doing the above change? > > libltdl 2.2.x has incompatible changes over 1.5.x, both in the library > as well as in the header, as well as (I think) in preloaded modules. Hey Ralf, The libtool distinfo file implies NetBSD currently uses libtool-2.2.6b. An ldd of mpirun shows -lltdl.7 => /usr/pkg/lib/libltdl.so.7 I do need to attempt a build of 1.4.2 here in ECS, so I'll try building without the patches but I seem to recall that if those libtool-related patches opal/Makefile.in configure opal/mca/base/mca_base_component_find.c opal/mca/base/mca_base_component_repository.c test/support/components.h test/support/components.c were not applied, it did not even build. But we'll see. And if you are reading this, Alexsej, have you,as the real "OpenMPI on NetBSD" man, built a 1.4.2 as yet ? Kevin -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD
Hello Kevin, * kevin.buck...@ecs.vuw.ac.nz wrote on Tue, May 11, 2010 at 06:42:01AM CEST: > That is a file that gets patched in the NetBSD build as follows > > $diff opal/mca/base/mca_base_component_find.c{.orig,} > 44,46d43 > < #ifndef __WINDOWS__ > < #include "opal/libltdl/ltdl.h" > < #else > 48d44 > < #endif > > ie we have taken out the inclusion of > > opal/libltdl/ltdl.h > > to force the use of the NetBSD "ltdl.h" one, which I guess might point > to something underlying the issue but as to what ... Which libltdl version is that NetBSD ltdl.h from? Which version is in opal/libltdl? Have you tried not doing the above change? libltdl 2.2.x has incompatible changes over 1.5.x, both in the library as well as in the header, as well as (I think) in preloaded modules. Cheers, Ralf