Brian -

Sorry for the slow reply, I've been on vacation for a while and am still
digging out from all the back e-mail.

Anyway, that makes sense.  Open MPI's default build mode is to dlopen()
the driver components needed for things like the various interconnects
and process starters we support.  Since libmpi was dlopen()'ed with
RTLD_LOCAL, the symbols needed in libmpi were not available to those
components when OMPI tried to dlopen() them.

I was a little confused initially by why the symbols in our other
support libraries were found (everything seemed to work until the
MPI-level -- the run-time stuff initialized properly).  But apparently
this makes sense as well, as there's something about how shared
libraries that are dependencies on the dlopen()'ed object are loaded
that puts those symbols in the global namespace.

One solution, of course, is to specify RTLD_GLOBAL when opening libmpi.
The other possibility is to build Open MPI with the --disable-dlopen
option, which will cause all the components to be built into libmpi,
avoiding the whole namespacing issue.

We'll add some information to the FAQ on this issue.  Thanks for
bringing it to our attention.

Brian

On Fri, 2006-09-08 at 10:51 -0600, Brian E Granger wrote:
> Brian,
> 
> 
> I think I have figured this one out.  By default ctypes calls dlopen
> with mode = RTLD_LOCAL (except on Mac OS 10.3).  When I instruct
> ctypes to set mode = RTLD_GLOBAL it works fine on 10.4.  Based on the
> dlopen man page:
> 
> 
>      RTLD_GLOBAL   Symbols exported from this image (dynamic library
> or bun-
>                    dle) will be available to any images build with
>                    -flat_namespace option to ld(1) or to calls to
> dlsym() when
>                    using a special handle.
> 
> 
>      RTLD_LOCAL    Symbols exported from this image (dynamic library
> or bun-
>                    dle) are generally hidden and only availble to
> dlsym() when
>                    directly using the handle returned by this call to
>                    dlopen().  If neither RTLD_GLOBAL nor RTLD_LOCAL is
> speci-
>                    fied, the default is RTLD_GLOBAL
> 
> 
> This behavior makes sense.  Thus the following works on 10.4:
> 
> 
> from ctypes import *
> mpi = CDLL('libmpi.0.dylib', RTLD_GLOBAL)
> f = pythonapi.Py_GetArgcArgv
> argc = c_int()
> argv = POINTER(c_char_p)()
> f(byref(argc), byref(argv))
> mpi.MPI_Init(byref(argc), byref(argv))
> mpi.MPI_Finalize()
> 
> 
> So I am not sure this is a defect in OpenMPI, but it sure is a subtle
> aspect of using it.  I will probably document this somewhere in the
> package I am creating.  
> 
> 
> Thanks
> 
> 
> Brian
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Sep 6, 2006, at 9:00 AM, Brian Barrett wrote:
> 
> > Thanks for the information.  I've filed a bug in our bug tracker on
> > this
> > issue.  It appears that for some reason, when libmpi is dlopened()
> > by
> > python, that objects it then dlopens are not able to find symbols in
> > the
> > libmpi.  It will probably take me a bit of time to track this issue
> > down, but you will be notified by the bug tracker when the issue is
> > resolved.
> > 
> > 
> > Brian
> > 
> > 
> > 
> > 
> > On Thu, 2006-08-31 at 17:27 -0600, Brian E Granger wrote:
> > > Brian,
> > > 
> > > 
> > > 
> > > 
> > > Sure, but my example will probably seem a little odd.  I am
> > > calling
> > > the mpi shared library from Python using ctypes.  
> > > 
> > > 
> > > 
> > > 
> > > The dependencies for doing things this way are:
> > > 
> > > 
> > > 
> > > 
> > > 1. Python built with --enable-shared
> > > 2. The ctypes python package
> > > 3. OpenMPI configured with --enable-shared
> > > 
> > > 
> > > 
> > > 
> > > Once you have this, the following python script will cause the
> > > problem
> > > on Mac OS X:
> > > 
> > > 
> > > 
> > > 
> > > from ctypes import *
> > > 
> > > 
> > > 
> > > 
> > > f = pythonapi.Py_GetArgcArgv
> > > argc = c_int()
> > > argv = POINTER(c_char_p)()
> > > f(byref(argc), byref(argv))
> > > mpi = cdll.LoadLibrary('libmpi.0.dylib')
> > > mpi.MPI_Init(byref(argc), byref(argv))
> > > 
> > > 
> > > 
> > > 
> > > I will try this on Linux as well to see if I get the same error.
> > > One
> > > important piece of the puzzle is that if I configure openmpi with
> > > the
> > > --disable-dlopen flag, I don't have the problem.  I will do some
> > > further testing on different systems and get back to you.  
> > > 
> > > 
> > > 
> > > 
> > > Thanks for looking at this.
> > > 
> > > 
> > > 
> > > 
> > > Brian
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > On Aug 31, 2006, at 4:20 PM, Brian Barrett wrote:
> > > 
> > > 
> > > > This is quite strange, and we're having some trouble figuring
> > > > out
> > > > exactly why the opening is failing.  Do you have a (somewhat?)
> > > > easy
> > > > list
> > > > of instructions so that I can try to reproduce this?
> > > > 
> > > > 
> > > > 
> > > > 
> > > > Thanks,
> > > > 
> > > > 
> > > > 
> > > > 
> > > > Brian
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On Tue, 2006-08-22 at 20:58 -0600, Brian Granger wrote:
> > > > > HI,
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > I am trying to dynamically load mpi.dylib on Mac OS X (using
> > > > > ctypes in 
> > > > > python).  It seems to
> > > > > load fine, but when I call MPI_Init(), I get the error shown
> > > > > below.  I
> > > > > can call other functions just fine (like MPI_Initialized).
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Also, my mpi install is seeing all the needed components and I
> > > > > can
> > > > > load them myself without error using dlopen.  I can also
> > > > > compile
> > > > > and
> > > > > run mpi programs and I build openmpi with shared library
> > > > > support.
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > [localhost:00973] mca: base: component_find: unable to open:
> > > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_allocator_basic.so,
> > > > > 9):
> > > > > Symbol not found: _ompi_free_list_item_t_class
> > > > >   Referenced from: 
> > > > > /usr/local/openmpi-1.1/lib/openmpi/mca_allocator_basic.so
> > > > >   Expected in: flat namespace
> > > > >   (ignored)
> > > > > [localhost:00973] mca: base: component_find: unable to open:
> > > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_rcache_rb.so,
> > > > > 9):
> > > > > Symbol
> > > > > not found: _ompi_free_list_item_t_class
> > > > >   Referenced
> > > > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_rcache_rb.so
> > > > >   Expected in: flat namespace
> > > > >   (ignored)
> > > > > [localhost:00973] mca: base: component_find: unable to open:
> > > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_mpool_sm.so, 9):
> > > > > Symbol
> > > > > not found: _mca_allocator_base_components
> > > > >   Referenced
> > > > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_mpool_sm.so
> > > > >   Expected in: flat namespace
> > > > >   (ignored)
> > > > > [localhost:00973] mca: base: component_find: unable to open:
> > > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_pml_ob1.so, 9):
> > > > > Symbol
> > > > > not found: _ompi_free_list_item_t_class
> > > > >   Referenced
> > > > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_pml_ob1.so
> > > > >   Expected in: flat namespace
> > > > >   (ignored)
> > > > > [localhost:00973] mca: base: component_find: unable to open:
> > > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_basic.so,
> > > > > 9):
> > > > > Symbol not found: _mca_pml
> > > > >   Referenced
> > > > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_basic.so
> > > > >   Expected in: flat namespace
> > > > >   (ignored)
> > > > > [localhost:00973] mca: base: component_find: unable to open:
> > > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_hierarch.so,
> > > > > 9):
> > > > > Symbol not found: _ompi_mpi_op_max
> > > > >   Referenced
> > > > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_hierarch.so
> > > > >   Expected in: flat namespace
> > > > >   (ignored)
> > > > > [localhost:00973] mca: base: component_find: unable to open:
> > > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_sm.so, 9):
> > > > > Symbol
> > > > > not found: _ompi_mpi_local_convertor
> > > > >   Referenced
> > > > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_sm.so
> > > > >   Expected in: flat namespace
> > > > >   (ignored)
> > > > > [localhost:00973] mca: base: component_find: unable to open:
> > > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_coll_tuned.so,
> > > > > 9):
> > > > > Symbol not found: _mca_pml
> > > > >   Referenced
> > > > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_coll_tuned.so
> > > > >   Expected in: flat namespace
> > > > >   (ignored)
> > > > > [localhost:00973] mca: base: component_find: unable to open:
> > > > > dlopen(/usr/local/openmpi-1.1/lib/openmpi/mca_osc_pt2pt.so,
> > > > > 9):
> > > > > Symbol
> > > > > not found: _ompi_request_t_class
> > > > >   Referenced
> > > > > from: /usr/local/openmpi-1.1/lib/openmpi/mca_osc_pt2pt.so
> > > > >   Expected in: flat namespace
> > > > >   (ignored)
> > > > > --------------------------------------------------------------------------
> > > > > No available pml components were found!
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > This means that there are no components of this type installed
> > > > > on
> > > > > your
> > > > > system or all the components reported that they could not be
> > > > > used.
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > This is a fatal error; your MPI process is likely to abort.
> > > > > Check
> > > > > the
> > > > > output of the "ompi_info" command and ensure that components
> > > > > of
> > > > > this
> > > > > type are available on your system.  You may also wish to check
> > > > > the
> > > > > value of the "component_path" MCA parameter and ensure that it
> > > > > has
> > > > > at
> > > > > least one directory that contains valid MCA components.
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > --------------------------------------------------------------------------
> > > > > [localhost:00973] PML ob1 cannot be selected
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Any Ideas?
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Thanks
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Brian Granger
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > us...@open-mpi.org
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > 
> > > > 
> > > > 
> > > > 
> > > > _______________________________________________
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > 
> > > 
> > > Brian E Granger, Ph.D.
> > > Research Scientist
> > > Tech-X Corporation
> > > phone:  720-974-1850
> > > bgran...@txcorp.com
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> Brian E Granger, Ph.D.
> Research Scientist
> Tech-X Corporation
> phone:  720-974-1850
> bgran...@txcorp.com
> 
> 
> 
> 
> 
> 

Reply via email to