Gilles, You described the problem exactly. I think we were able to nail down a solution to this one through judicious use of the -rpath $MPI_DIR/lib linker flag, allowing the runtime linker to properly find OpenMPI symbols at runtime. We're operational. Thanks for your help.
-Sean -- Sean Ahern Computational Engineering International 919-363-0883 On Mon, Oct 17, 2016 at 9:45 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > Sean, > > > if i understand correctly, your built a libtransport_mpi.so library that > depends on Open MPI, and your main program dlopen libtransport_mpi.so. > > in this case, and at least for the time being, you need to use > RTLD_GLOBAL in your dlopen flags. > > > Cheers, > > > Gilles > > On 10/18/2016 4:53 AM, Sean Ahern wrote: > > Folks, > > For our code, we have a communication layer that abstracts the code that > does the actual transfer of data. We call these "transports", and we link > them as shared libraries. We have created an MPI transport that > compiles/links against OpenMPI 2.0.1 using the compiler wrappers. When I > compile OpenMPI with the--disable-dlopen option (thus cramming all of > OpenMPI's plugins into the MPI library directly), things work great with > our transport shared library. But when I have a "normal" OpenMPI (without > --disable-dlopen) and create the same transport shared library, things > fail. Upon launch, it appears that OpenMPI is unable to find the > appropriate plugins: > > [hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to > open mca_patcher_overwrite: /home/sean/work/ceisvn/apex/ > branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi- > 2.0.1/lib/openmpi/mca_patcher_overwrite.so: undefined symbol: > *mca_patcher_base_patch_t_class* (ignored) > [hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to > open mca_shmem_mmap: /home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/ > machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_mmap.so: > undefined symbol: *opal_show_help* (ignored) > [hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to > open mca_shmem_posix: /home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/ > machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_posix.so: > undefined symbol: *opal_show_help* (ignored) > [hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to > open mca_shmem_sysv: /home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/ > machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_sysv.so: > undefined symbol: *opal_show_help* (ignored) > -------------------------------------------------------------------------- > It looks like opal_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during opal_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > opal_shmem_base_select failed > --> Returned value -1 instead of OPAL_SUCCESS > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > opal_init failed > --> Returned value Error (-1) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_mpi_init: ompi_rte_init failed > --> Returned "Error" (-1) instead of "Success" (0) > > > If I skip our shared libraries and instead write a standard MPI-based > "hello, world" program that links against MPI directly (without > --disable-dlopen), everything is again fine. > > It seems that having the double dlopen is causing problems for OpenMPI > finding its own shared libraries. > > Note: I do have LD_LIBRARY_PATH pointing to …"openmpi-2.0.1/lib", as well > as OPAL_PREFIX pointing to …"openmpi-2.0.1". > > Any thoughts about how I can try to tease out what's going wrong here? > > -Sean > > -- > Sean Ahern > Computational Engineering International > 919-363-0883 > > > _______________________________________________ > users mailing > listus...@lists.open-mpi.orghttps://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users