Re: [OMPI users] Problem with double shared library

2016-10-28 Thread Sean Ahern
Gilles,

You described the problem exactly. I think we were able to nail down a
solution to this one through judicious use of the -rpath $MPI_DIR/lib
linker flag, allowing the runtime linker to properly find OpenMPI symbols
at runtime. We're operational. Thanks for your help.

-Sean

--
Sean Ahern
Computational Engineering International
919-363-0883

On Mon, Oct 17, 2016 at 9:45 PM, Gilles Gouaillardet 
wrote:

> Sean,
>
>
> if i understand correctly, your built a libtransport_mpi.so library that
> depends on Open MPI, and your main program dlopen libtransport_mpi.so.
>
> in this case, and at least for the time being,  you need to use
> RTLD_GLOBAL in your dlopen flags.
>
>
> Cheers,
>
>
> Gilles
>
> On 10/18/2016 4:53 AM, Sean Ahern wrote:
>
> Folks,
>
> For our code, we have a communication layer that abstracts the code that
> does the actual transfer of data. We call these "transports", and we link
> them as shared libraries. We have created an MPI transport that
> compiles/links against OpenMPI 2.0.1 using the compiler wrappers. When I
> compile OpenMPI with the--disable-dlopen option (thus cramming all of
> OpenMPI's plugins into the MPI library directly), things work great with
> our transport shared library. But when I have a "normal" OpenMPI (without
> --disable-dlopen) and create the same transport shared library, things
> fail. Upon launch, it appears that OpenMPI is unable to find the
> appropriate plugins:
>
> [hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
> open mca_patcher_overwrite: /home/sean/work/ceisvn/apex/
> branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-
> 2.0.1/lib/openmpi/mca_patcher_overwrite.so: undefined symbol:
> *mca_patcher_base_patch_t_class* (ignored)
> [hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
> open mca_shmem_mmap: /home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/
> machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_mmap.so:
> undefined symbol: *opal_show_help* (ignored)
> [hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
> open mca_shmem_posix: /home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/
> machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_posix.so:
> undefined symbol: *opal_show_help* (ignored)
> [hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
> open mca_shmem_sysv: /home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/
> machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_sysv.so:
> undefined symbol: *opal_show_help* (ignored)
> --
> It looks like opal_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during opal_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   opal_shmem_base_select failed
>   --> Returned value -1 instead of OPAL_SUCCESS
> --
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   opal_init failed
>   --> Returned value Error (-1) instead of ORTE_SUCCESS
> --
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   ompi_mpi_init: ompi_rte_init failed
>   --> Returned "Error" (-1) instead of "Success" (0)
>
>
> If I skip our shared libraries and instead write a standard MPI-based
> "hello, world" program that links against MPI directly (without
> --disable-dlopen), everything is again fine.
>
> It seems that having the double dlopen is causing problems for OpenMPI
> finding its own shared libraries.
>
> Note: I do have LD_LIBRARY_PATH pointing to …"openmpi-2.0.1/lib", as well
> as OPAL_PREFIX pointing to …"openmpi-2.0.1".
>
> Any thoughts about how I can try to tease out what's going wrong here?
>
> -Sean
>
> --
> Sean Ahern
> Computational Engineering International
> 919-363-0883
>
>
> ___
> users mailing 
> l

Re: [OMPI users] Problem with double shared library

2016-10-17 Thread Gilles Gouaillardet

Sean,


if i understand correctly, your built a libtransport_mpi.so library that 
depends on Open MPI, and your main program dlopen libtransport_mpi.so.


in this case, and at least for the time being,  you need to use 
RTLD_GLOBAL in your dlopen flags.



Cheers,


Gilles


On 10/18/2016 4:53 AM, Sean Ahern wrote:

Folks,

For our code, we have a communication layer that abstracts the code 
that does the actual transfer of data. We call these "transports", and 
we link them as shared libraries. We have created an MPI transport 
that compiles/links against OpenMPI 2.0.1 using the compiler wrappers. 
When I compile OpenMPI with the--disable-dlopenoption (thus cramming 
all of OpenMPI's plugins into the MPI library directly), things work 
great with our transport shared library. But when I have a "normal" 
OpenMPI (without --disable-dlopen) and create the same transport 
shared library, things fail. Upon launch, it appears that OpenMPI is 
unable to find the appropriate plugins:


[hyperion.ceintl.com:25595 ]
mca_base_component_repository_open: unable to open
mca_patcher_overwrite:

/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_patcher_overwrite.so:
undefined symbol: *mca_patcher_base_patch_t_class* (ignored)
[hyperion.ceintl.com:25595 ]
mca_base_component_repository_open: unable to open mca_shmem_mmap:

/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_mmap.so:
undefined symbol: *opal_show_help* (ignored)
[hyperion.ceintl.com:25595 ]
mca_base_component_repository_open: unable to open
mca_shmem_posix:

/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_posix.so:
undefined symbol: *opal_show_help* (ignored)
[hyperion.ceintl.com:25595 ]
mca_base_component_repository_open: unable to open mca_shmem_sysv:

/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_sysv.so:
undefined symbol: *opal_show_help* (ignored)
--
It looks like opal_init failed for some reason; your parallel
process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
--
--
It looks like orte_init failed for some reason; your parallel
process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--
--
It looks like MPI_INIT failed for some reason; your parallel
process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)


If I skip our shared libraries and instead write a standard MPI-based 
"hello, world" program that links against MPI directly (without 
--disable-dlopen), everything is again fine.


It seems that having the double dlopenis causing problems for OpenMPI 
finding its own shared libraries.


Note: I do have LD_LIBRARY_PATHpointing to …"openmpi-2.0.1/lib", as 
well as OPAL_PREFIXpointing to …"openmpi-2.0.1".


Any thoughts about how I can try to tease out what's going wrong here?

-Sean

--
Sean Ahern
Computational Engineering International
919-363-0883


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Problem with double shared library

2016-10-17 Thread Sean Ahern
Folks,

For our code, we have a communication layer that abstracts the code that
does the actual transfer of data. We call these "transports", and we link
them as shared libraries. We have created an MPI transport that
compiles/links against OpenMPI 2.0.1 using the compiler wrappers. When I
compile OpenMPI with the--disable-dlopen option (thus cramming all of
OpenMPI's plugins into the MPI library directly), things work great with
our transport shared library. But when I have a "normal" OpenMPI (without
--disable-dlopen) and create the same transport shared library, things
fail. Upon launch, it appears that OpenMPI is unable to find the
appropriate plugins:

[hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
open mca_patcher_overwrite:
/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_patcher_overwrite.so:
undefined symbol: *mca_patcher_base_patch_t_class* (ignored)
[hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
open mca_shmem_mmap:
/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_mmap.so:
undefined symbol: *opal_show_help* (ignored)
[hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
open mca_shmem_posix:
/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_posix.so:
undefined symbol: *opal_show_help* (ignored)
[hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
open mca_shmem_sysv:
/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_sysv.so:
undefined symbol: *opal_show_help* (ignored)
--
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_shmem_base_select failed
  --> Returned value -1 instead of OPAL_SUCCESS
--
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)


If I skip our shared libraries and instead write a standard MPI-based
"hello, world" program that links against MPI directly (without
--disable-dlopen), everything is again fine.

It seems that having the double dlopen is causing problems for OpenMPI
finding its own shared libraries.

Note: I do have LD_LIBRARY_PATH pointing to …"openmpi-2.0.1/lib", as well
as OPAL_PREFIX pointing to …"openmpi-2.0.1".

Any thoughts about how I can try to tease out what's going wrong here?

-Sean

--
Sean Ahern
Computational Engineering International
919-363-0883
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users