Hi Gilles,

adding *,usnic* made it work :) --mca pml ob1 would not be then needed.

Does it render mpi very slow if infiniband is disabled (what does --mca pml pb1?)?

Regarding the version mismatch, everything seems to be right. When only one version is loaded, I see the PATH and the LD_LIBRARY_PATH for only one version, and with strings, everything seems to reference the right version.

Thanks a lot for the quick answers!
On 22/08/16 13:09, Gilles Gouaillardet wrote:
Juan,

can you try to
mpirun --mca btl ^openib,usnic --mca pml ob1 ...

note this simply disable native infiniband. from a performance point of view, you should have your sysadmin fix the infiniband fabric.

about the version mismatch, please double check your environment
(e.g. $PATH and $LD_LIBRARY_PATH), it is likely v2.0 is in your environment when you are using v1.8, or the other way around)
also, make sure orted are launched with the right environment.
if you are using ssh, then
ssh node env
should not contain reference to the version you are not using.
(that typically occurs when the environment is set in the .bashrc, directly or via modules)

last but not least, you can
strings /.../bin/orted
strings /.../lib/libmpi.so
and check they do not reference the wrong version
(that can happen if a library was built and then moved)


Cheers,

Gilles

On Monday, August 22, 2016, Juan A. Cordero Varelaq <bioinformatica-i...@us.es <mailto:bioinformatica-i...@us.es>> wrote:

    Dear Ralph,

    The existence of the two versions does not seem to be the source
    of problems, since they are in different locations. I uninstalled
    the most recent version and try again with no luck, getting the
    same warnings/errors. However, after a deep search I found a
    couple of hints, and executed this:

    mpirun *-mca btl ^openib -mca btl_sm_use_knem 0* -np 5 myscript

    and got only a fraction of the previous errors (before I had run
    the same but without the arguments in bold), related to OpenFabrics:

    Open MPI failed to open an OpenFabrics device.  This is an unusual
    error; the system reported the OpenFabrics device as being present,
    but then later failed to access it successfully. This usually
    indicates either a misconfiguration or a failed OpenFabrics hardware
    device.

    All OpenFabrics support has been disabled in this MPI process; your
    job may or may not continue.

      Hostname:    MYMACHINE
      Device name: mlx4_0
      Errror (22): Invalid argument
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    [[52062,1],0]: A high-performance Open MPI point-to-point
    messaging module
    was unable to find any relevant network interfaces:

    Module: usNIC
      Host: MYMACHINE

    Another transport will be used instead, although this may result in
    lower performance.
    --------------------------------------------------------------------------

    Do you guess why it could happen?


    Thanks a lot

    On 19/08/16 17:11, r...@open-mpi.org
    <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');> wrote:
    The rdma error sounds like something isn’t right with your
    machine’s Infiniband installation.

    The cross-version problem sounds like you installed both OMPI
    versions into the same location - did you do that?? If so, then
    that might be the root cause of both problems. You need to
    install them in totally different locations. Then you need to
    _prefix_ your PATH and LD_LIBRARY_PATH with the location of the
    version you want to use.

    HTH
    Ralph

    On Aug 19, 2016, at 12:53 AM, Juan A. Cordero Varelaq
    <bioinformatica-i...@us.es
    <javascript:_e(%7B%7D,'cvml','bioinformatica-i...@us.es');>> wrote:

    Dear users,

    I am totally stuck using openmpi. I have two versions on my
    machine: 1.8.1 and 2.0.0, and none of them work. When use the
    mpirun *1.8.1 version*, I get the following error:

    librdmacm: Fatal: unable to open RDMA device
    librdmacm: Fatal: unable to open RDMA device
    librdmacm: Fatal: unable to open RDMA device
    librdmacm: Fatal: unable to open RDMA device
    librdmacm: Fatal: unable to open RDMA device
    --------------------------------------------------------------------------
    Open MPI failed to open the /dev/knem device due to a local error.
    Please check with your system administrator to get the problem
    fixed,
    or set the btl_sm_use_knem MCA parameter to 0 to run without
    /dev/knem
    support.

      Local host: MYMACHINE
      Errno:      2 (No such file or directory)
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    Open MPI failed to open an OpenFabrics device.  This is an unusual
    error; the system reported the OpenFabrics device as being present,
    but then later failed to access it successfully.  This usually
    indicates either a misconfiguration or a failed OpenFabrics hardware
    device.

    All OpenFabrics support has been disabled in this MPI process; your
    job may or may not continue.

      Hostname:    MYMACHINE
      Device name: mlx4_0
      Errror (22): Invalid argument
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    [[60527,1],4]: A high-performance Open MPI point-to-point
    messaging module
    was unable to find any relevant network interfaces:

    Module: usNIC
      Host: MYMACHINE

    When I use the *2.0.0 version*, I get something strange, it
    seems openmpi-2.0.0 looks for openmpi-1.8.1 libraries?:

    A requested component was not found, or was unable to be
    opened.  This
    means that this component is either not installed or is unable to be
    used on your system (e.g., sometimes this means that shared
    libraries
that the component requires are unable to be found/loaded). Note that
    Open MPI stopped checking at the first component that it did not
    find.

    Host:      MYMACHINE
    Framework: ess
    Component: pmi
    --------------------------------------------------------------------------
    [MYMACHINE:126820] *** Process received signal ***
    [MYMACHINE:126820] Signal: Segmentation fault (11)
    [MYMACHINE:126820] Signal code: Address not mapped (1)
    [MYMACHINE:126820] Failing at address: 0x1c0
    [MYMACHINE:126820] [ 0]
    /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f39b2ec4cb0]
    [MYMACHINE:126820] [ 1]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7f39b23e7430]
    [MYMACHINE:126820] [ 2]
    /opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7f39b2676a57]
    [MYMACHINE:126820] [ 3]
    
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7f39b2676fb7]
    [MYMACHINE:126820] [ 4]
    
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7f39b267718f]
    [MYMACHINE:126820] [ 5]
    /opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7f39b23c5f2a]
    [MYMACHINE:126820] [ 6]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7f39b23c70c3]
    [MYMACHINE:126820] [ 7]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7f39b23c8278]
    [MYMACHINE:126820] [ 8]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7f39b23d1e6c]
    [MYMACHINE:126820] [ 9]
    /opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7f39b2666e21]
    [MYMACHINE:126820] [10]
    /opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7f39b3115c92]
    [MYMACHINE:126820] [11]
    /opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7f39b31387bb]
    [MYMACHINE:126820] [12] mb[0x402024]
    [MYMACHINE:126820] [13]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f39b2b187ed]
    [MYMACHINE:126820] [14] mb[0x402111]
    [MYMACHINE:126820] *** End of error message ***
    --------------------------------------------------------------------------
    A requested component was not found, or was unable to be
    opened.  This
    means that this component is either not installed or is unable to be
    used on your system (e.g., sometimes this means that shared
    libraries
that the component requires are unable to be found/loaded). Note that
    Open MPI stopped checking at the first component that it did not
    find.

    Host:      MYMACHINE
    Framework: ess
    Component: pmi
    --------------------------------------------------------------------------
    [MYMACHINE:126821] *** Process received signal ***
    [MYMACHINE:126821] Signal: Segmentation fault (11)
    [MYMACHINE:126821] Signal code: Address not mapped (1)
    [MYMACHINE:126821] Failing at address: 0x1c0
    [MYMACHINE:126821] [ 0]
    /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fed834bbcb0]
    [MYMACHINE:126821] [ 1]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7fed829de430]
    [MYMACHINE:126821] [ 2]
    /opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7fed82c6da57]
    [MYMACHINE:126821] [ 3]
    
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7fed82c6dfb7]
    [MYMACHINE:126821] [ 4]
    
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7fed82c6e18f]
    [MYMACHINE:126821] [ 5]
    /opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7fed829bcf2a]
    [MYMACHINE:126821] [ 6]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7fed829be0c3]
    [MYMACHINE:126821] [ 7]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7fed829bf278]
    [MYMACHINE:126821] [ 8]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7fed829c8e6c]
    [MYMACHINE:126821] [ 9]
    /opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7fed82c5de21]
    [MYMACHINE:126821] [10]
    /opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7fed8370cc92]
    [MYMACHINE:126821] [11]
    /opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7fed8372f7bb]
    [MYMACHINE:126821] [12] mb[0x402024]
    [MYMACHINE:126821] [13]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fed8310f7ed]
    [MYMACHINE:126821] [14] mb[0x402111]
    [MYMACHINE:126821] *** End of error message ***
    --------------------------------------------------------------------------
    A requested component was not found, or was unable to be
    opened.  This
    means that this component is either not installed or is unable to be
    used on your system (e.g., sometimes this means that shared
    libraries
that the component requires are unable to be found/loaded). Note that
    Open MPI stopped checking at the first component that it did not
    find.

    Host:      MYMACHINE
    Framework: ess
    Component: pmi
    --------------------------------------------------------------------------
    [MYMACHINE:126822] *** Process received signal ***
    [MYMACHINE:126822] Signal: Segmentation fault (11)
    [MYMACHINE:126822] Signal code: Address not mapped (1)
    [MYMACHINE:126822] Failing at address: 0x1c0
    [MYMACHINE:126822] [ 0]
    /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f0174bc0cb0]
    [MYMACHINE:126822] [ 1]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7f01740e3430]
    [MYMACHINE:126822] [ 2]
    /opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7f0174372a57]
    [MYMACHINE:126822] [ 3]
    
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7f0174372fb7]
    [MYMACHINE:126822] [ 4]
    
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7f017437318f]
    [MYMACHINE:126822] [ 5]
    /opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7f01740c1f2a]
    [MYMACHINE:126822] [ 6]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7f01740c30c3]
    [MYMACHINE:126822] [ 7]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7f01740c4278]
    [MYMACHINE:126822] [ 8]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7f01740cde6c]
    [MYMACHINE:126822] [ 9]
    /opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7f0174362e21]
    [MYMACHINE:126822] [10]
    /opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7f0174e11c92]
    [MYMACHINE:126822] [11]
    /opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7f0174e347bb]
    [MYMACHINE:126822] [12] mb[0x402024]
    [MYMACHINE:126822] [13]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f01748147ed]
    [MYMACHINE:126822] [14] mb[0x402111]
    [MYMACHINE:126822] *** End of error message ***
    --------------------------------------------------------------------------
    A requested component was not found, or was unable to be
    opened.  This
    means that this component is either not installed or is unable to be
    used on your system (e.g., sometimes this means that shared
    libraries
that the component requires are unable to be found/loaded). Note that
    Open MPI stopped checking at the first component that it did not
    find.

    Host:      MYMACHINE
    Framework: ess
    Component: pmi
    --------------------------------------------------------------------------
    [MYMACHINE:126823] *** Process received signal ***
    [MYMACHINE:126823] Signal: Segmentation fault (11)
    [MYMACHINE:126823] Signal code: Address not mapped (1)
    [MYMACHINE:126823] Failing at address: 0x1c0
    --------------------------------------------------------------------------
    A requested component was not found, or was unable to be
    opened.  This
    means that this component is either not installed or is unable to be
    used on your system (e.g., sometimes this means that shared
    libraries
that the component requires are unable to be found/loaded). Note that
    Open MPI stopped checking at the first component that it did not
    find.

    Host:      MYMACHINE
    Framework: ess
    Component: pmi
    --------------------------------------------------------------------------
    [MYMACHINE:126823] [ 0]
    /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fcd9cb58cb0]
    [MYMACHINE:126823] [ 1] [MYMACHINE:126824] *** Process received
    signal ***
    [MYMACHINE:126824] Signal: Segmentation fault (11)
    [MYMACHINE:126824] Signal code: Address not mapped (1)
    [MYMACHINE:126824] Failing at address: 0x1c0
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7fcd9c07b430]
    [MYMACHINE:126823] [ 2]
    /opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7fcd9c30aa57]
    [MYMACHINE:126823] [ 3]
    
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7fcd9c30afb7]
    [MYMACHINE:126823] [ 4]
    
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7fcd9c30b18f]
    [MYMACHINE:126823] [ 5]
    /opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7fcd9c059f2a]
    [MYMACHINE:126823] [MYMACHINE:126824] [ 0] [ 6]
    /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f2f0c611cb0]
    [MYMACHINE:126824] [ 1]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7fcd9c05b0c3]
    [MYMACHINE:126823] [ 7]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7f2f0bb34430]
    [MYMACHINE:126824] [ 2]
    /opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7f2f0bdc3a57]
    [MYMACHINE:126824] [ 3]
    
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7f2f0bdc3fb7]
    [MYMACHINE:126824] [ 4]
    
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7f2f0bdc418f]
    [MYMACHINE:126824] [ 5]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7fcd9c05c278]
    [MYMACHINE:126823] [ 8]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7fcd9c065e6c]
    [MYMACHINE:126823] [ 9]
    /opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7fcd9c2fae21]
    [MYMACHINE:126823] [10]
    /opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7fcd9cda9c92]
    [MYMACHINE:126823] [11]
    /opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7f2f0bb12f2a]
    [MYMACHINE:126824] [ 6]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7f2f0bb140c3]
    [MYMACHINE:126824] [ 7]
    /opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7fcd9cdcc7bb]
    [MYMACHINE:126823] [12] mb[0x402024]
    [MYMACHINE:126823] [13]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7f2f0bb15278]
    [MYMACHINE:126824] [ 8]
    
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7f2f0bb1ee6c]
    [MYMACHINE:126824] [ 9]
    /opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7f2f0bdb3e21]
    [MYMACHINE:126824] [10]
    /opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7f2f0c862c92]
    [MYMACHINE:126824] [11]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fcd9c7ac7ed]
    [MYMACHINE:126823] [14] mb[0x402111]
    [MYMACHINE:126823] *** End of error message ***
    /opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7f2f0c8857bb]
    [MYMACHINE:126824] [12] mb[0x402024]
    [MYMACHINE:126824] [13]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f2f0c2657ed]
    [MYMACHINE:126824] [14] mb[0x402111]
    [MYMACHINE:126824] *** End of error message ***
    --------------------------------------------------------------------------
    mpirun noticed that process rank 2 with PID 0 on node MYMACHINE
    exited on signal 11 (Segmentation fault).
    --------------------------------------------------------------------------

    I am running my script with *mpirun* in a *single node of a SGE
    cluster*.

    I would be very grateful if somebody could give me some hints to
    solve this issue.

    Thanks a lot in advance
    _______________________________________________
    users mailing list
    users@lists.open-mpi.org
    <javascript:_e(%7B%7D,'cvml','users@lists.open-mpi.org');>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/users
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>



    _______________________________________________
    users mailing list
    users@lists.open-mpi.org
    <javascript:_e(%7B%7D,'cvml','users@lists.open-mpi.org');>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/users
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to