Ralph,

now i remember this part ...
IIRC, LD_LIBRARY_PATH was never forwarded when remote starting orted.
i simply avoided this issue by using gnu compilers, or gcc/g++/ifort if i need
intel fortran
/* you already mentionned this is not officially supported by Intel */

What about adding a new configure option :
--orted-rpath=...

if we configure with LDFLAGS=-Wl,-rpath,... then orted will find the intel runtime (good) but the user binary will also use that runtime, regardless the $LD_LIBRARY_PATH used at mpicc/mpirun time
(not so good, since the user might want to use a different runtime)

any thoughts ?

Cheers,

Gilles


On 4/15/2015 12:10 PM, Ralph Castain wrote:
I think Gilles may be correct here. In reviewing the code, it appears we have never (going back to the 1.6 series, at least) forwarded the local LD_LIBRARY_PATH to the remote node when exec’ing the orted. The only thing we have done is to set the PATH and LD_LIBRARY_PATH to support the OMPI prefix - not any supporting libs.

What we have required, therefore, is that your path be setup properly in the remote .bashrc (or pick your shell) to handle the libraries.

As I indicated, the -x option only forwards envars to the application procs themselves, not the orted. I could try to add another cmd line option to forward things for the orted, but the concern we’ve had in the past (and still harbor) is that the ssh cmd line is limited in length. Thus, adding some potentially long paths to support this option could overwhelm it and cause failures.

I’d try the static method first, or perhaps the LDFLAGS Gilles suggested.


On Apr 14, 2015, at 5:11 PM, Gilles Gouaillardet <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:

Andy,

what about reconfiguring Open MPI with LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic" ?

IIRC, an other option is : LDFLAGS="-static-intel"

last but not least, you can always replace orted with a simple script that sets the LD_LIBRARY_PATH and exec the original orted

do you have the same behaviour on non MIC hardware when Open MPI is compiled with intel compilers ? if it works on non MIC hardware, the root cause could be in the sshd_config of the MIC that does not
accept to receive LD_LIBRARY_PATH

my 0.02 US$

Gilles

On 4/14/2015 11:20 PM, Ralph Castain wrote:
Hmmm…certainly looks that way. I’ll investigate.

On Apr 14, 2015, at 6:06 AM, Andy Riebs <andy.ri...@hp.com <mailto:andy.ri...@hp.com>> wrote:

Hi Ralph,

Still no happiness... It looks like my LD_LIBRARY_PATH just isn't getting propagated?

$ ldd /home/ariebs/mic/mpi-nightly/bin/orted
        linux-vdso.so.1 => (0x00007fffa1d3b000)
libopen-rte.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x00002ab6ce464000) libopen-pal.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x00002ab6ce7d3000)
        libm.so.6 => /lib64/libm.so.6 (0x00002ab6cebbd000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ab6ceded000)
        librt.so.1 => /lib64/librt.so.1 (0x00002ab6ceff1000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002ab6cf1f9000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ab6cf3fc000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ab6cf60f000)
        libc.so.6 => /lib64/libc.so.6 (0x00002ab6cf82c000)
libimf.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so (0x00002ab6cfb84000) libsvml.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so (0x00002ab6cffd6000) libirng.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so (0x00002ab6d086f000) libintlc.so.5 => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 (0x00002ab6d0a82000)
        /lib64/ld-linux-k1om.so.2 (0x00002ab6ce243000)

$ echo $LD_LIBRARY_PATH
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib

$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca plm_base_verbose 5 --mca memheap_base_verbose 100 --leave-session-attached --mca mca_component_show_load_errors 1 $PWD/mic.out
--------------------------------------------------------------------------
A deprecated MCA variable value was specified in the environment or
on the command line.  Deprecated MCA variables should be avoided;
they may disappear in future releases.

  Deprecated variable: mca_component_show_load_errors
  New variable: mca_base_component_show_load_errors
--------------------------------------------------------------------------
[atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [rsh]
[atl1-02-mic0:16183] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [atl1-02-mic0:16183] mca:base:select:( plm) Query of component [rsh] set priority to 10 [atl1-02-mic0:16183] mca:base:select:( plm) Querying component [isolated] [atl1-02-mic0:16183] mca:base:select:( plm) Query of component [isolated] set priority to 0
[atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [slurm]
[atl1-02-mic0:16183] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
[atl1-02-mic0:16183] mca:base:select:(  plm) Selected component [rsh]
[atl1-02-mic0:16183] plm:base:set_hnp_name: initial bias 16183 nodename hash 4238360777
[atl1-02-mic0:16183] plm:base:set_hnp_name: final jobfam 33630
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[atl1-02-mic0:16183] [[33630,0],0] plm:base:receive start comm
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_job
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm creating map
[atl1-02-mic0:16183] [[33630,0],0] setup:vm: working unmanaged allocation
[atl1-02-mic0:16183] [[33630,0],0] using dash_host
[atl1-02-mic0:16183] [[33630,0],0] checking node mic1
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm add new daemon [[33630,0],1] [atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm assigning new daemon [[33630,0],1] to node mic1
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: launching vm
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: local shell: 0 (bash)
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: assuming same remote shell as local shell
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: remote shell: 0 (bash)
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /home/ariebs/mic/mpi-nightly/bin/orted -mca orte_leave_session_attached "1" --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid "2203975680" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "2" -mca orte_hnp_uri "2203975680.0;usock;tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1" --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose "5" --mca memheap_base_verbose "100" --mca mca_component_show_load_errors "1" -mca plm "rsh" -mca rmaps_ppr_n_pernode "2" [atl1-02-mic0:16183] [[33630,0],0] plm:rsh:launch daemon 0 not a child of mine [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: adding node mic1 to launch list
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: activating launch event
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: recording launch of daemon [[33630,0],1] [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh mic1 PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /home/ariebs/mic/mpi-nightly/bin/orted -mca orte_leave_session_attached "1" --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid "2203975680" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca orte_hnp_uri "2203975680.0;usock;tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1" --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose "5" --mca memheap_base_verbose "100" --mca mca_component_show_load_errors "1" -mca plm "rsh" -mca rmaps_ppr_n_pernode "2"] /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
[atl1-02-mic0:16183] [[33630,0],0] daemon 1 failed with status 127
[atl1-02-mic0:16183] [[33630,0],0] plm:base:orted_cmd sending orted_exit commands
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
[atl1-02-mic0:16183] [[33630,0],0] plm:base:receive stop comm


On 04/13/2015 07:47 PM, Ralph Castain wrote:
Weird. I’m not sure what to try at that point - IIRC, building static won’t resolve this problem (but you could try and see). You could add the following to the cmd line and see if it tells us anything useful:

—leave-session-attached —mca mca_component_show_load_errors 1

You might also do an ldd on /home/ariebs/mic/mpi-nightly/bin/orted and see where it is looking for libimf since it (and not mic.out) is the one complaining


On Apr 13, 2015, at 1:58 PM, Andy Riebs <andy.ri...@hp.com <mailto:andy.ri...@hp.com>> wrote:

Ralph and Nathan,

The problem may be something trivial, as I don't typically use "shmemrun" to start jobs. With the following, I *think* I've demonstrated that the problem library is where it belongs on the remote system:

$ ldd mic.out
        linux-vdso.so.1 => (0x00007fffb83ff000)
liboshmem.so.0 => /home/ariebs/mic/mpi-nightly/lib/liboshmem.so.0 (0x00002b059cfbb000) libmpi.so.0 => /home/ariebs/mic/mpi-nightly/lib/libmpi.so.0 (0x00002b059d35a000) libopen-rte.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x00002b059d7e3000) libopen-pal.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x00002b059db53000)
        libm.so.6 => /lib64/libm.so.6 (0x00002b059df3d000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002b059e16c000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002b059e371000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b059e574000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b059e786000)
        libc.so.6 => /lib64/libc.so.6 (0x00002b059e9a4000)
        librt.so.1 => /lib64/librt.so.1 (0x00002b059ecfc000)
libimf.so =>*/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so* (0x00002b059ef04000) libsvml.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so (0x00002b059f356000) libirng.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so (0x00002b059fbef000) libintlc.so.5 => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 (0x00002b059fe02000)
/lib64/ld-linux-k1om.so.2 (0x00002b059cd9a000)
$ echo $LD_LIBRARY_PATH
*/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic*:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
$ ssh mic1 file */opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so* /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF 64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1 (SYSV), dynamically linked, not stripped
$ shmemrun -H mic1 -N 2 --mca btl scif,self $PWD/mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: *error while loading shared libraries: libimf.so*: cannot open shared object file: No such file or directory
...


On 04/13/2015 04:25 PM, Nathan Hjelm wrote:
For talking between PHIs on the same system I recommend using the scif
BTL NOT tcp.

That said, it looks like the LD_LIBRARY_PATH is wrong on the remote
system. It looks like it can't find the intel compiler libraries.

-Nathan Hjelm
HPC-5, LANL

On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote:
    Progress!  I can run my trivial program on the local PHI, but not the
    other PHI, on the system. Here are the interesting parts:

    A pretty good recipe with last night's nightly master:

    $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
    CXX="icpc -mmic" \
        --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
         AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib
    LD=x86_64-k1om-linux-ld \
         --enable-mpirun-prefix-by-default --disable-io-romio
    --disable-mpi-fortran \
         --enable-orterun-prefix-by-default \
         --enable-debug
    $ make && make install
    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
    yoda --mca btl sm,self,tcp $PWD/mic.out
    Hello World from process 0 of 2
    Hello World from process 1 of 2
    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
    yoda --mca btl openib,sm,self $PWD/mic.out
    Hello World from process 0 of 2
    Hello World from process 1 of 2
    $

    However, I can't seem to cross the fabric. I can ssh freely back and forth
    between mic0 and mic1. However, running the next 2 tests from mic0, it
    certainly seems like the second one should work, too:

    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda
    --mca btl sm,self,tcp $PWD/mic.out
    Hello World from process 0 of 2
    Hello World from process 1 of 2
    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda
    --mca btl sm,self,tcp $PWD/mic.out
    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
    libraries: libimf.so: cannot open shared object file: No such file or
    directory
    --------------------------------------------------------------------------
    ORTE was unable to reliably start one or more daemons.
    This usually is caused by:

    * not finding the required libraries and/or binaries on
      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
      settings, or configure OMPI with --enable-orterun-prefix-by-default

    * lack of authority to execute on one or more specified nodes.
      Please verify your allocation and authorities.

    * the inability to write startup files into /tmp
    (--tmpdir/orte_tmpdir_base).
      Please check with your sys admin to determine the correct location to
    use.

    *  compilation of the orted with dynamic libraries when static are
    required
      (e.g., on Cray). Please check your configure cmd line and consider using
      one of the contrib/platform definitions for your system type.

    * an inability to create a connection back to mpirun due to a
      lack of common network interfaces and/or no route found between
      them. Please check network connectivity (including firewalls
      and network routing requirements).
     ...
    $

    (Note that I get the same results with "--mca btl openib,sm,self"....)

    $ ssh mic1 file
    /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
    /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF
    64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1
    (SYSV), dynamically linked, not stripped
    $ shmemrun -x
    LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
    -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out
    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
    libraries: libimf.so: cannot open shared object file: No such file or
    directory
    --------------------------------------------------------------------------
    ORTE was unable to reliably start one or more daemons.
    This usually is caused by:

    * not finding the required libraries and/or binaries on
      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
      settings, or configure OMPI with --enable-orterun-prefix-by-default

    * lack of authority to execute on one or more specified nodes.
      Please verify your allocation and authorities.

    * the inability to write startup files into /tmp
    (--tmpdir/orte_tmpdir_base).
      Please check with your sys admin to determine the correct location to
    use.

    *  compilation of the orted with dynamic libraries when static are
    required
      (e.g., on Cray). Please check your configure cmd line and consider using
      one of the contrib/platform definitions for your system type.

    * an inability to create a connection back to mpirun due to a
      lack of common network interfaces and/or no route found between
      them. Please check network connectivity (including firewalls
      and network routing requirements).

    Following here is
    - IB information
    - Running the failing case with lots of debugging information. (As you
    might imagine, I've tried 17 ways from Sunday to try to ensure that
    libimf.so is found.)

    $ ibv_devices
        device                 node GUID
        ------              ----------------
        mlx4_0              24be05ffffa57160
        scif0               4c79bafffe4402b6
    $ ibv_devinfo
    hca_id: mlx4_0
            transport:                      InfiniBand (0)
            fw_ver:                         2.11.1250
            node_guid:                      24be:05ff:ffa5:7160
            sys_image_guid:                 24be:05ff:ffa5:7163
            vendor_id:                      0x02c9
            vendor_part_id:                 4099
            hw_ver:                         0x0
            phys_port_cnt:                  2
                    port:   1
                            state:                  PORT_ACTIVE (4)
                            max_mtu:                2048 (4)
                            active_mtu:             2048 (4)
                            sm_lid:                 8
                            port_lid:               86
                            port_lmc:               0x00
                            link_layer:             InfiniBand

                    port:   2
                            state:                  PORT_DOWN (1)
                            max_mtu:                2048 (4)
                            active_mtu:             2048 (4)
                            sm_lid:                 0
                            port_lid:               0
                            port_lmc:               0x00
                            link_layer:             InfiniBand

    hca_id: scif0
            transport:                      SCIF (2)
            fw_ver:                         0.0.1
            node_guid:                      4c79:baff:fe44:02b6
            sys_image_guid:                 4c79:baff:fe44:02b6
            vendor_id:                      0x8086
            vendor_part_id:                 0
            hw_ver:                         0x1
            phys_port_cnt:                  1
                    port:   1
                            state:                  PORT_ACTIVE (4)
                            max_mtu:                4096 (5)
                            active_mtu:             4096 (5)
                            sm_lid:                 1
                            port_lid:               1001
                            port_lmc:               0x00
                            link_layer:             SCIF

    $ shmemrun -x
    LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
    -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca plm_base_verbose
    5 --mca memheap_base_verbose 100 $PWD/mic.out
    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [rsh]
    [atl1-01-mic0:191024] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
    rsh path NULL
    [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component [rsh] set
    priority to 10
    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component
    [isolated]
    [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component
    [isolated] set priority to 0
    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [slurm]
    [atl1-01-mic0:191024] mca:base:select:(  plm) Skipping component [slurm].
    Query failed to return a module
    [atl1-01-mic0:191024] mca:base:select:(  plm) Selected component [rsh]
    [atl1-01-mic0:191024] plm:base:set_hnp_name: initial bias 191024 nodename
    hash 4121194178
    [atl1-01-mic0:191024] plm:base:set_hnp_name: final jobfam 29012
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh_setup on agent ssh : rsh path
    NULL
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive start comm
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_job
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm creating map
    [atl1-01-mic0:191024] [[29012,0],0] setup:vm: working unmanaged allocation
    [atl1-01-mic0:191024] [[29012,0],0] using dash_host
    [atl1-01-mic0:191024] [[29012,0],0] checking node mic1
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm add new daemon
    [[29012,0],1]
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm assigning new daemon
    [[29012,0],1] to node mic1
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: launching vm
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: local shell: 0 (bash)
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: assuming same remote shell as
    local shell
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: remote shell: 0 (bash)
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: final template argv:
            /usr/bin/ssh <template>
    PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ;
    LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export
    LD_LIBRARY_PATH ;
    DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
    export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
    --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
    orte_ess_jobid "1901330432" -mca orte_ess_vpid "<template>" -mca
    orte_ess_num_procs "2" -mca orte_hnp_uri
    
"1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
    --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
    plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
    rmaps_ppr_n_pernode "2"
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh:launch daemon 0 not a child of
    mine
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: adding node mic1 to launch
    list
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: activating launch event
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: recording launch of daemon
    [[29012,0],1]
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: executing: (/usr/bin/ssh)
    [/usr/bin/ssh mic1     PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ;
    export PATH ;
    LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export
    LD_LIBRARY_PATH ;
    DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
    export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
    --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
    orte_ess_jobid "1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs
    "2" -mca orte_hnp_uri
    
"1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
    --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
    plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
    rmaps_ppr_n_pernode "2"]
    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
    libraries: libimf.so: cannot open shared object file: No such file or
    directory
    [atl1-01-mic0:191024] [[29012,0],0] daemon 1 failed with status 127
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:orted_cmd sending orted_exit
    commands
    --------------------------------------------------------------------------
    ORTE was unable to reliably start one or more daemons.
    This usually is caused by:

    * not finding the required libraries and/or binaries on
      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
      settings, or configure OMPI with --enable-orterun-prefix-by-default

    * lack of authority to execute on one or more specified nodes.
      Please verify your allocation and authorities.

    * the inability to write startup files into /tmp
    (--tmpdir/orte_tmpdir_base).
      Please check with your sys admin to determine the correct location to
    use.

    *  compilation of the orted with dynamic libraries when static are
    required
      (e.g., on Cray). Please check your configure cmd line and consider using
      one of the contrib/platform definitions for your system type.

    * an inability to create a connection back to mpirun due to a
      lack of common network interfaces and/or no route found between
      them. Please check network connectivity (including firewalls
      and network routing requirements).
    --------------------------------------------------------------------------
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive stop comm

    On 04/13/2015 08:50 AM, Andy Riebs wrote:

      Hi Ralph,

      Here are the results with last night's "master" nightly,
      openmpi-dev-1487-g9c6d452.tar.bz2, and adding the memheap_base_verbose
      option (yes, it looks like the "ERROR_LOG" problem has gone away):

      $ cat /proc/sys/kernel/shmmax
      33554432
      $ cat /proc/sys/kernel/shmall
      2097152
      $ cat /proc/sys/kernel/shmmni
      4096
      $ export SHMEM_SYMMETRIC_HEAP=1M
      $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5
      --mca memheap_base_verbose 100 $PWD/mic.out
      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [rsh]
      [atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
      rsh path NULL
      [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component [rsh]
      set priority to 10
      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component
      [isolated]
      [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component
      [isolated] set priority to 0
      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [slurm]
      [atl1-01-mic0:190439] mca:base:select:(  plm) Skipping component
      [slurm]. Query failed to return a module
      [atl1-01-mic0:190439] mca:base:select:(  plm) Selected component [rsh]
      [atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439
      nodename hash 4121194178
      [atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875
      [atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh : rsh
      path NULL
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map
      [atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged
      allocation
      [atl1-01-mic0:190439] [[31875,0],0] using dash_host
      [atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0
      [atl1-01-mic0:190439] [[31875,0],0] ignoring myself
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in
      allocation
      [atl1-01-mic0:190439] [[31875,0],0] complete_setup on job [31875,1]
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job
      [31875,1]
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof for
      job [31875,1]
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1] registered
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1] is not
      a dynamic spawn
      [atl1-01-mic0:190441] mca: base: components_register: registering
      memheap components
      [atl1-01-mic0:190441] mca: base: components_register: found loaded
      component buddy
      [atl1-01-mic0:190441] mca: base: components_register: component buddy
      has no register or open function
      [atl1-01-mic0:190442] mca: base: components_register: registering
      memheap components
      [atl1-01-mic0:190442] mca: base: components_register: found loaded
      component buddy
      [atl1-01-mic0:190442] mca: base: components_register: component buddy
      has no register or open function
      [atl1-01-mic0:190442] mca: base: components_register: found loaded
      component ptmalloc
      [atl1-01-mic0:190442] mca: base: components_register: component ptmalloc
      has no register or open function
      [atl1-01-mic0:190441] mca: base: components_register: found loaded
      component ptmalloc
      [atl1-01-mic0:190441] mca: base: components_register: component ptmalloc
      has no register or open function
      [atl1-01-mic0:190441] mca: base: components_open: opening memheap
      components
      [atl1-01-mic0:190441] mca: base: components_open: found loaded component
      buddy
      [atl1-01-mic0:190441] mca: base: components_open: component buddy open
      function successful
      [atl1-01-mic0:190441] mca: base: components_open: found loaded component
      ptmalloc
      [atl1-01-mic0:190441] mca: base: components_open: component ptmalloc
      open function successful
      [atl1-01-mic0:190442] mca: base: components_open: opening memheap
      components
      [atl1-01-mic0:190442] mca: base: components_open: found loaded component
      buddy
      [atl1-01-mic0:190442] mca: base: components_open: component buddy open
      function successful
      [atl1-01-mic0:190442] mca: base: components_open: found loaded component
      ptmalloc
      [atl1-01-mic0:190442] mca: base: components_open: component ptmalloc
      open function successful
      [atl1-01-mic0:190442] base/memheap_base_alloc.c:38 -
      mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1
      segments by method: 1
      [atl1-01-mic0:190441] base/memheap_base_alloc.c:38 -
      mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1
      segments by method: 1
      [atl1-01-mic0:190442] base/memheap_base_static.c:205 - _load_segments()
      add: 00600000-00601000 rw-p 00000000 00:11
      6029314                            /home/ariebs/bench/hello/mic.out
      [atl1-01-mic0:190441] base/memheap_base_static.c:205 - _load_segments()
      add: 00600000-00601000 rw-p 00000000 00:11
      6029314                            /home/ariebs/bench/hello/mic.out
      [atl1-01-mic0:190442] base/memheap_base_static.c:75 -
      mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
      segments
      [atl1-01-mic0:190442] base/memheap_base_register.c:39 -
      mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
      270532608 bytes type=0x1 id=0xFFFFFFFF
      [atl1-01-mic0:190441] base/memheap_base_static.c:75 -
      mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
      segments
      [atl1-01-mic0:190441] base/memheap_base_register.c:39 -
      mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
      270532608 bytes type=0x1 id=0xFFFFFFFF
      [atl1-01-mic0:190442] Error base/memheap_base_register.c:130 -
      _reg_segment() Failed to register segment
      [atl1-01-mic0:190441] Error base/memheap_base_register.c:130 -
      _reg_segment() Failed to register segment
      [atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM
      failed to initialize - aborting
      [atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM
      failed to initialize - aborting
      --------------------------------------------------------------------------
      It looks like SHMEM_INIT failed for some reason; your parallel process
      is
      likely to abort.  There are many reasons that a parallel process can
      fail during SHMEM_INIT; some of which are due to configuration or
      environment
      problems.  This failure appears to be an internal failure; here's some
      additional information (which may only be relevant to an Open SHMEM
      developer):

        mca_memheap_base_select() failed
        --> Returned "Error" (-1) instead of "Success" (0)
      --------------------------------------------------------------------------
      --------------------------------------------------------------------------
      SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0) with
      errorcode -1.
      --------------------------------------------------------------------------
      --------------------------------------------------------------------------
      A SHMEM process is aborting at a time when it cannot guarantee that all
      of its peer processes in the job will be killed properly.  You should
      double check that everything has shut down cleanly.

      Local host: atl1-01-mic0
      PID:        190441
      --------------------------------------------------------------------------
      -------------------------------------------------------
      Primary job  terminated normally, but 1 process returned
      a non-zero exit code.. Per user-direction, the job has been aborted.
      -------------------------------------------------------
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending
      orted_exit commands
      --------------------------------------------------------------------------
      shmemrun detected that one or more processes exited with non-zero
      status, thus causing
      the job to be terminated. The first process to do so was:

        Process name: [[31875,1],0]
        Exit code:    255
      --------------------------------------------------------------------------
      [atl1-01-mic0:190439] 1 more process has sent help message
      help-shmem-runtime.txt / shmem_init:startup:internal-failure
      [atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate" to 0
      to see all help / error messages
      [atl1-01-mic0:190439] 1 more process has sent help message
      help-shmem-api.txt / shmem-abort
      [atl1-01-mic0:190439] 1 more process has sent help message
      help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm

      On 04/12/2015 03:09 PM, Ralph Castain wrote:

        Sorry about that - I hadn't brought it over to the 1.8 branch yet.
        I've done so now, which means the ERROR_LOG shouldn't show up any
        more. It won't fix the memheap problem, though.
        You might try adding "--mca memheap_base_verbose 100" to your cmd line
        so we can see why none of the memheap components are being selected.

          On Apr 12, 2015, at 11:30 AM, Andy Riebs<andy.ri...@hp.com>  wrote:
          Hi Ralph,

          Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2:

          $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
          plm_base_verbose 5 $PWD/mic.out
          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
          [rsh]
          [atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent
          ssh : rsh path NULL
          [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
          [rsh] set priority to 10
          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
          [isolated]
          [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
          [isolated] set priority to 0
          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
          [slurm]
          [atl1-01-mic0:190189] mca:base:select:(  plm) Skipping component
          [slurm]. Query failed to return a module
          [atl1-01-mic0:190189] mca:base:select:(  plm) Selected component
          [rsh]
          [atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189
          nodename hash 4121194178
          [atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
          [atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh
          path NULL
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
          [atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged
          allocation
          [atl1-01-mic0:190189] [[32137,0],0] using dash_host
          [atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
          [atl1-01-mic0:190189] [[32137,0],0] ignoring myself
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in
          allocation
          [atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
          [atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in
          file base/plm_base_launch_support.c at line 440
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job
          [32137,1]
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof
          for job [32137,1]
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1]
          registered
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is
          not a dynamic spawn
          
--------------------------------------------------------------------------
          It looks like SHMEM_INIT failed for some reason; your parallel
          process is
          likely to abort.  There are many reasons that a parallel process can
          fail during SHMEM_INIT; some of which are due to configuration or
          environment
          problems.  This failure appears to be an internal failure; here's
          some
          additional information (which may only be relevant to an Open SHMEM
          developer):

            mca_memheap_base_select() failed
            --> Returned "Error" (-1) instead of "Success" (0)
          
--------------------------------------------------------------------------
          [atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM
          failed to initialize - aborting
          [atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM
          failed to initialize - aborting
          
--------------------------------------------------------------------------
          SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0)
          with errorcode -1.
          
--------------------------------------------------------------------------
          
--------------------------------------------------------------------------
          A SHMEM process is aborting at a time when it cannot guarantee that
          all
          of its peer processes in the job will be killed properly.  You
          should
          double check that everything has shut down cleanly.

          Local host: atl1-01-mic0
          PID:        190192
          
--------------------------------------------------------------------------
          -------------------------------------------------------
          Primary job  terminated normally, but 1 process returned
          a non-zero exit code.. Per user-direction, the job has been aborted.
          -------------------------------------------------------
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending
          orted_exit commands
          
--------------------------------------------------------------------------
          shmemrun detected that one or more processes exited with non-zero
          status, thus causing
          the job to be terminated. The first process to do so was:

            Process name: [[32137,1],0]
            Exit code:    255
          
--------------------------------------------------------------------------
          [atl1-01-mic0:190189] 1 more process has sent help message
          help-shmem-runtime.txt / shmem_init:startup:internal-failure
          [atl1-01-mic0:190189] Set MCA parameter "orte_base_help_aggregate"
          to 0 to see all help / error messages
          [atl1-01-mic0:190189] 1 more process has sent help message
          help-shmem-api.txt / shmem-abort
          [atl1-01-mic0:190189] 1 more process has sent help message
          help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
          killed
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm

          On 04/11/2015 07:41 PM, Ralph Castain wrote:

            Got it - thanks. I fixed that ERROR_LOG issue (I think- please
            verify). I suspect the memheap issue relates to something else,
            but I probably need to let the OSHMEM folks comment on it

              On Apr 11, 2015, at 9:52 AM, Andy Riebs<andy.ri...@hp.com>
              wrote:
              Everything is built on the Xeon side, with the icc "-mmic"
              switch. I then ssh into one of the PHIs, and run shmemrun from
              there.

              On 04/11/2015 12:00 PM, Ralph Castain wrote:

                Let me try to understand the setup a little better. Are you
                running shmemrun on the PHI itself? Or is it running on the
                host processor, and you are trying to spawn a process onto the
                Phi?

                  On Apr 11, 2015, at 7:55 AM, Andy Riebs<andy.ri...@hp.com>
                  wrote:
                  Hi Ralph,

                  Yes, this is attempting to get OSHMEM to run on the Phi.

                  I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured
                  it with

                  $ ./configure --prefix=/home/ariebs/mic/mpi-nightly
                  CC=icc -mmic CXX=icpc -mmic    \
                      --build=x86_64-unknown-linux-gnu
                  --host=x86_64-k1om-linux    \
                       AR=x86_64-k1om-linux-ar
                  RANLIB=x86_64-k1om-linux-ranlib  LD=x86_64-k1om-linux-ld   \
                       --enable-mpirun-prefix-by-default
                  --disable-io-romio     --disable-mpi-fortran    \
                       --enable-debug
                  --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud

                  (Note that I had to add "oob-ud" to the
                  "--enable-mca-no-build" option, as the build complained that
                  mca oob/ud needed mca common-verbs.)

                  With that configuration, here is what I am seeing now...

                  $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
                  $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
                  plm_base_verbose 5 $PWD/mic.out
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
                  component [rsh]
                  [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on
                  agent ssh : rsh path NULL
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
                  component [rsh] set priority to 10
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
                  component [isolated]
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
                  component [isolated] set priority to 0
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
                  component [slurm]
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Skipping
                  component [slurm]. Query failed to return a module
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Selected
                  component [rsh]
                  [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias
                  189895 nodename hash 4121194178
                  [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam
                  32419
                  [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent
                  ssh : rsh path NULL
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start
                  comm
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
                  creating map
                  [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working
                  unmanaged allocation
                  [atl1-01-mic0:189895] [[32419,0],0] using dash_host
                  [atl1-01-mic0:189895] [[32419,0],0] checking node
                  atl1-01-mic0
                  [atl1-01-mic0:189895] [[32419,0],0] ignoring myself
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only
                  HNP in allocation
                  [atl1-01-mic0:189895] [[32419,0],0] complete_setup on job
                  [32419,1]
                  [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not
                  found in file base/plm_base_launch_support.c at line 440
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for
                  job [32419,1]
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring
                  up iof for job [32419,1]
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
                  [32419,1] registered
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job
                  [32419,1] is not a dynamic spawn
                  [atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init()
                  SHMEM failed to initialize - aborting
                  [atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init()
                  SHMEM failed to initialize - aborting
                  
--------------------------------------------------------------------------
                  It looks like SHMEM_INIT failed for some reason; your
                  parallel process is
                  likely to abort.  There are many reasons that a parallel
                  process can
                  fail during SHMEM_INIT; some of which are due to
                  configuration or environment
                  problems.  This failure appears to be an internal failure;
                  here's some
                  additional information (which may only be relevant to an
                  Open SHMEM
                  developer):

                    mca_memheap_base_select() failed
                    --> Returned "Error" (-1) instead of "Success" (0)
                  
--------------------------------------------------------------------------
                  
--------------------------------------------------------------------------
                  SHMEM_ABORT was invoked on rank 1 (pid 189899,
                  host=atl1-01-mic0) with errorcode -1.
                  
--------------------------------------------------------------------------
                  
--------------------------------------------------------------------------
                  A SHMEM process is aborting at a time when it cannot
                  guarantee that all
                  of its peer processes in the job will be killed properly.
                  You should
                  double check that everything has shut down cleanly.

                  Local host: atl1-01-mic0
                  PID:        189899
                  
--------------------------------------------------------------------------
                  -------------------------------------------------------
                  Primary job  terminated normally, but 1 process returned
                  a non-zero exit code.. Per user-direction, the job has been
                  aborted.
                  -------------------------------------------------------
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd
                  sending orted_exit commands
                  
--------------------------------------------------------------------------
                  shmemrun detected that one or more processes exited with
                  non-zero status, thus causing
                  the job to be terminated. The first process to do so was:

                    Process name: [[32419,1],1]
                    Exit code:    255
                  
--------------------------------------------------------------------------
                  [atl1-01-mic0:189895] 1 more process has sent help message
                  help-shmem-runtime.txt / shmem_init:startup:internal-failure
                  [atl1-01-mic0:189895] Set MCA parameter
                  "orte_base_help_aggregate" to 0 to see all help / error
                  messages
                  [atl1-01-mic0:189895] 1 more process has sent help message
                  help-shmem-api.txt / shmem-abort
                  [atl1-01-mic0:189895] 1 more process has sent help message
                  help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee
                  all killed
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop
                  comm

                  On 04/10/2015 06:37 PM, Ralph Castain wrote:

                    Andy - could you please try the current 1.8.5 nightly
                    tarball and see if it helps? The error log indicates that
                    it is failing to get the topology from some daemon, I**m
                    assuming the one on the Phi?
                    You might also add **enable-debug to that configure line
                    and then put -mca plm_base_verbose on the shmemrun cmd to
                    get more help

                      On Apr 10, 2015, at 11:55 AM, Andy Riebs
                      <andy.ri...@hp.com>  wrote:
                      Summary: MPI jobs work fine, SHMEM jobs work just often
                      enough to be tantalizing, on an Intel Xeon Phi/MIC
                      system.

                      Longer version

                      Thanks to the excellent write-up last June
                      
(<https://www.open-mpi.org/community/lists/users/2014/06/24711.php>),
                      I have been able to build a version of Open MPI for the
                      Xeon Phi coprocessor that runs MPI jobs on the Phi
                      coprocessor with no problem, but not SHMEM jobs.  Just
                      at the point where I was about to document the problems
                      I was having with SHMEM, my trivial SHMEM job worked.
                      And then failed when I tried to run it again,
                      immediately afterwards. I have a feeling I may be in
                      uncharted  territory here.

                      Environment
                        * RHEL 6.5
                        * Intel Composer XE 2015
                        * Xeon Phi/MIC
                      ----------------

                      Configuration

                      $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
                      $ source
                      /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
                      intel64
                      $ ./configure --prefix=/home/ariebs/mic/mpi \
                         CC="icc -mmic" CXX="icpc -mmic" \
                         --build=x86_64-unknown-linux-gnu
                      --host=x86_64-k1om-linux \
                          AR=x86_64-k1om-linux-ar
                      RANLIB=x86_64-k1om-linux-ranlib \
                          LD=x86_64-k1om-linux-ld \
                          --enable-mpirun-prefix-by-default --disable-io-romio
                      \
                          --disable-vt --disable-mpi-fortran \
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs
                      $ make
                      $ make install

                      ----------------

                      Test program

                      #include <stdio.h>
                      #include <stdlib.h>
                      #include <shmem.h>
                      int main(int argc, char **argv)
                      {
                              int me, num_pe;
                              shmem_init();
                              num_pe = num_pes();
                              me = my_pe();
                              printf("Hello World from process %ld of %ld\n",
                      me, num_pe);
                              exit(0);
                      }

                      ----------------

                      Building the program

                      export PATH=/home/ariebs/mic/mpi/bin:$PATH
                      export PATH=/usr/linux-k1om-4.7/bin/:$PATH
                      source
                      /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
                      intel64
                      export
                      
LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH

                      icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include
                      -pthread \
                              -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib
                      -Wl,--enable-new-dtags \
                              -L/home/ariebs/mic/mpi/lib -loshmem -lmpi
                      -lopen-rte -lopen-pal \
                              -lm -ldl -lutil \
                              -Wl,-rpath
                      
-Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
                      \
-L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
                      \
                              -o mic.out  shmem_hello.c

                      ----------------

                      Running the program

                      (Note that the program had been consistently failing.
                      Then, when I logged back into the system to capture the
                      results, it worked once,  and then immediately failed
                      when I tried again, as shown below. Logging in and out
                      isn't sufficient to correct the problem. Overall, I
                      think I had 3 successful runs in 30-40 attempts.)

                      $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
                      [atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not
                      found in file base/plm_base_launch_support.c at line 426
                      Hello World from process 0 of 2
                      Hello World from process 1 of 2
                      $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
                      [atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not
                      found in file base/plm_base_launch_support.c at line 426
                      [atl1-01-mic0:189383] Error: pshmem_init.c:61 -
                      shmem_init() SHMEM failed to initialize - aborting
                      
--------------------------------------------------------------------------
                      It looks like SHMEM_INIT failed for some reason; your
                      parallel process is
                      likely to abort.  There are many reasons that a parallel
                      process can
                      fail during SHMEM_INIT; some of which are due to
                      configuration or environment
                      problems.  This failure appears to be an internal
                      failure; here's some
                      additional information (which may only be relevant to an
                      Open SHMEM
                      developer):

                        mca_memheap_base_select() failed
                        --> Returned "Error" (-1) instead of "Success" (0)
                      
--------------------------------------------------------------------------
                      
--------------------------------------------------------------------------
                      SHMEM_ABORT was invoked on rank 0 (pid 189383,
                      host=atl1-01-mic0) with errorcode -1.
                      
--------------------------------------------------------------------------
                      
--------------------------------------------------------------------------
                      A SHMEM process is aborting at a time when it cannot
                      guarantee that all
                      of its peer processes in the job will be killed
                      properly.  You should
                      double check that everything has shut down cleanly.

                      Local host: atl1-01-mic0
                      PID:        189383
                      
--------------------------------------------------------------------------
                      -------------------------------------------------------
                      Primary job  terminated normally, but 1 process returned
                      a non-zero exit code.. Per user-direction, the job has
                      been aborted.
                      -------------------------------------------------------
                      
--------------------------------------------------------------------------
                      shmemrun detected that one or more processes exited with
                      non-zero status, thus causing
                      the job to be terminated. The first process to do so
                      was:

                        Process name: [[30881,1],0]
                        Exit code:    255
                      
--------------------------------------------------------------------------

                      Any thoughts about where to go from here?

                      Andy

  --
  Andy Riebs
  Hewlett-Packard Company
  High Performance Computing
  +1 404 648 9024
  My opinions are not necessarily those of HP

                      _______________________________________________
                      users mailing list
                      us...@open-mpi.org
                      Subscription:
                      http://www.open-mpi.org/mailman/listinfo.cgi/users
                      Link to this post:
                      
http://www.open-mpi.org/community/lists/users/2015/04/26670.php

  _______________________________________________
  users mailing list
  us...@open-mpi.org
  Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26676.php

                  _______________________________________________
                  users mailing list
                  us...@open-mpi.org
                  Subscription:
                  http://www.open-mpi.org/mailman/listinfo.cgi/users
                  Link to this post:
                  
http://www.open-mpi.org/community/lists/users/2015/04/26678.php

  _______________________________________________
  users mailing list
  us...@open-mpi.org
  Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26679.php

              _______________________________________________
              users mailing list
              us...@open-mpi.org
              Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
              Link to this post:
              http://www.open-mpi.org/community/lists/users/2015/04/26680.php

  _______________________________________________
  users mailing list
  us...@open-mpi.org
  Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26682.php

          _______________________________________________
          users mailing list
          us...@open-mpi.org
          Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
          Link to this post:
          http://www.open-mpi.org/community/lists/users/2015/04/26683.php

  _______________________________________________
  users mailing list
  us...@open-mpi.org
  Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26684.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26697.php


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26699.php

_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26700.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26706.php

_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26716.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26718.php

_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26731.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/04/26732.php

Reply via email to