Weird. I’m not sure what to try at that point - IIRC, building static won’t 
resolve this problem (but you could try and see). You could add the following 
to the cmd line and see if it tells us anything useful:

—leave-session-attached —mca mca_component_show_load_errors 1

You might also do an ldd on /home/ariebs/mic/mpi-nightly/bin/orted and see 
where it is looking for libimf since it (and not mic.out) is the one complaining


> On Apr 13, 2015, at 1:58 PM, Andy Riebs <andy.ri...@hp.com> wrote:
> 
> Ralph and Nathan,
> 
> The problem may be something trivial, as I don't typically use "shmemrun" to 
> start jobs. With the following, I *think* I've  demonstrated that the problem 
> library is where it belongs on the remote system:
> 
> $ ldd mic.out
>         linux-vdso.so.1 =>  (0x00007fffb83ff000)
>         liboshmem.so.0 => /home/ariebs/mic/mpi-nightly/lib/liboshmem.so.0 
> (0x00002b059cfbb000)
>         libmpi.so.0 => /home/ariebs/mic/mpi-nightly/lib/libmpi.so.0 
> (0x00002b059d35a000)
>         libopen-rte.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 
> (0x00002b059d7e3000)
>         libopen-pal.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 
> (0x00002b059db53000)
>         libm.so.6 => /lib64/libm.so.6 (0x00002b059df3d000)
>         libdl.so.2 => /lib64/libdl.so.2 (0x00002b059e16c000)
>         libutil.so.1 => /lib64/libutil.so.1 (0x00002b059e371000)
>         libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b059e574000)
>         libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b059e786000)
>         libc.so.6 => /lib64/libc.so.6 (0x00002b059e9a4000)
>         librt.so.1 => /lib64/librt.so.1 (0x00002b059ecfc000)
>         libimf.so => 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so 
> (0x00002b059ef04000)
>         libsvml.so => 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so 
> (0x00002b059f356000)
>         libirng.so => 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so 
> (0x00002b059fbef000)
>         libintlc.so.5 => 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 
> (0x00002b059fe02000)
>         /lib64/ld-linux-k1om.so.2 (0x00002b059cd9a000)
> $ echo $LD_LIBRARY_PATH 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
> $ ssh mic1 file 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF 64-bit 
> LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1 (SYSV), 
> dynamically linked, not stripped
> $ shmemrun -H mic1 -N 2 --mca btl scif,self $PWD/mic.out
> /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: 
> libimf.so: cannot open shared object file: No such file or directory
> ...
> 
> 
> On 04/13/2015 04:25 PM, Nathan Hjelm wrote:
>> For talking between PHIs on the same system I recommend using the scif
>> BTL NOT tcp.
>> 
>> That said, it looks like the LD_LIBRARY_PATH is wrong on the remote
>> system. It looks like it can't find the intel compiler libraries.
>> 
>> -Nathan Hjelm
>> HPC-5, LANL
>> 
>> On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote:
>>>    Progress!  I can run my trivial program on the local PHI, but not the
>>>    other PHI, on the system. Here are the interesting parts:
>>> 
>>>    A pretty good recipe with last night's nightly master:
>>> 
>>>    $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
>>>    CXX="icpc -mmic" \
>>>        --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
>>>         AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib 
>>>    LD=x86_64-k1om-linux-ld \
>>>         --enable-mpirun-prefix-by-default --disable-io-romio
>>>    --disable-mpi-fortran \
>>>         --enable-orterun-prefix-by-default \
>>>         --enable-debug
>>>    $ make && make install
>>>    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
>>>    yoda --mca btl sm,self,tcp $PWD/mic.out
>>>    Hello World from process 0 of 2
>>>    Hello World from process 1 of 2
>>>    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
>>>    yoda --mca btl openib,sm,self $PWD/mic.out
>>>    Hello World from process 0 of 2
>>>    Hello World from process 1 of 2
>>>    $
>>> 
>>>    However, I can't seem to cross the fabric. I can ssh freely back and 
>>> forth
>>>    between mic0 and mic1. However, running the next 2 tests from mic0, it 
>>>    certainly seems like the second one should work, too:
>>> 
>>>    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda
>>>    --mca btl sm,self,tcp $PWD/mic.out
>>>    Hello World from process 0 of 2
>>>    Hello World from process 1 of 2
>>>    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda
>>>    --mca btl sm,self,tcp $PWD/mic.out
>>>    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
>>>    libraries: libimf.so: cannot open shared object file: No such file or
>>>    directory
>>>    
>>> --------------------------------------------------------------------------
>>>    ORTE was unable to reliably start one or more daemons.
>>>    This usually is caused by:
>>> 
>>>    * not finding the required libraries and/or binaries on
>>>      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>>      settings, or configure OMPI with --enable-orterun-prefix-by-default
>>> 
>>>    * lack of authority to execute on one or more specified nodes.
>>>      Please verify your allocation and authorities.
>>> 
>>>    * the inability to write startup files into /tmp
>>>    (--tmpdir/orte_tmpdir_base).
>>>      Please check with your sys admin to determine the correct location to
>>>    use.
>>> 
>>>    *  compilation of the orted with dynamic libraries when static are
>>>    required
>>>      (e.g., on Cray). Please check your configure cmd line and consider 
>>> using
>>>      one of the contrib/platform definitions for your system type.
>>> 
>>>    * an inability to create a connection back to mpirun due to a
>>>      lack of common network interfaces and/or no route found between
>>>      them. Please check network connectivity (including firewalls
>>>      and network routing requirements).
>>>     ...
>>>    $
>>> 
>>>    (Note that I get the same results with "--mca btl openib,sm,self"....)
>>> 
>>>    $ ssh mic1 file
>>>    /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
>>>    /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF
>>>    64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1
>>>    (SYSV), dynamically linked, not stripped
>>>    $ shmemrun -x
>>>    
>>> LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
>>>    -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out
>>>    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
>>>    libraries: libimf.so: cannot open shared object file: No such file or
>>>    directory
>>>    
>>> --------------------------------------------------------------------------
>>>    ORTE was unable to reliably start one or more daemons.
>>>    This usually is caused by:
>>> 
>>>    * not finding the required libraries and/or binaries on
>>>      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>>      settings, or configure OMPI with --enable-orterun-prefix-by-default
>>> 
>>>    * lack of authority to execute on one or more specified nodes.
>>>      Please verify your allocation and authorities.
>>> 
>>>    * the inability to write startup files into /tmp
>>>    (--tmpdir/orte_tmpdir_base).
>>>      Please check with your sys admin to determine the correct location to
>>>    use.
>>> 
>>>    *  compilation of the orted with dynamic libraries when static are
>>>    required
>>>      (e.g., on Cray). Please check your configure cmd line and consider 
>>> using
>>>      one of the contrib/platform definitions for your system type.
>>> 
>>>    * an inability to create a connection back to mpirun due to a
>>>      lack of common network interfaces and/or no route found between
>>>      them. Please check network connectivity (including firewalls
>>>      and network routing requirements).
>>> 
>>>    Following here is
>>>    - IB information
>>>    - Running the failing case with lots of debugging information. (As you
>>>    might imagine, I've tried 17 ways from Sunday to try to ensure that
>>>    libimf.so is found.)
>>> 
>>>    $ ibv_devices
>>>        device                 node GUID
>>>        ------              ----------------
>>>        mlx4_0              24be05ffffa57160
>>>        scif0               4c79bafffe4402b6
>>>    $ ibv_devinfo
>>>    hca_id: mlx4_0
>>>            transport:                      InfiniBand (0)
>>>            fw_ver:                         2.11.1250
>>>            node_guid:                      24be:05ff:ffa5:7160
>>>            sys_image_guid:                 24be:05ff:ffa5:7163
>>>            vendor_id:                      0x02c9
>>>            vendor_part_id:                 4099
>>>            hw_ver:                         0x0
>>>            phys_port_cnt:                  2
>>>                    port:   1
>>>                            state:                  PORT_ACTIVE (4)
>>>                            max_mtu:                2048 (4)
>>>                            active_mtu:             2048 (4)
>>>                            sm_lid:                 8
>>>                            port_lid:               86
>>>                            port_lmc:               0x00
>>>                            link_layer:             InfiniBand
>>> 
>>>                    port:   2
>>>                            state:                  PORT_DOWN (1)
>>>                            max_mtu:                2048 (4)
>>>                            active_mtu:             2048 (4)
>>>                            sm_lid:                 0
>>>                            port_lid:               0
>>>                            port_lmc:               0x00
>>>                            link_layer:             InfiniBand
>>> 
>>>    hca_id: scif0
>>>            transport:                      SCIF (2)
>>>            fw_ver:                         0.0.1
>>>            node_guid:                      4c79:baff:fe44:02b6
>>>            sys_image_guid:                 4c79:baff:fe44:02b6
>>>            vendor_id:                      0x8086
>>>            vendor_part_id:                 0
>>>            hw_ver:                         0x1
>>>            phys_port_cnt:                  1
>>>                    port:   1
>>>                            state:                  PORT_ACTIVE (4)
>>>                            max_mtu:                4096 (5)
>>>                            active_mtu:             4096 (5)
>>>                            sm_lid:                 1
>>>                            port_lid:               1001
>>>                            port_lmc:               0x00
>>>                            link_layer:             SCIF
>>> 
>>>    $ shmemrun -x
>>>    
>>> LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
>>>    -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca plm_base_verbose
>>>    5 --mca memheap_base_verbose 100 $PWD/mic.out
>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [rsh]
>>>    [atl1-01-mic0:191024] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
>>>    rsh path NULL
>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component [rsh] 
>>> set
>>>    priority to 10
>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component
>>>    [isolated]
>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component
>>>    [isolated] set priority to 0
>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [slurm]
>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Skipping component [slurm].
>>>    Query failed to return a module
>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Selected component [rsh]
>>>    [atl1-01-mic0:191024] plm:base:set_hnp_name: initial bias 191024 nodename
>>>    hash 4121194178
>>>    [atl1-01-mic0:191024] plm:base:set_hnp_name: final jobfam 29012
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh_setup on agent ssh : rsh path
>>>    NULL
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive start comm
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_job
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm creating map
>>>    [atl1-01-mic0:191024] [[29012,0],0] setup:vm: working unmanaged 
>>> allocation
>>>    [atl1-01-mic0:191024] [[29012,0],0] using dash_host
>>>    [atl1-01-mic0:191024] [[29012,0],0] checking node mic1
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm add new daemon
>>>    [[29012,0],1]
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm assigning new 
>>> daemon
>>>    [[29012,0],1] to node mic1
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: launching vm
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: local shell: 0 (bash)
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: assuming same remote shell 
>>> as
>>>    local shell
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: remote shell: 0 (bash)
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: final template argv:
>>>            /usr/bin/ssh <template>    
>>>    PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ;
>>>    LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; 
>>> export
>>>    LD_LIBRARY_PATH ;
>>>    DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
>>>    export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
>>>    --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
>>>    orte_ess_jobid "1901330432" -mca orte_ess_vpid "<template>" -mca
>>>    orte_ess_num_procs "2" -mca orte_hnp_uri
>>>    
>>> "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
>>>    --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
>>>    plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
>>>    rmaps_ppr_n_pernode "2"
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh:launch daemon 0 not a child 
>>> of
>>>    mine
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: adding node mic1 to launch
>>>    list
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: activating launch event
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: recording launch of daemon
>>>    [[29012,0],1]
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: executing: (/usr/bin/ssh)
>>>    [/usr/bin/ssh mic1     PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ;
>>>    export PATH ;
>>>    LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; 
>>> export
>>>    LD_LIBRARY_PATH ;
>>>    DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
>>>    export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
>>>    --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
>>>    orte_ess_jobid "1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs
>>>    "2" -mca orte_hnp_uri
>>>    
>>> "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
>>>    --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
>>>    plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
>>>    rmaps_ppr_n_pernode "2"]
>>>    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
>>>    libraries: libimf.so: cannot open shared object file: No such file or
>>>    directory
>>>    [atl1-01-mic0:191024] [[29012,0],0] daemon 1 failed with status 127
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:orted_cmd sending orted_exit
>>>    commands
>>>    
>>> --------------------------------------------------------------------------
>>>    ORTE was unable to reliably start one or more daemons.
>>>    This usually is caused by:
>>> 
>>>    * not finding the required libraries and/or binaries on
>>>      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>>      settings, or configure OMPI with --enable-orterun-prefix-by-default
>>> 
>>>    * lack of authority to execute on one or more specified nodes.
>>>      Please verify your allocation and authorities.
>>> 
>>>    * the inability to write startup files into /tmp
>>>    (--tmpdir/orte_tmpdir_base).
>>>      Please check with your sys admin to determine the correct location to
>>>    use.
>>> 
>>>    *  compilation of the orted with dynamic libraries when static are
>>>    required
>>>      (e.g., on Cray). Please check your configure cmd line and consider 
>>> using
>>>      one of the contrib/platform definitions for your system type.
>>> 
>>>    * an inability to create a connection back to mpirun due to a
>>>      lack of common network interfaces and/or no route found between
>>>      them. Please check network connectivity (including firewalls
>>>      and network routing requirements).
>>>    
>>> --------------------------------------------------------------------------
>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive stop comm
>>> 
>>>    On 04/13/2015 08:50 AM, Andy Riebs wrote:
>>> 
>>>      Hi Ralph,
>>> 
>>>      Here are the results with last night's "master" nightly,
>>>      openmpi-dev-1487-g9c6d452.tar.bz2, and adding the memheap_base_verbose
>>>      option (yes, it looks like the "ERROR_LOG" problem has gone away):
>>> 
>>>      $ cat /proc/sys/kernel/shmmax
>>>      33554432
>>>      $ cat /proc/sys/kernel/shmall
>>>      2097152
>>>      $ cat /proc/sys/kernel/shmmni
>>>      4096
>>>      $ export SHMEM_SYMMETRIC_HEAP=1M
>>>      $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 
>>> 5
>>>      --mca memheap_base_verbose 100 $PWD/mic.out
>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [rsh]
>>>      [atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
>>>      rsh path NULL
>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component [rsh]
>>>      set priority to 10
>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component
>>>      [isolated]
>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component
>>>      [isolated] set priority to 0
>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component 
>>> [slurm]
>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Skipping component
>>>      [slurm]. Query failed to return a module
>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Selected component [rsh]
>>>      [atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439
>>>      nodename hash 4121194178
>>>      [atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875
>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh : rsh
>>>      path NULL
>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm
>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job
>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm
>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map
>>>      [atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged
>>>      allocation
>>>      [atl1-01-mic0:190439] [[31875,0],0] using dash_host
>>>      [atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0
>>>      [atl1-01-mic0:190439] [[31875,0],0] ignoring myself
>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in
>>>      allocation
>>>      [atl1-01-mic0:190439] [[31875,0],0] complete_setup on job [31875,1]
>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job
>>>      [31875,1]
>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof for
>>>      job [31875,1]
>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1] 
>>> registered
>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1] is 
>>> not
>>>      a dynamic spawn
>>>      [atl1-01-mic0:190441] mca: base: components_register: registering
>>>      memheap components
>>>      [atl1-01-mic0:190441] mca: base: components_register: found loaded
>>>      component buddy
>>>      [atl1-01-mic0:190441] mca: base: components_register: component buddy
>>>      has no register or open function
>>>      [atl1-01-mic0:190442] mca: base: components_register: registering
>>>      memheap components
>>>      [atl1-01-mic0:190442] mca: base: components_register: found loaded
>>>      component buddy
>>>      [atl1-01-mic0:190442] mca: base: components_register: component buddy
>>>      has no register or open function
>>>      [atl1-01-mic0:190442] mca: base: components_register: found loaded
>>>      component ptmalloc
>>>      [atl1-01-mic0:190442] mca: base: components_register: component 
>>> ptmalloc
>>>      has no register or open function
>>>      [atl1-01-mic0:190441] mca: base: components_register: found loaded
>>>      component ptmalloc
>>>      [atl1-01-mic0:190441] mca: base: components_register: component 
>>> ptmalloc
>>>      has no register or open function
>>>      [atl1-01-mic0:190441] mca: base: components_open: opening memheap
>>>      components
>>>      [atl1-01-mic0:190441] mca: base: components_open: found loaded 
>>> component
>>>      buddy
>>>      [atl1-01-mic0:190441] mca: base: components_open: component buddy open
>>>      function successful
>>>      [atl1-01-mic0:190441] mca: base: components_open: found loaded 
>>> component
>>>      ptmalloc
>>>      [atl1-01-mic0:190441] mca: base: components_open: component ptmalloc
>>>      open function successful
>>>      [atl1-01-mic0:190442] mca: base: components_open: opening memheap
>>>      components
>>>      [atl1-01-mic0:190442] mca: base: components_open: found loaded 
>>> component
>>>      buddy
>>>      [atl1-01-mic0:190442] mca: base: components_open: component buddy open
>>>      function successful
>>>      [atl1-01-mic0:190442] mca: base: components_open: found loaded 
>>> component
>>>      ptmalloc
>>>      [atl1-01-mic0:190442] mca: base: components_open: component ptmalloc
>>>      open function successful
>>>      [atl1-01-mic0:190442] base/memheap_base_alloc.c:38 -
>>>      mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 
>>> 1
>>>      segments by method: 1
>>>      [atl1-01-mic0:190441] base/memheap_base_alloc.c:38 -
>>>      mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 
>>> 1
>>>      segments by method: 1
>>>      [atl1-01-mic0:190442] base/memheap_base_static.c:205 - _load_segments()
>>>      add: 00600000-00601000 rw-p 00000000 00:11
>>>      6029314                            /home/ariebs/bench/hello/mic.out
>>>      [atl1-01-mic0:190441] base/memheap_base_static.c:205 - _load_segments()
>>>      add: 00600000-00601000 rw-p 00000000 00:11
>>>      6029314                            /home/ariebs/bench/hello/mic.out
>>>      [atl1-01-mic0:190442] base/memheap_base_static.c:75 -
>>>      mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
>>>      segments
>>>      [atl1-01-mic0:190442] base/memheap_base_register.c:39 -
>>>      mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
>>>      270532608 bytes type=0x1 id=0xFFFFFFFF
>>>      [atl1-01-mic0:190441] base/memheap_base_static.c:75 -
>>>      mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
>>>      segments
>>>      [atl1-01-mic0:190441] base/memheap_base_register.c:39 -
>>>      mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
>>>      270532608 bytes type=0x1 id=0xFFFFFFFF
>>>      [atl1-01-mic0:190442] Error base/memheap_base_register.c:130 -
>>>      _reg_segment() Failed to register segment
>>>      [atl1-01-mic0:190441] Error base/memheap_base_register.c:130 -
>>>      _reg_segment() Failed to register segment
>>>      [atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM
>>>      failed to initialize - aborting
>>>      [atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM
>>>      failed to initialize - aborting
>>>      
>>> --------------------------------------------------------------------------
>>>      It looks like SHMEM_INIT failed for some reason; your parallel process
>>>      is
>>>      likely to abort.  There are many reasons that a parallel process can
>>>      fail during SHMEM_INIT; some of which are due to configuration or
>>>      environment
>>>      problems.  This failure appears to be an internal failure; here's some
>>>      additional information (which may only be relevant to an Open SHMEM
>>>      developer):
>>> 
>>>        mca_memheap_base_select() failed
>>>        --> Returned "Error" (-1) instead of "Success" (0)
>>>      
>>> --------------------------------------------------------------------------
>>>      
>>> --------------------------------------------------------------------------
>>>      SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0) with
>>>      errorcode -1.
>>>      
>>> --------------------------------------------------------------------------
>>>      
>>> --------------------------------------------------------------------------
>>>      A SHMEM process is aborting at a time when it cannot guarantee that all
>>>      of its peer processes in the job will be killed properly.  You should
>>>      double check that everything has shut down cleanly.
>>> 
>>>      Local host: atl1-01-mic0
>>>      PID:        190441
>>>      
>>> --------------------------------------------------------------------------
>>>      -------------------------------------------------------
>>>      Primary job  terminated normally, but 1 process returned
>>>      a non-zero exit code.. Per user-direction, the job has been aborted.
>>>      -------------------------------------------------------
>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending
>>>      orted_exit commands
>>>      
>>> --------------------------------------------------------------------------
>>>      shmemrun detected that one or more processes exited with non-zero
>>>      status, thus causing
>>>      the job to be terminated. The first process to do so was:
>>> 
>>>        Process name: [[31875,1],0]
>>>        Exit code:    255
>>>      
>>> --------------------------------------------------------------------------
>>>      [atl1-01-mic0:190439] 1 more process has sent help message
>>>      help-shmem-runtime.txt / shmem_init:startup:internal-failure
>>>      [atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate" to 0
>>>      to see all help / error messages
>>>      [atl1-01-mic0:190439] 1 more process has sent help message
>>>      help-shmem-api.txt / shmem-abort
>>>      [atl1-01-mic0:190439] 1 more process has sent help message
>>>      help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm
>>> 
>>>      On 04/12/2015 03:09 PM, Ralph Castain wrote:
>>> 
>>>        Sorry about that - I hadn't brought it over to the 1.8 branch yet.
>>>        I've done so now, which means the ERROR_LOG shouldn't show up any
>>>        more. It won't fix the memheap problem, though.
>>>        You might try adding "--mca memheap_base_verbose 100" to your cmd 
>>> line
>>>        so we can see why none of the memheap components are being selected.
>>> 
>>>          On Apr 12, 2015, at 11:30 AM, Andy Riebs <andy.ri...@hp.com> 
>>> <mailto:andy.ri...@hp.com> wrote:
>>>          Hi Ralph,
>>> 
>>>          Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2:
>>> 
>>>          $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
>>>          plm_base_verbose 5 $PWD/mic.out
>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
>>>          [rsh]
>>>          [atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent
>>>          ssh : rsh path NULL
>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
>>>          [rsh] set priority to 10
>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
>>>          [isolated]
>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
>>>          [isolated] set priority to 0
>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
>>>          [slurm]
>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Skipping component
>>>          [slurm]. Query failed to return a module
>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Selected component
>>>          [rsh]
>>>          [atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189
>>>          nodename hash 4121194178
>>>          [atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : 
>>> rsh
>>>          path NULL
>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
>>>          [atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged
>>>          allocation
>>>          [atl1-01-mic0:190189] [[32137,0],0] using dash_host
>>>          [atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
>>>          [atl1-01-mic0:190189] [[32137,0],0] ignoring myself
>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in
>>>          allocation
>>>          [atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
>>>          [atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in
>>>          file base/plm_base_launch_support.c at line 440
>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job
>>>          [32137,1]
>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof
>>>          for job [32137,1]
>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1]
>>>          registered
>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] 
>>> is
>>>          not a dynamic spawn
>>>          
>>> --------------------------------------------------------------------------
>>>          It looks like SHMEM_INIT failed for some reason; your parallel
>>>          process is
>>>          likely to abort.  There are many reasons that a parallel process 
>>> can
>>>          fail during SHMEM_INIT; some of which are due to configuration or
>>>          environment
>>>          problems.  This failure appears to be an internal failure; here's
>>>          some
>>>          additional information (which may only be relevant to an Open SHMEM
>>>          developer):
>>> 
>>>            mca_memheap_base_select() failed
>>>            --> Returned "Error" (-1) instead of "Success" (0)
>>>          
>>> --------------------------------------------------------------------------
>>>          [atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM
>>>          failed to initialize - aborting
>>>          [atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM
>>>          failed to initialize - aborting
>>>          
>>> --------------------------------------------------------------------------
>>>          SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0)
>>>          with errorcode -1.
>>>          
>>> --------------------------------------------------------------------------
>>>          
>>> --------------------------------------------------------------------------
>>>          A SHMEM process is aborting at a time when it cannot guarantee that
>>>          all
>>>          of its peer processes in the job will be killed properly.  You
>>>          should
>>>          double check that everything has shut down cleanly.
>>> 
>>>          Local host: atl1-01-mic0
>>>          PID:        190192
>>>          
>>> --------------------------------------------------------------------------
>>>          -------------------------------------------------------
>>>          Primary job  terminated normally, but 1 process returned
>>>          a non-zero exit code.. Per user-direction, the job has been 
>>> aborted.
>>>          -------------------------------------------------------
>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending
>>>          orted_exit commands
>>>          
>>> --------------------------------------------------------------------------
>>>          shmemrun detected that one or more processes exited with non-zero
>>>          status, thus causing
>>>          the job to be terminated. The first process to do so was:
>>> 
>>>            Process name: [[32137,1],0]
>>>            Exit code:    255
>>>          
>>> --------------------------------------------------------------------------
>>>          [atl1-01-mic0:190189] 1 more process has sent help message
>>>          help-shmem-runtime.txt / shmem_init:startup:internal-failure
>>>          [atl1-01-mic0:190189] Set MCA parameter "orte_base_help_aggregate"
>>>          to 0 to see all help / error messages
>>>          [atl1-01-mic0:190189] 1 more process has sent help message
>>>          help-shmem-api.txt / shmem-abort
>>>          [atl1-01-mic0:190189] 1 more process has sent help message
>>>          help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
>>>          killed
>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm
>>> 
>>>          On 04/11/2015 07:41 PM, Ralph Castain wrote:
>>> 
>>>            Got it - thanks. I fixed that ERROR_LOG issue (I think- please
>>>            verify). I suspect the memheap issue relates to something else,
>>>            but I probably need to let the OSHMEM folks comment on it
>>> 
>>>              On Apr 11, 2015, at 9:52 AM, Andy Riebs <andy.ri...@hp.com> 
>>> <mailto:andy.ri...@hp.com>
>>>              wrote:
>>>              Everything is built on the Xeon side, with the icc "-mmic"
>>>              switch. I then ssh into one of the PHIs, and run shmemrun from
>>>              there.
>>> 
>>>              On 04/11/2015 12:00 PM, Ralph Castain wrote:
>>> 
>>>                Let me try to understand the setup a little better. Are you
>>>                running shmemrun on the PHI itself? Or is it running on the
>>>                host processor, and you are trying to spawn a process onto 
>>> the
>>>                Phi?
>>> 
>>>                  On Apr 11, 2015, at 7:55 AM, Andy Riebs 
>>> <andy.ri...@hp.com> <mailto:andy.ri...@hp.com>
>>>                  wrote:
>>>                  Hi Ralph,
>>> 
>>>                  Yes, this is attempting to get OSHMEM to run on the Phi.
>>> 
>>>                  I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured
>>>                  it with
>>> 
>>>                  $ ./configure --prefix=/home/ariebs/mic/mpi-nightly   
>>>                  CC=icc -mmic CXX=icpc -mmic    \
>>>                      --build=x86_64-unknown-linux-gnu
>>>                  --host=x86_64-k1om-linux    \
>>>                       AR=x86_64-k1om-linux-ar
>>>                  RANLIB=x86_64-k1om-linux-ranlib  LD=x86_64-k1om-linux-ld   
>>> \
>>>                       --enable-mpirun-prefix-by-default
>>>                  --disable-io-romio     --disable-mpi-fortran    \
>>>                       --enable-debug    
>>>                  
>>> --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
>>> 
>>>                  (Note that I had to add "oob-ud" to the
>>>                  "--enable-mca-no-build" option, as the build complained 
>>> that
>>>                  mca oob/ud needed mca common-verbs.)
>>> 
>>>                  With that configuration, here is what I am seeing now...
>>> 
>>>                  $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
>>>                  $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
>>>                  plm_base_verbose 5 $PWD/mic.out
>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
>>>                  component [rsh]
>>>                  [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on
>>>                  agent ssh : rsh path NULL
>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
>>>                  component [rsh] set priority to 10
>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
>>>                  component [isolated]
>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
>>>                  component [isolated] set priority to 0
>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
>>>                  component [slurm]
>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Skipping
>>>                  component [slurm]. Query failed to return a module
>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Selected
>>>                  component [rsh]
>>>                  [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias
>>>                  189895 nodename hash 4121194178
>>>                  [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam
>>>                  32419
>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent
>>>                  ssh : rsh path NULL
>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start
>>>                  comm
>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
>>>                  creating map
>>>                  [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working
>>>                  unmanaged allocation
>>>                  [atl1-01-mic0:189895] [[32419,0],0] using dash_host
>>>                  [atl1-01-mic0:189895] [[32419,0],0] checking node
>>>                  atl1-01-mic0
>>>                  [atl1-01-mic0:189895] [[32419,0],0] ignoring myself
>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only
>>>                  HNP in allocation
>>>                  [atl1-01-mic0:189895] [[32419,0],0] complete_setup on job
>>>                  [32419,1]
>>>                  [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not
>>>                  found in file base/plm_base_launch_support.c at line 440
>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps 
>>> for
>>>                  job [32419,1]
>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring
>>>                  up iof for job [32419,1]
>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
>>>                  [32419,1] registered
>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job
>>>                  [32419,1] is not a dynamic spawn
>>>                  [atl1-01-mic0:189899] Error: pshmem_init.c:61 - 
>>> shmem_init()
>>>                  SHMEM failed to initialize - aborting
>>>                  [atl1-01-mic0:189898] Error: pshmem_init.c:61 - 
>>> shmem_init()
>>>                  SHMEM failed to initialize - aborting
>>>                  
>>> --------------------------------------------------------------------------
>>>                  It looks like SHMEM_INIT failed for some reason; your
>>>                  parallel process is
>>>                  likely to abort.  There are many reasons that a parallel
>>>                  process can
>>>                  fail during SHMEM_INIT; some of which are due to
>>>                  configuration or environment
>>>                  problems.  This failure appears to be an internal failure;
>>>                  here's some
>>>                  additional information (which may only be relevant to an
>>>                  Open SHMEM
>>>                  developer):
>>> 
>>>                    mca_memheap_base_select() failed
>>>                    --> Returned "Error" (-1) instead of "Success" (0)
>>>                  
>>> --------------------------------------------------------------------------
>>>                  
>>> --------------------------------------------------------------------------
>>>                  SHMEM_ABORT was invoked on rank 1 (pid 189899,
>>>                  host=atl1-01-mic0) with errorcode -1.
>>>                  
>>> --------------------------------------------------------------------------
>>>                  
>>> --------------------------------------------------------------------------
>>>                  A SHMEM process is aborting at a time when it cannot
>>>                  guarantee that all
>>>                  of its peer processes in the job will be killed properly. 
>>>                  You should
>>>                  double check that everything has shut down cleanly.
>>> 
>>>                  Local host: atl1-01-mic0
>>>                  PID:        189899
>>>                  
>>> --------------------------------------------------------------------------
>>>                  -------------------------------------------------------
>>>                  Primary job  terminated normally, but 1 process returned
>>>                  a non-zero exit code.. Per user-direction, the job has been
>>>                  aborted.
>>>                  -------------------------------------------------------
>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd
>>>                  sending orted_exit commands
>>>                  
>>> --------------------------------------------------------------------------
>>>                  shmemrun detected that one or more processes exited with
>>>                  non-zero status, thus causing
>>>                  the job to be terminated. The first process to do so was:
>>> 
>>>                    Process name: [[32419,1],1]
>>>                    Exit code:    255
>>>                  
>>> --------------------------------------------------------------------------
>>>                  [atl1-01-mic0:189895] 1 more process has sent help message
>>>                  help-shmem-runtime.txt / 
>>> shmem_init:startup:internal-failure
>>>                  [atl1-01-mic0:189895] Set MCA parameter
>>>                  "orte_base_help_aggregate" to 0 to see all help / error
>>>                  messages
>>>                  [atl1-01-mic0:189895] 1 more process has sent help message
>>>                  help-shmem-api.txt / shmem-abort
>>>                  [atl1-01-mic0:189895] 1 more process has sent help message
>>>                  help-shmem-runtime.txt / oshmem shmem abort:cannot 
>>> guarantee
>>>                  all killed
>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop
>>>                  comm
>>> 
>>>                  On 04/10/2015 06:37 PM, Ralph Castain wrote:
>>> 
>>>                    Andy - could you please try the current 1.8.5 nightly
>>>                    tarball and see if it helps? The error log indicates that
>>>                    it is failing to get the topology from some daemon, I**m
>>>                    assuming the one on the Phi?
>>>                    You might also add **enable-debug to that configure line
>>>                    and then put -mca plm_base_verbose on the shmemrun cmd to
>>>                    get more help
>>> 
>>>                      On Apr 10, 2015, at 11:55 AM, Andy Riebs
>>>                      <andy.ri...@hp.com> <mailto:andy.ri...@hp.com> wrote:
>>>                      Summary: MPI jobs work fine, SHMEM jobs work just often
>>>                      enough to be tantalizing, on an Intel Xeon Phi/MIC
>>>                      system.
>>> 
>>>                      Longer version
>>> 
>>>                      Thanks to the excellent write-up last June
>>>                      
>>> (<https://www.open-mpi.org/community/lists/users/2014/06/24711.php> 
>>> <https://www.open-mpi.org/community/lists/users/2014/06/24711.php>),
>>>                      I have been able to build a version of Open MPI for the
>>>                      Xeon Phi coprocessor that runs MPI jobs on the Phi
>>>                      coprocessor with no problem, but not SHMEM jobs.  Just
>>>                      at the point where I was about to document the problems
>>>                      I was having with SHMEM, my trivial SHMEM job worked.
>>>                      And then failed when I tried to run it again,
>>>                      immediately afterwards. I have a feeling I may be in
>>>                      uncharted  territory here.
>>> 
>>>                      Environment
>>>                        * RHEL 6.5
>>>                        * Intel Composer XE 2015
>>>                        * Xeon Phi/MIC
>>>                      ----------------
>>> 
>>>                      Configuration
>>> 
>>>                      $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
>>>                      $ source
>>>                      /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
>>>                      intel64
>>>                      $ ./configure --prefix=/home/ariebs/mic/mpi \
>>>                         CC="icc -mmic" CXX="icpc -mmic" \
>>>                         --build=x86_64-unknown-linux-gnu
>>>                      --host=x86_64-k1om-linux \
>>>                          AR=x86_64-k1om-linux-ar
>>>                      RANLIB=x86_64-k1om-linux-ranlib \
>>>                          LD=x86_64-k1om-linux-ld \
>>>                          --enable-mpirun-prefix-by-default 
>>> --disable-io-romio
>>>                      \
>>>                          --disable-vt --disable-mpi-fortran \
>>>                         
>>>                      --enable-mca-no-build=btl-usnic,btl-openib,common-verbs
>>>                      $ make
>>>                      $ make install
>>> 
>>>                      ----------------
>>> 
>>>                      Test program
>>> 
>>>                      #include <stdio.h>
>>>                      #include <stdlib.h>
>>>                      #include <shmem.h>
>>>                      int main(int argc, char **argv)
>>>                      {
>>>                              int me, num_pe;
>>>                              shmem_init();
>>>                              num_pe = num_pes();
>>>                              me = my_pe();
>>>                              printf("Hello World from process %ld of %ld\n",
>>>                      me, num_pe);
>>>                              exit(0);
>>>                      }
>>> 
>>>                      ----------------
>>> 
>>>                      Building the program
>>> 
>>>                      export PATH=/home/ariebs/mic/mpi/bin:$PATH
>>>                      export PATH=/usr/linux-k1om-4.7/bin/:$PATH
>>>                      source
>>>                      /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
>>>                      intel64
>>>                      export
>>>                      
>>> LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH
>>> 
>>>                      icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include
>>>                      -pthread \
>>>                              -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib
>>>                      -Wl,--enable-new-dtags \
>>>                              -L/home/ariebs/mic/mpi/lib -loshmem -lmpi
>>>                      -lopen-rte -lopen-pal \
>>>                              -lm -ldl -lutil \
>>>                              -Wl,-rpath
>>>                      
>>> -Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
>>>                      \
>>>                             
>>>                      
>>> -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
>>>                      \
>>>                              -o mic.out  shmem_hello.c
>>> 
>>>                      ----------------
>>> 
>>>                      Running the program
>>> 
>>>                      (Note that the program had been consistently failing.
>>>                      Then, when I logged back into the system to capture the
>>>                      results, it worked once,  and then immediately failed
>>>                      when I tried again, as shown below. Logging in and out
>>>                      isn't sufficient to correct the problem. Overall, I
>>>                      think I had 3 successful runs in 30-40 attempts.)
>>> 
>>>                      $ shmemrun -H localhost -N 2 --mca sshmem mmap 
>>> ./mic.out
>>>                      [atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not
>>>                      found in file base/plm_base_launch_support.c at line 
>>> 426
>>>                      Hello World from process 0 of 2
>>>                      Hello World from process 1 of 2
>>>                      $ shmemrun -H localhost -N 2 --mca sshmem mmap 
>>> ./mic.out
>>>                      [atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not
>>>                      found in file base/plm_base_launch_support.c at line 
>>> 426
>>>                      [atl1-01-mic0:189383] Error: pshmem_init.c:61 -
>>>                      shmem_init() SHMEM failed to initialize - aborting
>>>                      
>>> --------------------------------------------------------------------------
>>>                      It looks like SHMEM_INIT failed for some reason; your
>>>                      parallel process is
>>>                      likely to abort.  There are many reasons that a 
>>> parallel
>>>                      process can
>>>                      fail during SHMEM_INIT; some of which are due to
>>>                      configuration or environment
>>>                      problems.  This failure appears to be an internal
>>>                      failure; here's some
>>>                      additional information (which may only be relevant to 
>>> an
>>>                      Open SHMEM
>>>                      developer):
>>> 
>>>                        mca_memheap_base_select() failed
>>>                        --> Returned "Error" (-1) instead of "Success" (0)
>>>                      
>>> --------------------------------------------------------------------------
>>>                      
>>> --------------------------------------------------------------------------
>>>                      SHMEM_ABORT was invoked on rank 0 (pid 189383,
>>>                      host=atl1-01-mic0) with errorcode -1.
>>>                      
>>> --------------------------------------------------------------------------
>>>                      
>>> --------------------------------------------------------------------------
>>>                      A SHMEM process is aborting at a time when it cannot
>>>                      guarantee that all
>>>                      of its peer processes in the job will be killed
>>>                      properly.  You should
>>>                      double check that everything has shut down cleanly.
>>> 
>>>                      Local host: atl1-01-mic0
>>>                      PID:        189383
>>>                      
>>> --------------------------------------------------------------------------
>>>                      -------------------------------------------------------
>>>                      Primary job  terminated normally, but 1 process 
>>> returned
>>>                      a non-zero exit code.. Per user-direction, the job has
>>>                      been aborted.
>>>                      -------------------------------------------------------
>>>                      
>>> --------------------------------------------------------------------------
>>>                      shmemrun detected that one or more processes exited 
>>> with
>>>                      non-zero status, thus causing
>>>                      the job to be terminated. The first process to do so
>>>                      was:
>>> 
>>>                        Process name: [[30881,1],0]
>>>                        Exit code:    255
>>>                      
>>> --------------------------------------------------------------------------
>>> 
>>>                      Any thoughts about where to go from here?
>>> 
>>>                      Andy
>>> 
>>>  --
>>>  Andy Riebs
>>>  Hewlett-Packard Company
>>>  High Performance Computing
>>>  +1 404 648 9024
>>>  My opinions are not necessarily those of HP
>>> 
>>>                      _______________________________________________
>>>                      users mailing list
>>>                      us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>                      Subscription:
>>>                      http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>                      Link to this post:
>>>                      
>>> http://www.open-mpi.org/community/lists/users/2015/04/26670.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/04/26670.php>
>>> 
>>>  _______________________________________________
>>>  users mailing list
>>>  us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>  Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/04/26676.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/04/26676.php>
>>> 
>>>                  _______________________________________________
>>>                  users mailing list
>>>                  us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>                  Subscription:
>>>                  http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>                  Link to this post:
>>>                  
>>> http://www.open-mpi.org/community/lists/users/2015/04/26678.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/04/26678.php>
>>> 
>>>  _______________________________________________
>>>  users mailing list
>>>  us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>  Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/04/26679.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/04/26679.php>
>>> 
>>>              _______________________________________________
>>>              users mailing list
>>>              us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>              Subscription: 
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>              Link to this post:
>>>              
>>> http://www.open-mpi.org/community/lists/users/2015/04/26680.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/04/26680.php>
>>> 
>>>  _______________________________________________
>>>  users mailing list
>>>  us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>  Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/04/26682.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/04/26682.php>
>>> 
>>>          _______________________________________________
>>>          users mailing list
>>>          us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>          Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>          Link to this post:
>>>          http://www.open-mpi.org/community/lists/users/2015/04/26683.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/04/26683.php>
>>> 
>>>  _______________________________________________
>>>  users mailing list
>>>  us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>  Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/04/26684.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/04/26684.php>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/04/26697.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/04/26697.php>
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/04/26699.php 
>> <http://www.open-mpi.org/community/lists/users/2015/04/26699.php>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26700.php

Reply via email to