I think Gilles may be correct here. In reviewing the code, it appears we have 
never (going back to the 1.6 series, at least) forwarded the local 
LD_LIBRARY_PATH to the remote node when exec’ing the orted. The only thing we 
have done is to set the PATH and LD_LIBRARY_PATH to support the OMPI prefix - 
not any supporting libs.

What we have required, therefore, is that your path be setup properly in the 
remote .bashrc (or pick your shell) to handle the libraries.

As I indicated, the -x option only forwards envars to the application procs 
themselves, not the orted. I could try to add another cmd line option to 
forward things for the orted, but the concern we’ve had in the past (and still 
harbor) is that the ssh cmd line is limited in length. Thus, adding some 
potentially long paths to support this option could overwhelm it and cause 
failures.

I’d try the static method first, or perhaps the LDFLAGS Gilles suggested.


> On Apr 14, 2015, at 5:11 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Andy,
> 
> what about reconfiguring Open MPI with 
> LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic" ?
> 
> IIRC, an other option is : LDFLAGS="-static-intel"
> 
> last but not least, you can always replace orted with a simple script that 
> sets the LD_LIBRARY_PATH and exec the original orted
> 
> do you have the same behaviour on non MIC hardware when Open MPI is compiled 
> with intel compilers ?
> if it works on non MIC hardware, the root cause could be in the sshd_config 
> of the MIC that does not
> accept to receive LD_LIBRARY_PATH
> 
> my 0.02 US$
> 
> Gilles
> 
> On 4/14/2015 11:20 PM, Ralph Castain wrote:
>> Hmmm…certainly looks that way. I’ll investigate.
>> 
>>> On Apr 14, 2015, at 6:06 AM, Andy Riebs <andy.ri...@hp.com 
>>> <mailto:andy.ri...@hp.com>> wrote:
>>> 
>>> Hi Ralph,
>>> 
>>> Still no happiness... It looks like my LD_LIBRARY_PATH just isn't getting 
>>> propagated?
>>> 
>>> $ ldd /home/ariebs/mic/mpi-nightly/bin/orted
>>>         linux-vdso.so.1 =>  (0x00007fffa1d3b000)
>>>         libopen-rte.so.0 => 
>>> /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x00002ab6ce464000)
>>>         libopen-pal.so.0 => 
>>> /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x00002ab6ce7d3000)
>>>         libm.so.6 => /lib64/libm.so.6 (0x00002ab6cebbd000)
>>>         libdl.so.2 => /lib64/libdl.so.2 (0x00002ab6ceded000)
>>>         librt.so.1 => /lib64/librt.so.1 (0x00002ab6ceff1000)
>>>         libutil.so.1 => /lib64/libutil.so.1 (0x00002ab6cf1f9000)
>>>         libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ab6cf3fc000)
>>>         libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ab6cf60f000)
>>>         libc.so.6 => /lib64/libc.so.6 (0x00002ab6cf82c000)
>>>         libimf.so => 
>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so 
>>> (0x00002ab6cfb84000)
>>>         libsvml.so => 
>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so 
>>> (0x00002ab6cffd6000)
>>>         libirng.so => 
>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so 
>>> (0x00002ab6d086f000)
>>>         libintlc.so.5 => 
>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 
>>> (0x00002ab6d0a82000)
>>>         /lib64/ld-linux-k1om.so.2 (0x00002ab6ce243000)
>>> 
>>> $ echo $LD_LIBRARY_PATH
>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
>>> 
>>> $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda 
>>> --mca btl sm,self,tcp --mca plm_base_verbose 5 --mca memheap_base_verbose 
>>> 100 --leave-session-attached --mca mca_component_show_load_errors 1 
>>> $PWD/mic.out
>>> --------------------------------------------------------------------------
>>> A deprecated MCA variable value was specified in the environment or
>>> on the command line.  Deprecated MCA variables should be avoided;
>>> they may disappear in future releases.
>>> 
>>>   Deprecated variable: mca_component_show_load_errors
>>>   New variable:        mca_base_component_show_load_errors
>>> --------------------------------------------------------------------------
>>> [atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [rsh]
>>> [atl1-02-mic0:16183] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
>>> path NULL
>>> [atl1-02-mic0:16183] mca:base:select:(  plm) Query of component [rsh] set 
>>> priority to 10
>>> [atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [isolated]
>>> [atl1-02-mic0:16183] mca:base:select:(  plm) Query of component [isolated] 
>>> set priority to 0
>>> [atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [slurm]
>>> [atl1-02-mic0:16183] mca:base:select:(  plm) Skipping component [slurm]. 
>>> Query failed to return a module
>>> [atl1-02-mic0:16183] mca:base:select:(  plm) Selected component [rsh]
>>> [atl1-02-mic0:16183] plm:base:set_hnp_name: initial bias 16183 nodename 
>>> hash 4238360777
>>> [atl1-02-mic0:16183] plm:base:set_hnp_name: final jobfam 33630
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh_setup on agent ssh : rsh path 
>>> NULL
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:base:receive start comm
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_job
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm creating map
>>> [atl1-02-mic0:16183] [[33630,0],0] setup:vm: working unmanaged allocation
>>> [atl1-02-mic0:16183] [[33630,0],0] using dash_host
>>> [atl1-02-mic0:16183] [[33630,0],0] checking node mic1
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm add new daemon 
>>> [[33630,0],1]
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm assigning new daemon 
>>> [[33630,0],1] to node mic1
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: launching vm
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: local shell: 0 (bash)
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: assuming same remote shell as 
>>> local shell
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: remote shell: 0 (bash)
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: final template argv:
>>>         /usr/bin/ssh <template>     
>>> PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ; 
>>> LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export 
>>> LD_LIBRARY_PATH ; 
>>> DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; 
>>> export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted -mca 
>>> orte_leave_session_attached "1" --hnp-topo-sig 
>>> 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid 
>>> "2203975680" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "2" 
>>> -mca orte_hnp_uri 
>>> "2203975680.0;usock;tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1
>>>  <tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1>" --tree-spawn 
>>> --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose "5" --mca 
>>> memheap_base_verbose "100" --mca mca_component_show_load_errors "1" -mca 
>>> plm "rsh" -mca rmaps_ppr_n_pernode "2"
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh:launch daemon 0 not a child of 
>>> mine
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: adding node mic1 to launch list
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: activating launch event
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: recording launch of daemon 
>>> [[33630,0],1]
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: executing: (/usr/bin/ssh) 
>>> [/usr/bin/ssh mic1     PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export 
>>> PATH ; LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; 
>>> export LD_LIBRARY_PATH ; 
>>> DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; 
>>> export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted -mca 
>>> orte_leave_session_attached "1" --hnp-topo-sig 
>>> 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid 
>>> "2203975680" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca 
>>> orte_hnp_uri 
>>> "2203975680.0;usock;tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1
>>>  <tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1>" --tree-spawn 
>>> --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose "5" --mca 
>>> memheap_base_verbose "100" --mca mca_component_show_load_errors "1" -mca 
>>> plm "rsh" -mca rmaps_ppr_n_pernode "2"]
>>> /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared 
>>> libraries: libimf.so: cannot open shared object file: No such file or 
>>> directory
>>> [atl1-02-mic0:16183] [[33630,0],0] daemon 1 failed with status 127
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:base:orted_cmd sending orted_exit 
>>> commands
>>> --------------------------------------------------------------------------
>>> ORTE was unable to reliably start one or more daemons.
>>> This usually is caused by:
>>> 
>>> * not finding the required libraries and/or binaries on
>>>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>>   settings, or configure OMPI with --enable-orterun-prefix-by-default
>>> 
>>> * lack of authority to execute on one or more specified nodes.
>>>   Please verify your allocation and authorities.
>>> 
>>> * the inability to write startup files into /tmp 
>>> (--tmpdir/orte_tmpdir_base).
>>>   Please check with your sys admin to determine the correct location to use.
>>> 
>>> *  compilation of the orted with dynamic libraries when static are required
>>>   (e.g., on Cray). Please check your configure cmd line and consider using
>>>   one of the contrib/platform definitions for your system type.
>>> 
>>> * an inability to create a connection back to mpirun due to a
>>>   lack of common network interfaces and/or no route found between
>>>   them. Please check network connectivity (including firewalls
>>>   and network routing requirements).
>>> --------------------------------------------------------------------------
>>> [atl1-02-mic0:16183] [[33630,0],0] plm:base:receive stop comm
>>> 
>>> 
>>> On 04/13/2015 07:47 PM, Ralph Castain wrote:
>>>> Weird. I’m not sure what to try at that point - IIRC, building static 
>>>> won’t resolve this problem (but you could try and see). You could add the 
>>>> following to the cmd line and see if it tells us anything useful:
>>>> 
>>>> —leave-session-attached —mca mca_component_show_load_errors 1
>>>> 
>>>> You might also do an ldd on /home/ariebs/mic/mpi-nightly/bin/orted and see 
>>>> where it is looking for libimf since it (and not mic.out) is the one 
>>>> complaining
>>>> 
>>>> 
>>>>> On Apr 13, 2015, at 1:58 PM, Andy Riebs <andy.ri...@hp.com 
>>>>> <mailto:andy.ri...@hp.com>> wrote:
>>>>> 
>>>>> Ralph and Nathan,
>>>>> 
>>>>> The problem may be something trivial, as I don't typically use "shmemrun" 
>>>>> to start jobs. With the following, I *think* I've  demonstrated that the 
>>>>> problem library is where it belongs on the remote system:
>>>>> 
>>>>> $ ldd mic.out
>>>>>         linux-vdso.so.1 =>  (0x00007fffb83ff000)
>>>>>         liboshmem.so.0 => /home/ariebs/mic/mpi-nightly/lib/liboshmem.so.0 
>>>>> (0x00002b059cfbb000)
>>>>>         libmpi.so.0 => /home/ariebs/mic/mpi-nightly/lib/libmpi.so.0 
>>>>> (0x00002b059d35a000)
>>>>>         libopen-rte.so.0 => 
>>>>> /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x00002b059d7e3000)
>>>>>         libopen-pal.so.0 => 
>>>>> /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x00002b059db53000)
>>>>>         libm.so.6 => /lib64/libm.so.6 (0x00002b059df3d000)
>>>>>         libdl.so.2 => /lib64/libdl.so.2 (0x00002b059e16c000)
>>>>>         libutil.so.1 => /lib64/libutil.so.1 (0x00002b059e371000)
>>>>>         libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b059e574000)
>>>>>         libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b059e786000)
>>>>>         libc.so.6 => /lib64/libc.so.6 (0x00002b059e9a4000)
>>>>>         librt.so.1 => /lib64/librt.so.1 (0x00002b059ecfc000)
>>>>>         libimf.so => 
>>>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so 
>>>>> (0x00002b059ef04000)
>>>>>         libsvml.so => 
>>>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so 
>>>>> (0x00002b059f356000)
>>>>>         libirng.so => 
>>>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so 
>>>>> (0x00002b059fbef000)
>>>>>         libintlc.so.5 => 
>>>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 
>>>>> (0x00002b059fe02000)
>>>>>         /lib64/ld-linux-k1om.so.2 (0x00002b059cd9a000)
>>>>> $ echo $LD_LIBRARY_PATH 
>>>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
>>>>> $ ssh mic1 file 
>>>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
>>>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF 
>>>>> 64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1 
>>>>> (SYSV), dynamically linked, not stripped
>>>>> $ shmemrun -H mic1 -N 2 --mca btl scif,self $PWD/mic.out
>>>>> /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared 
>>>>> libraries: libimf.so: cannot open shared object file: No such file or 
>>>>> directory
>>>>> ...
>>>>> 
>>>>> 
>>>>> On 04/13/2015 04:25 PM, Nathan Hjelm wrote:
>>>>>> For talking between PHIs on the same system I recommend using the scif
>>>>>> BTL NOT tcp.
>>>>>> 
>>>>>> That said, it looks like the LD_LIBRARY_PATH is wrong on the remote
>>>>>> system. It looks like it can't find the intel compiler libraries.
>>>>>> 
>>>>>> -Nathan Hjelm
>>>>>> HPC-5, LANL
>>>>>> 
>>>>>> On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote:
>>>>>>>    Progress!  I can run my trivial program on the local PHI, but not the
>>>>>>>    other PHI, on the system. Here are the interesting parts:
>>>>>>> 
>>>>>>>    A pretty good recipe with last night's nightly master:
>>>>>>> 
>>>>>>>    $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
>>>>>>>    CXX="icpc -mmic" \
>>>>>>>        --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
>>>>>>>         AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib 
>>>>>>>    LD=x86_64-k1om-linux-ld \
>>>>>>>         --enable-mpirun-prefix-by-default --disable-io-romio
>>>>>>>    --disable-mpi-fortran \
>>>>>>>         --enable-orterun-prefix-by-default \
>>>>>>>         --enable-debug
>>>>>>>    $ make && make install
>>>>>>>    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca 
>>>>>>> spml
>>>>>>>    yoda --mca btl sm,self,tcp $PWD/mic.out
>>>>>>>    Hello World from process 0 of 2
>>>>>>>    Hello World from process 1 of 2
>>>>>>>    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca 
>>>>>>> spml
>>>>>>>    yoda --mca btl openib,sm,self $PWD/mic.out
>>>>>>>    Hello World from process 0 of 2
>>>>>>>    Hello World from process 1 of 2
>>>>>>>    $
>>>>>>> 
>>>>>>>    However, I can't seem to cross the fabric. I can ssh freely back and 
>>>>>>> forth
>>>>>>>    between mic0 and mic1. However, running the next 2 tests from mic0, 
>>>>>>> it 
>>>>>>>    certainly seems like the second one should work, too:
>>>>>>> 
>>>>>>>    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml 
>>>>>>> yoda
>>>>>>>    --mca btl sm,self,tcp $PWD/mic.out
>>>>>>>    Hello World from process 0 of 2
>>>>>>>    Hello World from process 1 of 2
>>>>>>>    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml 
>>>>>>> yoda
>>>>>>>    --mca btl sm,self,tcp $PWD/mic.out
>>>>>>>    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
>>>>>>>    libraries: libimf.so: cannot open shared object file: No such file or
>>>>>>>    directory
>>>>>>>    
>>>>>>> --------------------------------------------------------------------------
>>>>>>>    ORTE was unable to reliably start one or more daemons.
>>>>>>>    This usually is caused by:
>>>>>>> 
>>>>>>>    * not finding the required libraries and/or binaries on
>>>>>>>      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>>>>>>      settings, or configure OMPI with --enable-orterun-prefix-by-default
>>>>>>> 
>>>>>>>    * lack of authority to execute on one or more specified nodes.
>>>>>>>      Please verify your allocation and authorities.
>>>>>>> 
>>>>>>>    * the inability to write startup files into /tmp
>>>>>>>    (--tmpdir/orte_tmpdir_base).
>>>>>>>      Please check with your sys admin to determine the correct location 
>>>>>>> to
>>>>>>>    use.
>>>>>>> 
>>>>>>>    *  compilation of the orted with dynamic libraries when static are
>>>>>>>    required
>>>>>>>      (e.g., on Cray). Please check your configure cmd line and consider 
>>>>>>> using
>>>>>>>      one of the contrib/platform definitions for your system type.
>>>>>>> 
>>>>>>>    * an inability to create a connection back to mpirun due to a
>>>>>>>      lack of common network interfaces and/or no route found between
>>>>>>>      them. Please check network connectivity (including firewalls
>>>>>>>      and network routing requirements).
>>>>>>>     ...
>>>>>>>    $
>>>>>>> 
>>>>>>>    (Note that I get the same results with "--mca btl 
>>>>>>> openib,sm,self"....)
>>>>>>> 
>>>>>>>    $ ssh mic1 file
>>>>>>>    /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
>>>>>>>    /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: 
>>>>>>> ELF
>>>>>>>    64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 
>>>>>>> 1
>>>>>>>    (SYSV), dynamically linked, not stripped
>>>>>>>    $ shmemrun -x
>>>>>>>    
>>>>>>> LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
>>>>>>>    -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out
>>>>>>>    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
>>>>>>>    libraries: libimf.so: cannot open shared object file: No such file or
>>>>>>>    directory
>>>>>>>    
>>>>>>> --------------------------------------------------------------------------
>>>>>>>    ORTE was unable to reliably start one or more daemons.
>>>>>>>    This usually is caused by:
>>>>>>> 
>>>>>>>    * not finding the required libraries and/or binaries on
>>>>>>>      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>>>>>>      settings, or configure OMPI with --enable-orterun-prefix-by-default
>>>>>>> 
>>>>>>>    * lack of authority to execute on one or more specified nodes.
>>>>>>>      Please verify your allocation and authorities.
>>>>>>> 
>>>>>>>    * the inability to write startup files into /tmp
>>>>>>>    (--tmpdir/orte_tmpdir_base).
>>>>>>>      Please check with your sys admin to determine the correct location 
>>>>>>> to
>>>>>>>    use.
>>>>>>> 
>>>>>>>    *  compilation of the orted with dynamic libraries when static are
>>>>>>>    required
>>>>>>>      (e.g., on Cray). Please check your configure cmd line and consider 
>>>>>>> using
>>>>>>>      one of the contrib/platform definitions for your system type.
>>>>>>> 
>>>>>>>    * an inability to create a connection back to mpirun due to a
>>>>>>>      lack of common network interfaces and/or no route found between
>>>>>>>      them. Please check network connectivity (including firewalls
>>>>>>>      and network routing requirements).
>>>>>>> 
>>>>>>>    Following here is
>>>>>>>    - IB information
>>>>>>>    - Running the failing case with lots of debugging information. (As 
>>>>>>> you
>>>>>>>    might imagine, I've tried 17 ways from Sunday to try to ensure that
>>>>>>>    libimf.so is found.)
>>>>>>> 
>>>>>>>    $ ibv_devices
>>>>>>>        device                 node GUID
>>>>>>>        ------              ----------------
>>>>>>>        mlx4_0              24be05ffffa57160
>>>>>>>        scif0               4c79bafffe4402b6
>>>>>>>    $ ibv_devinfo
>>>>>>>    hca_id: mlx4_0
>>>>>>>            transport:                      InfiniBand (0)
>>>>>>>            fw_ver:                         2.11.1250
>>>>>>>            node_guid:                      24be:05ff:ffa5:7160
>>>>>>>            sys_image_guid:                 24be:05ff:ffa5:7163
>>>>>>>            vendor_id:                      0x02c9
>>>>>>>            vendor_part_id:                 4099
>>>>>>>            hw_ver:                         0x0
>>>>>>>            phys_port_cnt:                  2
>>>>>>>                    port:   1
>>>>>>>                            state:                  PORT_ACTIVE (4)
>>>>>>>                            max_mtu:                2048 (4)
>>>>>>>                            active_mtu:             2048 (4)
>>>>>>>                            sm_lid:                 8
>>>>>>>                            port_lid:               86
>>>>>>>                            port_lmc:               0x00
>>>>>>>                            link_layer:             InfiniBand
>>>>>>> 
>>>>>>>                    port:   2
>>>>>>>                            state:                  PORT_DOWN (1)
>>>>>>>                            max_mtu:                2048 (4)
>>>>>>>                            active_mtu:             2048 (4)
>>>>>>>                            sm_lid:                 0
>>>>>>>                            port_lid:               0
>>>>>>>                            port_lmc:               0x00
>>>>>>>                            link_layer:             InfiniBand
>>>>>>> 
>>>>>>>    hca_id: scif0
>>>>>>>            transport:                      SCIF (2)
>>>>>>>            fw_ver:                         0.0.1
>>>>>>>            node_guid:                      4c79:baff:fe44:02b6
>>>>>>>            sys_image_guid:                 4c79:baff:fe44:02b6
>>>>>>>            vendor_id:                      0x8086
>>>>>>>            vendor_part_id:                 0
>>>>>>>            hw_ver:                         0x1
>>>>>>>            phys_port_cnt:                  1
>>>>>>>                    port:   1
>>>>>>>                            state:                  PORT_ACTIVE (4)
>>>>>>>                            max_mtu:                4096 (5)
>>>>>>>                            active_mtu:             4096 (5)
>>>>>>>                            sm_lid:                 1
>>>>>>>                            port_lid:               1001
>>>>>>>                            port_lmc:               0x00
>>>>>>>                            link_layer:             SCIF
>>>>>>> 
>>>>>>>    $ shmemrun -x
>>>>>>>    
>>>>>>> LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
>>>>>>>    -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca 
>>>>>>> plm_base_verbose
>>>>>>>    5 --mca memheap_base_verbose 100 $PWD/mic.out
>>>>>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component 
>>>>>>> [rsh]
>>>>>>>    [atl1-01-mic0:191024] [[INVALID],INVALID] plm:rsh_lookup on agent 
>>>>>>> ssh :
>>>>>>>    rsh path NULL
>>>>>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component 
>>>>>>> [rsh] set
>>>>>>>    priority to 10
>>>>>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component
>>>>>>>    [isolated]
>>>>>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component
>>>>>>>    [isolated] set priority to 0
>>>>>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component 
>>>>>>> [slurm]
>>>>>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Skipping component 
>>>>>>> [slurm].
>>>>>>>    Query failed to return a module
>>>>>>>    [atl1-01-mic0:191024] mca:base:select:(  plm) Selected component 
>>>>>>> [rsh]
>>>>>>>    [atl1-01-mic0:191024] plm:base:set_hnp_name: initial bias 191024 
>>>>>>> nodename
>>>>>>>    hash 4121194178
>>>>>>>    [atl1-01-mic0:191024] plm:base:set_hnp_name: final jobfam 29012
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh_setup on agent ssh : rsh 
>>>>>>> path
>>>>>>>    NULL
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive start comm
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_job
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm creating map
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] setup:vm: working unmanaged 
>>>>>>> allocation
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] using dash_host
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] checking node mic1
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm add new daemon
>>>>>>>    [[29012,0],1]
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm assigning new 
>>>>>>> daemon
>>>>>>>    [[29012,0],1] to node mic1
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: launching vm
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: local shell: 0 (bash)
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: assuming same remote 
>>>>>>> shell as
>>>>>>>    local shell
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: remote shell: 0 (bash)
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: final template argv:
>>>>>>>            /usr/bin/ssh <template>    
>>>>>>>    PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ;
>>>>>>>    LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; 
>>>>>>> export
>>>>>>>    LD_LIBRARY_PATH ;
>>>>>>>    
>>>>>>> DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
>>>>>>>    export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
>>>>>>>    --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
>>>>>>>    orte_ess_jobid "1901330432" -mca orte_ess_vpid "<template>" -mca
>>>>>>>    orte_ess_num_procs "2" -mca orte_hnp_uri
>>>>>>>    
>>>>>>> "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1
>>>>>>>  <tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1>"
>>>>>>>    --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
>>>>>>>    plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" 
>>>>>>> -mca
>>>>>>>    rmaps_ppr_n_pernode "2"
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh:launch daemon 0 not a 
>>>>>>> child of
>>>>>>>    mine
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: adding node mic1 to 
>>>>>>> launch
>>>>>>>    list
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: activating launch event
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: recording launch of 
>>>>>>> daemon
>>>>>>>    [[29012,0],1]
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: executing: 
>>>>>>> (/usr/bin/ssh)
>>>>>>>    [/usr/bin/ssh mic1     PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ;
>>>>>>>    export PATH ;
>>>>>>>    LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; 
>>>>>>> export
>>>>>>>    LD_LIBRARY_PATH ;
>>>>>>>    
>>>>>>> DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
>>>>>>>    export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
>>>>>>>    --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
>>>>>>>    orte_ess_jobid "1901330432" -mca orte_ess_vpid 1 -mca 
>>>>>>> orte_ess_num_procs
>>>>>>>    "2" -mca orte_hnp_uri
>>>>>>>    
>>>>>>> "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1
>>>>>>>  <tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1>"
>>>>>>>    --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
>>>>>>>    plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" 
>>>>>>> -mca
>>>>>>>    rmaps_ppr_n_pernode "2"]
>>>>>>>    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
>>>>>>>    libraries: libimf.so: cannot open shared object file: No such file or
>>>>>>>    directory
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] daemon 1 failed with status 127
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:orted_cmd sending 
>>>>>>> orted_exit
>>>>>>>    commands
>>>>>>>    
>>>>>>> --------------------------------------------------------------------------
>>>>>>>    ORTE was unable to reliably start one or more daemons.
>>>>>>>    This usually is caused by:
>>>>>>> 
>>>>>>>    * not finding the required libraries and/or binaries on
>>>>>>>      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>>>>>>      settings, or configure OMPI with --enable-orterun-prefix-by-default
>>>>>>> 
>>>>>>>    * lack of authority to execute on one or more specified nodes.
>>>>>>>      Please verify your allocation and authorities.
>>>>>>> 
>>>>>>>    * the inability to write startup files into /tmp
>>>>>>>    (--tmpdir/orte_tmpdir_base).
>>>>>>>      Please check with your sys admin to determine the correct location 
>>>>>>> to
>>>>>>>    use.
>>>>>>> 
>>>>>>>    *  compilation of the orted with dynamic libraries when static are
>>>>>>>    required
>>>>>>>      (e.g., on Cray). Please check your configure cmd line and consider 
>>>>>>> using
>>>>>>>      one of the contrib/platform definitions for your system type.
>>>>>>> 
>>>>>>>    * an inability to create a connection back to mpirun due to a
>>>>>>>      lack of common network interfaces and/or no route found between
>>>>>>>      them. Please check network connectivity (including firewalls
>>>>>>>      and network routing requirements).
>>>>>>>    
>>>>>>> --------------------------------------------------------------------------
>>>>>>>    [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive stop comm
>>>>>>> 
>>>>>>>    On 04/13/2015 08:50 AM, Andy Riebs wrote:
>>>>>>> 
>>>>>>>      Hi Ralph,
>>>>>>> 
>>>>>>>      Here are the results with last night's "master" nightly,
>>>>>>>      openmpi-dev-1487-g9c6d452.tar.bz2, and adding the 
>>>>>>> memheap_base_verbose
>>>>>>>      option (yes, it looks like the "ERROR_LOG" problem has gone away):
>>>>>>> 
>>>>>>>      $ cat /proc/sys/kernel/shmmax
>>>>>>>      33554432
>>>>>>>      $ cat /proc/sys/kernel/shmall
>>>>>>>      2097152
>>>>>>>      $ cat /proc/sys/kernel/shmmni
>>>>>>>      4096
>>>>>>>      $ export SHMEM_SYMMETRIC_HEAP=1M
>>>>>>>      $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca 
>>>>>>> plm_base_verbose 5
>>>>>>>      --mca memheap_base_verbose 100 $PWD/mic.out
>>>>>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component 
>>>>>>> [rsh]
>>>>>>>      [atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent 
>>>>>>> ssh :
>>>>>>>      rsh path NULL
>>>>>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component 
>>>>>>> [rsh]
>>>>>>>      set priority to 10
>>>>>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component
>>>>>>>      [isolated]
>>>>>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component
>>>>>>>      [isolated] set priority to 0
>>>>>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component 
>>>>>>> [slurm]
>>>>>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Skipping component
>>>>>>>      [slurm]. Query failed to return a module
>>>>>>>      [atl1-01-mic0:190439] mca:base:select:(  plm) Selected component 
>>>>>>> [rsh]
>>>>>>>      [atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439
>>>>>>>      nodename hash 4121194178
>>>>>>>      [atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh : 
>>>>>>> rsh
>>>>>>>      path NULL
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged
>>>>>>>      allocation
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] using dash_host
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] ignoring myself
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in
>>>>>>>      allocation
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] complete_setup on job [31875,1]
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job
>>>>>>>      [31875,1]
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof 
>>>>>>> for
>>>>>>>      job [31875,1]
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1] 
>>>>>>> registered
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1] 
>>>>>>> is not
>>>>>>>      a dynamic spawn
>>>>>>>      [atl1-01-mic0:190441] mca: base: components_register: registering
>>>>>>>      memheap components
>>>>>>>      [atl1-01-mic0:190441] mca: base: components_register: found loaded
>>>>>>>      component buddy
>>>>>>>      [atl1-01-mic0:190441] mca: base: components_register: component 
>>>>>>> buddy
>>>>>>>      has no register or open function
>>>>>>>      [atl1-01-mic0:190442] mca: base: components_register: registering
>>>>>>>      memheap components
>>>>>>>      [atl1-01-mic0:190442] mca: base: components_register: found loaded
>>>>>>>      component buddy
>>>>>>>      [atl1-01-mic0:190442] mca: base: components_register: component 
>>>>>>> buddy
>>>>>>>      has no register or open function
>>>>>>>      [atl1-01-mic0:190442] mca: base: components_register: found loaded
>>>>>>>      component ptmalloc
>>>>>>>      [atl1-01-mic0:190442] mca: base: components_register: component 
>>>>>>> ptmalloc
>>>>>>>      has no register or open function
>>>>>>>      [atl1-01-mic0:190441] mca: base: components_register: found loaded
>>>>>>>      component ptmalloc
>>>>>>>      [atl1-01-mic0:190441] mca: base: components_register: component 
>>>>>>> ptmalloc
>>>>>>>      has no register or open function
>>>>>>>      [atl1-01-mic0:190441] mca: base: components_open: opening memheap
>>>>>>>      components
>>>>>>>      [atl1-01-mic0:190441] mca: base: components_open: found loaded 
>>>>>>> component
>>>>>>>      buddy
>>>>>>>      [atl1-01-mic0:190441] mca: base: components_open: component buddy 
>>>>>>> open
>>>>>>>      function successful
>>>>>>>      [atl1-01-mic0:190441] mca: base: components_open: found loaded 
>>>>>>> component
>>>>>>>      ptmalloc
>>>>>>>      [atl1-01-mic0:190441] mca: base: components_open: component 
>>>>>>> ptmalloc
>>>>>>>      open function successful
>>>>>>>      [atl1-01-mic0:190442] mca: base: components_open: opening memheap
>>>>>>>      components
>>>>>>>      [atl1-01-mic0:190442] mca: base: components_open: found loaded 
>>>>>>> component
>>>>>>>      buddy
>>>>>>>      [atl1-01-mic0:190442] mca: base: components_open: component buddy 
>>>>>>> open
>>>>>>>      function successful
>>>>>>>      [atl1-01-mic0:190442] mca: base: components_open: found loaded 
>>>>>>> component
>>>>>>>      ptmalloc
>>>>>>>      [atl1-01-mic0:190442] mca: base: components_open: component 
>>>>>>> ptmalloc
>>>>>>>      open function successful
>>>>>>>      [atl1-01-mic0:190442] base/memheap_base_alloc.c:38 -
>>>>>>>      mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 
>>>>>>> byte(s), 1
>>>>>>>      segments by method: 1
>>>>>>>      [atl1-01-mic0:190441] base/memheap_base_alloc.c:38 -
>>>>>>>      mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 
>>>>>>> byte(s), 1
>>>>>>>      segments by method: 1
>>>>>>>      [atl1-01-mic0:190442] base/memheap_base_static.c:205 - 
>>>>>>> _load_segments()
>>>>>>>      add: 00600000-00601000 rw-p 00000000 00:11
>>>>>>>      6029314                            /home/ariebs/bench/hello/mic.out
>>>>>>>      [atl1-01-mic0:190441] base/memheap_base_static.c:205 - 
>>>>>>> _load_segments()
>>>>>>>      add: 00600000-00601000 rw-p 00000000 00:11
>>>>>>>      6029314                            /home/ariebs/bench/hello/mic.out
>>>>>>>      [atl1-01-mic0:190442] base/memheap_base_static.c:75 -
>>>>>>>      mca_memheap_base_static_init() Memheap static memory: 3824 
>>>>>>> byte(s), 2
>>>>>>>      segments
>>>>>>>      [atl1-01-mic0:190442] base/memheap_base_register.c:39 -
>>>>>>>      mca_memheap_base_reg() register seg#00: 0x0xff000000 - 
>>>>>>> 0x0x10f200000
>>>>>>>      270532608 bytes type=0x1 id=0xFFFFFFFF
>>>>>>>      [atl1-01-mic0:190441] base/memheap_base_static.c:75 -
>>>>>>>      mca_memheap_base_static_init() Memheap static memory: 3824 
>>>>>>> byte(s), 2
>>>>>>>      segments
>>>>>>>      [atl1-01-mic0:190441] base/memheap_base_register.c:39 -
>>>>>>>      mca_memheap_base_reg() register seg#00: 0x0xff000000 - 
>>>>>>> 0x0x10f200000
>>>>>>>      270532608 bytes type=0x1 id=0xFFFFFFFF
>>>>>>>      [atl1-01-mic0:190442] Error base/memheap_base_register.c:130 -
>>>>>>>      _reg_segment() Failed to register segment
>>>>>>>      [atl1-01-mic0:190441] Error base/memheap_base_register.c:130 -
>>>>>>>      _reg_segment() Failed to register segment
>>>>>>>      [atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM
>>>>>>>      failed to initialize - aborting
>>>>>>>      [atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM
>>>>>>>      failed to initialize - aborting
>>>>>>>      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>      It looks like SHMEM_INIT failed for some reason; your parallel 
>>>>>>> process
>>>>>>>      is
>>>>>>>      likely to abort.  There are many reasons that a parallel process 
>>>>>>> can
>>>>>>>      fail during SHMEM_INIT; some of which are due to configuration or
>>>>>>>      environment
>>>>>>>      problems.  This failure appears to be an internal failure; here's 
>>>>>>> some
>>>>>>>      additional information (which may only be relevant to an Open SHMEM
>>>>>>>      developer):
>>>>>>> 
>>>>>>>        mca_memheap_base_select() failed
>>>>>>>        --> Returned "Error" (-1) instead of "Success" (0)
>>>>>>>      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>      SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0) 
>>>>>>> with
>>>>>>>      errorcode -1.
>>>>>>>      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>      A SHMEM process is aborting at a time when it cannot guarantee 
>>>>>>> that all
>>>>>>>      of its peer processes in the job will be killed properly.  You 
>>>>>>> should
>>>>>>>      double check that everything has shut down cleanly.
>>>>>>> 
>>>>>>>      Local host: atl1-01-mic0
>>>>>>>      PID:        190441
>>>>>>>      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>      -------------------------------------------------------
>>>>>>>      Primary job  terminated normally, but 1 process returned
>>>>>>>      a non-zero exit code.. Per user-direction, the job has been 
>>>>>>> aborted.
>>>>>>>      -------------------------------------------------------
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending
>>>>>>>      orted_exit commands
>>>>>>>      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>      shmemrun detected that one or more processes exited with non-zero
>>>>>>>      status, thus causing
>>>>>>>      the job to be terminated. The first process to do so was:
>>>>>>> 
>>>>>>>        Process name: [[31875,1],0]
>>>>>>>        Exit code:    255
>>>>>>>      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>      [atl1-01-mic0:190439] 1 more process has sent help message
>>>>>>>      help-shmem-runtime.txt / shmem_init:startup:internal-failure
>>>>>>>      [atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate" 
>>>>>>> to 0
>>>>>>>      to see all help / error messages
>>>>>>>      [atl1-01-mic0:190439] 1 more process has sent help message
>>>>>>>      help-shmem-api.txt / shmem-abort
>>>>>>>      [atl1-01-mic0:190439] 1 more process has sent help message
>>>>>>>      help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all 
>>>>>>> killed
>>>>>>>      [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm
>>>>>>> 
>>>>>>>      On 04/12/2015 03:09 PM, Ralph Castain wrote:
>>>>>>> 
>>>>>>>        Sorry about that - I hadn't brought it over to the 1.8 branch 
>>>>>>> yet.
>>>>>>>        I've done so now, which means the ERROR_LOG shouldn't show up any
>>>>>>>        more. It won't fix the memheap problem, though.
>>>>>>>        You might try adding "--mca memheap_base_verbose 100" to your 
>>>>>>> cmd line
>>>>>>>        so we can see why none of the memheap components are being 
>>>>>>> selected.
>>>>>>> 
>>>>>>>          On Apr 12, 2015, at 11:30 AM, Andy Riebs <andy.ri...@hp.com> 
>>>>>>> <mailto:andy.ri...@hp.com> wrote:
>>>>>>>          Hi Ralph,
>>>>>>> 
>>>>>>>          Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2:
>>>>>>> 
>>>>>>>          $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
>>>>>>>          plm_base_verbose 5 $PWD/mic.out
>>>>>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying 
>>>>>>> component
>>>>>>>          [rsh]
>>>>>>>          [atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on 
>>>>>>> agent
>>>>>>>          ssh : rsh path NULL
>>>>>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Query of 
>>>>>>> component
>>>>>>>          [rsh] set priority to 10
>>>>>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying 
>>>>>>> component
>>>>>>>          [isolated]
>>>>>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Query of 
>>>>>>> component
>>>>>>>          [isolated] set priority to 0
>>>>>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying 
>>>>>>> component
>>>>>>>          [slurm]
>>>>>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Skipping 
>>>>>>> component
>>>>>>>          [slurm]. Query failed to return a module
>>>>>>>          [atl1-01-mic0:190189] mca:base:select:(  plm) Selected 
>>>>>>> component
>>>>>>>          [rsh]
>>>>>>>          [atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 
>>>>>>> 190189
>>>>>>>          nodename hash 4121194178
>>>>>>>          [atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh 
>>>>>>> : rsh
>>>>>>>          path NULL
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating 
>>>>>>> map
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged
>>>>>>>          allocation
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] using dash_host
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] ignoring myself
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP 
>>>>>>> in
>>>>>>>          allocation
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] complete_setup on job 
>>>>>>> [32137,1]
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found 
>>>>>>> in
>>>>>>>          file base/plm_base_launch_support.c at line 440
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for 
>>>>>>> job
>>>>>>>          [32137,1]
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up 
>>>>>>> iof
>>>>>>>          for job [32137,1]
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1]
>>>>>>>          registered
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job 
>>>>>>> [32137,1] is
>>>>>>>          not a dynamic spawn
>>>>>>>          
>>>>>>> --------------------------------------------------------------------------
>>>>>>>          It looks like SHMEM_INIT failed for some reason; your parallel
>>>>>>>          process is
>>>>>>>          likely to abort.  There are many reasons that a parallel 
>>>>>>> process can
>>>>>>>          fail during SHMEM_INIT; some of which are due to configuration 
>>>>>>> or
>>>>>>>          environment
>>>>>>>          problems.  This failure appears to be an internal failure; 
>>>>>>> here's
>>>>>>>          some
>>>>>>>          additional information (which may only be relevant to an Open 
>>>>>>> SHMEM
>>>>>>>          developer):
>>>>>>> 
>>>>>>>            mca_memheap_base_select() failed
>>>>>>>            --> Returned "Error" (-1) instead of "Success" (0)
>>>>>>>          
>>>>>>> --------------------------------------------------------------------------
>>>>>>>          [atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() 
>>>>>>> SHMEM
>>>>>>>          failed to initialize - aborting
>>>>>>>          [atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() 
>>>>>>> SHMEM
>>>>>>>          failed to initialize - aborting
>>>>>>>          
>>>>>>> --------------------------------------------------------------------------
>>>>>>>          SHMEM_ABORT was invoked on rank 1 (pid 190192, 
>>>>>>> host=atl1-01-mic0)
>>>>>>>          with errorcode -1.
>>>>>>>          
>>>>>>> --------------------------------------------------------------------------
>>>>>>>          
>>>>>>> --------------------------------------------------------------------------
>>>>>>>          A SHMEM process is aborting at a time when it cannot guarantee 
>>>>>>> that
>>>>>>>          all
>>>>>>>          of its peer processes in the job will be killed properly.  You
>>>>>>>          should
>>>>>>>          double check that everything has shut down cleanly.
>>>>>>> 
>>>>>>>          Local host: atl1-01-mic0
>>>>>>>          PID:        190192
>>>>>>>          
>>>>>>> --------------------------------------------------------------------------
>>>>>>>          -------------------------------------------------------
>>>>>>>          Primary job  terminated normally, but 1 process returned
>>>>>>>          a non-zero exit code.. Per user-direction, the job has been 
>>>>>>> aborted.
>>>>>>>          -------------------------------------------------------
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending
>>>>>>>          orted_exit commands
>>>>>>>          
>>>>>>> --------------------------------------------------------------------------
>>>>>>>          shmemrun detected that one or more processes exited with 
>>>>>>> non-zero
>>>>>>>          status, thus causing
>>>>>>>          the job to be terminated. The first process to do so was:
>>>>>>> 
>>>>>>>            Process name: [[32137,1],0]
>>>>>>>            Exit code:    255
>>>>>>>          
>>>>>>> --------------------------------------------------------------------------
>>>>>>>          [atl1-01-mic0:190189] 1 more process has sent help message
>>>>>>>          help-shmem-runtime.txt / shmem_init:startup:internal-failure
>>>>>>>          [atl1-01-mic0:190189] Set MCA parameter 
>>>>>>> "orte_base_help_aggregate"
>>>>>>>          to 0 to see all help / error messages
>>>>>>>          [atl1-01-mic0:190189] 1 more process has sent help message
>>>>>>>          help-shmem-api.txt / shmem-abort
>>>>>>>          [atl1-01-mic0:190189] 1 more process has sent help message
>>>>>>>          help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee 
>>>>>>> all
>>>>>>>          killed
>>>>>>>          [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm
>>>>>>> 
>>>>>>>          On 04/11/2015 07:41 PM, Ralph Castain wrote:
>>>>>>> 
>>>>>>>            Got it - thanks. I fixed that ERROR_LOG issue (I think- 
>>>>>>> please
>>>>>>>            verify). I suspect the memheap issue relates to something 
>>>>>>> else,
>>>>>>>            but I probably need to let the OSHMEM folks comment on it
>>>>>>> 
>>>>>>>              On Apr 11, 2015, at 9:52 AM, Andy Riebs 
>>>>>>> <andy.ri...@hp.com> <mailto:andy.ri...@hp.com>
>>>>>>>              wrote:
>>>>>>>              Everything is built on the Xeon side, with the icc "-mmic"
>>>>>>>              switch. I then ssh into one of the PHIs, and run shmemrun 
>>>>>>> from
>>>>>>>              there.
>>>>>>> 
>>>>>>>              On 04/11/2015 12:00 PM, Ralph Castain wrote:
>>>>>>> 
>>>>>>>                Let me try to understand the setup a little better. Are 
>>>>>>> you
>>>>>>>                running shmemrun on the PHI itself? Or is it running on 
>>>>>>> the
>>>>>>>                host processor, and you are trying to spawn a process 
>>>>>>> onto the
>>>>>>>                Phi?
>>>>>>> 
>>>>>>>                  On Apr 11, 2015, at 7:55 AM, Andy Riebs 
>>>>>>> <andy.ri...@hp.com> <mailto:andy.ri...@hp.com>
>>>>>>>                  wrote:
>>>>>>>                  Hi Ralph,
>>>>>>> 
>>>>>>>                  Yes, this is attempting to get OSHMEM to run on the 
>>>>>>> Phi.
>>>>>>> 
>>>>>>>                  I grabbed openmpi-dev-1484-g033418f.tar.bz2 and 
>>>>>>> configured
>>>>>>>                  it with
>>>>>>> 
>>>>>>>                  $ ./configure --prefix=/home/ariebs/mic/mpi-nightly   
>>>>>>>                  CC=icc -mmic CXX=icpc -mmic    \
>>>>>>>                      --build=x86_64-unknown-linux-gnu
>>>>>>>                  --host=x86_64-k1om-linux    \
>>>>>>>                       AR=x86_64-k1om-linux-ar
>>>>>>>                  RANLIB=x86_64-k1om-linux-ranlib  
>>>>>>> LD=x86_64-k1om-linux-ld   \
>>>>>>>                       --enable-mpirun-prefix-by-default
>>>>>>>                  --disable-io-romio     --disable-mpi-fortran    \
>>>>>>>                       --enable-debug    
>>>>>>>                  
>>>>>>> --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
>>>>>>> 
>>>>>>>                  (Note that I had to add "oob-ud" to the
>>>>>>>                  "--enable-mca-no-build" option, as the build 
>>>>>>> complained that
>>>>>>>                  mca oob/ud needed mca common-verbs.)
>>>>>>> 
>>>>>>>                  With that configuration, here is what I am seeing 
>>>>>>> now...
>>>>>>> 
>>>>>>>                  $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
>>>>>>>                  $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
>>>>>>>                  plm_base_verbose 5 $PWD/mic.out
>>>>>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
>>>>>>>                  component [rsh]
>>>>>>>                  [atl1-01-mic0:189895] [[INVALID],INVALID] 
>>>>>>> plm:rsh_lookup on
>>>>>>>                  agent ssh : rsh path NULL
>>>>>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
>>>>>>>                  component [rsh] set priority to 10
>>>>>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
>>>>>>>                  component [isolated]
>>>>>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
>>>>>>>                  component [isolated] set priority to 0
>>>>>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
>>>>>>>                  component [slurm]
>>>>>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Skipping
>>>>>>>                  component [slurm]. Query failed to return a module
>>>>>>>                  [atl1-01-mic0:189895] mca:base:select:(  plm) Selected
>>>>>>>                  component [rsh]
>>>>>>>                  [atl1-01-mic0:189895] plm:base:set_hnp_name: initial 
>>>>>>> bias
>>>>>>>                  189895 nodename hash 4121194178
>>>>>>>                  [atl1-01-mic0:189895] plm:base:set_hnp_name: final 
>>>>>>> jobfam
>>>>>>>                  32419
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on 
>>>>>>> agent
>>>>>>>                  ssh : rsh path NULL
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive 
>>>>>>> start
>>>>>>>                  comm
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
>>>>>>>                  creating map
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working
>>>>>>>                  unmanaged allocation
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] using dash_host
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] checking node
>>>>>>>                  atl1-01-mic0
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] ignoring myself
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm 
>>>>>>> only
>>>>>>>                  HNP in allocation
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] complete_setup on 
>>>>>>> job
>>>>>>>                  [32419,1]
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not
>>>>>>>                  found in file base/plm_base_launch_support.c at line 
>>>>>>> 440
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] 
>>>>>>> plm:base:launch_apps for
>>>>>>>                  job [32419,1]
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch 
>>>>>>> wiring
>>>>>>>                  up iof for job [32419,1]
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
>>>>>>>                  [32419,1] registered
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job
>>>>>>>                  [32419,1] is not a dynamic spawn
>>>>>>>                  [atl1-01-mic0:189899] Error: pshmem_init.c:61 - 
>>>>>>> shmem_init()
>>>>>>>                  SHMEM failed to initialize - aborting
>>>>>>>                  [atl1-01-mic0:189898] Error: pshmem_init.c:61 - 
>>>>>>> shmem_init()
>>>>>>>                  SHMEM failed to initialize - aborting
>>>>>>>                  
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                  It looks like SHMEM_INIT failed for some reason; your
>>>>>>>                  parallel process is
>>>>>>>                  likely to abort.  There are many reasons that a 
>>>>>>> parallel
>>>>>>>                  process can
>>>>>>>                  fail during SHMEM_INIT; some of which are due to
>>>>>>>                  configuration or environment
>>>>>>>                  problems.  This failure appears to be an internal 
>>>>>>> failure;
>>>>>>>                  here's some
>>>>>>>                  additional information (which may only be relevant to 
>>>>>>> an
>>>>>>>                  Open SHMEM
>>>>>>>                  developer):
>>>>>>> 
>>>>>>>                    mca_memheap_base_select() failed
>>>>>>>                    --> Returned "Error" (-1) instead of "Success" (0)
>>>>>>>                  
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                  
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                  SHMEM_ABORT was invoked on rank 1 (pid 189899,
>>>>>>>                  host=atl1-01-mic0) with errorcode -1.
>>>>>>>                  
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                  
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                  A SHMEM process is aborting at a time when it cannot
>>>>>>>                  guarantee that all
>>>>>>>                  of its peer processes in the job will be killed 
>>>>>>> properly. 
>>>>>>>                  You should
>>>>>>>                  double check that everything has shut down cleanly.
>>>>>>> 
>>>>>>>                  Local host: atl1-01-mic0
>>>>>>>                  PID:        189899
>>>>>>>                  
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                  -------------------------------------------------------
>>>>>>>                  Primary job  terminated normally, but 1 process 
>>>>>>> returned
>>>>>>>                  a non-zero exit code.. Per user-direction, the job has 
>>>>>>> been
>>>>>>>                  aborted.
>>>>>>>                  -------------------------------------------------------
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd
>>>>>>>                  sending orted_exit commands
>>>>>>>                  
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                  shmemrun detected that one or more processes exited 
>>>>>>> with
>>>>>>>                  non-zero status, thus causing
>>>>>>>                  the job to be terminated. The first process to do so 
>>>>>>> was:
>>>>>>> 
>>>>>>>                    Process name: [[32419,1],1]
>>>>>>>                    Exit code:    255
>>>>>>>                  
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                  [atl1-01-mic0:189895] 1 more process has sent help 
>>>>>>> message
>>>>>>>                  help-shmem-runtime.txt / 
>>>>>>> shmem_init:startup:internal-failure
>>>>>>>                  [atl1-01-mic0:189895] Set MCA parameter
>>>>>>>                  "orte_base_help_aggregate" to 0 to see all help / error
>>>>>>>                  messages
>>>>>>>                  [atl1-01-mic0:189895] 1 more process has sent help 
>>>>>>> message
>>>>>>>                  help-shmem-api.txt / shmem-abort
>>>>>>>                  [atl1-01-mic0:189895] 1 more process has sent help 
>>>>>>> message
>>>>>>>                  help-shmem-runtime.txt / oshmem shmem abort:cannot 
>>>>>>> guarantee
>>>>>>>                  all killed
>>>>>>>                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive 
>>>>>>> stop
>>>>>>>                  comm
>>>>>>> 
>>>>>>>                  On 04/10/2015 06:37 PM, Ralph Castain wrote:
>>>>>>> 
>>>>>>>                    Andy - could you please try the current 1.8.5 nightly
>>>>>>>                    tarball and see if it helps? The error log indicates 
>>>>>>> that
>>>>>>>                    it is failing to get the topology from some daemon, 
>>>>>>> I**m
>>>>>>>                    assuming the one on the Phi?
>>>>>>>                    You might also add **enable-debug to that configure 
>>>>>>> line
>>>>>>>                    and then put -mca plm_base_verbose on the shmemrun 
>>>>>>> cmd to
>>>>>>>                    get more help
>>>>>>> 
>>>>>>>                      On Apr 10, 2015, at 11:55 AM, Andy Riebs
>>>>>>>                      <andy.ri...@hp.com> <mailto:andy.ri...@hp.com> 
>>>>>>> wrote:
>>>>>>>                      Summary: MPI jobs work fine, SHMEM jobs work just 
>>>>>>> often
>>>>>>>                      enough to be tantalizing, on an Intel Xeon Phi/MIC
>>>>>>>                      system.
>>>>>>> 
>>>>>>>                      Longer version
>>>>>>> 
>>>>>>>                      Thanks to the excellent write-up last June
>>>>>>>                      
>>>>>>> (<https://www.open-mpi.org/community/lists/users/2014/06/24711.php> 
>>>>>>> <https://www.open-mpi.org/community/lists/users/2014/06/24711.php>),
>>>>>>>                      I have been able to build a version of Open MPI 
>>>>>>> for the
>>>>>>>                      Xeon Phi coprocessor that runs MPI jobs on the Phi
>>>>>>>                      coprocessor with no problem, but not SHMEM jobs.  
>>>>>>> Just
>>>>>>>                      at the point where I was about to document the 
>>>>>>> problems
>>>>>>>                      I was having with SHMEM, my trivial SHMEM job 
>>>>>>> worked.
>>>>>>>                      And then failed when I tried to run it again,
>>>>>>>                      immediately afterwards. I have a feeling I may be 
>>>>>>> in
>>>>>>>                      uncharted  territory here.
>>>>>>> 
>>>>>>>                      Environment
>>>>>>>                        * RHEL 6.5
>>>>>>>                        * Intel Composer XE 2015
>>>>>>>                        * Xeon Phi/MIC
>>>>>>>                      ----------------
>>>>>>> 
>>>>>>>                      Configuration
>>>>>>> 
>>>>>>>                      $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
>>>>>>>                      $ source
>>>>>>>                      
>>>>>>> /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
>>>>>>>                      intel64
>>>>>>>                      $ ./configure --prefix=/home/ariebs/mic/mpi \
>>>>>>>                         CC="icc -mmic" CXX="icpc -mmic" \
>>>>>>>                         --build=x86_64-unknown-linux-gnu
>>>>>>>                      --host=x86_64-k1om-linux \
>>>>>>>                          AR=x86_64-k1om-linux-ar
>>>>>>>                      RANLIB=x86_64-k1om-linux-ranlib \
>>>>>>>                          LD=x86_64-k1om-linux-ld \
>>>>>>>                          --enable-mpirun-prefix-by-default 
>>>>>>> --disable-io-romio
>>>>>>>                      \
>>>>>>>                          --disable-vt --disable-mpi-fortran \
>>>>>>>                         
>>>>>>>                      
>>>>>>> --enable-mca-no-build=btl-usnic,btl-openib,common-verbs
>>>>>>>                      $ make
>>>>>>>                      $ make install
>>>>>>> 
>>>>>>>                      ----------------
>>>>>>> 
>>>>>>>                      Test program
>>>>>>> 
>>>>>>>                      #include <stdio.h>
>>>>>>>                      #include <stdlib.h>
>>>>>>>                      #include <shmem.h>
>>>>>>>                      int main(int argc, char **argv)
>>>>>>>                      {
>>>>>>>                              int me, num_pe;
>>>>>>>                              shmem_init();
>>>>>>>                              num_pe = num_pes();
>>>>>>>                              me = my_pe();
>>>>>>>                              printf("Hello World from process %ld of 
>>>>>>> %ld\n",
>>>>>>>                      me, num_pe);
>>>>>>>                              exit(0);
>>>>>>>                      }
>>>>>>> 
>>>>>>>                      ----------------
>>>>>>> 
>>>>>>>                      Building the program
>>>>>>> 
>>>>>>>                      export PATH=/home/ariebs/mic/mpi/bin:$PATH
>>>>>>>                      export PATH=/usr/linux-k1om-4.7/bin/:$PATH
>>>>>>>                      source
>>>>>>>                      
>>>>>>> /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
>>>>>>>                      intel64
>>>>>>>                      export
>>>>>>>                      
>>>>>>> LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH
>>>>>>> 
>>>>>>>                      icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include
>>>>>>>                      -pthread \
>>>>>>>                              -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib
>>>>>>>                      -Wl,--enable-new-dtags \
>>>>>>>                              -L/home/ariebs/mic/mpi/lib -loshmem -lmpi
>>>>>>>                      -lopen-rte -lopen-pal \
>>>>>>>                              -lm -ldl -lutil \
>>>>>>>                              -Wl,-rpath
>>>>>>>                      
>>>>>>> -Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
>>>>>>>                      \
>>>>>>>                             
>>>>>>>                      
>>>>>>> -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
>>>>>>>                      \
>>>>>>>                              -o mic.out  shmem_hello.c
>>>>>>> 
>>>>>>>                      ----------------
>>>>>>> 
>>>>>>>                      Running the program
>>>>>>> 
>>>>>>>                      (Note that the program had been consistently 
>>>>>>> failing.
>>>>>>>                      Then, when I logged back into the system to 
>>>>>>> capture the
>>>>>>>                      results, it worked once,  and then immediately 
>>>>>>> failed
>>>>>>>                      when I tried again, as shown below. Logging in and 
>>>>>>> out
>>>>>>>                      isn't sufficient to correct the problem. Overall, I
>>>>>>>                      think I had 3 successful runs in 30-40 attempts.)
>>>>>>> 
>>>>>>>                      $ shmemrun -H localhost -N 2 --mca sshmem mmap 
>>>>>>> ./mic.out
>>>>>>>                      [atl1-01-mic0:189372] [[30936,0],0] 
>>>>>>> ORTE_ERROR_LOG: Not
>>>>>>>                      found in file base/plm_base_launch_support.c at 
>>>>>>> line 426
>>>>>>>                      Hello World from process 0 of 2
>>>>>>>                      Hello World from process 1 of 2
>>>>>>>                      $ shmemrun -H localhost -N 2 --mca sshmem mmap 
>>>>>>> ./mic.out
>>>>>>>                      [atl1-01-mic0:189381] [[30881,0],0] 
>>>>>>> ORTE_ERROR_LOG: Not
>>>>>>>                      found in file base/plm_base_launch_support.c at 
>>>>>>> line 426
>>>>>>>                      [atl1-01-mic0:189383] Error: pshmem_init.c:61 -
>>>>>>>                      shmem_init() SHMEM failed to initialize - aborting
>>>>>>>                      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                      It looks like SHMEM_INIT failed for some reason; 
>>>>>>> your
>>>>>>>                      parallel process is
>>>>>>>                      likely to abort.  There are many reasons that a 
>>>>>>> parallel
>>>>>>>                      process can
>>>>>>>                      fail during SHMEM_INIT; some of which are due to
>>>>>>>                      configuration or environment
>>>>>>>                      problems.  This failure appears to be an internal
>>>>>>>                      failure; here's some
>>>>>>>                      additional information (which may only be relevant 
>>>>>>> to an
>>>>>>>                      Open SHMEM
>>>>>>>                      developer):
>>>>>>> 
>>>>>>>                        mca_memheap_base_select() failed
>>>>>>>                        --> Returned "Error" (-1) instead of "Success" 
>>>>>>> (0)
>>>>>>>                      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                      SHMEM_ABORT was invoked on rank 0 (pid 189383,
>>>>>>>                      host=atl1-01-mic0) with errorcode -1.
>>>>>>>                      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                      A SHMEM process is aborting at a time when it 
>>>>>>> cannot
>>>>>>>                      guarantee that all
>>>>>>>                      of its peer processes in the job will be killed
>>>>>>>                      properly.  You should
>>>>>>>                      double check that everything has shut down cleanly.
>>>>>>> 
>>>>>>>                      Local host: atl1-01-mic0
>>>>>>>                      PID:        189383
>>>>>>>                      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                      
>>>>>>> -------------------------------------------------------
>>>>>>>                      Primary job  terminated normally, but 1 process 
>>>>>>> returned
>>>>>>>                      a non-zero exit code.. Per user-direction, the job 
>>>>>>> has
>>>>>>>                      been aborted.
>>>>>>>                      
>>>>>>> -------------------------------------------------------
>>>>>>>                      
>>>>>>> --------------------------------------------------------------------------
>>>>>>>                      shmemrun detected that one or more processes 
>>>>>>> exited with
>>>>>>>                      non-zero status, thus causing
>>>>>>>                      the job to be terminated. The first process to do 
>>>>>>> so
>>>>>>>                      was:
>>>>>>> 
>>>>>>>                        Process name: [[30881,1],0]
>>>>>>>                        Exit code:    255
>>>>>>>                      
>>>>>>> --------------------------------------------------------------------------
>>>>>>> 
>>>>>>>                      Any thoughts about where to go from here?
>>>>>>> 
>>>>>>>                      Andy
>>>>>>> 
>>>>>>>  --
>>>>>>>  Andy Riebs
>>>>>>>  Hewlett-Packard Company
>>>>>>>  High Performance Computing
>>>>>>>  +1 404 648 9024
>>>>>>>  My opinions are not necessarily those of HP
>>>>>>> 
>>>>>>>                      _______________________________________________
>>>>>>>                      users mailing list
>>>>>>>                      us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>                      Subscription:
>>>>>>>                      http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>                      Link to this post:
>>>>>>>                      
>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26670.php 
>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26670.php>
>>>>>>> 
>>>>>>>  _______________________________________________
>>>>>>>  users mailing list
>>>>>>>  us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>  Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26676.php 
>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26676.php>
>>>>>>> 
>>>>>>>                  _______________________________________________
>>>>>>>                  users mailing list
>>>>>>>                  us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>                  Subscription:
>>>>>>>                  http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>                  Link to this post:
>>>>>>>                  
>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26678.php 
>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26678.php>
>>>>>>> 
>>>>>>>  _______________________________________________
>>>>>>>  users mailing list
>>>>>>>  us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>  Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26679.php 
>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26679.php>
>>>>>>> 
>>>>>>>              _______________________________________________
>>>>>>>              users mailing list
>>>>>>>              us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>              Subscription: 
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>              Link to this post:
>>>>>>>              
>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26680.php 
>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26680.php>
>>>>>>> 
>>>>>>>  _______________________________________________
>>>>>>>  users mailing list
>>>>>>>  us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>  Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26682.php 
>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26682.php>
>>>>>>> 
>>>>>>>          _______________________________________________
>>>>>>>          users mailing list
>>>>>>>          us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>          Subscription: 
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>          Link to this post:
>>>>>>>          
>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26683.php 
>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26683.php>
>>>>>>> 
>>>>>>>  _______________________________________________
>>>>>>>  users mailing list
>>>>>>>  us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>  Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26684.php 
>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26684.php>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26697.php 
>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26697.php>
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26699.php 
>>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26699.php>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26700.php 
>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26700.php>
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/04/26706.php 
>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26706.php>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/04/26716.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/04/26716.php>
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/04/26718.php 
>> <http://www.open-mpi.org/community/lists/users/2015/04/26718.php>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26731.php

Reply via email to