Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-26 Thread Ralph Castain
Kewl! Let us know if it breaks again.

> On Apr 26, 2015, at 4:29 PM, Andy Riebs  wrote:
> 
> Yes, it just worked -- I took the old command line, just to ensure that I was 
> testing the correct problem, and it worked. Then I remembered that I had set 
> OMPI_MCA_plm_rsh_pass_path and OMPI_MCA_plm_rsh_pass_libpath in my test 
> setup, so I removed those from my environment, ran again, and it still worked!
> 
> Whatever it is that you're doing Ralph, keep it up :-)
> 
> Regardless of the cause or result, thanks $$ for poking at this!
> 
> Andy
> 
> On 04/26/2015 10:35 AM, Ralph Castain wrote:
>> Not intentionally - I did add that new MCA param as we discussed, but don’t 
>> recall making any other changes in this area.
>> 
>> There have been some other build system changes made as a result of more 
>> extensive testing of the 1.8 release candidate - it is possible that 
>> something in that area had an impact here.
>> 
>> Are you saying it just works, even without passing the new param?
>> 
>> 
>>> On Apr 26, 2015, at 6:39 AM, Andy Riebs >> > wrote:
>>> 
>>> Hi Ralph,
>>> 
>>> Did you solve this problem in a more general way? I finally sat down this 
>>> morning to try this with the openmpi-dev-1567-g11e8c20.tar.bz2 nightly kit 
>>> from last week, and can't reproduce the problem at all.
>>> 
>>> Andy
>>> 
>>> On 04/16/2015 12:15 PM, Ralph Castain wrote:
 Sorry - I had to revert the commit due to a reported MTT problem. I'll 
 reinsert it after I get home and can debug the problem this weekend.
 
 On Thu, Apr 16, 2015 at 9:41 AM, Andy Riebs > wrote:
 Hi Ralph,
 
 If I did this right (NEVER a good bet :-) ), it didn't work...
 
 Using last night's master nightly, openmpi-dev-1515-gc869490.tar.bz2, I 
 built with the same script as yesterday, but removing the LDFLAGS=-Wl, 
 stuff:
 
 $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic" 
 CXX="icpc -mmic" \
   --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib 
 LD=x86_64-k1om-linux-ld \
--enable-mpirun-prefix-by-default --disable-io-romio 
 --disable-mpi-fortran \
--enable-debug 
 --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
 $ make
 $ make install
  ...
 make[1]: Leaving directory 
 `/home/ariebs/mic/openmpi-dev-1515-gc869490/test'
 make[1]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
 make[2]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
 make  install-exec-hook
 make[3]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
 make[3]: ./config/find_common_syms: Command not found
 make[3]: [install-exec-hook] Error 127 (ignored)
 make[3]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
 make[2]: Nothing to be done for `install-data-am'.
 make[2]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
 make[1]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
 $
 
 But it seems to finish the install.
 
 I then tried to run, adding the new mca arguments:
 
 $  shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -mca plm_rsh_pass_path $PATH 
 -mca plm_rsh_pass_libpath $MIC_LD_LIBRARY_PATH -H mic0,mic1 -n 2 ./mic.out
 /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared 
 libraries: libimf.so: cannot open shared object file: No such file or 
 directory
  ...
 $ echo $MIC_LD_LIBRARY_PATH
 /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/mic
 $ ls /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.*
 /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.a
 /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
 $
 
 
 
 On 04/16/2015 07:22 AM, Ralph Castain wrote:
> FWIW: I just added (last night) a pair of new MCA params for this purpose:
> 
> plm_rsh_pass_pathprepends the designated path to the remote 
> shell's PATH prior to executing orted
> plm_rsh_pass_libpath   same thing for LD_LIBRARY_PATH
> 
> I believe that will resolve the problem for Andy regardless of compiler 
> used. In the master now, waiting for someone to verify it before adding 
> to 1.8.5. Sadly, I am away from any cluster for the rest of this week, so 
> I'd welcome anyone having a 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-26 Thread Andy Riebs

  
  
Yes, it just worked -- I took the old command line, just to ensure
that I was testing the correct problem, and it worked. Then I
remembered that I had set OMPI_MCA_plm_rsh_pass_path and
OMPI_MCA_plm_rsh_pass_libpath in my test setup, so I removed those
from my environment, ran again, and it still worked!

Whatever it is that you're doing Ralph, keep it up :-)

Regardless of the cause or result, thanks $$ for poking at this!

Andy

On 04/26/2015 10:35 AM, Ralph Castain
  wrote:


  
  Not intentionally - I did add that new MCA param as we discussed,
  but don’t recall making any other changes in this area.
  
  
  There have been some other build system changes made
as a result of more extensive testing of the 1.8 release
candidate - it is possible that something in that area had an
impact here.
  
  
  Are you saying it just works, even without passing
the new param?
  
  
  

  
On Apr 26, 2015, at 6:39 AM, Andy Riebs  wrote:


   Hi Ralph,

Did you solve this problem in a more general way? I
finally sat down this morning to try this with the
openmpi-dev-1567-g11e8c20.tar.bz2 nightly kit from last
week, and can't reproduce the problem at all.

Andy

On 04/16/2015 12:15 PM,
  Ralph Castain wrote:


  Sorry - I had to revert the
commit due to a reported MTT problem. I'll reinsert
it after I get home and can debug the problem this
weekend.
  
On Thu, Apr 16, 2015 at
  9:41 AM, Andy Riebs 
  wrote:
  

  Hi Ralph,
  
  If I did this right (NEVER a good bet :-) ),
  it didn't work...
  
  Using last night's master nightly,
  openmpi-dev-1515-gc869490.tar.bz2, I built
  with the same script as yesterday, but
  removing the LDFLAGS=-Wl, stuff:

$ ./configure
--prefix=/home/ariebs/mic/mpi-nightly
CC="icc -mmic" CXX="icpc -mmic" \
  --build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux \
   AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib
LD=x86_64-k1om-linux-ld \
   --enable-mpirun-prefix-by-default
--disable-io-romio --disable-mpi-fortran \
      --enable-debug
  --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
$ make
$ make install
    ...
  make[1]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490/test'
  make[1]: Entering directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[2]: Entering directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make  install-exec-hook
  make[3]: Entering directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[3]:
./config/find_common_syms: Command not found
  make[3]: [install-exec-hook] Error 127
  (ignored)
  make[3]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[2]: Nothing to be done for
  `install-data-am'.
  make[2]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[1]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  $
  
  But it seems to finish the install.
  
  I then tried to run, adding the new mca
  arguments:
  
  $  shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-26 Thread Ralph Castain
Not intentionally - I did add that new MCA param as we discussed, but don’t 
recall making any other changes in this area.

There have been some other build system changes made as a result of more 
extensive testing of the 1.8 release candidate - it is possible that something 
in that area had an impact here.

Are you saying it just works, even without passing the new param?


> On Apr 26, 2015, at 6:39 AM, Andy Riebs  wrote:
> 
> Hi Ralph,
> 
> Did you solve this problem in a more general way? I finally sat down this 
> morning to try this with the openmpi-dev-1567-g11e8c20.tar.bz2 nightly kit 
> from last week, and can't reproduce the problem at all.
> 
> Andy
> 
> On 04/16/2015 12:15 PM, Ralph Castain wrote:
>> Sorry - I had to revert the commit due to a reported MTT problem. I'll 
>> reinsert it after I get home and can debug the problem this weekend.
>> 
>> On Thu, Apr 16, 2015 at 9:41 AM, Andy Riebs > > wrote:
>> Hi Ralph,
>> 
>> If I did this right (NEVER a good bet :-) ), it didn't work...
>> 
>> Using last night's master nightly, openmpi-dev-1515-gc869490.tar.bz2, I 
>> built with the same script as yesterday, but removing the LDFLAGS=-Wl, stuff:
>> 
>> $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic" CXX="icpc 
>> -mmic" \
>>   --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
>>AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib 
>> LD=x86_64-k1om-linux-ld \
>>--enable-mpirun-prefix-by-default --disable-io-romio 
>> --disable-mpi-fortran \
>>--enable-debug 
>> --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
>> $ make
>> $ make install
>>  ...
>> make[1]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490/test'
>> make[1]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
>> make[2]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
>> make  install-exec-hook
>> make[3]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
>> make[3]: ./config/find_common_syms: Command not found
>> make[3]: [install-exec-hook] Error 127 (ignored)
>> make[3]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
>> make[2]: Nothing to be done for `install-data-am'.
>> make[2]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
>> make[1]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
>> $
>> 
>> But it seems to finish the install.
>> 
>> I then tried to run, adding the new mca arguments:
>> 
>> $  shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -mca plm_rsh_pass_path $PATH 
>> -mca plm_rsh_pass_libpath $MIC_LD_LIBRARY_PATH -H mic0,mic1 -n 2 ./mic.out
>> /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared 
>> libraries: libimf.so: cannot open shared object file: No such file or 
>> directory
>>  ...
>> $ echo $MIC_LD_LIBRARY_PATH
>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/mic
>> $ ls /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.*
>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.a
>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
>> $
>> 
>> 
>> 
>> On 04/16/2015 07:22 AM, Ralph Castain wrote:
>>> FWIW: I just added (last night) a pair of new MCA params for this purpose:
>>> 
>>> plm_rsh_pass_pathprepends the designated path to the remote 
>>> shell's PATH prior to executing orted
>>> plm_rsh_pass_libpath   same thing for LD_LIBRARY_PATH
>>> 
>>> I believe that will resolve the problem for Andy regardless of compiler 
>>> used. In the master now, waiting for someone to verify it before adding to 
>>> 1.8.5. Sadly, I am away from any cluster for the rest of this week, so I'd 
>>> welcome anyone having a chance to test it.
>>> 
>>> 
>>> On Thu, Apr 16, 2015 at 2:57 AM, Thomas Jahns >> > wrote:
>>> Hello,
>>> 
>>> On Apr 15, 2015, at 02:11 , Gilles Gouaillardet wrote:
 what about reconfiguring Open MPI with 
 LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic"
  ?
 
 IIRC, an other option is : LDFLAGS="-static-intel"
>>> 
>>> 
>>> let me first state that I have no experience developing for MIC. But 
>>> regarding the Intel runtime libraries, the only sane option in my opinion 
>>> is to use the icc.cfg/ifort.cfg/icpc.cfg files that get put in the same 
>>> directory as the corresponding compiler binaries and add a line like
>>> 
>>> -Wl,-rpath,/path/to/composerxe/lib/intel??
>>> 
>>> to that file.
>>> 
>>> Regards, Thomas
>>> -- 
>>> Thomas Jahns
>>> DKRZ 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-26 Thread Andy Riebs

  
  
Hi Ralph,

Did you solve this problem in a more general way? I finally sat down
this morning to try this with the openmpi-dev-1567-g11e8c20.tar.bz2
nightly kit from last week, and can't reproduce the problem at all.

Andy

On 04/16/2015 12:15 PM, Ralph Castain
  wrote:


  
  Sorry - I had to revert the commit due to a
reported MTT problem. I'll reinsert it after I get home and can
debug the problem this weekend.
  
On Thu, Apr 16, 2015 at 9:41 AM, Andy
  Riebs 
  wrote:
  
 Hi Ralph,
  
  If I did this right (NEVER a good bet :-) ), it didn't
  work...
  
  Using last night's master nightly,
  openmpi-dev-1515-gc869490.tar.bz2, I built with the same
  script as yesterday, but removing the LDFLAGS=-Wl, stuff:

$ ./configure --prefix=/home/ariebs/mic/mpi-nightly
CC="icc -mmic" CXX="icpc -mmic" \
  --build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux \
   AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib LD=x86_64-k1om-linux-ld
\
   --enable-mpirun-prefix-by-default --disable-io-romio
--disable-mpi-fortran \
      --enable-debug
  --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
$ make
$ make install
    ...
  make[1]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490/test'
  make[1]: Entering directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[2]: Entering directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make  install-exec-hook
  make[3]: Entering directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[3]: ./config/find_common_syms: Command not found
  make[3]: [install-exec-hook] Error 127 (ignored)
  make[3]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[2]: Nothing to be done for `install-data-am'.
  make[2]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[1]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  $
  
  But it seems to finish the install.
  
  I then tried to run, adding the new mca arguments:
  
  $  shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -mca
  plm_rsh_pass_path $PATH -mca plm_rsh_pass_libpath
$MIC_LD_LIBRARY_PATH -H mic0,mic1 -n 2 ./mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while
loading shared libraries: libimf.so: cannot open
  shared object file: No such file or directory
 ...
   $ echo $MIC_LD_LIBRARY_PATH
  /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/mic
  $ ls /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.*
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.a
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
  $
  

  
  
  On 04/16/2015 07:22 AM, Ralph Castain wrote:
  

  
  

  
FWIW: I just added (last night) a
  pair of new MCA params for this purpose:
  
  
  plm_rsh_pass_path    prepends the
designated path to the remote shell's PATH prior
to executing orted
  plm_rsh_pass_libpath   same thing
for LD_LIBRARY_PATH
  
  
  I believe that will resolve the problem for
Andy regardless of compiler used. In the master
now, waiting for someone to verify it before
adding to 1.8.5. Sadly, I am away from any
cluster for the rest of this week, so I'd
welcome anyone having a chance to test 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-16 Thread Ralph Castain
Sorry - I had to revert the commit due to a reported MTT problem. I'll
reinsert it after I get home and can debug the problem this weekend.

On Thu, Apr 16, 2015 at 9:41 AM, Andy Riebs  wrote:

>  Hi Ralph,
>
> If I did this right (NEVER a good bet :-) ), it didn't work...
>
> Using last night's master nightly, openmpi-dev-1515-gc869490.tar.bz2, I
> built with the same script as yesterday, but removing the LDFLAGS=-Wl,
> stuff:
>
> $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
> CXX="icpc -mmic" \
>   --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
>AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib
> LD=x86_64-k1om-linux-ld \
>--enable-mpirun-prefix-by-default --disable-io-romio
> --disable-mpi-fortran \
>--enable-debug
> --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
> $ make
> $ make install
>  ...
> make[1]: Leaving directory
> `/home/ariebs/mic/openmpi-dev-1515-gc869490/test'
> make[1]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
> make[2]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
> make  install-exec-hook
> make[3]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
> make[3]:* ./config/find_common_syms: Command not found*
> make[3]: [install-exec-hook] Error 127 (ignored)
> make[3]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
> make[2]: Nothing to be done for `install-data-am'.
> make[2]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
> make[1]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490'
> $
>
> But it seems to finish the install.
>
> I then tried to run, adding the new mca arguments:
>
> $  shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -mca plm_rsh_pass_path $PATH*
> -mca plm_rsh_pass_libpath $MIC_LD_LIBRARY_PATH* -H mic0,mic1 -n 2
> ./mic.out
> /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
> libraries:* libimf.so: cannot open shared object file*: No such file or
> directory
>  ...
> $ echo $MIC_LD_LIBRARY_PATH
> */opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic*
> :/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/mic
> $ ls */opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.**
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.a
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
> $
>
>
>
> On 04/16/2015 07:22 AM, Ralph Castain wrote:
>
> FWIW: I just added (last night) a pair of new MCA params for this purpose:
>
>  plm_rsh_pass_pathprepends the designated path to the remote
> shell's PATH prior to executing orted
> plm_rsh_pass_libpath   same thing for LD_LIBRARY_PATH
>
>  I believe that will resolve the problem for Andy regardless of compiler
> used. In the master now, waiting for someone to verify it before adding to
> 1.8.5. Sadly, I am away from any cluster for the rest of this week, so I'd
> welcome anyone having a chance to test it.
>
>
> On Thu, Apr 16, 2015 at 2:57 AM, Thomas Jahns  wrote:
>
>> Hello,
>>
>>  On Apr 15, 2015, at 02:11 , Gilles Gouaillardet wrote:
>>
>> what about reconfiguring Open MPI with
>> LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic"
>> ?
>>
>> IIRC, an other option is : LDFLAGS="-static-intel"
>>
>>
>>  let me first state that I have no experience developing for MIC. But
>> regarding the Intel runtime libraries, the only sane option in my opinion
>> is to use the icc.cfg/ifort.cfg/icpc.cfg files that get put in the same
>> directory as the corresponding compiler binaries and add a line like
>>
>>  -Wl,-rpath,/path/to/composerxe/lib/intel??
>>
>>  to that file.
>>
>>  Regards, Thomas
>>--
>> Thomas Jahns
>> DKRZ GmbH, Department: Application software
>>
>>  Deutsches Klimarechenzentrum
>> Bundesstraße 45a
>> D-20146 Hamburg
>>
>>  Phone: +49-40-460094-151
>> Fax: +49-40-460094-270
>> Email: Thomas Jahns 
>>
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/04/26745.php
>>
>
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26746.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-16 Thread Andy Riebs

  
  
Hi Ralph,

If I did this right (NEVER a good bet :-) ), it didn't work...

Using last night's master nightly,
openmpi-dev-1515-gc869490.tar.bz2, I built with the same script as
yesterday, but removing the LDFLAGS=-Wl, stuff:

$ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
CXX="icpc -mmic" \
  --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
   AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib
LD=x86_64-k1om-linux-ld \
   --enable-mpirun-prefix-by-default --disable-io-romio
--disable-mpi-fortran \
   --enable-debug
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
$ make
$ make install
 ...
make[1]: Leaving directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490/test'
make[1]: Entering directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490'
make[2]: Entering directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490'
make  install-exec-hook
make[3]: Entering directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490'
make[3]: ./config/find_common_syms: Command not found
make[3]: [install-exec-hook] Error 127 (ignored)
make[3]: Leaving directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490'
make[2]: Nothing to be done for `install-data-am'.
make[2]: Leaving directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490'
make[1]: Leaving directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490'
$

But it seems to finish the install.

I then tried to run, adding the new mca arguments:

$  shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -mca plm_rsh_pass_path
$PATH -mca plm_rsh_pass_libpath $MIC_LD_LIBRARY_PATH -H
mic0,mic1 -n 2 ./mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
libraries: libimf.so: cannot open shared object file: No such
file or directory
 ...
$ echo $MIC_LD_LIBRARY_PATH
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/mic
$ ls /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.*
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.a
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
$


On 04/16/2015 07:22 AM, Ralph Castain
  wrote:


  
  FWIW: I just added (last night) a pair of new MCA
params for this purpose:


plm_rsh_pass_path    prepends the designated
  path to the remote shell's PATH prior to executing orted
plm_rsh_pass_libpath   same thing for
  LD_LIBRARY_PATH


I believe that will resolve the problem for Andy regardless
  of compiler used. In the master now, waiting for someone to
  verify it before adding to 1.8.5. Sadly, I am away from any
  cluster for the rest of this week, so I'd welcome anyone
  having a chance to test it.


  
  
On Thu, Apr 16, 2015 at 2:57 AM, Thomas
  Jahns 
  wrote:
  
Hello,
  
  
On Apr 15, 2015, at 02:11 , Gilles Gouaillardet
  wrote:

  what about
reconfiguring Open MPI with
LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic"
?

IIRC, an other option is :
LDFLAGS="-static-intel"
  

  
  
  

let me first state that I have no experience
  developing for MIC. But regarding the Intel runtime
  libraries, the only sane option in my opinion is to
  use the icc.cfg/ifort.cfg/icpc.cfg files that get put
  in the same directory as the corresponding compiler
  binaries and add a line like


-Wl,-rpath,/path/to/composerxe/lib/intel??


to that file.


Regards, Thomas

 

  
-- 
Thomas Jahns
DKRZ GmbH, Department: Application
  

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-16 Thread Ralph Castain
FWIW: I just added (last night) a pair of new MCA params for this purpose:

plm_rsh_pass_pathprepends the designated path to the remote
shell's PATH prior to executing orted
plm_rsh_pass_libpath   same thing for LD_LIBRARY_PATH

I believe that will resolve the problem for Andy regardless of compiler
used. In the master now, waiting for someone to verify it before adding to
1.8.5. Sadly, I am away from any cluster for the rest of this week, so I'd
welcome anyone having a chance to test it.


On Thu, Apr 16, 2015 at 2:57 AM, Thomas Jahns  wrote:

> Hello,
>
> On Apr 15, 2015, at 02:11 , Gilles Gouaillardet wrote:
>
> what about reconfiguring Open MPI with
> LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic"
> ?
>
> IIRC, an other option is : LDFLAGS="-static-intel"
>
>
> let me first state that I have no experience developing for MIC. But
> regarding the Intel runtime libraries, the only sane option in my opinion
> is to use the icc.cfg/ifort.cfg/icpc.cfg files that get put in the same
> directory as the corresponding compiler binaries and add a line like
>
> -Wl,-rpath,/path/to/composerxe/lib/intel??
>
> to that file.
>
> Regards, Thomas
> --
> Thomas Jahns
> DKRZ GmbH, Department: Application software
>
> Deutsches Klimarechenzentrum
> Bundesstraße 45a
> D-20146 Hamburg
>
> Phone: +49-40-460094-151
> Fax: +49-40-460094-270
> Email: Thomas Jahns 
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26745.php
>


Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-16 Thread Thomas Jahns

Hello,

On Apr 15, 2015, at 02:11 , Gilles Gouaillardet wrote:
what about reconfiguring Open MPI with LDFLAGS="-Wl,-rpath,/opt/ 
intel/15.0/composer_xe_2015.2.164/compiler/lib/mic" ?


IIRC, an other option is : LDFLAGS="-static-intel"



let me first state that I have no experience developing for MIC. But  
regarding the Intel runtime libraries, the only sane option in my  
opinion is to use the icc.cfg/ifort.cfg/icpc.cfg files that get put in  
the same directory as the corresponding compiler binaries and add a  
line like


-Wl,-rpath,/path/to/composerxe/lib/intel??

to that file.

Regards, Thomas
--
Thomas Jahns
DKRZ GmbH, Department: Application software

Deutsches Klimarechenzentrum
Bundesstraße 45a
D-20146 Hamburg

Phone: +49-40-460094-151
Fax: +49-40-460094-270
Email: Thomas Jahns 





smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-16 Thread Gilles Gouaillardet

Ralph,

now i remember this part ...
IIRC, LD_LIBRARY_PATH was never forwarded when remote starting orted.
i simply avoided this issue by using gnu compilers, or gcc/g++/ifort if 
i need

intel fortran
/* you already mentionned this is not officially supported by Intel */

What about adding a new configure option :
--orted-rpath=...

if we configure with LDFLAGS=-Wl,-rpath,... then orted will find the 
intel runtime (good)
but the user binary will also use that runtime, regardless the 
$LD_LIBRARY_PATH used at mpicc/mpirun time

(not so good, since the user might want to use a different runtime)

any thoughts ?

Cheers,

Gilles


On 4/15/2015 12:10 PM, Ralph Castain wrote:
I think Gilles may be correct here. In reviewing the code, it appears 
we have never (going back to the 1.6 series, at least) forwarded the 
local LD_LIBRARY_PATH to the remote node when exec’ing the orted. The 
only thing we have done is to set the PATH and LD_LIBRARY_PATH to 
support the OMPI prefix - not any supporting libs.


What we have required, therefore, is that your path be setup properly 
in the remote .bashrc (or pick your shell) to handle the libraries.


As I indicated, the -x option only forwards envars to the application 
procs themselves, not the orted. I could try to add another cmd line 
option to forward things for the orted, but the concern we’ve had in 
the past (and still harbor) is that the ssh cmd line is limited in 
length. Thus, adding some potentially long paths to support this 
option could overwhelm it and cause failures.


I’d try the static method first, or perhaps the LDFLAGS Gilles suggested.


On Apr 14, 2015, at 5:11 PM, Gilles Gouaillardet > wrote:


Andy,

what about reconfiguring Open MPI with 
LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic" 
?


IIRC, an other option is : LDFLAGS="-static-intel"

last but not least, you can always replace orted with a simple script 
that sets the LD_LIBRARY_PATH and exec the original orted


do you have the same behaviour on non MIC hardware when Open MPI is 
compiled with intel compilers ?
if it works on non MIC hardware, the root cause could be in the 
sshd_config of the MIC that does not

accept to receive LD_LIBRARY_PATH

my 0.02 US$

Gilles

On 4/14/2015 11:20 PM, Ralph Castain wrote:

Hmmm…certainly looks that way. I’ll investigate.

On Apr 14, 2015, at 6:06 AM, Andy Riebs > wrote:


Hi Ralph,

Still no happiness... It looks like my LD_LIBRARY_PATH just isn't 
getting propagated?


$ ldd /home/ariebs/mic/mpi-nightly/bin/orted
linux-vdso.so.1 => (0x7fffa1d3b000)
libopen-rte.so.0 => 
/home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x2ab6ce464000)
libopen-pal.so.0 => 
/home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x2ab6ce7d3000)

libm.so.6 => /lib64/libm.so.6 (0x2ab6cebbd000)
libdl.so.2 => /lib64/libdl.so.2 (0x2ab6ceded000)
librt.so.1 => /lib64/librt.so.1 (0x2ab6ceff1000)
libutil.so.1 => /lib64/libutil.so.1 (0x2ab6cf1f9000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2ab6cf3fc000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x2ab6cf60f000)
libc.so.6 => /lib64/libc.so.6 (0x2ab6cf82c000)
libimf.so => 
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so 
(0x2ab6cfb84000)
libsvml.so => 
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so 
(0x2ab6cffd6000)
libirng.so => 
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so 
(0x2ab6d086f000)
libintlc.so.5 => 
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 
(0x2ab6d0a82000)

/lib64/ld-linux-k1om.so.2 (0x2ab6ce243000)

$ echo $LD_LIBRARY_PATH
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib

$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml 
yoda --mca btl sm,self,tcp --mca plm_base_verbose 5 --mca 
memheap_base_verbose 100 --leave-session-attached --mca 
mca_component_show_load_errors 1 $PWD/mic.out

--
A deprecated MCA variable value was specified in the environment or
on the command line.  

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-15 Thread Andy Riebs

  
  
Gilles and Ralph, thanks!

$ shmemrun -H mic0,mic1 -n 2 -x SHMEM_SYMMETRIC_HEAP_SIZE=1M
$PWD/mic.out
[atl1-01-mic0:192474] [[29886,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 440
Hello World from process 0 of 2
Hello World from process 1 of 2
$

This was built with the openmpi-dev-1487-g9c6d452.tar.bz2 nightly
master. Oddly, -static-intel didn't work. Fortunately, -rpath did.
I'll follow-up in the next day or so with the winning build recipes
for both MPI and the user app to wrap up this note and, one hopes,
save others from some frustration in the future.

Andy


On 04/14/2015 11:10 PM, Ralph Castain
  wrote:


  
  I think Gilles may be correct here. In reviewing the code, it
  appears we have never (going back to the 1.6 series, at least)
  forwarded the local LD_LIBRARY_PATH to the remote node when
  exec’ing the orted. The only thing we have done is to set the PATH
  and LD_LIBRARY_PATH to support the OMPI prefix - not any
  supporting libs.
  
  
  What we have required, therefore, is that your path
be setup properly in the remote .bashrc (or pick your shell) to
handle the libraries.
  
  
  As I indicated, the -x option only forwards envars
to the application procs themselves, not the orted. I could try
to add another cmd line option to forward things for the orted,
but the concern we’ve had in the past (and still harbor) is that
the ssh cmd line is limited in length. Thus, adding some
potentially long paths to support this option could overwhelm it
and cause failures.
  
  
  I’d try the static method first, or perhaps the
LDFLAGS Gilles suggested.
  
  
  

  
On Apr 14, 2015, at 5:11 PM, Gilles
  Gouaillardet 
  wrote:


   Andy,

what about reconfiguring Open MPI with
LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic"

?

IIRC, an other option is : LDFLAGS="-static-intel"

last but not least, you can always replace orted with a
simple script that sets the LD_LIBRARY_PATH and exec the
original orted

do you have the same behaviour on non MIC hardware when
Open MPI is compiled with intel compilers ?
if it works on non MIC hardware, the root cause could be
in the sshd_config of the MIC that does not
accept to receive LD_LIBRARY_PATH

my 0.02 US$

Gilles

On 4/14/2015 11:20 PM,
  Ralph Castain wrote:

 Hmmm…certainly looks that way.
  I’ll investigate.
  

  
On Apr 14, 2015, at 6:06 AM, Andy
  Riebs 
  wrote:


  
Hi Ralph,

Still no happiness... It looks like my
LD_LIBRARY_PATH just isn't getting
propagated?

$ ldd /home/ariebs/mic/mpi-nightly/bin/orted
    linux-vdso.so.1 => 
(0x7fffa1d3b000)
    libopen-rte.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0
(0x2ab6ce464000)
    libopen-pal.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0
(0x2ab6ce7d3000)
    libm.so.6 => /lib64/libm.so.6
(0x2ab6cebbd000)
    libdl.so.2 => /lib64/libdl.so.2
(0x2ab6ceded000)
    librt.so.1 => /lib64/librt.so.1
(0x2ab6ceff1000)
    libutil.so.1 =>
/lib64/libutil.so.1 (0x2ab6cf1f9000)
    libgcc_s.so.1 =>
/lib64/libgcc_s.so.1 (0x2ab6cf3fc000)
    libpthread.so.0 =>
/lib64/libpthread.so.0 (0x2ab6cf60f000)
    libc.so.6 => 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-15 Thread Ralph Castain
I think Gilles may be correct here. In reviewing the code, it appears we have 
never (going back to the 1.6 series, at least) forwarded the local 
LD_LIBRARY_PATH to the remote node when exec’ing the orted. The only thing we 
have done is to set the PATH and LD_LIBRARY_PATH to support the OMPI prefix - 
not any supporting libs.

What we have required, therefore, is that your path be setup properly in the 
remote .bashrc (or pick your shell) to handle the libraries.

As I indicated, the -x option only forwards envars to the application procs 
themselves, not the orted. I could try to add another cmd line option to 
forward things for the orted, but the concern we’ve had in the past (and still 
harbor) is that the ssh cmd line is limited in length. Thus, adding some 
potentially long paths to support this option could overwhelm it and cause 
failures.

I’d try the static method first, or perhaps the LDFLAGS Gilles suggested.


> On Apr 14, 2015, at 5:11 PM, Gilles Gouaillardet  wrote:
> 
> Andy,
> 
> what about reconfiguring Open MPI with 
> LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic" ?
> 
> IIRC, an other option is : LDFLAGS="-static-intel"
> 
> last but not least, you can always replace orted with a simple script that 
> sets the LD_LIBRARY_PATH and exec the original orted
> 
> do you have the same behaviour on non MIC hardware when Open MPI is compiled 
> with intel compilers ?
> if it works on non MIC hardware, the root cause could be in the sshd_config 
> of the MIC that does not
> accept to receive LD_LIBRARY_PATH
> 
> my 0.02 US$
> 
> Gilles
> 
> On 4/14/2015 11:20 PM, Ralph Castain wrote:
>> Hmmm…certainly looks that way. I’ll investigate.
>> 
>>> On Apr 14, 2015, at 6:06 AM, Andy Riebs >> > wrote:
>>> 
>>> Hi Ralph,
>>> 
>>> Still no happiness... It looks like my LD_LIBRARY_PATH just isn't getting 
>>> propagated?
>>> 
>>> $ ldd /home/ariebs/mic/mpi-nightly/bin/orted
>>> linux-vdso.so.1 =>  (0x7fffa1d3b000)
>>> libopen-rte.so.0 => 
>>> /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x2ab6ce464000)
>>> libopen-pal.so.0 => 
>>> /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x2ab6ce7d3000)
>>> libm.so.6 => /lib64/libm.so.6 (0x2ab6cebbd000)
>>> libdl.so.2 => /lib64/libdl.so.2 (0x2ab6ceded000)
>>> librt.so.1 => /lib64/librt.so.1 (0x2ab6ceff1000)
>>> libutil.so.1 => /lib64/libutil.so.1 (0x2ab6cf1f9000)
>>> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2ab6cf3fc000)
>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x2ab6cf60f000)
>>> libc.so.6 => /lib64/libc.so.6 (0x2ab6cf82c000)
>>> libimf.so => 
>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so 
>>> (0x2ab6cfb84000)
>>> libsvml.so => 
>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so 
>>> (0x2ab6cffd6000)
>>> libirng.so => 
>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so 
>>> (0x2ab6d086f000)
>>> libintlc.so.5 => 
>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 
>>> (0x2ab6d0a82000)
>>> /lib64/ld-linux-k1om.so.2 (0x2ab6ce243000)
>>> 
>>> $ echo $LD_LIBRARY_PATH
>>> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
>>> 
>>> $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda 
>>> --mca btl sm,self,tcp --mca plm_base_verbose 5 --mca memheap_base_verbose 
>>> 100 --leave-session-attached --mca mca_component_show_load_errors 1 
>>> $PWD/mic.out
>>> --
>>> A deprecated MCA variable value was specified in the environment or
>>> on the command line.  Deprecated MCA variables should be avoided;
>>> they may disappear in future releases.
>>> 
>>>   Deprecated variable: mca_component_show_load_errors
>>>   New variable:mca_base_component_show_load_errors
>>> --
>>> [atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [rsh]
>>> [atl1-02-mic0:16183] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
>>> path NULL
>>> [atl1-02-mic0:16183] 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-14 Thread Gilles Gouaillardet

Andy,

what about reconfiguring Open MPI with 
LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic" 
?


IIRC, an other option is : LDFLAGS="-static-intel"

last but not least, you can always replace orted with a simple script 
that sets the LD_LIBRARY_PATH and exec the original orted


do you have the same behaviour on non MIC hardware when Open MPI is 
compiled with intel compilers ?
if it works on non MIC hardware, the root cause could be in the 
sshd_config of the MIC that does not

accept to receive LD_LIBRARY_PATH

my 0.02 US$

Gilles

On 4/14/2015 11:20 PM, Ralph Castain wrote:

Hmmm…certainly looks that way. I’ll investigate.

On Apr 14, 2015, at 6:06 AM, Andy Riebs > wrote:


Hi Ralph,

Still no happiness... It looks like my LD_LIBRARY_PATH just isn't 
getting propagated?


$ ldd /home/ariebs/mic/mpi-nightly/bin/orted
linux-vdso.so.1 =>  (0x7fffa1d3b000)
libopen-rte.so.0 => 
/home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x2ab6ce464000)
libopen-pal.so.0 => 
/home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x2ab6ce7d3000)

libm.so.6 => /lib64/libm.so.6 (0x2ab6cebbd000)
libdl.so.2 => /lib64/libdl.so.2 (0x2ab6ceded000)
librt.so.1 => /lib64/librt.so.1 (0x2ab6ceff1000)
libutil.so.1 => /lib64/libutil.so.1 (0x2ab6cf1f9000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2ab6cf3fc000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x2ab6cf60f000)
libc.so.6 => /lib64/libc.so.6 (0x2ab6cf82c000)
libimf.so => 
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so 
(0x2ab6cfb84000)
libsvml.so => 
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so 
(0x2ab6cffd6000)
libirng.so => 
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so 
(0x2ab6d086f000)
libintlc.so.5 => 
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 
(0x2ab6d0a82000)

/lib64/ld-linux-k1om.so.2 (0x2ab6ce243000)

$ echo $LD_LIBRARY_PATH
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib

$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml 
yoda --mca btl sm,self,tcp --mca plm_base_verbose 5 --mca 
memheap_base_verbose 100 --leave-session-attached --mca 
mca_component_show_load_errors 1 $PWD/mic.out

--
A deprecated MCA variable value was specified in the environment or
on the command line.  Deprecated MCA variables should be avoided;
they may disappear in future releases.

  Deprecated variable: mca_component_show_load_errors
  New variable: mca_base_component_show_load_errors
--
[atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [rsh]
[atl1-02-mic0:16183] [[INVALID],INVALID] plm:rsh_lookup on agent ssh 
: rsh path NULL
[atl1-02-mic0:16183] mca:base:select:(  plm) Query of component [rsh] 
set priority to 10
[atl1-02-mic0:16183] mca:base:select:(  plm) Querying component 
[isolated]
[atl1-02-mic0:16183] mca:base:select:(  plm) Query of component 
[isolated] set priority to 0

[atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [slurm]
[atl1-02-mic0:16183] mca:base:select:(  plm) Skipping component 
[slurm]. Query failed to return a module

[atl1-02-mic0:16183] mca:base:select:(  plm) Selected component [rsh]
[atl1-02-mic0:16183] plm:base:set_hnp_name: initial bias 16183 
nodename hash 4238360777

[atl1-02-mic0:16183] plm:base:set_hnp_name: final jobfam 33630
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh_setup on agent ssh : rsh 
path NULL

[atl1-02-mic0:16183] [[33630,0],0] plm:base:receive start comm
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_job
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm creating map
[atl1-02-mic0:16183] [[33630,0],0] setup:vm: working unmanaged allocation
[atl1-02-mic0:16183] [[33630,0],0] using dash_host
[atl1-02-mic0:16183] [[33630,0],0] checking node mic1
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm add new daemon 
[[33630,0],1]
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm assigning new 
daemon 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-14 Thread Ralph Castain
Hmmm…certainly looks that way. I’ll investigate.

> On Apr 14, 2015, at 6:06 AM, Andy Riebs  wrote:
> 
> Hi Ralph,
> 
> Still no happiness... It looks like my LD_LIBRARY_PATH just isn't getting 
> propagated?
> 
> $ ldd /home/ariebs/mic/mpi-nightly/bin/orted
> linux-vdso.so.1 =>  (0x7fffa1d3b000)
> libopen-rte.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 
> (0x2ab6ce464000)
> libopen-pal.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 
> (0x2ab6ce7d3000)
> libm.so.6 => /lib64/libm.so.6 (0x2ab6cebbd000)
> libdl.so.2 => /lib64/libdl.so.2 (0x2ab6ceded000)
> librt.so.1 => /lib64/librt.so.1 (0x2ab6ceff1000)
> libutil.so.1 => /lib64/libutil.so.1 (0x2ab6cf1f9000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2ab6cf3fc000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x2ab6cf60f000)
> libc.so.6 => /lib64/libc.so.6 (0x2ab6cf82c000)
> libimf.so => 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so 
> (0x2ab6cfb84000)
> libsvml.so => 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so 
> (0x2ab6cffd6000)
> libirng.so => 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so 
> (0x2ab6d086f000)
> libintlc.so.5 => 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 
> (0x2ab6d0a82000)
> /lib64/ld-linux-k1om.so.2 (0x2ab6ce243000)
> 
> $ echo $LD_LIBRARY_PATH
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
> 
> $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda --mca 
> btl sm,self,tcp --mca plm_base_verbose 5 --mca memheap_base_verbose 100 
> --leave-session-attached --mca mca_component_show_load_errors 1 $PWD/mic.out
> --
> A deprecated MCA variable value was specified in the environment or
> on the command line.  Deprecated MCA variables should be avoided;
> they may disappear in future releases.
> 
>   Deprecated variable: mca_component_show_load_errors
>   New variable:mca_base_component_show_load_errors
> --
> [atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [rsh]
> [atl1-02-mic0:16183] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
> path NULL
> [atl1-02-mic0:16183] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [isolated]
> [atl1-02-mic0:16183] mca:base:select:(  plm) Query of component [isolated] 
> set priority to 0
> [atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [slurm]
> [atl1-02-mic0:16183] mca:base:select:(  plm) Skipping component [slurm]. 
> Query failed to return a module
> [atl1-02-mic0:16183] mca:base:select:(  plm) Selected component [rsh]
> [atl1-02-mic0:16183] plm:base:set_hnp_name: initial bias 16183 nodename hash 
> 4238360777
> [atl1-02-mic0:16183] plm:base:set_hnp_name: final jobfam 33630
> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> [atl1-02-mic0:16183] [[33630,0],0] plm:base:receive start comm
> [atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_job
> [atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm
> [atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm creating map
> [atl1-02-mic0:16183] [[33630,0],0] setup:vm: working unmanaged allocation
> [atl1-02-mic0:16183] [[33630,0],0] using dash_host
> [atl1-02-mic0:16183] [[33630,0],0] checking node mic1
> [atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm add new daemon 
> [[33630,0],1]
> [atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm assigning new daemon 
> [[33630,0],1] to node mic1
> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: launching vm
> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: local shell: 0 (bash)
> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: assuming same remote shell as 
> local shell
> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: remote shell: 0 (bash)
> [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: final template argv:
> /usr/bin/ssh  
> PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ; 
> 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-14 Thread Andy Riebs

  
  
Hi Ralph,

Still no happiness... It looks like my LD_LIBRARY_PATH just isn't
getting propagated?

$ ldd /home/ariebs/mic/mpi-nightly/bin/orted
    linux-vdso.so.1 =>  (0x7fffa1d3b000)
    libopen-rte.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0
(0x2ab6ce464000)
    libopen-pal.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0
(0x2ab6ce7d3000)
    libm.so.6 => /lib64/libm.so.6 (0x2ab6cebbd000)
    libdl.so.2 => /lib64/libdl.so.2 (0x2ab6ceded000)
    librt.so.1 => /lib64/librt.so.1 (0x2ab6ceff1000)
    libutil.so.1 => /lib64/libutil.so.1 (0x2ab6cf1f9000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1
(0x2ab6cf3fc000)
    libpthread.so.0 => /lib64/libpthread.so.0
(0x2ab6cf60f000)
    libc.so.6 => /lib64/libc.so.6 (0x2ab6cf82c000)
    libimf.so =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
(0x2ab6cfb84000)
    libsvml.so =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so
(0x2ab6cffd6000)
    libirng.so =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so
(0x2ab6d086f000)
    libintlc.so.5 =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5
(0x2ab6d0a82000)
    /lib64/ld-linux-k1om.so.2 (0x2ab6ce243000)

$ echo $LD_LIBRARY_PATH
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib

$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml
yoda --mca btl sm,self,tcp --mca plm_base_verbose 5 --mca
memheap_base_verbose 100 --leave-session-attached --mca
mca_component_show_load_errors 1 $PWD/mic.out
--
A deprecated MCA variable value was specified in the environment or
on the command line.  Deprecated MCA variables should be avoided;
they may disappear in future releases.

  Deprecated variable: mca_component_show_load_errors
  New variable:    mca_base_component_show_load_errors
--
[atl1-02-mic0:16183] mca:base:select:(  plm) Querying component
[rsh]
[atl1-02-mic0:16183] [[INVALID],INVALID] plm:rsh_lookup on agent ssh
: rsh path NULL
[atl1-02-mic0:16183] mca:base:select:(  plm) Query of component
[rsh] set priority to 10
[atl1-02-mic0:16183] mca:base:select:(  plm) Querying component
[isolated]
[atl1-02-mic0:16183] mca:base:select:(  plm) Query of component
[isolated] set priority to 0
[atl1-02-mic0:16183] mca:base:select:(  plm) Querying component
[slurm]
[atl1-02-mic0:16183] mca:base:select:(  plm) Skipping component
[slurm]. Query failed to return a module
[atl1-02-mic0:16183] mca:base:select:(  plm) Selected component
[rsh]
[atl1-02-mic0:16183] plm:base:set_hnp_name: initial bias 16183
nodename hash 4238360777
[atl1-02-mic0:16183] plm:base:set_hnp_name: final jobfam 33630
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh_setup on agent ssh : rsh
path NULL
[atl1-02-mic0:16183] [[33630,0],0] plm:base:receive start comm
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_job
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm creating map
[atl1-02-mic0:16183] [[33630,0],0] setup:vm: working unmanaged
allocation
[atl1-02-mic0:16183] [[33630,0],0] using dash_host
[atl1-02-mic0:16183] [[33630,0],0] checking node mic1
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm add new daemon
[[33630,0],1]
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm assigning new
daemon [[33630,0],1] to node mic1
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: launching vm
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: local shell: 0 (bash)
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: assuming same remote
shell as local shell
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: remote shell: 0 (bash)
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: final template argv:
    /usr/bin/ssh 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-13 Thread Ralph Castain
Weird. I’m not sure what to try at that point - IIRC, building static won’t 
resolve this problem (but you could try and see). You could add the following 
to the cmd line and see if it tells us anything useful:

—leave-session-attached —mca mca_component_show_load_errors 1

You might also do an ldd on /home/ariebs/mic/mpi-nightly/bin/orted and see 
where it is looking for libimf since it (and not mic.out) is the one complaining


> On Apr 13, 2015, at 1:58 PM, Andy Riebs  wrote:
> 
> Ralph and Nathan,
> 
> The problem may be something trivial, as I don't typically use "shmemrun" to 
> start jobs. With the following, I *think* I've  demonstrated that the problem 
> library is where it belongs on the remote system:
> 
> $ ldd mic.out
> linux-vdso.so.1 =>  (0x7fffb83ff000)
> liboshmem.so.0 => /home/ariebs/mic/mpi-nightly/lib/liboshmem.so.0 
> (0x2b059cfbb000)
> libmpi.so.0 => /home/ariebs/mic/mpi-nightly/lib/libmpi.so.0 
> (0x2b059d35a000)
> libopen-rte.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 
> (0x2b059d7e3000)
> libopen-pal.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 
> (0x2b059db53000)
> libm.so.6 => /lib64/libm.so.6 (0x2b059df3d000)
> libdl.so.2 => /lib64/libdl.so.2 (0x2b059e16c000)
> libutil.so.1 => /lib64/libutil.so.1 (0x2b059e371000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2b059e574000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x2b059e786000)
> libc.so.6 => /lib64/libc.so.6 (0x2b059e9a4000)
> librt.so.1 => /lib64/librt.so.1 (0x2b059ecfc000)
> libimf.so => 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so 
> (0x2b059ef04000)
> libsvml.so => 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so 
> (0x2b059f356000)
> libirng.so => 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so 
> (0x2b059fbef000)
> libintlc.so.5 => 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 
> (0x2b059fe02000)
> /lib64/ld-linux-k1om.so.2 (0x2b059cd9a000)
> $ echo $LD_LIBRARY_PATH 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
> $ ssh mic1 file 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF 64-bit 
> LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1 (SYSV), 
> dynamically linked, not stripped
> $ shmemrun -H mic1 -N 2 --mca btl scif,self $PWD/mic.out
> /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: 
> libimf.so: cannot open shared object file: No such file or directory
> ...
> 
> 
> On 04/13/2015 04:25 PM, Nathan Hjelm wrote:
>> For talking between PHIs on the same system I recommend using the scif
>> BTL NOT tcp.
>> 
>> That said, it looks like the LD_LIBRARY_PATH is wrong on the remote
>> system. It looks like it can't find the intel compiler libraries.
>> 
>> -Nathan Hjelm
>> HPC-5, LANL
>> 
>> On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote:
>>>Progress!  I can run my trivial program on the local PHI, but not the
>>>other PHI, on the system. Here are the interesting parts:
>>> 
>>>A pretty good recipe with last night's nightly master:
>>> 
>>>$ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
>>>CXX="icpc -mmic" \
>>>--build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
>>> AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib 
>>>LD=x86_64-k1om-linux-ld \
>>> --enable-mpirun-prefix-by-default --disable-io-romio
>>>--disable-mpi-fortran \
>>> --enable-orterun-prefix-by-default \
>>> --enable-debug
>>>$ make && make install
>>>$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
>>>yoda --mca btl sm,self,tcp $PWD/mic.out
>>>Hello World from process 0 of 2
>>>Hello World from process 1 of 2
>>>$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
>>>yoda --mca btl openib,sm,self $PWD/mic.out
>>>Hello World from process 0 of 2
>>>Hello World from process 1 of 2
>>>$
>>> 
>>>However, I can't seem to cross 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-13 Thread Andy Riebs

  
  
Ralph and Nathan,

The problem may be something trivial, as I don't typically use
"shmemrun" to start jobs. With the following, I *think* I've 
demonstrated that the problem library is where it belongs on the
remote system:

$ ldd mic.out
    linux-vdso.so.1 =>  (0x7fffb83ff000)
    liboshmem.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/liboshmem.so.0 (0x2b059cfbb000)
    libmpi.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libmpi.so.0 (0x2b059d35a000)
    libopen-rte.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0
(0x2b059d7e3000)
    libopen-pal.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0
(0x2b059db53000)
    libm.so.6 => /lib64/libm.so.6 (0x2b059df3d000)
    libdl.so.2 => /lib64/libdl.so.2 (0x2b059e16c000)
    libutil.so.1 => /lib64/libutil.so.1 (0x2b059e371000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1
(0x2b059e574000)
    libpthread.so.0 => /lib64/libpthread.so.0
(0x2b059e786000)
    libc.so.6 => /lib64/libc.so.6 (0x2b059e9a4000)
    librt.so.1 => /lib64/librt.so.1 (0x2b059ecfc000)
    libimf.so =>
  /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
(0x2b059ef04000)
    libsvml.so =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so
(0x2b059f356000)
    libirng.so =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so
(0x2b059fbef000)
    libintlc.so.5 =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5
(0x2b059fe02000)
    /lib64/ld-linux-k1om.so.2 (0x2b059cd9a000)
$ echo $LD_LIBRARY_PATH 
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
$ ssh mic1 file /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so:
ELF 64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om),
version 1 (SYSV), dynamically linked, not stripped
$ shmemrun -H mic1 -N 2 --mca btl scif,self $PWD/mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading
  shared libraries: libimf.so: cannot open shared object file:
No such file or directory
...


On 04/13/2015 04:25 PM, Nathan Hjelm
  wrote:


  
For talking between PHIs on the same system I recommend using the scif
BTL NOT tcp.

That said, it looks like the LD_LIBRARY_PATH is wrong on the remote
system. It looks like it can't find the intel compiler libraries.

-Nathan Hjelm
HPC-5, LANL

On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote:

  
   Progress!  I can run my trivial program on the local PHI, but not the
   other PHI, on the system. Here are the interesting parts:

   A pretty good recipe with last night's nightly master:

   $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
   CXX="icpc -mmic" \
   --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib 
   LD=x86_64-k1om-linux-ld \
--enable-mpirun-prefix-by-default --disable-io-romio
   --disable-mpi-fortran \
--enable-orterun-prefix-by-default \
--enable-debug
   $ make && make install
   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
   yoda --mca btl sm,self,tcp $PWD/mic.out
   Hello World from process 0 of 2
   Hello World from process 1 of 2
   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
   yoda --mca btl openib,sm,self $PWD/mic.out
   Hello World from process 0 of 2
   Hello World from process 1 of 2
   $

   However, I can't seem to cross the fabric. I can ssh freely back and forth
   between mic0 and mic1. However, running the next 2 tests from mic0, it 
   certainly seems like the second one should work, too:

   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda
   --mca btl sm,self,tcp $PWD/mic.out
   Hello World from process 0 of 2
   Hello World from process 1 of 2
   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda
   --mca btl sm,self,tcp $PWD/mic.out
   

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-13 Thread Nathan Hjelm

For talking between PHIs on the same system I recommend using the scif
BTL NOT tcp.

That said, it looks like the LD_LIBRARY_PATH is wrong on the remote
system. It looks like it can't find the intel compiler libraries.

-Nathan Hjelm
HPC-5, LANL

On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote:
>Progress!  I can run my trivial program on the local PHI, but not the
>other PHI, on the system. Here are the interesting parts:
> 
>A pretty good recipe with last night's nightly master:
> 
>$ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
>CXX="icpc -mmic" \
>--build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
> AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib 
>LD=x86_64-k1om-linux-ld \
> --enable-mpirun-prefix-by-default --disable-io-romio
>--disable-mpi-fortran \
> --enable-orterun-prefix-by-default \
> --enable-debug
>$ make && make install
>$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
>yoda --mca btl sm,self,tcp $PWD/mic.out
>Hello World from process 0 of 2
>Hello World from process 1 of 2
>$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
>yoda --mca btl openib,sm,self $PWD/mic.out
>Hello World from process 0 of 2
>Hello World from process 1 of 2
>$
> 
>However, I can't seem to cross the fabric. I can ssh freely back and forth
>between mic0 and mic1. However, running the next 2 tests from mic0, it 
>certainly seems like the second one should work, too:
> 
>$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda
>--mca btl sm,self,tcp $PWD/mic.out
>Hello World from process 0 of 2
>Hello World from process 1 of 2
>$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda
>--mca btl sm,self,tcp $PWD/mic.out
>/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
>libraries: libimf.so: cannot open shared object file: No such file or
>directory
>--
>ORTE was unable to reliably start one or more daemons.
>This usually is caused by:
> 
>* not finding the required libraries and/or binaries on
>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>  settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
>* lack of authority to execute on one or more specified nodes.
>  Please verify your allocation and authorities.
> 
>* the inability to write startup files into /tmp
>(--tmpdir/orte_tmpdir_base).
>  Please check with your sys admin to determine the correct location to
>use.
> 
>*  compilation of the orted with dynamic libraries when static are
>required
>  (e.g., on Cray). Please check your configure cmd line and consider using
>  one of the contrib/platform definitions for your system type.
> 
>* an inability to create a connection back to mpirun due to a
>  lack of common network interfaces and/or no route found between
>  them. Please check network connectivity (including firewalls
>  and network routing requirements).
> ...
>$
> 
>(Note that I get the same results with "--mca btl openib,sm,self")
> 
>$ ssh mic1 file
>/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
>/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF
>64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1
>(SYSV), dynamically linked, not stripped
>$ shmemrun -x
>
> LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
>-H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out
>/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
>libraries: libimf.so: cannot open shared object file: No such file or
>directory
>--
>ORTE was unable to reliably start one or more daemons.
>This usually is caused by:
> 
>* not finding the required libraries and/or binaries on
>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>  settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
>* lack of authority to execute on one or more specified nodes.
>  Please verify your allocation and authorities.
> 
>* the inability to write startup files into /tmp
>(--tmpdir/orte_tmpdir_base).
>  Please check with your sys admin to determine the correct location to
>use.
> 
>*  compilation of the orted with dynamic libraries when static are
>required
>  (e.g., on Cray). Please check your configure cmd line and consider using
>  one of the contrib/platform definitions for your system type.
> 
>* an inability to create a connection back to mpirun due to a
>  lack of common network interfaces and/or no 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-13 Thread Ralph Castain
I don’t see that LD_PRELOAD showing up on the ssh path, Andy

> /usr/bin/ssh mic1 PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export 
> PATH ; LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; 
> export LD_LIBRARY_PATH ; 
> DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; 
> export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted 
> --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca 
> orte_ess_jobid "1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" 
> -mca orte_hnp_uri 
> "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1" 
> --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose 
> "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca rmaps_ppr_n_pernode 
> “2"

The -x option doesn’t impact the ssh line - it only forwards the value to the 
application’s environment. You’ll need to include the path in your 
LD_LIBRARY_PATH


> On Apr 13, 2015, at 1:06 PM, Andy Riebs  wrote:
> 
> Progress!  I can run my trivial program on the local PHI, but not the other 
> PHI, on the system. Here are the interesting parts:
> 
> A pretty good recipe with last night's nightly master:
> 
> $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic" CXX="icpc 
> -mmic" \
> --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
>  AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib  
> LD=x86_64-k1om-linux-ld \
>  --enable-mpirun-prefix-by-default --disable-io-romio 
> --disable-mpi-fortran \
>  --enable-orterun-prefix-by-default \
>  --enable-debug
> $ make && make install
> $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml yoda 
> --mca btl sm,self,tcp $PWD/mic.out
> Hello World from process 0 of 2
> Hello World from process 1 of 2
> $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml yoda 
> --mca btl openib,sm,self $PWD/mic.out
> Hello World from process 0 of 2
> Hello World from process 1 of 2
> $ 
> 
> However, I can't seem to cross the fabric. I can ssh freely back and forth 
> between mic0 and mic1. However, running the next 2 tests from mic0, it  
> certainly seems like the second one should work, too:
> 
> $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda --mca 
> btl sm,self,tcp $PWD/mic.out
> Hello World from process 0 of 2
> Hello World from process 1 of 2
> $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda --mca 
> btl sm,self,tcp $PWD/mic.out
> /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: 
> libimf.so: cannot open shared object file: No such file or directory
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
>  ...
> $
> 
> (Note that I get the same results with "--mca btl openib,sm,self")
> 
> 
> $ ssh mic1 file 
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
> /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF 64-bit 
> LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1 (SYSV), 
> dynamically linked, not stripped
> $ shmemrun -x 
> LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so 
> -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out
> /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: 
> libimf.so: cannot open shared object file: No such file or directory
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-13 Thread Andy Riebs

  
  
Progress!  I can run my trivial program on the local PHI, but not
the other PHI, on the system. Here are the interesting parts:

A pretty good recipe with last night's nightly master:

$ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
CXX="icpc -mmic" \
    --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
 AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib 
LD=x86_64-k1om-linux-ld \
 --enable-mpirun-prefix-by-default --disable-io-romio
--disable-mpi-fortran \
 --enable-orterun-prefix-by-default \
 --enable-debug
$ make && make install
$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca
spml yoda --mca btl sm,self,tcp $PWD/mic.out
Hello World from process 0 of 2
Hello World from process 1 of 2
$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca
spml yoda --mca btl openib,sm,self $PWD/mic.out
Hello World from process 0 of 2
Hello World from process 1 of 2
$ 

However, I can't seem to cross the fabric. I can ssh freely back and
forth between mic0 and mic1. However, running the next 2 tests from
mic0, it  certainly seems like the second one should work, too:

$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml
yoda --mca btl sm,self,tcp $PWD/mic.out
Hello World from process 0 of 2
Hello World from process 1 of 2
$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml
yoda --mca btl sm,self,tcp $PWD/mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading
  shared libraries: libimf.so: cannot open shared object file: No
  such file or directory
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with
--enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location
to use.

*  compilation of the orted with dynamic libraries when static are
required
  (e.g., on Cray). Please check your configure cmd line and consider
using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
 ...
$

(Note that I get the same results with "--mca btl openib,sm,self")


$ ssh mic1 file /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so:
ELF 64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om),
version 1 (SYSV), dynamically linked, not stripped
$ shmemrun -x 
LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
-H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
libraries: libimf.so: cannot open shared object file: No such file
or directory
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with
--enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location
to use.

*  compilation of the orted with dynamic libraries when static are
required
  (e.g., on Cray). Please check your configure cmd line and consider
using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

Following here is 
- IB information
- Running the failing case with lots of debugging information. (As
you might imagine, I've tried 17 ways from Sunday to try to ensure
that libimf.so is 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-13 Thread Andy Riebs

  
  
Hi Ralph,

Here are the results with last night's "master" nightly,
openmpi-dev-1487-g9c6d452.tar.bz2, and adding the
memheap_base_verbose option (yes, it looks like the "ERROR_LOG"
problem has gone away):

$ cat /proc/sys/kernel/shmmax
33554432
$ cat /proc/sys/kernel/shmall
2097152
$ cat /proc/sys/kernel/shmmni
4096
$ export SHMEM_SYMMETRIC_HEAP=1M
$ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
plm_base_verbose 5 --mca memheap_base_verbose 100 $PWD/mic.out
[atl1-01-mic0:190439] mca:base:select:(  plm) Querying component
[rsh]
[atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent
ssh : rsh path NULL
[atl1-01-mic0:190439] mca:base:select:(  plm) Query of component
[rsh] set priority to 10
[atl1-01-mic0:190439] mca:base:select:(  plm) Querying component
[isolated]
[atl1-01-mic0:190439] mca:base:select:(  plm) Query of component
[isolated] set priority to 0
[atl1-01-mic0:190439] mca:base:select:(  plm) Querying component
[slurm]
[atl1-01-mic0:190439] mca:base:select:(  plm) Skipping component
[slurm]. Query failed to return a module
[atl1-01-mic0:190439] mca:base:select:(  plm) Selected component
[rsh]
[atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439
nodename hash 4121194178
[atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875
[atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh : rsh
path NULL
[atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged
allocation
[atl1-01-mic0:190439] [[31875,0],0] using dash_host
[atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0
[atl1-01-mic0:190439] [[31875,0],0] ignoring myself
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in
allocation
[atl1-01-mic0:190439] [[31875,0],0] complete_setup on job [31875,1]
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job
[31875,1]
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof
for job [31875,1]
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1]
registered
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1] is
not a dynamic spawn
[atl1-01-mic0:190441] mca: base: components_register: registering
memheap components
[atl1-01-mic0:190441] mca: base: components_register: found loaded
component buddy
[atl1-01-mic0:190441] mca: base: components_register: component
buddy has no register or open function
[atl1-01-mic0:190442] mca: base: components_register: registering
memheap components
[atl1-01-mic0:190442] mca: base: components_register: found loaded
component buddy
[atl1-01-mic0:190442] mca: base: components_register: component
buddy has no register or open function
[atl1-01-mic0:190442] mca: base: components_register: found loaded
component ptmalloc
[atl1-01-mic0:190442] mca: base: components_register: component
ptmalloc has no register or open function
[atl1-01-mic0:190441] mca: base: components_register: found loaded
component ptmalloc
[atl1-01-mic0:190441] mca: base: components_register: component
ptmalloc has no register or open function
[atl1-01-mic0:190441] mca: base: components_open: opening memheap
components
[atl1-01-mic0:190441] mca: base: components_open: found loaded
component buddy
[atl1-01-mic0:190441] mca: base: components_open: component buddy
open function successful
[atl1-01-mic0:190441] mca: base: components_open: found loaded
component ptmalloc
[atl1-01-mic0:190441] mca: base: components_open: component ptmalloc
open function successful
[atl1-01-mic0:190442] mca: base: components_open: opening memheap
components
[atl1-01-mic0:190442] mca: base: components_open: found loaded
component buddy
[atl1-01-mic0:190442] mca: base: components_open: component buddy
open function successful
[atl1-01-mic0:190442] mca: base: components_open: found loaded
component ptmalloc
[atl1-01-mic0:190442] mca: base: components_open: component ptmalloc
open function successful
[atl1-01-mic0:190442] base/memheap_base_alloc.c:38 -
mca_memheap_base_alloc_init() Memheap alloc memory: 270532608
byte(s), 1 segments by method: 1
[atl1-01-mic0:190441] base/memheap_base_alloc.c:38 -
mca_memheap_base_alloc_init() Memheap alloc memory: 270532608
byte(s), 1 segments by method: 1
[atl1-01-mic0:190442] base/memheap_base_static.c:205 -
_load_segments() add: 0060-00601000 rw-p  00:11
6029314    /home/ariebs/bench/hello/mic.out

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-12 Thread Riebs, Andy
My fault, I thought the tar ball name looked funny :-)

Will try again tomorrow

Andy

--
Andy Riebs
andy.ri...@hp.com


 Original message 
From: Ralph Castain
Date:04/12/2015 3:10 PM (GMT-05:00)
To: Open MPI Users
Subject: Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon 
Phi/MIC

Sorry about that - I hadn’t brought it over to the 1.8 branch yet. I’ve done so 
now, which means the ERROR_LOG shouldn’t show up any more. It won’t fix the 
memheap problem, though.

You might try adding “--mca memheap_base_verbose 100” to your cmd line so we 
can see why none of the memheap components are being selected.


On Apr 12, 2015, at 11:30 AM, Andy Riebs 
<andy.ri...@hp.com<mailto:andy.ri...@hp.com>> wrote:

Hi Ralph,

Here's the output with 
openmpi-v1.8.4-202-gc2da6a5.tar.bz2<https://www.open-mpi.org/nightly/v1.8/openmpi-v1.8.4-202-gc2da6a5.tar.bz2>:

$ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5 
$PWD/mic.out
[atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [rsh]
[atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
[atl1-01-mic0:190189] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [isolated]
[atl1-01-mic0:190189] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [slurm]
[atl1-01-mic0:190189] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[atl1-01-mic0:190189] mca:base:select:(  plm) Selected component [rsh]
[atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189 nodename hash 
4121194178
[atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
[atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged allocation
[atl1-01-mic0:190189] [[32137,0],0] using dash_host
[atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
[atl1-01-mic0:190189] [[32137,0],0] ignoring myself
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in allocation
[atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in file 
base/plm_base_launch_support.c at line 440
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof for job 
[32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1] registered
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is not a 
dynamic spawn
--
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--
[atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
initialize - aborting
[atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
initialize - aborting
--
SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0) with 
errorcode -1.
--
--
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

Local host: atl1-01-mic0
PID:190192
--
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
[atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending orted_exit 
commands
--
shmemrun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Proces

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-12 Thread Ralph Castain
Sorry about that - I hadn’t brought it over to the 1.8 branch yet. I’ve done so 
now, which means the ERROR_LOG shouldn’t show up any more. It won’t fix the 
memheap problem, though.

You might try adding “--mca memheap_base_verbose 100” to your cmd line so we 
can see why none of the memheap components are being selected.


> On Apr 12, 2015, at 11:30 AM, Andy Riebs  wrote:
> 
> Hi Ralph,
> 
> Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2 
> :
> 
> $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5 
> $PWD/mic.out
> [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [rsh]
> [atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
> path NULL
> [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [isolated]
> [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component [isolated] 
> set priority to 0
> [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component [slurm]
> [atl1-01-mic0:190189] mca:base:select:(  plm) Skipping component [slurm]. 
> Query failed to return a module
> [atl1-01-mic0:190189] mca:base:select:(  plm) Selected component [rsh]
> [atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189 nodename 
> hash 4121194178
> [atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
> [atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
> [atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged allocation
> [atl1-01-mic0:190189] [[32137,0],0] using dash_host
> [atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
> [atl1-01-mic0:190189] [[32137,0],0] ignoring myself
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in allocation
> [atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
> [atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in file 
> base/plm_base_launch_support.c at line 440
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job [32137,1]
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof for job 
> [32137,1]
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1] registered
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is not a 
> dynamic spawn
> --
> It looks like SHMEM_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during SHMEM_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open SHMEM
> developer):
> 
>   mca_memheap_base_select() failed
>   --> Returned "Error" (-1) instead of "Success" (0)
> --
> [atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
> initialize - aborting
> [atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
> initialize - aborting
> --
> SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0) with 
> errorcode -1.
> --
> --
> A SHMEM process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
> 
> Local host: atl1-01-mic0
> PID:190192
> --
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> [atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending orted_exit 
> commands
> --
> shmemrun detected that one or more processes exited with non-zero status, 
> thus causing
> the job to be terminated. The first process to do so was:
> 
>   Process name: [[32137,1],0]
>   Exit code:255
> --
> [atl1-01-mic0:190189] 1 more process has sent help message 
> help-shmem-runtime.txt / shmem_init:startup:internal-failure

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-12 Thread Andy Riebs

  
  
Hi Ralph,

Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2:

$ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
[rsh]
[atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent
ssh : rsh path NULL
[atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
[rsh] set priority to 10
[atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
[isolated]
[atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
[isolated] set priority to 0
[atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
[slurm]
[atl1-01-mic0:190189] mca:base:select:(  plm) Skipping component
[slurm]. Query failed to return a module
[atl1-01-mic0:190189] mca:base:select:(  plm) Selected component
[rsh]
[atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189
nodename hash 4121194178
[atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
[atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh
path NULL
[atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged
allocation
[atl1-01-mic0:190189] [[32137,0],0] using dash_host
[atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
[atl1-01-mic0:190189] [[32137,0],0] ignoring myself
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in
allocation
[atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 440
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job
[32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof
for job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1]
registered
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is
not a dynamic spawn
--
It looks like SHMEM_INIT failed for some reason; your parallel
process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's
some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--
[atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
[atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
--
SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0)
with errorcode -1.
--
--
A SHMEM process is aborting at a time when it cannot guarantee that
all
of its peer processes in the job will be killed properly.  You
should
double check that everything has shut down cleanly.

Local host: atl1-01-mic0
PID:    190192
--
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
[atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending
orted_exit commands
--
shmemrun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32137,1],0]
  Exit code:    255
--
[atl1-01-mic0:190189] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[atl1-01-mic0:190189] Set MCA parameter "orte_base_help_aggregate"
to 0 to see all help / error messages
[atl1-01-mic0:190189] 1 more process has sent help message
help-shmem-api.txt / shmem-abort
[atl1-01-mic0:190189] 1 more process has sent help message

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-11 Thread Ralph Castain
Got it - thanks. I fixed that ERROR_LOG issue (I think- please verify). I 
suspect the memheap issue relates to something else, but I probably need to let 
the OSHMEM folks comment on it


> On Apr 11, 2015, at 9:52 AM, Andy Riebs  wrote:
> 
> Everything is built on the Xeon side, with the icc "-mmic" switch. I then ssh 
> into one of the PHIs, and run shmemrun from there.
> 
> 
> On 04/11/2015 12:00 PM, Ralph Castain wrote:
>> Let me try to understand the setup a little better. Are you running shmemrun 
>> on the PHI itself? Or is it running on the host processor, and you are 
>> trying to spawn a process onto the Phi?
>> 
>> 
>>> On Apr 11, 2015, at 7:55 AM, Andy Riebs >> > wrote:
>>> 
>>> Hi Ralph,
>>> 
>>> Yes, this is attempting to get OSHMEM to run on the Phi.
>>> 
>>> I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured it with
>>> 
>>> $ ./configure --prefix=/home/ariebs/mic/mpi-nightlyCC=icc -mmic 
>>> CXX=icpc -mmic\
>>> --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux\
>>>  AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib  
>>> LD=x86_64-k1om-linux-ld   \
>>>  --enable-mpirun-prefix-by-default --disable-io-romio 
>>> --disable-mpi-fortran\
>>>  --enable-debug 
>>> --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
>>> 
>>> (Note that I had to add "oob-ud" to the "--enable-mca-no-build" option, as 
>>> the build complained that mca oob/ud needed mca common-verbs.)
>>> 
>>> With that configuration, here is what I am seeing now...
>>> 
>>> $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
>>> $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5 
>>> $PWD/mic.out
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Querying component [rsh]
>>> [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
>>> path NULL
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Query of component [rsh] set 
>>> priority to 10
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Querying component [isolated]
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Query of component [isolated] 
>>> set priority to 0
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Querying component [slurm]
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Skipping component [slurm]. 
>>> Query failed to return a module
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Selected component [rsh]
>>> [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias 189895 nodename 
>>> hash 4121194178
>>> [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam 32419
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent ssh : rsh path 
>>> NULL
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start comm
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm creating map
>>> [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working unmanaged allocation
>>> [atl1-01-mic0:189895] [[32419,0],0] using dash_host
>>> [atl1-01-mic0:189895] [[32419,0],0] checking node atl1-01-mic0
>>> [atl1-01-mic0:189895] [[32419,0],0] ignoring myself
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only HNP in allocation
>>> [atl1-01-mic0:189895] [[32419,0],0] complete_setup on job [32419,1]
>>> [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not found in file 
>>> base/plm_base_launch_support.c at line 440
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for job [32419,1]
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring up iof for job 
>>> [32419,1]
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch [32419,1] registered
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job [32419,1] is not a 
>>> dynamic spawn
>>> [atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init() SHMEM failed 
>>> to initialize - aborting
>>> [atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init() SHMEM failed 
>>> to initialize - aborting
>>> --
>>> It looks like SHMEM_INIT failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during SHMEM_INIT; some of which are due to configuration or 
>>> environment
>>> problems.  This failure appears to be an internal failure; here's some
>>> additional information (which may only be relevant to an Open SHMEM
>>> developer):
>>> 
>>>   mca_memheap_base_select() failed
>>>   --> Returned "Error" (-1) instead of "Success" (0)
>>> --
>>> --
>>> SHMEM_ABORT was invoked on rank 1 (pid 189899, host=atl1-01-mic0) with 
>>> errorcode -1.
>>> --
>>> 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-11 Thread Andy Riebs

  
  
Everything is built on the Xeon side, with the icc "-mmic" switch. I
then ssh into one of the PHIs, and run shmemrun from there.


On 04/11/2015 12:00 PM, Ralph Castain
  wrote:


  
  Let me try to understand the setup a little better. Are you
  running shmemrun on the PHI itself? Or is it running on the host
  processor, and you are trying to spawn a process onto the Phi?
  
  
  

  
On Apr 11, 2015, at 7:55 AM, Andy Riebs  wrote:


   Hi Ralph,

Yes, this is attempting to get OSHMEM to run on the Phi.

I grabbed openmpi-dev-1484-g033418f.tar.bz2 and
configured it with

$ ./configure --prefix=/home/ariebs/mic/mpi-nightly   
CC=icc -mmic CXX=icpc -mmic    \
    --build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux    \
 AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib 
LD=x86_64-k1om-linux-ld   \
 --enable-mpirun-prefix-by-default
--disable-io-romio --disable-mpi-fortran    \
 --enable-debug
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud

(Note that I had to add "oob-ud" to the
"--enable-mca-no-build" option, as the build complained
that mca oob/ud needed mca common-verbs.)

With that configuration, here is what I am seeing now...

$ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
$ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:189895] mca:base:select:(  plm) Querying
component [rsh]
[atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup
on agent ssh : rsh path NULL
[atl1-01-mic0:189895] mca:base:select:(  plm) Query of
component [rsh] set priority to 10
[atl1-01-mic0:189895] mca:base:select:(  plm) Querying
component [isolated]
[atl1-01-mic0:189895] mca:base:select:(  plm) Query of
component [isolated] set priority to 0
[atl1-01-mic0:189895] mca:base:select:(  plm) Querying
component [slurm]
[atl1-01-mic0:189895] mca:base:select:(  plm) Skipping
component [slurm]. Query failed to return a module
[atl1-01-mic0:189895] mca:base:select:(  plm) Selected
component [rsh]
[atl1-01-mic0:189895] plm:base:set_hnp_name: initial
bias 189895 nodename hash 4121194178
[atl1-01-mic0:189895] plm:base:set_hnp_name: final
jobfam 32419
[atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on
agent ssh : rsh path NULL
[atl1-01-mic0:189895] [[32419,0],0] plm:base:receive
start comm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
creating map
[atl1-01-mic0:189895] [[32419,0],0] setup:vm: working
unmanaged allocation
[atl1-01-mic0:189895] [[32419,0],0] using dash_host
[atl1-01-mic0:189895] [[32419,0],0] checking node
atl1-01-mic0
[atl1-01-mic0:189895] [[32419,0],0] ignoring myself
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
only HNP in allocation
[atl1-01-mic0:189895] [[32419,0],0] complete_setup on
job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not
found in file base/plm_base_launch_support.c at line 440
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps
for job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
wiring up iof for job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
[32419,1] registered
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job
[32419,1] is not a dynamic spawn
[atl1-01-mic0:189899] Error: pshmem_init.c:61 -
shmem_init() SHMEM failed to initialize - aborting
[atl1-01-mic0:189898] Error: pshmem_init.c:61 -
shmem_init() SHMEM failed to initialize - aborting
--
It looks 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-11 Thread Ralph Castain
Let me try to understand the setup a little better. Are you running shmemrun on 
the PHI itself? Or is it running on the host processor, and you are trying to 
spawn a process onto the Phi?


> On Apr 11, 2015, at 7:55 AM, Andy Riebs  wrote:
> 
> Hi Ralph,
> 
> Yes, this is attempting to get OSHMEM to run on the Phi.
> 
> I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured it with
> 
> $ ./configure --prefix=/home/ariebs/mic/mpi-nightlyCC=icc -mmic CXX=icpc 
> -mmic\
> --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux\
>  AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib  
> LD=x86_64-k1om-linux-ld   \
>  --enable-mpirun-prefix-by-default --disable-io-romio 
> --disable-mpi-fortran\
>  --enable-debug 
> --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
> 
> (Note that I had to add "oob-ud" to the "--enable-mca-no-build" option, as 
> the build complained that mca oob/ud needed mca common-verbs.)
> 
> With that configuration, here is what I am seeing now...
> 
> $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
> $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5 
> $PWD/mic.out
> [atl1-01-mic0:189895] mca:base:select:(  plm) Querying component [rsh]
> [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
> path NULL
> [atl1-01-mic0:189895] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [atl1-01-mic0:189895] mca:base:select:(  plm) Querying component [isolated]
> [atl1-01-mic0:189895] mca:base:select:(  plm) Query of component [isolated] 
> set priority to 0
> [atl1-01-mic0:189895] mca:base:select:(  plm) Querying component [slurm]
> [atl1-01-mic0:189895] mca:base:select:(  plm) Skipping component [slurm]. 
> Query failed to return a module
> [atl1-01-mic0:189895] mca:base:select:(  plm) Selected component [rsh]
> [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias 189895 nodename 
> hash 4121194178
> [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam 32419
> [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start comm
> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm creating map
> [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working unmanaged allocation
> [atl1-01-mic0:189895] [[32419,0],0] using dash_host
> [atl1-01-mic0:189895] [[32419,0],0] checking node atl1-01-mic0
> [atl1-01-mic0:189895] [[32419,0],0] ignoring myself
> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only HNP in allocation
> [atl1-01-mic0:189895] [[32419,0],0] complete_setup on job [32419,1]
> [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not found in file 
> base/plm_base_launch_support.c at line 440
> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for job [32419,1]
> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring up iof for job 
> [32419,1]
> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch [32419,1] registered
> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job [32419,1] is not a 
> dynamic spawn
> [atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
> initialize - aborting
> [atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
> initialize - aborting
> --
> It looks like SHMEM_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during SHMEM_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open SHMEM
> developer):
> 
>   mca_memheap_base_select() failed
>   --> Returned "Error" (-1) instead of "Success" (0)
> --
> --
> SHMEM_ABORT was invoked on rank 1 (pid 189899, host=atl1-01-mic0) with 
> errorcode -1.
> --
> --
> A SHMEM process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
> 
> Local host: atl1-01-mic0
> PID:189899
> --
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> [atl1-01-mic0:189895] 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-11 Thread Andy Riebs

  
  
Hi Ralph,

Yes, this is attempting to get OSHMEM to run on the Phi.

I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured it with

$ ./configure --prefix=/home/ariebs/mic/mpi-nightly    CC=icc -mmic
CXX=icpc -mmic    \
    --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux    \
 AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib 
LD=x86_64-k1om-linux-ld   \
 --enable-mpirun-prefix-by-default --disable-io-romio
--disable-mpi-fortran    \
 --enable-debug
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud

(Note that I had to add "oob-ud" to the "--enable-mca-no-build"
option, as the build complained that mca oob/ud needed mca
common-verbs.)

With that configuration, here is what I am seeing now...

$ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
$ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:189895] mca:base:select:(  plm) Querying component
[rsh]
[atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on agent
ssh : rsh path NULL
[atl1-01-mic0:189895] mca:base:select:(  plm) Query of component
[rsh] set priority to 10
[atl1-01-mic0:189895] mca:base:select:(  plm) Querying component
[isolated]
[atl1-01-mic0:189895] mca:base:select:(  plm) Query of component
[isolated] set priority to 0
[atl1-01-mic0:189895] mca:base:select:(  plm) Querying component
[slurm]
[atl1-01-mic0:189895] mca:base:select:(  plm) Skipping component
[slurm]. Query failed to return a module
[atl1-01-mic0:189895] mca:base:select:(  plm) Selected component
[rsh]
[atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias 189895
nodename hash 4121194178
[atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam 32419
[atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent ssh : rsh
path NULL
[atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start comm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:189895] [[32419,0],0] setup:vm: working unmanaged
allocation
[atl1-01-mic0:189895] [[32419,0],0] using dash_host
[atl1-01-mic0:189895] [[32419,0],0] checking node atl1-01-mic0
[atl1-01-mic0:189895] [[32419,0],0] ignoring myself
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only HNP in
allocation
[atl1-01-mic0:189895] [[32419,0],0] complete_setup on job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 440
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for job
[32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring up iof
for job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch [32419,1]
registered
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job [32419,1] is
not a dynamic spawn
[atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
[atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
--
It looks like SHMEM_INIT failed for some reason; your parallel
process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's
some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--
--
SHMEM_ABORT was invoked on rank 1 (pid 189899, host=atl1-01-mic0)
with errorcode -1.
--
--
A SHMEM process is aborting at a time when it cannot guarantee that
all
of its peer processes in the job will be killed properly.  You
should
double check that everything has shut down cleanly.

Local host: atl1-01-mic0
PID:    189899
--
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
[atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd sending
orted_exit 

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-10 Thread Ralph Castain
Andy - could you please try the current 1.8.5 nightly tarball and see if it 
helps? The error log indicates that it is failing to get the topology from some 
daemon, I’m assuming the one on the Phi?

You might also add —enable-debug to that configure line and then put -mca 
plm_base_verbose on the shmemrun cmd to get more help


> On Apr 10, 2015, at 11:55 AM, Andy Riebs  wrote:
> 
> Summary: MPI jobs work fine, SHMEM jobs work just often enough to be 
> tantalizing, on an Intel Xeon Phi/MIC system.
> 
> Longer version
> 
> Thanks to the excellent write-up last June 
> ( 
> ), I have 
> been able to build a version of Open MPI for the Xeon Phi coprocessor that 
> runs MPI jobs on the Phi coprocessor with no problem, but not SHMEM jobs.  
> Just at the point where I was about to document the problems I was having 
> with SHMEM, my trivial SHMEM job worked. And then failed when I tried to run 
> it again, immediately afterwards. I have a feeling I may be in uncharted  
> territory here.
> 
> Environment
> RHEL 6.5
> Intel Composer XE 2015
> Xeon Phi/MIC
> 
> 
> 
> Configuration
> 
> $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
> $ source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64
> $ ./configure --prefix=/home/ariebs/mic/mpi \
>CC="icc -mmic" CXX="icpc -mmic" \
>--build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
> AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib \
> LD=x86_64-k1om-linux-ld \
> --enable-mpirun-prefix-by-default --disable-io-romio \
> --disable-vt --disable-mpi-fortran \
> --enable-mca-no-build=btl-usnic,btl-openib,common-verbs
> $ make
> $ make install
> 
> 
> 
> Test program
> 
> #include 
> #include 
> #include 
> int main(int argc, char **argv)
> {
> int me, num_pe;
> shmem_init();
> num_pe = num_pes();
> me = my_pe();
> printf("Hello World from process %ld of %ld\n", me, num_pe);
> exit(0);
> }
> 
> 
> 
> Building the program
> 
> export PATH=/home/ariebs/mic/mpi/bin:$PATH
> export PATH=/usr/linux-k1om-4.7/bin/:$PATH
> source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64
> export 
> LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH
> 
> icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include -pthread \
> -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib -Wl,--enable-new-dtags \
> -L/home/ariebs/mic/mpi/lib -loshmem -lmpi -lopen-rte -lopen-pal \
> -lm -ldl -lutil \
> -Wl,-rpath 
> -Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
> -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
> -o mic.out  shmem_hello.c
> 
> 
> 
> Running the program
> 
> (Note that the program had been consistently failing. Then, when I logged 
> back into the system to capture the results, it worked once,  and then 
> immediately failed when I tried again, as shown below. Logging in and out 
> isn't sufficient to correct the problem. Overall, I think I had 3 successful 
> runs in 30-40 attempts.)
> 
> $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
> [atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not found in file 
> base/plm_base_launch_support.c at line 426
> Hello World from process 0 of 2
> Hello World from process 1 of 2
> $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
> [atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not found in file 
> base/plm_base_launch_support.c at line 426
> [atl1-01-mic0:189383] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to 
> initialize - aborting
> --
> It looks like SHMEM_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during SHMEM_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open SHMEM
> developer):
> 
>   mca_memheap_base_select() failed
>   --> Returned "Error" (-1) instead of "Success" (0)
> --
> --
> SHMEM_ABORT was invoked on rank 0 (pid 189383, host=atl1-01-mic0) with 
> errorcode -1.
> --
> --
> A SHMEM process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
> 
> Local host: atl1-01-mic0
> PID:   

[OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-10 Thread Andy Riebs

  
  
Summary: MPI jobs work fine, SHMEM jobs work just often enough to be
tantalizing, on an Intel Xeon Phi/MIC system.

Longer version

Thanks to the excellent write-up last June
(),
I have been able to build a version of Open MPI for the Xeon Phi
coprocessor that runs MPI jobs on the Phi coprocessor with no
problem, but not SHMEM jobs.  Just at the point where I was about to
document the problems I was having with SHMEM, my trivial SHMEM job
worked. And then failed when I tried to run it again, immediately
afterwards. I have a feeling I may be in uncharted  territory here.

Environment

  RHEL 6.5
  Intel Composer XE 2015
  Xeon Phi/MIC




Configuration

$ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
$ source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
intel64
$ ./configure --prefix=/home/ariebs/mic/mpi \
   CC="icc -mmic" CXX="icpc -mmic" \
   --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
    AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib \
    LD=x86_64-k1om-linux-ld \
    --enable-mpirun-prefix-by-default --disable-io-romio \
    --disable-vt --disable-mpi-fortran \
    --enable-mca-no-build=btl-usnic,btl-openib,common-verbs
$ make
$ make install



Test program

#include 
#include 
#include 
int main(int argc, char **argv)
{
    int me, num_pe;
    shmem_init();
    num_pe = num_pes();
    me = my_pe();
    printf("Hello World from process %ld of %ld\n", me, num_pe);
    exit(0);
}



Building the program

export PATH=/home/ariebs/mic/mpi/bin:$PATH
export PATH=/usr/linux-k1om-4.7/bin/:$PATH
source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64
export
LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH

icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include -pthread \
    -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib
-Wl,--enable-new-dtags \
    -L/home/ariebs/mic/mpi/lib -loshmem -lmpi -lopen-rte
-lopen-pal \
    -lm -ldl -lutil \
    -Wl,-rpath
-Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
    -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
    -o mic.out  shmem_hello.c



Running the program

(Note that the program had been consistently failing. Then, when I
logged back into the system to capture the results, it worked once, 
and then immediately failed when I tried again, as shown below.
Logging in and out isn't sufficient to correct the problem. Overall,
I think I had 3 successful runs in 30-40 attempts.)

$ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
[atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 426
Hello World from process 0 of 2
Hello World from process 1 of 2
$ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
[atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 426
[atl1-01-mic0:189383] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
--
It looks like SHMEM_INIT failed for some reason; your parallel
process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's
some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--
--
SHMEM_ABORT was invoked on rank 0 (pid 189383, host=atl1-01-mic0)
with errorcode -1.
--
--
A SHMEM process is aborting at a time when it cannot guarantee that
all
of its peer processes in the job will be killed properly.  You
should
double check that everything has shut down cleanly.

Local host: atl1-01-mic0
PID:    189383
--
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.