Got it - thanks. I fixed that ERROR_LOG issue (I think- please verify). I 
suspect the memheap issue relates to something else, but I probably need to let 
the OSHMEM folks comment on it


> On Apr 11, 2015, at 9:52 AM, Andy Riebs <andy.ri...@hp.com> wrote:
> 
> Everything is built on the Xeon side, with the icc "-mmic" switch. I then ssh 
> into one of the PHIs, and run shmemrun from there.
> 
> 
> On 04/11/2015 12:00 PM, Ralph Castain wrote:
>> Let me try to understand the setup a little better. Are you running shmemrun 
>> on the PHI itself? Or is it running on the host processor, and you are 
>> trying to spawn a process onto the Phi?
>> 
>> 
>>> On Apr 11, 2015, at 7:55 AM, Andy Riebs <andy.ri...@hp.com 
>>> <mailto:andy.ri...@hp.com>> wrote:
>>> 
>>> Hi Ralph,
>>> 
>>> Yes, this is attempting to get OSHMEM to run on the Phi.
>>> 
>>> I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured it with
>>> 
>>> $ ./configure --prefix=/home/ariebs/mic/mpi-nightly    CC=icc -mmic 
>>> CXX=icpc -mmic    \
>>>     --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux    \
>>>      AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib  
>>> LD=x86_64-k1om-linux-ld   \
>>>      --enable-mpirun-prefix-by-default --disable-io-romio     
>>> --disable-mpi-fortran    \
>>>      --enable-debug     
>>> --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
>>> 
>>> (Note that I had to add "oob-ud" to the "--enable-mca-no-build" option, as 
>>> the build complained that mca oob/ud needed mca common-verbs.)
>>> 
>>> With that configuration, here is what I am seeing now...
>>> 
>>> $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
>>> $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5 
>>> $PWD/mic.out
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Querying component [rsh]
>>> [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
>>> path NULL
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Query of component [rsh] set 
>>> priority to 10
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Querying component [isolated]
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Query of component [isolated] 
>>> set priority to 0
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Querying component [slurm]
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Skipping component [slurm]. 
>>> Query failed to return a module
>>> [atl1-01-mic0:189895] mca:base:select:(  plm) Selected component [rsh]
>>> [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias 189895 nodename 
>>> hash 4121194178
>>> [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam 32419
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent ssh : rsh path 
>>> NULL
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start comm
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm creating map
>>> [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working unmanaged allocation
>>> [atl1-01-mic0:189895] [[32419,0],0] using dash_host
>>> [atl1-01-mic0:189895] [[32419,0],0] checking node atl1-01-mic0
>>> [atl1-01-mic0:189895] [[32419,0],0] ignoring myself
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only HNP in allocation
>>> [atl1-01-mic0:189895] [[32419,0],0] complete_setup on job [32419,1]
>>> [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not found in file 
>>> base/plm_base_launch_support.c at line 440
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for job [32419,1]
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring up iof for job 
>>> [32419,1]
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch [32419,1] registered
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job [32419,1] is not a 
>>> dynamic spawn
>>> [atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init() SHMEM failed 
>>> to initialize - aborting
>>> [atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init() SHMEM failed 
>>> to initialize - aborting
>>> --------------------------------------------------------------------------
>>> It looks like SHMEM_INIT failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during SHMEM_INIT; some of which are due to configuration or 
>>> environment
>>> problems.  This failure appears to be an internal failure; here's some
>>> additional information (which may only be relevant to an Open SHMEM
>>> developer):
>>> 
>>>   mca_memheap_base_select() failed
>>>   --> Returned "Error" (-1) instead of "Success" (0)
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> SHMEM_ABORT was invoked on rank 1 (pid 189899, host=atl1-01-mic0) with 
>>> errorcode -1.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> A SHMEM process is aborting at a time when it cannot guarantee that all
>>> of its peer processes in the job will be killed properly.  You should
>>> double check that everything has shut down cleanly.
>>> 
>>> Local host: atl1-01-mic0
>>> PID:        189899
>>> --------------------------------------------------------------------------
>>> -------------------------------------------------------
>>> Primary job  terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd sending orted_exit 
>>> commands
>>> --------------------------------------------------------------------------
>>> shmemrun detected that one or more processes exited with non-zero status, 
>>> thus causing
>>> the job to be terminated. The first process to do so was:
>>> 
>>>   Process name: [[32419,1],1]
>>>   Exit code:    255
>>> --------------------------------------------------------------------------
>>> [atl1-01-mic0:189895] 1 more process has sent help message 
>>> help-shmem-runtime.txt / shmem_init:startup:internal-failure
>>> [atl1-01-mic0:189895] Set MCA parameter "orte_base_help_aggregate" to 0 to 
>>> see all help / error messages
>>> [atl1-01-mic0:189895] 1 more process has sent help message 
>>> help-shmem-api.txt / shmem-abort
>>> [atl1-01-mic0:189895] 1 more process has sent help message 
>>> help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
>>> [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop comm
>>> 
>>> 
>>> 
>>> 
>>> On 04/10/2015 06:37 PM, Ralph Castain wrote:
>>>> Andy - could you please try the current 1.8.5 nightly tarball and see if 
>>>> it helps? The error log indicates that it is failing to get the topology 
>>>> from some daemon, I�m assuming the one on the Phi?
>>>> 
>>>> You might also add �enable-debug to that configure line and then put -mca 
>>>> plm_base_verbose on the shmemrun cmd to get more help
>>>> 
>>>> 
>>>>> On Apr 10, 2015, at 11:55 AM, Andy Riebs <andy.ri...@hp.com 
>>>>> <mailto:andy.ri...@hp.com>> wrote:
>>>>> 
>>>>> Summary: MPI jobs work fine, SHMEM jobs work just often enough to be 
>>>>> tantalizing, on an Intel Xeon Phi/MIC system.
>>>>> 
>>>>> Longer version
>>>>> 
>>>>> Thanks to the excellent write-up last June 
>>>>> (<https://www.open-mpi.org/community/lists/users/2014/06/24711.php> 
>>>>> <https://www.open-mpi.org/community/lists/users/2014/06/24711.php>), I 
>>>>> have been able to build a version of Open MPI for the Xeon Phi 
>>>>> coprocessor that runs MPI jobs on the Phi coprocessor with no problem, 
>>>>> but not SHMEM jobs.  Just at the point where I was about to document the 
>>>>> problems I was having with SHMEM, my trivial SHMEM job worked. And then 
>>>>> failed when I tried to run it again, immediately afterwards. I have a 
>>>>> feeling I may be in uncharted  territory here.
>>>>> 
>>>>> Environment
>>>>> RHEL 6.5
>>>>> Intel Composer XE 2015
>>>>> Xeon Phi/MIC
>>>>> ----------------
>>>>> 
>>>>> 
>>>>> Configuration
>>>>> 
>>>>> $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
>>>>> $ source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64
>>>>> $ ./configure --prefix=/home/ariebs/mic/mpi \
>>>>>    CC="icc -mmic" CXX="icpc -mmic" \
>>>>>    --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
>>>>>     AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib \
>>>>>     LD=x86_64-k1om-linux-ld \
>>>>>     --enable-mpirun-prefix-by-default --disable-io-romio \
>>>>>     --disable-vt --disable-mpi-fortran \
>>>>>     --enable-mca-no-build=btl-usnic,btl-openib,common-verbs
>>>>> $ make
>>>>> $ make install
>>>>> 
>>>>> ----------------
>>>>> 
>>>>> Test program
>>>>> 
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> #include <shmem.h>
>>>>> int main(int argc, char **argv)
>>>>> {
>>>>>         int me, num_pe;
>>>>>         shmem_init();
>>>>>         num_pe = num_pes();
>>>>>         me = my_pe();
>>>>>         printf("Hello World from process %ld of %ld\n", me, num_pe);
>>>>>         exit(0);
>>>>> }
>>>>> 
>>>>> ----------------
>>>>> 
>>>>> Building the program
>>>>> 
>>>>> export PATH=/home/ariebs/mic/mpi/bin:$PATH
>>>>> export PATH=/usr/linux-k1om-4.7/bin/:$PATH
>>>>> source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64
>>>>> export 
>>>>> LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH
>>>>> 
>>>>> icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include -pthread \
>>>>>         -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib -Wl,--enable-new-dtags \
>>>>>         -L/home/ariebs/mic/mpi/lib -loshmem -lmpi -lopen-rte -lopen-pal \
>>>>>         -lm -ldl -lutil \
>>>>>         -Wl,-rpath 
>>>>> -Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
>>>>>         -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
>>>>>         -o mic.out  shmem_hello.c
>>>>> 
>>>>> ----------------
>>>>> 
>>>>> Running the program
>>>>> 
>>>>> (Note that the program had been consistently failing. Then, when I logged 
>>>>> back into the system to capture the results, it worked once,  and then 
>>>>> immediately failed when I tried again, as shown below. Logging in and out 
>>>>> isn't sufficient to correct the problem. Overall, I think I had 3 
>>>>> successful runs in 30-40 attempts.)
>>>>> 
>>>>> $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
>>>>> [atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not found in file 
>>>>> base/plm_base_launch_support.c at line 426
>>>>> Hello World from process 0 of 2
>>>>> Hello World from process 1 of 2
>>>>> $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
>>>>> [atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not found in file 
>>>>> base/plm_base_launch_support.c at line 426
>>>>> [atl1-01-mic0:189383] Error: pshmem_init.c:61 - shmem_init() SHMEM failed 
>>>>> to initialize - aborting
>>>>> --------------------------------------------------------------------------
>>>>> It looks like SHMEM_INIT failed for some reason; your parallel process is
>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>> fail during SHMEM_INIT; some of which are due to configuration or 
>>>>> environment
>>>>> problems.  This failure appears to be an internal failure; here's some
>>>>> additional information (which may only be relevant to an Open SHMEM
>>>>> developer):
>>>>> 
>>>>>   mca_memheap_base_select() failed
>>>>>   --> Returned "Error" (-1) instead of "Success" (0)
>>>>> --------------------------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> SHMEM_ABORT was invoked on rank 0 (pid 189383, host=atl1-01-mic0) with 
>>>>> errorcode -1.
>>>>> --------------------------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> A SHMEM process is aborting at a time when it cannot guarantee that all
>>>>> of its peer processes in the job will be killed properly.  You should
>>>>> double check that everything has shut down cleanly.
>>>>> 
>>>>> Local host: atl1-01-mic0
>>>>> PID:        189383
>>>>> --------------------------------------------------------------------------
>>>>> -------------------------------------------------------
>>>>> Primary job  terminated normally, but 1 process returned
>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>> -------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> shmemrun detected that one or more processes exited with non-zero status, 
>>>>> thus causing
>>>>> the job to be terminated. The first process to do so was:
>>>>> 
>>>>>   Process name: [[30881,1],0]
>>>>>   Exit code:    255
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> Any thoughts about where to go from here?
>>>>> 
>>>>> Andy
>>>>> 
>>>>> -- 
>>>>> Andy Riebs
>>>>> Hewlett-Packard Company
>>>>> High Performance Computing
>>>>> +1 404 648 9024
>>>>> My opinions are not necessarily those of HP
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26670.php 
>>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26670.php>
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/04/26676.php 
>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26676.php>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/04/26678.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/04/26678.php>
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/04/26679.php 
>> <http://www.open-mpi.org/community/lists/users/2015/04/26679.php>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26680.php

Reply via email to