Hi Ralph,
Yes, this is attempting to get OSHMEM to run on the Phi.
I grabbed openmpi-dev-1484-g033418f.tar.bz2 and
configured it with
$ ./configure --prefix=/home/ariebs/mic/mpi-nightly
CC=icc -mmic CXX=icpc -mmic \
--build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux \
AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib
LD=x86_64-k1om-linux-ld \
--enable-mpirun-prefix-by-default
--disable-io-romio --disable-mpi-fortran \
--enable-debug
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
(Note that I had to add "oob-ud" to the
"--enable-mca-no-build" option, as the build complained
that mca oob/ud needed mca common-verbs.)
With that configuration, here is what I am seeing now...
$ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
$ shmemrun -H localhost -N 2 --mca sshmem mmap --mca
plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:189895] mca:base:select:( plm) Querying
component [rsh]
[atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup
on agent ssh : rsh path NULL
[atl1-01-mic0:189895] mca:base:select:( plm) Query of
component [rsh] set priority to 10
[atl1-01-mic0:189895] mca:base:select:( plm) Querying
component [isolated]
[atl1-01-mic0:189895] mca:base:select:( plm) Query of
component [isolated] set priority to 0
[atl1-01-mic0:189895] mca:base:select:( plm) Querying
component [slurm]
[atl1-01-mic0:189895] mca:base:select:( plm) Skipping
component [slurm]. Query failed to return a module
[atl1-01-mic0:189895] mca:base:select:( plm) Selected
component [rsh]
[atl1-01-mic0:189895] plm:base:set_hnp_name: initial
bias 189895 nodename hash 4121194178
[atl1-01-mic0:189895] plm:base:set_hnp_name: final
jobfam 32419
[atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on
agent ssh : rsh path NULL
[atl1-01-mic0:189895] [[32419,0],0] plm:base:receive
start comm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
creating map
[atl1-01-mic0:189895] [[32419,0],0] setup:vm: working
unmanaged allocation
[atl1-01-mic0:189895] [[32419,0],0] using dash_host
[atl1-01-mic0:189895] [[32419,0],0] checking node
atl1-01-mic0
[atl1-01-mic0:189895] [[32419,0],0] ignoring myself
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
only HNP in allocation
[atl1-01-mic0:189895] [[32419,0],0] complete_setup on
job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not
found in file base/plm_base_launch_support.c at line 440
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps
for job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
wiring up iof for job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
[32419,1] registered
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job
[32419,1] is not a dynamic spawn
[atl1-01-mic0:189899] Error: pshmem_init.c:61 -
shmem_init() SHMEM failed to initialize - aborting
[atl1-01-mic0:189898] Error: pshmem_init.c:61 -
shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your
parallel process is
likely to abort. There are many reasons that a parallel
process can
fail during SHMEM_INIT; some of which are due to
configuration or environment
problems. This failure appears to be an internal
failure; here's some
additional information (which may only be relevant to an
Open SHMEM
developer):
mca_memheap_base_select() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 1 (pid 189899,
host=atl1-01-mic0) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot
guarantee that all
of its peer processes in the job will be killed
properly. You should
double check that everything has shut down cleanly.
Local host: atl1-01-mic0
PID: 189899
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has
been aborted.
-------------------------------------------------------
[atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd
sending orted_exit commands
--------------------------------------------------------------------------
shmemrun detected that one or more processes exited with
non-zero status, thus causing
the job to be terminated. The first process to do so
was:
Process name: [[32419,1],1]
Exit code: 255
--------------------------------------------------------------------------
[atl1-01-mic0:189895] 1 more process has sent help
message help-shmem-runtime.txt /
shmem_init:startup:internal-failure
[atl1-01-mic0:189895] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error
messages
[atl1-01-mic0:189895] 1 more process has sent help
message help-shmem-api.txt / shmem-abort
[atl1-01-mic0:189895] 1 more process has sent help
message help-shmem-runtime.txt / oshmem shmem
abort:cannot guarantee all killed
[atl1-01-mic0:189895] [[32419,0],0] plm:base:receive
stop comm
On 04/10/2015 06:37 PM,
Ralph Castain wrote:
Andy - could you please try the
current 1.8.5 nightly tarball and see if it helps? The
error log indicates that it is failing to get the
topology from some daemon, I�m assuming the one on the
Phi?
You might also add �enable-debug to that
configure line and then put -mca plm_base_verbose on
the shmemrun cmd to get more help
Summary: MPI jobs work fine, SHMEM jobs work
just often enough to be tantalizing, on an
Intel Xeon Phi/MIC system.
Longer version
Thanks to the excellent write-up last June (
<https://www.open-mpi.org/community/lists/users/2014/06/24711.php>),
I have been able to build a version of Open
MPI for the Xeon Phi coprocessor that runs
MPI jobs on the Phi coprocessor with no
problem, but not SHMEM jobs. Just at the
point where I was about to document the
problems I was having with SHMEM, my trivial
SHMEM job worked. And then failed when I
tried to run it again, immediately
afterwards. I have a feeling I may be in
uncharted territory here.
Environment
- RHEL 6.5
- Intel Composer XE 2015
- Xeon Phi/MIC
----------------
Configuration
$ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
$ source
/opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
intel64
$ ./configure --prefix=/home/ariebs/mic/mpi
\
CC="icc -mmic" CXX="icpc -mmic" \
--build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux \
AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib \
LD=x86_64-k1om-linux-ld \
--enable-mpirun-prefix-by-default
--disable-io-romio \
--disable-vt --disable-mpi-fortran \
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs
$ make
$ make install
----------------
Test program
#include <stdio.h>
#include <stdlib.h>
#include <shmem.h>
int main(int argc, char **argv)
{
int me, num_pe;
shmem_init();
num_pe = num_pes();
me = my_pe();
printf("Hello World from process %ld
of %ld\n", me, num_pe);
exit(0);
}
----------------
Building the program
export PATH=/home/ariebs/mic/mpi/bin:$PATH
export PATH=/usr/linux-k1om-4.7/bin/:$PATH
source
/opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
intel64
export
LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH
icc -mmic -std=gnu99
-I/home/ariebs/mic/mpi/include -pthread \
-Wl,-rpath
-Wl,/home/ariebs/mic/mpi/lib
-Wl,--enable-new-dtags \
-L/home/ariebs/mic/mpi/lib -loshmem
-lmpi -lopen-rte -lopen-pal \
-lm -ldl -lutil \
-Wl,-rpath
-Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
\
-L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
\
-o mic.out shmem_hello.c
----------------
Running the program
(Note that the program had been consistently
failing. Then, when I logged back into the
system to capture the results, it worked
once, and then immediately failed when I
tried again, as shown below. Logging in and
out isn't sufficient to correct the problem.
Overall, I think I had 3 successful runs in
30-40 attempts.)
$ shmemrun -H localhost -N 2 --mca sshmem
mmap ./mic.out
[atl1-01-mic0:189372] [[30936,0],0]
ORTE_ERROR_LOG: Not found in file
base/plm_base_launch_support.c at line 426
Hello World from process 0 of 2
Hello World from process 1 of 2
$ shmemrun -H localhost -N 2 --mca sshmem
mmap ./mic.out
[atl1-01-mic0:189381] [[30881,0],0]
ORTE_ERROR_LOG: Not found in file
base/plm_base_launch_support.c at line 426
[atl1-01-mic0:189383] Error:
pshmem_init.c:61 - shmem_init() SHMEM failed
to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some
reason; your parallel process is
likely to abort. There are many reasons
that a parallel process can
fail during SHMEM_INIT; some of which are
due to configuration or environment
problems. This failure appears to be an
internal failure; here's some
additional information (which may only be
relevant to an Open SHMEM
developer):
mca_memheap_base_select() failed
--> Returned "Error" (-1) instead of
"Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid
189383, host=atl1-01-mic0) with errorcode
-1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when
it cannot guarantee that all
of its peer processes in the job will be
killed properly. You should
double check that everything has shut down
cleanly.
Local host: atl1-01-mic0
PID: 189383
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1
process returned
a non-zero exit code.. Per user-direction,
the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
shmemrun detected that one or more processes
exited with non-zero status, thus causing
the job to be terminated. The first process
to do so was:
Process name: [[30881,1],0]
Exit code: 255
--------------------------------------------------------------------------
Any thoughts about where to go from here?
Andy
--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/04/26670.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26676.php