Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Andy Riebs Mon, 13 Apr 2015 08:50:05 -0400 (EDT)

Hi Ralph,

Here are the results with last night's "master" nightly, openmpi-dev-1487-g9c6d452.tar.bz2, and adding the memheap_base_verbose option (yes, it looks like the "ERROR_LOG" problem has gone away):

$ cat /proc/sys/kernel/shmmax
33554432
$ cat /proc/sys/kernel/shmall
2097152
$ cat /proc/sys/kernel/shmmni
4096
$ export SHMEM_SYMMETRIC_HEAP=1M
$ shmemrun -H localhost -N 2 --mca sshmem mmap --mca plm_base_verbose 5 --mca memheap_base_verbose 100 $PWD/mic.out
[atl1-01-mic0:190439] mca:base:select:( plm) Querying component [rsh]
[atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[atl1-01-mic0:190439] mca:base:select:( plm) Query of component [rsh] set priority to 10
[atl1-01-mic0:190439] mca:base:select:( plm) Querying component [isolated]
[atl1-01-mic0:190439] mca:base:select:( plm) Query of component [isolated] set priority to 0
[atl1-01-mic0:190439] mca:base:select:( plm) Querying component [slurm]
[atl1-01-mic0:190439] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
[atl1-01-mic0:190439] mca:base:select:( plm) Selected component [rsh]
[atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439 nodename hash 4121194178
[atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875
[atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged allocation
[atl1-01-mic0:190439] [[31875,0],0] using dash_host
[atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0
[atl1-01-mic0:190439] [[31875,0],0] ignoring myself
[atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in allocation
[atl1-01-mic0:190439] [[31875,0],0] complete_setup on job [31875,1]
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job [31875,1]
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof for job [31875,1]
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1] registered
[atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1] is not a dynamic spawn
[atl1-01-mic0:190441] mca: base: components_register: registering memheap components
[atl1-01-mic0:190441] mca: base: components_register: found loaded component buddy
[atl1-01-mic0:190441] mca: base: components_register: component buddy has no register or open function
[atl1-01-mic0:190442] mca: base: components_register: registering memheap components
[atl1-01-mic0:190442] mca: base: components_register: found loaded component buddy
[atl1-01-mic0:190442] mca: base: components_register: component buddy has no register or open function
[atl1-01-mic0:190442] mca: base: components_register: found loaded component ptmalloc
[atl1-01-mic0:190442] mca: base: components_register: component ptmalloc has no register or open function
[atl1-01-mic0:190441] mca: base: components_register: found loaded component ptmalloc
[atl1-01-mic0:190441] mca: base: components_register: component ptmalloc has no register or open function
[atl1-01-mic0:190441] mca: base: components_open: opening memheap components
[atl1-01-mic0:190441] mca: base: components_open: found loaded component buddy
[atl1-01-mic0:190441] mca: base: components_open: component buddy open function successful
[atl1-01-mic0:190441] mca: base: components_open: found loaded component ptmalloc
[atl1-01-mic0:190441] mca: base: components_open: component ptmalloc open function successful
[atl1-01-mic0:190442] mca: base: components_open: opening memheap components
[atl1-01-mic0:190442] mca: base: components_open: found loaded component buddy
[atl1-01-mic0:190442] mca: base: components_open: component buddy open function successful
[atl1-01-mic0:190442] mca: base: components_open: found loaded component ptmalloc
[atl1-01-mic0:190442] mca: base: components_open: component ptmalloc open function successful
[atl1-01-mic0:190442] base/memheap_base_alloc.c:38 - mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1 segments by method: 1
[atl1-01-mic0:190441] base/memheap_base_alloc.c:38 - mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1 segments by method: 1
[atl1-01-mic0:190442] base/memheap_base_static.c:205 - _load_segments() add: 00600000-00601000 rw-p 00000000 00:11 6029314                            /home/ariebs/bench/hello/mic.out
[atl1-01-mic0:190441] base/memheap_base_static.c:205 - _load_segments() add: 00600000-00601000 rw-p 00000000 00:11 6029314                            /home/ariebs/bench/hello/mic.out
[atl1-01-mic0:190442] base/memheap_base_static.c:75 - mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2 segments
[atl1-01-mic0:190442] base/memheap_base_register.c:39 - mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000 270532608 bytes type=0x1 id=0xFFFFFFFF
[atl1-01-mic0:190441] base/memheap_base_static.c:75 - mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2 segments
[atl1-01-mic0:190441] base/memheap_base_register.c:39 - mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000 270532608 bytes type=0x1 id=0xFFFFFFFF
[atl1-01-mic0:190442] Error base/memheap_base_register.c:130 - _reg_segment() Failed to register segment
[atl1-01-mic0:190441] Error base/memheap_base_register.c:130 - _reg_segment() Failed to register segment
[atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

mca_memheap_base_select() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.

Local host: atl1-01-mic0
PID:        190441
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending orted_exit commands
--------------------------------------------------------------------------
shmemrun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[31875,1],0]
Exit code:    255
--------------------------------------------------------------------------
[atl1-01-mic0:190439] 1 more process has sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure
[atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[atl1-01-mic0:190439] 1 more process has sent help message help-shmem-api.txt / shmem-abort
[atl1-01-mic0:190439] 1 more process has sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
[atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm

On 04/12/2015 03:09 PM, Ralph Castain wrote:

Sorry about that - I hadn’t brought it over to the 1.8 branch yet. I’ve done so now, which means the ERROR_LOG shouldn’t show up any more. It won’t fix the memheap problem, though.

You might try adding “--mca memheap_base_verbose 100” to your cmd line so we can see why none of the memheap components are being selected.
On Apr 12, 2015, at 11:30 AM, Andy Riebs <andy.ri...@hp.com> wrote:
Hi Ralph,

Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2:

$ shmemrun -H localhost -N 2 --mca sshmem mmap --mca plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:190189] mca:base:select:( plm) Querying component [rsh]
[atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[atl1-01-mic0:190189] mca:base:select:( plm) Query of component [rsh] set priority to 10
[atl1-01-mic0:190189] mca:base:select:( plm) Querying component [isolated]
[atl1-01-mic0:190189] mca:base:select:( plm) Query of component [isolated] set priority to 0
[atl1-01-mic0:190189] mca:base:select:( plm) Querying component [slurm]
[atl1-01-mic0:190189] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
[atl1-01-mic0:190189] mca:base:select:( plm) Selected component [rsh]
[atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189 nodename hash 4121194178
[atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
[atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged allocation
[atl1-01-mic0:190189] [[32137,0],0] using dash_host
[atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
[atl1-01-mic0:190189] [[32137,0],0] ignoring myself
[atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in allocation
[atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 440
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof for job [32137,1]
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1] registered
[atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is not a dynamic spawn
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

mca_memheap_base_select() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.

Local host: atl1-01-mic0
PID: 190192
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending orted_exit commands
--------------------------------------------------------------------------
shmemrun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[32137,1],0]
Exit code: 255
--------------------------------------------------------------------------
[atl1-01-mic0:190189] 1 more process has sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure
[atl1-01-mic0:190189] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[atl1-01-mic0:190189] 1 more process has sent help message help-shmem-api.txt / shmem-abort
[atl1-01-mic0:190189] 1 more process has sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
[atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm

On 04/11/2015 07:41 PM, Ralph Castain wrote:
Got it - thanks. I fixed that ERROR_LOG issue (I think- please verify). I suspect the memheap issue relates to something else, but I probably need to let the OSHMEM folks comment on it
On Apr 11, 2015, at 9:52 AM, Andy Riebs <andy.ri...@hp.com> wrote:
Everything is built on the Xeon side, with the icc "-mmic" switch. I then ssh into one of the PHIs, and run shmemrun from there.

On 04/11/2015 12:00 PM, Ralph Castain wrote:
Let me try to understand the setup a little better. Are you running shmemrun on the PHI itself? Or is it running on the host processor, and you are trying to spawn a process onto the Phi?
On Apr 11, 2015, at 7:55 AM, Andy Riebs <andy.ri...@hp.com> wrote:
Hi Ralph,

Yes, this is attempting to get OSHMEM to run on the Phi.

I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured it with

$ ./configure --prefix=/home/ariebs/mic/mpi-nightly    CC=icc -mmic CXX=icpc -mmic    \
    --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux    \
     AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib LD=x86_64-k1om-linux-ld   \
     --enable-mpirun-prefix-by-default --disable-io-romio     --disable-mpi-fortran    \
     --enable-debug     --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud

(Note that I had to add "oob-ud" to the "--enable-mca-no-build" option, as the build complained that mca oob/ud needed mca common-verbs.)

With that configuration, here is what I am seeing now...

$ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
$ shmemrun -H localhost -N 2 --mca sshmem mmap --mca plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:189895] mca:base:select:( plm) Querying component [rsh]
[atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[atl1-01-mic0:189895] mca:base:select:( plm) Query of component [rsh] set priority to 10
[atl1-01-mic0:189895] mca:base:select:( plm) Querying component [isolated]
[atl1-01-mic0:189895] mca:base:select:( plm) Query of component [isolated] set priority to 0
[atl1-01-mic0:189895] mca:base:select:( plm) Querying component [slurm]
[atl1-01-mic0:189895] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
[atl1-01-mic0:189895] mca:base:select:( plm) Selected component [rsh]
[atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias 189895 nodename hash 4121194178
[atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam 32419
[atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start comm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm creating map
[atl1-01-mic0:189895] [[32419,0],0] setup:vm: working unmanaged allocation
[atl1-01-mic0:189895] [[32419,0],0] using dash_host
[atl1-01-mic0:189895] [[32419,0],0] checking node atl1-01-mic0
[atl1-01-mic0:189895] [[32419,0],0] ignoring myself
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only HNP in allocation
[atl1-01-mic0:189895] [[32419,0],0] complete_setup on job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 440
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring up iof for job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch [32419,1] registered
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job [32419,1] is not a dynamic spawn
[atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

mca_memheap_base_select() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 1 (pid 189899, host=atl1-01-mic0) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.

Local host: atl1-01-mic0
PID:        189899
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd sending orted_exit commands
--------------------------------------------------------------------------
shmemrun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[32419,1],1]
Exit code:    255
--------------------------------------------------------------------------
[atl1-01-mic0:189895] 1 more process has sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure
[atl1-01-mic0:189895] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[atl1-01-mic0:189895] 1 more process has sent help message help-shmem-api.txt / shmem-abort
[atl1-01-mic0:189895] 1 more process has sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
[atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop comm

On 04/10/2015 06:37 PM, Ralph Castain wrote:
Andy - could you please try the current 1.8.5 nightly tarball and see if it helps? The error log indicates that it is failing to get the topology from some daemon, I�m assuming the one on the Phi?

You might also add �enable-debug to that configure line and then put -mca plm_base_verbose on the shmemrun cmd to get more help
On Apr 10, 2015, at 11:55 AM, Andy Riebs <andy.ri...@hp.com> wrote:
Summary: MPI jobs work fine, SHMEM jobs work just often enough to be tantalizing, on an Intel Xeon Phi/MIC system.

Longer version

Thanks to the excellent write-up last June (<https://www.open-mpi.org/community/lists/users/2014/06/24711.php>), I have been able to build a version of Open MPI for the Xeon Phi coprocessor that runs MPI jobs on the Phi coprocessor with no problem, but not SHMEM jobs. Just at the point where I was about to document the problems I was having with SHMEM, my trivial SHMEM job worked. And then failed when I tried to run it again, immediately afterwards. I have a feeling I may be in uncharted territory here.

Environment

RHEL 6.5

Intel Composer XE 2015

Xeon Phi/MIC

----------------

Configuration

$ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
$ source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64
$ ./configure --prefix=/home/ariebs/mic/mpi \
   CC="icc -mmic" CXX="icpc -mmic" \
   --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
    AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib \
    LD=x86_64-k1om-linux-ld \
    --enable-mpirun-prefix-by-default --disable-io-romio \
    --disable-vt --disable-mpi-fortran \
    --enable-mca-no-build=btl-usnic,btl-openib,common-verbs
$ make
$ make install

----------------

Test program

#include <stdio.h>
#include <stdlib.h>
#include <shmem.h>
int main(int argc, char **argv)
{
        int me, num_pe;
        shmem_init();
        num_pe = num_pes();
        me = my_pe();
        printf("Hello World from process %ld of %ld\n", me, num_pe);
        exit(0);
}

----------------

Building the program

export PATH=/home/ariebs/mic/mpi/bin:$PATH
export PATH=/usr/linux-k1om-4.7/bin/:$PATH
source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64
export LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH

icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include -pthread \
        -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib -Wl,--enable-new-dtags \
        -L/home/ariebs/mic/mpi/lib -loshmem -lmpi -lopen-rte -lopen-pal \
        -lm -ldl -lutil \
        -Wl,-rpath -Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
        -L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic \
        -o mic.out shmem_hello.c

----------------

Running the program

(Note that the program had been consistently failing. Then, when I logged back into the system to capture the results, it worked once, and then immediately failed when I tried again, as shown below. Logging in and out isn't sufficient to correct the problem. Overall, I think I had 3 successful runs in 30-40 attempts.)

$ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
[atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 426
Hello World from process 0 of 2
Hello World from process 1 of 2
$ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
[atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 426
[atl1-01-mic0:189383] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

mca_memheap_base_select() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 189383, host=atl1-01-mic0) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.

Local host: atl1-01-mic0
PID:        189383
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
shmemrun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[30881,1],0]
Exit code:    255
--------------------------------------------------------------------------

Any thoughts about where to go from here?

Andy
-- 
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26670.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26676.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26678.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26679.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26680.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26682.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26683.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26684.php

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Reply via email to