Hi Ralph, Here are the results with last night's "master" nightly, openmpi-dev-1487-g9c6d452.tar.bz2, and adding the memheap_base_verbose option (yes, it looks like the "ERROR_LOG" problem has gone away): $ cat /proc/sys/kernel/shmmax 33554432 $ cat /proc/sys/kernel/shmall 2097152 $ cat /proc/sys/kernel/shmmni 4096 $ export SHMEM_SYMMETRIC_HEAP=1M $ shmemrun -H localhost -N 2 --mca sshmem mmap --mca plm_base_verbose 5 --mca memheap_base_verbose 100 $PWD/mic.out [atl1-01-mic0:190439] mca:base:select:( plm) Querying component [rsh] [atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [atl1-01-mic0:190439] mca:base:select:( plm) Query of component [rsh] set priority to 10 [atl1-01-mic0:190439] mca:base:select:( plm) Querying component [isolated] [atl1-01-mic0:190439] mca:base:select:( plm) Query of component [isolated] set priority to 0 [atl1-01-mic0:190439] mca:base:select:( plm) Querying component [slurm] [atl1-01-mic0:190439] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [atl1-01-mic0:190439] mca:base:select:( plm) Selected component [rsh] [atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439 nodename hash 4121194178 [atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875 [atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh : rsh path NULL [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map [atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged allocation [atl1-01-mic0:190439] [[31875,0],0] using dash_host [atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0 [atl1-01-mic0:190439] [[31875,0],0] ignoring myself [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in allocation [atl1-01-mic0:190439] [[31875,0],0] complete_setup on job [31875,1] [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job [31875,1] [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof for job [31875,1] [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1] registered [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1] is not a dynamic spawn [atl1-01-mic0:190441] mca: base: components_register: registering memheap components [atl1-01-mic0:190441] mca: base: components_register: found loaded component buddy [atl1-01-mic0:190441] mca: base: components_register: component buddy has no register or open function [atl1-01-mic0:190442] mca: base: components_register: registering memheap components [atl1-01-mic0:190442] mca: base: components_register: found loaded component buddy [atl1-01-mic0:190442] mca: base: components_register: component buddy has no register or open function [atl1-01-mic0:190442] mca: base: components_register: found loaded component ptmalloc [atl1-01-mic0:190442] mca: base: components_register: component ptmalloc has no register or open function [atl1-01-mic0:190441] mca: base: components_register: found loaded component ptmalloc [atl1-01-mic0:190441] mca: base: components_register: component ptmalloc has no register or open function [atl1-01-mic0:190441] mca: base: components_open: opening memheap components [atl1-01-mic0:190441] mca: base: components_open: found loaded component buddy [atl1-01-mic0:190441] mca: base: components_open: component buddy open function successful [atl1-01-mic0:190441] mca: base: components_open: found loaded component ptmalloc [atl1-01-mic0:190441] mca: base: components_open: component ptmalloc open function successful [atl1-01-mic0:190442] mca: base: components_open: opening memheap components [atl1-01-mic0:190442] mca: base: components_open: found loaded component buddy [atl1-01-mic0:190442] mca: base: components_open: component buddy open function successful [atl1-01-mic0:190442] mca: base: components_open: found loaded component ptmalloc [atl1-01-mic0:190442] mca: base: components_open: component ptmalloc open function successful [atl1-01-mic0:190442] base/memheap_base_alloc.c:38 - mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1 segments by method: 1 [atl1-01-mic0:190441] base/memheap_base_alloc.c:38 - mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1 segments by method: 1 [atl1-01-mic0:190442] base/memheap_base_static.c:205 - _load_segments() add: 00600000-00601000 rw-p 00000000 00:11 6029314 /home/ariebs/bench/hello/mic.out [atl1-01-mic0:190441] base/memheap_base_static.c:205 - _load_segments() add: 00600000-00601000 rw-p 00000000 00:11 6029314 /home/ariebs/bench/hello/mic.out [atl1-01-mic0:190442] base/memheap_base_static.c:75 - mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2 segments [atl1-01-mic0:190442] base/memheap_base_register.c:39 - mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000 270532608 bytes type=0x1 id=0xFFFFFFFF [atl1-01-mic0:190441] base/memheap_base_static.c:75 - mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2 segments [atl1-01-mic0:190441] base/memheap_base_register.c:39 - mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000 270532608 bytes type=0x1 id=0xFFFFFFFF [atl1-01-mic0:190442] Error base/memheap_base_register.c:130 - _reg_segment() Failed to register segment [atl1-01-mic0:190441] Error base/memheap_base_register.c:130 - _reg_segment() Failed to register segment [atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting [atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting -------------------------------------------------------------------------- It looks like SHMEM_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during SHMEM_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open SHMEM developer): mca_memheap_base_select() failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0) with errorcode -1. -------------------------------------------------------------------------- -------------------------------------------------------------------------- A SHMEM process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Local host: atl1-01-mic0 PID: 190441 -------------------------------------------------------------------------- ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- [atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending orted_exit commands -------------------------------------------------------------------------- shmemrun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[31875,1],0] Exit code: 255 -------------------------------------------------------------------------- [atl1-01-mic0:190439] 1 more process has sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure [atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [atl1-01-mic0:190439] 1 more process has sent help message help-shmem-api.txt / shmem-abort [atl1-01-mic0:190439] 1 more process has sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm On 04/12/2015 03:09 PM, Ralph Castain
wrote:
Sorry about that - I hadn’t brought it over to the 1.8 branch yet. I’ve done so now, which means the ERROR_LOG shouldn’t show up any more. It won’t fix the memheap problem, though. |
- [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on I... Andy Riebs
- Re: [OMPI users] Problems using Open MPI 1.8.4 OS... Ralph Castain
- Re: [OMPI users] Problems using Open MPI 1.8.... Andy Riebs
- Re: [OMPI users] Problems using Open MPI ... Ralph Castain
- Re: [OMPI users] Problems using Open ... Andy Riebs
- Re: [OMPI users] Problems using ... Ralph Castain
- Re: [OMPI users] Problems us... Andy Riebs
- Re: [OMPI users] Problem... Ralph Castain
- Re: [OMPI users] Problem... Riebs, Andy
- Re: [OMPI users] Problem... Andy Riebs
- Re: [OMPI users] Problem... Andy Riebs
- Re: [OMPI users] Problem... Ralph Castain
- Re: [OMPI users] Problem... Nathan Hjelm
- Re: [OMPI users] Problem... Andy Riebs
- Re: [OMPI users] Problem... Ralph Castain
- Re: [OMPI users] Problem... Andy Riebs
- Re: [OMPI users] Problem... Ralph Castain
- Re: [OMPI users] Problem... Gilles Gouaillardet
- Re: [OMPI users] Problem... Ralph Castain
- Re: [OMPI users] Problem... Andy Riebs