With Open MPI 1.9a1r32252 (Jul 16, 2014 (nightly snapshot tarball)) i got this 
output (same?):
$ salloc -N2 --exclusive -p test -J ompi
salloc: Granted job allocation 645686

$LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so 
 mpirun  -mca mca_base_env_list 'LD_PRELOAD'  --mca plm_base_verbose 10 
--debug-daemons -np 1 hello_c
[access1:04312] mca: base: components_register: registering plm components
[access1:04312] mca: base: components_register: found loaded component isolated
[access1:04312] mca: base: components_register: component isolated has no 
register or open function
[access1:04312] mca: base: components_register: found loaded component rsh
[access1:04312] mca: base: components_register: component rsh register function 
successful
[access1:04312] mca: base: components_register: found loaded component slurm
[access1:04312] mca: base: components_register: component slurm register 
function successful
[access1:04312] mca: base: components_open: opening plm components
[access1:04312] mca: base: components_open: found loaded component isolated
[access1:04312] mca: base: components_open: component isolated open function 
successful
[access1:04312] mca: base: components_open: found loaded component rsh
[access1:04312] mca: base: components_open: component rsh open function 
successful
[access1:04312] mca: base: components_open: found loaded component slurm
[access1:04312] mca: base: components_open: component slurm open function 
successful
[access1:04312] mca:base:select: Auto-selecting plm components
[access1:04312] mca:base:select:( plm) Querying component [isolated]
[access1:04312] mca:base:select:( plm) Query of component [isolated] set 
priority to 0
[access1:04312] mca:base:select:( plm) Querying component [rsh]
[access1:04312] mca:base:select:( plm) Query of component [rsh] set priority to 
10
[access1:04312] mca:base:select:( plm) Querying component [slurm]
[access1:04312] mca:base:select:( plm) Query of component [slurm] set priority 
to 75
[access1:04312] mca:base:select:( plm) Selected component [slurm]
[access1:04312] mca: base: close: component isolated closed
[access1:04312] mca: base: close: unloading component isolated
[access1:04312] mca: base: close: component rsh closed
[access1:04312] mca: base: close: unloading component rsh
Daemon was launched on node1-128-09 - beginning to initialize
Daemon was launched on node1-128-15 - beginning to initialize
Daemon [[39207,0],1] checking in as pid 26240 on host node1-128-09
[node1-128-09:26240] [[39207,0],1] orted: up and running - waiting for commands!
Daemon [[39207,0],2] checking in as pid 30129 on host node1-128-15
[node1-128-15:30129] [[39207,0],2] orted: up and running - waiting for commands!
srun: error: node1-128-09: task 0: Exited with exit code 1
srun: Terminating job step 645686.3
srun: error: node1-128-15: task 1: Exited with exit code 1
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
[access1:04312] [[39207,0],0] orted_cmd: received halt_vm cmd
[access1:04312] mca: base: close: component slurm closed
[access1:04312] mca: base: close: unloading component slurm


Thu, 17 Jul 2014 11:40:24 +0300 от Mike Dubman <mi...@dev.mellanox.co.il>:
>can you use latest ompi-1.8 from svn/git?
>Ralph - could you please suggest.
>Thx
>
>
>On Wed, Jul 16, 2014 at 2:48 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>Here it is:
>>
>>$ 
>>LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>>  mpirun  -x LD_PRELOAD --mca plm_base_verbose 10 --debug-daemons -np 1 
>>hello_c
>>
>>[access1:29064] mca: base: components_register: registering plm components
>>[access1:29064] mca: base: components_register: found loaded component 
>>isolated
>>[access1:29064] mca: base: components_register: component isolated has no 
>>register or open function
>>[access1:29064] mca: base: components_register: found loaded component rsh
>>[access1:29064] mca: base: components_register: component rsh register 
>>function successful
>>[access1:29064] mca: base: components_register: found loaded component slurm
>>[access1:29064] mca: base: components_register: component slurm register 
>>function successful
>>[access1:29064] mca: base: components_open: opening plm components
>>[access1:29064] mca: base: components_open: found loaded component isolated
>>[access1:29064] mca: base: components_open: component isolated open function 
>>successful
>>[access1:29064] mca: base: components_open: found loaded component rsh
>>[access1:29064] mca: base: components_open: component rsh open function 
>>successful
>>[access1:29064] mca: base: components_open: found loaded component slurm
>>[access1:29064] mca: base: components_open: component slurm open function 
>>successful
>>[access1:29064] mca:base:select: Auto-selecting plm components
>>[access1:29064] mca:base:select:(  plm) Querying component [isolated]
>>[access1:29064] mca:base:select:(  plm) Query of component [isolated] set 
>>priority to 0
>>[access1:29064] mca:base:select:(  plm) Querying component [rsh]
>>[access1:29064] mca:base:select:(  plm) Query of component [rsh] set priority 
>>to 10
>>[access1:29064] mca:base:select:(  plm) Querying component [slurm]
>>[access1:29064] mca:base:select:(  plm) Query of component [slurm] set 
>>priority to 75
>>[access1:29064] mca:base:select:(  plm) Selected component [slurm]
>>[access1:29064] mca: base: close: component isolated closed
>>[access1:29064] mca: base: close: unloading component isolated
>>[access1:29064] mca: base: close: component rsh closed
>>[access1:29064] mca: base: close: unloading component rsh
>>Daemon was launched on node1-128-17 - beginning to initialize
>>Daemon was launched on node1-128-18 - beginning to initialize
>>Daemon [[63607,0],2] checking in as pid 24538 on host node1-128-18
>>[node1-128-18:24538] [[63607,0],2] orted: up and running - waiting for 
>>commands!
>>Daemon [[63607,0],1] checking in as pid 17192 on host node1-128-17
>>[node1-128-17:17192] [[63607,0],1] orted: up and running - waiting for 
>>commands!
>>srun: error: node1-128-18: task 1: Exited with exit code 1
>>srun: Terminating job step 645191.1
>>srun: error: node1-128-17: task 0: Exited with exit code 1
>>
>>--------------------------------------------------------------------------
>>An ORTE daemon has unexpectedly failed after launch and before
>>communicating back to mpirun. This could be caused by a number
>>of factors, including an inability to create a connection back
>>to mpirun due to a lack of common network interfaces and/or no
>>route found between them. Please check network connectivity
>>(including firewalls and network routing requirements).
>>--------------------------------------------------------------------------
>>[access1:29064] [[63607,0],0] orted_cmd: received halt_vm cmd
>>[access1:29064] mca: base: close: component slurm closed
>>[access1:29064] mca: base: close: unloading component slurm
>>
>>
>>Wed, 16 Jul 2014 14:20:33 +0300 от Mike Dubman < mi...@dev.mellanox.co.il >:
>>>please add following flags to mpirun "--mca plm_base_verbose 10 
>>>--debug-daemons" and attach output.
>>>Thx
>>>
>>>
>>>On Wed, Jul 16, 2014 at 11:12 AM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>wrote:
>>>>Hello!
>>>>I have Open MPI v1.9a1r32142 and slurm 2.5.6.
>>>>
>>>>I can not use mpirun after salloc:
>>>>
>>>>$salloc -N2 --exclusive -p test -J ompi
>>>>$LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>>>> mpirun -np 1 hello_c
>>>>-----------------------------------------------------------------------------------------------------
>>>>An ORTE daemon has unexpectedly failed after launch and before
>>>>communicating back to mpirun. This could be caused by a number
>>>>of factors, including an inability to create a connection back
>>>>to mpirun due to a lack of common network interfaces and/or no
>>>>route found between them. Please check network connectivity
>>>>(including firewalls and network routing requirements).
>>>>------------------------------------------------------------------------------------------------------
>>>>But if i use mpirun in sbutch script it looks correct:
>>>>$cat ompi_mxm3.0
>>>>#!/bin/sh
>>>>LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>>>>  mpirun  -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --map-by slot:pe=8 "$@"
>>>>
>>>>$sbatch -N2  --exclusive -p test -J ompi  ompi_mxm3.0 ./hello_c
>>>>Submitted batch job 645039
>>>>$cat slurm-645039.out 
>>>>[warn] Epoll ADD(1) on fd 0 failed.  Old events were 0; read change was 1 
>>>>(add); write change was 0 (none): Operation not permitted
>>>>[warn] Epoll ADD(4) on fd 1 failed.  Old events were 0; read change was 0 
>>>>(none); write change was 1 (add): Operation not permitted
>>>>Hello, world, I am 0 of 2, (Open MPI v1.9a1, package: Open MPI 
>>>>semenov@compiler-2 Distribution, ident: 1.9a1r32142, repo rev: r32142, Jul 
>>>>04, 2014 (nightly snapshot tarball), 146)
>>>>Hello, world, I am 1 of 2, (Open MPI v1.9a1, package: Open MPI 
>>>>semenov@compiler-2 Distribution, ident: 1.9a1r32142, repo rev: r32142, Jul 
>>>>04, 2014 (nightly snapshot tarball), 146)
>>>>
>>>>Regards,
>>>>Timur
>>>>_______________________________________________
>>>>users mailing list
>>>>us...@open-mpi.org
>>>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>Link to this post:  
>>>>http://www.open-mpi.org/community/lists/users/2014/07/24777.php
>>>
>>
>>
>>
>



Reply via email to