Hi,
i reconfigured to only have one physical node. Still no success, but the
nodefile now looks better. I still get the errors:
[a00551.science.domain:18021] [[34768,0],1] bind() failed on error
Address already in use (98)
[a00551.science.domain:18021] [[34768,0],1] ORTE_ERROR_LOG: Error in
file oob_usock_component.c at line 228
[a00551.science.domain:18022] [[34768,0],2] bind() failed on error
Address already in use (98)
[a00551.science.domain:18022] [[34768,0],2] ORTE_ERROR_LOG: Error in
file oob_usock_component.c at line 228
(btw: for some reason the bind errors where missing. sorry!)
PBS_NODEFILE
a00551.science.domain
a00554.science.domain
a00553.science.domain
-----------------------
mpirun --mca plm_base_verbose 10 --tag-output -display-map hostname
[a00551.science.domain:18097] mca: base: components_register:
registering framework plm components
[a00551.science.domain:18097] mca: base: components_register: found
loaded component isolated
[a00551.science.domain:18097] mca: base: components_register: component
isolated has no register or open function
[a00551.science.domain:18097] mca: base: components_register: found
loaded component rsh
[a00551.science.domain:18097] mca: base: components_register: component
rsh register function successful
[a00551.science.domain:18097] mca: base: components_register: found
loaded component slurm
[a00551.science.domain:18097] mca: base: components_register: component
slurm register function successful
[a00551.science.domain:18097] mca: base: components_register: found
loaded component tm
[a00551.science.domain:18097] mca: base: components_register: component
tm register function successful
[a00551.science.domain:18097] mca: base: components_open: opening plm
components
[a00551.science.domain:18097] mca: base: components_open: found loaded
component isolated
[a00551.science.domain:18097] mca: base: components_open: component
isolated open function successful
[a00551.science.domain:18097] mca: base: components_open: found loaded
component rsh
[a00551.science.domain:18097] mca: base: components_open: component rsh
open function successful
[a00551.science.domain:18097] mca: base: components_open: found loaded
component slurm
[a00551.science.domain:18097] mca: base: components_open: component
slurm open function successful
[a00551.science.domain:18097] mca: base: components_open: found loaded
component tm
[a00551.science.domain:18097] mca: base: components_open: component tm
open function successful
[a00551.science.domain:18097] mca:base:select: Auto-selecting plm
components
[a00551.science.domain:18097] mca:base:select:( plm) Querying component
[isolated]
[a00551.science.domain:18097] mca:base:select:( plm) Query of component
[isolated] set priority to 0
[a00551.science.domain:18097] mca:base:select:( plm) Querying component
[rsh]
[a00551.science.domain:18097] [[INVALID],INVALID] plm:rsh_lookup on
agent ssh : rsh path NULL
[a00551.science.domain:18097] mca:base:select:( plm) Query of component
[rsh] set priority to 10
[a00551.science.domain:18097] mca:base:select:( plm) Querying component
[slurm]
[a00551.science.domain:18097] mca:base:select:( plm) Querying component
[tm]
[a00551.science.domain:18097] mca:base:select:( plm) Query of component
[tm] set priority to 75
[a00551.science.domain:18097] mca:base:select:( plm) Selected component
[tm]
[a00551.science.domain:18097] mca: base: close: component isolated
closed
[a00551.science.domain:18097] mca: base: close: unloading component
isolated
[a00551.science.domain:18097] mca: base: close: component rsh closed
[a00551.science.domain:18097] mca: base: close: unloading component rsh
[a00551.science.domain:18097] mca: base: close: component slurm closed
[a00551.science.domain:18097] mca: base: close: unloading component
slurm
[a00551.science.domain:18097] plm:base:set_hnp_name: initial bias 18097
nodename hash 2226275586
[a00551.science.domain:18097] plm:base:set_hnp_name: final jobfam 34561
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive start comm
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_job
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm creating
map
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new
daemon [[34561,0],1]
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm assigning
new daemon [[34561,0],1] to node a00554.science.domain
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new
daemon [[34561,0],2]
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm assigning
new daemon [[34561,0],2] to node a00553.science.domain
[a00551.science.domain:18097] [[34561,0],0] plm:tm: launching vm
[a00551.science.domain:18097] [[34561,0],0] plm:tm: final top-level
argv:
orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm
-mca ess_base_jobid 2264989696 -mca ess_base_vpid <template> -mca
ess_base_num_procs 3 -mca orte_hnp_uri
2264989696.0;usock;tcp://130.226.12.194:35939;tcp6://[fe80::225:90ff:feeb:f6d5]:35904
--mca plm_base_verbose 10
[a00551.science.domain:18097] [[34561,0],0] plm:tm: launching on node
a00554.science.domain
[a00551.science.domain:18097] [[34561,0],0] plm:tm: executing:
orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm
-mca ess_base_jobid 2264989696 -mca ess_base_vpid 1 -mca
ess_base_num_procs 3 -mca orte_hnp_uri
2264989696.0;usock;tcp://130.226.12.194:35939;tcp6://[fe80::225:90ff:feeb:f6d5]:35904
--mca plm_base_verbose 10
[a00551.science.domain:18097] [[34561,0],0] plm:tm: launching on node
a00553.science.domain
[a00551.science.domain:18097] [[34561,0],0] plm:tm: executing:
orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm
-mca ess_base_jobid 2264989696 -mca ess_base_vpid 2 -mca
ess_base_num_procs 3 -mca orte_hnp_uri
2264989696.0;usock;tcp://130.226.12.194:35939;tcp6://[fe80::225:90ff:feeb:f6d5]:35904
--mca plm_base_verbose 10
[a00551.science.domain:18097] [[34561,0],0] plm:tm:launch: finished
spawning orteds
[a00551.science.domain:18102] mca: base: components_register:
registering framework plm components
[a00551.science.domain:18102] mca: base: components_register: found
loaded component rsh
[a00551.science.domain:18102] mca: base: components_register: component
rsh register function successful
[a00551.science.domain:18102] mca: base: components_open: opening plm
components
[a00551.science.domain:18102] mca: base: components_open: found loaded
component rsh
[a00551.science.domain:18102] mca: base: components_open: component rsh
open function successful
[a00551.science.domain:18102] mca:base:select: Auto-selecting plm
components
[a00551.science.domain:18102] mca:base:select:( plm) Querying component
[rsh]
[a00551.science.domain:18102] [[34561,0],1] plm:rsh_lookup on agent ssh
: rsh path NULL
[a00551.science.domain:18102] mca:base:select:( plm) Query of component
[rsh] set priority to 10
[a00551.science.domain:18102] mca:base:select:( plm) Selected component
[rsh]
[a00551.science.domain:18102] [[34561,0],1] bind() failed on error
Address already in use (98)
[a00551.science.domain:18102] [[34561,0],1] ORTE_ERROR_LOG: Error in
file oob_usock_component.c at line 228
[a00551.science.domain:18102] [[34561,0],1] plm:rsh_setup on agent ssh :
rsh path NULL
[a00551.science.domain:18102] [[34561,0],1] plm:base:receive start comm
[a00551.science.domain:18097] [[34561,0],0] plm:base:orted_report_launch
from daemon [[34561,0],1]
[a00551.science.domain:18097] [[34561,0],0] plm:base:orted_report_launch
from daemon [[34561,0],1] on node a00551
[a00551.science.domain:18097] [[34561,0],0] RECEIVED TOPOLOGY FROM NODE
a00551
[a00551.science.domain:18097] [[34561,0],0] ADDING TOPOLOGY PER USER
REQUEST TO NODE a00554.science.domain
[a00551.science.domain:18097] [[34561,0],0] plm:base:orted_report_launch
completed for daemon [[34561,0],1] at contact
2264989696.1;tcp://130.226.12.194:52354;tcp6://[fe80::225:90ff:feeb:f6d5]:60904
[a00551.science.domain:18097] [[34561,0],0] plm:base:orted_report_launch
recvd 2 of 3 reported daemons
[a00551.science.domain:18103] mca: base: components_register:
registering framework plm components
[a00551.science.domain:18103] mca: base: components_register: found
loaded component rsh
[a00551.science.domain:18103] mca: base: components_register: component
rsh register function successful
[a00551.science.domain:18103] mca: base: components_open: opening plm
components
[a00551.science.domain:18103] mca: base: components_open: found loaded
component rsh
[a00551.science.domain:18103] mca: base: components_open: component rsh
open function successful
[a00551.science.domain:18103] mca:base:select: Auto-selecting plm
components
[a00551.science.domain:18103] mca:base:select:( plm) Querying component
[rsh]
[a00551.science.domain:18103] [[34561,0],2] plm:rsh_lookup on agent ssh
: rsh path NULL
[a00551.science.domain:18103] mca:base:select:( plm) Query of component
[rsh] set priority to 10
[a00551.science.domain:18103] mca:base:select:( plm) Selected component
[rsh]
[a00551.science.domain:18103] [[34561,0],2] bind() failed on error
Address already in use (98)
[a00551.science.domain:18103] [[34561,0],2] ORTE_ERROR_LOG: Error in
file oob_usock_component.c at line 228
[a00551.science.domain:18103] [[34561,0],2] plm:rsh_setup on agent ssh :
rsh path NULL
[a00551.science.domain:18103] [[34561,0],2] plm:base:receive start comm
[a00551.science.domain:18097] [[34561,0],0] plm:base:orted_report_launch
from daemon [[34561,0],2]
[a00551.science.domain:18097] [[34561,0],0] plm:base:orted_report_launch
from daemon [[34561,0],2] on node a00551
[a00551.science.domain:18097] [[34561,0],0] plm:base:orted_report_launch
completed for daemon [[34561,0],2] at contact
2264989696.2;tcp://130.226.12.194:41272;tcp6://[fe80::225:90ff:feeb:f6d5]:35343
[a00551.science.domain:18097] [[34561,0],0] plm:base:orted_report_launch
recvd 3 of 3 reported daemons
[a00551.science.domain:18097] [[34561,0],0] plm:base:setting topo to
that from node a00554.science.domain
Data for JOB [34561,1] offset 0
======================== JOB MAP ========================
Data for node: a00551 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [34561,1] App: 0 Process rank: 0 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
Data for node: a00554.science.domain Num slots: 1 Max slots: 0 Num
procs: 1
Process OMPI jobid: [34561,1] App: 0 Process rank: 1 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
Data for node: a00553.science.domain Num slots: 1 Max slots: 0 Num
procs: 1
Process OMPI jobid: [34561,1] App: 0 Process rank: 2 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
=============================================================
[a00551.science.domain:18097] [[34561,0],0] complete_setup on job
[34561,1]
[a00551.science.domain:18097] [[34561,0],0] plm:base:launch_apps for job
[34561,1]
[1,0]<stdout>:a00551.science.domain
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive processing
msg
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive update proc
state command from [[34561,0],2]
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive got
update_proc_state for job [34561,1]
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive got
update_proc_state for vpid 2 state RUNNING exit_code 0
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive done
processing commands
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive processing
msg
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive update proc
state command from [[34561,0],1]
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive got
update_proc_state for job [34561,1]
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive got
update_proc_state for vpid 1 state RUNNING exit_code 0
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive done
processing commands
[a00551.science.domain:18097] [[34561,0],0] plm:base:launch wiring up
iof for job [34561,1]
[a00551.science.domain:18097] [[34561,0],0] plm:base:launch job
[34561,1] is not a dynamic spawn
[1,2]<stdout>:a00551.science.domain
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive processing
msg
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive update proc
state command from [[34561,0],2]
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive got
update_proc_state for job [34561,1]
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive got
update_proc_state for vpid 2 state NORMALLY TERMINATED exit_code 0
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive done
processing commands
[1,1]<stdout>:a00551.science.domain
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive processing
msg
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive update proc
state command from [[34561,0],1]
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive got
update_proc_state for job [34561,1]
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive got
update_proc_state for vpid 1 state NORMALLY TERMINATED exit_code 0
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive done
processing commands
[a00551.science.domain:18097] [[34561,0],0] plm:base:orted_cmd sending
orted_exit commands
[a00551.science.domain:18102] [[34561,0],1] plm:base:receive stop comm
[a00551.science.domain:18102] mca: base: close: component rsh closed
[a00551.science.domain:18102] mca: base: close: unloading component rsh
[a00551.science.domain:18103] [[34561,0],2] plm:base:receive stop comm
[a00551.science.domain:18103] mca: base: close: component rsh closed
[a00551.science.domain:18103] mca: base: close: unloading component rsh
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive stop comm
[a00551.science.domain:18097] mca: base: close: component tm closed
[a00551.science.domain:18097] mca: base: close: unloading component tm
Best,
Oswin
On 2016-09-08 10:33, Oswin Krause wrote:
Hi Gilles, Hi Ralph,
I have just rebuild openmpi. quite a lot more of information. As I
said, i did not tinker with the PBS_NODEFILE. I think the issue might
be NUMA here. I can try to go through the process and reconfigure to
non-numa and see whether this works. The issue might be that the node
allocation looks like this:
a00551.science.domain-0
a00552.science.domain-0
a00551.science.domain-1
and the last part then gets shortened which leads to the issue. Not
sure whether this makes sense but this is my explanation.
Here the output:
$PBS_NODEFILE
/var/lib/torque/aux//285.a00552.science.domain
PBS_NODEFILE
a00551.science.domain
a00553.science.domain
a00551.science.domain
---------
[a00551.science.domain:16986] mca: base: components_register:
registering framework plm components
[a00551.science.domain:16986] mca: base: components_register: found
loaded component isolated
[a00551.science.domain:16986] mca: base: components_register:
component isolated has no register or open function
[a00551.science.domain:16986] mca: base: components_register: found
loaded component rsh
[a00551.science.domain:16986] mca: base: components_register:
component rsh register function successful
[a00551.science.domain:16986] mca: base: components_register: found
loaded component slurm
[a00551.science.domain:16986] mca: base: components_register:
component slurm register function successful
[a00551.science.domain:16986] mca: base: components_register: found
loaded component tm
[a00551.science.domain:16986] mca: base: components_register:
component tm register function successful
[a00551.science.domain:16986] mca: base: components_open: opening plm
components
[a00551.science.domain:16986] mca: base: components_open: found loaded
component isolated
[a00551.science.domain:16986] mca: base: components_open: component
isolated open function successful
[a00551.science.domain:16986] mca: base: components_open: found loaded
component rsh
[a00551.science.domain:16986] mca: base: components_open: component
rsh open function successful
[a00551.science.domain:16986] mca: base: components_open: found loaded
component slurm
[a00551.science.domain:16986] mca: base: components_open: component
slurm open function successful
[a00551.science.domain:16986] mca: base: components_open: found loaded
component tm
[a00551.science.domain:16986] mca: base: components_open: component tm
open function successful
[a00551.science.domain:16986] mca:base:select: Auto-selecting plm
components
[a00551.science.domain:16986] mca:base:select:( plm) Querying
component [isolated]
[a00551.science.domain:16986] mca:base:select:( plm) Query of
component [isolated] set priority to 0
[a00551.science.domain:16986] mca:base:select:( plm) Querying
component [rsh]
[a00551.science.domain:16986] [[INVALID],INVALID] plm:rsh_lookup on
agent ssh : rsh path NULL
[a00551.science.domain:16986] mca:base:select:( plm) Query of
component [rsh] set priority to 10
[a00551.science.domain:16986] mca:base:select:( plm) Querying
component [slurm]
[a00551.science.domain:16986] mca:base:select:( plm) Querying
component [tm]
[a00551.science.domain:16986] mca:base:select:( plm) Query of
component [tm] set priority to 75
[a00551.science.domain:16986] mca:base:select:( plm) Selected
component [tm]
[a00551.science.domain:16986] mca: base: close: component isolated
closed
[a00551.science.domain:16986] mca: base: close: unloading component
isolated
[a00551.science.domain:16986] mca: base: close: component rsh closed
[a00551.science.domain:16986] mca: base: close: unloading component rsh
[a00551.science.domain:16986] mca: base: close: component slurm closed
[a00551.science.domain:16986] mca: base: close: unloading component
slurm
[a00551.science.domain:16986] plm:base:set_hnp_name: initial bias
16986 nodename hash 2226275586
[a00551.science.domain:16986] plm:base:set_hnp_name: final jobfam 33770
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive start comm
[a00551.science.domain:16986] [[33770,0],0] plm:base:setup_job
[a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm
[a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm creating
map
[a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm add new
daemon [[33770,0],1]
[a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm
assigning new daemon [[33770,0],1] to node a00553.science.domain
[a00551.science.domain:16986] [[33770,0],0] plm:tm: launching vm
[a00551.science.domain:16986] [[33770,0],0] plm:tm: final top-level
argv:
orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm
-mca ess_base_jobid 2213150720 -mca ess_base_vpid <template> -mca
ess_base_num_procs 2 -mca orte_hnp_uri
2213150720.0;usock;tcp://130.226.12.194:53397;tcp6://[fe80::225:90ff:feeb:f6d5]:42821
--mca plm_base_verbose 10
[a00551.science.domain:16986] [[33770,0],0] plm:tm: launching on node
a00553.science.domain
[a00551.science.domain:16986] [[33770,0],0] plm:tm: executing:
orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm
-mca ess_base_jobid 2213150720 -mca ess_base_vpid 1 -mca
ess_base_num_procs 2 -mca orte_hnp_uri
2213150720.0;usock;tcp://130.226.12.194:53397;tcp6://[fe80::225:90ff:feeb:f6d5]:42821
--mca plm_base_verbose 10
[a00551.science.domain:16986] [[33770,0],0] plm:tm:launch: finished
spawning orteds
[a00551.science.domain:16986] [[33770,0],0]
plm:base:orted_report_launch from daemon [[33770,0],1]
[a00551.science.domain:16986] [[33770,0],0]
plm:base:orted_report_launch from daemon [[33770,0],1] on node a00551
[a00551.science.domain:16986] [[33770,0],0] RECEIVED TOPOLOGY FROM NODE
a00551
[a00551.science.domain:16986] [[33770,0],0] ADDING TOPOLOGY PER USER
REQUEST TO NODE a00553.science.domain
[a00551.science.domain:16986] [[33770,0],0]
plm:base:orted_report_launch completed for daemon [[33770,0],1] at
contact
2213150720.1;tcp://130.226.12.194:38025;tcp6://[fe80::225:90ff:feeb:f6d5]:39080
[a00551.science.domain:16986] [[33770,0],0]
plm:base:orted_report_launch recvd 2 of 2 reported daemons
[a00551.science.domain:16986] [[33770,0],0] plm:base:setting topo to
that from node a00553.science.domain
Data for JOB [33770,1] offset 0
======================== JOB MAP ========================
Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2
Process OMPI jobid: [33770,1] App: 0 Process rank: 0 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
Process OMPI jobid: [33770,1] App: 0 Process rank: 1 Bound: socket
1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
Data for node: a00553.science.domain Num slots: 1 Max slots: 0 Num
procs: 1
Process OMPI jobid: [33770,1] App: 0 Process rank: 2 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
=============================================================
[a00551.science.domain:16986] [[33770,0],0] complete_setup on job
[33770,1]
[a00551.science.domain:16986] [[33770,0],0] plm:base:launch_apps for
job [33770,1]
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive processing
msg
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive update
proc state command from [[33770,0],1]
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive got
update_proc_state for job [33770,1]
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive got
update_proc_state for vpid 2 state RUNNING exit_code 0
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive done
processing commands
[1,0]<stdout>:a00551.science.domain
[1,2]<stdout>:a00551.science.domain
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive processing
msg
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive update
proc state command from [[33770,0],1]
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive got
update_proc_state for job [33770,1]
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive got
update_proc_state for vpid 2 state NORMALLY TERMINATED exit_code 0
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive done
processing commands
[1,1]<stdout>:a00551.science.domain
[a00551.science.domain:16986] [[33770,0],0] plm:base:launch wiring up
iof for job [33770,1]
[a00551.science.domain:16986] [[33770,0],0] plm:base:launch job
[33770,1] is not a dynamic spawn
[a00551.science.domain:16986] [[33770,0],0] plm:base:orted_cmd sending
orted_exit commands
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive stop comm
[a00551.science.domain:16986] mca: base: close: component tm closed
[a00551.science.domain:16986] mca: base: close: unloading component tm
On 2016-09-08 10:18, Gilles Gouaillardet wrote:
Ralph,
i am not sure i am reading you correctly, so let me clarify.
i did not hack $PBS_NODEFILE for fun nor profit, i was simply trying
to reproduce an issue i could not reproduce otherwise.
/* my job submitted with -l nodes=3:ppn=1 do not start if there are
only two nodes available, whereas the same user job
starts on two nodes */
thanks for the explanation of the torque internals, my hack was
incomplete and not a valid one, i do acknowledge it.
i re-read the email that started this thread and i found the
information i was looking for
echo $PBS_NODEFILE
/var/lib/torque/aux//278.a00552.science.domain
cat $PBS_NODEFILE
a00551.science.domain
a00553.science.domain
a00551.science.domain
so, assuming the enduser did not edit his $PBS_NODEFILE, and torque is
correctly configured and not busted, then
Torque indeed always provides an ordered file - the only way you can
get an unordered one is for someone to edit it
might be updated to
"Torque used to always provide an ordered file, but recent versions
might not do that."
makes sense ?
Cheers,
Gilles
On 9/8/2016 4:57 PM, r...@open-mpi.org wrote:
Someone has done some work there since I last did, but I can see the
issue. Torque indeed always provides an ordered file - the only way
you can get an unordered one is for someone to edit it, and that is
forbidden - i.e., you get what you deserve because you are messing
around with a system-defined file :-)
The problem is that Torque internally assigns a “launch ID” which is
just the integer position of the nodename in the PBS_NODEFILE. So if
you modify that position, then we get the wrong index - and
everything goes down the drain from there. In your example,
n1.cluster changed index from 3 to 2 because of your edit. Torque
thinks that index 2 is just another reference to n0.cluster, and so
we merrily launch a daemon onto the wrong node.
They have a good reason for doing things this way. It allows you to
launch a process against each launch ID, and the pattern will reflect
the original qsub request in what we would call a map-by slot
round-robin mode. This maximizes the use of shared memory, and is
expected to provide good performance for a range of apps.
Lesson to be learned: never, ever muddle around with a
system-generated file. If you want to modify where things go, then
use one or more of the mpirun options to do so. We give you lots and
lots of knobs for just that reason.
On Sep 7, 2016, at 10:53 PM, Gilles Gouaillardet <gil...@rist.or.jp>
wrote:
Ralph,
there might be an issue within Open MPI.
on the cluster i used, hostname returns the FQDN, and $PBS_NODEFILE
uses the FQDN too.
my $PBS_NODEFILE has one line per task, and it is ordered
e.g.
n0.cluster
n0.cluster
n1.cluster
n1.cluster
in my torque script, i rewrote the machinefile like this
n0.cluster
n1.cluster
n0.cluster
n1.cluster
and updated the PBS environment variable to point to my new file.
then i invoked
mpirun hostname
in the first case, 2 tasks run on n0 and 2 tasks run on n1
in the second case, 4 tasks run on n0, and none on n1.
so i am thinking we might not support unordered $PBS_NODEFILE.
as a reminder, the submit command was
qsub -l nodes=3:ppn=1
but for some reasons i ignore, only two nodes were allocated (two
slots on the first one, one on the second one)
and if i understand correctly, $PBS_NODEFILE was not ordered.
(e.g. n0 n1 n0 and *not * n0 n0 n1)
i tried to reproduce this without hacking $PBS_NODEFILE, but my jobs
hang in the queue if only two nodes with 16 slots each are available
and i request
-l nodes=3:ppn=1
i guess this is a different scheduler configuration, and i cannot
change that.
Could you please have a look at this ?
Cheers,
Gilles
On 9/7/2016 11:15 PM, r...@open-mpi.org wrote:
The usual cause of this problem is that the nodename in the
machinefile is given as a00551, while Torque is assigning the node
name as a00551.science.domain. Thus, mpirun thinks those are two
separate nodes and winds up spawning an orted on its own node.
You might try ensuring that your machinefile is using the exact
same name as provided in your allocation
On Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com> wrote:
Thanjs for the ligs
From what i see now, it looks like a00551 is running both mpirun
and orted, though it should only run mpirun, and orted should run
only on a00553
I will check the code and see what could be happening here
Btw, what is the output of
hostname
hostname -f
On a00551 ?
Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4)
installled and running correctly on your cluster ?
Cheers,
Gilles
Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
Hi Gilles,
Thanks for the hint with the machinefile. I know it is not
equivalent
and i do not intend to use that approach. I just wanted to know
whether
I could start the program successfully at all.
Outside torque(4.2), rsh seems to be used which works fine,
querying a
password if no kerberos ticket is there
Here is the output:
[zbh251@a00551 ~]$ mpirun -V
mpirun (Open MPI) 2.0.1
[zbh251@a00551 ~]$ ompi_info | grep ras
MCA ras: loadleveler (MCA v2.1.0, API v2.0.0,
Component
v2.0.1)
MCA ras: simulator (MCA v2.1.0, API v2.0.0,
Component
v2.0.1)
MCA ras: slurm (MCA v2.1.0, API v2.0.0,
Component
v2.0.1)
MCA ras: tm (MCA v2.1.0, API v2.0.0, Component
v2.0.1)
[zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output
-display-map hostname
[a00551.science.domain:04104] mca: base: components_register:
registering framework plm components
[a00551.science.domain:04104] mca: base: components_register:
found
loaded component isolated
[a00551.science.domain:04104] mca: base: components_register:
component
isolated has no register or open function
[a00551.science.domain:04104] mca: base: components_register:
found
loaded component rsh
[a00551.science.domain:04104] mca: base: components_register:
component
rsh register function successful
[a00551.science.domain:04104] mca: base: components_register:
found
loaded component slurm
[a00551.science.domain:04104] mca: base: components_register:
component
slurm register function successful
[a00551.science.domain:04104] mca: base: components_register:
found
loaded component tm
[a00551.science.domain:04104] mca: base: components_register:
component
tm register function successful
[a00551.science.domain:04104] mca: base: components_open: opening
plm
components
[a00551.science.domain:04104] mca: base: components_open: found
loaded
component isolated
[a00551.science.domain:04104] mca: base: components_open:
component
isolated open function successful
[a00551.science.domain:04104] mca: base: components_open: found
loaded
component rsh
[a00551.science.domain:04104] mca: base: components_open:
component rsh
open function successful
[a00551.science.domain:04104] mca: base: components_open: found
loaded
component slurm
[a00551.science.domain:04104] mca: base: components_open:
component
slurm open function successful
[a00551.science.domain:04104] mca: base: components_open: found
loaded
component tm
[a00551.science.domain:04104] mca: base: components_open:
component tm
open function successful
[a00551.science.domain:04104] mca:base:select: Auto-selecting plm
components
[a00551.science.domain:04104] mca:base:select:( plm) Querying
component
[isolated]
[a00551.science.domain:04104] mca:base:select:( plm) Query of
component
[isolated] set priority to 0
[a00551.science.domain:04104] mca:base:select:( plm) Querying
component
[rsh]
[a00551.science.domain:04104] mca:base:select:( plm) Query of
component
[rsh] set priority to 10
[a00551.science.domain:04104] mca:base:select:( plm) Querying
component
[slurm]
[a00551.science.domain:04104] mca:base:select:( plm) Querying
component
[tm]
[a00551.science.domain:04104] mca:base:select:( plm) Query of
component
[tm] set priority to 75
[a00551.science.domain:04104] mca:base:select:( plm) Selected
component
[tm]
[a00551.science.domain:04104] mca: base: close: component
isolated
closed
[a00551.science.domain:04104] mca: base: close: unloading
component
isolated
[a00551.science.domain:04104] mca: base: close: component rsh
closed
[a00551.science.domain:04104] mca: base: close: unloading
component rsh
[a00551.science.domain:04104] mca: base: close: component slurm
closed
[a00551.science.domain:04104] mca: base: close: unloading
component
slurm
[a00551.science.domain:04109] mca: base: components_register:
registering framework plm components
[a00551.science.domain:04109] mca: base: components_register:
found
loaded component rsh
[a00551.science.domain:04109] mca: base: components_register:
component
rsh register function successful
[a00551.science.domain:04109] mca: base: components_open: opening
plm
components
[a00551.science.domain:04109] mca: base: components_open: found
loaded
component rsh
[a00551.science.domain:04109] mca: base: components_open:
component rsh
open function successful
[a00551.science.domain:04109] mca:base:select: Auto-selecting plm
components
[a00551.science.domain:04109] mca:base:select:( plm) Querying
component
[rsh]
[a00551.science.domain:04109] mca:base:select:( plm) Query of
component
[rsh] set priority to 10
[a00551.science.domain:04109] mca:base:select:( plm) Selected
component
[rsh]
[a00551.science.domain:04109] [[53688,0],1] bind() failed on
error
Address already in use (98)
[a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error
in
file oob_usock_component.c at line 228
Data for JOB [53688,1] offset 0
======================== JOB MAP ========================
Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2
Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound:
socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core
2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]],
socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core
7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound:
socket
1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core
12[hwt
0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]],
socket
1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core
17[hwt
0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
Data for node: a00553.science.domain Num slots: 1 Max slots:
0 Num
procs: 1
Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound:
socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core
2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]],
socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core
7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
=============================================================
[a00551.science.domain:04104] [[53688,0],0] complete_setup on job
[53688,1]
[a00551.science.domain:04104] [[53688,0],0] plm:base:receive
update proc
state command from [[53688,0],1]
[a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
update_proc_state for job [53688,1]
[1,0]<stdout>:a00551.science.domain
[1,2]<stdout>:a00551.science.domain
[a00551.science.domain:04104] [[53688,0],0] plm:base:receive
update proc
state command from [[53688,0],1]
[a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
update_proc_state for job [53688,1]
[1,1]<stdout>:a00551.science.domain
[a00551.science.domain:04109] mca: base: close: component rsh
closed
[a00551.science.domain:04109] mca: base: close: unloading
component rsh
[a00551.science.domain:04104] mca: base: close: component tm
closed
[a00551.science.domain:04104] mca: base: close: unloading
component tm
On 2016-09-07 14:41, Gilles Gouaillardet wrote:
Hi,
Which version of Open MPI are you running ?
I noted that though you are asking three nodes and one task per
node,
you have been allocated 2 nodes only.
I do not know if this is related to this issue.
Note if you use the machinefile, a00551 has two slots (since it
appears twice in the machinefile) but a00553 has 20 slots (since
it
appears once in the machinefile, the number of slots is
automatically
detected)
Can you run
mpirun --mca plm_base_verbose 10 ...
So we can confirm tm is used.
Before invoking mpirun, you might want to cleanup the ompi
directory in
/tmp
Cheers,
Gilles
Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
Hi,
I am currently trying to set up OpenMPI in torque. OpenMPI is
build
with
tm support. Torque is correctly assigning nodes and I can run
mpi-programs on single nodes just fine. the problem starts when
processes are split between nodes.
For example, I create an interactive session with torque and
start a
program by
qsub -I -n -l nodes=3:ppn=1
mpirun --tag-output -display-map hostname
which leads to
[a00551.science.domain:15932] [[65415,0],1] bind() failed on
error
Address already in use (98)
[a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG:
Error in
file oob_usock_component.c at line 228
Data for JOB [65415,1] offset 0
======================== JOB MAP ========================
Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2
Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound:
socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core
2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]],
socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core
7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound:
socket
1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core
12[hwt
0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]],
socket
1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core
17[hwt
0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
Data for node: a00553.science.domain Num slots: 1 Max slots:
0 Num
procs: 1
Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound:
socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core
2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]],
socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core
7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
=============================================================
[1,0]<stdout>:a00551.science.domain
[1,2]<stdout>:a00551.science.domain
[1,1]<stdout>:a00551.science.domain
if I login on a00551 and start using the hostfile generated by
the
PBS_NODEFILE, everything works:
(from within the interactive session)
echo $PBS_NODEFILE
/var/lib/torque/aux//278.a00552.science.domain
cat $PBS_NODEFILE
a00551.science.domain
a00553.science.domain
a00551.science.domain
(from within the separate login)
mpirun --hostfile
/var/lib/torque/aux//278.a00552.science.domain -np 3
--tag-output -display-map hostname
Data for JOB [65445,1] offset 0
======================== JOB MAP ========================
Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2
Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound:
socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core
2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]],
socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core
7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound:
socket
1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core
12[hwt
0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]],
socket
1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core
17[hwt
0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
Data for node: a00553.science.domain Num slots: 20 Max slots:
0 Num
procs: 1
Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound:
socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core
2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]],
socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core
7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
=============================================================
[1,0]<stdout>:a00551.science.domain
[1,2]<stdout>:a00553.science.domain
[1,1]<stdout>:a00551.science.domain
I am kind of lost whats going on here. Anyone having an idea? I
am
seriously considering this to be the problem of kerberos
authentification that we have to work with, but I fail to see
how this
should affect the sockets.
Best,
Oswin
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users