Oswin, So it seems that Open MPI think it tm_spawn orted on the remote nodes, but orted ends up running on the same node than mpirun.
On your compute nodes, can you ldd /.../lib/openmpi/mca_plm_tm.so And confirm it is linked with the same libtorque.so that was built/provided with torque ? Check path and md5sum on both compute nodes and the node on which you built torque (ideally from both build and install dir) Cheers, Gilles Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote: >Hi Gilles, > >There you go: > >[zbh251@a00551 ~]$ cat $PBS_NODEFILE >a00551.science.domain >a00554.science.domain >a00553.science.domain >[zbh251@a00551 ~]$ mpirun --mca ess_base_verbose 10 --mca >plm_base_verbose 10 --mca ras_base_verbose 10 hostname >[a00551.science.domain:18889] mca: base: components_register: >registering framework ess components >[a00551.science.domain:18889] mca: base: components_register: found >loaded component pmi >[a00551.science.domain:18889] mca: base: components_register: component >pmi has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component tool >[a00551.science.domain:18889] mca: base: components_register: component >tool has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component env >[a00551.science.domain:18889] mca: base: components_register: component >env has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component hnp >[a00551.science.domain:18889] mca: base: components_register: component >hnp has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component singleton >[a00551.science.domain:18889] mca: base: components_register: component >singleton register function successful >[a00551.science.domain:18889] mca: base: components_register: found >loaded component slurm >[a00551.science.domain:18889] mca: base: components_register: component >slurm has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component tm >[a00551.science.domain:18889] mca: base: components_register: component >tm has no register or open function >[a00551.science.domain:18889] mca: base: components_open: opening ess >components >[a00551.science.domain:18889] mca: base: components_open: found loaded >component pmi >[a00551.science.domain:18889] mca: base: components_open: component pmi >open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component tool >[a00551.science.domain:18889] mca: base: components_open: component tool >open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component env >[a00551.science.domain:18889] mca: base: components_open: component env >open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component hnp >[a00551.science.domain:18889] mca: base: components_open: component hnp >open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component singleton >[a00551.science.domain:18889] mca: base: components_open: component >singleton open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component slurm >[a00551.science.domain:18889] mca: base: components_open: component >slurm open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component tm >[a00551.science.domain:18889] mca: base: components_open: component tm >open function successful >[a00551.science.domain:18889] mca:base:select: Auto-selecting ess >components >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[pmi] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[tool] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[env] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[hnp] >[a00551.science.domain:18889] mca:base:select:( ess) Query of component >[hnp] set priority to 100 >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[singleton] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[slurm] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[tm] >[a00551.science.domain:18889] mca:base:select:( ess) Selected component >[hnp] >[a00551.science.domain:18889] mca: base: close: component pmi closed >[a00551.science.domain:18889] mca: base: close: unloading component pmi >[a00551.science.domain:18889] mca: base: close: component tool closed >[a00551.science.domain:18889] mca: base: close: unloading component tool >[a00551.science.domain:18889] mca: base: close: component env closed >[a00551.science.domain:18889] mca: base: close: unloading component env >[a00551.science.domain:18889] mca: base: close: component singleton >closed >[a00551.science.domain:18889] mca: base: close: unloading component >singleton >[a00551.science.domain:18889] mca: base: close: component slurm closed >[a00551.science.domain:18889] mca: base: close: unloading component >slurm >[a00551.science.domain:18889] mca: base: close: component tm closed >[a00551.science.domain:18889] mca: base: close: unloading component tm >[a00551.science.domain:18889] mca: base: components_register: >registering framework plm components >[a00551.science.domain:18889] mca: base: components_register: found >loaded component isolated >[a00551.science.domain:18889] mca: base: components_register: component >isolated has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component rsh >[a00551.science.domain:18889] mca: base: components_register: component >rsh register function successful >[a00551.science.domain:18889] mca: base: components_register: found >loaded component slurm >[a00551.science.domain:18889] mca: base: components_register: component >slurm register function successful >[a00551.science.domain:18889] mca: base: components_register: found >loaded component tm >[a00551.science.domain:18889] mca: base: components_register: component >tm register function successful >[a00551.science.domain:18889] mca: base: components_open: opening plm >components >[a00551.science.domain:18889] mca: base: components_open: found loaded >component isolated >[a00551.science.domain:18889] mca: base: components_open: component >isolated open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component rsh >[a00551.science.domain:18889] mca: base: components_open: component rsh >open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component slurm >[a00551.science.domain:18889] mca: base: components_open: component >slurm open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component tm >[a00551.science.domain:18889] mca: base: components_open: component tm >open function successful >[a00551.science.domain:18889] mca:base:select: Auto-selecting plm >components >[a00551.science.domain:18889] mca:base:select:( plm) Querying component >[isolated] >[a00551.science.domain:18889] mca:base:select:( plm) Query of component >[isolated] set priority to 0 >[a00551.science.domain:18889] mca:base:select:( plm) Querying component >[rsh] >[a00551.science.domain:18889] [[INVALID],INVALID] plm:rsh_lookup on >agent ssh : rsh path NULL >[a00551.science.domain:18889] mca:base:select:( plm) Query of component >[rsh] set priority to 10 >[a00551.science.domain:18889] mca:base:select:( plm) Querying component >[slurm] >[a00551.science.domain:18889] mca:base:select:( plm) Querying component >[tm] >[a00551.science.domain:18889] mca:base:select:( plm) Query of component >[tm] set priority to 75 >[a00551.science.domain:18889] mca:base:select:( plm) Selected component >[tm] >[a00551.science.domain:18889] mca: base: close: component isolated >closed >[a00551.science.domain:18889] mca: base: close: unloading component >isolated >[a00551.science.domain:18889] mca: base: close: component rsh closed >[a00551.science.domain:18889] mca: base: close: unloading component rsh >[a00551.science.domain:18889] mca: base: close: component slurm closed >[a00551.science.domain:18889] mca: base: close: unloading component >slurm >[a00551.science.domain:18889] plm:base:set_hnp_name: initial bias 18889 >nodename hash 2226275586 >[a00551.science.domain:18889] plm:base:set_hnp_name: final jobfam 34937 >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive start comm >[a00551.science.domain:18889] mca: base: components_register: >registering framework ras components >[a00551.science.domain:18889] mca: base: components_register: found >loaded component loadleveler >[a00551.science.domain:18889] mca: base: components_register: component >loadleveler register function successful >[a00551.science.domain:18889] mca: base: components_register: found >loaded component simulator >[a00551.science.domain:18889] mca: base: components_register: component >simulator register function successful >[a00551.science.domain:18889] mca: base: components_register: found >loaded component slurm >[a00551.science.domain:18889] mca: base: components_register: component >slurm register function successful >[a00551.science.domain:18889] mca: base: components_register: found >loaded component tm >[a00551.science.domain:18889] mca: base: components_register: component >tm register function successful >[a00551.science.domain:18889] mca: base: components_open: opening ras >components >[a00551.science.domain:18889] mca: base: components_open: found loaded >component loadleveler >[a00551.science.domain:18889] mca: base: components_open: component >loadleveler open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component simulator >[a00551.science.domain:18889] mca: base: components_open: found loaded >component slurm >[a00551.science.domain:18889] mca: base: components_open: component >slurm open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component tm >[a00551.science.domain:18889] mca: base: components_open: component tm >open function successful >[a00551.science.domain:18889] mca:base:select: Auto-selecting ras >components >[a00551.science.domain:18889] mca:base:select:( ras) Querying component >[loadleveler] >[a00551.science.domain:18889] [[34937,0],0] ras:loadleveler: NOT >available for selection >[a00551.science.domain:18889] mca:base:select:( ras) Querying component >[simulator] >[a00551.science.domain:18889] mca:base:select:( ras) Querying component >[slurm] >[a00551.science.domain:18889] mca:base:select:( ras) Querying component >[tm] >[a00551.science.domain:18889] mca:base:select:( ras) Query of component >[tm] set priority to 100 >[a00551.science.domain:18889] mca:base:select:( ras) Selected component >[tm] >[a00551.science.domain:18889] mca: base: close: unloading component >loadleveler >[a00551.science.domain:18889] mca: base: close: unloading component >simulator >[a00551.science.domain:18889] mca: base: close: component slurm closed >[a00551.science.domain:18889] mca: base: close: unloading component >slurm >[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_job >[a00551.science.domain:18889] [[34937,0],0] ras:base:allocate >[a00551.science.domain:18889] [[34937,0],0] ras:tm:allocate:discover: >got hostname a00551.science.domain >[a00551.science.domain:18889] [[34937,0],0] ras:tm:allocate:discover: >not found -- added to list >[a00551.science.domain:18889] [[34937,0],0] ras:tm:allocate:discover: >got hostname a00554.science.domain >[a00551.science.domain:18889] [[34937,0],0] ras:tm:allocate:discover: >not found -- added to list >[a00551.science.domain:18889] [[34937,0],0] ras:tm:allocate:discover: >got hostname a00553.science.domain >[a00551.science.domain:18889] [[34937,0],0] ras:tm:allocate:discover: >not found -- added to list >[a00551.science.domain:18889] [[34937,0],0] ras:base:node_insert >inserting 3 nodes >[a00551.science.domain:18889] [[34937,0],0] ras:base:node_insert >updating HNP [a00551.science.domain] info to 1 slots >[a00551.science.domain:18889] [[34937,0],0] ras:base:node_insert node >a00554.science.domain slots 1 >[a00551.science.domain:18889] [[34937,0],0] ras:base:node_insert node >a00553.science.domain slots 1 > >====================== ALLOCATED NODES ====================== > a00551: slots=1 max_slots=0 slots_inuse=0 state=UP > a00554.science.domain: slots=1 max_slots=0 slots_inuse=0 state=UP > a00553.science.domain: slots=1 max_slots=0 slots_inuse=0 state=UP >================================================================= >[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_vm >[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_vm creating >map >[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_vm add new >daemon [[34937,0],1] >[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_vm assigning >new daemon [[34937,0],1] to node a00554.science.domain >[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_vm add new >daemon [[34937,0],2] >[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_vm assigning >new daemon [[34937,0],2] to node a00553.science.domain >[a00551.science.domain:18889] [[34937,0],0] plm:tm: launching vm >[a00551.science.domain:18889] [[34937,0],0] plm:tm: final top-level >argv: > orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm >-mca ess_base_jobid 2289631232 -mca ess_base_vpid <template> -mca >ess_base_num_procs 3 -mca orte_hnp_uri >2289631232.0;usock;tcp://130.226.12.194:59413;tcp6://[fe80::225:90ff:feeb:f6d5]:46374 > >--mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca >ras_base_verbose 10 >[a00551.science.domain:18889] [[34937,0],0] plm:tm: launching on node >a00554.science.domain >[a00551.science.domain:18889] [[34937,0],0] plm:tm: executing: > orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm >-mca ess_base_jobid 2289631232 -mca ess_base_vpid 1 -mca >ess_base_num_procs 3 -mca orte_hnp_uri >2289631232.0;usock;tcp://130.226.12.194:59413;tcp6://[fe80::225:90ff:feeb:f6d5]:46374 > >--mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca >ras_base_verbose 10 >[a00551.science.domain:18889] [[34937,0],0] plm:tm: launching on node >a00553.science.domain >[a00551.science.domain:18889] [[34937,0],0] plm:tm: executing: > orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm >-mca ess_base_jobid 2289631232 -mca ess_base_vpid 2 -mca >ess_base_num_procs 3 -mca orte_hnp_uri >2289631232.0;usock;tcp://130.226.12.194:59413;tcp6://[fe80::225:90ff:feeb:f6d5]:46374 > >--mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca >ras_base_verbose 10 >[a00551.science.domain:18889] [[34937,0],0] plm:tm:launch: finished >spawning orteds >[a00551.science.domain:18894] mca: base: components_register: >registering framework ess components >[a00551.science.domain:18894] mca: base: components_register: found >loaded component tm >[a00551.science.domain:18894] mca: base: components_register: component >tm has no register or open function >[a00551.science.domain:18894] mca: base: components_open: opening ess >components >[a00551.science.domain:18894] mca: base: components_open: found loaded >component tm >[a00551.science.domain:18894] mca: base: components_open: component tm >open function successful >[a00551.science.domain:18894] mca:base:select: Auto-selecting ess >components >[a00551.science.domain:18894] mca:base:select:( ess) Querying component >[tm] >[a00551.science.domain:18894] mca:base:select:( ess) Query of component >[tm] set priority to 30 >[a00551.science.domain:18894] mca:base:select:( ess) Selected component >[tm] >[a00551.science.domain:18894] ess:tm setting name >[a00551.science.domain:18894] ess:tm set name to [[34937,0],1] >[a00551.science.domain:18895] mca: base: components_register: >registering framework ess components >[a00551.science.domain:18895] mca: base: components_register: found >loaded component tm >[a00551.science.domain:18895] mca: base: components_register: component >tm has no register or open function >[a00551.science.domain:18895] mca: base: components_open: opening ess >components >[a00551.science.domain:18895] mca: base: components_open: found loaded >component tm >[a00551.science.domain:18895] mca: base: components_open: component tm >open function successful >[a00551.science.domain:18895] mca:base:select: Auto-selecting ess >components >[a00551.science.domain:18895] mca:base:select:( ess) Querying component >[tm] >[a00551.science.domain:18895] mca:base:select:( ess) Query of component >[tm] set priority to 30 >[a00551.science.domain:18895] mca:base:select:( ess) Selected component >[tm] >[a00551.science.domain:18895] ess:tm setting name >[a00551.science.domain:18895] ess:tm set name to [[34937,0],2] >[a00551.science.domain:18894] mca: base: components_register: >registering framework plm components >[a00551.science.domain:18894] mca: base: components_register: found >loaded component rsh >[a00551.science.domain:18894] mca: base: components_register: component >rsh register function successful >[a00551.science.domain:18894] mca: base: components_open: opening plm >components >[a00551.science.domain:18894] mca: base: components_open: found loaded >component rsh >[a00551.science.domain:18894] mca: base: components_open: component rsh >open function successful >[a00551.science.domain:18894] mca:base:select: Auto-selecting plm >components >[a00551.science.domain:18894] mca:base:select:( plm) Querying component >[rsh] >[a00551.science.domain:18894] [[34937,0],1] plm:rsh_lookup on agent ssh >: rsh path NULL >[a00551.science.domain:18894] mca:base:select:( plm) Query of component >[rsh] set priority to 10 >[a00551.science.domain:18894] mca:base:select:( plm) Selected component >[rsh] >[a00551.science.domain:18894] [[34937,0],1] setting up session dir with > tmpdir: UNDEF > host a00551 >[a00551.science.domain:18894] [[34937,0],1] bind() failed on error >Address already in use (98) >[a00551.science.domain:18894] [[34937,0],1] ORTE_ERROR_LOG: Error in >file oob_usock_component.c at line 228 >[a00551.science.domain:18894] [[34937,0],1] plm:rsh_setup on agent ssh : >rsh path NULL >[a00551.science.domain:18894] [[34937,0],1] plm:base:receive start comm >[a00551.science.domain:18895] mca: base: components_register: >registering framework plm components >[a00551.science.domain:18895] mca: base: components_register: found >loaded component rsh >[a00551.science.domain:18895] mca: base: components_register: component >rsh register function successful >[a00551.science.domain:18895] mca: base: components_open: opening plm >components >[a00551.science.domain:18895] mca: base: components_open: found loaded >component rsh >[a00551.science.domain:18895] mca: base: components_open: component rsh >open function successful >[a00551.science.domain:18895] mca:base:select: Auto-selecting plm >components >[a00551.science.domain:18895] mca:base:select:( plm) Querying component >[rsh] >[a00551.science.domain:18895] [[34937,0],2] plm:rsh_lookup on agent ssh >: rsh path NULL >[a00551.science.domain:18895] mca:base:select:( plm) Query of component >[rsh] set priority to 10 >[a00551.science.domain:18895] mca:base:select:( plm) Selected component >[rsh] >[a00551.science.domain:18895] [[34937,0],2] setting up session dir with > tmpdir: UNDEF > host a00551 >[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch >from daemon [[34937,0],1] >[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch >from daemon [[34937,0],1] on node a00551 >[a00551.science.domain:18895] [[34937,0],2] bind() failed on error >Address already in use (98) >[a00551.science.domain:18895] [[34937,0],2] ORTE_ERROR_LOG: Error in >file oob_usock_component.c at line 228 >[a00551.science.domain:18889] [[34937,0],0] RECEIVED TOPOLOGY FROM NODE >a00551 >[a00551.science.domain:18889] [[34937,0],0] ADDING TOPOLOGY PER USER >REQUEST TO NODE a00554.science.domain >[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch >completed for daemon [[34937,0],1] at contact >2289631232.1;tcp://130.226.12.194:46861;tcp6://[fe80::225:90ff:feeb:f6d5]:33227 >[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch >recvd 2 of 3 reported daemons >[a00551.science.domain:18895] [[34937,0],2] plm:rsh_setup on agent ssh : >rsh path NULL >[a00551.science.domain:18895] [[34937,0],2] plm:base:receive start comm >[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch >from daemon [[34937,0],2] >[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch >from daemon [[34937,0],2] on node a00551 >[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch >completed for daemon [[34937,0],2] at contact >2289631232.2;tcp://130.226.12.194:38146;tcp6://[fe80::225:90ff:feeb:f6d5]:44834 >[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch >recvd 3 of 3 reported daemons >[a00551.science.domain:18889] [[34937,0],0] plm:base:setting topo to >that from node a00554.science.domain >[a00551.science.domain:18889] [[34937,0],0] complete_setup on job >[34937,1] >[a00551.science.domain:18889] [[34937,0],0] plm:base:launch_apps for job >[34937,1] >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive processing >msg >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive update proc >state command from [[34937,0],1] >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got >update_proc_state for job [34937,1] >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got >update_proc_state for vpid 1 state RUNNING exit_code 0 >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive done >processing commands >a00551.science.domain >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive processing >msg >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive update proc >state command from [[34937,0],2] >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got >update_proc_state for job [34937,1] >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got >update_proc_state for vpid 2 state RUNNING exit_code 0 >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive done >processing commands >[a00551.science.domain:18889] [[34937,0],0] plm:base:launch wiring up >iof for job [34937,1] >[a00551.science.domain:18889] [[34937,0],0] plm:base:launch job >[34937,1] is not a dynamic spawn >a00551.science.domain >a00551.science.domain >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive processing >msg >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive update proc >state command from [[34937,0],1] >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got >update_proc_state for job [34937,1] >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got >update_proc_state for vpid 1 state NORMALLY TERMINATED exit_code 0 >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive done >processing commands >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive processing >msg >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive update proc >state command from [[34937,0],2] >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got >update_proc_state for job [34937,1] >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got >update_proc_state for vpid 2 state NORMALLY TERMINATED exit_code 0 >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive done >processing commands >[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_cmd sending >orted_exit commands >[a00551.science.domain:18894] [[34937,0],1] plm:base:receive stop comm >[a00551.science.domain:18894] mca: base: close: component rsh closed >[a00551.science.domain:18894] mca: base: close: unloading component rsh >[a00551.science.domain:18895] [[34937,0],2] plm:base:receive stop comm >[a00551.science.domain:18895] mca: base: close: component rsh closed >[a00551.science.domain:18895] mca: base: close: unloading component rsh >[a00551.science.domain:18895] mca: base: close: component tm closed >[a00551.science.domain:18895] mca: base: close: unloading component tm >[a00551.science.domain:18894] mca: base: close: component tm closed >[a00551.science.domain:18894] mca: base: close: unloading component tm >[a00551.science.domain:18889] [[34937,0],0] ras:tm:finalize: success >(nothing to do) >[a00551.science.domain:18889] mca: base: close: unloading component tm >[a00551.science.domain:18889] [[34937,0],0] plm:base:receive stop comm >[a00551.science.domain:18889] mca: base: close: component tm closed >[a00551.science.domain:18889] mca: base: close: unloading component tm >[a00551.science.domain:18889] mca: base: close: component hnp closed >[a00551.science.domain:18889] mca: base: close: unloading component hnp > > >Cheers, >Oswin > >On 2016-09-08 12:13, Gilles Gouaillardet wrote: >> Oswin, >> >> >> can you please run again (one task per physical node) with >> >> mpirun --mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca >> ras_base_verbose 10 hostname >> >> >> Cheers, >> >> >> Gilles >> >> >> On 9/8/2016 6:42 PM, Oswin Krause wrote: >>> Hi, >>> >>> i reconfigured to only have one physical node. Still no success, but >>> the nodefile now looks better. I still get the errors: >>> >>> [a00551.science.domain:18021] [[34768,0],1] bind() failed on error >>> Address already in use (98) >>> [a00551.science.domain:18021] [[34768,0],1] ORTE_ERROR_LOG: Error in >>> file oob_usock_component.c at line 228 >>> [a00551.science.domain:18022] [[34768,0],2] bind() failed on error >>> Address already in use (98) >>> [a00551.science.domain:18022] [[34768,0],2] ORTE_ERROR_LOG: Error in >>> file oob_usock_component.c at line 228 >>> >>> (btw: for some reason the bind errors where missing. sorry!) >>> >>> PBS_NODEFILE >>> a00551.science.domain >>> a00554.science.domain >>> a00553.science.domain >>> ----------------------- >>> mpirun --mca plm_base_verbose 10 --tag-output -display-map hostname >>> [a00551.science.domain:18097] mca: base: components_register: >>> registering framework plm components >>> [a00551.science.domain:18097] mca: base: components_register: found >>> loaded component isolated >>> [a00551.science.domain:18097] mca: base: components_register: >>> component isolated has no register or open function >>> [a00551.science.domain:18097] mca: base: components_register: found >>> loaded component rsh >>> [a00551.science.domain:18097] mca: base: components_register: >>> component rsh register function successful >>> [a00551.science.domain:18097] mca: base: components_register: found >>> loaded component slurm >>> [a00551.science.domain:18097] mca: base: components_register: >>> component slurm register function successful >>> [a00551.science.domain:18097] mca: base: components_register: found >>> loaded component tm >>> [a00551.science.domain:18097] mca: base: components_register: >>> component tm register function successful >>> [a00551.science.domain:18097] mca: base: components_open: opening plm >>> components >>> [a00551.science.domain:18097] mca: base: components_open: found loaded >>> component isolated >>> [a00551.science.domain:18097] mca: base: components_open: component >>> isolated open function successful >>> [a00551.science.domain:18097] mca: base: components_open: found loaded >>> component rsh >>> [a00551.science.domain:18097] mca: base: components_open: component >>> rsh open function successful >>> [a00551.science.domain:18097] mca: base: components_open: found loaded >>> component slurm >>> [a00551.science.domain:18097] mca: base: components_open: component >>> slurm open function successful >>> [a00551.science.domain:18097] mca: base: components_open: found loaded >>> component tm >>> [a00551.science.domain:18097] mca: base: components_open: component tm >>> open function successful >>> [a00551.science.domain:18097] mca:base:select: Auto-selecting plm >>> components >>> [a00551.science.domain:18097] mca:base:select:( plm) Querying >>> component [isolated] >>> [a00551.science.domain:18097] mca:base:select:( plm) Query of >>> component [isolated] set priority to 0 >>> [a00551.science.domain:18097] mca:base:select:( plm) Querying >>> component [rsh] >>> [a00551.science.domain:18097] [[INVALID],INVALID] plm:rsh_lookup on >>> agent ssh : rsh path NULL >>> [a00551.science.domain:18097] mca:base:select:( plm) Query of >>> component [rsh] set priority to 10 >>> [a00551.science.domain:18097] mca:base:select:( plm) Querying >>> component [slurm] >>> [a00551.science.domain:18097] mca:base:select:( plm) Querying >>> component [tm] >>> [a00551.science.domain:18097] mca:base:select:( plm) Query of >>> component [tm] set priority to 75 >>> [a00551.science.domain:18097] mca:base:select:( plm) Selected >>> component [tm] >>> [a00551.science.domain:18097] mca: base: close: component isolated >>> closed >>> [a00551.science.domain:18097] mca: base: close: unloading component >>> isolated >>> [a00551.science.domain:18097] mca: base: close: component rsh closed >>> [a00551.science.domain:18097] mca: base: close: unloading component >>> rsh >>> [a00551.science.domain:18097] mca: base: close: component slurm closed >>> [a00551.science.domain:18097] mca: base: close: unloading component >>> slurm >>> [a00551.science.domain:18097] plm:base:set_hnp_name: initial bias >>> 18097 nodename hash 2226275586 >>> [a00551.science.domain:18097] plm:base:set_hnp_name: final jobfam >>> 34561 >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive start >>> comm >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_job >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm creating >>> map >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new >>> daemon [[34561,0],1] >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm >>> assigning new daemon [[34561,0],1] to node a00554.science.domain >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new >>> daemon [[34561,0],2] >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm >>> assigning new daemon [[34561,0],2] to node a00553.science.domain >>> [a00551.science.domain:18097] [[34561,0],0] plm:tm: launching vm >>> [a00551.science.domain:18097] [[34561,0],0] plm:tm: final top-level >>> argv: >>> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess >>> tm -mca ess_base_jobid 2264989696 -mca ess_base_vpid <template> -mca >>> ess_base_num_procs 3 -mca orte_hnp_uri >>> 2264989696.0;usock;tcp://130.226.12.194:35939;tcp6://[fe80::225:90ff:feeb:f6d5]:35904 >>> >>> --mca plm_base_verbose 10 >>> [a00551.science.domain:18097] [[34561,0],0] plm:tm: launching on node >>> a00554.science.domain >>> [a00551.science.domain:18097] [[34561,0],0] plm:tm: executing: >>> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess >>> tm -mca ess_base_jobid 2264989696 -mca ess_base_vpid 1 -mca >>> ess_base_num_procs 3 -mca orte_hnp_uri >>> 2264989696.0;usock;tcp://130.226.12.194:35939;tcp6://[fe80::225:90ff:feeb:f6d5]:35904 >>> >>> --mca plm_base_verbose 10 >>> [a00551.science.domain:18097] [[34561,0],0] plm:tm: launching on node >>> a00553.science.domain >>> [a00551.science.domain:18097] [[34561,0],0] plm:tm: executing: >>> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess >>> tm -mca ess_base_jobid 2264989696 -mca ess_base_vpid 2 -mca >>> ess_base_num_procs 3 -mca orte_hnp_uri >>> 2264989696.0;usock;tcp://130.226.12.194:35939;tcp6://[fe80::225:90ff:feeb:f6d5]:35904 >>> >>> --mca plm_base_verbose 10 >>> [a00551.science.domain:18097] [[34561,0],0] plm:tm:launch: finished >>> spawning orteds >>> [a00551.science.domain:18102] mca: base: components_register: >>> registering framework plm components >>> [a00551.science.domain:18102] mca: base: components_register: found >>> loaded component rsh >>> [a00551.science.domain:18102] mca: base: components_register: >>> component rsh register function successful >>> [a00551.science.domain:18102] mca: base: components_open: opening plm >>> components >>> [a00551.science.domain:18102] mca: base: components_open: found loaded >>> component rsh >>> [a00551.science.domain:18102] mca: base: components_open: component >>> rsh open function successful >>> [a00551.science.domain:18102] mca:base:select: Auto-selecting plm >>> components >>> [a00551.science.domain:18102] mca:base:select:( plm) Querying >>> component [rsh] >>> [a00551.science.domain:18102] [[34561,0],1] plm:rsh_lookup on agent >>> ssh : rsh path NULL >>> [a00551.science.domain:18102] mca:base:select:( plm) Query of >>> component [rsh] set priority to 10 >>> [a00551.science.domain:18102] mca:base:select:( plm) Selected >>> component [rsh] >>> [a00551.science.domain:18102] [[34561,0],1] bind() failed on error >>> Address already in use (98) >>> [a00551.science.domain:18102] [[34561,0],1] ORTE_ERROR_LOG: Error in >>> file oob_usock_component.c at line 228 >>> [a00551.science.domain:18102] [[34561,0],1] plm:rsh_setup on agent ssh >>> : rsh path NULL >>> [a00551.science.domain:18102] [[34561,0],1] plm:base:receive start >>> comm >>> [a00551.science.domain:18097] [[34561,0],0] >>> plm:base:orted_report_launch from daemon [[34561,0],1] >>> [a00551.science.domain:18097] [[34561,0],0] >>> plm:base:orted_report_launch from daemon [[34561,0],1] on node a00551 >>> [a00551.science.domain:18097] [[34561,0],0] RECEIVED TOPOLOGY FROM >>> NODE a00551 >>> [a00551.science.domain:18097] [[34561,0],0] ADDING TOPOLOGY PER USER >>> REQUEST TO NODE a00554.science.domain >>> [a00551.science.domain:18097] [[34561,0],0] >>> plm:base:orted_report_launch completed for daemon [[34561,0],1] at >>> contact >>> 2264989696.1;tcp://130.226.12.194:52354;tcp6://[fe80::225:90ff:feeb:f6d5]:60904 >>> [a00551.science.domain:18097] [[34561,0],0] >>> plm:base:orted_report_launch recvd 2 of 3 reported daemons >>> [a00551.science.domain:18103] mca: base: components_register: >>> registering framework plm components >>> [a00551.science.domain:18103] mca: base: components_register: found >>> loaded component rsh >>> [a00551.science.domain:18103] mca: base: components_register: >>> component rsh register function successful >>> [a00551.science.domain:18103] mca: base: components_open: opening plm >>> components >>> [a00551.science.domain:18103] mca: base: components_open: found loaded >>> component rsh >>> [a00551.science.domain:18103] mca: base: components_open: component >>> rsh open function successful >>> [a00551.science.domain:18103] mca:base:select: Auto-selecting plm >>> components >>> [a00551.science.domain:18103] mca:base:select:( plm) Querying >>> component [rsh] >>> [a00551.science.domain:18103] [[34561,0],2] plm:rsh_lookup on agent >>> ssh : rsh path NULL >>> [a00551.science.domain:18103] mca:base:select:( plm) Query of >>> component [rsh] set priority to 10 >>> [a00551.science.domain:18103] mca:base:select:( plm) Selected >>> component [rsh] >>> [a00551.science.domain:18103] [[34561,0],2] bind() failed on error >>> Address already in use (98) >>> [a00551.science.domain:18103] [[34561,0],2] ORTE_ERROR_LOG: Error in >>> file oob_usock_component.c at line 228 >>> [a00551.science.domain:18103] [[34561,0],2] plm:rsh_setup on agent ssh >>> : rsh path NULL >>> [a00551.science.domain:18103] [[34561,0],2] plm:base:receive start >>> comm >>> [a00551.science.domain:18097] [[34561,0],0] >>> plm:base:orted_report_launch from daemon [[34561,0],2] >>> [a00551.science.domain:18097] [[34561,0],0] >>> plm:base:orted_report_launch from daemon [[34561,0],2] on node a00551 >>> [a00551.science.domain:18097] [[34561,0],0] >>> plm:base:orted_report_launch completed for daemon [[34561,0],2] at >>> contact >>> 2264989696.2;tcp://130.226.12.194:41272;tcp6://[fe80::225:90ff:feeb:f6d5]:35343 >>> [a00551.science.domain:18097] [[34561,0],0] >>> plm:base:orted_report_launch recvd 3 of 3 reported daemons >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setting topo to >>> that from node a00554.science.domain >>> Data for JOB [34561,1] offset 0 >>> >>> ======================== JOB MAP ======================== >>> >>> Data for node: a00551 Num slots: 1 Max slots: 0 Num procs: 1 >>> Process OMPI jobid: [34561,1] App: 0 Process rank: 0 Bound: >>> socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core >>> 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], >>> socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core >>> 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>> >>> Data for node: a00554.science.domain Num slots: 1 Max slots: 0 >>> Num procs: 1 >>> Process OMPI jobid: [34561,1] App: 0 Process rank: 1 Bound: >>> socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core >>> 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], >>> socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core >>> 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>> >>> Data for node: a00553.science.domain Num slots: 1 Max slots: 0 >>> Num procs: 1 >>> Process OMPI jobid: [34561,1] App: 0 Process rank: 2 Bound: >>> socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core >>> 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], >>> socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core >>> 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>> >>> ============================================================= >>> [a00551.science.domain:18097] [[34561,0],0] complete_setup on job >>> [34561,1] >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:launch_apps for >>> job [34561,1] >>> [1,0]<stdout>:a00551.science.domain >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive >>> processing msg >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive update >>> proc state command from [[34561,0],2] >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got >>> update_proc_state for job [34561,1] >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got >>> update_proc_state for vpid 2 state RUNNING exit_code 0 >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive done >>> processing commands >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive >>> processing msg >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive update >>> proc state command from [[34561,0],1] >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got >>> update_proc_state for job [34561,1] >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got >>> update_proc_state for vpid 1 state RUNNING exit_code 0 >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive done >>> processing commands >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:launch wiring up >>> iof for job [34561,1] >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:launch job >>> [34561,1] is not a dynamic spawn >>> [1,2]<stdout>:a00551.science.domain >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive >>> processing msg >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive update >>> proc state command from [[34561,0],2] >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got >>> update_proc_state for job [34561,1] >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got >>> update_proc_state for vpid 2 state NORMALLY TERMINATED exit_code 0 >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive done >>> processing commands >>> [1,1]<stdout>:a00551.science.domain >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive >>> processing msg >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive update >>> proc state command from [[34561,0],1] >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got >>> update_proc_state for job [34561,1] >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got >>> update_proc_state for vpid 1 state NORMALLY TERMINATED exit_code 0 >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive done >>> processing commands >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:orted_cmd sending >>> orted_exit commands >>> [a00551.science.domain:18102] [[34561,0],1] plm:base:receive stop comm >>> [a00551.science.domain:18102] mca: base: close: component rsh closed >>> [a00551.science.domain:18102] mca: base: close: unloading component >>> rsh >>> [a00551.science.domain:18103] [[34561,0],2] plm:base:receive stop comm >>> [a00551.science.domain:18103] mca: base: close: component rsh closed >>> [a00551.science.domain:18103] mca: base: close: unloading component >>> rsh >>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive stop comm >>> [a00551.science.domain:18097] mca: base: close: component tm closed >>> [a00551.science.domain:18097] mca: base: close: unloading component tm >>> >>> >>> Best, >>> Oswin >>> >>> On 2016-09-08 10:33, Oswin Krause wrote: >>>> Hi Gilles, Hi Ralph, >>>> >>>> I have just rebuild openmpi. quite a lot more of information. As I >>>> said, i did not tinker with the PBS_NODEFILE. I think the issue might >>>> be NUMA here. I can try to go through the process and reconfigure to >>>> non-numa and see whether this works. The issue might be that the node >>>> allocation looks like this: >>>> >>>> a00551.science.domain-0 >>>> a00552.science.domain-0 >>>> a00551.science.domain-1 >>>> >>>> and the last part then gets shortened which leads to the issue. Not >>>> sure whether this makes sense but this is my explanation. >>>> >>>> Here the output: >>>> $PBS_NODEFILE >>>> /var/lib/torque/aux//285.a00552.science.domain >>>> PBS_NODEFILE >>>> a00551.science.domain >>>> a00553.science.domain >>>> a00551.science.domain >>>> --------- >>>> [a00551.science.domain:16986] mca: base: components_register: >>>> registering framework plm components >>>> [a00551.science.domain:16986] mca: base: components_register: found >>>> loaded component isolated >>>> [a00551.science.domain:16986] mca: base: components_register: >>>> component isolated has no register or open function >>>> [a00551.science.domain:16986] mca: base: components_register: found >>>> loaded component rsh >>>> [a00551.science.domain:16986] mca: base: components_register: >>>> component rsh register function successful >>>> [a00551.science.domain:16986] mca: base: components_register: found >>>> loaded component slurm >>>> [a00551.science.domain:16986] mca: base: components_register: >>>> component slurm register function successful >>>> [a00551.science.domain:16986] mca: base: components_register: found >>>> loaded component tm >>>> [a00551.science.domain:16986] mca: base: components_register: >>>> component tm register function successful >>>> [a00551.science.domain:16986] mca: base: components_open: opening plm >>>> components >>>> [a00551.science.domain:16986] mca: base: components_open: found >>>> loaded >>>> component isolated >>>> [a00551.science.domain:16986] mca: base: components_open: component >>>> isolated open function successful >>>> [a00551.science.domain:16986] mca: base: components_open: found >>>> loaded >>>> component rsh >>>> [a00551.science.domain:16986] mca: base: components_open: component >>>> rsh open function successful >>>> [a00551.science.domain:16986] mca: base: components_open: found >>>> loaded >>>> component slurm >>>> [a00551.science.domain:16986] mca: base: components_open: component >>>> slurm open function successful >>>> [a00551.science.domain:16986] mca: base: components_open: found >>>> loaded >>>> component tm >>>> [a00551.science.domain:16986] mca: base: components_open: component >>>> tm >>>> open function successful >>>> [a00551.science.domain:16986] mca:base:select: Auto-selecting plm >>>> components >>>> [a00551.science.domain:16986] mca:base:select:( plm) Querying >>>> component [isolated] >>>> [a00551.science.domain:16986] mca:base:select:( plm) Query of >>>> component [isolated] set priority to 0 >>>> [a00551.science.domain:16986] mca:base:select:( plm) Querying >>>> component [rsh] >>>> [a00551.science.domain:16986] [[INVALID],INVALID] plm:rsh_lookup on >>>> agent ssh : rsh path NULL >>>> [a00551.science.domain:16986] mca:base:select:( plm) Query of >>>> component [rsh] set priority to 10 >>>> [a00551.science.domain:16986] mca:base:select:( plm) Querying >>>> component [slurm] >>>> [a00551.science.domain:16986] mca:base:select:( plm) Querying >>>> component [tm] >>>> [a00551.science.domain:16986] mca:base:select:( plm) Query of >>>> component [tm] set priority to 75 >>>> [a00551.science.domain:16986] mca:base:select:( plm) Selected >>>> component [tm] >>>> [a00551.science.domain:16986] mca: base: close: component isolated >>>> closed >>>> [a00551.science.domain:16986] mca: base: close: unloading component >>>> isolated >>>> [a00551.science.domain:16986] mca: base: close: component rsh closed >>>> [a00551.science.domain:16986] mca: base: close: unloading component >>>> rsh >>>> [a00551.science.domain:16986] mca: base: close: component slurm >>>> closed >>>> [a00551.science.domain:16986] mca: base: close: unloading component >>>> slurm >>>> [a00551.science.domain:16986] plm:base:set_hnp_name: initial bias >>>> 16986 nodename hash 2226275586 >>>> [a00551.science.domain:16986] plm:base:set_hnp_name: final jobfam >>>> 33770 >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive start >>>> comm >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_job >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm >>>> creating map >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm add new >>>> daemon [[33770,0],1] >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm >>>> assigning new daemon [[33770,0],1] to node a00553.science.domain >>>> [a00551.science.domain:16986] [[33770,0],0] plm:tm: launching vm >>>> [a00551.science.domain:16986] [[33770,0],0] plm:tm: final top-level >>>> argv: >>>> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess >>>> tm >>>> -mca ess_base_jobid 2213150720 -mca ess_base_vpid <template> -mca >>>> ess_base_num_procs 2 -mca orte_hnp_uri >>>> 2213150720.0;usock;tcp://130.226.12.194:53397;tcp6://[fe80::225:90ff:feeb:f6d5]:42821 >>>> >>>> --mca plm_base_verbose 10 >>>> [a00551.science.domain:16986] [[33770,0],0] plm:tm: launching on node >>>> a00553.science.domain >>>> [a00551.science.domain:16986] [[33770,0],0] plm:tm: executing: >>>> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess >>>> tm >>>> -mca ess_base_jobid 2213150720 -mca ess_base_vpid 1 -mca >>>> ess_base_num_procs 2 -mca orte_hnp_uri >>>> 2213150720.0;usock;tcp://130.226.12.194:53397;tcp6://[fe80::225:90ff:feeb:f6d5]:42821 >>>> >>>> --mca plm_base_verbose 10 >>>> [a00551.science.domain:16986] [[33770,0],0] plm:tm:launch: finished >>>> spawning orteds >>>> [a00551.science.domain:16986] [[33770,0],0] >>>> plm:base:orted_report_launch from daemon [[33770,0],1] >>>> [a00551.science.domain:16986] [[33770,0],0] >>>> plm:base:orted_report_launch from daemon [[33770,0],1] on node a00551 >>>> [a00551.science.domain:16986] [[33770,0],0] RECEIVED TOPOLOGY FROM >>>> NODE a00551 >>>> [a00551.science.domain:16986] [[33770,0],0] ADDING TOPOLOGY PER USER >>>> REQUEST TO NODE a00553.science.domain >>>> [a00551.science.domain:16986] [[33770,0],0] >>>> plm:base:orted_report_launch completed for daemon [[33770,0],1] at >>>> contact >>>> 2213150720.1;tcp://130.226.12.194:38025;tcp6://[fe80::225:90ff:feeb:f6d5]:39080 >>>> >>>> [a00551.science.domain:16986] [[33770,0],0] >>>> plm:base:orted_report_launch recvd 2 of 2 reported daemons >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:setting topo to >>>> that from node a00553.science.domain >>>> Data for JOB [33770,1] offset 0 >>>> >>>> ======================== JOB MAP ======================== >>>> >>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: >>>> 2 >>>> Process OMPI jobid: [33770,1] App: 0 Process rank: 0 Bound: >>>> socket >>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>> Process OMPI jobid: [33770,1] App: 0 Process rank: 1 Bound: >>>> socket >>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt >>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket >>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt >>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>> >>>> Data for node: a00553.science.domain Num slots: 1 Max slots: 0 >>>> Num procs: 1 >>>> Process OMPI jobid: [33770,1] App: 0 Process rank: 2 Bound: >>>> socket >>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>> >>>> ============================================================= >>>> [a00551.science.domain:16986] [[33770,0],0] complete_setup on job >>>> [33770,1] >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:launch_apps for >>>> job [33770,1] >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive >>>> processing msg >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive update >>>> proc state command from [[33770,0],1] >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive got >>>> update_proc_state for job [33770,1] >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive got >>>> update_proc_state for vpid 2 state RUNNING exit_code 0 >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive done >>>> processing commands >>>> [1,0]<stdout>:a00551.science.domain >>>> [1,2]<stdout>:a00551.science.domain >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive >>>> processing msg >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive update >>>> proc state command from [[33770,0],1] >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive got >>>> update_proc_state for job [33770,1] >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive got >>>> update_proc_state for vpid 2 state NORMALLY TERMINATED exit_code 0 >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive done >>>> processing commands >>>> [1,1]<stdout>:a00551.science.domain >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:launch wiring up >>>> iof for job [33770,1] >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:launch job >>>> [33770,1] is not a dynamic spawn >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:orted_cmd >>>> sending >>>> orted_exit commands >>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive stop >>>> comm >>>> [a00551.science.domain:16986] mca: base: close: component tm closed >>>> [a00551.science.domain:16986] mca: base: close: unloading component >>>> tm >>>> >>>> >>>> >>>> On 2016-09-08 10:18, Gilles Gouaillardet wrote: >>>>> Ralph, >>>>> >>>>> >>>>> i am not sure i am reading you correctly, so let me clarify. >>>>> >>>>> >>>>> i did not hack $PBS_NODEFILE for fun nor profit, i was simply trying >>>>> to reproduce an issue i could not reproduce otherwise. >>>>> >>>>> /* my job submitted with -l nodes=3:ppn=1 do not start if there are >>>>> only two nodes available, whereas the same user job >>>>> >>>>> starts on two nodes */ >>>>> >>>>> thanks for the explanation of the torque internals, my hack was >>>>> incomplete and not a valid one, i do acknowledge it. >>>>> >>>>> >>>>> i re-read the email that started this thread and i found the >>>>> information i was looking for >>>>> >>>>> >>>>>> echo $PBS_NODEFILE >>>>>> /var/lib/torque/aux//278.a00552.science.domain >>>>>> cat $PBS_NODEFILE >>>>>> a00551.science.domain >>>>>> a00553.science.domain >>>>>> a00551.science.domain >>>>> >>>>> >>>>> so, assuming the enduser did not edit his $PBS_NODEFILE, and torque >>>>> is >>>>> correctly configured and not busted, then >>>>> >>>>>> Torque indeed always provides an ordered file - the only way you >>>>>> can get an unordered one is for someone to edit it >>>>> might be updated to >>>>> >>>>> "Torque used to always provide an ordered file, but recent versions >>>>> might not do that." >>>>> >>>>> >>>>> makes sense ? >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> >>>>> On 9/8/2016 4:57 PM, r...@open-mpi.org wrote: >>>>>> Someone has done some work there since I last did, but I can see >>>>>> the issue. Torque indeed always provides an ordered file - the only >>>>>> way you can get an unordered one is for someone to edit it, and >>>>>> that is forbidden - i.e., you get what you deserve because you are >>>>>> messing around with a system-defined file :-) >>>>>> >>>>>> The problem is that Torque internally assigns a “launch ID” which >>>>>> is just the integer position of the nodename in the PBS_NODEFILE. >>>>>> So if you modify that position, then we get the wrong index - and >>>>>> everything goes down the drain from there. In your example, >>>>>> n1.cluster changed index from 3 to 2 because of your edit. Torque >>>>>> thinks that index 2 is just another reference to n0.cluster, and so >>>>>> we merrily launch a daemon onto the wrong node. >>>>>> >>>>>> They have a good reason for doing things this way. It allows you to >>>>>> launch a process against each launch ID, and the pattern will >>>>>> reflect the original qsub request in what we would call a map-by >>>>>> slot round-robin mode. This maximizes the use of shared memory, and >>>>>> is expected to provide good performance for a range of apps. >>>>>> >>>>>> Lesson to be learned: never, ever muddle around with a >>>>>> system-generated file. If you want to modify where things go, then >>>>>> use one or more of the mpirun options to do so. We give you lots >>>>>> and lots of knobs for just that reason. >>>>>> >>>>>> >>>>>> >>>>>>> On Sep 7, 2016, at 10:53 PM, Gilles Gouaillardet >>>>>>> <gil...@rist.or.jp> wrote: >>>>>>> >>>>>>> Ralph, >>>>>>> >>>>>>> >>>>>>> there might be an issue within Open MPI. >>>>>>> >>>>>>> >>>>>>> on the cluster i used, hostname returns the FQDN, and >>>>>>> $PBS_NODEFILE uses the FQDN too. >>>>>>> >>>>>>> my $PBS_NODEFILE has one line per task, and it is ordered >>>>>>> >>>>>>> e.g. >>>>>>> >>>>>>> n0.cluster >>>>>>> >>>>>>> n0.cluster >>>>>>> >>>>>>> n1.cluster >>>>>>> >>>>>>> n1.cluster >>>>>>> >>>>>>> >>>>>>> in my torque script, i rewrote the machinefile like this >>>>>>> >>>>>>> n0.cluster >>>>>>> >>>>>>> n1.cluster >>>>>>> >>>>>>> n0.cluster >>>>>>> >>>>>>> n1.cluster >>>>>>> >>>>>>> and updated the PBS environment variable to point to my new file. >>>>>>> >>>>>>> >>>>>>> then i invoked >>>>>>> >>>>>>> mpirun hostname >>>>>>> >>>>>>> >>>>>>> >>>>>>> in the first case, 2 tasks run on n0 and 2 tasks run on n1 >>>>>>> in the second case, 4 tasks run on n0, and none on n1. >>>>>>> >>>>>>> so i am thinking we might not support unordered $PBS_NODEFILE. >>>>>>> >>>>>>> as a reminder, the submit command was >>>>>>> qsub -l nodes=3:ppn=1 >>>>>>> but for some reasons i ignore, only two nodes were allocated (two >>>>>>> slots on the first one, one on the second one) >>>>>>> and if i understand correctly, $PBS_NODEFILE was not ordered. >>>>>>> (e.g. n0 n1 n0 and *not * n0 n0 n1) >>>>>>> >>>>>>> i tried to reproduce this without hacking $PBS_NODEFILE, but my >>>>>>> jobs hang in the queue if only two nodes with 16 slots each are >>>>>>> available and i request >>>>>>> -l nodes=3:ppn=1 >>>>>>> i guess this is a different scheduler configuration, and i cannot >>>>>>> change that. >>>>>>> >>>>>>> Could you please have a look at this ? >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Gilles >>>>>>> >>>>>>> On 9/7/2016 11:15 PM, r...@open-mpi.org wrote: >>>>>>>> The usual cause of this problem is that the nodename in the >>>>>>>> machinefile is given as a00551, while Torque is assigning the >>>>>>>> node name as a00551.science.domain. Thus, mpirun thinks those are >>>>>>>> two separate nodes and winds up spawning an orted on its own >>>>>>>> node. >>>>>>>> >>>>>>>> You might try ensuring that your machinefile is using the exact >>>>>>>> same name as provided in your allocation >>>>>>>> >>>>>>>> >>>>>>>>> On Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet >>>>>>>>> <gilles.gouaillar...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Thanjs for the ligs >>>>>>>>> >>>>>>>>> From what i see now, it looks like a00551 is running both >>>>>>>>> mpirun and orted, though it should only run mpirun, and orted >>>>>>>>> should run only on a00553 >>>>>>>>> >>>>>>>>> I will check the code and see what could be happening here >>>>>>>>> >>>>>>>>> Btw, what is the output of >>>>>>>>> hostname >>>>>>>>> hostname -f >>>>>>>>> On a00551 ? >>>>>>>>> >>>>>>>>> Out of curiosity, is a previous version of Open MPI (e.g. >>>>>>>>> v1.10.4) installled and running correctly on your cluster ? >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Gilles >>>>>>>>> >>>>>>>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote: >>>>>>>>>> Hi Gilles, >>>>>>>>>> >>>>>>>>>> Thanks for the hint with the machinefile. I know it is not >>>>>>>>>> equivalent >>>>>>>>>> and i do not intend to use that approach. I just wanted to know >>>>>>>>>> whether >>>>>>>>>> I could start the program successfully at all. >>>>>>>>>> >>>>>>>>>> Outside torque(4.2), rsh seems to be used which works fine, >>>>>>>>>> querying a >>>>>>>>>> password if no kerberos ticket is there >>>>>>>>>> >>>>>>>>>> Here is the output: >>>>>>>>>> [zbh251@a00551 ~]$ mpirun -V >>>>>>>>>> mpirun (Open MPI) 2.0.1 >>>>>>>>>> [zbh251@a00551 ~]$ ompi_info | grep ras >>>>>>>>>> MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, >>>>>>>>>> Component >>>>>>>>>> v2.0.1) >>>>>>>>>> MCA ras: simulator (MCA v2.1.0, API v2.0.0, >>>>>>>>>> Component >>>>>>>>>> v2.0.1) >>>>>>>>>> MCA ras: slurm (MCA v2.1.0, API v2.0.0, >>>>>>>>>> Component >>>>>>>>>> v2.0.1) >>>>>>>>>> MCA ras: tm (MCA v2.1.0, API v2.0.0, Component >>>>>>>>>> v2.0.1) >>>>>>>>>> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 >>>>>>>>>> --tag-output >>>>>>>>>> -display-map hostname >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>>>>>>> registering framework plm components >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>>>>>>> found >>>>>>>>>> loaded component isolated >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>>>>>>> component >>>>>>>>>> isolated has no register or open function >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>>>>>>> found >>>>>>>>>> loaded component rsh >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>>>>>>> component >>>>>>>>>> rsh register function successful >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>>>>>>> found >>>>>>>>>> loaded component slurm >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>>>>>>> component >>>>>>>>>> slurm register function successful >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>>>>>>> found >>>>>>>>>> loaded component tm >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>>>>>>> component >>>>>>>>>> tm register function successful >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: >>>>>>>>>> opening plm >>>>>>>>>> components >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: found >>>>>>>>>> loaded >>>>>>>>>> component isolated >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: >>>>>>>>>> component >>>>>>>>>> isolated open function successful >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: found >>>>>>>>>> loaded >>>>>>>>>> component rsh >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: >>>>>>>>>> component rsh >>>>>>>>>> open function successful >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: found >>>>>>>>>> loaded >>>>>>>>>> component slurm >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: >>>>>>>>>> component >>>>>>>>>> slurm open function successful >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: found >>>>>>>>>> loaded >>>>>>>>>> component tm >>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: >>>>>>>>>> component tm >>>>>>>>>> open function successful >>>>>>>>>> [a00551.science.domain:04104] mca:base:select: Auto-selecting >>>>>>>>>> plm >>>>>>>>>> components >>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying >>>>>>>>>> component >>>>>>>>>> [isolated] >>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of >>>>>>>>>> component >>>>>>>>>> [isolated] set priority to 0 >>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying >>>>>>>>>> component >>>>>>>>>> [rsh] >>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of >>>>>>>>>> component >>>>>>>>>> [rsh] set priority to 10 >>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying >>>>>>>>>> component >>>>>>>>>> [slurm] >>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying >>>>>>>>>> component >>>>>>>>>> [tm] >>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of >>>>>>>>>> component >>>>>>>>>> [tm] set priority to 75 >>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Selected >>>>>>>>>> component >>>>>>>>>> [tm] >>>>>>>>>> [a00551.science.domain:04104] mca: base: close: component >>>>>>>>>> isolated >>>>>>>>>> closed >>>>>>>>>> [a00551.science.domain:04104] mca: base: close: unloading >>>>>>>>>> component >>>>>>>>>> isolated >>>>>>>>>> [a00551.science.domain:04104] mca: base: close: component rsh >>>>>>>>>> closed >>>>>>>>>> [a00551.science.domain:04104] mca: base: close: unloading >>>>>>>>>> component rsh >>>>>>>>>> [a00551.science.domain:04104] mca: base: close: component slurm >>>>>>>>>> closed >>>>>>>>>> [a00551.science.domain:04104] mca: base: close: unloading >>>>>>>>>> component >>>>>>>>>> slurm >>>>>>>>>> [a00551.science.domain:04109] mca: base: components_register: >>>>>>>>>> registering framework plm components >>>>>>>>>> [a00551.science.domain:04109] mca: base: components_register: >>>>>>>>>> found >>>>>>>>>> loaded component rsh >>>>>>>>>> [a00551.science.domain:04109] mca: base: components_register: >>>>>>>>>> component >>>>>>>>>> rsh register function successful >>>>>>>>>> [a00551.science.domain:04109] mca: base: components_open: >>>>>>>>>> opening plm >>>>>>>>>> components >>>>>>>>>> [a00551.science.domain:04109] mca: base: components_open: found >>>>>>>>>> loaded >>>>>>>>>> component rsh >>>>>>>>>> [a00551.science.domain:04109] mca: base: components_open: >>>>>>>>>> component rsh >>>>>>>>>> open function successful >>>>>>>>>> [a00551.science.domain:04109] mca:base:select: Auto-selecting >>>>>>>>>> plm >>>>>>>>>> components >>>>>>>>>> [a00551.science.domain:04109] mca:base:select:( plm) Querying >>>>>>>>>> component >>>>>>>>>> [rsh] >>>>>>>>>> [a00551.science.domain:04109] mca:base:select:( plm) Query of >>>>>>>>>> component >>>>>>>>>> [rsh] set priority to 10 >>>>>>>>>> [a00551.science.domain:04109] mca:base:select:( plm) Selected >>>>>>>>>> component >>>>>>>>>> [rsh] >>>>>>>>>> [a00551.science.domain:04109] [[53688,0],1] bind() failed on >>>>>>>>>> error >>>>>>>>>> Address already in use (98) >>>>>>>>>> [a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: >>>>>>>>>> Error in >>>>>>>>>> file oob_usock_component.c at line 228 >>>>>>>>>> Data for JOB [53688,1] offset 0 >>>>>>>>>> >>>>>>>>>> ======================== JOB MAP ======================== >>>>>>>>>> >>>>>>>>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num >>>>>>>>>> procs: 2 >>>>>>>>>> Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: >>>>>>>>>> socket >>>>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core >>>>>>>>>> 2[hwt >>>>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], >>>>>>>>>> socket >>>>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core >>>>>>>>>> 7[hwt >>>>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>>>>>> Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: >>>>>>>>>> socket >>>>>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core >>>>>>>>>> 12[hwt >>>>>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], >>>>>>>>>> socket >>>>>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core >>>>>>>>>> 17[hwt >>>>>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>>>>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>>>>>>>> Data for node: a00553.science.domain Num slots: 1 Max >>>>>>>>>> slots: 0 Num >>>>>>>>>> procs: 1 >>>>>>>>>> Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: >>>>>>>>>> socket >>>>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core >>>>>>>>>> 2[hwt >>>>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], >>>>>>>>>> socket >>>>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core >>>>>>>>>> 7[hwt >>>>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>>>>>> ============================================================= >>>>>>>>>> [a00551.science.domain:04104] [[53688,0],0] complete_setup on >>>>>>>>>> job >>>>>>>>>> [53688,1] >>>>>>>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive >>>>>>>>>> update proc >>>>>>>>>> state command from [[53688,0],1] >>>>>>>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive >>>>>>>>>> got >>>>>>>>>> update_proc_state for job [53688,1] >>>>>>>>>> [1,0]<stdout>:a00551.science.domain >>>>>>>>>> [1,2]<stdout>:a00551.science.domain >>>>>>>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive >>>>>>>>>> update proc >>>>>>>>>> state command from [[53688,0],1] >>>>>>>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive >>>>>>>>>> got >>>>>>>>>> update_proc_state for job [53688,1] >>>>>>>>>> [1,1]<stdout>:a00551.science.domain >>>>>>>>>> [a00551.science.domain:04109] mca: base: close: component rsh >>>>>>>>>> closed >>>>>>>>>> [a00551.science.domain:04109] mca: base: close: unloading >>>>>>>>>> component rsh >>>>>>>>>> [a00551.science.domain:04104] mca: base: close: component tm >>>>>>>>>> closed >>>>>>>>>> [a00551.science.domain:04104] mca: base: close: unloading >>>>>>>>>> component tm >>>>>>>>>> >>>>>>>>>> On 2016-09-07 14:41, Gilles Gouaillardet wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Which version of Open MPI are you running ? >>>>>>>>>>> >>>>>>>>>>> I noted that though you are asking three nodes and one task >>>>>>>>>>> per node, >>>>>>>>>>> you have been allocated 2 nodes only. >>>>>>>>>>> I do not know if this is related to this issue. >>>>>>>>>>> >>>>>>>>>>> Note if you use the machinefile, a00551 has two slots (since >>>>>>>>>>> it >>>>>>>>>>> appears twice in the machinefile) but a00553 has 20 slots >>>>>>>>>>> (since it >>>>>>>>>>> appears once in the machinefile, the number of slots is >>>>>>>>>>> automatically >>>>>>>>>>> detected) >>>>>>>>>>> >>>>>>>>>>> Can you run >>>>>>>>>>> mpirun --mca plm_base_verbose 10 ... >>>>>>>>>>> So we can confirm tm is used. >>>>>>>>>>> >>>>>>>>>>> Before invoking mpirun, you might want to cleanup the ompi >>>>>>>>>>> directory in >>>>>>>>>>> /tmp >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> >>>>>>>>>>> Gilles >>>>>>>>>>> >>>>>>>>>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I am currently trying to set up OpenMPI in torque. OpenMPI is >>>>>>>>>>>> build >>>>>>>>>>>> with >>>>>>>>>>>> tm support. Torque is correctly assigning nodes and I can run >>>>>>>>>>>> mpi-programs on single nodes just fine. the problem starts >>>>>>>>>>>> when >>>>>>>>>>>> processes are split between nodes. >>>>>>>>>>>> >>>>>>>>>>>> For example, I create an interactive session with torque and >>>>>>>>>>>> start a >>>>>>>>>>>> program by >>>>>>>>>>>> >>>>>>>>>>>> qsub -I -n -l nodes=3:ppn=1 >>>>>>>>>>>> mpirun --tag-output -display-map hostname >>>>>>>>>>>> >>>>>>>>>>>> which leads to >>>>>>>>>>>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on >>>>>>>>>>>> error >>>>>>>>>>>> Address already in use (98) >>>>>>>>>>>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: >>>>>>>>>>>> Error in >>>>>>>>>>>> file oob_usock_component.c at line 228 >>>>>>>>>>>> Data for JOB [65415,1] offset 0 >>>>>>>>>>>> >>>>>>>>>>>> ======================== JOB MAP ======================== >>>>>>>>>>>> >>>>>>>>>>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num >>>>>>>>>>>> procs: 2 >>>>>>>>>>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 0 >>>>>>>>>>>> Bound: socket >>>>>>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core >>>>>>>>>>>> 2[hwt >>>>>>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], >>>>>>>>>>>> socket >>>>>>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core >>>>>>>>>>>> 7[hwt >>>>>>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>>>>>>>> >>>>>>>>>>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 1 >>>>>>>>>>>> Bound: socket >>>>>>>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket >>>>>>>>>>>> 1[core 12[hwt >>>>>>>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >>>>>>>>>>>> 0-1]], socket >>>>>>>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket >>>>>>>>>>>> 1[core 17[hwt >>>>>>>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>>>>>>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>>>>>>>>>> >>>>>>>>>>>> Data for node: a00553.science.domain Num slots: 1 Max >>>>>>>>>>>> slots: 0 Num >>>>>>>>>>>> procs: 1 >>>>>>>>>>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 2 >>>>>>>>>>>> Bound: socket >>>>>>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core >>>>>>>>>>>> 2[hwt >>>>>>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], >>>>>>>>>>>> socket >>>>>>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core >>>>>>>>>>>> 7[hwt >>>>>>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>>>>>>>> >>>>>>>>>>>> ============================================================= >>>>>>>>>>>> [1,0]<stdout>:a00551.science.domain >>>>>>>>>>>> [1,2]<stdout>:a00551.science.domain >>>>>>>>>>>> [1,1]<stdout>:a00551.science.domain >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> if I login on a00551 and start using the hostfile generated >>>>>>>>>>>> by the >>>>>>>>>>>> PBS_NODEFILE, everything works: >>>>>>>>>>>> >>>>>>>>>>>> (from within the interactive session) >>>>>>>>>>>> echo $PBS_NODEFILE >>>>>>>>>>>> /var/lib/torque/aux//278.a00552.science.domain >>>>>>>>>>>> cat $PBS_NODEFILE >>>>>>>>>>>> a00551.science.domain >>>>>>>>>>>> a00553.science.domain >>>>>>>>>>>> a00551.science.domain >>>>>>>>>>>> >>>>>>>>>>>> (from within the separate login) >>>>>>>>>>>> mpirun --hostfile >>>>>>>>>>>> /var/lib/torque/aux//278.a00552.science.domain -np 3 >>>>>>>>>>>> --tag-output -display-map hostname >>>>>>>>>>>> >>>>>>>>>>>> Data for JOB [65445,1] offset 0 >>>>>>>>>>>> >>>>>>>>>>>> ======================== JOB MAP ======================== >>>>>>>>>>>> >>>>>>>>>>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num >>>>>>>>>>>> procs: 2 >>>>>>>>>>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 0 >>>>>>>>>>>> Bound: socket >>>>>>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core >>>>>>>>>>>> 2[hwt >>>>>>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], >>>>>>>>>>>> socket >>>>>>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core >>>>>>>>>>>> 7[hwt >>>>>>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>>>>>>>> >>>>>>>>>>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 1 >>>>>>>>>>>> Bound: socket >>>>>>>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket >>>>>>>>>>>> 1[core 12[hwt >>>>>>>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >>>>>>>>>>>> 0-1]], socket >>>>>>>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket >>>>>>>>>>>> 1[core 17[hwt >>>>>>>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>>>>>>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>>>>>>>>>> >>>>>>>>>>>> Data for node: a00553.science.domain Num slots: 20 Max >>>>>>>>>>>> slots: 0 Num >>>>>>>>>>>> procs: 1 >>>>>>>>>>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 2 >>>>>>>>>>>> Bound: socket >>>>>>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core >>>>>>>>>>>> 2[hwt >>>>>>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], >>>>>>>>>>>> socket >>>>>>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core >>>>>>>>>>>> 7[hwt >>>>>>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>>>>>>>> >>>>>>>>>>>> ============================================================= >>>>>>>>>>>> [1,0]<stdout>:a00551.science.domain >>>>>>>>>>>> [1,2]<stdout>:a00553.science.domain >>>>>>>>>>>> [1,1]<stdout>:a00551.science.domain >>>>>>>>>>>> >>>>>>>>>>>> I am kind of lost whats going on here. Anyone having an idea? >>>>>>>>>>>> I am >>>>>>>>>>>> seriously considering this to be the problem of kerberos >>>>>>>>>>>> authentification that we have to work with, but I fail to see >>>>>>>>>>>> how this >>>>>>>>>>>> should affect the sockets. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Oswin >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> users@lists.open-mpi.org >>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> users@lists.open-mpi.org >>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> users@lists.open-mpi.org >>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >_______________________________________________ >users mailing list >users@lists.open-mpi.org >https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users