If you are correctly analyzing things, then there would be an issue in the 
code. When we get an allocation from a resource manager, we set a flag 
indicating that it is “gospel” - i.e., that we do not directly sense the number 
of cores on a node and set the #slots equal to that value. Instead, we take the 
RM-provided allocation as ultimate truth.

This should be true even if you add a machinefile, as the machinefile is only 
used to “filter” the nodelist provided by the RM. It shouldn’t cause the #slots 
to be modified.

Taking a quick glance at the v2.x code, it looks to me like all is being done 
correctly. Again, output from a debug build would resolve that question


> On Sep 7, 2016, at 10:56 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Oswin,
> 
> 
> unfortunatly some important info is missing.
> 
> i guess the root cause is Open MPI was not configure'd with --enable-debug
> 
> 
> could you please update your torque script and simply add the following 
> snippet before invoking mpirun
> 
> 
> echo PBS_NODEFILE
> 
> cat $PBS_NODEFILE
> 
> echo ---
> 
> 
> as i wrote in an other email, i suspect hosts are not ordered (and i'd like 
> to confirm that) and Open MPI does not handle that correctly
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> On 9/7/2016 10:25 PM, Oswin Krause wrote:
>> Hi Gilles,
>> 
>> Thanks for the hint with the machinefile. I know it is not equivalent and i 
>> do not intend to use that approach. I just wanted to know whether I could 
>> start the program successfully at all.
>> 
>> Outside torque(4.2), rsh seems to be used which works fine, querying a 
>> password if no kerberos ticket is there
>> 
>> Here is the output:
>> [zbh251@a00551 ~]$ mpirun -V
>> mpirun (Open MPI) 2.0.1
>> [zbh251@a00551 ~]$ ompi_info | grep ras
>>                 MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component 
>> v2.0.1)
>>                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v2.0.1)
>>                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v2.0.1)
>>                 MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1)
>> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output 
>> -display-map hostname
>> [a00551.science.domain:04104] mca: base: components_register: registering 
>> framework plm components
>> [a00551.science.domain:04104] mca: base: components_register: found loaded 
>> component isolated
>> [a00551.science.domain:04104] mca: base: components_register: component 
>> isolated has no register or open function
>> [a00551.science.domain:04104] mca: base: components_register: found loaded 
>> component rsh
>> [a00551.science.domain:04104] mca: base: components_register: component rsh 
>> register function successful
>> [a00551.science.domain:04104] mca: base: components_register: found loaded 
>> component slurm
>> [a00551.science.domain:04104] mca: base: components_register: component 
>> slurm register function successful
>> [a00551.science.domain:04104] mca: base: components_register: found loaded 
>> component tm
>> [a00551.science.domain:04104] mca: base: components_register: component tm 
>> register function successful
>> [a00551.science.domain:04104] mca: base: components_open: opening plm 
>> components
>> [a00551.science.domain:04104] mca: base: components_open: found loaded 
>> component isolated
>> [a00551.science.domain:04104] mca: base: components_open: component isolated 
>> open function successful
>> [a00551.science.domain:04104] mca: base: components_open: found loaded 
>> component rsh
>> [a00551.science.domain:04104] mca: base: components_open: component rsh open 
>> function successful
>> [a00551.science.domain:04104] mca: base: components_open: found loaded 
>> component slurm
>> [a00551.science.domain:04104] mca: base: components_open: component slurm 
>> open function successful
>> [a00551.science.domain:04104] mca: base: components_open: found loaded 
>> component tm
>> [a00551.science.domain:04104] mca: base: components_open: component tm open 
>> function successful
>> [a00551.science.domain:04104] mca:base:select: Auto-selecting plm components
>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component 
>> [isolated]
>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component 
>> [isolated] set priority to 0
>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component 
>> [rsh]
>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component 
>> [rsh] set priority to 10
>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component 
>> [slurm]
>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component [tm]
>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component 
>> [tm] set priority to 75
>> [a00551.science.domain:04104] mca:base:select:(  plm) Selected component [tm]
>> [a00551.science.domain:04104] mca: base: close: component isolated closed
>> [a00551.science.domain:04104] mca: base: close: unloading component isolated
>> [a00551.science.domain:04104] mca: base: close: component rsh closed
>> [a00551.science.domain:04104] mca: base: close: unloading component rsh
>> [a00551.science.domain:04104] mca: base: close: component slurm closed
>> [a00551.science.domain:04104] mca: base: close: unloading component slurm
>> [a00551.science.domain:04109] mca: base: components_register: registering 
>> framework plm components
>> [a00551.science.domain:04109] mca: base: components_register: found loaded 
>> component rsh
>> [a00551.science.domain:04109] mca: base: components_register: component rsh 
>> register function successful
>> [a00551.science.domain:04109] mca: base: components_open: opening plm 
>> components
>> [a00551.science.domain:04109] mca: base: components_open: found loaded 
>> component rsh
>> [a00551.science.domain:04109] mca: base: components_open: component rsh open 
>> function successful
>> [a00551.science.domain:04109] mca:base:select: Auto-selecting plm components
>> [a00551.science.domain:04109] mca:base:select:(  plm) Querying component 
>> [rsh]
>> [a00551.science.domain:04109] mca:base:select:(  plm) Query of component 
>> [rsh] set priority to 10
>> [a00551.science.domain:04109] mca:base:select:(  plm) Selected component 
>> [rsh]
>> [a00551.science.domain:04109] [[53688,0],1] bind() failed on error Address 
>> already in use (98)
>> [a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in file 
>> oob_usock_component.c at line 228
>> Data for JOB [53688,1] offset 0
>> 
>> ========================   JOB MAP   ========================
>> 
>> Data for node: a00551    Num slots: 2    Max slots: 0    Num procs: 2
>>     Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: socket 
>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], 
>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 
>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 
>> 8[hwt 0-1]], socket 0[core 9[hwt 
>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>     Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: socket 
>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], 
>> socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 
>> 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 0-1]], socket 1[core 
>> 18[hwt 0-1]], socket 1[core 19[hwt 
>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>> 
>> Data for node: a00553.science.domain    Num slots: 1    Max slots: 0    Num 
>> procs: 1
>>     Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: socket 
>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], 
>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 
>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 
>> 8[hwt 0-1]], socket 0[core 9[hwt 
>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>> 
>> =============================================================
>> [a00551.science.domain:04104] [[53688,0],0] complete_setup on job [53688,1]
>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc 
>> state command from [[53688,0],1]
>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got 
>> update_proc_state for job [53688,1]
>> [1,0]<stdout>:a00551.science.domain
>> [1,2]<stdout>:a00551.science.domain
>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc 
>> state command from [[53688,0],1]
>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got 
>> update_proc_state for job [53688,1]
>> [1,1]<stdout>:a00551.science.domain
>> [a00551.science.domain:04109] mca: base: close: component rsh closed
>> [a00551.science.domain:04109] mca: base: close: unloading component rsh
>> [a00551.science.domain:04104] mca: base: close: component tm closed
>> [a00551.science.domain:04104] mca: base: close: unloading component tm
>> 
>> On 2016-09-07 14:41, Gilles Gouaillardet wrote:
>>> Hi,
>>> 
>>> Which version of Open MPI are you running ?
>>> 
>>> I noted that though you are asking three nodes and one task per node,
>>> you have been allocated 2 nodes only.
>>> I do not know if this is related to this issue.
>>> 
>>> Note if you use the machinefile, a00551 has two slots (since it
>>> appears twice in the machinefile) but a00553 has 20 slots (since it
>>> appears once in the machinefile, the number of slots is automatically
>>> detected)
>>> 
>>> Can you run
>>> mpirun --mca plm_base_verbose 10 ...
>>> So we can confirm tm is used.
>>> 
>>> Before invoking mpirun, you might want to cleanup the ompi directory in /tmp
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>>>> Hi,
>>>> 
>>>> I am currently trying to set up OpenMPI in torque. OpenMPI is build with
>>>> tm support. Torque is correctly assigning nodes and I can run
>>>> mpi-programs on single nodes just fine. the problem starts when
>>>> processes are split between nodes.
>>>> 
>>>> For example, I create an interactive session with torque and start a
>>>> program by
>>>> 
>>>> qsub -I -n -l nodes=3:ppn=1
>>>> mpirun --tag-output -display-map hostname
>>>> 
>>>> which leads to
>>>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on error
>>>> Address already in use (98)
>>>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error in
>>>> file oob_usock_component.c at line 228
>>>> Data for JOB [65415,1] offset 0
>>>> 
>>>> ========================   JOB MAP   ========================
>>>> 
>>>> Data for node: a00551    Num slots: 2    Max slots: 0    Num procs: 2
>>>>     Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound: socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>     Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound: socket
>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>> 
>>>> Data for node: a00553.science.domain    Num slots: 1    Max slots: 0    Num
>>>> procs: 1
>>>>     Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound: socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>> 
>>>> =============================================================
>>>> [1,0]<stdout>:a00551.science.domain
>>>> [1,2]<stdout>:a00551.science.domain
>>>> [1,1]<stdout>:a00551.science.domain
>>>> 
>>>> 
>>>> if I login on a00551 and start using the hostfile generated by the
>>>> PBS_NODEFILE, everything works:
>>>> 
>>>> (from within the interactive session)
>>>> echo $PBS_NODEFILE
>>>> /var/lib/torque/aux//278.a00552.science.domain
>>>> cat $PBS_NODEFILE
>>>> a00551.science.domain
>>>> a00553.science.domain
>>>> a00551.science.domain
>>>> 
>>>> (from within the separate login)
>>>> mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain -np 3
>>>> --tag-output -display-map hostname
>>>> 
>>>> Data for JOB [65445,1] offset 0
>>>> 
>>>> ========================   JOB MAP   ========================
>>>> 
>>>> Data for node: a00551    Num slots: 2    Max slots: 0    Num procs: 2
>>>>     Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound: socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>     Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound: socket
>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>> 
>>>> Data for node: a00553.science.domain    Num slots: 20    Max slots: 0    
>>>> Num
>>>> procs: 1
>>>>     Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound: socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>> 
>>>> =============================================================
>>>> [1,0]<stdout>:a00551.science.domain
>>>> [1,2]<stdout>:a00553.science.domain
>>>> [1,1]<stdout>:a00551.science.domain
>>>> 
>>>> I am kind of lost whats going on here. Anyone having an idea? I am
>>>> seriously considering this to be the problem of kerberos
>>>> authentification that we have to work with, but I fail to see how this
>>>> should affect the sockets.
>>>> 
>>>> Best,
>>>> Oswin
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to