Note the torque library will only show up if you configure'd with 
--disable-dlopen. Otherwise, you can ldd /.../lib/openmpi/mca_plm_tm.so

Cheers,

Gilles

Bennet Fauber <ben...@umich.edu> wrote:
>Oswin,
>
>Does the torque library show up if you run
>
>$ ldd mpirun
>
>That would indicate that Torque support is compiled in.
>
>Also, what happens if you use the same hostfile, or some hostfile as
>an explicit argument when you run mpirun from within the torque job?
>
>-- bennet
>
>
>
>
>On Wed, Sep 7, 2016 at 9:25 AM, Oswin Krause
><oswin.kra...@ruhr-uni-bochum.de> wrote:
>> Hi Gilles,
>>
>> Thanks for the hint with the machinefile. I know it is not equivalent and i
>> do not intend to use that approach. I just wanted to know whether I could
>> start the program successfully at all.
>>
>> Outside torque(4.2), rsh seems to be used which works fine, querying a
>> password if no kerberos ticket is there
>>
>> Here is the output:
>> [zbh251@a00551 ~]$ mpirun -V
>> mpirun (Open MPI) 2.0.1
>> [zbh251@a00551 ~]$ ompi_info | grep ras
>>                  MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component
>> v2.0.1)
>>                  MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
>> v2.0.1)
>>                  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v2.0.1)
>>                  MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1)
>> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output
>> -display-map hostname
>> [a00551.science.domain:04104] mca: base: components_register: registering
>> framework plm components
>> [a00551.science.domain:04104] mca: base: components_register: found loaded
>> component isolated
>> [a00551.science.domain:04104] mca: base: components_register: component
>> isolated has no register or open function
>> [a00551.science.domain:04104] mca: base: components_register: found loaded
>> component rsh
>> [a00551.science.domain:04104] mca: base: components_register: component rsh
>> register function successful
>> [a00551.science.domain:04104] mca: base: components_register: found loaded
>> component slurm
>> [a00551.science.domain:04104] mca: base: components_register: component
>> slurm register function successful
>> [a00551.science.domain:04104] mca: base: components_register: found loaded
>> component tm
>> [a00551.science.domain:04104] mca: base: components_register: component tm
>> register function successful
>> [a00551.science.domain:04104] mca: base: components_open: opening plm
>> components
>> [a00551.science.domain:04104] mca: base: components_open: found loaded
>> component isolated
>> [a00551.science.domain:04104] mca: base: components_open: component isolated
>> open function successful
>> [a00551.science.domain:04104] mca: base: components_open: found loaded
>> component rsh
>> [a00551.science.domain:04104] mca: base: components_open: component rsh open
>> function successful
>> [a00551.science.domain:04104] mca: base: components_open: found loaded
>> component slurm
>> [a00551.science.domain:04104] mca: base: components_open: component slurm
>> open function successful
>> [a00551.science.domain:04104] mca: base: components_open: found loaded
>> component tm
>> [a00551.science.domain:04104] mca: base: components_open: component tm open
>> function successful
>> [a00551.science.domain:04104] mca:base:select: Auto-selecting plm components
>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component
>> [isolated]
>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component
>> [isolated] set priority to 0
>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component
>> [rsh]
>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component
>> [rsh] set priority to 10
>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component
>> [slurm]
>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component
>> [tm]
>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component
>> [tm] set priority to 75
>> [a00551.science.domain:04104] mca:base:select:(  plm) Selected component
>> [tm]
>> [a00551.science.domain:04104] mca: base: close: component isolated closed
>> [a00551.science.domain:04104] mca: base: close: unloading component isolated
>> [a00551.science.domain:04104] mca: base: close: component rsh closed
>> [a00551.science.domain:04104] mca: base: close: unloading component rsh
>> [a00551.science.domain:04104] mca: base: close: component slurm closed
>> [a00551.science.domain:04104] mca: base: close: unloading component slurm
>> [a00551.science.domain:04109] mca: base: components_register: registering
>> framework plm components
>> [a00551.science.domain:04109] mca: base: components_register: found loaded
>> component rsh
>> [a00551.science.domain:04109] mca: base: components_register: component rsh
>> register function successful
>> [a00551.science.domain:04109] mca: base: components_open: opening plm
>> components
>> [a00551.science.domain:04109] mca: base: components_open: found loaded
>> component rsh
>> [a00551.science.domain:04109] mca: base: components_open: component rsh open
>> function successful
>> [a00551.science.domain:04109] mca:base:select: Auto-selecting plm components
>> [a00551.science.domain:04109] mca:base:select:(  plm) Querying component
>> [rsh]
>> [a00551.science.domain:04109] mca:base:select:(  plm) Query of component
>> [rsh] set priority to 10
>> [a00551.science.domain:04109] mca:base:select:(  plm) Selected component
>> [rsh]
>> [a00551.science.domain:04109] [[53688,0],1] bind() failed on error Address
>> already in use (98)
>> [a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in file
>> oob_usock_component.c at line 228
>>  Data for JOB [53688,1] offset 0
>>
>>  ========================   JOB MAP   ========================
>>
>>  Data for node: a00551  Num slots: 2    Max slots: 0    Num procs: 2
>>         Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: socket
>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]],
>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt
>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core
>> 8[hwt 0-1]], socket 0[core 9[hwt
>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>         Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: socket
>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]],
>> socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt
>> 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 0-1]], socket 1[core
>> 18[hwt 0-1]], socket 1[core 19[hwt
>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>
>>  Data for node: a00553.science.domain   Num slots: 1    Max slots: 0    Num
>> procs: 1
>>         Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: socket
>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]],
>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt
>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core
>> 8[hwt 0-1]], socket 0[core 9[hwt
>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>
>>  =============================================================
>> [a00551.science.domain:04104] [[53688,0],0] complete_setup on job [53688,1]
>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc
>> state command from [[53688,0],1]
>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
>> update_proc_state for job [53688,1]
>> [1,0]<stdout>:a00551.science.domain
>> [1,2]<stdout>:a00551.science.domain
>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc
>> state command from [[53688,0],1]
>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
>> update_proc_state for job [53688,1]
>> [1,1]<stdout>:a00551.science.domain
>> [a00551.science.domain:04109] mca: base: close: component rsh closed
>> [a00551.science.domain:04109] mca: base: close: unloading component rsh
>> [a00551.science.domain:04104] mca: base: close: component tm closed
>> [a00551.science.domain:04104] mca: base: close: unloading component tm
>>
>> On 2016-09-07 14:41, Gilles Gouaillardet wrote:
>>>
>>> Hi,
>>>
>>> Which version of Open MPI are you running ?
>>>
>>> I noted that though you are asking three nodes and one task per node,
>>> you have been allocated 2 nodes only.
>>> I do not know if this is related to this issue.
>>>
>>> Note if you use the machinefile, a00551 has two slots (since it
>>> appears twice in the machinefile) but a00553 has 20 slots (since it
>>> appears once in the machinefile, the number of slots is automatically
>>> detected)
>>>
>>> Can you run
>>> mpirun --mca plm_base_verbose 10 ...
>>> So we can confirm tm is used.
>>>
>>> Before invoking mpirun, you might want to cleanup the ompi directory in
>>> /tmp
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am currently trying to set up OpenMPI in torque. OpenMPI is build with
>>>> tm support. Torque is correctly assigning nodes and I can run
>>>> mpi-programs on single nodes just fine. the problem starts when
>>>> processes are split between nodes.
>>>>
>>>> For example, I create an interactive session with torque and start a
>>>> program by
>>>>
>>>> qsub -I -n -l nodes=3:ppn=1
>>>> mpirun --tag-output -display-map hostname
>>>>
>>>> which leads to
>>>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on error
>>>> Address already in use (98)
>>>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error in
>>>> file oob_usock_component.c at line 228
>>>>  Data for JOB [65415,1] offset 0
>>>>
>>>>  ========================   JOB MAP   ========================
>>>>
>>>>  Data for node: a00551  Num slots: 2    Max slots: 0    Num procs: 2
>>>>         Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound:
>>>> socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>         Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound:
>>>> socket
>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>>
>>>>  Data for node: a00553.science.domain   Num slots: 1    Max slots: 0
>>>> Num
>>>> procs: 1
>>>>         Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound:
>>>> socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>
>>>>  =============================================================
>>>> [1,0]<stdout>:a00551.science.domain
>>>> [1,2]<stdout>:a00551.science.domain
>>>> [1,1]<stdout>:a00551.science.domain
>>>>
>>>>
>>>> if I login on a00551 and start using the hostfile generated by the
>>>> PBS_NODEFILE, everything works:
>>>>
>>>> (from within the interactive session)
>>>> echo $PBS_NODEFILE
>>>> /var/lib/torque/aux//278.a00552.science.domain
>>>> cat $PBS_NODEFILE
>>>> a00551.science.domain
>>>> a00553.science.domain
>>>> a00551.science.domain
>>>>
>>>> (from within the separate login)
>>>> mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain -np 3
>>>> --tag-output -display-map hostname
>>>>
>>>>  Data for JOB [65445,1] offset 0
>>>>
>>>>  ========================   JOB MAP   ========================
>>>>
>>>>  Data for node: a00551  Num slots: 2    Max slots: 0    Num procs: 2
>>>>         Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound:
>>>> socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>         Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound:
>>>> socket
>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>>
>>>>  Data for node: a00553.science.domain   Num slots: 20   Max slots: 0
>>>> Num
>>>> procs: 1
>>>>         Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound:
>>>> socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>
>>>>  =============================================================
>>>> [1,0]<stdout>:a00551.science.domain
>>>> [1,2]<stdout>:a00553.science.domain
>>>> [1,1]<stdout>:a00551.science.domain
>>>>
>>>> I am kind of lost whats going on here. Anyone having an idea? I am
>>>> seriously considering this to be the problem of kerberos
>>>> authentification that we have to work with, but I fail to see how this
>>>> should affect the sockets.
>>>>
>>>> Best,
>>>> Oswin
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>_______________________________________________
>users mailing list
>users@lists.open-mpi.org
>https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to