Note the torque library will only show up if you configure'd with --disable-dlopen. Otherwise, you can ldd /.../lib/openmpi/mca_plm_tm.so
Cheers, Gilles Bennet Fauber <ben...@umich.edu> wrote: >Oswin, > >Does the torque library show up if you run > >$ ldd mpirun > >That would indicate that Torque support is compiled in. > >Also, what happens if you use the same hostfile, or some hostfile as >an explicit argument when you run mpirun from within the torque job? > >-- bennet > > > > >On Wed, Sep 7, 2016 at 9:25 AM, Oswin Krause ><oswin.kra...@ruhr-uni-bochum.de> wrote: >> Hi Gilles, >> >> Thanks for the hint with the machinefile. I know it is not equivalent and i >> do not intend to use that approach. I just wanted to know whether I could >> start the program successfully at all. >> >> Outside torque(4.2), rsh seems to be used which works fine, querying a >> password if no kerberos ticket is there >> >> Here is the output: >> [zbh251@a00551 ~]$ mpirun -V >> mpirun (Open MPI) 2.0.1 >> [zbh251@a00551 ~]$ ompi_info | grep ras >> MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component >> v2.0.1) >> MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component >> v2.0.1) >> MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v2.0.1) >> MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1) >> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output >> -display-map hostname >> [a00551.science.domain:04104] mca: base: components_register: registering >> framework plm components >> [a00551.science.domain:04104] mca: base: components_register: found loaded >> component isolated >> [a00551.science.domain:04104] mca: base: components_register: component >> isolated has no register or open function >> [a00551.science.domain:04104] mca: base: components_register: found loaded >> component rsh >> [a00551.science.domain:04104] mca: base: components_register: component rsh >> register function successful >> [a00551.science.domain:04104] mca: base: components_register: found loaded >> component slurm >> [a00551.science.domain:04104] mca: base: components_register: component >> slurm register function successful >> [a00551.science.domain:04104] mca: base: components_register: found loaded >> component tm >> [a00551.science.domain:04104] mca: base: components_register: component tm >> register function successful >> [a00551.science.domain:04104] mca: base: components_open: opening plm >> components >> [a00551.science.domain:04104] mca: base: components_open: found loaded >> component isolated >> [a00551.science.domain:04104] mca: base: components_open: component isolated >> open function successful >> [a00551.science.domain:04104] mca: base: components_open: found loaded >> component rsh >> [a00551.science.domain:04104] mca: base: components_open: component rsh open >> function successful >> [a00551.science.domain:04104] mca: base: components_open: found loaded >> component slurm >> [a00551.science.domain:04104] mca: base: components_open: component slurm >> open function successful >> [a00551.science.domain:04104] mca: base: components_open: found loaded >> component tm >> [a00551.science.domain:04104] mca: base: components_open: component tm open >> function successful >> [a00551.science.domain:04104] mca:base:select: Auto-selecting plm components >> [a00551.science.domain:04104] mca:base:select:( plm) Querying component >> [isolated] >> [a00551.science.domain:04104] mca:base:select:( plm) Query of component >> [isolated] set priority to 0 >> [a00551.science.domain:04104] mca:base:select:( plm) Querying component >> [rsh] >> [a00551.science.domain:04104] mca:base:select:( plm) Query of component >> [rsh] set priority to 10 >> [a00551.science.domain:04104] mca:base:select:( plm) Querying component >> [slurm] >> [a00551.science.domain:04104] mca:base:select:( plm) Querying component >> [tm] >> [a00551.science.domain:04104] mca:base:select:( plm) Query of component >> [tm] set priority to 75 >> [a00551.science.domain:04104] mca:base:select:( plm) Selected component >> [tm] >> [a00551.science.domain:04104] mca: base: close: component isolated closed >> [a00551.science.domain:04104] mca: base: close: unloading component isolated >> [a00551.science.domain:04104] mca: base: close: component rsh closed >> [a00551.science.domain:04104] mca: base: close: unloading component rsh >> [a00551.science.domain:04104] mca: base: close: component slurm closed >> [a00551.science.domain:04104] mca: base: close: unloading component slurm >> [a00551.science.domain:04109] mca: base: components_register: registering >> framework plm components >> [a00551.science.domain:04109] mca: base: components_register: found loaded >> component rsh >> [a00551.science.domain:04109] mca: base: components_register: component rsh >> register function successful >> [a00551.science.domain:04109] mca: base: components_open: opening plm >> components >> [a00551.science.domain:04109] mca: base: components_open: found loaded >> component rsh >> [a00551.science.domain:04109] mca: base: components_open: component rsh open >> function successful >> [a00551.science.domain:04109] mca:base:select: Auto-selecting plm components >> [a00551.science.domain:04109] mca:base:select:( plm) Querying component >> [rsh] >> [a00551.science.domain:04109] mca:base:select:( plm) Query of component >> [rsh] set priority to 10 >> [a00551.science.domain:04109] mca:base:select:( plm) Selected component >> [rsh] >> [a00551.science.domain:04109] [[53688,0],1] bind() failed on error Address >> already in use (98) >> [a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in file >> oob_usock_component.c at line 228 >> Data for JOB [53688,1] offset 0 >> >> ======================== JOB MAP ======================== >> >> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2 >> Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: socket >> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], >> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt >> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core >> 8[hwt 0-1]], socket 0[core 9[hwt >> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >> Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: socket >> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], >> socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt >> 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 0-1]], socket 1[core >> 18[hwt 0-1]], socket 1[core 19[hwt >> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >> >> Data for node: a00553.science.domain Num slots: 1 Max slots: 0 Num >> procs: 1 >> Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: socket >> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], >> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt >> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core >> 8[hwt 0-1]], socket 0[core 9[hwt >> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >> >> ============================================================= >> [a00551.science.domain:04104] [[53688,0],0] complete_setup on job [53688,1] >> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc >> state command from [[53688,0],1] >> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got >> update_proc_state for job [53688,1] >> [1,0]<stdout>:a00551.science.domain >> [1,2]<stdout>:a00551.science.domain >> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc >> state command from [[53688,0],1] >> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got >> update_proc_state for job [53688,1] >> [1,1]<stdout>:a00551.science.domain >> [a00551.science.domain:04109] mca: base: close: component rsh closed >> [a00551.science.domain:04109] mca: base: close: unloading component rsh >> [a00551.science.domain:04104] mca: base: close: component tm closed >> [a00551.science.domain:04104] mca: base: close: unloading component tm >> >> On 2016-09-07 14:41, Gilles Gouaillardet wrote: >>> >>> Hi, >>> >>> Which version of Open MPI are you running ? >>> >>> I noted that though you are asking three nodes and one task per node, >>> you have been allocated 2 nodes only. >>> I do not know if this is related to this issue. >>> >>> Note if you use the machinefile, a00551 has two slots (since it >>> appears twice in the machinefile) but a00553 has 20 slots (since it >>> appears once in the machinefile, the number of slots is automatically >>> detected) >>> >>> Can you run >>> mpirun --mca plm_base_verbose 10 ... >>> So we can confirm tm is used. >>> >>> Before invoking mpirun, you might want to cleanup the ompi directory in >>> /tmp >>> >>> Cheers, >>> >>> Gilles >>> >>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote: >>>> >>>> Hi, >>>> >>>> I am currently trying to set up OpenMPI in torque. OpenMPI is build with >>>> tm support. Torque is correctly assigning nodes and I can run >>>> mpi-programs on single nodes just fine. the problem starts when >>>> processes are split between nodes. >>>> >>>> For example, I create an interactive session with torque and start a >>>> program by >>>> >>>> qsub -I -n -l nodes=3:ppn=1 >>>> mpirun --tag-output -display-map hostname >>>> >>>> which leads to >>>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on error >>>> Address already in use (98) >>>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error in >>>> file oob_usock_component.c at line 228 >>>> Data for JOB [65415,1] offset 0 >>>> >>>> ======================== JOB MAP ======================== >>>> >>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2 >>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound: >>>> socket >>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound: >>>> socket >>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt >>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket >>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt >>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>> >>>> Data for node: a00553.science.domain Num slots: 1 Max slots: 0 >>>> Num >>>> procs: 1 >>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound: >>>> socket >>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>> >>>> ============================================================= >>>> [1,0]<stdout>:a00551.science.domain >>>> [1,2]<stdout>:a00551.science.domain >>>> [1,1]<stdout>:a00551.science.domain >>>> >>>> >>>> if I login on a00551 and start using the hostfile generated by the >>>> PBS_NODEFILE, everything works: >>>> >>>> (from within the interactive session) >>>> echo $PBS_NODEFILE >>>> /var/lib/torque/aux//278.a00552.science.domain >>>> cat $PBS_NODEFILE >>>> a00551.science.domain >>>> a00553.science.domain >>>> a00551.science.domain >>>> >>>> (from within the separate login) >>>> mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain -np 3 >>>> --tag-output -display-map hostname >>>> >>>> Data for JOB [65445,1] offset 0 >>>> >>>> ======================== JOB MAP ======================== >>>> >>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2 >>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound: >>>> socket >>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound: >>>> socket >>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt >>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket >>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt >>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>> >>>> Data for node: a00553.science.domain Num slots: 20 Max slots: 0 >>>> Num >>>> procs: 1 >>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound: >>>> socket >>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>> >>>> ============================================================= >>>> [1,0]<stdout>:a00551.science.domain >>>> [1,2]<stdout>:a00553.science.domain >>>> [1,1]<stdout>:a00551.science.domain >>>> >>>> I am kind of lost whats going on here. Anyone having an idea? I am >>>> seriously considering this to be the problem of kerberos >>>> authentification that we have to work with, but I fail to see how this >>>> should affect the sockets. >>>> >>>> Best, >>>> Oswin >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >_______________________________________________ >users mailing list >users@lists.open-mpi.org >https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users