Hi Ralph, Somehow I did not receive your last answer as mail, so I reply to myself... Thanks for the explanation. I thought that the prefix issue would be handled by the OMPI configure parameter "enable-mpirun-prefix-by-default". But now I see your point. Anyway, I did not find any further information regarding that issue on the torque FAQ, and since the rsh launcher works I will stick to that and dont spend more time in experiments with torque... Thanks again for your help!
Greetings Johann Johann Knechtel schrieb: > Ralph, thank you very much for your input! The parameter "mca plm rsh" > did it. I am just curious about the reasons for that behavior? > You can find the complete output of the different commands embedded in > your mail below. The first line states the successful load of the OMPI > environment, we use the modules package on our cluster. > > Greetings > Johann > > > Ralph Castain schrieb: >> Sorry - hit "send" and then saw the version sitting right there in the >> subject! Doh... >> >> First, let's try verifying what components are actually getting used. Run >> this: >> >> mpirun -n 1 -mca ras_base_verbose 10 -mca plm_base_verbose 10 which orted >> > OpenMPI with PPU-GCC was loaded > [node1:00706] mca: base: components_open: Looking for plm components > [node1:00706] mca: base: components_open: opening plm components > [node1:00706] mca: base: components_open: found loaded component rsh > [node1:00706] mca: base: components_open: component rsh has no register > function > [node1:00706] mca: base: components_open: component rsh open function > successful > [node1:00706] mca: base: components_open: found loaded component slurm > [node1:00706] mca: base: components_open: component slurm has no > register function > [node1:00706] mca: base: components_open: component slurm open function > successful > [node1:00706] mca: base: components_open: found loaded component tm > [node1:00706] mca: base: components_open: component tm has no register > function > [node1:00706] mca: base: components_open: component tm open function > successful > [node1:00706] mca:base:select: Auto-selecting plm components > [node1:00706] mca:base:select:( plm) Querying component [rsh] > [node1:00706] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [node1:00706] mca:base:select:( plm) Querying component [slurm] > [node1:00706] mca:base:select:( plm) Skipping component [slurm]. Query > failed to return a module > [node1:00706] mca:base:select:( plm) Querying component [tm] > [node1:00706] mca:base:select:( plm) Query of component [tm] set > priority to 75 > [node1:00706] mca:base:select:( plm) Selected component [tm] > [node1:00706] mca: base: close: component rsh closed > [node1:00706] mca: base: close: unloading component rsh > [node1:00706] mca: base: close: component slurm closed > [node1:00706] mca: base: close: unloading component slurm > [node1:00706] mca: base: components_open: Looking for ras components > [node1:00706] mca: base: components_open: opening ras components > [node1:00706] mca: base: components_open: found loaded component slurm > [node1:00706] mca: base: components_open: component slurm has no > register function > [node1:00706] mca: base: components_open: component slurm open function > successful > [node1:00706] mca: base: components_open: found loaded component tm > [node1:00706] mca: base: components_open: component tm has no register > function > [node1:00706] mca: base: components_open: component tm open function > successful > [node1:00706] mca:base:select: Auto-selecting ras components > [node1:00706] mca:base:select:( ras) Querying component [slurm] > [node1:00706] mca:base:select:( ras) Skipping component [slurm]. Query > failed to return a module > [node1:00706] mca:base:select:( ras) Querying component [tm] > [node1:00706] mca:base:select:( ras) Query of component [tm] set > priority to 100 > [node1:00706] mca:base:select:( ras) Selected component [tm] > [node1:00706] mca: base: close: unloading component slurm > /opt/openmpi_1.3.4_gcc_ppc/bin/orted > [node1:00706] mca: base: close: unloading component tm > [node1:00706] mca: base: close: component tm closed > [node1:00706] mca: base: close: unloading component tm > >> Then get an allocation and run >> >> mpirun -pernode which orted >> > OpenMPI with PPU-GCC was loaded > -------------------------------------------------------------------------- > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun was unable to cleanly terminate the daemons on the nodes shown > below. Additional manual cleanup may be required - please refer to > the "orte-clean" tool for assistance. > -------------------------------------------------------------------------- > node2 - daemon did not report back when launched >> and >> >> mpirun -pernode -mca plm rsh which orted >> > OpenMPI with PPU-GCC was loaded > /opt/openmpi_1.3.4_gcc_ppc/bin/orted > /opt/openmpi_1.3.4_gcc_ppc/bin/orted >> and see what happens >> >> >> On Dec 19, 2009, at 5:17 PM, Ralph Castain wrote: >> >> >>> That error has nothing to do with Torque. The cmd line is simply wrong - >>> you are specifying a btl that doesn't exist. >>> >>> It should work just fine with >>> >>> mpirun -n X hellocluster >>> >>> Nothing else is required. When you run >>> >>> mpirun --hostfile nodefile hellocluster >>> >>> OMPI will still use Torque to do the launch - it just gets the list of >>> nodes from your nodefile instead of the PBS_NODEFILE. >>> >>> You may have stated it below, but I can't find it: what version of OMPI are >>> you using? Are there additional versions installed on your system? >>> >>> >>> On Dec 19, 2009, at 3:58 PM, Johann Knechtel wrote: >>> >>> >>>> Ah, and do I have to take care of the MCA ras plugin by my own? >>>> I tried somethings like >>>> >>>>> mpirun --mca ras tm --mca btl ras,plm --mca ras_tm_nodefile_dir >>>>> /var/spool/torque/aux/ hellocluster >>>>> >>>> but despite that it has not helped/worked out ([node3:22726] mca: base: >>>> components_open: component pml / csum open function failed) it also does >>>> not look so convenient to me... >>>> >>>> Greetings >>>> Johann >>>> >>>> >>>> Johann Knechtel schrieb: >>>> >>>>> Hi Ralph and all, >>>>> >>>>> Yes, the OMPI libs and binaries are at the same place on the nodes, I >>>>> packed OMPI via checkinstall and installed the deb via pdsh on the nodes. >>>>> The LD_LIBRARY_PATH is set; I can run for example "mpirun --hostfile >>>>> nodefile hellocluster" without problems. But when started via torque job >>>>> it does not work out. I do assume correctly, that the LD_LIBRARY_PATH >>>>> will be exported by torque to the daemonized mpirunners, dont I? >>>>> The torque libs are all on the same place, I installed the package shell >>>>> scripts via pdsh. >>>>> >>>>> Greetings, >>>>> Johann >>>>> >>>>> >>>>> Ralph Castain schrieb: >>>>> >>>>> >>>>>> Are the OMPI libraries and binaries installed at the same place on all >>>>>> the remote nodes? >>>>>> >>>>>> Are you setting the LD_LIBRARY_PATH correctly? >>>>>> >>>>>> Are the Torque libs available in the same place on the remote nodes? >>>>>> Remember, Torque runs mpirun on a backend node - not on the frontend. >>>>>> >>>>>> These are the most typical problems. >>>>>> >>>>>> >>>>>> On Dec 18, 2009, at 3:58 PM, Johann Knechtel wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> Your help with the following torque integration issue will be much >>>>>>> appreciated: whenever I try to start a openmpi job on more than one >>>>>>> node, it simply does not start up on the nodes. >>>>>>> The torque job fails with the following: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Fri Dec 18 22:11:07 CET 2009 >>>>>>>> OpenMPI with PPU-GCC was loaded >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting >>>>>>>> to >>>>>>>> launch so we are aborting. >>>>>>>> >>>>>>>> There may be more information reported by the environment (see above). >>>>>>>> >>>>>>>> This may be because the daemon was unable to find all the needed shared >>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>>>>>>> the >>>>>>>> location of the shared libraries on the remote nodes and this will >>>>>>>> automatically be forwarded to the remote nodes. >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> mpirun noticed that the job aborted, but has no info as to the process >>>>>>>> that caused that situation. >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown >>>>>>>> below. Additional manual cleanup may be required - please refer to >>>>>>>> the "orte-clean" tool for assistance. >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> node2 - daemon did not report back when launched >>>>>>>> Fri Dec 18 22:12:47 CET 2009 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> I am quite confident about the compilation and installation of torque >>>>>>> and openmpi, since it runs without error on one node: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Fri Dec 18 22:14:11 CET 2009 >>>>>>>> OpenMPI with PPU-GCC was loaded >>>>>>>> Process 1 on node1 out of 2 >>>>>>>> Process 0 on node1 out of 2 >>>>>>>> Fri Dec 18 22:14:12 CET 2009 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> The called programm is a simple helloworld which runs without errors >>>>>>> started manually on the nodes; therefore it also runs without errors >>>>>>> using a hostfile to daemonize on more than one node. I already tried to >>>>>>> compile openmpi with default prefix: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> $ ./configure CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m32 >>>>>>>> CXXFLAGS=-m32 FC=ppu-gfortran43 FCFLAGS=-m32 FFLAGS=-m32 >>>>>>>> CCASFLAGS=-m32 LD=ppu32-ld LDFLAGS=-m32 >>>>>>>> --prefix=/shared/openmpi_gcc_ppc --with-platform=optimized >>>>>>>> --disable-mpi-profile --with-tm=/usr/local/ --with-wrapper-cflags=-m32 >>>>>>>> --with-wrapper-ldflags=-m32 --with-wrapper-fflags=-m32 >>>>>>>> --with-wrapper-fcflags=-m32 --enable-mpirun-prefix-by-default >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Also the called helloworld is compiled with and without -rpath, so I >>>>>>> just wanted to be sure regarding any linked library issue. >>>>>>> >>>>>>> Now, the interesting fact is the following: I compiled on one node a >>>>>>> kernel with CONFIG_BSD_PROCESS_ACCT_V3 to monitor the startup of the >>>>>>> pbs, mpi and helloworld daemons. And as already mentioned at the >>>>>>> beginning, therefore I assumed that the mpi startup within torque is not >>>>>>> working for me. >>>>>>> Please request any further logs or so you want to review, I did not >>>>>>> wanted to get the mail to large at first. >>>>>>> Any ideas? >>>>>>> >>>>>>> Greetings, >>>>>>> Johann >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users