Are the OMPI libraries and binaries installed at the same place on all the 
remote nodes?

Are you setting the LD_LIBRARY_PATH correctly?

Are the Torque libs available in the same place on the remote nodes? Remember, 
Torque runs mpirun on a backend node - not on the frontend.

These are the most typical problems. 


On Dec 18, 2009, at 3:58 PM, Johann Knechtel wrote:

> Hi all,
> 
> Your help with the following torque integration issue will be much
> appreciated: whenever I try to start a openmpi job on more than one
> node, it simply does not start up on the nodes.
> The torque job fails with the following:
> 
>> Fri Dec 18 22:11:07 CET 2009
>> OpenMPI with PPU-GCC was loaded
>> --------------------------------------------------------------------------
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>> 
>> There may be more information reported by the environment (see above).
>> 
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>> below. Additional manual cleanup may be required - please refer to
>> the "orte-clean" tool for assistance.
>> --------------------------------------------------------------------------
>>        node2 - daemon did not report back when launched
>> Fri Dec 18 22:12:47 CET 2009
> 
> I am quite confident about the compilation and installation of torque
> and openmpi, since it runs without error on one node:
>> Fri Dec 18 22:14:11 CET 2009
>> OpenMPI with PPU-GCC was loaded
>> Process 1 on node1 out of 2
>> Process 0 on node1 out of 2
>> Fri Dec 18 22:14:12 CET 2009
> 
> The called programm is a simple helloworld which runs without errors
> started manually on the nodes; therefore it also runs without errors
> using a hostfile to daemonize on more than one node. I already tried to
> compile openmpi with default prefix:
>>  $ ./configure CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m32
>> CXXFLAGS=-m32 FC=ppu-gfortran43 FCFLAGS=-m32 FFLAGS=-m32
>> CCASFLAGS=-m32 LD=ppu32-ld LDFLAGS=-m32
>> --prefix=/shared/openmpi_gcc_ppc --with-platform=optimized
>> --disable-mpi-profile --with-tm=/usr/local/ --with-wrapper-cflags=-m32
>> --with-wrapper-ldflags=-m32 --with-wrapper-fflags=-m32
>> --with-wrapper-fcflags=-m32 --enable-mpirun-prefix-by-default
> 
> Also the called helloworld is compiled with and without -rpath, so I
> just wanted to be sure regarding any linked library issue.
> 
> Now, the interesting fact is the following: I compiled on one node a
> kernel with CONFIG_BSD_PROCESS_ACCT_V3 to monitor the startup of the
> pbs, mpi and helloworld daemons. And as already mentioned at the
> beginning, therefore I assumed that the mpi startup within torque is not
> working for me.
> Please request any further logs or so you want to review, I did not
> wanted to get the mail to large at first.
> Any ideas?
> 
> Greetings,
> Johann
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to