Re: [OMPI users] Torque 2.4.3 fails with OpenMPI 1.3.4; no startup at all

Johann Knechtel Tue, 22 Dec 2009 08:32:27 -0500

Hi Ralph,

Somehow I did not receive your last answer as mail, so I reply to myself...
Thanks for the explanation. I thought that the prefix issue would be
handled by the OMPI configure parameter
"enable-mpirun-prefix-by-default". But now I see your point. Anyway, I
did not find any further information regarding that issue on the torque
FAQ, and since the rsh launcher works I will stick to that and dont
spend more time in experiments with torque... Thanks again for your help!


Greetings
Johann


Johann Knechtel schrieb:
> Ralph, thank you very much for your input! The parameter "mca plm rsh"
> did it. I am just curious about the reasons for that behavior?
> You can find the complete output of the different commands embedded in
> your mail below. The first line states the successful load of the OMPI
> environment, we use the modules package on our cluster.
> 
> Greetings
> Johann
> 
> 
> Ralph Castain schrieb:
>> Sorry - hit "send" and then saw the version sitting right there in the 
>> subject! Doh...
>>
>> First, let's try verifying what components are actually getting used. Run 
>> this:
>>
>> mpirun -n 1 -mca ras_base_verbose 10 -mca plm_base_verbose 10 which orted
>>   
>  OpenMPI with PPU-GCC was loaded
> [node1:00706] mca: base: components_open: Looking for plm components
> [node1:00706] mca: base: components_open: opening plm components
> [node1:00706] mca: base: components_open: found loaded component rsh
> [node1:00706] mca: base: components_open: component rsh has no register
> function
> [node1:00706] mca: base: components_open: component rsh open function
> successful
> [node1:00706] mca: base: components_open: found loaded component slurm
> [node1:00706] mca: base: components_open: component slurm has no
> register function
> [node1:00706] mca: base: components_open: component slurm open function
> successful
> [node1:00706] mca: base: components_open: found loaded component tm
> [node1:00706] mca: base: components_open: component tm has no register
> function
> [node1:00706] mca: base: components_open: component tm open function
> successful
> [node1:00706] mca:base:select: Auto-selecting plm components
> [node1:00706] mca:base:select:(  plm) Querying component [rsh]
> [node1:00706] mca:base:select:(  plm) Query of component [rsh] set
> priority to 10
> [node1:00706] mca:base:select:(  plm) Querying component [slurm]
> [node1:00706] mca:base:select:(  plm) Skipping component [slurm]. Query
> failed to return a module
> [node1:00706] mca:base:select:(  plm) Querying component [tm]
> [node1:00706] mca:base:select:(  plm) Query of component [tm] set
> priority to 75
> [node1:00706] mca:base:select:(  plm) Selected component [tm]
> [node1:00706] mca: base: close: component rsh closed
> [node1:00706] mca: base: close: unloading component rsh
> [node1:00706] mca: base: close: component slurm closed
> [node1:00706] mca: base: close: unloading component slurm
> [node1:00706] mca: base: components_open: Looking for ras components
> [node1:00706] mca: base: components_open: opening ras components
> [node1:00706] mca: base: components_open: found loaded component slurm
> [node1:00706] mca: base: components_open: component slurm has no
> register function
> [node1:00706] mca: base: components_open: component slurm open function
> successful
> [node1:00706] mca: base: components_open: found loaded component tm
> [node1:00706] mca: base: components_open: component tm has no register
> function
> [node1:00706] mca: base: components_open: component tm open function
> successful
> [node1:00706] mca:base:select: Auto-selecting ras components
> [node1:00706] mca:base:select:(  ras) Querying component [slurm]
> [node1:00706] mca:base:select:(  ras) Skipping component [slurm]. Query
> failed to return a module
> [node1:00706] mca:base:select:(  ras) Querying component [tm]
> [node1:00706] mca:base:select:(  ras) Query of component [tm] set
> priority to 100
> [node1:00706] mca:base:select:(  ras) Selected component [tm]
> [node1:00706] mca: base: close: unloading component slurm
> /opt/openmpi_1.3.4_gcc_ppc/bin/orted
> [node1:00706] mca: base: close: unloading component tm
> [node1:00706] mca: base: close: component tm closed
> [node1:00706] mca: base: close: unloading component tm
> 
>> Then get an allocation and run
>>
>> mpirun -pernode which orted
>>   
>  OpenMPI with PPU-GCC was loaded
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
>         node2 - daemon did not report back when launched
>> and
>>
>> mpirun -pernode -mca plm rsh which orted
>>   
>  OpenMPI with PPU-GCC was loaded
> /opt/openmpi_1.3.4_gcc_ppc/bin/orted
> /opt/openmpi_1.3.4_gcc_ppc/bin/orted
>> and see what happens
>>
>>
>> On Dec 19, 2009, at 5:17 PM, Ralph Castain wrote:
>>
>>   
>>> That error has nothing to do with Torque. The cmd line is simply wrong - 
>>> you are specifying a btl that doesn't exist.
>>>
>>> It should work just fine with
>>>
>>> mpirun -n X hellocluster
>>>
>>> Nothing else is required. When you run
>>>
>>> mpirun --hostfile nodefile hellocluster
>>>
>>> OMPI will still use Torque to do the launch - it just gets the list of 
>>> nodes from your nodefile instead of the PBS_NODEFILE.
>>>
>>> You may have stated it below, but I can't find it: what version of OMPI are 
>>> you using? Are there additional versions installed on your system?
>>>
>>>
>>> On Dec 19, 2009, at 3:58 PM, Johann Knechtel wrote:
>>>
>>>     
>>>> Ah, and do I have to take care of the MCA ras plugin by my own?
>>>> I tried somethings like
>>>>       
>>>>> mpirun --mca ras tm --mca btl ras,plm  --mca ras_tm_nodefile_dir
>>>>> /var/spool/torque/aux/ hellocluster
>>>>>         
>>>> but despite that it has not helped/worked out ([node3:22726] mca: base:
>>>> components_open: component pml / csum open function failed) it also does
>>>> not look so convenient to me...
>>>>
>>>> Greetings
>>>> Johann
>>>>
>>>>
>>>> Johann Knechtel schrieb:
>>>>       
>>>>> Hi Ralph and all,
>>>>>
>>>>> Yes, the OMPI libs and binaries are at the same place on the nodes, I
>>>>> packed OMPI via checkinstall and installed the deb via pdsh on the nodes.
>>>>> The LD_LIBRARY_PATH is set; I can run for example "mpirun --hostfile
>>>>> nodefile hellocluster" without problems. But when started via torque job
>>>>> it does not work out. I do assume correctly, that the LD_LIBRARY_PATH
>>>>> will be exported by torque to the daemonized mpirunners, dont I?
>>>>> The torque libs are all on the same place, I installed the package shell
>>>>> scripts via pdsh.
>>>>>
>>>>> Greetings,
>>>>> Johann
>>>>>
>>>>>
>>>>> Ralph Castain schrieb:
>>>>>
>>>>>         
>>>>>> Are the OMPI libraries and binaries installed at the same place on all 
>>>>>> the remote nodes?
>>>>>>
>>>>>> Are you setting the LD_LIBRARY_PATH correctly?
>>>>>>
>>>>>> Are the Torque libs available in the same place on the remote nodes? 
>>>>>> Remember, Torque runs mpirun on a backend node - not on the frontend.
>>>>>>
>>>>>> These are the most typical problems. 
>>>>>>
>>>>>>
>>>>>> On Dec 18, 2009, at 3:58 PM, Johann Knechtel wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>           
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Your help with the following torque integration issue will be much
>>>>>>> appreciated: whenever I try to start a openmpi job on more than one
>>>>>>> node, it simply does not start up on the nodes.
>>>>>>> The torque job fails with the following:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>             
>>>>>>>> Fri Dec 18 22:11:07 CET 2009
>>>>>>>> OpenMPI with PPU-GCC was loaded
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting 
>>>>>>>> to
>>>>>>>> launch so we are aborting.
>>>>>>>>
>>>>>>>> There may be more information reported by the environment (see above).
>>>>>>>>
>>>>>>>> This may be because the daemon was unable to find all the needed shared
>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have 
>>>>>>>> the
>>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>>>>> that caused that situation.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>>>> the "orte-clean" tool for assistance.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>      node2 - daemon did not report back when launched
>>>>>>>> Fri Dec 18 22:12:47 CET 2009
>>>>>>>>
>>>>>>>>
>>>>>>>>               
>>>>>>> I am quite confident about the compilation and installation of torque
>>>>>>> and openmpi, since it runs without error on one node:
>>>>>>>
>>>>>>>
>>>>>>>             
>>>>>>>> Fri Dec 18 22:14:11 CET 2009
>>>>>>>> OpenMPI with PPU-GCC was loaded
>>>>>>>> Process 1 on node1 out of 2
>>>>>>>> Process 0 on node1 out of 2
>>>>>>>> Fri Dec 18 22:14:12 CET 2009
>>>>>>>>
>>>>>>>>
>>>>>>>>               
>>>>>>> The called programm is a simple helloworld which runs without errors
>>>>>>> started manually on the nodes; therefore it also runs without errors
>>>>>>> using a hostfile to daemonize on more than one node. I already tried to
>>>>>>> compile openmpi with default prefix:
>>>>>>>
>>>>>>>
>>>>>>>             
>>>>>>>> $ ./configure CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m32
>>>>>>>> CXXFLAGS=-m32 FC=ppu-gfortran43 FCFLAGS=-m32 FFLAGS=-m32
>>>>>>>> CCASFLAGS=-m32 LD=ppu32-ld LDFLAGS=-m32
>>>>>>>> --prefix=/shared/openmpi_gcc_ppc --with-platform=optimized
>>>>>>>> --disable-mpi-profile --with-tm=/usr/local/ --with-wrapper-cflags=-m32
>>>>>>>> --with-wrapper-ldflags=-m32 --with-wrapper-fflags=-m32
>>>>>>>> --with-wrapper-fcflags=-m32 --enable-mpirun-prefix-by-default
>>>>>>>>
>>>>>>>>
>>>>>>>>               
>>>>>>> Also the called helloworld is compiled with and without -rpath, so I
>>>>>>> just wanted to be sure regarding any linked library issue.
>>>>>>>
>>>>>>> Now, the interesting fact is the following: I compiled on one node a
>>>>>>> kernel with CONFIG_BSD_PROCESS_ACCT_V3 to monitor the startup of the
>>>>>>> pbs, mpi and helloworld daemons. And as already mentioned at the
>>>>>>> beginning, therefore I assumed that the mpi startup within torque is not
>>>>>>> working for me.
>>>>>>> Please request any further logs or so you want to review, I did not
>>>>>>> wanted to get the mail to large at first.
>>>>>>> Any ideas?
>>>>>>>
>>>>>>> Greetings,
>>>>>>> Johann
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>>             
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>>           
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>         
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>       
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>   
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Torque 2.4.3 fails with OpenMPI 1.3.4; no startup at all

Reply via email to