It looks like from your earlier discussions on gridengine user alias
that you are able to run a simple single queue SGE tightly integrated
parallel job with Open MPI, it's just a matter of using multiple queues
with your parallel job, right?
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=26298
The tm messages are just a red herring. What's more interesting is the
verbose messages from qrsh (the lines that you enable by using -mca
pls_gridengine_verbose 1, with lines started without the stuff prepended
by OMPI, like [shakespeare:05720]).
Starting server daemon at host "shakespeare.nci.nih.gov"
Starting server daemon at host "octopus.nci.nih.gov"
Server daemon successfully started with task id "1.shakespeare"
[shakespeare:05733] mca: base: component_find: unable to open ras tm:
file not found (ignored)
[shakespeare:05733] mca: base: component_find: unable to open pls tm:
file not found (ignored)
error: executing task of job 3576 failed: failed sending task to
ex...@octopus.nci.nih.gov: can't find connecti
on
Since you see these verbose messages here, it means that you are using
"qrsh -inherit" in the backend for launching tasks. (You can also see
the qrsh -inherit line by setting "-mca pls_gridnegine_debug 1" in mpirun.)
You can also see the actual "qrsh -inherit" line by setting "-mca
pls_gridnegine_debug 1" in mpirun.
Those messages above show you that somehow when mpirun is trying to send
the SGE tasks to the remote nodes to shakespeare and octopus via 2
queues, shakespeare appears to start the server daemon successfully, but
you don't seem to get the same message from octopus. Typically I see
only 1 message from the server daemon when I use only 1 queue in my
parallel job.
In order for the head node's "qrsh -inherit" tasks to be accepted by SGE
daemons on execution nodes, the execution daemons need to be
allocated/notified ahead of time that there are impending tasks coming
to the nodes.
Anyway, I don't know why it needs to start the server daemon on octopus
when you have 2 queues in your parallel job. But let's say it's the
right behavior, SGE seems to have problem starting the task from the
headnode shakespeare to octopus (therefore we are the "failed sending
task to execd: can't find connection message). Did you already try
connecting from shakespeare to octopus? You might also want to check out
messages on octopus' log file $SGE_ROOT/default/spool/octopus/messages
to see how exactly it isn't accepting the task.
It may also be worthwhile to ask the gridengine folks if anyone has
tried with parallel job on multiple queues. I am not sure how typical
that people use this SGE feature.
I don't have access to a SGE cluster but I notice from an online manual
there's a new qsub option (-masterq) in SGE 6.2 that may work. You might
want to give it a try. This looks more and more like an SGE issue not
able to accept tasks from multiple queues for parallel job.
btw, you don't need the --with-sge switch in OMPI configure. It's new in
OMPI v1.3 so that we don't build SGE support by default.
My $.02...
- Pak Lui
p...@penguincomputing.com
Penguin Computing
users-requ...@open-mpi.org wrote:
Date: Sat, 11 Oct 2008 07:56:02 -0400
From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] SGE tight integration and ?tm? protocol for
start
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <3e62159b-14b9-4d44-96f6-0345079bc...@cisco.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
I don't know much/anything about SGE (I'll leave that to the Sun folks
on this list to reply), but I can tell you about the tm plugins: tm is
the protocol used by the PBS/Torque family of launchers. It looks
like your Open MPI was built with TM support, but when you launch,
it's likely unable to find the support libraries that it needs to load
those plugins.
This is probably fine in your case, since you want to use SGE, not TM.
On Oct 9, 2008, at 4:40 PM, Sean Davis wrote:
I am relatively new to OpenMPI and Sun Grid Engine parallel
integration. I have a small cluster that is running SGE6.2 on linux
machines all using Intel Xeon processors. I have installed OpenMPI
1.2.7 from source using the --with-sge switch. Now, I am trying to
troubleshoot some problems I am having. I have created a simple job
script:
The job script looks like:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
mpirun --mca pls_gridengine_verbose 1 -np $NSLOTS hostname
And the output on the error stream:
more junksub.sh.e3574
[shakespeare:05720] mca: base: component_find: unable to open ras tm:
file not found (ignored)
[shakespeare:05720] mca: base: component_find: unable to open pls tm:
file not found (ignored)
Starting server daemon at host "shakespeare.nci.nih.gov"
Starting server daemon at host "octopus.nci.nih.gov"
Server daemon successfully started with task id "1.shakespeare"
[shakespeare:05733] mca: base: component_find: unable to open ras tm:
file not found (ignored)
[shakespeare:05733] mca: base: component_find: unable to open pls tm:
file not found (ignored)
error: executing task of job 3576 failed: failed sending task to
ex...@octopus.nci.nih.gov: can't find connecti
on
[shakespeare:05720] ERROR: A daemon on node octopus.nci.nih.gov failed
to start as expected.
[shakespeare:05720] ERROR: There may be more information available
from
[shakespeare:05720] ERROR: the 'qstat -t' command on the Grid Engine
tasks.
[shakespeare:05720] ERROR: If the problem persists, please restart the
[shakespeare:05720] ERROR: Grid Engine PE job
[shakespeare:05720] ERROR: The daemon exited unexpectedly with
status 1.
However, there is no output in any output stream.
And if I log into shakespeare and qrsh -q all.q@octopus, I immediately
get a slot, so there isn't a "direct" problem with connecting.
As I got a hint from folks on the SGE mailing list, it appears that
qrsh is not being used for job submission. Any suggestions as to why
this might be the case (or if it is the case)?
Thanks,
Sean
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users