Hi Henk,
SLIM H.A. wrote:
Dear Pak Lui
I can delete the (sge) job with qdel -f such that it disappears from the
job list but the application processes keep running, including the
shepherds. I have to kill them with -15
For some reason the kill -15 does not reach mpirun. (We use such a
parameter to mpirun on our myrinet mx nodes with mpich, that's why I
asked).
I believe qdel would send a SIGKILL to mpirun instead of a SIGTERM
(-15), that is why you don't see the signal reaches mpirun. Since there
is no way to catch a SIGKILL so that maybe why the orted and the
processes would keep running.
Hmm, this actually reminds me of a related problem. That is with the
qsub -notify option does not work as it intended under ORTE. The qsub
-notify option supposed to send a SIGUSR2 to mpirun and the processes
for an impending SIGKILL N seconds before it actually happens. However,
we don't catch SIGUSR2 signal in ORTE specifically for SGE (or the
gridengine modules), therefore user would see the mpirun and orted exit
before the user apps can catch the SIGUSR signal. I should file a trac
bug against this SGE feature we don't yet support and fix it sometime in
the future.
So back to your problem. Although this is unintended, maybe you can try
to run the job with qsub -notify for the mean time until we change for
above, since it will send a SIGUSR2 to mpirun, which should terminate
the mpirun, orted and the user processes in a way that is more
gracefully than qdel (or SIGKILL), because SIGKILL would not allow orted
to kill off the user processes, as SIGTERM or SIGUSR1/2 would.
Just to confirm, there is no configure directive specific to gridengine
when building openmpi?
Right, there isn't any configure directives currently.
Thanks
henk
-----Original Message-----
From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On Behalf Of Pak Lui
Sent: 23 July 2007 15:16
To: Open MPI Users
Subject: Re: [OMPI users] sge qdel fails
Hi Henk,
The sge script should not require any extra parameter. The
qdel command should send the kill signal to mpirun and also
remove the SGE allocated tmp directory (in something like
/tmp/174.1.all.q/) which contains the OMPI session dir for
the running job, and in turns would cause orted and the user
processes to exit.
Maybe you could try qdel -f <jid> to force delete from the
sge_qmaster, in case when sge_execd does not respond to the
delete request by the sge_qmaster?
SLIM H.A. wrote:
I am using OpenMPI 1.2.3 with SGE 6.0u7 over InfiniBand (OFED 1.2),
following the recommendation in the OpenMPI FAQ
http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge
The job runs but when the user wants to delete the job with
the qdel
command, this fails. Does the mpirun command
mpirun -np $NSLOTS ./exe
in the sge script require extra parameters?
Thanks for any advice
Henk
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
- Pak Lui
pak....@sun.com
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
- Pak Lui
pak....@sun.com