Hi Henk,

SLIM H.A. wrote:
Dear Pak Lui

I can delete the (sge) job with qdel -f such that it disappears from the
job list but the application processes keep running, including the
shepherds. I have to kill them with -15

For some reason the kill -15 does not reach mpirun. (We use such a
parameter to mpirun on our myrinet mx nodes with mpich, that's why I
asked).

I believe qdel would send a SIGKILL to mpirun instead of a SIGTERM (-15), that is why you don't see the signal reaches mpirun. Since there is no way to catch a SIGKILL so that maybe why the orted and the processes would keep running.

Hmm, this actually reminds me of a related problem. That is with the qsub -notify option does not work as it intended under ORTE. The qsub -notify option supposed to send a SIGUSR2 to mpirun and the processes for an impending SIGKILL N seconds before it actually happens. However, we don't catch SIGUSR2 signal in ORTE specifically for SGE (or the gridengine modules), therefore user would see the mpirun and orted exit before the user apps can catch the SIGUSR signal. I should file a trac bug against this SGE feature we don't yet support and fix it sometime in the future.

So back to your problem. Although this is unintended, maybe you can try to run the job with qsub -notify for the mean time until we change for above, since it will send a SIGUSR2 to mpirun, which should terminate the mpirun, orted and the user processes in a way that is more gracefully than qdel (or SIGKILL), because SIGKILL would not allow orted to kill off the user processes, as SIGTERM or SIGUSR1/2 would.


Just to confirm, there is no configure directive specific to gridengine
when building openmpi?

Right, there isn't any configure directives currently.


Thanks

henk

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Pak Lui
Sent: 23 July 2007 15:16
To: Open MPI Users
Subject: Re: [OMPI users] sge qdel fails

Hi Henk,

The sge script should not require any extra parameter. The qdel command should send the kill signal to mpirun and also remove the SGE allocated tmp directory (in something like /tmp/174.1.all.q/) which contains the OMPI session dir for the running job, and in turns would cause orted and the user processes to exit.

Maybe you could try qdel -f <jid> to force delete from the sge_qmaster, in case when sge_execd does not respond to the delete request by the sge_qmaster?

SLIM H.A. wrote:
I am using OpenMPI 1.2.3 with SGE 6.0u7 over InfiniBand (OFED 1.2), following the recommendation in the OpenMPI FAQ

http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge

The job runs but when the user wants to delete the job with
the qdel
command, this fails. Does the mpirun command

mpirun -np $NSLOTS ./exe

in the sge script require extra parameters?

Thanks for any advice

Henk

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--

- Pak Lui
pak....@sun.com
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--

- Pak Lui
pak....@sun.com

Reply via email to