Re: [gridengine users] SGE-6.2u5: Sudden death of all cluster jobs.

Mark Dixon Tue, 15 Mar 2011 08:17:49 -0700

On Tue, 15 Mar 2011, Erik Soyez wrote:

Mark, thanks a lot for your reply!


Do you have any idea, under which circumstances that happens or what
configuration details could be responsable?  Did you use tight mpi
integration (the problem has never occured before with loose mpi
integration)?  Although (re-soft-)starting the execds helped, it could
also be a qmaster problem, because it hit the entire cluster within a
few hours.  Or maybe each execd had just run the 200th job after some
time (which means that it will happen again after the next 200 jobs
on each node).  I might experiment with smaller gid ranges and see
if it happens any sooner.

Erik Soyez.

As I said, I've been meaning to look at it closer before making a properbug report.

On our system, we're using tight integration. We're also making usersspecify an h_rt value. Once the h_rt value has expired, regardless ofwhether the job has completed or not, the log on the relevant execd startslogging messages like:


failed to deliver signal 20 to job 1460921.1 task 40.c1s3b11n1 for KILL 
(shepherd with pid 3863): No such file or directory

(Note the "40.c1s3b11n1". This is a slave task of a tightly-integratedparallel job.)

As jobs continue to execute on the system, these messages mount up. GIDstarvation is only part of it: you also start playing Russian roulettewith those ex-shepherd PIDs that the execd keeps on trying to kill. Thismeans that simply increasing the GID range isn't a good answer.

There seems to be a more severe version of the problem that happens as ourexecds is get close to their GID limits, where a job ends, the shepherdcreates the usage file in the client spool, but doesn't send it to theqmaster. You end up with an unkillable job reported in qstat. Again, thekludge is to (soft) restart the execd, which notices the usage file andcleans-up the job and accounting data.

To keep a lid on these problems, we (soft) restart the execds on all thecompute nodes a couple of times a week.


Mark
--
-----------------------------------------------------------------
Mark Dixon                       Email    : [email protected]
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE-6.2u5: Sudden death of *all* cluster jobs.

Reply via email to

Re: [gridengine users] SGE-6.2u5: Sudden death of all cluster jobs.