On Tue, 15 Mar 2011, Erik Soyez wrote:
Mark, thanks a lot for your reply!
Do you have any idea, under which circumstances that happens or what
configuration details could be responsable? Did you use tight mpi
integration (the problem has never occured before with loose mpi
integration)? Although (re-soft-)starting the execds helped, it could
also be a qmaster problem, because it hit the entire cluster within a
few hours. Or maybe each execd had just run the 200th job after some
time (which means that it will happen again after the next 200 jobs
on each node). I might experiment with smaller gid ranges and see
if it happens any sooner.
Erik Soyez.
As I said, I've been meaning to look at it closer before making a proper
bug report.
On our system, we're using tight integration. We're also making users
specify an h_rt value. Once the h_rt value has expired, regardless of
whether the job has completed or not, the log on the relevant execd starts
logging messages like:
failed to deliver signal 20 to job 1460921.1 task 40.c1s3b11n1 for KILL
(shepherd with pid 3863): No such file or directory
(Note the "40.c1s3b11n1". This is a slave task of a tightly-integrated
parallel job.)
As jobs continue to execute on the system, these messages mount up. GID
starvation is only part of it: you also start playing Russian roulette
with those ex-shepherd PIDs that the execd keeps on trying to kill. This
means that simply increasing the GID range isn't a good answer.
There seems to be a more severe version of the problem that happens as our
execds is get close to their GID limits, where a job ends, the shepherd
creates the usage file in the client spool, but doesn't send it to the
qmaster. You end up with an unkillable job reported in qstat. Again, the
kludge is to (soft) restart the execd, which notices the usage file and
cleans-up the job and accounting data.
To keep a lid on these problems, we (soft) restart the execds on all the
compute nodes a couple of times a week.
Mark
--
-----------------------------------------------------------------
Mark Dixon Email : [email protected]
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users