Hi everyone,

we have a strange problem here where jobs die through SIGKILL (so far, I have 
failed to find out what triggered the signal) but then some processes remain on 
the node. We are using one of the killkids variants, but (at least) for 
multi-node jobs, there are actually *two* gids in use on the job master: one 
for the jobscript, mpiexec.hydra and qsh, and another one for qrsh_starter and 
the actual executables. terminate_method, however, runs only once (for the 
jobscript gid), so the executables remain unchallenged. Unfortunately, one of 
them hangs in a write while the others perform a busy wait, significantly 
slowing down the next job.

I guess the best way out of this would be to use a cgroups-capable GE version, 
but I am somewhat reluctant to perform a major upgrade on a production cluster 
unless absolutely necessary.

So, back to the question: is it normal to have two different gids with 
ENABLE_ADDGRP_KILL? Is terminate_method supposed to run twice in this case?

Moreover, is it possible to find out what killed the job in the first place? 
Login to the compute nodes is not allowed, so this must have happened without 
manual intervention.

Thanks a lot,

A.

PS: Software is OGS/GE 2011.11
-- 
Ansgar Esztermann
DV-Systemadministration
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to