Hi everyone, we have a strange problem here where jobs die through SIGKILL (so far, I have failed to find out what triggered the signal) but then some processes remain on the node. We are using one of the killkids variants, but (at least) for multi-node jobs, there are actually *two* gids in use on the job master: one for the jobscript, mpiexec.hydra and qsh, and another one for qrsh_starter and the actual executables. terminate_method, however, runs only once (for the jobscript gid), so the executables remain unchallenged. Unfortunately, one of them hangs in a write while the others perform a busy wait, significantly slowing down the next job.
I guess the best way out of this would be to use a cgroups-capable GE version, but I am somewhat reluctant to perform a major upgrade on a production cluster unless absolutely necessary. So, back to the question: is it normal to have two different gids with ENABLE_ADDGRP_KILL? Is terminate_method supposed to run twice in this case? Moreover, is it possible to find out what killed the job in the first place? Login to the compute nodes is not allowed, so this must have happened without manual intervention. Thanks a lot, A. PS: Software is OGS/GE 2011.11 -- Ansgar Esztermann DV-Systemadministration Max-Planck-Institut für biophysikalische Chemie, Abteilung 105
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
