On 12/20/2016 08:08 PM, Near-Ansari, Naveed wrote:
With higher core count machines and support for cgroups, we are now starting to
share nodes within slurm , but cleanup has been a challenge. We want to make
sure if a user has 2 jobs on a shared node, we don’t inadvertently kill
processes form the wrong job.
Does anyone know if there is a feature similar to the node access policy in
Moab of “uniqueuser”. This allowed for shared use of nodes, but only when the
jobs were from different users. This made cleaning up processes in an epilog
script similar.
If not, how do other people clean up leftover processes on shared nodes? Do
you use an epilog script to kill processes? If so how to you determine which
processes are from which jobs?
Hi Naveed,
We also allow several jobs to share a compute node, and, yes, process cleanup
is a challenge.
The "ps" command can tell which processes belong to which user.
Our process cleanup program algorithm within slurm.epilog goes in a sequence
something like this:
1/ We save information about all processes belonging to ordinary users (i.e.
not root, slurm, munge, ...).
2/ With "squeue" we look at all jobs currently running on this node, to decide
if some users' jobs need to be kept alive.
3/ Kill all processes that we saved information about, for all users that is not on the
"kept" list created above.
This sequence goes in a loop until we have no processes left to kill or until
we have
run the loop a ridiculous amount of times (there might e.g. be zombie
processes, not possible
to kill).
When we find that also some other job is in a completing state, and has been
that for
less that two minutes, we will not kill this time and instead wait two minutes
before restarting the loop, allowing Slurm itself some time to kill the
processes.
Also some logging was added, making it easier to make sure that the correct
processes
are killed (and something to look at when users say that their jobs strangely
died too early).
Best wishes,
-- Lennart Karlsson, UPPMAX, Uppsala University, Sweden
http://uppmax.uu.se/