With higher core count machines and support for cgroups, we are now starting to share nodes within slurm , but cleanup has been a challenge. We want to make sure if a user has 2 jobs on a shared node, we don’t inadvertently kill processes form the wrong job.
Does anyone know if there is a feature similar to the node access policy in Moab of “uniqueuser”. This allowed for shared use of nodes, but only when the jobs were from different users. This made cleaning up processes in an epilog script similar. If not, how do other people clean up leftover processes on shared nodes? Do you use an epilog script to kill processes? If so how to you determine which processes are from which jobs? Thanks, Naveed