Not really a slurm question, but here's my 2 cents:
FWIW, if they are true zombies (PPID 1 and kill -9 will not work) you
can only get rid of them with a reboot.
If they aren't eating much in the line of resources, you will want to
just ignore them until your next maintenance and then reboot.
This is one of the reasons I do not architect login nodes to allow
access to applications or much of anything. Minimal everything.
If your login node gets quite a bit of traffic, you should look at
setting up a load-balanced HA configuration for them. Users should not
have much of anything going on with a login node. Just submit your job
and do your work on the node. Even if it is an interactive job. Keeps
your dev/test environment the same as the runtime environment.
Brian Andrus
On 7/19/2021 7:09 AM, Durai Arasan wrote:
Hello,
One of our slurm user's account is hung with uninterruptible
processes. These processes cannot be killed. Hence a restart is
required. Is it possible to restart the user's login environment
alone? I would like to not restart the entire login node.
Thanks!
Durai
Max Planck Institute Tübingen