Here are the configuration parameters that you would use for this.
Note that the node will need to be set DOWN in order for the
completing job to be purged.
UnkillableStepProgram
If the processes in a job step are determined to be
unkillable for a period of time specified by the Unkill‐
ableStepTimeout variable, the program specified by Unkil‐
lableStepProgram will be executed. This program can be
used to take special actions to clean up the unkillable
processes and/or notify computer administrators. The
program will be run SlurmdUser (usually "root") on the
compute node. By default no program is run.
UnkillableStepTimeout
The length of time, in seconds, that SLURM will wait
before deciding that processes in a job step are unkill‐
able (after they have been signaled with SIGKILL) and
execute UnkillableStepProgram as de
Quoting Ulf Markwardt <[email protected]>:
Dear all,
it happens that a job hangs waiting for an unresponsive file system.
This job cannot be killed, we have to reboot the node. My idea,
would be to
1) set the node to drain,
2) force the batchsystem to forget the CG jobs
(else it would never reach the drained state),
3) reboot the node via Slurm
4) set the node to "resume".
The second is inspired by LSF, there is some forced cancellation for
zombie processes. Do we have something similar for Slurm?
Thank you,
Ulf
PS. It would be great to have a reboot flag to bring the rebooted
node back into the "resume" state automatically. Or do we have it
already?
--
___________________________________________________________________
Dr. Ulf Markwardt
Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany
Phone: (+49) 351/463-33640 WWW: http://www.tu-dresden.de/zih