[slurm-dev] Re: Ignore completing jobs

Moe Jette Fri, 31 Jan 2014 15:53:08 -0800

Here are the configuration parameters that you would use for this.Note that the node will need to be set DOWN in order for thecompleting job to be purged.


              UnkillableStepProgram
                     If the processes in a  job  step  are  determined  to  be
                     unkillable  for a period of time specified by the Unkill‐
                     ableStepTimeout variable, the program specified by Unkil‐
                     lableStepProgram  will  be executed.  This program can be
                     used to take special actions to clean up  the  unkillable
                     processes  and/or  notify  computer  administrators.  The
                     program will be  run  SlurmdUser (usually  "root") on the
                     compute node. By default no program is run.

              UnkillableStepTimeout
                     The  length  of  time,  in  seconds, that SLURM will wait
                     before deciding that processes in a job step are  unkill‐
                     able  (after  they  have  been signaled with SIGKILL) and
                     execute UnkillableStepProgram as  de

Quoting Ulf Markwardt <[email protected]>:

Dear all,
it happens that a job hangs waiting for an unresponsive file system.This job cannot be killed, we have to reboot the node. My idea,would be to
1) set the node to drain,
2) force the batchsystem to forget the CG jobs
  (else it would never reach the drained state),
3) reboot the node via Slurm
4) set the node to "resume".
The second is inspired by LSF, there is some forced cancellation forzombie processes. Do we have something similar for Slurm?
Thank you,
Ulf
PS. It would be great to have a reboot flag to bring the rebootednode back into the "resume" state automatically. Or do we have italready?
--
___________________________________________________________________
Dr. Ulf Markwardt

Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany

Phone: (+49) 351/463-33640      WWW:  http://www.tu-dresden.de/zih

[slurm-dev] Re: Ignore completing jobs

Reply via email to