On Wednesday, 05 July 2017, at 11:03:12 (-0600),
Belgin, Mehmet wrote:

> Hi everyone, I have another newbie question. 
> 
> What???s the best way to prevent Slurm from allocating jobs on nodes with 
> untracked CPU load (e.g. runaway system processes, zombie processes, etc). 
> 
> We do core-based allocation, which complicates things a bit. But even 
> checking the CPU load for nodes that are supposed to be completely idle 
> (torque-style) would be a good start. 
> 
> Any suggestions would be appreciated. 

Knowing that you use NHC, I'd say the simplest solution for you might
just be to add check_ps_loadavg() to your NHC configuration.  See what
your typical 1-, 5-, and/or 15-minute load averages tend to be on idle
compute nodes (probably < 1 or at most < 2) and then have NHC offline
nodes that exceed the threshold.

Michael

-- 
Michael E. Jennings <[email protected]>
HPC Systems Team, Los Alamos National Laboratory
Bldg. 03-2327, Rm. 2341     W: +1 (505) 606-0605

Reply via email to