On Wednesday, 05 July 2017, at 11:03:12 (-0600), Belgin, Mehmet wrote: > Hi everyone, I have another newbie question. > > What???s the best way to prevent Slurm from allocating jobs on nodes with > untracked CPU load (e.g. runaway system processes, zombie processes, etc). > > We do core-based allocation, which complicates things a bit. But even > checking the CPU load for nodes that are supposed to be completely idle > (torque-style) would be a good start. > > Any suggestions would be appreciated.
Knowing that you use NHC, I'd say the simplest solution for you might just be to add check_ps_loadavg() to your NHC configuration. See what your typical 1-, 5-, and/or 15-minute load averages tend to be on idle compute nodes (probably < 1 or at most < 2) and then have NHC offline nodes that exceed the threshold. Michael -- Michael E. Jennings <[email protected]> HPC Systems Team, Los Alamos National Laboratory Bldg. 03-2327, Rm. 2341 W: +1 (505) 606-0605
