Dell - Internal Use - Confidential
We are using slurm 2.6.4 as mechanism to load balance software builds from
login/access servers to a pool of build servers provisioned for this task.
We've been using it this way for several months for our automated builds with
no problems, and started using it for user builds last week. Unfortunately,
this has not gone quite as smoothly.
Our "make" wrapper script simply does a "srun -p build make ...". Most of the
time this works, but in some cases even though the slurmstepd on the build node
that spawned the make has exited, slurm still thinks the job is running.
Eventually this consumes all the available nodes in the build partition and
jobs start queuing until the completed jobs are manually canceled.
Has anyone encountered this before? Any suggestions of where I should
investigate further would be greatly appreciated.
--jtc