I have merged your patch to the version 2.6 branch of Slurm and merged that to the newer versions. We will probably not have any more releases of version 2.6. Your commit is here:
https://github.com/SchedMD/slurm/commit/d508ea95822050c5fa255ffd01711e7272293667
Thanks you for your contribution Quoting Hongjia Cao <[email protected]>:
I found this in a cluster running Slurm 2.6.9, using select/linear. I think the problem exists in newer versions also. When there are completing nodes in a partition, the backfill loop may be ended early: _try_sched() thinks the job can run immediately, while select_nodes() cannot allocate nodes for it, returning ESLURM_NODES_BUSY. The jobs in the queue will not be backfilled any longer until the related job can be started or failed to backfill.
