Hi,

I'm running a GPU cluster, and I would like to know if there is a way to
allocate resource for jobs without causing GPU fragmentation.

Currently, I'm using

> SelectType=select/cons_res
>
> SelectTypeParameters=CR_Core,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE

and over-subscribing of CPU cores is set.

Let's say there are nodes A and B, and each of nodes A and B has 4 GPUs and
40 CPU cores.
The problem is, if jobs 1 and 2 request 1 GPU and 30 CPU cores each, both
of nodes A and B are selected for those jobs, which prevents a future job
requiring 4 GPUs from running on any of the two nodes.

If I'm not wrong, a simple workaround might be not managing CPU cores via
Slurm (e.g. CR_Memory), but it comes with downsides.

Could someone suggest any select plugins/parameters that can prevent such
GPU fragmentation, please?

Best,
Jaekyeom

Reply via email to