Hi, I'm running a GPU cluster, and I would like to know if there is a way to allocate resource for jobs without causing GPU fragmentation.
Currently, I'm using > SelectType=select/cons_res > > SelectTypeParameters=CR_Core,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE and over-subscribing of CPU cores is set. Let's say there are nodes A and B, and each of nodes A and B has 4 GPUs and 40 CPU cores. The problem is, if jobs 1 and 2 request 1 GPU and 30 CPU cores each, both of nodes A and B are selected for those jobs, which prevents a future job requiring 4 GPUs from running on any of the two nodes. If I'm not wrong, a simple workaround might be not managing CPU cores via Slurm (e.g. CR_Memory), but it comes with downsides. Could someone suggest any select plugins/parameters that can prevent such GPU fragmentation, please? Best, Jaekyeom