Since upgrading slurm to 25.05.0 (22.05.9 -> 23.11.11 -> 25.05.0) some jobs requesting --gres=gpu:reqcount GPUs are allocated less than reqcount GPUs if some of the node's GPUs are already in use by other jobs.

We have a node - let's call it ares-c02-06 - with 2 GPUs. Consider the following test script:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=1-00:00:00

echo "CUDA_VISIBLE_DEVICES: " $CUDA_VISIBLE_DEVICES
echo "SLURM_JOB_GPUS: " $SLURM_JOB_GPUS
echo "SLURM_GPUS_ON_NODE: " $SLURM_GPUS_ON_NODE
sleep 10d



Submit a job to the node:

sbatch *--gres=gpu:1* --nodelist=ares-c02-06 job.sh

Submitted batch job 1950559

The job starts. Now submit the script again, asking for 2 GPUs

sbatch*--gres=gpu:2* --nodelist=ares-c02-06 job.sh

Submitted batch job 1950567

This second job should not start as the ressources are not available.

Surprisingly, _both jobs are running_

$ squeue -w ares-c02-06
              JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
            1950567     gpuai   job.sh jan.gmys  R       1:55      1 ares-c02-06
            1950559     gpuai   job.sh jan.gmys  R       2:58      1 ares-c02-06

In the second job - 1950567 - AllocTRES shows gres/gpu=1, instead of the requested gres/gpu=2.

# sacct -j 1950567,1950559 -X -o jobid%10,reqtres%45,alloctres%60
      JobID                                       ReqTRES                       
          AllocTRES
---------- --------------------------------------------- 
------------------------------------------------------------
    1950559      billing=1,cpu=1,gres/gpu=1,mem=4G,node=1 
billing=1,cpu=1,gres/gpu:l40s=1,gres/gpu=1,mem=4G,node=1
    1950567      billing=1,cpu=1,*gres/gpu=2*,mem=4G,node=1 
billing=1,cpu=1,gres/gpu:l40s=1,*gres/gpu=1*,mem=4G,node=1

The output of both jobs

$ cat slurm-1950559.out
CUDA_VISIBLE_DEVICES:  0
SLURM_JOB_GPUS:  0
SLURM_GPUS_ON_NODE:  1

$ cat slurm-1950567.out
CUDA_VISIBLE_DEVICES:  0
SLURM_JOB_GPUS:  1
SLURM_GPUS_ON_NODE:  1

CUDA_VISIBLE_DEVICES is set to 0 for both jobs. SLURM_JOB_GPUS is 0 resp. 1.


*Environment: *

- RHEL 9.1

- slurm 25.05.0

- The GRES configuration seems fine, AutoDetect is off :

# /usr/sbin/slurmd -G --conf-server hpc-slurm.cluster.hpc -v
[2025-07-22T16:44:05.548] GRES: Global*AutoDetect=off*(4)
[2025-07-22T16:44:05.548] GRES: _set_gres_device_desc : /dev/nvidia0 major 195,
minor 0
[2025-07-22T16:44:05.548] GRES: _set_gres_device_desc : /dev/nvidia1 major 195,
minor 1
[2025-07-22T16:44:05.548] GRES: gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2025-07-22T16:44:05.548] GRES: gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2025-07-22T16:44:05.548] Gres Name=gpu Type=L40S Count=2 Index=0 ID=7696487
File=/dev/nvidia[0-1] Links=(null)
Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
and 'scontrol show node'
NodeName=ares-c02-06 Arch=x86_64 CoresPerSocket=24
    CPUAlloc=0 CPUEfctv=48 CPUTot=48 CPULoad=0.00
    AvailableFeatures=(null)
    ActiveFeatures=(null)
    Gres=gpu:L40S:2
    NodeAddr=ares-c02-06 NodeHostName=ares-c02-06 Version=25.05.0
    OS=Linux 5.14.0-162.6.1.el9_1.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Sep 30
07:36:03 EDT 2022
    RealMemory=386000 AllocMem=0 FreeMem=363069 Sockets=2 Boards=1
    State=IDLE+RESERVED+PLANNED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
    Partitions=gpuai
    BootTime=2025-07-04T10:26:27 SlurmdStartTime=2025-07-22T14:48:54
    LastBusyTime=2025-07-22T16:26:36 ResumeAfterTime=None
    CfgTRES=cpu=48,mem=386000M,billing=48,gres/gpu=2
    AllocTRES=


*Debug notes :*

- When using the --gpus option instead of --gres everything works as expected : the second job is PENDING (Ressources)

- Tried both, ConstrainDevices=on/off in cgroup.conf, same result

- The same is happening on other multi-GPU nodes of the cluster

- When the --gres=gpu:2 job is submitted first, i.e. when all GPUs are taken, the second (--gres=gpu:1) job waits correctly.

- When both GPUs are free, the --gres=gpu:2 job, correctly gets both GPUs : CUDA_VISIBLE_DEVICES: 0,1

- It worked in slurm 22.05.9 (we recently upgraded in two steps -> 23.11.11 -> 25.05.0)

- The only viable workaround I see for the moment is to intercept --gres and --gpus-per-node (don't even have it in job_desc I think!! :-/) options in job_submit.lua and force users to use the --gpus option, which seems to works fine.


Anyone experienced similar issues?

Any idea how to solve this would be highly appreciated.


Jan

==




--
Jan Gmys
Ingénieur de recherche
Support HPC/IA pour la plateforme MesoNET
Mésocentre de Calcul Scientifique Intensif de l'Université de Lille
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to