Recently (past 3 days) on our NVIDIA DGX A100 systems running Ubuntu
22.04.5 and slurm 24.11.5 we have had jobs that ask for a gpu, get started
by Slurm, but fail to be given a GPU and then fail.
In the slurmctld log we see a line like:
[2025-07-22T02:46:29.697] error: gres/gpu: job 6919154 node A100-04 no
resources selected
on the slurmd log I see no errors for the job but there is a line like
[2025-07-22T02:46:29.757] [6919154.extern] task/cgroup:
_handle_device_access: GRES: job devices.deny: adding c 195:0
rwm(/dev/nvidia0)
for all 8 of the GPUs on the node.
Other jobs still seem to start up and get a GPU fine.
If you look at the job stats one sees:
ReqTRES : billing=7,cpu=1,gres/gpu=1,mem=96G,node=1
AllocTRES : billing=3,cpu=1,mem=96G,node=1
showing that even though the gpu was requested, it was not allocated.
Occasionly on these boxes (and only these -- my Dell Rocky 8 boxes with
GPUS have no problem) we see the nodes go into drain mode with the
"res/gpu GRES core specification ... doesn't match socket boundaries."
message as per https://support.schedmd.com/show_bug.cgi?id=22498
It does seem to happen after slurmctld restart.
I then restart slurmd on the nodes and can resume SLURM on the nodes
whenever that happens.
Otherwise nothing has changed on these nodes with SLURM config or
the OS in over a month.
Definition of the nodes are
NodeName=A100-[01-04] \
CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 \
ThreadsPerCore=1 RealMemory=1031000 MemSpecLimit=2048 \
TmpDisk=1400000 Feature=amd,epyc,a100 \
Gres=gpu:a100-sxm4-40gb:8
and gres.conf on the nodes is simply AutoDetect=nvml
---------------------------------------------------------------
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129 USA
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Mass General Brigham Compliance
HelpLine at https://www.massgeneralbrigham.org/complianceline
<https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com