Hey Slurm-users list,
while our regular gpu nodes are working fine, our on demand gpu nodes
have a weird issue. They power up, I can ssh onto them and execute
nvidia-smi on them without issue, but they are marked invalid and
slurmctld logs
_node_config_validate: gres/gpu: Count changed on node (0 != 2)
however, node show shows that the gpus are recognized and the gres.conf
are stored on the worker nodes as expected and the node entries in the
slurm.conf are fine, too:
# slurm.conf
NodeName=my_worker_node SocketsPerBoard=16 CoresPerSocket=1
RealMemory=64075 MemSpecLimit=4000 State=CLOUD Gres=gpu:L4:2 # openstack
# gres.conf on my_worker_node
ubuntu@my_node:~$ cat /etc/slurm/gres.conf
# GRES CONFIG
Name=gpu Type=L4 File=/dev/nvidia0
Name=gpu Type=L4 File=/dev/nvidia1
Thankful for any ideas and debugging ideas.
Best,
Xaver
PS:
By executing:
sudo scontrol update NodeName=$(bibiname 0) Gres=
sudo scontrol reconfigure
sudo scontrol update NodeName=$(bibiname 0) state=RESUME reason=None
the node can be resumed. However, this is not a real solution.
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]