Dear all,
I configured a GPU Cluster with GRES the following way:
_*slurm.conf*_:
/[...]
GresTypes=gpu,gpu_mem
[...]
NodeName=hpcg01 NodeAddr=x.x.x.x CPUs=12 RealMemory=128741 Sockets=2
CoresPerSocket=6 ThreadsPerCore=1 State=IDLE
Gres=gpu_mem:6143,gpu:titanb:2
[...]/
_*gres.conf (on controller)*_:
/NodeName=hpcg1 Type=titanb Name=gpu File=/dev/nvidia[0,1]
[...]
NodeName=hpcg[1,3-6] Name=gpu_mem Count=6143
[...]/
_* gres.conf (on node hpcg01)*_:
/Name=gpu Type=titanb File=/dev/nvidia0
Name=gpu Type=titanb File=/dev/nvidia1
Name=gpu_mem Count=6143/
When I run batch scripts with the flag --gres=gpu:1,gpu_mem:2000 where a
program is allocating memory more than 2000 MB, the program still runs.
Should it not be terminated when exceeding the limits like in --mem? If
it should be terminated, I configured it somehow wrong. Does anybody
spot an error I made? I'm looking for a solution for days and I'm
running out of ideas.
Looking forward to your answers. Thanks in advance!
Best wishes,
Felix Willenborg