Dear all,

I configured a GPU Cluster with GRES the following way:

_*slurm.conf*_:

    /[...]
    GresTypes=gpu,gpu_mem
    [...]
    NodeName=hpcg01 NodeAddr=x.x.x.x CPUs=12 RealMemory=128741 Sockets=2
    CoresPerSocket=6 ThreadsPerCore=1 State=IDLE
    Gres=gpu_mem:6143,gpu:titanb:2
    [...]/

_*gres.conf (on controller)*_:

    /NodeName=hpcg1 Type=titanb Name=gpu File=/dev/nvidia[0,1]
    [...]
    NodeName=hpcg[1,3-6] Name=gpu_mem Count=6143
    [...]/

_* gres.conf (on node hpcg01)*_:

    /Name=gpu Type=titanb File=/dev/nvidia0
    Name=gpu Type=titanb File=/dev/nvidia1
    Name=gpu_mem Count=6143/

When I run batch scripts with the flag --gres=gpu:1,gpu_mem:2000 where a
program is allocating memory more than 2000 MB, the program still runs.
Should it not be terminated when exceeding the limits like in --mem? If
it should be terminated, I configured it somehow wrong. Does anybody
spot an error I made? I'm looking for a solution for days and I'm
running out of ideas.

Looking forward to your answers. Thanks in advance!

Best wishes,
Felix Willenborg

Reply via email to