[slurm-dev] Scheduling of GPU resources

Antonia Mey Mon, 22 Jun 2015 06:14:11 -0700


Hi all,

I have used slurm as a user but not as an administrator and seem tostruggle to set up my GPU generic resources properly.

I have the following situation:
headnode runnning the slurmctrl deamon
node01-04 CPU nodes
node05-06 GPU nodes

The queue is working ok for node01-04. Node05 and 06 both have each 4Tesla M2090 GPUs and 32 core CPUs. What I would like to be able to dois have a GPU queue for the 8 GPUs, where each GPU gets assigned 1 CPUfrom the nodes and on top of that add the remaining 28 CPUs to the CPUqueue of node01-04.

I struggle to even get a minimial setup going for for GPU scheduling,i.e. not specifying CPUs, that should be used by the GPUs.I tried to follow the instructions on the the configuration of thegres.conf file from here:

http://slurm.schedmd.com/gres.html

In theory the setup there should be very similar to my setup.
The relevant part of my slurm.conf file looks like this:

# COMPUTE NODES
NodeName=node[01-04] CPUs=16

NodeName=node[05-06] CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8ThreadsPerCore=2 RealMemory=32147 Gres=gpu:tesla:4

GresTypes=gpu
PartitionName=serial Nodes=node0[1-4] Default=YES MaxTime=60 State=UP

PartitionName=GPU Nodes=node0[5-7] Default=NO MinNodes=1MaxNodes=UNLIMITED MaxTime=14-0 AllowGroups=ALL Priority=1DisableRootJobs=YES RootOnly=NO Hidden=NO Shared=YES


My gres.conf file looks like the one from the example, minus the bandwidth:
# Configure support for our four GPUs
Name=gpu Type=tesla  File=/dev/nvidia0 CPUs=0,1
Name=gpu Type=tesla  File=/dev/nvidia1 CPUs=0,1
Name=gpu Type=tesla  File=/dev/nvidia2 CPUs=2,3
Name=gpu Type=tesla  File=/dev/nvidia3 CPUs=2,3

(The devices do exists)

The additional info for NodeName in the slurm.conf file was obtainedusing slurmd -C on the compute node.

Now with the controller deamon running and I try to start the slurmddeamon on say node06, I get the following error:


m@node06:~$ sudo /etc/init.d/slurm-llnl start

* Starting slurm compute node daemon slurmd slurmd: error:NodeNames=node[05-06] CPUs=# or Procs=# with Boards=# is invalid and isignored.


The log file gives me the following information:
[2015-06-22T13:52:52.287] debug:  init: Gres GPU plugin loaded
[2015-06-22T13:52:52.287] error: Parsing error at unrecognized key: Type

[2015-06-22T13:52:52.287] error: Parse error in file/etc/slurm-llnl/gres.conf line 2: " Type=tesla File=/dev/nvidia0 CPUs=0,1"

[2015-06-22T13:52:52.288] error: Parsing error at unrecognized key: Type

[2015-06-22T13:52:52.288] error: Parse error in file/etc/slurm-llnl/gres.conf line 3: " Type=tesla File=/dev/nvidia1 CPUs=0,1"

[2015-06-22T13:52:52.288] error: Parsing error at unrecognized key: Type

[2015-06-22T13:52:52.288] error: Parse error in file/etc/slurm-llnl/gres.conf line 4: " Type=tesla File=/dev/nvidia2 CPUs=2,3"

[2015-06-22T13:52:52.288] error: Parsing error at unrecognized key: Type

[2015-06-22T13:52:52.288] error: Parse error in file/etc/slurm-llnl/gres.conf line 5: " Type=tesla File=/dev/nvidia3 CPUs=2,3"[2015-06-22T13:52:52.288] fatal: error opening/reading/etc/slurm-llnl/gres.conf

Any help with how to achieve my desired setup would be greatlyappreciated, because I am unsure how to carry on with troubleshooting.


Best,
Antonia

--
Dr. Antonia Mey
University of Edinburgh
Department of Chemistry
Joseph Black Building

[slurm-dev] Scheduling of GPU resources

Reply via email to