Hi all,
I have used slurm as a user but not as an administrator and seem to
struggle to set up my GPU generic resources properly.
I have the following situation:
headnode runnning the slurmctrl deamon
node01-04 CPU nodes
node05-06 GPU nodes
The queue is working ok for node01-04. Node05 and 06 both have each 4
Tesla M2090 GPUs and 32 core CPUs. What I would like to be able to do
is have a GPU queue for the 8 GPUs, where each GPU gets assigned 1 CPU
from the nodes and on top of that add the remaining 28 CPUs to the CPU
queue of node01-04.
I struggle to even get a minimial setup going for for GPU scheduling,
i.e. not specifying CPUs, that should be used by the GPUs.
I tried to follow the instructions on the the configuration of the
gres.conf file from here:
http://slurm.schedmd.com/gres.html
In theory the setup there should be very similar to my setup.
The relevant part of my slurm.conf file looks like this:
# COMPUTE NODES
NodeName=node[01-04] CPUs=16
NodeName=node[05-06] CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8
ThreadsPerCore=2 RealMemory=32147 Gres=gpu:tesla:4
GresTypes=gpu
PartitionName=serial Nodes=node0[1-4] Default=YES MaxTime=60 State=UP
PartitionName=GPU Nodes=node0[5-7] Default=NO MinNodes=1
MaxNodes=UNLIMITED MaxTime=14-0 AllowGroups=ALL Priority=1
DisableRootJobs=YES RootOnly=NO Hidden=NO Shared=YES
My gres.conf file looks like the one from the example, minus the bandwidth:
# Configure support for our four GPUs
Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0,1
Name=gpu Type=tesla File=/dev/nvidia1 CPUs=0,1
Name=gpu Type=tesla File=/dev/nvidia2 CPUs=2,3
Name=gpu Type=tesla File=/dev/nvidia3 CPUs=2,3
(The devices do exists)
The additional info for NodeName in the slurm.conf file was obtained
using slurmd -C on the compute node.
Now with the controller deamon running and I try to start the slurmd
deamon on say node06, I get the following error:
m@node06:~$ sudo /etc/init.d/slurm-llnl start
* Starting slurm compute node daemon slurmd slurmd: error:
NodeNames=node[05-06] CPUs=# or Procs=# with Boards=# is invalid and is
ignored.
The log file gives me the following information:
[2015-06-22T13:52:52.287] debug: init: Gres GPU plugin loaded
[2015-06-22T13:52:52.287] error: Parsing error at unrecognized key: Type
[2015-06-22T13:52:52.287] error: Parse error in file
/etc/slurm-llnl/gres.conf line 2: " Type=tesla File=/dev/nvidia0 CPUs=0,1"
[2015-06-22T13:52:52.288] error: Parsing error at unrecognized key: Type
[2015-06-22T13:52:52.288] error: Parse error in file
/etc/slurm-llnl/gres.conf line 3: " Type=tesla File=/dev/nvidia1 CPUs=0,1"
[2015-06-22T13:52:52.288] error: Parsing error at unrecognized key: Type
[2015-06-22T13:52:52.288] error: Parse error in file
/etc/slurm-llnl/gres.conf line 4: " Type=tesla File=/dev/nvidia2 CPUs=2,3"
[2015-06-22T13:52:52.288] error: Parsing error at unrecognized key: Type
[2015-06-22T13:52:52.288] error: Parse error in file
/etc/slurm-llnl/gres.conf line 5: " Type=tesla File=/dev/nvidia3 CPUs=2,3"
[2015-06-22T13:52:52.288] fatal: error opening/reading
/etc/slurm-llnl/gres.conf
Any help with how to achieve my desired setup would be greatly
appreciated, because I am unsure how to carry on with troubleshooting.
Best,
Antonia
--
Dr. Antonia Mey
University of Edinburgh
Department of Chemistry
Joseph Black Building