Hi all,

I have used slurm as a user but not as an administrator and seem to struggle to set up my GPU generic resources properly.
I have the following situation:
headnode runnning the slurmctrl deamon
node01-04 CPU nodes
node05-06 GPU nodes

The queue is working ok for node01-04. Node05 and 06 both have each 4 Tesla M2090 GPUs and 32 core CPUs. What I would like to be able to do is have a GPU queue for the 8 GPUs, where each GPU gets assigned 1 CPU from the nodes and on top of that add the remaining 28 CPUs to the CPU queue of node01-04.

I struggle to even get a minimial setup going for for GPU scheduling, i.e. not specifying CPUs, that should be used by the GPUs. I tried to follow the instructions on the the configuration of the gres.conf file from here:
http://slurm.schedmd.com/gres.html

In theory the setup there should be very similar to my setup.
The relevant part of my slurm.conf file looks like this:

# COMPUTE NODES
NodeName=node[01-04] CPUs=16
NodeName=node[05-06] CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=32147 Gres=gpu:tesla:4
GresTypes=gpu
PartitionName=serial Nodes=node0[1-4] Default=YES MaxTime=60 State=UP
PartitionName=GPU Nodes=node0[5-7] Default=NO MinNodes=1 MaxNodes=UNLIMITED MaxTime=14-0 AllowGroups=ALL Priority=1 DisableRootJobs=YES RootOnly=NO Hidden=NO Shared=YES

My gres.conf file looks like the one from the example, minus the bandwidth:
# Configure support for our four GPUs
Name=gpu Type=tesla  File=/dev/nvidia0 CPUs=0,1
Name=gpu Type=tesla  File=/dev/nvidia1 CPUs=0,1
Name=gpu Type=tesla  File=/dev/nvidia2 CPUs=2,3
Name=gpu Type=tesla  File=/dev/nvidia3 CPUs=2,3

(The devices do exists)
The additional info for NodeName in the slurm.conf file was obtained using slurmd -C on the compute node.

Now with the controller deamon running and I try to start the slurmd deamon on say node06, I get the following error:

m@node06:~$ sudo /etc/init.d/slurm-llnl start
* Starting slurm compute node daemon slurmd slurmd: error: NodeNames=node[05-06] CPUs=# or Procs=# with Boards=# is invalid and is ignored.

The log file gives me the following information:
[2015-06-22T13:52:52.287] debug:  init: Gres GPU plugin loaded
[2015-06-22T13:52:52.287] error: Parsing error at unrecognized key: Type
[2015-06-22T13:52:52.287] error: Parse error in file /etc/slurm-llnl/gres.conf line 2: " Type=tesla File=/dev/nvidia0 CPUs=0,1"
[2015-06-22T13:52:52.288] error: Parsing error at unrecognized key: Type
[2015-06-22T13:52:52.288] error: Parse error in file /etc/slurm-llnl/gres.conf line 3: " Type=tesla File=/dev/nvidia1 CPUs=0,1"
[2015-06-22T13:52:52.288] error: Parsing error at unrecognized key: Type
[2015-06-22T13:52:52.288] error: Parse error in file /etc/slurm-llnl/gres.conf line 4: " Type=tesla File=/dev/nvidia2 CPUs=2,3"
[2015-06-22T13:52:52.288] error: Parsing error at unrecognized key: Type
[2015-06-22T13:52:52.288] error: Parse error in file /etc/slurm-llnl/gres.conf line 5: " Type=tesla File=/dev/nvidia3 CPUs=2,3" [2015-06-22T13:52:52.288] fatal: error opening/reading /etc/slurm-llnl/gres.conf

Any help with how to achieve my desired setup would be greatly appreciated, because I am unsure how to carry on with troubleshooting.

Best,
Antonia

--
Dr. Antonia Mey
University of Edinburgh
Department of Chemistry
Joseph Black Building

Reply via email to