Marco's patch has been committed to the Slurm version 16.05 code base in
this commit:
https://github.com/SchedMD/slurm/commit/70aafa68b19a1d6819f1823ebdc0c1c103f2c9b6
Thank you for your contribution.
On 2016-05-04 05:08, Marco Ehlert wrote:
Hi Vova,
some weeks ago I proposed a bug fix for this problem.
https://groups.google.com/forum/#!searchin/slurm-devel/marco$20ehlert/slurm-devel/CRsW-eiUfms/MI2aAL4UGwAJ
But this solves only half of the problem if you are using distinct
partitioning of cores like this:
NodeName=n[01-02] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1
Gres=gpu:4
PartitionName=gpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=4
PartitionName=cpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=16
and bind concrete cpu cores to GPU cards in gres.conf.
Didn't understand whether this is the configuration you are using, but
for me it is the case. Slurm does not manage the core maps for the two
distinct partitions on the same node correctly. At some point slurm
gets confused, I saw definitely wrong core_map/part_coremap variables.
That's why I decided not to use "CPUs" arguments in gres.conf but set
the core map in a prolog script depending on the partition choosen by
the job.
Job scripts are reading this core map and are then setting the taskset
by themselves.
Maybe it is not the best work around, but it is working to some
satisfaction at least.
It ends up in a difference between the picture of used/non-used cores
slurm has of and the actual core map in use. But it works anyway,
because the cores are correctly counted now.
Best,
Marco
On Wed, 4 May 2016, Vladimir Goy wrote:
Dear Developers, when I use MaxCPUsPerNode options in partition
definition with 2 GPU per node (defined in gres) and
SelectType=select/cons_res, Slurm allocate only one GPU
for job per node, another GPU is free, but Slurm can not allocate it
for jobs.
In man slurm.conf:
"MaxCPUsPerNode
Maximum number of CPUs on any node available to all
jobs from this parti-
tion. This can be especially useful to schedule GPUs.
For example a node
can be associated with two Slurm partitions (e.g.
"cpu" and "gpu") and the
partition/queue "cpu" could be limited to only a subset
of the node?s CPUs,
insuring that one or more CPUs would be available to
jobs in the "gpu" par-
tition/queue."
Can You help me? Why option MaxCPUsPerNode is not working correctly
with SelectType=select/cons_res?
Best Regards, Vova.
2016-04-20 14:20 GMT+10:00 Vladimir Goy <[email protected]>:
Dear developers,
I have found something looking like a bug in Slurm code (file
src/plugins/select/cons_res/job_test.c, _allocate_sc(...)).
For example I have 2 nodes:GresTypes=gpu
NodeName=n[01-02] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1
Gres=gpu:kepler:2 RealMemory=64000 TmpDisk=16384 State=UNKNOWN
PartitionName=gpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=4
Default=YES MaxTime=INFINITE State=UP
gres.conf:
Name=gpu Type=kepler File=/dev/nvidia0 CPUs=0,1
Name=gpu Type=kepler File=/dev/nvidia1 CPUs=10,11
I use
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
=> I have 4 GPU on the cluster.
Problem with the next: I would like submit 4 one-proces jobs, which
each need 1 GPU per jobs. Slurm runs 2 tasks, other tasks are pending.
It is fail.
Please see more carefully on the next code from file
src/plugins/select/cons_res/job_test.c, function: _allocate_sc(...):
/* Step 1: create and compute core-count-per-socket
* arrays and total core counts */
free_cores = xmalloc(sockets * sizeof(uint16_t));
used_cores = xmalloc(sockets * sizeof(uint16_t));
used_cpu_array = xmalloc(sockets * sizeof(uint32_t));
for (c = core_begin; c < core_end; c++) { //Cycle 1.
i = (uint16_t) (c - core_begin) / cores_per_socket;
if (bit_test(core_map, c)) {
free_cores[i]++;
free_core_count++;
} else {
used_cores[i]++; //<-------Here can be error!!! (1
line)
}
if (part_core_map && bit_test(part_core_map, c))
used_cpu_array[i]++;
}
for (i = 0; i < sockets; i++) { //Cycle 2.
/* if a socket is already in use and entire_sockets_only is
* enabled, it cannot be used by this job */
if (entire_sockets_only && used_cores[i]) {
free_core_count -= free_cores[i];
used_cores[i] += free_cores[i];
free_cores[i] = 0;
}
free_cpu_count += free_cores[i] * threads_per_core;
if (used_cpu_array[i])
used_cpu_count += used_cores[i] * threads_per_core; //<----Here can
be error. (2 line)
}
xfree(used_cores);
xfree(used_cpu_array);
/* Ignore resources that would push a job allocation over the
* partition CPU limit (if any) */
if ((job_ptr->part_ptr->max_cpus_per_node != INFINITE) &&
(free_cpu_count + used_cpu_count >
job_ptr->part_ptr->max_cpus_per_node)) {
int excess = free_cpu_count + used_cpu_count -
job_ptr->part_ptr->max_cpus_per_node;
for (c = core_begin; c < core_end; c++) {
i = (uint16_t) (c - core_begin) / cores_per_socket;
if (free_cores[i] > 0) {
free_core_count--;
free_cores[i]--;
excess -= threads_per_core;
if (excess <= 0)
break;
}
}
}
I mark two lines, in which I think contain errors. Because when I use
Gres, line 1 is wrong, because some of this cores may be used or can
be forbidden by gres. Gres allow use only 0,1,10,11 cores, other cores
are forbidden. In this case after cylce 2 variable used_cpu_count be
wrong, and
on the next if operator I can not allocate this node for job, because
((used_cpu_count is equal to 10) + (free_cpu_count is equal to 2) =
12) > 4 => it
is wrong!!!
Could this be a bug?
Best regards, Vova.