Marco's patch has been committed to the Slurm version 16.05 code base in this commit:
https://github.com/SchedMD/slurm/commit/70aafa68b19a1d6819f1823ebdc0c1c103f2c9b6

Thank you for your contribution.


On 2016-05-04 05:08, Marco Ehlert wrote:
Hi Vova,

some weeks ago I proposed a bug fix for this problem.

https://groups.google.com/forum/#!searchin/slurm-devel/marco$20ehlert/slurm-devel/CRsW-eiUfms/MI2aAL4UGwAJ

But this solves only half of the problem if you are using distinct
partitioning of cores like this:

NodeName=n[01-02] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1
Gres=gpu:4
PartitionName=gpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=4
PartitionName=cpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=16

and bind concrete cpu cores to GPU cards in gres.conf.

Didn't understand whether this is the configuration you are using, but
for me it is the case. Slurm does not manage the core maps for the two
distinct partitions on the same node correctly. At some point slurm
gets confused, I saw definitely wrong core_map/part_coremap variables.

That's why I decided not to use "CPUs" arguments in gres.conf but set
the core map in a prolog script depending on the partition choosen by
the job.
Job scripts are reading this core map and are then setting the taskset
by themselves.
Maybe it is not the best work around, but it is working to some
satisfaction at least.
It ends up in a difference between the picture of used/non-used cores
slurm has of and the actual core map in use. But it works anyway,
because the cores are correctly counted now.

Best,
Marco

On Wed, 4 May 2016, Vladimir Goy wrote:

Dear Developers, when I use MaxCPUsPerNode options in partition definition with 2 GPU per node (defined in gres) and SelectType=select/cons_res, Slurm allocate only one GPU for job per node, another GPU is free, but Slurm can not allocate it for jobs.

In man slurm.conf:

"MaxCPUsPerNode
Maximum number of CPUs on any node available to all jobs from this parti- tion. This can be especially useful to schedule GPUs. For example a node can be associated with two Slurm partitions (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be limited to only a subset of the node?s CPUs, insuring that one or more CPUs would be available to jobs in the "gpu" par-
              tition/queue."

Can You help me? Why option MaxCPUsPerNode is not working correctly with SelectType=select/cons_res?

Best Regards, Vova.




2016-04-20 14:20 GMT+10:00 Vladimir Goy <[email protected]>:
      Dear developers,

I have found something looking like a bug in Slurm code (file src/plugins/select/cons_res/job_test.c, _allocate_sc(...)).
For example I have 2 nodes:GresTypes=gpu
NodeName=n[01-02] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 Gres=gpu:kepler:2 RealMemory=64000 TmpDisk=16384 State=UNKNOWN PartitionName=gpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=4 Default=YES MaxTime=INFINITE State=UP

gres.conf:
Name=gpu Type=kepler File=/dev/nvidia0 CPUs=0,1
Name=gpu Type=kepler File=/dev/nvidia1 CPUs=10,11

I use
SelectType=select/cons_res
SelectTypeParameters=CR_CPU

=> I have 4 GPU on the cluster.

Problem with the next: I would like submit 4 one-proces jobs, which each need 1 GPU per jobs. Slurm runs 2 tasks, other tasks are pending. It is fail.

Please see more carefully on the next code from file src/plugins/select/cons_res/job_test.c, function: _allocate_sc(...):

/* Step 1: create and compute core-count-per-socket
* arrays and total core counts */
free_cores = xmalloc(sockets * sizeof(uint16_t));
used_cores = xmalloc(sockets * sizeof(uint16_t));
used_cpu_array = xmalloc(sockets * sizeof(uint32_t));

for (c = core_begin; c < core_end; c++) {         //Cycle 1.
i = (uint16_t) (c - core_begin) / cores_per_socket;
if (bit_test(core_map, c)) {
free_cores[i]++;
free_core_count++;
} else {
used_cores[i]++; //<-------Here can be error!!! (1 line)
}
if (part_core_map && bit_test(part_core_map, c))
used_cpu_array[i]++;
}

for (i = 0; i < sockets; i++) {       //Cycle 2.
/* if a socket is already in use and entire_sockets_only is
* enabled, it cannot be used by this job */
if (entire_sockets_only && used_cores[i]) {
free_core_count -= free_cores[i];
used_cores[i] += free_cores[i];
free_cores[i] = 0;
}
free_cpu_count += free_cores[i] * threads_per_core;
if (used_cpu_array[i])
used_cpu_count += used_cores[i] * threads_per_core; //<----Here can be error. (2 line)
}
xfree(used_cores);
xfree(used_cpu_array);

/* Ignore resources that would push a job allocation over the
* partition CPU limit (if any) */
if ((job_ptr->part_ptr->max_cpus_per_node != INFINITE) &&
   (free_cpu_count + used_cpu_count >
    job_ptr->part_ptr->max_cpus_per_node)) {
int excess = free_cpu_count + used_cpu_count -
    job_ptr->part_ptr->max_cpus_per_node;
for (c = core_begin; c < core_end; c++) {
i = (uint16_t) (c - core_begin) / cores_per_socket;
if (free_cores[i] > 0) {
free_core_count--;
free_cores[i]--;
excess -= threads_per_core;
if (excess <= 0)
break;
}
}
}


I mark two lines, in which I think contain errors. Because when I use Gres, line 1 is wrong, because some of this cores may be used or can be forbidden by gres. Gres allow use only 0,1,10,11 cores, other cores are forbidden. In this case after cylce 2 variable used_cpu_count be wrong, and on the next if operator I can not allocate this node for job, because ((used_cpu_count is equal to 10) + (free_cpu_count is equal to 2) = 12) > 4 => it
is wrong!!!

Could this be a bug?

Best regards, Vova.




Reply via email to