Hi Vova,

some weeks ago I proposed a bug fix for this problem.

https://groups.google.com/forum/#!searchin/slurm-devel/marco$20ehlert/slurm-devel/CRsW-eiUfms/MI2aAL4UGwAJ

But this solves only half of the problem if you are using distinct partitioning 
of cores like this:

NodeName=n[01-02] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 
Gres=gpu:4
PartitionName=gpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=4
PartitionName=cpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=16

and bind concrete cpu cores to GPU cards in gres.conf.

Didn't understand whether this is the configuration you are using, but for me it is the case. Slurm does not manage the core maps for the two distinct partitions on the same node correctly. At some point slurm gets confused, I saw definitely wrong core_map/part_coremap variables.

That's why I decided not to use "CPUs" arguments in gres.conf but set the core map in a prolog script depending on the partition choosen by the job.
Job scripts are reading this core map and are then setting the taskset by 
themselves.
Maybe it is not the best work around, but it is working to some satisfaction at 
least.
It ends up in a difference between the picture of used/non-used cores slurm has of and the actual core map in use. But it works anyway, because the cores are correctly counted now.

Best,
Marco

On Wed, 4 May 2016, Vladimir Goy wrote:

Dear Developers,
when I use MaxCPUsPerNode options in partition definition with 2 GPU per node 
(defined in gres) and SelectType=select/cons_res, Slurm allocate only one GPU
for job per node, another GPU is free, but Slurm can not allocate it for jobs.

In man slurm.conf:

"MaxCPUsPerNode
              Maximum  number  of  CPUs on any node available to all jobs from 
this parti-
              tion.  This can be especially useful to schedule GPUs. For  
example  a  node
              can  be  associated with two Slurm partitions (e.g. "cpu" and 
"gpu") and the
              partition/queue "cpu" could be limited to only a subset of the 
node?s  CPUs,
              insuring  that one or more CPUs would be available to jobs in the 
"gpu" par-
              tition/queue."

Can You help me? Why option MaxCPUsPerNode is not working correctly with 
SelectType=select/cons_res?

Best Regards, Vova.




2016-04-20 14:20 GMT+10:00 Vladimir Goy <[email protected]>:
      Dear developers,

I have found something looking like a bug in Slurm code (file 
src/plugins/select/cons_res/job_test.c, _allocate_sc(...)).
For example I have 2 nodes:GresTypes=gpu
NodeName=n[01-02] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 
Gres=gpu:kepler:2 RealMemory=64000 TmpDisk=16384 State=UNKNOWN
PartitionName=gpu   Nodes=n[01-02] Shared=NO  MaxCPUsPerNode=4  Default=YES  
MaxTime=INFINITE State=UP

gres.conf:
Name=gpu Type=kepler File=/dev/nvidia0 CPUs=0,1
Name=gpu Type=kepler File=/dev/nvidia1 CPUs=10,11

I use
SelectType=select/cons_res
SelectTypeParameters=CR_CPU

=> I have 4 GPU on the cluster.

Problem with the next: I would like submit 4 one-proces jobs, which each need 1 
GPU per jobs. Slurm runs 2 tasks, other tasks are pending. It is fail.

Please see more carefully on the next code from file 
src/plugins/select/cons_res/job_test.c, function: _allocate_sc(...):

/* Step 1: create and compute core-count-per-socket
* arrays and total core counts */
free_cores = xmalloc(sockets * sizeof(uint16_t));
used_cores = xmalloc(sockets * sizeof(uint16_t));
used_cpu_array = xmalloc(sockets * sizeof(uint32_t));

for (c = core_begin; c < core_end; c++) {         //Cycle 1.
i = (uint16_t) (c - core_begin) / cores_per_socket;
if (bit_test(core_map, c)) {
free_cores[i]++;
free_core_count++;
} else {
used_cores[i]++;                 //<-------Here can be error!!! (1 line)
}
if (part_core_map && bit_test(part_core_map, c))
used_cpu_array[i]++;
}

for (i = 0; i < sockets; i++) {       //Cycle 2.
/* if a socket is already in use and entire_sockets_only is
* enabled, it cannot be used by this job */
if (entire_sockets_only && used_cores[i]) {
free_core_count -= free_cores[i];
used_cores[i] += free_cores[i];
free_cores[i] = 0;
}
free_cpu_count += free_cores[i] * threads_per_core;
if (used_cpu_array[i])
used_cpu_count += used_cores[i] * threads_per_core;   //<----Here can be error. 
 (2 line)
}
xfree(used_cores);
xfree(used_cpu_array);

/* Ignore resources that would push a job allocation over the
* partition CPU limit (if any) */
if ((job_ptr->part_ptr->max_cpus_per_node != INFINITE) &&
   (free_cpu_count + used_cpu_count >
    job_ptr->part_ptr->max_cpus_per_node)) {
int excess = free_cpu_count + used_cpu_count -
    job_ptr->part_ptr->max_cpus_per_node;
for (c = core_begin; c < core_end; c++) {
i = (uint16_t) (c - core_begin) / cores_per_socket;
if (free_cores[i] > 0) {
free_core_count--;
free_cores[i]--;
excess -= threads_per_core;
if (excess <= 0)
break;
}
}
}


I mark two lines, in which I think contain errors. Because when I use Gres, 
line 1 is wrong, because some of this cores may be used or can
be forbidden by gres. Gres allow use only 0,1,10,11 cores, other cores are 
forbidden. In this case after cylce 2 variable used_cpu_count be wrong, and
on the next if operator I can not allocate this node for job, because 
((used_cpu_count is equal to 10) + (free_cpu_count is equal to 2) = 12) > 4 => 
it
is wrong!!!

Could this be a bug?

Best regards, Vova.




Reply via email to