Dear Marco,

Thank you so much for your response. I have recently been added to the
mailing list. I did not see your last letter. It is good, that this problem
is known. In the end, I made the following configuration of our
cluster without use MaxCPUsPerNode:

GresTypes=gpu
NodeName=n[01-10] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1
Gres=gpu:K40:2,gpu:cpu:16 RealMemory=64000 TmpDisk=16384 State=UNKNOWN
PartitionName=long Nodes=n[01-10] Shared=NO  Default=YES DefMemPerCPU=1024
MaxMemPerCPU=3072 DefaultTime=24:0:0 MaxTime=INFINITE State=UP

and gres.conf:

Name=gpu Type=K40 File=/dev/nvidia0 CPUs=0
Name=gpu Type=K40 File=/dev/nvidia1 CPUs=10
Name=gpu Type=cpu CPUs=2-9,12-19 Count=16

In this case all working good!!

Best Regards, Vova

2016-05-04 21:09 GMT+10:00 Marco Ehlert <[email protected]>:

>
> Hi Vova,
>
> some weeks ago I proposed a bug fix for this problem.
>
>
> https://groups.google.com/forum/#!searchin/slurm-devel/marco$20ehlert/slurm-devel/CRsW-eiUfms/MI2aAL4UGwAJ
>
> But this solves only half of the problem if you are using distinct
> partitioning of cores like this:
>
> NodeName=n[01-02] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1
> Gres=gpu:4
> PartitionName=gpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=4
> PartitionName=cpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=16
>
> and bind concrete cpu cores to GPU cards in gres.conf.
>
> Didn't understand whether this is the configuration you are using, but for
> me it is the case. Slurm does not manage the core maps for the two distinct
> partitions on the same node correctly. At some point slurm gets confused, I
> saw definitely wrong core_map/part_coremap variables.
>
> That's why I decided not to use "CPUs" arguments in gres.conf but set the
> core map in a prolog script depending on the partition choosen by the job.
> Job scripts are reading this core map and are then setting the taskset by
> themselves.
> Maybe it is not the best work around, but it is working to some
> satisfaction at least.
> It ends up in a difference between the picture of used/non-used cores
> slurm has of and the actual core map in use. But it works anyway, because
> the cores are correctly counted now.
>
> Best,
> Marco
>
> On Wed, 4 May 2016, Vladimir Goy wrote:
>
> Dear Developers,
>> when I use MaxCPUsPerNode options in partition definition with 2 GPU per
>> node (defined in gres) and SelectType=select/cons_res, Slurm allocate only
>> one GPU
>> for job per node, another GPU is free, but Slurm can not allocate it for
>> jobs.
>>
>> In man slurm.conf:
>>
>> "MaxCPUsPerNode
>>               Maximum  number  of  CPUs on any node available to all jobs
>> from this parti-
>>               tion.  This can be especially useful to schedule GPUs. For
>> example  a  node
>>               can  be  associated with two Slurm partitions (e.g. "cpu"
>> and "gpu") and the
>>               partition/queue "cpu" could be limited to only a subset of
>> the node?s  CPUs,
>>
>>               insuring  that one or more CPUs would be available to jobs
>> in the "gpu" par-
>>               tition/queue."
>>
>> Can You help me? Why option MaxCPUsPerNode is not working correctly with
>> SelectType=select/cons_res?
>>
>> Best Regards, Vova.
>>
>>
>>
>>
>> 2016-04-20 14:20 GMT+10:00 Vladimir Goy <[email protected]>:
>>       Dear developers,
>>
>> I have found something looking like a bug in Slurm code (file
>> src/plugins/select/cons_res/job_test.c, _allocate_sc(...)).
>> For example I have 2 nodes:GresTypes=gpu
>> NodeName=n[01-02] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1
>> Gres=gpu:kepler:2 RealMemory=64000 TmpDisk=16384 State=UNKNOWN
>> PartitionName=gpu   Nodes=n[01-02] Shared=NO  MaxCPUsPerNode=4
>> Default=YES  MaxTime=INFINITE State=UP
>>
>> gres.conf:
>> Name=gpu Type=kepler File=/dev/nvidia0 CPUs=0,1
>> Name=gpu Type=kepler File=/dev/nvidia1 CPUs=10,11
>>
>> I use
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU
>>
>> => I have 4 GPU on the cluster.
>>
>> Problem with the next: I would like submit 4 one-proces jobs, which each
>> need 1 GPU per jobs. Slurm runs 2 tasks, other tasks are pending. It is
>> fail.
>>
>> Please see more carefully on the next code from file
>> src/plugins/select/cons_res/job_test.c, function: _allocate_sc(...):
>>
>> /* Step 1: create and compute core-count-per-socket
>> * arrays and total core counts */
>> free_cores = xmalloc(sockets * sizeof(uint16_t));
>> used_cores = xmalloc(sockets * sizeof(uint16_t));
>> used_cpu_array = xmalloc(sockets * sizeof(uint32_t));
>>
>> for (c = core_begin; c < core_end; c++) {         //Cycle 1.
>> i = (uint16_t) (c - core_begin) / cores_per_socket;
>> if (bit_test(core_map, c)) {
>> free_cores[i]++;
>> free_core_count++;
>> } else {
>> used_cores[i]++;                 //<-------Here can be error!!! (1 line)
>> }
>> if (part_core_map && bit_test(part_core_map, c))
>> used_cpu_array[i]++;
>> }
>>
>> for (i = 0; i < sockets; i++) {       //Cycle 2.
>> /* if a socket is already in use and entire_sockets_only is
>> * enabled, it cannot be used by this job */
>> if (entire_sockets_only && used_cores[i]) {
>> free_core_count -= free_cores[i];
>> used_cores[i] += free_cores[i];
>> free_cores[i] = 0;
>> }
>> free_cpu_count += free_cores[i] * threads_per_core;
>> if (used_cpu_array[i])
>> used_cpu_count += used_cores[i] * threads_per_core;   //<----Here can be
>> error.  (2 line)
>> }
>> xfree(used_cores);
>> xfree(used_cpu_array);
>>
>> /* Ignore resources that would push a job allocation over the
>> * partition CPU limit (if any) */
>> if ((job_ptr->part_ptr->max_cpus_per_node != INFINITE) &&
>>    (free_cpu_count + used_cpu_count >
>>     job_ptr->part_ptr->max_cpus_per_node)) {
>> int excess = free_cpu_count + used_cpu_count -
>>     job_ptr->part_ptr->max_cpus_per_node;
>> for (c = core_begin; c < core_end; c++) {
>> i = (uint16_t) (c - core_begin) / cores_per_socket;
>> if (free_cores[i] > 0) {
>> free_core_count--;
>> free_cores[i]--;
>> excess -= threads_per_core;
>> if (excess <= 0)
>> break;
>> }
>> }
>> }
>>
>>
>> I mark two lines, in which I think contain errors. Because when I use
>> Gres, line 1 is wrong, because some of this cores may be used or can
>> be forbidden by gres. Gres allow use only 0,1,10,11 cores, other cores
>> are forbidden. In this case after cylce 2 variable used_cpu_count be wrong,
>> and
>> on the next if operator I can not allocate this node for job, because
>> ((used_cpu_count is equal to 10) + (free_cpu_count is equal to 2) = 12) > 4
>> => it
>> is wrong!!!
>>
>> Could this be a bug?
>>
>> Best regards, Vova.
>>
>>
>>
>>
>>

Reply via email to