Re: [slurm-dev] GRES fatal error

Carles Fenoy Mon, 28 Nov 2011 09:57:54 -0800

Hi Moe,

I can't tell you exactly which is the use case which causes the problem,
because I don't remember, but it was something like this:


1 - user submits 1 job asking for 2 gpus gres=gpu:2 and 2 tasks -> the job
starts in node1 cpus 0,6
2 - another user submits 1 job asking for 1 task with 6 cpus and 1 gpu
gres=gpu:1

Slurm tries to assign node1 but it can not be assigned because both gpus
are being used, and we get the "non progressing" error.

I cannot reproduce it as we applied the patch, but I can install slurm in a
test environment if you are not able to reproduce the problem

Best regards,
Carles Fenoy

On Tue, Nov 22, 2011 at 5:58 PM, Moe Jette <[email protected]> wrote:

> Carles,
>
> Your patch looks good to me. What exactly is the configuration and use
> case which causes this problem?
>
> Moe Jette
> SchedMD LLC
>
>
> Quoting Carles Fenoy <[email protected]>:
>
>  Hi,
>>
>> I've found another bug in slurmctld that kills it with a fatal error.
>> I've solved it with the following patch but I'm not sure if it's the best
>> way to solve it, so I'm not pushing it to the git and sending it to the
>> list to discuss the best solution.
>> The problem is the same as the previous gres bug:
>> "fatal: cons_res: sync loop not progressing"
>> due to an error considering available gres resources.
>>
>> --- slurm-2.3.1/src/common/gres.c    2011-10-24 19:15:42.000000000 +0200
>> +++ ../slurm-2.3.1/src/common/**gres.c    2011-11-21 18:50:34.256761175
>> +0100
>> @@ -2509,6 +2509,12 @@
>>         return NO_VAL;
>>     } else if (job_gres_ptr->gres_cnt_alloc && node_gres_ptr->topo_cnt) {
>>         /* Need to determine which specific CPUs can be used */
>> +        gres_avail = node_gres_ptr->gres_cnt_avail;
>> +                if (!use_total_gres)
>> +                        gres_avail -= node_gres_ptr->gres_cnt_alloc;
>> +                if (job_gres_ptr->gres_cnt_alloc > gres_avail)
>> +                        return (uint32_t) 0;    /* insufficient, gres to
>> use */
>> +
>>         if (cpu_bitmap) {
>>             cpus_ctld = cpu_end_bit - cpu_start_bit + 1;
>>             if (cpus_ctld < 1) {
>>
>>
>> --
>> --
>> Carles Fenoy
>>
>>
>
>
>


-- 
--
Carles Fenoy

Re: [slurm-dev] GRES fatal error

Reply via email to