Hi Moe, I can't tell you exactly which is the use case which causes the problem, because I don't remember, but it was something like this:
1 - user submits 1 job asking for 2 gpus gres=gpu:2 and 2 tasks -> the job starts in node1 cpus 0,6 2 - another user submits 1 job asking for 1 task with 6 cpus and 1 gpu gres=gpu:1 Slurm tries to assign node1 but it can not be assigned because both gpus are being used, and we get the "non progressing" error. I cannot reproduce it as we applied the patch, but I can install slurm in a test environment if you are not able to reproduce the problem Best regards, Carles Fenoy On Tue, Nov 22, 2011 at 5:58 PM, Moe Jette <[email protected]> wrote: > Carles, > > Your patch looks good to me. What exactly is the configuration and use > case which causes this problem? > > Moe Jette > SchedMD LLC > > > Quoting Carles Fenoy <[email protected]>: > > Hi, >> >> I've found another bug in slurmctld that kills it with a fatal error. >> I've solved it with the following patch but I'm not sure if it's the best >> way to solve it, so I'm not pushing it to the git and sending it to the >> list to discuss the best solution. >> The problem is the same as the previous gres bug: >> "fatal: cons_res: sync loop not progressing" >> due to an error considering available gres resources. >> >> --- slurm-2.3.1/src/common/gres.c 2011-10-24 19:15:42.000000000 +0200 >> +++ ../slurm-2.3.1/src/common/**gres.c 2011-11-21 18:50:34.256761175 >> +0100 >> @@ -2509,6 +2509,12 @@ >> return NO_VAL; >> } else if (job_gres_ptr->gres_cnt_alloc && node_gres_ptr->topo_cnt) { >> /* Need to determine which specific CPUs can be used */ >> + gres_avail = node_gres_ptr->gres_cnt_avail; >> + if (!use_total_gres) >> + gres_avail -= node_gres_ptr->gres_cnt_alloc; >> + if (job_gres_ptr->gres_cnt_alloc > gres_avail) >> + return (uint32_t) 0; /* insufficient, gres to >> use */ >> + >> if (cpu_bitmap) { >> cpus_ctld = cpu_end_bit - cpu_start_bit + 1; >> if (cpus_ctld < 1) { >> >> >> -- >> -- >> Carles Fenoy >> >> > > > -- -- Carles Fenoy
