So our SLURM 2.5.7 install went down this evening with a massive bout of:
[2013-08-06T17:12:20-04:00] sched: Allocate JobId=113950
NodeList=holy2b09103 #CPUs=4
[2013-08-06T17:12:20-04:00] sched: Allocate JobId=113951
NodeList=sandy-rc01 #CPUs=4
[2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_dist
oversubscribe
[2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_dist
oversubscribe
[2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_dist
oversubscribe
I tried to save it but to no avail. It had already generated 125 GB of
logfile. I kill -9'd it after trying a reconfig and a restart. After
restarting it it dropped all the jobs again.
I dug through the source code and found the following statement for this
error:
for (tid = 0, i = job_ptr->details->cpus_per_task ; (tid <
maxtasks);
i += job_ptr->details->cpus_per_task ) { /* cycle counter */
bool space_remaining = false;
if (over_subscribe) {
/* 'over_subscribe' is a relief valve that guards
* against an infinite loop, and it *should* never
* come into play because maxtasks should never be
* greater than the total number of available cpus
*/
error("cons_res: _compute_c_b_task_dist
oversubscribe");
}
This apparently is a safety valve, but you should never hit it. I don't
know if slurm-2.6.0 fixes this but I'm not sure what exactly causes the
scheduler to get into this state or if it is avoidable. I don't even
know if it would ever recover from this or if it would just stay like
this forever.
Could anyone tell me why this happened or give me any insight as to if
this is a bug or not? This is the second time this has happened. We
cannot keep losing jobs like this, we had over 10,000 jobs in flight,
other wise we will have to abandon SLURM and use our old system as at
least it was stable. Please I need answers. I would rather not abandon
SLURM as I like many of the features it has. However, we will have to
if we can't maintain stability. We aren't even at our full capacity yet
as this is still beta testing. If it can't take this load this project
is sunk.
SLURM should at least fail in such a way that it never loses jobs. Is
there something I am missing? Some way of making it more robust? We've
tried the HA fail over but that didn't work when this happened and
caused other problems when the install went split brained.
-Paul Edmon-