So our SLURM 2.5.7 install went down this evening with a massive bout of:

[2013-08-06T17:12:20-04:00] sched: Allocate JobId=113950 NodeList=holy2b09103 #CPUs=4 [2013-08-06T17:12:20-04:00] sched: Allocate JobId=113951 NodeList=sandy-rc01 #CPUs=4 [2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_dist oversubscribe [2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_dist oversubscribe [2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_dist oversubscribe

I tried to save it but to no avail. It had already generated 125 GB of logfile. I kill -9'd it after trying a reconfig and a restart. After restarting it it dropped all the jobs again.

I dug through the source code and found the following statement for this error:

for (tid = 0, i = job_ptr->details->cpus_per_task ; (tid < maxtasks);
             i += job_ptr->details->cpus_per_task ) { /* cycle counter */
                bool space_remaining = false;
                if (over_subscribe) {
                        /* 'over_subscribe' is a relief valve that guards
                         * against an infinite loop, and it *should* never
                         * come into play because maxtasks should never be
                         * greater than the total number of available cpus
                         */
error("cons_res: _compute_c_b_task_dist oversubscribe");
                }

This apparently is a safety valve, but you should never hit it. I don't know if slurm-2.6.0 fixes this but I'm not sure what exactly causes the scheduler to get into this state or if it is avoidable. I don't even know if it would ever recover from this or if it would just stay like this forever.

Could anyone tell me why this happened or give me any insight as to if this is a bug or not? This is the second time this has happened. We cannot keep losing jobs like this, we had over 10,000 jobs in flight, other wise we will have to abandon SLURM and use our old system as at least it was stable. Please I need answers. I would rather not abandon SLURM as I like many of the features it has. However, we will have to if we can't maintain stability. We aren't even at our full capacity yet as this is still beta testing. If it can't take this load this project is sunk.

SLURM should at least fail in such a way that it never loses jobs. Is there something I am missing? Some way of making it more robust? We've tried the HA fail over but that didn't work when this happened and caused other problems when the install went split brained.

-Paul Edmon-

Reply via email to