[slurm-dev] Massive SLURM failure

Paul Edmon Tue, 06 Aug 2013 18:31:01 -0700

So our SLURM 2.5.7 install went down this evening with a massive bout of:

[2013-08-06T17:12:20-04:00] sched: Allocate JobId=113950NodeList=holy2b09103 #CPUs=4[2013-08-06T17:12:20-04:00] sched: Allocate JobId=113951NodeList=sandy-rc01 #CPUs=4[2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_distoversubscribe[2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_distoversubscribe[2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_distoversubscribe

I tried to save it but to no avail. It had already generated 125 GB oflogfile. I kill -9'd it after trying a reconfig and a restart. Afterrestarting it it dropped all the jobs again.

I dug through the source code and found the following statement for thiserror:

for (tid = 0, i = job_ptr->details->cpus_per_task ; (tid <maxtasks);

             i += job_ptr->details->cpus_per_task ) { /* cycle counter */
                bool space_remaining = false;
                if (over_subscribe) {
                        /* 'over_subscribe' is a relief valve that guards
                         * against an infinite loop, and it *should* never
                         * come into play because maxtasks should never be
                         * greater than the total number of available cpus
                         */

error("cons_res: _compute_c_b_task_distoversubscribe");

This apparently is a safety valve, but you should never hit it. I don'tknow if slurm-2.6.0 fixes this but I'm not sure what exactly causes thescheduler to get into this state or if it is avoidable. I don't evenknow if it would ever recover from this or if it would just stay likethis forever.

Could anyone tell me why this happened or give me any insight as to ifthis is a bug or not? This is the second time this has happened. Wecannot keep losing jobs like this, we had over 10,000 jobs in flight,other wise we will have to abandon SLURM and use our old system as atleast it was stable. Please I need answers. I would rather not abandonSLURM as I like many of the features it has. However, we will have toif we can't maintain stability. We aren't even at our full capacity yetas this is still beta testing. If it can't take this load this projectis sunk.

SLURM should at least fail in such a way that it never loses jobs. Isthere something I am missing? Some way of making it more robust? We'vetried the HA fail over but that didn't work when this happened andcaused other problems when the install went split brained.


-Paul Edmon-

[slurm-dev] Massive SLURM failure

Reply via email to