[slurm-dev] Re: oversubscribe

Paul Edmon Thu, 08 Aug 2013 13:35:28 -0700

So I'm trying to understand why slurm gets into thecompute_c_b_task_dist oversubscribe state in the first place. Therelevant code is in src/plugins/select/cons_res/dist_task.c At theroutine called _compute_c_b_task_dist.

I'm not sure my interpretation of what is going on is correct as I'm notsuper familiar with all the subtleties of the slurm source code.However here is my reconstruction of what is going on, at least fromwhat I interpret:


1. Initially the oversubscribe flag is set to false.

2. SLURM figures out how much space the job it is running requires andhow many nodes it needs to be distributed over.

3. SLURM then figures out how many tasks it is running.

4. SLURM sets the space_remaining flag false. Basically assuming thatthere is no space left on the node in question.4. SLURM counts up how many tasks it is running and compares it againstthe maximum number of allowed tasks5. a. If SLURM ends up with less than the maximum number of tasksallowed it says there is space remaining and moves on.5. b. If SLURM ends up with more than the maximum number of tasksallowed it complains saying it is oversubscribed.

Now in the below example it gets apparently stuck in an infinite loop ofoversubscription. The question is why? From my reading there are twoways for this to happen:

1. maxtasks is huge and therefore it never exits the loop, but some howfigures out it is oversubscribed.

2. tid is never advanced

Clearly this is an odd state as I'm having a hard time trying to come upwith ways that either of these could be satisfied. It is interestingthat by default it is assumed that space_remaining=false. Thus it isbasically assumed that the host is already oversubscribed. It willnever be set to true unless it gets into the loop that sets it. If itsomehow bypassed it then it wouldn't advance tid, and it would declareoversubscribed.

For the record we are running on a normal (non-Bluegene) cluster. Inoticed that nhosts (at least in the comments) appears to be for BG'sbut it is used all over the place. I will also note that we have set:


# SCHEDULING
DefMemPerCPU=100
FastSchedule=2
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerParameters=defer
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

If that helps.  Any insight would be appreciated.

-Paul Edmon-


On 08/06/2013 09:30 PM, Paul Edmon wrote:

So our SLURM 2.5.7 install went down this evening with a massive bout of:
[2013-08-06T17:12:20-04:00] sched: Allocate JobId=113950NodeList=holy2b09103 #CPUs=4[2013-08-06T17:12:20-04:00] sched: Allocate JobId=113951NodeList=sandy-rc01 #CPUs=4[2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_distoversubscribe[2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_distoversubscribe[2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_distoversubscribe
I tried to save it but to no avail. It had already generated 125 GBof logfile. I kill -9'd it after trying a reconfig and a restart.After restarting it it dropped all the jobs again.
I dug through the source code and found the following statement forthis error:
for (tid = 0, i = job_ptr->details->cpus_per_task ; (tid <maxtasks);
             i += job_ptr->details->cpus_per_task ) { /* cycle counter */
                bool space_remaining = false;
                if (over_subscribe) {
                        /* 'over_subscribe' is a relief valve that guards
* against an infinite loop, and it *should*never* come into play because maxtasks shouldnever be* greater than the total number of availablecpus
                         */
error("cons_res: _compute_c_b_task_distoversubscribe");
                }
This apparently is a safety valve, but you should never hit it. Idon't know if slurm-2.6.0 fixes this but I'm not sure what exactlycauses the scheduler to get into this state or if it is avoidable. Idon't even know if it would ever recover from this or if it would juststay like this forever.
Could anyone tell me why this happened or give me any insight as to ifthis is a bug or not? This is the second time this has happened. Wecannot keep losing jobs like this, we had over 10,000 jobs in flight,other wise we will have to abandon SLURM and use our old system as atleast it was stable. Please I need answers. I would rather notabandon SLURM as I like many of the features it has. However, we willhave to if we can't maintain stability. We aren't even at our fullcapacity yet as this is still beta testing. If it can't take thisload this project is sunk.
SLURM should at least fail in such a way that it never loses jobs. Isthere something I am missing? Some way of making it more robust?We've tried the HA fail over but that didn't work when this happenedand caused other problems when the install went split brained.
-Paul Edmon-

[slurm-dev] Re: oversubscribe

Reply via email to