So I'm trying to understand why slurm gets into the
compute_c_b_task_dist oversubscribe state in the first place. The
relevant code is in src/plugins/select/cons_res/dist_task.c At the
routine called _compute_c_b_task_dist.
I'm not sure my interpretation of what is going on is correct as I'm not
super familiar with all the subtleties of the slurm source code.
However here is my reconstruction of what is going on, at least from
what I interpret:
1. Initially the oversubscribe flag is set to false.
2. SLURM figures out how much space the job it is running requires and
how many nodes it needs to be distributed over.
3. SLURM then figures out how many tasks it is running.
4. SLURM sets the space_remaining flag false. Basically assuming that
there is no space left on the node in question.
4. SLURM counts up how many tasks it is running and compares it against
the maximum number of allowed tasks
5. a. If SLURM ends up with less than the maximum number of tasks
allowed it says there is space remaining and moves on.
5. b. If SLURM ends up with more than the maximum number of tasks
allowed it complains saying it is oversubscribed.
Now in the below example it gets apparently stuck in an infinite loop of
oversubscription. The question is why? From my reading there are two
ways for this to happen:
1. maxtasks is huge and therefore it never exits the loop, but some how
figures out it is oversubscribed.
2. tid is never advanced
Clearly this is an odd state as I'm having a hard time trying to come up
with ways that either of these could be satisfied. It is interesting
that by default it is assumed that space_remaining=false. Thus it is
basically assumed that the host is already oversubscribed. It will
never be set to true unless it gets into the loop that sets it. If it
somehow bypassed it then it wouldn't advance tid, and it would declare
oversubscribed.
For the record we are running on a normal (non-Bluegene) cluster. I
noticed that nhosts (at least in the comments) appears to be for BG's
but it is used all over the place. I will also note that we have set:
# SCHEDULING
DefMemPerCPU=100
FastSchedule=2
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerParameters=defer
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
If that helps. Any insight would be appreciated.
-Paul Edmon-
On 08/06/2013 09:30 PM, Paul Edmon wrote:
So our SLURM 2.5.7 install went down this evening with a massive bout of:
[2013-08-06T17:12:20-04:00] sched: Allocate JobId=113950
NodeList=holy2b09103 #CPUs=4
[2013-08-06T17:12:20-04:00] sched: Allocate JobId=113951
NodeList=sandy-rc01 #CPUs=4
[2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_dist
oversubscribe
[2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_dist
oversubscribe
[2013-08-06T17:12:20-04:00] error: cons_res: _compute_c_b_task_dist
oversubscribe
I tried to save it but to no avail. It had already generated 125 GB
of logfile. I kill -9'd it after trying a reconfig and a restart.
After restarting it it dropped all the jobs again.
I dug through the source code and found the following statement for
this error:
for (tid = 0, i = job_ptr->details->cpus_per_task ; (tid <
maxtasks);
i += job_ptr->details->cpus_per_task ) { /* cycle counter */
bool space_remaining = false;
if (over_subscribe) {
/* 'over_subscribe' is a relief valve that guards
* against an infinite loop, and it *should*
never
* come into play because maxtasks should
never be
* greater than the total number of available
cpus
*/
error("cons_res: _compute_c_b_task_dist
oversubscribe");
}
This apparently is a safety valve, but you should never hit it. I
don't know if slurm-2.6.0 fixes this but I'm not sure what exactly
causes the scheduler to get into this state or if it is avoidable. I
don't even know if it would ever recover from this or if it would just
stay like this forever.
Could anyone tell me why this happened or give me any insight as to if
this is a bug or not? This is the second time this has happened. We
cannot keep losing jobs like this, we had over 10,000 jobs in flight,
other wise we will have to abandon SLURM and use our old system as at
least it was stable. Please I need answers. I would rather not
abandon SLURM as I like many of the features it has. However, we will
have to if we can't maintain stability. We aren't even at our full
capacity yet as this is still beta testing. If it can't take this
load this project is sunk.
SLURM should at least fail in such a way that it never loses jobs. Is
there something I am missing? Some way of making it more robust?
We've tried the HA fail over but that didn't work when this happened
and caused other problems when the install went split brained.
-Paul Edmon-