[slurm-dev] Re: backfill scheduler look ahead?

Eckert, Phil Fri, 21 Feb 2014 07:50:36 -0800

Bill,

In addition to what Alejandro said, there is another consideration.


You indicated the top two high priority jobs and the 30 core job, I'm
assuming that the "..." indicated a number of other queued jobs ahead of
the 30 core job. Also, you didn't state it, but I'm also assuming there
were other jobs running at the time.

If both of these assumtions are true, then you would need to consider the
completion time of all the running jobs in relation to the needs of the
jobs ahead of the 30 core job in the queue. The 60 cores may be needed by
a higher priority job that is waiting for a currently running job, or
jobs, that will complete in less than two hours and provide the number of
cores it needs.

We have been using backfill batch systems, including SLURM,  here at LLNL
for over 20 years and trying to answer this question for our users is
never easy. A  conclusive way of determining when a job will either start
or be backfilled is to do an squeue and an sinfo then map an X Y
coordinates with time and nodes to represent the blocks that jobs will
use. This is a bit painful, but will provide a lot of insight to backfill.

I hope this is helpful.

Phil Eckert
LLNL




On 2/21/14 2:57 AM, "Alejandro Lucero Palau" <[email protected]>
wrote:

>
>Hi Bill,
>
>I think Moe gives you the right answer but it was so concise it can be
>easily misunderstood.
>
>If we take the situation you describe with a simple analysis from
>backfilling algorithm point of view, the answer is job 300 should be
>scheduled without any impact on jobs 201 and 202. However, what I think
>Moe tried to say is there are other details to take into account, not
>just total number of free cores. Those cores could be really free but,
>for example, due to per-node memory requirements they can not be used.
>Or maybe you have reservations which are reserving some cores but you
>can not see it just looking at free cores. Or you have some licenses or
>partitions limitations. Or your system does not allow to share nodes so
>free cores does not mean you can use them. All this assuming you do not
>have other pending jobs between job 201 and job 300. There is a
>backfilling parameter max_job_bf which limits the number of jobs to be
>processed by the algorithm. The default number is 50. Also, as
>backfilling is so demanding it is suspended after some time. Before
>resuming, if something changed in the system, the backfilling algorithm
>will start from scratch. You can avoid this using bf_continue parameter.
>
>As you can see there are a lot of details which could have an impact. We
>have suffered this situation in the past and it is not always trivial to
>see the reason behind scheduling decisions. I added extra debug
>information for backfilling algorithm to see how resources were being
>reserved by pending jobs and it was helpful. Maybe it would be
>interesting to have some way for knowing why a job can not be scheduled.
>There are other resource managers giving this detailed information but
>it would have a cost, of course.
>
>On 02/21/2014 12:45 AM, Bill Wichser wrote:
>>
>> Moe,
>>
>> That's quite an obfusicated answer!  I was looking for a "yes, this is
>> the expected behavior" or "no, something is amuck."
>>
>> In the case presented, again I'll say, it is clearly evident that the
>> job waiting, number 300, can run.  It has free cores, the job
>> currently waiting will have plenty of cores available when the job it
>> is waiting on finishes, yet it does not start simply because the time
>> it requires would interfere with the current start time of the
>> currently waiting job, #201.
>>
>> But the assertion that job 201 would be held up by starting job 300 is
>> completely incorrect in this case.
>>
>> Now if this is the way the scheduler works, by being simple minded
>> about time constraints,  then it is what it is.  I'm asking only if
>> this behavior is the expected behavior.  I think you are trying to say
>> that indeed this is the case.
>>
>> Sincerely,
>> Bill
>>
>>
>> On 2/20/2014 1:21 PM, Moe Jette wrote:
>>>
>>> Slurm uses what is known as a conservative backfill scheduling
>>> algorithm. No job will be started that adversely impacts the expected
>>> start time of _any_ higher priority job. The scheduling can also be
>>> effected by a job's requirements for memory, generic resources,
>>> licenses, and resource limits.
>>>
>>> Moe Jette
>>> SchedMD LLC
>>>
>>>
>>> Quoting Bill Wichser <[email protected]>:
>>>
>>>>
>>>> Just a question on expected behavior of the backfill scheduler. This
>>>> is an SMP machine if that matters.  Scheduler is backfill with no
>>>> preemption.
>>>>
>>>> I have a number of jobs queued.  There are three which matter,
>>>> ordered by priority.  In the current state I have 60 free cores.
>>>>
>>>> job 201 needs 200 cores and will start in 1 hour requiring 24 hours
>>>> of runtime
>>>> job 202 needs 250 cores and will start in 5 hours requiring 24 hours
>>>> of runtime
>>>> ...
>>>> job 300 needs 30 cores and will start in 300 hours requiring 2 hours
>>>> of runtime
>>>>
>>>> The job completing in 1 hour will free 252 cores.
>>>>
>>>> Clearly, starting job 300 will not impact job 201's start time in
>>>> any way.  Yet it will not start since the time overlaps the expected
>>>> 1 hour start time of job 201.  Is this the expected behavior?  I
>>>> haven't yet checked the source code to verify that this just looks
>>>> at the trivial impact on the next job but I'd expect the scheduler
>>>> to be able to look a little deeper than this.
>>>>
>>>> Bill
>>>>
>>>
>
>
>WARNING / LEGAL TEXT: This message is intended only for the use of the
>individual or entity to which it is addressed and may contain
>information which is privileged, confidential, proprietary, or exempt
>from disclosure under applicable law. If you are not the intended
>recipient or the person responsible for delivering the message to the
>intended recipient, you are strictly prohibited from disclosing,
>distributing, copying, or in any way using this message. If you have
>received this communication in error, please notify the sender and
>destroy and delete any copies you may have received.
>
>http://www.bsc.es/disclaimer

[slurm-dev] Re: backfill scheduler look ahead?

Reply via email to