Am 28.07.2015 um 21:31 schrieb Carl G. Riches:

> On Tue, 28 Jul 2015, Reuti wrote:
> 
>> Hi,
>> 
>> Am 28.07.2015 um 20:03 schrieb Carl G. Riches:
>> 
>>> 
>>> We have a Rocks cluster (Rocks release 6.1) with the SGE roll (rocks-sge
>>> 6.1.2 [GE2011]).  Usage levels have grown faster than the cluster's 
>>> capacity.  We have a single consumable resource (number of CPUs) that we 
>>> are trying to manage in a way that is acceptable to our users.  Before 
>>> diving in on a solution, I would like to find out if others have dealt with 
>>> our particular problem.
>>> 
>>> Here is a statement of the problem:
>>> - There is a fixed amount of a resource called "number of CPUs" available.
>>> - There are many possible users of the resource "number of CPUs".
>>> - There is a variable number of the resource in use at any given time.
>>> - When the resource is exhausted, requests to use the resource queue up
>>> until some amount of the resource becomes available again.
>>> - In the event that resource use requests have queued up, we must manage
>>> the resource in some way.
>>> 
>>> The way we would like to manage the resource is this:
>>> 1. In the event that no requests for the resource are queued up, do
>>>  nothing.
>>> 2. In the event that a single user is consuming all of the resource and
>>>  all queued requests for the resource belong to the same user that is
>>>  using all of the resource, do nothing.
>>> 3. In the event that a single user is consuming all of the resource and
>>>  not all queued requests for the resource belong to the same user that
>>>  is using all of the resource, "manage the resource".
>>> 4. In the event that there are queued requests for the resource and the
>>>  resource is completely used by more than one user, "manage the
>>>  resource".
>>> 
>>> By "manage the resource" we mean:
>>> a. If a user is consuming more than some arbitrary limit of the resource
>>>  (call it L), suspend one of that user's jobs.
>>> b. Determine how much of the resource (CPUs) are made available by the
>>>  prior step.
>> 
>> None. Suspending a job in SGE still consumes the granted resources. Only 
>> option would be to reschedule the job to put it in to pending state again 
>> (and maybe put it on hold before, to avoid that it gets scheduled instantly 
>> again).
> 
> I didn't know that, thanks!
> 
>> 
>> 
>>> c. Find a job in the list of queued requests that uses less than or equal
>>>  to the resources made available in the last step _and_ does not belong
>>>  to a user currently using some arbitrary limit L (or more) of the
>>>  resource, then dispatch the job.
>>> d. Repeat the prior step until the available resource is less than the
>>>  resource required by jobs in the list of queued requests.
>>> 
>>> Steps 1-4 above would be repeated at regular intervals to ensure that the
>>> resource is shared.
>> 
>> e. unsuspend the prior suspended job in case of...?
> 
> Well, that step wasn't explicitly asked for....but was probably assumed by 
> the users that a suspended (or held) job would eventually be restarted.
> 
>> 
>> 
>>> Has anyone on the list tried to do this sort of queue management?  If so,
>>> how did you go about the task?  Is this possible with Grid Engine?
>> 
>> All this management must be done by a co-schulder which you have to program 
>> to fulfill the above requirements, i.e. putting jobs on hold, reschedule 
>> them, remove the hold of other jobs... It's not built into SGE.
> 
> That's what I was afraid of.
> 
>> 
>> Would your users be happy to see that they got the promised computing time 
>> over a timeframe, so that a share-tree policy could be used? Of course, no 
>> user will see the instant effect that his just submitted job starts 
>> immediately, but over time they can check that the granted computing time 
>> was according to the promised one. SGE is also able to put a penalty on a 
>> user, in case he used to much computing time in the past. Then his jobs will 
>> get a lower priority, in case other users jobs are pending.
> 
> I'm not familiar with share tree policies.  What are they and how are they 
> used?

Some links about the share-tree policy:

http://www.informit.com/articles/article.aspx?p=101162

http://www.bioteam.net/wp-content/uploads/2009/09/06-SGE-6-Admin-Policies.pdf

http://gridscheduler.sourceforge.net/howto/geee.html

https://blogs.oracle.com/sgrell/entry/n1ge_6_scheduler_hacks_the

There is also a short man page in SGE `man share_tree`.

-- Reuti


> Thanks,
> Carl


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to