Re: [gridengine users] question about managing queues

Carl G. Riches Tue, 28 Jul 2015 12:36:14 -0700

On Tue, 28 Jul 2015, Reuti wrote:

Hi,


Am 28.07.2015 um 20:03 schrieb Carl G. Riches:


We have a Rocks cluster (Rocks release 6.1) with the SGE roll (rocks-sge
6.1.2 [GE2011]).  Usage levels have grown faster than the cluster's capacity.  
We have a single consumable resource (number of CPUs) that we are trying to 
manage in a way that is acceptable to our users.  Before diving in on a 
solution, I would like to find out if others have dealt with our particular 
problem.

Here is a statement of the problem:
- There is a fixed amount of a resource called "number of CPUs" available.
- There are many possible users of the resource "number of CPUs".
- There is a variable number of the resource in use at any given time.
- When the resource is exhausted, requests to use the resource queue up
 until some amount of the resource becomes available again.
- In the event that resource use requests have queued up, we must manage
 the resource in some way.

The way we would like to manage the resource is this:
1. In the event that no requests for the resource are queued up, do
  nothing.
2. In the event that a single user is consuming all of the resource and
  all queued requests for the resource belong to the same user that is
  using all of the resource, do nothing.
3. In the event that a single user is consuming all of the resource and
  not all queued requests for the resource belong to the same user that
  is using all of the resource, "manage the resource".
4. In the event that there are queued requests for the resource and the
  resource is completely used by more than one user, "manage the
  resource".

By "manage the resource" we mean:
a. If a user is consuming more than some arbitrary limit of the resource
  (call it L), suspend one of that user's jobs.
b. Determine how much of the resource (CPUs) are made available by the
  prior step.

None. Suspending a job in SGE still consumes the granted resources. Onlyoption would be to reschedule the job to put it in to pending stateagain (and maybe put it on hold before, to avoid that it gets scheduledinstantly again).


I didn't know that, thanks!

c. Find a job in the list of queued requests that uses less than or equal
  to the resources made available in the last step _and_ does not belong
  to a user currently using some arbitrary limit L (or more) of the
  resource, then dispatch the job.
d. Repeat the prior step until the available resource is less than the
  resource required by jobs in the list of queued requests.

Steps 1-4 above would be repeated at regular intervals to ensure that the
resource is shared.


e. unsuspend the prior suspended job in case of...?

Well, that step wasn't explicitly asked for....but was probably assumed bythe users that a suspended (or held) job would eventually be restarted.

Has anyone on the list tried to do this sort of queue management?  If so,
how did you go about the task?  Is this possible with Grid Engine?
All this management must be done by a co-schulder which you have toprogram to fulfill the above requirements, i.e. putting jobs on hold,reschedule them, remove the hold of other jobs... It's not built intoSGE.


That's what I was afraid of.

Would your users be happy to see that they got the promised computingtime over a timeframe, so that a share-tree policy could be used? Ofcourse, no user will see the instant effect that his just submitted jobstarts immediately, but over time they can check that the grantedcomputing time was according to the promised one. SGE is also able toput a penalty on a user, in case he used to much computing time in thepast. Then his jobs will get a lower priority, in case other users jobsare pending.

I'm not familiar with share tree policies. What are they and how are theyused?


Thanks,
Carl
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] question about managing queues

Reply via email to