Re: [gridengine users] Handling Time Slot Differentiation

William Hay Fri, 17 Aug 2012 01:30:54 -0700

On 16 August 2012 17:07, Brian Smith <[email protected]> wrote:


> I want to ditch the "queues-as-classifiers" model and use complex
> attributes instead.  Think a single "default" queue, but my jsv will now:
>
> ...
> # Set queue based on specified runtime
>      if [ -z "$hrt" ]; then
>          jsv_sub_add_param l_hard h_rt "01:00:00"
>          jsv_sub_add_param l_hard devel 1
>          do_correct="true"
>      else
>          do_correct="true"
>          if [ $hrt -le $((3600*1)) ]; then
>              jsv_sub_add_param l_hard devel 1
>          elif [ $hrt -gt $((3600*1)) -a $hrt -le $((3600*6)) ]; then
>              jsv_sub_add_param l_hard short 1
>          elif [ $hrt -gt $((3600*6)) -a $hrt -le $((3600*48)) ]; then
>              jsv_sub_add_param l_hard medium 1
>          elif [ $hrt -gt $((3600*48)) -a $hrt -le $((3600*168)) ]; then
>              jsv_sub_add_param l_hard long 1
>          elif [ $hrt -gt $((3600*168)) ]; then
>              jsv_sub_add_param q_hard "xlong"
>          fi
>      fi

> And global host gets configured as such:
> ...
> complex_values  ...,short=4096,devel=4096,medium=1768,long=1534
> ...
>
> We drop the urgency from h_rt and instead associate it with the complex
> attributes:
>
> $ qconf -sc | egrep '^(devel|short|medium|long)[ ]+'
> devel  devel    INT       <=    YES         YES        0        1000
> long   long     INT       <=    YES         YES        0        0
> medium medium   INT       <=    YES         YES        0        10
> short  short    INT       <=    YES         YES        0        100
>
> What say other GridEngine gurus about this approach?  I believe this
> will help with my resource reservation woes and at the very least,
> should make my scheduler iterations much shorter.  Is there a better
> way?  Are there any potential pitfalls I may have missed?
>
> Any input or suggestions would be appreciated.
>
We do something fairly similar to what you are proposing.  We have
only two complexes with  an urgency associated with them though: bonus
and penalty.  The JSV calculates appropriate values for them based on
our rules.  They are given the same weight (positive or negative) as
waiting time so that we can in effect add and subtract waiting time.
Rather than lots of elif statements for the rules though we have
defined data structures that are kept in a perl modules that the JSV
loads.  At some point I should change the modules to populate the data
structures from a non-perl specific config-file/database whatever.  We
still have a large number of queues on our cluster though for other
reasons.  We used to have the JSV restrict queue selection with -q
name@@hostgroup-glob.  When I removed that and replaced it with a
bunch of numeric attributes to select the appropriate queue the
scheduler started taking far less time to run.  Not saying it is
causal but it makes sense that comparing numeric values would take
less time than globbing(especially after hostgroup expansion).  We'd
previously tried reducing the number of queues we had but this seemed
to make far less difference than switching to queue selection via
numeric comparisons.  The annoying thing is I'd originally planned to
write it the way it is now but I was pressed for time and setting -q
from the JSV was a lot quicker to write.

That said we still aren't scheduling as fast as I'd like.  I'm
currently working on modifying a locally developed pbs-alike feature.
We have a string complex called status that we use to record problems
with a node.  The JSV currently requests status=OK (the default) to
ensure it doesn't get assigned to a known broken node which means we
can use the queue's enabled/disabled flag  to mean "disabled until
next reboot".  I've changes this so that inn addition a numeric value
already used in queue selection is set to a value (for the host) that
prevents any jobs from running on the node when we disable it so I can
drop the string matching request from the JSV.  I doubt this will
provide quite the same speedup as not globbing on the queue though.

We don't currently use any RQS here.

William
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Handling Time Slot Differentiation

Reply via email to