It sounds like "pre-scheduling": doing the math at submit-time to set bonus
and penalty, then telling the scheduler to schedule around only those
values?  That's a neat approach.

What PBS-alike features were you implementing, just out of curiosity?

I really like the idea of the status tag.  I'm going to have one of my guys
implement that here.  That should help eek out the extra stability we're
looking for.

Thanks,
-Brian

On Fri, Aug 17, 2012 at 4:28 AM, William Hay <[email protected]> wrote:

> On 16 August 2012 17:07, Brian Smith <[email protected]> wrote:
>
>
> > I want to ditch the "queues-as-classifiers" model and use complex
> > attributes instead.  Think a single "default" queue, but my jsv will now:
> >
> > ...
> > # Set queue based on specified runtime
> >      if [ -z "$hrt" ]; then
> >          jsv_sub_add_param l_hard h_rt "01:00:00"
> >          jsv_sub_add_param l_hard devel 1
> >          do_correct="true"
> >      else
> >          do_correct="true"
> >          if [ $hrt -le $((3600*1)) ]; then
> >              jsv_sub_add_param l_hard devel 1
> >          elif [ $hrt -gt $((3600*1)) -a $hrt -le $((3600*6)) ]; then
> >              jsv_sub_add_param l_hard short 1
> >          elif [ $hrt -gt $((3600*6)) -a $hrt -le $((3600*48)) ]; then
> >              jsv_sub_add_param l_hard medium 1
> >          elif [ $hrt -gt $((3600*48)) -a $hrt -le $((3600*168)) ]; then
> >              jsv_sub_add_param l_hard long 1
> >          elif [ $hrt -gt $((3600*168)) ]; then
> >              jsv_sub_add_param q_hard "xlong"
> >          fi
> >      fi
>
> > And global host gets configured as such:
> > ...
> > complex_values  ...,short=4096,devel=4096,medium=1768,long=1534
> > ...
> >
> > We drop the urgency from h_rt and instead associate it with the complex
> > attributes:
> >
> > $ qconf -sc | egrep '^(devel|short|medium|long)[ ]+'
> > devel  devel    INT       <=    YES         YES        0        1000
> > long   long     INT       <=    YES         YES        0        0
> > medium medium   INT       <=    YES         YES        0        10
> > short  short    INT       <=    YES         YES        0        100
> >
> > What say other GridEngine gurus about this approach?  I believe this
> > will help with my resource reservation woes and at the very least,
> > should make my scheduler iterations much shorter.  Is there a better
> > way?  Are there any potential pitfalls I may have missed?
> >
> > Any input or suggestions would be appreciated.
> >
> We do something fairly similar to what you are proposing.  We have
> only two complexes with  an urgency associated with them though: bonus
> and penalty.  The JSV calculates appropriate values for them based on
> our rules.  They are given the same weight (positive or negative) as
> waiting time so that we can in effect add and subtract waiting time.
> Rather than lots of elif statements for the rules though we have
> defined data structures that are kept in a perl modules that the JSV
> loads.  At some point I should change the modules to populate the data
> structures from a non-perl specific config-file/database whatever.  We
> still have a large number of queues on our cluster though for other
> reasons.  We used to have the JSV restrict queue selection with -q
> name@@hostgroup-glob.  When I removed that and replaced it with a
> bunch of numeric attributes to select the appropriate queue the
> scheduler started taking far less time to run.  Not saying it is
> causal but it makes sense that comparing numeric values would take
> less time than globbing(especially after hostgroup expansion).  We'd
> previously tried reducing the number of queues we had but this seemed
> to make far less difference than switching to queue selection via
> numeric comparisons.  The annoying thing is I'd originally planned to
> write it the way it is now but I was pressed for time and setting -q
> from the JSV was a lot quicker to write.
>
> That said we still aren't scheduling as fast as I'd like.  I'm
> currently working on modifying a locally developed pbs-alike feature.
> We have a string complex called status that we use to record problems
> with a node.  The JSV currently requests status=OK (the default) to
> ensure it doesn't get assigned to a known broken node which means we
> can use the queue's enabled/disabled flag  to mean "disabled until
> next reboot".  I've changes this so that inn addition a numeric value
> already used in queue selection is set to a value (for the host) that
> prevents any jobs from running on the node when we disable it so I can
> drop the string matching request from the JSV.  I doubt this will
> provide quite the same speedup as not globbing on the queue though.
>
> We don't currently use any RQS here.
>
> William
>



-- 
Brian Smith
Sr. System Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. SVC4010
Office Phone: +1 813 974-1467
Organization URL: http://rc.usf.edu
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to