Re: [gridengine users] Handling Time Slot Differentiation

Reuti Thu, 16 Aug 2012 10:47:56 -0700

Am 16.08.2012 um 18:07 schrieb Brian Smith:

> I know that in a lot of scheduling environments, queues are used such as 
> short, long, etc. to differentiate different classes of jobs.  In our 
> environment, we're doing very much the same thing and using fancy pe_list 
> syntax to differentiate our various clusters.  It occurred to me, however, 
> that it might be better to ditch that strategy and instead use JSV and 
> complex attributes with a single default queue instance.
> 
> Let's say I want to have the job classes
> 
> devel <= 1hr
> short <= 6hr
> medium <= 48hr
> long <= 192hr
> xlong > 192hr (no limit, restricted access)
> 
> Our current methodology for ensuring QoS for those queues involves RQS & JSV. 
>  Schedule intervals are pretty long and hairy even for a <500 node cluster 
> due to the complex PE configuration:
> 
> {
>   name         host_slotcap
>   description  make sure only the right number of slots get used
>   enabled      TRUE
>   limit        queues * hosts {*} to slots=$num_proc
> }
> {
>   name         queue_slotcap
>   description  slot limits for each queue
>   enabled      TRUE
>   limit        queues xlong to slots=512
>   limit        queues long to slots=1436
>   limit        queues medium to slots=1724
> }
> {
>   name         user_slotcap
>   description  make sure users can only use so much
>   enabled      TRUE
>   limit        users {*} to slots=512
> }
> 
> We use a jsv to classify the jobs into queues:
> ...
>    # Set queue based on specified runtime
>    if [ -z "$hrt" ]; then
>        jsv_sub_add_param q_hard "devel"
>        jsv_sub_add_param l_hard h_rt "01:00:00"
>        do_correct="true"
>    else
>        do_correct="true"
>        if [ $hrt -le $((3600*1)) ]; then
>            jsv_sub_add_param q_hard "devel"
>        elif [ $hrt -gt $((3600*1)) -a $hrt -le $((3600*6)) ]; then
>            jsv_sub_add_param q_hard "short"
>        elif [ $hrt -gt $((3600*6)) -a $hrt -le $((3600*48)) ]; then
>            jsv_sub_add_param q_hard "medium"
>        elif [ $hrt -gt $((3600*48)) -a $hrt -le $((3600*168)) ]; then
>            jsv_sub_add_param q_hard "long"
>        elif [ $hrt -gt $((3600*168)) ]; then
>            jsv_sub_add_param q_hard "xlong"
>        fi
>    fi
> ...
> 
> We also use my github project for pbs-esque parallel environment support: 
> https://github.com/brichsmith/gepetools
> 
> This means each queue has a complicated PE configuration:
> 
> pe_list               make smp,[@cms_X7DBR-3=pe_cms_X7DBR-3_hg \
>                      pe_cms_X7DBR-3_hg.1 pe_cms_X7DBR-3_hg.2 \
>                      pe_cms_X7DBR-3_hg.4 pe_cms_X7DBR-3_hg.6 \
>                      pe_cms_X7DBR-3_hg.8], \
>                      ...
>                      [@MRI_Sun_X4150=pe_MRI_Sun_X4150_hg \
>                      pe_MRI_Sun_X4150_hg.1 pe_MRI_Sun_X4150_hg.2 \
>                      pe_MRI_Sun_X4150_hg.4 pe_MRI_Sun_X4150_hg.6 \
>                      pe_MRI_Sun_X4150_hg.8], \
>                      ...
>                      [@RC_Dell_R410=pe_RC_Dell_R410_hg \
>                      pe_RC_Dell_R410_hg.1 \
>                      pe_RC_Dell_R410_hg.12 pe_RC_Dell_R410_hg.2 \
>                      pe_RC_Dell_R410_hg.4 pe_RC_Dell_R410_hg.6 \
>                      pe_RC_Dell_R410_hg.8], \
>                      ...
>                      [@RC_HP_DL165G7=pe_RC_HP_DL165G7_hg \
>                      pe_RC_HP_DL165G7_hg.1 pe_RC_HP_DL165G7_hg.12 \
>                      pe_RC_HP_DL165G7_hg.16 pe_RC_HP_DL165G7_hg.2 \
>                      pe_RC_HP_DL165G7_hg.4 pe_RC_HP_DL165G7_hg.6 \
>                      pe_RC_HP_DL165G7_hg.8], \
>                      ...
> 
> We set a negative urgency value to h_rt so that longer jobs get lower 
> priority.
> 
> This approach seems to confuse the scheduler in terms of resource 
> reservations, so we pretty much can't do them and end up with the occasional 
> starving >128 slot parallel job.  Its also pretty difficult to determine 
> scheduling bottlenecks, etc.  Its elegant from a user perspective, but 
> somewhat difficult to administer and troubleshoot (we've whipped up some 
> tools to help, but there are still limitations).
> 
> I want to ditch the "queues-as-classifiers" model and use complex attributes 
> instead.  Think a single "default" queue, but my jsv will now:
> 
> ...
> # Set queue based on specified runtime
>    if [ -z "$hrt" ]; then
>        jsv_sub_add_param l_hard h_rt "01:00:00"
>        jsv_sub_add_param l_hard devel 1
>        do_correct="true"
>    else
>        do_correct="true"
>        if [ $hrt -le $((3600*1)) ]; then
>            jsv_sub_add_param l_hard devel 1
>        elif [ $hrt -gt $((3600*1)) -a $hrt -le $((3600*6)) ]; then
>            jsv_sub_add_param l_hard short 1
>        elif [ $hrt -gt $((3600*6)) -a $hrt -le $((3600*48)) ]; then
>            jsv_sub_add_param l_hard medium 1
>        elif [ $hrt -gt $((3600*48)) -a $hrt -le $((3600*168)) ]; then
>            jsv_sub_add_param l_hard long 1
>        elif [ $hrt -gt $((3600*168)) ]; then
>            jsv_sub_add_param q_hard "xlong"
>        fi
>    fi
> ...
> 
> RQS gets simplified to:
> 
> {
>   name         host_slotcap
>   description  make sure only the right number of slots get used
>   enabled      TRUE
>   limit        hosts {*} to slots=$num_proc
> }
> {
>   name         user_slotcap
>   description  make sure users can only use so much
>   enabled      TRUE
>   limit        users {*} to slots=512
> }
> 
> And global host gets configured as such:
> ...
> complex_values  ...,short=4096,devel=4096,medium=1768,long=1534
> ...
> 
> We drop the urgency from h_rt and instead associate it with the complex 
> attributes:
> 
> $ qconf -sc | egrep '^(devel|short|medium|long)[ ]+'
> devel  devel    INT       <=    YES         YES        0        1000
> long   long     INT       <=    YES         YES        0        0
> medium medium   INT       <=    YES         YES        0        10
> short  short    INT       <=    YES         YES        0        100
> 
> What say other GridEngine gurus about this approach?  I believe this will 
> help with my resource reservation woes and at the very least, should make my 
> scheduler iterations much shorter.  Is there a better way?  Are there any 
> potential pitfalls I may have missed?


Yes, it's good to use less RQS as it's known that this sometimes leads to jobs 
which will never get scheduled if there are several of them. And if you have no 
(automatic) subordination, it can often be put in one queue in SGE.

But I wonder: you will also get rid of all the PEs, which you used up to know 
to pack jobs to certain exechosts due the setup of the network?

-- Reuti


> Any input or suggestions would be appreciated.
> 
> Best Regards,
> 
> Brian Smith
> Sr. System Administrator
> Research Computing, University of South Florida
> 4202 E. Fowler Ave. SVC4010
> Office Phone: +1 813 974-1467
> Organization URL: http://rc.usf.edu
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Handling Time Slot Differentiation

Reply via email to