Well, the change was made with some additional input from the fellows here, particularly Stuart. (big thanks!)
1. I set seq_no on my exec hosts based on the type of hardware. Newer, better stuff gets low seq_no, the opposite for old stuff 2. Set slots=$num_proc on exec nodes (sub num_proc for # of cores) 3. Ditched host_slotcap RQS AND the queue_slotcap RQS 4. Enabled default queue with max run time = to my previous long queue. Drained and remove devel and short queues. Waiting for medium and long to drain (will happen by next Friday). 5. Created devel, short, medium, long complex attributes with appropriate urgencies, modifying my JSV to request these based on runtime. Set limits in the global exec host so that only x long-length, y medium, and unlimited short & devel jobs could be run. 6. Used Stuart's load_formula to pack jobs better than with seqno alone (that was my previous approach). Since seq_no has high relevance in this formula, setting it on the exec nodes based on class of hardware ensures that jobs land on the best available nodes. Schedule intervals are short and sweet and job scheduling never skipped a beat during the changes. (BTW, commercial schedulers boast about their new features to modify configuration on the fly... GE has been doing this for as long as I can remember). ARs seem to be working correctly and I'll be testing out resource reservations later today. Utilization is still high and job turnover time has seen a nice improvement. Thanks to everyone who provided input on this. I'll post the results of my resource reservation tests also. -Brian On Fri, Aug 17, 2012 at 3:33 PM, Stuart Barkley <[email protected]> wrote: > On Thu, 16 Aug 2012 at 12:07 -0000, Brian Smith wrote: > > > { > > name host_slotcap > > description make sure only the right number of slots get used > > enabled TRUE > > limit queues * hosts {*} to slots=$num_proc > > } > > I used to have a rule similar to this (I didn't have 'queues *' > clause). I found that disabling the rule improved my scheduling > performance by a huge amount (several minutes became a few seconds). > You might try disabling this rule briefly and see if your scheduling > performance changes. > > I'm still using 6.2u5, it is possible bugs have been fixed in other > versions. > > I have queues defined to provide small, medium and large jobs (based > upon run time). Limits on the queues determine which jobs will run on > which queue. The significant definitions are: > > % qconf -sq small > qname small > hostlist @small > seq_no 20 > s_rt 4:00:00 > h_rt 4:00:00 > % > > % qconf -sq medium > qname medium > hostlist @medium > seq_no 30 > s_rt 48:00:00 > h_rt 48:00:00 > % > > % qconf -ssconf > queue_sort_method load > load_formula seq_no*100+m_core-slots > default_duration 48:00:00 > % > > seq_no controls the order the queues are searched. qmaster will > search until it finds a queue which can run the job so the most > limiting queus should be first. > > The size of the different queues is controlled by putting hosts in > different host groups. hosts become dedicated to jobs of a particular > size (or smaller). This does allow "small" jobs to run on host for > "large" jobs. In our case this is acceptable size the smaller jobs > will finish in a reasonable amount of time if large jobs are queued. > > No jsv is required and users should not specify the queue, just the > run time limit. > > We can adjust the run time limits and number of hosts in each host > group over time to best match the workload. > > Stuart > -- > I've never been lost; I was once bewildered for three days, but never lost! > -- Daniel Boone > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > -- Brian Smith Sr. System Administrator Research Computing, University of South Florida 4202 E. Fowler Ave. SVC4010 Office Phone: +1 813 974-1467 Organization URL: http://rc.usf.edu
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
