Hi all,

After scouring the mailing list and web for a good solution I've finally 
relented and wanted to ask the experts for their opinions.  Here's the 
scenario, I manage a small cluster of say 10 hosts each with 8 cores.  The  
majority (90%) of jobs we get take < 1hr, are CPU bound, and use up all 8 cores 
on one host.  But because of the nature of the work we do infrequently users 
will submit jobs that run the same program bu that can take multiple days to 
weeks.


The setup I have right now has an "all.q" and have fair sharing turned on (and 
an urgent.q queue for urgent jobs).  So naturally what happens every week or so 
one of the users submits a batch of long running jobs, and these generally take 
over the whole cluster locking out everyone else until they are done.  So I 
thought I'd turn to queue subordination and create a long.q that has no time 
limit and is subordinate to the all.q which would have an 1 hr limit.  This 
works well except that we have periods of time where we have *lots* of 1 hr 
jobs.  What ends up happening is that the long.q jobs stay suspended 
for...well...long periods of time and are essentially "locked out".  


What I'd like is to not have to dedicate 100% of the resources to the short 
jobs when we become inundated with them. I looked into adjusting the slot 
counts, and using slot subordination but that doesn't appear to do what I need 
as it seems to function on queue instances, not cluster queues (correct me if 
I'm wrong here).  Is there a better solution?  Maybe using load/suspend 
parameters and just letting the jobs run?

Thanks,
Joe
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to