Hi all, After scouring the mailing list and web for a good solution I've finally relented and wanted to ask the experts for their opinions. Here's the scenario, I manage a small cluster of say 10 hosts each with 8 cores. The majority (90%) of jobs we get take < 1hr, are CPU bound, and use up all 8 cores on one host. But because of the nature of the work we do infrequently users will submit jobs that run the same program bu that can take multiple days to weeks.
The setup I have right now has an "all.q" and have fair sharing turned on (and an urgent.q queue for urgent jobs). So naturally what happens every week or so one of the users submits a batch of long running jobs, and these generally take over the whole cluster locking out everyone else until they are done. So I thought I'd turn to queue subordination and create a long.q that has no time limit and is subordinate to the all.q which would have an 1 hr limit. This works well except that we have periods of time where we have *lots* of 1 hr jobs. What ends up happening is that the long.q jobs stay suspended for...well...long periods of time and are essentially "locked out". What I'd like is to not have to dedicate 100% of the resources to the short jobs when we become inundated with them. I looked into adjusting the slot counts, and using slot subordination but that doesn't appear to do what I need as it seems to function on queue instances, not cluster queues (correct me if I'm wrong here). Is there a better solution? Maybe using load/suspend parameters and just letting the jobs run? Thanks, Joe
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
