Am 06.09.2012 um 00:14 schrieb Stuart Barkley: > You may want to look at my setup. See: > > http://www.mail-archive.com/[email protected]/msg04430.html > > One of our clusters currently has nodes divided as: > > 171 large < 4 weeks > 15 medium < 2 days > 9 small < 4 hours > > Other comments below... > > On Fri, 31 Aug 2012 at 14:41 -0000, S Joe wrote: > >> The majority (90%) of jobs we get take < 1hr, are CPU bound, and >> use up all 8 cores on one host. But because of the nature of the >> work we do infrequently users will submit jobs that run the same >> program bu that can take multiple days to weeks. > > This is like us. > >> The setup I have right now has an "all.q" and have fair sharing >> turned on (and an urgent.q queue for urgent jobs). > > I have no urgent queue. When something truly urgent arises we will > just manually force things (it hasn't happened yet). We can bump job > priority, kill other jobs, move hosts between host groups, etc.
Another option could be to submit with a request for a BOOLEAN complex which has an urgency attached. > I do occasionally adjust the node count in the different host groups > according to workload. > >> So naturally what happens every week or so one of the users submits >> a batch of long running jobs, and these generally take over the >> whole cluster locking out everyone else until they are done. > > This used to happen with us, now we limit the number of nodes running > the larger/longer jobs... Yes, this is one way, but then you are fixed to these nodes and the jobs can't float around the cluster though. I would prefer requesting for such jobs a consumable complex by JOB which has an arbitrary high values attached on the global level and limited by an RQS then. -- Reuti >> So I thought I'd turn to queue subordination and create a long.q >> that has no time limit and is subordinate to the all.q which would >> have an 1 hr limit. > > I don't see much point in subordination for our system, many jobs are > memory intensive and job suspension doesn't free memory resources (our > nodes are diskless with no swap space). > >> This works well except that we have periods of time where we have >> *lots* of 1 hr jobs. What ends up happening is that the long.q jobs >> stay suspended for...well...long periods of time and are essentially >> "locked out". > > When this happens with us within 4 hours any short jobs running on the > large job nodes will have finished and new jobs will start based upon > the scheduling parameters. > >> What I'd like is to not have to dedicate 100% of the resources to >> the short jobs when we become inundated with them. > > This works for us, but I think of it the other way: the short jobs are > able to use the resources reserved for large jobs when there are no > large jobs waiting. > >> I looked into adjusting the slot counts, and using slot >> subordination but that doesn't appear to do what I need as it seems >> to function on queue instances, not cluster queues (correct me if >> I'm wrong here). Is there a better solution? Maybe using >> load/suspend parameters and just letting the jobs run? > > When it becomes a bigger issue, I'll push that the large jobs need to > support checkpointing and the checkpointable jobs will have higher > priority since they can be moved out of the way as needed. I haven't > explored checkpointing yet. > > Stuart Barkley > -- > I've never been lost; I was once bewildered for three days, but never lost! > -- Daniel > Boone_______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
