Well, for single core jobs you can change the sort order to pack jobs on nodes. But instead of the usual -slots you will need a special load sensor reporting only the used slots owner queue and use this variable.
-- Reuti Von meinem iPad gesendet Am 03.08.2012 um 21:23 schrieb Joseph Farran <[email protected]>: > For others that are trying to pack jobs on nodes and using subordinate > queues, here is an example of why job-packing is so critical: > > Consider the following scenario. We have two queues, "owner" and "free" > with "free" being the subordinate queue. > > Our two compute nodes have 8 cores each. > > We load up our free queue with 16 single core jobs: > > job-ID prior name user state queue slots ja-task-ID > --------------------------------------------------------------------------- > 8560 0.55500 FREE testfree r free@compute-3-2 1 > 8561 0.55500 FREE testfree r free@compute-3-2 1 > 8562 0.55500 FREE testfree r free@compute-3-2 1 > 8563 0.55500 FREE testfree r free@compute-3-2 1 > 8564 0.55500 FREE testfree r free@compute-3-2 1 > 8565 0.55500 FREE testfree r free@compute-3-2 1 > 8566 0.55500 FREE testfree r free@compute-3-2 1 > 8567 0.55500 FREE testfree r free@compute-3-2 1 > 8568 0.55500 FREE testfree r free@compute-3-1 1 > 8569 0.55500 FREE testfree r free@compute-3-1 1 > 8570 0.55500 FREE testfree r free@compute-3-1 1 > 8571 0.55500 FREE testfree r free@compute-3-1 1 > 8572 0.55500 FREE testfree r free@compute-3-1 1 > 8573 0.55500 FREE testfree r free@compute-3-1 1 > 8574 0.55500 FREE testfree r free@compute-3-1 1 > 8575 0.55500 FREE testfree r free@compute-3-1 1 > > > The owner now submits ONE single core job: > > $ qstat > job-ID prior name user state queue slots ja-task-ID > --------------------------------------------------------------------------- > 8560 0.55500 FREE testfree S free@compute-3-2 1 > 8561 0.55500 FREE testfree S free@compute-3-2 1 > 8562 0.55500 FREE testfree S free@compute-3-2 1 > 8563 0.55500 FREE testfree S free@compute-3-2 1 > 8564 0.55500 FREE testfree S free@compute-3-2 1 > 8565 0.55500 FREE testfree S free@compute-3-2 1 > 8566 0.55500 FREE testfree S free@compute-3-2 1 > 8567 0.55500 FREE testfree S free@compute-3-2 1 > 8568 0.55500 FREE testfree r free@compute-3-1 1 > 8569 0.55500 FREE testfree r free@compute-3-1 1 > 8570 0.55500 FREE testfree r free@compute-3-1 1 > 8571 0.55500 FREE testfree r free@compute-3-1 1 > 8572 0.55500 FREE testfree r free@compute-3-1 1 > 8573 0.55500 FREE testfree r free@compute-3-1 1 > 8574 0.55500 FREE testfree r free@compute-3-1 1 > 8575 0.55500 FREE testfree r free@compute-3-1 1 > 8584 0.55500 OWNER testbio r owner@compute-3-2 1 > > > All cores on compute-3-2 are suspended in order to run that one single core > owner job #8584. > > Not the ideal or best setup, but we can live with this. > > However, here is where it get's nasty. > > The owner now submits another ONE core job. At this point, compute-3-2 has > 7 free cores on which it could schedule this additional ONE core job, but no, > GE likes to spread jobs: > > $ qstat > job-ID prior name user state queue slots ja-task-ID > --------------------------------------------------------------------------- > 8560 0.55500 FREE testfree S free@compute-3-2 1 > 8561 0.55500 FREE testfree S free@compute-3-2 1 > 8562 0.55500 FREE testfree S free@compute-3-2 1 > 8563 0.55500 FREE testfree S free@compute-3-2 1 > 8564 0.55500 FREE testfree S free@compute-3-2 1 > 8565 0.55500 FREE testfree S free@compute-3-2 1 > 8566 0.55500 FREE testfree S free@compute-3-2 1 > 8567 0.55500 FREE testfree S free@compute-3-2 1 > 8568 0.55500 FREE testfree S free@compute-3-1 1 > 8569 0.55500 FREE testfree S free@compute-3-1 1 > 8570 0.55500 FREE testfree S free@compute-3-1 1 > 8571 0.55500 FREE testfree S free@compute-3-1 1 > 8572 0.55500 FREE testfree S free@compute-3-1 1 > 8573 0.55500 FREE testfree S free@compute-3-1 1 > 8574 0.55500 FREE testfree S free@compute-3-1 1 > 8575 0.55500 FREE testfree S free@compute-3-1 1 > 8584 0.55500 OWNER testbio r owner@compute-3-2 1 > 8585 0.55500 OWNER testbio r owner@compute-3-1 1 > > The new single core job # 8585 starts on compute-3-1 instead of on > compute-3-2 suspending another 7 cores. > > If job-packing with subordinate queues were available, job #8585 would have > started compute-3-2 since it has cores available. > > Two single ONE core jobs suspend 16 single core jobs. Nasty and wasteful! > > > > > On 08/03/2012 10:10 AM, Joseph Farran wrote: >> On 08/03/2012 09:57 AM, Reuti wrote: >>> Am 03.08.2012 um 18:50 schrieb Joseph Farran: >>> >>>> On 08/03/2012 09:18 AM, Reuti wrote: >>>>> Am 03.08.2012 um 18:04 schrieb Joseph Farran: >>>>> >>>>>> I pack jobs unto nodes using the following GE setup: >>>>>> >>>>>> # qconf -ssconf | egrep "queue|load" >>>>>> queue_sort_method seqno >>>>>> job_load_adjustments NONE >>>>>> load_adjustment_decay_time 0 >>>>>> load_formula slots >>>>>> >>>>>> I also set my nodes with the slots complex value: >>>>>> >>>>>> # qconf -rattr exechost complex_values "slots=64" compute-2-1 >>>>> Don't limit it here. Just define 64 in both queues for slots. >>>>> >>>> Yes, I tried that approached as well but then parallel jobs will not >>>> suspend equal number of serial jobs. >>>> >>>> So after I setup the above ( note my test queue and nodes have 8 cores and >>>> not 64 ): >>>> >>>> # qconf -sq owner | egrep "slots" >>>> slots 8 >>>> subordinate_list slots=8(free:0:sr) >>>> >>>> # qconf -sq free | egrep "slots" >>>> slots 8 >>>> >>>> [# qconf -se compute-3-1 | egrep complex >>>> complex_values NONE >>>> # qconf -se compute-3-2 | egrep complex >>>> complex_values NONE >>>> >>>> When I submit one 8 parallel job to owner, only one core in free is >>>> suspended instead of 8: >>>> >>>> Here is qstat listing: >>>> >>>> job-ID prior name user state queue slots >>>> -------------------------------------------------------------- >>>> 8531 0.50500 FREE testfree r free@compute-3-1 1 >>>> 8532 0.50500 FREE testfree r free@compute-3-1 1 >>>> 8533 0.50500 FREE testfree r free@compute-3-1 1 >>>> 8534 0.50500 FREE testfree r free@compute-3-1 1 >>>> 8535 0.50500 FREE testfree r free@compute-3-1 1 >>>> 8536 0.50500 FREE testfree r free@compute-3-1 1 >>>> 8537 0.50500 FREE testfree r free@compute-3-1 1 >>>> 8538 0.50500 FREE testfree S free@compute-3-1 1 >>>> 8539 0.50500 FREE testfree r free@compute-3-2 1 >>>> 8540 0.50500 FREE testfree r free@compute-3-2 1 >>>> 8541 0.50500 FREE testfree r free@compute-3-2 1 >>>> 8542 0.50500 FREE testfree r free@compute-3-2 1 >>>> 8543 0.50500 FREE testfree r free@compute-3-2 1 >>>> 8544 0.50500 FREE testfree r free@compute-3-2 1 >>>> 8545 0.50500 FREE testfree r free@compute-3-2 1 >>>> 8546 0.50500 FREE testfree r free@compute-3-2 1 >>>> 8547 0.60500 Owner me r owner@compute-3-1 8 >>>> >>>> >>>> Job 8547 on owner queue starts just fine running with 8 cores on >>>> compute-3-1 *but* only one core in compute-3-1 from free queue is >>>> suspended instead of 8 cores. >>> AFAIR this is a known bug for parallel jobs. >> >> So the answer to my original question is that no, it cannot be done. >> >> Is there another open source GE flavor that has fixed this bug, or is this >> bug across all open source GE flavors? >> >> >>> >>>>>> Serial jobs are all packed nicely unto a node until the node is full and >>>>>> then it goes unto the next node. >>>>>> >>>>>> >>>>>> The issue I am having is that my subordinate queue breaks when I have >>>>>> set my nodes with the node complex value above. >>>>>> >>>>>> I have two queues: The owner queue and the free queue: >>>>>> >>>>>> # qconf -sq owner | egrep "subordinate|shell" >>>>>> shell /bin/bash >>>>>> shell_start_mode posix_compliant >>>>>> subordinate_list free=1 >>>>> subordinate_list slots=64(free) >>>>> >>>>> >>>>>> # qconf -sq free | egrep "subordinate|shell" >>>>>> shell /bin/bash >>>>>> shell_start_mode posix_compliant >>>>>> subordinate_list NONE >>>>>> >>>>>> When I fill up the free queue with serial jobs and I then submit a job >>>>>> to the owner queue, the owner job will not suspend the free job. Qstat >>>>>> scheduling info says: >>>>>> >>>>>> queue instance "[email protected]" dropped because it is full >>>>>> queue instance "[email protected]" dropped because it is full >>>>>> >>>>>> If I remove the "complex_values=" from my nodes, then jobs are correctly >>>>>> suspended in free queue and the owner job runs just fine. >>>>> Yes, and what's the problem with this setup? >>>> What is wrong with the above setup is that the 'owner' cannot run because >>>> free jobs are not suspended. >>> They are not suspended in advance. The suspension is the result of an >>> additional job being started thereon. Not the other way round. >> >> Right, but the idea of a subordinate queue ( job preemption ) is that when a >> job *IS* scheduled, that the subordinate queue suspend jobs. I mean, >> that's the whole idea. >> >> >>> -- Reuti >>> >>> >>>>>> So how can I accomplish both items above? >>>>>> >>>>>> >>>>>> >>>>>> *** By the way, here are some pre-answers to some questions I am going >>>>>> to be asked: >>>>>> >>>>>> Why pack jobs?: Because in any HPC environment that runs a mixture of >>>>>> serial and parallel jobs, you really don't want to spread single core >>>>>> jobs across multiple nodes, specially 64 cores nodes. You want to keep >>>>>> nodes whole for parallel jobs ( this is HPC 101 ). >>>>> Depends on the application. E.g. Molcas is writing a lot to the local >>>>> scratch disk, so it's better to spread them in the cluster and use the >>>>> remaining cores in each exechost for jobs without or at least with less >>>>> disk access. >>>> Yes, there will always be exceptions. I should have said in most 99% of >>>> circumstances. >>>> >>>> >>>>> -- Reuti >>>>> >>>>> >>>>>> Suspended jobs will not free up resources: Yeap, but the jobs will >>>>>> *not* be consuming CPU cycles which is what I want. >>>>>> >>>>>> Thanks, >>>>>> Joseph >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> [email protected] >>>>>> https://gridengine.org/mailman/listinfo/users >>> >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
