Depends, only for serial ones and for PEs with allocation_rule $PE_SLOTS -- Reuti
Von meinem iPad gesendet Am 03.08.2012 um 22:03 schrieb Joseph Farran <[email protected]>: > Great! Will it work for both parallel and single core jobs? > > If yes, is there such a load sensor available? > > > > On 08/03/2012 12:58 PM, Reuti wrote: >> Well, for single core jobs you can change the sort order to pack jobs on >> nodes. But instead of the usual -slots you will need a special load sensor >> reporting only the used slots owner queue and use this variable. >> >> -- Reuti >> >> Von meinem iPad gesendet >> >> Am 03.08.2012 um 21:23 schrieb Joseph Farran<[email protected]>: >> >>> For others that are trying to pack jobs on nodes and using subordinate >>> queues, here is an example of why job-packing is so critical: >>> >>> Consider the following scenario. We have two queues, "owner" and "free" >>> with "free" being the subordinate queue. >>> >>> Our two compute nodes have 8 cores each. >>> >>> We load up our free queue with 16 single core jobs: >>> >>> job-ID prior name user state queue slots ja-task-ID >>> --------------------------------------------------------------------------- >>> 8560 0.55500 FREE testfree r free@compute-3-2 1 >>> 8561 0.55500 FREE testfree r free@compute-3-2 1 >>> 8562 0.55500 FREE testfree r free@compute-3-2 1 >>> 8563 0.55500 FREE testfree r free@compute-3-2 1 >>> 8564 0.55500 FREE testfree r free@compute-3-2 1 >>> 8565 0.55500 FREE testfree r free@compute-3-2 1 >>> 8566 0.55500 FREE testfree r free@compute-3-2 1 >>> 8567 0.55500 FREE testfree r free@compute-3-2 1 >>> 8568 0.55500 FREE testfree r free@compute-3-1 1 >>> 8569 0.55500 FREE testfree r free@compute-3-1 1 >>> 8570 0.55500 FREE testfree r free@compute-3-1 1 >>> 8571 0.55500 FREE testfree r free@compute-3-1 1 >>> 8572 0.55500 FREE testfree r free@compute-3-1 1 >>> 8573 0.55500 FREE testfree r free@compute-3-1 1 >>> 8574 0.55500 FREE testfree r free@compute-3-1 1 >>> 8575 0.55500 FREE testfree r free@compute-3-1 1 >>> >>> >>> The owner now submits ONE single core job: >>> >>> $ qstat >>> job-ID prior name user state queue slots ja-task-ID >>> --------------------------------------------------------------------------- >>> 8560 0.55500 FREE testfree S free@compute-3-2 1 >>> 8561 0.55500 FREE testfree S free@compute-3-2 1 >>> 8562 0.55500 FREE testfree S free@compute-3-2 1 >>> 8563 0.55500 FREE testfree S free@compute-3-2 1 >>> 8564 0.55500 FREE testfree S free@compute-3-2 1 >>> 8565 0.55500 FREE testfree S free@compute-3-2 1 >>> 8566 0.55500 FREE testfree S free@compute-3-2 1 >>> 8567 0.55500 FREE testfree S free@compute-3-2 1 >>> 8568 0.55500 FREE testfree r free@compute-3-1 1 >>> 8569 0.55500 FREE testfree r free@compute-3-1 1 >>> 8570 0.55500 FREE testfree r free@compute-3-1 1 >>> 8571 0.55500 FREE testfree r free@compute-3-1 1 >>> 8572 0.55500 FREE testfree r free@compute-3-1 1 >>> 8573 0.55500 FREE testfree r free@compute-3-1 1 >>> 8574 0.55500 FREE testfree r free@compute-3-1 1 >>> 8575 0.55500 FREE testfree r free@compute-3-1 1 >>> 8584 0.55500 OWNER testbio r owner@compute-3-2 1 >>> >>> >>> All cores on compute-3-2 are suspended in order to run that one single core >>> owner job #8584. >>> >>> Not the ideal or best setup, but we can live with this. >>> >>> However, here is where it get's nasty. >>> >>> The owner now submits another ONE core job. At this point, compute-3-2 >>> has 7 free cores on which it could schedule this additional ONE core job, >>> but no, GE likes to spread jobs: >>> >>> $ qstat >>> job-ID prior name user state queue slots ja-task-ID >>> --------------------------------------------------------------------------- >>> 8560 0.55500 FREE testfree S free@compute-3-2 1 >>> 8561 0.55500 FREE testfree S free@compute-3-2 1 >>> 8562 0.55500 FREE testfree S free@compute-3-2 1 >>> 8563 0.55500 FREE testfree S free@compute-3-2 1 >>> 8564 0.55500 FREE testfree S free@compute-3-2 1 >>> 8565 0.55500 FREE testfree S free@compute-3-2 1 >>> 8566 0.55500 FREE testfree S free@compute-3-2 1 >>> 8567 0.55500 FREE testfree S free@compute-3-2 1 >>> 8568 0.55500 FREE testfree S free@compute-3-1 1 >>> 8569 0.55500 FREE testfree S free@compute-3-1 1 >>> 8570 0.55500 FREE testfree S free@compute-3-1 1 >>> 8571 0.55500 FREE testfree S free@compute-3-1 1 >>> 8572 0.55500 FREE testfree S free@compute-3-1 1 >>> 8573 0.55500 FREE testfree S free@compute-3-1 1 >>> 8574 0.55500 FREE testfree S free@compute-3-1 1 >>> 8575 0.55500 FREE testfree S free@compute-3-1 1 >>> 8584 0.55500 OWNER testbio r owner@compute-3-2 1 >>> 8585 0.55500 OWNER testbio r owner@compute-3-1 1 >>> >>> The new single core job # 8585 starts on compute-3-1 instead of on >>> compute-3-2 suspending another 7 cores. >>> >>> If job-packing with subordinate queues were available, job #8585 would have >>> started compute-3-2 since it has cores available. >>> >>> Two single ONE core jobs suspend 16 single core jobs. Nasty and wasteful! >>> >>> >>> >>> >>> On 08/03/2012 10:10 AM, Joseph Farran wrote: >>>> On 08/03/2012 09:57 AM, Reuti wrote: >>>>> Am 03.08.2012 um 18:50 schrieb Joseph Farran: >>>>> >>>>>> On 08/03/2012 09:18 AM, Reuti wrote: >>>>>>> Am 03.08.2012 um 18:04 schrieb Joseph Farran: >>>>>>> >>>>>>>> I pack jobs unto nodes using the following GE setup: >>>>>>>> >>>>>>>> # qconf -ssconf | egrep "queue|load" >>>>>>>> queue_sort_method seqno >>>>>>>> job_load_adjustments NONE >>>>>>>> load_adjustment_decay_time 0 >>>>>>>> load_formula slots >>>>>>>> >>>>>>>> I also set my nodes with the slots complex value: >>>>>>>> >>>>>>>> # qconf -rattr exechost complex_values "slots=64" compute-2-1 >>>>>>> Don't limit it here. Just define 64 in both queues for slots. >>>>>>> >>>>>> Yes, I tried that approached as well but then parallel jobs will not >>>>>> suspend equal number of serial jobs. >>>>>> >>>>>> So after I setup the above ( note my test queue and nodes have 8 cores >>>>>> and not 64 ): >>>>>> >>>>>> # qconf -sq owner | egrep "slots" >>>>>> slots 8 >>>>>> subordinate_list slots=8(free:0:sr) >>>>>> >>>>>> # qconf -sq free | egrep "slots" >>>>>> slots 8 >>>>>> >>>>>> [# qconf -se compute-3-1 | egrep complex >>>>>> complex_values NONE >>>>>> # qconf -se compute-3-2 | egrep complex >>>>>> complex_values NONE >>>>>> >>>>>> When I submit one 8 parallel job to owner, only one core in free is >>>>>> suspended instead of 8: >>>>>> >>>>>> Here is qstat listing: >>>>>> >>>>>> job-ID prior name user state queue slots >>>>>> -------------------------------------------------------------- >>>>>> 8531 0.50500 FREE testfree r free@compute-3-1 1 >>>>>> 8532 0.50500 FREE testfree r free@compute-3-1 1 >>>>>> 8533 0.50500 FREE testfree r free@compute-3-1 1 >>>>>> 8534 0.50500 FREE testfree r free@compute-3-1 1 >>>>>> 8535 0.50500 FREE testfree r free@compute-3-1 1 >>>>>> 8536 0.50500 FREE testfree r free@compute-3-1 1 >>>>>> 8537 0.50500 FREE testfree r free@compute-3-1 1 >>>>>> 8538 0.50500 FREE testfree S free@compute-3-1 1 >>>>>> 8539 0.50500 FREE testfree r free@compute-3-2 1 >>>>>> 8540 0.50500 FREE testfree r free@compute-3-2 1 >>>>>> 8541 0.50500 FREE testfree r free@compute-3-2 1 >>>>>> 8542 0.50500 FREE testfree r free@compute-3-2 1 >>>>>> 8543 0.50500 FREE testfree r free@compute-3-2 1 >>>>>> 8544 0.50500 FREE testfree r free@compute-3-2 1 >>>>>> 8545 0.50500 FREE testfree r free@compute-3-2 1 >>>>>> 8546 0.50500 FREE testfree r free@compute-3-2 1 >>>>>> 8547 0.60500 Owner me r owner@compute-3-1 8 >>>>>> >>>>>> >>>>>> Job 8547 on owner queue starts just fine running with 8 cores on >>>>>> compute-3-1 *but* only one core in compute-3-1 from free queue is >>>>>> suspended instead of 8 cores. >>>>> AFAIR this is a known bug for parallel jobs. >>>> So the answer to my original question is that no, it cannot be done. >>>> >>>> Is there another open source GE flavor that has fixed this bug, or is this >>>> bug across all open source GE flavors? >>>> >>>> >>>>>>>> Serial jobs are all packed nicely unto a node until the node is full >>>>>>>> and then it goes unto the next node. >>>>>>>> >>>>>>>> >>>>>>>> The issue I am having is that my subordinate queue breaks when I have >>>>>>>> set my nodes with the node complex value above. >>>>>>>> >>>>>>>> I have two queues: The owner queue and the free queue: >>>>>>>> >>>>>>>> # qconf -sq owner | egrep "subordinate|shell" >>>>>>>> shell /bin/bash >>>>>>>> shell_start_mode posix_compliant >>>>>>>> subordinate_list free=1 >>>>>>> subordinate_list slots=64(free) >>>>>>> >>>>>>> >>>>>>>> # qconf -sq free | egrep "subordinate|shell" >>>>>>>> shell /bin/bash >>>>>>>> shell_start_mode posix_compliant >>>>>>>> subordinate_list NONE >>>>>>>> >>>>>>>> When I fill up the free queue with serial jobs and I then submit a job >>>>>>>> to the owner queue, the owner job will not suspend the free job. >>>>>>>> Qstat scheduling info says: >>>>>>>> >>>>>>>> queue instance "[email protected]" dropped because it is full >>>>>>>> queue instance "[email protected]" dropped because it is full >>>>>>>> >>>>>>>> If I remove the "complex_values=" from my nodes, then jobs are >>>>>>>> correctly suspended in free queue and the owner job runs just fine. >>>>>>> Yes, and what's the problem with this setup? >>>>>> What is wrong with the above setup is that the 'owner' cannot run >>>>>> because free jobs are not suspended. >>>>> They are not suspended in advance. The suspension is the result of an >>>>> additional job being started thereon. Not the other way round. >>>> Right, but the idea of a subordinate queue ( job preemption ) is that when >>>> a job *IS* scheduled, that the subordinate queue suspend jobs. I mean, >>>> that's the whole idea. >>>> >>>> >>>>> -- Reuti >>>>> >>>>> >>>>>>>> So how can I accomplish both items above? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *** By the way, here are some pre-answers to some questions I am going >>>>>>>> to be asked: >>>>>>>> >>>>>>>> Why pack jobs?: Because in any HPC environment that runs a mixture of >>>>>>>> serial and parallel jobs, you really don't want to spread single core >>>>>>>> jobs across multiple nodes, specially 64 cores nodes. You want to >>>>>>>> keep nodes whole for parallel jobs ( this is HPC 101 ). >>>>>>> Depends on the application. E.g. Molcas is writing a lot to the local >>>>>>> scratch disk, so it's better to spread them in the cluster and use the >>>>>>> remaining cores in each exechost for jobs without or at least with less >>>>>>> disk access. >>>>>> Yes, there will always be exceptions. I should have said in most 99% >>>>>> of circumstances. >>>>>> >>>>>> >>>>>>> -- Reuti >>>>>>> >>>>>>> >>>>>>>> Suspended jobs will not free up resources: Yeap, but the jobs will >>>>>>>> *not* be consuming CPU cycles which is what I want. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Joseph >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> [email protected] >>>>>>>> https://gridengine.org/mailman/listinfo/users >>>> _______________________________________________ >>>> users mailing list >>>> [email protected] >>>> https://gridengine.org/mailman/listinfo/users >>>> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
