Re: [gridengine users] Subordinate Queue & Job Packing

Reuti Fri, 03 Aug 2012 14:04:48 -0700

Depends, only for serial ones and for PEs with allocation_rule $PE_SLOTS

-- Reuti


Von meinem iPad gesendet

Am 03.08.2012 um 22:03 schrieb Joseph Farran <[email protected]>:

> Great!    Will it work for both parallel and single core jobs?
> 
> If yes, is there such a load sensor available?
> 
> 
> 
> On 08/03/2012 12:58 PM, Reuti wrote:
>> Well, for single core jobs you can change the sort order to pack jobs on 
>> nodes. But instead of the usual -slots you will need a special load sensor 
>> reporting only the used slots owner queue and use this variable.
>> 
>> -- Reuti
>> 
>> Von meinem iPad gesendet
>> 
>> Am 03.08.2012 um 21:23 schrieb Joseph Farran<[email protected]>:
>> 
>>> For others that are trying to pack jobs on nodes and using subordinate 
>>> queues, here is an example of why job-packing is so critical:
>>> 
>>> Consider the following scenario.   We have two queues, "owner" and "free" 
>>> with "free" being the subordinate queue.
>>> 
>>> Our two compute nodes have 8 cores each.
>>> 
>>> We load up our free queue with 16 single core jobs:
>>> 
>>> job-ID  prior   name   user      state queue               slots ja-task-ID
>>> ---------------------------------------------------------------------------
>>>  8560 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>  8561 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>  8562 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>  8563 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>  8564 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>  8565 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>  8566 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>  8567 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>  8568 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8569 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8570 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8571 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8572 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8573 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8574 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8575 0.55500 FREE  testfree   r     free@compute-3-1  1
>>> 
>>> 
>>> The owner now submits ONE single core job:
>>> 
>>> $ qstat
>>> job-ID  prior   name  user       state queue               slots ja-task-ID
>>> ---------------------------------------------------------------------------
>>>  8560 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8561 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8562 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8563 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8564 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8565 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8566 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8567 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8568 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8569 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8570 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8571 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8572 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8573 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8574 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8575 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>  8584 0.55500 OWNER testbio    r     owner@compute-3-2  1
>>> 
>>> 
>>> All cores on compute-3-2 are suspended in order to run that one single core 
>>> owner job #8584.
>>> 
>>> Not the ideal or best setup, but we can live with this.
>>> 
>>> However, here is where it get's nasty.
>>> 
>>> The owner now submits another ONE core job.    At this point, compute-3-2 
>>> has 7 free cores on which it could schedule this additional ONE core job, 
>>> but no, GE likes to spread jobs:
>>> 
>>> $ qstat
>>> job-ID  prior   name  user       state queue               slots ja-task-ID
>>> ---------------------------------------------------------------------------
>>>  8560 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8561 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8562 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8563 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8564 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8565 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8566 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8567 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>  8568 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>  8569 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>  8570 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>  8571 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>  8572 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>  8573 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>  8574 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>  8575 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>  8584 0.55500 OWNER testbio    r     owner@compute-3-2  1
>>>  8585 0.55500 OWNER testbio    r     owner@compute-3-1  1
>>> 
>>> The new single core job # 8585 starts on compute-3-1 instead of on 
>>> compute-3-2 suspending another 7 cores.
>>> 
>>> If job-packing with subordinate queues were available, job #8585 would have 
>>> started compute-3-2 since it has cores available.
>>> 
>>> Two single ONE core jobs suspend 16 single core jobs.    Nasty and wasteful!
>>> 
>>> 
>>> 
>>> 
>>> On 08/03/2012 10:10 AM, Joseph Farran wrote:
>>>> On 08/03/2012 09:57 AM, Reuti wrote:
>>>>> Am 03.08.2012 um 18:50 schrieb Joseph Farran:
>>>>> 
>>>>>> On 08/03/2012 09:18 AM, Reuti wrote:
>>>>>>> Am 03.08.2012 um 18:04 schrieb Joseph Farran:
>>>>>>> 
>>>>>>>> I pack jobs unto nodes using the following GE setup:
>>>>>>>> 
>>>>>>>>   # qconf -ssconf | egrep "queue|load"
>>>>>>>>   queue_sort_method                 seqno
>>>>>>>>   job_load_adjustments              NONE
>>>>>>>>   load_adjustment_decay_time        0
>>>>>>>>   load_formula                      slots
>>>>>>>> 
>>>>>>>> I also set my nodes with the slots complex value:
>>>>>>>> 
>>>>>>>>   # qconf -rattr exechost complex_values "slots=64" compute-2-1
>>>>>>> Don't limit it here. Just define 64 in both queues for slots.
>>>>>>> 
>>>>>> Yes, I tried that approached as well but then parallel jobs will not 
>>>>>> suspend equal number of serial jobs.
>>>>>> 
>>>>>> So after I setup the above ( note my test queue and nodes have 8 cores 
>>>>>> and not 64 ):
>>>>>> 
>>>>>> # qconf -sq owner | egrep "slots"
>>>>>> slots                 8
>>>>>> subordinate_list      slots=8(free:0:sr)
>>>>>> 
>>>>>> # qconf -sq free | egrep "slots"
>>>>>> slots                 8
>>>>>> 
>>>>>> [# qconf -se compute-3-1 | egrep complex
>>>>>> complex_values        NONE
>>>>>> # qconf -se compute-3-2 | egrep complex
>>>>>> complex_values        NONE
>>>>>> 
>>>>>> When I submit one 8 parallel job to owner, only one core in free is 
>>>>>> suspended instead of 8:
>>>>>> 
>>>>>> Here is qstat listing:
>>>>>> 
>>>>>> job-ID  prior   name   user      state queue             slots
>>>>>> --------------------------------------------------------------
>>>>>>  8531 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>>>  8532 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>>>  8533 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>>>  8534 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>>>  8535 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>>>  8536 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>>>  8537 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>>>  8538 0.50500 FREE   testfree   S    free@compute-3-1   1
>>>>>>  8539 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>>>  8540 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>>>  8541 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>>>  8542 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>>>  8543 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>>>  8544 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>>>  8545 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>>>  8546 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>>>  8547 0.60500 Owner  me         r    owner@compute-3-1  8
>>>>>> 
>>>>>> 
>>>>>> Job 8547 on owner queue starts just fine running with 8 cores on 
>>>>>> compute-3-1 *but* only one core in compute-3-1 from free queue is 
>>>>>> suspended instead of 8 cores.
>>>>> AFAIR this is a known bug for parallel jobs.
>>>> So the answer to my original question is that no, it cannot be done.
>>>> 
>>>> Is there another open source GE flavor that has fixed this bug, or is this 
>>>> bug across all open source GE flavors?
>>>> 
>>>> 
>>>>>>>> Serial jobs are all packed nicely unto a node until the node is full 
>>>>>>>> and then it goes unto the next node.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The issue I am having is that my subordinate queue breaks when I have 
>>>>>>>> set my nodes with the node complex value above.
>>>>>>>> 
>>>>>>>> I have two queues:  The owner queue and the free queue:
>>>>>>>> 
>>>>>>>>   # qconf -sq owner | egrep "subordinate|shell"
>>>>>>>>   shell                 /bin/bash
>>>>>>>>   shell_start_mode      posix_compliant
>>>>>>>>   subordinate_list      free=1
>>>>>>> subordinate_list      slots=64(free)
>>>>>>> 
>>>>>>> 
>>>>>>>>   # qconf -sq free | egrep "subordinate|shell"
>>>>>>>>   shell                 /bin/bash
>>>>>>>>   shell_start_mode      posix_compliant
>>>>>>>>   subordinate_list      NONE
>>>>>>>> 
>>>>>>>> When I fill up the free queue with serial jobs and I then submit a job 
>>>>>>>> to the owner queue, the owner job will not suspend the free job.   
>>>>>>>> Qstat scheduling info says:
>>>>>>>> 
>>>>>>>>   queue instance "[email protected]" dropped because it is full
>>>>>>>>   queue instance "[email protected]" dropped because it is full
>>>>>>>> 
>>>>>>>> If I remove the "complex_values=" from my nodes, then jobs are 
>>>>>>>> correctly suspended in free queue and the owner job runs just fine.
>>>>>>> Yes, and what's the problem with this setup?
>>>>>> What is wrong with the above setup is that the 'owner' cannot run 
>>>>>> because free jobs are not suspended.
>>>>> They are not suspended in advance. The suspension is the result of an 
>>>>> additional job being started thereon. Not the other way round.
>>>> Right, but the idea of a subordinate queue ( job preemption ) is that when 
>>>> a job *IS* scheduled, that the subordinate queue suspend jobs.    I mean, 
>>>> that's the whole idea.
>>>> 
>>>> 
>>>>> -- Reuti
>>>>> 
>>>>> 
>>>>>>>> So how can I accomplish both items above?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> *** By the way, here are some pre-answers to some questions I am going 
>>>>>>>> to be asked:
>>>>>>>> 
>>>>>>>> Why pack jobs?:  Because in any HPC environment that runs a mixture of 
>>>>>>>> serial and parallel jobs, you really don't want to spread single core 
>>>>>>>> jobs across multiple nodes, specially 64 cores nodes.   You want to 
>>>>>>>> keep nodes whole for parallel jobs ( this is HPC 101 ).
>>>>>>> Depends on the application. E.g. Molcas is writing a lot to the local 
>>>>>>> scratch disk, so it's better to spread them in the cluster and use the 
>>>>>>> remaining cores in each exechost for jobs without or at least with less 
>>>>>>> disk access.
>>>>>> Yes, there will always be exceptions.    I should have said in most 99% 
>>>>>> of circumstances.
>>>>>> 
>>>>>> 
>>>>>>> -- Reuti
>>>>>>> 
>>>>>>> 
>>>>>>>> Suspended jobs will not free up resources:  Yeap, but the jobs will 
>>>>>>>> *not* be consuming CPU cycles which is what I want.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Joseph
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> [email protected]
>>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>>>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Subordinate Queue & Job Packing

Reply via email to