Re: [gridengine users] Subordinate Queue & Job Packing

Reuti Fri, 03 Aug 2012 13:00:26 -0700

Well, for single core jobs you can change the sort order to pack jobs on nodes. 
But instead of the usual -slots you will need a special load sensor reporting 
only the used slots owner queue and use this variable.


-- Reuti

Von meinem iPad gesendet

Am 03.08.2012 um 21:23 schrieb Joseph Farran <[email protected]>:

> For others that are trying to pack jobs on nodes and using subordinate 
> queues, here is an example of why job-packing is so critical:
> 
> Consider the following scenario.   We have two queues, "owner" and "free" 
> with "free" being the subordinate queue.
> 
> Our two compute nodes have 8 cores each.
> 
> We load up our free queue with 16 single core jobs:
> 
> job-ID  prior   name   user      state queue               slots ja-task-ID
> ---------------------------------------------------------------------------
>   8560 0.55500 FREE  testfree   r     free@compute-3-2  1
>   8561 0.55500 FREE  testfree   r     free@compute-3-2  1
>   8562 0.55500 FREE  testfree   r     free@compute-3-2  1
>   8563 0.55500 FREE  testfree   r     free@compute-3-2  1
>   8564 0.55500 FREE  testfree   r     free@compute-3-2  1
>   8565 0.55500 FREE  testfree   r     free@compute-3-2  1
>   8566 0.55500 FREE  testfree   r     free@compute-3-2  1
>   8567 0.55500 FREE  testfree   r     free@compute-3-2  1
>   8568 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8569 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8570 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8571 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8572 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8573 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8574 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8575 0.55500 FREE  testfree   r     free@compute-3-1  1
> 
> 
> The owner now submits ONE single core job:
> 
> $ qstat
> job-ID  prior   name  user       state queue               slots ja-task-ID
> ---------------------------------------------------------------------------
>   8560 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8561 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8562 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8563 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8564 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8565 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8566 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8567 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8568 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8569 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8570 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8571 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8572 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8573 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8574 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8575 0.55500 FREE  testfree   r     free@compute-3-1  1
>   8584 0.55500 OWNER testbio    r     owner@compute-3-2  1
> 
> 
> All cores on compute-3-2 are suspended in order to run that one single core 
> owner job #8584.
> 
> Not the ideal or best setup, but we can live with this.
> 
> However, here is where it get's nasty.
> 
> The owner now submits another ONE core job.    At this point, compute-3-2 has 
> 7 free cores on which it could schedule this additional ONE core job, but no, 
> GE likes to spread jobs:
> 
> $ qstat
> job-ID  prior   name  user       state queue               slots ja-task-ID
> ---------------------------------------------------------------------------
>   8560 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8561 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8562 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8563 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8564 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8565 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8566 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8567 0.55500 FREE  testfree   S     free@compute-3-2  1
>   8568 0.55500 FREE  testfree   S     free@compute-3-1  1
>   8569 0.55500 FREE  testfree   S     free@compute-3-1  1
>   8570 0.55500 FREE  testfree   S     free@compute-3-1  1
>   8571 0.55500 FREE  testfree   S     free@compute-3-1  1
>   8572 0.55500 FREE  testfree   S     free@compute-3-1  1
>   8573 0.55500 FREE  testfree   S     free@compute-3-1  1
>   8574 0.55500 FREE  testfree   S     free@compute-3-1  1
>   8575 0.55500 FREE  testfree   S     free@compute-3-1  1
>   8584 0.55500 OWNER testbio    r     owner@compute-3-2  1
>   8585 0.55500 OWNER testbio    r     owner@compute-3-1  1
> 
> The new single core job # 8585 starts on compute-3-1 instead of on 
> compute-3-2 suspending another 7 cores.
> 
> If job-packing with subordinate queues were available, job #8585 would have 
> started compute-3-2 since it has cores available.
> 
> Two single ONE core jobs suspend 16 single core jobs.    Nasty and wasteful!
> 
> 
> 
> 
> On 08/03/2012 10:10 AM, Joseph Farran wrote:
>> On 08/03/2012 09:57 AM, Reuti wrote:
>>> Am 03.08.2012 um 18:50 schrieb Joseph Farran:
>>> 
>>>> On 08/03/2012 09:18 AM, Reuti wrote:
>>>>> Am 03.08.2012 um 18:04 schrieb Joseph Farran:
>>>>> 
>>>>>> I pack jobs unto nodes using the following GE setup:
>>>>>> 
>>>>>>    # qconf -ssconf | egrep "queue|load"
>>>>>>    queue_sort_method                 seqno
>>>>>>    job_load_adjustments              NONE
>>>>>>    load_adjustment_decay_time        0
>>>>>>    load_formula                      slots
>>>>>> 
>>>>>> I also set my nodes with the slots complex value:
>>>>>> 
>>>>>>    # qconf -rattr exechost complex_values "slots=64" compute-2-1
>>>>> Don't limit it here. Just define 64 in both queues for slots.
>>>>> 
>>>> Yes, I tried that approached as well but then parallel jobs will not 
>>>> suspend equal number of serial jobs.
>>>> 
>>>> So after I setup the above ( note my test queue and nodes have 8 cores and 
>>>> not 64 ):
>>>> 
>>>> # qconf -sq owner | egrep "slots"
>>>> slots                 8
>>>> subordinate_list      slots=8(free:0:sr)
>>>> 
>>>> # qconf -sq free | egrep "slots"
>>>> slots                 8
>>>> 
>>>> [# qconf -se compute-3-1 | egrep complex
>>>> complex_values        NONE
>>>> # qconf -se compute-3-2 | egrep complex
>>>> complex_values        NONE
>>>> 
>>>> When I submit one 8 parallel job to owner, only one core in free is 
>>>> suspended instead of 8:
>>>> 
>>>> Here is qstat listing:
>>>> 
>>>> job-ID  prior   name   user      state queue             slots
>>>> --------------------------------------------------------------
>>>>   8531 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>   8532 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>   8533 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>   8534 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>   8535 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>   8536 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>   8537 0.50500 FREE   testfree   r    free@compute-3-1   1
>>>>   8538 0.50500 FREE   testfree   S    free@compute-3-1   1
>>>>   8539 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>   8540 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>   8541 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>   8542 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>   8543 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>   8544 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>   8545 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>   8546 0.50500 FREE   testfree   r    free@compute-3-2   1
>>>>   8547 0.60500 Owner  me         r    owner@compute-3-1  8
>>>> 
>>>> 
>>>> Job 8547 on owner queue starts just fine running with 8 cores on 
>>>> compute-3-1 *but* only one core in compute-3-1 from free queue is 
>>>> suspended instead of 8 cores.
>>> AFAIR this is a known bug for parallel jobs.
>> 
>> So the answer to my original question is that no, it cannot be done.
>> 
>> Is there another open source GE flavor that has fixed this bug, or is this 
>> bug across all open source GE flavors?
>> 
>> 
>>> 
>>>>>> Serial jobs are all packed nicely unto a node until the node is full and 
>>>>>> then it goes unto the next node.
>>>>>> 
>>>>>> 
>>>>>> The issue I am having is that my subordinate queue breaks when I have 
>>>>>> set my nodes with the node complex value above.
>>>>>> 
>>>>>> I have two queues:  The owner queue and the free queue:
>>>>>> 
>>>>>>    # qconf -sq owner | egrep "subordinate|shell"
>>>>>>    shell                 /bin/bash
>>>>>>    shell_start_mode      posix_compliant
>>>>>>    subordinate_list      free=1
>>>>> subordinate_list      slots=64(free)
>>>>> 
>>>>> 
>>>>>>    # qconf -sq free | egrep "subordinate|shell"
>>>>>>    shell                 /bin/bash
>>>>>>    shell_start_mode      posix_compliant
>>>>>>    subordinate_list      NONE
>>>>>> 
>>>>>> When I fill up the free queue with serial jobs and I then submit a job 
>>>>>> to the owner queue, the owner job will not suspend the free job.   Qstat 
>>>>>> scheduling info says:
>>>>>> 
>>>>>>    queue instance "[email protected]" dropped because it is full
>>>>>>    queue instance "[email protected]" dropped because it is full
>>>>>> 
>>>>>> If I remove the "complex_values=" from my nodes, then jobs are correctly 
>>>>>> suspended in free queue and the owner job runs just fine.
>>>>> Yes, and what's the problem with this setup?
>>>> What is wrong with the above setup is that the 'owner' cannot run because 
>>>> free jobs are not suspended.
>>> They are not suspended in advance. The suspension is the result of an 
>>> additional job being started thereon. Not the other way round.
>> 
>> Right, but the idea of a subordinate queue ( job preemption ) is that when a 
>> job *IS* scheduled, that the subordinate queue suspend jobs.    I mean, 
>> that's the whole idea.
>> 
>> 
>>> -- Reuti
>>> 
>>> 
>>>>>> So how can I accomplish both items above?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> *** By the way, here are some pre-answers to some questions I am going 
>>>>>> to be asked:
>>>>>> 
>>>>>> Why pack jobs?:  Because in any HPC environment that runs a mixture of 
>>>>>> serial and parallel jobs, you really don't want to spread single core 
>>>>>> jobs across multiple nodes, specially 64 cores nodes.   You want to keep 
>>>>>> nodes whole for parallel jobs ( this is HPC 101 ).
>>>>> Depends on the application. E.g. Molcas is writing a lot to the local 
>>>>> scratch disk, so it's better to spread them in the cluster and use the 
>>>>> remaining cores in each exechost for jobs without or at least with less 
>>>>> disk access.
>>>> Yes, there will always be exceptions.    I should have said in most 99% of 
>>>> circumstances.
>>>> 
>>>> 
>>>>> -- Reuti
>>>>> 
>>>>> 
>>>>>> Suspended jobs will not free up resources:  Yeap, but the jobs will 
>>>>>> *not* be consuming CPU cycles which is what I want.
>>>>>> 
>>>>>> Thanks,
>>>>>> Joseph
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> [email protected]
>>>>>> https://gridengine.org/mailman/listinfo/users
>>> 
>> 
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>> 
> 

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Subordinate Queue & Job Packing

Reply via email to