Re: [gridengine users] Subordinate Queue & Job Packing

Reuti Sat, 04 Aug 2012 11:04:30 -0700

Am 04.08.2012 um 07:06 schrieb Joseph Farran:

> Found the issue.   If I start with the count being the number of cores 
> counting down, then it works.


Yep, therefore I wrote -slots (the minus sign would make it negative by 
intention), but counting down will also work.

-- Reuti


> On 8/3/2012 4:29 PM, Joseph Farran wrote:
>> I create a load sensor and it is reporting accordingly.   Not sure if I got 
>> the sensor options correct?
>> 
>> # qconf -sc| egrep cores_in_use
>> cores_in_use        cu         INT         ==      YES NO         0        0
>> 
>> The nodes are reporting cores in use.   Compute-3-3 has two jobs and qhost 
>> reports accordingly:
>> 
>> # qhost -F -h compute-3-2 | egrep cores_in_use
>>   hl:cores_in_use=2.000000
>> 
>> I setup the scheduler with:
>> 
>> # qconf -ssconf | egrep "queue|load"
>> queue_sort_method                 seqno
>> job_load_adjustments              NONE
>> load_adjustment_decay_time        0
>> load_formula                      cores_in_use
>> 
>> But jobs are not packing.
>> 
>> 
>> 
>> On 08/03/2012 12:58 PM, Reuti wrote:
>>> Well, for single core jobs you can change the sort order to pack jobs on 
>>> nodes. But instead of the usual -slots you will need a special load sensor 
>>> reporting only the used slots owner queue and use this variable.
>>> 
>>> -- Reuti
>>> 
>>> Von meinem iPad gesendet
>>> 
>>> Am 03.08.2012 um 21:23 schrieb Joseph Farran<[email protected]>:
>>> 
>>>> For others that are trying to pack jobs on nodes and using subordinate 
>>>> queues, here is an example of why job-packing is so critical:
>>>> 
>>>> Consider the following scenario.   We have two queues, "owner" and "free" 
>>>> with "free" being the subordinate queue.
>>>> 
>>>> Our two compute nodes have 8 cores each.
>>>> 
>>>> We load up our free queue with 16 single core jobs:
>>>> 
>>>> job-ID  prior   name   user      state queue slots ja-task-ID
>>>> ---------------------------------------------------------------------------
>>>>   8560 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>>   8561 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>>   8562 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>>   8563 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>>   8564 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>>   8565 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>>   8566 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>>   8567 0.55500 FREE  testfree   r     free@compute-3-2  1
>>>>   8568 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8569 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8570 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8571 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8572 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8573 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8574 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8575 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>> 
>>>> 
>>>> The owner now submits ONE single core job:
>>>> 
>>>> $ qstat
>>>> job-ID  prior   name  user       state queue slots ja-task-ID
>>>> ---------------------------------------------------------------------------
>>>>   8560 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8561 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8562 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8563 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8564 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8565 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8566 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8567 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8568 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8569 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8570 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8571 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8572 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8573 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8574 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8575 0.55500 FREE  testfree   r     free@compute-3-1  1
>>>>   8584 0.55500 OWNER testbio    r     owner@compute-3-2  1
>>>> 
>>>> 
>>>> All cores on compute-3-2 are suspended in order to run that one single 
>>>> core owner job #8584.
>>>> 
>>>> Not the ideal or best setup, but we can live with this.
>>>> 
>>>> However, here is where it get's nasty.
>>>> 
>>>> The owner now submits another ONE core job.    At this point, compute-3-2 
>>>> has 7 free cores on which it could schedule this additional ONE core job, 
>>>> but no, GE likes to spread jobs:
>>>> 
>>>> $ qstat
>>>> job-ID  prior   name  user       state queue slots ja-task-ID
>>>> ---------------------------------------------------------------------------
>>>>   8560 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8561 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8562 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8563 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8564 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8565 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8566 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8567 0.55500 FREE  testfree   S     free@compute-3-2  1
>>>>   8568 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>>   8569 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>>   8570 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>>   8571 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>>   8572 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>>   8573 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>>   8574 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>>   8575 0.55500 FREE  testfree   S     free@compute-3-1  1
>>>>   8584 0.55500 OWNER testbio    r     owner@compute-3-2  1
>>>>   8585 0.55500 OWNER testbio    r     owner@compute-3-1  1
>>>> 
>>>> The new single core job # 8585 starts on compute-3-1 instead of on 
>>>> compute-3-2 suspending another 7 cores.
>>>> 
>>>> If job-packing with subordinate queues were available, job #8585 would 
>>>> have started compute-3-2 since it has cores available.
>>>> 
>>>> Two single ONE core jobs suspend 16 single core jobs.    Nasty and 
>>>> wasteful!
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 08/03/2012 10:10 AM, Joseph Farran wrote:
>>>>> On 08/03/2012 09:57 AM, Reuti wrote:
>>>>>> Am 03.08.2012 um 18:50 schrieb Joseph Farran:
>>>>>> 
>>>>>>> On 08/03/2012 09:18 AM, Reuti wrote:
>>>>>>>> Am 03.08.2012 um 18:04 schrieb Joseph Farran:
>>>>>>>> 
>>>>>>>>> I pack jobs unto nodes using the following GE setup:
>>>>>>>>> 
>>>>>>>>>    # qconf -ssconf | egrep "queue|load"
>>>>>>>>>    queue_sort_method                 seqno
>>>>>>>>>    job_load_adjustments              NONE
>>>>>>>>>    load_adjustment_decay_time        0
>>>>>>>>>    load_formula                      slots
>>>>>>>>> 
>>>>>>>>> I also set my nodes with the slots complex value:
>>>>>>>>> 
>>>>>>>>>    # qconf -rattr exechost complex_values "slots=64" compute-2-1
>>>>>>>> Don't limit it here. Just define 64 in both queues for slots.
>>>>>>>> 
>>>>>>> Yes, I tried that approached as well but then parallel jobs will not 
>>>>>>> suspend equal number of serial jobs.
>>>>>>> 
>>>>>>> So after I setup the above ( note my test queue and nodes have 8 cores 
>>>>>>> and not 64 ):
>>>>>>> 
>>>>>>> # qconf -sq owner | egrep "slots"
>>>>>>> slots                 8
>>>>>>> subordinate_list      slots=8(free:0:sr)
>>>>>>> 
>>>>>>> # qconf -sq free | egrep "slots"
>>>>>>> slots                 8
>>>>>>> 
>>>>>>> [# qconf -se compute-3-1 | egrep complex
>>>>>>> complex_values        NONE
>>>>>>> # qconf -se compute-3-2 | egrep complex
>>>>>>> complex_values        NONE
>>>>>>> 
>>>>>>> When I submit one 8 parallel job to owner, only one core in free is 
>>>>>>> suspended instead of 8:
>>>>>>> 
>>>>>>> Here is qstat listing:
>>>>>>> 
>>>>>>> job-ID  prior   name   user      state queue slots
>>>>>>> --------------------------------------------------------------
>>>>>>>   8531 0.50500 FREE   testfree   r free@compute-3-1   1
>>>>>>>   8532 0.50500 FREE   testfree   r free@compute-3-1   1
>>>>>>>   8533 0.50500 FREE   testfree   r free@compute-3-1   1
>>>>>>>   8534 0.50500 FREE   testfree   r free@compute-3-1   1
>>>>>>>   8535 0.50500 FREE   testfree   r free@compute-3-1   1
>>>>>>>   8536 0.50500 FREE   testfree   r free@compute-3-1   1
>>>>>>>   8537 0.50500 FREE   testfree   r free@compute-3-1   1
>>>>>>>   8538 0.50500 FREE   testfree   S free@compute-3-1   1
>>>>>>>   8539 0.50500 FREE   testfree   r free@compute-3-2   1
>>>>>>>   8540 0.50500 FREE   testfree   r free@compute-3-2   1
>>>>>>>   8541 0.50500 FREE   testfree   r free@compute-3-2   1
>>>>>>>   8542 0.50500 FREE   testfree   r free@compute-3-2   1
>>>>>>>   8543 0.50500 FREE   testfree   r free@compute-3-2   1
>>>>>>>   8544 0.50500 FREE   testfree   r free@compute-3-2   1
>>>>>>>   8545 0.50500 FREE   testfree   r free@compute-3-2   1
>>>>>>>   8546 0.50500 FREE   testfree   r free@compute-3-2   1
>>>>>>>   8547 0.60500 Owner  me         r owner@compute-3-1  8
>>>>>>> 
>>>>>>> 
>>>>>>> Job 8547 on owner queue starts just fine running with 8 cores on 
>>>>>>> compute-3-1 *but* only one core in compute-3-1 from free queue is 
>>>>>>> suspended instead of 8 cores.
>>>>>> AFAIR this is a known bug for parallel jobs.
>>>>> So the answer to my original question is that no, it cannot be done.
>>>>> 
>>>>> Is there another open source GE flavor that has fixed this bug, or is 
>>>>> this bug across all open source GE flavors?
>>>>> 
>>>>> 
>>>>>>>>> Serial jobs are all packed nicely unto a node until the node is full 
>>>>>>>>> and then it goes unto the next node.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The issue I am having is that my subordinate queue breaks when I have 
>>>>>>>>> set my nodes with the node complex value above.
>>>>>>>>> 
>>>>>>>>> I have two queues:  The owner queue and the free queue:
>>>>>>>>> 
>>>>>>>>>    # qconf -sq owner | egrep "subordinate|shell"
>>>>>>>>>    shell                 /bin/bash
>>>>>>>>>    shell_start_mode      posix_compliant
>>>>>>>>>    subordinate_list      free=1
>>>>>>>> subordinate_list      slots=64(free)
>>>>>>>> 
>>>>>>>> 
>>>>>>>>>    # qconf -sq free | egrep "subordinate|shell"
>>>>>>>>>    shell                 /bin/bash
>>>>>>>>>    shell_start_mode      posix_compliant
>>>>>>>>>    subordinate_list      NONE
>>>>>>>>> 
>>>>>>>>> When I fill up the free queue with serial jobs and I then submit a 
>>>>>>>>> job to the owner queue, the owner job will not suspend the free job.  
>>>>>>>>>  Qstat scheduling info says:
>>>>>>>>> 
>>>>>>>>>    queue instance "[email protected]" dropped because it is full
>>>>>>>>>    queue instance "[email protected]" dropped because it is full
>>>>>>>>> 
>>>>>>>>> If I remove the "complex_values=" from my nodes, then jobs are 
>>>>>>>>> correctly suspended in free queue and the owner job runs just fine.
>>>>>>>> Yes, and what's the problem with this setup?
>>>>>>> What is wrong with the above setup is that the 'owner' cannot run 
>>>>>>> because free jobs are not suspended.
>>>>>> They are not suspended in advance. The suspension is the result of an 
>>>>>> additional job being started thereon. Not the other way round.
>>>>> Right, but the idea of a subordinate queue ( job preemption ) is that 
>>>>> when a job *IS* scheduled, that the subordinate queue suspend jobs.    I 
>>>>> mean, that's the whole idea.
>>>>> 
>>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>>> 
>>>>>>>>> So how can I accomplish both items above?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> *** By the way, here are some pre-answers to some questions I am 
>>>>>>>>> going to be asked:
>>>>>>>>> 
>>>>>>>>> Why pack jobs?:  Because in any HPC environment that runs a mixture 
>>>>>>>>> of serial and parallel jobs, you really don't want to spread single 
>>>>>>>>> core jobs across multiple nodes, specially 64 cores nodes.   You want 
>>>>>>>>> to keep nodes whole for parallel jobs ( this is HPC 101 ).
>>>>>>>> Depends on the application. E.g. Molcas is writing a lot to the local 
>>>>>>>> scratch disk, so it's better to spread them in the cluster and use the 
>>>>>>>> remaining cores in each exechost for jobs without or at least with 
>>>>>>>> less disk access.
>>>>>>> Yes, there will always be exceptions.    I should have said in most 99% 
>>>>>>> of circumstances.
>>>>>>> 
>>>>>>> 
>>>>>>>> -- Reuti
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Suspended jobs will not free up resources:  Yeap, but the jobs will 
>>>>>>>>> *not* be consuming CPU cycles which is what I want.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Joseph
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/mailman/listinfo/users
>>>>> 
>> 
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Subordinate Queue & Job Packing

Reply via email to