Re: [gridengine users] Subordinate Queue & Job Packing

Joseph Farran Fri, 03 Aug 2012 22:07:30 -0700

Found the issue.   If I start with the count being the number of cores counting 
down, then it works.


On 8/3/2012 4:29 PM, Joseph Farran wrote:

I create a load sensor and it is reporting accordingly.   Not sure if I got the 
sensor options correct?

# qconf -sc| egrep cores_in_use
cores_in_use        cu         INT         ==      YES NO         0        0

The nodes are reporting cores in use.   Compute-3-3 has two jobs and qhost 
reports accordingly:

# qhost -F -h compute-3-2 | egrep cores_in_use
   hl:cores_in_use=2.000000

I setup the scheduler with:

# qconf -ssconf | egrep "queue|load"
queue_sort_method                 seqno
job_load_adjustments              NONE
load_adjustment_decay_time        0
load_formula                      cores_in_use

But jobs are not packing.



On 08/03/2012 12:58 PM, Reuti wrote:

Well, for single core jobs you can change the sort order to pack jobs on nodes. But instead of the usual -slots you will need a special load sensor reporting only the used slots owner queue and usethis variable.


-- Reuti

Von meinem iPad gesendet

Am 03.08.2012 um 21:23 schrieb Joseph Farran<[email protected]>:

For others that are trying to pack jobs on nodes and using subordinate queues, 
here is an example of why job-packing is so critical:

Consider the following scenario.   We have two queues, "owner" and "free" with 
"free" being the subordinate queue.

Our two compute nodes have 8 cores each.

We load up our free queue with 16 single core jobs:

job-ID  prior   name   user      state queue slots ja-task-ID
---------------------------------------------------------------------------
   8560 0.55500 FREE  testfree   r     free@compute-3-2  1
   8561 0.55500 FREE  testfree   r     free@compute-3-2  1
   8562 0.55500 FREE  testfree   r     free@compute-3-2  1
   8563 0.55500 FREE  testfree   r     free@compute-3-2  1
   8564 0.55500 FREE  testfree   r     free@compute-3-2  1
   8565 0.55500 FREE  testfree   r     free@compute-3-2  1
   8566 0.55500 FREE  testfree   r     free@compute-3-2  1
   8567 0.55500 FREE  testfree   r     free@compute-3-2  1
   8568 0.55500 FREE  testfree   r     free@compute-3-1  1
   8569 0.55500 FREE  testfree   r     free@compute-3-1  1
   8570 0.55500 FREE  testfree   r     free@compute-3-1  1
   8571 0.55500 FREE  testfree   r     free@compute-3-1  1
   8572 0.55500 FREE  testfree   r     free@compute-3-1  1
   8573 0.55500 FREE  testfree   r     free@compute-3-1  1
   8574 0.55500 FREE  testfree   r     free@compute-3-1  1
   8575 0.55500 FREE  testfree   r     free@compute-3-1  1


The owner now submits ONE single core job:

$ qstat
job-ID  prior   name  user       state queue slots ja-task-ID
---------------------------------------------------------------------------
   8560 0.55500 FREE  testfree   S     free@compute-3-2  1
   8561 0.55500 FREE  testfree   S     free@compute-3-2  1
   8562 0.55500 FREE  testfree   S     free@compute-3-2  1
   8563 0.55500 FREE  testfree   S     free@compute-3-2  1
   8564 0.55500 FREE  testfree   S     free@compute-3-2  1
   8565 0.55500 FREE  testfree   S     free@compute-3-2  1
   8566 0.55500 FREE  testfree   S     free@compute-3-2  1
   8567 0.55500 FREE  testfree   S     free@compute-3-2  1
   8568 0.55500 FREE  testfree   r     free@compute-3-1  1
   8569 0.55500 FREE  testfree   r     free@compute-3-1  1
   8570 0.55500 FREE  testfree   r     free@compute-3-1  1
   8571 0.55500 FREE  testfree   r     free@compute-3-1  1
   8572 0.55500 FREE  testfree   r     free@compute-3-1  1
   8573 0.55500 FREE  testfree   r     free@compute-3-1  1
   8574 0.55500 FREE  testfree   r     free@compute-3-1  1
   8575 0.55500 FREE  testfree   r     free@compute-3-1  1
   8584 0.55500 OWNER testbio    r     owner@compute-3-2  1


All cores on compute-3-2 are suspended in order to run that one single core 
owner job #8584.

Not the ideal or best setup, but we can live with this.

However, here is where it get's nasty.

The owner now submits another ONE core job.    At this point, compute-3-2 has 7 
free cores on which it could schedule this additional ONE core job, but no, GE 
likes to spread jobs:

$ qstat
job-ID  prior   name  user       state queue slots ja-task-ID
---------------------------------------------------------------------------
   8560 0.55500 FREE  testfree   S     free@compute-3-2  1
   8561 0.55500 FREE  testfree   S     free@compute-3-2  1
   8562 0.55500 FREE  testfree   S     free@compute-3-2  1
   8563 0.55500 FREE  testfree   S     free@compute-3-2  1
   8564 0.55500 FREE  testfree   S     free@compute-3-2  1
   8565 0.55500 FREE  testfree   S     free@compute-3-2  1
   8566 0.55500 FREE  testfree   S     free@compute-3-2  1
   8567 0.55500 FREE  testfree   S     free@compute-3-2  1
   8568 0.55500 FREE  testfree   S     free@compute-3-1  1
   8569 0.55500 FREE  testfree   S     free@compute-3-1  1
   8570 0.55500 FREE  testfree   S     free@compute-3-1  1
   8571 0.55500 FREE  testfree   S     free@compute-3-1  1
   8572 0.55500 FREE  testfree   S     free@compute-3-1  1
   8573 0.55500 FREE  testfree   S     free@compute-3-1  1
   8574 0.55500 FREE  testfree   S     free@compute-3-1  1
   8575 0.55500 FREE  testfree   S     free@compute-3-1  1
   8584 0.55500 OWNER testbio    r     owner@compute-3-2  1
   8585 0.55500 OWNER testbio    r     owner@compute-3-1  1

The new single core job # 8585 starts on compute-3-1 instead of on compute-3-2 
suspending another 7 cores.

If job-packing with subordinate queues were available, job #8585 would have 
started compute-3-2 since it has cores available.

Two single ONE core jobs suspend 16 single core jobs.    Nasty and wasteful!




On 08/03/2012 10:10 AM, Joseph Farran wrote:

On 08/03/2012 09:57 AM, Reuti wrote:

Am 03.08.2012 um 18:50 schrieb Joseph Farran:

On 08/03/2012 09:18 AM, Reuti wrote:

Am 03.08.2012 um 18:04 schrieb Joseph Farran:

I pack jobs unto nodes using the following GE setup:

    # qconf -ssconf | egrep "queue|load"
    queue_sort_method                 seqno
    job_load_adjustments              NONE
    load_adjustment_decay_time        0
    load_formula                      slots

I also set my nodes with the slots complex value:

    # qconf -rattr exechost complex_values "slots=64" compute-2-1

Don't limit it here. Just define 64 in both queues for slots.

Yes, I tried that approached as well but then parallel jobs will not suspend 
equal number of serial jobs.

So after I setup the above ( note my test queue and nodes have 8 cores and not 
64 ):

# qconf -sq owner | egrep "slots"
slots                 8
subordinate_list      slots=8(free:0:sr)

# qconf -sq free | egrep "slots"
slots                 8

[# qconf -se compute-3-1 | egrep complex
complex_values        NONE
# qconf -se compute-3-2 | egrep complex
complex_values        NONE

When I submit one 8 parallel job to owner, only one core in free is suspended 
instead of 8:

Here is qstat listing:

job-ID  prior   name   user      state queue slots
--------------------------------------------------------------
   8531 0.50500 FREE   testfree   r free@compute-3-1   1
   8532 0.50500 FREE   testfree   r free@compute-3-1   1
   8533 0.50500 FREE   testfree   r free@compute-3-1   1
   8534 0.50500 FREE   testfree   r free@compute-3-1   1
   8535 0.50500 FREE   testfree   r free@compute-3-1   1
   8536 0.50500 FREE   testfree   r free@compute-3-1   1
   8537 0.50500 FREE   testfree   r free@compute-3-1   1
   8538 0.50500 FREE   testfree   S free@compute-3-1   1
   8539 0.50500 FREE   testfree   r free@compute-3-2   1
   8540 0.50500 FREE   testfree   r free@compute-3-2   1
   8541 0.50500 FREE   testfree   r free@compute-3-2   1
   8542 0.50500 FREE   testfree   r free@compute-3-2   1
   8543 0.50500 FREE   testfree   r free@compute-3-2   1
   8544 0.50500 FREE   testfree   r free@compute-3-2   1
   8545 0.50500 FREE   testfree   r free@compute-3-2   1
   8546 0.50500 FREE   testfree   r free@compute-3-2   1
   8547 0.60500 Owner  me         r owner@compute-3-1  8


Job 8547 on owner queue starts just fine running with 8 cores on compute-3-1 
*but* only one core in compute-3-1 from free queue is suspended instead of 8 
cores.

AFAIR this is a known bug for parallel jobs.

So the answer to my original question is that no, it cannot be done.

Is there another open source GE flavor that has fixed this bug, or is this bug 
across all open source GE flavors?

Serial jobs are all packed nicely unto a node until the node is full and then 
it goes unto the next node.


The issue I am having is that my subordinate queue breaks when I have set my 
nodes with the node complex value above.

I have two queues:  The owner queue and the free queue:

    # qconf -sq owner | egrep "subordinate|shell"
    shell                 /bin/bash
    shell_start_mode      posix_compliant
    subordinate_list      free=1

subordinate_list      slots=64(free)

    # qconf -sq free | egrep "subordinate|shell"
    shell                 /bin/bash
    shell_start_mode      posix_compliant
    subordinate_list      NONE

When I fill up the free queue with serial jobs and I then submit a job to the 
owner queue, the owner job will not suspend the free job.   Qstat scheduling 
info says:

    queue instance "[email protected]" dropped because it is full
    queue instance "[email protected]" dropped because it is full

If I remove the "complex_values=" from my nodes, then jobs are correctly 
suspended in free queue and the owner job runs just fine.

Yes, and what's the problem with this setup?

What is wrong with the above setup is that the 'owner' cannot run because free 
jobs are not suspended.

They are not suspended in advance. The suspension is the result of an 
additional job being started thereon. Not the other way round.

Right, but the idea of a subordinate queue ( job preemption ) is that when a 
job *IS* scheduled, that the subordinate queue suspend jobs.    I mean, that's 
the whole idea.

-- Reuti
So how can I accomplish both items above?



*** By the way, here are some pre-answers to some questions I am going to be 
asked:
Why pack jobs?: Because in any HPC environment that runs a mixture of serial and parallel jobs, you really don't want to spread single core jobs across multiple nodes, specially 64 coresnodes. You want to keep nodes whole for parallel jobs ( this is HPC 101 ).
Depends on the application. E.g. Molcas is writing a lot to the local scratch disk, so it's better to spread them in the cluster and use the remaining cores in each exechost for jobs withoutor at least with less disk access.
Yes, there will always be exceptions.    I should have said in most 99% of 
circumstances.
-- Reuti
Suspended jobs will not free up resources:  Yeap, but the jobs will *not* be 
consuming CPU cycles which is what I want.

Thanks,
Joseph

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Subordinate Queue & Job Packing

Reply via email to