Hi Reuti. Same thing.
I modified my execution hosts to have slots=64:
# qconf -se compute-2-3 | fgrep slots
complex_values slots=64
Then modified the scheduler with "-slots":
# qconf -ssconf
algorithm default
schedule_interval 0:0:15
maxujobs 0
queue_sort_method load
job_load_adjustments NONE
load_adjustment_decay_time 0
load_formula -slots
I also tried "slots" as well. The "load_formula" is being ignored for PE
jobs.
I should note that the scheduler already had jobs running, so I am not sure if
this makes a difference?
I saw some old postings that this used to be a bug with GE, that parallel jobs
were not using the scheduler load_formula. Was this bug corrected in
GE2011.11 ?
Anyone able to test this in GE2011.11 to see if it was fixed?
Joseph
On 8/11/2012 1:51 PM, Reuti wrote:
Am 11.08.2012 um 20:30 schrieb Joseph Farran:
Yes, all my queues have the same "0" for "seq_no".
Here is my scheduler load formula:
qconf -ssconf
algorithm default
schedule_interval 0:0:15
maxujobs 0
queue_sort_method load
job_load_adjustments NONE
load_adjustment_decay_time 0
load_formula -cores_in_use
Can you please try it with -slots? It should behave the same like your own
complex. In one of your former post you mentioned a different relation == for
it.
-- Reuti
Here is a sample display of what is going on. My compute nodes have 64 cores
each:
I submit 4 1-core jobs to my bio queue. Note: I wait around 30 seconds before
submitting each 1-core job, long enough for my "cores_in_use" to report back
correctly:
job-ID name user state queue slots
-----------------------------------------------------
2324 TEST me r bio@compute-2-3 1
2325 TEST me r bio@compute-2-3 1
2326 TEST me r bio@compute-2-3 1
2327 TEST me r bio@compute-2-3 1
Everything works great with single 1-core jobs. Jobs 2324 through 2327 packed unto one node (
compute-2-3 ) correctly. The "cores_in_use" for compute-2-3 reports "4".
Now I submit one 16-core "openmp" PE job:
job-ID name user state queue slots
-----------------------------------------------------
2324 TEST me r bio@compute-2-3 1
2325 TEST me r bio@compute-2-3 1
2326 TEST me r bio@compute-2-3 1
2327 TEST me r bio@compute-2-3 1
2328 TEST me r bio@compute-2-6 16
The scheduler should have picked compute-2-3 since it has 4 cores_in_use, but
instead, it picked compute-2-6 which had 0 cores_in_use. So here the
scheduler is now behaving differently than with 1-core jobs.
As a further test I wait until my cores_in_use report back that compute2-6 has "16" cores
in use. I now submit another 16-core "openmp" job:
job-ID name user state queue slots
-----------------------------------------------------
2324 TEST me r bio@compute-2-3 1
2325 TEST me r bio@compute-2-3 1
2326 TEST me r bio@compute-2-3 1
2327 TEST me r bio@compute-2-3 1
2328 TEST me r bio@compute-2-6 16
2329 TEST me r bio@compute-2-7 16
The schedule now picks yet a different node compute-2-7 which had 0 cores_in_use. I
have tried this several times with many config changes to the scheduler and it sure looks
like that the scheduler is *not* using the "load_formula" for PE jobs. From
what I can tell, the scheduler chooses nodes in random with PE jobs.
Here is my "openmp" PE:
# qconf -sp openmp
pe_name openmp
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args NONE
stop_proc_args NONE
allocation_rule $pe_slots
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
Here is my "bio" Q showing relevant info:
# qconf -sq bio | egrep "qname|slots|pe_list"
qname bio
pe_list make mpi openmp
slots 64
Thanks for taking a look at this!
On 8/11/2012 4:32 AM, Reuti wrote:
Am 11.08.2012 um 02:57 schrieb Joseph Farran <[email protected]>:
Reuti,
Are you sure this works in GE2011.11?
I have defined my own complex called "cores_in_use" which counts both single
cores and PE cores correctly.
It works great for single core jobs, but not for PE jobs using the "$pe_slots"
allocation rule.
# qconf -sp openmp
pe_name openmp
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args NONE
stop_proc_args NONE
allocation_rule $pe_slots
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
# qconf -ssconf
algorithm default
schedule_interval 0:0:15
maxujobs 0
queue_sort_method seqno
The seq_no is the same for the queue instances in question?
-- Reuti
job_load_adjustments cores_in_use=1
load_adjustment_decay_time 0
load_formula -cores_in_use
schedd_job_info true
flush_submit_sec 5
flush_finish_sec 5
I wait until the node reports the correct "cores_in_use" complex, I then submit a PE
openmp job and it totally ignores the "load_formula" on the scheduler.
Joseph
On 08/09/2012 12:50 PM, Reuti wrote:
Correct. It uses the "allocation_rule" specified in the PE instead. Only for
"allocation_rule" set to $PE_SLOTS it will also use the "load_formula". Unfortunately
there is nothing what you can do to change the behavior.
-- Reuti
Am 09.08.2012 um 21:23 schrieb Joseph Farran<[email protected]>:
Howdy.
I am using GE2011.11.
I am successfully using GE "load_formula" to load jobs by core count using my own
"load_sensor" script.
All works as expected with single core jobs, however, for PE jobs, it seems as if GE does
not abide by the "load_formula".
Does the scheduler use a different "load" formula for single core jobs verses
parallel jobs suing the PE environment setup?
Joseph
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users