Am 12.08.2012 um 19:55 schrieb Joseph Farran: > Hi Rayson. > > Here is one particular entry: > http://gridengine.org/pipermail/users/2012-May/003495.html > > I am using Grid Engine 2011.11 binary > http://dl.dropbox.com/u/47200624/respin/ge2011.11.tar.gz
First of all sorry for using the wrong expression. If you used "-cores_in_use", it should be the positive "slots". As a lower value is taken first, a lower remaining number of slots should be taken first. It's working as it should for serial jobs. But for parallel ones, even with $pe_slots as allocation rule, it's ignored already in 6.2u5. -- Reuti > Thanks, > Joseph > > On 8/12/2012 10:10 AM, Rayson Ho wrote: >> On Sun, Aug 12, 2012 at 5:27 AM, Joseph Farran <[email protected]> wrote: >>> I saw some old postings that this used to be a bug with GE, that parallel >>> jobs were not using the scheduler load_formula. Was this bug corrected in >>> GE2011.11 ? >> Hi Joseph, >> >> Can you point me to the previous discussion? We did not receive bug >> report related to this problem before... >> >> So far, our main focus is to fix issues & bugs reported by our users >> first, and may be we've missed the discussion on this bug. >> >> Rayson >> >> >> >>> Anyone able to test this in GE2011.11 to see if it was fixed? >>> >>> Joseph >>> >>> >>> On 8/11/2012 1:51 PM, Reuti wrote: >>>> Am 11.08.2012 um 20:30 schrieb Joseph Farran: >>>> >>>>> Yes, all my queues have the same "0" for "seq_no". >>>>> >>>>> Here is my scheduler load formula: >>>>> >>>>> qconf -ssconf >>>>> algorithm default >>>>> schedule_interval 0:0:15 >>>>> maxujobs 0 >>>>> queue_sort_method load >>>>> job_load_adjustments NONE >>>>> load_adjustment_decay_time 0 >>>>> load_formula -cores_in_use >>>> Can you please try it with -slots? It should behave the same like your own >>>> complex. In one of your former post you mentioned a different relation == >>>> for it. >>>> >>>> -- Reuti >>>> >>>> >>>>> Here is a sample display of what is going on. My compute nodes have 64 >>>>> cores each: >>>>> >>>>> I submit 4 1-core jobs to my bio queue. Note: I wait around 30 seconds >>>>> before submitting each 1-core job, long enough for my "cores_in_use" to >>>>> report back correctly: >>>>> >>>>> job-ID name user state queue slots >>>>> ----------------------------------------------------- >>>>> 2324 TEST me r bio@compute-2-3 1 >>>>> 2325 TEST me r bio@compute-2-3 1 >>>>> 2326 TEST me r bio@compute-2-3 1 >>>>> 2327 TEST me r bio@compute-2-3 1 >>>>> >>>>> Everything works great with single 1-core jobs. Jobs 2324 through 2327 >>>>> packed unto one node ( compute-2-3 ) correctly. The "cores_in_use" for >>>>> compute-2-3 reports "4". >>>>> >>>>> Now I submit one 16-core "openmp" PE job: >>>>> >>>>> job-ID name user state queue slots >>>>> ----------------------------------------------------- >>>>> 2324 TEST me r bio@compute-2-3 1 >>>>> 2325 TEST me r bio@compute-2-3 1 >>>>> 2326 TEST me r bio@compute-2-3 1 >>>>> 2327 TEST me r bio@compute-2-3 1 >>>>> 2328 TEST me r bio@compute-2-6 16 >>>>> >>>>> The scheduler should have picked compute-2-3 since it has 4 cores_in_use, >>>>> but instead, it picked compute-2-6 which had 0 cores_in_use. So here >>>>> the >>>>> scheduler is now behaving differently than with 1-core jobs. >>>>> >>>>> As a further test I wait until my cores_in_use report back that >>>>> compute2-6 has "16" cores in use. I now submit another 16-core "openmp" >>>>> job: >>>>> >>>>> job-ID name user state queue slots >>>>> ----------------------------------------------------- >>>>> 2324 TEST me r bio@compute-2-3 1 >>>>> 2325 TEST me r bio@compute-2-3 1 >>>>> 2326 TEST me r bio@compute-2-3 1 >>>>> 2327 TEST me r bio@compute-2-3 1 >>>>> 2328 TEST me r bio@compute-2-6 16 >>>>> 2329 TEST me r bio@compute-2-7 16 >>>>> >>>>> The schedule now picks yet a different node compute-2-7 which had 0 >>>>> cores_in_use. I have tried this several times with many config changes >>>>> to >>>>> the scheduler and it sure looks like that the scheduler is *not* using the >>>>> "load_formula" for PE jobs. From what I can tell, the scheduler chooses >>>>> nodes in random with PE jobs. >>>>> >>>>> Here is my "openmp" PE: >>>>> # qconf -sp openmp >>>>> pe_name openmp >>>>> slots 9999 >>>>> user_lists NONE >>>>> xuser_lists NONE >>>>> start_proc_args NONE >>>>> stop_proc_args NONE >>>>> allocation_rule $pe_slots >>>>> control_slaves TRUE >>>>> job_is_first_task FALSE >>>>> urgency_slots min >>>>> accounting_summary TRUE >>>>> >>>>> Here is my "bio" Q showing relevant info: >>>>> >>>>> # qconf -sq bio | egrep "qname|slots|pe_list" >>>>> qname bio >>>>> pe_list make mpi openmp >>>>> slots 64 >>>>> >>>>> Thanks for taking a look at this! >>>>> >>>>> >>>>> On 8/11/2012 4:32 AM, Reuti wrote: >>>>>> Am 11.08.2012 um 02:57 schrieb Joseph Farran <[email protected]>: >>>>>> >>>>>>> Reuti, >>>>>>> >>>>>>> Are you sure this works in GE2011.11? >>>>>>> >>>>>>> I have defined my own complex called "cores_in_use" which counts both >>>>>>> single cores and PE cores correctly. >>>>>>> >>>>>>> It works great for single core jobs, but not for PE jobs using the >>>>>>> "$pe_slots" allocation rule. >>>>>>> >>>>>>> # qconf -sp openmp >>>>>>> pe_name openmp >>>>>>> slots 9999 >>>>>>> user_lists NONE >>>>>>> xuser_lists NONE >>>>>>> start_proc_args NONE >>>>>>> stop_proc_args NONE >>>>>>> allocation_rule $pe_slots >>>>>>> control_slaves TRUE >>>>>>> job_is_first_task FALSE >>>>>>> urgency_slots min >>>>>>> accounting_summary TRUE >>>>>>> >>>>>>> # qconf -ssconf >>>>>>> algorithm default >>>>>>> schedule_interval 0:0:15 >>>>>>> maxujobs 0 >>>>>>> queue_sort_method seqno >>>>>> The seq_no is the same for the queue instances in question? >>>>>> >>>>>> -- Reuti >>>>>> >>>>>>> job_load_adjustments cores_in_use=1 >>>>>>> load_adjustment_decay_time 0 >>>>>>> load_formula -cores_in_use >>>>>>> schedd_job_info true >>>>>>> flush_submit_sec 5 >>>>>>> flush_finish_sec 5 >>>>>>> >>>>>>> I wait until the node reports the correct "cores_in_use" complex, I >>>>>>> then submit a PE openmp job and it totally ignores the "load_formula" >>>>>>> on the >>>>>>> scheduler. >>>>>>> >>>>>>> Joseph >>>>>>> >>>>>>> On 08/09/2012 12:50 PM, Reuti wrote: >>>>>>>> Correct. It uses the "allocation_rule" specified in the PE instead. >>>>>>>> Only for "allocation_rule" set to $PE_SLOTS it will also use the >>>>>>>> "load_formula". Unfortunately there is nothing what you can do to >>>>>>>> change the >>>>>>>> behavior. >>>>>>>> >>>>>>>> -- Reuti >>>>>>>> >>>>>>>> Am 09.08.2012 um 21:23 schrieb Joseph Farran<[email protected]>: >>>>>>>> >>>>>>>>> Howdy. >>>>>>>>> >>>>>>>>> I am using GE2011.11. >>>>>>>>> >>>>>>>>> I am successfully using GE "load_formula" to load jobs by core count >>>>>>>>> using my own "load_sensor" script. >>>>>>>>> >>>>>>>>> All works as expected with single core jobs, however, for PE jobs, it >>>>>>>>> seems as if GE does not abide by the "load_formula". >>>>>>>>> >>>>>>>>> Does the scheduler use a different "load" formula for single core >>>>>>>>> jobs verses parallel jobs suing the PE environment setup? >>>>>>>>> >>>>>>>>> Joseph >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> [email protected] >>>>>>>>> https://gridengine.org/mailman/listinfo/users >>>> >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
