Re: [gridengine users] Help with PE tshooting / config

Michael Coffman Mon, 14 Apr 2014 08:50:48 -0700

On Mon, Apr 14, 2014 at 9:34 AM, Reuti <[email protected]> wrote:


> Am 14.04.2014 um 17:24 schrieb Michael Coffman:
>
> > On Mon, Apr 14, 2014 at 9:11 AM, Reuti <[email protected]>
> wrote:
> > Am 14.04.2014 um 17:04 schrieb Michael Coffman:
> >
> > >
> > > On Sat, Apr 12, 2014 at 4:45 PM, Reuti <[email protected]>
> wrote:
> > > Am 11.04.2014 um 19:49 schrieb Michael Coffman:
> > >
> > > > On Fri, Apr 11, 2014 at 9:41 AM, Reuti <[email protected]>
> wrote:
> > > > Am 11.04.2014 um 17:28 schrieb Michael Coffman:
> > > > <snip>
> > > > The queue configuration?
> > > >
> > > > Woops... Sorry.
> > > >
> > > >  qname                 all.q
> > > > <snip>
> > >
> > > Ok. Is there a "job_load_adjustments" in the scheduler configuration?
> > >
> > > Nope...
> >
> > There is "job_load_adjustments              np_load_avg=0.50" below.
> >
> > OK..
> >
> >
> > But using such a value shouldn't block the scheduling. Nevertheless you
> can try to change in the queue definition:
> >
> > $ qconf -sq all.q
> > ...
> > load_thresholds       NONE
> >
> >
> > Job immediatly picked up.   OK.  I have no idea why this helped.
> When I qstat the job now I see the following info for the other queue (
> same set of hosts ) ...
> >
> > scheduling info:            queue instance "fast.q@gridtst1" dropped
> because it is overloaded: np_load_avg=4.665000 (=    0.000 + 0.50 *
> 9.330000 with nproc=1) >= 1.75
> >                              queue instance "fast.q@gridtst2" dropped
> because it is overloaded: np_load_avg=4.665000 (=    0.000 + 0.50 *
> 9.330000 with nproc=1) >= 1.75
> >                              queue instance "fast.q@gridtst3" dropped
> because it is overloaded: np_load_avg=1.865000 (=    0.000 + 0.50 *
> 3.730000 with nproc=1) >= 1.75
> >
> > What cause the above to be printed in the qstat as I don't see this on
> my production grid?
> >
> >
> > I see that grid things the systems are too heavily loaded to run the
> jobs, but there is nothing else running on them and no load...
> >
> >  09:23:42 up 74 days, 15:51,  0 users,  load average: 0.00, 0.02, 0.00
> >  09:23:53 up 74 days, 16:00,  0 users,  load average: 0.00, 0.00, 0.00
> >  09:24:05 up 73 days, 19:08,  0 users,  load average: 0.00, 0.00, 0.00
> >
> > What else is being taken into account to cause the scheduler to think
> the machines are too busy?
>
> It's looking ahead whether the to be scheduled job will bypass the
> load_thresholds on its own instantly.
>
> With 0.5 of course it shouldn't hit this value - or is 10 not the real
> core count?
>

In this case, 10 is not the real core count.

Is there something I can set on my production grid to get it to print
similar scheduling info?

Also - how is np_load_avg calculated?



> -- Reuti
>
>
> > > $ qconf -ssconf
> > > algorithm                         default
> > > schedule_interval                 0:0:15
> > > maxujobs                          0
> > > queue_sort_method                 seqno
> > > job_load_adjustments              np_load_avg=0.50
> > > load_adjustment_decay_time        0:7:30
> > > load_formula                      np_load_avg
> > > schedd_job_info                   true
> > > flush_submit_sec                  0
> > > flush_finish_sec                  0
> > > params                            none
> > > reprioritize_interval             0:2:0
> > > halftime                          168
> > > usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
> > > compensation_factor               5.000000
> > > weight_user                       0.250000
> > > weight_project                    0.250000
> > > weight_department                 0.250000
> > > weight_job                        0.250000
> > > weight_tickets_functional         1000000
> > > weight_tickets_share              1000000
> > > share_override_tickets            TRUE
> > > share_functional_shares           TRUE
> > > max_functional_jobs_to_schedule   200
> > > report_pjob_tickets               TRUE
> > > max_pending_tasks_per_job         50
> > > halflife_decay_list               none
> > > policy_hierarchy                  OFS
> > > weight_ticket                     0.100000
> > > weight_waiting_time               1.000000
> > > weight_deadline                   3600000.000000
> > > weight_urgency                    0.100000
> > > weight_priority                   1.000000
> > > max_reservation                   0
> > > default_duration                  0:10:0
> > >
> > >
> > >
> > > -- Reuti
> > >
> > >
> > > > > $ qconf -srqsl
> > > > > no resource quota set list defined
> > > >
> > > > Good.
> > > >
> > > >
> > > > > > slots                 s                 INT         <=    YES
>       YES        1        1000
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Apr 10, 2014 at 4:08 PM, Reuti <
> [email protected]> wrote:
> > > > > > Am 10.04.2014 um 23:51 schrieb Michael Coffman:
> > > > > >
> > > > > > > I am trying to setup a PE and am struggling to understand how
> grid determines how many slots are available for the PE.   I have set up 3
> test machines in a queue.  I set the default slots to 10.  Each system is
> actually a virtual machine that has one cpu and ~2G of memory.    PE
> definition is:
> > > > > > >
> > > > > > > pe_name            dp
> > > > > > > slots              999
> > > > > > > user_lists         NONE
> > > > > > > xuser_lists        NONE
> > > > > > > start_proc_args    /bin/true
> > > > > > > stop_proc_args     /bin/true
> > > > > > > allocation_rule    $fill_up
> > > > > > > control_slaves     FALSE
> > > > > > > job_is_first_task  TRUE
> > > > > > > urgency_slots      min
> > > > > > > accounting_summary FALSE
> > > > > > >
> > > > > > > Since I have 10 slots per host, I assumed I would have 30
> slots.   And when testing I get:
> > > > > > >
> > > > > > > $qrsh -w v -q all.q  -now no -pe dp 30
> > > > > > > verification: found possible assignment with 30 slots
> > > > > > >
> > > > > > > $qrsh -w p -q all.q  -now no -pe dp 30
> > > > > > > verification: found possible assignment with 30 slots
> > > > > > >
> > > > > > > But when I actually try to run the job the following from
> qstat...
> > > > > > >
> > > > > > > cannot run in PE "dp" because it only offers 12 slots
> > > > > > >
> > > > > > > I get that other resources can impact the availablity of
> slots, but I'm having a hard time figuring out why I'm only getting 12
> slots and what resources are impacting this...
> > > > > > >
> > > > > > > When I request -pd dp 12, it works fine and distributes the
> jobs across all three systems...
> > > > > > >
> > > > > > > 717 0.65000 QRLOGIN    user      r    04/10/2014 14:40:14
> all.q@gridtst1 SLAVE
> > > > > > >
>  all.q@gridtst1 SLAVE
> > > > > > >
>  all.q@gridtst1 SLAVE
> > > > > > >
>  all.q@gridtst1 SLAVE
> > > > > > > 9717 0.65000 QRLOGIN    user      r    04/10/2014 14:40:14
> all.q@gridtst2 SLAVE
> > > > > > >
>  all.q@gridtst2 SLAVE
> > > > > > >
>  all.q@gridtst2 SLAVE
> > > > > > >
>  all.q@gridtst2 SLAVE
> > > > > > > 9717 0.65000 QRLOGIN    user      r    04/10/2014 14:40:14
> all.q@gridtst3 MASTER
> > > > > > >
>  all.q@gridtst3 SLAVE
> > > > > > >
>  all.q@gridtst3 SLAVE
> > > > > > >
>  all.q@gridtst3 SLAVE
> > > > > >
> > > > > > What's the output of: qstat -f
> > > > > >
> > > > > > Did you setup any consumable like memory on the nodes with a
> default consumption?
> > > > > >
> > > > > > - Reuti
> > > > > >
> > > > > >
> > > > > > > I'm assuming I am missing something simple :(    What should I
> be looking at to help me better understand what's going on?    I do notice
> that hl:cpu jumps significantly between idle, dp 12 and dp 24, but I did
> find anything in the docs describing what cpu represents...
> > > >
> > > > /usr/sge/doc/load_parameters.asc
> > > >
> > > > It's % load.
> > > >
> > > >
> > > > Ahh.  Thanks for the pointer to the file.   Very useful.
> > > >
> > > > -- Reuti
> > > >
> > > >
> > > >
> > > > --
> > > > -MichaelC
> > >
> > >
> > >
> > >
> > > --
> > > -MichaelC
> >
> >
> >
> >
> > --
> > -MichaelC
>
>


-- 
-MichaelC

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Help with PE tshooting / config

Reply via email to