Am 14.04.2014 um 17:24 schrieb Michael Coffman: > On Mon, Apr 14, 2014 at 9:11 AM, Reuti <[email protected]> wrote: > Am 14.04.2014 um 17:04 schrieb Michael Coffman: > > > > > On Sat, Apr 12, 2014 at 4:45 PM, Reuti <[email protected]> wrote: > > Am 11.04.2014 um 19:49 schrieb Michael Coffman: > > > > > On Fri, Apr 11, 2014 at 9:41 AM, Reuti <[email protected]> wrote: > > > Am 11.04.2014 um 17:28 schrieb Michael Coffman: > > > <snip> > > > The queue configuration? > > > > > > Woops... Sorry. > > > > > > qname all.q > > > <snip> > > > > Ok. Is there a "job_load_adjustments" in the scheduler configuration? > > > > Nope... > > There is "job_load_adjustments np_load_avg=0.50" below. > > OK.. > > > But using such a value shouldn't block the scheduling. Nevertheless you can > try to change in the queue definition: > > $ qconf -sq all.q > ... > load_thresholds NONE > > > Job immediatly picked up. OK. I have no idea why this helped. When I > qstat the job now I see the following info for the other queue ( same set of > hosts ) ... > > scheduling info: queue instance "fast.q@gridtst1" dropped because > it is overloaded: np_load_avg=4.665000 (= 0.000 + 0.50 * 9.330000 with > nproc=1) >= 1.75 > queue instance "fast.q@gridtst2" dropped because > it is overloaded: np_load_avg=4.665000 (= 0.000 + 0.50 * 9.330000 with > nproc=1) >= 1.75 > queue instance "fast.q@gridtst3" dropped because > it is overloaded: np_load_avg=1.865000 (= 0.000 + 0.50 * 3.730000 with > nproc=1) >= 1.75 > > What cause the above to be printed in the qstat as I don't see this on my > production grid? > > > I see that grid things the systems are too heavily loaded to run the jobs, > but there is nothing else running on them and no load... > > 09:23:42 up 74 days, 15:51, 0 users, load average: 0.00, 0.02, 0.00 > 09:23:53 up 74 days, 16:00, 0 users, load average: 0.00, 0.00, 0.00 > 09:24:05 up 73 days, 19:08, 0 users, load average: 0.00, 0.00, 0.00 > > What else is being taken into account to cause the scheduler to think the > machines are too busy?
It's looking ahead whether the to be scheduled job will bypass the load_thresholds on its own instantly. With 0.5 of course it shouldn't hit this value - or is 10 not the real core count? -- Reuti > > $ qconf -ssconf > > algorithm default > > schedule_interval 0:0:15 > > maxujobs 0 > > queue_sort_method seqno > > job_load_adjustments np_load_avg=0.50 > > load_adjustment_decay_time 0:7:30 > > load_formula np_load_avg > > schedd_job_info true > > flush_submit_sec 0 > > flush_finish_sec 0 > > params none > > reprioritize_interval 0:2:0 > > halftime 168 > > usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000 > > compensation_factor 5.000000 > > weight_user 0.250000 > > weight_project 0.250000 > > weight_department 0.250000 > > weight_job 0.250000 > > weight_tickets_functional 1000000 > > weight_tickets_share 1000000 > > share_override_tickets TRUE > > share_functional_shares TRUE > > max_functional_jobs_to_schedule 200 > > report_pjob_tickets TRUE > > max_pending_tasks_per_job 50 > > halflife_decay_list none > > policy_hierarchy OFS > > weight_ticket 0.100000 > > weight_waiting_time 1.000000 > > weight_deadline 3600000.000000 > > weight_urgency 0.100000 > > weight_priority 1.000000 > > max_reservation 0 > > default_duration 0:10:0 > > > > > > > > -- Reuti > > > > > > > > $ qconf -srqsl > > > > no resource quota set list defined > > > > > > Good. > > > > > > > > > > > slots s INT <= YES > > > > > YES 1 1000 > > > > > > > > > > > > > > > > > > > > On Thu, Apr 10, 2014 at 4:08 PM, Reuti <[email protected]> > > > > > wrote: > > > > > Am 10.04.2014 um 23:51 schrieb Michael Coffman: > > > > > > > > > > > I am trying to setup a PE and am struggling to understand how grid > > > > > > determines how many slots are available for the PE. I have set up > > > > > > 3 test machines in a queue. I set the default slots to 10. Each > > > > > > system is actually a virtual machine that has one cpu and ~2G of > > > > > > memory. PE definition is: > > > > > > > > > > > > pe_name dp > > > > > > slots 999 > > > > > > user_lists NONE > > > > > > xuser_lists NONE > > > > > > start_proc_args /bin/true > > > > > > stop_proc_args /bin/true > > > > > > allocation_rule $fill_up > > > > > > control_slaves FALSE > > > > > > job_is_first_task TRUE > > > > > > urgency_slots min > > > > > > accounting_summary FALSE > > > > > > > > > > > > Since I have 10 slots per host, I assumed I would have 30 slots. > > > > > > And when testing I get: > > > > > > > > > > > > $qrsh -w v -q all.q -now no -pe dp 30 > > > > > > verification: found possible assignment with 30 slots > > > > > > > > > > > > $qrsh -w p -q all.q -now no -pe dp 30 > > > > > > verification: found possible assignment with 30 slots > > > > > > > > > > > > But when I actually try to run the job the following from qstat... > > > > > > > > > > > > cannot run in PE "dp" because it only offers 12 slots > > > > > > > > > > > > I get that other resources can impact the availablity of slots, but > > > > > > I'm having a hard time figuring out why I'm only getting 12 slots > > > > > > and what resources are impacting this... > > > > > > > > > > > > When I request -pd dp 12, it works fine and distributes the jobs > > > > > > across all three systems... > > > > > > > > > > > > 717 0.65000 QRLOGIN user r 04/10/2014 14:40:14 > > > > > > all.q@gridtst1 SLAVE > > > > > > > > > > > > all.q@gridtst1 SLAVE > > > > > > > > > > > > all.q@gridtst1 SLAVE > > > > > > > > > > > > all.q@gridtst1 SLAVE > > > > > > 9717 0.65000 QRLOGIN user r 04/10/2014 14:40:14 > > > > > > all.q@gridtst2 SLAVE > > > > > > > > > > > > all.q@gridtst2 SLAVE > > > > > > > > > > > > all.q@gridtst2 SLAVE > > > > > > > > > > > > all.q@gridtst2 SLAVE > > > > > > 9717 0.65000 QRLOGIN user r 04/10/2014 14:40:14 > > > > > > all.q@gridtst3 MASTER > > > > > > > > > > > > all.q@gridtst3 SLAVE > > > > > > > > > > > > all.q@gridtst3 SLAVE > > > > > > > > > > > > all.q@gridtst3 SLAVE > > > > > > > > > > What's the output of: qstat -f > > > > > > > > > > Did you setup any consumable like memory on the nodes with a default > > > > > consumption? > > > > > > > > > > - Reuti > > > > > > > > > > > > > > > > I'm assuming I am missing something simple :( What should I be > > > > > > looking at to help me better understand what's going on? I do > > > > > > notice that hl:cpu jumps significantly between idle, dp 12 and dp > > > > > > 24, but I did find anything in the docs describing what cpu > > > > > > represents... > > > > > > /usr/sge/doc/load_parameters.asc > > > > > > It's % load. > > > > > > > > > Ahh. Thanks for the pointer to the file. Very useful. > > > > > > -- Reuti > > > > > > > > > > > > -- > > > -MichaelC > > > > > > > > > > -- > > -MichaelC > > > > > -- > -MichaelC _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
