Re: [gridengine users] Understanding load_formula and load calculations for queue overloads..

Ben Daniel Pere Sun, 28 Feb 2016 12:54:06 -0800

>
> Each job starting on a machine will contribute 1 to the adjustment which
> will decay over time to 0, in your case in 7:30 minutes. The 38.23 is the
> sum of all these adjustments of all jobs starting in the last 7:30 while
> each job will have it's own individual contribution to this sum. If no job
> started in the last 7:30 on a machine it should read 0.50 * 0.000000. This
> value is then divided by 56 before being added to 0.965536.
>


I actually realized the 38.23 while I was writing this email and noticed
the decay time and started to read about it - still what made me send the
question was the fact I don't see where the load_formula kicks into play
here - the minus num_procs seems to be completely ignored here so I'm
probably missing something - what is it?


> > load_formula is load_avg-num_proc and load_adjustments are 0.5:
>
What was the reason to implement it this way? Having a full loaded machine
> and subtracting num_proc would read zero - which doesn't reflect the actual
> use of the machine.
>

no one remembers.. talked with the people who configured it - they have
absolutely no idea :) "probably copy pasted from somewhere online" <-- real
quote.


> - A job_load_adjustments does handle the fact that a job isn't using the
> granted resources instantly, what is not happening in your case.
>

I would also assumes it's good for "starting engines" - since the load_avg
is the 5 minutes load submitting a huge array after some idle time will
make all jobs see almost zero load on the machine.. I wouldn't mind bombing
the machine because we only have 1 slot per core so not really worried
about killing the cpu but I can see the logic in it even in cases of always
intensive jobs.


> - alarm_threshold in the queue definition takes care in case you want to
> oversubscribe a machine by intention as your parallel job doesn't scale
> well
>

we basically have 2 kinds of queue - a workhorse queue "all.q" which has 1
slot per core and an interactive queue which also has 1 slot per core but
gets a better priority. we set the load_thresholds to 1.3 to allow 30%
oversubscription to ensure interactive jobs can always run.. we never ever
put our nodes in alarm mode, we use zabbix to monitor machine's health and
we automatically take it out of the cluster (by disabling all of it's
queues) in cases of "mess" (disk failures, out of space, mounting issues,
stuff like that)..

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Understanding load_formula and load calculations for queue overloads..

Reply via email to