Re: give cron a sensible default max load_avg for batch jobs

2015-11-16 Thread Craig Skinner
On 2015-11-14 Sat 05:57 AM |, Todd C. Miller wrote:
> The quesion no one seems to be asking here is "who actually runs
> batch".  Anyone?
> 

I do, on small servers with an average uptime(1) load of ~0.2



Re: give cron a sensible default max load_avg for batch jobs

2015-11-14 Thread Peter Hessler
On 2015 Nov 13 (Fri) at 20:28:01 -0700 (-0700), Todd C. Miller wrote:
:On Fri, 13 Nov 2015 16:45:44 -0700, Theo de Raadt wrote:
:
:> > This patch changes the default setting to 1.5 *
:> > (number_of_cpus_in_system) instead, which I find better matches modern
:> > behaviour.
:> 
:> A larger number is sensible in this position.
:> 
:> I would propose 8.  I don't agree with a calculation like that; the
:> amount of work a system can do should not be calculated like that.
:
:I think 8 is way to high.  Isn't the point of batch to run things
:when the machine is mostly idle?
:
: - todd
:

my laptop currently has chrome with no javascript pages open, a torrent
client that is fully paused, and is running cvsync.  My load is at 1.65. :/
I think 8 is much better, imho.


-- 
It was one of those perfect summer days -- the sun was shining, a
breeze was blowing, the birds were singing, and the lawn mower was
broken ...
-- James Dent



Re: give cron a sensible default max load_avg for batch jobs

2015-11-14 Thread lists
> >>> This patch changes the default setting to 1.5 *
> >>> (number_of_cpus_in_system) instead, which I find better matches modern
> >>> behaviour.  
> >>
> >> A larger number is sensible in this position.
> >>
> >> I would propose 8.  I don't agree with a calculation like that; the
> >> amount of work a system can do should not be calculated like that.  
> > 
> > I think 8 is way to high.  Isn't the point of batch to run things
> > when the machine is mostly idle?  
> 
> The problem is (and we've had this discussion several times before at
> least in misc@), that the system load doesn't really tell us that.

What's the proper way to calculate amount of work a system can do, for
(then) figuring CPU idle time threshold?

Does this not also include the work load (type) being done and imply
capability to manage the work load distribution?
 
> It *may* be the case that the system is under lots of work, but it may
> also be the case that there are many processes just blocking waiting for
> some resource and that the system is essentially idling.
> 
> My particular problem, and the reason I suggested this patch in the
> first place, is that I often see loads of 20-30-50 or even way more,
> without there even being a problem. The machine is very responsive, and
> everything works great - there are just a lot of processes running or
> waiting for an opportunity to run.

That's not the general case on 'single/dual' (or less than "your chosen
higher than 4 number" of) CPU systems, and when running fewer processes
that are more CPU intensive.  In these cases it may also be easier to
know what's happening on the system.  Selecting the offloaded period
(automatically?) where you don't have direct control requires more
understanding than average load numbers (suggestion only).  Or a
different approach at task running (e.g. service oriented nodes
assisting general worker ones).

Better use statistical approach per machine (counters) while factoring
processing capability and duty saturation cycle (human assessment).  Or
simply users circadian cycle and not care much as machines just work
while people rest, with potential and overlap between multiple machines
for same role/task.

> Since the system load essentially is a decaying average of the number of
> runnable or running processes, it is not in any way connected to actual
> processor workload as in instructions executed, just to the fact that
> there is much *potentially* going on in the system.

Obviously, this explains why the average load figure is not 'the' proper
way to quantify processor business, such method gains little adequacy
without a tuning knob and that is after assessment of other factors.
CPU number does correlate but is not solely deterministic, and imagine
the mess from twisting a knob without understanding what it does (sane
limits, sane defaults).

> That's also why I suggested to base the default on a value relative to
> the number of cores - it made sense from my practical point of view. But
> I understand where Theo's coming from on this.

Please comment (improved?) method to estimate processor offloaded
periods that reduces average load guess work, or simply a practical
approach at solving the problem of finding offloaded periods
(threshold) without pushing edge case changes.



Re: give cron a sensible default max load_avg for batch jobs

2015-11-14 Thread Todd C. Miller
The quesion no one seems to be asking here is "who actually runs
batch".  Anyone?

 - todd



Re: give cron a sensible default max load_avg for batch jobs

2015-11-14 Thread Benny Lofgren
On 2015-11-14 09:54, li...@wrant.com wrote:
>>> I think 8 is way to high.  Isn't the point of batch to run things
>>> when the machine is mostly idle?  
>> The problem is (and we've had this discussion several times before at
>> least in misc@), that the system load doesn't really tell us that.

> What's the proper way to calculate amount of work a system can do, for
> (then) figuring CPU idle time threshold?
> Does this not also include the work load (type) being done and imply
> capability to manage the work load distribution?

The problem is that there is no proper way, at least not *one* proper
way, to do that. It all depends on your particular situation. The
problem is that cron has no way to know whether the job it is just about
to fire away is going to take a hundred milliseconds or a hundred hours
to run, or what kind of resources it will consume.

The reason to use loadavg as an indicator for system activity is that
while it measures neither high cpu activity nor high i/o-activity
directly, it is actually a pretty good hint as to whether the system is
"busy", for a very fuzzy definition of busy.


The problem is that the value isn't absolute, it is relative to the
configuration and load profile of each system.

If my SP system shows a load of 14, I can with some certainty say that
it's quite busy. If my 12-core dual Opteron server shows 14, it's hardly
even breathing heavily, *even* if it's got mostly cpu-bound activity. If
it shows 100+, then as a sysadmin I'd start looking for explanations.


Remember also that running a job via batch generally is a very kind way
to start heavy tasks, because cron runs the job niced. So in case of
much cpu-bound activity, it may never even get in more than a few time
slices here and there to run, while if there's much i/o going on it may
run next to unnoticed even if it's got lots of cpu-bound stuff to do.

So the problem isn't even that big, since the system's own scheduler is
pretty good at handling various system loads when they actually have
begun life as processes.


The one time when it is especially unsuitable to run an extra batch job
is if we are memory starved, and are swapping, or are close to having to
swap. And as it happens, load_avg is conveniently going to start
skyrocketing as we are starting to swap.

So, if we break it down, load_avg is really not such a bad metric to use
in this particular case. It is "just" that the default limit is set way
too low for today's standards.


With that said, I'm looking at other ways to determine system workload.
Maybe there's a set of metrics that give us a more accurate snapshot of
the system's current state, that can be averaged over time like load_avg
so as to avoid temporary spikes that may give a faulty impression of the
system's activity.

But that calculation must also be as unaffected as possible by the
system's "dimensions". A system with fast i/o can of course handle more
of it before becoming saturated. Likewise with cpu speed and number of
cores, and system memory. So the best aspects in my mind to start
looking at is whether the system does a lot of *waiting* to get its jobs
done. Either waiting for disk or network i/o, processes waiting to run
or a lock contention to clear, things like that.


So, if nobody is waiting for anybody, then by all means go ahead and run
one more job! It's not going to harm anything, as long as it doesn't
consume all of the idling resources for itself.

The problem can be as complex as we want to make it, or as simple, if it
isn't a problem in practice.

>> My particular problem, and the reason I suggested this patch in the
>> first place, is that I often see loads of 20-30-50 or even way more,
>> without there even being a problem. The machine is very responsive, and
>> everything works great - there are just a lot of processes running or
>> waiting for an opportunity to run.
> 
> That's not the general case on 'single/dual' (or less than "your chosen
> higher than 4 number" of) CPU systems, and when running fewer processes
> that are more CPU intensive.  In these cases it may also be easier to
> know what's happening on the system.  Selecting the offloaded period
> (automatically?) where you don't have direct control requires more
> understanding than average load numbers (suggestion only).  Or a
> different approach at task running (e.g. service oriented nodes
> assisting general worker ones).
> 
> Better use statistical approach per machine (counters) while factoring
> processing capability and duty saturation cycle (human assessment).  Or
> simply users circadian cycle and not care much as machines just work
> while people rest, with potential and overlap between multiple machines
> for same role/task.

Good points. But taking the human circadian cycle into account, that is,
working when the human is not and vice versa, can easily be accommodated
already, by using "at".

>> Since the system load essentially is a decaying average of the number of
>> runnable or 

Re: give cron a sensible default max load_avg for batch jobs

2015-11-14 Thread Benny Lofgren
On 2015-11-14 13:57, Todd C. Miller wrote:
> The quesion no one seems to be asking here is "who actually runs
> batch".  Anyone?

I gave kind of an answer to that in my original posting. :-)

At least I run batch and at, and I do it *all the time*.

There is imho no more convenient way of firing off a background job than
using batch, it's a hidden gem in the unix toolbox. And using at makes
it super easy to schedule tasks at times when it is more convenient to
run them than "now". And if you have output you get it in a mail when
it's done. Very spiffy!


Regards,

/Benny



Re: give cron a sensible default max load_avg for batch jobs

2015-11-13 Thread Benny Lofgren
On 2015-11-14 04:28, Todd C. Miller wrote:
> On Fri, 13 Nov 2015 16:45:44 -0700, Theo de Raadt wrote:
> 
>>> This patch changes the default setting to 1.5 *
>>> (number_of_cpus_in_system) instead, which I find better matches modern
>>> behaviour.
>>
>> A larger number is sensible in this position.
>>
>> I would propose 8.  I don't agree with a calculation like that; the
>> amount of work a system can do should not be calculated like that.
> 
> I think 8 is way to high.  Isn't the point of batch to run things
> when the machine is mostly idle?

The problem is (and we've had this discussion several times before at
least in misc@), that the system load doesn't really tell us that.

It *may* be the case that the system is under lots of work, but it may
also be the case that there are many processes just blocking waiting for
some resource and that the system is essentially idling.

My particular problem, and the reason I suggested this patch in the
first place, is that I often see loads of 20-30-50 or even way more,
without there even being a problem. The machine is very responsive, and
everything works great - there are just a lot of processes running or
waiting for an opportunity to run.


Since the system load essentially is a decaying average of the number of
runnable or running processes, it is not in any way connected to actual
processor workload as in instructions executed, just to the fact that
there is much *potentially* going on in the system.

For example, I run a couple of Hadoop clusters (not on OpenBSD
unfortunately), and with cluster nodes containing dual 6-core
hyper-threading Xeon processors, there is 24 "cores" that can be tasked
with calculations, and if they are all doing something the system load
will be at least 24 - but there would be no problem whatsoever to do
more things on the server, especially since the map/reduce tasks are
running with lowered priority. Each core's individual load would be about 1.

That's also why I suggested to base the default on a value relative to
the number of cores - it made sense from my practical point of view. But
I understand where Theo's coming from on this.



Regards,

/Benny


-- 
internetlabbet.se / work:   +46 8 551 124 80  / "Words must
Benny Lofgren/  mobile: +46 70 718 11 90 /   be weighed,
/   fax:+46 8 551 124 89/not counted."
   /email:  benny -at- internetlabbet.se



Re: give cron a sensible default max load_avg for batch jobs

2015-11-13 Thread Theo de Raadt
> This patch changes the default setting to 1.5 *
> (number_of_cpus_in_system) instead, which I find better matches modern
> behaviour.

A larger number is sensible in this position.

I would propose 8.  I don't agree with a calculation like that; the
amount of work a system can do should not be calculated like that.




Re: give cron a sensible default max load_avg for batch jobs

2015-11-13 Thread Todd C. Miller
On Fri, 13 Nov 2015 16:45:44 -0700, Theo de Raadt wrote:

> > This patch changes the default setting to 1.5 *
> > (number_of_cpus_in_system) instead, which I find better matches modern
> > behaviour.
> 
> A larger number is sensible in this position.
> 
> I would propose 8.  I don't agree with a calculation like that; the
> amount of work a system can do should not be calculated like that.

I think 8 is way to high.  Isn't the point of batch to run things
when the machine is mostly idle?

 - todd



Re: give cron a sensible default max load_avg for batch jobs

2015-11-13 Thread Benny Lofgren
On 2015-11-14 00:45, Theo de Raadt wrote:
>> This patch changes the default setting to 1.5 *
>> (number_of_cpus_in_system) instead, which I find better matches modern
>> behaviour.
> 
> A larger number is sensible in this position.
> 
> I would propose 8.  I don't agree with a calculation like that; the
> amount of work a system can do should not be calculated like that.


Fair enough! I agree that 8 will probably fit most cases. It makes for a
simpler patch, too. :-)

(I retained the decimal point in 8.0 in the man page, as an indicator
that it is not an integer value.)


Regards,

/Benny





Index: config.h
===
RCS file: /cvs/src/usr.sbin/cron/config.h,v
retrieving revision 1.23
diff -u -p -u -r1.23 config.h
--- config.h23 Oct 2015 18:42:55 -  1.23
+++ config.h14 Nov 2015 00:32:21 -
@@ -40,7 +40,7 @@
 #define MAILARG _PATH_SENDMAIL /*-*/

/* maximum load at which batch jobs will still run */
-#define BATCH_MAXLOAD  1.5 /*-*/
+#define BATCH_MAXLOAD  8.0 /*-*/

/* Define this to run crontab setgid instead of
 * setuid root.  Group access will be used to read
Index: cron.8
===
RCS file: /cvs/src/usr.sbin/cron/cron.8,v
retrieving revision 1.34
diff -u -p -u -r1.34 cron.8
--- cron.8  12 Nov 2015 21:14:01 -  1.34
+++ cron.8  14 Nov 2015 00:32:21 -
@@ -116,7 +116,7 @@ If the current load average is greater t
 .Ar load_avg ,
 .Xr batch 1
 jobs will not be run.
-The default value is 1.5.
+The default value is 8.0.
 To allow
 .Xr batch 1
 jobs to run regardless of the load, a value of 0.0 may be used.