Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Matt Mackall
On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> 
> * William Lee Irwin III <[EMAIL PROTECTED]> wrote:
> 
> > [...] Also rest assured that the tone of the critique is not hostile, 
> > and wasn't meant to sound that way.
> 
> ok :) (And i guess i was too touchy - sorry about coming out swinging.)
> 
> > Also, given the general comments it appears clear that some 
> > statistical metric of deviation from the intended behavior furthermore 
> > qualified by timescale is necessary, so this appears to be headed 
> > toward a sort of performance metric as opposed to a pass/fail test 
> > anyway. However, to even measure this at all, some statement of 
> > intention is required. I'd prefer that there be a Linux-standard 
> > semantics for nice so results are more directly comparable and so that 
> > users also get similar nice behavior from the scheduler as it varies 
> > over time and possibly implementations if users should care to switch 
> > them out with some scheduler patch or other.
> 
> yeah. If you could come up with a sane definition that also translates 
> into low overhead on the algorithm side that would be great!

How's this:

If you're running two identical CPU hog tasks A and B differing only by nice
level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a
constant f(Anice - Bnice).

Other definitions make things hard to analyze and probably not
well-bounded when confronted with > 2 tasks.

I -think- this implies keeping a separate scaled CPU usage counter,
where the scaling factor is a trivial exponential function of nice
level where f(0) == 1. Then you schedule based on this scaled usage
counter rather than unscaled.

I also suspect we want to keep the exponential base small so that the
maximal difference is 10x-100x.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Matt Mackall
On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> > >> All things are not equal; they all have different properties. I like
> > 
> > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > > Exactly. So we have to explore those properties and evaluate performance
> > > (in all meanings of the word). That's only logical.
> > 
> > Any chance you'd be willing to put down a few thoughts on what sorts
> > of standards you'd like to set for both correctness (i.e. the bare
> > minimum a scheduler implementation must do to be considered valid
> > beyond not oopsing) and performance metrics (i.e. things that produce
> > numbers for each scheduler you can compare to say "this scheduler is
> > better than this other scheduler at this.").
> 
> Yeah I guess that's the hard part :)
> 
> For correctness, I guess fairness is an easy one. I think that unfairness
> is basically a bug and that it would be very unfortunate to merge something
> unfair. But this is just within the context of a single runqueue... for
> better or worse, we allow some unfairness in multiprocessors for performance
> reasons of course.

I'm a big fan of fairness, but I think it's a bit early to declare it
a mandatory feature. Bounded unfairness is probably something we can
agree on, ie "if we decide to be unfair, no process suffers more than
a factor of x".
 
> Latency. Given N tasks in the system, an arbitrary task should get
> onto the CPU in a bounded amount of time (excluding events like freak
> IRQ holdoffs and such, obviously -- ie. just considering the context
> of the scheduler's state machine).

This is a slightly stronger statement than starvation-free (which is
obviously mandatory). I think you're looking for something like
"worst-case scheduling latency is proportional to the number of
runnable tasks". Which I think is quite a reasonable requirement.

I'm pretty sure the stock scheduler falls short of both of these
guarantees though.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
> Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589
> That value has the property that a nice=10 task gets 1/10th the cpu of a 
> nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that 
> would be fairly easy to explain to admins and users so that they can 
> know what to expect from nicing tasks.
[...additional good commentary trimmed...]

Lots of good ideas here. I'll follow them.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Matt Mackall
On Tue, Apr 17, 2007 at 05:31:20AM +0200, Nick Piggin wrote:
> On Mon, Apr 16, 2007 at 09:28:24AM -0500, Matt Mackall wrote:
> > On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote:
> > > I'd prefer if we kept a single CPU scheduler in mainline, because I
> > > think that simplifies analysis and focuses testing.
> > 
> > I think you'll find something like 80-90% of the testing will be done
> > on the default choice, even if other choices exist. So you really
> > won't have much of a problem here.
> > 
> > But when the only choice for other schedulers is to go out-of-tree,
> > then only 1% of the people will try it out and those people are
> > guaranteed to be the ones who saw scheduling problems in mainline.
> > So the alternative won't end up getting any testing on many of the
> > workloads that work fine in mainstream so their feedback won't tell
> > you very much at all.
> 
> Yeah I concede that perhaps it is the only way to get things going
> any further. But how do we decide if and when the current scheduler
> should be demoted from default, and which should replace it?

Step one is ship both in -mm. If that doesn't give us enough
confidence, ship both in mainline. If that doesn't give us enough
confidence, wait until vendors ship both. Eventually a clear picture
should emerge. If it doesn't, either the change is not significant or
no one cares.

But it really is important to be able to do controlled experiments on
this stuff with little effort. That's the recipe for getting lots of
valid feedback.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Chris Friesen

Peter Williams wrote:

Chris Friesen wrote:
Scuse me if I jump in here, but doesn't the load balancer need some 
way to figure out a) when to run, and b) which tasks to pull and where 
to push them?



Yes but both of these are independent of the scheduler discipline in force.


It is not clear to me that this is always the case, especially once you 
mix in things like resource groups.



If
the load balancer manages to keep the weighted (according to static 
priority) load and distribution of priorities within the loads on the 
CPUs roughly equal and the scheduler does a good job of ensuring 
fairness, interactive responsiveness etc. for the tasks within a CPU 
then the result will be good system performance within the constraints 
set by the sys admins use of real time priorities and nice.


Suppose I have a really high priority task running.  Another very high 
priority task wakes up and would normally preempt the first one. 
However, there happens to be another cpu available.  It seems like it 
would be a win if we moved one of those tasks to the available cpu 
immediately so they can both run simultaneously.  This would seem to 
require some communication between the scheduler and the load balancer.


Certainly the above design could introduce a lot of context switching. 
But if my goal is a scheduler that minimizes latency (even at the cost 
of throughput) then that's an acceptable price to pay.


Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

William Lee Irwin III wrote:

William Lee Irwin III wrote:

Comments on which directions you'd like this to go in these respects
would be appreciated, as I regard you as the current "project owner."


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
I'd do scan through LKML from about 18 months ago looking for mention of 
runtime configurable version of plugsched.  Some students at a 
university (in Germany, I think) posted some patches adding this feature 
to plugsched around about then.


Excellent. I'll go hunting for that.


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
I never added them to plugsched proper as I knew (from previous 
experience when the company I worked for posted patches with similar 
functionality) that Linux would like this idea less than he did the 
current plugsched mechanism.


Odd how the requirements ended up including that. Fickleness abounds.
If only we knew up-front what the end would be.


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
Unfortunately, my own cache of the relevant e-mails got overwritten 
during a Fedora Core upgrade (I've since moved /var onto a separate 
drive to avoid a repetition) or I would dig them out and send them to 
you.  I'd provided with copies of the company's patches to use as a 
guide to how to overcome the problems associated with changing 
schedulers on a running system (a few non trivial locking issues pop up).

Maybe if one of the students still reads LKML he will provide a pointer.


I was tempted to restart from scratch given Ingo's comments, but I
reconsidered and I'll be working with your code (and the German
students' as well). If everything has to change, so be it, but it'll
still be a derived work. It would be ignoring precedent and failure to
properly attribute if I did otherwise.


I can give you a patch (or set of patches) against the latest git 
vanilla kernel version if that would help.  There have been changes to 
the vanilla scheduler code since 2.6.20 so the latest patch on 
sourceforge won't apply cleanly.  I've found that implementing this as a 
series of patches rather than one big patch makes it easier fro me to 
cope with changes to the underlying code.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

Ingo Molnar wrote:

* Nick Piggin <[EMAIL PROTECTED]> wrote:

Maybe the progress is that more key people are becoming open to 
the idea of changing the scheduler.
Could be.  All was quiet for quite a while, but when RSDL showed up, 
it aroused enough interest to show that scheduling woes is on folks 
radar.
Well I know people have had woes with the scheduler for ever (I guess 
that isn't going to change :P). [...]


yes, that part isnt going to change, because the CPU is a _scarce 
resource_ that is perhaps the most frequently overcommitted physical 
computer resource in existence, and because the kernel does not (yet) 
track eye movements of humans to figure out which tasks are more 
important them. So critical human constraints are unknown to the 
scheduler and thus complaints will always come.


The upstream scheduler thought it had enough information: the sleep 
average. So now the attempt is to go back and _simplify_ the scheduler 
and remove that information, and concentrate on getting fairness 
precisely right. The magic thing about 'fairness' is that it's a pretty 
good default policy if we decide that we simply have not enough 
information to do an intelligent choice.


( Lets be cautious though: the jury is still out whether people actually 
  like this more than the current approach. While CFS feedback looks 
  promising after a whopping 3 days of it being released [ ;-) ], the 
  test coverage of all 'fairness centric' schedulers, even considering 
  years of availability is less than 1% i'm afraid, and that < 1% was 
  mostly self-selecting. )


At this point I'd like to make the observation that spa_ebs is a very 
fair scheduler if you consider "nice" to be an indication of the 
relative entitlement of tasks to CPU bandwidth.


It works by mapping nice to shares using a function very similar to the 
one for calculating p->load weight except it's not offset by the RT 
priorities as RT is handled separately.  In theory, a runnable task's 
entitlement to CPU bandwidth at any time is the ratio of its shares to 
the total shares held by runnable tasks on the same CPU (in reality, a 
smoothed average of this sum is used to make scheduling smoother).  The 
dynamic priorities of the runnable tasks are then fiddled to try to keep 
each tasks CPU bandwidth usage in proportion to its entitlement.


That's the theory anyway.

The actual implementation looks a bit different due to efficiency 
considerations.  The modifications to the above theory boil down to 
keeping a running measure of the (recent) highest CPU bandwidth use per 
share for tasks running on the CPU -- I call this the yardstick for this 
CPU.  When it's time to put a task on the run queue it's dynamic 
priority is determined by comparing its CPU bandwidth per share value 
with the yardstick for its CPU.  If it's greater than the yardstick this 
value becomes the new yardstick and the task gets given the lowest 
possible dynamic priority (for its scheduling class).  If the value is 
zero it gets the highest possible priority (for its scheduling class) 
which would be MAX_RT_PRIO for a SCHED_OTHER task.  Otherwise it gets 
given a priority between these two extremes proportional to ratio of its 
CPU bandwidth per share value and the yardstick.  Quite simple really.


The other way in which the code deviates from the original as that (for 
a few years now) I no longer calculated CPU bandwidth usage directly. 
I've found that the overhead is less if I keep a running average of the 
size of a tasks CPU bursts and the length of its scheduling cycle (i.e. 
from on CPU one time to on CPU next time) and using the ratio of these 
values as a measure of bandwidth usage.


Anyway it works and gives very predictable allocations of CPU bandwidth 
based on nice.


Another good feature is that (in this pure form) it's starvation free. 
However, if you fiddle with it and do things like giving bonus priority 
boosts to interactive tasks it becomes susceptible to starvation.  This 
can be fixed by using an anti starvation mechanism such as SPA's 
promotion scheme and that's what spa_ebs does.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

Nick Piggin wrote:

On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote:

Nick Piggin wrote:
Other hints that it was a bad idea was the need to transfer time slices 
between children and parents during fork() and exit().

I don't see how that has anything to do with dual arrays.
It's totally to do with the dual arrays.  The only real purpose of the 
time slice in O(1) (regardless of what its perceived purpose was) was to 
control the switching between the arrays.


The O(1) design is pretty convoluted in that regard. In my scheduler,
the only purpose of the arrays is to renew time slices.

The fork/exit logic is added to make interactivity better. Ingo's
scheduler has similar equivalent logic.



If you put
a new child at the back of the queue, then your various interactive
shell commands that typically do a lot of dependant forking get slowed
right down behind your compile job. If you give a new child its own
timeslice irrespective of the parent, then you have things like 'make'
(which doesn't use a lot of CPU time) spawning off lots of high
priority children.
This is an artefact of trying to control nice using time slices while 
using them for controlling array switching and whatever else they were 
being used for.  Priority (static and dynamic) is the the best way to 
implement nice.


I don't like the timeslice based nice in mainline. It's too nasty
with latencies. nicksched is far better in that regard IMO.

But I don't know how you can assert a particular way is the best way
to do something.


I should have added "I may be wrong but I think that ...".

My opinion is based on a lot of experience with different types of 
scheduler design and the observation from gathering scheduling 
statistics while playing with these schedulers that the size of the time 
slices we're talking about is much larger than the CPU chunks most tasks 
use in any one go so time slice size has no real effect on most tasks 
and the faster CPUs become the more this becomes true.






You need to do _something_ (Ingo's does). I don't see why this would
be tied with a dual array. FWIW, mine doesn't do anything on exit()
like most others, but it may need more tuning in this area.


This disregard for the dual array mechanism has prevented me from 
looking at the rest of your scheduler in any great detail so I can't 
comment on any other ideas that may be in there.

Well I wasn't really asking you to review it. As I said, everyone
has their own idea of what a good design does, and review can't really
distinguish between the better of two reasonable designs.

A fair evaluation of the alternatives seems like a good idea though.
Nobody is actually against this, are they?
No.  It would be nice if the basic ideas that each scheduler tries to 
implement could be extracted and explained though.  This could lead to a 
melding of ideas that leads to something quite good.





I haven't looked at Con's ones for a while,
but I believe they are also much more straightforward than mainline...
I like Con's scheduler (partly because it uses a single array) but 
mainly because it's nice and simple.  However, his earlier schedulers 
were prone to starvation (admittedly, only if you went out of your way 
to make it happen) and I tried to convince him to use the anti 
starvation mechanism in my SPA schedulers but was unsuccessful.  I 
haven't looked at his latest scheduler that sparked all this furore so 
can't comment on it.

I agree starvation or unfairness is unacceptable for a new scheduler.



For example, let's say all else is equal between them, then why would
we go with the O(logN) implementation rather than the O(1)?
In the highly unlikely event that you can't separate them on technical 
grounds, Occam's razor recommends choosing the simplest solution. :-)

O(logN) vs O(1) is technical grounds.
In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
greater than O(logN)'s k factor multiplied by logMaxN.


Yes, or even significantly greater around typical large sizes of N.


Yes.  In fact its' probably better to use the maximum number of threads 
allowed on the system for N.  We know that value don't we?


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread James Bruce

Chris Friesen wrote:

William Lee Irwin III wrote:


The sorts of like explicit decisions I'd like to be made for these are:
(1) In a mixture of tasks with varying nice numbers, a given nice number
corresponds to some share of CPU bandwidth. Implementations
should not have the freedom to change this arbitrarily according
to some intention.


The first question that comes to my mind is whether nice levels should 
be linear or not.  I would lean towards nonlinear as it allows a wider 
range (although of course at the expense of precision).  Maybe something 
like "each nice level gives X times the cpu of the previous"?  I think a 
value of X somewhere between 1.15 and 1.25 might be reasonable.


Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589

That value has the property that a nice=10 task gets 1/10th the cpu of a 
nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that 
would be fairly easy to explain to admins and users so that they can 
know what to expect from nicing tasks.


What about also having something that looks at latency, and how latency 
changes with niceness?


I think this would be a lot harder to pin down, since it's a function of 
all the other tasks running and their nice levels.  Do you have any of 
the RT-derived analysis models in mind?


What about specifying the timeframe over which the cpu bandwidth is 
measured?  I currently have a system where the application designers 
would like it to be totally fair over a period of 1 second.  As you can 
imagine, mainline doesn't do very well in this case.


It might be easier to specify the maximum deviation from the ideal 
bandwidth over a certain period.  I.e. something like "over a period of 
one second, each task receives within 10% of the expected bandwidth".


 - Jim Bruce

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > 2.6.21-rc7-cfs-v2
> > 534.80user 30.92system 2:23.64elapsed 393%CPU
> > 534.75user 31.01system 2:23.70elapsed 393%CPU
> > 534.66user 31.07system 2:23.76elapsed 393%CPU
> > 534.56user 30.91system 2:23.76elapsed 393%CPU
> > 534.66user 31.07system 2:23.67elapsed 393%CPU
> > 535.43user 30.62system 2:23.72elapsed 393%CPU
> 
> Thanks for testing this! Could you please try this also with:
> 
>echo 1 > /proc/sys/kernel/sched_granularity
> 
> on the same system, so that we can get a complete set of numbers? Just 
> to make sure that lowering the preemption frequency indeed has the 
> expected result of moving kernbench numbers back to mainline levels. (if 
> not then that would indicate some CFS buglet)
> 
> could you maybe even try a more extreme setting of:
> 
>echo 5 > /proc/sys/kernel/sched_granularity
> 
> for kicks? This would allow us to see how much kernbench we lose due to 
> preemption granularity. Thanks!

Yeah but I just powered down the test-box, so I'll have to get onto
that tomorrow.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
William Lee Irwin III wrote:
>> Comments on which directions you'd like this to go in these respects
>> would be appreciated, as I regard you as the current "project owner."

On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
> I'd do scan through LKML from about 18 months ago looking for mention of 
> runtime configurable version of plugsched.  Some students at a 
> university (in Germany, I think) posted some patches adding this feature 
> to plugsched around about then.

Excellent. I'll go hunting for that.


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
> I never added them to plugsched proper as I knew (from previous 
> experience when the company I worked for posted patches with similar 
> functionality) that Linux would like this idea less than he did the 
> current plugsched mechanism.

Odd how the requirements ended up including that. Fickleness abounds.
If only we knew up-front what the end would be.


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
> Unfortunately, my own cache of the relevant e-mails got overwritten 
> during a Fedora Core upgrade (I've since moved /var onto a separate 
> drive to avoid a repetition) or I would dig them out and send them to 
> you.  I'd provided with copies of the company's patches to use as a 
> guide to how to overcome the problems associated with changing 
> schedulers on a running system (a few non trivial locking issues pop up).
> Maybe if one of the students still reads LKML he will provide a pointer.

I was tempted to restart from scratch given Ingo's comments, but I
reconsidered and I'll be working with your code (and the German
students' as well). If everything has to change, so be it, but it'll
still be a derived work. It would be ignoring precedent and failure to
properly attribute if I did otherwise.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* William Lee Irwin III <[EMAIL PROTECTED]> wrote:

> On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
>
> > until now the main approach for nice levels in Linux was always: 
> > "implement your main scheduling logic for nice 0 and then look for 
> > some low-overhead method that can be glued to it that does something 
> > that behaves like nice levels". Feel free to turn that around into a 
> > more natural approach, but the algorithm should remain fairly simple 
> > i think.
> 
> Part of my insistence was because it seemed to be relatively close to 
> a one-liner, though I'm not entirely sure what particular computation 
> to use to handle the signedness of the keys. I guess I could pick some 
> particular nice semantics myself and then sweep the extant schedulers 
> to use them after getting a testcase hammered out.

i'd love to have a oneliner solution :-)

wrt. signedness: note that in v2 i have made rq_running signed, and most 
calculations (especially those related to nice) are signed values. (On 
64-bit systems this all isnt a big issue - most of the arithmetics 
gymnastics in CFS are done to keep deltas within 32 bits, so that 
divisions and multiplications are sane.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* Nick Piggin <[EMAIL PROTECTED]> wrote:

> 2.6.21-rc7-cfs-v2
> 534.80user 30.92system 2:23.64elapsed 393%CPU
> 534.75user 31.01system 2:23.70elapsed 393%CPU
> 534.66user 31.07system 2:23.76elapsed 393%CPU
> 534.56user 30.91system 2:23.76elapsed 393%CPU
> 534.66user 31.07system 2:23.67elapsed 393%CPU
> 535.43user 30.62system 2:23.72elapsed 393%CPU

Thanks for testing this! Could you please try this also with:

   echo 1 > /proc/sys/kernel/sched_granularity

on the same system, so that we can get a complete set of numbers? Just 
to make sure that lowering the preemption frequency indeed has the 
expected result of moving kernbench numbers back to mainline levels. (if 
not then that would indicate some CFS buglet)

could you maybe even try a more extreme setting of:

   echo 5 > /proc/sys/kernel/sched_granularity

for kicks? This would allow us to see how much kernbench we lose due to 
preemption granularity. Thanks!

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
* William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> Also, given the general comments it appears clear that some 
>> statistical metric of deviation from the intended behavior furthermore 
>> qualified by timescale is necessary, so this appears to be headed 
>> toward a sort of performance metric as opposed to a pass/fail test 
[...]

On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> yeah. If you could come up with a sane definition that also translates 
> into low overhead on the algorithm side that would be great! The only 
> good generic definition i could come up with (nice levels are isolated 
> buckets with a constant maximum relative percentage of CPU time 
> available to every active bucket) resulted in having a per-nice-level 
> array of rbtree roots, which did not look worth the hassle at first 
> sight :-)

Interesting! That's what vdls did, except its fundamental data structure
was more like a circular buffer data structure (resembling Davide
Libenzi's timer ring in concept, but with all the details different).
I'm not entirely sure how that would've turned out performancewise if
I'd done any tuning at all. I was mostly interested in doing something
like what I heard Bob Mullens did in 1976 for basic pedagogical value
about schedulers to prepare for writing patches for gang scheduling as
opposed to creating a viable replacement for the mainline scheduler.

I'm relatively certain a different key calculation will suffice, but
it may disturb other desired semantics since they really need to be
nonnegative for multiplying by a scaling factor corresponding to its
nice number to work properly. Well, as the cfs code now stands, it
would correspond to negative keys. Dividing positive keys by the nice
scaling factor is my first thought of how to extend the method to the
current key semantics. Or such are my thoughts on the subject.

I expect that all that's needed is to fiddle with those numbers a bit.
There's quite some capacity for expression there given the precision.


On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> until now the main approach for nice levels in Linux was always: 
> "implement your main scheduling logic for nice 0 and then look for some 
> low-overhead method that can be glued to it that does something that 
> behaves like nice levels". Feel free to turn that around into a more 
> natural approach, but the algorithm should remain fairly simple i think.

Part of my insistence was because it seemed to be relatively close to a
one-liner, though I'm not entirely sure what particular computation to
use to handle the signedness of the keys. I guess I could pick some
particular nice semantics myself and then sweep the extant schedulers
to use them after getting a testcase hammered out.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* Nick Piggin <[EMAIL PROTECTED]> wrote:

> > > Maybe the progress is that more key people are becoming open to 
> > > the idea of changing the scheduler.
> > 
> > Could be.  All was quiet for quite a while, but when RSDL showed up, 
> > it aroused enough interest to show that scheduling woes is on folks 
> > radar.
> 
> Well I know people have had woes with the scheduler for ever (I guess 
> that isn't going to change :P). [...]

yes, that part isnt going to change, because the CPU is a _scarce 
resource_ that is perhaps the most frequently overcommitted physical 
computer resource in existence, and because the kernel does not (yet) 
track eye movements of humans to figure out which tasks are more 
important them. So critical human constraints are unknown to the 
scheduler and thus complaints will always come.

The upstream scheduler thought it had enough information: the sleep 
average. So now the attempt is to go back and _simplify_ the scheduler 
and remove that information, and concentrate on getting fairness 
precisely right. The magic thing about 'fairness' is that it's a pretty 
good default policy if we decide that we simply have not enough 
information to do an intelligent choice.

( Lets be cautious though: the jury is still out whether people actually 
  like this more than the current approach. While CFS feedback looks 
  promising after a whopping 3 days of it being released [ ;-) ], the 
  test coverage of all 'fairness centric' schedulers, even considering 
  years of availability is less than 1% i'm afraid, and that < 1% was 
  mostly self-selecting. )

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* Peter Williams <[EMAIL PROTECTED]> wrote:

> There's a lot of ugly code in the load balancer that is only there to 
> overcome the side effects of SMT and dual core.  A lot of it was put 
> there by Intel employees trying to make load balancing more friendly 
> to their systems.  What I'm suggesting is that an N CPUs per runqueue 
> is a better way of achieving that end.  I may (of course) be wrong but 
> I think that the idea deserves more consideration than you're willing 
> to give it.

i actually implemented that some time ago and i'm afraid it was ugly as 
hell and pretty fragile. Load-balancing gets simpler, but task picking 
gets alot uglier.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 08:56:27AM +0100, Andy Whitcroft wrote:
> > 
> > as usual, any sort of feedback, bugreports, fixes and suggestions are
> > more than welcome,
> 
> Pushed this through the test.kernel.org and nothing new blew up.
> Notably the kernbench figures are within expectations even on the bigger
> numa systems, commonly badly affected by balancing problems in the
> schedular.
> 
> I see there is a second one out, I'll push that one through too.

Well I just sent some feedback on cfs-v2, but realised it went off-list,
so I'll resend here because others may find it interesting too. Sorry
about jamming it in here, but it is relevant to performance...

Anyway, roughly in the context of good cfs-v2 interactivity, I wrote:

Well I'm not too surprised. I am disappointed that it uses such small
timeslices (or whatever they are called) as the default.

Using small timeslices is actually a pretty easy way to ensure everything
stays smooth even under load, but is bad for efficiency. Sure you can say
you'll have desktop and server tunings, but... With nicksched I'm testing
a default timeslice of *300ms* even on the desktop, wheras Ingo's seems
to be effectively 3ms :P So if you compare default tunings, it isn't
exactly fair!

Kbuild times on a 2x Xeon:

2.6.21-rc7
508.87user 32.47system 2:17.82elapsed 392%CPU
509.05user 32.25system 2:17.84elapsed 392%CPU
508.75user 32.26system 2:17.83elapsed 392%CPU
508.63user 32.17system 2:17.88elapsed 392%CPU
509.01user 32.26system 2:17.90elapsed 392%CPU
509.08user 32.20system 2:17.95elapsed 392%CPU

2.6.21-rc7-cfs-v2
534.80user 30.92system 2:23.64elapsed 393%CPU
534.75user 31.01system 2:23.70elapsed 393%CPU
534.66user 31.07system 2:23.76elapsed 393%CPU
534.56user 30.91system 2:23.76elapsed 393%CPU
534.66user 31.07system 2:23.67elapsed 393%CPU
535.43user 30.62system 2:23.72elapsed 393%CPU

2.6.21-rc7-nicksched
505.60user 32.31system 2:17.91elapsed 390%CPU
506.55user 32.42system 2:17.66elapsed 391%CPU
506.41user 32.30system 2:17.85elapsed 390%CPU
506.48user 32.36system 2:17.77elapsed 391%CPU
506.10user 32.40system 2:17.81elapsed 390%CPU
506.69user 32.16system 2:17.78elapsed 391%CPU

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* William Lee Irwin III <[EMAIL PROTECTED]> wrote:

> [...] Also rest assured that the tone of the critique is not hostile, 
> and wasn't meant to sound that way.

ok :) (And i guess i was too touchy - sorry about coming out swinging.)

> Also, given the general comments it appears clear that some 
> statistical metric of deviation from the intended behavior furthermore 
> qualified by timescale is necessary, so this appears to be headed 
> toward a sort of performance metric as opposed to a pass/fail test 
> anyway. However, to even measure this at all, some statement of 
> intention is required. I'd prefer that there be a Linux-standard 
> semantics for nice so results are more directly comparable and so that 
> users also get similar nice behavior from the scheduler as it varies 
> over time and possibly implementations if users should care to switch 
> them out with some scheduler patch or other.

yeah. If you could come up with a sane definition that also translates 
into low overhead on the algorithm side that would be great! The only 
good generic definition i could come up with (nice levels are isolated 
buckets with a constant maximum relative percentage of CPU time 
available to every active bucket) resulted in having a per-nice-level 
array of rbtree roots, which did not look worth the hassle at first 
sight :-)

until now the main approach for nice levels in Linux was always: 
"implement your main scheduling logic for nice 0 and then look for some 
low-overhead method that can be glued to it that does something that 
behaves like nice levels". Feel free to turn that around into a more 
natural approach, but the algorithm should remain fairly simple i think.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
* William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> The additive nice_offset breaks nice levels. A multiplicative priority 
>> weighting of a different, nonnegative metric of cpu utilization from 
>> what's now used is required for nice levels to work. I've been trying 
>> to point this out politely by strongly suggesting testing whether nice 
>> levels work.

On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> granted, CFS's nice code is still incomplete, but you err quite 
> significantly with this extreme statement that they are "broken".

I used the word relatively loosely. Nothing extreme is going on.
Maybe the phrasing exaggerated the force of the opinion. I'm sorry
about having misspoke so.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> nice levels certainly work to a fair degree even in the current code and 
> much of the focus is elsewhere - just try it. (In fact i claim that 
> CFS's nice levels often work _better_ than the mainline scheduler's nice 
> level support, for the testcases that matter to users.)

Al Boldi's testcase appears to reveal some issues. I'm plotting a
testcase of my own if I can ever get past responding to email.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> The precise behavior of nice levels, as i pointed it out in previous 
> mails, is largely 'uninteresting' and it has changed multiple times in 
> the past 10 years.

I expect that whether a scheduler can handle such prioritization has a
rather strong predictive quality regarding whether it can handle, say,
CKRM controls. I remain convinced that there should be some target
behavior and that some attempt should be made to achieve it. I don't
think any particular behavior is best, just that the behavior should
be well-defined.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> What matters to users is mainly: whether X reniced to -10 does get 
> enough CPU time and whether stuff reniced to +19 doesnt take away too 
> much CPU time from the rest of the system. _How_ a Linux scheduler 
> achieves this is an internal matter and certainly CFS does it in a hacky 
> way at the moment.

It's not so far out. Basically just changing the key calculation in a
relatively simple manner should get things into relatively good shape.
It can, of course, be done other ways (I did it a rather different way
in vdls, though that method is not likely to be considered desirable).

I can't really write a testcase for such loose semantics, so the above
description is useless to me. These squishy sorts of definitions of
semantics are also uninformative to users, who, I would argue, do have
some interest in what nice levels mean. There have been at least a small
number of concerns about the strength of nice levels, and it would
reveal issues surrounding that area earlier if there were an objective
one could test to see if it were achieved.

It's furthermore a user-visible change in system call semantics we
should be more careful about changing out from beneath users.

So I see a lot of good reasons to pin down nice numbers. Incompleteness
is not a particularly mortal sin, but the proliferation of competing
schedulers is creating a need for standards, and that's what I'm really
on about.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> All the rest, 'CPU bandwidth utilization' or whatever abstract metric we 
> could come up with is just a fancy academic technicality that has no 
> real significance to any of the testers who are trying CFS right now.

I could say "percent cpu" if it sounds less like formal jargon, which
"CPU bandwidth utilization" isn't really.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> Sure we prefer final solutions that are clean and make sense (because 
> such things are the easiest to maintain long-term), and often such final 
> solutions are quite close to academic concepts, and i think Davide 
> correctly observed this by saying that "the final key calculation code 
> will end up quite different from what it looks now", but your 
> extreme-end claim of 'breakage' for something that is just plain 
> incomplete is not really a fair characterisation at this point.

It wasn't meant to be quite as strong a statement as it came out.
Sorry about that.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> Anyone who thinks that there exists only two kinds of code: 100% correct 
> and 100% incorrect with no shades of grey inbetween is in reality a sort 
> of an extremist: whom, depending on mood and affection, we could call 
> either a 'coding purist' or a 'coding taliban' ;-)

I've made no such claims. Also rest assured that the tone of the
critique is not hostile, and wasn't meant to sound that way.

Also, given the general comments it appears clear that some statistical
metric of deviation from the intended behavior furthermore qualified by
timescale is necessary, so this appears to be headed toward a sort of
performance 

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* Peter Williams <[EMAIL PROTECTED]> wrote:

> > And my scheduler for example cuts down the amount of policy code and 
> > code size significantly.
> 
> Yours is one of the smaller patches mainly because you perpetuate (or 
> you did in the last one I looked at) the (horrible to my eyes) dual 
> array (active/expired) mechanism.  That this idea was bad should have 
> been apparent to all as soon as the decision was made to excuse some 
> tasks from being moved from the active array to the expired array.  
> This essentially meant that there would be circumstances where extreme 
> unfairness (to the extent of starvation in some cases) -- the very 
> things that the mechanism was originally designed to ensure (as far as 
> I can gather).  Right about then in the development of the O(1) 
> scheduler alternative solutions should have been sought.

in hindsight i'd agree. But back then we were clearly not ready for 
fine-grained accurate statistics + trees (cpus are alot faster at more 
complex arithmetics today, plus people still believed that low-res can 
be done well enough), and taking out any of these two concepts from CFS 
would result in a similarly complex runqueue implementation. Also, the 
array switch was just thought to be of another piece of 'if the 
heuristics go wrong, we fall back to an array switch' logic, right in 
line with the other heuristics. And you have to accept it, mainline's 
ability to auto-renice make -j jobs (and other CPU hogs) was quite a 
plus for developers, so it had (and probably still has) quite some 
inertia.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
>> This observation of Peter's is the best thing to come out of this
>> whole foofaraw.  Looking at what's happening in CPU-land, I think it's
>> going to be necessary, within a couple of years, to replace the whole
>> idea of "CPU scheduling" with "run queue scheduling" across a complex,
>> possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
>> point in churning the mainline scheduler through a design that isn't
>> significantly more flexible than any of those now under discussion.

On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote:
> Why? If you do that, then your load balancer just becomes less flexible
> because it is harder to have tasks run on one or the other.

On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote:
> You can have single-runqueue-per-domain behaviour (or close to) just by
> relaxing all restrictions on idle load balancing within that domain. It
> is harder to go the other way and place any per-cpu affinity or
> restirctions with multiple cpus on a single runqueue.

The big sticking point here is order-sensitivity. One can point to
stringent sched_yield() ordering but that's not so important in and of
itself. The more significant case is RT applications which are order-
sensitive. Per-cpu runqueues rather significantly disturb the ordering
requirements of applications that care about it.

In terms of a plugging framework, the per-cpu arrangement precludes or
makes extremely awkward scheduling policies that don't have per-cpu
runqueues, for instance, the 2.4.x policy. There is also the alternate
SMP scalability strategy of a lockless scheduler with a single global
queue, which is more performance-oriented.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
>> Any chance you'd be willing to put down a few thoughts on what sorts
>> of standards you'd like to set for both correctness (i.e. the bare
>> minimum a scheduler implementation must do to be considered valid
>> beyond not oopsing) and performance metrics (i.e. things that produce
>> numbers for each scheduler you can compare to say "this scheduler is
>> better than this other scheduler at this.").

On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> Yeah I guess that's the hard part :)
> For correctness, I guess fairness is an easy one. I think that unfairness
> is basically a bug and that it would be very unfortunate to merge something
> unfair. But this is just within the context of a single runqueue... for
> better or worse, we allow some unfairness in multiprocessors for performance
> reasons of course.

Requiring that identical tasks be allocated equal shares of CPU
bandwidth is the easy part here. ringtest.c exercises another aspect
of fairness that is extremely important. Generalizing ringtest.c is
a good idea for fairness testing.

But another aspect of fairness is that "controlled unfairness" is also
intended to exist, in no small part by virtue of nice levels, but also
in the form of favoring tasks that are considered interactive somehow.
Testing various forms of controlled unfairness to ensure that they are
indeed controlled and otherwise have the semantics intended is IMHO the
more difficult aspect of fairness testing.


On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> Latency. Given N tasks in the system, an arbitrary task should get
> onto the CPU in a bounded amount of time (excluding events like freak
> IRQ holdoffs and such, obviously -- ie. just considering the context
> of the scheduler's state machine).

ISTR Davide Libenzi having a scheduling latency test a number of years
ago. Resurrecting that and tuning it to the needs of this kind of
testing sounds relevant here. The test suite Peter Willliams mentioned
would also help.


On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> I wouldn't like to see a significant drop in any micro or macro
> benchmarks or even worse real workloads, but I could accept some if it
> means haaving a fair scheduler by default.

On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> Now it isn't actually too hard to achieve the above, I think. The hard bit
> is trying to compare interactivity. Ideally, we'd be able to get scripted
> dumps of login sessions, and measure scheduling latencies of key proceses
> (sh/X/wm/xmms/firefox/etc).  People would send a dump if they were having
> problems with any scheduler, and we could compare all of them against it.
> Wishful thinking!

That's a pretty good idea. I'll queue up writing something of that form
as well.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

William Lee Irwin III wrote:

On Tue, Apr 17, 2007 at 04:34:36PM +1000, Peter Williams wrote:

This doesn't make any sense to me.
For a start, exact simultaneous operation would be impossible to achieve 
except with highly specialized architecture such as the long departed 
transputer.  And secondly, I can't see why it's necessary.


We're not going to make any headway here, so we might as well drop the
thread.


Yes, we were starting to go around in circles weren't we?



There are other things to talk about anyway, for instance I'm seeing
interest in plugsched come about from elsewhere and am taking an
interest in getting it into shape wrt. various design goals therefore.

Probably the largest issue of note is getting scheduler drivers
loadable as kernel modules. Addressing the points Ingo made that can
be addressed are also lined up for this effort.

Comments on which directions you'd like this to go in these respects
would be appreciated, as I regard you as the current "project owner."


I'd do scan through LKML from about 18 months ago looking for mention of 
runtime configurable version of plugsched.  Some students at a 
university (in Germany, I think) posted some patches adding this feature 
to plugsched around about then.


I never added them to plugsched proper as I knew (from previous 
experience when the company I worked for posted patches with similar 
functionality) that Linux would like this idea less than he did the 
current plugsched mechanism.


Unfortunately, my own cache of the relevant e-mails got overwritten 
during a Fedora Core upgrade (I've since moved /var onto a separate 
drive to avoid a repetition) or I would dig them out and send them to 
you.  I'd provided with copies of the company's patches to use as a 
guide to how to overcome the problems associated with changing 
schedulers on a running system (a few non trivial locking issues pop up).


Maybe if one of the students still reads LKML he will provide a pointer.

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* Nick Piggin <[EMAIL PROTECTED]> wrote:

> > Anyone who thinks that there exists only two kinds of code: 100% 
> > correct and 100% incorrect with no shades of grey inbetween is in 
> > reality a sort of an extremist: whom, depending on mood and 
> > affection, we could call either a 'coding purist' or a 'coding 
> > taliban' ;-)
> 
> Only if you are an extremist-naming extremist with no shades of grey. 
> Others, like myself, also include 'coding al-qaeda' and 'coding john 
> howard' in that scale.

heh ;) You, you ... nitpicking extremist! ;)

And beware that you just commited another act of extremism too:

> I agree there.

because you just went to the extreme position of saying that "i agree 
with this portion 100%", instead of saying "this seems to be 91.5% 
correct in my opinion, Tue, 17 Apr 2007 09:40:25 +0200".

and the nasty thing is, that in reality even shades of grey, if you 
print them out, are just a set of extreme black dots on an extreme white 
sheet of paper! ;)

[ so i guess we've got to consider the scope of extremism too: the 
  larger the scope, the more limiting and hence the more dangerous it
  is. ]

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Andy Whitcroft
Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> 
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
> 
>http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> 
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
> 
> [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
>   new scheduler will be active by default and all tasks will default
>   to the new SCHED_FAIR interactive scheduling class. ]
> 
> Highlights are:
> 
>  - the introduction of Scheduling Classes: an extensible hierarchy of
>scheduler modules. These modules encapsulate scheduling policy
>details and are handled by the scheduler core without the core
>code assuming about them too much.
> 
>  - sched_fair.c implements the 'CFS desktop scheduler': it is a
>replacement for the vanilla scheduler's SCHED_OTHER interactivity
>code.
> 
>i'd like to give credit to Con Kolivas for the general approach here:
>he has proven via RSDL/SD that 'fair scheduling' is possible and that
>it results in better desktop scheduling. Kudos Con!
> 
>The CFS patch uses a completely different approach and implementation
>from RSDL/SD. My goal was to make CFS's interactivity quality exceed
>that of RSDL/SD, which is a high standard to meet :-) Testing
>feedback is welcome to decide this one way or another. [ and, in any
>case, all of SD's logic could be added via a kernel/sched_sd.c module
>as well, if Con is interested in such an approach. ]
> 
>CFS's design is quite radical: it does not use runqueues, it uses a
>time-ordered rbtree to build a 'timeline' of future task execution,
>and thus has no 'array switch' artifacts (by which both the vanilla
>scheduler and RSDL/SD are affected).
> 
>CFS uses nanosecond granularity accounting and does not rely on any
>jiffies or other HZ detail. Thus the CFS scheduler has no notion of
>'timeslices' and has no heuristics whatsoever. There is only one
>central tunable:
> 
>  /proc/sys/kernel/sched_granularity_ns
> 
>which can be used to tune the scheduler from 'desktop' (low
>latencies) to 'server' (good batching) workloads. It defaults to a
>setting suitable for desktop workloads. SCHED_BATCH is handled by the
>CFS scheduler module too.
> 
>due to its design, the CFS scheduler is not prone to any of the
>'attacks' that exist today against the heuristics of the stock
>scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
>work fine and do not impact interactivity and produce the expected
>behavior.
> 
>the CFS scheduler has a much stronger handling of nice levels and
>SCHED_BATCH: both types of workloads should be isolated much more
>agressively than under the vanilla scheduler.
> 
>( another rdetail: due to nanosec accounting and timeline sorting,
>  sched_yield() support is very simple under CFS, and in fact under
>  CFS sched_yield() behaves much better than under any other
>  scheduler i have tested so far. )
> 
>  - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
>way than the vanilla scheduler does. It uses 100 runqueues (for all
>100 RT priority levels, instead of 140 in the vanilla scheduler)
>and it needs no expired array.
> 
>  - reworked/sanitized SMP load-balancing: the runqueue-walking
>assumptions are gone from the load-balancing code now, and
>iterators of the scheduling modules are used. The balancing code got
>quite a bit simpler as a result.
> 
> the core scheduler got smaller by more than 700 lines:
> 
>  kernel/sched.c | 1454 
> 
>  1 file changed, 372 insertions(+), 1082 deletions(-)
> 
> and even adding all the scheduling modules, the total size impact is
> relatively small:
> 
>  18 files changed, 1454 insertions(+), 1133 deletions(-)
> 
> most of the increase is due to extensive comments. The kernel size
> impact is in fact a small negative:
> 
>textdata bss dec hex filename
>   233664001  24   273916aff kernel/sched.o.vanilla
>   241592705  56   269206928 kernel/sched.o.CFS
> 
> (this is mainly due to the benefit of getting rid of the expired array
> and its data structure overhead.)
> 
> than

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >>Other hints that it was a bad idea was the need to transfer time slices 
> >>between children and parents during fork() and exit().
> >
> >I don't see how that has anything to do with dual arrays.
> 
> It's totally to do with the dual arrays.  The only real purpose of the 
> time slice in O(1) (regardless of what its perceived purpose was) was to 
> control the switching between the arrays.

The O(1) design is pretty convoluted in that regard. In my scheduler,
the only purpose of the arrays is to renew time slices.

The fork/exit logic is added to make interactivity better. Ingo's
scheduler has similar equivalent logic.


> >If you put
> >a new child at the back of the queue, then your various interactive
> >shell commands that typically do a lot of dependant forking get slowed
> >right down behind your compile job. If you give a new child its own
> >timeslice irrespective of the parent, then you have things like 'make'
> >(which doesn't use a lot of CPU time) spawning off lots of high
> >priority children.
> 
> This is an artefact of trying to control nice using time slices while 
> using them for controlling array switching and whatever else they were 
> being used for.  Priority (static and dynamic) is the the best way to 
> implement nice.

I don't like the timeslice based nice in mainline. It's too nasty
with latencies. nicksched is far better in that regard IMO.

But I don't know how you can assert a particular way is the best way
to do something.


> >You need to do _something_ (Ingo's does). I don't see why this would
> >be tied with a dual array. FWIW, mine doesn't do anything on exit()
> >like most others, but it may need more tuning in this area.
> >
> >
> >>This disregard for the dual array mechanism has prevented me from 
> >>looking at the rest of your scheduler in any great detail so I can't 
> >>comment on any other ideas that may be in there.
> >
> >Well I wasn't really asking you to review it. As I said, everyone
> >has their own idea of what a good design does, and review can't really
> >distinguish between the better of two reasonable designs.
> >
> >A fair evaluation of the alternatives seems like a good idea though.
> >Nobody is actually against this, are they?
> 
> No.  It would be nice if the basic ideas that each scheduler tries to 
> implement could be extracted and explained though.  This could lead to a 
> melding of ideas that leads to something quite good.
> 
> >
> >
> >>>I haven't looked at Con's ones for a while,
> >>>but I believe they are also much more straightforward than mainline...
> >>I like Con's scheduler (partly because it uses a single array) but 
> >>mainly because it's nice and simple.  However, his earlier schedulers 
> >>were prone to starvation (admittedly, only if you went out of your way 
> >>to make it happen) and I tried to convince him to use the anti 
> >>starvation mechanism in my SPA schedulers but was unsuccessful.  I 
> >>haven't looked at his latest scheduler that sparked all this furore so 
> >>can't comment on it.
> >
> >I agree starvation or unfairness is unacceptable for a new scheduler.
> >
> >
> >>>For example, let's say all else is equal between them, then why would
> >>>we go with the O(logN) implementation rather than the O(1)?
> >>In the highly unlikely event that you can't separate them on technical 
> >>grounds, Occam's razor recommends choosing the simplest solution. :-)
> >
> >O(logN) vs O(1) is technical grounds.
> 
> In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
> greater than O(logN)'s k factor multiplied by logMaxN.

Yes, or even significantly greater around typical large sizes of N.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

Nick Piggin wrote:

On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote:

Nick Piggin wrote:

And my scheduler for example cuts down the amount of policy code and
code size significantly.
Yours is one of the smaller patches mainly because you perpetuate (or 
you did in the last one I looked at) the (horrible to my eyes) dual 
array (active/expired) mechanism.


Actually, I wasn't comparing with other out of tree schedulers (but it
is good to know mine is among the smaller ones). I was comparing with
the mainline scheduler, which also has the dual arrays.


 That this idea was bad should have 
been apparent to all as soon as the decision was made to excuse some 
tasks from being moved from the active array to the expired array.  This 


My patch doesn't implement any such excusing.


essentially meant that there would be circumstances where extreme 
unfairness (to the extent of starvation in some cases) -- the very 
things that the mechanism was originally designed to ensure (as far as I 
can gather).  Right about then in the development of the O(1) scheduler 
alternative solutions should have been sought.


Fairness has always been my first priority, and I consider it a bug
if it is possible for any process to get more CPU time than a CPU hog
over the long term. Or over another task doing the same thing, for
that matter.


Other hints that it was a bad idea was the need to transfer time slices 
between children and parents during fork() and exit().


I don't see how that has anything to do with dual arrays.


It's totally to do with the dual arrays.  The only real purpose of the 
time slice in O(1) (regardless of what its perceived purpose was) was to 
control the switching between the arrays.



If you put
a new child at the back of the queue, then your various interactive
shell commands that typically do a lot of dependant forking get slowed
right down behind your compile job. If you give a new child its own
timeslice irrespective of the parent, then you have things like 'make'
(which doesn't use a lot of CPU time) spawning off lots of high
priority children.


This is an artefact of trying to control nice using time slices while 
using them for controlling array switching and whatever else they were 
being used for.  Priority (static and dynamic) is the the best way to 
implement nice.




You need to do _something_ (Ingo's does). I don't see why this would
be tied with a dual array. FWIW, mine doesn't do anything on exit()
like most others, but it may need more tuning in this area.


This disregard for the dual array mechanism has prevented me from 
looking at the rest of your scheduler in any great detail so I can't 
comment on any other ideas that may be in there.


Well I wasn't really asking you to review it. As I said, everyone
has their own idea of what a good design does, and review can't really
distinguish between the better of two reasonable designs.

A fair evaluation of the alternatives seems like a good idea though.
Nobody is actually against this, are they?


No.  It would be nice if the basic ideas that each scheduler tries to 
implement could be extracted and explained though.  This could lead to a 
melding of ideas that leads to something quite good.






I haven't looked at Con's ones for a while,
but I believe they are also much more straightforward than mainline...
I like Con's scheduler (partly because it uses a single array) but 
mainly because it's nice and simple.  However, his earlier schedulers 
were prone to starvation (admittedly, only if you went out of your way 
to make it happen) and I tried to convince him to use the anti 
starvation mechanism in my SPA schedulers but was unsuccessful.  I 
haven't looked at his latest scheduler that sparked all this furore so 
can't comment on it.


I agree starvation or unfairness is unacceptable for a new scheduler.



For example, let's say all else is equal between them, then why would
we go with the O(logN) implementation rather than the O(1)?
In the highly unlikely event that you can't separate them on technical 
grounds, Occam's razor recommends choosing the simplest solution. :-)


O(logN) vs O(1) is technical grounds.


In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
greater than O(logN)'s k factor multiplied by logMaxN.




But yeah, see my earlier comment: simplicity would be a factor too.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> 
> * William Lee Irwin III <[EMAIL PROTECTED]> wrote:
> 
> > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > > I had a quick look at Ingo's code yesterday. Ingo is always smart to 
> > > prepare a main dish (feature) with a nice sider (code cleanup) to 
> > > Linus ;) And even this code does that pretty nicely. The deadline 
> > > designs looks good, although I think the final "key" calculation 
> > > code will end up quite different from what it looks now.
> > 
> > The additive nice_offset breaks nice levels. A multiplicative priority 
> > weighting of a different, nonnegative metric of cpu utilization from 
> > what's now used is required for nice levels to work. I've been trying 
> > to point this out politely by strongly suggesting testing whether nice 
> > levels work.
> 
> granted, CFS's nice code is still incomplete, but you err quite 
> significantly with this extreme statement that they are "broken".
> 
> nice levels certainly work to a fair degree even in the current code and 
> much of the focus is elsewhere - just try it. (In fact i claim that 
> CFS's nice levels often work _better_ than the mainline scheduler's nice 
> level support, for the testcases that matter to users.)
> 
> The precise behavior of nice levels, as i pointed it out in previous 
> mails, is largely 'uninteresting' and it has changed multiple times in 
> the past 10 years.
> 
> What matters to users is mainly: whether X reniced to -10 does get 
> enough CPU time and whether stuff reniced to +19 doesnt take away too 
> much CPU time from the rest of the system.

I agree there.


> _How_ a Linux scheduler 
> achieves this is an internal matter and certainly CFS does it in a hacky 
> way at the moment.
> 
> All the rest, 'CPU bandwidth utilization' or whatever abstract metric we 
> could come up with is just a fancy academic technicality that has no 
> real significance to any of the testers who are trying CFS right now.
> 
> Sure we prefer final solutions that are clean and make sense (because 
> such things are the easiest to maintain long-term), and often such final 
> solutions are quite close to academic concepts, and i think Davide 
> correctly observed this by saying that "the final key calculation code 
> will end up quite different from what it looks now", but your 
> extreme-end claim of 'breakage' for something that is just plain 
> incomplete is not really a fair characterisation at this point.
> 
> Anyone who thinks that there exists only two kinds of code: 100% correct 
> and 100% incorrect with no shades of grey inbetween is in reality a sort 
> of an extremist: whom, depending on mood and affection, we could call 
> either a 'coding purist' or a 'coding taliban' ;-)

Only if you are an extremist-naming extremist with no shades of grey.
Others, like myself, also include 'coding al-qaeda' and 'coding john
howard' in that scale.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* William Lee Irwin III <[EMAIL PROTECTED]> wrote:

> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > I had a quick look at Ingo's code yesterday. Ingo is always smart to 
> > prepare a main dish (feature) with a nice sider (code cleanup) to 
> > Linus ;) And even this code does that pretty nicely. The deadline 
> > designs looks good, although I think the final "key" calculation 
> > code will end up quite different from what it looks now.
> 
> The additive nice_offset breaks nice levels. A multiplicative priority 
> weighting of a different, nonnegative metric of cpu utilization from 
> what's now used is required for nice levels to work. I've been trying 
> to point this out politely by strongly suggesting testing whether nice 
> levels work.

granted, CFS's nice code is still incomplete, but you err quite 
significantly with this extreme statement that they are "broken".

nice levels certainly work to a fair degree even in the current code and 
much of the focus is elsewhere - just try it. (In fact i claim that 
CFS's nice levels often work _better_ than the mainline scheduler's nice 
level support, for the testcases that matter to users.)

The precise behavior of nice levels, as i pointed it out in previous 
mails, is largely 'uninteresting' and it has changed multiple times in 
the past 10 years.

What matters to users is mainly: whether X reniced to -10 does get 
enough CPU time and whether stuff reniced to +19 doesnt take away too 
much CPU time from the rest of the system. _How_ a Linux scheduler 
achieves this is an internal matter and certainly CFS does it in a hacky 
way at the moment.

All the rest, 'CPU bandwidth utilization' or whatever abstract metric we 
could come up with is just a fancy academic technicality that has no 
real significance to any of the testers who are trying CFS right now.

Sure we prefer final solutions that are clean and make sense (because 
such things are the easiest to maintain long-term), and often such final 
solutions are quite close to academic concepts, and i think Davide 
correctly observed this by saying that "the final key calculation code 
will end up quite different from what it looks now", but your 
extreme-end claim of 'breakage' for something that is just plain 
incomplete is not really a fair characterisation at this point.

Anyone who thinks that there exists only two kinds of code: 100% correct 
and 100% incorrect with no shades of grey inbetween is in reality a sort 
of an extremist: whom, depending on mood and affection, we could call 
either a 'coding purist' or a 'coding taliban' ;-)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 12:27:28AM -0700, Davide Libenzi wrote:
> On Tue, 17 Apr 2007, William Lee Irwin III wrote:
> 
> > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > > I would suggest to thoroughly test all your alternatives before deciding. 
> > > Some code and design may look very good and small at the beginning, but 
> > > when you start patching it to cover all the dark spots, you effectively 
> > > end up with another thing (in both design and code footprint).
> > > About O(1), I never thought it was a must (besides a good marketing 
> > > material), and O(log(N)) *may* be just fine (to be verified, of course).
> > 
> > The trouble with thorough testing right now is that no one agrees on
> > what the tests should be and a number of the testcases are not in great
> > shape. An agreed-upon set of testcases for basic correctness should be
> > devised and the implementations of those testcases need to be
> > maintainable code and the tests set up for automated testing and
> > changing their parameters without recompiling via command-line options.
> > 
> > Once there's a standard regression test suite for correctness, one
> > needs to be devised for performance, including interactive performance.
> > The primary difficulty I see along these lines is finding a way to
> > automate tests of graphics and input device response performance. Others,
> > like how deterministically priorities are respected over progressively
> > smaller time intervals and noninteractive workload performance are
> > nowhere near as difficult to arrange and in many cases already exist.
> > Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.
> 
> What I meant was, that the rules (requirements and associated test cases) 
> for this new Scheduler Amazing Race should be set forward, and not kept a 
> moving target to fit one or the other implementation.

Exactly. Well I don't mind if it is a moving target as such, just as
long as the decisions are rational (no "blah is more important
because I say so").
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Davide Libenzi
On Tue, 17 Apr 2007, William Lee Irwin III wrote:

> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > I would suggest to thoroughly test all your alternatives before deciding. 
> > Some code and design may look very good and small at the beginning, but 
> > when you start patching it to cover all the dark spots, you effectively 
> > end up with another thing (in both design and code footprint).
> > About O(1), I never thought it was a must (besides a good marketing 
> > material), and O(log(N)) *may* be just fine (to be verified, of course).
> 
> The trouble with thorough testing right now is that no one agrees on
> what the tests should be and a number of the testcases are not in great
> shape. An agreed-upon set of testcases for basic correctness should be
> devised and the implementations of those testcases need to be
> maintainable code and the tests set up for automated testing and
> changing their parameters without recompiling via command-line options.
> 
> Once there's a standard regression test suite for correctness, one
> needs to be devised for performance, including interactive performance.
> The primary difficulty I see along these lines is finding a way to
> automate tests of graphics and input device response performance. Others,
> like how deterministically priorities are respected over progressively
> smaller time intervals and noninteractive workload performance are
> nowhere near as difficult to arrange and in many cases already exist.
> Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.

What I meant was, that the rules (requirements and associated test cases) 
for this new Scheduler Amazing Race should be set forward, and not kept a 
moving target to fit one or the other implementation.


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 12:09:49AM -0700, William Lee Irwin III wrote:
> 
> The trouble with thorough testing right now is that no one agrees on
> what the tests should be and a number of the testcases are not in great
> shape. An agreed-upon set of testcases for basic correctness should be
> devised and the implementations of those testcases need to be
> maintainable code and the tests set up for automated testing and
> changing their parameters without recompiling via command-line options.
> 
> Once there's a standard regression test suite for correctness, one
> needs to be devised for performance, including interactive performance.
> The primary difficulty I see along these lines is finding a way to
> automate tests of graphics and input device response performance. Others,
> like how deterministically priorities are respected over progressively
> smaller time intervals and noninteractive workload performance are
> nowhere near as difficult to arrange and in many cases already exist.
> Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.

Definitely. It would be really good if we could have interactivity
regression tests too (see my earlier wishful email). The problem
with a lot of the scripted interactivity tests I see is that they
don't really capture the complexities of the interactions between,
say, an interactive X session. Others just go straight for trying
to exploit the design by making lots of high priority processes
runnablel at once. This just provides an unrealistic decoy and you
end up trying to tune for the wrong thing.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Davide Libenzi
On Tue, 17 Apr 2007, Nick Piggin wrote:

> To be clear, I'm not saying O(logN) itself is a big problem. Type
> 
>   plot [10:100] x with lines, log(x) with lines, 1 with lines

Haha, Nick, I know how a log() looks like :)
The Time Ring I posted as example (that nothing is other than a 
ring-based bucket sort), keeps O(1) if you can concede some timer 
clustering.


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

William Lee Irwin III wrote:

On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
I had a quick look at Ingo's code yesterday. Ingo is always smart to 
prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
And even this code does that pretty nicely. The deadline designs looks 
good, although I think the final "key" calculation code will end up quite 
different from what it looks now.


The additive nice_offset breaks nice levels. A multiplicative priority
weighting of a different, nonnegative metric of cpu utilization from
what's now used is required for nice levels to work. I've been trying
to point this out politely by strongly suggesting testing whether nice
levels work.


On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
I would suggest to thoroughly test all your alternatives before deciding. 
Some code and design may look very good and small at the beginning, but 
when you start patching it to cover all the dark spots, you effectively 
end up with another thing (in both design and code footprint).
About O(1), I never thought it was a must (besides a good marketing 
material), and O(log(N)) *may* be just fine (to be verified, of course).


The trouble with thorough testing right now is that no one agrees on
what the tests should be and a number of the testcases are not in great
shape. An agreed-upon set of testcases for basic correctness should be
devised and the implementations of those testcases need to be
maintainable code and the tests set up for automated testing and
changing their parameters without recompiling via command-line options.

Once there's a standard regression test suite for correctness, one
needs to be devised for performance, including interactive performance.
The primary difficulty I see along these lines is finding a way to
automate tests of graphics and input device response performance. Others,
like how deterministically priorities are respected over progressively
smaller time intervals and noninteractive workload performance are
nowhere near as difficult to arrange and in many cases already exist.
Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.


At this point, I'd like direct everyone's attention to the simloads package:



which contains a set of programs designed to be used in the construction 
of CPU scheduler tests.  Of particular use is the aspin program which 
can be used to launch tasks with specified sleep/wake characteristics.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> On Tue, 17 Apr 2007, Nick Piggin wrote:
> 
> > > All things are not equal; they all have different properties. I like
> > 
> > Exactly. So we have to explore those properties and evaluate performance
> > (in all meanings of the word). That's only logical.
> 
> I had a quick look at Ingo's code yesterday. Ingo is always smart to 
> prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
> And even this code does that pretty nicely. The deadline designs looks 
> good, although I think the final "key" calculation code will end up quite 
> different from what it looks now.
> I would suggest to thoroughly test all your alternatives before deciding. 
> Some code and design may look very good and small at the beginning, but 
> when you start patching it to cover all the dark spots, you effectively 
> end up with another thing (in both design and code footprint).
> About O(1), I never thought it was a must (besides a good marketing 
> material), and O(log(N)) *may* be just fine (to be verified, of course).

To be clear, I'm not saying O(logN) itself is a big problem. Type

  plot [10:100] x with lines, log(x) with lines, 1 with lines

into gnuplot. I was just trying to point out that we need to evalute
things. Considering how long we've had this scheduler with its known
deficiencies, let's pick a new one wisely.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> I had a quick look at Ingo's code yesterday. Ingo is always smart to 
> prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
> And even this code does that pretty nicely. The deadline designs looks 
> good, although I think the final "key" calculation code will end up quite 
> different from what it looks now.

The additive nice_offset breaks nice levels. A multiplicative priority
weighting of a different, nonnegative metric of cpu utilization from
what's now used is required for nice levels to work. I've been trying
to point this out politely by strongly suggesting testing whether nice
levels work.


On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> I would suggest to thoroughly test all your alternatives before deciding. 
> Some code and design may look very good and small at the beginning, but 
> when you start patching it to cover all the dark spots, you effectively 
> end up with another thing (in both design and code footprint).
> About O(1), I never thought it was a must (besides a good marketing 
> material), and O(log(N)) *may* be just fine (to be verified, of course).

The trouble with thorough testing right now is that no one agrees on
what the tests should be and a number of the testcases are not in great
shape. An agreed-upon set of testcases for basic correctness should be
devised and the implementations of those testcases need to be
maintainable code and the tests set up for automated testing and
changing their parameters without recompiling via command-line options.

Once there's a standard regression test suite for correctness, one
needs to be devised for performance, including interactive performance.
The primary difficulty I see along these lines is finding a way to
automate tests of graphics and input device response performance. Others,
like how deterministically priorities are respected over progressively
smaller time intervals and noninteractive workload performance are
nowhere near as difficult to arrange and in many cases already exist.
Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> >> All things are not equal; they all have different properties. I like
> 
> On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > Exactly. So we have to explore those properties and evaluate performance
> > (in all meanings of the word). That's only logical.
> 
> Any chance you'd be willing to put down a few thoughts on what sorts
> of standards you'd like to set for both correctness (i.e. the bare
> minimum a scheduler implementation must do to be considered valid
> beyond not oopsing) and performance metrics (i.e. things that produce
> numbers for each scheduler you can compare to say "this scheduler is
> better than this other scheduler at this.").

Yeah I guess that's the hard part :)

For correctness, I guess fairness is an easy one. I think that unfairness
is basically a bug and that it would be very unfortunate to merge something
unfair. But this is just within the context of a single runqueue... for
better or worse, we allow some unfairness in multiprocessors for performance
reasons of course.

Latency. Given N tasks in the system, an arbitrary task should get
onto the CPU in a bounded amount of time (excluding events like freak
IRQ holdoffs and such, obviously -- ie. just considering the context
of the scheduler's state machine).

I wouldn't like to see a significant drop in any micro or macro
benchmarks or even worse real workloads, but I could accept some if it
means haaving a fair scheduler by default.

Now it isn't actually too hard to achieve the above, I think. The hard bit
is trying to compare interactivity. Ideally, we'd be able to get scripted
dumps of login sessions, and measure scheduling latencies of key proceses
(sh/X/wm/xmms/firefox/etc).  People would send a dump if they were having
problems with any scheduler, and we could compare all of them against it.
Wishful thinking!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Davide Libenzi
On Tue, 17 Apr 2007, Nick Piggin wrote:

> > All things are not equal; they all have different properties. I like
> 
> Exactly. So we have to explore those properties and evaluate performance
> (in all meanings of the word). That's only logical.

I had a quick look at Ingo's code yesterday. Ingo is always smart to 
prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
And even this code does that pretty nicely. The deadline designs looks 
good, although I think the final "key" calculation code will end up quite 
different from what it looks now.
I would suggest to thoroughly test all your alternatives before deciding. 
Some code and design may look very good and small at the beginning, but 
when you start patching it to cover all the dark spots, you effectively 
end up with another thing (in both design and code footprint).
About O(1), I never thought it was a must (besides a good marketing 
material), and O(log(N)) *may* be just fine (to be verified, of course).



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >And my scheduler for example cuts down the amount of policy code and
> >code size significantly.
> 
> Yours is one of the smaller patches mainly because you perpetuate (or 
> you did in the last one I looked at) the (horrible to my eyes) dual 
> array (active/expired) mechanism.

Actually, I wasn't comparing with other out of tree schedulers (but it
is good to know mine is among the smaller ones). I was comparing with
the mainline scheduler, which also has the dual arrays.


>  That this idea was bad should have 
> been apparent to all as soon as the decision was made to excuse some 
> tasks from being moved from the active array to the expired array.  This 

My patch doesn't implement any such excusing.


> essentially meant that there would be circumstances where extreme 
> unfairness (to the extent of starvation in some cases) -- the very 
> things that the mechanism was originally designed to ensure (as far as I 
> can gather).  Right about then in the development of the O(1) scheduler 
> alternative solutions should have been sought.

Fairness has always been my first priority, and I consider it a bug
if it is possible for any process to get more CPU time than a CPU hog
over the long term. Or over another task doing the same thing, for
that matter.


> Other hints that it was a bad idea was the need to transfer time slices 
> between children and parents during fork() and exit().

I don't see how that has anything to do with dual arrays. If you put
a new child at the back of the queue, then your various interactive
shell commands that typically do a lot of dependant forking get slowed
right down behind your compile job. If you give a new child its own
timeslice irrespective of the parent, then you have things like 'make'
(which doesn't use a lot of CPU time) spawning off lots of high
priority children.

You need to do _something_ (Ingo's does). I don't see why this would
be tied with a dual array. FWIW, mine doesn't do anything on exit()
like most others, but it may need more tuning in this area.


> This disregard for the dual array mechanism has prevented me from 
> looking at the rest of your scheduler in any great detail so I can't 
> comment on any other ideas that may be in there.

Well I wasn't really asking you to review it. As I said, everyone
has their own idea of what a good design does, and review can't really
distinguish between the better of two reasonable designs.

A fair evaluation of the alternatives seems like a good idea though.
Nobody is actually against this, are they?


> >I haven't looked at Con's ones for a while,
> >but I believe they are also much more straightforward than mainline...
> 
> I like Con's scheduler (partly because it uses a single array) but 
> mainly because it's nice and simple.  However, his earlier schedulers 
> were prone to starvation (admittedly, only if you went out of your way 
> to make it happen) and I tried to convince him to use the anti 
> starvation mechanism in my SPA schedulers but was unsuccessful.  I 
> haven't looked at his latest scheduler that sparked all this furore so 
> can't comment on it.

I agree starvation or unfairness is unacceptable for a new scheduler.


> >For example, let's say all else is equal between them, then why would
> >we go with the O(logN) implementation rather than the O(1)?
> 
> In the highly unlikely event that you can't separate them on technical 
> grounds, Occam's razor recommends choosing the simplest solution. :-)

O(logN) vs O(1) is technical grounds.

But yeah, see my earlier comment: simplicity would be a factor too.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
>> All things are not equal; they all have different properties. I like

On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> Exactly. So we have to explore those properties and evaluate performance
> (in all meanings of the word). That's only logical.

Any chance you'd be willing to put down a few thoughts on what sorts
of standards you'd like to set for both correctness (i.e. the bare
minimum a scheduler implementation must do to be considered valid
beyond not oopsing) and performance metrics (i.e. things that produce
numbers for each scheduler you can compare to say "this scheduler is
better than this other scheduler at this.").


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

Nick Piggin wrote:

Well I know people have had woes with the scheduler for ever (I guess that
isn't going to change :P). I think people generally lost a bit of interest
in trying to improve the situation because of the upstream problem.


Yes.

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

Nick Piggin wrote:

On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:

Nick Piggin wrote:

On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:

On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:

Mike Galbraith wrote:

Demystify what?   The casual observer need only read either your attempt
at writing a scheduler, or my attempts at fixing the one we have, to see
that it was high time for someone with the necessary skills to step in.

Make that "someone with the necessary clout".

No, I was brutally honest to both of us, but quite correct.


Now progress can happen, which was _not_ happening before.


This is true.

Yup, and progress _is_ happening now, quite rapidly.

Progress as in progress on Ingo's scheduler. I still don't know how we'd
decide when to replace the mainline scheduler or with what.

I don't think we can say Ingo's is better than the alternatives, can we?
If there is some kind of bakeoff, then I'd like one of Con's designs to
be involved, and mine, and Peter's...
I myself was thinking of this as the chance for a much needed 
simplification of the scheduling code and if this can be done with the 
result being "reasonable" it then gives us the basis on which to propose 
improvements based on the ideas of others such as you mention.


As the size of the cpusched indicates, trying to evaluate alternative 
proposals based on the current O(1) scheduler is fraught.  Hopefully, 


I don't know why. The problem is that you can't really evaluate good
proposals by looking at the code (you can say that one is bad, ie. the
current one, which has a huge amount of temporal complexity and is
explicitly unfair), but it is pretty hard to say one behaves well.


I meant that it's indicative of the amount of work that you have to do 
to implement a new scheduling discipline for evaluation.




And my scheduler for example cuts down the amount of policy code and
code size significantly.


Yours is one of the smaller patches mainly because you perpetuate (or 
you did in the last one I looked at) the (horrible to my eyes) dual 
array (active/expired) mechanism.  That this idea was bad should have 
been apparent to all as soon as the decision was made to excuse some 
tasks from being moved from the active array to the expired array.  This 
essentially meant that there would be circumstances where extreme 
unfairness (to the extent of starvation in some cases) -- the very 
things that the mechanism was originally designed to ensure (as far as I 
can gather).  Right about then in the development of the O(1) scheduler 
alternative solutions should have been sought.


Other hints that it was a bad idea was the need to transfer time slices 
between children and parents during fork() and exit().


This disregard for the dual array mechanism has prevented me from 
looking at the rest of your scheduler in any great detail so I can't 
comment on any other ideas that may be in there.



I haven't looked at Con's ones for a while,
but I believe they are also much more straightforward than mainline...


I like Con's scheduler (partly because it uses a single array) but 
mainly because it's nice and simple.  However, his earlier schedulers 
were prone to starvation (admittedly, only if you went out of your way 
to make it happen) and I tried to convince him to use the anti 
starvation mechanism in my SPA schedulers but was unsuccessful.  I 
haven't looked at his latest scheduler that sparked all this furore so 
can't comment on it.




For example, let's say all else is equal between them, then why would
we go with the O(logN) implementation rather than the O(1)?


In the highly unlikely event that you can't separate them on technical 
grounds, Occam's razor recommends choosing the simplest solution. :-)


To digress, my main concern is that load balancing is being lumped in 
with this new change.  It's becoming "accept this beg lump of new code 
or nothing".  I'd rather see a good fix to the intra runqueue/CPU 
scheduler problem implemented first and then if there really are any 
outstanding problems with the load balancer attack them later.  Them all 
being mixed up together gives me a nasty deja vu of impending disaster.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >
> >But you add extra code for that on top of what we have, and are also
> >prevented from making per-cpu assumptions.
> >
> >And you can get N CPUs per runqueue behaviour by having them in a domain
> >with no restrictions on idle balancing. So where does your increased
> >flexibilty come from?
> >
> >>One advantage of allowing multiple CPUs per run queue would be at the 
> >>smaller end of the system scale i.e. a PC with a single hyper threading 
> >>chip (i.e. 2 CPUs) would not need to worry about load balancing at all 
> >>if both CPUs used the one runqueue and all the nasty side effects that 
> >>come with hyper threading would be minimized at the same time.
> >
> >I don't know about that -- the current load balancer already minimises
> >the nasty multi threading effects. SMT is very important for IBM's chips
> >for example, and they've never had any problem with that side of it
> >since it was introduced and bugs ironed out (at least, none that I've
> >heard).
> >
> 
> There's a lot of ugly code in the load balancer that is only there to 
> overcome the side effects of SMT and dual core.  A lot of it was put 
> there by Intel employees trying to make load balancing more friendly to 

I agree that some of that has exploded complexity. I have some
thoughts about better approaches for some of those things, but
basically been stuck working on VM problems for a while.


> their systems.  What I'm suggesting is that an N CPUs per runqueue is a 
> better way of achieving that end.  I may (of course) be wrong but I 
> think that the idea deserves more consideration than you're willing to 
> give it.

Put it this way: it is trivial to group the load balancing stats
of N CPUs with their own runqueues. Just put them under a domain
and take the sum. The domain essentially takes on the same function
as a single queue with N CPUs under it. Anything _further_ you can
do with individual runqueues (like naturally adding an affinity
pressure ranging from nothing to absolute) are things that you
don't trivially get with 1:N approach. AFAIKS.

So I will definitely give any idea consideration, but I just need to
be shown where the benefit comes from.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote:
> There's a lot of ugly code in the load balancer that is only there to 
> overcome the side effects of SMT and dual core.  A lot of it was put 
> there by Intel employees trying to make load balancing more friendly to 
> their systems.  What I'm suggesting is that an N CPUs per runqueue is a 
> better way of achieving that end.  I may (of course) be wrong but I 
> think that the idea deserves more consideration than you're willing to 
> give it.

This may be a good one to ask Ingo about, as he did significant
performance work on per-core runqueues for SMT. While I did write
per-node runqueue code for NUMA at some point in the past, I did no
tuning or other performance work on it, only functionality.

I've actually dealt with kernels using elder versions of Ingo's code
for per-core runqueues on SMT, but was never called upon to examine
that particular code for either performance or stability, so I'm
largely ignorant of what the perceived outcome of it was.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
> >> I myself was thinking of this as the chance for a much needed 
> >> simplification of the scheduling code and if this can be done with the 
> >> result being "reasonable" it then gives us the basis on which to propose 
> >> improvements based on the ideas of others such as you mention.
> >> As the size of the cpusched indicates, trying to evaluate alternative 
> >> proposals based on the current O(1) scheduler is fraught.  Hopefully, 
> 
> On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
> > I don't know why. The problem is that you can't really evaluate good
> > proposals by looking at the code (you can say that one is bad, ie. the
> > current one, which has a huge amount of temporal complexity and is
> > explicitly unfair), but it is pretty hard to say one behaves well.
> > And my scheduler for example cuts down the amount of policy code and
> > code size significantly. I haven't looked at Con's ones for a while,
> > but I believe they are also much more straightforward than mainline...
> > For example, let's say all else is equal between them, then why would
> > we go with the O(logN) implementation rather than the O(1)?
> 
> All things are not equal; they all have different properties. I like

Exactly. So we have to explore those properties and evaluate performance
(in all meanings of the word). That's only logical.


> On a random note, limitations on kernel address space make O(lg(n))
> effectively O(1), albeit with large upper bounds on the worst case
> and an expected case much faster than the worst case.

Yeah. O(n!) is also O(1) if you can put an upper bound on n ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 07:53:55AM +0200, Willy Tarreau wrote:
> Hi Nick,
> 
> On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
> (...)
> > And my scheduler for example cuts down the amount of policy code and
> > code size significantly. I haven't looked at Con's ones for a while,
> > but I believe they are also much more straightforward than mainline...
> > 
> > For example, let's say all else is equal between them, then why would
> > we go with the O(logN) implementation rather than the O(1)?
> 
> Of course, if this is the case, the question will be raised. But as a
> general rule, I don't see much potential in O(1) to finely tune scheduling
> according to several criteria.

What do you mean? By what criteria?

> In O(logN), you can adjust scheduling in
> realtime at a very low cost. Better processing of varying priorities or
> fork() comes to mind.

The main problem as I see it is choosing which task to run next and
how much time to run it for. And given that there are typically far less
than 58 (the number of priorities in nicksched) runnable tasks for a
desktop system, I don't find it at all constraining to quantize my "next
runnable" criteria onto that size of key.

Even if you do expect a huge number of runnable tasks, you would hope
for fewer interactive ones toward the higher end of the priority scale.

Handwaving or even detailed design descriptions is simply not the best
way to decide on a new scheduler, is all I'm saying.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
>> I myself was thinking of this as the chance for a much needed 
>> simplification of the scheduling code and if this can be done with the 
>> result being "reasonable" it then gives us the basis on which to propose 
>> improvements based on the ideas of others such as you mention.
>> As the size of the cpusched indicates, trying to evaluate alternative 
>> proposals based on the current O(1) scheduler is fraught.  Hopefully, 

On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
> I don't know why. The problem is that you can't really evaluate good
> proposals by looking at the code (you can say that one is bad, ie. the
> current one, which has a huge amount of temporal complexity and is
> explicitly unfair), but it is pretty hard to say one behaves well.
> And my scheduler for example cuts down the amount of policy code and
> code size significantly. I haven't looked at Con's ones for a while,
> but I believe they are also much more straightforward than mainline...
> For example, let's say all else is equal between them, then why would
> we go with the O(logN) implementation rather than the O(1)?

All things are not equal; they all have different properties. I like
getting rid of the queue-swapping artifacts as ebs and cfs have done,
as the artifacts introduced there are nasty IMNSHO. I'm queueing up
a demonstration of epoch expiry scheduling artifacts as a testcase,
albeit one with no pass/fail semantics for its results, just detecting
scheduler properties.

That said, inequality/inequivalence is not a superiority/inferiority
ranking per se. What needs to come out of these discussions is a set
of standards which a candidate for mainline must pass to be considered
correct and a set of performance metrics by which to rank them. Video
game framerates and some sort of way to automate window wiggle tests
sound like good ideas, but automating such is beyond my userspace
programming abilities. An organization able to devote manpower to
devising such testcases will likely have to get involved for them to
happen, I suspect.

On a random note, limitations on kernel address space make O(lg(n))
effectively O(1), albeit with large upper bounds on the worst case
and an expected case much faster than the worst case.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

Nick Piggin wrote:

On Tue, Apr 17, 2007 at 02:25:39PM +1000, Peter Williams wrote:

Nick Piggin wrote:

On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:

On 4/16/07, Peter Williams <[EMAIL PROTECTED]> wrote:

Note that I talk of run queues
not CPUs as I think a shift to multiple CPUs per run queue may be a good
idea.

This observation of Peter's is the best thing to come out of this
whole foofaraw.  Looking at what's happening in CPU-land, I think it's
going to be necessary, within a couple of years, to replace the whole
idea of "CPU scheduling" with "run queue scheduling" across a complex,
possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
point in churning the mainline scheduler through a design that isn't
significantly more flexible than any of those now under discussion.

Why? If you do that, then your load balancer just becomes less flexible
because it is harder to have tasks run on one or the other.

You can have single-runqueue-per-domain behaviour (or close to) just by
relaxing all restrictions on idle load balancing within that domain. It
is harder to go the other way and place any per-cpu affinity or
restirctions with multiple cpus on a single runqueue.
Allowing N (where N can be one or greater) CPUs per run queue actually 
increases flexibility as you can still set N to 1 to get the current 
behaviour.


But you add extra code for that on top of what we have, and are also
prevented from making per-cpu assumptions.

And you can get N CPUs per runqueue behaviour by having them in a domain
with no restrictions on idle balancing. So where does your increased
flexibilty come from?

One advantage of allowing multiple CPUs per run queue would be at the 
smaller end of the system scale i.e. a PC with a single hyper threading 
chip (i.e. 2 CPUs) would not need to worry about load balancing at all 
if both CPUs used the one runqueue and all the nasty side effects that 
come with hyper threading would be minimized at the same time.


I don't know about that -- the current load balancer already minimises
the nasty multi threading effects. SMT is very important for IBM's chips
for example, and they've never had any problem with that side of it
since it was introduced and bugs ironed out (at least, none that I've
heard).



There's a lot of ugly code in the load balancer that is only there to 
overcome the side effects of SMT and dual core.  A lot of it was put 
there by Intel employees trying to make load balancing more friendly to 
their systems.  What I'm suggesting is that an N CPUs per runqueue is a 
better way of achieving that end.  I may (of course) be wrong but I 
think that the idea deserves more consideration than you're willing to 
give it.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

Nick Piggin wrote:

On Tue, Apr 17, 2007 at 02:25:39PM +1000, Peter Williams wrote:

Nick Piggin wrote:

On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:

On 4/16/07, Peter Williams [EMAIL PROTECTED] wrote:

Note that I talk of run queues
not CPUs as I think a shift to multiple CPUs per run queue may be a good
idea.

This observation of Peter's is the best thing to come out of this
whole foofaraw.  Looking at what's happening in CPU-land, I think it's
going to be necessary, within a couple of years, to replace the whole
idea of CPU scheduling with run queue scheduling across a complex,
possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
point in churning the mainline scheduler through a design that isn't
significantly more flexible than any of those now under discussion.

Why? If you do that, then your load balancer just becomes less flexible
because it is harder to have tasks run on one or the other.

You can have single-runqueue-per-domain behaviour (or close to) just by
relaxing all restrictions on idle load balancing within that domain. It
is harder to go the other way and place any per-cpu affinity or
restirctions with multiple cpus on a single runqueue.
Allowing N (where N can be one or greater) CPUs per run queue actually 
increases flexibility as you can still set N to 1 to get the current 
behaviour.


But you add extra code for that on top of what we have, and are also
prevented from making per-cpu assumptions.

And you can get N CPUs per runqueue behaviour by having them in a domain
with no restrictions on idle balancing. So where does your increased
flexibilty come from?

One advantage of allowing multiple CPUs per run queue would be at the 
smaller end of the system scale i.e. a PC with a single hyper threading 
chip (i.e. 2 CPUs) would not need to worry about load balancing at all 
if both CPUs used the one runqueue and all the nasty side effects that 
come with hyper threading would be minimized at the same time.


I don't know about that -- the current load balancer already minimises
the nasty multi threading effects. SMT is very important for IBM's chips
for example, and they've never had any problem with that side of it
since it was introduced and bugs ironed out (at least, none that I've
heard).



There's a lot of ugly code in the load balancer that is only there to 
overcome the side effects of SMT and dual core.  A lot of it was put 
there by Intel employees trying to make load balancing more friendly to 
their systems.  What I'm suggesting is that an N CPUs per runqueue is a 
better way of achieving that end.  I may (of course) be wrong but I 
think that the idea deserves more consideration than you're willing to 
give it.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
 I myself was thinking of this as the chance for a much needed 
 simplification of the scheduling code and if this can be done with the 
 result being reasonable it then gives us the basis on which to propose 
 improvements based on the ideas of others such as you mention.
 As the size of the cpusched indicates, trying to evaluate alternative 
 proposals based on the current O(1) scheduler is fraught.  Hopefully, 

On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
 I don't know why. The problem is that you can't really evaluate good
 proposals by looking at the code (you can say that one is bad, ie. the
 current one, which has a huge amount of temporal complexity and is
 explicitly unfair), but it is pretty hard to say one behaves well.
 And my scheduler for example cuts down the amount of policy code and
 code size significantly. I haven't looked at Con's ones for a while,
 but I believe they are also much more straightforward than mainline...
 For example, let's say all else is equal between them, then why would
 we go with the O(logN) implementation rather than the O(1)?

All things are not equal; they all have different properties. I like
getting rid of the queue-swapping artifacts as ebs and cfs have done,
as the artifacts introduced there are nasty IMNSHO. I'm queueing up
a demonstration of epoch expiry scheduling artifacts as a testcase,
albeit one with no pass/fail semantics for its results, just detecting
scheduler properties.

That said, inequality/inequivalence is not a superiority/inferiority
ranking per se. What needs to come out of these discussions is a set
of standards which a candidate for mainline must pass to be considered
correct and a set of performance metrics by which to rank them. Video
game framerates and some sort of way to automate window wiggle tests
sound like good ideas, but automating such is beyond my userspace
programming abilities. An organization able to devote manpower to
devising such testcases will likely have to get involved for them to
happen, I suspect.

On a random note, limitations on kernel address space make O(lg(n))
effectively O(1), albeit with large upper bounds on the worst case
and an expected case much faster than the worst case.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 07:53:55AM +0200, Willy Tarreau wrote:
 Hi Nick,
 
 On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
 (...)
  And my scheduler for example cuts down the amount of policy code and
  code size significantly. I haven't looked at Con's ones for a while,
  but I believe they are also much more straightforward than mainline...
  
  For example, let's say all else is equal between them, then why would
  we go with the O(logN) implementation rather than the O(1)?
 
 Of course, if this is the case, the question will be raised. But as a
 general rule, I don't see much potential in O(1) to finely tune scheduling
 according to several criteria.

What do you mean? By what criteria?

 In O(logN), you can adjust scheduling in
 realtime at a very low cost. Better processing of varying priorities or
 fork() comes to mind.

The main problem as I see it is choosing which task to run next and
how much time to run it for. And given that there are typically far less
than 58 (the number of priorities in nicksched) runnable tasks for a
desktop system, I don't find it at all constraining to quantize my next
runnable criteria onto that size of key.

Even if you do expect a huge number of runnable tasks, you would hope
for fewer interactive ones toward the higher end of the priority scale.

Handwaving or even detailed design descriptions is simply not the best
way to decide on a new scheduler, is all I'm saying.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote:
 There's a lot of ugly code in the load balancer that is only there to 
 overcome the side effects of SMT and dual core.  A lot of it was put 
 there by Intel employees trying to make load balancing more friendly to 
 their systems.  What I'm suggesting is that an N CPUs per runqueue is a 
 better way of achieving that end.  I may (of course) be wrong but I 
 think that the idea deserves more consideration than you're willing to 
 give it.

This may be a good one to ask Ingo about, as he did significant
performance work on per-core runqueues for SMT. While I did write
per-node runqueue code for NUMA at some point in the past, I did no
tuning or other performance work on it, only functionality.

I've actually dealt with kernels using elder versions of Ingo's code
for per-core runqueues on SMT, but was never called upon to examine
that particular code for either performance or stability, so I'm
largely ignorant of what the perceived outcome of it was.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
 On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
  I myself was thinking of this as the chance for a much needed 
  simplification of the scheduling code and if this can be done with the 
  result being reasonable it then gives us the basis on which to propose 
  improvements based on the ideas of others such as you mention.
  As the size of the cpusched indicates, trying to evaluate alternative 
  proposals based on the current O(1) scheduler is fraught.  Hopefully, 
 
 On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
  I don't know why. The problem is that you can't really evaluate good
  proposals by looking at the code (you can say that one is bad, ie. the
  current one, which has a huge amount of temporal complexity and is
  explicitly unfair), but it is pretty hard to say one behaves well.
  And my scheduler for example cuts down the amount of policy code and
  code size significantly. I haven't looked at Con's ones for a while,
  but I believe they are also much more straightforward than mainline...
  For example, let's say all else is equal between them, then why would
  we go with the O(logN) implementation rather than the O(1)?
 
 All things are not equal; they all have different properties. I like

Exactly. So we have to explore those properties and evaluate performance
(in all meanings of the word). That's only logical.


 On a random note, limitations on kernel address space make O(lg(n))
 effectively O(1), albeit with large upper bounds on the worst case
 and an expected case much faster than the worst case.

Yeah. O(n!) is also O(1) if you can put an upper bound on n ;)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote:
 Nick Piggin wrote:
 
 But you add extra code for that on top of what we have, and are also
 prevented from making per-cpu assumptions.
 
 And you can get N CPUs per runqueue behaviour by having them in a domain
 with no restrictions on idle balancing. So where does your increased
 flexibilty come from?
 
 One advantage of allowing multiple CPUs per run queue would be at the 
 smaller end of the system scale i.e. a PC with a single hyper threading 
 chip (i.e. 2 CPUs) would not need to worry about load balancing at all 
 if both CPUs used the one runqueue and all the nasty side effects that 
 come with hyper threading would be minimized at the same time.
 
 I don't know about that -- the current load balancer already minimises
 the nasty multi threading effects. SMT is very important for IBM's chips
 for example, and they've never had any problem with that side of it
 since it was introduced and bugs ironed out (at least, none that I've
 heard).
 
 
 There's a lot of ugly code in the load balancer that is only there to 
 overcome the side effects of SMT and dual core.  A lot of it was put 
 there by Intel employees trying to make load balancing more friendly to 

I agree that some of that has exploded complexity. I have some
thoughts about better approaches for some of those things, but
basically been stuck working on VM problems for a while.


 their systems.  What I'm suggesting is that an N CPUs per runqueue is a 
 better way of achieving that end.  I may (of course) be wrong but I 
 think that the idea deserves more consideration than you're willing to 
 give it.

Put it this way: it is trivial to group the load balancing stats
of N CPUs with their own runqueues. Just put them under a domain
and take the sum. The domain essentially takes on the same function
as a single queue with N CPUs under it. Anything _further_ you can
do with individual runqueues (like naturally adding an affinity
pressure ranging from nothing to absolute) are things that you
don't trivially get with 1:N approach. AFAIKS.

So I will definitely give any idea consideration, but I just need to
be shown where the benefit comes from.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

Nick Piggin wrote:

On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:

Nick Piggin wrote:

On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:

On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:

Mike Galbraith wrote:

Demystify what?   The casual observer need only read either your attempt
at writing a scheduler, or my attempts at fixing the one we have, to see
that it was high time for someone with the necessary skills to step in.

Make that someone with the necessary clout.

No, I was brutally honest to both of us, but quite correct.


Now progress can happen, which was _not_ happening before.


This is true.

Yup, and progress _is_ happening now, quite rapidly.

Progress as in progress on Ingo's scheduler. I still don't know how we'd
decide when to replace the mainline scheduler or with what.

I don't think we can say Ingo's is better than the alternatives, can we?
If there is some kind of bakeoff, then I'd like one of Con's designs to
be involved, and mine, and Peter's...
I myself was thinking of this as the chance for a much needed 
simplification of the scheduling code and if this can be done with the 
result being reasonable it then gives us the basis on which to propose 
improvements based on the ideas of others such as you mention.


As the size of the cpusched indicates, trying to evaluate alternative 
proposals based on the current O(1) scheduler is fraught.  Hopefully, 


I don't know why. The problem is that you can't really evaluate good
proposals by looking at the code (you can say that one is bad, ie. the
current one, which has a huge amount of temporal complexity and is
explicitly unfair), but it is pretty hard to say one behaves well.


I meant that it's indicative of the amount of work that you have to do 
to implement a new scheduling discipline for evaluation.




And my scheduler for example cuts down the amount of policy code and
code size significantly.


Yours is one of the smaller patches mainly because you perpetuate (or 
you did in the last one I looked at) the (horrible to my eyes) dual 
array (active/expired) mechanism.  That this idea was bad should have 
been apparent to all as soon as the decision was made to excuse some 
tasks from being moved from the active array to the expired array.  This 
essentially meant that there would be circumstances where extreme 
unfairness (to the extent of starvation in some cases) -- the very 
things that the mechanism was originally designed to ensure (as far as I 
can gather).  Right about then in the development of the O(1) scheduler 
alternative solutions should have been sought.


Other hints that it was a bad idea was the need to transfer time slices 
between children and parents during fork() and exit().


This disregard for the dual array mechanism has prevented me from 
looking at the rest of your scheduler in any great detail so I can't 
comment on any other ideas that may be in there.



I haven't looked at Con's ones for a while,
but I believe they are also much more straightforward than mainline...


I like Con's scheduler (partly because it uses a single array) but 
mainly because it's nice and simple.  However, his earlier schedulers 
were prone to starvation (admittedly, only if you went out of your way 
to make it happen) and I tried to convince him to use the anti 
starvation mechanism in my SPA schedulers but was unsuccessful.  I 
haven't looked at his latest scheduler that sparked all this furore so 
can't comment on it.




For example, let's say all else is equal between them, then why would
we go with the O(logN) implementation rather than the O(1)?


In the highly unlikely event that you can't separate them on technical 
grounds, Occam's razor recommends choosing the simplest solution. :-)


To digress, my main concern is that load balancing is being lumped in 
with this new change.  It's becoming accept this beg lump of new code 
or nothing.  I'd rather see a good fix to the intra runqueue/CPU 
scheduler problem implemented first and then if there really are any 
outstanding problems with the load balancer attack them later.  Them all 
being mixed up together gives me a nasty deja vu of impending disaster.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
 All things are not equal; they all have different properties. I like

On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
 Exactly. So we have to explore those properties and evaluate performance
 (in all meanings of the word). That's only logical.

Any chance you'd be willing to put down a few thoughts on what sorts
of standards you'd like to set for both correctness (i.e. the bare
minimum a scheduler implementation must do to be considered valid
beyond not oopsing) and performance metrics (i.e. things that produce
numbers for each scheduler you can compare to say this scheduler is
better than this other scheduler at this.).


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

Nick Piggin wrote:

Well I know people have had woes with the scheduler for ever (I guess that
isn't going to change :P). I think people generally lost a bit of interest
in trying to improve the situation because of the upstream problem.


Yes.

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote:
 Nick Piggin wrote:
 And my scheduler for example cuts down the amount of policy code and
 code size significantly.
 
 Yours is one of the smaller patches mainly because you perpetuate (or 
 you did in the last one I looked at) the (horrible to my eyes) dual 
 array (active/expired) mechanism.

Actually, I wasn't comparing with other out of tree schedulers (but it
is good to know mine is among the smaller ones). I was comparing with
the mainline scheduler, which also has the dual arrays.


  That this idea was bad should have 
 been apparent to all as soon as the decision was made to excuse some 
 tasks from being moved from the active array to the expired array.  This 

My patch doesn't implement any such excusing.


 essentially meant that there would be circumstances where extreme 
 unfairness (to the extent of starvation in some cases) -- the very 
 things that the mechanism was originally designed to ensure (as far as I 
 can gather).  Right about then in the development of the O(1) scheduler 
 alternative solutions should have been sought.

Fairness has always been my first priority, and I consider it a bug
if it is possible for any process to get more CPU time than a CPU hog
over the long term. Or over another task doing the same thing, for
that matter.


 Other hints that it was a bad idea was the need to transfer time slices 
 between children and parents during fork() and exit().

I don't see how that has anything to do with dual arrays. If you put
a new child at the back of the queue, then your various interactive
shell commands that typically do a lot of dependant forking get slowed
right down behind your compile job. If you give a new child its own
timeslice irrespective of the parent, then you have things like 'make'
(which doesn't use a lot of CPU time) spawning off lots of high
priority children.

You need to do _something_ (Ingo's does). I don't see why this would
be tied with a dual array. FWIW, mine doesn't do anything on exit()
like most others, but it may need more tuning in this area.


 This disregard for the dual array mechanism has prevented me from 
 looking at the rest of your scheduler in any great detail so I can't 
 comment on any other ideas that may be in there.

Well I wasn't really asking you to review it. As I said, everyone
has their own idea of what a good design does, and review can't really
distinguish between the better of two reasonable designs.

A fair evaluation of the alternatives seems like a good idea though.
Nobody is actually against this, are they?


 I haven't looked at Con's ones for a while,
 but I believe they are also much more straightforward than mainline...
 
 I like Con's scheduler (partly because it uses a single array) but 
 mainly because it's nice and simple.  However, his earlier schedulers 
 were prone to starvation (admittedly, only if you went out of your way 
 to make it happen) and I tried to convince him to use the anti 
 starvation mechanism in my SPA schedulers but was unsuccessful.  I 
 haven't looked at his latest scheduler that sparked all this furore so 
 can't comment on it.

I agree starvation or unfairness is unacceptable for a new scheduler.


 For example, let's say all else is equal between them, then why would
 we go with the O(logN) implementation rather than the O(1)?
 
 In the highly unlikely event that you can't separate them on technical 
 grounds, Occam's razor recommends choosing the simplest solution. :-)

O(logN) vs O(1) is technical grounds.

But yeah, see my earlier comment: simplicity would be a factor too.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Davide Libenzi
On Tue, 17 Apr 2007, Nick Piggin wrote:

  All things are not equal; they all have different properties. I like
 
 Exactly. So we have to explore those properties and evaluate performance
 (in all meanings of the word). That's only logical.

I had a quick look at Ingo's code yesterday. Ingo is always smart to 
prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
And even this code does that pretty nicely. The deadline designs looks 
good, although I think the final key calculation code will end up quite 
different from what it looks now.
I would suggest to thoroughly test all your alternatives before deciding. 
Some code and design may look very good and small at the beginning, but 
when you start patching it to cover all the dark spots, you effectively 
end up with another thing (in both design and code footprint).
About O(1), I never thought it was a must (besides a good marketing 
material), and O(log(N)) *may* be just fine (to be verified, of course).



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
 On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
  All things are not equal; they all have different properties. I like
 
 On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
  Exactly. So we have to explore those properties and evaluate performance
  (in all meanings of the word). That's only logical.
 
 Any chance you'd be willing to put down a few thoughts on what sorts
 of standards you'd like to set for both correctness (i.e. the bare
 minimum a scheduler implementation must do to be considered valid
 beyond not oopsing) and performance metrics (i.e. things that produce
 numbers for each scheduler you can compare to say this scheduler is
 better than this other scheduler at this.).

Yeah I guess that's the hard part :)

For correctness, I guess fairness is an easy one. I think that unfairness
is basically a bug and that it would be very unfortunate to merge something
unfair. But this is just within the context of a single runqueue... for
better or worse, we allow some unfairness in multiprocessors for performance
reasons of course.

Latency. Given N tasks in the system, an arbitrary task should get
onto the CPU in a bounded amount of time (excluding events like freak
IRQ holdoffs and such, obviously -- ie. just considering the context
of the scheduler's state machine).

I wouldn't like to see a significant drop in any micro or macro
benchmarks or even worse real workloads, but I could accept some if it
means haaving a fair scheduler by default.

Now it isn't actually too hard to achieve the above, I think. The hard bit
is trying to compare interactivity. Ideally, we'd be able to get scripted
dumps of login sessions, and measure scheduling latencies of key proceses
(sh/X/wm/xmms/firefox/etc).  People would send a dump if they were having
problems with any scheduler, and we could compare all of them against it.
Wishful thinking!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
 I had a quick look at Ingo's code yesterday. Ingo is always smart to 
 prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
 And even this code does that pretty nicely. The deadline designs looks 
 good, although I think the final key calculation code will end up quite 
 different from what it looks now.

The additive nice_offset breaks nice levels. A multiplicative priority
weighting of a different, nonnegative metric of cpu utilization from
what's now used is required for nice levels to work. I've been trying
to point this out politely by strongly suggesting testing whether nice
levels work.


On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
 I would suggest to thoroughly test all your alternatives before deciding. 
 Some code and design may look very good and small at the beginning, but 
 when you start patching it to cover all the dark spots, you effectively 
 end up with another thing (in both design and code footprint).
 About O(1), I never thought it was a must (besides a good marketing 
 material), and O(log(N)) *may* be just fine (to be verified, of course).

The trouble with thorough testing right now is that no one agrees on
what the tests should be and a number of the testcases are not in great
shape. An agreed-upon set of testcases for basic correctness should be
devised and the implementations of those testcases need to be
maintainable code and the tests set up for automated testing and
changing their parameters without recompiling via command-line options.

Once there's a standard regression test suite for correctness, one
needs to be devised for performance, including interactive performance.
The primary difficulty I see along these lines is finding a way to
automate tests of graphics and input device response performance. Others,
like how deterministically priorities are respected over progressively
smaller time intervals and noninteractive workload performance are
nowhere near as difficult to arrange and in many cases already exist.
Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Davide Libenzi
On Tue, 17 Apr 2007, Nick Piggin wrote:

 To be clear, I'm not saying O(logN) itself is a big problem. Type
 
   plot [10:100] x with lines, log(x) with lines, 1 with lines

Haha, Nick, I know how a log() looks like :)
The Time Ring I posted as example (that nothing is other than a 
ring-based bucket sort), keeps O(1) if you can concede some timer 
clustering.


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

William Lee Irwin III wrote:

On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
I had a quick look at Ingo's code yesterday. Ingo is always smart to 
prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
And even this code does that pretty nicely. The deadline designs looks 
good, although I think the final key calculation code will end up quite 
different from what it looks now.


The additive nice_offset breaks nice levels. A multiplicative priority
weighting of a different, nonnegative metric of cpu utilization from
what's now used is required for nice levels to work. I've been trying
to point this out politely by strongly suggesting testing whether nice
levels work.


On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
I would suggest to thoroughly test all your alternatives before deciding. 
Some code and design may look very good and small at the beginning, but 
when you start patching it to cover all the dark spots, you effectively 
end up with another thing (in both design and code footprint).
About O(1), I never thought it was a must (besides a good marketing 
material), and O(log(N)) *may* be just fine (to be verified, of course).


The trouble with thorough testing right now is that no one agrees on
what the tests should be and a number of the testcases are not in great
shape. An agreed-upon set of testcases for basic correctness should be
devised and the implementations of those testcases need to be
maintainable code and the tests set up for automated testing and
changing their parameters without recompiling via command-line options.

Once there's a standard regression test suite for correctness, one
needs to be devised for performance, including interactive performance.
The primary difficulty I see along these lines is finding a way to
automate tests of graphics and input device response performance. Others,
like how deterministically priorities are respected over progressively
smaller time intervals and noninteractive workload performance are
nowhere near as difficult to arrange and in many cases already exist.
Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.


At this point, I'd like direct everyone's attention to the simloads package:

http://downloads.sourceforge.net/cpuse/simloads-0.1.1.tar.gz

which contains a set of programs designed to be used in the construction 
of CPU scheduler tests.  Of particular use is the aspin program which 
can be used to launch tasks with specified sleep/wake characteristics.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 12:09:49AM -0700, William Lee Irwin III wrote:
 
 The trouble with thorough testing right now is that no one agrees on
 what the tests should be and a number of the testcases are not in great
 shape. An agreed-upon set of testcases for basic correctness should be
 devised and the implementations of those testcases need to be
 maintainable code and the tests set up for automated testing and
 changing their parameters without recompiling via command-line options.
 
 Once there's a standard regression test suite for correctness, one
 needs to be devised for performance, including interactive performance.
 The primary difficulty I see along these lines is finding a way to
 automate tests of graphics and input device response performance. Others,
 like how deterministically priorities are respected over progressively
 smaller time intervals and noninteractive workload performance are
 nowhere near as difficult to arrange and in many cases already exist.
 Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.

Definitely. It would be really good if we could have interactivity
regression tests too (see my earlier wishful email). The problem
with a lot of the scripted interactivity tests I see is that they
don't really capture the complexities of the interactions between,
say, an interactive X session. Others just go straight for trying
to exploit the design by making lots of high priority processes
runnablel at once. This just provides an unrealistic decoy and you
end up trying to tune for the wrong thing.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Davide Libenzi
On Tue, 17 Apr 2007, William Lee Irwin III wrote:

 On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
  I would suggest to thoroughly test all your alternatives before deciding. 
  Some code and design may look very good and small at the beginning, but 
  when you start patching it to cover all the dark spots, you effectively 
  end up with another thing (in both design and code footprint).
  About O(1), I never thought it was a must (besides a good marketing 
  material), and O(log(N)) *may* be just fine (to be verified, of course).
 
 The trouble with thorough testing right now is that no one agrees on
 what the tests should be and a number of the testcases are not in great
 shape. An agreed-upon set of testcases for basic correctness should be
 devised and the implementations of those testcases need to be
 maintainable code and the tests set up for automated testing and
 changing their parameters without recompiling via command-line options.
 
 Once there's a standard regression test suite for correctness, one
 needs to be devised for performance, including interactive performance.
 The primary difficulty I see along these lines is finding a way to
 automate tests of graphics and input device response performance. Others,
 like how deterministically priorities are respected over progressively
 smaller time intervals and noninteractive workload performance are
 nowhere near as difficult to arrange and in many cases already exist.
 Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.

What I meant was, that the rules (requirements and associated test cases) 
for this new Scheduler Amazing Race should be set forward, and not kept a 
moving target to fitfollow one or the other implementation.


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 12:27:28AM -0700, Davide Libenzi wrote:
 On Tue, 17 Apr 2007, William Lee Irwin III wrote:
 
  On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
   I would suggest to thoroughly test all your alternatives before deciding. 
   Some code and design may look very good and small at the beginning, but 
   when you start patching it to cover all the dark spots, you effectively 
   end up with another thing (in both design and code footprint).
   About O(1), I never thought it was a must (besides a good marketing 
   material), and O(log(N)) *may* be just fine (to be verified, of course).
  
  The trouble with thorough testing right now is that no one agrees on
  what the tests should be and a number of the testcases are not in great
  shape. An agreed-upon set of testcases for basic correctness should be
  devised and the implementations of those testcases need to be
  maintainable code and the tests set up for automated testing and
  changing their parameters without recompiling via command-line options.
  
  Once there's a standard regression test suite for correctness, one
  needs to be devised for performance, including interactive performance.
  The primary difficulty I see along these lines is finding a way to
  automate tests of graphics and input device response performance. Others,
  like how deterministically priorities are respected over progressively
  smaller time intervals and noninteractive workload performance are
  nowhere near as difficult to arrange and in many cases already exist.
  Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.
 
 What I meant was, that the rules (requirements and associated test cases) 
 for this new Scheduler Amazing Race should be set forward, and not kept a 
 moving target to fitfollow one or the other implementation.

Exactly. Well I don't mind if it is a moving target as such, just as
long as the decisions are rational (no blah is more important
because I say so).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* William Lee Irwin III [EMAIL PROTECTED] wrote:

 On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
  I had a quick look at Ingo's code yesterday. Ingo is always smart to 
  prepare a main dish (feature) with a nice sider (code cleanup) to 
  Linus ;) And even this code does that pretty nicely. The deadline 
  designs looks good, although I think the final key calculation 
  code will end up quite different from what it looks now.
 
 The additive nice_offset breaks nice levels. A multiplicative priority 
 weighting of a different, nonnegative metric of cpu utilization from 
 what's now used is required for nice levels to work. I've been trying 
 to point this out politely by strongly suggesting testing whether nice 
 levels work.

granted, CFS's nice code is still incomplete, but you err quite 
significantly with this extreme statement that they are broken.

nice levels certainly work to a fair degree even in the current code and 
much of the focus is elsewhere - just try it. (In fact i claim that 
CFS's nice levels often work _better_ than the mainline scheduler's nice 
level support, for the testcases that matter to users.)

The precise behavior of nice levels, as i pointed it out in previous 
mails, is largely 'uninteresting' and it has changed multiple times in 
the past 10 years.

What matters to users is mainly: whether X reniced to -10 does get 
enough CPU time and whether stuff reniced to +19 doesnt take away too 
much CPU time from the rest of the system. _How_ a Linux scheduler 
achieves this is an internal matter and certainly CFS does it in a hacky 
way at the moment.

All the rest, 'CPU bandwidth utilization' or whatever abstract metric we 
could come up with is just a fancy academic technicality that has no 
real significance to any of the testers who are trying CFS right now.

Sure we prefer final solutions that are clean and make sense (because 
such things are the easiest to maintain long-term), and often such final 
solutions are quite close to academic concepts, and i think Davide 
correctly observed this by saying that the final key calculation code 
will end up quite different from what it looks now, but your 
extreme-end claim of 'breakage' for something that is just plain 
incomplete is not really a fair characterisation at this point.

Anyone who thinks that there exists only two kinds of code: 100% correct 
and 100% incorrect with no shades of grey inbetween is in reality a sort 
of an extremist: whom, depending on mood and affection, we could call 
either a 'coding purist' or a 'coding taliban' ;-)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
 
 * William Lee Irwin III [EMAIL PROTECTED] wrote:
 
  On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
   I had a quick look at Ingo's code yesterday. Ingo is always smart to 
   prepare a main dish (feature) with a nice sider (code cleanup) to 
   Linus ;) And even this code does that pretty nicely. The deadline 
   designs looks good, although I think the final key calculation 
   code will end up quite different from what it looks now.
  
  The additive nice_offset breaks nice levels. A multiplicative priority 
  weighting of a different, nonnegative metric of cpu utilization from 
  what's now used is required for nice levels to work. I've been trying 
  to point this out politely by strongly suggesting testing whether nice 
  levels work.
 
 granted, CFS's nice code is still incomplete, but you err quite 
 significantly with this extreme statement that they are broken.
 
 nice levels certainly work to a fair degree even in the current code and 
 much of the focus is elsewhere - just try it. (In fact i claim that 
 CFS's nice levels often work _better_ than the mainline scheduler's nice 
 level support, for the testcases that matter to users.)
 
 The precise behavior of nice levels, as i pointed it out in previous 
 mails, is largely 'uninteresting' and it has changed multiple times in 
 the past 10 years.
 
 What matters to users is mainly: whether X reniced to -10 does get 
 enough CPU time and whether stuff reniced to +19 doesnt take away too 
 much CPU time from the rest of the system.

I agree there.


 _How_ a Linux scheduler 
 achieves this is an internal matter and certainly CFS does it in a hacky 
 way at the moment.
 
 All the rest, 'CPU bandwidth utilization' or whatever abstract metric we 
 could come up with is just a fancy academic technicality that has no 
 real significance to any of the testers who are trying CFS right now.
 
 Sure we prefer final solutions that are clean and make sense (because 
 such things are the easiest to maintain long-term), and often such final 
 solutions are quite close to academic concepts, and i think Davide 
 correctly observed this by saying that the final key calculation code 
 will end up quite different from what it looks now, but your 
 extreme-end claim of 'breakage' for something that is just plain 
 incomplete is not really a fair characterisation at this point.
 
 Anyone who thinks that there exists only two kinds of code: 100% correct 
 and 100% incorrect with no shades of grey inbetween is in reality a sort 
 of an extremist: whom, depending on mood and affection, we could call 
 either a 'coding purist' or a 'coding taliban' ;-)

Only if you are an extremist-naming extremist with no shades of grey.
Others, like myself, also include 'coding al-qaeda' and 'coding john
howard' in that scale.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

Nick Piggin wrote:

On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote:

Nick Piggin wrote:

And my scheduler for example cuts down the amount of policy code and
code size significantly.
Yours is one of the smaller patches mainly because you perpetuate (or 
you did in the last one I looked at) the (horrible to my eyes) dual 
array (active/expired) mechanism.


Actually, I wasn't comparing with other out of tree schedulers (but it
is good to know mine is among the smaller ones). I was comparing with
the mainline scheduler, which also has the dual arrays.


 That this idea was bad should have 
been apparent to all as soon as the decision was made to excuse some 
tasks from being moved from the active array to the expired array.  This 


My patch doesn't implement any such excusing.


essentially meant that there would be circumstances where extreme 
unfairness (to the extent of starvation in some cases) -- the very 
things that the mechanism was originally designed to ensure (as far as I 
can gather).  Right about then in the development of the O(1) scheduler 
alternative solutions should have been sought.


Fairness has always been my first priority, and I consider it a bug
if it is possible for any process to get more CPU time than a CPU hog
over the long term. Or over another task doing the same thing, for
that matter.


Other hints that it was a bad idea was the need to transfer time slices 
between children and parents during fork() and exit().


I don't see how that has anything to do with dual arrays.


It's totally to do with the dual arrays.  The only real purpose of the 
time slice in O(1) (regardless of what its perceived purpose was) was to 
control the switching between the arrays.



If you put
a new child at the back of the queue, then your various interactive
shell commands that typically do a lot of dependant forking get slowed
right down behind your compile job. If you give a new child its own
timeslice irrespective of the parent, then you have things like 'make'
(which doesn't use a lot of CPU time) spawning off lots of high
priority children.


This is an artefact of trying to control nice using time slices while 
using them for controlling array switching and whatever else they were 
being used for.  Priority (static and dynamic) is the the best way to 
implement nice.




You need to do _something_ (Ingo's does). I don't see why this would
be tied with a dual array. FWIW, mine doesn't do anything on exit()
like most others, but it may need more tuning in this area.


This disregard for the dual array mechanism has prevented me from 
looking at the rest of your scheduler in any great detail so I can't 
comment on any other ideas that may be in there.


Well I wasn't really asking you to review it. As I said, everyone
has their own idea of what a good design does, and review can't really
distinguish between the better of two reasonable designs.

A fair evaluation of the alternatives seems like a good idea though.
Nobody is actually against this, are they?


No.  It would be nice if the basic ideas that each scheduler tries to 
implement could be extracted and explained though.  This could lead to a 
melding of ideas that leads to something quite good.






I haven't looked at Con's ones for a while,
but I believe they are also much more straightforward than mainline...
I like Con's scheduler (partly because it uses a single array) but 
mainly because it's nice and simple.  However, his earlier schedulers 
were prone to starvation (admittedly, only if you went out of your way 
to make it happen) and I tried to convince him to use the anti 
starvation mechanism in my SPA schedulers but was unsuccessful.  I 
haven't looked at his latest scheduler that sparked all this furore so 
can't comment on it.


I agree starvation or unfairness is unacceptable for a new scheduler.



For example, let's say all else is equal between them, then why would
we go with the O(logN) implementation rather than the O(1)?
In the highly unlikely event that you can't separate them on technical 
grounds, Occam's razor recommends choosing the simplest solution. :-)


O(logN) vs O(1) is technical grounds.


In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
greater than O(logN)'s k factor multiplied by logMaxN.




But yeah, see my earlier comment: simplicity would be a factor too.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote:
 Nick Piggin wrote:
 Other hints that it was a bad idea was the need to transfer time slices 
 between children and parents during fork() and exit().
 
 I don't see how that has anything to do with dual arrays.
 
 It's totally to do with the dual arrays.  The only real purpose of the 
 time slice in O(1) (regardless of what its perceived purpose was) was to 
 control the switching between the arrays.

The O(1) design is pretty convoluted in that regard. In my scheduler,
the only purpose of the arrays is to renew time slices.

The fork/exit logic is added to make interactivity better. Ingo's
scheduler has similar equivalent logic.


 If you put
 a new child at the back of the queue, then your various interactive
 shell commands that typically do a lot of dependant forking get slowed
 right down behind your compile job. If you give a new child its own
 timeslice irrespective of the parent, then you have things like 'make'
 (which doesn't use a lot of CPU time) spawning off lots of high
 priority children.
 
 This is an artefact of trying to control nice using time slices while 
 using them for controlling array switching and whatever else they were 
 being used for.  Priority (static and dynamic) is the the best way to 
 implement nice.

I don't like the timeslice based nice in mainline. It's too nasty
with latencies. nicksched is far better in that regard IMO.

But I don't know how you can assert a particular way is the best way
to do something.


 You need to do _something_ (Ingo's does). I don't see why this would
 be tied with a dual array. FWIW, mine doesn't do anything on exit()
 like most others, but it may need more tuning in this area.
 
 
 This disregard for the dual array mechanism has prevented me from 
 looking at the rest of your scheduler in any great detail so I can't 
 comment on any other ideas that may be in there.
 
 Well I wasn't really asking you to review it. As I said, everyone
 has their own idea of what a good design does, and review can't really
 distinguish between the better of two reasonable designs.
 
 A fair evaluation of the alternatives seems like a good idea though.
 Nobody is actually against this, are they?
 
 No.  It would be nice if the basic ideas that each scheduler tries to 
 implement could be extracted and explained though.  This could lead to a 
 melding of ideas that leads to something quite good.
 
 
 
 I haven't looked at Con's ones for a while,
 but I believe they are also much more straightforward than mainline...
 I like Con's scheduler (partly because it uses a single array) but 
 mainly because it's nice and simple.  However, his earlier schedulers 
 were prone to starvation (admittedly, only if you went out of your way 
 to make it happen) and I tried to convince him to use the anti 
 starvation mechanism in my SPA schedulers but was unsuccessful.  I 
 haven't looked at his latest scheduler that sparked all this furore so 
 can't comment on it.
 
 I agree starvation or unfairness is unacceptable for a new scheduler.
 
 
 For example, let's say all else is equal between them, then why would
 we go with the O(logN) implementation rather than the O(1)?
 In the highly unlikely event that you can't separate them on technical 
 grounds, Occam's razor recommends choosing the simplest solution. :-)
 
 O(logN) vs O(1) is technical grounds.
 
 In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
 greater than O(logN)'s k factor multiplied by logMaxN.

Yes, or even significantly greater around typical large sizes of N.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Andy Whitcroft
Ingo Molnar wrote:
 [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
 
 i'm pleased to announce the first release of the Modular Scheduler Core
 and Completely Fair Scheduler [CFS] patchset:
 
http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
 
 This project is a complete rewrite of the Linux task scheduler. My goal
 is to address various feature requests and to fix deficiencies in the
 vanilla scheduler that were suggested/found in the past few years, both
 for desktop scheduling and for server scheduling workloads.
 
 [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
   new scheduler will be active by default and all tasks will default
   to the new SCHED_FAIR interactive scheduling class. ]
 
 Highlights are:
 
  - the introduction of Scheduling Classes: an extensible hierarchy of
scheduler modules. These modules encapsulate scheduling policy
details and are handled by the scheduler core without the core
code assuming about them too much.
 
  - sched_fair.c implements the 'CFS desktop scheduler': it is a
replacement for the vanilla scheduler's SCHED_OTHER interactivity
code.
 
i'd like to give credit to Con Kolivas for the general approach here:
he has proven via RSDL/SD that 'fair scheduling' is possible and that
it results in better desktop scheduling. Kudos Con!
 
The CFS patch uses a completely different approach and implementation
from RSDL/SD. My goal was to make CFS's interactivity quality exceed
that of RSDL/SD, which is a high standard to meet :-) Testing
feedback is welcome to decide this one way or another. [ and, in any
case, all of SD's logic could be added via a kernel/sched_sd.c module
as well, if Con is interested in such an approach. ]
 
CFS's design is quite radical: it does not use runqueues, it uses a
time-ordered rbtree to build a 'timeline' of future task execution,
and thus has no 'array switch' artifacts (by which both the vanilla
scheduler and RSDL/SD are affected).
 
CFS uses nanosecond granularity accounting and does not rely on any
jiffies or other HZ detail. Thus the CFS scheduler has no notion of
'timeslices' and has no heuristics whatsoever. There is only one
central tunable:
 
  /proc/sys/kernel/sched_granularity_ns
 
which can be used to tune the scheduler from 'desktop' (low
latencies) to 'server' (good batching) workloads. It defaults to a
setting suitable for desktop workloads. SCHED_BATCH is handled by the
CFS scheduler module too.
 
due to its design, the CFS scheduler is not prone to any of the
'attacks' that exist today against the heuristics of the stock
scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
work fine and do not impact interactivity and produce the expected
behavior.
 
the CFS scheduler has a much stronger handling of nice levels and
SCHED_BATCH: both types of workloads should be isolated much more
agressively than under the vanilla scheduler.
 
( another rdetail: due to nanosec accounting and timeline sorting,
  sched_yield() support is very simple under CFS, and in fact under
  CFS sched_yield() behaves much better than under any other
  scheduler i have tested so far. )
 
  - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
way than the vanilla scheduler does. It uses 100 runqueues (for all
100 RT priority levels, instead of 140 in the vanilla scheduler)
and it needs no expired array.
 
  - reworked/sanitized SMP load-balancing: the runqueue-walking
assumptions are gone from the load-balancing code now, and
iterators of the scheduling modules are used. The balancing code got
quite a bit simpler as a result.
 
 the core scheduler got smaller by more than 700 lines:
 
  kernel/sched.c | 1454 
 
  1 file changed, 372 insertions(+), 1082 deletions(-)
 
 and even adding all the scheduling modules, the total size impact is
 relatively small:
 
  18 files changed, 1454 insertions(+), 1133 deletions(-)
 
 most of the increase is due to extensive comments. The kernel size
 impact is in fact a small negative:
 
textdata bss dec hex filename
   233664001  24   273916aff kernel/sched.o.vanilla
   241592705  56   269206928 kernel/sched.o.CFS
 
 (this is mainly due to the benefit of getting rid of the expired array
 and its data structure overhead.)
 
 thanks go to Thomas Gleixner and Arjan van de Ven for review of this
 patchset.
 
 as usual, any sort of feedback, bugreports, fixes and suggestions are
 more than welcome,

Pushed this through the test.kernel.org and nothing new blew up.
Notably the kernbench figures are within expectations even on the bigger
numa systems, commonly badly affected by balancing problems in the
schedular.

I see there is a second one out

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* Nick Piggin [EMAIL PROTECTED] wrote:

  Anyone who thinks that there exists only two kinds of code: 100% 
  correct and 100% incorrect with no shades of grey inbetween is in 
  reality a sort of an extremist: whom, depending on mood and 
  affection, we could call either a 'coding purist' or a 'coding 
  taliban' ;-)
 
 Only if you are an extremist-naming extremist with no shades of grey. 
 Others, like myself, also include 'coding al-qaeda' and 'coding john 
 howard' in that scale.

heh ;) You, you ... nitpicking extremist! ;)

And beware that you just commited another act of extremism too:

 I agree there.

because you just went to the extreme position of saying that i agree 
with this portion 100%, instead of saying this seems to be 91.5% 
correct in my opinion, Tue, 17 Apr 2007 09:40:25 +0200.

and the nasty thing is, that in reality even shades of grey, if you 
print them out, are just a set of extreme black dots on an extreme white 
sheet of paper! ;)

[ so i guess we've got to consider the scope of extremism too: the 
  larger the scope, the more limiting and hence the more dangerous it
  is. ]

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

William Lee Irwin III wrote:

On Tue, Apr 17, 2007 at 04:34:36PM +1000, Peter Williams wrote:

This doesn't make any sense to me.
For a start, exact simultaneous operation would be impossible to achieve 
except with highly specialized architecture such as the long departed 
transputer.  And secondly, I can't see why it's necessary.


We're not going to make any headway here, so we might as well drop the
thread.


Yes, we were starting to go around in circles weren't we?



There are other things to talk about anyway, for instance I'm seeing
interest in plugsched come about from elsewhere and am taking an
interest in getting it into shape wrt. various design goals therefore.

Probably the largest issue of note is getting scheduler drivers
loadable as kernel modules. Addressing the points Ingo made that can
be addressed are also lined up for this effort.

Comments on which directions you'd like this to go in these respects
would be appreciated, as I regard you as the current project owner.


I'd do scan through LKML from about 18 months ago looking for mention of 
runtime configurable version of plugsched.  Some students at a 
university (in Germany, I think) posted some patches adding this feature 
to plugsched around about then.


I never added them to plugsched proper as I knew (from previous 
experience when the company I worked for posted patches with similar 
functionality) that Linux would like this idea less than he did the 
current plugsched mechanism.


Unfortunately, my own cache of the relevant e-mails got overwritten 
during a Fedora Core upgrade (I've since moved /var onto a separate 
drive to avoid a repetition) or I would dig them out and send them to 
you.  I'd provided with copies of the company's patches to use as a 
guide to how to overcome the problems associated with changing 
schedulers on a running system (a few non trivial locking issues pop up).


Maybe if one of the students still reads LKML he will provide a pointer.

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
 Any chance you'd be willing to put down a few thoughts on what sorts
 of standards you'd like to set for both correctness (i.e. the bare
 minimum a scheduler implementation must do to be considered valid
 beyond not oopsing) and performance metrics (i.e. things that produce
 numbers for each scheduler you can compare to say this scheduler is
 better than this other scheduler at this.).

On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
 Yeah I guess that's the hard part :)
 For correctness, I guess fairness is an easy one. I think that unfairness
 is basically a bug and that it would be very unfortunate to merge something
 unfair. But this is just within the context of a single runqueue... for
 better or worse, we allow some unfairness in multiprocessors for performance
 reasons of course.

Requiring that identical tasks be allocated equal shares of CPU
bandwidth is the easy part here. ringtest.c exercises another aspect
of fairness that is extremely important. Generalizing ringtest.c is
a good idea for fairness testing.

But another aspect of fairness is that controlled unfairness is also
intended to exist, in no small part by virtue of nice levels, but also
in the form of favoring tasks that are considered interactive somehow.
Testing various forms of controlled unfairness to ensure that they are
indeed controlled and otherwise have the semantics intended is IMHO the
more difficult aspect of fairness testing.


On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
 Latency. Given N tasks in the system, an arbitrary task should get
 onto the CPU in a bounded amount of time (excluding events like freak
 IRQ holdoffs and such, obviously -- ie. just considering the context
 of the scheduler's state machine).

ISTR Davide Libenzi having a scheduling latency test a number of years
ago. Resurrecting that and tuning it to the needs of this kind of
testing sounds relevant here. The test suite Peter Willliams mentioned
would also help.


On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
 I wouldn't like to see a significant drop in any micro or macro
 benchmarks or even worse real workloads, but I could accept some if it
 means haaving a fair scheduler by default.

On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
 Now it isn't actually too hard to achieve the above, I think. The hard bit
 is trying to compare interactivity. Ideally, we'd be able to get scripted
 dumps of login sessions, and measure scheduling latencies of key proceses
 (sh/X/wm/xmms/firefox/etc).  People would send a dump if they were having
 problems with any scheduler, and we could compare all of them against it.
 Wishful thinking!

That's a pretty good idea. I'll queue up writing something of that form
as well.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
 This observation of Peter's is the best thing to come out of this
 whole foofaraw.  Looking at what's happening in CPU-land, I think it's
 going to be necessary, within a couple of years, to replace the whole
 idea of CPU scheduling with run queue scheduling across a complex,
 possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
 point in churning the mainline scheduler through a design that isn't
 significantly more flexible than any of those now under discussion.

On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote:
 Why? If you do that, then your load balancer just becomes less flexible
 because it is harder to have tasks run on one or the other.

On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote:
 You can have single-runqueue-per-domain behaviour (or close to) just by
 relaxing all restrictions on idle load balancing within that domain. It
 is harder to go the other way and place any per-cpu affinity or
 restirctions with multiple cpus on a single runqueue.

The big sticking point here is order-sensitivity. One can point to
stringent sched_yield() ordering but that's not so important in and of
itself. The more significant case is RT applications which are order-
sensitive. Per-cpu runqueues rather significantly disturb the ordering
requirements of applications that care about it.

In terms of a plugging framework, the per-cpu arrangement precludes or
makes extremely awkward scheduling policies that don't have per-cpu
runqueues, for instance, the 2.4.x policy. There is also the alternate
SMP scalability strategy of a lockless scheduler with a single global
queue, which is more performance-oriented.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* Peter Williams [EMAIL PROTECTED] wrote:

  And my scheduler for example cuts down the amount of policy code and 
  code size significantly.
 
 Yours is one of the smaller patches mainly because you perpetuate (or 
 you did in the last one I looked at) the (horrible to my eyes) dual 
 array (active/expired) mechanism.  That this idea was bad should have 
 been apparent to all as soon as the decision was made to excuse some 
 tasks from being moved from the active array to the expired array.  
 This essentially meant that there would be circumstances where extreme 
 unfairness (to the extent of starvation in some cases) -- the very 
 things that the mechanism was originally designed to ensure (as far as 
 I can gather).  Right about then in the development of the O(1) 
 scheduler alternative solutions should have been sought.

in hindsight i'd agree. But back then we were clearly not ready for 
fine-grained accurate statistics + trees (cpus are alot faster at more 
complex arithmetics today, plus people still believed that low-res can 
be done well enough), and taking out any of these two concepts from CFS 
would result in a similarly complex runqueue implementation. Also, the 
array switch was just thought to be of another piece of 'if the 
heuristics go wrong, we fall back to an array switch' logic, right in 
line with the other heuristics. And you have to accept it, mainline's 
ability to auto-renice make -j jobs (and other CPU hogs) was quite a 
plus for developers, so it had (and probably still has) quite some 
inertia.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
* William Lee Irwin III [EMAIL PROTECTED] wrote:
 The additive nice_offset breaks nice levels. A multiplicative priority 
 weighting of a different, nonnegative metric of cpu utilization from 
 what's now used is required for nice levels to work. I've been trying 
 to point this out politely by strongly suggesting testing whether nice 
 levels work.

On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
 granted, CFS's nice code is still incomplete, but you err quite 
 significantly with this extreme statement that they are broken.

I used the word relatively loosely. Nothing extreme is going on.
Maybe the phrasing exaggerated the force of the opinion. I'm sorry
about having misspoke so.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
 nice levels certainly work to a fair degree even in the current code and 
 much of the focus is elsewhere - just try it. (In fact i claim that 
 CFS's nice levels often work _better_ than the mainline scheduler's nice 
 level support, for the testcases that matter to users.)

Al Boldi's testcase appears to reveal some issues. I'm plotting a
testcase of my own if I can ever get past responding to email.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
 The precise behavior of nice levels, as i pointed it out in previous 
 mails, is largely 'uninteresting' and it has changed multiple times in 
 the past 10 years.

I expect that whether a scheduler can handle such prioritization has a
rather strong predictive quality regarding whether it can handle, say,
CKRM controls. I remain convinced that there should be some target
behavior and that some attempt should be made to achieve it. I don't
think any particular behavior is best, just that the behavior should
be well-defined.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
 What matters to users is mainly: whether X reniced to -10 does get 
 enough CPU time and whether stuff reniced to +19 doesnt take away too 
 much CPU time from the rest of the system. _How_ a Linux scheduler 
 achieves this is an internal matter and certainly CFS does it in a hacky 
 way at the moment.

It's not so far out. Basically just changing the key calculation in a
relatively simple manner should get things into relatively good shape.
It can, of course, be done other ways (I did it a rather different way
in vdls, though that method is not likely to be considered desirable).

I can't really write a testcase for such loose semantics, so the above
description is useless to me. These squishy sorts of definitions of
semantics are also uninformative to users, who, I would argue, do have
some interest in what nice levels mean. There have been at least a small
number of concerns about the strength of nice levels, and it would
reveal issues surrounding that area earlier if there were an objective
one could test to see if it were achieved.

It's furthermore a user-visible change in system call semantics we
should be more careful about changing out from beneath users.

So I see a lot of good reasons to pin down nice numbers. Incompleteness
is not a particularly mortal sin, but the proliferation of competing
schedulers is creating a need for standards, and that's what I'm really
on about.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
 All the rest, 'CPU bandwidth utilization' or whatever abstract metric we 
 could come up with is just a fancy academic technicality that has no 
 real significance to any of the testers who are trying CFS right now.

I could say percent cpu if it sounds less like formal jargon, which
CPU bandwidth utilization isn't really.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
 Sure we prefer final solutions that are clean and make sense (because 
 such things are the easiest to maintain long-term), and often such final 
 solutions are quite close to academic concepts, and i think Davide 
 correctly observed this by saying that the final key calculation code 
 will end up quite different from what it looks now, but your 
 extreme-end claim of 'breakage' for something that is just plain 
 incomplete is not really a fair characterisation at this point.

It wasn't meant to be quite as strong a statement as it came out.
Sorry about that.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
 Anyone who thinks that there exists only two kinds of code: 100% correct 
 and 100% incorrect with no shades of grey inbetween is in reality a sort 
 of an extremist: whom, depending on mood and affection, we could call 
 either a 'coding purist' or a 'coding taliban' ;-)

I've made no such claims. Also rest assured that the tone of the
critique is not hostile, and wasn't meant to sound that way.

Also, given the general comments it appears clear that some statistical
metric of deviation from the intended behavior furthermore qualified by
timescale is necessary, so this appears to be headed toward a sort of
performance metric as opposed to a pass/fail test anyway. 

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* William Lee Irwin III [EMAIL PROTECTED] wrote:

 [...] Also rest assured that the tone of the critique is not hostile, 
 and wasn't meant to sound that way.

ok :) (And i guess i was too touchy - sorry about coming out swinging.)

 Also, given the general comments it appears clear that some 
 statistical metric of deviation from the intended behavior furthermore 
 qualified by timescale is necessary, so this appears to be headed 
 toward a sort of performance metric as opposed to a pass/fail test 
 anyway. However, to even measure this at all, some statement of 
 intention is required. I'd prefer that there be a Linux-standard 
 semantics for nice so results are more directly comparable and so that 
 users also get similar nice behavior from the scheduler as it varies 
 over time and possibly implementations if users should care to switch 
 them out with some scheduler patch or other.

yeah. If you could come up with a sane definition that also translates 
into low overhead on the algorithm side that would be great! The only 
good generic definition i could come up with (nice levels are isolated 
buckets with a constant maximum relative percentage of CPU time 
available to every active bucket) resulted in having a per-nice-level 
array of rbtree roots, which did not look worth the hassle at first 
sight :-)

until now the main approach for nice levels in Linux was always: 
implement your main scheduling logic for nice 0 and then look for some 
low-overhead method that can be glued to it that does something that 
behaves like nice levels. Feel free to turn that around into a more 
natural approach, but the algorithm should remain fairly simple i think.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 08:56:27AM +0100, Andy Whitcroft wrote:
  
  as usual, any sort of feedback, bugreports, fixes and suggestions are
  more than welcome,
 
 Pushed this through the test.kernel.org and nothing new blew up.
 Notably the kernbench figures are within expectations even on the bigger
 numa systems, commonly badly affected by balancing problems in the
 schedular.
 
 I see there is a second one out, I'll push that one through too.

Well I just sent some feedback on cfs-v2, but realised it went off-list,
so I'll resend here because others may find it interesting too. Sorry
about jamming it in here, but it is relevant to performance...

Anyway, roughly in the context of good cfs-v2 interactivity, I wrote:

Well I'm not too surprised. I am disappointed that it uses such small
timeslices (or whatever they are called) as the default.

Using small timeslices is actually a pretty easy way to ensure everything
stays smooth even under load, but is bad for efficiency. Sure you can say
you'll have desktop and server tunings, but... With nicksched I'm testing
a default timeslice of *300ms* even on the desktop, wheras Ingo's seems
to be effectively 3ms :P So if you compare default tunings, it isn't
exactly fair!

Kbuild times on a 2x Xeon:

2.6.21-rc7
508.87user 32.47system 2:17.82elapsed 392%CPU
509.05user 32.25system 2:17.84elapsed 392%CPU
508.75user 32.26system 2:17.83elapsed 392%CPU
508.63user 32.17system 2:17.88elapsed 392%CPU
509.01user 32.26system 2:17.90elapsed 392%CPU
509.08user 32.20system 2:17.95elapsed 392%CPU

2.6.21-rc7-cfs-v2
534.80user 30.92system 2:23.64elapsed 393%CPU
534.75user 31.01system 2:23.70elapsed 393%CPU
534.66user 31.07system 2:23.76elapsed 393%CPU
534.56user 30.91system 2:23.76elapsed 393%CPU
534.66user 31.07system 2:23.67elapsed 393%CPU
535.43user 30.62system 2:23.72elapsed 393%CPU

2.6.21-rc7-nicksched
505.60user 32.31system 2:17.91elapsed 390%CPU
506.55user 32.42system 2:17.66elapsed 391%CPU
506.41user 32.30system 2:17.85elapsed 390%CPU
506.48user 32.36system 2:17.77elapsed 391%CPU
506.10user 32.40system 2:17.81elapsed 390%CPU
506.69user 32.16system 2:17.78elapsed 391%CPU

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* Peter Williams [EMAIL PROTECTED] wrote:

 There's a lot of ugly code in the load balancer that is only there to 
 overcome the side effects of SMT and dual core.  A lot of it was put 
 there by Intel employees trying to make load balancing more friendly 
 to their systems.  What I'm suggesting is that an N CPUs per runqueue 
 is a better way of achieving that end.  I may (of course) be wrong but 
 I think that the idea deserves more consideration than you're willing 
 to give it.

i actually implemented that some time ago and i'm afraid it was ugly as 
hell and pretty fragile. Load-balancing gets simpler, but task picking 
gets alot uglier.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* Nick Piggin [EMAIL PROTECTED] wrote:

   Maybe the progress is that more key people are becoming open to 
   the idea of changing the scheduler.
  
  Could be.  All was quiet for quite a while, but when RSDL showed up, 
  it aroused enough interest to show that scheduling woes is on folks 
  radar.
 
 Well I know people have had woes with the scheduler for ever (I guess 
 that isn't going to change :P). [...]

yes, that part isnt going to change, because the CPU is a _scarce 
resource_ that is perhaps the most frequently overcommitted physical 
computer resource in existence, and because the kernel does not (yet) 
track eye movements of humans to figure out which tasks are more 
important them. So critical human constraints are unknown to the 
scheduler and thus complaints will always come.

The upstream scheduler thought it had enough information: the sleep 
average. So now the attempt is to go back and _simplify_ the scheduler 
and remove that information, and concentrate on getting fairness 
precisely right. The magic thing about 'fairness' is that it's a pretty 
good default policy if we decide that we simply have not enough 
information to do an intelligent choice.

( Lets be cautious though: the jury is still out whether people actually 
  like this more than the current approach. While CFS feedback looks 
  promising after a whopping 3 days of it being released [ ;-) ], the 
  test coverage of all 'fairness centric' schedulers, even considering 
  years of availability is less than 1% i'm afraid, and that  1% was 
  mostly self-selecting. )

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
* William Lee Irwin III [EMAIL PROTECTED] wrote:
 Also, given the general comments it appears clear that some 
 statistical metric of deviation from the intended behavior furthermore 
 qualified by timescale is necessary, so this appears to be headed 
 toward a sort of performance metric as opposed to a pass/fail test 
[...]

On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
 yeah. If you could come up with a sane definition that also translates 
 into low overhead on the algorithm side that would be great! The only 
 good generic definition i could come up with (nice levels are isolated 
 buckets with a constant maximum relative percentage of CPU time 
 available to every active bucket) resulted in having a per-nice-level 
 array of rbtree roots, which did not look worth the hassle at first 
 sight :-)

Interesting! That's what vdls did, except its fundamental data structure
was more like a circular buffer data structure (resembling Davide
Libenzi's timer ring in concept, but with all the details different).
I'm not entirely sure how that would've turned out performancewise if
I'd done any tuning at all. I was mostly interested in doing something
like what I heard Bob Mullens did in 1976 for basic pedagogical value
about schedulers to prepare for writing patches for gang scheduling as
opposed to creating a viable replacement for the mainline scheduler.

I'm relatively certain a different key calculation will suffice, but
it may disturb other desired semantics since they really need to be
nonnegative for multiplying by a scaling factor corresponding to its
nice number to work properly. Well, as the cfs code now stands, it
would correspond to negative keys. Dividing positive keys by the nice
scaling factor is my first thought of how to extend the method to the
current key semantics. Or such are my thoughts on the subject.

I expect that all that's needed is to fiddle with those numbers a bit.
There's quite some capacity for expression there given the precision.


On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
 until now the main approach for nice levels in Linux was always: 
 implement your main scheduling logic for nice 0 and then look for some 
 low-overhead method that can be glued to it that does something that 
 behaves like nice levels. Feel free to turn that around into a more 
 natural approach, but the algorithm should remain fairly simple i think.

Part of my insistence was because it seemed to be relatively close to a
one-liner, though I'm not entirely sure what particular computation to
use to handle the signedness of the keys. I guess I could pick some
particular nice semantics myself and then sweep the extant schedulers
to use them after getting a testcase hammered out.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* Nick Piggin [EMAIL PROTECTED] wrote:

 2.6.21-rc7-cfs-v2
 534.80user 30.92system 2:23.64elapsed 393%CPU
 534.75user 31.01system 2:23.70elapsed 393%CPU
 534.66user 31.07system 2:23.76elapsed 393%CPU
 534.56user 30.91system 2:23.76elapsed 393%CPU
 534.66user 31.07system 2:23.67elapsed 393%CPU
 535.43user 30.62system 2:23.72elapsed 393%CPU

Thanks for testing this! Could you please try this also with:

   echo 1  /proc/sys/kernel/sched_granularity

on the same system, so that we can get a complete set of numbers? Just 
to make sure that lowering the preemption frequency indeed has the 
expected result of moving kernbench numbers back to mainline levels. (if 
not then that would indicate some CFS buglet)

could you maybe even try a more extreme setting of:

   echo 5  /proc/sys/kernel/sched_granularity

for kicks? This would allow us to see how much kernbench we lose due to 
preemption granularity. Thanks!

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Ingo Molnar

* William Lee Irwin III [EMAIL PROTECTED] wrote:

 On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:

  until now the main approach for nice levels in Linux was always: 
  implement your main scheduling logic for nice 0 and then look for 
  some low-overhead method that can be glued to it that does something 
  that behaves like nice levels. Feel free to turn that around into a 
  more natural approach, but the algorithm should remain fairly simple 
  i think.
 
 Part of my insistence was because it seemed to be relatively close to 
 a one-liner, though I'm not entirely sure what particular computation 
 to use to handle the signedness of the keys. I guess I could pick some 
 particular nice semantics myself and then sweep the extant schedulers 
 to use them after getting a testcase hammered out.

i'd love to have a oneliner solution :-)

wrt. signedness: note that in v2 i have made rq_running signed, and most 
calculations (especially those related to nice) are signed values. (On 
64-bit systems this all isnt a big issue - most of the arithmetics 
gymnastics in CFS are done to keep deltas within 32 bits, so that 
divisions and multiplications are sane.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
William Lee Irwin III wrote:
 Comments on which directions you'd like this to go in these respects
 would be appreciated, as I regard you as the current project owner.

On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
 I'd do scan through LKML from about 18 months ago looking for mention of 
 runtime configurable version of plugsched.  Some students at a 
 university (in Germany, I think) posted some patches adding this feature 
 to plugsched around about then.

Excellent. I'll go hunting for that.


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
 I never added them to plugsched proper as I knew (from previous 
 experience when the company I worked for posted patches with similar 
 functionality) that Linux would like this idea less than he did the 
 current plugsched mechanism.

Odd how the requirements ended up including that. Fickleness abounds.
If only we knew up-front what the end would be.


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
 Unfortunately, my own cache of the relevant e-mails got overwritten 
 during a Fedora Core upgrade (I've since moved /var onto a separate 
 drive to avoid a repetition) or I would dig them out and send them to 
 you.  I'd provided with copies of the company's patches to use as a 
 guide to how to overcome the problems associated with changing 
 schedulers on a running system (a few non trivial locking issues pop up).
 Maybe if one of the students still reads LKML he will provide a pointer.

I was tempted to restart from scratch given Ingo's comments, but I
reconsidered and I'll be working with your code (and the German
students' as well). If everything has to change, so be it, but it'll
still be a derived work. It would be ignoring precedent and failure to
properly attribute if I did otherwise.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Nick Piggin
On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote:
 
 * Nick Piggin [EMAIL PROTECTED] wrote:
 
  2.6.21-rc7-cfs-v2
  534.80user 30.92system 2:23.64elapsed 393%CPU
  534.75user 31.01system 2:23.70elapsed 393%CPU
  534.66user 31.07system 2:23.76elapsed 393%CPU
  534.56user 30.91system 2:23.76elapsed 393%CPU
  534.66user 31.07system 2:23.67elapsed 393%CPU
  535.43user 30.62system 2:23.72elapsed 393%CPU
 
 Thanks for testing this! Could you please try this also with:
 
echo 1  /proc/sys/kernel/sched_granularity
 
 on the same system, so that we can get a complete set of numbers? Just 
 to make sure that lowering the preemption frequency indeed has the 
 expected result of moving kernbench numbers back to mainline levels. (if 
 not then that would indicate some CFS buglet)
 
 could you maybe even try a more extreme setting of:
 
echo 5  /proc/sys/kernel/sched_granularity
 
 for kicks? This would allow us to see how much kernbench we lose due to 
 preemption granularity. Thanks!

Yeah but I just powered down the test-box, so I'll have to get onto
that tomorrow.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread James Bruce

Chris Friesen wrote:

William Lee Irwin III wrote:


The sorts of like explicit decisions I'd like to be made for these are:
(1) In a mixture of tasks with varying nice numbers, a given nice number
corresponds to some share of CPU bandwidth. Implementations
should not have the freedom to change this arbitrarily according
to some intention.


The first question that comes to my mind is whether nice levels should 
be linear or not.  I would lean towards nonlinear as it allows a wider 
range (although of course at the expense of precision).  Maybe something 
like each nice level gives X times the cpu of the previous?  I think a 
value of X somewhere between 1.15 and 1.25 might be reasonable.


Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589

That value has the property that a nice=10 task gets 1/10th the cpu of a 
nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that 
would be fairly easy to explain to admins and users so that they can 
know what to expect from nicing tasks.


What about also having something that looks at latency, and how latency 
changes with niceness?


I think this would be a lot harder to pin down, since it's a function of 
all the other tasks running and their nice levels.  Do you have any of 
the RT-derived analysis models in mind?


What about specifying the timeframe over which the cpu bandwidth is 
measured?  I currently have a system where the application designers 
would like it to be totally fair over a period of 1 second.  As you can 
imagine, mainline doesn't do very well in this case.


It might be easier to specify the maximum deviation from the ideal 
bandwidth over a certain period.  I.e. something like over a period of 
one second, each task receives within 10% of the expected bandwidth.


 - Jim Bruce

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

William Lee Irwin III wrote:

William Lee Irwin III wrote:

Comments on which directions you'd like this to go in these respects
would be appreciated, as I regard you as the current project owner.


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
I'd do scan through LKML from about 18 months ago looking for mention of 
runtime configurable version of plugsched.  Some students at a 
university (in Germany, I think) posted some patches adding this feature 
to plugsched around about then.


Excellent. I'll go hunting for that.


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
I never added them to plugsched proper as I knew (from previous 
experience when the company I worked for posted patches with similar 
functionality) that Linux would like this idea less than he did the 
current plugsched mechanism.


Odd how the requirements ended up including that. Fickleness abounds.
If only we knew up-front what the end would be.


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
Unfortunately, my own cache of the relevant e-mails got overwritten 
during a Fedora Core upgrade (I've since moved /var onto a separate 
drive to avoid a repetition) or I would dig them out and send them to 
you.  I'd provided with copies of the company's patches to use as a 
guide to how to overcome the problems associated with changing 
schedulers on a running system (a few non trivial locking issues pop up).

Maybe if one of the students still reads LKML he will provide a pointer.


I was tempted to restart from scratch given Ingo's comments, but I
reconsidered and I'll be working with your code (and the German
students' as well). If everything has to change, so be it, but it'll
still be a derived work. It would be ignoring precedent and failure to
properly attribute if I did otherwise.


I can give you a patch (or set of patches) against the latest git 
vanilla kernel version if that would help.  There have been changes to 
the vanilla scheduler code since 2.6.20 so the latest patch on 
sourceforge won't apply cleanly.  I've found that implementing this as a 
series of patches rather than one big patch makes it easier fro me to 
cope with changes to the underlying code.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Chris Friesen

Peter Williams wrote:

Chris Friesen wrote:
Scuse me if I jump in here, but doesn't the load balancer need some 
way to figure out a) when to run, and b) which tasks to pull and where 
to push them?



Yes but both of these are independent of the scheduler discipline in force.


It is not clear to me that this is always the case, especially once you 
mix in things like resource groups.



If
the load balancer manages to keep the weighted (according to static 
priority) load and distribution of priorities within the loads on the 
CPUs roughly equal and the scheduler does a good job of ensuring 
fairness, interactive responsiveness etc. for the tasks within a CPU 
then the result will be good system performance within the constraints 
set by the sys admins use of real time priorities and nice.


Suppose I have a really high priority task running.  Another very high 
priority task wakes up and would normally preempt the first one. 
However, there happens to be another cpu available.  It seems like it 
would be a win if we moved one of those tasks to the available cpu 
immediately so they can both run simultaneously.  This would seem to 
require some communication between the scheduler and the load balancer.


Certainly the above design could introduce a lot of context switching. 
But if my goal is a scheduler that minimizes latency (even at the cost 
of throughput) then that's an acceptable price to pay.


Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

Nick Piggin wrote:

On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote:

Nick Piggin wrote:
Other hints that it was a bad idea was the need to transfer time slices 
between children and parents during fork() and exit().

I don't see how that has anything to do with dual arrays.
It's totally to do with the dual arrays.  The only real purpose of the 
time slice in O(1) (regardless of what its perceived purpose was) was to 
control the switching between the arrays.


The O(1) design is pretty convoluted in that regard. In my scheduler,
the only purpose of the arrays is to renew time slices.

The fork/exit logic is added to make interactivity better. Ingo's
scheduler has similar equivalent logic.



If you put
a new child at the back of the queue, then your various interactive
shell commands that typically do a lot of dependant forking get slowed
right down behind your compile job. If you give a new child its own
timeslice irrespective of the parent, then you have things like 'make'
(which doesn't use a lot of CPU time) spawning off lots of high
priority children.
This is an artefact of trying to control nice using time slices while 
using them for controlling array switching and whatever else they were 
being used for.  Priority (static and dynamic) is the the best way to 
implement nice.


I don't like the timeslice based nice in mainline. It's too nasty
with latencies. nicksched is far better in that regard IMO.

But I don't know how you can assert a particular way is the best way
to do something.


I should have added I may be wrong but I think that 

My opinion is based on a lot of experience with different types of 
scheduler design and the observation from gathering scheduling 
statistics while playing with these schedulers that the size of the time 
slices we're talking about is much larger than the CPU chunks most tasks 
use in any one go so time slice size has no real effect on most tasks 
and the faster CPUs become the more this becomes true.






You need to do _something_ (Ingo's does). I don't see why this would
be tied with a dual array. FWIW, mine doesn't do anything on exit()
like most others, but it may need more tuning in this area.


This disregard for the dual array mechanism has prevented me from 
looking at the rest of your scheduler in any great detail so I can't 
comment on any other ideas that may be in there.

Well I wasn't really asking you to review it. As I said, everyone
has their own idea of what a good design does, and review can't really
distinguish between the better of two reasonable designs.

A fair evaluation of the alternatives seems like a good idea though.
Nobody is actually against this, are they?
No.  It would be nice if the basic ideas that each scheduler tries to 
implement could be extracted and explained though.  This could lead to a 
melding of ideas that leads to something quite good.





I haven't looked at Con's ones for a while,
but I believe they are also much more straightforward than mainline...
I like Con's scheduler (partly because it uses a single array) but 
mainly because it's nice and simple.  However, his earlier schedulers 
were prone to starvation (admittedly, only if you went out of your way 
to make it happen) and I tried to convince him to use the anti 
starvation mechanism in my SPA schedulers but was unsuccessful.  I 
haven't looked at his latest scheduler that sparked all this furore so 
can't comment on it.

I agree starvation or unfairness is unacceptable for a new scheduler.



For example, let's say all else is equal between them, then why would
we go with the O(logN) implementation rather than the O(1)?
In the highly unlikely event that you can't separate them on technical 
grounds, Occam's razor recommends choosing the simplest solution. :-)

O(logN) vs O(1) is technical grounds.
In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
greater than O(logN)'s k factor multiplied by logMaxN.


Yes, or even significantly greater around typical large sizes of N.


Yes.  In fact its' probably better to use the maximum number of threads 
allowed on the system for N.  We know that value don't we?


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Peter Williams

Ingo Molnar wrote:

* Nick Piggin [EMAIL PROTECTED] wrote:

Maybe the progress is that more key people are becoming open to 
the idea of changing the scheduler.
Could be.  All was quiet for quite a while, but when RSDL showed up, 
it aroused enough interest to show that scheduling woes is on folks 
radar.
Well I know people have had woes with the scheduler for ever (I guess 
that isn't going to change :P). [...]


yes, that part isnt going to change, because the CPU is a _scarce 
resource_ that is perhaps the most frequently overcommitted physical 
computer resource in existence, and because the kernel does not (yet) 
track eye movements of humans to figure out which tasks are more 
important them. So critical human constraints are unknown to the 
scheduler and thus complaints will always come.


The upstream scheduler thought it had enough information: the sleep 
average. So now the attempt is to go back and _simplify_ the scheduler 
and remove that information, and concentrate on getting fairness 
precisely right. The magic thing about 'fairness' is that it's a pretty 
good default policy if we decide that we simply have not enough 
information to do an intelligent choice.


( Lets be cautious though: the jury is still out whether people actually 
  like this more than the current approach. While CFS feedback looks 
  promising after a whopping 3 days of it being released [ ;-) ], the 
  test coverage of all 'fairness centric' schedulers, even considering 
  years of availability is less than 1% i'm afraid, and that  1% was 
  mostly self-selecting. )


At this point I'd like to make the observation that spa_ebs is a very 
fair scheduler if you consider nice to be an indication of the 
relative entitlement of tasks to CPU bandwidth.


It works by mapping nice to shares using a function very similar to the 
one for calculating p-load weight except it's not offset by the RT 
priorities as RT is handled separately.  In theory, a runnable task's 
entitlement to CPU bandwidth at any time is the ratio of its shares to 
the total shares held by runnable tasks on the same CPU (in reality, a 
smoothed average of this sum is used to make scheduling smoother).  The 
dynamic priorities of the runnable tasks are then fiddled to try to keep 
each tasks CPU bandwidth usage in proportion to its entitlement.


That's the theory anyway.

The actual implementation looks a bit different due to efficiency 
considerations.  The modifications to the above theory boil down to 
keeping a running measure of the (recent) highest CPU bandwidth use per 
share for tasks running on the CPU -- I call this the yardstick for this 
CPU.  When it's time to put a task on the run queue it's dynamic 
priority is determined by comparing its CPU bandwidth per share value 
with the yardstick for its CPU.  If it's greater than the yardstick this 
value becomes the new yardstick and the task gets given the lowest 
possible dynamic priority (for its scheduling class).  If the value is 
zero it gets the highest possible priority (for its scheduling class) 
which would be MAX_RT_PRIO for a SCHED_OTHER task.  Otherwise it gets 
given a priority between these two extremes proportional to ratio of its 
CPU bandwidth per share value and the yardstick.  Quite simple really.


The other way in which the code deviates from the original as that (for 
a few years now) I no longer calculated CPU bandwidth usage directly. 
I've found that the overhead is less if I keep a running average of the 
size of a tasks CPU bursts and the length of its scheduling cycle (i.e. 
from on CPU one time to on CPU next time) and using the ratio of these 
values as a measure of bandwidth usage.


Anyway it works and gives very predictable allocations of CPU bandwidth 
based on nice.


Another good feature is that (in this pure form) it's starvation free. 
However, if you fiddle with it and do things like giving bonus priority 
boosts to interactive tasks it becomes susceptible to starvation.  This 
can be fixed by using an anti starvation mechanism such as SPA's 
promotion scheme and that's what spa_ebs does.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Matt Mackall
On Tue, Apr 17, 2007 at 05:31:20AM +0200, Nick Piggin wrote:
 On Mon, Apr 16, 2007 at 09:28:24AM -0500, Matt Mackall wrote:
  On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote:
   I'd prefer if we kept a single CPU scheduler in mainline, because I
   think that simplifies analysis and focuses testing.
  
  I think you'll find something like 80-90% of the testing will be done
  on the default choice, even if other choices exist. So you really
  won't have much of a problem here.
  
  But when the only choice for other schedulers is to go out-of-tree,
  then only 1% of the people will try it out and those people are
  guaranteed to be the ones who saw scheduling problems in mainline.
  So the alternative won't end up getting any testing on many of the
  workloads that work fine in mainstream so their feedback won't tell
  you very much at all.
 
 Yeah I concede that perhaps it is the only way to get things going
 any further. But how do we decide if and when the current scheduler
 should be demoted from default, and which should replace it?

Step one is ship both in -mm. If that doesn't give us enough
confidence, ship both in mainline. If that doesn't give us enough
confidence, wait until vendors ship both. Eventually a clear picture
should emerge. If it doesn't, either the change is not significant or
no one cares.

But it really is important to be able to do controlled experiments on
this stuff with little effort. That's the recipe for getting lots of
valid feedback.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
 Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589
 That value has the property that a nice=10 task gets 1/10th the cpu of a 
 nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that 
 would be fairly easy to explain to admins and users so that they can 
 know what to expect from nicing tasks.
[...additional good commentary trimmed...]

Lots of good ideas here. I'll follow them.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Matt Mackall
On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
 On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
  On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
   All things are not equal; they all have different properties. I like
  
  On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
   Exactly. So we have to explore those properties and evaluate performance
   (in all meanings of the word). That's only logical.
  
  Any chance you'd be willing to put down a few thoughts on what sorts
  of standards you'd like to set for both correctness (i.e. the bare
  minimum a scheduler implementation must do to be considered valid
  beyond not oopsing) and performance metrics (i.e. things that produce
  numbers for each scheduler you can compare to say this scheduler is
  better than this other scheduler at this.).
 
 Yeah I guess that's the hard part :)
 
 For correctness, I guess fairness is an easy one. I think that unfairness
 is basically a bug and that it would be very unfortunate to merge something
 unfair. But this is just within the context of a single runqueue... for
 better or worse, we allow some unfairness in multiprocessors for performance
 reasons of course.

I'm a big fan of fairness, but I think it's a bit early to declare it
a mandatory feature. Bounded unfairness is probably something we can
agree on, ie if we decide to be unfair, no process suffers more than
a factor of x.
 
 Latency. Given N tasks in the system, an arbitrary task should get
 onto the CPU in a bounded amount of time (excluding events like freak
 IRQ holdoffs and such, obviously -- ie. just considering the context
 of the scheduler's state machine).

This is a slightly stronger statement than starvation-free (which is
obviously mandatory). I think you're looking for something like
worst-case scheduling latency is proportional to the number of
runnable tasks. Which I think is quite a reasonable requirement.

I'm pretty sure the stock scheduler falls short of both of these
guarantees though.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Matt Mackall
On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
 
 * William Lee Irwin III [EMAIL PROTECTED] wrote:
 
  [...] Also rest assured that the tone of the critique is not hostile, 
  and wasn't meant to sound that way.
 
 ok :) (And i guess i was too touchy - sorry about coming out swinging.)
 
  Also, given the general comments it appears clear that some 
  statistical metric of deviation from the intended behavior furthermore 
  qualified by timescale is necessary, so this appears to be headed 
  toward a sort of performance metric as opposed to a pass/fail test 
  anyway. However, to even measure this at all, some statement of 
  intention is required. I'd prefer that there be a Linux-standard 
  semantics for nice so results are more directly comparable and so that 
  users also get similar nice behavior from the scheduler as it varies 
  over time and possibly implementations if users should care to switch 
  them out with some scheduler patch or other.
 
 yeah. If you could come up with a sane definition that also translates 
 into low overhead on the algorithm side that would be great!

How's this:

If you're running two identical CPU hog tasks A and B differing only by nice
level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a
constant f(Anice - Bnice).

Other definitions make things hard to analyze and probably not
well-bounded when confronted with  2 tasks.

I -think- this implies keeping a separate scaled CPU usage counter,
where the scaling factor is a trivial exponential function of nice
level where f(0) == 1. Then you schedule based on this scaled usage
counter rather than unscaled.

I also suspect we want to keep the exponential base small so that the
maximal difference is 10x-100x.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Davide Libenzi
On Tue, 17 Apr 2007, William Lee Irwin III wrote:

 On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
  Latency. Given N tasks in the system, an arbitrary task should get
  onto the CPU in a bounded amount of time (excluding events like freak
  IRQ holdoffs and such, obviously -- ie. just considering the context
  of the scheduler's state machine).
 
 ISTR Davide Libenzi having a scheduling latency test a number of years
 ago. Resurrecting that and tuning it to the needs of this kind of
 testing sounds relevant here. The test suite Peter Willliams mentioned
 would also help.

That helped me a lot at that time. At every context switch was sampling 
critical scheduler parameters for both entering and exiting task (and 
associated timestamps). Then the data was collected through a 
/dev/idontremember from userspace for analysis. It'd very useful to have 
it those days, to study what really happens under the hook 
(scheduler internal parameters variations and such) when those wierd loads 
make the scheduler unstable.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
 yeah. If you could come up with a sane definition that also translates 
 into low overhead on the algorithm side that would be great!

On Tue, Apr 17, 2007 at 05:08:09PM -0500, Matt Mackall wrote:
 How's this:
 If you're running two identical CPU hog tasks A and B differing only by nice
 level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a
 constant f(Anice - Bnice).
 Other definitions make things hard to analyze and probably not
 well-bounded when confronted with  2 tasks.
 I -think- this implies keeping a separate scaled CPU usage counter,
 where the scaling factor is a trivial exponential function of nice
 level where f(0) == 1. Then you schedule based on this scaled usage
 counter rather than unscaled.
 I also suspect we want to keep the exponential base small so that the
 maximal difference is 10x-100x.

I'm already working with this as my assumed nice semantics (actually
something with a specific exponential base, suggested in other emails)
until others start saying they want something different and agree.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Matt Mackall
On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote:
 On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
  yeah. If you could come up with a sane definition that also translates 
  into low overhead on the algorithm side that would be great!
 
 On Tue, Apr 17, 2007 at 05:08:09PM -0500, Matt Mackall wrote:
  How's this:
  If you're running two identical CPU hog tasks A and B differing only by nice
  level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a
  constant f(Anice - Bnice).
  Other definitions make things hard to analyze and probably not
  well-bounded when confronted with  2 tasks.
  I -think- this implies keeping a separate scaled CPU usage counter,
  where the scaling factor is a trivial exponential function of nice
  level where f(0) == 1. Then you schedule based on this scaled usage
  counter rather than unscaled.
  I also suspect we want to keep the exponential base small so that the
  maximal difference is 10x-100x.
 
 I'm already working with this as my assumed nice semantics (actually
 something with a specific exponential base, suggested in other emails)
 until others start saying they want something different and agree.

Good. This has a couple nice mathematical properties, including
bounded unfairness which I mentioned earlier. What base are you
looking at?

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread William Lee Irwin III
On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote:
 I'm already working with this as my assumed nice semantics (actually
 something with a specific exponential base, suggested in other emails)
 until others start saying they want something different and agree.

On Tue, Apr 17, 2007 at 05:39:09PM -0500, Matt Mackall wrote:
 Good. This has a couple nice mathematical properties, including
 bounded unfairness which I mentioned earlier. What base are you
 looking at?

I'm working with the following suggestion:


On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
 Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589
 That value has the property that a nice=10 task gets 1/10th the cpu of a
 nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that
 would be fairly easy to explain to admins and users so that they can
 know what to expect from nicing tasks.


I'm not likely to write the testcase until this upcoming weekend, though.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Michael K. Edwards

On 4/17/07, Peter Williams [EMAIL PROTECTED] wrote:

The other way in which the code deviates from the original as that (for
a few years now) I no longer calculated CPU bandwidth usage directly.
I've found that the overhead is less if I keep a running average of the
size of a tasks CPU bursts and the length of its scheduling cycle (i.e.
from on CPU one time to on CPU next time) and using the ratio of these
values as a measure of bandwidth usage.

Anyway it works and gives very predictable allocations of CPU bandwidth
based on nice.


Works, that is, right up until you add nonlinear interactions with CPU
speed scaling.  From my perspective as an embedded platform
integrator, clock/voltage scaling is the elephant in the scheduler's
living room.  Patch in DPM (now OpPoint?) to scale the clock based on
what task is being scheduled, and suddenly the dynamic priority
calculations go wild.  Nip this in the bud by putting an RT priority
on the relevant threads (which you have to do anyway if you need
remotely audio-grade latency), and the lock affinity heuristics break,
so you have to hand-tune all the thread priorities.  Blecch.

Not to mention the likelihood that the task whose clock speed you're
trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority
than the application.  (You want to crank the CPU for this task
because it runs with the RF hot, which may cost you as much power as
the rest of the platform.)  You'd better hope you can remove it from
the dynamic priority heuristics with SCHED_BATCH.  Otherwise
everything _else_ has to be RT priority (or it'll be starved by the
soft MAC) and you've basically tossed SCHED_NORMAL in the bin.  Double
blecch!

Is it too much to ask for someone with actual engineering training
(not me, unfortunately) to sit down and build a negative-feedback
control system that handles soft-real-time _and_ dynamic-priority
_and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock
scaling?  And actually separates the accounting and control mechanisms
from the heuristics, so the latter can be tuned (within a well
documented stable range) to reflect the expected system usage
patterns?

It's not like there isn't a vast literature in this area over the past
decade, including some dealing specifically with clock scaling
consistent with low-latency applications.  It's a pity that people
doing academic work in this area rarely wade into LKML, even when
they're hacking on a Linux fork.  But then, there's not much economic
incentive for them to do so, and they can usually get their fill of
citation politics and dominance games without leaving their home
department.  :-P

Seriously, though.  If you're really going to put the mainline
scheduler through this kind of churn, please please pretty please knit
in per-task clock scaling (possibly even rejigged during the slice;
see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of
linger mechanism to keep from taking context switch hits when you're
confident that an I/O will complete quickly.

Cheers,
- Michael
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   6   7   >