RE: Re: sched_yield proposals/rationale

2007-04-17 Thread Buytaert_Steven


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:linux-kernel-
> [EMAIL PROTECTED] On Behalf Of Bill Davidsen
> Sent: dinsdag 17 april 2007 21:38
> To: linux-kernel@vger.kernel.org
> Cc: Buytaert, Steven; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org
> Subject: Re: sched_yield proposals/rationale
> 
> Mark Lord wrote:
> >
> > Cool.  You *do know* that there is a brand new CPU scheduler
> > scheduled to replace the current one for the 2.6.22 Kernel, right?
> >
> Having tried both nicksched and Con's fair sched on some normal loads,
> as opposed to benchmarks, I sure hope Linus changes his mind about
> having several schedulers in the kernel. The "one perfect and
> self-adjusting scheduler" isn't here yet.

I have the same opinion, and it is still a long time out I'm afraid. Probably 
people only read my suggestion for a 'fix' diagonally, let alone they read my 
footer. Too bad the archives only go back to 95, I would love to retrieve my 
posts from 93. Anybody still have these?

Now a bit more on topic:

1) My problem is/was solved by making the default time slice much smaller than 
the default 100 in a 250Hz system. But that's only masking it away.

2) I made a schedstat program sort of like vmstat that samples and prints 
deltas each second (from /proc/schedstat). Just printing the jiffies delta and 
the # times schedule is called per CPU, is already thought provoking; a small 
12 second example:

2.6.16 vanilla scheduler

   Jiffies CPU0CPU1
Delta 
250  | 1756 |  7301 |
252  | 1730 |  2638 |
254  | 1963 |  1663 |
385  |  868 |   658 | <--- stall starts
325  |  138 |   112 |
330  |  184 |   130 |
339  |  682 |   122 |
335  |  367 |   159 |
334  |  653 |   127 |
345  |  467 |   137 |
335  |  673 |   128 |
337  |  471 |   131 |
334  |  673 |   127 |
332  |  321 |   144 |
333  |  523 |   129 |
332  |   98 |   123 |
356  |  496 |   124 |
270  |   96 |87 |
277  | 5878 | 26228 | <-- yes 26K
252  |18263 | 19130 | ...
255  | 2024 |  5747 | <-- back to normal

Let's talk about fairness when we get the basics right. And yes, the real 
physical elapsed time per line IS 1 second, so the jiffies jumping up from the 
normal expected value of 250 to up to 356 is not normal; it's in fact very very 
bad. This box was unusable for 13 seconds in a row.

But this doesn't look serious enough to most people, it seems. 

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne 
reste rien à enlever. (Antoine de Saint-Exupéry)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-17 Thread Bill Davidsen

Mark Lord wrote:

[EMAIL PROTECTED] wrote:

From: Bill Davidsen

And having gotten same, are you going to code up what appears to be a
solution, based on this feedback?


The feedback was helpful in verifying whether there are any arguments 
against my approach. The real proof is in the pudding.


I'm running a kernel with these changes, as we speak. Overall system 
throughput is about up 20%. With 'system throughput' I mean measured 
performance of a rather large (experimental) system. The patch isn't 
even 24h old... Also the application latency has improved.


Cool.  You *do know* that there is a brand new CPU scheduler
scheduled to replace the current one for the 2.6.22 Kernel, right?

Having tried both nicksched and Con's fair sched on some normal loads, 
as opposed to benchmarks, I sure hope Linus changes his mind about 
having several schedulers in the kernel. The "one perfect and 
self-adjusting scheduler" isn't here yet.


--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-17 Thread Bill Davidsen

Mark Lord wrote:

[EMAIL PROTECTED] wrote:

From: Bill Davidsen

And having gotten same, are you going to code up what appears to be a
solution, based on this feedback?


The feedback was helpful in verifying whether there are any arguments 
against my approach. The real proof is in the pudding.


I'm running a kernel with these changes, as we speak. Overall system 
throughput is about up 20%. With 'system throughput' I mean measured 
performance of a rather large (experimental) system. The patch isn't 
even 24h old... Also the application latency has improved.


Cool.  You *do know* that there is a brand new CPU scheduler
scheduled to replace the current one for the 2.6.22 Kernel, right?

Having tried both nicksched and Con's fair sched on some normal loads, 
as opposed to benchmarks, I sure hope Linus changes his mind about 
having several schedulers in the kernel. The one perfect and 
self-adjusting scheduler isn't here yet.


--
Bill Davidsen [EMAIL PROTECTED]
  We have more to fear from the bungling of the incompetent than from
the machinations of the wicked.  - from Slashdot
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Re: sched_yield proposals/rationale

2007-04-17 Thread Buytaert_Steven


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:linux-kernel-
 [EMAIL PROTECTED] On Behalf Of Bill Davidsen
 Sent: dinsdag 17 april 2007 21:38
 To: linux-kernel@vger.kernel.org
 Cc: Buytaert, Steven; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org
 Subject: Re: sched_yield proposals/rationale
 
 Mark Lord wrote:
 
  Cool.  You *do know* that there is a brand new CPU scheduler
  scheduled to replace the current one for the 2.6.22 Kernel, right?
 
 Having tried both nicksched and Con's fair sched on some normal loads,
 as opposed to benchmarks, I sure hope Linus changes his mind about
 having several schedulers in the kernel. The one perfect and
 self-adjusting scheduler isn't here yet.

I have the same opinion, and it is still a long time out I'm afraid. Probably 
people only read my suggestion for a 'fix' diagonally, let alone they read my 
footer. Too bad the archives only go back to 95, I would love to retrieve my 
posts from 93. Anybody still have these?

Now a bit more on topic:

1) My problem is/was solved by making the default time slice much smaller than 
the default 100 in a 250Hz system. But that's only masking it away.

2) I made a schedstat program sort of like vmstat that samples and prints 
deltas each second (from /proc/schedstat). Just printing the jiffies delta and 
the # times schedule is called per CPU, is already thought provoking; a small 
12 second example:

2.6.16 vanilla scheduler

   Jiffies CPU0CPU1
Delta 
250  | 1756 |  7301 |
252  | 1730 |  2638 |
254  | 1963 |  1663 |
385  |  868 |   658 | --- stall starts
325  |  138 |   112 |
330  |  184 |   130 |
339  |  682 |   122 |
335  |  367 |   159 |
334  |  653 |   127 |
345  |  467 |   137 |
335  |  673 |   128 |
337  |  471 |   131 |
334  |  673 |   127 |
332  |  321 |   144 |
333  |  523 |   129 |
332  |   98 |   123 |
356  |  496 |   124 |
270  |   96 |87 |
277  | 5878 | 26228 | -- yes 26K
252  |18263 | 19130 | ...
255  | 2024 |  5747 | -- back to normal

Let's talk about fairness when we get the basics right. And yes, the real 
physical elapsed time per line IS 1 second, so the jiffies jumping up from the 
normal expected value of 250 to up to 356 is not normal; it's in fact very very 
bad. This box was unusable for 13 seconds in a row.

But this doesn't look serious enough to most people, it seems. 

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne 
reste rien à enlever. (Antoine de Saint-Exupéry)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-13 Thread Mark Lord

[EMAIL PROTECTED] wrote:

From: Bill Davidsen

And having gotten same, are you going to code up what appears to be a
solution, based on this feedback?


The feedback was helpful in verifying whether there are any arguments against 
my approach. The real proof is in the pudding.

I'm running a kernel with these changes, as we speak. Overall system throughput 
is about up 20%. With 'system throughput' I mean measured performance of a 
rather large (experimental) system. The patch isn't even 24h old... Also the 
application latency has improved.


Cool.  You *do know* that there is a brand new CPU scheduler
scheduled to replace the current one for the 2.6.22 Kernel, right?

-ml
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-13 Thread Mark Lord

[EMAIL PROTECTED] wrote:

From: Bill Davidsen

And having gotten same, are you going to code up what appears to be a
solution, based on this feedback?


The feedback was helpful in verifying whether there are any arguments against 
my approach. The real proof is in the pudding.

I'm running a kernel with these changes, as we speak. Overall system throughput 
is about up 20%. With 'system throughput' I mean measured performance of a 
rather large (experimental) system. The patch isn't even 24h old... Also the 
application latency has improved.


Cool.  You *do know* that there is a brand new CPU scheduler
scheduled to replace the current one for the 2.6.22 Kernel, right?

-ml
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sched_yield proposals/rationale

2007-04-12 Thread Buytaert_Steven

> From: Bill Davidsen
>
> And having gotten same, are you going to code up what appears to be a
> solution, based on this feedback?

The feedback was helpful in verifying whether there are any arguments against 
my approach. The real proof is in the pudding.

I'm running a kernel with these changes, as we speak. Overall system throughput 
is about up 20%. With 'system throughput' I mean measured performance of a 
rather large (experimental) system. The patch isn't even 24h old... Also the 
application latency has improved.

Additional settings: my patch is running *also* with a kernel modified to have 
only 8 default time slices at 250Hz setting. And no, the overall number of 
context switches per second hasn't blown up. The kernel was compiled with low 
latency and in-kernel preemption enabled, BKL preemption enabled. I haven't 
checked the patch stand alone yet.

> I'm curious how well it would run poorly written programs, having
> recently worked with a company which seemed to have a whole part of
> purchasing dedicated to buying same. :-(

So first signs are positive; note that it requires much more run time and a 
slew of other tests/scrutiny before we can be really sure.

W.r.t. the remarks; I am most interested in possibilities of DOS attacks that 
could exploit this change in sched_yield. Therefore the comments of Andi were 
interesting, but I haven't heard back from him yet. I'm still not sure how a 
task could juggle more slices from the system because of these changes.

Last remark on the O(1)'ness being violated. I think it's a mooth point. The 
sched_yield is executed on the CPU time of the yielder. Being O(1) is most 
important for the scheduler proper at each timer tick (interrupt). That being 
O(1) is crucial.

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne 
reste rien à enlever. (Antoine de Saint-Exupéry)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread Bill Davidsen

[EMAIL PROTECTED] wrote:

-Original Message-



Besides - but I guess you're aware of it - any randomized
algorithms tend to drive benchmarkers and performance analysts
crazy because their performance cannot be repeated. So it's usually
better to avoid them unless there is really no alternative.


That could already solve your concern from above. Statistically

speaking, it will give them (benchmarkers) the smoothest curve they've
ever seen.


Please be aware that I'm just exploring options/insight here. It is

not something I intend to push inside the mainline kernel. I just want
to find reasonable and logic criticism as you and some others have
provided already. Thanks for that!

And having gotten same, are you going to code up what appears to be a 
solution, based on this feedback?


I'm curious how well it would run poorly written programs, having 
recently worked with a company which seemed to have a whole part of 
purchasing dedicated to buying same. :-(


--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread William Lee Irwin III
On Thu, Apr 12, 2007 at 11:27:22PM +1000, Nick Piggin wrote:
> This one should be pretty rare (actually I think it is dead code in 
> practice, due to the way the page allocator works).
> Avoiding sched_yield is a really good idea outside realtime scheduling. 
> Since we have gone this far with the current semantics, I think it
> would be sad to back down now.
> It would be nice if you could pressure those other components to adapt :)

Outside of realtime scheduling there appear to be two desired behaviors:
(1) busywait: demote as aggressively as possible
(2) CPU burn: give other apps a chance to run but demote lightly at most

There is no way for the scheduler to distinguish which of the two
behaviors is desired. A fresh system call taking an argument to describe
which is the desired behavior is my recommended solution. Most unaware
apps should be able to be dealt with via LD_PRELOAD.

Busywaiters able to be modified could be given more specific scheduling
primitives, in particular "directed yields," which donate timeslice and
possibly dynamic priority to their targets. They would look something
like:
int yield_to(pid_t);
int yield_to_futex(int *);
int yield_to_sem(int);
/* etc. */
as userspace library functions where yielding to a resource is intended
to donate timeslice to its owner or one of its owners, where those
owner(s) are to be determined by the kernel. Directed yields are a more
direct attack on the priority inversion one most desperately wants to
avoid in the case of sched_yield() -based busywaiters on a resource,
namely the resource owner falling behind the busywaiters in priority or
running out of timeslice. They furthermore reduce the competition for
CPU between resource owners and busywaiters on that resource.

A less direct alternative suggested by Andi Kleen is to have coprocess
groups and an alternative to sched_yield() that directs yielding toward
a member of the same coprocess group as the yielder, possibly using the
standard system call by making that the default behavior when a process
is a member of such a coprocess group.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread William Lee Irwin III
On Thu, Apr 12, 2007 at 03:31:31PM +0200, Andi Kleen wrote:
> The only way I could think of to make sched_yield work the way they 
> expect would be to define some way of gang scheduling and give
> sched_yield semantics that it preferably yields to other members
> of the gang.
> But it would be still hard to get these semantics (how to define
> the gangs) into your uncontrollable broken applications and also
> it has the risk of either unfairness or not full utilization of the 
> machine. Getting it to scale well on MP systems would be also likely 
> a challenge.

Gang scheduling isn't a squishy concept whose name can be arbitrarily
repurposed. Perhaps "group scheduling" or similar would be appropriate
if the standard gang scheduling semantics are not what you have in mind.

Standard gang scheduling would not be appropriate for applications that
don't know what they're doing. All threads of a gang falling asleep
when one sleeps (or more properly, the gang is considered either
runnable or unrunnable as a unit) is not to be taken lightly.

I'd call this something like a "directed yield with a group as a target,"
but I wouldn't actually try to do this. I'd try to provide ways for a
directed yield to donate remaining timeslice and dynamic priority if
possible to a particular task associated with a resource, for instance,
a futex or SysV semaphore owner. The priority inversion one desperately
wants to avoid is the resource owner running out of timeslice or
otherwise losing priority to where it falls behind busywaiters such as
callers of sched_yield for the purposes of multitier locking.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sched_yield proposals/rationale

2007-04-12 Thread Thomas Gleixner
On Thu, 2007-04-12 at 10:15 -0400, [EMAIL PROTECTED] wrote:
> > -Original Message-
> > > Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc
> > over
> > > the 2.6.16 kernel yields 105 hits, note including comments... An
> > interesting spot is e.g. fs/buffer.c free_more_memory()
> > 
> > A lot of those are probably broken in some way agreed.
> 
> OK, so there are 2 options: 1) fix 100 spots or 2) ... ;-)

100, simply because at least 95 of them are broken.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sched_yield proposals/rationale

2007-04-12 Thread Buytaert_Steven

> -Original Message-
> > Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc
> over
> > the 2.6.16 kernel yields 105 hits, note including comments... An
> interesting spot is e.g. fs/buffer.c free_more_memory()
> 
> A lot of those are probably broken in some way agreed.

OK, so there are 2 options: 1) fix 100 spots or 2) ... ;-)

> [ about giving a thread its full slice quota after yielding ]
> With a particular sleep pattern it could get more CPU time.

I'm not a linux scheduler expert, but I don't understand this. Suppose 2 
extreme situations:

1) I have full time slices and call yield, going to the expired list with again 
full times slices; this is IMHO a NOOP, i.e. fair.

2) I have only a single slice left, NOT yielding would get me to the expired 
list via scheduler_tick the next timer interrupt and I would get my full time 
slice quota. DO YIELD at that moment, will give me full time slice quota again 
(with my proposal). 

> 
> 3) Put the task in the expired list at a random position [...]
> > while (X--) { prev = current; current = current->next }
> 
> You would need to rename the scheduler to "sometimes O(1)" first @)

Again, agreed, but this tight loop would run in code cache completely. We could 
call it O(1.1) ;-) With a duffs device it would mean a few clock cycles per 
position.

> Besides - but I guess you're aware of it - any randomized algorithms tend
> to drive benchmarkers and performance analysts crazy because their
> performance
> cannot be repeated. So it's usually better to avoid them unless there is
> really no alternative.

That could already solve your concern from above. Statistically speaking, it 
will give them (benchmarkers) the smoothest curve they've ever seen.

Please be aware that I'm just exploring options/insight here. It is not 
something I intend to push inside the mainline kernel. I just want to find 
reasonable and logic criticism as you and some others have provided already. 
Thanks for that!

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne 
reste rien à enlever. (Antoine de Saint-Exupéry)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread Andi Kleen
On Thu, Apr 12, 2007 at 09:05:25AM -0400, [EMAIL PROTECTED] wrote:
> > -Original Message-
> > From: Andi Kleen
> > [ ... about use of sched_yield ...]
> > On the other hand when they fix their code to not rely on sched_yield
> > but use [...]
> 
> Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc over
> the 2.6.16 kernel yields 105 hits, note including comments... An interesting 
> spot is e.g. fs/buffer.c free_more_memory()

A lot of those are probably broken in some way agreed.

> > 
> > > 2) When a task is eventually put in the expired list in sched_yield,
> > > give it back the full time slices round (as done in scheduler_tick), not 
> > > > > with the remaining slices as is done now?
> > 
> > That would likely be unfair and exploitable.
> 
> I don't understand; how more unfair would it be than passing via 
> scheduler_tick? Grabbing a resource with a single time slice left would be 
> more unfair towards other tasks IMHO when you get moved to the expired list 
> with the resource in still in your possession.

With a particular sleep pattern it could get more CPU time.

> > > 3) Put the task in the expired list at a random position, not at the end
> > > is done now?
> > 
> > Sounds like an interesting approach, but to do it in O(1) you would
> > need a new data structure with possibly much larger constant overhead.
> 
> Agreed, but not dramatic. Suppose you need to insert at position X, you would 
> do, on the linked list after proper setup:
> 
> while (X--) { prev = current; current = current->next }
> 
> You could have a small duffs device to reduce the X-- checking overhead.

You would need to rename the scheduler to "sometimes O(1)" first @)

Besides - but I guess you're aware of it - any randomized algorithms tend
to drive benchmarkers and performance analysts crazy because their performance
cannot be repeated. So it's usually better to avoid them unless there is 
really no alternative.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread Nick Piggin

[EMAIL PROTECTED] wrote:

-Original Message-
From: Andi Kleen
[ ... about use of sched_yield ...]
On the other hand when they fix their code to not rely on sched_yield
but use [...]



Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc over
the 2.6.16 kernel yields 105 hits, note including comments...


Most of these (in core code, anyway) seem to use yield when they really don't
care about running for a while.


An interesting spot is e.g. fs/buffer.c free_more_memory()


This one should be pretty rare (actually I think it is dead code in practice,
due to the way the page allocator works).

Avoiding sched_yield is a really good idea outside realtime scheduling. Since
we have gone this far with the current semantics, I think it would be sad to
back down now.

It would be nice if you could pressure those other components to adapt :)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sched_yield proposals/rationale

2007-04-12 Thread Buytaert_Steven
> -Original Message-
> From: Andi Kleen
> [ ... about use of sched_yield ...]
> On the other hand when they fix their code to not rely on sched_yield
> but use [...]

Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc over
the 2.6.16 kernel yields 105 hits, note including comments... An interesting 
spot is e.g. fs/buffer.c free_more_memory()

> The only way I could think of to make sched_yield work the way they
> expect would be to define some way of gang scheduling and give
> sched_yield semantics that it preferably yields to other members
> of the gang. [...]

That would indeed make things too complicated.

> [about question 1]
> That would still not unbreak most applications I would suspect -- they
> will likely try to yield multiple times before using up a full time slice
> unless their event handlers are very compute intensive.

No, they only get 1 go again on the active queue, until they pass the proper 
exit to the expired queue via scheduler_tick. And they are enqueued at the end 
of the active queue with whatever slices they have left, thus fair.
 
> In general other subsystems (e.g. VM) had had quite bad experiences
> with similar "one more try" hacks -- they tend to be not robust and
> actually penalize some loads.
> 
> > 2) When a task is eventually put in the expired list in sched_yield,
> > give it back the full time slices round (as done in scheduler_tick), not > 
> > > with the remaining slices as is done now?
> 
> That would likely be unfair and exploitable.

I don't understand; how more unfair would it be than passing via 
scheduler_tick? Grabbing a resource with a single time slice left would be more 
unfair towards other tasks IMHO when you get moved to the expired list with the 
resource in still in your possession.

> 
> > 3) Put the task in the expired list at a random position, not at the end
> > is done now?
> 
> Sounds like an interesting approach, but to do it in O(1) you would
> need a new data structure with possibly much larger constant overhead.

Agreed, but not dramatic. Suppose you need to insert at position X, you would 
do, on the linked list after proper setup:

while (X--) { prev = current; current = current->next }

You could have a small duffs device to reduce the X-- checking overhead.

Thanks for the insight Andi!

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne 
reste rien à enlever. (Antoine de Saint-Exupéry)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread Andi Kleen
[EMAIL PROTECTED] writes:

> Since the new 2.6.x O(1) scheduler I'm having latency problems. Probably due
> to excessive use of sched_yield in code in components I don't have control
> over. This 'problem'/behavioral change has been reported also by other
> applications (e.g. OpenLDAP, Gnome netmeeting, Postgress, e.google...)

On the other hand when they fix their code to not rely on sched_yield
but use some more directed wakeup method they will end up with a much
more robust configuration. sched_yield always assumes the rest of the system
is idle, which is just wrong.

But yes the new sched_yield semantics seem to be definitely unexpected
for a lot of people.

The only way I could think of to make sched_yield work the way they 
expect would be to define some way of gang scheduling and give
sched_yield semantics that it preferably yields to other members
of the gang.

But it would be still hard to get these semantics (how to define
the gangs) into your uncontrollable broken applications and also
it has the risk of either unfairness or not full utilization of the 
machine. Getting it to scale well on MP systems would be also likely 
a challenge.

> I have analysed the sched_yield code in kernel/sched.c (2.6.16 SLES10) and 
> have 3 questions/proposals:
> 
> 1) Would it be beneficial to give each task one try to be enqueued on the
> end of the active queue (by means of a boolean flag, reset each time the
> time slices reach 0 and it is put in the expired list by scheduler_tick)?

That would still not unbreak most applications I would suspect -- they
will likely try to yield multiple times before using up a full time slice
unless their event handlers are very compute intensive.

In general other subsystems (e.g. VM) had had quite bad experiences
with similar "one more try" hacks -- they tend to be not robust and
actually penalize some loads.

> 2) When a task is eventually put in the expired list in sched_yield, give it
> back the full time slices round (as done in scheduler_tick), not with the
> remaining slices as is done now?

That would likely be unfair and exploitable.

> 3) Put the task in the expired list at a random position, not at the end as
> is done now?

Sounds like an interesting approach, but to do it in O(1) you would
need a new data structure with possibly much larger constant overhead.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread Sven-Thorsten Dietrich
On Thu, 2007-04-12 at 04:31 -0400, [EMAIL PROTECTED] wrote:
> Since the new 2.6.x O(1) scheduler I'm having latency problems. 

1. Have you elevated the process priority?
2. Have you tried running SCHED_FIFO, or SCHED_RR?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


sched_yield proposals/rationale

2007-04-12 Thread Buytaert_Steven
Since the new 2.6.x O(1) scheduler I'm having latency problems. Probably due
to excessive use of sched_yield in code in components I don't have control
over. This 'problem'/behavioral change has been reported also by other
applications (e.g. OpenLDAP, Gnome netmeeting, Postgress, e.google...)

I have analysed the sched_yield code in kernel/sched.c (2.6.16 SLES10) and have 
3 questions/proposals:

1) Would it be beneficial to give each task one try to be enqueued on the
end of the active queue (by means of a boolean flag, reset each time the
time slices reach 0 and it is put in the expired list by scheduler_tick)?

2) When a task is eventually put in the expired list in sched_yield, give it
back the full time slices round (as done in scheduler_tick), not with the
remaining slices as is done now?

3) Put the task in the expired list at a random position, not at the end as
is done now?

Rationales:

For 1) suppose the resource that the task wants to acquire (if any) and for
which it yields, since it is not available, is kept by another task that is
in the active list, behind this task. Putting the task on the expired list
would induce some kind of 'tail chasing'. This would be broken by 1.

For 2) When a task is put on the expired list with e.g. 1 time quantum left,
next time it gets the resource and is put on the expired list at the next
scheduler_tick occurence, before it has the chance to release the resource
again, keeping it for a longer time than necessary.

For 3) when the resource (if any) is requested by many tasks
regularly, always putting the yielding task at the end of the expired list
could result in 'unfairness' since the resource is handed out to the tasks
as they get on the cpu. The only 'fairness' that can be applied with yield
is than to insert in a random position.

Any insight on these proposed changes? Thanks

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne 
reste rien à enlever. (Antoine de Saint-Exupéry)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


sched_yield proposals/rationale

2007-04-12 Thread Buytaert_Steven
Since the new 2.6.x O(1) scheduler I'm having latency problems. Probably due
to excessive use of sched_yield in code in components I don't have control
over. This 'problem'/behavioral change has been reported also by other
applications (e.g. OpenLDAP, Gnome netmeeting, Postgress, e.google...)

I have analysed the sched_yield code in kernel/sched.c (2.6.16 SLES10) and have 
3 questions/proposals:

1) Would it be beneficial to give each task one try to be enqueued on the
end of the active queue (by means of a boolean flag, reset each time the
time slices reach 0 and it is put in the expired list by scheduler_tick)?

2) When a task is eventually put in the expired list in sched_yield, give it
back the full time slices round (as done in scheduler_tick), not with the
remaining slices as is done now?

3) Put the task in the expired list at a random position, not at the end as
is done now?

Rationales:

For 1) suppose the resource that the task wants to acquire (if any) and for
which it yields, since it is not available, is kept by another task that is
in the active list, behind this task. Putting the task on the expired list
would induce some kind of 'tail chasing'. This would be broken by 1.

For 2) When a task is put on the expired list with e.g. 1 time quantum left,
next time it gets the resource and is put on the expired list at the next
scheduler_tick occurence, before it has the chance to release the resource
again, keeping it for a longer time than necessary.

For 3) when the resource (if any) is requested by many tasks
regularly, always putting the yielding task at the end of the expired list
could result in 'unfairness' since the resource is handed out to the tasks
as they get on the cpu. The only 'fairness' that can be applied with yield
is than to insert in a random position.

Any insight on these proposed changes? Thanks

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne 
reste rien à enlever. (Antoine de Saint-Exupéry)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread Sven-Thorsten Dietrich
On Thu, 2007-04-12 at 04:31 -0400, [EMAIL PROTECTED] wrote:
 Since the new 2.6.x O(1) scheduler I'm having latency problems. 

1. Have you elevated the process priority?
2. Have you tried running SCHED_FIFO, or SCHED_RR?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread Andi Kleen
[EMAIL PROTECTED] writes:

 Since the new 2.6.x O(1) scheduler I'm having latency problems. Probably due
 to excessive use of sched_yield in code in components I don't have control
 over. This 'problem'/behavioral change has been reported also by other
 applications (e.g. OpenLDAP, Gnome netmeeting, Postgress, e.google...)

On the other hand when they fix their code to not rely on sched_yield
but use some more directed wakeup method they will end up with a much
more robust configuration. sched_yield always assumes the rest of the system
is idle, which is just wrong.

But yes the new sched_yield semantics seem to be definitely unexpected
for a lot of people.

The only way I could think of to make sched_yield work the way they 
expect would be to define some way of gang scheduling and give
sched_yield semantics that it preferably yields to other members
of the gang.

But it would be still hard to get these semantics (how to define
the gangs) into your uncontrollable broken applications and also
it has the risk of either unfairness or not full utilization of the 
machine. Getting it to scale well on MP systems would be also likely 
a challenge.

 I have analysed the sched_yield code in kernel/sched.c (2.6.16 SLES10) and 
 have 3 questions/proposals:
 
 1) Would it be beneficial to give each task one try to be enqueued on the
 end of the active queue (by means of a boolean flag, reset each time the
 time slices reach 0 and it is put in the expired list by scheduler_tick)?

That would still not unbreak most applications I would suspect -- they
will likely try to yield multiple times before using up a full time slice
unless their event handlers are very compute intensive.

In general other subsystems (e.g. VM) had had quite bad experiences
with similar one more try hacks -- they tend to be not robust and
actually penalize some loads.

 2) When a task is eventually put in the expired list in sched_yield, give it
 back the full time slices round (as done in scheduler_tick), not with the
 remaining slices as is done now?

That would likely be unfair and exploitable.

 3) Put the task in the expired list at a random position, not at the end as
 is done now?

Sounds like an interesting approach, but to do it in O(1) you would
need a new data structure with possibly much larger constant overhead.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sched_yield proposals/rationale

2007-04-12 Thread Buytaert_Steven
 -Original Message-
 From: Andi Kleen
 [ ... about use of sched_yield ...]
 On the other hand when they fix their code to not rely on sched_yield
 but use [...]

Agreed, but $ find . -name *.[ch] | xargs grep -E yield[ ]*\( | wc over
the 2.6.16 kernel yields 105 hits, note including comments... An interesting 
spot is e.g. fs/buffer.c free_more_memory()

 The only way I could think of to make sched_yield work the way they
 expect would be to define some way of gang scheduling and give
 sched_yield semantics that it preferably yields to other members
 of the gang. [...]

That would indeed make things too complicated.

 [about question 1]
 That would still not unbreak most applications I would suspect -- they
 will likely try to yield multiple times before using up a full time slice
 unless their event handlers are very compute intensive.

No, they only get 1 go again on the active queue, until they pass the proper 
exit to the expired queue via scheduler_tick. And they are enqueued at the end 
of the active queue with whatever slices they have left, thus fair.
 
 In general other subsystems (e.g. VM) had had quite bad experiences
 with similar one more try hacks -- they tend to be not robust and
 actually penalize some loads.
 
  2) When a task is eventually put in the expired list in sched_yield,
  give it back the full time slices round (as done in scheduler_tick), not  
   with the remaining slices as is done now?
 
 That would likely be unfair and exploitable.

I don't understand; how more unfair would it be than passing via 
scheduler_tick? Grabbing a resource with a single time slice left would be more 
unfair towards other tasks IMHO when you get moved to the expired list with the 
resource in still in your possession.

 
  3) Put the task in the expired list at a random position, not at the end
  is done now?
 
 Sounds like an interesting approach, but to do it in O(1) you would
 need a new data structure with possibly much larger constant overhead.

Agreed, but not dramatic. Suppose you need to insert at position X, you would 
do, on the linked list after proper setup:

while (X--) { prev = current; current = current-next }

You could have a small duffs device to reduce the X-- checking overhead.

Thanks for the insight Andi!

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne 
reste rien à enlever. (Antoine de Saint-Exupéry)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread Nick Piggin

[EMAIL PROTECTED] wrote:

-Original Message-
From: Andi Kleen
[ ... about use of sched_yield ...]
On the other hand when they fix their code to not rely on sched_yield
but use [...]



Agreed, but $ find . -name *.[ch] | xargs grep -E yield[ ]*\( | wc over
the 2.6.16 kernel yields 105 hits, note including comments...


Most of these (in core code, anyway) seem to use yield when they really don't
care about running for a while.


An interesting spot is e.g. fs/buffer.c free_more_memory()


This one should be pretty rare (actually I think it is dead code in practice,
due to the way the page allocator works).

Avoiding sched_yield is a really good idea outside realtime scheduling. Since
we have gone this far with the current semantics, I think it would be sad to
back down now.

It would be nice if you could pressure those other components to adapt :)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread Andi Kleen
On Thu, Apr 12, 2007 at 09:05:25AM -0400, [EMAIL PROTECTED] wrote:
  -Original Message-
  From: Andi Kleen
  [ ... about use of sched_yield ...]
  On the other hand when they fix their code to not rely on sched_yield
  but use [...]
 
 Agreed, but $ find . -name *.[ch] | xargs grep -E yield[ ]*\( | wc over
 the 2.6.16 kernel yields 105 hits, note including comments... An interesting 
 spot is e.g. fs/buffer.c free_more_memory()

A lot of those are probably broken in some way agreed.

  
   2) When a task is eventually put in the expired list in sched_yield,
   give it back the full time slices round (as done in scheduler_tick), not 
 with the remaining slices as is done now?
  
  That would likely be unfair and exploitable.
 
 I don't understand; how more unfair would it be than passing via 
 scheduler_tick? Grabbing a resource with a single time slice left would be 
 more unfair towards other tasks IMHO when you get moved to the expired list 
 with the resource in still in your possession.

With a particular sleep pattern it could get more CPU time.

   3) Put the task in the expired list at a random position, not at the end
   is done now?
  
  Sounds like an interesting approach, but to do it in O(1) you would
  need a new data structure with possibly much larger constant overhead.
 
 Agreed, but not dramatic. Suppose you need to insert at position X, you would 
 do, on the linked list after proper setup:
 
 while (X--) { prev = current; current = current-next }
 
 You could have a small duffs device to reduce the X-- checking overhead.

You would need to rename the scheduler to sometimes O(1) first @)

Besides - but I guess you're aware of it - any randomized algorithms tend
to drive benchmarkers and performance analysts crazy because their performance
cannot be repeated. So it's usually better to avoid them unless there is 
really no alternative.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sched_yield proposals/rationale

2007-04-12 Thread Buytaert_Steven

 -Original Message-
  Agreed, but $ find . -name *.[ch] | xargs grep -E yield[ ]*\( | wc
 over
  the 2.6.16 kernel yields 105 hits, note including comments... An
 interesting spot is e.g. fs/buffer.c free_more_memory()
 
 A lot of those are probably broken in some way agreed.

OK, so there are 2 options: 1) fix 100 spots or 2) ... ;-)

 [ about giving a thread its full slice quota after yielding ]
 With a particular sleep pattern it could get more CPU time.

I'm not a linux scheduler expert, but I don't understand this. Suppose 2 
extreme situations:

1) I have full time slices and call yield, going to the expired list with again 
full times slices; this is IMHO a NOOP, i.e. fair.

2) I have only a single slice left, NOT yielding would get me to the expired 
list via scheduler_tick the next timer interrupt and I would get my full time 
slice quota. DO YIELD at that moment, will give me full time slice quota again 
(with my proposal). 

 
 3) Put the task in the expired list at a random position [...]
  while (X--) { prev = current; current = current-next }
 
 You would need to rename the scheduler to sometimes O(1) first @)

Again, agreed, but this tight loop would run in code cache completely. We could 
call it O(1.1) ;-) With a duffs device it would mean a few clock cycles per 
position.

 Besides - but I guess you're aware of it - any randomized algorithms tend
 to drive benchmarkers and performance analysts crazy because their
 performance
 cannot be repeated. So it's usually better to avoid them unless there is
 really no alternative.

That could already solve your concern from above. Statistically speaking, it 
will give them (benchmarkers) the smoothest curve they've ever seen.

Please be aware that I'm just exploring options/insight here. It is not 
something I intend to push inside the mainline kernel. I just want to find 
reasonable and logic criticism as you and some others have provided already. 
Thanks for that!

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne 
reste rien à enlever. (Antoine de Saint-Exupéry)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sched_yield proposals/rationale

2007-04-12 Thread Thomas Gleixner
On Thu, 2007-04-12 at 10:15 -0400, [EMAIL PROTECTED] wrote:
  -Original Message-
   Agreed, but $ find . -name *.[ch] | xargs grep -E yield[ ]*\( | wc
  over
   the 2.6.16 kernel yields 105 hits, note including comments... An
  interesting spot is e.g. fs/buffer.c free_more_memory()
  
  A lot of those are probably broken in some way agreed.
 
 OK, so there are 2 options: 1) fix 100 spots or 2) ... ;-)

100, simply because at least 95 of them are broken.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread William Lee Irwin III
On Thu, Apr 12, 2007 at 03:31:31PM +0200, Andi Kleen wrote:
 The only way I could think of to make sched_yield work the way they 
 expect would be to define some way of gang scheduling and give
 sched_yield semantics that it preferably yields to other members
 of the gang.
 But it would be still hard to get these semantics (how to define
 the gangs) into your uncontrollable broken applications and also
 it has the risk of either unfairness or not full utilization of the 
 machine. Getting it to scale well on MP systems would be also likely 
 a challenge.

Gang scheduling isn't a squishy concept whose name can be arbitrarily
repurposed. Perhaps group scheduling or similar would be appropriate
if the standard gang scheduling semantics are not what you have in mind.

Standard gang scheduling would not be appropriate for applications that
don't know what they're doing. All threads of a gang falling asleep
when one sleeps (or more properly, the gang is considered either
runnable or unrunnable as a unit) is not to be taken lightly.

I'd call this something like a directed yield with a group as a target,
but I wouldn't actually try to do this. I'd try to provide ways for a
directed yield to donate remaining timeslice and dynamic priority if
possible to a particular task associated with a resource, for instance,
a futex or SysV semaphore owner. The priority inversion one desperately
wants to avoid is the resource owner running out of timeslice or
otherwise losing priority to where it falls behind busywaiters such as
callers of sched_yield for the purposes of multitier locking.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread William Lee Irwin III
On Thu, Apr 12, 2007 at 11:27:22PM +1000, Nick Piggin wrote:
 This one should be pretty rare (actually I think it is dead code in 
 practice, due to the way the page allocator works).
 Avoiding sched_yield is a really good idea outside realtime scheduling. 
 Since we have gone this far with the current semantics, I think it
 would be sad to back down now.
 It would be nice if you could pressure those other components to adapt :)

Outside of realtime scheduling there appear to be two desired behaviors:
(1) busywait: demote as aggressively as possible
(2) CPU burn: give other apps a chance to run but demote lightly at most

There is no way for the scheduler to distinguish which of the two
behaviors is desired. A fresh system call taking an argument to describe
which is the desired behavior is my recommended solution. Most unaware
apps should be able to be dealt with via LD_PRELOAD.

Busywaiters able to be modified could be given more specific scheduling
primitives, in particular directed yields, which donate timeslice and
possibly dynamic priority to their targets. They would look something
like:
int yield_to(pid_t);
int yield_to_futex(int *);
int yield_to_sem(int);
/* etc. */
as userspace library functions where yielding to a resource is intended
to donate timeslice to its owner or one of its owners, where those
owner(s) are to be determined by the kernel. Directed yields are a more
direct attack on the priority inversion one most desperately wants to
avoid in the case of sched_yield() -based busywaiters on a resource,
namely the resource owner falling behind the busywaiters in priority or
running out of timeslice. They furthermore reduce the competition for
CPU between resource owners and busywaiters on that resource.

A less direct alternative suggested by Andi Kleen is to have coprocess
groups and an alternative to sched_yield() that directs yielding toward
a member of the same coprocess group as the yielder, possibly using the
standard system call by making that the default behavior when a process
is a member of such a coprocess group.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield proposals/rationale

2007-04-12 Thread Bill Davidsen

[EMAIL PROTECTED] wrote:

-Original Message-



Besides - but I guess you're aware of it - any randomized
algorithms tend to drive benchmarkers and performance analysts
crazy because their performance cannot be repeated. So it's usually
better to avoid them unless there is really no alternative.


That could already solve your concern from above. Statistically

speaking, it will give them (benchmarkers) the smoothest curve they've
ever seen.


Please be aware that I'm just exploring options/insight here. It is

not something I intend to push inside the mainline kernel. I just want
to find reasonable and logic criticism as you and some others have
provided already. Thanks for that!

And having gotten same, are you going to code up what appears to be a 
solution, based on this feedback?


I'm curious how well it would run poorly written programs, having 
recently worked with a company which seemed to have a whole part of 
purchasing dedicated to buying same. :-(


--
Bill Davidsen [EMAIL PROTECTED]
  We have more to fear from the bungling of the incompetent than from
the machinations of the wicked.  - from Slashdot
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sched_yield proposals/rationale

2007-04-12 Thread Buytaert_Steven

 From: Bill Davidsen

 And having gotten same, are you going to code up what appears to be a
 solution, based on this feedback?

The feedback was helpful in verifying whether there are any arguments against 
my approach. The real proof is in the pudding.

I'm running a kernel with these changes, as we speak. Overall system throughput 
is about up 20%. With 'system throughput' I mean measured performance of a 
rather large (experimental) system. The patch isn't even 24h old... Also the 
application latency has improved.

Additional settings: my patch is running *also* with a kernel modified to have 
only 8 default time slices at 250Hz setting. And no, the overall number of 
context switches per second hasn't blown up. The kernel was compiled with low 
latency and in-kernel preemption enabled, BKL preemption enabled. I haven't 
checked the patch stand alone yet.

 I'm curious how well it would run poorly written programs, having
 recently worked with a company which seemed to have a whole part of
 purchasing dedicated to buying same. :-(

So first signs are positive; note that it requires much more run time and a 
slew of other tests/scrutiny before we can be really sure.

W.r.t. the remarks; I am most interested in possibilities of DOS attacks that 
could exploit this change in sched_yield. Therefore the comments of Andi were 
interesting, but I haven't heard back from him yet. I'm still not sure how a 
task could juggle more slices from the system because of these changes.

Last remark on the O(1)'ness being violated. I think it's a mooth point. The 
sched_yield is executed on the CPU time of the yielder. Being O(1) is most 
important for the scheduler proper at each timer tick (interrupt). That being 
O(1) is crucial.

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne 
reste rien à enlever. (Antoine de Saint-Exupéry)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/