RE: Re: sched_yield proposals/rationale
> -Original Message- > From: [EMAIL PROTECTED] [mailto:linux-kernel- > [EMAIL PROTECTED] On Behalf Of Bill Davidsen > Sent: dinsdag 17 april 2007 21:38 > To: linux-kernel@vger.kernel.org > Cc: Buytaert, Steven; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org > Subject: Re: sched_yield proposals/rationale > > Mark Lord wrote: > > > > Cool. You *do know* that there is a brand new CPU scheduler > > scheduled to replace the current one for the 2.6.22 Kernel, right? > > > Having tried both nicksched and Con's fair sched on some normal loads, > as opposed to benchmarks, I sure hope Linus changes his mind about > having several schedulers in the kernel. The "one perfect and > self-adjusting scheduler" isn't here yet. I have the same opinion, and it is still a long time out I'm afraid. Probably people only read my suggestion for a 'fix' diagonally, let alone they read my footer. Too bad the archives only go back to 95, I would love to retrieve my posts from 93. Anybody still have these? Now a bit more on topic: 1) My problem is/was solved by making the default time slice much smaller than the default 100 in a 250Hz system. But that's only masking it away. 2) I made a schedstat program sort of like vmstat that samples and prints deltas each second (from /proc/schedstat). Just printing the jiffies delta and the # times schedule is called per CPU, is already thought provoking; a small 12 second example: 2.6.16 vanilla scheduler Jiffies CPU0CPU1 Delta 250 | 1756 | 7301 | 252 | 1730 | 2638 | 254 | 1963 | 1663 | 385 | 868 | 658 | <--- stall starts 325 | 138 | 112 | 330 | 184 | 130 | 339 | 682 | 122 | 335 | 367 | 159 | 334 | 653 | 127 | 345 | 467 | 137 | 335 | 673 | 128 | 337 | 471 | 131 | 334 | 673 | 127 | 332 | 321 | 144 | 333 | 523 | 129 | 332 | 98 | 123 | 356 | 496 | 124 | 270 | 96 |87 | 277 | 5878 | 26228 | <-- yes 26K 252 |18263 | 19130 | ... 255 | 2024 | 5747 | <-- back to normal Let's talk about fairness when we get the basics right. And yes, the real physical elapsed time per line IS 1 second, so the jiffies jumping up from the normal expected value of 250 to up to 356 is not normal; it's in fact very very bad. This box was unusable for 13 seconds in a row. But this doesn't look serious enough to most people, it seems. Steven Buytaert -- La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
Mark Lord wrote: [EMAIL PROTECTED] wrote: From: Bill Davidsen And having gotten same, are you going to code up what appears to be a solution, based on this feedback? The feedback was helpful in verifying whether there are any arguments against my approach. The real proof is in the pudding. I'm running a kernel with these changes, as we speak. Overall system throughput is about up 20%. With 'system throughput' I mean measured performance of a rather large (experimental) system. The patch isn't even 24h old... Also the application latency has improved. Cool. You *do know* that there is a brand new CPU scheduler scheduled to replace the current one for the 2.6.22 Kernel, right? Having tried both nicksched and Con's fair sched on some normal loads, as opposed to benchmarks, I sure hope Linus changes his mind about having several schedulers in the kernel. The "one perfect and self-adjusting scheduler" isn't here yet. -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
Mark Lord wrote: [EMAIL PROTECTED] wrote: From: Bill Davidsen And having gotten same, are you going to code up what appears to be a solution, based on this feedback? The feedback was helpful in verifying whether there are any arguments against my approach. The real proof is in the pudding. I'm running a kernel with these changes, as we speak. Overall system throughput is about up 20%. With 'system throughput' I mean measured performance of a rather large (experimental) system. The patch isn't even 24h old... Also the application latency has improved. Cool. You *do know* that there is a brand new CPU scheduler scheduled to replace the current one for the 2.6.22 Kernel, right? Having tried both nicksched and Con's fair sched on some normal loads, as opposed to benchmarks, I sure hope Linus changes his mind about having several schedulers in the kernel. The one perfect and self-adjusting scheduler isn't here yet. -- Bill Davidsen [EMAIL PROTECTED] We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Re: sched_yield proposals/rationale
-Original Message- From: [EMAIL PROTECTED] [mailto:linux-kernel- [EMAIL PROTECTED] On Behalf Of Bill Davidsen Sent: dinsdag 17 april 2007 21:38 To: linux-kernel@vger.kernel.org Cc: Buytaert, Steven; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org Subject: Re: sched_yield proposals/rationale Mark Lord wrote: Cool. You *do know* that there is a brand new CPU scheduler scheduled to replace the current one for the 2.6.22 Kernel, right? Having tried both nicksched and Con's fair sched on some normal loads, as opposed to benchmarks, I sure hope Linus changes his mind about having several schedulers in the kernel. The one perfect and self-adjusting scheduler isn't here yet. I have the same opinion, and it is still a long time out I'm afraid. Probably people only read my suggestion for a 'fix' diagonally, let alone they read my footer. Too bad the archives only go back to 95, I would love to retrieve my posts from 93. Anybody still have these? Now a bit more on topic: 1) My problem is/was solved by making the default time slice much smaller than the default 100 in a 250Hz system. But that's only masking it away. 2) I made a schedstat program sort of like vmstat that samples and prints deltas each second (from /proc/schedstat). Just printing the jiffies delta and the # times schedule is called per CPU, is already thought provoking; a small 12 second example: 2.6.16 vanilla scheduler Jiffies CPU0CPU1 Delta 250 | 1756 | 7301 | 252 | 1730 | 2638 | 254 | 1963 | 1663 | 385 | 868 | 658 | --- stall starts 325 | 138 | 112 | 330 | 184 | 130 | 339 | 682 | 122 | 335 | 367 | 159 | 334 | 653 | 127 | 345 | 467 | 137 | 335 | 673 | 128 | 337 | 471 | 131 | 334 | 673 | 127 | 332 | 321 | 144 | 333 | 523 | 129 | 332 | 98 | 123 | 356 | 496 | 124 | 270 | 96 |87 | 277 | 5878 | 26228 | -- yes 26K 252 |18263 | 19130 | ... 255 | 2024 | 5747 | -- back to normal Let's talk about fairness when we get the basics right. And yes, the real physical elapsed time per line IS 1 second, so the jiffies jumping up from the normal expected value of 250 to up to 356 is not normal; it's in fact very very bad. This box was unusable for 13 seconds in a row. But this doesn't look serious enough to most people, it seems. Steven Buytaert -- La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
[EMAIL PROTECTED] wrote: From: Bill Davidsen And having gotten same, are you going to code up what appears to be a solution, based on this feedback? The feedback was helpful in verifying whether there are any arguments against my approach. The real proof is in the pudding. I'm running a kernel with these changes, as we speak. Overall system throughput is about up 20%. With 'system throughput' I mean measured performance of a rather large (experimental) system. The patch isn't even 24h old... Also the application latency has improved. Cool. You *do know* that there is a brand new CPU scheduler scheduled to replace the current one for the 2.6.22 Kernel, right? -ml - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
[EMAIL PROTECTED] wrote: From: Bill Davidsen And having gotten same, are you going to code up what appears to be a solution, based on this feedback? The feedback was helpful in verifying whether there are any arguments against my approach. The real proof is in the pudding. I'm running a kernel with these changes, as we speak. Overall system throughput is about up 20%. With 'system throughput' I mean measured performance of a rather large (experimental) system. The patch isn't even 24h old... Also the application latency has improved. Cool. You *do know* that there is a brand new CPU scheduler scheduled to replace the current one for the 2.6.22 Kernel, right? -ml - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: sched_yield proposals/rationale
> From: Bill Davidsen > > And having gotten same, are you going to code up what appears to be a > solution, based on this feedback? The feedback was helpful in verifying whether there are any arguments against my approach. The real proof is in the pudding. I'm running a kernel with these changes, as we speak. Overall system throughput is about up 20%. With 'system throughput' I mean measured performance of a rather large (experimental) system. The patch isn't even 24h old... Also the application latency has improved. Additional settings: my patch is running *also* with a kernel modified to have only 8 default time slices at 250Hz setting. And no, the overall number of context switches per second hasn't blown up. The kernel was compiled with low latency and in-kernel preemption enabled, BKL preemption enabled. I haven't checked the patch stand alone yet. > I'm curious how well it would run poorly written programs, having > recently worked with a company which seemed to have a whole part of > purchasing dedicated to buying same. :-( So first signs are positive; note that it requires much more run time and a slew of other tests/scrutiny before we can be really sure. W.r.t. the remarks; I am most interested in possibilities of DOS attacks that could exploit this change in sched_yield. Therefore the comments of Andi were interesting, but I haven't heard back from him yet. I'm still not sure how a task could juggle more slices from the system because of these changes. Last remark on the O(1)'ness being violated. I think it's a mooth point. The sched_yield is executed on the CPU time of the yielder. Being O(1) is most important for the scheduler proper at each timer tick (interrupt). That being O(1) is crucial. Steven Buytaert -- La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
[EMAIL PROTECTED] wrote: -Original Message- Besides - but I guess you're aware of it - any randomized algorithms tend to drive benchmarkers and performance analysts crazy because their performance cannot be repeated. So it's usually better to avoid them unless there is really no alternative. That could already solve your concern from above. Statistically speaking, it will give them (benchmarkers) the smoothest curve they've ever seen. Please be aware that I'm just exploring options/insight here. It is not something I intend to push inside the mainline kernel. I just want to find reasonable and logic criticism as you and some others have provided already. Thanks for that! And having gotten same, are you going to code up what appears to be a solution, based on this feedback? I'm curious how well it would run poorly written programs, having recently worked with a company which seemed to have a whole part of purchasing dedicated to buying same. :-( -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
On Thu, Apr 12, 2007 at 11:27:22PM +1000, Nick Piggin wrote: > This one should be pretty rare (actually I think it is dead code in > practice, due to the way the page allocator works). > Avoiding sched_yield is a really good idea outside realtime scheduling. > Since we have gone this far with the current semantics, I think it > would be sad to back down now. > It would be nice if you could pressure those other components to adapt :) Outside of realtime scheduling there appear to be two desired behaviors: (1) busywait: demote as aggressively as possible (2) CPU burn: give other apps a chance to run but demote lightly at most There is no way for the scheduler to distinguish which of the two behaviors is desired. A fresh system call taking an argument to describe which is the desired behavior is my recommended solution. Most unaware apps should be able to be dealt with via LD_PRELOAD. Busywaiters able to be modified could be given more specific scheduling primitives, in particular "directed yields," which donate timeslice and possibly dynamic priority to their targets. They would look something like: int yield_to(pid_t); int yield_to_futex(int *); int yield_to_sem(int); /* etc. */ as userspace library functions where yielding to a resource is intended to donate timeslice to its owner or one of its owners, where those owner(s) are to be determined by the kernel. Directed yields are a more direct attack on the priority inversion one most desperately wants to avoid in the case of sched_yield() -based busywaiters on a resource, namely the resource owner falling behind the busywaiters in priority or running out of timeslice. They furthermore reduce the competition for CPU between resource owners and busywaiters on that resource. A less direct alternative suggested by Andi Kleen is to have coprocess groups and an alternative to sched_yield() that directs yielding toward a member of the same coprocess group as the yielder, possibly using the standard system call by making that the default behavior when a process is a member of such a coprocess group. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
On Thu, Apr 12, 2007 at 03:31:31PM +0200, Andi Kleen wrote: > The only way I could think of to make sched_yield work the way they > expect would be to define some way of gang scheduling and give > sched_yield semantics that it preferably yields to other members > of the gang. > But it would be still hard to get these semantics (how to define > the gangs) into your uncontrollable broken applications and also > it has the risk of either unfairness or not full utilization of the > machine. Getting it to scale well on MP systems would be also likely > a challenge. Gang scheduling isn't a squishy concept whose name can be arbitrarily repurposed. Perhaps "group scheduling" or similar would be appropriate if the standard gang scheduling semantics are not what you have in mind. Standard gang scheduling would not be appropriate for applications that don't know what they're doing. All threads of a gang falling asleep when one sleeps (or more properly, the gang is considered either runnable or unrunnable as a unit) is not to be taken lightly. I'd call this something like a "directed yield with a group as a target," but I wouldn't actually try to do this. I'd try to provide ways for a directed yield to donate remaining timeslice and dynamic priority if possible to a particular task associated with a resource, for instance, a futex or SysV semaphore owner. The priority inversion one desperately wants to avoid is the resource owner running out of timeslice or otherwise losing priority to where it falls behind busywaiters such as callers of sched_yield for the purposes of multitier locking. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: sched_yield proposals/rationale
On Thu, 2007-04-12 at 10:15 -0400, [EMAIL PROTECTED] wrote: > > -Original Message- > > > Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc > > over > > > the 2.6.16 kernel yields 105 hits, note including comments... An > > interesting spot is e.g. fs/buffer.c free_more_memory() > > > > A lot of those are probably broken in some way agreed. > > OK, so there are 2 options: 1) fix 100 spots or 2) ... ;-) 100, simply because at least 95 of them are broken. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: sched_yield proposals/rationale
> -Original Message- > > Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc > over > > the 2.6.16 kernel yields 105 hits, note including comments... An > interesting spot is e.g. fs/buffer.c free_more_memory() > > A lot of those are probably broken in some way agreed. OK, so there are 2 options: 1) fix 100 spots or 2) ... ;-) > [ about giving a thread its full slice quota after yielding ] > With a particular sleep pattern it could get more CPU time. I'm not a linux scheduler expert, but I don't understand this. Suppose 2 extreme situations: 1) I have full time slices and call yield, going to the expired list with again full times slices; this is IMHO a NOOP, i.e. fair. 2) I have only a single slice left, NOT yielding would get me to the expired list via scheduler_tick the next timer interrupt and I would get my full time slice quota. DO YIELD at that moment, will give me full time slice quota again (with my proposal). > > 3) Put the task in the expired list at a random position [...] > > while (X--) { prev = current; current = current->next } > > You would need to rename the scheduler to "sometimes O(1)" first @) Again, agreed, but this tight loop would run in code cache completely. We could call it O(1.1) ;-) With a duffs device it would mean a few clock cycles per position. > Besides - but I guess you're aware of it - any randomized algorithms tend > to drive benchmarkers and performance analysts crazy because their > performance > cannot be repeated. So it's usually better to avoid them unless there is > really no alternative. That could already solve your concern from above. Statistically speaking, it will give them (benchmarkers) the smoothest curve they've ever seen. Please be aware that I'm just exploring options/insight here. It is not something I intend to push inside the mainline kernel. I just want to find reasonable and logic criticism as you and some others have provided already. Thanks for that! Steven Buytaert -- La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
On Thu, Apr 12, 2007 at 09:05:25AM -0400, [EMAIL PROTECTED] wrote: > > -Original Message- > > From: Andi Kleen > > [ ... about use of sched_yield ...] > > On the other hand when they fix their code to not rely on sched_yield > > but use [...] > > Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc over > the 2.6.16 kernel yields 105 hits, note including comments... An interesting > spot is e.g. fs/buffer.c free_more_memory() A lot of those are probably broken in some way agreed. > > > > > 2) When a task is eventually put in the expired list in sched_yield, > > > give it back the full time slices round (as done in scheduler_tick), not > > > > > with the remaining slices as is done now? > > > > That would likely be unfair and exploitable. > > I don't understand; how more unfair would it be than passing via > scheduler_tick? Grabbing a resource with a single time slice left would be > more unfair towards other tasks IMHO when you get moved to the expired list > with the resource in still in your possession. With a particular sleep pattern it could get more CPU time. > > > 3) Put the task in the expired list at a random position, not at the end > > > is done now? > > > > Sounds like an interesting approach, but to do it in O(1) you would > > need a new data structure with possibly much larger constant overhead. > > Agreed, but not dramatic. Suppose you need to insert at position X, you would > do, on the linked list after proper setup: > > while (X--) { prev = current; current = current->next } > > You could have a small duffs device to reduce the X-- checking overhead. You would need to rename the scheduler to "sometimes O(1)" first @) Besides - but I guess you're aware of it - any randomized algorithms tend to drive benchmarkers and performance analysts crazy because their performance cannot be repeated. So it's usually better to avoid them unless there is really no alternative. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
[EMAIL PROTECTED] wrote: -Original Message- From: Andi Kleen [ ... about use of sched_yield ...] On the other hand when they fix their code to not rely on sched_yield but use [...] Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc over the 2.6.16 kernel yields 105 hits, note including comments... Most of these (in core code, anyway) seem to use yield when they really don't care about running for a while. An interesting spot is e.g. fs/buffer.c free_more_memory() This one should be pretty rare (actually I think it is dead code in practice, due to the way the page allocator works). Avoiding sched_yield is a really good idea outside realtime scheduling. Since we have gone this far with the current semantics, I think it would be sad to back down now. It would be nice if you could pressure those other components to adapt :) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: sched_yield proposals/rationale
> -Original Message- > From: Andi Kleen > [ ... about use of sched_yield ...] > On the other hand when they fix their code to not rely on sched_yield > but use [...] Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc over the 2.6.16 kernel yields 105 hits, note including comments... An interesting spot is e.g. fs/buffer.c free_more_memory() > The only way I could think of to make sched_yield work the way they > expect would be to define some way of gang scheduling and give > sched_yield semantics that it preferably yields to other members > of the gang. [...] That would indeed make things too complicated. > [about question 1] > That would still not unbreak most applications I would suspect -- they > will likely try to yield multiple times before using up a full time slice > unless their event handlers are very compute intensive. No, they only get 1 go again on the active queue, until they pass the proper exit to the expired queue via scheduler_tick. And they are enqueued at the end of the active queue with whatever slices they have left, thus fair. > In general other subsystems (e.g. VM) had had quite bad experiences > with similar "one more try" hacks -- they tend to be not robust and > actually penalize some loads. > > > 2) When a task is eventually put in the expired list in sched_yield, > > give it back the full time slices round (as done in scheduler_tick), not > > > > with the remaining slices as is done now? > > That would likely be unfair and exploitable. I don't understand; how more unfair would it be than passing via scheduler_tick? Grabbing a resource with a single time slice left would be more unfair towards other tasks IMHO when you get moved to the expired list with the resource in still in your possession. > > > 3) Put the task in the expired list at a random position, not at the end > > is done now? > > Sounds like an interesting approach, but to do it in O(1) you would > need a new data structure with possibly much larger constant overhead. Agreed, but not dramatic. Suppose you need to insert at position X, you would do, on the linked list after proper setup: while (X--) { prev = current; current = current->next } You could have a small duffs device to reduce the X-- checking overhead. Thanks for the insight Andi! Steven Buytaert -- La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
[EMAIL PROTECTED] writes: > Since the new 2.6.x O(1) scheduler I'm having latency problems. Probably due > to excessive use of sched_yield in code in components I don't have control > over. This 'problem'/behavioral change has been reported also by other > applications (e.g. OpenLDAP, Gnome netmeeting, Postgress, e.google...) On the other hand when they fix their code to not rely on sched_yield but use some more directed wakeup method they will end up with a much more robust configuration. sched_yield always assumes the rest of the system is idle, which is just wrong. But yes the new sched_yield semantics seem to be definitely unexpected for a lot of people. The only way I could think of to make sched_yield work the way they expect would be to define some way of gang scheduling and give sched_yield semantics that it preferably yields to other members of the gang. But it would be still hard to get these semantics (how to define the gangs) into your uncontrollable broken applications and also it has the risk of either unfairness or not full utilization of the machine. Getting it to scale well on MP systems would be also likely a challenge. > I have analysed the sched_yield code in kernel/sched.c (2.6.16 SLES10) and > have 3 questions/proposals: > > 1) Would it be beneficial to give each task one try to be enqueued on the > end of the active queue (by means of a boolean flag, reset each time the > time slices reach 0 and it is put in the expired list by scheduler_tick)? That would still not unbreak most applications I would suspect -- they will likely try to yield multiple times before using up a full time slice unless their event handlers are very compute intensive. In general other subsystems (e.g. VM) had had quite bad experiences with similar "one more try" hacks -- they tend to be not robust and actually penalize some loads. > 2) When a task is eventually put in the expired list in sched_yield, give it > back the full time slices round (as done in scheduler_tick), not with the > remaining slices as is done now? That would likely be unfair and exploitable. > 3) Put the task in the expired list at a random position, not at the end as > is done now? Sounds like an interesting approach, but to do it in O(1) you would need a new data structure with possibly much larger constant overhead. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
On Thu, 2007-04-12 at 04:31 -0400, [EMAIL PROTECTED] wrote: > Since the new 2.6.x O(1) scheduler I'm having latency problems. 1. Have you elevated the process priority? 2. Have you tried running SCHED_FIFO, or SCHED_RR? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
sched_yield proposals/rationale
Since the new 2.6.x O(1) scheduler I'm having latency problems. Probably due to excessive use of sched_yield in code in components I don't have control over. This 'problem'/behavioral change has been reported also by other applications (e.g. OpenLDAP, Gnome netmeeting, Postgress, e.google...) I have analysed the sched_yield code in kernel/sched.c (2.6.16 SLES10) and have 3 questions/proposals: 1) Would it be beneficial to give each task one try to be enqueued on the end of the active queue (by means of a boolean flag, reset each time the time slices reach 0 and it is put in the expired list by scheduler_tick)? 2) When a task is eventually put in the expired list in sched_yield, give it back the full time slices round (as done in scheduler_tick), not with the remaining slices as is done now? 3) Put the task in the expired list at a random position, not at the end as is done now? Rationales: For 1) suppose the resource that the task wants to acquire (if any) and for which it yields, since it is not available, is kept by another task that is in the active list, behind this task. Putting the task on the expired list would induce some kind of 'tail chasing'. This would be broken by 1. For 2) When a task is put on the expired list with e.g. 1 time quantum left, next time it gets the resource and is put on the expired list at the next scheduler_tick occurence, before it has the chance to release the resource again, keeping it for a longer time than necessary. For 3) when the resource (if any) is requested by many tasks regularly, always putting the yielding task at the end of the expired list could result in 'unfairness' since the resource is handed out to the tasks as they get on the cpu. The only 'fairness' that can be applied with yield is than to insert in a random position. Any insight on these proposed changes? Thanks Steven Buytaert -- La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
sched_yield proposals/rationale
Since the new 2.6.x O(1) scheduler I'm having latency problems. Probably due to excessive use of sched_yield in code in components I don't have control over. This 'problem'/behavioral change has been reported also by other applications (e.g. OpenLDAP, Gnome netmeeting, Postgress, e.google...) I have analysed the sched_yield code in kernel/sched.c (2.6.16 SLES10) and have 3 questions/proposals: 1) Would it be beneficial to give each task one try to be enqueued on the end of the active queue (by means of a boolean flag, reset each time the time slices reach 0 and it is put in the expired list by scheduler_tick)? 2) When a task is eventually put in the expired list in sched_yield, give it back the full time slices round (as done in scheduler_tick), not with the remaining slices as is done now? 3) Put the task in the expired list at a random position, not at the end as is done now? Rationales: For 1) suppose the resource that the task wants to acquire (if any) and for which it yields, since it is not available, is kept by another task that is in the active list, behind this task. Putting the task on the expired list would induce some kind of 'tail chasing'. This would be broken by 1. For 2) When a task is put on the expired list with e.g. 1 time quantum left, next time it gets the resource and is put on the expired list at the next scheduler_tick occurence, before it has the chance to release the resource again, keeping it for a longer time than necessary. For 3) when the resource (if any) is requested by many tasks regularly, always putting the yielding task at the end of the expired list could result in 'unfairness' since the resource is handed out to the tasks as they get on the cpu. The only 'fairness' that can be applied with yield is than to insert in a random position. Any insight on these proposed changes? Thanks Steven Buytaert -- La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
On Thu, 2007-04-12 at 04:31 -0400, [EMAIL PROTECTED] wrote: Since the new 2.6.x O(1) scheduler I'm having latency problems. 1. Have you elevated the process priority? 2. Have you tried running SCHED_FIFO, or SCHED_RR? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
[EMAIL PROTECTED] writes: Since the new 2.6.x O(1) scheduler I'm having latency problems. Probably due to excessive use of sched_yield in code in components I don't have control over. This 'problem'/behavioral change has been reported also by other applications (e.g. OpenLDAP, Gnome netmeeting, Postgress, e.google...) On the other hand when they fix their code to not rely on sched_yield but use some more directed wakeup method they will end up with a much more robust configuration. sched_yield always assumes the rest of the system is idle, which is just wrong. But yes the new sched_yield semantics seem to be definitely unexpected for a lot of people. The only way I could think of to make sched_yield work the way they expect would be to define some way of gang scheduling and give sched_yield semantics that it preferably yields to other members of the gang. But it would be still hard to get these semantics (how to define the gangs) into your uncontrollable broken applications and also it has the risk of either unfairness or not full utilization of the machine. Getting it to scale well on MP systems would be also likely a challenge. I have analysed the sched_yield code in kernel/sched.c (2.6.16 SLES10) and have 3 questions/proposals: 1) Would it be beneficial to give each task one try to be enqueued on the end of the active queue (by means of a boolean flag, reset each time the time slices reach 0 and it is put in the expired list by scheduler_tick)? That would still not unbreak most applications I would suspect -- they will likely try to yield multiple times before using up a full time slice unless their event handlers are very compute intensive. In general other subsystems (e.g. VM) had had quite bad experiences with similar one more try hacks -- they tend to be not robust and actually penalize some loads. 2) When a task is eventually put in the expired list in sched_yield, give it back the full time slices round (as done in scheduler_tick), not with the remaining slices as is done now? That would likely be unfair and exploitable. 3) Put the task in the expired list at a random position, not at the end as is done now? Sounds like an interesting approach, but to do it in O(1) you would need a new data structure with possibly much larger constant overhead. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: sched_yield proposals/rationale
-Original Message- From: Andi Kleen [ ... about use of sched_yield ...] On the other hand when they fix their code to not rely on sched_yield but use [...] Agreed, but $ find . -name *.[ch] | xargs grep -E yield[ ]*\( | wc over the 2.6.16 kernel yields 105 hits, note including comments... An interesting spot is e.g. fs/buffer.c free_more_memory() The only way I could think of to make sched_yield work the way they expect would be to define some way of gang scheduling and give sched_yield semantics that it preferably yields to other members of the gang. [...] That would indeed make things too complicated. [about question 1] That would still not unbreak most applications I would suspect -- they will likely try to yield multiple times before using up a full time slice unless their event handlers are very compute intensive. No, they only get 1 go again on the active queue, until they pass the proper exit to the expired queue via scheduler_tick. And they are enqueued at the end of the active queue with whatever slices they have left, thus fair. In general other subsystems (e.g. VM) had had quite bad experiences with similar one more try hacks -- they tend to be not robust and actually penalize some loads. 2) When a task is eventually put in the expired list in sched_yield, give it back the full time slices round (as done in scheduler_tick), not with the remaining slices as is done now? That would likely be unfair and exploitable. I don't understand; how more unfair would it be than passing via scheduler_tick? Grabbing a resource with a single time slice left would be more unfair towards other tasks IMHO when you get moved to the expired list with the resource in still in your possession. 3) Put the task in the expired list at a random position, not at the end is done now? Sounds like an interesting approach, but to do it in O(1) you would need a new data structure with possibly much larger constant overhead. Agreed, but not dramatic. Suppose you need to insert at position X, you would do, on the linked list after proper setup: while (X--) { prev = current; current = current-next } You could have a small duffs device to reduce the X-- checking overhead. Thanks for the insight Andi! Steven Buytaert -- La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
[EMAIL PROTECTED] wrote: -Original Message- From: Andi Kleen [ ... about use of sched_yield ...] On the other hand when they fix their code to not rely on sched_yield but use [...] Agreed, but $ find . -name *.[ch] | xargs grep -E yield[ ]*\( | wc over the 2.6.16 kernel yields 105 hits, note including comments... Most of these (in core code, anyway) seem to use yield when they really don't care about running for a while. An interesting spot is e.g. fs/buffer.c free_more_memory() This one should be pretty rare (actually I think it is dead code in practice, due to the way the page allocator works). Avoiding sched_yield is a really good idea outside realtime scheduling. Since we have gone this far with the current semantics, I think it would be sad to back down now. It would be nice if you could pressure those other components to adapt :) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
On Thu, Apr 12, 2007 at 09:05:25AM -0400, [EMAIL PROTECTED] wrote: -Original Message- From: Andi Kleen [ ... about use of sched_yield ...] On the other hand when they fix their code to not rely on sched_yield but use [...] Agreed, but $ find . -name *.[ch] | xargs grep -E yield[ ]*\( | wc over the 2.6.16 kernel yields 105 hits, note including comments... An interesting spot is e.g. fs/buffer.c free_more_memory() A lot of those are probably broken in some way agreed. 2) When a task is eventually put in the expired list in sched_yield, give it back the full time slices round (as done in scheduler_tick), not with the remaining slices as is done now? That would likely be unfair and exploitable. I don't understand; how more unfair would it be than passing via scheduler_tick? Grabbing a resource with a single time slice left would be more unfair towards other tasks IMHO when you get moved to the expired list with the resource in still in your possession. With a particular sleep pattern it could get more CPU time. 3) Put the task in the expired list at a random position, not at the end is done now? Sounds like an interesting approach, but to do it in O(1) you would need a new data structure with possibly much larger constant overhead. Agreed, but not dramatic. Suppose you need to insert at position X, you would do, on the linked list after proper setup: while (X--) { prev = current; current = current-next } You could have a small duffs device to reduce the X-- checking overhead. You would need to rename the scheduler to sometimes O(1) first @) Besides - but I guess you're aware of it - any randomized algorithms tend to drive benchmarkers and performance analysts crazy because their performance cannot be repeated. So it's usually better to avoid them unless there is really no alternative. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: sched_yield proposals/rationale
-Original Message- Agreed, but $ find . -name *.[ch] | xargs grep -E yield[ ]*\( | wc over the 2.6.16 kernel yields 105 hits, note including comments... An interesting spot is e.g. fs/buffer.c free_more_memory() A lot of those are probably broken in some way agreed. OK, so there are 2 options: 1) fix 100 spots or 2) ... ;-) [ about giving a thread its full slice quota after yielding ] With a particular sleep pattern it could get more CPU time. I'm not a linux scheduler expert, but I don't understand this. Suppose 2 extreme situations: 1) I have full time slices and call yield, going to the expired list with again full times slices; this is IMHO a NOOP, i.e. fair. 2) I have only a single slice left, NOT yielding would get me to the expired list via scheduler_tick the next timer interrupt and I would get my full time slice quota. DO YIELD at that moment, will give me full time slice quota again (with my proposal). 3) Put the task in the expired list at a random position [...] while (X--) { prev = current; current = current-next } You would need to rename the scheduler to sometimes O(1) first @) Again, agreed, but this tight loop would run in code cache completely. We could call it O(1.1) ;-) With a duffs device it would mean a few clock cycles per position. Besides - but I guess you're aware of it - any randomized algorithms tend to drive benchmarkers and performance analysts crazy because their performance cannot be repeated. So it's usually better to avoid them unless there is really no alternative. That could already solve your concern from above. Statistically speaking, it will give them (benchmarkers) the smoothest curve they've ever seen. Please be aware that I'm just exploring options/insight here. It is not something I intend to push inside the mainline kernel. I just want to find reasonable and logic criticism as you and some others have provided already. Thanks for that! Steven Buytaert -- La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: sched_yield proposals/rationale
On Thu, 2007-04-12 at 10:15 -0400, [EMAIL PROTECTED] wrote: -Original Message- Agreed, but $ find . -name *.[ch] | xargs grep -E yield[ ]*\( | wc over the 2.6.16 kernel yields 105 hits, note including comments... An interesting spot is e.g. fs/buffer.c free_more_memory() A lot of those are probably broken in some way agreed. OK, so there are 2 options: 1) fix 100 spots or 2) ... ;-) 100, simply because at least 95 of them are broken. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
On Thu, Apr 12, 2007 at 03:31:31PM +0200, Andi Kleen wrote: The only way I could think of to make sched_yield work the way they expect would be to define some way of gang scheduling and give sched_yield semantics that it preferably yields to other members of the gang. But it would be still hard to get these semantics (how to define the gangs) into your uncontrollable broken applications and also it has the risk of either unfairness or not full utilization of the machine. Getting it to scale well on MP systems would be also likely a challenge. Gang scheduling isn't a squishy concept whose name can be arbitrarily repurposed. Perhaps group scheduling or similar would be appropriate if the standard gang scheduling semantics are not what you have in mind. Standard gang scheduling would not be appropriate for applications that don't know what they're doing. All threads of a gang falling asleep when one sleeps (or more properly, the gang is considered either runnable or unrunnable as a unit) is not to be taken lightly. I'd call this something like a directed yield with a group as a target, but I wouldn't actually try to do this. I'd try to provide ways for a directed yield to donate remaining timeslice and dynamic priority if possible to a particular task associated with a resource, for instance, a futex or SysV semaphore owner. The priority inversion one desperately wants to avoid is the resource owner running out of timeslice or otherwise losing priority to where it falls behind busywaiters such as callers of sched_yield for the purposes of multitier locking. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
On Thu, Apr 12, 2007 at 11:27:22PM +1000, Nick Piggin wrote: This one should be pretty rare (actually I think it is dead code in practice, due to the way the page allocator works). Avoiding sched_yield is a really good idea outside realtime scheduling. Since we have gone this far with the current semantics, I think it would be sad to back down now. It would be nice if you could pressure those other components to adapt :) Outside of realtime scheduling there appear to be two desired behaviors: (1) busywait: demote as aggressively as possible (2) CPU burn: give other apps a chance to run but demote lightly at most There is no way for the scheduler to distinguish which of the two behaviors is desired. A fresh system call taking an argument to describe which is the desired behavior is my recommended solution. Most unaware apps should be able to be dealt with via LD_PRELOAD. Busywaiters able to be modified could be given more specific scheduling primitives, in particular directed yields, which donate timeslice and possibly dynamic priority to their targets. They would look something like: int yield_to(pid_t); int yield_to_futex(int *); int yield_to_sem(int); /* etc. */ as userspace library functions where yielding to a resource is intended to donate timeslice to its owner or one of its owners, where those owner(s) are to be determined by the kernel. Directed yields are a more direct attack on the priority inversion one most desperately wants to avoid in the case of sched_yield() -based busywaiters on a resource, namely the resource owner falling behind the busywaiters in priority or running out of timeslice. They furthermore reduce the competition for CPU between resource owners and busywaiters on that resource. A less direct alternative suggested by Andi Kleen is to have coprocess groups and an alternative to sched_yield() that directs yielding toward a member of the same coprocess group as the yielder, possibly using the standard system call by making that the default behavior when a process is a member of such a coprocess group. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
[EMAIL PROTECTED] wrote: -Original Message- Besides - but I guess you're aware of it - any randomized algorithms tend to drive benchmarkers and performance analysts crazy because their performance cannot be repeated. So it's usually better to avoid them unless there is really no alternative. That could already solve your concern from above. Statistically speaking, it will give them (benchmarkers) the smoothest curve they've ever seen. Please be aware that I'm just exploring options/insight here. It is not something I intend to push inside the mainline kernel. I just want to find reasonable and logic criticism as you and some others have provided already. Thanks for that! And having gotten same, are you going to code up what appears to be a solution, based on this feedback? I'm curious how well it would run poorly written programs, having recently worked with a company which seemed to have a whole part of purchasing dedicated to buying same. :-( -- Bill Davidsen [EMAIL PROTECTED] We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: sched_yield proposals/rationale
From: Bill Davidsen And having gotten same, are you going to code up what appears to be a solution, based on this feedback? The feedback was helpful in verifying whether there are any arguments against my approach. The real proof is in the pudding. I'm running a kernel with these changes, as we speak. Overall system throughput is about up 20%. With 'system throughput' I mean measured performance of a rather large (experimental) system. The patch isn't even 24h old... Also the application latency has improved. Additional settings: my patch is running *also* with a kernel modified to have only 8 default time slices at 250Hz setting. And no, the overall number of context switches per second hasn't blown up. The kernel was compiled with low latency and in-kernel preemption enabled, BKL preemption enabled. I haven't checked the patch stand alone yet. I'm curious how well it would run poorly written programs, having recently worked with a company which seemed to have a whole part of purchasing dedicated to buying same. :-( So first signs are positive; note that it requires much more run time and a slew of other tests/scrutiny before we can be really sure. W.r.t. the remarks; I am most interested in possibilities of DOS attacks that could exploit this change in sched_yield. Therefore the comments of Andi were interesting, but I haven't heard back from him yet. I'm still not sure how a task could juggle more slices from the system because of these changes. Last remark on the O(1)'ness being violated. I think it's a mooth point. The sched_yield is executed on the CPU time of the yielder. Being O(1) is most important for the scheduler proper at each timer tick (interrupt). That being O(1) is crucial. Steven Buytaert -- La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/