Re: [PATCH] msleep() with hrtimers
Hi, On Sun, 5 Aug 2007, Arjan van de Ven wrote: There's no problem to provide a high resolution sleep, but there is also no reason to mess with msleep, don't fix what ain't broken... John Corbet provided the patch because he had a problem with the current msleep... in that it didn't provide as good a common case as he wanted... so I think your statement is wrong ;) Only under the assumptation, that msleep _must_ be fixed for all other current users too. Give users a choice to use msleep or nanosleep, how do you know what's best for them? bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] msleep() with hrtimers
Hi, On Sun, 5 Aug 2007, Arjan van de Ven wrote: > Timers are course resolution that is highly HZ-value dependent. For > cases where you want a finer resolution, the kernel now has a way to > provide that functionality... so why not use the quality of service this > provides.. We're going in circles here. We have two different timer APIs for a reason, only because hrtimer provide better resolution, doesn't automatically make them the better generic timer. There's no problem to provide a high resolution sleep, but there is also no reason to mess with msleep, don't fix what ain't broken... bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] msleep() with hrtimers
Hi, On Sat, 4 Aug 2007, Arjan van de Ven wrote: > > hr_msleep makes no sense. Why should we tie this interface to millisecond > > resolution? > > because a lot of parts of the kernel think and work in milliseconds, > it's logical and USEFUL to at least provide an interface that works on > milliseconds. If the millisecond resolution is enough for these users, that means the current msleep will work fine for them. > > Your suggested msleep_approx makes not much sense to me either, since > > neither interface guarantees anything and may "approximate" the sleep > > (and if the user is surprised by that something else already went wrong). > > an interface should try to map to the implementation that provides the > best implementation quality of the requested thing in general. That's > the hrtimers based msleep(). This generalization is simply not true. First it requires the HIGH_RES_TIMERS option to be enabled to really make a real difference. Second a hrtimers based msleep has a higher setup cost, which can't be completely ignored. "Best" is a subjective term here and can't be that easily generalized to all current users. > > If you don't like the hrsleep name, we can also call it nanosleep and so > > match what we already do for userspace. > > having a nanosleep *in addition* to msleep (or maybe nsleep() and > usleep() to have consistent naming) sounds reasonable to me. We only need one sleep implementation of both and msleep is a fine name for the current implementation - not only does it describe the unit, but it also describe the best resolution one can expect from it. > Do you have something against hrtimer use in general? From your emails > on this msleep topic it sort of seems you do I can give the question back, what do you have against simple timers, that you want to make them as awkward as possible to use? hrtimer have a higher usage cost depending on the clock source, so simply using them only because they are the new cool kid in town doesn't make sense. It may not be that critical for a simple sleep implementation, but that only means we should keep the API as simple as possible, that means one low resolution, cheap msleep and one high resolution nanosleep is enough. Why do you insist on making more complex than necessary? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] msleep() with hrtimers
Hi, On Sat, 4 Aug 2007, Arjan van de Ven wrote: hr_msleep makes no sense. Why should we tie this interface to millisecond resolution? because a lot of parts of the kernel think and work in milliseconds, it's logical and USEFUL to at least provide an interface that works on milliseconds. If the millisecond resolution is enough for these users, that means the current msleep will work fine for them. Your suggested msleep_approx makes not much sense to me either, since neither interface guarantees anything and may approximate the sleep (and if the user is surprised by that something else already went wrong). an interface should try to map to the implementation that provides the best implementation quality of the requested thing in general. That's the hrtimers based msleep(). This generalization is simply not true. First it requires the HIGH_RES_TIMERS option to be enabled to really make a real difference. Second a hrtimers based msleep has a higher setup cost, which can't be completely ignored. Best is a subjective term here and can't be that easily generalized to all current users. If you don't like the hrsleep name, we can also call it nanosleep and so match what we already do for userspace. having a nanosleep *in addition* to msleep (or maybe nsleep() and usleep() to have consistent naming) sounds reasonable to me. We only need one sleep implementation of both and msleep is a fine name for the current implementation - not only does it describe the unit, but it also describe the best resolution one can expect from it. Do you have something against hrtimer use in general? From your emails on this msleep topic it sort of seems you do I can give the question back, what do you have against simple timers, that you want to make them as awkward as possible to use? hrtimer have a higher usage cost depending on the clock source, so simply using them only because they are the new cool kid in town doesn't make sense. It may not be that critical for a simple sleep implementation, but that only means we should keep the API as simple as possible, that means one low resolution, cheap msleep and one high resolution nanosleep is enough. Why do you insist on making more complex than necessary? bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] msleep() with hrtimers
Hi, On Sun, 5 Aug 2007, Arjan van de Ven wrote: Timers are course resolution that is highly HZ-value dependent. For cases where you want a finer resolution, the kernel now has a way to provide that functionality... so why not use the quality of service this provides.. We're going in circles here. We have two different timer APIs for a reason, only because hrtimer provide better resolution, doesn't automatically make them the better generic timer. There's no problem to provide a high resolution sleep, but there is also no reason to mess with msleep, don't fix what ain't broken... bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] msleep() with hrtimers
Hi, On Fri, 3 Aug 2007, Arjan van de Ven wrote: > > Actually the hrsleep() function would allow for submillisecond sleeps, > > which might be what some of the 450 users really want and they only use > > msleep(1) because it's the next best thing. > > A hrsleep() function is really what makes most sense from an API > > perspective. > > I respectfully disagree. The power of msleep is that the unit of sleep > time is in the name; so in your proposal it would be hr_msleep or > somesuch. I much rather do the opposite in that case; make the "short" > name be the best implementation of the requested behavior, and have > qualifiers for allowing exceptions to that... least surprise and all > that. hr_msleep makes no sense. Why should we tie this interface to millisecond resolution? Your suggested msleep_approx makes not much sense to me either, since neither interface guarantees anything and may "approximate" the sleep (and if the user is surprised by that something else already went wrong). If you don't like the hrsleep name, we can also call it nanosleep and so match what we already do for userspace. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] msleep() with hrtimers
Hi, On Fri, 3 Aug 2007, Arjan van de Ven wrote: > On Fri, 2007-08-03 at 21:19 +0200, Roman Zippel wrote: > > Hi, > > > > On Fri, 3 Aug 2007, Jonathan Corbet wrote: > > > > > Most comments last time were favorable. The one dissenter was Roman, > > > who worries about the overhead of using hrtimers for this operation; my > > > understanding is that he would rather see a really_msleep() function for > > > those who actually want millisecond resolution. I'm not sure how to > > > characterize what the cost could be, but it can only be buried by the > > > fact that every call sleeps for some number of milliseconds. On my > > > system, the several hundred total msleep() calls can't cause any real > > > overhead, and almost all happen at initialization time. > > > > The main point is still that these are two _different_ APIs for different > > usages, so I still prefer to add a hrsleep() instead. > > > I would actually prefer it the other way around; call the > not-so-accurate one "msleep_approx()" or somesuch, to make it explicit > that the sleep is only approximate... Actually the hrsleep() function would allow for submillisecond sleeps, which might be what some of the 450 users really want and they only use msleep(1) because it's the next best thing. A hrsleep() function is really what makes most sense from an API perspective. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] msleep() with hrtimers
Hi, On Fri, 3 Aug 2007, Jonathan Corbet wrote: > Most comments last time were favorable. The one dissenter was Roman, > who worries about the overhead of using hrtimers for this operation; my > understanding is that he would rather see a really_msleep() function for > those who actually want millisecond resolution. I'm not sure how to > characterize what the cost could be, but it can only be buried by the > fact that every call sleeps for some number of milliseconds. On my > system, the several hundred total msleep() calls can't cause any real > overhead, and almost all happen at initialization time. The main point is still that these are two _different_ APIs for different usages, so I still prefer to add a hrsleep() instead. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] msleep() with hrtimers
Hi, On Fri, 3 Aug 2007, Jonathan Corbet wrote: Most comments last time were favorable. The one dissenter was Roman, who worries about the overhead of using hrtimers for this operation; my understanding is that he would rather see a really_msleep() function for those who actually want millisecond resolution. I'm not sure how to characterize what the cost could be, but it can only be buried by the fact that every call sleeps for some number of milliseconds. On my system, the several hundred total msleep() calls can't cause any real overhead, and almost all happen at initialization time. The main point is still that these are two _different_ APIs for different usages, so I still prefer to add a hrsleep() instead. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] msleep() with hrtimers
Hi, On Fri, 3 Aug 2007, Arjan van de Ven wrote: On Fri, 2007-08-03 at 21:19 +0200, Roman Zippel wrote: Hi, On Fri, 3 Aug 2007, Jonathan Corbet wrote: Most comments last time were favorable. The one dissenter was Roman, who worries about the overhead of using hrtimers for this operation; my understanding is that he would rather see a really_msleep() function for those who actually want millisecond resolution. I'm not sure how to characterize what the cost could be, but it can only be buried by the fact that every call sleeps for some number of milliseconds. On my system, the several hundred total msleep() calls can't cause any real overhead, and almost all happen at initialization time. The main point is still that these are two _different_ APIs for different usages, so I still prefer to add a hrsleep() instead. I would actually prefer it the other way around; call the not-so-accurate one msleep_approx() or somesuch, to make it explicit that the sleep is only approximate... Actually the hrsleep() function would allow for submillisecond sleeps, which might be what some of the 450 users really want and they only use msleep(1) because it's the next best thing. A hrsleep() function is really what makes most sense from an API perspective. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] msleep() with hrtimers
Hi, On Fri, 3 Aug 2007, Arjan van de Ven wrote: Actually the hrsleep() function would allow for submillisecond sleeps, which might be what some of the 450 users really want and they only use msleep(1) because it's the next best thing. A hrsleep() function is really what makes most sense from an API perspective. I respectfully disagree. The power of msleep is that the unit of sleep time is in the name; so in your proposal it would be hr_msleep or somesuch. I much rather do the opposite in that case; make the short name be the best implementation of the requested behavior, and have qualifiers for allowing exceptions to that... least surprise and all that. hr_msleep makes no sense. Why should we tie this interface to millisecond resolution? Your suggested msleep_approx makes not much sense to me either, since neither interface guarantees anything and may approximate the sleep (and if the user is surprised by that something else already went wrong). If you don't like the hrsleep name, we can also call it nanosleep and so match what we already do for userspace. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Linus Torvalds wrote: > So I think it would be entirely appropriate to > > - do something that *approximates* microseconds. > >Using microseconds instead of nanoseconds would likely allow us to do >32-bit arithmetic in more areas, without any real overflow. The basic problem is that one needs a number of bits (at least 16) for normalization, which limits the time range one can work with. This means that 32 bit leaves only room for 1 millisecond resolution, the remainder could maybe saved and reused later. So AFAICT using micro- or nanosecond resolution doesn't make much computational difference. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Thu, 2 Aug 2007, Ingo Molnar wrote: > Most importantly, CFS _already_ includes a number of measures that act > against too frequent math. So even though you can see 64-bit math code > in it, it's only rarely called if your clock has a low resolution - and > that happens all automatically! (see below the details of this buffered > delta math) > > I have not seen Roman notice and mention any of these important details > (perhaps because he was concentrating on finding faults in CFS - which a > reviewer should do), but those measures are still very important for a > complete, balanced picture, especially if one focuses on overhead on > small boxes where the clock is low-resolution. > > As Peter has said it in his detailed review of Roman's suggested > algorithm, our main focus is on keeping total complexity down - and we > are (of course) fundamentally open to changing the math behind CFS, we > ourselves tweaked it numerous times, it's not cast into stone in any > way, shape or form. You're comparing apples with oranges, I explicitely said: "At this point I'm not that much interested in a few localized optimizations, what I'm interested in is how can this optimized at the design level" IMO it's very important to keep computational and algorithmic complexity separately, I want to concentrate on the latter, so unless you can _prove_ that a similiar set of optimizations is impossible within my example, I'm going to ignore them for now. CFS has already gone through several versions of optimization and tuning, expecting the same from my design prototype is a little confusing... I want to analyze the foundation CFS is based on, in the review I mentioned a number of other issues and design related questions. If you need more time, that's fine, but I'd appreciate more background information related to that and not that you only jump on the more trivial issues. > In Roman's variant of CFS's algorithm the variables are 32-bit, but the > error is rolled forward in separate fract_* (fractional) 32-bit > variables, so we still have 32+32==64 bit of stuff to handle. So we > think that in the end such a 32+32 scheme would be more complex (and > anyone who took a look at fs2.c would i think agree - it took Peter a > day to decypher the math!) Come on, Ingo, you can do better than that, I did mention in my review some of the requirements for the data types. I'm amazed how you can get to that judgement so quickly, could you please substantiate that a little more? I admit that the lack of source comments is an open invitation for further questions and Peter did exactly this and his comments were great - I'm hoping for more like that. You OTOH jump to conclusions based on a partial understanding what I'm actually trying to do. Ingo, how about you provide some of the mathematical prove CFS is based on? Can you prove that the rounding errors are irrelevant? Can you prove that all the limit checks can have no adverse effect? I tried that and I'm not entirely convinced of that, but maybe it's just me, so I'd love to see someone else's attempt at this. A major goal of my design is it to be able to define the limits within the scheduler is working correctly, so I know which information is relevant and what can be approximated. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Peter Zijlstra wrote: > Took me most of today trying to figure out WTH you did in fs2.c, more > math and fundamental explanations would have been good. So please bear > with me as I try to recap this thing. (No, your code was very much _not_ > obvious, a few comments and broken out functions would have made a world > of a difference) Thanks for the effort though. :) I know I'm not the best explaining these things, so I really appreciate the questions, so I know what to concentrate on. > So, for each task we keep normalised time > > normalised time := time/weight > > using Bresenham's algorithm we can do this prefectly (up until a renice > - where you'd get errors) > > avg_frac += weight_inv > > weight_inv = X / weight > > avg = avg_frac / weight0_inv > > weight0_inv = X / weight0 > > avg = avg_frac / (X / weight0) > = (X / weight) / (X / weight0) > = X / weight * weight0 / X > = weight0 / weight > > > So avg ends up being in units of [weight0/weight]. > > Then, in order to allow sleeping, we need to have a global clock to sync > with. Its this global clock that gave me headaches to reconstruct. > > We're looking for a time like this: > > rq_time := sum(time)/sum(weight) > > And you commented that the /sum(weight) part is where CFS obtained its > accumulating rounding error? (I'm inclined to believe the error will > statistically be 0, but I'll readily accept otherwise if you can show a > practical 'exploit') > > Its not obvious how to do this using modulo logic like Bresenham because > that would involve using a gcm of all possible weights. I think I've sent you off into the wrong direction somehow. Sorry. :) Let's ignore the average for a second, normalized time is maintained as: normalized time := time * (2^16 / weight) The important point is that I keep the value in full resolution of 2^-16 vsec units (vsec for virtual second or sec/weight, where every tasks gets weight seconds for every virtual second, to keep things simpler I also omit the nano prefix from the units for a moment). Compared to that CFS maintains a global normalized value in 1 vsec units. Since I don't round the value down I avoid the accumulating error, this means that time_norm += time_delta1 * (2^16 / weight) time_norm += time_delta2 * (2^16 / weight) is the same as time_norm += (time_delta1 + time_delta2) * (2^16 / weight) CFS for example does this delta_mine = calc_delta_mine(delta_exec, curr->load.weight, lw); in above terms this means time = time_delta * weight * (2^16 / weight_sum) / 2^16 The last shift now rounds the value down and if one does that 1000 times per second, the resolution of the value that is finally accounted to wait_runtime is also reduced appropriately. The other rounding problem is based on that this term x * prio_to_weight[i] * prio_to_wmult[i] / 2^32 doesn't produce x for most values in that tables (the same applies to the weight sum), so if we have chains, where the values are converted from one scale to the other, a rounding error is produced. In CFS this happens now because wait_runtime is maintained in nanoseconds and fair_clock is a normalized value. The problem here isn't that these errors might have a statistical relevance, as they are usually completely overshadowed by measurement errors anyway. The problem is that these errors exist at all, this means they have to be compensated somehow, so that they don't accumulate over time and then become significant. This also has to be seen in the context of the overflow checks. All this adds a number of variables to the system which considerably increases complexity and makes a thorough analysis quite challenging. So to get back to the average, if you look for this rq_time := sum(time)/sum(weight) you won't find it like this, this basically produces a weighted average and I agree this can't really be maintained via the modulo logic (at least AFAICT), so I'm using a simple average instead, so if we have: time_norm = time/weight we can write your rq_time like this: weighted_avg = sum_{i}^{N}(time_norm_{i}*weight_{i})/sum_{i}^{N}(weight_{i}) this is the formula for a weighted average, so we can aproximate the value using a simple average instead: avg = sum_{i}^{N}(time_norm_{i})/N This sum is now what I maintain at runtime incrementally: time_{i} = sum_{j}^{S}(time_{j}) time_norm_{i} = time_{i}/weight_{i} = sum_{j}^{S}(time_{j})/weight_{i} = sum_{j}^{S}(time_{j}/weight_{i}) If I add this up and add weigth0 I get: avg*N*weigth0 = sum_{i}^{N}(time_norm_{i})*weight0 and now I have also the needed modulo factors. The average probably could be further simplified by using a different approximation.
Re: CFS review
Hi, On Wed, 1 Aug 2007, Peter Zijlstra wrote: Took me most of today trying to figure out WTH you did in fs2.c, more math and fundamental explanations would have been good. So please bear with me as I try to recap this thing. (No, your code was very much _not_ obvious, a few comments and broken out functions would have made a world of a difference) Thanks for the effort though. :) I know I'm not the best explaining these things, so I really appreciate the questions, so I know what to concentrate on. So, for each task we keep normalised time normalised time := time/weight using Bresenham's algorithm we can do this prefectly (up until a renice - where you'd get errors) avg_frac += weight_inv weight_inv = X / weight avg = avg_frac / weight0_inv weight0_inv = X / weight0 avg = avg_frac / (X / weight0) = (X / weight) / (X / weight0) = X / weight * weight0 / X = weight0 / weight So avg ends up being in units of [weight0/weight]. Then, in order to allow sleeping, we need to have a global clock to sync with. Its this global clock that gave me headaches to reconstruct. We're looking for a time like this: rq_time := sum(time)/sum(weight) And you commented that the /sum(weight) part is where CFS obtained its accumulating rounding error? (I'm inclined to believe the error will statistically be 0, but I'll readily accept otherwise if you can show a practical 'exploit') Its not obvious how to do this using modulo logic like Bresenham because that would involve using a gcm of all possible weights. I think I've sent you off into the wrong direction somehow. Sorry. :) Let's ignore the average for a second, normalized time is maintained as: normalized time := time * (2^16 / weight) The important point is that I keep the value in full resolution of 2^-16 vsec units (vsec for virtual second or sec/weight, where every tasks gets weight seconds for every virtual second, to keep things simpler I also omit the nano prefix from the units for a moment). Compared to that CFS maintains a global normalized value in 1 vsec units. Since I don't round the value down I avoid the accumulating error, this means that time_norm += time_delta1 * (2^16 / weight) time_norm += time_delta2 * (2^16 / weight) is the same as time_norm += (time_delta1 + time_delta2) * (2^16 / weight) CFS for example does this delta_mine = calc_delta_mine(delta_exec, curr-load.weight, lw); in above terms this means time = time_delta * weight * (2^16 / weight_sum) / 2^16 The last shift now rounds the value down and if one does that 1000 times per second, the resolution of the value that is finally accounted to wait_runtime is also reduced appropriately. The other rounding problem is based on that this term x * prio_to_weight[i] * prio_to_wmult[i] / 2^32 doesn't produce x for most values in that tables (the same applies to the weight sum), so if we have chains, where the values are converted from one scale to the other, a rounding error is produced. In CFS this happens now because wait_runtime is maintained in nanoseconds and fair_clock is a normalized value. The problem here isn't that these errors might have a statistical relevance, as they are usually completely overshadowed by measurement errors anyway. The problem is that these errors exist at all, this means they have to be compensated somehow, so that they don't accumulate over time and then become significant. This also has to be seen in the context of the overflow checks. All this adds a number of variables to the system which considerably increases complexity and makes a thorough analysis quite challenging. So to get back to the average, if you look for this rq_time := sum(time)/sum(weight) you won't find it like this, this basically produces a weighted average and I agree this can't really be maintained via the modulo logic (at least AFAICT), so I'm using a simple average instead, so if we have: time_norm = time/weight we can write your rq_time like this: weighted_avg = sum_{i}^{N}(time_norm_{i}*weight_{i})/sum_{i}^{N}(weight_{i}) this is the formula for a weighted average, so we can aproximate the value using a simple average instead: avg = sum_{i}^{N}(time_norm_{i})/N This sum is now what I maintain at runtime incrementally: time_{i} = sum_{j}^{S}(time_{j}) time_norm_{i} = time_{i}/weight_{i} = sum_{j}^{S}(time_{j})/weight_{i} = sum_{j}^{S}(time_{j}/weight_{i}) If I add this up and add weigth0 I get: avg*N*weigth0 = sum_{i}^{N}(time_norm_{i})*weight0 and now I have also the needed modulo factors. The average probably could be further simplified by using a different approximation. The question is how perfect this average
Re: CFS review
Hi, On Thu, 2 Aug 2007, Ingo Molnar wrote: Most importantly, CFS _already_ includes a number of measures that act against too frequent math. So even though you can see 64-bit math code in it, it's only rarely called if your clock has a low resolution - and that happens all automatically! (see below the details of this buffered delta math) I have not seen Roman notice and mention any of these important details (perhaps because he was concentrating on finding faults in CFS - which a reviewer should do), but those measures are still very important for a complete, balanced picture, especially if one focuses on overhead on small boxes where the clock is low-resolution. As Peter has said it in his detailed review of Roman's suggested algorithm, our main focus is on keeping total complexity down - and we are (of course) fundamentally open to changing the math behind CFS, we ourselves tweaked it numerous times, it's not cast into stone in any way, shape or form. You're comparing apples with oranges, I explicitely said: At this point I'm not that much interested in a few localized optimizations, what I'm interested in is how can this optimized at the design level IMO it's very important to keep computational and algorithmic complexity separately, I want to concentrate on the latter, so unless you can _prove_ that a similiar set of optimizations is impossible within my example, I'm going to ignore them for now. CFS has already gone through several versions of optimization and tuning, expecting the same from my design prototype is a little confusing... I want to analyze the foundation CFS is based on, in the review I mentioned a number of other issues and design related questions. If you need more time, that's fine, but I'd appreciate more background information related to that and not that you only jump on the more trivial issues. In Roman's variant of CFS's algorithm the variables are 32-bit, but the error is rolled forward in separate fract_* (fractional) 32-bit variables, so we still have 32+32==64 bit of stuff to handle. So we think that in the end such a 32+32 scheme would be more complex (and anyone who took a look at fs2.c would i think agree - it took Peter a day to decypher the math!) Come on, Ingo, you can do better than that, I did mention in my review some of the requirements for the data types. I'm amazed how you can get to that judgement so quickly, could you please substantiate that a little more? I admit that the lack of source comments is an open invitation for further questions and Peter did exactly this and his comments were great - I'm hoping for more like that. You OTOH jump to conclusions based on a partial understanding what I'm actually trying to do. Ingo, how about you provide some of the mathematical prove CFS is based on? Can you prove that the rounding errors are irrelevant? Can you prove that all the limit checks can have no adverse effect? I tried that and I'm not entirely convinced of that, but maybe it's just me, so I'd love to see someone else's attempt at this. A major goal of my design is it to be able to define the limits within the scheduler is working correctly, so I know which information is relevant and what can be approximated. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Linus Torvalds wrote: So I think it would be entirely appropriate to - do something that *approximates* microseconds. Using microseconds instead of nanoseconds would likely allow us to do 32-bit arithmetic in more areas, without any real overflow. The basic problem is that one needs a number of bits (at least 16) for normalization, which limits the time range one can work with. This means that 32 bit leaves only room for 1 millisecond resolution, the remainder could maybe saved and reused later. So AFAICT using micro- or nanosecond resolution doesn't make much computational difference. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: > > [...] I didn't say 'sleeper starvation' or 'rounding error', these are > > your words and it's your perception of what I said. > > Oh dear :-) It was indeed my preception that yesterday you said: *sigh* and here you go off again nitpicking on a minor issue just to prove your point... When I wrote the earlier stuff I hadn't realized it was resolution related, so things have to be put into proper context and you make it yourself a little easy by equating them. Yippi, you found another small error I made, can we drop this now? Please? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: > Andi's theory cannot be true either, Roman's debug info also shows this > /proc//sched data: > > clock-delta : 95 > > that means that sched_clock() is in high-res mode, the TSC is alive and > kicking and a sched_clock() call took 95 nanoseconds. > > Roman, could you please help us with this mystery? Actually, Andi is right. What I sent you was generated directly after boot, as I had to reboot for the right kernel, so a little later appeared this: Aug 1 14:54:30 spit kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1 Aug 1 15:09:56 spit kernel: Clocksource tsc unstable (delta = 656747233 ns) Aug 1 15:09:56 spit kernel: Time: pit clocksource has been installed. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: > > > in that case 'top' accounting symptoms similar to the above are not > > > due to the scheduler starvation you suspected, but due the effect of > > > a low-resolution scheduler clock and a tightly coupled > > > timer/scheduler tick to it. > > > > Well, it magnifies the rounding problems in CFS. > > why do you say that? 2.6.22 behaves similarly with a low-res > sched_clock(). This has nothing to do with 'rounding problems'! > > i tried your fl.c and if sched_clock() is high-resolution it's scheduled > _perfectly_ by CFS: > >PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND > 5906 mingo 20 0 1576 244 196 R 71.2 0.0 0:30.11 l > 5909 mingo 20 0 1844 344 260 S 9.6 0.0 0:04.02 lt > 5907 mingo 20 0 1844 508 424 S 9.5 0.0 0:04.01 lt > 5908 mingo 20 0 1844 344 260 S 9.5 0.0 0:04.02 lt > > if sched_clock() is low-resolution then indeed the 'lt' tasks will > "hide": > > PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND > 2366 mingo 20 0 1576 248 196 R 99.9 0.0 0:07.95 loop_silent > 1 root 20 0 2132 636 548 S 0.0 0.0 0:04.64 init > > but that's nothing new. CFS cannot conjure up time measurement methods > that do not exist. If you have a low-res clock and if you create an app > that syncs precisely to the tick of that clock via timers that run off > that exact tick then there's nothing the scheduler can do about it. It > is false to charachterise this as 'sleeper starvation' or 'rounding > error' like you did. No amount of rounding logic can create a > high-resolution clock out of thin air. Please calm down. You apparantly already get worked up about one of the secondary problems. I didn't say 'sleeper starvation' or 'rounding error', these are your words and it's your perception of what I said. sched_clock() can have a low resolution, which can be a problem for the scheduler. This is all this program demonstrates. If and how this problem should be solved is a completely different issue, about which I haven't said anything yet and since it's not that important right now I'll leave it at that for now. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Andi Kleen wrote: > > especially if one already knows that > > scheduler clock has only limited resolution (because it's based on > > jiffies), it becomes possible to use mostly 32bit values. > > jiffies based sched_clock should be soon very rare. It's probably > not worth optimizing for it. I'm not so sure about that. sched_clock() has to be fast, so many archs may want to continue to use jiffies. As soon as one does that one can also save a lot of computational overhead by using 32bit instead of 64bit. The question is then how easy that is possible. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: > Please also send me the output of this script: > > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh Send privately. > Could you also please send the source code for the "l.c" and "lt.c" apps > you used for your testing so i can have a look. Thanks! l.c is a simple busy loop (well, with the option to start many of them). This is lt.c, what it does is to run a bit less than a jiffie, so it needs a low resolution clock to trigger the problem: #include #include #include #include #define NSEC 10 #define USEC 100 #define PERIOD (NSEC/1000) int i; void worker(int sig) { struct timeval tv; long long t0, t; gettimeofday(, 0); //printf("%u,%lu\n", i, tv.tv_usec); t0 = (long long)tv.tv_sec * 100 + tv.tv_usec + PERIOD / 1000 - 50; do { gettimeofday(, 0); t = (long long)tv.tv_sec * 100 + tv.tv_usec; } while (t < t0); } int main(int ac, char **av) { int cnt; timer_t timer; struct itimerspec its; struct sigaction sa; cnt = i = atoi(av[1]); sa.sa_handler = worker; sa.sa_flags = 0; sigemptyset(_mask); sigaction(SIGALRM, , 0); clock_gettime(CLOCK_MONOTONIC, _value); its.it_interval.tv_sec = 0; its.it_interval.tv_nsec = PERIOD * cnt; while (--i > 0 && fork() > 0) ; its.it_value.tv_nsec += i * PERIOD; if (its.it_value.tv_nsec > NSEC) { its.it_value.tv_sec++; its.it_value.tv_nsec -= NSEC; } timer_create(CLOCK_MONOTONIC, 0, ); timer_settime(timer, TIMER_ABSTIME, , 0); printf("%u,%lu\n", i, its.it_interval.tv_nsec); while (1) pause(); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: > * Roman Zippel <[EMAIL PROTECTED]> wrote: > > > [...] the increase in code size: > > > > 2.6.22: > >textdata bss dec hex filename > > 10150 243344 1351834ce kernel/sched.o > > > > recent git: > >textdata bss dec hex filename > > 14724 2282020 16972424c kernel/sched.o > > > > That's i386 without stats/debug. [...] > > that's without CONFIG_SMP, right? :-) On SMP they are about net break > even: > > textdata bss dec hex filename > 265354173 24 30732780c kernel/sched.o-2.6.22 > 283782574 16 3096878f8 kernel/sched.o-2.6.23-git That's still quite an increase in some rather important code paths and it's not just the code size, but also code complexity which is important - a major point I tried to address in my review. > (plus a further ~1.5K per CPU data reduction which is not visible here) That's why I mentioned the increased runtime memory usage... bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: > > [...] e.g. in this example there are three tasks that run only for > > about 1ms every 3ms, but they get far more time than should have > > gotten fairly: > > > > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt > > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt > > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt > > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l > > Mike and me have managed to reproduce similarly looking 'top' output, > but it takes some effort: we had to deliberately run a non-TSC > sched_clock(), CONFIG_HZ=100, !CONFIG_NO_HZ and !CONFIG_HIGH_RES_TIMERS. I used my old laptop for these tests, where tsc is indeed disabled due to instability. Otherwise the kernel was configured with CONFIG_HZ=1000. > in that case 'top' accounting symptoms similar to the above are not due > to the scheduler starvation you suspected, but due the effect of a > low-resolution scheduler clock and a tightly coupled timer/scheduler > tick to it. Well, it magnifies the rounding problems in CFS. I mainly wanted to test a little the behaviour of CFS and I thought a saw patch which enabled the use of TSC in these cases, so I didn't check sched_clock(). Anyway, I want to point out that this wasn't the main focus of what I wrote. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: [...] e.g. in this example there are three tasks that run only for about 1ms every 3ms, but they get far more time than should have gotten fairly: 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l Mike and me have managed to reproduce similarly looking 'top' output, but it takes some effort: we had to deliberately run a non-TSC sched_clock(), CONFIG_HZ=100, !CONFIG_NO_HZ and !CONFIG_HIGH_RES_TIMERS. I used my old laptop for these tests, where tsc is indeed disabled due to instability. Otherwise the kernel was configured with CONFIG_HZ=1000. in that case 'top' accounting symptoms similar to the above are not due to the scheduler starvation you suspected, but due the effect of a low-resolution scheduler clock and a tightly coupled timer/scheduler tick to it. Well, it magnifies the rounding problems in CFS. I mainly wanted to test a little the behaviour of CFS and I thought a saw patch which enabled the use of TSC in these cases, so I didn't check sched_clock(). Anyway, I want to point out that this wasn't the main focus of what I wrote. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: * Roman Zippel [EMAIL PROTECTED] wrote: [...] the increase in code size: 2.6.22: textdata bss dec hex filename 10150 243344 1351834ce kernel/sched.o recent git: textdata bss dec hex filename 14724 2282020 16972424c kernel/sched.o That's i386 without stats/debug. [...] that's without CONFIG_SMP, right? :-) On SMP they are about net break even: textdata bss dec hex filename 265354173 24 30732780c kernel/sched.o-2.6.22 283782574 16 3096878f8 kernel/sched.o-2.6.23-git That's still quite an increase in some rather important code paths and it's not just the code size, but also code complexity which is important - a major point I tried to address in my review. (plus a further ~1.5K per CPU data reduction which is not visible here) That's why I mentioned the increased runtime memory usage... bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Andi Kleen wrote: especially if one already knows that scheduler clock has only limited resolution (because it's based on jiffies), it becomes possible to use mostly 32bit values. jiffies based sched_clock should be soon very rare. It's probably not worth optimizing for it. I'm not so sure about that. sched_clock() has to be fast, so many archs may want to continue to use jiffies. As soon as one does that one can also save a lot of computational overhead by using 32bit instead of 64bit. The question is then how easy that is possible. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: Please also send me the output of this script: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh Send privately. Could you also please send the source code for the l.c and lt.c apps you used for your testing so i can have a look. Thanks! l.c is a simple busy loop (well, with the option to start many of them). This is lt.c, what it does is to run a bit less than a jiffie, so it needs a low resolution clock to trigger the problem: #include stdio.h #include signal.h #include time.h #include sys/time.h #define NSEC 10 #define USEC 100 #define PERIOD (NSEC/1000) int i; void worker(int sig) { struct timeval tv; long long t0, t; gettimeofday(tv, 0); //printf(%u,%lu\n, i, tv.tv_usec); t0 = (long long)tv.tv_sec * 100 + tv.tv_usec + PERIOD / 1000 - 50; do { gettimeofday(tv, 0); t = (long long)tv.tv_sec * 100 + tv.tv_usec; } while (t t0); } int main(int ac, char **av) { int cnt; timer_t timer; struct itimerspec its; struct sigaction sa; cnt = i = atoi(av[1]); sa.sa_handler = worker; sa.sa_flags = 0; sigemptyset(sa.sa_mask); sigaction(SIGALRM, sa, 0); clock_gettime(CLOCK_MONOTONIC, its.it_value); its.it_interval.tv_sec = 0; its.it_interval.tv_nsec = PERIOD * cnt; while (--i 0 fork() 0) ; its.it_value.tv_nsec += i * PERIOD; if (its.it_value.tv_nsec NSEC) { its.it_value.tv_sec++; its.it_value.tv_nsec -= NSEC; } timer_create(CLOCK_MONOTONIC, 0, timer); timer_settime(timer, TIMER_ABSTIME, its, 0); printf(%u,%lu\n, i, its.it_interval.tv_nsec); while (1) pause(); return 0; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: in that case 'top' accounting symptoms similar to the above are not due to the scheduler starvation you suspected, but due the effect of a low-resolution scheduler clock and a tightly coupled timer/scheduler tick to it. Well, it magnifies the rounding problems in CFS. why do you say that? 2.6.22 behaves similarly with a low-res sched_clock(). This has nothing to do with 'rounding problems'! i tried your fl.c and if sched_clock() is high-resolution it's scheduled _perfectly_ by CFS: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 5906 mingo 20 0 1576 244 196 R 71.2 0.0 0:30.11 l 5909 mingo 20 0 1844 344 260 S 9.6 0.0 0:04.02 lt 5907 mingo 20 0 1844 508 424 S 9.5 0.0 0:04.01 lt 5908 mingo 20 0 1844 344 260 S 9.5 0.0 0:04.02 lt if sched_clock() is low-resolution then indeed the 'lt' tasks will hide: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 2366 mingo 20 0 1576 248 196 R 99.9 0.0 0:07.95 loop_silent 1 root 20 0 2132 636 548 S 0.0 0.0 0:04.64 init but that's nothing new. CFS cannot conjure up time measurement methods that do not exist. If you have a low-res clock and if you create an app that syncs precisely to the tick of that clock via timers that run off that exact tick then there's nothing the scheduler can do about it. It is false to charachterise this as 'sleeper starvation' or 'rounding error' like you did. No amount of rounding logic can create a high-resolution clock out of thin air. Please calm down. You apparantly already get worked up about one of the secondary problems. I didn't say 'sleeper starvation' or 'rounding error', these are your words and it's your perception of what I said. sched_clock() can have a low resolution, which can be a problem for the scheduler. This is all this program demonstrates. If and how this problem should be solved is a completely different issue, about which I haven't said anything yet and since it's not that important right now I'll leave it at that for now. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: Andi's theory cannot be true either, Roman's debug info also shows this /proc/PID/sched data: clock-delta : 95 that means that sched_clock() is in high-res mode, the TSC is alive and kicking and a sched_clock() call took 95 nanoseconds. Roman, could you please help us with this mystery? Actually, Andi is right. What I sent you was generated directly after boot, as I had to reboot for the right kernel, so a little later appeared this: Aug 1 14:54:30 spit kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1 Aug 1 15:09:56 spit kernel: Clocksource tsc unstable (delta = 656747233 ns) Aug 1 15:09:56 spit kernel: Time: pit clocksource has been installed. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: [...] I didn't say 'sleeper starvation' or 'rounding error', these are your words and it's your perception of what I said. Oh dear :-) It was indeed my preception that yesterday you said: *sigh* and here you go off again nitpicking on a minor issue just to prove your point... When I wrote the earlier stuff I hadn't realized it was resolution related, so things have to be put into proper context and you make it yourself a little easy by equating them. Yippi, you found another small error I made, can we drop this now? Please? bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ck] Re: Linus 2.6.23-rc1
Hi, On Sat, 28 Jul 2007, Linus Torvalds wrote: > We've had people go with a splash before. Quite frankly, the current > scheduler situation looks very much like the CML2 situation. Anybody > remember that? The developer there also got rejected, the improvement was > made differently (and much more in line with existing practices and > maintainership), and life went on. Eric Raymond, however, left with a > splash. Since I was directly involved I'd like to point out a key difference. http://lkml.org/lkml/2002/2/21/57 was the very first start of Kconfig and initially I didn't plan on writing a new config system. At the beginning there was only the converter, which I did to address the issue that Eric created a complete new and different config database, so the converter was meant to create a more acceptable transition path. What happened next is that I haven't got a single response from Eric, so I continued hacking on it until was complete. The key difference is now that Eric refused the offered help, while Con was refused the help he needed to get his work integrated. When Ingo posted his rewrite http://lkml.org/lkml/2007/4/13/180, Con had already pretty much lost. I have no doubt that Ingo can quickly transform an idea into working code and I would've been very surprised if he wouldn't be able to turn it into something technically superior. When Ingo figured out how to implement fair scheduling in a better way, he didn't use this idea to help Con to improve his work. He decided instead to work against Con and started his own rewrite, this is of course his right to do, but then he should also accept the responsibility that Con felt his years of work ripped apart and in vain and we have now lost a developer who tried to address things from a different perspective. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
CFS review
Hi, On Sat, 14 Jul 2007, Mike Galbraith wrote: > > On Fri, 13 Jul 2007, Mike Galbraith wrote: > > > > > > The new scheduler does _a_lot_ of heavy 64 bit calculations without any > > > > attempt to scale that down a little... > > > > > > See prio_to_weight[], prio_to_wmult[] and sysctl_sched_stat_granularity. > > > Perhaps more can be done, but "without any attempt..." isn't accurate. > > > > Calculating these values at runtime would have been completely insane, the > > alternative would be a crummy approximation, so using a lookup table is > > actually a good thing. That's not the problem. > > I meant see usage. I more meant serious attempts. At this point I'm not that much interested in a few localized optimizations, what I'm interested in is how can this optimized at the design level (e.g. how can arch information be used to simplify things). So I spent quite a bit of time looking through cfs and experimenting with some ideas. I want to put the main focus on the performance aspect, but there are a few other issues as well. But first something else (especially for Ingo): I tried to be very careful with any claims made in this mail, but this of course doesn't exclude the possibility of errors, in which case I'd appreciate any corrections. Any explanations done in this mail don't imply that anyone needs any such explanations, they're done to keep things in context, so that interested readers have a chance to follow even if they don't have the complete background information. Any suggestions made don't imply that they have to be implemented like this, there are more an incentive for further discussion and I'm always interested in better solutions. A first indication that something may not be quite right is the increase in code size: 2.6.22: textdata bss dec hex filename 10150 243344 1351834ce kernel/sched.o recent git: textdata bss dec hex filename 14724 2282020 16972424c kernel/sched.o That's i386 without stats/debug. A lot of the new code is in regularly executed regions and it's often not exactly trivial code as cfs added lots of heavy 64bit calculations. With the increased text comes increased runtime memory usage, e.g. task_struct increased so that only 5 of them instead 6 fit now into 8KB. Since sched-design-CFS.txt doesn't really go into any serious detail, so the EEVDF paper was more helpful and after playing with the ideas a little I noticed that the whole idea of fair scheduling can be explained somewhat simpler and I'm a little surprised not finding it mentioned anywhere. So a different view on this is that the runtime of a task is simply normalized and the virtual time (or fair_clock) is the weighted average of these normalized runtimes. The advantage of normalization is that it makes things comparable, once the normalized time values are equal each task got its fair share. It's more obvious in the EEVDF paper, cfs makes it a bit more complicated, as it uses the virtual time to calculate the eligible runtime, but it doesn't maintain a per process virtual time (fair_key is not quite the same). Here we get to the first problem, cfs is not overly accurate at maintaining a precise balance. First there a lot of rounding errors due to constant conversion between normalized and non-normalized values and the higher the update frequency the bigger the error. The effect of this can be seen by running: while (1) sched_yield(); and watching the sched_debug output and watch the underrun counter go crazy. cfs thus needs the limiting to keep this misbehaviour under control. The problem here is that it's not that difficult to hit one of the many limits, which may change the behaviour and makes cfs hard to predict how it will behave under different situations. The next issue is scheduler granularity, here I don't quite understand why the actual running time has no influence at all, which makes it difficult to predict how much cpu time a process will get at a time (even the comments only refer to the vmstat output). What is basically used instead is the normalized time since it was enqueued and practically it's a bit more complicated, as fair_key is not entirely a normalized time value. If the wait_runtime value is positive, higher prioritized tasks are given even more priority than they already get from their larger wait_runtime value. The problem here is that this triggers underruns and lower priority tasks get even less time. Another issue is the sleep bonus given to sleeping tasks. A problem here is that this can be exploited, if a job is spread over a few threads, they can get more time relativ to other tasks, e.g. in this example there are three tasks that run only for about 1ms every 3ms, but they get far more time than should have gotten fairly: 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt
CFS review
Hi, On Sat, 14 Jul 2007, Mike Galbraith wrote: On Fri, 13 Jul 2007, Mike Galbraith wrote: The new scheduler does _a_lot_ of heavy 64 bit calculations without any attempt to scale that down a little... See prio_to_weight[], prio_to_wmult[] and sysctl_sched_stat_granularity. Perhaps more can be done, but without any attempt... isn't accurate. Calculating these values at runtime would have been completely insane, the alternative would be a crummy approximation, so using a lookup table is actually a good thing. That's not the problem. I meant see usage. I more meant serious attempts. At this point I'm not that much interested in a few localized optimizations, what I'm interested in is how can this optimized at the design level (e.g. how can arch information be used to simplify things). So I spent quite a bit of time looking through cfs and experimenting with some ideas. I want to put the main focus on the performance aspect, but there are a few other issues as well. But first something else (especially for Ingo): I tried to be very careful with any claims made in this mail, but this of course doesn't exclude the possibility of errors, in which case I'd appreciate any corrections. Any explanations done in this mail don't imply that anyone needs any such explanations, they're done to keep things in context, so that interested readers have a chance to follow even if they don't have the complete background information. Any suggestions made don't imply that they have to be implemented like this, there are more an incentive for further discussion and I'm always interested in better solutions. A first indication that something may not be quite right is the increase in code size: 2.6.22: textdata bss dec hex filename 10150 243344 1351834ce kernel/sched.o recent git: textdata bss dec hex filename 14724 2282020 16972424c kernel/sched.o That's i386 without stats/debug. A lot of the new code is in regularly executed regions and it's often not exactly trivial code as cfs added lots of heavy 64bit calculations. With the increased text comes increased runtime memory usage, e.g. task_struct increased so that only 5 of them instead 6 fit now into 8KB. Since sched-design-CFS.txt doesn't really go into any serious detail, so the EEVDF paper was more helpful and after playing with the ideas a little I noticed that the whole idea of fair scheduling can be explained somewhat simpler and I'm a little surprised not finding it mentioned anywhere. So a different view on this is that the runtime of a task is simply normalized and the virtual time (or fair_clock) is the weighted average of these normalized runtimes. The advantage of normalization is that it makes things comparable, once the normalized time values are equal each task got its fair share. It's more obvious in the EEVDF paper, cfs makes it a bit more complicated, as it uses the virtual time to calculate the eligible runtime, but it doesn't maintain a per process virtual time (fair_key is not quite the same). Here we get to the first problem, cfs is not overly accurate at maintaining a precise balance. First there a lot of rounding errors due to constant conversion between normalized and non-normalized values and the higher the update frequency the bigger the error. The effect of this can be seen by running: while (1) sched_yield(); and watching the sched_debug output and watch the underrun counter go crazy. cfs thus needs the limiting to keep this misbehaviour under control. The problem here is that it's not that difficult to hit one of the many limits, which may change the behaviour and makes cfs hard to predict how it will behave under different situations. The next issue is scheduler granularity, here I don't quite understand why the actual running time has no influence at all, which makes it difficult to predict how much cpu time a process will get at a time (even the comments only refer to the vmstat output). What is basically used instead is the normalized time since it was enqueued and practically it's a bit more complicated, as fair_key is not entirely a normalized time value. If the wait_runtime value is positive, higher prioritized tasks are given even more priority than they already get from their larger wait_runtime value. The problem here is that this triggers underruns and lower priority tasks get even less time. Another issue is the sleep bonus given to sleeping tasks. A problem here is that this can be exploited, if a job is spread over a few threads, they can get more time relativ to other tasks, e.g. in this example there are three tasks that run only for about 1ms every 3ms, but they get far more time than should have gotten fairly: 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt 4546 roman 20 0 1796 344
Re: [ck] Re: Linus 2.6.23-rc1
Hi, On Sat, 28 Jul 2007, Linus Torvalds wrote: We've had people go with a splash before. Quite frankly, the current scheduler situation looks very much like the CML2 situation. Anybody remember that? The developer there also got rejected, the improvement was made differently (and much more in line with existing practices and maintainership), and life went on. Eric Raymond, however, left with a splash. Since I was directly involved I'd like to point out a key difference. http://lkml.org/lkml/2002/2/21/57 was the very first start of Kconfig and initially I didn't plan on writing a new config system. At the beginning there was only the converter, which I did to address the issue that Eric created a complete new and different config database, so the converter was meant to create a more acceptable transition path. What happened next is that I haven't got a single response from Eric, so I continued hacking on it until was complete. The key difference is now that Eric refused the offered help, while Con was refused the help he needed to get his work integrated. When Ingo posted his rewrite http://lkml.org/lkml/2007/4/13/180, Con had already pretty much lost. I have no doubt that Ingo can quickly transform an idea into working code and I would've been very surprised if he wouldn't be able to turn it into something technically superior. When Ingo figured out how to implement fair scheduling in a better way, he didn't use this idea to help Con to improve his work. He decided instead to work against Con and started his own rewrite, this is of course his right to do, but then he should also accept the responsibility that Con felt his years of work ripped apart and in vain and we have now lost a developer who tried to address things from a different perspective. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] i2o: defined but not used.
Hi, On Saturday 21 July 2007, Andrew Morton wrote: > On Sat, 21 Jul 2007 00:58:01 +0200 Sebastian Siewior <[EMAIL PROTECTED]> wrote: > > Got with randconfig > > randconfig apparently generates impossible configs. Please always > run `make oldconfig' after the randconfig, then do the test build. If that should make any difference, that would be a bug and I'd like to see that .config. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] LinuxPPS - definitive version
Hi, On Tuesday 24 July 2007, Rodolfo Giometti wrote: > By doing: > > struct pps_ktime { > __u64 sec; > - __u32 nsec; > + __u64 nsec; > }; Just using __u32 for both works as well... bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] LinuxPPS - definitive version
Hi, On Tuesday 24 July 2007, Rodolfo Giometti wrote: By doing: struct pps_ktime { __u64 sec; - __u32 nsec; + __u64 nsec; }; Just using __u32 for both works as well... bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] i2o: defined but not used.
Hi, On Saturday 21 July 2007, Andrew Morton wrote: On Sat, 21 Jul 2007 00:58:01 +0200 Sebastian Siewior [EMAIL PROTECTED] wrote: Got with randconfig randconfig apparently generates impossible configs. Please always run `make oldconfig' after the randconfig, then do the test build. If that should make any difference, that would be a bug and I'd like to see that .config. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Ingo Molnar wrote: > > Why do you constantly stress level 19? Yes, that one is special, all > > other positive levels were already relatively consistent. > > i constantly stress it for the reason i mentioned a good number of > times: because it's by far the most commonly used (and complained about) > nice level. =B-) How do you know that? Most complained about makes most commonly used? > but because you are asking, i'm glad to give you some first-hand > historic background about Linux nice levels (in case you are interested) > and the motivations behind their old and new implementations: I guess I should be thankful now? I'm curious why you post this now, after I "asked" about this. Most of the information is either rather generic or not specific enough for the problem at hand. If you had posted this information earlier, it had been far more valueable as it could have been a nice base for a discussion. But posting it this late I can't lose the feeling you're more interested in "teaching" me. > nice levels were always so weak under Linux (just read Peter's report) -ENOLINK > Hope this helps, Not completely. For negative nice levels you mentioned audio apps, but these aren't really interested in a fair share, they would use the higher percentage only to guarantee they get the amount of time they need independent of the current load. I think they would be better served with e.g. a deadline scheduler, which guarantees them an absolute time share not a relative one. On the other end with positive levels I more remember requests for something closer to idle scheduling, where a process only runs when nothing else is running. So assuming we had scheduling classes for the above use cases, what other reasons are left for such extreme nice levels? My proposed nice levels have otherwise the same properties as yours (e.g. being consistent). There is one propery you haven't commented on at all yet. My proposed levels give the average use a far better idea what they actually mean, i.e. that every 5 levels the process gets double/halve the cpu time. This is IMO a considerable advantage. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Ingo Molnar wrote: > > > [more rude insults deleted] > > > I've been waiting for that obvious question, and i _might_ be able > > > to answer it, but somehow it never occured to you ;-) Thanks, > > the ";-)" emoticon (and its contents) clearly signals this as a > sarcastic, tongue-in-cheek remark. To take another example why is this still insulting and inappropriate, this is a behaviour I would characterize as school bullying: A bully attacks someone obviously weaker than himself and for example takes something away and than continues like "If you ask nicely I'll give it back to you.", this often accompied by laughter to signal he's enjoying himself and the power he has, but for the other person it's everything but funny. Maybe you don't know what it feels like, but I do and I can't find anything funny, sarcastic or whatever about this, no matter how many smileys or other tags you add there. If the communication is already that troubled as this, such "humor" is really the worst thing you can do and I find it rather sad that you can't realize this yourself. > ok? (If you didnt see/read it as sarcastic straight away then my > apologies for insulting you!) Sorry, that is too little too late. You've apologized before and you continued to make fun of me personally to the point of spreading wrong information about me, which you could have very easily verified yourself, if you only wanted. What I want from you is that you treat me with respect and to keep your "sarcasm" to yourself. I told you very clearly how I think about you requoting this crap and yet you repeat it again _twice_, so on the one hand I get this apology attempt and on the other hand you continue to kick me in the crotch? How do you think am I supposed to feel about this? It's also always interesting what you don't respond to. I asked you for examples which would prove the (rather strong) assertions you made about me, what does it tell me now if you can't back up your statements? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Jonathan Corbet wrote: > > That's a bit my problem - we have to consider other setups as well. > > Is it worth converting all msleep users behind their back or should we > > just a provide a separate function for those who care? > > Any additional overhead is clearly small - enough not to disrupt a *high > resolution* timer, after all. If you use already high resolution timer, you also need a fast time source, so that in this case it indeed doesn't matter much how you sleep. > And msleep() is used mostly during > initialization time. My box had a few hundred calls, almost all during > boot. Any cost will be bounded by the fact that, well, it sleeps for > milliseconds on every call. Well, there are over 1500 msleep calls, so I'm not sure they're mostly during initialization. > I'm not *that* attached to this patch; if it causes heartburn we can > just forget it. But I had thought it might be useful... I'm not against using hrtimer in drivers, if you add a hrsleep() function and use that, that would be perfectly fine. The really important point is to keep our APIs clean, so it's obvious who is using what. The requirements for both timers are different, so there should be a choice in what to use. > > Which driver is this? I'd like to look at this, in case there's some other > > hidden problem. > > drivers/media/video/cafe_ccic.c, and cafe_smbus_write_data() in > particular. The "hidden problem," though, is that the hardware has > periods where reading the status registers will send it off into its > room where it will hide under its bed and never come out. It's indeed not a trivial problem, as it's not localized to the driver (the request comes from generic code). The most elegant and general solution might be to move such initialization sequences into a separate thread, where they don't hold up the rest. > My understanding is that the current dyntick code only turns off the > tick during idle periods; while things are running it's business as > usual. Perhaps I misunderstood? jiffies needs to be updated, theoretically one could reduce the timer tick even then, but one has to be careful about the increased resolution, so jiffies+1 isn't enough anymore to round it up. In general it's doable by further cleaning up our APIs, but here it's really important to keep the APIs clean to keep Linux running on a wide range of hardware. It should be clear whether one requests a low resolution, but low overhead timer or a high resolution and more precise timer (and _please_ ignore that "likely to expire" stuff). It's e.g. a possibility to map everything to high resolution timer on a hardware, which can deal with this, but on other hardware that's not possible without paying a significant price. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Jonathan Corbet wrote: That's a bit my problem - we have to consider other setups as well. Is it worth converting all msleep users behind their back or should we just a provide a separate function for those who care? Any additional overhead is clearly small - enough not to disrupt a *high resolution* timer, after all. If you use already high resolution timer, you also need a fast time source, so that in this case it indeed doesn't matter much how you sleep. And msleep() is used mostly during initialization time. My box had a few hundred calls, almost all during boot. Any cost will be bounded by the fact that, well, it sleeps for milliseconds on every call. Well, there are over 1500 msleep calls, so I'm not sure they're mostly during initialization. I'm not *that* attached to this patch; if it causes heartburn we can just forget it. But I had thought it might be useful... I'm not against using hrtimer in drivers, if you add a hrsleep() function and use that, that would be perfectly fine. The really important point is to keep our APIs clean, so it's obvious who is using what. The requirements for both timers are different, so there should be a choice in what to use. Which driver is this? I'd like to look at this, in case there's some other hidden problem. drivers/media/video/cafe_ccic.c, and cafe_smbus_write_data() in particular. The hidden problem, though, is that the hardware has periods where reading the status registers will send it off into its room where it will hide under its bed and never come out. It's indeed not a trivial problem, as it's not localized to the driver (the request comes from generic code). The most elegant and general solution might be to move such initialization sequences into a separate thread, where they don't hold up the rest. My understanding is that the current dyntick code only turns off the tick during idle periods; while things are running it's business as usual. Perhaps I misunderstood? jiffies needs to be updated, theoretically one could reduce the timer tick even then, but one has to be careful about the increased resolution, so jiffies+1 isn't enough anymore to round it up. In general it's doable by further cleaning up our APIs, but here it's really important to keep the APIs clean to keep Linux running on a wide range of hardware. It should be clear whether one requests a low resolution, but low overhead timer or a high resolution and more precise timer (and _please_ ignore that likely to expire stuff). It's e.g. a possibility to map everything to high resolution timer on a hardware, which can deal with this, but on other hardware that's not possible without paying a significant price. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Ingo Molnar wrote: [more rude insults deleted] I've been waiting for that obvious question, and i _might_ be able to answer it, but somehow it never occured to you ;-) Thanks, the ;-) emoticon (and its contents) clearly signals this as a sarcastic, tongue-in-cheek remark. To take another example why is this still insulting and inappropriate, this is a behaviour I would characterize as school bullying: A bully attacks someone obviously weaker than himself and for example takes something away and than continues like If you ask nicely I'll give it back to you., this often accompied by laughter to signal he's enjoying himself and the power he has, but for the other person it's everything but funny. Maybe you don't know what it feels like, but I do and I can't find anything funny, sarcastic or whatever about this, no matter how many smileys or other tags you add there. If the communication is already that troubled as this, such humor is really the worst thing you can do and I find it rather sad that you can't realize this yourself. ok? (If you didnt see/read it as sarcastic straight away then my apologies for insulting you!) Sorry, that is too little too late. You've apologized before and you continued to make fun of me personally to the point of spreading wrong information about me, which you could have very easily verified yourself, if you only wanted. What I want from you is that you treat me with respect and to keep your sarcasm to yourself. I told you very clearly how I think about you requoting this crap and yet you repeat it again _twice_, so on the one hand I get this apology attempt and on the other hand you continue to kick me in the crotch? How do you think am I supposed to feel about this? It's also always interesting what you don't respond to. I asked you for examples which would prove the (rather strong) assertions you made about me, what does it tell me now if you can't back up your statements? bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Ingo Molnar wrote: Why do you constantly stress level 19? Yes, that one is special, all other positive levels were already relatively consistent. i constantly stress it for the reason i mentioned a good number of times: because it's by far the most commonly used (and complained about) nice level. =B-) How do you know that? Most complained about makes most commonly used? but because you are asking, i'm glad to give you some first-hand historic background about Linux nice levels (in case you are interested) and the motivations behind their old and new implementations: I guess I should be thankful now? I'm curious why you post this now, after I asked about this. Most of the information is either rather generic or not specific enough for the problem at hand. If you had posted this information earlier, it had been far more valueable as it could have been a nice base for a discussion. But posting it this late I can't lose the feeling you're more interested in teaching me. nice levels were always so weak under Linux (just read Peter's report) -ENOLINK Hope this helps, Not completely. For negative nice levels you mentioned audio apps, but these aren't really interested in a fair share, they would use the higher percentage only to guarantee they get the amount of time they need independent of the current load. I think they would be better served with e.g. a deadline scheduler, which guarantees them an absolute time share not a relative one. On the other end with positive levels I more remember requests for something closer to idle scheduling, where a process only runs when nothing else is running. So assuming we had scheduling classes for the above use cases, what other reasons are left for such extreme nice levels? My proposed nice levels have otherwise the same properties as yours (e.g. being consistent). There is one propery you haven't commented on at all yet. My proposed levels give the average use a far better idea what they actually mean, i.e. that every 5 levels the process gets double/halve the cpu time. This is IMO a considerable advantage. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Ingo Molnar wrote: > _changing_ it is an option within reason, and we've done it a couple of > times already in the past, and even within CFS (as Peter correctly > observed) we've been through a couple of iterations already. And as i > mentioned it before, the outer edge of nice levels (+19, by far the most > commonly used nice level) was inconsistent to begin with: 3%, 5%, 9% of > nice-0, depending on HZ. Why do you constantly stress level 19? Yes, that one is special, all other positive levels were already relatively consistent. > So changing that to a consistent (and > user-requested) How old is CFS and how many users did it have so far? How many users has the old scheduler, which will be exposed to the new one soon? > 1.5% is a much smaller change than you seem to make it > out to be. The percentage levels are off by a factor of upto _seven_, sorry I fail see how you can characterize this as "small". > So by your standard we could never change the > scheduler. (which your ultimate argument might be after all =B-) Careful, you make assertion about me, for which you have absolutely no base, adding a smiley doesn't make this any funnier. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Peter Zijlstra wrote: > The only expectation is that a process with a lower nice level gets more > time. Any other expectation is a bug. Yes, users are buggy, they expect a lot of stupid things... Is this really reason enough to break this? What exactly is the damage if setpriority() accepts a few more levels? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Peter Zijlstra wrote: > By breaking the UNIX model of nice levels. Not an option in my book. BTW what is the "UNIX model of nice levels"? SUS specifies the limit via NZERO, which is defined as "Minimum Acceptable Value: 20", I can't find any information that it must be 20. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Peter Zijlstra wrote: > By breaking the UNIX model of nice levels. Not an option in my book. Breaking user expectations of nice levels is? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Peter Zijlstra wrote: > I actually like the extra range, it allows for a much softer punch of > background tasks even on somewhat slower boxen. The extra range is not really a problem, in http://www.ussg.iu.edu/hypermail/linux/kernel/0707.2/0850.html I suggested how we can have both. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Ingo Molnar wrote: > > > Roman, please do me a favor, and ask me the following question: > > > > > > [insult deleted] > In this discussion about > nice levels you were (very) agressively asserting things that were > untrue, Instead of simply asserting things, how about you provide some examples? I made so far a single mistake of mixing up nice levels 18 and 19. If you would point me to such examples, I could learn how to tone it down a little, since the nice levels are not the only issue I have with the new scheduler, the heavy stuff is still about to come. The problem here is there is too much burnt ground so I can't just present raw ideas, which get flamed by you, I have to be sufficiently confident they are valid, what you might then interpret as "agressive assertion". > you were suggesting that i dont understand the code, Again, please point me to examples, so I at least have a chance to clear things up, since it was never my intention to make such a suggestion, but this gives me no chance to defend myself. OTOH I can tell you exactly how you continuously insult me, e.g. by suggesting I ask "stupid questions" or that I'm in "denial of facts". Don't make such suggestions if you have no idea how insulting they are. Especially the one deleted insult above where you have the impertinence to quote it, such tone is more appropriate between lord and inferior, where the latter have to make a request and the former "might" grant it. _Never_ make me beg. :-( bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Ingo Molnar wrote: Roman, please do me a favor, and ask me the following question: [insult deleted] In this discussion about nice levels you were (very) agressively asserting things that were untrue, Instead of simply asserting things, how about you provide some examples? I made so far a single mistake of mixing up nice levels 18 and 19. If you would point me to such examples, I could learn how to tone it down a little, since the nice levels are not the only issue I have with the new scheduler, the heavy stuff is still about to come. The problem here is there is too much burnt ground so I can't just present raw ideas, which get flamed by you, I have to be sufficiently confident they are valid, what you might then interpret as agressive assertion. you were suggesting that i dont understand the code, Again, please point me to examples, so I at least have a chance to clear things up, since it was never my intention to make such a suggestion, but this gives me no chance to defend myself. OTOH I can tell you exactly how you continuously insult me, e.g. by suggesting I ask stupid questions or that I'm in denial of facts. Don't make such suggestions if you have no idea how insulting they are. Especially the one deleted insult above where you have the impertinence to quote it, such tone is more appropriate between lord and inferior, where the latter have to make a request and the former might grant it. _Never_ make me beg. :-( bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Peter Zijlstra wrote: I actually like the extra range, it allows for a much softer punch of background tasks even on somewhat slower boxen. The extra range is not really a problem, in http://www.ussg.iu.edu/hypermail/linux/kernel/0707.2/0850.html I suggested how we can have both. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Peter Zijlstra wrote: By breaking the UNIX model of nice levels. Not an option in my book. Breaking user expectations of nice levels is? bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Peter Zijlstra wrote: By breaking the UNIX model of nice levels. Not an option in my book. BTW what is the UNIX model of nice levels? SUS specifies the limit via NZERO, which is defined as Minimum Acceptable Value: 20, I can't find any information that it must be 20. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Peter Zijlstra wrote: The only expectation is that a process with a lower nice level gets more time. Any other expectation is a bug. Yes, users are buggy, they expect a lot of stupid things... Is this really reason enough to break this? What exactly is the damage if setpriority() accepts a few more levels? bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Wed, 18 Jul 2007, Ingo Molnar wrote: _changing_ it is an option within reason, and we've done it a couple of times already in the past, and even within CFS (as Peter correctly observed) we've been through a couple of iterations already. And as i mentioned it before, the outer edge of nice levels (+19, by far the most commonly used nice level) was inconsistent to begin with: 3%, 5%, 9% of nice-0, depending on HZ. Why do you constantly stress level 19? Yes, that one is special, all other positive levels were already relatively consistent. So changing that to a consistent (and user-requested) How old is CFS and how many users did it have so far? How many users has the old scheduler, which will be exposed to the new one soon? 1.5% is a much smaller change than you seem to make it out to be. The percentage levels are off by a factor of upto _seven_, sorry I fail see how you can characterize this as small. So by your standard we could never change the scheduler. (which your ultimate argument might be after all =B-) Careful, you make assertion about me, for which you have absolutely no base, adding a smiley doesn't make this any funnier. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Tue, 17 Jul 2007, Ingo Molnar wrote: > * Roman Zippel <[EMAIL PROTECTED]> wrote: > > > > > It's nice that these artifacts are gone, but that still doesn't > > > > explain why this ratio had to be increase that much from around > > > > 1:10 to 1:69. > > > > > > More dynamic range is better? If you actually want a task to get 20x > > > the CPU time of another, the older scheduler doesn't really allow > > > it. > > > > You can already have that, the complete range level from 19 to -20 was > > about 1:80. > > But that is irrelevant: all tasks start out at nice 0, and what matters > is the dynamic range around 0. > > So the dynamic range has been made uniform in the positive from > 1:10...1:20...1:30 to 1:69 for nice +19, and from 1:8 to 1:69 in the > minus. (with 1:86 nice -20) If you look at the negative nice levels > alone it's a substantial increase but if you compare it with positive > nice levels you'll similar kinds of dynamic ranges were already present > in the old scheduler and you'll see why we've done it. So let's look at them: for (i=0;i<20;i++) print i, " : ", (20-i)*5, " : ", 100*1.25^-i, " : ", e(l(2)*(-i/5))*100, "\n"; 0 : 100 : 100 : 100. 1 : 95 : 80. : 87.05505632961241391300 2 : 90 : 64. : 75.78582832551990411700 3 : 85 : 51.2000 : 65.97539553864471296900 4 : 80 : 40.9600 : 57.43491774985175034000 5 : 75 : 32.7680 : 50. 6 : 70 : 26.2144 : 43.52752816480620695700 7 : 65 : 20.97152000 : 37.89291416275995205900 8 : 60 : 16.77721600 : 32.98769776932235648400 9 : 55 : 13.42177280 : 28.71745887492587517000 10 : 50 : 10.73741824 : 25. 11 : 45 : 8.589934592000 : 21.76376408240310347800 12 : 40 : 6.871947673600 : 18.94645708137997602900 13 : 35 : 5.497558138880 : 16.49384888466117824200 14 : 30 : 4.398046511104 : 14.35872943746293758500 15 : 25 : 3.5184372088832000 : 12.5000 16 : 20 : 2.8147497671065600 : 10.88188204120155173900 17 : 15 : 2.2517998136852480 : 9.47322854068998801400 18 : 10 : 1.8014398509481984 : 8.24692444233058912100 19 : 5 : 1.44115188075855872000 : 7.17936471873146879200 (nice level : old % : new % : my suggested %) Your levels divert very quickly from what they used to be (upto a factor of 7), it's also not really easy to remember what the individual levels mean. I at least try to keep them somewhat in the range they used to be (and the difference is limited to a factor of about 2), also every 5 levels the amount of cpu time is halved, which is very easy to remember. If you need more dynamic range, is there a law that prevents us from going beyond 19? For example: for (i=20;i<=30;i++) print i, " : ", (20-i)*5, " : ", 100*1.25^-i, " : ", e(l(2)*(-i/5))*100, "\n"; 20 : 0 : 1.15292150460684697600 : 6.2500 21 : -5 : .92233720368547758000 : 5.44094102060077586900 22 : -10 : .73786976294838206400 : 4.73661427034499400700 23 : -15 : .59029581035870565100 : 4.12346222116529456000 24 : -20 : .47223664828696452100 : 3.58968235936573439600 25 : -25 : .37778931862957161700 : 3.1250 26 : -30 : .30223145490365729300 : 2.72047051030038793400 27 : -35 : .24178516392292583400 : 2.36830713517249700300 28 : -40 : .19342813113834066700 : 2.06173111058264728000 29 : -45 : .15474250491067253400 : 1.79484117968286719800 30 : -50 : .12379400392853802700 : 1.5625 setpriority() accepts such values without error. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Tue, 17 Jul 2007, Ingo Molnar wrote: > Roman, please do me a favor, and ask me the following question: > > " Ingo, you've been maintaining the scheduler for years. In fact you >wrote the old nice code we are talking about here. You changed it a >number of times since then. So you really know what's going on here. >Why does the old nice code behave like that for nice +19 levels? " > > I've been waiting for that obvious question, and i _might_ be able to > answer it, but somehow it never occured to you ;-) Thanks, Do you have any idea how insulting and arrogant this is? Let me translate for you, how this arrived: "O Ingo, who art our god of the scheduler. You have blessed the paths I walked in. You kept me from sinning numerous times. Your wisdom is infinite. Guide me on the journey that layeth ahead of me into this world knowledge of Your truth." (I apologize already in advance, if I should have hurt anyones religious feelings.) It's obvious that you have more experience with the scheduler code, but does that make you unfailable? Does that give you the right to act like a jerk? I do make mistakes, I try to learn from them and life goes on, I have no problem with that, but what I have a problem with is if someone is abusing this to his own advantage. I have to be extremely carful what I say to you, because you jump on the first small mistake and I have to bear your insults like "there's nothing i can do about your denial of facts - that is your own private problem." I have no problems with facts, I'm only trying very hard to ignore your arrogant behaviour... If you have something to contribute to this discussion which might clear things up, then just say it, but I'm not going to beg for it. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Tue, 17 Jul 2007, Ingo Molnar wrote: Roman, please do me a favor, and ask me the following question: Ingo, you've been maintaining the scheduler for years. In fact you wrote the old nice code we are talking about here. You changed it a number of times since then. So you really know what's going on here. Why does the old nice code behave like that for nice +19 levels? I've been waiting for that obvious question, and i _might_ be able to answer it, but somehow it never occured to you ;-) Thanks, Do you have any idea how insulting and arrogant this is? Let me translate for you, how this arrived: O Ingo, who art our god of the scheduler. You have blessed the paths I walked in. You kept me from sinning numerous times. Your wisdom is infinite. Guide me on the journey that layeth ahead of me into this world knowledge of Your truth. (I apologize already in advance, if I should have hurt anyones religious feelings.) It's obvious that you have more experience with the scheduler code, but does that make you unfailable? Does that give you the right to act like a jerk? I do make mistakes, I try to learn from them and life goes on, I have no problem with that, but what I have a problem with is if someone is abusing this to his own advantage. I have to be extremely carful what I say to you, because you jump on the first small mistake and I have to bear your insults like there's nothing i can do about your denial of facts - that is your own private problem. I have no problems with facts, I'm only trying very hard to ignore your arrogant behaviour... If you have something to contribute to this discussion which might clear things up, then just say it, but I'm not going to beg for it. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Tue, 17 Jul 2007, Ingo Molnar wrote: * Roman Zippel [EMAIL PROTECTED] wrote: It's nice that these artifacts are gone, but that still doesn't explain why this ratio had to be increase that much from around 1:10 to 1:69. More dynamic range is better? If you actually want a task to get 20x the CPU time of another, the older scheduler doesn't really allow it. You can already have that, the complete range level from 19 to -20 was about 1:80. But that is irrelevant: all tasks start out at nice 0, and what matters is the dynamic range around 0. So the dynamic range has been made uniform in the positive from 1:10...1:20...1:30 to 1:69 for nice +19, and from 1:8 to 1:69 in the minus. (with 1:86 nice -20) If you look at the negative nice levels alone it's a substantial increase but if you compare it with positive nice levels you'll similar kinds of dynamic ranges were already present in the old scheduler and you'll see why we've done it. So let's look at them: for (i=0;i20;i++) print i, : , (20-i)*5, : , 100*1.25^-i, : , e(l(2)*(-i/5))*100, \n; 0 : 100 : 100 : 100. 1 : 95 : 80. : 87.05505632961241391300 2 : 90 : 64. : 75.78582832551990411700 3 : 85 : 51.2000 : 65.97539553864471296900 4 : 80 : 40.9600 : 57.43491774985175034000 5 : 75 : 32.7680 : 50. 6 : 70 : 26.2144 : 43.52752816480620695700 7 : 65 : 20.97152000 : 37.89291416275995205900 8 : 60 : 16.77721600 : 32.98769776932235648400 9 : 55 : 13.42177280 : 28.71745887492587517000 10 : 50 : 10.73741824 : 25. 11 : 45 : 8.589934592000 : 21.76376408240310347800 12 : 40 : 6.871947673600 : 18.94645708137997602900 13 : 35 : 5.497558138880 : 16.49384888466117824200 14 : 30 : 4.398046511104 : 14.35872943746293758500 15 : 25 : 3.5184372088832000 : 12.5000 16 : 20 : 2.8147497671065600 : 10.88188204120155173900 17 : 15 : 2.2517998136852480 : 9.47322854068998801400 18 : 10 : 1.8014398509481984 : 8.24692444233058912100 19 : 5 : 1.44115188075855872000 : 7.17936471873146879200 (nice level : old % : new % : my suggested %) Your levels divert very quickly from what they used to be (upto a factor of 7), it's also not really easy to remember what the individual levels mean. I at least try to keep them somewhat in the range they used to be (and the difference is limited to a factor of about 2), also every 5 levels the amount of cpu time is halved, which is very easy to remember. If you need more dynamic range, is there a law that prevents us from going beyond 19? For example: for (i=20;i=30;i++) print i, : , (20-i)*5, : , 100*1.25^-i, : , e(l(2)*(-i/5))*100, \n; 20 : 0 : 1.15292150460684697600 : 6.2500 21 : -5 : .92233720368547758000 : 5.44094102060077586900 22 : -10 : .73786976294838206400 : 4.73661427034499400700 23 : -15 : .59029581035870565100 : 4.12346222116529456000 24 : -20 : .47223664828696452100 : 3.58968235936573439600 25 : -25 : .37778931862957161700 : 3.1250 26 : -30 : .30223145490365729300 : 2.72047051030038793400 27 : -35 : .24178516392292583400 : 2.36830713517249700300 28 : -40 : .19342813113834066700 : 2.06173111058264728000 29 : -45 : .15474250491067253400 : 1.79484117968286719800 30 : -50 : .12379400392853802700 : 1.5625 setpriority() accepts such values without error. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Tue, 17 Jul 2007, I wrote: > Playing around with some other nice levels, confirms the theory that > something is a little off, so I'm quite correct at saying that the ratio > _should_ be 1:10. Rechecking everything there was actually a small error in my test program, so the ratio should be at 1:20. Sorry about that mistake. Nice level 19 shows the largest artifacts, as that level only gets a single tick, so the ratio is often 1:HZ/10 (except for 1000HZ where it's 5:100). Nevertheless it's still true that in general nice levels were independent of HZ (that's all I wanted to say a couple of mails ago). Ingo, you can start now gloating, but contrary to you I have no problems with admitting mistakes and apologizing for them. The point is just that I'm reacting better to factual arguments instead of flames (and I think it's not just me), so I'm pretty sure I'm still correct about this: > OTOH you are the one who is wrong about me (again). :-( bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Tue, 17 Jul 2007, Ingo Molnar wrote: > * Roman Zippel <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > On Mon, 16 Jul 2007, Ingo Molnar wrote: > > > > > and note that even on the old scheduler, nice-0 was "3200% more > > > powerful" than nice +19 (with CONFIG_HZ=300), > > > > How did you get that value? At any HZ the ratio should be around 1:10 > > (+- rounding error). > > you are wrong again. I sent you the numbers earlier today already: > > | PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND > | 2332 mingo 25 0 1580 248 196 R 95.1 0.0 0:11.84 loop > | 2335 mingo 39 19 1576 244 196 R 3.1 0.0 0:00.39 loop > > 3.1% is 3067% more than 95.1%, and the ratio is 1:30.67. You again deny > above that this is the case, and there's nothing i can do about your > denial of facts - that is your own private problem. Ingo, how am I supposed to react to this? I'm asking a simple question and I get this? I'm at serious loss how to deal with you. :-( Above is based on theoritical values, for a 300HZ kernel these two processes should get 30 and 3 ticks. Should there be any rounding error or off by one error so that the processes get one tick less than they should get or one tick is accounted to the wrong process, my theoritical value is still within the possible error range and doesn't contradict your practical values. Playing around with some other nice levels, confirms the theory that something is a little off, so I'm quite correct at saying that the ratio _should_ be 1:10. OTOH you are the one who is wrong about me (again). :-( bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > and note that even on the old scheduler, nice-0 was "3200% more > powerful" than nice +19 (with CONFIG_HZ=300), How did you get that value? At any HZ the ratio should be around 1:10 (+- rounding error). > in fact i like it that nice -20 has a slightly bigger punch than it used > to have before: "Slightly bigger"??? You're joking, right? Especially the user levels are doing something completely different now, which may break user expectation. While the user couldn't expect anything precise, it's still a big difference whether a process at nice 5 gets 75% of the time or only 30%. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Matt Mackall wrote: > > It's nice that these artifacts are gone, but that still doesn't explain > > why this ratio had to be increase that much from around 1:10 to 1:69. > > More dynamic range is better? If you actually want a task to get 20x > the CPU time of another, the older scheduler doesn't really allow it. You can already have that, the complete range level from 19 to -20 was about 1:80. There is also something like too much range, I tried it with top at 19 and as soon as something runs at -20 it's practically dead, because it gets now only 1/5900 of cpu time. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Linus Torvalds wrote: > How about trying a much less aggressive nice-level (and preferably linear, > not exponential)? I think the exponential increase isn't the problem. The old code did approximate something like this rather crudely with the result that there was a big gap between level 0 and -1. Something like this: echo 'for (i=-20;i<=20;i++) print i, " : ", 1024*e(l(2)*(-i/20*3)), "\n";' | bc -l would produce a range similiar to the old code. Replacing the factor 3 with 4 would be IMO a more reasonable increase and had the advantage for the user that it's easier to understand that every 5 levels the time a process gets is doubled. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > I explained it numerous times (remember the 'timeout' vs. 'timer event' > discussion?) that i consider timer granularity important to scalability. > Basically, in every case where we know with great certainty that a > time-out will _not_ occur (where the time-out is in essence just an > exception handling mechanism), using struct timer_list is the best > solution. Whether the timer expires or not is in many cases completely irrelevant. You need special cases, where the timer wheel behaviour becomes an issue and whether hrtimer would behave any better in such situations is questionable. Again, for the average user such details are pretty much irrelevant. > what i consider harmful on the other hand are all the HZ assumptions > embedded into various pieces of code. The most harmful ones are design > details that depend on HZ and kernel-internal API details that depends > on HZ. Yes, NTP was such an example, and it was hard to fix, and you > didnt help much with that. Stop spreading lies! :-( One only has to look at the history of kernel/time/ntp.c John's rather simple "HZ free ntp" patch wouldn't have been that simple without all the cleanup patches before that done by me, which were precisely intended to make this possible. > (perhaps that is one source of this > increasingly testy exchange ;-) No, it's your prejudice against me based on wrong facts. Get your facts straight and stop being an ass. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Jonathan Corbet wrote: > > One possible problem here is that setting up that timer can be > > considerably more expensive, for a relative timer you have to read the > > current time, which can be quite expensive (e.g. your machine now uses the > > PIT timer, because TSC was deemed unstable). > > That's a possibility, I admit I haven't benchmarked it. I will say that > I don't think it will be enough to matter - msleep() is not a hot-path > sort of function. Once the system is up and running it almost never > gets called at all - at least, on my setup. That's a bit my problem - we have to consider other setups as well. Is it worth converting all msleep users behind their back or should we just a provide a separate function for those who care? I would really like to keep hrtimer and kernel timer separate and make it obvious who is using what, as the usage requirements are somewhat different. > > One question here would be, is it really a problem to sleep a little more? > > "A little more" is a bit different than "twenty times as long as you > asked for." That "little bit more" added up to a few seconds when > programming a device which needs a brief delay after tweaking each of > almost 200 registers. Which driver is this? I'd like to look at this, in case there's some other hidden problem. > > BTW there is another thing to consider. If you already run with hrtimer/ > > dyntick, there is not much reason to keep HZ at 100, so you could just > > increase HZ to get the same effect. > > Except that then, with the current implementation, you're paying for the > higher HZ whenever the CPU is busy. I bet that doesn't take long to > overwhelm any added overhead in the hrtimer msleep(). Actually if that's the case I'd consider this a bug, where is that extra cost coming from? > In the end, I did this because I thought msleep() should do what it > claims to do, because I thought that getting a known-to-expire timeout > off the timer wheel made sense, and to make a tiny baby step in the > direction of reducing the use of jiffies in the core code. I know that Ingo considers everything HZ related evil, but it really is not - it keeps Linux scalable. Unless you need the high resolution the timer wheel performance is still pretty hard to beat. That "known-to-expire" stuff is really the least significant problem to consider here, please just forget about it. I don't want to keep anyone from using hrtimer, if it's just some driver go wild, but in generic code we have to consider portability issues. Using jiffies as a time base is still unbeatable cheap in the general case, so we have to carefully consider whether using a different time source is required. There is nothing wrong with using jiffies if it fits the bill and in many cases it still does. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > because when i assumed the obvious, you called it an > insult so please dont leave any room for assumptions and remove any > ambiguity - especially as our communication seems to be marred by what > appears to be frequent misunderstandings ;-) What the hell is this supposed to be? How am I not to take this: "i did not want to embarrass you (and distract the discussion) with answering a pretty stupid, irrelevant question" as an insult? How does it reflect on someone if he asks "stupid, irrelevant questions"? If it had been a misunderstanding, you could have ask appropriately instead of assuming I'm an idiot. BTW Adding smileys to it doesn't make it any funnier, since it I don't believe in the "misunderstandings" theory anymore. :-( bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Sun, 15 Jul 2007, Jonathan Corbet wrote: > The OLPC folks and I recently discovered something interesting: on a > HZ=100 system, a call to msleep(1) will delay for about 20ms. The > combination of jiffies timekeeping and rounding up means that the > minimum delay from msleep will be two jiffies, never less. That led to > multi-second delays in a driver which does a bunch of short msleep() > calls and, in response, a change to mdelay(), which will come back in > something closer to the requested time. > > Here's another approach: a reimplementation of msleep() and > msleep_interruptible() using hrtimers. On a system without real > hrtimers this code will at least drop down to single-jiffy delays much > of the time (though not deterministically so). On my x86_64 system with > Thomas's hrtimer/dyntick patch applied, msleep(1) gives almost exactly > what was asked for. BTW there is another thing to consider. If you already run with hrtimer/ dyntick, there is not much reason to keep HZ at 100, so you could just increase HZ to get the same effect. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > to sum it up: a nice +19 task (the most commonly used nice level in > practice) gets 9.1%, 3.9%, 3.1% of CPU time on the old scheduler, > depending on the value of HZ. This is quite inconsistent and illogical. You're correct that you can find artifacts in the extreme cases, it's subjective whether this is a serious problem. It's nice that these artifacts are gone, but that still doesn't explain why this ratio had to be increase that much from around 1:10 to 1:69. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > > Well, you cut out the major question from my initial mail: > > One question here would be, is it really a problem to sleep a little more? > > oh, i did not want to embarrass you (and distract the discussion) with > answering a pretty stupid, irrelevant question that has the obvious > answer even for the most casual observer: "yes, of course it really is a > problem to sleep a little more, read the description of the fine patch > you are replying to" ... And your insults continue... :-( I ask a simple question and try to explore alternative solutions and this is your contribution to it? To put this into a little more context, this is the complete text you cut off: | One question here would be, is it really a problem to sleep a little more? | Another possibility would be to add another sleep function, which uses | hrtimer and could also take ktime argument. So instead of considering this suggestion, you just read what you want out of what I wrote and turn everything into an insult. Nicely done, Ingo. :-( bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > > > > As soon as you add another loop the difference changes again, > > > > while it's always correct to say it gets 25% more cpu time [...] > > > > > > yep, and i'll add the relative effect to the comment too. > > > > Why did you cut off the rest of the sentence? > > (no need to become hostile, i answered to that portion of your sentence > separately, which was logically detached from the other portion of your > sentence. I marked the cut with the '[...]' sign. ) Could you please stop with these accusations? Could you please point me to the mail with the separate answer? > > To illustrate the problem a little different: a task with a nice level > > -20 got around 700% more cpu time (or 8 times more), now it gets 8500% > > more cpu time (or 86.7 times more). You don't think that change to the > > nice levels is a little drastic? > > This was discussed on lkml in detail, see the CFS threads. Which are quite big, so I skipped most of it, a more precise pointer would be appreciated. > It has been a > common request for nice levels to be more logical (i.e. to make them > universal and to detach them from HZ) and for them to be more effective > as well. Huh? What has this to do with HZ? The scheduler used ticks internally, but it's irrelevant to what the user sees via the nice levels. So the question still stands that this change may be a little drastic, as you changed the nice levels of _all_ users, not just of those who were previously interested in CFS. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > i'm not sure how your question relates/connects to what i wrote above, > could you please re-phrase your question into a bit more verbose form so > that i can answer it? Thanks, Well, you cut out the major question from my initial mail: One question here would be, is it really a problem to sleep a little more? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > > As soon as you add another loop the difference changes again, while > > it's always correct to say it gets 25% more cpu time [...] > > yep, and i'll add the relative effect to the comment too. Why did you cut off the rest of the sentence? To illustrate the problem a little different: a task with a nice level -20 got around 700% more cpu time (or 8 times more), now it gets 8500% more cpu time (or 86.7 times more). You don't think that change to the nice levels is a little drastic? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > i dont think there's any significant overhead. The OLPC folks are pretty > sensitive to performance, How is a sleep function relevant to performace? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Sun, 15 Jul 2007, Jonathan Corbet wrote: > Here's another approach: a reimplementation of msleep() and > msleep_interruptible() using hrtimers. On a system without real > hrtimers this code will at least drop down to single-jiffy delays much > of the time (though not deterministically so). On my x86_64 system with > Thomas's hrtimer/dyntick patch applied, msleep(1) gives almost exactly > what was asked for. One possible problem here is that setting up that timer can be considerably more expensive, for a relative timer you have to read the current time, which can be quite expensive (e.g. your machine now uses the PIT timer, because TSC was deemed unstable). One question here would be, is it really a problem to sleep a little more? Another possibility would be to add another sleep function, which uses hrtimer and could also take ktime argument. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > yes, the weight multiplier 1.25, but the actual difference in CPU > utilization, when running two CPU intense tasks, is ~10%: > > PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND > 8246 mingo 20 0 1576 244 196 R 55 0.0 0:11.96 loop > 8247 mingo 21 1 1576 244 196 R 45 0.0 0:10.52 loop > > so the first task 'wins' +10% CPU utilization (relative to the 50% it > had before), the second task 'loses' -10% CPU utilization (relative to > the 50% it had before). As soon as you add another loop the difference changes again, while it's always correct to say it gets 25% more cpu time (which I still think is a little too much). bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: yes, the weight multiplier 1.25, but the actual difference in CPU utilization, when running two CPU intense tasks, is ~10%: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 8246 mingo 20 0 1576 244 196 R 55 0.0 0:11.96 loop 8247 mingo 21 1 1576 244 196 R 45 0.0 0:10.52 loop so the first task 'wins' +10% CPU utilization (relative to the 50% it had before), the second task 'loses' -10% CPU utilization (relative to the 50% it had before). As soon as you add another loop the difference changes again, while it's always correct to say it gets 25% more cpu time (which I still think is a little too much). bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Sun, 15 Jul 2007, Jonathan Corbet wrote: Here's another approach: a reimplementation of msleep() and msleep_interruptible() using hrtimers. On a system without real hrtimers this code will at least drop down to single-jiffy delays much of the time (though not deterministically so). On my x86_64 system with Thomas's hrtimer/dyntick patch applied, msleep(1) gives almost exactly what was asked for. One possible problem here is that setting up that timer can be considerably more expensive, for a relative timer you have to read the current time, which can be quite expensive (e.g. your machine now uses the PIT timer, because TSC was deemed unstable). One question here would be, is it really a problem to sleep a little more? Another possibility would be to add another sleep function, which uses hrtimer and could also take ktime argument. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: i dont think there's any significant overhead. The OLPC folks are pretty sensitive to performance, How is a sleep function relevant to performace? bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: As soon as you add another loop the difference changes again, while it's always correct to say it gets 25% more cpu time [...] yep, and i'll add the relative effect to the comment too. Why did you cut off the rest of the sentence? To illustrate the problem a little different: a task with a nice level -20 got around 700% more cpu time (or 8 times more), now it gets 8500% more cpu time (or 86.7 times more). You don't think that change to the nice levels is a little drastic? bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: i'm not sure how your question relates/connects to what i wrote above, could you please re-phrase your question into a bit more verbose form so that i can answer it? Thanks, Well, you cut out the major question from my initial mail: One question here would be, is it really a problem to sleep a little more? bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: As soon as you add another loop the difference changes again, while it's always correct to say it gets 25% more cpu time [...] yep, and i'll add the relative effect to the comment too. Why did you cut off the rest of the sentence? (no need to become hostile, i answered to that portion of your sentence separately, which was logically detached from the other portion of your sentence. I marked the cut with the '[...]' sign. ) Could you please stop with these accusations? Could you please point me to the mail with the separate answer? To illustrate the problem a little different: a task with a nice level -20 got around 700% more cpu time (or 8 times more), now it gets 8500% more cpu time (or 86.7 times more). You don't think that change to the nice levels is a little drastic? This was discussed on lkml in detail, see the CFS threads. Which are quite big, so I skipped most of it, a more precise pointer would be appreciated. It has been a common request for nice levels to be more logical (i.e. to make them universal and to detach them from HZ) and for them to be more effective as well. Huh? What has this to do with HZ? The scheduler used ticks internally, but it's irrelevant to what the user sees via the nice levels. So the question still stands that this change may be a little drastic, as you changed the nice levels of _all_ users, not just of those who were previously interested in CFS. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: Well, you cut out the major question from my initial mail: One question here would be, is it really a problem to sleep a little more? oh, i did not want to embarrass you (and distract the discussion) with answering a pretty stupid, irrelevant question that has the obvious answer even for the most casual observer: yes, of course it really is a problem to sleep a little more, read the description of the fine patch you are replying to ... And your insults continue... :-( I ask a simple question and try to explore alternative solutions and this is your contribution to it? To put this into a little more context, this is the complete text you cut off: | One question here would be, is it really a problem to sleep a little more? | Another possibility would be to add another sleep function, which uses | hrtimer and could also take ktime argument. So instead of considering this suggestion, you just read what you want out of what I wrote and turn everything into an insult. Nicely done, Ingo. :-( bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: to sum it up: a nice +19 task (the most commonly used nice level in practice) gets 9.1%, 3.9%, 3.1% of CPU time on the old scheduler, depending on the value of HZ. This is quite inconsistent and illogical. You're correct that you can find artifacts in the extreme cases, it's subjective whether this is a serious problem. It's nice that these artifacts are gone, but that still doesn't explain why this ratio had to be increase that much from around 1:10 to 1:69. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Sun, 15 Jul 2007, Jonathan Corbet wrote: The OLPC folks and I recently discovered something interesting: on a HZ=100 system, a call to msleep(1) will delay for about 20ms. The combination of jiffies timekeeping and rounding up means that the minimum delay from msleep will be two jiffies, never less. That led to multi-second delays in a driver which does a bunch of short msleep() calls and, in response, a change to mdelay(), which will come back in something closer to the requested time. Here's another approach: a reimplementation of msleep() and msleep_interruptible() using hrtimers. On a system without real hrtimers this code will at least drop down to single-jiffy delays much of the time (though not deterministically so). On my x86_64 system with Thomas's hrtimer/dyntick patch applied, msleep(1) gives almost exactly what was asked for. BTW there is another thing to consider. If you already run with hrtimer/ dyntick, there is not much reason to keep HZ at 100, so you could just increase HZ to get the same effect. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: because when i assumed the obvious, you called it an insult so please dont leave any room for assumptions and remove any ambiguity - especially as our communication seems to be marred by what appears to be frequent misunderstandings ;-) What the hell is this supposed to be? How am I not to take this: i did not want to embarrass you (and distract the discussion) with answering a pretty stupid, irrelevant question as an insult? How does it reflect on someone if he asks stupid, irrelevant questions? If it had been a misunderstanding, you could have ask appropriately instead of assuming I'm an idiot. BTW Adding smileys to it doesn't make it any funnier, since it I don't believe in the misunderstandings theory anymore. :-( bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Jonathan Corbet wrote: One possible problem here is that setting up that timer can be considerably more expensive, for a relative timer you have to read the current time, which can be quite expensive (e.g. your machine now uses the PIT timer, because TSC was deemed unstable). That's a possibility, I admit I haven't benchmarked it. I will say that I don't think it will be enough to matter - msleep() is not a hot-path sort of function. Once the system is up and running it almost never gets called at all - at least, on my setup. That's a bit my problem - we have to consider other setups as well. Is it worth converting all msleep users behind their back or should we just a provide a separate function for those who care? I would really like to keep hrtimer and kernel timer separate and make it obvious who is using what, as the usage requirements are somewhat different. One question here would be, is it really a problem to sleep a little more? A little more is a bit different than twenty times as long as you asked for. That little bit more added up to a few seconds when programming a device which needs a brief delay after tweaking each of almost 200 registers. Which driver is this? I'd like to look at this, in case there's some other hidden problem. BTW there is another thing to consider. If you already run with hrtimer/ dyntick, there is not much reason to keep HZ at 100, so you could just increase HZ to get the same effect. Except that then, with the current implementation, you're paying for the higher HZ whenever the CPU is busy. I bet that doesn't take long to overwhelm any added overhead in the hrtimer msleep(). Actually if that's the case I'd consider this a bug, where is that extra cost coming from? In the end, I did this because I thought msleep() should do what it claims to do, because I thought that getting a known-to-expire timeout off the timer wheel made sense, and to make a tiny baby step in the direction of reducing the use of jiffies in the core code. I know that Ingo considers everything HZ related evil, but it really is not - it keeps Linux scalable. Unless you need the high resolution the timer wheel performance is still pretty hard to beat. That known-to-expire stuff is really the least significant problem to consider here, please just forget about it. I don't want to keep anyone from using hrtimer, if it's just some driver go wild, but in generic code we have to consider portability issues. Using jiffies as a time base is still unbeatable cheap in the general case, so we have to carefully consider whether using a different time source is required. There is nothing wrong with using jiffies if it fits the bill and in many cases it still does. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] msleep() with hrtimers
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: I explained it numerous times (remember the 'timeout' vs. 'timer event' discussion?) that i consider timer granularity important to scalability. Basically, in every case where we know with great certainty that a time-out will _not_ occur (where the time-out is in essence just an exception handling mechanism), using struct timer_list is the best solution. Whether the timer expires or not is in many cases completely irrelevant. You need special cases, where the timer wheel behaviour becomes an issue and whether hrtimer would behave any better in such situations is questionable. Again, for the average user such details are pretty much irrelevant. what i consider harmful on the other hand are all the HZ assumptions embedded into various pieces of code. The most harmful ones are design details that depend on HZ and kernel-internal API details that depends on HZ. Yes, NTP was such an example, and it was hard to fix, and you didnt help much with that. Stop spreading lies! :-( One only has to look at the history of kernel/time/ntp.c John's rather simple HZ free ntp patch wouldn't have been that simple without all the cleanup patches before that done by me, which were precisely intended to make this possible. (perhaps that is one source of this increasingly testy exchange ;-) No, it's your prejudice against me based on wrong facts. Get your facts straight and stop being an ass. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Linus Torvalds wrote: How about trying a much less aggressive nice-level (and preferably linear, not exponential)? I think the exponential increase isn't the problem. The old code did approximate something like this rather crudely with the result that there was a big gap between level 0 and -1. Something like this: echo 'for (i=-20;i=20;i++) print i, : , 1024*e(l(2)*(-i/20*3)), \n;' | bc -l would produce a range similiar to the old code. Replacing the factor 3 with 4 would be IMO a more reasonable increase and had the advantage for the user that it's easier to understand that every 5 levels the time a process gets is doubled. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Matt Mackall wrote: It's nice that these artifacts are gone, but that still doesn't explain why this ratio had to be increase that much from around 1:10 to 1:69. More dynamic range is better? If you actually want a task to get 20x the CPU time of another, the older scheduler doesn't really allow it. You can already have that, the complete range level from 19 to -20 was about 1:80. There is also something like too much range, I tried it with top at 19 and as soon as something runs at -20 it's practically dead, because it gets now only 1/5900 of cpu time. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: and note that even on the old scheduler, nice-0 was 3200% more powerful than nice +19 (with CONFIG_HZ=300), How did you get that value? At any HZ the ratio should be around 1:10 (+- rounding error). in fact i like it that nice -20 has a slightly bigger punch than it used to have before: Slightly bigger??? You're joking, right? Especially the user levels are doing something completely different now, which may break user expectation. While the user couldn't expect anything precise, it's still a big difference whether a process at nice 5 gets 75% of the time or only 30%. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Tue, 17 Jul 2007, Ingo Molnar wrote: * Roman Zippel [EMAIL PROTECTED] wrote: Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: and note that even on the old scheduler, nice-0 was 3200% more powerful than nice +19 (with CONFIG_HZ=300), How did you get that value? At any HZ the ratio should be around 1:10 (+- rounding error). you are wrong again. I sent you the numbers earlier today already: | PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND | 2332 mingo 25 0 1580 248 196 R 95.1 0.0 0:11.84 loop | 2335 mingo 39 19 1576 244 196 R 3.1 0.0 0:00.39 loop 3.1% is 3067% more than 95.1%, and the ratio is 1:30.67. You again deny above that this is the case, and there's nothing i can do about your denial of facts - that is your own private problem. Ingo, how am I supposed to react to this? I'm asking a simple question and I get this? I'm at serious loss how to deal with you. :-( Above is based on theoritical values, for a 300HZ kernel these two processes should get 30 and 3 ticks. Should there be any rounding error or off by one error so that the processes get one tick less than they should get or one tick is accounted to the wrong process, my theoritical value is still within the possible error range and doesn't contradict your practical values. Playing around with some other nice levels, confirms the theory that something is a little off, so I'm quite correct at saying that the ratio _should_ be 1:10. OTOH you are the one who is wrong about me (again). :-( bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CFS: Fix missing digit off in wmult table
Hi, On Tue, 17 Jul 2007, I wrote: Playing around with some other nice levels, confirms the theory that something is a little off, so I'm quite correct at saying that the ratio _should_ be 1:10. Rechecking everything there was actually a small error in my test program, so the ratio should be at 1:20. Sorry about that mistake. Nice level 19 shows the largest artifacts, as that level only gets a single tick, so the ratio is often 1:HZ/10 (except for 1000HZ where it's 5:100). Nevertheless it's still true that in general nice levels were independent of HZ (that's all I wanted to say a couple of mails ago). Ingo, you can start now gloating, but contrary to you I have no problems with admitting mistakes and apologizing for them. The point is just that I'm reacting better to factual arguments instead of flames (and I think it's not just me), so I'm pretty sure I'm still correct about this: OTOH you are the one who is wrong about me (again). :-( bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 status was Re: -mm merge plans for 2.6.23
Hi, On Fri, 13 Jul 2007, Mike Galbraith wrote: > > The new scheduler does _a_lot_ of heavy 64 bit calculations without any > > attempt to scale that down a little... > > See prio_to_weight[], prio_to_wmult[] and sysctl_sched_stat_granularity. > Perhaps more can be done, but "without any attempt..." isn't accurate. Calculating these values at runtime would have been completely insane, the alternative would be a crummy approximation, so using a lookup table is actually a good thing. That's not the problem. BTW could someone please verify the prio_to_wmult table, especially [16] and [21] look a little off, like a digit was cut off. While I'm at this, the 10% scaling there looks a little much (unless there are other changes I haven't looked at yet), the old code used more like 5%. This would mean a prio -20 task would get 98.86% cpu time compared to a prio 0 task, that was previously about the difference between -20 and 19 (and it would have previously gotten only 88.89%), now a prio -20 task would get 99.98% cpu time compared to a prio 19 task. The individual levels are unfortunately not that easily comparable, but at the overall scale the change looks IMHO a little drastic. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 status was Re: -mm merge plans for 2.6.23
Hi, On Fri, 13 Jul 2007, Mike Galbraith wrote: The new scheduler does _a_lot_ of heavy 64 bit calculations without any attempt to scale that down a little... See prio_to_weight[], prio_to_wmult[] and sysctl_sched_stat_granularity. Perhaps more can be done, but without any attempt... isn't accurate. Calculating these values at runtime would have been completely insane, the alternative would be a crummy approximation, so using a lookup table is actually a good thing. That's not the problem. BTW could someone please verify the prio_to_wmult table, especially [16] and [21] look a little off, like a digit was cut off. While I'm at this, the 10% scaling there looks a little much (unless there are other changes I haven't looked at yet), the old code used more like 5%. This would mean a prio -20 task would get 98.86% cpu time compared to a prio 0 task, that was previously about the difference between -20 and 19 (and it would have previously gotten only 88.89%), now a prio -20 task would get 99.98% cpu time compared to a prio 19 task. The individual levels are unfortunately not that easily comparable, but at the overall scale the change looks IMHO a little drastic. bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 status was Re: -mm merge plans for 2.6.23
Hi, On Wed, 11 Jul 2007, Linus Torvalds wrote: > Sure, bugs happen, but code that everybody runs the same generally doesn't > break. So a CPU scheduler doesn't worry me all that much. CPU schedulers > are "easy". A little more advance warning wouldn't have hurt though. The new scheduler does _a_lot_ of heavy 64 bit calculations without any attempt to scale that down a little... One can blame me now for not having it brought up earlier, but discussions with Ingo are not something I'm looking forward to. :( bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Start to genericize kconfig for use by other projects.
Hi, On Thu, 12 Jul 2007, I wrote: > On Wed, 11 Jul 2007, Rob Landley wrote: > > > Replace name "Linux Kernel" in menuconfig with a macro (defaulting to "Linux > > Kernel" if not -Ddefined by the makefile), and remove a few unnecessary > > occurrences of "kernel" in pop-up text. > > Could you drop the PROJECT_NAME changes for now? Or at least replace it with a variable at first. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Start to genericize kconfig for use by other projects.
Hi, On Wed, 11 Jul 2007, Rob Landley wrote: > Replace name "Linux Kernel" in menuconfig with a macro (defaulting to "Linux > Kernel" if not -Ddefined by the makefile), and remove a few unnecessary > occurrences of "kernel" in pop-up text. Could you drop the PROJECT_NAME changes for now? The rest looks fine. I would prefer if the project would be settable via Kconfig. If you want to play with it add this to Kconfig: config PROJECT_NAME string default "Linux kernel" and at the end of conf_parse() you can lookup, calculate and cache the value. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/