Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, Jul 03, 2014 at 03:12:17PM +0200, Peter Zijlstra wrote: > On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote: > > On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote: > > > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: > > > > As were others, not that long ago. Today is the first hint that I got > > > > that you feel otherwise. But it does look like the softirq approach to > > > > callback processing needs to stick around for awhile longer. Nice to > > > > hear that softirq is now "sane and normal" again, I guess. ;-) > > > > > > Nah, softirqs are still totally annoying :-) > > > > Name me one thing that isn't annoying. ;-) > > > > > So I've lost detail again, but it seems to me that on all CPUs that are > > > actually getting ticks, waking tasks to process the RCU state is > > > entirely over doing it. Might as well keep processing their RCU state > > > from the tick as was previously done. > > > > And that is in fact the approach taken by my patch. For which I just > > kicked off testing, so expect an update later today. (And that -is- > > optimistic! A pessimistic viewpoint would hold that the patch would > > turn out to be so broken that it would take -weeks- to get a fix!) > > Right, but as you told Mike its not really dynamic, but of course we can > work on that. If it is actually needed by someone, then I would be happy to work on it. But all I see now is people asserting that it should be provided, without any real justification. > That said; I'm somewhat confused on the whole nocb thing. So the way I > see things there's two things that need doing: > > 1) push the state machine > 2) run callbacks > > It seems to me the nocb threads do both, and somehow some of this is > getting conflated. Because afaik RCU only uses softirqs for (2), since > (1) is fully done from the tick -- well, it used to be, before all this. Well, you do need a finer-grained view of the RCU state machine: 1a. Registering the need for a future grace period. 1b. Self-reporting of quiescent states (softirq). 1c. Reporting of other CPUs' quiescent states (grace-period kthread). This includes idle CPUs, userspace nohz_full CPUs, and CPUs that just now transitioned to offline. 1d. Kicking CPUs that have not yet reported a quiescent state (also grace-period kthread). 2. Running callbacks (softirq, or, for RCU_NOCB_CPU, rcuo kthread). And here (1a) is done via softirq in the non-nocb case and via the rcuo kthreads on the nocb case. And yes, RCU's softirq processing is normally done from the tick. > Now, IIRC rcu callbacks are not guaranteed to run on whatever cpu > they're queued on, so we can 'easily' splice the actual callback list > into some other CPUs callback list. Which leaves only (1) to actually > 'do'. True, although the 'easily' part needs to take into account the fact that the RCU callbacks from an given CPU must be invoked in order. Or rcu_barrier() needs to find a different way to guarantee that all previously registered callbacks have been invoked, as the case may be. > Yet the whole thing is called after the 'no-callback' thing, even though > the most important part is pushing the state machine remotely. Well, you do have to do both. Pushing the state machine doesn't help unless you also invoke the RCU callbacks. > Now I can see we'd probably don't want to actually push remote cpu's > their rcu state from IRQ context, but we could, I think, drive the state > machine remotely. And we want to avoid overloading one CPU with the work > of all others, which is I think still a fundamental issue with the whole > nohz_full thing, it reverts to the _one_ timekeeper cpu, but on big > enough systems that'll be a problem. Well, RCU already pushes the remote CPU's RCU state remotely via RCU's dynticks setup. But you are quite right, dumping all of the RCU processing onto one CPU can be a bottleneck on large systems (which Fengguang's tests noted, by the way), and this is the reason for patch 11/17 in the fixes series (https://lkml.org/lkml/2014/7/7/990). This patch allows housekeeping kthreads like the grace-period kthreads to use a new housekeeping_affine() function to bind themselves onto the non-nohz_full CPUs. The system can be booted with the desired number of housekeeping CPUs using the nohz_full= boot parameter. However, it is not clear to me that having only one timekeeping CPU (as opposed to having only one housekeeping CPU) is a real problem, even for very large systems. If it does turn out to be a real problem, the sysidle code will probably need to change as well. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, Jul 03, 2014 at 11:50:09AM +0200, Peter Zijlstra wrote: > On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote: > > Name me one thing that isn't annoying. ;-) > > A cold beverage on a warm day :-) Fair point... Not clear how to work that into the RCU implementation, though. I suppose I could add comments instructing the reader to consume a cold beverage. Or I could consume a cold beverage while working on RCU. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote: > On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote: > > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: > > > As were others, not that long ago. Today is the first hint that I got > > > that you feel otherwise. But it does look like the softirq approach to > > > callback processing needs to stick around for awhile longer. Nice to > > > hear that softirq is now "sane and normal" again, I guess. ;-) > > > > Nah, softirqs are still totally annoying :-) > > Name me one thing that isn't annoying. ;-) > > > So I've lost detail again, but it seems to me that on all CPUs that are > > actually getting ticks, waking tasks to process the RCU state is > > entirely over doing it. Might as well keep processing their RCU state > > from the tick as was previously done. > > And that is in fact the approach taken by my patch. For which I just > kicked off testing, so expect an update later today. (And that -is- > optimistic! A pessimistic viewpoint would hold that the patch would > turn out to be so broken that it would take -weeks- to get a fix!) Right, but as you told Mike its not really dynamic, but of course we can work on that. That said; I'm somewhat confused on the whole nocb thing. So the way I see things there's two things that need doing: 1) push the state machine 2) run callbacks It seems to me the nocb threads do both, and somehow some of this is getting conflated. Because afaik RCU only uses softirqs for (2), since (1) is fully done from the tick -- well, it used to be, before all this. Now, IIRC rcu callbacks are not guaranteed to run on whatever cpu they're queued on, so we can 'easily' splice the actual callback list into some other CPUs callback list. Which leaves only (1) to actually 'do'. Yet the whole thing is called after the 'no-callback' thing, even though the most important part is pushing the state machine remotely. Now I can see we'd probably don't want to actually push remote cpu's their rcu state from IRQ context, but we could, I think, drive the state machine remotely. And we want to avoid overloading one CPU with the work of all others, which is I think still a fundamental issue with the whole nohz_full thing, it reverts to the _one_ timekeeper cpu, but on big enough systems that'll be a problem. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 01:29:38PM -0400, Rik van Riel wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 07/02/2014 01:26 PM, Peter Zijlstra wrote: > > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: > >> As were others, not that long ago. Today is the first hint that > >> I got that you feel otherwise. But it does look like the softirq > >> approach to callback processing needs to stick around for awhile > >> longer. Nice to hear that softirq is now "sane and normal" > >> again, I guess. ;-) > > > > Nah, softirqs are still totally annoying :-) > > > > So I've lost detail again, but it seems to me that on all CPUs that > > are actually getting ticks, waking tasks to process the RCU state > > is entirely over doing it. Might as well keep processing their RCU > > state from the tick as was previously done. > > For CPUs that are not getting ticks (eg. because they are idle), > is it worth waking up anything on that CPU, or would it make more > sense to simply process their RCU callbacks on a different CPU, > if there aren't too many pending? If they're idle, RCU 'should' know this and exclude them from the state machine, the tricky part, and where all this nocb nonsense started with, its the NOHZ_FULL stuff where a cpu can be !idle but still not get ticks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote: > Name me one thing that isn't annoying. ;-) A cold beverage on a warm day :-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote: Name me one thing that isn't annoying. ;-) A cold beverage on a warm day :-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 01:29:38PM -0400, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/02/2014 01:26 PM, Peter Zijlstra wrote: On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: As were others, not that long ago. Today is the first hint that I got that you feel otherwise. But it does look like the softirq approach to callback processing needs to stick around for awhile longer. Nice to hear that softirq is now sane and normal again, I guess. ;-) Nah, softirqs are still totally annoying :-) So I've lost detail again, but it seems to me that on all CPUs that are actually getting ticks, waking tasks to process the RCU state is entirely over doing it. Might as well keep processing their RCU state from the tick as was previously done. For CPUs that are not getting ticks (eg. because they are idle), is it worth waking up anything on that CPU, or would it make more sense to simply process their RCU callbacks on a different CPU, if there aren't too many pending? If they're idle, RCU 'should' know this and exclude them from the state machine, the tricky part, and where all this nocb nonsense started with, its the NOHZ_FULL stuff where a cpu can be !idle but still not get ticks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote: On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote: On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: As were others, not that long ago. Today is the first hint that I got that you feel otherwise. But it does look like the softirq approach to callback processing needs to stick around for awhile longer. Nice to hear that softirq is now sane and normal again, I guess. ;-) Nah, softirqs are still totally annoying :-) Name me one thing that isn't annoying. ;-) So I've lost detail again, but it seems to me that on all CPUs that are actually getting ticks, waking tasks to process the RCU state is entirely over doing it. Might as well keep processing their RCU state from the tick as was previously done. And that is in fact the approach taken by my patch. For which I just kicked off testing, so expect an update later today. (And that -is- optimistic! A pessimistic viewpoint would hold that the patch would turn out to be so broken that it would take -weeks- to get a fix!) Right, but as you told Mike its not really dynamic, but of course we can work on that. That said; I'm somewhat confused on the whole nocb thing. So the way I see things there's two things that need doing: 1) push the state machine 2) run callbacks It seems to me the nocb threads do both, and somehow some of this is getting conflated. Because afaik RCU only uses softirqs for (2), since (1) is fully done from the tick -- well, it used to be, before all this. Now, IIRC rcu callbacks are not guaranteed to run on whatever cpu they're queued on, so we can 'easily' splice the actual callback list into some other CPUs callback list. Which leaves only (1) to actually 'do'. Yet the whole thing is called after the 'no-callback' thing, even though the most important part is pushing the state machine remotely. Now I can see we'd probably don't want to actually push remote cpu's their rcu state from IRQ context, but we could, I think, drive the state machine remotely. And we want to avoid overloading one CPU with the work of all others, which is I think still a fundamental issue with the whole nohz_full thing, it reverts to the _one_ timekeeper cpu, but on big enough systems that'll be a problem. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, Jul 03, 2014 at 11:50:09AM +0200, Peter Zijlstra wrote: On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote: Name me one thing that isn't annoying. ;-) A cold beverage on a warm day :-) Fair point... Not clear how to work that into the RCU implementation, though. I suppose I could add comments instructing the reader to consume a cold beverage. Or I could consume a cold beverage while working on RCU. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, Jul 03, 2014 at 03:12:17PM +0200, Peter Zijlstra wrote: On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote: On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote: On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: As were others, not that long ago. Today is the first hint that I got that you feel otherwise. But it does look like the softirq approach to callback processing needs to stick around for awhile longer. Nice to hear that softirq is now sane and normal again, I guess. ;-) Nah, softirqs are still totally annoying :-) Name me one thing that isn't annoying. ;-) So I've lost detail again, but it seems to me that on all CPUs that are actually getting ticks, waking tasks to process the RCU state is entirely over doing it. Might as well keep processing their RCU state from the tick as was previously done. And that is in fact the approach taken by my patch. For which I just kicked off testing, so expect an update later today. (And that -is- optimistic! A pessimistic viewpoint would hold that the patch would turn out to be so broken that it would take -weeks- to get a fix!) Right, but as you told Mike its not really dynamic, but of course we can work on that. If it is actually needed by someone, then I would be happy to work on it. But all I see now is people asserting that it should be provided, without any real justification. That said; I'm somewhat confused on the whole nocb thing. So the way I see things there's two things that need doing: 1) push the state machine 2) run callbacks It seems to me the nocb threads do both, and somehow some of this is getting conflated. Because afaik RCU only uses softirqs for (2), since (1) is fully done from the tick -- well, it used to be, before all this. Well, you do need a finer-grained view of the RCU state machine: 1a. Registering the need for a future grace period. 1b. Self-reporting of quiescent states (softirq). 1c. Reporting of other CPUs' quiescent states (grace-period kthread). This includes idle CPUs, userspace nohz_full CPUs, and CPUs that just now transitioned to offline. 1d. Kicking CPUs that have not yet reported a quiescent state (also grace-period kthread). 2. Running callbacks (softirq, or, for RCU_NOCB_CPU, rcuo kthread). And here (1a) is done via softirq in the non-nocb case and via the rcuo kthreads on the nocb case. And yes, RCU's softirq processing is normally done from the tick. Now, IIRC rcu callbacks are not guaranteed to run on whatever cpu they're queued on, so we can 'easily' splice the actual callback list into some other CPUs callback list. Which leaves only (1) to actually 'do'. True, although the 'easily' part needs to take into account the fact that the RCU callbacks from an given CPU must be invoked in order. Or rcu_barrier() needs to find a different way to guarantee that all previously registered callbacks have been invoked, as the case may be. Yet the whole thing is called after the 'no-callback' thing, even though the most important part is pushing the state machine remotely. Well, you do have to do both. Pushing the state machine doesn't help unless you also invoke the RCU callbacks. Now I can see we'd probably don't want to actually push remote cpu's their rcu state from IRQ context, but we could, I think, drive the state machine remotely. And we want to avoid overloading one CPU with the work of all others, which is I think still a fundamental issue with the whole nohz_full thing, it reverts to the _one_ timekeeper cpu, but on big enough systems that'll be a problem. Well, RCU already pushes the remote CPU's RCU state remotely via RCU's dynticks setup. But you are quite right, dumping all of the RCU processing onto one CPU can be a bottleneck on large systems (which Fengguang's tests noted, by the way), and this is the reason for patch 11/17 in the fixes series (https://lkml.org/lkml/2014/7/7/990). This patch allows housekeeping kthreads like the grace-period kthreads to use a new housekeeping_affine() function to bind themselves onto the non-nohz_full CPUs. The system can be booted with the desired number of housekeeping CPUs using the nohz_full= boot parameter. However, it is not clear to me that having only one timekeeping CPU (as opposed to having only one housekeeping CPU) is a real problem, even for very large systems. If it does turn out to be a real problem, the sysidle code will probably need to change as well. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote: > On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: > > On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote: > > > > NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go > > > away as being a redundant interface to manage a single property of a set > > > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to > > > manage a single property of a set of CPUs. What am I missing? > > > > Well, for now, it can only be specified at build time or at boot time. > > In theory, it is possible to change a CPU from being callback-offloaded > > to not at runtime, but there would need to be an extremely good reason > > for adding that level of complexity. Lots of "fun" races in there... > > Yeah, understood. > > (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more > usable if it eventually became unified with cpuset and learned how to > tap-dance properly;) Well, the exact same goes for NO_HZ_FULL, quoting Paul (just replacing RCU things with dynticks) it becomes: "it can only be specified at build time or at boot time. In theory, it is possible to change a CPU from being idle-dynticks to full-dynticks at runtime, but there would need to be an extremely good reason for adding that level of complexity. Lots of "fun" races in there..." And I'm not even sure that somebody actually uses full dynticks today. I only know that some financial institutions are considering it, which is not cheering me up much... So we are very far from that day when we'll migrate to a runtime interface. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote: On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote: NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go away as being a redundant interface to manage a single property of a set of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to manage a single property of a set of CPUs. What am I missing? Well, for now, it can only be specified at build time or at boot time. In theory, it is possible to change a CPU from being callback-offloaded to not at runtime, but there would need to be an extremely good reason for adding that level of complexity. Lots of fun races in there... Yeah, understood. (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more usable if it eventually became unified with cpuset and learned how to tap-dance properly;) Well, the exact same goes for NO_HZ_FULL, quoting Paul (just replacing RCU things with dynticks) it becomes: it can only be specified at build time or at boot time. In theory, it is possible to change a CPU from being idle-dynticks to full-dynticks at runtime, but there would need to be an extremely good reason for adding that level of complexity. Lots of fun races in there... And I'm not even sure that somebody actually uses full dynticks today. I only know that some financial institutions are considering it, which is not cheering me up much... So we are very far from that day when we'll migrate to a runtime interface. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Fri, Jul 04, 2014 at 08:01:34AM +0200, Mike Galbraith wrote: > On Thu, 2014-07-03 at 22:05 -0700, Paul E. McKenney wrote: > > On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote: > > > > Turn it on and don't worry about it is exactly what distros want the > > > obscure feature with very few users to be. Last time I did a drive-by, > > > my boxen said I should continue to worry about it ;-) > > > > Yep, which is the reason for the patch on the last email. > > > > Then again, exactly which feature and which reason for worry? > > NO_HZ_FULL. I tried ALL a while back, box instantly called me an idiot. > Maybe that has improved since, dunno. Ah, I was thinking in terms of RCU_CPU_NOCB. > Last drive-by I didn't do much overhead measurement, stuck mostly > functionality, and it still had rough edges that enterprise users may > not fully appreciate. Trying to let 60 of 64 cores do 100% compute > showed some cores having a hard time entering tickless at all, and > ~200us spikes that I think are due to tick losing skew.. told Frederic > I'd take a peek at that, but haven't had time yet. There were other > known things as well, like timers and workqueues for which there are > patches floating around. All in all, it was waving the men at work > sign, pointing at the "Say N" by the config option, and suggesting that > ignoring that would not be the cleverest of moves. Well, I am not going to join a debate on Kconfig default selection. ;-) I will say that two years ago, setting NO_HZ_FULL=y by default would have been insane. Perhaps soon it will be a no-brainer, and I am of course trying to bring that day closer. Right now it is of course a judgment call. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, 2014-07-03 at 22:05 -0700, Paul E. McKenney wrote: > On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote: > > Turn it on and don't worry about it is exactly what distros want the > > obscure feature with very few users to be. Last time I did a drive-by, > > my boxen said I should continue to worry about it ;-) > > Yep, which is the reason for the patch on the last email. > > Then again, exactly which feature and which reason for worry? NO_HZ_FULL. I tried ALL a while back, box instantly called me an idiot. Maybe that has improved since, dunno. Last drive-by I didn't do much overhead measurement, stuck mostly functionality, and it still had rough edges that enterprise users may not fully appreciate. Trying to let 60 of 64 cores do 100% compute showed some cores having a hard time entering tickless at all, and ~200us spikes that I think are due to tick losing skew.. told Frederic I'd take a peek at that, but haven't had time yet. There were other known things as well, like timers and workqueues for which there are patches floating around. All in all, it was waving the men at work sign, pointing at the "Say N" by the config option, and suggesting that ignoring that would not be the cleverest of moves. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Fri, Jul 04, 2014 at 08:01:34AM +0200, Mike Galbraith wrote: On Thu, 2014-07-03 at 22:05 -0700, Paul E. McKenney wrote: On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote: Turn it on and don't worry about it is exactly what distros want the obscure feature with very few users to be. Last time I did a drive-by, my boxen said I should continue to worry about it ;-) Yep, which is the reason for the patch on the last email. Then again, exactly which feature and which reason for worry? NO_HZ_FULL. I tried ALL a while back, box instantly called me an idiot. Maybe that has improved since, dunno. Ah, I was thinking in terms of RCU_CPU_NOCB. Last drive-by I didn't do much overhead measurement, stuck mostly functionality, and it still had rough edges that enterprise users may not fully appreciate. Trying to let 60 of 64 cores do 100% compute showed some cores having a hard time entering tickless at all, and ~200us spikes that I think are due to tick losing skew.. told Frederic I'd take a peek at that, but haven't had time yet. There were other known things as well, like timers and workqueues for which there are patches floating around. All in all, it was waving the men at work sign, pointing at the Say N by the config option, and suggesting that ignoring that would not be the cleverest of moves. Well, I am not going to join a debate on Kconfig default selection. ;-) I will say that two years ago, setting NO_HZ_FULL=y by default would have been insane. Perhaps soon it will be a no-brainer, and I am of course trying to bring that day closer. Right now it is of course a judgment call. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, 2014-07-03 at 22:05 -0700, Paul E. McKenney wrote: On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote: Turn it on and don't worry about it is exactly what distros want the obscure feature with very few users to be. Last time I did a drive-by, my boxen said I should continue to worry about it ;-) Yep, which is the reason for the patch on the last email. Then again, exactly which feature and which reason for worry? NO_HZ_FULL. I tried ALL a while back, box instantly called me an idiot. Maybe that has improved since, dunno. Last drive-by I didn't do much overhead measurement, stuck mostly functionality, and it still had rough edges that enterprise users may not fully appreciate. Trying to let 60 of 64 cores do 100% compute showed some cores having a hard time entering tickless at all, and ~200us spikes that I think are due to tick losing skew.. told Frederic I'd take a peek at that, but haven't had time yet. There were other known things as well, like timers and workqueues for which there are patches floating around. All in all, it was waving the men at work sign, pointing at the Say N by the config option, and suggesting that ignoring that would not be the cleverest of moves. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote: > On Thu, 2014-07-03 at 09:29 -0700, Paul E. McKenney wrote: > > On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote: > > > On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: > > > > On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote: > > > > > > > > NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go > > > > > away as being a redundant interface to manage a single property of a > > > > > set > > > > > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to > > > > > manage a single property of a set of CPUs. What am I missing? > > > > > > > > Well, for now, it can only be specified at build time or at boot time. > > > > In theory, it is possible to change a CPU from being callback-offloaded > > > > to not at runtime, but there would need to be an extremely good reason > > > > for adding that level of complexity. Lots of "fun" races in there... > > > > > > Yeah, understood. > > > > > > (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more > > > usable if it eventually became unified with cpuset and learned how to > > > tap-dance properly;) > > > > Agreed, it would in some sense be nice. What specifically do you need > > it for? > > I personally have zero use for the thing (git/vi aren't particularly > perturbation sensitive;). I'm just doing occasional drive-by testing > from a distro perspective, how well does it work, what does it cost etc. > > > Are you really running workloads that generate large numbers of > > callbacks spread across most of the CPUs? It was this sort of workload > > that caused Rik's system to show scary CPU-time accumulation, due to > > the high overhead of frequent one-to-many wakeups. > > > > If your systems aren't running that kind of high-callback-rate workload, > > just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it. > > > > If your systems -are- running that kind of high-callback-rate workload, > > but your system has fewer than 200 CPUs, ensure that you have enough > > housekeeping CPUs to allow the grace-period kthread sufficient CPU time, > > set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it. > > > > If your systems -are- running that kind of high-callback-rate workload, > > and your system has more than 200 CPUs, apply the following patch, > > set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it. ;-) > > Turn it on and don't worry about it is exactly what distros want the > obscure feature with very few users to be. Last time I did a drive-by, > my boxen said I should continue to worry about it ;-) Yep, which is the reason for the patch on the last email. Then again, exactly which feature and which reason for worry? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, 2014-07-03 at 09:29 -0700, Paul E. McKenney wrote: > On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote: > > On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: > > > On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote: > > > > > > NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go > > > > away as being a redundant interface to manage a single property of a set > > > > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to > > > > manage a single property of a set of CPUs. What am I missing? > > > > > > Well, for now, it can only be specified at build time or at boot time. > > > In theory, it is possible to change a CPU from being callback-offloaded > > > to not at runtime, but there would need to be an extremely good reason > > > for adding that level of complexity. Lots of "fun" races in there... > > > > Yeah, understood. > > > > (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more > > usable if it eventually became unified with cpuset and learned how to > > tap-dance properly;) > > Agreed, it would in some sense be nice. What specifically do you need > it for? I personally have zero use for the thing (git/vi aren't particularly perturbation sensitive;). I'm just doing occasional drive-by testing from a distro perspective, how well does it work, what does it cost etc. > Are you really running workloads that generate large numbers of > callbacks spread across most of the CPUs? It was this sort of workload > that caused Rik's system to show scary CPU-time accumulation, due to > the high overhead of frequent one-to-many wakeups. > > If your systems aren't running that kind of high-callback-rate workload, > just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it. > > If your systems -are- running that kind of high-callback-rate workload, > but your system has fewer than 200 CPUs, ensure that you have enough > housekeeping CPUs to allow the grace-period kthread sufficient CPU time, > set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it. > > If your systems -are- running that kind of high-callback-rate workload, > and your system has more than 200 CPUs, apply the following patch, > set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it. ;-) Turn it on and don't worry about it is exactly what distros want the obscure feature with very few users to be. Last time I did a drive-by, my boxen said I should continue to worry about it ;-) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote: > On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: > > On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote: > > > > NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go > > > away as being a redundant interface to manage a single property of a set > > > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to > > > manage a single property of a set of CPUs. What am I missing? > > > > Well, for now, it can only be specified at build time or at boot time. > > In theory, it is possible to change a CPU from being callback-offloaded > > to not at runtime, but there would need to be an extremely good reason > > for adding that level of complexity. Lots of "fun" races in there... > > Yeah, understood. > > (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more > usable if it eventually became unified with cpuset and learned how to > tap-dance properly;) Agreed, it would in some sense be nice. What specifically do you need it for? Are you really running workloads that generate large numbers of callbacks spread across most of the CPUs? It was this sort of workload that caused Rik's system to show scary CPU-time accumulation, due to the high overhead of frequent one-to-many wakeups. If your systems aren't running that kind of high-callback-rate workload, just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it. If your systems -are- running that kind of high-callback-rate workload, but your system has fewer than 200 CPUs, ensure that you have enough housekeeping CPUs to allow the grace-period kthread sufficient CPU time, set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it. If your systems -are- running that kind of high-callback-rate workload, and your system has more than 200 CPUs, apply the following patch, set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it. ;-) Thanx, Paul rcu: Parallelize and economize NOCB kthread wakeups An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Reported-by: Rik van Riel Signed-off-by: Paul E. McKenney diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6eaa9cdb7094..affed6434ec8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. quiescent states. Units are jiffies, minimum value is one, and maximum value is HZ. + rcutree.rcu_nocb_leader_stride= [KNL] + Set the number of NOCB kthread groups, which + defaults to the square root of the number of + CPUs. Larger numbers reduces the wakeup overhead + on the per-CPU grace-period kthreads, but increases + that same overhead on each group's leader. + rcutree.qhimark= [KNL] Set threshold of queued RCU callbacks beyond which batch limiting is disabled. diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index bf2c1e669691..de12fa5a860b 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -331,11 +331,29 @@ struct rcu_data { struct rcu_head **nocb_tail; atomic_long_t nocb_q_count; /* # CBs waiting for kthread */
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote: On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote: NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go away as being a redundant interface to manage a single property of a set of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to manage a single property of a set of CPUs. What am I missing? Well, for now, it can only be specified at build time or at boot time. In theory, it is possible to change a CPU from being callback-offloaded to not at runtime, but there would need to be an extremely good reason for adding that level of complexity. Lots of fun races in there... Yeah, understood. (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more usable if it eventually became unified with cpuset and learned how to tap-dance properly;) Agreed, it would in some sense be nice. What specifically do you need it for? Are you really running workloads that generate large numbers of callbacks spread across most of the CPUs? It was this sort of workload that caused Rik's system to show scary CPU-time accumulation, due to the high overhead of frequent one-to-many wakeups. If your systems aren't running that kind of high-callback-rate workload, just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it. If your systems -are- running that kind of high-callback-rate workload, but your system has fewer than 200 CPUs, ensure that you have enough housekeeping CPUs to allow the grace-period kthread sufficient CPU time, set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it. If your systems -are- running that kind of high-callback-rate workload, and your system has more than 200 CPUs, apply the following patch, set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it. ;-) Thanx, Paul rcu: Parallelize and economize NOCB kthread wakeups An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Reported-by: Rik van Riel r...@redhat.com Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6eaa9cdb7094..affed6434ec8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. quiescent states. Units are jiffies, minimum value is one, and maximum value is HZ. + rcutree.rcu_nocb_leader_stride= [KNL] + Set the number of NOCB kthread groups, which + defaults to the square root of the number of + CPUs. Larger numbers reduces the wakeup overhead + on the per-CPU grace-period kthreads, but increases + that same overhead on each group's leader. + rcutree.qhimark= [KNL] Set threshold of queued RCU callbacks beyond which batch limiting is disabled. diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index bf2c1e669691..de12fa5a860b 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -331,11 +331,29 @@ struct rcu_data { struct rcu_head **nocb_tail; atomic_long_t nocb_q_count; /* # CBs waiting for kthread */
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, 2014-07-03 at 09:29 -0700, Paul E. McKenney wrote: On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote: On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote: NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go away as being a redundant interface to manage a single property of a set of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to manage a single property of a set of CPUs. What am I missing? Well, for now, it can only be specified at build time or at boot time. In theory, it is possible to change a CPU from being callback-offloaded to not at runtime, but there would need to be an extremely good reason for adding that level of complexity. Lots of fun races in there... Yeah, understood. (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more usable if it eventually became unified with cpuset and learned how to tap-dance properly;) Agreed, it would in some sense be nice. What specifically do you need it for? I personally have zero use for the thing (git/vi aren't particularly perturbation sensitive;). I'm just doing occasional drive-by testing from a distro perspective, how well does it work, what does it cost etc. Are you really running workloads that generate large numbers of callbacks spread across most of the CPUs? It was this sort of workload that caused Rik's system to show scary CPU-time accumulation, due to the high overhead of frequent one-to-many wakeups. If your systems aren't running that kind of high-callback-rate workload, just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it. If your systems -are- running that kind of high-callback-rate workload, but your system has fewer than 200 CPUs, ensure that you have enough housekeeping CPUs to allow the grace-period kthread sufficient CPU time, set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it. If your systems -are- running that kind of high-callback-rate workload, and your system has more than 200 CPUs, apply the following patch, set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it. ;-) Turn it on and don't worry about it is exactly what distros want the obscure feature with very few users to be. Last time I did a drive-by, my boxen said I should continue to worry about it ;-) -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote: On Thu, 2014-07-03 at 09:29 -0700, Paul E. McKenney wrote: On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote: On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote: NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go away as being a redundant interface to manage a single property of a set of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to manage a single property of a set of CPUs. What am I missing? Well, for now, it can only be specified at build time or at boot time. In theory, it is possible to change a CPU from being callback-offloaded to not at runtime, but there would need to be an extremely good reason for adding that level of complexity. Lots of fun races in there... Yeah, understood. (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more usable if it eventually became unified with cpuset and learned how to tap-dance properly;) Agreed, it would in some sense be nice. What specifically do you need it for? I personally have zero use for the thing (git/vi aren't particularly perturbation sensitive;). I'm just doing occasional drive-by testing from a distro perspective, how well does it work, what does it cost etc. Are you really running workloads that generate large numbers of callbacks spread across most of the CPUs? It was this sort of workload that caused Rik's system to show scary CPU-time accumulation, due to the high overhead of frequent one-to-many wakeups. If your systems aren't running that kind of high-callback-rate workload, just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it. If your systems -are- running that kind of high-callback-rate workload, but your system has fewer than 200 CPUs, ensure that you have enough housekeeping CPUs to allow the grace-period kthread sufficient CPU time, set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it. If your systems -are- running that kind of high-callback-rate workload, and your system has more than 200 CPUs, apply the following patch, set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it. ;-) Turn it on and don't worry about it is exactly what distros want the obscure feature with very few users to be. Last time I did a drive-by, my boxen said I should continue to worry about it ;-) Yep, which is the reason for the patch on the last email. Then again, exactly which feature and which reason for worry? Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: > On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote: > > NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go > > away as being a redundant interface to manage a single property of a set > > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to > > manage a single property of a set of CPUs. What am I missing? > > Well, for now, it can only be specified at build time or at boot time. > In theory, it is possible to change a CPU from being callback-offloaded > to not at runtime, but there would need to be an extremely good reason > for adding that level of complexity. Lots of "fun" races in there... Yeah, understood. (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more usable if it eventually became unified with cpuset and learned how to tap-dance properly;) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote: > On Wed, 2014-07-02 at 10:08 -0700, Paul E. McKenney wrote: > > On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote: > > > On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote: > > > > On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote: > > > > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: > > > > > > An 80-CPU system with a context-switch-heavy workload can require so > > > > > > many NOCB kthread wakeups that the RCU grace-period kthreads spend > > > > > > several > > > > > > tens of percent of a CPU just awakening things. This clearly will > > > > > > not > > > > > > scale well: If you add enough CPUs, the RCU grace-period kthreads > > > > > > would > > > > > > get behind, increasing grace-period latency. > > > > > > > > > > > > To avoid this problem, this commit divides the NOCB kthreads into > > > > > > leaders > > > > > > and followers, where the grace-period kthreads awaken the leaders > > > > > > each of > > > > > > whom in turn awakens its followers. By default, the number of > > > > > > groups of > > > > > > kthreads is the square root of the number of CPUs, but this default > > > > > > may > > > > > > be overridden using the rcutree.rcu_nocb_leader_stride boot > > > > > > parameter. > > > > > > This reduces the number of wakeups done per grace period by the RCU > > > > > > grace-period kthread by the square root of the number of CPUs, but > > > > > > of > > > > > > course by shifting those wakeups to the leaders. In addition, > > > > > > because > > > > > > the leaders do grace periods on behalf of their respective > > > > > > followers, > > > > > > the number of wakeups of the followers decreases by up to a factor > > > > > > of two. > > > > > > Instead of being awakened once when new callbacks arrive and again > > > > > > at the end of the grace period, the followers are awakened only at > > > > > > the end of the grace period. > > > > > > > > > > > > For a numerical example, in a 4096-CPU system, the grace-period > > > > > > kthread > > > > > > would awaken 64 leaders, each of which would awaken its 63 followers > > > > > > at the end of the grace period. This compares favorably with the 79 > > > > > > wakeups for the grace-period kthread on an 80-CPU system. > > > > > > > > > > Urgh, how about we kill the entire nocb nonsense and try again? This > > > > > is > > > > > getting quite rediculous. > > > > > > > > Sure thing, Peter. > > > > > > So you don't think this has gotten a little out of hand? The NOCB stuff > > > has lead to these masses of rcu threads and now you're adding extra > > > cache misses to the perfectly sane and normal code paths just to deal > > > with so many threads. > > > > Indeed it appears to have gotten a bit out of hand. But let's please > > attack the real problem rather than the immediate irritant. > > > > And in this case, the real problem is that users are getting callback > > offloading even when there is no reason for it. > > > > > And all to support a feature that nearly nobody uses. And you were > > > talking about making nocb the default rcu... > > > > As were others, not that long ago. Today is the first hint that I got > > that you feel otherwise. But it does look like the softirq approach to > > callback processing needs to stick around for awhile longer. Nice to > > hear that softirq is now "sane and normal" again, I guess. ;-) > > > > Please see my patch in reply to Rik's email. The idea is to neither > > rip callback offloading from the kernel nor to keep callback offloading > > as the default, but instead do callback offloading only for those CPUs > > specifically marked as NO_HZ_FULL CPUs, or when specifically requested > > at build time or at boot time. In other words, only do it when it is > > needed. > > Exactly! Like dynamically, when the user isolates CPUs via the cpuset > interface, none of it making much sense without that particular property > of a set of CPUs, and cpuset being the manager of CPU set properties. Glad you like it! ;-) > NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go > away as being a redundant interface to manage a single property of a set > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to > manage a single property of a set of CPUs. What am I missing? Well, for now, it can only be specified at build time or at boot time. In theory, it is possible to change a CPU from being callback-offloaded to not at runtime, but there would need to be an extremely good reason for adding that level of complexity. Lots of "fun" races in there... Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, 2014-07-02 at 10:08 -0700, Paul E. McKenney wrote: > On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote: > > On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote: > > > On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote: > > > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: > > > > > An 80-CPU system with a context-switch-heavy workload can require so > > > > > many NOCB kthread wakeups that the RCU grace-period kthreads spend > > > > > several > > > > > tens of percent of a CPU just awakening things. This clearly will not > > > > > scale well: If you add enough CPUs, the RCU grace-period kthreads > > > > > would > > > > > get behind, increasing grace-period latency. > > > > > > > > > > To avoid this problem, this commit divides the NOCB kthreads into > > > > > leaders > > > > > and followers, where the grace-period kthreads awaken the leaders > > > > > each of > > > > > whom in turn awakens its followers. By default, the number of groups > > > > > of > > > > > kthreads is the square root of the number of CPUs, but this default > > > > > may > > > > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. > > > > > This reduces the number of wakeups done per grace period by the RCU > > > > > grace-period kthread by the square root of the number of CPUs, but of > > > > > course by shifting those wakeups to the leaders. In addition, because > > > > > the leaders do grace periods on behalf of their respective followers, > > > > > the number of wakeups of the followers decreases by up to a factor of > > > > > two. > > > > > Instead of being awakened once when new callbacks arrive and again > > > > > at the end of the grace period, the followers are awakened only at > > > > > the end of the grace period. > > > > > > > > > > For a numerical example, in a 4096-CPU system, the grace-period > > > > > kthread > > > > > would awaken 64 leaders, each of which would awaken its 63 followers > > > > > at the end of the grace period. This compares favorably with the 79 > > > > > wakeups for the grace-period kthread on an 80-CPU system. > > > > > > > > Urgh, how about we kill the entire nocb nonsense and try again? This is > > > > getting quite rediculous. > > > > > > Sure thing, Peter. > > > > So you don't think this has gotten a little out of hand? The NOCB stuff > > has lead to these masses of rcu threads and now you're adding extra > > cache misses to the perfectly sane and normal code paths just to deal > > with so many threads. > > Indeed it appears to have gotten a bit out of hand. But let's please > attack the real problem rather than the immediate irritant. > > And in this case, the real problem is that users are getting callback > offloading even when there is no reason for it. > > > And all to support a feature that nearly nobody uses. And you were > > talking about making nocb the default rcu... > > As were others, not that long ago. Today is the first hint that I got > that you feel otherwise. But it does look like the softirq approach to > callback processing needs to stick around for awhile longer. Nice to > hear that softirq is now "sane and normal" again, I guess. ;-) > > Please see my patch in reply to Rik's email. The idea is to neither > rip callback offloading from the kernel nor to keep callback offloading > as the default, but instead do callback offloading only for those CPUs > specifically marked as NO_HZ_FULL CPUs, or when specifically requested > at build time or at boot time. In other words, only do it when it is > needed. Exactly! Like dynamically, when the user isolates CPUs via the cpuset interface, none of it making much sense without that particular property of a set of CPUs, and cpuset being the manager of CPU set properties. NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go away as being a redundant interface to manage a single property of a set of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to manage a single property of a set of CPUs. What am I missing? -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 09:55:56AM -0700, Paul E. McKenney wrote: > On Wed, Jul 02, 2014 at 09:46:19AM -0400, Rik van Riel wrote: > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA1 > > > > On 07/02/2014 08:34 AM, Peter Zijlstra wrote: > > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: > > >> An 80-CPU system with a context-switch-heavy workload can require > > >> so many NOCB kthread wakeups that the RCU grace-period kthreads > > >> spend several tens of percent of a CPU just awakening things. > > >> This clearly will not scale well: If you add enough CPUs, the RCU > > >> grace-period kthreads would get behind, increasing grace-period > > >> latency. > > >> > > >> To avoid this problem, this commit divides the NOCB kthreads into > > >> leaders and followers, where the grace-period kthreads awaken the > > >> leaders each of whom in turn awakens its followers. By default, > > >> the number of groups of kthreads is the square root of the number > > >> of CPUs, but this default may be overridden using the > > >> rcutree.rcu_nocb_leader_stride boot parameter. This reduces the > > >> number of wakeups done per grace period by the RCU grace-period > > >> kthread by the square root of the number of CPUs, but of course > > >> by shifting those wakeups to the leaders. In addition, because > > >> the leaders do grace periods on behalf of their respective > > >> followers, the number of wakeups of the followers decreases by up > > >> to a factor of two. Instead of being awakened once when new > > >> callbacks arrive and again at the end of the grace period, the > > >> followers are awakened only at the end of the grace period. > > >> > > >> For a numerical example, in a 4096-CPU system, the grace-period > > >> kthread would awaken 64 leaders, each of which would awaken its > > >> 63 followers at the end of the grace period. This compares > > >> favorably with the 79 wakeups for the grace-period kthread on an > > >> 80-CPU system. > > > > > > Urgh, how about we kill the entire nocb nonsense and try again? > > > This is getting quite rediculous. > > > > Some observations. > > > > First, the rcuos/N threads are NOT bound to CPU N at all, but are > > free to float through the system. > > I could easily bind each to its home CPU by default for CONFIG_NO_HZ_FULL=n. > For CONFIG_NO_HZ_FULL=y, they get bound to the non-nohz_full= CPUs. > > > Second, the number of RCU callbacks at the end of each grace period > > is quite likely to be small most of the time. > > > > This suggests that on a system with N CPUs, it may be perfectly > > sufficient to have a much smaller number of rcuos threads. > > > > One thread can probably handle the RCU callbacks for as many as > > 16, or even 64 CPUs... > > In many cases, one thread could handle the RCU callbacks for way more > than that. In other cases, a single CPU could keep a single rcuo kthread > quite busy. So something dynamic ends up being required. > > But I suspect that the real solution here is to adjust the Kconfig setup > between NO_HZ_FULL and RCU_NOCB_CPU_ALL so that you have to specify boot > parameters to get callback offloading on systems built with NO_HZ_FULL. > Then add some boot-time code so that any CPU that has nohz_full= is > forced to also have rcu_nocbs= set. This would have the good effect > of applying callback offloading only to those workloads for which it > was specifically designed, but allowing those workloads to gain the > latency-reduction benefits of callback offloading. > > I do freely confess that I was hoping that callback offloading might one > day completely replace RCU_SOFTIRQ, but that hope now appears to be at > best premature. > > Something like the attached patch. Untested, probably does not even build. Against all odds, it builds and passes moderate rcutorture testing. Although this doesn't satisfy the desire to wean RCU of softirq, it does allow NO_HZ_FULL kernels to maintain better compatibility with earlier kernel versions, which appears to be more important for the time being. Thanx, Paul > > > rcu: Don't offload callbacks unless specifically requested > > > > Not-yet-signed-off-by: Paul E. McKenney > > diff --git a/init/Kconfig b/init/Kconfig > index 9d76b99af1b9..9332d33346ac 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -737,7 +737,7 @@ choice > > config RCU_NOCB_CPU_NONE > bool "No build_forced no-CBs CPUs" > - depends on RCU_NOCB_CPU && !NO_HZ_FULL > + depends on RCU_NOCB_CPU && !NO_HZ_FULL_ALL > help > This option does not force any of the CPUs to be no-CBs CPUs. > Only CPUs designated by the rcu_nocbs= boot parameter will be > @@ -751,7 +751,7 @@ config RCU_NOCB_CPU_NONE > > config RCU_NOCB_CPU_ZERO > bool "CPU 0 is a build_forced no-CBs CPU" > - depends on RCU_NOCB_CPU && !NO_HZ_FULL > +
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 01:29:38PM -0400, Rik van Riel wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 07/02/2014 01:26 PM, Peter Zijlstra wrote: > > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: > >> As were others, not that long ago. Today is the first hint that > >> I got that you feel otherwise. But it does look like the softirq > >> approach to callback processing needs to stick around for awhile > >> longer. Nice to hear that softirq is now "sane and normal" > >> again, I guess. ;-) > > > > Nah, softirqs are still totally annoying :-) > > > > So I've lost detail again, but it seems to me that on all CPUs that > > are actually getting ticks, waking tasks to process the RCU state > > is entirely over doing it. Might as well keep processing their RCU > > state from the tick as was previously done. > > For CPUs that are not getting ticks (eg. because they are idle), > is it worth waking up anything on that CPU, or would it make more > sense to simply process their RCU callbacks on a different CPU, > if there aren't too many pending? Give or take the number of wakeups generated... ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote: > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: > > As were others, not that long ago. Today is the first hint that I got > > that you feel otherwise. But it does look like the softirq approach to > > callback processing needs to stick around for awhile longer. Nice to > > hear that softirq is now "sane and normal" again, I guess. ;-) > > Nah, softirqs are still totally annoying :-) Name me one thing that isn't annoying. ;-) > So I've lost detail again, but it seems to me that on all CPUs that are > actually getting ticks, waking tasks to process the RCU state is > entirely over doing it. Might as well keep processing their RCU state > from the tick as was previously done. And that is in fact the approach taken by my patch. For which I just kicked off testing, so expect an update later today. (And that -is- optimistic! A pessimistic viewpoint would hold that the patch would turn out to be so broken that it would take -weeks- to get a fix!) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/02/2014 01:26 PM, Peter Zijlstra wrote: > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: >> As were others, not that long ago. Today is the first hint that >> I got that you feel otherwise. But it does look like the softirq >> approach to callback processing needs to stick around for awhile >> longer. Nice to hear that softirq is now "sane and normal" >> again, I guess. ;-) > > Nah, softirqs are still totally annoying :-) > > So I've lost detail again, but it seems to me that on all CPUs that > are actually getting ticks, waking tasks to process the RCU state > is entirely over doing it. Might as well keep processing their RCU > state from the tick as was previously done. For CPUs that are not getting ticks (eg. because they are idle), is it worth waking up anything on that CPU, or would it make more sense to simply process their RCU callbacks on a different CPU, if there aren't too many pending? -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTtEGCAAoJEM553pKExN6D7t4IALdymyu0+/SdaXG73dfkzNKd yJ3WJtJl1TV6JyejV747IRdKYfkuliZJ+99JZHtJ9dWvOoTtw19GqXXXVANlFpNE 8dQ4UTR6gDE1fRHnWKCdi0p8s3JwgZYdyhr0fKq7k09EXs+eJvDUTVVptBwLj36P oaENzeONv5xkn3LS9cZVQATX1ZjpYiXjFUxblWoi/NJfSIlq81IkPj8ujaZ4f/6Q 6QLqymNbUGnF5n8v5gs8UqsP+fM3phsIJsT5m42hqnS9eKVwcw4T7UZ8UMFie+mC hzy7vA0ClcdMWOMlRCSRbJMq0lDA0ej8acYpnj4Yz13wY2DIdTYVU38BbUE+iNA= =Ia0S -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: > As were others, not that long ago. Today is the first hint that I got > that you feel otherwise. But it does look like the softirq approach to > callback processing needs to stick around for awhile longer. Nice to > hear that softirq is now "sane and normal" again, I guess. ;-) Nah, softirqs are still totally annoying :-) So I've lost detail again, but it seems to me that on all CPUs that are actually getting ticks, waking tasks to process the RCU state is entirely over doing it. Might as well keep processing their RCU state from the tick as was previously done. pgp9dqqJ76y5N.pgp Description: PGP signature
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote: > On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote: > > On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote: > > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: > > > > An 80-CPU system with a context-switch-heavy workload can require so > > > > many NOCB kthread wakeups that the RCU grace-period kthreads spend > > > > several > > > > tens of percent of a CPU just awakening things. This clearly will not > > > > scale well: If you add enough CPUs, the RCU grace-period kthreads would > > > > get behind, increasing grace-period latency. > > > > > > > > To avoid this problem, this commit divides the NOCB kthreads into > > > > leaders > > > > and followers, where the grace-period kthreads awaken the leaders each > > > > of > > > > whom in turn awakens its followers. By default, the number of groups of > > > > kthreads is the square root of the number of CPUs, but this default may > > > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. > > > > This reduces the number of wakeups done per grace period by the RCU > > > > grace-period kthread by the square root of the number of CPUs, but of > > > > course by shifting those wakeups to the leaders. In addition, because > > > > the leaders do grace periods on behalf of their respective followers, > > > > the number of wakeups of the followers decreases by up to a factor of > > > > two. > > > > Instead of being awakened once when new callbacks arrive and again > > > > at the end of the grace period, the followers are awakened only at > > > > the end of the grace period. > > > > > > > > For a numerical example, in a 4096-CPU system, the grace-period kthread > > > > would awaken 64 leaders, each of which would awaken its 63 followers > > > > at the end of the grace period. This compares favorably with the 79 > > > > wakeups for the grace-period kthread on an 80-CPU system. > > > > > > Urgh, how about we kill the entire nocb nonsense and try again? This is > > > getting quite rediculous. > > > > Sure thing, Peter. > > So you don't think this has gotten a little out of hand? The NOCB stuff > has lead to these masses of rcu threads and now you're adding extra > cache misses to the perfectly sane and normal code paths just to deal > with so many threads. Indeed it appears to have gotten a bit out of hand. But let's please attack the real problem rather than the immediate irritant. And in this case, the real problem is that users are getting callback offloading even when there is no reason for it. > And all to support a feature that nearly nobody uses. And you were > talking about making nocb the default rcu... As were others, not that long ago. Today is the first hint that I got that you feel otherwise. But it does look like the softirq approach to callback processing needs to stick around for awhile longer. Nice to hear that softirq is now "sane and normal" again, I guess. ;-) Please see my patch in reply to Rik's email. The idea is to neither rip callback offloading from the kernel nor to keep callback offloading as the default, but instead do callback offloading only for those CPUs specifically marked as NO_HZ_FULL CPUs, or when specifically requested at build time or at boot time. In other words, only do it when it is needed. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 09:46:19AM -0400, Rik van Riel wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 07/02/2014 08:34 AM, Peter Zijlstra wrote: > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: > >> An 80-CPU system with a context-switch-heavy workload can require > >> so many NOCB kthread wakeups that the RCU grace-period kthreads > >> spend several tens of percent of a CPU just awakening things. > >> This clearly will not scale well: If you add enough CPUs, the RCU > >> grace-period kthreads would get behind, increasing grace-period > >> latency. > >> > >> To avoid this problem, this commit divides the NOCB kthreads into > >> leaders and followers, where the grace-period kthreads awaken the > >> leaders each of whom in turn awakens its followers. By default, > >> the number of groups of kthreads is the square root of the number > >> of CPUs, but this default may be overridden using the > >> rcutree.rcu_nocb_leader_stride boot parameter. This reduces the > >> number of wakeups done per grace period by the RCU grace-period > >> kthread by the square root of the number of CPUs, but of course > >> by shifting those wakeups to the leaders. In addition, because > >> the leaders do grace periods on behalf of their respective > >> followers, the number of wakeups of the followers decreases by up > >> to a factor of two. Instead of being awakened once when new > >> callbacks arrive and again at the end of the grace period, the > >> followers are awakened only at the end of the grace period. > >> > >> For a numerical example, in a 4096-CPU system, the grace-period > >> kthread would awaken 64 leaders, each of which would awaken its > >> 63 followers at the end of the grace period. This compares > >> favorably with the 79 wakeups for the grace-period kthread on an > >> 80-CPU system. > > > > Urgh, how about we kill the entire nocb nonsense and try again? > > This is getting quite rediculous. > > Some observations. > > First, the rcuos/N threads are NOT bound to CPU N at all, but are > free to float through the system. I could easily bind each to its home CPU by default for CONFIG_NO_HZ_FULL=n. For CONFIG_NO_HZ_FULL=y, they get bound to the non-nohz_full= CPUs. > Second, the number of RCU callbacks at the end of each grace period > is quite likely to be small most of the time. > > This suggests that on a system with N CPUs, it may be perfectly > sufficient to have a much smaller number of rcuos threads. > > One thread can probably handle the RCU callbacks for as many as > 16, or even 64 CPUs... In many cases, one thread could handle the RCU callbacks for way more than that. In other cases, a single CPU could keep a single rcuo kthread quite busy. So something dynamic ends up being required. But I suspect that the real solution here is to adjust the Kconfig setup between NO_HZ_FULL and RCU_NOCB_CPU_ALL so that you have to specify boot parameters to get callback offloading on systems built with NO_HZ_FULL. Then add some boot-time code so that any CPU that has nohz_full= is forced to also have rcu_nocbs= set. This would have the good effect of applying callback offloading only to those workloads for which it was specifically designed, but allowing those workloads to gain the latency-reduction benefits of callback offloading. I do freely confess that I was hoping that callback offloading might one day completely replace RCU_SOFTIRQ, but that hope now appears to be at best premature. Something like the attached patch. Untested, probably does not even build. Thanx, Paul rcu: Don't offload callbacks unless specifically requested Not-yet-signed-off-by: Paul E. McKenney diff --git a/init/Kconfig b/init/Kconfig index 9d76b99af1b9..9332d33346ac 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -737,7 +737,7 @@ choice config RCU_NOCB_CPU_NONE bool "No build_forced no-CBs CPUs" - depends on RCU_NOCB_CPU && !NO_HZ_FULL + depends on RCU_NOCB_CPU && !NO_HZ_FULL_ALL help This option does not force any of the CPUs to be no-CBs CPUs. Only CPUs designated by the rcu_nocbs= boot parameter will be @@ -751,7 +751,7 @@ config RCU_NOCB_CPU_NONE config RCU_NOCB_CPU_ZERO bool "CPU 0 is a build_forced no-CBs CPU" - depends on RCU_NOCB_CPU && !NO_HZ_FULL + depends on RCU_NOCB_CPU && !NO_HZ_FULL_ALL help This option forces CPU 0 to be a no-CBs CPU, so that its RCU callbacks are invoked by a per-CPU kthread whose name begins diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 58fbb8204d15..3b150bfcce3d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2473,6 +2473,9 @@ static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp) if (rcu_nocb_mask == NULL) return; +#ifdef
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote: > On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote: > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: > > > An 80-CPU system with a context-switch-heavy workload can require so > > > many NOCB kthread wakeups that the RCU grace-period kthreads spend several > > > tens of percent of a CPU just awakening things. This clearly will not > > > scale well: If you add enough CPUs, the RCU grace-period kthreads would > > > get behind, increasing grace-period latency. > > > > > > To avoid this problem, this commit divides the NOCB kthreads into leaders > > > and followers, where the grace-period kthreads awaken the leaders each of > > > whom in turn awakens its followers. By default, the number of groups of > > > kthreads is the square root of the number of CPUs, but this default may > > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. > > > This reduces the number of wakeups done per grace period by the RCU > > > grace-period kthread by the square root of the number of CPUs, but of > > > course by shifting those wakeups to the leaders. In addition, because > > > the leaders do grace periods on behalf of their respective followers, > > > the number of wakeups of the followers decreases by up to a factor of two. > > > Instead of being awakened once when new callbacks arrive and again > > > at the end of the grace period, the followers are awakened only at > > > the end of the grace period. > > > > > > For a numerical example, in a 4096-CPU system, the grace-period kthread > > > would awaken 64 leaders, each of which would awaken its 63 followers > > > at the end of the grace period. This compares favorably with the 79 > > > wakeups for the grace-period kthread on an 80-CPU system. > > > > Urgh, how about we kill the entire nocb nonsense and try again? This is > > getting quite rediculous. > > Sure thing, Peter. So you don't think this has gotten a little out of hand? The NOCB stuff has lead to these masses of rcu threads and now you're adding extra cache misses to the perfectly sane and normal code paths just to deal with so many threads. And all to support a feature that nearly nobody uses. And you were talking about making nocb the default rcu... pgp6swy2pzusd.pgp Description: PGP signature
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote: > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: > > An 80-CPU system with a context-switch-heavy workload can require so > > many NOCB kthread wakeups that the RCU grace-period kthreads spend several > > tens of percent of a CPU just awakening things. This clearly will not > > scale well: If you add enough CPUs, the RCU grace-period kthreads would > > get behind, increasing grace-period latency. > > > > To avoid this problem, this commit divides the NOCB kthreads into leaders > > and followers, where the grace-period kthreads awaken the leaders each of > > whom in turn awakens its followers. By default, the number of groups of > > kthreads is the square root of the number of CPUs, but this default may > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. > > This reduces the number of wakeups done per grace period by the RCU > > grace-period kthread by the square root of the number of CPUs, but of > > course by shifting those wakeups to the leaders. In addition, because > > the leaders do grace periods on behalf of their respective followers, > > the number of wakeups of the followers decreases by up to a factor of two. > > Instead of being awakened once when new callbacks arrive and again > > at the end of the grace period, the followers are awakened only at > > the end of the grace period. > > > > For a numerical example, in a 4096-CPU system, the grace-period kthread > > would awaken 64 leaders, each of which would awaken its 63 followers > > at the end of the grace period. This compares favorably with the 79 > > wakeups for the grace-period kthread on an 80-CPU system. > > Urgh, how about we kill the entire nocb nonsense and try again? This is > getting quite rediculous. Sure thing, Peter. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/02/2014 08:34 AM, Peter Zijlstra wrote: > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: >> An 80-CPU system with a context-switch-heavy workload can require >> so many NOCB kthread wakeups that the RCU grace-period kthreads >> spend several tens of percent of a CPU just awakening things. >> This clearly will not scale well: If you add enough CPUs, the RCU >> grace-period kthreads would get behind, increasing grace-period >> latency. >> >> To avoid this problem, this commit divides the NOCB kthreads into >> leaders and followers, where the grace-period kthreads awaken the >> leaders each of whom in turn awakens its followers. By default, >> the number of groups of kthreads is the square root of the number >> of CPUs, but this default may be overridden using the >> rcutree.rcu_nocb_leader_stride boot parameter. This reduces the >> number of wakeups done per grace period by the RCU grace-period >> kthread by the square root of the number of CPUs, but of course >> by shifting those wakeups to the leaders. In addition, because >> the leaders do grace periods on behalf of their respective >> followers, the number of wakeups of the followers decreases by up >> to a factor of two. Instead of being awakened once when new >> callbacks arrive and again at the end of the grace period, the >> followers are awakened only at the end of the grace period. >> >> For a numerical example, in a 4096-CPU system, the grace-period >> kthread would awaken 64 leaders, each of which would awaken its >> 63 followers at the end of the grace period. This compares >> favorably with the 79 wakeups for the grace-period kthread on an >> 80-CPU system. > > Urgh, how about we kill the entire nocb nonsense and try again? > This is getting quite rediculous. Some observations. First, the rcuos/N threads are NOT bound to CPU N at all, but are free to float through the system. Second, the number of RCU callbacks at the end of each grace period is quite likely to be small most of the time. This suggests that on a system with N CPUs, it may be perfectly sufficient to have a much smaller number of rcuos threads. One thread can probably handle the RCU callbacks for as many as 16, or even 64 CPUs... - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTtA0rAAoJEM553pKExN6D3IkIAKFZjAWhopcwPppzGWsT7OA/ 3y7fnAvBcJ6AEwE1igzbJyCPjdOJECY/9iUdvuB9CbBD82kyfm4qnuREpdt+hqQp Vi8EDB9UdyI2I42hbfRVOS4NgAl8ZYsDVQ+QiEM1cMp+LqEKPg7adwoNTPQL4eZn ANcNh3B3eSpxnZ+ZbEBYJmQXIVP2S5t5M/EMizqUJEBI2/2zB68eeFkgvuW1yg1a /J4L9w+Iqbu+is+6JK9ibQAR/tTS6Exmuc6RnKDH/nkj1jefKdH1z2p+r69u4AK1 JVDl40lra4n6XHsfjDWDHDXBsiD/JDJJ6Zxf77NWwg1aRT77HZPiMOu98hpFxWs= =OSUn -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: > An 80-CPU system with a context-switch-heavy workload can require so > many NOCB kthread wakeups that the RCU grace-period kthreads spend several > tens of percent of a CPU just awakening things. This clearly will not > scale well: If you add enough CPUs, the RCU grace-period kthreads would > get behind, increasing grace-period latency. > > To avoid this problem, this commit divides the NOCB kthreads into leaders > and followers, where the grace-period kthreads awaken the leaders each of > whom in turn awakens its followers. By default, the number of groups of > kthreads is the square root of the number of CPUs, but this default may > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. > This reduces the number of wakeups done per grace period by the RCU > grace-period kthread by the square root of the number of CPUs, but of > course by shifting those wakeups to the leaders. In addition, because > the leaders do grace periods on behalf of their respective followers, > the number of wakeups of the followers decreases by up to a factor of two. > Instead of being awakened once when new callbacks arrive and again > at the end of the grace period, the followers are awakened only at > the end of the grace period. > > For a numerical example, in a 4096-CPU system, the grace-period kthread > would awaken 64 leaders, each of which would awaken its 63 followers > at the end of the grace period. This compares favorably with the 79 > wakeups for the grace-period kthread on an 80-CPU system. Urgh, how about we kill the entire nocb nonsense and try again? This is getting quite rediculous. pgpXt8_H8LS1W.pgp Description: PGP signature
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Urgh, how about we kill the entire nocb nonsense and try again? This is getting quite rediculous. pgpXt8_H8LS1W.pgp Description: PGP signature
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/02/2014 08:34 AM, Peter Zijlstra wrote: On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Urgh, how about we kill the entire nocb nonsense and try again? This is getting quite rediculous. Some observations. First, the rcuos/N threads are NOT bound to CPU N at all, but are free to float through the system. Second, the number of RCU callbacks at the end of each grace period is quite likely to be small most of the time. This suggests that on a system with N CPUs, it may be perfectly sufficient to have a much smaller number of rcuos threads. One thread can probably handle the RCU callbacks for as many as 16, or even 64 CPUs... - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTtA0rAAoJEM553pKExN6D3IkIAKFZjAWhopcwPppzGWsT7OA/ 3y7fnAvBcJ6AEwE1igzbJyCPjdOJECY/9iUdvuB9CbBD82kyfm4qnuREpdt+hqQp Vi8EDB9UdyI2I42hbfRVOS4NgAl8ZYsDVQ+QiEM1cMp+LqEKPg7adwoNTPQL4eZn ANcNh3B3eSpxnZ+ZbEBYJmQXIVP2S5t5M/EMizqUJEBI2/2zB68eeFkgvuW1yg1a /J4L9w+Iqbu+is+6JK9ibQAR/tTS6Exmuc6RnKDH/nkj1jefKdH1z2p+r69u4AK1 JVDl40lra4n6XHsfjDWDHDXBsiD/JDJJ6Zxf77NWwg1aRT77HZPiMOu98hpFxWs= =OSUn -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote: On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Urgh, how about we kill the entire nocb nonsense and try again? This is getting quite rediculous. Sure thing, Peter. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote: On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote: On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Urgh, how about we kill the entire nocb nonsense and try again? This is getting quite rediculous. Sure thing, Peter. So you don't think this has gotten a little out of hand? The NOCB stuff has lead to these masses of rcu threads and now you're adding extra cache misses to the perfectly sane and normal code paths just to deal with so many threads. And all to support a feature that nearly nobody uses. And you were talking about making nocb the default rcu... pgp6swy2pzusd.pgp Description: PGP signature
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 09:46:19AM -0400, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/02/2014 08:34 AM, Peter Zijlstra wrote: On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Urgh, how about we kill the entire nocb nonsense and try again? This is getting quite rediculous. Some observations. First, the rcuos/N threads are NOT bound to CPU N at all, but are free to float through the system. I could easily bind each to its home CPU by default for CONFIG_NO_HZ_FULL=n. For CONFIG_NO_HZ_FULL=y, they get bound to the non-nohz_full= CPUs. Second, the number of RCU callbacks at the end of each grace period is quite likely to be small most of the time. This suggests that on a system with N CPUs, it may be perfectly sufficient to have a much smaller number of rcuos threads. One thread can probably handle the RCU callbacks for as many as 16, or even 64 CPUs... In many cases, one thread could handle the RCU callbacks for way more than that. In other cases, a single CPU could keep a single rcuo kthread quite busy. So something dynamic ends up being required. But I suspect that the real solution here is to adjust the Kconfig setup between NO_HZ_FULL and RCU_NOCB_CPU_ALL so that you have to specify boot parameters to get callback offloading on systems built with NO_HZ_FULL. Then add some boot-time code so that any CPU that has nohz_full= is forced to also have rcu_nocbs= set. This would have the good effect of applying callback offloading only to those workloads for which it was specifically designed, but allowing those workloads to gain the latency-reduction benefits of callback offloading. I do freely confess that I was hoping that callback offloading might one day completely replace RCU_SOFTIRQ, but that hope now appears to be at best premature. Something like the attached patch. Untested, probably does not even build. Thanx, Paul rcu: Don't offload callbacks unless specifically requested more here soon Not-yet-signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/init/Kconfig b/init/Kconfig index 9d76b99af1b9..9332d33346ac 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -737,7 +737,7 @@ choice config RCU_NOCB_CPU_NONE bool No build_forced no-CBs CPUs - depends on RCU_NOCB_CPU !NO_HZ_FULL + depends on RCU_NOCB_CPU !NO_HZ_FULL_ALL help This option does not force any of the CPUs to be no-CBs CPUs. Only CPUs designated by the rcu_nocbs= boot parameter will be @@ -751,7 +751,7 @@ config RCU_NOCB_CPU_NONE config RCU_NOCB_CPU_ZERO bool CPU 0 is a build_forced no-CBs CPU - depends on RCU_NOCB_CPU !NO_HZ_FULL + depends on RCU_NOCB_CPU !NO_HZ_FULL_ALL help This option forces CPU 0 to be a no-CBs CPU, so that its RCU callbacks are invoked by a per-CPU kthread whose name begins diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 58fbb8204d15..3b150bfcce3d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2473,6 +2473,9 @@ static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp) if (rcu_nocb_mask == NULL) return; +#ifdef CONFIG_NO_HZ_FULL + cpumask_or(rcu_nocb_mask, rcu_nocb_mask, tick_nohz_full_mask);
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote: On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote: On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote: On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Urgh, how about we kill the entire nocb nonsense and try again? This is getting quite rediculous. Sure thing, Peter. So you don't think this has gotten a little out of hand? The NOCB stuff has lead to these masses of rcu threads and now you're adding extra cache misses to the perfectly sane and normal code paths just to deal with so many threads. Indeed it appears to have gotten a bit out of hand. But let's please attack the real problem rather than the immediate irritant. And in this case, the real problem is that users are getting callback offloading even when there is no reason for it. And all to support a feature that nearly nobody uses. And you were talking about making nocb the default rcu... As were others, not that long ago. Today is the first hint that I got that you feel otherwise. But it does look like the softirq approach to callback processing needs to stick around for awhile longer. Nice to hear that softirq is now sane and normal again, I guess. ;-) Please see my patch in reply to Rik's email. The idea is to neither rip callback offloading from the kernel nor to keep callback offloading as the default, but instead do callback offloading only for those CPUs specifically marked as NO_HZ_FULL CPUs, or when specifically requested at build time or at boot time. In other words, only do it when it is needed. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: As were others, not that long ago. Today is the first hint that I got that you feel otherwise. But it does look like the softirq approach to callback processing needs to stick around for awhile longer. Nice to hear that softirq is now sane and normal again, I guess. ;-) Nah, softirqs are still totally annoying :-) So I've lost detail again, but it seems to me that on all CPUs that are actually getting ticks, waking tasks to process the RCU state is entirely over doing it. Might as well keep processing their RCU state from the tick as was previously done. pgp9dqqJ76y5N.pgp Description: PGP signature
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/02/2014 01:26 PM, Peter Zijlstra wrote: On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: As were others, not that long ago. Today is the first hint that I got that you feel otherwise. But it does look like the softirq approach to callback processing needs to stick around for awhile longer. Nice to hear that softirq is now sane and normal again, I guess. ;-) Nah, softirqs are still totally annoying :-) So I've lost detail again, but it seems to me that on all CPUs that are actually getting ticks, waking tasks to process the RCU state is entirely over doing it. Might as well keep processing their RCU state from the tick as was previously done. For CPUs that are not getting ticks (eg. because they are idle), is it worth waking up anything on that CPU, or would it make more sense to simply process their RCU callbacks on a different CPU, if there aren't too many pending? -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTtEGCAAoJEM553pKExN6D7t4IALdymyu0+/SdaXG73dfkzNKd yJ3WJtJl1TV6JyejV747IRdKYfkuliZJ+99JZHtJ9dWvOoTtw19GqXXXVANlFpNE 8dQ4UTR6gDE1fRHnWKCdi0p8s3JwgZYdyhr0fKq7k09EXs+eJvDUTVVptBwLj36P oaENzeONv5xkn3LS9cZVQATX1ZjpYiXjFUxblWoi/NJfSIlq81IkPj8ujaZ4f/6Q 6QLqymNbUGnF5n8v5gs8UqsP+fM3phsIJsT5m42hqnS9eKVwcw4T7UZ8UMFie+mC hzy7vA0ClcdMWOMlRCSRbJMq0lDA0ej8acYpnj4Yz13wY2DIdTYVU38BbUE+iNA= =Ia0S -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote: On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: As were others, not that long ago. Today is the first hint that I got that you feel otherwise. But it does look like the softirq approach to callback processing needs to stick around for awhile longer. Nice to hear that softirq is now sane and normal again, I guess. ;-) Nah, softirqs are still totally annoying :-) Name me one thing that isn't annoying. ;-) So I've lost detail again, but it seems to me that on all CPUs that are actually getting ticks, waking tasks to process the RCU state is entirely over doing it. Might as well keep processing their RCU state from the tick as was previously done. And that is in fact the approach taken by my patch. For which I just kicked off testing, so expect an update later today. (And that -is- optimistic! A pessimistic viewpoint would hold that the patch would turn out to be so broken that it would take -weeks- to get a fix!) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 01:29:38PM -0400, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/02/2014 01:26 PM, Peter Zijlstra wrote: On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote: As were others, not that long ago. Today is the first hint that I got that you feel otherwise. But it does look like the softirq approach to callback processing needs to stick around for awhile longer. Nice to hear that softirq is now sane and normal again, I guess. ;-) Nah, softirqs are still totally annoying :-) So I've lost detail again, but it seems to me that on all CPUs that are actually getting ticks, waking tasks to process the RCU state is entirely over doing it. Might as well keep processing their RCU state from the tick as was previously done. For CPUs that are not getting ticks (eg. because they are idle), is it worth waking up anything on that CPU, or would it make more sense to simply process their RCU callbacks on a different CPU, if there aren't too many pending? Give or take the number of wakeups generated... ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, Jul 02, 2014 at 09:55:56AM -0700, Paul E. McKenney wrote: On Wed, Jul 02, 2014 at 09:46:19AM -0400, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/02/2014 08:34 AM, Peter Zijlstra wrote: On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Urgh, how about we kill the entire nocb nonsense and try again? This is getting quite rediculous. Some observations. First, the rcuos/N threads are NOT bound to CPU N at all, but are free to float through the system. I could easily bind each to its home CPU by default for CONFIG_NO_HZ_FULL=n. For CONFIG_NO_HZ_FULL=y, they get bound to the non-nohz_full= CPUs. Second, the number of RCU callbacks at the end of each grace period is quite likely to be small most of the time. This suggests that on a system with N CPUs, it may be perfectly sufficient to have a much smaller number of rcuos threads. One thread can probably handle the RCU callbacks for as many as 16, or even 64 CPUs... In many cases, one thread could handle the RCU callbacks for way more than that. In other cases, a single CPU could keep a single rcuo kthread quite busy. So something dynamic ends up being required. But I suspect that the real solution here is to adjust the Kconfig setup between NO_HZ_FULL and RCU_NOCB_CPU_ALL so that you have to specify boot parameters to get callback offloading on systems built with NO_HZ_FULL. Then add some boot-time code so that any CPU that has nohz_full= is forced to also have rcu_nocbs= set. This would have the good effect of applying callback offloading only to those workloads for which it was specifically designed, but allowing those workloads to gain the latency-reduction benefits of callback offloading. I do freely confess that I was hoping that callback offloading might one day completely replace RCU_SOFTIRQ, but that hope now appears to be at best premature. Something like the attached patch. Untested, probably does not even build. Against all odds, it builds and passes moderate rcutorture testing. Although this doesn't satisfy the desire to wean RCU of softirq, it does allow NO_HZ_FULL kernels to maintain better compatibility with earlier kernel versions, which appears to be more important for the time being. Thanx, Paul rcu: Don't offload callbacks unless specifically requested more here soon Not-yet-signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/init/Kconfig b/init/Kconfig index 9d76b99af1b9..9332d33346ac 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -737,7 +737,7 @@ choice config RCU_NOCB_CPU_NONE bool No build_forced no-CBs CPUs - depends on RCU_NOCB_CPU !NO_HZ_FULL + depends on RCU_NOCB_CPU !NO_HZ_FULL_ALL help This option does not force any of the CPUs to be no-CBs CPUs. Only CPUs designated by the rcu_nocbs= boot parameter will be @@ -751,7 +751,7 @@ config RCU_NOCB_CPU_NONE config RCU_NOCB_CPU_ZERO bool CPU 0 is a build_forced no-CBs CPU - depends on RCU_NOCB_CPU !NO_HZ_FULL + depends on RCU_NOCB_CPU !NO_HZ_FULL_ALL help This option forces CPU 0 to be a no-CBs CPU, so that its RCU callbacks are invoked by a per-CPU kthread
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, 2014-07-02 at 10:08 -0700, Paul E. McKenney wrote: On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote: On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote: On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote: On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Urgh, how about we kill the entire nocb nonsense and try again? This is getting quite rediculous. Sure thing, Peter. So you don't think this has gotten a little out of hand? The NOCB stuff has lead to these masses of rcu threads and now you're adding extra cache misses to the perfectly sane and normal code paths just to deal with so many threads. Indeed it appears to have gotten a bit out of hand. But let's please attack the real problem rather than the immediate irritant. And in this case, the real problem is that users are getting callback offloading even when there is no reason for it. And all to support a feature that nearly nobody uses. And you were talking about making nocb the default rcu... As were others, not that long ago. Today is the first hint that I got that you feel otherwise. But it does look like the softirq approach to callback processing needs to stick around for awhile longer. Nice to hear that softirq is now sane and normal again, I guess. ;-) Please see my patch in reply to Rik's email. The idea is to neither rip callback offloading from the kernel nor to keep callback offloading as the default, but instead do callback offloading only for those CPUs specifically marked as NO_HZ_FULL CPUs, or when specifically requested at build time or at boot time. In other words, only do it when it is needed. Exactly! Like dynamically, when the user isolates CPUs via the cpuset interface, none of it making much sense without that particular property of a set of CPUs, and cpuset being the manager of CPU set properties. NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go away as being a redundant interface to manage a single property of a set of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to manage a single property of a set of CPUs. What am I missing? -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote: On Wed, 2014-07-02 at 10:08 -0700, Paul E. McKenney wrote: On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote: On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote: On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote: On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote: An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Urgh, how about we kill the entire nocb nonsense and try again? This is getting quite rediculous. Sure thing, Peter. So you don't think this has gotten a little out of hand? The NOCB stuff has lead to these masses of rcu threads and now you're adding extra cache misses to the perfectly sane and normal code paths just to deal with so many threads. Indeed it appears to have gotten a bit out of hand. But let's please attack the real problem rather than the immediate irritant. And in this case, the real problem is that users are getting callback offloading even when there is no reason for it. And all to support a feature that nearly nobody uses. And you were talking about making nocb the default rcu... As were others, not that long ago. Today is the first hint that I got that you feel otherwise. But it does look like the softirq approach to callback processing needs to stick around for awhile longer. Nice to hear that softirq is now sane and normal again, I guess. ;-) Please see my patch in reply to Rik's email. The idea is to neither rip callback offloading from the kernel nor to keep callback offloading as the default, but instead do callback offloading only for those CPUs specifically marked as NO_HZ_FULL CPUs, or when specifically requested at build time or at boot time. In other words, only do it when it is needed. Exactly! Like dynamically, when the user isolates CPUs via the cpuset interface, none of it making much sense without that particular property of a set of CPUs, and cpuset being the manager of CPU set properties. Glad you like it! ;-) NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go away as being a redundant interface to manage a single property of a set of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to manage a single property of a set of CPUs. What am I missing? Well, for now, it can only be specified at build time or at boot time. In theory, it is possible to change a CPU from being callback-offloaded to not at runtime, but there would need to be an extremely good reason for adding that level of complexity. Lots of fun races in there... Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote: NO_HZ_FULL is a property of a set of CPUs. isolcpus is supposed to go away as being a redundant interface to manage a single property of a set of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to manage a single property of a set of CPUs. What am I missing? Well, for now, it can only be specified at build time or at boot time. In theory, it is possible to change a CPU from being callback-offloaded to not at runtime, but there would need to be an extremely good reason for adding that level of complexity. Lots of fun races in there... Yeah, understood. (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more usable if it eventually became unified with cpuset and learned how to tap-dance properly;) -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Fri, Jun 27, 2014 at 03:13:17PM +, Mathieu Desnoyers wrote: > - Original Message - > > From: "Mathieu Desnoyers" > > To: paul...@linux.vnet.ibm.com > > Cc: linux-kernel@vger.kernel.org, r...@redhat.com, mi...@kernel.org, > > la...@cn.fujitsu.com, dipan...@in.ibm.com, > > a...@linux-foundation.org, j...@joshtriplett.org, n...@us.ibm.com, > > t...@linutronix.de, pet...@infradead.org, > > rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, > > dvh...@linux.intel.com, fweis...@gmail.com, > > o...@redhat.com, s...@mit.edu > > Sent: Friday, June 27, 2014 11:01:27 AM > > Subject: Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB > > kthread wakeups > > > > - Original Message - > > > From: "Paul E. McKenney" > > > To: linux-kernel@vger.kernel.org, r...@redhat.com > > > Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, > > > a...@linux-foundation.org, "mathieu desnoyers" > > > , j...@joshtriplett.org, n...@us.ibm.com, > > > t...@linutronix.de, pet...@infradead.org, > > > rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, > > > dvh...@linux.intel.com, fweis...@gmail.com, > > > o...@redhat.com, s...@mit.edu > > > Sent: Friday, June 27, 2014 10:20:38 AM > > > Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread > > > wakeups > > > > > > An 80-CPU system with a context-switch-heavy workload can require so > > > many NOCB kthread wakeups that the RCU grace-period kthreads spend several > > > tens of percent of a CPU just awakening things. This clearly will not > > > scale well: If you add enough CPUs, the RCU grace-period kthreads would > > > get behind, increasing grace-period latency. > > > > > > To avoid this problem, this commit divides the NOCB kthreads into leaders > > > and followers, where the grace-period kthreads awaken the leaders each of > > > whom in turn awakens its followers. By default, the number of groups of > > > kthreads is the square root of the number of CPUs, but this default may > > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. > > > This reduces the number of wakeups done per grace period by the RCU > > > grace-period kthread by the square root of the number of CPUs, but of > > > course by shifting those wakeups to the leaders. In addition, because > > > the leaders do grace periods on behalf of their respective followers, > > > the number of wakeups of the followers decreases by up to a factor of two. > > > Instead of being awakened once when new callbacks arrive and again > > > at the end of the grace period, the followers are awakened only at > > > the end of the grace period. > > > > > > For a numerical example, in a 4096-CPU system, the grace-period kthread > > > would awaken 64 leaders, each of which would awaken its 63 followers > > > at the end of the grace period. This compares favorably with the 79 > > > wakeups for the grace-period kthread on an 80-CPU system. > > > > If I understand your approach correctly, it looks like the callbacks > > are moved from the follower threads (per CPU) to leader threads (for > > a group of CPUs). I'm concerned that moving those callbacks to leader > > threads would increase cache trashing, since the callbacks would be > > often executed from a different CPU than the CPU which enqueued the > > work. In a case where cgroups/affinity are used to pin kthreads to > > specific CPUs to minimize cache trashing, I'm concerned that this > > approach could degrade performance. > > > > Would there be another way to distribute the wake up that would keep > > callbacks local to their enqueuing CPU ? > > > > Or am I missing something important ? > > What I appear to have missed is that the leader moves the callbacks > from the follower's structure _to_ another list within that same > follower's structure, which ensures that callbacks are executed > locally by the follower. > > Sorry for the noise. ;-) Not a problem, and thank you for looking at the patch! Thanx, Paul > Thanks, > > Mathieu > > > > > Thanks, > > > > Mathieu > > > > > > > > Reported-by: Rik van Riel > > > Signed-off-by: Paul E. McKenney > > > > > > diff --git a/Documentation/kernel-parameters.txt > > > b/Documentation/kernel-parameters.txt &g
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Fri, Jun 27, 2014 at 03:01:27PM +, Mathieu Desnoyers wrote: > - Original Message - > > From: "Paul E. McKenney" > > To: linux-kernel@vger.kernel.org, r...@redhat.com > > Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, > > a...@linux-foundation.org, "mathieu desnoyers" > > , j...@joshtriplett.org, n...@us.ibm.com, > > t...@linutronix.de, pet...@infradead.org, > > rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, > > dvh...@linux.intel.com, fweis...@gmail.com, > > o...@redhat.com, s...@mit.edu > > Sent: Friday, June 27, 2014 10:20:38 AM > > Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread > > wakeups > > > > An 80-CPU system with a context-switch-heavy workload can require so > > many NOCB kthread wakeups that the RCU grace-period kthreads spend several > > tens of percent of a CPU just awakening things. This clearly will not > > scale well: If you add enough CPUs, the RCU grace-period kthreads would > > get behind, increasing grace-period latency. > > > > To avoid this problem, this commit divides the NOCB kthreads into leaders > > and followers, where the grace-period kthreads awaken the leaders each of > > whom in turn awakens its followers. By default, the number of groups of > > kthreads is the square root of the number of CPUs, but this default may > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. > > This reduces the number of wakeups done per grace period by the RCU > > grace-period kthread by the square root of the number of CPUs, but of > > course by shifting those wakeups to the leaders. In addition, because > > the leaders do grace periods on behalf of their respective followers, > > the number of wakeups of the followers decreases by up to a factor of two. > > Instead of being awakened once when new callbacks arrive and again > > at the end of the grace period, the followers are awakened only at > > the end of the grace period. > > > > For a numerical example, in a 4096-CPU system, the grace-period kthread > > would awaken 64 leaders, each of which would awaken its 63 followers > > at the end of the grace period. This compares favorably with the 79 > > wakeups for the grace-period kthread on an 80-CPU system. > > If I understand your approach correctly, it looks like the callbacks > are moved from the follower threads (per CPU) to leader threads (for > a group of CPUs). Not quite. The leader thread does the moving, but the callbacks live on the follower's data structure. There probably are some added cache misses, but I believe that this is more than made up for by the decrease in wakeups -- the followers are normally awakened only once rather than twice per grace period. > I'm concerned that moving those callbacks to leader > threads would increase cache trashing, since the callbacks would be > often executed from a different CPU than the CPU which enqueued the > work. In a case where cgroups/affinity are used to pin kthreads to > specific CPUs to minimize cache trashing, I'm concerned that this > approach could degrade performance. > > Would there be another way to distribute the wake up that would keep > callbacks local to their enqueuing CPU ? > > Or am I missing something important ? One problem is that each leader must look at its followers' state in order to determine who needs to be awakened. So there are some added cache misses no matter how you redistribute the wakeups. Thanx, Paul > Thanks, > > Mathieu > > > > > Reported-by: Rik van Riel > > Signed-off-by: Paul E. McKenney > > > > diff --git a/Documentation/kernel-parameters.txt > > b/Documentation/kernel-parameters.txt > > index 6eaa9cdb7094..affed6434ec8 100644 > > --- a/Documentation/kernel-parameters.txt > > +++ b/Documentation/kernel-parameters.txt > > @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be > > entirely omitted. > > quiescent states. Units are jiffies, minimum > > value is one, and maximum value is HZ. > > > > + rcutree.rcu_nocb_leader_stride= [KNL] > > + Set the number of NOCB kthread groups, which > > + defaults to the square root of the number of > > + CPUs. Larger numbers reduces the wakeup overhead > > + on the per-CPU grace-period kthreads, but increases > > + that same overhead on each group's leader. > > + > > rcutree.
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
- Original Message - > From: "Mathieu Desnoyers" > To: paul...@linux.vnet.ibm.com > Cc: linux-kernel@vger.kernel.org, r...@redhat.com, mi...@kernel.org, > la...@cn.fujitsu.com, dipan...@in.ibm.com, > a...@linux-foundation.org, j...@joshtriplett.org, n...@us.ibm.com, > t...@linutronix.de, pet...@infradead.org, > rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, > dvh...@linux.intel.com, fweis...@gmail.com, > o...@redhat.com, s...@mit.edu > Sent: Friday, June 27, 2014 11:01:27 AM > Subject: Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread > wakeups > > - Original Message - > > From: "Paul E. McKenney" > > To: linux-kernel@vger.kernel.org, r...@redhat.com > > Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, > > a...@linux-foundation.org, "mathieu desnoyers" > > , j...@joshtriplett.org, n...@us.ibm.com, > > t...@linutronix.de, pet...@infradead.org, > > rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, > > dvh...@linux.intel.com, fweis...@gmail.com, > > o...@redhat.com, s...@mit.edu > > Sent: Friday, June 27, 2014 10:20:38 AM > > Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread > > wakeups > > > > An 80-CPU system with a context-switch-heavy workload can require so > > many NOCB kthread wakeups that the RCU grace-period kthreads spend several > > tens of percent of a CPU just awakening things. This clearly will not > > scale well: If you add enough CPUs, the RCU grace-period kthreads would > > get behind, increasing grace-period latency. > > > > To avoid this problem, this commit divides the NOCB kthreads into leaders > > and followers, where the grace-period kthreads awaken the leaders each of > > whom in turn awakens its followers. By default, the number of groups of > > kthreads is the square root of the number of CPUs, but this default may > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. > > This reduces the number of wakeups done per grace period by the RCU > > grace-period kthread by the square root of the number of CPUs, but of > > course by shifting those wakeups to the leaders. In addition, because > > the leaders do grace periods on behalf of their respective followers, > > the number of wakeups of the followers decreases by up to a factor of two. > > Instead of being awakened once when new callbacks arrive and again > > at the end of the grace period, the followers are awakened only at > > the end of the grace period. > > > > For a numerical example, in a 4096-CPU system, the grace-period kthread > > would awaken 64 leaders, each of which would awaken its 63 followers > > at the end of the grace period. This compares favorably with the 79 > > wakeups for the grace-period kthread on an 80-CPU system. > > If I understand your approach correctly, it looks like the callbacks > are moved from the follower threads (per CPU) to leader threads (for > a group of CPUs). I'm concerned that moving those callbacks to leader > threads would increase cache trashing, since the callbacks would be > often executed from a different CPU than the CPU which enqueued the > work. In a case where cgroups/affinity are used to pin kthreads to > specific CPUs to minimize cache trashing, I'm concerned that this > approach could degrade performance. > > Would there be another way to distribute the wake up that would keep > callbacks local to their enqueuing CPU ? > > Or am I missing something important ? What I appear to have missed is that the leader moves the callbacks from the follower's structure _to_ another list within that same follower's structure, which ensures that callbacks are executed locally by the follower. Sorry for the noise. ;-) Thanks, Mathieu > > Thanks, > > Mathieu > > > > > Reported-by: Rik van Riel > > Signed-off-by: Paul E. McKenney > > > > diff --git a/Documentation/kernel-parameters.txt > > b/Documentation/kernel-parameters.txt > > index 6eaa9cdb7094..affed6434ec8 100644 > > --- a/Documentation/kernel-parameters.txt > > +++ b/Documentation/kernel-parameters.txt > > @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be > > entirely omitted. > > quiescent states. Units are jiffies, minimum > > value is one, and maximum value is HZ. > > > > + rcutree.rcu_nocb_leader_stride= [KNL] > > + Set the number of NOCB kthread groups, which > > + defaults to the square root of the number of > > +
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
- Original Message - > From: "Paul E. McKenney" > To: linux-kernel@vger.kernel.org, r...@redhat.com > Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, > a...@linux-foundation.org, "mathieu desnoyers" > , j...@joshtriplett.org, n...@us.ibm.com, > t...@linutronix.de, pet...@infradead.org, > rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, > dvh...@linux.intel.com, fweis...@gmail.com, > o...@redhat.com, s...@mit.edu > Sent: Friday, June 27, 2014 10:20:38 AM > Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread > wakeups > > An 80-CPU system with a context-switch-heavy workload can require so > many NOCB kthread wakeups that the RCU grace-period kthreads spend several > tens of percent of a CPU just awakening things. This clearly will not > scale well: If you add enough CPUs, the RCU grace-period kthreads would > get behind, increasing grace-period latency. > > To avoid this problem, this commit divides the NOCB kthreads into leaders > and followers, where the grace-period kthreads awaken the leaders each of > whom in turn awakens its followers. By default, the number of groups of > kthreads is the square root of the number of CPUs, but this default may > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. > This reduces the number of wakeups done per grace period by the RCU > grace-period kthread by the square root of the number of CPUs, but of > course by shifting those wakeups to the leaders. In addition, because > the leaders do grace periods on behalf of their respective followers, > the number of wakeups of the followers decreases by up to a factor of two. > Instead of being awakened once when new callbacks arrive and again > at the end of the grace period, the followers are awakened only at > the end of the grace period. > > For a numerical example, in a 4096-CPU system, the grace-period kthread > would awaken 64 leaders, each of which would awaken its 63 followers > at the end of the grace period. This compares favorably with the 79 > wakeups for the grace-period kthread on an 80-CPU system. If I understand your approach correctly, it looks like the callbacks are moved from the follower threads (per CPU) to leader threads (for a group of CPUs). I'm concerned that moving those callbacks to leader threads would increase cache trashing, since the callbacks would be often executed from a different CPU than the CPU which enqueued the work. In a case where cgroups/affinity are used to pin kthreads to specific CPUs to minimize cache trashing, I'm concerned that this approach could degrade performance. Would there be another way to distribute the wake up that would keep callbacks local to their enqueuing CPU ? Or am I missing something important ? Thanks, Mathieu > > Reported-by: Rik van Riel > Signed-off-by: Paul E. McKenney > > diff --git a/Documentation/kernel-parameters.txt > b/Documentation/kernel-parameters.txt > index 6eaa9cdb7094..affed6434ec8 100644 > --- a/Documentation/kernel-parameters.txt > +++ b/Documentation/kernel-parameters.txt > @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be > entirely omitted. > quiescent states. Units are jiffies, minimum > value is one, and maximum value is HZ. > > + rcutree.rcu_nocb_leader_stride= [KNL] > + Set the number of NOCB kthread groups, which > + defaults to the square root of the number of > + CPUs. Larger numbers reduces the wakeup overhead > + on the per-CPU grace-period kthreads, but increases > + that same overhead on each group's leader. > + > rcutree.qhimark= [KNL] > Set threshold of queued RCU callbacks beyond which > batch limiting is disabled. > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h > index bf2c1e669691..de12fa5a860b 100644 > --- a/kernel/rcu/tree.h > +++ b/kernel/rcu/tree.h > @@ -331,11 +331,29 @@ struct rcu_data { > struct rcu_head **nocb_tail; > atomic_long_t nocb_q_count; /* # CBs waiting for kthread */ > atomic_long_t nocb_q_count_lazy; /* (approximate). */ > + struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */ > + struct rcu_head **nocb_follower_tail; > + atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */ > + atomic_long_t nocb_follower_count_lazy; /* (approximate). */ > int nocb_p_count; /* # CBs being invoked by kthread */ > int nocb_p_count_lazy; /* (approximate). */ > wait_queue_head_t nocb_wq; /* For nocb kthreads to slee
[PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Reported-by: Rik van Riel Signed-off-by: Paul E. McKenney diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6eaa9cdb7094..affed6434ec8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. quiescent states. Units are jiffies, minimum value is one, and maximum value is HZ. + rcutree.rcu_nocb_leader_stride= [KNL] + Set the number of NOCB kthread groups, which + defaults to the square root of the number of + CPUs. Larger numbers reduces the wakeup overhead + on the per-CPU grace-period kthreads, but increases + that same overhead on each group's leader. + rcutree.qhimark= [KNL] Set threshold of queued RCU callbacks beyond which batch limiting is disabled. diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index bf2c1e669691..de12fa5a860b 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -331,11 +331,29 @@ struct rcu_data { struct rcu_head **nocb_tail; atomic_long_t nocb_q_count; /* # CBs waiting for kthread */ atomic_long_t nocb_q_count_lazy; /* (approximate). */ + struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */ + struct rcu_head **nocb_follower_tail; + atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */ + atomic_long_t nocb_follower_count_lazy; /* (approximate). */ int nocb_p_count; /* # CBs being invoked by kthread */ int nocb_p_count_lazy; /* (approximate). */ wait_queue_head_t nocb_wq; /* For nocb kthreads to sleep on. */ struct task_struct *nocb_kthread; bool nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ + + /* The following fields are used by the leader, hence own cacheline. */ + struct rcu_head *nocb_gp_head cacheline_internodealigned_in_smp; + /* CBs waiting for GP. */ + struct rcu_head **nocb_gp_tail; + long nocb_gp_count; + long nocb_gp_count_lazy; + bool nocb_leader_wake; /* Is the nocb leader thread awake? */ + struct rcu_data *nocb_next_follower; + /* Next follower in wakeup chain. */ + + /* The following fields are used by the follower, hence new cachline. */ + struct rcu_data *nocb_leader cacheline_internodealigned_in_smp; + /* Leader CPU takes GP-end wakeups. */ #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ /* 8) RCU CPU stall data. */ @@ -583,8 +601,14 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp); /* Sum up queue lengths for tracing. */ static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll) { - *ql = atomic_long_read(>nocb_q_count) + rdp->nocb_p_count; - *qll = atomic_long_read(>nocb_q_count_lazy) + rdp->nocb_p_count_lazy; + *ql = atomic_long_read(>nocb_q_count) + + rdp->nocb_p_count + + atomic_long_read(>nocb_follower_count) + + rdp->nocb_p_count + rdp->nocb_gp_count; + *qll = atomic_long_read(>nocb_q_count_lazy) + + rdp->nocb_p_count_lazy + +
[PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Reported-by: Rik van Riel r...@redhat.com Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6eaa9cdb7094..affed6434ec8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. quiescent states. Units are jiffies, minimum value is one, and maximum value is HZ. + rcutree.rcu_nocb_leader_stride= [KNL] + Set the number of NOCB kthread groups, which + defaults to the square root of the number of + CPUs. Larger numbers reduces the wakeup overhead + on the per-CPU grace-period kthreads, but increases + that same overhead on each group's leader. + rcutree.qhimark= [KNL] Set threshold of queued RCU callbacks beyond which batch limiting is disabled. diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index bf2c1e669691..de12fa5a860b 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -331,11 +331,29 @@ struct rcu_data { struct rcu_head **nocb_tail; atomic_long_t nocb_q_count; /* # CBs waiting for kthread */ atomic_long_t nocb_q_count_lazy; /* (approximate). */ + struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */ + struct rcu_head **nocb_follower_tail; + atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */ + atomic_long_t nocb_follower_count_lazy; /* (approximate). */ int nocb_p_count; /* # CBs being invoked by kthread */ int nocb_p_count_lazy; /* (approximate). */ wait_queue_head_t nocb_wq; /* For nocb kthreads to sleep on. */ struct task_struct *nocb_kthread; bool nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ + + /* The following fields are used by the leader, hence own cacheline. */ + struct rcu_head *nocb_gp_head cacheline_internodealigned_in_smp; + /* CBs waiting for GP. */ + struct rcu_head **nocb_gp_tail; + long nocb_gp_count; + long nocb_gp_count_lazy; + bool nocb_leader_wake; /* Is the nocb leader thread awake? */ + struct rcu_data *nocb_next_follower; + /* Next follower in wakeup chain. */ + + /* The following fields are used by the follower, hence new cachline. */ + struct rcu_data *nocb_leader cacheline_internodealigned_in_smp; + /* Leader CPU takes GP-end wakeups. */ #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ /* 8) RCU CPU stall data. */ @@ -583,8 +601,14 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp); /* Sum up queue lengths for tracing. */ static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll) { - *ql = atomic_long_read(rdp-nocb_q_count) + rdp-nocb_p_count; - *qll = atomic_long_read(rdp-nocb_q_count_lazy) + rdp-nocb_p_count_lazy; + *ql = atomic_long_read(rdp-nocb_q_count) + + rdp-nocb_p_count + + atomic_long_read(rdp-nocb_follower_count) + + rdp-nocb_p_count + rdp-nocb_gp_count; + *qll = atomic_long_read(rdp-nocb_q_count_lazy) + +
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
- Original Message - From: Paul E. McKenney paul...@linux.vnet.ibm.com To: linux-kernel@vger.kernel.org, r...@redhat.com Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, a...@linux-foundation.org, mathieu desnoyers mathieu.desnoy...@efficios.com, j...@joshtriplett.org, n...@us.ibm.com, t...@linutronix.de, pet...@infradead.org, rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, dvh...@linux.intel.com, fweis...@gmail.com, o...@redhat.com, s...@mit.edu Sent: Friday, June 27, 2014 10:20:38 AM Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. If I understand your approach correctly, it looks like the callbacks are moved from the follower threads (per CPU) to leader threads (for a group of CPUs). I'm concerned that moving those callbacks to leader threads would increase cache trashing, since the callbacks would be often executed from a different CPU than the CPU which enqueued the work. In a case where cgroups/affinity are used to pin kthreads to specific CPUs to minimize cache trashing, I'm concerned that this approach could degrade performance. Would there be another way to distribute the wake up that would keep callbacks local to their enqueuing CPU ? Or am I missing something important ? Thanks, Mathieu Reported-by: Rik van Riel r...@redhat.com Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6eaa9cdb7094..affed6434ec8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. quiescent states. Units are jiffies, minimum value is one, and maximum value is HZ. + rcutree.rcu_nocb_leader_stride= [KNL] + Set the number of NOCB kthread groups, which + defaults to the square root of the number of + CPUs. Larger numbers reduces the wakeup overhead + on the per-CPU grace-period kthreads, but increases + that same overhead on each group's leader. + rcutree.qhimark= [KNL] Set threshold of queued RCU callbacks beyond which batch limiting is disabled. diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index bf2c1e669691..de12fa5a860b 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -331,11 +331,29 @@ struct rcu_data { struct rcu_head **nocb_tail; atomic_long_t nocb_q_count; /* # CBs waiting for kthread */ atomic_long_t nocb_q_count_lazy; /* (approximate). */ + struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */ + struct rcu_head **nocb_follower_tail; + atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */ + atomic_long_t nocb_follower_count_lazy; /* (approximate). */ int nocb_p_count; /* # CBs being invoked by kthread */ int nocb_p_count_lazy; /* (approximate). */ wait_queue_head_t nocb_wq; /* For nocb kthreads to sleep on. */ struct task_struct *nocb_kthread; bool nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ + + /* The following fields are used by the leader, hence own cacheline. */ + struct rcu_head *nocb_gp_head
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
- Original Message - From: Mathieu Desnoyers mathieu.desnoy...@efficios.com To: paul...@linux.vnet.ibm.com Cc: linux-kernel@vger.kernel.org, r...@redhat.com, mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, a...@linux-foundation.org, j...@joshtriplett.org, n...@us.ibm.com, t...@linutronix.de, pet...@infradead.org, rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, dvh...@linux.intel.com, fweis...@gmail.com, o...@redhat.com, s...@mit.edu Sent: Friday, June 27, 2014 11:01:27 AM Subject: Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups - Original Message - From: Paul E. McKenney paul...@linux.vnet.ibm.com To: linux-kernel@vger.kernel.org, r...@redhat.com Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, a...@linux-foundation.org, mathieu desnoyers mathieu.desnoy...@efficios.com, j...@joshtriplett.org, n...@us.ibm.com, t...@linutronix.de, pet...@infradead.org, rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, dvh...@linux.intel.com, fweis...@gmail.com, o...@redhat.com, s...@mit.edu Sent: Friday, June 27, 2014 10:20:38 AM Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. If I understand your approach correctly, it looks like the callbacks are moved from the follower threads (per CPU) to leader threads (for a group of CPUs). I'm concerned that moving those callbacks to leader threads would increase cache trashing, since the callbacks would be often executed from a different CPU than the CPU which enqueued the work. In a case where cgroups/affinity are used to pin kthreads to specific CPUs to minimize cache trashing, I'm concerned that this approach could degrade performance. Would there be another way to distribute the wake up that would keep callbacks local to their enqueuing CPU ? Or am I missing something important ? What I appear to have missed is that the leader moves the callbacks from the follower's structure _to_ another list within that same follower's structure, which ensures that callbacks are executed locally by the follower. Sorry for the noise. ;-) Thanks, Mathieu Thanks, Mathieu Reported-by: Rik van Riel r...@redhat.com Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6eaa9cdb7094..affed6434ec8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. quiescent states. Units are jiffies, minimum value is one, and maximum value is HZ. + rcutree.rcu_nocb_leader_stride= [KNL] + Set the number of NOCB kthread groups, which + defaults to the square root of the number of + CPUs. Larger numbers reduces the wakeup overhead + on the per-CPU grace-period kthreads, but increases + that same overhead on each group's leader. + rcutree.qhimark= [KNL] Set threshold of queued RCU callbacks beyond which batch limiting is disabled. diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index bf2c1e669691..de12fa5a860b 100644 --- a/kernel/rcu/tree.h +++ b
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Fri, Jun 27, 2014 at 03:13:17PM +, Mathieu Desnoyers wrote: - Original Message - From: Mathieu Desnoyers mathieu.desnoy...@efficios.com To: paul...@linux.vnet.ibm.com Cc: linux-kernel@vger.kernel.org, r...@redhat.com, mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, a...@linux-foundation.org, j...@joshtriplett.org, n...@us.ibm.com, t...@linutronix.de, pet...@infradead.org, rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, dvh...@linux.intel.com, fweis...@gmail.com, o...@redhat.com, s...@mit.edu Sent: Friday, June 27, 2014 11:01:27 AM Subject: Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups - Original Message - From: Paul E. McKenney paul...@linux.vnet.ibm.com To: linux-kernel@vger.kernel.org, r...@redhat.com Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, a...@linux-foundation.org, mathieu desnoyers mathieu.desnoy...@efficios.com, j...@joshtriplett.org, n...@us.ibm.com, t...@linutronix.de, pet...@infradead.org, rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, dvh...@linux.intel.com, fweis...@gmail.com, o...@redhat.com, s...@mit.edu Sent: Friday, June 27, 2014 10:20:38 AM Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. If I understand your approach correctly, it looks like the callbacks are moved from the follower threads (per CPU) to leader threads (for a group of CPUs). I'm concerned that moving those callbacks to leader threads would increase cache trashing, since the callbacks would be often executed from a different CPU than the CPU which enqueued the work. In a case where cgroups/affinity are used to pin kthreads to specific CPUs to minimize cache trashing, I'm concerned that this approach could degrade performance. Would there be another way to distribute the wake up that would keep callbacks local to their enqueuing CPU ? Or am I missing something important ? What I appear to have missed is that the leader moves the callbacks from the follower's structure _to_ another list within that same follower's structure, which ensures that callbacks are executed locally by the follower. Sorry for the noise. ;-) Not a problem, and thank you for looking at the patch! Thanx, Paul Thanks, Mathieu Thanks, Mathieu Reported-by: Rik van Riel r...@redhat.com Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6eaa9cdb7094..affed6434ec8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. quiescent states. Units are jiffies, minimum value is one, and maximum value is HZ. + rcutree.rcu_nocb_leader_stride= [KNL] + Set the number of NOCB kthread groups, which + defaults to the square root of the number of + CPUs. Larger numbers reduces the wakeup overhead + on the per-CPU grace-period kthreads, but increases + that same overhead on each group's leader
Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
On Fri, Jun 27, 2014 at 03:01:27PM +, Mathieu Desnoyers wrote: - Original Message - From: Paul E. McKenney paul...@linux.vnet.ibm.com To: linux-kernel@vger.kernel.org, r...@redhat.com Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, a...@linux-foundation.org, mathieu desnoyers mathieu.desnoy...@efficios.com, j...@joshtriplett.org, n...@us.ibm.com, t...@linutronix.de, pet...@infradead.org, rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, dvh...@linux.intel.com, fweis...@gmail.com, o...@redhat.com, s...@mit.edu Sent: Friday, June 27, 2014 10:20:38 AM Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. If I understand your approach correctly, it looks like the callbacks are moved from the follower threads (per CPU) to leader threads (for a group of CPUs). Not quite. The leader thread does the moving, but the callbacks live on the follower's data structure. There probably are some added cache misses, but I believe that this is more than made up for by the decrease in wakeups -- the followers are normally awakened only once rather than twice per grace period. I'm concerned that moving those callbacks to leader threads would increase cache trashing, since the callbacks would be often executed from a different CPU than the CPU which enqueued the work. In a case where cgroups/affinity are used to pin kthreads to specific CPUs to minimize cache trashing, I'm concerned that this approach could degrade performance. Would there be another way to distribute the wake up that would keep callbacks local to their enqueuing CPU ? Or am I missing something important ? One problem is that each leader must look at its followers' state in order to determine who needs to be awakened. So there are some added cache misses no matter how you redistribute the wakeups. Thanx, Paul Thanks, Mathieu Reported-by: Rik van Riel r...@redhat.com Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6eaa9cdb7094..affed6434ec8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. quiescent states. Units are jiffies, minimum value is one, and maximum value is HZ. + rcutree.rcu_nocb_leader_stride= [KNL] + Set the number of NOCB kthread groups, which + defaults to the square root of the number of + CPUs. Larger numbers reduces the wakeup overhead + on the per-CPU grace-period kthreads, but increases + that same overhead on each group's leader. + rcutree.qhimark= [KNL] Set threshold of queued RCU callbacks beyond which batch limiting is disabled. diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index bf2c1e669691..de12fa5a860b 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -331,11 +331,29 @@ struct rcu_data { struct rcu_head **nocb_tail; atomic_long_t nocb_q_count; /* # CBs waiting for kthread */ atomic_long_t nocb_q_count_lazy; /* (approximate