Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-08 Thread Paul E. McKenney
On Thu, Jul 03, 2014 at 03:12:17PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
> > > > As were others, not that long ago.  Today is the first hint that I got
> > > > that you feel otherwise.  But it does look like the softirq approach to
> > > > callback processing needs to stick around for awhile longer.  Nice to
> > > > hear that softirq is now "sane and normal" again, I guess.  ;-)
> > > 
> > > Nah, softirqs are still totally annoying :-)
> > 
> > Name me one thing that isn't annoying.  ;-)
> > 
> > > So I've lost detail again, but it seems to me that on all CPUs that are
> > > actually getting ticks, waking tasks to process the RCU state is
> > > entirely over doing it. Might as well keep processing their RCU state
> > > from the tick as was previously done.
> > 
> > And that is in fact the approach taken by my patch.  For which I just
> > kicked off testing, so expect an update later today.  (And that -is-
> > optimistic!  A pessimistic viewpoint would hold that the patch would
> > turn out to be so broken that it would take -weeks- to get a fix!)
> 
> Right, but as you told Mike its not really dynamic, but of course we can
> work on that.

If it is actually needed by someone, then I would be happy to work on it.
But all I see now is people asserting that it should be provided, without
any real justification.

> That said; I'm somewhat confused on the whole nocb thing. So the way I
> see things there's two things that need doing:
> 
>  1) push the state machine
>  2) run callbacks
> 
> It seems to me the nocb threads do both, and somehow some of this is
> getting conflated. Because afaik RCU only uses softirqs for (2), since
> (1) is fully done from the tick -- well, it used to be, before all this.

Well, you do need a finer-grained view of the RCU state machine:

1a. Registering the need for a future grace period.
1b. Self-reporting of quiescent states (softirq).
1c. Reporting of other CPUs' quiescent states (grace-period kthread).
This includes idle CPUs, userspace nohz_full CPUs, and CPUs that
just now transitioned to offline.
1d. Kicking CPUs that have not yet reported a quiescent state
(also grace-period kthread).
2.  Running callbacks (softirq, or, for RCU_NOCB_CPU, rcuo kthread).

And here (1a) is done via softirq in the non-nocb case and via the rcuo
kthreads on the nocb case.

And yes, RCU's softirq processing is normally done from the tick.

> Now, IIRC rcu callbacks are not guaranteed to run on whatever cpu
> they're queued on, so we can 'easily' splice the actual callback list
> into some other CPUs callback list. Which leaves only (1) to actually
> 'do'.

True, although the 'easily' part needs to take into account the fact
that the RCU callbacks from an given CPU must be invoked in order.
Or rcu_barrier() needs to find a different way to guarantee that all
previously registered callbacks have been invoked, as the case may be.

> Yet the whole thing is called after the 'no-callback' thing, even though
> the most important part is pushing the state machine remotely.

Well, you do have to do both.  Pushing the state machine doesn't help
unless you also invoke the RCU callbacks.

> Now I can see we'd probably don't want to actually push remote cpu's
> their rcu state from IRQ context, but we could, I think, drive the state
> machine remotely. And we want to avoid overloading one CPU with the work
> of all others, which is I think still a fundamental issue with the whole
> nohz_full thing, it reverts to the _one_ timekeeper cpu, but on big
> enough systems that'll be a problem.

Well, RCU already pushes the remote CPU's RCU state remotely via
RCU's dynticks setup.  But you are quite right, dumping all of the RCU
processing onto one CPU can be a bottleneck on large systems (which
Fengguang's tests noted, by the way), and this is the reason for patch
11/17 in the fixes series (https://lkml.org/lkml/2014/7/7/990).  This
patch allows housekeeping kthreads like the grace-period kthreads to
use a new housekeeping_affine() function to bind themselves onto the
non-nohz_full CPUs.  The system can be booted with the desired number
of housekeeping CPUs using the nohz_full= boot parameter.

However, it is not clear to me that having only one timekeeping CPU
(as opposed to having only one housekeeping CPU) is a real problem,
even for very large systems.  If it does turn out to be a real problem,
the sysidle code will probably need to change as well.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-08 Thread Paul E. McKenney
On Thu, Jul 03, 2014 at 11:50:09AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote:
> > Name me one thing that isn't annoying.  ;-)
> 
> A cold beverage on a warm day :-)

Fair point...  Not clear how to work that into the RCU implementation,
though.  I suppose I could add comments instructing the reader to consume
a cold beverage.  Or I could consume a cold beverage while working on
RCU.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-08 Thread Peter Zijlstra
On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
> > > As were others, not that long ago.  Today is the first hint that I got
> > > that you feel otherwise.  But it does look like the softirq approach to
> > > callback processing needs to stick around for awhile longer.  Nice to
> > > hear that softirq is now "sane and normal" again, I guess.  ;-)
> > 
> > Nah, softirqs are still totally annoying :-)
> 
> Name me one thing that isn't annoying.  ;-)
> 
> > So I've lost detail again, but it seems to me that on all CPUs that are
> > actually getting ticks, waking tasks to process the RCU state is
> > entirely over doing it. Might as well keep processing their RCU state
> > from the tick as was previously done.
> 
> And that is in fact the approach taken by my patch.  For which I just
> kicked off testing, so expect an update later today.  (And that -is-
> optimistic!  A pessimistic viewpoint would hold that the patch would
> turn out to be so broken that it would take -weeks- to get a fix!)

Right, but as you told Mike its not really dynamic, but of course we can
work on that.

That said; I'm somewhat confused on the whole nocb thing. So the way I
see things there's two things that need doing:

 1) push the state machine
 2) run callbacks

It seems to me the nocb threads do both, and somehow some of this is
getting conflated. Because afaik RCU only uses softirqs for (2), since
(1) is fully done from the tick -- well, it used to be, before all this.

Now, IIRC rcu callbacks are not guaranteed to run on whatever cpu
they're queued on, so we can 'easily' splice the actual callback list
into some other CPUs callback list. Which leaves only (1) to actually
'do'.

Yet the whole thing is called after the 'no-callback' thing, even though
the most important part is pushing the state machine remotely.

Now I can see we'd probably don't want to actually push remote cpu's
their rcu state from IRQ context, but we could, I think, drive the state
machine remotely. And we want to avoid overloading one CPU with the work
of all others, which is I think still a fundamental issue with the whole
nohz_full thing, it reverts to the _one_ timekeeper cpu, but on big
enough systems that'll be a problem.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-08 Thread Peter Zijlstra
On Wed, Jul 02, 2014 at 01:29:38PM -0400, Rik van Riel wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 07/02/2014 01:26 PM, Peter Zijlstra wrote:
> > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
> >> As were others, not that long ago.  Today is the first hint that
> >> I got that you feel otherwise.  But it does look like the softirq
> >> approach to callback processing needs to stick around for awhile
> >> longer.  Nice to hear that softirq is now "sane and normal"
> >> again, I guess.  ;-)
> > 
> > Nah, softirqs are still totally annoying :-)
> > 
> > So I've lost detail again, but it seems to me that on all CPUs that
> > are actually getting ticks, waking tasks to process the RCU state
> > is entirely over doing it. Might as well keep processing their RCU
> > state from the tick as was previously done.
> 
> For CPUs that are not getting ticks (eg. because they are idle),
> is it worth waking up anything on that CPU, or would it make more
> sense to simply process their RCU callbacks on a different CPU,
> if there aren't too many pending?

If they're idle, RCU 'should' know this and exclude them from the state
machine, the tricky part, and where all this nocb nonsense started with,
its the NOHZ_FULL stuff where a cpu can be !idle but still not get
ticks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-08 Thread Peter Zijlstra
On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote:
> Name me one thing that isn't annoying.  ;-)

A cold beverage on a warm day :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-08 Thread Peter Zijlstra
On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote:
 Name me one thing that isn't annoying.  ;-)

A cold beverage on a warm day :-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-08 Thread Peter Zijlstra
On Wed, Jul 02, 2014 at 01:29:38PM -0400, Rik van Riel wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 07/02/2014 01:26 PM, Peter Zijlstra wrote:
  On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
  As were others, not that long ago.  Today is the first hint that
  I got that you feel otherwise.  But it does look like the softirq
  approach to callback processing needs to stick around for awhile
  longer.  Nice to hear that softirq is now sane and normal
  again, I guess.  ;-)
  
  Nah, softirqs are still totally annoying :-)
  
  So I've lost detail again, but it seems to me that on all CPUs that
  are actually getting ticks, waking tasks to process the RCU state
  is entirely over doing it. Might as well keep processing their RCU
  state from the tick as was previously done.
 
 For CPUs that are not getting ticks (eg. because they are idle),
 is it worth waking up anything on that CPU, or would it make more
 sense to simply process their RCU callbacks on a different CPU,
 if there aren't too many pending?

If they're idle, RCU 'should' know this and exclude them from the state
machine, the tricky part, and where all this nocb nonsense started with,
its the NOHZ_FULL stuff where a cpu can be !idle but still not get
ticks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-08 Thread Peter Zijlstra
On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote:
 On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote:
  On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
   As were others, not that long ago.  Today is the first hint that I got
   that you feel otherwise.  But it does look like the softirq approach to
   callback processing needs to stick around for awhile longer.  Nice to
   hear that softirq is now sane and normal again, I guess.  ;-)
  
  Nah, softirqs are still totally annoying :-)
 
 Name me one thing that isn't annoying.  ;-)
 
  So I've lost detail again, but it seems to me that on all CPUs that are
  actually getting ticks, waking tasks to process the RCU state is
  entirely over doing it. Might as well keep processing their RCU state
  from the tick as was previously done.
 
 And that is in fact the approach taken by my patch.  For which I just
 kicked off testing, so expect an update later today.  (And that -is-
 optimistic!  A pessimistic viewpoint would hold that the patch would
 turn out to be so broken that it would take -weeks- to get a fix!)

Right, but as you told Mike its not really dynamic, but of course we can
work on that.

That said; I'm somewhat confused on the whole nocb thing. So the way I
see things there's two things that need doing:

 1) push the state machine
 2) run callbacks

It seems to me the nocb threads do both, and somehow some of this is
getting conflated. Because afaik RCU only uses softirqs for (2), since
(1) is fully done from the tick -- well, it used to be, before all this.

Now, IIRC rcu callbacks are not guaranteed to run on whatever cpu
they're queued on, so we can 'easily' splice the actual callback list
into some other CPUs callback list. Which leaves only (1) to actually
'do'.

Yet the whole thing is called after the 'no-callback' thing, even though
the most important part is pushing the state machine remotely.

Now I can see we'd probably don't want to actually push remote cpu's
their rcu state from IRQ context, but we could, I think, drive the state
machine remotely. And we want to avoid overloading one CPU with the work
of all others, which is I think still a fundamental issue with the whole
nohz_full thing, it reverts to the _one_ timekeeper cpu, but on big
enough systems that'll be a problem.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-08 Thread Paul E. McKenney
On Thu, Jul 03, 2014 at 11:50:09AM +0200, Peter Zijlstra wrote:
 On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote:
  Name me one thing that isn't annoying.  ;-)
 
 A cold beverage on a warm day :-)

Fair point...  Not clear how to work that into the RCU implementation,
though.  I suppose I could add comments instructing the reader to consume
a cold beverage.  Or I could consume a cold beverage while working on
RCU.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-08 Thread Paul E. McKenney
On Thu, Jul 03, 2014 at 03:12:17PM +0200, Peter Zijlstra wrote:
 On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote:
  On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote:
   On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
As were others, not that long ago.  Today is the first hint that I got
that you feel otherwise.  But it does look like the softirq approach to
callback processing needs to stick around for awhile longer.  Nice to
hear that softirq is now sane and normal again, I guess.  ;-)
   
   Nah, softirqs are still totally annoying :-)
  
  Name me one thing that isn't annoying.  ;-)
  
   So I've lost detail again, but it seems to me that on all CPUs that are
   actually getting ticks, waking tasks to process the RCU state is
   entirely over doing it. Might as well keep processing their RCU state
   from the tick as was previously done.
  
  And that is in fact the approach taken by my patch.  For which I just
  kicked off testing, so expect an update later today.  (And that -is-
  optimistic!  A pessimistic viewpoint would hold that the patch would
  turn out to be so broken that it would take -weeks- to get a fix!)
 
 Right, but as you told Mike its not really dynamic, but of course we can
 work on that.

If it is actually needed by someone, then I would be happy to work on it.
But all I see now is people asserting that it should be provided, without
any real justification.

 That said; I'm somewhat confused on the whole nocb thing. So the way I
 see things there's two things that need doing:
 
  1) push the state machine
  2) run callbacks
 
 It seems to me the nocb threads do both, and somehow some of this is
 getting conflated. Because afaik RCU only uses softirqs for (2), since
 (1) is fully done from the tick -- well, it used to be, before all this.

Well, you do need a finer-grained view of the RCU state machine:

1a. Registering the need for a future grace period.
1b. Self-reporting of quiescent states (softirq).
1c. Reporting of other CPUs' quiescent states (grace-period kthread).
This includes idle CPUs, userspace nohz_full CPUs, and CPUs that
just now transitioned to offline.
1d. Kicking CPUs that have not yet reported a quiescent state
(also grace-period kthread).
2.  Running callbacks (softirq, or, for RCU_NOCB_CPU, rcuo kthread).

And here (1a) is done via softirq in the non-nocb case and via the rcuo
kthreads on the nocb case.

And yes, RCU's softirq processing is normally done from the tick.

 Now, IIRC rcu callbacks are not guaranteed to run on whatever cpu
 they're queued on, so we can 'easily' splice the actual callback list
 into some other CPUs callback list. Which leaves only (1) to actually
 'do'.

True, although the 'easily' part needs to take into account the fact
that the RCU callbacks from an given CPU must be invoked in order.
Or rcu_barrier() needs to find a different way to guarantee that all
previously registered callbacks have been invoked, as the case may be.

 Yet the whole thing is called after the 'no-callback' thing, even though
 the most important part is pushing the state machine remotely.

Well, you do have to do both.  Pushing the state machine doesn't help
unless you also invoke the RCU callbacks.

 Now I can see we'd probably don't want to actually push remote cpu's
 their rcu state from IRQ context, but we could, I think, drive the state
 machine remotely. And we want to avoid overloading one CPU with the work
 of all others, which is I think still a fundamental issue with the whole
 nohz_full thing, it reverts to the _one_ timekeeper cpu, but on big
 enough systems that'll be a problem.

Well, RCU already pushes the remote CPU's RCU state remotely via
RCU's dynticks setup.  But you are quite right, dumping all of the RCU
processing onto one CPU can be a bottleneck on large systems (which
Fengguang's tests noted, by the way), and this is the reason for patch
11/17 in the fixes series (https://lkml.org/lkml/2014/7/7/990).  This
patch allows housekeeping kthreads like the grace-period kthreads to
use a new housekeeping_affine() function to bind themselves onto the
non-nohz_full CPUs.  The system can be booted with the desired number
of housekeeping CPUs using the nohz_full= boot parameter.

However, it is not clear to me that having only one timekeeping CPU
(as opposed to having only one housekeeping CPU) is a real problem,
even for very large systems.  If it does turn out to be a real problem,
the sysidle code will probably need to change as well.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-05 Thread Frederic Weisbecker
On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote:
> On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
> > On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
> 
> > > NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
> > > away as being a redundant interface to manage a single property of a set
> > > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
> > > manage a single property of a set of CPUs.  What am I missing? 
> > 
> > Well, for now, it can only be specified at build time or at boot time.
> > In theory, it is possible to change a CPU from being callback-offloaded
> > to not at runtime, but there would need to be an extremely good reason
> > for adding that level of complexity.  Lots of "fun" races in there...
> 
> Yeah, understood.
> 
> (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
> usable if it eventually became unified with cpuset and learned how to
> tap-dance properly;)

Well, the exact same goes for NO_HZ_FULL, quoting Paul (just replacing RCU 
things
with dynticks) it becomes:

 "it can only be specified at build time or at boot time.
  In theory, it is possible to change a CPU from being idle-dynticks
  to full-dynticks at runtime, but there would need to be an extremely good 
reason
  for adding that level of complexity.  Lots of "fun" races in there..."

And I'm not even sure that somebody actually uses full dynticks today. I only 
know
that some financial institutions are considering it, which is not cheering me up
much...

So we are very far from that day when we'll migrate to a runtime interface.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-05 Thread Frederic Weisbecker
On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote:
 On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
  On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
 
   NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
   away as being a redundant interface to manage a single property of a set
   of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
   manage a single property of a set of CPUs.  What am I missing? 
  
  Well, for now, it can only be specified at build time or at boot time.
  In theory, it is possible to change a CPU from being callback-offloaded
  to not at runtime, but there would need to be an extremely good reason
  for adding that level of complexity.  Lots of fun races in there...
 
 Yeah, understood.
 
 (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
 usable if it eventually became unified with cpuset and learned how to
 tap-dance properly;)

Well, the exact same goes for NO_HZ_FULL, quoting Paul (just replacing RCU 
things
with dynticks) it becomes:

 it can only be specified at build time or at boot time.
  In theory, it is possible to change a CPU from being idle-dynticks
  to full-dynticks at runtime, but there would need to be an extremely good 
reason
  for adding that level of complexity.  Lots of fun races in there...

And I'm not even sure that somebody actually uses full dynticks today. I only 
know
that some financial institutions are considering it, which is not cheering me up
much...

So we are very far from that day when we'll migrate to a runtime interface.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-04 Thread Paul E. McKenney
On Fri, Jul 04, 2014 at 08:01:34AM +0200, Mike Galbraith wrote:
> On Thu, 2014-07-03 at 22:05 -0700, Paul E. McKenney wrote: 
> > On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote:
> 
> > > Turn it on and don't worry about it is exactly what distros want the
> > > obscure feature with very few users to be.  Last time I did a drive-by,
> > > my boxen said I should continue to worry about it ;-)
> > 
> > Yep, which is the reason for the patch on the last email.
> > 
> > Then again, exactly which feature and which reason for worry?
> 
> NO_HZ_FULL.  I tried ALL a while back, box instantly called me an idiot.
> Maybe that has improved since, dunno.

Ah, I was thinking in terms of RCU_CPU_NOCB.

> Last drive-by I didn't do much overhead measurement, stuck mostly
> functionality, and it still had rough edges that enterprise users may
> not fully appreciate.  Trying to let 60 of 64 cores do 100% compute
> showed some cores having a hard time entering tickless at all, and
> ~200us spikes that I think are due to tick losing skew.. told Frederic
> I'd take a peek at that, but haven't had time yet.  There were other
> known things as well, like timers and workqueues for which there are
> patches floating around.  All in all, it was waving the men at work
> sign, pointing at the "Say N" by the config option, and suggesting that
> ignoring that would not be the cleverest of moves.

Well, I am not going to join a debate on Kconfig default selection.  ;-)
I will say that two years ago, setting NO_HZ_FULL=y by default would
have been insane.  Perhaps soon it will be a no-brainer, and I am of
course trying to bring that day closer.  Right now it is of course a
judgment call.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-04 Thread Mike Galbraith
On Thu, 2014-07-03 at 22:05 -0700, Paul E. McKenney wrote: 
> On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote:

> > Turn it on and don't worry about it is exactly what distros want the
> > obscure feature with very few users to be.  Last time I did a drive-by,
> > my boxen said I should continue to worry about it ;-)
> 
> Yep, which is the reason for the patch on the last email.
> 
> Then again, exactly which feature and which reason for worry?

NO_HZ_FULL.  I tried ALL a while back, box instantly called me an idiot.
Maybe that has improved since, dunno.

Last drive-by I didn't do much overhead measurement, stuck mostly
functionality, and it still had rough edges that enterprise users may
not fully appreciate.  Trying to let 60 of 64 cores do 100% compute
showed some cores having a hard time entering tickless at all, and
~200us spikes that I think are due to tick losing skew.. told Frederic
I'd take a peek at that, but haven't had time yet.  There were other
known things as well, like timers and workqueues for which there are
patches floating around.  All in all, it was waving the men at work
sign, pointing at the "Say N" by the config option, and suggesting that
ignoring that would not be the cleverest of moves.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-04 Thread Paul E. McKenney
On Fri, Jul 04, 2014 at 08:01:34AM +0200, Mike Galbraith wrote:
 On Thu, 2014-07-03 at 22:05 -0700, Paul E. McKenney wrote: 
  On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote:
 
   Turn it on and don't worry about it is exactly what distros want the
   obscure feature with very few users to be.  Last time I did a drive-by,
   my boxen said I should continue to worry about it ;-)
  
  Yep, which is the reason for the patch on the last email.
  
  Then again, exactly which feature and which reason for worry?
 
 NO_HZ_FULL.  I tried ALL a while back, box instantly called me an idiot.
 Maybe that has improved since, dunno.

Ah, I was thinking in terms of RCU_CPU_NOCB.

 Last drive-by I didn't do much overhead measurement, stuck mostly
 functionality, and it still had rough edges that enterprise users may
 not fully appreciate.  Trying to let 60 of 64 cores do 100% compute
 showed some cores having a hard time entering tickless at all, and
 ~200us spikes that I think are due to tick losing skew.. told Frederic
 I'd take a peek at that, but haven't had time yet.  There were other
 known things as well, like timers and workqueues for which there are
 patches floating around.  All in all, it was waving the men at work
 sign, pointing at the Say N by the config option, and suggesting that
 ignoring that would not be the cleverest of moves.

Well, I am not going to join a debate on Kconfig default selection.  ;-)
I will say that two years ago, setting NO_HZ_FULL=y by default would
have been insane.  Perhaps soon it will be a no-brainer, and I am of
course trying to bring that day closer.  Right now it is of course a
judgment call.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-04 Thread Mike Galbraith
On Thu, 2014-07-03 at 22:05 -0700, Paul E. McKenney wrote: 
 On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote:

  Turn it on and don't worry about it is exactly what distros want the
  obscure feature with very few users to be.  Last time I did a drive-by,
  my boxen said I should continue to worry about it ;-)
 
 Yep, which is the reason for the patch on the last email.
 
 Then again, exactly which feature and which reason for worry?

NO_HZ_FULL.  I tried ALL a while back, box instantly called me an idiot.
Maybe that has improved since, dunno.

Last drive-by I didn't do much overhead measurement, stuck mostly
functionality, and it still had rough edges that enterprise users may
not fully appreciate.  Trying to let 60 of 64 cores do 100% compute
showed some cores having a hard time entering tickless at all, and
~200us spikes that I think are due to tick losing skew.. told Frederic
I'd take a peek at that, but haven't had time yet.  There were other
known things as well, like timers and workqueues for which there are
patches floating around.  All in all, it was waving the men at work
sign, pointing at the Say N by the config option, and suggesting that
ignoring that would not be the cleverest of moves.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-03 Thread Paul E. McKenney
On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote:
> On Thu, 2014-07-03 at 09:29 -0700, Paul E. McKenney wrote: 
> > On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote:
> > > On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
> > > > On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
> > > 
> > > > > NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
> > > > > away as being a redundant interface to manage a single property of a 
> > > > > set
> > > > > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
> > > > > manage a single property of a set of CPUs.  What am I missing? 
> > > > 
> > > > Well, for now, it can only be specified at build time or at boot time.
> > > > In theory, it is possible to change a CPU from being callback-offloaded
> > > > to not at runtime, but there would need to be an extremely good reason
> > > > for adding that level of complexity.  Lots of "fun" races in there...
> > > 
> > > Yeah, understood.
> > > 
> > > (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
> > > usable if it eventually became unified with cpuset and learned how to
> > > tap-dance properly;)
> > 
> > Agreed, it would in some sense be nice.  What specifically do you need
> > it for?
> 
> I personally have zero use for the thing (git/vi aren't particularly
> perturbation sensitive;). I'm just doing occasional drive-by testing
> from a distro perspective, how well does it work, what does it cost etc.
> 
> >   Are you really running workloads that generate large numbers of
> > callbacks spread across most of the CPUs?  It was this sort of workload
> > that caused Rik's system to show scary CPU-time accumulation, due to
> > the high overhead of frequent one-to-many wakeups.
> > 
> > If your systems aren't running that kind of high-callback-rate workload,
> > just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.
> > 
> > If your systems -are- running that kind of high-callback-rate workload,
> > but your system has fewer than 200 CPUs, ensure that you have enough
> > housekeeping CPUs to allow the grace-period kthread sufficient CPU time,
> > set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.
> > 
> > If your systems -are- running that kind of high-callback-rate workload,
> > and your system has more than 200 CPUs, apply the following patch,
> > set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it.  ;-)
> 
> Turn it on and don't worry about it is exactly what distros want the
> obscure feature with very few users to be.  Last time I did a drive-by,
> my boxen said I should continue to worry about it ;-)

Yep, which is the reason for the patch on the last email.

Then again, exactly which feature and which reason for worry?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-03 Thread Mike Galbraith
On Thu, 2014-07-03 at 09:29 -0700, Paul E. McKenney wrote: 
> On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote:
> > On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
> > > On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
> > 
> > > > NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
> > > > away as being a redundant interface to manage a single property of a set
> > > > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
> > > > manage a single property of a set of CPUs.  What am I missing? 
> > > 
> > > Well, for now, it can only be specified at build time or at boot time.
> > > In theory, it is possible to change a CPU from being callback-offloaded
> > > to not at runtime, but there would need to be an extremely good reason
> > > for adding that level of complexity.  Lots of "fun" races in there...
> > 
> > Yeah, understood.
> > 
> > (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
> > usable if it eventually became unified with cpuset and learned how to
> > tap-dance properly;)
> 
> Agreed, it would in some sense be nice.  What specifically do you need
> it for?

I personally have zero use for the thing (git/vi aren't particularly
perturbation sensitive;). I'm just doing occasional drive-by testing
from a distro perspective, how well does it work, what does it cost etc.

>   Are you really running workloads that generate large numbers of
> callbacks spread across most of the CPUs?  It was this sort of workload
> that caused Rik's system to show scary CPU-time accumulation, due to
> the high overhead of frequent one-to-many wakeups.
> 
> If your systems aren't running that kind of high-callback-rate workload,
> just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.
> 
> If your systems -are- running that kind of high-callback-rate workload,
> but your system has fewer than 200 CPUs, ensure that you have enough
> housekeeping CPUs to allow the grace-period kthread sufficient CPU time,
> set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.
> 
> If your systems -are- running that kind of high-callback-rate workload,
> and your system has more than 200 CPUs, apply the following patch,
> set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it.  ;-)

Turn it on and don't worry about it is exactly what distros want the
obscure feature with very few users to be.  Last time I did a drive-by,
my boxen said I should continue to worry about it ;-)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-03 Thread Paul E. McKenney
On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote:
> On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
> > On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
> 
> > > NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
> > > away as being a redundant interface to manage a single property of a set
> > > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
> > > manage a single property of a set of CPUs.  What am I missing? 
> > 
> > Well, for now, it can only be specified at build time or at boot time.
> > In theory, it is possible to change a CPU from being callback-offloaded
> > to not at runtime, but there would need to be an extremely good reason
> > for adding that level of complexity.  Lots of "fun" races in there...
> 
> Yeah, understood.
> 
> (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
> usable if it eventually became unified with cpuset and learned how to
> tap-dance properly;)

Agreed, it would in some sense be nice.  What specifically do you need
it for?  Are you really running workloads that generate large numbers of
callbacks spread across most of the CPUs?  It was this sort of workload
that caused Rik's system to show scary CPU-time accumulation, due to
the high overhead of frequent one-to-many wakeups.

If your systems aren't running that kind of high-callback-rate workload,
just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.

If your systems -are- running that kind of high-callback-rate workload,
but your system has fewer than 200 CPUs, ensure that you have enough
housekeeping CPUs to allow the grace-period kthread sufficient CPU time,
set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.

If your systems -are- running that kind of high-callback-rate workload,
and your system has more than 200 CPUs, apply the following patch,
set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it.  ;-)

Thanx, Paul



rcu: Parallelize and economize NOCB kthread wakeups

An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things.  This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.

To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers.  By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders.  In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.

For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period.  This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.

Reported-by: Rik van Riel 
Signed-off-by: Paul E. McKenney 

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 6eaa9cdb7094..affed6434ec8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
quiescent states.  Units are jiffies, minimum
value is one, and maximum value is HZ.
 
+   rcutree.rcu_nocb_leader_stride= [KNL]
+   Set the number of NOCB kthread groups, which
+   defaults to the square root of the number of
+   CPUs.  Larger numbers reduces the wakeup overhead
+   on the per-CPU grace-period kthreads, but increases
+   that same overhead on each group's leader.
+
rcutree.qhimark= [KNL]
Set threshold of queued RCU callbacks beyond which
batch limiting is disabled.
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index bf2c1e669691..de12fa5a860b 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -331,11 +331,29 @@ struct rcu_data {
struct rcu_head **nocb_tail;
atomic_long_t nocb_q_count; /* # CBs waiting for kthread */

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-03 Thread Paul E. McKenney
On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote:
 On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
  On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
 
   NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
   away as being a redundant interface to manage a single property of a set
   of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
   manage a single property of a set of CPUs.  What am I missing? 
  
  Well, for now, it can only be specified at build time or at boot time.
  In theory, it is possible to change a CPU from being callback-offloaded
  to not at runtime, but there would need to be an extremely good reason
  for adding that level of complexity.  Lots of fun races in there...
 
 Yeah, understood.
 
 (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
 usable if it eventually became unified with cpuset and learned how to
 tap-dance properly;)

Agreed, it would in some sense be nice.  What specifically do you need
it for?  Are you really running workloads that generate large numbers of
callbacks spread across most of the CPUs?  It was this sort of workload
that caused Rik's system to show scary CPU-time accumulation, due to
the high overhead of frequent one-to-many wakeups.

If your systems aren't running that kind of high-callback-rate workload,
just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.

If your systems -are- running that kind of high-callback-rate workload,
but your system has fewer than 200 CPUs, ensure that you have enough
housekeeping CPUs to allow the grace-period kthread sufficient CPU time,
set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.

If your systems -are- running that kind of high-callback-rate workload,
and your system has more than 200 CPUs, apply the following patch,
set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it.  ;-)

Thanx, Paul



rcu: Parallelize and economize NOCB kthread wakeups

An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things.  This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.

To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers.  By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders.  In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.

For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period.  This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.

Reported-by: Rik van Riel r...@redhat.com
Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 6eaa9cdb7094..affed6434ec8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
quiescent states.  Units are jiffies, minimum
value is one, and maximum value is HZ.
 
+   rcutree.rcu_nocb_leader_stride= [KNL]
+   Set the number of NOCB kthread groups, which
+   defaults to the square root of the number of
+   CPUs.  Larger numbers reduces the wakeup overhead
+   on the per-CPU grace-period kthreads, but increases
+   that same overhead on each group's leader.
+
rcutree.qhimark= [KNL]
Set threshold of queued RCU callbacks beyond which
batch limiting is disabled.
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index bf2c1e669691..de12fa5a860b 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -331,11 +331,29 @@ struct rcu_data {
struct rcu_head **nocb_tail;
atomic_long_t nocb_q_count; /* # CBs waiting for kthread */
  

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-03 Thread Mike Galbraith
On Thu, 2014-07-03 at 09:29 -0700, Paul E. McKenney wrote: 
 On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote:
  On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
   On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
  
NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
away as being a redundant interface to manage a single property of a set
of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
manage a single property of a set of CPUs.  What am I missing? 
   
   Well, for now, it can only be specified at build time or at boot time.
   In theory, it is possible to change a CPU from being callback-offloaded
   to not at runtime, but there would need to be an extremely good reason
   for adding that level of complexity.  Lots of fun races in there...
  
  Yeah, understood.
  
  (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
  usable if it eventually became unified with cpuset and learned how to
  tap-dance properly;)
 
 Agreed, it would in some sense be nice.  What specifically do you need
 it for?

I personally have zero use for the thing (git/vi aren't particularly
perturbation sensitive;). I'm just doing occasional drive-by testing
from a distro perspective, how well does it work, what does it cost etc.

   Are you really running workloads that generate large numbers of
 callbacks spread across most of the CPUs?  It was this sort of workload
 that caused Rik's system to show scary CPU-time accumulation, due to
 the high overhead of frequent one-to-many wakeups.
 
 If your systems aren't running that kind of high-callback-rate workload,
 just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.
 
 If your systems -are- running that kind of high-callback-rate workload,
 but your system has fewer than 200 CPUs, ensure that you have enough
 housekeeping CPUs to allow the grace-period kthread sufficient CPU time,
 set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.
 
 If your systems -are- running that kind of high-callback-rate workload,
 and your system has more than 200 CPUs, apply the following patch,
 set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it.  ;-)

Turn it on and don't worry about it is exactly what distros want the
obscure feature with very few users to be.  Last time I did a drive-by,
my boxen said I should continue to worry about it ;-)

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-03 Thread Paul E. McKenney
On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote:
 On Thu, 2014-07-03 at 09:29 -0700, Paul E. McKenney wrote: 
  On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote:
   On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
   
 NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
 away as being a redundant interface to manage a single property of a 
 set
 of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
 manage a single property of a set of CPUs.  What am I missing? 

Well, for now, it can only be specified at build time or at boot time.
In theory, it is possible to change a CPU from being callback-offloaded
to not at runtime, but there would need to be an extremely good reason
for adding that level of complexity.  Lots of fun races in there...
   
   Yeah, understood.
   
   (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
   usable if it eventually became unified with cpuset and learned how to
   tap-dance properly;)
  
  Agreed, it would in some sense be nice.  What specifically do you need
  it for?
 
 I personally have zero use for the thing (git/vi aren't particularly
 perturbation sensitive;). I'm just doing occasional drive-by testing
 from a distro perspective, how well does it work, what does it cost etc.
 
Are you really running workloads that generate large numbers of
  callbacks spread across most of the CPUs?  It was this sort of workload
  that caused Rik's system to show scary CPU-time accumulation, due to
  the high overhead of frequent one-to-many wakeups.
  
  If your systems aren't running that kind of high-callback-rate workload,
  just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.
  
  If your systems -are- running that kind of high-callback-rate workload,
  but your system has fewer than 200 CPUs, ensure that you have enough
  housekeeping CPUs to allow the grace-period kthread sufficient CPU time,
  set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.
  
  If your systems -are- running that kind of high-callback-rate workload,
  and your system has more than 200 CPUs, apply the following patch,
  set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it.  ;-)
 
 Turn it on and don't worry about it is exactly what distros want the
 obscure feature with very few users to be.  Last time I did a drive-by,
 my boxen said I should continue to worry about it ;-)

Yep, which is the reason for the patch on the last email.

Then again, exactly which feature and which reason for worry?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Mike Galbraith
On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
> On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:

> > NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
> > away as being a redundant interface to manage a single property of a set
> > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
> > manage a single property of a set of CPUs.  What am I missing? 
> 
> Well, for now, it can only be specified at build time or at boot time.
> In theory, it is possible to change a CPU from being callback-offloaded
> to not at runtime, but there would need to be an extremely good reason
> for adding that level of complexity.  Lots of "fun" races in there...

Yeah, understood.

(still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
usable if it eventually became unified with cpuset and learned how to
tap-dance properly;)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
> On Wed, 2014-07-02 at 10:08 -0700, Paul E. McKenney wrote: 
> > On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
> > > > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> > > > > > An 80-CPU system with a context-switch-heavy workload can require so
> > > > > > many NOCB kthread wakeups that the RCU grace-period kthreads spend 
> > > > > > several
> > > > > > tens of percent of a CPU just awakening things.  This clearly will 
> > > > > > not
> > > > > > scale well: If you add enough CPUs, the RCU grace-period kthreads 
> > > > > > would
> > > > > > get behind, increasing grace-period latency.
> > > > > > 
> > > > > > To avoid this problem, this commit divides the NOCB kthreads into 
> > > > > > leaders
> > > > > > and followers, where the grace-period kthreads awaken the leaders 
> > > > > > each of
> > > > > > whom in turn awakens its followers.  By default, the number of 
> > > > > > groups of
> > > > > > kthreads is the square root of the number of CPUs, but this default 
> > > > > > may
> > > > > > be overridden using the rcutree.rcu_nocb_leader_stride boot 
> > > > > > parameter.
> > > > > > This reduces the number of wakeups done per grace period by the RCU
> > > > > > grace-period kthread by the square root of the number of CPUs, but 
> > > > > > of
> > > > > > course by shifting those wakeups to the leaders.  In addition, 
> > > > > > because
> > > > > > the leaders do grace periods on behalf of their respective 
> > > > > > followers,
> > > > > > the number of wakeups of the followers decreases by up to a factor 
> > > > > > of two.
> > > > > > Instead of being awakened once when new callbacks arrive and again
> > > > > > at the end of the grace period, the followers are awakened only at
> > > > > > the end of the grace period.
> > > > > > 
> > > > > > For a numerical example, in a 4096-CPU system, the grace-period 
> > > > > > kthread
> > > > > > would awaken 64 leaders, each of which would awaken its 63 followers
> > > > > > at the end of the grace period.  This compares favorably with the 79
> > > > > > wakeups for the grace-period kthread on an 80-CPU system.
> > > > > 
> > > > > Urgh, how about we kill the entire nocb nonsense and try again? This 
> > > > > is
> > > > > getting quite rediculous.
> > > > 
> > > > Sure thing, Peter.
> > > 
> > > So you don't think this has gotten a little out of hand? The NOCB stuff
> > > has lead to these masses of rcu threads and now you're adding extra
> > > cache misses to the perfectly sane and normal code paths just to deal
> > > with so many threads.
> > 
> > Indeed it appears to have gotten a bit out of hand.  But let's please
> > attack the real problem rather than the immediate irritant.
> > 
> > And in this case, the real problem is that users are getting callback
> > offloading even when there is no reason for it.
> > 
> > > And all to support a feature that nearly nobody uses. And you were
> > > talking about making nocb the default rcu...
> > 
> > As were others, not that long ago.  Today is the first hint that I got
> > that you feel otherwise.  But it does look like the softirq approach to
> > callback processing needs to stick around for awhile longer.  Nice to
> > hear that softirq is now "sane and normal" again, I guess.  ;-)
> > 
> > Please see my patch in reply to Rik's email.  The idea is to neither
> > rip callback offloading from the kernel nor to keep callback offloading
> > as the default, but instead do callback offloading only for those CPUs
> > specifically marked as NO_HZ_FULL CPUs, or when specifically requested
> > at build time or at boot time.  In other words, only do it when it is
> > needed.
> 
> Exactly!  Like dynamically, when the user isolates CPUs via the cpuset
> interface, none of it making much sense without that particular property
> of a set of CPUs, and cpuset being the manager of CPU set properties.

Glad you like it!  ;-)

> NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
> away as being a redundant interface to manage a single property of a set
> of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
> manage a single property of a set of CPUs.  What am I missing? 

Well, for now, it can only be specified at build time or at boot time.
In theory, it is possible to change a CPU from being callback-offloaded
to not at runtime, but there would need to be an extremely good reason
for adding that level of complexity.  Lots of "fun" races in there...

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Mike Galbraith
On Wed, 2014-07-02 at 10:08 -0700, Paul E. McKenney wrote: 
> On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
> > > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> > > > > An 80-CPU system with a context-switch-heavy workload can require so
> > > > > many NOCB kthread wakeups that the RCU grace-period kthreads spend 
> > > > > several
> > > > > tens of percent of a CPU just awakening things.  This clearly will not
> > > > > scale well: If you add enough CPUs, the RCU grace-period kthreads 
> > > > > would
> > > > > get behind, increasing grace-period latency.
> > > > > 
> > > > > To avoid this problem, this commit divides the NOCB kthreads into 
> > > > > leaders
> > > > > and followers, where the grace-period kthreads awaken the leaders 
> > > > > each of
> > > > > whom in turn awakens its followers.  By default, the number of groups 
> > > > > of
> > > > > kthreads is the square root of the number of CPUs, but this default 
> > > > > may
> > > > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > > > > This reduces the number of wakeups done per grace period by the RCU
> > > > > grace-period kthread by the square root of the number of CPUs, but of
> > > > > course by shifting those wakeups to the leaders.  In addition, because
> > > > > the leaders do grace periods on behalf of their respective followers,
> > > > > the number of wakeups of the followers decreases by up to a factor of 
> > > > > two.
> > > > > Instead of being awakened once when new callbacks arrive and again
> > > > > at the end of the grace period, the followers are awakened only at
> > > > > the end of the grace period.
> > > > > 
> > > > > For a numerical example, in a 4096-CPU system, the grace-period 
> > > > > kthread
> > > > > would awaken 64 leaders, each of which would awaken its 63 followers
> > > > > at the end of the grace period.  This compares favorably with the 79
> > > > > wakeups for the grace-period kthread on an 80-CPU system.
> > > > 
> > > > Urgh, how about we kill the entire nocb nonsense and try again? This is
> > > > getting quite rediculous.
> > > 
> > > Sure thing, Peter.
> > 
> > So you don't think this has gotten a little out of hand? The NOCB stuff
> > has lead to these masses of rcu threads and now you're adding extra
> > cache misses to the perfectly sane and normal code paths just to deal
> > with so many threads.
> 
> Indeed it appears to have gotten a bit out of hand.  But let's please
> attack the real problem rather than the immediate irritant.
> 
> And in this case, the real problem is that users are getting callback
> offloading even when there is no reason for it.
> 
> > And all to support a feature that nearly nobody uses. And you were
> > talking about making nocb the default rcu...
> 
> As were others, not that long ago.  Today is the first hint that I got
> that you feel otherwise.  But it does look like the softirq approach to
> callback processing needs to stick around for awhile longer.  Nice to
> hear that softirq is now "sane and normal" again, I guess.  ;-)
> 
> Please see my patch in reply to Rik's email.  The idea is to neither
> rip callback offloading from the kernel nor to keep callback offloading
> as the default, but instead do callback offloading only for those CPUs
> specifically marked as NO_HZ_FULL CPUs, or when specifically requested
> at build time or at boot time.  In other words, only do it when it is
> needed.

Exactly!  Like dynamically, when the user isolates CPUs via the cpuset
interface, none of it making much sense without that particular property
of a set of CPUs, and cpuset being the manager of CPU set properties.

NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
away as being a redundant interface to manage a single property of a set
of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
manage a single property of a set of CPUs.  What am I missing? 

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Wed, Jul 02, 2014 at 09:55:56AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 02, 2014 at 09:46:19AM -0400, Rik van Riel wrote:
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA1
> > 
> > On 07/02/2014 08:34 AM, Peter Zijlstra wrote:
> > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> > >> An 80-CPU system with a context-switch-heavy workload can require
> > >> so many NOCB kthread wakeups that the RCU grace-period kthreads
> > >> spend several tens of percent of a CPU just awakening things.
> > >> This clearly will not scale well: If you add enough CPUs, the RCU
> > >> grace-period kthreads would get behind, increasing grace-period
> > >> latency.
> > >> 
> > >> To avoid this problem, this commit divides the NOCB kthreads into
> > >> leaders and followers, where the grace-period kthreads awaken the
> > >> leaders each of whom in turn awakens its followers.  By default,
> > >> the number of groups of kthreads is the square root of the number
> > >> of CPUs, but this default may be overridden using the
> > >> rcutree.rcu_nocb_leader_stride boot parameter. This reduces the
> > >> number of wakeups done per grace period by the RCU grace-period
> > >> kthread by the square root of the number of CPUs, but of course
> > >> by shifting those wakeups to the leaders.  In addition, because 
> > >> the leaders do grace periods on behalf of their respective
> > >> followers, the number of wakeups of the followers decreases by up
> > >> to a factor of two. Instead of being awakened once when new
> > >> callbacks arrive and again at the end of the grace period, the
> > >> followers are awakened only at the end of the grace period.
> > >> 
> > >> For a numerical example, in a 4096-CPU system, the grace-period
> > >> kthread would awaken 64 leaders, each of which would awaken its
> > >> 63 followers at the end of the grace period.  This compares
> > >> favorably with the 79 wakeups for the grace-period kthread on an
> > >> 80-CPU system.
> > > 
> > > Urgh, how about we kill the entire nocb nonsense and try again?
> > > This is getting quite rediculous.
> > 
> > Some observations.
> > 
> > First, the rcuos/N threads are NOT bound to CPU N at all, but are
> > free to float through the system.
> 
> I could easily bind each to its home CPU by default for CONFIG_NO_HZ_FULL=n.
> For CONFIG_NO_HZ_FULL=y, they get bound to the non-nohz_full= CPUs.
> 
> > Second, the number of RCU callbacks at the end of each grace period
> > is quite likely to be small most of the time.
> > 
> > This suggests that on a system with N CPUs, it may be perfectly
> > sufficient to have a much smaller number of rcuos threads.
> > 
> > One thread can probably handle the RCU callbacks for as many as
> > 16, or even 64 CPUs...
> 
> In many cases, one thread could handle the RCU callbacks for way more
> than that.  In other cases, a single CPU could keep a single rcuo kthread
> quite busy.  So something dynamic ends up being required.
> 
> But I suspect that the real solution here is to adjust the Kconfig setup
> between NO_HZ_FULL and RCU_NOCB_CPU_ALL so that you have to specify boot
> parameters to get callback offloading on systems built with NO_HZ_FULL.
> Then add some boot-time code so that any CPU that has nohz_full= is
> forced to also have rcu_nocbs= set.  This would have the good effect
> of applying callback offloading only to those workloads for which it
> was specifically designed, but allowing those workloads to gain the
> latency-reduction benefits of callback offloading.
> 
> I do freely confess that I was hoping that callback offloading might one
> day completely replace RCU_SOFTIRQ, but that hope now appears to be at
> best premature.
> 
> Something like the attached patch.  Untested, probably does not even build.

Against all odds, it builds and passes moderate rcutorture testing.

Although this doesn't satisfy the desire to wean RCU of softirq, it does
allow NO_HZ_FULL kernels to maintain better compatibility with earlier
kernel versions, which appears to be more important for the time being.

Thanx, Paul

> 
> 
> rcu: Don't offload callbacks unless specifically requested
> 
> 
> 
> Not-yet-signed-off-by: Paul E. McKenney 
> 
> diff --git a/init/Kconfig b/init/Kconfig
> index 9d76b99af1b9..9332d33346ac 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -737,7 +737,7 @@ choice
>  
>  config RCU_NOCB_CPU_NONE
>   bool "No build_forced no-CBs CPUs"
> - depends on RCU_NOCB_CPU && !NO_HZ_FULL
> + depends on RCU_NOCB_CPU && !NO_HZ_FULL_ALL
>   help
> This option does not force any of the CPUs to be no-CBs CPUs.
> Only CPUs designated by the rcu_nocbs= boot parameter will be
> @@ -751,7 +751,7 @@ config RCU_NOCB_CPU_NONE
>  
>  config RCU_NOCB_CPU_ZERO
>   bool "CPU 0 is a build_forced no-CBs CPU"
> - depends on RCU_NOCB_CPU && !NO_HZ_FULL
> + 

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Wed, Jul 02, 2014 at 01:29:38PM -0400, Rik van Riel wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 07/02/2014 01:26 PM, Peter Zijlstra wrote:
> > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
> >> As were others, not that long ago.  Today is the first hint that
> >> I got that you feel otherwise.  But it does look like the softirq
> >> approach to callback processing needs to stick around for awhile
> >> longer.  Nice to hear that softirq is now "sane and normal"
> >> again, I guess.  ;-)
> > 
> > Nah, softirqs are still totally annoying :-)
> > 
> > So I've lost detail again, but it seems to me that on all CPUs that
> > are actually getting ticks, waking tasks to process the RCU state
> > is entirely over doing it. Might as well keep processing their RCU
> > state from the tick as was previously done.
> 
> For CPUs that are not getting ticks (eg. because they are idle),
> is it worth waking up anything on that CPU, or would it make more
> sense to simply process their RCU callbacks on a different CPU,
> if there aren't too many pending?

Give or take the number of wakeups generated...  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
> > As were others, not that long ago.  Today is the first hint that I got
> > that you feel otherwise.  But it does look like the softirq approach to
> > callback processing needs to stick around for awhile longer.  Nice to
> > hear that softirq is now "sane and normal" again, I guess.  ;-)
> 
> Nah, softirqs are still totally annoying :-)

Name me one thing that isn't annoying.  ;-)

> So I've lost detail again, but it seems to me that on all CPUs that are
> actually getting ticks, waking tasks to process the RCU state is
> entirely over doing it. Might as well keep processing their RCU state
> from the tick as was previously done.

And that is in fact the approach taken by my patch.  For which I just
kicked off testing, so expect an update later today.  (And that -is-
optimistic!  A pessimistic viewpoint would hold that the patch would
turn out to be so broken that it would take -weeks- to get a fix!)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/02/2014 01:26 PM, Peter Zijlstra wrote:
> On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
>> As were others, not that long ago.  Today is the first hint that
>> I got that you feel otherwise.  But it does look like the softirq
>> approach to callback processing needs to stick around for awhile
>> longer.  Nice to hear that softirq is now "sane and normal"
>> again, I guess.  ;-)
> 
> Nah, softirqs are still totally annoying :-)
> 
> So I've lost detail again, but it seems to me that on all CPUs that
> are actually getting ticks, waking tasks to process the RCU state
> is entirely over doing it. Might as well keep processing their RCU
> state from the tick as was previously done.

For CPUs that are not getting ticks (eg. because they are idle),
is it worth waking up anything on that CPU, or would it make more
sense to simply process their RCU callbacks on a different CPU,
if there aren't too many pending?

-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTtEGCAAoJEM553pKExN6D7t4IALdymyu0+/SdaXG73dfkzNKd
yJ3WJtJl1TV6JyejV747IRdKYfkuliZJ+99JZHtJ9dWvOoTtw19GqXXXVANlFpNE
8dQ4UTR6gDE1fRHnWKCdi0p8s3JwgZYdyhr0fKq7k09EXs+eJvDUTVVptBwLj36P
oaENzeONv5xkn3LS9cZVQATX1ZjpYiXjFUxblWoi/NJfSIlq81IkPj8ujaZ4f/6Q
6QLqymNbUGnF5n8v5gs8UqsP+fM3phsIJsT5m42hqnS9eKVwcw4T7UZ8UMFie+mC
hzy7vA0ClcdMWOMlRCSRbJMq0lDA0ej8acYpnj4Yz13wY2DIdTYVU38BbUE+iNA=
=Ia0S
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Peter Zijlstra
On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
> As were others, not that long ago.  Today is the first hint that I got
> that you feel otherwise.  But it does look like the softirq approach to
> callback processing needs to stick around for awhile longer.  Nice to
> hear that softirq is now "sane and normal" again, I guess.  ;-)

Nah, softirqs are still totally annoying :-)

So I've lost detail again, but it seems to me that on all CPUs that are
actually getting ticks, waking tasks to process the RCU state is
entirely over doing it. Might as well keep processing their RCU state
from the tick as was previously done.




pgp9dqqJ76y5N.pgp
Description: PGP signature


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
> > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> > > > An 80-CPU system with a context-switch-heavy workload can require so
> > > > many NOCB kthread wakeups that the RCU grace-period kthreads spend 
> > > > several
> > > > tens of percent of a CPU just awakening things.  This clearly will not
> > > > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > > > get behind, increasing grace-period latency.
> > > > 
> > > > To avoid this problem, this commit divides the NOCB kthreads into 
> > > > leaders
> > > > and followers, where the grace-period kthreads awaken the leaders each 
> > > > of
> > > > whom in turn awakens its followers.  By default, the number of groups of
> > > > kthreads is the square root of the number of CPUs, but this default may
> > > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > > > This reduces the number of wakeups done per grace period by the RCU
> > > > grace-period kthread by the square root of the number of CPUs, but of
> > > > course by shifting those wakeups to the leaders.  In addition, because
> > > > the leaders do grace periods on behalf of their respective followers,
> > > > the number of wakeups of the followers decreases by up to a factor of 
> > > > two.
> > > > Instead of being awakened once when new callbacks arrive and again
> > > > at the end of the grace period, the followers are awakened only at
> > > > the end of the grace period.
> > > > 
> > > > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > > > would awaken 64 leaders, each of which would awaken its 63 followers
> > > > at the end of the grace period.  This compares favorably with the 79
> > > > wakeups for the grace-period kthread on an 80-CPU system.
> > > 
> > > Urgh, how about we kill the entire nocb nonsense and try again? This is
> > > getting quite rediculous.
> > 
> > Sure thing, Peter.
> 
> So you don't think this has gotten a little out of hand? The NOCB stuff
> has lead to these masses of rcu threads and now you're adding extra
> cache misses to the perfectly sane and normal code paths just to deal
> with so many threads.

Indeed it appears to have gotten a bit out of hand.  But let's please
attack the real problem rather than the immediate irritant.

And in this case, the real problem is that users are getting callback
offloading even when there is no reason for it.

> And all to support a feature that nearly nobody uses. And you were
> talking about making nocb the default rcu...

As were others, not that long ago.  Today is the first hint that I got
that you feel otherwise.  But it does look like the softirq approach to
callback processing needs to stick around for awhile longer.  Nice to
hear that softirq is now "sane and normal" again, I guess.  ;-)

Please see my patch in reply to Rik's email.  The idea is to neither
rip callback offloading from the kernel nor to keep callback offloading
as the default, but instead do callback offloading only for those CPUs
specifically marked as NO_HZ_FULL CPUs, or when specifically requested
at build time or at boot time.  In other words, only do it when it is
needed.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Wed, Jul 02, 2014 at 09:46:19AM -0400, Rik van Riel wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 07/02/2014 08:34 AM, Peter Zijlstra wrote:
> > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> >> An 80-CPU system with a context-switch-heavy workload can require
> >> so many NOCB kthread wakeups that the RCU grace-period kthreads
> >> spend several tens of percent of a CPU just awakening things.
> >> This clearly will not scale well: If you add enough CPUs, the RCU
> >> grace-period kthreads would get behind, increasing grace-period
> >> latency.
> >> 
> >> To avoid this problem, this commit divides the NOCB kthreads into
> >> leaders and followers, where the grace-period kthreads awaken the
> >> leaders each of whom in turn awakens its followers.  By default,
> >> the number of groups of kthreads is the square root of the number
> >> of CPUs, but this default may be overridden using the
> >> rcutree.rcu_nocb_leader_stride boot parameter. This reduces the
> >> number of wakeups done per grace period by the RCU grace-period
> >> kthread by the square root of the number of CPUs, but of course
> >> by shifting those wakeups to the leaders.  In addition, because 
> >> the leaders do grace periods on behalf of their respective
> >> followers, the number of wakeups of the followers decreases by up
> >> to a factor of two. Instead of being awakened once when new
> >> callbacks arrive and again at the end of the grace period, the
> >> followers are awakened only at the end of the grace period.
> >> 
> >> For a numerical example, in a 4096-CPU system, the grace-period
> >> kthread would awaken 64 leaders, each of which would awaken its
> >> 63 followers at the end of the grace period.  This compares
> >> favorably with the 79 wakeups for the grace-period kthread on an
> >> 80-CPU system.
> > 
> > Urgh, how about we kill the entire nocb nonsense and try again?
> > This is getting quite rediculous.
> 
> Some observations.
> 
> First, the rcuos/N threads are NOT bound to CPU N at all, but are
> free to float through the system.

I could easily bind each to its home CPU by default for CONFIG_NO_HZ_FULL=n.
For CONFIG_NO_HZ_FULL=y, they get bound to the non-nohz_full= CPUs.

> Second, the number of RCU callbacks at the end of each grace period
> is quite likely to be small most of the time.
> 
> This suggests that on a system with N CPUs, it may be perfectly
> sufficient to have a much smaller number of rcuos threads.
> 
> One thread can probably handle the RCU callbacks for as many as
> 16, or even 64 CPUs...

In many cases, one thread could handle the RCU callbacks for way more
than that.  In other cases, a single CPU could keep a single rcuo kthread
quite busy.  So something dynamic ends up being required.

But I suspect that the real solution here is to adjust the Kconfig setup
between NO_HZ_FULL and RCU_NOCB_CPU_ALL so that you have to specify boot
parameters to get callback offloading on systems built with NO_HZ_FULL.
Then add some boot-time code so that any CPU that has nohz_full= is
forced to also have rcu_nocbs= set.  This would have the good effect
of applying callback offloading only to those workloads for which it
was specifically designed, but allowing those workloads to gain the
latency-reduction benefits of callback offloading.

I do freely confess that I was hoping that callback offloading might one
day completely replace RCU_SOFTIRQ, but that hope now appears to be at
best premature.

Something like the attached patch.  Untested, probably does not even build.

Thanx, Paul



rcu: Don't offload callbacks unless specifically requested



Not-yet-signed-off-by: Paul E. McKenney 

diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99af1b9..9332d33346ac 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -737,7 +737,7 @@ choice
 
 config RCU_NOCB_CPU_NONE
bool "No build_forced no-CBs CPUs"
-   depends on RCU_NOCB_CPU && !NO_HZ_FULL
+   depends on RCU_NOCB_CPU && !NO_HZ_FULL_ALL
help
  This option does not force any of the CPUs to be no-CBs CPUs.
  Only CPUs designated by the rcu_nocbs= boot parameter will be
@@ -751,7 +751,7 @@ config RCU_NOCB_CPU_NONE
 
 config RCU_NOCB_CPU_ZERO
bool "CPU 0 is a build_forced no-CBs CPU"
-   depends on RCU_NOCB_CPU && !NO_HZ_FULL
+   depends on RCU_NOCB_CPU && !NO_HZ_FULL_ALL
help
  This option forces CPU 0 to be a no-CBs CPU, so that its RCU
  callbacks are invoked by a per-CPU kthread whose name begins
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 58fbb8204d15..3b150bfcce3d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2473,6 +2473,9 @@ static void __init rcu_spawn_nocb_kthreads(struct 
rcu_state *rsp)
 
if (rcu_nocb_mask == NULL)
return;
+#ifdef 

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Peter Zijlstra
On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> > > An 80-CPU system with a context-switch-heavy workload can require so
> > > many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > > tens of percent of a CPU just awakening things.  This clearly will not
> > > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > > get behind, increasing grace-period latency.
> > > 
> > > To avoid this problem, this commit divides the NOCB kthreads into leaders
> > > and followers, where the grace-period kthreads awaken the leaders each of
> > > whom in turn awakens its followers.  By default, the number of groups of
> > > kthreads is the square root of the number of CPUs, but this default may
> > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > > This reduces the number of wakeups done per grace period by the RCU
> > > grace-period kthread by the square root of the number of CPUs, but of
> > > course by shifting those wakeups to the leaders.  In addition, because
> > > the leaders do grace periods on behalf of their respective followers,
> > > the number of wakeups of the followers decreases by up to a factor of two.
> > > Instead of being awakened once when new callbacks arrive and again
> > > at the end of the grace period, the followers are awakened only at
> > > the end of the grace period.
> > > 
> > > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > > would awaken 64 leaders, each of which would awaken its 63 followers
> > > at the end of the grace period.  This compares favorably with the 79
> > > wakeups for the grace-period kthread on an 80-CPU system.
> > 
> > Urgh, how about we kill the entire nocb nonsense and try again? This is
> > getting quite rediculous.
> 
> Sure thing, Peter.

So you don't think this has gotten a little out of hand? The NOCB stuff
has lead to these masses of rcu threads and now you're adding extra
cache misses to the perfectly sane and normal code paths just to deal
with so many threads.

And all to support a feature that nearly nobody uses. And you were
talking about making nocb the default rcu...




pgp6swy2pzusd.pgp
Description: PGP signature


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> > An 80-CPU system with a context-switch-heavy workload can require so
> > many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > tens of percent of a CPU just awakening things.  This clearly will not
> > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > get behind, increasing grace-period latency.
> > 
> > To avoid this problem, this commit divides the NOCB kthreads into leaders
> > and followers, where the grace-period kthreads awaken the leaders each of
> > whom in turn awakens its followers.  By default, the number of groups of
> > kthreads is the square root of the number of CPUs, but this default may
> > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > This reduces the number of wakeups done per grace period by the RCU
> > grace-period kthread by the square root of the number of CPUs, but of
> > course by shifting those wakeups to the leaders.  In addition, because
> > the leaders do grace periods on behalf of their respective followers,
> > the number of wakeups of the followers decreases by up to a factor of two.
> > Instead of being awakened once when new callbacks arrive and again
> > at the end of the grace period, the followers are awakened only at
> > the end of the grace period.
> > 
> > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > would awaken 64 leaders, each of which would awaken its 63 followers
> > at the end of the grace period.  This compares favorably with the 79
> > wakeups for the grace-period kthread on an 80-CPU system.
> 
> Urgh, how about we kill the entire nocb nonsense and try again? This is
> getting quite rediculous.

Sure thing, Peter.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/02/2014 08:34 AM, Peter Zijlstra wrote:
> On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
>> An 80-CPU system with a context-switch-heavy workload can require
>> so many NOCB kthread wakeups that the RCU grace-period kthreads
>> spend several tens of percent of a CPU just awakening things.
>> This clearly will not scale well: If you add enough CPUs, the RCU
>> grace-period kthreads would get behind, increasing grace-period
>> latency.
>> 
>> To avoid this problem, this commit divides the NOCB kthreads into
>> leaders and followers, where the grace-period kthreads awaken the
>> leaders each of whom in turn awakens its followers.  By default,
>> the number of groups of kthreads is the square root of the number
>> of CPUs, but this default may be overridden using the
>> rcutree.rcu_nocb_leader_stride boot parameter. This reduces the
>> number of wakeups done per grace period by the RCU grace-period
>> kthread by the square root of the number of CPUs, but of course
>> by shifting those wakeups to the leaders.  In addition, because 
>> the leaders do grace periods on behalf of their respective
>> followers, the number of wakeups of the followers decreases by up
>> to a factor of two. Instead of being awakened once when new
>> callbacks arrive and again at the end of the grace period, the
>> followers are awakened only at the end of the grace period.
>> 
>> For a numerical example, in a 4096-CPU system, the grace-period
>> kthread would awaken 64 leaders, each of which would awaken its
>> 63 followers at the end of the grace period.  This compares
>> favorably with the 79 wakeups for the grace-period kthread on an
>> 80-CPU system.
> 
> Urgh, how about we kill the entire nocb nonsense and try again?
> This is getting quite rediculous.

Some observations.

First, the rcuos/N threads are NOT bound to CPU N at all, but are
free to float through the system.

Second, the number of RCU callbacks at the end of each grace period
is quite likely to be small most of the time.

This suggests that on a system with N CPUs, it may be perfectly
sufficient to have a much smaller number of rcuos threads.

One thread can probably handle the RCU callbacks for as many as
16, or even 64 CPUs...

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTtA0rAAoJEM553pKExN6D3IkIAKFZjAWhopcwPppzGWsT7OA/
3y7fnAvBcJ6AEwE1igzbJyCPjdOJECY/9iUdvuB9CbBD82kyfm4qnuREpdt+hqQp
Vi8EDB9UdyI2I42hbfRVOS4NgAl8ZYsDVQ+QiEM1cMp+LqEKPg7adwoNTPQL4eZn
ANcNh3B3eSpxnZ+ZbEBYJmQXIVP2S5t5M/EMizqUJEBI2/2zB68eeFkgvuW1yg1a
/J4L9w+Iqbu+is+6JK9ibQAR/tTS6Exmuc6RnKDH/nkj1jefKdH1z2p+r69u4AK1
JVDl40lra4n6XHsfjDWDHDXBsiD/JDJJ6Zxf77NWwg1aRT77HZPiMOu98hpFxWs=
=OSUn
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Peter Zijlstra
On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> An 80-CPU system with a context-switch-heavy workload can require so
> many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> tens of percent of a CPU just awakening things.  This clearly will not
> scale well: If you add enough CPUs, the RCU grace-period kthreads would
> get behind, increasing grace-period latency.
> 
> To avoid this problem, this commit divides the NOCB kthreads into leaders
> and followers, where the grace-period kthreads awaken the leaders each of
> whom in turn awakens its followers.  By default, the number of groups of
> kthreads is the square root of the number of CPUs, but this default may
> be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> This reduces the number of wakeups done per grace period by the RCU
> grace-period kthread by the square root of the number of CPUs, but of
> course by shifting those wakeups to the leaders.  In addition, because
> the leaders do grace periods on behalf of their respective followers,
> the number of wakeups of the followers decreases by up to a factor of two.
> Instead of being awakened once when new callbacks arrive and again
> at the end of the grace period, the followers are awakened only at
> the end of the grace period.
> 
> For a numerical example, in a 4096-CPU system, the grace-period kthread
> would awaken 64 leaders, each of which would awaken its 63 followers
> at the end of the grace period.  This compares favorably with the 79
> wakeups for the grace-period kthread on an 80-CPU system.

Urgh, how about we kill the entire nocb nonsense and try again? This is
getting quite rediculous.


pgpXt8_H8LS1W.pgp
Description: PGP signature


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Peter Zijlstra
On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
 An 80-CPU system with a context-switch-heavy workload can require so
 many NOCB kthread wakeups that the RCU grace-period kthreads spend several
 tens of percent of a CPU just awakening things.  This clearly will not
 scale well: If you add enough CPUs, the RCU grace-period kthreads would
 get behind, increasing grace-period latency.
 
 To avoid this problem, this commit divides the NOCB kthreads into leaders
 and followers, where the grace-period kthreads awaken the leaders each of
 whom in turn awakens its followers.  By default, the number of groups of
 kthreads is the square root of the number of CPUs, but this default may
 be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
 This reduces the number of wakeups done per grace period by the RCU
 grace-period kthread by the square root of the number of CPUs, but of
 course by shifting those wakeups to the leaders.  In addition, because
 the leaders do grace periods on behalf of their respective followers,
 the number of wakeups of the followers decreases by up to a factor of two.
 Instead of being awakened once when new callbacks arrive and again
 at the end of the grace period, the followers are awakened only at
 the end of the grace period.
 
 For a numerical example, in a 4096-CPU system, the grace-period kthread
 would awaken 64 leaders, each of which would awaken its 63 followers
 at the end of the grace period.  This compares favorably with the 79
 wakeups for the grace-period kthread on an 80-CPU system.

Urgh, how about we kill the entire nocb nonsense and try again? This is
getting quite rediculous.


pgpXt8_H8LS1W.pgp
Description: PGP signature


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/02/2014 08:34 AM, Peter Zijlstra wrote:
 On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
 An 80-CPU system with a context-switch-heavy workload can require
 so many NOCB kthread wakeups that the RCU grace-period kthreads
 spend several tens of percent of a CPU just awakening things.
 This clearly will not scale well: If you add enough CPUs, the RCU
 grace-period kthreads would get behind, increasing grace-period
 latency.
 
 To avoid this problem, this commit divides the NOCB kthreads into
 leaders and followers, where the grace-period kthreads awaken the
 leaders each of whom in turn awakens its followers.  By default,
 the number of groups of kthreads is the square root of the number
 of CPUs, but this default may be overridden using the
 rcutree.rcu_nocb_leader_stride boot parameter. This reduces the
 number of wakeups done per grace period by the RCU grace-period
 kthread by the square root of the number of CPUs, but of course
 by shifting those wakeups to the leaders.  In addition, because 
 the leaders do grace periods on behalf of their respective
 followers, the number of wakeups of the followers decreases by up
 to a factor of two. Instead of being awakened once when new
 callbacks arrive and again at the end of the grace period, the
 followers are awakened only at the end of the grace period.
 
 For a numerical example, in a 4096-CPU system, the grace-period
 kthread would awaken 64 leaders, each of which would awaken its
 63 followers at the end of the grace period.  This compares
 favorably with the 79 wakeups for the grace-period kthread on an
 80-CPU system.
 
 Urgh, how about we kill the entire nocb nonsense and try again?
 This is getting quite rediculous.

Some observations.

First, the rcuos/N threads are NOT bound to CPU N at all, but are
free to float through the system.

Second, the number of RCU callbacks at the end of each grace period
is quite likely to be small most of the time.

This suggests that on a system with N CPUs, it may be perfectly
sufficient to have a much smaller number of rcuos threads.

One thread can probably handle the RCU callbacks for as many as
16, or even 64 CPUs...

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTtA0rAAoJEM553pKExN6D3IkIAKFZjAWhopcwPppzGWsT7OA/
3y7fnAvBcJ6AEwE1igzbJyCPjdOJECY/9iUdvuB9CbBD82kyfm4qnuREpdt+hqQp
Vi8EDB9UdyI2I42hbfRVOS4NgAl8ZYsDVQ+QiEM1cMp+LqEKPg7adwoNTPQL4eZn
ANcNh3B3eSpxnZ+ZbEBYJmQXIVP2S5t5M/EMizqUJEBI2/2zB68eeFkgvuW1yg1a
/J4L9w+Iqbu+is+6JK9ibQAR/tTS6Exmuc6RnKDH/nkj1jefKdH1z2p+r69u4AK1
JVDl40lra4n6XHsfjDWDHDXBsiD/JDJJ6Zxf77NWwg1aRT77HZPiMOu98hpFxWs=
=OSUn
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
 On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
  An 80-CPU system with a context-switch-heavy workload can require so
  many NOCB kthread wakeups that the RCU grace-period kthreads spend several
  tens of percent of a CPU just awakening things.  This clearly will not
  scale well: If you add enough CPUs, the RCU grace-period kthreads would
  get behind, increasing grace-period latency.
  
  To avoid this problem, this commit divides the NOCB kthreads into leaders
  and followers, where the grace-period kthreads awaken the leaders each of
  whom in turn awakens its followers.  By default, the number of groups of
  kthreads is the square root of the number of CPUs, but this default may
  be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
  This reduces the number of wakeups done per grace period by the RCU
  grace-period kthread by the square root of the number of CPUs, but of
  course by shifting those wakeups to the leaders.  In addition, because
  the leaders do grace periods on behalf of their respective followers,
  the number of wakeups of the followers decreases by up to a factor of two.
  Instead of being awakened once when new callbacks arrive and again
  at the end of the grace period, the followers are awakened only at
  the end of the grace period.
  
  For a numerical example, in a 4096-CPU system, the grace-period kthread
  would awaken 64 leaders, each of which would awaken its 63 followers
  at the end of the grace period.  This compares favorably with the 79
  wakeups for the grace-period kthread on an 80-CPU system.
 
 Urgh, how about we kill the entire nocb nonsense and try again? This is
 getting quite rediculous.

Sure thing, Peter.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Peter Zijlstra
On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote:
 On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
  On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
   An 80-CPU system with a context-switch-heavy workload can require so
   many NOCB kthread wakeups that the RCU grace-period kthreads spend several
   tens of percent of a CPU just awakening things.  This clearly will not
   scale well: If you add enough CPUs, the RCU grace-period kthreads would
   get behind, increasing grace-period latency.
   
   To avoid this problem, this commit divides the NOCB kthreads into leaders
   and followers, where the grace-period kthreads awaken the leaders each of
   whom in turn awakens its followers.  By default, the number of groups of
   kthreads is the square root of the number of CPUs, but this default may
   be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
   This reduces the number of wakeups done per grace period by the RCU
   grace-period kthread by the square root of the number of CPUs, but of
   course by shifting those wakeups to the leaders.  In addition, because
   the leaders do grace periods on behalf of their respective followers,
   the number of wakeups of the followers decreases by up to a factor of two.
   Instead of being awakened once when new callbacks arrive and again
   at the end of the grace period, the followers are awakened only at
   the end of the grace period.
   
   For a numerical example, in a 4096-CPU system, the grace-period kthread
   would awaken 64 leaders, each of which would awaken its 63 followers
   at the end of the grace period.  This compares favorably with the 79
   wakeups for the grace-period kthread on an 80-CPU system.
  
  Urgh, how about we kill the entire nocb nonsense and try again? This is
  getting quite rediculous.
 
 Sure thing, Peter.

So you don't think this has gotten a little out of hand? The NOCB stuff
has lead to these masses of rcu threads and now you're adding extra
cache misses to the perfectly sane and normal code paths just to deal
with so many threads.

And all to support a feature that nearly nobody uses. And you were
talking about making nocb the default rcu...




pgp6swy2pzusd.pgp
Description: PGP signature


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Wed, Jul 02, 2014 at 09:46:19AM -0400, Rik van Riel wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 07/02/2014 08:34 AM, Peter Zijlstra wrote:
  On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
  An 80-CPU system with a context-switch-heavy workload can require
  so many NOCB kthread wakeups that the RCU grace-period kthreads
  spend several tens of percent of a CPU just awakening things.
  This clearly will not scale well: If you add enough CPUs, the RCU
  grace-period kthreads would get behind, increasing grace-period
  latency.
  
  To avoid this problem, this commit divides the NOCB kthreads into
  leaders and followers, where the grace-period kthreads awaken the
  leaders each of whom in turn awakens its followers.  By default,
  the number of groups of kthreads is the square root of the number
  of CPUs, but this default may be overridden using the
  rcutree.rcu_nocb_leader_stride boot parameter. This reduces the
  number of wakeups done per grace period by the RCU grace-period
  kthread by the square root of the number of CPUs, but of course
  by shifting those wakeups to the leaders.  In addition, because 
  the leaders do grace periods on behalf of their respective
  followers, the number of wakeups of the followers decreases by up
  to a factor of two. Instead of being awakened once when new
  callbacks arrive and again at the end of the grace period, the
  followers are awakened only at the end of the grace period.
  
  For a numerical example, in a 4096-CPU system, the grace-period
  kthread would awaken 64 leaders, each of which would awaken its
  63 followers at the end of the grace period.  This compares
  favorably with the 79 wakeups for the grace-period kthread on an
  80-CPU system.
  
  Urgh, how about we kill the entire nocb nonsense and try again?
  This is getting quite rediculous.
 
 Some observations.
 
 First, the rcuos/N threads are NOT bound to CPU N at all, but are
 free to float through the system.

I could easily bind each to its home CPU by default for CONFIG_NO_HZ_FULL=n.
For CONFIG_NO_HZ_FULL=y, they get bound to the non-nohz_full= CPUs.

 Second, the number of RCU callbacks at the end of each grace period
 is quite likely to be small most of the time.
 
 This suggests that on a system with N CPUs, it may be perfectly
 sufficient to have a much smaller number of rcuos threads.
 
 One thread can probably handle the RCU callbacks for as many as
 16, or even 64 CPUs...

In many cases, one thread could handle the RCU callbacks for way more
than that.  In other cases, a single CPU could keep a single rcuo kthread
quite busy.  So something dynamic ends up being required.

But I suspect that the real solution here is to adjust the Kconfig setup
between NO_HZ_FULL and RCU_NOCB_CPU_ALL so that you have to specify boot
parameters to get callback offloading on systems built with NO_HZ_FULL.
Then add some boot-time code so that any CPU that has nohz_full= is
forced to also have rcu_nocbs= set.  This would have the good effect
of applying callback offloading only to those workloads for which it
was specifically designed, but allowing those workloads to gain the
latency-reduction benefits of callback offloading.

I do freely confess that I was hoping that callback offloading might one
day completely replace RCU_SOFTIRQ, but that hope now appears to be at
best premature.

Something like the attached patch.  Untested, probably does not even build.

Thanx, Paul



rcu: Don't offload callbacks unless specifically requested

more here soon

Not-yet-signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99af1b9..9332d33346ac 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -737,7 +737,7 @@ choice
 
 config RCU_NOCB_CPU_NONE
bool No build_forced no-CBs CPUs
-   depends on RCU_NOCB_CPU  !NO_HZ_FULL
+   depends on RCU_NOCB_CPU  !NO_HZ_FULL_ALL
help
  This option does not force any of the CPUs to be no-CBs CPUs.
  Only CPUs designated by the rcu_nocbs= boot parameter will be
@@ -751,7 +751,7 @@ config RCU_NOCB_CPU_NONE
 
 config RCU_NOCB_CPU_ZERO
bool CPU 0 is a build_forced no-CBs CPU
-   depends on RCU_NOCB_CPU  !NO_HZ_FULL
+   depends on RCU_NOCB_CPU  !NO_HZ_FULL_ALL
help
  This option forces CPU 0 to be a no-CBs CPU, so that its RCU
  callbacks are invoked by a per-CPU kthread whose name begins
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 58fbb8204d15..3b150bfcce3d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2473,6 +2473,9 @@ static void __init rcu_spawn_nocb_kthreads(struct 
rcu_state *rsp)
 
if (rcu_nocb_mask == NULL)
return;
+#ifdef CONFIG_NO_HZ_FULL
+   cpumask_or(rcu_nocb_mask, rcu_nocb_mask, tick_nohz_full_mask);

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote:
 On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote:
  On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
   On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend 
several
tens of percent of a CPU just awakening things.  This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.

To avoid this problem, this commit divides the NOCB kthreads into 
leaders
and followers, where the grace-period kthreads awaken the leaders each 
of
whom in turn awakens its followers.  By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders.  In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of 
two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.

For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period.  This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
   
   Urgh, how about we kill the entire nocb nonsense and try again? This is
   getting quite rediculous.
  
  Sure thing, Peter.
 
 So you don't think this has gotten a little out of hand? The NOCB stuff
 has lead to these masses of rcu threads and now you're adding extra
 cache misses to the perfectly sane and normal code paths just to deal
 with so many threads.

Indeed it appears to have gotten a bit out of hand.  But let's please
attack the real problem rather than the immediate irritant.

And in this case, the real problem is that users are getting callback
offloading even when there is no reason for it.

 And all to support a feature that nearly nobody uses. And you were
 talking about making nocb the default rcu...

As were others, not that long ago.  Today is the first hint that I got
that you feel otherwise.  But it does look like the softirq approach to
callback processing needs to stick around for awhile longer.  Nice to
hear that softirq is now sane and normal again, I guess.  ;-)

Please see my patch in reply to Rik's email.  The idea is to neither
rip callback offloading from the kernel nor to keep callback offloading
as the default, but instead do callback offloading only for those CPUs
specifically marked as NO_HZ_FULL CPUs, or when specifically requested
at build time or at boot time.  In other words, only do it when it is
needed.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Peter Zijlstra
On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
 As were others, not that long ago.  Today is the first hint that I got
 that you feel otherwise.  But it does look like the softirq approach to
 callback processing needs to stick around for awhile longer.  Nice to
 hear that softirq is now sane and normal again, I guess.  ;-)

Nah, softirqs are still totally annoying :-)

So I've lost detail again, but it seems to me that on all CPUs that are
actually getting ticks, waking tasks to process the RCU state is
entirely over doing it. Might as well keep processing their RCU state
from the tick as was previously done.




pgp9dqqJ76y5N.pgp
Description: PGP signature


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/02/2014 01:26 PM, Peter Zijlstra wrote:
 On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
 As were others, not that long ago.  Today is the first hint that
 I got that you feel otherwise.  But it does look like the softirq
 approach to callback processing needs to stick around for awhile
 longer.  Nice to hear that softirq is now sane and normal
 again, I guess.  ;-)
 
 Nah, softirqs are still totally annoying :-)
 
 So I've lost detail again, but it seems to me that on all CPUs that
 are actually getting ticks, waking tasks to process the RCU state
 is entirely over doing it. Might as well keep processing their RCU
 state from the tick as was previously done.

For CPUs that are not getting ticks (eg. because they are idle),
is it worth waking up anything on that CPU, or would it make more
sense to simply process their RCU callbacks on a different CPU,
if there aren't too many pending?

-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTtEGCAAoJEM553pKExN6D7t4IALdymyu0+/SdaXG73dfkzNKd
yJ3WJtJl1TV6JyejV747IRdKYfkuliZJ+99JZHtJ9dWvOoTtw19GqXXXVANlFpNE
8dQ4UTR6gDE1fRHnWKCdi0p8s3JwgZYdyhr0fKq7k09EXs+eJvDUTVVptBwLj36P
oaENzeONv5xkn3LS9cZVQATX1ZjpYiXjFUxblWoi/NJfSIlq81IkPj8ujaZ4f/6Q
6QLqymNbUGnF5n8v5gs8UqsP+fM3phsIJsT5m42hqnS9eKVwcw4T7UZ8UMFie+mC
hzy7vA0ClcdMWOMlRCSRbJMq0lDA0ej8acYpnj4Yz13wY2DIdTYVU38BbUE+iNA=
=Ia0S
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote:
 On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
  As were others, not that long ago.  Today is the first hint that I got
  that you feel otherwise.  But it does look like the softirq approach to
  callback processing needs to stick around for awhile longer.  Nice to
  hear that softirq is now sane and normal again, I guess.  ;-)
 
 Nah, softirqs are still totally annoying :-)

Name me one thing that isn't annoying.  ;-)

 So I've lost detail again, but it seems to me that on all CPUs that are
 actually getting ticks, waking tasks to process the RCU state is
 entirely over doing it. Might as well keep processing their RCU state
 from the tick as was previously done.

And that is in fact the approach taken by my patch.  For which I just
kicked off testing, so expect an update later today.  (And that -is-
optimistic!  A pessimistic viewpoint would hold that the patch would
turn out to be so broken that it would take -weeks- to get a fix!)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Wed, Jul 02, 2014 at 01:29:38PM -0400, Rik van Riel wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 07/02/2014 01:26 PM, Peter Zijlstra wrote:
  On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
  As were others, not that long ago.  Today is the first hint that
  I got that you feel otherwise.  But it does look like the softirq
  approach to callback processing needs to stick around for awhile
  longer.  Nice to hear that softirq is now sane and normal
  again, I guess.  ;-)
  
  Nah, softirqs are still totally annoying :-)
  
  So I've lost detail again, but it seems to me that on all CPUs that
  are actually getting ticks, waking tasks to process the RCU state
  is entirely over doing it. Might as well keep processing their RCU
  state from the tick as was previously done.
 
 For CPUs that are not getting ticks (eg. because they are idle),
 is it worth waking up anything on that CPU, or would it make more
 sense to simply process their RCU callbacks on a different CPU,
 if there aren't too many pending?

Give or take the number of wakeups generated...  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Wed, Jul 02, 2014 at 09:55:56AM -0700, Paul E. McKenney wrote:
 On Wed, Jul 02, 2014 at 09:46:19AM -0400, Rik van Riel wrote:
  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA1
  
  On 07/02/2014 08:34 AM, Peter Zijlstra wrote:
   On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
   An 80-CPU system with a context-switch-heavy workload can require
   so many NOCB kthread wakeups that the RCU grace-period kthreads
   spend several tens of percent of a CPU just awakening things.
   This clearly will not scale well: If you add enough CPUs, the RCU
   grace-period kthreads would get behind, increasing grace-period
   latency.
   
   To avoid this problem, this commit divides the NOCB kthreads into
   leaders and followers, where the grace-period kthreads awaken the
   leaders each of whom in turn awakens its followers.  By default,
   the number of groups of kthreads is the square root of the number
   of CPUs, but this default may be overridden using the
   rcutree.rcu_nocb_leader_stride boot parameter. This reduces the
   number of wakeups done per grace period by the RCU grace-period
   kthread by the square root of the number of CPUs, but of course
   by shifting those wakeups to the leaders.  In addition, because 
   the leaders do grace periods on behalf of their respective
   followers, the number of wakeups of the followers decreases by up
   to a factor of two. Instead of being awakened once when new
   callbacks arrive and again at the end of the grace period, the
   followers are awakened only at the end of the grace period.
   
   For a numerical example, in a 4096-CPU system, the grace-period
   kthread would awaken 64 leaders, each of which would awaken its
   63 followers at the end of the grace period.  This compares
   favorably with the 79 wakeups for the grace-period kthread on an
   80-CPU system.
   
   Urgh, how about we kill the entire nocb nonsense and try again?
   This is getting quite rediculous.
  
  Some observations.
  
  First, the rcuos/N threads are NOT bound to CPU N at all, but are
  free to float through the system.
 
 I could easily bind each to its home CPU by default for CONFIG_NO_HZ_FULL=n.
 For CONFIG_NO_HZ_FULL=y, they get bound to the non-nohz_full= CPUs.
 
  Second, the number of RCU callbacks at the end of each grace period
  is quite likely to be small most of the time.
  
  This suggests that on a system with N CPUs, it may be perfectly
  sufficient to have a much smaller number of rcuos threads.
  
  One thread can probably handle the RCU callbacks for as many as
  16, or even 64 CPUs...
 
 In many cases, one thread could handle the RCU callbacks for way more
 than that.  In other cases, a single CPU could keep a single rcuo kthread
 quite busy.  So something dynamic ends up being required.
 
 But I suspect that the real solution here is to adjust the Kconfig setup
 between NO_HZ_FULL and RCU_NOCB_CPU_ALL so that you have to specify boot
 parameters to get callback offloading on systems built with NO_HZ_FULL.
 Then add some boot-time code so that any CPU that has nohz_full= is
 forced to also have rcu_nocbs= set.  This would have the good effect
 of applying callback offloading only to those workloads for which it
 was specifically designed, but allowing those workloads to gain the
 latency-reduction benefits of callback offloading.
 
 I do freely confess that I was hoping that callback offloading might one
 day completely replace RCU_SOFTIRQ, but that hope now appears to be at
 best premature.
 
 Something like the attached patch.  Untested, probably does not even build.

Against all odds, it builds and passes moderate rcutorture testing.

Although this doesn't satisfy the desire to wean RCU of softirq, it does
allow NO_HZ_FULL kernels to maintain better compatibility with earlier
kernel versions, which appears to be more important for the time being.

Thanx, Paul

 
 
 rcu: Don't offload callbacks unless specifically requested
 
 more here soon
 
 Not-yet-signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 
 diff --git a/init/Kconfig b/init/Kconfig
 index 9d76b99af1b9..9332d33346ac 100644
 --- a/init/Kconfig
 +++ b/init/Kconfig
 @@ -737,7 +737,7 @@ choice
  
  config RCU_NOCB_CPU_NONE
   bool No build_forced no-CBs CPUs
 - depends on RCU_NOCB_CPU  !NO_HZ_FULL
 + depends on RCU_NOCB_CPU  !NO_HZ_FULL_ALL
   help
 This option does not force any of the CPUs to be no-CBs CPUs.
 Only CPUs designated by the rcu_nocbs= boot parameter will be
 @@ -751,7 +751,7 @@ config RCU_NOCB_CPU_NONE
  
  config RCU_NOCB_CPU_ZERO
   bool CPU 0 is a build_forced no-CBs CPU
 - depends on RCU_NOCB_CPU  !NO_HZ_FULL
 + depends on RCU_NOCB_CPU  !NO_HZ_FULL_ALL
   help
 This option forces CPU 0 to be a no-CBs CPU, so that its RCU
 callbacks are invoked by a per-CPU kthread 

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Mike Galbraith
On Wed, 2014-07-02 at 10:08 -0700, Paul E. McKenney wrote: 
 On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote:
  On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote:
   On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
 An 80-CPU system with a context-switch-heavy workload can require so
 many NOCB kthread wakeups that the RCU grace-period kthreads spend 
 several
 tens of percent of a CPU just awakening things.  This clearly will not
 scale well: If you add enough CPUs, the RCU grace-period kthreads 
 would
 get behind, increasing grace-period latency.
 
 To avoid this problem, this commit divides the NOCB kthreads into 
 leaders
 and followers, where the grace-period kthreads awaken the leaders 
 each of
 whom in turn awakens its followers.  By default, the number of groups 
 of
 kthreads is the square root of the number of CPUs, but this default 
 may
 be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
 This reduces the number of wakeups done per grace period by the RCU
 grace-period kthread by the square root of the number of CPUs, but of
 course by shifting those wakeups to the leaders.  In addition, because
 the leaders do grace periods on behalf of their respective followers,
 the number of wakeups of the followers decreases by up to a factor of 
 two.
 Instead of being awakened once when new callbacks arrive and again
 at the end of the grace period, the followers are awakened only at
 the end of the grace period.
 
 For a numerical example, in a 4096-CPU system, the grace-period 
 kthread
 would awaken 64 leaders, each of which would awaken its 63 followers
 at the end of the grace period.  This compares favorably with the 79
 wakeups for the grace-period kthread on an 80-CPU system.

Urgh, how about we kill the entire nocb nonsense and try again? This is
getting quite rediculous.
   
   Sure thing, Peter.
  
  So you don't think this has gotten a little out of hand? The NOCB stuff
  has lead to these masses of rcu threads and now you're adding extra
  cache misses to the perfectly sane and normal code paths just to deal
  with so many threads.
 
 Indeed it appears to have gotten a bit out of hand.  But let's please
 attack the real problem rather than the immediate irritant.
 
 And in this case, the real problem is that users are getting callback
 offloading even when there is no reason for it.
 
  And all to support a feature that nearly nobody uses. And you were
  talking about making nocb the default rcu...
 
 As were others, not that long ago.  Today is the first hint that I got
 that you feel otherwise.  But it does look like the softirq approach to
 callback processing needs to stick around for awhile longer.  Nice to
 hear that softirq is now sane and normal again, I guess.  ;-)
 
 Please see my patch in reply to Rik's email.  The idea is to neither
 rip callback offloading from the kernel nor to keep callback offloading
 as the default, but instead do callback offloading only for those CPUs
 specifically marked as NO_HZ_FULL CPUs, or when specifically requested
 at build time or at boot time.  In other words, only do it when it is
 needed.

Exactly!  Like dynamically, when the user isolates CPUs via the cpuset
interface, none of it making much sense without that particular property
of a set of CPUs, and cpuset being the manager of CPU set properties.

NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
away as being a redundant interface to manage a single property of a set
of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
manage a single property of a set of CPUs.  What am I missing? 

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Paul E. McKenney
On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
 On Wed, 2014-07-02 at 10:08 -0700, Paul E. McKenney wrote: 
  On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote:
   On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote:
On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
 On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
  An 80-CPU system with a context-switch-heavy workload can require so
  many NOCB kthread wakeups that the RCU grace-period kthreads spend 
  several
  tens of percent of a CPU just awakening things.  This clearly will 
  not
  scale well: If you add enough CPUs, the RCU grace-period kthreads 
  would
  get behind, increasing grace-period latency.
  
  To avoid this problem, this commit divides the NOCB kthreads into 
  leaders
  and followers, where the grace-period kthreads awaken the leaders 
  each of
  whom in turn awakens its followers.  By default, the number of 
  groups of
  kthreads is the square root of the number of CPUs, but this default 
  may
  be overridden using the rcutree.rcu_nocb_leader_stride boot 
  parameter.
  This reduces the number of wakeups done per grace period by the RCU
  grace-period kthread by the square root of the number of CPUs, but 
  of
  course by shifting those wakeups to the leaders.  In addition, 
  because
  the leaders do grace periods on behalf of their respective 
  followers,
  the number of wakeups of the followers decreases by up to a factor 
  of two.
  Instead of being awakened once when new callbacks arrive and again
  at the end of the grace period, the followers are awakened only at
  the end of the grace period.
  
  For a numerical example, in a 4096-CPU system, the grace-period 
  kthread
  would awaken 64 leaders, each of which would awaken its 63 followers
  at the end of the grace period.  This compares favorably with the 79
  wakeups for the grace-period kthread on an 80-CPU system.
 
 Urgh, how about we kill the entire nocb nonsense and try again? This 
 is
 getting quite rediculous.

Sure thing, Peter.
   
   So you don't think this has gotten a little out of hand? The NOCB stuff
   has lead to these masses of rcu threads and now you're adding extra
   cache misses to the perfectly sane and normal code paths just to deal
   with so many threads.
  
  Indeed it appears to have gotten a bit out of hand.  But let's please
  attack the real problem rather than the immediate irritant.
  
  And in this case, the real problem is that users are getting callback
  offloading even when there is no reason for it.
  
   And all to support a feature that nearly nobody uses. And you were
   talking about making nocb the default rcu...
  
  As were others, not that long ago.  Today is the first hint that I got
  that you feel otherwise.  But it does look like the softirq approach to
  callback processing needs to stick around for awhile longer.  Nice to
  hear that softirq is now sane and normal again, I guess.  ;-)
  
  Please see my patch in reply to Rik's email.  The idea is to neither
  rip callback offloading from the kernel nor to keep callback offloading
  as the default, but instead do callback offloading only for those CPUs
  specifically marked as NO_HZ_FULL CPUs, or when specifically requested
  at build time or at boot time.  In other words, only do it when it is
  needed.
 
 Exactly!  Like dynamically, when the user isolates CPUs via the cpuset
 interface, none of it making much sense without that particular property
 of a set of CPUs, and cpuset being the manager of CPU set properties.

Glad you like it!  ;-)

 NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
 away as being a redundant interface to manage a single property of a set
 of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
 manage a single property of a set of CPUs.  What am I missing? 

Well, for now, it can only be specified at build time or at boot time.
In theory, it is possible to change a CPU from being callback-offloaded
to not at runtime, but there would need to be an extremely good reason
for adding that level of complexity.  Lots of fun races in there...

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-07-02 Thread Mike Galbraith
On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
 On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:

  NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
  away as being a redundant interface to manage a single property of a set
  of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
  manage a single property of a set of CPUs.  What am I missing? 
 
 Well, for now, it can only be specified at build time or at boot time.
 In theory, it is possible to change a CPU from being callback-offloaded
 to not at runtime, but there would need to be an extremely good reason
 for adding that level of complexity.  Lots of fun races in there...

Yeah, understood.

(still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
usable if it eventually became unified with cpuset and learned how to
tap-dance properly;)

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-06-27 Thread Paul E. McKenney
On Fri, Jun 27, 2014 at 03:13:17PM +, Mathieu Desnoyers wrote:
> - Original Message -
> > From: "Mathieu Desnoyers" 
> > To: paul...@linux.vnet.ibm.com
> > Cc: linux-kernel@vger.kernel.org, r...@redhat.com, mi...@kernel.org, 
> > la...@cn.fujitsu.com, dipan...@in.ibm.com,
> > a...@linux-foundation.org, j...@joshtriplett.org, n...@us.ibm.com, 
> > t...@linutronix.de, pet...@infradead.org,
> > rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, 
> > dvh...@linux.intel.com, fweis...@gmail.com,
> > o...@redhat.com, s...@mit.edu
> > Sent: Friday, June 27, 2014 11:01:27 AM
> > Subject: Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB 
> > kthread wakeups
> > 
> > - Original Message -
> > > From: "Paul E. McKenney" 
> > > To: linux-kernel@vger.kernel.org, r...@redhat.com
> > > Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com,
> > > a...@linux-foundation.org, "mathieu desnoyers"
> > > , j...@joshtriplett.org, n...@us.ibm.com,
> > > t...@linutronix.de, pet...@infradead.org,
> > > rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com,
> > > dvh...@linux.intel.com, fweis...@gmail.com,
> > > o...@redhat.com, s...@mit.edu
> > > Sent: Friday, June 27, 2014 10:20:38 AM
> > > Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread
> > > wakeups
> > > 
> > > An 80-CPU system with a context-switch-heavy workload can require so
> > > many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > > tens of percent of a CPU just awakening things.  This clearly will not
> > > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > > get behind, increasing grace-period latency.
> > > 
> > > To avoid this problem, this commit divides the NOCB kthreads into leaders
> > > and followers, where the grace-period kthreads awaken the leaders each of
> > > whom in turn awakens its followers.  By default, the number of groups of
> > > kthreads is the square root of the number of CPUs, but this default may
> > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > > This reduces the number of wakeups done per grace period by the RCU
> > > grace-period kthread by the square root of the number of CPUs, but of
> > > course by shifting those wakeups to the leaders.  In addition, because
> > > the leaders do grace periods on behalf of their respective followers,
> > > the number of wakeups of the followers decreases by up to a factor of two.
> > > Instead of being awakened once when new callbacks arrive and again
> > > at the end of the grace period, the followers are awakened only at
> > > the end of the grace period.
> > > 
> > > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > > would awaken 64 leaders, each of which would awaken its 63 followers
> > > at the end of the grace period.  This compares favorably with the 79
> > > wakeups for the grace-period kthread on an 80-CPU system.
> > 
> > If I understand your approach correctly, it looks like the callbacks
> > are moved from the follower threads (per CPU) to leader threads (for
> > a group of CPUs). I'm concerned that moving those callbacks to leader
> > threads would increase cache trashing, since the callbacks would be
> > often executed from a different CPU than the CPU which enqueued the
> > work. In a case where cgroups/affinity are used to pin kthreads to
> > specific CPUs to minimize cache trashing, I'm concerned that this
> > approach could degrade performance.
> > 
> > Would there be another way to distribute the wake up that would keep
> > callbacks local to their enqueuing CPU ?
> > 
> > Or am I missing something important ?
> 
> What I appear to have missed is that the leader moves the callbacks
> from the follower's structure _to_ another list within that same
> follower's structure, which ensures that callbacks are executed
> locally by the follower.
> 
> Sorry for the noise. ;-)

Not a problem, and thank you for looking at the patch!

Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > > 
> > > Reported-by: Rik van Riel 
> > > Signed-off-by: Paul E. McKenney 
> > > 
> > > diff --git a/Documentation/kernel-parameters.txt
> > > b/Documentation/kernel-parameters.txt
&g

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-06-27 Thread Paul E. McKenney
On Fri, Jun 27, 2014 at 03:01:27PM +, Mathieu Desnoyers wrote:
> - Original Message -
> > From: "Paul E. McKenney" 
> > To: linux-kernel@vger.kernel.org, r...@redhat.com
> > Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, 
> > a...@linux-foundation.org, "mathieu desnoyers"
> > , j...@joshtriplett.org, n...@us.ibm.com, 
> > t...@linutronix.de, pet...@infradead.org,
> > rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, 
> > dvh...@linux.intel.com, fweis...@gmail.com,
> > o...@redhat.com, s...@mit.edu
> > Sent: Friday, June 27, 2014 10:20:38 AM
> > Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread 
> > wakeups
> > 
> > An 80-CPU system with a context-switch-heavy workload can require so
> > many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > tens of percent of a CPU just awakening things.  This clearly will not
> > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > get behind, increasing grace-period latency.
> > 
> > To avoid this problem, this commit divides the NOCB kthreads into leaders
> > and followers, where the grace-period kthreads awaken the leaders each of
> > whom in turn awakens its followers.  By default, the number of groups of
> > kthreads is the square root of the number of CPUs, but this default may
> > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > This reduces the number of wakeups done per grace period by the RCU
> > grace-period kthread by the square root of the number of CPUs, but of
> > course by shifting those wakeups to the leaders.  In addition, because
> > the leaders do grace periods on behalf of their respective followers,
> > the number of wakeups of the followers decreases by up to a factor of two.
> > Instead of being awakened once when new callbacks arrive and again
> > at the end of the grace period, the followers are awakened only at
> > the end of the grace period.
> > 
> > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > would awaken 64 leaders, each of which would awaken its 63 followers
> > at the end of the grace period.  This compares favorably with the 79
> > wakeups for the grace-period kthread on an 80-CPU system.
> 
> If I understand your approach correctly, it looks like the callbacks
> are moved from the follower threads (per CPU) to leader threads (for
> a group of CPUs).

Not quite.  The leader thread does the moving, but the callbacks live
on the follower's data structure.  There probably are some added
cache misses, but I believe that this is more than made up for by
the decrease in wakeups -- the followers are normally awakened only
once rather than twice per grace period.

>   I'm concerned that moving those callbacks to leader
> threads would increase cache trashing, since the callbacks would be
> often executed from a different CPU than the CPU which enqueued the
> work. In a case where cgroups/affinity are used to pin kthreads to
> specific CPUs to minimize cache trashing, I'm concerned that this
> approach could degrade performance.
> 
> Would there be another way to distribute the wake up that would keep
> callbacks local to their enqueuing CPU ?
> 
> Or am I missing something important ?

One problem is that each leader must look at its followers' state
in order to determine who needs to be awakened.  So there are some
added cache misses no matter how you redistribute the wakeups.

Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > Reported-by: Rik van Riel 
> > Signed-off-by: Paul E. McKenney 
> > 
> > diff --git a/Documentation/kernel-parameters.txt
> > b/Documentation/kernel-parameters.txt
> > index 6eaa9cdb7094..affed6434ec8 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be
> > entirely omitted.
> > quiescent states.  Units are jiffies, minimum
> > value is one, and maximum value is HZ.
> >  
> > +   rcutree.rcu_nocb_leader_stride= [KNL]
> > +   Set the number of NOCB kthread groups, which
> > +   defaults to the square root of the number of
> > +   CPUs.  Larger numbers reduces the wakeup overhead
> > +   on the per-CPU grace-period kthreads, but increases
> > +   that same overhead on each group's leader.
> > +
> > rcutree.

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-06-27 Thread Mathieu Desnoyers
- Original Message -
> From: "Mathieu Desnoyers" 
> To: paul...@linux.vnet.ibm.com
> Cc: linux-kernel@vger.kernel.org, r...@redhat.com, mi...@kernel.org, 
> la...@cn.fujitsu.com, dipan...@in.ibm.com,
> a...@linux-foundation.org, j...@joshtriplett.org, n...@us.ibm.com, 
> t...@linutronix.de, pet...@infradead.org,
> rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, 
> dvh...@linux.intel.com, fweis...@gmail.com,
> o...@redhat.com, s...@mit.edu
> Sent: Friday, June 27, 2014 11:01:27 AM
> Subject: Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread 
> wakeups
> 
> - Original Message -
> > From: "Paul E. McKenney" 
> > To: linux-kernel@vger.kernel.org, r...@redhat.com
> > Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com,
> > a...@linux-foundation.org, "mathieu desnoyers"
> > , j...@joshtriplett.org, n...@us.ibm.com,
> > t...@linutronix.de, pet...@infradead.org,
> > rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com,
> > dvh...@linux.intel.com, fweis...@gmail.com,
> > o...@redhat.com, s...@mit.edu
> > Sent: Friday, June 27, 2014 10:20:38 AM
> > Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread
> > wakeups
> > 
> > An 80-CPU system with a context-switch-heavy workload can require so
> > many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > tens of percent of a CPU just awakening things.  This clearly will not
> > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > get behind, increasing grace-period latency.
> > 
> > To avoid this problem, this commit divides the NOCB kthreads into leaders
> > and followers, where the grace-period kthreads awaken the leaders each of
> > whom in turn awakens its followers.  By default, the number of groups of
> > kthreads is the square root of the number of CPUs, but this default may
> > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > This reduces the number of wakeups done per grace period by the RCU
> > grace-period kthread by the square root of the number of CPUs, but of
> > course by shifting those wakeups to the leaders.  In addition, because
> > the leaders do grace periods on behalf of their respective followers,
> > the number of wakeups of the followers decreases by up to a factor of two.
> > Instead of being awakened once when new callbacks arrive and again
> > at the end of the grace period, the followers are awakened only at
> > the end of the grace period.
> > 
> > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > would awaken 64 leaders, each of which would awaken its 63 followers
> > at the end of the grace period.  This compares favorably with the 79
> > wakeups for the grace-period kthread on an 80-CPU system.
> 
> If I understand your approach correctly, it looks like the callbacks
> are moved from the follower threads (per CPU) to leader threads (for
> a group of CPUs). I'm concerned that moving those callbacks to leader
> threads would increase cache trashing, since the callbacks would be
> often executed from a different CPU than the CPU which enqueued the
> work. In a case where cgroups/affinity are used to pin kthreads to
> specific CPUs to minimize cache trashing, I'm concerned that this
> approach could degrade performance.
> 
> Would there be another way to distribute the wake up that would keep
> callbacks local to their enqueuing CPU ?
> 
> Or am I missing something important ?

What I appear to have missed is that the leader moves the callbacks
from the follower's structure _to_ another list within that same
follower's structure, which ensures that callbacks are executed
locally by the follower.

Sorry for the noise. ;-)

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
> > 
> > Reported-by: Rik van Riel 
> > Signed-off-by: Paul E. McKenney 
> > 
> > diff --git a/Documentation/kernel-parameters.txt
> > b/Documentation/kernel-parameters.txt
> > index 6eaa9cdb7094..affed6434ec8 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be
> > entirely omitted.
> > quiescent states.  Units are jiffies, minimum
> > value is one, and maximum value is HZ.
> >  
> > +   rcutree.rcu_nocb_leader_stride= [KNL]
> > +   Set the number of NOCB kthread groups, which
> > +   defaults to the square root of the number of
> > +   

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-06-27 Thread Mathieu Desnoyers
- Original Message -
> From: "Paul E. McKenney" 
> To: linux-kernel@vger.kernel.org, r...@redhat.com
> Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, 
> a...@linux-foundation.org, "mathieu desnoyers"
> , j...@joshtriplett.org, n...@us.ibm.com, 
> t...@linutronix.de, pet...@infradead.org,
> rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, 
> dvh...@linux.intel.com, fweis...@gmail.com,
> o...@redhat.com, s...@mit.edu
> Sent: Friday, June 27, 2014 10:20:38 AM
> Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread 
> wakeups
> 
> An 80-CPU system with a context-switch-heavy workload can require so
> many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> tens of percent of a CPU just awakening things.  This clearly will not
> scale well: If you add enough CPUs, the RCU grace-period kthreads would
> get behind, increasing grace-period latency.
> 
> To avoid this problem, this commit divides the NOCB kthreads into leaders
> and followers, where the grace-period kthreads awaken the leaders each of
> whom in turn awakens its followers.  By default, the number of groups of
> kthreads is the square root of the number of CPUs, but this default may
> be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> This reduces the number of wakeups done per grace period by the RCU
> grace-period kthread by the square root of the number of CPUs, but of
> course by shifting those wakeups to the leaders.  In addition, because
> the leaders do grace periods on behalf of their respective followers,
> the number of wakeups of the followers decreases by up to a factor of two.
> Instead of being awakened once when new callbacks arrive and again
> at the end of the grace period, the followers are awakened only at
> the end of the grace period.
> 
> For a numerical example, in a 4096-CPU system, the grace-period kthread
> would awaken 64 leaders, each of which would awaken its 63 followers
> at the end of the grace period.  This compares favorably with the 79
> wakeups for the grace-period kthread on an 80-CPU system.

If I understand your approach correctly, it looks like the callbacks
are moved from the follower threads (per CPU) to leader threads (for
a group of CPUs). I'm concerned that moving those callbacks to leader
threads would increase cache trashing, since the callbacks would be
often executed from a different CPU than the CPU which enqueued the
work. In a case where cgroups/affinity are used to pin kthreads to
specific CPUs to minimize cache trashing, I'm concerned that this
approach could degrade performance.

Would there be another way to distribute the wake up that would keep
callbacks local to their enqueuing CPU ?

Or am I missing something important ?

Thanks,

Mathieu

> 
> Reported-by: Rik van Riel 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/Documentation/kernel-parameters.txt
> b/Documentation/kernel-parameters.txt
> index 6eaa9cdb7094..affed6434ec8 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be
> entirely omitted.
>   quiescent states.  Units are jiffies, minimum
>   value is one, and maximum value is HZ.
>  
> + rcutree.rcu_nocb_leader_stride= [KNL]
> + Set the number of NOCB kthread groups, which
> + defaults to the square root of the number of
> + CPUs.  Larger numbers reduces the wakeup overhead
> + on the per-CPU grace-period kthreads, but increases
> + that same overhead on each group's leader.
> +
>   rcutree.qhimark= [KNL]
>   Set threshold of queued RCU callbacks beyond which
>   batch limiting is disabled.
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index bf2c1e669691..de12fa5a860b 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -331,11 +331,29 @@ struct rcu_data {
>   struct rcu_head **nocb_tail;
>   atomic_long_t nocb_q_count; /* # CBs waiting for kthread */
>   atomic_long_t nocb_q_count_lazy; /*  (approximate). */
> + struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */
> + struct rcu_head **nocb_follower_tail;
> + atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */
> + atomic_long_t nocb_follower_count_lazy; /*  (approximate). */
>   int nocb_p_count;   /* # CBs being invoked by kthread */
>   int nocb_p_count_lazy;  /*  (approximate). */
>   wait_queue_head_t nocb_wq;  /* For nocb kthreads to slee

[PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-06-27 Thread Paul E. McKenney
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things.  This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.

To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers.  By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders.  In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.

For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period.  This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.

Reported-by: Rik van Riel 
Signed-off-by: Paul E. McKenney 

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 6eaa9cdb7094..affed6434ec8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
quiescent states.  Units are jiffies, minimum
value is one, and maximum value is HZ.
 
+   rcutree.rcu_nocb_leader_stride= [KNL]
+   Set the number of NOCB kthread groups, which
+   defaults to the square root of the number of
+   CPUs.  Larger numbers reduces the wakeup overhead
+   on the per-CPU grace-period kthreads, but increases
+   that same overhead on each group's leader.
+
rcutree.qhimark= [KNL]
Set threshold of queued RCU callbacks beyond which
batch limiting is disabled.
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index bf2c1e669691..de12fa5a860b 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -331,11 +331,29 @@ struct rcu_data {
struct rcu_head **nocb_tail;
atomic_long_t nocb_q_count; /* # CBs waiting for kthread */
atomic_long_t nocb_q_count_lazy; /*  (approximate). */
+   struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */
+   struct rcu_head **nocb_follower_tail;
+   atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */
+   atomic_long_t nocb_follower_count_lazy; /*  (approximate). */
int nocb_p_count;   /* # CBs being invoked by kthread */
int nocb_p_count_lazy;  /*  (approximate). */
wait_queue_head_t nocb_wq;  /* For nocb kthreads to sleep on. */
struct task_struct *nocb_kthread;
bool nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */
+
+   /* The following fields are used by the leader, hence own cacheline. */
+   struct rcu_head *nocb_gp_head cacheline_internodealigned_in_smp;
+   /* CBs waiting for GP. */
+   struct rcu_head **nocb_gp_tail;
+   long nocb_gp_count;
+   long nocb_gp_count_lazy;
+   bool nocb_leader_wake;  /* Is the nocb leader thread awake? */
+   struct rcu_data *nocb_next_follower;
+   /* Next follower in wakeup chain. */
+
+   /* The following fields are used by the follower, hence new cachline. */
+   struct rcu_data *nocb_leader cacheline_internodealigned_in_smp;
+   /* Leader CPU takes GP-end wakeups. */
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
 
/* 8) RCU CPU stall data. */
@@ -583,8 +601,14 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
 /* Sum up queue lengths for tracing. */
 static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long 
*qll)
 {
-   *ql = atomic_long_read(>nocb_q_count) + rdp->nocb_p_count;
-   *qll = atomic_long_read(>nocb_q_count_lazy) + 
rdp->nocb_p_count_lazy;
+   *ql = atomic_long_read(>nocb_q_count) +
+ rdp->nocb_p_count +
+ atomic_long_read(>nocb_follower_count) +
+ rdp->nocb_p_count + rdp->nocb_gp_count;
+   *qll = atomic_long_read(>nocb_q_count_lazy) +
+  rdp->nocb_p_count_lazy +
+  

[PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-06-27 Thread Paul E. McKenney
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things.  This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.

To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers.  By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders.  In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.

For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period.  This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.

Reported-by: Rik van Riel r...@redhat.com
Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 6eaa9cdb7094..affed6434ec8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
quiescent states.  Units are jiffies, minimum
value is one, and maximum value is HZ.
 
+   rcutree.rcu_nocb_leader_stride= [KNL]
+   Set the number of NOCB kthread groups, which
+   defaults to the square root of the number of
+   CPUs.  Larger numbers reduces the wakeup overhead
+   on the per-CPU grace-period kthreads, but increases
+   that same overhead on each group's leader.
+
rcutree.qhimark= [KNL]
Set threshold of queued RCU callbacks beyond which
batch limiting is disabled.
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index bf2c1e669691..de12fa5a860b 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -331,11 +331,29 @@ struct rcu_data {
struct rcu_head **nocb_tail;
atomic_long_t nocb_q_count; /* # CBs waiting for kthread */
atomic_long_t nocb_q_count_lazy; /*  (approximate). */
+   struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */
+   struct rcu_head **nocb_follower_tail;
+   atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */
+   atomic_long_t nocb_follower_count_lazy; /*  (approximate). */
int nocb_p_count;   /* # CBs being invoked by kthread */
int nocb_p_count_lazy;  /*  (approximate). */
wait_queue_head_t nocb_wq;  /* For nocb kthreads to sleep on. */
struct task_struct *nocb_kthread;
bool nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */
+
+   /* The following fields are used by the leader, hence own cacheline. */
+   struct rcu_head *nocb_gp_head cacheline_internodealigned_in_smp;
+   /* CBs waiting for GP. */
+   struct rcu_head **nocb_gp_tail;
+   long nocb_gp_count;
+   long nocb_gp_count_lazy;
+   bool nocb_leader_wake;  /* Is the nocb leader thread awake? */
+   struct rcu_data *nocb_next_follower;
+   /* Next follower in wakeup chain. */
+
+   /* The following fields are used by the follower, hence new cachline. */
+   struct rcu_data *nocb_leader cacheline_internodealigned_in_smp;
+   /* Leader CPU takes GP-end wakeups. */
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
 
/* 8) RCU CPU stall data. */
@@ -583,8 +601,14 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
 /* Sum up queue lengths for tracing. */
 static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long 
*qll)
 {
-   *ql = atomic_long_read(rdp-nocb_q_count) + rdp-nocb_p_count;
-   *qll = atomic_long_read(rdp-nocb_q_count_lazy) + 
rdp-nocb_p_count_lazy;
+   *ql = atomic_long_read(rdp-nocb_q_count) +
+ rdp-nocb_p_count +
+ atomic_long_read(rdp-nocb_follower_count) +
+ rdp-nocb_p_count + rdp-nocb_gp_count;
+   *qll = atomic_long_read(rdp-nocb_q_count_lazy) +
+  

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-06-27 Thread Mathieu Desnoyers
- Original Message -
 From: Paul E. McKenney paul...@linux.vnet.ibm.com
 To: linux-kernel@vger.kernel.org, r...@redhat.com
 Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, 
 a...@linux-foundation.org, mathieu desnoyers
 mathieu.desnoy...@efficios.com, j...@joshtriplett.org, n...@us.ibm.com, 
 t...@linutronix.de, pet...@infradead.org,
 rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, 
 dvh...@linux.intel.com, fweis...@gmail.com,
 o...@redhat.com, s...@mit.edu
 Sent: Friday, June 27, 2014 10:20:38 AM
 Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread 
 wakeups
 
 An 80-CPU system with a context-switch-heavy workload can require so
 many NOCB kthread wakeups that the RCU grace-period kthreads spend several
 tens of percent of a CPU just awakening things.  This clearly will not
 scale well: If you add enough CPUs, the RCU grace-period kthreads would
 get behind, increasing grace-period latency.
 
 To avoid this problem, this commit divides the NOCB kthreads into leaders
 and followers, where the grace-period kthreads awaken the leaders each of
 whom in turn awakens its followers.  By default, the number of groups of
 kthreads is the square root of the number of CPUs, but this default may
 be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
 This reduces the number of wakeups done per grace period by the RCU
 grace-period kthread by the square root of the number of CPUs, but of
 course by shifting those wakeups to the leaders.  In addition, because
 the leaders do grace periods on behalf of their respective followers,
 the number of wakeups of the followers decreases by up to a factor of two.
 Instead of being awakened once when new callbacks arrive and again
 at the end of the grace period, the followers are awakened only at
 the end of the grace period.
 
 For a numerical example, in a 4096-CPU system, the grace-period kthread
 would awaken 64 leaders, each of which would awaken its 63 followers
 at the end of the grace period.  This compares favorably with the 79
 wakeups for the grace-period kthread on an 80-CPU system.

If I understand your approach correctly, it looks like the callbacks
are moved from the follower threads (per CPU) to leader threads (for
a group of CPUs). I'm concerned that moving those callbacks to leader
threads would increase cache trashing, since the callbacks would be
often executed from a different CPU than the CPU which enqueued the
work. In a case where cgroups/affinity are used to pin kthreads to
specific CPUs to minimize cache trashing, I'm concerned that this
approach could degrade performance.

Would there be another way to distribute the wake up that would keep
callbacks local to their enqueuing CPU ?

Or am I missing something important ?

Thanks,

Mathieu

 
 Reported-by: Rik van Riel r...@redhat.com
 Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 
 diff --git a/Documentation/kernel-parameters.txt
 b/Documentation/kernel-parameters.txt
 index 6eaa9cdb7094..affed6434ec8 100644
 --- a/Documentation/kernel-parameters.txt
 +++ b/Documentation/kernel-parameters.txt
 @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be
 entirely omitted.
   quiescent states.  Units are jiffies, minimum
   value is one, and maximum value is HZ.
  
 + rcutree.rcu_nocb_leader_stride= [KNL]
 + Set the number of NOCB kthread groups, which
 + defaults to the square root of the number of
 + CPUs.  Larger numbers reduces the wakeup overhead
 + on the per-CPU grace-period kthreads, but increases
 + that same overhead on each group's leader.
 +
   rcutree.qhimark= [KNL]
   Set threshold of queued RCU callbacks beyond which
   batch limiting is disabled.
 diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
 index bf2c1e669691..de12fa5a860b 100644
 --- a/kernel/rcu/tree.h
 +++ b/kernel/rcu/tree.h
 @@ -331,11 +331,29 @@ struct rcu_data {
   struct rcu_head **nocb_tail;
   atomic_long_t nocb_q_count; /* # CBs waiting for kthread */
   atomic_long_t nocb_q_count_lazy; /*  (approximate). */
 + struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */
 + struct rcu_head **nocb_follower_tail;
 + atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */
 + atomic_long_t nocb_follower_count_lazy; /*  (approximate). */
   int nocb_p_count;   /* # CBs being invoked by kthread */
   int nocb_p_count_lazy;  /*  (approximate). */
   wait_queue_head_t nocb_wq;  /* For nocb kthreads to sleep on. */
   struct task_struct *nocb_kthread;
   bool nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */
 +
 + /* The following fields are used by the leader, hence own cacheline. */
 + struct rcu_head *nocb_gp_head

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-06-27 Thread Mathieu Desnoyers
- Original Message -
 From: Mathieu Desnoyers mathieu.desnoy...@efficios.com
 To: paul...@linux.vnet.ibm.com
 Cc: linux-kernel@vger.kernel.org, r...@redhat.com, mi...@kernel.org, 
 la...@cn.fujitsu.com, dipan...@in.ibm.com,
 a...@linux-foundation.org, j...@joshtriplett.org, n...@us.ibm.com, 
 t...@linutronix.de, pet...@infradead.org,
 rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, 
 dvh...@linux.intel.com, fweis...@gmail.com,
 o...@redhat.com, s...@mit.edu
 Sent: Friday, June 27, 2014 11:01:27 AM
 Subject: Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread 
 wakeups
 
 - Original Message -
  From: Paul E. McKenney paul...@linux.vnet.ibm.com
  To: linux-kernel@vger.kernel.org, r...@redhat.com
  Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com,
  a...@linux-foundation.org, mathieu desnoyers
  mathieu.desnoy...@efficios.com, j...@joshtriplett.org, n...@us.ibm.com,
  t...@linutronix.de, pet...@infradead.org,
  rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com,
  dvh...@linux.intel.com, fweis...@gmail.com,
  o...@redhat.com, s...@mit.edu
  Sent: Friday, June 27, 2014 10:20:38 AM
  Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread
  wakeups
  
  An 80-CPU system with a context-switch-heavy workload can require so
  many NOCB kthread wakeups that the RCU grace-period kthreads spend several
  tens of percent of a CPU just awakening things.  This clearly will not
  scale well: If you add enough CPUs, the RCU grace-period kthreads would
  get behind, increasing grace-period latency.
  
  To avoid this problem, this commit divides the NOCB kthreads into leaders
  and followers, where the grace-period kthreads awaken the leaders each of
  whom in turn awakens its followers.  By default, the number of groups of
  kthreads is the square root of the number of CPUs, but this default may
  be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
  This reduces the number of wakeups done per grace period by the RCU
  grace-period kthread by the square root of the number of CPUs, but of
  course by shifting those wakeups to the leaders.  In addition, because
  the leaders do grace periods on behalf of their respective followers,
  the number of wakeups of the followers decreases by up to a factor of two.
  Instead of being awakened once when new callbacks arrive and again
  at the end of the grace period, the followers are awakened only at
  the end of the grace period.
  
  For a numerical example, in a 4096-CPU system, the grace-period kthread
  would awaken 64 leaders, each of which would awaken its 63 followers
  at the end of the grace period.  This compares favorably with the 79
  wakeups for the grace-period kthread on an 80-CPU system.
 
 If I understand your approach correctly, it looks like the callbacks
 are moved from the follower threads (per CPU) to leader threads (for
 a group of CPUs). I'm concerned that moving those callbacks to leader
 threads would increase cache trashing, since the callbacks would be
 often executed from a different CPU than the CPU which enqueued the
 work. In a case where cgroups/affinity are used to pin kthreads to
 specific CPUs to minimize cache trashing, I'm concerned that this
 approach could degrade performance.
 
 Would there be another way to distribute the wake up that would keep
 callbacks local to their enqueuing CPU ?
 
 Or am I missing something important ?

What I appear to have missed is that the leader moves the callbacks
from the follower's structure _to_ another list within that same
follower's structure, which ensures that callbacks are executed
locally by the follower.

Sorry for the noise. ;-)

Thanks,

Mathieu

 
 Thanks,
 
 Mathieu
 
  
  Reported-by: Rik van Riel r...@redhat.com
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  
  diff --git a/Documentation/kernel-parameters.txt
  b/Documentation/kernel-parameters.txt
  index 6eaa9cdb7094..affed6434ec8 100644
  --- a/Documentation/kernel-parameters.txt
  +++ b/Documentation/kernel-parameters.txt
  @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be
  entirely omitted.
  quiescent states.  Units are jiffies, minimum
  value is one, and maximum value is HZ.
   
  +   rcutree.rcu_nocb_leader_stride= [KNL]
  +   Set the number of NOCB kthread groups, which
  +   defaults to the square root of the number of
  +   CPUs.  Larger numbers reduces the wakeup overhead
  +   on the per-CPU grace-period kthreads, but increases
  +   that same overhead on each group's leader.
  +
  rcutree.qhimark= [KNL]
  Set threshold of queued RCU callbacks beyond which
  batch limiting is disabled.
  diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
  index bf2c1e669691..de12fa5a860b 100644
  --- a/kernel/rcu/tree.h
  +++ b

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-06-27 Thread Paul E. McKenney
On Fri, Jun 27, 2014 at 03:13:17PM +, Mathieu Desnoyers wrote:
 - Original Message -
  From: Mathieu Desnoyers mathieu.desnoy...@efficios.com
  To: paul...@linux.vnet.ibm.com
  Cc: linux-kernel@vger.kernel.org, r...@redhat.com, mi...@kernel.org, 
  la...@cn.fujitsu.com, dipan...@in.ibm.com,
  a...@linux-foundation.org, j...@joshtriplett.org, n...@us.ibm.com, 
  t...@linutronix.de, pet...@infradead.org,
  rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, 
  dvh...@linux.intel.com, fweis...@gmail.com,
  o...@redhat.com, s...@mit.edu
  Sent: Friday, June 27, 2014 11:01:27 AM
  Subject: Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB 
  kthread wakeups
  
  - Original Message -
   From: Paul E. McKenney paul...@linux.vnet.ibm.com
   To: linux-kernel@vger.kernel.org, r...@redhat.com
   Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com,
   a...@linux-foundation.org, mathieu desnoyers
   mathieu.desnoy...@efficios.com, j...@joshtriplett.org, n...@us.ibm.com,
   t...@linutronix.de, pet...@infradead.org,
   rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com,
   dvh...@linux.intel.com, fweis...@gmail.com,
   o...@redhat.com, s...@mit.edu
   Sent: Friday, June 27, 2014 10:20:38 AM
   Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread
   wakeups
   
   An 80-CPU system with a context-switch-heavy workload can require so
   many NOCB kthread wakeups that the RCU grace-period kthreads spend several
   tens of percent of a CPU just awakening things.  This clearly will not
   scale well: If you add enough CPUs, the RCU grace-period kthreads would
   get behind, increasing grace-period latency.
   
   To avoid this problem, this commit divides the NOCB kthreads into leaders
   and followers, where the grace-period kthreads awaken the leaders each of
   whom in turn awakens its followers.  By default, the number of groups of
   kthreads is the square root of the number of CPUs, but this default may
   be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
   This reduces the number of wakeups done per grace period by the RCU
   grace-period kthread by the square root of the number of CPUs, but of
   course by shifting those wakeups to the leaders.  In addition, because
   the leaders do grace periods on behalf of their respective followers,
   the number of wakeups of the followers decreases by up to a factor of two.
   Instead of being awakened once when new callbacks arrive and again
   at the end of the grace period, the followers are awakened only at
   the end of the grace period.
   
   For a numerical example, in a 4096-CPU system, the grace-period kthread
   would awaken 64 leaders, each of which would awaken its 63 followers
   at the end of the grace period.  This compares favorably with the 79
   wakeups for the grace-period kthread on an 80-CPU system.
  
  If I understand your approach correctly, it looks like the callbacks
  are moved from the follower threads (per CPU) to leader threads (for
  a group of CPUs). I'm concerned that moving those callbacks to leader
  threads would increase cache trashing, since the callbacks would be
  often executed from a different CPU than the CPU which enqueued the
  work. In a case where cgroups/affinity are used to pin kthreads to
  specific CPUs to minimize cache trashing, I'm concerned that this
  approach could degrade performance.
  
  Would there be another way to distribute the wake up that would keep
  callbacks local to their enqueuing CPU ?
  
  Or am I missing something important ?
 
 What I appear to have missed is that the leader moves the callbacks
 from the follower's structure _to_ another list within that same
 follower's structure, which ensures that callbacks are executed
 locally by the follower.
 
 Sorry for the noise. ;-)

Not a problem, and thank you for looking at the patch!

Thanx, Paul

 Thanks,
 
 Mathieu
 
  
  Thanks,
  
  Mathieu
  
   
   Reported-by: Rik van Riel r...@redhat.com
   Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
   
   diff --git a/Documentation/kernel-parameters.txt
   b/Documentation/kernel-parameters.txt
   index 6eaa9cdb7094..affed6434ec8 100644
   --- a/Documentation/kernel-parameters.txt
   +++ b/Documentation/kernel-parameters.txt
   @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also 
   be
   entirely omitted.
 quiescent states.  Units are jiffies, minimum
 value is one, and maximum value is HZ.

   + rcutree.rcu_nocb_leader_stride= [KNL]
   + Set the number of NOCB kthread groups, which
   + defaults to the square root of the number of
   + CPUs.  Larger numbers reduces the wakeup overhead
   + on the per-CPU grace-period kthreads, but increases
   + that same overhead on each group's leader

Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups

2014-06-27 Thread Paul E. McKenney
On Fri, Jun 27, 2014 at 03:01:27PM +, Mathieu Desnoyers wrote:
 - Original Message -
  From: Paul E. McKenney paul...@linux.vnet.ibm.com
  To: linux-kernel@vger.kernel.org, r...@redhat.com
  Cc: mi...@kernel.org, la...@cn.fujitsu.com, dipan...@in.ibm.com, 
  a...@linux-foundation.org, mathieu desnoyers
  mathieu.desnoy...@efficios.com, j...@joshtriplett.org, n...@us.ibm.com, 
  t...@linutronix.de, pet...@infradead.org,
  rost...@goodmis.org, dhowe...@redhat.com, eduma...@google.com, 
  dvh...@linux.intel.com, fweis...@gmail.com,
  o...@redhat.com, s...@mit.edu
  Sent: Friday, June 27, 2014 10:20:38 AM
  Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread 
  wakeups
  
  An 80-CPU system with a context-switch-heavy workload can require so
  many NOCB kthread wakeups that the RCU grace-period kthreads spend several
  tens of percent of a CPU just awakening things.  This clearly will not
  scale well: If you add enough CPUs, the RCU grace-period kthreads would
  get behind, increasing grace-period latency.
  
  To avoid this problem, this commit divides the NOCB kthreads into leaders
  and followers, where the grace-period kthreads awaken the leaders each of
  whom in turn awakens its followers.  By default, the number of groups of
  kthreads is the square root of the number of CPUs, but this default may
  be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
  This reduces the number of wakeups done per grace period by the RCU
  grace-period kthread by the square root of the number of CPUs, but of
  course by shifting those wakeups to the leaders.  In addition, because
  the leaders do grace periods on behalf of their respective followers,
  the number of wakeups of the followers decreases by up to a factor of two.
  Instead of being awakened once when new callbacks arrive and again
  at the end of the grace period, the followers are awakened only at
  the end of the grace period.
  
  For a numerical example, in a 4096-CPU system, the grace-period kthread
  would awaken 64 leaders, each of which would awaken its 63 followers
  at the end of the grace period.  This compares favorably with the 79
  wakeups for the grace-period kthread on an 80-CPU system.
 
 If I understand your approach correctly, it looks like the callbacks
 are moved from the follower threads (per CPU) to leader threads (for
 a group of CPUs).

Not quite.  The leader thread does the moving, but the callbacks live
on the follower's data structure.  There probably are some added
cache misses, but I believe that this is more than made up for by
the decrease in wakeups -- the followers are normally awakened only
once rather than twice per grace period.

   I'm concerned that moving those callbacks to leader
 threads would increase cache trashing, since the callbacks would be
 often executed from a different CPU than the CPU which enqueued the
 work. In a case where cgroups/affinity are used to pin kthreads to
 specific CPUs to minimize cache trashing, I'm concerned that this
 approach could degrade performance.
 
 Would there be another way to distribute the wake up that would keep
 callbacks local to their enqueuing CPU ?
 
 Or am I missing something important ?

One problem is that each leader must look at its followers' state
in order to determine who needs to be awakened.  So there are some
added cache misses no matter how you redistribute the wakeups.

Thanx, Paul

 Thanks,
 
 Mathieu
 
  
  Reported-by: Rik van Riel r...@redhat.com
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  
  diff --git a/Documentation/kernel-parameters.txt
  b/Documentation/kernel-parameters.txt
  index 6eaa9cdb7094..affed6434ec8 100644
  --- a/Documentation/kernel-parameters.txt
  +++ b/Documentation/kernel-parameters.txt
  @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be
  entirely omitted.
  quiescent states.  Units are jiffies, minimum
  value is one, and maximum value is HZ.
   
  +   rcutree.rcu_nocb_leader_stride= [KNL]
  +   Set the number of NOCB kthread groups, which
  +   defaults to the square root of the number of
  +   CPUs.  Larger numbers reduces the wakeup overhead
  +   on the per-CPU grace-period kthreads, but increases
  +   that same overhead on each group's leader.
  +
  rcutree.qhimark= [KNL]
  Set threshold of queued RCU callbacks beyond which
  batch limiting is disabled.
  diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
  index bf2c1e669691..de12fa5a860b 100644
  --- a/kernel/rcu/tree.h
  +++ b/kernel/rcu/tree.h
  @@ -331,11 +331,29 @@ struct rcu_data {
  struct rcu_head **nocb_tail;
  atomic_long_t nocb_q_count; /* # CBs waiting for kthread */
  atomic_long_t nocb_q_count_lazy; /*  (approximate