Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 08:57:35PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 23, 2014 at 10:33:08AM -0700, Paul E. McKenney wrote:
> > On Mon, Jun 23, 2014 at 08:57:50AM -0700, Paul E. McKenney wrote:
> > > On Mon, Jun 23, 2014 at 12:28:50PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Jun 17, 2014 at 10:37:17AM -0700, Paul E. McKenney wrote:
> > > > > Oh, and to answer the implicit question...  A properly configured 
> > > > > 4096-CPU
> > > > > system will have two funnel levels, with 64 nodes at the leaf level
> > > > > and a single node at the root level.  If the system is not properly
> > > > > configured, it will have three funnel levels.  The maximum number of
> > > > > funnel levels is four, which would handle more than four million CPUs
> > > > > (sixteen million if properly configured), so we should be good.  ;-)
> > > > > 
> > > > > The larger numbers of levels are intended strictly for testing.  I set
> > > > > CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system 
> > > > > just
> > > > > to make sure that I am testing something uglier than what will be 
> > > > > running
> > > > > in production.  A large system should have both of these set to 64,
> > > > > though this requires also booting with skew_tick=1 as well.
> > > > 
> > > > Right, and I think we talked about this before; the first thing one
> > > > should do is align the RCU fanout masks with the actual machine
> > > > topology. Because currently they can be all over the place.
> > > 
> > > And we also talked before about how it would make a lot more sense to
> > > align the CPU numbering with the actual machine topology, as that would
> > > fix the problem in one place.  But either way, in the particular case
> > > of the RCU fanout, does anyone have any real data showing that this is
> > > a real problem?  Given that the rcu_node accesses are quite a ways off
> > > of any fastpath, I remain skeptical.
> > 
> > And one way to test for this is to set CONFIG_RCU_FANOUT to the number of
> > cores in a socket (or to the number of hardware threads per socket for
> > systems that number their hardware threads consecutively), then specify
> > CONFIG_RCU_FANOUT_EXACT=y.  This will align the rcu_node structures with
> > the sockets.  If the number of cores/threads per socket is too large,
> > you can of course use a smaller number that exactly divides the number
> > of cores/threads per socket.
> 
> Typical Intel cpu numbering is [0..n) for SMT0 and [n..2*n) for SMT1, so
> that'll fall flat on its face at try 1.

I am quite painfully aware of that CPU numbering scheme, which is why
I suggested using the number of cores per socket as well as the number
of threads per socket.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-23 Thread Peter Zijlstra
On Mon, Jun 23, 2014 at 10:33:08AM -0700, Paul E. McKenney wrote:
> On Mon, Jun 23, 2014 at 08:57:50AM -0700, Paul E. McKenney wrote:
> > On Mon, Jun 23, 2014 at 12:28:50PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 17, 2014 at 10:37:17AM -0700, Paul E. McKenney wrote:
> > > > Oh, and to answer the implicit question...  A properly configured 
> > > > 4096-CPU
> > > > system will have two funnel levels, with 64 nodes at the leaf level
> > > > and a single node at the root level.  If the system is not properly
> > > > configured, it will have three funnel levels.  The maximum number of
> > > > funnel levels is four, which would handle more than four million CPUs
> > > > (sixteen million if properly configured), so we should be good.  ;-)
> > > > 
> > > > The larger numbers of levels are intended strictly for testing.  I set
> > > > CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system just
> > > > to make sure that I am testing something uglier than what will be 
> > > > running
> > > > in production.  A large system should have both of these set to 64,
> > > > though this requires also booting with skew_tick=1 as well.
> > > 
> > > Right, and I think we talked about this before; the first thing one
> > > should do is align the RCU fanout masks with the actual machine
> > > topology. Because currently they can be all over the place.
> > 
> > And we also talked before about how it would make a lot more sense to
> > align the CPU numbering with the actual machine topology, as that would
> > fix the problem in one place.  But either way, in the particular case
> > of the RCU fanout, does anyone have any real data showing that this is
> > a real problem?  Given that the rcu_node accesses are quite a ways off
> > of any fastpath, I remain skeptical.
> 
> And one way to test for this is to set CONFIG_RCU_FANOUT to the number of
> cores in a socket (or to the number of hardware threads per socket for
> systems that number their hardware threads consecutively), then specify
> CONFIG_RCU_FANOUT_EXACT=y.  This will align the rcu_node structures with
> the sockets.  If the number of cores/threads per socket is too large,
> you can of course use a smaller number that exactly divides the number
> of cores/threads per socket.

Typical Intel cpu numbering is [0..n) for SMT0 and [n..2*n) for SMT1, so
that'll fall flat on its face at try 1.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 08:57:50AM -0700, Paul E. McKenney wrote:
> On Mon, Jun 23, 2014 at 12:28:50PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 17, 2014 at 10:37:17AM -0700, Paul E. McKenney wrote:
> > > Oh, and to answer the implicit question...  A properly configured 4096-CPU
> > > system will have two funnel levels, with 64 nodes at the leaf level
> > > and a single node at the root level.  If the system is not properly
> > > configured, it will have three funnel levels.  The maximum number of
> > > funnel levels is four, which would handle more than four million CPUs
> > > (sixteen million if properly configured), so we should be good.  ;-)
> > > 
> > > The larger numbers of levels are intended strictly for testing.  I set
> > > CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system just
> > > to make sure that I am testing something uglier than what will be running
> > > in production.  A large system should have both of these set to 64,
> > > though this requires also booting with skew_tick=1 as well.
> > 
> > Right, and I think we talked about this before; the first thing one
> > should do is align the RCU fanout masks with the actual machine
> > topology. Because currently they can be all over the place.
> 
> And we also talked before about how it would make a lot more sense to
> align the CPU numbering with the actual machine topology, as that would
> fix the problem in one place.  But either way, in the particular case
> of the RCU fanout, does anyone have any real data showing that this is
> a real problem?  Given that the rcu_node accesses are quite a ways off
> of any fastpath, I remain skeptical.

And one way to test for this is to set CONFIG_RCU_FANOUT to the number of
cores in a socket (or to the number of hardware threads per socket for
systems that number their hardware threads consecutively), then specify
CONFIG_RCU_FANOUT_EXACT=y.  This will align the rcu_node structures with
the sockets.  If the number of cores/threads per socket is too large,
you can of course use a smaller number that exactly divides the number
of cores/threads per socket.

If this does turn out to improve performance, I would be happy to create
a boot parameter for CONFIG_RCU_FANOUT, perhaps also some mechanism to
allow the architecture to tell RCU what the fanout should be.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 12:28:50PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 17, 2014 at 10:37:17AM -0700, Paul E. McKenney wrote:
> > Oh, and to answer the implicit question...  A properly configured 4096-CPU
> > system will have two funnel levels, with 64 nodes at the leaf level
> > and a single node at the root level.  If the system is not properly
> > configured, it will have three funnel levels.  The maximum number of
> > funnel levels is four, which would handle more than four million CPUs
> > (sixteen million if properly configured), so we should be good.  ;-)
> > 
> > The larger numbers of levels are intended strictly for testing.  I set
> > CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system just
> > to make sure that I am testing something uglier than what will be running
> > in production.  A large system should have both of these set to 64,
> > though this requires also booting with skew_tick=1 as well.
> 
> Right, and I think we talked about this before; the first thing one
> should do is align the RCU fanout masks with the actual machine
> topology. Because currently they can be all over the place.

And we also talked before about how it would make a lot more sense to
align the CPU numbering with the actual machine topology, as that would
fix the problem in one place.  But either way, in the particular case
of the RCU fanout, does anyone have any real data showing that this is
a real problem?  Given that the rcu_node accesses are quite a ways off
of any fastpath, I remain skeptical.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-23 Thread Peter Zijlstra
On Tue, Jun 17, 2014 at 10:37:17AM -0700, Paul E. McKenney wrote:
> Oh, and to answer the implicit question...  A properly configured 4096-CPU
> system will have two funnel levels, with 64 nodes at the leaf level
> and a single node at the root level.  If the system is not properly
> configured, it will have three funnel levels.  The maximum number of
> funnel levels is four, which would handle more than four million CPUs
> (sixteen million if properly configured), so we should be good.  ;-)
> 
> The larger numbers of levels are intended strictly for testing.  I set
> CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system just
> to make sure that I am testing something uglier than what will be running
> in production.  A large system should have both of these set to 64,
> though this requires also booting with skew_tick=1 as well.

Right, and I think we talked about this before; the first thing one
should do is align the RCU fanout masks with the actual machine
topology. Because currently they can be all over the place.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 12:28:50PM +0200, Peter Zijlstra wrote:
 On Tue, Jun 17, 2014 at 10:37:17AM -0700, Paul E. McKenney wrote:
  Oh, and to answer the implicit question...  A properly configured 4096-CPU
  system will have two funnel levels, with 64 nodes at the leaf level
  and a single node at the root level.  If the system is not properly
  configured, it will have three funnel levels.  The maximum number of
  funnel levels is four, which would handle more than four million CPUs
  (sixteen million if properly configured), so we should be good.  ;-)
  
  The larger numbers of levels are intended strictly for testing.  I set
  CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system just
  to make sure that I am testing something uglier than what will be running
  in production.  A large system should have both of these set to 64,
  though this requires also booting with skew_tick=1 as well.
 
 Right, and I think we talked about this before; the first thing one
 should do is align the RCU fanout masks with the actual machine
 topology. Because currently they can be all over the place.

And we also talked before about how it would make a lot more sense to
align the CPU numbering with the actual machine topology, as that would
fix the problem in one place.  But either way, in the particular case
of the RCU fanout, does anyone have any real data showing that this is
a real problem?  Given that the rcu_node accesses are quite a ways off
of any fastpath, I remain skeptical.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 08:57:50AM -0700, Paul E. McKenney wrote:
 On Mon, Jun 23, 2014 at 12:28:50PM +0200, Peter Zijlstra wrote:
  On Tue, Jun 17, 2014 at 10:37:17AM -0700, Paul E. McKenney wrote:
   Oh, and to answer the implicit question...  A properly configured 4096-CPU
   system will have two funnel levels, with 64 nodes at the leaf level
   and a single node at the root level.  If the system is not properly
   configured, it will have three funnel levels.  The maximum number of
   funnel levels is four, which would handle more than four million CPUs
   (sixteen million if properly configured), so we should be good.  ;-)
   
   The larger numbers of levels are intended strictly for testing.  I set
   CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system just
   to make sure that I am testing something uglier than what will be running
   in production.  A large system should have both of these set to 64,
   though this requires also booting with skew_tick=1 as well.
  
  Right, and I think we talked about this before; the first thing one
  should do is align the RCU fanout masks with the actual machine
  topology. Because currently they can be all over the place.
 
 And we also talked before about how it would make a lot more sense to
 align the CPU numbering with the actual machine topology, as that would
 fix the problem in one place.  But either way, in the particular case
 of the RCU fanout, does anyone have any real data showing that this is
 a real problem?  Given that the rcu_node accesses are quite a ways off
 of any fastpath, I remain skeptical.

And one way to test for this is to set CONFIG_RCU_FANOUT to the number of
cores in a socket (or to the number of hardware threads per socket for
systems that number their hardware threads consecutively), then specify
CONFIG_RCU_FANOUT_EXACT=y.  This will align the rcu_node structures with
the sockets.  If the number of cores/threads per socket is too large,
you can of course use a smaller number that exactly divides the number
of cores/threads per socket.

If this does turn out to improve performance, I would be happy to create
a boot parameter for CONFIG_RCU_FANOUT, perhaps also some mechanism to
allow the architecture to tell RCU what the fanout should be.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-23 Thread Peter Zijlstra
On Mon, Jun 23, 2014 at 10:33:08AM -0700, Paul E. McKenney wrote:
 On Mon, Jun 23, 2014 at 08:57:50AM -0700, Paul E. McKenney wrote:
  On Mon, Jun 23, 2014 at 12:28:50PM +0200, Peter Zijlstra wrote:
   On Tue, Jun 17, 2014 at 10:37:17AM -0700, Paul E. McKenney wrote:
Oh, and to answer the implicit question...  A properly configured 
4096-CPU
system will have two funnel levels, with 64 nodes at the leaf level
and a single node at the root level.  If the system is not properly
configured, it will have three funnel levels.  The maximum number of
funnel levels is four, which would handle more than four million CPUs
(sixteen million if properly configured), so we should be good.  ;-)

The larger numbers of levels are intended strictly for testing.  I set
CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system just
to make sure that I am testing something uglier than what will be 
running
in production.  A large system should have both of these set to 64,
though this requires also booting with skew_tick=1 as well.
   
   Right, and I think we talked about this before; the first thing one
   should do is align the RCU fanout masks with the actual machine
   topology. Because currently they can be all over the place.
  
  And we also talked before about how it would make a lot more sense to
  align the CPU numbering with the actual machine topology, as that would
  fix the problem in one place.  But either way, in the particular case
  of the RCU fanout, does anyone have any real data showing that this is
  a real problem?  Given that the rcu_node accesses are quite a ways off
  of any fastpath, I remain skeptical.
 
 And one way to test for this is to set CONFIG_RCU_FANOUT to the number of
 cores in a socket (or to the number of hardware threads per socket for
 systems that number their hardware threads consecutively), then specify
 CONFIG_RCU_FANOUT_EXACT=y.  This will align the rcu_node structures with
 the sockets.  If the number of cores/threads per socket is too large,
 you can of course use a smaller number that exactly divides the number
 of cores/threads per socket.

Typical Intel cpu numbering is [0..n) for SMT0 and [n..2*n) for SMT1, so
that'll fall flat on its face at try 1.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 08:57:35PM +0200, Peter Zijlstra wrote:
 On Mon, Jun 23, 2014 at 10:33:08AM -0700, Paul E. McKenney wrote:
  On Mon, Jun 23, 2014 at 08:57:50AM -0700, Paul E. McKenney wrote:
   On Mon, Jun 23, 2014 at 12:28:50PM +0200, Peter Zijlstra wrote:
On Tue, Jun 17, 2014 at 10:37:17AM -0700, Paul E. McKenney wrote:
 Oh, and to answer the implicit question...  A properly configured 
 4096-CPU
 system will have two funnel levels, with 64 nodes at the leaf level
 and a single node at the root level.  If the system is not properly
 configured, it will have three funnel levels.  The maximum number of
 funnel levels is four, which would handle more than four million CPUs
 (sixteen million if properly configured), so we should be good.  ;-)
 
 The larger numbers of levels are intended strictly for testing.  I set
 CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system 
 just
 to make sure that I am testing something uglier than what will be 
 running
 in production.  A large system should have both of these set to 64,
 though this requires also booting with skew_tick=1 as well.

Right, and I think we talked about this before; the first thing one
should do is align the RCU fanout masks with the actual machine
topology. Because currently they can be all over the place.
   
   And we also talked before about how it would make a lot more sense to
   align the CPU numbering with the actual machine topology, as that would
   fix the problem in one place.  But either way, in the particular case
   of the RCU fanout, does anyone have any real data showing that this is
   a real problem?  Given that the rcu_node accesses are quite a ways off
   of any fastpath, I remain skeptical.
  
  And one way to test for this is to set CONFIG_RCU_FANOUT to the number of
  cores in a socket (or to the number of hardware threads per socket for
  systems that number their hardware threads consecutively), then specify
  CONFIG_RCU_FANOUT_EXACT=y.  This will align the rcu_node structures with
  the sockets.  If the number of cores/threads per socket is too large,
  you can of course use a smaller number that exactly divides the number
  of cores/threads per socket.
 
 Typical Intel cpu numbering is [0..n) for SMT0 and [n..2*n) for SMT1, so
 that'll fall flat on its face at try 1.

I am quite painfully aware of that CPU numbering scheme, which is why
I suggested using the number of cores per socket as well as the number
of threads per socket.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-23 Thread Peter Zijlstra
On Tue, Jun 17, 2014 at 10:37:17AM -0700, Paul E. McKenney wrote:
 Oh, and to answer the implicit question...  A properly configured 4096-CPU
 system will have two funnel levels, with 64 nodes at the leaf level
 and a single node at the root level.  If the system is not properly
 configured, it will have three funnel levels.  The maximum number of
 funnel levels is four, which would handle more than four million CPUs
 (sixteen million if properly configured), so we should be good.  ;-)
 
 The larger numbers of levels are intended strictly for testing.  I set
 CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system just
 to make sure that I am testing something uglier than what will be running
 in production.  A large system should have both of these set to 64,
 though this requires also booting with skew_tick=1 as well.

Right, and I think we talked about this before; the first thing one
should do is align the RCU fanout masks with the actual machine
topology. Because currently they can be all over the place.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Waiman Long

On 06/17/2014 01:37 PM, Paul E. McKenney wrote:


Oh, and to answer the implicit question...  A properly configured 4096-CPU
system will have two funnel levels, with 64 nodes at the leaf level
and a single node at the root level.  If the system is not properly
configured, it will have three funnel levels.  The maximum number of
funnel levels is four, which would handle more than four million CPUs
(sixteen million if properly configured), so we should be good.  ;-)

The larger numbers of levels are intended strictly for testing.  I set
CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system just
to make sure that I am testing something uglier than what will be running
in production.  A large system should have both of these set to 64,
though this requires also booting with skew_tick=1 as well.

Thanx, Paul


Thank for the clarification as I haven't looked deep into the code to 
see how many levels there are. I totally understand the impact cacheline 
contention can have on system performance. After all, this is what many 
of my patches are trying to address.


-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Pranith Kumar
On Tue, Jun 17, 2014 at 1:10 PM, Paul E. McKenney
 wrote:

>
> Pranith, Romanov:  You do -not-, repeat -not-, get to shoot from the hip
> with this code.  You absolutely need to understand what it is doing and
> why before you try hacking on it.  Otherwise, all that will happen is
> that you will come up with additional creative ways to break RCU.  Your
> commit log and comments will need to clearly indicate that you understand
> what happens if (say) 4096 CPUs all call force_quiescent_state() at the
> same time.
>

I do apologize for the noise I am causing.
I totally agree that I need to understand what exactly is happening
before I propose to do anything. I am trying to understand how all
this works by asking questions. I did try looking at why the locks
were there and spent quite some time trying to figure it out looking
at the history, but could not entirely get it. Hence the RFC.

Thank you for patiently considering and replying to my RFCs.

Regards,
-- 
Pranith
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Paul E. McKenney
On Tue, Jun 17, 2014 at 10:11:16AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 17, 2014 at 12:56:22PM -0400, Waiman Long wrote:
> > On 06/17/2014 12:01 PM, Romanov Arya wrote:
> > >On Tue, Jun 17, 2014 at 10:54 AM, Paul E. McKenney
> > >  wrote:
> > >>On Mon, Jun 16, 2014 at 10:55:29PM -0400, Pranith Kumar wrote:
> > >>>This might sound really naive, but please bear with me.
> > >>>
> > >>>force_quiescent_state() used to do a lot of things in the past in 
> > >>>addition to
> > >>>forcing a quiescent state. (In my reading of the mailing list I found 
> > >>>state
> > >>>transitions for one).
> > >>>
> > >>>Now according to the code, what is being done is multiple callers try to 
> > >>>go up
> > >>>the hierarchy of nodes to see who reaches the root node. The caller 
> > >>>reaching the
> > >>>root node wins and it acquires root node lock and it gets to set 
> > >>>rsp->gp_flags!
> > >>>
> > >>>At each level of the hierarchy we try to acquire fqslock. This is the 
> > >>>only place
> > >>>which actually uses fqslock.
> > >>>
> > >>>I guess this was being done to avoid the contention on fqslock, but all 
> > >>>we are
> > >>>doing here is setting one flag. This way of acquiring locks might reduce
> > >>>contention if every update is trying to do some independent work, but 
> > >>>here all
> > >>>we are doing is setting the same flag with same value.
> > >>Actually, to reduce contention on rnp_root->lock.
> > >>
> > >>The trick is that the "losers" at each level of ->fqslock acquisition go
> > >>away.  The "winner" ends up doing the real work of setting 
> > >>RCU_GP_FLAG_FQS.
> > >>
> > >>>We can also remove fqslock completely if we do not need this. Also using
> > >>>cmpxchg() to set the value of the flag looks like a good idea to avoid 
> > >>>taking
> > >>>the root node lock. Thoughts?
> > >>The ->fqslock funnel was needed to avoid lockups on large systems (many
> > >>hundreds or even thousands of CPUs).  Moving grace-period responsibilities
> > >>from softirq to the grace-period kthreads might have reduced contention
> > >>sufficienty to make the ->fqslock funnel unnecessary.  However, given
> > >>that I don't usually have access to such a large system, I will leave it,
> > >>at least for the time being.
> > >Sounds like a good case study for using the newly introduced MCS based
> > >locks(qspinlock.h).
> > >Waiman, Peter?
> > 
> > The ticket trylock and queue trylock are similar in the sense they
> > both do a read to check if the lock is free first and then use
> > cmpxchg() to acquire the lock. So if it is a problem to use ticket
> > spinlock, it will be a problem for the queue spinlock as well. The
> > advantage of the queue spinlock in a heavily contended case is that
> > the lock waiters except the one in the head of the queue will not be
> > contending the lock cacheline.
> > >Btw, is doing the following a bad idea? It reduces contention on
> > >rnp_root->lock using fqslock
> > >which seems to be the lock which needs to be taken while forcing a
> > >quiescent state:
> > 
> > I don't see any problem with the existing code as long as the number
> > of funnel levels is small.

Oh, and to answer the implicit question...  A properly configured 4096-CPU
system will have two funnel levels, with 64 nodes at the leaf level
and a single node at the root level.  If the system is not properly
configured, it will have three funnel levels.  The maximum number of
funnel levels is four, which would handle more than four million CPUs
(sixteen million if properly configured), so we should be good.  ;-)

The larger numbers of levels are intended strictly for testing.  I set
CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system just
to make sure that I am testing something uglier than what will be running
in production.  A large system should have both of these set to 64,
though this requires also booting with skew_tick=1 as well.

Thanx, Paul

> >You really need to have some performance
> > data to show that your approach is better.
> 
> What Waiman said.
> 
> And you will need that performance data taken on systems with at least
> 1024 CPUs, because it was those systems that motivated this code.
> 
>   Thanx, Paul
> 
> > >diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > >index f1ba773..f5a0e7e 100644
> > >--- a/kernel/rcu/tree.c
> > >+++ b/kernel/rcu/tree.c
> > >@@ -2401,34 +2401,24 @@ static void force_quiescent_state(struct rcu_state 
> > >*rsp)
> > >  unsigned long flags;
> > >  bool ret;
> > >  struct rcu_node *rnp;
> > >-struct rcu_node *rnp_old = NULL;
> > >-
> > >-/* Funnel through hierarchy to reduce memory contention. */
> > >-rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
> > >-for (; rnp != NULL; rnp = rnp->parent) {
> > >-ret = (ACCESS_ONCE(rsp->gp_flags)&  RCU_GP_FLAG_FQS) ||
> > >-  

Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Paul E. McKenney
On Tue, Jun 17, 2014 at 12:56:22PM -0400, Waiman Long wrote:
> On 06/17/2014 12:01 PM, Romanov Arya wrote:
> >On Tue, Jun 17, 2014 at 10:54 AM, Paul E. McKenney
> >  wrote:
> >>On Mon, Jun 16, 2014 at 10:55:29PM -0400, Pranith Kumar wrote:
> >>>This might sound really naive, but please bear with me.
> >>>
> >>>force_quiescent_state() used to do a lot of things in the past in addition 
> >>>to
> >>>forcing a quiescent state. (In my reading of the mailing list I found state
> >>>transitions for one).
> >>>
> >>>Now according to the code, what is being done is multiple callers try to 
> >>>go up
> >>>the hierarchy of nodes to see who reaches the root node. The caller 
> >>>reaching the
> >>>root node wins and it acquires root node lock and it gets to set 
> >>>rsp->gp_flags!
> >>>
> >>>At each level of the hierarchy we try to acquire fqslock. This is the only 
> >>>place
> >>>which actually uses fqslock.
> >>>
> >>>I guess this was being done to avoid the contention on fqslock, but all we 
> >>>are
> >>>doing here is setting one flag. This way of acquiring locks might reduce
> >>>contention if every update is trying to do some independent work, but here 
> >>>all
> >>>we are doing is setting the same flag with same value.
> >>Actually, to reduce contention on rnp_root->lock.
> >>
> >>The trick is that the "losers" at each level of ->fqslock acquisition go
> >>away.  The "winner" ends up doing the real work of setting RCU_GP_FLAG_FQS.
> >>
> >>>We can also remove fqslock completely if we do not need this. Also using
> >>>cmpxchg() to set the value of the flag looks like a good idea to avoid 
> >>>taking
> >>>the root node lock. Thoughts?
> >>The ->fqslock funnel was needed to avoid lockups on large systems (many
> >>hundreds or even thousands of CPUs).  Moving grace-period responsibilities
> >>from softirq to the grace-period kthreads might have reduced contention
> >>sufficienty to make the ->fqslock funnel unnecessary.  However, given
> >>that I don't usually have access to such a large system, I will leave it,
> >>at least for the time being.
> >Sounds like a good case study for using the newly introduced MCS based
> >locks(qspinlock.h).
> >Waiman, Peter?
> 
> The ticket trylock and queue trylock are similar in the sense they
> both do a read to check if the lock is free first and then use
> cmpxchg() to acquire the lock. So if it is a problem to use ticket
> spinlock, it will be a problem for the queue spinlock as well. The
> advantage of the queue spinlock in a heavily contended case is that
> the lock waiters except the one in the head of the queue will not be
> contending the lock cacheline.
> >Btw, is doing the following a bad idea? It reduces contention on
> >rnp_root->lock using fqslock
> >which seems to be the lock which needs to be taken while forcing a
> >quiescent state:
> 
> I don't see any problem with the existing code as long as the number
> of funnel levels is small. You really need to have some performance
> data to show that your approach is better.

What Waiman said.

And you will need that performance data taken on systems with at least
1024 CPUs, because it was those systems that motivated this code.

Thanx, Paul

> >diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >index f1ba773..f5a0e7e 100644
> >--- a/kernel/rcu/tree.c
> >+++ b/kernel/rcu/tree.c
> >@@ -2401,34 +2401,24 @@ static void force_quiescent_state(struct rcu_state 
> >*rsp)
> >  unsigned long flags;
> >  bool ret;
> >  struct rcu_node *rnp;
> >-struct rcu_node *rnp_old = NULL;
> >-
> >-/* Funnel through hierarchy to reduce memory contention. */
> >-rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
> >-for (; rnp != NULL; rnp = rnp->parent) {
> >-ret = (ACCESS_ONCE(rsp->gp_flags)&  RCU_GP_FLAG_FQS) ||
> >-  !raw_spin_trylock(>fqslock);
> >-if (rnp_old != NULL)
> >-raw_spin_unlock(_old->fqslock);
> >-if (ret) {
> >-ACCESS_ONCE(rsp->n_force_qs_lh)++;
> >-return;
> >-}
> >-rnp_old = rnp;
> >+struct rcu_node *rnp_root = rcu_get_root(rsp);
> >+
> >+if (!raw_spin_trylock(rnp_root->fqslock)) {
> >+ACCESS_ONCE(rsp->n_force_qs_lh)++;
> >+return;  /* Someone is already trying to force */
> >  }
> 
> I think it is better to check the gp_flags first before doing a trylock.
> 
> -Longman
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Paul E. McKenney
On Tue, Jun 17, 2014 at 12:01:28PM -0400, Romanov Arya wrote:
> On Tue, Jun 17, 2014 at 10:54 AM, Paul E. McKenney
>  wrote:
> > On Mon, Jun 16, 2014 at 10:55:29PM -0400, Pranith Kumar wrote:
> >> This might sound really naive, but please bear with me.
> >>
> >> force_quiescent_state() used to do a lot of things in the past in addition 
> >> to
> >> forcing a quiescent state. (In my reading of the mailing list I found state
> >> transitions for one).
> >>
> >> Now according to the code, what is being done is multiple callers try to 
> >> go up
> >> the hierarchy of nodes to see who reaches the root node. The caller 
> >> reaching the
> >> root node wins and it acquires root node lock and it gets to set 
> >> rsp->gp_flags!
> >>
> >> At each level of the hierarchy we try to acquire fqslock. This is the only 
> >> place
> >> which actually uses fqslock.
> >>
> >> I guess this was being done to avoid the contention on fqslock, but all we 
> >> are
> >> doing here is setting one flag. This way of acquiring locks might reduce
> >> contention if every update is trying to do some independent work, but here 
> >> all
> >> we are doing is setting the same flag with same value.
> >
> > Actually, to reduce contention on rnp_root->lock.
> >
> > The trick is that the "losers" at each level of ->fqslock acquisition go
> > away.  The "winner" ends up doing the real work of setting RCU_GP_FLAG_FQS.
> >
> >> We can also remove fqslock completely if we do not need this. Also using
> >> cmpxchg() to set the value of the flag looks like a good idea to avoid 
> >> taking
> >> the root node lock. Thoughts?
> >
> > The ->fqslock funnel was needed to avoid lockups on large systems (many
> > hundreds or even thousands of CPUs).  Moving grace-period responsibilities
> > from softirq to the grace-period kthreads might have reduced contention
> > sufficienty to make the ->fqslock funnel unnecessary.  However, given
> > that I don't usually have access to such a large system, I will leave it,
> > at least for the time being.
> 
> Sounds like a good case study for using the newly introduced MCS based
> locks(qspinlock.h).
> Waiman, Peter?

No.  Absolutely not.

Any exclusive lock, MCS or otherwise, will waste a huge amount of time
in this case.

> Btw, is doing the following a bad idea? It reduces contention on
> rnp_root->lock using fqslock
> which seems to be the lock which needs to be taken while forcing a
> quiescent state:
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index f1ba773..f5a0e7e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2401,34 +2401,24 @@ static void force_quiescent_state(struct rcu_state 
> *rsp)
>  unsigned long flags;
>  bool ret;
>  struct rcu_node *rnp;
> -struct rcu_node *rnp_old = NULL;
> -
> -/* Funnel through hierarchy to reduce memory contention. */
> -rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
> -for (; rnp != NULL; rnp = rnp->parent) {
> -ret = (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) ||
> -  !raw_spin_trylock(>fqslock);
> -if (rnp_old != NULL)
> -raw_spin_unlock(_old->fqslock);
> -if (ret) {
> -ACCESS_ONCE(rsp->n_force_qs_lh)++;
> -return;
> -}
> -rnp_old = rnp;
> +struct rcu_node *rnp_root = rcu_get_root(rsp);
> +
> +if (!raw_spin_trylock(rnp_root->fqslock)) {

This will be an epic fail on huge systems because it can result in massive
memory contention.  And yes, I have recently reduced the probability of
large numbers of concurrent calls to force_quiescent_state(), but this
situation really can still happen, for example, if a bunch of CPUs are
doing lots of concurrent call_rcu() invocations.

Pranith, Romanov:  You do -not-, repeat -not-, get to shoot from the hip
with this code.  You absolutely need to understand what it is doing and
why before you try hacking on it.  Otherwise, all that will happen is
that you will come up with additional creative ways to break RCU.  Your
commit log and comments will need to clearly indicate that you understand
what happens if (say) 4096 CPUs all call force_quiescent_state() at the
same time.

Thanx, Paul

> +ACCESS_ONCE(rsp->n_force_qs_lh)++;
> +return;  /* Someone is already trying to force */
>  }
> -/* rnp_old == rcu_get_root(rsp), rnp == NULL. */
> 
> -/* Reached the root of the rcu_node tree, acquire lock. */
> -raw_spin_lock_irqsave(_old->lock, flags);
> -smp_mb__after_unlock_lock();
> -raw_spin_unlock(_old->fqslock);
>  if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
>  ACCESS_ONCE(rsp->n_force_qs_lh)++;
> -raw_spin_unlock_irqrestore(_old->lock, flags);
> +raw_spin_unlock(rnp_root->fqslock);
>  return;  /* Someone beat us to it. */
>  }
> +
> +/* Reached the root of the rcu_node tree, acquire lock. */
> +

Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Waiman Long

On 06/17/2014 12:01 PM, Romanov Arya wrote:

On Tue, Jun 17, 2014 at 10:54 AM, Paul E. McKenney
  wrote:

On Mon, Jun 16, 2014 at 10:55:29PM -0400, Pranith Kumar wrote:

This might sound really naive, but please bear with me.

force_quiescent_state() used to do a lot of things in the past in addition to
forcing a quiescent state. (In my reading of the mailing list I found state
transitions for one).

Now according to the code, what is being done is multiple callers try to go up
the hierarchy of nodes to see who reaches the root node. The caller reaching the
root node wins and it acquires root node lock and it gets to set rsp->gp_flags!

At each level of the hierarchy we try to acquire fqslock. This is the only place
which actually uses fqslock.

I guess this was being done to avoid the contention on fqslock, but all we are
doing here is setting one flag. This way of acquiring locks might reduce
contention if every update is trying to do some independent work, but here all
we are doing is setting the same flag with same value.

Actually, to reduce contention on rnp_root->lock.

The trick is that the "losers" at each level of ->fqslock acquisition go
away.  The "winner" ends up doing the real work of setting RCU_GP_FLAG_FQS.


We can also remove fqslock completely if we do not need this. Also using
cmpxchg() to set the value of the flag looks like a good idea to avoid taking
the root node lock. Thoughts?

The ->fqslock funnel was needed to avoid lockups on large systems (many
hundreds or even thousands of CPUs).  Moving grace-period responsibilities
from softirq to the grace-period kthreads might have reduced contention
sufficienty to make the ->fqslock funnel unnecessary.  However, given
that I don't usually have access to such a large system, I will leave it,
at least for the time being.

Sounds like a good case study for using the newly introduced MCS based
locks(qspinlock.h).
Waiman, Peter?


The ticket trylock and queue trylock are similar in the sense they both 
do a read to check if the lock is free first and then use cmpxchg() to 
acquire the lock. So if it is a problem to use ticket spinlock, it will 
be a problem for the queue spinlock as well. The advantage of the queue 
spinlock in a heavily contended case is that the lock waiters except the 
one in the head of the queue will not be contending the lock cacheline.

Btw, is doing the following a bad idea? It reduces contention on
rnp_root->lock using fqslock
which seems to be the lock which needs to be taken while forcing a
quiescent state:


I don't see any problem with the existing code as long as the number of 
funnel levels is small. You really need to have some performance data to 
show that your approach is better.




diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f1ba773..f5a0e7e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2401,34 +2401,24 @@ static void force_quiescent_state(struct rcu_state *rsp)
  unsigned long flags;
  bool ret;
  struct rcu_node *rnp;
-struct rcu_node *rnp_old = NULL;
-
-/* Funnel through hierarchy to reduce memory contention. */
-rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
-for (; rnp != NULL; rnp = rnp->parent) {
-ret = (ACCESS_ONCE(rsp->gp_flags)&  RCU_GP_FLAG_FQS) ||
-  !raw_spin_trylock(>fqslock);
-if (rnp_old != NULL)
-raw_spin_unlock(_old->fqslock);
-if (ret) {
-ACCESS_ONCE(rsp->n_force_qs_lh)++;
-return;
-}
-rnp_old = rnp;
+struct rcu_node *rnp_root = rcu_get_root(rsp);
+
+if (!raw_spin_trylock(rnp_root->fqslock)) {
+ACCESS_ONCE(rsp->n_force_qs_lh)++;
+return;  /* Someone is already trying to force */
  }


I think it is better to check the gp_flags first before doing a trylock.

-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Romanov Arya
On Tue, Jun 17, 2014 at 10:54 AM, Paul E. McKenney
 wrote:
> On Mon, Jun 16, 2014 at 10:55:29PM -0400, Pranith Kumar wrote:
>> This might sound really naive, but please bear with me.
>>
>> force_quiescent_state() used to do a lot of things in the past in addition to
>> forcing a quiescent state. (In my reading of the mailing list I found state
>> transitions for one).
>>
>> Now according to the code, what is being done is multiple callers try to go 
>> up
>> the hierarchy of nodes to see who reaches the root node. The caller reaching 
>> the
>> root node wins and it acquires root node lock and it gets to set 
>> rsp->gp_flags!
>>
>> At each level of the hierarchy we try to acquire fqslock. This is the only 
>> place
>> which actually uses fqslock.
>>
>> I guess this was being done to avoid the contention on fqslock, but all we 
>> are
>> doing here is setting one flag. This way of acquiring locks might reduce
>> contention if every update is trying to do some independent work, but here 
>> all
>> we are doing is setting the same flag with same value.
>
> Actually, to reduce contention on rnp_root->lock.
>
> The trick is that the "losers" at each level of ->fqslock acquisition go
> away.  The "winner" ends up doing the real work of setting RCU_GP_FLAG_FQS.
>
>> We can also remove fqslock completely if we do not need this. Also using
>> cmpxchg() to set the value of the flag looks like a good idea to avoid taking
>> the root node lock. Thoughts?
>
> The ->fqslock funnel was needed to avoid lockups on large systems (many
> hundreds or even thousands of CPUs).  Moving grace-period responsibilities
> from softirq to the grace-period kthreads might have reduced contention
> sufficienty to make the ->fqslock funnel unnecessary.  However, given
> that I don't usually have access to such a large system, I will leave it,
> at least for the time being.

Sounds like a good case study for using the newly introduced MCS based
locks(qspinlock.h).
Waiman, Peter?

Btw, is doing the following a bad idea? It reduces contention on
rnp_root->lock using fqslock
which seems to be the lock which needs to be taken while forcing a
quiescent state:

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f1ba773..f5a0e7e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2401,34 +2401,24 @@ static void force_quiescent_state(struct rcu_state *rsp)
 unsigned long flags;
 bool ret;
 struct rcu_node *rnp;
-struct rcu_node *rnp_old = NULL;
-
-/* Funnel through hierarchy to reduce memory contention. */
-rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
-for (; rnp != NULL; rnp = rnp->parent) {
-ret = (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) ||
-  !raw_spin_trylock(>fqslock);
-if (rnp_old != NULL)
-raw_spin_unlock(_old->fqslock);
-if (ret) {
-ACCESS_ONCE(rsp->n_force_qs_lh)++;
-return;
-}
-rnp_old = rnp;
+struct rcu_node *rnp_root = rcu_get_root(rsp);
+
+if (!raw_spin_trylock(rnp_root->fqslock)) {
+ACCESS_ONCE(rsp->n_force_qs_lh)++;
+return;  /* Someone is already trying to force */
 }
-/* rnp_old == rcu_get_root(rsp), rnp == NULL. */

-/* Reached the root of the rcu_node tree, acquire lock. */
-raw_spin_lock_irqsave(_old->lock, flags);
-smp_mb__after_unlock_lock();
-raw_spin_unlock(_old->fqslock);
 if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
 ACCESS_ONCE(rsp->n_force_qs_lh)++;
-raw_spin_unlock_irqrestore(_old->lock, flags);
+raw_spin_unlock(rnp_root->fqslock);
 return;  /* Someone beat us to it. */
 }
+
+/* Reached the root of the rcu_node tree, acquire lock. */
+raw_spin_lock_irqsave(_root->lock, flags);
+smp_mb__after_unlock_lock();
 ACCESS_ONCE(rsp->gp_flags) |= RCU_GP_FLAG_FQS;
-raw_spin_unlock_irqrestore(_old->lock, flags);
+raw_spin_unlock_irqrestore(_root->lock, flags);
 wake_up(>gp_wq);  /* Memory barrier implied by wake_up() path. */
 }

Regards,
Romanov

>
> But you might be interested in thinking through what else would need to
> change in order to make cmpxchg() work.  ;-)
>
> Thanx, Paul
>
>> Signed-off-by: Pranith Kumar 
>> ---
>>  kernel/rcu/tree.c | 35 +--
>>  1 file changed, 13 insertions(+), 22 deletions(-)
>>
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index f1ba773..9a46f32 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -2399,36 +2399,27 @@ static void force_qs_rnp(struct rcu_state *rsp,
>>  static void force_quiescent_state(struct rcu_state *rsp)
>>  {
>>   unsigned long flags;
>> - bool ret;
>> - struct rcu_node *rnp;
>> - struct rcu_node *rnp_old = NULL;
>> -
>> - /* Funnel through hierarchy to reduce memory contention. */
>> - rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
>> - for (; rnp != 

Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Paul E. McKenney
On Mon, Jun 16, 2014 at 10:55:29PM -0400, Pranith Kumar wrote:
> This might sound really naive, but please bear with me.
> 
> force_quiescent_state() used to do a lot of things in the past in addition to
> forcing a quiescent state. (In my reading of the mailing list I found state
> transitions for one). 
> 
> Now according to the code, what is being done is multiple callers try to go up
> the hierarchy of nodes to see who reaches the root node. The caller reaching 
> the
> root node wins and it acquires root node lock and it gets to set 
> rsp->gp_flags!
> 
> At each level of the hierarchy we try to acquire fqslock. This is the only 
> place
> which actually uses fqslock. 
> 
> I guess this was being done to avoid the contention on fqslock, but all we are
> doing here is setting one flag. This way of acquiring locks might reduce
> contention if every update is trying to do some independent work, but here all
> we are doing is setting the same flag with same value.

Actually, to reduce contention on rnp_root->lock.

The trick is that the "losers" at each level of ->fqslock acquisition go
away.  The "winner" ends up doing the real work of setting RCU_GP_FLAG_FQS.

> We can also remove fqslock completely if we do not need this. Also using
> cmpxchg() to set the value of the flag looks like a good idea to avoid taking
> the root node lock. Thoughts?

The ->fqslock funnel was needed to avoid lockups on large systems (many
hundreds or even thousands of CPUs).  Moving grace-period responsibilities
from softirq to the grace-period kthreads might have reduced contention
sufficienty to make the ->fqslock funnel unnecessary.  However, given
that I don't usually have access to such a large system, I will leave it,
at least for the time being.

But you might be interested in thinking through what else would need to
change in order to make cmpxchg() work.  ;-)

Thanx, Paul

> Signed-off-by: Pranith Kumar 
> ---
>  kernel/rcu/tree.c | 35 +--
>  1 file changed, 13 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index f1ba773..9a46f32 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2399,36 +2399,27 @@ static void force_qs_rnp(struct rcu_state *rsp,
>  static void force_quiescent_state(struct rcu_state *rsp)
>  {
>   unsigned long flags;
> - bool ret;
> - struct rcu_node *rnp;
> - struct rcu_node *rnp_old = NULL;
> -
> - /* Funnel through hierarchy to reduce memory contention. */
> - rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
> - for (; rnp != NULL; rnp = rnp->parent) {
> - ret = (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) ||
> -   !raw_spin_trylock(>fqslock);
> - if (rnp_old != NULL)
> - raw_spin_unlock(_old->fqslock);
> - if (ret) {
> - ACCESS_ONCE(rsp->n_force_qs_lh)++;
> - return;
> - }
> - rnp_old = rnp;
> + struct rcu_node *rnp_root = rcu_get_root(rsp);
> +
> + /* early test to see if someone already forced a quiescent state
> +  */
> + if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
> + ACCESS_ONCE(rsp->n_force_qs_lh)++;
> + return;  /* Someone beat us to it. */
>   }
> - /* rnp_old == rcu_get_root(rsp), rnp == NULL. */
> 
>   /* Reached the root of the rcu_node tree, acquire lock. */
> - raw_spin_lock_irqsave(_old->lock, flags);
> + raw_spin_lock_irqsave(_root->lock, flags);
>   smp_mb__after_unlock_lock();
> - raw_spin_unlock(_old->fqslock);
>   if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
>   ACCESS_ONCE(rsp->n_force_qs_lh)++;
> - raw_spin_unlock_irqrestore(_old->lock, flags);
> - return;  /* Someone beat us to it. */
> + raw_spin_unlock_irqrestore(_root->lock, flags);
> + return;  /* Someone actually beat us to it. */
>   }
> +
> + /* can we use cmpxchg instead of the above lock? */
>   ACCESS_ONCE(rsp->gp_flags) |= RCU_GP_FLAG_FQS;
> - raw_spin_unlock_irqrestore(_old->lock, flags);
> + raw_spin_unlock_irqrestore(_root->lock, flags);
>   wake_up(>gp_wq);  /* Memory barrier implied by wake_up() path. */
>  }
> 
> -- 
> 1.9.1
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Paul E. McKenney
On Mon, Jun 16, 2014 at 10:55:29PM -0400, Pranith Kumar wrote:
 This might sound really naive, but please bear with me.
 
 force_quiescent_state() used to do a lot of things in the past in addition to
 forcing a quiescent state. (In my reading of the mailing list I found state
 transitions for one). 
 
 Now according to the code, what is being done is multiple callers try to go up
 the hierarchy of nodes to see who reaches the root node. The caller reaching 
 the
 root node wins and it acquires root node lock and it gets to set 
 rsp-gp_flags!
 
 At each level of the hierarchy we try to acquire fqslock. This is the only 
 place
 which actually uses fqslock. 
 
 I guess this was being done to avoid the contention on fqslock, but all we are
 doing here is setting one flag. This way of acquiring locks might reduce
 contention if every update is trying to do some independent work, but here all
 we are doing is setting the same flag with same value.

Actually, to reduce contention on rnp_root-lock.

The trick is that the losers at each level of -fqslock acquisition go
away.  The winner ends up doing the real work of setting RCU_GP_FLAG_FQS.

 We can also remove fqslock completely if we do not need this. Also using
 cmpxchg() to set the value of the flag looks like a good idea to avoid taking
 the root node lock. Thoughts?

The -fqslock funnel was needed to avoid lockups on large systems (many
hundreds or even thousands of CPUs).  Moving grace-period responsibilities
from softirq to the grace-period kthreads might have reduced contention
sufficienty to make the -fqslock funnel unnecessary.  However, given
that I don't usually have access to such a large system, I will leave it,
at least for the time being.

But you might be interested in thinking through what else would need to
change in order to make cmpxchg() work.  ;-)

Thanx, Paul

 Signed-off-by: Pranith Kumar bobby.pr...@gmail.com
 ---
  kernel/rcu/tree.c | 35 +--
  1 file changed, 13 insertions(+), 22 deletions(-)
 
 diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
 index f1ba773..9a46f32 100644
 --- a/kernel/rcu/tree.c
 +++ b/kernel/rcu/tree.c
 @@ -2399,36 +2399,27 @@ static void force_qs_rnp(struct rcu_state *rsp,
  static void force_quiescent_state(struct rcu_state *rsp)
  {
   unsigned long flags;
 - bool ret;
 - struct rcu_node *rnp;
 - struct rcu_node *rnp_old = NULL;
 -
 - /* Funnel through hierarchy to reduce memory contention. */
 - rnp = per_cpu_ptr(rsp-rda, raw_smp_processor_id())-mynode;
 - for (; rnp != NULL; rnp = rnp-parent) {
 - ret = (ACCESS_ONCE(rsp-gp_flags)  RCU_GP_FLAG_FQS) ||
 -   !raw_spin_trylock(rnp-fqslock);
 - if (rnp_old != NULL)
 - raw_spin_unlock(rnp_old-fqslock);
 - if (ret) {
 - ACCESS_ONCE(rsp-n_force_qs_lh)++;
 - return;
 - }
 - rnp_old = rnp;
 + struct rcu_node *rnp_root = rcu_get_root(rsp);
 +
 + /* early test to see if someone already forced a quiescent state
 +  */
 + if (ACCESS_ONCE(rsp-gp_flags)  RCU_GP_FLAG_FQS) {
 + ACCESS_ONCE(rsp-n_force_qs_lh)++;
 + return;  /* Someone beat us to it. */
   }
 - /* rnp_old == rcu_get_root(rsp), rnp == NULL. */
 
   /* Reached the root of the rcu_node tree, acquire lock. */
 - raw_spin_lock_irqsave(rnp_old-lock, flags);
 + raw_spin_lock_irqsave(rnp_root-lock, flags);
   smp_mb__after_unlock_lock();
 - raw_spin_unlock(rnp_old-fqslock);
   if (ACCESS_ONCE(rsp-gp_flags)  RCU_GP_FLAG_FQS) {
   ACCESS_ONCE(rsp-n_force_qs_lh)++;
 - raw_spin_unlock_irqrestore(rnp_old-lock, flags);
 - return;  /* Someone beat us to it. */
 + raw_spin_unlock_irqrestore(rnp_root-lock, flags);
 + return;  /* Someone actually beat us to it. */
   }
 +
 + /* can we use cmpxchg instead of the above lock? */
   ACCESS_ONCE(rsp-gp_flags) |= RCU_GP_FLAG_FQS;
 - raw_spin_unlock_irqrestore(rnp_old-lock, flags);
 + raw_spin_unlock_irqrestore(rnp_root-lock, flags);
   wake_up(rsp-gp_wq);  /* Memory barrier implied by wake_up() path. */
  }
 
 -- 
 1.9.1
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Romanov Arya
On Tue, Jun 17, 2014 at 10:54 AM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:
 On Mon, Jun 16, 2014 at 10:55:29PM -0400, Pranith Kumar wrote:
 This might sound really naive, but please bear with me.

 force_quiescent_state() used to do a lot of things in the past in addition to
 forcing a quiescent state. (In my reading of the mailing list I found state
 transitions for one).

 Now according to the code, what is being done is multiple callers try to go 
 up
 the hierarchy of nodes to see who reaches the root node. The caller reaching 
 the
 root node wins and it acquires root node lock and it gets to set 
 rsp-gp_flags!

 At each level of the hierarchy we try to acquire fqslock. This is the only 
 place
 which actually uses fqslock.

 I guess this was being done to avoid the contention on fqslock, but all we 
 are
 doing here is setting one flag. This way of acquiring locks might reduce
 contention if every update is trying to do some independent work, but here 
 all
 we are doing is setting the same flag with same value.

 Actually, to reduce contention on rnp_root-lock.

 The trick is that the losers at each level of -fqslock acquisition go
 away.  The winner ends up doing the real work of setting RCU_GP_FLAG_FQS.

 We can also remove fqslock completely if we do not need this. Also using
 cmpxchg() to set the value of the flag looks like a good idea to avoid taking
 the root node lock. Thoughts?

 The -fqslock funnel was needed to avoid lockups on large systems (many
 hundreds or even thousands of CPUs).  Moving grace-period responsibilities
 from softirq to the grace-period kthreads might have reduced contention
 sufficienty to make the -fqslock funnel unnecessary.  However, given
 that I don't usually have access to such a large system, I will leave it,
 at least for the time being.

Sounds like a good case study for using the newly introduced MCS based
locks(qspinlock.h).
Waiman, Peter?

Btw, is doing the following a bad idea? It reduces contention on
rnp_root-lock using fqslock
which seems to be the lock which needs to be taken while forcing a
quiescent state:

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f1ba773..f5a0e7e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2401,34 +2401,24 @@ static void force_quiescent_state(struct rcu_state *rsp)
 unsigned long flags;
 bool ret;
 struct rcu_node *rnp;
-struct rcu_node *rnp_old = NULL;
-
-/* Funnel through hierarchy to reduce memory contention. */
-rnp = per_cpu_ptr(rsp-rda, raw_smp_processor_id())-mynode;
-for (; rnp != NULL; rnp = rnp-parent) {
-ret = (ACCESS_ONCE(rsp-gp_flags)  RCU_GP_FLAG_FQS) ||
-  !raw_spin_trylock(rnp-fqslock);
-if (rnp_old != NULL)
-raw_spin_unlock(rnp_old-fqslock);
-if (ret) {
-ACCESS_ONCE(rsp-n_force_qs_lh)++;
-return;
-}
-rnp_old = rnp;
+struct rcu_node *rnp_root = rcu_get_root(rsp);
+
+if (!raw_spin_trylock(rnp_root-fqslock)) {
+ACCESS_ONCE(rsp-n_force_qs_lh)++;
+return;  /* Someone is already trying to force */
 }
-/* rnp_old == rcu_get_root(rsp), rnp == NULL. */

-/* Reached the root of the rcu_node tree, acquire lock. */
-raw_spin_lock_irqsave(rnp_old-lock, flags);
-smp_mb__after_unlock_lock();
-raw_spin_unlock(rnp_old-fqslock);
 if (ACCESS_ONCE(rsp-gp_flags)  RCU_GP_FLAG_FQS) {
 ACCESS_ONCE(rsp-n_force_qs_lh)++;
-raw_spin_unlock_irqrestore(rnp_old-lock, flags);
+raw_spin_unlock(rnp_root-fqslock);
 return;  /* Someone beat us to it. */
 }
+
+/* Reached the root of the rcu_node tree, acquire lock. */
+raw_spin_lock_irqsave(rnp_root-lock, flags);
+smp_mb__after_unlock_lock();
 ACCESS_ONCE(rsp-gp_flags) |= RCU_GP_FLAG_FQS;
-raw_spin_unlock_irqrestore(rnp_old-lock, flags);
+raw_spin_unlock_irqrestore(rnp_root-lock, flags);
 wake_up(rsp-gp_wq);  /* Memory barrier implied by wake_up() path. */
 }

Regards,
Romanov


 But you might be interested in thinking through what else would need to
 change in order to make cmpxchg() work.  ;-)

 Thanx, Paul

 Signed-off-by: Pranith Kumar bobby.pr...@gmail.com
 ---
  kernel/rcu/tree.c | 35 +--
  1 file changed, 13 insertions(+), 22 deletions(-)

 diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
 index f1ba773..9a46f32 100644
 --- a/kernel/rcu/tree.c
 +++ b/kernel/rcu/tree.c
 @@ -2399,36 +2399,27 @@ static void force_qs_rnp(struct rcu_state *rsp,
  static void force_quiescent_state(struct rcu_state *rsp)
  {
   unsigned long flags;
 - bool ret;
 - struct rcu_node *rnp;
 - struct rcu_node *rnp_old = NULL;
 -
 - /* Funnel through hierarchy to reduce memory contention. */
 - rnp = per_cpu_ptr(rsp-rda, raw_smp_processor_id())-mynode;
 - for (; rnp != NULL; rnp = rnp-parent) {
 - ret = 

Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Waiman Long

On 06/17/2014 12:01 PM, Romanov Arya wrote:

On Tue, Jun 17, 2014 at 10:54 AM, Paul E. McKenney
paul...@linux.vnet.ibm.com  wrote:

On Mon, Jun 16, 2014 at 10:55:29PM -0400, Pranith Kumar wrote:

This might sound really naive, but please bear with me.

force_quiescent_state() used to do a lot of things in the past in addition to
forcing a quiescent state. (In my reading of the mailing list I found state
transitions for one).

Now according to the code, what is being done is multiple callers try to go up
the hierarchy of nodes to see who reaches the root node. The caller reaching the
root node wins and it acquires root node lock and it gets to set rsp-gp_flags!

At each level of the hierarchy we try to acquire fqslock. This is the only place
which actually uses fqslock.

I guess this was being done to avoid the contention on fqslock, but all we are
doing here is setting one flag. This way of acquiring locks might reduce
contention if every update is trying to do some independent work, but here all
we are doing is setting the same flag with same value.

Actually, to reduce contention on rnp_root-lock.

The trick is that the losers at each level of -fqslock acquisition go
away.  The winner ends up doing the real work of setting RCU_GP_FLAG_FQS.


We can also remove fqslock completely if we do not need this. Also using
cmpxchg() to set the value of the flag looks like a good idea to avoid taking
the root node lock. Thoughts?

The -fqslock funnel was needed to avoid lockups on large systems (many
hundreds or even thousands of CPUs).  Moving grace-period responsibilities
from softirq to the grace-period kthreads might have reduced contention
sufficienty to make the -fqslock funnel unnecessary.  However, given
that I don't usually have access to such a large system, I will leave it,
at least for the time being.

Sounds like a good case study for using the newly introduced MCS based
locks(qspinlock.h).
Waiman, Peter?


The ticket trylock and queue trylock are similar in the sense they both 
do a read to check if the lock is free first and then use cmpxchg() to 
acquire the lock. So if it is a problem to use ticket spinlock, it will 
be a problem for the queue spinlock as well. The advantage of the queue 
spinlock in a heavily contended case is that the lock waiters except the 
one in the head of the queue will not be contending the lock cacheline.

Btw, is doing the following a bad idea? It reduces contention on
rnp_root-lock using fqslock
which seems to be the lock which needs to be taken while forcing a
quiescent state:


I don't see any problem with the existing code as long as the number of 
funnel levels is small. You really need to have some performance data to 
show that your approach is better.




diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f1ba773..f5a0e7e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2401,34 +2401,24 @@ static void force_quiescent_state(struct rcu_state *rsp)
  unsigned long flags;
  bool ret;
  struct rcu_node *rnp;
-struct rcu_node *rnp_old = NULL;
-
-/* Funnel through hierarchy to reduce memory contention. */
-rnp = per_cpu_ptr(rsp-rda, raw_smp_processor_id())-mynode;
-for (; rnp != NULL; rnp = rnp-parent) {
-ret = (ACCESS_ONCE(rsp-gp_flags)  RCU_GP_FLAG_FQS) ||
-  !raw_spin_trylock(rnp-fqslock);
-if (rnp_old != NULL)
-raw_spin_unlock(rnp_old-fqslock);
-if (ret) {
-ACCESS_ONCE(rsp-n_force_qs_lh)++;
-return;
-}
-rnp_old = rnp;
+struct rcu_node *rnp_root = rcu_get_root(rsp);
+
+if (!raw_spin_trylock(rnp_root-fqslock)) {
+ACCESS_ONCE(rsp-n_force_qs_lh)++;
+return;  /* Someone is already trying to force */
  }


I think it is better to check the gp_flags first before doing a trylock.

-Longman
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Paul E. McKenney
On Tue, Jun 17, 2014 at 12:01:28PM -0400, Romanov Arya wrote:
 On Tue, Jun 17, 2014 at 10:54 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
  On Mon, Jun 16, 2014 at 10:55:29PM -0400, Pranith Kumar wrote:
  This might sound really naive, but please bear with me.
 
  force_quiescent_state() used to do a lot of things in the past in addition 
  to
  forcing a quiescent state. (In my reading of the mailing list I found state
  transitions for one).
 
  Now according to the code, what is being done is multiple callers try to 
  go up
  the hierarchy of nodes to see who reaches the root node. The caller 
  reaching the
  root node wins and it acquires root node lock and it gets to set 
  rsp-gp_flags!
 
  At each level of the hierarchy we try to acquire fqslock. This is the only 
  place
  which actually uses fqslock.
 
  I guess this was being done to avoid the contention on fqslock, but all we 
  are
  doing here is setting one flag. This way of acquiring locks might reduce
  contention if every update is trying to do some independent work, but here 
  all
  we are doing is setting the same flag with same value.
 
  Actually, to reduce contention on rnp_root-lock.
 
  The trick is that the losers at each level of -fqslock acquisition go
  away.  The winner ends up doing the real work of setting RCU_GP_FLAG_FQS.
 
  We can also remove fqslock completely if we do not need this. Also using
  cmpxchg() to set the value of the flag looks like a good idea to avoid 
  taking
  the root node lock. Thoughts?
 
  The -fqslock funnel was needed to avoid lockups on large systems (many
  hundreds or even thousands of CPUs).  Moving grace-period responsibilities
  from softirq to the grace-period kthreads might have reduced contention
  sufficienty to make the -fqslock funnel unnecessary.  However, given
  that I don't usually have access to such a large system, I will leave it,
  at least for the time being.
 
 Sounds like a good case study for using the newly introduced MCS based
 locks(qspinlock.h).
 Waiman, Peter?

No.  Absolutely not.

Any exclusive lock, MCS or otherwise, will waste a huge amount of time
in this case.

 Btw, is doing the following a bad idea? It reduces contention on
 rnp_root-lock using fqslock
 which seems to be the lock which needs to be taken while forcing a
 quiescent state:
 
 diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
 index f1ba773..f5a0e7e 100644
 --- a/kernel/rcu/tree.c
 +++ b/kernel/rcu/tree.c
 @@ -2401,34 +2401,24 @@ static void force_quiescent_state(struct rcu_state 
 *rsp)
  unsigned long flags;
  bool ret;
  struct rcu_node *rnp;
 -struct rcu_node *rnp_old = NULL;
 -
 -/* Funnel through hierarchy to reduce memory contention. */
 -rnp = per_cpu_ptr(rsp-rda, raw_smp_processor_id())-mynode;
 -for (; rnp != NULL; rnp = rnp-parent) {
 -ret = (ACCESS_ONCE(rsp-gp_flags)  RCU_GP_FLAG_FQS) ||
 -  !raw_spin_trylock(rnp-fqslock);
 -if (rnp_old != NULL)
 -raw_spin_unlock(rnp_old-fqslock);
 -if (ret) {
 -ACCESS_ONCE(rsp-n_force_qs_lh)++;
 -return;
 -}
 -rnp_old = rnp;
 +struct rcu_node *rnp_root = rcu_get_root(rsp);
 +
 +if (!raw_spin_trylock(rnp_root-fqslock)) {

This will be an epic fail on huge systems because it can result in massive
memory contention.  And yes, I have recently reduced the probability of
large numbers of concurrent calls to force_quiescent_state(), but this
situation really can still happen, for example, if a bunch of CPUs are
doing lots of concurrent call_rcu() invocations.

Pranith, Romanov:  You do -not-, repeat -not-, get to shoot from the hip
with this code.  You absolutely need to understand what it is doing and
why before you try hacking on it.  Otherwise, all that will happen is
that you will come up with additional creative ways to break RCU.  Your
commit log and comments will need to clearly indicate that you understand
what happens if (say) 4096 CPUs all call force_quiescent_state() at the
same time.

Thanx, Paul

 +ACCESS_ONCE(rsp-n_force_qs_lh)++;
 +return;  /* Someone is already trying to force */
  }
 -/* rnp_old == rcu_get_root(rsp), rnp == NULL. */
 
 -/* Reached the root of the rcu_node tree, acquire lock. */
 -raw_spin_lock_irqsave(rnp_old-lock, flags);
 -smp_mb__after_unlock_lock();
 -raw_spin_unlock(rnp_old-fqslock);
  if (ACCESS_ONCE(rsp-gp_flags)  RCU_GP_FLAG_FQS) {
  ACCESS_ONCE(rsp-n_force_qs_lh)++;
 -raw_spin_unlock_irqrestore(rnp_old-lock, flags);
 +raw_spin_unlock(rnp_root-fqslock);
  return;  /* Someone beat us to it. */
  }
 +
 +/* Reached the root of the rcu_node tree, acquire lock. */
 +raw_spin_lock_irqsave(rnp_root-lock, flags);
 +smp_mb__after_unlock_lock();
  ACCESS_ONCE(rsp-gp_flags) |= RCU_GP_FLAG_FQS;
 -raw_spin_unlock_irqrestore(rnp_old-lock, 

Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Paul E. McKenney
On Tue, Jun 17, 2014 at 12:56:22PM -0400, Waiman Long wrote:
 On 06/17/2014 12:01 PM, Romanov Arya wrote:
 On Tue, Jun 17, 2014 at 10:54 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com  wrote:
 On Mon, Jun 16, 2014 at 10:55:29PM -0400, Pranith Kumar wrote:
 This might sound really naive, but please bear with me.
 
 force_quiescent_state() used to do a lot of things in the past in addition 
 to
 forcing a quiescent state. (In my reading of the mailing list I found state
 transitions for one).
 
 Now according to the code, what is being done is multiple callers try to 
 go up
 the hierarchy of nodes to see who reaches the root node. The caller 
 reaching the
 root node wins and it acquires root node lock and it gets to set 
 rsp-gp_flags!
 
 At each level of the hierarchy we try to acquire fqslock. This is the only 
 place
 which actually uses fqslock.
 
 I guess this was being done to avoid the contention on fqslock, but all we 
 are
 doing here is setting one flag. This way of acquiring locks might reduce
 contention if every update is trying to do some independent work, but here 
 all
 we are doing is setting the same flag with same value.
 Actually, to reduce contention on rnp_root-lock.
 
 The trick is that the losers at each level of -fqslock acquisition go
 away.  The winner ends up doing the real work of setting RCU_GP_FLAG_FQS.
 
 We can also remove fqslock completely if we do not need this. Also using
 cmpxchg() to set the value of the flag looks like a good idea to avoid 
 taking
 the root node lock. Thoughts?
 The -fqslock funnel was needed to avoid lockups on large systems (many
 hundreds or even thousands of CPUs).  Moving grace-period responsibilities
 from softirq to the grace-period kthreads might have reduced contention
 sufficienty to make the -fqslock funnel unnecessary.  However, given
 that I don't usually have access to such a large system, I will leave it,
 at least for the time being.
 Sounds like a good case study for using the newly introduced MCS based
 locks(qspinlock.h).
 Waiman, Peter?
 
 The ticket trylock and queue trylock are similar in the sense they
 both do a read to check if the lock is free first and then use
 cmpxchg() to acquire the lock. So if it is a problem to use ticket
 spinlock, it will be a problem for the queue spinlock as well. The
 advantage of the queue spinlock in a heavily contended case is that
 the lock waiters except the one in the head of the queue will not be
 contending the lock cacheline.
 Btw, is doing the following a bad idea? It reduces contention on
 rnp_root-lock using fqslock
 which seems to be the lock which needs to be taken while forcing a
 quiescent state:
 
 I don't see any problem with the existing code as long as the number
 of funnel levels is small. You really need to have some performance
 data to show that your approach is better.

What Waiman said.

And you will need that performance data taken on systems with at least
1024 CPUs, because it was those systems that motivated this code.

Thanx, Paul

 diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
 index f1ba773..f5a0e7e 100644
 --- a/kernel/rcu/tree.c
 +++ b/kernel/rcu/tree.c
 @@ -2401,34 +2401,24 @@ static void force_quiescent_state(struct rcu_state 
 *rsp)
   unsigned long flags;
   bool ret;
   struct rcu_node *rnp;
 -struct rcu_node *rnp_old = NULL;
 -
 -/* Funnel through hierarchy to reduce memory contention. */
 -rnp = per_cpu_ptr(rsp-rda, raw_smp_processor_id())-mynode;
 -for (; rnp != NULL; rnp = rnp-parent) {
 -ret = (ACCESS_ONCE(rsp-gp_flags)  RCU_GP_FLAG_FQS) ||
 -  !raw_spin_trylock(rnp-fqslock);
 -if (rnp_old != NULL)
 -raw_spin_unlock(rnp_old-fqslock);
 -if (ret) {
 -ACCESS_ONCE(rsp-n_force_qs_lh)++;
 -return;
 -}
 -rnp_old = rnp;
 +struct rcu_node *rnp_root = rcu_get_root(rsp);
 +
 +if (!raw_spin_trylock(rnp_root-fqslock)) {
 +ACCESS_ONCE(rsp-n_force_qs_lh)++;
 +return;  /* Someone is already trying to force */
   }
 
 I think it is better to check the gp_flags first before doing a trylock.
 
 -Longman
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Paul E. McKenney
On Tue, Jun 17, 2014 at 10:11:16AM -0700, Paul E. McKenney wrote:
 On Tue, Jun 17, 2014 at 12:56:22PM -0400, Waiman Long wrote:
  On 06/17/2014 12:01 PM, Romanov Arya wrote:
  On Tue, Jun 17, 2014 at 10:54 AM, Paul E. McKenney
  paul...@linux.vnet.ibm.com  wrote:
  On Mon, Jun 16, 2014 at 10:55:29PM -0400, Pranith Kumar wrote:
  This might sound really naive, but please bear with me.
  
  force_quiescent_state() used to do a lot of things in the past in 
  addition to
  forcing a quiescent state. (In my reading of the mailing list I found 
  state
  transitions for one).
  
  Now according to the code, what is being done is multiple callers try to 
  go up
  the hierarchy of nodes to see who reaches the root node. The caller 
  reaching the
  root node wins and it acquires root node lock and it gets to set 
  rsp-gp_flags!
  
  At each level of the hierarchy we try to acquire fqslock. This is the 
  only place
  which actually uses fqslock.
  
  I guess this was being done to avoid the contention on fqslock, but all 
  we are
  doing here is setting one flag. This way of acquiring locks might reduce
  contention if every update is trying to do some independent work, but 
  here all
  we are doing is setting the same flag with same value.
  Actually, to reduce contention on rnp_root-lock.
  
  The trick is that the losers at each level of -fqslock acquisition go
  away.  The winner ends up doing the real work of setting 
  RCU_GP_FLAG_FQS.
  
  We can also remove fqslock completely if we do not need this. Also using
  cmpxchg() to set the value of the flag looks like a good idea to avoid 
  taking
  the root node lock. Thoughts?
  The -fqslock funnel was needed to avoid lockups on large systems (many
  hundreds or even thousands of CPUs).  Moving grace-period responsibilities
  from softirq to the grace-period kthreads might have reduced contention
  sufficienty to make the -fqslock funnel unnecessary.  However, given
  that I don't usually have access to such a large system, I will leave it,
  at least for the time being.
  Sounds like a good case study for using the newly introduced MCS based
  locks(qspinlock.h).
  Waiman, Peter?
  
  The ticket trylock and queue trylock are similar in the sense they
  both do a read to check if the lock is free first and then use
  cmpxchg() to acquire the lock. So if it is a problem to use ticket
  spinlock, it will be a problem for the queue spinlock as well. The
  advantage of the queue spinlock in a heavily contended case is that
  the lock waiters except the one in the head of the queue will not be
  contending the lock cacheline.
  Btw, is doing the following a bad idea? It reduces contention on
  rnp_root-lock using fqslock
  which seems to be the lock which needs to be taken while forcing a
  quiescent state:
  
  I don't see any problem with the existing code as long as the number
  of funnel levels is small.

Oh, and to answer the implicit question...  A properly configured 4096-CPU
system will have two funnel levels, with 64 nodes at the leaf level
and a single node at the root level.  If the system is not properly
configured, it will have three funnel levels.  The maximum number of
funnel levels is four, which would handle more than four million CPUs
(sixteen million if properly configured), so we should be good.  ;-)

The larger numbers of levels are intended strictly for testing.  I set
CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system just
to make sure that I am testing something uglier than what will be running
in production.  A large system should have both of these set to 64,
though this requires also booting with skew_tick=1 as well.

Thanx, Paul

 You really need to have some performance
  data to show that your approach is better.
 
 What Waiman said.
 
 And you will need that performance data taken on systems with at least
 1024 CPUs, because it was those systems that motivated this code.
 
   Thanx, Paul
 
  diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
  index f1ba773..f5a0e7e 100644
  --- a/kernel/rcu/tree.c
  +++ b/kernel/rcu/tree.c
  @@ -2401,34 +2401,24 @@ static void force_quiescent_state(struct rcu_state 
  *rsp)
unsigned long flags;
bool ret;
struct rcu_node *rnp;
  -struct rcu_node *rnp_old = NULL;
  -
  -/* Funnel through hierarchy to reduce memory contention. */
  -rnp = per_cpu_ptr(rsp-rda, raw_smp_processor_id())-mynode;
  -for (; rnp != NULL; rnp = rnp-parent) {
  -ret = (ACCESS_ONCE(rsp-gp_flags)  RCU_GP_FLAG_FQS) ||
  -  !raw_spin_trylock(rnp-fqslock);
  -if (rnp_old != NULL)
  -raw_spin_unlock(rnp_old-fqslock);
  -if (ret) {
  -ACCESS_ONCE(rsp-n_force_qs_lh)++;
  -return;
  -}
  -rnp_old = rnp;
  +struct rcu_node *rnp_root = rcu_get_root(rsp);
 

Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Pranith Kumar
On Tue, Jun 17, 2014 at 1:10 PM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:


 Pranith, Romanov:  You do -not-, repeat -not-, get to shoot from the hip
 with this code.  You absolutely need to understand what it is doing and
 why before you try hacking on it.  Otherwise, all that will happen is
 that you will come up with additional creative ways to break RCU.  Your
 commit log and comments will need to clearly indicate that you understand
 what happens if (say) 4096 CPUs all call force_quiescent_state() at the
 same time.


I do apologize for the noise I am causing.
I totally agree that I need to understand what exactly is happening
before I propose to do anything. I am trying to understand how all
this works by asking questions. I did try looking at why the locks
were there and spent quite some time trying to figure it out looking
at the history, but could not entirely get it. Hence the RFC.

Thank you for patiently considering and replying to my RFCs.

Regards,
-- 
Pranith
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-17 Thread Waiman Long

On 06/17/2014 01:37 PM, Paul E. McKenney wrote:


Oh, and to answer the implicit question...  A properly configured 4096-CPU
system will have two funnel levels, with 64 nodes at the leaf level
and a single node at the root level.  If the system is not properly
configured, it will have three funnel levels.  The maximum number of
funnel levels is four, which would handle more than four million CPUs
(sixteen million if properly configured), so we should be good.  ;-)

The larger numbers of levels are intended strictly for testing.  I set
CONFIG_RCU_FANOUT_LEAF=2 and CONFIG_RCU_FANOUT=2 on a 16-CPU system just
to make sure that I am testing something uglier than what will be running
in production.  A large system should have both of these set to 64,
though this requires also booting with skew_tick=1 as well.

Thanx, Paul


Thank for the clarification as I haven't looked deep into the code to 
see how many levels there are. I totally understand the impact cacheline 
contention can have on system performance. After all, this is what many 
of my patches are trying to address.


-Longman
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-16 Thread Pranith Kumar
This might sound really naive, but please bear with me.

force_quiescent_state() used to do a lot of things in the past in addition to
forcing a quiescent state. (In my reading of the mailing list I found state
transitions for one). 

Now according to the code, what is being done is multiple callers try to go up
the hierarchy of nodes to see who reaches the root node. The caller reaching the
root node wins and it acquires root node lock and it gets to set rsp->gp_flags!

At each level of the hierarchy we try to acquire fqslock. This is the only place
which actually uses fqslock. 

I guess this was being done to avoid the contention on fqslock, but all we are
doing here is setting one flag. This way of acquiring locks might reduce
contention if every update is trying to do some independent work, but here all
we are doing is setting the same flag with same value.

We can also remove fqslock completely if we do not need this. Also using
cmpxchg() to set the value of the flag looks like a good idea to avoid taking
the root node lock. Thoughts?

Signed-off-by: Pranith Kumar 
---
 kernel/rcu/tree.c | 35 +--
 1 file changed, 13 insertions(+), 22 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f1ba773..9a46f32 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2399,36 +2399,27 @@ static void force_qs_rnp(struct rcu_state *rsp,
 static void force_quiescent_state(struct rcu_state *rsp)
 {
unsigned long flags;
-   bool ret;
-   struct rcu_node *rnp;
-   struct rcu_node *rnp_old = NULL;
-
-   /* Funnel through hierarchy to reduce memory contention. */
-   rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
-   for (; rnp != NULL; rnp = rnp->parent) {
-   ret = (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) ||
- !raw_spin_trylock(>fqslock);
-   if (rnp_old != NULL)
-   raw_spin_unlock(_old->fqslock);
-   if (ret) {
-   ACCESS_ONCE(rsp->n_force_qs_lh)++;
-   return;
-   }
-   rnp_old = rnp;
+   struct rcu_node *rnp_root = rcu_get_root(rsp);
+
+   /* early test to see if someone already forced a quiescent state
+*/
+   if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
+   ACCESS_ONCE(rsp->n_force_qs_lh)++;
+   return;  /* Someone beat us to it. */
}
-   /* rnp_old == rcu_get_root(rsp), rnp == NULL. */
 
/* Reached the root of the rcu_node tree, acquire lock. */
-   raw_spin_lock_irqsave(_old->lock, flags);
+   raw_spin_lock_irqsave(_root->lock, flags);
smp_mb__after_unlock_lock();
-   raw_spin_unlock(_old->fqslock);
if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
ACCESS_ONCE(rsp->n_force_qs_lh)++;
-   raw_spin_unlock_irqrestore(_old->lock, flags);
-   return;  /* Someone beat us to it. */
+   raw_spin_unlock_irqrestore(_root->lock, flags);
+   return;  /* Someone actually beat us to it. */
}
+
+   /* can we use cmpxchg instead of the above lock? */
ACCESS_ONCE(rsp->gp_flags) |= RCU_GP_FLAG_FQS;
-   raw_spin_unlock_irqrestore(_old->lock, flags);
+   raw_spin_unlock_irqrestore(_root->lock, flags);
wake_up(>gp_wq);  /* Memory barrier implied by wake_up() path. */
 }
 
-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

2014-06-16 Thread Pranith Kumar
This might sound really naive, but please bear with me.

force_quiescent_state() used to do a lot of things in the past in addition to
forcing a quiescent state. (In my reading of the mailing list I found state
transitions for one). 

Now according to the code, what is being done is multiple callers try to go up
the hierarchy of nodes to see who reaches the root node. The caller reaching the
root node wins and it acquires root node lock and it gets to set rsp-gp_flags!

At each level of the hierarchy we try to acquire fqslock. This is the only place
which actually uses fqslock. 

I guess this was being done to avoid the contention on fqslock, but all we are
doing here is setting one flag. This way of acquiring locks might reduce
contention if every update is trying to do some independent work, but here all
we are doing is setting the same flag with same value.

We can also remove fqslock completely if we do not need this. Also using
cmpxchg() to set the value of the flag looks like a good idea to avoid taking
the root node lock. Thoughts?

Signed-off-by: Pranith Kumar bobby.pr...@gmail.com
---
 kernel/rcu/tree.c | 35 +--
 1 file changed, 13 insertions(+), 22 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f1ba773..9a46f32 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2399,36 +2399,27 @@ static void force_qs_rnp(struct rcu_state *rsp,
 static void force_quiescent_state(struct rcu_state *rsp)
 {
unsigned long flags;
-   bool ret;
-   struct rcu_node *rnp;
-   struct rcu_node *rnp_old = NULL;
-
-   /* Funnel through hierarchy to reduce memory contention. */
-   rnp = per_cpu_ptr(rsp-rda, raw_smp_processor_id())-mynode;
-   for (; rnp != NULL; rnp = rnp-parent) {
-   ret = (ACCESS_ONCE(rsp-gp_flags)  RCU_GP_FLAG_FQS) ||
- !raw_spin_trylock(rnp-fqslock);
-   if (rnp_old != NULL)
-   raw_spin_unlock(rnp_old-fqslock);
-   if (ret) {
-   ACCESS_ONCE(rsp-n_force_qs_lh)++;
-   return;
-   }
-   rnp_old = rnp;
+   struct rcu_node *rnp_root = rcu_get_root(rsp);
+
+   /* early test to see if someone already forced a quiescent state
+*/
+   if (ACCESS_ONCE(rsp-gp_flags)  RCU_GP_FLAG_FQS) {
+   ACCESS_ONCE(rsp-n_force_qs_lh)++;
+   return;  /* Someone beat us to it. */
}
-   /* rnp_old == rcu_get_root(rsp), rnp == NULL. */
 
/* Reached the root of the rcu_node tree, acquire lock. */
-   raw_spin_lock_irqsave(rnp_old-lock, flags);
+   raw_spin_lock_irqsave(rnp_root-lock, flags);
smp_mb__after_unlock_lock();
-   raw_spin_unlock(rnp_old-fqslock);
if (ACCESS_ONCE(rsp-gp_flags)  RCU_GP_FLAG_FQS) {
ACCESS_ONCE(rsp-n_force_qs_lh)++;
-   raw_spin_unlock_irqrestore(rnp_old-lock, flags);
-   return;  /* Someone beat us to it. */
+   raw_spin_unlock_irqrestore(rnp_root-lock, flags);
+   return;  /* Someone actually beat us to it. */
}
+
+   /* can we use cmpxchg instead of the above lock? */
ACCESS_ONCE(rsp-gp_flags) |= RCU_GP_FLAG_FQS;
-   raw_spin_unlock_irqrestore(rnp_old-lock, flags);
+   raw_spin_unlock_irqrestore(rnp_root-lock, flags);
wake_up(rsp-gp_wq);  /* Memory barrier implied by wake_up() path. */
 }
 
-- 
1.9.1
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/