Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-10 Thread Paul E. McKenney
On Mon, Sep 08, 2014 at 01:20:30PM -0500, Christoph Lameter wrote:
> 
> On Thu, 4 Sep 2014, Paul E. McKenney wrote:
> 
> > The lglocks use should be of particular interest to you.  See above to
> > find the others.
> 
> Well the other seem to be in drivers.

OK...

> > Understood, but no hasty changes to that part of RCU.
> 
> At our age I think we have learned to be very deliberate and slow.

And rcutorture needs some upgrades for this case.

> > rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data
> 
> Reviewed-by: Christoph Lameter 

Thank you, recorded.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-10 Thread Paul E. McKenney
On Mon, Sep 08, 2014 at 01:20:30PM -0500, Christoph Lameter wrote:
 
 On Thu, 4 Sep 2014, Paul E. McKenney wrote:
 
  The lglocks use should be of particular interest to you.  See above to
  find the others.
 
 Well the other seem to be in drivers.

OK...

  Understood, but no hasty changes to that part of RCU.
 
 At our age I think we have learned to be very deliberate and slow.

And rcutorture needs some upgrades for this case.

  rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data
 
 Reviewed-by: Christoph Lameter c...@linux.com

Thank you, recorded.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-08 Thread Christoph Lameter

On Thu, 4 Sep 2014, Paul E. McKenney wrote:

> The lglocks use should be of particular interest to you.  See above to
> find the others.

Well the other seem to be in drivers.

> Understood, but no hasty changes to that part of RCU.

At our age I think we have learned to be very deliberate and slow.

> rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data

Reviewed-by: Christoph Lameter 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-08 Thread Christoph Lameter

On Thu, 4 Sep 2014, Paul E. McKenney wrote:

 The lglocks use should be of particular interest to you.  See above to
 find the others.

Well the other seem to be in drivers.

 Understood, but no hasty changes to that part of RCU.

At our age I think we have learned to be very deliberate and slow.

 rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data

Reviewed-by: Christoph Lameter c...@linux.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-05 Thread Christoph Lameter

On Wed, 3 Sep 2014, Paul E. McKenney wrote:

> As noted earlier, in theory, the atomic operations could be nonatomic,

Well as demonstrated by the patch earlier: The atomic operations are only
used on a the local cpu. There is no synchronization in that sense needed
between processors because there is never a remote atomic operation.

> > The code looks fragile and bound to have issues in the future given the
> > barriers/atomics etc. Its going to be cleaner without that.
>
> What exactly looks fragile about it, and exactly what issues do you
> anticipate?

I am concerned about creation of unecessary synchronization issues. In
this case we have already discovered that the atomic operations on per
cpu variables are only used to modify the contents from the local cpu.

This means at minimum we can give up on the use of atomics and keep the
barriers to enforce visibility.

> > And we are right now focusing on the simplest case. The atomics scheme is
> > used multiple times in the RCU subsystem. There is more weird looking code
> > there like atomic_add using zero etc.
>
> The atomic_add_return(0,...) reads the value out, forcing full ordering.
> Again, in theory, this could be a volatile read with explicit memory-barrier
> instructions on either side, but it is not clear which wins.  (Keep in
> mind that almost all of the atomic_add_return(0,...) calls for a given
> dynticks counter are executed from a single kthread.)
>
> If systems continue to add CPUs like they have over the past decade or
> so, I expect that you will be seeing more code like RCU's, not less.

We have other code like this in multiple subsystems but it does not have
the barrier issues, per cpu variables are updated always without the use
of atomics and the inspection of the per cpu state from remote cpus works
just fine also without them.

I'd like to simplify this as much as possible and make it consistent
throughout the kernel.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-05 Thread Christoph Lameter
On Thu, 4 Sep 2014, Paul E. McKenney wrote:

> So in short, you don't see the potential for this use case actually
> breaking anything, correct?

In general its a performance impact but depending on how this_cpu_ops may
be implemented in a particular platform there may also be correctness
issues since the assumption there is that no remote writes occur.

There is a slight issue in th RCU code. It uses DEFINE_PER_CPU for
per cpu data which is used for true per cpu data where the
cachelines are not evicted. False aliasing RCU structure that are
remotely handled can cause issue for code that expects the per cpu data
to be not contended. I think it would be better to go to

DEFINE_PER_CPU_SHARED_ALIGNED

for your definitions in particular since there are still code pieces where
we are not sure if there are remote write accesses or not. This will give
you separate cachelines so that the false aliasing effect is not
occurring.

> Besides RCU is not the only place where atomics are used on per-CPU
> variables.  For one thing, there are a number of per-CPU spinlocks in use
> in various places throughout the kernel.  For another thing, there is also
> a large number of per-CPU structures (not pointers to structures, actual
> structures), and I bet that a fair number of these feature cross-CPU
> writes and cross-CPU atomics.  RCU's rcu_data structures certainly do.

Would be interested to see where that occurs.

> > the barrier issues, per cpu variables are updated always without the use
> > of atomics and the inspection of the per cpu state from remote cpus works
> > just fine also without them.
>
> Including the per-CPU spinlocks?  That seems a bit unlikely.  And again,
> I expect that a fair number of the per-CPU structures involve cross-CPU
> synchronization.

Where are those per cpu spinlocks? Cross cpu synchronization can be done
in a number of ways that often allow avoiding remote writes to percpu
areas.


> It already is consistent, just not in the manner that you want.  ;-)
>
> But -why- do you want these restrictions?  How does it help anything?

1. It allows potentially faster operations that allow to make the
assumption that no remote writes occur. The design of deterministic low
latency code often needs some assurances that another cpu is not simply
kicking the cacheline out which will then require off chip memory access
and remote cacheline eviction once the cacheline is touched again.

2. The use of atomic without a rationale is something that I frown upon
and it seems very likely that we have such a case here. People make
assumptions that the use of atomic has some reason, like a remote access
or contention, which is not occurring here.

3. this_cpu operations create instructions with reduced latency due tothe
lack of lock prefix. Remote operations at the same time could create
inconsistent results.

See also

linux/Documentation/this_cpu_ops.txt

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-05 Thread Christoph Lameter
On Thu, 4 Sep 2014, Paul E. McKenney wrote:

 So in short, you don't see the potential for this use case actually
 breaking anything, correct?

In general its a performance impact but depending on how this_cpu_ops may
be implemented in a particular platform there may also be correctness
issues since the assumption there is that no remote writes occur.

There is a slight issue in th RCU code. It uses DEFINE_PER_CPU for
per cpu data which is used for true per cpu data where the
cachelines are not evicted. False aliasing RCU structure that are
remotely handled can cause issue for code that expects the per cpu data
to be not contended. I think it would be better to go to

DEFINE_PER_CPU_SHARED_ALIGNED

for your definitions in particular since there are still code pieces where
we are not sure if there are remote write accesses or not. This will give
you separate cachelines so that the false aliasing effect is not
occurring.

 Besides RCU is not the only place where atomics are used on per-CPU
 variables.  For one thing, there are a number of per-CPU spinlocks in use
 in various places throughout the kernel.  For another thing, there is also
 a large number of per-CPU structures (not pointers to structures, actual
 structures), and I bet that a fair number of these feature cross-CPU
 writes and cross-CPU atomics.  RCU's rcu_data structures certainly do.

Would be interested to see where that occurs.

  the barrier issues, per cpu variables are updated always without the use
  of atomics and the inspection of the per cpu state from remote cpus works
  just fine also without them.

 Including the per-CPU spinlocks?  That seems a bit unlikely.  And again,
 I expect that a fair number of the per-CPU structures involve cross-CPU
 synchronization.

Where are those per cpu spinlocks? Cross cpu synchronization can be done
in a number of ways that often allow avoiding remote writes to percpu
areas.


 It already is consistent, just not in the manner that you want.  ;-)

 But -why- do you want these restrictions?  How does it help anything?

1. It allows potentially faster operations that allow to make the
assumption that no remote writes occur. The design of deterministic low
latency code often needs some assurances that another cpu is not simply
kicking the cacheline out which will then require off chip memory access
and remote cacheline eviction once the cacheline is touched again.

2. The use of atomic without a rationale is something that I frown upon
and it seems very likely that we have such a case here. People make
assumptions that the use of atomic has some reason, like a remote access
or contention, which is not occurring here.

3. this_cpu operations create instructions with reduced latency due tothe
lack of lock prefix. Remote operations at the same time could create
inconsistent results.

See also

linux/Documentation/this_cpu_ops.txt

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-05 Thread Christoph Lameter

On Wed, 3 Sep 2014, Paul E. McKenney wrote:

 As noted earlier, in theory, the atomic operations could be nonatomic,

Well as demonstrated by the patch earlier: The atomic operations are only
used on a the local cpu. There is no synchronization in that sense needed
between processors because there is never a remote atomic operation.

  The code looks fragile and bound to have issues in the future given the
  barriers/atomics etc. Its going to be cleaner without that.

 What exactly looks fragile about it, and exactly what issues do you
 anticipate?

I am concerned about creation of unecessary synchronization issues. In
this case we have already discovered that the atomic operations on per
cpu variables are only used to modify the contents from the local cpu.

This means at minimum we can give up on the use of atomics and keep the
barriers to enforce visibility.

  And we are right now focusing on the simplest case. The atomics scheme is
  used multiple times in the RCU subsystem. There is more weird looking code
  there like atomic_add using zero etc.

 The atomic_add_return(0,...) reads the value out, forcing full ordering.
 Again, in theory, this could be a volatile read with explicit memory-barrier
 instructions on either side, but it is not clear which wins.  (Keep in
 mind that almost all of the atomic_add_return(0,...) calls for a given
 dynticks counter are executed from a single kthread.)

 If systems continue to add CPUs like they have over the past decade or
 so, I expect that you will be seeing more code like RCU's, not less.

We have other code like this in multiple subsystems but it does not have
the barrier issues, per cpu variables are updated always without the use
of atomics and the inspection of the per cpu state from remote cpus works
just fine also without them.

I'd like to simplify this as much as possible and make it consistent
throughout the kernel.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-04 Thread Paul E. McKenney
On Thu, Sep 04, 2014 at 01:19:29PM -0500, Christoph Lameter wrote:
> On Thu, 4 Sep 2014, Paul E. McKenney wrote:
> 
> > So in short, you don't see the potential for this use case actually
> > breaking anything, correct?
> 
> In general its a performance impact but depending on how this_cpu_ops may
> be implemented in a particular platform there may also be correctness
> issues since the assumption there is that no remote writes occur.

This assumption of yours does not seem to be shared by a fair amount
of the code using per-CPU variables.

> There is a slight issue in th RCU code. It uses DEFINE_PER_CPU for
> per cpu data which is used for true per cpu data where the
> cachelines are not evicted. False aliasing RCU structure that are
> remotely handled can cause issue for code that expects the per cpu data
> to be not contended. I think it would be better to go to
> 
> DEFINE_PER_CPU_SHARED_ALIGNED
> 
> for your definitions in particular since there are still code pieces where
> we are not sure if there are remote write accesses or not. This will give
> you separate cachelines so that the false aliasing effect is not
> occurring.

Fair enough, I have queued a commit that makes this change for the
rcu_data per-CPU structures with your Report-by.  (See below for patch.)

> > Besides RCU is not the only place where atomics are used on per-CPU
> > variables.  For one thing, there are a number of per-CPU spinlocks in use
> > in various places throughout the kernel.  For another thing, there is also
> > a large number of per-CPU structures (not pointers to structures, actual
> > structures), and I bet that a fair number of these feature cross-CPU
> > writes and cross-CPU atomics.  RCU's rcu_data structures certainly do.
> 
> Would be interested to see where that occurs.

Something like the following will find them for you:

git grep "DEFINE_PER_CPU.*spinlock"
git grep "DEFINE_PER_CPU.*(struct[^*]*$"

> > > the barrier issues, per cpu variables are updated always without the use
> > > of atomics and the inspection of the per cpu state from remote cpus works
> > > just fine also without them.
> >
> > Including the per-CPU spinlocks?  That seems a bit unlikely.  And again,
> > I expect that a fair number of the per-CPU structures involve cross-CPU
> > synchronization.
> 
> Where are those per cpu spinlocks? Cross cpu synchronization can be done
> in a number of ways that often allow avoiding remote writes to percpu
> areas.

The lglocks use should be of particular interest to you.  See above to
find the others.

> > It already is consistent, just not in the manner that you want.  ;-)
> >
> > But -why- do you want these restrictions?  How does it help anything?
> 
> 1. It allows potentially faster operations that allow to make the
> assumption that no remote writes occur. The design of deterministic low
> latency code often needs some assurances that another cpu is not simply
> kicking the cacheline out which will then require off chip memory access
> and remote cacheline eviction once the cacheline is touched again.

OK, DEFINE_PER_CPU_SHARED_ALIGNED() should avoid the false sharing.

> 2. The use of atomic without a rationale is something that I frown upon
> and it seems very likely that we have such a case here. People make
> assumptions that the use of atomic has some reason, like a remote access
> or contention, which is not occurring here.

Understood, but no hasty changes to that part of RCU.

> 3. this_cpu operations create instructions with reduced latency due tothe
> lack of lock prefix. Remote operations at the same time could create
> inconsistent results.
> 
> See also
> 
> linux/Documentation/this_cpu_ops.txt

Yep.  If you use atomic operations, you need to be very careful about
mixing in non-atomic operations, which in this case includes the
per-CPU atomic operations.  Normally the atomic_t and atomic_long_t
types make it difficult to mess up.  Of course, you can work around
this type checking, and you can also use xchg() and cmpxchg(), which
don't provide this type-checking misuse-avoidance help.

Just as it has always been.

Thanx, Paul



rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data

The rcu_data per-CPU variable has a number of fields that are atomically
manipulated, potentially by any CPU.  This situation can result in false
sharing with per-CPU variables that have the misfortune of being allocated
adjacent to rcu_data in memory.  This commit therefore changes the
DEFINE_PER_CPU() to DEFINE_PER_CPU_SHARED_ALIGNED() in order to avoid
this false sharing.

Reported-by: Christoph Lameter 
Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1b7eba915dbe..e4c6d604694e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -105,7 +105,7 @@ struct rcu_state sname##_state = { \
.name = RCU_STATE_NAME(sname), \

Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-04 Thread Paul E. McKenney
On Thu, Sep 04, 2014 at 10:04:17AM -0500, Christoph Lameter wrote:
> 
> On Wed, 3 Sep 2014, Paul E. McKenney wrote:
> 
> > As noted earlier, in theory, the atomic operations could be nonatomic,
> 
> Well as demonstrated by the patch earlier: The atomic operations are only
> used on a the local cpu. There is no synchronization in that sense needed
> between processors because there is never a remote atomic operation.

Easy to say!  ;-)

> > > The code looks fragile and bound to have issues in the future given the
> > > barriers/atomics etc. Its going to be cleaner without that.
> >
> > What exactly looks fragile about it, and exactly what issues do you
> > anticipate?
> 
> I am concerned about creation of unecessary synchronization issues. In
> this case we have already discovered that the atomic operations on per
> cpu variables are only used to modify the contents from the local cpu.
> 
> This means at minimum we can give up on the use of atomics and keep the
> barriers to enforce visibility.

Sounds like a desire for a potential optimization rather than any
sort of fragility.  And in this case, it is not clear that your desire
of replacing a value-returning atomic operation with a normal memory
reference and a pair of memory barrier actually makes anything go faster.

So in short, you don't see the potential for this use case actually
breaking anything, correct?

Besides RCU is not the only place where atomics are used on per-CPU
variables.  For one thing, there are a number of per-CPU spinlocks in use
in various places throughout the kernel.  For another thing, there is also
a large number of per-CPU structures (not pointers to structures, actual
structures), and I bet that a fair number of these feature cross-CPU
writes and cross-CPU atomics.  RCU's rcu_data structures certainly do.

> > > And we are right now focusing on the simplest case. The atomics scheme is
> > > used multiple times in the RCU subsystem. There is more weird looking code
> > > there like atomic_add using zero etc.
> >
> > The atomic_add_return(0,...) reads the value out, forcing full ordering.
> > Again, in theory, this could be a volatile read with explicit memory-barrier
> > instructions on either side, but it is not clear which wins.  (Keep in
> > mind that almost all of the atomic_add_return(0,...) calls for a given
> > dynticks counter are executed from a single kthread.)
> >
> > If systems continue to add CPUs like they have over the past decade or
> > so, I expect that you will be seeing more code like RCU's, not less.
> 
> We have other code like this in multiple subsystems but it does not have
> the barrier issues, per cpu variables are updated always without the use
> of atomics and the inspection of the per cpu state from remote cpus works
> just fine also without them.

Including the per-CPU spinlocks?  That seems a bit unlikely.  And again,
I expect that a fair number of the per-CPU structures involve cross-CPU
synchronization.

> I'd like to simplify this as much as possible and make it consistent
> throughout the kernel.

It already is consistent, just not in the manner that you want.  ;-)

But -why- do you want these restrictions?  How does it help anything?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-04 Thread Paul E. McKenney
On Thu, Sep 04, 2014 at 01:19:29PM -0500, Christoph Lameter wrote:
 On Thu, 4 Sep 2014, Paul E. McKenney wrote:
 
  So in short, you don't see the potential for this use case actually
  breaking anything, correct?
 
 In general its a performance impact but depending on how this_cpu_ops may
 be implemented in a particular platform there may also be correctness
 issues since the assumption there is that no remote writes occur.

This assumption of yours does not seem to be shared by a fair amount
of the code using per-CPU variables.

 There is a slight issue in th RCU code. It uses DEFINE_PER_CPU for
 per cpu data which is used for true per cpu data where the
 cachelines are not evicted. False aliasing RCU structure that are
 remotely handled can cause issue for code that expects the per cpu data
 to be not contended. I think it would be better to go to
 
 DEFINE_PER_CPU_SHARED_ALIGNED
 
 for your definitions in particular since there are still code pieces where
 we are not sure if there are remote write accesses or not. This will give
 you separate cachelines so that the false aliasing effect is not
 occurring.

Fair enough, I have queued a commit that makes this change for the
rcu_data per-CPU structures with your Report-by.  (See below for patch.)

  Besides RCU is not the only place where atomics are used on per-CPU
  variables.  For one thing, there are a number of per-CPU spinlocks in use
  in various places throughout the kernel.  For another thing, there is also
  a large number of per-CPU structures (not pointers to structures, actual
  structures), and I bet that a fair number of these feature cross-CPU
  writes and cross-CPU atomics.  RCU's rcu_data structures certainly do.
 
 Would be interested to see where that occurs.

Something like the following will find them for you:

git grep DEFINE_PER_CPU.*spinlock
git grep DEFINE_PER_CPU.*(struct[^*]*$

   the barrier issues, per cpu variables are updated always without the use
   of atomics and the inspection of the per cpu state from remote cpus works
   just fine also without them.
 
  Including the per-CPU spinlocks?  That seems a bit unlikely.  And again,
  I expect that a fair number of the per-CPU structures involve cross-CPU
  synchronization.
 
 Where are those per cpu spinlocks? Cross cpu synchronization can be done
 in a number of ways that often allow avoiding remote writes to percpu
 areas.

The lglocks use should be of particular interest to you.  See above to
find the others.

  It already is consistent, just not in the manner that you want.  ;-)
 
  But -why- do you want these restrictions?  How does it help anything?
 
 1. It allows potentially faster operations that allow to make the
 assumption that no remote writes occur. The design of deterministic low
 latency code often needs some assurances that another cpu is not simply
 kicking the cacheline out which will then require off chip memory access
 and remote cacheline eviction once the cacheline is touched again.

OK, DEFINE_PER_CPU_SHARED_ALIGNED() should avoid the false sharing.

 2. The use of atomic without a rationale is something that I frown upon
 and it seems very likely that we have such a case here. People make
 assumptions that the use of atomic has some reason, like a remote access
 or contention, which is not occurring here.

Understood, but no hasty changes to that part of RCU.

 3. this_cpu operations create instructions with reduced latency due tothe
 lack of lock prefix. Remote operations at the same time could create
 inconsistent results.
 
 See also
 
 linux/Documentation/this_cpu_ops.txt

Yep.  If you use atomic operations, you need to be very careful about
mixing in non-atomic operations, which in this case includes the
per-CPU atomic operations.  Normally the atomic_t and atomic_long_t
types make it difficult to mess up.  Of course, you can work around
this type checking, and you can also use xchg() and cmpxchg(), which
don't provide this type-checking misuse-avoidance help.

Just as it has always been.

Thanx, Paul



rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data

The rcu_data per-CPU variable has a number of fields that are atomically
manipulated, potentially by any CPU.  This situation can result in false
sharing with per-CPU variables that have the misfortune of being allocated
adjacent to rcu_data in memory.  This commit therefore changes the
DEFINE_PER_CPU() to DEFINE_PER_CPU_SHARED_ALIGNED() in order to avoid
this false sharing.

Reported-by: Christoph Lameter c...@linux.com
Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1b7eba915dbe..e4c6d604694e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -105,7 +105,7 @@ struct rcu_state sname##_state = { \
.name = RCU_STATE_NAME(sname), \
.abbr = sabbr, \
 }; \
-DEFINE_PER_CPU(struct 

Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-04 Thread Paul E. McKenney
On Thu, Sep 04, 2014 at 10:04:17AM -0500, Christoph Lameter wrote:
 
 On Wed, 3 Sep 2014, Paul E. McKenney wrote:
 
  As noted earlier, in theory, the atomic operations could be nonatomic,
 
 Well as demonstrated by the patch earlier: The atomic operations are only
 used on a the local cpu. There is no synchronization in that sense needed
 between processors because there is never a remote atomic operation.

Easy to say!  ;-)

   The code looks fragile and bound to have issues in the future given the
   barriers/atomics etc. Its going to be cleaner without that.
 
  What exactly looks fragile about it, and exactly what issues do you
  anticipate?
 
 I am concerned about creation of unecessary synchronization issues. In
 this case we have already discovered that the atomic operations on per
 cpu variables are only used to modify the contents from the local cpu.
 
 This means at minimum we can give up on the use of atomics and keep the
 barriers to enforce visibility.

Sounds like a desire for a potential optimization rather than any
sort of fragility.  And in this case, it is not clear that your desire
of replacing a value-returning atomic operation with a normal memory
reference and a pair of memory barrier actually makes anything go faster.

So in short, you don't see the potential for this use case actually
breaking anything, correct?

Besides RCU is not the only place where atomics are used on per-CPU
variables.  For one thing, there are a number of per-CPU spinlocks in use
in various places throughout the kernel.  For another thing, there is also
a large number of per-CPU structures (not pointers to structures, actual
structures), and I bet that a fair number of these feature cross-CPU
writes and cross-CPU atomics.  RCU's rcu_data structures certainly do.

   And we are right now focusing on the simplest case. The atomics scheme is
   used multiple times in the RCU subsystem. There is more weird looking code
   there like atomic_add using zero etc.
 
  The atomic_add_return(0,...) reads the value out, forcing full ordering.
  Again, in theory, this could be a volatile read with explicit memory-barrier
  instructions on either side, but it is not clear which wins.  (Keep in
  mind that almost all of the atomic_add_return(0,...) calls for a given
  dynticks counter are executed from a single kthread.)
 
  If systems continue to add CPUs like they have over the past decade or
  so, I expect that you will be seeing more code like RCU's, not less.
 
 We have other code like this in multiple subsystems but it does not have
 the barrier issues, per cpu variables are updated always without the use
 of atomics and the inspection of the per cpu state from remote cpus works
 just fine also without them.

Including the per-CPU spinlocks?  That seems a bit unlikely.  And again,
I expect that a fair number of the per-CPU structures involve cross-CPU
synchronization.

 I'd like to simplify this as much as possible and make it consistent
 throughout the kernel.

It already is consistent, just not in the manner that you want.  ;-)

But -why- do you want these restrictions?  How does it help anything?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-03 Thread Paul E. McKenney
On Wed, Sep 03, 2014 at 12:43:15PM -0500, Christoph Lameter wrote:
> On Wed, 3 Sep 2014, Paul E. McKenney wrote:
> 
> > > Well, a shared data structure would be cleaner in general but there are
> > > certainly other approaches.
> >
> > Per-CPU variables -are- a shared data structure.
> 
> No the intent is for them to be for the particular cpu and therefore
> there is limited support for sharing. Its not a shared data structure in
> the classic sense.

It really is shared.  Normal memory at various addresses and all that.

So what is your real concern here, anyway?

> The code in the rcu subsystem operates like other percpu code: There is a
> modification of local variables by the current processor and other
> processors inspect the state once in awhile. The other percpu code does
> not need atomics and barriers. RCU for some reason (that is not clear to
> me) does.

Like I have said several times before in multiple email threads, including
this one, RCU needs to reliably detect quiescent states.  To do this, it
needs do know that the accesses prior to the quiescent state will reliably
be seen by all as having happened before the quiescent state started.
Similarly, it needs to know that the accesses after the quiescent state
will reliably be seen by all as having happened after the quiescent
state ended.  The memory barriers are required to enforce this ordering.

As noted earlier, in theory, the atomic operations could be nonatomic,
but on x86 this would require adding explicit memory-barrier instructions.
Not clear to me which would be faster.

> > > But lets focus on the dynticks_idle case we are discussing here rather
> > > than tackle the more difficult other atomics. What is checked in the loop
> > > over the remote cpus is the dynticks_idle value plus
> > > dynticks_idle_jiffies. So it seems that memory ordering is only used to
> > > ensure that the jiffies are seen correctly.
> > >
> > > In that case both the dynticks_idle and dynticks_idle_jiffies could be
> > > placed in one 64 bit value. If this is stored and retrieved as one then
> > > there is no issue with ordering anymore and the barriers would no longer
> > > be needed.
> >
> > If there was an upper bound on the propagation of values through a system,
> > I could buy this.
> 
> What is different in propagation speeds? The atomic read on the function
> that checks for the quiescent period having passed is a regular read
> anyways. The atomic_inc makes the cacheline propagate faster through the
> system? I understand that it evicts the cachelines from other processors
> caches (containing other percpu data by the way). That is the desired
> effect?

The smp_mb__{before,after}_atomic() guarantee the ordering.

> > But Mike Galbraith checked the overhead of ->dynticks_idle and found
> > it to be too small to measure.  So doesn't seem to be a problem worth
> > extraordinary efforts, especially given that many systems can avoid
> > it simply by leaving CONFIG_NO_HZ_SYSIDLE=n.
> 
> The code looks fragile and bound to have issues in the future given the
> barriers/atomics etc. Its going to be cleaner without that.

What exactly looks fragile about it, and exactly what issues do you
anticipate?

> And we are right now focusing on the simplest case. The atomics scheme is
> used multiple times in the RCU subsystem. There is more weird looking code
> there like atomic_add using zero etc.

The atomic_add_return(0,...) reads the value out, forcing full ordering.
Again, in theory, this could be a volatile read with explicit memory-barrier
instructions on either side, but it is not clear which wins.  (Keep in
mind that almost all of the atomic_add_return(0,...) calls for a given
dynticks counter are executed from a single kthread.)

If systems continue to add CPUs like they have over the past decade or
so, I expect that you will be seeing more code like RCU's, not less.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-03 Thread Christoph Lameter
On Wed, 3 Sep 2014, Paul E. McKenney wrote:

> > Well, a shared data structure would be cleaner in general but there are
> > certainly other approaches.
>
> Per-CPU variables -are- a shared data structure.

No the intent is for them to be for the particular cpu and therefore
there is limited support for sharing. Its not a shared data structure in
the classic sense.

The code in the rcu subsystem operates like other percpu code: There is a
modification of local variables by the current processor and other
processors inspect the state once in awhile. The other percpu code does
not need atomics and barriers. RCU for some reason (that is not clear to
me) does.

> > But lets focus on the dynticks_idle case we are discussing here rather
> > than tackle the more difficult other atomics. What is checked in the loop
> > over the remote cpus is the dynticks_idle value plus
> > dynticks_idle_jiffies. So it seems that memory ordering is only used to
> > ensure that the jiffies are seen correctly.
> >
> > In that case both the dynticks_idle and dynticks_idle_jiffies could be
> > placed in one 64 bit value. If this is stored and retrieved as one then
> > there is no issue with ordering anymore and the barriers would no longer
> > be needed.
>
> If there was an upper bound on the propagation of values through a system,
> I could buy this.

What is different in propagation speeds? The atomic read on the function
that checks for the quiescent period having passed is a regular read
anyways. The atomic_inc makes the cacheline propagate faster through the
system? I understand that it evicts the cachelines from other processors
caches (containing other percpu data by the way). That is the desired
effect?

> But Mike Galbraith checked the overhead of ->dynticks_idle and found
> it to be too small to measure.  So doesn't seem to be a problem worth
> extraordinary efforts, especially given that many systems can avoid
> it simply by leaving CONFIG_NO_HZ_SYSIDLE=n.

The code looks fragile and bound to have issues in the future given the
barriers/atomics etc. Its going to be cleaner without that.

And we are right now focusing on the simplest case. The atomics scheme is
used multiple times in the RCU subsystem. There is more weird looking code
there like atomic_add using zero etc.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-03 Thread Paul E. McKenney
On Wed, Sep 03, 2014 at 10:51:13AM -0500, Christoph Lameter wrote:
> On Wed, 3 Sep 2014, Paul E. McKenney wrote:
> 
> > You would prefer that I instead allocated an NR_CPUS-sized array?
> 
> Well, a shared data structure would be cleaner in general but there are
> certainly other approaches.

Per-CPU variables -are- a shared data structure.

> But lets focus on the dynticks_idle case we are discussing here rather
> than tackle the more difficult other atomics. What is checked in the loop
> over the remote cpus is the dynticks_idle value plus
> dynticks_idle_jiffies. So it seems that memory ordering is only used to
> ensure that the jiffies are seen correctly.
> 
> In that case both the dynticks_idle and dynticks_idle_jiffies could be
> placed in one 64 bit value. If this is stored and retrieved as one then
> there is no issue with ordering anymore and the barriers would no longer
> be needed.

If there was an upper bound on the propagation of values through a system,
I could buy this.

But Mike Galbraith checked the overhead of ->dynticks_idle and found
it to be too small to measure.  So doesn't seem to be a problem worth
extraordinary efforts, especially given that many systems can avoid
it simply by leaving CONFIG_NO_HZ_SYSIDLE=n.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-03 Thread Christoph Lameter
On Wed, 3 Sep 2014, Paul E. McKenney wrote:

> You would prefer that I instead allocated an NR_CPUS-sized array?

Well, a shared data structure would be cleaner in general but there are
certainly other approaches.

But lets focus on the dynticks_idle case we are discussing here rather
than tackle the more difficult other atomics. What is checked in the loop
over the remote cpus is the dynticks_idle value plus
dynticks_idle_jiffies. So it seems that memory ordering is only used to
ensure that the jiffies are seen correctly.

In that case both the dynticks_idle and dynticks_idle_jiffies could be
placed in one 64 bit value. If this is stored and retrieved as one then
there is no issue with ordering anymore and the barriers would no longer
be needed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-03 Thread Paul E. McKenney
On Wed, Sep 03, 2014 at 09:10:24AM -0500, Christoph Lameter wrote:
> On Tue, 2 Sep 2014, Paul E. McKenney wrote:
> 
> > On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote:
> > > On Tue, 2 Sep 2014, Paul E. McKenney wrote:
> > >
> > > > Yep, these two have been on my "when I am feeling insanely gutsy" list
> > > > for quite some time.
> > > >
> > > > But I have to ask...  On x86, is a pair of mfence instructions really
> > > > cheaper than an atomic increment?
> > >
> > > Not sure why you would need an mfence instruction?
> >
> > Because otherwise RCU can break.  As soon as the grace-period machinery
> > sees that the value of this variable is even, it assumes a quiescent
> > state.  If there are no memory barriers, the non-quiescent code might
> > not have completed executing, and your kernel's actuarial statistics
> > become sub-optimal.
> 
> Synchronization using per cpu variables is bound to be problematic since
> they are simply not made for that. The per cpu variable usually can change
> without notice to the other cpu since typically per cpu processing is
> ongoing. The improvided performance of per cpu instructions is
> possible only because we exploit the fact that there is no need for
> synchronization.

Christoph, per-CPU variables are memory.  If I do the correct operations
on them, they work just like any other memory.  And yes, I typically
cannot use the per-CPU operations if I need coherent results visible to
all CPUs (but see your statistical-counters example below).  This is of
course exactly why I use atomic operations and memory barriers on
the dynticks counters.

You would prefer that I instead allocated an NR_CPUS-sized array?

> Kernel statistics *are* suboptimal for that very reason because they
> typically sum up individual counters from multiple processors without
> regard to complete accuracy. The manipulation of the VM counters is very
> low overhead due to the lack of concern for synchronization. This is a
> tradeoff vs. performance. We actually can tune the the fuzziness of
> statistics in the VM which allows us to control the overhead generated by
> the need for more or less accurate statistics.

Yes, statistics is one of the cross-CPU uses of per-CPU variables where
you can get away without tight synchronization.

> Memory barriers ensure that the code has completed executing? I think what
> is meant is that they ensure that all modifications to cachelines before the
> change of state are visible and the other processor does not have stale
> cachelines around?

No, memory barriers simply enforce ordering, and ordering is all
that RCU's dyntick-idle code relies on.  In other words, if the RCU
grace-period kthread sees a value indicating that a CPU is idle, that
kthread needs assurance that all the memory operations that took place
before that CPU went idle have completed.  All that is required for this
is ordering:  I saw that the ->dynticks counter had an even value, and
I know that it was preceded a memory barrier, therefore, all subsequent
code after a memory barrier will see the effects of all pre-idle code.

> If the state variable is odd how does the other processor see a state
> change to even before processing is complete if the state is updated only
> at the end of processing?

I am having some difficulty parsing that question.

However, suppose that the grace-period kthread sees ->dynticks with an
odd value, say three.  Supppose that this kthread later sees another
odd value, say eleven.  Then because of the atomic operations and memory
barriers, the kthread can be sure that the CPU corresponding to that
->dynticks has passed through an idle state since the time it saw the
value three, and therefore that that CPU has passed through a quiescent
state since the start of the grace period.

Similarly, if the grace-period kthread sees ->dynticks with an even value,
it knows that after a subsequent memory barrier, all the pre-idle effects
will be visible, as required.

As a diagram:

CPU 0   CPU 1

/* Access some RCU-protected struct. */
smp_mb__before_atomic()
atomic_inc()->even value
/* Now idle. */
atomic_add_return(0, ...)-> odd value
/* implied memory barrier. */
/* post-GP changes won't interfere */
/*   with pre-idle accesses. */

In other words, if CPU 0 accessed some RCU-protected memory before going
idle, that memory is guaranteed not to be freed until after those pre-idle
accesses have completed.

Of course, it also needs to work the other way around:

CPU 0   CPU 1

/* Remove some RCU-protected struct. */
/* implied memory barrier. */
atomic_add_return(0, ...)-> even value
 

Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-03 Thread Christoph Lameter
On Tue, 2 Sep 2014, Paul E. McKenney wrote:

> On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote:
> > On Tue, 2 Sep 2014, Paul E. McKenney wrote:
> >
> > > Yep, these two have been on my "when I am feeling insanely gutsy" list
> > > for quite some time.
> > >
> > > But I have to ask...  On x86, is a pair of mfence instructions really
> > > cheaper than an atomic increment?
> >
> > Not sure why you would need an mfence instruction?
>
> Because otherwise RCU can break.  As soon as the grace-period machinery
> sees that the value of this variable is even, it assumes a quiescent
> state.  If there are no memory barriers, the non-quiescent code might
> not have completed executing, and your kernel's actuarial statistics
> become sub-optimal.

Synchronization using per cpu variables is bound to be problematic since
they are simply not made for that. The per cpu variable usually can change
without notice to the other cpu since typically per cpu processing is
ongoing. The improvided performance of per cpu instructions is
possible only because we exploit the fact that there is no need for
synchronization.

Kernel statistics *are* suboptimal for that very reason because they
typically sum up individual counters from multiple processors without
regard to complete accuracy. The manipulation of the VM counters is very
low overhead due to the lack of concern for synchronization. This is a
tradeoff vs. performance. We actually can tune the the fuzziness of
statistics in the VM which allows us to control the overhead generated by
the need for more or less accurate statistics.

Memory barriers ensure that the code has completed executing? I think what
is meant is that they ensure that all modifications to cachelines before the
change of state are visible and the other processor does not have stale
cachelines around?

If the state variable is odd how does the other processor see a state
change to even before processing is complete if the state is updated only
at the end of processing?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-03 Thread Christoph Lameter
On Tue, 2 Sep 2014, Paul E. McKenney wrote:

 On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote:
  On Tue, 2 Sep 2014, Paul E. McKenney wrote:
 
   Yep, these two have been on my when I am feeling insanely gutsy list
   for quite some time.
  
   But I have to ask...  On x86, is a pair of mfence instructions really
   cheaper than an atomic increment?
 
  Not sure why you would need an mfence instruction?

 Because otherwise RCU can break.  As soon as the grace-period machinery
 sees that the value of this variable is even, it assumes a quiescent
 state.  If there are no memory barriers, the non-quiescent code might
 not have completed executing, and your kernel's actuarial statistics
 become sub-optimal.

Synchronization using per cpu variables is bound to be problematic since
they are simply not made for that. The per cpu variable usually can change
without notice to the other cpu since typically per cpu processing is
ongoing. The improvided performance of per cpu instructions is
possible only because we exploit the fact that there is no need for
synchronization.

Kernel statistics *are* suboptimal for that very reason because they
typically sum up individual counters from multiple processors without
regard to complete accuracy. The manipulation of the VM counters is very
low overhead due to the lack of concern for synchronization. This is a
tradeoff vs. performance. We actually can tune the the fuzziness of
statistics in the VM which allows us to control the overhead generated by
the need for more or less accurate statistics.

Memory barriers ensure that the code has completed executing? I think what
is meant is that they ensure that all modifications to cachelines before the
change of state are visible and the other processor does not have stale
cachelines around?

If the state variable is odd how does the other processor see a state
change to even before processing is complete if the state is updated only
at the end of processing?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-03 Thread Paul E. McKenney
On Wed, Sep 03, 2014 at 09:10:24AM -0500, Christoph Lameter wrote:
 On Tue, 2 Sep 2014, Paul E. McKenney wrote:
 
  On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote:
   On Tue, 2 Sep 2014, Paul E. McKenney wrote:
  
Yep, these two have been on my when I am feeling insanely gutsy list
for quite some time.
   
But I have to ask...  On x86, is a pair of mfence instructions really
cheaper than an atomic increment?
  
   Not sure why you would need an mfence instruction?
 
  Because otherwise RCU can break.  As soon as the grace-period machinery
  sees that the value of this variable is even, it assumes a quiescent
  state.  If there are no memory barriers, the non-quiescent code might
  not have completed executing, and your kernel's actuarial statistics
  become sub-optimal.
 
 Synchronization using per cpu variables is bound to be problematic since
 they are simply not made for that. The per cpu variable usually can change
 without notice to the other cpu since typically per cpu processing is
 ongoing. The improvided performance of per cpu instructions is
 possible only because we exploit the fact that there is no need for
 synchronization.

Christoph, per-CPU variables are memory.  If I do the correct operations
on them, they work just like any other memory.  And yes, I typically
cannot use the per-CPU operations if I need coherent results visible to
all CPUs (but see your statistical-counters example below).  This is of
course exactly why I use atomic operations and memory barriers on
the dynticks counters.

You would prefer that I instead allocated an NR_CPUS-sized array?

 Kernel statistics *are* suboptimal for that very reason because they
 typically sum up individual counters from multiple processors without
 regard to complete accuracy. The manipulation of the VM counters is very
 low overhead due to the lack of concern for synchronization. This is a
 tradeoff vs. performance. We actually can tune the the fuzziness of
 statistics in the VM which allows us to control the overhead generated by
 the need for more or less accurate statistics.

Yes, statistics is one of the cross-CPU uses of per-CPU variables where
you can get away without tight synchronization.

 Memory barriers ensure that the code has completed executing? I think what
 is meant is that they ensure that all modifications to cachelines before the
 change of state are visible and the other processor does not have stale
 cachelines around?

No, memory barriers simply enforce ordering, and ordering is all
that RCU's dyntick-idle code relies on.  In other words, if the RCU
grace-period kthread sees a value indicating that a CPU is idle, that
kthread needs assurance that all the memory operations that took place
before that CPU went idle have completed.  All that is required for this
is ordering:  I saw that the -dynticks counter had an even value, and
I know that it was preceded a memory barrier, therefore, all subsequent
code after a memory barrier will see the effects of all pre-idle code.

 If the state variable is odd how does the other processor see a state
 change to even before processing is complete if the state is updated only
 at the end of processing?

I am having some difficulty parsing that question.

However, suppose that the grace-period kthread sees -dynticks with an
odd value, say three.  Supppose that this kthread later sees another
odd value, say eleven.  Then because of the atomic operations and memory
barriers, the kthread can be sure that the CPU corresponding to that
-dynticks has passed through an idle state since the time it saw the
value three, and therefore that that CPU has passed through a quiescent
state since the start of the grace period.

Similarly, if the grace-period kthread sees -dynticks with an even value,
it knows that after a subsequent memory barrier, all the pre-idle effects
will be visible, as required.

As a diagram:

CPU 0   CPU 1

/* Access some RCU-protected struct. */
smp_mb__before_atomic()
atomic_inc()-even value
/* Now idle. */
atomic_add_return(0, ...)- odd value
/* implied memory barrier. */
/* post-GP changes won't interfere */
/*   with pre-idle accesses. */

In other words, if CPU 0 accessed some RCU-protected memory before going
idle, that memory is guaranteed not to be freed until after those pre-idle
accesses have completed.

Of course, it also needs to work the other way around:

CPU 0   CPU 1

/* Remove some RCU-protected struct. */
/* implied memory barrier. */
atomic_add_return(0, ...)- even value
/* Was idle. */
atomic_inc()-odd value

Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-03 Thread Christoph Lameter
On Wed, 3 Sep 2014, Paul E. McKenney wrote:

 You would prefer that I instead allocated an NR_CPUS-sized array?

Well, a shared data structure would be cleaner in general but there are
certainly other approaches.

But lets focus on the dynticks_idle case we are discussing here rather
than tackle the more difficult other atomics. What is checked in the loop
over the remote cpus is the dynticks_idle value plus
dynticks_idle_jiffies. So it seems that memory ordering is only used to
ensure that the jiffies are seen correctly.

In that case both the dynticks_idle and dynticks_idle_jiffies could be
placed in one 64 bit value. If this is stored and retrieved as one then
there is no issue with ordering anymore and the barriers would no longer
be needed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-03 Thread Paul E. McKenney
On Wed, Sep 03, 2014 at 10:51:13AM -0500, Christoph Lameter wrote:
 On Wed, 3 Sep 2014, Paul E. McKenney wrote:
 
  You would prefer that I instead allocated an NR_CPUS-sized array?
 
 Well, a shared data structure would be cleaner in general but there are
 certainly other approaches.

Per-CPU variables -are- a shared data structure.

 But lets focus on the dynticks_idle case we are discussing here rather
 than tackle the more difficult other atomics. What is checked in the loop
 over the remote cpus is the dynticks_idle value plus
 dynticks_idle_jiffies. So it seems that memory ordering is only used to
 ensure that the jiffies are seen correctly.
 
 In that case both the dynticks_idle and dynticks_idle_jiffies could be
 placed in one 64 bit value. If this is stored and retrieved as one then
 there is no issue with ordering anymore and the barriers would no longer
 be needed.

If there was an upper bound on the propagation of values through a system,
I could buy this.

But Mike Galbraith checked the overhead of -dynticks_idle and found
it to be too small to measure.  So doesn't seem to be a problem worth
extraordinary efforts, especially given that many systems can avoid
it simply by leaving CONFIG_NO_HZ_SYSIDLE=n.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-03 Thread Christoph Lameter
On Wed, 3 Sep 2014, Paul E. McKenney wrote:

  Well, a shared data structure would be cleaner in general but there are
  certainly other approaches.

 Per-CPU variables -are- a shared data structure.

No the intent is for them to be for the particular cpu and therefore
there is limited support for sharing. Its not a shared data structure in
the classic sense.

The code in the rcu subsystem operates like other percpu code: There is a
modification of local variables by the current processor and other
processors inspect the state once in awhile. The other percpu code does
not need atomics and barriers. RCU for some reason (that is not clear to
me) does.

  But lets focus on the dynticks_idle case we are discussing here rather
  than tackle the more difficult other atomics. What is checked in the loop
  over the remote cpus is the dynticks_idle value plus
  dynticks_idle_jiffies. So it seems that memory ordering is only used to
  ensure that the jiffies are seen correctly.
 
  In that case both the dynticks_idle and dynticks_idle_jiffies could be
  placed in one 64 bit value. If this is stored and retrieved as one then
  there is no issue with ordering anymore and the barriers would no longer
  be needed.

 If there was an upper bound on the propagation of values through a system,
 I could buy this.

What is different in propagation speeds? The atomic read on the function
that checks for the quiescent period having passed is a regular read
anyways. The atomic_inc makes the cacheline propagate faster through the
system? I understand that it evicts the cachelines from other processors
caches (containing other percpu data by the way). That is the desired
effect?

 But Mike Galbraith checked the overhead of -dynticks_idle and found
 it to be too small to measure.  So doesn't seem to be a problem worth
 extraordinary efforts, especially given that many systems can avoid
 it simply by leaving CONFIG_NO_HZ_SYSIDLE=n.

The code looks fragile and bound to have issues in the future given the
barriers/atomics etc. Its going to be cleaner without that.

And we are right now focusing on the simplest case. The atomics scheme is
used multiple times in the RCU subsystem. There is more weird looking code
there like atomic_add using zero etc.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-03 Thread Paul E. McKenney
On Wed, Sep 03, 2014 at 12:43:15PM -0500, Christoph Lameter wrote:
 On Wed, 3 Sep 2014, Paul E. McKenney wrote:
 
   Well, a shared data structure would be cleaner in general but there are
   certainly other approaches.
 
  Per-CPU variables -are- a shared data structure.
 
 No the intent is for them to be for the particular cpu and therefore
 there is limited support for sharing. Its not a shared data structure in
 the classic sense.

It really is shared.  Normal memory at various addresses and all that.

So what is your real concern here, anyway?

 The code in the rcu subsystem operates like other percpu code: There is a
 modification of local variables by the current processor and other
 processors inspect the state once in awhile. The other percpu code does
 not need atomics and barriers. RCU for some reason (that is not clear to
 me) does.

Like I have said several times before in multiple email threads, including
this one, RCU needs to reliably detect quiescent states.  To do this, it
needs do know that the accesses prior to the quiescent state will reliably
be seen by all as having happened before the quiescent state started.
Similarly, it needs to know that the accesses after the quiescent state
will reliably be seen by all as having happened after the quiescent
state ended.  The memory barriers are required to enforce this ordering.

As noted earlier, in theory, the atomic operations could be nonatomic,
but on x86 this would require adding explicit memory-barrier instructions.
Not clear to me which would be faster.

   But lets focus on the dynticks_idle case we are discussing here rather
   than tackle the more difficult other atomics. What is checked in the loop
   over the remote cpus is the dynticks_idle value plus
   dynticks_idle_jiffies. So it seems that memory ordering is only used to
   ensure that the jiffies are seen correctly.
  
   In that case both the dynticks_idle and dynticks_idle_jiffies could be
   placed in one 64 bit value. If this is stored and retrieved as one then
   there is no issue with ordering anymore and the barriers would no longer
   be needed.
 
  If there was an upper bound on the propagation of values through a system,
  I could buy this.
 
 What is different in propagation speeds? The atomic read on the function
 that checks for the quiescent period having passed is a regular read
 anyways. The atomic_inc makes the cacheline propagate faster through the
 system? I understand that it evicts the cachelines from other processors
 caches (containing other percpu data by the way). That is the desired
 effect?

The smp_mb__{before,after}_atomic() guarantee the ordering.

  But Mike Galbraith checked the overhead of -dynticks_idle and found
  it to be too small to measure.  So doesn't seem to be a problem worth
  extraordinary efforts, especially given that many systems can avoid
  it simply by leaving CONFIG_NO_HZ_SYSIDLE=n.
 
 The code looks fragile and bound to have issues in the future given the
 barriers/atomics etc. Its going to be cleaner without that.

What exactly looks fragile about it, and exactly what issues do you
anticipate?

 And we are right now focusing on the simplest case. The atomics scheme is
 used multiple times in the RCU subsystem. There is more weird looking code
 there like atomic_add using zero etc.

The atomic_add_return(0,...) reads the value out, forcing full ordering.
Again, in theory, this could be a volatile read with explicit memory-barrier
instructions on either side, but it is not clear which wins.  (Keep in
mind that almost all of the atomic_add_return(0,...) calls for a given
dynticks counter are executed from a single kthread.)

If systems continue to add CPUs like they have over the past decade or
so, I expect that you will be seeing more code like RCU's, not less.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-02 Thread Paul E. McKenney
On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote:
> On Tue, 2 Sep 2014, Paul E. McKenney wrote:
> 
> > Yep, these two have been on my "when I am feeling insanely gutsy" list
> > for quite some time.
> >
> > But I have to ask...  On x86, is a pair of mfence instructions really
> > cheaper than an atomic increment?
> 
> Not sure why you would need an mfence instruction?

Because otherwise RCU can break.  As soon as the grace-period machinery
sees that the value of this variable is even, it assumes a quiescent
state.  If there are no memory barriers, the non-quiescent code might
not have completed executing, and your kernel's actuarial statistics
become sub-optimal.

Thanx, Paul

> > > If the first patch I send gets merged then a lot of other this_cpu related
> > > optimizations become possible regardless of the ones in the RFC.
> >
> > Yep, I am queuing that one.
> 
> Great.
> 
> > But could you please do future patches against the rcu/dev branch of
> > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git?
> > I had to hand-apply due to conflicts.  Please see below for my version,
> > and please check to make sure that I didn't mess something up in the
> > translation.
> 
> Looks ok. Will use the correct tree next time.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-02 Thread Christoph Lameter
On Tue, 2 Sep 2014, Paul E. McKenney wrote:

> Yep, these two have been on my "when I am feeling insanely gutsy" list
> for quite some time.
>
> But I have to ask...  On x86, is a pair of mfence instructions really
> cheaper than an atomic increment?

Not sure why you would need an mfence instruction?

> > If the first patch I send gets merged then a lot of other this_cpu related
> > optimizations become possible regardless of the ones in the RFC.
>
> Yep, I am queuing that one.

Great.

> But could you please do future patches against the rcu/dev branch of
> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git?
> I had to hand-apply due to conflicts.  Please see below for my version,
> and please check to make sure that I didn't mess something up in the
> translation.

Looks ok. Will use the correct tree next time.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-02 Thread Paul E. McKenney
On Tue, Sep 02, 2014 at 03:47:55PM -0500, Christoph Lameter wrote:
> On Tue, 2 Sep 2014, Paul E. McKenney wrote:
> 
> > But yes, in theory, something like this can work if appropriate memory
> > barriers are put in place.  In practice, this sort of change needs
> > profound testing on multiple architectures.
> 
> Ok how can we move forward? I just looked further and it seems a similar
> approach could perhaps work for the dynticks field.

Yep, these two have been on my "when I am feeling insanely gutsy" list
for quite some time.

But I have to ask...  On x86, is a pair of mfence instructions really
cheaper than an atomic increment?

> If the first patch I send gets merged then a lot of other this_cpu related
> optimizations become possible regardless of the ones in the RFC.

Yep, I am queuing that one.

But could you please do future patches against the rcu/dev branch of
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git?
I had to hand-apply due to conflicts.  Please see below for my version,
and please check to make sure that I didn't mess something up in the
translation.

Thanx, Paul



rcu: Remove rcu_dynticks * parameters when they are always 
this_cpu_ptr(_dynticks)

For some functions in kernel/rcu/tree* the rdtp parameter is always
this_cpu_ptr(rdtp).  Remove the parameter if constant and calculate the
pointer in function.

This will have the advantage that it is obvious that the address are
all per cpu offsets and thus it will enable the use of this_cpu_ops in
the future.

Signed-off-by: Christoph Lameter 
[ paulmck: Forward-ported to rcu/dev, whitespace adjustment. ]
Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 929ea9a6e0a1..65c42a6465eb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -497,11 +497,11 @@ cpu_needs_another_gp(struct rcu_state *rsp, struct 
rcu_data *rdp)
  * we really have entered idle, and must do the appropriate accounting.
  * The caller must have disabled interrupts.
  */
-static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
-   bool user)
+static void rcu_eqs_enter_common(long long oldval, bool user)
 {
struct rcu_state *rsp;
struct rcu_data *rdp;
+   struct rcu_dynticks *rdtp = this_cpu_ptr(_dynticks);
 
trace_rcu_dyntick(TPS("Start"), oldval, rdtp->dynticks_nesting);
if (!user && !is_idle_task(current)) {
@@ -552,7 +552,7 @@ static void rcu_eqs_enter(bool user)
WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0);
if ((oldval & DYNTICK_TASK_NEST_MASK) == DYNTICK_TASK_NEST_VALUE) {
rdtp->dynticks_nesting = 0;
-   rcu_eqs_enter_common(rdtp, oldval, user);
+   rcu_eqs_enter_common(oldval, user);
} else {
rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
}
@@ -576,7 +576,7 @@ void rcu_idle_enter(void)
 
local_irq_save(flags);
rcu_eqs_enter(false);
-   rcu_sysidle_enter(this_cpu_ptr(_dynticks), 0);
+   rcu_sysidle_enter(0);
local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -626,8 +626,8 @@ void rcu_irq_exit(void)
if (rdtp->dynticks_nesting)
trace_rcu_dyntick(TPS("--="), oldval, rdtp->dynticks_nesting);
else
-   rcu_eqs_enter_common(rdtp, oldval, true);
-   rcu_sysidle_enter(rdtp, 1);
+   rcu_eqs_enter_common(oldval, true);
+   rcu_sysidle_enter(1);
local_irq_restore(flags);
 }
 
@@ -638,9 +638,10 @@ void rcu_irq_exit(void)
  * we really have exited idle, and must do the appropriate accounting.
  * The caller must have disabled interrupts.
  */
-static void rcu_eqs_exit_common(struct rcu_dynticks *rdtp, long long oldval,
-  int user)
+static void rcu_eqs_exit_common(long long oldval, int user)
 {
+   struct rcu_dynticks *rdtp = this_cpu_ptr(_dynticks);
+
rcu_dynticks_task_exit();
smp_mb__before_atomic();  /* Force ordering w/previous sojourn. */
atomic_inc(>dynticks);
@@ -678,7 +679,7 @@ static void rcu_eqs_exit(bool user)
rdtp->dynticks_nesting += DYNTICK_TASK_NEST_VALUE;
} else {
rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
-   rcu_eqs_exit_common(rdtp, oldval, user);
+   rcu_eqs_exit_common(oldval, user);
}
 }
 
@@ -699,7 +700,7 @@ void rcu_idle_exit(void)
 
local_irq_save(flags);
rcu_eqs_exit(false);
-   rcu_sysidle_exit(this_cpu_ptr(_dynticks), 0);
+   rcu_sysidle_exit(0);
local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_exit);
@@ -750,8 +751,8 @@ void rcu_irq_enter(void)
if (oldval)
trace_rcu_dyntick(TPS("++="), oldval, rdtp->dynticks_nesting);
else
-   

Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-02 Thread Christoph Lameter
On Tue, 2 Sep 2014, Paul E. McKenney wrote:

> But yes, in theory, something like this can work if appropriate memory
> barriers are put in place.  In practice, this sort of change needs
> profound testing on multiple architectures.

Ok how can we move forward? I just looked further and it seems a similar
approach could perhaps work for the dynticks field.

If the first patch I send gets merged then a lot of other this_cpu related
optimizations become possible regardless of the ones in the RFC.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-02 Thread Paul E. McKenney
On Tue, Sep 02, 2014 at 03:14:43PM -0500, Christoph Lameter wrote:
> 
> Since dynticks_idle is only ever modified by the local cpu we do
> not need to use an atomic there. The weak "atomicity" of this_cpu
> ops is sufficient since there is no other cpu modifying the variable.
> 
> [This is a cautious patch that leaves the barriers in place]

Actually, not so cautious.  On x86:

#define smp_mb__before_atomic() barrier()
#define smp_mb__after_atomic()  barrier()

But yes, in theory, something like this can work if appropriate memory
barriers are put in place.  In practice, this sort of change needs
profound testing on multiple architectures.

Thanx, Paul

> Signed-off-by: Christoph Lameter 
> 
> Index: linux/kernel/rcu/tree.c
> ===
> --- linux.orig/kernel/rcu/tree.c
> +++ linux/kernel/rcu/tree.c
> @@ -213,7 +213,7 @@ static DEFINE_PER_CPU(struct rcu_dyntick
>   .dynticks = ATOMIC_INIT(1),
>  #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
>   .dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
> - .dynticks_idle = ATOMIC_INIT(1),
> + .dynticks_idle = 1,
>  #endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
>  };
> 
> Index: linux/kernel/rcu/tree.h
> ===
> --- linux.orig/kernel/rcu/tree.h
> +++ linux/kernel/rcu/tree.h
> @@ -91,7 +91,7 @@ struct rcu_dynticks {
>  #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
>   long long dynticks_idle_nesting;
>   /* irq/process nesting level from idle. */
> - atomic_t dynticks_idle; /* Even value for idle, else odd. */
> + long dynticks_idle; /* Even value for idle, else odd. */
>   /*  "Idle" excludes userspace execution. */
>   unsigned long dynticks_idle_jiffies;
>   /* End of last non-NMI non-idle period. */
> Index: linux/kernel/rcu/tree_plugin.h
> ===
> --- linux.orig/kernel/rcu/tree_plugin.h
> +++ linux/kernel/rcu/tree_plugin.h
> @@ -2644,9 +2644,9 @@ static void rcu_sysidle_enter(int irq)
>   j = jiffies;
>   ACCESS_ONCE(rdtp->dynticks_idle_jiffies) = j;
>   smp_mb__before_atomic();
> - atomic_inc(>dynticks_idle);
> + this_cpu_inc(rcu_dynticks.dynticks_idle);
>   smp_mb__after_atomic();
> - WARN_ON_ONCE(atomic_read(>dynticks_idle) & 0x1);
> + WARN_ON_ONCE(__this_cpu_read(rcu_dynticks.dynticks_idle) & 0x1);
>  }
> 
>  /*
> @@ -2712,9 +2712,9 @@ static void rcu_sysidle_exit(int irq)
> 
>   /* Record end of idle period. */
>   smp_mb__before_atomic();
> - atomic_inc(>dynticks_idle);
> + this_cpu_inc(rcu_dynticks.dynticks_idle);
>   smp_mb__after_atomic();
> - WARN_ON_ONCE(!(atomic_read(>dynticks_idle) & 0x1));
> + WARN_ON_ONCE(!(__this_cpu_read(rcu_dynticks.dynticks_idle) & 0x1));
> 
>   /*
>* If we are the timekeeping CPU, we are permitted to be non-idle
> @@ -2755,7 +2755,7 @@ static void rcu_sysidle_check_cpu(struct
>   WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);
> 
>   /* Pick up current idle and NMI-nesting counter and check. */
> - cur = atomic_read(>dynticks_idle);
> + cur = rdtp->dynticks_idle;
>   if (cur & 0x1) {
>   *isidle = false; /* We are not idle! */
>   return;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-02 Thread Christoph Lameter

Since dynticks_idle is only ever modified by the local cpu we do
not need to use an atomic there. The weak "atomicity" of this_cpu
ops is sufficient since there is no other cpu modifying the variable.

[This is a cautious patch that leaves the barriers in place]

Signed-off-by: Christoph Lameter 

Index: linux/kernel/rcu/tree.c
===
--- linux.orig/kernel/rcu/tree.c
+++ linux/kernel/rcu/tree.c
@@ -213,7 +213,7 @@ static DEFINE_PER_CPU(struct rcu_dyntick
.dynticks = ATOMIC_INIT(1),
 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
.dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
-   .dynticks_idle = ATOMIC_INIT(1),
+   .dynticks_idle = 1,
 #endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
 };

Index: linux/kernel/rcu/tree.h
===
--- linux.orig/kernel/rcu/tree.h
+++ linux/kernel/rcu/tree.h
@@ -91,7 +91,7 @@ struct rcu_dynticks {
 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
long long dynticks_idle_nesting;
/* irq/process nesting level from idle. */
-   atomic_t dynticks_idle; /* Even value for idle, else odd. */
+   long dynticks_idle; /* Even value for idle, else odd. */
/*  "Idle" excludes userspace execution. */
unsigned long dynticks_idle_jiffies;
/* End of last non-NMI non-idle period. */
Index: linux/kernel/rcu/tree_plugin.h
===
--- linux.orig/kernel/rcu/tree_plugin.h
+++ linux/kernel/rcu/tree_plugin.h
@@ -2644,9 +2644,9 @@ static void rcu_sysidle_enter(int irq)
j = jiffies;
ACCESS_ONCE(rdtp->dynticks_idle_jiffies) = j;
smp_mb__before_atomic();
-   atomic_inc(>dynticks_idle);
+   this_cpu_inc(rcu_dynticks.dynticks_idle);
smp_mb__after_atomic();
-   WARN_ON_ONCE(atomic_read(>dynticks_idle) & 0x1);
+   WARN_ON_ONCE(__this_cpu_read(rcu_dynticks.dynticks_idle) & 0x1);
 }

 /*
@@ -2712,9 +2712,9 @@ static void rcu_sysidle_exit(int irq)

/* Record end of idle period. */
smp_mb__before_atomic();
-   atomic_inc(>dynticks_idle);
+   this_cpu_inc(rcu_dynticks.dynticks_idle);
smp_mb__after_atomic();
-   WARN_ON_ONCE(!(atomic_read(>dynticks_idle) & 0x1));
+   WARN_ON_ONCE(!(__this_cpu_read(rcu_dynticks.dynticks_idle) & 0x1));

/*
 * If we are the timekeeping CPU, we are permitted to be non-idle
@@ -2755,7 +2755,7 @@ static void rcu_sysidle_check_cpu(struct
WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);

/* Pick up current idle and NMI-nesting counter and check. */
-   cur = atomic_read(>dynticks_idle);
+   cur = rdtp->dynticks_idle;
if (cur & 0x1) {
*isidle = false; /* We are not idle! */
return;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-02 Thread Christoph Lameter

Since dynticks_idle is only ever modified by the local cpu we do
not need to use an atomic there. The weak atomicity of this_cpu
ops is sufficient since there is no other cpu modifying the variable.

[This is a cautious patch that leaves the barriers in place]

Signed-off-by: Christoph Lameter c...@linux.com

Index: linux/kernel/rcu/tree.c
===
--- linux.orig/kernel/rcu/tree.c
+++ linux/kernel/rcu/tree.c
@@ -213,7 +213,7 @@ static DEFINE_PER_CPU(struct rcu_dyntick
.dynticks = ATOMIC_INIT(1),
 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
.dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
-   .dynticks_idle = ATOMIC_INIT(1),
+   .dynticks_idle = 1,
 #endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
 };

Index: linux/kernel/rcu/tree.h
===
--- linux.orig/kernel/rcu/tree.h
+++ linux/kernel/rcu/tree.h
@@ -91,7 +91,7 @@ struct rcu_dynticks {
 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
long long dynticks_idle_nesting;
/* irq/process nesting level from idle. */
-   atomic_t dynticks_idle; /* Even value for idle, else odd. */
+   long dynticks_idle; /* Even value for idle, else odd. */
/*  Idle excludes userspace execution. */
unsigned long dynticks_idle_jiffies;
/* End of last non-NMI non-idle period. */
Index: linux/kernel/rcu/tree_plugin.h
===
--- linux.orig/kernel/rcu/tree_plugin.h
+++ linux/kernel/rcu/tree_plugin.h
@@ -2644,9 +2644,9 @@ static void rcu_sysidle_enter(int irq)
j = jiffies;
ACCESS_ONCE(rdtp-dynticks_idle_jiffies) = j;
smp_mb__before_atomic();
-   atomic_inc(rdtp-dynticks_idle);
+   this_cpu_inc(rcu_dynticks.dynticks_idle);
smp_mb__after_atomic();
-   WARN_ON_ONCE(atomic_read(rdtp-dynticks_idle)  0x1);
+   WARN_ON_ONCE(__this_cpu_read(rcu_dynticks.dynticks_idle)  0x1);
 }

 /*
@@ -2712,9 +2712,9 @@ static void rcu_sysidle_exit(int irq)

/* Record end of idle period. */
smp_mb__before_atomic();
-   atomic_inc(rdtp-dynticks_idle);
+   this_cpu_inc(rcu_dynticks.dynticks_idle);
smp_mb__after_atomic();
-   WARN_ON_ONCE(!(atomic_read(rdtp-dynticks_idle)  0x1));
+   WARN_ON_ONCE(!(__this_cpu_read(rcu_dynticks.dynticks_idle)  0x1));

/*
 * If we are the timekeeping CPU, we are permitted to be non-idle
@@ -2755,7 +2755,7 @@ static void rcu_sysidle_check_cpu(struct
WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);

/* Pick up current idle and NMI-nesting counter and check. */
-   cur = atomic_read(rdtp-dynticks_idle);
+   cur = rdtp-dynticks_idle;
if (cur  0x1) {
*isidle = false; /* We are not idle! */
return;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-02 Thread Paul E. McKenney
On Tue, Sep 02, 2014 at 03:14:43PM -0500, Christoph Lameter wrote:
 
 Since dynticks_idle is only ever modified by the local cpu we do
 not need to use an atomic there. The weak atomicity of this_cpu
 ops is sufficient since there is no other cpu modifying the variable.
 
 [This is a cautious patch that leaves the barriers in place]

Actually, not so cautious.  On x86:

#define smp_mb__before_atomic() barrier()
#define smp_mb__after_atomic()  barrier()

But yes, in theory, something like this can work if appropriate memory
barriers are put in place.  In practice, this sort of change needs
profound testing on multiple architectures.

Thanx, Paul

 Signed-off-by: Christoph Lameter c...@linux.com
 
 Index: linux/kernel/rcu/tree.c
 ===
 --- linux.orig/kernel/rcu/tree.c
 +++ linux/kernel/rcu/tree.c
 @@ -213,7 +213,7 @@ static DEFINE_PER_CPU(struct rcu_dyntick
   .dynticks = ATOMIC_INIT(1),
  #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
   .dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
 - .dynticks_idle = ATOMIC_INIT(1),
 + .dynticks_idle = 1,
  #endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
  };
 
 Index: linux/kernel/rcu/tree.h
 ===
 --- linux.orig/kernel/rcu/tree.h
 +++ linux/kernel/rcu/tree.h
 @@ -91,7 +91,7 @@ struct rcu_dynticks {
  #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
   long long dynticks_idle_nesting;
   /* irq/process nesting level from idle. */
 - atomic_t dynticks_idle; /* Even value for idle, else odd. */
 + long dynticks_idle; /* Even value for idle, else odd. */
   /*  Idle excludes userspace execution. */
   unsigned long dynticks_idle_jiffies;
   /* End of last non-NMI non-idle period. */
 Index: linux/kernel/rcu/tree_plugin.h
 ===
 --- linux.orig/kernel/rcu/tree_plugin.h
 +++ linux/kernel/rcu/tree_plugin.h
 @@ -2644,9 +2644,9 @@ static void rcu_sysidle_enter(int irq)
   j = jiffies;
   ACCESS_ONCE(rdtp-dynticks_idle_jiffies) = j;
   smp_mb__before_atomic();
 - atomic_inc(rdtp-dynticks_idle);
 + this_cpu_inc(rcu_dynticks.dynticks_idle);
   smp_mb__after_atomic();
 - WARN_ON_ONCE(atomic_read(rdtp-dynticks_idle)  0x1);
 + WARN_ON_ONCE(__this_cpu_read(rcu_dynticks.dynticks_idle)  0x1);
  }
 
  /*
 @@ -2712,9 +2712,9 @@ static void rcu_sysidle_exit(int irq)
 
   /* Record end of idle period. */
   smp_mb__before_atomic();
 - atomic_inc(rdtp-dynticks_idle);
 + this_cpu_inc(rcu_dynticks.dynticks_idle);
   smp_mb__after_atomic();
 - WARN_ON_ONCE(!(atomic_read(rdtp-dynticks_idle)  0x1));
 + WARN_ON_ONCE(!(__this_cpu_read(rcu_dynticks.dynticks_idle)  0x1));
 
   /*
* If we are the timekeeping CPU, we are permitted to be non-idle
 @@ -2755,7 +2755,7 @@ static void rcu_sysidle_check_cpu(struct
   WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);
 
   /* Pick up current idle and NMI-nesting counter and check. */
 - cur = atomic_read(rdtp-dynticks_idle);
 + cur = rdtp-dynticks_idle;
   if (cur  0x1) {
   *isidle = false; /* We are not idle! */
   return;
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-02 Thread Christoph Lameter
On Tue, 2 Sep 2014, Paul E. McKenney wrote:

 But yes, in theory, something like this can work if appropriate memory
 barriers are put in place.  In practice, this sort of change needs
 profound testing on multiple architectures.

Ok how can we move forward? I just looked further and it seems a similar
approach could perhaps work for the dynticks field.

If the first patch I send gets merged then a lot of other this_cpu related
optimizations become possible regardless of the ones in the RFC.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-02 Thread Paul E. McKenney
On Tue, Sep 02, 2014 at 03:47:55PM -0500, Christoph Lameter wrote:
 On Tue, 2 Sep 2014, Paul E. McKenney wrote:
 
  But yes, in theory, something like this can work if appropriate memory
  barriers are put in place.  In practice, this sort of change needs
  profound testing on multiple architectures.
 
 Ok how can we move forward? I just looked further and it seems a similar
 approach could perhaps work for the dynticks field.

Yep, these two have been on my when I am feeling insanely gutsy list
for quite some time.

But I have to ask...  On x86, is a pair of mfence instructions really
cheaper than an atomic increment?

 If the first patch I send gets merged then a lot of other this_cpu related
 optimizations become possible regardless of the ones in the RFC.

Yep, I am queuing that one.

But could you please do future patches against the rcu/dev branch of
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git?
I had to hand-apply due to conflicts.  Please see below for my version,
and please check to make sure that I didn't mess something up in the
translation.

Thanx, Paul



rcu: Remove rcu_dynticks * parameters when they are always 
this_cpu_ptr(rcu_dynticks)

For some functions in kernel/rcu/tree* the rdtp parameter is always
this_cpu_ptr(rdtp).  Remove the parameter if constant and calculate the
pointer in function.

This will have the advantage that it is obvious that the address are
all per cpu offsets and thus it will enable the use of this_cpu_ops in
the future.

Signed-off-by: Christoph Lameter c...@linux.com
[ paulmck: Forward-ported to rcu/dev, whitespace adjustment. ]
Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 929ea9a6e0a1..65c42a6465eb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -497,11 +497,11 @@ cpu_needs_another_gp(struct rcu_state *rsp, struct 
rcu_data *rdp)
  * we really have entered idle, and must do the appropriate accounting.
  * The caller must have disabled interrupts.
  */
-static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
-   bool user)
+static void rcu_eqs_enter_common(long long oldval, bool user)
 {
struct rcu_state *rsp;
struct rcu_data *rdp;
+   struct rcu_dynticks *rdtp = this_cpu_ptr(rcu_dynticks);
 
trace_rcu_dyntick(TPS(Start), oldval, rdtp-dynticks_nesting);
if (!user  !is_idle_task(current)) {
@@ -552,7 +552,7 @@ static void rcu_eqs_enter(bool user)
WARN_ON_ONCE((oldval  DYNTICK_TASK_NEST_MASK) == 0);
if ((oldval  DYNTICK_TASK_NEST_MASK) == DYNTICK_TASK_NEST_VALUE) {
rdtp-dynticks_nesting = 0;
-   rcu_eqs_enter_common(rdtp, oldval, user);
+   rcu_eqs_enter_common(oldval, user);
} else {
rdtp-dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
}
@@ -576,7 +576,7 @@ void rcu_idle_enter(void)
 
local_irq_save(flags);
rcu_eqs_enter(false);
-   rcu_sysidle_enter(this_cpu_ptr(rcu_dynticks), 0);
+   rcu_sysidle_enter(0);
local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -626,8 +626,8 @@ void rcu_irq_exit(void)
if (rdtp-dynticks_nesting)
trace_rcu_dyntick(TPS(--=), oldval, rdtp-dynticks_nesting);
else
-   rcu_eqs_enter_common(rdtp, oldval, true);
-   rcu_sysidle_enter(rdtp, 1);
+   rcu_eqs_enter_common(oldval, true);
+   rcu_sysidle_enter(1);
local_irq_restore(flags);
 }
 
@@ -638,9 +638,10 @@ void rcu_irq_exit(void)
  * we really have exited idle, and must do the appropriate accounting.
  * The caller must have disabled interrupts.
  */
-static void rcu_eqs_exit_common(struct rcu_dynticks *rdtp, long long oldval,
-  int user)
+static void rcu_eqs_exit_common(long long oldval, int user)
 {
+   struct rcu_dynticks *rdtp = this_cpu_ptr(rcu_dynticks);
+
rcu_dynticks_task_exit();
smp_mb__before_atomic();  /* Force ordering w/previous sojourn. */
atomic_inc(rdtp-dynticks);
@@ -678,7 +679,7 @@ static void rcu_eqs_exit(bool user)
rdtp-dynticks_nesting += DYNTICK_TASK_NEST_VALUE;
} else {
rdtp-dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
-   rcu_eqs_exit_common(rdtp, oldval, user);
+   rcu_eqs_exit_common(oldval, user);
}
 }
 
@@ -699,7 +700,7 @@ void rcu_idle_exit(void)
 
local_irq_save(flags);
rcu_eqs_exit(false);
-   rcu_sysidle_exit(this_cpu_ptr(rcu_dynticks), 0);
+   rcu_sysidle_exit(0);
local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_exit);
@@ -750,8 +751,8 @@ void rcu_irq_enter(void)
if (oldval)
trace_rcu_dyntick(TPS(++=), oldval, rdtp-dynticks_nesting);
else
-

Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-02 Thread Christoph Lameter
On Tue, 2 Sep 2014, Paul E. McKenney wrote:

 Yep, these two have been on my when I am feeling insanely gutsy list
 for quite some time.

 But I have to ask...  On x86, is a pair of mfence instructions really
 cheaper than an atomic increment?

Not sure why you would need an mfence instruction?

  If the first patch I send gets merged then a lot of other this_cpu related
  optimizations become possible regardless of the ones in the RFC.

 Yep, I am queuing that one.

Great.

 But could you please do future patches against the rcu/dev branch of
 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git?
 I had to hand-apply due to conflicts.  Please see below for my version,
 and please check to make sure that I didn't mess something up in the
 translation.

Looks ok. Will use the correct tree next time.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops

2014-09-02 Thread Paul E. McKenney
On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote:
 On Tue, 2 Sep 2014, Paul E. McKenney wrote:
 
  Yep, these two have been on my when I am feeling insanely gutsy list
  for quite some time.
 
  But I have to ask...  On x86, is a pair of mfence instructions really
  cheaper than an atomic increment?
 
 Not sure why you would need an mfence instruction?

Because otherwise RCU can break.  As soon as the grace-period machinery
sees that the value of this variable is even, it assumes a quiescent
state.  If there are no memory barriers, the non-quiescent code might
not have completed executing, and your kernel's actuarial statistics
become sub-optimal.

Thanx, Paul

   If the first patch I send gets merged then a lot of other this_cpu related
   optimizations become possible regardless of the ones in the RFC.
 
  Yep, I am queuing that one.
 
 Great.
 
  But could you please do future patches against the rcu/dev branch of
  git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git?
  I had to hand-apply due to conflicts.  Please see below for my version,
  and please check to make sure that I didn't mess something up in the
  translation.
 
 Looks ok. Will use the correct tree next time.
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/