Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Mon, Sep 08, 2014 at 01:20:30PM -0500, Christoph Lameter wrote: > > On Thu, 4 Sep 2014, Paul E. McKenney wrote: > > > The lglocks use should be of particular interest to you. See above to > > find the others. > > Well the other seem to be in drivers. OK... > > Understood, but no hasty changes to that part of RCU. > > At our age I think we have learned to be very deliberate and slow. And rcutorture needs some upgrades for this case. > > rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data > > Reviewed-by: Christoph Lameter Thank you, recorded. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Mon, Sep 08, 2014 at 01:20:30PM -0500, Christoph Lameter wrote: On Thu, 4 Sep 2014, Paul E. McKenney wrote: The lglocks use should be of particular interest to you. See above to find the others. Well the other seem to be in drivers. OK... Understood, but no hasty changes to that part of RCU. At our age I think we have learned to be very deliberate and slow. And rcutorture needs some upgrades for this case. rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data Reviewed-by: Christoph Lameter c...@linux.com Thank you, recorded. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Thu, 4 Sep 2014, Paul E. McKenney wrote: > The lglocks use should be of particular interest to you. See above to > find the others. Well the other seem to be in drivers. > Understood, but no hasty changes to that part of RCU. At our age I think we have learned to be very deliberate and slow. > rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data Reviewed-by: Christoph Lameter -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Thu, 4 Sep 2014, Paul E. McKenney wrote: The lglocks use should be of particular interest to you. See above to find the others. Well the other seem to be in drivers. Understood, but no hasty changes to that part of RCU. At our age I think we have learned to be very deliberate and slow. rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data Reviewed-by: Christoph Lameter c...@linux.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Wed, 3 Sep 2014, Paul E. McKenney wrote: > As noted earlier, in theory, the atomic operations could be nonatomic, Well as demonstrated by the patch earlier: The atomic operations are only used on a the local cpu. There is no synchronization in that sense needed between processors because there is never a remote atomic operation. > > The code looks fragile and bound to have issues in the future given the > > barriers/atomics etc. Its going to be cleaner without that. > > What exactly looks fragile about it, and exactly what issues do you > anticipate? I am concerned about creation of unecessary synchronization issues. In this case we have already discovered that the atomic operations on per cpu variables are only used to modify the contents from the local cpu. This means at minimum we can give up on the use of atomics and keep the barriers to enforce visibility. > > And we are right now focusing on the simplest case. The atomics scheme is > > used multiple times in the RCU subsystem. There is more weird looking code > > there like atomic_add using zero etc. > > The atomic_add_return(0,...) reads the value out, forcing full ordering. > Again, in theory, this could be a volatile read with explicit memory-barrier > instructions on either side, but it is not clear which wins. (Keep in > mind that almost all of the atomic_add_return(0,...) calls for a given > dynticks counter are executed from a single kthread.) > > If systems continue to add CPUs like they have over the past decade or > so, I expect that you will be seeing more code like RCU's, not less. We have other code like this in multiple subsystems but it does not have the barrier issues, per cpu variables are updated always without the use of atomics and the inspection of the per cpu state from remote cpus works just fine also without them. I'd like to simplify this as much as possible and make it consistent throughout the kernel. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Thu, 4 Sep 2014, Paul E. McKenney wrote: > So in short, you don't see the potential for this use case actually > breaking anything, correct? In general its a performance impact but depending on how this_cpu_ops may be implemented in a particular platform there may also be correctness issues since the assumption there is that no remote writes occur. There is a slight issue in th RCU code. It uses DEFINE_PER_CPU for per cpu data which is used for true per cpu data where the cachelines are not evicted. False aliasing RCU structure that are remotely handled can cause issue for code that expects the per cpu data to be not contended. I think it would be better to go to DEFINE_PER_CPU_SHARED_ALIGNED for your definitions in particular since there are still code pieces where we are not sure if there are remote write accesses or not. This will give you separate cachelines so that the false aliasing effect is not occurring. > Besides RCU is not the only place where atomics are used on per-CPU > variables. For one thing, there are a number of per-CPU spinlocks in use > in various places throughout the kernel. For another thing, there is also > a large number of per-CPU structures (not pointers to structures, actual > structures), and I bet that a fair number of these feature cross-CPU > writes and cross-CPU atomics. RCU's rcu_data structures certainly do. Would be interested to see where that occurs. > > the barrier issues, per cpu variables are updated always without the use > > of atomics and the inspection of the per cpu state from remote cpus works > > just fine also without them. > > Including the per-CPU spinlocks? That seems a bit unlikely. And again, > I expect that a fair number of the per-CPU structures involve cross-CPU > synchronization. Where are those per cpu spinlocks? Cross cpu synchronization can be done in a number of ways that often allow avoiding remote writes to percpu areas. > It already is consistent, just not in the manner that you want. ;-) > > But -why- do you want these restrictions? How does it help anything? 1. It allows potentially faster operations that allow to make the assumption that no remote writes occur. The design of deterministic low latency code often needs some assurances that another cpu is not simply kicking the cacheline out which will then require off chip memory access and remote cacheline eviction once the cacheline is touched again. 2. The use of atomic without a rationale is something that I frown upon and it seems very likely that we have such a case here. People make assumptions that the use of atomic has some reason, like a remote access or contention, which is not occurring here. 3. this_cpu operations create instructions with reduced latency due tothe lack of lock prefix. Remote operations at the same time could create inconsistent results. See also linux/Documentation/this_cpu_ops.txt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Thu, 4 Sep 2014, Paul E. McKenney wrote: So in short, you don't see the potential for this use case actually breaking anything, correct? In general its a performance impact but depending on how this_cpu_ops may be implemented in a particular platform there may also be correctness issues since the assumption there is that no remote writes occur. There is a slight issue in th RCU code. It uses DEFINE_PER_CPU for per cpu data which is used for true per cpu data where the cachelines are not evicted. False aliasing RCU structure that are remotely handled can cause issue for code that expects the per cpu data to be not contended. I think it would be better to go to DEFINE_PER_CPU_SHARED_ALIGNED for your definitions in particular since there are still code pieces where we are not sure if there are remote write accesses or not. This will give you separate cachelines so that the false aliasing effect is not occurring. Besides RCU is not the only place where atomics are used on per-CPU variables. For one thing, there are a number of per-CPU spinlocks in use in various places throughout the kernel. For another thing, there is also a large number of per-CPU structures (not pointers to structures, actual structures), and I bet that a fair number of these feature cross-CPU writes and cross-CPU atomics. RCU's rcu_data structures certainly do. Would be interested to see where that occurs. the barrier issues, per cpu variables are updated always without the use of atomics and the inspection of the per cpu state from remote cpus works just fine also without them. Including the per-CPU spinlocks? That seems a bit unlikely. And again, I expect that a fair number of the per-CPU structures involve cross-CPU synchronization. Where are those per cpu spinlocks? Cross cpu synchronization can be done in a number of ways that often allow avoiding remote writes to percpu areas. It already is consistent, just not in the manner that you want. ;-) But -why- do you want these restrictions? How does it help anything? 1. It allows potentially faster operations that allow to make the assumption that no remote writes occur. The design of deterministic low latency code often needs some assurances that another cpu is not simply kicking the cacheline out which will then require off chip memory access and remote cacheline eviction once the cacheline is touched again. 2. The use of atomic without a rationale is something that I frown upon and it seems very likely that we have such a case here. People make assumptions that the use of atomic has some reason, like a remote access or contention, which is not occurring here. 3. this_cpu operations create instructions with reduced latency due tothe lack of lock prefix. Remote operations at the same time could create inconsistent results. See also linux/Documentation/this_cpu_ops.txt -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Wed, 3 Sep 2014, Paul E. McKenney wrote: As noted earlier, in theory, the atomic operations could be nonatomic, Well as demonstrated by the patch earlier: The atomic operations are only used on a the local cpu. There is no synchronization in that sense needed between processors because there is never a remote atomic operation. The code looks fragile and bound to have issues in the future given the barriers/atomics etc. Its going to be cleaner without that. What exactly looks fragile about it, and exactly what issues do you anticipate? I am concerned about creation of unecessary synchronization issues. In this case we have already discovered that the atomic operations on per cpu variables are only used to modify the contents from the local cpu. This means at minimum we can give up on the use of atomics and keep the barriers to enforce visibility. And we are right now focusing on the simplest case. The atomics scheme is used multiple times in the RCU subsystem. There is more weird looking code there like atomic_add using zero etc. The atomic_add_return(0,...) reads the value out, forcing full ordering. Again, in theory, this could be a volatile read with explicit memory-barrier instructions on either side, but it is not clear which wins. (Keep in mind that almost all of the atomic_add_return(0,...) calls for a given dynticks counter are executed from a single kthread.) If systems continue to add CPUs like they have over the past decade or so, I expect that you will be seeing more code like RCU's, not less. We have other code like this in multiple subsystems but it does not have the barrier issues, per cpu variables are updated always without the use of atomics and the inspection of the per cpu state from remote cpus works just fine also without them. I'd like to simplify this as much as possible and make it consistent throughout the kernel. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Thu, Sep 04, 2014 at 01:19:29PM -0500, Christoph Lameter wrote: > On Thu, 4 Sep 2014, Paul E. McKenney wrote: > > > So in short, you don't see the potential for this use case actually > > breaking anything, correct? > > In general its a performance impact but depending on how this_cpu_ops may > be implemented in a particular platform there may also be correctness > issues since the assumption there is that no remote writes occur. This assumption of yours does not seem to be shared by a fair amount of the code using per-CPU variables. > There is a slight issue in th RCU code. It uses DEFINE_PER_CPU for > per cpu data which is used for true per cpu data where the > cachelines are not evicted. False aliasing RCU structure that are > remotely handled can cause issue for code that expects the per cpu data > to be not contended. I think it would be better to go to > > DEFINE_PER_CPU_SHARED_ALIGNED > > for your definitions in particular since there are still code pieces where > we are not sure if there are remote write accesses or not. This will give > you separate cachelines so that the false aliasing effect is not > occurring. Fair enough, I have queued a commit that makes this change for the rcu_data per-CPU structures with your Report-by. (See below for patch.) > > Besides RCU is not the only place where atomics are used on per-CPU > > variables. For one thing, there are a number of per-CPU spinlocks in use > > in various places throughout the kernel. For another thing, there is also > > a large number of per-CPU structures (not pointers to structures, actual > > structures), and I bet that a fair number of these feature cross-CPU > > writes and cross-CPU atomics. RCU's rcu_data structures certainly do. > > Would be interested to see where that occurs. Something like the following will find them for you: git grep "DEFINE_PER_CPU.*spinlock" git grep "DEFINE_PER_CPU.*(struct[^*]*$" > > > the barrier issues, per cpu variables are updated always without the use > > > of atomics and the inspection of the per cpu state from remote cpus works > > > just fine also without them. > > > > Including the per-CPU spinlocks? That seems a bit unlikely. And again, > > I expect that a fair number of the per-CPU structures involve cross-CPU > > synchronization. > > Where are those per cpu spinlocks? Cross cpu synchronization can be done > in a number of ways that often allow avoiding remote writes to percpu > areas. The lglocks use should be of particular interest to you. See above to find the others. > > It already is consistent, just not in the manner that you want. ;-) > > > > But -why- do you want these restrictions? How does it help anything? > > 1. It allows potentially faster operations that allow to make the > assumption that no remote writes occur. The design of deterministic low > latency code often needs some assurances that another cpu is not simply > kicking the cacheline out which will then require off chip memory access > and remote cacheline eviction once the cacheline is touched again. OK, DEFINE_PER_CPU_SHARED_ALIGNED() should avoid the false sharing. > 2. The use of atomic without a rationale is something that I frown upon > and it seems very likely that we have such a case here. People make > assumptions that the use of atomic has some reason, like a remote access > or contention, which is not occurring here. Understood, but no hasty changes to that part of RCU. > 3. this_cpu operations create instructions with reduced latency due tothe > lack of lock prefix. Remote operations at the same time could create > inconsistent results. > > See also > > linux/Documentation/this_cpu_ops.txt Yep. If you use atomic operations, you need to be very careful about mixing in non-atomic operations, which in this case includes the per-CPU atomic operations. Normally the atomic_t and atomic_long_t types make it difficult to mess up. Of course, you can work around this type checking, and you can also use xchg() and cmpxchg(), which don't provide this type-checking misuse-avoidance help. Just as it has always been. Thanx, Paul rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data The rcu_data per-CPU variable has a number of fields that are atomically manipulated, potentially by any CPU. This situation can result in false sharing with per-CPU variables that have the misfortune of being allocated adjacent to rcu_data in memory. This commit therefore changes the DEFINE_PER_CPU() to DEFINE_PER_CPU_SHARED_ALIGNED() in order to avoid this false sharing. Reported-by: Christoph Lameter Signed-off-by: Paul E. McKenney diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 1b7eba915dbe..e4c6d604694e 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -105,7 +105,7 @@ struct rcu_state sname##_state = { \ .name = RCU_STATE_NAME(sname), \
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Thu, Sep 04, 2014 at 10:04:17AM -0500, Christoph Lameter wrote: > > On Wed, 3 Sep 2014, Paul E. McKenney wrote: > > > As noted earlier, in theory, the atomic operations could be nonatomic, > > Well as demonstrated by the patch earlier: The atomic operations are only > used on a the local cpu. There is no synchronization in that sense needed > between processors because there is never a remote atomic operation. Easy to say! ;-) > > > The code looks fragile and bound to have issues in the future given the > > > barriers/atomics etc. Its going to be cleaner without that. > > > > What exactly looks fragile about it, and exactly what issues do you > > anticipate? > > I am concerned about creation of unecessary synchronization issues. In > this case we have already discovered that the atomic operations on per > cpu variables are only used to modify the contents from the local cpu. > > This means at minimum we can give up on the use of atomics and keep the > barriers to enforce visibility. Sounds like a desire for a potential optimization rather than any sort of fragility. And in this case, it is not clear that your desire of replacing a value-returning atomic operation with a normal memory reference and a pair of memory barrier actually makes anything go faster. So in short, you don't see the potential for this use case actually breaking anything, correct? Besides RCU is not the only place where atomics are used on per-CPU variables. For one thing, there are a number of per-CPU spinlocks in use in various places throughout the kernel. For another thing, there is also a large number of per-CPU structures (not pointers to structures, actual structures), and I bet that a fair number of these feature cross-CPU writes and cross-CPU atomics. RCU's rcu_data structures certainly do. > > > And we are right now focusing on the simplest case. The atomics scheme is > > > used multiple times in the RCU subsystem. There is more weird looking code > > > there like atomic_add using zero etc. > > > > The atomic_add_return(0,...) reads the value out, forcing full ordering. > > Again, in theory, this could be a volatile read with explicit memory-barrier > > instructions on either side, but it is not clear which wins. (Keep in > > mind that almost all of the atomic_add_return(0,...) calls for a given > > dynticks counter are executed from a single kthread.) > > > > If systems continue to add CPUs like they have over the past decade or > > so, I expect that you will be seeing more code like RCU's, not less. > > We have other code like this in multiple subsystems but it does not have > the barrier issues, per cpu variables are updated always without the use > of atomics and the inspection of the per cpu state from remote cpus works > just fine also without them. Including the per-CPU spinlocks? That seems a bit unlikely. And again, I expect that a fair number of the per-CPU structures involve cross-CPU synchronization. > I'd like to simplify this as much as possible and make it consistent > throughout the kernel. It already is consistent, just not in the manner that you want. ;-) But -why- do you want these restrictions? How does it help anything? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Thu, Sep 04, 2014 at 01:19:29PM -0500, Christoph Lameter wrote: On Thu, 4 Sep 2014, Paul E. McKenney wrote: So in short, you don't see the potential for this use case actually breaking anything, correct? In general its a performance impact but depending on how this_cpu_ops may be implemented in a particular platform there may also be correctness issues since the assumption there is that no remote writes occur. This assumption of yours does not seem to be shared by a fair amount of the code using per-CPU variables. There is a slight issue in th RCU code. It uses DEFINE_PER_CPU for per cpu data which is used for true per cpu data where the cachelines are not evicted. False aliasing RCU structure that are remotely handled can cause issue for code that expects the per cpu data to be not contended. I think it would be better to go to DEFINE_PER_CPU_SHARED_ALIGNED for your definitions in particular since there are still code pieces where we are not sure if there are remote write accesses or not. This will give you separate cachelines so that the false aliasing effect is not occurring. Fair enough, I have queued a commit that makes this change for the rcu_data per-CPU structures with your Report-by. (See below for patch.) Besides RCU is not the only place where atomics are used on per-CPU variables. For one thing, there are a number of per-CPU spinlocks in use in various places throughout the kernel. For another thing, there is also a large number of per-CPU structures (not pointers to structures, actual structures), and I bet that a fair number of these feature cross-CPU writes and cross-CPU atomics. RCU's rcu_data structures certainly do. Would be interested to see where that occurs. Something like the following will find them for you: git grep DEFINE_PER_CPU.*spinlock git grep DEFINE_PER_CPU.*(struct[^*]*$ the barrier issues, per cpu variables are updated always without the use of atomics and the inspection of the per cpu state from remote cpus works just fine also without them. Including the per-CPU spinlocks? That seems a bit unlikely. And again, I expect that a fair number of the per-CPU structures involve cross-CPU synchronization. Where are those per cpu spinlocks? Cross cpu synchronization can be done in a number of ways that often allow avoiding remote writes to percpu areas. The lglocks use should be of particular interest to you. See above to find the others. It already is consistent, just not in the manner that you want. ;-) But -why- do you want these restrictions? How does it help anything? 1. It allows potentially faster operations that allow to make the assumption that no remote writes occur. The design of deterministic low latency code often needs some assurances that another cpu is not simply kicking the cacheline out which will then require off chip memory access and remote cacheline eviction once the cacheline is touched again. OK, DEFINE_PER_CPU_SHARED_ALIGNED() should avoid the false sharing. 2. The use of atomic without a rationale is something that I frown upon and it seems very likely that we have such a case here. People make assumptions that the use of atomic has some reason, like a remote access or contention, which is not occurring here. Understood, but no hasty changes to that part of RCU. 3. this_cpu operations create instructions with reduced latency due tothe lack of lock prefix. Remote operations at the same time could create inconsistent results. See also linux/Documentation/this_cpu_ops.txt Yep. If you use atomic operations, you need to be very careful about mixing in non-atomic operations, which in this case includes the per-CPU atomic operations. Normally the atomic_t and atomic_long_t types make it difficult to mess up. Of course, you can work around this type checking, and you can also use xchg() and cmpxchg(), which don't provide this type-checking misuse-avoidance help. Just as it has always been. Thanx, Paul rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data The rcu_data per-CPU variable has a number of fields that are atomically manipulated, potentially by any CPU. This situation can result in false sharing with per-CPU variables that have the misfortune of being allocated adjacent to rcu_data in memory. This commit therefore changes the DEFINE_PER_CPU() to DEFINE_PER_CPU_SHARED_ALIGNED() in order to avoid this false sharing. Reported-by: Christoph Lameter c...@linux.com Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 1b7eba915dbe..e4c6d604694e 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -105,7 +105,7 @@ struct rcu_state sname##_state = { \ .name = RCU_STATE_NAME(sname), \ .abbr = sabbr, \ }; \ -DEFINE_PER_CPU(struct
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Thu, Sep 04, 2014 at 10:04:17AM -0500, Christoph Lameter wrote: On Wed, 3 Sep 2014, Paul E. McKenney wrote: As noted earlier, in theory, the atomic operations could be nonatomic, Well as demonstrated by the patch earlier: The atomic operations are only used on a the local cpu. There is no synchronization in that sense needed between processors because there is never a remote atomic operation. Easy to say! ;-) The code looks fragile and bound to have issues in the future given the barriers/atomics etc. Its going to be cleaner without that. What exactly looks fragile about it, and exactly what issues do you anticipate? I am concerned about creation of unecessary synchronization issues. In this case we have already discovered that the atomic operations on per cpu variables are only used to modify the contents from the local cpu. This means at minimum we can give up on the use of atomics and keep the barriers to enforce visibility. Sounds like a desire for a potential optimization rather than any sort of fragility. And in this case, it is not clear that your desire of replacing a value-returning atomic operation with a normal memory reference and a pair of memory barrier actually makes anything go faster. So in short, you don't see the potential for this use case actually breaking anything, correct? Besides RCU is not the only place where atomics are used on per-CPU variables. For one thing, there are a number of per-CPU spinlocks in use in various places throughout the kernel. For another thing, there is also a large number of per-CPU structures (not pointers to structures, actual structures), and I bet that a fair number of these feature cross-CPU writes and cross-CPU atomics. RCU's rcu_data structures certainly do. And we are right now focusing on the simplest case. The atomics scheme is used multiple times in the RCU subsystem. There is more weird looking code there like atomic_add using zero etc. The atomic_add_return(0,...) reads the value out, forcing full ordering. Again, in theory, this could be a volatile read with explicit memory-barrier instructions on either side, but it is not clear which wins. (Keep in mind that almost all of the atomic_add_return(0,...) calls for a given dynticks counter are executed from a single kthread.) If systems continue to add CPUs like they have over the past decade or so, I expect that you will be seeing more code like RCU's, not less. We have other code like this in multiple subsystems but it does not have the barrier issues, per cpu variables are updated always without the use of atomics and the inspection of the per cpu state from remote cpus works just fine also without them. Including the per-CPU spinlocks? That seems a bit unlikely. And again, I expect that a fair number of the per-CPU structures involve cross-CPU synchronization. I'd like to simplify this as much as possible and make it consistent throughout the kernel. It already is consistent, just not in the manner that you want. ;-) But -why- do you want these restrictions? How does it help anything? Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Wed, Sep 03, 2014 at 12:43:15PM -0500, Christoph Lameter wrote: > On Wed, 3 Sep 2014, Paul E. McKenney wrote: > > > > Well, a shared data structure would be cleaner in general but there are > > > certainly other approaches. > > > > Per-CPU variables -are- a shared data structure. > > No the intent is for them to be for the particular cpu and therefore > there is limited support for sharing. Its not a shared data structure in > the classic sense. It really is shared. Normal memory at various addresses and all that. So what is your real concern here, anyway? > The code in the rcu subsystem operates like other percpu code: There is a > modification of local variables by the current processor and other > processors inspect the state once in awhile. The other percpu code does > not need atomics and barriers. RCU for some reason (that is not clear to > me) does. Like I have said several times before in multiple email threads, including this one, RCU needs to reliably detect quiescent states. To do this, it needs do know that the accesses prior to the quiescent state will reliably be seen by all as having happened before the quiescent state started. Similarly, it needs to know that the accesses after the quiescent state will reliably be seen by all as having happened after the quiescent state ended. The memory barriers are required to enforce this ordering. As noted earlier, in theory, the atomic operations could be nonatomic, but on x86 this would require adding explicit memory-barrier instructions. Not clear to me which would be faster. > > > But lets focus on the dynticks_idle case we are discussing here rather > > > than tackle the more difficult other atomics. What is checked in the loop > > > over the remote cpus is the dynticks_idle value plus > > > dynticks_idle_jiffies. So it seems that memory ordering is only used to > > > ensure that the jiffies are seen correctly. > > > > > > In that case both the dynticks_idle and dynticks_idle_jiffies could be > > > placed in one 64 bit value. If this is stored and retrieved as one then > > > there is no issue with ordering anymore and the barriers would no longer > > > be needed. > > > > If there was an upper bound on the propagation of values through a system, > > I could buy this. > > What is different in propagation speeds? The atomic read on the function > that checks for the quiescent period having passed is a regular read > anyways. The atomic_inc makes the cacheline propagate faster through the > system? I understand that it evicts the cachelines from other processors > caches (containing other percpu data by the way). That is the desired > effect? The smp_mb__{before,after}_atomic() guarantee the ordering. > > But Mike Galbraith checked the overhead of ->dynticks_idle and found > > it to be too small to measure. So doesn't seem to be a problem worth > > extraordinary efforts, especially given that many systems can avoid > > it simply by leaving CONFIG_NO_HZ_SYSIDLE=n. > > The code looks fragile and bound to have issues in the future given the > barriers/atomics etc. Its going to be cleaner without that. What exactly looks fragile about it, and exactly what issues do you anticipate? > And we are right now focusing on the simplest case. The atomics scheme is > used multiple times in the RCU subsystem. There is more weird looking code > there like atomic_add using zero etc. The atomic_add_return(0,...) reads the value out, forcing full ordering. Again, in theory, this could be a volatile read with explicit memory-barrier instructions on either side, but it is not clear which wins. (Keep in mind that almost all of the atomic_add_return(0,...) calls for a given dynticks counter are executed from a single kthread.) If systems continue to add CPUs like they have over the past decade or so, I expect that you will be seeing more code like RCU's, not less. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Wed, 3 Sep 2014, Paul E. McKenney wrote: > > Well, a shared data structure would be cleaner in general but there are > > certainly other approaches. > > Per-CPU variables -are- a shared data structure. No the intent is for them to be for the particular cpu and therefore there is limited support for sharing. Its not a shared data structure in the classic sense. The code in the rcu subsystem operates like other percpu code: There is a modification of local variables by the current processor and other processors inspect the state once in awhile. The other percpu code does not need atomics and barriers. RCU for some reason (that is not clear to me) does. > > But lets focus on the dynticks_idle case we are discussing here rather > > than tackle the more difficult other atomics. What is checked in the loop > > over the remote cpus is the dynticks_idle value plus > > dynticks_idle_jiffies. So it seems that memory ordering is only used to > > ensure that the jiffies are seen correctly. > > > > In that case both the dynticks_idle and dynticks_idle_jiffies could be > > placed in one 64 bit value. If this is stored and retrieved as one then > > there is no issue with ordering anymore and the barriers would no longer > > be needed. > > If there was an upper bound on the propagation of values through a system, > I could buy this. What is different in propagation speeds? The atomic read on the function that checks for the quiescent period having passed is a regular read anyways. The atomic_inc makes the cacheline propagate faster through the system? I understand that it evicts the cachelines from other processors caches (containing other percpu data by the way). That is the desired effect? > But Mike Galbraith checked the overhead of ->dynticks_idle and found > it to be too small to measure. So doesn't seem to be a problem worth > extraordinary efforts, especially given that many systems can avoid > it simply by leaving CONFIG_NO_HZ_SYSIDLE=n. The code looks fragile and bound to have issues in the future given the barriers/atomics etc. Its going to be cleaner without that. And we are right now focusing on the simplest case. The atomics scheme is used multiple times in the RCU subsystem. There is more weird looking code there like atomic_add using zero etc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Wed, Sep 03, 2014 at 10:51:13AM -0500, Christoph Lameter wrote: > On Wed, 3 Sep 2014, Paul E. McKenney wrote: > > > You would prefer that I instead allocated an NR_CPUS-sized array? > > Well, a shared data structure would be cleaner in general but there are > certainly other approaches. Per-CPU variables -are- a shared data structure. > But lets focus on the dynticks_idle case we are discussing here rather > than tackle the more difficult other atomics. What is checked in the loop > over the remote cpus is the dynticks_idle value plus > dynticks_idle_jiffies. So it seems that memory ordering is only used to > ensure that the jiffies are seen correctly. > > In that case both the dynticks_idle and dynticks_idle_jiffies could be > placed in one 64 bit value. If this is stored and retrieved as one then > there is no issue with ordering anymore and the barriers would no longer > be needed. If there was an upper bound on the propagation of values through a system, I could buy this. But Mike Galbraith checked the overhead of ->dynticks_idle and found it to be too small to measure. So doesn't seem to be a problem worth extraordinary efforts, especially given that many systems can avoid it simply by leaving CONFIG_NO_HZ_SYSIDLE=n. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Wed, 3 Sep 2014, Paul E. McKenney wrote: > You would prefer that I instead allocated an NR_CPUS-sized array? Well, a shared data structure would be cleaner in general but there are certainly other approaches. But lets focus on the dynticks_idle case we are discussing here rather than tackle the more difficult other atomics. What is checked in the loop over the remote cpus is the dynticks_idle value plus dynticks_idle_jiffies. So it seems that memory ordering is only used to ensure that the jiffies are seen correctly. In that case both the dynticks_idle and dynticks_idle_jiffies could be placed in one 64 bit value. If this is stored and retrieved as one then there is no issue with ordering anymore and the barriers would no longer be needed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Wed, Sep 03, 2014 at 09:10:24AM -0500, Christoph Lameter wrote: > On Tue, 2 Sep 2014, Paul E. McKenney wrote: > > > On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote: > > > On Tue, 2 Sep 2014, Paul E. McKenney wrote: > > > > > > > Yep, these two have been on my "when I am feeling insanely gutsy" list > > > > for quite some time. > > > > > > > > But I have to ask... On x86, is a pair of mfence instructions really > > > > cheaper than an atomic increment? > > > > > > Not sure why you would need an mfence instruction? > > > > Because otherwise RCU can break. As soon as the grace-period machinery > > sees that the value of this variable is even, it assumes a quiescent > > state. If there are no memory barriers, the non-quiescent code might > > not have completed executing, and your kernel's actuarial statistics > > become sub-optimal. > > Synchronization using per cpu variables is bound to be problematic since > they are simply not made for that. The per cpu variable usually can change > without notice to the other cpu since typically per cpu processing is > ongoing. The improvided performance of per cpu instructions is > possible only because we exploit the fact that there is no need for > synchronization. Christoph, per-CPU variables are memory. If I do the correct operations on them, they work just like any other memory. And yes, I typically cannot use the per-CPU operations if I need coherent results visible to all CPUs (but see your statistical-counters example below). This is of course exactly why I use atomic operations and memory barriers on the dynticks counters. You would prefer that I instead allocated an NR_CPUS-sized array? > Kernel statistics *are* suboptimal for that very reason because they > typically sum up individual counters from multiple processors without > regard to complete accuracy. The manipulation of the VM counters is very > low overhead due to the lack of concern for synchronization. This is a > tradeoff vs. performance. We actually can tune the the fuzziness of > statistics in the VM which allows us to control the overhead generated by > the need for more or less accurate statistics. Yes, statistics is one of the cross-CPU uses of per-CPU variables where you can get away without tight synchronization. > Memory barriers ensure that the code has completed executing? I think what > is meant is that they ensure that all modifications to cachelines before the > change of state are visible and the other processor does not have stale > cachelines around? No, memory barriers simply enforce ordering, and ordering is all that RCU's dyntick-idle code relies on. In other words, if the RCU grace-period kthread sees a value indicating that a CPU is idle, that kthread needs assurance that all the memory operations that took place before that CPU went idle have completed. All that is required for this is ordering: I saw that the ->dynticks counter had an even value, and I know that it was preceded a memory barrier, therefore, all subsequent code after a memory barrier will see the effects of all pre-idle code. > If the state variable is odd how does the other processor see a state > change to even before processing is complete if the state is updated only > at the end of processing? I am having some difficulty parsing that question. However, suppose that the grace-period kthread sees ->dynticks with an odd value, say three. Supppose that this kthread later sees another odd value, say eleven. Then because of the atomic operations and memory barriers, the kthread can be sure that the CPU corresponding to that ->dynticks has passed through an idle state since the time it saw the value three, and therefore that that CPU has passed through a quiescent state since the start of the grace period. Similarly, if the grace-period kthread sees ->dynticks with an even value, it knows that after a subsequent memory barrier, all the pre-idle effects will be visible, as required. As a diagram: CPU 0 CPU 1 /* Access some RCU-protected struct. */ smp_mb__before_atomic() atomic_inc()->even value /* Now idle. */ atomic_add_return(0, ...)-> odd value /* implied memory barrier. */ /* post-GP changes won't interfere */ /* with pre-idle accesses. */ In other words, if CPU 0 accessed some RCU-protected memory before going idle, that memory is guaranteed not to be freed until after those pre-idle accesses have completed. Of course, it also needs to work the other way around: CPU 0 CPU 1 /* Remove some RCU-protected struct. */ /* implied memory barrier. */ atomic_add_return(0, ...)-> even value
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Tue, 2 Sep 2014, Paul E. McKenney wrote: > On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote: > > On Tue, 2 Sep 2014, Paul E. McKenney wrote: > > > > > Yep, these two have been on my "when I am feeling insanely gutsy" list > > > for quite some time. > > > > > > But I have to ask... On x86, is a pair of mfence instructions really > > > cheaper than an atomic increment? > > > > Not sure why you would need an mfence instruction? > > Because otherwise RCU can break. As soon as the grace-period machinery > sees that the value of this variable is even, it assumes a quiescent > state. If there are no memory barriers, the non-quiescent code might > not have completed executing, and your kernel's actuarial statistics > become sub-optimal. Synchronization using per cpu variables is bound to be problematic since they are simply not made for that. The per cpu variable usually can change without notice to the other cpu since typically per cpu processing is ongoing. The improvided performance of per cpu instructions is possible only because we exploit the fact that there is no need for synchronization. Kernel statistics *are* suboptimal for that very reason because they typically sum up individual counters from multiple processors without regard to complete accuracy. The manipulation of the VM counters is very low overhead due to the lack of concern for synchronization. This is a tradeoff vs. performance. We actually can tune the the fuzziness of statistics in the VM which allows us to control the overhead generated by the need for more or less accurate statistics. Memory barriers ensure that the code has completed executing? I think what is meant is that they ensure that all modifications to cachelines before the change of state are visible and the other processor does not have stale cachelines around? If the state variable is odd how does the other processor see a state change to even before processing is complete if the state is updated only at the end of processing? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Tue, 2 Sep 2014, Paul E. McKenney wrote: On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote: On Tue, 2 Sep 2014, Paul E. McKenney wrote: Yep, these two have been on my when I am feeling insanely gutsy list for quite some time. But I have to ask... On x86, is a pair of mfence instructions really cheaper than an atomic increment? Not sure why you would need an mfence instruction? Because otherwise RCU can break. As soon as the grace-period machinery sees that the value of this variable is even, it assumes a quiescent state. If there are no memory barriers, the non-quiescent code might not have completed executing, and your kernel's actuarial statistics become sub-optimal. Synchronization using per cpu variables is bound to be problematic since they are simply not made for that. The per cpu variable usually can change without notice to the other cpu since typically per cpu processing is ongoing. The improvided performance of per cpu instructions is possible only because we exploit the fact that there is no need for synchronization. Kernel statistics *are* suboptimal for that very reason because they typically sum up individual counters from multiple processors without regard to complete accuracy. The manipulation of the VM counters is very low overhead due to the lack of concern for synchronization. This is a tradeoff vs. performance. We actually can tune the the fuzziness of statistics in the VM which allows us to control the overhead generated by the need for more or less accurate statistics. Memory barriers ensure that the code has completed executing? I think what is meant is that they ensure that all modifications to cachelines before the change of state are visible and the other processor does not have stale cachelines around? If the state variable is odd how does the other processor see a state change to even before processing is complete if the state is updated only at the end of processing? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Wed, Sep 03, 2014 at 09:10:24AM -0500, Christoph Lameter wrote: On Tue, 2 Sep 2014, Paul E. McKenney wrote: On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote: On Tue, 2 Sep 2014, Paul E. McKenney wrote: Yep, these two have been on my when I am feeling insanely gutsy list for quite some time. But I have to ask... On x86, is a pair of mfence instructions really cheaper than an atomic increment? Not sure why you would need an mfence instruction? Because otherwise RCU can break. As soon as the grace-period machinery sees that the value of this variable is even, it assumes a quiescent state. If there are no memory barriers, the non-quiescent code might not have completed executing, and your kernel's actuarial statistics become sub-optimal. Synchronization using per cpu variables is bound to be problematic since they are simply not made for that. The per cpu variable usually can change without notice to the other cpu since typically per cpu processing is ongoing. The improvided performance of per cpu instructions is possible only because we exploit the fact that there is no need for synchronization. Christoph, per-CPU variables are memory. If I do the correct operations on them, they work just like any other memory. And yes, I typically cannot use the per-CPU operations if I need coherent results visible to all CPUs (but see your statistical-counters example below). This is of course exactly why I use atomic operations and memory barriers on the dynticks counters. You would prefer that I instead allocated an NR_CPUS-sized array? Kernel statistics *are* suboptimal for that very reason because they typically sum up individual counters from multiple processors without regard to complete accuracy. The manipulation of the VM counters is very low overhead due to the lack of concern for synchronization. This is a tradeoff vs. performance. We actually can tune the the fuzziness of statistics in the VM which allows us to control the overhead generated by the need for more or less accurate statistics. Yes, statistics is one of the cross-CPU uses of per-CPU variables where you can get away without tight synchronization. Memory barriers ensure that the code has completed executing? I think what is meant is that they ensure that all modifications to cachelines before the change of state are visible and the other processor does not have stale cachelines around? No, memory barriers simply enforce ordering, and ordering is all that RCU's dyntick-idle code relies on. In other words, if the RCU grace-period kthread sees a value indicating that a CPU is idle, that kthread needs assurance that all the memory operations that took place before that CPU went idle have completed. All that is required for this is ordering: I saw that the -dynticks counter had an even value, and I know that it was preceded a memory barrier, therefore, all subsequent code after a memory barrier will see the effects of all pre-idle code. If the state variable is odd how does the other processor see a state change to even before processing is complete if the state is updated only at the end of processing? I am having some difficulty parsing that question. However, suppose that the grace-period kthread sees -dynticks with an odd value, say three. Supppose that this kthread later sees another odd value, say eleven. Then because of the atomic operations and memory barriers, the kthread can be sure that the CPU corresponding to that -dynticks has passed through an idle state since the time it saw the value three, and therefore that that CPU has passed through a quiescent state since the start of the grace period. Similarly, if the grace-period kthread sees -dynticks with an even value, it knows that after a subsequent memory barrier, all the pre-idle effects will be visible, as required. As a diagram: CPU 0 CPU 1 /* Access some RCU-protected struct. */ smp_mb__before_atomic() atomic_inc()-even value /* Now idle. */ atomic_add_return(0, ...)- odd value /* implied memory barrier. */ /* post-GP changes won't interfere */ /* with pre-idle accesses. */ In other words, if CPU 0 accessed some RCU-protected memory before going idle, that memory is guaranteed not to be freed until after those pre-idle accesses have completed. Of course, it also needs to work the other way around: CPU 0 CPU 1 /* Remove some RCU-protected struct. */ /* implied memory barrier. */ atomic_add_return(0, ...)- even value /* Was idle. */ atomic_inc()-odd value
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Wed, 3 Sep 2014, Paul E. McKenney wrote: You would prefer that I instead allocated an NR_CPUS-sized array? Well, a shared data structure would be cleaner in general but there are certainly other approaches. But lets focus on the dynticks_idle case we are discussing here rather than tackle the more difficult other atomics. What is checked in the loop over the remote cpus is the dynticks_idle value plus dynticks_idle_jiffies. So it seems that memory ordering is only used to ensure that the jiffies are seen correctly. In that case both the dynticks_idle and dynticks_idle_jiffies could be placed in one 64 bit value. If this is stored and retrieved as one then there is no issue with ordering anymore and the barriers would no longer be needed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Wed, Sep 03, 2014 at 10:51:13AM -0500, Christoph Lameter wrote: On Wed, 3 Sep 2014, Paul E. McKenney wrote: You would prefer that I instead allocated an NR_CPUS-sized array? Well, a shared data structure would be cleaner in general but there are certainly other approaches. Per-CPU variables -are- a shared data structure. But lets focus on the dynticks_idle case we are discussing here rather than tackle the more difficult other atomics. What is checked in the loop over the remote cpus is the dynticks_idle value plus dynticks_idle_jiffies. So it seems that memory ordering is only used to ensure that the jiffies are seen correctly. In that case both the dynticks_idle and dynticks_idle_jiffies could be placed in one 64 bit value. If this is stored and retrieved as one then there is no issue with ordering anymore and the barriers would no longer be needed. If there was an upper bound on the propagation of values through a system, I could buy this. But Mike Galbraith checked the overhead of -dynticks_idle and found it to be too small to measure. So doesn't seem to be a problem worth extraordinary efforts, especially given that many systems can avoid it simply by leaving CONFIG_NO_HZ_SYSIDLE=n. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Wed, 3 Sep 2014, Paul E. McKenney wrote: Well, a shared data structure would be cleaner in general but there are certainly other approaches. Per-CPU variables -are- a shared data structure. No the intent is for them to be for the particular cpu and therefore there is limited support for sharing. Its not a shared data structure in the classic sense. The code in the rcu subsystem operates like other percpu code: There is a modification of local variables by the current processor and other processors inspect the state once in awhile. The other percpu code does not need atomics and barriers. RCU for some reason (that is not clear to me) does. But lets focus on the dynticks_idle case we are discussing here rather than tackle the more difficult other atomics. What is checked in the loop over the remote cpus is the dynticks_idle value plus dynticks_idle_jiffies. So it seems that memory ordering is only used to ensure that the jiffies are seen correctly. In that case both the dynticks_idle and dynticks_idle_jiffies could be placed in one 64 bit value. If this is stored and retrieved as one then there is no issue with ordering anymore and the barriers would no longer be needed. If there was an upper bound on the propagation of values through a system, I could buy this. What is different in propagation speeds? The atomic read on the function that checks for the quiescent period having passed is a regular read anyways. The atomic_inc makes the cacheline propagate faster through the system? I understand that it evicts the cachelines from other processors caches (containing other percpu data by the way). That is the desired effect? But Mike Galbraith checked the overhead of -dynticks_idle and found it to be too small to measure. So doesn't seem to be a problem worth extraordinary efforts, especially given that many systems can avoid it simply by leaving CONFIG_NO_HZ_SYSIDLE=n. The code looks fragile and bound to have issues in the future given the barriers/atomics etc. Its going to be cleaner without that. And we are right now focusing on the simplest case. The atomics scheme is used multiple times in the RCU subsystem. There is more weird looking code there like atomic_add using zero etc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Wed, Sep 03, 2014 at 12:43:15PM -0500, Christoph Lameter wrote: On Wed, 3 Sep 2014, Paul E. McKenney wrote: Well, a shared data structure would be cleaner in general but there are certainly other approaches. Per-CPU variables -are- a shared data structure. No the intent is for them to be for the particular cpu and therefore there is limited support for sharing. Its not a shared data structure in the classic sense. It really is shared. Normal memory at various addresses and all that. So what is your real concern here, anyway? The code in the rcu subsystem operates like other percpu code: There is a modification of local variables by the current processor and other processors inspect the state once in awhile. The other percpu code does not need atomics and barriers. RCU for some reason (that is not clear to me) does. Like I have said several times before in multiple email threads, including this one, RCU needs to reliably detect quiescent states. To do this, it needs do know that the accesses prior to the quiescent state will reliably be seen by all as having happened before the quiescent state started. Similarly, it needs to know that the accesses after the quiescent state will reliably be seen by all as having happened after the quiescent state ended. The memory barriers are required to enforce this ordering. As noted earlier, in theory, the atomic operations could be nonatomic, but on x86 this would require adding explicit memory-barrier instructions. Not clear to me which would be faster. But lets focus on the dynticks_idle case we are discussing here rather than tackle the more difficult other atomics. What is checked in the loop over the remote cpus is the dynticks_idle value plus dynticks_idle_jiffies. So it seems that memory ordering is only used to ensure that the jiffies are seen correctly. In that case both the dynticks_idle and dynticks_idle_jiffies could be placed in one 64 bit value. If this is stored and retrieved as one then there is no issue with ordering anymore and the barriers would no longer be needed. If there was an upper bound on the propagation of values through a system, I could buy this. What is different in propagation speeds? The atomic read on the function that checks for the quiescent period having passed is a regular read anyways. The atomic_inc makes the cacheline propagate faster through the system? I understand that it evicts the cachelines from other processors caches (containing other percpu data by the way). That is the desired effect? The smp_mb__{before,after}_atomic() guarantee the ordering. But Mike Galbraith checked the overhead of -dynticks_idle and found it to be too small to measure. So doesn't seem to be a problem worth extraordinary efforts, especially given that many systems can avoid it simply by leaving CONFIG_NO_HZ_SYSIDLE=n. The code looks fragile and bound to have issues in the future given the barriers/atomics etc. Its going to be cleaner without that. What exactly looks fragile about it, and exactly what issues do you anticipate? And we are right now focusing on the simplest case. The atomics scheme is used multiple times in the RCU subsystem. There is more weird looking code there like atomic_add using zero etc. The atomic_add_return(0,...) reads the value out, forcing full ordering. Again, in theory, this could be a volatile read with explicit memory-barrier instructions on either side, but it is not clear which wins. (Keep in mind that almost all of the atomic_add_return(0,...) calls for a given dynticks counter are executed from a single kthread.) If systems continue to add CPUs like they have over the past decade or so, I expect that you will be seeing more code like RCU's, not less. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote: > On Tue, 2 Sep 2014, Paul E. McKenney wrote: > > > Yep, these two have been on my "when I am feeling insanely gutsy" list > > for quite some time. > > > > But I have to ask... On x86, is a pair of mfence instructions really > > cheaper than an atomic increment? > > Not sure why you would need an mfence instruction? Because otherwise RCU can break. As soon as the grace-period machinery sees that the value of this variable is even, it assumes a quiescent state. If there are no memory barriers, the non-quiescent code might not have completed executing, and your kernel's actuarial statistics become sub-optimal. Thanx, Paul > > > If the first patch I send gets merged then a lot of other this_cpu related > > > optimizations become possible regardless of the ones in the RFC. > > > > Yep, I am queuing that one. > > Great. > > > But could you please do future patches against the rcu/dev branch of > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git? > > I had to hand-apply due to conflicts. Please see below for my version, > > and please check to make sure that I didn't mess something up in the > > translation. > > Looks ok. Will use the correct tree next time. > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Tue, 2 Sep 2014, Paul E. McKenney wrote: > Yep, these two have been on my "when I am feeling insanely gutsy" list > for quite some time. > > But I have to ask... On x86, is a pair of mfence instructions really > cheaper than an atomic increment? Not sure why you would need an mfence instruction? > > If the first patch I send gets merged then a lot of other this_cpu related > > optimizations become possible regardless of the ones in the RFC. > > Yep, I am queuing that one. Great. > But could you please do future patches against the rcu/dev branch of > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git? > I had to hand-apply due to conflicts. Please see below for my version, > and please check to make sure that I didn't mess something up in the > translation. Looks ok. Will use the correct tree next time. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Tue, Sep 02, 2014 at 03:47:55PM -0500, Christoph Lameter wrote: > On Tue, 2 Sep 2014, Paul E. McKenney wrote: > > > But yes, in theory, something like this can work if appropriate memory > > barriers are put in place. In practice, this sort of change needs > > profound testing on multiple architectures. > > Ok how can we move forward? I just looked further and it seems a similar > approach could perhaps work for the dynticks field. Yep, these two have been on my "when I am feeling insanely gutsy" list for quite some time. But I have to ask... On x86, is a pair of mfence instructions really cheaper than an atomic increment? > If the first patch I send gets merged then a lot of other this_cpu related > optimizations become possible regardless of the ones in the RFC. Yep, I am queuing that one. But could you please do future patches against the rcu/dev branch of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git? I had to hand-apply due to conflicts. Please see below for my version, and please check to make sure that I didn't mess something up in the translation. Thanx, Paul rcu: Remove rcu_dynticks * parameters when they are always this_cpu_ptr(_dynticks) For some functions in kernel/rcu/tree* the rdtp parameter is always this_cpu_ptr(rdtp). Remove the parameter if constant and calculate the pointer in function. This will have the advantage that it is obvious that the address are all per cpu offsets and thus it will enable the use of this_cpu_ops in the future. Signed-off-by: Christoph Lameter [ paulmck: Forward-ported to rcu/dev, whitespace adjustment. ] Signed-off-by: Paul E. McKenney diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 929ea9a6e0a1..65c42a6465eb 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -497,11 +497,11 @@ cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp) * we really have entered idle, and must do the appropriate accounting. * The caller must have disabled interrupts. */ -static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval, - bool user) +static void rcu_eqs_enter_common(long long oldval, bool user) { struct rcu_state *rsp; struct rcu_data *rdp; + struct rcu_dynticks *rdtp = this_cpu_ptr(_dynticks); trace_rcu_dyntick(TPS("Start"), oldval, rdtp->dynticks_nesting); if (!user && !is_idle_task(current)) { @@ -552,7 +552,7 @@ static void rcu_eqs_enter(bool user) WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0); if ((oldval & DYNTICK_TASK_NEST_MASK) == DYNTICK_TASK_NEST_VALUE) { rdtp->dynticks_nesting = 0; - rcu_eqs_enter_common(rdtp, oldval, user); + rcu_eqs_enter_common(oldval, user); } else { rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE; } @@ -576,7 +576,7 @@ void rcu_idle_enter(void) local_irq_save(flags); rcu_eqs_enter(false); - rcu_sysidle_enter(this_cpu_ptr(_dynticks), 0); + rcu_sysidle_enter(0); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(rcu_idle_enter); @@ -626,8 +626,8 @@ void rcu_irq_exit(void) if (rdtp->dynticks_nesting) trace_rcu_dyntick(TPS("--="), oldval, rdtp->dynticks_nesting); else - rcu_eqs_enter_common(rdtp, oldval, true); - rcu_sysidle_enter(rdtp, 1); + rcu_eqs_enter_common(oldval, true); + rcu_sysidle_enter(1); local_irq_restore(flags); } @@ -638,9 +638,10 @@ void rcu_irq_exit(void) * we really have exited idle, and must do the appropriate accounting. * The caller must have disabled interrupts. */ -static void rcu_eqs_exit_common(struct rcu_dynticks *rdtp, long long oldval, - int user) +static void rcu_eqs_exit_common(long long oldval, int user) { + struct rcu_dynticks *rdtp = this_cpu_ptr(_dynticks); + rcu_dynticks_task_exit(); smp_mb__before_atomic(); /* Force ordering w/previous sojourn. */ atomic_inc(>dynticks); @@ -678,7 +679,7 @@ static void rcu_eqs_exit(bool user) rdtp->dynticks_nesting += DYNTICK_TASK_NEST_VALUE; } else { rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE; - rcu_eqs_exit_common(rdtp, oldval, user); + rcu_eqs_exit_common(oldval, user); } } @@ -699,7 +700,7 @@ void rcu_idle_exit(void) local_irq_save(flags); rcu_eqs_exit(false); - rcu_sysidle_exit(this_cpu_ptr(_dynticks), 0); + rcu_sysidle_exit(0); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(rcu_idle_exit); @@ -750,8 +751,8 @@ void rcu_irq_enter(void) if (oldval) trace_rcu_dyntick(TPS("++="), oldval, rdtp->dynticks_nesting); else -
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Tue, 2 Sep 2014, Paul E. McKenney wrote: > But yes, in theory, something like this can work if appropriate memory > barriers are put in place. In practice, this sort of change needs > profound testing on multiple architectures. Ok how can we move forward? I just looked further and it seems a similar approach could perhaps work for the dynticks field. If the first patch I send gets merged then a lot of other this_cpu related optimizations become possible regardless of the ones in the RFC. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Tue, Sep 02, 2014 at 03:14:43PM -0500, Christoph Lameter wrote: > > Since dynticks_idle is only ever modified by the local cpu we do > not need to use an atomic there. The weak "atomicity" of this_cpu > ops is sufficient since there is no other cpu modifying the variable. > > [This is a cautious patch that leaves the barriers in place] Actually, not so cautious. On x86: #define smp_mb__before_atomic() barrier() #define smp_mb__after_atomic() barrier() But yes, in theory, something like this can work if appropriate memory barriers are put in place. In practice, this sort of change needs profound testing on multiple architectures. Thanx, Paul > Signed-off-by: Christoph Lameter > > Index: linux/kernel/rcu/tree.c > === > --- linux.orig/kernel/rcu/tree.c > +++ linux/kernel/rcu/tree.c > @@ -213,7 +213,7 @@ static DEFINE_PER_CPU(struct rcu_dyntick > .dynticks = ATOMIC_INIT(1), > #ifdef CONFIG_NO_HZ_FULL_SYSIDLE > .dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE, > - .dynticks_idle = ATOMIC_INIT(1), > + .dynticks_idle = 1, > #endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */ > }; > > Index: linux/kernel/rcu/tree.h > === > --- linux.orig/kernel/rcu/tree.h > +++ linux/kernel/rcu/tree.h > @@ -91,7 +91,7 @@ struct rcu_dynticks { > #ifdef CONFIG_NO_HZ_FULL_SYSIDLE > long long dynticks_idle_nesting; > /* irq/process nesting level from idle. */ > - atomic_t dynticks_idle; /* Even value for idle, else odd. */ > + long dynticks_idle; /* Even value for idle, else odd. */ > /* "Idle" excludes userspace execution. */ > unsigned long dynticks_idle_jiffies; > /* End of last non-NMI non-idle period. */ > Index: linux/kernel/rcu/tree_plugin.h > === > --- linux.orig/kernel/rcu/tree_plugin.h > +++ linux/kernel/rcu/tree_plugin.h > @@ -2644,9 +2644,9 @@ static void rcu_sysidle_enter(int irq) > j = jiffies; > ACCESS_ONCE(rdtp->dynticks_idle_jiffies) = j; > smp_mb__before_atomic(); > - atomic_inc(>dynticks_idle); > + this_cpu_inc(rcu_dynticks.dynticks_idle); > smp_mb__after_atomic(); > - WARN_ON_ONCE(atomic_read(>dynticks_idle) & 0x1); > + WARN_ON_ONCE(__this_cpu_read(rcu_dynticks.dynticks_idle) & 0x1); > } > > /* > @@ -2712,9 +2712,9 @@ static void rcu_sysidle_exit(int irq) > > /* Record end of idle period. */ > smp_mb__before_atomic(); > - atomic_inc(>dynticks_idle); > + this_cpu_inc(rcu_dynticks.dynticks_idle); > smp_mb__after_atomic(); > - WARN_ON_ONCE(!(atomic_read(>dynticks_idle) & 0x1)); > + WARN_ON_ONCE(!(__this_cpu_read(rcu_dynticks.dynticks_idle) & 0x1)); > > /* >* If we are the timekeeping CPU, we are permitted to be non-idle > @@ -2755,7 +2755,7 @@ static void rcu_sysidle_check_cpu(struct > WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); > > /* Pick up current idle and NMI-nesting counter and check. */ > - cur = atomic_read(>dynticks_idle); > + cur = rdtp->dynticks_idle; > if (cur & 0x1) { > *isidle = false; /* We are not idle! */ > return; > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
Since dynticks_idle is only ever modified by the local cpu we do not need to use an atomic there. The weak "atomicity" of this_cpu ops is sufficient since there is no other cpu modifying the variable. [This is a cautious patch that leaves the barriers in place] Signed-off-by: Christoph Lameter Index: linux/kernel/rcu/tree.c === --- linux.orig/kernel/rcu/tree.c +++ linux/kernel/rcu/tree.c @@ -213,7 +213,7 @@ static DEFINE_PER_CPU(struct rcu_dyntick .dynticks = ATOMIC_INIT(1), #ifdef CONFIG_NO_HZ_FULL_SYSIDLE .dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE, - .dynticks_idle = ATOMIC_INIT(1), + .dynticks_idle = 1, #endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */ }; Index: linux/kernel/rcu/tree.h === --- linux.orig/kernel/rcu/tree.h +++ linux/kernel/rcu/tree.h @@ -91,7 +91,7 @@ struct rcu_dynticks { #ifdef CONFIG_NO_HZ_FULL_SYSIDLE long long dynticks_idle_nesting; /* irq/process nesting level from idle. */ - atomic_t dynticks_idle; /* Even value for idle, else odd. */ + long dynticks_idle; /* Even value for idle, else odd. */ /* "Idle" excludes userspace execution. */ unsigned long dynticks_idle_jiffies; /* End of last non-NMI non-idle period. */ Index: linux/kernel/rcu/tree_plugin.h === --- linux.orig/kernel/rcu/tree_plugin.h +++ linux/kernel/rcu/tree_plugin.h @@ -2644,9 +2644,9 @@ static void rcu_sysidle_enter(int irq) j = jiffies; ACCESS_ONCE(rdtp->dynticks_idle_jiffies) = j; smp_mb__before_atomic(); - atomic_inc(>dynticks_idle); + this_cpu_inc(rcu_dynticks.dynticks_idle); smp_mb__after_atomic(); - WARN_ON_ONCE(atomic_read(>dynticks_idle) & 0x1); + WARN_ON_ONCE(__this_cpu_read(rcu_dynticks.dynticks_idle) & 0x1); } /* @@ -2712,9 +2712,9 @@ static void rcu_sysidle_exit(int irq) /* Record end of idle period. */ smp_mb__before_atomic(); - atomic_inc(>dynticks_idle); + this_cpu_inc(rcu_dynticks.dynticks_idle); smp_mb__after_atomic(); - WARN_ON_ONCE(!(atomic_read(>dynticks_idle) & 0x1)); + WARN_ON_ONCE(!(__this_cpu_read(rcu_dynticks.dynticks_idle) & 0x1)); /* * If we are the timekeeping CPU, we are permitted to be non-idle @@ -2755,7 +2755,7 @@ static void rcu_sysidle_check_cpu(struct WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); /* Pick up current idle and NMI-nesting counter and check. */ - cur = atomic_read(>dynticks_idle); + cur = rdtp->dynticks_idle; if (cur & 0x1) { *isidle = false; /* We are not idle! */ return; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
Since dynticks_idle is only ever modified by the local cpu we do not need to use an atomic there. The weak atomicity of this_cpu ops is sufficient since there is no other cpu modifying the variable. [This is a cautious patch that leaves the barriers in place] Signed-off-by: Christoph Lameter c...@linux.com Index: linux/kernel/rcu/tree.c === --- linux.orig/kernel/rcu/tree.c +++ linux/kernel/rcu/tree.c @@ -213,7 +213,7 @@ static DEFINE_PER_CPU(struct rcu_dyntick .dynticks = ATOMIC_INIT(1), #ifdef CONFIG_NO_HZ_FULL_SYSIDLE .dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE, - .dynticks_idle = ATOMIC_INIT(1), + .dynticks_idle = 1, #endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */ }; Index: linux/kernel/rcu/tree.h === --- linux.orig/kernel/rcu/tree.h +++ linux/kernel/rcu/tree.h @@ -91,7 +91,7 @@ struct rcu_dynticks { #ifdef CONFIG_NO_HZ_FULL_SYSIDLE long long dynticks_idle_nesting; /* irq/process nesting level from idle. */ - atomic_t dynticks_idle; /* Even value for idle, else odd. */ + long dynticks_idle; /* Even value for idle, else odd. */ /* Idle excludes userspace execution. */ unsigned long dynticks_idle_jiffies; /* End of last non-NMI non-idle period. */ Index: linux/kernel/rcu/tree_plugin.h === --- linux.orig/kernel/rcu/tree_plugin.h +++ linux/kernel/rcu/tree_plugin.h @@ -2644,9 +2644,9 @@ static void rcu_sysidle_enter(int irq) j = jiffies; ACCESS_ONCE(rdtp-dynticks_idle_jiffies) = j; smp_mb__before_atomic(); - atomic_inc(rdtp-dynticks_idle); + this_cpu_inc(rcu_dynticks.dynticks_idle); smp_mb__after_atomic(); - WARN_ON_ONCE(atomic_read(rdtp-dynticks_idle) 0x1); + WARN_ON_ONCE(__this_cpu_read(rcu_dynticks.dynticks_idle) 0x1); } /* @@ -2712,9 +2712,9 @@ static void rcu_sysidle_exit(int irq) /* Record end of idle period. */ smp_mb__before_atomic(); - atomic_inc(rdtp-dynticks_idle); + this_cpu_inc(rcu_dynticks.dynticks_idle); smp_mb__after_atomic(); - WARN_ON_ONCE(!(atomic_read(rdtp-dynticks_idle) 0x1)); + WARN_ON_ONCE(!(__this_cpu_read(rcu_dynticks.dynticks_idle) 0x1)); /* * If we are the timekeeping CPU, we are permitted to be non-idle @@ -2755,7 +2755,7 @@ static void rcu_sysidle_check_cpu(struct WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); /* Pick up current idle and NMI-nesting counter and check. */ - cur = atomic_read(rdtp-dynticks_idle); + cur = rdtp-dynticks_idle; if (cur 0x1) { *isidle = false; /* We are not idle! */ return; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Tue, Sep 02, 2014 at 03:14:43PM -0500, Christoph Lameter wrote: Since dynticks_idle is only ever modified by the local cpu we do not need to use an atomic there. The weak atomicity of this_cpu ops is sufficient since there is no other cpu modifying the variable. [This is a cautious patch that leaves the barriers in place] Actually, not so cautious. On x86: #define smp_mb__before_atomic() barrier() #define smp_mb__after_atomic() barrier() But yes, in theory, something like this can work if appropriate memory barriers are put in place. In practice, this sort of change needs profound testing on multiple architectures. Thanx, Paul Signed-off-by: Christoph Lameter c...@linux.com Index: linux/kernel/rcu/tree.c === --- linux.orig/kernel/rcu/tree.c +++ linux/kernel/rcu/tree.c @@ -213,7 +213,7 @@ static DEFINE_PER_CPU(struct rcu_dyntick .dynticks = ATOMIC_INIT(1), #ifdef CONFIG_NO_HZ_FULL_SYSIDLE .dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE, - .dynticks_idle = ATOMIC_INIT(1), + .dynticks_idle = 1, #endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */ }; Index: linux/kernel/rcu/tree.h === --- linux.orig/kernel/rcu/tree.h +++ linux/kernel/rcu/tree.h @@ -91,7 +91,7 @@ struct rcu_dynticks { #ifdef CONFIG_NO_HZ_FULL_SYSIDLE long long dynticks_idle_nesting; /* irq/process nesting level from idle. */ - atomic_t dynticks_idle; /* Even value for idle, else odd. */ + long dynticks_idle; /* Even value for idle, else odd. */ /* Idle excludes userspace execution. */ unsigned long dynticks_idle_jiffies; /* End of last non-NMI non-idle period. */ Index: linux/kernel/rcu/tree_plugin.h === --- linux.orig/kernel/rcu/tree_plugin.h +++ linux/kernel/rcu/tree_plugin.h @@ -2644,9 +2644,9 @@ static void rcu_sysidle_enter(int irq) j = jiffies; ACCESS_ONCE(rdtp-dynticks_idle_jiffies) = j; smp_mb__before_atomic(); - atomic_inc(rdtp-dynticks_idle); + this_cpu_inc(rcu_dynticks.dynticks_idle); smp_mb__after_atomic(); - WARN_ON_ONCE(atomic_read(rdtp-dynticks_idle) 0x1); + WARN_ON_ONCE(__this_cpu_read(rcu_dynticks.dynticks_idle) 0x1); } /* @@ -2712,9 +2712,9 @@ static void rcu_sysidle_exit(int irq) /* Record end of idle period. */ smp_mb__before_atomic(); - atomic_inc(rdtp-dynticks_idle); + this_cpu_inc(rcu_dynticks.dynticks_idle); smp_mb__after_atomic(); - WARN_ON_ONCE(!(atomic_read(rdtp-dynticks_idle) 0x1)); + WARN_ON_ONCE(!(__this_cpu_read(rcu_dynticks.dynticks_idle) 0x1)); /* * If we are the timekeeping CPU, we are permitted to be non-idle @@ -2755,7 +2755,7 @@ static void rcu_sysidle_check_cpu(struct WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); /* Pick up current idle and NMI-nesting counter and check. */ - cur = atomic_read(rdtp-dynticks_idle); + cur = rdtp-dynticks_idle; if (cur 0x1) { *isidle = false; /* We are not idle! */ return; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Tue, 2 Sep 2014, Paul E. McKenney wrote: But yes, in theory, something like this can work if appropriate memory barriers are put in place. In practice, this sort of change needs profound testing on multiple architectures. Ok how can we move forward? I just looked further and it seems a similar approach could perhaps work for the dynticks field. If the first patch I send gets merged then a lot of other this_cpu related optimizations become possible regardless of the ones in the RFC. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Tue, Sep 02, 2014 at 03:47:55PM -0500, Christoph Lameter wrote: On Tue, 2 Sep 2014, Paul E. McKenney wrote: But yes, in theory, something like this can work if appropriate memory barriers are put in place. In practice, this sort of change needs profound testing on multiple architectures. Ok how can we move forward? I just looked further and it seems a similar approach could perhaps work for the dynticks field. Yep, these two have been on my when I am feeling insanely gutsy list for quite some time. But I have to ask... On x86, is a pair of mfence instructions really cheaper than an atomic increment? If the first patch I send gets merged then a lot of other this_cpu related optimizations become possible regardless of the ones in the RFC. Yep, I am queuing that one. But could you please do future patches against the rcu/dev branch of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git? I had to hand-apply due to conflicts. Please see below for my version, and please check to make sure that I didn't mess something up in the translation. Thanx, Paul rcu: Remove rcu_dynticks * parameters when they are always this_cpu_ptr(rcu_dynticks) For some functions in kernel/rcu/tree* the rdtp parameter is always this_cpu_ptr(rdtp). Remove the parameter if constant and calculate the pointer in function. This will have the advantage that it is obvious that the address are all per cpu offsets and thus it will enable the use of this_cpu_ops in the future. Signed-off-by: Christoph Lameter c...@linux.com [ paulmck: Forward-ported to rcu/dev, whitespace adjustment. ] Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 929ea9a6e0a1..65c42a6465eb 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -497,11 +497,11 @@ cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp) * we really have entered idle, and must do the appropriate accounting. * The caller must have disabled interrupts. */ -static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval, - bool user) +static void rcu_eqs_enter_common(long long oldval, bool user) { struct rcu_state *rsp; struct rcu_data *rdp; + struct rcu_dynticks *rdtp = this_cpu_ptr(rcu_dynticks); trace_rcu_dyntick(TPS(Start), oldval, rdtp-dynticks_nesting); if (!user !is_idle_task(current)) { @@ -552,7 +552,7 @@ static void rcu_eqs_enter(bool user) WARN_ON_ONCE((oldval DYNTICK_TASK_NEST_MASK) == 0); if ((oldval DYNTICK_TASK_NEST_MASK) == DYNTICK_TASK_NEST_VALUE) { rdtp-dynticks_nesting = 0; - rcu_eqs_enter_common(rdtp, oldval, user); + rcu_eqs_enter_common(oldval, user); } else { rdtp-dynticks_nesting -= DYNTICK_TASK_NEST_VALUE; } @@ -576,7 +576,7 @@ void rcu_idle_enter(void) local_irq_save(flags); rcu_eqs_enter(false); - rcu_sysidle_enter(this_cpu_ptr(rcu_dynticks), 0); + rcu_sysidle_enter(0); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(rcu_idle_enter); @@ -626,8 +626,8 @@ void rcu_irq_exit(void) if (rdtp-dynticks_nesting) trace_rcu_dyntick(TPS(--=), oldval, rdtp-dynticks_nesting); else - rcu_eqs_enter_common(rdtp, oldval, true); - rcu_sysidle_enter(rdtp, 1); + rcu_eqs_enter_common(oldval, true); + rcu_sysidle_enter(1); local_irq_restore(flags); } @@ -638,9 +638,10 @@ void rcu_irq_exit(void) * we really have exited idle, and must do the appropriate accounting. * The caller must have disabled interrupts. */ -static void rcu_eqs_exit_common(struct rcu_dynticks *rdtp, long long oldval, - int user) +static void rcu_eqs_exit_common(long long oldval, int user) { + struct rcu_dynticks *rdtp = this_cpu_ptr(rcu_dynticks); + rcu_dynticks_task_exit(); smp_mb__before_atomic(); /* Force ordering w/previous sojourn. */ atomic_inc(rdtp-dynticks); @@ -678,7 +679,7 @@ static void rcu_eqs_exit(bool user) rdtp-dynticks_nesting += DYNTICK_TASK_NEST_VALUE; } else { rdtp-dynticks_nesting = DYNTICK_TASK_EXIT_IDLE; - rcu_eqs_exit_common(rdtp, oldval, user); + rcu_eqs_exit_common(oldval, user); } } @@ -699,7 +700,7 @@ void rcu_idle_exit(void) local_irq_save(flags); rcu_eqs_exit(false); - rcu_sysidle_exit(this_cpu_ptr(rcu_dynticks), 0); + rcu_sysidle_exit(0); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(rcu_idle_exit); @@ -750,8 +751,8 @@ void rcu_irq_enter(void) if (oldval) trace_rcu_dyntick(TPS(++=), oldval, rdtp-dynticks_nesting); else -
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Tue, 2 Sep 2014, Paul E. McKenney wrote: Yep, these two have been on my when I am feeling insanely gutsy list for quite some time. But I have to ask... On x86, is a pair of mfence instructions really cheaper than an atomic increment? Not sure why you would need an mfence instruction? If the first patch I send gets merged then a lot of other this_cpu related optimizations become possible regardless of the ones in the RFC. Yep, I am queuing that one. Great. But could you please do future patches against the rcu/dev branch of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git? I had to hand-apply due to conflicts. Please see below for my version, and please check to make sure that I didn't mess something up in the translation. Looks ok. Will use the correct tree next time. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote: On Tue, 2 Sep 2014, Paul E. McKenney wrote: Yep, these two have been on my when I am feeling insanely gutsy list for quite some time. But I have to ask... On x86, is a pair of mfence instructions really cheaper than an atomic increment? Not sure why you would need an mfence instruction? Because otherwise RCU can break. As soon as the grace-period machinery sees that the value of this variable is even, it assumes a quiescent state. If there are no memory barriers, the non-quiescent code might not have completed executing, and your kernel's actuarial statistics become sub-optimal. Thanx, Paul If the first patch I send gets merged then a lot of other this_cpu related optimizations become possible regardless of the ones in the RFC. Yep, I am queuing that one. Great. But could you please do future patches against the rcu/dev branch of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git? I had to hand-apply due to conflicts. Please see below for my version, and please check to make sure that I didn't mess something up in the translation. Looks ok. Will use the correct tree next time. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/