Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-24 Thread Paul E. McKenney
On Tue, Jun 24, 2014 at 01:43:16PM -0700, Dave Hansen wrote:
> On 06/23/2014 05:39 PM, Paul E. McKenney wrote:
> > On Mon, Jun 23, 2014 at 05:20:30PM -0700, Dave Hansen wrote:
> >> On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
> >>> Just out of curiosity, how many CPUs does your system have?  80?
> >>> If 160, looks like something bad is happening at 80.
> >>
> >> 80 cores, 160 threads.  >80 processes/threads is where we start using
> >> the second thread on the cores.  The tasks are also pinned to
> >> hyperthread pairs, so they disturb each other, and the scheduler moves
> >> them between threads on occasion which causes extra noise.
> > 
> > OK, that could explain the near flattening of throughput near 80
> > processes.  Is 3.16.0-rc1-pf2 with the two RCU patches?  If so, is the
> > new sysfs parameter at its default value?
> 
> Here's 3.16-rc1 with e552592e applied and jiffies_till_sched_qs=12 vs. 3.15:
> 
> > https://www.sr71.net/~dave/intel/bb.html?2=3.16.0-rc1-paultry2-jtsq12&1=3.15
> 
> 3.16-rc1 is actually in the lead up until the end when we're filling up
> the hyperthreads.  The same pattern holds when comparing
> 3.16-rc1+e552592e to 3.16-rc1 with ac1bea8 reverted:
> 
> > https://www.sr71.net/~dave/intel/bb.html?2=3.16.0-rc1-paultry2-jtsq12&1=3.16.0-rc1-wrevert
> 
> So, the current situation is generally _better_ than 3.15, except during
> the noisy ranges of the test where hyperthreading and the scheduler are
> coming in to play.

Good to know that my intuition is not yet completely broken.  ;-)

> I made the mistake of doing all my spot-checks at
> the 160-thread number, which honestly wasn't the best point to be
> looking at.

That would do it!  ;-)

> At this point, I'm satisfied with how e552592e is dealing with the
> original regression.  Thanks for all the prompt attention on this one, Paul.

Glad it worked out, I have sent a pull request to Ingo to hopefully
get this into 3.16.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-24 Thread Dave Hansen
On 06/23/2014 05:39 PM, Paul E. McKenney wrote:
> On Mon, Jun 23, 2014 at 05:20:30PM -0700, Dave Hansen wrote:
>> On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
>>> Just out of curiosity, how many CPUs does your system have?  80?
>>> If 160, looks like something bad is happening at 80.
>>
>> 80 cores, 160 threads.  >80 processes/threads is where we start using
>> the second thread on the cores.  The tasks are also pinned to
>> hyperthread pairs, so they disturb each other, and the scheduler moves
>> them between threads on occasion which causes extra noise.
> 
> OK, that could explain the near flattening of throughput near 80
> processes.  Is 3.16.0-rc1-pf2 with the two RCU patches?  If so, is the
> new sysfs parameter at its default value?

Here's 3.16-rc1 with e552592e applied and jiffies_till_sched_qs=12 vs. 3.15:

> https://www.sr71.net/~dave/intel/bb.html?2=3.16.0-rc1-paultry2-jtsq12&1=3.15

3.16-rc1 is actually in the lead up until the end when we're filling up
the hyperthreads.  The same pattern holds when comparing
3.16-rc1+e552592e to 3.16-rc1 with ac1bea8 reverted:

> https://www.sr71.net/~dave/intel/bb.html?2=3.16.0-rc1-paultry2-jtsq12&1=3.16.0-rc1-wrevert

So, the current situation is generally _better_ than 3.15, except during
the noisy ranges of the test where hyperthreading and the scheduler are
coming in to play.  I made the mistake of doing all my spot-checks at
the 160-thread number, which honestly wasn't the best point to be
looking at.

At this point, I'm satisfied with how e552592e is dealing with the
original regression.  Thanks for all the prompt attention on this one, Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-24 Thread Dave Hansen
On 06/23/2014 05:39 PM, Paul E. McKenney wrote:
> On Mon, Jun 23, 2014 at 05:20:30PM -0700, Dave Hansen wrote:
>> On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
>>> Just out of curiosity, how many CPUs does your system have?  80?
>>> If 160, looks like something bad is happening at 80.
>>
>> 80 cores, 160 threads.  >80 processes/threads is where we start using
>> the second thread on the cores.  The tasks are also pinned to
>> hyperthread pairs, so they disturb each other, and the scheduler moves
>> them between threads on occasion which causes extra noise.
> 
> OK, that could explain the near flattening of throughput near 80
> processes.  Is 3.16.0-rc1-pf2 with the two RCU patches?

It's actually with _just_ e552592e03 applied on top of 3.16-rc1.

> If so, is the new sysfs parameter at its default value?

I didn't record that, and I've forgotten.  I'll re-run it to verify what
it was.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-24 Thread Dave Hansen
On 06/23/2014 05:39 PM, Paul E. McKenney wrote:
 On Mon, Jun 23, 2014 at 05:20:30PM -0700, Dave Hansen wrote:
 On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
 Just out of curiosity, how many CPUs does your system have?  80?
 If 160, looks like something bad is happening at 80.

 80 cores, 160 threads.  80 processes/threads is where we start using
 the second thread on the cores.  The tasks are also pinned to
 hyperthread pairs, so they disturb each other, and the scheduler moves
 them between threads on occasion which causes extra noise.
 
 OK, that could explain the near flattening of throughput near 80
 processes.  Is 3.16.0-rc1-pf2 with the two RCU patches?

It's actually with _just_ e552592e03 applied on top of 3.16-rc1.

 If so, is the new sysfs parameter at its default value?

I didn't record that, and I've forgotten.  I'll re-run it to verify what
it was.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-24 Thread Dave Hansen
On 06/23/2014 05:39 PM, Paul E. McKenney wrote:
 On Mon, Jun 23, 2014 at 05:20:30PM -0700, Dave Hansen wrote:
 On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
 Just out of curiosity, how many CPUs does your system have?  80?
 If 160, looks like something bad is happening at 80.

 80 cores, 160 threads.  80 processes/threads is where we start using
 the second thread on the cores.  The tasks are also pinned to
 hyperthread pairs, so they disturb each other, and the scheduler moves
 them between threads on occasion which causes extra noise.
 
 OK, that could explain the near flattening of throughput near 80
 processes.  Is 3.16.0-rc1-pf2 with the two RCU patches?  If so, is the
 new sysfs parameter at its default value?

Here's 3.16-rc1 with e552592e applied and jiffies_till_sched_qs=12 vs. 3.15:

 https://www.sr71.net/~dave/intel/bb.html?2=3.16.0-rc1-paultry2-jtsq121=3.15

3.16-rc1 is actually in the lead up until the end when we're filling up
the hyperthreads.  The same pattern holds when comparing
3.16-rc1+e552592e to 3.16-rc1 with ac1bea8 reverted:

 https://www.sr71.net/~dave/intel/bb.html?2=3.16.0-rc1-paultry2-jtsq121=3.16.0-rc1-wrevert

So, the current situation is generally _better_ than 3.15, except during
the noisy ranges of the test where hyperthreading and the scheduler are
coming in to play.  I made the mistake of doing all my spot-checks at
the 160-thread number, which honestly wasn't the best point to be
looking at.

At this point, I'm satisfied with how e552592e is dealing with the
original regression.  Thanks for all the prompt attention on this one, Paul.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-24 Thread Paul E. McKenney
On Tue, Jun 24, 2014 at 01:43:16PM -0700, Dave Hansen wrote:
 On 06/23/2014 05:39 PM, Paul E. McKenney wrote:
  On Mon, Jun 23, 2014 at 05:20:30PM -0700, Dave Hansen wrote:
  On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
  Just out of curiosity, how many CPUs does your system have?  80?
  If 160, looks like something bad is happening at 80.
 
  80 cores, 160 threads.  80 processes/threads is where we start using
  the second thread on the cores.  The tasks are also pinned to
  hyperthread pairs, so they disturb each other, and the scheduler moves
  them between threads on occasion which causes extra noise.
  
  OK, that could explain the near flattening of throughput near 80
  processes.  Is 3.16.0-rc1-pf2 with the two RCU patches?  If so, is the
  new sysfs parameter at its default value?
 
 Here's 3.16-rc1 with e552592e applied and jiffies_till_sched_qs=12 vs. 3.15:
 
  https://www.sr71.net/~dave/intel/bb.html?2=3.16.0-rc1-paultry2-jtsq121=3.15
 
 3.16-rc1 is actually in the lead up until the end when we're filling up
 the hyperthreads.  The same pattern holds when comparing
 3.16-rc1+e552592e to 3.16-rc1 with ac1bea8 reverted:
 
  https://www.sr71.net/~dave/intel/bb.html?2=3.16.0-rc1-paultry2-jtsq121=3.16.0-rc1-wrevert
 
 So, the current situation is generally _better_ than 3.15, except during
 the noisy ranges of the test where hyperthreading and the scheduler are
 coming in to play.

Good to know that my intuition is not yet completely broken.  ;-)

 I made the mistake of doing all my spot-checks at
 the 160-thread number, which honestly wasn't the best point to be
 looking at.

That would do it!  ;-)

 At this point, I'm satisfied with how e552592e is dealing with the
 original regression.  Thanks for all the prompt attention on this one, Paul.

Glad it worked out, I have sent a pull request to Ingo to hopefully
get this into 3.16.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 05:20:30PM -0700, Dave Hansen wrote:
> On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
> > Just out of curiosity, how many CPUs does your system have?  80?
> > If 160, looks like something bad is happening at 80.
> 
> 80 cores, 160 threads.  >80 processes/threads is where we start using
> the second thread on the cores.  The tasks are also pinned to
> hyperthread pairs, so they disturb each other, and the scheduler moves
> them between threads on occasion which causes extra noise.

OK, that could explain the near flattening of throughput near 80
processes.  Is 3.16.0-rc1-pf2 with the two RCU patches?  If so, is the
new sysfs parameter at its default value?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Dave Hansen
On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
> Just out of curiosity, how many CPUs does your system have?  80?
> If 160, looks like something bad is happening at 80.

80 cores, 160 threads.  >80 processes/threads is where we start using
the second thread on the cores.  The tasks are also pinned to
hyperthread pairs, so they disturb each other, and the scheduler moves
them between threads on occasion which causes extra noise.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 04:30:12PM -0700, Dave Hansen wrote:
> On 06/23/2014 11:09 AM, Paul E. McKenney wrote:
> > So let's see...  The open1 benchmark sits in a loop doing open()
> > and close(), and probably spends most of its time in the kernel.
> > It doesn't do much context switching.  I am guessing that you don't
> > have CONFIG_NO_HZ_FULL=y, or the boot/sysfs parameter would not have
> > much effect because then the first quiescent-state-forcing attempt would
> > likely finish the grace period.
> > 
> > So, given that short grace periods help other workloads (I have the
> > scars to prove it), and given that the patch fixes some real problems,
> 
> I'm not arguing that short grace periods _can_ help some workloads, or
> that one is better than the other.  The patch in question changes
> existing behavior by shortening grace periods.  This change of existing
> behavior removes some of the benefits that my system gets out of RCU.  I
> suspect this affects a lot more systems, but my core cout makes it
> easier to see.

And adds some benefits for other systems.  Your tight loop on open() and
close() will be sensitive to some things, and tight loops on other syscalls
will be sensitive to others.

> Perhaps I'm misunderstanding the original patch's intent, but it seemed
> to me to be working around an overactive debug message.  While often a
> _useful_ debug message, it was firing falsely in the case being
> addressed in the patch.

You are indeed misunderstanding the original patch's intent.  It was
preventing OOMs.  The "overactive debug message" is just a warning that
OOMs are possible.

> > and given that the large number for rcutree.jiffies_till_sched_qs got
> > us within 3%, shouldn't we consider this issue closed?
> 
> With the default value for the tunable, the regression is still solidly
> over 10%.  I think we can have a reasonable argument about it once the
> default delta is down to the small single digits.

Look, you are to be congratulated for identifying a micro-benchmark that
exposes such small changes in timing, but I am not at all interested
in that micro-benchmark becoming the kernel's straightjacket.  If you
have real workloads for which this micro-benchmark is a good predictor
of performance, we can talk about quite a few additional steps to take
to tune for those workloads.

> One more thing I just realized: this isn't a scalability problem, at
> least with rcutree.jiffies_till_sched_qs=12.  There's a pretty
> consistent delta in throughput throughout the entire range of threads
> from 1->160.  See the "processes" column in the data files:
> 
> plain 3.15:
> > https://www.sr71.net/~dave/intel/willitscale/systems/bigbox/3.15/open1.csv
> e552592e0383bc:
> > https://www.sr71.net/~dave/intel/willitscale/systems/bigbox/3.16.0-rc1-pf2/open1.csv
> 
> or visually:
> 
> > https://www.sr71.net/~dave/intel/array-join.html?1=willitscale/systems/bigbox/3.15&2=willitscale/systems/bigbox/3.16.0-rc1-pf2=linear,threads_idle,processes_idle

Just out of curiosity, how many CPUs does your system have?  80?
If 160, looks like something bad is happening at 80.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Dave Hansen
On 06/23/2014 11:09 AM, Paul E. McKenney wrote:
> So let's see...  The open1 benchmark sits in a loop doing open()
> and close(), and probably spends most of its time in the kernel.
> It doesn't do much context switching.  I am guessing that you don't
> have CONFIG_NO_HZ_FULL=y, or the boot/sysfs parameter would not have
> much effect because then the first quiescent-state-forcing attempt would
> likely finish the grace period.
> 
> So, given that short grace periods help other workloads (I have the
> scars to prove it), and given that the patch fixes some real problems,

I'm not arguing that short grace periods _can_ help some workloads, or
that one is better than the other.  The patch in question changes
existing behavior by shortening grace periods.  This change of existing
behavior removes some of the benefits that my system gets out of RCU.  I
suspect this affects a lot more systems, but my core cout makes it
easier to see.

Perhaps I'm misunderstanding the original patch's intent, but it seemed
to me to be working around an overactive debug message.  While often a
_useful_ debug message, it was firing falsely in the case being
addressed in the patch.

> and given that the large number for rcutree.jiffies_till_sched_qs got
> us within 3%, shouldn't we consider this issue closed?

With the default value for the tunable, the regression is still solidly
over 10%.  I think we can have a reasonable argument about it once the
default delta is down to the small single digits.

One more thing I just realized: this isn't a scalability problem, at
least with rcutree.jiffies_till_sched_qs=12.  There's a pretty
consistent delta in throughput throughout the entire range of threads
from 1->160.  See the "processes" column in the data files:

plain 3.15:
> https://www.sr71.net/~dave/intel/willitscale/systems/bigbox/3.15/open1.csv
e552592e0383bc:
> https://www.sr71.net/~dave/intel/willitscale/systems/bigbox/3.16.0-rc1-pf2/open1.csv

or visually:

> https://www.sr71.net/~dave/intel/array-join.html?1=willitscale/systems/bigbox/3.15&2=willitscale/systems/bigbox/3.16.0-rc1-pf2=linear,threads_idle,processes_idle
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 10:17:19AM -0700, Dave Hansen wrote:
> On 06/23/2014 09:55 AM, Dave Hansen wrote:
> > This still has a regression.  Commit 1ed70de (from Paul's git tree),
> > gets a result of 52231880.  If I back up two commits to v3.16-rc1 and
> > revert ac1bea85 (the original culprit) the result goes back up to 57308512.
> > 
> > So something is still going on here.
> > 
> > I'll go back and compare the grace period ages to see if I can tell what
> > is going on.
> 
> RCU_TRACE interferes with the benchmark a little bit, and it lowers the
> delta that the regression causes.  So, evaluate this cautiously.

RCU_TRACE does increase overhead somewhat, so I would expect somewhat
less difference with it enabled.  Though I am a bit surprised that the
overhead of its counters is measurable.  Or is something going on?

> According to rcu_sched/rcugp, the average "age" is:
> 
> v3.16-rc1, with ac1bea85 reverted:10.7
> v3.16-rc1, plus e552592e:  6.1
> 
> Paul, have you been keeping an eye on rcugp?  Even if I run my system
> with only 10 threads, I still see this basic pattern where the average
> "age" is lower when I see lower performance.  It seems to be a
> reasonable proxy that could be used instead of waiting on me to re-run
> tests.

I do print out GPs/sec when running rcutorture, and they do vary somewhat,
but mostly with different Kconfig parameter settings.  Plus rcutorture
ramps up and down, so the GPs/sec is less than what you might see in a
system running an unvarying workload.  That said, increasing grace-period
latency is not always good for performance, in fact, I usually get beaten
up for grace periods completing too quickly rather than too slowly.
This current issue is one of the rare exceptions, perhaps even the
only exception.

So let's see...  The open1 benchmark sits in a loop doing open()
and close(), and probably spends most of its time in the kernel.
It doesn't do much context switching.  I am guessing that you don't
have CONFIG_NO_HZ_FULL=y, or the boot/sysfs parameter would not have
much effect because then the first quiescent-state-forcing attempt would
likely finish the grace period.

So, given that short grace periods help other workloads (I have the
scars to prove it), and given that the patch fixes some real problems,
and given that the large number for rcutree.jiffies_till_sched_qs got
us within 3%, shouldn't we consider this issue closed?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 10:19:05AM -0700, Andi Kleen wrote:
> > In 3.10, RCU had 14,046 lines of code, not counting documentation and
> > test scripting.  In 3.15, RCU had 13,208 lines of code, again not counting
> > documentation and test scripting.  That is a decrease of almost 1KLoC,
> > so your wish is granted.
> 
> Ok that's good progress.

Glad you like it!

> > CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_NONE, CONFIG_RCU_NOCB_CPU_ZERO,
> > and CONFIG_RCU_NOCB_CPU_ALL.  It also might be reasonable to replace
> > uses of CONFIG_PROVE_RCU with CONFIG_PROVE_LOCKING, thus allowing
> > CONFIG_PROVE_RCU to be eliminated.  CONFIG_PROVE_RCU_DELAY hasn't proven
> > very good at finding bugs, so I am considering eliminating it as well.
> > Given recent and planned changes related to RCU's stall-warning stack
> > dumping, I hope to eliminate both CONFIG_RCU_CPU_STALL_VERBOSE and
> > CONFIG_RCU_CPU_STALL_INFO, making them both happen unconditionally.
> > (And yes, I should probably make CONFIG_RCU_CPU_STALL_INFO be the default
> > for some time beforehand.)  I have also been considering getting rid of
> > CONFIG_RCU_FANOUT_EXACT, given that it appears that no one uses it.
> 
> Yes please to all.
> 
> Sounds good thanks.

Very good!

Please note that this will take some time.  For example, getting rid
of CONFIG_RCU_CPU_STALL_TIMEOUT resulted in a series of bugs over a
period of a well over a year.  It turned out that very few people were
exercising it while it was non-default.  Hopefully, Fengguang Wu's
RANDCONFIG testing is testing these things more these days.

Also, some of them might have non-obvious effects on performance,
witness the cond_resched() fun and excitement.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Dave Hansen
On 06/23/2014 10:16 AM, Paul E. McKenney wrote:
> On Mon, Jun 23, 2014 at 09:55:21AM -0700, Dave Hansen wrote:
>> This still has a regression.  Commit 1ed70de (from Paul's git tree),
>> gets a result of 52231880.  If I back up two commits to v3.16-rc1 and
>> revert ac1bea85 (the original culprit) the result goes back up to 57308512.
>>
>> So something is still going on here.
> 
> And commit 1ed70de is in fact the right one, so...
> 
> The rcutree.jiffies_till_sched_qs boot/sysfs parameter controls how
> long RCU waits before asking for quiescent states.  The default is
> currently HZ/20.  Does increasing this parameter help?  Easy for me to
> increase the default if it does.

Making it an insane value:

echo 12 > /sys/module/rcutree/parameters/jiffies_till_sched_qs
average:52248706
echo 999 > /sys/module/rcutree/parameters/jiffies_till_sched_qs
average:55712533

gets us back up _closer_ to our original 57M number, but it's still not
quite there.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Andi Kleen
> In 3.10, RCU had 14,046 lines of code, not counting documentation and
> test scripting.  In 3.15, RCU had 13,208 lines of code, again not counting
> documentation and test scripting.  That is a decrease of almost 1KLoC,
> so your wish is granted.

Ok that's good progress.

> CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_NONE, CONFIG_RCU_NOCB_CPU_ZERO,
> and CONFIG_RCU_NOCB_CPU_ALL.  It also might be reasonable to replace
> uses of CONFIG_PROVE_RCU with CONFIG_PROVE_LOCKING, thus allowing
> CONFIG_PROVE_RCU to be eliminated.  CONFIG_PROVE_RCU_DELAY hasn't proven
> very good at finding bugs, so I am considering eliminating it as well.
> Given recent and planned changes related to RCU's stall-warning stack
> dumping, I hope to eliminate both CONFIG_RCU_CPU_STALL_VERBOSE and
> CONFIG_RCU_CPU_STALL_INFO, making them both happen unconditionally.
> (And yes, I should probably make CONFIG_RCU_CPU_STALL_INFO be the default
> for some time beforehand.)  I have also been considering getting rid of
> CONFIG_RCU_FANOUT_EXACT, given that it appears that no one uses it.

Yes please to all.

Sounds good thanks.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Dave Hansen
On 06/23/2014 09:55 AM, Dave Hansen wrote:
> This still has a regression.  Commit 1ed70de (from Paul's git tree),
> gets a result of 52231880.  If I back up two commits to v3.16-rc1 and
> revert ac1bea85 (the original culprit) the result goes back up to 57308512.
> 
> So something is still going on here.
> 
> I'll go back and compare the grace period ages to see if I can tell what
> is going on.

RCU_TRACE interferes with the benchmark a little bit, and it lowers the
delta that the regression causes.  So, evaluate this cautiously.

According to rcu_sched/rcugp, the average "age" is:

v3.16-rc1, with ac1bea85 reverted:  10.7
v3.16-rc1, plus e552592e:6.1

Paul, have you been keeping an eye on rcugp?  Even if I run my system
with only 10 threads, I still see this basic pattern where the average
"age" is lower when I see lower performance.  It seems to be a
reasonable proxy that could be used instead of waiting on me to re-run
tests.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 09:55:21AM -0700, Dave Hansen wrote:
> This still has a regression.  Commit 1ed70de (from Paul's git tree),
> gets a result of 52231880.  If I back up two commits to v3.16-rc1 and
> revert ac1bea85 (the original culprit) the result goes back up to 57308512.
> 
> So something is still going on here.

And commit 1ed70de is in fact the right one, so...

The rcutree.jiffies_till_sched_qs boot/sysfs parameter controls how
long RCU waits before asking for quiescent states.  The default is
currently HZ/20.  Does increasing this parameter help?  Easy for me to
increase the default if it does.

> I'll go back and compare the grace period ages to see if I can tell what
> is going on.

That would be very helpful, thank you!

Thanx, Paul

> --
> 
> root@bigbox:~/will-it-scale# ./open1_processes -t 160 -s30
> testcase:Separate file open/close
> warmup
> min:319787 max:386071 total:57464989
> min:307235 max:351905 total:53289241
> min:291765 max:342439 total:51364514
> min:297948 max:349214 total:52552745
> min:294950 max:340132 total:51586179
> min:290791 max:339958 total:50793238
> measurement
> min:298851 max:346868 total:51951469
> min:292879 max:340704 total:50817269
> min:305768 max:347381 total:52655149
> min:301046 max:345616 total:52449584
> min:300428 max:345293 total:52021166
> min:293404 max:337973 total:51012206
> min:303569 max:348191 total:52713179
> min:305523 max:357448 total:53707053
> min:307040 max:356937 total:53271883
> min:302134 max:347923 total:52477496
> min:297823 max:340488 total:51884417
> min:286981 max:338246 total:50496850
> min:295920 max:349405 total:51792563
> min:302749 max:343780 total:52305074
> min:298497 max:345208 total:52035318
> min:291393 max:332195 total:50163093
> min:303561 max:353396 total:52983515
> min:301613 max:352988 total:53029200
> min:300693 max:343726 total:52057334
> min:296801 max:352408 total:52028824
> min:304834 max:358236 total:53526191
> min:297933 max:338351 total:51578481
> min:299571 max:341679 total:51817941
> min:308225 max:354075 total:53760098
> min:296262 max:346965 total:51856596
> min:309196 max:356432 total:53455141
> min:295604 max:341814 total:51449366
> min:296931 max:345961 total:52051944
> min:300533 max:350304 total:52652951
> min:299887 max:350764 total:52955064
> average:52231880
> root@bigbox:~/will-it-scale# uname -a
> Linux bigbox 3.16.0-rc1-2-g1ed70de #176 SMP Mon Jun 23 09:04:02 PDT
> 2014 x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> root@bigbox:~/will-it-scale# ./open1_processes -t 160 -s 30
> testcase:Separate file open/close
> warmup
> min:346853 max:416035 total:62412724
> min:281766 max:344178 total:52207349
> min:311187 max:374918 total:57149451
> min:326309 max:391061 total:60200366
> min:310327 max:375619 total:56744357
> min:323336 max:393415 total:59619164
> measurement
> min:323934 max:393718 total:59665843
> min:307247 max:368313 total:55681436
> min:318210 max:378048 total:57849321
> min:314494 max:383884 total:57741073
> min:316497 max:385223 total:58565045
> min:320490 max:397636 total:59003133
> min:318695 max:391712 total:57789360
> min:304368 max:378540 total:56412216
> min:314609 max:384462 total:58298008
> min:317235 max:384205 total:58812490
> min:323556 max:388014 total:59468492
> min:301011 max:362664 total:55381779
> min:301113 max:364712 total:55375445
> min:311730 max:369336 total:56640530
> min:316951 max:381341 total:58649244
> min:317077 max:383943 total:58132878
> min:316970 max:390127 total:59039489
> min:315895 max:375937 total:57404755
> min:295500 max:346523 total:53086962
> min:310882 max:371923 total:56612144
> min:321837 max:390544 total:59651640
> min:303481 max:368716 total:56135908
> min:306437 max:367658 total:56388659
> min:307343 max:373645 total:56893136
> min:298703 max:358090 total:54152268
> min:319162 max:386583 total:58999429
> min:304881 max:361968 total:55286607
> min:311034 max:381100 total:57846182
> min:312786 max:378270 total:57964383
> min:311740 max:367481 total:56327526
> average:57308512
> root@bigbox:~/will-it-scale# uname -a
> Linux bigbox 3.16.0-rc1-dirty #177 SMP Mon Jun 23 09:13:59 PDT 2014
> x86_64 x86_64 x86_64 GNU/Linux
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Dave Hansen
This still has a regression.  Commit 1ed70de (from Paul's git tree),
gets a result of 52231880.  If I back up two commits to v3.16-rc1 and
revert ac1bea85 (the original culprit) the result goes back up to 57308512.

So something is still going on here.

I'll go back and compare the grace period ages to see if I can tell what
is going on.

--

root@bigbox:~/will-it-scale# ./open1_processes -t 160 -s30
testcase:Separate file open/close
warmup
min:319787 max:386071 total:57464989
min:307235 max:351905 total:53289241
min:291765 max:342439 total:51364514
min:297948 max:349214 total:52552745
min:294950 max:340132 total:51586179
min:290791 max:339958 total:50793238
measurement
min:298851 max:346868 total:51951469
min:292879 max:340704 total:50817269
min:305768 max:347381 total:52655149
min:301046 max:345616 total:52449584
min:300428 max:345293 total:52021166
min:293404 max:337973 total:51012206
min:303569 max:348191 total:52713179
min:305523 max:357448 total:53707053
min:307040 max:356937 total:53271883
min:302134 max:347923 total:52477496
min:297823 max:340488 total:51884417
min:286981 max:338246 total:50496850
min:295920 max:349405 total:51792563
min:302749 max:343780 total:52305074
min:298497 max:345208 total:52035318
min:291393 max:332195 total:50163093
min:303561 max:353396 total:52983515
min:301613 max:352988 total:53029200
min:300693 max:343726 total:52057334
min:296801 max:352408 total:52028824
min:304834 max:358236 total:53526191
min:297933 max:338351 total:51578481
min:299571 max:341679 total:51817941
min:308225 max:354075 total:53760098
min:296262 max:346965 total:51856596
min:309196 max:356432 total:53455141
min:295604 max:341814 total:51449366
min:296931 max:345961 total:52051944
min:300533 max:350304 total:52652951
min:299887 max:350764 total:52955064
average:52231880
root@bigbox:~/will-it-scale# uname -a
Linux bigbox 3.16.0-rc1-2-g1ed70de #176 SMP Mon Jun 23 09:04:02 PDT
2014 x86_64 x86_64 x86_64 GNU/Linux


root@bigbox:~/will-it-scale# ./open1_processes -t 160 -s 30
testcase:Separate file open/close
warmup
min:346853 max:416035 total:62412724
min:281766 max:344178 total:52207349
min:311187 max:374918 total:57149451
min:326309 max:391061 total:60200366
min:310327 max:375619 total:56744357
min:323336 max:393415 total:59619164
measurement
min:323934 max:393718 total:59665843
min:307247 max:368313 total:55681436
min:318210 max:378048 total:57849321
min:314494 max:383884 total:57741073
min:316497 max:385223 total:58565045
min:320490 max:397636 total:59003133
min:318695 max:391712 total:57789360
min:304368 max:378540 total:56412216
min:314609 max:384462 total:58298008
min:317235 max:384205 total:58812490
min:323556 max:388014 total:59468492
min:301011 max:362664 total:55381779
min:301113 max:364712 total:55375445
min:311730 max:369336 total:56640530
min:316951 max:381341 total:58649244
min:317077 max:383943 total:58132878
min:316970 max:390127 total:59039489
min:315895 max:375937 total:57404755
min:295500 max:346523 total:53086962
min:310882 max:371923 total:56612144
min:321837 max:390544 total:59651640
min:303481 max:368716 total:56135908
min:306437 max:367658 total:56388659
min:307343 max:373645 total:56893136
min:298703 max:358090 total:54152268
min:319162 max:386583 total:58999429
min:304881 max:361968 total:55286607
min:311034 max:381100 total:57846182
min:312786 max:378270 total:57964383
min:311740 max:367481 total:56327526
average:57308512
root@bigbox:~/will-it-scale# uname -a
Linux bigbox 3.16.0-rc1-dirty #177 SMP Mon Jun 23 09:13:59 PDT 2014
x86_64 x86_64 x86_64 GNU/Linux
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 08:51:08AM -0500, Christoph Lameter wrote:
> On Mon, 23 Jun 2014, Peter Zijlstra wrote:
> 
> > On the topic of these threads; I recently noticed RCU grew a metric ton
> > of them, I found some 75 rcu kthreads on my box, wth up with that?
> 
> Would kworker threads work for rcu? That would also avoid the shifting
> around of RCU threads for NOHZ configurations (which seems to have to be
> done manually right now). The kworker subsystem work that allows
> restriction to non NOHZ hardware threads would then also allow the
> shifting of the rcu threads which would simplify the whole endeavor.

Short term, I am planning to use a different method to automate the
binding of the rcuo kthreads to housekeeping CPUs, but longer term,
it might well make a lot of sense to move to workqueues and the kworker
threads.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 08:49:31AM -0700, Andi Kleen wrote:
> > On the topic of these threads; I recently noticed RCU grew a metric ton
> > of them, I found some 75 rcu kthreads on my box, wth up with that?
> 
> It seems like RCU is growing in complexity all the time.
> 
> Can it be put on a diet in general? 

In 3.10, RCU had 14,046 lines of code, not counting documentation and
test scripting.  In 3.15, RCU had 13,208 lines of code, again not counting
documentation and test scripting.  That is a decrease of almost 1KLoC,
so your wish is granted.

In the future, I hope to be able to make NOCB the default and remove the
softirq-based callback handling, which should shrink things a bit further.
Of course, continued work to make NOCB handle various corner cases will
offset that expected shrinkage, though hopefully not be too much.

Of course, I cannot resist taking your call for RCU simplicity as a vote
against Peter's proposal for aligning the rcu_node tree to the hardware's
electrical structure.  ;-)

> No more new CONFIGs please either.

Since 3.10, I have gotten rid of CONFIG_RCU_CPU_STALL_TIMEOUT.

Over time, it might be possible to make CONFIG_RCU_FAST_NO_HZ the default,
and thus eliminate that Kconfig parameter.  As noted about, ditto for
CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_NONE, CONFIG_RCU_NOCB_CPU_ZERO,
and CONFIG_RCU_NOCB_CPU_ALL.  It also might be reasonable to replace
uses of CONFIG_PROVE_RCU with CONFIG_PROVE_LOCKING, thus allowing
CONFIG_PROVE_RCU to be eliminated.  CONFIG_PROVE_RCU_DELAY hasn't proven
very good at finding bugs, so I am considering eliminating it as well.
Given recent and planned changes related to RCU's stall-warning stack
dumping, I hope to eliminate both CONFIG_RCU_CPU_STALL_VERBOSE and
CONFIG_RCU_CPU_STALL_INFO, making them both happen unconditionally.
(And yes, I should probably make CONFIG_RCU_CPU_STALL_INFO be the default
for some time beforehand.)  I have also been considering getting rid of
CONFIG_RCU_FANOUT_EXACT, given that it appears that no one uses it.

That should make room for additional RCU Kconfig parameters as needed
for specialized or high-risk new functionality, when and if required.

Thanx, Paul

> -Andi
> -- 
> a...@linux.intel.com -- Speaking for myself only
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Andi Kleen
> On the topic of these threads; I recently noticed RCU grew a metric ton
> of them, I found some 75 rcu kthreads on my box, wth up with that?

It seems like RCU is growing in complexity all the time.

Can it be put on a diet in general? 

No more new CONFIGs please either.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 08:53:12AM -0500, Christoph Lameter wrote:
> On Fri, 20 Jun 2014, Paul E. McKenney wrote:
> 
> > > I like this approach *far* better.  This is the kind of thing I had in
> > > mind when I suggested using the fqs machinery: remove the poll entirely
> > > and just thwack a CPU if it takes too long without a quiescent state.
> > > Reviewed-by: Josh Triplett 
> >
> > Glad you like it.  Not a fan of the IPI myself, but then again if you
> > are spending that must time looping in the kernel, an extra IPI is the
> > least of your problems.
> 
> Good. The IPI is only used when actually necessary. The code inserted
> was always there and always executed although rarely needed.

Interesting.  I actually proposed this approach several times in the
earlier thread, but to deafing silence: https://lkml.org/lkml/2014/6/18/836,
https://lkml.org/lkml/2014/6/17/793, and https://lkml.org/lkml/2014/6/20/479.

I guess this further validates interpreting silence as assent.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 08:26:15AM +0200, Peter Zijlstra wrote:
> On Fri, Jun 20, 2014 at 07:59:58PM -0700, Paul E. McKenney wrote:
> > Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
> > fixed a problem where a CPU looping in the kernel with but one runnable
> > task would give RCU CPU stall warnings, even if the in-kernel loop
> > contained cond_resched() calls.  Unfortunately, in so doing, it introduced
> > performance regressions in Anton Blanchard's will-it-scale "open1" test.
> > The problem appears to be not so much the increased cond_resched() path
> > length as an increase in the rate at which grace periods complete, which
> > increased per-update grace-period overhead.
> > 
> > This commit takes a different approach to fixing this bug, mainly by
> > moving the RCU-visible quiescent state from cond_resched() to
> > rcu_note_context_switch(), and by further reducing the check to a
> > simple non-zero test of a single per-CPU variable.  However, this
> > approach requires that the force-quiescent-state processing send
> > resched IPIs to the offending CPUs.  These will be sent only once
> > the grace period has reached an age specified by the boot/sysfs
> > parameter rcutree.jiffies_till_sched_qs, or once the grace period
> > reaches an age halfway to the point at which RCU CPU stall warnings
> > will be emitted, whichever comes first.
> 
> Right, and I suppose the force quiescent stuff is triggered from the
> tick, which in turn wakes some of these rcu kthreads, which on UP would
> cause scheduling themselves.

Yep, which is another reason why this commit only affects TREE_RCU and
TREE_PREEMPT_RCU, not TINY_RCU.

> On the topic of these threads; I recently noticed RCU grew a metric ton
> of them, I found some 75 rcu kthreads on my box, wth up with that?

The most likely cause of a recent increase would be if you now have
CONFIG_RCU_NOCB_CPU_ALL=y, which would give you a pair of kthreads per
CPU for callback offloading.  Plus an additional kthread per CPU (for
a total of three new kthreads per CPU) for CONFIG_PREEMPT=y.  These would
be the rcuo kthreads.

Are they causing you trouble?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Christoph Lameter
On Fri, 20 Jun 2014, Paul E. McKenney wrote:

> > I like this approach *far* better.  This is the kind of thing I had in
> > mind when I suggested using the fqs machinery: remove the poll entirely
> > and just thwack a CPU if it takes too long without a quiescent state.
> > Reviewed-by: Josh Triplett 
>
> Glad you like it.  Not a fan of the IPI myself, but then again if you
> are spending that must time looping in the kernel, an extra IPI is the
> least of your problems.

Good. The IPI is only used when actually necessary. The code inserted
was always there and always executed although rarely needed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Christoph Lameter
On Mon, 23 Jun 2014, Peter Zijlstra wrote:

> On the topic of these threads; I recently noticed RCU grew a metric ton
> of them, I found some 75 rcu kthreads on my box, wth up with that?

Would kworker threads work for rcu? That would also avoid the shifting
around of RCU threads for NOHZ configurations (which seems to have to be
done manually right now). The kworker subsystem work that allows
restriction to non NOHZ hardware threads would then also allow the
shifting of the rcu threads which would simplify the whole endeavor.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Peter Zijlstra
On Fri, Jun 20, 2014 at 07:59:58PM -0700, Paul E. McKenney wrote:
> Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
> fixed a problem where a CPU looping in the kernel with but one runnable
> task would give RCU CPU stall warnings, even if the in-kernel loop
> contained cond_resched() calls.  Unfortunately, in so doing, it introduced
> performance regressions in Anton Blanchard's will-it-scale "open1" test.
> The problem appears to be not so much the increased cond_resched() path
> length as an increase in the rate at which grace periods complete, which
> increased per-update grace-period overhead.
> 
> This commit takes a different approach to fixing this bug, mainly by
> moving the RCU-visible quiescent state from cond_resched() to
> rcu_note_context_switch(), and by further reducing the check to a
> simple non-zero test of a single per-CPU variable.  However, this
> approach requires that the force-quiescent-state processing send
> resched IPIs to the offending CPUs.  These will be sent only once
> the grace period has reached an age specified by the boot/sysfs
> parameter rcutree.jiffies_till_sched_qs, or once the grace period
> reaches an age halfway to the point at which RCU CPU stall warnings
> will be emitted, whichever comes first.

Right, and I suppose the force quiescent stuff is triggered from the
tick, which in turn wakes some of these rcu kthreads, which on UP would
cause scheduling themselves.

On the topic of these threads; I recently noticed RCU grew a metric ton
of them, I found some 75 rcu kthreads on my box, wth up with that?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Christoph Lameter
On Mon, 23 Jun 2014, Peter Zijlstra wrote:

 On the topic of these threads; I recently noticed RCU grew a metric ton
 of them, I found some 75 rcu kthreads on my box, wth up with that?

Would kworker threads work for rcu? That would also avoid the shifting
around of RCU threads for NOHZ configurations (which seems to have to be
done manually right now). The kworker subsystem work that allows
restriction to non NOHZ hardware threads would then also allow the
shifting of the rcu threads which would simplify the whole endeavor.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Christoph Lameter
On Fri, 20 Jun 2014, Paul E. McKenney wrote:

  I like this approach *far* better.  This is the kind of thing I had in
  mind when I suggested using the fqs machinery: remove the poll entirely
  and just thwack a CPU if it takes too long without a quiescent state.
  Reviewed-by: Josh Triplett j...@joshtriplett.org

 Glad you like it.  Not a fan of the IPI myself, but then again if you
 are spending that must time looping in the kernel, an extra IPI is the
 least of your problems.

Good. The IPI is only used when actually necessary. The code inserted
was always there and always executed although rarely needed.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 08:26:15AM +0200, Peter Zijlstra wrote:
 On Fri, Jun 20, 2014 at 07:59:58PM -0700, Paul E. McKenney wrote:
  Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
  fixed a problem where a CPU looping in the kernel with but one runnable
  task would give RCU CPU stall warnings, even if the in-kernel loop
  contained cond_resched() calls.  Unfortunately, in so doing, it introduced
  performance regressions in Anton Blanchard's will-it-scale open1 test.
  The problem appears to be not so much the increased cond_resched() path
  length as an increase in the rate at which grace periods complete, which
  increased per-update grace-period overhead.
  
  This commit takes a different approach to fixing this bug, mainly by
  moving the RCU-visible quiescent state from cond_resched() to
  rcu_note_context_switch(), and by further reducing the check to a
  simple non-zero test of a single per-CPU variable.  However, this
  approach requires that the force-quiescent-state processing send
  resched IPIs to the offending CPUs.  These will be sent only once
  the grace period has reached an age specified by the boot/sysfs
  parameter rcutree.jiffies_till_sched_qs, or once the grace period
  reaches an age halfway to the point at which RCU CPU stall warnings
  will be emitted, whichever comes first.
 
 Right, and I suppose the force quiescent stuff is triggered from the
 tick, which in turn wakes some of these rcu kthreads, which on UP would
 cause scheduling themselves.

Yep, which is another reason why this commit only affects TREE_RCU and
TREE_PREEMPT_RCU, not TINY_RCU.

 On the topic of these threads; I recently noticed RCU grew a metric ton
 of them, I found some 75 rcu kthreads on my box, wth up with that?

The most likely cause of a recent increase would be if you now have
CONFIG_RCU_NOCB_CPU_ALL=y, which would give you a pair of kthreads per
CPU for callback offloading.  Plus an additional kthread per CPU (for
a total of three new kthreads per CPU) for CONFIG_PREEMPT=y.  These would
be the rcuo kthreads.

Are they causing you trouble?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 08:53:12AM -0500, Christoph Lameter wrote:
 On Fri, 20 Jun 2014, Paul E. McKenney wrote:
 
   I like this approach *far* better.  This is the kind of thing I had in
   mind when I suggested using the fqs machinery: remove the poll entirely
   and just thwack a CPU if it takes too long without a quiescent state.
   Reviewed-by: Josh Triplett j...@joshtriplett.org
 
  Glad you like it.  Not a fan of the IPI myself, but then again if you
  are spending that must time looping in the kernel, an extra IPI is the
  least of your problems.
 
 Good. The IPI is only used when actually necessary. The code inserted
 was always there and always executed although rarely needed.

Interesting.  I actually proposed this approach several times in the
earlier thread, but to deafing silence: https://lkml.org/lkml/2014/6/18/836,
https://lkml.org/lkml/2014/6/17/793, and https://lkml.org/lkml/2014/6/20/479.

I guess this further validates interpreting silence as assent.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Andi Kleen
 On the topic of these threads; I recently noticed RCU grew a metric ton
 of them, I found some 75 rcu kthreads on my box, wth up with that?

It seems like RCU is growing in complexity all the time.

Can it be put on a diet in general? 

No more new CONFIGs please either.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 08:49:31AM -0700, Andi Kleen wrote:
  On the topic of these threads; I recently noticed RCU grew a metric ton
  of them, I found some 75 rcu kthreads on my box, wth up with that?
 
 It seems like RCU is growing in complexity all the time.
 
 Can it be put on a diet in general? 

In 3.10, RCU had 14,046 lines of code, not counting documentation and
test scripting.  In 3.15, RCU had 13,208 lines of code, again not counting
documentation and test scripting.  That is a decrease of almost 1KLoC,
so your wish is granted.

In the future, I hope to be able to make NOCB the default and remove the
softirq-based callback handling, which should shrink things a bit further.
Of course, continued work to make NOCB handle various corner cases will
offset that expected shrinkage, though hopefully not be too much.

Of course, I cannot resist taking your call for RCU simplicity as a vote
against Peter's proposal for aligning the rcu_node tree to the hardware's
electrical structure.  ;-)

 No more new CONFIGs please either.

Since 3.10, I have gotten rid of CONFIG_RCU_CPU_STALL_TIMEOUT.

Over time, it might be possible to make CONFIG_RCU_FAST_NO_HZ the default,
and thus eliminate that Kconfig parameter.  As noted about, ditto for
CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_NONE, CONFIG_RCU_NOCB_CPU_ZERO,
and CONFIG_RCU_NOCB_CPU_ALL.  It also might be reasonable to replace
uses of CONFIG_PROVE_RCU with CONFIG_PROVE_LOCKING, thus allowing
CONFIG_PROVE_RCU to be eliminated.  CONFIG_PROVE_RCU_DELAY hasn't proven
very good at finding bugs, so I am considering eliminating it as well.
Given recent and planned changes related to RCU's stall-warning stack
dumping, I hope to eliminate both CONFIG_RCU_CPU_STALL_VERBOSE and
CONFIG_RCU_CPU_STALL_INFO, making them both happen unconditionally.
(And yes, I should probably make CONFIG_RCU_CPU_STALL_INFO be the default
for some time beforehand.)  I have also been considering getting rid of
CONFIG_RCU_FANOUT_EXACT, given that it appears that no one uses it.

That should make room for additional RCU Kconfig parameters as needed
for specialized or high-risk new functionality, when and if required.

Thanx, Paul

 -Andi
 -- 
 a...@linux.intel.com -- Speaking for myself only
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 08:51:08AM -0500, Christoph Lameter wrote:
 On Mon, 23 Jun 2014, Peter Zijlstra wrote:
 
  On the topic of these threads; I recently noticed RCU grew a metric ton
  of them, I found some 75 rcu kthreads on my box, wth up with that?
 
 Would kworker threads work for rcu? That would also avoid the shifting
 around of RCU threads for NOHZ configurations (which seems to have to be
 done manually right now). The kworker subsystem work that allows
 restriction to non NOHZ hardware threads would then also allow the
 shifting of the rcu threads which would simplify the whole endeavor.

Short term, I am planning to use a different method to automate the
binding of the rcuo kthreads to housekeeping CPUs, but longer term,
it might well make a lot of sense to move to workqueues and the kworker
threads.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Dave Hansen
This still has a regression.  Commit 1ed70de (from Paul's git tree),
gets a result of 52231880.  If I back up two commits to v3.16-rc1 and
revert ac1bea85 (the original culprit) the result goes back up to 57308512.

So something is still going on here.

I'll go back and compare the grace period ages to see if I can tell what
is going on.

--

root@bigbox:~/will-it-scale# ./open1_processes -t 160 -s30
testcase:Separate file open/close
warmup
min:319787 max:386071 total:57464989
min:307235 max:351905 total:53289241
min:291765 max:342439 total:51364514
min:297948 max:349214 total:52552745
min:294950 max:340132 total:51586179
min:290791 max:339958 total:50793238
measurement
min:298851 max:346868 total:51951469
min:292879 max:340704 total:50817269
min:305768 max:347381 total:52655149
min:301046 max:345616 total:52449584
min:300428 max:345293 total:52021166
min:293404 max:337973 total:51012206
min:303569 max:348191 total:52713179
min:305523 max:357448 total:53707053
min:307040 max:356937 total:53271883
min:302134 max:347923 total:52477496
min:297823 max:340488 total:51884417
min:286981 max:338246 total:50496850
min:295920 max:349405 total:51792563
min:302749 max:343780 total:52305074
min:298497 max:345208 total:52035318
min:291393 max:332195 total:50163093
min:303561 max:353396 total:52983515
min:301613 max:352988 total:53029200
min:300693 max:343726 total:52057334
min:296801 max:352408 total:52028824
min:304834 max:358236 total:53526191
min:297933 max:338351 total:51578481
min:299571 max:341679 total:51817941
min:308225 max:354075 total:53760098
min:296262 max:346965 total:51856596
min:309196 max:356432 total:53455141
min:295604 max:341814 total:51449366
min:296931 max:345961 total:52051944
min:300533 max:350304 total:52652951
min:299887 max:350764 total:52955064
average:52231880
root@bigbox:~/will-it-scale# uname -a
Linux bigbox 3.16.0-rc1-2-g1ed70de #176 SMP Mon Jun 23 09:04:02 PDT
2014 x86_64 x86_64 x86_64 GNU/Linux


root@bigbox:~/will-it-scale# ./open1_processes -t 160 -s 30
testcase:Separate file open/close
warmup
min:346853 max:416035 total:62412724
min:281766 max:344178 total:52207349
min:311187 max:374918 total:57149451
min:326309 max:391061 total:60200366
min:310327 max:375619 total:56744357
min:323336 max:393415 total:59619164
measurement
min:323934 max:393718 total:59665843
min:307247 max:368313 total:55681436
min:318210 max:378048 total:57849321
min:314494 max:383884 total:57741073
min:316497 max:385223 total:58565045
min:320490 max:397636 total:59003133
min:318695 max:391712 total:57789360
min:304368 max:378540 total:56412216
min:314609 max:384462 total:58298008
min:317235 max:384205 total:58812490
min:323556 max:388014 total:59468492
min:301011 max:362664 total:55381779
min:301113 max:364712 total:55375445
min:311730 max:369336 total:56640530
min:316951 max:381341 total:58649244
min:317077 max:383943 total:58132878
min:316970 max:390127 total:59039489
min:315895 max:375937 total:57404755
min:295500 max:346523 total:53086962
min:310882 max:371923 total:56612144
min:321837 max:390544 total:59651640
min:303481 max:368716 total:56135908
min:306437 max:367658 total:56388659
min:307343 max:373645 total:56893136
min:298703 max:358090 total:54152268
min:319162 max:386583 total:58999429
min:304881 max:361968 total:55286607
min:311034 max:381100 total:57846182
min:312786 max:378270 total:57964383
min:311740 max:367481 total:56327526
average:57308512
root@bigbox:~/will-it-scale# uname -a
Linux bigbox 3.16.0-rc1-dirty #177 SMP Mon Jun 23 09:13:59 PDT 2014
x86_64 x86_64 x86_64 GNU/Linux
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 09:55:21AM -0700, Dave Hansen wrote:
 This still has a regression.  Commit 1ed70de (from Paul's git tree),
 gets a result of 52231880.  If I back up two commits to v3.16-rc1 and
 revert ac1bea85 (the original culprit) the result goes back up to 57308512.
 
 So something is still going on here.

And commit 1ed70de is in fact the right one, so...

The rcutree.jiffies_till_sched_qs boot/sysfs parameter controls how
long RCU waits before asking for quiescent states.  The default is
currently HZ/20.  Does increasing this parameter help?  Easy for me to
increase the default if it does.

 I'll go back and compare the grace period ages to see if I can tell what
 is going on.

That would be very helpful, thank you!

Thanx, Paul

 --
 
 root@bigbox:~/will-it-scale# ./open1_processes -t 160 -s30
 testcase:Separate file open/close
 warmup
 min:319787 max:386071 total:57464989
 min:307235 max:351905 total:53289241
 min:291765 max:342439 total:51364514
 min:297948 max:349214 total:52552745
 min:294950 max:340132 total:51586179
 min:290791 max:339958 total:50793238
 measurement
 min:298851 max:346868 total:51951469
 min:292879 max:340704 total:50817269
 min:305768 max:347381 total:52655149
 min:301046 max:345616 total:52449584
 min:300428 max:345293 total:52021166
 min:293404 max:337973 total:51012206
 min:303569 max:348191 total:52713179
 min:305523 max:357448 total:53707053
 min:307040 max:356937 total:53271883
 min:302134 max:347923 total:52477496
 min:297823 max:340488 total:51884417
 min:286981 max:338246 total:50496850
 min:295920 max:349405 total:51792563
 min:302749 max:343780 total:52305074
 min:298497 max:345208 total:52035318
 min:291393 max:332195 total:50163093
 min:303561 max:353396 total:52983515
 min:301613 max:352988 total:53029200
 min:300693 max:343726 total:52057334
 min:296801 max:352408 total:52028824
 min:304834 max:358236 total:53526191
 min:297933 max:338351 total:51578481
 min:299571 max:341679 total:51817941
 min:308225 max:354075 total:53760098
 min:296262 max:346965 total:51856596
 min:309196 max:356432 total:53455141
 min:295604 max:341814 total:51449366
 min:296931 max:345961 total:52051944
 min:300533 max:350304 total:52652951
 min:299887 max:350764 total:52955064
 average:52231880
 root@bigbox:~/will-it-scale# uname -a
 Linux bigbox 3.16.0-rc1-2-g1ed70de #176 SMP Mon Jun 23 09:04:02 PDT
 2014 x86_64 x86_64 x86_64 GNU/Linux
 
 
 root@bigbox:~/will-it-scale# ./open1_processes -t 160 -s 30
 testcase:Separate file open/close
 warmup
 min:346853 max:416035 total:62412724
 min:281766 max:344178 total:52207349
 min:311187 max:374918 total:57149451
 min:326309 max:391061 total:60200366
 min:310327 max:375619 total:56744357
 min:323336 max:393415 total:59619164
 measurement
 min:323934 max:393718 total:59665843
 min:307247 max:368313 total:55681436
 min:318210 max:378048 total:57849321
 min:314494 max:383884 total:57741073
 min:316497 max:385223 total:58565045
 min:320490 max:397636 total:59003133
 min:318695 max:391712 total:57789360
 min:304368 max:378540 total:56412216
 min:314609 max:384462 total:58298008
 min:317235 max:384205 total:58812490
 min:323556 max:388014 total:59468492
 min:301011 max:362664 total:55381779
 min:301113 max:364712 total:55375445
 min:311730 max:369336 total:56640530
 min:316951 max:381341 total:58649244
 min:317077 max:383943 total:58132878
 min:316970 max:390127 total:59039489
 min:315895 max:375937 total:57404755
 min:295500 max:346523 total:53086962
 min:310882 max:371923 total:56612144
 min:321837 max:390544 total:59651640
 min:303481 max:368716 total:56135908
 min:306437 max:367658 total:56388659
 min:307343 max:373645 total:56893136
 min:298703 max:358090 total:54152268
 min:319162 max:386583 total:58999429
 min:304881 max:361968 total:55286607
 min:311034 max:381100 total:57846182
 min:312786 max:378270 total:57964383
 min:311740 max:367481 total:56327526
 average:57308512
 root@bigbox:~/will-it-scale# uname -a
 Linux bigbox 3.16.0-rc1-dirty #177 SMP Mon Jun 23 09:13:59 PDT 2014
 x86_64 x86_64 x86_64 GNU/Linux
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Dave Hansen
On 06/23/2014 09:55 AM, Dave Hansen wrote:
 This still has a regression.  Commit 1ed70de (from Paul's git tree),
 gets a result of 52231880.  If I back up two commits to v3.16-rc1 and
 revert ac1bea85 (the original culprit) the result goes back up to 57308512.
 
 So something is still going on here.
 
 I'll go back and compare the grace period ages to see if I can tell what
 is going on.

RCU_TRACE interferes with the benchmark a little bit, and it lowers the
delta that the regression causes.  So, evaluate this cautiously.

According to rcu_sched/rcugp, the average age is:

v3.16-rc1, with ac1bea85 reverted:  10.7
v3.16-rc1, plus e552592e:6.1

Paul, have you been keeping an eye on rcugp?  Even if I run my system
with only 10 threads, I still see this basic pattern where the average
age is lower when I see lower performance.  It seems to be a
reasonable proxy that could be used instead of waiting on me to re-run
tests.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Andi Kleen
 In 3.10, RCU had 14,046 lines of code, not counting documentation and
 test scripting.  In 3.15, RCU had 13,208 lines of code, again not counting
 documentation and test scripting.  That is a decrease of almost 1KLoC,
 so your wish is granted.

Ok that's good progress.

 CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_NONE, CONFIG_RCU_NOCB_CPU_ZERO,
 and CONFIG_RCU_NOCB_CPU_ALL.  It also might be reasonable to replace
 uses of CONFIG_PROVE_RCU with CONFIG_PROVE_LOCKING, thus allowing
 CONFIG_PROVE_RCU to be eliminated.  CONFIG_PROVE_RCU_DELAY hasn't proven
 very good at finding bugs, so I am considering eliminating it as well.
 Given recent and planned changes related to RCU's stall-warning stack
 dumping, I hope to eliminate both CONFIG_RCU_CPU_STALL_VERBOSE and
 CONFIG_RCU_CPU_STALL_INFO, making them both happen unconditionally.
 (And yes, I should probably make CONFIG_RCU_CPU_STALL_INFO be the default
 for some time beforehand.)  I have also been considering getting rid of
 CONFIG_RCU_FANOUT_EXACT, given that it appears that no one uses it.

Yes please to all.

Sounds good thanks.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Dave Hansen
On 06/23/2014 10:16 AM, Paul E. McKenney wrote:
 On Mon, Jun 23, 2014 at 09:55:21AM -0700, Dave Hansen wrote:
 This still has a regression.  Commit 1ed70de (from Paul's git tree),
 gets a result of 52231880.  If I back up two commits to v3.16-rc1 and
 revert ac1bea85 (the original culprit) the result goes back up to 57308512.

 So something is still going on here.
 
 And commit 1ed70de is in fact the right one, so...
 
 The rcutree.jiffies_till_sched_qs boot/sysfs parameter controls how
 long RCU waits before asking for quiescent states.  The default is
 currently HZ/20.  Does increasing this parameter help?  Easy for me to
 increase the default if it does.

Making it an insane value:

echo 12  /sys/module/rcutree/parameters/jiffies_till_sched_qs
average:52248706
echo 999  /sys/module/rcutree/parameters/jiffies_till_sched_qs
average:55712533

gets us back up _closer_ to our original 57M number, but it's still not
quite there.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 10:19:05AM -0700, Andi Kleen wrote:
  In 3.10, RCU had 14,046 lines of code, not counting documentation and
  test scripting.  In 3.15, RCU had 13,208 lines of code, again not counting
  documentation and test scripting.  That is a decrease of almost 1KLoC,
  so your wish is granted.
 
 Ok that's good progress.

Glad you like it!

  CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_NONE, CONFIG_RCU_NOCB_CPU_ZERO,
  and CONFIG_RCU_NOCB_CPU_ALL.  It also might be reasonable to replace
  uses of CONFIG_PROVE_RCU with CONFIG_PROVE_LOCKING, thus allowing
  CONFIG_PROVE_RCU to be eliminated.  CONFIG_PROVE_RCU_DELAY hasn't proven
  very good at finding bugs, so I am considering eliminating it as well.
  Given recent and planned changes related to RCU's stall-warning stack
  dumping, I hope to eliminate both CONFIG_RCU_CPU_STALL_VERBOSE and
  CONFIG_RCU_CPU_STALL_INFO, making them both happen unconditionally.
  (And yes, I should probably make CONFIG_RCU_CPU_STALL_INFO be the default
  for some time beforehand.)  I have also been considering getting rid of
  CONFIG_RCU_FANOUT_EXACT, given that it appears that no one uses it.
 
 Yes please to all.
 
 Sounds good thanks.

Very good!

Please note that this will take some time.  For example, getting rid
of CONFIG_RCU_CPU_STALL_TIMEOUT resulted in a series of bugs over a
period of a well over a year.  It turned out that very few people were
exercising it while it was non-default.  Hopefully, Fengguang Wu's
RANDCONFIG testing is testing these things more these days.

Also, some of them might have non-obvious effects on performance,
witness the cond_resched() fun and excitement.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 10:17:19AM -0700, Dave Hansen wrote:
 On 06/23/2014 09:55 AM, Dave Hansen wrote:
  This still has a regression.  Commit 1ed70de (from Paul's git tree),
  gets a result of 52231880.  If I back up two commits to v3.16-rc1 and
  revert ac1bea85 (the original culprit) the result goes back up to 57308512.
  
  So something is still going on here.
  
  I'll go back and compare the grace period ages to see if I can tell what
  is going on.
 
 RCU_TRACE interferes with the benchmark a little bit, and it lowers the
 delta that the regression causes.  So, evaluate this cautiously.

RCU_TRACE does increase overhead somewhat, so I would expect somewhat
less difference with it enabled.  Though I am a bit surprised that the
overhead of its counters is measurable.  Or is something going on?

 According to rcu_sched/rcugp, the average age is:
 
 v3.16-rc1, with ac1bea85 reverted:10.7
 v3.16-rc1, plus e552592e:  6.1
 
 Paul, have you been keeping an eye on rcugp?  Even if I run my system
 with only 10 threads, I still see this basic pattern where the average
 age is lower when I see lower performance.  It seems to be a
 reasonable proxy that could be used instead of waiting on me to re-run
 tests.

I do print out GPs/sec when running rcutorture, and they do vary somewhat,
but mostly with different Kconfig parameter settings.  Plus rcutorture
ramps up and down, so the GPs/sec is less than what you might see in a
system running an unvarying workload.  That said, increasing grace-period
latency is not always good for performance, in fact, I usually get beaten
up for grace periods completing too quickly rather than too slowly.
This current issue is one of the rare exceptions, perhaps even the
only exception.

So let's see...  The open1 benchmark sits in a loop doing open()
and close(), and probably spends most of its time in the kernel.
It doesn't do much context switching.  I am guessing that you don't
have CONFIG_NO_HZ_FULL=y, or the boot/sysfs parameter would not have
much effect because then the first quiescent-state-forcing attempt would
likely finish the grace period.

So, given that short grace periods help other workloads (I have the
scars to prove it), and given that the patch fixes some real problems,
and given that the large number for rcutree.jiffies_till_sched_qs got
us within 3%, shouldn't we consider this issue closed?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Dave Hansen
On 06/23/2014 11:09 AM, Paul E. McKenney wrote:
 So let's see...  The open1 benchmark sits in a loop doing open()
 and close(), and probably spends most of its time in the kernel.
 It doesn't do much context switching.  I am guessing that you don't
 have CONFIG_NO_HZ_FULL=y, or the boot/sysfs parameter would not have
 much effect because then the first quiescent-state-forcing attempt would
 likely finish the grace period.
 
 So, given that short grace periods help other workloads (I have the
 scars to prove it), and given that the patch fixes some real problems,

I'm not arguing that short grace periods _can_ help some workloads, or
that one is better than the other.  The patch in question changes
existing behavior by shortening grace periods.  This change of existing
behavior removes some of the benefits that my system gets out of RCU.  I
suspect this affects a lot more systems, but my core cout makes it
easier to see.

Perhaps I'm misunderstanding the original patch's intent, but it seemed
to me to be working around an overactive debug message.  While often a
_useful_ debug message, it was firing falsely in the case being
addressed in the patch.

 and given that the large number for rcutree.jiffies_till_sched_qs got
 us within 3%, shouldn't we consider this issue closed?

With the default value for the tunable, the regression is still solidly
over 10%.  I think we can have a reasonable argument about it once the
default delta is down to the small single digits.

One more thing I just realized: this isn't a scalability problem, at
least with rcutree.jiffies_till_sched_qs=12.  There's a pretty
consistent delta in throughput throughout the entire range of threads
from 1-160.  See the processes column in the data files:

plain 3.15:
 https://www.sr71.net/~dave/intel/willitscale/systems/bigbox/3.15/open1.csv
e552592e0383bc:
 https://www.sr71.net/~dave/intel/willitscale/systems/bigbox/3.16.0-rc1-pf2/open1.csv

or visually:

 https://www.sr71.net/~dave/intel/array-join.html?1=willitscale/systems/bigbox/3.152=willitscale/systems/bigbox/3.16.0-rc1-pf2hide=linear,threads_idle,processes_idle
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 04:30:12PM -0700, Dave Hansen wrote:
 On 06/23/2014 11:09 AM, Paul E. McKenney wrote:
  So let's see...  The open1 benchmark sits in a loop doing open()
  and close(), and probably spends most of its time in the kernel.
  It doesn't do much context switching.  I am guessing that you don't
  have CONFIG_NO_HZ_FULL=y, or the boot/sysfs parameter would not have
  much effect because then the first quiescent-state-forcing attempt would
  likely finish the grace period.
  
  So, given that short grace periods help other workloads (I have the
  scars to prove it), and given that the patch fixes some real problems,
 
 I'm not arguing that short grace periods _can_ help some workloads, or
 that one is better than the other.  The patch in question changes
 existing behavior by shortening grace periods.  This change of existing
 behavior removes some of the benefits that my system gets out of RCU.  I
 suspect this affects a lot more systems, but my core cout makes it
 easier to see.

And adds some benefits for other systems.  Your tight loop on open() and
close() will be sensitive to some things, and tight loops on other syscalls
will be sensitive to others.

 Perhaps I'm misunderstanding the original patch's intent, but it seemed
 to me to be working around an overactive debug message.  While often a
 _useful_ debug message, it was firing falsely in the case being
 addressed in the patch.

You are indeed misunderstanding the original patch's intent.  It was
preventing OOMs.  The overactive debug message is just a warning that
OOMs are possible.

  and given that the large number for rcutree.jiffies_till_sched_qs got
  us within 3%, shouldn't we consider this issue closed?
 
 With the default value for the tunable, the regression is still solidly
 over 10%.  I think we can have a reasonable argument about it once the
 default delta is down to the small single digits.

Look, you are to be congratulated for identifying a micro-benchmark that
exposes such small changes in timing, but I am not at all interested
in that micro-benchmark becoming the kernel's straightjacket.  If you
have real workloads for which this micro-benchmark is a good predictor
of performance, we can talk about quite a few additional steps to take
to tune for those workloads.

 One more thing I just realized: this isn't a scalability problem, at
 least with rcutree.jiffies_till_sched_qs=12.  There's a pretty
 consistent delta in throughput throughout the entire range of threads
 from 1-160.  See the processes column in the data files:
 
 plain 3.15:
  https://www.sr71.net/~dave/intel/willitscale/systems/bigbox/3.15/open1.csv
 e552592e0383bc:
  https://www.sr71.net/~dave/intel/willitscale/systems/bigbox/3.16.0-rc1-pf2/open1.csv
 
 or visually:
 
  https://www.sr71.net/~dave/intel/array-join.html?1=willitscale/systems/bigbox/3.152=willitscale/systems/bigbox/3.16.0-rc1-pf2hide=linear,threads_idle,processes_idle

Just out of curiosity, how many CPUs does your system have?  80?
If 160, looks like something bad is happening at 80.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Dave Hansen
On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
 Just out of curiosity, how many CPUs does your system have?  80?
 If 160, looks like something bad is happening at 80.

80 cores, 160 threads.  80 processes/threads is where we start using
the second thread on the cores.  The tasks are also pinned to
hyperthread pairs, so they disturb each other, and the scheduler moves
them between threads on occasion which causes extra noise.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Paul E. McKenney
On Mon, Jun 23, 2014 at 05:20:30PM -0700, Dave Hansen wrote:
 On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
  Just out of curiosity, how many CPUs does your system have?  80?
  If 160, looks like something bad is happening at 80.
 
 80 cores, 160 threads.  80 processes/threads is where we start using
 the second thread on the cores.  The tasks are also pinned to
 hyperthread pairs, so they disturb each other, and the scheduler moves
 them between threads on occasion which causes extra noise.

OK, that could explain the near flattening of throughput near 80
processes.  Is 3.16.0-rc1-pf2 with the two RCU patches?  If so, is the
new sysfs parameter at its default value?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-23 Thread Peter Zijlstra
On Fri, Jun 20, 2014 at 07:59:58PM -0700, Paul E. McKenney wrote:
 Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
 fixed a problem where a CPU looping in the kernel with but one runnable
 task would give RCU CPU stall warnings, even if the in-kernel loop
 contained cond_resched() calls.  Unfortunately, in so doing, it introduced
 performance regressions in Anton Blanchard's will-it-scale open1 test.
 The problem appears to be not so much the increased cond_resched() path
 length as an increase in the rate at which grace periods complete, which
 increased per-update grace-period overhead.
 
 This commit takes a different approach to fixing this bug, mainly by
 moving the RCU-visible quiescent state from cond_resched() to
 rcu_note_context_switch(), and by further reducing the check to a
 simple non-zero test of a single per-CPU variable.  However, this
 approach requires that the force-quiescent-state processing send
 resched IPIs to the offending CPUs.  These will be sent only once
 the grace period has reached an age specified by the boot/sysfs
 parameter rcutree.jiffies_till_sched_qs, or once the grace period
 reaches an age halfway to the point at which RCU CPU stall warnings
 will be emitted, whichever comes first.

Right, and I suppose the force quiescent stuff is triggered from the
tick, which in turn wakes some of these rcu kthreads, which on UP would
cause scheduling themselves.

On the topic of these threads; I recently noticed RCU grew a metric ton
of them, I found some 75 rcu kthreads on my box, wth up with that?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-21 Thread Paul E. McKenney
On Fri, Jun 20, 2014 at 09:29:58PM -0700, Josh Triplett wrote:
> On Fri, Jun 20, 2014 at 07:59:58PM -0700, Paul E. McKenney wrote:
> > Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
> > fixed a problem where a CPU looping in the kernel with but one runnable
> > task would give RCU CPU stall warnings, even if the in-kernel loop
> > contained cond_resched() calls.  Unfortunately, in so doing, it introduced
> > performance regressions in Anton Blanchard's will-it-scale "open1" test.
> > The problem appears to be not so much the increased cond_resched() path
> > length as an increase in the rate at which grace periods complete, which
> > increased per-update grace-period overhead.
> > 
> > This commit takes a different approach to fixing this bug, mainly by
> > moving the RCU-visible quiescent state from cond_resched() to
> > rcu_note_context_switch(), and by further reducing the check to a
> > simple non-zero test of a single per-CPU variable.  However, this
> > approach requires that the force-quiescent-state processing send
> > resched IPIs to the offending CPUs.  These will be sent only once
> > the grace period has reached an age specified by the boot/sysfs
> > parameter rcutree.jiffies_till_sched_qs, or once the grace period
> > reaches an age halfway to the point at which RCU CPU stall warnings
> > will be emitted, whichever comes first.
> > 
> > Reported-by: Dave Hansen 
> > Signed-off-by: Paul E. McKenney 
> > Cc: Josh Triplett 
> > Cc: Andi Kleen 
> > Cc: Christoph Lameter 
> > Cc: Mike Galbraith 
> > Cc: Eric Dumazet 
> 
> I like this approach *far* better.  This is the kind of thing I had in
> mind when I suggested using the fqs machinery: remove the poll entirely
> and just thwack a CPU if it takes too long without a quiescent state.
> Reviewed-by: Josh Triplett 

Glad you like it.  Not a fan of the IPI myself, but then again if you
are spending that must time looping in the kernel, an extra IPI is the
least of your problems.

I will be testing this more thoroughly, and if nothing bad happens will
send it on up within a few days.

Thanx, Paul

> > ---
> > 
> >  b/Documentation/kernel-parameters.txt |6 +
> >  b/include/linux/rcupdate.h|   36 
> >  b/kernel/rcu/tree.c   |  140 
> > +++---
> >  b/kernel/rcu/tree.h   |6 +
> >  b/kernel/rcu/tree_plugin.h|2 
> >  b/kernel/rcu/update.c |   18 
> >  b/kernel/sched/core.c |7 -
> >  7 files changed, 125 insertions(+), 90 deletions(-)
> > 
> > diff --git a/Documentation/kernel-parameters.txt 
> > b/Documentation/kernel-parameters.txt
> > index 6eaa9cdb7094..910c3829f81d 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -2785,6 +2785,12 @@ bytes respectively. Such letter suffixes can also be 
> > entirely omitted.
> > leaf rcu_node structure.  Useful for very large
> > systems.
> >  
> > +   rcutree.jiffies_till_sched_qs= [KNL]
> > +   Set required age in jiffies for a
> > +   given grace period before RCU starts
> > +   soliciting quiescent-state help from
> > +   rcu_note_context_switch().
> > +
> > rcutree.jiffies_till_first_fqs= [KNL]
> > Set delay from grace-period initialization to
> > first attempt to force quiescent states.
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 5a75d19aa661..243aa4656cb7 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -44,7 +44,6 @@
> >  #include 
> >  #include 
> >  #include 
> > -#include 
> >  #include 
> >  
> >  extern int rcu_expedited; /* for sysctl */
> > @@ -300,41 +299,6 @@ bool __rcu_is_watching(void);
> >  #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || 
> > defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
> >  
> >  /*
> > - * Hooks for cond_resched() and friends to avoid RCU CPU stall warnings.
> > - */
> > -
> > -#define RCU_COND_RESCHED_LIM 256   /* ms vs. 100s of ms. */
> > -DECLARE_PER_CPU(int, rcu_cond_resched_count);
> > -void rcu_resched(void);
> > -
> > -/*
> > - * Is it time to report RCU quiescent states?
> > - *
> > - * Note unsynchronized access to rcu_cond_resched_count.  Yes, we might
> > - * increment some random CPU's count, and possibly also load the result 
> > from
> > - * yet another CPU's count.  We might even clobber some other CPU's attempt
> > - * to zero its counter.  This is all OK because the goal is not precision,
> > - * but rather reasonable amortization of rcu_note_context_switch() overhead
> > - * and extremely high probability of avoiding RCU CPU stall warnings.
> > - * Note that this function has to be preempted in just the wrong place,
> > - * many thousands of times in a row, for 

Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-21 Thread Paul E. McKenney
On Fri, Jun 20, 2014 at 09:29:58PM -0700, Josh Triplett wrote:
 On Fri, Jun 20, 2014 at 07:59:58PM -0700, Paul E. McKenney wrote:
  Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
  fixed a problem where a CPU looping in the kernel with but one runnable
  task would give RCU CPU stall warnings, even if the in-kernel loop
  contained cond_resched() calls.  Unfortunately, in so doing, it introduced
  performance regressions in Anton Blanchard's will-it-scale open1 test.
  The problem appears to be not so much the increased cond_resched() path
  length as an increase in the rate at which grace periods complete, which
  increased per-update grace-period overhead.
  
  This commit takes a different approach to fixing this bug, mainly by
  moving the RCU-visible quiescent state from cond_resched() to
  rcu_note_context_switch(), and by further reducing the check to a
  simple non-zero test of a single per-CPU variable.  However, this
  approach requires that the force-quiescent-state processing send
  resched IPIs to the offending CPUs.  These will be sent only once
  the grace period has reached an age specified by the boot/sysfs
  parameter rcutree.jiffies_till_sched_qs, or once the grace period
  reaches an age halfway to the point at which RCU CPU stall warnings
  will be emitted, whichever comes first.
  
  Reported-by: Dave Hansen dave.han...@intel.com
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  Cc: Josh Triplett j...@joshtriplett.org
  Cc: Andi Kleen a...@linux.intel.com
  Cc: Christoph Lameter c...@gentwo.org
  Cc: Mike Galbraith umgwanakikb...@gmail.com
  Cc: Eric Dumazet eric.duma...@gmail.com
 
 I like this approach *far* better.  This is the kind of thing I had in
 mind when I suggested using the fqs machinery: remove the poll entirely
 and just thwack a CPU if it takes too long without a quiescent state.
 Reviewed-by: Josh Triplett j...@joshtriplett.org

Glad you like it.  Not a fan of the IPI myself, but then again if you
are spending that must time looping in the kernel, an extra IPI is the
least of your problems.

I will be testing this more thoroughly, and if nothing bad happens will
send it on up within a few days.

Thanx, Paul

  ---
  
   b/Documentation/kernel-parameters.txt |6 +
   b/include/linux/rcupdate.h|   36 
   b/kernel/rcu/tree.c   |  140 
  +++---
   b/kernel/rcu/tree.h   |6 +
   b/kernel/rcu/tree_plugin.h|2 
   b/kernel/rcu/update.c |   18 
   b/kernel/sched/core.c |7 -
   7 files changed, 125 insertions(+), 90 deletions(-)
  
  diff --git a/Documentation/kernel-parameters.txt 
  b/Documentation/kernel-parameters.txt
  index 6eaa9cdb7094..910c3829f81d 100644
  --- a/Documentation/kernel-parameters.txt
  +++ b/Documentation/kernel-parameters.txt
  @@ -2785,6 +2785,12 @@ bytes respectively. Such letter suffixes can also be 
  entirely omitted.
  leaf rcu_node structure.  Useful for very large
  systems.
   
  +   rcutree.jiffies_till_sched_qs= [KNL]
  +   Set required age in jiffies for a
  +   given grace period before RCU starts
  +   soliciting quiescent-state help from
  +   rcu_note_context_switch().
  +
  rcutree.jiffies_till_first_fqs= [KNL]
  Set delay from grace-period initialization to
  first attempt to force quiescent states.
  diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
  index 5a75d19aa661..243aa4656cb7 100644
  --- a/include/linux/rcupdate.h
  +++ b/include/linux/rcupdate.h
  @@ -44,7 +44,6 @@
   #include linux/debugobjects.h
   #include linux/bug.h
   #include linux/compiler.h
  -#include linux/percpu.h
   #include asm/barrier.h
   
   extern int rcu_expedited; /* for sysctl */
  @@ -300,41 +299,6 @@ bool __rcu_is_watching(void);
   #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || 
  defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
   
   /*
  - * Hooks for cond_resched() and friends to avoid RCU CPU stall warnings.
  - */
  -
  -#define RCU_COND_RESCHED_LIM 256   /* ms vs. 100s of ms. */
  -DECLARE_PER_CPU(int, rcu_cond_resched_count);
  -void rcu_resched(void);
  -
  -/*
  - * Is it time to report RCU quiescent states?
  - *
  - * Note unsynchronized access to rcu_cond_resched_count.  Yes, we might
  - * increment some random CPU's count, and possibly also load the result 
  from
  - * yet another CPU's count.  We might even clobber some other CPU's attempt
  - * to zero its counter.  This is all OK because the goal is not precision,
  - * but rather reasonable amortization of rcu_note_context_switch() overhead
  - * and extremely high probability of avoiding RCU CPU stall warnings.
  - * Note that this function has to be preempted in just the wrong place,
  

Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-20 Thread Josh Triplett
On Fri, Jun 20, 2014 at 07:59:58PM -0700, Paul E. McKenney wrote:
> Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
> fixed a problem where a CPU looping in the kernel with but one runnable
> task would give RCU CPU stall warnings, even if the in-kernel loop
> contained cond_resched() calls.  Unfortunately, in so doing, it introduced
> performance regressions in Anton Blanchard's will-it-scale "open1" test.
> The problem appears to be not so much the increased cond_resched() path
> length as an increase in the rate at which grace periods complete, which
> increased per-update grace-period overhead.
> 
> This commit takes a different approach to fixing this bug, mainly by
> moving the RCU-visible quiescent state from cond_resched() to
> rcu_note_context_switch(), and by further reducing the check to a
> simple non-zero test of a single per-CPU variable.  However, this
> approach requires that the force-quiescent-state processing send
> resched IPIs to the offending CPUs.  These will be sent only once
> the grace period has reached an age specified by the boot/sysfs
> parameter rcutree.jiffies_till_sched_qs, or once the grace period
> reaches an age halfway to the point at which RCU CPU stall warnings
> will be emitted, whichever comes first.
> 
> Reported-by: Dave Hansen 
> Signed-off-by: Paul E. McKenney 
> Cc: Josh Triplett 
> Cc: Andi Kleen 
> Cc: Christoph Lameter 
> Cc: Mike Galbraith 
> Cc: Eric Dumazet 

I like this approach *far* better.  This is the kind of thing I had in
mind when I suggested using the fqs machinery: remove the poll entirely
and just thwack a CPU if it takes too long without a quiescent state.
Reviewed-by: Josh Triplett 

> ---
> 
>  b/Documentation/kernel-parameters.txt |6 +
>  b/include/linux/rcupdate.h|   36 
>  b/kernel/rcu/tree.c   |  140 
> +++---
>  b/kernel/rcu/tree.h   |6 +
>  b/kernel/rcu/tree_plugin.h|2 
>  b/kernel/rcu/update.c |   18 
>  b/kernel/sched/core.c |7 -
>  7 files changed, 125 insertions(+), 90 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 6eaa9cdb7094..910c3829f81d 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2785,6 +2785,12 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   leaf rcu_node structure.  Useful for very large
>   systems.
>  
> + rcutree.jiffies_till_sched_qs= [KNL]
> + Set required age in jiffies for a
> + given grace period before RCU starts
> + soliciting quiescent-state help from
> + rcu_note_context_switch().
> +
>   rcutree.jiffies_till_first_fqs= [KNL]
>   Set delay from grace-period initialization to
>   first attempt to force quiescent states.
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 5a75d19aa661..243aa4656cb7 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -44,7 +44,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  
>  extern int rcu_expedited; /* for sysctl */
> @@ -300,41 +299,6 @@ bool __rcu_is_watching(void);
>  #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) 
> || defined(CONFIG_SMP) */
>  
>  /*
> - * Hooks for cond_resched() and friends to avoid RCU CPU stall warnings.
> - */
> -
> -#define RCU_COND_RESCHED_LIM 256 /* ms vs. 100s of ms. */
> -DECLARE_PER_CPU(int, rcu_cond_resched_count);
> -void rcu_resched(void);
> -
> -/*
> - * Is it time to report RCU quiescent states?
> - *
> - * Note unsynchronized access to rcu_cond_resched_count.  Yes, we might
> - * increment some random CPU's count, and possibly also load the result from
> - * yet another CPU's count.  We might even clobber some other CPU's attempt
> - * to zero its counter.  This is all OK because the goal is not precision,
> - * but rather reasonable amortization of rcu_note_context_switch() overhead
> - * and extremely high probability of avoiding RCU CPU stall warnings.
> - * Note that this function has to be preempted in just the wrong place,
> - * many thousands of times in a row, for anything bad to happen.
> - */
> -static inline bool rcu_should_resched(void)
> -{
> - return raw_cpu_inc_return(rcu_cond_resched_count) >=
> -RCU_COND_RESCHED_LIM;
> -}
> -
> -/*
> - * Report quiscent states to RCU if it is time to do so.
> - */
> -static inline void rcu_cond_resched(void)
> -{
> - if (unlikely(rcu_should_resched()))
> - rcu_resched();
> -}
> -
> -/*
>   * Infrastructure to implement the synchronize_() primitives in
>   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
>   */
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> 

Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

2014-06-20 Thread Josh Triplett
On Fri, Jun 20, 2014 at 07:59:58PM -0700, Paul E. McKenney wrote:
 Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
 fixed a problem where a CPU looping in the kernel with but one runnable
 task would give RCU CPU stall warnings, even if the in-kernel loop
 contained cond_resched() calls.  Unfortunately, in so doing, it introduced
 performance regressions in Anton Blanchard's will-it-scale open1 test.
 The problem appears to be not so much the increased cond_resched() path
 length as an increase in the rate at which grace periods complete, which
 increased per-update grace-period overhead.
 
 This commit takes a different approach to fixing this bug, mainly by
 moving the RCU-visible quiescent state from cond_resched() to
 rcu_note_context_switch(), and by further reducing the check to a
 simple non-zero test of a single per-CPU variable.  However, this
 approach requires that the force-quiescent-state processing send
 resched IPIs to the offending CPUs.  These will be sent only once
 the grace period has reached an age specified by the boot/sysfs
 parameter rcutree.jiffies_till_sched_qs, or once the grace period
 reaches an age halfway to the point at which RCU CPU stall warnings
 will be emitted, whichever comes first.
 
 Reported-by: Dave Hansen dave.han...@intel.com
 Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 Cc: Josh Triplett j...@joshtriplett.org
 Cc: Andi Kleen a...@linux.intel.com
 Cc: Christoph Lameter c...@gentwo.org
 Cc: Mike Galbraith umgwanakikb...@gmail.com
 Cc: Eric Dumazet eric.duma...@gmail.com

I like this approach *far* better.  This is the kind of thing I had in
mind when I suggested using the fqs machinery: remove the poll entirely
and just thwack a CPU if it takes too long without a quiescent state.
Reviewed-by: Josh Triplett j...@joshtriplett.org

 ---
 
  b/Documentation/kernel-parameters.txt |6 +
  b/include/linux/rcupdate.h|   36 
  b/kernel/rcu/tree.c   |  140 
 +++---
  b/kernel/rcu/tree.h   |6 +
  b/kernel/rcu/tree_plugin.h|2 
  b/kernel/rcu/update.c |   18 
  b/kernel/sched/core.c |7 -
  7 files changed, 125 insertions(+), 90 deletions(-)
 
 diff --git a/Documentation/kernel-parameters.txt 
 b/Documentation/kernel-parameters.txt
 index 6eaa9cdb7094..910c3829f81d 100644
 --- a/Documentation/kernel-parameters.txt
 +++ b/Documentation/kernel-parameters.txt
 @@ -2785,6 +2785,12 @@ bytes respectively. Such letter suffixes can also be 
 entirely omitted.
   leaf rcu_node structure.  Useful for very large
   systems.
  
 + rcutree.jiffies_till_sched_qs= [KNL]
 + Set required age in jiffies for a
 + given grace period before RCU starts
 + soliciting quiescent-state help from
 + rcu_note_context_switch().
 +
   rcutree.jiffies_till_first_fqs= [KNL]
   Set delay from grace-period initialization to
   first attempt to force quiescent states.
 diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
 index 5a75d19aa661..243aa4656cb7 100644
 --- a/include/linux/rcupdate.h
 +++ b/include/linux/rcupdate.h
 @@ -44,7 +44,6 @@
  #include linux/debugobjects.h
  #include linux/bug.h
  #include linux/compiler.h
 -#include linux/percpu.h
  #include asm/barrier.h
  
  extern int rcu_expedited; /* for sysctl */
 @@ -300,41 +299,6 @@ bool __rcu_is_watching(void);
  #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) 
 || defined(CONFIG_SMP) */
  
  /*
 - * Hooks for cond_resched() and friends to avoid RCU CPU stall warnings.
 - */
 -
 -#define RCU_COND_RESCHED_LIM 256 /* ms vs. 100s of ms. */
 -DECLARE_PER_CPU(int, rcu_cond_resched_count);
 -void rcu_resched(void);
 -
 -/*
 - * Is it time to report RCU quiescent states?
 - *
 - * Note unsynchronized access to rcu_cond_resched_count.  Yes, we might
 - * increment some random CPU's count, and possibly also load the result from
 - * yet another CPU's count.  We might even clobber some other CPU's attempt
 - * to zero its counter.  This is all OK because the goal is not precision,
 - * but rather reasonable amortization of rcu_note_context_switch() overhead
 - * and extremely high probability of avoiding RCU CPU stall warnings.
 - * Note that this function has to be preempted in just the wrong place,
 - * many thousands of times in a row, for anything bad to happen.
 - */
 -static inline bool rcu_should_resched(void)
 -{
 - return raw_cpu_inc_return(rcu_cond_resched_count) =
 -RCU_COND_RESCHED_LIM;
 -}
 -
 -/*
 - * Report quiscent states to RCU if it is time to do so.
 - */
 -static inline void rcu_cond_resched(void)
 -{
 - if (unlikely(rcu_should_resched()))
 - rcu_resched();
 -}
 -
 -/*
   * Infrastructure to implement the synchronize_()