Re: question about RCU dynticks_nesting

2015-05-07 Thread Rik van Riel
On 05/06/2015 08:59 PM, Frederic Weisbecker wrote:
> On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote:

>> Ingo's idea is to simply have cpu 0 check the current task
>> on all other CPUs, see whether that task is running in system
>> mode, user mode, guest mode, irq mode, etc and update that
>> task's vtime accordingly.
>>
>> I suspect the runqueue lock is probably enough to do that,
>> and between rcu state and PF_VCPU we probably have enough
>> information to see what mode the task is running in, with
>> just remote memory reads.
> 
> Note that we could significantly reduce the overhead of vtime accounting
> by only accumulate utime/stime on per cpu buffers and actually account it
> on context switch or task_cputime() calls. That way we remove the overhead
> of the account_user/system_time() functions and the vtime locks.
> 
> But doing the accounting from CPU 0 by just accounting 1 tick to the context
> we remotely observe would certainly reduce the local accounting overhead to 
> the strict
> minimum. And I think we shouldn't even lock rq for that, we can live with some
> lack of precision.

We can live with lack of precision, but we cannot live with data
structures being re-used and pointers pointing off into la-la
land while we are following them :)

> Now we must expect quite some overhead on CPU 0. Perhaps it should be
> an option as I'm not sure every full dynticks usecases want that.

Lets see if I can get this to work before deciding whether we need yet
another configurable option :)

It may be possible to have most of the overhead happen from schedulable
context, maybe softirq code. Right now I am still stuck in the giant
spaghetti mess under account_process_tick, with dozens of functions that
only work on cpu-local, task-local, or architecture dependently cpu or
task local data...

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-07 Thread Rik van Riel
On 05/06/2015 08:59 PM, Frederic Weisbecker wrote:
 On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote:

 Ingo's idea is to simply have cpu 0 check the current task
 on all other CPUs, see whether that task is running in system
 mode, user mode, guest mode, irq mode, etc and update that
 task's vtime accordingly.

 I suspect the runqueue lock is probably enough to do that,
 and between rcu state and PF_VCPU we probably have enough
 information to see what mode the task is running in, with
 just remote memory reads.
 
 Note that we could significantly reduce the overhead of vtime accounting
 by only accumulate utime/stime on per cpu buffers and actually account it
 on context switch or task_cputime() calls. That way we remove the overhead
 of the account_user/system_time() functions and the vtime locks.
 
 But doing the accounting from CPU 0 by just accounting 1 tick to the context
 we remotely observe would certainly reduce the local accounting overhead to 
 the strict
 minimum. And I think we shouldn't even lock rq for that, we can live with some
 lack of precision.

We can live with lack of precision, but we cannot live with data
structures being re-used and pointers pointing off into la-la
land while we are following them :)

 Now we must expect quite some overhead on CPU 0. Perhaps it should be
 an option as I'm not sure every full dynticks usecases want that.

Lets see if I can get this to work before deciding whether we need yet
another configurable option :)

It may be possible to have most of the overhead happen from schedulable
context, maybe softirq code. Right now I am still stuck in the giant
spaghetti mess under account_process_tick, with dozens of functions that
only work on cpu-local, task-local, or architecture dependently cpu or
task local data...

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-06 Thread Frederic Weisbecker
On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote:
> On 05/04/2015 04:38 PM, Paul E. McKenney wrote:
> > On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote:
> >> On 05/04/2015 04:02 PM, Paul E. McKenney wrote:
> 
> >>> Hmmm...  But didn't earlier performance measurements show that the bulk of
> >>> the overhead was the delta-time computations rather than RCU accounting?
> >>
> >> The bulk of the overhead was disabling and re-enabling
> >> irqs around the calls to rcu_user_exit and rcu_user_enter :)
> > 
> > Really???  OK...  How about software irq masking?  (I know, that is
> > probably a bit of a scary change as well.)
> > 
> >> Of the remaining time, about 2/3 seems to be the vtime
> >> stuff, and the other 1/3 the rcu code.
> > 
> > OK, worth some thought, then.
> > 
> >> I suspect it makes sense to optimize both, though the
> >> vtime code may be the easiest :)
> > 
> > Making a crude version that does jiffies (or whatever) instead of
> > fine-grained computations might give good bang for the buck.  ;-)
> 
> Ingo's idea is to simply have cpu 0 check the current task
> on all other CPUs, see whether that task is running in system
> mode, user mode, guest mode, irq mode, etc and update that
> task's vtime accordingly.
> 
> I suspect the runqueue lock is probably enough to do that,
> and between rcu state and PF_VCPU we probably have enough
> information to see what mode the task is running in, with
> just remote memory reads.

Note that we could significantly reduce the overhead of vtime accounting
by only accumulate utime/stime on per cpu buffers and actually account it
on context switch or task_cputime() calls. That way we remove the overhead
of the account_user/system_time() functions and the vtime locks.

But doing the accounting from CPU 0 by just accounting 1 tick to the context
we remotely observe would certainly reduce the local accounting overhead to the 
strict
minimum. And I think we shouldn't even lock rq for that, we can live with some
lack of precision.

Now we must expect quite some overhead on CPU 0. Perhaps it should be
an option as I'm not sure every full dynticks usecases want that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-06 Thread Mike Galbraith
On Wed, 2015-05-06 at 08:52 +0200, Mike Galbraith wrote:
> On Tue, 2015-05-05 at 23:06 -0700, Paul E. McKenney wrote:
> 
> > > 1 * stat() on isolated cpu
> > > 
> > > NO_HZ_FULL offinactive housekeepernohz_full
> > > real0m14.266s 0m14.367s0m20.427s  0m27.921s 
> > > user0m1.756s  0m1.553s 0m1.976s   0m10.447s
> > > sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
> > > (real)  1.000 1.0071.431  1.957
> 
> 
> > Does the attached patch help at all?
> 
> nohz_full
> 0m27.073s
> 0m9.423s
> 0m17.602s
> 
> Not a complete retest, and a pull in between, but I'd say that's a no.

(well, a second is a nice "at all", just not a _huge_ "at all";)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-06 Thread Mike Galbraith
On Tue, 2015-05-05 at 23:06 -0700, Paul E. McKenney wrote:

> > 1 * stat() on isolated cpu
> > 
> > NO_HZ_FULL offinactive housekeepernohz_full
> > real0m14.266s 0m14.367s0m20.427s  0m27.921s 
> > user0m1.756s  0m1.553s 0m1.976s   0m10.447s
> > sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
> > (real)  1.000 1.0071.431  1.957


> Does the attached patch help at all?

nohz_full
0m27.073s
0m9.423s
0m17.602s

Not a complete retest, and a pull in between, but I'd say that's a no.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-06 Thread Paul E. McKenney
On Wed, May 06, 2015 at 05:44:54AM +0200, Mike Galbraith wrote:
> On Wed, 2015-05-06 at 03:49 +0200, Mike Galbraith wrote:
> > On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote:
> > 
> > > You have RCU_FAST_NO_HZ=y, correct?  Could you please try measuring with
> > > RCU_FAST_NO_HZ=n?
> > 
> > FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n.  (I didn't
> > profile to see where costs lie though)
> 
> (did that)

Nice, thank you!!!

> 1 * stat() on isolated cpu
> 
> NO_HZ_FULL offinactive housekeepernohz_full
> real0m14.266s 0m14.367s0m20.427s  0m27.921s 
> user0m1.756s  0m1.553s 0m1.976s   0m10.447s
> sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
> (real)  1.000 1.0071.431  1.957
> 
>  inactive housekeeper 
>nohz_full
> --
>  7.61%  [.] __xstat64 11.12%  [k] 
> context_tracking_exit  7.41%  [k] context_tracking_exit
>  7.04%  [k] system_call6.18%  [k] 
> context_tracking_enter 6.02%  [k] native_sched_clock
>  6.96%  [k] copy_user_enhanced_fast_string 5.18%  [.] __xstat64   
>4.69%  [k] rcu_eqs_enter_common.isra.37
>  6.57%  [k] path_init  4.89%  [k] system_call 
>4.35%  [k] _raw_spin_lock
>  5.92%  [k] system_call_after_swapgs   4.84%  [k] 
> copy_user_enhanced_fast_string 4.30%  [k] context_tracking_enter
>  5.44%  [k] lockref_put_return 4.46%  [k] path_init   
>4.25%  [k] kmem_cache_alloc
>  4.69%  [k] link_path_walk 4.30%  [k] 
> system_call_after_swapgs   4.14%  [.] __xstat64
>  4.47%  [k] lockref_get_not_dead   4.12%  [k] kmem_cache_free 
>3.89%  [k] rcu_eqs_exit_common.isra.38
>  4.46%  [k] kmem_cache_free3.78%  [k] link_path_walk  
>3.50%  [k] system_call
>  4.20%  [k] kmem_cache_alloc   3.62%  [k] 
> lockref_put_return 3.48%  [k] 
> copy_user_enhanced_fast_string
>  4.09%  [k] cp_new_stat3.43%  [k] 
> kmem_cache_alloc   3.02%  [k] system_call_after_swapgs
>  3.38%  [k] vfs_getattr_nosec  2.95%  [k] 
> lockref_get_not_dead   2.97%  [k] kmem_cache_free
>  2.82%  [k] vfs_fstatat2.87%  [k] cp_new_stat 
>2.88%  [k] lockref_put_return
>  2.60%  [k] user_path_at_empty 2.62%  [k] 
> syscall_trace_leave2.61%  [k] link_path_walk
>  2.47%  [k] path_lookupat  1.91%  [k] 
> vfs_getattr_nosec  2.58%  [k] path_init
>  2.14%  [k] strncpy_from_user  1.89%  [k] 
> syscall_trace_enter_phase1 2.15%  [k] lockref_get_not_dead
>  2.11%  [k] getname_flags  1.77%  [k] path_lookupat   
>2.04%  [k] cp_new_stat
>  2.10%  [k] generic_fillattr   1.67%  [k] complete_walk   
>1.89%  [k] generic_fillattr
>  2.05%  [.] main   1.65%  [k] vfs_fstatat 
>1.67%  [k] syscall_trace_leave
>  1.89%  [k] complete_walk  1.56%  [k] 
> generic_fillattr   1.59%  [k] vfs_getattr_nosec
>  1.73%  [k] generic_permission 1.55%  [k] 
> user_path_at_empty 1.49%  [k] get_vtime_delta
>  1.50%  [k] system_call_fastpath   1.54%  [k] 
> strncpy_from_user  1.32%  [k] user_path_at_empty
>  1.37%  [k] legitimize_mnt 1.53%  [k] getname_flags   
>1.30%  [k] syscall_trace_enter_phase1
>  1.30%  [k] dput   1.46%  [k] legitimize_mnt  
>1.21%  [k] rcu_eqs_exit
>  1.26%  [k] putname1.34%  [.] main
>1.21%  [k] vfs_fstatat
>  1.19%  [k] path_put   1.32%  [k] int_with_check  
>1.18%  [k] path_lookupat
>  1.18%  [k] filename_lookup1.28%  [k] 
> generic_permission 1.15%  [k] getname_flags
>  1.01%  [k] SYSC_newstat   1.16%  [k] 
> int_very_careful   1.03%  [k] strncpy_from_user
>  0.96%  [k] mntput_no_expire   1.04%  [k] putname 
>1.01%  [k] account_system_time
>  

Re: question about RCU dynticks_nesting

2015-05-06 Thread Mike Galbraith
On Wed, 2015-05-06 at 08:52 +0200, Mike Galbraith wrote:
 On Tue, 2015-05-05 at 23:06 -0700, Paul E. McKenney wrote:
 
   1 * stat() on isolated cpu
   
   NO_HZ_FULL offinactive housekeepernohz_full
   real0m14.266s 0m14.367s0m20.427s  0m27.921s 
   user0m1.756s  0m1.553s 0m1.976s   0m10.447s
   sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
   (real)  1.000 1.0071.431  1.957
 
 
  Does the attached patch help at all?
 
 nohz_full
 0m27.073s
 0m9.423s
 0m17.602s
 
 Not a complete retest, and a pull in between, but I'd say that's a no.

(well, a second is a nice at all, just not a _huge_ at all;)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-06 Thread Mike Galbraith
On Tue, 2015-05-05 at 23:06 -0700, Paul E. McKenney wrote:

  1 * stat() on isolated cpu
  
  NO_HZ_FULL offinactive housekeepernohz_full
  real0m14.266s 0m14.367s0m20.427s  0m27.921s 
  user0m1.756s  0m1.553s 0m1.976s   0m10.447s
  sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
  (real)  1.000 1.0071.431  1.957


 Does the attached patch help at all?

nohz_full
0m27.073s
0m9.423s
0m17.602s

Not a complete retest, and a pull in between, but I'd say that's a no.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-06 Thread Paul E. McKenney
On Wed, May 06, 2015 at 05:44:54AM +0200, Mike Galbraith wrote:
 On Wed, 2015-05-06 at 03:49 +0200, Mike Galbraith wrote:
  On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote:
  
   You have RCU_FAST_NO_HZ=y, correct?  Could you please try measuring with
   RCU_FAST_NO_HZ=n?
  
  FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n.  (I didn't
  profile to see where costs lie though)
 
 (did that)

Nice, thank you!!!

 1 * stat() on isolated cpu
 
 NO_HZ_FULL offinactive housekeepernohz_full
 real0m14.266s 0m14.367s0m20.427s  0m27.921s 
 user0m1.756s  0m1.553s 0m1.976s   0m10.447s
 sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
 (real)  1.000 1.0071.431  1.957
 
  inactive housekeeper 
nohz_full
 --
  7.61%  [.] __xstat64 11.12%  [k] 
 context_tracking_exit  7.41%  [k] context_tracking_exit
  7.04%  [k] system_call6.18%  [k] 
 context_tracking_enter 6.02%  [k] native_sched_clock
  6.96%  [k] copy_user_enhanced_fast_string 5.18%  [.] __xstat64   
4.69%  [k] rcu_eqs_enter_common.isra.37
  6.57%  [k] path_init  4.89%  [k] system_call 
4.35%  [k] _raw_spin_lock
  5.92%  [k] system_call_after_swapgs   4.84%  [k] 
 copy_user_enhanced_fast_string 4.30%  [k] context_tracking_enter
  5.44%  [k] lockref_put_return 4.46%  [k] path_init   
4.25%  [k] kmem_cache_alloc
  4.69%  [k] link_path_walk 4.30%  [k] 
 system_call_after_swapgs   4.14%  [.] __xstat64
  4.47%  [k] lockref_get_not_dead   4.12%  [k] kmem_cache_free 
3.89%  [k] rcu_eqs_exit_common.isra.38
  4.46%  [k] kmem_cache_free3.78%  [k] link_path_walk  
3.50%  [k] system_call
  4.20%  [k] kmem_cache_alloc   3.62%  [k] 
 lockref_put_return 3.48%  [k] 
 copy_user_enhanced_fast_string
  4.09%  [k] cp_new_stat3.43%  [k] 
 kmem_cache_alloc   3.02%  [k] system_call_after_swapgs
  3.38%  [k] vfs_getattr_nosec  2.95%  [k] 
 lockref_get_not_dead   2.97%  [k] kmem_cache_free
  2.82%  [k] vfs_fstatat2.87%  [k] cp_new_stat 
2.88%  [k] lockref_put_return
  2.60%  [k] user_path_at_empty 2.62%  [k] 
 syscall_trace_leave2.61%  [k] link_path_walk
  2.47%  [k] path_lookupat  1.91%  [k] 
 vfs_getattr_nosec  2.58%  [k] path_init
  2.14%  [k] strncpy_from_user  1.89%  [k] 
 syscall_trace_enter_phase1 2.15%  [k] lockref_get_not_dead
  2.11%  [k] getname_flags  1.77%  [k] path_lookupat   
2.04%  [k] cp_new_stat
  2.10%  [k] generic_fillattr   1.67%  [k] complete_walk   
1.89%  [k] generic_fillattr
  2.05%  [.] main   1.65%  [k] vfs_fstatat 
1.67%  [k] syscall_trace_leave
  1.89%  [k] complete_walk  1.56%  [k] 
 generic_fillattr   1.59%  [k] vfs_getattr_nosec
  1.73%  [k] generic_permission 1.55%  [k] 
 user_path_at_empty 1.49%  [k] get_vtime_delta
  1.50%  [k] system_call_fastpath   1.54%  [k] 
 strncpy_from_user  1.32%  [k] user_path_at_empty
  1.37%  [k] legitimize_mnt 1.53%  [k] getname_flags   
1.30%  [k] syscall_trace_enter_phase1
  1.30%  [k] dput   1.46%  [k] legitimize_mnt  
1.21%  [k] rcu_eqs_exit
  1.26%  [k] putname1.34%  [.] main
1.21%  [k] vfs_fstatat
  1.19%  [k] path_put   1.32%  [k] int_with_check  
1.18%  [k] path_lookupat
  1.18%  [k] filename_lookup1.28%  [k] 
 generic_permission 1.15%  [k] getname_flags
  1.01%  [k] SYSC_newstat   1.16%  [k] 
 int_very_careful   1.03%  [k] strncpy_from_user
  0.96%  [k] mntput_no_expire   1.04%  [k] putname 
1.01%  [k] account_system_time
  0.79%  [k] path_cleanup   0.94%  [k] dput
 

Re: question about RCU dynticks_nesting

2015-05-06 Thread Frederic Weisbecker
On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote:
 On 05/04/2015 04:38 PM, Paul E. McKenney wrote:
  On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote:
  On 05/04/2015 04:02 PM, Paul E. McKenney wrote:
 
  Hmmm...  But didn't earlier performance measurements show that the bulk of
  the overhead was the delta-time computations rather than RCU accounting?
 
  The bulk of the overhead was disabling and re-enabling
  irqs around the calls to rcu_user_exit and rcu_user_enter :)
  
  Really???  OK...  How about software irq masking?  (I know, that is
  probably a bit of a scary change as well.)
  
  Of the remaining time, about 2/3 seems to be the vtime
  stuff, and the other 1/3 the rcu code.
  
  OK, worth some thought, then.
  
  I suspect it makes sense to optimize both, though the
  vtime code may be the easiest :)
  
  Making a crude version that does jiffies (or whatever) instead of
  fine-grained computations might give good bang for the buck.  ;-)
 
 Ingo's idea is to simply have cpu 0 check the current task
 on all other CPUs, see whether that task is running in system
 mode, user mode, guest mode, irq mode, etc and update that
 task's vtime accordingly.
 
 I suspect the runqueue lock is probably enough to do that,
 and between rcu state and PF_VCPU we probably have enough
 information to see what mode the task is running in, with
 just remote memory reads.

Note that we could significantly reduce the overhead of vtime accounting
by only accumulate utime/stime on per cpu buffers and actually account it
on context switch or task_cputime() calls. That way we remove the overhead
of the account_user/system_time() functions and the vtime locks.

But doing the accounting from CPU 0 by just accounting 1 tick to the context
we remotely observe would certainly reduce the local accounting overhead to the 
strict
minimum. And I think we shouldn't even lock rq for that, we can live with some
lack of precision.

Now we must expect quite some overhead on CPU 0. Perhaps it should be
an option as I'm not sure every full dynticks usecases want that.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Paul E. McKenney
On Tue, May 05, 2015 at 05:09:23PM -0400, Rik van Riel wrote:
> On 05/05/2015 02:35 PM, Paul E. McKenney wrote:
> > On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote:
> >> On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote:
> >>> On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
>  On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
> > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> > counter in production kernels.  Even if there was, we have to sample 
> > this
> > on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> > would be where kernel entry/exit is, so I expect that this would be a
> > net loss in overall performance.
> 
>  We unconditionally have the preempt_count, its just not used much for
>  PREEMPT_COUNT=n kernels.
> >>>
> >>> We have the field, you mean?  I might be missing something, but it still
> >>> appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
> >>> So what am I missing?
> >>
> >> There's another layer of accessors that can in fact manipulate the
> >> preempt_count even for !PREEMPT_COUNT kernels. They are currently used
> >> by things like pagefault_disable().
> > 
> > OK, fair enough.
> > 
> > I am going to focus first on getting rid of (or at least greatly reducing)
> > RCU's interrupt disabling on the user-kernel entry/exit paths, since
> > that seems to be the biggest cost.
> 
> Interrupts are already disabled on kernel-user and kernel-guest
> switches.  Paolo and I have patches to move a bunch of the calls
> to user_enter, user_exit, guest_enter, and guest_exit to places
> where interrupts are already disabled, so we do not need to
> disable them again.
> 
> With those in place, the vtime calculations are the largest
> CPU user. I am working on those.

OK, so I should stop worrying about making rcu_user_enter() and
rcu_user_exit() operate with interrupts disabled, and instead think about
the overhead of the operations themselves.  Probably starting from Mike
Galbraith's profile (thank you!) unless Rik has some reason to believe
that it is nonrepresentative.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Mike Galbraith
On Wed, 2015-05-06 at 03:49 +0200, Mike Galbraith wrote:
> On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote:
> 
> > You have RCU_FAST_NO_HZ=y, correct?  Could you please try measuring with
> > RCU_FAST_NO_HZ=n?
> 
> FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n.  (I didn't
> profile to see where costs lie though)

(did that)

1 * stat() on isolated cpu

NO_HZ_FULL offinactive housekeepernohz_full
real0m14.266s 0m14.367s0m20.427s  0m27.921s 
user0m1.756s  0m1.553s 0m1.976s   0m10.447s
sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
(real)  1.000 1.0071.431  1.957

 inactive housekeeper   
 nohz_full
--
 7.61%  [.] __xstat64 11.12%  [k] 
context_tracking_exit  7.41%  [k] context_tracking_exit
 7.04%  [k] system_call6.18%  [k] 
context_tracking_enter 6.02%  [k] native_sched_clock
 6.96%  [k] copy_user_enhanced_fast_string 5.18%  [.] __xstat64 
 4.69%  [k] rcu_eqs_enter_common.isra.37
 6.57%  [k] path_init  4.89%  [k] system_call   
 4.35%  [k] _raw_spin_lock
 5.92%  [k] system_call_after_swapgs   4.84%  [k] 
copy_user_enhanced_fast_string 4.30%  [k] context_tracking_enter
 5.44%  [k] lockref_put_return 4.46%  [k] path_init 
 4.25%  [k] kmem_cache_alloc
 4.69%  [k] link_path_walk 4.30%  [k] 
system_call_after_swapgs   4.14%  [.] __xstat64
 4.47%  [k] lockref_get_not_dead   4.12%  [k] kmem_cache_free   
 3.89%  [k] rcu_eqs_exit_common.isra.38
 4.46%  [k] kmem_cache_free3.78%  [k] link_path_walk
 3.50%  [k] system_call
 4.20%  [k] kmem_cache_alloc   3.62%  [k] 
lockref_put_return 3.48%  [k] copy_user_enhanced_fast_string
 4.09%  [k] cp_new_stat3.43%  [k] kmem_cache_alloc  
 3.02%  [k] system_call_after_swapgs
 3.38%  [k] vfs_getattr_nosec  2.95%  [k] 
lockref_get_not_dead   2.97%  [k] kmem_cache_free
 2.82%  [k] vfs_fstatat2.87%  [k] cp_new_stat   
 2.88%  [k] lockref_put_return
 2.60%  [k] user_path_at_empty 2.62%  [k] 
syscall_trace_leave2.61%  [k] link_path_walk
 2.47%  [k] path_lookupat  1.91%  [k] vfs_getattr_nosec 
 2.58%  [k] path_init
 2.14%  [k] strncpy_from_user  1.89%  [k] 
syscall_trace_enter_phase1 2.15%  [k] lockref_get_not_dead
 2.11%  [k] getname_flags  1.77%  [k] path_lookupat 
 2.04%  [k] cp_new_stat
 2.10%  [k] generic_fillattr   1.67%  [k] complete_walk 
 1.89%  [k] generic_fillattr
 2.05%  [.] main   1.65%  [k] vfs_fstatat   
 1.67%  [k] syscall_trace_leave
 1.89%  [k] complete_walk  1.56%  [k] generic_fillattr  
 1.59%  [k] vfs_getattr_nosec
 1.73%  [k] generic_permission 1.55%  [k] 
user_path_at_empty 1.49%  [k] get_vtime_delta
 1.50%  [k] system_call_fastpath   1.54%  [k] strncpy_from_user 
 1.32%  [k] user_path_at_empty
 1.37%  [k] legitimize_mnt 1.53%  [k] getname_flags 
 1.30%  [k] syscall_trace_enter_phase1
 1.30%  [k] dput   1.46%  [k] legitimize_mnt
 1.21%  [k] rcu_eqs_exit
 1.26%  [k] putname1.34%  [.] main  
 1.21%  [k] vfs_fstatat
 1.19%  [k] path_put   1.32%  [k] int_with_check
 1.18%  [k] path_lookupat
 1.18%  [k] filename_lookup1.28%  [k] 
generic_permission 1.15%  [k] getname_flags
 1.01%  [k] SYSC_newstat   1.16%  [k] int_very_careful  
 1.03%  [k] strncpy_from_user
 0.96%  [k] mntput_no_expire   1.04%  [k] putname   
 1.01%  [k] account_system_time
 0.79%  [k] path_cleanup   0.94%  [k] dput  
 1.00%  [k] complete_walk
 0.79%  [k] mntput 0.91%  [k] 
context_tracking_user_exit 0.99%  [k] 

Re: question about RCU dynticks_nesting

2015-05-05 Thread Mike Galbraith
On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote:

> You have RCU_FAST_NO_HZ=y, correct?  Could you please try measuring with
> RCU_FAST_NO_HZ=n?

FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n.  (I didn't
profile to see where costs lie though)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Rik van Riel
On 05/05/2015 02:35 PM, Paul E. McKenney wrote:
> On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote:
>> On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote:
>>> On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
 On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
> But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> counter in production kernels.  Even if there was, we have to sample this
> on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> would be where kernel entry/exit is, so I expect that this would be a
> net loss in overall performance.

 We unconditionally have the preempt_count, its just not used much for
 PREEMPT_COUNT=n kernels.
>>>
>>> We have the field, you mean?  I might be missing something, but it still
>>> appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
>>> So what am I missing?
>>
>> There's another layer of accessors that can in fact manipulate the
>> preempt_count even for !PREEMPT_COUNT kernels. They are currently used
>> by things like pagefault_disable().
> 
> OK, fair enough.
> 
> I am going to focus first on getting rid of (or at least greatly reducing)
> RCU's interrupt disabling on the user-kernel entry/exit paths, since
> that seems to be the biggest cost.

Interrupts are already disabled on kernel-user and kernel-guest
switches.  Paolo and I have patches to move a bunch of the calls
to user_enter, user_exit, guest_enter, and guest_exit to places
where interrupts are already disabled, so we do not need to
disable them again.

With those in place, the vtime calculations are the largest
CPU user. I am working on those.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Paul E. McKenney
On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote:
> On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote:
> > On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
> > > On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
> > > > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> > > > counter in production kernels.  Even if there was, we have to sample 
> > > > this
> > > > on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> > > > would be where kernel entry/exit is, so I expect that this would be a
> > > > net loss in overall performance.
> > > 
> > > We unconditionally have the preempt_count, its just not used much for
> > > PREEMPT_COUNT=n kernels.
> > 
> > We have the field, you mean?  I might be missing something, but it still
> > appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
> > So what am I missing?
> 
> There's another layer of accessors that can in fact manipulate the
> preempt_count even for !PREEMPT_COUNT kernels. They are currently used
> by things like pagefault_disable().

OK, fair enough.

I am going to focus first on getting rid of (or at least greatly reducing)
RCU's interrupt disabling on the user-kernel entry/exit paths, since
that seems to be the biggest cost.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Peter Zijlstra
On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote:
> On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
> > On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
> > > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> > > counter in production kernels.  Even if there was, we have to sample this
> > > on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> > > would be where kernel entry/exit is, so I expect that this would be a
> > > net loss in overall performance.
> > 
> > We unconditionally have the preempt_count, its just not used much for
> > PREEMPT_COUNT=n kernels.
> 
> We have the field, you mean?  I might be missing something, but it still
> appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
> So what am I missing?

There's another layer of accessors that can in fact manipulate the
preempt_count even for !PREEMPT_COUNT kernels. They are currently used
by things like pagefault_disable().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Paul E. McKenney
On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
> On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
> > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> > counter in production kernels.  Even if there was, we have to sample this
> > on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> > would be where kernel entry/exit is, so I expect that this would be a
> > net loss in overall performance.
> 
> We unconditionally have the preempt_count, its just not used much for
> PREEMPT_COUNT=n kernels.

We have the field, you mean?  I might be missing something, but it still
appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
So what am I missing?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Paul E. McKenney
On Tue, May 05, 2015 at 12:51:02PM +0200, Peter Zijlstra wrote:
> On Tue, May 05, 2015 at 12:48:34PM +0200, Peter Zijlstra wrote:
> > On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
> >  In case of the non-preemptible RCU, we could easily also
> > > increase current->rcu_read_lock_nesting at the same time
> > > we increase the preempt counter, and use that as the
> > > indicator to test whether the cpu is in an extended
> > > rcu quiescent state. That way there would be no extra
> > > overhead at syscall entry or exit at all. The trick
> > > would be getting the preempt count and the rcu read
> > > lock nesting count in the same cache line for each task.
> > 
> > Can't do that. Remember, on x86 we have per-cpu preempt count, and your
> > rcu_read_lock_nesting is per task.
> 
> Hmm, I suppose you could do the rcu_read_lock_nesting thing in a per-cpu
> counter too and transfer that into the task_struct on context switch.
> 
> If you manage to put both sides of that in the same cache things should
> not add significant overhead.
> 
> You'd have to move the rcu_read_lock_nesting into the thread_info, which
> would be painful as you'd have to go touch all archs etc..

Last I tried doing that, things got really messy at context-switch time.
Perhaps I simply didn't do the save/restore in the right place?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Peter Zijlstra
On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
> But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> counter in production kernels.  Even if there was, we have to sample this
> on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> would be where kernel entry/exit is, so I expect that this would be a
> net loss in overall performance.

We unconditionally have the preempt_count, its just not used much for
PREEMPT_COUNT=n kernels.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Peter Zijlstra
On Tue, May 05, 2015 at 12:48:34PM +0200, Peter Zijlstra wrote:
> On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
>  In case of the non-preemptible RCU, we could easily also
> > increase current->rcu_read_lock_nesting at the same time
> > we increase the preempt counter, and use that as the
> > indicator to test whether the cpu is in an extended
> > rcu quiescent state. That way there would be no extra
> > overhead at syscall entry or exit at all. The trick
> > would be getting the preempt count and the rcu read
> > lock nesting count in the same cache line for each task.
> 
> Can't do that. Remember, on x86 we have per-cpu preempt count, and your
> rcu_read_lock_nesting is per task.

Hmm, I suppose you could do the rcu_read_lock_nesting thing in a per-cpu
counter too and transfer that into the task_struct on context switch.

If you manage to put both sides of that in the same cache things should
not add significant overhead.

You'd have to move the rcu_read_lock_nesting into the thread_info, which
would be painful as you'd have to go touch all archs etc..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Peter Zijlstra
On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
 In case of the non-preemptible RCU, we could easily also
> increase current->rcu_read_lock_nesting at the same time
> we increase the preempt counter, and use that as the
> indicator to test whether the cpu is in an extended
> rcu quiescent state. That way there would be no extra
> overhead at syscall entry or exit at all. The trick
> would be getting the preempt count and the rcu read
> lock nesting count in the same cache line for each task.

Can't do that. Remember, on x86 we have per-cpu preempt count, and your
rcu_read_lock_nesting is per task.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Peter Zijlstra
On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote:
 On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
  On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
   But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
   counter in production kernels.  Even if there was, we have to sample this
   on other CPUs, so the overhead of preempt_disable() and preempt_enable()
   would be where kernel entry/exit is, so I expect that this would be a
   net loss in overall performance.
  
  We unconditionally have the preempt_count, its just not used much for
  PREEMPT_COUNT=n kernels.
 
 We have the field, you mean?  I might be missing something, but it still
 appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
 So what am I missing?

There's another layer of accessors that can in fact manipulate the
preempt_count even for !PREEMPT_COUNT kernels. They are currently used
by things like pagefault_disable().
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Paul E. McKenney
On Tue, May 05, 2015 at 12:51:02PM +0200, Peter Zijlstra wrote:
 On Tue, May 05, 2015 at 12:48:34PM +0200, Peter Zijlstra wrote:
  On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
   In case of the non-preemptible RCU, we could easily also
   increase current-rcu_read_lock_nesting at the same time
   we increase the preempt counter, and use that as the
   indicator to test whether the cpu is in an extended
   rcu quiescent state. That way there would be no extra
   overhead at syscall entry or exit at all. The trick
   would be getting the preempt count and the rcu read
   lock nesting count in the same cache line for each task.
  
  Can't do that. Remember, on x86 we have per-cpu preempt count, and your
  rcu_read_lock_nesting is per task.
 
 Hmm, I suppose you could do the rcu_read_lock_nesting thing in a per-cpu
 counter too and transfer that into the task_struct on context switch.
 
 If you manage to put both sides of that in the same cache things should
 not add significant overhead.
 
 You'd have to move the rcu_read_lock_nesting into the thread_info, which
 would be painful as you'd have to go touch all archs etc..

Last I tried doing that, things got really messy at context-switch time.
Perhaps I simply didn't do the save/restore in the right place?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Paul E. McKenney
On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
 On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
  But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
  counter in production kernels.  Even if there was, we have to sample this
  on other CPUs, so the overhead of preempt_disable() and preempt_enable()
  would be where kernel entry/exit is, so I expect that this would be a
  net loss in overall performance.
 
 We unconditionally have the preempt_count, its just not used much for
 PREEMPT_COUNT=n kernels.

We have the field, you mean?  I might be missing something, but it still
appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
So what am I missing?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Mike Galbraith
On Wed, 2015-05-06 at 03:49 +0200, Mike Galbraith wrote:
 On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote:
 
  You have RCU_FAST_NO_HZ=y, correct?  Could you please try measuring with
  RCU_FAST_NO_HZ=n?
 
 FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n.  (I didn't
 profile to see where costs lie though)

(did that)

1 * stat() on isolated cpu

NO_HZ_FULL offinactive housekeepernohz_full
real0m14.266s 0m14.367s0m20.427s  0m27.921s 
user0m1.756s  0m1.553s 0m1.976s   0m10.447s
sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
(real)  1.000 1.0071.431  1.957

 inactive housekeeper   
 nohz_full
--
 7.61%  [.] __xstat64 11.12%  [k] 
context_tracking_exit  7.41%  [k] context_tracking_exit
 7.04%  [k] system_call6.18%  [k] 
context_tracking_enter 6.02%  [k] native_sched_clock
 6.96%  [k] copy_user_enhanced_fast_string 5.18%  [.] __xstat64 
 4.69%  [k] rcu_eqs_enter_common.isra.37
 6.57%  [k] path_init  4.89%  [k] system_call   
 4.35%  [k] _raw_spin_lock
 5.92%  [k] system_call_after_swapgs   4.84%  [k] 
copy_user_enhanced_fast_string 4.30%  [k] context_tracking_enter
 5.44%  [k] lockref_put_return 4.46%  [k] path_init 
 4.25%  [k] kmem_cache_alloc
 4.69%  [k] link_path_walk 4.30%  [k] 
system_call_after_swapgs   4.14%  [.] __xstat64
 4.47%  [k] lockref_get_not_dead   4.12%  [k] kmem_cache_free   
 3.89%  [k] rcu_eqs_exit_common.isra.38
 4.46%  [k] kmem_cache_free3.78%  [k] link_path_walk
 3.50%  [k] system_call
 4.20%  [k] kmem_cache_alloc   3.62%  [k] 
lockref_put_return 3.48%  [k] copy_user_enhanced_fast_string
 4.09%  [k] cp_new_stat3.43%  [k] kmem_cache_alloc  
 3.02%  [k] system_call_after_swapgs
 3.38%  [k] vfs_getattr_nosec  2.95%  [k] 
lockref_get_not_dead   2.97%  [k] kmem_cache_free
 2.82%  [k] vfs_fstatat2.87%  [k] cp_new_stat   
 2.88%  [k] lockref_put_return
 2.60%  [k] user_path_at_empty 2.62%  [k] 
syscall_trace_leave2.61%  [k] link_path_walk
 2.47%  [k] path_lookupat  1.91%  [k] vfs_getattr_nosec 
 2.58%  [k] path_init
 2.14%  [k] strncpy_from_user  1.89%  [k] 
syscall_trace_enter_phase1 2.15%  [k] lockref_get_not_dead
 2.11%  [k] getname_flags  1.77%  [k] path_lookupat 
 2.04%  [k] cp_new_stat
 2.10%  [k] generic_fillattr   1.67%  [k] complete_walk 
 1.89%  [k] generic_fillattr
 2.05%  [.] main   1.65%  [k] vfs_fstatat   
 1.67%  [k] syscall_trace_leave
 1.89%  [k] complete_walk  1.56%  [k] generic_fillattr  
 1.59%  [k] vfs_getattr_nosec
 1.73%  [k] generic_permission 1.55%  [k] 
user_path_at_empty 1.49%  [k] get_vtime_delta
 1.50%  [k] system_call_fastpath   1.54%  [k] strncpy_from_user 
 1.32%  [k] user_path_at_empty
 1.37%  [k] legitimize_mnt 1.53%  [k] getname_flags 
 1.30%  [k] syscall_trace_enter_phase1
 1.30%  [k] dput   1.46%  [k] legitimize_mnt
 1.21%  [k] rcu_eqs_exit
 1.26%  [k] putname1.34%  [.] main  
 1.21%  [k] vfs_fstatat
 1.19%  [k] path_put   1.32%  [k] int_with_check
 1.18%  [k] path_lookupat
 1.18%  [k] filename_lookup1.28%  [k] 
generic_permission 1.15%  [k] getname_flags
 1.01%  [k] SYSC_newstat   1.16%  [k] int_very_careful  
 1.03%  [k] strncpy_from_user
 0.96%  [k] mntput_no_expire   1.04%  [k] putname   
 1.01%  [k] account_system_time
 0.79%  [k] path_cleanup   0.94%  [k] dput  
 1.00%  [k] complete_walk
 0.79%  [k] mntput 0.91%  [k] 
context_tracking_user_exit 0.99%  [k] vtime_account_user


Re: question about RCU dynticks_nesting

2015-05-05 Thread Mike Galbraith
On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote:

 You have RCU_FAST_NO_HZ=y, correct?  Could you please try measuring with
 RCU_FAST_NO_HZ=n?

FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n.  (I didn't
profile to see where costs lie though)

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Paul E. McKenney
On Tue, May 05, 2015 at 05:09:23PM -0400, Rik van Riel wrote:
 On 05/05/2015 02:35 PM, Paul E. McKenney wrote:
  On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote:
  On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote:
  On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
  On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
  But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
  counter in production kernels.  Even if there was, we have to sample 
  this
  on other CPUs, so the overhead of preempt_disable() and preempt_enable()
  would be where kernel entry/exit is, so I expect that this would be a
  net loss in overall performance.
 
  We unconditionally have the preempt_count, its just not used much for
  PREEMPT_COUNT=n kernels.
 
  We have the field, you mean?  I might be missing something, but it still
  appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
  So what am I missing?
 
  There's another layer of accessors that can in fact manipulate the
  preempt_count even for !PREEMPT_COUNT kernels. They are currently used
  by things like pagefault_disable().
  
  OK, fair enough.
  
  I am going to focus first on getting rid of (or at least greatly reducing)
  RCU's interrupt disabling on the user-kernel entry/exit paths, since
  that seems to be the biggest cost.
 
 Interrupts are already disabled on kernel-user and kernel-guest
 switches.  Paolo and I have patches to move a bunch of the calls
 to user_enter, user_exit, guest_enter, and guest_exit to places
 where interrupts are already disabled, so we do not need to
 disable them again.
 
 With those in place, the vtime calculations are the largest
 CPU user. I am working on those.

OK, so I should stop worrying about making rcu_user_enter() and
rcu_user_exit() operate with interrupts disabled, and instead think about
the overhead of the operations themselves.  Probably starting from Mike
Galbraith's profile (thank you!) unless Rik has some reason to believe
that it is nonrepresentative.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Paul E. McKenney
On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote:
 On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote:
  On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
   On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
counter in production kernels.  Even if there was, we have to sample 
this
on other CPUs, so the overhead of preempt_disable() and preempt_enable()
would be where kernel entry/exit is, so I expect that this would be a
net loss in overall performance.
   
   We unconditionally have the preempt_count, its just not used much for
   PREEMPT_COUNT=n kernels.
  
  We have the field, you mean?  I might be missing something, but it still
  appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
  So what am I missing?
 
 There's another layer of accessors that can in fact manipulate the
 preempt_count even for !PREEMPT_COUNT kernels. They are currently used
 by things like pagefault_disable().

OK, fair enough.

I am going to focus first on getting rid of (or at least greatly reducing)
RCU's interrupt disabling on the user-kernel entry/exit paths, since
that seems to be the biggest cost.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Rik van Riel
On 05/05/2015 02:35 PM, Paul E. McKenney wrote:
 On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote:
 On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote:
 On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
 On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
 But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
 counter in production kernels.  Even if there was, we have to sample this
 on other CPUs, so the overhead of preempt_disable() and preempt_enable()
 would be where kernel entry/exit is, so I expect that this would be a
 net loss in overall performance.

 We unconditionally have the preempt_count, its just not used much for
 PREEMPT_COUNT=n kernels.

 We have the field, you mean?  I might be missing something, but it still
 appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
 So what am I missing?

 There's another layer of accessors that can in fact manipulate the
 preempt_count even for !PREEMPT_COUNT kernels. They are currently used
 by things like pagefault_disable().
 
 OK, fair enough.
 
 I am going to focus first on getting rid of (or at least greatly reducing)
 RCU's interrupt disabling on the user-kernel entry/exit paths, since
 that seems to be the biggest cost.

Interrupts are already disabled on kernel-user and kernel-guest
switches.  Paolo and I have patches to move a bunch of the calls
to user_enter, user_exit, guest_enter, and guest_exit to places
where interrupts are already disabled, so we do not need to
disable them again.

With those in place, the vtime calculations are the largest
CPU user. I am working on those.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Peter Zijlstra
On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
 In case of the non-preemptible RCU, we could easily also
 increase current-rcu_read_lock_nesting at the same time
 we increase the preempt counter, and use that as the
 indicator to test whether the cpu is in an extended
 rcu quiescent state. That way there would be no extra
 overhead at syscall entry or exit at all. The trick
 would be getting the preempt count and the rcu read
 lock nesting count in the same cache line for each task.

Can't do that. Remember, on x86 we have per-cpu preempt count, and your
rcu_read_lock_nesting is per task.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Peter Zijlstra
On Tue, May 05, 2015 at 12:48:34PM +0200, Peter Zijlstra wrote:
 On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
  In case of the non-preemptible RCU, we could easily also
  increase current-rcu_read_lock_nesting at the same time
  we increase the preempt counter, and use that as the
  indicator to test whether the cpu is in an extended
  rcu quiescent state. That way there would be no extra
  overhead at syscall entry or exit at all. The trick
  would be getting the preempt count and the rcu read
  lock nesting count in the same cache line for each task.
 
 Can't do that. Remember, on x86 we have per-cpu preempt count, and your
 rcu_read_lock_nesting is per task.

Hmm, I suppose you could do the rcu_read_lock_nesting thing in a per-cpu
counter too and transfer that into the task_struct on context switch.

If you manage to put both sides of that in the same cache things should
not add significant overhead.

You'd have to move the rcu_read_lock_nesting into the thread_info, which
would be painful as you'd have to go touch all archs etc..
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-05 Thread Peter Zijlstra
On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
 But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
 counter in production kernels.  Even if there was, we have to sample this
 on other CPUs, so the overhead of preempt_disable() and preempt_enable()
 would be where kernel entry/exit is, so I expect that this would be a
 net loss in overall performance.

We unconditionally have the preempt_count, its just not used much for
PREEMPT_COUNT=n kernels.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Paul E. McKenney
On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote:
> On 05/04/2015 04:38 PM, Paul E. McKenney wrote:
> > On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote:
> >> On 05/04/2015 04:02 PM, Paul E. McKenney wrote:
> 
> >>> Hmmm...  But didn't earlier performance measurements show that the bulk of
> >>> the overhead was the delta-time computations rather than RCU accounting?
> >>
> >> The bulk of the overhead was disabling and re-enabling
> >> irqs around the calls to rcu_user_exit and rcu_user_enter :)
> > 
> > Really???  OK...  How about software irq masking?  (I know, that is
> > probably a bit of a scary change as well.)
> > 
> >> Of the remaining time, about 2/3 seems to be the vtime
> >> stuff, and the other 1/3 the rcu code.
> > 
> > OK, worth some thought, then.
> > 
> >> I suspect it makes sense to optimize both, though the
> >> vtime code may be the easiest :)
> > 
> > Making a crude version that does jiffies (or whatever) instead of
> > fine-grained computations might give good bang for the buck.  ;-)
> 
> Ingo's idea is to simply have cpu 0 check the current task
> on all other CPUs, see whether that task is running in system
> mode, user mode, guest mode, irq mode, etc and update that
> task's vtime accordingly.
> 
> I suspect the runqueue lock is probably enough to do that,
> and between rcu state and PF_VCPU we probably have enough
> information to see what mode the task is running in, with
> just remote memory reads.
> 
> I looked at implementing the vtime bits (and am pretty sure
> how to do those now), and then spent some hours looking at
> the RCU bits, to see if we could not simplify both things at
> once, especially considering that the current RCU context
> tracking bits need to be called with irqs disabled.

Remotely sampling the vtime info without memory barriers makes sense.
After all, the result is statistical anyway.  Unfortunately, as noted
earlier, RCU correctness depends on ordering.

The current RCU idle entry/exit code most definitely absolutely
requires irqs be disabled.  However, I will see if that can be changed.
No promises, especially no short-term promises, but it does not feel
impossible.

You have RCU_FAST_NO_HZ=y, correct?  Could you please try measuring with
RCU_FAST_NO_HZ=n?  If that has a significant effect, easy quick win is
turning it off -- and I could then make it a boot parameter to get you
back to one kernel for everyone.  (The existing tick_nohz_active boot
parameter already turns it off, but also turns off dyntick idle, which
might be a bit excessive.)  Or if there is some way that the kernel can
know that the system is currently running on battery or some such.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Rik van Riel
On 05/04/2015 04:38 PM, Paul E. McKenney wrote:
> On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote:
>> On 05/04/2015 04:02 PM, Paul E. McKenney wrote:

>>> Hmmm...  But didn't earlier performance measurements show that the bulk of
>>> the overhead was the delta-time computations rather than RCU accounting?
>>
>> The bulk of the overhead was disabling and re-enabling
>> irqs around the calls to rcu_user_exit and rcu_user_enter :)
> 
> Really???  OK...  How about software irq masking?  (I know, that is
> probably a bit of a scary change as well.)
> 
>> Of the remaining time, about 2/3 seems to be the vtime
>> stuff, and the other 1/3 the rcu code.
> 
> OK, worth some thought, then.
> 
>> I suspect it makes sense to optimize both, though the
>> vtime code may be the easiest :)
> 
> Making a crude version that does jiffies (or whatever) instead of
> fine-grained computations might give good bang for the buck.  ;-)

Ingo's idea is to simply have cpu 0 check the current task
on all other CPUs, see whether that task is running in system
mode, user mode, guest mode, irq mode, etc and update that
task's vtime accordingly.

I suspect the runqueue lock is probably enough to do that,
and between rcu state and PF_VCPU we probably have enough
information to see what mode the task is running in, with
just remote memory reads.

I looked at implementing the vtime bits (and am pretty sure
how to do those now), and then spent some hours looking at
the RCU bits, to see if we could not simplify both things at
once, especially considering that the current RCU context
tracking bits need to be called with irqs disabled.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Paul E. McKenney
On Mon, May 04, 2015 at 03:59:02PM -0400, Rik van Riel wrote:
> On 05/04/2015 03:39 PM, Paul E. McKenney wrote:
> > On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
> 
> >> In case of the non-preemptible RCU, we could easily also
> >> increase current->rcu_read_lock_nesting at the same time
> >> we increase the preempt counter, and use that as the
> >> indicator to test whether the cpu is in an extended
> >> rcu quiescent state. That way there would be no extra
> >> overhead at syscall entry or exit at all. The trick
> >> would be getting the preempt count and the rcu read
> >> lock nesting count in the same cache line for each task.
> > 
> > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> > counter in production kernels.  Even if there was, we have to sample this
> > on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> > would be where kernel entry/exit is, so I expect that this would be a
> > net loss in overall performance.
> 
> CONFIG_PREEMPT_RCU seems to be independent of CONFIG_PREEMPT.
> Not sure why, but they are :)

Well, they used to be independent.  But the "depends" clauses force
them.  You cannot have TREE_RCU unless !PREEMPT && SMP.

> >> In case of the preemptible RCU scheme, we would have to
> >> examine the per-task state (under the runqueue lock)
> >> to get the current task info of all CPUs, and in
> >> addition wait for the blkd_tasks list to empty out
> >> when doing a synchronize_rcu().
> >>
> >> That does not appear to require special per-cpu
> >> counters; examining the per-cpu rdp and the lists
> >> inside it, with the rnp->lock held if doing any
> >> list manipulation, looks like it would be enough.
> >>
> >> However, the current code is a lot more complicated
> >> than that. Am I overlooking something obvious, Paul?
> >> Maybe something non-obvious? :)
> > 
> > Ummm...  The need to maintain memory ordering when sampling task
> > state from remote CPUs?
> > 
> > Or am I completely confused about what you are suggesting?
> > 
> > That said, are you chasing a real system-visible performance issue
> > that you tracked to RCU's dyntick-idle system?
> 
> The goal is to reduce the syscall overhead of nohz_full.
> 
> Part of the overhead is in the vtime updates, part of it is
> in the way RCU extended quiescent state is tracked.

OK, as long as it is actual measurements rather than guesswork.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Paul E. McKenney
On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote:
> On 05/04/2015 04:02 PM, Paul E. McKenney wrote:
> > On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote:
> >> On 05/04/2015 02:39 PM, Paul E. McKenney wrote:
> >>> On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:
> >>
>  In fact, would we be able to simply use tsk->rcu_read_lock_nesting
>  as an indicator of whether or not we should bother waiting on that
>  task or CPU when doing synchronize_rcu?
> >>>
> >>> Depends on exactly what you are asking.  If you are asking if I could add
> >>> a few more checks to preemptible RCU and speed up grace-period detection
> >>> in a number of cases, the answer is very likely "yes".  This is on my
> >>> list, but not particularly high priority.  If you are asking whether
> >>> CPU 0 could access ->rcu_read_lock_nesting of some task running on
> >>> some other CPU, in theory, the answer is "yes", but in practice that
> >>> would require putting full memory barriers in both rcu_read_lock()
> >>> and rcu_read_unlock(), so the real answer is "no".
> >>>
> >>> Or am I missing your point?
> >>
> >> The main question is "how can we greatly reduce the overhead
> >> of nohz_full, by simplifying the RCU extended quiescent state
> >> code called in the syscall fast path, and maybe piggyback on
> >> that to do time accounting for remote CPUs?"
> >>
> >> Your memory barrier answer above makes it clear we will still
> >> want to do the RCU stuff at syscall entry & exit time, at least
> >> on x86, where we already have automatic and implicit memory
> >> barriers.
> > 
> > We do need to keep in mind that x86's automatic and implicit memory
> > barriers do not order prior stores against later loads.
> > 
> > Hmmm...  But didn't earlier performance measurements show that the bulk of
> > the overhead was the delta-time computations rather than RCU accounting?
> 
> The bulk of the overhead was disabling and re-enabling
> irqs around the calls to rcu_user_exit and rcu_user_enter :)

Really???  OK...  How about software irq masking?  (I know, that is
probably a bit of a scary change as well.)

> Of the remaining time, about 2/3 seems to be the vtime
> stuff, and the other 1/3 the rcu code.

OK, worth some thought, then.

> I suspect it makes sense to optimize both, though the
> vtime code may be the easiest :)

Making a crude version that does jiffies (or whatever) instead of
fine-grained computations might give good bang for the buck.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Rik van Riel
On 05/04/2015 04:02 PM, Paul E. McKenney wrote:
> On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote:
>> On 05/04/2015 02:39 PM, Paul E. McKenney wrote:
>>> On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:
>>
 In fact, would we be able to simply use tsk->rcu_read_lock_nesting
 as an indicator of whether or not we should bother waiting on that
 task or CPU when doing synchronize_rcu?
>>>
>>> Depends on exactly what you are asking.  If you are asking if I could add
>>> a few more checks to preemptible RCU and speed up grace-period detection
>>> in a number of cases, the answer is very likely "yes".  This is on my
>>> list, but not particularly high priority.  If you are asking whether
>>> CPU 0 could access ->rcu_read_lock_nesting of some task running on
>>> some other CPU, in theory, the answer is "yes", but in practice that
>>> would require putting full memory barriers in both rcu_read_lock()
>>> and rcu_read_unlock(), so the real answer is "no".
>>>
>>> Or am I missing your point?
>>
>> The main question is "how can we greatly reduce the overhead
>> of nohz_full, by simplifying the RCU extended quiescent state
>> code called in the syscall fast path, and maybe piggyback on
>> that to do time accounting for remote CPUs?"
>>
>> Your memory barrier answer above makes it clear we will still
>> want to do the RCU stuff at syscall entry & exit time, at least
>> on x86, where we already have automatic and implicit memory
>> barriers.
> 
> We do need to keep in mind that x86's automatic and implicit memory
> barriers do not order prior stores against later loads.
> 
> Hmmm...  But didn't earlier performance measurements show that the bulk of
> the overhead was the delta-time computations rather than RCU accounting?

The bulk of the overhead was disabling and re-enabling
irqs around the calls to rcu_user_exit and rcu_user_enter :)

Of the remaining time, about 2/3 seems to be the vtime
stuff, and the other 1/3 the rcu code.

I suspect it makes sense to optimize both, though the
vtime code may be the easiest :)

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Paul E. McKenney
On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote:
> On 05/04/2015 02:39 PM, Paul E. McKenney wrote:
> > On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:
> 
> >> In fact, would we be able to simply use tsk->rcu_read_lock_nesting
> >> as an indicator of whether or not we should bother waiting on that
> >> task or CPU when doing synchronize_rcu?
> > 
> > Depends on exactly what you are asking.  If you are asking if I could add
> > a few more checks to preemptible RCU and speed up grace-period detection
> > in a number of cases, the answer is very likely "yes".  This is on my
> > list, but not particularly high priority.  If you are asking whether
> > CPU 0 could access ->rcu_read_lock_nesting of some task running on
> > some other CPU, in theory, the answer is "yes", but in practice that
> > would require putting full memory barriers in both rcu_read_lock()
> > and rcu_read_unlock(), so the real answer is "no".
> > 
> > Or am I missing your point?
> 
> The main question is "how can we greatly reduce the overhead
> of nohz_full, by simplifying the RCU extended quiescent state
> code called in the syscall fast path, and maybe piggyback on
> that to do time accounting for remote CPUs?"
> 
> Your memory barrier answer above makes it clear we will still
> want to do the RCU stuff at syscall entry & exit time, at least
> on x86, where we already have automatic and implicit memory
> barriers.

We do need to keep in mind that x86's automatic and implicit memory
barriers do not order prior stores against later loads.

Hmmm...  But didn't earlier performance measurements show that the bulk of
the overhead was the delta-time computations rather than RCU accounting?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Rik van Riel
On 05/04/2015 03:39 PM, Paul E. McKenney wrote:
> On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:

>> In case of the non-preemptible RCU, we could easily also
>> increase current->rcu_read_lock_nesting at the same time
>> we increase the preempt counter, and use that as the
>> indicator to test whether the cpu is in an extended
>> rcu quiescent state. That way there would be no extra
>> overhead at syscall entry or exit at all. The trick
>> would be getting the preempt count and the rcu read
>> lock nesting count in the same cache line for each task.
> 
> But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> counter in production kernels.  Even if there was, we have to sample this
> on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> would be where kernel entry/exit is, so I expect that this would be a
> net loss in overall performance.

CONFIG_PREEMPT_RCU seems to be independent of CONFIG_PREEMPT.
Not sure why, but they are :)

>> In case of the preemptible RCU scheme, we would have to
>> examine the per-task state (under the runqueue lock)
>> to get the current task info of all CPUs, and in
>> addition wait for the blkd_tasks list to empty out
>> when doing a synchronize_rcu().
>>
>> That does not appear to require special per-cpu
>> counters; examining the per-cpu rdp and the lists
>> inside it, with the rnp->lock held if doing any
>> list manipulation, looks like it would be enough.
>>
>> However, the current code is a lot more complicated
>> than that. Am I overlooking something obvious, Paul?
>> Maybe something non-obvious? :)
> 
> Ummm...  The need to maintain memory ordering when sampling task
> state from remote CPUs?
> 
> Or am I completely confused about what you are suggesting?
> 
> That said, are you chasing a real system-visible performance issue
> that you tracked to RCU's dyntick-idle system?

The goal is to reduce the syscall overhead of nohz_full.

Part of the overhead is in the vtime updates, part of it is
in the way RCU extended quiescent state is tracked.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Rik van Riel
On 05/04/2015 02:39 PM, Paul E. McKenney wrote:
> On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:

>> In fact, would we be able to simply use tsk->rcu_read_lock_nesting
>> as an indicator of whether or not we should bother waiting on that
>> task or CPU when doing synchronize_rcu?
> 
> Depends on exactly what you are asking.  If you are asking if I could add
> a few more checks to preemptible RCU and speed up grace-period detection
> in a number of cases, the answer is very likely "yes".  This is on my
> list, but not particularly high priority.  If you are asking whether
> CPU 0 could access ->rcu_read_lock_nesting of some task running on
> some other CPU, in theory, the answer is "yes", but in practice that
> would require putting full memory barriers in both rcu_read_lock()
> and rcu_read_unlock(), so the real answer is "no".
> 
> Or am I missing your point?

The main question is "how can we greatly reduce the overhead
of nohz_full, by simplifying the RCU extended quiescent state
code called in the syscall fast path, and maybe piggyback on
that to do time accounting for remote CPUs?"

Your memory barrier answer above makes it clear we will still
want to do the RCU stuff at syscall entry & exit time, at least
on x86, where we already have automatic and implicit memory
barriers.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Paul E. McKenney
On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
> On 05/04/2015 11:59 AM, Rik van Riel wrote:
> 
> > However, currently the RCU code seems to use a much more
> > complex counting scheme, with a different increment for
> > kernel/task use, and irq use.
> > 
> > This counter seems to be modeled on the task preempt_counter,
> > where we do care about whether we are in task context, irq
> > context, or softirq context.
> > 
> > On the other hand, the RCU code only seems to care about
> > whether or not a CPU is in an extended quiescent state,
> > or is potentially in an RCU critical section.
> > 
> > Paul, what is the reason for RCU using a complex counter,
> > instead of a simple increment for each potential kernel/RCU
> > entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU
> > enabled?
> 
> Looking at the code for a while more, I have not found
> any reason why the rcu dynticks counter is so complex.

For the nesting counter, please see my earlier email.

> The rdtp->dynticks atomic seems to be used as a serial
> number. Odd means the cpu is in an rcu quiescent state,
> even means it is not.

Yep.

> This test is used to verify whether or not a CPU is
> in rcu quiescent state. Presumably the atomic_add_return
> is used to add a memory barrier.
> 
>   atomic_add_return(0, >dynticks) & 0x1)

Yep.  It is sampled remotely, hence the need for full memory barriers.
It doesn't help to sample the counter if the sampling gets reordered
with the surrounding code.  Ditto for the increments.

By the end of the year, and hopefully much sooner, I expect to have
testing infrastructure capable of detecting ordering bugs in this code.
At which point, I can start experimenting with alternative code sequences.

But full ordering is still required, and cache misses can happen.

> > In fact, would we be able to simply use tsk->rcu_read_lock_nesting
> > as an indicator of whether or not we should bother waiting on that
> > task or CPU when doing synchronize_rcu?
> 
> We seem to have two variants of __rcu_read_lock().
> 
> One increments current->rcu_read_lock_nesting, the other
> calls preempt_disable().

Yep.  The first is preemptible RCU, the second classic RCU.

> In case of the non-preemptible RCU, we could easily also
> increase current->rcu_read_lock_nesting at the same time
> we increase the preempt counter, and use that as the
> indicator to test whether the cpu is in an extended
> rcu quiescent state. That way there would be no extra
> overhead at syscall entry or exit at all. The trick
> would be getting the preempt count and the rcu read
> lock nesting count in the same cache line for each task.

But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
counter in production kernels.  Even if there was, we have to sample this
on other CPUs, so the overhead of preempt_disable() and preempt_enable()
would be where kernel entry/exit is, so I expect that this would be a
net loss in overall performance.

> In case of the preemptible RCU scheme, we would have to
> examine the per-task state (under the runqueue lock)
> to get the current task info of all CPUs, and in
> addition wait for the blkd_tasks list to empty out
> when doing a synchronize_rcu().
> 
> That does not appear to require special per-cpu
> counters; examining the per-cpu rdp and the lists
> inside it, with the rnp->lock held if doing any
> list manipulation, looks like it would be enough.
> 
> However, the current code is a lot more complicated
> than that. Am I overlooking something obvious, Paul?
> Maybe something non-obvious? :)

Ummm...  The need to maintain memory ordering when sampling task
state from remote CPUs?

Or am I completely confused about what you are suggesting?

That said, are you chasing a real system-visible performance issue
that you tracked to RCU's dyntick-idle system?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Rik van Riel
On 05/04/2015 11:59 AM, Rik van Riel wrote:

> However, currently the RCU code seems to use a much more
> complex counting scheme, with a different increment for
> kernel/task use, and irq use.
> 
> This counter seems to be modeled on the task preempt_counter,
> where we do care about whether we are in task context, irq
> context, or softirq context.
> 
> On the other hand, the RCU code only seems to care about
> whether or not a CPU is in an extended quiescent state,
> or is potentially in an RCU critical section.
> 
> Paul, what is the reason for RCU using a complex counter,
> instead of a simple increment for each potential kernel/RCU
> entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU
> enabled?

Looking at the code for a while more, I have not found
any reason why the rcu dynticks counter is so complex.

The rdtp->dynticks atomic seems to be used as a serial
number. Odd means the cpu is in an rcu quiescent state,
even means it is not.

This test is used to verify whether or not a CPU is
in rcu quiescent state. Presumably the atomic_add_return
is used to add a memory barrier.

atomic_add_return(0, >dynticks) & 0x1)

> In fact, would we be able to simply use tsk->rcu_read_lock_nesting
> as an indicator of whether or not we should bother waiting on that
> task or CPU when doing synchronize_rcu?

We seem to have two variants of __rcu_read_lock().

One increments current->rcu_read_lock_nesting, the other
calls preempt_disable().

In case of the non-preemptible RCU, we could easily also
increase current->rcu_read_lock_nesting at the same time
we increase the preempt counter, and use that as the
indicator to test whether the cpu is in an extended
rcu quiescent state. That way there would be no extra
overhead at syscall entry or exit at all. The trick
would be getting the preempt count and the rcu read
lock nesting count in the same cache line for each task.

In case of the preemptible RCU scheme, we would have to
examine the per-task state (under the runqueue lock)
to get the current task info of all CPUs, and in
addition wait for the blkd_tasks list to empty out
when doing a synchronize_rcu().

That does not appear to require special per-cpu
counters; examining the per-cpu rdp and the lists
inside it, with the rnp->lock held if doing any
list manipulation, looks like it would be enough.

However, the current code is a lot more complicated
than that. Am I overlooking something obvious, Paul?
Maybe something non-obvious? :)

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Paul E. McKenney
On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:
> On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
> 
> > Isn't this racy?
> > 
> > synchronize_rcu CPU nohz CPU
> > -
> > set flag = 0
> > read flag = 0
> > return to userspace
> > set TIF_NOHZ
> > 
> > and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.
> 
> Looking at the code some more, a flag is not going to be enough.
> 
> An irq can hit while we are in kernel mode, leading to the
> task's "rcu active" counter being incremented twice.
> 
> However, currently the RCU code seems to use a much more
> complex counting scheme, with a different increment for
> kernel/task use, and irq use.
> 
> This counter seems to be modeled on the task preempt_counter,
> where we do care about whether we are in task context, irq
> context, or softirq context.
> 
> On the other hand, the RCU code only seems to care about
> whether or not a CPU is in an extended quiescent state,
> or is potentially in an RCU critical section.
> 
> Paul, what is the reason for RCU using a complex counter,
> instead of a simple increment for each potential kernel/RCU
> entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU
> enabled?

Heh!  I found out why the hard way.

You see, there are architectures where a CPU can enter an interrupt level
without ever exiting, and perhaps vice versa.  But only if that CPU is
non-idle at the time.  So, when a CPU enters idle, it is necessary to
reset the interrupt nesting to zero.  But that means that it is in turn
necessary to count task-level nesting separately from interrupt-level
nesting, so that we can determine when the CPU goes idle from a task-level
viewpoint.  Hence the use of masks and fields within the counter.

It -might- be possible to simplify this somewhat, especially now that
we have unified idle loops.  Except that I don't trust the architectures
to be reasonable about this at this point.  Furthermore, the associated
nesting checks do trigger when people are making certain types of changes
to architectures, so it is a useful debugging tool.  Which is another
reason that I am reluctant to change it.

> In fact, would we be able to simply use tsk->rcu_read_lock_nesting
> as an indicator of whether or not we should bother waiting on that
> task or CPU when doing synchronize_rcu?

Depends on exactly what you are asking.  If you are asking if I could add
a few more checks to preemptible RCU and speed up grace-period detection
in a number of cases, the answer is very likely "yes".  This is on my
list, but not particularly high priority.  If you are asking whether
CPU 0 could access ->rcu_read_lock_nesting of some task running on
some other CPU, in theory, the answer is "yes", but in practice that
would require putting full memory barriers in both rcu_read_lock()
and rcu_read_unlock(), so the real answer is "no".

Or am I missing your point?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


question about RCU dynticks_nesting

2015-05-04 Thread Rik van Riel
On 05/04/2015 05:26 AM, Paolo Bonzini wrote:

> Isn't this racy?
> 
>   synchronize_rcu CPU nohz CPU
>   -
>   set flag = 0
>   read flag = 0
>   return to userspace
>   set TIF_NOHZ
> 
> and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

Looking at the code some more, a flag is not going to be enough.

An irq can hit while we are in kernel mode, leading to the
task's "rcu active" counter being incremented twice.

However, currently the RCU code seems to use a much more
complex counting scheme, with a different increment for
kernel/task use, and irq use.

This counter seems to be modeled on the task preempt_counter,
where we do care about whether we are in task context, irq
context, or softirq context.

On the other hand, the RCU code only seems to care about
whether or not a CPU is in an extended quiescent state,
or is potentially in an RCU critical section.

Paul, what is the reason for RCU using a complex counter,
instead of a simple increment for each potential kernel/RCU
entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU
enabled?

In fact, would we be able to simply use tsk->rcu_read_lock_nesting
as an indicator of whether or not we should bother waiting on that
task or CPU when doing synchronize_rcu?

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


question about RCU dynticks_nesting

2015-05-04 Thread Rik van Riel
On 05/04/2015 05:26 AM, Paolo Bonzini wrote:

 Isn't this racy?
 
   synchronize_rcu CPU nohz CPU
   -
   set flag = 0
   read flag = 0
   return to userspace
   set TIF_NOHZ
 
 and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

Looking at the code some more, a flag is not going to be enough.

An irq can hit while we are in kernel mode, leading to the
task's rcu active counter being incremented twice.

However, currently the RCU code seems to use a much more
complex counting scheme, with a different increment for
kernel/task use, and irq use.

This counter seems to be modeled on the task preempt_counter,
where we do care about whether we are in task context, irq
context, or softirq context.

On the other hand, the RCU code only seems to care about
whether or not a CPU is in an extended quiescent state,
or is potentially in an RCU critical section.

Paul, what is the reason for RCU using a complex counter,
instead of a simple increment for each potential kernel/RCU
entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU
enabled?

In fact, would we be able to simply use tsk-rcu_read_lock_nesting
as an indicator of whether or not we should bother waiting on that
task or CPU when doing synchronize_rcu?

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Paul E. McKenney
On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:
 On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
 
  Isn't this racy?
  
  synchronize_rcu CPU nohz CPU
  -
  set flag = 0
  read flag = 0
  return to userspace
  set TIF_NOHZ
  
  and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.
 
 Looking at the code some more, a flag is not going to be enough.
 
 An irq can hit while we are in kernel mode, leading to the
 task's rcu active counter being incremented twice.
 
 However, currently the RCU code seems to use a much more
 complex counting scheme, with a different increment for
 kernel/task use, and irq use.
 
 This counter seems to be modeled on the task preempt_counter,
 where we do care about whether we are in task context, irq
 context, or softirq context.
 
 On the other hand, the RCU code only seems to care about
 whether or not a CPU is in an extended quiescent state,
 or is potentially in an RCU critical section.
 
 Paul, what is the reason for RCU using a complex counter,
 instead of a simple increment for each potential kernel/RCU
 entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU
 enabled?

Heh!  I found out why the hard way.

You see, there are architectures where a CPU can enter an interrupt level
without ever exiting, and perhaps vice versa.  But only if that CPU is
non-idle at the time.  So, when a CPU enters idle, it is necessary to
reset the interrupt nesting to zero.  But that means that it is in turn
necessary to count task-level nesting separately from interrupt-level
nesting, so that we can determine when the CPU goes idle from a task-level
viewpoint.  Hence the use of masks and fields within the counter.

It -might- be possible to simplify this somewhat, especially now that
we have unified idle loops.  Except that I don't trust the architectures
to be reasonable about this at this point.  Furthermore, the associated
nesting checks do trigger when people are making certain types of changes
to architectures, so it is a useful debugging tool.  Which is another
reason that I am reluctant to change it.

 In fact, would we be able to simply use tsk-rcu_read_lock_nesting
 as an indicator of whether or not we should bother waiting on that
 task or CPU when doing synchronize_rcu?

Depends on exactly what you are asking.  If you are asking if I could add
a few more checks to preemptible RCU and speed up grace-period detection
in a number of cases, the answer is very likely yes.  This is on my
list, but not particularly high priority.  If you are asking whether
CPU 0 could access -rcu_read_lock_nesting of some task running on
some other CPU, in theory, the answer is yes, but in practice that
would require putting full memory barriers in both rcu_read_lock()
and rcu_read_unlock(), so the real answer is no.

Or am I missing your point?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Rik van Riel
On 05/04/2015 11:59 AM, Rik van Riel wrote:

 However, currently the RCU code seems to use a much more
 complex counting scheme, with a different increment for
 kernel/task use, and irq use.
 
 This counter seems to be modeled on the task preempt_counter,
 where we do care about whether we are in task context, irq
 context, or softirq context.
 
 On the other hand, the RCU code only seems to care about
 whether or not a CPU is in an extended quiescent state,
 or is potentially in an RCU critical section.
 
 Paul, what is the reason for RCU using a complex counter,
 instead of a simple increment for each potential kernel/RCU
 entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU
 enabled?

Looking at the code for a while more, I have not found
any reason why the rcu dynticks counter is so complex.

The rdtp-dynticks atomic seems to be used as a serial
number. Odd means the cpu is in an rcu quiescent state,
even means it is not.

This test is used to verify whether or not a CPU is
in rcu quiescent state. Presumably the atomic_add_return
is used to add a memory barrier.

atomic_add_return(0, rdtp-dynticks)  0x1)

 In fact, would we be able to simply use tsk-rcu_read_lock_nesting
 as an indicator of whether or not we should bother waiting on that
 task or CPU when doing synchronize_rcu?

We seem to have two variants of __rcu_read_lock().

One increments current-rcu_read_lock_nesting, the other
calls preempt_disable().

In case of the non-preemptible RCU, we could easily also
increase current-rcu_read_lock_nesting at the same time
we increase the preempt counter, and use that as the
indicator to test whether the cpu is in an extended
rcu quiescent state. That way there would be no extra
overhead at syscall entry or exit at all. The trick
would be getting the preempt count and the rcu read
lock nesting count in the same cache line for each task.

In case of the preemptible RCU scheme, we would have to
examine the per-task state (under the runqueue lock)
to get the current task info of all CPUs, and in
addition wait for the blkd_tasks list to empty out
when doing a synchronize_rcu().

That does not appear to require special per-cpu
counters; examining the per-cpu rdp and the lists
inside it, with the rnp-lock held if doing any
list manipulation, looks like it would be enough.

However, the current code is a lot more complicated
than that. Am I overlooking something obvious, Paul?
Maybe something non-obvious? :)

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Paul E. McKenney
On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote:
 On 05/04/2015 02:39 PM, Paul E. McKenney wrote:
  On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:
 
  In fact, would we be able to simply use tsk-rcu_read_lock_nesting
  as an indicator of whether or not we should bother waiting on that
  task or CPU when doing synchronize_rcu?
  
  Depends on exactly what you are asking.  If you are asking if I could add
  a few more checks to preemptible RCU and speed up grace-period detection
  in a number of cases, the answer is very likely yes.  This is on my
  list, but not particularly high priority.  If you are asking whether
  CPU 0 could access -rcu_read_lock_nesting of some task running on
  some other CPU, in theory, the answer is yes, but in practice that
  would require putting full memory barriers in both rcu_read_lock()
  and rcu_read_unlock(), so the real answer is no.
  
  Or am I missing your point?
 
 The main question is how can we greatly reduce the overhead
 of nohz_full, by simplifying the RCU extended quiescent state
 code called in the syscall fast path, and maybe piggyback on
 that to do time accounting for remote CPUs?
 
 Your memory barrier answer above makes it clear we will still
 want to do the RCU stuff at syscall entry  exit time, at least
 on x86, where we already have automatic and implicit memory
 barriers.

We do need to keep in mind that x86's automatic and implicit memory
barriers do not order prior stores against later loads.

Hmmm...  But didn't earlier performance measurements show that the bulk of
the overhead was the delta-time computations rather than RCU accounting?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Paul E. McKenney
On Mon, May 04, 2015 at 03:59:02PM -0400, Rik van Riel wrote:
 On 05/04/2015 03:39 PM, Paul E. McKenney wrote:
  On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
 
  In case of the non-preemptible RCU, we could easily also
  increase current-rcu_read_lock_nesting at the same time
  we increase the preempt counter, and use that as the
  indicator to test whether the cpu is in an extended
  rcu quiescent state. That way there would be no extra
  overhead at syscall entry or exit at all. The trick
  would be getting the preempt count and the rcu read
  lock nesting count in the same cache line for each task.
  
  But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
  counter in production kernels.  Even if there was, we have to sample this
  on other CPUs, so the overhead of preempt_disable() and preempt_enable()
  would be where kernel entry/exit is, so I expect that this would be a
  net loss in overall performance.
 
 CONFIG_PREEMPT_RCU seems to be independent of CONFIG_PREEMPT.
 Not sure why, but they are :)

Well, they used to be independent.  But the depends clauses force
them.  You cannot have TREE_RCU unless !PREEMPT  SMP.

  In case of the preemptible RCU scheme, we would have to
  examine the per-task state (under the runqueue lock)
  to get the current task info of all CPUs, and in
  addition wait for the blkd_tasks list to empty out
  when doing a synchronize_rcu().
 
  That does not appear to require special per-cpu
  counters; examining the per-cpu rdp and the lists
  inside it, with the rnp-lock held if doing any
  list manipulation, looks like it would be enough.
 
  However, the current code is a lot more complicated
  than that. Am I overlooking something obvious, Paul?
  Maybe something non-obvious? :)
  
  Ummm...  The need to maintain memory ordering when sampling task
  state from remote CPUs?
  
  Or am I completely confused about what you are suggesting?
  
  That said, are you chasing a real system-visible performance issue
  that you tracked to RCU's dyntick-idle system?
 
 The goal is to reduce the syscall overhead of nohz_full.
 
 Part of the overhead is in the vtime updates, part of it is
 in the way RCU extended quiescent state is tracked.

OK, as long as it is actual measurements rather than guesswork.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Paul E. McKenney
On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote:
 On 05/04/2015 04:02 PM, Paul E. McKenney wrote:
  On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote:
  On 05/04/2015 02:39 PM, Paul E. McKenney wrote:
  On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:
 
  In fact, would we be able to simply use tsk-rcu_read_lock_nesting
  as an indicator of whether or not we should bother waiting on that
  task or CPU when doing synchronize_rcu?
 
  Depends on exactly what you are asking.  If you are asking if I could add
  a few more checks to preemptible RCU and speed up grace-period detection
  in a number of cases, the answer is very likely yes.  This is on my
  list, but not particularly high priority.  If you are asking whether
  CPU 0 could access -rcu_read_lock_nesting of some task running on
  some other CPU, in theory, the answer is yes, but in practice that
  would require putting full memory barriers in both rcu_read_lock()
  and rcu_read_unlock(), so the real answer is no.
 
  Or am I missing your point?
 
  The main question is how can we greatly reduce the overhead
  of nohz_full, by simplifying the RCU extended quiescent state
  code called in the syscall fast path, and maybe piggyback on
  that to do time accounting for remote CPUs?
 
  Your memory barrier answer above makes it clear we will still
  want to do the RCU stuff at syscall entry  exit time, at least
  on x86, where we already have automatic and implicit memory
  barriers.
  
  We do need to keep in mind that x86's automatic and implicit memory
  barriers do not order prior stores against later loads.
  
  Hmmm...  But didn't earlier performance measurements show that the bulk of
  the overhead was the delta-time computations rather than RCU accounting?
 
 The bulk of the overhead was disabling and re-enabling
 irqs around the calls to rcu_user_exit and rcu_user_enter :)

Really???  OK...  How about software irq masking?  (I know, that is
probably a bit of a scary change as well.)

 Of the remaining time, about 2/3 seems to be the vtime
 stuff, and the other 1/3 the rcu code.

OK, worth some thought, then.

 I suspect it makes sense to optimize both, though the
 vtime code may be the easiest :)

Making a crude version that does jiffies (or whatever) instead of
fine-grained computations might give good bang for the buck.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Rik van Riel
On 05/04/2015 02:39 PM, Paul E. McKenney wrote:
 On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:

 In fact, would we be able to simply use tsk-rcu_read_lock_nesting
 as an indicator of whether or not we should bother waiting on that
 task or CPU when doing synchronize_rcu?
 
 Depends on exactly what you are asking.  If you are asking if I could add
 a few more checks to preemptible RCU and speed up grace-period detection
 in a number of cases, the answer is very likely yes.  This is on my
 list, but not particularly high priority.  If you are asking whether
 CPU 0 could access -rcu_read_lock_nesting of some task running on
 some other CPU, in theory, the answer is yes, but in practice that
 would require putting full memory barriers in both rcu_read_lock()
 and rcu_read_unlock(), so the real answer is no.
 
 Or am I missing your point?

The main question is how can we greatly reduce the overhead
of nohz_full, by simplifying the RCU extended quiescent state
code called in the syscall fast path, and maybe piggyback on
that to do time accounting for remote CPUs?

Your memory barrier answer above makes it clear we will still
want to do the RCU stuff at syscall entry  exit time, at least
on x86, where we already have automatic and implicit memory
barriers.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Rik van Riel
On 05/04/2015 03:39 PM, Paul E. McKenney wrote:
 On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:

 In case of the non-preemptible RCU, we could easily also
 increase current-rcu_read_lock_nesting at the same time
 we increase the preempt counter, and use that as the
 indicator to test whether the cpu is in an extended
 rcu quiescent state. That way there would be no extra
 overhead at syscall entry or exit at all. The trick
 would be getting the preempt count and the rcu read
 lock nesting count in the same cache line for each task.
 
 But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
 counter in production kernels.  Even if there was, we have to sample this
 on other CPUs, so the overhead of preempt_disable() and preempt_enable()
 would be where kernel entry/exit is, so I expect that this would be a
 net loss in overall performance.

CONFIG_PREEMPT_RCU seems to be independent of CONFIG_PREEMPT.
Not sure why, but they are :)

 In case of the preemptible RCU scheme, we would have to
 examine the per-task state (under the runqueue lock)
 to get the current task info of all CPUs, and in
 addition wait for the blkd_tasks list to empty out
 when doing a synchronize_rcu().

 That does not appear to require special per-cpu
 counters; examining the per-cpu rdp and the lists
 inside it, with the rnp-lock held if doing any
 list manipulation, looks like it would be enough.

 However, the current code is a lot more complicated
 than that. Am I overlooking something obvious, Paul?
 Maybe something non-obvious? :)
 
 Ummm...  The need to maintain memory ordering when sampling task
 state from remote CPUs?
 
 Or am I completely confused about what you are suggesting?
 
 That said, are you chasing a real system-visible performance issue
 that you tracked to RCU's dyntick-idle system?

The goal is to reduce the syscall overhead of nohz_full.

Part of the overhead is in the vtime updates, part of it is
in the way RCU extended quiescent state is tracked.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Paul E. McKenney
On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
 On 05/04/2015 11:59 AM, Rik van Riel wrote:
 
  However, currently the RCU code seems to use a much more
  complex counting scheme, with a different increment for
  kernel/task use, and irq use.
  
  This counter seems to be modeled on the task preempt_counter,
  where we do care about whether we are in task context, irq
  context, or softirq context.
  
  On the other hand, the RCU code only seems to care about
  whether or not a CPU is in an extended quiescent state,
  or is potentially in an RCU critical section.
  
  Paul, what is the reason for RCU using a complex counter,
  instead of a simple increment for each potential kernel/RCU
  entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU
  enabled?
 
 Looking at the code for a while more, I have not found
 any reason why the rcu dynticks counter is so complex.

For the nesting counter, please see my earlier email.

 The rdtp-dynticks atomic seems to be used as a serial
 number. Odd means the cpu is in an rcu quiescent state,
 even means it is not.

Yep.

 This test is used to verify whether or not a CPU is
 in rcu quiescent state. Presumably the atomic_add_return
 is used to add a memory barrier.
 
   atomic_add_return(0, rdtp-dynticks)  0x1)

Yep.  It is sampled remotely, hence the need for full memory barriers.
It doesn't help to sample the counter if the sampling gets reordered
with the surrounding code.  Ditto for the increments.

By the end of the year, and hopefully much sooner, I expect to have
testing infrastructure capable of detecting ordering bugs in this code.
At which point, I can start experimenting with alternative code sequences.

But full ordering is still required, and cache misses can happen.

  In fact, would we be able to simply use tsk-rcu_read_lock_nesting
  as an indicator of whether or not we should bother waiting on that
  task or CPU when doing synchronize_rcu?
 
 We seem to have two variants of __rcu_read_lock().
 
 One increments current-rcu_read_lock_nesting, the other
 calls preempt_disable().

Yep.  The first is preemptible RCU, the second classic RCU.

 In case of the non-preemptible RCU, we could easily also
 increase current-rcu_read_lock_nesting at the same time
 we increase the preempt counter, and use that as the
 indicator to test whether the cpu is in an extended
 rcu quiescent state. That way there would be no extra
 overhead at syscall entry or exit at all. The trick
 would be getting the preempt count and the rcu read
 lock nesting count in the same cache line for each task.

But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
counter in production kernels.  Even if there was, we have to sample this
on other CPUs, so the overhead of preempt_disable() and preempt_enable()
would be where kernel entry/exit is, so I expect that this would be a
net loss in overall performance.

 In case of the preemptible RCU scheme, we would have to
 examine the per-task state (under the runqueue lock)
 to get the current task info of all CPUs, and in
 addition wait for the blkd_tasks list to empty out
 when doing a synchronize_rcu().
 
 That does not appear to require special per-cpu
 counters; examining the per-cpu rdp and the lists
 inside it, with the rnp-lock held if doing any
 list manipulation, looks like it would be enough.
 
 However, the current code is a lot more complicated
 than that. Am I overlooking something obvious, Paul?
 Maybe something non-obvious? :)

Ummm...  The need to maintain memory ordering when sampling task
state from remote CPUs?

Or am I completely confused about what you are suggesting?

That said, are you chasing a real system-visible performance issue
that you tracked to RCU's dyntick-idle system?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Rik van Riel
On 05/04/2015 04:02 PM, Paul E. McKenney wrote:
 On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote:
 On 05/04/2015 02:39 PM, Paul E. McKenney wrote:
 On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:

 In fact, would we be able to simply use tsk-rcu_read_lock_nesting
 as an indicator of whether or not we should bother waiting on that
 task or CPU when doing synchronize_rcu?

 Depends on exactly what you are asking.  If you are asking if I could add
 a few more checks to preemptible RCU and speed up grace-period detection
 in a number of cases, the answer is very likely yes.  This is on my
 list, but not particularly high priority.  If you are asking whether
 CPU 0 could access -rcu_read_lock_nesting of some task running on
 some other CPU, in theory, the answer is yes, but in practice that
 would require putting full memory barriers in both rcu_read_lock()
 and rcu_read_unlock(), so the real answer is no.

 Or am I missing your point?

 The main question is how can we greatly reduce the overhead
 of nohz_full, by simplifying the RCU extended quiescent state
 code called in the syscall fast path, and maybe piggyback on
 that to do time accounting for remote CPUs?

 Your memory barrier answer above makes it clear we will still
 want to do the RCU stuff at syscall entry  exit time, at least
 on x86, where we already have automatic and implicit memory
 barriers.
 
 We do need to keep in mind that x86's automatic and implicit memory
 barriers do not order prior stores against later loads.
 
 Hmmm...  But didn't earlier performance measurements show that the bulk of
 the overhead was the delta-time computations rather than RCU accounting?

The bulk of the overhead was disabling and re-enabling
irqs around the calls to rcu_user_exit and rcu_user_enter :)

Of the remaining time, about 2/3 seems to be the vtime
stuff, and the other 1/3 the rcu code.

I suspect it makes sense to optimize both, though the
vtime code may be the easiest :)

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Rik van Riel
On 05/04/2015 04:38 PM, Paul E. McKenney wrote:
 On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote:
 On 05/04/2015 04:02 PM, Paul E. McKenney wrote:

 Hmmm...  But didn't earlier performance measurements show that the bulk of
 the overhead was the delta-time computations rather than RCU accounting?

 The bulk of the overhead was disabling and re-enabling
 irqs around the calls to rcu_user_exit and rcu_user_enter :)
 
 Really???  OK...  How about software irq masking?  (I know, that is
 probably a bit of a scary change as well.)
 
 Of the remaining time, about 2/3 seems to be the vtime
 stuff, and the other 1/3 the rcu code.
 
 OK, worth some thought, then.
 
 I suspect it makes sense to optimize both, though the
 vtime code may be the easiest :)
 
 Making a crude version that does jiffies (or whatever) instead of
 fine-grained computations might give good bang for the buck.  ;-)

Ingo's idea is to simply have cpu 0 check the current task
on all other CPUs, see whether that task is running in system
mode, user mode, guest mode, irq mode, etc and update that
task's vtime accordingly.

I suspect the runqueue lock is probably enough to do that,
and between rcu state and PF_VCPU we probably have enough
information to see what mode the task is running in, with
just remote memory reads.

I looked at implementing the vtime bits (and am pretty sure
how to do those now), and then spent some hours looking at
the RCU bits, to see if we could not simplify both things at
once, especially considering that the current RCU context
tracking bits need to be called with irqs disabled.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about RCU dynticks_nesting

2015-05-04 Thread Paul E. McKenney
On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote:
 On 05/04/2015 04:38 PM, Paul E. McKenney wrote:
  On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote:
  On 05/04/2015 04:02 PM, Paul E. McKenney wrote:
 
  Hmmm...  But didn't earlier performance measurements show that the bulk of
  the overhead was the delta-time computations rather than RCU accounting?
 
  The bulk of the overhead was disabling and re-enabling
  irqs around the calls to rcu_user_exit and rcu_user_enter :)
  
  Really???  OK...  How about software irq masking?  (I know, that is
  probably a bit of a scary change as well.)
  
  Of the remaining time, about 2/3 seems to be the vtime
  stuff, and the other 1/3 the rcu code.
  
  OK, worth some thought, then.
  
  I suspect it makes sense to optimize both, though the
  vtime code may be the easiest :)
  
  Making a crude version that does jiffies (or whatever) instead of
  fine-grained computations might give good bang for the buck.  ;-)
 
 Ingo's idea is to simply have cpu 0 check the current task
 on all other CPUs, see whether that task is running in system
 mode, user mode, guest mode, irq mode, etc and update that
 task's vtime accordingly.
 
 I suspect the runqueue lock is probably enough to do that,
 and between rcu state and PF_VCPU we probably have enough
 information to see what mode the task is running in, with
 just remote memory reads.
 
 I looked at implementing the vtime bits (and am pretty sure
 how to do those now), and then spent some hours looking at
 the RCU bits, to see if we could not simplify both things at
 once, especially considering that the current RCU context
 tracking bits need to be called with irqs disabled.

Remotely sampling the vtime info without memory barriers makes sense.
After all, the result is statistical anyway.  Unfortunately, as noted
earlier, RCU correctness depends on ordering.

The current RCU idle entry/exit code most definitely absolutely
requires irqs be disabled.  However, I will see if that can be changed.
No promises, especially no short-term promises, but it does not feel
impossible.

You have RCU_FAST_NO_HZ=y, correct?  Could you please try measuring with
RCU_FAST_NO_HZ=n?  If that has a significant effect, easy quick win is
turning it off -- and I could then make it a boot parameter to get you
back to one kernel for everyone.  (The existing tick_nohz_active boot
parameter already turns it off, but also turns off dyntick idle, which
might be a bit excessive.)  Or if there is some way that the kernel can
know that the system is currently running on battery or some such.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/