Re: question about RCU dynticks_nesting
On 05/06/2015 08:59 PM, Frederic Weisbecker wrote: > On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote: >> Ingo's idea is to simply have cpu 0 check the current task >> on all other CPUs, see whether that task is running in system >> mode, user mode, guest mode, irq mode, etc and update that >> task's vtime accordingly. >> >> I suspect the runqueue lock is probably enough to do that, >> and between rcu state and PF_VCPU we probably have enough >> information to see what mode the task is running in, with >> just remote memory reads. > > Note that we could significantly reduce the overhead of vtime accounting > by only accumulate utime/stime on per cpu buffers and actually account it > on context switch or task_cputime() calls. That way we remove the overhead > of the account_user/system_time() functions and the vtime locks. > > But doing the accounting from CPU 0 by just accounting 1 tick to the context > we remotely observe would certainly reduce the local accounting overhead to > the strict > minimum. And I think we shouldn't even lock rq for that, we can live with some > lack of precision. We can live with lack of precision, but we cannot live with data structures being re-used and pointers pointing off into la-la land while we are following them :) > Now we must expect quite some overhead on CPU 0. Perhaps it should be > an option as I'm not sure every full dynticks usecases want that. Lets see if I can get this to work before deciding whether we need yet another configurable option :) It may be possible to have most of the overhead happen from schedulable context, maybe softirq code. Right now I am still stuck in the giant spaghetti mess under account_process_tick, with dozens of functions that only work on cpu-local, task-local, or architecture dependently cpu or task local data... -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On 05/06/2015 08:59 PM, Frederic Weisbecker wrote: On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote: Ingo's idea is to simply have cpu 0 check the current task on all other CPUs, see whether that task is running in system mode, user mode, guest mode, irq mode, etc and update that task's vtime accordingly. I suspect the runqueue lock is probably enough to do that, and between rcu state and PF_VCPU we probably have enough information to see what mode the task is running in, with just remote memory reads. Note that we could significantly reduce the overhead of vtime accounting by only accumulate utime/stime on per cpu buffers and actually account it on context switch or task_cputime() calls. That way we remove the overhead of the account_user/system_time() functions and the vtime locks. But doing the accounting from CPU 0 by just accounting 1 tick to the context we remotely observe would certainly reduce the local accounting overhead to the strict minimum. And I think we shouldn't even lock rq for that, we can live with some lack of precision. We can live with lack of precision, but we cannot live with data structures being re-used and pointers pointing off into la-la land while we are following them :) Now we must expect quite some overhead on CPU 0. Perhaps it should be an option as I'm not sure every full dynticks usecases want that. Lets see if I can get this to work before deciding whether we need yet another configurable option :) It may be possible to have most of the overhead happen from schedulable context, maybe softirq code. Right now I am still stuck in the giant spaghetti mess under account_process_tick, with dozens of functions that only work on cpu-local, task-local, or architecture dependently cpu or task local data... -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote: > On 05/04/2015 04:38 PM, Paul E. McKenney wrote: > > On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote: > >> On 05/04/2015 04:02 PM, Paul E. McKenney wrote: > > >>> Hmmm... But didn't earlier performance measurements show that the bulk of > >>> the overhead was the delta-time computations rather than RCU accounting? > >> > >> The bulk of the overhead was disabling and re-enabling > >> irqs around the calls to rcu_user_exit and rcu_user_enter :) > > > > Really??? OK... How about software irq masking? (I know, that is > > probably a bit of a scary change as well.) > > > >> Of the remaining time, about 2/3 seems to be the vtime > >> stuff, and the other 1/3 the rcu code. > > > > OK, worth some thought, then. > > > >> I suspect it makes sense to optimize both, though the > >> vtime code may be the easiest :) > > > > Making a crude version that does jiffies (or whatever) instead of > > fine-grained computations might give good bang for the buck. ;-) > > Ingo's idea is to simply have cpu 0 check the current task > on all other CPUs, see whether that task is running in system > mode, user mode, guest mode, irq mode, etc and update that > task's vtime accordingly. > > I suspect the runqueue lock is probably enough to do that, > and between rcu state and PF_VCPU we probably have enough > information to see what mode the task is running in, with > just remote memory reads. Note that we could significantly reduce the overhead of vtime accounting by only accumulate utime/stime on per cpu buffers and actually account it on context switch or task_cputime() calls. That way we remove the overhead of the account_user/system_time() functions and the vtime locks. But doing the accounting from CPU 0 by just accounting 1 tick to the context we remotely observe would certainly reduce the local accounting overhead to the strict minimum. And I think we shouldn't even lock rq for that, we can live with some lack of precision. Now we must expect quite some overhead on CPU 0. Perhaps it should be an option as I'm not sure every full dynticks usecases want that. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Wed, 2015-05-06 at 08:52 +0200, Mike Galbraith wrote: > On Tue, 2015-05-05 at 23:06 -0700, Paul E. McKenney wrote: > > > > 1 * stat() on isolated cpu > > > > > > NO_HZ_FULL offinactive housekeepernohz_full > > > real0m14.266s 0m14.367s0m20.427s 0m27.921s > > > user0m1.756s 0m1.553s 0m1.976s 0m10.447s > > > sys 0m12.508s 0m12.769s0m18.400s 0m17.464s > > > (real) 1.000 1.0071.431 1.957 > > > > Does the attached patch help at all? > > nohz_full > 0m27.073s > 0m9.423s > 0m17.602s > > Not a complete retest, and a pull in between, but I'd say that's a no. (well, a second is a nice "at all", just not a _huge_ "at all";) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, 2015-05-05 at 23:06 -0700, Paul E. McKenney wrote: > > 1 * stat() on isolated cpu > > > > NO_HZ_FULL offinactive housekeepernohz_full > > real0m14.266s 0m14.367s0m20.427s 0m27.921s > > user0m1.756s 0m1.553s 0m1.976s 0m10.447s > > sys 0m12.508s 0m12.769s0m18.400s 0m17.464s > > (real) 1.000 1.0071.431 1.957 > Does the attached patch help at all? nohz_full 0m27.073s 0m9.423s 0m17.602s Not a complete retest, and a pull in between, but I'd say that's a no. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Wed, May 06, 2015 at 05:44:54AM +0200, Mike Galbraith wrote: > On Wed, 2015-05-06 at 03:49 +0200, Mike Galbraith wrote: > > On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote: > > > > > You have RCU_FAST_NO_HZ=y, correct? Could you please try measuring with > > > RCU_FAST_NO_HZ=n? > > > > FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n. (I didn't > > profile to see where costs lie though) > > (did that) Nice, thank you!!! > 1 * stat() on isolated cpu > > NO_HZ_FULL offinactive housekeepernohz_full > real0m14.266s 0m14.367s0m20.427s 0m27.921s > user0m1.756s 0m1.553s 0m1.976s 0m10.447s > sys 0m12.508s 0m12.769s0m18.400s 0m17.464s > (real) 1.000 1.0071.431 1.957 > > inactive housekeeper >nohz_full > -- > 7.61% [.] __xstat64 11.12% [k] > context_tracking_exit 7.41% [k] context_tracking_exit > 7.04% [k] system_call6.18% [k] > context_tracking_enter 6.02% [k] native_sched_clock > 6.96% [k] copy_user_enhanced_fast_string 5.18% [.] __xstat64 >4.69% [k] rcu_eqs_enter_common.isra.37 > 6.57% [k] path_init 4.89% [k] system_call >4.35% [k] _raw_spin_lock > 5.92% [k] system_call_after_swapgs 4.84% [k] > copy_user_enhanced_fast_string 4.30% [k] context_tracking_enter > 5.44% [k] lockref_put_return 4.46% [k] path_init >4.25% [k] kmem_cache_alloc > 4.69% [k] link_path_walk 4.30% [k] > system_call_after_swapgs 4.14% [.] __xstat64 > 4.47% [k] lockref_get_not_dead 4.12% [k] kmem_cache_free >3.89% [k] rcu_eqs_exit_common.isra.38 > 4.46% [k] kmem_cache_free3.78% [k] link_path_walk >3.50% [k] system_call > 4.20% [k] kmem_cache_alloc 3.62% [k] > lockref_put_return 3.48% [k] > copy_user_enhanced_fast_string > 4.09% [k] cp_new_stat3.43% [k] > kmem_cache_alloc 3.02% [k] system_call_after_swapgs > 3.38% [k] vfs_getattr_nosec 2.95% [k] > lockref_get_not_dead 2.97% [k] kmem_cache_free > 2.82% [k] vfs_fstatat2.87% [k] cp_new_stat >2.88% [k] lockref_put_return > 2.60% [k] user_path_at_empty 2.62% [k] > syscall_trace_leave2.61% [k] link_path_walk > 2.47% [k] path_lookupat 1.91% [k] > vfs_getattr_nosec 2.58% [k] path_init > 2.14% [k] strncpy_from_user 1.89% [k] > syscall_trace_enter_phase1 2.15% [k] lockref_get_not_dead > 2.11% [k] getname_flags 1.77% [k] path_lookupat >2.04% [k] cp_new_stat > 2.10% [k] generic_fillattr 1.67% [k] complete_walk >1.89% [k] generic_fillattr > 2.05% [.] main 1.65% [k] vfs_fstatat >1.67% [k] syscall_trace_leave > 1.89% [k] complete_walk 1.56% [k] > generic_fillattr 1.59% [k] vfs_getattr_nosec > 1.73% [k] generic_permission 1.55% [k] > user_path_at_empty 1.49% [k] get_vtime_delta > 1.50% [k] system_call_fastpath 1.54% [k] > strncpy_from_user 1.32% [k] user_path_at_empty > 1.37% [k] legitimize_mnt 1.53% [k] getname_flags >1.30% [k] syscall_trace_enter_phase1 > 1.30% [k] dput 1.46% [k] legitimize_mnt >1.21% [k] rcu_eqs_exit > 1.26% [k] putname1.34% [.] main >1.21% [k] vfs_fstatat > 1.19% [k] path_put 1.32% [k] int_with_check >1.18% [k] path_lookupat > 1.18% [k] filename_lookup1.28% [k] > generic_permission 1.15% [k] getname_flags > 1.01% [k] SYSC_newstat 1.16% [k] > int_very_careful 1.03% [k] strncpy_from_user > 0.96% [k] mntput_no_expire 1.04% [k] putname >1.01% [k] account_system_time >
Re: question about RCU dynticks_nesting
On Wed, 2015-05-06 at 08:52 +0200, Mike Galbraith wrote: On Tue, 2015-05-05 at 23:06 -0700, Paul E. McKenney wrote: 1 * stat() on isolated cpu NO_HZ_FULL offinactive housekeepernohz_full real0m14.266s 0m14.367s0m20.427s 0m27.921s user0m1.756s 0m1.553s 0m1.976s 0m10.447s sys 0m12.508s 0m12.769s0m18.400s 0m17.464s (real) 1.000 1.0071.431 1.957 Does the attached patch help at all? nohz_full 0m27.073s 0m9.423s 0m17.602s Not a complete retest, and a pull in between, but I'd say that's a no. (well, a second is a nice at all, just not a _huge_ at all;) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, 2015-05-05 at 23:06 -0700, Paul E. McKenney wrote: 1 * stat() on isolated cpu NO_HZ_FULL offinactive housekeepernohz_full real0m14.266s 0m14.367s0m20.427s 0m27.921s user0m1.756s 0m1.553s 0m1.976s 0m10.447s sys 0m12.508s 0m12.769s0m18.400s 0m17.464s (real) 1.000 1.0071.431 1.957 Does the attached patch help at all? nohz_full 0m27.073s 0m9.423s 0m17.602s Not a complete retest, and a pull in between, but I'd say that's a no. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Wed, May 06, 2015 at 05:44:54AM +0200, Mike Galbraith wrote: On Wed, 2015-05-06 at 03:49 +0200, Mike Galbraith wrote: On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote: You have RCU_FAST_NO_HZ=y, correct? Could you please try measuring with RCU_FAST_NO_HZ=n? FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n. (I didn't profile to see where costs lie though) (did that) Nice, thank you!!! 1 * stat() on isolated cpu NO_HZ_FULL offinactive housekeepernohz_full real0m14.266s 0m14.367s0m20.427s 0m27.921s user0m1.756s 0m1.553s 0m1.976s 0m10.447s sys 0m12.508s 0m12.769s0m18.400s 0m17.464s (real) 1.000 1.0071.431 1.957 inactive housekeeper nohz_full -- 7.61% [.] __xstat64 11.12% [k] context_tracking_exit 7.41% [k] context_tracking_exit 7.04% [k] system_call6.18% [k] context_tracking_enter 6.02% [k] native_sched_clock 6.96% [k] copy_user_enhanced_fast_string 5.18% [.] __xstat64 4.69% [k] rcu_eqs_enter_common.isra.37 6.57% [k] path_init 4.89% [k] system_call 4.35% [k] _raw_spin_lock 5.92% [k] system_call_after_swapgs 4.84% [k] copy_user_enhanced_fast_string 4.30% [k] context_tracking_enter 5.44% [k] lockref_put_return 4.46% [k] path_init 4.25% [k] kmem_cache_alloc 4.69% [k] link_path_walk 4.30% [k] system_call_after_swapgs 4.14% [.] __xstat64 4.47% [k] lockref_get_not_dead 4.12% [k] kmem_cache_free 3.89% [k] rcu_eqs_exit_common.isra.38 4.46% [k] kmem_cache_free3.78% [k] link_path_walk 3.50% [k] system_call 4.20% [k] kmem_cache_alloc 3.62% [k] lockref_put_return 3.48% [k] copy_user_enhanced_fast_string 4.09% [k] cp_new_stat3.43% [k] kmem_cache_alloc 3.02% [k] system_call_after_swapgs 3.38% [k] vfs_getattr_nosec 2.95% [k] lockref_get_not_dead 2.97% [k] kmem_cache_free 2.82% [k] vfs_fstatat2.87% [k] cp_new_stat 2.88% [k] lockref_put_return 2.60% [k] user_path_at_empty 2.62% [k] syscall_trace_leave2.61% [k] link_path_walk 2.47% [k] path_lookupat 1.91% [k] vfs_getattr_nosec 2.58% [k] path_init 2.14% [k] strncpy_from_user 1.89% [k] syscall_trace_enter_phase1 2.15% [k] lockref_get_not_dead 2.11% [k] getname_flags 1.77% [k] path_lookupat 2.04% [k] cp_new_stat 2.10% [k] generic_fillattr 1.67% [k] complete_walk 1.89% [k] generic_fillattr 2.05% [.] main 1.65% [k] vfs_fstatat 1.67% [k] syscall_trace_leave 1.89% [k] complete_walk 1.56% [k] generic_fillattr 1.59% [k] vfs_getattr_nosec 1.73% [k] generic_permission 1.55% [k] user_path_at_empty 1.49% [k] get_vtime_delta 1.50% [k] system_call_fastpath 1.54% [k] strncpy_from_user 1.32% [k] user_path_at_empty 1.37% [k] legitimize_mnt 1.53% [k] getname_flags 1.30% [k] syscall_trace_enter_phase1 1.30% [k] dput 1.46% [k] legitimize_mnt 1.21% [k] rcu_eqs_exit 1.26% [k] putname1.34% [.] main 1.21% [k] vfs_fstatat 1.19% [k] path_put 1.32% [k] int_with_check 1.18% [k] path_lookupat 1.18% [k] filename_lookup1.28% [k] generic_permission 1.15% [k] getname_flags 1.01% [k] SYSC_newstat 1.16% [k] int_very_careful 1.03% [k] strncpy_from_user 0.96% [k] mntput_no_expire 1.04% [k] putname 1.01% [k] account_system_time 0.79% [k] path_cleanup 0.94% [k] dput
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote: On 05/04/2015 04:38 PM, Paul E. McKenney wrote: On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote: On 05/04/2015 04:02 PM, Paul E. McKenney wrote: Hmmm... But didn't earlier performance measurements show that the bulk of the overhead was the delta-time computations rather than RCU accounting? The bulk of the overhead was disabling and re-enabling irqs around the calls to rcu_user_exit and rcu_user_enter :) Really??? OK... How about software irq masking? (I know, that is probably a bit of a scary change as well.) Of the remaining time, about 2/3 seems to be the vtime stuff, and the other 1/3 the rcu code. OK, worth some thought, then. I suspect it makes sense to optimize both, though the vtime code may be the easiest :) Making a crude version that does jiffies (or whatever) instead of fine-grained computations might give good bang for the buck. ;-) Ingo's idea is to simply have cpu 0 check the current task on all other CPUs, see whether that task is running in system mode, user mode, guest mode, irq mode, etc and update that task's vtime accordingly. I suspect the runqueue lock is probably enough to do that, and between rcu state and PF_VCPU we probably have enough information to see what mode the task is running in, with just remote memory reads. Note that we could significantly reduce the overhead of vtime accounting by only accumulate utime/stime on per cpu buffers and actually account it on context switch or task_cputime() calls. That way we remove the overhead of the account_user/system_time() functions and the vtime locks. But doing the accounting from CPU 0 by just accounting 1 tick to the context we remotely observe would certainly reduce the local accounting overhead to the strict minimum. And I think we shouldn't even lock rq for that, we can live with some lack of precision. Now we must expect quite some overhead on CPU 0. Perhaps it should be an option as I'm not sure every full dynticks usecases want that. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, May 05, 2015 at 05:09:23PM -0400, Rik van Riel wrote: > On 05/05/2015 02:35 PM, Paul E. McKenney wrote: > > On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote: > >> On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote: > >>> On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote: > On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote: > > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt > > counter in production kernels. Even if there was, we have to sample > > this > > on other CPUs, so the overhead of preempt_disable() and preempt_enable() > > would be where kernel entry/exit is, so I expect that this would be a > > net loss in overall performance. > > We unconditionally have the preempt_count, its just not used much for > PREEMPT_COUNT=n kernels. > >>> > >>> We have the field, you mean? I might be missing something, but it still > >>> appears to me thta preempt_disable() does nothing for PREEMPT=n kernels. > >>> So what am I missing? > >> > >> There's another layer of accessors that can in fact manipulate the > >> preempt_count even for !PREEMPT_COUNT kernels. They are currently used > >> by things like pagefault_disable(). > > > > OK, fair enough. > > > > I am going to focus first on getting rid of (or at least greatly reducing) > > RCU's interrupt disabling on the user-kernel entry/exit paths, since > > that seems to be the biggest cost. > > Interrupts are already disabled on kernel-user and kernel-guest > switches. Paolo and I have patches to move a bunch of the calls > to user_enter, user_exit, guest_enter, and guest_exit to places > where interrupts are already disabled, so we do not need to > disable them again. > > With those in place, the vtime calculations are the largest > CPU user. I am working on those. OK, so I should stop worrying about making rcu_user_enter() and rcu_user_exit() operate with interrupts disabled, and instead think about the overhead of the operations themselves. Probably starting from Mike Galbraith's profile (thank you!) unless Rik has some reason to believe that it is nonrepresentative. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Wed, 2015-05-06 at 03:49 +0200, Mike Galbraith wrote: > On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote: > > > You have RCU_FAST_NO_HZ=y, correct? Could you please try measuring with > > RCU_FAST_NO_HZ=n? > > FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n. (I didn't > profile to see where costs lie though) (did that) 1 * stat() on isolated cpu NO_HZ_FULL offinactive housekeepernohz_full real0m14.266s 0m14.367s0m20.427s 0m27.921s user0m1.756s 0m1.553s 0m1.976s 0m10.447s sys 0m12.508s 0m12.769s0m18.400s 0m17.464s (real) 1.000 1.0071.431 1.957 inactive housekeeper nohz_full -- 7.61% [.] __xstat64 11.12% [k] context_tracking_exit 7.41% [k] context_tracking_exit 7.04% [k] system_call6.18% [k] context_tracking_enter 6.02% [k] native_sched_clock 6.96% [k] copy_user_enhanced_fast_string 5.18% [.] __xstat64 4.69% [k] rcu_eqs_enter_common.isra.37 6.57% [k] path_init 4.89% [k] system_call 4.35% [k] _raw_spin_lock 5.92% [k] system_call_after_swapgs 4.84% [k] copy_user_enhanced_fast_string 4.30% [k] context_tracking_enter 5.44% [k] lockref_put_return 4.46% [k] path_init 4.25% [k] kmem_cache_alloc 4.69% [k] link_path_walk 4.30% [k] system_call_after_swapgs 4.14% [.] __xstat64 4.47% [k] lockref_get_not_dead 4.12% [k] kmem_cache_free 3.89% [k] rcu_eqs_exit_common.isra.38 4.46% [k] kmem_cache_free3.78% [k] link_path_walk 3.50% [k] system_call 4.20% [k] kmem_cache_alloc 3.62% [k] lockref_put_return 3.48% [k] copy_user_enhanced_fast_string 4.09% [k] cp_new_stat3.43% [k] kmem_cache_alloc 3.02% [k] system_call_after_swapgs 3.38% [k] vfs_getattr_nosec 2.95% [k] lockref_get_not_dead 2.97% [k] kmem_cache_free 2.82% [k] vfs_fstatat2.87% [k] cp_new_stat 2.88% [k] lockref_put_return 2.60% [k] user_path_at_empty 2.62% [k] syscall_trace_leave2.61% [k] link_path_walk 2.47% [k] path_lookupat 1.91% [k] vfs_getattr_nosec 2.58% [k] path_init 2.14% [k] strncpy_from_user 1.89% [k] syscall_trace_enter_phase1 2.15% [k] lockref_get_not_dead 2.11% [k] getname_flags 1.77% [k] path_lookupat 2.04% [k] cp_new_stat 2.10% [k] generic_fillattr 1.67% [k] complete_walk 1.89% [k] generic_fillattr 2.05% [.] main 1.65% [k] vfs_fstatat 1.67% [k] syscall_trace_leave 1.89% [k] complete_walk 1.56% [k] generic_fillattr 1.59% [k] vfs_getattr_nosec 1.73% [k] generic_permission 1.55% [k] user_path_at_empty 1.49% [k] get_vtime_delta 1.50% [k] system_call_fastpath 1.54% [k] strncpy_from_user 1.32% [k] user_path_at_empty 1.37% [k] legitimize_mnt 1.53% [k] getname_flags 1.30% [k] syscall_trace_enter_phase1 1.30% [k] dput 1.46% [k] legitimize_mnt 1.21% [k] rcu_eqs_exit 1.26% [k] putname1.34% [.] main 1.21% [k] vfs_fstatat 1.19% [k] path_put 1.32% [k] int_with_check 1.18% [k] path_lookupat 1.18% [k] filename_lookup1.28% [k] generic_permission 1.15% [k] getname_flags 1.01% [k] SYSC_newstat 1.16% [k] int_very_careful 1.03% [k] strncpy_from_user 0.96% [k] mntput_no_expire 1.04% [k] putname 1.01% [k] account_system_time 0.79% [k] path_cleanup 0.94% [k] dput 1.00% [k] complete_walk 0.79% [k] mntput 0.91% [k] context_tracking_user_exit 0.99% [k]
Re: question about RCU dynticks_nesting
On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote: > You have RCU_FAST_NO_HZ=y, correct? Could you please try measuring with > RCU_FAST_NO_HZ=n? FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n. (I didn't profile to see where costs lie though) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On 05/05/2015 02:35 PM, Paul E. McKenney wrote: > On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote: >> On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote: >>> On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote: On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote: > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt > counter in production kernels. Even if there was, we have to sample this > on other CPUs, so the overhead of preempt_disable() and preempt_enable() > would be where kernel entry/exit is, so I expect that this would be a > net loss in overall performance. We unconditionally have the preempt_count, its just not used much for PREEMPT_COUNT=n kernels. >>> >>> We have the field, you mean? I might be missing something, but it still >>> appears to me thta preempt_disable() does nothing for PREEMPT=n kernels. >>> So what am I missing? >> >> There's another layer of accessors that can in fact manipulate the >> preempt_count even for !PREEMPT_COUNT kernels. They are currently used >> by things like pagefault_disable(). > > OK, fair enough. > > I am going to focus first on getting rid of (or at least greatly reducing) > RCU's interrupt disabling on the user-kernel entry/exit paths, since > that seems to be the biggest cost. Interrupts are already disabled on kernel-user and kernel-guest switches. Paolo and I have patches to move a bunch of the calls to user_enter, user_exit, guest_enter, and guest_exit to places where interrupts are already disabled, so we do not need to disable them again. With those in place, the vtime calculations are the largest CPU user. I am working on those. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote: > On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote: > > On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote: > > > On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote: > > > > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt > > > > counter in production kernels. Even if there was, we have to sample > > > > this > > > > on other CPUs, so the overhead of preempt_disable() and preempt_enable() > > > > would be where kernel entry/exit is, so I expect that this would be a > > > > net loss in overall performance. > > > > > > We unconditionally have the preempt_count, its just not used much for > > > PREEMPT_COUNT=n kernels. > > > > We have the field, you mean? I might be missing something, but it still > > appears to me thta preempt_disable() does nothing for PREEMPT=n kernels. > > So what am I missing? > > There's another layer of accessors that can in fact manipulate the > preempt_count even for !PREEMPT_COUNT kernels. They are currently used > by things like pagefault_disable(). OK, fair enough. I am going to focus first on getting rid of (or at least greatly reducing) RCU's interrupt disabling on the user-kernel entry/exit paths, since that seems to be the biggest cost. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote: > On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote: > > On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote: > > > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt > > > counter in production kernels. Even if there was, we have to sample this > > > on other CPUs, so the overhead of preempt_disable() and preempt_enable() > > > would be where kernel entry/exit is, so I expect that this would be a > > > net loss in overall performance. > > > > We unconditionally have the preempt_count, its just not used much for > > PREEMPT_COUNT=n kernels. > > We have the field, you mean? I might be missing something, but it still > appears to me thta preempt_disable() does nothing for PREEMPT=n kernels. > So what am I missing? There's another layer of accessors that can in fact manipulate the preempt_count even for !PREEMPT_COUNT kernels. They are currently used by things like pagefault_disable(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote: > On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote: > > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt > > counter in production kernels. Even if there was, we have to sample this > > on other CPUs, so the overhead of preempt_disable() and preempt_enable() > > would be where kernel entry/exit is, so I expect that this would be a > > net loss in overall performance. > > We unconditionally have the preempt_count, its just not used much for > PREEMPT_COUNT=n kernels. We have the field, you mean? I might be missing something, but it still appears to me thta preempt_disable() does nothing for PREEMPT=n kernels. So what am I missing? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, May 05, 2015 at 12:51:02PM +0200, Peter Zijlstra wrote: > On Tue, May 05, 2015 at 12:48:34PM +0200, Peter Zijlstra wrote: > > On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote: > > In case of the non-preemptible RCU, we could easily also > > > increase current->rcu_read_lock_nesting at the same time > > > we increase the preempt counter, and use that as the > > > indicator to test whether the cpu is in an extended > > > rcu quiescent state. That way there would be no extra > > > overhead at syscall entry or exit at all. The trick > > > would be getting the preempt count and the rcu read > > > lock nesting count in the same cache line for each task. > > > > Can't do that. Remember, on x86 we have per-cpu preempt count, and your > > rcu_read_lock_nesting is per task. > > Hmm, I suppose you could do the rcu_read_lock_nesting thing in a per-cpu > counter too and transfer that into the task_struct on context switch. > > If you manage to put both sides of that in the same cache things should > not add significant overhead. > > You'd have to move the rcu_read_lock_nesting into the thread_info, which > would be painful as you'd have to go touch all archs etc.. Last I tried doing that, things got really messy at context-switch time. Perhaps I simply didn't do the save/restore in the right place? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote: > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt > counter in production kernels. Even if there was, we have to sample this > on other CPUs, so the overhead of preempt_disable() and preempt_enable() > would be where kernel entry/exit is, so I expect that this would be a > net loss in overall performance. We unconditionally have the preempt_count, its just not used much for PREEMPT_COUNT=n kernels. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, May 05, 2015 at 12:48:34PM +0200, Peter Zijlstra wrote: > On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote: > In case of the non-preemptible RCU, we could easily also > > increase current->rcu_read_lock_nesting at the same time > > we increase the preempt counter, and use that as the > > indicator to test whether the cpu is in an extended > > rcu quiescent state. That way there would be no extra > > overhead at syscall entry or exit at all. The trick > > would be getting the preempt count and the rcu read > > lock nesting count in the same cache line for each task. > > Can't do that. Remember, on x86 we have per-cpu preempt count, and your > rcu_read_lock_nesting is per task. Hmm, I suppose you could do the rcu_read_lock_nesting thing in a per-cpu counter too and transfer that into the task_struct on context switch. If you manage to put both sides of that in the same cache things should not add significant overhead. You'd have to move the rcu_read_lock_nesting into the thread_info, which would be painful as you'd have to go touch all archs etc.. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote: In case of the non-preemptible RCU, we could easily also > increase current->rcu_read_lock_nesting at the same time > we increase the preempt counter, and use that as the > indicator to test whether the cpu is in an extended > rcu quiescent state. That way there would be no extra > overhead at syscall entry or exit at all. The trick > would be getting the preempt count and the rcu read > lock nesting count in the same cache line for each task. Can't do that. Remember, on x86 we have per-cpu preempt count, and your rcu_read_lock_nesting is per task. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote: On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote: On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote: But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt counter in production kernels. Even if there was, we have to sample this on other CPUs, so the overhead of preempt_disable() and preempt_enable() would be where kernel entry/exit is, so I expect that this would be a net loss in overall performance. We unconditionally have the preempt_count, its just not used much for PREEMPT_COUNT=n kernels. We have the field, you mean? I might be missing something, but it still appears to me thta preempt_disable() does nothing for PREEMPT=n kernels. So what am I missing? There's another layer of accessors that can in fact manipulate the preempt_count even for !PREEMPT_COUNT kernels. They are currently used by things like pagefault_disable(). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, May 05, 2015 at 12:51:02PM +0200, Peter Zijlstra wrote: On Tue, May 05, 2015 at 12:48:34PM +0200, Peter Zijlstra wrote: On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote: In case of the non-preemptible RCU, we could easily also increase current-rcu_read_lock_nesting at the same time we increase the preempt counter, and use that as the indicator to test whether the cpu is in an extended rcu quiescent state. That way there would be no extra overhead at syscall entry or exit at all. The trick would be getting the preempt count and the rcu read lock nesting count in the same cache line for each task. Can't do that. Remember, on x86 we have per-cpu preempt count, and your rcu_read_lock_nesting is per task. Hmm, I suppose you could do the rcu_read_lock_nesting thing in a per-cpu counter too and transfer that into the task_struct on context switch. If you manage to put both sides of that in the same cache things should not add significant overhead. You'd have to move the rcu_read_lock_nesting into the thread_info, which would be painful as you'd have to go touch all archs etc.. Last I tried doing that, things got really messy at context-switch time. Perhaps I simply didn't do the save/restore in the right place? Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote: On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote: But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt counter in production kernels. Even if there was, we have to sample this on other CPUs, so the overhead of preempt_disable() and preempt_enable() would be where kernel entry/exit is, so I expect that this would be a net loss in overall performance. We unconditionally have the preempt_count, its just not used much for PREEMPT_COUNT=n kernels. We have the field, you mean? I might be missing something, but it still appears to me thta preempt_disable() does nothing for PREEMPT=n kernels. So what am I missing? Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Wed, 2015-05-06 at 03:49 +0200, Mike Galbraith wrote: On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote: You have RCU_FAST_NO_HZ=y, correct? Could you please try measuring with RCU_FAST_NO_HZ=n? FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n. (I didn't profile to see where costs lie though) (did that) 1 * stat() on isolated cpu NO_HZ_FULL offinactive housekeepernohz_full real0m14.266s 0m14.367s0m20.427s 0m27.921s user0m1.756s 0m1.553s 0m1.976s 0m10.447s sys 0m12.508s 0m12.769s0m18.400s 0m17.464s (real) 1.000 1.0071.431 1.957 inactive housekeeper nohz_full -- 7.61% [.] __xstat64 11.12% [k] context_tracking_exit 7.41% [k] context_tracking_exit 7.04% [k] system_call6.18% [k] context_tracking_enter 6.02% [k] native_sched_clock 6.96% [k] copy_user_enhanced_fast_string 5.18% [.] __xstat64 4.69% [k] rcu_eqs_enter_common.isra.37 6.57% [k] path_init 4.89% [k] system_call 4.35% [k] _raw_spin_lock 5.92% [k] system_call_after_swapgs 4.84% [k] copy_user_enhanced_fast_string 4.30% [k] context_tracking_enter 5.44% [k] lockref_put_return 4.46% [k] path_init 4.25% [k] kmem_cache_alloc 4.69% [k] link_path_walk 4.30% [k] system_call_after_swapgs 4.14% [.] __xstat64 4.47% [k] lockref_get_not_dead 4.12% [k] kmem_cache_free 3.89% [k] rcu_eqs_exit_common.isra.38 4.46% [k] kmem_cache_free3.78% [k] link_path_walk 3.50% [k] system_call 4.20% [k] kmem_cache_alloc 3.62% [k] lockref_put_return 3.48% [k] copy_user_enhanced_fast_string 4.09% [k] cp_new_stat3.43% [k] kmem_cache_alloc 3.02% [k] system_call_after_swapgs 3.38% [k] vfs_getattr_nosec 2.95% [k] lockref_get_not_dead 2.97% [k] kmem_cache_free 2.82% [k] vfs_fstatat2.87% [k] cp_new_stat 2.88% [k] lockref_put_return 2.60% [k] user_path_at_empty 2.62% [k] syscall_trace_leave2.61% [k] link_path_walk 2.47% [k] path_lookupat 1.91% [k] vfs_getattr_nosec 2.58% [k] path_init 2.14% [k] strncpy_from_user 1.89% [k] syscall_trace_enter_phase1 2.15% [k] lockref_get_not_dead 2.11% [k] getname_flags 1.77% [k] path_lookupat 2.04% [k] cp_new_stat 2.10% [k] generic_fillattr 1.67% [k] complete_walk 1.89% [k] generic_fillattr 2.05% [.] main 1.65% [k] vfs_fstatat 1.67% [k] syscall_trace_leave 1.89% [k] complete_walk 1.56% [k] generic_fillattr 1.59% [k] vfs_getattr_nosec 1.73% [k] generic_permission 1.55% [k] user_path_at_empty 1.49% [k] get_vtime_delta 1.50% [k] system_call_fastpath 1.54% [k] strncpy_from_user 1.32% [k] user_path_at_empty 1.37% [k] legitimize_mnt 1.53% [k] getname_flags 1.30% [k] syscall_trace_enter_phase1 1.30% [k] dput 1.46% [k] legitimize_mnt 1.21% [k] rcu_eqs_exit 1.26% [k] putname1.34% [.] main 1.21% [k] vfs_fstatat 1.19% [k] path_put 1.32% [k] int_with_check 1.18% [k] path_lookupat 1.18% [k] filename_lookup1.28% [k] generic_permission 1.15% [k] getname_flags 1.01% [k] SYSC_newstat 1.16% [k] int_very_careful 1.03% [k] strncpy_from_user 0.96% [k] mntput_no_expire 1.04% [k] putname 1.01% [k] account_system_time 0.79% [k] path_cleanup 0.94% [k] dput 1.00% [k] complete_walk 0.79% [k] mntput 0.91% [k] context_tracking_user_exit 0.99% [k] vtime_account_user
Re: question about RCU dynticks_nesting
On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote: You have RCU_FAST_NO_HZ=y, correct? Could you please try measuring with RCU_FAST_NO_HZ=n? FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n. (I didn't profile to see where costs lie though) -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, May 05, 2015 at 05:09:23PM -0400, Rik van Riel wrote: On 05/05/2015 02:35 PM, Paul E. McKenney wrote: On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote: On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote: On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote: On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote: But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt counter in production kernels. Even if there was, we have to sample this on other CPUs, so the overhead of preempt_disable() and preempt_enable() would be where kernel entry/exit is, so I expect that this would be a net loss in overall performance. We unconditionally have the preempt_count, its just not used much for PREEMPT_COUNT=n kernels. We have the field, you mean? I might be missing something, but it still appears to me thta preempt_disable() does nothing for PREEMPT=n kernels. So what am I missing? There's another layer of accessors that can in fact manipulate the preempt_count even for !PREEMPT_COUNT kernels. They are currently used by things like pagefault_disable(). OK, fair enough. I am going to focus first on getting rid of (or at least greatly reducing) RCU's interrupt disabling on the user-kernel entry/exit paths, since that seems to be the biggest cost. Interrupts are already disabled on kernel-user and kernel-guest switches. Paolo and I have patches to move a bunch of the calls to user_enter, user_exit, guest_enter, and guest_exit to places where interrupts are already disabled, so we do not need to disable them again. With those in place, the vtime calculations are the largest CPU user. I am working on those. OK, so I should stop worrying about making rcu_user_enter() and rcu_user_exit() operate with interrupts disabled, and instead think about the overhead of the operations themselves. Probably starting from Mike Galbraith's profile (thank you!) unless Rik has some reason to believe that it is nonrepresentative. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote: On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote: On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote: On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote: But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt counter in production kernels. Even if there was, we have to sample this on other CPUs, so the overhead of preempt_disable() and preempt_enable() would be where kernel entry/exit is, so I expect that this would be a net loss in overall performance. We unconditionally have the preempt_count, its just not used much for PREEMPT_COUNT=n kernels. We have the field, you mean? I might be missing something, but it still appears to me thta preempt_disable() does nothing for PREEMPT=n kernels. So what am I missing? There's another layer of accessors that can in fact manipulate the preempt_count even for !PREEMPT_COUNT kernels. They are currently used by things like pagefault_disable(). OK, fair enough. I am going to focus first on getting rid of (or at least greatly reducing) RCU's interrupt disabling on the user-kernel entry/exit paths, since that seems to be the biggest cost. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On 05/05/2015 02:35 PM, Paul E. McKenney wrote: On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote: On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote: On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote: On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote: But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt counter in production kernels. Even if there was, we have to sample this on other CPUs, so the overhead of preempt_disable() and preempt_enable() would be where kernel entry/exit is, so I expect that this would be a net loss in overall performance. We unconditionally have the preempt_count, its just not used much for PREEMPT_COUNT=n kernels. We have the field, you mean? I might be missing something, but it still appears to me thta preempt_disable() does nothing for PREEMPT=n kernels. So what am I missing? There's another layer of accessors that can in fact manipulate the preempt_count even for !PREEMPT_COUNT kernels. They are currently used by things like pagefault_disable(). OK, fair enough. I am going to focus first on getting rid of (or at least greatly reducing) RCU's interrupt disabling on the user-kernel entry/exit paths, since that seems to be the biggest cost. Interrupts are already disabled on kernel-user and kernel-guest switches. Paolo and I have patches to move a bunch of the calls to user_enter, user_exit, guest_enter, and guest_exit to places where interrupts are already disabled, so we do not need to disable them again. With those in place, the vtime calculations are the largest CPU user. I am working on those. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote: In case of the non-preemptible RCU, we could easily also increase current-rcu_read_lock_nesting at the same time we increase the preempt counter, and use that as the indicator to test whether the cpu is in an extended rcu quiescent state. That way there would be no extra overhead at syscall entry or exit at all. The trick would be getting the preempt count and the rcu read lock nesting count in the same cache line for each task. Can't do that. Remember, on x86 we have per-cpu preempt count, and your rcu_read_lock_nesting is per task. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Tue, May 05, 2015 at 12:48:34PM +0200, Peter Zijlstra wrote: On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote: In case of the non-preemptible RCU, we could easily also increase current-rcu_read_lock_nesting at the same time we increase the preempt counter, and use that as the indicator to test whether the cpu is in an extended rcu quiescent state. That way there would be no extra overhead at syscall entry or exit at all. The trick would be getting the preempt count and the rcu read lock nesting count in the same cache line for each task. Can't do that. Remember, on x86 we have per-cpu preempt count, and your rcu_read_lock_nesting is per task. Hmm, I suppose you could do the rcu_read_lock_nesting thing in a per-cpu counter too and transfer that into the task_struct on context switch. If you manage to put both sides of that in the same cache things should not add significant overhead. You'd have to move the rcu_read_lock_nesting into the thread_info, which would be painful as you'd have to go touch all archs etc.. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote: But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt counter in production kernels. Even if there was, we have to sample this on other CPUs, so the overhead of preempt_disable() and preempt_enable() would be where kernel entry/exit is, so I expect that this would be a net loss in overall performance. We unconditionally have the preempt_count, its just not used much for PREEMPT_COUNT=n kernels. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote: > On 05/04/2015 04:38 PM, Paul E. McKenney wrote: > > On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote: > >> On 05/04/2015 04:02 PM, Paul E. McKenney wrote: > > >>> Hmmm... But didn't earlier performance measurements show that the bulk of > >>> the overhead was the delta-time computations rather than RCU accounting? > >> > >> The bulk of the overhead was disabling and re-enabling > >> irqs around the calls to rcu_user_exit and rcu_user_enter :) > > > > Really??? OK... How about software irq masking? (I know, that is > > probably a bit of a scary change as well.) > > > >> Of the remaining time, about 2/3 seems to be the vtime > >> stuff, and the other 1/3 the rcu code. > > > > OK, worth some thought, then. > > > >> I suspect it makes sense to optimize both, though the > >> vtime code may be the easiest :) > > > > Making a crude version that does jiffies (or whatever) instead of > > fine-grained computations might give good bang for the buck. ;-) > > Ingo's idea is to simply have cpu 0 check the current task > on all other CPUs, see whether that task is running in system > mode, user mode, guest mode, irq mode, etc and update that > task's vtime accordingly. > > I suspect the runqueue lock is probably enough to do that, > and between rcu state and PF_VCPU we probably have enough > information to see what mode the task is running in, with > just remote memory reads. > > I looked at implementing the vtime bits (and am pretty sure > how to do those now), and then spent some hours looking at > the RCU bits, to see if we could not simplify both things at > once, especially considering that the current RCU context > tracking bits need to be called with irqs disabled. Remotely sampling the vtime info without memory barriers makes sense. After all, the result is statistical anyway. Unfortunately, as noted earlier, RCU correctness depends on ordering. The current RCU idle entry/exit code most definitely absolutely requires irqs be disabled. However, I will see if that can be changed. No promises, especially no short-term promises, but it does not feel impossible. You have RCU_FAST_NO_HZ=y, correct? Could you please try measuring with RCU_FAST_NO_HZ=n? If that has a significant effect, easy quick win is turning it off -- and I could then make it a boot parameter to get you back to one kernel for everyone. (The existing tick_nohz_active boot parameter already turns it off, but also turns off dyntick idle, which might be a bit excessive.) Or if there is some way that the kernel can know that the system is currently running on battery or some such. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On 05/04/2015 04:38 PM, Paul E. McKenney wrote: > On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote: >> On 05/04/2015 04:02 PM, Paul E. McKenney wrote: >>> Hmmm... But didn't earlier performance measurements show that the bulk of >>> the overhead was the delta-time computations rather than RCU accounting? >> >> The bulk of the overhead was disabling and re-enabling >> irqs around the calls to rcu_user_exit and rcu_user_enter :) > > Really??? OK... How about software irq masking? (I know, that is > probably a bit of a scary change as well.) > >> Of the remaining time, about 2/3 seems to be the vtime >> stuff, and the other 1/3 the rcu code. > > OK, worth some thought, then. > >> I suspect it makes sense to optimize both, though the >> vtime code may be the easiest :) > > Making a crude version that does jiffies (or whatever) instead of > fine-grained computations might give good bang for the buck. ;-) Ingo's idea is to simply have cpu 0 check the current task on all other CPUs, see whether that task is running in system mode, user mode, guest mode, irq mode, etc and update that task's vtime accordingly. I suspect the runqueue lock is probably enough to do that, and between rcu state and PF_VCPU we probably have enough information to see what mode the task is running in, with just remote memory reads. I looked at implementing the vtime bits (and am pretty sure how to do those now), and then spent some hours looking at the RCU bits, to see if we could not simplify both things at once, especially considering that the current RCU context tracking bits need to be called with irqs disabled. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 03:59:02PM -0400, Rik van Riel wrote: > On 05/04/2015 03:39 PM, Paul E. McKenney wrote: > > On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote: > > >> In case of the non-preemptible RCU, we could easily also > >> increase current->rcu_read_lock_nesting at the same time > >> we increase the preempt counter, and use that as the > >> indicator to test whether the cpu is in an extended > >> rcu quiescent state. That way there would be no extra > >> overhead at syscall entry or exit at all. The trick > >> would be getting the preempt count and the rcu read > >> lock nesting count in the same cache line for each task. > > > > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt > > counter in production kernels. Even if there was, we have to sample this > > on other CPUs, so the overhead of preempt_disable() and preempt_enable() > > would be where kernel entry/exit is, so I expect that this would be a > > net loss in overall performance. > > CONFIG_PREEMPT_RCU seems to be independent of CONFIG_PREEMPT. > Not sure why, but they are :) Well, they used to be independent. But the "depends" clauses force them. You cannot have TREE_RCU unless !PREEMPT && SMP. > >> In case of the preemptible RCU scheme, we would have to > >> examine the per-task state (under the runqueue lock) > >> to get the current task info of all CPUs, and in > >> addition wait for the blkd_tasks list to empty out > >> when doing a synchronize_rcu(). > >> > >> That does not appear to require special per-cpu > >> counters; examining the per-cpu rdp and the lists > >> inside it, with the rnp->lock held if doing any > >> list manipulation, looks like it would be enough. > >> > >> However, the current code is a lot more complicated > >> than that. Am I overlooking something obvious, Paul? > >> Maybe something non-obvious? :) > > > > Ummm... The need to maintain memory ordering when sampling task > > state from remote CPUs? > > > > Or am I completely confused about what you are suggesting? > > > > That said, are you chasing a real system-visible performance issue > > that you tracked to RCU's dyntick-idle system? > > The goal is to reduce the syscall overhead of nohz_full. > > Part of the overhead is in the vtime updates, part of it is > in the way RCU extended quiescent state is tracked. OK, as long as it is actual measurements rather than guesswork. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote: > On 05/04/2015 04:02 PM, Paul E. McKenney wrote: > > On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote: > >> On 05/04/2015 02:39 PM, Paul E. McKenney wrote: > >>> On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote: > >> > In fact, would we be able to simply use tsk->rcu_read_lock_nesting > as an indicator of whether or not we should bother waiting on that > task or CPU when doing synchronize_rcu? > >>> > >>> Depends on exactly what you are asking. If you are asking if I could add > >>> a few more checks to preemptible RCU and speed up grace-period detection > >>> in a number of cases, the answer is very likely "yes". This is on my > >>> list, but not particularly high priority. If you are asking whether > >>> CPU 0 could access ->rcu_read_lock_nesting of some task running on > >>> some other CPU, in theory, the answer is "yes", but in practice that > >>> would require putting full memory barriers in both rcu_read_lock() > >>> and rcu_read_unlock(), so the real answer is "no". > >>> > >>> Or am I missing your point? > >> > >> The main question is "how can we greatly reduce the overhead > >> of nohz_full, by simplifying the RCU extended quiescent state > >> code called in the syscall fast path, and maybe piggyback on > >> that to do time accounting for remote CPUs?" > >> > >> Your memory barrier answer above makes it clear we will still > >> want to do the RCU stuff at syscall entry & exit time, at least > >> on x86, where we already have automatic and implicit memory > >> barriers. > > > > We do need to keep in mind that x86's automatic and implicit memory > > barriers do not order prior stores against later loads. > > > > Hmmm... But didn't earlier performance measurements show that the bulk of > > the overhead was the delta-time computations rather than RCU accounting? > > The bulk of the overhead was disabling and re-enabling > irqs around the calls to rcu_user_exit and rcu_user_enter :) Really??? OK... How about software irq masking? (I know, that is probably a bit of a scary change as well.) > Of the remaining time, about 2/3 seems to be the vtime > stuff, and the other 1/3 the rcu code. OK, worth some thought, then. > I suspect it makes sense to optimize both, though the > vtime code may be the easiest :) Making a crude version that does jiffies (or whatever) instead of fine-grained computations might give good bang for the buck. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On 05/04/2015 04:02 PM, Paul E. McKenney wrote: > On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote: >> On 05/04/2015 02:39 PM, Paul E. McKenney wrote: >>> On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote: >> In fact, would we be able to simply use tsk->rcu_read_lock_nesting as an indicator of whether or not we should bother waiting on that task or CPU when doing synchronize_rcu? >>> >>> Depends on exactly what you are asking. If you are asking if I could add >>> a few more checks to preemptible RCU and speed up grace-period detection >>> in a number of cases, the answer is very likely "yes". This is on my >>> list, but not particularly high priority. If you are asking whether >>> CPU 0 could access ->rcu_read_lock_nesting of some task running on >>> some other CPU, in theory, the answer is "yes", but in practice that >>> would require putting full memory barriers in both rcu_read_lock() >>> and rcu_read_unlock(), so the real answer is "no". >>> >>> Or am I missing your point? >> >> The main question is "how can we greatly reduce the overhead >> of nohz_full, by simplifying the RCU extended quiescent state >> code called in the syscall fast path, and maybe piggyback on >> that to do time accounting for remote CPUs?" >> >> Your memory barrier answer above makes it clear we will still >> want to do the RCU stuff at syscall entry & exit time, at least >> on x86, where we already have automatic and implicit memory >> barriers. > > We do need to keep in mind that x86's automatic and implicit memory > barriers do not order prior stores against later loads. > > Hmmm... But didn't earlier performance measurements show that the bulk of > the overhead was the delta-time computations rather than RCU accounting? The bulk of the overhead was disabling and re-enabling irqs around the calls to rcu_user_exit and rcu_user_enter :) Of the remaining time, about 2/3 seems to be the vtime stuff, and the other 1/3 the rcu code. I suspect it makes sense to optimize both, though the vtime code may be the easiest :) -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote: > On 05/04/2015 02:39 PM, Paul E. McKenney wrote: > > On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote: > > >> In fact, would we be able to simply use tsk->rcu_read_lock_nesting > >> as an indicator of whether or not we should bother waiting on that > >> task or CPU when doing synchronize_rcu? > > > > Depends on exactly what you are asking. If you are asking if I could add > > a few more checks to preemptible RCU and speed up grace-period detection > > in a number of cases, the answer is very likely "yes". This is on my > > list, but not particularly high priority. If you are asking whether > > CPU 0 could access ->rcu_read_lock_nesting of some task running on > > some other CPU, in theory, the answer is "yes", but in practice that > > would require putting full memory barriers in both rcu_read_lock() > > and rcu_read_unlock(), so the real answer is "no". > > > > Or am I missing your point? > > The main question is "how can we greatly reduce the overhead > of nohz_full, by simplifying the RCU extended quiescent state > code called in the syscall fast path, and maybe piggyback on > that to do time accounting for remote CPUs?" > > Your memory barrier answer above makes it clear we will still > want to do the RCU stuff at syscall entry & exit time, at least > on x86, where we already have automatic and implicit memory > barriers. We do need to keep in mind that x86's automatic and implicit memory barriers do not order prior stores against later loads. Hmmm... But didn't earlier performance measurements show that the bulk of the overhead was the delta-time computations rather than RCU accounting? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On 05/04/2015 03:39 PM, Paul E. McKenney wrote: > On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote: >> In case of the non-preemptible RCU, we could easily also >> increase current->rcu_read_lock_nesting at the same time >> we increase the preempt counter, and use that as the >> indicator to test whether the cpu is in an extended >> rcu quiescent state. That way there would be no extra >> overhead at syscall entry or exit at all. The trick >> would be getting the preempt count and the rcu read >> lock nesting count in the same cache line for each task. > > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt > counter in production kernels. Even if there was, we have to sample this > on other CPUs, so the overhead of preempt_disable() and preempt_enable() > would be where kernel entry/exit is, so I expect that this would be a > net loss in overall performance. CONFIG_PREEMPT_RCU seems to be independent of CONFIG_PREEMPT. Not sure why, but they are :) >> In case of the preemptible RCU scheme, we would have to >> examine the per-task state (under the runqueue lock) >> to get the current task info of all CPUs, and in >> addition wait for the blkd_tasks list to empty out >> when doing a synchronize_rcu(). >> >> That does not appear to require special per-cpu >> counters; examining the per-cpu rdp and the lists >> inside it, with the rnp->lock held if doing any >> list manipulation, looks like it would be enough. >> >> However, the current code is a lot more complicated >> than that. Am I overlooking something obvious, Paul? >> Maybe something non-obvious? :) > > Ummm... The need to maintain memory ordering when sampling task > state from remote CPUs? > > Or am I completely confused about what you are suggesting? > > That said, are you chasing a real system-visible performance issue > that you tracked to RCU's dyntick-idle system? The goal is to reduce the syscall overhead of nohz_full. Part of the overhead is in the vtime updates, part of it is in the way RCU extended quiescent state is tracked. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On 05/04/2015 02:39 PM, Paul E. McKenney wrote: > On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote: >> In fact, would we be able to simply use tsk->rcu_read_lock_nesting >> as an indicator of whether or not we should bother waiting on that >> task or CPU when doing synchronize_rcu? > > Depends on exactly what you are asking. If you are asking if I could add > a few more checks to preemptible RCU and speed up grace-period detection > in a number of cases, the answer is very likely "yes". This is on my > list, but not particularly high priority. If you are asking whether > CPU 0 could access ->rcu_read_lock_nesting of some task running on > some other CPU, in theory, the answer is "yes", but in practice that > would require putting full memory barriers in both rcu_read_lock() > and rcu_read_unlock(), so the real answer is "no". > > Or am I missing your point? The main question is "how can we greatly reduce the overhead of nohz_full, by simplifying the RCU extended quiescent state code called in the syscall fast path, and maybe piggyback on that to do time accounting for remote CPUs?" Your memory barrier answer above makes it clear we will still want to do the RCU stuff at syscall entry & exit time, at least on x86, where we already have automatic and implicit memory barriers. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote: > On 05/04/2015 11:59 AM, Rik van Riel wrote: > > > However, currently the RCU code seems to use a much more > > complex counting scheme, with a different increment for > > kernel/task use, and irq use. > > > > This counter seems to be modeled on the task preempt_counter, > > where we do care about whether we are in task context, irq > > context, or softirq context. > > > > On the other hand, the RCU code only seems to care about > > whether or not a CPU is in an extended quiescent state, > > or is potentially in an RCU critical section. > > > > Paul, what is the reason for RCU using a complex counter, > > instead of a simple increment for each potential kernel/RCU > > entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU > > enabled? > > Looking at the code for a while more, I have not found > any reason why the rcu dynticks counter is so complex. For the nesting counter, please see my earlier email. > The rdtp->dynticks atomic seems to be used as a serial > number. Odd means the cpu is in an rcu quiescent state, > even means it is not. Yep. > This test is used to verify whether or not a CPU is > in rcu quiescent state. Presumably the atomic_add_return > is used to add a memory barrier. > > atomic_add_return(0, >dynticks) & 0x1) Yep. It is sampled remotely, hence the need for full memory barriers. It doesn't help to sample the counter if the sampling gets reordered with the surrounding code. Ditto for the increments. By the end of the year, and hopefully much sooner, I expect to have testing infrastructure capable of detecting ordering bugs in this code. At which point, I can start experimenting with alternative code sequences. But full ordering is still required, and cache misses can happen. > > In fact, would we be able to simply use tsk->rcu_read_lock_nesting > > as an indicator of whether or not we should bother waiting on that > > task or CPU when doing synchronize_rcu? > > We seem to have two variants of __rcu_read_lock(). > > One increments current->rcu_read_lock_nesting, the other > calls preempt_disable(). Yep. The first is preemptible RCU, the second classic RCU. > In case of the non-preemptible RCU, we could easily also > increase current->rcu_read_lock_nesting at the same time > we increase the preempt counter, and use that as the > indicator to test whether the cpu is in an extended > rcu quiescent state. That way there would be no extra > overhead at syscall entry or exit at all. The trick > would be getting the preempt count and the rcu read > lock nesting count in the same cache line for each task. But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt counter in production kernels. Even if there was, we have to sample this on other CPUs, so the overhead of preempt_disable() and preempt_enable() would be where kernel entry/exit is, so I expect that this would be a net loss in overall performance. > In case of the preemptible RCU scheme, we would have to > examine the per-task state (under the runqueue lock) > to get the current task info of all CPUs, and in > addition wait for the blkd_tasks list to empty out > when doing a synchronize_rcu(). > > That does not appear to require special per-cpu > counters; examining the per-cpu rdp and the lists > inside it, with the rnp->lock held if doing any > list manipulation, looks like it would be enough. > > However, the current code is a lot more complicated > than that. Am I overlooking something obvious, Paul? > Maybe something non-obvious? :) Ummm... The need to maintain memory ordering when sampling task state from remote CPUs? Or am I completely confused about what you are suggesting? That said, are you chasing a real system-visible performance issue that you tracked to RCU's dyntick-idle system? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On 05/04/2015 11:59 AM, Rik van Riel wrote: > However, currently the RCU code seems to use a much more > complex counting scheme, with a different increment for > kernel/task use, and irq use. > > This counter seems to be modeled on the task preempt_counter, > where we do care about whether we are in task context, irq > context, or softirq context. > > On the other hand, the RCU code only seems to care about > whether or not a CPU is in an extended quiescent state, > or is potentially in an RCU critical section. > > Paul, what is the reason for RCU using a complex counter, > instead of a simple increment for each potential kernel/RCU > entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU > enabled? Looking at the code for a while more, I have not found any reason why the rcu dynticks counter is so complex. The rdtp->dynticks atomic seems to be used as a serial number. Odd means the cpu is in an rcu quiescent state, even means it is not. This test is used to verify whether or not a CPU is in rcu quiescent state. Presumably the atomic_add_return is used to add a memory barrier. atomic_add_return(0, >dynticks) & 0x1) > In fact, would we be able to simply use tsk->rcu_read_lock_nesting > as an indicator of whether or not we should bother waiting on that > task or CPU when doing synchronize_rcu? We seem to have two variants of __rcu_read_lock(). One increments current->rcu_read_lock_nesting, the other calls preempt_disable(). In case of the non-preemptible RCU, we could easily also increase current->rcu_read_lock_nesting at the same time we increase the preempt counter, and use that as the indicator to test whether the cpu is in an extended rcu quiescent state. That way there would be no extra overhead at syscall entry or exit at all. The trick would be getting the preempt count and the rcu read lock nesting count in the same cache line for each task. In case of the preemptible RCU scheme, we would have to examine the per-task state (under the runqueue lock) to get the current task info of all CPUs, and in addition wait for the blkd_tasks list to empty out when doing a synchronize_rcu(). That does not appear to require special per-cpu counters; examining the per-cpu rdp and the lists inside it, with the rnp->lock held if doing any list manipulation, looks like it would be enough. However, the current code is a lot more complicated than that. Am I overlooking something obvious, Paul? Maybe something non-obvious? :) -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote: > On 05/04/2015 05:26 AM, Paolo Bonzini wrote: > > > Isn't this racy? > > > > synchronize_rcu CPU nohz CPU > > - > > set flag = 0 > > read flag = 0 > > return to userspace > > set TIF_NOHZ > > > > and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU. > > Looking at the code some more, a flag is not going to be enough. > > An irq can hit while we are in kernel mode, leading to the > task's "rcu active" counter being incremented twice. > > However, currently the RCU code seems to use a much more > complex counting scheme, with a different increment for > kernel/task use, and irq use. > > This counter seems to be modeled on the task preempt_counter, > where we do care about whether we are in task context, irq > context, or softirq context. > > On the other hand, the RCU code only seems to care about > whether or not a CPU is in an extended quiescent state, > or is potentially in an RCU critical section. > > Paul, what is the reason for RCU using a complex counter, > instead of a simple increment for each potential kernel/RCU > entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU > enabled? Heh! I found out why the hard way. You see, there are architectures where a CPU can enter an interrupt level without ever exiting, and perhaps vice versa. But only if that CPU is non-idle at the time. So, when a CPU enters idle, it is necessary to reset the interrupt nesting to zero. But that means that it is in turn necessary to count task-level nesting separately from interrupt-level nesting, so that we can determine when the CPU goes idle from a task-level viewpoint. Hence the use of masks and fields within the counter. It -might- be possible to simplify this somewhat, especially now that we have unified idle loops. Except that I don't trust the architectures to be reasonable about this at this point. Furthermore, the associated nesting checks do trigger when people are making certain types of changes to architectures, so it is a useful debugging tool. Which is another reason that I am reluctant to change it. > In fact, would we be able to simply use tsk->rcu_read_lock_nesting > as an indicator of whether or not we should bother waiting on that > task or CPU when doing synchronize_rcu? Depends on exactly what you are asking. If you are asking if I could add a few more checks to preemptible RCU and speed up grace-period detection in a number of cases, the answer is very likely "yes". This is on my list, but not particularly high priority. If you are asking whether CPU 0 could access ->rcu_read_lock_nesting of some task running on some other CPU, in theory, the answer is "yes", but in practice that would require putting full memory barriers in both rcu_read_lock() and rcu_read_unlock(), so the real answer is "no". Or am I missing your point? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
question about RCU dynticks_nesting
On 05/04/2015 05:26 AM, Paolo Bonzini wrote: > Isn't this racy? > > synchronize_rcu CPU nohz CPU > - > set flag = 0 > read flag = 0 > return to userspace > set TIF_NOHZ > > and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU. Looking at the code some more, a flag is not going to be enough. An irq can hit while we are in kernel mode, leading to the task's "rcu active" counter being incremented twice. However, currently the RCU code seems to use a much more complex counting scheme, with a different increment for kernel/task use, and irq use. This counter seems to be modeled on the task preempt_counter, where we do care about whether we are in task context, irq context, or softirq context. On the other hand, the RCU code only seems to care about whether or not a CPU is in an extended quiescent state, or is potentially in an RCU critical section. Paul, what is the reason for RCU using a complex counter, instead of a simple increment for each potential kernel/RCU entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU enabled? In fact, would we be able to simply use tsk->rcu_read_lock_nesting as an indicator of whether or not we should bother waiting on that task or CPU when doing synchronize_rcu? -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
question about RCU dynticks_nesting
On 05/04/2015 05:26 AM, Paolo Bonzini wrote: Isn't this racy? synchronize_rcu CPU nohz CPU - set flag = 0 read flag = 0 return to userspace set TIF_NOHZ and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU. Looking at the code some more, a flag is not going to be enough. An irq can hit while we are in kernel mode, leading to the task's rcu active counter being incremented twice. However, currently the RCU code seems to use a much more complex counting scheme, with a different increment for kernel/task use, and irq use. This counter seems to be modeled on the task preempt_counter, where we do care about whether we are in task context, irq context, or softirq context. On the other hand, the RCU code only seems to care about whether or not a CPU is in an extended quiescent state, or is potentially in an RCU critical section. Paul, what is the reason for RCU using a complex counter, instead of a simple increment for each potential kernel/RCU entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU enabled? In fact, would we be able to simply use tsk-rcu_read_lock_nesting as an indicator of whether or not we should bother waiting on that task or CPU when doing synchronize_rcu? -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote: On 05/04/2015 05:26 AM, Paolo Bonzini wrote: Isn't this racy? synchronize_rcu CPU nohz CPU - set flag = 0 read flag = 0 return to userspace set TIF_NOHZ and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU. Looking at the code some more, a flag is not going to be enough. An irq can hit while we are in kernel mode, leading to the task's rcu active counter being incremented twice. However, currently the RCU code seems to use a much more complex counting scheme, with a different increment for kernel/task use, and irq use. This counter seems to be modeled on the task preempt_counter, where we do care about whether we are in task context, irq context, or softirq context. On the other hand, the RCU code only seems to care about whether or not a CPU is in an extended quiescent state, or is potentially in an RCU critical section. Paul, what is the reason for RCU using a complex counter, instead of a simple increment for each potential kernel/RCU entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU enabled? Heh! I found out why the hard way. You see, there are architectures where a CPU can enter an interrupt level without ever exiting, and perhaps vice versa. But only if that CPU is non-idle at the time. So, when a CPU enters idle, it is necessary to reset the interrupt nesting to zero. But that means that it is in turn necessary to count task-level nesting separately from interrupt-level nesting, so that we can determine when the CPU goes idle from a task-level viewpoint. Hence the use of masks and fields within the counter. It -might- be possible to simplify this somewhat, especially now that we have unified idle loops. Except that I don't trust the architectures to be reasonable about this at this point. Furthermore, the associated nesting checks do trigger when people are making certain types of changes to architectures, so it is a useful debugging tool. Which is another reason that I am reluctant to change it. In fact, would we be able to simply use tsk-rcu_read_lock_nesting as an indicator of whether or not we should bother waiting on that task or CPU when doing synchronize_rcu? Depends on exactly what you are asking. If you are asking if I could add a few more checks to preemptible RCU and speed up grace-period detection in a number of cases, the answer is very likely yes. This is on my list, but not particularly high priority. If you are asking whether CPU 0 could access -rcu_read_lock_nesting of some task running on some other CPU, in theory, the answer is yes, but in practice that would require putting full memory barriers in both rcu_read_lock() and rcu_read_unlock(), so the real answer is no. Or am I missing your point? Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On 05/04/2015 11:59 AM, Rik van Riel wrote: However, currently the RCU code seems to use a much more complex counting scheme, with a different increment for kernel/task use, and irq use. This counter seems to be modeled on the task preempt_counter, where we do care about whether we are in task context, irq context, or softirq context. On the other hand, the RCU code only seems to care about whether or not a CPU is in an extended quiescent state, or is potentially in an RCU critical section. Paul, what is the reason for RCU using a complex counter, instead of a simple increment for each potential kernel/RCU entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU enabled? Looking at the code for a while more, I have not found any reason why the rcu dynticks counter is so complex. The rdtp-dynticks atomic seems to be used as a serial number. Odd means the cpu is in an rcu quiescent state, even means it is not. This test is used to verify whether or not a CPU is in rcu quiescent state. Presumably the atomic_add_return is used to add a memory barrier. atomic_add_return(0, rdtp-dynticks) 0x1) In fact, would we be able to simply use tsk-rcu_read_lock_nesting as an indicator of whether or not we should bother waiting on that task or CPU when doing synchronize_rcu? We seem to have two variants of __rcu_read_lock(). One increments current-rcu_read_lock_nesting, the other calls preempt_disable(). In case of the non-preemptible RCU, we could easily also increase current-rcu_read_lock_nesting at the same time we increase the preempt counter, and use that as the indicator to test whether the cpu is in an extended rcu quiescent state. That way there would be no extra overhead at syscall entry or exit at all. The trick would be getting the preempt count and the rcu read lock nesting count in the same cache line for each task. In case of the preemptible RCU scheme, we would have to examine the per-task state (under the runqueue lock) to get the current task info of all CPUs, and in addition wait for the blkd_tasks list to empty out when doing a synchronize_rcu(). That does not appear to require special per-cpu counters; examining the per-cpu rdp and the lists inside it, with the rnp-lock held if doing any list manipulation, looks like it would be enough. However, the current code is a lot more complicated than that. Am I overlooking something obvious, Paul? Maybe something non-obvious? :) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote: On 05/04/2015 02:39 PM, Paul E. McKenney wrote: On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote: In fact, would we be able to simply use tsk-rcu_read_lock_nesting as an indicator of whether or not we should bother waiting on that task or CPU when doing synchronize_rcu? Depends on exactly what you are asking. If you are asking if I could add a few more checks to preemptible RCU and speed up grace-period detection in a number of cases, the answer is very likely yes. This is on my list, but not particularly high priority. If you are asking whether CPU 0 could access -rcu_read_lock_nesting of some task running on some other CPU, in theory, the answer is yes, but in practice that would require putting full memory barriers in both rcu_read_lock() and rcu_read_unlock(), so the real answer is no. Or am I missing your point? The main question is how can we greatly reduce the overhead of nohz_full, by simplifying the RCU extended quiescent state code called in the syscall fast path, and maybe piggyback on that to do time accounting for remote CPUs? Your memory barrier answer above makes it clear we will still want to do the RCU stuff at syscall entry exit time, at least on x86, where we already have automatic and implicit memory barriers. We do need to keep in mind that x86's automatic and implicit memory barriers do not order prior stores against later loads. Hmmm... But didn't earlier performance measurements show that the bulk of the overhead was the delta-time computations rather than RCU accounting? Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 03:59:02PM -0400, Rik van Riel wrote: On 05/04/2015 03:39 PM, Paul E. McKenney wrote: On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote: In case of the non-preemptible RCU, we could easily also increase current-rcu_read_lock_nesting at the same time we increase the preempt counter, and use that as the indicator to test whether the cpu is in an extended rcu quiescent state. That way there would be no extra overhead at syscall entry or exit at all. The trick would be getting the preempt count and the rcu read lock nesting count in the same cache line for each task. But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt counter in production kernels. Even if there was, we have to sample this on other CPUs, so the overhead of preempt_disable() and preempt_enable() would be where kernel entry/exit is, so I expect that this would be a net loss in overall performance. CONFIG_PREEMPT_RCU seems to be independent of CONFIG_PREEMPT. Not sure why, but they are :) Well, they used to be independent. But the depends clauses force them. You cannot have TREE_RCU unless !PREEMPT SMP. In case of the preemptible RCU scheme, we would have to examine the per-task state (under the runqueue lock) to get the current task info of all CPUs, and in addition wait for the blkd_tasks list to empty out when doing a synchronize_rcu(). That does not appear to require special per-cpu counters; examining the per-cpu rdp and the lists inside it, with the rnp-lock held if doing any list manipulation, looks like it would be enough. However, the current code is a lot more complicated than that. Am I overlooking something obvious, Paul? Maybe something non-obvious? :) Ummm... The need to maintain memory ordering when sampling task state from remote CPUs? Or am I completely confused about what you are suggesting? That said, are you chasing a real system-visible performance issue that you tracked to RCU's dyntick-idle system? The goal is to reduce the syscall overhead of nohz_full. Part of the overhead is in the vtime updates, part of it is in the way RCU extended quiescent state is tracked. OK, as long as it is actual measurements rather than guesswork. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote: On 05/04/2015 04:02 PM, Paul E. McKenney wrote: On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote: On 05/04/2015 02:39 PM, Paul E. McKenney wrote: On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote: In fact, would we be able to simply use tsk-rcu_read_lock_nesting as an indicator of whether or not we should bother waiting on that task or CPU when doing synchronize_rcu? Depends on exactly what you are asking. If you are asking if I could add a few more checks to preemptible RCU and speed up grace-period detection in a number of cases, the answer is very likely yes. This is on my list, but not particularly high priority. If you are asking whether CPU 0 could access -rcu_read_lock_nesting of some task running on some other CPU, in theory, the answer is yes, but in practice that would require putting full memory barriers in both rcu_read_lock() and rcu_read_unlock(), so the real answer is no. Or am I missing your point? The main question is how can we greatly reduce the overhead of nohz_full, by simplifying the RCU extended quiescent state code called in the syscall fast path, and maybe piggyback on that to do time accounting for remote CPUs? Your memory barrier answer above makes it clear we will still want to do the RCU stuff at syscall entry exit time, at least on x86, where we already have automatic and implicit memory barriers. We do need to keep in mind that x86's automatic and implicit memory barriers do not order prior stores against later loads. Hmmm... But didn't earlier performance measurements show that the bulk of the overhead was the delta-time computations rather than RCU accounting? The bulk of the overhead was disabling and re-enabling irqs around the calls to rcu_user_exit and rcu_user_enter :) Really??? OK... How about software irq masking? (I know, that is probably a bit of a scary change as well.) Of the remaining time, about 2/3 seems to be the vtime stuff, and the other 1/3 the rcu code. OK, worth some thought, then. I suspect it makes sense to optimize both, though the vtime code may be the easiest :) Making a crude version that does jiffies (or whatever) instead of fine-grained computations might give good bang for the buck. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On 05/04/2015 02:39 PM, Paul E. McKenney wrote: On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote: In fact, would we be able to simply use tsk-rcu_read_lock_nesting as an indicator of whether or not we should bother waiting on that task or CPU when doing synchronize_rcu? Depends on exactly what you are asking. If you are asking if I could add a few more checks to preemptible RCU and speed up grace-period detection in a number of cases, the answer is very likely yes. This is on my list, but not particularly high priority. If you are asking whether CPU 0 could access -rcu_read_lock_nesting of some task running on some other CPU, in theory, the answer is yes, but in practice that would require putting full memory barriers in both rcu_read_lock() and rcu_read_unlock(), so the real answer is no. Or am I missing your point? The main question is how can we greatly reduce the overhead of nohz_full, by simplifying the RCU extended quiescent state code called in the syscall fast path, and maybe piggyback on that to do time accounting for remote CPUs? Your memory barrier answer above makes it clear we will still want to do the RCU stuff at syscall entry exit time, at least on x86, where we already have automatic and implicit memory barriers. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On 05/04/2015 03:39 PM, Paul E. McKenney wrote: On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote: In case of the non-preemptible RCU, we could easily also increase current-rcu_read_lock_nesting at the same time we increase the preempt counter, and use that as the indicator to test whether the cpu is in an extended rcu quiescent state. That way there would be no extra overhead at syscall entry or exit at all. The trick would be getting the preempt count and the rcu read lock nesting count in the same cache line for each task. But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt counter in production kernels. Even if there was, we have to sample this on other CPUs, so the overhead of preempt_disable() and preempt_enable() would be where kernel entry/exit is, so I expect that this would be a net loss in overall performance. CONFIG_PREEMPT_RCU seems to be independent of CONFIG_PREEMPT. Not sure why, but they are :) In case of the preemptible RCU scheme, we would have to examine the per-task state (under the runqueue lock) to get the current task info of all CPUs, and in addition wait for the blkd_tasks list to empty out when doing a synchronize_rcu(). That does not appear to require special per-cpu counters; examining the per-cpu rdp and the lists inside it, with the rnp-lock held if doing any list manipulation, looks like it would be enough. However, the current code is a lot more complicated than that. Am I overlooking something obvious, Paul? Maybe something non-obvious? :) Ummm... The need to maintain memory ordering when sampling task state from remote CPUs? Or am I completely confused about what you are suggesting? That said, are you chasing a real system-visible performance issue that you tracked to RCU's dyntick-idle system? The goal is to reduce the syscall overhead of nohz_full. Part of the overhead is in the vtime updates, part of it is in the way RCU extended quiescent state is tracked. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote: On 05/04/2015 11:59 AM, Rik van Riel wrote: However, currently the RCU code seems to use a much more complex counting scheme, with a different increment for kernel/task use, and irq use. This counter seems to be modeled on the task preempt_counter, where we do care about whether we are in task context, irq context, or softirq context. On the other hand, the RCU code only seems to care about whether or not a CPU is in an extended quiescent state, or is potentially in an RCU critical section. Paul, what is the reason for RCU using a complex counter, instead of a simple increment for each potential kernel/RCU entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU enabled? Looking at the code for a while more, I have not found any reason why the rcu dynticks counter is so complex. For the nesting counter, please see my earlier email. The rdtp-dynticks atomic seems to be used as a serial number. Odd means the cpu is in an rcu quiescent state, even means it is not. Yep. This test is used to verify whether or not a CPU is in rcu quiescent state. Presumably the atomic_add_return is used to add a memory barrier. atomic_add_return(0, rdtp-dynticks) 0x1) Yep. It is sampled remotely, hence the need for full memory barriers. It doesn't help to sample the counter if the sampling gets reordered with the surrounding code. Ditto for the increments. By the end of the year, and hopefully much sooner, I expect to have testing infrastructure capable of detecting ordering bugs in this code. At which point, I can start experimenting with alternative code sequences. But full ordering is still required, and cache misses can happen. In fact, would we be able to simply use tsk-rcu_read_lock_nesting as an indicator of whether or not we should bother waiting on that task or CPU when doing synchronize_rcu? We seem to have two variants of __rcu_read_lock(). One increments current-rcu_read_lock_nesting, the other calls preempt_disable(). Yep. The first is preemptible RCU, the second classic RCU. In case of the non-preemptible RCU, we could easily also increase current-rcu_read_lock_nesting at the same time we increase the preempt counter, and use that as the indicator to test whether the cpu is in an extended rcu quiescent state. That way there would be no extra overhead at syscall entry or exit at all. The trick would be getting the preempt count and the rcu read lock nesting count in the same cache line for each task. But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt counter in production kernels. Even if there was, we have to sample this on other CPUs, so the overhead of preempt_disable() and preempt_enable() would be where kernel entry/exit is, so I expect that this would be a net loss in overall performance. In case of the preemptible RCU scheme, we would have to examine the per-task state (under the runqueue lock) to get the current task info of all CPUs, and in addition wait for the blkd_tasks list to empty out when doing a synchronize_rcu(). That does not appear to require special per-cpu counters; examining the per-cpu rdp and the lists inside it, with the rnp-lock held if doing any list manipulation, looks like it would be enough. However, the current code is a lot more complicated than that. Am I overlooking something obvious, Paul? Maybe something non-obvious? :) Ummm... The need to maintain memory ordering when sampling task state from remote CPUs? Or am I completely confused about what you are suggesting? That said, are you chasing a real system-visible performance issue that you tracked to RCU's dyntick-idle system? Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On 05/04/2015 04:02 PM, Paul E. McKenney wrote: On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote: On 05/04/2015 02:39 PM, Paul E. McKenney wrote: On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote: In fact, would we be able to simply use tsk-rcu_read_lock_nesting as an indicator of whether or not we should bother waiting on that task or CPU when doing synchronize_rcu? Depends on exactly what you are asking. If you are asking if I could add a few more checks to preemptible RCU and speed up grace-period detection in a number of cases, the answer is very likely yes. This is on my list, but not particularly high priority. If you are asking whether CPU 0 could access -rcu_read_lock_nesting of some task running on some other CPU, in theory, the answer is yes, but in practice that would require putting full memory barriers in both rcu_read_lock() and rcu_read_unlock(), so the real answer is no. Or am I missing your point? The main question is how can we greatly reduce the overhead of nohz_full, by simplifying the RCU extended quiescent state code called in the syscall fast path, and maybe piggyback on that to do time accounting for remote CPUs? Your memory barrier answer above makes it clear we will still want to do the RCU stuff at syscall entry exit time, at least on x86, where we already have automatic and implicit memory barriers. We do need to keep in mind that x86's automatic and implicit memory barriers do not order prior stores against later loads. Hmmm... But didn't earlier performance measurements show that the bulk of the overhead was the delta-time computations rather than RCU accounting? The bulk of the overhead was disabling and re-enabling irqs around the calls to rcu_user_exit and rcu_user_enter :) Of the remaining time, about 2/3 seems to be the vtime stuff, and the other 1/3 the rcu code. I suspect it makes sense to optimize both, though the vtime code may be the easiest :) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On 05/04/2015 04:38 PM, Paul E. McKenney wrote: On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote: On 05/04/2015 04:02 PM, Paul E. McKenney wrote: Hmmm... But didn't earlier performance measurements show that the bulk of the overhead was the delta-time computations rather than RCU accounting? The bulk of the overhead was disabling and re-enabling irqs around the calls to rcu_user_exit and rcu_user_enter :) Really??? OK... How about software irq masking? (I know, that is probably a bit of a scary change as well.) Of the remaining time, about 2/3 seems to be the vtime stuff, and the other 1/3 the rcu code. OK, worth some thought, then. I suspect it makes sense to optimize both, though the vtime code may be the easiest :) Making a crude version that does jiffies (or whatever) instead of fine-grained computations might give good bang for the buck. ;-) Ingo's idea is to simply have cpu 0 check the current task on all other CPUs, see whether that task is running in system mode, user mode, guest mode, irq mode, etc and update that task's vtime accordingly. I suspect the runqueue lock is probably enough to do that, and between rcu state and PF_VCPU we probably have enough information to see what mode the task is running in, with just remote memory reads. I looked at implementing the vtime bits (and am pretty sure how to do those now), and then spent some hours looking at the RCU bits, to see if we could not simplify both things at once, especially considering that the current RCU context tracking bits need to be called with irqs disabled. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about RCU dynticks_nesting
On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote: On 05/04/2015 04:38 PM, Paul E. McKenney wrote: On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote: On 05/04/2015 04:02 PM, Paul E. McKenney wrote: Hmmm... But didn't earlier performance measurements show that the bulk of the overhead was the delta-time computations rather than RCU accounting? The bulk of the overhead was disabling and re-enabling irqs around the calls to rcu_user_exit and rcu_user_enter :) Really??? OK... How about software irq masking? (I know, that is probably a bit of a scary change as well.) Of the remaining time, about 2/3 seems to be the vtime stuff, and the other 1/3 the rcu code. OK, worth some thought, then. I suspect it makes sense to optimize both, though the vtime code may be the easiest :) Making a crude version that does jiffies (or whatever) instead of fine-grained computations might give good bang for the buck. ;-) Ingo's idea is to simply have cpu 0 check the current task on all other CPUs, see whether that task is running in system mode, user mode, guest mode, irq mode, etc and update that task's vtime accordingly. I suspect the runqueue lock is probably enough to do that, and between rcu state and PF_VCPU we probably have enough information to see what mode the task is running in, with just remote memory reads. I looked at implementing the vtime bits (and am pretty sure how to do those now), and then spent some hours looking at the RCU bits, to see if we could not simplify both things at once, especially considering that the current RCU context tracking bits need to be called with irqs disabled. Remotely sampling the vtime info without memory barriers makes sense. After all, the result is statistical anyway. Unfortunately, as noted earlier, RCU correctness depends on ordering. The current RCU idle entry/exit code most definitely absolutely requires irqs be disabled. However, I will see if that can be changed. No promises, especially no short-term promises, but it does not feel impossible. You have RCU_FAST_NO_HZ=y, correct? Could you please try measuring with RCU_FAST_NO_HZ=n? If that has a significant effect, easy quick win is turning it off -- and I could then make it a boot parameter to get you back to one kernel for everyone. (The existing tick_nohz_active boot parameter already turns it off, but also turns off dyntick idle, which might be a bit excessive.) Or if there is some way that the kernel can know that the system is currently running on battery or some such. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/