Re: [PATCH RFC] rcu: torture: shorten the time between forward-progress tests

2023-08-23 Thread Paul E. McKenney
On Tue, May 02, 2023 at 11:06:02PM +0800, zhouzho...@gmail.com wrote:
> From: Zhouyi Zhou 
> 
> Currently, default time between rcu torture forward-progress tests is 60HZ,
> Under this configuration, false positive caused by __stack_chk_fail [1] is
> difficult to reproduce (needs average 5*420 seconds for SRCU-P),
> which means one has to invoke [2] 5 times in average to make [1] appear.
> 
> With time between rcu torture forward-progress tests be 1 HZ, above
> phenomenon will be reproduced within 3 minutes, which means we can
> reproduce [1] everytime we invoke [2].
> 
> Although [1] is a false positive, this change will make possible future
> true bugs easier to be discovered.
>
> [1] Link: 
> https://lore.kernel.org/lkml/CAABZP2yS5=zuwezq7ihkv0wdm_hgo8k-teahyjrzhavzkda...@mail.gmail.com/T/
> [2] tools/testing/selftests/rcutorture/bin/torture.sh
> 
> Tested in PPC VM of Opensource Lab of Oregon State Univerisity.
> 
> Signed-off-by: Zhouyi Zhou 

Please accept my apologies for being ridiculously slow to reply!

In recent -rcu, module parameters such as this one that simply set a
value can be overridden on the command line.  So you could get the effect
(again, in recent kernels) in your testing by adding:

--bootargs "rcutorture.fwd_progress_holdoff=1"

The reason that I am reluctant to accept this patch is that we sometimes
have trouble with this forward-progress testing exhausting memory, and
making in happen could therefore cause trouble with generic rcutorture
testing.

Or am I missing the point of this change?

Thanx, Paul

> ---
>  tools/testing/selftests/rcutorture/configs/rcu/SRCU-N.boot  | 1 +
>  tools/testing/selftests/rcutorture/configs/rcu/SRCU-P.boot  | 1 +
>  tools/testing/selftests/rcutorture/configs/rcu/TRACE02.boot | 1 +
>  tools/testing/selftests/rcutorture/configs/rcu/TREE02.boot  | 1 +
>  tools/testing/selftests/rcutorture/configs/rcu/TREE10.boot  | 1 +
>  5 files changed, 5 insertions(+)
> 
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-N.boot 
> b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-N.boot
> index ce0694fd9b92..982582bff041 100644
> --- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-N.boot
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-N.boot
> @@ -1,2 +1,3 @@
>  rcutorture.torture_type=srcu
>  rcutorture.fwd_progress=3
> +rcutorture.fwd_progress_holdoff=1
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-P.boot 
> b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-P.boot
> index 2db39f298d18..18f5d7361d8a 100644
> --- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-P.boot
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-P.boot
> @@ -1,4 +1,5 @@
>  rcutorture.torture_type=srcud
>  rcupdate.rcu_self_test=1
>  rcutorture.fwd_progress=3
> +rcutorture.fwd_progress_holdoff=1
>  srcutree.big_cpu_lim=5
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TRACE02.boot 
> b/tools/testing/selftests/rcutorture/configs/rcu/TRACE02.boot
> index c70b5db6c2ae..b86bc7df7603 100644
> --- a/tools/testing/selftests/rcutorture/configs/rcu/TRACE02.boot
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/TRACE02.boot
> @@ -1,2 +1,3 @@
>  rcutorture.torture_type=tasks-tracing
>  rcutorture.fwd_progress=2
> +rcutorture.fwd_progress_holdoff=1
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE02.boot 
> b/tools/testing/selftests/rcutorture/configs/rcu/TREE02.boot
> index dd914fa8f690..933302f885df 100644
> --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE02.boot
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE02.boot
> @@ -1 +1,2 @@
>  rcutorture.fwd_progress=2
> +rcutorture.fwd_progress_holdoff=1
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE10.boot 
> b/tools/testing/selftests/rcutorture/configs/rcu/TREE10.boot
> index dd914fa8f690..933302f885df 100644
> --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE10.boot
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE10.boot
> @@ -1 +1,2 @@
>  rcutorture.fwd_progress=2
> +rcutorture.fwd_progress_holdoff=1
> -- 
> 2.34.1
> 


Re: [PATCH] rcu: torture: ppc: Remove duplicated argument --enable-kvm

2023-03-26 Thread Paul E. McKenney
On Sun, Mar 26, 2023 at 08:24:34AM +0800, zhouzho...@gmail.com wrote:
> From: Zhouyi Zhou 
> 
> The argument --enable-kvm is duplicated because qemu_args
> in kvm-test-1-run.sh has already give this.
>   
> Signed-off-by: Zhouyi Zhou 

Good catch!  Applied, thank you!

Thanx, Paul

> ---
> Dear RCU and PPC developers
> 
> I discover this possible minor flaw when I am performing RCU torture
> test in PPC VM of of Open Source Lab of Oregon State University.
> 
> But I can't test my patch because I am in a VM.
> 
> Thanks for your time
> 
> Cheers ;-)
> Zhouyi   
> --
>  tools/testing/selftests/rcutorture/bin/functions.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/rcutorture/bin/functions.sh 
> b/tools/testing/selftests/rcutorture/bin/functions.sh
> index b52d5069563c..48b9147e8c91 100644
> --- a/tools/testing/selftests/rcutorture/bin/functions.sh
> +++ b/tools/testing/selftests/rcutorture/bin/functions.sh
> @@ -250,7 +250,7 @@ identify_qemu_args () {
>   echo -machine virt,gic-version=host -cpu host
>   ;;
>   qemu-system-ppc64)
> - echo -enable-kvm -M pseries -nodefaults
> + echo -M pseries -nodefaults
>   echo -device spapr-vscsi
>   if test -n "$TORTURE_QEMU_INTERACTIVE" -a -n "$TORTURE_QEMU_MAC"
>   then
> -- 
> 2.34.1
> 


Re: [next-20230322] Kernel WARN at kernel/workqueue.c:3182 (rcutorture)

2023-03-23 Thread Paul E. McKenney
On Fri, Mar 24, 2023 at 08:47:38AM +0530, Sachin Sant wrote:
> 
> >>> Hello, Sachin, and it looks like you hit something that Zqiang and I
> >>> have been tracking down.  I am guessing that you were using modprobe
> >>> and rmmod to make this happen, and that this happened at rmmod time.
> >>> 
> >> Yes, the LTP test script rcu_torture.sh relies on modprobe to load/unload
> >> the rcutorture module.
> >> 
> >>> Whatever the reproducer, does the following patch help?
> >> 
> >> Thank you for the patch. Yes, with this patch applied, the test completes
> >> successfully without the reported warning.
> > 
> > Very good, thank you!  May we have your Tested-by?
> 
> Tested-by: Sachin Sant 


Thank you, and I will apply on the next rebase.

Thanx, Paul


Re: [next-20230322] Kernel WARN at kernel/workqueue.c:3182 (rcutorture)

2023-03-23 Thread Paul E. McKenney
On Thu, Mar 23, 2023 at 11:00:59PM +0530, Sachin Sant wrote:
> 
> >> [ 3629.243407] NIP [7fff8cd39558] 0x7fff8cd39558
> >> [ 3629.243410] LR [00010d800398] 0x10d800398
> >> [ 3629.243413] --- interrupt: c00
> >> [ 3629.243415] Code: 419dffa4 e93a0078 3941 552907be 2f89 7d20579e 
> >> 0b09 e95a0078 e91a0080 3921 7fa85000 7d204f9e <0b09> 7f23cb78 
> >> 4bfffd65 0b03 
> >> [ 3629.243430] ---[ end trace  ]—
> >> 
> >> These warnings are repeated few times. The LTP test is marked as PASS.
> >> 
> >> Git bisect point to the following patch
> >> commit f46a5170e6e7d5f836f2199fe82cdb0b4363427f
> >>srcu: Use static init for statically allocated in-module srcu_struct
> > 
> > Hello, Sachin, and it looks like you hit something that Zqiang and I
> > have been tracking down.  I am guessing that you were using modprobe
> > and rmmod to make this happen, and that this happened at rmmod time.
> > 
> Yes, the LTP test script rcu_torture.sh relies on modprobe to load/unload
> the rcutorture module.
> 
> > Whatever the reproducer, does the following patch help?
> 
> Thank you for the patch. Yes, with this patch applied, the test completes
> successfully without the reported warning.

Very good, thank you!  May we have your Tested-by?

Thanx, Paul


Re: [next-20230322] Kernel WARN at kernel/workqueue.c:3182 (rcutorture)

2023-03-23 Thread Paul E. McKenney
On Thu, Mar 23, 2023 at 04:55:54PM +0530, Sachin Sant wrote:
> While running rcutorture tests from LTP on an IBM Power10 server booted with
> 6.3.0-rc3-next-20230322 following warning is observed:
> 
> [ 3629.242831] [ cut here ]
> [ 3629.242835] WARNING: CPU: 8 PID: 614614 at kernel/workqueue.c:3182 
> __flush_work.isra.44+0x44/0x370
> [ 3629.242845] Modules linked in: rcutorture(-) torture vmac poly1305_generic 
> chacha_generic chacha20poly1305 n_gsm pps_ldisc ppp_synctty ppp_async 
> ppp_generic serport slcan can_dev slip slhc snd_hrtimer snd_seq 
> snd_seq_device snd_timer snd soundcore pcrypt crypto_user n_hdlc dummy veth 
> tun nfsv3 nfs_acl nfs lockd grace fscache netfs brd overlay exfat vfat fat 
> btrfs blake2b_generic xor raid6_pq zstd_compress xfs loop sctp ip6_udp_tunnel 
> udp_tunnel libcrc32c dm_mod bonding rfkill tls sunrpc kmem device_dax nd_pmem 
> nd_btt dax_pmem papr_scm pseries_rng libnvdimm vmx_crypto ext4 mbcache jbd2 
> sd_mod t10_pi crc64_rocksoft crc64 sg ibmvscsi scsi_transport_srp ibmveth 
> fuse [last unloaded: ltp_uaccess(O)]
> [ 3629.242911] CPU: 8 PID: 614614 Comm: modprobe Tainted: G O 
> 6.3.0-rc3-next-20230322 #1
> [ 3629.242917] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
> of:IBM,FW1030.00 (NH1030_026) hv:phyp pSeries
> [ 3629.242923] NIP: c018c204 LR: c022306c CTR: 
> c02233c0
> [ 3629.242927] REGS: c005c14e3880 TRAP: 0700 Tainted: G O 
> (6.3.0-rc3-next-20230322)
> [ 3629.242932] MSR: 8282b033  CR: 
> 4800 XER: 000a
> [ 3629.242943] CFAR: c018c5b0 IRQMASK: 0 
> [ 3629.242943] GPR00: c022306c c005c14e3b20 c1401200 
> c0080c4419e8 
> [ 3629.242943] GPR04: 0001 0001 0011 
> fffe 
> [ 3629.242943] GPR08: c00efe9a8300 0001  
> c0080c42afe0 
> [ 3629.242943] GPR12: c02233c0 c00e6700  
>  
> [ 3629.242943] GPR16:    
>  
> [ 3629.242943] GPR20:    
>  
> [ 3629.242943] GPR24: c0080c443400 c0080c440f60 c0080c4418c8 
> c2abb368 
> [ 3629.242943] GPR28: c0080c440f58  c0080c4419e8 
> c0080c443400 
> [ 3629.242987] NIP [c018c204] __flush_work.isra.44+0x44/0x370
> [ 3629.242993] LR [c022306c] cleanup_srcu_struct+0x6c/0x1e0
> [ 3629.242998] Call Trace:
> [ 3629.243000] [c005c14e3b20] [c0080c440f58] 
> srcu9+0x0/0xfffef0a8 [rcutorture] (unreliable)
> [ 3629.243009] [c005c14e3bb0] [c022306c] 
> cleanup_srcu_struct+0x6c/0x1e0
> [ 3629.243015] [c005c14e3c50] [c0223428] 
> srcu_module_notify+0x68/0x180
> [ 3629.243021] [c005c14e3c90] [c019a1e0] 
> notifier_call_chain+0xc0/0x1b0
> [ 3629.243027] [c005c14e3cf0] [c019ad24] 
> blocking_notifier_call_chain+0x64/0xa0
> [ 3629.243033] [c005c14e3d30] [c024a4c8] 
> sys_delete_module+0x1f8/0x3c0
> [ 3629.243039] [c005c14e3e10] [c0037480] 
> system_call_exception+0x140/0x350
> [ 3629.243044] [c005c14e3e50] [c000d6a0] 
> system_call_common+0x160/0x2e4
> [ 3629.243050] --- interrupt: c00 at 0x7fff8cd39558
> [ 3629.243054] NIP: 7fff8cd39558 LR: 00010d800398 CTR: 
> 
> [ 3629.243057] REGS: c005c14e3e80 TRAP: 0c00 Tainted: G O 
> (6.3.0-rc3-next-20230322)
> [ 3629.243062] MSR: 8280f033  CR: 
> 28008282 XER: 
> [ 3629.243072] IRQMASK: 0 
> [ 3629.243072] GPR00: 0081 7fffe99fd9c0 7fff8ce07300 
> 00013df30ec8 
> [ 3629.243072] GPR04: 0800 000a 1999 
>  
> [ 3629.243072] GPR08: 7fff8cd98160   
>  
> [ 3629.243072] GPR12:  7fff8d5fcb50 00010d80a650 
> 00010d80a648 
> [ 3629.243072] GPR16:  0001  
> 00010d80a428 
> [ 3629.243072] GPR20: 00010d830068  7fffe99ff2f8 
> 00013df304f0 
> [ 3629.243072] GPR24:  7fffe99ff2f8 00013df30ec8 
>  
> [ 3629.243072] GPR28:  00013df30e60 00013df30ec8 
> 00013df30e60 
> [ 3629.243115] NIP [7fff8cd39558] 0x7fff8cd39558
> [ 3629.243118] LR [00010d800398] 0x10d800398
> [ 3629.243121] --- interrupt: c00
> [ 3629.243123] Code: 89292e39 f821ff71 e94d0c78 f9410068 3940 69290001 
> 0b09 fbc10080 7c7e1b78 e9230018 7d290074 7929d182 <0b09> 7c0802a6 
> fb810070 f80100a0 
> [ 3629.243138] ---[ end trace  ]—
> 
> Followed by following traces:
> 
> [ 3629.243149] [ cut here ]
> [ 3629.243152] WARNING: CPU: 8 PID: 614614 at kernel/rcu/srcutree.c:663 
> cleanup_srcu_struct+0x11c/0x1e0
> [ 3629.243159] Modules 

Re: [PATCH v2] arch/powerpc/include/asm/barrier.h: redefine rmb and wmb to lwsync

2023-02-22 Thread Paul E. McKenney
On Thu, Feb 23, 2023 at 09:31:48AM +0530, Kautuk Consul wrote:
> On 2023-02-22 09:47:19, Paul E. McKenney wrote:
> > On Wed, Feb 22, 2023 at 02:33:44PM +0530, Kautuk Consul wrote:
> > > A link from ibm.com states:
> > > "Ensures that all instructions preceding the call to __lwsync
> > >  complete before any subsequent store instructions can be executed
> > >  on the processor that executed the function. Also, it ensures that
> > >  all load instructions preceding the call to __lwsync complete before
> > >  any subsequent load instructions can be executed on the processor
> > >  that executed the function. This allows you to synchronize between
> > >  multiple processors with minimal performance impact, as __lwsync
> > >  does not wait for confirmation from each processor."
> > > 
> > > Thats why smp_rmb() and smp_wmb() are defined to lwsync.
> > > But this same understanding applies to parallel pipeline
> > > execution on each PowerPC processor.
> > > So, use the lwsync instruction for rmb() and wmb() on the PPC
> > > architectures that support it.
> > > 
> > > Signed-off-by: Kautuk Consul 
> > > ---
> > >  arch/powerpc/include/asm/barrier.h | 7 +++
> > >  1 file changed, 7 insertions(+)
> > > 
> > > diff --git a/arch/powerpc/include/asm/barrier.h 
> > > b/arch/powerpc/include/asm/barrier.h
> > > index b95b666f0374..e088dacc0ee8 100644
> > > --- a/arch/powerpc/include/asm/barrier.h
> > > +++ b/arch/powerpc/include/asm/barrier.h
> > > @@ -36,8 +36,15 @@
> > >   * heavy-weight sync, so smp_wmb() can be a lighter-weight eieio.
> > >   */
> > >  #define __mb()   __asm__ __volatile__ ("sync" : : : "memory")
> > > +
> > > +/* The sub-arch has lwsync. */
> > > +#if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> > > +#define __rmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> > > +#define __wmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> > 
> > Hmmm...
> > 
> > Does the lwsync instruction now order both cached and uncached accesses?
> > Or have there been changes so that smp_rmb() and smp_wmb() get this
> > definition, while rmb() and wmb() still get the sync instruction?
> > (Not seeing this, but I could easily be missing something.)

> Upfront I don't see any documentation that states that lwsync
> distinguishes between cached and uncached accesses.
> That's why I requested the mailing list for test results with
> kernel load testing.

I suggest giving the reference manual a very careful read.  I wish I
could be more helpful, but I found that a very long time ago, and no
longer recall exactly where it was stated.

But maybe Michael Ellerman has a pointer?

Thanx, Paul

> > > +#else
> > >  #define __rmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > >  #define __wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > > +#endif
> > >  
> > >  /* The sub-arch has lwsync */
> > >  #if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> > > -- 
> > > 2.31.1
> > > 


Re: [PATCH v2] arch/powerpc/include/asm/barrier.h: redefine rmb and wmb to lwsync

2023-02-22 Thread Paul E. McKenney
On Wed, Feb 22, 2023 at 02:33:44PM +0530, Kautuk Consul wrote:
> A link from ibm.com states:
> "Ensures that all instructions preceding the call to __lwsync
>  complete before any subsequent store instructions can be executed
>  on the processor that executed the function. Also, it ensures that
>  all load instructions preceding the call to __lwsync complete before
>  any subsequent load instructions can be executed on the processor
>  that executed the function. This allows you to synchronize between
>  multiple processors with minimal performance impact, as __lwsync
>  does not wait for confirmation from each processor."
> 
> Thats why smp_rmb() and smp_wmb() are defined to lwsync.
> But this same understanding applies to parallel pipeline
> execution on each PowerPC processor.
> So, use the lwsync instruction for rmb() and wmb() on the PPC
> architectures that support it.
> 
> Signed-off-by: Kautuk Consul 
> ---
>  arch/powerpc/include/asm/barrier.h | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/barrier.h 
> b/arch/powerpc/include/asm/barrier.h
> index b95b666f0374..e088dacc0ee8 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -36,8 +36,15 @@
>   * heavy-weight sync, so smp_wmb() can be a lighter-weight eieio.
>   */
>  #define __mb()   __asm__ __volatile__ ("sync" : : : "memory")
> +
> +/* The sub-arch has lwsync. */
> +#if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> +#define __rmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> +#define __wmb() __asm__ __volatile__ ("lwsync" : : : "memory")

Hmmm...

Does the lwsync instruction now order both cached and uncached accesses?
Or have there been changes so that smp_rmb() and smp_wmb() get this
definition, while rmb() and wmb() still get the sync instruction?
(Not seeing this, but I could easily be missing something.)

Thanx, Paul

> +#else
>  #define __rmb()  __asm__ __volatile__ ("sync" : : : "memory")
>  #define __wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> +#endif
>  
>  /* The sub-arch has lwsync */
>  #if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> -- 
> 2.31.1
> 


Re: [PATCH v2 00/24] cpu,sched: Mark arch_cpu_idle_dead() __noreturn

2023-02-15 Thread Paul E. McKenney
On Mon, Feb 13, 2023 at 11:05:34PM -0800, Josh Poimboeuf wrote:
> v2:
> - make arch_call_rest_init() and rest_init() __noreturn
> - make objtool 'global_returns' work for weak functions
> - rebase on tip/objtool/core with dependencies merged in (mingo)
> - add acks
> 
> v1.1:
> - add __noreturn to all arch_cpu_idle_dead() implementations (mpe)

With this, rcutorture no longer gets objtool complaints on x86, thank you!

Tested-by: Paul E. McKenney 

> Josh Poimboeuf (24):
>   alpha/cpu: Expose arch_cpu_idle_dead()'s prototype declaration
>   alpha/cpu: Make sure arch_cpu_idle_dead() doesn't return
>   arm/cpu: Make sure arch_cpu_idle_dead() doesn't return
>   arm64/cpu: Mark cpu_die() __noreturn
>   csky/cpu: Make sure arch_cpu_idle_dead() doesn't return
>   ia64/cpu: Mark play_dead() __noreturn
>   loongarch/cpu: Make sure play_dead() doesn't return
>   loongarch/cpu: Mark play_dead() __noreturn
>   mips/cpu: Expose play_dead()'s prototype definition
>   mips/cpu: Make sure play_dead() doesn't return
>   mips/cpu: Mark play_dead() __noreturn
>   powerpc/cpu: Mark start_secondary_resume() __noreturn
>   sh/cpu: Make sure play_dead() doesn't return
>   sh/cpu: Mark play_dead() __noreturn
>   sh/cpu: Expose arch_cpu_idle_dead()'s prototype definition
>   sparc/cpu: Mark cpu_play_dead() __noreturn
>   x86/cpu: Make sure play_dead() doesn't return
>   x86/cpu: Mark play_dead() __noreturn
>   xtensa/cpu: Make sure cpu_die() doesn't return
>   xtensa/cpu: Mark cpu_die() __noreturn
>   sched/idle: Make sure weak version of arch_cpu_idle_dead() doesn't
> return
>   objtool: Include weak functions in 'global_noreturns' check
>   init: Make arch_call_rest_init() and rest_init() __noreturn
>   sched/idle: Mark arch_cpu_idle_dead() __noreturn
> 
>  arch/alpha/kernel/process.c  |  4 +++-
>  arch/arm/kernel/smp.c|  4 +++-
>  arch/arm64/include/asm/smp.h |  2 +-
>  arch/arm64/kernel/process.c  |  2 +-
>  arch/csky/kernel/smp.c   |  4 +++-
>  arch/ia64/kernel/process.c   |  6 +++---
>  arch/loongarch/include/asm/smp.h |  2 +-
>  arch/loongarch/kernel/process.c  |  2 +-
>  arch/loongarch/kernel/smp.c  |  2 +-
>  arch/mips/include/asm/smp.h  |  2 +-
>  arch/mips/kernel/process.c   |  2 +-
>  arch/mips/kernel/smp-bmips.c |  3 +++
>  arch/mips/loongson64/smp.c   |  1 +
>  arch/parisc/kernel/process.c |  2 +-
>  arch/powerpc/include/asm/smp.h   |  2 +-
>  arch/powerpc/kernel/smp.c|  2 +-
>  arch/riscv/kernel/cpu-hotplug.c  |  2 +-
>  arch/s390/kernel/idle.c  |  2 +-
>  arch/s390/kernel/setup.c |  2 +-
>  arch/sh/include/asm/smp-ops.h|  5 +++--
>  arch/sh/kernel/idle.c|  3 ++-
>  arch/sparc/include/asm/smp_64.h  |  2 +-
>  arch/sparc/kernel/process_64.c   |  2 +-
>  arch/x86/include/asm/smp.h   |  3 ++-
>  arch/x86/kernel/process.c|  4 ++--
>  arch/xtensa/include/asm/smp.h|  2 +-
>  arch/xtensa/kernel/smp.c |  4 +++-
>  include/linux/cpu.h  |  2 +-
>  include/linux/start_kernel.h |  4 ++--
>  init/main.c  |  4 ++--
>  kernel/sched/idle.c  |  2 +-
>  tools/objtool/check.c| 11 +++
>  32 files changed, 57 insertions(+), 39 deletions(-)
> 
> -- 
> 2.39.1
> 


Re: [PATCH v3 1/7] kernel/fork: convert vma assignment to a memcpy

2023-01-26 Thread Paul E. McKenney
On Wed, Jan 25, 2023 at 05:34:49PM -0800, Andrew Morton wrote:
> On Wed, 25 Jan 2023 16:50:01 -0800 Suren Baghdasaryan  
> wrote:
> 
> > On Wed, Jan 25, 2023 at 4:22 PM Andrew Morton  
> > wrote:
> > >
> > > On Wed, 25 Jan 2023 15:35:48 -0800 Suren Baghdasaryan  
> > > wrote:
> > >
> > > > Convert vma assignment in vm_area_dup() to a memcpy() to prevent 
> > > > compiler
> > > > errors when we add a const modifier to vma->vm_flags.
> > > >
> > > > ...
> > > >
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -482,7 +482,7 @@ struct vm_area_struct *vm_area_dup(struct 
> > > > vm_area_struct *orig)
> > > >* orig->shared.rb may be modified concurrently, but the 
> > > > clone
> > > >* will be reinitialized.
> > > >*/
> > > > - *new = data_race(*orig);
> > > > + memcpy(new, orig, sizeof(*new));
> > >
> > > The data_race() removal is unchangelogged?
> > 
> > True. I'll add a note in the changelog about that. Ideally I would
> > like to preserve it but I could not find a way to do that.
> 
> Perhaps Paul can comment?
> 
> I wonder if KCSAN knows how to detect this race, given that it's now in
> a memcpy.  I assume so.

I ran an experiment memcpy()ing between a static array and an onstack
array, and KCSAN did not complain.  But maybe I was setting it up wrong.

This is what I did:

long myid = (long)arg; /* different value for each task */
static unsigned long z1[10] = { 0 };
unsigned long z2[10];

...

memcpy(z1, z2, ARRAY_SIZE(z1) * sizeof(z1[0]));
for (zi = 0; zi < ARRAY_SIZE(z1); zi++)
z2[zi] += myid;
memcpy(z2, z1, ARRAY_SIZE(z1) * sizeof(z1[0]));

Adding Marco on CC for his thoughts.

Thanx, Paul


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-20 Thread Paul E. McKenney
On Fri, Jan 20, 2023 at 04:49:42PM +, Matthew Wilcox wrote:
> On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan  
> > wrote:
> > >
> > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko  wrote:
> > > >
> > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko  wrote:
> > > > > >
> > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > call_rcu() can take a long time when callback offloading is 
> > > > > > > enabled.
> > > > > > > Its use in the vm_area_free can cause regressions in the exit 
> > > > > > > path when
> > > > > > > multiple VMAs are being freed. To minimize that impact, place 
> > > > > > > VMAs into
> > > > > > > a list and free them in groups using one call_rcu() call per 
> > > > > > > group.
> > > > > >
> > > > > > After some more clarification I can understand how call_rcu might 
> > > > > > not be
> > > > > > super happy about thousands of callbacks to be invoked and I do 
> > > > > > agree
> > > > > > that this is not really optimal.
> > > > > >
> > > > > > On the other hand I do not like this solution much either.
> > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > much with processes with a huge number of vmas either. It would 
> > > > > > still be
> > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > >
> > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > batching? We could easily just link all the vmas into linked list 
> > > > > > and
> > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > implementation, remove the scaling issue as well and we do not have 
> > > > > > to
> > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 
> > > > > > 1.
> > > > >
> > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > not sound too appealing either.
> > > >
> > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > (be it from exit_mmap or remove_mt). See?
> > >
> > > Yes, I understood your idea but keeping dead objects until the process
> > > exits even when the system is low on memory (no shrinkers attached)
> > > seems too wasteful. If we do this I would advocate for attaching a
> > > shrinker.
> > 
> > Maybe even simpler, since we are hit with this VMA freeing flood
> > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > vm_area_free to batch the destruction and all other cases call
> > call_rcu()? I don't think there will be other cases of VMA destruction
> > floods.
> 
> ... or have two different call_rcu functions; one for munmap() and
> one for exit.  It'd be nice to use kmem_cache_free_bulk().

Good point, kfree_rcu(p, r) where "r" is the name of the rcu_head
structure's field, is much more cache-efficient.

The penalty is that there is no callback function to do any cleanup.
There is just a kfree()/kvfree (bulk version where applicable),
nothing else.

Thanx, Paul


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-20 Thread Paul E. McKenney
On Fri, Jan 20, 2023 at 09:57:05AM +0100, Michal Hocko wrote:
> On Thu 19-01-23 11:17:07, Paul E. McKenney wrote:
> > On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
> > > On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> > > > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney  
> > > > wrote:
> > > [...]
> > > > > There are a couple of possibilities here.
> > > > >
> > > > > First, if I am remembering correctly, the time between the call_rcu()
> > > > > and invocation of the corresponding callback was taking multiple 
> > > > > seconds,
> > > > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > > > > order to save power by batching RCU work over multiple call_rcu()
> > > > > invocations.  If this is causing a problem for a given call site, the
> > > > > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > > > > to the old-school non-laziness, but can of course consume more power.
> > > > 
> > > > That would not be the case because CONFIG_LAZY_RCU was not an option
> > > > at the time I was profiling this issue.
> > > > Laxy RCU would be a great option to replace this patch but
> > > > unfortunately it's not the default behavior, so I would still have to
> > > > implement this batching in case lazy RCU is not enabled.
> > > > 
> > > > >
> > > > > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > > > > and the invocation of the corresponding callback in kernels built with
> > > > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the 
> > > > > nohz_full
> > > > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but 
> > > > > only
> > > > > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The 
> > > > > purpose
> > > > > of this delay is to avoid lock contention, and so this delay is 
> > > > > incurred
> > > > > only on CPUs that are queuing callbacks at a rate exceeding 
> > > > > 16K/second.
> > > > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > > > > invoking call_rcu() at least 16 times within a given jiffy will incur
> > > > > the added delay.  The reason for this delay is the use of a separate
> > > > > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > > > > lock contention on the main ->cblist.  This is not needed in 
> > > > > old-school
> > > > > kernels built without either CONFIG_NO_HZ_FULL=y or 
> > > > > CONFIG_RCU_NOCB_CPU=y
> > > > > (including most datacenter kernels) because in that case the callbacks
> > > > > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > > > > that there is no need for locks.
> > > > 
> > > > I believe this is the reason in my profiled case.
> > > > 
> > > > >
> > > > > Third, if you are instead seeing multiple milliseconds of CPU 
> > > > > consumed by
> > > > > call_rcu() in the common case (for example, without the aid of 
> > > > > interrupts,
> > > > > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> > > > 
> > > > I don't think I've seen such a case.
> > > > Thanks for clarifications, Paul!
> > > 
> > > Thanks for the explanation Paul. I have to say this has caught me as a
> > > surprise. There are just not enough details about the benchmark to
> > > understand what is going on but I find it rather surprising that
> > > call_rcu can induce a higher overhead than the actual kmem_cache_free
> > > which is the callback. My naive understanding has been that call_rcu is
> > > really fast way to defer the execution to the RCU safe context to do the
> > > final cleanup.
> > 
> > If I am following along correctly (ha!), then your "induce a higher
> > overhead" should be something like "induce a higher to-kfree() latency".
> 
> Yes, this is expected.
> 
> > Of course, there already is a higher latency-to-kfree via call_rcu()
> > than via a direct call to kfree(), and callback-offload CPUs that are
> > being flooded with callbacks raise that latency a jiffy or so more in
> > order to avoid lock contention.
> > 
> > If this becomes a problem, the callback-offloading code can be a bit
> > smarter about avoiding lock contention, but need to see a real problem
> > before I make that change.  But if there is a real problem I will of
> > course fix it.
> 
> I believe that Suren claims that the call_rcu is really visible in the
> exit_mmap case. Time-to-free actual vmas shouldn't really be material
> for that path. If that happens much more later on there could be some
> side effects by an increased memory consumption but that should be
> marginal. How fast exit_mmap really is should only depend on direct
> calls from that path.
> 
> But I guess we need some specific numbers from Suren to be sure what is
> going on here.

Actually, Suren did discuss these (perhaps offlist) back in August.
I was just being forgetful.  :-/

Thanx, Paul


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-19 Thread Paul E. McKenney
On Thu, Jan 19, 2023 at 11:47:36AM -0800, Suren Baghdasaryan wrote:
> On Thu, Jan 19, 2023 at 11:20 AM Paul E. McKenney  wrote:
> >
> > On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko  wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path 
> > > > > when
> > > > > multiple VMAs are being freed. To minimize that impact, place VMAs 
> > > > > into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > After some more clarification I can understand how call_rcu might not be
> > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > that this is not really optimal.
> > > >
> > > > On the other hand I do not like this solution much either.
> > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > much with processes with a huge number of vmas either. It would still be
> > > > in housands of callbacks to be scheduled without a good reason.
> > > >
> > > > Instead, are there any other cases than remove_vma that need this
> > > > batching? We could easily just link all the vmas into linked list and
> > > > use a single call_rcu instead, no? This would both simplify the
> > > > implementation, remove the scaling issue as well and we do not have to
> > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > >
> > > Yes, I agree the solution is not stellar. I wanted something simple
> > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > on the list without hooking up a shrinker (additional complexity) does
> > > not sound too appealing either. WDYT about time domain throttling to
> > > limit draining the list to say once per second like this:
> > >
> > > void vm_area_free(struct vm_area_struct *vma)
> > > {
> > >struct mm_struct *mm = vma->vm_mm;
> > >bool drain;
> > >
> > >free_anon_vma_name(vma);
> > >
> > >spin_lock(>vma_free_list.lock);
> > >list_add(>vm_free_list, >vma_free_list.head);
> > >mm->vma_free_list.size++;
> > > -   drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> > > +   drain = jiffies > mm->last_drain_tm + HZ;
> > >
> > >spin_unlock(>vma_free_list.lock);
> > >
> > > -   if (drain)
> > > +   if (drain) {
> > >   drain_free_vmas(mm);
> > > + mm->last_drain_tm = jiffies;
> > > +   }
> > > }
> > >
> > > Ultimately we want to prevent very frequent call_rcu() calls, so
> > > throttling in the time domain seems appropriate. That's the simplest
> > > way I can think of to address your concern about a quick spike in VMA
> > > freeing. It does not place any restriction on the list size and we
> > > might have excessive dead vm_area_structs if after a large spike there
> > > are no vm_area_free() calls but I don't know if that's a real problem,
> > > so not sure we should be addressing it at this time. WDYT?
> >
> > Just to double-check, we really did try the very frequent call_rcu()
> > invocations and we really did see a problem, correct?
> 
> Correct. More specifically with CONFIG_RCU_NOCB_CPU=y we saw
> regressions when a process exits and all its VMAs get destroyed,
> causing a flood of call_rcu()'s.

Thank you for the reminder, real problem needs solution.  ;-)

Thanx, Paul

> > Although it is not perfect, call_rcu() is designed to take a fair amount
> > of abuse.  So if we didn't see a real problem, the frequent call_rcu()
> > invocations might be a bit simpler.


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-19 Thread Paul E. McKenney
On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote:
> On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko  wrote:
> >
> > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > call_rcu() can take a long time when callback offloading is enabled.
> > > Its use in the vm_area_free can cause regressions in the exit path when
> > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > a list and free them in groups using one call_rcu() call per group.
> >
> > After some more clarification I can understand how call_rcu might not be
> > super happy about thousands of callbacks to be invoked and I do agree
> > that this is not really optimal.
> >
> > On the other hand I do not like this solution much either.
> > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > much with processes with a huge number of vmas either. It would still be
> > in housands of callbacks to be scheduled without a good reason.
> >
> > Instead, are there any other cases than remove_vma that need this
> > batching? We could easily just link all the vmas into linked list and
> > use a single call_rcu instead, no? This would both simplify the
> > implementation, remove the scaling issue as well and we do not have to
> > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> 
> Yes, I agree the solution is not stellar. I wanted something simple
> but this is probably too simple. OTOH keeping all dead vm_area_structs
> on the list without hooking up a shrinker (additional complexity) does
> not sound too appealing either. WDYT about time domain throttling to
> limit draining the list to say once per second like this:
> 
> void vm_area_free(struct vm_area_struct *vma)
> {
>struct mm_struct *mm = vma->vm_mm;
>bool drain;
> 
>free_anon_vma_name(vma);
> 
>spin_lock(>vma_free_list.lock);
>list_add(>vm_free_list, >vma_free_list.head);
>mm->vma_free_list.size++;
> -   drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> +   drain = jiffies > mm->last_drain_tm + HZ;
> 
>spin_unlock(>vma_free_list.lock);
> 
> -   if (drain)
> +   if (drain) {
>   drain_free_vmas(mm);
> + mm->last_drain_tm = jiffies;
> +   }
> }
> 
> Ultimately we want to prevent very frequent call_rcu() calls, so
> throttling in the time domain seems appropriate. That's the simplest
> way I can think of to address your concern about a quick spike in VMA
> freeing. It does not place any restriction on the list size and we
> might have excessive dead vm_area_structs if after a large spike there
> are no vm_area_free() calls but I don't know if that's a real problem,
> so not sure we should be addressing it at this time. WDYT?

Just to double-check, we really did try the very frequent call_rcu()
invocations and we really did see a problem, correct?

Although it is not perfect, call_rcu() is designed to take a fair amount
of abuse.  So if we didn't see a real problem, the frequent call_rcu()
invocations might be a bit simpler.

Thanx, Paul


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-19 Thread Paul E. McKenney
On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
> On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney  
> > wrote:
> [...]
> > > There are a couple of possibilities here.
> > >
> > > First, if I am remembering correctly, the time between the call_rcu()
> > > and invocation of the corresponding callback was taking multiple seconds,
> > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > > order to save power by batching RCU work over multiple call_rcu()
> > > invocations.  If this is causing a problem for a given call site, the
> > > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > > to the old-school non-laziness, but can of course consume more power.
> > 
> > That would not be the case because CONFIG_LAZY_RCU was not an option
> > at the time I was profiling this issue.
> > Laxy RCU would be a great option to replace this patch but
> > unfortunately it's not the default behavior, so I would still have to
> > implement this batching in case lazy RCU is not enabled.
> > 
> > >
> > > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > > and the invocation of the corresponding callback in kernels built with
> > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > > of this delay is to avoid lock contention, and so this delay is incurred
> > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > > invoking call_rcu() at least 16 times within a given jiffy will incur
> > > the added delay.  The reason for this delay is the use of a separate
> > > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > > lock contention on the main ->cblist.  This is not needed in old-school
> > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > > (including most datacenter kernels) because in that case the callbacks
> > > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > > that there is no need for locks.
> > 
> > I believe this is the reason in my profiled case.
> > 
> > >
> > > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > > call_rcu() in the common case (for example, without the aid of interrupts,
> > > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> > 
> > I don't think I've seen such a case.
> > Thanks for clarifications, Paul!
> 
> Thanks for the explanation Paul. I have to say this has caught me as a
> surprise. There are just not enough details about the benchmark to
> understand what is going on but I find it rather surprising that
> call_rcu can induce a higher overhead than the actual kmem_cache_free
> which is the callback. My naive understanding has been that call_rcu is
> really fast way to defer the execution to the RCU safe context to do the
> final cleanup.

If I am following along correctly (ha!), then your "induce a higher
overhead" should be something like "induce a higher to-kfree() latency".

Of course, there already is a higher latency-to-kfree via call_rcu()
than via a direct call to kfree(), and callback-offload CPUs that are
being flooded with callbacks raise that latency a jiffy or so more in
order to avoid lock contention.

If this becomes a problem, the callback-offloading code can be a bit
smarter about avoiding lock contention, but need to see a real problem
before I make that change.  But if there is a real problem I will of
course fix it.

Or did I miss a turn in this discussion?

Thanx, Paul


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-18 Thread Paul E. McKenney
On Wed, Jan 18, 2023 at 11:01:08AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney  wrote:
> >
> > On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> > > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko  wrote:
> > > >
> > > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko  wrote:
> > > > > >
> > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > call_rcu() can take a long time when callback offloading is 
> > > > > > > enabled.
> > > > > > > Its use in the vm_area_free can cause regressions in the exit 
> > > > > > > path when
> > > > > > > multiple VMAs are being freed.
> > > > > >
> > > > > > What kind of regressions.
> > > > > >
> > > > > > > To minimize that impact, place VMAs into
> > > > > > > a list and free them in groups using one call_rcu() call per 
> > > > > > > group.
> > > > > >
> > > > > > Please add some data to justify this additional complexity.
> > > > >
> > > > > Sorry, should have done that in the first place. A 4.3% regression was
> > > > > noticed when running execl test from unixbench suite. spawn test also
> > > > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > > > enabled.
> > > >
> > > > Could you be more specific? vma freeing is async with the RCU so how
> > > > come this has resulted in a regression? Is there any heavy
> > > > rcu_synchronize in the exec path? That would be an interesting
> > > > information.
> > >
> > > No, there is no heavy rcu_synchronize() or any other additional
> > > synchronous load in the exit path. It's the call_rcu() which can block
> > > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> > > other call_rcu()'s going on in parallel. Note that call_rcu() calls
> > > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> > > revealed that this function was taking multiple ms (don't recall the
> > > actual number, sorry). Paul's explanation implied that this happens
> > > due to contention on the locks taken in this function. For more
> > > in-depth details I'll have to ask Paul for help :) This code is quite
> > > complex and I don't know all the details of RCU implementation.
> >
> > There are a couple of possibilities here.
> >
> > First, if I am remembering correctly, the time between the call_rcu()
> > and invocation of the corresponding callback was taking multiple seconds,
> > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > order to save power by batching RCU work over multiple call_rcu()
> > invocations.  If this is causing a problem for a given call site, the
> > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > to the old-school non-laziness, but can of course consume more power.
> 
> That would not be the case because CONFIG_LAZY_RCU was not an option
> at the time I was profiling this issue.
> Laxy RCU would be a great option to replace this patch but
> unfortunately it's not the default behavior, so I would still have to
> implement this batching in case lazy RCU is not enabled.
> 
> > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > and the invocation of the corresponding callback in kernels built with
> > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > of this delay is to avoid lock contention, and so this delay is incurred
> > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > invoking call_rcu() at least 16 times within a given jiffy will incur
> > the added delay.  The reason for this delay is the use of a separate
> > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > lock contention on the main ->cblist.  This is not needed in old-school
> > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > (including most datacenter kernels) because in that case the callbacks
> > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > that there is no need for locks.
> 
> I believe this is the reason in my profiled case.
> 
> >
> > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > call_rcu() in the common case (for example, without the aid of interrupts,
> > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> 
> I don't think I've seen such a case.

Whew!!!  ;-)

> Thanks for clarifications, Paul!

No problem!

Thanx, Paul


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-18 Thread Paul E. McKenney
On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko  wrote:
> >
> > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko  wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path 
> > > > > when
> > > > > multiple VMAs are being freed.
> > > >
> > > > What kind of regressions.
> > > >
> > > > > To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > Please add some data to justify this additional complexity.
> > >
> > > Sorry, should have done that in the first place. A 4.3% regression was
> > > noticed when running execl test from unixbench suite. spawn test also
> > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > enabled.
> >
> > Could you be more specific? vma freeing is async with the RCU so how
> > come this has resulted in a regression? Is there any heavy
> > rcu_synchronize in the exec path? That would be an interesting
> > information.
> 
> No, there is no heavy rcu_synchronize() or any other additional
> synchronous load in the exit path. It's the call_rcu() which can block
> the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> other call_rcu()'s going on in parallel. Note that call_rcu() calls
> rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> revealed that this function was taking multiple ms (don't recall the
> actual number, sorry). Paul's explanation implied that this happens
> due to contention on the locks taken in this function. For more
> in-depth details I'll have to ask Paul for help :) This code is quite
> complex and I don't know all the details of RCU implementation.

There are a couple of possibilities here.

First, if I am remembering correctly, the time between the call_rcu()
and invocation of the corresponding callback was taking multiple seconds,
but that was because the kernel was built with CONFIG_LAZY_RCU=y in
order to save power by batching RCU work over multiple call_rcu()
invocations.  If this is causing a problem for a given call site, the
shiny new call_rcu_hurry() can be used instead.  Doing this gets back
to the old-school non-laziness, but can of course consume more power.

Second, there is a much shorter one-jiffy delay between the call_rcu()
and the invocation of the corresponding callback in kernels built with
either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
of this delay is to avoid lock contention, and so this delay is incurred
only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
invoking call_rcu() at least 16 times within a given jiffy will incur
the added delay.  The reason for this delay is the use of a separate
->nocb_bypass list.  As Suren says, this bypass list is used to reduce
lock contention on the main ->cblist.  This is not needed in old-school
kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
(including most datacenter kernels) because in that case the callbacks
enqueued by call_rcu() are touched only by the corresponding CPU, so
that there is no need for locks.

Third, if you are instead seeing multiple milliseconds of CPU consumed by
call_rcu() in the common case (for example, without the aid of interrupts,
NMIs, or SMIs), please do let me know.  That sounds to me like a bug.

Or have I lost track of some other slow case?

Thanx, Paul


Re: [PATCH v3 00/51] cpuidle,rcu: Clean up the mess

2023-01-13 Thread Paul E. McKenney
On Thu, Jan 12, 2023 at 08:43:14PM +0100, Peter Zijlstra wrote:
> Hi All!
> 
> The (hopefully) final respin of cpuidle vs rcu cleanup patches. Barring any
> objections I'll be queueing these patches in tip/sched/core in the next few
> days.
> 
> v2: https://lkml.kernel.org/r/20220919095939.761690...@infradead.org
> 
> These here patches clean up the mess that is cpuidle vs rcuidle.
> 
> At the end of the ride there's only on RCU_NONIDLE user left:
> 
>   arch/arm64/kernel/suspend.c:RCU_NONIDLE(__cpu_suspend_exit());
> 
> And I know Mark has been prodding that with something sharp.
> 
> The last version was tested by a number of people and I'm hoping to not have
> broken anything in the meantime ;-)
> 
> 
> Changes since v2:

150 rcutorture hours on each of the default scenarios passed.  This
is qemu/KVM on x86:

Tested-by: Paul E. McKenney 

>  - rebased to v6.2-rc3; as available at:
>  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/idle
> 
>  - folded: 
> https://lkml.kernel.org/r/y3ubwyny15etu...@hirez.programming.kicks-ass.net
>which makes the ARM cpuidle index 0 consistently not use
>CPUIDLE_FLAG_RCU_IDLE, as requested by Ulf.
> 
>  - added a few more __always_inline to empty stub functions as found by the
>robot.
> 
>  - Used _RET_IP_ instead of _THIS_IP_ in a few placed because of:
>https://github.com/ClangBuiltLinux/linux/issues/263
> 
>  - Added new patches to address various robot reports:
> 
>  #35:  trace,hardirq: No moar _rcuidle() tracing
>  #47:  cpuidle: Ensure ct_cpuidle_enter() is always called from 
> noinstr/__cpuidle
>  #48:  cpuidle,arch: Mark all ct_cpuidle_enter() callers __cpuidle
>  #49:  cpuidle,arch: Mark all regular cpuidle_state::enter methods 
> __cpuidle
>  #50:  cpuidle: Comments about noinstr/__cpuidle
>  #51:  context_tracking: Fix noinstr vs KASAN
> 
> 
> ---
>  arch/alpha/kernel/process.c   |  1 -
>  arch/alpha/kernel/vmlinux.lds.S   |  1 -
>  arch/arc/kernel/process.c |  3 ++
>  arch/arc/kernel/vmlinux.lds.S |  1 -
>  arch/arm/include/asm/vmlinux.lds.h|  1 -
>  arch/arm/kernel/cpuidle.c |  4 +-
>  arch/arm/kernel/process.c |  1 -
>  arch/arm/kernel/smp.c |  6 +--
>  arch/arm/mach-davinci/cpuidle.c   |  4 +-
>  arch/arm/mach-gemini/board-dt.c   |  3 +-
>  arch/arm/mach-imx/cpuidle-imx5.c  |  4 +-
>  arch/arm/mach-imx/cpuidle-imx6q.c |  8 ++--
>  arch/arm/mach-imx/cpuidle-imx6sl.c|  4 +-
>  arch/arm/mach-imx/cpuidle-imx6sx.c|  9 ++--
>  arch/arm/mach-imx/cpuidle-imx7ulp.c   |  4 +-
>  arch/arm/mach-omap2/common.h  |  6 ++-
>  arch/arm/mach-omap2/cpuidle34xx.c | 16 ++-
>  arch/arm/mach-omap2/cpuidle44xx.c | 29 +++--
>  arch/arm/mach-omap2/omap-mpuss-lowpower.c | 12 +-
>  arch/arm/mach-omap2/pm.h  |  2 +-
>  arch/arm/mach-omap2/pm24xx.c  | 51 +-
>  arch/arm/mach-omap2/pm34xx.c  | 14 +--
>  arch/arm/mach-omap2/pm44xx.c  |  2 +-
>  arch/arm/mach-omap2/powerdomain.c | 10 ++---
>  arch/arm/mach-s3c/cpuidle-s3c64xx.c   |  5 +--
>  arch/arm64/kernel/cpuidle.c   |  2 +-
>  arch/arm64/kernel/idle.c  |  1 -
>  arch/arm64/kernel/smp.c   |  4 +-
>  arch/arm64/kernel/vmlinux.lds.S   |  1 -
>  arch/csky/kernel/process.c|  1 -
>  arch/csky/kernel/smp.c|  2 +-
>  arch/csky/kernel/vmlinux.lds.S|  1 -
>  arch/hexagon/kernel/process.c |  1 -
>  arch/hexagon/kernel/vmlinux.lds.S |  1 -
>  arch/ia64/kernel/process.c|  1 +
>  arch/ia64/kernel/vmlinux.lds.S|  1 -
>  arch/loongarch/kernel/idle.c  |  1 +
>  arch/loongarch/kernel/vmlinux.lds.S   |  1 -
>  arch/m68k/kernel/vmlinux-nommu.lds|  1 -
>  arch/m68k/kernel/vmlinux-std.lds  |  1 -
>  arch/m68k/kernel/vmlinux-sun3.lds |  1 -
>  arch/microblaze/kernel/process.c  |  1 -
>  arch/microblaze/kernel/vmlinux.lds.S  |  1 -
>  arch/mips/kernel/idle.c   | 14 +++
>  arch/mips/kernel/vmlinux.lds.S|  1 -
>  arch/nios2/kernel/process.c   |  1 -
>  arch/nios2/kernel/vmlinux.lds.S   |  1 -
>  arch/openrisc/kernel/process.c|  1 +
>  arch/openrisc/kernel/vmlinux.lds.S|  1 -
>  arch/parisc/kernel/process.c  |  2 -
>  arch/parisc/kernel/vmlinux.lds.S  |  1 -
>  arch/powerpc/kernel/idle.c| 

Re: [PATCH rcu 04/27] arch/powerpc/kvm: Remove "select SRCU"

2023-01-11 Thread Paul E. McKenney
On Thu, Jan 12, 2023 at 10:49:04AM +1100, Michael Ellerman wrote:
> "Paul E. McKenney"  writes:
> > Now that the SRCU Kconfig option is unconditionally selected, there is
> > no longer any point in selecting it.  Therefore, remove the "select SRCU"
> > Kconfig statements.
> >
> > Signed-off-by: Paul E. McKenney 
> > Cc: Michael Ellerman 
> > Cc: Nicholas Piggin 
> > Cc: Christophe Leroy 
> > Cc: 
> > ---
> >  arch/powerpc/kvm/Kconfig | 1 -
> >  1 file changed, 1 deletion(-)
> 
> Acked-by: Michael Ellerman  (powerpc)

Thank you!  I will apply on the next rebase.

Thanx, Paul

> cheers
> 
> > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> > index a9f57dad6d916..902611954200d 100644
> > --- a/arch/powerpc/kvm/Kconfig
> > +++ b/arch/powerpc/kvm/Kconfig
> > @@ -22,7 +22,6 @@ config KVM
> > select PREEMPT_NOTIFIERS
> > select HAVE_KVM_EVENTFD
> > select HAVE_KVM_VCPU_ASYNC_IOCTL
> > -   select SRCU
> > select KVM_VFIO
> > select IRQ_BYPASS_MANAGER
> > select HAVE_KVM_IRQ_BYPASS
> > -- 
> > 2.31.1.189.g2e36527f23


[PATCH rcu 04/27] arch/powerpc/kvm: Remove "select SRCU"

2023-01-04 Thread Paul E. McKenney
Now that the SRCU Kconfig option is unconditionally selected, there is
no longer any point in selecting it.  Therefore, remove the "select SRCU"
Kconfig statements.

Signed-off-by: Paul E. McKenney 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: 
---
 arch/powerpc/kvm/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index a9f57dad6d916..902611954200d 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
select PREEMPT_NOTIFIERS
select HAVE_KVM_EVENTFD
select HAVE_KVM_VCPU_ASYNC_IOCTL
-   select SRCU
select KVM_VFIO
select IRQ_BYPASS_MANAGER
select HAVE_KVM_IRQ_BYPASS
-- 
2.31.1.189.g2e36527f23



Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu

2022-11-28 Thread Paul E. McKenney
On Mon, Nov 28, 2022 at 09:12:28AM +0100, Thomas Gleixner wrote:
> On Sun, Nov 27 2022 at 09:53, Paul E. McKenney wrote:
> > On Sun, Nov 27, 2022 at 01:40:28PM +0100, Thomas Gleixner wrote:
> >> There are quite some reasons why a CPU-hotplug or a hot-unplug operation
> >> can fail, which is not a fatal problem, really.
> >> 
> >> So if a CPU hotplug operation fails, then why can't the torture test
> >> just move on and validate that the system still behaves correctly?
> >> 
> >> That gives us more coverage than just testing the good case and giving
> >> up when something unexpected happens.
> >
> > Agreed, with access to a function like the tick_nohz_full_timekeeper()
> > suggested earlier in this email thread, then yes, it would make sense to
> > try to offline the CPU anyway, then forgive the failure in cases where
> > the CPU matches that indicated by tick_nohz_full_timekeeper().
> 
> Why special casing this? There are other valid reasons why offlining can
> fail. So we special case timekeeper today and then next week we special
> case something else just because. That does not make sense. If it fails
> there is a reason and you can log it. The important part is that the
> system is functional and stable after the fail and the rollback.

Perhaps there are other valid reasons, but they have not been showing up
in my torture-test runs for well over a decade.  Not saying that they
don't happen, of course.  But if they involved (say) cgroups, then my
test setup would not exercise them.

So are you looking to introduce spurious CPU-hotplug failures?  If so,
these will also affect things like suspend/resume.  Plus it will make
it much more difficult to detect real but intermittent CPU-hotplug bugs,
which is the motivation for special-casing the tick_nohz_full_timekeeper()
failures.

So we should discuss introduciton of any spurious failures that might
be under consideration.

Independently of that, the torture_onoff() functions can of course keep
some sort of histogram of the failure return codes.  Or are there other
failure indications that should be captured?

> >> I even argue that the torture test should inject random failures into
> >> the hotplug state machine to achieve extended code coverage.
> >
> > I could imagine torture_onoff() telling various CPU-hotplug notifiers
> > to refuse the transition using some TBD interface.
> 
> There is already an interface which is exposed to sysfs which allows you
> to enforce a "fail" at a defined hotplug state.

If you would like me to be testing this as part of my normal testing
regimen, I will need an in-kernel interface.  Such an interface is of
course not needed for modprobe-style testing, in which case the script
doing the modprobe and rmmod can of course manipulate the sysfs files.
But I don't do that sort of testing very often.  And when I do, it is
almost always with kernels configured for Meta's fleet, which almost
never do CPU-offline operations.

Thanx, Paul

> > That would better test the CPU-hotplug common code's ability to deal
> > with failures.
> 
> Correct.
> 
> > Or did you have something else/additional in mind?
> 
> No.
> 
> Thanks,
> 
> tglx


Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu

2022-11-27 Thread Paul E. McKenney
On Sun, Nov 27, 2022 at 01:40:28PM +0100, Thomas Gleixner wrote:

[ . . . ]

> >> No. We are not exporting this just to make a bogus test case happy.
> >>
> >> Fix the torture code to handle -EBUSY correctly.
> > I am going to do a study on this, for now, I do a grep in the kernel tree:
> > find . -name "*.c"|xargs grep cpuhp_setup_state|wc -l
> > The result of the grep command shows that there are 268
> > cpuhp_setup_state* cases.
> > which may make our task more complicated.
> 
> Why? The whole point of this torture thing is to stress the
> infrastructure.

Indeed.

> There are quite some reasons why a CPU-hotplug or a hot-unplug operation
> can fail, which is not a fatal problem, really.
> 
> So if a CPU hotplug operation fails, then why can't the torture test
> just move on and validate that the system still behaves correctly?
> 
> That gives us more coverage than just testing the good case and giving
> up when something unexpected happens.

Agreed, with access to a function like the tick_nohz_full_timekeeper()
suggested earlier in this email thread, then yes, it would make sense to
try to offline the CPU anyway, then forgive the failure in cases where
the CPU matches that indicated by tick_nohz_full_timekeeper().

> I even argue that the torture test should inject random failures into
> the hotplug state machine to achieve extended code coverage.

I could imagine torture_onoff() telling various CPU-hotplug notifiers
to refuse the transition using some TBD interface.  That would better
test the CPU-hotplug common code's ability to deal with failures.

Or did you have something else/additional in mind?

Thanx, Paul


Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu

2022-11-23 Thread Paul E. McKenney
On Wed, Nov 23, 2022 at 11:25:43PM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 21, 2022 at 05:37:54PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 21, 2022 at 11:51:40AM +0800, Zhouyi Zhou wrote:
> > > @@ -358,7 +359,16 @@ torture_onoff(void *arg)
> > >   schedule_timeout_interruptible(HZ / 10);
> > >   continue;
> > >   }
> > > +#ifdef CONFIG_NO_HZ_FULL
> > > + /* do not offline tick do timer cpu */
> > > + if (tick_nohz_full_running) {
> > > + cpu = (torture_random() >> 4) % maxcpu;
> > > + if (cpu >= tick_do_timer_cpu)
> > 
> > Why is this ">=" instead of "=="?
> > 
> > > + cpu = (cpu + 1) % (maxcpu + 1);
> > > + } else
> > > +#else
> > >   cpu = (torture_random() >> 4) % (maxcpu + 1);
> > > +#endif
> > 
> > What happens if the value of tick_do_timer_cpu changes between the time of
> > the check above and the call to torture_offline() below?  Alternatively,
> > how is such a change in value prevented?
> 
> It can't, currently tick_do_timer_cpu is fixed when nohz_full is running.
> It can however have special values at early boot such as TICK_DO_TIMER_NONE.
> But if rcutorture is initialized after smp, it should be ok.

Ah, getting ahead of myself, thank you for the info!

So the thing to do would be to generate only maxcpu-1 choices.

Thanx, Paul

> Thanks.
> 
> > 
> > Thanx, Paul
> > 
> > >   if (!torture_offline(cpu,
> > >_offline_attempts, _offline_successes,
> > >_offline, _offline, _offline))
> > > -- 
> > > 2.34.1
> > > 


Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu

2022-11-23 Thread Paul E. McKenney
On Wed, Nov 23, 2022 at 10:23:11AM +0800, Zhouyi Zhou wrote:
> On Tue, Nov 22, 2022 at 9:37 AM Paul E. McKenney  wrote:
> >
> > On Mon, Nov 21, 2022 at 11:51:40AM +0800, Zhouyi Zhou wrote:
> > > During CPU-hotplug torture (CONFIG_NO_HZ_FULL=y), if we try to
> > > offline tick_do_timer_cpu, the operation will fail because in
> > > function tick_nohz_cpu_down:
> > > ```
> > > if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
> > >   return -EBUSY;
> > > ```
> > > Above bug was first discovered in torture tests performed in PPC VM
> > > of Open Source Lab of Oregon State University, and reproducable in RISC-V
> > > and X86-64 (with additional kernel commandline cpu0_hotplug).
> > >
> > > In this patch, we avoid offline tick_do_timer_cpu by distribute
> > > the offlining cpu among remaining cpus.
> > >
> > > Signed-off-by: Zhouyi Zhou 
> >
> > Good show chasing this down!
> Thank Paul for your guidance and encouragement!
> >
> > A couple of questions below.
> The answers below.
> >
> > > ---
> > >  include/linux/tick.h|  1 +
> > >  kernel/time/tick-common.c   |  1 +
> > >  kernel/time/tick-internal.h |  1 -
> > >  kernel/torture.c| 10 ++
> > >  4 files changed, 12 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > > index bfd571f18cfd..23cc0b205853 100644
> > > --- a/include/linux/tick.h
> > > +++ b/include/linux/tick.h
> > > @@ -14,6 +14,7 @@
> > >  #include 
> > >
> > >  #ifdef CONFIG_GENERIC_CLOCKEVENTS
> > > +extern int tick_do_timer_cpu __read_mostly;
> > >  extern void __init tick_init(void);
> > >  /* Should be core only, but ARM BL switcher requires it */
> > >  extern void tick_suspend_local(void);
> > > diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
> > > index 46789356f856..87b9b9afa320 100644
> > > --- a/kernel/time/tick-common.c
> > > +++ b/kernel/time/tick-common.c
> > > @@ -48,6 +48,7 @@ ktime_t tick_next_period;
> > >   *procedure also covers cpu hotplug.
> > >   */
> > >  int tick_do_timer_cpu __read_mostly = TICK_DO_TIMER_BOOT;
> > > +EXPORT_SYMBOL_GPL(tick_do_timer_cpu);
> > >  #ifdef CONFIG_NO_HZ_FULL
> > >  /*
> > >   * tick_do_timer_boot_cpu indicates the boot CPU temporarily owns
> > > diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
> > > index 649f2b48e8f0..8953dca10fdd 100644
> > > --- a/kernel/time/tick-internal.h
> > > +++ b/kernel/time/tick-internal.h
> > > @@ -15,7 +15,6 @@
> > >
> > >  DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
> > >  extern ktime_t tick_next_period;
> > > -extern int tick_do_timer_cpu __read_mostly;
> > >
> > >  extern void tick_setup_periodic(struct clock_event_device *dev, int 
> > > broadcast);
> > >  extern void tick_handle_periodic(struct clock_event_device *dev);
> > > diff --git a/kernel/torture.c b/kernel/torture.c
> > > index 789aeb0e1159..bccbdd33dda2 100644
> > > --- a/kernel/torture.c
> > > +++ b/kernel/torture.c
> > > @@ -33,6 +33,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -358,7 +359,16 @@ torture_onoff(void *arg)
> > >   schedule_timeout_interruptible(HZ / 10);
> > >   continue;
> > >   }
> > > +#ifdef CONFIG_NO_HZ_FULL
> > > + /* do not offline tick do timer cpu */
> > > + if (tick_nohz_full_running) {
> > > + cpu = (torture_random() >> 4) % maxcpu;
> > > + if (cpu >= tick_do_timer_cpu)
> >
> > Why is this ">=" instead of "=="?
> I use probability theory here to let the remaining cpu distribute evenly.
> Example:
> we have cpus: 0 1 2 3 4 5 6 7
> maxcpu = 7
> tick_do_timer_cpu = 2
> remaining cpus are: 0 1 3 4 5 6 7
> if the offline cpu candidate is 2, then the result cpu is 2+1
> else if the offline cpu candidate is 3, then the result cpu is 3+1
> ...
> else if the offline cpu candidate is 6, then the result cpu is 6+1
> >
> > > + cpu = (cpu + 1) % (maxcpu + 1);
> we could just use cpu = cpu + 1 here

But 

Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu

2022-11-21 Thread Paul E. McKenney
On Mon, Nov 21, 2022 at 11:51:40AM +0800, Zhouyi Zhou wrote:
> During CPU-hotplug torture (CONFIG_NO_HZ_FULL=y), if we try to
> offline tick_do_timer_cpu, the operation will fail because in
> function tick_nohz_cpu_down:
> ```
> if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
>   return -EBUSY;
> ```
> Above bug was first discovered in torture tests performed in PPC VM
> of Open Source Lab of Oregon State University, and reproducable in RISC-V
> and X86-64 (with additional kernel commandline cpu0_hotplug).
> 
> In this patch, we avoid offline tick_do_timer_cpu by distribute
> the offlining cpu among remaining cpus.
> 
> Signed-off-by: Zhouyi Zhou 

Good show chasing this down!

A couple of questions below.

> ---
>  include/linux/tick.h|  1 +
>  kernel/time/tick-common.c   |  1 +
>  kernel/time/tick-internal.h |  1 -
>  kernel/torture.c| 10 ++
>  4 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index bfd571f18cfd..23cc0b205853 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -14,6 +14,7 @@
>  #include 
>  
>  #ifdef CONFIG_GENERIC_CLOCKEVENTS
> +extern int tick_do_timer_cpu __read_mostly;
>  extern void __init tick_init(void);
>  /* Should be core only, but ARM BL switcher requires it */
>  extern void tick_suspend_local(void);
> diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
> index 46789356f856..87b9b9afa320 100644
> --- a/kernel/time/tick-common.c
> +++ b/kernel/time/tick-common.c
> @@ -48,6 +48,7 @@ ktime_t tick_next_period;
>   *procedure also covers cpu hotplug.
>   */
>  int tick_do_timer_cpu __read_mostly = TICK_DO_TIMER_BOOT;
> +EXPORT_SYMBOL_GPL(tick_do_timer_cpu);
>  #ifdef CONFIG_NO_HZ_FULL
>  /*
>   * tick_do_timer_boot_cpu indicates the boot CPU temporarily owns
> diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
> index 649f2b48e8f0..8953dca10fdd 100644
> --- a/kernel/time/tick-internal.h
> +++ b/kernel/time/tick-internal.h
> @@ -15,7 +15,6 @@
>  
>  DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
>  extern ktime_t tick_next_period;
> -extern int tick_do_timer_cpu __read_mostly;
>  
>  extern void tick_setup_periodic(struct clock_event_device *dev, int 
> broadcast);
>  extern void tick_handle_periodic(struct clock_event_device *dev);
> diff --git a/kernel/torture.c b/kernel/torture.c
> index 789aeb0e1159..bccbdd33dda2 100644
> --- a/kernel/torture.c
> +++ b/kernel/torture.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -358,7 +359,16 @@ torture_onoff(void *arg)
>   schedule_timeout_interruptible(HZ / 10);
>   continue;
>   }
> +#ifdef CONFIG_NO_HZ_FULL
> + /* do not offline tick do timer cpu */
> + if (tick_nohz_full_running) {
> + cpu = (torture_random() >> 4) % maxcpu;
> + if (cpu >= tick_do_timer_cpu)

Why is this ">=" instead of "=="?

> + cpu = (cpu + 1) % (maxcpu + 1);
> + } else
> +#else
>   cpu = (torture_random() >> 4) % (maxcpu + 1);
> +#endif

What happens if the value of tick_do_timer_cpu changes between the time of
the check above and the call to torture_offline() below?  Alternatively,
how is such a change in value prevented?

Thanx, Paul

>   if (!torture_offline(cpu,
>_offline_attempts, _offline_successes,
>_offline, _offline, _offline))
> -- 
> 2.34.1
> 


Re: [PATCH linux-next][RFC] powerpc: protect cpu offlining by RCU offline lock

2022-09-14 Thread Paul E. McKenney
On Wed, Sep 14, 2022 at 10:15:28AM +0800, Zhouyi Zhou wrote:
> During the cpu offlining, the sub functions of xive_teardown_cpu will
> call __lock_acquire when CONFIG_LOCKDEP=y. The latter function will
> travel RCU protected list, so "WARNING: suspicious RCU usage" will be
> triggered.
> 
> Try to protect cpu offlining by RCU offline lock.

Rather than acquiring the RCU lock, why not change the functions called
by xive_teardown_cpu() to avoid calling __lock_acquire()?  For example,
a call to spin_lock() could be changed to arch_spin_lock().

Thanx, Paul

> Tested on PPC VM of Open Source Lab of Oregon State University.
> (Each round of tests takes about 19 hours to finish)
> Test results show that although "WARNING: suspicious RCU usage" has gone,
> but there are more "BUG: soft lockup" reports than the original kernel
> (10 vs 6), so I add a [RFC] to my subject line.
> 
> Signed-off-by: Zhouyi Zhou 
> ---
> [it seems that there are some delivery problem in my previous email,
>  so I send again via gmail, sorry for the trouble]
>  
> Dear PPC and RCU developers
> 
> I found this bug when trying to do rcutorture tests in ppc VM of
> Open Source Lab of Oregon State University.
> 
> console.log report following bug:
> [   37.635545][T0] WARNING: suspicious RCU usage^M
> [   37.636409][T0] 6.0.0-rc4-next-20220907-dirty #8 Not tainted^M
> [   37.637575][T0] -^M
> [   37.638306][T0] kernel/locking/lockdep.c:3723 RCU-list traversed in 
> non-reader section!!^M
> [   37.639651][T0] ^M
> [   37.639651][T0] other info that might help us debug this:^M
> [   37.639651][T0] ^M
> [   37.641381][T0] ^M
> [   37.641381][T0] RCU used illegally from offline CPU!^M
> [   37.641381][T0] rcu_scheduler_active = 2, debug_locks = 1^M
> [   37.667170][T0] no locks held by swapper/6/0.^M
> [   37.668328][T0] ^M
> [   37.668328][T0] stack backtrace:^M
> [   37.669995][T0] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 
> 6.0.0-rc4-next-20220907-dirty #8^M
> [   37.672777][T0] Call Trace:^M
> [   37.673729][T0] [c4653920] [c097f9b4] 
> dump_stack_lvl+0x98/0xe0 (unreliable)^M
> [   37.678579][T0] [c4653960] [c01f2eb8] 
> lockdep_rcu_suspicious+0x148/0x16c^M
> [   37.680425][T0] [c46539f0] [c01ed9b4] 
> __lock_acquire+0x10f4/0x26e0^M
> [   37.682450][T0] [c4653b30] [c01efc2c] 
> lock_acquire+0x12c/0x420^M
> [   37.684113][T0] [c4653c20] [c10d704c] 
> _raw_spin_lock_irqsave+0x6c/0xc0^M
> [   37.686154][T0] [c4653c60] [c00c7b4c] 
> xive_spapr_put_ipi+0xcc/0x150^M
> [   37.687879][T0] [c4653ca0] [c10c72a8] 
> xive_cleanup_cpu_ipi+0xc8/0xf0^M
> [   37.689856][T0] [c4653cf0] [c10c7370] 
> xive_teardown_cpu+0xa0/0xf0^M
> [   37.691877][T0] [c4653d30] [c00fba5c] 
> pseries_cpu_offline_self+0x5c/0x100^M
> [   37.693882][T0] [c4653da0] [c005d2c4] 
> arch_cpu_idle_dead+0x44/0x60^M
> [   37.695739][T0] [c4653dc0] [c01c740c] 
> do_idle+0x16c/0x3d0^M
> [   37.697536][T0] [c4653e70] [c01c7a1c] 
> cpu_startup_entry+0x3c/0x40^M
> [   37.699694][T0] [c4653ea0] [c005ca20] 
> start_secondary+0x6c0/0xb50^M
> [   37.701742][T0] [c4653f90] [c000d054] 
> start_secondary_prolog+0x10/0x14^M
> 
> 
> I am a beginner, hope I can be of some beneficial to the community ;-)
> 
> Thanks
> Zhouyi
> --
>  arch/powerpc/platforms/pseries/hotplug-cpu.c |  5 -
>  include/linux/rcupdate.h |  3 ++-
>  kernel/rcu/tree.c| 10 ++
>  3 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
> b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 0f8cd8b06432..ddf66a253c70 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -64,11 +64,14 @@ static void pseries_cpu_offline_self(void)
>  
>   local_irq_disable();
>   idle_task_exit();
> +
> + /* Because the cpu is now offline, let rcu know that */
> + rcu_state_ofl_lock();
>   if (xive_enabled())
>   xive_teardown_cpu();
>   else
>   xics_teardown_cpu();
> -
> + rcu_state_ofl_unlock();
>   unregister_slb_shadow(hwcpu);
>   rtas_stop_self();
>  
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 63d2e6a60ad7..d857955a02ba 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1034,5 +1034,6 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, 
> rcu_callback_t f)
>  /* kernel/ksysfs.c definitions */
>  extern int rcu_expedited;
>  extern int rcu_normal;
> -
> +void rcu_state_ofl_lock(void);
> +void rcu_state_ofl_unlock(void);
>  #endif /* 

Re: [PATCH 04/36] cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE

2022-07-30 Thread Paul E. McKenney
On Sat, Jul 30, 2022 at 02:40:32AM -0700, Michel Lespinasse wrote:
> On Fri, Jul 29, 2022 at 08:26:22AM -0700, Paul E. McKenney wrote:> Would you 
> be willing to try another shot in the dark, but untested
> > this time?  I freely admit that this is getting strange.
> > 
> > Thanx, Paul
> 
> Yes, adding this second change got rid of the boot time warning for me.

OK, I will make a real patch.  May I have your Tested-by?

Thanx, Paul

> > 
> > 
> > diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> > index e374c0c923dae..279f557bf60bb 100644
> > --- a/kernel/sched/clock.c
> > +++ b/kernel/sched/clock.c
> > @@ -394,7 +394,7 @@ notrace void sched_clock_tick(void)
> > if (!static_branch_likely(_clock_running))
> > return;
> >  
> > -   lockdep_assert_irqs_disabled();
> > +   WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !raw_irqs_disabled());
> >  
> > scd = this_scd();
> > __scd_stamp(scd);


Re: [PATCH 04/36] cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE

2022-07-29 Thread Paul E. McKenney
Or better yet, try the patch that Rafael proposed.  ;-)

Thanx, Paul

On Fri, Jul 29, 2022 at 08:26:22AM -0700, Paul E. McKenney wrote:
> On Fri, Jul 29, 2022 at 03:24:58AM -0700, Michel Lespinasse wrote:
> > On Thu, Jul 28, 2022 at 10:20:53AM -0700, Paul E. McKenney wrote:
> > > On Mon, Jul 25, 2022 at 12:43:06PM -0700, Michel Lespinasse wrote:
> > > > On Wed, Jun 08, 2022 at 04:27:27PM +0200, Peter Zijlstra wrote:
> > > > > Commit c227233ad64c ("intel_idle: enable interrupts before C1 on
> > > > > Xeons") wrecked intel_idle in two ways:
> > > > > 
> > > > >  - must not have tracing in idle functions
> > > > >  - must return with IRQs disabled
> > > > > 
> > > > > Additionally, it added a branch for no good reason.
> > > > > 
> > > > > Fixes: c227233ad64c ("intel_idle: enable interrupts before C1 on 
> > > > > Xeons")
> > > > > Signed-off-by: Peter Zijlstra (Intel) 
> > > > 
> > > > After this change was introduced, I am seeing "WARNING: suspicious RCU
> > > > usage" when booting a kernel with debug options compiled in. Please
> > > > see the attached dmesg output. The issue starts with commit 32d4fd5751ea
> > > > and is still present in v5.19-rc8.
> > > > 
> > > > I'm not sure, is this too late to fix or revert in v5.19 final ?
> > > 
> > > I finally got a chance to take a quick look at this.
> > > 
> > > The rcu_eqs_exit() function is making a lockdep complaint about
> > > being invoked with interrupts enabled.  This function is called from
> > > rcu_idle_exit(), which is an expected code path from cpuidle_enter_state()
> > > via its call to rcu_idle_exit().  Except that rcu_idle_exit() disables
> > > interrupts before invoking rcu_eqs_exit().
> > > 
> > > The only other call to rcu_idle_exit() does not disable interrupts,
> > > but it is via rcu_user_exit(), which would be a very odd choice for
> > > cpuidle_enter_state().
> > > 
> > > It seems unlikely, but it might be that it is the use of local_irq_save()
> > > instead of raw_local_irq_save() within rcu_idle_exit() that is causing
> > > the trouble.  If this is the case, then the commit shown below would
> > > help.  Note that this commit removes the warning from lockdep, so it
> > > is necessary to build the kernel with CONFIG_RCU_EQS_DEBUG=y to enable
> > > equivalent debugging.
> > > 
> > > Could you please try your test with the -rce commit shown below applied?
> > 
> > Thanks for looking into it.
> 
> And thank you for trying this shot in the dark!
> 
> > After checking out Peter's commit 32d4fd5751ea,
> > cherry picking your commit ed4ae5eff4b3,
> > and setting CONFIG_RCU_EQS_DEBUG=y in addition of my usual debug config,
> > I am now seeing this a few seconds into the boot:
> > 
> > [3.010650] [ cut here ]
> > [3.010651] WARNING: CPU: 0 PID: 0 at kernel/sched/clock.c:397 
> > sched_clock_tick+0x27/0x60
> 
> And this is again a complaint about interrupts not being disabled.
> 
> But it does appear that the problem was the lockdep complaint, and
> eliminating that did take care of part of the problem.  But lockdep
> remained enabled, and you therefore hit the next complaint.
> 
> > [3.010657] Modules linked in:
> > [3.010660] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> > 5.19.0-rc1-test-5-g1be22fea0611 #1
> > [3.010662] Hardware name: LENOVO 30BFS44D00/1036, BIOS S03KT51A 
> > 01/17/2022
> > [3.010663] RIP: 0010:sched_clock_tick+0x27/0x60
> 
> The most straightforward way to get to sched_clock_tick() from
> cpuidle_enter_state() is via the call to sched_clock_idle_wakeup_event().
> 
> Except that it disables interrupts before invoking sched_clock_tick().
> 
> > [3.010665] Code: 1f 40 00 53 eb 02 5b c3 66 90 8b 05 2f c3 40 01 85 c0 
> > 74 18 65 8b 05 60 88 8f 4e 85 c0 75 0d 65 8b 05 a9 85 8f 4e 85 c0 74 02 
> > <0f> 0b e8 e2 6c 89 00 48 c7 c3 40 d5 02 00
> >  89 c0 48 03 1c c5 c0 98
> > [3.010667] RSP: :b2803e28 EFLAGS: 00010002
> > [3.010670] RAX: 0001 RBX: c8ce7fa07060 RCX: 
> > 0001
> > [3.010671] RDX:  RSI: b268dd21 RDI: 
> > b269ab13
> > [3.010673] RBP: 0001 R08: ffc300d5 R09: 
> > 0002be80
> >

Re: [PATCH 04/36] cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE

2022-07-29 Thread Paul E. McKenney
On Fri, Jul 29, 2022 at 03:24:58AM -0700, Michel Lespinasse wrote:
> On Thu, Jul 28, 2022 at 10:20:53AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 25, 2022 at 12:43:06PM -0700, Michel Lespinasse wrote:
> > > On Wed, Jun 08, 2022 at 04:27:27PM +0200, Peter Zijlstra wrote:
> > > > Commit c227233ad64c ("intel_idle: enable interrupts before C1 on
> > > > Xeons") wrecked intel_idle in two ways:
> > > > 
> > > >  - must not have tracing in idle functions
> > > >  - must return with IRQs disabled
> > > > 
> > > > Additionally, it added a branch for no good reason.
> > > > 
> > > > Fixes: c227233ad64c ("intel_idle: enable interrupts before C1 on Xeons")
> > > > Signed-off-by: Peter Zijlstra (Intel) 
> > > 
> > > After this change was introduced, I am seeing "WARNING: suspicious RCU
> > > usage" when booting a kernel with debug options compiled in. Please
> > > see the attached dmesg output. The issue starts with commit 32d4fd5751ea
> > > and is still present in v5.19-rc8.
> > > 
> > > I'm not sure, is this too late to fix or revert in v5.19 final ?
> > 
> > I finally got a chance to take a quick look at this.
> > 
> > The rcu_eqs_exit() function is making a lockdep complaint about
> > being invoked with interrupts enabled.  This function is called from
> > rcu_idle_exit(), which is an expected code path from cpuidle_enter_state()
> > via its call to rcu_idle_exit().  Except that rcu_idle_exit() disables
> > interrupts before invoking rcu_eqs_exit().
> > 
> > The only other call to rcu_idle_exit() does not disable interrupts,
> > but it is via rcu_user_exit(), which would be a very odd choice for
> > cpuidle_enter_state().
> > 
> > It seems unlikely, but it might be that it is the use of local_irq_save()
> > instead of raw_local_irq_save() within rcu_idle_exit() that is causing
> > the trouble.  If this is the case, then the commit shown below would
> > help.  Note that this commit removes the warning from lockdep, so it
> > is necessary to build the kernel with CONFIG_RCU_EQS_DEBUG=y to enable
> > equivalent debugging.
> > 
> > Could you please try your test with the -rce commit shown below applied?
> 
> Thanks for looking into it.

And thank you for trying this shot in the dark!

> After checking out Peter's commit 32d4fd5751ea,
> cherry picking your commit ed4ae5eff4b3,
> and setting CONFIG_RCU_EQS_DEBUG=y in addition of my usual debug config,
> I am now seeing this a few seconds into the boot:
> 
> [3.010650] [ cut here ]
> [3.010651] WARNING: CPU: 0 PID: 0 at kernel/sched/clock.c:397 
> sched_clock_tick+0x27/0x60

And this is again a complaint about interrupts not being disabled.

But it does appear that the problem was the lockdep complaint, and
eliminating that did take care of part of the problem.  But lockdep
remained enabled, and you therefore hit the next complaint.

> [3.010657] Modules linked in:
> [3.010660] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> 5.19.0-rc1-test-5-g1be22fea0611 #1
> [3.010662] Hardware name: LENOVO 30BFS44D00/1036, BIOS S03KT51A 01/17/2022
> [3.010663] RIP: 0010:sched_clock_tick+0x27/0x60

The most straightforward way to get to sched_clock_tick() from
cpuidle_enter_state() is via the call to sched_clock_idle_wakeup_event().

Except that it disables interrupts before invoking sched_clock_tick().

> [3.010665] Code: 1f 40 00 53 eb 02 5b c3 66 90 8b 05 2f c3 40 01 85 c0 74 
> 18 65 8b 05 60 88 8f 4e 85 c0 75 0d 65 8b 05 a9 85 8f 4e 85 c0 74 02 <0f> 0b 
> e8 e2 6c 89 00 48 c7 c3 40 d5 02 00
>  89 c0 48 03 1c c5 c0 98
> [3.010667] RSP: :b2803e28 EFLAGS: 00010002
> [3.010670] RAX: 0001 RBX: c8ce7fa07060 RCX: 
> 0001
> [3.010671] RDX:  RSI: b268dd21 RDI: 
> b269ab13
> [3.010673] RBP: 0001 R08: ffc300d5 R09: 
> 0002be80
> [3.010674] R10: 03625b53183a R11: a012b802b7a4 R12: 
> b2aa9e80
> [3.010675] R13: b2aa9e00 R14: 0001 R15: 
> 
> [3.010677] FS:  () GS:a012b800() 
> knlGS:
> [3.010678] CS:  0010 DS:  ES:  CR0: 80050033
> [3.010680] CR2: a012f81ff000 CR3: 000c99612001 CR4: 
> 003706f0
> [3.010681] DR0:  DR1:  DR2: 
> 
> [3.010682] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [3.010683] Call Trace:
> [  

Re: [PATCH 04/36] cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE

2022-07-28 Thread Paul E. McKenney
On Mon, Jul 25, 2022 at 12:43:06PM -0700, Michel Lespinasse wrote:
> On Wed, Jun 08, 2022 at 04:27:27PM +0200, Peter Zijlstra wrote:
> > Commit c227233ad64c ("intel_idle: enable interrupts before C1 on
> > Xeons") wrecked intel_idle in two ways:
> > 
> >  - must not have tracing in idle functions
> >  - must return with IRQs disabled
> > 
> > Additionally, it added a branch for no good reason.
> > 
> > Fixes: c227233ad64c ("intel_idle: enable interrupts before C1 on Xeons")
> > Signed-off-by: Peter Zijlstra (Intel) 
> 
> After this change was introduced, I am seeing "WARNING: suspicious RCU
> usage" when booting a kernel with debug options compiled in. Please
> see the attached dmesg output. The issue starts with commit 32d4fd5751ea
> and is still present in v5.19-rc8.
> 
> I'm not sure, is this too late to fix or revert in v5.19 final ?

I finally got a chance to take a quick look at this.

The rcu_eqs_exit() function is making a lockdep complaint about
being invoked with interrupts enabled.  This function is called from
rcu_idle_exit(), which is an expected code path from cpuidle_enter_state()
via its call to rcu_idle_exit().  Except that rcu_idle_exit() disables
interrupts before invoking rcu_eqs_exit().

The only other call to rcu_idle_exit() does not disable interrupts,
but it is via rcu_user_exit(), which would be a very odd choice for
cpuidle_enter_state().

It seems unlikely, but it might be that it is the use of local_irq_save()
instead of raw_local_irq_save() within rcu_idle_exit() that is causing
the trouble.  If this is the case, then the commit shown below would
help.  Note that this commit removes the warning from lockdep, so it
is necessary to build the kernel with CONFIG_RCU_EQS_DEBUG=y to enable
equivalent debugging.

Could you please try your test with the -rce commit shown below applied?

Thanx, Paul

----

commit ed4ae5eff4b38797607cbdd80da394149110fb37
Author: Paul E. McKenney 
Date:   Tue May 17 21:00:04 2022 -0700

rcu: Apply noinstr to rcu_idle_enter() and rcu_idle_exit()

This commit applies the "noinstr" tag to the rcu_idle_enter() and
rcu_idle_exit() functions, which are invoked from portions of the idle
loop that cannot be instrumented.  These tags require reworking the
rcu_eqs_enter() and rcu_eqs_exit() functions that these two functions
invoke in order to cause them to use normal assertions rather than
lockdep.  In addition, within rcu_idle_exit(), the raw versions of
local_irq_save() and local_irq_restore() are used, again to avoid issues
with lockdep in uninstrumented code.

This patch is based in part on an earlier patch by Jiri Olsa, discussions
with Peter Zijlstra and Frederic Weisbecker, earlier changes by Thomas
Gleixner, and off-list discussions with Yonghong Song.

Link: 
https://lore.kernel.org/lkml/20220515203653.4039075-1-jo...@kernel.org/
    Reported-by: Jiri Olsa 
Reported-by: Alexei Starovoitov 
Reported-by: Andrii Nakryiko 
Signed-off-by: Paul E. McKenney 
Reviewed-by: Yonghong Song 

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c25ba442044a6..9a5edab5558c9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -631,8 +631,8 @@ static noinstr void rcu_eqs_enter(bool user)
return;
}
 
-   lockdep_assert_irqs_disabled();
instrumentation_begin();
+   lockdep_assert_irqs_disabled();
trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, 
atomic_read(>dynticks));
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && 
!is_idle_task(current));
rcu_preempt_deferred_qs(current);
@@ -659,9 +659,9 @@ static noinstr void rcu_eqs_enter(bool user)
  * If you add or remove a call to rcu_idle_enter(), be sure to test with
  * CONFIG_RCU_EQS_DEBUG=y.
  */
-void rcu_idle_enter(void)
+void noinstr rcu_idle_enter(void)
 {
-   lockdep_assert_irqs_disabled();
+   WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !raw_irqs_disabled());
rcu_eqs_enter(false);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -861,7 +861,7 @@ static void noinstr rcu_eqs_exit(bool user)
struct rcu_data *rdp;
long oldval;
 
-   lockdep_assert_irqs_disabled();
+   WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !raw_irqs_disabled());
rdp = this_cpu_ptr(_data);
oldval = rdp->dynticks_nesting;
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && oldval < 0);
@@ -896,13 +896,13 @@ static void noinstr rcu_eqs_exit(bool user)
  * If you add or remove a call to rcu_idle_exit(), be sure to test with
  * CONFIG_RCU_EQS_DEBUG=y.
  */
-void rcu_idle_exit(void)
+void no

Re: [PATCH 16/36] rcu: Fix rcu_idle_exit()

2022-06-15 Thread Paul E. McKenney
, Arnd Bergmann , ulli.kr...@googlemail.com, vgu...@kernel.org, 
linux-...@vger.kernel.org, j...@joshtriplett.org, rost...@goodmis.org, 
r...@vger.kernel.org, b...@alien8.de, bc...@quicinc.com, 
tsbog...@alpha.franken.de, linux-par...@vger.kernel.org, sudeep.ho...@arm.com, 
shawn...@kernel.org, da...@davemloft.net, dal...@libc.org, t...@atomide.com, 
amakha...@vmware.com, bjorn.anders...@linaro.org, h...@zytor.com, 
sparcli...@vger.kernel.org, linux-hexa...@vger.kernel.org, 
linux-ri...@lists.infradead.org, anton.iva...@cambridgegreys.com, 
jo...@southpole.se, yury.no...@gmail.com, rich...@nod.at, x...@kernel.org, 
li...@armlinux.org.uk, mi...@redhat.com, a...@eecs.berkeley.edu, 
h...@linux.ibm.com, stefan.kristians...@saunalahti.fi, 
openr...@lists.librecores.org, paul.walms...@sifive.com, 
linux-te...@vger.kernel.org, namhy...@kernel.org, 
andriy.shevche...@linux.intel.com, jpoim...@kernel.org, jgr...@suse.com, 
mon...@monstr.eu, linux-m...@vger.kernel.org, pal...@dabbelt.com, anup@brain
 fault.org, i...@jurassic.park.msu.ru, johan...@sipsolutions.net, 
linuxppc-dev@lists.ozlabs.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Wed, Jun 08, 2022 at 04:27:39PM +0200, Peter Zijlstra wrote:
> Current rcu_idle_exit() is terminally broken because it uses
> local_irq_{save,restore}(), which are traced which uses RCU.
> 
> However, now that all the callers are sure to have IRQs disabled, we
> can remove these calls.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Acked-by: Paul E. McKenney 

We have some fun conflicts between this series and Frederic's context-tracking
series.  But it looks like these can be resolved by:

1.  A patch on top of Frederic's series that provides the old rcu_*()
names for the functions now prefixed with ct_*() such as
ct_idle_exit().

2.  Another patch on top of Frederic's series that takes the
changes remaining from this patch, shown below.  Frederic's
series uses raw_local_irq_save() and raw_local_irq_restore(),
which can then be removed.

Or is there a better way to do this?

Thanx, Paul



commit f64cee8c159e9863a74594efe3d33fb513a6a7b5
Author: Peter Zijlstra 
Date:   Tue Jun 14 17:24:43 2022 -0700

context_tracking: Interrupts always disabled for ct_idle_exit()

Now that the idle-loop cleanups have ensured that rcu_idle_exit() is
always invoked with interrupts disabled, remove the interrupt disabling
in favor of a debug check.

Signed-off-by: Peter Zijlstra 
Cc: Frederic Weisbecker 
    Signed-off-by: Paul E. McKenney 

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 1da44803fd319..99310cf5b0254 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -332,11 +332,8 @@ EXPORT_SYMBOL_GPL(ct_idle_enter);
  */
 void noinstr ct_idle_exit(void)
 {
-   unsigned long flags;
-
-   raw_local_irq_save(flags);
+   WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !raw_irqs_disabled());
ct_kernel_enter(false, RCU_DYNTICKS_IDX - CONTEXT_IDLE);
-   raw_local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(ct_idle_exit);
 


Re: [PATCH v4] locking/csd_lock: change csdlock_debug from early_param to __setup

2022-05-27 Thread Paul E. McKenney
On Fri, May 27, 2022 at 02:49:03PM +0800, Chen Zhongjin wrote:
> Hi,
> 
> On 2022/5/18 9:11, Paul E. McKenney wrote:
> > On Tue, May 17, 2022 at 11:22:04AM +0800, Chen Zhongjin wrote:
> >> On 2022/5/10 17:46, Chen Zhongjin wrote:
> >>> csdlock_debug uses early_param and static_branch_enable() to enable
> >>> csd_lock_wait feature, which triggers a panic on arm64 with config:
> >>> CONFIG_SPARSEMEM=y
> >>> CONFIG_SPARSEMEM_VMEMMAP=n
> >>>
> >>> With CONFIG_SPARSEMEM_VMEMMAP=n, __nr_to_section is called in
> >>> static_key_enable() and returns NULL which makes NULL dereference
> >>> because mem_section is initialized in sparse_init() which is later
> >>> than parse_early_param() stage.
> >>>
> >>> For powerpc this is also broken, because early_param stage is
> >>> earlier than jump_label_init() so static_key_enable won't work.
> >>> powerpc throws an warning: "static key 'xxx' used before call
> >>> to jump_label_init()".
> >>>
> >>> Thus, early_param is too early for csd_lock_wait to run
> >>> static_branch_enable(), so changes it to __setup to fix these.
> >>>
> >>> Fixes: 8d0968cc6b8f ("locking/csd_lock: Add boot parameter for 
> >>> controlling CSD lock debugging")
> >>> Cc: sta...@vger.kernel.org
> >>> Reported-by: Chen jingwen 
> >>> Signed-off-by: Chen Zhongjin 
> >>> ---
> >>> Change v3 -> v4:
> >>> Fix title and description because this fix is also applied
> >>> to powerpc.
> >>> For more detailed arm64 bug report see:
> >>> https://lore.kernel.org/linux-arm-kernel/e8715911-f835-059d-27f8-cc5f5ad30...@huawei.com/t/
> >>>
> >>> Change v2 -> v3:
> >>> Add module name in title
> >>>
> >>> Change v1 -> v2:
> >>> Fix return 1 for __setup
> >>> ---
> >>>  kernel/smp.c | 4 ++--
> >>>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/kernel/smp.c b/kernel/smp.c
> >>> index 65a630f62363..381eb15cd28f 100644
> >>> --- a/kernel/smp.c
> >>> +++ b/kernel/smp.c
> >>> @@ -174,9 +174,9 @@ static int __init csdlock_debug(char *str)
> >>>   if (val)
> >>>   static_branch_enable(_debug_enabled);
> >>>  
> >>> - return 0;
> >>> + return 1;
> >>>  }
> >>> -early_param("csdlock_debug", csdlock_debug);
> >>> +__setup("csdlock_debug=", csdlock_debug);
> >>>  
> >>>  static DEFINE_PER_CPU(call_single_data_t *, cur_csd);
> >>>  static DEFINE_PER_CPU(smp_call_func_t, cur_csd_func);
> >>
> >> Ping for review. Thanks!
> > 
> > I have pulled it into -rcu for testing and further review.  It might
> > well need to go through some other path, though.
> >>Thanx, Paul
> > .
> 
> So did it have any result? Do we have any idea to fix that except delaying the
> set timing? I guess that maybe not using static_branch can work for this, but 
> it
> still needs to be evaluated for performance influence of not enabled 
> situation.

It was in -next for a short time without complaints.  It will go back
into -next after the merge window closes.  If there are no objections,
I would include it in my pull request for the next merge window (v5.20).

Thanx, Paul


Re: [PATCH v4] locking/csd_lock: change csdlock_debug from early_param to __setup

2022-05-17 Thread Paul E. McKenney
On Tue, May 17, 2022 at 11:22:04AM +0800, Chen Zhongjin wrote:
> On 2022/5/10 17:46, Chen Zhongjin wrote:
> > csdlock_debug uses early_param and static_branch_enable() to enable
> > csd_lock_wait feature, which triggers a panic on arm64 with config:
> > CONFIG_SPARSEMEM=y
> > CONFIG_SPARSEMEM_VMEMMAP=n
> > 
> > With CONFIG_SPARSEMEM_VMEMMAP=n, __nr_to_section is called in
> > static_key_enable() and returns NULL which makes NULL dereference
> > because mem_section is initialized in sparse_init() which is later
> > than parse_early_param() stage.
> > 
> > For powerpc this is also broken, because early_param stage is
> > earlier than jump_label_init() so static_key_enable won't work.
> > powerpc throws an warning: "static key 'xxx' used before call
> > to jump_label_init()".
> > 
> > Thus, early_param is too early for csd_lock_wait to run
> > static_branch_enable(), so changes it to __setup to fix these.
> > 
> > Fixes: 8d0968cc6b8f ("locking/csd_lock: Add boot parameter for controlling 
> > CSD lock debugging")
> > Cc: sta...@vger.kernel.org
> > Reported-by: Chen jingwen 
> > Signed-off-by: Chen Zhongjin 
> > ---
> > Change v3 -> v4:
> > Fix title and description because this fix is also applied
> > to powerpc.
> > For more detailed arm64 bug report see:
> > https://lore.kernel.org/linux-arm-kernel/e8715911-f835-059d-27f8-cc5f5ad30...@huawei.com/t/
> > 
> > Change v2 -> v3:
> > Add module name in title
> > 
> > Change v1 -> v2:
> > Fix return 1 for __setup
> > ---
> >  kernel/smp.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index 65a630f62363..381eb15cd28f 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -174,9 +174,9 @@ static int __init csdlock_debug(char *str)
> > if (val)
> > static_branch_enable(_debug_enabled);
> >  
> > -   return 0;
> > +   return 1;
> >  }
> > -early_param("csdlock_debug", csdlock_debug);
> > +__setup("csdlock_debug=", csdlock_debug);
> >  
> >  static DEFINE_PER_CPU(call_single_data_t *, cur_csd);
> >  static DEFINE_PER_CPU(smp_call_func_t, cur_csd_func);
> 
> Ping for review. Thanks!

I have pulled it into -rcu for testing and further review.  It might
well need to go through some other path, though.

Thanx, Paul


Re: [PATCH 20/30] panic: Add the panic informational notifier list

2022-04-27 Thread Paul E. McKenney
On Wed, Apr 27, 2022 at 07:49:14PM -0300, Guilherme G. Piccoli wrote:
> The goal of this new panic notifier is to allow its users to
> register callbacks to run earlier in the panic path than they
> currently do. This aims at informational mechanisms, like dumping
> kernel offsets and showing device error data (in case it's simple
> registers reading, for example) as well as mechanisms to disable
> log flooding (like hung_task detector / RCU warnings) and the
> tracing dump_on_oops (when enabled).
> 
> Any (non-invasive) information that should be provided before
> kmsg_dump() as well as log flooding preventing code should fit
> here, as long it offers relatively low risk for kdump.
> 
> For now, the patch is almost a no-op, although it changes a bit
> the ordering in which some panic notifiers are executed - specially
> affected by this are the notifiers responsible for disabling the
> hung_task detector / RCU warnings, which now run first. In a
> subsequent patch, the panic path will be refactored, then the
> panic informational notifiers will effectively run earlier,
> before ksmg_dump() (and usually before kdump as well).
> 
> We also defer documenting it all properly in the subsequent
> refactor patch. Finally, while at it, we removed some useless
> header inclusions too.
> 
> Cc: Benjamin Herrenschmidt 
> Cc: Catalin Marinas 
> Cc: Florian Fainelli 
> Cc: Frederic Weisbecker 
> Cc: "H. Peter Anvin" 
> Cc: Hari Bathini 
> Cc: Joel Fernandes 
> Cc: Jonathan Hunter 
> Cc: Josh Triplett 
> Cc: Lai Jiangshan 
> Cc: Leo Yan 
> Cc: Mathieu Desnoyers 
> Cc: Mathieu Poirier 
> Cc: Michael Ellerman 
> Cc: Mike Leach 
> Cc: Mikko Perttunen 
> Cc: Neeraj Upadhyay 
> Cc: Nicholas Piggin 
> Cc: Paul Mackerras 
> Cc: Suzuki K Poulose 
> Cc: Thierry Reding 
> Cc: Thomas Bogendoerfer 
> Signed-off-by: Guilherme G. Piccoli 

>From an RCU perspective:

Acked-by: Paul E. McKenney 

> ---
>  arch/arm64/kernel/setup.c | 2 +-
>  arch/mips/kernel/relocate.c   | 2 +-
>  arch/powerpc/kernel/setup-common.c| 2 +-
>  arch/x86/kernel/setup.c   | 2 +-
>  drivers/bus/brcmstb_gisb.c| 2 +-
>  drivers/hwtracing/coresight/coresight-cpu-debug.c | 4 ++--
>  drivers/soc/tegra/ari-tegra186.c  | 3 ++-
>  include/linux/panic_notifier.h| 1 +
>  kernel/hung_task.c| 3 ++-
>  kernel/panic.c| 4 
>  kernel/rcu/tree.c | 1 -
>  kernel/rcu/tree_stall.h   | 3 ++-
>  kernel/trace/trace.c  | 2 +-
>  13 files changed, 19 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 3505789cf4bd..ac2c7e8c9c6a 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -444,7 +444,7 @@ static struct notifier_block arm64_panic_block = {
>  
>  static int __init register_arm64_panic_block(void)
>  {
> - atomic_notifier_chain_register(_notifier_list,
> + atomic_notifier_chain_register(_info_list,
>  _panic_block);
>   return 0;
>  }
> diff --git a/arch/mips/kernel/relocate.c b/arch/mips/kernel/relocate.c
> index 56b51de2dc51..650811f2436a 100644
> --- a/arch/mips/kernel/relocate.c
> +++ b/arch/mips/kernel/relocate.c
> @@ -459,7 +459,7 @@ static struct notifier_block kernel_location_notifier = {
>  
>  static int __init register_kernel_offset_dumper(void)
>  {
> - atomic_notifier_chain_register(_notifier_list,
> + atomic_notifier_chain_register(_info_list,
>  _location_notifier);
>   return 0;
>  }
> diff --git a/arch/powerpc/kernel/setup-common.c 
> b/arch/powerpc/kernel/setup-common.c
> index 1468c3937bf4..d04b8bf8dbc7 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -757,7 +757,7 @@ void __init setup_panic(void)
>  _fadump_block);
>  
>   if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && kaslr_offset() > 0)
> - atomic_notifier_chain_register(_notifier_list,
> + atomic_notifier_chain_register(_info_list,
>  _offset_notifier);
>  
>   /* Low-level platform-specific routines that should run on panic */
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index c95b9ac5a457..599b25346964 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1266,7 +1266,7 @@ static struct not

Re: Low-res tick handler device not going to ONESHOT_STOPPED when tick is stopped (was: rcu_sched self-detected stall on CPU)

2022-04-14 Thread Paul E. McKenney
On Wed, Apr 13, 2022 at 04:10:02PM +1000, Nicholas Piggin wrote:
> Oops, fixed subject...
> 
> Excerpts from Nicholas Piggin's message of April 13, 2022 3:11 pm:
> > +Daniel, Thomas, Viresh
> > 
> > Subject: Re: rcu_sched self-detected stall on CPU
> > 
> > Excerpts from Michael Ellerman's message of April 9, 2022 12:42 am:
> >> Michael Ellerman  writes:
> >>> "Paul E. McKenney"  writes:
> >>>> On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote:
> >>>>> Hi
> >>>>> 
> >>>>> I can reproduce it in a ppc virtual cloud server provided by Oregon
> >>>>> State University.  Following is what I do:
> >>>>> 1) curl -l 
> >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz
> >>>>> -o linux-5.18-rc1.tar.gz
> >>>>> 2) tar zxf linux-5.18-rc1.tar.gz
> >>>>> 3) cp config linux-5.18-rc1/.config
> >>>>> 4) cd linux-5.18-rc1
> >>>>> 5) make vmlinux -j 8
> >>>>> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot
> >>>>> -smp 2 (QEMU 4.2.1)
> >>>>> 7) after 12 rounds, the bug got reproduced:
> >>>>> (http://154.223.142.244/logs/20220406/qemu.log.txt)
> >>>>
> >>>> Just to make sure, are you both seeing the same thing?  Last I knew,
> >>>> Zhouyi was chasing an RCU-tasks issue that appears only in kernels
> >>>> built with CONFIG_PROVE_RCU=y, which Miguel does not have set.  Or did
> >>>> I miss something?
> >>>>
> >>>> Miguel is instead seeing an RCU CPU stall warning where RCU's 
> >>>> grace-period
> >>>> kthread slept for three milliseconds, but did not wake up for more than
> >>>> 20 seconds.  This kthread would normally have awakened on CPU 1, but
> >>>> CPU 1 looks to me to be very unhealthy, as can be seen in your console
> >>>> output below (but maybe my idea of what is healthy for powerpc systems
> >>>> is outdated).  Please see also the inline annotations.
> >>>>
> >>>> Thoughts from the PPC guys?
> >>>
> >>> I haven't seen it in my testing. But using Miguel's config I can
> >>> reproduce it seemingly on every boot.
> >>>
> >>> For me it bisects to:
> >>>
> >>>   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
> >>>
> >>> Which seems plausible.
> >>>
> >>> Reverting that on mainline makes the bug go away.
> >>>
> >>> I don't see an obvious bug in the diff, but I could be wrong, or the old
> >>> code was papering over an existing bug?
> >>>
> >>> I'll try and work out what it is about Miguel's config that exposes
> >>> this vs our defconfig, that might give us a clue.
> >> 
> >> It's CONFIG_HIGH_RES_TIMERS=n which triggers the stall.
> >> 
> >> I can reproduce just with:
> >> 
> >>   $ make ppc64le_guest_defconfig
> >>   $ ./scripts/config -d HIGH_RES_TIMERS
> >> 
> >> We have no defconfigs that disable HIGH_RES_TIMERS, I didn't even
> >> realise you could disable it TBH :)
> >> 
> >> The Rust CI has it disabled because I copied that from the x86 defconfig
> >> they were using back when I added the Rust support. I think that was
> >> meant to be a stripped down fast config for CI, but the result is it's
> >> just using a badly tested combination which is not helpful.
> >> 
> >> So I'll send a patch to turn HIGH_RES_TIMERS on for the Rust CI, and we
> >> can debug this further without blocking them.
> > 
> > So we traced the problem down to possibly a misunderstanding between 
> > decrementer clock event device and core code.
> > 
> > The decrementer is only oneshot*ish*. It actually needs to either be 
> > reprogrammed or shut down otherwise it just continues to cause 
> > interrupts.
> > 
> > Before commit 35de589cb879, it was sort of two-shot. The initial 
> > interrupt at the programmed time would set its internal next_tb variable 
> > to ~0 and call the ->event_handler(). If that did not set_next_event or 
> > stop the timer, the interrupt will fire again immediately, notice 
> > next_tb is ~0, and only then stop the decrementer interrupt.
> > 
> > So that was already kind of ugl

Re: rcu_sched self-detected stall on CPU

2022-04-12 Thread Paul E. McKenney
On Tue, Apr 12, 2022 at 04:53:06PM +1000, Michael Ellerman wrote:
> "Paul E. McKenney"  writes:
> > On Sun, Apr 10, 2022 at 09:33:43PM +1000, Michael Ellerman wrote:
> >> Zhouyi Zhou  writes:
> >> > On Fri, Apr 8, 2022 at 10:07 PM Paul E. McKenney  
> >> > wrote:
> >> >> On Fri, Apr 08, 2022 at 06:02:19PM +0800, Zhouyi Zhou wrote:
> >> >> > On Fri, Apr 8, 2022 at 3:23 PM Michael Ellerman  
> >> >> > wrote:
> >> ...
> >> >> > > I haven't seen it in my testing. But using Miguel's config I can
> >> >> > > reproduce it seemingly on every boot.
> >> >> > >
> >> >> > > For me it bisects to:
> >> >> > >
> >> >> > >   35de589cb879 ("powerpc/time: improve decrementer clockevent 
> >> >> > > processing")
> >> >> > >
> >> >> > > Which seems plausible.
> >> >> > I also bisect to 35de589cb879 ("powerpc/time: improve decrementer
> >> >> > clockevent processing")
> >> ...
> >> >>
> >> >> > > Reverting that on mainline makes the bug go away.
> >> 
> >> >> > I also revert that on the mainline, and am currently doing a pressure
> >> >> > test (by repeatedly invoking qemu and checking the console.log) on PPC
> >> >> > VM in Oregon State University.
> >> 
> >> > After 306 rounds of stress test on mainline without triggering the bug
> >> > (last for 4 hours and 27 minutes), I think the bug is indeed caused by
> >> > 35de589cb879 ("powerpc/time: improve decrementer clockevent
> >> > processing") and stop the test for now.
> >> 
> >> Thanks for testing, that's pretty conclusive.
> >> 
> >> I'm not inclined to actually revert it yet.
> >> 
> >> We need to understand if there's actually a bug in the patch, or if it's
> >> just exposing some existing bug/bad behavior we have. The fact that it
> >> only appears with CONFIG_HIGH_RES_TIMERS=n is suspicious.
> >> 
> >> Do we have some code that inadvertently relies on something enabled by
> >> HIGH_RES_TIMERS=y, or do we have a bug that is hidden by HIGH_RES_TIMERS=y 
> >> ?
> >
> > For whatever it is worth, moderate rcutorture runs to completion without
> > errors with CONFIG_HIGH_RES_TIMERS=n on 64-bit x86.
> 
> Thanks for testing that, I don't have any big x86 machines to test on :)
> 
> > Also for whatever it is worth, I don't know of anything other than
> > microcontrollers or the larger IoT devices that would want their kernels
> > built with CONFIG_HIGH_RES_TIMERS=n.  Which might be a failure of
> > imagination on my part, but so it goes.
> 
> Yeah I agree, like I said before I wasn't even aware you could turn it
> off. So I think we'll definitely add a select HIGH_RES_TIMERS in future,
> but first I need to work out why we are seeing stalls with it disabled.

Good point, and fair enough!

Thanx, Paul


Re: rcu_sched self-detected stall on CPU

2022-04-10 Thread Paul E. McKenney
On Sun, Apr 10, 2022 at 09:33:43PM +1000, Michael Ellerman wrote:
> Zhouyi Zhou  writes:
> > On Fri, Apr 8, 2022 at 10:07 PM Paul E. McKenney  wrote:
> >> On Fri, Apr 08, 2022 at 06:02:19PM +0800, Zhouyi Zhou wrote:
> >> > On Fri, Apr 8, 2022 at 3:23 PM Michael Ellerman  
> >> > wrote:
> ...
> >> > > I haven't seen it in my testing. But using Miguel's config I can
> >> > > reproduce it seemingly on every boot.
> >> > >
> >> > > For me it bisects to:
> >> > >
> >> > >   35de589cb879 ("powerpc/time: improve decrementer clockevent 
> >> > > processing")
> >> > >
> >> > > Which seems plausible.
> >> > I also bisect to 35de589cb879 ("powerpc/time: improve decrementer
> >> > clockevent processing")
> ...
> >>
> >> > > Reverting that on mainline makes the bug go away.
> 
> >> > I also revert that on the mainline, and am currently doing a pressure
> >> > test (by repeatedly invoking qemu and checking the console.log) on PPC
> >> > VM in Oregon State University.
> 
> > After 306 rounds of stress test on mainline without triggering the bug
> > (last for 4 hours and 27 minutes), I think the bug is indeed caused by
> > 35de589cb879 ("powerpc/time: improve decrementer clockevent
> > processing") and stop the test for now.
> 
> Thanks for testing, that's pretty conclusive.
> 
> I'm not inclined to actually revert it yet.
> 
> We need to understand if there's actually a bug in the patch, or if it's
> just exposing some existing bug/bad behavior we have. The fact that it
> only appears with CONFIG_HIGH_RES_TIMERS=n is suspicious.
> 
> Do we have some code that inadvertently relies on something enabled by
> HIGH_RES_TIMERS=y, or do we have a bug that is hidden by HIGH_RES_TIMERS=y ?

For whatever it is worth, moderate rcutorture runs to completion without
errors with CONFIG_HIGH_RES_TIMERS=n on 64-bit x86.

Also for whatever it is worth, I don't know of anything other than
microcontrollers or the larger IoT devices that would want their kernels
built with CONFIG_HIGH_RES_TIMERS=n.  Which might be a failure of
imagination on my part, but so it goes.

Thanx, Paul


Re: rcu_sched self-detected stall on CPU

2022-04-08 Thread Paul E. McKenney
On Sat, Apr 09, 2022 at 12:42:39AM +1000, Michael Ellerman wrote:
> Michael Ellerman  writes:
> > "Paul E. McKenney"  writes:
> >> On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote:
> >>> Hi
> >>> 
> >>> I can reproduce it in a ppc virtual cloud server provided by Oregon
> >>> State University.  Following is what I do:
> >>> 1) curl -l 
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz
> >>> -o linux-5.18-rc1.tar.gz
> >>> 2) tar zxf linux-5.18-rc1.tar.gz
> >>> 3) cp config linux-5.18-rc1/.config
> >>> 4) cd linux-5.18-rc1
> >>> 5) make vmlinux -j 8
> >>> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot
> >>> -smp 2 (QEMU 4.2.1)
> >>> 7) after 12 rounds, the bug got reproduced:
> >>> (http://154.223.142.244/logs/20220406/qemu.log.txt)
> >>
> >> Just to make sure, are you both seeing the same thing?  Last I knew,
> >> Zhouyi was chasing an RCU-tasks issue that appears only in kernels
> >> built with CONFIG_PROVE_RCU=y, which Miguel does not have set.  Or did
> >> I miss something?
> >>
> >> Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period
> >> kthread slept for three milliseconds, but did not wake up for more than
> >> 20 seconds.  This kthread would normally have awakened on CPU 1, but
> >> CPU 1 looks to me to be very unhealthy, as can be seen in your console
> >> output below (but maybe my idea of what is healthy for powerpc systems
> >> is outdated).  Please see also the inline annotations.
> >>
> >> Thoughts from the PPC guys?
> >
> > I haven't seen it in my testing. But using Miguel's config I can
> > reproduce it seemingly on every boot.
> >
> > For me it bisects to:
> >
> >   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
> >
> > Which seems plausible.
> >
> > Reverting that on mainline makes the bug go away.
> >
> > I don't see an obvious bug in the diff, but I could be wrong, or the old
> > code was papering over an existing bug?
> >
> > I'll try and work out what it is about Miguel's config that exposes
> > this vs our defconfig, that might give us a clue.
> 
> It's CONFIG_HIGH_RES_TIMERS=n which triggers the stall.
> 
> I can reproduce just with:
> 
>   $ make ppc64le_guest_defconfig
>   $ ./scripts/config -d HIGH_RES_TIMERS
> 
> We have no defconfigs that disable HIGH_RES_TIMERS, I didn't even
> realise you could disable it TBH :)
> 
> The Rust CI has it disabled because I copied that from the x86 defconfig
> they were using back when I added the Rust support. I think that was
> meant to be a stripped down fast config for CI, but the result is it's
> just using a badly tested combination which is not helpful.
> 
> So I'll send a patch to turn HIGH_RES_TIMERS on for the Rust CI, and we
> can debug this further without blocking them.

Would it make sense to select HIGH_RES_TIMERS from one of the Kconfig*
files in arch/powerpc?  Asking for a friend.  ;-)

Thanx, Paul


Re: rcu_sched self-detected stall on CPU

2022-04-08 Thread Paul E. McKenney
On Fri, Apr 08, 2022 at 06:02:19PM +0800, Zhouyi Zhou wrote:
> On Fri, Apr 8, 2022 at 3:23 PM Michael Ellerman  wrote:
> >
> > "Paul E. McKenney"  writes:
> > > On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote:
> > >> Hi
> > >>
> > >> I can reproduce it in a ppc virtual cloud server provided by Oregon
> > >> State University.  Following is what I do:
> > >> 1) curl -l 
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz
> > >> -o linux-5.18-rc1.tar.gz
> > >> 2) tar zxf linux-5.18-rc1.tar.gz
> > >> 3) cp config linux-5.18-rc1/.config
> > >> 4) cd linux-5.18-rc1
> > >> 5) make vmlinux -j 8
> > >> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot
> > >> -smp 2 (QEMU 4.2.1)
> > >> 7) after 12 rounds, the bug got reproduced:
> > >> (http://154.223.142.244/logs/20220406/qemu.log.txt)
> > >
> > > Just to make sure, are you both seeing the same thing?  Last I knew,
> > > Zhouyi was chasing an RCU-tasks issue that appears only in kernels
> > > built with CONFIG_PROVE_RCU=y, which Miguel does not have set.  Or did
> > > I miss something?
> > >
> > > Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period
> > > kthread slept for three milliseconds, but did not wake up for more than
> > > 20 seconds.  This kthread would normally have awakened on CPU 1, but
> > > CPU 1 looks to me to be very unhealthy, as can be seen in your console
> > > output below (but maybe my idea of what is healthy for powerpc systems
> > > is outdated).  Please see also the inline annotations.
> > >
> > > Thoughts from the PPC guys?
> >
> > I haven't seen it in my testing. But using Miguel's config I can
> > reproduce it seemingly on every boot.
> >
> > For me it bisects to:
> >
> >   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
> >
> > Which seems plausible.
> I also bisect to 35de589cb879 ("powerpc/time: improve decrementer
> clockevent processing")

Very good!  Thank you all!!!

Thanx, Paul

> > Reverting that on mainline makes the bug go away.
> I also revert that on the mainline, and am currently doing a pressure
> test (by repeatedly invoking qemu and checking the console.log) on PPC
> VM in Oregon State University.
> >
> > I don't see an obvious bug in the diff, but I could be wrong, or the old
> > code was papering over an existing bug?
> >
> > I'll try and work out what it is about Miguel's config that exposes
> > this vs our defconfig, that might give us a clue.
> Great job!
> >
> > cheers
> Thanks
> Zhouyi


Re: rcu_sched self-detected stall on CPU

2022-04-08 Thread Paul E. McKenney
On Fri, Apr 08, 2022 at 05:23:32PM +1000, Michael Ellerman wrote:
> "Paul E. McKenney"  writes:
> > On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote:
> >> Hi
> >> 
> >> I can reproduce it in a ppc virtual cloud server provided by Oregon
> >> State University.  Following is what I do:
> >> 1) curl -l 
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz
> >> -o linux-5.18-rc1.tar.gz
> >> 2) tar zxf linux-5.18-rc1.tar.gz
> >> 3) cp config linux-5.18-rc1/.config
> >> 4) cd linux-5.18-rc1
> >> 5) make vmlinux -j 8
> >> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot
> >> -smp 2 (QEMU 4.2.1)
> >> 7) after 12 rounds, the bug got reproduced:
> >> (http://154.223.142.244/logs/20220406/qemu.log.txt)
> >
> > Just to make sure, are you both seeing the same thing?  Last I knew,
> > Zhouyi was chasing an RCU-tasks issue that appears only in kernels
> > built with CONFIG_PROVE_RCU=y, which Miguel does not have set.  Or did
> > I miss something?
> >
> > Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period
> > kthread slept for three milliseconds, but did not wake up for more than
> > 20 seconds.  This kthread would normally have awakened on CPU 1, but
> > CPU 1 looks to me to be very unhealthy, as can be seen in your console
> > output below (but maybe my idea of what is healthy for powerpc systems
> > is outdated).  Please see also the inline annotations.
> >
> > Thoughts from the PPC guys?
> 
> I haven't seen it in my testing. But using Miguel's config I can
> reproduce it seemingly on every boot.
> 
> For me it bisects to:
> 
>   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
> 
> Which seems plausible.
> 
> Reverting that on mainline makes the bug go away.

Thank you for looking into this!

> I don't see an obvious bug in the diff, but I could be wrong, or the old
> code was papering over an existing bug?
> 
> I'll try and work out what it is about Miguel's config that exposes
> this vs our defconfig, that might give us a clue.

I have recently had some RCU bugs that were due to Kconfig failing to
rule out broken .config files.  Maybe this is something similar?

Thanx, Paul


Re: rcu_sched self-detected stall on CPU

2022-04-07 Thread Paul E. McKenney
On Fri, Apr 08, 2022 at 07:14:20AM +0800, Zhouyi Zhou wrote:
> Dear Paul and Miguel
> 
> On Fri, Apr 8, 2022 at 1:55 AM Paul E. McKenney  wrote:
> >
> > On Thu, Apr 07, 2022 at 07:05:58PM +0200, Miguel Ojeda wrote:
> > > On Thu, Apr 7, 2022 at 5:15 PM Paul E. McKenney  
> > > wrote:
> > > >
> > > > Ah.  So you would instead look for boot to have completed within 10
> > > > seconds?  Either way, reliable automation might well more important than
> > > > reduction in time.
> > >
> > > No (although I guess that could be an option), I was only pointing out
> > > that when no stall is produced, the run should be much quicker than 30
> > > seconds (at least it was in my setup), which would be the majority of the 
> > > runs.
> >
> > Ah, thank you for the clarification!
> Thank both of you for the information. In my setup (PPC cloud VM), the
> majority of the runs complete at least for 50 seconds. From last
> evening to this morning (Beijing Time), following experiments have
> been done:
> 1) torture mainline: the test quickly finished by hitting "rcu_sched
> self-detected stall" after 12 runs
> 2) torture v5.17: the test last 10 hours plus 14 minutes, 702 runs
> have been done without trigger the bug
> 
> Conclusion:
> There must be a commit that causes the bug as Paul has pointed out.
> I am going to do the bisect, and estimate to locate the bug within a
> week (at most).
> This is a good learning experience, thanks for the guidance ;-)

Very good, and looking forward to seeing what you find.

Thanx, Paul


Re: rcu_sched self-detected stall on CPU

2022-04-07 Thread Paul E. McKenney
On Thu, Apr 07, 2022 at 07:05:58PM +0200, Miguel Ojeda wrote:
> On Thu, Apr 7, 2022 at 5:15 PM Paul E. McKenney  wrote:
> >
> > Ah.  So you would instead look for boot to have completed within 10
> > seconds?  Either way, reliable automation might well more important than
> > reduction in time.
> 
> No (although I guess that could be an option), I was only pointing out
> that when no stall is produced, the run should be much quicker than 30
> seconds (at least it was in my setup), which would be the majority of the 
> runs.

Ah, thank you for the clarification!

Thanx, Paul


Re: rcu_sched self-detected stall on CPU

2022-04-07 Thread Paul E. McKenney
On Thu, Apr 07, 2022 at 12:07:34PM +0200, Miguel Ojeda wrote:
> On Thu, Apr 7, 2022 at 4:27 AM Zhouyi Zhou  wrote:
> >
> > Yes, this happens within 30 seconds after kernel boot.  If we take all
> > into account (qemu preparing, kernel loading), we can do one test
> > within 54 seconds.
> 
> When it does not trigger, the run should be 20 seconds quicker than
> that (e.g. 10 seconds), since we don't wait for the stall timeout. I
> guess the timeout could also be reduced a fair bit to make failures
> quicker, but they do not contribute as much as the successes anyway.

Ah.  So you would instead look for boot to have completed within 10
seconds?  Either way, reliable automation might well more important than
reduction in time.

> Thanks a lot for running the bisect on that server, Zhouyi!

What Miguel said!

Thanx, Paul


Re: rcu_sched self-detected stall on CPU

2022-04-06 Thread Paul E. McKenney
On Thu, Apr 07, 2022 at 02:25:59AM +0800, Zhouyi Zhou wrote:
> Hi Paul
> 
> On Thu, Apr 7, 2022 at 1:00 AM Paul E. McKenney  wrote:
> >
> > On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote:
> > > Hi
> > >
> > > I can reproduce it in a ppc virtual cloud server provided by Oregon
> > > State University.  Following is what I do:
> > > 1) curl -l 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz
> > > -o linux-5.18-rc1.tar.gz
> > > 2) tar zxf linux-5.18-rc1.tar.gz
> > > 3) cp config linux-5.18-rc1/.config
> > > 4) cd linux-5.18-rc1
> > > 5) make vmlinux -j 8
> > > 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot
> > > -smp 2 (QEMU 4.2.1)
> > > 7) after 12 rounds, the bug got reproduced:
> > > (http://154.223.142.244/logs/20220406/qemu.log.txt)
> >
> > Just to make sure, are you both seeing the same thing?  Last I knew,
> > Zhouyi was chasing an RCU-tasks issue that appears only in kernels
> > built with CONFIG_PROVE_RCU=y, which Miguel does not have set.  Or did
> > I miss something?
> We are both seeing the same thing, I work in parallel.
> 1) I am chasing the RCU-tasks issue which I will report my discoveries
> to you later.
> 2) I am reproducing the RCU CPU stall issue reported by Miguel
> yesterday. Lucky enough, I can reproduce it and thanks to Oregon State
> University who provides me with the environment! I am also very
> interested in helping chase the reason behind the issue. Lucky enough
> the issue can be reproduced in a non-hardware accelerated qemu
> environment so that I can give a hand.

How quickly does this happen?  The console log that Miguel sent had
within 30 seconds of boot.  If it always happens this quickly, it
should be possible to do a bisection, especially when running qemu.
The trick would be to boot a given commit until you see it fail on the
one hand or until it boots successfully 70 times.  In the latter case,
report success to "git bisect", in the former case report failure.
If the one-out-of-5 failure rate is accurate, you will have a 99.997%
chance of reporting the correct failure state on each step, resulting
in better than a 99.9% chance of converging on the correct commit.

Of course, you would hit the preceding commit hard to double-check.

Does this seem reasonable?  Or am I being overly optimstic on the
failure times?

Thanx, Paul

> Thanks
> Zhouyi
> >
> > Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period
> > kthread slept for three milliseconds, but did not wake up for more than
> > 20 seconds.  This kthread would normally have awakened on CPU 1, but
> > CPU 1 looks to me to be very unhealthy, as can be seen in your console
> > output below (but maybe my idea of what is healthy for powerpc systems
> > is outdated).  Please see also the inline annotations.
> >
> > Thoughts from the PPC guys?
> >
> > Thanx, Paul
> >
> > 
> >
> > [   21.186912] rcu: INFO: rcu_sched self-detected stall on CPU
> > [   21.187331] rcu: 1-...!: (4712629 ticks this GP) idle=2c1/0/0x3 
> > softirq=8/8 fqs=0
> > [   21.187529]  (t=21000 jiffies g=-1183 q=3)
> > [   21.187681] rcu: rcu_sched kthread timer wakeup didn't happen for 20997 
> > jiffies! g-1183 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> >
> > The grace-period kthread is still asleep (->state=0x402).
> > This indicates that the three-jiffy timer has somehow been
> > prevented from expiring for almost a full 21 seconds.  Of course,
> > if timers don't work, RCU cannot work.
> >
> > [   21.187770] rcu: Possible timer handling issue on cpu=1 
> > timer-softirq=1
> > [   21.187927] rcu: rcu_sched kthread starved for 21001 jiffies! g-1183 
> > f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
> > [   21.188019] rcu: Unless rcu_sched kthread gets sufficient CPU time, 
> > OOM is now expected behavior.
> > [   21.188087] rcu: RCU grace-period kthread stack dump:
> > [   21.188196] task:rcu_sched   state:I stack:0 pid:   10 ppid: 
> > 2 flags:0x0800
> > [   21.188453] Call Trace:
> > [   21.188525] [c61e78a0] [c61e78e0] 0xc61e78e0 
> > (unreliable)
> > [   21.188900] [c61e7a90] [c0017210] __switch_to+0x250/0x310
> > [   21.189210] [c61e7b00] [c03ed660] __schedule+0x210/0x66

Re: rcu_sched self-detected stall on CPU

2022-04-06 Thread Paul E. McKenney
On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote:
> Hi
> 
> I can reproduce it in a ppc virtual cloud server provided by Oregon
> State University.  Following is what I do:
> 1) curl -l 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz
> -o linux-5.18-rc1.tar.gz
> 2) tar zxf linux-5.18-rc1.tar.gz
> 3) cp config linux-5.18-rc1/.config
> 4) cd linux-5.18-rc1
> 5) make vmlinux -j 8
> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot
> -smp 2 (QEMU 4.2.1)
> 7) after 12 rounds, the bug got reproduced:
> (http://154.223.142.244/logs/20220406/qemu.log.txt)

Just to make sure, are you both seeing the same thing?  Last I knew,
Zhouyi was chasing an RCU-tasks issue that appears only in kernels
built with CONFIG_PROVE_RCU=y, which Miguel does not have set.  Or did
I miss something?

Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period
kthread slept for three milliseconds, but did not wake up for more than
20 seconds.  This kthread would normally have awakened on CPU 1, but
CPU 1 looks to me to be very unhealthy, as can be seen in your console
output below (but maybe my idea of what is healthy for powerpc systems
is outdated).  Please see also the inline annotations.

Thoughts from the PPC guys?

Thanx, Paul



[   21.186912] rcu: INFO: rcu_sched self-detected stall on CPU
[   21.187331] rcu: 1-...!: (4712629 ticks this GP) idle=2c1/0/0x3 
softirq=8/8 fqs=0 
[   21.187529]  (t=21000 jiffies g=-1183 q=3)
[   21.187681] rcu: rcu_sched kthread timer wakeup didn't happen for 20997 
jiffies! g-1183 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402

The grace-period kthread is still asleep (->state=0x402).
This indicates that the three-jiffy timer has somehow been
prevented from expiring for almost a full 21 seconds.  Of course,
if timers don't work, RCU cannot work.

[   21.187770] rcu: Possible timer handling issue on cpu=1 timer-softirq=1
[   21.187927] rcu: rcu_sched kthread starved for 21001 jiffies! g-1183 f0x0 
RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[   21.188019] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM 
is now expected behavior.
[   21.188087] rcu: RCU grace-period kthread stack dump:
[   21.188196] task:rcu_sched   state:I stack:0 pid:   10 ppid: 2 
flags:0x0800
[   21.188453] Call Trace:
[   21.188525] [c61e78a0] [c61e78e0] 0xc61e78e0 
(unreliable)
[   21.188900] [c61e7a90] [c0017210] __switch_to+0x250/0x310
[   21.189210] [c61e7b00] [c03ed660] __schedule+0x210/0x660
[   21.189315] [c61e7b80] [c03edb14] schedule+0x64/0x110
[   21.189387] [c61e7bb0] [c03f6648] 
schedule_timeout+0x1d8/0x390
[   21.189473] [c61e7c80] [c01c] rcu_gp_fqs_loop+0x2dc/0x3d0
[   21.189555] [c61e7d30] [c01144ec] rcu_gp_kthread+0x13c/0x160
[   21.189633] [c61e7dc0] [c00c1770] kthread+0x110/0x120
[   21.189714] [c61e7e10] [c000c9e4] 
ret_from_kernel_thread+0x5c/0x64

The above stack trace is expected behavior when the RCU
grace-period kthread is waiting to do its next FQS scan.

[   21.189938] rcu: Stack dump where RCU GP kthread last ran:

And here is the stalled CPU, which also happens to be the CPU
that RCU last ran on:

[   21.189992] Task dump for CPU 1:
[   21.190059] task:swapper/1   state:R  running task stack:0 pid:  
  0 ppid: 1 flags:0x0804
[   21.190169] Call Trace:
[   21.190194] [c61ef2d0] [c00c9a40] 
sched_show_task+0x180/0x1c0 (unreliable)
[   21.190278] [c61ef340] [c0116ca0] 
rcu_check_gp_kthread_starvation+0x16c/0x19c
[   21.190370] [c61ef3c0] [c0114f7c] 
rcu_sched_clock_irq+0x7ec/0xaf0
[   21.190448] [c61ef4b0] [c0120fdc] 
update_process_times+0xbc/0x140
[   21.190524] [c61ef4f0] [c0136a24] 
tick_nohz_handler+0xf4/0x1b0
[   21.190608] [c61ef540] [c001c828] timer_interrupt+0x148/0x2d0
[   21.190699] [c61ef590] [c00098e8] 
decrementer_common_virt+0x208/0x210
[   21.190837] --- interrupt: 900 at arch_local_irq_restore+0x168/0x170

Up through this point is just the stack trace of the the
code doing the stack dump that the RCU CPU stall warning code
asked for.

[   21.190941] NIP:  c0013608 LR: c03f8114 CTR: c00dc630

This NIP does not look at all good to me.  But I freely confess
that I am out of date on what Power machines do.

[   21.191031] REGS: c61ef600 TRAP: 0900   Not tainted  (5.18.0-rc1)
[   21.191109] MSR:  80009033   CR: 22000202  
XER: 
[   21.191274] CFAR:  IRQMASK: 0 
[   21.191274] GPR00: 

Re: rcutorture’s init segfaults in ppc64le VM

2022-03-09 Thread Paul E. McKenney
On Thu, Mar 10, 2022 at 10:37:12AM +0800, Zhouyi Zhou wrote:
> Dear Paul
> 
> I try to reproduce the bug in ppc64 VM in Oregon State University
> using the vmlinux extracted from
> https://owww.molgen.mpg.de/~pmenzel/rcutorture-2022.02.01-21.52.37-torture-locktorture-kasan-lock01.tar.xz
> 
> the ppc64 VM in which I run the qemu without hardware acceleration is:
> Linux version 5.4.0-100-generic (buildd@bos02-ppc64el-021) (gcc
> version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #113-Ubuntu SMP Thu Feb
> 3 18:43:11 UTC 2022 (Ubuntu 5.4.0-100.113-generic 5.4.166)
> 
> 
> The qemu command I use to test:
> cd 
> /tmp/dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-locktorture-kasan/LOCK01$
> $qemu-system-ppc64   -nographic -smp cores=2,threads=1 -net none -M
> pseries -nodefaults -device spapr-vscsi -serial file:/tmp/console.log
> -m 512 -kernel ./vmlinux -append "debug_boot_weak_hash panic=-1
> console=ttyS0 rcutorture.onoff_interval=200
> rcutorture.onoff_holdoff=30 rcutree.gp_preinit_delay=12
> rcutree.gp_init_delay=3 rcutree.gp_cleanup_delay=3
> rcutree.kthread_prio=2 threadirqs tree.use_softirq=0
> rcutorture.n_barrier_cbs=4 rcutorture.stat_interval=15
> rcutorture.shutdown_secs=1800 rcutorture.test_no_idle_hz=1
> rcutorture.verbose=1"
> 
> The console.log is uploaded to:
> http://154.223.142.244/logs/20220310/console.paul.log
> The log tells us it is illegal instruction that causes the trouble:
> [4.246387][T1] init[1]: illegal instruction (4) at 1002c308
> nip 1002c308 lr 10001684 code 1 in init[1000+d]
> [4.251400][T1] init[1]: code: f90d88c0 f92a0008 f9480008
> 7c2004ac 2c2d f949 386d88d0 38e8
> [4.253416][T1] init[1]: code: 41820098 e92d8f98 75290010
> 4182008c <4401> 2c2d 6000 8902f438
> 
> 
> Meanwhile, the vmlinux compiled by myself runs smoothly.
> 
> Then I modify mkinitrd.sh to let it panic manually:
> http://154.223.142.244/logs/20220310/mkinitrd.sh
> The log tells us it is a segfault (instead of a illegal instruction):
> http://154.223.142.244/logs/20220310/console.zhouyi.log
> 
> Then I use gdb to debug the init in host:
> ubuntu@zhouzhouyi-1:~/newkernel/linux-next$ gdb
> tools/testing/selftests/rcutorture/initrd/init
> (gdb) run
> Starting program:
> /home/ubuntu/newkernel/linux-next/tools/testing/selftests/rcutorture/initrd/init
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x1b2c in ?? ()
> (gdb) x/10i $pc
> => 0x1b2c:stw r9,0(r9)
>0x1b30:trap
>0x1b34:.long 0x0
>0x1b38:.long 0x0
>0x1b3c:.long 0x0
>0x1b40:lis r2,4110
>0x1b44:addir2,r2,31488
>0x1b48:mr  r9,r1
>0x1b4c:rldicr  r1,r1,0,59
>0x1b50:li  r0,0
> (gdb) p $r9
> $1 = 0
> (gdb) x/30x $pc - 0x30
> 0x1afc:0x388400400x387f00400xf80100400x48026919
> 0x1b0c:0x60000xe80100400x7c0803a60x4b24
> 0x1b1c:0x0x01000x01800x3920
> 0x1b2c:0x91290x7fe80x0x
> which matches the hex content of
> http://154.223.142.244/logs/20220310/console.zhouyi.log:
> [5.077431][T1] init[1]: segfault (11) at 0 nip 1b2c lr
> 10001024 code 1 in init[1000+d]
> [5.087167][T1] init[1]: code: 38840040 387f0040 f8010040
> 48026919 6000 e8010040 7c0803a6 4b24
> [5.093987][T1] init[1]: code:  0100 0180
> 3920 <9129> 7fe8  
> 
> 
> Conclusions: there might be something wrong when packing the init into
> vmlinux in your environment.

Quite possibly!  Or the compiler might not be being invoked properly
by the mkinitrd.sh script.

> I will continue to do research on this interesting problem with you.

Please let me know how it goes!

Thanx, Paul

> Thanks
> Kind Regards
> Zhouyi
> 
> 
> 
> On Tue, Feb 8, 2022 at 8:12 PM Paul Menzel  wrote:
> >
> > Dear Michael,
> >
> >
> > Thank you for looking into this.
> >
> > Am 08.02.22 um 11:09 schrieb Michael Ellerman:
> > > Paul Menzel writes:
> >
> > […]
> >
> > >> On the POWER8 server IBM S822LC running Ubuntu 21.10, building Linux
> > >> 5.17-rc2+ with rcutorture tests
> > >
> > > I'm not sure if that's the host kernel version or the version you're
> > > using of rcutorture? Can you tell us the sha1 of your host kernel and of
> > > the tree you're running rcutorture from?
> >
> > The host system runs Linux 5.17-rc1+ started with kexec. Unfortunately,
> > I am unable to find the exact sha1.
> >
> >  $ more /proc/version
> >  Linux version 5.17.0-rc1+
> > (pmen...@flughafenberlinbrandenburgwillybrandt.molgen.mpg.de) (Ubuntu
> > clang version 13.0.0-2, LLD 13.0.0) #1 SMP Fri Jan 28
> > 17:13:04 CET 2022
> >
> > The Linux tree, from where I run rcutorture from, is at commit
> > dfd42facf1e4 (Linux 5.17-rc3) with 

Re: ppc64le: rcutorture warns about improperly set `CONFIG_HYPERVISOR_GUEST` and `CONFIG_PARAVIRT`

2022-02-07 Thread Paul E. McKenney
On Mon, Feb 07, 2022 at 05:53:05PM +0100, Paul Menzel wrote:
> Dear Sebastian, dear Paul,
> 
> 
> In commit a6fda6dab9 (rcutorture: Tweak kvm options)
> `tools/testing/selftests/rcutorture/configs/rcu/CFcommon` was extended by
> the three selections below:
> 
> CONFIG_HYPERVISOR_GUEST=y
> CONFIG_PARAVIRT=y
> CONFIG_KVM_GUEST=y
> 
> Unfortunately, `CONFIG_HYPERVISOR_GUEST` is x86 specific and
> `CONFIG_PARAVIRT` only available on x86 and ARM.
> 
> Thus, running the tests on a ppc64le system (POWER8 IBM S822LC), the script
> shows the warnings below:
> 
> :CONFIG_HYPERVISOR_GUEST=y: improperly set
> :CONFIG_PARAVIRT=y: improperly set
> 
> Do you have a way, how to work around that?

If you can tell me the Kconfig-option incantation for ppc64le, my thought
would be to make rcutorture look for a CFcommon.ppc64.  Then the proper
Kconfig options for each architecture could be supplied.

While we are thinking about this, here is the bash function that
figures out which architecture rcutorture is running on, which
is passed the newly built vmlinux file:

identify_qemu () {
local u="`file "$1"`"
if test -n "$TORTURE_QEMU_CMD"
then
echo $TORTURE_QEMU_CMD
elif echo $u | grep -q x86-64
then
echo qemu-system-x86_64
elif echo $u | grep -q "Intel 80386"
then
echo qemu-system-i386
elif echo $u | grep -q aarch64
then
echo qemu-system-aarch64
elif uname -a | grep -q ppc64
then
echo qemu-system-ppc64
else
echo Cannot figure out what qemu command to use! 1>&2
echo file $1 output: $u
# Usually this will be one of /usr/bin/qemu-system-*
# Use TORTURE_QEMU_CMD environment variable or appropriate
# argument to top-level script.
exit 1
fi
}

First, any better approach?

Second, we need to know the Kconfig options -before- the vmlinux
file is generated.  What is the best approach in that case?

Thanx, Paul


Re: rcutorture’s init segfaults in ppc64le VM

2022-02-07 Thread Paul E. McKenney
On Mon, Feb 07, 2022 at 05:44:47PM +0100, Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> On the POWER8 server IBM S822LC running Ubuntu 21.10, building Linux
> 5.17-rc2+ with rcutorture tests
> 
> $ tools/testing/selftests/rcutorture/bin/torture.sh --duration 10
> 
> the built init
> 
> $ file tools/testing/selftests/rcutorture/initrd/init
> tools/testing/selftests/rcutorture/initrd/init: ELF 64-bit LSB
> executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), statically
> linked, BuildID[sha1]=0ded0e45649184a296f30d611f7a03cc51ecb616, for
> GNU/Linux 3.10.0, stripped
> 
> segfaults in QEMU. From one of the log files
> 
> 
> /dev/shm/linux/tools/testing/selftests/rcutorture/res/2022.02.01-21.52.37-torture/results-rcutorture/TREE03/console.log
> 
> [1.119803][T1] Run /init as init process
> [1.122011][T1] init[1]: segfault (11) at f0656d90 nip 1a18
> lr 0 code 1 in init[1000+d]
> [1.124863][T1] init[1]: code: 2c2903e7 f9210030 4081ff84
> 4b58  0100 0580 3c40100f
> [1.128823][T1] init[1]: code: 38427c00 7c290b78 782106e4
> 3800  7c0803a6 f801 e9028010
> 
> Executing the init, which just seems to be an endless loop, from userspace
> work:
> 
> $ strace ./tools/testing/selftests/rcutorture/initrd/init
> execve("./tools/testing/selftests/rcutorture/initrd/init",
> ["./tools/testing/selftests/rcutor"...], 0x7db9e860 /* 31 vars */) = 0
> brk(NULL)   = 0x1001d94
> brk(0x1001d940b98)  = 0x1001d940b98
> set_tid_address(0x1001d9400d0)  = 2890832
> set_robust_list(0x1001d9400e0, 24)  = 0
> uname({sysname="Linux",
> nodename="flughafenberlinbrandenburgwillybrandt.molgen.mpg.de", ...}) = 0
> prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024,
> rlim_max=RLIM64_INFINITY}) = 0
> readlink("/proc/self/exe", "/dev/shm/linux/tools/testing/sel"..., 4096)
> = 61
> getrandom("\xf1\x30\x4c\x9e\x82\x8d\x26\xd7", 8, GRND_NONBLOCK) = 8
> brk(0x1001d970b98)  = 0x1001d970b98
> brk(0x1001d98)  = 0x1001d98
> mprotect(0x100e, 65536, PROT_READ)  = 0
> clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},
> 0x7b22c8a8) = 0
> clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},
> 0x7b22c8a8) = 0
> clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, ^C{tv_sec=0,
> tv_nsec=872674044}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
> strace: Process 2890832 detached

Huh.  In PowerPC, is there some difference between system calls
executed in initrd and those same system calls executed in userspace?

And just to make sure, the above strace was from exactly the same
binary "init" file that is included in initrd, correct?

Adding Willy Tarreau for his thoughts.

Thanx, Paul

> Any ideas, what `mkinitrd.sh` [2] should do differently?
> 
> ```
> cat > init.c << '___EOF___'
> #ifndef NOLIBC
> #include 
> #include 
> #endif
> 
> volatile unsigned long delaycount;
> 
> int main(int argc, int argv[])
> {
>   int i;
>   struct timeval tv;
>   struct timeval tvb;
> 
>   for (;;) {
>   sleep(1);
>   /* Need some userspace time. */
>   if (gettimeofday(, NULL))
>   continue;
>   do {
>   for (i = 0; i < 1000 * 100; i++)
>   delaycount = i * i;
>   if (gettimeofday(, NULL))
>   break;
>   tv.tv_sec -= tvb.tv_sec;
>   if (tv.tv_sec > 1)
>   break;
>   tv.tv_usec += tv.tv_sec * 1000 * 1000;
>   tv.tv_usec -= tvb.tv_usec;
>   } while (tv.tv_usec < 1000);
>   }
>   return 0;
> }
> ___EOF___
> 
> # build using nolibc on supported archs (smaller executable) and fall
> # back to regular glibc on other ones.
> if echo -e "#if __x86_64__||__i386__||__i486__||__i586__||__i686__" \
>"||__ARM_EABI__||__aarch64__\nyes\n#endif" \
>| ${CROSS_COMPILE}gcc -E -nostdlib -xc - \
>| grep -q '^yes'; then
>   # architecture supported by nolibc
> ${CROSS_COMPILE}gcc -fno-asynchronous-unwind-tables -fno-ident \
>   -nostdlib -include ../../../../include/nolibc/nolibc.h \
>   -s -static -Os -o init init.c -lgcc
> else
>   ${CROSS_COMPILE}gcc -s -static -Os -o init init.c
> fi
> ```
> 
> 
> Kind regards,
> 
> Paul
> 
> 
> [1]: 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/rcutorture/doc/initrd.txt
> [2]: 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/rcutorture/bin/mkinitrd.sh


Re: ftrace hangs waiting for rcu

2022-01-28 Thread Paul E. McKenney
On Fri, Jan 28, 2022 at 08:15:47AM -0800, Paul E. McKenney wrote:
> On Fri, Jan 28, 2022 at 04:11:57PM +, Mark Rutland wrote:
> > On Fri, Jan 28, 2022 at 05:08:48PM +0100, Sven Schnelle wrote:
> > > Hi Mark,
> > > 
> > > Mark Rutland  writes:
> > > 
> > > > On arm64 I bisected this down to:
> > > >
> > > >   7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for 
> > > > dynamic queue selection")
> > > >
> > > > Which was going wrong because ilog2() rounds down, and so the shift was 
> > > > wrong
> > > > for any nr_cpus that was not a power-of-two. Paul had already fixed 
> > > > that in
> > > > rcu-next, and just sent a pull request to Linus:
> > > >
> > > >   
> > > > https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/
> > > >
> > > > With that applied, I no longer see these hangs.
> > > >
> > > > Does your s390 test machine have a non-power-of-two nr_cpus, and does 
> > > > that fix
> > > > the issue for you?
> > > 
> > > We noticed the PR from Paul and are currently testing the fix. So far
> > > it's looking good. The configuration where we have seen the hang is a
> > > bit unusual:
> > > 
> > > - 16 physical CPUs on the kvm host
> > > - 248 logical CPUs inside kvm
> > 
> > Aha! 248 is notably *NOT* a power of two, and in this case the shift would 
> > be
> > wrong (ilog2() would give 7, when we need a shift of 8).
> > 
> > So I suspect you're hitting the same issue as I was.
> 
> And apparently no one runs -next on systems having a non-power-of-two
> number of CPUs.  ;-)

And the fix is now in mainline.

Thanx, Paul

> > Thanks,
> > Mark.
> > 
> > > - debug kernel both on the host and kvm guest
> > > 
> > > So things are likely a bit slow in the kvm guest. Interesting is that
> > > the number of CPUs is even. But maybe RCU sees an odd number of CPUs
> > > and gets confused before all cpus are brought up. Have to read code/test
> > > to see whether that could be possible.
> > > 
> > > Thanks for investigating!
> > > Sven


Re: ftrace hangs waiting for rcu

2022-01-28 Thread Paul E. McKenney
On Fri, Jan 28, 2022 at 04:11:57PM +, Mark Rutland wrote:
> On Fri, Jan 28, 2022 at 05:08:48PM +0100, Sven Schnelle wrote:
> > Hi Mark,
> > 
> > Mark Rutland  writes:
> > 
> > > On arm64 I bisected this down to:
> > >
> > >   7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for 
> > > dynamic queue selection")
> > >
> > > Which was going wrong because ilog2() rounds down, and so the shift was 
> > > wrong
> > > for any nr_cpus that was not a power-of-two. Paul had already fixed that 
> > > in
> > > rcu-next, and just sent a pull request to Linus:
> > >
> > >   
> > > https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/
> > >
> > > With that applied, I no longer see these hangs.
> > >
> > > Does your s390 test machine have a non-power-of-two nr_cpus, and does 
> > > that fix
> > > the issue for you?
> > 
> > We noticed the PR from Paul and are currently testing the fix. So far
> > it's looking good. The configuration where we have seen the hang is a
> > bit unusual:
> > 
> > - 16 physical CPUs on the kvm host
> > - 248 logical CPUs inside kvm
> 
> Aha! 248 is notably *NOT* a power of two, and in this case the shift would be
> wrong (ilog2() would give 7, when we need a shift of 8).
> 
> So I suspect you're hitting the same issue as I was.

And apparently no one runs -next on systems having a non-power-of-two
number of CPUs.  ;-)

Thanx, Paul

> Thanks,
> Mark.
> 
> > - debug kernel both on the host and kvm guest
> > 
> > So things are likely a bit slow in the kvm guest. Interesting is that
> > the number of CPUs is even. But maybe RCU sees an odd number of CPUs
> > and gets confused before all cpus are brought up. Have to read code/test
> > to see whether that could be possible.
> > 
> > Thanks for investigating!
> > Sven


[GIT PULL] Fix kprobes issue by moving RCU-tasks initialization earlier

2021-01-04 Thread Paul E. McKenney
Hello, Linus,

This fix is for a regression in the v5.10 merge window, but it was
reported quite late in the v5.10 process, plus generating and testing
the fix took some time.

The regression is due to 36dadef23fcc ("kprobes: Init kprobes in
early_initcall") which on powerpc can use RCU Tasks before initialization,
resulting in boot failures.  The fix is straightforward, simply moving
initialization of RCU Tasks before the early_initcall()s.  The fix has
been exposed to -next and kbuild test robot testing, and has been
tested by the PowerPC guys.

The following changes since commit 0477e92881850d44910a7e94fc2c46f96faa131f:

  Linux 5.10-rc7 (2020-12-06 14:25:12 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/urgent

for you to fetch changes up to 1b04fa9900263b4e217ca2509fd778b32c2b4eb2:

  rcu-tasks: Move RCU-tasks initialization to before early_initcall() 
(2020-12-14 15:31:13 -0800)


Uladzislau Rezki (Sony) (1):
  rcu-tasks: Move RCU-tasks initialization to before early_initcall()

 include/linux/rcupdate.h |  6 ++
 init/main.c  |  1 +
 kernel/rcu/tasks.h   | 25 +
 3 files changed, 28 insertions(+), 4 deletions(-)


Re: powerpc 5.10-rcN boot failures with RCU_SCALE_TEST=m

2020-11-27 Thread Paul E. McKenney
On Fri, Nov 27, 2020 at 01:02:29PM +1100, Daniel Axtens wrote:
> Hi all,
> 
> I'm having some difficulty tracking down a bug.
> 
> Some configurations of the powerpc kernel since somewhere in the 5.10
> merge window fail to boot on some ppc64 systems. They hang while trying
> to bring up SMP. It seems to depend on the RCU_SCALE/PERF_TEST option.
> (It was renamed in the 5.10 merge window.)

Adding Mark Rutland on CC in case his similarly mystifying experience
obtaining a fix for ARM has relevance.  From what I could see, that
was a delayed consequence of the x86/entry rewrite.  It was similarly
difficult to bisect.

Thanx, Paul

> I can reproduce it as follows with qemu tcg:
> 
> make -j64 pseries_le_defconfig
> scripts/config -m RCU_SCALE_TEST
> scripts/config -m RCU_PERF_TEST
> make -j 64 vmlinux CC="ccache gcc"
> 
> qemu-system-ppc64 -cpu power9 -M pseries -m 1G -nographic -vga none -smp 4 
> -kernel vmlinux
> 
> ...
> [0.036284][T0] Mount-cache hash table entries: 8192 (order: 0, 65536 
> bytes, linear)
> [0.036481][T0] Mountpoint-cache hash table entries: 8192 (order: 0, 
> 65536 bytes, linear)
> [0.148168][T1] POWER9 performance monitor hardware support registered
> [0.151118][T1] rcu: Hierarchical SRCU implementation.
> [0.186660][T1] smp: Bringing up secondary CPUs ...
> 
> 
> I have no idea why RCU_SCALE/PERF_TEST would be causing this, but that
> seems to be what does it: if I don't set that, the kernel boots fine.
> 
> I've tried to git bisect it, but I keep getting different results:
> always a random merge of a seemingly-unrelated subsystem tree - things
> like armsoc or integrity or input.
> 
> It appears to also depend on the way the kernel is booted. Testing with
> a Canonical kernel, so a slightly different config but including
> RCU_SCALE_TEST=m, I see:
> 
> Power8 host + KVM + grub -> boots
> Power9 host bare metal (kexec)   -> fails
> Power9 host + KVM + grub -> fails
> Power9 host + KVM + qemu -kernel -> boots
> qemu TCG + power9 cpu-> fails
> qemu TCG + power8 cpu-> fails
> 
> Any ideas?
> 
> Kind regards,
> Daniel
> 
> $ qemu-system-ppc64 -version
> QEMU emulator version 4.2.1 (Debian 1:4.2-3ubuntu6.9)
> 
> $ gcc --version
> gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
> 
> It also happens when compiling with GCC 7 and 10.
> 
> 


Re: [PATCH] powerpc/smp: Move rcu_cpu_starting() earlier

2020-10-28 Thread Paul E. McKenney
On Thu, Oct 29, 2020 at 11:09:07AM +1100, Michael Ellerman wrote:
> Qian Cai  writes:
> > The call to rcu_cpu_starting() in start_secondary() is not early enough
> > in the CPU-hotplug onlining process, which results in lockdep splats as
> > follows:
> 
> Since when?
> What kernel version?
> 
> I haven't seen this running CPU hotplug tests with PROVE_LOCKING=y on
> v5.10-rc1. Am I missing a CONFIG?

My guess would be that adding CONFIG_PROVE_RAW_LOCK_NESTING=y will
get you some splats.

Thanx, Paul

> cheers
> 
> 
> >  WARNING: suspicious RCU usage
> >  -
> >  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!
> >
> >  other info that might help us debug this:
> >
> >  RCU used illegally from offline CPU!
> >  rcu_scheduler_active = 1, debug_locks = 1
> >  no locks held by swapper/1/0.
> >
> >  Call Trace:
> >  dump_stack+0xec/0x144 (unreliable)
> >  lockdep_rcu_suspicious+0x128/0x14c
> >  __lock_acquire+0x1060/0x1c60
> >  lock_acquire+0x140/0x5f0
> >  _raw_spin_lock_irqsave+0x64/0xb0
> >  clockevents_register_device+0x74/0x270
> >  register_decrementer_clockevent+0x94/0x110
> >  start_secondary+0x134/0x800
> >  start_secondary_prolog+0x10/0x14
> >
> > This is avoided by moving the call to rcu_cpu_starting up near the
> > beginning of the start_secondary() function. Note that the
> > raw_smp_processor_id() is required in order to avoid calling into
> > lockdep before RCU has declared the CPU to be watched for readers.
> >
> > Link: 
> > https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
> > Signed-off-by: Qian Cai 
> > ---
> >  arch/powerpc/kernel/smp.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 3c6b9822f978..8c2857cbd960 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -1393,13 +1393,14 @@ static void add_cpu_to_masks(int cpu)
> >  /* Activate a secondary processor. */
> >  void start_secondary(void *unused)
> >  {
> > -   unsigned int cpu = smp_processor_id();
> > +   unsigned int cpu = raw_smp_processor_id();
> >  
> > mmgrab(_mm);
> > current->active_mm = _mm;
> >  
> > smp_store_cpu_info(cpu);
> > set_dec(tb_ticks_per_jiffy);
> > +   rcu_cpu_starting(cpu);
> > preempt_disable();
> > cpu_callin_map[cpu] = 1;
> >  
> > -- 
> > 2.28.0


Re: [PATCH] powerpc/smp: Move rcu_cpu_starting() earlier

2020-10-28 Thread Paul E. McKenney
On Wed, Oct 28, 2020 at 02:23:34PM -0400, Qian Cai wrote:
> The call to rcu_cpu_starting() in start_secondary() is not early enough
> in the CPU-hotplug onlining process, which results in lockdep splats as
> follows:
> 
>  WARNING: suspicious RCU usage
>  -
>  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!
> 
>  other info that might help us debug this:
> 
>  RCU used illegally from offline CPU!
>  rcu_scheduler_active = 1, debug_locks = 1
>  no locks held by swapper/1/0.
> 
>  Call Trace:
>  dump_stack+0xec/0x144 (unreliable)
>  lockdep_rcu_suspicious+0x128/0x14c
>  __lock_acquire+0x1060/0x1c60
>  lock_acquire+0x140/0x5f0
>  _raw_spin_lock_irqsave+0x64/0xb0
>  clockevents_register_device+0x74/0x270
>  register_decrementer_clockevent+0x94/0x110
>  start_secondary+0x134/0x800
>  start_secondary_prolog+0x10/0x14
> 
> This is avoided by moving the call to rcu_cpu_starting up near the
> beginning of the start_secondary() function. Note that the
> raw_smp_processor_id() is required in order to avoid calling into
> lockdep before RCU has declared the CPU to be watched for readers.
> 
> Link: 
> https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
> Signed-off-by: Qian Cai 

Acked-by: Paul E. McKenney 

> ---
>  arch/powerpc/kernel/smp.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 3c6b9822f978..8c2857cbd960 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1393,13 +1393,14 @@ static void add_cpu_to_masks(int cpu)
>  /* Activate a secondary processor. */
>  void start_secondary(void *unused)
>  {
> - unsigned int cpu = smp_processor_id();
> + unsigned int cpu = raw_smp_processor_id();
>  
>   mmgrab(_mm);
>   current->active_mm = _mm;
>  
>   smp_store_cpu_info(cpu);
>   set_dec(tb_ticks_per_jiffy);
> + rcu_cpu_starting(cpu);
>   preempt_disable();
>   cpu_callin_map[cpu] = 1;
>  
> -- 
> 2.28.0
> 


Re: linux-next: manual merge of the rcu tree with the powerpc tree

2020-05-21 Thread Paul E. McKenney
On Thu, May 21, 2020 at 02:51:24PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> On Tue, 19 May 2020 17:23:16 +1000 Stephen Rothwell  
> wrote:
> >
> > Today's linux-next merge of the rcu tree got a conflict in:
> > 
> >   arch/powerpc/kernel/traps.c
> > 
> > between commit:
> > 
> >   116ac378bb3f ("powerpc/64s: machine check interrupt update NMI 
> > accounting")
> > 
> > from the powerpc tree and commit:
> > 
> >   187416eeb388 ("hardirq/nmi: Allow nested nmi_enter()")
> > 
> > from the rcu tree.
> > 
> > I fixed it up (I used the powerpc tree version for now) and can carry the
> > fix as necessary. This is now fixed as far as linux-next is concerned,
> > but any non trivial conflicts should be mentioned to your upstream
> > maintainer when your tree is submitted for merging.  You may also want
> > to consider cooperating with the maintainer of the conflicting tree to
> > minimise any particularly complex conflicts.
> 
> This is now a conflict between the powerpc commit and commit
> 
>   69ea03b56ed2 ("hardirq/nmi: Allow nested nmi_enter()")
> 
> from the tip tree.  I assume that the rcu and tip trees are sharing
> some patches (but not commits) :-(

We are sharing commits, and in fact 187416eeb388 in the rcu tree came
from the tip tree.  My guess is version skew, and that I probably have
another rebase coming up.

Why is this happening?  There are sets of conflicting commits in different
efforts, and we are trying to resolve them.  But we are getting feedback
on some of those commits, which is probably what is causing the skew.

Thanx, Paul


Re: [PATCH 03/35] docs: fix broken references to text files

2020-04-08 Thread Paul E. McKenney
On Wed, Apr 08, 2020 at 05:45:55PM +0200, Mauro Carvalho Chehab wrote:
> Several references got broken due to txt to ReST conversion.
> 
> Several of them can be automatically fixed with:
> 
>   scripts/documentation-file-ref-check --fix
> 
> Reviewed-by: Mathieu Poirier  # 
> hwtracing/coresight/Kconfig
> Signed-off-by: Mauro Carvalho Chehab 

For the memory-barrier.txt portions:

Reviewed-by: Paul E. McKenney 

> ---
>  Documentation/memory-barriers.txt|  2 +-
>  Documentation/process/submit-checklist.rst   |  2 +-
>  .../translations/it_IT/process/submit-checklist.rst  |  2 +-
>  Documentation/translations/ko_KR/memory-barriers.txt |  2 +-
>  .../translations/zh_CN/filesystems/sysfs.txt |  2 +-
>  .../translations/zh_CN/process/submit-checklist.rst  |  2 +-
>  Documentation/virt/kvm/arm/pvtime.rst|  2 +-
>  Documentation/virt/kvm/devices/vcpu.rst  |  2 +-
>  Documentation/virt/kvm/hypercalls.rst|  4 ++--
>  arch/powerpc/include/uapi/asm/kvm_para.h |  2 +-
>  drivers/gpu/drm/Kconfig  |  2 +-
>  drivers/gpu/drm/drm_ioctl.c  |  2 +-
>  drivers/hwtracing/coresight/Kconfig  |  2 +-
>  fs/fat/Kconfig   |  8 
>  fs/fuse/Kconfig  |  2 +-
>  fs/fuse/dev.c|  2 +-
>  fs/overlayfs/Kconfig |  6 +++---
>  include/linux/mm.h   |  4 ++--
>  include/uapi/linux/ethtool_netlink.h |  2 +-
>  include/uapi/rdma/rdma_user_ioctl_cmds.h |  2 +-
>  mm/gup.c | 12 ++--
>  virt/kvm/arm/vgic/vgic-mmio-v3.c |  2 +-
>  virt/kvm/arm/vgic/vgic.h |  4 ++--
>  23 files changed, 36 insertions(+), 36 deletions(-)
> 
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index e1c355e84edd..eaabc3134294 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -620,7 +620,7 @@ because the CPUs that the Linux kernel supports don't do 
> writes
>  until they are certain (1) that the write will actually happen, (2)
>  of the location of the write, and (3) of the value to be written.
>  But please carefully read the "CONTROL DEPENDENCIES" section and the
> -Documentation/RCU/rcu_dereference.txt file:  The compiler can and does
> +Documentation/RCU/rcu_dereference.rst file:  The compiler can and does
>  break dependencies in a great many highly creative ways.
>  
>   CPU 1 CPU 2
> diff --git a/Documentation/process/submit-checklist.rst 
> b/Documentation/process/submit-checklist.rst
> index 8e56337d422d..3f8e9d5d95c2 100644
> --- a/Documentation/process/submit-checklist.rst
> +++ b/Documentation/process/submit-checklist.rst
> @@ -107,7 +107,7 @@ and elsewhere regarding submitting Linux kernel patches.
>  and why.
>  
>  26) If any ioctl's are added by the patch, then also update
> -``Documentation/ioctl/ioctl-number.rst``.
> +``Documentation/userspace-api/ioctl/ioctl-number.rst``.
>  
>  27) If your modified source code depends on or uses any of the kernel
>  APIs or features that are related to the following ``Kconfig`` symbols,
> diff --git a/Documentation/translations/it_IT/process/submit-checklist.rst 
> b/Documentation/translations/it_IT/process/submit-checklist.rst
> index 995ee69fab11..3e575502690f 100644
> --- a/Documentation/translations/it_IT/process/submit-checklist.rst
> +++ b/Documentation/translations/it_IT/process/submit-checklist.rst
> @@ -117,7 +117,7 @@ sottomissione delle patch, in particolare
>  sorgenti che ne spieghi la logica: cosa fanno e perché.
>  
>  25) Se la patch aggiunge nuove chiamate ioctl, allora aggiornate
> -``Documentation/ioctl/ioctl-number.rst``.
> +``Documentation/userspace-api/ioctl/ioctl-number.rst``.
>  
>  26) Se il codice che avete modificato dipende o usa una qualsiasi 
> interfaccia o
>  funzionalità del kernel che è associata a uno dei seguenti simboli
> diff --git a/Documentation/translations/ko_KR/memory-barriers.txt 
> b/Documentation/translations/ko_KR/memory-barriers.txt
> index 2e831ece6e26..e50fe6541335 100644
> --- a/Documentation/translations/ko_KR/memory-barriers.txt
> +++ b/Documentation/translations/ko_KR/memory-barriers.txt
> @@ -641,7 +641,7 @@ P 는 짝수 번호 캐시 라인에 저장되어 있고, 변수 B 는 홀수 
>  리눅스 커널이 지원하는 CPU 들은 (1) 쓰기가 정말로 일어날지, (2) 쓰기가 어디에
>  이루어질지, 그리고 (3) 쓰여질 값을 확실히 알기 전까지는 쓰기를 수행하지 않기
>  때문입니다.  하지만 "컨트롤 의존성" 섹션

Re: [PATCH v2] sched/core: fix illegal RCU from offline CPUs

2020-04-02 Thread Paul E. McKenney
On Thu, Apr 02, 2020 at 12:19:54PM -0400, Qian Cai wrote:
> 
> 
> > On Apr 2, 2020, at 11:54 AM, Paul E. McKenney  wrote:
> > 
> > I do run this combination quite frequently, but only as part of
> > rcutorture, which might not be a representative workload.  For one thing,
> > it has a minimal userspace consisting only of a trivial init program.
> > I don't recall having ever seen this.  (I have seen one recent complaint
> > about an IPI being sent to an offline CPU, but I cannot prove that this
> > was not due to RCU bugs that I was chasing at the time.)
> 
> Yes, a trivial init is tough while running systemd should be able to catch it 
> as it will use cgroup.

Not planning to add systemd to my rcutorture runs.  ;-)

Thanx, Paul


Re: [PATCH v2] sched/core: fix illegal RCU from offline CPUs

2020-04-02 Thread Paul E. McKenney
On Thu, Apr 02, 2020 at 10:00:16AM -0400, Qian Cai wrote:
> 
> 
> > On Apr 2, 2020, at 7:24 AM, Michael Ellerman  wrote:
> > 
> > Qian Cai  writes:
> >> From: Peter Zijlstra 
> >> 
> >> In the CPU-offline process, it calls mmdrop() after idle entry and the
> >> subsequent call to cpuhp_report_idle_dead(). Once execution passes the
> >> call to rcu_report_dead(), RCU is ignoring the CPU, which results in
> >> lockdep complaining when mmdrop() uses RCU from either memcg or
> >> debugobjects below.
> >> 
> >> Fix it by cleaning up the active_mm state from BP instead. Every arch
> >> which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
> >> from AP. The only exception is parisc because it switches them to
> >> _mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
> >> but the patch will still work there because it calls mmgrab(_mm) in
> >> smp_cpu_init() and then should call mmdrop(_mm) in finish_cpu().
> > 
> > Thanks for debugging this. How did you hit it in the first place?
> 
> Just repeatedly offline/online CPUs which will eventually cause an idle thread
> refcount goes to 0 and trigger __mmdrop() and of course it needs to enable
> lockdep (PROVE_RCU?) as well as having luck to hit the cgroup, workqueue
> or debugobject code paths to call RCU.
> 
> > 
> > A link to the original thread would have helped me:
> > 
> >  https://lore.kernel.org/lkml/20200113190331.12788-1-...@lca.pw/
> > 
> >> WARNING: suspicious RCU usage
> >> -
> >> kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!
> >> 
> >> other info that might help us debug this:
> >> 
> >> RCU used illegally from offline CPU!
> >> Call Trace:
> >> dump_stack+0xf4/0x164 (unreliable)
> >> lockdep_rcu_suspicious+0x140/0x164
> >> get_work_pool+0x110/0x150
> >> __queue_work+0x1bc/0xca0
> >> queue_work_on+0x114/0x120
> >> css_release+0x9c/0xc0
> >> percpu_ref_put_many+0x204/0x230
> >> free_pcp_prepare+0x264/0x570
> >> free_unref_page+0x38/0xf0
> >> __mmdrop+0x21c/0x2c0
> >> idle_task_exit+0x170/0x1b0
> >> pnv_smp_cpu_kill_self+0x38/0x2e0
> >> cpu_die+0x48/0x64
> >> arch_cpu_idle_dead+0x30/0x50
> >> do_idle+0x2f4/0x470
> >> cpu_startup_entry+0x38/0x40
> >> start_secondary+0x7a8/0xa80
> >> start_secondary_resume+0x10/0x14
> > 
> > Do we know when this started happening? ie. can we determine a Fixes
> > tag?
> 
> I don’t know. I looked at some commits that it seems the code was like that
> even 10-year ago. It must be nobody who cares to run lockdep (PROVE_RCU?)
> with CPU hotplug very regularly.

I do run this combination quite frequently, but only as part of
rcutorture, which might not be a representative workload.  For one thing,
it has a minimal userspace consisting only of a trivial init program.
I don't recall having ever seen this.  (I have seen one recent complaint
about an IPI being sent to an offline CPU, but I cannot prove that this
was not due to RCU bugs that I was chasing at the time.)

Thanx, Paul

> >> 
> >> Signed-off-by: Qian Cai 
> >> ---
> >> arch/powerpc/platforms/powernv/smp.c |  1 -
> >> include/linux/sched/mm.h |  2 ++
> >> kernel/cpu.c | 18 +-
> >> kernel/sched/core.c  |  5 +++--
> >> 4 files changed, 22 insertions(+), 4 deletions(-)
> >> 
> >> diff --git a/arch/powerpc/platforms/powernv/smp.c 
> >> b/arch/powerpc/platforms/powernv/smp.c
> >> index 13e251699346..b2ba3e95bda7 100644
> >> --- a/arch/powerpc/platforms/powernv/smp.c
> >> +++ b/arch/powerpc/platforms/powernv/smp.c
> >> @@ -167,7 +167,6 @@ static void pnv_smp_cpu_kill_self(void)
> >>/* Standard hot unplug procedure */
> >> 
> >>idle_task_exit();
> >> -  current->active_mm = NULL; /* for sanity */
> > 
> > If I'm reading it right, we'll now be running with active_mm == init_mm
> > in the offline loop.
> > 
> > I guess that's fine, I can't think of any reason it would matter, and it
> > seems like we were NULL'ing it out just for paranoia's sake not because
> > of any actual problem.
> > 
> > Acked-by: Michael Ellerman  (powerpc)
> > 
> > 
> > cheers
> > 
> >> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> >> index c49257a3b510..a132d875d351 100644
> >> --- a/include/linux/sched/mm.h
> >> +++ b/include/linux/sched/mm.h
> >> @@ -49,6 +49,8 @@ static inline void mmdrop(struct mm_struct *mm)
> >>__mmdrop(mm);
> >> }
> >> 
> >> +void mmdrop(struct mm_struct *mm);
> >> +
> >> /*
> >>  * This has to be called after a get_task_mm()/mmget_not_zero()
> >>  * followed by taking the mmap_sem for writing before modifying the
> >> diff --git a/kernel/cpu.c b/kernel/cpu.c
> >> index 2371292f30b0..244d30544377 100644
> >> --- a/kernel/cpu.c
> >> +++ b/kernel/cpu.c
> >> @@ -3,6 +3,7 @@
> >>  *
> >>  * This code is licenced under the GPL.
> >>  */
> >> +#include 
> >> #include 
> >> #include 
> >> #include 
> >> @@ -564,6 +565,21 @@ static int bringup_cpu(unsigned 

Re: [PATCH v2] Documentation/locking/locktypes: minor copy editor fixes

2020-03-25 Thread Paul E. McKenney
On Wed, Mar 25, 2020 at 09:58:14AM -0700, Randy Dunlap wrote:
> From: Randy Dunlap 
> 
> Minor editorial fixes:
> - add some hyphens in multi-word adjectives
> - add some periods for consistency
> - add "'" for possessive CPU's
> - capitalize IRQ when it's an acronym and not part of a function name
> 
> Signed-off-by: Randy Dunlap 
> Cc: Paul McKenney 
> Cc: Thomas Gleixner 
> Cc: Sebastian Siewior 
> Cc: Joel Fernandes 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 

Some nits below, but with or without those suggested changes:

Reviewed-by: Paul E. McKenney 

> ---
>  Documentation/locking/locktypes.rst |   16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> --- linux-next-20200325.orig/Documentation/locking/locktypes.rst
> +++ linux-next-20200325/Documentation/locking/locktypes.rst
> @@ -84,7 +84,7 @@ rtmutex
>  
>  RT-mutexes are mutexes with support for priority inheritance (PI).
>  
> -PI has limitations on non PREEMPT_RT enabled kernels due to preemption and
> +PI has limitations on non-PREEMPT_RT-enabled kernels due to preemption and

Or just drop the " enabled".

>  interrupt disabled sections.
>  
>  PI clearly cannot preempt preemption-disabled or interrupt-disabled
> @@ -150,7 +150,7 @@ kernel configuration including PREEMPT_R
>  
>  raw_spinlock_t is a strict spinning lock implementation in all kernels,
>  including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
> -core code, low level interrupt handling and places where disabling
> +core code, low-level interrupt handling and places where disabling
>  preemption or interrupts is required, for example, to safely access
>  hardware state.  raw_spinlock_t can sometimes also be used when the
>  critical section is tiny, thus avoiding RT-mutex overhead.
> @@ -160,20 +160,20 @@ spinlock_t
>  
>  The semantics of spinlock_t change with the state of PREEMPT_RT.
>  
> -On a non PREEMPT_RT enabled kernel spinlock_t is mapped to raw_spinlock_t
> +On a non-PREEMPT_RT-enabled kernel spinlock_t is mapped to raw_spinlock_t

Ditto.

>  and has exactly the same semantics.
>  
>  spinlock_t and PREEMPT_RT
>  -
>  
> -On a PREEMPT_RT enabled kernel spinlock_t is mapped to a separate
> +On a PREEMPT_RT-enabled kernel spinlock_t is mapped to a separate

And here as well.

>  implementation based on rt_mutex which changes the semantics:
>  
> - - Preemption is not disabled
> + - Preemption is not disabled.
>  
>   - The hard interrupt related suffixes for spin_lock / spin_unlock
> -   operations (_irq, _irqsave / _irqrestore) do not affect the CPUs
> -   interrupt disabled state
> +   operations (_irq, _irqsave / _irqrestore) do not affect the CPU's
> +   interrupt disabled state.
>  
>   - The soft interrupt related suffix (_bh()) still disables softirq
> handlers.
> @@ -279,7 +279,7 @@ fully preemptible context.  Instead, use
>  spin_lock_irqsave() and their unlock counterparts.  In cases where the
>  interrupt disabling and locking must remain separate, PREEMPT_RT offers a
>  local_lock mechanism.  Acquiring the local_lock pins the task to a CPU,
> -allowing things like per-CPU irq-disabled locks to be acquired.  However,
> +allowing things like per-CPU IRQ-disabled locks to be acquired.  However,

Quite a bit of text in the kernel uses "irq", lower case.  Another
option is to spell out "interrupt".

>  this approach should be used only where absolutely necessary.
>  
>  
> 


Re: Documentation/locking/locktypes: Further clarifications and wordsmithing

2020-03-25 Thread Paul E. McKenney
On Wed, Mar 25, 2020 at 05:02:12PM +0100, Sebastian Siewior wrote:
> On 2020-03-25 13:27:49 [+0100], Thomas Gleixner wrote:
> > The documentation of rw_semaphores is wrong as it claims that the non-owner
> > reader release is not supported by RT. That's just history biased memory
> > distortion.
> > 
> > Split the 'Owner semantics' section up and add separate sections for
> > semaphore and rw_semaphore to reflect reality.
> > 
> > Aside of that the following updates are done:
> > 
> >  - Add pseudo code to document the spinlock state preserving mechanism on
> >PREEMPT_RT
> > 
> >  - Wordsmith the bitspinlock and lock nesting sections
> > 
> > Co-developed-by: Paul McKenney 
> > Signed-off-by: Paul McKenney 
> > Signed-off-by: Thomas Gleixner 
> Acked-by: Sebastian Andrzej Siewior 
> 
> > --- a/Documentation/locking/locktypes.rst
> > +++ b/Documentation/locking/locktypes.rst
> …
> > +rw_semaphore
> > +
> > +
> > +rw_semaphore is a multiple readers and single writer lock mechanism.
> > +
> > +On non-PREEMPT_RT kernels the implementation is fair, thus preventing
> > +writer starvation.
> > +
> > +rw_semaphore complies by default with the strict owner semantics, but there
> > +exist special-purpose interfaces that allow non-owner release for readers.
> > +These work independent of the kernel configuration.
> 
> This reads funny, could be my English. "This works independent …" maybe?

The "These" refers to "interfaces", which is plural, so "These" rather
than "This".  But yes, it is a bit awkward, because you have to skip
back past "readers", "release", and "non-owner" to find the implied
subject of that last sentence.

So how about this instead, making the implied subject explicit?

rw_semaphore complies by default with the strict owner semantics, but there
exist special-purpose interfaces that allow non-owner release for readers.
These interfaces work independent of the kernel configuration.

Thanx, Paul


Re: [patch V3 13/20] Documentation: Add lock ordering and nesting documentation

2020-03-24 Thread Paul E. McKenney
On Wed, Mar 25, 2020 at 12:13:34AM +0100, Thomas Gleixner wrote:
> Paul,
> 
> "Paul E. McKenney"  writes:
> > On Sat, Mar 21, 2020 at 12:25:57PM +0100, Thomas Gleixner wrote:
> > In the normal case where the task sleeps through the entire lock
> > acquisition, the sequence of events is as follows:
> >
> >  state = UNINTERRUPTIBLE
> >  lock()
> >block()
> >  real_state = state
> >  state = SLEEPONLOCK
> >
> >lock wakeup
> >  state = real_state == UNINTERRUPTIBLE
> >
> > This sequence of events can occur when the task acquires spinlocks
> > on its way to sleeping, for example, in a call to wait_event().
> >
> > The non-lock wakeup can occur when a wakeup races with this wait_event(),
> > which can result in the following sequence of events:
> >
> >  state = UNINTERRUPTIBLE
> >  lock()
> >block()
> >  real_state = state
> >  state = SLEEPONLOCK
> >
> >  non lock wakeup
> >  real_state = RUNNING
> >
> >lock wakeup
> >  state = real_state == RUNNING
> >
> > Without this real_state subterfuge, the wakeup might be lost.
> 
> I added this with a few modifications which reflect the actual
> implementation. Conceptually the same.

Looks good!

> > rwsems have grown special-purpose interfaces that allow non-owner release.
> > This non-owner release prevents PREEMPT_RT from substituting RT-mutex
> > implementations, for example, by defeating priority inheritance.
> > After all, if the lock has no owner, whose priority should be boosted?
> > As a result, PREEMPT_RT does not currently support rwsem, which in turn
> > means that code using it must therefore be disabled until a workable
> > solution presents itself.
> >
> > [ Note: Not as confident as I would like to be in the above. ]
> 
> I'm not confident either especially not after looking at the actual
> code.
> 
> In fact I feel really stupid because the rw_semaphore reader non-owner
> restriction on RT simply does not exist anymore and my history biased
> memory tricked me.

I guess I am glad that it is not just me.  ;-)

> The first rw_semaphore implementation of RT was simple and restricted
> the reader side to a single reader to support PI on both the reader and
> the writer side. That obviosuly did not scale well and made mmap_sem
> heavy use cases pretty unhappy.
> 
> The short interlude with multi-reader boosting turned out to be a failed
> experiment - Steven might still disagree though :)
> 
> At some point we gave up and I myself (sic!) reimplemented the RT
> variant of rw_semaphore with a reader biased mechanism.
> 
> The reader never holds the underlying rt_mutex accross the read side
> critical section. It merily increments the reader count and drops it on
> release.
> 
> The only time a reader takes the rt_mutex is when it blocks on a
> writer. Writers hold the rt_mutex across the write side critical section
> to allow incoming readers to boost them. Once the writer releases the
> rw_semaphore it unlocks the rt_mutex which is then handed off to the
> readers. They increment the reader count and then drop the rt_mutex
> before continuing in the read side critical section.
> 
> So while I changed the implementation it did obviously not occur to me
> that this also lifted the non-owner release restriction. Nobody else
> noticed either. So we kept dragging this along in both memory and
> implementation. Both will be fixed now :)
> 
> The owner semantics of down/up_read() are only enforced by lockdep. That
> applies to both RT and !RT. The up/down_read_non_owner() variants are
> just there to tell lockdep about it.
> 
> So, I picked up your other suggestions with slight modifications and
> adjusted the owner, semaphore and rw_semaphore docs accordingly.
> 
> Please have a close look at the patch below (applies on tip core/locking).
> 
> Thanks,
> 
> tglx, who is searching a brown paperbag

Sorry, used all the ones here over the past few days.  :-/

Please see below for a wordsmithing patch to be applied on top of
or merged into the patch in your email.

        Thanx, Paul



commit e38c64ce8db45e2b0a19082f1e1f988c3b25fb81
Author: Paul E. McKenney 
Date:   Tue Mar 24 17:23:36 2020 -0700

Documentation: Wordsmith lock ordering and nesting documentation

This

Re: [PATCH v4 01/17] cpu: Add new {add,remove}_cpu() functions

2020-03-23 Thread Paul E. McKenney
On Mon, Mar 23, 2020 at 01:50:54PM +, Qais Yousef wrote:
> The new functions use device_{online,offline}() which are userspace
> safe.
> 
> This is in preparation to move cpu_{up, down} kernel users to use
> a safer interface that is not racy with userspace.
> 
> Suggested-by: "Paul E. McKenney" 
> Signed-off-by: Qais Yousef 
> CC: Thomas Gleixner 
> CC: "Paul E. McKenney" 

Reviewed-by: Paul E. McKenney 

> CC: Helge Deller 
> CC: Michael Ellerman 
> CC: "David S. Miller" 
> CC: Juergen Gross 
> CC: Mark Rutland 
> CC: Lorenzo Pieralisi 
> CC: xen-de...@lists.xenproject.org
> CC: linux-par...@vger.kernel.org
> CC: sparcli...@vger.kernel.org
> CC: linuxppc-dev@lists.ozlabs.org
> CC: linux-arm-ker...@lists.infradead.org
> CC: x...@kernel.org
> CC: linux-ker...@vger.kernel.org
> ---
>  include/linux/cpu.h |  2 ++
>  kernel/cpu.c| 24 
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/linux/cpu.h b/include/linux/cpu.h
> index 1ca2baf817ed..cf8cf38dca43 100644
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -89,6 +89,7 @@ extern ssize_t arch_cpu_release(const char *, size_t);
>  #ifdef CONFIG_SMP
>  extern bool cpuhp_tasks_frozen;
>  int cpu_up(unsigned int cpu);
> +int add_cpu(unsigned int cpu);
>  void notify_cpu_starting(unsigned int cpu);
>  extern void cpu_maps_update_begin(void);
>  extern void cpu_maps_update_done(void);
> @@ -118,6 +119,7 @@ extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  void clear_tasks_mm_cpumask(int cpu);
>  int cpu_down(unsigned int cpu);
> +int remove_cpu(unsigned int cpu);
>  
>  #else /* CONFIG_HOTPLUG_CPU */
>  
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 9c706af713fb..069802f7010f 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1057,6 +1057,18 @@ int cpu_down(unsigned int cpu)
>  }
>  EXPORT_SYMBOL(cpu_down);
>  
> +int remove_cpu(unsigned int cpu)
> +{
> + int ret;
> +
> + lock_device_hotplug();
> + ret = device_offline(get_cpu_device(cpu));
> + unlock_device_hotplug();
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(remove_cpu);
> +
>  #else
>  #define takedown_cpu NULL
>  #endif /*CONFIG_HOTPLUG_CPU*/
> @@ -1209,6 +1221,18 @@ int cpu_up(unsigned int cpu)
>  }
>  EXPORT_SYMBOL_GPL(cpu_up);
>  
> +int add_cpu(unsigned int cpu)
> +{
> + int ret;
> +
> + lock_device_hotplug();
> + ret = device_online(get_cpu_device(cpu));
> + unlock_device_hotplug();
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(add_cpu);
> +
>  #ifdef CONFIG_PM_SLEEP_SMP
>  static cpumask_var_t frozen_cpus;
>  
> -- 
> 2.17.1
> 


Re: [patch V3 13/20] Documentation: Add lock ordering and nesting documentation

2020-03-22 Thread Paul E. McKenney
On Sat, Mar 21, 2020 at 12:25:57PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> The kernel provides a variety of locking primitives. The nesting of these
> lock types and the implications of them on RT enabled kernels is nowhere
> documented.
> 
> Add initial documentation.
> 
> Signed-off-by: Thomas Gleixner 
> Cc: "Paul E . McKenney" 
> Cc: Jonathan Corbet 
> Cc: Davidlohr Bueso 
> Cc: Randy Dunlap 
> ---
> V3: Addressed review comments from Paul, Jonathan, Davidlohr
> V2: Addressed review comments from Randy
> ---
>  Documentation/locking/index.rst |1 
>  Documentation/locking/locktypes.rst |  299 
> 
>  2 files changed, 300 insertions(+)
>  create mode 100644 Documentation/locking/locktypes.rst
> 
> --- a/Documentation/locking/index.rst
> +++ b/Documentation/locking/index.rst
> @@ -7,6 +7,7 @@ locking
>  .. toctree::
>  :maxdepth: 1
>  
> +locktypes
>  lockdep-design
>  lockstat
>  locktorture
> --- /dev/null
> +++ b/Documentation/locking/locktypes.rst
> @@ -0,0 +1,299 @@

[ . . . Adding your example execution sequences . . . ]

> +PREEMPT_RT kernels preserve all other spinlock_t semantics:
> +
> + - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
> +   avoid migration by disabling preemption.  PREEMPT_RT kernels instead
> +   disable migration, which ensures that pointers to per-CPU variables
> +   remain valid even if the task is preempted.
> +
> + - Task state is preserved across spinlock acquisition, ensuring that the
> +   task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
> +   kernels leave task state untouched.  However, PREEMPT_RT must change
> +   task state if the task blocks during acquisition.  Therefore, it saves
> +   the current task state before blocking and the corresponding lock wakeup
> +   restores it.
> +
> +   Other types of wakeups would normally unconditionally set the task state
> +   to RUNNING, but that does not work here because the task must remain
> +   blocked until the lock becomes available.  Therefore, when a non-lock
> +   wakeup attempts to awaken a task blocked waiting for a spinlock, it
> +   instead sets the saved state to RUNNING.  Then, when the lock
> +   acquisition completes, the lock wakeup sets the task state to the saved
> +   state, in this case setting it to RUNNING.

In the normal case where the task sleeps through the entire lock
acquisition, the sequence of events is as follows:

 state = UNINTERRUPTIBLE
 lock()
   block()
 real_state = state
 state = SLEEPONLOCK

   lock wakeup
 state = real_state == UNINTERRUPTIBLE

This sequence of events can occur when the task acquires spinlocks
on its way to sleeping, for example, in a call to wait_event().

The non-lock wakeup can occur when a wakeup races with this wait_event(),
which can result in the following sequence of events:

 state = UNINTERRUPTIBLE
 lock()
   block()
 real_state = state
 state = SLEEPONLOCK

 non lock wakeup
 real_state = RUNNING

   lock wakeup
 state = real_state == RUNNING

Without this real_state subterfuge, the wakeup might be lost.

[ . . . and continuing where I left off earlier . . . ]

> +bit spinlocks
> +-
> +
> +Bit spinlocks are problematic for PREEMPT_RT as they cannot be easily
> +substituted by an RT-mutex based implementation for obvious reasons.
> +
> +The semantics of bit spinlocks are preserved on PREEMPT_RT kernels and the
> +caveats vs. raw_spinlock_t apply.
> +
> +Some bit spinlocks are substituted by regular spinlock_t for PREEMPT_RT but
> +this requires conditional (#ifdef'ed) code changes at the usage site while
> +the spinlock_t substitution is simply done by the compiler and the
> +conditionals are restricted to header files and core implementation of the
> +locking primitives and the usage sites do not require any changes.

PREEMPT_RT cannot substitute bit spinlocks because a single bit is
too small to accommodate an RT-mutex.  Therefore, the semantics of bit
spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t
caveats also apply to bit spinlocks.

Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT
using conditional (#ifdef'ed) code changes at the usage site.
In contrast, usage-site changes are not needed for the spinlock_t
substitution.  Instead, conditionals in header files and the core locking
implemementation enable the compiler to do the substitution transparently.


> +Lock type nesting rules
> +===
&

Re: [patch V2 08/15] Documentation: Add lock ordering and nesting documentation

2020-03-21 Thread Paul E. McKenney
On Sat, Mar 21, 2020 at 11:26:06AM +0100, Thomas Gleixner wrote:
> "Paul E. McKenney"  writes:
> > On Fri, Mar 20, 2020 at 11:36:03PM +0100, Thomas Gleixner wrote:
> >> I agree that what I tried to express is hard to parse, but it's at least
> >> halfways correct :)
> >
> > Apologies!  That is what I get for not looking it up in the source.  :-/
> >
> > OK, so I am stupid enough not only to get it wrong, but also to try again:
> >
> >... Other types of wakeups would normally unconditionally set the
> >task state to RUNNING, but that does not work here because the task
> >must remain blocked until the lock becomes available.  Therefore,
> >when a non-lock wakeup attempts to awaken a task blocked waiting
> >for a spinlock, it instead sets the saved state to RUNNING.  Then,
> >when the lock acquisition completes, the lock wakeup sets the task
> >state to the saved state, in this case setting it to RUNNING.
> >
> > Is that better?
> 
> Definitely!
> 
> Thanks for all the editorial work!

NP, and glad you like it!

But I felt even more stupid sometime in the middle of the night.  Why on
earth didn't I work in your nice examples?  :-/

I will pull them in later.  Time to go hike!!!

Thanx, Paul


Re: [patch V2 08/15] Documentation: Add lock ordering and nesting documentation

2020-03-20 Thread Paul E. McKenney
On Fri, Mar 20, 2020 at 11:36:03PM +0100, Thomas Gleixner wrote:
> "Paul E. McKenney"  writes:
> > On Fri, Mar 20, 2020 at 08:51:44PM +0100, Thomas Gleixner wrote:
> >> "Paul E. McKenney"  writes:
> >> >
> >> >  - The soft interrupt related suffix (_bh()) still disables softirq
> >> >handlers.  However, unlike non-PREEMPT_RT kernels (which disable
> >> >preemption to get this effect), PREEMPT_RT kernels use a per-CPU
> >> >lock to exclude softirq handlers.
> >> 
> >> I've made that:
> >> 
> >>   - The soft interrupt related suffix (_bh()) still disables softirq
> >> handlers.
> >> 
> >> Non-PREEMPT_RT kernels disable preemption to get this effect.
> >> 
> >> PREEMPT_RT kernels use a per-CPU lock for serialization. The lock
> >> disables softirq handlers and prevents reentrancy by a preempting
> >> task.
> >
> > That works!  At the end, I would instead say "prevents reentrancy
> > due to task preemption", but what you have works.
> 
> Yours is better.
> 
> >>- Task state is preserved across spinlock acquisition, ensuring that the
> >>  task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
> >>  kernels leave task state untouched.  However, PREEMPT_RT must change
> >>  task state if the task blocks during acquisition.  Therefore, it
> >>  saves the current task state before blocking and the corresponding
> >>  lock wakeup restores it. A regular not lock related wakeup sets the
> >>  task state to RUNNING. If this happens while the task is blocked on
> >>  a spinlock then the saved task state is changed so that correct
> >>  state is restored on lock wakeup.
> >> 
> >> Hmm?
> >
> > I of course cannot resist editing the last two sentences:
> >
> >... Other types of wakeups unconditionally set task state to RUNNING.
> >If this happens while a task is blocked while acquiring a spinlock,
> >then the task state is restored to its pre-acquisition value at
> >lock-wakeup time.
> 
> Errm no. That would mean
> 
>  state = UNINTERRUPTIBLE
>  lock()
>block()
>  real_state = state
>  state = SLEEPONLOCK
> 
>non lock wakeup
>  state = RUNNING<--- FAIL #1
> 
>lock wakeup
>  state = real_state <--- FAIL #2
> 
> How it works is:
> 
>  state = UNINTERRUPTIBLE
>  lock()
>block()
>  real_state = state
>  state = SLEEPONLOCK
> 
>non lock wakeup
>  real_state = RUNNING
> 
>lock wakeup
>  state = real_state == RUNNING
> 
> If there is no 'non lock wakeup' before the lock wakeup:
> 
>  state = UNINTERRUPTIBLE
>  lock()
>block()
>  real_state = state
>  state = SLEEPONLOCK
> 
>lock wakeup
>  state = real_state == UNINTERRUPTIBLE
> 
> I agree that what I tried to express is hard to parse, but it's at least
> halfways correct :)

Apologies!  That is what I get for not looking it up in the source.  :-/

OK, so I am stupid enough not only to get it wrong, but also to try again:

   ... Other types of wakeups would normally unconditionally set the
   task state to RUNNING, but that does not work here because the task
   must remain blocked until the lock becomes available.  Therefore,
   when a non-lock wakeup attempts to awaken a task blocked waiting
   for a spinlock, it instead sets the saved state to RUNNING.  Then,
   when the lock acquisition completes, the lock wakeup sets the task
   state to the saved state, in this case setting it to RUNNING.

Is that better?

Thanx, Paul


Re: [patch V2 08/15] Documentation: Add lock ordering and nesting documentation

2020-03-20 Thread Paul E. McKenney
On Fri, Mar 20, 2020 at 08:51:44PM +0100, Thomas Gleixner wrote:
> "Paul E. McKenney"  writes:
> >
> >  - The soft interrupt related suffix (_bh()) still disables softirq
> >handlers.  However, unlike non-PREEMPT_RT kernels (which disable
> >preemption to get this effect), PREEMPT_RT kernels use a per-CPU
> >lock to exclude softirq handlers.
> 
> I've made that:
> 
>   - The soft interrupt related suffix (_bh()) still disables softirq
> handlers.
> 
> Non-PREEMPT_RT kernels disable preemption to get this effect.
> 
> PREEMPT_RT kernels use a per-CPU lock for serialization. The lock
> disables softirq handlers and prevents reentrancy by a preempting
> task.

That works!  At the end, I would instead say "prevents reentrancy
due to task preemption", but what you have works.

> On non-RT this is implicit through preemption disable, but it's non
> obvious for RT as preemption stays enabled.
> 
> > PREEMPT_RT kernels preserve all other spinlock_t semantics:
> >
> >  - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
> >avoid migration by disabling preemption.  PREEMPT_RT kernels instead
> >disable migration, which ensures that pointers to per-CPU variables
> >remain valid even if the task is preempted.
> >
> >  - Task state is preserved across spinlock acquisition, ensuring that the
> >task-state rules apply to all kernel configurations.  In non-PREEMPT_RT
> >kernels leave task state untouched.  However, PREEMPT_RT must change
> >task state if the task blocks during acquisition.  Therefore, the
> >corresponding lock wakeup restores the task state.  Note that regular
> >(not lock related) wakeups do not restore task state.
> 
>- Task state is preserved across spinlock acquisition, ensuring that the
>  task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
>  kernels leave task state untouched.  However, PREEMPT_RT must change
>  task state if the task blocks during acquisition.  Therefore, it
>  saves the current task state before blocking and the corresponding
>  lock wakeup restores it. A regular not lock related wakeup sets the
>  task state to RUNNING. If this happens while the task is blocked on
>  a spinlock then the saved task state is changed so that correct
>  state is restored on lock wakeup.
> 
> Hmm?

I of course cannot resist editing the last two sentences:

   ... Other types of wakeups unconditionally set task state to RUNNING.
   If this happens while a task is blocked while acquiring a spinlock,
   then the task state is restored to its pre-acquisition value at
   lock-wakeup time.

> > But this code failes on PREEMPT_RT kernels because the memory allocator
> > is fully preemptible and therefore cannot be invoked from truly atomic
> > contexts.  However, it is perfectly fine to invoke the memory allocator
> > while holding a normal non-raw spinlocks because they do not disable
> > preemption::
> >
> >> +  spin_lock();
> >> +  p = kmalloc(sizeof(*p), GFP_ATOMIC);
> >> +
> >> +Most places which use GFP_ATOMIC allocations are safe on PREEMPT_RT as the
> >> +execution is forced into thread context and the lock substitution is
> >> +ensuring preemptibility.
> >
> > Interestingly enough, most uses of GFP_ATOMIC allocations are
> > actually safe on PREEMPT_RT because the the lock substitution ensures
> > preemptibility.  Only those GFP_ATOMIC allocations that are invoke
> > while holding a raw spinlock or with preemption otherwise disabled need
> > adjustment to work correctly on PREEMPT_RT.
> >
> > [ I am not as confident of the above as I would like to be... ]
> 
> I'd leave that whole paragraph out. This documents the rules and from
> the above code examples it's pretty clear what works and what not :)

Works for me!  ;-)

> > And meeting time, will continue later!
> 
> Enjoy!

Not bad, actually, as meetings go.

Thanx, Paul


Re: [patch V2 08/15] Documentation: Add lock ordering and nesting documentation

2020-03-20 Thread Paul E. McKenney
On Thu, Mar 19, 2020 at 07:02:17PM +0100, Thomas Gleixner wrote:
> Paul,
> 
> "Paul E. McKenney"  writes:
> 
> > On Wed, Mar 18, 2020 at 09:43:10PM +0100, Thomas Gleixner wrote:
> >
> > Mostly native-English-speaker services below, so please feel free to
> > ignore.  The one place I made a substantive change, I marked it "@@@".
> > I only did about half of this document, but should this prove useful,
> > I will do the other half later.
> 
> Native speaker services are always useful and appreciated.

Glad it is helpful.  ;-)

[ . . . ]

> >> +
> >> +raw_spinlock_t and spinlock_t
> >> +=
> >> +
> >> +raw_spinlock_t
> >> +--
> >> +
> >> +raw_spinlock_t is a strict spinning lock implementation regardless of the
> >> +kernel configuration including PREEMPT_RT enabled kernels.
> >> +
> >> +raw_spinlock_t is to be used only in real critical core code, low level
> >> +interrupt handling and places where protecting (hardware) state is 
> >> required
> >> +to be safe against preemption and eventually interrupts.
> >> +
> >> +Another reason to use raw_spinlock_t is when the critical section is tiny
> >> +to avoid the overhead of spinlock_t on a PREEMPT_RT enabled kernel in the
> >> +contended case.
> >
> > raw_spinlock_t is a strict spinning lock implementation in all kernels,
> > including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
> > core code, low level interrupt handling and places where disabling
> > preemption or interrupts is required, for example, to safely access
> > hardware state.  raw_spinlock_t can sometimes also be used when the
> > critical section is tiny and the lock is lightly contended, thus avoiding
> > RT-mutex overhead.
> >
> > @@@  I added the point about the lock being lightly contended.
> 
> Hmm, not sure. The point is that if the critical section is small the
> overhead of cross CPU boosting along with the resulting IPIs is going to
> be at least an order of magnitude larger. And on contention this is just
> pushing the raw_spinlock contention off to the raw_spinlock in the rt
> mutex plus the owning tasks pi_lock which makes things even worse.

Fair enough.  So, leaving that out:

raw_spinlock_t is a strict spinning lock implementation in all kernels,
including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
core code, low level interrupt handling and places where disabling
preemption or interrupts is required, for example, to safely access
hardware state.  In addition, raw_spinlock_t can sometimes be used when
the critical section is tiny, thus avoiding RT-mutex overhead.

> >> + - The hard interrupt related suffixes for spin_lock / spin_unlock
> >> +   operations (_irq, _irqsave / _irqrestore) do not affect the CPUs
> 
> Si senor!

;-)

> >> +   interrupt disabled state
> >> +
> >> + - The soft interrupt related suffix (_bh()) is still disabling the
> >> +   execution of soft interrupts, but contrary to a non PREEMPT_RT enabled
> >> +   kernel, which utilizes the preemption count, this is achieved by a per
> >> +   CPU bottom half locking mechanism.
> >
> >  - The soft interrupt related suffix (_bh()) still disables softirq
> >handlers.  However, unlike non-PREEMPT_RT kernels (which disable
> >preemption to get this effect), PREEMPT_RT kernels use a per-CPU
> >per-bottom-half locking mechanism.
> 
> it's not per-bottom-half anymore. That turned out to be dangerous due to
> dependencies between BH types, e.g. network and timers.

Ah!  OK, how about this?

 - The soft interrupt related suffix (_bh()) still disables softirq
   handlers.  However, unlike non-PREEMPT_RT kernels (which disable
   preemption to get this effect), PREEMPT_RT kernels use a per-CPU
   lock to exclude softirq handlers.

> I hope I was able to encourage you to comment on the other half as well :)

OK, here goes...

> +All other semantics of spinlock_t are preserved:
> +
> + - Migration of tasks which hold a spinlock_t is prevented. On a non
> +   PREEMPT_RT enabled kernel this is implicit due to preemption disable.
> +   PREEMPT_RT has a separate mechanism to achieve this. This ensures that
> +   pointers to per CPU variables stay valid even if the task is preempted.
> +
> + - Task state preservation. The task state is not affected when a lock is
> +   contended and the task has to schedule out and wait for the lock to
> +   become available. The lock wake up restores the task state unless there
> +   was a regular (not lock related) wake up on the task. This ensures that

Re: [patch V2 08/15] Documentation: Add lock ordering and nesting documentation

2020-03-18 Thread Paul E. McKenney
On Wed, Mar 18, 2020 at 09:43:10PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> The kernel provides a variety of locking primitives. The nesting of these
> lock types and the implications of them on RT enabled kernels is nowhere
> documented.
> 
> Add initial documentation.
> 
> Signed-off-by: Thomas Gleixner 

Mostly native-English-speaker services below, so please feel free to
ignore.  The one place I made a substantive change, I marked it "@@@".
I only did about half of this document, but should this prove useful,
I will do the other half later.

Thanx, Paul

> ---
> V2: Addressed review comments from Randy
> ---
>  Documentation/locking/index.rst |1 
>  Documentation/locking/locktypes.rst |  298 
> 
>  2 files changed, 299 insertions(+)
>  create mode 100644 Documentation/locking/locktypes.rst
> 
> --- a/Documentation/locking/index.rst
> +++ b/Documentation/locking/index.rst
> @@ -7,6 +7,7 @@ locking
>  .. toctree::
>  :maxdepth: 1
>  
> +locktypes
>  lockdep-design
>  lockstat
>  locktorture
> --- /dev/null
> +++ b/Documentation/locking/locktypes.rst
> @@ -0,0 +1,298 @@
> +.. _kernel_hacking_locktypes:
> +
> +==
> +Lock types and their rules
> +==
> +
> +Introduction
> +
> +
> +The kernel provides a variety of locking primitives which can be divided
> +into two categories:
> +
> + - Sleeping locks
> + - Spinning locks
> +
> +This document describes the lock types at least at the conceptual level and
> +provides rules for nesting of lock types also under the aspect of PREEMPT_RT.

I suggest something like this:

This document conceptually describes these lock types and provides rules
for their nesting, including the rules for use under PREEMPT_RT.

> +
> +Lock categories
> +===
> +
> +Sleeping locks
> +--
> +
> +Sleeping locks can only be acquired in preemptible task context.
> +
> +Some of the implementations allow try_lock() attempts from other contexts,
> +but that has to be really evaluated carefully including the question
> +whether the unlock can be done from that context safely as well.
> +
> +Note that some lock types change their implementation details when
> +debugging is enabled, so this should be really only considered if there is
> +no other option.

How about something like this?

Although implementations allow try_lock() from other contexts, it is
necessary to carefully evaluate the safety of unlock() as well as of
try_lock().  Furthermore, it is also necessary to evaluate the debugging
versions of these primitives.  In short, don't acquire sleeping locks
from other contexts unless there is no other option.

> +Sleeping lock types:
> +
> + - mutex
> + - rt_mutex
> + - semaphore
> + - rw_semaphore
> + - ww_mutex
> + - percpu_rw_semaphore
> +
> +On a PREEMPT_RT enabled kernel the following lock types are converted to
> +sleeping locks:

On PREEMPT_RT kernels, these lock types are converted to sleeping locks:

> + - spinlock_t
> + - rwlock_t
> +
> +Spinning locks
> +--
> +
> + - raw_spinlock_t
> + - bit spinlocks
> +
> +On a non PREEMPT_RT enabled kernel the following lock types are spinning
> +locks as well:

On non-PREEMPT_RT kernels, these lock types are also spinning locks:

> + - spinlock_t
> + - rwlock_t
> +
> +Spinning locks implicitly disable preemption and the lock / unlock functions
> +can have suffixes which apply further protections:
> +
> + ===  
> + _bh()Disable / enable bottom halves (soft interrupts)
> + _irq()   Disable / enable interrupts
> + _irqsave/restore()   Save and disable / restore interrupt disabled state
> + ===  
> +
> +
> +rtmutex
> +===
> +
> +RT-mutexes are mutexes with support for priority inheritance (PI).
> +
> +PI has limitations on non PREEMPT_RT enabled kernels due to preemption and
> +interrupt disabled sections.
> +
> +On a PREEMPT_RT enabled kernel most of these sections are fully
> +preemptible. This is possible because PREEMPT_RT forces most executions
> +into task context, especially interrupt handlers and soft interrupts, which
> +allows to substitute spinlock_t and rwlock_t with RT-mutex based
> +implementations.

PI clearly cannot preempt preemption-disabled or interrupt-disabled
regions of code, even on PREEMPT_RT kernels.  Instead, PREEMPT_RT kernels
execute most such regions of code in preemptible task context, especially
interrupt handlers and soft interrupts.  This conversion allows spinlock_t
and rwlock_t to be implemented via RT-mutexes.

> +
> +raw_spinlock_t and spinlock_t
> +=
> +
> +raw_spinlock_t
> +--
> +
> +raw_spinlock_t is a strict spinning lock implementation regardless of the
> +kernel 

Re: [PATCH] treewide: Rename rcu_dereference_raw_notrace to _check

2019-07-12 Thread Paul E. McKenney
On Thu, Jul 11, 2019 at 04:45:41PM -0400, Joel Fernandes (Google) wrote:
> The rcu_dereference_raw_notrace() API name is confusing.
> It is equivalent to rcu_dereference_raw() except that it also does
> sparse pointer checking.
> 
> There are only a few users of rcu_dereference_raw_notrace(). This
> patches renames all of them to be rcu_dereference_raw_check with the
> "check" indicating sparse checking.
> 
> Signed-off-by: Joel Fernandes (Google) 

I queued this, but reworked the commit log and fixed a couple of
irritating checkpatch issues that were in the original code.
Does this work for you?

Thanx, Paul



commit bd5c0fea6016c90cf7a9eb0435cd0c373dfdac2f
Author: Joel Fernandes (Google) 
Date:   Thu Jul 11 16:45:41 2019 -0400

treewide: Rename rcu_dereference_raw_notrace() to _check()

The rcu_dereference_raw_notrace() API name is confusing.  It is equivalent
to rcu_dereference_raw() except that it also does sparse pointer checking.

There are only a few users of rcu_dereference_raw_notrace(). This patches
renames all of them to be rcu_dereference_raw_check() with the "_check()"
indicating sparse checking.

Signed-off-by: Joel Fernandes (Google) 
[ paulmck: Fix checkpatch warnings about parentheses. ]
Signed-off-by: Paul E. McKenney 

diff --git a/Documentation/RCU/Design/Requirements/Requirements.html 
b/Documentation/RCU/Design/Requirements/Requirements.html
index f04c467e55c5..467251f7fef6 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2514,7 +2514,7 @@ disabled across the entire RCU read-side critical section.
 
 It is possible to use tracing on RCU code, but tracing itself
 uses RCU.
-For this reason, rcu_dereference_raw_notrace()
+For this reason, rcu_dereference_raw_check()
 is provided for use by tracing, which avoids the destructive
 recursion that could otherwise ensue.
 This API is also used by virtualization in some architectures,
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 21b1ed5df888..53388a311967 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -546,7 +546,7 @@ static inline void note_hpte_modification(struct kvm *kvm,
  */
 static inline struct kvm_memslots *kvm_memslots_raw(struct kvm *kvm)
 {
-   return rcu_dereference_raw_notrace(kvm->memslots[0]);
+   return rcu_dereference_raw_check(kvm->memslots[0]);
 }
 
 extern void kvmppc_mmu_debugfs_init(struct kvm *kvm);
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index e91ec9ddcd30..932296144131 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -622,7 +622,7 @@ static inline void hlist_add_behind_rcu(struct hlist_node 
*n,
  * as long as the traversal is guarded by rcu_read_lock().
  */
 #define hlist_for_each_entry_rcu(pos, head, member)\
-   for (pos = hlist_entry_safe 
(rcu_dereference_raw(hlist_first_rcu(head)),\
+   for (pos = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),\
typeof(*(pos)), member);\
pos;\
pos = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(\
@@ -642,10 +642,10 @@ static inline void hlist_add_behind_rcu(struct hlist_node 
*n,
  * not do any RCU debugging or tracing.
  */
 #define hlist_for_each_entry_rcu_notrace(pos, head, member)
\
-   for (pos = hlist_entry_safe 
(rcu_dereference_raw_notrace(hlist_first_rcu(head)),\
+   for (pos = 
hlist_entry_safe(rcu_dereference_raw_check(hlist_first_rcu(head)),\
typeof(*(pos)), member);\
pos;\
-   pos = 
hlist_entry_safe(rcu_dereference_raw_notrace(hlist_next_rcu(\
+   pos = 
hlist_entry_safe(rcu_dereference_raw_check(hlist_next_rcu(\
&(pos)->member)), typeof(*(pos)), member))
 
 /**
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 0c9b92799abc..e5161e377ad4 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -478,7 +478,7 @@ do {
  \
  * The no-tracing version of rcu_dereference_raw() must not call
  * rcu_read_lock_held().
  */
-#define rcu_dereference_raw_notrace(p) __rcu_dereference_check((p), 1, __rcu)
+#define rcu_dereference_raw_check(p) __rcu_dereference_check((p), 1, __rcu)
 
 /**
  * rcu_dereference_protected() - fetch RCU pointer when updates prevented
diff --git a/kernel/trace/ftrace_internal.h b/kernel/tra

Re: [PATCH RFC 0/5] Remove some notrace RCU APIs

2019-05-28 Thread Paul E. McKenney
On Tue, May 28, 2019 at 03:00:07PM -0400, Joel Fernandes wrote:
> On Tue, May 28, 2019 at 05:24:47AM -0700, Paul E. McKenney wrote:
> > On Sat, May 25, 2019 at 02:14:07PM -0400, Joel Fernandes wrote:
> > > On Sat, May 25, 2019 at 08:50:35AM -0700, Paul E. McKenney wrote:
> > > > On Sat, May 25, 2019 at 10:19:54AM -0400, Joel Fernandes wrote:
> > > > > On Sat, May 25, 2019 at 07:08:26AM -0400, Steven Rostedt wrote:
> > > > > > On Sat, 25 May 2019 04:14:44 -0400
> > > > > > Joel Fernandes  wrote:
> > > > > > 
> > > > > > > > I guess the difference between the _raw_notrace and just _raw 
> > > > > > > > variants
> > > > > > > > is that _notrace ones do a rcu_check_sparse(). Don't we want to 
> > > > > > > > keep
> > > > > > > > that check?  
> > > > > > > 
> > > > > > > This is true.
> > > > > > > 
> > > > > > > Since the users of _raw_notrace are very few, is it worth keeping 
> > > > > > > this API
> > > > > > > just for sparse checking? The API naming is also confusing. I was 
> > > > > > > expecting
> > > > > > > _raw_notrace to do fewer checks than _raw, instead of more. 
> > > > > > > Honestly, I just
> > > > > > > want to nuke _raw_notrace as done in this series and later we can 
> > > > > > > introduce a
> > > > > > > sparse checking version of _raw if need-be. The other option 
> > > > > > > could be to
> > > > > > > always do sparse checking for _raw however that used to be the 
> > > > > > > case and got
> > > > > > > changed in 
> > > > > > > http://lists.infradead.org/pipermail/linux-afs/2016-July/001016.html
> > > > > > 
> > > > > > What if we just rename _raw to _raw_nocheck, and _raw_notrace to 
> > > > > > _raw ?
> > > > > 
> > > > > That would also mean changing 160 usages of _raw to _raw_nocheck in 
> > > > > the
> > > > > kernel :-/.
> > > > > 
> > > > > The tracing usage of _raw_notrace is only like 2 or 3 users. Can we 
> > > > > just call
> > > > > rcu_check_sparse directly in the calling code for those and eliminate 
> > > > > the APIs?
> > > > > 
> > > > > I wonder what Paul thinks about the matter as well.
> > > > 
> > > > My thought is that it is likely that a goodly number of the current uses
> > > > of _raw should really be some form of _check, with lockdep expressions
> > > > spelled out.  Not that working out what exactly those lockdep 
> > > > expressions
> > > > should be is necessarily a trivial undertaking.  ;-)
> > > 
> > > Yes, currently where I am a bit stuck is the rcu_dereference_raw()
> > > cannot possibly know what SRCU domain it is under, so lockdep cannot 
> > > check if
> > > an SRCU lock is held without the user also passing along the SRCU domain. 
> > > I
> > > am trying to change lockdep to see if it can check if *any* srcu domain 
> > > lock
> > > is held (regardless of which one) and complain if none are. This is at 
> > > least
> > > better than no check at all.
> > > 
> > > However, I think it gets tricky for mutexes. If you have something like:
> > > mutex_lock(some_mutex);
> > > p = rcu_dereference_raw(gp);
> > > mutex_unlock(some_mutex);
> > > 
> > > This might be a perfectly valid invocation of _raw, however my checks 
> > > (patch
> > > is still cooking) trigger a lockdep warning becase _raw cannot know that 
> > > this
> > > is Ok. lockdep thinks it is not in a reader section. This then gets into 
> > > the
> > > territory of a new rcu_derference_raw_protected(gp, 
> > > assert_held(some_mutex))
> > > which sucks because its yet another API. To circumvent this issue, can we
> > > just have callers of rcu_dereference_raw ensure that they call
> > > rcu_read_lock() if they are protecting dereferences by a mutex? That would
> > > make things a lot easier and also may be Ok since rcu_read_lock is quite
> > > cheap.
> > 
> > Why not just rcu_dereference_protected(lockdep_is_held(some_mutex))?
> > The API is already there, a

Re: [PATCH RFC 0/5] Remove some notrace RCU APIs

2019-05-28 Thread Paul E. McKenney
On Sat, May 25, 2019 at 02:14:07PM -0400, Joel Fernandes wrote:
> On Sat, May 25, 2019 at 08:50:35AM -0700, Paul E. McKenney wrote:
> > On Sat, May 25, 2019 at 10:19:54AM -0400, Joel Fernandes wrote:
> > > On Sat, May 25, 2019 at 07:08:26AM -0400, Steven Rostedt wrote:
> > > > On Sat, 25 May 2019 04:14:44 -0400
> > > > Joel Fernandes  wrote:
> > > > 
> > > > > > I guess the difference between the _raw_notrace and just _raw 
> > > > > > variants
> > > > > > is that _notrace ones do a rcu_check_sparse(). Don't we want to keep
> > > > > > that check?  
> > > > > 
> > > > > This is true.
> > > > > 
> > > > > Since the users of _raw_notrace are very few, is it worth keeping 
> > > > > this API
> > > > > just for sparse checking? The API naming is also confusing. I was 
> > > > > expecting
> > > > > _raw_notrace to do fewer checks than _raw, instead of more. Honestly, 
> > > > > I just
> > > > > want to nuke _raw_notrace as done in this series and later we can 
> > > > > introduce a
> > > > > sparse checking version of _raw if need-be. The other option could be 
> > > > > to
> > > > > always do sparse checking for _raw however that used to be the case 
> > > > > and got
> > > > > changed in 
> > > > > http://lists.infradead.org/pipermail/linux-afs/2016-July/001016.html
> > > > 
> > > > What if we just rename _raw to _raw_nocheck, and _raw_notrace to _raw ?
> > > 
> > > That would also mean changing 160 usages of _raw to _raw_nocheck in the
> > > kernel :-/.
> > > 
> > > The tracing usage of _raw_notrace is only like 2 or 3 users. Can we just 
> > > call
> > > rcu_check_sparse directly in the calling code for those and eliminate the 
> > > APIs?
> > > 
> > > I wonder what Paul thinks about the matter as well.
> > 
> > My thought is that it is likely that a goodly number of the current uses
> > of _raw should really be some form of _check, with lockdep expressions
> > spelled out.  Not that working out what exactly those lockdep expressions
> > should be is necessarily a trivial undertaking.  ;-)
> 
> Yes, currently where I am a bit stuck is the rcu_dereference_raw()
> cannot possibly know what SRCU domain it is under, so lockdep cannot check if
> an SRCU lock is held without the user also passing along the SRCU domain. I
> am trying to change lockdep to see if it can check if *any* srcu domain lock
> is held (regardless of which one) and complain if none are. This is at least
> better than no check at all.
> 
> However, I think it gets tricky for mutexes. If you have something like:
> mutex_lock(some_mutex);
> p = rcu_dereference_raw(gp);
> mutex_unlock(some_mutex);
> 
> This might be a perfectly valid invocation of _raw, however my checks (patch
> is still cooking) trigger a lockdep warning becase _raw cannot know that this
> is Ok. lockdep thinks it is not in a reader section. This then gets into the
> territory of a new rcu_derference_raw_protected(gp, assert_held(some_mutex))
> which sucks because its yet another API. To circumvent this issue, can we
> just have callers of rcu_dereference_raw ensure that they call
> rcu_read_lock() if they are protecting dereferences by a mutex? That would
> make things a lot easier and also may be Ok since rcu_read_lock is quite
> cheap.

Why not just rcu_dereference_protected(lockdep_is_held(some_mutex))?
The API is already there, and no need for spurious readers.

> > That aside, if we are going to change the name of an API that is
> > used 160 places throughout the tree, we would need to have a pretty
> > good justification.  Without such a justification, it will just look
> > like pointless churn to the various developers and maintainers on the
> > receiving end of the patches.
> 
> Actually, the API name change is not something I want to do, it is Steven
> suggestion. My suggestion is let us just delete _raw_notrace and just use the
> _raw API for tracing, since _raw doesn't do any tracing anyway. Steve pointed
> that _raw_notrace does sparse checking unlike _raw, but I think that isn't an
> issue since _raw doesn't do such checking at the moment anyway.. (if possible
> check my cover letter again for details/motivation of this series).

Understood, but regardless of who suggested it, if we are to go through
with it, good justification will be required.  ;-)

Thanx, Paul

> thanks!
> 
>  - Joel
> 
> > Thanx, Paul
> > 
> > > thanks, Steven!
> > > 
> > 
> 



Re: [PATCH RFC 0/5] Remove some notrace RCU APIs

2019-05-25 Thread Paul E. McKenney
On Sat, May 25, 2019 at 10:19:54AM -0400, Joel Fernandes wrote:
> On Sat, May 25, 2019 at 07:08:26AM -0400, Steven Rostedt wrote:
> > On Sat, 25 May 2019 04:14:44 -0400
> > Joel Fernandes  wrote:
> > 
> > > > I guess the difference between the _raw_notrace and just _raw variants
> > > > is that _notrace ones do a rcu_check_sparse(). Don't we want to keep
> > > > that check?  
> > > 
> > > This is true.
> > > 
> > > Since the users of _raw_notrace are very few, is it worth keeping this API
> > > just for sparse checking? The API naming is also confusing. I was 
> > > expecting
> > > _raw_notrace to do fewer checks than _raw, instead of more. Honestly, I 
> > > just
> > > want to nuke _raw_notrace as done in this series and later we can 
> > > introduce a
> > > sparse checking version of _raw if need-be. The other option could be to
> > > always do sparse checking for _raw however that used to be the case and 
> > > got
> > > changed in 
> > > http://lists.infradead.org/pipermail/linux-afs/2016-July/001016.html
> > 
> > What if we just rename _raw to _raw_nocheck, and _raw_notrace to _raw ?
> 
> That would also mean changing 160 usages of _raw to _raw_nocheck in the
> kernel :-/.
> 
> The tracing usage of _raw_notrace is only like 2 or 3 users. Can we just call
> rcu_check_sparse directly in the calling code for those and eliminate the 
> APIs?
> 
> I wonder what Paul thinks about the matter as well.

My thought is that it is likely that a goodly number of the current uses
of _raw should really be some form of _check, with lockdep expressions
spelled out.  Not that working out what exactly those lockdep expressions
should be is necessarily a trivial undertaking.  ;-)

That aside, if we are going to change the name of an API that is
used 160 places throughout the tree, we would need to have a pretty
good justification.  Without such a justification, it will just look
like pointless churn to the various developers and maintainers on the
receiving end of the patches.

Thanx, Paul

> thanks, Steven!
> 



Re: [PATCH] MAINTAINERS: Update remaining @linux.vnet.ibm.com addresses

2019-04-11 Thread Paul E. McKenney
On Thu, Apr 11, 2019 at 05:27:31AM -0700, Joe Perches wrote:
> On Thu, 2019-04-11 at 22:07 +1000, Michael Ellerman wrote:
> > Joe Perches  writes:
> > > On Thu, 2019-04-11 at 06:27 +0200, Lukas Bulwahn wrote:
> > > > Paul McKenney attempted to update all email addresses 
> > > > @linux.vnet.ibm.com
> > > > to @linux.ibm.com in commit 1dfddcdb95c4
> > > > ("MAINTAINERS: Update from @linux.vnet.ibm.com to @linux.ibm.com"), but
> > > > some still remained.
> > > > 
> > > > We update the remaining email addresses in MAINTAINERS, hopefully 
> > > > finally
> > > > catching all cases for good.
> > > 
> > > Perhaps update all the similar addresses in other files too
> > > 
> > > $ git grep --name-only 'linux\.vnet\.ibm\.com' | wc -l
> > > 315
> > 
> > A good number of them are no longer valid. So I'm not sure it's worth
> > updating them en masse to addresses that won't ever work.
> > 
> > We have git now, we don't need email addresses in files, they're just
> > prone to bitrot like this.
> > 
> > Should we just change them all like so?
> > 
> >   -arch/powerpc/boot/dts/bamboo.dts: * Josh Boyer 
> > 
> >   +arch/powerpc/boot/dts/bamboo.dts: * Josh Boyer (IBM)
> > 
> > To indicate the author was at IBM when they wrote it?
> 
> If that's desired, perhaps:
> 
> $ git grep -P --name-only '?' | \
>   grep -vP '\.mailmap|MAINTAINERS' | \
>   xargs perl -p -i -e 's/?/(IBM)/g'
> 
> > Or should we try and update them with current addresses? Though then the
> > authors might start getting mails they don't want.
> 
> That'd be my preference.
> 
> If authors get emails they don't want, then those contact
> emails should be removed.

I have updated most of mine, with one more installment of patches to go
into the next merge window and another into the merge window after that.
More churn than I would have expected, though.  If my email address were
to change again, I would instead go with the "(IBM)" approach and let
the git log and MAINTAINERS file keep the contact information.  Not that
we get to update the git log, of course.  ;-)

I might not have bothered except for combining with the SPDX-tag
commits.

Thanx, Paul



Re: [PATCH] MAINTAINERS: Update remaining @linux.vnet.ibm.com addresses

2019-04-11 Thread Paul E. McKenney
On Thu, Apr 11, 2019 at 06:27:52AM +0200, Lukas Bulwahn wrote:
> Paul McKenney attempted to update all email addresses @linux.vnet.ibm.com
> to @linux.ibm.com in commit 1dfddcdb95c4
> ("MAINTAINERS: Update from @linux.vnet.ibm.com to @linux.ibm.com"), but
> some still remained.
> 
> We update the remaining email addresses in MAINTAINERS, hopefully finally
> catching all cases for good.
> 
> Fixes: 1dfddcdb95c4 ("MAINTAINERS: Update from @linux.vnet.ibm.com to 
> @linux.ibm.com")
> Signed-off-by: Lukas Bulwahn 

For whatever it is worth:

Acked-by: Paul E. McKenney 

> ---
> 
> Tyrel, please take this patch. Thanks.
> 
>  MAINTAINERS | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2359e12e4c41..454b3cf36aa4 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7439,14 +7439,14 @@ F:drivers/crypto/vmx/ghash*
>  F:   drivers/crypto/vmx/ppc-xlate.pl
>  
>  IBM Power PCI Hotplug Driver for RPA-compliant PPC64 platform
> -M:   Tyrel Datwyler 
> +M:   Tyrel Datwyler 
>  L:   linux-...@vger.kernel.org
>  L:   linuxppc-dev@lists.ozlabs.org
>  S:   Supported
>  F:   drivers/pci/hotplug/rpaphp*
>  
>  IBM Power IO DLPAR Driver for RPA-compliant PPC64 platform
> -M:   Tyrel Datwyler 
> +M:   Tyrel Datwyler 
>  L:   linux-...@vger.kernel.org
>  L:   linuxppc-dev@lists.ozlabs.org
>  S:   Supported
> @@ -10388,7 +10388,7 @@ F:arch/arm/mach-mmp/
>  
>  MMU GATHER AND TLB INVALIDATION
>  M:   Will Deacon 
> -M:   "Aneesh Kumar K.V" 
> +M:   "Aneesh Kumar K.V" 
>  M:   Andrew Morton 
>  M:   Nick Piggin 
>  M:   Peter Zijlstra 
> -- 
> 2.17.1
> 



Re: [RFC PATCH v2 09/14] watchdog/hardlockup: Make arch_touch_nmi_watchdog() to hpet-based implementation

2019-02-27 Thread Paul E. McKenney
On Wed, Feb 27, 2019 at 08:05:13AM -0800, Ricardo Neri wrote:
> CPU architectures that have an NMI watchdog use arch_touch_nmi_watchdog()
> to briefly ignore the hardlockup detector. If the architecture does not
> have an NMI watchdog, one can be constructed using a source of non-
> maskable interrupts. In this case, arch_touch_nmi_watchdog() is common
> to any underlying hardware resource used to drive the detector and needs
> to be available to other kernel subsystems if hardware different from perf
> drives the detector.
> 
> There exists perf-based and HPET-based implementations. Make it available
> to the latter.
> 
> For clarity, wrap this function in a separate preprocessor conditional
> from functions which are truly specific to the perf-based implementation.
> 
> Cc: "H. Peter Anvin" 
> Cc: Ashok Raj 
> Cc: Andi Kleen 
> Cc: Tony Luck 
> Cc: "Rafael J. Wysocki" 
> Cc: Don Zickus 
> Cc: Nicholas Piggin 
> Cc: Michael Ellerman 
> Cc: Frederic Weisbecker 
> Cc: Alexei Starovoitov 
> Cc: Babu Moger 
> Cc: "David S. Miller" 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Mathieu Desnoyers 
> Cc: Masami Hiramatsu 
> Cc: Peter Zijlstra 
> Cc: Andrew Morton 
> Cc: Philippe Ombredanne 
> Cc: Colin Ian King 
> Cc: Byungchul Park 
> Cc: "Paul E. McKenney" 
> Cc: "Luis R. Rodriguez" 
> Cc: Waiman Long 
> Cc: Josh Poimboeuf 
> Cc: Randy Dunlap 
> Cc: Davidlohr Bueso 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: Kai-Heng Feng 
> Cc: Konrad Rzeszutek Wilk 
> Cc: David Rientjes 
> Cc: "Ravi V. Shankar" 
> Cc: x...@kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Ricardo Neri 
> ---
>  include/linux/nmi.h | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> index 5a8b19749769..bf5ebcfdd590 100644
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -94,8 +94,16 @@ static inline void hardlockup_detector_disable(void) {}
>  # define NMI_WATCHDOG_SYSCTL_PERM0444
>  #endif
> 
> -#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
> +#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF) || \
> +defined(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET)

Why not instead make CONFIG_X86_HARDLOCKUP_DETECTOR_HPET select
CONFIG_HARDLOCKUP_DETECTOR_PERF?  Keep the arch-specific details
in the arch-specific files and all that.

Thanx, Paul

>  extern void arch_touch_nmi_watchdog(void);
> +#else
> +# if !defined(CONFIG_HAVE_NMI_WATCHDOG)
> +static inline void arch_touch_nmi_watchdog(void) {}
> +# endif
> +#endif
> +
> +#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
>  extern void hardlockup_detector_perf_stop(void);
>  extern void hardlockup_detector_perf_restart(void);
>  extern void hardlockup_detector_perf_disable(void);
> -- 
> 2.17.1
> 



[tip:core/rcu] powerpc: Convert hugepd_free() to use call_rcu()

2018-12-04 Thread tip-bot for Paul E. McKenney
Commit-ID:  04229110adfba984950fc0209632640a76eb1de4
Gitweb: https://git.kernel.org/tip/04229110adfba984950fc0209632640a76eb1de4
Author: Paul E. McKenney 
AuthorDate: Mon, 5 Nov 2018 16:53:13 -0800
Committer:  Paul E. McKenney 
CommitDate: Thu, 8 Nov 2018 21:43:20 -0800

powerpc: Convert hugepd_free() to use call_rcu()

Now that call_rcu()'s callback is not invoked until after all
preempt-disable regions of code have completed (in addition to explicitly
marked RCU read-side critical sections), call_rcu() can be used in place
of call_rcu_sched().  This commit therefore makes that change.

Signed-off-by: Paul E. McKenney 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: 
---
 arch/powerpc/mm/hugetlbpage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 8cf035e68378..4c01e9a01a74 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -289,7 +289,7 @@ static void hugepd_free(struct mmu_gather *tlb, void 
*hugepte)
 
(*batchp)->ptes[(*batchp)->index++] = hugepte;
if ((*batchp)->index == HUGEPD_FREELIST_SIZE) {
-   call_rcu_sched(&(*batchp)->rcu, hugepd_free_rcu_callback);
+   call_rcu(&(*batchp)->rcu, hugepd_free_rcu_callback);
*batchp = NULL;
}
put_cpu_var(hugepd_freelist_cur);


Re: UBSAN: Undefined behaviour in kernel/rcu/tree_plugin.h in 4.20-rc1

2018-11-14 Thread Paul E. McKenney
On Wed, Nov 14, 2018 at 03:43:05PM +0100, Christophe LEROY wrote:
> 
> 
> Le 09/11/2018 à 21:10, Paul E. McKenney a écrit :
> >On Fri, Nov 09, 2018 at 06:11:20PM +0100, Christophe LEROY wrote:
> >>(Resending due to error in Paul's address)
> >>
> >>Paul
> >>
> >>I get the following UBSAN reports in 4.20-rc1 on an MPC8321E
> >>(powerpc/book3s/32)
> >>
> >>I bisected it to 3e31009898699dfc ("rcu: Defer reporting RCU-preempt
> >>quiescent states when disabled")
> >
> >Fixed by dfdc33585b0a ("rcu: Avoid signed integer overflow in
> >rcu_preempt_deferred_qs()") in my -rcu tree and in -next, which I intend
> >to push into the next merge window.
> 
> Thanks, I confirm it fixes the issue.
> 
> Do you intend to push it into 4.20-rc3 or do you mean 4.21 ?

The next merge window, which will be either v4.21 or v5.0.  The v4.20
merge window is over and done.  ;-)

Please note that the gcc command-line arguments used by the Linux kernel
prevent the compiler from taking advantage of the C-standard signed
integer overflow aspect of undefined behavior, so this is a aesthetic
issue rather than a failure case.  Plus the C++ standards committee just
voted in a change that gets rid of signed integer overflow completely.
It is not clear whether the C language will also make this change, but
it does require that the usual compilers have the ability to operate in
this manner.

Thanx, Paul

> Christophe
> 
> > Thanx, Paul
> >
> >>Thanks
> >>Christophe
> >>
> >>[4.919995] 
> >>
> >>[4.928428] UBSAN: Undefined behaviour in kernel/rcu/tree_plugin.h:623:28
> >>[4.935198] signed integer overflow:
> >>[4.938766] 0 - -2147483648 cannot be represented in type 'int'
> >>[4.944678] CPU: 0 PID: 119 Comm: mkdir Not tainted
> >>4.19.0-rc1-s3k-dev-5-g5a60513 #214
> >>[4.952908] Call Trace:
> >>[4.955382] [dec4fd20] [c02cb0d0] ubsan_epilogue+0x18/0x74 (unreliable)
> >>[4.962003] [dec4fd30] [c02cb5e0] handle_overflow+0xd0/0xe0
> >>[4.967588] [dec4fdb0] [c007b424] rcu_preempt_deferred_qs+0xc0/0xc8
> >>[4.973857] [dec4fdd0] [c007be28] rcu_note_context_switch+0x74/0x608
> >>[4.980217] [dec4fe10] [c064b790] __schedule+0x58/0x6e0
> >>[4.985448] [dec4fe50] [c064bfdc] preempt_schedule_common+0x48/0x9c
> >>[4.991717] [dec4fe70] [c01308c8] handle_mm_fault+0x10fc/0x1ecc
> >>[4.997639] [dec4fee0] [c001339c] do_page_fault+0x10c/0x760
> >>[5.003225] [dec4ff40] [c001234c] handle_page_fault+0x14/0x40
> >>[5.008968] --- interrupt: 401 at 0xff9cff8
> >>[5.008968] LR = 0xfeefd78
> >>[5.016170] 
> >>
> >>[5.024591] 
> >>
> >>[5.033005] UBSAN: Undefined behaviour in kernel/rcu/tree_plugin.h:627:28
> >>[5.039775] signed integer overflow:
> >>[5.043342] -2147483648 + -2147483648 cannot be represented in type 'int'
> >>[5.050118] CPU: 0 PID: 119 Comm: mkdir Not tainted
> >>4.19.0-rc1-s3k-dev-5-g5a60513 #214
> >>[5.058348] Call Trace:
> >>[5.060813] [dec4fd20] [c02cb0d0] ubsan_epilogue+0x18/0x74 (unreliable)
> >>[5.067433] [dec4fd30] [c02cb5e0] handle_overflow+0xd0/0xe0
> >>[5.073014] [dec4fdb0] [c007b408] rcu_preempt_deferred_qs+0xa4/0xc8
> >>[5.079283] [dec4fdd0] [c007be28] rcu_note_context_switch+0x74/0x608
> >>[5.085640] [dec4fe10] [c064b790] __schedule+0x58/0x6e0
> >>[5.090871] [dec4fe50] [c064bfdc] preempt_schedule_common+0x48/0x9c
> >>[5.097139] [dec4fe70] [c01308c8] handle_mm_fault+0x10fc/0x1ecc
> >>[5.103059] [dec4fee0] [c001339c] do_page_fault+0x10c/0x760
> >>[5.108642] [dec4ff40] [c001234c] handle_page_fault+0x14/0x40
> >>[5.114385] --- interrupt: 401 at 0xff9cff8
> >>[5.114385] LR = 0xfeefd78
> >>[5.121588] 
> >>
> >>
> 



[PATCH tip/core/rcu 08/41] powerpc: Convert hugepd_free() to use call_rcu()

2018-11-11 Thread Paul E. McKenney
Now that call_rcu()'s callback is not invoked until after all
preempt-disable regions of code have completed (in addition to explicitly
marked RCU read-side critical sections), call_rcu() can be used in place
of call_rcu_sched().  This commit therefore makes that change.

Signed-off-by: Paul E. McKenney 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: 
---
 arch/powerpc/mm/hugetlbpage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 8cf035e68378..4c01e9a01a74 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -289,7 +289,7 @@ static void hugepd_free(struct mmu_gather *tlb, void 
*hugepte)
 
(*batchp)->ptes[(*batchp)->index++] = hugepte;
if ((*batchp)->index == HUGEPD_FREELIST_SIZE) {
-   call_rcu_sched(&(*batchp)->rcu, hugepd_free_rcu_callback);
+   call_rcu(&(*batchp)->rcu, hugepd_free_rcu_callback);
*batchp = NULL;
}
put_cpu_var(hugepd_freelist_cur);
-- 
2.17.1



Re: UBSAN: Undefined behaviour in kernel/rcu/tree_plugin.h in 4.20-rc1

2018-11-10 Thread Paul E. McKenney
On Fri, Nov 09, 2018 at 12:10:30PM -0800, Paul E. McKenney wrote:
> On Fri, Nov 09, 2018 at 06:11:20PM +0100, Christophe LEROY wrote:
> > (Resending due to error in Paul's address)
> > 
> > Paul
> > 
> > I get the following UBSAN reports in 4.20-rc1 on an MPC8321E
> > (powerpc/book3s/32)
> > 
> > I bisected it to 3e31009898699dfc ("rcu: Defer reporting RCU-preempt
> > quiescent states when disabled")
> 
> Fixed by dfdc33585b0a ("rcu: Avoid signed integer overflow in
> rcu_preempt_deferred_qs()") in my -rcu tree and in -next, which I intend
> to push into the next merge window.

And while I am at it...  The C++ Standards Committee just yesterday
voted "Signed integers are twos complement" into the C++20 standard.  ;-)

Yeah, C++20 rather than now, and C++ rather than C, but there you have it!

Thanx, Paul

> > Thanks
> > Christophe
> > 
> > [4.919995] 
> > 
> > [4.928428] UBSAN: Undefined behaviour in kernel/rcu/tree_plugin.h:623:28
> > [4.935198] signed integer overflow:
> > [4.938766] 0 - -2147483648 cannot be represented in type 'int'
> > [4.944678] CPU: 0 PID: 119 Comm: mkdir Not tainted
> > 4.19.0-rc1-s3k-dev-5-g5a60513 #214
> > [4.952908] Call Trace:
> > [4.955382] [dec4fd20] [c02cb0d0] ubsan_epilogue+0x18/0x74 (unreliable)
> > [4.962003] [dec4fd30] [c02cb5e0] handle_overflow+0xd0/0xe0
> > [4.967588] [dec4fdb0] [c007b424] rcu_preempt_deferred_qs+0xc0/0xc8
> > [4.973857] [dec4fdd0] [c007be28] rcu_note_context_switch+0x74/0x608
> > [4.980217] [dec4fe10] [c064b790] __schedule+0x58/0x6e0
> > [4.985448] [dec4fe50] [c064bfdc] preempt_schedule_common+0x48/0x9c
> > [4.991717] [dec4fe70] [c01308c8] handle_mm_fault+0x10fc/0x1ecc
> > [4.997639] [dec4fee0] [c001339c] do_page_fault+0x10c/0x760
> > [5.003225] [dec4ff40] [c001234c] handle_page_fault+0x14/0x40
> > [5.008968] --- interrupt: 401 at 0xff9cff8
> > [5.008968] LR = 0xfeefd78
> > [5.016170] 
> > 
> > [5.024591] 
> > 
> > [5.033005] UBSAN: Undefined behaviour in kernel/rcu/tree_plugin.h:627:28
> > [5.039775] signed integer overflow:
> > [5.043342] -2147483648 + -2147483648 cannot be represented in type 'int'
> > [5.050118] CPU: 0 PID: 119 Comm: mkdir Not tainted
> > 4.19.0-rc1-s3k-dev-5-g5a60513 #214
> > [5.058348] Call Trace:
> > [5.060813] [dec4fd20] [c02cb0d0] ubsan_epilogue+0x18/0x74 (unreliable)
> > [5.067433] [dec4fd30] [c02cb5e0] handle_overflow+0xd0/0xe0
> > [5.073014] [dec4fdb0] [c007b408] rcu_preempt_deferred_qs+0xa4/0xc8
> > [5.079283] [dec4fdd0] [c007be28] rcu_note_context_switch+0x74/0x608
> > [5.085640] [dec4fe10] [c064b790] __schedule+0x58/0x6e0
> > [5.090871] [dec4fe50] [c064bfdc] preempt_schedule_common+0x48/0x9c
> > [5.097139] [dec4fe70] [c01308c8] handle_mm_fault+0x10fc/0x1ecc
> > [5.103059] [dec4fee0] [c001339c] do_page_fault+0x10c/0x760
> > [5.108642] [dec4ff40] [c001234c] handle_page_fault+0x14/0x40
> > [5.114385] --- interrupt: 401 at 0xff9cff8
> > [5.114385] LR = 0xfeefd78
> > [5.121588] 
> > 
> > 



Re: UBSAN: Undefined behaviour in kernel/rcu/tree_plugin.h in 4.20-rc1

2018-11-09 Thread Paul E. McKenney
On Fri, Nov 09, 2018 at 06:11:20PM +0100, Christophe LEROY wrote:
> (Resending due to error in Paul's address)
> 
> Paul
> 
> I get the following UBSAN reports in 4.20-rc1 on an MPC8321E
> (powerpc/book3s/32)
> 
> I bisected it to 3e31009898699dfc ("rcu: Defer reporting RCU-preempt
> quiescent states when disabled")

Fixed by dfdc33585b0a ("rcu: Avoid signed integer overflow in
rcu_preempt_deferred_qs()") in my -rcu tree and in -next, which I intend
to push into the next merge window.

Thanx, Paul

> Thanks
> Christophe
> 
> [4.919995] 
> 
> [4.928428] UBSAN: Undefined behaviour in kernel/rcu/tree_plugin.h:623:28
> [4.935198] signed integer overflow:
> [4.938766] 0 - -2147483648 cannot be represented in type 'int'
> [4.944678] CPU: 0 PID: 119 Comm: mkdir Not tainted
> 4.19.0-rc1-s3k-dev-5-g5a60513 #214
> [4.952908] Call Trace:
> [4.955382] [dec4fd20] [c02cb0d0] ubsan_epilogue+0x18/0x74 (unreliable)
> [4.962003] [dec4fd30] [c02cb5e0] handle_overflow+0xd0/0xe0
> [4.967588] [dec4fdb0] [c007b424] rcu_preempt_deferred_qs+0xc0/0xc8
> [4.973857] [dec4fdd0] [c007be28] rcu_note_context_switch+0x74/0x608
> [4.980217] [dec4fe10] [c064b790] __schedule+0x58/0x6e0
> [4.985448] [dec4fe50] [c064bfdc] preempt_schedule_common+0x48/0x9c
> [4.991717] [dec4fe70] [c01308c8] handle_mm_fault+0x10fc/0x1ecc
> [4.997639] [dec4fee0] [c001339c] do_page_fault+0x10c/0x760
> [5.003225] [dec4ff40] [c001234c] handle_page_fault+0x14/0x40
> [5.008968] --- interrupt: 401 at 0xff9cff8
> [5.008968] LR = 0xfeefd78
> [5.016170] 
> 
> [5.024591] 
> 
> [5.033005] UBSAN: Undefined behaviour in kernel/rcu/tree_plugin.h:627:28
> [5.039775] signed integer overflow:
> [5.043342] -2147483648 + -2147483648 cannot be represented in type 'int'
> [5.050118] CPU: 0 PID: 119 Comm: mkdir Not tainted
> 4.19.0-rc1-s3k-dev-5-g5a60513 #214
> [5.058348] Call Trace:
> [5.060813] [dec4fd20] [c02cb0d0] ubsan_epilogue+0x18/0x74 (unreliable)
> [5.067433] [dec4fd30] [c02cb5e0] handle_overflow+0xd0/0xe0
> [5.073014] [dec4fdb0] [c007b408] rcu_preempt_deferred_qs+0xa4/0xc8
> [5.079283] [dec4fdd0] [c007be28] rcu_note_context_switch+0x74/0x608
> [5.085640] [dec4fe10] [c064b790] __schedule+0x58/0x6e0
> [5.090871] [dec4fe50] [c064bfdc] preempt_schedule_common+0x48/0x9c
> [5.097139] [dec4fe70] [c01308c8] handle_mm_fault+0x10fc/0x1ecc
> [5.103059] [dec4fee0] [c001339c] do_page_fault+0x10c/0x760
> [5.108642] [dec4ff40] [c001234c] handle_page_fault+0x14/0x40
> [5.114385] --- interrupt: 401 at 0xff9cff8
> [5.114385] LR = 0xfeefd78
> [5.121588] 
> 
> 



Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed

2018-11-02 Thread Paul E. McKenney
On Fri, Nov 02, 2018 at 01:23:28PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 02, 2018 at 10:56:31AM +, David Laight wrote:
> > From: Paul E. McKenney
> > > Sent: 01 November 2018 17:02
> > ...
> > > And there is a push to define C++ signed arithmetic as 2s complement,
> > > but there are still 1s complement systems with C compilers.  Just not
> > > C++ compilers.  Legacy...
> > 
> > Hmmm... I've used C compilers for DSPs where signed integer arithmetic
> > used the 'data registers' and would saturate, unsigned used the 'address
> > registers' and wrapped.
> > That was deliberate because it is much better to clip analogue values.
> 
> Seems a dodgy heuristic if you ask me.
> 
> > Then there was the annoying cobol run time that didn't update the
> > result variable if the result wouldn't fit.
> > Took a while to notice that the sum of a list of values was even wrong!
> > That would be perfectly valid for C - if unexpected.
> 
> That's just insane ;-)
> 
> > > > But for us using -fno-strict-overflow which actually defines signed
> > > > overflow
> > 
> > I wonder how much real code 'strict-overflow' gets rid of?
> > IIRC gcc silently turns loops like:
> > int i; for (i = 1; i != 0; i *= 2) ...
> > into infinite ones.
> > Which is never what is required.
> 
> Nobody said C was a 'safe' language. But less UB makes a better language
> IMO. Ideally we'd get all UBs filled in -- but I realise C has a few
> very 'interesting' ones that might be hard to get rid of.

There has been an effort to reduce UB, but not sure how far they got.

Thanx, Paul



Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed

2018-11-02 Thread Paul E. McKenney
On Fri, Nov 02, 2018 at 10:56:31AM +, David Laight wrote:
> From: Paul E. McKenney
> > Sent: 01 November 2018 17:02
> ...
> > And there is a push to define C++ signed arithmetic as 2s complement,
> > but there are still 1s complement systems with C compilers.  Just not
> > C++ compilers.  Legacy...
> 
> Hmmm... I've used C compilers for DSPs where signed integer arithmetic
> used the 'data registers' and would saturate, unsigned used the 'address
> registers' and wrapped.
> That was deliberate because it is much better to clip analogue values.

There are no C++ compilers for those DSPs, correct?  (Some of the
C++ standards commmittee members believe that they have fully checked,
but they might well have missed something.)

> Then there was the annoying cobol run time that didn't update the
> result variable if the result wouldn't fit.
> Took a while to notice that the sum of a list of values was even wrong!
> That would be perfectly valid for C - if unexpected.

Heh!  COBOL and FORTRAN also helped fund my first pass through university.

> > > But for us using -fno-strict-overflow which actually defines signed
> > > overflow
> 
> I wonder how much real code 'strict-overflow' gets rid of?
> IIRC gcc silently turns loops like:
>   int i; for (i = 1; i != 0; i *= 2) ...
> into infinite ones.
> Which is never what is required.

The usual response is something like this:

for (i = 1; i < n; i++)

where the compiler has no idea what range of values "n" might take on.
Can't say that I am convinced by that example, but at least we do have
-fno-strict-overflow.

Thanx, Paul



Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed

2018-11-01 Thread Paul E. McKenney
On Thu, Nov 01, 2018 at 10:38:34PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 01, 2018 at 01:29:10PM -0700, Paul E. McKenney wrote:
> > On Thu, Nov 01, 2018 at 06:27:39PM +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 01, 2018 at 06:14:32PM +0100, Peter Zijlstra wrote:
> > > > > This reminds me of this so silly patch :/
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=adb03115f4590baa280ddc440a8eff08a6be0cb7
> > > 
> > > You'd probably want to write it like so; +- some ordering stuff, that
> > > code didn't look like it really needs the memory barriers implied by
> > > these, but I didn't look too hard.
> > 
> > The atomic_fetch_add() API would need to be propagated out to the other
> > architectures, correct?
> 
> Like these commits I did like 2 years ago ? :-)

Color me blind and stupid!  ;-)

Thanx, Paul

> $ git log --oneline 6dc25876cdb1...1f51dee7ca74
> 6dc25876cdb1 locking/atomic, arch/xtensa: Implement 
> atomic_fetch_{add,sub,and,or,xor}()
> a8bcccaba162 locking/atomic, arch/x86: Implement 
> atomic{,64}_fetch_{add,sub,and,or,xor}()
> 1af5de9af138 locking/atomic, arch/tile: Implement 
> atomic{,64}_fetch_{add,sub,and,or,xor}()
> 3a1adb23a52c locking/atomic, arch/sparc: Implement 
> atomic{,64}_fetch_{add,sub,and,or,xor}()
> 7d9794e75237 locking/atomic, arch/sh: Implement 
> atomic_fetch_{add,sub,and,or,xor}()
> 56fefbbc3f13 locking/atomic, arch/s390: Implement 
> atomic{,64}_fetch_{add,sub,and,or,xor}()
> a28cc7bbe8e3 locking/atomic, arch/powerpc: Implement 
> atomic{,64}_fetch_{add,sub,and,or,xor}{,_relaxed,_acquire,_release}()
> e5857a6ed600 locking/atomic, arch/parisc: Implement 
> atomic{,64}_fetch_{add,sub,and,or,xor}()
> f8d638e28d7c locking/atomic, arch/mn10300: Implement 
> atomic_fetch_{add,sub,and,or,xor}()
> 4edac529eb62 locking/atomic, arch/mips: Implement 
> atomic{,64}_fetch_{add,sub,and,or,xor}()
> e898eb27ffd8 locking/atomic, arch/metag: Implement 
> atomic_fetch_{add,sub,and,or,xor}()
> e39d88ea3ce4 locking/atomic, arch/m68k: Implement 
> atomic_fetch_{add,sub,and,or,xor}()
> f64937052303 locking/atomic, arch/m32r: Implement 
> atomic_fetch_{add,sub,and,or,xor}()
> cc102507fac7 locking/atomic, arch/ia64: Implement 
> atomic{,64}_fetch_{add,sub,and,or,xor}()
> 4be7dd393515 locking/atomic, arch/hexagon: Implement 
> atomic_fetch_{add,sub,and,or,xor}()
> 0c074cbc3309 locking/atomic, arch/h8300: Implement 
> atomic_fetch_{add,sub,and,or,xor}()
> d9c730281617 locking/atomic, arch/frv: Implement 
> atomic{,64}_fetch_{add,sub,and,or,xor}()
> e87fc0ec0705 locking/atomic, arch/blackfin: Implement 
> atomic_fetch_{add,sub,and,or,xor}()
> 1a6eafacd481 locking/atomic, arch/avr32: Implement 
> atomic_fetch_{add,sub,and,or,xor}()
> 2efe95fe6952 locking/atomic, arch/arm64: Implement 
> atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}() 
> for LSE instructions
> 6822a84dd4e3 locking/atomic, arch/arm64: Generate LSE non-return cases using 
> common macros
> e490f9b1d3b4 locking/atomic, arch/arm64: Implement 
> atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
> 6da068c1beba locking/atomic, arch/arm: Implement 
> atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
> fbffe892e525 locking/atomic, arch/arc: Implement 
> atomic_fetch_{add,sub,and,andnot,or,xor}()
> 



Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed

2018-11-01 Thread Paul E. McKenney
On Thu, Nov 01, 2018 at 06:27:39PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 01, 2018 at 06:14:32PM +0100, Peter Zijlstra wrote:
> > > This reminds me of this so silly patch :/
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=adb03115f4590baa280ddc440a8eff08a6be0cb7
> 
> You'd probably want to write it like so; +- some ordering stuff, that
> code didn't look like it really needs the memory barriers implied by
> these, but I didn't look too hard.

The atomic_fetch_add() API would need to be propagated out to the other
architectures, correct?

Thanx, Paul

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index c0a9d26c06ce..11deb1d7e96b 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -485,16 +485,10 @@ u32 ip_idents_reserve(u32 hash, int segs)
>   u32 now = (u32)jiffies;
>   u32 new, delta = 0;
> 
> - if (old != now && cmpxchg(p_tstamp, old, now) == old)
> + if (old != now && try_cmpxchg(p_tstamp, , now))
>   delta = prandom_u32_max(now - old);
> 
> - /* Do not use atomic_add_return() as it makes UBSAN unhappy */
> - do {
> - old = (u32)atomic_read(p_id);
> - new = old + delta + segs;
> - } while (atomic_cmpxchg(p_id, old, new) != old);
> -
> - return new - segs;
> + return atomic_fetch_add(segs + delta, p_id) + delta;
>  }
>  EXPORT_SYMBOL(ip_idents_reserve);
> 
> 



Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed

2018-11-01 Thread Paul E. McKenney
On Thu, Nov 01, 2018 at 06:14:32PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 01, 2018 at 09:59:38AM -0700, Eric Dumazet wrote:
> > On 11/01/2018 09:32 AM, Peter Zijlstra wrote:
> > 
> > >> Anyhow, if the atomic maintainers are willing to stand up and state for
> > >> the record that the atomic counters are guaranteed to wrap modulo 2^n
> > >> just like unsigned integers, then I'm happy to take Paul's patch.
> > > 
> > > I myself am certainly relying on it.
> > 
> > Could we get uatomic_t support maybe ?
> 
> Whatever for; it'd be the exact identical same functions as for
> atomic_t, except for a giant amount of code duplication to deal with the
> new type.
> 
> That is; today we merged a bunch of scripts that generates most of
> atomic*_t, so we could probably script uatomic*_t wrappers with minimal
> effort, but it would add several thousand lines of code to each compile
> for absolutely no reason what so ever.
> 
> > This reminds me of this so silly patch :/
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=adb03115f4590baa280ddc440a8eff08a6be0cb7
> 
> Yes, that's stupid. UBSAN is just wrong there.

It would be good for UBSAN to treat atomic operations as guaranteed
2s complement with no UB for signed integer overflow.  After all, if
even the C standard is willing to do this...

Ah, but don't we disable interrupts and fall back to normal arithmetic
for UP systems?  Hmmm...  We do so for atomic_add_return() even on
x86, it turns out:

static __always_inline int arch_atomic_add_return(int i, atomic_t *v)
{
return i + xadd(>counter, i);
}

So UBSAN actually did have a point.  :-(

Thanx, Paul



Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed

2018-11-01 Thread Paul E. McKenney
On Thu, Nov 01, 2018 at 06:18:46PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 01, 2018 at 10:01:46AM -0700, Paul E. McKenney wrote:
> > On Thu, Nov 01, 2018 at 05:32:12PM +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 01, 2018 at 03:22:15PM +, Trond Myklebust wrote:
> > > > On Thu, 2018-11-01 at 15:59 +0100, Peter Zijlstra wrote:
> > > > > On Thu, Nov 01, 2018 at 01:18:46PM +, Mark Rutland wrote:
> > > 
> > > > > > > My one question (and the reason why I went with cmpxchg() in the
> > > > > > > first place) would be about the overflow behaviour for
> > > > > > > atomic_fetch_inc() and friends. I believe those functions should
> > > > > > > be OK on x86, so that when we overflow the counter, it behaves
> > > > > > > like an unsigned value and wraps back around.  Is that the case
> > > > > > > for all architectures?
> > > > > > > 
> > > > > > > i.e. are atomic_t/atomic64_t always guaranteed to behave like
> > > > > > > u32/u64 on increment?
> > > > > > > 
> > > > > > > I could not find any documentation that explicitly stated that
> > > > > > > they should.
> > > > > > 
> > > > > > Peter, Will, I understand that the atomic_t/atomic64_t ops are
> > > > > > required to wrap per 2's-complement. IIUC the refcount code relies
> > > > > > on this.
> > > > > > 
> > > > > > Can you confirm?
> > > > > 
> > > > > There is quite a bit of core code that hard assumes 2s-complement.
> > > > > Not only for atomics but for any signed integer type. Also see the
> > > > > kernel using -fno-strict-overflow which implies -fwrapv, which
> > > > > defines signed overflow to behave like 2s-complement (and rids us of
> > > > > that particular UB).
> > > > 
> > > > Fair enough, but there have also been bugfixes to explicitly fix unsafe
> > > > C standards assumptions for signed integers. See, for instance commit
> > > > 5a581b367b5d "jiffies: Avoid undefined behavior from signed overflow"
> > > > from Paul McKenney.
> > > 
> > > Yes, I feel Paul has been to too many C/C++ committee meetings and got
> > > properly paranoid. Which isn't always a bad thing :-)
> > 
> > Even the C standard defines 2s complement for atomics.  
> 
> Ooh good to know.

Must be some mistake, right?  ;-)

> > Just not for
> > normal arithmetic, where yes, signed overflow is UB.  And yes, I do
> > know about -fwrapv, but I would like to avoid at least some copy-pasta
> > UB from my kernel code to who knows what user-mode environment.  :-/
> > 
> > At least where it is reasonably easy to do so.
> 
> Fair enough I suppose; I just always make sure to include the same
> -fknobs for the userspace thing when I lift code.

Agreed!  But when it is other people lifting the code...

> > And there is a push to define C++ signed arithmetic as 2s complement,
> > but there are still 1s complement systems with C compilers.  Just not
> > C++ compilers.  Legacy...
> 
> *groan*; how about those ancient hardwares keep using ancient compilers
> and we all move on to the 70s :-)

Hey!!!  Some of that 70s (and 60s!) 1s-complement hardware helped pay
my way through university the first time around!!!  ;-)

Though where it once filled a room it is now on a single small chip.
Go figure...

> > > But for us using -fno-strict-overflow which actually defines signed
> > > overflow, I myself am really not worried. I'm also not sure if KASAN has
> > > been taught about this, or if it will still (incorrectly) warn about UB
> > > for signed types.
> > 
> > UBSAN gave me a signed-overflow warning a few days ago.  Which I have
> > fixed, even though 2s complement did the right thing.  I am also taking
> > advantage of the change to use better naming.
> 
> Oh too many *SANs I suppose; and yes, if you can make the code better,
> why not.

Yeah, when INT_MIN was confined to a single function, no problem.
But thanks to the RCU flavor consolidation, it has to be spread out a
bit more...  Plus there is now INT_MAX, INT_MAX/2, ...

> > > > Anyhow, if the atomic maintainers are willing to stand up and state for
> > > > the record that the atomic counters are guaranteed to wrap modulo 2^n
> > > > just like unsigned integers, then I'm happy to take Paul's patch.
> > > 
> > > I myself am certainly relying on it.
> > 
> > Color me confused.  My 5a581b367b5d is from 2013.  Or is "Paul" instead
> > intended to mean Paul Mackerras, who happens to be on CC?
> 
> Paul Burton I think, on a part of the thread before we joined :-)

Couldn't be bothered to look up the earlier part of the thread.  Getting
lazy in my old age.  ;-)

Thanx, Paul



Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed

2018-11-01 Thread Paul E. McKenney
On Thu, Nov 01, 2018 at 05:32:12PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 01, 2018 at 03:22:15PM +, Trond Myklebust wrote:
> > On Thu, 2018-11-01 at 15:59 +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 01, 2018 at 01:18:46PM +, Mark Rutland wrote:
> 
> > > > > My one question (and the reason why I went with cmpxchg() in the
> > > > > first place) would be about the overflow behaviour for
> > > > > atomic_fetch_inc() and friends. I believe those functions should
> > > > > be OK on x86, so that when we overflow the counter, it behaves
> > > > > like an unsigned value and wraps back around.  Is that the case
> > > > > for all architectures?
> > > > > 
> > > > > i.e. are atomic_t/atomic64_t always guaranteed to behave like
> > > > > u32/u64 on increment?
> > > > > 
> > > > > I could not find any documentation that explicitly stated that
> > > > > they should.
> > > > 
> > > > Peter, Will, I understand that the atomic_t/atomic64_t ops are
> > > > required to wrap per 2's-complement. IIUC the refcount code relies
> > > > on this.
> > > > 
> > > > Can you confirm?
> > > 
> > > There is quite a bit of core code that hard assumes 2s-complement.
> > > Not only for atomics but for any signed integer type. Also see the
> > > kernel using -fno-strict-overflow which implies -fwrapv, which
> > > defines signed overflow to behave like 2s-complement (and rids us of
> > > that particular UB).
> > 
> > Fair enough, but there have also been bugfixes to explicitly fix unsafe
> > C standards assumptions for signed integers. See, for instance commit
> > 5a581b367b5d "jiffies: Avoid undefined behavior from signed overflow"
> > from Paul McKenney.
> 
> Yes, I feel Paul has been to too many C/C++ committee meetings and got
> properly paranoid. Which isn't always a bad thing :-)

Even the C standard defines 2s complement for atomics.  Just not for
normal arithmetic, where yes, signed overflow is UB.  And yes, I do
know about -fwrapv, but I would like to avoid at least some copy-pasta
UB from my kernel code to who knows what user-mode environment.  :-/

At least where it is reasonably easy to do so.

And there is a push to define C++ signed arithmetic as 2s complement,
but there are still 1s complement systems with C compilers.  Just not
C++ compilers.  Legacy...

> But for us using -fno-strict-overflow which actually defines signed
> overflow, I myself am really not worried. I'm also not sure if KASAN has
> been taught about this, or if it will still (incorrectly) warn about UB
> for signed types.

UBSAN gave me a signed-overflow warning a few days ago.  Which I have
fixed, even though 2s complement did the right thing.  I am also taking
advantage of the change to use better naming.

> > Anyhow, if the atomic maintainers are willing to stand up and state for
> > the record that the atomic counters are guaranteed to wrap modulo 2^n
> > just like unsigned integers, then I'm happy to take Paul's patch.
> 
> I myself am certainly relying on it.

Color me confused.  My 5a581b367b5d is from 2013.  Or is "Paul" instead
intended to mean Paul Mackerras, who happens to be on CC?

Thanx, Paul



Re: [PATCH] [RFC v2] Drop all 00-INDEX files from Documentation/

2018-09-04 Thread Paul E. McKenney
On Tue, Sep 04, 2018 at 12:15:23AM +0200, Henrik Austad wrote:
> This is a respin with a wider audience (all that get_maintainer returned)
> and I know this spams a *lot* of people. Not sure what would be the correct
> way, so my apologies for ruining your inbox.
> 
> The 00-INDEX files are supposed to give a summary of all files present
> in a directory, but these files are horribly out of date and their
> usefulness is brought into question. Often a simple "ls" would reveal
> the same information as the filenames are generally quite descriptive as
> a short introduction to what the file covers (it should not surprise
> anyone what Documentation/sched/sched-design-CFS.txt covers)
> 
> A few years back it was mentioned that these files were no longer really
> needed, and they have since then grown further out of date, so perhaps
> it is time to just throw them out.
> 
> A short status yields the following _outdated_ 00-INDEX files, first
> counter is files listed in 00-INDEX but missing in the directory, last
> is files present but not listed in 00-INDEX.
> 
> List of outdated 00-INDEX:
> Documentation: (4/10)
> Documentation/sysctl: (0/1)
> Documentation/timers: (1/0)
> Documentation/blockdev: (3/1)
> Documentation/w1/slaves: (0/1)
> Documentation/locking: (0/1)
> Documentation/devicetree: (0/5)
> Documentation/power: (1/1)
> Documentation/powerpc: (0/5)
> Documentation/arm: (1/0)
> Documentation/x86: (0/9)
> Documentation/x86/x86_64: (1/1)
> Documentation/scsi: (4/4)
> Documentation/filesystems: (2/9)
> Documentation/filesystems/nfs: (0/2)
> Documentation/cgroup-v1: (0/2)
> Documentation/kbuild: (0/4)
> Documentation/spi: (1/0)
> Documentation/virtual/kvm: (1/0)
> Documentation/scheduler: (0/2)
> Documentation/fb: (0/1)
> Documentation/block: (0/1)
> Documentation/networking: (6/37)
> Documentation/vm: (1/3)
> 
> Then there are 364 subdirectories in Documentation/ with several files that
> are missing 00-INDEX alltogether (and another 120 with a single file and no
> 00-INDEX).
> 
> I don't really have an opinion to whether or not we /should/ have 00-INDEX,
> but the above 00-INDEX should either be removed or be kept up to date. If
> we should keep the files, I can try to keep them updated, but I rather not
> if we just want to delete them anyway.
> 
> As a starting point, remove all index-files and references to 00-INDEX and
> see where the discussion is going.

For the RCU portions:

Acked-by: Paul E. McKenney 

> Again, sorry for the insanely wide distribution.
> 
> Signed-off-by: Henrik Austad 
> Cc: Jonathan Corbet 
> Cc: Bjorn Helgaas 
> Cc: "Paul E. McKenney" 
> Cc: Josh Triplett 
> Cc: Steven Rostedt 
> Cc: Mathieu Desnoyers 
> Cc: Lai Jiangshan 
> Cc: Jens Axboe 
> Cc: Rob Herring 
> Cc: Mark Rutland 
> Cc: Bartlomiej Zolnierkiewicz 
> Cc: Linus Walleij 
> Cc: "David S. Miller" 
> Cc: Karsten Keil 
> Cc: Masahiro Yamada 
> Cc: Michal Marek 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Will Deacon 
> Cc: Ralf Baechle 
> Cc: Paul Burton 
> Cc: James Hogan 
> Cc: Paul Moore 
> Cc: "James E.J. Bottomley" 
> Cc: Helge Deller 
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: Pavel Machek 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Martin Schwidefsky 
> Cc: Heiko Carstens 
> Cc: Greg Kroah-Hartman 
> Cc: Jiri Slaby 
> Cc: Mark Brown 
> Cc: Thomas Gleixner 
> Cc: Paolo Bonzini 
> Cc: "Radim Krčmář" 
> Cc: Evgeniy Polyakov 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Cc: Henrik Austad 
> Cc: Andrew Morton 
> Cc: Ian Kent 
> Cc: Jacek Anaszewski 
> Cc: Mike Rapoport 
> Cc: Jan Kandziora 
> Cc: linux-...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: devicet...@vger.kernel.org
> Cc: dri-de...@lists.freedesktop.org
> Cc: linux-fb...@vger.kernel.org
> Cc: linux-g...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: net...@vger.kernel.org
> Cc: linux-kbu...@vger.kernel.org
> Cc: linux-m...@linux-mips.org
> Cc: linux-security-mod...@vger.kernel.org
> Cc: linux-par...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: k...@vger.kernel.org
> Signed-off-by: Henrik Austad 
> ---
>  Documentation/00-INDEX  | 428 
> 
>  Documentation/PCI/00-INDEX  |  26 --
>  Documentation/RCU/00-INDEX  |  34 ---
>  Documentation/RCU/rcu.txt   |   4 -
>  Documentation/admin-guide/README.rst|  

Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-23 Thread Paul E. McKenney
On Wed, May 23, 2018 at 04:14:39PM -0400, Mathieu Desnoyers wrote:
> - On May 20, 2018, at 10:08 AM, Boqun Feng boqun.f...@gmail.com wrote:
> 
> > On Fri, May 18, 2018 at 02:17:17PM -0400, Mathieu Desnoyers wrote:
> >> - On May 17, 2018, at 7:50 PM, Boqun Feng boqun.f...@gmail.com wrote:
> >> [...]
> >> >> > I think you're right. So we have to introduce callsite to 
> >> >> > rseq_syscall()
> >> >> > in syscall path, something like:
> >> >> > 
> >> >> > diff --git a/arch/powerpc/kernel/entry_64.S 
> >> >> > b/arch/powerpc/kernel/entry_64.S
> >> >> > index 51695608c68b..a25734a96640 100644
> >> >> > --- a/arch/powerpc/kernel/entry_64.S
> >> >> > +++ b/arch/powerpc/kernel/entry_64.S
> >> >> > @@ -222,6 +222,9 @@ system_call_exit:
> >> >> >   mtmsrd  r11,1
> >> >> > #endif /* CONFIG_PPC_BOOK3E */
> >> >> > 
> >> >> > + addir3,r1,STACK_FRAME_OVERHEAD
> >> >> > + bl  rseq_syscall
> >> >> > +
> >> >> >   ld  r9,TI_FLAGS(r12)
> >> >> >   li  r11,-MAX_ERRNO
> >> >> >   andi.
> >> >> >   
> >> >> > r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
> >> >> > 
> >> 
> >> By the way, I think this is not the right spot to call rseq_syscall, 
> >> because
> >> interrupts are disabled. I think we should move this hunk right after
> >> system_call_exit.
> >> 
> > 
> > Good point.
> > 
> >> Would you like to implement and test an updated patch adding those calls 
> >> for ppc
> >> 32 and 64 ?
> >> 
> > 
> > I'd like to help, but I don't have a handy ppc environment for test...
> > So I made the below patch which has only been build-tested, hope it
> > could be somewhat helpful.
> 
> Hi Boqun,
> 
> I tried your patch in a ppc64 le environment, and it does not survive boot
> with CONFIG_DEBUG_RSEQ=y. init gets killed right away.
> 
> Moreover, I'm not sure that the r3 register don't contain something worth
> saving before the call on ppc32. Just after there is a "mr" instruction
> which AFAIU takes r3 as input register.
> 
> Can you look into it ?

Hello, Boqun,

You can also request access to a ppc64 environment here:

http://osuosl.org/services/powerdev/request_hosting/

Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > Regards,
> > Boqun
> > 
> > ->8
> > Subject: [PATCH] powerpc: Add syscall detection for restartable sequences
> > 
> > Syscalls are not allowed inside restartable sequences, so add a call to
> > rseq_syscall() at the very beginning of system call exiting path for
> > CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there
> > is a syscall issued inside restartable sequences.
> > 
> > Signed-off-by: Boqun Feng 
> > ---
> > arch/powerpc/kernel/entry_32.S | 5 +
> > arch/powerpc/kernel/entry_64.S | 5 +
> > 2 files changed, 10 insertions(+)
> > 
> > diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
> > index eb8d01bae8c6..2f134eebe7ed 100644
> > --- a/arch/powerpc/kernel/entry_32.S
> > +++ b/arch/powerpc/kernel/entry_32.S
> > @@ -365,6 +365,11 @@ syscall_dotrace_cont:
> > blrl/* Call handler */
> > .globl  ret_from_syscall
> > ret_from_syscall:
> > +#ifdef CONFIG_DEBUG_RSEQ
> > +   /* Check whether the syscall is issued inside a restartable sequence */
> > +   addir3,r1,STACK_FRAME_OVERHEAD
> > +   bl  rseq_syscall
> > +#endif
> > mr  r6,r3
> > CURRENT_THREAD_INFO(r12, r1)
> > /* disable interrupts so current_thread_info()->flags can't change */
> > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> > index 2cb5109a7ea3..2e2d59bb45d0 100644
> > --- a/arch/powerpc/kernel/entry_64.S
> > +++ b/arch/powerpc/kernel/entry_64.S
> > @@ -204,6 +204,11 @@ system_call:   /* label this so stack 
> > traces look sane */
> >  * This is blacklisted from kprobes further below with 
> > _ASM_NOKPROBE_SYMBOL().
> >  */
> > system_call_exit:
> > +#ifdef CONFIG_DEBUG_RSEQ
> > +   /* Check whether the syscall is issued inside a restartable sequence */
> > +   addir3,r1,STACK_FRAME_OVERHEAD
> > +   bl  rseq_syscall
> > +#endif
> > /*
> >  * Disable interrupts so current_thread_info()->flags can't change,
> >  * and so that we don't get interrupted after loading SRR0/1.
> > --
> > 2.16.2
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 



Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()

2018-04-01 Thread Paul E. McKenney
On Sun, Apr 01, 2018 at 02:11:08PM +0300, Yury Norov wrote:
> On Tue, Mar 27, 2018 at 11:21:17AM +0100, Will Deacon wrote:
> > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote:
> > > kick_all_cpus_sync() forces all CPUs to sync caches by sending broadcast 
> > > IPI.
> > > If CPU is in extended quiescent state (idle task or nohz_full userspace), 
> > > this
> > > work may be done at the exit of this state. Delaying synchronization 
> > > helps to
> > > save power if CPU is in idle state and decrease latency for real-time 
> > > tasks.
> > > 
> > > This patch introduces kick_active_cpus_sync() and uses it in mm/slab and 
> > > arm64
> > > code to delay syncronization.
> > > 
> > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to the CPU 
> > > running
> > > isolated task would be fatal, as it breaks isolation. The approach with 
> > > delaying
> > > of synchronization work helps to maintain isolated state.
> > > 
> > > I've tested it with test from task isolation series on ThunderX2 for more 
> > > than
> > > 10 hours (10k giga-ticks) without breaking isolation.
> > > 
> > > Signed-off-by: Yury Norov 
> > > ---
> > >  arch/arm64/kernel/insn.c |  2 +-
> > >  include/linux/smp.h  |  2 ++
> > >  kernel/smp.c | 24 
> > >  mm/slab.c|  2 +-
> > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
> > > index 2718a77da165..9d7c492e920e 100644
> > > --- a/arch/arm64/kernel/insn.c
> > > +++ b/arch/arm64/kernel/insn.c
> > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void *addrs[], 
> > > u32 insns[], int cnt)
> > >* synchronization.
> > >*/
> > >   ret = aarch64_insn_patch_text_nosync(addrs[0], 
> > > insns[0]);
> > > - kick_all_cpus_sync();
> > > + kick_active_cpus_sync();
> > >   return ret;
> > >   }
> > >   }
> > 
> > I think this means that runtime modifications to the kernel text might not
> > be picked up by CPUs coming out of idle. Shouldn't we add an ISB on that
> > path to avoid executing stale instructions?
> > 
> > Will
> 
> commit 153ae9d5667e7baab4d48c48e8ec30fbcbd86f1e
> Author: Yury Norov 
> Date:   Sat Mar 31 15:05:23 2018 +0300
> 
> Hi Will, Paul,
> 
> On my system there are 3 paths that go thru rcu_dynticks_eqs_exit(),
> and so require isb().
> 
> First path starts at gic_handle_irq() on secondary_start_kernel stack.
> gic_handle_irq() already issues isb(), and so we can do nothing.
> 
> Second path starts at el0_svc entry; and third path is the exit from
> do_idle() on secondary_start_kernel stack.
> 
> For do_idle() path there is arch_cpu_idle_exit() hook that is not used by
> arm64 at now, so I picked it. And for el0_svc, I've introduced isb_if_eqs
> macro and call it at the beginning of el0_svc_naked.
> 
> I've tested it on ThunderX2 machine, and it works for me.
> 
> Below is my call traces and patch for them. If you OK with it, I think I'm
> ready to submit v2 (but maybe split this patch for better readability).

I must defer to Will on this one.

Thanx, Paul

> Yury
> 
> [  585.412095] Call trace:
> [  585.412097] [] dump_backtrace+0x0/0x380
> [  585.412099] [] show_stack+0x14/0x20
> [  585.412101] [] dump_stack+0x98/0xbc
> [  585.412104] [] rcu_dynticks_eqs_exit+0x68/0x70
> [  585.412105] [] rcu_irq_enter+0x48/0x50
> [  585.412106] [] irq_enter+0xc/0x70
> [  585.412108] [] __handle_domain_irq+0x3c/0x120
> [  585.412109] [] gic_handle_irq+0xc4/0x180
> [  585.412110] Exception stack(0xfc001130fe20 to 0xfc001130ff60)
> [  585.412112] fe20: 00a0  0001 
> 
> [  585.412113] fe40: 028f6f0b 0020 0013cd6f53963b31 
> 
> [  585.412144] fe60: 0002 fc001130fed0 0b80 
> 3400
> [  585.412146] fe80:  0001  
> 01db
> [  585.412147] fea0: fc0008247a78 03ff86dc61f8 0014 
> fc0008fc
> [  585.412149] fec0: fc00090143e8 fc0009014000 fc0008fc94a0 
> 
> [  585.412150] fee0:  fe8f46bb1700  
> 
> [  585.412152] ff00:  fc001130ff60 fc0008085034 
> fc001130ff60
> [  585.412153] ff20: fc0008085038 00400149 fc0009014000 
> fc0008fc94a0
> [  585.412155] ff40:   fc001130ff60 
> fc0008085038
> [  585.412156] [] el1_irq+0xb0/0x124
> [  585.412158] [] arch_cpu_idle+0x10/0x18
> [  585.412159] [] do_idle+0x10c/0x1d8
> [  585.412160] [] cpu_startup_entry+0x24/0x28
> [  585.412162] [] secondary_start_kernel+0x15c/0x1a0
> [  585.412164] CPU: 1 PID: 0 

Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()

2018-03-28 Thread Paul E. McKenney
On Wed, Mar 28, 2018 at 04:36:05PM +0300, Yury Norov wrote:
> On Mon, Mar 26, 2018 at 05:45:55AM -0700, Paul E. McKenney wrote:
> > On Sun, Mar 25, 2018 at 11:11:54PM +0300, Yury Norov wrote:
> > > On Sun, Mar 25, 2018 at 12:23:28PM -0700, Paul E. McKenney wrote:
> > > > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote:
> > > > > kick_all_cpus_sync() forces all CPUs to sync caches by sending 
> > > > > broadcast IPI.
> > > > > If CPU is in extended quiescent state (idle task or nohz_full 
> > > > > userspace), this
> > > > > work may be done at the exit of this state. Delaying synchronization 
> > > > > helps to
> > > > > save power if CPU is in idle state and decrease latency for real-time 
> > > > > tasks.
> > > > > 
> > > > > This patch introduces kick_active_cpus_sync() and uses it in mm/slab 
> > > > > and arm64
> > > > > code to delay syncronization.
> > > > > 
> > > > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to the 
> > > > > CPU running
> > > > > isolated task would be fatal, as it breaks isolation. The approach 
> > > > > with delaying
> > > > > of synchronization work helps to maintain isolated state.
> > > > > 
> > > > > I've tested it with test from task isolation series on ThunderX2 for 
> > > > > more than
> > > > > 10 hours (10k giga-ticks) without breaking isolation.
> > > > > 
> > > > > Signed-off-by: Yury Norov <yno...@caviumnetworks.com>
> > > > > ---
> > > > >  arch/arm64/kernel/insn.c |  2 +-
> > > > >  include/linux/smp.h  |  2 ++
> > > > >  kernel/smp.c | 24 
> > > > >  mm/slab.c|  2 +-
> > > > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
> > > > > index 2718a77da165..9d7c492e920e 100644
> > > > > --- a/arch/arm64/kernel/insn.c
> > > > > +++ b/arch/arm64/kernel/insn.c
> > > > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void 
> > > > > *addrs[], u32 insns[], int cnt)
> > > > >* synchronization.
> > > > >*/
> > > > >   ret = aarch64_insn_patch_text_nosync(addrs[0], 
> > > > > insns[0]);
> > > > > - kick_all_cpus_sync();
> > > > > + kick_active_cpus_sync();
> > > > >   return ret;
> > > > >   }
> > > > >   }
> > > > > diff --git a/include/linux/smp.h b/include/linux/smp.h
> > > > > index 9fb239e12b82..27215e22240d 100644
> > > > > --- a/include/linux/smp.h
> > > > > +++ b/include/linux/smp.h
> > > > > @@ -105,6 +105,7 @@ int smp_call_function_any(const struct cpumask 
> > > > > *mask,
> > > > > smp_call_func_t func, void *info, int wait);
> > > > > 
> > > > >  void kick_all_cpus_sync(void);
> > > > > +void kick_active_cpus_sync(void);
> > > > >  void wake_up_all_idle_cpus(void);
> > > > > 
> > > > >  /*
> > > > > @@ -161,6 +162,7 @@ smp_call_function_any(const struct cpumask *mask, 
> > > > > smp_call_func_t func,
> > > > >  }
> > > > > 
> > > > >  static inline void kick_all_cpus_sync(void) {  }
> > > > > +static inline void kick_active_cpus_sync(void) {  }
> > > > >  static inline void wake_up_all_idle_cpus(void) {  }
> > > > > 
> > > > >  #ifdef CONFIG_UP_LATE_INIT
> > > > > diff --git a/kernel/smp.c b/kernel/smp.c
> > > > > index 084c8b3a2681..0358d6673850 100644
> > > > > --- a/kernel/smp.c
> > > > > +++ b/kernel/smp.c
> > > > > @@ -724,6 +724,30 @@ void kick_all_cpus_sync(void)
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
> > > > > 
> > > > > +/**
> > > > > + * kick_active_cpus_sync - Force CPUs that are not in extended
> > > > > + * quiescent state (idle or nohz_full userspace) sync

Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()

2018-03-28 Thread Paul E. McKenney
On Wed, Mar 28, 2018 at 05:41:40PM +0300, Yury Norov wrote:
> On Wed, Mar 28, 2018 at 06:56:17AM -0700, Paul E. McKenney wrote:
> > On Wed, Mar 28, 2018 at 04:36:05PM +0300, Yury Norov wrote:
> > > On Mon, Mar 26, 2018 at 05:45:55AM -0700, Paul E. McKenney wrote:
> > > > On Sun, Mar 25, 2018 at 11:11:54PM +0300, Yury Norov wrote:
> > > > > On Sun, Mar 25, 2018 at 12:23:28PM -0700, Paul E. McKenney wrote:
> > > > > > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote:
> > > > > > > kick_all_cpus_sync() forces all CPUs to sync caches by sending 
> > > > > > > broadcast IPI.
> > > > > > > If CPU is in extended quiescent state (idle task or nohz_full 
> > > > > > > userspace), this
> > > > > > > work may be done at the exit of this state. Delaying 
> > > > > > > synchronization helps to
> > > > > > > save power if CPU is in idle state and decrease latency for 
> > > > > > > real-time tasks.
> > > > > > > 
> > > > > > > This patch introduces kick_active_cpus_sync() and uses it in 
> > > > > > > mm/slab and arm64
> > > > > > > code to delay syncronization.
> > > > > > > 
> > > > > > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to 
> > > > > > > the CPU running
> > > > > > > isolated task would be fatal, as it breaks isolation. The 
> > > > > > > approach with delaying
> > > > > > > of synchronization work helps to maintain isolated state.
> > > > > > > 
> > > > > > > I've tested it with test from task isolation series on ThunderX2 
> > > > > > > for more than
> > > > > > > 10 hours (10k giga-ticks) without breaking isolation.
> > > > > > > 
> > > > > > > Signed-off-by: Yury Norov <yno...@caviumnetworks.com>
> > > > > > > ---
> > > > > > >  arch/arm64/kernel/insn.c |  2 +-
> > > > > > >  include/linux/smp.h  |  2 ++
> > > > > > >  kernel/smp.c | 24 
> > > > > > >  mm/slab.c|  2 +-
> > > > > > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
> > > > > > > index 2718a77da165..9d7c492e920e 100644
> > > > > > > --- a/arch/arm64/kernel/insn.c
> > > > > > > +++ b/arch/arm64/kernel/insn.c
> > > > > > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void 
> > > > > > > *addrs[], u32 insns[], int cnt)
> > > > > > >* synchronization.
> > > > > > >*/
> > > > > > >   ret = aarch64_insn_patch_text_nosync(addrs[0], 
> > > > > > > insns[0]);
> > > > > > > - kick_all_cpus_sync();
> > > > > > > + kick_active_cpus_sync();
> > > > > > >   return ret;
> > > > > > >   }
> > > > > > >   }
> > > > > > > diff --git a/include/linux/smp.h b/include/linux/smp.h
> > > > > > > index 9fb239e12b82..27215e22240d 100644
> > > > > > > --- a/include/linux/smp.h
> > > > > > > +++ b/include/linux/smp.h
> > > > > > > @@ -105,6 +105,7 @@ int smp_call_function_any(const struct 
> > > > > > > cpumask *mask,
> > > > > > > smp_call_func_t func, void *info, int wait);
> > > > > > > 
> > > > > > >  void kick_all_cpus_sync(void);
> > > > > > > +void kick_active_cpus_sync(void);
> > > > > > >  void wake_up_all_idle_cpus(void);
> > > > > > > 
> > > > > > >  /*
> > > > > > > @@ -161,6 +162,7 @@ smp_call_function_any(const struct cpumask 
> > > > > > > *mask, smp_call_func_t func,
> > > > > > >  }
> > > > > > > 
> > > > > > >  static inline void kick_all_cpus_sync(void) {  }
> > > > > > > +static inline void kick_active_cpus

Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()

2018-03-26 Thread Paul E. McKenney
On Sun, Mar 25, 2018 at 11:11:54PM +0300, Yury Norov wrote:
> On Sun, Mar 25, 2018 at 12:23:28PM -0700, Paul E. McKenney wrote:
> > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote:
> > > kick_all_cpus_sync() forces all CPUs to sync caches by sending broadcast 
> > > IPI.
> > > If CPU is in extended quiescent state (idle task or nohz_full userspace), 
> > > this
> > > work may be done at the exit of this state. Delaying synchronization 
> > > helps to
> > > save power if CPU is in idle state and decrease latency for real-time 
> > > tasks.
> > > 
> > > This patch introduces kick_active_cpus_sync() and uses it in mm/slab and 
> > > arm64
> > > code to delay syncronization.
> > > 
> > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to the CPU 
> > > running
> > > isolated task would be fatal, as it breaks isolation. The approach with 
> > > delaying
> > > of synchronization work helps to maintain isolated state.
> > > 
> > > I've tested it with test from task isolation series on ThunderX2 for more 
> > > than
> > > 10 hours (10k giga-ticks) without breaking isolation.
> > > 
> > > Signed-off-by: Yury Norov <yno...@caviumnetworks.com>
> > > ---
> > >  arch/arm64/kernel/insn.c |  2 +-
> > >  include/linux/smp.h  |  2 ++
> > >  kernel/smp.c | 24 
> > >  mm/slab.c|  2 +-
> > >  4 files changed, 28 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
> > > index 2718a77da165..9d7c492e920e 100644
> > > --- a/arch/arm64/kernel/insn.c
> > > +++ b/arch/arm64/kernel/insn.c
> > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void *addrs[], 
> > > u32 insns[], int cnt)
> > >* synchronization.
> > >*/
> > >   ret = aarch64_insn_patch_text_nosync(addrs[0], 
> > > insns[0]);
> > > - kick_all_cpus_sync();
> > > + kick_active_cpus_sync();
> > >   return ret;
> > >   }
> > >   }
> > > diff --git a/include/linux/smp.h b/include/linux/smp.h
> > > index 9fb239e12b82..27215e22240d 100644
> > > --- a/include/linux/smp.h
> > > +++ b/include/linux/smp.h
> > > @@ -105,6 +105,7 @@ int smp_call_function_any(const struct cpumask *mask,
> > > smp_call_func_t func, void *info, int wait);
> > > 
> > >  void kick_all_cpus_sync(void);
> > > +void kick_active_cpus_sync(void);
> > >  void wake_up_all_idle_cpus(void);
> > > 
> > >  /*
> > > @@ -161,6 +162,7 @@ smp_call_function_any(const struct cpumask *mask, 
> > > smp_call_func_t func,
> > >  }
> > > 
> > >  static inline void kick_all_cpus_sync(void) {  }
> > > +static inline void kick_active_cpus_sync(void) {  }
> > >  static inline void wake_up_all_idle_cpus(void) {  }
> > > 
> > >  #ifdef CONFIG_UP_LATE_INIT
> > > diff --git a/kernel/smp.c b/kernel/smp.c
> > > index 084c8b3a2681..0358d6673850 100644
> > > --- a/kernel/smp.c
> > > +++ b/kernel/smp.c
> > > @@ -724,6 +724,30 @@ void kick_all_cpus_sync(void)
> > >  }
> > >  EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
> > > 
> > > +/**
> > > + * kick_active_cpus_sync - Force CPUs that are not in extended
> > > + * quiescent state (idle or nohz_full userspace) sync by sending
> > > + * IPI. Extended quiescent state CPUs will sync at the exit of
> > > + * that state.
> > > + */
> > > +void kick_active_cpus_sync(void)
> > > +{
> > > + int cpu;
> > > + struct cpumask kernel_cpus;
> > > +
> > > + smp_mb();
> > > +
> > > + cpumask_clear(_cpus);
> > > + preempt_disable();
> > > + for_each_online_cpu(cpu) {
> > > + if (!rcu_eqs_special_set(cpu))
> > 
> > If we get here, the CPU is not in a quiescent state, so we therefore
> > must IPI it, correct?
> > 
> > But don't you also need to define rcu_eqs_special_exit() so that RCU
> > can invoke it when it next leaves its quiescent state?  Or are you able
> > to ignore the CPU in that case?  (If you are able to ignore the CPU in
> > that case, I could give you a lower-cost function to get your job done.)
> > 
> > 

Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()

2018-03-25 Thread Paul E. McKenney
On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote:
> kick_all_cpus_sync() forces all CPUs to sync caches by sending broadcast IPI.
> If CPU is in extended quiescent state (idle task or nohz_full userspace), this
> work may be done at the exit of this state. Delaying synchronization helps to
> save power if CPU is in idle state and decrease latency for real-time tasks.
> 
> This patch introduces kick_active_cpus_sync() and uses it in mm/slab and arm64
> code to delay syncronization.
> 
> For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to the CPU 
> running
> isolated task would be fatal, as it breaks isolation. The approach with 
> delaying
> of synchronization work helps to maintain isolated state.
> 
> I've tested it with test from task isolation series on ThunderX2 for more than
> 10 hours (10k giga-ticks) without breaking isolation.
> 
> Signed-off-by: Yury Norov 
> ---
>  arch/arm64/kernel/insn.c |  2 +-
>  include/linux/smp.h  |  2 ++
>  kernel/smp.c | 24 
>  mm/slab.c|  2 +-
>  4 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
> index 2718a77da165..9d7c492e920e 100644
> --- a/arch/arm64/kernel/insn.c
> +++ b/arch/arm64/kernel/insn.c
> @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void *addrs[], u32 
> insns[], int cnt)
>* synchronization.
>*/
>   ret = aarch64_insn_patch_text_nosync(addrs[0], 
> insns[0]);
> - kick_all_cpus_sync();
> + kick_active_cpus_sync();
>   return ret;
>   }
>   }
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 9fb239e12b82..27215e22240d 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -105,6 +105,7 @@ int smp_call_function_any(const struct cpumask *mask,
> smp_call_func_t func, void *info, int wait);
> 
>  void kick_all_cpus_sync(void);
> +void kick_active_cpus_sync(void);
>  void wake_up_all_idle_cpus(void);
> 
>  /*
> @@ -161,6 +162,7 @@ smp_call_function_any(const struct cpumask *mask, 
> smp_call_func_t func,
>  }
> 
>  static inline void kick_all_cpus_sync(void) {  }
> +static inline void kick_active_cpus_sync(void) {  }
>  static inline void wake_up_all_idle_cpus(void) {  }
> 
>  #ifdef CONFIG_UP_LATE_INIT
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 084c8b3a2681..0358d6673850 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -724,6 +724,30 @@ void kick_all_cpus_sync(void)
>  }
>  EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
> 
> +/**
> + * kick_active_cpus_sync - Force CPUs that are not in extended
> + * quiescent state (idle or nohz_full userspace) sync by sending
> + * IPI. Extended quiescent state CPUs will sync at the exit of
> + * that state.
> + */
> +void kick_active_cpus_sync(void)
> +{
> + int cpu;
> + struct cpumask kernel_cpus;
> +
> + smp_mb();
> +
> + cpumask_clear(_cpus);
> + preempt_disable();
> + for_each_online_cpu(cpu) {
> + if (!rcu_eqs_special_set(cpu))

If we get here, the CPU is not in a quiescent state, so we therefore
must IPI it, correct?

But don't you also need to define rcu_eqs_special_exit() so that RCU
can invoke it when it next leaves its quiescent state?  Or are you able
to ignore the CPU in that case?  (If you are able to ignore the CPU in
that case, I could give you a lower-cost function to get your job done.)

Thanx, Paul

> + cpumask_set_cpu(cpu, _cpus);
> + }
> + smp_call_function_many(_cpus, do_nothing, NULL, 1);
> + preempt_enable();
> +}
> +EXPORT_SYMBOL_GPL(kick_active_cpus_sync);
> +
>  /**
>   * wake_up_all_idle_cpus - break all cpus out of idle
>   * wake_up_all_idle_cpus try to break all cpus which is in idle state even
> diff --git a/mm/slab.c b/mm/slab.c
> index 324446621b3e..678d5dbd6f46 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3856,7 +3856,7 @@ static int __do_tune_cpucache(struct kmem_cache 
> *cachep, int limit,
>* cpus, so skip the IPIs.
>*/
>   if (prev)
> - kick_all_cpus_sync();
> + kick_active_cpus_sync();
> 
>   check_irq_on();
>   cachep->batchcount = batchcount;
> -- 
> 2.14.1
> 



Re: [PATCH 1/2] rcu: declare rcu_eqs_special_set() in public header

2018-03-25 Thread Paul E. McKenney
On Sun, Mar 25, 2018 at 08:50:03PM +0300, Yury Norov wrote:
> rcu_eqs_special_set() is declared only in internal header
> kernel/rcu/tree.h and stubbed in include/linux/rcutiny.h.
> 
> This patch declares rcu_eqs_special_set() in include/linux/rcutree.h, so
> it can be used in non-rcu kernel code.
> 
> Signed-off-by: Yury Norov <yno...@caviumnetworks.com>
> ---
>  include/linux/rcutree.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> index fd996cdf1833..448f20f27396 100644
> --- a/include/linux/rcutree.h
> +++ b/include/linux/rcutree.h
> @@ -74,6 +74,7 @@ static inline void synchronize_rcu_bh_expedited(void)
>  void rcu_barrier(void);
>  void rcu_barrier_bh(void);
>  void rcu_barrier_sched(void);
> +bool rcu_eqs_special_set(int cpu);
>  unsigned long get_state_synchronize_rcu(void);
>  void cond_synchronize_rcu(unsigned long oldstate);
>  unsigned long get_state_synchronize_sched(void);

Good point, a bit hard to use otherwise.  ;-)

I removed the declaration from rcutree.h and updated the commit log as
follows.  Does it look OK?

Thanx, Paul



commit 4497105b718a819072d48a675916d9d200b5327f
Author: Yury Norov <yno...@caviumnetworks.com>
Date:   Sun Mar 25 20:50:03 2018 +0300

rcu: Declare rcu_eqs_special_set() in public header

Because rcu_eqs_special_set() is declared only in internal header
kernel/rcu/tree.h and stubbed in include/linux/rcutiny.h, it is
inaccessible outside of the RCU implementation.  This patch therefore
moves the  rcu_eqs_special_set() declaration to include/linux/rcutree.h,
which allows it to be used in non-rcu kernel code.
    
    Signed-off-by: Yury Norov <yno...@caviumnetworks.com>
Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>

diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index fd996cdf1833..448f20f27396 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -74,6 +74,7 @@ static inline void synchronize_rcu_bh_expedited(void)
 void rcu_barrier(void);
 void rcu_barrier_bh(void);
 void rcu_barrier_sched(void);
+bool rcu_eqs_special_set(int cpu);
 unsigned long get_state_synchronize_rcu(void);
 void cond_synchronize_rcu(unsigned long oldstate);
 unsigned long get_state_synchronize_sched(void);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 59ad0e23c722..d5f617aaa744 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -415,7 +415,6 @@ extern struct rcu_state rcu_preempt_state;
 #endif /* #ifdef CONFIG_PREEMPT_RCU */
 
 int rcu_dynticks_snap(struct rcu_dynticks *rdtp);
-bool rcu_eqs_special_set(int cpu);
 
 #ifdef CONFIG_RCU_BOOST
 DECLARE_PER_CPU(unsigned int, rcu_cpu_kthread_status);



[PATCH tip/sched/membarrier 5/5] Fix: membarrier: Handle CLONE_VM + !CLONE_THREAD correctly on powerpc

2017-10-13 Thread Paul E. McKenney
From: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(>mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_SWITCH_MM flag, issue synchronize_sched(), and only
then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows private
expedited membarrier commands to succeed. membarrier_arch_switch_mm()
now tests for the MEMBARRIER_STATE_SWITCH_MM flag.

Changes since v1:
- Remove membarrier thread flag on powerpc (now unused).

Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Andrew Hunter <a...@google.com>
CC: Maged Michael <maged.mich...@gmail.com>
CC: gro...@google.com
CC: Avi Kivity <a...@scylladb.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Dave Watson <davejwat...@fb.com>
CC: Alan Stern <st...@rowland.harvard.edu>
CC: Will Deacon <will.dea...@arm.com>
CC: Andy Lutomirski <l...@kernel.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Alexander Viro <v...@zeniv.linux.org.uk>
CC: Nicholas Piggin <npig...@gmail.com>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/membarrier.h  | 21 ++---
 arch/powerpc/include/asm/thread_info.h |  3 ---
 arch/powerpc/kernel/membarrier.c   | 17 -
 include/linux/mm_types.h   |  2 +-
 include/linux/sched/mm.h   | 28 ++--
 kernel/fork.c  |  2 --
 kernel/sched/membarrier.c  | 16 +---
 7 files changed, 26 insertions(+), 63 deletions(-)

diff --git a/arch/powerpc/include/asm/membarrier.h 
b/arch/powerpc/include/asm/membarrier.h
index 61152a7a3cf9..0951646253d9 100644
--- a/arch/powerpc/include/asm/membarrier.h
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -11,8 +11,8 @@ static inline void membarrier_arch_switch_mm(struct mm_struct 
*prev,
 * when switching from userspace to kernel is not needed after
 * store to rq->curr.
 */
-   if (likely(!test_ti_thread_flag(task_thread_info(tsk),
-   TIF_MEMBARRIER_PRIVATE_EXPEDITED) || !prev))
+   if (likely(!(atomic_read(>membarrier_state)
+   & MEMBARRIER_STATE_SWITCH_MM) || !prev))
return;
 
/*
@@ -21,23 +21,6 @@ static inline void membarrier_arch_switch_mm(struct 
mm_struct *prev,
 */
smp_mb();
 }
-static inline void membarrier_arch_fork(struct task_struct *t,
-   unsigned long clone_flags)
-{
-   /*
-* Coherence of TIF_MEMBARRIER_PRIVATE_EXPEDITED against thread
-* fork is protected by siglock. membarrier_arch_fork is called
-* with siglock held.
-*/
-   if (test_thread_flag(TIF_MEMBARRIER_PRIVATE_EXPEDITED))
-   set_ti_thread_flag(task_thread_info(t),
-   TIF_MEMBARRIER_PRIVATE_EXPEDITED);
-}
-static inline void membarrier_arch_execve(struct task_struct *t)
-{
-   clear_ti_thread_flag(task_thread_info(t),
-   TIF_MEMBARRIER_PRIVATE_EXPEDITED);
-}
 void membarrier_arch_register_private_expedited(struct task_struct *t);
 
 #endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index 2a208487724b..a941cc6fc3e9 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -100,7 +100,6 @@ static inline struct thread_info *current_thread_info(void)
 #if defined(CONFIG_PPC64)
 #define TIF_ELF2ABI18  /* function descriptors must die! */
 #endif
-#define TIF_MEMBARRIER_PRIVATE_EXPEDITED   19  /* membarrier */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
@@ -120,8 +119,6 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_SYSCALL_TRACEPOINT(1<<TIF_SYSCALL_TRACEPOINT)
 #define _

[PATCH tip/sched/membarrier 1/5] membarrier: Provide register expedited private command

2017-10-13 Thread Paul E. McKenney
From: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>

Provide a new command allowing processes to register their intent to use
the private expedited command.

This allows PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Andrew Hunter <a...@google.com>
CC: Maged Michael <maged.mich...@gmail.com>
CC: gro...@google.com
CC: Avi Kivity <a...@scylladb.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Dave Watson <davejwat...@fb.com>
CC: Alan Stern <st...@rowland.harvard.edu>
CC: Will Deacon <will.dea...@arm.com>
CC: Andy Lutomirski <l...@kernel.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Alexander Viro <v...@zeniv.linux.org.uk>
CC: Nicholas Piggin <npig...@gmail.com>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
---
 MAINTAINERS|  2 ++
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/membarrier.h  | 43 +
 arch/powerpc/include/asm/thread_info.h |  3 ++
 arch/powerpc/kernel/Makefile   |  2 ++
 arch/powerpc/kernel/membarrier.c   | 45 ++
 arch/powerpc/mm/mmu_context.c  |  7 +
 fs/exec.c  |  1 +
 include/linux/mm_types.h   |  3 ++
 include/linux/sched/mm.h   | 50 ++
 include/uapi/linux/membarrier.h| 23 +++-
 init/Kconfig   |  3 ++
 kernel/fork.c  |  2 ++
 kernel/sched/core.c| 10 ---
 kernel/sched/membarrier.c  | 25 ++---
 15 files changed, 199 insertions(+), 21 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h
 create mode 100644 arch/powerpc/kernel/membarrier.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 2d3d750b19c0..f0bc68b2d221 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8829,6 +8829,8 @@ L:linux-ker...@vger.kernel.org
 S: Supported
 F: kernel/sched/membarrier.c
 F: include/uapi/linux/membarrier.h
+F: arch/powerpc/kernel/membarrier.c
+F: arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L: linux...@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 809c468edab1..6f44c5f74f71 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -138,6 +138,7 @@ config PPC
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
+   select ARCH_HAS_MEMBARRIER_HOOKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
d

[PATCH tip/sched/membarrier 4/5] membarrier: Remove unused code for architectures without membarrier hooks

2017-10-13 Thread Paul E. McKenney
From: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>

Architectures without membarrier hooks don't need to emit the
empty membarrier_arch_switch_mm() static inline when
CONFIG_MEMBARRIER=y.

Adapt the CONFIG_MEMBARRIER=n counterpart to only emit the empty
membarrier_arch_switch_mm() for architectures with membarrier hooks.

Reported-by: Nicholas Piggin <npig...@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Andrew Hunter <a...@google.com>
CC: Maged Michael <maged.mich...@gmail.com>
CC: gro...@google.com
CC: Avi Kivity <a...@scylladb.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Dave Watson <davejwat...@fb.com>
CC: Alan Stern <st...@rowland.harvard.edu>
CC: Will Deacon <will.dea...@arm.com>
CC: Andy Lutomirski <l...@kernel.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Alexander Viro <v...@zeniv.linux.org.uk>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
---
 include/linux/sched/mm.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e4955d293687..40379edac388 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -221,10 +221,6 @@ static inline void memalloc_noreclaim_restore(unsigned int 
flags)
 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
 #include 
 #else
-static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
-   struct mm_struct *next, struct task_struct *tsk)
-{
-}
 static inline void membarrier_arch_fork(struct task_struct *t,
unsigned long clone_flags)
 {
@@ -253,10 +249,12 @@ static inline void membarrier_execve(struct task_struct 
*t)
membarrier_arch_execve(t);
 }
 #else
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
 static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
struct mm_struct *next, struct task_struct *tsk)
 {
 }
+#endif
 static inline void membarrier_fork(struct task_struct *t,
unsigned long clone_flags)
 {
-- 
2.5.2



Re: [RFC PATCH for 4.14 1/2] membarrier: Remove unused code for architectures without membarrier hooks

2017-10-06 Thread Paul E. McKenney
On Thu, Oct 05, 2017 at 06:33:26PM -0400, Mathieu Desnoyers wrote:
> Architectures without membarrier hooks don't need to emit the
> empty membarrier_arch_switch_mm() static inline when
> CONFIG_MEMBARRIER=y.
> 
> Adapt the CONFIG_MEMBARRIER=n counterpart to only emit the empty
> membarrier_arch_switch_mm() for architectures with membarrier hooks.
> 
> Reported-by: Nicholas Piggin <npig...@gmail.com>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>

Queued for further review and testing, targeting v4.15.  Please let me
know if you need it sooner.

Thanx, Paul

> CC: Peter Zijlstra <pet...@infradead.org>
> CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> CC: Boqun Feng <boqun.f...@gmail.com>
> CC: Andrew Hunter <a...@google.com>
> CC: Maged Michael <maged.mich...@gmail.com>
> CC: gro...@google.com
> CC: Avi Kivity <a...@scylladb.com>
> CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> CC: Paul Mackerras <pau...@samba.org>
> CC: Michael Ellerman <m...@ellerman.id.au>
> CC: Dave Watson <davejwat...@fb.com>
> CC: Alan Stern <st...@rowland.harvard.edu>
> CC: Will Deacon <will.dea...@arm.com>
> CC: Andy Lutomirski <l...@kernel.org>
> CC: Ingo Molnar <mi...@redhat.com>
> CC: Alexander Viro <v...@zeniv.linux.org.uk>
> CC: linuxppc-dev@lists.ozlabs.org
> CC: linux-a...@vger.kernel.org
> ---
>  include/linux/sched/mm.h | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index d5a9ab8f3836..b2767ecb21a8 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -215,10 +215,6 @@ static inline void memalloc_noreclaim_restore(unsigned 
> int flags)
>  #ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
>  #include 
>  #else
> -static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
> - struct mm_struct *next, struct task_struct *tsk)
> -{
> -}
>  static inline void membarrier_arch_fork(struct task_struct *t,
>   unsigned long clone_flags)
>  {
> @@ -247,10 +243,12 @@ static inline void membarrier_execve(struct task_struct 
> *t)
>   membarrier_arch_execve(t);
>  }
>  #else
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
>  static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
>   struct mm_struct *next, struct task_struct *tsk)
>  {
>  }
> +#endif
>  static inline void membarrier_fork(struct task_struct *t,
>   unsigned long clone_flags)
>  {
> -- 
> 2.11.0
> 



[PATCH tip/core/rcu 1/3] membarrier: Provide register expedited private command

2017-10-04 Thread Paul E. McKenney
From: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>

Provide a new command allowing processes to register their intent to use
the private expedited command.

This allows PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Andrew Hunter <a...@google.com>
CC: Maged Michael <maged.mich...@gmail.com>
CC: gro...@google.com
CC: Avi Kivity <a...@scylladb.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Dave Watson <davejwat...@fb.com>
CC: Alan Stern <st...@rowland.harvard.edu>
CC: Will Deacon <will.dea...@arm.com>
CC: Andy Lutomirski <l...@kernel.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Alexander Viro <v...@zeniv.linux.org.uk>
CC: Nicholas Piggin <npig...@gmail.com>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
---
 MAINTAINERS|  2 ++
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/membarrier.h  | 43 +
 arch/powerpc/include/asm/thread_info.h |  3 ++
 arch/powerpc/kernel/Makefile   |  2 ++
 arch/powerpc/kernel/membarrier.c   | 45 ++
 arch/powerpc/mm/mmu_context.c  |  7 +
 fs/exec.c  |  1 +
 include/linux/mm_types.h   |  3 ++
 include/linux/sched/mm.h   | 50 ++
 include/uapi/linux/membarrier.h| 23 +++-
 init/Kconfig   |  3 ++
 kernel/fork.c  |  2 ++
 kernel/sched/core.c| 10 ---
 kernel/sched/membarrier.c  | 25 ++---
 15 files changed, 199 insertions(+), 21 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h
 create mode 100644 arch/powerpc/kernel/membarrier.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 65b0c88d5ee0..c5296d7f447b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8822,6 +8822,8 @@ L:linux-ker...@vger.kernel.org
 S: Supported
 F: kernel/sched/membarrier.c
 F: include/uapi/linux/membarrier.h
+F: arch/powerpc/kernel/membarrier.c
+F: arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L: linux...@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 809c468edab1..6f44c5f74f71 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -138,6 +138,7 @@ config PPC
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
+   select ARCH_HAS_MEMBARRIER_HOOKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
d

Re: [PATCH v5 for 4.14 1/3] membarrier: Provide register expedited private command

2017-09-30 Thread Paul E. McKenney
On Fri, Sep 29, 2017 at 01:47:18PM -0400, Mathieu Desnoyers wrote:
> Provide a new command allowing processes to register their intent to use
> the private expedited command.
> 
> This allows PowerPC to skip the full memory barrier in switch_mm(), and
> only issue the barrier when scheduling into a task belonging to a
> process that has registered to use expedited private.
> 
> Processes are now required to register before using
> MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

Queued for further review and testing, replacing your two earlier
commits.  Will send a pull request if there are no objections in the
next few days.

Thanx, Paul

> Changes since v1:
> - Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
>   powerpc membarrier_arch_sched_in(), given that we want to specifically
>   check the next thread state.
> - Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
> - Use task_thread_info() to pass thread_info from task to
>   *_ti_thread_flag().
> 
> Changes since v2:
> - Move membarrier_arch_sched_in() call to finish_task_switch().
> - Check for NULL t->mm in membarrier_arch_fork().
> - Use membarrier_sched_in() in generic code, which invokes the
>   arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
>   build on PowerPC.
> - Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
>   allnoconfig build on PowerPC.
> - Build and runtime tested on PowerPC.
> 
> Changes since v3:
> - Simply rely on copy_mm() to copy the membarrier_private_expedited mm
>   field on fork.
> - powerpc: test thread flag instead of reading
>   membarrier_private_expedited in membarrier_arch_fork().
> - powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
>   from kernel thread, since mmdrop() implies a full barrier.
> - Set membarrier_private_expedited to 1 only after arch registration
>   code, thus eliminating a race where concurrent commands could succeed
>   when they should fail if issued concurrently with process
>   registration.
> - Use READ_ONCE() for membarrier_private_expedited field access in
>   membarrier_private_expedited. Matches WRITE_ONCE() performed in
>   process registration.
> 
> Changes since v4:
> - Move powerpc hook from sched_in() to switch_mm(), based on feedback
>   from Nicholas Piggin.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
> CC: Peter Zijlstra <pet...@infradead.org>
> CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> CC: Boqun Feng <boqun.f...@gmail.com>
> CC: Andrew Hunter <a...@google.com>
> CC: Maged Michael <maged.mich...@gmail.com>
> CC: gro...@google.com
> CC: Avi Kivity <a...@scylladb.com>
> CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> CC: Paul Mackerras <pau...@samba.org>
> CC: Michael Ellerman <m...@ellerman.id.au>
> CC: Dave Watson <davejwat...@fb.com>
> CC: Alan Stern <st...@rowland.harvard.edu>
> CC: Will Deacon <will.dea...@arm.com>
> CC: Andy Lutomirski <l...@kernel.org>
> CC: Ingo Molnar <mi...@redhat.com>
> CC: Alexander Viro <v...@zeniv.linux.org.uk>
> CC: Nicholas Piggin <npig...@gmail.com>
> CC: linuxppc-dev@lists.ozlabs.org
> CC: linux-a...@vger.kernel.org
> ---
>  MAINTAINERS|  2 ++
>  arch/powerpc/Kconfig   |  1 +
>  arch/powerpc/include/asm/membarrier.h  | 43 +
>  arch/powerpc/include/asm/thread_info.h |  3 ++
>  arch/powerpc/kernel/Makefile   |  2 ++
>  arch/powerpc/kernel/membarrier.c   | 45 ++
>  arch/powerpc/mm/mmu_context.c  |  7 +
>  fs/exec.c  |  1 +
>  include/linux/mm_types.h   |  3 ++
>  include/linux/sched/mm.h   | 50 
> ++
>  include/uapi/linux/membarrier.h| 23 +++-
>  init/Kconfig   |  3 ++
>  kernel/fork.c  |  2 ++
>  kernel/sched/core.c| 10 ---
>  kernel/sched/membarrier.c  | 25 ++---
>  15 files changed, 199 insertions(+), 21 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/membarrier.h
>  create mode 100644 arch/powerpc/kernel/membarrier.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 6671f375f7fc..f11d8aece00d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8816,6 +8816,8 @@ L:  linux-ker...@vger.kernel.org
>  S:   Supported
>  F:   kernel/sched/membarrier.c
>  F:   include/uapi/linux/membarrier.h
> +F:   arch/powerpc/kernel/

  1   2   3   4   >