Re: 9908859acaa9 cpuidle/menu: add per CPU PM QoS resume latency consideration

2017-02-22 Thread Mike Galbraith
On Wed, 2017-02-22 at 14:12 +0100, Peter Zijlstra wrote:
> On Wed, Feb 22, 2017 at 01:56:37PM +0100, Mike Galbraith wrote:
> > Hi,
> > 
> > Do we really need a spinlock for that in the idle loop?
> 
> Urgh, that's broken on RT, you cannot schedule the idle loop.

That's what made me notice the obnoxious little bugger.

[   77.608340] BUG: sleeping function called from invalid context at 
kernel/locking/rtmutex.c:995
[   77.608342] in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
[   77.608343] INFO: lockdep is turned off.
[   77.608344] irq event stamp: 59222
[   77.608353] hardirqs last  enabled at (59221): [] 
rcu_idle_exit+0x2f/0x50
[   77.608362] hardirqs last disabled at (59222): [] 
do_idle+0x9a/0x290
[   77.608372] softirqs last  enabled at (0): [] 
copy_process.part.34+0x5f1/0x22a0
[   77.608374] softirqs last disabled at (0): [<  (null)>]   
(null)
[   77.608374] Preemption disabled at:
[   77.608383] [] schedule_preempt_disabled+0x22/0x30
[   77.608387] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW   E   
4.11.0-rt9-rt #163
[   77.608389] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS 
BRHSXSD1.86B.0056.R01.1409242327 09/24/2014
[   77.608390] Call Trace:
[   77.608399]  dump_stack+0x85/0xc8
[   77.608405]  ___might_sleep+0x15d/0x260
[   77.608409]  rt_spin_lock+0x24/0x80
[   77.608419]  dev_pm_qos_read_value+0x1e/0x40
[   77.608424]  menu_select+0x56/0x3e0
[   77.608426]  ? rcu_eqs_enter_common.isra.40+0x9d/0x160
[   77.608435]  cpuidle_select+0x13/0x20
[   77.608438]  do_idle+0x182/0x290
[   77.608445]  cpu_startup_entry+0x48/0x50
[   77.608450]  start_secondary+0x133/0x160
[   77.608453]  start_cpu+0x14/0x14


Re: 9908859acaa9 cpuidle/menu: add per CPU PM QoS resume latency consideration

2017-02-22 Thread Mike Galbraith
On Wed, 2017-02-22 at 14:12 +0100, Peter Zijlstra wrote:
> On Wed, Feb 22, 2017 at 01:56:37PM +0100, Mike Galbraith wrote:
> > Hi,
> > 
> > Do we really need a spinlock for that in the idle loop?
> 
> Urgh, that's broken on RT, you cannot schedule the idle loop.

That's what made me notice the obnoxious little bugger.

[   77.608340] BUG: sleeping function called from invalid context at 
kernel/locking/rtmutex.c:995
[   77.608342] in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
[   77.608343] INFO: lockdep is turned off.
[   77.608344] irq event stamp: 59222
[   77.608353] hardirqs last  enabled at (59221): [] 
rcu_idle_exit+0x2f/0x50
[   77.608362] hardirqs last disabled at (59222): [] 
do_idle+0x9a/0x290
[   77.608372] softirqs last  enabled at (0): [] 
copy_process.part.34+0x5f1/0x22a0
[   77.608374] softirqs last disabled at (0): [<  (null)>]   
(null)
[   77.608374] Preemption disabled at:
[   77.608383] [] schedule_preempt_disabled+0x22/0x30
[   77.608387] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW   E   
4.11.0-rt9-rt #163
[   77.608389] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS 
BRHSXSD1.86B.0056.R01.1409242327 09/24/2014
[   77.608390] Call Trace:
[   77.608399]  dump_stack+0x85/0xc8
[   77.608405]  ___might_sleep+0x15d/0x260
[   77.608409]  rt_spin_lock+0x24/0x80
[   77.608419]  dev_pm_qos_read_value+0x1e/0x40
[   77.608424]  menu_select+0x56/0x3e0
[   77.608426]  ? rcu_eqs_enter_common.isra.40+0x9d/0x160
[   77.608435]  cpuidle_select+0x13/0x20
[   77.608438]  do_idle+0x182/0x290
[   77.608445]  cpu_startup_entry+0x48/0x50
[   77.608450]  start_secondary+0x133/0x160
[   77.608453]  start_cpu+0x14/0x14


9908859acaa9 cpuidle/menu: add per CPU PM QoS resume latency consideration

2017-02-22 Thread Mike Galbraith
Hi,

Do we really need a spinlock for that in the idle loop?

-Mike


9908859acaa9 cpuidle/menu: add per CPU PM QoS resume latency consideration

2017-02-22 Thread Mike Galbraith
Hi,

Do we really need a spinlock for that in the idle loop?

-Mike


Re: [bisection] b0119e87083 iommu: Introduce new 'struct iommu_device' ==> boom

2017-02-21 Thread Mike Galbraith
On Tue, 2017-02-21 at 16:19 +0100, Joerg Roedel wrote:
> Hi Mike,
> 
> thanks for the report, this didn't trigger in my local testing here.
> Loosk like I need to test without intel_iommu=on too :/
> 
> Anyway, can you check whether the attached patch helps?

Yup, boots.

> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index d9c0decfc91a..a74fec8d266a 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1108,8 +1108,10 @@ static int alloc_iommu(struct dmar_drhd_unit
> *drhd)
>  
>  static void free_iommu(struct intel_iommu *iommu)
>  {
> - iommu_device_sysfs_remove(>iommu);
> - iommu_device_unregister(>iommu);
> + if (intel_iommu_enabled) {
> + iommu_device_sysfs_remove(>iommu);
> + iommu_device_unregister(>iommu);
> + }
>  
>   if (iommu->irq) {
>   if (iommu->pr_irq) {


Re: [bisection] b0119e87083 iommu: Introduce new 'struct iommu_device' ==> boom

2017-02-21 Thread Mike Galbraith
On Tue, 2017-02-21 at 16:19 +0100, Joerg Roedel wrote:
> Hi Mike,
> 
> thanks for the report, this didn't trigger in my local testing here.
> Loosk like I need to test without intel_iommu=on too :/
> 
> Anyway, can you check whether the attached patch helps?

Yup, boots.

> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index d9c0decfc91a..a74fec8d266a 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1108,8 +1108,10 @@ static int alloc_iommu(struct dmar_drhd_unit
> *drhd)
>  
>  static void free_iommu(struct intel_iommu *iommu)
>  {
> - iommu_device_sysfs_remove(>iommu);
> - iommu_device_unregister(>iommu);
> + if (intel_iommu_enabled) {
> + iommu_device_sysfs_remove(>iommu);
> + iommu_device_unregister(>iommu);
> + }
>  
>   if (iommu->irq) {
>   if (iommu->pr_irq) {


[bisection] b0119e87083 iommu: Introduce new 'struct iommu_device' ==> boom

2017-02-21 Thread Mike Galbraith
4x18 box (berio) explodes as below after morning master pull.  BIOS has
a couple issues, maybe one of them.. helps.

[   30.796530] ima: No TPM chip found, activating TPM-bypass! (rc=-19)
[   30.810709] evm: HMAC attrs: 0x1
[   30.821200] BUG: unable to handle kernel NULL pointer dereference at 
0008
[   30.839003] IP: device_del+0x6e/0x350
[   30.847364] PGD 0 
[   30.847365] 
[   30.855639] Oops:  [#1] SMP
[   30.862858] Dumping ftrace buffer:
[   30.870678](ftrace buffer empty)
[   30.878849] Modules linked in:
[   30.885870] CPU: 39 PID: 1 Comm: swapper/0 Not tainted 4.10.0-default #144
[   30.901334] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS 
BRHSXSD1.86B.0056.R01.1409242327 09/24/2014
[   30.924687] task: 88017cab2040 task.stack: c9038000
[   30.938040] RIP: 0010:device_del+0x6e/0x350
[   30.947532] RSP: :c903bd50 EFLAGS: 00010246
[   30.959344] RAX:  RBX: 8810fce66928 RCX: 77ff8000
[   30.975381] RDX: 88017cab2040 RSI: 00ec RDI: 81a1300b
[   30.991420] RBP: c903bd90 R08: 8808fc9bcdb8 R09: 
[   31.007459] R10:  R11: c903bc08 R12: 8810fce66928
[   31.023497] R13:  R14:  R15: 8810fce669c8
[   31.039536] FS:  () GS:88106f8c() 
knlGS:
[   31.057897] CS:  0010 DS:  ES:  CR0: 80050033
[   31.070867] CR2: 0008 CR3: 01c09000 CR4: 001406e0
[   31.086890] DR0:  DR1:  DR2: 
[   31.102927] DR3:  DR6: fffe0ff0 DR7: 0400
[   31.118967] Call Trace:
[   31.124641]  ? dmar_free_dev_scope+0x62/0x80
[   31.134347]  device_unregister+0x1a/0x60
[   31.143284]  iommu_device_sysfs_remove+0x12/0x20
[   31.153755]  dmar_free_drhd+0x40/0x120
[   31.162311]  dmar_free_unused_resources+0xad/0xc9
[   31.172975]  ? detect_intel_iommu+0xcf/0xcf
[   31.182487]  do_one_initcall+0x51/0x1b0
[   31.191233]  ? parse_args+0x27b/0x460
[   31.199596]  kernel_init_freeable+0x1a2/0x232
[   31.209490]  ? set_debug_rodata+0x12/0x12
[   31.218619]  ? rest_init+0x90/0x90
[   31.226399]  kernel_init+0xe/0x110
[   31.234186]  ret_from_fork+0x2c/0x40
[   31.242351] Code: 00 00 00 48 81 c7 f0 00 00 00 e8 2e b1 bc ff 48 c7 c7 e0 
69 d0 81 4d 8d bc 24 a0 00 00 00 e8 2a f4 1b 00 49 8b 84 24 a8 00 00 00 <48> 8b 
48 08 49 39 c7 4c 8d 70 e0 48 8d 59 e0 75 08 eb 2a 49 89 
[   31.284763] RIP: device_del+0x6e/0x350 RSP: c903bd50
[   31.297536] CR2: 0008
[   31.305148] ---[ end trace 617d26bc9a426981 ]---


b0119e870837dcd15a207b4701542ebac5d19b45 is the first bad commit
commit b0119e870837dcd15a207b4701542ebac5d19b45
Author: Joerg Roedel 
Date:   Wed Feb 1 13:23:08 2017 +0100

iommu: Introduce new 'struct iommu_device'

This struct represents one hardware iommu in the iommu core
code. For now it only has the iommu-ops associated with it,
but that will be extended soon.

The register/unregister interface is also added, as well as
making use of it in the Intel and AMD IOMMU drivers.

Signed-off-by: Joerg Roedel 

:04 04 cb491d4d5bd25f1b65e6c93f7e67c8594901d6e1 
84a5621c5e88961cf2385566c1c28eb5375c413f M  drivers
:04 04 5a2f0b8b829b29ef80baf6ef7cf2ba4b9bf23bf7 
89ecaf2419fe0e00500c23413149f1df7bcbd693 M  include

git bisect start
# good: [c470abd4fde40ea6a0846a2beab642a578c0b8cd] Linux 4.10
git bisect good c470abd4fde40ea6a0846a2beab642a578c0b8cd
# bad: [2bfe01eff4307409b95859e860261d0907149b61] Merge branch 'for-next' of 
git://git.samba.org/sfrench/cifs-2.6
git bisect bad 2bfe01eff4307409b95859e860261d0907149b61
# good: [828cad8ea05d194d8a9452e0793261c2024c23a2] Merge branch 
'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 828cad8ea05d194d8a9452e0793261c2024c23a2
# bad: [f790bd9c8e826434ab6c326b225276ed0f73affe] Merge tag 'regulator-v4.11' 
of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
git bisect bad f790bd9c8e826434ab6c326b225276ed0f73affe
# good: [937b5b5ddd2f685b4962ec19502e641bb5741c12] Merge tag 
'm68k-for-v4.11-tag1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
git bisect good 937b5b5ddd2f685b4962ec19502e641bb5741c12
# bad: [27a67e0f983567574ef659520d930f82cf65125a] Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
git bisect bad 27a67e0f983567574ef659520d930f82cf65125a
# good: [2c9f1af528a4581e8ef8590108daa3c3df08dd5a] vfio/type1: Fix error return 
code in vfio_iommu_type1_attach_group()
git bisect good 2c9f1af528a4581e8ef8590108daa3c3df08dd5a
# bad: [ebb4949eb32ff500602f960525592fc4e614c5a7] Merge tag 
'iommu-updates-v4.11' of 
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
git bisect bad 

[bisection] b0119e87083 iommu: Introduce new 'struct iommu_device' ==> boom

2017-02-21 Thread Mike Galbraith
4x18 box (berio) explodes as below after morning master pull.  BIOS has
a couple issues, maybe one of them.. helps.

[   30.796530] ima: No TPM chip found, activating TPM-bypass! (rc=-19)
[   30.810709] evm: HMAC attrs: 0x1
[   30.821200] BUG: unable to handle kernel NULL pointer dereference at 
0008
[   30.839003] IP: device_del+0x6e/0x350
[   30.847364] PGD 0 
[   30.847365] 
[   30.855639] Oops:  [#1] SMP
[   30.862858] Dumping ftrace buffer:
[   30.870678](ftrace buffer empty)
[   30.878849] Modules linked in:
[   30.885870] CPU: 39 PID: 1 Comm: swapper/0 Not tainted 4.10.0-default #144
[   30.901334] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS 
BRHSXSD1.86B.0056.R01.1409242327 09/24/2014
[   30.924687] task: 88017cab2040 task.stack: c9038000
[   30.938040] RIP: 0010:device_del+0x6e/0x350
[   30.947532] RSP: :c903bd50 EFLAGS: 00010246
[   30.959344] RAX:  RBX: 8810fce66928 RCX: 77ff8000
[   30.975381] RDX: 88017cab2040 RSI: 00ec RDI: 81a1300b
[   30.991420] RBP: c903bd90 R08: 8808fc9bcdb8 R09: 
[   31.007459] R10:  R11: c903bc08 R12: 8810fce66928
[   31.023497] R13:  R14:  R15: 8810fce669c8
[   31.039536] FS:  () GS:88106f8c() 
knlGS:
[   31.057897] CS:  0010 DS:  ES:  CR0: 80050033
[   31.070867] CR2: 0008 CR3: 01c09000 CR4: 001406e0
[   31.086890] DR0:  DR1:  DR2: 
[   31.102927] DR3:  DR6: fffe0ff0 DR7: 0400
[   31.118967] Call Trace:
[   31.124641]  ? dmar_free_dev_scope+0x62/0x80
[   31.134347]  device_unregister+0x1a/0x60
[   31.143284]  iommu_device_sysfs_remove+0x12/0x20
[   31.153755]  dmar_free_drhd+0x40/0x120
[   31.162311]  dmar_free_unused_resources+0xad/0xc9
[   31.172975]  ? detect_intel_iommu+0xcf/0xcf
[   31.182487]  do_one_initcall+0x51/0x1b0
[   31.191233]  ? parse_args+0x27b/0x460
[   31.199596]  kernel_init_freeable+0x1a2/0x232
[   31.209490]  ? set_debug_rodata+0x12/0x12
[   31.218619]  ? rest_init+0x90/0x90
[   31.226399]  kernel_init+0xe/0x110
[   31.234186]  ret_from_fork+0x2c/0x40
[   31.242351] Code: 00 00 00 48 81 c7 f0 00 00 00 e8 2e b1 bc ff 48 c7 c7 e0 
69 d0 81 4d 8d bc 24 a0 00 00 00 e8 2a f4 1b 00 49 8b 84 24 a8 00 00 00 <48> 8b 
48 08 49 39 c7 4c 8d 70 e0 48 8d 59 e0 75 08 eb 2a 49 89 
[   31.284763] RIP: device_del+0x6e/0x350 RSP: c903bd50
[   31.297536] CR2: 0008
[   31.305148] ---[ end trace 617d26bc9a426981 ]---


b0119e870837dcd15a207b4701542ebac5d19b45 is the first bad commit
commit b0119e870837dcd15a207b4701542ebac5d19b45
Author: Joerg Roedel 
Date:   Wed Feb 1 13:23:08 2017 +0100

iommu: Introduce new 'struct iommu_device'

This struct represents one hardware iommu in the iommu core
code. For now it only has the iommu-ops associated with it,
but that will be extended soon.

The register/unregister interface is also added, as well as
making use of it in the Intel and AMD IOMMU drivers.

Signed-off-by: Joerg Roedel 

:04 04 cb491d4d5bd25f1b65e6c93f7e67c8594901d6e1 
84a5621c5e88961cf2385566c1c28eb5375c413f M  drivers
:04 04 5a2f0b8b829b29ef80baf6ef7cf2ba4b9bf23bf7 
89ecaf2419fe0e00500c23413149f1df7bcbd693 M  include

git bisect start
# good: [c470abd4fde40ea6a0846a2beab642a578c0b8cd] Linux 4.10
git bisect good c470abd4fde40ea6a0846a2beab642a578c0b8cd
# bad: [2bfe01eff4307409b95859e860261d0907149b61] Merge branch 'for-next' of 
git://git.samba.org/sfrench/cifs-2.6
git bisect bad 2bfe01eff4307409b95859e860261d0907149b61
# good: [828cad8ea05d194d8a9452e0793261c2024c23a2] Merge branch 
'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 828cad8ea05d194d8a9452e0793261c2024c23a2
# bad: [f790bd9c8e826434ab6c326b225276ed0f73affe] Merge tag 'regulator-v4.11' 
of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
git bisect bad f790bd9c8e826434ab6c326b225276ed0f73affe
# good: [937b5b5ddd2f685b4962ec19502e641bb5741c12] Merge tag 
'm68k-for-v4.11-tag1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
git bisect good 937b5b5ddd2f685b4962ec19502e641bb5741c12
# bad: [27a67e0f983567574ef659520d930f82cf65125a] Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
git bisect bad 27a67e0f983567574ef659520d930f82cf65125a
# good: [2c9f1af528a4581e8ef8590108daa3c3df08dd5a] vfio/type1: Fix error return 
code in vfio_iommu_type1_attach_group()
git bisect good 2c9f1af528a4581e8ef8590108daa3c3df08dd5a
# bad: [ebb4949eb32ff500602f960525592fc4e614c5a7] Merge tag 
'iommu-updates-v4.11' of 
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
git bisect bad ebb4949eb32ff500602f960525592fc4e614c5a7
# bad: 

[cgroups] suspicious rcu_dereference_check() usage!

2017-02-20 Thread Mike Galbraith
Running LTP on master.today (v4.10) with a seriously bloated PREEMPT
config inspired box to emit the below.

[ 7160.458996] ===
[ 7160.463195] [ INFO: suspicious RCU usage. ]
[ 7160.467387] 4.10.0-default #100 Tainted: GE  
[ 7160.472808] ---
[ 7160.476999] ./include/linux/cgroup.h:435 suspicious rcu_dereference_check() 
usage!
[ 7160.484576] 
[ 7160.484576] other info that might help us debug this:
[ 7160.484576] 
[ 7160.492577] 
[ 7160.492577] rcu_scheduler_active = 2, debug_locks = 1
[ 7160.499113] 1 lock held by pids_task1/19308:
[ 7160.503390]  #0:  (_threadgroup_rwsem){+.}, at: 
[] _do_fork+0xf0/0x710
[ 7160.512450] 
[ 7160.512450] stack backtrace:
[ 7160.516810] CPU: 5 PID: 19308 Comm: pids_task1 Tainted: GE   
4.10.0-default #100
[ 7160.525239] Hardware name: IBM System x3550 M3 -[7944K3G]-/69Y5698 , 
BIOS -[D6E150AUS-1.10]- 12/15/2010
[ 7160.534965] Call Trace:
[ 7160.537414]  dump_stack+0x85/0xc9
[ 7160.540732]  lockdep_rcu_suspicious+0xd5/0x110
[ 7160.545177]  task_css.constprop.7+0x88/0x90
[ 7160.549357]  pids_can_fork+0x132/0x160
[ 7160.553106]  cgroup_can_fork+0x63/0xc0
[ 7160.556855]  copy_process.part.30+0x17ef/0x21b0
[ 7160.561382]  ? _do_fork+0xf0/0x710
[ 7160.564786]  ? free_pages_and_swap_cache+0x9e/0xc0
[ 7160.569575]  _do_fork+0xf0/0x710
[ 7160.572806]  ? __this_cpu_preempt_check+0x13/0x20
[ 7160.577505]  ? __percpu_counter_add+0x86/0xb0
[ 7160.581860]  ? entry_SYSCALL_64_fastpath+0x5/0xc2
[ 7160.586562]  ? do_syscall_64+0x2d/0x200
[ 7160.590395]  SyS_clone+0x19/0x20
[ 7160.593623]  do_syscall_64+0x6c/0x200
[ 7160.597283]  entry_SYSCALL64_slow_path+0x25/0x25
[ 7160.601899] RIP: 0033:0x7fdaa3b881c4
[ 7160.605473] RSP: 002b:7ffd21635d50 EFLAGS: 0246 ORIG_RAX: 
0038
[ 7160.613036] RAX: ffda RBX: 4b6c RCX: 7fdaa3b881c4
[ 7160.620162] RDX:  RSI:  RDI: 01200011
[ 7160.627288] RBP: 7ffd21635d90 R08:  R09: 7fdaa4052700
[ 7160.634414] R10: 7fdaa40529d0 R11: 0246 R12: 7ffd21635d50
[ 7160.641539] R13:  R14:  R15: 


[cgroups] suspicious rcu_dereference_check() usage!

2017-02-20 Thread Mike Galbraith
Running LTP on master.today (v4.10) with a seriously bloated PREEMPT
config inspired box to emit the below.

[ 7160.458996] ===
[ 7160.463195] [ INFO: suspicious RCU usage. ]
[ 7160.467387] 4.10.0-default #100 Tainted: GE  
[ 7160.472808] ---
[ 7160.476999] ./include/linux/cgroup.h:435 suspicious rcu_dereference_check() 
usage!
[ 7160.484576] 
[ 7160.484576] other info that might help us debug this:
[ 7160.484576] 
[ 7160.492577] 
[ 7160.492577] rcu_scheduler_active = 2, debug_locks = 1
[ 7160.499113] 1 lock held by pids_task1/19308:
[ 7160.503390]  #0:  (_threadgroup_rwsem){+.}, at: 
[] _do_fork+0xf0/0x710
[ 7160.512450] 
[ 7160.512450] stack backtrace:
[ 7160.516810] CPU: 5 PID: 19308 Comm: pids_task1 Tainted: GE   
4.10.0-default #100
[ 7160.525239] Hardware name: IBM System x3550 M3 -[7944K3G]-/69Y5698 , 
BIOS -[D6E150AUS-1.10]- 12/15/2010
[ 7160.534965] Call Trace:
[ 7160.537414]  dump_stack+0x85/0xc9
[ 7160.540732]  lockdep_rcu_suspicious+0xd5/0x110
[ 7160.545177]  task_css.constprop.7+0x88/0x90
[ 7160.549357]  pids_can_fork+0x132/0x160
[ 7160.553106]  cgroup_can_fork+0x63/0xc0
[ 7160.556855]  copy_process.part.30+0x17ef/0x21b0
[ 7160.561382]  ? _do_fork+0xf0/0x710
[ 7160.564786]  ? free_pages_and_swap_cache+0x9e/0xc0
[ 7160.569575]  _do_fork+0xf0/0x710
[ 7160.572806]  ? __this_cpu_preempt_check+0x13/0x20
[ 7160.577505]  ? __percpu_counter_add+0x86/0xb0
[ 7160.581860]  ? entry_SYSCALL_64_fastpath+0x5/0xc2
[ 7160.586562]  ? do_syscall_64+0x2d/0x200
[ 7160.590395]  SyS_clone+0x19/0x20
[ 7160.593623]  do_syscall_64+0x6c/0x200
[ 7160.597283]  entry_SYSCALL64_slow_path+0x25/0x25
[ 7160.601899] RIP: 0033:0x7fdaa3b881c4
[ 7160.605473] RSP: 002b:7ffd21635d50 EFLAGS: 0246 ORIG_RAX: 
0038
[ 7160.613036] RAX: ffda RBX: 4b6c RCX: 7fdaa3b881c4
[ 7160.620162] RDX:  RSI:  RDI: 01200011
[ 7160.627288] RBP: 7ffd21635d90 R08:  R09: 7fdaa4052700
[ 7160.634414] R10: 7fdaa40529d0 R11: 0246 R12: 7ffd21635d50
[ 7160.641539] R13:  R14:  R15: 


[btrfs] lockdep splat

2017-02-17 Thread Mike Galbraith
Greetings,

Running ltp on master.today, I received the splat (from hell) below.

[ 5015.128458] =
[ 5015.128458] [ INFO: possible irq lock inversion dependency detected ]
[ 5015.128458] 4.10.0-default #119 Tainted: GE  
[ 5015.128458] -
[ 5015.128458] khugepaged/896 just changed the state of lock:
[ 5015.128458]  (_node->mutex){+.+.-.}, at: [] 
__btrfs_release_delayed_node+0x41/0x2d0 [btrfs]
[ 5015.128458] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 5015.128458]  (pcpu_alloc_mutex){+.+.+.}
[ 5015.128458] 
[ 5015.128458] 
[ 5015.128458] and interrupts could create inverse lock ordering between them.
[ 5015.128458] 
[ 5015.128458] 
[ 5015.128458] other info that might help us debug this:
[ 5015.128458] Chain exists of:
[ 5015.128458]   _node->mutex --> >groups_sem --> 
pcpu_alloc_mutex
[ 5015.128458] 
[ 5015.128458]  Possible interrupt unsafe locking scenario:
[ 5015.128458] 
[ 5015.128458]CPU0CPU1
[ 5015.128458]
[ 5015.128458]   lock(pcpu_alloc_mutex);
[ 5015.128458]local_irq_disable();
[ 5015.128458]lock(_node->mutex);
[ 5015.128458]lock(>groups_sem);
[ 5015.128458]   
[ 5015.128458] lock(_node->mutex);
[ 5015.128458] 
[ 5015.128458]  *** DEADLOCK ***
[ 5015.128458] 
[ 5015.128458] 2 locks held by khugepaged/896:
[ 5015.128458]  #0:  (shrinker_rwsem){..}, at: [] 
shrink_slab+0x7d/0x650
[ 5015.128458]  #1:  (>s_umount_key#26){..}, at: [] 
trylock_super+0x1b/0x50
[ 5015.128458] 
[ 5015.128458] the shortest dependencies between 2nd lock and 1st lock:
[ 5015.128458]-> (pcpu_alloc_mutex){+.+.+.} ops: 4652 {
[ 5015.128458]   HARDIRQ-ON-W at:
[ 5015.128458]   __lock_acquire+0x8e6/0x1550
[ 5015.128458]   lock_acquire+0xbd/0x220
[ 5015.128458]   mutex_lock_nested+0x67/0x6a0
[ 5015.128458]   pcpu_alloc+0x1c0/0x600
[ 5015.128458]   __alloc_percpu+0x15/0x20
[ 5015.128458]   alloc_kmem_cache_cpus.isra.56+0x2b/0xa0
[ 5015.128458]   __do_tune_cpucache+0x30/0x210
[ 5015.128458]   do_tune_cpucache+0x2a/0xd0
[ 5015.128458]   enable_cpucache+0x61/0x110
[ 5015.128458]   kmem_cache_init_late+0x41/0x76
[ 5015.128458]   start_kernel+0x352/0x4cd
[ 5015.128458]   x86_64_start_reservations+0x2a/0x2c
[ 5015.128458]   x86_64_start_kernel+0x13d/0x14c
[ 5015.128458]   verify_cpu+0x0/0xfc
[ 5015.128458]   SOFTIRQ-ON-W at:
[ 5015.128458]   __lock_acquire+0x283/0x1550
[ 5015.128458]   lock_acquire+0xbd/0x220
[ 5015.128458]   mutex_lock_nested+0x67/0x6a0
[ 5015.128458]   pcpu_alloc+0x1c0/0x600
[ 5015.128458]   __alloc_percpu+0x15/0x20
[ 5015.128458]   alloc_kmem_cache_cpus.isra.56+0x2b/0xa0
[ 5015.128458]   __do_tune_cpucache+0x30/0x210
[ 5015.128458]   do_tune_cpucache+0x2a/0xd0
[ 5015.128458]   enable_cpucache+0x61/0x110
[ 5015.128458]   kmem_cache_init_late+0x41/0x76
[ 5015.128458]   start_kernel+0x352/0x4cd
[ 5015.128458]   x86_64_start_reservations+0x2a/0x2c
[ 5015.128458]   x86_64_start_kernel+0x13d/0x14c
[ 5015.128458]   verify_cpu+0x0/0xfc
[ 5015.128458]   RECLAIM_FS-ON-W at:
[ 5015.128458]  mark_held_locks+0x66/0x90
[ 5015.128458]  lockdep_trace_alloc+0x6f/0xd0
[ 5015.128458]  __alloc_pages_nodemask+0x81/0x370
[ 5015.128458]  pcpu_populate_chunk+0xac/0x340
[ 5015.128458]  pcpu_alloc+0x4f8/0x600
[ 5015.128458]  __alloc_percpu+0x15/0x20
[ 5015.128458]  perf_pmu_register+0xc6/0x3c0
[ 5015.128458]  init_hw_perf_events+0x513/0x56d
[ 5015.128458]  do_one_initcall+0x51/0x1c0
[ 5015.128458]  kernel_init_freeable+0x146/0x28e
[ 5015.128458]  kernel_init+0xe/0x110
[ 5015.128458]  ret_from_fork+0x31/0x40
[ 5015.128458]   INITIAL USE at:
[ 5015.128458]  __lock_acquire+0x2ce/0x1550
[ 5015.128458]  lock_acquire+0xbd/0x220
[ 5015.128458]  

[btrfs] lockdep splat

2017-02-17 Thread Mike Galbraith
Greetings,

Running ltp on master.today, I received the splat (from hell) below.

[ 5015.128458] =
[ 5015.128458] [ INFO: possible irq lock inversion dependency detected ]
[ 5015.128458] 4.10.0-default #119 Tainted: GE  
[ 5015.128458] -
[ 5015.128458] khugepaged/896 just changed the state of lock:
[ 5015.128458]  (_node->mutex){+.+.-.}, at: [] 
__btrfs_release_delayed_node+0x41/0x2d0 [btrfs]
[ 5015.128458] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 5015.128458]  (pcpu_alloc_mutex){+.+.+.}
[ 5015.128458] 
[ 5015.128458] 
[ 5015.128458] and interrupts could create inverse lock ordering between them.
[ 5015.128458] 
[ 5015.128458] 
[ 5015.128458] other info that might help us debug this:
[ 5015.128458] Chain exists of:
[ 5015.128458]   _node->mutex --> >groups_sem --> 
pcpu_alloc_mutex
[ 5015.128458] 
[ 5015.128458]  Possible interrupt unsafe locking scenario:
[ 5015.128458] 
[ 5015.128458]CPU0CPU1
[ 5015.128458]
[ 5015.128458]   lock(pcpu_alloc_mutex);
[ 5015.128458]local_irq_disable();
[ 5015.128458]lock(_node->mutex);
[ 5015.128458]lock(>groups_sem);
[ 5015.128458]   
[ 5015.128458] lock(_node->mutex);
[ 5015.128458] 
[ 5015.128458]  *** DEADLOCK ***
[ 5015.128458] 
[ 5015.128458] 2 locks held by khugepaged/896:
[ 5015.128458]  #0:  (shrinker_rwsem){..}, at: [] 
shrink_slab+0x7d/0x650
[ 5015.128458]  #1:  (>s_umount_key#26){..}, at: [] 
trylock_super+0x1b/0x50
[ 5015.128458] 
[ 5015.128458] the shortest dependencies between 2nd lock and 1st lock:
[ 5015.128458]-> (pcpu_alloc_mutex){+.+.+.} ops: 4652 {
[ 5015.128458]   HARDIRQ-ON-W at:
[ 5015.128458]   __lock_acquire+0x8e6/0x1550
[ 5015.128458]   lock_acquire+0xbd/0x220
[ 5015.128458]   mutex_lock_nested+0x67/0x6a0
[ 5015.128458]   pcpu_alloc+0x1c0/0x600
[ 5015.128458]   __alloc_percpu+0x15/0x20
[ 5015.128458]   alloc_kmem_cache_cpus.isra.56+0x2b/0xa0
[ 5015.128458]   __do_tune_cpucache+0x30/0x210
[ 5015.128458]   do_tune_cpucache+0x2a/0xd0
[ 5015.128458]   enable_cpucache+0x61/0x110
[ 5015.128458]   kmem_cache_init_late+0x41/0x76
[ 5015.128458]   start_kernel+0x352/0x4cd
[ 5015.128458]   x86_64_start_reservations+0x2a/0x2c
[ 5015.128458]   x86_64_start_kernel+0x13d/0x14c
[ 5015.128458]   verify_cpu+0x0/0xfc
[ 5015.128458]   SOFTIRQ-ON-W at:
[ 5015.128458]   __lock_acquire+0x283/0x1550
[ 5015.128458]   lock_acquire+0xbd/0x220
[ 5015.128458]   mutex_lock_nested+0x67/0x6a0
[ 5015.128458]   pcpu_alloc+0x1c0/0x600
[ 5015.128458]   __alloc_percpu+0x15/0x20
[ 5015.128458]   alloc_kmem_cache_cpus.isra.56+0x2b/0xa0
[ 5015.128458]   __do_tune_cpucache+0x30/0x210
[ 5015.128458]   do_tune_cpucache+0x2a/0xd0
[ 5015.128458]   enable_cpucache+0x61/0x110
[ 5015.128458]   kmem_cache_init_late+0x41/0x76
[ 5015.128458]   start_kernel+0x352/0x4cd
[ 5015.128458]   x86_64_start_reservations+0x2a/0x2c
[ 5015.128458]   x86_64_start_kernel+0x13d/0x14c
[ 5015.128458]   verify_cpu+0x0/0xfc
[ 5015.128458]   RECLAIM_FS-ON-W at:
[ 5015.128458]  mark_held_locks+0x66/0x90
[ 5015.128458]  lockdep_trace_alloc+0x6f/0xd0
[ 5015.128458]  __alloc_pages_nodemask+0x81/0x370
[ 5015.128458]  pcpu_populate_chunk+0xac/0x340
[ 5015.128458]  pcpu_alloc+0x4f8/0x600
[ 5015.128458]  __alloc_percpu+0x15/0x20
[ 5015.128458]  perf_pmu_register+0xc6/0x3c0
[ 5015.128458]  init_hw_perf_events+0x513/0x56d
[ 5015.128458]  do_one_initcall+0x51/0x1c0
[ 5015.128458]  kernel_init_freeable+0x146/0x28e
[ 5015.128458]  kernel_init+0xe/0x110
[ 5015.128458]  ret_from_fork+0x31/0x40
[ 5015.128458]   INITIAL USE at:
[ 5015.128458]  __lock_acquire+0x2ce/0x1550
[ 5015.128458]  lock_acquire+0xbd/0x220
[ 5015.128458]  

Re: [RT] lockdep munching nr_list_entries like popcorn

2017-02-17 Thread Mike Galbraith
On Thu, 2017-02-16 at 19:06 +0100, Mike Galbraith wrote:
> On Thu, 2017-02-16 at 15:53 +0100, Sebastian Andrzej Siewior wrote:
> > On 2017-02-16 15:42:59 [+0100], Mike Galbraith wrote:
> > > 
> > > Weeell, I'm trying to cobble something kinda like that together using
> > > __RT_SPIN_INITIALIZER() instead, but seems mean ole Mr. Compiler NAKs
> > > the PER_CPU_DEP_MAP_INIT() thingy.
> > > 
> > >   CC  mm/swap.o
> > > mm/swap.c:54:689: error: braced-group within expression allowed only
> > > inside a function
> > 
> > so this is what I have now. I need to get the `static' symbol working
> > again and PER_CPU_DEP_MAP_INIT but aside from that it seems to do its
> > job.
> 
> ...
> 
> Yeah, works, I should be able to do an ltp run with stock lockdep
> settings without it taking it's toys and going home in a snit. 
> 
> berio:/sys/kernel/debug/tracing/:[0]# !while
> while sleep 60; do tail -1 trace; done
><...>-10315 [064] d...1..   226.953935: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14223
>w-13148 [120] d...111   287.414978: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14465
>w-16492 [089] d...111   347.128742: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14653
> (starts kbuild loop)
>  btrfs-transacti-1964  [016] d...1..   411.101549: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 17011
><...>-100268 [127] d...112   472.271769: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18153
>w-18864 [011] d...1..   534.386443: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18543
><...>-50390 [035] dN..2..   597.794164: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18765
><...>-80098 [127] d...111   659.912145: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18977
>checkproc-11123 [017] d...1..   721.483463: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19247
>   -0 [055] d..h5..   782.685953: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19383
><...>-93632 [055] d...111   835.527817: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19441

And now Thomas's patch.  Spiffiness.  Now to start ltp.

berio:/sys/kernel/debug/tracing/:[0]# !while 
while sleep 60; do tail -1 trace; done
   <...>-12462 [105] d...1..   211.489528: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 12847
 btrfs-transacti-3136  [002] d...211   272.672777: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 12947
 irq/155-eth2-Tx-4495  [096] dN..213   332.035236: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 13001
(starts kbuild loop)
   <...>-44245 [087] d...114   396.892748: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 13917
   <...>-105411 [067] dN..211   457.708259: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14367
   w-21800 [113] dN..2..   519.231735: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14449
 modpost-31558 [020] d11   576.065855: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14601
   kworker/dying-11860 [133] d...112   637.170497: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14679
   <...>-118853 [055] d11   703.884755: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14759
   <...>-52143 [090] d...1..   767.624735: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14813
   <...>-71788 [126] d...1..   829.160330: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14857
  kworker/u289:5-2991  [002] d...1..   892.402939: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14883
  sh-15106 [008] d...211   953.172196: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14937
> 


Re: [RT] lockdep munching nr_list_entries like popcorn

2017-02-17 Thread Mike Galbraith
On Thu, 2017-02-16 at 19:06 +0100, Mike Galbraith wrote:
> On Thu, 2017-02-16 at 15:53 +0100, Sebastian Andrzej Siewior wrote:
> > On 2017-02-16 15:42:59 [+0100], Mike Galbraith wrote:
> > > 
> > > Weeell, I'm trying to cobble something kinda like that together using
> > > __RT_SPIN_INITIALIZER() instead, but seems mean ole Mr. Compiler NAKs
> > > the PER_CPU_DEP_MAP_INIT() thingy.
> > > 
> > >   CC  mm/swap.o
> > > mm/swap.c:54:689: error: braced-group within expression allowed only
> > > inside a function
> > 
> > so this is what I have now. I need to get the `static' symbol working
> > again and PER_CPU_DEP_MAP_INIT but aside from that it seems to do its
> > job.
> 
> ...
> 
> Yeah, works, I should be able to do an ltp run with stock lockdep
> settings without it taking it's toys and going home in a snit. 
> 
> berio:/sys/kernel/debug/tracing/:[0]# !while
> while sleep 60; do tail -1 trace; done
><...>-10315 [064] d...1..   226.953935: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14223
>w-13148 [120] d...111   287.414978: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14465
>w-16492 [089] d...111   347.128742: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14653
> (starts kbuild loop)
>  btrfs-transacti-1964  [016] d...1..   411.101549: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 17011
><...>-100268 [127] d...112   472.271769: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18153
>w-18864 [011] d...1..   534.386443: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18543
><...>-50390 [035] dN..2..   597.794164: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18765
><...>-80098 [127] d...111   659.912145: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18977
>checkproc-11123 [017] d...1..   721.483463: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19247
>   -0 [055] d..h5..   782.685953: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19383
><...>-93632 [055] d...111   835.527817: 
> add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19441

And now Thomas's patch.  Spiffiness.  Now to start ltp.

berio:/sys/kernel/debug/tracing/:[0]# !while 
while sleep 60; do tail -1 trace; done
   <...>-12462 [105] d...1..   211.489528: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 12847
 btrfs-transacti-3136  [002] d...211   272.672777: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 12947
 irq/155-eth2-Tx-4495  [096] dN..213   332.035236: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 13001
(starts kbuild loop)
   <...>-44245 [087] d...114   396.892748: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 13917
   <...>-105411 [067] dN..211   457.708259: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14367
   w-21800 [113] dN..2..   519.231735: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14449
 modpost-31558 [020] d11   576.065855: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14601
   kworker/dying-11860 [133] d...112   637.170497: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14679
   <...>-118853 [055] d11   703.884755: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14759
   <...>-52143 [090] d...1..   767.624735: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14813
   <...>-71788 [126] d...1..   829.160330: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14857
  kworker/u289:5-2991  [002] d...1..   892.402939: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14883
  sh-15106 [008] d...211   953.172196: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14937
> 


Re: [RT] lockdep munching nr_list_entries like popcorn

2017-02-16 Thread Mike Galbraith

BTW, this ain't gone.  I'll take a peek.  It doesn't happen in my tree,
seems likely to be because whether running sirqs fully threaded or not,
I don't let one any thread handle what another exists to handle.

[  638.107293] NOHZ: local_softirq_pending 80
[  939.729684] NOHZ: local_softirq_pending 80
[  945.600869] NOHZ: local_softirq_pending 80
[ 1387.101178] NOHZ: local_softirq_pending 80
[ 1387.101343] NOHZ: local_softirq_pending 80
[ 1387.101549] NOHZ: local_softirq_pending 80
[ 1413.313212] NOHZ: local_softirq_pending 80
[ 1413.313305] NOHZ: local_softirq_pending 80
[ 1413.313347] NOHZ: local_softirq_pending 80
> 


Re: [RT] lockdep munching nr_list_entries like popcorn

2017-02-16 Thread Mike Galbraith

BTW, this ain't gone.  I'll take a peek.  It doesn't happen in my tree,
seems likely to be because whether running sirqs fully threaded or not,
I don't let one any thread handle what another exists to handle.

[  638.107293] NOHZ: local_softirq_pending 80
[  939.729684] NOHZ: local_softirq_pending 80
[  945.600869] NOHZ: local_softirq_pending 80
[ 1387.101178] NOHZ: local_softirq_pending 80
[ 1387.101343] NOHZ: local_softirq_pending 80
[ 1387.101549] NOHZ: local_softirq_pending 80
[ 1413.313212] NOHZ: local_softirq_pending 80
[ 1413.313305] NOHZ: local_softirq_pending 80
[ 1413.313347] NOHZ: local_softirq_pending 80
> 


Re: [RT] lockdep munching nr_list_entries like popcorn

2017-02-16 Thread Mike Galbraith
On Thu, 2017-02-16 at 15:53 +0100, Sebastian Andrzej Siewior wrote:
> On 2017-02-16 15:42:59 [+0100], Mike Galbraith wrote:
> > 
> > Weeell, I'm trying to cobble something kinda like that together using
> > __RT_SPIN_INITIALIZER() instead, but seems mean ole Mr. Compiler NAKs
> > the PER_CPU_DEP_MAP_INIT() thingy.
> > 
> >   CC  mm/swap.o
> > mm/swap.c:54:689: error: braced-group within expression allowed only
> > inside a function
> 
> so this is what I have now. I need to get the `static' symbol working
> again and PER_CPU_DEP_MAP_INIT but aside from that it seems to do its
> job.

...

Yeah, works, I should be able to do an ltp run with stock lockdep
settings without it taking it's toys and going home in a snit. 

berio:/sys/kernel/debug/tracing/:[0]# !while
while sleep 60; do tail -1 trace; done
   <...>-10315 [064] d...1..   226.953935: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14223
   w-13148 [120] d...111   287.414978: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14465
   w-16492 [089] d...111   347.128742: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14653
(starts kbuild loop)
 btrfs-transacti-1964  [016] d...1..   411.101549: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 17011
   <...>-100268 [127] d...112   472.271769: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18153
   w-18864 [011] d...1..   534.386443: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18543
   <...>-50390 [035] dN..2..   597.794164: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18765
   <...>-80098 [127] d...111   659.912145: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18977
   checkproc-11123 [017] d...1..   721.483463: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19247
  -0 [055] d..h5..   782.685953: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19383
   <...>-93632 [055] d...111   835.527817: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19441




Re: [RT] lockdep munching nr_list_entries like popcorn

2017-02-16 Thread Mike Galbraith
On Thu, 2017-02-16 at 15:53 +0100, Sebastian Andrzej Siewior wrote:
> On 2017-02-16 15:42:59 [+0100], Mike Galbraith wrote:
> > 
> > Weeell, I'm trying to cobble something kinda like that together using
> > __RT_SPIN_INITIALIZER() instead, but seems mean ole Mr. Compiler NAKs
> > the PER_CPU_DEP_MAP_INIT() thingy.
> > 
> >   CC  mm/swap.o
> > mm/swap.c:54:689: error: braced-group within expression allowed only
> > inside a function
> 
> so this is what I have now. I need to get the `static' symbol working
> again and PER_CPU_DEP_MAP_INIT but aside from that it seems to do its
> job.

...

Yeah, works, I should be able to do an ltp run with stock lockdep
settings without it taking it's toys and going home in a snit. 

berio:/sys/kernel/debug/tracing/:[0]# !while
while sleep 60; do tail -1 trace; done
   <...>-10315 [064] d...1..   226.953935: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14223
   w-13148 [120] d...111   287.414978: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14465
   w-16492 [089] d...111   347.128742: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14653
(starts kbuild loop)
 btrfs-transacti-1964  [016] d...1..   411.101549: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 17011
   <...>-100268 [127] d...112   472.271769: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18153
   w-18864 [011] d...1..   534.386443: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18543
   <...>-50390 [035] dN..2..   597.794164: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18765
   <...>-80098 [127] d...111   659.912145: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18977
   checkproc-11123 [017] d...1..   721.483463: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19247
  -0 [055] d..h5..   782.685953: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19383
   <...>-93632 [055] d...111   835.527817: 
add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19441




Re: [RT] lockdep munching nr_list_entries like popcorn

2017-02-16 Thread Mike Galbraith
On Thu, 2017-02-16 at 12:06 +0100, Peter Zijlstra wrote:
> On Thu, Feb 16, 2017 at 10:01:18AM +0100, Thomas Gleixner wrote:
> > On Thu, 16 Feb 2017, Mike Galbraith wrote:
> > 
> > > On Thu, 2017-02-16 at 09:37 +0100, Thomas Gleixner wrote:
> > > > On Thu, 16 Feb 2017, Mike Galbraith wrote:
> > > > 
> > > ...
> > > > > swapvec_lock?  Oodles of 'em?  Nope.
> > > > 
> > > > Well, it's a per cpu lock and the lru_cache_add() variants might be 
> > > > called
> > > > from a gazillion of different call chains, but yes, it does not make a 
> > > > lot
> > > > of sense. We'll have a look.
> > > 
> > > Adding explicit local_irq_lock_init() makes things heaps better, so
> > > presumably we need better lockdep-foo in DEFINE_LOCAL_IRQ_LOCK().
> > 
> > Bah.
> 
> 
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> #define PER_CPU_DEP_MAP_INIT(lockname)  \
> .dep_map = {\
> .key = ({ static struct lock_class_key __key; &__key }), \
> .name = #lockname,  \
> }
> #else
> #define PER_CPU_DEP_MAP_INIT(lockname)
> #endif
> 
> #define DEFINE_LOCAL_IRQ_LOCK(lvar) \
> DEFINE_PER_CPU(struct local_irq_lock, lvar) = { \
> .lock = { .rlock = {\
> .raw_lock = __ARCH_SPIN_LOCK_UNLOCKED,  \
> SPIN_DEBUG_INIT(lvar)   \
> PER_CPU_DEP_MAP_INIT(lvar)  \
> } } \
> }
> 
> That's fairly horrible for poking inside all the internals, but it might
> just work ;-

Weeell, I'm trying to cobble something kinda like that together using
__RT_SPIN_INITIALIZER() instead, but seems mean ole Mr. Compiler NAKs
the PER_CPU_DEP_MAP_INIT() thingy.

  CC  mm/swap.o
mm/swap.c:54:689: error: braced-group within expression allowed only
inside a function

-Mike


Re: [RT] lockdep munching nr_list_entries like popcorn

2017-02-16 Thread Mike Galbraith
On Thu, 2017-02-16 at 12:06 +0100, Peter Zijlstra wrote:
> On Thu, Feb 16, 2017 at 10:01:18AM +0100, Thomas Gleixner wrote:
> > On Thu, 16 Feb 2017, Mike Galbraith wrote:
> > 
> > > On Thu, 2017-02-16 at 09:37 +0100, Thomas Gleixner wrote:
> > > > On Thu, 16 Feb 2017, Mike Galbraith wrote:
> > > > 
> > > ...
> > > > > swapvec_lock?  Oodles of 'em?  Nope.
> > > > 
> > > > Well, it's a per cpu lock and the lru_cache_add() variants might be 
> > > > called
> > > > from a gazillion of different call chains, but yes, it does not make a 
> > > > lot
> > > > of sense. We'll have a look.
> > > 
> > > Adding explicit local_irq_lock_init() makes things heaps better, so
> > > presumably we need better lockdep-foo in DEFINE_LOCAL_IRQ_LOCK().
> > 
> > Bah.
> 
> 
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> #define PER_CPU_DEP_MAP_INIT(lockname)  \
> .dep_map = {\
> .key = ({ static struct lock_class_key __key; &__key }), \
> .name = #lockname,  \
> }
> #else
> #define PER_CPU_DEP_MAP_INIT(lockname)
> #endif
> 
> #define DEFINE_LOCAL_IRQ_LOCK(lvar) \
> DEFINE_PER_CPU(struct local_irq_lock, lvar) = { \
> .lock = { .rlock = {\
> .raw_lock = __ARCH_SPIN_LOCK_UNLOCKED,  \
> SPIN_DEBUG_INIT(lvar)   \
> PER_CPU_DEP_MAP_INIT(lvar)  \
> } } \
> }
> 
> That's fairly horrible for poking inside all the internals, but it might
> just work ;-

Weeell, I'm trying to cobble something kinda like that together using
__RT_SPIN_INITIALIZER() instead, but seems mean ole Mr. Compiler NAKs
the PER_CPU_DEP_MAP_INIT() thingy.

  CC  mm/swap.o
mm/swap.c:54:689: error: braced-group within expression allowed only
inside a function

-Mike


Re: [RT] lockdep munching nr_list_entries like popcorn

2017-02-16 Thread Mike Galbraith
On Thu, 2017-02-16 at 10:01 +0100, Thomas Gleixner wrote:
> On Thu, 16 Feb 2017, Mike Galbraith wrote:
> 
> > On Thu, 2017-02-16 at 09:37 +0100, Thomas Gleixner wrote:
> > > On Thu, 16 Feb 2017, Mike Galbraith wrote:
> > > 
> > ...
> > > > swapvec_lock?  Oodles of 'em?  Nope.
> > > 
> > > Well, it's a per cpu lock and the lru_cache_add() variants might be called
> > > from a gazillion of different call chains, but yes, it does not make a lot
> > > of sense. We'll have a look.
> > 
> > Adding explicit local_irq_lock_init() makes things heaps better, so
> > presumably we need better lockdep-foo in DEFINE_LOCAL_IRQ_LOCK().
> 
> Bah.

Hm, "bah" sounds kinda like it might be a synonym for -EDUMMY :)  Fair
enough, I know spit about about lockdep, so that's likely the case, but
the below has me down to ~17k (and climbing, but not as fast).

berio:/sys/kernel/debug/tracing/:[0]# grep -A 1 'stack trace' trace|grep 
'=>'|sort|uniq
 => ___slab_alloc+0x171/0x5c0
 => __percpu_counter_add+0x56/0xd0
 => __schedule+0xb0/0x7b0
 => __slab_free+0xd8/0x200
 => cgroup_idr_alloc.constprop.39+0x37/0x80
 => hrtimer_start_range_ns+0xe6/0x400
 => idr_preload+0x6c/0x300
 => jbd2_journal_extend+0x4c/0x310 [jbd2]
 => lock_hrtimer_base.isra.28+0x29/0x50
 => rcu_note_context_switch+0x2b8/0x5c0
 => rcu_report_unblock_qs_rnp+0x6e/0xa0
 => rt_mutex_slowunlock+0x25/0xc0
 => rt_spin_lock_slowlock+0x52/0x330
 => rt_spin_lock_slowlock+0x94/0x330
 => rt_spin_lock_slowunlock+0x3c/0xc0
 => swake_up+0x21/0x40
 => task_blocks_on_rt_mutex+0x42/0x1e0
 => try_to_wake_up+0x2d/0x920

berio:/sys/kernel/debug/tracing/:[0]# grep nr_list_entries: trace|tail -1
 irq/66-eth2-TxR-3670  [115] d14  1542.321173: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 17839

Got rid of the really pesky growth anyway.

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5522,6 +5522,7 @@ static int __init init_workqueues(void)
 
pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
 
+   local_irq_lock_init(pendingb_lock);
wq_numa_init();
 
/* initialize CPU pools */
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1677,5 +1677,6 @@ void __init radix_tree_init(void)
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
radix_tree_node_ctor);
radix_tree_init_maxnodes();
+   local_irq_lock_init(radix_tree_preloads_lock);
hotcpu_notifier(radix_tree_callback, 0);
 }
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5786,6 +5786,7 @@ static int __init mem_cgroup_init(void)
int cpu, node;
 
hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
+   local_irq_lock_init(event_lock);
 
for_each_possible_cpu(cpu)
INIT_WORK(_cpu_ptr(_stock, cpu)->work,
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -681,6 +681,14 @@ static inline void remote_lru_add_drain(
local_unlock_on(swapvec_lock, cpu);
 }
 
+static int __init lru_init(void)
+{
+   local_irq_lock_init(swapvec_lock);
+   local_irq_lock_init(rotate_lock);
+   return 0;
+}
+early_initcall(lru_init);
+
 #else
 
 /*
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -525,6 +525,7 @@ int __init netfilter_init(void)
 {
int ret;
 
+   local_irq_lock_init(xt_write_lock);
ret = register_pernet_subsys(_net_ops);
if (ret < 0)
goto err;


Re: [RT] lockdep munching nr_list_entries like popcorn

2017-02-16 Thread Mike Galbraith
On Thu, 2017-02-16 at 10:01 +0100, Thomas Gleixner wrote:
> On Thu, 16 Feb 2017, Mike Galbraith wrote:
> 
> > On Thu, 2017-02-16 at 09:37 +0100, Thomas Gleixner wrote:
> > > On Thu, 16 Feb 2017, Mike Galbraith wrote:
> > > 
> > ...
> > > > swapvec_lock?  Oodles of 'em?  Nope.
> > > 
> > > Well, it's a per cpu lock and the lru_cache_add() variants might be called
> > > from a gazillion of different call chains, but yes, it does not make a lot
> > > of sense. We'll have a look.
> > 
> > Adding explicit local_irq_lock_init() makes things heaps better, so
> > presumably we need better lockdep-foo in DEFINE_LOCAL_IRQ_LOCK().
> 
> Bah.

Hm, "bah" sounds kinda like it might be a synonym for -EDUMMY :)  Fair
enough, I know spit about about lockdep, so that's likely the case, but
the below has me down to ~17k (and climbing, but not as fast).

berio:/sys/kernel/debug/tracing/:[0]# grep -A 1 'stack trace' trace|grep 
'=>'|sort|uniq
 => ___slab_alloc+0x171/0x5c0
 => __percpu_counter_add+0x56/0xd0
 => __schedule+0xb0/0x7b0
 => __slab_free+0xd8/0x200
 => cgroup_idr_alloc.constprop.39+0x37/0x80
 => hrtimer_start_range_ns+0xe6/0x400
 => idr_preload+0x6c/0x300
 => jbd2_journal_extend+0x4c/0x310 [jbd2]
 => lock_hrtimer_base.isra.28+0x29/0x50
 => rcu_note_context_switch+0x2b8/0x5c0
 => rcu_report_unblock_qs_rnp+0x6e/0xa0
 => rt_mutex_slowunlock+0x25/0xc0
 => rt_spin_lock_slowlock+0x52/0x330
 => rt_spin_lock_slowlock+0x94/0x330
 => rt_spin_lock_slowunlock+0x3c/0xc0
 => swake_up+0x21/0x40
 => task_blocks_on_rt_mutex+0x42/0x1e0
 => try_to_wake_up+0x2d/0x920

berio:/sys/kernel/debug/tracing/:[0]# grep nr_list_entries: trace|tail -1
 irq/66-eth2-TxR-3670  [115] d14  1542.321173: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 17839

Got rid of the really pesky growth anyway.

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5522,6 +5522,7 @@ static int __init init_workqueues(void)
 
pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
 
+   local_irq_lock_init(pendingb_lock);
wq_numa_init();
 
/* initialize CPU pools */
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1677,5 +1677,6 @@ void __init radix_tree_init(void)
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
radix_tree_node_ctor);
radix_tree_init_maxnodes();
+   local_irq_lock_init(radix_tree_preloads_lock);
hotcpu_notifier(radix_tree_callback, 0);
 }
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5786,6 +5786,7 @@ static int __init mem_cgroup_init(void)
int cpu, node;
 
hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
+   local_irq_lock_init(event_lock);
 
for_each_possible_cpu(cpu)
INIT_WORK(_cpu_ptr(_stock, cpu)->work,
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -681,6 +681,14 @@ static inline void remote_lru_add_drain(
local_unlock_on(swapvec_lock, cpu);
 }
 
+static int __init lru_init(void)
+{
+   local_irq_lock_init(swapvec_lock);
+   local_irq_lock_init(rotate_lock);
+   return 0;
+}
+early_initcall(lru_init);
+
 #else
 
 /*
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -525,6 +525,7 @@ int __init netfilter_init(void)
 {
int ret;
 
+   local_irq_lock_init(xt_write_lock);
ret = register_pernet_subsys(_net_ops);
if (ret < 0)
goto err;


Re: [RT] lockdep munching nr_list_entries like popcorn

2017-02-16 Thread Mike Galbraith
On Thu, 2017-02-16 at 09:37 +0100, Thomas Gleixner wrote:
> On Thu, 16 Feb 2017, Mike Galbraith wrote:
> 
...
> > swapvec_lock?  Oodles of 'em?  Nope.
> 
> Well, it's a per cpu lock and the lru_cache_add() variants might be called
> from a gazillion of different call chains, but yes, it does not make a lot
> of sense. We'll have a look.

Adding explicit local_irq_lock_init() makes things heaps better, so
presumably we need better lockdep-foo in DEFINE_LOCAL_IRQ_LOCK().

-Mike


Re: [RT] lockdep munching nr_list_entries like popcorn

2017-02-16 Thread Mike Galbraith
On Thu, 2017-02-16 at 09:37 +0100, Thomas Gleixner wrote:
> On Thu, 16 Feb 2017, Mike Galbraith wrote:
> 
...
> > swapvec_lock?  Oodles of 'em?  Nope.
> 
> Well, it's a per cpu lock and the lru_cache_add() variants might be called
> from a gazillion of different call chains, but yes, it does not make a lot
> of sense. We'll have a look.

Adding explicit local_irq_lock_init() makes things heaps better, so
presumably we need better lockdep-foo in DEFINE_LOCAL_IRQ_LOCK().

-Mike


[RT] lockdep munching nr_list_entries like popcorn

2017-02-15 Thread Mike Galbraith
4.9.10-rt6-virgin on 72 core +SMT box.

Below is 1 line per minute, box idling along daintily nibbling, I fire
up a parallel kbuild loop at 40465, and box gobbles greedily.

I have entries bumped to 128k, and chain bits to 18 so box will get
booted and run for a while before lockdep says "I quit".  With stock
settings, this box will barely get booted.  Seems the bigger the box,
the sooner you're gonna run out.  A NOPREEMPT kernel seems to nibble
entries too, but nowhere remotely near as greedily as RT.

   <...>-100309 [064] d13  2885.873312: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40129
   <...>-104320 [116] dN..211  2959.633630: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40155
 btrfs-transacti-1955  [043] d...111  3021.073949: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40183
   <...>-118865 [120] d13  3086.146794: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40209
  systemd-logind-4763  [068] d11  3146.953001: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40239
   <...>-123725 [032] dN..2..  3215.735774: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40285
   <...>-33968 [031] d...1..  3347.919001: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40409
   <...>-130886 [143] d12  3412.586643: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40465
   <...>-138291 [037] d11  3477.816405: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 42825
   <...>-67678 [137] d...112  3551.648282: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 47899
ksoftirqd/45-421   [045] d13  3617.926394: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 48751
 ihex2fw-24635 [035] d11  3686.899690: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 49345
   <...>-76041 [047] d...111  3758.230009: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 49757
stty-10772 [118] d...1..  3825.626815: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 50115
  kworker/u289:4-13376 [075] d12  3896.432428: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 51189
   <...>-92785 [047] d12  3905.137578: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 51287

With stacktrace on, buffer contains 1010 __lru_cache_add+0x4f...

(gdb) list *__lru_cache_add+0x4f
0x811dca9f is in __lru_cache_add (./include/linux/locallock.h:59).
54
55  static inline void __local_lock(struct local_irq_lock *lv)
56  {
57  if (lv->owner != current) {
58  spin_lock_local(>lock);
59  LL_WARN(lv->owner);
60  LL_WARN(lv->nestcnt);
61  lv->owner = current;
62  }
63  lv->nestcnt++;

...which seems to be this.

0x811dca80 is in __lru_cache_add (mm/swap.c:397).
392 }
393 EXPORT_SYMBOL(mark_page_accessed);
394
395 static void __lru_cache_add(struct page *page)
396 {
397 struct pagevec *pvec = _locked_var(swapvec_lock, 
lru_add_pvec);
398
399 get_page(page);
400 if (!pagevec_add(pvec, page) || PageCompound(page))
401 __pagevec_lru_add(pvec);

swapvec_lock?  Oodles of 'em?  Nope.

-Mike


[RT] lockdep munching nr_list_entries like popcorn

2017-02-15 Thread Mike Galbraith
4.9.10-rt6-virgin on 72 core +SMT box.

Below is 1 line per minute, box idling along daintily nibbling, I fire
up a parallel kbuild loop at 40465, and box gobbles greedily.

I have entries bumped to 128k, and chain bits to 18 so box will get
booted and run for a while before lockdep says "I quit".  With stock
settings, this box will barely get booted.  Seems the bigger the box,
the sooner you're gonna run out.  A NOPREEMPT kernel seems to nibble
entries too, but nowhere remotely near as greedily as RT.

   <...>-100309 [064] d13  2885.873312: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40129
   <...>-104320 [116] dN..211  2959.633630: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40155
 btrfs-transacti-1955  [043] d...111  3021.073949: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40183
   <...>-118865 [120] d13  3086.146794: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40209
  systemd-logind-4763  [068] d11  3146.953001: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40239
   <...>-123725 [032] dN..2..  3215.735774: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40285
   <...>-33968 [031] d...1..  3347.919001: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40409
   <...>-130886 [143] d12  3412.586643: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40465
   <...>-138291 [037] d11  3477.816405: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 42825
   <...>-67678 [137] d...112  3551.648282: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 47899
ksoftirqd/45-421   [045] d13  3617.926394: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 48751
 ihex2fw-24635 [035] d11  3686.899690: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 49345
   <...>-76041 [047] d...111  3758.230009: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 49757
stty-10772 [118] d...1..  3825.626815: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 50115
  kworker/u289:4-13376 [075] d12  3896.432428: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 51189
   <...>-92785 [047] d12  3905.137578: 
add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 51287

With stacktrace on, buffer contains 1010 __lru_cache_add+0x4f...

(gdb) list *__lru_cache_add+0x4f
0x811dca9f is in __lru_cache_add (./include/linux/locallock.h:59).
54
55  static inline void __local_lock(struct local_irq_lock *lv)
56  {
57  if (lv->owner != current) {
58  spin_lock_local(>lock);
59  LL_WARN(lv->owner);
60  LL_WARN(lv->nestcnt);
61  lv->owner = current;
62  }
63  lv->nestcnt++;

...which seems to be this.

0x811dca80 is in __lru_cache_add (mm/swap.c:397).
392 }
393 EXPORT_SYMBOL(mark_page_accessed);
394
395 static void __lru_cache_add(struct page *page)
396 {
397 struct pagevec *pvec = _locked_var(swapvec_lock, 
lru_add_pvec);
398
399 get_page(page);
400 if (!pagevec_add(pvec, page) || PageCompound(page))
401 __pagevec_lru_add(pvec);

swapvec_lock?  Oodles of 'em?  Nope.

-Mike


[tip:timers/urgent] tick/broadcast: Prevent deadlock on tick_broadcast_lock

2017-02-13 Thread tip-bot for Mike Galbraith
Commit-ID:  202461e2f3c15dbfb05825d29ace0d20cdf55fa4
Gitweb: http://git.kernel.org/tip/202461e2f3c15dbfb05825d29ace0d20cdf55fa4
Author: Mike Galbraith <efa...@gmx.de>
AuthorDate: Mon, 13 Feb 2017 03:31:55 +0100
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Mon, 13 Feb 2017 09:49:31 +0100

tick/broadcast: Prevent deadlock on tick_broadcast_lock

tick_broadcast_lock is taken from interrupt context, but the following call
chain takes the lock without disabling interrupts:

[   12.703736]  _raw_spin_lock+0x3b/0x50
[   12.703738]  tick_broadcast_control+0x5a/0x1a0
[   12.703742]  intel_idle_cpu_online+0x22/0x100
[   12.703744]  cpuhp_invoke_callback+0x245/0x9d0
[   12.703752]  cpuhp_thread_fun+0x52/0x110
[   12.703754]  smpboot_thread_fn+0x276/0x320

So the following deadlock can happen:

   lock(tick_broadcast_lock);
   
  lock(tick_broadcast_lock);

intel_idle_cpu_online() is the only place which violates the calling
convention of tick_broadcast_control(). This was caused by the removal of
the smp function call in course of the cpu hotplug rework.

Instead of slapping local_irq_disable/enable() at the call site, we can
relax the calling convention and handle it in the core code, which makes
the whole machinery more robust.

Fixes: 29d7bbada98e ("intel_idle: Remove superfluous SMP fuction call")
Reported-by: Gabriel C <nix.or@gmail.com>
Signed-off-by: Mike Galbraith <efa...@gmx.de>
Cc: Ruslan Ruslichenko <rrusl...@cisco.com>
Cc: Jiri Slaby <jsl...@suse.cz>
Cc: Greg KH <gre...@linuxfoundation.org>
Cc: Borislav Petkov <b...@alien8.de>
Cc: l...@lwn.net
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Anna-Maria Gleixner <anna-ma...@linutronix.de>
Cc: Sebastian Siewior <bige...@linutronix.de>
Cc: stable <sta...@vger.kernel.org>
Link: http://lkml.kernel.org/r/1486953115.5912.4.ca...@gmx.de
Signed-off-by: Thomas Gleixner <t...@linutronix.de>

---
 kernel/time/tick-broadcast.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 3109204..17ac99b 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -347,17 +347,16 @@ static void tick_handle_periodic_broadcast(struct 
clock_event_device *dev)
  *
  * Called when the system enters a state where affected tick devices
  * might stop. Note: TICK_BROADCAST_FORCE cannot be undone.
- *
- * Called with interrupts disabled, so clockevents_lock is not
- * required here because the local clock event device cannot go away
- * under us.
  */
 void tick_broadcast_control(enum tick_broadcast_mode mode)
 {
struct clock_event_device *bc, *dev;
struct tick_device *td;
int cpu, bc_stopped;
+   unsigned long flags;
 
+   /* Protects also the local clockevent device. */
+   raw_spin_lock_irqsave(_broadcast_lock, flags);
td = this_cpu_ptr(_cpu_device);
dev = td->evtdev;
 
@@ -365,12 +364,11 @@ void tick_broadcast_control(enum tick_broadcast_mode mode)
 * Is the device not affected by the powerstate ?
 */
if (!dev || !(dev->features & CLOCK_EVT_FEAT_C3STOP))
-   return;
+   goto out;
 
if (!tick_device_is_functional(dev))
-   return;
+   goto out;
 
-   raw_spin_lock(_broadcast_lock);
cpu = smp_processor_id();
bc = tick_broadcast_device.evtdev;
bc_stopped = cpumask_empty(tick_broadcast_mask);
@@ -420,7 +418,8 @@ void tick_broadcast_control(enum tick_broadcast_mode mode)
tick_broadcast_setup_oneshot(bc);
}
}
-   raw_spin_unlock(_broadcast_lock);
+out:
+   raw_spin_unlock_irqrestore(_broadcast_lock, flags);
 }
 EXPORT_SYMBOL_GPL(tick_broadcast_control);
 


[tip:timers/urgent] tick/broadcast: Prevent deadlock on tick_broadcast_lock

2017-02-13 Thread tip-bot for Mike Galbraith
Commit-ID:  202461e2f3c15dbfb05825d29ace0d20cdf55fa4
Gitweb: http://git.kernel.org/tip/202461e2f3c15dbfb05825d29ace0d20cdf55fa4
Author: Mike Galbraith 
AuthorDate: Mon, 13 Feb 2017 03:31:55 +0100
Committer:  Thomas Gleixner 
CommitDate: Mon, 13 Feb 2017 09:49:31 +0100

tick/broadcast: Prevent deadlock on tick_broadcast_lock

tick_broadcast_lock is taken from interrupt context, but the following call
chain takes the lock without disabling interrupts:

[   12.703736]  _raw_spin_lock+0x3b/0x50
[   12.703738]  tick_broadcast_control+0x5a/0x1a0
[   12.703742]  intel_idle_cpu_online+0x22/0x100
[   12.703744]  cpuhp_invoke_callback+0x245/0x9d0
[   12.703752]  cpuhp_thread_fun+0x52/0x110
[   12.703754]  smpboot_thread_fn+0x276/0x320

So the following deadlock can happen:

   lock(tick_broadcast_lock);
   
  lock(tick_broadcast_lock);

intel_idle_cpu_online() is the only place which violates the calling
convention of tick_broadcast_control(). This was caused by the removal of
the smp function call in course of the cpu hotplug rework.

Instead of slapping local_irq_disable/enable() at the call site, we can
relax the calling convention and handle it in the core code, which makes
the whole machinery more robust.

Fixes: 29d7bbada98e ("intel_idle: Remove superfluous SMP fuction call")
Reported-by: Gabriel C 
Signed-off-by: Mike Galbraith 
Cc: Ruslan Ruslichenko 
Cc: Jiri Slaby 
Cc: Greg KH 
Cc: Borislav Petkov 
Cc: l...@lwn.net
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Anna-Maria Gleixner 
Cc: Sebastian Siewior 
Cc: stable 
Link: http://lkml.kernel.org/r/1486953115.5912.4.ca...@gmx.de
Signed-off-by: Thomas Gleixner 

---
 kernel/time/tick-broadcast.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 3109204..17ac99b 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -347,17 +347,16 @@ static void tick_handle_periodic_broadcast(struct 
clock_event_device *dev)
  *
  * Called when the system enters a state where affected tick devices
  * might stop. Note: TICK_BROADCAST_FORCE cannot be undone.
- *
- * Called with interrupts disabled, so clockevents_lock is not
- * required here because the local clock event device cannot go away
- * under us.
  */
 void tick_broadcast_control(enum tick_broadcast_mode mode)
 {
struct clock_event_device *bc, *dev;
struct tick_device *td;
int cpu, bc_stopped;
+   unsigned long flags;
 
+   /* Protects also the local clockevent device. */
+   raw_spin_lock_irqsave(_broadcast_lock, flags);
td = this_cpu_ptr(_cpu_device);
dev = td->evtdev;
 
@@ -365,12 +364,11 @@ void tick_broadcast_control(enum tick_broadcast_mode mode)
 * Is the device not affected by the powerstate ?
 */
if (!dev || !(dev->features & CLOCK_EVT_FEAT_C3STOP))
-   return;
+   goto out;
 
if (!tick_device_is_functional(dev))
-   return;
+   goto out;
 
-   raw_spin_lock(_broadcast_lock);
cpu = smp_processor_id();
bc = tick_broadcast_device.evtdev;
bc_stopped = cpumask_empty(tick_broadcast_mask);
@@ -420,7 +418,8 @@ void tick_broadcast_control(enum tick_broadcast_mode mode)
tick_broadcast_setup_oneshot(bc);
}
}
-   raw_spin_unlock(_broadcast_lock);
+out:
+   raw_spin_unlock_irqrestore(_broadcast_lock, flags);
 }
 EXPORT_SYMBOL_GPL(tick_broadcast_control);
 


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-12 Thread Mike Galbraith
On Sun, 2017-02-12 at 07:59 +0100, Mike Galbraith wrote:
> On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote:
> 
> > > I think cgroup tree depth is a more significant issue; because of
> > > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > > 
> > > So creating elaborate trees is something I try not to do.
> > 
> > So, as long as the depth stays reasonable (single digit or lower),
> > what we try to do is keeping tree traversal operations aggregated or
> > located on slow paths.  There still are places that this overhead
> > shows up (e.g. the block controllers aren't too optimized) but it
> > isn't particularly difficult to make a handful of layers not matter at
> > all.
> 
> A handful of cpu bean counting layers stings considerably.

BTW, that overhead is also why merging cpu/cpuacct is not really as
wonderful as it may seem on paper.  If you only want to account, you
may not have anything to gain from group scheduling (in fact it may
wreck performance), but you'll pay for it.
 
> homer:/abuild # pipe-test 1  
> 2.010057 usecs/loop -- avg 2.010057 995.0 KHz
> 2.006630 usecs/loop -- avg 2.009714 995.2 KHz
> 2.127118 usecs/loop -- avg 2.021455 989.4 KHz
> 2.256244 usecs/loop -- avg 2.044934 978.0 KHz
> 1.993693 usecs/loop -- avg 2.039810 980.5 KHz
> ^C
> homer:/abuild # cgexec -g cpu:hurt pipe-test 1
> 2.771641 usecs/loop -- avg 2.771641 721.6 KHz
> 2.432333 usecs/loop -- avg 2.737710 730.5 KHz
> 2.750493 usecs/loop -- avg 2.738988 730.2 KHz
> 2.663203 usecs/loop -- avg 2.731410 732.2 KHz
> 2.762564 usecs/loop -- avg 2.734525 731.4 KHz
> ^C
> homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1
> 2.967201 usecs/loop -- avg 2.967201 674.0 KHz
> 3.049012 usecs/loop -- avg 2.975382 672.2 KHz
> 3.031226 usecs/loop -- avg 2.980966 670.9 KHz
> 2.954259 usecs/loop -- avg 2.978296 671.5 KHz
> 2.933432 usecs/loop -- avg 2.973809 672.5 KHz
> ^C
> ...
> homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1
> 4.417044 usecs/loop -- avg 4.417044 452.8 KHz
> 4.494913 usecs/loop -- avg 4.424831 452.0 KHz
> 4.253861 usecs/loop -- avg 4.407734 453.7 KHz
> 4.378059 usecs/loop -- avg 4.404766 454.1 KHz
> 4.179895 usecs/loop -- avg 4.382279 456.4 KHz


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-12 Thread Mike Galbraith
On Sun, 2017-02-12 at 07:59 +0100, Mike Galbraith wrote:
> On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote:
> 
> > > I think cgroup tree depth is a more significant issue; because of
> > > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > > 
> > > So creating elaborate trees is something I try not to do.
> > 
> > So, as long as the depth stays reasonable (single digit or lower),
> > what we try to do is keeping tree traversal operations aggregated or
> > located on slow paths.  There still are places that this overhead
> > shows up (e.g. the block controllers aren't too optimized) but it
> > isn't particularly difficult to make a handful of layers not matter at
> > all.
> 
> A handful of cpu bean counting layers stings considerably.

BTW, that overhead is also why merging cpu/cpuacct is not really as
wonderful as it may seem on paper.  If you only want to account, you
may not have anything to gain from group scheduling (in fact it may
wreck performance), but you'll pay for it.
 
> homer:/abuild # pipe-test 1  
> 2.010057 usecs/loop -- avg 2.010057 995.0 KHz
> 2.006630 usecs/loop -- avg 2.009714 995.2 KHz
> 2.127118 usecs/loop -- avg 2.021455 989.4 KHz
> 2.256244 usecs/loop -- avg 2.044934 978.0 KHz
> 1.993693 usecs/loop -- avg 2.039810 980.5 KHz
> ^C
> homer:/abuild # cgexec -g cpu:hurt pipe-test 1
> 2.771641 usecs/loop -- avg 2.771641 721.6 KHz
> 2.432333 usecs/loop -- avg 2.737710 730.5 KHz
> 2.750493 usecs/loop -- avg 2.738988 730.2 KHz
> 2.663203 usecs/loop -- avg 2.731410 732.2 KHz
> 2.762564 usecs/loop -- avg 2.734525 731.4 KHz
> ^C
> homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1
> 2.967201 usecs/loop -- avg 2.967201 674.0 KHz
> 3.049012 usecs/loop -- avg 2.975382 672.2 KHz
> 3.031226 usecs/loop -- avg 2.980966 670.9 KHz
> 2.954259 usecs/loop -- avg 2.978296 671.5 KHz
> 2.933432 usecs/loop -- avg 2.973809 672.5 KHz
> ^C
> ...
> homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1
> 4.417044 usecs/loop -- avg 4.417044 452.8 KHz
> 4.494913 usecs/loop -- avg 4.424831 452.0 KHz
> 4.253861 usecs/loop -- avg 4.407734 453.7 KHz
> 4.378059 usecs/loop -- avg 4.404766 454.1 KHz
> 4.179895 usecs/loop -- avg 4.382279 456.4 KHz


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-12 Thread Mike Galbraith
On Sun, 2017-02-12 at 13:16 -0800, Paul Turner wrote:
> 
> 
> On Thursday, February 9, 2017, Peter Zijlstra  wrote:
> > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> > > The only case that this does not support vs ".threads" would be some
> > > hybrid where we co-mingle threads from different processes (with the
> > > processes belonging to the same node in the hierarchy).  I'm not aware
> > > of any usage that looks like this.
> > 
> > If I understand you right; this is a fairly common thing with RT where
> > we would stuff all the !rt threads of the various processes in a 'misc'
> > bucket.
> > 
> > Similarly, it happens that we stuff the various rt threads of processes
> > in a specific (shared) 'rt' bucket.
> > 
> > So I would certainly not like to exclude that setup.
> > 
> 
> Unless you're using rt groups I'm not sure this one really changes.  
> Whether the "misc" threads exist at the parent level or one below
> should not matter.

(with exclusive cpusets, a mask can exist at one and only one location)


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-12 Thread Mike Galbraith
On Sun, 2017-02-12 at 13:16 -0800, Paul Turner wrote:
> 
> 
> On Thursday, February 9, 2017, Peter Zijlstra  wrote:
> > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> > > The only case that this does not support vs ".threads" would be some
> > > hybrid where we co-mingle threads from different processes (with the
> > > processes belonging to the same node in the hierarchy).  I'm not aware
> > > of any usage that looks like this.
> > 
> > If I understand you right; this is a fairly common thing with RT where
> > we would stuff all the !rt threads of the various processes in a 'misc'
> > bucket.
> > 
> > Similarly, it happens that we stuff the various rt threads of processes
> > in a specific (shared) 'rt' bucket.
> > 
> > So I would certainly not like to exclude that setup.
> > 
> 
> Unless you're using rt groups I'm not sure this one really changes.  
> Whether the "misc" threads exist at the parent level or one below
> should not matter.

(with exclusive cpusets, a mask can exist at one and only one location)


Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Mike Galbraith
On Mon, 2017-02-13 at 02:26 +0100, Gabriel C wrote:

> [5.276704]CPU0
> [5.312400]
> [5.347605]   lock(tick_broadcast_lock);
> [5.383163]   
> [5.418457] lock(tick_broadcast_lock);
> [5.454015]
>  *** DEADLOCK ***
> 
> [5.557982] no locks held by cpuhp/0/14.

Oh, that looks familiar...

tick/broadcast: Make tick_broadcast_control() use raw_spinlock_irqsave()

Otherwise we end up with the lockdep splat below:

[   12.703619] =
[   12.703619] [ INFO: inconsistent lock state ]
[   12.703621] 4.10.0-rt1-rt #18 Not tainted
[   12.703622] -
[   12.703623] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[   12.703624] cpuhp/0/23 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   12.703625]  (tick_broadcast_lock){?.}, at: [] 
tick_broadcast_control+0x5a/0x1a0
[   12.703632] {IN-HARDIRQ-W} state was registered at:
[   12.703637] [] __lock_acquire+0xa21/0x1550
[   12.703639] [] lock_acquire+0xbd/0x250
[   12.703642] [] _raw_spin_lock_irqsave+0x53/0x70
[   12.703644] [] tick_broadcast_switch_to_oneshot+0x16/0x50
[   12.703646] [] tick_switch_to_oneshot+0x59/0xd0
[   12.703647] [] tick_init_highres+0x15/0x20
[   12.703652] [] hrtimer_run_queues+0x9f/0xe0
[   12.703654] [] run_local_timers+0x25/0x60
[   12.703656] [] update_process_times+0x2c/0x60
[   12.703659] [] tick_periodic+0x2f/0x100
[   12.703661] [] tick_handle_periodic+0x24/0x70
[   12.703664] [] local_apic_timer_interrupt+0x33/0x60
[   12.703669] [] smp_apic_timer_interrupt+0x38/0x50
[   12.703671] [] apic_timer_interrupt+0x9d/0xb0
[   12.703672] [] mwait_idle+0x94/0x290
[   12.703676] [] arch_cpu_idle+0xf/0x20
[   12.703677] [] default_idle_call+0x31/0x60
[   12.703681] [] do_idle+0x175/0x290
[   12.703683] [] cpu_startup_entry+0x48/0x50
[   12.703687] [] start_secondary+0x133/0x160
[   12.703689] [] verify_cpu+0x0/0xfc
[   12.703690] irq event stamp: 71
[   12.703691] hardirqs last  enabled at (71): [] 
_raw_spin_unlock_irq+0x2c/0x80
[   12.703696] hardirqs last disabled at (70): [] 
__schedule+0x9c/0x7e0
[   12.703699] softirqs last  enabled at (0): [] 
copy_process.part.34+0x5f1/0x22d0
[   12.703700] softirqs last disabled at (0): [<  (null)>]   
(null)
[   12.703701] 
[   12.703701] other info that might help us debug this:
[   12.703701]  Possible unsafe locking scenario:
[   12.703701] 
[   12.703701]CPU0
[   12.703702]
[   12.703702]   lock(tick_broadcast_lock);
[   12.703703]   
[   12.703704] lock(tick_broadcast_lock);
[   12.703705] 
[   12.703705]  *** DEADLOCK ***
[   12.703705] 
[   12.703705] no locks held by cpuhp/0/23.
[   12.703705] 
[   12.703705] stack backtrace:
[   12.703707] CPU: 0 PID: 23 Comm: cpuhp/0 Not tainted 4.10.0-rt1-rt #18
[   12.703708] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 
07/07/2010
[   12.703709] Call Trace:
[   12.703715]  dump_stack+0x85/0xc8
[   12.703717]  print_usage_bug+0x1ea/0x1fb
[   12.703719]  ? print_shortest_lock_dependencies+0x1c0/0x1c0
[   12.703721]  mark_lock+0x20d/0x290
[   12.703723]  __lock_acquire+0x8e6/0x1550
[   12.703724]  ? __lock_acquire+0x2ce/0x1550
[   12.703726]  ? load_balance+0x1b4/0xaf0
[   12.703728]  lock_acquire+0xbd/0x250
[   12.703729]  ? tick_broadcast_control+0x5a/0x1a0
[   12.703735]  ? efifb_probe+0x170/0x170
[   12.703736]  _raw_spin_lock+0x3b/0x50
[   12.703737]  ? tick_broadcast_control+0x5a/0x1a0
[   12.703738]  tick_broadcast_control+0x5a/0x1a0
[   12.703740]  ? efifb_probe+0x170/0x170
[   12.703742]  intel_idle_cpu_online+0x22/0x100
[   12.703744]  cpuhp_invoke_callback+0x245/0x9d0
[   12.703747]  ? finish_task_switch+0x78/0x290
[   12.703750]  ? check_preemption_disabled+0x9f/0x130
[   12.703752]  cpuhp_thread_fun+0x52/0x110
[   12.703754]  smpboot_thread_fn+0x276/0x320
[   12.703757]  kthread+0x10c/0x140
[   12.703759]  ? smpboot_update_cpumask_percpu_thread+0x130/0x130
[   12.703760]  ? kthread_park+0x90/0x90
[   12.703762]  ret_from_fork+0x2a/0x40
[   12.709790] intel_idle: lapic_timer_reliable_states 0x2

Signed-off-by: Mike Galbraith <efa...@gmx.de>
---
 kernel/time/tick-broadcast.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -357,6 +357,7 @@ void tick_broadcast_control(enum tick_br
struct clock_event_device *bc, *dev;
struct tick_device *td;
int cpu, bc_stopped;
+   unsigned long flags;
 
td = this_cpu_ptr(_cpu_device);
dev = td->evtdev;
@@ -370,7 +371,7 @@ void tick_broadcast_control(enum tick_br
if (!tick_device_is_functional(dev))
return;
 
-   raw_spin_lock(_broadcast_lock);
+   raw_spin_lock_irqsave(_broadcast_lock, flags);
cpu = smp_processor_id();
bc = tick_broadcast_device.evtdev;
bc_stopped = cpumask_empty(tick_broadcast_mask);
@@ -420,7 +421,

Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

2017-02-12 Thread Mike Galbraith
On Mon, 2017-02-13 at 02:26 +0100, Gabriel C wrote:

> [5.276704]CPU0
> [5.312400]
> [5.347605]   lock(tick_broadcast_lock);
> [5.383163]   
> [5.418457] lock(tick_broadcast_lock);
> [5.454015]
>  *** DEADLOCK ***
> 
> [5.557982] no locks held by cpuhp/0/14.

Oh, that looks familiar...

tick/broadcast: Make tick_broadcast_control() use raw_spinlock_irqsave()

Otherwise we end up with the lockdep splat below:

[   12.703619] =
[   12.703619] [ INFO: inconsistent lock state ]
[   12.703621] 4.10.0-rt1-rt #18 Not tainted
[   12.703622] -
[   12.703623] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[   12.703624] cpuhp/0/23 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   12.703625]  (tick_broadcast_lock){?.}, at: [] 
tick_broadcast_control+0x5a/0x1a0
[   12.703632] {IN-HARDIRQ-W} state was registered at:
[   12.703637] [] __lock_acquire+0xa21/0x1550
[   12.703639] [] lock_acquire+0xbd/0x250
[   12.703642] [] _raw_spin_lock_irqsave+0x53/0x70
[   12.703644] [] tick_broadcast_switch_to_oneshot+0x16/0x50
[   12.703646] [] tick_switch_to_oneshot+0x59/0xd0
[   12.703647] [] tick_init_highres+0x15/0x20
[   12.703652] [] hrtimer_run_queues+0x9f/0xe0
[   12.703654] [] run_local_timers+0x25/0x60
[   12.703656] [] update_process_times+0x2c/0x60
[   12.703659] [] tick_periodic+0x2f/0x100
[   12.703661] [] tick_handle_periodic+0x24/0x70
[   12.703664] [] local_apic_timer_interrupt+0x33/0x60
[   12.703669] [] smp_apic_timer_interrupt+0x38/0x50
[   12.703671] [] apic_timer_interrupt+0x9d/0xb0
[   12.703672] [] mwait_idle+0x94/0x290
[   12.703676] [] arch_cpu_idle+0xf/0x20
[   12.703677] [] default_idle_call+0x31/0x60
[   12.703681] [] do_idle+0x175/0x290
[   12.703683] [] cpu_startup_entry+0x48/0x50
[   12.703687] [] start_secondary+0x133/0x160
[   12.703689] [] verify_cpu+0x0/0xfc
[   12.703690] irq event stamp: 71
[   12.703691] hardirqs last  enabled at (71): [] 
_raw_spin_unlock_irq+0x2c/0x80
[   12.703696] hardirqs last disabled at (70): [] 
__schedule+0x9c/0x7e0
[   12.703699] softirqs last  enabled at (0): [] 
copy_process.part.34+0x5f1/0x22d0
[   12.703700] softirqs last disabled at (0): [<  (null)>]   
(null)
[   12.703701] 
[   12.703701] other info that might help us debug this:
[   12.703701]  Possible unsafe locking scenario:
[   12.703701] 
[   12.703701]CPU0
[   12.703702]
[   12.703702]   lock(tick_broadcast_lock);
[   12.703703]   
[   12.703704] lock(tick_broadcast_lock);
[   12.703705] 
[   12.703705]  *** DEADLOCK ***
[   12.703705] 
[   12.703705] no locks held by cpuhp/0/23.
[   12.703705] 
[   12.703705] stack backtrace:
[   12.703707] CPU: 0 PID: 23 Comm: cpuhp/0 Not tainted 4.10.0-rt1-rt #18
[   12.703708] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 
07/07/2010
[   12.703709] Call Trace:
[   12.703715]  dump_stack+0x85/0xc8
[   12.703717]  print_usage_bug+0x1ea/0x1fb
[   12.703719]  ? print_shortest_lock_dependencies+0x1c0/0x1c0
[   12.703721]  mark_lock+0x20d/0x290
[   12.703723]  __lock_acquire+0x8e6/0x1550
[   12.703724]  ? __lock_acquire+0x2ce/0x1550
[   12.703726]  ? load_balance+0x1b4/0xaf0
[   12.703728]  lock_acquire+0xbd/0x250
[   12.703729]  ? tick_broadcast_control+0x5a/0x1a0
[   12.703735]  ? efifb_probe+0x170/0x170
[   12.703736]  _raw_spin_lock+0x3b/0x50
[   12.703737]  ? tick_broadcast_control+0x5a/0x1a0
[   12.703738]  tick_broadcast_control+0x5a/0x1a0
[   12.703740]  ? efifb_probe+0x170/0x170
[   12.703742]  intel_idle_cpu_online+0x22/0x100
[   12.703744]  cpuhp_invoke_callback+0x245/0x9d0
[   12.703747]  ? finish_task_switch+0x78/0x290
[   12.703750]  ? check_preemption_disabled+0x9f/0x130
[   12.703752]  cpuhp_thread_fun+0x52/0x110
[   12.703754]  smpboot_thread_fn+0x276/0x320
[   12.703757]  kthread+0x10c/0x140
[   12.703759]  ? smpboot_update_cpumask_percpu_thread+0x130/0x130
[   12.703760]  ? kthread_park+0x90/0x90
[   12.703762]  ret_from_fork+0x2a/0x40
[   12.709790] intel_idle: lapic_timer_reliable_states 0x2

Signed-off-by: Mike Galbraith 
---
 kernel/time/tick-broadcast.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -357,6 +357,7 @@ void tick_broadcast_control(enum tick_br
struct clock_event_device *bc, *dev;
struct tick_device *td;
int cpu, bc_stopped;
+   unsigned long flags;
 
td = this_cpu_ptr(_cpu_device);
dev = td->evtdev;
@@ -370,7 +371,7 @@ void tick_broadcast_control(enum tick_br
if (!tick_device_is_functional(dev))
return;
 
-   raw_spin_lock(_broadcast_lock);
+   raw_spin_lock_irqsave(_broadcast_lock, flags);
cpu = smp_processor_id();
bc = tick_broadcast_device.evtdev;
bc_stopped = cpumask_empty(tick_broadcast_mask);
@@ -420,7 +421,7 @@ v

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-11 Thread Mike Galbraith
On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote:

> > I think cgroup tree depth is a more significant issue; because of
> > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > 
> > So creating elaborate trees is something I try not to do.
> 
> So, as long as the depth stays reasonable (single digit or lower),
> what we try to do is keeping tree traversal operations aggregated or
> located on slow paths.  There still are places that this overhead
> shows up (e.g. the block controllers aren't too optimized) but it
> isn't particularly difficult to make a handful of layers not matter at
> all.

A handful of cpu bean counting layers stings considerably.

homer:/abuild # pipe-test 1  
2.010057 usecs/loop -- avg 2.010057 995.0 KHz
2.006630 usecs/loop -- avg 2.009714 995.2 KHz
2.127118 usecs/loop -- avg 2.021455 989.4 KHz
2.256244 usecs/loop -- avg 2.044934 978.0 KHz
1.993693 usecs/loop -- avg 2.039810 980.5 KHz
^C
homer:/abuild # cgexec -g cpu:hurt pipe-test 1
2.771641 usecs/loop -- avg 2.771641 721.6 KHz
2.432333 usecs/loop -- avg 2.737710 730.5 KHz
2.750493 usecs/loop -- avg 2.738988 730.2 KHz
2.663203 usecs/loop -- avg 2.731410 732.2 KHz
2.762564 usecs/loop -- avg 2.734525 731.4 KHz
^C
homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1
2.967201 usecs/loop -- avg 2.967201 674.0 KHz
3.049012 usecs/loop -- avg 2.975382 672.2 KHz
3.031226 usecs/loop -- avg 2.980966 670.9 KHz
2.954259 usecs/loop -- avg 2.978296 671.5 KHz
2.933432 usecs/loop -- avg 2.973809 672.5 KHz
^C
...
homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1
4.417044 usecs/loop -- avg 4.417044 452.8 KHz
4.494913 usecs/loop -- avg 4.424831 452.0 KHz
4.253861 usecs/loop -- avg 4.407734 453.7 KHz
4.378059 usecs/loop -- avg 4.404766 454.1 KHz
4.179895 usecs/loop -- avg 4.382279 456.4 KHz


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-11 Thread Mike Galbraith
On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote:

> > I think cgroup tree depth is a more significant issue; because of
> > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > 
> > So creating elaborate trees is something I try not to do.
> 
> So, as long as the depth stays reasonable (single digit or lower),
> what we try to do is keeping tree traversal operations aggregated or
> located on slow paths.  There still are places that this overhead
> shows up (e.g. the block controllers aren't too optimized) but it
> isn't particularly difficult to make a handful of layers not matter at
> all.

A handful of cpu bean counting layers stings considerably.

homer:/abuild # pipe-test 1  
2.010057 usecs/loop -- avg 2.010057 995.0 KHz
2.006630 usecs/loop -- avg 2.009714 995.2 KHz
2.127118 usecs/loop -- avg 2.021455 989.4 KHz
2.256244 usecs/loop -- avg 2.044934 978.0 KHz
1.993693 usecs/loop -- avg 2.039810 980.5 KHz
^C
homer:/abuild # cgexec -g cpu:hurt pipe-test 1
2.771641 usecs/loop -- avg 2.771641 721.6 KHz
2.432333 usecs/loop -- avg 2.737710 730.5 KHz
2.750493 usecs/loop -- avg 2.738988 730.2 KHz
2.663203 usecs/loop -- avg 2.731410 732.2 KHz
2.762564 usecs/loop -- avg 2.734525 731.4 KHz
^C
homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1
2.967201 usecs/loop -- avg 2.967201 674.0 KHz
3.049012 usecs/loop -- avg 2.975382 672.2 KHz
3.031226 usecs/loop -- avg 2.980966 670.9 KHz
2.954259 usecs/loop -- avg 2.978296 671.5 KHz
2.933432 usecs/loop -- avg 2.973809 672.5 KHz
^C
...
homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1
4.417044 usecs/loop -- avg 4.417044 452.8 KHz
4.494913 usecs/loop -- avg 4.424831 452.0 KHz
4.253861 usecs/loop -- avg 4.407734 453.7 KHz
4.378059 usecs/loop -- avg 4.404766 454.1 KHz
4.179895 usecs/loop -- avg 4.382279 456.4 KHz


Re: [PATCH 2/2] sched/deadline: Throttle a constrained deadline task activated after the deadline

2017-02-11 Thread Mike Galbraith
On Sat, 2017-02-11 at 08:15 +0100, luca abeni wrote:
> Hi Daniel,
> 
> On Fri, 10 Feb 2017 20:48:11 +0100
> Daniel Bristot de Oliveira  wrote:
> 
> > During the activation, CBS checks if it can reuse the current
> > task's
> > runtime and period. If the deadline of the task is in the past, CBS
> > cannot use the runtime, and so it replenishes the task. This rule
> > works fine for implicit deadline tasks (deadline == period), and
> > the
> > CBS was designed for implicit deadline tasks. However, a task with
> > constrained deadline (deadine < period) might be awakened after the
> > deadline, but before the next period. In this case, replenishing
> > the
> > task would allow it to run for runtime / deadline. As in this case
> > deadline < period, CBS enables a task to run for more than the
> > runtime/period. In a very load system, this can cause the domino
> > effect, making other tasks to miss their deadlines.
> 
> I think you are right: SCHED_DEADLINE implements the original CBS
> algorithm here, but uses relative deadlines different from periods in
> other places (while the original algorithm only considered relative
> deadlines equal to periods).
> An this mix is dangerous... I think your fix is correct, and cures a
> real problem.

Both of these should be tagged for stable as well, or?

-Mike


Re: [PATCH 2/2] sched/deadline: Throttle a constrained deadline task activated after the deadline

2017-02-11 Thread Mike Galbraith
On Sat, 2017-02-11 at 08:15 +0100, luca abeni wrote:
> Hi Daniel,
> 
> On Fri, 10 Feb 2017 20:48:11 +0100
> Daniel Bristot de Oliveira  wrote:
> 
> > During the activation, CBS checks if it can reuse the current
> > task's
> > runtime and period. If the deadline of the task is in the past, CBS
> > cannot use the runtime, and so it replenishes the task. This rule
> > works fine for implicit deadline tasks (deadline == period), and
> > the
> > CBS was designed for implicit deadline tasks. However, a task with
> > constrained deadline (deadine < period) might be awakened after the
> > deadline, but before the next period. In this case, replenishing
> > the
> > task would allow it to run for runtime / deadline. As in this case
> > deadline < period, CBS enables a task to run for more than the
> > runtime/period. In a very load system, this can cause the domino
> > effect, making other tasks to miss their deadlines.
> 
> I think you are right: SCHED_DEADLINE implements the original CBS
> algorithm here, but uses relative deadlines different from periods in
> other places (while the original algorithm only considered relative
> deadlines equal to periods).
> An this mix is dangerous... I think your fix is correct, and cures a
> real problem.

Both of these should be tagged for stable as well, or?

-Mike


Re: [GIT pull] x86/timers for 4.10

2017-02-09 Thread Mike Galbraith
On Thu, 2017-02-09 at 16:21 +0100, Thomas Gleixner wrote:
> On Thu, 9 Feb 2017, Mike Galbraith wrote:
> 
> > On Thu, 2017-02-09 at 16:07 +0100, Thomas Gleixner wrote:
> > > On Wed, 8 Feb 2017, Mike Galbraith wrote:
> > > > On Wed, 2017-02-08 at 12:44 +0100, Thomas Gleixner wrote:
> > > > > On Mon, 6 Feb 2017, Olof Johansson wrote:
> > > > > > [0.177102] [Firmware Bug]: TSC ADJUST differs: Reference CPU0:
> > > > > > -6495898515190607 CPU1: -6495898517158354
> > > > > 
> > > > > Yay, another "clever" BIOS 
> > > > 
> > > > Oh yeah, that reminds me...
> > > > 
> > > > I met one such box, and the adjustment code did salvage it, but I had
> > > > to cheat a little for it to do so reliably, as it would sometimes still
> > > > see a delta of 1 or 2 whole cycles, and hand me a useless wreck instead
> > > > quick like bunny big box.
> > > 
> > > Can you share your cheatery ?
> > 
> > I didn't keep it, it was just a bandaid for a fleeting use, dirt simple
> > ignore microscopic delta.
> 
> Can you send me the dmesg output of that box for a good and a bad case or
> don't you have access to it anymore?

I don't even remember which box it was, but I can try to find it again
during idle moments.

-Mike 


Re: [GIT pull] x86/timers for 4.10

2017-02-09 Thread Mike Galbraith
On Thu, 2017-02-09 at 16:21 +0100, Thomas Gleixner wrote:
> On Thu, 9 Feb 2017, Mike Galbraith wrote:
> 
> > On Thu, 2017-02-09 at 16:07 +0100, Thomas Gleixner wrote:
> > > On Wed, 8 Feb 2017, Mike Galbraith wrote:
> > > > On Wed, 2017-02-08 at 12:44 +0100, Thomas Gleixner wrote:
> > > > > On Mon, 6 Feb 2017, Olof Johansson wrote:
> > > > > > [0.177102] [Firmware Bug]: TSC ADJUST differs: Reference CPU0:
> > > > > > -6495898515190607 CPU1: -6495898517158354
> > > > > 
> > > > > Yay, another "clever" BIOS 
> > > > 
> > > > Oh yeah, that reminds me...
> > > > 
> > > > I met one such box, and the adjustment code did salvage it, but I had
> > > > to cheat a little for it to do so reliably, as it would sometimes still
> > > > see a delta of 1 or 2 whole cycles, and hand me a useless wreck instead
> > > > quick like bunny big box.
> > > 
> > > Can you share your cheatery ?
> > 
> > I didn't keep it, it was just a bandaid for a fleeting use, dirt simple
> > ignore microscopic delta.
> 
> Can you send me the dmesg output of that box for a good and a bad case or
> don't you have access to it anymore?

I don't even remember which box it was, but I can try to find it again
during idle moments.

-Mike 


Re: [GIT pull] x86/timers for 4.10

2017-02-09 Thread Mike Galbraith
On Thu, 2017-02-09 at 16:07 +0100, Thomas Gleixner wrote:
> On Wed, 8 Feb 2017, Mike Galbraith wrote:
> > On Wed, 2017-02-08 at 12:44 +0100, Thomas Gleixner wrote:
> > > On Mon, 6 Feb 2017, Olof Johansson wrote:
> > > > [0.177102] [Firmware Bug]: TSC ADJUST differs: Reference CPU0:
> > > > -6495898515190607 CPU1: -6495898517158354
> > > 
> > > Yay, another "clever" BIOS 
> > 
> > Oh yeah, that reminds me...
> > 
> > I met one such box, and the adjustment code did salvage it, but I had
> > to cheat a little for it to do so reliably, as it would sometimes still
> > see a delta of 1 or 2 whole cycles, and hand me a useless wreck instead
> > quick like bunny big box.
> 
> Can you share your cheatery ?

I didn't keep it, it was just a bandaid for a fleeting use, dirt simple
ignore microscopic delta.

-Mike


Re: [GIT pull] x86/timers for 4.10

2017-02-09 Thread Mike Galbraith
On Thu, 2017-02-09 at 16:07 +0100, Thomas Gleixner wrote:
> On Wed, 8 Feb 2017, Mike Galbraith wrote:
> > On Wed, 2017-02-08 at 12:44 +0100, Thomas Gleixner wrote:
> > > On Mon, 6 Feb 2017, Olof Johansson wrote:
> > > > [0.177102] [Firmware Bug]: TSC ADJUST differs: Reference CPU0:
> > > > -6495898515190607 CPU1: -6495898517158354
> > > 
> > > Yay, another "clever" BIOS 
> > 
> > Oh yeah, that reminds me...
> > 
> > I met one such box, and the adjustment code did salvage it, but I had
> > to cheat a little for it to do so reliably, as it would sometimes still
> > see a delta of 1 or 2 whole cycles, and hand me a useless wreck instead
> > quick like bunny big box.
> 
> Can you share your cheatery ?

I didn't keep it, it was just a bandaid for a fleeting use, dirt simple
ignore microscopic delta.

-Mike


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-09 Thread Mike Galbraith
On Thu, 2017-02-09 at 15:47 +0100, Peter Zijlstra wrote:
> On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> > The only case that this does not support vs ".threads" would be some
> > hybrid where we co-mingle threads from different processes (with the
> > processes belonging to the same node in the hierarchy).  I'm not aware
> > of any usage that looks like this.
> 
> If I understand you right; this is a fairly common thing with RT where
> we would stuff all the !rt threads of the various processes in a 'misc'
> bucket.
> 
> Similarly, it happens that we stuff the various rt threads of processes
> in a specific (shared) 'rt' bucket.
> 
> So I would certainly not like to exclude that setup.

Absolutely, you just described my daily bread performance setup.

-Mike


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-09 Thread Mike Galbraith
On Thu, 2017-02-09 at 15:47 +0100, Peter Zijlstra wrote:
> On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> > The only case that this does not support vs ".threads" would be some
> > hybrid where we co-mingle threads from different processes (with the
> > processes belonging to the same node in the hierarchy).  I'm not aware
> > of any usage that looks like this.
> 
> If I understand you right; this is a fairly common thing with RT where
> we would stuff all the !rt threads of the various processes in a 'misc'
> bucket.
> 
> Similarly, it happens that we stuff the various rt threads of processes
> in a specific (shared) 'rt' bucket.
> 
> So I would certainly not like to exclude that setup.

Absolutely, you just described my daily bread performance setup.

-Mike


Re: [GIT pull] x86/timers for 4.10

2017-02-08 Thread Mike Galbraith
On Wed, 2017-02-08 at 12:44 +0100, Thomas Gleixner wrote:
> On Mon, 6 Feb 2017, Olof Johansson wrote:
> > [0.177102] [Firmware Bug]: TSC ADJUST differs: Reference CPU0:
> > -6495898515190607 CPU1: -6495898517158354
> 
> Yay, another "clever" BIOS 

Oh yeah, that reminds me...

I met one such box, and the adjustment code did salvage it, but I had
to cheat a little for it to do so reliably, as it would sometimes still
see a delta of 1 or 2 whole cycles, and hand me a useless wreck instead
quick like bunny big box.

-Mike


Re: [GIT pull] x86/timers for 4.10

2017-02-08 Thread Mike Galbraith
On Wed, 2017-02-08 at 12:44 +0100, Thomas Gleixner wrote:
> On Mon, 6 Feb 2017, Olof Johansson wrote:
> > [0.177102] [Firmware Bug]: TSC ADJUST differs: Reference CPU0:
> > -6495898515190607 CPU1: -6495898517158354
> 
> Yay, another "clever" BIOS 

Oh yeah, that reminds me...

I met one such box, and the adjustment code did salvage it, but I had
to cheat a little for it to do so reliably, as it would sometimes still
see a delta of 1 or 2 whole cycles, and hand me a useless wreck instead
quick like bunny big box.

-Mike


Re: [RFC,v2 3/3] sched: ignore task_h_load for CPU_NEWLY_IDLE

2017-02-08 Thread Mike Galbraith
On Wed, 2017-02-08 at 09:43 +0100, Uladzislau Rezki wrote:
> From: Uladzislau 2 Rezki 
> 
> A load balancer calculates imbalance factor for particular shed
 ^sched
> domain and tries to steal up the prescribed amount of weighted load.
> However, a small imbalance factor would sometimes prevent us from
> stealing any tasks at all. When a CPU is newly idle, it should
> steal first task which passes a migration criteria.
 s/passes a/meets the
> 
> Signed-off-by: Uladzislau 2 Rezki 
> ---
>  kernel/sched/fair.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 232ef3c..29e0d7f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
>   >   > env->loop++;
> @@ -6824,8 +6832,9 @@ static int detach_tasks(struct lb_env *env)
>  >>   > if (sched_feat(LB_MIN) && load < 16 && 
> !env->sd->nr_balance_failed)
>  >>   >   > goto next;
>  
> ->>   > if ((load / 2) > env->imbalance)
> ->>   >   > goto next;
> +>>   > if (env->idle != CPU_NEWLY_IDLE)
> +>>   >   > if ((load / 2) > env->imbalance)
> +>>   >   >   > goto next;

Those two ifs could be one ala if (foo && bar).


Re: [RFC,v2 3/3] sched: ignore task_h_load for CPU_NEWLY_IDLE

2017-02-08 Thread Mike Galbraith
On Wed, 2017-02-08 at 09:43 +0100, Uladzislau Rezki wrote:
> From: Uladzislau 2 Rezki 
> 
> A load balancer calculates imbalance factor for particular shed
 ^sched
> domain and tries to steal up the prescribed amount of weighted load.
> However, a small imbalance factor would sometimes prevent us from
> stealing any tasks at all. When a CPU is newly idle, it should
> steal first task which passes a migration criteria.
 s/passes a/meets the
> 
> Signed-off-by: Uladzislau 2 Rezki 
> ---
>  kernel/sched/fair.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 232ef3c..29e0d7f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
>   >   > env->loop++;
> @@ -6824,8 +6832,9 @@ static int detach_tasks(struct lb_env *env)
>  >>   > if (sched_feat(LB_MIN) && load < 16 && 
> !env->sd->nr_balance_failed)
>  >>   >   > goto next;
>  
> ->>   > if ((load / 2) > env->imbalance)
> ->>   >   > goto next;
> +>>   > if (env->idle != CPU_NEWLY_IDLE)
> +>>   >   > if ((load / 2) > env->imbalance)
> +>>   >   >   > goto next;

Those two ifs could be one ala if (foo && bar).


Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone

2017-02-07 Thread Mike Galbraith
On Tue, 2017-02-07 at 19:58 -0900, Kent Overstreet wrote:
> On Tue, Feb 07, 2017 at 09:39:11PM +0100, Pavel Machek wrote:
> > On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > > 
> > > > > So.. what needs to be done there?
> > > 
> > > > But, I just got an idea for how to handle this that might be halfway 
> > > > sane, maybe
> > > > I'll try and come up with a patch...
> > > 
> > > Ok, here's such a patch, only lightly tested:
> > 
> > I guess it would be nice for me to test it... but what it is against?
> > I tried after v4.10-rc5 and linux-next, but got rejects in both cases.
> 
> Sorry, I forgot I had a few other patches in my branch that touch
> mempool/biosets code.
> 
> Also, after thinking about it more and looking at the relevant code, I'm 
> pretty
> sure we don't need rescuer threads for block devices that just split bios - 
> i.e.
> most of them, so I changed my patch to do that.
> 
> Tested it by ripping out the current->bio_list checks/workarounds from the
> bcache code, appears to work:

Patch killed every last one of them, but..

homer:/root # dmesg|grep WARNING
[   11.701447] WARNING: CPU: 4 PID: 801 at block/bio.c:388 
bio_alloc_bioset+0x1a7/0x240
[   11.711027] WARNING: CPU: 4 PID: 801 at block/blk-core.c:2013 
generic_make_request+0x191/0x1f0
[   19.728989] WARNING: CPU: 0 PID: 717 at block/bio.c:388 
bio_alloc_bioset+0x1a7/0x240
[   19.737020] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 
generic_make_request+0x191/0x1f0
[   19.746173] WARNING: CPU: 0 PID: 717 at block/bio.c:388 
bio_alloc_bioset+0x1a7/0x240
[   19.755260] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 
generic_make_request+0x191/0x1f0
[   19.763837] WARNING: CPU: 0 PID: 717 at block/bio.c:388 
bio_alloc_bioset+0x1a7/0x240
[   19.772526] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 
generic_make_request+0x191/0x1f0



Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone

2017-02-07 Thread Mike Galbraith
On Tue, 2017-02-07 at 19:58 -0900, Kent Overstreet wrote:
> On Tue, Feb 07, 2017 at 09:39:11PM +0100, Pavel Machek wrote:
> > On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > > 
> > > > > So.. what needs to be done there?
> > > 
> > > > But, I just got an idea for how to handle this that might be halfway 
> > > > sane, maybe
> > > > I'll try and come up with a patch...
> > > 
> > > Ok, here's such a patch, only lightly tested:
> > 
> > I guess it would be nice for me to test it... but what it is against?
> > I tried after v4.10-rc5 and linux-next, but got rejects in both cases.
> 
> Sorry, I forgot I had a few other patches in my branch that touch
> mempool/biosets code.
> 
> Also, after thinking about it more and looking at the relevant code, I'm 
> pretty
> sure we don't need rescuer threads for block devices that just split bios - 
> i.e.
> most of them, so I changed my patch to do that.
> 
> Tested it by ripping out the current->bio_list checks/workarounds from the
> bcache code, appears to work:

Patch killed every last one of them, but..

homer:/root # dmesg|grep WARNING
[   11.701447] WARNING: CPU: 4 PID: 801 at block/bio.c:388 
bio_alloc_bioset+0x1a7/0x240
[   11.711027] WARNING: CPU: 4 PID: 801 at block/blk-core.c:2013 
generic_make_request+0x191/0x1f0
[   19.728989] WARNING: CPU: 0 PID: 717 at block/bio.c:388 
bio_alloc_bioset+0x1a7/0x240
[   19.737020] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 
generic_make_request+0x191/0x1f0
[   19.746173] WARNING: CPU: 0 PID: 717 at block/bio.c:388 
bio_alloc_bioset+0x1a7/0x240
[   19.755260] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 
generic_make_request+0x191/0x1f0
[   19.763837] WARNING: CPU: 0 PID: 717 at block/bio.c:388 
bio_alloc_bioset+0x1a7/0x240
[   19.772526] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 
generic_make_request+0x191/0x1f0



Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone

2017-02-07 Thread Mike Galbraith
On Tue, 2017-02-07 at 21:39 +0100, Pavel Machek wrote:
> On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > 
> > > > So.. what needs to be done there?
> > 
> > > But, I just got an idea for how to handle this that might be halfway 
> > > sane, maybe
> > > I'll try and come up with a patch...
> > 
> > Ok, here's such a patch, only lightly tested:
> 
> I guess it would be nice for me to test it... but what it is against?
> I tried after v4.10-rc5 and linux-next, but got rejects in both cases.

It wedged into master easily enough (box still seems to work.. but I'll
be rebooting in a very few seconds just in case:), but threads on my
desktop box only dropped from 73 to 71.  Poo.

-Mike


Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone

2017-02-07 Thread Mike Galbraith
On Tue, 2017-02-07 at 21:39 +0100, Pavel Machek wrote:
> On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > 
> > > > So.. what needs to be done there?
> > 
> > > But, I just got an idea for how to handle this that might be halfway 
> > > sane, maybe
> > > I'll try and come up with a patch...
> > 
> > Ok, here's such a patch, only lightly tested:
> 
> I guess it would be nice for me to test it... but what it is against?
> I tried after v4.10-rc5 and linux-next, but got rejects in both cases.

It wedged into master easily enough (box still seems to work.. but I'll
be rebooting in a very few seconds just in case:), but threads on my
desktop box only dropped from 73 to 71.  Poo.

-Mike


Re: tip: demise of tsk_cpus_allowed() and tsk_nr_cpus_allowed()

2017-02-06 Thread Mike Galbraith
On Mon, 2017-02-06 at 13:29 +0100, Ingo Molnar wrote:
> * Mike Galbraith <efa...@gmx.de> wrote:
> 
> > On Mon, 2017-02-06 at 11:31 +0100, Ingo Molnar wrote:
> > > * Mike Galbraith <efa...@gmx.de> wrote:
> > > 
> > > > Hi Ingo,
> > > > 
> > > > Doing my ~daily tip merge of -rt, I couldn't help noticing $subject, as
> > > > they grow more functionality in -rt, which is allegedly slowly but
> > > > surely headed toward merge.  I don't suppose they could be left intact?
> > > >  I can easily restore them in my local tree, but it seems a bit of a
> > > > shame to whack these integration friendly bits.
> > > 
> > > Oh, I missed that. How is tsk_cpus_allowed() wrapped in -rt right now?
> > 
> > RT extends them to reflect whether migration is disabled or not.
> > 
> > +/* Future-safe accessor for struct task_struct's cpus_allowed. */
> > +static inline const struct cpumask *tsk_cpus_allowed(struct task_struct *p)
> > +{
> > +   if (__migrate_disabled(p))
> > +   return cpumask_of(task_cpu(p));
> > +
> > +   return >cpus_allowed;
> > +}
> > +
> > +static inline int tsk_nr_cpus_allowed(struct task_struct *p)
> > +{
> > +   if (__migrate_disabled(p))
> > +   return 1;
> > +   return p->nr_cpus_allowed;
> > +}
> 
> So ... I think the cleaner approach in -rt would be to introduce 
> ->cpus_allowed_saved, and when disabling/enabling migration then saving the 
> current mask there and changing ->cpus_allowed - and then restoring it when 
> re-enabling migration.
> 
> This means ->cpus_allowed could be used by the scheduler directly, no 
> wrappery 
> would be required, AFAICS.
> 
> ( Some extra care would be required in places that change ->cpus_allowed 
> because 
>   they'd now have to be aware of ->cpus_allowed_saved. )
> 
> Am I missing something?

I suppose it's a matter of personal preference.  I prefer the above,
looks nice and clean to me.  Hohum, I'll just put them back locally for
the nonce.  My trees are only place holders until official releases
catch up anyway.

-Mike


Re: tip: demise of tsk_cpus_allowed() and tsk_nr_cpus_allowed()

2017-02-06 Thread Mike Galbraith
On Mon, 2017-02-06 at 13:29 +0100, Ingo Molnar wrote:
> * Mike Galbraith  wrote:
> 
> > On Mon, 2017-02-06 at 11:31 +0100, Ingo Molnar wrote:
> > > * Mike Galbraith  wrote:
> > > 
> > > > Hi Ingo,
> > > > 
> > > > Doing my ~daily tip merge of -rt, I couldn't help noticing $subject, as
> > > > they grow more functionality in -rt, which is allegedly slowly but
> > > > surely headed toward merge.  I don't suppose they could be left intact?
> > > >  I can easily restore them in my local tree, but it seems a bit of a
> > > > shame to whack these integration friendly bits.
> > > 
> > > Oh, I missed that. How is tsk_cpus_allowed() wrapped in -rt right now?
> > 
> > RT extends them to reflect whether migration is disabled or not.
> > 
> > +/* Future-safe accessor for struct task_struct's cpus_allowed. */
> > +static inline const struct cpumask *tsk_cpus_allowed(struct task_struct *p)
> > +{
> > +   if (__migrate_disabled(p))
> > +   return cpumask_of(task_cpu(p));
> > +
> > +   return >cpus_allowed;
> > +}
> > +
> > +static inline int tsk_nr_cpus_allowed(struct task_struct *p)
> > +{
> > +   if (__migrate_disabled(p))
> > +   return 1;
> > +   return p->nr_cpus_allowed;
> > +}
> 
> So ... I think the cleaner approach in -rt would be to introduce 
> ->cpus_allowed_saved, and when disabling/enabling migration then saving the 
> current mask there and changing ->cpus_allowed - and then restoring it when 
> re-enabling migration.
> 
> This means ->cpus_allowed could be used by the scheduler directly, no 
> wrappery 
> would be required, AFAICS.
> 
> ( Some extra care would be required in places that change ->cpus_allowed 
> because 
>   they'd now have to be aware of ->cpus_allowed_saved. )
> 
> Am I missing something?

I suppose it's a matter of personal preference.  I prefer the above,
looks nice and clean to me.  Hohum, I'll just put them back locally for
the nonce.  My trees are only place holders until official releases
catch up anyway.

-Mike


Re: tip: demise of tsk_cpus_allowed() and tsk_nr_cpus_allowed()

2017-02-06 Thread Mike Galbraith
On Mon, 2017-02-06 at 11:31 +0100, Ingo Molnar wrote:
> * Mike Galbraith <efa...@gmx.de> wrote:
> 
> > Hi Ingo,
> > 
> > Doing my ~daily tip merge of -rt, I couldn't help noticing $subject, as
> > they grow more functionality in -rt, which is allegedly slowly but
> > surely headed toward merge.  I don't suppose they could be left intact?
> >  I can easily restore them in my local tree, but it seems a bit of a
> > shame to whack these integration friendly bits.
> 
> Oh, I missed that. How is tsk_cpus_allowed() wrapped in -rt right now?

RT extends them to reflect whether migration is disabled or not.

+/* Future-safe accessor for struct task_struct's cpus_allowed. */
+static inline const struct cpumask *tsk_cpus_allowed(struct task_struct *p)
+{
+   if (__migrate_disabled(p))
+   return cpumask_of(task_cpu(p));
+
+   return >cpus_allowed;
+}
+
+static inline int tsk_nr_cpus_allowed(struct task_struct *p)
+{
+   if (__migrate_disabled(p))
+   return 1;
+   return p->nr_cpus_allowed;
+}


Re: tip: demise of tsk_cpus_allowed() and tsk_nr_cpus_allowed()

2017-02-06 Thread Mike Galbraith
On Mon, 2017-02-06 at 11:31 +0100, Ingo Molnar wrote:
> * Mike Galbraith  wrote:
> 
> > Hi Ingo,
> > 
> > Doing my ~daily tip merge of -rt, I couldn't help noticing $subject, as
> > they grow more functionality in -rt, which is allegedly slowly but
> > surely headed toward merge.  I don't suppose they could be left intact?
> >  I can easily restore them in my local tree, but it seems a bit of a
> > shame to whack these integration friendly bits.
> 
> Oh, I missed that. How is tsk_cpus_allowed() wrapped in -rt right now?

RT extends them to reflect whether migration is disabled or not.

+/* Future-safe accessor for struct task_struct's cpus_allowed. */
+static inline const struct cpumask *tsk_cpus_allowed(struct task_struct *p)
+{
+   if (__migrate_disabled(p))
+   return cpumask_of(task_cpu(p));
+
+   return >cpus_allowed;
+}
+
+static inline int tsk_nr_cpus_allowed(struct task_struct *p)
+{
+   if (__migrate_disabled(p))
+   return 1;
+   return p->nr_cpus_allowed;
+}


tip: demise of tsk_cpus_allowed() and tsk_nr_cpus_allowed()

2017-02-05 Thread Mike Galbraith
Hi Ingo,

Doing my ~daily tip merge of -rt, I couldn't help noticing $subject, as
they grow more functionality in -rt, which is allegedly slowly but
surely headed toward merge.  I don't suppose they could be left intact?
 I can easily restore them in my local tree, but it seems a bit of a
shame to whack these integration friendly bits.

-Mike


tip: demise of tsk_cpus_allowed() and tsk_nr_cpus_allowed()

2017-02-05 Thread Mike Galbraith
Hi Ingo,

Doing my ~daily tip merge of -rt, I couldn't help noticing $subject, as
they grow more functionality in -rt, which is allegedly slowly but
surely headed toward merge.  I don't suppose they could be left intact?
 I can easily restore them in my local tree, but it seems a bit of a
shame to whack these integration friendly bits.

-Mike


Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-02-03 Thread Mike Galbraith
On Fri, 2017-02-03 at 14:37 +0100, Peter Zijlstra wrote:
> On Fri, Feb 03, 2017 at 01:59:34PM +0100, Mike Galbraith wrote:

> > FWIW, I'm not seeing stalls/hangs while beating hotplug up in tip. (so
> > next grew a wart?)
> 
> I've seen it on tip. It looks like hot unplug goes really slow when
> there's running tasks on the CPU being taken down.
> 
> What I did was something like:
> 
>   taskset -p $((1<<1)) $$
>   for ((i=0; i<20; i++)) do while :; do :; done & done
> 
>   taskset -p $((1<<0)) $$
>   echo 0 > /sys/devices/system/cpu/cpu1/online
> 
> And with those 20 tasks stuck sucking cycles on CPU1, the unplug goes
> _really_ slow and the RCU stall triggers. What I suspect happens is that
> hotplug stops participating in the RCU state machine early, but only
> tells RCU about it really late, and in between it gets suspicious it
> takes too long.

Ah.  I wasn't doing a really hard pounding, just running a couple
instances of Steven's script.  To beat hell out of it, I add futextest,
stockfish and a small kbuild on a big box.

-Mike


Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-02-03 Thread Mike Galbraith
On Fri, 2017-02-03 at 14:37 +0100, Peter Zijlstra wrote:
> On Fri, Feb 03, 2017 at 01:59:34PM +0100, Mike Galbraith wrote:

> > FWIW, I'm not seeing stalls/hangs while beating hotplug up in tip. (so
> > next grew a wart?)
> 
> I've seen it on tip. It looks like hot unplug goes really slow when
> there's running tasks on the CPU being taken down.
> 
> What I did was something like:
> 
>   taskset -p $((1<<1)) $$
>   for ((i=0; i<20; i++)) do while :; do :; done & done
> 
>   taskset -p $((1<<0)) $$
>   echo 0 > /sys/devices/system/cpu/cpu1/online
> 
> And with those 20 tasks stuck sucking cycles on CPU1, the unplug goes
> _really_ slow and the RCU stall triggers. What I suspect happens is that
> hotplug stops participating in the RCU state machine early, but only
> tells RCU about it really late, and in between it gets suspicious it
> takes too long.

Ah.  I wasn't doing a really hard pounding, just running a couple
instances of Steven's script.  To beat hell out of it, I add futextest,
stockfish and a small kbuild on a big box.

-Mike


x86-tip/master: Kernel panic - not syncing: snd_hda_codec_hdmi hdaudioC1D0: unrecoverable failure

2017-02-03 Thread Mike Galbraith
On a lark, I tried a suspend/resume cycle after a bit of uneventful
beating on cpu hotplug.  Suspend worked fine, resume not so well.

[ 1571.838698] Call Trace:
[ 1571.838703]  __schedule+0x32c/0xd10
[ 1571.838706]  ? _raw_spin_unlock_irqrestore+0x36/0x60
[ 1571.838710]  schedule+0x3d/0x90
[ 1571.838712]  schedule_timeout+0x2d8/0x620
[ 1571.838718]  ? snd_hdac_bus_send_cmd+0xab/0x110 [snd_hda_core]
[ 1571.838722]  ? lock_timer_base+0xa0/0xa0
[ 1571.838728]  msleep+0x39/0x50
[ 1571.838734]  azx_rirb_get_response+0x4a/0x270 [snd_hda_codec]
[ 1571.838740]  azx_get_response+0x33/0x40 [snd_hda_codec]
[ 1571.838743]  snd_hdac_bus_exec_verb_unlocked+0x169/0x2f0 [snd_hda_core]
[ 1571.838748]  codec_exec_verb+0x8c/0x120 [snd_hda_codec]
[ 1571.838753]  snd_hdac_exec_verb+0x17/0x40 [snd_hda_core]
[ 1571.838756]  snd_hdac_codec_read+0x34/0x50 [snd_hda_core]
[ 1571.838759]  ? snd_hdac_regmap_read_raw+0x10/0x20 [snd_hda_core]
[ 1571.838763]  read_pin_sense+0x35/0x80 [snd_hda_codec]
[ 1571.838768]  jack_detect_update+0x82/0xc0 [snd_hda_codec]
[ 1571.838772]  snd_hda_pin_sense+0x5e/0x70 [snd_hda_codec]
[ 1571.838775]  hdmi_present_sense+0x128/0x390 [snd_hda_codec_hdmi]
[ 1571.838781]  ? hda_call_codec_resume+0x120/0x120 [snd_hda_codec]
[ 1571.838783]  ? pm_runtime_force_suspend+0x90/0x90
[ 1571.838786]  generic_hdmi_resume+0x4d/0x60 [snd_hda_codec_hdmi]
[ 1571.838790]  hda_call_codec_resume+0xd0/0x120 [snd_hda_codec]
[ 1571.838794]  hda_codec_runtime_resume+0x35/0x50 [snd_hda_codec]
[ 1571.838795]  pm_runtime_force_resume+0x93/0xe0
[ 1571.838798]  dpm_run_callback+0xba/0x300
[ 1571.838801]  device_resume+0x10e/0x240
[ 1571.838804]  ? pm_dev_dbg+0x80/0x80
[ 1571.838809]  async_resume+0x1d/0x50
[ 1571.838811]  async_run_entry_fn+0x39/0x170
[ 1571.838814]  process_one_work+0x1e1/0x670
[ 1571.838815]  ? process_one_work+0x162/0x670
[ 1571.838819]  worker_thread+0x137/0x4b0
[ 1571.838824]  kthread+0x10c/0x140
[ 1571.838825]  ? process_one_work+0x670/0x670
[ 1571.838827]  ? kthread_park+0x90/0x90
[ 1571.838830]  ret_from_fork+0x31/0x40
[ 1571.838837] Kernel panic - not syncing: snd_hda_codec_hdmi hdaudioC1D0: 
unrecoverable failure
   
[ 1571.838839] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GE   
4.10.0-tip-default_lockdep #27
[ 1571.838840] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[ 1571.838840] Call Trace:
[ 1571.838841]  
[ 1571.838843]  dump_stack+0x85/0xc9
[ 1571.838846]  panic+0xe0/0x233
[ 1571.838850]  ? pm_dev_dbg+0x80/0x80
[ 1571.838852]  dpm_watchdog_handler+0x4e/0x60
[ 1571.838853]  call_timer_fn+0x95/0x340
[ 1571.838855]  ? call_timer_fn+0x5/0x340
[ 1571.838857]  ? pm_dev_dbg+0x80/0x80
[ 1571.838860]  run_timer_softirq+0x230/0x620
[ 1571.838862]  ? ktime_get+0xac/0x140
[ 1571.838867]  __do_softirq+0xc0/0x48b
[ 1571.838871]  irq_exit+0xe5/0xf0
[ 1571.838873]  smp_apic_timer_interrupt+0x3d/0x50
[ 1571.838875]  apic_timer_interrupt+0x9d/0xb0
[ 1571.838877] RIP: 0010:cpuidle_enter_state+0xe9/0x320
[ 1571.838878] RSP: 0018:81c03dd0 EFLAGS: 0202 ORIG_RAX: 
ff10
[ 1571.838879] RAX: 81c16540 RBX: e8c0a200 RCX: 
[ 1571.838880] RDX: 81c16540 RSI: 0001 RDI: 81c16540
[ 1571.838881] RBP: 81c03e08 R08:  R09: 
[ 1571.838882] R10: 0001 R11: 0014 R12: 0003
[ 1571.838882] R13:  R14: 81d23360 R15: 016df8bcf98d
[ 1571.838883]  
[ 1571.838892]  cpuidle_enter+0x17/0x20
[ 1571.838894]  call_cpuidle+0x23/0x40
[ 1571.838896]  do_idle+0x172/0x200
[ 1571.838899]  cpu_startup_entry+0x62/0x70
[ 1571.838902]  rest_init+0x138/0x140
[ 1571.838903]  ? rest_init+0x5/0x140
[ 1571.838907]  start_kernel+0x4b3/0x4c0
[ 1571.838909]  ? set_init_arg+0x55/0x55
[ 1571.838911]  ? early_idt_handler_array+0x120/0x120
[ 1571.838913]  x86_64_start_reservations+0x2a/0x2c
[ 1571.838915]  x86_64_start_kernel+0x13d/0x14c
[ 1571.838919]  start_cpu+0x14/0x14


x86-tip/master: Kernel panic - not syncing: snd_hda_codec_hdmi hdaudioC1D0: unrecoverable failure

2017-02-03 Thread Mike Galbraith
On a lark, I tried a suspend/resume cycle after a bit of uneventful
beating on cpu hotplug.  Suspend worked fine, resume not so well.

[ 1571.838698] Call Trace:
[ 1571.838703]  __schedule+0x32c/0xd10
[ 1571.838706]  ? _raw_spin_unlock_irqrestore+0x36/0x60
[ 1571.838710]  schedule+0x3d/0x90
[ 1571.838712]  schedule_timeout+0x2d8/0x620
[ 1571.838718]  ? snd_hdac_bus_send_cmd+0xab/0x110 [snd_hda_core]
[ 1571.838722]  ? lock_timer_base+0xa0/0xa0
[ 1571.838728]  msleep+0x39/0x50
[ 1571.838734]  azx_rirb_get_response+0x4a/0x270 [snd_hda_codec]
[ 1571.838740]  azx_get_response+0x33/0x40 [snd_hda_codec]
[ 1571.838743]  snd_hdac_bus_exec_verb_unlocked+0x169/0x2f0 [snd_hda_core]
[ 1571.838748]  codec_exec_verb+0x8c/0x120 [snd_hda_codec]
[ 1571.838753]  snd_hdac_exec_verb+0x17/0x40 [snd_hda_core]
[ 1571.838756]  snd_hdac_codec_read+0x34/0x50 [snd_hda_core]
[ 1571.838759]  ? snd_hdac_regmap_read_raw+0x10/0x20 [snd_hda_core]
[ 1571.838763]  read_pin_sense+0x35/0x80 [snd_hda_codec]
[ 1571.838768]  jack_detect_update+0x82/0xc0 [snd_hda_codec]
[ 1571.838772]  snd_hda_pin_sense+0x5e/0x70 [snd_hda_codec]
[ 1571.838775]  hdmi_present_sense+0x128/0x390 [snd_hda_codec_hdmi]
[ 1571.838781]  ? hda_call_codec_resume+0x120/0x120 [snd_hda_codec]
[ 1571.838783]  ? pm_runtime_force_suspend+0x90/0x90
[ 1571.838786]  generic_hdmi_resume+0x4d/0x60 [snd_hda_codec_hdmi]
[ 1571.838790]  hda_call_codec_resume+0xd0/0x120 [snd_hda_codec]
[ 1571.838794]  hda_codec_runtime_resume+0x35/0x50 [snd_hda_codec]
[ 1571.838795]  pm_runtime_force_resume+0x93/0xe0
[ 1571.838798]  dpm_run_callback+0xba/0x300
[ 1571.838801]  device_resume+0x10e/0x240
[ 1571.838804]  ? pm_dev_dbg+0x80/0x80
[ 1571.838809]  async_resume+0x1d/0x50
[ 1571.838811]  async_run_entry_fn+0x39/0x170
[ 1571.838814]  process_one_work+0x1e1/0x670
[ 1571.838815]  ? process_one_work+0x162/0x670
[ 1571.838819]  worker_thread+0x137/0x4b0
[ 1571.838824]  kthread+0x10c/0x140
[ 1571.838825]  ? process_one_work+0x670/0x670
[ 1571.838827]  ? kthread_park+0x90/0x90
[ 1571.838830]  ret_from_fork+0x31/0x40
[ 1571.838837] Kernel panic - not syncing: snd_hda_codec_hdmi hdaudioC1D0: 
unrecoverable failure
   
[ 1571.838839] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GE   
4.10.0-tip-default_lockdep #27
[ 1571.838840] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[ 1571.838840] Call Trace:
[ 1571.838841]  
[ 1571.838843]  dump_stack+0x85/0xc9
[ 1571.838846]  panic+0xe0/0x233
[ 1571.838850]  ? pm_dev_dbg+0x80/0x80
[ 1571.838852]  dpm_watchdog_handler+0x4e/0x60
[ 1571.838853]  call_timer_fn+0x95/0x340
[ 1571.838855]  ? call_timer_fn+0x5/0x340
[ 1571.838857]  ? pm_dev_dbg+0x80/0x80
[ 1571.838860]  run_timer_softirq+0x230/0x620
[ 1571.838862]  ? ktime_get+0xac/0x140
[ 1571.838867]  __do_softirq+0xc0/0x48b
[ 1571.838871]  irq_exit+0xe5/0xf0
[ 1571.838873]  smp_apic_timer_interrupt+0x3d/0x50
[ 1571.838875]  apic_timer_interrupt+0x9d/0xb0
[ 1571.838877] RIP: 0010:cpuidle_enter_state+0xe9/0x320
[ 1571.838878] RSP: 0018:81c03dd0 EFLAGS: 0202 ORIG_RAX: 
ff10
[ 1571.838879] RAX: 81c16540 RBX: e8c0a200 RCX: 
[ 1571.838880] RDX: 81c16540 RSI: 0001 RDI: 81c16540
[ 1571.838881] RBP: 81c03e08 R08:  R09: 
[ 1571.838882] R10: 0001 R11: 0014 R12: 0003
[ 1571.838882] R13:  R14: 81d23360 R15: 016df8bcf98d
[ 1571.838883]  
[ 1571.838892]  cpuidle_enter+0x17/0x20
[ 1571.838894]  call_cpuidle+0x23/0x40
[ 1571.838896]  do_idle+0x172/0x200
[ 1571.838899]  cpu_startup_entry+0x62/0x70
[ 1571.838902]  rest_init+0x138/0x140
[ 1571.838903]  ? rest_init+0x5/0x140
[ 1571.838907]  start_kernel+0x4b3/0x4c0
[ 1571.838909]  ? set_init_arg+0x55/0x55
[ 1571.838911]  ? early_idt_handler_array+0x120/0x120
[ 1571.838913]  x86_64_start_reservations+0x2a/0x2c
[ 1571.838915]  x86_64_start_kernel+0x13d/0x14c
[ 1571.838919]  start_cpu+0x14/0x14


Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-02-03 Thread Mike Galbraith
On Fri, 2017-02-03 at 09:53 +0100, Peter Zijlstra wrote:
> On Fri, Feb 03, 2017 at 10:03:14AM +0530, Sachin Sant wrote:

> > I ran few cycles of cpu hot(un)plug tests. In most cases it works except one
> > where I ran into rcu stall:
> > 
> > [  173.493453] INFO: rcu_sched detected stalls on CPUs/tasks:
> > [  173.493473] > >  > > 8-...: (2 GPs behind) idle=006/140/0 
> > softirq=0/0 fqs=2996 
> > [  173.493476] > >  > > (detected by 0, t=6002 jiffies, g=885, c=884, 
> > q=6350)
> 
> Right, I actually saw that too, but I don't think that would be related
> to my patch. I'll see if I can dig into this though, ought to get fixed
> regardless.

FWIW, I'm not seeing stalls/hangs while beating hotplug up in tip. (so
next grew a wart?)

-Mike


Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-02-03 Thread Mike Galbraith
On Fri, 2017-02-03 at 09:53 +0100, Peter Zijlstra wrote:
> On Fri, Feb 03, 2017 at 10:03:14AM +0530, Sachin Sant wrote:

> > I ran few cycles of cpu hot(un)plug tests. In most cases it works except one
> > where I ran into rcu stall:
> > 
> > [  173.493453] INFO: rcu_sched detected stalls on CPUs/tasks:
> > [  173.493473] > >  > > 8-...: (2 GPs behind) idle=006/140/0 
> > softirq=0/0 fqs=2996 
> > [  173.493476] > >  > > (detected by 0, t=6002 jiffies, g=885, c=884, 
> > q=6350)
> 
> Right, I actually saw that too, but I don't think that would be related
> to my patch. I'll see if I can dig into this though, ought to get fixed
> regardless.

FWIW, I'm not seeing stalls/hangs while beating hotplug up in tip. (so
next grew a wart?)

-Mike


[patch-tip] drivers/mtd: Apply sched include reorg to tests/mtd_test.h

2017-02-03 Thread Mike Galbraith

signal_pending() moved to linux/sched/signal.h, go get it.

Signed-off-by: Mike Galbraith <efa...@gmx.de>
---
 drivers/mtd/tests/mtd_test.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/mtd/tests/mtd_test.h
+++ b/drivers/mtd/tests/mtd_test.h
@@ -1,5 +1,5 @@
 #include 
-#include 
+#include 
 
 static inline int mtdtest_relax(void)
 {


[patch-tip] drivers/mtd: Apply sched include reorg to tests/mtd_test.h

2017-02-03 Thread Mike Galbraith

signal_pending() moved to linux/sched/signal.h, go get it.

Signed-off-by: Mike Galbraith 
---
 drivers/mtd/tests/mtd_test.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/mtd/tests/mtd_test.h
+++ b/drivers/mtd/tests/mtd_test.h
@@ -1,5 +1,5 @@
 #include 
-#include 
+#include 
 
 static inline int mtdtest_relax(void)
 {


Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-02-02 Thread Mike Galbraith
On Thu, 2017-02-02 at 16:55 +0100, Peter Zijlstra wrote:
> On Tue, Jan 31, 2017 at 10:22:47AM -0700, Ross Zwisler wrote:
> > On Tue, Jan 31, 2017 at 4:48 AM, Mike Galbraith <efa...@gmx.de>
> > wrote:
> > > On Tue, 2017-01-31 at 16:30 +0530, Sachin Sant wrote:
> 
> 
> Could some of you test this? It seems to cure things in my (very)
> limited testing.

Hotplug stress gripe is gone here.

-Mike


Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-02-02 Thread Mike Galbraith
On Thu, 2017-02-02 at 16:55 +0100, Peter Zijlstra wrote:
> On Tue, Jan 31, 2017 at 10:22:47AM -0700, Ross Zwisler wrote:
> > On Tue, Jan 31, 2017 at 4:48 AM, Mike Galbraith 
> > wrote:
> > > On Tue, 2017-01-31 at 16:30 +0530, Sachin Sant wrote:
> 
> 
> Could some of you test this? It seems to cure things in my (very)
> limited testing.

Hotplug stress gripe is gone here.

-Mike


Re: [PATCH] x86/microcode: Do not access the initrd after it has been freed

2017-01-31 Thread Mike Galbraith
On Tue, 2017-01-31 at 18:49 +0100, Borislav Petkov wrote:
> On Tue, Jan 31, 2017 at 01:31:00PM +0100, Borislav Petkov wrote:
> > On Tue, Jan 31, 2017 at 12:31:17PM +0100, Mike Galbraith wrote:
> > > (bisect fingered irqdomain: Avoid activating interrupts more than once)
> > 
> > Yeah, that one is not kosher on x86. It broke IO-APIC timer on a box
> > here.
> 
> Mike,
> 
> does the below hunk fix the issue for ya? (Ontop of tip/master, without
> the revert).
> 
> It does fix my APIC timer detection failure.

Yup, need a new doorstop.

-Mike


Re: [PATCH] x86/microcode: Do not access the initrd after it has been freed

2017-01-31 Thread Mike Galbraith
On Tue, 2017-01-31 at 18:49 +0100, Borislav Petkov wrote:
> On Tue, Jan 31, 2017 at 01:31:00PM +0100, Borislav Petkov wrote:
> > On Tue, Jan 31, 2017 at 12:31:17PM +0100, Mike Galbraith wrote:
> > > (bisect fingered irqdomain: Avoid activating interrupts more than once)
> > 
> > Yeah, that one is not kosher on x86. It broke IO-APIC timer on a box
> > here.
> 
> Mike,
> 
> does the below hunk fix the issue for ya? (Ontop of tip/master, without
> the revert).
> 
> It does fix my APIC timer detection failure.

Yup, need a new doorstop.

-Mike


Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-01-31 Thread Mike Galbraith
On Tue, 2017-01-31 at 16:30 +0530, Sachin Sant wrote:
> Trimming the cc list.
> 
> > > I assume I should be worried?
> > 
> > Thanks for the report. No need to worry, the bug has existed for a
> > while, this patch just turns on the warning ;-)
> > 
> > The following commit queued up in tip/sched/core should fix your
> > issues (assuming you see the same callstack on all your powerpc
> > machines):
> > 
> >  
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=sched/core=1b1d62254df0fe42a711eb71948f915918987790
> 
> I still see this warning with today’s next running inside PowerVM LPAR
> on a POWER8 box. The stack trace is different from what Michael had
> reported.
> 
> Easiest way to recreate this is to Online/offline cpu’s.

(Ditto tip.today, x86_64 + hotplug stress)

[   94.804196] [ cut here ]
[   94.804201] WARNING: CPU: 3 PID: 27 at kernel/sched/sched.h:804 
set_next_entity+0x81c/0x910
[   94.804201] rq->clock_update_flags < RQCF_ACT_SKIP
[   94.804202] Modules linked in: ebtable_filter(E) ebtables(E) fuse(E) 
bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) 
xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) 
ipt_REJECT(E) iptable_raw(E) iptable_filter(E) ip6table_mangle(E) 
nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) 
nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) 
ip6table_filter(E) ip6_tables(E) x_tables(E) x86_pkg_temp_thermal(E) 
intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) 
crct10dif_pclmul(E) crc32_pclmul(E) nls_iso8859_1(E) crc32c_intel(E) 
nls_cp437(E) snd_hda_codec_realtek(E) snd_hda_codec_hdmi(E) 
snd_hda_codec_generic(E) nfsd(E) aesni_intel(E) snd_hda_intel(E) 
snd_hda_codec(E) snd_hwdep(E) aes_x86_64(E) snd_hda_core(E) crypto_simd(E)
[   94.804220]  snd_pcm(E) auth_rpcgss(E) snd_timer(E) snd(E) iTCO_wdt(E) 
iTCO_vendor_support(E) joydev(E) nfs_acl(E) lpc_ich(E) cryptd(E) lockd(E) 
intel_smartconnect(E) mfd_core(E) i2c_i801(E) battery(E) glue_helper(E) 
mei_me(E) shpchp(E) mei(E) soundcore(E) grace(E) fan(E) thermal(E) 
tpm_infineon(E) pcspkr(E) sunrpc(E) efivarfs(E) sr_mod(E) cdrom(E) 
hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) 
usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) 
sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ahci(E) xhci_pci(E) ehci_pci(E) 
ttm(E) libahci(E) xhci_hcd(E) ehci_hcd(E) r8169(E) mii(E) libata(E) drm(E) 
usbcore(E) fjes(E) video(E) button(E) af_packet(E) sd_mod(E) vfat(E) fat(E) 
ext4(E) crc16(E) jbd2(E) mbcache(E) dm_mod(E) loop(E) sg(E) scsi_mod(E) 
autofs4(E)
[   94.804246] CPU: 3 PID: 27 Comm: migration/3 Tainted: GE   
4.10.0-tip #15
[   94.804247] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[   94.804247] Call Trace:
[   94.804251]  ? dump_stack+0x5c/0x7c
[   94.804253]  ? __warn+0xc4/0xe0
[   94.804255]  ? warn_slowpath_fmt+0x4f/0x60
[   94.804256]  ? set_next_entity+0x81c/0x910
[   94.804258]  ? pick_next_task_fair+0x20a/0xa20
[   94.804259]  ? sched_cpu_starting+0x50/0x50
[   94.804260]  ? sched_cpu_dying+0x237/0x280
[   94.804261]  ? sched_cpu_starting+0x50/0x50
[   94.804262]  ? cpuhp_invoke_callback+0x83/0x3e0
[   94.804263]  ? take_cpu_down+0x56/0x90
[   94.804266]  ? multi_cpu_stop+0xa9/0xd0
[   94.804267]  ? cpu_stop_queue_work+0xb0/0xb0
[   94.804268]  ? cpu_stopper_thread+0x81/0x110
[   94.804270]  ? smpboot_thread_fn+0xfe/0x150
[   94.804272]  ? kthread+0xf4/0x130
[   94.804273]  ? sort_range+0x20/0x20
[   94.804274]  ? kthread_park+0x80/0x80
[   94.804276]  ? ret_from_fork+0x26/0x40
[   94.804277] ---[ end trace b0a9e4aa1fb229bb ]---



Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-01-31 Thread Mike Galbraith
On Tue, 2017-01-31 at 16:30 +0530, Sachin Sant wrote:
> Trimming the cc list.
> 
> > > I assume I should be worried?
> > 
> > Thanks for the report. No need to worry, the bug has existed for a
> > while, this patch just turns on the warning ;-)
> > 
> > The following commit queued up in tip/sched/core should fix your
> > issues (assuming you see the same callstack on all your powerpc
> > machines):
> > 
> >  
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=sched/core=1b1d62254df0fe42a711eb71948f915918987790
> 
> I still see this warning with today’s next running inside PowerVM LPAR
> on a POWER8 box. The stack trace is different from what Michael had
> reported.
> 
> Easiest way to recreate this is to Online/offline cpu’s.

(Ditto tip.today, x86_64 + hotplug stress)

[   94.804196] [ cut here ]
[   94.804201] WARNING: CPU: 3 PID: 27 at kernel/sched/sched.h:804 
set_next_entity+0x81c/0x910
[   94.804201] rq->clock_update_flags < RQCF_ACT_SKIP
[   94.804202] Modules linked in: ebtable_filter(E) ebtables(E) fuse(E) 
bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) 
xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) 
ipt_REJECT(E) iptable_raw(E) iptable_filter(E) ip6table_mangle(E) 
nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) 
nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) 
ip6table_filter(E) ip6_tables(E) x_tables(E) x86_pkg_temp_thermal(E) 
intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) 
crct10dif_pclmul(E) crc32_pclmul(E) nls_iso8859_1(E) crc32c_intel(E) 
nls_cp437(E) snd_hda_codec_realtek(E) snd_hda_codec_hdmi(E) 
snd_hda_codec_generic(E) nfsd(E) aesni_intel(E) snd_hda_intel(E) 
snd_hda_codec(E) snd_hwdep(E) aes_x86_64(E) snd_hda_core(E) crypto_simd(E)
[   94.804220]  snd_pcm(E) auth_rpcgss(E) snd_timer(E) snd(E) iTCO_wdt(E) 
iTCO_vendor_support(E) joydev(E) nfs_acl(E) lpc_ich(E) cryptd(E) lockd(E) 
intel_smartconnect(E) mfd_core(E) i2c_i801(E) battery(E) glue_helper(E) 
mei_me(E) shpchp(E) mei(E) soundcore(E) grace(E) fan(E) thermal(E) 
tpm_infineon(E) pcspkr(E) sunrpc(E) efivarfs(E) sr_mod(E) cdrom(E) 
hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) 
usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) 
sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ahci(E) xhci_pci(E) ehci_pci(E) 
ttm(E) libahci(E) xhci_hcd(E) ehci_hcd(E) r8169(E) mii(E) libata(E) drm(E) 
usbcore(E) fjes(E) video(E) button(E) af_packet(E) sd_mod(E) vfat(E) fat(E) 
ext4(E) crc16(E) jbd2(E) mbcache(E) dm_mod(E) loop(E) sg(E) scsi_mod(E) 
autofs4(E)
[   94.804246] CPU: 3 PID: 27 Comm: migration/3 Tainted: GE   
4.10.0-tip #15
[   94.804247] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[   94.804247] Call Trace:
[   94.804251]  ? dump_stack+0x5c/0x7c
[   94.804253]  ? __warn+0xc4/0xe0
[   94.804255]  ? warn_slowpath_fmt+0x4f/0x60
[   94.804256]  ? set_next_entity+0x81c/0x910
[   94.804258]  ? pick_next_task_fair+0x20a/0xa20
[   94.804259]  ? sched_cpu_starting+0x50/0x50
[   94.804260]  ? sched_cpu_dying+0x237/0x280
[   94.804261]  ? sched_cpu_starting+0x50/0x50
[   94.804262]  ? cpuhp_invoke_callback+0x83/0x3e0
[   94.804263]  ? take_cpu_down+0x56/0x90
[   94.804266]  ? multi_cpu_stop+0xa9/0xd0
[   94.804267]  ? cpu_stop_queue_work+0xb0/0xb0
[   94.804268]  ? cpu_stopper_thread+0x81/0x110
[   94.804270]  ? smpboot_thread_fn+0xfe/0x150
[   94.804272]  ? kthread+0xf4/0x130
[   94.804273]  ? sort_range+0x20/0x20
[   94.804274]  ? kthread_park+0x80/0x80
[   94.804276]  ? ret_from_fork+0x26/0x40
[   94.804277] ---[ end trace b0a9e4aa1fb229bb ]---



Re: [PATCH] x86/microcode: Do not access the initrd after it has been freed

2017-01-31 Thread Mike Galbraith
On Tue, 2017-01-31 at 11:01 +0100, Borislav Petkov wrote:
> On Tue, Jan 31, 2017 at 08:43:55AM +0100, Ingo Molnar wrote:
> > (Cc:-ed Mike as this could explain his early boot crash/hang?
> > Mike: please try -tip f18a8a0143b1 that I just pushed out. )
> 
> One other thing to try, Mike, is boot with "dis_ucode_ldr". See whether
> that makes it go away.

(bisect fingered irqdomain: Avoid activating interrupts more than once)


Re: [PATCH] x86/microcode: Do not access the initrd after it has been freed

2017-01-31 Thread Mike Galbraith
On Tue, 2017-01-31 at 11:01 +0100, Borislav Petkov wrote:
> On Tue, Jan 31, 2017 at 08:43:55AM +0100, Ingo Molnar wrote:
> > (Cc:-ed Mike as this could explain his early boot crash/hang?
> > Mike: please try -tip f18a8a0143b1 that I just pushed out. )
> 
> One other thing to try, Mike, is boot with "dis_ucode_ldr". See whether
> that makes it go away.

(bisect fingered irqdomain: Avoid activating interrupts more than once)


Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27

2017-01-31 Thread Mike Galbraith
On Tue, 2017-01-31 at 09:54 +0100, Ingo Molnar wrote:

> > Fast ain't gonna happen, 5bf728f02218 bricked.
> 
> :-/
> 
> Next point would be f9a42e0d58cf I suspect, to establish that Linus's latest 
> kernel is fine. That means it's in one of the ~200 -tip commits - should be 
> bisectable in 8-10 steps from that point on.

It bisected cleanly to the below, confirmed via quilt push/pop revert. 
 According to the symptoms my box exhibits, patchlet needs to be
twiddled to ensure that interrupts are enabled at _least_ once ;-)

08d85f3ea99f1eeafc4e8507936190e86a16ee8c is the first bad commit
commit 08d85f3ea99f1eeafc4e8507936190e86a16ee8c
Author: Marc Zyngier 
Date:   Tue Jan 17 16:00:48 2017 +

irqdomain: Avoid activating interrupts more than once

Since commit f3b0946d629c ("genirq/msi: Make sure PCI MSIs are
activated early"), we can end-up activating a PCI/MSI twice (once
at allocation time, and once at startup time).

This is normally of no consequences, except that there is some
HW out there that may misbehave if activate is used more than once
(the GICv3 ITS, for example, uses the activate callback
to issue the MAPVI command, and the architecture spec says that
"If there is an existing mapping for the EventID-DeviceID
combination, behavior is UNPREDICTABLE").

While this could be worked around in each individual driver, it may
make more sense to tackle the issue at the core level. In order to
avoid getting in that situation, let's have a per-interrupt flag
to remember if we have already activated that interrupt or not.

Fixes: f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early")
Reported-and-tested-by: Andre Przywara 
Signed-off-by: Marc Zyngier 
Cc: sta...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1484668848-24361-1-git-send-email-marc.zyng...@arm.com
Signed-off-by: Thomas Gleixner 

:04 04 eed859b1f22b822f4400e7c050929d8b4c4a146d 
39097c0315a12c0a3809bb82687fa56b1c9e5633 M  include
:04 04 7dfe2ca8e1de55e890d0e6a761bab9c07c6f5f8a 
e28a3a54a68866273b474e2053b16155987e06f2 M  kernel


Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27

2017-01-31 Thread Mike Galbraith
On Tue, 2017-01-31 at 09:54 +0100, Ingo Molnar wrote:

> > Fast ain't gonna happen, 5bf728f02218 bricked.
> 
> :-/
> 
> Next point would be f9a42e0d58cf I suspect, to establish that Linus's latest 
> kernel is fine. That means it's in one of the ~200 -tip commits - should be 
> bisectable in 8-10 steps from that point on.

It bisected cleanly to the below, confirmed via quilt push/pop revert. 
 According to the symptoms my box exhibits, patchlet needs to be
twiddled to ensure that interrupts are enabled at _least_ once ;-)

08d85f3ea99f1eeafc4e8507936190e86a16ee8c is the first bad commit
commit 08d85f3ea99f1eeafc4e8507936190e86a16ee8c
Author: Marc Zyngier 
Date:   Tue Jan 17 16:00:48 2017 +

irqdomain: Avoid activating interrupts more than once

Since commit f3b0946d629c ("genirq/msi: Make sure PCI MSIs are
activated early"), we can end-up activating a PCI/MSI twice (once
at allocation time, and once at startup time).

This is normally of no consequences, except that there is some
HW out there that may misbehave if activate is used more than once
(the GICv3 ITS, for example, uses the activate callback
to issue the MAPVI command, and the architecture spec says that
"If there is an existing mapping for the EventID-DeviceID
combination, behavior is UNPREDICTABLE").

While this could be worked around in each individual driver, it may
make more sense to tackle the issue at the core level. In order to
avoid getting in that situation, let's have a per-interrupt flag
to remember if we have already activated that interrupt or not.

Fixes: f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early")
Reported-and-tested-by: Andre Przywara 
Signed-off-by: Marc Zyngier 
Cc: sta...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1484668848-24361-1-git-send-email-marc.zyng...@arm.com
Signed-off-by: Thomas Gleixner 

:04 04 eed859b1f22b822f4400e7c050929d8b4c4a146d 
39097c0315a12c0a3809bb82687fa56b1c9e5633 M  include
:04 04 7dfe2ca8e1de55e890d0e6a761bab9c07c6f5f8a 
e28a3a54a68866273b474e2053b16155987e06f2 M  kernel


Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27

2017-01-31 Thread Mike Galbraith
On Tue, 2017-01-31 at 08:28 +0100, Ingo Molnar wrote:
> * Mike Galbraith <efa...@gmx.de> wrote:
> 
> > On Mon, 2017-01-30 at 11:59 +, Matt Fleming wrote:
> > > On Sat, 28 Jan, at 08:21:05AM, Mike Galbraith wrote:
> > > > Running Steven's hotplug stress script in tip.today.  Config is
> > > > NOPREEMPT, tune for maximum build time (enterprise default-ish).
> > > > 
> > > > [   75.268049] x86: Booting SMP configuration:
> > > > [   75.268052] smpboot: Booting Node 0 Processor 1 APIC 0x2
> > > > [   75.279994] smpboot: Booting Node 0 Processor 2 APIC 0x4
> > > > [   75.294617] smpboot: Booting Node 0 Processor 4 APIC 0x1
> > > > [   75.310698] smpboot: Booting Node 0 Processor 5 APIC 0x3
> > > > [   75.359056] smpboot: CPU 3 is now offline
> > > > [   75.415505] smpboot: CPU 4 is now offline
> > > > [   75.479985] smpboot: CPU 5 is now offline
> > > > [   75.550674] [ cut here ]
> > > > [   75.550678] WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804
> > > > assert_clock_updated.isra.62.part.63+0x25/0x27
> > > > [   75.550679] rq->clock_update_flags < RQCF_ACT_SKIP
> > > 
> > > The following patch queued in tip/sched/core should fix this issue:
> > 
> > Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew
> > an early boot brick problem.
> 
> That's bad - could you perhaps try to bisect it? All recently queued up 
> patches 
> that could cause such problems should be readily bisectable.
> 
> The bisection might be faster if you first checked whether 5bf728f02218 works 
> - if 
> it does then the bug is in the patches in WIP.x86/boot or WIP.x86/fpu.

Fast ain't gonna happen, 5bf728f02218 bricked.

-Mike


Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27

2017-01-31 Thread Mike Galbraith
On Tue, 2017-01-31 at 08:28 +0100, Ingo Molnar wrote:
> * Mike Galbraith  wrote:
> 
> > On Mon, 2017-01-30 at 11:59 +, Matt Fleming wrote:
> > > On Sat, 28 Jan, at 08:21:05AM, Mike Galbraith wrote:
> > > > Running Steven's hotplug stress script in tip.today.  Config is
> > > > NOPREEMPT, tune for maximum build time (enterprise default-ish).
> > > > 
> > > > [   75.268049] x86: Booting SMP configuration:
> > > > [   75.268052] smpboot: Booting Node 0 Processor 1 APIC 0x2
> > > > [   75.279994] smpboot: Booting Node 0 Processor 2 APIC 0x4
> > > > [   75.294617] smpboot: Booting Node 0 Processor 4 APIC 0x1
> > > > [   75.310698] smpboot: Booting Node 0 Processor 5 APIC 0x3
> > > > [   75.359056] smpboot: CPU 3 is now offline
> > > > [   75.415505] smpboot: CPU 4 is now offline
> > > > [   75.479985] smpboot: CPU 5 is now offline
> > > > [   75.550674] [ cut here ]
> > > > [   75.550678] WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804
> > > > assert_clock_updated.isra.62.part.63+0x25/0x27
> > > > [   75.550679] rq->clock_update_flags < RQCF_ACT_SKIP
> > > 
> > > The following patch queued in tip/sched/core should fix this issue:
> > 
> > Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew
> > an early boot brick problem.
> 
> That's bad - could you perhaps try to bisect it? All recently queued up 
> patches 
> that could cause such problems should be readily bisectable.
> 
> The bisection might be faster if you first checked whether 5bf728f02218 works 
> - if 
> it does then the bug is in the patches in WIP.x86/boot or WIP.x86/fpu.

Fast ain't gonna happen, 5bf728f02218 bricked.

-Mike


Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27

2017-01-31 Thread Mike Galbraith
On Tue, 2017-01-31 at 08:45 +0100, Ingo Molnar wrote:
> * Mike Galbraith <efa...@gmx.de> wrote:
> 
> > On Tue, 2017-01-31 at 08:28 +0100, Ingo Molnar wrote:
> > > * Mike Galbraith <efa...@gmx.de> wrote:
> > 
> > > > Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew
> > > > an early boot brick problem.
> > > 
> > > That's bad - could you perhaps try to bisect it? All recently queued up 
> > > patches 
> > > that could cause such problems should be readily bisectable.
> > 
> > Yeah, I'll give it a go as soon as I get some other stuff done.
> 
> Please double check whether -tip f18a8a0143b1 works for you (latestest -tip 
> freshly pushed out), it might be that my bogus conflict resolution of a 
> x86/microcode conflict is what caused your boot problems?

Oh darn, it's a nogo.  Back to plan A.

-Mike


Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27

2017-01-31 Thread Mike Galbraith
On Tue, 2017-01-31 at 08:45 +0100, Ingo Molnar wrote:
> * Mike Galbraith  wrote:
> 
> > On Tue, 2017-01-31 at 08:28 +0100, Ingo Molnar wrote:
> > > * Mike Galbraith  wrote:
> > 
> > > > Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew
> > > > an early boot brick problem.
> > > 
> > > That's bad - could you perhaps try to bisect it? All recently queued up 
> > > patches 
> > > that could cause such problems should be readily bisectable.
> > 
> > Yeah, I'll give it a go as soon as I get some other stuff done.
> 
> Please double check whether -tip f18a8a0143b1 works for you (latestest -tip 
> freshly pushed out), it might be that my bogus conflict resolution of a 
> x86/microcode conflict is what caused your boot problems?

Oh darn, it's a nogo.  Back to plan A.

-Mike


Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27

2017-01-30 Thread Mike Galbraith
On Tue, 2017-01-31 at 08:28 +0100, Ingo Molnar wrote:
> * Mike Galbraith <efa...@gmx.de> wrote:

> > Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew
> > an early boot brick problem.
> 
> That's bad - could you perhaps try to bisect it? All recently queued up 
> patches 
> that could cause such problems should be readily bisectable.

Yeah, I'll give it a go as soon as I get some other stuff done.

-Mike


Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27

2017-01-30 Thread Mike Galbraith
On Tue, 2017-01-31 at 08:28 +0100, Ingo Molnar wrote:
> * Mike Galbraith  wrote:

> > Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew
> > an early boot brick problem.
> 
> That's bad - could you perhaps try to bisect it? All recently queued up 
> patches 
> that could cause such problems should be readily bisectable.

Yeah, I'll give it a go as soon as I get some other stuff done.

-Mike


Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27

2017-01-30 Thread Mike Galbraith
On Mon, 2017-01-30 at 11:59 +, Matt Fleming wrote:
> On Sat, 28 Jan, at 08:21:05AM, Mike Galbraith wrote:
> > Running Steven's hotplug stress script in tip.today.  Config is
> > NOPREEMPT, tune for maximum build time (enterprise default-ish).
> > 
> > [   75.268049] x86: Booting SMP configuration:
> > [   75.268052] smpboot: Booting Node 0 Processor 1 APIC 0x2
> > [   75.279994] smpboot: Booting Node 0 Processor 2 APIC 0x4
> > [   75.294617] smpboot: Booting Node 0 Processor 4 APIC 0x1
> > [   75.310698] smpboot: Booting Node 0 Processor 5 APIC 0x3
> > [   75.359056] smpboot: CPU 3 is now offline
> > [   75.415505] smpboot: CPU 4 is now offline
> > [   75.479985] smpboot: CPU 5 is now offline
> > [   75.550674] [ cut here ]
> > [   75.550678] WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804
> > assert_clock_updated.isra.62.part.63+0x25/0x27
> > [   75.550679] rq->clock_update_flags < RQCF_ACT_SKIP
> 
> The following patch queued in tip/sched/core should fix this issue:

Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew
an early boot brick problem.

> >8
> 
> From 4d25b35ea3729affd37d69c78191ce6f92766e1a Mon Sep 17 00:00:00
> 2001
> From: Matt Fleming <m...@codeblueprint.co.uk>
> Date: Wed, 26 Oct 2016 16:15:44 +0100
> Subject: [PATCH] sched/fair: Restore previous rq_flags when migrating
> tasks in
>  hotplug
> 
> __migrate_task() can return with a different runqueue locked than the
> one we passed as an argument. So that we can repin the lock in
> migrate_tasks() (and keep the update_rq_clock() bit) we need to
> restore the old rq_flags before repinning.
> 
> Note that it wouldn't be correct to change move_queued_task() to
> repin
> because of the change of runqueue and the fact that having an
> up-to-date clock on the initial rq doesn't mean the new rq has one
> too.
> 
> Signed-off-by: Matt Fleming <m...@codeblueprint.co.uk>
> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
> Cc: Linus Torvalds <torva...@linux-foundation.org>
> Cc: Mike Galbraith <efa...@gmx.de>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Signed-off-by: Ingo Molnar <mi...@kernel.org>
> ---
>  kernel/sched/core.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7f983e83a353..3b248b03ad8f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5608,7 +5608,7 @@ static void migrate_tasks(struct rq *dead_rq)
>  {
>   struct rq *rq = dead_rq;
>   struct task_struct *next, *stop = rq->stop;
> - struct rq_flags rf;
> + struct rq_flags rf, old_rf;
>   int dest_cpu;
>  
>   /*
> @@ -5669,6 +5669,13 @@ static void migrate_tasks(struct rq *dead_rq)
>   continue;
>   }
>  
> + /*
> +  * __migrate_task() may return with a different
> +  * rq->lock held and a new cookie in 'rf', but we
> need
> +  * to preserve rf::clock_update_flags for 'dead_rq'.
> +  */
> + old_rf = rf;
> +
>   /* Find suitable destination for @next, with force
> if needed. */
>   dest_cpu = select_fallback_rq(dead_rq->cpu, next);
>  
> @@ -5677,6 +5684,7 @@ static void migrate_tasks(struct rq *dead_rq)
>   raw_spin_unlock(>lock);
>   rq = dead_rq;
>   raw_spin_lock(>lock);
> + rf = old_rf;
>   }
>   raw_spin_unlock(>pi_lock);
>   }


Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27

2017-01-30 Thread Mike Galbraith
On Mon, 2017-01-30 at 11:59 +, Matt Fleming wrote:
> On Sat, 28 Jan, at 08:21:05AM, Mike Galbraith wrote:
> > Running Steven's hotplug stress script in tip.today.  Config is
> > NOPREEMPT, tune for maximum build time (enterprise default-ish).
> > 
> > [   75.268049] x86: Booting SMP configuration:
> > [   75.268052] smpboot: Booting Node 0 Processor 1 APIC 0x2
> > [   75.279994] smpboot: Booting Node 0 Processor 2 APIC 0x4
> > [   75.294617] smpboot: Booting Node 0 Processor 4 APIC 0x1
> > [   75.310698] smpboot: Booting Node 0 Processor 5 APIC 0x3
> > [   75.359056] smpboot: CPU 3 is now offline
> > [   75.415505] smpboot: CPU 4 is now offline
> > [   75.479985] smpboot: CPU 5 is now offline
> > [   75.550674] [ cut here ]
> > [   75.550678] WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804
> > assert_clock_updated.isra.62.part.63+0x25/0x27
> > [   75.550679] rq->clock_update_flags < RQCF_ACT_SKIP
> 
> The following patch queued in tip/sched/core should fix this issue:

Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew
an early boot brick problem.

> >8
> 
> From 4d25b35ea3729affd37d69c78191ce6f92766e1a Mon Sep 17 00:00:00
> 2001
> From: Matt Fleming 
> Date: Wed, 26 Oct 2016 16:15:44 +0100
> Subject: [PATCH] sched/fair: Restore previous rq_flags when migrating
> tasks in
>  hotplug
> 
> __migrate_task() can return with a different runqueue locked than the
> one we passed as an argument. So that we can repin the lock in
> migrate_tasks() (and keep the update_rq_clock() bit) we need to
> restore the old rq_flags before repinning.
> 
> Note that it wouldn't be correct to change move_queued_task() to
> repin
> because of the change of runqueue and the fact that having an
> up-to-date clock on the initial rq doesn't mean the new rq has one
> too.
> 
> Signed-off-by: Matt Fleming 
> Signed-off-by: Peter Zijlstra (Intel) 
> Cc: Linus Torvalds 
> Cc: Mike Galbraith 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Signed-off-by: Ingo Molnar 
> ---
>  kernel/sched/core.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7f983e83a353..3b248b03ad8f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5608,7 +5608,7 @@ static void migrate_tasks(struct rq *dead_rq)
>  {
>   struct rq *rq = dead_rq;
>   struct task_struct *next, *stop = rq->stop;
> - struct rq_flags rf;
> + struct rq_flags rf, old_rf;
>   int dest_cpu;
>  
>   /*
> @@ -5669,6 +5669,13 @@ static void migrate_tasks(struct rq *dead_rq)
>   continue;
>   }
>  
> + /*
> +  * __migrate_task() may return with a different
> +  * rq->lock held and a new cookie in 'rf', but we
> need
> +  * to preserve rf::clock_update_flags for 'dead_rq'.
> +  */
> + old_rf = rf;
> +
>   /* Find suitable destination for @next, with force
> if needed. */
>   dest_cpu = select_fallback_rq(dead_rq->cpu, next);
>  
> @@ -5677,6 +5684,7 @@ static void migrate_tasks(struct rq *dead_rq)
>   raw_spin_unlock(>lock);
>   rq = dead_rq;
>   raw_spin_lock(>lock);
> + rf = old_rf;
>   }
>   raw_spin_unlock(>pi_lock);
>   }


WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27

2017-01-27 Thread Mike Galbraith
Running Steven's hotplug stress script in tip.today.  Config is
NOPREEMPT, tune for maximum build time (enterprise default-ish).

[   75.268049] x86: Booting SMP configuration:
[   75.268052] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   75.279994] smpboot: Booting Node 0 Processor 2 APIC 0x4
[   75.294617] smpboot: Booting Node 0 Processor 4 APIC 0x1
[   75.310698] smpboot: Booting Node 0 Processor 5 APIC 0x3
[   75.359056] smpboot: CPU 3 is now offline
[   75.415505] smpboot: CPU 4 is now offline
[   75.479985] smpboot: CPU 5 is now offline
[   75.550674] [ cut here ]
[   75.550678] WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 
assert_clock_updated.isra.62.part.63+0x25/0x27
[   75.550679] rq->clock_update_flags < RQCF_ACT_SKIP
[   75.550679] Modules linked in: ebtable_filter(E) ebtables(E) fuse(E) 
nf_log_ipv6(E) xt_pkttype(E) xt_physdev(E) br_netfilter(E) nf_log_ipv4(E) 
nf_log_common(E) xt_LOG(E) xt_limit(E) af_packet(E) bridge(E) stp(E) llc(E) 
iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) 
nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) 
iptable_raw(E) xt_CT(E) iptable_filter(E) ip6table_mangle(E) 
nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) 
nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) 
ip6table_filter(E) snd_hda_codec_hdmi(E) ip6_tables(E) x_tables(E) 
snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_intel(E) 
snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) nls_iso8859_1(E) intel_rapl(E) 
x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E)
[   75.550703]  snd_pcm(E) nls_cp437(E) kvm_intel(E) snd_timer(E) kvm(E) 
irqbypass(E) nfsd(E) snd(E) crct10dif_pclmul(E) crc32_pclmul(E) auth_rpcgss(E) 
ghash_clmulni_intel(E) joydev(E) nfs_acl(E) lockd(E) soundcore(E) i2c_i801(E) 
shpchp(E) pcbc(E) aesni_intel(E) mei_me(E) aes_x86_64(E) crypto_simd(E) 
iTCO_wdt(E) iTCO_vendor_support(E) lpc_ich(E) mfd_core(E) glue_helper(E) 
pcspkr(E) mei(E) grace(E) cryptd(E) intel_smartconnect(E) battery(E) fan(E) 
thermal(E) tpm_infineon(E) sunrpc(E) efivarfs(E) sr_mod(E) cdrom(E) 
hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) 
usbhid(E) nouveau(E) ahci(E) wmi(E) libahci(E) i2c_algo_bit(E) 
drm_kms_helper(E) xhci_pci(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) 
fb_sys_fops(E) ehci_pci(E) ehci_hcd(E) ttm(E) xhci_hcd(E) crc32c_intel(E) 
r8169(E)
[   75.550721]  mii(E) libata(E) drm(E) usbcore(E) fjes(E) video(E) button(E) 
sd_mod(E) vfat(E) fat(E) ext4(E) crc16(E) jbd2(E) mbcache(E) dm_mod(E) loop(E) 
sg(E) scsi_mod(E) autofs4(E)
[   75.550728] CPU: 1 PID: 15 Comm: migration/1 Tainted: GE   
4.10.0-tip-default #47
[   75.550728] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[   75.550728] Call Trace:
[   75.550732]  dump_stack+0x63/0x87
[   75.550734]  __warn+0xd1/0xf0
[   75.550737]  ? load_balance+0xa00/0xa00
[   75.550738]  warn_slowpath_fmt+0x4f/0x60
[   75.550739]  ? cpumask_next_and+0x35/0x50
[   75.550740]  assert_clock_updated.isra.62.part.63+0x25/0x27
[   75.550741]  update_load_avg+0x855/0x950
[   75.550742]  ? load_balance+0xa00/0xa00
[   75.550743]  set_next_entity+0x9e/0x1b0
[   75.550744]  pick_next_task_fair+0x78/0x540
[   75.550746]  ? sched_clock+0x9/0x10
[   75.550747]  ? sched_clock_cpu+0x11/0xb0
[   75.550748]  ? load_balance+0xa00/0xa00
[   75.550749]  sched_cpu_dying+0x23c/0x280
[   75.550751]  ? fini_debug_store_on_cpu+0x34/0x40
[   75.550752]  ? sched_cpu_starting+0x60/0x60
[   75.550753]  cpuhp_invoke_callback+0x90/0x400
[   75.550754]  take_cpu_down+0x5e/0xa0
[   75.550757]  multi_cpu_stop+0xc4/0xf0
[   75.550757]  ? cpu_stop_queue_work+0xb0/0xb0
[   75.550758]  cpu_stopper_thread+0x8c/0x120
[   75.550760]  smpboot_thread_fn+0x110/0x160
[   75.550762]  kthread+0x101/0x140
[   75.550762]  ? sort_range+0x30/0x30
[   75.550763]  ? kthread_park+0x90/0x90
[   75.550766]  ret_from_fork+0x2c/0x40
[   75.550766] ---[ end trace 9dd372e3b19c77a0 ]---


WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27

2017-01-27 Thread Mike Galbraith
Running Steven's hotplug stress script in tip.today.  Config is
NOPREEMPT, tune for maximum build time (enterprise default-ish).

[   75.268049] x86: Booting SMP configuration:
[   75.268052] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   75.279994] smpboot: Booting Node 0 Processor 2 APIC 0x4
[   75.294617] smpboot: Booting Node 0 Processor 4 APIC 0x1
[   75.310698] smpboot: Booting Node 0 Processor 5 APIC 0x3
[   75.359056] smpboot: CPU 3 is now offline
[   75.415505] smpboot: CPU 4 is now offline
[   75.479985] smpboot: CPU 5 is now offline
[   75.550674] [ cut here ]
[   75.550678] WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 
assert_clock_updated.isra.62.part.63+0x25/0x27
[   75.550679] rq->clock_update_flags < RQCF_ACT_SKIP
[   75.550679] Modules linked in: ebtable_filter(E) ebtables(E) fuse(E) 
nf_log_ipv6(E) xt_pkttype(E) xt_physdev(E) br_netfilter(E) nf_log_ipv4(E) 
nf_log_common(E) xt_LOG(E) xt_limit(E) af_packet(E) bridge(E) stp(E) llc(E) 
iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) 
nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) 
iptable_raw(E) xt_CT(E) iptable_filter(E) ip6table_mangle(E) 
nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) 
nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) 
ip6table_filter(E) snd_hda_codec_hdmi(E) ip6_tables(E) x_tables(E) 
snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_intel(E) 
snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) nls_iso8859_1(E) intel_rapl(E) 
x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E)
[   75.550703]  snd_pcm(E) nls_cp437(E) kvm_intel(E) snd_timer(E) kvm(E) 
irqbypass(E) nfsd(E) snd(E) crct10dif_pclmul(E) crc32_pclmul(E) auth_rpcgss(E) 
ghash_clmulni_intel(E) joydev(E) nfs_acl(E) lockd(E) soundcore(E) i2c_i801(E) 
shpchp(E) pcbc(E) aesni_intel(E) mei_me(E) aes_x86_64(E) crypto_simd(E) 
iTCO_wdt(E) iTCO_vendor_support(E) lpc_ich(E) mfd_core(E) glue_helper(E) 
pcspkr(E) mei(E) grace(E) cryptd(E) intel_smartconnect(E) battery(E) fan(E) 
thermal(E) tpm_infineon(E) sunrpc(E) efivarfs(E) sr_mod(E) cdrom(E) 
hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) 
usbhid(E) nouveau(E) ahci(E) wmi(E) libahci(E) i2c_algo_bit(E) 
drm_kms_helper(E) xhci_pci(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) 
fb_sys_fops(E) ehci_pci(E) ehci_hcd(E) ttm(E) xhci_hcd(E) crc32c_intel(E) 
r8169(E)
[   75.550721]  mii(E) libata(E) drm(E) usbcore(E) fjes(E) video(E) button(E) 
sd_mod(E) vfat(E) fat(E) ext4(E) crc16(E) jbd2(E) mbcache(E) dm_mod(E) loop(E) 
sg(E) scsi_mod(E) autofs4(E)
[   75.550728] CPU: 1 PID: 15 Comm: migration/1 Tainted: GE   
4.10.0-tip-default #47
[   75.550728] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[   75.550728] Call Trace:
[   75.550732]  dump_stack+0x63/0x87
[   75.550734]  __warn+0xd1/0xf0
[   75.550737]  ? load_balance+0xa00/0xa00
[   75.550738]  warn_slowpath_fmt+0x4f/0x60
[   75.550739]  ? cpumask_next_and+0x35/0x50
[   75.550740]  assert_clock_updated.isra.62.part.63+0x25/0x27
[   75.550741]  update_load_avg+0x855/0x950
[   75.550742]  ? load_balance+0xa00/0xa00
[   75.550743]  set_next_entity+0x9e/0x1b0
[   75.550744]  pick_next_task_fair+0x78/0x540
[   75.550746]  ? sched_clock+0x9/0x10
[   75.550747]  ? sched_clock_cpu+0x11/0xb0
[   75.550748]  ? load_balance+0xa00/0xa00
[   75.550749]  sched_cpu_dying+0x23c/0x280
[   75.550751]  ? fini_debug_store_on_cpu+0x34/0x40
[   75.550752]  ? sched_cpu_starting+0x60/0x60
[   75.550753]  cpuhp_invoke_callback+0x90/0x400
[   75.550754]  take_cpu_down+0x5e/0xa0
[   75.550757]  multi_cpu_stop+0xc4/0xf0
[   75.550757]  ? cpu_stop_queue_work+0xb0/0xb0
[   75.550758]  cpu_stopper_thread+0x8c/0x120
[   75.550760]  smpboot_thread_fn+0x110/0x160
[   75.550762]  kthread+0x101/0x140
[   75.550762]  ? sort_range+0x30/0x30
[   75.550763]  ? kthread_park+0x90/0x90
[   75.550766]  ret_from_fork+0x2c/0x40
[   75.550766] ---[ end trace 9dd372e3b19c77a0 ]---


Re: [btrfs/rt] lockdep false positive

2017-01-26 Thread Mike Galbraith
On Thu, 2017-01-26 at 18:09 +0100, Sebastian Andrzej Siewior wrote:
> On 2017-01-25 19:29:49 [+0100], Mike Galbraith wrote:
> > On Wed, 2017-01-25 at 18:02 +0100, Sebastian Andrzej Siewior wrote:
> > 
> > > > [  341.960794]CPU0
> > > > [  341.960795]
> > > > [  341.960795]   lock(btrfs-tree-00);
> > > > [  341.960795]   lock(btrfs-tree-00);
> > > > [  341.960796] 
> > > > [  341.960796]  *** DEADLOCK ***
> > > > [  341.960796]
> > > > [  341.960796]  May be due to missing lock nesting notation
> > > > [  341.960796]
> > > > [  341.960796] 6 locks held by kworker/u8:9/2039:
> > > > [  341.960797]  #0:  ("%s-%s""btrfs", name){.+.+..}, at: [] 
> > > > process_one_work+0x171/0x700
> > > > [  341.960812]  #1:  ((>normal_work)){+.+...}, at: [] 
> > > > process_one_work+0x171/0x700
> > > > [  341.960815]  #2:  (sb_internal){.+.+..}, at: [] 
> > > > start_transaction+0x2a7/0x5a0 [btrfs]
> > > > [  341.960825]  #3:  (btrfs-tree-02){+.+...}, at: [] 
> > > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
> > > > [  341.960835]  #4:  (btrfs-tree-01){+.+...}, at: [] 
> > > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
> > > > [  341.960854]  #5:  (btrfs-tree-00){+.+...}, at: [] 
> > > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
> > > > 
> > > > Attempting to describe RT rwlock semantics to lockdep prevents this.
> > > 
> > > and this is what I don't get. I stumbled upon this myself [0] but didn't
> > > fully understand the problem (assuming this is the same problem colored
> > > differently).
> > 
> > Yeah, [0] looks like it, though I haven't met an 'fs' variant, my
> > encounters were always either 'tree' or 'csum' flavors.
> > 
> > > With your explanation I am not sure if I get what is happening. If btrfs
> > > is taking here read-locks on random locks then it may deadlock if
> > > another btrfs-thread is doing the same and need each other's locks.
> > 
> > I don't know if a real RT deadlock is possible.  I haven't met one,
> > only variants of this bogus recursion gripe.
> >  
> > > If btrfs takes locks recursively which it already holds (in the same
> > > context / process) then it shouldn't be visible here because lockdep
> > > does not account this on -RT.
> > 
> > If what lockdep gripes about were true, we would never see the splat,
> > we'd zip straight through that (illusion) recursive read_lock() with
> > lockdep being none the wiser. 
> > 
> > > If btrfs takes the locks in a special order for instance only ascending
> > > according to inode's number then it shouldn't deadlock.
> > 
> > No idea.  Locking fancy enough to require dynamic key assignment to
> > appease lockdep is too fancy for me.
> 
> yup, for me, too. As long as nobody from the btrfs camp explains how
> that locking workings and if it is safe I am not feeling comfortable to
> shut up lockdep here.

Works for me.  What we're talking about is an obvious false positive in
one and only one contrived situation.  It's annoying/sub-optimal, but
happily has no (known) impact other than testing, and that's trivial to
remedy.

-Mike


Re: [btrfs/rt] lockdep false positive

2017-01-26 Thread Mike Galbraith
On Thu, 2017-01-26 at 18:09 +0100, Sebastian Andrzej Siewior wrote:
> On 2017-01-25 19:29:49 [+0100], Mike Galbraith wrote:
> > On Wed, 2017-01-25 at 18:02 +0100, Sebastian Andrzej Siewior wrote:
> > 
> > > > [  341.960794]CPU0
> > > > [  341.960795]
> > > > [  341.960795]   lock(btrfs-tree-00);
> > > > [  341.960795]   lock(btrfs-tree-00);
> > > > [  341.960796] 
> > > > [  341.960796]  *** DEADLOCK ***
> > > > [  341.960796]
> > > > [  341.960796]  May be due to missing lock nesting notation
> > > > [  341.960796]
> > > > [  341.960796] 6 locks held by kworker/u8:9/2039:
> > > > [  341.960797]  #0:  ("%s-%s""btrfs", name){.+.+..}, at: [] 
> > > > process_one_work+0x171/0x700
> > > > [  341.960812]  #1:  ((>normal_work)){+.+...}, at: [] 
> > > > process_one_work+0x171/0x700
> > > > [  341.960815]  #2:  (sb_internal){.+.+..}, at: [] 
> > > > start_transaction+0x2a7/0x5a0 [btrfs]
> > > > [  341.960825]  #3:  (btrfs-tree-02){+.+...}, at: [] 
> > > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
> > > > [  341.960835]  #4:  (btrfs-tree-01){+.+...}, at: [] 
> > > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
> > > > [  341.960854]  #5:  (btrfs-tree-00){+.+...}, at: [] 
> > > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
> > > > 
> > > > Attempting to describe RT rwlock semantics to lockdep prevents this.
> > > 
> > > and this is what I don't get. I stumbled upon this myself [0] but didn't
> > > fully understand the problem (assuming this is the same problem colored
> > > differently).
> > 
> > Yeah, [0] looks like it, though I haven't met an 'fs' variant, my
> > encounters were always either 'tree' or 'csum' flavors.
> > 
> > > With your explanation I am not sure if I get what is happening. If btrfs
> > > is taking here read-locks on random locks then it may deadlock if
> > > another btrfs-thread is doing the same and need each other's locks.
> > 
> > I don't know if a real RT deadlock is possible.  I haven't met one,
> > only variants of this bogus recursion gripe.
> >  
> > > If btrfs takes locks recursively which it already holds (in the same
> > > context / process) then it shouldn't be visible here because lockdep
> > > does not account this on -RT.
> > 
> > If what lockdep gripes about were true, we would never see the splat,
> > we'd zip straight through that (illusion) recursive read_lock() with
> > lockdep being none the wiser. 
> > 
> > > If btrfs takes the locks in a special order for instance only ascending
> > > according to inode's number then it shouldn't deadlock.
> > 
> > No idea.  Locking fancy enough to require dynamic key assignment to
> > appease lockdep is too fancy for me.
> 
> yup, for me, too. As long as nobody from the btrfs camp explains how
> that locking workings and if it is safe I am not feeling comfortable to
> shut up lockdep here.

Works for me.  What we're talking about is an obvious false positive in
one and only one contrived situation.  It's annoying/sub-optimal, but
happily has no (known) impact other than testing, and that's trivial to
remedy.

-Mike


Re: [rfc patch-rt] radix-tree: Partially disable memcg accounting in radix_tree_node_alloc()

2017-01-25 Thread Mike Galbraith
On Wed, 2017-01-25 at 16:06 +0100, Sebastian Andrzej Siewior wrote:

> According to the to description of radix_tree_preload() the return code
> of 0 means that the following addition of a single element does not
> fail. But in RT's case this requirement is not fulfilled. There is more
> than just one user of that function. So instead adding an exception here
> and maybe later one someplace else, what about the following patch?
> That testcase you mentioned passes now:

Modulo missing EXPORT_SYMBOL(), yup, works fine.

> > testcases/kernel/syscalls/madvise/madvise06
> > tst_test.c:760: INFO: Timeout per run is 0h 05m 00s
> > madvise06.c:65: INFO: dropping caches
> > madvise06.c:139: INFO: SwapCached (before madvise): 304
> > madvise06.c:153: INFO: SwapCached (after madvise): 309988
> > madvise06.c:155: PASS: Regression test pass
> > 
> > Summary:
> > passed   1
> > failed   0
> > skipped  0
> > warnings 0
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index f87f87dec84c..277295039c8f 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -289,19 +289,11 @@ unsigned int radix_tree_gang_lookup(struct 
> radix_tree_root *root,
>  unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
>  >>   >   > void ***results, unsigned long *indices,
>  >>   >   > unsigned long first_index, unsigned int max_items);
> -#ifdef CONFIG_PREEMPT_RT_FULL
> -static inline int radix_tree_preload(gfp_t gm) { return 0; }
> -static inline int radix_tree_maybe_preload(gfp_t gfp_mask) { return 0; }
> -static inline int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order)
> -{
> ->> return 0;
> -};
> -
> -#else
>  int radix_tree_preload(gfp_t gfp_mask);
>  int radix_tree_maybe_preload(gfp_t gfp_mask);
>  int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
> -#endif
> +void radix_tree_preload_end(void);
> +
>  void radix_tree_init(void);
>  void *radix_tree_tag_set(struct radix_tree_root *root,
>  >>   >   > unsigned long index, unsigned int tag);
> @@ -324,11 +316,6 @@ unsigned long radix_tree_range_tag_if_tagged(struct 
> radix_tree_root *root,
>  int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
>  unsigned long radix_tree_locate_item(struct radix_tree_root *root, void 
> *item);
>  
> -static inline void radix_tree_preload_end(void)
> -{
> ->> preempt_enable_nort();
> -}
> -
>  /**
>   * struct radix_tree_iter - radix tree iterator state
>   *
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 881cc195d85f..e96c6a99f25c 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -36,7 +36,7 @@
>  #include 
>  #include 
>  #include >   >   > /* in_interrupt() */
> -
> +#include 
>  
>  /* Number of nodes in fully populated tree of given height */
>  static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] 
> __read_mostly;
> @@ -68,6 +68,7 @@ struct radix_tree_preload {
>  >> struct radix_tree_node *nodes;
>  };
>  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, 
> };
> +static DEFINE_LOCAL_IRQ_LOCK(radix_tree_preloads_lock);
>  
>  static inline void *node_to_entry(void *ptr)
>  {
> @@ -290,14 +291,14 @@ radix_tree_node_alloc(struct radix_tree_root *root)
>  >>   >  * succeed in getting a node here (and never reach
>  >>   >  * kmem_cache_alloc)
>  >>   >  */
> ->>   > rtp = _cpu_var(radix_tree_preloads);
> +>>   > rtp = _locked_var(radix_tree_preloads_lock, 
> radix_tree_preloads);
>  >>   > if (rtp->nr) {
>  >>   >   > ret = rtp->nodes;
>  >>   >   > rtp->nodes = ret->private_data;
>  >>   >   > ret->private_data = NULL;
>  >>   >   > rtp->nr--;
>  >>   > }
> ->>   > put_cpu_var(radix_tree_preloads);
> +>>   > put_locked_var(radix_tree_preloads_lock, radix_tree_preloads);
>  >>   > /*
>  >>   >  * Update the allocation stack trace as this is more useful
>  >>   >  * for debugging.
> @@ -337,7 +338,6 @@ radix_tree_node_free(struct radix_tree_node *node)
>  >> call_rcu(>rcu_head, radix_tree_node_rcu_free);
>  }
>  
> -#ifndef CONFIG_PREEMPT_RT_FULL
>  /*
>   * Load up this CPU's radix_tree_node buffer with sufficient objects to
>   * ensure that the addition of a single element in the tree cannot fail.  On
> @@ -359,14 +359,14 @@ static int __radix_tree_preload(gfp_t gfp_mask, int nr)
>  >>  */
>  >> gfp_mask &= ~__GFP_ACCOUNT;
>  
> ->> preempt_disable();
> +>> local_lock(radix_tree_preloads_lock);
>  >> rtp = this_cpu_ptr(_tree_preloads);
>  >> while (rtp->nr < nr) {
> ->>   > preempt_enable();
> +>>   > local_unlock(radix_tree_preloads_lock);
>  >>   > node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
>  >>   > if (node == NULL)
>  >>   >   > goto out;
> ->>   > preempt_disable();
> +>>   

Re: [rfc patch-rt] radix-tree: Partially disable memcg accounting in radix_tree_node_alloc()

2017-01-25 Thread Mike Galbraith
On Wed, 2017-01-25 at 16:06 +0100, Sebastian Andrzej Siewior wrote:

> According to the to description of radix_tree_preload() the return code
> of 0 means that the following addition of a single element does not
> fail. But in RT's case this requirement is not fulfilled. There is more
> than just one user of that function. So instead adding an exception here
> and maybe later one someplace else, what about the following patch?
> That testcase you mentioned passes now:

Modulo missing EXPORT_SYMBOL(), yup, works fine.

> > testcases/kernel/syscalls/madvise/madvise06
> > tst_test.c:760: INFO: Timeout per run is 0h 05m 00s
> > madvise06.c:65: INFO: dropping caches
> > madvise06.c:139: INFO: SwapCached (before madvise): 304
> > madvise06.c:153: INFO: SwapCached (after madvise): 309988
> > madvise06.c:155: PASS: Regression test pass
> > 
> > Summary:
> > passed   1
> > failed   0
> > skipped  0
> > warnings 0
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index f87f87dec84c..277295039c8f 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -289,19 +289,11 @@ unsigned int radix_tree_gang_lookup(struct 
> radix_tree_root *root,
>  unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
>  >>   >   > void ***results, unsigned long *indices,
>  >>   >   > unsigned long first_index, unsigned int max_items);
> -#ifdef CONFIG_PREEMPT_RT_FULL
> -static inline int radix_tree_preload(gfp_t gm) { return 0; }
> -static inline int radix_tree_maybe_preload(gfp_t gfp_mask) { return 0; }
> -static inline int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order)
> -{
> ->> return 0;
> -};
> -
> -#else
>  int radix_tree_preload(gfp_t gfp_mask);
>  int radix_tree_maybe_preload(gfp_t gfp_mask);
>  int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
> -#endif
> +void radix_tree_preload_end(void);
> +
>  void radix_tree_init(void);
>  void *radix_tree_tag_set(struct radix_tree_root *root,
>  >>   >   > unsigned long index, unsigned int tag);
> @@ -324,11 +316,6 @@ unsigned long radix_tree_range_tag_if_tagged(struct 
> radix_tree_root *root,
>  int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
>  unsigned long radix_tree_locate_item(struct radix_tree_root *root, void 
> *item);
>  
> -static inline void radix_tree_preload_end(void)
> -{
> ->> preempt_enable_nort();
> -}
> -
>  /**
>   * struct radix_tree_iter - radix tree iterator state
>   *
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 881cc195d85f..e96c6a99f25c 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -36,7 +36,7 @@
>  #include 
>  #include 
>  #include >   >   > /* in_interrupt() */
> -
> +#include 
>  
>  /* Number of nodes in fully populated tree of given height */
>  static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] 
> __read_mostly;
> @@ -68,6 +68,7 @@ struct radix_tree_preload {
>  >> struct radix_tree_node *nodes;
>  };
>  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, 
> };
> +static DEFINE_LOCAL_IRQ_LOCK(radix_tree_preloads_lock);
>  
>  static inline void *node_to_entry(void *ptr)
>  {
> @@ -290,14 +291,14 @@ radix_tree_node_alloc(struct radix_tree_root *root)
>  >>   >  * succeed in getting a node here (and never reach
>  >>   >  * kmem_cache_alloc)
>  >>   >  */
> ->>   > rtp = _cpu_var(radix_tree_preloads);
> +>>   > rtp = _locked_var(radix_tree_preloads_lock, 
> radix_tree_preloads);
>  >>   > if (rtp->nr) {
>  >>   >   > ret = rtp->nodes;
>  >>   >   > rtp->nodes = ret->private_data;
>  >>   >   > ret->private_data = NULL;
>  >>   >   > rtp->nr--;
>  >>   > }
> ->>   > put_cpu_var(radix_tree_preloads);
> +>>   > put_locked_var(radix_tree_preloads_lock, radix_tree_preloads);
>  >>   > /*
>  >>   >  * Update the allocation stack trace as this is more useful
>  >>   >  * for debugging.
> @@ -337,7 +338,6 @@ radix_tree_node_free(struct radix_tree_node *node)
>  >> call_rcu(>rcu_head, radix_tree_node_rcu_free);
>  }
>  
> -#ifndef CONFIG_PREEMPT_RT_FULL
>  /*
>   * Load up this CPU's radix_tree_node buffer with sufficient objects to
>   * ensure that the addition of a single element in the tree cannot fail.  On
> @@ -359,14 +359,14 @@ static int __radix_tree_preload(gfp_t gfp_mask, int nr)
>  >>  */
>  >> gfp_mask &= ~__GFP_ACCOUNT;
>  
> ->> preempt_disable();
> +>> local_lock(radix_tree_preloads_lock);
>  >> rtp = this_cpu_ptr(_tree_preloads);
>  >> while (rtp->nr < nr) {
> ->>   > preempt_enable();
> +>>   > local_unlock(radix_tree_preloads_lock);
>  >>   > node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
>  >>   > if (node == NULL)
>  >>   >   > goto out;
> ->>   > preempt_disable();
> +>>   

Re: [btrfs/rt] lockdep false positive

2017-01-25 Thread Mike Galbraith
On Wed, 2017-01-25 at 18:02 +0100, Sebastian Andrzej Siewior wrote:

> > [  341.960794]CPU0
> > [  341.960795]
> > [  341.960795]   lock(btrfs-tree-00);
> > [  341.960795]   lock(btrfs-tree-00);
> > [  341.960796] 
> > [  341.960796]  *** DEADLOCK ***
> > [  341.960796]
> > [  341.960796]  May be due to missing lock nesting notation
> > [  341.960796]
> > [  341.960796] 6 locks held by kworker/u8:9/2039:
> > [  341.960797]  #0:  ("%s-%s""btrfs", name){.+.+..}, at: [] 
> > process_one_work+0x171/0x700
> > [  341.960812]  #1:  ((>normal_work)){+.+...}, at: [] 
> > process_one_work+0x171/0x700
> > [  341.960815]  #2:  (sb_internal){.+.+..}, at: [] 
> > start_transaction+0x2a7/0x5a0 [btrfs]
> > [  341.960825]  #3:  (btrfs-tree-02){+.+...}, at: [] 
> > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
> > [  341.960835]  #4:  (btrfs-tree-01){+.+...}, at: [] 
> > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
> > [  341.960854]  #5:  (btrfs-tree-00){+.+...}, at: [] 
> > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
> > 
> > Attempting to describe RT rwlock semantics to lockdep prevents this.
> 
> and this is what I don't get. I stumbled upon this myself [0] but didn't
> fully understand the problem (assuming this is the same problem colored
> differently).

Yeah, [0] looks like it, though I haven't met an 'fs' variant, my
encounters were always either 'tree' or 'csum' flavors.

> With your explanation I am not sure if I get what is happening. If btrfs
> is taking here read-locks on random locks then it may deadlock if
> another btrfs-thread is doing the same and need each other's locks.

I don't know if a real RT deadlock is possible.  I haven't met one,
only variants of this bogus recursion gripe.
 
> If btrfs takes locks recursively which it already holds (in the same
> context / process) then it shouldn't be visible here because lockdep
> does not account this on -RT.

If what lockdep gripes about were true, we would never see the splat,
we'd zip straight through that (illusion) recursive read_lock() with
lockdep being none the wiser. 

> If btrfs takes the locks in a special order for instance only ascending
> according to inode's number then it shouldn't deadlock.

No idea.  Locking fancy enough to require dynamic key assignment to
appease lockdep is too fancy for me.

-Mike


Re: [btrfs/rt] lockdep false positive

2017-01-25 Thread Mike Galbraith
On Wed, 2017-01-25 at 18:02 +0100, Sebastian Andrzej Siewior wrote:

> > [  341.960794]CPU0
> > [  341.960795]
> > [  341.960795]   lock(btrfs-tree-00);
> > [  341.960795]   lock(btrfs-tree-00);
> > [  341.960796] 
> > [  341.960796]  *** DEADLOCK ***
> > [  341.960796]
> > [  341.960796]  May be due to missing lock nesting notation
> > [  341.960796]
> > [  341.960796] 6 locks held by kworker/u8:9/2039:
> > [  341.960797]  #0:  ("%s-%s""btrfs", name){.+.+..}, at: [] 
> > process_one_work+0x171/0x700
> > [  341.960812]  #1:  ((>normal_work)){+.+...}, at: [] 
> > process_one_work+0x171/0x700
> > [  341.960815]  #2:  (sb_internal){.+.+..}, at: [] 
> > start_transaction+0x2a7/0x5a0 [btrfs]
> > [  341.960825]  #3:  (btrfs-tree-02){+.+...}, at: [] 
> > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
> > [  341.960835]  #4:  (btrfs-tree-01){+.+...}, at: [] 
> > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
> > [  341.960854]  #5:  (btrfs-tree-00){+.+...}, at: [] 
> > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
> > 
> > Attempting to describe RT rwlock semantics to lockdep prevents this.
> 
> and this is what I don't get. I stumbled upon this myself [0] but didn't
> fully understand the problem (assuming this is the same problem colored
> differently).

Yeah, [0] looks like it, though I haven't met an 'fs' variant, my
encounters were always either 'tree' or 'csum' flavors.

> With your explanation I am not sure if I get what is happening. If btrfs
> is taking here read-locks on random locks then it may deadlock if
> another btrfs-thread is doing the same and need each other's locks.

I don't know if a real RT deadlock is possible.  I haven't met one,
only variants of this bogus recursion gripe.
 
> If btrfs takes locks recursively which it already holds (in the same
> context / process) then it shouldn't be visible here because lockdep
> does not account this on -RT.

If what lockdep gripes about were true, we would never see the splat,
we'd zip straight through that (illusion) recursive read_lock() with
lockdep being none the wiser. 

> If btrfs takes the locks in a special order for instance only ascending
> according to inode's number then it shouldn't deadlock.

No idea.  Locking fancy enough to require dynamic key assignment to
appease lockdep is too fancy for me.

-Mike


Re: [btrfs/rt] lockdep false positive

2017-01-22 Thread Mike Galbraith
On Sun, 2017-01-22 at 18:45 +0100, Mike Galbraith wrote:
> On Sun, 2017-01-22 at 09:46 +0100, Mike Galbraith wrote:
> > Greetings btrfs/lockdep wizards,
> > 
> > RT trees have trouble with the BTRFS lockdep positive avoidance lock
> > class dance (see disk-io.c).  Seems the trouble is due to RT not having
> > a means of telling lockdep that its rwlocks are recursive for read by
> > the lock owner only, combined with the BTRFS lock class dance assuming
> > that read_lock() is annotated rwlock_acquire_read(), which RT cannot
> > do, as that would be a big fat lie.
> > 
> > Creating a rt_read_lock_shared() for btrfs_clear_lock_blocking_rw() did
> > indeed make lockdep happy as a clam for test purposes.  (hm, submitting
> > that would be excellent way to replenish frozen shark supply:)
> > 
> > Ideas?
> 
> Hrm.  The below seems to work fine, but /me strongly suspects that if
> it were this damn trivial, the issue would be long dead.

(iow, did I merely spell '2' as '3' vs creating the annotation I want)


Re: [btrfs/rt] lockdep false positive

2017-01-22 Thread Mike Galbraith
On Sun, 2017-01-22 at 18:45 +0100, Mike Galbraith wrote:
> On Sun, 2017-01-22 at 09:46 +0100, Mike Galbraith wrote:
> > Greetings btrfs/lockdep wizards,
> > 
> > RT trees have trouble with the BTRFS lockdep positive avoidance lock
> > class dance (see disk-io.c).  Seems the trouble is due to RT not having
> > a means of telling lockdep that its rwlocks are recursive for read by
> > the lock owner only, combined with the BTRFS lock class dance assuming
> > that read_lock() is annotated rwlock_acquire_read(), which RT cannot
> > do, as that would be a big fat lie.
> > 
> > Creating a rt_read_lock_shared() for btrfs_clear_lock_blocking_rw() did
> > indeed make lockdep happy as a clam for test purposes.  (hm, submitting
> > that would be excellent way to replenish frozen shark supply:)
> > 
> > Ideas?
> 
> Hrm.  The below seems to work fine, but /me strongly suspects that if
> it were this damn trivial, the issue would be long dead.

(iow, did I merely spell '2' as '3' vs creating the annotation I want)


Re: [btrfs/rt] lockdep false positive

2017-01-22 Thread Mike Galbraith
 
> +>>   > /*
> +>>   >  * Allow read-after-read or read-after-write recursion of the
> +>>   >  * same lock class for RT rwlocks.
> +>>   >  */
> +>>   > if (read == 3 && (prev->read == 3 || prev->read == 0))

Pff, shoulda left it reader vs reader.. but it's gotta be wrong anyway.


Re: [btrfs/rt] lockdep false positive

2017-01-22 Thread Mike Galbraith
 
> +>>   > /*
> +>>   >  * Allow read-after-read or read-after-write recursion of the
> +>>   >  * same lock class for RT rwlocks.
> +>>   >  */
> +>>   > if (read == 3 && (prev->read == 3 || prev->read == 0))

Pff, shoulda left it reader vs reader.. but it's gotta be wrong anyway.


Re: [btrfs/rt] lockdep false positive

2017-01-22 Thread Mike Galbraith
On Sun, 2017-01-22 at 09:46 +0100, Mike Galbraith wrote:
> Greetings btrfs/lockdep wizards,
> 
> RT trees have trouble with the BTRFS lockdep positive avoidance lock
> class dance (see disk-io.c).  Seems the trouble is due to RT not having
> a means of telling lockdep that its rwlocks are recursive for read by
> the lock owner only, combined with the BTRFS lock class dance assuming
> that read_lock() is annotated rwlock_acquire_read(), which RT cannot
> do, as that would be a big fat lie.
> 
> Creating a rt_read_lock_shared() for btrfs_clear_lock_blocking_rw() did
> indeed make lockdep happy as a clam for test purposes.  (hm, submitting
> that would be excellent way to replenish frozen shark supply:)
> 
> Ideas?

Hrm.  The below seems to work fine, but /me strongly suspects that if
it were this damn trivial, the issue would be long dead.

RT does not have a way to describe its rwlock semantics to lockdep,
leading to the btrfs false positive below.  Btrfs maintains an array
of keys which it assigns on the fly in order to avoid false positives
in stock code, however, that scheme depends upon lockdep knowing that
read_lock()+read_lock() is allowed within a class, as multiple locks
are assigned to the same class, and end up acquired by the same task.

[  341.960754] =
[  341.960754] [ INFO: possible recursive locking detected ]
[  341.960756] 4.10.0-rt1-rt #124 Tainted: GE  
[  341.960756] -
[  341.960757] kworker/u8:9/2039 is trying to acquire lock:
[  341.960757]  (btrfs-tree-00){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]

This kworker assigned this lock to class 'tree' level 0 shortly
before acquisition, however..

[  341.960783] 
[  341.960783]  but task is already holding lock:
[  341.960783]  (btrfs-tree-00){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]

..another kworker previously assigned another lock we now hold to the
'tree' level 0 key as well.  Since RT tells lockdep that read_lock() is an
exclusive acquisition, in class read_lock()+read_lock() is forbidden.

[  341.960794]CPU0
[  341.960795]
[  341.960795]   lock(btrfs-tree-00);
[  341.960795]   lock(btrfs-tree-00);
[  341.960796] 
[  341.960796]  *** DEADLOCK ***
[  341.960796]
[  341.960796]  May be due to missing lock nesting notation
[  341.960796]
[  341.960796] 6 locks held by kworker/u8:9/2039:
[  341.960797]  #0:  ("%s-%s""btrfs", name){.+.+..}, at: [] 
process_one_work+0x171/0x700
[  341.960812]  #1:  ((>normal_work)){+.+...}, at: [] 
process_one_work+0x171/0x700
[  341.960815]  #2:  (sb_internal){.+.+..}, at: [] 
start_transaction+0x2a7/0x5a0 [btrfs]
[  341.960825]  #3:  (btrfs-tree-02){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
[  341.960835]  #4:  (btrfs-tree-01){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
[  341.960854]  #5:  (btrfs-tree-00){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]

Attempting to describe RT rwlock semantics to lockdep prevents this.

Not-signed-off-by: /me
---
 include/linux/lockdep.h  |5 +
 kernel/locking/lockdep.c |8 
 kernel/locking/rt.c  |7 ++-
 3 files changed, 15 insertions(+), 5 deletions(-)

--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -543,13 +543,18 @@ static inline void print_irqtrace_events
 #define lock_acquire_exclusive(l, s, t, n, i)  lock_acquire(l, s, t, 
0, 1, n, i)
 #define lock_acquire_shared(l, s, t, n, i) lock_acquire(l, s, t, 
1, 1, n, i)
 #define lock_acquire_shared_recursive(l, s, t, n, i)   lock_acquire(l, s, t, 
2, 1, n, i)
+#define lock_acquire_reader_recursive(l, s, t, n, i)   lock_acquire(l, s, t, 
3, 1, n, i)
 
 #define spin_acquire(l, s, t, i)   lock_acquire_exclusive(l, s, t, 
NULL, i)
 #define spin_acquire_nest(l, s, t, n, i)   lock_acquire_exclusive(l, s, t, 
n, i)
 #define spin_release(l, n, i)  lock_release(l, n, i)
 
 #define rwlock_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, 
NULL, i)
+#ifndef CONFIG_PREEMPT_RT_FULL
 #define rwlock_acquire_read(l, s, t, i)
lock_acquire_shared_recursive(l, s, t, NULL, i)
+#else
+#define rwlock_acquire_read(l, s, t, i)
lock_acquire_reader_recursive(l, s, t, NULL, i)
+#endif
 #define rwlock_release(l, n, i)lock_release(l, n, i)
 
 #define seqcount_acquire(l, s, t, i)   lock_acquire_exclusive(l, s, t, 
NULL, i)
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1761,6 +1761,14 @@ check_deadlock(struct task_struct *curr,
if ((read == 2) && prev->read)
return 2;
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+   /*
+* Allow read-after-read or read-after-write recursion of the
+* same 

Re: [btrfs/rt] lockdep false positive

2017-01-22 Thread Mike Galbraith
On Sun, 2017-01-22 at 09:46 +0100, Mike Galbraith wrote:
> Greetings btrfs/lockdep wizards,
> 
> RT trees have trouble with the BTRFS lockdep positive avoidance lock
> class dance (see disk-io.c).  Seems the trouble is due to RT not having
> a means of telling lockdep that its rwlocks are recursive for read by
> the lock owner only, combined with the BTRFS lock class dance assuming
> that read_lock() is annotated rwlock_acquire_read(), which RT cannot
> do, as that would be a big fat lie.
> 
> Creating a rt_read_lock_shared() for btrfs_clear_lock_blocking_rw() did
> indeed make lockdep happy as a clam for test purposes.  (hm, submitting
> that would be excellent way to replenish frozen shark supply:)
> 
> Ideas?

Hrm.  The below seems to work fine, but /me strongly suspects that if
it were this damn trivial, the issue would be long dead.

RT does not have a way to describe its rwlock semantics to lockdep,
leading to the btrfs false positive below.  Btrfs maintains an array
of keys which it assigns on the fly in order to avoid false positives
in stock code, however, that scheme depends upon lockdep knowing that
read_lock()+read_lock() is allowed within a class, as multiple locks
are assigned to the same class, and end up acquired by the same task.

[  341.960754] =
[  341.960754] [ INFO: possible recursive locking detected ]
[  341.960756] 4.10.0-rt1-rt #124 Tainted: GE  
[  341.960756] -
[  341.960757] kworker/u8:9/2039 is trying to acquire lock:
[  341.960757]  (btrfs-tree-00){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]

This kworker assigned this lock to class 'tree' level 0 shortly
before acquisition, however..

[  341.960783] 
[  341.960783]  but task is already holding lock:
[  341.960783]  (btrfs-tree-00){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]

..another kworker previously assigned another lock we now hold to the
'tree' level 0 key as well.  Since RT tells lockdep that read_lock() is an
exclusive acquisition, in class read_lock()+read_lock() is forbidden.

[  341.960794]CPU0
[  341.960795]
[  341.960795]   lock(btrfs-tree-00);
[  341.960795]   lock(btrfs-tree-00);
[  341.960796] 
[  341.960796]  *** DEADLOCK ***
[  341.960796]
[  341.960796]  May be due to missing lock nesting notation
[  341.960796]
[  341.960796] 6 locks held by kworker/u8:9/2039:
[  341.960797]  #0:  ("%s-%s""btrfs", name){.+.+..}, at: [] 
process_one_work+0x171/0x700
[  341.960812]  #1:  ((>normal_work)){+.+...}, at: [] 
process_one_work+0x171/0x700
[  341.960815]  #2:  (sb_internal){.+.+..}, at: [] 
start_transaction+0x2a7/0x5a0 [btrfs]
[  341.960825]  #3:  (btrfs-tree-02){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
[  341.960835]  #4:  (btrfs-tree-01){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]
[  341.960854]  #5:  (btrfs-tree-00){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs]

Attempting to describe RT rwlock semantics to lockdep prevents this.

Not-signed-off-by: /me
---
 include/linux/lockdep.h  |5 +
 kernel/locking/lockdep.c |8 
 kernel/locking/rt.c  |7 ++-
 3 files changed, 15 insertions(+), 5 deletions(-)

--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -543,13 +543,18 @@ static inline void print_irqtrace_events
 #define lock_acquire_exclusive(l, s, t, n, i)  lock_acquire(l, s, t, 
0, 1, n, i)
 #define lock_acquire_shared(l, s, t, n, i) lock_acquire(l, s, t, 
1, 1, n, i)
 #define lock_acquire_shared_recursive(l, s, t, n, i)   lock_acquire(l, s, t, 
2, 1, n, i)
+#define lock_acquire_reader_recursive(l, s, t, n, i)   lock_acquire(l, s, t, 
3, 1, n, i)
 
 #define spin_acquire(l, s, t, i)   lock_acquire_exclusive(l, s, t, 
NULL, i)
 #define spin_acquire_nest(l, s, t, n, i)   lock_acquire_exclusive(l, s, t, 
n, i)
 #define spin_release(l, n, i)  lock_release(l, n, i)
 
 #define rwlock_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, 
NULL, i)
+#ifndef CONFIG_PREEMPT_RT_FULL
 #define rwlock_acquire_read(l, s, t, i)
lock_acquire_shared_recursive(l, s, t, NULL, i)
+#else
+#define rwlock_acquire_read(l, s, t, i)
lock_acquire_reader_recursive(l, s, t, NULL, i)
+#endif
 #define rwlock_release(l, n, i)lock_release(l, n, i)
 
 #define seqcount_acquire(l, s, t, i)   lock_acquire_exclusive(l, s, t, 
NULL, i)
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1761,6 +1761,14 @@ check_deadlock(struct task_struct *curr,
if ((read == 2) && prev->read)
return 2;
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+   /*
+* Allow read-after-read or read-after-write recursion of the
+* same 

[btrfs/rt] lockdep false positive

2017-01-22 Thread Mike Galbraith
Greetings btrfs/lockdep wizards,

RT trees have trouble with the BTRFS lockdep positive avoidance lock
class dance (see disk-io.c).  Seems the trouble is due to RT not having
a means of telling lockdep that its rwlocks are recursive for read by
the lock owner only, combined with the BTRFS lock class dance assuming
that read_lock() is annotated rwlock_acquire_read(), which RT cannot
do, as that would be a big fat lie.

Creating a rt_read_lock_shared() for btrfs_clear_lock_blocking_rw() did
indeed make lockdep happy as a clam for test purposes.  (hm, submitting
that would be excellent way to replenish frozen shark supply:)

Ideas?

The below is tip-rt, but that's irrelevant.  Any RT tree will do, you
just might hit the recently fixed log_mutex gripe instead of the btrfs
-tree-00/btrfs-csum-00 variants you'll eventually hit with log_mutex
splat fixed.

[  433.956516] =
[  433.956516] [ INFO: possible recursive locking detected ]
[  433.956518] 4.10.0-rt1-tip-rt #36 Tainted: GE  
[  433.956518] -
[  433.956519] kworker/u8:2/555 is trying to acquire lock:
[  433.956519]  (btrfs-csum-00){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956540] 
   but task is already holding lock:
[  433.956540]  (btrfs-csum-00){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956551] 
   other info that might help us debug this:
[  433.956551]  Possible unsafe locking scenario:

[  433.956552]CPU0
[  433.956552]
[  433.956552]   lock(btrfs-csum-00);
[  433.956552]   lock(btrfs-csum-00);
[  433.956553] 
*** DEADLOCK ***

[  433.956553]  May be due to missing lock nesting notation

[  433.956554] 6 locks held by kworker/u8:2/555:
[  433.956554]  #0:  ("%s-%s""btrfs", name){.+.+..}, at: [] 
process_one_work+0x171/0x700
[  433.956565]  #1:  ((>normal_work)){+.+...}, at: [] 
process_one_work+0x171/0x700
[  433.956567]  #2:  (sb_internal){.+.+..}, at: [] 
start_transaction+0x2a7/0x5a0 [btrfs]
[  433.956576]  #3:  (btrfs-csum-02){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956585]  #4:  (btrfs-csum-01){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956593]  #5:  (btrfs-csum-00){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956601] 

Lock class assignment leadin
 btrfs-transacti-623   [002] ...   406.637399: 
btrfs_set_buffer_lockdep_class: set >lock: 88014a087ce0 level: 0 to 
btrfs-extent-00
kworker/u8:5-558   [000] ...   429.673871: 
btrfs_set_buffer_lockdep_class: set >lock: 880007073ce0 level: 2 to 
btrfs-csum-02
kworker/u8:5-558   [000] ...   429.673904: 
btrfs_set_buffer_lockdep_class: set >lock: 88014a087ce0 level: 1 to 
btrfs-csum-01
kworker/u8:0-5 [002] ...   433.022595: 
btrfs_set_buffer_lockdep_class: set >lock: 88009bd98fe0 level: 0 to 
btrfs-csum-00 *
kworker/u8:2-555   [001] ...   433.838082: 
btrfs_set_buffer_lockdep_class: set >lock: 880096e924e0 level: 0 to 
btrfs-csum-00

Our hero about to go splat
kworker/u8:2-555   [000] ...   434.043172: 
btrfs_clear_lock_blocking_rw: read_lock(>lock: 880007073ce0)  == 
btrfs-csum-02
kworker/u8:2-555   [000] .11   434.043172: 
btrfs_clear_lock_blocking_rw: read_lock(>lock: 88014a087ce0)  == 
btrfs-csum-01
kworker/u8:2-555   [000] .12   434.043173: 
btrfs_clear_lock_blocking_rw: read_lock(>lock: 88009bd98fe0)  == 
btrfs-csum-00  set by kworker/u8:0-5
kworker/u8:2-555   [000] .13   434.043173: 
btrfs_clear_lock_blocking_rw: read_lock(>lock: 880096e924e0)  == 
btrfs-csum-00  set by hero - two locks, one key - splat

   stack backtrace:
[  433.956602] CPU: 0 PID: 555 Comm: kworker/u8:2 Tainted: GE   
4.10.0-rt1-tip-rt #36
[  433.956603] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.8.1-0-g4adadbd-20161202_174313-build11a 04/01/2014
[  433.956611] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[  433.956612] Call Trace:
[  433.956618]  dump_stack+0x85/0xc8
[  433.956622]  __lock_acquire+0x9f9/0x1550
[  433.956627]  ? ring_buffer_lock_reserve+0x115/0x3b0
[  433.956629]  ? ring_buffer_unlock_commit+0x27/0xe0
[  433.956630]  lock_acquire+0xbd/0x250
[  433.956637]  ? btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956641]  rt_read_lock+0x47/0x60
[  433.956648]  ? btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956654]  btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956660]  btrfs_clear_path_blocking+0x99/0xc0 [btrfs]
[  433.956667]  btrfs_next_old_leaf+0x407/0x440 [btrfs]
[  433.956674]  btrfs_next_leaf+0x10/0x20 [btrfs]
[  433.956681]  btrfs_csum_file_blocks+0x31a/0x5f0 [btrfs]
[  433.956682]  ? migrate_enable+0x87/0x160
[  433.956690]  add_pending_csums.isra.46+0x4d/0x70 [btrfs]
[  433.956698]  

[btrfs/rt] lockdep false positive

2017-01-22 Thread Mike Galbraith
Greetings btrfs/lockdep wizards,

RT trees have trouble with the BTRFS lockdep positive avoidance lock
class dance (see disk-io.c).  Seems the trouble is due to RT not having
a means of telling lockdep that its rwlocks are recursive for read by
the lock owner only, combined with the BTRFS lock class dance assuming
that read_lock() is annotated rwlock_acquire_read(), which RT cannot
do, as that would be a big fat lie.

Creating a rt_read_lock_shared() for btrfs_clear_lock_blocking_rw() did
indeed make lockdep happy as a clam for test purposes.  (hm, submitting
that would be excellent way to replenish frozen shark supply:)

Ideas?

The below is tip-rt, but that's irrelevant.  Any RT tree will do, you
just might hit the recently fixed log_mutex gripe instead of the btrfs
-tree-00/btrfs-csum-00 variants you'll eventually hit with log_mutex
splat fixed.

[  433.956516] =
[  433.956516] [ INFO: possible recursive locking detected ]
[  433.956518] 4.10.0-rt1-tip-rt #36 Tainted: GE  
[  433.956518] -
[  433.956519] kworker/u8:2/555 is trying to acquire lock:
[  433.956519]  (btrfs-csum-00){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956540] 
   but task is already holding lock:
[  433.956540]  (btrfs-csum-00){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956551] 
   other info that might help us debug this:
[  433.956551]  Possible unsafe locking scenario:

[  433.956552]CPU0
[  433.956552]
[  433.956552]   lock(btrfs-csum-00);
[  433.956552]   lock(btrfs-csum-00);
[  433.956553] 
*** DEADLOCK ***

[  433.956553]  May be due to missing lock nesting notation

[  433.956554] 6 locks held by kworker/u8:2/555:
[  433.956554]  #0:  ("%s-%s""btrfs", name){.+.+..}, at: [] 
process_one_work+0x171/0x700
[  433.956565]  #1:  ((>normal_work)){+.+...}, at: [] 
process_one_work+0x171/0x700
[  433.956567]  #2:  (sb_internal){.+.+..}, at: [] 
start_transaction+0x2a7/0x5a0 [btrfs]
[  433.956576]  #3:  (btrfs-csum-02){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956585]  #4:  (btrfs-csum-01){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956593]  #5:  (btrfs-csum-00){+.+...}, at: [] 
btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956601] 

Lock class assignment leadin
 btrfs-transacti-623   [002] ...   406.637399: 
btrfs_set_buffer_lockdep_class: set >lock: 88014a087ce0 level: 0 to 
btrfs-extent-00
kworker/u8:5-558   [000] ...   429.673871: 
btrfs_set_buffer_lockdep_class: set >lock: 880007073ce0 level: 2 to 
btrfs-csum-02
kworker/u8:5-558   [000] ...   429.673904: 
btrfs_set_buffer_lockdep_class: set >lock: 88014a087ce0 level: 1 to 
btrfs-csum-01
kworker/u8:0-5 [002] ...   433.022595: 
btrfs_set_buffer_lockdep_class: set >lock: 88009bd98fe0 level: 0 to 
btrfs-csum-00 *
kworker/u8:2-555   [001] ...   433.838082: 
btrfs_set_buffer_lockdep_class: set >lock: 880096e924e0 level: 0 to 
btrfs-csum-00

Our hero about to go splat
kworker/u8:2-555   [000] ...   434.043172: 
btrfs_clear_lock_blocking_rw: read_lock(>lock: 880007073ce0)  == 
btrfs-csum-02
kworker/u8:2-555   [000] .11   434.043172: 
btrfs_clear_lock_blocking_rw: read_lock(>lock: 88014a087ce0)  == 
btrfs-csum-01
kworker/u8:2-555   [000] .12   434.043173: 
btrfs_clear_lock_blocking_rw: read_lock(>lock: 88009bd98fe0)  == 
btrfs-csum-00  set by kworker/u8:0-5
kworker/u8:2-555   [000] .13   434.043173: 
btrfs_clear_lock_blocking_rw: read_lock(>lock: 880096e924e0)  == 
btrfs-csum-00  set by hero - two locks, one key - splat

   stack backtrace:
[  433.956602] CPU: 0 PID: 555 Comm: kworker/u8:2 Tainted: GE   
4.10.0-rt1-tip-rt #36
[  433.956603] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.8.1-0-g4adadbd-20161202_174313-build11a 04/01/2014
[  433.956611] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[  433.956612] Call Trace:
[  433.956618]  dump_stack+0x85/0xc8
[  433.956622]  __lock_acquire+0x9f9/0x1550
[  433.956627]  ? ring_buffer_lock_reserve+0x115/0x3b0
[  433.956629]  ? ring_buffer_unlock_commit+0x27/0xe0
[  433.956630]  lock_acquire+0xbd/0x250
[  433.956637]  ? btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956641]  rt_read_lock+0x47/0x60
[  433.956648]  ? btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956654]  btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs]
[  433.956660]  btrfs_clear_path_blocking+0x99/0xc0 [btrfs]
[  433.956667]  btrfs_next_old_leaf+0x407/0x440 [btrfs]
[  433.956674]  btrfs_next_leaf+0x10/0x20 [btrfs]
[  433.956681]  btrfs_csum_file_blocks+0x31a/0x5f0 [btrfs]
[  433.956682]  ? migrate_enable+0x87/0x160
[  433.956690]  add_pending_csums.isra.46+0x4d/0x70 [btrfs]
[  433.956698]  

<    8   9   10   11   12   13   14   15   16   17   >