Re: 9908859acaa9 cpuidle/menu: add per CPU PM QoS resume latency consideration
On Wed, 2017-02-22 at 14:12 +0100, Peter Zijlstra wrote: > On Wed, Feb 22, 2017 at 01:56:37PM +0100, Mike Galbraith wrote: > > Hi, > > > > Do we really need a spinlock for that in the idle loop? > > Urgh, that's broken on RT, you cannot schedule the idle loop. That's what made me notice the obnoxious little bugger. [ 77.608340] BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:995 [ 77.608342] in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1 [ 77.608343] INFO: lockdep is turned off. [ 77.608344] irq event stamp: 59222 [ 77.608353] hardirqs last enabled at (59221): [] rcu_idle_exit+0x2f/0x50 [ 77.608362] hardirqs last disabled at (59222): [] do_idle+0x9a/0x290 [ 77.608372] softirqs last enabled at (0): [] copy_process.part.34+0x5f1/0x22a0 [ 77.608374] softirqs last disabled at (0): [< (null)>] (null) [ 77.608374] Preemption disabled at: [ 77.608383] [] schedule_preempt_disabled+0x22/0x30 [ 77.608387] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW E 4.11.0-rt9-rt #163 [ 77.608389] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014 [ 77.608390] Call Trace: [ 77.608399] dump_stack+0x85/0xc8 [ 77.608405] ___might_sleep+0x15d/0x260 [ 77.608409] rt_spin_lock+0x24/0x80 [ 77.608419] dev_pm_qos_read_value+0x1e/0x40 [ 77.608424] menu_select+0x56/0x3e0 [ 77.608426] ? rcu_eqs_enter_common.isra.40+0x9d/0x160 [ 77.608435] cpuidle_select+0x13/0x20 [ 77.608438] do_idle+0x182/0x290 [ 77.608445] cpu_startup_entry+0x48/0x50 [ 77.608450] start_secondary+0x133/0x160 [ 77.608453] start_cpu+0x14/0x14
Re: 9908859acaa9 cpuidle/menu: add per CPU PM QoS resume latency consideration
On Wed, 2017-02-22 at 14:12 +0100, Peter Zijlstra wrote: > On Wed, Feb 22, 2017 at 01:56:37PM +0100, Mike Galbraith wrote: > > Hi, > > > > Do we really need a spinlock for that in the idle loop? > > Urgh, that's broken on RT, you cannot schedule the idle loop. That's what made me notice the obnoxious little bugger. [ 77.608340] BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:995 [ 77.608342] in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1 [ 77.608343] INFO: lockdep is turned off. [ 77.608344] irq event stamp: 59222 [ 77.608353] hardirqs last enabled at (59221): [] rcu_idle_exit+0x2f/0x50 [ 77.608362] hardirqs last disabled at (59222): [] do_idle+0x9a/0x290 [ 77.608372] softirqs last enabled at (0): [] copy_process.part.34+0x5f1/0x22a0 [ 77.608374] softirqs last disabled at (0): [< (null)>] (null) [ 77.608374] Preemption disabled at: [ 77.608383] [] schedule_preempt_disabled+0x22/0x30 [ 77.608387] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW E 4.11.0-rt9-rt #163 [ 77.608389] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014 [ 77.608390] Call Trace: [ 77.608399] dump_stack+0x85/0xc8 [ 77.608405] ___might_sleep+0x15d/0x260 [ 77.608409] rt_spin_lock+0x24/0x80 [ 77.608419] dev_pm_qos_read_value+0x1e/0x40 [ 77.608424] menu_select+0x56/0x3e0 [ 77.608426] ? rcu_eqs_enter_common.isra.40+0x9d/0x160 [ 77.608435] cpuidle_select+0x13/0x20 [ 77.608438] do_idle+0x182/0x290 [ 77.608445] cpu_startup_entry+0x48/0x50 [ 77.608450] start_secondary+0x133/0x160 [ 77.608453] start_cpu+0x14/0x14
9908859acaa9 cpuidle/menu: add per CPU PM QoS resume latency consideration
Hi, Do we really need a spinlock for that in the idle loop? -Mike
9908859acaa9 cpuidle/menu: add per CPU PM QoS resume latency consideration
Hi, Do we really need a spinlock for that in the idle loop? -Mike
Re: [bisection] b0119e87083 iommu: Introduce new 'struct iommu_device' ==> boom
On Tue, 2017-02-21 at 16:19 +0100, Joerg Roedel wrote: > Hi Mike, > > thanks for the report, this didn't trigger in my local testing here. > Loosk like I need to test without intel_iommu=on too :/ > > Anyway, can you check whether the attached patch helps? Yup, boots. > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c > index d9c0decfc91a..a74fec8d266a 100644 > --- a/drivers/iommu/dmar.c > +++ b/drivers/iommu/dmar.c > @@ -1108,8 +1108,10 @@ static int alloc_iommu(struct dmar_drhd_unit > *drhd) > > static void free_iommu(struct intel_iommu *iommu) > { > - iommu_device_sysfs_remove(>iommu); > - iommu_device_unregister(>iommu); > + if (intel_iommu_enabled) { > + iommu_device_sysfs_remove(>iommu); > + iommu_device_unregister(>iommu); > + } > > if (iommu->irq) { > if (iommu->pr_irq) {
Re: [bisection] b0119e87083 iommu: Introduce new 'struct iommu_device' ==> boom
On Tue, 2017-02-21 at 16:19 +0100, Joerg Roedel wrote: > Hi Mike, > > thanks for the report, this didn't trigger in my local testing here. > Loosk like I need to test without intel_iommu=on too :/ > > Anyway, can you check whether the attached patch helps? Yup, boots. > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c > index d9c0decfc91a..a74fec8d266a 100644 > --- a/drivers/iommu/dmar.c > +++ b/drivers/iommu/dmar.c > @@ -1108,8 +1108,10 @@ static int alloc_iommu(struct dmar_drhd_unit > *drhd) > > static void free_iommu(struct intel_iommu *iommu) > { > - iommu_device_sysfs_remove(>iommu); > - iommu_device_unregister(>iommu); > + if (intel_iommu_enabled) { > + iommu_device_sysfs_remove(>iommu); > + iommu_device_unregister(>iommu); > + } > > if (iommu->irq) { > if (iommu->pr_irq) {
[bisection] b0119e87083 iommu: Introduce new 'struct iommu_device' ==> boom
4x18 box (berio) explodes as below after morning master pull. BIOS has a couple issues, maybe one of them.. helps. [ 30.796530] ima: No TPM chip found, activating TPM-bypass! (rc=-19) [ 30.810709] evm: HMAC attrs: 0x1 [ 30.821200] BUG: unable to handle kernel NULL pointer dereference at 0008 [ 30.839003] IP: device_del+0x6e/0x350 [ 30.847364] PGD 0 [ 30.847365] [ 30.855639] Oops: [#1] SMP [ 30.862858] Dumping ftrace buffer: [ 30.870678](ftrace buffer empty) [ 30.878849] Modules linked in: [ 30.885870] CPU: 39 PID: 1 Comm: swapper/0 Not tainted 4.10.0-default #144 [ 30.901334] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014 [ 30.924687] task: 88017cab2040 task.stack: c9038000 [ 30.938040] RIP: 0010:device_del+0x6e/0x350 [ 30.947532] RSP: :c903bd50 EFLAGS: 00010246 [ 30.959344] RAX: RBX: 8810fce66928 RCX: 77ff8000 [ 30.975381] RDX: 88017cab2040 RSI: 00ec RDI: 81a1300b [ 30.991420] RBP: c903bd90 R08: 8808fc9bcdb8 R09: [ 31.007459] R10: R11: c903bc08 R12: 8810fce66928 [ 31.023497] R13: R14: R15: 8810fce669c8 [ 31.039536] FS: () GS:88106f8c() knlGS: [ 31.057897] CS: 0010 DS: ES: CR0: 80050033 [ 31.070867] CR2: 0008 CR3: 01c09000 CR4: 001406e0 [ 31.086890] DR0: DR1: DR2: [ 31.102927] DR3: DR6: fffe0ff0 DR7: 0400 [ 31.118967] Call Trace: [ 31.124641] ? dmar_free_dev_scope+0x62/0x80 [ 31.134347] device_unregister+0x1a/0x60 [ 31.143284] iommu_device_sysfs_remove+0x12/0x20 [ 31.153755] dmar_free_drhd+0x40/0x120 [ 31.162311] dmar_free_unused_resources+0xad/0xc9 [ 31.172975] ? detect_intel_iommu+0xcf/0xcf [ 31.182487] do_one_initcall+0x51/0x1b0 [ 31.191233] ? parse_args+0x27b/0x460 [ 31.199596] kernel_init_freeable+0x1a2/0x232 [ 31.209490] ? set_debug_rodata+0x12/0x12 [ 31.218619] ? rest_init+0x90/0x90 [ 31.226399] kernel_init+0xe/0x110 [ 31.234186] ret_from_fork+0x2c/0x40 [ 31.242351] Code: 00 00 00 48 81 c7 f0 00 00 00 e8 2e b1 bc ff 48 c7 c7 e0 69 d0 81 4d 8d bc 24 a0 00 00 00 e8 2a f4 1b 00 49 8b 84 24 a8 00 00 00 <48> 8b 48 08 49 39 c7 4c 8d 70 e0 48 8d 59 e0 75 08 eb 2a 49 89 [ 31.284763] RIP: device_del+0x6e/0x350 RSP: c903bd50 [ 31.297536] CR2: 0008 [ 31.305148] ---[ end trace 617d26bc9a426981 ]--- b0119e870837dcd15a207b4701542ebac5d19b45 is the first bad commit commit b0119e870837dcd15a207b4701542ebac5d19b45 Author: Joerg RoedelDate: Wed Feb 1 13:23:08 2017 +0100 iommu: Introduce new 'struct iommu_device' This struct represents one hardware iommu in the iommu core code. For now it only has the iommu-ops associated with it, but that will be extended soon. The register/unregister interface is also added, as well as making use of it in the Intel and AMD IOMMU drivers. Signed-off-by: Joerg Roedel :04 04 cb491d4d5bd25f1b65e6c93f7e67c8594901d6e1 84a5621c5e88961cf2385566c1c28eb5375c413f M drivers :04 04 5a2f0b8b829b29ef80baf6ef7cf2ba4b9bf23bf7 89ecaf2419fe0e00500c23413149f1df7bcbd693 M include git bisect start # good: [c470abd4fde40ea6a0846a2beab642a578c0b8cd] Linux 4.10 git bisect good c470abd4fde40ea6a0846a2beab642a578c0b8cd # bad: [2bfe01eff4307409b95859e860261d0907149b61] Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 git bisect bad 2bfe01eff4307409b95859e860261d0907149b61 # good: [828cad8ea05d194d8a9452e0793261c2024c23a2] Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 828cad8ea05d194d8a9452e0793261c2024c23a2 # bad: [f790bd9c8e826434ab6c326b225276ed0f73affe] Merge tag 'regulator-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator git bisect bad f790bd9c8e826434ab6c326b225276ed0f73affe # good: [937b5b5ddd2f685b4962ec19502e641bb5741c12] Merge tag 'm68k-for-v4.11-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k git bisect good 937b5b5ddd2f685b4962ec19502e641bb5741c12 # bad: [27a67e0f983567574ef659520d930f82cf65125a] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid git bisect bad 27a67e0f983567574ef659520d930f82cf65125a # good: [2c9f1af528a4581e8ef8590108daa3c3df08dd5a] vfio/type1: Fix error return code in vfio_iommu_type1_attach_group() git bisect good 2c9f1af528a4581e8ef8590108daa3c3df08dd5a # bad: [ebb4949eb32ff500602f960525592fc4e614c5a7] Merge tag 'iommu-updates-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu git bisect bad
[bisection] b0119e87083 iommu: Introduce new 'struct iommu_device' ==> boom
4x18 box (berio) explodes as below after morning master pull. BIOS has a couple issues, maybe one of them.. helps. [ 30.796530] ima: No TPM chip found, activating TPM-bypass! (rc=-19) [ 30.810709] evm: HMAC attrs: 0x1 [ 30.821200] BUG: unable to handle kernel NULL pointer dereference at 0008 [ 30.839003] IP: device_del+0x6e/0x350 [ 30.847364] PGD 0 [ 30.847365] [ 30.855639] Oops: [#1] SMP [ 30.862858] Dumping ftrace buffer: [ 30.870678](ftrace buffer empty) [ 30.878849] Modules linked in: [ 30.885870] CPU: 39 PID: 1 Comm: swapper/0 Not tainted 4.10.0-default #144 [ 30.901334] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014 [ 30.924687] task: 88017cab2040 task.stack: c9038000 [ 30.938040] RIP: 0010:device_del+0x6e/0x350 [ 30.947532] RSP: :c903bd50 EFLAGS: 00010246 [ 30.959344] RAX: RBX: 8810fce66928 RCX: 77ff8000 [ 30.975381] RDX: 88017cab2040 RSI: 00ec RDI: 81a1300b [ 30.991420] RBP: c903bd90 R08: 8808fc9bcdb8 R09: [ 31.007459] R10: R11: c903bc08 R12: 8810fce66928 [ 31.023497] R13: R14: R15: 8810fce669c8 [ 31.039536] FS: () GS:88106f8c() knlGS: [ 31.057897] CS: 0010 DS: ES: CR0: 80050033 [ 31.070867] CR2: 0008 CR3: 01c09000 CR4: 001406e0 [ 31.086890] DR0: DR1: DR2: [ 31.102927] DR3: DR6: fffe0ff0 DR7: 0400 [ 31.118967] Call Trace: [ 31.124641] ? dmar_free_dev_scope+0x62/0x80 [ 31.134347] device_unregister+0x1a/0x60 [ 31.143284] iommu_device_sysfs_remove+0x12/0x20 [ 31.153755] dmar_free_drhd+0x40/0x120 [ 31.162311] dmar_free_unused_resources+0xad/0xc9 [ 31.172975] ? detect_intel_iommu+0xcf/0xcf [ 31.182487] do_one_initcall+0x51/0x1b0 [ 31.191233] ? parse_args+0x27b/0x460 [ 31.199596] kernel_init_freeable+0x1a2/0x232 [ 31.209490] ? set_debug_rodata+0x12/0x12 [ 31.218619] ? rest_init+0x90/0x90 [ 31.226399] kernel_init+0xe/0x110 [ 31.234186] ret_from_fork+0x2c/0x40 [ 31.242351] Code: 00 00 00 48 81 c7 f0 00 00 00 e8 2e b1 bc ff 48 c7 c7 e0 69 d0 81 4d 8d bc 24 a0 00 00 00 e8 2a f4 1b 00 49 8b 84 24 a8 00 00 00 <48> 8b 48 08 49 39 c7 4c 8d 70 e0 48 8d 59 e0 75 08 eb 2a 49 89 [ 31.284763] RIP: device_del+0x6e/0x350 RSP: c903bd50 [ 31.297536] CR2: 0008 [ 31.305148] ---[ end trace 617d26bc9a426981 ]--- b0119e870837dcd15a207b4701542ebac5d19b45 is the first bad commit commit b0119e870837dcd15a207b4701542ebac5d19b45 Author: Joerg Roedel Date: Wed Feb 1 13:23:08 2017 +0100 iommu: Introduce new 'struct iommu_device' This struct represents one hardware iommu in the iommu core code. For now it only has the iommu-ops associated with it, but that will be extended soon. The register/unregister interface is also added, as well as making use of it in the Intel and AMD IOMMU drivers. Signed-off-by: Joerg Roedel :04 04 cb491d4d5bd25f1b65e6c93f7e67c8594901d6e1 84a5621c5e88961cf2385566c1c28eb5375c413f M drivers :04 04 5a2f0b8b829b29ef80baf6ef7cf2ba4b9bf23bf7 89ecaf2419fe0e00500c23413149f1df7bcbd693 M include git bisect start # good: [c470abd4fde40ea6a0846a2beab642a578c0b8cd] Linux 4.10 git bisect good c470abd4fde40ea6a0846a2beab642a578c0b8cd # bad: [2bfe01eff4307409b95859e860261d0907149b61] Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 git bisect bad 2bfe01eff4307409b95859e860261d0907149b61 # good: [828cad8ea05d194d8a9452e0793261c2024c23a2] Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 828cad8ea05d194d8a9452e0793261c2024c23a2 # bad: [f790bd9c8e826434ab6c326b225276ed0f73affe] Merge tag 'regulator-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator git bisect bad f790bd9c8e826434ab6c326b225276ed0f73affe # good: [937b5b5ddd2f685b4962ec19502e641bb5741c12] Merge tag 'm68k-for-v4.11-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k git bisect good 937b5b5ddd2f685b4962ec19502e641bb5741c12 # bad: [27a67e0f983567574ef659520d930f82cf65125a] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid git bisect bad 27a67e0f983567574ef659520d930f82cf65125a # good: [2c9f1af528a4581e8ef8590108daa3c3df08dd5a] vfio/type1: Fix error return code in vfio_iommu_type1_attach_group() git bisect good 2c9f1af528a4581e8ef8590108daa3c3df08dd5a # bad: [ebb4949eb32ff500602f960525592fc4e614c5a7] Merge tag 'iommu-updates-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu git bisect bad ebb4949eb32ff500602f960525592fc4e614c5a7 # bad:
[cgroups] suspicious rcu_dereference_check() usage!
Running LTP on master.today (v4.10) with a seriously bloated PREEMPT config inspired box to emit the below. [ 7160.458996] === [ 7160.463195] [ INFO: suspicious RCU usage. ] [ 7160.467387] 4.10.0-default #100 Tainted: GE [ 7160.472808] --- [ 7160.476999] ./include/linux/cgroup.h:435 suspicious rcu_dereference_check() usage! [ 7160.484576] [ 7160.484576] other info that might help us debug this: [ 7160.484576] [ 7160.492577] [ 7160.492577] rcu_scheduler_active = 2, debug_locks = 1 [ 7160.499113] 1 lock held by pids_task1/19308: [ 7160.503390] #0: (_threadgroup_rwsem){+.}, at: [] _do_fork+0xf0/0x710 [ 7160.512450] [ 7160.512450] stack backtrace: [ 7160.516810] CPU: 5 PID: 19308 Comm: pids_task1 Tainted: GE 4.10.0-default #100 [ 7160.525239] Hardware name: IBM System x3550 M3 -[7944K3G]-/69Y5698 , BIOS -[D6E150AUS-1.10]- 12/15/2010 [ 7160.534965] Call Trace: [ 7160.537414] dump_stack+0x85/0xc9 [ 7160.540732] lockdep_rcu_suspicious+0xd5/0x110 [ 7160.545177] task_css.constprop.7+0x88/0x90 [ 7160.549357] pids_can_fork+0x132/0x160 [ 7160.553106] cgroup_can_fork+0x63/0xc0 [ 7160.556855] copy_process.part.30+0x17ef/0x21b0 [ 7160.561382] ? _do_fork+0xf0/0x710 [ 7160.564786] ? free_pages_and_swap_cache+0x9e/0xc0 [ 7160.569575] _do_fork+0xf0/0x710 [ 7160.572806] ? __this_cpu_preempt_check+0x13/0x20 [ 7160.577505] ? __percpu_counter_add+0x86/0xb0 [ 7160.581860] ? entry_SYSCALL_64_fastpath+0x5/0xc2 [ 7160.586562] ? do_syscall_64+0x2d/0x200 [ 7160.590395] SyS_clone+0x19/0x20 [ 7160.593623] do_syscall_64+0x6c/0x200 [ 7160.597283] entry_SYSCALL64_slow_path+0x25/0x25 [ 7160.601899] RIP: 0033:0x7fdaa3b881c4 [ 7160.605473] RSP: 002b:7ffd21635d50 EFLAGS: 0246 ORIG_RAX: 0038 [ 7160.613036] RAX: ffda RBX: 4b6c RCX: 7fdaa3b881c4 [ 7160.620162] RDX: RSI: RDI: 01200011 [ 7160.627288] RBP: 7ffd21635d90 R08: R09: 7fdaa4052700 [ 7160.634414] R10: 7fdaa40529d0 R11: 0246 R12: 7ffd21635d50 [ 7160.641539] R13: R14: R15:
[cgroups] suspicious rcu_dereference_check() usage!
Running LTP on master.today (v4.10) with a seriously bloated PREEMPT config inspired box to emit the below. [ 7160.458996] === [ 7160.463195] [ INFO: suspicious RCU usage. ] [ 7160.467387] 4.10.0-default #100 Tainted: GE [ 7160.472808] --- [ 7160.476999] ./include/linux/cgroup.h:435 suspicious rcu_dereference_check() usage! [ 7160.484576] [ 7160.484576] other info that might help us debug this: [ 7160.484576] [ 7160.492577] [ 7160.492577] rcu_scheduler_active = 2, debug_locks = 1 [ 7160.499113] 1 lock held by pids_task1/19308: [ 7160.503390] #0: (_threadgroup_rwsem){+.}, at: [] _do_fork+0xf0/0x710 [ 7160.512450] [ 7160.512450] stack backtrace: [ 7160.516810] CPU: 5 PID: 19308 Comm: pids_task1 Tainted: GE 4.10.0-default #100 [ 7160.525239] Hardware name: IBM System x3550 M3 -[7944K3G]-/69Y5698 , BIOS -[D6E150AUS-1.10]- 12/15/2010 [ 7160.534965] Call Trace: [ 7160.537414] dump_stack+0x85/0xc9 [ 7160.540732] lockdep_rcu_suspicious+0xd5/0x110 [ 7160.545177] task_css.constprop.7+0x88/0x90 [ 7160.549357] pids_can_fork+0x132/0x160 [ 7160.553106] cgroup_can_fork+0x63/0xc0 [ 7160.556855] copy_process.part.30+0x17ef/0x21b0 [ 7160.561382] ? _do_fork+0xf0/0x710 [ 7160.564786] ? free_pages_and_swap_cache+0x9e/0xc0 [ 7160.569575] _do_fork+0xf0/0x710 [ 7160.572806] ? __this_cpu_preempt_check+0x13/0x20 [ 7160.577505] ? __percpu_counter_add+0x86/0xb0 [ 7160.581860] ? entry_SYSCALL_64_fastpath+0x5/0xc2 [ 7160.586562] ? do_syscall_64+0x2d/0x200 [ 7160.590395] SyS_clone+0x19/0x20 [ 7160.593623] do_syscall_64+0x6c/0x200 [ 7160.597283] entry_SYSCALL64_slow_path+0x25/0x25 [ 7160.601899] RIP: 0033:0x7fdaa3b881c4 [ 7160.605473] RSP: 002b:7ffd21635d50 EFLAGS: 0246 ORIG_RAX: 0038 [ 7160.613036] RAX: ffda RBX: 4b6c RCX: 7fdaa3b881c4 [ 7160.620162] RDX: RSI: RDI: 01200011 [ 7160.627288] RBP: 7ffd21635d90 R08: R09: 7fdaa4052700 [ 7160.634414] R10: 7fdaa40529d0 R11: 0246 R12: 7ffd21635d50 [ 7160.641539] R13: R14: R15:
[btrfs] lockdep splat
Greetings, Running ltp on master.today, I received the splat (from hell) below. [ 5015.128458] = [ 5015.128458] [ INFO: possible irq lock inversion dependency detected ] [ 5015.128458] 4.10.0-default #119 Tainted: GE [ 5015.128458] - [ 5015.128458] khugepaged/896 just changed the state of lock: [ 5015.128458] (_node->mutex){+.+.-.}, at: [] __btrfs_release_delayed_node+0x41/0x2d0 [btrfs] [ 5015.128458] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 5015.128458] (pcpu_alloc_mutex){+.+.+.} [ 5015.128458] [ 5015.128458] [ 5015.128458] and interrupts could create inverse lock ordering between them. [ 5015.128458] [ 5015.128458] [ 5015.128458] other info that might help us debug this: [ 5015.128458] Chain exists of: [ 5015.128458] _node->mutex --> >groups_sem --> pcpu_alloc_mutex [ 5015.128458] [ 5015.128458] Possible interrupt unsafe locking scenario: [ 5015.128458] [ 5015.128458]CPU0CPU1 [ 5015.128458] [ 5015.128458] lock(pcpu_alloc_mutex); [ 5015.128458]local_irq_disable(); [ 5015.128458]lock(_node->mutex); [ 5015.128458]lock(>groups_sem); [ 5015.128458] [ 5015.128458] lock(_node->mutex); [ 5015.128458] [ 5015.128458] *** DEADLOCK *** [ 5015.128458] [ 5015.128458] 2 locks held by khugepaged/896: [ 5015.128458] #0: (shrinker_rwsem){..}, at: [] shrink_slab+0x7d/0x650 [ 5015.128458] #1: (>s_umount_key#26){..}, at: [] trylock_super+0x1b/0x50 [ 5015.128458] [ 5015.128458] the shortest dependencies between 2nd lock and 1st lock: [ 5015.128458]-> (pcpu_alloc_mutex){+.+.+.} ops: 4652 { [ 5015.128458] HARDIRQ-ON-W at: [ 5015.128458] __lock_acquire+0x8e6/0x1550 [ 5015.128458] lock_acquire+0xbd/0x220 [ 5015.128458] mutex_lock_nested+0x67/0x6a0 [ 5015.128458] pcpu_alloc+0x1c0/0x600 [ 5015.128458] __alloc_percpu+0x15/0x20 [ 5015.128458] alloc_kmem_cache_cpus.isra.56+0x2b/0xa0 [ 5015.128458] __do_tune_cpucache+0x30/0x210 [ 5015.128458] do_tune_cpucache+0x2a/0xd0 [ 5015.128458] enable_cpucache+0x61/0x110 [ 5015.128458] kmem_cache_init_late+0x41/0x76 [ 5015.128458] start_kernel+0x352/0x4cd [ 5015.128458] x86_64_start_reservations+0x2a/0x2c [ 5015.128458] x86_64_start_kernel+0x13d/0x14c [ 5015.128458] verify_cpu+0x0/0xfc [ 5015.128458] SOFTIRQ-ON-W at: [ 5015.128458] __lock_acquire+0x283/0x1550 [ 5015.128458] lock_acquire+0xbd/0x220 [ 5015.128458] mutex_lock_nested+0x67/0x6a0 [ 5015.128458] pcpu_alloc+0x1c0/0x600 [ 5015.128458] __alloc_percpu+0x15/0x20 [ 5015.128458] alloc_kmem_cache_cpus.isra.56+0x2b/0xa0 [ 5015.128458] __do_tune_cpucache+0x30/0x210 [ 5015.128458] do_tune_cpucache+0x2a/0xd0 [ 5015.128458] enable_cpucache+0x61/0x110 [ 5015.128458] kmem_cache_init_late+0x41/0x76 [ 5015.128458] start_kernel+0x352/0x4cd [ 5015.128458] x86_64_start_reservations+0x2a/0x2c [ 5015.128458] x86_64_start_kernel+0x13d/0x14c [ 5015.128458] verify_cpu+0x0/0xfc [ 5015.128458] RECLAIM_FS-ON-W at: [ 5015.128458] mark_held_locks+0x66/0x90 [ 5015.128458] lockdep_trace_alloc+0x6f/0xd0 [ 5015.128458] __alloc_pages_nodemask+0x81/0x370 [ 5015.128458] pcpu_populate_chunk+0xac/0x340 [ 5015.128458] pcpu_alloc+0x4f8/0x600 [ 5015.128458] __alloc_percpu+0x15/0x20 [ 5015.128458] perf_pmu_register+0xc6/0x3c0 [ 5015.128458] init_hw_perf_events+0x513/0x56d [ 5015.128458] do_one_initcall+0x51/0x1c0 [ 5015.128458] kernel_init_freeable+0x146/0x28e [ 5015.128458] kernel_init+0xe/0x110 [ 5015.128458] ret_from_fork+0x31/0x40 [ 5015.128458] INITIAL USE at: [ 5015.128458] __lock_acquire+0x2ce/0x1550 [ 5015.128458] lock_acquire+0xbd/0x220 [ 5015.128458]
[btrfs] lockdep splat
Greetings, Running ltp on master.today, I received the splat (from hell) below. [ 5015.128458] = [ 5015.128458] [ INFO: possible irq lock inversion dependency detected ] [ 5015.128458] 4.10.0-default #119 Tainted: GE [ 5015.128458] - [ 5015.128458] khugepaged/896 just changed the state of lock: [ 5015.128458] (_node->mutex){+.+.-.}, at: [] __btrfs_release_delayed_node+0x41/0x2d0 [btrfs] [ 5015.128458] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 5015.128458] (pcpu_alloc_mutex){+.+.+.} [ 5015.128458] [ 5015.128458] [ 5015.128458] and interrupts could create inverse lock ordering between them. [ 5015.128458] [ 5015.128458] [ 5015.128458] other info that might help us debug this: [ 5015.128458] Chain exists of: [ 5015.128458] _node->mutex --> >groups_sem --> pcpu_alloc_mutex [ 5015.128458] [ 5015.128458] Possible interrupt unsafe locking scenario: [ 5015.128458] [ 5015.128458]CPU0CPU1 [ 5015.128458] [ 5015.128458] lock(pcpu_alloc_mutex); [ 5015.128458]local_irq_disable(); [ 5015.128458]lock(_node->mutex); [ 5015.128458]lock(>groups_sem); [ 5015.128458] [ 5015.128458] lock(_node->mutex); [ 5015.128458] [ 5015.128458] *** DEADLOCK *** [ 5015.128458] [ 5015.128458] 2 locks held by khugepaged/896: [ 5015.128458] #0: (shrinker_rwsem){..}, at: [] shrink_slab+0x7d/0x650 [ 5015.128458] #1: (>s_umount_key#26){..}, at: [] trylock_super+0x1b/0x50 [ 5015.128458] [ 5015.128458] the shortest dependencies between 2nd lock and 1st lock: [ 5015.128458]-> (pcpu_alloc_mutex){+.+.+.} ops: 4652 { [ 5015.128458] HARDIRQ-ON-W at: [ 5015.128458] __lock_acquire+0x8e6/0x1550 [ 5015.128458] lock_acquire+0xbd/0x220 [ 5015.128458] mutex_lock_nested+0x67/0x6a0 [ 5015.128458] pcpu_alloc+0x1c0/0x600 [ 5015.128458] __alloc_percpu+0x15/0x20 [ 5015.128458] alloc_kmem_cache_cpus.isra.56+0x2b/0xa0 [ 5015.128458] __do_tune_cpucache+0x30/0x210 [ 5015.128458] do_tune_cpucache+0x2a/0xd0 [ 5015.128458] enable_cpucache+0x61/0x110 [ 5015.128458] kmem_cache_init_late+0x41/0x76 [ 5015.128458] start_kernel+0x352/0x4cd [ 5015.128458] x86_64_start_reservations+0x2a/0x2c [ 5015.128458] x86_64_start_kernel+0x13d/0x14c [ 5015.128458] verify_cpu+0x0/0xfc [ 5015.128458] SOFTIRQ-ON-W at: [ 5015.128458] __lock_acquire+0x283/0x1550 [ 5015.128458] lock_acquire+0xbd/0x220 [ 5015.128458] mutex_lock_nested+0x67/0x6a0 [ 5015.128458] pcpu_alloc+0x1c0/0x600 [ 5015.128458] __alloc_percpu+0x15/0x20 [ 5015.128458] alloc_kmem_cache_cpus.isra.56+0x2b/0xa0 [ 5015.128458] __do_tune_cpucache+0x30/0x210 [ 5015.128458] do_tune_cpucache+0x2a/0xd0 [ 5015.128458] enable_cpucache+0x61/0x110 [ 5015.128458] kmem_cache_init_late+0x41/0x76 [ 5015.128458] start_kernel+0x352/0x4cd [ 5015.128458] x86_64_start_reservations+0x2a/0x2c [ 5015.128458] x86_64_start_kernel+0x13d/0x14c [ 5015.128458] verify_cpu+0x0/0xfc [ 5015.128458] RECLAIM_FS-ON-W at: [ 5015.128458] mark_held_locks+0x66/0x90 [ 5015.128458] lockdep_trace_alloc+0x6f/0xd0 [ 5015.128458] __alloc_pages_nodemask+0x81/0x370 [ 5015.128458] pcpu_populate_chunk+0xac/0x340 [ 5015.128458] pcpu_alloc+0x4f8/0x600 [ 5015.128458] __alloc_percpu+0x15/0x20 [ 5015.128458] perf_pmu_register+0xc6/0x3c0 [ 5015.128458] init_hw_perf_events+0x513/0x56d [ 5015.128458] do_one_initcall+0x51/0x1c0 [ 5015.128458] kernel_init_freeable+0x146/0x28e [ 5015.128458] kernel_init+0xe/0x110 [ 5015.128458] ret_from_fork+0x31/0x40 [ 5015.128458] INITIAL USE at: [ 5015.128458] __lock_acquire+0x2ce/0x1550 [ 5015.128458] lock_acquire+0xbd/0x220 [ 5015.128458]
Re: [RT] lockdep munching nr_list_entries like popcorn
On Thu, 2017-02-16 at 19:06 +0100, Mike Galbraith wrote: > On Thu, 2017-02-16 at 15:53 +0100, Sebastian Andrzej Siewior wrote: > > On 2017-02-16 15:42:59 [+0100], Mike Galbraith wrote: > > > > > > Weeell, I'm trying to cobble something kinda like that together using > > > __RT_SPIN_INITIALIZER() instead, but seems mean ole Mr. Compiler NAKs > > > the PER_CPU_DEP_MAP_INIT() thingy. > > > > > > CC mm/swap.o > > > mm/swap.c:54:689: error: braced-group within expression allowed only > > > inside a function > > > > so this is what I have now. I need to get the `static' symbol working > > again and PER_CPU_DEP_MAP_INIT but aside from that it seems to do its > > job. > > ... > > Yeah, works, I should be able to do an ltp run with stock lockdep > settings without it taking it's toys and going home in a snit. > > berio:/sys/kernel/debug/tracing/:[0]# !while > while sleep 60; do tail -1 trace; done ><...>-10315 [064] d...1.. 226.953935: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14223 >w-13148 [120] d...111 287.414978: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14465 >w-16492 [089] d...111 347.128742: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14653 > (starts kbuild loop) > btrfs-transacti-1964 [016] d...1.. 411.101549: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 17011 ><...>-100268 [127] d...112 472.271769: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18153 >w-18864 [011] d...1.. 534.386443: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18543 ><...>-50390 [035] dN..2.. 597.794164: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18765 ><...>-80098 [127] d...111 659.912145: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18977 >checkproc-11123 [017] d...1.. 721.483463: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19247 > -0 [055] d..h5.. 782.685953: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19383 ><...>-93632 [055] d...111 835.527817: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19441 And now Thomas's patch. Spiffiness. Now to start ltp. berio:/sys/kernel/debug/tracing/:[0]# !while while sleep 60; do tail -1 trace; done <...>-12462 [105] d...1.. 211.489528: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 12847 btrfs-transacti-3136 [002] d...211 272.672777: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 12947 irq/155-eth2-Tx-4495 [096] dN..213 332.035236: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 13001 (starts kbuild loop) <...>-44245 [087] d...114 396.892748: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 13917 <...>-105411 [067] dN..211 457.708259: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14367 w-21800 [113] dN..2.. 519.231735: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14449 modpost-31558 [020] d11 576.065855: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14601 kworker/dying-11860 [133] d...112 637.170497: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14679 <...>-118853 [055] d11 703.884755: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14759 <...>-52143 [090] d...1.. 767.624735: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14813 <...>-71788 [126] d...1.. 829.160330: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14857 kworker/u289:5-2991 [002] d...1.. 892.402939: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14883 sh-15106 [008] d...211 953.172196: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14937 >
Re: [RT] lockdep munching nr_list_entries like popcorn
On Thu, 2017-02-16 at 19:06 +0100, Mike Galbraith wrote: > On Thu, 2017-02-16 at 15:53 +0100, Sebastian Andrzej Siewior wrote: > > On 2017-02-16 15:42:59 [+0100], Mike Galbraith wrote: > > > > > > Weeell, I'm trying to cobble something kinda like that together using > > > __RT_SPIN_INITIALIZER() instead, but seems mean ole Mr. Compiler NAKs > > > the PER_CPU_DEP_MAP_INIT() thingy. > > > > > > CC mm/swap.o > > > mm/swap.c:54:689: error: braced-group within expression allowed only > > > inside a function > > > > so this is what I have now. I need to get the `static' symbol working > > again and PER_CPU_DEP_MAP_INIT but aside from that it seems to do its > > job. > > ... > > Yeah, works, I should be able to do an ltp run with stock lockdep > settings without it taking it's toys and going home in a snit. > > berio:/sys/kernel/debug/tracing/:[0]# !while > while sleep 60; do tail -1 trace; done ><...>-10315 [064] d...1.. 226.953935: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14223 >w-13148 [120] d...111 287.414978: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14465 >w-16492 [089] d...111 347.128742: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14653 > (starts kbuild loop) > btrfs-transacti-1964 [016] d...1.. 411.101549: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 17011 ><...>-100268 [127] d...112 472.271769: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18153 >w-18864 [011] d...1.. 534.386443: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18543 ><...>-50390 [035] dN..2.. 597.794164: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18765 ><...>-80098 [127] d...111 659.912145: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18977 >checkproc-11123 [017] d...1.. 721.483463: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19247 > -0 [055] d..h5.. 782.685953: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19383 ><...>-93632 [055] d...111 835.527817: > add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19441 And now Thomas's patch. Spiffiness. Now to start ltp. berio:/sys/kernel/debug/tracing/:[0]# !while while sleep 60; do tail -1 trace; done <...>-12462 [105] d...1.. 211.489528: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 12847 btrfs-transacti-3136 [002] d...211 272.672777: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 12947 irq/155-eth2-Tx-4495 [096] dN..213 332.035236: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 13001 (starts kbuild loop) <...>-44245 [087] d...114 396.892748: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 13917 <...>-105411 [067] dN..211 457.708259: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14367 w-21800 [113] dN..2.. 519.231735: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14449 modpost-31558 [020] d11 576.065855: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14601 kworker/dying-11860 [133] d...112 637.170497: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14679 <...>-118853 [055] d11 703.884755: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14759 <...>-52143 [090] d...1.. 767.624735: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14813 <...>-71788 [126] d...1.. 829.160330: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14857 kworker/u289:5-2991 [002] d...1.. 892.402939: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14883 sh-15106 [008] d...211 953.172196: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14937 >
Re: [RT] lockdep munching nr_list_entries like popcorn
BTW, this ain't gone. I'll take a peek. It doesn't happen in my tree, seems likely to be because whether running sirqs fully threaded or not, I don't let one any thread handle what another exists to handle. [ 638.107293] NOHZ: local_softirq_pending 80 [ 939.729684] NOHZ: local_softirq_pending 80 [ 945.600869] NOHZ: local_softirq_pending 80 [ 1387.101178] NOHZ: local_softirq_pending 80 [ 1387.101343] NOHZ: local_softirq_pending 80 [ 1387.101549] NOHZ: local_softirq_pending 80 [ 1413.313212] NOHZ: local_softirq_pending 80 [ 1413.313305] NOHZ: local_softirq_pending 80 [ 1413.313347] NOHZ: local_softirq_pending 80 >
Re: [RT] lockdep munching nr_list_entries like popcorn
BTW, this ain't gone. I'll take a peek. It doesn't happen in my tree, seems likely to be because whether running sirqs fully threaded or not, I don't let one any thread handle what another exists to handle. [ 638.107293] NOHZ: local_softirq_pending 80 [ 939.729684] NOHZ: local_softirq_pending 80 [ 945.600869] NOHZ: local_softirq_pending 80 [ 1387.101178] NOHZ: local_softirq_pending 80 [ 1387.101343] NOHZ: local_softirq_pending 80 [ 1387.101549] NOHZ: local_softirq_pending 80 [ 1413.313212] NOHZ: local_softirq_pending 80 [ 1413.313305] NOHZ: local_softirq_pending 80 [ 1413.313347] NOHZ: local_softirq_pending 80 >
Re: [RT] lockdep munching nr_list_entries like popcorn
On Thu, 2017-02-16 at 15:53 +0100, Sebastian Andrzej Siewior wrote: > On 2017-02-16 15:42:59 [+0100], Mike Galbraith wrote: > > > > Weeell, I'm trying to cobble something kinda like that together using > > __RT_SPIN_INITIALIZER() instead, but seems mean ole Mr. Compiler NAKs > > the PER_CPU_DEP_MAP_INIT() thingy. > > > > CC mm/swap.o > > mm/swap.c:54:689: error: braced-group within expression allowed only > > inside a function > > so this is what I have now. I need to get the `static' symbol working > again and PER_CPU_DEP_MAP_INIT but aside from that it seems to do its > job. ... Yeah, works, I should be able to do an ltp run with stock lockdep settings without it taking it's toys and going home in a snit. berio:/sys/kernel/debug/tracing/:[0]# !while while sleep 60; do tail -1 trace; done <...>-10315 [064] d...1.. 226.953935: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14223 w-13148 [120] d...111 287.414978: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14465 w-16492 [089] d...111 347.128742: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14653 (starts kbuild loop) btrfs-transacti-1964 [016] d...1.. 411.101549: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 17011 <...>-100268 [127] d...112 472.271769: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18153 w-18864 [011] d...1.. 534.386443: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18543 <...>-50390 [035] dN..2.. 597.794164: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18765 <...>-80098 [127] d...111 659.912145: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18977 checkproc-11123 [017] d...1.. 721.483463: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19247 -0 [055] d..h5.. 782.685953: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19383 <...>-93632 [055] d...111 835.527817: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19441
Re: [RT] lockdep munching nr_list_entries like popcorn
On Thu, 2017-02-16 at 15:53 +0100, Sebastian Andrzej Siewior wrote: > On 2017-02-16 15:42:59 [+0100], Mike Galbraith wrote: > > > > Weeell, I'm trying to cobble something kinda like that together using > > __RT_SPIN_INITIALIZER() instead, but seems mean ole Mr. Compiler NAKs > > the PER_CPU_DEP_MAP_INIT() thingy. > > > > CC mm/swap.o > > mm/swap.c:54:689: error: braced-group within expression allowed only > > inside a function > > so this is what I have now. I need to get the `static' symbol working > again and PER_CPU_DEP_MAP_INIT but aside from that it seems to do its > job. ... Yeah, works, I should be able to do an ltp run with stock lockdep settings without it taking it's toys and going home in a snit. berio:/sys/kernel/debug/tracing/:[0]# !while while sleep 60; do tail -1 trace; done <...>-10315 [064] d...1.. 226.953935: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14223 w-13148 [120] d...111 287.414978: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14465 w-16492 [089] d...111 347.128742: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 14653 (starts kbuild loop) btrfs-transacti-1964 [016] d...1.. 411.101549: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 17011 <...>-100268 [127] d...112 472.271769: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18153 w-18864 [011] d...1.. 534.386443: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18543 <...>-50390 [035] dN..2.. 597.794164: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18765 <...>-80098 [127] d...111 659.912145: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 18977 checkproc-11123 [017] d...1.. 721.483463: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19247 -0 [055] d..h5.. 782.685953: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19383 <...>-93632 [055] d...111 835.527817: add_lock_to_list.isra.24.constprop.42: nr_list_entries: 19441
Re: [RT] lockdep munching nr_list_entries like popcorn
On Thu, 2017-02-16 at 12:06 +0100, Peter Zijlstra wrote: > On Thu, Feb 16, 2017 at 10:01:18AM +0100, Thomas Gleixner wrote: > > On Thu, 16 Feb 2017, Mike Galbraith wrote: > > > > > On Thu, 2017-02-16 at 09:37 +0100, Thomas Gleixner wrote: > > > > On Thu, 16 Feb 2017, Mike Galbraith wrote: > > > > > > > ... > > > > > swapvec_lock? Oodles of 'em? Nope. > > > > > > > > Well, it's a per cpu lock and the lru_cache_add() variants might be > > > > called > > > > from a gazillion of different call chains, but yes, it does not make a > > > > lot > > > > of sense. We'll have a look. > > > > > > Adding explicit local_irq_lock_init() makes things heaps better, so > > > presumably we need better lockdep-foo in DEFINE_LOCAL_IRQ_LOCK(). > > > > Bah. > > > #ifdef CONFIG_DEBUG_LOCK_ALLOC > #define PER_CPU_DEP_MAP_INIT(lockname) \ > .dep_map = {\ > .key = ({ static struct lock_class_key __key; &__key }), \ > .name = #lockname, \ > } > #else > #define PER_CPU_DEP_MAP_INIT(lockname) > #endif > > #define DEFINE_LOCAL_IRQ_LOCK(lvar) \ > DEFINE_PER_CPU(struct local_irq_lock, lvar) = { \ > .lock = { .rlock = {\ > .raw_lock = __ARCH_SPIN_LOCK_UNLOCKED, \ > SPIN_DEBUG_INIT(lvar) \ > PER_CPU_DEP_MAP_INIT(lvar) \ > } } \ > } > > That's fairly horrible for poking inside all the internals, but it might > just work ;- Weeell, I'm trying to cobble something kinda like that together using __RT_SPIN_INITIALIZER() instead, but seems mean ole Mr. Compiler NAKs the PER_CPU_DEP_MAP_INIT() thingy. CC mm/swap.o mm/swap.c:54:689: error: braced-group within expression allowed only inside a function -Mike
Re: [RT] lockdep munching nr_list_entries like popcorn
On Thu, 2017-02-16 at 12:06 +0100, Peter Zijlstra wrote: > On Thu, Feb 16, 2017 at 10:01:18AM +0100, Thomas Gleixner wrote: > > On Thu, 16 Feb 2017, Mike Galbraith wrote: > > > > > On Thu, 2017-02-16 at 09:37 +0100, Thomas Gleixner wrote: > > > > On Thu, 16 Feb 2017, Mike Galbraith wrote: > > > > > > > ... > > > > > swapvec_lock? Oodles of 'em? Nope. > > > > > > > > Well, it's a per cpu lock and the lru_cache_add() variants might be > > > > called > > > > from a gazillion of different call chains, but yes, it does not make a > > > > lot > > > > of sense. We'll have a look. > > > > > > Adding explicit local_irq_lock_init() makes things heaps better, so > > > presumably we need better lockdep-foo in DEFINE_LOCAL_IRQ_LOCK(). > > > > Bah. > > > #ifdef CONFIG_DEBUG_LOCK_ALLOC > #define PER_CPU_DEP_MAP_INIT(lockname) \ > .dep_map = {\ > .key = ({ static struct lock_class_key __key; &__key }), \ > .name = #lockname, \ > } > #else > #define PER_CPU_DEP_MAP_INIT(lockname) > #endif > > #define DEFINE_LOCAL_IRQ_LOCK(lvar) \ > DEFINE_PER_CPU(struct local_irq_lock, lvar) = { \ > .lock = { .rlock = {\ > .raw_lock = __ARCH_SPIN_LOCK_UNLOCKED, \ > SPIN_DEBUG_INIT(lvar) \ > PER_CPU_DEP_MAP_INIT(lvar) \ > } } \ > } > > That's fairly horrible for poking inside all the internals, but it might > just work ;- Weeell, I'm trying to cobble something kinda like that together using __RT_SPIN_INITIALIZER() instead, but seems mean ole Mr. Compiler NAKs the PER_CPU_DEP_MAP_INIT() thingy. CC mm/swap.o mm/swap.c:54:689: error: braced-group within expression allowed only inside a function -Mike
Re: [RT] lockdep munching nr_list_entries like popcorn
On Thu, 2017-02-16 at 10:01 +0100, Thomas Gleixner wrote: > On Thu, 16 Feb 2017, Mike Galbraith wrote: > > > On Thu, 2017-02-16 at 09:37 +0100, Thomas Gleixner wrote: > > > On Thu, 16 Feb 2017, Mike Galbraith wrote: > > > > > ... > > > > swapvec_lock? Oodles of 'em? Nope. > > > > > > Well, it's a per cpu lock and the lru_cache_add() variants might be called > > > from a gazillion of different call chains, but yes, it does not make a lot > > > of sense. We'll have a look. > > > > Adding explicit local_irq_lock_init() makes things heaps better, so > > presumably we need better lockdep-foo in DEFINE_LOCAL_IRQ_LOCK(). > > Bah. Hm, "bah" sounds kinda like it might be a synonym for -EDUMMY :) Fair enough, I know spit about about lockdep, so that's likely the case, but the below has me down to ~17k (and climbing, but not as fast). berio:/sys/kernel/debug/tracing/:[0]# grep -A 1 'stack trace' trace|grep '=>'|sort|uniq => ___slab_alloc+0x171/0x5c0 => __percpu_counter_add+0x56/0xd0 => __schedule+0xb0/0x7b0 => __slab_free+0xd8/0x200 => cgroup_idr_alloc.constprop.39+0x37/0x80 => hrtimer_start_range_ns+0xe6/0x400 => idr_preload+0x6c/0x300 => jbd2_journal_extend+0x4c/0x310 [jbd2] => lock_hrtimer_base.isra.28+0x29/0x50 => rcu_note_context_switch+0x2b8/0x5c0 => rcu_report_unblock_qs_rnp+0x6e/0xa0 => rt_mutex_slowunlock+0x25/0xc0 => rt_spin_lock_slowlock+0x52/0x330 => rt_spin_lock_slowlock+0x94/0x330 => rt_spin_lock_slowunlock+0x3c/0xc0 => swake_up+0x21/0x40 => task_blocks_on_rt_mutex+0x42/0x1e0 => try_to_wake_up+0x2d/0x920 berio:/sys/kernel/debug/tracing/:[0]# grep nr_list_entries: trace|tail -1 irq/66-eth2-TxR-3670 [115] d14 1542.321173: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 17839 Got rid of the really pesky growth anyway. --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -5522,6 +5522,7 @@ static int __init init_workqueues(void) pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC); + local_irq_lock_init(pendingb_lock); wq_numa_init(); /* initialize CPU pools */ --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -1677,5 +1677,6 @@ void __init radix_tree_init(void) SLAB_PANIC | SLAB_RECLAIM_ACCOUNT, radix_tree_node_ctor); radix_tree_init_maxnodes(); + local_irq_lock_init(radix_tree_preloads_lock); hotcpu_notifier(radix_tree_callback, 0); } --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5786,6 +5786,7 @@ static int __init mem_cgroup_init(void) int cpu, node; hotcpu_notifier(memcg_cpu_hotplug_callback, 0); + local_irq_lock_init(event_lock); for_each_possible_cpu(cpu) INIT_WORK(_cpu_ptr(_stock, cpu)->work, --- a/mm/swap.c +++ b/mm/swap.c @@ -681,6 +681,14 @@ static inline void remote_lru_add_drain( local_unlock_on(swapvec_lock, cpu); } +static int __init lru_init(void) +{ + local_irq_lock_init(swapvec_lock); + local_irq_lock_init(rotate_lock); + return 0; +} +early_initcall(lru_init); + #else /* --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -525,6 +525,7 @@ int __init netfilter_init(void) { int ret; + local_irq_lock_init(xt_write_lock); ret = register_pernet_subsys(_net_ops); if (ret < 0) goto err;
Re: [RT] lockdep munching nr_list_entries like popcorn
On Thu, 2017-02-16 at 10:01 +0100, Thomas Gleixner wrote: > On Thu, 16 Feb 2017, Mike Galbraith wrote: > > > On Thu, 2017-02-16 at 09:37 +0100, Thomas Gleixner wrote: > > > On Thu, 16 Feb 2017, Mike Galbraith wrote: > > > > > ... > > > > swapvec_lock? Oodles of 'em? Nope. > > > > > > Well, it's a per cpu lock and the lru_cache_add() variants might be called > > > from a gazillion of different call chains, but yes, it does not make a lot > > > of sense. We'll have a look. > > > > Adding explicit local_irq_lock_init() makes things heaps better, so > > presumably we need better lockdep-foo in DEFINE_LOCAL_IRQ_LOCK(). > > Bah. Hm, "bah" sounds kinda like it might be a synonym for -EDUMMY :) Fair enough, I know spit about about lockdep, so that's likely the case, but the below has me down to ~17k (and climbing, but not as fast). berio:/sys/kernel/debug/tracing/:[0]# grep -A 1 'stack trace' trace|grep '=>'|sort|uniq => ___slab_alloc+0x171/0x5c0 => __percpu_counter_add+0x56/0xd0 => __schedule+0xb0/0x7b0 => __slab_free+0xd8/0x200 => cgroup_idr_alloc.constprop.39+0x37/0x80 => hrtimer_start_range_ns+0xe6/0x400 => idr_preload+0x6c/0x300 => jbd2_journal_extend+0x4c/0x310 [jbd2] => lock_hrtimer_base.isra.28+0x29/0x50 => rcu_note_context_switch+0x2b8/0x5c0 => rcu_report_unblock_qs_rnp+0x6e/0xa0 => rt_mutex_slowunlock+0x25/0xc0 => rt_spin_lock_slowlock+0x52/0x330 => rt_spin_lock_slowlock+0x94/0x330 => rt_spin_lock_slowunlock+0x3c/0xc0 => swake_up+0x21/0x40 => task_blocks_on_rt_mutex+0x42/0x1e0 => try_to_wake_up+0x2d/0x920 berio:/sys/kernel/debug/tracing/:[0]# grep nr_list_entries: trace|tail -1 irq/66-eth2-TxR-3670 [115] d14 1542.321173: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 17839 Got rid of the really pesky growth anyway. --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -5522,6 +5522,7 @@ static int __init init_workqueues(void) pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC); + local_irq_lock_init(pendingb_lock); wq_numa_init(); /* initialize CPU pools */ --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -1677,5 +1677,6 @@ void __init radix_tree_init(void) SLAB_PANIC | SLAB_RECLAIM_ACCOUNT, radix_tree_node_ctor); radix_tree_init_maxnodes(); + local_irq_lock_init(radix_tree_preloads_lock); hotcpu_notifier(radix_tree_callback, 0); } --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5786,6 +5786,7 @@ static int __init mem_cgroup_init(void) int cpu, node; hotcpu_notifier(memcg_cpu_hotplug_callback, 0); + local_irq_lock_init(event_lock); for_each_possible_cpu(cpu) INIT_WORK(_cpu_ptr(_stock, cpu)->work, --- a/mm/swap.c +++ b/mm/swap.c @@ -681,6 +681,14 @@ static inline void remote_lru_add_drain( local_unlock_on(swapvec_lock, cpu); } +static int __init lru_init(void) +{ + local_irq_lock_init(swapvec_lock); + local_irq_lock_init(rotate_lock); + return 0; +} +early_initcall(lru_init); + #else /* --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -525,6 +525,7 @@ int __init netfilter_init(void) { int ret; + local_irq_lock_init(xt_write_lock); ret = register_pernet_subsys(_net_ops); if (ret < 0) goto err;
Re: [RT] lockdep munching nr_list_entries like popcorn
On Thu, 2017-02-16 at 09:37 +0100, Thomas Gleixner wrote: > On Thu, 16 Feb 2017, Mike Galbraith wrote: > ... > > swapvec_lock? Oodles of 'em? Nope. > > Well, it's a per cpu lock and the lru_cache_add() variants might be called > from a gazillion of different call chains, but yes, it does not make a lot > of sense. We'll have a look. Adding explicit local_irq_lock_init() makes things heaps better, so presumably we need better lockdep-foo in DEFINE_LOCAL_IRQ_LOCK(). -Mike
Re: [RT] lockdep munching nr_list_entries like popcorn
On Thu, 2017-02-16 at 09:37 +0100, Thomas Gleixner wrote: > On Thu, 16 Feb 2017, Mike Galbraith wrote: > ... > > swapvec_lock? Oodles of 'em? Nope. > > Well, it's a per cpu lock and the lru_cache_add() variants might be called > from a gazillion of different call chains, but yes, it does not make a lot > of sense. We'll have a look. Adding explicit local_irq_lock_init() makes things heaps better, so presumably we need better lockdep-foo in DEFINE_LOCAL_IRQ_LOCK(). -Mike
[RT] lockdep munching nr_list_entries like popcorn
4.9.10-rt6-virgin on 72 core +SMT box. Below is 1 line per minute, box idling along daintily nibbling, I fire up a parallel kbuild loop at 40465, and box gobbles greedily. I have entries bumped to 128k, and chain bits to 18 so box will get booted and run for a while before lockdep says "I quit". With stock settings, this box will barely get booted. Seems the bigger the box, the sooner you're gonna run out. A NOPREEMPT kernel seems to nibble entries too, but nowhere remotely near as greedily as RT. <...>-100309 [064] d13 2885.873312: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40129 <...>-104320 [116] dN..211 2959.633630: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40155 btrfs-transacti-1955 [043] d...111 3021.073949: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40183 <...>-118865 [120] d13 3086.146794: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40209 systemd-logind-4763 [068] d11 3146.953001: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40239 <...>-123725 [032] dN..2.. 3215.735774: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40285 <...>-33968 [031] d...1.. 3347.919001: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40409 <...>-130886 [143] d12 3412.586643: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40465 <...>-138291 [037] d11 3477.816405: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 42825 <...>-67678 [137] d...112 3551.648282: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 47899 ksoftirqd/45-421 [045] d13 3617.926394: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 48751 ihex2fw-24635 [035] d11 3686.899690: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 49345 <...>-76041 [047] d...111 3758.230009: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 49757 stty-10772 [118] d...1.. 3825.626815: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 50115 kworker/u289:4-13376 [075] d12 3896.432428: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 51189 <...>-92785 [047] d12 3905.137578: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 51287 With stacktrace on, buffer contains 1010 __lru_cache_add+0x4f... (gdb) list *__lru_cache_add+0x4f 0x811dca9f is in __lru_cache_add (./include/linux/locallock.h:59). 54 55 static inline void __local_lock(struct local_irq_lock *lv) 56 { 57 if (lv->owner != current) { 58 spin_lock_local(>lock); 59 LL_WARN(lv->owner); 60 LL_WARN(lv->nestcnt); 61 lv->owner = current; 62 } 63 lv->nestcnt++; ...which seems to be this. 0x811dca80 is in __lru_cache_add (mm/swap.c:397). 392 } 393 EXPORT_SYMBOL(mark_page_accessed); 394 395 static void __lru_cache_add(struct page *page) 396 { 397 struct pagevec *pvec = _locked_var(swapvec_lock, lru_add_pvec); 398 399 get_page(page); 400 if (!pagevec_add(pvec, page) || PageCompound(page)) 401 __pagevec_lru_add(pvec); swapvec_lock? Oodles of 'em? Nope. -Mike
[RT] lockdep munching nr_list_entries like popcorn
4.9.10-rt6-virgin on 72 core +SMT box. Below is 1 line per minute, box idling along daintily nibbling, I fire up a parallel kbuild loop at 40465, and box gobbles greedily. I have entries bumped to 128k, and chain bits to 18 so box will get booted and run for a while before lockdep says "I quit". With stock settings, this box will barely get booted. Seems the bigger the box, the sooner you're gonna run out. A NOPREEMPT kernel seems to nibble entries too, but nowhere remotely near as greedily as RT. <...>-100309 [064] d13 2885.873312: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40129 <...>-104320 [116] dN..211 2959.633630: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40155 btrfs-transacti-1955 [043] d...111 3021.073949: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40183 <...>-118865 [120] d13 3086.146794: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40209 systemd-logind-4763 [068] d11 3146.953001: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40239 <...>-123725 [032] dN..2.. 3215.735774: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40285 <...>-33968 [031] d...1.. 3347.919001: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40409 <...>-130886 [143] d12 3412.586643: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 40465 <...>-138291 [037] d11 3477.816405: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 42825 <...>-67678 [137] d...112 3551.648282: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 47899 ksoftirqd/45-421 [045] d13 3617.926394: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 48751 ihex2fw-24635 [035] d11 3686.899690: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 49345 <...>-76041 [047] d...111 3758.230009: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 49757 stty-10772 [118] d...1.. 3825.626815: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 50115 kworker/u289:4-13376 [075] d12 3896.432428: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 51189 <...>-92785 [047] d12 3905.137578: add_lock_to_list.isra.24.constprop.42+0x20/0x100: nr_list_entries: 51287 With stacktrace on, buffer contains 1010 __lru_cache_add+0x4f... (gdb) list *__lru_cache_add+0x4f 0x811dca9f is in __lru_cache_add (./include/linux/locallock.h:59). 54 55 static inline void __local_lock(struct local_irq_lock *lv) 56 { 57 if (lv->owner != current) { 58 spin_lock_local(>lock); 59 LL_WARN(lv->owner); 60 LL_WARN(lv->nestcnt); 61 lv->owner = current; 62 } 63 lv->nestcnt++; ...which seems to be this. 0x811dca80 is in __lru_cache_add (mm/swap.c:397). 392 } 393 EXPORT_SYMBOL(mark_page_accessed); 394 395 static void __lru_cache_add(struct page *page) 396 { 397 struct pagevec *pvec = _locked_var(swapvec_lock, lru_add_pvec); 398 399 get_page(page); 400 if (!pagevec_add(pvec, page) || PageCompound(page)) 401 __pagevec_lru_add(pvec); swapvec_lock? Oodles of 'em? Nope. -Mike
[tip:timers/urgent] tick/broadcast: Prevent deadlock on tick_broadcast_lock
Commit-ID: 202461e2f3c15dbfb05825d29ace0d20cdf55fa4 Gitweb: http://git.kernel.org/tip/202461e2f3c15dbfb05825d29ace0d20cdf55fa4 Author: Mike Galbraith <efa...@gmx.de> AuthorDate: Mon, 13 Feb 2017 03:31:55 +0100 Committer: Thomas Gleixner <t...@linutronix.de> CommitDate: Mon, 13 Feb 2017 09:49:31 +0100 tick/broadcast: Prevent deadlock on tick_broadcast_lock tick_broadcast_lock is taken from interrupt context, but the following call chain takes the lock without disabling interrupts: [ 12.703736] _raw_spin_lock+0x3b/0x50 [ 12.703738] tick_broadcast_control+0x5a/0x1a0 [ 12.703742] intel_idle_cpu_online+0x22/0x100 [ 12.703744] cpuhp_invoke_callback+0x245/0x9d0 [ 12.703752] cpuhp_thread_fun+0x52/0x110 [ 12.703754] smpboot_thread_fn+0x276/0x320 So the following deadlock can happen: lock(tick_broadcast_lock); lock(tick_broadcast_lock); intel_idle_cpu_online() is the only place which violates the calling convention of tick_broadcast_control(). This was caused by the removal of the smp function call in course of the cpu hotplug rework. Instead of slapping local_irq_disable/enable() at the call site, we can relax the calling convention and handle it in the core code, which makes the whole machinery more robust. Fixes: 29d7bbada98e ("intel_idle: Remove superfluous SMP fuction call") Reported-by: Gabriel C <nix.or@gmail.com> Signed-off-by: Mike Galbraith <efa...@gmx.de> Cc: Ruslan Ruslichenko <rrusl...@cisco.com> Cc: Jiri Slaby <jsl...@suse.cz> Cc: Greg KH <gre...@linuxfoundation.org> Cc: Borislav Petkov <b...@alien8.de> Cc: l...@lwn.net Cc: Andrew Morton <a...@linux-foundation.org> Cc: Linus Torvalds <torva...@linux-foundation.org> Cc: Anna-Maria Gleixner <anna-ma...@linutronix.de> Cc: Sebastian Siewior <bige...@linutronix.de> Cc: stable <sta...@vger.kernel.org> Link: http://lkml.kernel.org/r/1486953115.5912.4.ca...@gmx.de Signed-off-by: Thomas Gleixner <t...@linutronix.de> --- kernel/time/tick-broadcast.c | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 3109204..17ac99b 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -347,17 +347,16 @@ static void tick_handle_periodic_broadcast(struct clock_event_device *dev) * * Called when the system enters a state where affected tick devices * might stop. Note: TICK_BROADCAST_FORCE cannot be undone. - * - * Called with interrupts disabled, so clockevents_lock is not - * required here because the local clock event device cannot go away - * under us. */ void tick_broadcast_control(enum tick_broadcast_mode mode) { struct clock_event_device *bc, *dev; struct tick_device *td; int cpu, bc_stopped; + unsigned long flags; + /* Protects also the local clockevent device. */ + raw_spin_lock_irqsave(_broadcast_lock, flags); td = this_cpu_ptr(_cpu_device); dev = td->evtdev; @@ -365,12 +364,11 @@ void tick_broadcast_control(enum tick_broadcast_mode mode) * Is the device not affected by the powerstate ? */ if (!dev || !(dev->features & CLOCK_EVT_FEAT_C3STOP)) - return; + goto out; if (!tick_device_is_functional(dev)) - return; + goto out; - raw_spin_lock(_broadcast_lock); cpu = smp_processor_id(); bc = tick_broadcast_device.evtdev; bc_stopped = cpumask_empty(tick_broadcast_mask); @@ -420,7 +418,8 @@ void tick_broadcast_control(enum tick_broadcast_mode mode) tick_broadcast_setup_oneshot(bc); } } - raw_spin_unlock(_broadcast_lock); +out: + raw_spin_unlock_irqrestore(_broadcast_lock, flags); } EXPORT_SYMBOL_GPL(tick_broadcast_control);
[tip:timers/urgent] tick/broadcast: Prevent deadlock on tick_broadcast_lock
Commit-ID: 202461e2f3c15dbfb05825d29ace0d20cdf55fa4 Gitweb: http://git.kernel.org/tip/202461e2f3c15dbfb05825d29ace0d20cdf55fa4 Author: Mike Galbraith AuthorDate: Mon, 13 Feb 2017 03:31:55 +0100 Committer: Thomas Gleixner CommitDate: Mon, 13 Feb 2017 09:49:31 +0100 tick/broadcast: Prevent deadlock on tick_broadcast_lock tick_broadcast_lock is taken from interrupt context, but the following call chain takes the lock without disabling interrupts: [ 12.703736] _raw_spin_lock+0x3b/0x50 [ 12.703738] tick_broadcast_control+0x5a/0x1a0 [ 12.703742] intel_idle_cpu_online+0x22/0x100 [ 12.703744] cpuhp_invoke_callback+0x245/0x9d0 [ 12.703752] cpuhp_thread_fun+0x52/0x110 [ 12.703754] smpboot_thread_fn+0x276/0x320 So the following deadlock can happen: lock(tick_broadcast_lock); lock(tick_broadcast_lock); intel_idle_cpu_online() is the only place which violates the calling convention of tick_broadcast_control(). This was caused by the removal of the smp function call in course of the cpu hotplug rework. Instead of slapping local_irq_disable/enable() at the call site, we can relax the calling convention and handle it in the core code, which makes the whole machinery more robust. Fixes: 29d7bbada98e ("intel_idle: Remove superfluous SMP fuction call") Reported-by: Gabriel C Signed-off-by: Mike Galbraith Cc: Ruslan Ruslichenko Cc: Jiri Slaby Cc: Greg KH Cc: Borislav Petkov Cc: l...@lwn.net Cc: Andrew Morton Cc: Linus Torvalds Cc: Anna-Maria Gleixner Cc: Sebastian Siewior Cc: stable Link: http://lkml.kernel.org/r/1486953115.5912.4.ca...@gmx.de Signed-off-by: Thomas Gleixner --- kernel/time/tick-broadcast.c | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 3109204..17ac99b 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -347,17 +347,16 @@ static void tick_handle_periodic_broadcast(struct clock_event_device *dev) * * Called when the system enters a state where affected tick devices * might stop. Note: TICK_BROADCAST_FORCE cannot be undone. - * - * Called with interrupts disabled, so clockevents_lock is not - * required here because the local clock event device cannot go away - * under us. */ void tick_broadcast_control(enum tick_broadcast_mode mode) { struct clock_event_device *bc, *dev; struct tick_device *td; int cpu, bc_stopped; + unsigned long flags; + /* Protects also the local clockevent device. */ + raw_spin_lock_irqsave(_broadcast_lock, flags); td = this_cpu_ptr(_cpu_device); dev = td->evtdev; @@ -365,12 +364,11 @@ void tick_broadcast_control(enum tick_broadcast_mode mode) * Is the device not affected by the powerstate ? */ if (!dev || !(dev->features & CLOCK_EVT_FEAT_C3STOP)) - return; + goto out; if (!tick_device_is_functional(dev)) - return; + goto out; - raw_spin_lock(_broadcast_lock); cpu = smp_processor_id(); bc = tick_broadcast_device.evtdev; bc_stopped = cpumask_empty(tick_broadcast_mask); @@ -420,7 +418,8 @@ void tick_broadcast_control(enum tick_broadcast_mode mode) tick_broadcast_setup_oneshot(bc); } } - raw_spin_unlock(_broadcast_lock); +out: + raw_spin_unlock_irqrestore(_broadcast_lock, flags); } EXPORT_SYMBOL_GPL(tick_broadcast_control);
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, 2017-02-12 at 07:59 +0100, Mike Galbraith wrote: > On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote: > > > > I think cgroup tree depth is a more significant issue; because of > > > hierarchy we often do tree walks (uo-to-root or down-to-task). > > > > > > So creating elaborate trees is something I try not to do. > > > > So, as long as the depth stays reasonable (single digit or lower), > > what we try to do is keeping tree traversal operations aggregated or > > located on slow paths. There still are places that this overhead > > shows up (e.g. the block controllers aren't too optimized) but it > > isn't particularly difficult to make a handful of layers not matter at > > all. > > A handful of cpu bean counting layers stings considerably. BTW, that overhead is also why merging cpu/cpuacct is not really as wonderful as it may seem on paper. If you only want to account, you may not have anything to gain from group scheduling (in fact it may wreck performance), but you'll pay for it. > homer:/abuild # pipe-test 1 > 2.010057 usecs/loop -- avg 2.010057 995.0 KHz > 2.006630 usecs/loop -- avg 2.009714 995.2 KHz > 2.127118 usecs/loop -- avg 2.021455 989.4 KHz > 2.256244 usecs/loop -- avg 2.044934 978.0 KHz > 1.993693 usecs/loop -- avg 2.039810 980.5 KHz > ^C > homer:/abuild # cgexec -g cpu:hurt pipe-test 1 > 2.771641 usecs/loop -- avg 2.771641 721.6 KHz > 2.432333 usecs/loop -- avg 2.737710 730.5 KHz > 2.750493 usecs/loop -- avg 2.738988 730.2 KHz > 2.663203 usecs/loop -- avg 2.731410 732.2 KHz > 2.762564 usecs/loop -- avg 2.734525 731.4 KHz > ^C > homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1 > 2.967201 usecs/loop -- avg 2.967201 674.0 KHz > 3.049012 usecs/loop -- avg 2.975382 672.2 KHz > 3.031226 usecs/loop -- avg 2.980966 670.9 KHz > 2.954259 usecs/loop -- avg 2.978296 671.5 KHz > 2.933432 usecs/loop -- avg 2.973809 672.5 KHz > ^C > ... > homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1 > 4.417044 usecs/loop -- avg 4.417044 452.8 KHz > 4.494913 usecs/loop -- avg 4.424831 452.0 KHz > 4.253861 usecs/loop -- avg 4.407734 453.7 KHz > 4.378059 usecs/loop -- avg 4.404766 454.1 KHz > 4.179895 usecs/loop -- avg 4.382279 456.4 KHz
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, 2017-02-12 at 07:59 +0100, Mike Galbraith wrote: > On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote: > > > > I think cgroup tree depth is a more significant issue; because of > > > hierarchy we often do tree walks (uo-to-root or down-to-task). > > > > > > So creating elaborate trees is something I try not to do. > > > > So, as long as the depth stays reasonable (single digit or lower), > > what we try to do is keeping tree traversal operations aggregated or > > located on slow paths. There still are places that this overhead > > shows up (e.g. the block controllers aren't too optimized) but it > > isn't particularly difficult to make a handful of layers not matter at > > all. > > A handful of cpu bean counting layers stings considerably. BTW, that overhead is also why merging cpu/cpuacct is not really as wonderful as it may seem on paper. If you only want to account, you may not have anything to gain from group scheduling (in fact it may wreck performance), but you'll pay for it. > homer:/abuild # pipe-test 1 > 2.010057 usecs/loop -- avg 2.010057 995.0 KHz > 2.006630 usecs/loop -- avg 2.009714 995.2 KHz > 2.127118 usecs/loop -- avg 2.021455 989.4 KHz > 2.256244 usecs/loop -- avg 2.044934 978.0 KHz > 1.993693 usecs/loop -- avg 2.039810 980.5 KHz > ^C > homer:/abuild # cgexec -g cpu:hurt pipe-test 1 > 2.771641 usecs/loop -- avg 2.771641 721.6 KHz > 2.432333 usecs/loop -- avg 2.737710 730.5 KHz > 2.750493 usecs/loop -- avg 2.738988 730.2 KHz > 2.663203 usecs/loop -- avg 2.731410 732.2 KHz > 2.762564 usecs/loop -- avg 2.734525 731.4 KHz > ^C > homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1 > 2.967201 usecs/loop -- avg 2.967201 674.0 KHz > 3.049012 usecs/loop -- avg 2.975382 672.2 KHz > 3.031226 usecs/loop -- avg 2.980966 670.9 KHz > 2.954259 usecs/loop -- avg 2.978296 671.5 KHz > 2.933432 usecs/loop -- avg 2.973809 672.5 KHz > ^C > ... > homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1 > 4.417044 usecs/loop -- avg 4.417044 452.8 KHz > 4.494913 usecs/loop -- avg 4.424831 452.0 KHz > 4.253861 usecs/loop -- avg 4.407734 453.7 KHz > 4.378059 usecs/loop -- avg 4.404766 454.1 KHz > 4.179895 usecs/loop -- avg 4.382279 456.4 KHz
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, 2017-02-12 at 13:16 -0800, Paul Turner wrote: > > > On Thursday, February 9, 2017, Peter Zijlstrawrote: > > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > > > The only case that this does not support vs ".threads" would be some > > > hybrid where we co-mingle threads from different processes (with the > > > processes belonging to the same node in the hierarchy). I'm not aware > > > of any usage that looks like this. > > > > If I understand you right; this is a fairly common thing with RT where > > we would stuff all the !rt threads of the various processes in a 'misc' > > bucket. > > > > Similarly, it happens that we stuff the various rt threads of processes > > in a specific (shared) 'rt' bucket. > > > > So I would certainly not like to exclude that setup. > > > > Unless you're using rt groups I'm not sure this one really changes. > Whether the "misc" threads exist at the parent level or one below > should not matter. (with exclusive cpusets, a mask can exist at one and only one location)
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, 2017-02-12 at 13:16 -0800, Paul Turner wrote: > > > On Thursday, February 9, 2017, Peter Zijlstra wrote: > > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > > > The only case that this does not support vs ".threads" would be some > > > hybrid where we co-mingle threads from different processes (with the > > > processes belonging to the same node in the hierarchy). I'm not aware > > > of any usage that looks like this. > > > > If I understand you right; this is a fairly common thing with RT where > > we would stuff all the !rt threads of the various processes in a 'misc' > > bucket. > > > > Similarly, it happens that we stuff the various rt threads of processes > > in a specific (shared) 'rt' bucket. > > > > So I would certainly not like to exclude that setup. > > > > Unless you're using rt groups I'm not sure this one really changes. > Whether the "misc" threads exist at the parent level or one below > should not matter. (with exclusive cpusets, a mask can exist at one and only one location)
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, 2017-02-13 at 02:26 +0100, Gabriel C wrote: > [5.276704]CPU0 > [5.312400] > [5.347605] lock(tick_broadcast_lock); > [5.383163] > [5.418457] lock(tick_broadcast_lock); > [5.454015] > *** DEADLOCK *** > > [5.557982] no locks held by cpuhp/0/14. Oh, that looks familiar... tick/broadcast: Make tick_broadcast_control() use raw_spinlock_irqsave() Otherwise we end up with the lockdep splat below: [ 12.703619] = [ 12.703619] [ INFO: inconsistent lock state ] [ 12.703621] 4.10.0-rt1-rt #18 Not tainted [ 12.703622] - [ 12.703623] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ 12.703624] cpuhp/0/23 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 12.703625] (tick_broadcast_lock){?.}, at: [] tick_broadcast_control+0x5a/0x1a0 [ 12.703632] {IN-HARDIRQ-W} state was registered at: [ 12.703637] [] __lock_acquire+0xa21/0x1550 [ 12.703639] [] lock_acquire+0xbd/0x250 [ 12.703642] [] _raw_spin_lock_irqsave+0x53/0x70 [ 12.703644] [] tick_broadcast_switch_to_oneshot+0x16/0x50 [ 12.703646] [] tick_switch_to_oneshot+0x59/0xd0 [ 12.703647] [] tick_init_highres+0x15/0x20 [ 12.703652] [] hrtimer_run_queues+0x9f/0xe0 [ 12.703654] [] run_local_timers+0x25/0x60 [ 12.703656] [] update_process_times+0x2c/0x60 [ 12.703659] [] tick_periodic+0x2f/0x100 [ 12.703661] [] tick_handle_periodic+0x24/0x70 [ 12.703664] [] local_apic_timer_interrupt+0x33/0x60 [ 12.703669] [] smp_apic_timer_interrupt+0x38/0x50 [ 12.703671] [] apic_timer_interrupt+0x9d/0xb0 [ 12.703672] [] mwait_idle+0x94/0x290 [ 12.703676] [] arch_cpu_idle+0xf/0x20 [ 12.703677] [] default_idle_call+0x31/0x60 [ 12.703681] [] do_idle+0x175/0x290 [ 12.703683] [] cpu_startup_entry+0x48/0x50 [ 12.703687] [] start_secondary+0x133/0x160 [ 12.703689] [] verify_cpu+0x0/0xfc [ 12.703690] irq event stamp: 71 [ 12.703691] hardirqs last enabled at (71): [] _raw_spin_unlock_irq+0x2c/0x80 [ 12.703696] hardirqs last disabled at (70): [] __schedule+0x9c/0x7e0 [ 12.703699] softirqs last enabled at (0): [] copy_process.part.34+0x5f1/0x22d0 [ 12.703700] softirqs last disabled at (0): [< (null)>] (null) [ 12.703701] [ 12.703701] other info that might help us debug this: [ 12.703701] Possible unsafe locking scenario: [ 12.703701] [ 12.703701]CPU0 [ 12.703702] [ 12.703702] lock(tick_broadcast_lock); [ 12.703703] [ 12.703704] lock(tick_broadcast_lock); [ 12.703705] [ 12.703705] *** DEADLOCK *** [ 12.703705] [ 12.703705] no locks held by cpuhp/0/23. [ 12.703705] [ 12.703705] stack backtrace: [ 12.703707] CPU: 0 PID: 23 Comm: cpuhp/0 Not tainted 4.10.0-rt1-rt #18 [ 12.703708] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010 [ 12.703709] Call Trace: [ 12.703715] dump_stack+0x85/0xc8 [ 12.703717] print_usage_bug+0x1ea/0x1fb [ 12.703719] ? print_shortest_lock_dependencies+0x1c0/0x1c0 [ 12.703721] mark_lock+0x20d/0x290 [ 12.703723] __lock_acquire+0x8e6/0x1550 [ 12.703724] ? __lock_acquire+0x2ce/0x1550 [ 12.703726] ? load_balance+0x1b4/0xaf0 [ 12.703728] lock_acquire+0xbd/0x250 [ 12.703729] ? tick_broadcast_control+0x5a/0x1a0 [ 12.703735] ? efifb_probe+0x170/0x170 [ 12.703736] _raw_spin_lock+0x3b/0x50 [ 12.703737] ? tick_broadcast_control+0x5a/0x1a0 [ 12.703738] tick_broadcast_control+0x5a/0x1a0 [ 12.703740] ? efifb_probe+0x170/0x170 [ 12.703742] intel_idle_cpu_online+0x22/0x100 [ 12.703744] cpuhp_invoke_callback+0x245/0x9d0 [ 12.703747] ? finish_task_switch+0x78/0x290 [ 12.703750] ? check_preemption_disabled+0x9f/0x130 [ 12.703752] cpuhp_thread_fun+0x52/0x110 [ 12.703754] smpboot_thread_fn+0x276/0x320 [ 12.703757] kthread+0x10c/0x140 [ 12.703759] ? smpboot_update_cpumask_percpu_thread+0x130/0x130 [ 12.703760] ? kthread_park+0x90/0x90 [ 12.703762] ret_from_fork+0x2a/0x40 [ 12.709790] intel_idle: lapic_timer_reliable_states 0x2 Signed-off-by: Mike Galbraith <efa...@gmx.de> --- kernel/time/tick-broadcast.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -357,6 +357,7 @@ void tick_broadcast_control(enum tick_br struct clock_event_device *bc, *dev; struct tick_device *td; int cpu, bc_stopped; + unsigned long flags; td = this_cpu_ptr(_cpu_device); dev = td->evtdev; @@ -370,7 +371,7 @@ void tick_broadcast_control(enum tick_br if (!tick_device_is_functional(dev)) return; - raw_spin_lock(_broadcast_lock); + raw_spin_lock_irqsave(_broadcast_lock, flags); cpu = smp_processor_id(); bc = tick_broadcast_device.evtdev; bc_stopped = cpumask_empty(tick_broadcast_mask); @@ -420,7 +421,
Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )
On Mon, 2017-02-13 at 02:26 +0100, Gabriel C wrote: > [5.276704]CPU0 > [5.312400] > [5.347605] lock(tick_broadcast_lock); > [5.383163] > [5.418457] lock(tick_broadcast_lock); > [5.454015] > *** DEADLOCK *** > > [5.557982] no locks held by cpuhp/0/14. Oh, that looks familiar... tick/broadcast: Make tick_broadcast_control() use raw_spinlock_irqsave() Otherwise we end up with the lockdep splat below: [ 12.703619] = [ 12.703619] [ INFO: inconsistent lock state ] [ 12.703621] 4.10.0-rt1-rt #18 Not tainted [ 12.703622] - [ 12.703623] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ 12.703624] cpuhp/0/23 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 12.703625] (tick_broadcast_lock){?.}, at: [] tick_broadcast_control+0x5a/0x1a0 [ 12.703632] {IN-HARDIRQ-W} state was registered at: [ 12.703637] [] __lock_acquire+0xa21/0x1550 [ 12.703639] [] lock_acquire+0xbd/0x250 [ 12.703642] [] _raw_spin_lock_irqsave+0x53/0x70 [ 12.703644] [] tick_broadcast_switch_to_oneshot+0x16/0x50 [ 12.703646] [] tick_switch_to_oneshot+0x59/0xd0 [ 12.703647] [] tick_init_highres+0x15/0x20 [ 12.703652] [] hrtimer_run_queues+0x9f/0xe0 [ 12.703654] [] run_local_timers+0x25/0x60 [ 12.703656] [] update_process_times+0x2c/0x60 [ 12.703659] [] tick_periodic+0x2f/0x100 [ 12.703661] [] tick_handle_periodic+0x24/0x70 [ 12.703664] [] local_apic_timer_interrupt+0x33/0x60 [ 12.703669] [] smp_apic_timer_interrupt+0x38/0x50 [ 12.703671] [] apic_timer_interrupt+0x9d/0xb0 [ 12.703672] [] mwait_idle+0x94/0x290 [ 12.703676] [] arch_cpu_idle+0xf/0x20 [ 12.703677] [] default_idle_call+0x31/0x60 [ 12.703681] [] do_idle+0x175/0x290 [ 12.703683] [] cpu_startup_entry+0x48/0x50 [ 12.703687] [] start_secondary+0x133/0x160 [ 12.703689] [] verify_cpu+0x0/0xfc [ 12.703690] irq event stamp: 71 [ 12.703691] hardirqs last enabled at (71): [] _raw_spin_unlock_irq+0x2c/0x80 [ 12.703696] hardirqs last disabled at (70): [] __schedule+0x9c/0x7e0 [ 12.703699] softirqs last enabled at (0): [] copy_process.part.34+0x5f1/0x22d0 [ 12.703700] softirqs last disabled at (0): [< (null)>] (null) [ 12.703701] [ 12.703701] other info that might help us debug this: [ 12.703701] Possible unsafe locking scenario: [ 12.703701] [ 12.703701]CPU0 [ 12.703702] [ 12.703702] lock(tick_broadcast_lock); [ 12.703703] [ 12.703704] lock(tick_broadcast_lock); [ 12.703705] [ 12.703705] *** DEADLOCK *** [ 12.703705] [ 12.703705] no locks held by cpuhp/0/23. [ 12.703705] [ 12.703705] stack backtrace: [ 12.703707] CPU: 0 PID: 23 Comm: cpuhp/0 Not tainted 4.10.0-rt1-rt #18 [ 12.703708] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010 [ 12.703709] Call Trace: [ 12.703715] dump_stack+0x85/0xc8 [ 12.703717] print_usage_bug+0x1ea/0x1fb [ 12.703719] ? print_shortest_lock_dependencies+0x1c0/0x1c0 [ 12.703721] mark_lock+0x20d/0x290 [ 12.703723] __lock_acquire+0x8e6/0x1550 [ 12.703724] ? __lock_acquire+0x2ce/0x1550 [ 12.703726] ? load_balance+0x1b4/0xaf0 [ 12.703728] lock_acquire+0xbd/0x250 [ 12.703729] ? tick_broadcast_control+0x5a/0x1a0 [ 12.703735] ? efifb_probe+0x170/0x170 [ 12.703736] _raw_spin_lock+0x3b/0x50 [ 12.703737] ? tick_broadcast_control+0x5a/0x1a0 [ 12.703738] tick_broadcast_control+0x5a/0x1a0 [ 12.703740] ? efifb_probe+0x170/0x170 [ 12.703742] intel_idle_cpu_online+0x22/0x100 [ 12.703744] cpuhp_invoke_callback+0x245/0x9d0 [ 12.703747] ? finish_task_switch+0x78/0x290 [ 12.703750] ? check_preemption_disabled+0x9f/0x130 [ 12.703752] cpuhp_thread_fun+0x52/0x110 [ 12.703754] smpboot_thread_fn+0x276/0x320 [ 12.703757] kthread+0x10c/0x140 [ 12.703759] ? smpboot_update_cpumask_percpu_thread+0x130/0x130 [ 12.703760] ? kthread_park+0x90/0x90 [ 12.703762] ret_from_fork+0x2a/0x40 [ 12.709790] intel_idle: lapic_timer_reliable_states 0x2 Signed-off-by: Mike Galbraith --- kernel/time/tick-broadcast.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -357,6 +357,7 @@ void tick_broadcast_control(enum tick_br struct clock_event_device *bc, *dev; struct tick_device *td; int cpu, bc_stopped; + unsigned long flags; td = this_cpu_ptr(_cpu_device); dev = td->evtdev; @@ -370,7 +371,7 @@ void tick_broadcast_control(enum tick_br if (!tick_device_is_functional(dev)) return; - raw_spin_lock(_broadcast_lock); + raw_spin_lock_irqsave(_broadcast_lock, flags); cpu = smp_processor_id(); bc = tick_broadcast_device.evtdev; bc_stopped = cpumask_empty(tick_broadcast_mask); @@ -420,7 +421,7 @@ v
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote: > > I think cgroup tree depth is a more significant issue; because of > > hierarchy we often do tree walks (uo-to-root or down-to-task). > > > > So creating elaborate trees is something I try not to do. > > So, as long as the depth stays reasonable (single digit or lower), > what we try to do is keeping tree traversal operations aggregated or > located on slow paths. There still are places that this overhead > shows up (e.g. the block controllers aren't too optimized) but it > isn't particularly difficult to make a handful of layers not matter at > all. A handful of cpu bean counting layers stings considerably. homer:/abuild # pipe-test 1 2.010057 usecs/loop -- avg 2.010057 995.0 KHz 2.006630 usecs/loop -- avg 2.009714 995.2 KHz 2.127118 usecs/loop -- avg 2.021455 989.4 KHz 2.256244 usecs/loop -- avg 2.044934 978.0 KHz 1.993693 usecs/loop -- avg 2.039810 980.5 KHz ^C homer:/abuild # cgexec -g cpu:hurt pipe-test 1 2.771641 usecs/loop -- avg 2.771641 721.6 KHz 2.432333 usecs/loop -- avg 2.737710 730.5 KHz 2.750493 usecs/loop -- avg 2.738988 730.2 KHz 2.663203 usecs/loop -- avg 2.731410 732.2 KHz 2.762564 usecs/loop -- avg 2.734525 731.4 KHz ^C homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1 2.967201 usecs/loop -- avg 2.967201 674.0 KHz 3.049012 usecs/loop -- avg 2.975382 672.2 KHz 3.031226 usecs/loop -- avg 2.980966 670.9 KHz 2.954259 usecs/loop -- avg 2.978296 671.5 KHz 2.933432 usecs/loop -- avg 2.973809 672.5 KHz ^C ... homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1 4.417044 usecs/loop -- avg 4.417044 452.8 KHz 4.494913 usecs/loop -- avg 4.424831 452.0 KHz 4.253861 usecs/loop -- avg 4.407734 453.7 KHz 4.378059 usecs/loop -- avg 4.404766 454.1 KHz 4.179895 usecs/loop -- avg 4.382279 456.4 KHz
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote: > > I think cgroup tree depth is a more significant issue; because of > > hierarchy we often do tree walks (uo-to-root or down-to-task). > > > > So creating elaborate trees is something I try not to do. > > So, as long as the depth stays reasonable (single digit or lower), > what we try to do is keeping tree traversal operations aggregated or > located on slow paths. There still are places that this overhead > shows up (e.g. the block controllers aren't too optimized) but it > isn't particularly difficult to make a handful of layers not matter at > all. A handful of cpu bean counting layers stings considerably. homer:/abuild # pipe-test 1 2.010057 usecs/loop -- avg 2.010057 995.0 KHz 2.006630 usecs/loop -- avg 2.009714 995.2 KHz 2.127118 usecs/loop -- avg 2.021455 989.4 KHz 2.256244 usecs/loop -- avg 2.044934 978.0 KHz 1.993693 usecs/loop -- avg 2.039810 980.5 KHz ^C homer:/abuild # cgexec -g cpu:hurt pipe-test 1 2.771641 usecs/loop -- avg 2.771641 721.6 KHz 2.432333 usecs/loop -- avg 2.737710 730.5 KHz 2.750493 usecs/loop -- avg 2.738988 730.2 KHz 2.663203 usecs/loop -- avg 2.731410 732.2 KHz 2.762564 usecs/loop -- avg 2.734525 731.4 KHz ^C homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1 2.967201 usecs/loop -- avg 2.967201 674.0 KHz 3.049012 usecs/loop -- avg 2.975382 672.2 KHz 3.031226 usecs/loop -- avg 2.980966 670.9 KHz 2.954259 usecs/loop -- avg 2.978296 671.5 KHz 2.933432 usecs/loop -- avg 2.973809 672.5 KHz ^C ... homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1 4.417044 usecs/loop -- avg 4.417044 452.8 KHz 4.494913 usecs/loop -- avg 4.424831 452.0 KHz 4.253861 usecs/loop -- avg 4.407734 453.7 KHz 4.378059 usecs/loop -- avg 4.404766 454.1 KHz 4.179895 usecs/loop -- avg 4.382279 456.4 KHz
Re: [PATCH 2/2] sched/deadline: Throttle a constrained deadline task activated after the deadline
On Sat, 2017-02-11 at 08:15 +0100, luca abeni wrote: > Hi Daniel, > > On Fri, 10 Feb 2017 20:48:11 +0100 > Daniel Bristot de Oliveirawrote: > > > During the activation, CBS checks if it can reuse the current > > task's > > runtime and period. If the deadline of the task is in the past, CBS > > cannot use the runtime, and so it replenishes the task. This rule > > works fine for implicit deadline tasks (deadline == period), and > > the > > CBS was designed for implicit deadline tasks. However, a task with > > constrained deadline (deadine < period) might be awakened after the > > deadline, but before the next period. In this case, replenishing > > the > > task would allow it to run for runtime / deadline. As in this case > > deadline < period, CBS enables a task to run for more than the > > runtime/period. In a very load system, this can cause the domino > > effect, making other tasks to miss their deadlines. > > I think you are right: SCHED_DEADLINE implements the original CBS > algorithm here, but uses relative deadlines different from periods in > other places (while the original algorithm only considered relative > deadlines equal to periods). > An this mix is dangerous... I think your fix is correct, and cures a > real problem. Both of these should be tagged for stable as well, or? -Mike
Re: [PATCH 2/2] sched/deadline: Throttle a constrained deadline task activated after the deadline
On Sat, 2017-02-11 at 08:15 +0100, luca abeni wrote: > Hi Daniel, > > On Fri, 10 Feb 2017 20:48:11 +0100 > Daniel Bristot de Oliveira wrote: > > > During the activation, CBS checks if it can reuse the current > > task's > > runtime and period. If the deadline of the task is in the past, CBS > > cannot use the runtime, and so it replenishes the task. This rule > > works fine for implicit deadline tasks (deadline == period), and > > the > > CBS was designed for implicit deadline tasks. However, a task with > > constrained deadline (deadine < period) might be awakened after the > > deadline, but before the next period. In this case, replenishing > > the > > task would allow it to run for runtime / deadline. As in this case > > deadline < period, CBS enables a task to run for more than the > > runtime/period. In a very load system, this can cause the domino > > effect, making other tasks to miss their deadlines. > > I think you are right: SCHED_DEADLINE implements the original CBS > algorithm here, but uses relative deadlines different from periods in > other places (while the original algorithm only considered relative > deadlines equal to periods). > An this mix is dangerous... I think your fix is correct, and cures a > real problem. Both of these should be tagged for stable as well, or? -Mike
Re: [GIT pull] x86/timers for 4.10
On Thu, 2017-02-09 at 16:21 +0100, Thomas Gleixner wrote: > On Thu, 9 Feb 2017, Mike Galbraith wrote: > > > On Thu, 2017-02-09 at 16:07 +0100, Thomas Gleixner wrote: > > > On Wed, 8 Feb 2017, Mike Galbraith wrote: > > > > On Wed, 2017-02-08 at 12:44 +0100, Thomas Gleixner wrote: > > > > > On Mon, 6 Feb 2017, Olof Johansson wrote: > > > > > > [0.177102] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: > > > > > > -6495898515190607 CPU1: -6495898517158354 > > > > > > > > > > Yay, another "clever" BIOS > > > > > > > > Oh yeah, that reminds me... > > > > > > > > I met one such box, and the adjustment code did salvage it, but I had > > > > to cheat a little for it to do so reliably, as it would sometimes still > > > > see a delta of 1 or 2 whole cycles, and hand me a useless wreck instead > > > > quick like bunny big box. > > > > > > Can you share your cheatery ? > > > > I didn't keep it, it was just a bandaid for a fleeting use, dirt simple > > ignore microscopic delta. > > Can you send me the dmesg output of that box for a good and a bad case or > don't you have access to it anymore? I don't even remember which box it was, but I can try to find it again during idle moments. -Mike
Re: [GIT pull] x86/timers for 4.10
On Thu, 2017-02-09 at 16:21 +0100, Thomas Gleixner wrote: > On Thu, 9 Feb 2017, Mike Galbraith wrote: > > > On Thu, 2017-02-09 at 16:07 +0100, Thomas Gleixner wrote: > > > On Wed, 8 Feb 2017, Mike Galbraith wrote: > > > > On Wed, 2017-02-08 at 12:44 +0100, Thomas Gleixner wrote: > > > > > On Mon, 6 Feb 2017, Olof Johansson wrote: > > > > > > [0.177102] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: > > > > > > -6495898515190607 CPU1: -6495898517158354 > > > > > > > > > > Yay, another "clever" BIOS > > > > > > > > Oh yeah, that reminds me... > > > > > > > > I met one such box, and the adjustment code did salvage it, but I had > > > > to cheat a little for it to do so reliably, as it would sometimes still > > > > see a delta of 1 or 2 whole cycles, and hand me a useless wreck instead > > > > quick like bunny big box. > > > > > > Can you share your cheatery ? > > > > I didn't keep it, it was just a bandaid for a fleeting use, dirt simple > > ignore microscopic delta. > > Can you send me the dmesg output of that box for a good and a bad case or > don't you have access to it anymore? I don't even remember which box it was, but I can try to find it again during idle moments. -Mike
Re: [GIT pull] x86/timers for 4.10
On Thu, 2017-02-09 at 16:07 +0100, Thomas Gleixner wrote: > On Wed, 8 Feb 2017, Mike Galbraith wrote: > > On Wed, 2017-02-08 at 12:44 +0100, Thomas Gleixner wrote: > > > On Mon, 6 Feb 2017, Olof Johansson wrote: > > > > [0.177102] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: > > > > -6495898515190607 CPU1: -6495898517158354 > > > > > > Yay, another "clever" BIOS > > > > Oh yeah, that reminds me... > > > > I met one such box, and the adjustment code did salvage it, but I had > > to cheat a little for it to do so reliably, as it would sometimes still > > see a delta of 1 or 2 whole cycles, and hand me a useless wreck instead > > quick like bunny big box. > > Can you share your cheatery ? I didn't keep it, it was just a bandaid for a fleeting use, dirt simple ignore microscopic delta. -Mike
Re: [GIT pull] x86/timers for 4.10
On Thu, 2017-02-09 at 16:07 +0100, Thomas Gleixner wrote: > On Wed, 8 Feb 2017, Mike Galbraith wrote: > > On Wed, 2017-02-08 at 12:44 +0100, Thomas Gleixner wrote: > > > On Mon, 6 Feb 2017, Olof Johansson wrote: > > > > [0.177102] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: > > > > -6495898515190607 CPU1: -6495898517158354 > > > > > > Yay, another "clever" BIOS > > > > Oh yeah, that reminds me... > > > > I met one such box, and the adjustment code did salvage it, but I had > > to cheat a little for it to do so reliably, as it would sometimes still > > see a delta of 1 or 2 whole cycles, and hand me a useless wreck instead > > quick like bunny big box. > > Can you share your cheatery ? I didn't keep it, it was just a bandaid for a fleeting use, dirt simple ignore microscopic delta. -Mike
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, 2017-02-09 at 15:47 +0100, Peter Zijlstra wrote: > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > > The only case that this does not support vs ".threads" would be some > > hybrid where we co-mingle threads from different processes (with the > > processes belonging to the same node in the hierarchy). I'm not aware > > of any usage that looks like this. > > If I understand you right; this is a fairly common thing with RT where > we would stuff all the !rt threads of the various processes in a 'misc' > bucket. > > Similarly, it happens that we stuff the various rt threads of processes > in a specific (shared) 'rt' bucket. > > So I would certainly not like to exclude that setup. Absolutely, you just described my daily bread performance setup. -Mike
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, 2017-02-09 at 15:47 +0100, Peter Zijlstra wrote: > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > > The only case that this does not support vs ".threads" would be some > > hybrid where we co-mingle threads from different processes (with the > > processes belonging to the same node in the hierarchy). I'm not aware > > of any usage that looks like this. > > If I understand you right; this is a fairly common thing with RT where > we would stuff all the !rt threads of the various processes in a 'misc' > bucket. > > Similarly, it happens that we stuff the various rt threads of processes > in a specific (shared) 'rt' bucket. > > So I would certainly not like to exclude that setup. Absolutely, you just described my daily bread performance setup. -Mike
Re: [GIT pull] x86/timers for 4.10
On Wed, 2017-02-08 at 12:44 +0100, Thomas Gleixner wrote: > On Mon, 6 Feb 2017, Olof Johansson wrote: > > [0.177102] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: > > -6495898515190607 CPU1: -6495898517158354 > > Yay, another "clever" BIOS Oh yeah, that reminds me... I met one such box, and the adjustment code did salvage it, but I had to cheat a little for it to do so reliably, as it would sometimes still see a delta of 1 or 2 whole cycles, and hand me a useless wreck instead quick like bunny big box. -Mike
Re: [GIT pull] x86/timers for 4.10
On Wed, 2017-02-08 at 12:44 +0100, Thomas Gleixner wrote: > On Mon, 6 Feb 2017, Olof Johansson wrote: > > [0.177102] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: > > -6495898515190607 CPU1: -6495898517158354 > > Yay, another "clever" BIOS Oh yeah, that reminds me... I met one such box, and the adjustment code did salvage it, but I had to cheat a little for it to do so reliably, as it would sometimes still see a delta of 1 or 2 whole cycles, and hand me a useless wreck instead quick like bunny big box. -Mike
Re: [RFC,v2 3/3] sched: ignore task_h_load for CPU_NEWLY_IDLE
On Wed, 2017-02-08 at 09:43 +0100, Uladzislau Rezki wrote: > From: Uladzislau 2 Rezki> > A load balancer calculates imbalance factor for particular shed ^sched > domain and tries to steal up the prescribed amount of weighted load. > However, a small imbalance factor would sometimes prevent us from > stealing any tasks at all. When a CPU is newly idle, it should > steal first task which passes a migration criteria. s/passes a/meets the > > Signed-off-by: Uladzislau 2 Rezki > --- > kernel/sched/fair.c | 13 +++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 232ef3c..29e0d7f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > > > env->loop++; > @@ -6824,8 +6832,9 @@ static int detach_tasks(struct lb_env *env) > >> > if (sched_feat(LB_MIN) && load < 16 && > !env->sd->nr_balance_failed) > >> > > goto next; > > ->> > if ((load / 2) > env->imbalance) > ->> > > goto next; > +>> > if (env->idle != CPU_NEWLY_IDLE) > +>> > > if ((load / 2) > env->imbalance) > +>> > > > goto next; Those two ifs could be one ala if (foo && bar).
Re: [RFC,v2 3/3] sched: ignore task_h_load for CPU_NEWLY_IDLE
On Wed, 2017-02-08 at 09:43 +0100, Uladzislau Rezki wrote: > From: Uladzislau 2 Rezki > > A load balancer calculates imbalance factor for particular shed ^sched > domain and tries to steal up the prescribed amount of weighted load. > However, a small imbalance factor would sometimes prevent us from > stealing any tasks at all. When a CPU is newly idle, it should > steal first task which passes a migration criteria. s/passes a/meets the > > Signed-off-by: Uladzislau 2 Rezki > --- > kernel/sched/fair.c | 13 +++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 232ef3c..29e0d7f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > > > env->loop++; > @@ -6824,8 +6832,9 @@ static int detach_tasks(struct lb_env *env) > >> > if (sched_feat(LB_MIN) && load < 16 && > !env->sd->nr_balance_failed) > >> > > goto next; > > ->> > if ((load / 2) > env->imbalance) > ->> > > goto next; > +>> > if (env->idle != CPU_NEWLY_IDLE) > +>> > > if ((load / 2) > env->imbalance) > +>> > > > goto next; Those two ifs could be one ala if (foo && bar).
Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
On Tue, 2017-02-07 at 19:58 -0900, Kent Overstreet wrote: > On Tue, Feb 07, 2017 at 09:39:11PM +0100, Pavel Machek wrote: > > On Mon 2017-02-06 17:49:06, Kent Overstreet wrote: > > > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote: > > > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote: > > > > > Still there on v4.9, 36 threads on nokia n900 cellphone. > > > > > > > > > > So.. what needs to be done there? > > > > > > > But, I just got an idea for how to handle this that might be halfway > > > > sane, maybe > > > > I'll try and come up with a patch... > > > > > > Ok, here's such a patch, only lightly tested: > > > > I guess it would be nice for me to test it... but what it is against? > > I tried after v4.10-rc5 and linux-next, but got rejects in both cases. > > Sorry, I forgot I had a few other patches in my branch that touch > mempool/biosets code. > > Also, after thinking about it more and looking at the relevant code, I'm > pretty > sure we don't need rescuer threads for block devices that just split bios - > i.e. > most of them, so I changed my patch to do that. > > Tested it by ripping out the current->bio_list checks/workarounds from the > bcache code, appears to work: Patch killed every last one of them, but.. homer:/root # dmesg|grep WARNING [ 11.701447] WARNING: CPU: 4 PID: 801 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240 [ 11.711027] WARNING: CPU: 4 PID: 801 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0 [ 19.728989] WARNING: CPU: 0 PID: 717 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240 [ 19.737020] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0 [ 19.746173] WARNING: CPU: 0 PID: 717 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240 [ 19.755260] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0 [ 19.763837] WARNING: CPU: 0 PID: 717 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240 [ 19.772526] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0
Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
On Tue, 2017-02-07 at 19:58 -0900, Kent Overstreet wrote: > On Tue, Feb 07, 2017 at 09:39:11PM +0100, Pavel Machek wrote: > > On Mon 2017-02-06 17:49:06, Kent Overstreet wrote: > > > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote: > > > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote: > > > > > Still there on v4.9, 36 threads on nokia n900 cellphone. > > > > > > > > > > So.. what needs to be done there? > > > > > > > But, I just got an idea for how to handle this that might be halfway > > > > sane, maybe > > > > I'll try and come up with a patch... > > > > > > Ok, here's such a patch, only lightly tested: > > > > I guess it would be nice for me to test it... but what it is against? > > I tried after v4.10-rc5 and linux-next, but got rejects in both cases. > > Sorry, I forgot I had a few other patches in my branch that touch > mempool/biosets code. > > Also, after thinking about it more and looking at the relevant code, I'm > pretty > sure we don't need rescuer threads for block devices that just split bios - > i.e. > most of them, so I changed my patch to do that. > > Tested it by ripping out the current->bio_list checks/workarounds from the > bcache code, appears to work: Patch killed every last one of them, but.. homer:/root # dmesg|grep WARNING [ 11.701447] WARNING: CPU: 4 PID: 801 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240 [ 11.711027] WARNING: CPU: 4 PID: 801 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0 [ 19.728989] WARNING: CPU: 0 PID: 717 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240 [ 19.737020] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0 [ 19.746173] WARNING: CPU: 0 PID: 717 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240 [ 19.755260] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0 [ 19.763837] WARNING: CPU: 0 PID: 717 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240 [ 19.772526] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0
Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
On Tue, 2017-02-07 at 21:39 +0100, Pavel Machek wrote: > On Mon 2017-02-06 17:49:06, Kent Overstreet wrote: > > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote: > > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote: > > > > Still there on v4.9, 36 threads on nokia n900 cellphone. > > > > > > > > So.. what needs to be done there? > > > > > But, I just got an idea for how to handle this that might be halfway > > > sane, maybe > > > I'll try and come up with a patch... > > > > Ok, here's such a patch, only lightly tested: > > I guess it would be nice for me to test it... but what it is against? > I tried after v4.10-rc5 and linux-next, but got rejects in both cases. It wedged into master easily enough (box still seems to work.. but I'll be rebooting in a very few seconds just in case:), but threads on my desktop box only dropped from 73 to 71. Poo. -Mike
Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
On Tue, 2017-02-07 at 21:39 +0100, Pavel Machek wrote: > On Mon 2017-02-06 17:49:06, Kent Overstreet wrote: > > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote: > > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote: > > > > Still there on v4.9, 36 threads on nokia n900 cellphone. > > > > > > > > So.. what needs to be done there? > > > > > But, I just got an idea for how to handle this that might be halfway > > > sane, maybe > > > I'll try and come up with a patch... > > > > Ok, here's such a patch, only lightly tested: > > I guess it would be nice for me to test it... but what it is against? > I tried after v4.10-rc5 and linux-next, but got rejects in both cases. It wedged into master easily enough (box still seems to work.. but I'll be rebooting in a very few seconds just in case:), but threads on my desktop box only dropped from 73 to 71. Poo. -Mike
Re: tip: demise of tsk_cpus_allowed() and tsk_nr_cpus_allowed()
On Mon, 2017-02-06 at 13:29 +0100, Ingo Molnar wrote: > * Mike Galbraith <efa...@gmx.de> wrote: > > > On Mon, 2017-02-06 at 11:31 +0100, Ingo Molnar wrote: > > > * Mike Galbraith <efa...@gmx.de> wrote: > > > > > > > Hi Ingo, > > > > > > > > Doing my ~daily tip merge of -rt, I couldn't help noticing $subject, as > > > > they grow more functionality in -rt, which is allegedly slowly but > > > > surely headed toward merge. I don't suppose they could be left intact? > > > > I can easily restore them in my local tree, but it seems a bit of a > > > > shame to whack these integration friendly bits. > > > > > > Oh, I missed that. How is tsk_cpus_allowed() wrapped in -rt right now? > > > > RT extends them to reflect whether migration is disabled or not. > > > > +/* Future-safe accessor for struct task_struct's cpus_allowed. */ > > +static inline const struct cpumask *tsk_cpus_allowed(struct task_struct *p) > > +{ > > + if (__migrate_disabled(p)) > > + return cpumask_of(task_cpu(p)); > > + > > + return >cpus_allowed; > > +} > > + > > +static inline int tsk_nr_cpus_allowed(struct task_struct *p) > > +{ > > + if (__migrate_disabled(p)) > > + return 1; > > + return p->nr_cpus_allowed; > > +} > > So ... I think the cleaner approach in -rt would be to introduce > ->cpus_allowed_saved, and when disabling/enabling migration then saving the > current mask there and changing ->cpus_allowed - and then restoring it when > re-enabling migration. > > This means ->cpus_allowed could be used by the scheduler directly, no > wrappery > would be required, AFAICS. > > ( Some extra care would be required in places that change ->cpus_allowed > because > they'd now have to be aware of ->cpus_allowed_saved. ) > > Am I missing something? I suppose it's a matter of personal preference. I prefer the above, looks nice and clean to me. Hohum, I'll just put them back locally for the nonce. My trees are only place holders until official releases catch up anyway. -Mike
Re: tip: demise of tsk_cpus_allowed() and tsk_nr_cpus_allowed()
On Mon, 2017-02-06 at 13:29 +0100, Ingo Molnar wrote: > * Mike Galbraith wrote: > > > On Mon, 2017-02-06 at 11:31 +0100, Ingo Molnar wrote: > > > * Mike Galbraith wrote: > > > > > > > Hi Ingo, > > > > > > > > Doing my ~daily tip merge of -rt, I couldn't help noticing $subject, as > > > > they grow more functionality in -rt, which is allegedly slowly but > > > > surely headed toward merge. I don't suppose they could be left intact? > > > > I can easily restore them in my local tree, but it seems a bit of a > > > > shame to whack these integration friendly bits. > > > > > > Oh, I missed that. How is tsk_cpus_allowed() wrapped in -rt right now? > > > > RT extends them to reflect whether migration is disabled or not. > > > > +/* Future-safe accessor for struct task_struct's cpus_allowed. */ > > +static inline const struct cpumask *tsk_cpus_allowed(struct task_struct *p) > > +{ > > + if (__migrate_disabled(p)) > > + return cpumask_of(task_cpu(p)); > > + > > + return >cpus_allowed; > > +} > > + > > +static inline int tsk_nr_cpus_allowed(struct task_struct *p) > > +{ > > + if (__migrate_disabled(p)) > > + return 1; > > + return p->nr_cpus_allowed; > > +} > > So ... I think the cleaner approach in -rt would be to introduce > ->cpus_allowed_saved, and when disabling/enabling migration then saving the > current mask there and changing ->cpus_allowed - and then restoring it when > re-enabling migration. > > This means ->cpus_allowed could be used by the scheduler directly, no > wrappery > would be required, AFAICS. > > ( Some extra care would be required in places that change ->cpus_allowed > because > they'd now have to be aware of ->cpus_allowed_saved. ) > > Am I missing something? I suppose it's a matter of personal preference. I prefer the above, looks nice and clean to me. Hohum, I'll just put them back locally for the nonce. My trees are only place holders until official releases catch up anyway. -Mike
Re: tip: demise of tsk_cpus_allowed() and tsk_nr_cpus_allowed()
On Mon, 2017-02-06 at 11:31 +0100, Ingo Molnar wrote: > * Mike Galbraith <efa...@gmx.de> wrote: > > > Hi Ingo, > > > > Doing my ~daily tip merge of -rt, I couldn't help noticing $subject, as > > they grow more functionality in -rt, which is allegedly slowly but > > surely headed toward merge. I don't suppose they could be left intact? > > I can easily restore them in my local tree, but it seems a bit of a > > shame to whack these integration friendly bits. > > Oh, I missed that. How is tsk_cpus_allowed() wrapped in -rt right now? RT extends them to reflect whether migration is disabled or not. +/* Future-safe accessor for struct task_struct's cpus_allowed. */ +static inline const struct cpumask *tsk_cpus_allowed(struct task_struct *p) +{ + if (__migrate_disabled(p)) + return cpumask_of(task_cpu(p)); + + return >cpus_allowed; +} + +static inline int tsk_nr_cpus_allowed(struct task_struct *p) +{ + if (__migrate_disabled(p)) + return 1; + return p->nr_cpus_allowed; +}
Re: tip: demise of tsk_cpus_allowed() and tsk_nr_cpus_allowed()
On Mon, 2017-02-06 at 11:31 +0100, Ingo Molnar wrote: > * Mike Galbraith wrote: > > > Hi Ingo, > > > > Doing my ~daily tip merge of -rt, I couldn't help noticing $subject, as > > they grow more functionality in -rt, which is allegedly slowly but > > surely headed toward merge. I don't suppose they could be left intact? > > I can easily restore them in my local tree, but it seems a bit of a > > shame to whack these integration friendly bits. > > Oh, I missed that. How is tsk_cpus_allowed() wrapped in -rt right now? RT extends them to reflect whether migration is disabled or not. +/* Future-safe accessor for struct task_struct's cpus_allowed. */ +static inline const struct cpumask *tsk_cpus_allowed(struct task_struct *p) +{ + if (__migrate_disabled(p)) + return cpumask_of(task_cpu(p)); + + return >cpus_allowed; +} + +static inline int tsk_nr_cpus_allowed(struct task_struct *p) +{ + if (__migrate_disabled(p)) + return 1; + return p->nr_cpus_allowed; +}
tip: demise of tsk_cpus_allowed() and tsk_nr_cpus_allowed()
Hi Ingo, Doing my ~daily tip merge of -rt, I couldn't help noticing $subject, as they grow more functionality in -rt, which is allegedly slowly but surely headed toward merge. I don't suppose they could be left intact? I can easily restore them in my local tree, but it seems a bit of a shame to whack these integration friendly bits. -Mike
tip: demise of tsk_cpus_allowed() and tsk_nr_cpus_allowed()
Hi Ingo, Doing my ~daily tip merge of -rt, I couldn't help noticing $subject, as they grow more functionality in -rt, which is allegedly slowly but surely headed toward merge. I don't suppose they could be left intact? I can easily restore them in my local tree, but it seems a bit of a shame to whack these integration friendly bits. -Mike
Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls
On Fri, 2017-02-03 at 14:37 +0100, Peter Zijlstra wrote: > On Fri, Feb 03, 2017 at 01:59:34PM +0100, Mike Galbraith wrote: > > FWIW, I'm not seeing stalls/hangs while beating hotplug up in tip. (so > > next grew a wart?) > > I've seen it on tip. It looks like hot unplug goes really slow when > there's running tasks on the CPU being taken down. > > What I did was something like: > > taskset -p $((1<<1)) $$ > for ((i=0; i<20; i++)) do while :; do :; done & done > > taskset -p $((1<<0)) $$ > echo 0 > /sys/devices/system/cpu/cpu1/online > > And with those 20 tasks stuck sucking cycles on CPU1, the unplug goes > _really_ slow and the RCU stall triggers. What I suspect happens is that > hotplug stops participating in the RCU state machine early, but only > tells RCU about it really late, and in between it gets suspicious it > takes too long. Ah. I wasn't doing a really hard pounding, just running a couple instances of Steven's script. To beat hell out of it, I add futextest, stockfish and a small kbuild on a big box. -Mike
Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls
On Fri, 2017-02-03 at 14:37 +0100, Peter Zijlstra wrote: > On Fri, Feb 03, 2017 at 01:59:34PM +0100, Mike Galbraith wrote: > > FWIW, I'm not seeing stalls/hangs while beating hotplug up in tip. (so > > next grew a wart?) > > I've seen it on tip. It looks like hot unplug goes really slow when > there's running tasks on the CPU being taken down. > > What I did was something like: > > taskset -p $((1<<1)) $$ > for ((i=0; i<20; i++)) do while :; do :; done & done > > taskset -p $((1<<0)) $$ > echo 0 > /sys/devices/system/cpu/cpu1/online > > And with those 20 tasks stuck sucking cycles on CPU1, the unplug goes > _really_ slow and the RCU stall triggers. What I suspect happens is that > hotplug stops participating in the RCU state machine early, but only > tells RCU about it really late, and in between it gets suspicious it > takes too long. Ah. I wasn't doing a really hard pounding, just running a couple instances of Steven's script. To beat hell out of it, I add futextest, stockfish and a small kbuild on a big box. -Mike
x86-tip/master: Kernel panic - not syncing: snd_hda_codec_hdmi hdaudioC1D0: unrecoverable failure
On a lark, I tried a suspend/resume cycle after a bit of uneventful beating on cpu hotplug. Suspend worked fine, resume not so well. [ 1571.838698] Call Trace: [ 1571.838703] __schedule+0x32c/0xd10 [ 1571.838706] ? _raw_spin_unlock_irqrestore+0x36/0x60 [ 1571.838710] schedule+0x3d/0x90 [ 1571.838712] schedule_timeout+0x2d8/0x620 [ 1571.838718] ? snd_hdac_bus_send_cmd+0xab/0x110 [snd_hda_core] [ 1571.838722] ? lock_timer_base+0xa0/0xa0 [ 1571.838728] msleep+0x39/0x50 [ 1571.838734] azx_rirb_get_response+0x4a/0x270 [snd_hda_codec] [ 1571.838740] azx_get_response+0x33/0x40 [snd_hda_codec] [ 1571.838743] snd_hdac_bus_exec_verb_unlocked+0x169/0x2f0 [snd_hda_core] [ 1571.838748] codec_exec_verb+0x8c/0x120 [snd_hda_codec] [ 1571.838753] snd_hdac_exec_verb+0x17/0x40 [snd_hda_core] [ 1571.838756] snd_hdac_codec_read+0x34/0x50 [snd_hda_core] [ 1571.838759] ? snd_hdac_regmap_read_raw+0x10/0x20 [snd_hda_core] [ 1571.838763] read_pin_sense+0x35/0x80 [snd_hda_codec] [ 1571.838768] jack_detect_update+0x82/0xc0 [snd_hda_codec] [ 1571.838772] snd_hda_pin_sense+0x5e/0x70 [snd_hda_codec] [ 1571.838775] hdmi_present_sense+0x128/0x390 [snd_hda_codec_hdmi] [ 1571.838781] ? hda_call_codec_resume+0x120/0x120 [snd_hda_codec] [ 1571.838783] ? pm_runtime_force_suspend+0x90/0x90 [ 1571.838786] generic_hdmi_resume+0x4d/0x60 [snd_hda_codec_hdmi] [ 1571.838790] hda_call_codec_resume+0xd0/0x120 [snd_hda_codec] [ 1571.838794] hda_codec_runtime_resume+0x35/0x50 [snd_hda_codec] [ 1571.838795] pm_runtime_force_resume+0x93/0xe0 [ 1571.838798] dpm_run_callback+0xba/0x300 [ 1571.838801] device_resume+0x10e/0x240 [ 1571.838804] ? pm_dev_dbg+0x80/0x80 [ 1571.838809] async_resume+0x1d/0x50 [ 1571.838811] async_run_entry_fn+0x39/0x170 [ 1571.838814] process_one_work+0x1e1/0x670 [ 1571.838815] ? process_one_work+0x162/0x670 [ 1571.838819] worker_thread+0x137/0x4b0 [ 1571.838824] kthread+0x10c/0x140 [ 1571.838825] ? process_one_work+0x670/0x670 [ 1571.838827] ? kthread_park+0x90/0x90 [ 1571.838830] ret_from_fork+0x31/0x40 [ 1571.838837] Kernel panic - not syncing: snd_hda_codec_hdmi hdaudioC1D0: unrecoverable failure [ 1571.838839] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GE 4.10.0-tip-default_lockdep #27 [ 1571.838840] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 [ 1571.838840] Call Trace: [ 1571.838841] [ 1571.838843] dump_stack+0x85/0xc9 [ 1571.838846] panic+0xe0/0x233 [ 1571.838850] ? pm_dev_dbg+0x80/0x80 [ 1571.838852] dpm_watchdog_handler+0x4e/0x60 [ 1571.838853] call_timer_fn+0x95/0x340 [ 1571.838855] ? call_timer_fn+0x5/0x340 [ 1571.838857] ? pm_dev_dbg+0x80/0x80 [ 1571.838860] run_timer_softirq+0x230/0x620 [ 1571.838862] ? ktime_get+0xac/0x140 [ 1571.838867] __do_softirq+0xc0/0x48b [ 1571.838871] irq_exit+0xe5/0xf0 [ 1571.838873] smp_apic_timer_interrupt+0x3d/0x50 [ 1571.838875] apic_timer_interrupt+0x9d/0xb0 [ 1571.838877] RIP: 0010:cpuidle_enter_state+0xe9/0x320 [ 1571.838878] RSP: 0018:81c03dd0 EFLAGS: 0202 ORIG_RAX: ff10 [ 1571.838879] RAX: 81c16540 RBX: e8c0a200 RCX: [ 1571.838880] RDX: 81c16540 RSI: 0001 RDI: 81c16540 [ 1571.838881] RBP: 81c03e08 R08: R09: [ 1571.838882] R10: 0001 R11: 0014 R12: 0003 [ 1571.838882] R13: R14: 81d23360 R15: 016df8bcf98d [ 1571.838883] [ 1571.838892] cpuidle_enter+0x17/0x20 [ 1571.838894] call_cpuidle+0x23/0x40 [ 1571.838896] do_idle+0x172/0x200 [ 1571.838899] cpu_startup_entry+0x62/0x70 [ 1571.838902] rest_init+0x138/0x140 [ 1571.838903] ? rest_init+0x5/0x140 [ 1571.838907] start_kernel+0x4b3/0x4c0 [ 1571.838909] ? set_init_arg+0x55/0x55 [ 1571.838911] ? early_idt_handler_array+0x120/0x120 [ 1571.838913] x86_64_start_reservations+0x2a/0x2c [ 1571.838915] x86_64_start_kernel+0x13d/0x14c [ 1571.838919] start_cpu+0x14/0x14
x86-tip/master: Kernel panic - not syncing: snd_hda_codec_hdmi hdaudioC1D0: unrecoverable failure
On a lark, I tried a suspend/resume cycle after a bit of uneventful beating on cpu hotplug. Suspend worked fine, resume not so well. [ 1571.838698] Call Trace: [ 1571.838703] __schedule+0x32c/0xd10 [ 1571.838706] ? _raw_spin_unlock_irqrestore+0x36/0x60 [ 1571.838710] schedule+0x3d/0x90 [ 1571.838712] schedule_timeout+0x2d8/0x620 [ 1571.838718] ? snd_hdac_bus_send_cmd+0xab/0x110 [snd_hda_core] [ 1571.838722] ? lock_timer_base+0xa0/0xa0 [ 1571.838728] msleep+0x39/0x50 [ 1571.838734] azx_rirb_get_response+0x4a/0x270 [snd_hda_codec] [ 1571.838740] azx_get_response+0x33/0x40 [snd_hda_codec] [ 1571.838743] snd_hdac_bus_exec_verb_unlocked+0x169/0x2f0 [snd_hda_core] [ 1571.838748] codec_exec_verb+0x8c/0x120 [snd_hda_codec] [ 1571.838753] snd_hdac_exec_verb+0x17/0x40 [snd_hda_core] [ 1571.838756] snd_hdac_codec_read+0x34/0x50 [snd_hda_core] [ 1571.838759] ? snd_hdac_regmap_read_raw+0x10/0x20 [snd_hda_core] [ 1571.838763] read_pin_sense+0x35/0x80 [snd_hda_codec] [ 1571.838768] jack_detect_update+0x82/0xc0 [snd_hda_codec] [ 1571.838772] snd_hda_pin_sense+0x5e/0x70 [snd_hda_codec] [ 1571.838775] hdmi_present_sense+0x128/0x390 [snd_hda_codec_hdmi] [ 1571.838781] ? hda_call_codec_resume+0x120/0x120 [snd_hda_codec] [ 1571.838783] ? pm_runtime_force_suspend+0x90/0x90 [ 1571.838786] generic_hdmi_resume+0x4d/0x60 [snd_hda_codec_hdmi] [ 1571.838790] hda_call_codec_resume+0xd0/0x120 [snd_hda_codec] [ 1571.838794] hda_codec_runtime_resume+0x35/0x50 [snd_hda_codec] [ 1571.838795] pm_runtime_force_resume+0x93/0xe0 [ 1571.838798] dpm_run_callback+0xba/0x300 [ 1571.838801] device_resume+0x10e/0x240 [ 1571.838804] ? pm_dev_dbg+0x80/0x80 [ 1571.838809] async_resume+0x1d/0x50 [ 1571.838811] async_run_entry_fn+0x39/0x170 [ 1571.838814] process_one_work+0x1e1/0x670 [ 1571.838815] ? process_one_work+0x162/0x670 [ 1571.838819] worker_thread+0x137/0x4b0 [ 1571.838824] kthread+0x10c/0x140 [ 1571.838825] ? process_one_work+0x670/0x670 [ 1571.838827] ? kthread_park+0x90/0x90 [ 1571.838830] ret_from_fork+0x31/0x40 [ 1571.838837] Kernel panic - not syncing: snd_hda_codec_hdmi hdaudioC1D0: unrecoverable failure [ 1571.838839] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GE 4.10.0-tip-default_lockdep #27 [ 1571.838840] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 [ 1571.838840] Call Trace: [ 1571.838841] [ 1571.838843] dump_stack+0x85/0xc9 [ 1571.838846] panic+0xe0/0x233 [ 1571.838850] ? pm_dev_dbg+0x80/0x80 [ 1571.838852] dpm_watchdog_handler+0x4e/0x60 [ 1571.838853] call_timer_fn+0x95/0x340 [ 1571.838855] ? call_timer_fn+0x5/0x340 [ 1571.838857] ? pm_dev_dbg+0x80/0x80 [ 1571.838860] run_timer_softirq+0x230/0x620 [ 1571.838862] ? ktime_get+0xac/0x140 [ 1571.838867] __do_softirq+0xc0/0x48b [ 1571.838871] irq_exit+0xe5/0xf0 [ 1571.838873] smp_apic_timer_interrupt+0x3d/0x50 [ 1571.838875] apic_timer_interrupt+0x9d/0xb0 [ 1571.838877] RIP: 0010:cpuidle_enter_state+0xe9/0x320 [ 1571.838878] RSP: 0018:81c03dd0 EFLAGS: 0202 ORIG_RAX: ff10 [ 1571.838879] RAX: 81c16540 RBX: e8c0a200 RCX: [ 1571.838880] RDX: 81c16540 RSI: 0001 RDI: 81c16540 [ 1571.838881] RBP: 81c03e08 R08: R09: [ 1571.838882] R10: 0001 R11: 0014 R12: 0003 [ 1571.838882] R13: R14: 81d23360 R15: 016df8bcf98d [ 1571.838883] [ 1571.838892] cpuidle_enter+0x17/0x20 [ 1571.838894] call_cpuidle+0x23/0x40 [ 1571.838896] do_idle+0x172/0x200 [ 1571.838899] cpu_startup_entry+0x62/0x70 [ 1571.838902] rest_init+0x138/0x140 [ 1571.838903] ? rest_init+0x5/0x140 [ 1571.838907] start_kernel+0x4b3/0x4c0 [ 1571.838909] ? set_init_arg+0x55/0x55 [ 1571.838911] ? early_idt_handler_array+0x120/0x120 [ 1571.838913] x86_64_start_reservations+0x2a/0x2c [ 1571.838915] x86_64_start_kernel+0x13d/0x14c [ 1571.838919] start_cpu+0x14/0x14
Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls
On Fri, 2017-02-03 at 09:53 +0100, Peter Zijlstra wrote: > On Fri, Feb 03, 2017 at 10:03:14AM +0530, Sachin Sant wrote: > > I ran few cycles of cpu hot(un)plug tests. In most cases it works except one > > where I ran into rcu stall: > > > > [ 173.493453] INFO: rcu_sched detected stalls on CPUs/tasks: > > [ 173.493473] > > > > 8-...: (2 GPs behind) idle=006/140/0 > > softirq=0/0 fqs=2996 > > [ 173.493476] > > > > (detected by 0, t=6002 jiffies, g=885, c=884, > > q=6350) > > Right, I actually saw that too, but I don't think that would be related > to my patch. I'll see if I can dig into this though, ought to get fixed > regardless. FWIW, I'm not seeing stalls/hangs while beating hotplug up in tip. (so next grew a wart?) -Mike
Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls
On Fri, 2017-02-03 at 09:53 +0100, Peter Zijlstra wrote: > On Fri, Feb 03, 2017 at 10:03:14AM +0530, Sachin Sant wrote: > > I ran few cycles of cpu hot(un)plug tests. In most cases it works except one > > where I ran into rcu stall: > > > > [ 173.493453] INFO: rcu_sched detected stalls on CPUs/tasks: > > [ 173.493473] > > > > 8-...: (2 GPs behind) idle=006/140/0 > > softirq=0/0 fqs=2996 > > [ 173.493476] > > > > (detected by 0, t=6002 jiffies, g=885, c=884, > > q=6350) > > Right, I actually saw that too, but I don't think that would be related > to my patch. I'll see if I can dig into this though, ought to get fixed > regardless. FWIW, I'm not seeing stalls/hangs while beating hotplug up in tip. (so next grew a wart?) -Mike
[patch-tip] drivers/mtd: Apply sched include reorg to tests/mtd_test.h
signal_pending() moved to linux/sched/signal.h, go get it. Signed-off-by: Mike Galbraith <efa...@gmx.de> --- drivers/mtd/tests/mtd_test.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/mtd/tests/mtd_test.h +++ b/drivers/mtd/tests/mtd_test.h @@ -1,5 +1,5 @@ #include -#include +#include static inline int mtdtest_relax(void) {
[patch-tip] drivers/mtd: Apply sched include reorg to tests/mtd_test.h
signal_pending() moved to linux/sched/signal.h, go get it. Signed-off-by: Mike Galbraith --- drivers/mtd/tests/mtd_test.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/mtd/tests/mtd_test.h +++ b/drivers/mtd/tests/mtd_test.h @@ -1,5 +1,5 @@ #include -#include +#include static inline int mtdtest_relax(void) {
Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls
On Thu, 2017-02-02 at 16:55 +0100, Peter Zijlstra wrote: > On Tue, Jan 31, 2017 at 10:22:47AM -0700, Ross Zwisler wrote: > > On Tue, Jan 31, 2017 at 4:48 AM, Mike Galbraith <efa...@gmx.de> > > wrote: > > > On Tue, 2017-01-31 at 16:30 +0530, Sachin Sant wrote: > > > Could some of you test this? It seems to cure things in my (very) > limited testing. Hotplug stress gripe is gone here. -Mike
Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls
On Thu, 2017-02-02 at 16:55 +0100, Peter Zijlstra wrote: > On Tue, Jan 31, 2017 at 10:22:47AM -0700, Ross Zwisler wrote: > > On Tue, Jan 31, 2017 at 4:48 AM, Mike Galbraith > > wrote: > > > On Tue, 2017-01-31 at 16:30 +0530, Sachin Sant wrote: > > > Could some of you test this? It seems to cure things in my (very) > limited testing. Hotplug stress gripe is gone here. -Mike
Re: [PATCH] x86/microcode: Do not access the initrd after it has been freed
On Tue, 2017-01-31 at 18:49 +0100, Borislav Petkov wrote: > On Tue, Jan 31, 2017 at 01:31:00PM +0100, Borislav Petkov wrote: > > On Tue, Jan 31, 2017 at 12:31:17PM +0100, Mike Galbraith wrote: > > > (bisect fingered irqdomain: Avoid activating interrupts more than once) > > > > Yeah, that one is not kosher on x86. It broke IO-APIC timer on a box > > here. > > Mike, > > does the below hunk fix the issue for ya? (Ontop of tip/master, without > the revert). > > It does fix my APIC timer detection failure. Yup, need a new doorstop. -Mike
Re: [PATCH] x86/microcode: Do not access the initrd after it has been freed
On Tue, 2017-01-31 at 18:49 +0100, Borislav Petkov wrote: > On Tue, Jan 31, 2017 at 01:31:00PM +0100, Borislav Petkov wrote: > > On Tue, Jan 31, 2017 at 12:31:17PM +0100, Mike Galbraith wrote: > > > (bisect fingered irqdomain: Avoid activating interrupts more than once) > > > > Yeah, that one is not kosher on x86. It broke IO-APIC timer on a box > > here. > > Mike, > > does the below hunk fix the issue for ya? (Ontop of tip/master, without > the revert). > > It does fix my APIC timer detection failure. Yup, need a new doorstop. -Mike
Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls
On Tue, 2017-01-31 at 16:30 +0530, Sachin Sant wrote: > Trimming the cc list. > > > > I assume I should be worried? > > > > Thanks for the report. No need to worry, the bug has existed for a > > while, this patch just turns on the warning ;-) > > > > The following commit queued up in tip/sched/core should fix your > > issues (assuming you see the same callstack on all your powerpc > > machines): > > > > > > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=sched/core=1b1d62254df0fe42a711eb71948f915918987790 > > I still see this warning with today’s next running inside PowerVM LPAR > on a POWER8 box. The stack trace is different from what Michael had > reported. > > Easiest way to recreate this is to Online/offline cpu’s. (Ditto tip.today, x86_64 + hotplug stress) [ 94.804196] [ cut here ] [ 94.804201] WARNING: CPU: 3 PID: 27 at kernel/sched/sched.h:804 set_next_entity+0x81c/0x910 [ 94.804201] rq->clock_update_flags < RQCF_ACT_SKIP [ 94.804202] Modules linked in: ebtable_filter(E) ebtables(E) fuse(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) iptable_raw(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) ip6_tables(E) x_tables(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) nls_iso8859_1(E) crc32c_intel(E) nls_cp437(E) snd_hda_codec_realtek(E) snd_hda_codec_hdmi(E) snd_hda_codec_generic(E) nfsd(E) aesni_intel(E) snd_hda_intel(E) snd_hda_codec(E) snd_hwdep(E) aes_x86_64(E) snd_hda_core(E) crypto_simd(E) [ 94.804220] snd_pcm(E) auth_rpcgss(E) snd_timer(E) snd(E) iTCO_wdt(E) iTCO_vendor_support(E) joydev(E) nfs_acl(E) lpc_ich(E) cryptd(E) lockd(E) intel_smartconnect(E) mfd_core(E) i2c_i801(E) battery(E) glue_helper(E) mei_me(E) shpchp(E) mei(E) soundcore(E) grace(E) fan(E) thermal(E) tpm_infineon(E) pcspkr(E) sunrpc(E) efivarfs(E) sr_mod(E) cdrom(E) hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ahci(E) xhci_pci(E) ehci_pci(E) ttm(E) libahci(E) xhci_hcd(E) ehci_hcd(E) r8169(E) mii(E) libata(E) drm(E) usbcore(E) fjes(E) video(E) button(E) af_packet(E) sd_mod(E) vfat(E) fat(E) ext4(E) crc16(E) jbd2(E) mbcache(E) dm_mod(E) loop(E) sg(E) scsi_mod(E) autofs4(E) [ 94.804246] CPU: 3 PID: 27 Comm: migration/3 Tainted: GE 4.10.0-tip #15 [ 94.804247] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 [ 94.804247] Call Trace: [ 94.804251] ? dump_stack+0x5c/0x7c [ 94.804253] ? __warn+0xc4/0xe0 [ 94.804255] ? warn_slowpath_fmt+0x4f/0x60 [ 94.804256] ? set_next_entity+0x81c/0x910 [ 94.804258] ? pick_next_task_fair+0x20a/0xa20 [ 94.804259] ? sched_cpu_starting+0x50/0x50 [ 94.804260] ? sched_cpu_dying+0x237/0x280 [ 94.804261] ? sched_cpu_starting+0x50/0x50 [ 94.804262] ? cpuhp_invoke_callback+0x83/0x3e0 [ 94.804263] ? take_cpu_down+0x56/0x90 [ 94.804266] ? multi_cpu_stop+0xa9/0xd0 [ 94.804267] ? cpu_stop_queue_work+0xb0/0xb0 [ 94.804268] ? cpu_stopper_thread+0x81/0x110 [ 94.804270] ? smpboot_thread_fn+0xfe/0x150 [ 94.804272] ? kthread+0xf4/0x130 [ 94.804273] ? sort_range+0x20/0x20 [ 94.804274] ? kthread_park+0x80/0x80 [ 94.804276] ? ret_from_fork+0x26/0x40 [ 94.804277] ---[ end trace b0a9e4aa1fb229bb ]---
Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls
On Tue, 2017-01-31 at 16:30 +0530, Sachin Sant wrote: > Trimming the cc list. > > > > I assume I should be worried? > > > > Thanks for the report. No need to worry, the bug has existed for a > > while, this patch just turns on the warning ;-) > > > > The following commit queued up in tip/sched/core should fix your > > issues (assuming you see the same callstack on all your powerpc > > machines): > > > > > > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=sched/core=1b1d62254df0fe42a711eb71948f915918987790 > > I still see this warning with today’s next running inside PowerVM LPAR > on a POWER8 box. The stack trace is different from what Michael had > reported. > > Easiest way to recreate this is to Online/offline cpu’s. (Ditto tip.today, x86_64 + hotplug stress) [ 94.804196] [ cut here ] [ 94.804201] WARNING: CPU: 3 PID: 27 at kernel/sched/sched.h:804 set_next_entity+0x81c/0x910 [ 94.804201] rq->clock_update_flags < RQCF_ACT_SKIP [ 94.804202] Modules linked in: ebtable_filter(E) ebtables(E) fuse(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) iptable_raw(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) ip6_tables(E) x_tables(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) nls_iso8859_1(E) crc32c_intel(E) nls_cp437(E) snd_hda_codec_realtek(E) snd_hda_codec_hdmi(E) snd_hda_codec_generic(E) nfsd(E) aesni_intel(E) snd_hda_intel(E) snd_hda_codec(E) snd_hwdep(E) aes_x86_64(E) snd_hda_core(E) crypto_simd(E) [ 94.804220] snd_pcm(E) auth_rpcgss(E) snd_timer(E) snd(E) iTCO_wdt(E) iTCO_vendor_support(E) joydev(E) nfs_acl(E) lpc_ich(E) cryptd(E) lockd(E) intel_smartconnect(E) mfd_core(E) i2c_i801(E) battery(E) glue_helper(E) mei_me(E) shpchp(E) mei(E) soundcore(E) grace(E) fan(E) thermal(E) tpm_infineon(E) pcspkr(E) sunrpc(E) efivarfs(E) sr_mod(E) cdrom(E) hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ahci(E) xhci_pci(E) ehci_pci(E) ttm(E) libahci(E) xhci_hcd(E) ehci_hcd(E) r8169(E) mii(E) libata(E) drm(E) usbcore(E) fjes(E) video(E) button(E) af_packet(E) sd_mod(E) vfat(E) fat(E) ext4(E) crc16(E) jbd2(E) mbcache(E) dm_mod(E) loop(E) sg(E) scsi_mod(E) autofs4(E) [ 94.804246] CPU: 3 PID: 27 Comm: migration/3 Tainted: GE 4.10.0-tip #15 [ 94.804247] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 [ 94.804247] Call Trace: [ 94.804251] ? dump_stack+0x5c/0x7c [ 94.804253] ? __warn+0xc4/0xe0 [ 94.804255] ? warn_slowpath_fmt+0x4f/0x60 [ 94.804256] ? set_next_entity+0x81c/0x910 [ 94.804258] ? pick_next_task_fair+0x20a/0xa20 [ 94.804259] ? sched_cpu_starting+0x50/0x50 [ 94.804260] ? sched_cpu_dying+0x237/0x280 [ 94.804261] ? sched_cpu_starting+0x50/0x50 [ 94.804262] ? cpuhp_invoke_callback+0x83/0x3e0 [ 94.804263] ? take_cpu_down+0x56/0x90 [ 94.804266] ? multi_cpu_stop+0xa9/0xd0 [ 94.804267] ? cpu_stop_queue_work+0xb0/0xb0 [ 94.804268] ? cpu_stopper_thread+0x81/0x110 [ 94.804270] ? smpboot_thread_fn+0xfe/0x150 [ 94.804272] ? kthread+0xf4/0x130 [ 94.804273] ? sort_range+0x20/0x20 [ 94.804274] ? kthread_park+0x80/0x80 [ 94.804276] ? ret_from_fork+0x26/0x40 [ 94.804277] ---[ end trace b0a9e4aa1fb229bb ]---
Re: [PATCH] x86/microcode: Do not access the initrd after it has been freed
On Tue, 2017-01-31 at 11:01 +0100, Borislav Petkov wrote: > On Tue, Jan 31, 2017 at 08:43:55AM +0100, Ingo Molnar wrote: > > (Cc:-ed Mike as this could explain his early boot crash/hang? > > Mike: please try -tip f18a8a0143b1 that I just pushed out. ) > > One other thing to try, Mike, is boot with "dis_ucode_ldr". See whether > that makes it go away. (bisect fingered irqdomain: Avoid activating interrupts more than once)
Re: [PATCH] x86/microcode: Do not access the initrd after it has been freed
On Tue, 2017-01-31 at 11:01 +0100, Borislav Petkov wrote: > On Tue, Jan 31, 2017 at 08:43:55AM +0100, Ingo Molnar wrote: > > (Cc:-ed Mike as this could explain his early boot crash/hang? > > Mike: please try -tip f18a8a0143b1 that I just pushed out. ) > > One other thing to try, Mike, is boot with "dis_ucode_ldr". See whether > that makes it go away. (bisect fingered irqdomain: Avoid activating interrupts more than once)
Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27
On Tue, 2017-01-31 at 09:54 +0100, Ingo Molnar wrote: > > Fast ain't gonna happen, 5bf728f02218 bricked. > > :-/ > > Next point would be f9a42e0d58cf I suspect, to establish that Linus's latest > kernel is fine. That means it's in one of the ~200 -tip commits - should be > bisectable in 8-10 steps from that point on. It bisected cleanly to the below, confirmed via quilt push/pop revert. According to the symptoms my box exhibits, patchlet needs to be twiddled to ensure that interrupts are enabled at _least_ once ;-) 08d85f3ea99f1eeafc4e8507936190e86a16ee8c is the first bad commit commit 08d85f3ea99f1eeafc4e8507936190e86a16ee8c Author: Marc ZyngierDate: Tue Jan 17 16:00:48 2017 + irqdomain: Avoid activating interrupts more than once Since commit f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early"), we can end-up activating a PCI/MSI twice (once at allocation time, and once at startup time). This is normally of no consequences, except that there is some HW out there that may misbehave if activate is used more than once (the GICv3 ITS, for example, uses the activate callback to issue the MAPVI command, and the architecture spec says that "If there is an existing mapping for the EventID-DeviceID combination, behavior is UNPREDICTABLE"). While this could be worked around in each individual driver, it may make more sense to tackle the issue at the core level. In order to avoid getting in that situation, let's have a per-interrupt flag to remember if we have already activated that interrupt or not. Fixes: f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early") Reported-and-tested-by: Andre Przywara Signed-off-by: Marc Zyngier Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/1484668848-24361-1-git-send-email-marc.zyng...@arm.com Signed-off-by: Thomas Gleixner :04 04 eed859b1f22b822f4400e7c050929d8b4c4a146d 39097c0315a12c0a3809bb82687fa56b1c9e5633 M include :04 04 7dfe2ca8e1de55e890d0e6a761bab9c07c6f5f8a e28a3a54a68866273b474e2053b16155987e06f2 M kernel
Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27
On Tue, 2017-01-31 at 09:54 +0100, Ingo Molnar wrote: > > Fast ain't gonna happen, 5bf728f02218 bricked. > > :-/ > > Next point would be f9a42e0d58cf I suspect, to establish that Linus's latest > kernel is fine. That means it's in one of the ~200 -tip commits - should be > bisectable in 8-10 steps from that point on. It bisected cleanly to the below, confirmed via quilt push/pop revert. According to the symptoms my box exhibits, patchlet needs to be twiddled to ensure that interrupts are enabled at _least_ once ;-) 08d85f3ea99f1eeafc4e8507936190e86a16ee8c is the first bad commit commit 08d85f3ea99f1eeafc4e8507936190e86a16ee8c Author: Marc Zyngier Date: Tue Jan 17 16:00:48 2017 + irqdomain: Avoid activating interrupts more than once Since commit f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early"), we can end-up activating a PCI/MSI twice (once at allocation time, and once at startup time). This is normally of no consequences, except that there is some HW out there that may misbehave if activate is used more than once (the GICv3 ITS, for example, uses the activate callback to issue the MAPVI command, and the architecture spec says that "If there is an existing mapping for the EventID-DeviceID combination, behavior is UNPREDICTABLE"). While this could be worked around in each individual driver, it may make more sense to tackle the issue at the core level. In order to avoid getting in that situation, let's have a per-interrupt flag to remember if we have already activated that interrupt or not. Fixes: f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early") Reported-and-tested-by: Andre Przywara Signed-off-by: Marc Zyngier Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/1484668848-24361-1-git-send-email-marc.zyng...@arm.com Signed-off-by: Thomas Gleixner :04 04 eed859b1f22b822f4400e7c050929d8b4c4a146d 39097c0315a12c0a3809bb82687fa56b1c9e5633 M include :04 04 7dfe2ca8e1de55e890d0e6a761bab9c07c6f5f8a e28a3a54a68866273b474e2053b16155987e06f2 M kernel
Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27
On Tue, 2017-01-31 at 08:28 +0100, Ingo Molnar wrote: > * Mike Galbraith <efa...@gmx.de> wrote: > > > On Mon, 2017-01-30 at 11:59 +, Matt Fleming wrote: > > > On Sat, 28 Jan, at 08:21:05AM, Mike Galbraith wrote: > > > > Running Steven's hotplug stress script in tip.today. Config is > > > > NOPREEMPT, tune for maximum build time (enterprise default-ish). > > > > > > > > [ 75.268049] x86: Booting SMP configuration: > > > > [ 75.268052] smpboot: Booting Node 0 Processor 1 APIC 0x2 > > > > [ 75.279994] smpboot: Booting Node 0 Processor 2 APIC 0x4 > > > > [ 75.294617] smpboot: Booting Node 0 Processor 4 APIC 0x1 > > > > [ 75.310698] smpboot: Booting Node 0 Processor 5 APIC 0x3 > > > > [ 75.359056] smpboot: CPU 3 is now offline > > > > [ 75.415505] smpboot: CPU 4 is now offline > > > > [ 75.479985] smpboot: CPU 5 is now offline > > > > [ 75.550674] [ cut here ] > > > > [ 75.550678] WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 > > > > assert_clock_updated.isra.62.part.63+0x25/0x27 > > > > [ 75.550679] rq->clock_update_flags < RQCF_ACT_SKIP > > > > > > The following patch queued in tip/sched/core should fix this issue: > > > > Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew > > an early boot brick problem. > > That's bad - could you perhaps try to bisect it? All recently queued up > patches > that could cause such problems should be readily bisectable. > > The bisection might be faster if you first checked whether 5bf728f02218 works > - if > it does then the bug is in the patches in WIP.x86/boot or WIP.x86/fpu. Fast ain't gonna happen, 5bf728f02218 bricked. -Mike
Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27
On Tue, 2017-01-31 at 08:28 +0100, Ingo Molnar wrote: > * Mike Galbraith wrote: > > > On Mon, 2017-01-30 at 11:59 +, Matt Fleming wrote: > > > On Sat, 28 Jan, at 08:21:05AM, Mike Galbraith wrote: > > > > Running Steven's hotplug stress script in tip.today. Config is > > > > NOPREEMPT, tune for maximum build time (enterprise default-ish). > > > > > > > > [ 75.268049] x86: Booting SMP configuration: > > > > [ 75.268052] smpboot: Booting Node 0 Processor 1 APIC 0x2 > > > > [ 75.279994] smpboot: Booting Node 0 Processor 2 APIC 0x4 > > > > [ 75.294617] smpboot: Booting Node 0 Processor 4 APIC 0x1 > > > > [ 75.310698] smpboot: Booting Node 0 Processor 5 APIC 0x3 > > > > [ 75.359056] smpboot: CPU 3 is now offline > > > > [ 75.415505] smpboot: CPU 4 is now offline > > > > [ 75.479985] smpboot: CPU 5 is now offline > > > > [ 75.550674] [ cut here ] > > > > [ 75.550678] WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 > > > > assert_clock_updated.isra.62.part.63+0x25/0x27 > > > > [ 75.550679] rq->clock_update_flags < RQCF_ACT_SKIP > > > > > > The following patch queued in tip/sched/core should fix this issue: > > > > Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew > > an early boot brick problem. > > That's bad - could you perhaps try to bisect it? All recently queued up > patches > that could cause such problems should be readily bisectable. > > The bisection might be faster if you first checked whether 5bf728f02218 works > - if > it does then the bug is in the patches in WIP.x86/boot or WIP.x86/fpu. Fast ain't gonna happen, 5bf728f02218 bricked. -Mike
Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27
On Tue, 2017-01-31 at 08:45 +0100, Ingo Molnar wrote: > * Mike Galbraith <efa...@gmx.de> wrote: > > > On Tue, 2017-01-31 at 08:28 +0100, Ingo Molnar wrote: > > > * Mike Galbraith <efa...@gmx.de> wrote: > > > > > > Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew > > > > an early boot brick problem. > > > > > > That's bad - could you perhaps try to bisect it? All recently queued up > > > patches > > > that could cause such problems should be readily bisectable. > > > > Yeah, I'll give it a go as soon as I get some other stuff done. > > Please double check whether -tip f18a8a0143b1 works for you (latestest -tip > freshly pushed out), it might be that my bogus conflict resolution of a > x86/microcode conflict is what caused your boot problems? Oh darn, it's a nogo. Back to plan A. -Mike
Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27
On Tue, 2017-01-31 at 08:45 +0100, Ingo Molnar wrote: > * Mike Galbraith wrote: > > > On Tue, 2017-01-31 at 08:28 +0100, Ingo Molnar wrote: > > > * Mike Galbraith wrote: > > > > > > Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew > > > > an early boot brick problem. > > > > > > That's bad - could you perhaps try to bisect it? All recently queued up > > > patches > > > that could cause such problems should be readily bisectable. > > > > Yeah, I'll give it a go as soon as I get some other stuff done. > > Please double check whether -tip f18a8a0143b1 works for you (latestest -tip > freshly pushed out), it might be that my bogus conflict resolution of a > x86/microcode conflict is what caused your boot problems? Oh darn, it's a nogo. Back to plan A. -Mike
Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27
On Tue, 2017-01-31 at 08:28 +0100, Ingo Molnar wrote: > * Mike Galbraith <efa...@gmx.de> wrote: > > Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew > > an early boot brick problem. > > That's bad - could you perhaps try to bisect it? All recently queued up > patches > that could cause such problems should be readily bisectable. Yeah, I'll give it a go as soon as I get some other stuff done. -Mike
Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27
On Tue, 2017-01-31 at 08:28 +0100, Ingo Molnar wrote: > * Mike Galbraith wrote: > > Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew > > an early boot brick problem. > > That's bad - could you perhaps try to bisect it? All recently queued up > patches > that could cause such problems should be readily bisectable. Yeah, I'll give it a go as soon as I get some other stuff done. -Mike
Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27
On Mon, 2017-01-30 at 11:59 +, Matt Fleming wrote: > On Sat, 28 Jan, at 08:21:05AM, Mike Galbraith wrote: > > Running Steven's hotplug stress script in tip.today. Config is > > NOPREEMPT, tune for maximum build time (enterprise default-ish). > > > > [ 75.268049] x86: Booting SMP configuration: > > [ 75.268052] smpboot: Booting Node 0 Processor 1 APIC 0x2 > > [ 75.279994] smpboot: Booting Node 0 Processor 2 APIC 0x4 > > [ 75.294617] smpboot: Booting Node 0 Processor 4 APIC 0x1 > > [ 75.310698] smpboot: Booting Node 0 Processor 5 APIC 0x3 > > [ 75.359056] smpboot: CPU 3 is now offline > > [ 75.415505] smpboot: CPU 4 is now offline > > [ 75.479985] smpboot: CPU 5 is now offline > > [ 75.550674] [ cut here ] > > [ 75.550678] WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 > > assert_clock_updated.isra.62.part.63+0x25/0x27 > > [ 75.550679] rq->clock_update_flags < RQCF_ACT_SKIP > > The following patch queued in tip/sched/core should fix this issue: Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew an early boot brick problem. > >8 > > From 4d25b35ea3729affd37d69c78191ce6f92766e1a Mon Sep 17 00:00:00 > 2001 > From: Matt Fleming <m...@codeblueprint.co.uk> > Date: Wed, 26 Oct 2016 16:15:44 +0100 > Subject: [PATCH] sched/fair: Restore previous rq_flags when migrating > tasks in > hotplug > > __migrate_task() can return with a different runqueue locked than the > one we passed as an argument. So that we can repin the lock in > migrate_tasks() (and keep the update_rq_clock() bit) we need to > restore the old rq_flags before repinning. > > Note that it wouldn't be correct to change move_queued_task() to > repin > because of the change of runqueue and the fact that having an > up-to-date clock on the initial rq doesn't mean the new rq has one > too. > > Signed-off-by: Matt Fleming <m...@codeblueprint.co.uk> > Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> > Cc: Linus Torvalds <torva...@linux-foundation.org> > Cc: Mike Galbraith <efa...@gmx.de> > Cc: Peter Zijlstra <pet...@infradead.org> > Cc: Thomas Gleixner <t...@linutronix.de> > Signed-off-by: Ingo Molnar <mi...@kernel.org> > --- > kernel/sched/core.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 7f983e83a353..3b248b03ad8f 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5608,7 +5608,7 @@ static void migrate_tasks(struct rq *dead_rq) > { > struct rq *rq = dead_rq; > struct task_struct *next, *stop = rq->stop; > - struct rq_flags rf; > + struct rq_flags rf, old_rf; > int dest_cpu; > > /* > @@ -5669,6 +5669,13 @@ static void migrate_tasks(struct rq *dead_rq) > continue; > } > > + /* > + * __migrate_task() may return with a different > + * rq->lock held and a new cookie in 'rf', but we > need > + * to preserve rf::clock_update_flags for 'dead_rq'. > + */ > + old_rf = rf; > + > /* Find suitable destination for @next, with force > if needed. */ > dest_cpu = select_fallback_rq(dead_rq->cpu, next); > > @@ -5677,6 +5684,7 @@ static void migrate_tasks(struct rq *dead_rq) > raw_spin_unlock(>lock); > rq = dead_rq; > raw_spin_lock(>lock); > + rf = old_rf; > } > raw_spin_unlock(>pi_lock); > }
Re: WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27
On Mon, 2017-01-30 at 11:59 +, Matt Fleming wrote: > On Sat, 28 Jan, at 08:21:05AM, Mike Galbraith wrote: > > Running Steven's hotplug stress script in tip.today. Config is > > NOPREEMPT, tune for maximum build time (enterprise default-ish). > > > > [ 75.268049] x86: Booting SMP configuration: > > [ 75.268052] smpboot: Booting Node 0 Processor 1 APIC 0x2 > > [ 75.279994] smpboot: Booting Node 0 Processor 2 APIC 0x4 > > [ 75.294617] smpboot: Booting Node 0 Processor 4 APIC 0x1 > > [ 75.310698] smpboot: Booting Node 0 Processor 5 APIC 0x3 > > [ 75.359056] smpboot: CPU 3 is now offline > > [ 75.415505] smpboot: CPU 4 is now offline > > [ 75.479985] smpboot: CPU 5 is now offline > > [ 75.550674] [ cut here ] > > [ 75.550678] WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 > > assert_clock_updated.isra.62.part.63+0x25/0x27 > > [ 75.550679] rq->clock_update_flags < RQCF_ACT_SKIP > > The following patch queued in tip/sched/core should fix this issue: Weeell, I'll have to take your word for it, as tip g35669bb7fd46 grew an early boot brick problem. > >8 > > From 4d25b35ea3729affd37d69c78191ce6f92766e1a Mon Sep 17 00:00:00 > 2001 > From: Matt Fleming > Date: Wed, 26 Oct 2016 16:15:44 +0100 > Subject: [PATCH] sched/fair: Restore previous rq_flags when migrating > tasks in > hotplug > > __migrate_task() can return with a different runqueue locked than the > one we passed as an argument. So that we can repin the lock in > migrate_tasks() (and keep the update_rq_clock() bit) we need to > restore the old rq_flags before repinning. > > Note that it wouldn't be correct to change move_queued_task() to > repin > because of the change of runqueue and the fact that having an > up-to-date clock on the initial rq doesn't mean the new rq has one > too. > > Signed-off-by: Matt Fleming > Signed-off-by: Peter Zijlstra (Intel) > Cc: Linus Torvalds > Cc: Mike Galbraith > Cc: Peter Zijlstra > Cc: Thomas Gleixner > Signed-off-by: Ingo Molnar > --- > kernel/sched/core.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 7f983e83a353..3b248b03ad8f 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5608,7 +5608,7 @@ static void migrate_tasks(struct rq *dead_rq) > { > struct rq *rq = dead_rq; > struct task_struct *next, *stop = rq->stop; > - struct rq_flags rf; > + struct rq_flags rf, old_rf; > int dest_cpu; > > /* > @@ -5669,6 +5669,13 @@ static void migrate_tasks(struct rq *dead_rq) > continue; > } > > + /* > + * __migrate_task() may return with a different > + * rq->lock held and a new cookie in 'rf', but we > need > + * to preserve rf::clock_update_flags for 'dead_rq'. > + */ > + old_rf = rf; > + > /* Find suitable destination for @next, with force > if needed. */ > dest_cpu = select_fallback_rq(dead_rq->cpu, next); > > @@ -5677,6 +5684,7 @@ static void migrate_tasks(struct rq *dead_rq) > raw_spin_unlock(>lock); > rq = dead_rq; > raw_spin_lock(>lock); > + rf = old_rf; > } > raw_spin_unlock(>pi_lock); > }
WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27
Running Steven's hotplug stress script in tip.today. Config is NOPREEMPT, tune for maximum build time (enterprise default-ish). [ 75.268049] x86: Booting SMP configuration: [ 75.268052] smpboot: Booting Node 0 Processor 1 APIC 0x2 [ 75.279994] smpboot: Booting Node 0 Processor 2 APIC 0x4 [ 75.294617] smpboot: Booting Node 0 Processor 4 APIC 0x1 [ 75.310698] smpboot: Booting Node 0 Processor 5 APIC 0x3 [ 75.359056] smpboot: CPU 3 is now offline [ 75.415505] smpboot: CPU 4 is now offline [ 75.479985] smpboot: CPU 5 is now offline [ 75.550674] [ cut here ] [ 75.550678] WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27 [ 75.550679] rq->clock_update_flags < RQCF_ACT_SKIP [ 75.550679] Modules linked in: ebtable_filter(E) ebtables(E) fuse(E) nf_log_ipv6(E) xt_pkttype(E) xt_physdev(E) br_netfilter(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) xt_limit(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) iptable_raw(E) xt_CT(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) snd_hda_codec_hdmi(E) ip6_tables(E) x_tables(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) nls_iso8859_1(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) [ 75.550703] snd_pcm(E) nls_cp437(E) kvm_intel(E) snd_timer(E) kvm(E) irqbypass(E) nfsd(E) snd(E) crct10dif_pclmul(E) crc32_pclmul(E) auth_rpcgss(E) ghash_clmulni_intel(E) joydev(E) nfs_acl(E) lockd(E) soundcore(E) i2c_i801(E) shpchp(E) pcbc(E) aesni_intel(E) mei_me(E) aes_x86_64(E) crypto_simd(E) iTCO_wdt(E) iTCO_vendor_support(E) lpc_ich(E) mfd_core(E) glue_helper(E) pcspkr(E) mei(E) grace(E) cryptd(E) intel_smartconnect(E) battery(E) fan(E) thermal(E) tpm_infineon(E) sunrpc(E) efivarfs(E) sr_mod(E) cdrom(E) hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) nouveau(E) ahci(E) wmi(E) libahci(E) i2c_algo_bit(E) drm_kms_helper(E) xhci_pci(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ehci_pci(E) ehci_hcd(E) ttm(E) xhci_hcd(E) crc32c_intel(E) r8169(E) [ 75.550721] mii(E) libata(E) drm(E) usbcore(E) fjes(E) video(E) button(E) sd_mod(E) vfat(E) fat(E) ext4(E) crc16(E) jbd2(E) mbcache(E) dm_mod(E) loop(E) sg(E) scsi_mod(E) autofs4(E) [ 75.550728] CPU: 1 PID: 15 Comm: migration/1 Tainted: GE 4.10.0-tip-default #47 [ 75.550728] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 [ 75.550728] Call Trace: [ 75.550732] dump_stack+0x63/0x87 [ 75.550734] __warn+0xd1/0xf0 [ 75.550737] ? load_balance+0xa00/0xa00 [ 75.550738] warn_slowpath_fmt+0x4f/0x60 [ 75.550739] ? cpumask_next_and+0x35/0x50 [ 75.550740] assert_clock_updated.isra.62.part.63+0x25/0x27 [ 75.550741] update_load_avg+0x855/0x950 [ 75.550742] ? load_balance+0xa00/0xa00 [ 75.550743] set_next_entity+0x9e/0x1b0 [ 75.550744] pick_next_task_fair+0x78/0x540 [ 75.550746] ? sched_clock+0x9/0x10 [ 75.550747] ? sched_clock_cpu+0x11/0xb0 [ 75.550748] ? load_balance+0xa00/0xa00 [ 75.550749] sched_cpu_dying+0x23c/0x280 [ 75.550751] ? fini_debug_store_on_cpu+0x34/0x40 [ 75.550752] ? sched_cpu_starting+0x60/0x60 [ 75.550753] cpuhp_invoke_callback+0x90/0x400 [ 75.550754] take_cpu_down+0x5e/0xa0 [ 75.550757] multi_cpu_stop+0xc4/0xf0 [ 75.550757] ? cpu_stop_queue_work+0xb0/0xb0 [ 75.550758] cpu_stopper_thread+0x8c/0x120 [ 75.550760] smpboot_thread_fn+0x110/0x160 [ 75.550762] kthread+0x101/0x140 [ 75.550762] ? sort_range+0x30/0x30 [ 75.550763] ? kthread_park+0x90/0x90 [ 75.550766] ret_from_fork+0x2c/0x40 [ 75.550766] ---[ end trace 9dd372e3b19c77a0 ]---
WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27
Running Steven's hotplug stress script in tip.today. Config is NOPREEMPT, tune for maximum build time (enterprise default-ish). [ 75.268049] x86: Booting SMP configuration: [ 75.268052] smpboot: Booting Node 0 Processor 1 APIC 0x2 [ 75.279994] smpboot: Booting Node 0 Processor 2 APIC 0x4 [ 75.294617] smpboot: Booting Node 0 Processor 4 APIC 0x1 [ 75.310698] smpboot: Booting Node 0 Processor 5 APIC 0x3 [ 75.359056] smpboot: CPU 3 is now offline [ 75.415505] smpboot: CPU 4 is now offline [ 75.479985] smpboot: CPU 5 is now offline [ 75.550674] [ cut here ] [ 75.550678] WARNING: CPU: 1 PID: 15 at kernel/sched/sched.h:804 assert_clock_updated.isra.62.part.63+0x25/0x27 [ 75.550679] rq->clock_update_flags < RQCF_ACT_SKIP [ 75.550679] Modules linked in: ebtable_filter(E) ebtables(E) fuse(E) nf_log_ipv6(E) xt_pkttype(E) xt_physdev(E) br_netfilter(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) xt_limit(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) iptable_raw(E) xt_CT(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) snd_hda_codec_hdmi(E) ip6_tables(E) x_tables(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) nls_iso8859_1(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) [ 75.550703] snd_pcm(E) nls_cp437(E) kvm_intel(E) snd_timer(E) kvm(E) irqbypass(E) nfsd(E) snd(E) crct10dif_pclmul(E) crc32_pclmul(E) auth_rpcgss(E) ghash_clmulni_intel(E) joydev(E) nfs_acl(E) lockd(E) soundcore(E) i2c_i801(E) shpchp(E) pcbc(E) aesni_intel(E) mei_me(E) aes_x86_64(E) crypto_simd(E) iTCO_wdt(E) iTCO_vendor_support(E) lpc_ich(E) mfd_core(E) glue_helper(E) pcspkr(E) mei(E) grace(E) cryptd(E) intel_smartconnect(E) battery(E) fan(E) thermal(E) tpm_infineon(E) sunrpc(E) efivarfs(E) sr_mod(E) cdrom(E) hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) nouveau(E) ahci(E) wmi(E) libahci(E) i2c_algo_bit(E) drm_kms_helper(E) xhci_pci(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ehci_pci(E) ehci_hcd(E) ttm(E) xhci_hcd(E) crc32c_intel(E) r8169(E) [ 75.550721] mii(E) libata(E) drm(E) usbcore(E) fjes(E) video(E) button(E) sd_mod(E) vfat(E) fat(E) ext4(E) crc16(E) jbd2(E) mbcache(E) dm_mod(E) loop(E) sg(E) scsi_mod(E) autofs4(E) [ 75.550728] CPU: 1 PID: 15 Comm: migration/1 Tainted: GE 4.10.0-tip-default #47 [ 75.550728] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 [ 75.550728] Call Trace: [ 75.550732] dump_stack+0x63/0x87 [ 75.550734] __warn+0xd1/0xf0 [ 75.550737] ? load_balance+0xa00/0xa00 [ 75.550738] warn_slowpath_fmt+0x4f/0x60 [ 75.550739] ? cpumask_next_and+0x35/0x50 [ 75.550740] assert_clock_updated.isra.62.part.63+0x25/0x27 [ 75.550741] update_load_avg+0x855/0x950 [ 75.550742] ? load_balance+0xa00/0xa00 [ 75.550743] set_next_entity+0x9e/0x1b0 [ 75.550744] pick_next_task_fair+0x78/0x540 [ 75.550746] ? sched_clock+0x9/0x10 [ 75.550747] ? sched_clock_cpu+0x11/0xb0 [ 75.550748] ? load_balance+0xa00/0xa00 [ 75.550749] sched_cpu_dying+0x23c/0x280 [ 75.550751] ? fini_debug_store_on_cpu+0x34/0x40 [ 75.550752] ? sched_cpu_starting+0x60/0x60 [ 75.550753] cpuhp_invoke_callback+0x90/0x400 [ 75.550754] take_cpu_down+0x5e/0xa0 [ 75.550757] multi_cpu_stop+0xc4/0xf0 [ 75.550757] ? cpu_stop_queue_work+0xb0/0xb0 [ 75.550758] cpu_stopper_thread+0x8c/0x120 [ 75.550760] smpboot_thread_fn+0x110/0x160 [ 75.550762] kthread+0x101/0x140 [ 75.550762] ? sort_range+0x30/0x30 [ 75.550763] ? kthread_park+0x90/0x90 [ 75.550766] ret_from_fork+0x2c/0x40 [ 75.550766] ---[ end trace 9dd372e3b19c77a0 ]---
Re: [btrfs/rt] lockdep false positive
On Thu, 2017-01-26 at 18:09 +0100, Sebastian Andrzej Siewior wrote: > On 2017-01-25 19:29:49 [+0100], Mike Galbraith wrote: > > On Wed, 2017-01-25 at 18:02 +0100, Sebastian Andrzej Siewior wrote: > > > > > > [ 341.960794]CPU0 > > > > [ 341.960795] > > > > [ 341.960795] lock(btrfs-tree-00); > > > > [ 341.960795] lock(btrfs-tree-00); > > > > [ 341.960796] > > > > [ 341.960796] *** DEADLOCK *** > > > > [ 341.960796] > > > > [ 341.960796] May be due to missing lock nesting notation > > > > [ 341.960796] > > > > [ 341.960796] 6 locks held by kworker/u8:9/2039: > > > > [ 341.960797] #0: ("%s-%s""btrfs", name){.+.+..}, at: [] > > > > process_one_work+0x171/0x700 > > > > [ 341.960812] #1: ((>normal_work)){+.+...}, at: [] > > > > process_one_work+0x171/0x700 > > > > [ 341.960815] #2: (sb_internal){.+.+..}, at: [] > > > > start_transaction+0x2a7/0x5a0 [btrfs] > > > > [ 341.960825] #3: (btrfs-tree-02){+.+...}, at: [] > > > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] > > > > [ 341.960835] #4: (btrfs-tree-01){+.+...}, at: [] > > > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] > > > > [ 341.960854] #5: (btrfs-tree-00){+.+...}, at: [] > > > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] > > > > > > > > Attempting to describe RT rwlock semantics to lockdep prevents this. > > > > > > and this is what I don't get. I stumbled upon this myself [0] but didn't > > > fully understand the problem (assuming this is the same problem colored > > > differently). > > > > Yeah, [0] looks like it, though I haven't met an 'fs' variant, my > > encounters were always either 'tree' or 'csum' flavors. > > > > > With your explanation I am not sure if I get what is happening. If btrfs > > > is taking here read-locks on random locks then it may deadlock if > > > another btrfs-thread is doing the same and need each other's locks. > > > > I don't know if a real RT deadlock is possible. I haven't met one, > > only variants of this bogus recursion gripe. > > > > > If btrfs takes locks recursively which it already holds (in the same > > > context / process) then it shouldn't be visible here because lockdep > > > does not account this on -RT. > > > > If what lockdep gripes about were true, we would never see the splat, > > we'd zip straight through that (illusion) recursive read_lock() with > > lockdep being none the wiser. > > > > > If btrfs takes the locks in a special order for instance only ascending > > > according to inode's number then it shouldn't deadlock. > > > > No idea. Locking fancy enough to require dynamic key assignment to > > appease lockdep is too fancy for me. > > yup, for me, too. As long as nobody from the btrfs camp explains how > that locking workings and if it is safe I am not feeling comfortable to > shut up lockdep here. Works for me. What we're talking about is an obvious false positive in one and only one contrived situation. It's annoying/sub-optimal, but happily has no (known) impact other than testing, and that's trivial to remedy. -Mike
Re: [btrfs/rt] lockdep false positive
On Thu, 2017-01-26 at 18:09 +0100, Sebastian Andrzej Siewior wrote: > On 2017-01-25 19:29:49 [+0100], Mike Galbraith wrote: > > On Wed, 2017-01-25 at 18:02 +0100, Sebastian Andrzej Siewior wrote: > > > > > > [ 341.960794]CPU0 > > > > [ 341.960795] > > > > [ 341.960795] lock(btrfs-tree-00); > > > > [ 341.960795] lock(btrfs-tree-00); > > > > [ 341.960796] > > > > [ 341.960796] *** DEADLOCK *** > > > > [ 341.960796] > > > > [ 341.960796] May be due to missing lock nesting notation > > > > [ 341.960796] > > > > [ 341.960796] 6 locks held by kworker/u8:9/2039: > > > > [ 341.960797] #0: ("%s-%s""btrfs", name){.+.+..}, at: [] > > > > process_one_work+0x171/0x700 > > > > [ 341.960812] #1: ((>normal_work)){+.+...}, at: [] > > > > process_one_work+0x171/0x700 > > > > [ 341.960815] #2: (sb_internal){.+.+..}, at: [] > > > > start_transaction+0x2a7/0x5a0 [btrfs] > > > > [ 341.960825] #3: (btrfs-tree-02){+.+...}, at: [] > > > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] > > > > [ 341.960835] #4: (btrfs-tree-01){+.+...}, at: [] > > > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] > > > > [ 341.960854] #5: (btrfs-tree-00){+.+...}, at: [] > > > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] > > > > > > > > Attempting to describe RT rwlock semantics to lockdep prevents this. > > > > > > and this is what I don't get. I stumbled upon this myself [0] but didn't > > > fully understand the problem (assuming this is the same problem colored > > > differently). > > > > Yeah, [0] looks like it, though I haven't met an 'fs' variant, my > > encounters were always either 'tree' or 'csum' flavors. > > > > > With your explanation I am not sure if I get what is happening. If btrfs > > > is taking here read-locks on random locks then it may deadlock if > > > another btrfs-thread is doing the same and need each other's locks. > > > > I don't know if a real RT deadlock is possible. I haven't met one, > > only variants of this bogus recursion gripe. > > > > > If btrfs takes locks recursively which it already holds (in the same > > > context / process) then it shouldn't be visible here because lockdep > > > does not account this on -RT. > > > > If what lockdep gripes about were true, we would never see the splat, > > we'd zip straight through that (illusion) recursive read_lock() with > > lockdep being none the wiser. > > > > > If btrfs takes the locks in a special order for instance only ascending > > > according to inode's number then it shouldn't deadlock. > > > > No idea. Locking fancy enough to require dynamic key assignment to > > appease lockdep is too fancy for me. > > yup, for me, too. As long as nobody from the btrfs camp explains how > that locking workings and if it is safe I am not feeling comfortable to > shut up lockdep here. Works for me. What we're talking about is an obvious false positive in one and only one contrived situation. It's annoying/sub-optimal, but happily has no (known) impact other than testing, and that's trivial to remedy. -Mike
Re: [rfc patch-rt] radix-tree: Partially disable memcg accounting in radix_tree_node_alloc()
On Wed, 2017-01-25 at 16:06 +0100, Sebastian Andrzej Siewior wrote: > According to the to description of radix_tree_preload() the return code > of 0 means that the following addition of a single element does not > fail. But in RT's case this requirement is not fulfilled. There is more > than just one user of that function. So instead adding an exception here > and maybe later one someplace else, what about the following patch? > That testcase you mentioned passes now: Modulo missing EXPORT_SYMBOL(), yup, works fine. > > testcases/kernel/syscalls/madvise/madvise06 > > tst_test.c:760: INFO: Timeout per run is 0h 05m 00s > > madvise06.c:65: INFO: dropping caches > > madvise06.c:139: INFO: SwapCached (before madvise): 304 > > madvise06.c:153: INFO: SwapCached (after madvise): 309988 > > madvise06.c:155: PASS: Regression test pass > > > > Summary: > > passed 1 > > failed 0 > > skipped 0 > > warnings 0 > > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h > index f87f87dec84c..277295039c8f 100644 > --- a/include/linux/radix-tree.h > +++ b/include/linux/radix-tree.h > @@ -289,19 +289,11 @@ unsigned int radix_tree_gang_lookup(struct > radix_tree_root *root, > unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root, > >> > > void ***results, unsigned long *indices, > >> > > unsigned long first_index, unsigned int max_items); > -#ifdef CONFIG_PREEMPT_RT_FULL > -static inline int radix_tree_preload(gfp_t gm) { return 0; } > -static inline int radix_tree_maybe_preload(gfp_t gfp_mask) { return 0; } > -static inline int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order) > -{ > ->> return 0; > -}; > - > -#else > int radix_tree_preload(gfp_t gfp_mask); > int radix_tree_maybe_preload(gfp_t gfp_mask); > int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order); > -#endif > +void radix_tree_preload_end(void); > + > void radix_tree_init(void); > void *radix_tree_tag_set(struct radix_tree_root *root, > >> > > unsigned long index, unsigned int tag); > @@ -324,11 +316,6 @@ unsigned long radix_tree_range_tag_if_tagged(struct > radix_tree_root *root, > int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag); > unsigned long radix_tree_locate_item(struct radix_tree_root *root, void > *item); > > -static inline void radix_tree_preload_end(void) > -{ > ->> preempt_enable_nort(); > -} > - > /** > * struct radix_tree_iter - radix tree iterator state > * > diff --git a/lib/radix-tree.c b/lib/radix-tree.c > index 881cc195d85f..e96c6a99f25c 100644 > --- a/lib/radix-tree.c > +++ b/lib/radix-tree.c > @@ -36,7 +36,7 @@ > #include > #include > #include > > > /* in_interrupt() */ > - > +#include > > /* Number of nodes in fully populated tree of given height */ > static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] > __read_mostly; > @@ -68,6 +68,7 @@ struct radix_tree_preload { > >> struct radix_tree_node *nodes; > }; > static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, > }; > +static DEFINE_LOCAL_IRQ_LOCK(radix_tree_preloads_lock); > > static inline void *node_to_entry(void *ptr) > { > @@ -290,14 +291,14 @@ radix_tree_node_alloc(struct radix_tree_root *root) > >> > * succeed in getting a node here (and never reach > >> > * kmem_cache_alloc) > >> > */ > ->> > rtp = _cpu_var(radix_tree_preloads); > +>> > rtp = _locked_var(radix_tree_preloads_lock, > radix_tree_preloads); > >> > if (rtp->nr) { > >> > > ret = rtp->nodes; > >> > > rtp->nodes = ret->private_data; > >> > > ret->private_data = NULL; > >> > > rtp->nr--; > >> > } > ->> > put_cpu_var(radix_tree_preloads); > +>> > put_locked_var(radix_tree_preloads_lock, radix_tree_preloads); > >> > /* > >> > * Update the allocation stack trace as this is more useful > >> > * for debugging. > @@ -337,7 +338,6 @@ radix_tree_node_free(struct radix_tree_node *node) > >> call_rcu(>rcu_head, radix_tree_node_rcu_free); > } > > -#ifndef CONFIG_PREEMPT_RT_FULL > /* > * Load up this CPU's radix_tree_node buffer with sufficient objects to > * ensure that the addition of a single element in the tree cannot fail. On > @@ -359,14 +359,14 @@ static int __radix_tree_preload(gfp_t gfp_mask, int nr) > >> */ > >> gfp_mask &= ~__GFP_ACCOUNT; > > ->> preempt_disable(); > +>> local_lock(radix_tree_preloads_lock); > >> rtp = this_cpu_ptr(_tree_preloads); > >> while (rtp->nr < nr) { > ->> > preempt_enable(); > +>> > local_unlock(radix_tree_preloads_lock); > >> > node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); > >> > if (node == NULL) > >> > > goto out; > ->> > preempt_disable(); > +>>
Re: [rfc patch-rt] radix-tree: Partially disable memcg accounting in radix_tree_node_alloc()
On Wed, 2017-01-25 at 16:06 +0100, Sebastian Andrzej Siewior wrote: > According to the to description of radix_tree_preload() the return code > of 0 means that the following addition of a single element does not > fail. But in RT's case this requirement is not fulfilled. There is more > than just one user of that function. So instead adding an exception here > and maybe later one someplace else, what about the following patch? > That testcase you mentioned passes now: Modulo missing EXPORT_SYMBOL(), yup, works fine. > > testcases/kernel/syscalls/madvise/madvise06 > > tst_test.c:760: INFO: Timeout per run is 0h 05m 00s > > madvise06.c:65: INFO: dropping caches > > madvise06.c:139: INFO: SwapCached (before madvise): 304 > > madvise06.c:153: INFO: SwapCached (after madvise): 309988 > > madvise06.c:155: PASS: Regression test pass > > > > Summary: > > passed 1 > > failed 0 > > skipped 0 > > warnings 0 > > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h > index f87f87dec84c..277295039c8f 100644 > --- a/include/linux/radix-tree.h > +++ b/include/linux/radix-tree.h > @@ -289,19 +289,11 @@ unsigned int radix_tree_gang_lookup(struct > radix_tree_root *root, > unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root, > >> > > void ***results, unsigned long *indices, > >> > > unsigned long first_index, unsigned int max_items); > -#ifdef CONFIG_PREEMPT_RT_FULL > -static inline int radix_tree_preload(gfp_t gm) { return 0; } > -static inline int radix_tree_maybe_preload(gfp_t gfp_mask) { return 0; } > -static inline int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order) > -{ > ->> return 0; > -}; > - > -#else > int radix_tree_preload(gfp_t gfp_mask); > int radix_tree_maybe_preload(gfp_t gfp_mask); > int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order); > -#endif > +void radix_tree_preload_end(void); > + > void radix_tree_init(void); > void *radix_tree_tag_set(struct radix_tree_root *root, > >> > > unsigned long index, unsigned int tag); > @@ -324,11 +316,6 @@ unsigned long radix_tree_range_tag_if_tagged(struct > radix_tree_root *root, > int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag); > unsigned long radix_tree_locate_item(struct radix_tree_root *root, void > *item); > > -static inline void radix_tree_preload_end(void) > -{ > ->> preempt_enable_nort(); > -} > - > /** > * struct radix_tree_iter - radix tree iterator state > * > diff --git a/lib/radix-tree.c b/lib/radix-tree.c > index 881cc195d85f..e96c6a99f25c 100644 > --- a/lib/radix-tree.c > +++ b/lib/radix-tree.c > @@ -36,7 +36,7 @@ > #include > #include > #include > > > /* in_interrupt() */ > - > +#include > > /* Number of nodes in fully populated tree of given height */ > static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] > __read_mostly; > @@ -68,6 +68,7 @@ struct radix_tree_preload { > >> struct radix_tree_node *nodes; > }; > static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, > }; > +static DEFINE_LOCAL_IRQ_LOCK(radix_tree_preloads_lock); > > static inline void *node_to_entry(void *ptr) > { > @@ -290,14 +291,14 @@ radix_tree_node_alloc(struct radix_tree_root *root) > >> > * succeed in getting a node here (and never reach > >> > * kmem_cache_alloc) > >> > */ > ->> > rtp = _cpu_var(radix_tree_preloads); > +>> > rtp = _locked_var(radix_tree_preloads_lock, > radix_tree_preloads); > >> > if (rtp->nr) { > >> > > ret = rtp->nodes; > >> > > rtp->nodes = ret->private_data; > >> > > ret->private_data = NULL; > >> > > rtp->nr--; > >> > } > ->> > put_cpu_var(radix_tree_preloads); > +>> > put_locked_var(radix_tree_preloads_lock, radix_tree_preloads); > >> > /* > >> > * Update the allocation stack trace as this is more useful > >> > * for debugging. > @@ -337,7 +338,6 @@ radix_tree_node_free(struct radix_tree_node *node) > >> call_rcu(>rcu_head, radix_tree_node_rcu_free); > } > > -#ifndef CONFIG_PREEMPT_RT_FULL > /* > * Load up this CPU's radix_tree_node buffer with sufficient objects to > * ensure that the addition of a single element in the tree cannot fail. On > @@ -359,14 +359,14 @@ static int __radix_tree_preload(gfp_t gfp_mask, int nr) > >> */ > >> gfp_mask &= ~__GFP_ACCOUNT; > > ->> preempt_disable(); > +>> local_lock(radix_tree_preloads_lock); > >> rtp = this_cpu_ptr(_tree_preloads); > >> while (rtp->nr < nr) { > ->> > preempt_enable(); > +>> > local_unlock(radix_tree_preloads_lock); > >> > node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); > >> > if (node == NULL) > >> > > goto out; > ->> > preempt_disable(); > +>>
Re: [btrfs/rt] lockdep false positive
On Wed, 2017-01-25 at 18:02 +0100, Sebastian Andrzej Siewior wrote: > > [ 341.960794]CPU0 > > [ 341.960795] > > [ 341.960795] lock(btrfs-tree-00); > > [ 341.960795] lock(btrfs-tree-00); > > [ 341.960796] > > [ 341.960796] *** DEADLOCK *** > > [ 341.960796] > > [ 341.960796] May be due to missing lock nesting notation > > [ 341.960796] > > [ 341.960796] 6 locks held by kworker/u8:9/2039: > > [ 341.960797] #0: ("%s-%s""btrfs", name){.+.+..}, at: [] > > process_one_work+0x171/0x700 > > [ 341.960812] #1: ((>normal_work)){+.+...}, at: [] > > process_one_work+0x171/0x700 > > [ 341.960815] #2: (sb_internal){.+.+..}, at: [] > > start_transaction+0x2a7/0x5a0 [btrfs] > > [ 341.960825] #3: (btrfs-tree-02){+.+...}, at: [] > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] > > [ 341.960835] #4: (btrfs-tree-01){+.+...}, at: [] > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] > > [ 341.960854] #5: (btrfs-tree-00){+.+...}, at: [] > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] > > > > Attempting to describe RT rwlock semantics to lockdep prevents this. > > and this is what I don't get. I stumbled upon this myself [0] but didn't > fully understand the problem (assuming this is the same problem colored > differently). Yeah, [0] looks like it, though I haven't met an 'fs' variant, my encounters were always either 'tree' or 'csum' flavors. > With your explanation I am not sure if I get what is happening. If btrfs > is taking here read-locks on random locks then it may deadlock if > another btrfs-thread is doing the same and need each other's locks. I don't know if a real RT deadlock is possible. I haven't met one, only variants of this bogus recursion gripe. > If btrfs takes locks recursively which it already holds (in the same > context / process) then it shouldn't be visible here because lockdep > does not account this on -RT. If what lockdep gripes about were true, we would never see the splat, we'd zip straight through that (illusion) recursive read_lock() with lockdep being none the wiser. > If btrfs takes the locks in a special order for instance only ascending > according to inode's number then it shouldn't deadlock. No idea. Locking fancy enough to require dynamic key assignment to appease lockdep is too fancy for me. -Mike
Re: [btrfs/rt] lockdep false positive
On Wed, 2017-01-25 at 18:02 +0100, Sebastian Andrzej Siewior wrote: > > [ 341.960794]CPU0 > > [ 341.960795] > > [ 341.960795] lock(btrfs-tree-00); > > [ 341.960795] lock(btrfs-tree-00); > > [ 341.960796] > > [ 341.960796] *** DEADLOCK *** > > [ 341.960796] > > [ 341.960796] May be due to missing lock nesting notation > > [ 341.960796] > > [ 341.960796] 6 locks held by kworker/u8:9/2039: > > [ 341.960797] #0: ("%s-%s""btrfs", name){.+.+..}, at: [] > > process_one_work+0x171/0x700 > > [ 341.960812] #1: ((>normal_work)){+.+...}, at: [] > > process_one_work+0x171/0x700 > > [ 341.960815] #2: (sb_internal){.+.+..}, at: [] > > start_transaction+0x2a7/0x5a0 [btrfs] > > [ 341.960825] #3: (btrfs-tree-02){+.+...}, at: [] > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] > > [ 341.960835] #4: (btrfs-tree-01){+.+...}, at: [] > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] > > [ 341.960854] #5: (btrfs-tree-00){+.+...}, at: [] > > btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] > > > > Attempting to describe RT rwlock semantics to lockdep prevents this. > > and this is what I don't get. I stumbled upon this myself [0] but didn't > fully understand the problem (assuming this is the same problem colored > differently). Yeah, [0] looks like it, though I haven't met an 'fs' variant, my encounters were always either 'tree' or 'csum' flavors. > With your explanation I am not sure if I get what is happening. If btrfs > is taking here read-locks on random locks then it may deadlock if > another btrfs-thread is doing the same and need each other's locks. I don't know if a real RT deadlock is possible. I haven't met one, only variants of this bogus recursion gripe. > If btrfs takes locks recursively which it already holds (in the same > context / process) then it shouldn't be visible here because lockdep > does not account this on -RT. If what lockdep gripes about were true, we would never see the splat, we'd zip straight through that (illusion) recursive read_lock() with lockdep being none the wiser. > If btrfs takes the locks in a special order for instance only ascending > according to inode's number then it shouldn't deadlock. No idea. Locking fancy enough to require dynamic key assignment to appease lockdep is too fancy for me. -Mike
Re: [btrfs/rt] lockdep false positive
On Sun, 2017-01-22 at 18:45 +0100, Mike Galbraith wrote: > On Sun, 2017-01-22 at 09:46 +0100, Mike Galbraith wrote: > > Greetings btrfs/lockdep wizards, > > > > RT trees have trouble with the BTRFS lockdep positive avoidance lock > > class dance (see disk-io.c). Seems the trouble is due to RT not having > > a means of telling lockdep that its rwlocks are recursive for read by > > the lock owner only, combined with the BTRFS lock class dance assuming > > that read_lock() is annotated rwlock_acquire_read(), which RT cannot > > do, as that would be a big fat lie. > > > > Creating a rt_read_lock_shared() for btrfs_clear_lock_blocking_rw() did > > indeed make lockdep happy as a clam for test purposes. (hm, submitting > > that would be excellent way to replenish frozen shark supply:) > > > > Ideas? > > Hrm. The below seems to work fine, but /me strongly suspects that if > it were this damn trivial, the issue would be long dead. (iow, did I merely spell '2' as '3' vs creating the annotation I want)
Re: [btrfs/rt] lockdep false positive
On Sun, 2017-01-22 at 18:45 +0100, Mike Galbraith wrote: > On Sun, 2017-01-22 at 09:46 +0100, Mike Galbraith wrote: > > Greetings btrfs/lockdep wizards, > > > > RT trees have trouble with the BTRFS lockdep positive avoidance lock > > class dance (see disk-io.c). Seems the trouble is due to RT not having > > a means of telling lockdep that its rwlocks are recursive for read by > > the lock owner only, combined with the BTRFS lock class dance assuming > > that read_lock() is annotated rwlock_acquire_read(), which RT cannot > > do, as that would be a big fat lie. > > > > Creating a rt_read_lock_shared() for btrfs_clear_lock_blocking_rw() did > > indeed make lockdep happy as a clam for test purposes. (hm, submitting > > that would be excellent way to replenish frozen shark supply:) > > > > Ideas? > > Hrm. The below seems to work fine, but /me strongly suspects that if > it were this damn trivial, the issue would be long dead. (iow, did I merely spell '2' as '3' vs creating the annotation I want)
Re: [btrfs/rt] lockdep false positive
> +>> > /* > +>> > * Allow read-after-read or read-after-write recursion of the > +>> > * same lock class for RT rwlocks. > +>> > */ > +>> > if (read == 3 && (prev->read == 3 || prev->read == 0)) Pff, shoulda left it reader vs reader.. but it's gotta be wrong anyway.
Re: [btrfs/rt] lockdep false positive
> +>> > /* > +>> > * Allow read-after-read or read-after-write recursion of the > +>> > * same lock class for RT rwlocks. > +>> > */ > +>> > if (read == 3 && (prev->read == 3 || prev->read == 0)) Pff, shoulda left it reader vs reader.. but it's gotta be wrong anyway.
Re: [btrfs/rt] lockdep false positive
On Sun, 2017-01-22 at 09:46 +0100, Mike Galbraith wrote: > Greetings btrfs/lockdep wizards, > > RT trees have trouble with the BTRFS lockdep positive avoidance lock > class dance (see disk-io.c). Seems the trouble is due to RT not having > a means of telling lockdep that its rwlocks are recursive for read by > the lock owner only, combined with the BTRFS lock class dance assuming > that read_lock() is annotated rwlock_acquire_read(), which RT cannot > do, as that would be a big fat lie. > > Creating a rt_read_lock_shared() for btrfs_clear_lock_blocking_rw() did > indeed make lockdep happy as a clam for test purposes. (hm, submitting > that would be excellent way to replenish frozen shark supply:) > > Ideas? Hrm. The below seems to work fine, but /me strongly suspects that if it were this damn trivial, the issue would be long dead. RT does not have a way to describe its rwlock semantics to lockdep, leading to the btrfs false positive below. Btrfs maintains an array of keys which it assigns on the fly in order to avoid false positives in stock code, however, that scheme depends upon lockdep knowing that read_lock()+read_lock() is allowed within a class, as multiple locks are assigned to the same class, and end up acquired by the same task. [ 341.960754] = [ 341.960754] [ INFO: possible recursive locking detected ] [ 341.960756] 4.10.0-rt1-rt #124 Tainted: GE [ 341.960756] - [ 341.960757] kworker/u8:9/2039 is trying to acquire lock: [ 341.960757] (btrfs-tree-00){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] This kworker assigned this lock to class 'tree' level 0 shortly before acquisition, however.. [ 341.960783] [ 341.960783] but task is already holding lock: [ 341.960783] (btrfs-tree-00){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] ..another kworker previously assigned another lock we now hold to the 'tree' level 0 key as well. Since RT tells lockdep that read_lock() is an exclusive acquisition, in class read_lock()+read_lock() is forbidden. [ 341.960794]CPU0 [ 341.960795] [ 341.960795] lock(btrfs-tree-00); [ 341.960795] lock(btrfs-tree-00); [ 341.960796] [ 341.960796] *** DEADLOCK *** [ 341.960796] [ 341.960796] May be due to missing lock nesting notation [ 341.960796] [ 341.960796] 6 locks held by kworker/u8:9/2039: [ 341.960797] #0: ("%s-%s""btrfs", name){.+.+..}, at: [] process_one_work+0x171/0x700 [ 341.960812] #1: ((>normal_work)){+.+...}, at: [] process_one_work+0x171/0x700 [ 341.960815] #2: (sb_internal){.+.+..}, at: [] start_transaction+0x2a7/0x5a0 [btrfs] [ 341.960825] #3: (btrfs-tree-02){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] [ 341.960835] #4: (btrfs-tree-01){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] [ 341.960854] #5: (btrfs-tree-00){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] Attempting to describe RT rwlock semantics to lockdep prevents this. Not-signed-off-by: /me --- include/linux/lockdep.h |5 + kernel/locking/lockdep.c |8 kernel/locking/rt.c |7 ++- 3 files changed, 15 insertions(+), 5 deletions(-) --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -543,13 +543,18 @@ static inline void print_irqtrace_events #define lock_acquire_exclusive(l, s, t, n, i) lock_acquire(l, s, t, 0, 1, n, i) #define lock_acquire_shared(l, s, t, n, i) lock_acquire(l, s, t, 1, 1, n, i) #define lock_acquire_shared_recursive(l, s, t, n, i) lock_acquire(l, s, t, 2, 1, n, i) +#define lock_acquire_reader_recursive(l, s, t, n, i) lock_acquire(l, s, t, 3, 1, n, i) #define spin_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i) #define spin_acquire_nest(l, s, t, n, i) lock_acquire_exclusive(l, s, t, n, i) #define spin_release(l, n, i) lock_release(l, n, i) #define rwlock_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i) +#ifndef CONFIG_PREEMPT_RT_FULL #define rwlock_acquire_read(l, s, t, i) lock_acquire_shared_recursive(l, s, t, NULL, i) +#else +#define rwlock_acquire_read(l, s, t, i) lock_acquire_reader_recursive(l, s, t, NULL, i) +#endif #define rwlock_release(l, n, i)lock_release(l, n, i) #define seqcount_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i) --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -1761,6 +1761,14 @@ check_deadlock(struct task_struct *curr, if ((read == 2) && prev->read) return 2; +#ifdef CONFIG_PREEMPT_RT_FULL + /* +* Allow read-after-read or read-after-write recursion of the +* same
Re: [btrfs/rt] lockdep false positive
On Sun, 2017-01-22 at 09:46 +0100, Mike Galbraith wrote: > Greetings btrfs/lockdep wizards, > > RT trees have trouble with the BTRFS lockdep positive avoidance lock > class dance (see disk-io.c). Seems the trouble is due to RT not having > a means of telling lockdep that its rwlocks are recursive for read by > the lock owner only, combined with the BTRFS lock class dance assuming > that read_lock() is annotated rwlock_acquire_read(), which RT cannot > do, as that would be a big fat lie. > > Creating a rt_read_lock_shared() for btrfs_clear_lock_blocking_rw() did > indeed make lockdep happy as a clam for test purposes. (hm, submitting > that would be excellent way to replenish frozen shark supply:) > > Ideas? Hrm. The below seems to work fine, but /me strongly suspects that if it were this damn trivial, the issue would be long dead. RT does not have a way to describe its rwlock semantics to lockdep, leading to the btrfs false positive below. Btrfs maintains an array of keys which it assigns on the fly in order to avoid false positives in stock code, however, that scheme depends upon lockdep knowing that read_lock()+read_lock() is allowed within a class, as multiple locks are assigned to the same class, and end up acquired by the same task. [ 341.960754] = [ 341.960754] [ INFO: possible recursive locking detected ] [ 341.960756] 4.10.0-rt1-rt #124 Tainted: GE [ 341.960756] - [ 341.960757] kworker/u8:9/2039 is trying to acquire lock: [ 341.960757] (btrfs-tree-00){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] This kworker assigned this lock to class 'tree' level 0 shortly before acquisition, however.. [ 341.960783] [ 341.960783] but task is already holding lock: [ 341.960783] (btrfs-tree-00){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] ..another kworker previously assigned another lock we now hold to the 'tree' level 0 key as well. Since RT tells lockdep that read_lock() is an exclusive acquisition, in class read_lock()+read_lock() is forbidden. [ 341.960794]CPU0 [ 341.960795] [ 341.960795] lock(btrfs-tree-00); [ 341.960795] lock(btrfs-tree-00); [ 341.960796] [ 341.960796] *** DEADLOCK *** [ 341.960796] [ 341.960796] May be due to missing lock nesting notation [ 341.960796] [ 341.960796] 6 locks held by kworker/u8:9/2039: [ 341.960797] #0: ("%s-%s""btrfs", name){.+.+..}, at: [] process_one_work+0x171/0x700 [ 341.960812] #1: ((>normal_work)){+.+...}, at: [] process_one_work+0x171/0x700 [ 341.960815] #2: (sb_internal){.+.+..}, at: [] start_transaction+0x2a7/0x5a0 [btrfs] [ 341.960825] #3: (btrfs-tree-02){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] [ 341.960835] #4: (btrfs-tree-01){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] [ 341.960854] #5: (btrfs-tree-00){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x55/0x100 [btrfs] Attempting to describe RT rwlock semantics to lockdep prevents this. Not-signed-off-by: /me --- include/linux/lockdep.h |5 + kernel/locking/lockdep.c |8 kernel/locking/rt.c |7 ++- 3 files changed, 15 insertions(+), 5 deletions(-) --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -543,13 +543,18 @@ static inline void print_irqtrace_events #define lock_acquire_exclusive(l, s, t, n, i) lock_acquire(l, s, t, 0, 1, n, i) #define lock_acquire_shared(l, s, t, n, i) lock_acquire(l, s, t, 1, 1, n, i) #define lock_acquire_shared_recursive(l, s, t, n, i) lock_acquire(l, s, t, 2, 1, n, i) +#define lock_acquire_reader_recursive(l, s, t, n, i) lock_acquire(l, s, t, 3, 1, n, i) #define spin_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i) #define spin_acquire_nest(l, s, t, n, i) lock_acquire_exclusive(l, s, t, n, i) #define spin_release(l, n, i) lock_release(l, n, i) #define rwlock_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i) +#ifndef CONFIG_PREEMPT_RT_FULL #define rwlock_acquire_read(l, s, t, i) lock_acquire_shared_recursive(l, s, t, NULL, i) +#else +#define rwlock_acquire_read(l, s, t, i) lock_acquire_reader_recursive(l, s, t, NULL, i) +#endif #define rwlock_release(l, n, i)lock_release(l, n, i) #define seqcount_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i) --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -1761,6 +1761,14 @@ check_deadlock(struct task_struct *curr, if ((read == 2) && prev->read) return 2; +#ifdef CONFIG_PREEMPT_RT_FULL + /* +* Allow read-after-read or read-after-write recursion of the +* same
[btrfs/rt] lockdep false positive
Greetings btrfs/lockdep wizards, RT trees have trouble with the BTRFS lockdep positive avoidance lock class dance (see disk-io.c). Seems the trouble is due to RT not having a means of telling lockdep that its rwlocks are recursive for read by the lock owner only, combined with the BTRFS lock class dance assuming that read_lock() is annotated rwlock_acquire_read(), which RT cannot do, as that would be a big fat lie. Creating a rt_read_lock_shared() for btrfs_clear_lock_blocking_rw() did indeed make lockdep happy as a clam for test purposes. (hm, submitting that would be excellent way to replenish frozen shark supply:) Ideas? The below is tip-rt, but that's irrelevant. Any RT tree will do, you just might hit the recently fixed log_mutex gripe instead of the btrfs -tree-00/btrfs-csum-00 variants you'll eventually hit with log_mutex splat fixed. [ 433.956516] = [ 433.956516] [ INFO: possible recursive locking detected ] [ 433.956518] 4.10.0-rt1-tip-rt #36 Tainted: GE [ 433.956518] - [ 433.956519] kworker/u8:2/555 is trying to acquire lock: [ 433.956519] (btrfs-csum-00){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956540] but task is already holding lock: [ 433.956540] (btrfs-csum-00){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956551] other info that might help us debug this: [ 433.956551] Possible unsafe locking scenario: [ 433.956552]CPU0 [ 433.956552] [ 433.956552] lock(btrfs-csum-00); [ 433.956552] lock(btrfs-csum-00); [ 433.956553] *** DEADLOCK *** [ 433.956553] May be due to missing lock nesting notation [ 433.956554] 6 locks held by kworker/u8:2/555: [ 433.956554] #0: ("%s-%s""btrfs", name){.+.+..}, at: [] process_one_work+0x171/0x700 [ 433.956565] #1: ((>normal_work)){+.+...}, at: [] process_one_work+0x171/0x700 [ 433.956567] #2: (sb_internal){.+.+..}, at: [] start_transaction+0x2a7/0x5a0 [btrfs] [ 433.956576] #3: (btrfs-csum-02){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956585] #4: (btrfs-csum-01){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956593] #5: (btrfs-csum-00){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956601] Lock class assignment leadin btrfs-transacti-623 [002] ... 406.637399: btrfs_set_buffer_lockdep_class: set >lock: 88014a087ce0 level: 0 to btrfs-extent-00 kworker/u8:5-558 [000] ... 429.673871: btrfs_set_buffer_lockdep_class: set >lock: 880007073ce0 level: 2 to btrfs-csum-02 kworker/u8:5-558 [000] ... 429.673904: btrfs_set_buffer_lockdep_class: set >lock: 88014a087ce0 level: 1 to btrfs-csum-01 kworker/u8:0-5 [002] ... 433.022595: btrfs_set_buffer_lockdep_class: set >lock: 88009bd98fe0 level: 0 to btrfs-csum-00 * kworker/u8:2-555 [001] ... 433.838082: btrfs_set_buffer_lockdep_class: set >lock: 880096e924e0 level: 0 to btrfs-csum-00 Our hero about to go splat kworker/u8:2-555 [000] ... 434.043172: btrfs_clear_lock_blocking_rw: read_lock(>lock: 880007073ce0) == btrfs-csum-02 kworker/u8:2-555 [000] .11 434.043172: btrfs_clear_lock_blocking_rw: read_lock(>lock: 88014a087ce0) == btrfs-csum-01 kworker/u8:2-555 [000] .12 434.043173: btrfs_clear_lock_blocking_rw: read_lock(>lock: 88009bd98fe0) == btrfs-csum-00 set by kworker/u8:0-5 kworker/u8:2-555 [000] .13 434.043173: btrfs_clear_lock_blocking_rw: read_lock(>lock: 880096e924e0) == btrfs-csum-00 set by hero - two locks, one key - splat stack backtrace: [ 433.956602] CPU: 0 PID: 555 Comm: kworker/u8:2 Tainted: GE 4.10.0-rt1-tip-rt #36 [ 433.956603] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20161202_174313-build11a 04/01/2014 [ 433.956611] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [ 433.956612] Call Trace: [ 433.956618] dump_stack+0x85/0xc8 [ 433.956622] __lock_acquire+0x9f9/0x1550 [ 433.956627] ? ring_buffer_lock_reserve+0x115/0x3b0 [ 433.956629] ? ring_buffer_unlock_commit+0x27/0xe0 [ 433.956630] lock_acquire+0xbd/0x250 [ 433.956637] ? btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956641] rt_read_lock+0x47/0x60 [ 433.956648] ? btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956654] btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956660] btrfs_clear_path_blocking+0x99/0xc0 [btrfs] [ 433.956667] btrfs_next_old_leaf+0x407/0x440 [btrfs] [ 433.956674] btrfs_next_leaf+0x10/0x20 [btrfs] [ 433.956681] btrfs_csum_file_blocks+0x31a/0x5f0 [btrfs] [ 433.956682] ? migrate_enable+0x87/0x160 [ 433.956690] add_pending_csums.isra.46+0x4d/0x70 [btrfs] [ 433.956698]
[btrfs/rt] lockdep false positive
Greetings btrfs/lockdep wizards, RT trees have trouble with the BTRFS lockdep positive avoidance lock class dance (see disk-io.c). Seems the trouble is due to RT not having a means of telling lockdep that its rwlocks are recursive for read by the lock owner only, combined with the BTRFS lock class dance assuming that read_lock() is annotated rwlock_acquire_read(), which RT cannot do, as that would be a big fat lie. Creating a rt_read_lock_shared() for btrfs_clear_lock_blocking_rw() did indeed make lockdep happy as a clam for test purposes. (hm, submitting that would be excellent way to replenish frozen shark supply:) Ideas? The below is tip-rt, but that's irrelevant. Any RT tree will do, you just might hit the recently fixed log_mutex gripe instead of the btrfs -tree-00/btrfs-csum-00 variants you'll eventually hit with log_mutex splat fixed. [ 433.956516] = [ 433.956516] [ INFO: possible recursive locking detected ] [ 433.956518] 4.10.0-rt1-tip-rt #36 Tainted: GE [ 433.956518] - [ 433.956519] kworker/u8:2/555 is trying to acquire lock: [ 433.956519] (btrfs-csum-00){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956540] but task is already holding lock: [ 433.956540] (btrfs-csum-00){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956551] other info that might help us debug this: [ 433.956551] Possible unsafe locking scenario: [ 433.956552]CPU0 [ 433.956552] [ 433.956552] lock(btrfs-csum-00); [ 433.956552] lock(btrfs-csum-00); [ 433.956553] *** DEADLOCK *** [ 433.956553] May be due to missing lock nesting notation [ 433.956554] 6 locks held by kworker/u8:2/555: [ 433.956554] #0: ("%s-%s""btrfs", name){.+.+..}, at: [] process_one_work+0x171/0x700 [ 433.956565] #1: ((>normal_work)){+.+...}, at: [] process_one_work+0x171/0x700 [ 433.956567] #2: (sb_internal){.+.+..}, at: [] start_transaction+0x2a7/0x5a0 [btrfs] [ 433.956576] #3: (btrfs-csum-02){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956585] #4: (btrfs-csum-01){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956593] #5: (btrfs-csum-00){+.+...}, at: [] btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956601] Lock class assignment leadin btrfs-transacti-623 [002] ... 406.637399: btrfs_set_buffer_lockdep_class: set >lock: 88014a087ce0 level: 0 to btrfs-extent-00 kworker/u8:5-558 [000] ... 429.673871: btrfs_set_buffer_lockdep_class: set >lock: 880007073ce0 level: 2 to btrfs-csum-02 kworker/u8:5-558 [000] ... 429.673904: btrfs_set_buffer_lockdep_class: set >lock: 88014a087ce0 level: 1 to btrfs-csum-01 kworker/u8:0-5 [002] ... 433.022595: btrfs_set_buffer_lockdep_class: set >lock: 88009bd98fe0 level: 0 to btrfs-csum-00 * kworker/u8:2-555 [001] ... 433.838082: btrfs_set_buffer_lockdep_class: set >lock: 880096e924e0 level: 0 to btrfs-csum-00 Our hero about to go splat kworker/u8:2-555 [000] ... 434.043172: btrfs_clear_lock_blocking_rw: read_lock(>lock: 880007073ce0) == btrfs-csum-02 kworker/u8:2-555 [000] .11 434.043172: btrfs_clear_lock_blocking_rw: read_lock(>lock: 88014a087ce0) == btrfs-csum-01 kworker/u8:2-555 [000] .12 434.043173: btrfs_clear_lock_blocking_rw: read_lock(>lock: 88009bd98fe0) == btrfs-csum-00 set by kworker/u8:0-5 kworker/u8:2-555 [000] .13 434.043173: btrfs_clear_lock_blocking_rw: read_lock(>lock: 880096e924e0) == btrfs-csum-00 set by hero - two locks, one key - splat stack backtrace: [ 433.956602] CPU: 0 PID: 555 Comm: kworker/u8:2 Tainted: GE 4.10.0-rt1-tip-rt #36 [ 433.956603] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20161202_174313-build11a 04/01/2014 [ 433.956611] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [ 433.956612] Call Trace: [ 433.956618] dump_stack+0x85/0xc8 [ 433.956622] __lock_acquire+0x9f9/0x1550 [ 433.956627] ? ring_buffer_lock_reserve+0x115/0x3b0 [ 433.956629] ? ring_buffer_unlock_commit+0x27/0xe0 [ 433.956630] lock_acquire+0xbd/0x250 [ 433.956637] ? btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956641] rt_read_lock+0x47/0x60 [ 433.956648] ? btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956654] btrfs_clear_lock_blocking_rw+0x74/0x130 [btrfs] [ 433.956660] btrfs_clear_path_blocking+0x99/0xc0 [btrfs] [ 433.956667] btrfs_next_old_leaf+0x407/0x440 [btrfs] [ 433.956674] btrfs_next_leaf+0x10/0x20 [btrfs] [ 433.956681] btrfs_csum_file_blocks+0x31a/0x5f0 [btrfs] [ 433.956682] ? migrate_enable+0x87/0x160 [ 433.956690] add_pending_csums.isra.46+0x4d/0x70 [btrfs] [ 433.956698]