Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Tejun Heowrites: > Hello, Michael. > > On Tue, Oct 18, 2016 at 03:37:42PM +1100, Michael Ellerman wrote: >> That doesn't compile, wq doesn't exist. >> >> I guessed that you meant: >> >> + wq_numa_init(); >> + list_for_each_entry(wq, , list) >> + wq_update_unbound_numa(wq, smp_processor_id(), true); > > Yeap, sorry about that. No worries. >> And that does boot. >> >> The sysrq-t output is below, it's rather large. > > Didn't expect that many cpus but it looks good. I have another system here with twice as many CPUs if that would help ;) > I'll post proper patches soon. Thanks. Can you pull the current series out of linux-next for now until you merge the new series? cheers
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Tejun Heo writes: > Hello, Michael. > > On Tue, Oct 18, 2016 at 03:37:42PM +1100, Michael Ellerman wrote: >> That doesn't compile, wq doesn't exist. >> >> I guessed that you meant: >> >> + wq_numa_init(); >> + list_for_each_entry(wq, , list) >> + wq_update_unbound_numa(wq, smp_processor_id(), true); > > Yeap, sorry about that. No worries. >> And that does boot. >> >> The sysrq-t output is below, it's rather large. > > Didn't expect that many cpus but it looks good. I have another system here with twice as many CPUs if that would help ;) > I'll post proper patches soon. Thanks. Can you pull the current series out of linux-next for now until you merge the new series? cheers
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, Michael. On Tue, Oct 18, 2016 at 03:37:42PM +1100, Michael Ellerman wrote: > That doesn't compile, wq doesn't exist. > > I guessed that you meant: > > + wq_numa_init(); > + list_for_each_entry(wq, , list) > + wq_update_unbound_numa(wq, smp_processor_id(), true); Yeap, sorry about that. > And that does boot. > > The sysrq-t output is below, it's rather large. Didn't expect that many cpus but it looks good. I'll post proper patches soon. Thanks! -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, Michael. On Tue, Oct 18, 2016 at 03:37:42PM +1100, Michael Ellerman wrote: > That doesn't compile, wq doesn't exist. > > I guessed that you meant: > > + wq_numa_init(); > + list_for_each_entry(wq, , list) > + wq_update_unbound_numa(wq, smp_processor_id(), true); Yeap, sorry about that. > And that does boot. > > The sysrq-t output is below, it's rather large. Didn't expect that many cpus but it looks good. I'll post proper patches soon. Thanks! -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Balbir Singhwrites: > On 17/10/16 23:24, Michael Ellerman wrote: >> That happened because we haven't yet called set_cpu_numa_node() for the >> non-boot >> cpus, because that happens in smp_prepare_cpus(), and >> workqueue_init_early() is called much earlier than that. >> >> This doesn't trigger on x86 because it does set_cpu_numa_node() in >> setup_per_cpu_areas(), which is called prior to workqueue_init_early(). >> >> We can (should) probably do the same on powerpc, I'll look at that >> tomorrow. But other arches may have a similar problem, and at the very >> least we need to document that workqueue_init_early() relies on >> cpu_to_node() working. > > Don't we do the setup cpu->node mapings in initmem_init()? > Ideally we have setup_arch->intmem_init->numa_setup_cpu That sets up numa_cpu_lookup_table, which is a powerpc only data structure. But it doesn't setup the percpu numa_node variables, used by cpu_to_node(), because percpu areas are not setup yet. cheers
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Balbir Singh writes: > On 17/10/16 23:24, Michael Ellerman wrote: >> That happened because we haven't yet called set_cpu_numa_node() for the >> non-boot >> cpus, because that happens in smp_prepare_cpus(), and >> workqueue_init_early() is called much earlier than that. >> >> This doesn't trigger on x86 because it does set_cpu_numa_node() in >> setup_per_cpu_areas(), which is called prior to workqueue_init_early(). >> >> We can (should) probably do the same on powerpc, I'll look at that >> tomorrow. But other arches may have a similar problem, and at the very >> least we need to document that workqueue_init_early() relies on >> cpu_to_node() working. > > Don't we do the setup cpu->node mapings in initmem_init()? > Ideally we have setup_arch->intmem_init->numa_setup_cpu That sets up numa_cpu_lookup_table, which is a powerpc only data structure. But it doesn't setup the percpu numa_node variables, used by cpu_to_node(), because percpu areas are not setup yet. cheers
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, Michael. Other NUMA archs are lazy-initializing cpu to node mapping too, so we need to fix it from workqueue side. This also means that we've been getting NUMA node wrong for percpu pools on those archs. Can you please try the following patch and if it resolves the issue, report the workqueue part (it's at the end) of sysrq-t dump? Thanks. diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 984f6ff..276557b 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4411,14 +4411,14 @@ void show_workqueue_state(void) break; } } - if (idle) - continue; + //if (idle) + // continue; pr_info("workqueue %s: flags=0x%x\n", wq->name, wq->flags); for_each_pwq(pwq, wq) { spin_lock_irqsave(>pool->lock, flags); - if (pwq->nr_active || !list_empty(>delayed_works)) + //if (pwq->nr_active || !list_empty(>delayed_works)) show_pwq(pwq); spin_unlock_irqrestore(>pool->lock, flags); } @@ -4429,8 +4429,8 @@ void show_workqueue_state(void) bool first = true; spin_lock_irqsave(>lock, flags); - if (pool->nr_workers == pool->nr_idle) - goto next_pool; + //if (pool->nr_workers == pool->nr_idle) + // goto next_pool; pr_info("pool %d:", pool->id); pr_cont_pool_info(pool); @@ -4649,10 +4649,12 @@ int workqueue_online_cpu(unsigned int cpu) for_each_pool(pool, pi) { mutex_lock(>attach_mutex); - if (pool->cpu == cpu) + if (pool->cpu == cpu) { + pool->node = cpu_to_node(cpu); rebind_workers(pool); - else if (pool->cpu < 0) + } else if (pool->cpu < 0) { restore_unbound_workers_cpumask(pool, cpu); + } mutex_unlock(>attach_mutex); } @@ -5495,8 +5497,6 @@ int __init workqueue_init_early(void) pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC); - wq_numa_init(); - /* initialize CPU pools */ for_each_possible_cpu(cpu) { struct worker_pool *pool; @@ -5571,6 +5571,9 @@ int __init workqueue_init(void) struct worker_pool *pool; int cpu, bkt; + wq_numa_init(); + wq_update_unbound_numa(wq, smp_processor_id(), true); + /* create the initial workers */ for_each_online_cpu(cpu) { for_each_cpu_worker_pool(pool, cpu) {
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, Michael. Other NUMA archs are lazy-initializing cpu to node mapping too, so we need to fix it from workqueue side. This also means that we've been getting NUMA node wrong for percpu pools on those archs. Can you please try the following patch and if it resolves the issue, report the workqueue part (it's at the end) of sysrq-t dump? Thanks. diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 984f6ff..276557b 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4411,14 +4411,14 @@ void show_workqueue_state(void) break; } } - if (idle) - continue; + //if (idle) + // continue; pr_info("workqueue %s: flags=0x%x\n", wq->name, wq->flags); for_each_pwq(pwq, wq) { spin_lock_irqsave(>pool->lock, flags); - if (pwq->nr_active || !list_empty(>delayed_works)) + //if (pwq->nr_active || !list_empty(>delayed_works)) show_pwq(pwq); spin_unlock_irqrestore(>pool->lock, flags); } @@ -4429,8 +4429,8 @@ void show_workqueue_state(void) bool first = true; spin_lock_irqsave(>lock, flags); - if (pool->nr_workers == pool->nr_idle) - goto next_pool; + //if (pool->nr_workers == pool->nr_idle) + // goto next_pool; pr_info("pool %d:", pool->id); pr_cont_pool_info(pool); @@ -4649,10 +4649,12 @@ int workqueue_online_cpu(unsigned int cpu) for_each_pool(pool, pi) { mutex_lock(>attach_mutex); - if (pool->cpu == cpu) + if (pool->cpu == cpu) { + pool->node = cpu_to_node(cpu); rebind_workers(pool); - else if (pool->cpu < 0) + } else if (pool->cpu < 0) { restore_unbound_workers_cpumask(pool, cpu); + } mutex_unlock(>attach_mutex); } @@ -5495,8 +5497,6 @@ int __init workqueue_init_early(void) pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC); - wq_numa_init(); - /* initialize CPU pools */ for_each_possible_cpu(cpu) { struct worker_pool *pool; @@ -5571,6 +5571,9 @@ int __init workqueue_init(void) struct worker_pool *pool; int cpu, bkt; + wq_numa_init(); + wq_update_unbound_numa(wq, smp_processor_id(), true); + /* create the initial workers */ for_each_online_cpu(cpu) { for_each_cpu_worker_pool(pool, cpu) {
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, Michael. On Mon, Oct 17, 2016 at 11:24:34PM +1100, Michael Ellerman wrote: > The bad case (where we hit the BUG_ON I added above) is where we are > creating a wq for node 1. > > In wq_calc_node_cpumask() we do: > > cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]); > return !cpumask_equal(cpumask, attrs->cpumask); > > Which with the arguments inserted is: > > cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, > wq_numa_possible_cpumask[1]); > return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask); > > And that results in tmp_attrs->cpumask being empty, because > wq_numa_possible_cpumask[1] is an empty cpumask. Ah, should have read this before replying to the previous mail, so it's the numa mask, not the cpu_possible_mask. > The reason wq_numa_possible_cpumask[1] is an empty mask is because in > wq_numa_init() we did: > > for_each_possible_cpu(cpu) { > node = cpu_to_node(cpu); > if (WARN_ON(node == NUMA_NO_NODE)) { > pr_warn("workqueue: NUMA node mapping not available for > cpu%d, disabling NUMA support\n", cpu); > /* happens iff arch is bonkers, let's just proceed */ > return; > } > cpumask_set_cpu(cpu, tbl[node]); > } > > And cpu_to_node() returned node 0 for every CPU in the system, despite there > being multiple nodes. > > That happened because we haven't yet called set_cpu_numa_node() for the > non-boot > cpus, because that happens in smp_prepare_cpus(), and > workqueue_init_early() is called much earlier than that. > > This doesn't trigger on x86 because it does set_cpu_numa_node() in > setup_per_cpu_areas(), which is called prior to workqueue_init_early(). > > We can (should) probably do the same on powerpc, I'll look at that > tomorrow. But other arches may have a similar problem, and at the very > least we need to document that workqueue_init_early() relies on > cpu_to_node() working. I should be able to move the numa part of initialization to the later init function. Working on it. Thanks. -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, Michael. On Mon, Oct 17, 2016 at 11:24:34PM +1100, Michael Ellerman wrote: > The bad case (where we hit the BUG_ON I added above) is where we are > creating a wq for node 1. > > In wq_calc_node_cpumask() we do: > > cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]); > return !cpumask_equal(cpumask, attrs->cpumask); > > Which with the arguments inserted is: > > cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, > wq_numa_possible_cpumask[1]); > return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask); > > And that results in tmp_attrs->cpumask being empty, because > wq_numa_possible_cpumask[1] is an empty cpumask. Ah, should have read this before replying to the previous mail, so it's the numa mask, not the cpu_possible_mask. > The reason wq_numa_possible_cpumask[1] is an empty mask is because in > wq_numa_init() we did: > > for_each_possible_cpu(cpu) { > node = cpu_to_node(cpu); > if (WARN_ON(node == NUMA_NO_NODE)) { > pr_warn("workqueue: NUMA node mapping not available for > cpu%d, disabling NUMA support\n", cpu); > /* happens iff arch is bonkers, let's just proceed */ > return; > } > cpumask_set_cpu(cpu, tbl[node]); > } > > And cpu_to_node() returned node 0 for every CPU in the system, despite there > being multiple nodes. > > That happened because we haven't yet called set_cpu_numa_node() for the > non-boot > cpus, because that happens in smp_prepare_cpus(), and > workqueue_init_early() is called much earlier than that. > > This doesn't trigger on x86 because it does set_cpu_numa_node() in > setup_per_cpu_areas(), which is called prior to workqueue_init_early(). > > We can (should) probably do the same on powerpc, I'll look at that > tomorrow. But other arches may have a similar problem, and at the very > least we need to document that workqueue_init_early() relies on > cpu_to_node() working. I should be able to move the numa part of initialization to the later init function. Working on it. Thanks. -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, On Sat, Oct 15, 2016 at 08:48:01PM +1100, Michael Ellerman wrote: > > Hmm... it doesn't reproduce it here and can't see how the commit would > > affect this given that it doesn't really change when the kworker > > kthreads are being created. > > Try turning on CONFIG_DEBUG_PER_CPU_MAPS=y ? > > That will warn if you're indexing off the end of a cpu mask and just > getting lucky with the result. That's not happening on x86. That could mean that powerpc is initializing cpu_possible_mask after workqueue_init_early(). Looking into it. > [ cut here ] > WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:1602 > try_to_wake_up+0x3f4/0x5c0 > Modules linked in: > CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 4.8.0-compiler_gcc-6.2.0-next-20161014-dirty #116 > task: c00ff920 task.stack: c01ffc084000 > NIP: c00f1ba4 LR: c00f180c CTR: > REGS: c01ffc0878f0 TRAP: 0700 Not tainted > (4.8.0-compiler_gcc-6.2.0-next-20161014-dirty) > MSR: 92029033CR: 28000422 XER: > > CFAR: c00f18bc SOFTE: 0 > GPR00: c00f180c c01ffc087b70 c0e83400 > GPR04: 0002 > GPR08: c0dc3400 0001 0002 > GPR12: cfb8 c000e0c8 > GPR16: > GPR20: c0eb8960 > GPR24: c0d8ce00 > GPR28: c007f54050f4 c007f5404900 > NIP [c00f1ba4] try_to_wake_up+0x3f4/0x5c0 > LR [c00f180c] try_to_wake_up+0x5c/0x5c0 > Call Trace: > [c01ffc087b70] [c00f180c] try_to_wake_up+0x5c/0x5c0 (unreliable) > [c01ffc087bf0] [c00d53e4] create_worker+0x144/0x250 > [c01ffc087c90] [c0cf7930] workqueue_init+0x170/0x19c > [c01ffc087d00] [c0ce0e74] kernel_init_freeable+0x158/0x360 > [c01ffc087dc0] [c000e0e4] kernel_init+0x24/0x160 > [c01ffc087e30] [c000bfa0] ret_from_kernel_thread+0x5c/0xbc > Instruction dump: > e8790890 4bff6ed9 2fa3 419e00dc 6000 4bfffe54 3d02fff4 8928d7f9 > 2f89 409e0018 3921 9928d7f9 <0fe0> 6000 6042 3b5f0368 > ---[ end trace ]--- > > But I'm not sure that tells us anything new? Yeah, I should have asked to print out information of the target task but it looks like we have enough information now. Thanks. -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, On Sat, Oct 15, 2016 at 08:48:01PM +1100, Michael Ellerman wrote: > > Hmm... it doesn't reproduce it here and can't see how the commit would > > affect this given that it doesn't really change when the kworker > > kthreads are being created. > > Try turning on CONFIG_DEBUG_PER_CPU_MAPS=y ? > > That will warn if you're indexing off the end of a cpu mask and just > getting lucky with the result. That's not happening on x86. That could mean that powerpc is initializing cpu_possible_mask after workqueue_init_early(). Looking into it. > [ cut here ] > WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:1602 > try_to_wake_up+0x3f4/0x5c0 > Modules linked in: > CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 4.8.0-compiler_gcc-6.2.0-next-20161014-dirty #116 > task: c00ff920 task.stack: c01ffc084000 > NIP: c00f1ba4 LR: c00f180c CTR: > REGS: c01ffc0878f0 TRAP: 0700 Not tainted > (4.8.0-compiler_gcc-6.2.0-next-20161014-dirty) > MSR: 92029033 CR: 28000422 XER: > > CFAR: c00f18bc SOFTE: 0 > GPR00: c00f180c c01ffc087b70 c0e83400 > GPR04: 0002 > GPR08: c0dc3400 0001 0002 > GPR12: cfb8 c000e0c8 > GPR16: > GPR20: c0eb8960 > GPR24: c0d8ce00 > GPR28: c007f54050f4 c007f5404900 > NIP [c00f1ba4] try_to_wake_up+0x3f4/0x5c0 > LR [c00f180c] try_to_wake_up+0x5c/0x5c0 > Call Trace: > [c01ffc087b70] [c00f180c] try_to_wake_up+0x5c/0x5c0 (unreliable) > [c01ffc087bf0] [c00d53e4] create_worker+0x144/0x250 > [c01ffc087c90] [c0cf7930] workqueue_init+0x170/0x19c > [c01ffc087d00] [c0ce0e74] kernel_init_freeable+0x158/0x360 > [c01ffc087dc0] [c000e0e4] kernel_init+0x24/0x160 > [c01ffc087e30] [c000bfa0] ret_from_kernel_thread+0x5c/0xbc > Instruction dump: > e8790890 4bff6ed9 2fa3 419e00dc 6000 4bfffe54 3d02fff4 8928d7f9 > 2f89 409e0018 3921 9928d7f9 <0fe0> 6000 6042 3b5f0368 > ---[ end trace ]--- > > But I'm not sure that tells us anything new? Yeah, I should have asked to print out information of the target task but it looks like we have enough information now. Thanks. -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On 17/10/16 23:24, Michael Ellerman wrote: > Tejun Heowrites: > >> Hello, Michael. >> >> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote: >>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq >>> is NULL. >>> >>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is >>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in >>> set_task_rq() and happen to get NULL. >>> >>> We never should have done set_task_rq(p, 2048), because 2048 is >= >>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() >>> doesn't cope with that. >> >> Hmm... it doesn't reproduce it here and can't see how the commit would >> affect this given that it doesn't really change when the kworker >> kthreads are being created. > > It changes when the pool attributes are created, which is the source of > the bug. > > The original crash happens because we have a task with an empty cpus_allowed > mask. That mask originally comes from pool->attrs->cpumask. > > The attrs for the pool are created early via workqueue_init_early() in > apply_wqattrs_prepare(): > > start_here_common > -> start_kernel > -> workqueue_init_early > -> __alloc_workqueue_key >-> apply_workqueue_attrs > -> apply_workqueue_attrs_locked > -> apply_wqattrs_prepare > > In there we do: > > copy_workqueue_attrs(new_attrs, attrs); > cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask); > if (unlikely(cpumask_empty(new_attrs->cpumask))) > cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask); > ... > copy_workqueue_attrs(tmp_attrs, new_attrs); > ... > for_each_node(node) { > if (wq_calc_node_cpumask(new_attrs, node, -1, > tmp_attrs->cpumask)) { > + BUG_ON(cpumask_empty(tmp_attrs->cpumask)); > ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs); > > > The bad case (where we hit the BUG_ON I added above) is where we are > creating a wq for node 1. > > In wq_calc_node_cpumask() we do: > > cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]); > return !cpumask_equal(cpumask, attrs->cpumask); > > Which with the arguments inserted is: > > cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, > wq_numa_possible_cpumask[1]); > return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask); > > And that results in tmp_attrs->cpumask being empty, because > wq_numa_possible_cpumask[1] is an empty cpumask. > > The reason wq_numa_possible_cpumask[1] is an empty mask is because in > wq_numa_init() we did: > > for_each_possible_cpu(cpu) { > node = cpu_to_node(cpu); > if (WARN_ON(node == NUMA_NO_NODE)) { > pr_warn("workqueue: NUMA node mapping not available for > cpu%d, disabling NUMA support\n", cpu); > /* happens iff arch is bonkers, let's just proceed */ > return; > } > cpumask_set_cpu(cpu, tbl[node]); > } > > And cpu_to_node() returned node 0 for every CPU in the system, despite there > being multiple nodes. > > That happened because we haven't yet called set_cpu_numa_node() for the > non-boot > cpus, because that happens in smp_prepare_cpus(), and > workqueue_init_early() is called much earlier than that. > > This doesn't trigger on x86 because it does set_cpu_numa_node() in > setup_per_cpu_areas(), which is called prior to workqueue_init_early(). > > We can (should) probably do the same on powerpc, I'll look at that > tomorrow. But other arches may have a similar problem, and at the very > least we need to document that workqueue_init_early() relies on > cpu_to_node() working. Don't we do the setup cpu->node mapings in initmem_init()? Ideally we have setup_arch->intmem_init->numa_setup_cpu Will look at it tomorrow Balbir Singh
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On 17/10/16 23:24, Michael Ellerman wrote: > Tejun Heo writes: > >> Hello, Michael. >> >> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote: >>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq >>> is NULL. >>> >>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is >>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in >>> set_task_rq() and happen to get NULL. >>> >>> We never should have done set_task_rq(p, 2048), because 2048 is >= >>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() >>> doesn't cope with that. >> >> Hmm... it doesn't reproduce it here and can't see how the commit would >> affect this given that it doesn't really change when the kworker >> kthreads are being created. > > It changes when the pool attributes are created, which is the source of > the bug. > > The original crash happens because we have a task with an empty cpus_allowed > mask. That mask originally comes from pool->attrs->cpumask. > > The attrs for the pool are created early via workqueue_init_early() in > apply_wqattrs_prepare(): > > start_here_common > -> start_kernel > -> workqueue_init_early > -> __alloc_workqueue_key >-> apply_workqueue_attrs > -> apply_workqueue_attrs_locked > -> apply_wqattrs_prepare > > In there we do: > > copy_workqueue_attrs(new_attrs, attrs); > cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask); > if (unlikely(cpumask_empty(new_attrs->cpumask))) > cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask); > ... > copy_workqueue_attrs(tmp_attrs, new_attrs); > ... > for_each_node(node) { > if (wq_calc_node_cpumask(new_attrs, node, -1, > tmp_attrs->cpumask)) { > + BUG_ON(cpumask_empty(tmp_attrs->cpumask)); > ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs); > > > The bad case (where we hit the BUG_ON I added above) is where we are > creating a wq for node 1. > > In wq_calc_node_cpumask() we do: > > cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]); > return !cpumask_equal(cpumask, attrs->cpumask); > > Which with the arguments inserted is: > > cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, > wq_numa_possible_cpumask[1]); > return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask); > > And that results in tmp_attrs->cpumask being empty, because > wq_numa_possible_cpumask[1] is an empty cpumask. > > The reason wq_numa_possible_cpumask[1] is an empty mask is because in > wq_numa_init() we did: > > for_each_possible_cpu(cpu) { > node = cpu_to_node(cpu); > if (WARN_ON(node == NUMA_NO_NODE)) { > pr_warn("workqueue: NUMA node mapping not available for > cpu%d, disabling NUMA support\n", cpu); > /* happens iff arch is bonkers, let's just proceed */ > return; > } > cpumask_set_cpu(cpu, tbl[node]); > } > > And cpu_to_node() returned node 0 for every CPU in the system, despite there > being multiple nodes. > > That happened because we haven't yet called set_cpu_numa_node() for the > non-boot > cpus, because that happens in smp_prepare_cpus(), and > workqueue_init_early() is called much earlier than that. > > This doesn't trigger on x86 because it does set_cpu_numa_node() in > setup_per_cpu_areas(), which is called prior to workqueue_init_early(). > > We can (should) probably do the same on powerpc, I'll look at that > tomorrow. But other arches may have a similar problem, and at the very > least we need to document that workqueue_init_early() relies on > cpu_to_node() working. Don't we do the setup cpu->node mapings in initmem_init()? Ideally we have setup_arch->intmem_init->numa_setup_cpu Will look at it tomorrow Balbir Singh
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Tejun Heowrites: > Hello, Michael. > > On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote: >> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq >> is NULL. >> >> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is >> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in >> set_task_rq() and happen to get NULL. >> >> We never should have done set_task_rq(p, 2048), because 2048 is >= >> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() >> doesn't cope with that. > > Hmm... it doesn't reproduce it here and can't see how the commit would > affect this given that it doesn't really change when the kworker > kthreads are being created. It changes when the pool attributes are created, which is the source of the bug. The original crash happens because we have a task with an empty cpus_allowed mask. That mask originally comes from pool->attrs->cpumask. The attrs for the pool are created early via workqueue_init_early() in apply_wqattrs_prepare(): start_here_common -> start_kernel -> workqueue_init_early -> __alloc_workqueue_key -> apply_workqueue_attrs -> apply_workqueue_attrs_locked -> apply_wqattrs_prepare In there we do: copy_workqueue_attrs(new_attrs, attrs); cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask); if (unlikely(cpumask_empty(new_attrs->cpumask))) cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask); ... copy_workqueue_attrs(tmp_attrs, new_attrs); ... for_each_node(node) { if (wq_calc_node_cpumask(new_attrs, node, -1, tmp_attrs->cpumask)) { + BUG_ON(cpumask_empty(tmp_attrs->cpumask)); ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs); The bad case (where we hit the BUG_ON I added above) is where we are creating a wq for node 1. In wq_calc_node_cpumask() we do: cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]); return !cpumask_equal(cpumask, attrs->cpumask); Which with the arguments inserted is: cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, wq_numa_possible_cpumask[1]); return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask); And that results in tmp_attrs->cpumask being empty, because wq_numa_possible_cpumask[1] is an empty cpumask. The reason wq_numa_possible_cpumask[1] is an empty mask is because in wq_numa_init() we did: for_each_possible_cpu(cpu) { node = cpu_to_node(cpu); if (WARN_ON(node == NUMA_NO_NODE)) { pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu); /* happens iff arch is bonkers, let's just proceed */ return; } cpumask_set_cpu(cpu, tbl[node]); } And cpu_to_node() returned node 0 for every CPU in the system, despite there being multiple nodes. That happened because we haven't yet called set_cpu_numa_node() for the non-boot cpus, because that happens in smp_prepare_cpus(), and workqueue_init_early() is called much earlier than that. This doesn't trigger on x86 because it does set_cpu_numa_node() in setup_per_cpu_areas(), which is called prior to workqueue_init_early(). We can (should) probably do the same on powerpc, I'll look at that tomorrow. But other arches may have a similar problem, and at the very least we need to document that workqueue_init_early() relies on cpu_to_node() working. cheers
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Tejun Heo writes: > Hello, Michael. > > On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote: >> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq >> is NULL. >> >> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is >> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in >> set_task_rq() and happen to get NULL. >> >> We never should have done set_task_rq(p, 2048), because 2048 is >= >> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() >> doesn't cope with that. > > Hmm... it doesn't reproduce it here and can't see how the commit would > affect this given that it doesn't really change when the kworker > kthreads are being created. It changes when the pool attributes are created, which is the source of the bug. The original crash happens because we have a task with an empty cpus_allowed mask. That mask originally comes from pool->attrs->cpumask. The attrs for the pool are created early via workqueue_init_early() in apply_wqattrs_prepare(): start_here_common -> start_kernel -> workqueue_init_early -> __alloc_workqueue_key -> apply_workqueue_attrs -> apply_workqueue_attrs_locked -> apply_wqattrs_prepare In there we do: copy_workqueue_attrs(new_attrs, attrs); cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask); if (unlikely(cpumask_empty(new_attrs->cpumask))) cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask); ... copy_workqueue_attrs(tmp_attrs, new_attrs); ... for_each_node(node) { if (wq_calc_node_cpumask(new_attrs, node, -1, tmp_attrs->cpumask)) { + BUG_ON(cpumask_empty(tmp_attrs->cpumask)); ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs); The bad case (where we hit the BUG_ON I added above) is where we are creating a wq for node 1. In wq_calc_node_cpumask() we do: cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]); return !cpumask_equal(cpumask, attrs->cpumask); Which with the arguments inserted is: cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, wq_numa_possible_cpumask[1]); return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask); And that results in tmp_attrs->cpumask being empty, because wq_numa_possible_cpumask[1] is an empty cpumask. The reason wq_numa_possible_cpumask[1] is an empty mask is because in wq_numa_init() we did: for_each_possible_cpu(cpu) { node = cpu_to_node(cpu); if (WARN_ON(node == NUMA_NO_NODE)) { pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu); /* happens iff arch is bonkers, let's just proceed */ return; } cpumask_set_cpu(cpu, tbl[node]); } And cpu_to_node() returned node 0 for every CPU in the system, despite there being multiple nodes. That happened because we haven't yet called set_cpu_numa_node() for the non-boot cpus, because that happens in smp_prepare_cpus(), and workqueue_init_early() is called much earlier than that. This doesn't trigger on x86 because it does set_cpu_numa_node() in setup_per_cpu_areas(), which is called prior to workqueue_init_early(). We can (should) probably do the same on powerpc, I'll look at that tomorrow. But other arches may have a similar problem, and at the very least we need to document that workqueue_init_early() relies on cpu_to_node() working. cheers
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Tejun Heowrites: > Hello, Michael. > > On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote: >> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq >> is NULL. >> >> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is >> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in >> set_task_rq() and happen to get NULL. >> >> We never should have done set_task_rq(p, 2048), because 2048 is >= >> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() >> doesn't cope with that. > > Hmm... it doesn't reproduce it here and can't see how the commit would > affect this given that it doesn't really change when the kworker > kthreads are being created. Try turning on CONFIG_DEBUG_PER_CPU_MAPS=y ? That will warn if you're indexing off the end of a cpu mask and just getting lucky with the result. >> Presumably we shouldn't be ending up with tsk_cpus_allowed() being >> empty, but I haven't had time to track down why that's happening. > > Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to > select_task_rq() and post what that says? It says: [ cut here ] WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:1602 try_to_wake_up+0x3f4/0x5c0 Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161014-dirty #116 task: c00ff920 task.stack: c01ffc084000 NIP: c00f1ba4 LR: c00f180c CTR: REGS: c01ffc0878f0 TRAP: 0700 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161014-dirty) MSR: 92029033 CR: 28000422 XER: CFAR: c00f18bc SOFTE: 0 GPR00: c00f180c c01ffc087b70 c0e83400 GPR04: 0002 GPR08: c0dc3400 0001 0002 GPR12: cfb8 c000e0c8 GPR16: GPR20: c0eb8960 GPR24: c0d8ce00 GPR28: c007f54050f4 c007f5404900 NIP [c00f1ba4] try_to_wake_up+0x3f4/0x5c0 LR [c00f180c] try_to_wake_up+0x5c/0x5c0 Call Trace: [c01ffc087b70] [c00f180c] try_to_wake_up+0x5c/0x5c0 (unreliable) [c01ffc087bf0] [c00d53e4] create_worker+0x144/0x250 [c01ffc087c90] [c0cf7930] workqueue_init+0x170/0x19c [c01ffc087d00] [c0ce0e74] kernel_init_freeable+0x158/0x360 [c01ffc087dc0] [c000e0e4] kernel_init+0x24/0x160 [c01ffc087e30] [c000bfa0] ret_from_kernel_thread+0x5c/0xbc Instruction dump: e8790890 4bff6ed9 2fa3 419e00dc 6000 4bfffe54 3d02fff4 8928d7f9 2f89 409e0018 3921 9928d7f9 <0fe0> 6000 6042 3b5f0368 ---[ end trace ]--- But I'm not sure that tells us anything new? cheers
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Tejun Heo writes: > Hello, Michael. > > On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote: >> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq >> is NULL. >> >> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is >> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in >> set_task_rq() and happen to get NULL. >> >> We never should have done set_task_rq(p, 2048), because 2048 is >= >> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() >> doesn't cope with that. > > Hmm... it doesn't reproduce it here and can't see how the commit would > affect this given that it doesn't really change when the kworker > kthreads are being created. Try turning on CONFIG_DEBUG_PER_CPU_MAPS=y ? That will warn if you're indexing off the end of a cpu mask and just getting lucky with the result. >> Presumably we shouldn't be ending up with tsk_cpus_allowed() being >> empty, but I haven't had time to track down why that's happening. > > Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to > select_task_rq() and post what that says? It says: [ cut here ] WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:1602 try_to_wake_up+0x3f4/0x5c0 Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161014-dirty #116 task: c00ff920 task.stack: c01ffc084000 NIP: c00f1ba4 LR: c00f180c CTR: REGS: c01ffc0878f0 TRAP: 0700 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161014-dirty) MSR: 92029033 CR: 28000422 XER: CFAR: c00f18bc SOFTE: 0 GPR00: c00f180c c01ffc087b70 c0e83400 GPR04: 0002 GPR08: c0dc3400 0001 0002 GPR12: cfb8 c000e0c8 GPR16: GPR20: c0eb8960 GPR24: c0d8ce00 GPR28: c007f54050f4 c007f5404900 NIP [c00f1ba4] try_to_wake_up+0x3f4/0x5c0 LR [c00f180c] try_to_wake_up+0x5c/0x5c0 Call Trace: [c01ffc087b70] [c00f180c] try_to_wake_up+0x5c/0x5c0 (unreliable) [c01ffc087bf0] [c00d53e4] create_worker+0x144/0x250 [c01ffc087c90] [c0cf7930] workqueue_init+0x170/0x19c [c01ffc087d00] [c0ce0e74] kernel_init_freeable+0x158/0x360 [c01ffc087dc0] [c000e0e4] kernel_init+0x24/0x160 [c01ffc087e30] [c000bfa0] ret_from_kernel_thread+0x5c/0xbc Instruction dump: e8790890 4bff6ed9 2fa3 419e00dc 6000 4bfffe54 3d02fff4 8928d7f9 2f89 409e0018 3921 9928d7f9 <0fe0> 6000 6042 3b5f0368 ---[ end trace ]--- But I'm not sure that tells us anything new? cheers
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On 15/10/16 02:08, Tejun Heo wrote: > Hello, Balbir. > > On Tue, Oct 11, 2016 at 11:21:09PM +1100, Balbir Singh wrote: >> FYI: I see the samething on my cpu as well, its just that I get lucky >> and cpu_online(cpu) returns false. > > Are you seeing this on x86 or is your test setup also a power machine? > > Thanks. > I saw this on a powerpc box Balbir Singh.
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On 15/10/16 02:08, Tejun Heo wrote: > Hello, Balbir. > > On Tue, Oct 11, 2016 at 11:21:09PM +1100, Balbir Singh wrote: >> FYI: I see the samething on my cpu as well, its just that I get lucky >> and cpu_online(cpu) returns false. > > Are you seeing this on x86 or is your test setup also a power machine? > > Thanks. > I saw this on a powerpc box Balbir Singh.
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On 15/10/16 02:07, Tejun Heo wrote: > Hello, Michael. > > On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote: >> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq >> is NULL. >> >> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is >> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in >> set_task_rq() and happen to get NULL. >> >> We never should have done set_task_rq(p, 2048), because 2048 is >= >> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() >> doesn't cope with that. > > Hmm... it doesn't reproduce it here and can't see how the commit would > affect this given that it doesn't really change when the kworker > kthreads are being created. > >> Presumably we shouldn't be ending up with tsk_cpus_allowed() being >> empty, but I haven't had time to track down why that's happening. > I think the basic analysis shows the change to creation of unbounded workqueues from the unbound_hash, but those have a pool cpumask empty. > Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to > select_task_rq() and post what that says? > > Thanks. > Balbir Singh.
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On 15/10/16 02:07, Tejun Heo wrote: > Hello, Michael. > > On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote: >> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq >> is NULL. >> >> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is >> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in >> set_task_rq() and happen to get NULL. >> >> We never should have done set_task_rq(p, 2048), because 2048 is >= >> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() >> doesn't cope with that. > > Hmm... it doesn't reproduce it here and can't see how the commit would > affect this given that it doesn't really change when the kworker > kthreads are being created. > >> Presumably we shouldn't be ending up with tsk_cpus_allowed() being >> empty, but I haven't had time to track down why that's happening. > I think the basic analysis shows the change to creation of unbounded workqueues from the unbound_hash, but those have a pool cpumask empty. > Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to > select_task_rq() and post what that says? > > Thanks. > Balbir Singh.
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, Balbir. On Tue, Oct 11, 2016 at 11:21:09PM +1100, Balbir Singh wrote: > FYI: I see the samething on my cpu as well, its just that I get lucky > and cpu_online(cpu) returns false. Are you seeing this on x86 or is your test setup also a power machine? Thanks. -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, Balbir. On Tue, Oct 11, 2016 at 11:21:09PM +1100, Balbir Singh wrote: > FYI: I see the samething on my cpu as well, its just that I get lucky > and cpu_online(cpu) returns false. Are you seeing this on x86 or is your test setup also a power machine? Thanks. -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, Michael. On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote: > The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq > is NULL. > > The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is > NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in > set_task_rq() and happen to get NULL. > > We never should have done set_task_rq(p, 2048), because 2048 is >= > nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() > doesn't cope with that. Hmm... it doesn't reproduce it here and can't see how the commit would affect this given that it doesn't really change when the kworker kthreads are being created. > Presumably we shouldn't be ending up with tsk_cpus_allowed() being > empty, but I haven't had time to track down why that's happening. Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to select_task_rq() and post what that says? Thanks. -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, Michael. On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote: > The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq > is NULL. > > The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is > NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in > set_task_rq() and happen to get NULL. > > We never should have done set_task_rq(p, 2048), because 2048 is >= > nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() > doesn't cope with that. Hmm... it doesn't reproduce it here and can't see how the commit would affect this given that it doesn't really change when the kworker kthreads are being created. > Presumably we shouldn't be ending up with tsk_cpus_allowed() being > empty, but I haven't had time to track down why that's happening. Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to select_task_rq() and post what that says? Thanks. -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On 11/10/16 22:22, Michael Ellerman wrote: > Tejun Heowrites: > >> Hello, Michael. >> >> On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote: >>> This patch seems to be causing one of my Power8 boxes not to boot. >>> >>> Specifically commit 3347fa092821 ("workqueue: make workqueue available >>> early during boot") in linux-next. >>> >>> If I revert this on top of next-20161005 then the machine boots again. >>> >>> I've attached the oops below. It looks like the cfs_rq of p->se is NULL? >> >> Hah, weird that it's arch dependent, or maybe it's just different >> config options. Most likely, it's caused by workqueue_init() call >> being moved too early. Can you please try the following patch and see >> whether the problem goes away? > > No that doesn't help. > > What does is this: > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 94732d1ab00a..4e79549d242f 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1614,7 +1614,8 @@ int select_task_rq(struct task_struct *p, int cpu, int > sd_flags, int wake_flags) >* [ this allows ->select_task() to simply return task_cpu(p) and >* not worry about this generic constraint ] >*/ > - if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || > + if (unlikely(cpu >= nr_cpu_ids || > + !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || >!cpu_online(cpu))) > cpu = select_fallback_rq(task_cpu(p), p); > > > The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq > is NULL. > > The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is > NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in > set_task_rq() and happen to get NULL. > > We never should have done set_task_rq(p, 2048), because 2048 is >= > nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() > doesn't cope with that. > > The reason we're calling set_task_rq() with CPU 2048 is because > in select_task_rq() we had tsk_nr_cpus_allowed() = 0, because > tsk_cpus_allowed(p) is an empty cpu mask. > > That means we do in select_task_rq(): > cpu = cpumask_any(tsk_cpus_allowed(p)); > > > > And when tsk_cpus_allowed(p) is empty cpumask_any() returns nr_cpu_ids, > causing cpu to be set to 2048 in my case. > > select_task_rq() then does the check to see if it should use a fallback > rq: > > if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || > > >!cpu_online(cpu))) > cpu = select_fallback_rq(task_cpu(p), p); > > > But in both those checks we end up indexing off the end of the cpu mask, > because cpu is >= nr_cpu_ids. At least on my system they both return > true and so we return cpu == 2048. > > The patch above is pretty clearly not the right fix, though maybe it's a > good safety measure. > > Presumably we shouldn't be ending up with tsk_cpus_allowed() being > empty, but I haven't had time to track down why that's happening. > > cheers > +peterz FYI: I see the samething on my cpu as well, its just that I get lucky and cpu_online(cpu) returns false. I think from a functional perspective we may want to get some additional debug checks for places where the cpumask is empty early during boot. Looks like there is a dependency between cpumasks and cpus coming online. I wonder if we can hit similar issues during hotplug FWIW, your patch looks correct to me, though one might argue that cpumask_test_cpu() is a better place to fix it Balbir Singh
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On 11/10/16 22:22, Michael Ellerman wrote: > Tejun Heo writes: > >> Hello, Michael. >> >> On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote: >>> This patch seems to be causing one of my Power8 boxes not to boot. >>> >>> Specifically commit 3347fa092821 ("workqueue: make workqueue available >>> early during boot") in linux-next. >>> >>> If I revert this on top of next-20161005 then the machine boots again. >>> >>> I've attached the oops below. It looks like the cfs_rq of p->se is NULL? >> >> Hah, weird that it's arch dependent, or maybe it's just different >> config options. Most likely, it's caused by workqueue_init() call >> being moved too early. Can you please try the following patch and see >> whether the problem goes away? > > No that doesn't help. > > What does is this: > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 94732d1ab00a..4e79549d242f 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1614,7 +1614,8 @@ int select_task_rq(struct task_struct *p, int cpu, int > sd_flags, int wake_flags) >* [ this allows ->select_task() to simply return task_cpu(p) and >* not worry about this generic constraint ] >*/ > - if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || > + if (unlikely(cpu >= nr_cpu_ids || > + !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || >!cpu_online(cpu))) > cpu = select_fallback_rq(task_cpu(p), p); > > > The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq > is NULL. > > The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is > NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in > set_task_rq() and happen to get NULL. > > We never should have done set_task_rq(p, 2048), because 2048 is >= > nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() > doesn't cope with that. > > The reason we're calling set_task_rq() with CPU 2048 is because > in select_task_rq() we had tsk_nr_cpus_allowed() = 0, because > tsk_cpus_allowed(p) is an empty cpu mask. > > That means we do in select_task_rq(): > cpu = cpumask_any(tsk_cpus_allowed(p)); > > > > And when tsk_cpus_allowed(p) is empty cpumask_any() returns nr_cpu_ids, > causing cpu to be set to 2048 in my case. > > select_task_rq() then does the check to see if it should use a fallback > rq: > > if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || > > >!cpu_online(cpu))) > cpu = select_fallback_rq(task_cpu(p), p); > > > But in both those checks we end up indexing off the end of the cpu mask, > because cpu is >= nr_cpu_ids. At least on my system they both return > true and so we return cpu == 2048. > > The patch above is pretty clearly not the right fix, though maybe it's a > good safety measure. > > Presumably we shouldn't be ending up with tsk_cpus_allowed() being > empty, but I haven't had time to track down why that's happening. > > cheers > +peterz FYI: I see the samething on my cpu as well, its just that I get lucky and cpu_online(cpu) returns false. I think from a functional perspective we may want to get some additional debug checks for places where the cpumask is empty early during boot. Looks like there is a dependency between cpumasks and cpus coming online. I wonder if we can hit similar issues during hotplug FWIW, your patch looks correct to me, though one might argue that cpumask_test_cpu() is a better place to fix it Balbir Singh
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Tejun Heowrites: > Hello, Michael. > > On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote: >> This patch seems to be causing one of my Power8 boxes not to boot. >> >> Specifically commit 3347fa092821 ("workqueue: make workqueue available >> early during boot") in linux-next. >> >> If I revert this on top of next-20161005 then the machine boots again. >> >> I've attached the oops below. It looks like the cfs_rq of p->se is NULL? > > Hah, weird that it's arch dependent, or maybe it's just different > config options. Most likely, it's caused by workqueue_init() call > being moved too early. Can you please try the following patch and see > whether the problem goes away? No that doesn't help. What does is this: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 94732d1ab00a..4e79549d242f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1614,7 +1614,8 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags) * [ this allows ->select_task() to simply return task_cpu(p) and * not worry about this generic constraint ] */ - if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || + if (unlikely(cpu >= nr_cpu_ids || +!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || !cpu_online(cpu))) cpu = select_fallback_rq(task_cpu(p), p); The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq is NULL. The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in set_task_rq() and happen to get NULL. We never should have done set_task_rq(p, 2048), because 2048 is >= nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() doesn't cope with that. The reason we're calling set_task_rq() with CPU 2048 is because in select_task_rq() we had tsk_nr_cpus_allowed() = 0, because tsk_cpus_allowed(p) is an empty cpu mask. That means we do in select_task_rq(): cpu = cpumask_any(tsk_cpus_allowed(p)); And when tsk_cpus_allowed(p) is empty cpumask_any() returns nr_cpu_ids, causing cpu to be set to 2048 in my case. select_task_rq() then does the check to see if it should use a fallback rq: if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || !cpu_online(cpu))) cpu = select_fallback_rq(task_cpu(p), p); But in both those checks we end up indexing off the end of the cpu mask, because cpu is >= nr_cpu_ids. At least on my system they both return true and so we return cpu == 2048. The patch above is pretty clearly not the right fix, though maybe it's a good safety measure. Presumably we shouldn't be ending up with tsk_cpus_allowed() being empty, but I haven't had time to track down why that's happening. cheers
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Tejun Heo writes: > Hello, Michael. > > On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote: >> This patch seems to be causing one of my Power8 boxes not to boot. >> >> Specifically commit 3347fa092821 ("workqueue: make workqueue available >> early during boot") in linux-next. >> >> If I revert this on top of next-20161005 then the machine boots again. >> >> I've attached the oops below. It looks like the cfs_rq of p->se is NULL? > > Hah, weird that it's arch dependent, or maybe it's just different > config options. Most likely, it's caused by workqueue_init() call > being moved too early. Can you please try the following patch and see > whether the problem goes away? No that doesn't help. What does is this: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 94732d1ab00a..4e79549d242f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1614,7 +1614,8 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags) * [ this allows ->select_task() to simply return task_cpu(p) and * not worry about this generic constraint ] */ - if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || + if (unlikely(cpu >= nr_cpu_ids || +!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || !cpu_online(cpu))) cpu = select_fallback_rq(task_cpu(p), p); The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq is NULL. The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in set_task_rq() and happen to get NULL. We never should have done set_task_rq(p, 2048), because 2048 is >= nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq() doesn't cope with that. The reason we're calling set_task_rq() with CPU 2048 is because in select_task_rq() we had tsk_nr_cpus_allowed() = 0, because tsk_cpus_allowed(p) is an empty cpu mask. That means we do in select_task_rq(): cpu = cpumask_any(tsk_cpus_allowed(p)); And when tsk_cpus_allowed(p) is empty cpumask_any() returns nr_cpu_ids, causing cpu to be set to 2048 in my case. select_task_rq() then does the check to see if it should use a fallback rq: if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || !cpu_online(cpu))) cpu = select_fallback_rq(task_cpu(p), p); But in both those checks we end up indexing off the end of the cpu mask, because cpu is >= nr_cpu_ids. At least on my system they both return true and so we return cpu == 2048. The patch above is pretty clearly not the right fix, though maybe it's a good safety measure. Presumably we shouldn't be ending up with tsk_cpus_allowed() being empty, but I haven't had time to track down why that's happening. cheers
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On 10/10/16 23:53, Tejun Heo wrote: > On Mon, Oct 10, 2016 at 10:17:16PM +1100, Balbir Singh wrote: >> rest_init() >> { >> ... >> kernel_thread(kernel_init, NULL, CLONE_FS); >> numa_default_policy(); >> pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); >> rcu_read_lock(); >> kthreadd_task = find_task_by_pid_ns(pid, _pid_ns); >> ... >> >> } >> >> create_worker() needs kthreadd, it wakes up kthreadd in >> kthread_create_on_node, >> workqueue_init() is called from kernel_init() , but kthreadd is created after >> the call to kernel_init(), so its touch and go > > But the first thing kernel_init_freeable() does is > wait_for_completion(_done). > Yes, Of course, looking at the stack trace again, it was not the wake_up itself, but the absence of cfs_rq of p->se that caused the issue. Will try and chase it down. Quick look shows cgroup_init() has occurred before workqueue_init(), so ideally p->se.cfs_rq should be allocated. Sorry for the noise, Balbir
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On 10/10/16 23:53, Tejun Heo wrote: > On Mon, Oct 10, 2016 at 10:17:16PM +1100, Balbir Singh wrote: >> rest_init() >> { >> ... >> kernel_thread(kernel_init, NULL, CLONE_FS); >> numa_default_policy(); >> pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); >> rcu_read_lock(); >> kthreadd_task = find_task_by_pid_ns(pid, _pid_ns); >> ... >> >> } >> >> create_worker() needs kthreadd, it wakes up kthreadd in >> kthread_create_on_node, >> workqueue_init() is called from kernel_init() , but kthreadd is created after >> the call to kernel_init(), so its touch and go > > But the first thing kernel_init_freeable() does is > wait_for_completion(_done). > Yes, Of course, looking at the stack trace again, it was not the wake_up itself, but the absence of cfs_rq of p->se that caused the issue. Will try and chase it down. Quick look shows cgroup_init() has occurred before workqueue_init(), so ideally p->se.cfs_rq should be allocated. Sorry for the noise, Balbir
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On Mon, Oct 10, 2016 at 09:02:53AM -0400, Tejun Heo wrote: > Hello, Michael. > > On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote: > > This patch seems to be causing one of my Power8 boxes not to boot. > > > > Specifically commit 3347fa092821 ("workqueue: make workqueue available > > early during boot") in linux-next. > > > > If I revert this on top of next-20161005 then the machine boots again. > > > > I've attached the oops below. It looks like the cfs_rq of p->se is NULL? > > Hah, weird that it's arch dependent, or maybe it's just different > config options. Most likely, it's caused by workqueue_init() call > being moved too early. Can you please try the following patch and see > whether the problem goes away? And if it does, can you please post the boot log with kernel boot option "initcall_debug=1" specified? Thanks. -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On Mon, Oct 10, 2016 at 09:02:53AM -0400, Tejun Heo wrote: > Hello, Michael. > > On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote: > > This patch seems to be causing one of my Power8 boxes not to boot. > > > > Specifically commit 3347fa092821 ("workqueue: make workqueue available > > early during boot") in linux-next. > > > > If I revert this on top of next-20161005 then the machine boots again. > > > > I've attached the oops below. It looks like the cfs_rq of p->se is NULL? > > Hah, weird that it's arch dependent, or maybe it's just different > config options. Most likely, it's caused by workqueue_init() call > being moved too early. Can you please try the following patch and see > whether the problem goes away? And if it does, can you please post the boot log with kernel boot option "initcall_debug=1" specified? Thanks. -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, Michael. On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote: > This patch seems to be causing one of my Power8 boxes not to boot. > > Specifically commit 3347fa092821 ("workqueue: make workqueue available > early during boot") in linux-next. > > If I revert this on top of next-20161005 then the machine boots again. > > I've attached the oops below. It looks like the cfs_rq of p->se is NULL? Hah, weird that it's arch dependent, or maybe it's just different config options. Most likely, it's caused by workqueue_init() call being moved too early. Can you please try the following patch and see whether the problem goes away? Thanks. diff --git a/init/main.c b/init/main.c index 5c4fd68..fe4fa47 100644 --- a/init/main.c +++ b/init/main.c @@ -1013,7 +1013,7 @@ static noinline void __init kernel_init_freeable(void) smp_prepare_cpus(setup_max_cpus); - workqueue_init(); + //workqueue_init(); do_pre_smp_initcalls(); lockup_detector_init(); diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ad0cd43..400f5e2 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -5585,3 +5585,4 @@ int __init workqueue_init(void) return 0; } +early_initcall(workqueue_init);
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Hello, Michael. On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote: > This patch seems to be causing one of my Power8 boxes not to boot. > > Specifically commit 3347fa092821 ("workqueue: make workqueue available > early during boot") in linux-next. > > If I revert this on top of next-20161005 then the machine boots again. > > I've attached the oops below. It looks like the cfs_rq of p->se is NULL? Hah, weird that it's arch dependent, or maybe it's just different config options. Most likely, it's caused by workqueue_init() call being moved too early. Can you please try the following patch and see whether the problem goes away? Thanks. diff --git a/init/main.c b/init/main.c index 5c4fd68..fe4fa47 100644 --- a/init/main.c +++ b/init/main.c @@ -1013,7 +1013,7 @@ static noinline void __init kernel_init_freeable(void) smp_prepare_cpus(setup_max_cpus); - workqueue_init(); + //workqueue_init(); do_pre_smp_initcalls(); lockup_detector_init(); diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ad0cd43..400f5e2 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -5585,3 +5585,4 @@ int __init workqueue_init(void) return 0; } +early_initcall(workqueue_init);
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On Mon, Oct 10, 2016 at 10:17:16PM +1100, Balbir Singh wrote: > rest_init() > { > ... > kernel_thread(kernel_init, NULL, CLONE_FS); > numa_default_policy(); > pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); > rcu_read_lock(); > kthreadd_task = find_task_by_pid_ns(pid, _pid_ns); > ... > > } > > create_worker() needs kthreadd, it wakes up kthreadd in > kthread_create_on_node, > workqueue_init() is called from kernel_init() , but kthreadd is created after > the call to kernel_init(), so its touch and go But the first thing kernel_init_freeable() does is wait_for_completion(_done). Thanks. -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On Mon, Oct 10, 2016 at 10:17:16PM +1100, Balbir Singh wrote: > rest_init() > { > ... > kernel_thread(kernel_init, NULL, CLONE_FS); > numa_default_policy(); > pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); > rcu_read_lock(); > kthreadd_task = find_task_by_pid_ns(pid, _pid_ns); > ... > > } > > create_worker() needs kthreadd, it wakes up kthreadd in > kthread_create_on_node, > workqueue_init() is called from kernel_init() , but kthreadd is created after > the call to kernel_init(), so its touch and go But the first thing kernel_init_freeable() does is wait_for_completion(_done). Thanks. -- tejun
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On 10/10/16 21:22, Michael Ellerman wrote: > Hi Tejun, > > Tejun Heowrites: >> From f85002f627f7fdc7b3cda526863f5c9a8d36b997 Mon Sep 17 00:00:00 2001 >> From: Tejun Heo >> Date: Fri, 16 Sep 2016 15:49:32 -0400 >> Subject: [PATCH] workqueue: make workqueue available early during boot >> >> Workqueue is currently initialized in an early init call; however, >> there are cases where early boot code has to be split and reordered to >> come after workqueue initialization or the same code path which makes >> use of workqueues is used both before workqueue initailization and >> after. The latter cases have to gate workqueue usages with >> keventd_up() tests, which is nasty and easy to get wrong. >> >> Workqueue usages have become widespread and it'd be a lot more >> convenient if it can be used very early from boot. This patch splits >> workqueue initialization into two steps. workqueue_init_early() which >> sets up the basic data structures so that workqueues can be created >> and work items queued, and workqueue_init() which actually brings up >> workqueues online and starts executing queued work items. The former >> step can be done very early during boot once memory allocation, >> cpumasks and idr are initialized. The latter right after kthreads >> become available. >> >> This allows work item queueing and canceling from very early boot >> which is what most of these use cases want. >> >> * As systemd_wq being initialized doesn't indicate that workqueue is >> fully online anymore, update keventd_up() to test wq_online instead. >> The follow-up patches will get rid of all its usages and the >> function itself. >> >> * Flushing doesn't make sense before workqueue is fully initialized. >> The flush functions trigger WARN and return immediately before fully >> online. >> >> * Work items are never in-flight before fully online. Canceling can >> always succeed by skipping the flush step. >> >> * Some code paths can no longer assume to be called with irq enabled >> as irq is disabled during early boot. Use irqsave/restore >> operations instead. >> >> v2: Watchdog init, which requires timer to be running, moved from >> workqueue_init_early() to workqueue_init(). >> >> Signed-off-by: Tejun Heo >> Suggested-by: Linus Torvalds >> Link: >> http://lkml.kernel.org/r/ca+55afx0vpumuxn00rbsm192n-du5uxy+4avka0sbsovjeu...@mail.gmail.com > > > This patch seems to be causing one of my Power8 boxes not to boot. > > Specifically commit 3347fa092821 ("workqueue: make workqueue available > early during boot") in linux-next. > > If I revert this on top of next-20161005 then the machine boots again. > > I've attached the oops below. It looks like the cfs_rq of p->se is NULL? > > cheers > > > bootconsole [udbg0] disabled > bootconsole [udbg0] disabled > mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= > or the kernel.numa_balancing sysctl > pid_max: default: 163840 minimum: 1280 > Dentry cache hash table entries: 16777216 (order: 11, 134217728 bytes) > Inode-cache hash table entries: 8388608 (order: 10, 67108864 bytes) > Mount-cache hash table entries: 262144 (order: 5, 2097152 bytes) > Mountpoint-cache hash table entries: 262144 (order: 5, 2097152 bytes) > Unable to handle kernel paging request for data at address 0x0038 > Faulting instruction address: 0xc00fc0cc > Oops: Kernel access of bad area, sig: 11 [#1] > SMP NR_CPUS=2048 NUMA PowerNV > Modules linked in: > CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 4.8.0-compiler_gcc-6.2.0-next-20161005 #94 > task: c007f540 task.stack: c01ffc084000 > NIP: c00fc0cc LR: c00ed928 CTR: c00fbfd0 > REGS: c01ffc087780 TRAP: 0300 Not tainted > (4.8.0-compiler_gcc-6.2.0-next-20161005) > MSR: 92009033 CR: 48000424 XER: > > CFAR: c00089dc DAR: 0038 DSISR: 4000 SOFTE: 0 > GPR00: c00ed928 c01ffc087a00 c0e63200 c00010d6d600 > GPR04: c007f5409200 0021 0748e08c 001f > GPR08: 0021 0748f1f8 > GPR12: 28000422 cfb8 c000e0c8 > GPR16: 0021 0001 > GPR20: afb50401 c00010d6d600 ba7e > GPR24: ba7e c0d8bc58 afb504000afb5041 0001 > GPR28: 0004 c007f5409280 > NIP [c00fc0cc] enqueue_task_fair+0xfc/0x18b0 > LR [c00ed928] activate_task+0x78/0xe0 > Call Trace: > [c01ffc087a00] [c007f5409200] 0xc007f5409200 (unreliable) > [c01ffc087b10] [c00ed928] activate_task+0x78/0xe0 > [c01ffc087b50] [c00ede58] ttwu_do_activate+0x68/0xc0 > [c01ffc087b90]
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
On 10/10/16 21:22, Michael Ellerman wrote: > Hi Tejun, > > Tejun Heo writes: >> From f85002f627f7fdc7b3cda526863f5c9a8d36b997 Mon Sep 17 00:00:00 2001 >> From: Tejun Heo >> Date: Fri, 16 Sep 2016 15:49:32 -0400 >> Subject: [PATCH] workqueue: make workqueue available early during boot >> >> Workqueue is currently initialized in an early init call; however, >> there are cases where early boot code has to be split and reordered to >> come after workqueue initialization or the same code path which makes >> use of workqueues is used both before workqueue initailization and >> after. The latter cases have to gate workqueue usages with >> keventd_up() tests, which is nasty and easy to get wrong. >> >> Workqueue usages have become widespread and it'd be a lot more >> convenient if it can be used very early from boot. This patch splits >> workqueue initialization into two steps. workqueue_init_early() which >> sets up the basic data structures so that workqueues can be created >> and work items queued, and workqueue_init() which actually brings up >> workqueues online and starts executing queued work items. The former >> step can be done very early during boot once memory allocation, >> cpumasks and idr are initialized. The latter right after kthreads >> become available. >> >> This allows work item queueing and canceling from very early boot >> which is what most of these use cases want. >> >> * As systemd_wq being initialized doesn't indicate that workqueue is >> fully online anymore, update keventd_up() to test wq_online instead. >> The follow-up patches will get rid of all its usages and the >> function itself. >> >> * Flushing doesn't make sense before workqueue is fully initialized. >> The flush functions trigger WARN and return immediately before fully >> online. >> >> * Work items are never in-flight before fully online. Canceling can >> always succeed by skipping the flush step. >> >> * Some code paths can no longer assume to be called with irq enabled >> as irq is disabled during early boot. Use irqsave/restore >> operations instead. >> >> v2: Watchdog init, which requires timer to be running, moved from >> workqueue_init_early() to workqueue_init(). >> >> Signed-off-by: Tejun Heo >> Suggested-by: Linus Torvalds >> Link: >> http://lkml.kernel.org/r/ca+55afx0vpumuxn00rbsm192n-du5uxy+4avka0sbsovjeu...@mail.gmail.com > > > This patch seems to be causing one of my Power8 boxes not to boot. > > Specifically commit 3347fa092821 ("workqueue: make workqueue available > early during boot") in linux-next. > > If I revert this on top of next-20161005 then the machine boots again. > > I've attached the oops below. It looks like the cfs_rq of p->se is NULL? > > cheers > > > bootconsole [udbg0] disabled > bootconsole [udbg0] disabled > mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= > or the kernel.numa_balancing sysctl > pid_max: default: 163840 minimum: 1280 > Dentry cache hash table entries: 16777216 (order: 11, 134217728 bytes) > Inode-cache hash table entries: 8388608 (order: 10, 67108864 bytes) > Mount-cache hash table entries: 262144 (order: 5, 2097152 bytes) > Mountpoint-cache hash table entries: 262144 (order: 5, 2097152 bytes) > Unable to handle kernel paging request for data at address 0x0038 > Faulting instruction address: 0xc00fc0cc > Oops: Kernel access of bad area, sig: 11 [#1] > SMP NR_CPUS=2048 NUMA PowerNV > Modules linked in: > CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 4.8.0-compiler_gcc-6.2.0-next-20161005 #94 > task: c007f540 task.stack: c01ffc084000 > NIP: c00fc0cc LR: c00ed928 CTR: c00fbfd0 > REGS: c01ffc087780 TRAP: 0300 Not tainted > (4.8.0-compiler_gcc-6.2.0-next-20161005) > MSR: 92009033 CR: 48000424 XER: > > CFAR: c00089dc DAR: 0038 DSISR: 4000 SOFTE: 0 > GPR00: c00ed928 c01ffc087a00 c0e63200 c00010d6d600 > GPR04: c007f5409200 0021 0748e08c 001f > GPR08: 0021 0748f1f8 > GPR12: 28000422 cfb8 c000e0c8 > GPR16: 0021 0001 > GPR20: afb50401 c00010d6d600 ba7e > GPR24: ba7e c0d8bc58 afb504000afb5041 0001 > GPR28: 0004 c007f5409280 > NIP [c00fc0cc] enqueue_task_fair+0xfc/0x18b0 > LR [c00ed928] activate_task+0x78/0xe0 > Call Trace: > [c01ffc087a00] [c007f5409200] 0xc007f5409200 (unreliable) > [c01ffc087b10] [c00ed928] activate_task+0x78/0xe0 > [c01ffc087b50] [c00ede58] ttwu_do_activate+0x68/0xc0 > [c01ffc087b90] [c00ef1b8] try_to_wake_up+0x208/0x4f0 > [c01ffc087c10] [c00d3484] create_worker+0x144/0x250 >